In Bayesian statistics, a hyperparameter is a parameter of a prior distribution; the term is used to distinguish them from parameters of the model for the underlying system under analysis.

For example, if one is using a beta distribution to model the distribution of the parameter p of a Bernoulli distribution, then:

  • p is a parameter of the underlying system (Bernoulli distribution), and
  • α and β are parameters of the prior distribution (beta distribution), hence hyperparameters.

One may take a single value for a given hyperparameter, or one can iterate and take a probability distribution on the hyperparameter itself, called a hyperprior.

Purpose

edit

One often uses a prior which comes from a parametric family of probability distributions – this is done partly for explicitness (so one can write down a distribution, and choose the form by varying the hyperparameter, rather than trying to produce an arbitrary function), and partly so that one can vary the hyperparameter, particularly in the method of conjugate priors, or for sensitivity analysis.

Conjugate priors

edit

When using a conjugate prior, the posterior distribution will be from the same family, but will have different hyperparameters, which reflect the added information from the data: in subjective terms, one's beliefs have been updated. For a general prior distribution, this is computationally very involved, and the posterior may have an unusual or hard to describe form, but with a conjugate prior, there is generally a simple formula relating the values of the hyperparameters of the posterior to those of the prior, and thus the computation of the posterior distribution is very easy.

Sensitivity analysis

edit

A key concern of users of Bayesian statistics, and criticism by critics, is the dependence of the posterior distribution on one's prior. Hyperparameters address this by allowing one to easily vary them and see how the posterior distribution (and various statistics of it, such as credible intervals) vary: one can see how sensitive one's conclusions are to one's prior assumptions, and the process is called sensitivity analysis.

Similarly, one may use a prior distribution with a range for a hyperparameter, thus defining a hyperprior, perhaps reflecting uncertainty in the correct prior to take, and reflect this in a range for final uncertainty.[1]

Hyperpriors

edit

Instead of using a single value for a given hyperparameter, one can instead consider a probability distribution of the hyperparameter itself; this is called a "hyperprior." In principle, one may iterate this, calling parameters of a hyperprior "hyperhyperparameters," and so forth.

See also

edit

References

edit

Further reading

edit
  • Bernardo, J. M.; Smith, A. F. M. (2000). Bayesian Theory. New York: Wiley. ISBN 0-471-49464-X.
  • Gelman, A.; Hill, J. (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models. New York: Cambridge University Press. pp. 251–278. ISBN 978-0-521-68689-1.
  • Kruschke, J. K. (2010). Doing Bayesian Data Analysis: A Tutorial with R and BUGS. Academic Press. pp. 241–264. ISBN 978-0-12-381485-2.

📚 Artikel Terkait di Wikipedia

Hyperparameter optimization

learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a

Hyperparameter

Hyperparameter may refer to: Hyperparameter (machine learning) Hyperparameter (Bayesian statistics) This disambiguation page lists articles associated

Bayesian optimization

the 21st century, Bayesian optimization algorithms have found prominent use in machine learning problems for optimizing hyperparameter values. The term

Bayesian inference

predictive distribution uses the updated values of the hyperparameters (applying the Bayesian update rules given in the conjugate prior article), while

Bayesian hierarchical modeling

common population, with distribution governed by a hyperparameter ϕ {\displaystyle \phi } . The Bayesian hierarchical model contains the following stages:

Conjugate prior

In Bayesian probability theory, if, given a likelihood function p ( x ∣ θ ) {\displaystyle p(x\mid \theta )} , the posterior distribution p ( θ ∣ x ) {\displaystyle

Optuna

the hyperparameter space increases, it may become computationally expensive. Hence, there are methods (e.g., grid search, random search, or bayesian optimization)

Gaussian process

coordinate of estimation x* and all other observed coordinates x for a given hyperparameter vector θ, ⁠ K ( θ , x , x ′ ) {\displaystyle K(\theta ,x,x')} ⁠ and