In statistics, the uncertainty coefficient, also called proficiency, entropy coefficient or Theil's U, is a measure of nominal association. It was first introduced by Henri Theil[citation needed] and is based on the concept of information entropy.

Definition

edit

Suppose we have samples of two discrete random variables, X and Y. By constructing the joint distribution, PX,Y(xy), from which we can calculate the conditional distributions, PX|Y(x|y) = PX,Y(xy)/PY(y) and PY|X(y|x) = PX,Y(xy)/PX(x), and calculating the various entropies, we can determine the degree of association between the two variables.

The entropy of a single distribution is given as: [1]

while the conditional entropy is given as:[1]

The uncertainty coefficient[2] or proficiency[3] is defined as:

and tells us: given Y, what fraction of the bits of X can we predict? In this case we can think of X as containing the total information, and of Y as allowing one to predict part of such information.

The above expression makes clear that the uncertainty coefficient is a normalised mutual information I(X;Y). In particular, the uncertainty coefficient ranges in [0, 1] as I(X;Y) < H(X) and both I(X,Y) and H(X) are positive or null.

Note that the value of U (but not H!) is independent of the base of the log since all logarithms are proportional.

The uncertainty coefficient is useful for measuring the validity of a statistical classification algorithm and has the advantage over simpler accuracy measures such as precision and recall in that it is not affected by the relative fractions of the different classes, i.e., P(x). [4] It also has the unique property that it won't penalize an algorithm for predicting the wrong classes, so long as it does so consistently (i.e., it simply rearranges the classes). This is useful in evaluating clustering algorithms since cluster labels typically have no particular ordering.[3]

Variations

edit

The uncertainty coefficient is not symmetric with respect to the roles of X and Y. The roles can be reversed and a symmetrical measure thus defined as a weighted average between the two:[2]

Although normally applied to discrete variables, the uncertainty coefficient can be extended to continuous variables[1] using density estimation.[citation needed]

See also

edit

References

edit
  1. ^ a b c Claude E. Shannon; Warren Weaver (1963). The Mathematical Theory of Communication. University of Illinois Press.
  2. ^ a b William H. Press; Brian P. Flannery; Saul A. Teukolsky; William T. Vetterling (1992). "14.7.4". Numerical Recipes: the Art of Scientific Computing (3rd ed.). Cambridge University Press. p. 761.
  3. ^ a b White, Jim; Steingold, Sam; Fournelle, Connie. "Performance Metrics for Group-Detection Algorithms" (PDF). Interface 2004. Archived from the original on April 13, 2012. {{cite journal}}: Cite journal requires |journal= (help)
  4. ^ Peter, Mills (2011). "Efficient statistical classification of satellite measurements" (PDF). International Journal of Remote Sensing. 32 (21): 6109–6132. arXiv:1202.2194. Bibcode:2011IJRS...32.6109M. doi:10.1080/01431161.2010.507795. S2CID 88518570. Archived from the original (PDF) on 2012-04-26.
edit
  • libagf Includes software for calculating uncertainty coefficients.

📚 Artikel Terkait di Wikipedia

Contingency table

More Correlation Coefficients Nominal Association: Phi, Contingency Coefficient, Tschuprow's T, Cramer's V, Lambda, Uncertainty Coefficient, March 24, 2008

Mutual information

variants of the mutual information are provided by the coefficients of constraint, uncertainty coefficient or proficiency: C X Y = I ⁡ ( X ; Y ) H ( Y )    

Correlation coefficient

A correlation coefficient is a numerical measure of some type of linear correlation, meaning a linear function between two variables. The variables may

Propagation of uncertainty

In statistics, propagation of uncertainty is the effect of variables' uncertainties on the uncertainty of a function based on them. When the variables

Pearson correlation coefficient

statistics, the Pearson correlation coefficient (PCC), also known as Pearson's r, the Pearson product-moment correlation coefficient (PPMCC), or simply the unqualified

F-score

NIST (metric) Receiver operating characteristic ROUGE (metric) Uncertainty coefficient, aka Proficiency Word error rate LEPOR Sasaki, Y. (2007). "The

Coefficient of determination

In statistics, the coefficient of determination, denoted R2 or r2 and pronounced "R squared", is the proportion of the variation in the dependent variable

Proficiency

excelling in a specific situation or skill; being above standard Uncertainty coefficient, an information-theoretic measure of nominal association All pages