Model definitions

At present, pyAnno implements three probabilistic models of data annotation:

1. Model A , a three-step generative model from the paper [Rzhetsky2009].

2. Model B-with-theta , a multinomial generative model from the paper [Rzhetsky2009].

3. Model B, a Bayesian generalization of the model proposed in [Dawid1979].

Glossary

annotations
The values emitted by the annotator on the available data items. In the documentation, x_i^j, indicates the i-th annotation for annotator j.
labels
The possible annotations. They may be numbers, or strings, or any discrete set of objects.
label class, or just class
Every set of labels is ordered and numbered from 0 to K. The number associated with each label is the label class. The ground truth label class for each data item, i, is indicated in the documentation as y_i.
prevalence
The prior probability of label classes
accuracy
The probability of an annotator reporting the correct label class

Model A

Model A defines a probability distribution over data annotations with a loop design in which each item is annotated by three users. The distributions over annotations is defined by a three-steps generative model:

  1. First, the model independently generates correctness values for the triplet of annotators (e.g., CCI where C=correct, I=incorrect)
  2. Second, the model generates an agreement pattern compatible with the correctness values (e.g., CII is compatible with the agreement patterns ‘abb’ and ‘abc’, where different letters correspond to different annotations
  3. Finally, the model generates actual observations compatible with the agreement patterns

More in detail, the model is described as follows:

  • Parameters \theta_j control the prior probability that annotator j is correct. Thus, for each triplet of annotations for annotators m, n, and l, we have

    P( \mathrm{X}_m \mathrm{X}_n \mathrm{X}_l | \theta ) = P( \mathrm{X}_m | \theta ) P( \mathrm{X}_n | \theta ) P( \mathrm{X}_l | \theta )

    where

    P( \mathrm{X}_j ) =\left\{\begin{array}{l l} \theta_j & \quad \text{if } \mathrm{X}_j = \mathrm{C} \\ 1-\theta_j & \quad \text{if } \mathrm{X}_j = \mathrm{I}\\ \end{array} \right.

  • Parameters \omega_k control the probability of observing an annotation of class k over all items and annotators. From these one can derive the parameters \alpha, which correspond to the probability of each agreement pattern according to the tables published in [Rzhetsky2009].

See [Rzhetsky2009] for a more complete presentation of the model.

Model B-with-theta

Model B-with-theta is a multinomial generative model of the annotation process. The process begins with the generation of “true” label classes, drawn from a fixed categorical distribution. Each annotator reports a label class with some additional noise.

There are two sets of parameters: \gamma_k controls the prior probability of generating a label of class k. The accuracy parameter \theta^j_k controls the probability of annotator j reporting class k' given that the true label is k. An important part of the model is that the error probability is controlled by just one parameter per annotator, making estimation more robust and efficient.

Formally, for annotations x_i^j and true label classes y_i:

  • The probability of the true label classes is

    P(\mathbf{y} | \gamma) = \prod_i P(y_i | \gamma),

    P(y_i | \gamma) = \mathrm{Categorical}(y_i; \gamma) = \gamma_{y_i}.

  • The prior over the accuracy parameters is

    P(\theta_j) = \mathrm{Beta}(\theta_j; 1, 2).

  • And finally the distribution over the annotations is

    P(\mathbf{x} | \mathbf{y}, \theta) = \prod_i \prod_j P(x^j_i | y_i, \theta_j),

    P(x^j_i | y_i, \theta_j) = \left\{\begin{array}{l l} \theta_j & \quad \text{if } x_i^j = y_i\\ \frac{1-\theta_j}{\sum_n \theta_n} & \quad \text{otherwise}\\ \end{array} \right..

See [Rzhetsky2009] for more details.

Model B

Model B is a more general form of B-with-theta, and is also a Bayesian generalization of the earlier model proposed in [Dawid1979]. The generative process is identical to the one in model B-with-theta, except that a) the accuracy parameters are represented by a full tensor \theta_{j,k,k'} = P(x^j = k' | y = k), and b) it defines prior probabilities over the model parameters, \theta, and \pi.

The complete model description is as follows:

  • The probability of the true label classes is

    P(\pi | \beta) = \mathrm{Dirichlet} (\pi ; \beta)

    P(\mathbf{y} | \pi) = \prod_i P(y_i | \pi),

    P(y_i | \pi) = \mathrm{Categorical}(y_i; \pi) = \pi_{y_i}

  • The distribution over accuracy parameters is

    P(\theta_{j,k,:} | \alpha_k) = \mathrm{Dirichlet} ( \theta_{j,k,:} ; \alpha_k)

    The hyper-parameters \alpha_k define what kind of error distributions are more likely for an annotator. For example, they can be defined such that \alpha_{k,k'} peaks at k = k' and decays for k' becoming increasingly dissimilar to k. Such a prior is adequate for ordinal data, where the label classes have a meaningful order.

  • The distribution over annotation is defined as

    P(\mathbf{x} | \mathbf{y}, \theta) = \prod_i \prod_j P(x^j_i | y_i, \theta_{j,:,:}),

    P(x^j_i = k' | y_i = k, \theta_{j,:,:}) = \theta{j,k,k'}.

References

[Rzhetsky2009](1, 2, 3, 4, 5) Rzhetsky A., Shatkay, H., and Wilbur, W.J. (2009). “How to get the most from your curation effort”, PLoS Computational Biology, 5(5).
[Dawid1979](1, 2) Dawid, A. P. and A. M. Skene. 1979. Maximum likelihood estimation of observer error-rates using the EM algorithm. Applied Statistics, 28(1):20–28.

Table Of Contents

Previous topic

User guide

Next topic

Developer guide

This Page