At present, pyAnno implements three probabilistic models of data annotation:
1. Model A , a three-step generative model from the paper [Rzhetsky2009].
2. Model B-with-theta , a multinomial generative model from the paper [Rzhetsky2009].
3. Model B, a Bayesian generalization of the model proposed in [Dawid1979].
Model A defines a probability distribution over data annotations with a loop design in which each item is annotated by three users. The distributions over annotations is defined by a three-steps generative model:
- First, the model independently generates correctness values for the triplet of annotators (e.g., CCI where C=correct, I=incorrect)
- Second, the model generates an agreement pattern compatible with the correctness values (e.g., CII is compatible with the agreement patterns ‘abb’ and ‘abc’, where different letters correspond to different annotations
- Finally, the model generates actual observations compatible with the agreement patterns
More in detail, the model is described as follows:
Parameters control the prior probability that annotator is correct. Thus, for each triplet of annotations for annotators , , and , we have
where
Parameters control the probability of observing an annotation of class over all items and annotators. From these one can derive the parameters , which correspond to the probability of each agreement pattern according to the tables published in [Rzhetsky2009].
See [Rzhetsky2009] for a more complete presentation of the model.
Model B-with-theta is a multinomial generative model of the annotation process. The process begins with the generation of “true” label classes, drawn from a fixed categorical distribution. Each annotator reports a label class with some additional noise.
There are two sets of parameters: controls the prior probability of generating a label of class . The accuracy parameter controls the probability of annotator reporting class given that the true label is . An important part of the model is that the error probability is controlled by just one parameter per annotator, making estimation more robust and efficient.
Formally, for annotations and true label classes :
The probability of the true label classes is
,
.
The prior over the accuracy parameters is
.
And finally the distribution over the annotations is
,
.
See [Rzhetsky2009] for more details.
Model B is a more general form of B-with-theta, and is also a Bayesian generalization of the earlier model proposed in [Dawid1979]. The generative process is identical to the one in model B-with-theta, except that a) the accuracy parameters are represented by a full tensor , and b) it defines prior probabilities over the model parameters, , and .
The complete model description is as follows:
The probability of the true label classes is
,
The distribution over accuracy parameters is
The hyper-parameters define what kind of error distributions are more likely for an annotator. For example, they can be defined such that peaks at and decays for becoming increasingly dissimilar to . Such a prior is adequate for ordinal data, where the label classes have a meaningful order.
The distribution over annotation is defined as
,
.
[Rzhetsky2009] | (1, 2, 3, 4, 5) Rzhetsky A., Shatkay, H., and Wilbur, W.J. (2009). “How to get the most from your curation effort”, PLoS Computational Biology, 5(5). |
[Dawid1979] | (1, 2) Dawid, A. P. and A. M. Skene. 1979. Maximum likelihood estimation of observer error-rates using the EM algorithm. Applied Statistics, 28(1):20–28. |