These classes are implementations of statistical models of annotations. They are available through the pyanno.models namespace, e.g.:
import pyanno.models
# create a new instance of model B, for 4 label classes and 6 annotators
model = pyanno.models.ModelB.create_initial_state(4, 6)
This module defines the class ModelB, a Bayesian generalization of the model proposed in (Dawid et al., 1979).
Reference:
Bases: pyanno.abstract_model.AbstractModel
Bayesian generalization of the model proposed in (Dawid et al., 1979).
Model B is a hierarchical generative model over annotations. The model assumes the existence of “true” underlying labels for each item, which are drawn from a categorical distribution, . Annotators report this labels with some noise, depending on their accuracy, .
The model parameters are:
- pi[k] is the probability of label k
- theta[j,k,k’] is the probability that annotator j reports label k’ for an item whose real label is k, i.e. P( annotator j chooses k’ | real label = k)
The parameters themselves are random variables with hyperparameters
- beta are the parameters of a Dirichlet distribution over pi
- alpha[k,:] are the parameters of Dirichlet distributions over theta[j,k,:]
See the documentation for a more detailed description of the model.
References:
Return the accuracy of each annotator.
Compute a summary of the a-priori accuracy of each annotator, i.e., P( annotator j is correct ). This can be computed from the parameters theta and pi, as
P( annotator j is correct ) = sum_k P( annotator j reports k | label is k ) P( label is k ) = sum_k theta[j,k,k] * pi[k]
Returns: | accuracy (ndarray, shape = (n_annotators, )) - accuracy[j] = P( annotator j is correct ) |
---|
Return samples from the accuracy of each annotator.
Given samples from the posterior of accuracy parameters theta (see :method:`sample_posterior_over_accuracy`), compute samples from the posterior distribution of the annotator accuracy, i.e.,
P( annotator j is correct | annotations).
See also :method:`sample_posterior_over_accuracy`, :method:`annotator_accuracy`
Returns: | accuracy (ndarray, shape = (n_annotators, )) - accuracy[j] = P( annotator j is correct ) |
---|
Factory method returning a model with random initial parameters.
It is often more convenient to use this factory method over the constructor, as one does not need to specify the initial model parameters.
The parameters theta and pi, controlling accuracy and prevalence, are initialized at random from the prior alpha and beta:
If not defined, the prior parameters alpha ad beta are defined as described below.
Parameters: |
|
---|---|
Returns: | model (ModelB) - Instance of ModelB |
Generate a random annotation set from the model.
Sample a random set of annotations from the probability distribution defined the current model parameters:
- Label classes are generated from the prior distribution, pi
- Annotations are generated from the conditional distribution of annotations given classes, theta
Parameters: | nitems (int) – Number of items to sample |
---|---|
Returns: | annotations (ndarray, shape = (n_items, n_annotators)) - annotations[i,j] is the annotation of annotator j for item i |
Generate random annotations from the model, given labels
The method samples random annotations from the conditional probability distribution of annotations, given labels, :
Returns: | annotations (ndarray, shape = (n_items, n_annotators)) - annotations[i,j] is the annotation of annotator j for item i |
---|
Infer posterior distribution over label classes.
Compute the posterior distribution over label classes given observed annotations, .
Returns: | posterior (ndarray, shape = (n_items, n_classes)) - posterior[i,k] is the posterior probability of class k given the annotation observed in item i. |
---|
Compute the log likelihood of a set of annotations given the model.
Returns log P(annotations | current model parameters).
Returns: | log_lhood (float) - log likelihood of annotations |
---|
Computes maximum a posteriori (MAP) estimation of parameters.
Estimate the parameters theta and pi from a set of observed annotations using maximum a posteriori estimation.
Computes maximum likelihood estimate (MLE) of parameters.
Estimate the parameters theta and pi from a set of observed annotations using maximum likelihood estimation.
Return samples from posterior distribution over theta given data.
Samples are drawn using Gibbs sampling, i.e., alternating between sampling from the conditional distribution of theta given the annotations and the label classes, and sampling from the conditional distribution of the classes given theta and the annotations.
This results in a fast-mixing sampler, and so the parameters controlling burn-in and thinning can be set to a small number of samples.
Returns: | samples (ndarray, shape = (n_samples, n_annotators, nclasses, nclasses)) - samples[i,...] is one sample from the posterior distribution over the parameters theta |
---|
This module defines model B-with-theta.
pyAnno includes another implementation of B-with-theta, pyanno.modelBt_loopdesign, which is optimized for a loop design where each item is annotated by 3 out of 8 annotators.
Bases: pyanno.abstract_model.AbstractModel
Implementation of Model B-with-theta from (Rzhetsky et al., 2009).
The model assumes the existence of “true” underlying labels for each item, which are drawn from a categorical distribution, gamma. Annotators report this labels with some noise, according to their accuracy, theta.
This model is closely related to ModelB, but, crucially, the noise distribution is described by a small number of parameters (one per annotator), which makes their estimation efficient and less sensitive to local optima.
The model parameters are:
k’ given ground truth, k. More specifically, P( annotator j chooses k’ | real label = k) is theta[j] for k’ = k, or (1 - theta[j]) / sum(theta) if `k’ != k `.
See the documentation for a more detailed description of the model.
For a version of this model optimized for the loop design described in (Rzhetsky et al., 2009), see ModelBtLoopDesign.
Reference
Factory method returning a model with random initial parameters.
It is often more convenient to use this factory method over the constructor, as one does not need to specify the initial model parameters.
The parameters theta and gamma, controlling accuracy and prevalence, are initialized at random as follows:
Parameters: |
|
---|---|
Returns: | model (ModelBt) - Instance of ModelBt |
Generate a random annotation set from the model.
Sample a random set of annotations from the probability distribution defined the current model parameters:
- Label classes are generated from the prior distribution, pi
- Annotations are generated from the conditional distribution of annotations given classes, parametrized by theta
Parameters: | nitems (int) – Number of items to sample |
---|---|
Returns: | annotations (ndarray, shape = (n_items, n_annotators)) - annotations[i,j] is the annotation of annotator j for item i |
Generate random annotations from the model, given labels
The method samples random annotations from the conditional probability distribution of annotations, given labels, .
Returns: | annotations (ndarray, shape = (n_items, n_annotators)) - annotations[i,j] is the annotation of annotator j for item i |
---|
Infer posterior distribution over label classes.
Compute the posterior distribution over label classes given observed annotations, .
Returns: | posterior (ndarray, shape = (n_items, n_classes)) - posterior[i,k] is the posterior probability of class k given the annotation observed in item i. |
---|
Compute the log likelihood of a set of annotations given the model.
Returns , where is the array of annotations.
Returns: | log_lhood (float) - log likelihood of annotations |
---|
Computes maximum a posteriori (MAP) estimate of parameters.
Estimate the parameters theta and gamma from a set of observed annotations using maximum a posteriori estimation.
Computes maximum likelihood estimate (MLE) of parameters.
Estimate the parameters theta and gamma from a set of observed annotations using maximum likelihood estimation.
Return samples from posterior distribution over theta given data.
Samples are drawn using a variant of a Metropolis-Hasting Markov Chain Monte Carlo (MCMC) algorithm. Sampling proceeds in two phases:
- step size estimation phase: first, the step size in the MCMC algorithm is adjusted to achieve a given rejection rate.
- sampling phase: second, samples are collected using the step size from phase 1.
Returns: | samples (ndarray, shape = (n_samples, n_annotators)) - samples[i,:] is one sample from the posterior distribution over the parameters theta |
---|
This module defines model B-with-theta, optimized for a loop design.
The implementation assumes that there are a total or 8 annotators. Each item is annotated by a triplet of annotators, according to the loop design described in Rzhetsky et al., 2009.
E.g., for 16 items the loop design looks like this (A indicates a label, * indicates a missing value):
A A A * * * * *
A A A * * * * *
* A A A * * * *
* A A A * * * *
* * A A A * * *
* * A A A * * *
* * * A A A * *
* * * A A A * *
* * * * A A A *
* * * * A A A *
* * * * * A A A
* * * * * A A A
A * * * * * A A
A * * * * * A A
A A * * * * * A
A A * * * * * A
Bases: pyanno.abstract_model.AbstractModel
Implementation of Model B-with-theta from (Rzhetsky et al., 2009).
The model assumes the existence of “true” underlying labels for each item, which are drawn from a categorical distribution, gamma. Annotators report this labels with some noise.
This model is closely related to ModelB, but, crucially, the noise distribution is described by a small number of parameters (one per annotator), which makes their estimation efficient and less sensitive to local optima.
The model parameters are:
This implementation is optimized for he loop design introduced in (Rzhetsky et al., 2009), which assumes that each item is annotated by 3 out of 8 annotators. For a more general implementation, see ModelBt
See the documentation for a more detailed description of the model.
Reference
Check if the annotations are compatible with the models’ parameters.
Factory method returning a model with random initial parameters.
It is often more convenient to use this factory method over the constructor, as one does not need to specify the initial model parameters.
The parameters theta and gamma, controlling accuracy and prevalence, are initialized at random as follows:
Parameters: |
|
---|---|
Returns: | model (ModelBtLoopDesign) - Instance of ModelBtLoopDesign |
Generate a random annotation set from the model.
Sample a random set of annotations from the probability distribution defined the current model parameters:
- Label classes are generated from the prior distribution, pi
- Annotations are generated from the conditional distribution of annotations given classes, parametrized by theta
Parameters: | nitems (int) – Number of items to sample |
---|---|
Returns: | annotations (ndarray, shape = (n_items, n_annotators)) - annotations[i,j] is the annotation of annotator j for item i |
Generate random annotations from the model, given labels
The method samples random annotations from the conditional probability distribution of annotations, given labels, .
Returns: | annotations (ndarray, shape = (n_items, n_annotators)) - annotations[i,j] is the annotation of annotator j for item i |
---|
Infer posterior distribution over label classes.
Compute the posterior distribution over label classes given observed annotations, .
Returns: | posterior (ndarray, shape = (n_items, n_classes)) - posterior[i,k] is the posterior probability of class k given the annotation observed in item i. |
---|
Compute the log likelihood of a set of annotations given the model.
Returns , where is the array of annotations.
Returns: | log_lhood (float) - log likelihood of annotations |
---|
Computes maximum a posteriori (MAP) estimate of parameters.
Estimate the parameters theta and gamma from a set of observed annotations using maximum a posteriori estimation.
Computes maximum likelihood estimate (MLE) of parameters.
Estimate the parameters theta and gamma from a set of observed annotations using maximum likelihood estimation.
Return samples from posterior distribution over theta given data.
Samples are drawn using a variant of a Metropolis-Hasting Markov Chain Monte Carlo (MCMC) algorithm. Sampling proceeds in two phases:
- step size estimation phase: first, the step size in the MCMC algorithm is adjusted to achieve a given rejection rate.
- sampling phase: second, samples are collected using the step size from phase 1.
Returns: | samples (ndarray, shape = (n_samples, n_annotators)) - samples[i,:] is one sample from the posterior distribution over the parameters theta |
---|
This module defines the class ModelA, an implementation of model A from Rzhetsky et al., 2009.
The implementation assumes that there are a total or 8 annotators. Each item is annotated by a triplet of annotators, according to the loop design described in Rzhetsky et al., 2009.
E.g., for 16 items the loop design looks like this (A indicates a label, * indicates a missing value):
A A A * * * * *
A A A * * * * *
* A A A * * * *
* A A A * * * *
* * A A A * * *
* * A A A * * *
* * * A A A * *
* * * A A A * *
* * * * A A A *
* * * * A A A *
* * * * * A A A
* * * * * A A A
A * * * * * A A
A * * * * * A A
A A * * * * * A
A A * * * * * A
Reference
Bases: pyanno.abstract_model.AbstractModel
Implementation of Model A from (Rzhetsky et al., 2009).
The model defines a probability distribution over data annotations in which each item is annotated by three users. The distributions is described according to a three-steps generative model:
1. First, the model independently generates correctness values for the triplet of annotators (e.g., CCI where C=correct, I=incorrect)
2. Second, the model generates an agreement pattern compatible with the correctness values (e.g., CII is compatible with the agreement patterns ‘abb’ and ‘abc’, where different letters correspond to different annotations
3. Finally, the model generates actual observations compatible with the agreement patterns
The model has two main sets of parameters:
- theta[j] is the probability that annotator j is correct
- omega[k] is the probability of observing an annotation of class k over all items and annotators
At the moment the implementation of the model assumes 1) a total of 8 annotators, and 2) each item is annotated by exactly 3 annotators.
See the documentation for a more detailed description of the model.
Reference
Check if the annotations are compatible with the models’ parameters.
Factory method to create a new model.
It is often more convenient to use this factory method over the constructor, as one does not need to specify the initial model parameters.
If not specified, the parameters theta are drawn from a uniform distribution between 0.6 and 0.95 . The parameters omega are drawn from a Dirichlet distribution with parameters 2.0 :
Parameters: |
|
---|
Generate random annotations from the model.
The method samples random annotations from the probability distribution defined by the model parameters:
- generate correct/incorrect labels for the three annotators, according to the parameters theta
- generate agreement patterns (which annotator agrees which whom) given the correctness information and the parameters alpha
- generate the annotations given the agreement patterns and the parameters omega
Note that, according to the model’s definition, only three annotators per item return an annotation. Non-observed annotations have the standard value of MISSING_VALUE.
Parameters: | nitems (int) – number of annotations to draw from the model |
---|---|
Returns: | annotations (ndarray, shape = (n_items, n_annotators)) - annotations[i,j] is the annotation of annotator j for item i |
Infer posterior distribution over label classes.
Compute the posterior distribution over label classes given observed annotations, .
Parameters: | annotations (ndarray, shape = (n_items, n_annotators)) – annotations[i,j] is the annotation of annotator j for item i |
---|---|
Returns: | posterior (ndarray, shape = (n_items, n_classes)) - posterior[i,k] is the posterior probability of class k given the annotation observed in item i. |
Compute the log likelihood of a set of annotations given the model.
Returns , where is the array of annotations.
Parameters: | annotations (ndarray, shape = (n_items, n_annotators)) – annotations[i,j] is the annotation of annotator j for item i |
---|---|
Returns: | log_lhood (float) - log likelihood of annotations |
Computes maximum a posteriori (MAP) estimate of parameters.
Estimate the parameters theta and omega from a set of observed annotations using maximum a posteriori estimation.
Parameters: |
|
---|
Computes maximum likelihood estimate (MLE) of parameters.
Estimate the parameters theta and omega from a set of observed annotations using maximum likelihood estimation.
Parameters: |
|
---|
Return samples from posterior distribution over theta given data.
Samples are drawn using a variant of a Metropolis-Hasting Markov Chain Monte Carlo (MCMC) algorithm. Sampling proceeds in two phases:
- step size estimation phase: first, the step size in the MCMC algorithm is adjusted to achieve a given rejection rate.
- sampling phase: second, samples are collected using the step size from phase 1.
Parameters: |
|
---|---|
Returns: | samples (ndarray, shape = (n_samples, n_annotators)) - samples[i,:] is one sample from the posterior distribution over the parameters theta |