Allocating and Reallocating Samples to Annotators
One of the main features of EffiARA is the ability to allocate samples to a set of annotators according to user-specified criteria. We’ll here discuss in more detail how this is done within EffiARA.
Sample Allocation
Allocating samples to annotators requires the user to input values
annotation_rate(\(\rho\)): The estimated number of samples an annotator will complete in an hour.
time_available(\(t\)): The total number of hours available for each annotator.
double_proportion(\(d\)): The proportion of samples that will be allocated to 2 annotators, for computing inter-annotator agreement.
re_proportion(\(r\)): The proportion of samples that will be allocated to the same annotator twice, for computing intra-annotator agreement.
These variables are related to each other according to the equation below.
\(k = (2d + (1 + r)(1 - d))^{-1} \cdot \rho \cdot t \cdot n\)
where \(k\) is the total number of samples to annotate.
EffiARA solves this equation and then allocates samples to annotators in such a way that there is a maximal set of points to compute inter-annotator agreement between each pair of annotators. While it is often the case that we want to determine the number of samples we can annotate, EffiARA can solve for any variable in this equation given the other four. For example, we could determine the time required to annotate a given number of samples provided an estimate of the annotation rate.
Using SampleDistributor
Allocation is done using the SampleDistributor class. It is initialized by providing
four of the five variables above, with the missing variable being the one we wish to solve for.
It then solves the aforementioned equation for the fifth variable and assigns samples accordingly.
from effiara.preparation import SampleDistributor
# Generate some dummy data to allocate.
df = pd.DataFrame({"sample_id": range(1000), "value": np.random.randint(5, size=(1000, 2))})
annotators = ["Larry", "Curly", "Moe"]
distrib = SampleDistributor(
annotators=annotators,
num_samples=None, # We want to solve for this
annotation_rate=20,
time_available=4,
double_proportion=1.0, # double-annotate all samples
re_proportion=0.5, # annotators re-annotate half of their samples
)
distrib.set_project_distribution() # solve the equation above
allocations = distrib.distribute_samples(df.copy())
allocations is a Python dict of annotator names to Pandas DataFrames indicating the samples
allocated to each annotator.
Sample Reallocation
Occasionally, we may want to assign already annotated samples to annotators that have not seen them before.
For example, perhaps we want a third annotation for samples where the first two annotators disagreed.
This can be done using the SampleRedistributor class. It is initialized the same as
SampleDistributor, but double_proportion and re_proportion are always set to
0.0.
Because one of the primary uses of the SampleRedistributor is to assign an additional annotator
to our samples, we can also initialize it from our existing SampleDistributor instance.
from effiara.preparation import SampleRedistributor
redistrib = SampleRedistributor.from_sample_distributor(distrib)
redistrib.set_project_distribution()
reallocations = redistrib.distribute_samples(annotated_df)
Reallocation uses a different algorithm for allocating samples which ensures that no sample is assigned to
an annotator who has already annotated it. As such, its distribute_samples method requires
a DataFrame with annotation columns (i.e., columns in the {username}_label format).