Named Entity Recognition (NER) is an essential component of many Natural Language Processing pipelines but the process of creating training corpora for NER in new domains is expensive, considering both time and monetary costs (Marrero 2013). At DefinedCrowd, we circumvent this issue by collecting and enriching data at scale through crowdsourcing.
Nonetheless, managing these annotations at scale raise other issues such as the prediction of completion timelines and fair pricing estimation for workers in advance. To that end, in this blogpost we present how at DefinedCrowd we estimate the effort needed for a human to complete a Named Entity Tagging (NET) task before the actual annotation takes place. For a complete version of the investigation, refer to the published paper (Gomes, et al. 2020).
Named Entity Tagging in Crowdsourcing
Crowdsourcing can be used to annotate large quantities of data by assigning Human Intelligence Tasks (HITs) to an existing pool of non-expert contributors, that complete it within a short time, in exchange for a monetary reward (Hassan 2013). In NET instances, the contributor’s goal is to locate and annotate pre-defined named entities in unstructured text. Figure 1 shows an example where the contributor will iteratively highlight a segment of text with a named entity and select the corresponding category.
Figure 1: Example of a Named Entity Tagging HIT in Neevo
Human Effort Definition
Human effort represents the time needed by a human for completing a given HIT, that is, the interval since the HIT is shown and submitting it. For representability and comprehension purposes, we normalize time-on-task by the number of tokens present in the input. This transformation results in speed-on-task, which is measured in words per minute (wpm) and given by:
Considering that at DefinedCrowd NET data collection pipelines have a given redundancy, that is, each HIT is executed per more than one contributor, the average speed-on-task is our ultimate variable to predict.
When executing a HIT, external factors including contributor carelessness (such as leaving the HIT open for an unreasonable amount of time) or contributor expertise (where an experienced contributor completes tasks at a very fast pace) may impact the completion time. These undesirable data points are HITs that stand apart from the average speed-on-task (i.e. the target variable) behavior, so we must curate our dataset by discarding them.
For this, we investigate the speed-on-task’s standard deviation (SD) for each HIT: a value of zero represents perfect alignment, i.e., all contributors performed the task at the exact same speed (optimal condition); larger values for the standard deviation imply discrepant speeds and, therefore, are not good candidates to be used for training nor testing the prediction of human effort.
To make the process of data curation systematic, we use the concept of Inter-Annotator Agreement (IAA) applied to the speed-on-task values, measuring the Krippendorff’s alpha (𝛼) coefficient of the dataset (Krippendorff 1980). Our goal was to remove HIT executions with large speed-on-task, until the remaining dataset reached an agreement of 𝛼≥0.65 .
Figure 2 plots the tradeoffs between the value of IAA with respect to both number of HITs remaining in the dataset (on the left) and standard deviation of speed-on-task (on the right). We observe that the first setting where the ideal condition is met ( 𝛼≥0.65), corresponds to accepting HITs for which their speed-on-task standard deviation is below 20 wpm, resulting on a total dataset size of 37,061 HITs.
Figure 2: Kripendorff’s alpha and dataset size evolution given the allowed speed-on-task standard deviation threshold.
Predicting Human Effort
As a first approach to predict human effort, we establish a linear model baseline with two features. The features used represent the dimensions described in the literature review on the topic (Feyisetan 2015) (Sweller 1994): input length in number of tokens, as a basic representation of the amount of information to be processed, and number of categories of named entities, as a representation of cognitive load. Although we believe that the input length in number of tokens may evolve linearly with the effort needed to complete a task, we do not expect the taxonomy size to affect equally the annotation time. Thus, we explore the impact of using nonlinear approaches to the experiment. Then, we expand the set of explanatory variables adding text preprocessing features that can model the cognitive load associated with the task: number of sentences, number of punctuation tokens, number of stopwords and average word length (with and without stopwords).
Table 1 shows the experiments results. We conclude that the Nearest Neighbors Regressor outperforms the remaining models, attaining 25.68 wpm RMSE (Root Mean Squared Error), with a statistically significant improvement of 6%, when comparing to the linear model baseline. The final set of features used by the Nearest Neighbors are input length in number of tokens, number of categories of named entities, number of stopwords, number of punctuation tokens and average word length.
Table 1: Performance of predicting human effort using Nonlinear Models with the complete set of features (7), the reduced set of features for the Nearest Neighbors and Random Forest (5), compared with the Random Forest and Linear Model with a set of two features.
Interpreting the Human Effort Explanatory Variables
To understand the importance of each feature on the winning model, we decide to carry out an ablation study. Table 2 shows the rank of the features ordered by the highest impact on the model performance.