6 Highlights From the AAAI Conference on Human Computation and Crowdsourcing

The AAAI Conference on Human Computation and Crowdsourcing (HCOMP) is an annual conference, known for its interdisciplinary approach, bringing together researchers and practitioners from several fields, including Social Sciences, Machine Learning, Human-Computer Interaction, and Policy, to discuss issues in crowdsourcing and present current research in the field.

We attended the eighth edition of HCOMP, held virtually between the 25th and the 29th of October, 2020, and it was no exception. As a key area of interest for us at DefinedCrowd, we will take this opportunity to highlight some of the main insights we took away from the workshops and research presentations.

Data Excellence Workshop: A Summary

Before diving into the actual publications, we want to highlight the first edition of the Data Excellence Workshop (DEW). The focus of this workshop, which we expect to continue in the future, was the numerous cases of unethical and unfair outcomes of AI systems we all read about on the news: from biased hiring algorithms to flawed facial recognition solutions. The way data is treated (with a focus on quantity rather than quality) and the glamorization of math over data work in the scientific community are two factors that perpetuate traceability issues and contribute to the so-called replication crisis in science. To start this discussion, we were presented with an exercise during the DEW: look at data as we look at software. In other words, instead of focusing on one all-explaining (and usually misguided) metric, we should systematize the actual process of data generation and annotation.

On this note, the talks by Ben Hutchingson (Google Research) and Andrea Olgiati (Amazon AI) propose complementary strategies. The first focuses on documentation – it is urgent to record and share how datasets are created. If a piece of software passes through several steps, which are documented, so should data. This includes having a Dataset Requirements Specification, a Dataset Design Document (explaining how the specification will be met), a Dataset Implementation Diary (justifying implementation decisions and keeping track of any changes), a Dataset Testing Report (describing the circumstances on which the data can be used), and a corrective, preventive and adaptive Dataset Maintenance Plan.

The second talk, on the other hand, proposes adapting further practices from Software Engineering. These can include processes such as Data Unit Testing (both guaranteeing syntax, like columns having the correct data type, and semantics, like looking at the range of values in a field), Data Reviews (all data units need to be validated, for example through crowdsourcing), or Data Rollback/forward (assessing if the data is still representative of the real world).

As crowdsourcing experts, we recognize the importance of data excellence and the approaches suggested to ensure it. In fact, the artifacts and strategies mentioned during DEW are already byproducts of our structured data collection/annotation process. At DefinedCrowd, we have several opinions over the same data units, we know who contributed what, how much and with which level of confidence. Data is no longer composed of obscure and isolated units of information. It is intrinsically linked to the process that was used to generate it. We take it as a challenge to continue to improve on packaging this metadata appropriately, and we challenge the community to become more and more sensitive to the relevance of this information.

Having given the well-deserved mention to DEW, let us now leap into the main conference content. We divided our review into four categories: Ethics & Fairness, Usability, Crowd Training, and Machine Learning.

Ethics and Fairness

Crowdsourcing has moved long past its early reputation as a “digital sweatshop” and has increasingly become an important and complex marketplace. The provocative question by Kittur et al. (2013) – “Can we foresee a future crowd workplace in which we would want our children to participate?” – situates ethics and fairness in crowd work as one of the central topics in the research community. On this front, we highlight two contributions at HCOMP.

Title: Qualification Labour: a fair wage isn’t enough if workers need to do 5,000 low paid tasks to qualify for your task

Figure 1 – Contributor Performance based on pre-requisites (source: Kummerfel (2020))

Summary: The most frequently used strategy to select individuals to participate in crowd work is to look into their history: number of tasks submitted and acceptance rate. Fearing substandard work, requesters have grown more and more conservative, imposing demanding thresholds for these milestones. It has become common to request that a contributor has completed five thousand tasks with acceptance rates above 99%. In this study, a Named Entity Tagging task was set up on Amazon’s Mechanical Turk and different groups of contributors were allowed in (according to these two criteria), measuring their performance afterward. Figure 1 summarizes the results from which two main conclusions can be drawn: the percentage of HITs accepted is much more relevant than the number of tasks completed, and the threshold for the number of tasks can be set to much lower than what is commonly practiced.

TLDR: A common phenomenon in unregulated crowdsourcing ecosystems, crowd work paying a fair wage but relying on achieving unreasonable milestones to be able to work on them (like having completed five thousand tasks in the past) cannot be considered fair. This strategy institutes a pernicious cycle where people must grind over potentially unfairly paid tasks to get in the Ivy League of contributors.

Title: The Challenges of Crowd Workers in Rural and Urban America

Summary: In a series of interviews, 450 individuals distributed over the categories Super Rural, Rural, and Urban (according to their living situation) were questioned on four crowdsourcing-related dimensions: income, flexibility, onboarding, and infrastructure.

Figure 2 – Advantages and Challenges of Crowdsourcing (source: Flores-Saviaga et al. (2020))

Figure 2 summarizes the results, presenting the percentage of individuals from each group which mentioned each dimension as a challenge or as advantage. Contributors in super rural areas feel the strongest about all four dimensions, except for seeing income as an advantage, which is attributed to the fact that work is not constant nor steady. One of the most interesting results in this study is seeing the positive influence of crowdsourcing  as it spreads through rural areas, promoting digital literacy and fostering the development of digital infrastructures.

TLDR: With its advantages and challenges, crowdsourcing has been growing as a true marketplace for work. Specifically, for individuals living in rural areas, crowdsourcing allows individuals to work without depending on (often scarce) transportation and at the same time promotes digital development.

Usability

Another area where there are significatant developments being made for crowdsourcing is in the way contributors can interact with the work. The evolution of devices and their widespread adoption allows for more seamless integration of crowd work into the daily routine of contributors. On this front there are two works that we would like to highlight.

Title: Analyzing Workers Performance in Online Mapping Tasks Across Web, Mobile, and Virtual Reality Platforms

Fig.03 – VR Interface for the FIND Task (source: Alphen et al. (2020))

Summary: In this work, the authors explore Virtual Reality (VR) devices (Figure 3) to perform image tagging, comparing them with Web and Mobile interfaces. In the spirit of micro-tasking, image tagging was divided into three tasks: FIND the general location of the object, FIX the bounding box of the object, and VERIFY the correctness of the annotation. In terms of accuracy, results show that Web is best for the FIND, while VR performs better in the FIX and VERIFY tasks. In terms of efficiency, on the other hand, Mobile seems to outperform the other two means of interaction, with VR being the least efficient. For this specific pipeline of image tagging, the authors conclude that overall Web is best suited for the FIND task, VR for FIX, and Mobile to VERIFY.

TLDR: New ways of participating in crowdsourcing work are arriving, such as Augmented Reality devices for image tagging with bounding boxes. These new means of interaction are more suitable to some tasks but are by no means a one-size-fits-all solution. Simple tasks, such as validations, are still more efficient in a smartphone setup. The crowdsourcing ecosystem is becoming richer and, for efficiency reasons, the tasks we are allowed to do may become more and more dependent on the device we are using.

Title: How Context Influences Cross-Device Task Acceptance in Crowd Work

Figure 4 – Comparison of likelihood to engage in crowdwork by device and time of day (source: Hettiachchi et al. (2020))

Summary: Nowadays, people are naturally in touch with different devices throughout their day, including laptops, smartphones, and smart speakers, to name a few. In this paper, the authors survey individuals about their likelihood to participate in crowd work in different situations categorized by two dimensions: device and time of day. Figure 4 presents one of the paper’s conclusions, showing differences in predisposition to engage in crowdsourcing tasks with regards to the two dimensions. It is especially interesting to see high engagement probability (>68%) for smart speakers: a device that has only recently started gaining a user base but for which no crowdsourcing solutions are available in the market. Another highlight of the paper is the changes of the preferred devices throughout the day: desktop/laptop and smart speakers having their peak of preferred usage in the morning, and smartphones in the evening.

 TLDR:  Diminishing the effort needed to engage in crowd work is an important piece in the evolution of crowdsourcing. By placing diverse crowd work opportunities at various points in a contributor’s daily routine, we can increase participation, and make the most out of an individual’s spare time.

Crowd Training

Training is a crucial component of crowdsourcing, particularly in the face of more complex tasks. It is especially valuable as an alternative to relying on past history on the platform to allow contributors to participate (which, as we have already seen, can have severe fairness implications). Training sessions are often composed of several steps, including sound instructions (with positive and negative examples), training tasks, qualification tests, and other evaluations. Training can be seen as a mechanism to provide a clean slate for any potential contributor wanting to participate in a task. On this dimension, we would like to highlight one work, which was also awarded Best Paper at HCOMP 2020.

Title: Motivating Novice Crowd Workers Through Goal Setting: an investigation into the effects on complex crowdsourcing task training

Figure 5 – Perceived Helpfulness of Lessons (source: Rechkemmer and Yin (2020))

Summary: There are roughly three strategies to address complex tasks in crowdsourcing: divide them into simpler subtasks, having several contributors annotating the same data unit until reaching an agreement, and/or training the crowd. In this paper, the authors focus on the training component, investigating how introducing the Goal Setting framework can help in learning how to solve complex tasks. In this case, contributors were asked to identify nutritional components (fat, fiber, protein, etc) in images of meals. The training flow had six steps: a pre-test of 12 tasks, a goal-setting phase, a set of nutrition lessons, a set of practice tasks, a survey on the perceived learning, and a post-test of 12 tasks. In the goal-setting phase, three types of goals were available for setup: Learning Goal (“Understand how to identify carbohydrates”), Practice (“Complete X practice tasks”), and Performance (“Answer correctly to Y post-test tasks”). In this experiment, the authors also changed who would set the goals: the contributor or the requester. Figure 5 shows one of the conclusions of this study, plotting the self-reported helpfulness of the training lessons with respect to the type of goal that was set up, and who set up that goal. Results show that setting learning goals (regardless of who sets them) contributed to a higher sense of helpfulness of the training lessons, compared to the control group results in red (where there was no goal setting). Another relevant conclusion was that practice goals set by the requester hinder the outcomes of the overall training.

TLDR: Structured and explicit approaches to training are beneficial for crowd work, contrasting for instance with inflexible instruction pages. Highlighting the main points that should be acquired before exposing individuals to the information contributes to a greater sense of effectiveness of the training.

Machine Learning

The last dimension we would like to touch upon is the way data gathered or labeled through crowdsourcing is used for building applications. As we have already mentioned in this post, while data passes through a structured pipeline of crowdsourcing, a lot of metadata is generated as a byproduct (demographics of annotators, time performing the task, etc). Such metadata usually ends up unused. The paper we highlight on this front, tells another story.

Title: A Case for Soft Loss Functions

Figure 6 – Results of soft- VS hard-label based models for the image classification task (source: Peterson et al. (2019))

Summary: When several answers over the same datapoint exist, the common practice in crowdsourcing is to aggregate them somehow (for instance, through majority voting), producing one final label for each data unit. However, this strategy has been challenged recently. Namely, in Peterson et al. (2019), the authors provided evidence that training models directly from the distribution of judgments (instead of their aggregation) are beneficial for image classification. Figure 6 shows an example of the results of two different models: one trained with hard-labels (shown in blue), meaning the most frequent human answer is used as training, and another with soft-labels (in red), which is trained with the complete distribution of human answers. In the HCOMP paper, the authors revisit this approach with a different task: part-of-speech tagging. For this task, the soft-label model outperformed traditional approaches, more particularly in scenarios with a substantial number of high-quality annotations per unit. The so-called soft-loss training consistently outperforms gold training when the objective is to produce distributions of labels, either for relative ranking or for representing uncertainty.

TLDR: There is a case for not aggregating multiple answers of the same data unit to train a model. By feeding the model with different judgments, we are propagating natural human uncertainty, allowing models to infer similarity structure beyond top-1 performance. This is an obvious advantage of crowdsourcing for labeling and annotation of data and can decrease the need to only rely on experts, especially in the face of information that can be dubious or nonconsensual.

Continuous Research is Imperative

The state-of-the-art research in crowdsourcing reinforces the human-in-the-loop paradigm as a de-facto resource for addressing the industry’s intense need for high-quality labeled data. While crowdsourcing as a research topic in itself is expanding (into different interaction formats and more sound quality strategies), we are also seeing increased research into the way crowdsourcing work fits into the bigger picture of our society.

We are now looking at crowdsourcing from different perspectives across the spectrum. On one side, as crowdsourcing as a valid job marketplace and source of income, we must consider ethics and fairness, and approach these issues systematically and with accountability. On the other side, as crowdsourcing as a data provider for ML practitioners, we see developments in guaranteeing data excellence. This means ensuring the data is traceable, representative, unbiased and useful for what it was designed. We are also seeing a paradigm shift in the way data is seen – going beyond one truth for each data unit, and addressing and exploiting uncertainty and differing opinions to facilitate more human-like responses.