Crowdsourcing in computing education research: Case Amazon MTurk

Published: 11 Apr 2021
Written by: Chun Fei Lung

Crowdsourcing survey responses using Amazon Mechanical Turk is easy. But how useful are those responses?

You never know when you might need Surveys as a Service

Crowdsourcing is a method to outsource a large number of small tasks that are typically hard to automate to an outside workforce. Amazon Mechanical Turk (also known as MTurk) is a crowdsourcing marketplace where individuals and businesses can outsource such small tasks to workers around the world.

About the article

Title	Crowdsourcing in computing education research: Case Amazon MTurk
Year	2020
Author(s)	Arto Hellas (Aalto University) Albina Zavgorodniaia (Aalto University) Juha Sorva (Aalto University)
Venue	Koli Calling ’20: Proceedings of the 20th Koli Calling International Conference on Computing Education Research

Why it matters

Crowdsourcing is especially useful for tasks that are hard to automate, like completing surveys, participating in small experiments, and annotating data for machine learning. A crowdsourcing marketplace like MTurk could therefore help researchers recruit participants for their studies.

While there is evidence that students (side note: Which are normally used for studies) and paid workers perform similarly, some report that crowdsourced data can be inaccurate or even completely invalid: up to a third of data from a crowdsourcing platform may have serious quality issues!

This paper therefore aims to provide an overview of how paid crowdsourcing for computing education research (side note: In this field you’ll also find studies on best practices for computer programming.) works in practice.

How the study was conducted

The paper consists of two parts.

In the first part, the authors discuss the results of a literature review on crowdsourced research in computing education research.

The second part describes the authors’ experiences with one of their own crowsourcing-based projects. Their study was aimed at non-programmers and included a 45-minute long experiment that consists of three parts:

a demographic survey;
a randomly selected instructional 24-minute video on programming;
a post-test and a few additional survey items.

What discoveries were made

The paper includes some lessons learnt that are primarily interesting if you intend to conduct your own study using the MTurk marketplace in the near future.

Review on crowdsourcing studies

The review shows that crowdsourcing studies can generally be grouped into two categories:

surveys, many for program comprehension studies about the understandability or quality of code or studies about developers’ preferences;
system evaluations, in which participants are asked to work with an external system (side note: Some examples include algorithmic visualisations, teaching approaches for programming, and IDE tooling innovations) for a short period and then answer a few questions in a survey.

The reviewed papers are not about crowdsourcing, but merely use crowdsourcing. This means that the papers are obviously slightly biased in favour of crowdsourcing, as their authors have already chosen to use the method.

While the papers do mention some “negative results”, quality issues aren’t really discussed explicitly. This may be because researchers take these into account in their study design, by adding requirements for workers (side note: e.g. workers must have a high number of successfully completed tasks, with a high acceptance ratio) and including validity-indicator questions that are used to verify that responses were made by someone who was paying attention.

MTurk is sometimes used as the sole source of data. Others use MTurk to explore the validity and generalisability of some initial findings or use MTurk workers as an additional population.

In most cases MTurk workers are paid a fixed amount, although in some studies the size of the reward is performance-based.

A case study

Most unreliable respondents and bots can be easily pruned out by requiring that workers have at least 500 completed tasks with an acceptance rate of at least 99%.

However, it is not inconceivable that in longer studies participants skip (reading) parts and answer questions randomly. To identify such unwanted behaviour, one should monitor participants’ behaviour within surveys.

Regarding the recruitment of workers, the authors note that:

to avoid self-selection bias one should leave out details about the experiment in the task description;
they rewarded each response $7.50, which is about$ 10 per hour;
the MTurk platform has more workers with at least some experience in computer programming than you might think.

To gather data from a sub-population, you can either use a pre-test within the task or an additional screening survey as a separate MTurk task. When the authors used the latter approach, they received responses from 1,000 workers at $0.50 each. In either case, you will end up paying workers whose data is of limited interest.

There were some data quality issues, but on a relatively small scale. Including at least one validity-indicator question (side note: This is some sort of poor man’s CAPTCHA: a question that is very easy to answer if you take your participation in the survey seriously.) helps you sort out nonsense and bot-generated data. This question should not be easily answerable using a search engine, because people will try to google answers to your questions.

Summary

Amazon Mechanical Turk (MTurk) can be a useful tool to recruit a large number of participants for a study
Take measures to ensure that you get high-quality responses if you make use of a crowdsourcing platform