Machine learning data can be mined from existing sources, but just as often it is created from scratch with the help of human annotators, who create examples that the model can learn from. This week’s article discusses seven don’ts (and some dos) for human annotation.
Why it matters
For a very long time, computer science and AI were mostly about theoretical problems with solutions that were either right or wrong, but with the advent of big data both disciplines have started to look more like empirical sciences.
Unfortunately, their methods are often still based on the mathematical ideal of truth, even though in reality “truth” is entirely relative and mostly related to agreement and consensus.
Consider for example the idea of a gold standard or ground truth, which is obtained by having humans annotate a small amount of example data. The quality of the resulting set of annotations can be established by measuring the inter-annotator agreement, i.e. the average pairwise probability that two people agree about an annotation.
How much sense does this make when humans are asked to provide annotations for things that are highly subjective, like interpretations of music or poems? Very little!
How the study was conducted
The article . The observations and conclusions are based on experiences with various human annotation projects.
What discoveries were made
The article discusses seven myths about human annotations, and also talks a bit about some practical implications.
The seven myths
Make sure you avoid these misconceptions about human annotation.
Most data collection efforts assume that there is one correct interpretation for every input example.
This might be true for simple input examples, but for more complex examples (e.g. images or sentences, even seemingly simple ones) it is possible that different people interpret them differently. Even experts don’t always agree with each other.
Disagreement is bad
To increase the quality of annotation data, disagreement among the annotators should be avoided or reduced.
If two people are given the same annotation task, they might not generate the same ground truth. This is often interpreted as a sign that the annotation task is poorly defined or the annotations lack sufficient training.
However, it is actually a sign that the input example is ambiguous or vague, or that the annotator is not doing a good job. In other words: disagreement gives us information!
Detailed guidelines help
When specific cases continuously cause disagreement, more instructions are added to limit interpretations.
To avoid disagreement, researchers often provide detailed guidelines that are designed to obtain annotations that are performed more consistently.
While this does eliminate disagreement, it does not improve the quality of annotations. Rather, it forces annotators to make choices that they believe are actually “wrong” and may also lead to annotations that are not representative of how people would normally interpret input examples.
Finally, when crowdsourcing is used one is forced to keep the annotation instructions simple, as microtask workers won’t read guidelines that are long or complex. For researchers, simple instructions have the additional benefit that annotation tasks can be designed in a lot less time.
One is enough
Most annotated examples are evaluated by one person.
Human annotations are costly to generate. Most examples are therefore annotated by just one person. Only a few are annotated by multiple people so that the inter-annotator agreement can be measured.
There are many cases where one perspective isn’t enough. The authors see that in some cases there might be as many as five or six popular interpretations, which can’t all be captured by one person.
How many annotations do you need for each example then? Experimental results for a study with sentence annotations suggest that .
Experts are better
Human annotators with domain knowledge provide better annotated data.
Medical texts are usually annotated by medical experts. However, it turns out that experts do not show significantly better-quality annotations than non-experts.
It seems that experts may even be hampered by their expertise, as they may “see” things that can’t actually inferred from the input example, but are purely based on their knowledge about a domain. Non-experts do not suffer from this problem (but do make other mistakes).
Experts may also have perspectives that differ from those of laypeople. The authors mention two examples of projects where there is little overlap between the tags that experts assign to museum artifacts and videos, and the keywords that are actually used by people when they tag or search for things.
All examples are created equal
The mathematics of using ground truth treats every example the same; either you match the correct result or not.
Each annotated example is normally considered to be equally important and thus given the same weight. But the examples are clearly not the same: some are clear, while others may be ambiguous. It makes sense to give a higher weight to examples with no or little disagreement.
Once done, forever valid
Once human annotated data is collected for a task, it is used over and over with no update. New annotated data is not aligned with previous data.
The interpretation of some types of input example may change over time, for instance when they mention news or historical events, music, or trends. New training data should therefore be updated and collected continuously.
Working with crowd truth
The key element to crowd truth is that multiple people annotate the same objects, which makes it possible to gather multiple perspectives and interpretations.
To facilitate this, it might be useful to give annotators the ability to provide multiple interpretations themselves and .
Another thing that’s important to discuss is the issue of annotation quality. Crowd truth encourages disagreement, but that doesn’t mean that disagreement is always a good sign. There are two metrics that can be computed for annotators:
an annotator-example disagreement score, which shows how much an annotator disagrees with the crowd for each example (that is reasonably clear), and
an annotator-annotator disagreement score, which is calculated by constructing a pairwise confusion matrix between annotators. This shows whether there are consistently like-minded annotators.
Annotators who tend to disagree with the crowd consistently and do not generally agree with any other annotators are low-quality annotators.
Different people may interpret examples differently. This means that examples can be annotated in more than one way
Disagreement in annotations is a sign that an example is vague or ambiguous
Annotation guidelines should be kept as simple as possible
Examples should be annotated by multiple people
Experts do not make better annotations than non-experts
Examples that are clear should be given a higher weight than examples that are vague or ambiguous
New human-annotated data should be collected continuously to capture changed perspectives and interpretations