Fake news vs satire: A dataset and analysis (2018)

An orange president waves an American flag, while also holding a Russian one

Social media platforms have a moral obligation to prevent the spread of fake news, while still allowing users to freely share satire. This can only be achieved using advanced content filters. Golbeck et (many, many) al. compiled a dataset of fake news and satire, and built a simple classifier that can tell the two types of stories apart.

Why it matters

While fake news isn’t a completely new phenomenon, its impact on society at large has increased significantly in the past few years. This is partially due to social media like Facebook, Twitter, and YouTube, which facilitate the spread of fake news.

To combat fake news at web-scale, we need a way to automatically distinguish fake news from other types of information.

This is not an easy task. Let’s first look at the definition of fake news:

Fake news is information, presented as a news story that is factually incorrect and designed to deceive the consumer into believing it is true.

Now let’s look at things that are not fake news:

Because we only want to prevent the spread of real fake news, an automated classifier must be able to tell fake news and satirical articles apart.

How the study was conducted

The authors first created a dataset that consists of recent fake news and satirical articles about American politicsYou can find the dataset on GitHub. Articles are selected from many different sources to reduce the chance that topics discussed in the articles or a particular publication’s writing style would affect the classifier. Articles that are not clearly fake newsi.e., easily rebutted and clearly deceptive (x)or satire are excluded from the dataset for a similar reason.

To find differences in language (the vocabulary that’s used), the authors represented each article as a simple word vector that’s labelled either “Fake” or “Satire”. A model was trained using a multinomial naive Bayes classifier and tested using 10-fold cross validation.

Then, the authors created a list of themes that appeared throughout the dataset:

After manually labelling each article in the dataset using these themes, the authors performed an analysis of the correlations between them.

What discoveries were made

The multinomial naive Bayes classifier manages to achieve an accuracy of 79.1%, which suggests that there are clear differences in the language used between fake news and satire.

More than two-thirds of all articles take hyperbolic positions against a person, while conspiracy theories appear in almost 30% of all articles, of which most are fake news stories. Sensationalist crimes are another theme that appears more often in fake news than in satire. Paranormal themes on the other hand, are more common in satire.

Fake news stories tend to have more themes than satirical articles. The overall most common pairing of themes was formed by hyperbolic criticism and conspiracy theories, e.g. articles about President Obama’s birth certificate.

When the authors subsequently added themes to the word vectors, they discovered that the word vectors can also be used to determine the presence of certain themes in an article.

The important bits

  1. A simple bag of words approach is enough to tell fake news and satirical articles apart
  2. Hyperbolic criticism is a popular theme in both fake news and satire
  3. Fake news typically has different combinations of themes than satirical articles