Comparing algorithms for extracting content from web pages

Published: 3 Nov 2024
Written by: Chun Fei Lung

This study pits 14 open-source main content extractors against each other and arrives at a somewhat surprising conclusion.

It’s kind of a scrapeheap challenge

The World Wide Web contains a wealth of information in the form of HTML web pages. Extracting information from web pages using scraping tools is not an easy task. While HTML web pages are technically machine-readable, in practice they often might as well be considered unstructured.

This is because web pages don’t just include the information that you need, but also a lot of secondary information in the form of irrelevant boilerplate content, like headers, footers, navigational links, and advertisements. The distinction between main content and boilerplate content isn’t always clear: depending on your use case, page elements like comments or the “About the article” box below might be considered part of either category.

About the article

Title	An empirical comparison of web content extraction algorithms
Year	2023
Author(s)	Janek Bevendorff (Bauhaus-Universität Weimar) Sanket Gupta (Bauhaus-Universität Weimar) Johannes Kiesel (Bauhaus-Universität Weimar) Benno Stein (Bauhaus-Universität Weimar)
Venue	International ACM SIGIR Conference on Research and Development in Information Retrieval

Many main content extraction systems have been written over the past decades, but the algorithms used generally fall into one of two categories:

Heuristic approaches use heuristic rules (often in the form of trees) to identify one or more blocks of main content. While these rules are efficient to execute, they rely heavily on human expertise for their design.

Many heuristics are based on the assumption that the markup for main content contains fewer HTML tags than that of boilerplate content, or similar assumptions based on the ratio between words and child nodes.

Systems in this category can often be used on all web pages (albeit with mixed results), like Mozilla’s Readability extractor. Some, like Fundus, are designed to extract main content from specific news websites.
Machine learning approaches use machine learning to classify regions on a web page as main or boilerplate content. Boilerpipe, the first system to use this approach, used structure, text, and text density features. Newer systems are often based on sequence labeling methods and deep neural networks. Some approaches even render web pages in order to extract visual features!

Although a lot of work has gone into developing better main content extraction systems, relatively limited effort has been spent on developing resources for reproducible experiments. Most systems are only evaluated using small datasets. Systems produced outside of academia are often not evaluated at all and are sometimes ignored entirely in evaluation studies.

The authors of this paper have combined eight common evaluation datasets into one large dataset, which they then used to evaluate 14 main content extraction systems:

Extractor	Language	Approach
BTE	Python	Heuristic: HTML tag distribution
Goose3	Python	Heuristic: rule-based
jusText	Python	Heuristic: rule-based
Newspaper3k	Python	Heuristic: rule-based (for news)
Readability	JavaScript	Heuristic: rule-based
Resiliparse	Python	Heuristic: rule-based
Trafilatura	Python	Heuristic: rule-based
news-please	Python	Meta heuristic: rule-based (for news)
Boilerpipe	Java	AI: text node classification
Dragnet	Python	AI: text node classification
ExtractNet	Python	AI: text node classification
Go DOM Distiller	Go	AI: text node classification
BoilerNet	Python (+JS)	AI: sequence labeling (LSTM)
Web2Text	Python	AI: sequence labeling (HMM+CNN)

These 14 extractors are also compared with five HTML-to-text conversion tools that simply extract all text from a web page, as a baseline.

The results show that almost all extractors perform reasonably well on simple web pages that largely consist of main content. This is even true for the basic HTML conversion tools due to their near-perfect recall.

The differences between extractors and simple converters become larger on more complex pages. Baseline performance is still pretty high with an $F_1$ score of 0.738, which suggests that most pages in the dataset primarily consist of main content.

No single extractor performs best for all page complexity levels, but there are a few extractors that do slightly better than others. For instance, Readability has the highest median score (0.970) and has the highest level of predictability, while Trafilatura has the best overall mean performance (0.883). Most models have their own strengths and weaknesses. Readability is a notable exception, as it appears to work well with all types of web pages.

Another interesting observation is that heuristic extractors perform the best and are most robust across the board, whereas the performance of large neural models is surprisingly bad – especially on the most complex pages for which they were primarily designed!

To see whether extraction performance could be further improved, the researchers defined three ensembles on top of the individual extraction systems:

Majority vote: For each token in the HTML, check whether the five tokens to its left and right appear in the extractor’s output. If so, the extractor “votes” for the token. If at least two thirds of systems (including baseline HTML-to-text converters) vote for a token, it is considered to be part of the main content.
Majority vote best: This ensemble is based on the same principle as the majority vote, except now only the nine best-performing main content extractors are included.
Majority vote best (weighted): The same nine content extractors get to vote for tokens, but now votes from the three best extractors (Readability, Trafilatura, and Goose3) count double.

All ensembles outperform the individual extractors, with the weighted vote ensemble achieving the best results. The complete results are shown in the table below:

Model	Mean			Median
Model	Prec.	Recall	$F_1$	Prec.	Recall	$F_1$
(Best weighted)	0.922	0.912	0.899	0.986	0.981	0.970
(Best only)	0.926	0.892	0.889	0.992	0.976	0.973
(Majority all)	0.930	0.879	0.885	0.996	0.971	0.974
Trafilatura	0.913	0.895	0.883	0.989	0.965	0.957
Readability	0.921	0.856	0.861	0.991	0.972	0.970
Resiliparse	0.863	0.901	0.859	0.940	0.993	0.942
DOM Distiller	0.894	0.864	0.858	0.983	0.970	0.959
Web2Text	0.797	0.944	0.841	0.885	0.984	0.917
Boilerpipe	0.908	0.825	0.834	0.973	0.966	0.946
Dragnet	0.901	0.810	0.823	0.980	0.950	0.943
BTE	0.796	0.897	0.817	0.927	0.965	0.936
Newspaper3k	0.896	0.803	0.816	0.994	0.961	0.958
news-please	0.895	0.802	0.815	0.994	0.961	0.958
Goose3	0.899	0.779	0.810	0.999	0.919	0.940
BoilerNet	0.840	0.816	0.798	0.944	0.938	0.895
ExtractNet	0.858	0.773	0.791	0.963	0.915	0.911
jusText	0.794	0.769	0.759	0.949	0.921	0.904

Summary

No single main content extractor clearly outperforms the others, although Readability appears to do well most of the time
Heuristic models generally outperform neural models, especially on more complex web pages

Comparing algorithms for extracting content from web pages

Summary

More about information science

More about Python