menu_book Explore the article's raw data

Comparison of Common Crawl News & GDELT

Abstract

The corpus of worldwide news is important for natural language processing, knowledge graphs, large language models, and other technical efforts. Additionally, this corpus is important for understanding the people, places, organizations, and events that interact in real-time every day. This paper compares two news datasets used for these tasks today, namely the Global Database of Events, Language, and Tone (GDELT) and Common Crawl News. Our research highlights the strengths and limitations of each dataset, analyzing their content and coverage. Notably, while GDELT relies on broadcasts, prints, and web news from across the globe, Common Crawl focuses on news sites from around the world gathered through web crawling. Our analysis revealed considerable differences in where the two datasets gather their news sources.

article Proceedings Paper
date_range 2024
language English
link Link of the paper
format_quote
Sorry! There is no raw data available for this article.
Loading references...
Loading citations...
Featured Keywords

Open Source Data
News Data
NLP
LLMs
Citations by Year

Share Your Research Data, Enhance Academic Impact