Web applications
E-books
Project Gutenberg Corpus Builder
Search the Project Gutenberg database and download ebooks in various formats.
How to cite (click to copy):
News sources
The Times Corpus Builder
Search The Times and download articles in various formats.
How to cite (click to copy):
The Guardian Corpus Builder
Search The Guardian and download articles in various formats. Also available as part of the Press Corpus Scraper browser extension.
How to cite (click to copy):
The New York Times Corpus Builder
Search The New York Times and download articles in various formats. Also available as part of the Press Corpus Scraper browser extension.
How to cite (click to copy):
NPR Corpus Builder
Search NPR and download articles in various formats.
How to cite (click to copy):
La Presse Corpus Builder
Search La Presse and download articles in various formats.
How to cite (click to copy):
Corriere della Sera Corpus Builder
Search the Corriere della Sera and download articles in various formats.
How to cite (click to copy):
Frankfurter Allgemeine Corpus Builder
Search the Frankfurter Allgemeine Zeitung and download articles in various formats.
How to cite (click to copy):
News meta-discourse
The Guardian Comments Corpus Builder
Collect a corpus of Guardian article comments based on a keyword search or URL input.
How to cite (click to copy):
Le Figaro Comments Corpus Builder
Collect a corpus of Le Figaro article comments based on a keyword search or URL input.
How to cite (click to copy):
Social media
BlueskyScraper
Scrape and download posts from the Bluesky social network. Also available as a browser extension.
How to cite (click to copy):
BlueskyStreamer
Stream Bluesky posts in real time and download in
various formats.
Also available as part of the
BlueskyScraper browser extension.
How to cite (click to copy):
MastoScraper
Scrape and download posts from the Mastodon social network. Also available as a browser extension.
How to cite (click to copy):
RedditScraper
Scrape and download posts from Reddit. Also available as a browser extension.
How to cite (click to copy):
Corpus statistics
Type/token ratio
Calculate and compare the type/token ratio of different corpora as an estimate of their lexical diversity.
How to cite (click to copy):
Miscellaneous
Browser extensions
Some of the tools above and others exist as browser add-ons:
APP Extractor
A browser extension to scrape and download documents from The American Presidency Project.
How to cite (click to copy):
BlueskyScraper (browser add-on)
A browser extension to scrape or stream and download Bluesky posts. ⚠️ No longer actively maintained: use the web app instead.
How to cite (click to copy):
DiscordScraper
A browser extension to scrape and download Discord messages.
How to cite (click to copy):
MastoScraper (browser add-on)
A browser extension to scrape and download toots (posts on Mastodon). ⚠️ No longer actively maintained: use the web app instead.
How to cite (click to copy):
Press Corpus Scraper
A browser extension to extract and download press articles from a variety of sources.
How to cite (click to copy):
RedditScraper (browser add-on)
A browser extension to scrape and download Reddit posts. ⚠️ No longer actively maintained: use the web app instead.
How to cite (click to copy):
TruthScraper
A browser extension to scrape and download posts from Truth Social.
How to cite (click to copy):