Mastodon

Florent Moncomble’s corpus tools

I am a senior lecturer in English linguistics at the Université d’Artois in Arras, France. This page lists a number of apps I have been developing over the last few years for corpus collection and analysis. (Contact)

Notice

Web scraping and text mining are limited to research purposes under most legislations (eg. EU, UK, US).

Use of these apps is subject to prior verification of applicable copyright and privacy laws as well as service providers’ terms of use.

Make sure also to follow basic ethical principles and best practices in your discipline.

Please remember to cite the tools you use in your publications and presentations.

Click an item below to scroll to the relevant section:

Web applications

E-books

Project Gutenberg Corpus Builder

Search the Project Gutenberg database and download ebooks in various formats.

How to cite (click to copy):

News sources

The Times Corpus Builder

Search The Times and download articles in various formats.

How to cite (click to copy):

The Guardian Corpus Builder

Search The Guardian and download articles in various formats. Also available as part of the Press Corpus Scraper browser extension.

How to cite (click to copy):

The New York Times Corpus Builder

Search The New York Times and download articles in various formats. Also available as part of the Press Corpus Scraper browser extension.

How to cite (click to copy):

NPR Corpus Builder

Search NPR and download articles in various formats.

How to cite (click to copy):

La Presse Corpus Builder

Search La Presse and download articles in various formats.

How to cite (click to copy):

Corriere della Sera Corpus Builder

Search the Corriere della Sera and download articles in various formats.

How to cite (click to copy):

Frankfurter Allgemeine Corpus Builder

Search the Frankfurter Allgemeine Zeitung and download articles in various formats.

How to cite (click to copy):

News meta-discourse

The Guardian Comments Corpus Builder

Collect a corpus of Guardian article comments based on a keyword search or URL input.

How to cite (click to copy):

Le Figaro Comments Corpus Builder

Collect a corpus of Le Figaro article comments based on a keyword search or URL input.

How to cite (click to copy):

Social media

BlueskyScraper

Scrape and download posts from the Bluesky social network. Also available as a browser extension.

How to cite (click to copy):

BlueskyStreamer

Stream Bluesky posts in real time and download in various formats.
Also available as part of the BlueskyScraper browser extension.

How to cite (click to copy):

MastoScraper

Scrape and download posts from the Mastodon social network. Also available as a browser extension.

How to cite (click to copy):

RedditScraper

Scrape and download posts from Reddit. Also available as a browser extension.

How to cite (click to copy):

Corpus statistics

Type/token ratio

Calculate and compare the type/token ratio of different corpora as an estimate of their lexical diversity.

How to cite (click to copy):

Miscellaneous

WikiXl8

Look up Wikipedia articles across languages.

How to cite (click to copy):

Browser extensions

Some of the tools above and others exist as browser add-ons:

APP Extractor

A browser extension to scrape and download documents from The American Presidency Project.

How to cite (click to copy):

BlueskyScraper (browser add-on)

A browser extension to scrape or stream and download Bluesky posts. ⚠️ No longer actively maintained: use the web app instead.

How to cite (click to copy):

DiscordScraper

A browser extension to scrape and download Discord messages.

How to cite (click to copy):

MastoScraper (browser add-on)

A browser extension to scrape and download toots (posts on Mastodon). ⚠️ No longer actively maintained: use the web app instead.

How to cite (click to copy):

Press Corpus Scraper

A browser extension to extract and download press articles from a variety of sources.

How to cite (click to copy):

RedditScraper (browser add-on)

A browser extension to scrape and download Reddit posts. ⚠️ No longer actively maintained: use the web app instead.

How to cite (click to copy):

TruthScraper

A browser extension to scrape and download posts from Truth Social.

How to cite (click to copy):

𝕏-Scraper

A browser extension to scrape and download tweets.

How to cite (click to copy):