Florent Moncomble’s corpus tools

I am a senior lecturer in English linguistics at the Université d’Artois in Arras, France. This page lists a number of apps I have been developing over the last few years for corpus collection and analysis. (Contact)

Notice

Web scraping and text mining are limited to research purposes under most legislations.

Use of these apps is subject to prior verification of applicable copyright and privacy laws as well as service providers’ terms of use.

Make sure also to follow best practices and basic ethical principles.

Please remember to cite the tools you use in your publications and presentations.

Click an item below to scroll to the relevant section:

Web applications

Project Gutenberg Corpus Builder

Search the Project Gutenberg database and download ebooks in various formats.

How to cite:

Moncomble, Florent. 2025. ‘Project Gutenberg Corpus Builder’. Web application. 2025. https://corpustools.prendrelangue.fr/pgcorpusbuilder/. Download BibTeX file

The Times Corpus Builder

Search The Times and download articles in various formats.

How to cite:

Moncomble, Florent. 2025. ‘The Times Corpus Builder’. Web application. 2025. https://corpustools.prendrelangue.fr/timescorpusbuilder/. Download BibTeX file

The Guardian Corpus Builder

Search The Guardian and download articles in various formats. Also available as part of the Press Corpus Scraper browser extension.

How to cite:

Moncomble, Florent. 2025. ‘The Guardian Corpus Builder’. Web application. 2025. https://corpustools.prendrelangue.fr/guardiancorpusbuilder/. Download BibTeX file

The New York Times Corpus Builder

Search The New York Times and download articles in various formats. Also available as part of the Press Corpus Scraper browser extension.

How to cite:

Moncomble, Florent. 2025. ‘The NYT Corpus Builder’. Web application. 2025. https://corpustools.prendrelangue.fr/nytcorpusbuilder/. Download BibTeX file

The NPR Corpus Builder

Search NPR and download articles in various formats.

How to cite:

Moncomble, Florent. 2025. ‘The NPR Corpus Builder’. Web application. 2025. https://corpustools.prendrelangue.fr/nytcorpusbuilder/. Download BibTeX file

Corriere della Sera Corpus Builder

Search the Corriere della Sera and download articles in various formats.

How to cite:

Moncomble, Florent. 2025. ‘The Corriere Della Sera Corpus Builder’. Web application. 2025. https://corpustools.prendrelangue.fr/cdscorpusbuilder/. Download BibTeX file

BlueskyScraper

Scrape and download posts from the Bluesky social network. Also available as a browser extension.

How to cite:

Moncomble, Florent. 2025. ‘BlueskyScraper’. Web application. 2025. https://corpustools.prendrelangue.fr/blueskyscraper/. Download BibTeX file

BlueskyStreamer

Stream Bluesky posts in real time and download in various formats.
Also available as part of the BlueskyScraper browser extension.

How to cite:

Moncomble, Florent. 2025. ‘BlueskyStreamer’. Web application. 2025. https://corpustools.prendrelangue.fr/blueskystreamer/. Download BibTeX file

MastoScraper

Scrape and download posts from the Mastodon social network. Also available as a browser extension.

How to cite:

Moncomble, Florent. 2025. ‘MastoScraper’. Web application. 2025. https://corpustools.prendrelangue.fr/mastoscraper/. Download BibTeX file

Type/token ratio

Calculate and compare the type/token ratio of different corpora as an estimate of their lexical diversity.

How to cite:

Moncomble, Florent. 2025. ‘Type/Token Ratio’. Web application. 2025. https://corpustools.prendrelangue.fr/ttr/. Download BibTeX file

Browser extensions

Some of the tools above and others exist as browser add-ons:

APP Extractor

A browser extension to scrape and download documents from The American Presidency Project.

How to cite:

Moncomble, Florent. (2023) 2024. ‘APP_Extractor’. JavaScript. Arras, France: Université d’Artois. https://github.com/fmoncomble/APP_extractor. Download BibTeX file

BlueskyScraper (browser add-on)

A browser extension to scrape or stream and download Bluesky posts.

How to cite:

Moncomble, Florent. (2024) 2025. ‘BlueskyScraper’. JavaScript. Arras, France: Université d’Artois. https://github.com/fmoncomble/blueskyscraper. Download BibTeX file

DiscordScraper

A browser extension to scrape and download Discord messages.

How to cite:

Moncomble, Florent. (2025) 2025. ‘DiscordScraper’. JavaScript. Arras, France: Université d’Artois. https://github.com/fmoncomble/DiscordScraper. Download BibTeX file

MastoScraper (browser add-on)

A browser extension to scrape and download toots (posts on Mastodon).

How to cite:

Moncomble, Florent. (2024) 2024. ‘MastoScraper’. JavaScript. Arras, France: Université d’Artois. https://github.com/fmoncomble/mastoscraper. Download BibTeX file

Press Corpus Scraper

A browser extension to extract and download press articles from a variety of sources.

How to cite:

Moncomble, Florent. 2024. ‘Press Corpus Scraper’. JavaScript. Arras, France: Université d’Artois. https://fmoncomble.github.io/press-corpus-scraper/. Download BibTeX file

RedditScraper

A browser extension to scrape and download Reddit posts.

How to cite:

Moncomble, Florent. (2024) 2024. ‘RedditScraper’. JavaScript. Arras, France: Université d’Artois. https://github.com/fmoncomble/redditscraper. Download BibTeX file

Social Corpus Scraper

A 4-in-1 bundle of BlueskyScraper, MastoScraper, RedditScraper and 𝕏-Scraper.

How to cite:

Moncomble, Florent. (2024) 2025. ‘Social Corpus Scraper’. JavaScript. https://github.com/fmoncomble/SocialCorpusScraper. Download BibTeX file

𝕏-Scraper

A browser extension to scrape and download tweets.

How to cite:

Moncomble, Florent. (2024) 2025. ‘X-Scraper’. JavaScript. Arras, France: Université d’Artois. https://github.com/fmoncomble/X-scraper. Download BibTeX file