About the Project
R Mailing List Archives
A searchable, browsable interface to the R Project mailing lists, preserving decades of community knowledge from language design debates to statistical methodology discussions.
Created and maintained by HJJB, LLC
Background
The R mailing lists have been the primary communication channel for the R community since 1997. The original archives are hosted by ETH Zurich and R-Forge as raw pipermail HTML. While functional, these archives are difficult to search, browse, or analyze at scale.
This project transforms those archives into structured, searchable data. Every message is parsed, threaded, and indexed. Email addresses are hashed for privacy. The result is available both through this web viewer and as downloadable datasets for research.
How It Works
Scrape
Raw mbox archives fetched from ETH Zurich and R-Forge pipermail servers.
Process
Messages parsed, threads reconstructed, email addresses replaced with SHA-256 hashes.
Publish
Per-list SQLite databases for this viewer. Apache Parquet files for analysis in R or Python.
Available Datasets
All data is freely available as Parquet files from r-mailing-lists/data on GitHub.
Threads
data/threads.parquetThread-level summaries across all lists. Find the longest discussions or most active conversations without loading full message bodies.
Contributors
data/contributors.parquetAggregated statistics per author, including total messages, number of lists, and which lists they posted to.
Accessing the Data
Helper scripts handle downloading and caching automatically. No cloning required.
# One-line setup -- source directly from GitHub
source("https://raw.githubusercontent.com/r-mailing-lists/data/main/scripts/rml.R")
rml_available() # List all mailing lists
r_devel <- rml_read("r-devel") # Download & read a list
threads <- rml_read_threads() # Thread summaries
contribs <- rml_read_contributors() # Contributor stats
Privacy
Email addresses are never included in any dataset. Author identity uses display names and SHA-256 hashes of email addresses, enabling author grouping without exposing contact information. The original messages are publicly archived on the R Project’s mailing list servers.
Hosting & Advertising
This site is hosted on personal infrastructure. Due to the size of the archives (600,000+ messages across 30+ lists), a static site isn't feasible. The data requires server-side search and dynamic page rendering, so ads help offset the hosting costs. All underlying data is freely available as open Parquet files for anyone to download, analyze, or build their own tools with.
Related Projects
r-mailing-list-archive
by Michael Chirico provides an R script that automatically downloads mbox files from the
R mailing list servers, runnable locally or via GitHub Actions. It’s a great option if you want
the raw mbox archives for command-line searching with grep/awk
or to build your own datasets (e.g. for NLP).