Skip to content

About the Project

R Mailing List Archives

A searchable, browsable interface to the R Project mailing lists, preserving decades of community knowledge from language design debates to statistical methodology discussions.

Created and maintained by HJJB, LLC

Background

The R mailing lists have been the primary communication channel for the R community since 1997. The original archives are hosted by ETH Zurich and R-Forge as raw pipermail HTML. While functional, these archives are difficult to search, browse, or analyze at scale.

This project transforms those archives into structured, searchable data. Every message is parsed, threaded, and indexed. Email addresses are hashed for privacy. The result is available both through this web viewer and as downloadable datasets for research.

How It Works

1

Scrape

Raw mbox archives fetched from ETH Zurich and R-Forge pipermail servers.

2

Process

Messages parsed, threads reconstructed, email addresses replaced with SHA-256 hashes.

3

Publish

Per-list SQLite databases for this viewer. Apache Parquet files for analysis in R or Python.

Available Datasets

All data is freely available as Parquet files from r-mailing-lists/data on GitHub.

One file per mailing list. Full message content with author, date, subject, body, and threading metadata.

from_name date subject body thread_id in_reply_to message_id

Thread-level summaries across all lists. Find the longest discussions or most active conversations without loading full message bodies.

list subject message_count started last_reply

Aggregated statistics per author, including total messages, number of lists, and which lists they posted to.

name message_count list_count lists

Accessing the Data

Helper scripts handle downloading and caching automatically. No cloning required.

# One-line setup -- source directly from GitHub
source("https://raw.githubusercontent.com/r-mailing-lists/data/main/scripts/rml.R")

rml_available()                           # List all mailing lists
r_devel <- rml_read("r-devel")            # Download & read a list
threads <- rml_read_threads()             # Thread summaries
contribs <- rml_read_contributors()       # Contributor stats

Privacy

Email addresses are never included in any dataset. Author identity uses display names and SHA-256 hashes of email addresses, enabling author grouping without exposing contact information. The original messages are publicly archived on the R Project’s mailing list servers.

Hosting & Advertising

This site is hosted on personal infrastructure. Due to the size of the archives (600,000+ messages across 30+ lists), a static site isn't feasible. The data requires server-side search and dynamic page rendering, so ads help offset the hosting costs. All underlying data is freely available as open Parquet files for anyone to download, analyze, or build their own tools with.

Related Projects

r-mailing-list-archive by Michael Chirico provides an R script that automatically downloads mbox files from the R mailing list servers, runnable locally or via GitHub Actions. It’s a great option if you want the raw mbox archives for command-line searching with grep/awk or to build your own datasets (e.g. for NLP).

Links