Robots.txt Experiments and Metrics

How is the Robots Exclusion Protocol (robots.txt or RFC 9309) used in the WWW? This projects tries to get some insights mining Common Crawl's robots.txt captures of the years 2016 – 2024.

Top-K Sampling of Web Sites

Three Tranco top-1M lists have been combined into a single ranked list, see top-k-sites. The resulting list of 2 million web sites is used to obtain samples on multiple strata (1k, 5k, 10k, 100k, 1M, 2M).

Locating and Downloading Robots.txt Captures in Common Crawl's Web Archives

Common Crawl's Web Archives include since 2016 a robots.txt data set from which the robots.txt captures are extracted. This is done utilizing the columnar URL index. The necessary steps are described in the data preparation notebook.

Metrics and Findings

top-k metrics notebook: first aggregations and few plots
user-agent metrics notebook: more plots about user-agents addressed in robots.txt files

Poster at IIPC Web Archiving Conference 2025

Condensed results of this project were presented as poster on the IIPC Web Archiving Conference 2025. A copy of the poster is available here.

Notes and Credits

This project is an extension of work done for a presentation at #ossym2022: "The robots.txt standard – Implementations and Usage". The corresponding code is found at ossym2022-robotstxt-experiments.

The idea to look at multiple strata (top-k) is inspired by the work of Longpre et al. "Consent in crisis" (https://arxiv.org/abs/2407.14933) and Liu et al. "Somesite I used to crawl" (https://arxiv.org/pdf/2411.15091).

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
docs		docs
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Robots.txt Experiments and Metrics

Top-K Sampling of Web Sites

Locating and Downloading Robots.txt Captures in Common Crawl's Web Archives

Metrics and Findings

Poster at IIPC Web Archiving Conference 2025

Notes and Credits

About

Uh oh!

Releases

Packages

Languages

License

commoncrawl/robotstxt-experiments

Folders and files

Latest commit

History

Repository files navigation

Robots.txt Experiments and Metrics

Top-K Sampling of Web Sites

Locating and Downloading Robots.txt Captures in Common Crawl's Web Archives

Metrics and Findings

Poster at IIPC Web Archiving Conference 2025

Notes and Credits

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages