How is the Robots Exclusion Protocol (robots.txt or RFC 9309) used in the WWW? This projects tries to get some insights mining Common Crawl's robots.txt captures of the years 2016 – 2024.
Three Tranco top-1M lists have been combined into a single ranked list, see top-k-sites. The resulting list of 2 million web sites is used to obtain samples on multiple strata (1k, 5k, 10k, 100k, 1M, 2M).
Common Crawl's Web Archives include since 2016 a robots.txt data set from which the robots.txt captures are extracted. This is done utilizing the columnar URL index. The necessary steps are described in the data preparation notebook.
- top-k metrics notebook: first aggregations and few plots
- user-agent metrics notebook: more plots about user-agents addressed in robots.txt files
Condensed results of this project were presented as poster on the IIPC Web Archiving Conference 2025. A copy of the poster is available here.
This project is an extension of work done for a presentation at #ossym2022: "The robots.txt standard – Implementations and Usage". The corresponding code is found at ossym2022-robotstxt-experiments.
The idea to look at multiple strata (top-k) is inspired by the work of Longpre et al. "Consent in crisis" (https://arxiv.org/abs/2407.14933) and Liu et al. "Somesite I used to crawl" (https://arxiv.org/pdf/2411.15091).