Skip to content

How is the Robots Exclusion Protocol (robots.txt) used in the WWW? This projects tries to get some insights mining Common Crawl's robots.txt captures of the years 2016 – 2024.

License

Notifications You must be signed in to change notification settings

commoncrawl/robotstxt-experiments

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Robots.txt Experiments and Metrics

How is the Robots Exclusion Protocol (robots.txt or RFC 9309) used in the WWW? This projects tries to get some insights mining Common Crawl's robots.txt captures of the years 2016 – 2024.

Top-K Sampling of Web Sites

Three Tranco top-1M lists have been combined into a single ranked list, see top-k-sites. The resulting list of 2 million web sites is used to obtain samples on multiple strata (1k, 5k, 10k, 100k, 1M, 2M).

Locating and Downloading Robots.txt Captures in Common Crawl's Web Archives

Common Crawl's Web Archives include since 2016 a robots.txt data set from which the robots.txt captures are extracted. This is done utilizing the columnar URL index. The necessary steps are described in the data preparation notebook.

Metrics and Findings

Poster at IIPC Web Archiving Conference 2025

Condensed results of this project were presented as poster on the IIPC Web Archiving Conference 2025. A copy of the poster is available here.

Notes and Credits

This project is an extension of work done for a presentation at #ossym2022: "The robots.txt standard – Implementations and Usage". The corresponding code is found at ossym2022-robotstxt-experiments.

The idea to look at multiple strata (top-k) is inspired by the work of Longpre et al. "Consent in crisis" (https://arxiv.org/abs/2407.14933) and Liu et al. "Somesite I used to crawl" (https://arxiv.org/pdf/2411.15091).

About

How is the Robots Exclusion Protocol (robots.txt) used in the WWW? This projects tries to get some insights mining Common Crawl's robots.txt captures of the years 2016 – 2024.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages