-
Notifications
You must be signed in to change notification settings - Fork 90
Description
Dear Common Crawl Team,
I am analyzing the Common Crawl data for the October, November, and December 2024 collections. Specifically, I have processed the WET files from the October 2024 archive to extract domain names and sample textual content.
During this process, I identified approximately 5 million domains that are present in the WET files of the October 2024 crawl but are missing from the Oct–Nov–Dec 2024 Web Graph domain rankings.
Could you kindly clarify whether this discrepancy is expected? Are there specific filtering criteria applied to the Web Graph generation that would account for the exclusion of such domains?
Here are a few example domains missing from the Web Graph domain ranks file:
geneticlight.blogspot.com
sire.neighborhoodguides.com
cs.gzhbautoparts.com
booking.cheapflyme.com
cherry.22006.net
Any insight you could provide regarding this issue would be greatly appreciated.