Skip to content

Discrepancy Between WET Data and Domain in Web Graph (Oct–Dec 2024) #52

@hussien

Description

@hussien

Dear Common Crawl Team,
I am analyzing the Common Crawl data for the October, November, and December 2024 collections. Specifically, I have processed the WET files from the October 2024 archive to extract domain names and sample textual content.

During this process, I identified approximately 5 million domains that are present in the WET files of the October 2024 crawl but are missing from the Oct–Nov–Dec 2024 Web Graph domain rankings.

Could you kindly clarify whether this discrepancy is expected? Are there specific filtering criteria applied to the Web Graph generation that would account for the exclusion of such domains?

Here are a few example domains missing from the Web Graph domain ranks file:
geneticlight.blogspot.com
sire.neighborhoodguides.com
cs.gzhbautoparts.com
booking.cheapflyme.com
cherry.22006.net
Any insight you could provide regarding this issue would be greatly appreciated.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions