https://github.com/commoncrawl/cc-warcinfo-index-builder

Code to build an index that maps warcinfo-id to crawl / warc

https://github.com/commoncrawl/cc-warcinfo-index-builder

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (5.6%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Code to build an index that maps warcinfo-id to crawl / warc

Basic Info
  • Host: GitHub
  • Owner: commoncrawl
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 6.84 KB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created about 1 year ago · Last pushed about 1 year ago
Metadata Files
Readme

README.md

generate-warcinfo-index

For each crawl, generate parquet which has the following fields:

  • warcinfo_id
  • warc_filename

The make all-warcinfo step runs one extractor per crawl. On the first run, the first crawl extraction finished in 1h 35m and the last in 6h 56m.

A copy of the actual index can be found on rf:/home/cc-pds/warcinfo-id.parquet

How to query

Look at the test code, testpandas.py and testduck.py

Updating the index

The code uses smart_open() to read the initial part of every warc, extracting the first record, which should be the warcinfo record.

The code is smart enough to not re-download anything, and runs in parallel for every crawl. It only needs about 3% of a core per extractor, but network latency slows it down to as slow as 7 hours for a single crawl. And if you are doing many crawls in parallel, the slowest one could be much slower than the fastest.

make collinfo make all-crawls make all-warcinfo make parquet make test

To add a single new crawl, edit the Makefile to change the CRAWL variable, then

make one-paths make one-warcinfo make parquet make test

Install

If happy, copy to place:

make install

Owner

  • Name: Common Crawl Foundation
  • Login: commoncrawl
  • Kind: organization
  • Email: info@commoncrawl.org

Common Crawl provides an archive of webpages going back to 2007.

GitHub Events

Total
  • Push event: 1
  • Create event: 1
Last Year
  • Push event: 1
  • Create event: 1

Committers

Last synced: about 1 year ago

All Time
  • Total Commits: 10
  • Total Committers: 1
  • Avg Commits per committer: 10.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 10
  • Committers: 1
  • Avg Commits per committer: 10.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Greg Lindahl g****g@c****g 10
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: about 1 year ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

requirements.txt pypi
  • boto3 *
  • duckdb *
  • numpy *
  • pandas *
  • pyarrow *
  • pytest *
  • warcio *