https://github.com/commoncrawl/cc-warcinfo-index-builder

Code to build an index that maps warcinfo-id to crawl / warc

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (5.6%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Code to build an index that maps warcinfo-id to crawl / warc

Basic Info

Host: GitHub
Owner: commoncrawl
Language: Python
Default Branch: main
Homepage:
Size: 6.84 KB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 0

Created about 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme

generate-warcinfo-index

For each crawl, generate parquet which has the following fields:

warcinfo_id
warc_filename

The make all-warcinfo step runs one extractor per crawl. On the first run, the first crawl extraction finished in 1h 35m and the last in 6h 56m.

A copy of the actual index can be found on rf:/home/cc-pds/warcinfo-id.parquet

How to query

Look at the test code, testpandas.py and testduck.py

Updating the index

The code uses smart_open() to read the initial part of every warc, extracting the first record, which should be the warcinfo record.

The code is smart enough to not re-download anything, and runs in parallel for every crawl. It only needs about 3% of a core per extractor, but network latency slows it down to as slow as 7 hours for a single crawl. And if you are doing many crawls in parallel, the slowest one could be much slower than the fastest.

make collinfo make all-crawls make all-warcinfo make parquet make test

To add a single new crawl, edit the Makefile to change the CRAWL variable, then

make one-paths make one-warcinfo make parquet make test

Install

If happy, copy to place:

make install

Owner

Name: Common Crawl Foundation
Login: commoncrawl
Kind: organization
Email: info@commoncrawl.org

Website: https://commoncrawl.org
Twitter: commoncrawl
Repositories: 50
Profile: https://github.com/commoncrawl

Common Crawl provides an archive of webpages going back to 2007.

GitHub Events

Total

Push event: 1
Create event: 1

Last Year

Push event: 1
Create event: 1

Committers

Last synced: about 1 year ago

All Time

Total Commits: 10
Total Committers: 1
Avg Commits per committer: 10.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 10
Committers: 1
Avg Commits per committer: 10.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Greg Lindahl	g**g@c**g	10

Committer Domains (Top 20 + Academic)

commomncrawl.org: 1

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

requirements.txt pypi

boto3 *
duckdb *
numpy *
pandas *
pyarrow *
pytest *
warcio *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/commoncrawl/cc-warcinfo-index-builder

Science Score: 26.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

generate-warcinfo-index

How to query

Updating the index

Install

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies