https://github.com/commoncrawl/cc-warcinfo-index-builder
Code to build an index that maps warcinfo-id to crawl / warc
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (5.6%) to scientific vocabulary
Repository
Code to build an index that maps warcinfo-id to crawl / warc
Basic Info
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
generate-warcinfo-index
For each crawl, generate parquet which has the following fields:
- warcinfo_id
- warc_filename
The make all-warcinfo step runs one extractor per crawl. On the
first run, the first crawl extraction finished in 1h 35m and the last
in 6h 56m.
A copy of the actual index can be found on rf:/home/cc-pds/warcinfo-id.parquet
How to query
Look at the test code, testpandas.py and testduck.py
Updating the index
The code uses smart_open() to read the initial part of every warc, extracting the first record, which should be the warcinfo record.
The code is smart enough to not re-download anything, and runs in parallel for every crawl. It only needs about 3% of a core per extractor, but network latency slows it down to as slow as 7 hours for a single crawl. And if you are doing many crawls in parallel, the slowest one could be much slower than the fastest.
make collinfo
make all-crawls
make all-warcinfo
make parquet
make test
To add a single new crawl, edit the Makefile to change the CRAWL variable, then
make one-paths
make one-warcinfo
make parquet
make test
Install
If happy, copy to place:
make install
Owner
- Name: Common Crawl Foundation
- Login: commoncrawl
- Kind: organization
- Email: info@commoncrawl.org
- Website: https://commoncrawl.org
- Twitter: commoncrawl
- Repositories: 50
- Profile: https://github.com/commoncrawl
Common Crawl provides an archive of webpages going back to 2007.
GitHub Events
Total
- Push event: 1
- Create event: 1
Last Year
- Push event: 1
- Create event: 1
Committers
Last synced: about 1 year ago
Top Committers
| Name | Commits | |
|---|---|---|
| Greg Lindahl | g****g@c****g | 10 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: about 1 year ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- boto3 *
- duckdb *
- numpy *
- pandas *
- pyarrow *
- pytest *
- warcio *