https://github.com/commoncrawl/cc-crawl-statistics

Statistics of Common Crawl monthly archives mined from URL index files

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.5%) to scientific vocabulary

Keywords

common-crawl commoncrawl statistics

Keywords from Contributors

archived warc centrality-measures pagerank webgraph webgraph-framework

Last synced: 5 months ago · JSON representation

Repository

Statistics of Common Crawl monthly archives mined from URL index files

Basic Info

Host: GitHub
Owner: commoncrawl
License: apache-2.0
Language: Python
Default Branch: master
Homepage: https://commoncrawl.github.io/cc-crawl-statistics/
Size: 387 MB

Statistics

Stars: 188
Watchers: 17
Forks: 12
Open Issues: 2
Releases: 0

Topics

common-crawl commoncrawl statistics

Created over 9 years ago · Last pushed 6 months ago

Metadata Files

Readme License

Basic Statistics of Common Crawl Monthly Archives

Analyze the Common Crawl data to get metrics about the monthly crawl archives: * size of the monthly crawls, number of * fetched pages * unique URLs * unique documents (by content digest) * number of different hosts, domains, top-level domains * distribution of pages/URLs on hosts, domains, top-level domains * and ... * mime types * protocols / schemes (http vs. https) * content languages (since summer 2018)

This is a description how to generate the statistics from the Common Crawl URL index files.

The results are presented on https://commoncrawl.github.io/cc-crawl-statistics/.

Step 1: Count Items

The items (URLs, hosts, domains, etc.) are counted using the Common Crawl index files on AWS S3 s3://commoncrawl/cc-index/collections/*/indexes/cdx-*.gz.

define a pattern of cdx files to process - usually from one monthly crawl (here: CC-MAIN-2016-26)
- either smaller set of local files for testing INPUT="test/cdx/cdx-0000[0-3].gz"
- or one monthly crawl to be accessed via Hadoop on AWS S3: INPUT="s3a://commoncrawl/cc-index/collections/CC-MAIN-2016-26/indexes/cdx-*.gz"
run crawlstats.py --job=count to process the cdx files and count the items: python3 crawlstats.py --job=count --no-exact-counts \ --no-output --output-dir .../count/ $INPUT

Help on command-line parameters (including mrjob options) are shown by python3 crawlstats.py --help. The option --no-exact-counts is recommended (and is the default) to save storage space and computation time when counting URLs and content digests.

Step 2: Aggregate Counts

Run crawlstats.py --job=stats on the output of step 1: python3 crawlstats.py --job=stats --max-top-hosts-domains=500 \ --no-output --output-dir .../stats/ .../count/ The max. number of most frequent thosts and domains contained in the output is set by the option --max-top-hosts-domains=N.

Step 3: Download the Data

In order to prepare the plots, the the output of step 2 must be downloaded to local disk. Simplest, the data is fetched from the Common Crawl Public Data Set bucket on AWS S3: sh while read crawl; do aws s3 cp s3://commoncrawl/crawl-analysis/$crawl/stats/part-00000.gz ./stats/$crawl.gz done <<EOF CC-MAIN-2008-2009 ... EOF

One aggregated, gzip-compressed statistics file, is about 1 MiB in size. So you could just run get_stats.sh to download the data files for all released monthly crawls.

Also the output of step 1 is provided on s3://commoncrawl/. The counts for every crawl is hold in 10 bzip2-compressed files, together 1 GiB per crawl in average. To download the counts for one crawl: - if you're on AWS and AWS CLI is installed and configured sh CRAWL=CC-MAIN-2022-05 aws s3 cp --recursive s3://commoncrawl/crawl-analysis/$CRAWL/count stats/count/$CRAWL - otherwise sh CRAWL=CC-MAIN-2022-05 mkdir -p stats/count/$CRAWL for i in $(seq 0 9); do curl https://data.commoncrawl.org/crawl-analysis/$CRAWL/count/part-0000$i.bz2 \ >stats/count/$CRAWL/part-0000$i.bz2 done

Step 4: Plot the Data

To prepare the plots using the downloaded aggregated data: gzip -dc stats/CC-MAIN-*.gz | python3 plot/crawl_size.py The full list of commands to prepare all plots is found in plot.sh. Don't forget to install the Python modules required for plotting.

Step 5: Local Site Preview

The crawl statistics site is hosted by Github pages. The site is updated as soon as plots or description texts are updated, committed and pushed to the Github repository.

To preview local changes, it's possible to serve the site locally: 1. build the Docker image with Ruby, Jekyll and the content to be served docker build -f site.Dockerfile -t cc-crawl-statistics-site:latest . 2. run a Docker container to serve the site preview docker run --network=host --rm -ti cc-crawl-statistics-site:latest The site should be served on localhost, port 4000 (http://127.0.0.1:4000). If not, the correct location is shown in the output of the docker run command.

If running this on a Mac, you may find that the loopback interface (127.0.0.1) within the container is not accessible, so you can change the line in the Dockerfile to:

CMD bundle exec jekyll serve --host 0.0.0.0

... and then the site will be served on http://0.0.0.0:4000 instead. (You will of course need to rebuild the Docker image after updating the Dockerfile.)

Related Projects

The columnar index simplifies counting and analytics a lot - easier to maintain, more transparent, reproducible and extensible than running two MapReduce jobs, see the the list of example - SQL queries and - Jupyter notebooks

Owner

Name: Common Crawl Foundation
Login: commoncrawl
Kind: organization
Email: info@commoncrawl.org

Website: https://commoncrawl.org
Twitter: commoncrawl
Repositories: 50
Profile: https://github.com/commoncrawl

Common Crawl provides an archive of webpages going back to 2007.

GitHub Events

Total

Issues event: 2
Watch event: 40
Delete event: 3
Issue comment event: 4
Push event: 29
Pull request review event: 3
Pull request review comment event: 1
Pull request event: 7
Fork event: 1
Create event: 2

Last Year

Issues event: 2
Watch event: 40
Delete event: 3
Issue comment event: 4
Push event: 29
Pull request review event: 3
Pull request review comment event: 1
Pull request event: 7
Fork event: 1
Create event: 2

Committers

Last synced: 8 months ago

All Time

Total Commits: 238
Total Committers: 6
Avg Commits per committer: 39.667
Development Distribution Score (DDS): 0.143

Past Year

Commits: 33
Committers: 3
Avg Commits per committer: 11.0
Development Distribution Score (DDS): 0.515

Top Committers

Name	Email	Commits
Sebastian Nagel	s**n@c**g	204
Thom Vaughan	t**m@c**g	24
Julien Nioche	j**n@d**m	4
Pedro Ortiz Suarez	p**o@c**g	3
Julien Nioche	j**n@c**g	2
Greg Lindahl	g**g@c**g	1

Committer Domains (Top 20 + Academic)

commoncrawl.org: 4 commomncrawl.org: 1 digitalpebble.com: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 8
Total pull requests: 5
Average time to close issues: 24 days
Average time to close pull requests: 11 days
Total issue authors: 8
Total pull request authors: 1
Average comments per issue: 1.63
Average comments per pull request: 0.4
Merged pull requests: 4
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 3
Average time to close issues: 1 day
Average time to close pull requests: 5 days
Issue authors: 1
Pull request authors: 1
Average comments per issue: 1.0
Average comments per pull request: 0.0
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

tfmorris (1)
drunkpig (1)
pjox (1)
RichardKCollins (1)
johanovic (1)
SeekPoint (1)
sebastian-nagel (1)
edsu (1)
handecelikkanat (1)
happywhitelake (1)

Pull Request Authors

sebastian-nagel (10)

Top Labels

Issue Labels

question (3) wontfix (1)

Pull Request Labels

Dependencies

requirements_plot.txt pypi

ggplot *
idna *
pandas *
pygraphviz *
rpy2 *

requirements.txt pypi

hyperloglog *
isoweek *
mrjob *
tldextract *
ujson *

https://github.com/commoncrawl/cc-crawl-statistics

Science Score: 26.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Basic Statistics of Common Crawl Monthly Archives

Step 1: Count Items

Step 2: Aggregate Counts

Step 3: Download the Data

Step 4: Plot the Data

Step 5: Local Site Preview

Related Projects

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies