https://github.com/commoncrawl/commoncrawl

Common Crawl support library to access 2008-2012 crawl archives (ARC files)

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
1 of 10 committers (10.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.3%) to scientific vocabulary

Keywords

archived inactive

Last synced: 9 months ago · JSON representation

Repository

Common Crawl support library to access 2008-2012 crawl archives (ARC files)

Basic Info

Host: GitHub
Owner: commoncrawl
Language: C++
Default Branch: master
Homepage:
Size: 12.9 MB

Statistics

Stars: 502
Watchers: 63
Forks: 91
Open Issues: 7
Releases: 0

Archived

Topics

archived inactive

Created over 14 years ago · Last pushed over 8 years ago

https://github.com/commoncrawl/commoncrawl/blob/master/

# Common Crawl Support Library

## Overview

This library provides support code for the consumption of the Common Crawl Corpus
RAW crawl data (ARC Files) stored on S3. More information about how to access the corpus
can be found at https://commoncrawl.atlassian.net/wiki/display/CRWL/About+the+Data+Set .

You can take two primary routes to consuming the ARC File content:

(1) You can run a Hadoop cluster on EC2 or use EMR to run a Hadoop job. In this case,
you can use the ARCFileInputFormat to drive data to your mappers/reducers. There are two
versions of the InputFormat: One written to conform to the deprecated mapred package,
located at org.commoncrawl.hadoop.io.mapred and one written for the mapreduce package,
correspondingly located at org.commoncrawl.hadoop.io.mapreduce.

(2) You can decode data directly by feeding an InputStream to the ARCFileReader class
located in the org.commoncrawl.util.shared package.

Both routes (InputFormat or ARCFileReader direct route) produce a tuple consisting of a UTF-8
encoded URL (Text), and the raw content (BytesWritable), including HTTP headers, that were
downloaded by the crawler. The HTTP headers are UTF-8 encoded, and the headers and content
are delimited by a consecutive set of CRLF tokens. The content itself, when it is of a text
mime type, is encoded using the source text encoding.

## Build Notes:

1. You need to define JAVA_HOME, and make sure you have Ant & Maven installed.
2. Set hadoop.path (in build.properties) to point to your Hadoop distribution.

# Sample Usage:

Once the commoncrawl.jar has been built, you can validate that the ARCFileReader works for you by
executing the sample command line from root for the commoncrawl source directory:

./bin/launcher.sh org.commoncrawl.util.shared.ARCFileReader --awsAccessKey --awsSecret --file s3n://aws-publicdatasets/common-crawl/parse-output/segment/1341690164240/1341819847375_4319.arc.gz

Owner

Name: Common Crawl Foundation
Login: commoncrawl
Kind: organization
Email: info@commoncrawl.org

Website: https://commoncrawl.org
Twitter: commoncrawl
Repositories: 50
Profile: https://github.com/commoncrawl

Common Crawl provides an archive of webpages going back to 2007.

GitHub Events

Total

Issues event: 7
Watch event: 13
Fork event: 2

Last Year

Issues event: 7
Watch event: 13
Fork event: 2

Committers

Last synced: about 1 year ago

All Time

Total Commits: 64
Total Committers: 10
Avg Commits per committer: 6.4
Development Distribution Score (DDS): 0.531

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Ahad Rana	r**a@A**l	30
Ahad Rana	a**a@g**m	22
Ahad Rana	r**a@A**l	3
Ahad Rana	r**a@A**l	3
mat kelcey	m**y@g**m	1
Sebastian Nagel	s**n@c**g	1
Santiago Castro	s**0@h**m	1
Nada Amin	n**n@a**u	1
Shh	s**h@u**)	1
Chris Stephens	c**s@c**g	1

Committer Domains (Top 20 + Academic)

commoncrawl.org: 2 ubuntu.(none): 1 alum.mit.edu: 1

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 13
Total pull requests: 8
Average time to close issues: 7 months
Average time to close pull requests: about 1 year
Total issue authors: 10
Total pull request authors: 8
Average comments per issue: 1.46
Average comments per pull request: 0.25
Merged pull requests: 4
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 5
Pull requests: 0
Average time to close issues: 2 months
Average time to close pull requests: N/A
Issue authors: 2
Pull request authors: 0
Average comments per issue: 0.2
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

MikeA55 (4)
banshee (1)
adamchainz (1)
knowthetech (1)
gsingers (1)
spydaz (1)
andy-m (1)
azzurolilc (1)
Eugene56 (1)
wiseman (1)

Pull Request Authors

sha0h0ng (1)
namin (1)
andrelaszlo (1)
jseppanen (1)
noiano (1)
bryant1410 (1)
sameerpany (1)
matpalm (1)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/commoncrawl/commoncrawl

Science Score: 10.0%

Keywords

Repository

Basic Info

Statistics

Topics

https://github.com/commoncrawl/commoncrawl/blob/master/

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels