https://github.com/commoncrawl/cc-notebooks
Various Jupyter notebooks about Common Crawl data
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (4.0%) to scientific vocabulary
Keywords
aws-athena
common-crawl
commoncrawl
jupyter-notebook
webarchiving
webgraph-framework
Last synced: 5 months ago
·
JSON representation
Repository
Various Jupyter notebooks about Common Crawl data
Basic Info
Statistics
- Stars: 54
- Watchers: 18
- Forks: 11
- Open Issues: 0
- Releases: 0
Topics
aws-athena
common-crawl
commoncrawl
jupyter-notebook
webarchiving
webgraph-framework
Created over 6 years ago
· Last pushed 11 months ago
Metadata Files
Readme
License
README.md
Jupyter Notebooks to Analyze Common Crawl Data
- analyzing data using the columnar index
- blocking of internet connections from and to the Islamic Republic of Iran during the November 2019 crawl: net-blocking-iran-cc-main-2019-47.ipynb
- total number of captures 2013 – 2019, domain coverage and approximation of unique URLs for the
.edutop-level domain: cc-main-2013-2019-metrics.ipynb - correlations between character sets and lanuages: correlation-language-charset.ipynb
- analyze the Common Crawl webgraph data sets and interactively explore the graphs: cc-webgraph-statistics
- how to explore WARC files running a notebook on AWS EMR
- truncated record payloads in WARC Files:
- verify that all truncated payloads are annotated by the WARC-Truncated header
- which MIME types are mostly affected by truncation? Aggregations using the columnar index.
Owner
- Name: Common Crawl Foundation
- Login: commoncrawl
- Kind: organization
- Email: info@commoncrawl.org
- Website: https://commoncrawl.org
- Twitter: commoncrawl
- Repositories: 50
- Profile: https://github.com/commoncrawl
Common Crawl provides an archive of webpages going back to 2007.
GitHub Events
Total
- Issues event: 2
- Watch event: 10
- Issue comment event: 2
- Push event: 3
- Pull request event: 1
- Fork event: 2
Last Year
- Issues event: 2
- Watch event: 10
- Issue comment event: 2
- Push event: 3
- Pull request event: 1
- Fork event: 2
Committers
Last synced: 8 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Sebastian Nagel | s****l@a****g | 23 |
| Alex Xue | a****1@g****m | 2 |
Committer Domains (Top 20 + Academic)
apache.org: 1
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 2
- Total pull requests: 4
- Average time to close issues: about 3 years
- Average time to close pull requests: about 1 month
- Total issue authors: 1
- Total pull request authors: 2
- Average comments per issue: 1.0
- Average comments per pull request: 0.75
- Merged pull requests: 4
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 2
- Average time to close issues: N/A
- Average time to close pull requests: 2 months
- Issue authors: 0
- Pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- sebastian-nagel (2)
Pull Request Authors
- Xue-Alex (2)
- sebastian-nagel (2)
Top Labels
Issue Labels
enhancement (1)