cc-citations
Scientific articles using or citing Common Crawl data
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.4%) to scientific vocabulary
Keywords
Repository
Scientific articles using or citing Common Crawl data
Basic Info
Statistics
- Stars: 26
- Watchers: 12
- Forks: 3
- Open Issues: 1
- Releases: 0
Topics
Metadata Files
README.md
Common Crawl Citations – BibTeX Database
BibTex files are in bib/
Note: work in progress, still contains only a fraction of recent articles
Fields Specific for Common Crawl
The following non-standard fields are used to add information how the publications relate to Common Crawl:
- cc-author-affiliation
- affiliation of the authors
- cc-class
- classification of the publication: domain of research, topics, keywords
- cc-snippet
- snippet citing Common Crawl
- cc-dataset-used
- subset of Common Crawl used, e.g., CC-MAIN-2016-07
- cc-derived-dataset-about
- the publication describes a dataset which has been derived from Common Crawl, e.g., GloVe-word-embeddings
- cc-derived-dataset-used
- a dataset has been used which is derived from Common Crawl, e.g., GloVe-word-embeddings
- cc-derived-dataset-cited
- a derived dataset is cited but not used
Formatting and Export of Citations
The Makefile contains targets to apply a consistent formatting to the citations. It also allows to export the citations. The following BibTeX tools are required: bibtex2html, bibclean, bibtool.
(Do not be confused by the pypi package bibclean, it's entirely different. bibclean, bibtool, and bibtex2html are available as OS packages, at least in apt-based distros.)
Citations from Google Scholar Alerts
As an initial step and to get a higher coverage, citations are extracted from Google Scholar Alert e-mails received April 2016 to date. See gscholar_alerts.
Updating the awesome graph that everyone loves
Uploading the raw data to Hugging Face
Google Scholar
This data is split by year to make it easier to explore.
- pull the updated repo
make gscholar-bib- look in tmp for 2024.jsonl etc.
- upload at https://huggingface.co/datasets/commoncrawl/citations/tree/main
Annotated Citations
This much smaller dataset has the extra fields mentioned above.
- pull the updated repo
make tmp/commoncrawl_annotated.csv- TODO
Owner
- Name: Common Crawl Foundation
- Login: commoncrawl
- Kind: organization
- Email: info@commoncrawl.org
- Website: https://commoncrawl.org
- Twitter: commoncrawl
- Repositories: 50
- Profile: https://github.com/commoncrawl
Common Crawl provides an archive of webpages going back to 2007.
Citation (citations_2025.csv)
,year,count,cumulative_count 0,2012,30,30 1,2013,80,110 2,2014,173,283 3,2015,213,496 4,2016,196,692 5,2017,316,1008 6,2018,533,1541 7,2019,710,2251 8,2020,977,3228 9,2021,1105,4333 10,2022,1266,5599 11,2023,1777,7376 12,2024,2300,9676
GitHub Events
Total
- Watch event: 11
- Delete event: 1
- Issue comment event: 2
- Push event: 27
- Pull request event: 1
- Pull request review event: 10
- Pull request review comment event: 16
- Fork event: 1
Last Year
- Watch event: 11
- Delete event: 1
- Issue comment event: 2
- Push event: 27
- Pull request event: 1
- Pull request review event: 10
- Pull request review comment event: 16
- Fork event: 1
Committers
Last synced: 10 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Sebastian Nagel | s****n@c****g | 62 |
| Greg Lindahl | g****g@c****g | 3 |
| Thom Vaughan | t@l****k | 2 |
| Jodi Schneider | j****r@g****m | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 10 months ago
All Time
- Total issues: 0
- Total pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: 3 days
- Total issue authors: 0
- Total pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 1.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: 3 days
- Issue authors: 0
- Pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 1.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- jodischneider (2)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- pybtex *