cc-citations

Scientific articles using or citing Common Crawl data

https://github.com/commoncrawl/cc-citations

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.4%) to scientific vocabulary

Keywords

bibliography bibtex opendata
Last synced: 7 months ago · JSON representation ·

Repository

Scientific articles using or citing Common Crawl data

Basic Info
  • Host: GitHub
  • Owner: commoncrawl
  • Language: Jupyter Notebook
  • Default Branch: main
  • Homepage:
  • Size: 24 MB
Statistics
  • Stars: 26
  • Watchers: 12
  • Forks: 3
  • Open Issues: 1
  • Releases: 0
Topics
bibliography bibtex opendata
Created over 7 years ago · Last pushed 8 months ago
Metadata Files
Readme Citation

README.md

Common Crawl Citations – BibTeX Database

BibTex files are in bib/

Note: work in progress, still contains only a fraction of recent articles

Fields Specific for Common Crawl

The following non-standard fields are used to add information how the publications relate to Common Crawl:

cc-author-affiliation
affiliation of the authors
cc-class
classification of the publication: domain of research, topics, keywords
cc-snippet
snippet citing Common Crawl
cc-dataset-used
subset of Common Crawl used, e.g., CC-MAIN-2016-07
cc-derived-dataset-about
the publication describes a dataset which has been derived from Common Crawl, e.g., GloVe-word-embeddings
cc-derived-dataset-used
a dataset has been used which is derived from Common Crawl, e.g., GloVe-word-embeddings
cc-derived-dataset-cited
a derived dataset is cited but not used

Formatting and Export of Citations

The Makefile contains targets to apply a consistent formatting to the citations. It also allows to export the citations. The following BibTeX tools are required: bibtex2html, bibclean, bibtool.

(Do not be confused by the pypi package bibclean, it's entirely different. bibclean, bibtool, and bibtex2html are available as OS packages, at least in apt-based distros.)

Citations from Google Scholar Alerts

As an initial step and to get a higher coverage, citations are extracted from Google Scholar Alert e-mails received April 2016 to date. See gscholar_alerts.

Updating the awesome graph that everyone loves

Uploading the raw data to Hugging Face

Google Scholar

This data is split by year to make it easier to explore.

  • pull the updated repo
  • make gscholar-bib
  • look in tmp for 2024.jsonl etc.
  • upload at https://huggingface.co/datasets/commoncrawl/citations/tree/main

Annotated Citations

This much smaller dataset has the extra fields mentioned above.

  • pull the updated repo
  • make tmp/commoncrawl_annotated.csv
  • TODO

Owner

  • Name: Common Crawl Foundation
  • Login: commoncrawl
  • Kind: organization
  • Email: info@commoncrawl.org

Common Crawl provides an archive of webpages going back to 2007.

Citation (citations_2025.csv)

,year,count,cumulative_count
0,2012,30,30
1,2013,80,110
2,2014,173,283
3,2015,213,496
4,2016,196,692
5,2017,316,1008
6,2018,533,1541
7,2019,710,2251
8,2020,977,3228
9,2021,1105,4333
10,2022,1266,5599
11,2023,1777,7376
12,2024,2300,9676

GitHub Events

Total
  • Watch event: 11
  • Delete event: 1
  • Issue comment event: 2
  • Push event: 27
  • Pull request event: 1
  • Pull request review event: 10
  • Pull request review comment event: 16
  • Fork event: 1
Last Year
  • Watch event: 11
  • Delete event: 1
  • Issue comment event: 2
  • Push event: 27
  • Pull request event: 1
  • Pull request review event: 10
  • Pull request review comment event: 16
  • Fork event: 1

Committers

Last synced: 10 months ago

All Time
  • Total Commits: 68
  • Total Committers: 4
  • Avg Commits per committer: 17.0
  • Development Distribution Score (DDS): 0.088
Past Year
  • Commits: 26
  • Committers: 4
  • Avg Commits per committer: 6.5
  • Development Distribution Score (DDS): 0.231
Top Committers
Name Email Commits
Sebastian Nagel s****n@c****g 62
Greg Lindahl g****g@c****g 3
Thom Vaughan t@l****k 2
Jodi Schneider j****r@g****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 0
  • Total pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: 3 days
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 1.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: 3 days
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 1.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • jodischneider (2)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

requirements.txt pypi
  • pybtex *