doi_scraper

Digital Object Identifier scraper written in Python

https://github.com/albertocuadra/doi_scraper

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.2%) to scientific vocabulary

Keywords

bibtex crossref crossref-api doi latex python research scraper
Last synced: 6 months ago · JSON representation ·

Repository

Digital Object Identifier scraper written in Python

Basic Info
  • Host: GitHub
  • Owner: AlbertoCuadra
  • License: mit
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 34.2 KB
Statistics
  • Stars: 5
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 5
Topics
bibtex crossref crossref-api doi latex python research scraper
Created almost 3 years ago · Last pushed 9 months ago
Metadata Files
Readme License Citation

README.md

DOI Scraper

The DOI Scraper is a Python script that reads a .bib file, searches for entries missing required fields (such as a DOI), retrieves the missing information using the Crossref API, and reformats the file with consistent indentation. The refactored design supports different entry types (e.g., articles, books, inproceedings, tech reports), with each type defining its own required fields.

Prerequisites

  • Python 3.x
  • requests library
  • tqdm library

Installation

  1. Clone the repository or download the doi_scraper.py file.

  2. Install the required dependencies by running the following command:

shell pip install -r requirements.txt

Usage

Place your input .bib file in the same directory as the doi_scraper.py script.

Open the doi_scraper.py file and modify the following variables according to your needs:

python input_file = 'input.bib' # Name of the input .bib file output_file = 'output.bib' # Name of the output .bib file INDENT_PRE = 4 # Number of spaces before the field name INDENT_POST = 16 # Number of spaces after the field name

Run the script using the following command:

shell python doi_scraper.py

The script will search for articles without a DOI and retrieve the missing DOIs using the Crossref API. It will then update the output .bib file with the retrieved DOIs.

Once the script completes, you will find the updated .bib file with the retrieved DOIs in the same directory.

Optional Arguments

  • --format-only: If you want to reformat the file without performing any Crossref lookups.

Example

Before

bibtex @article{Cuadra2020, title = {Effect of equivalence ratio fluctuations on planar detonation discontinuities}, author = {Cuadra, Alberto and Huete, C{\'e}sar and Vera, Marcos}, pages= {A30 1--39} }

After

bibtex @article{Cuadra2020, title = {Effect of equivalence ratio fluctuations on planar detonation discontinuities}, author = {Cuadra, Alberto and Huete, C{\'e}sar and Vera, Marcos}, pages = {A30 1--39}, year = {2020}, journal = {Journal of Fluid Mechanics}, volume = {903}, doi = {10.1017/jfm.2020.651}, }

License

This project is licensed under the MIT License.

Owner

  • Name: Alberto Cuadra-Lara
  • Login: AlbertoCuadra
  • Kind: user
  • Location: Madrid, Spain
  • Company: Universidad Carlos III de Madrid

Pre-doctoral researcher in Fluid Mechanics

Citation (CITATION.cff)

# YAML 1.2
---
cff-version: 1.2.0
message: "If you use this software, please cite it using these metadata."
type: misc
license: "MIT"
title: "DOI Scraper"
version: 1.2.0
doi: 10.5281/zenodo.7932535
date-released: 2025-03-20
url: "https://github.com/AlbertoCuadra/doi_scraper"
abstract:
    "The DOI Scraper is a Python script that reads a `.bib` file, searches for entries missing required fields (such as a DOI), retrieves the missing information using the Crossref API, and reformats the file with consistent indentation. The refactored design supports different entry types (e.g., articles, books, inproceedings, tech reports), with each type defining its own required fields."
authors: 
  -
    family-names: "Cuadra"
    given-names: A
    orcid: "https://orcid.org/0000-0001-8280-2426"
keywords: 
  - scraper
  - latex
  - bibtex
  - doi
  - crossref
  - "crossref-api"
  - python
  - "open-source"

GitHub Events

Total
  • Release event: 1
  • Watch event: 3
  • Push event: 3
  • Pull request event: 4
  • Create event: 1
Last Year
  • Release event: 1
  • Watch event: 3
  • Push event: 3
  • Pull request event: 4
  • Create event: 1

Committers

Last synced: 8 months ago

All Time
  • Total Commits: 13
  • Total Committers: 1
  • Avg Commits per committer: 13.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 3
  • Committers: 1
  • Avg Commits per committer: 3.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Alberto Cuadra Lara a****a@i****s 13
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 8 months ago

All Time
  • Total issues: 0
  • Total pull requests: 10
  • Average time to close issues: N/A
  • Average time to close pull requests: less than a minute
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 10
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: 1 minute
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • AlbertoCuadra (8)
Top Labels
Issue Labels
Pull Request Labels
documentation (1) bug (1) enhancement (1)