scrape_papers_arxiv
This script scrapes the latest papers from specified categories on arXiv, extracts the text from their PDFs, and searches for citations of a specified author within those papers. Then it prints the number of citations found in each paper, the total number of citations, and the percentage of papers that cite the author.
Science Score: 23.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.4%) to scientific vocabulary
Repository
This script scrapes the latest papers from specified categories on arXiv, extracts the text from their PDFs, and searches for citations of a specified author within those papers. Then it prints the number of citations found in each paper, the total number of citations, and the percentage of papers that cite the author.
Basic Info
- Host: GitHub
- Owner: ianpaga
- Language: Jupyter Notebook
- Default Branch: main
- Size: 274 KB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Scraping latest papers from arXiv.org: Searching for citations
This script scrapes the latest papers from specified categories on arXiv, extracts the text from their PDFs, and searches for citations of a specified author within those papers. Then it prints the number of citations found in each paper, the total number of citations, and the percentage of papers that cite the author.
When running the Jupyter notebook scrape_papers_arXiv.ipynb on June 5th, 2024, we discovered that the author = 'Calzetti, D.' was cited in 4.9 % of the total amount of astro-ph papers on the arXiv i.e. 6 out of 122 papers had at least one citation for Calzetti. Moreover, the article with identifier 2406.01831 had the maximum number of citations: 19 citations. The author Calzetti D. had a total of 35 citations on June 5th, 2024. See the Figure below for more insights!
Output example:
``` Getting paper identifiers XXXX.YYYYY from listing: https://arxiv.org/list/astro-ph/new
Scraping paper: https://arxiv.org/pdf/2406.01666.pdf Number of citations: 10 (++++++++++)
Scraping paper: https://arxiv.org/pdf/2406.01673.pdf Number of citations: 2 (++)
Scraping paper: https://arxiv.org/pdf/2406.01683.pdf Number of citations: 1 (+)
Scraping paper: https://arxiv.org/pdf/2406.01831.pdf Number of citations: 19 (+++++++++++++++++++)
Scraping paper: https://arxiv.org/pdf/2406.02072.pdf Number of citations: 1 (+)
Scraping paper: https://arxiv.org/pdf/2402.18515.pdf Number of citations: 2 (++)
Scraping paper: https://arxiv.org/pdf/2405.19195.pdf
Total number of citations: 35 6/122 (4.9 %) of papers cite the author Calzetti, D.
```
Workflow:
- Fetches identifiers of the latest papers from arXiv in specified categories.
- Constructs the URLs to access the PDFs of these papers.
- Downloads each PDF and extracts its text.
- Searches the extracted text for citations of the specified author.
- Prints the number of citations found in each paper along with a visual representation using plus signs.
- Displays the total number of citations and the percentage of papers that cite the author.
- Removes the temporary PDF file after processing.
- Plots the number of citations for each paper that has at least one citation.
- Plots a pie chart showing the proportion of papers with at least one citation and papers with no citations.
Usage:
- Set the 'author' variable to the name of the author you want to search for.
- Run the script.
Requirements:
- PyPDF2 can be installed by running pip install PyPDF2 in the terminal.
- Matplotlib can be installed by running pip install matplotlib in the terminal.
Owner
- Name: Ian Padilla-Gay
- Login: ianpaga
- Kind: user
- Location: Palo Alto
- Company: SLAC, Stanford
- Repositories: 1
- Profile: https://github.com/ianpaga