https://github.com/axect/inspire_tag_extraction
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.0%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: Axect
- License: mit
- Language: Python
- Default Branch: main
- Size: 10.7 KB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
INSPIRE Tag Extraction
This repository contains two Python scripts demonstrating how to extract research tags (e.g., hep-th) from INSPIRE HEP pages using Selenium. Because INSPIRE pages are dynamically rendered (JavaScript-based), traditional HTTP requests and parsing (e.g., requests + BeautifulSoup) will not reliably retrieve the DOM elements. Instead, we use Selenium to render pages and parse the final DOM.
Table of Contents
- Google Sheets
- Overview of Scripts
- Requirements
- Installation and Setup
- Usage
- Additional Notes
- License
Google Sheets
The Google Sheet linked below presents the 2025 HEP Theory Postdoc Rumor Mill data after being processed through our bulk CSV parsing script. In particular, each entry in the rumor mill (e.g., a candidate, their institution, and acceptance status) has been enhanced with a new column, "Area," which reflects the research tag (such as hep-th, hep-ph, etc.) retrieved from the candidate’s INSPIRE profile.
- 2025 PostDoc Rumor Mill with tags (Updated: 2025-02-19)
Overview of Scripts
Bulk CSV Parsing (
bulk_csv_parser.py)- Reads a CSV file containing one or more INSPIRE links (in a column called
"Inspire link"). - Uses Selenium to visit each link, extracts the
Area(i.e., the tag likehep-th) and adds it as a new column (Area). - Outputs a new CSV file appending
"_with_area"to the original file's name.
- Reads a CSV file containing one or more INSPIRE links (in a column called
Interactive Tag Parser (
interactive_parser.py)- Runs in a loop, asking the user to input an INSPIRE link.
- Attempts to extract and print the research tag.
- Exits on certain commands (
quit,q, orexit).
- Runs in a loop, asking the user to input an INSPIRE link.
Requirements
- Python 3.7+ (or a version that supports Selenium well)
- Google Chrome installed
- ChromeDriver (matching your installed Chrome version) in your system PATH or placed in a known location
- Python packages:
Installation and Setup
Clone or download this repository:
bash git clone https://github.com/Axect/INSPIRE-Tag-Extraction cd INSPIRE-Tag-ExtractionCreate and activate a virtual environment via uv (recommended):
bash uv venv source .venv/bin/activateInstall required packages:
bash uv pip sync requirements.txtOr manually:bash uv pip install pandas selenium tqdmInstall ChromeDriver:
- Download ChromeDriver that matches your installed Google Chrome version from here.
- Make sure it is placed in your system PATH or in the same directory as the scripts.
- Download ChromeDriver that matches your installed Google Chrome version from here.
Usage
Script 1: Bulk CSV Parsing
File: bulk_csv_parser.py
Description: This script reads an input CSV, visits each "Inspire link" row, extracts the tag, and writes a new CSV with an added Area column.
- Run the script:
bash python bulk_csv_parser.py --csv path/to/input.csv - What it does:
- It will open the CSV specified by
--csv. - For each row, it extracts the tag from the provided
Inspire link. - A new file (with
_with_area.csvappended to the base name) will be created.
- It will open the CSV specified by
Example: ```bash python bulkcsvparser.py --csv "inspire_authors.csv"
Creates "inspireauthorswith_area.csv"
```
Script 2: Interactive Tag Parser
File: interactive_parser.py
Description: This script opens a headless Chrome browser and interactively prompts the user to input INSPIRE links one at a time.
- Run the script:
bash python interactive_parser.py - What it does:
- A loop prompts you to enter an INSPIRE link.
- Once you enter the link, the script attempts to load the page, extract the
<span class="ant-tag __UnclickableTag__">text, and print it on the console. - Type
quit,q, orexitto stop.
Additional Notes
- If ChromeDriver cannot be found, you may need to specify the path explicitly. For example:
python driver = webdriver.Chrome(executable_path="/path/to/chromedriver", options=chrome_options) - If you encounter any issues with dynamic rendering delays, consider increasing the timeout in
WebDriverWait(driver, timeout=XX). - The
tqdmlibrary is used in the CSV script to show progress bars. If you prefer, you can remove it or adjust the code to use standard print statements.
License
This project is licensed under the MIT License. Feel free to use and modify the code according to your needs.
Owner
- Name: Tae-Geun Kim
- Login: Axect
- Kind: user
- Location: Seoul, South Korea
- Company: Yonsei Univ.
- Website: https://axect.github.io
- Repositories: 21
- Profile: https://github.com/Axect
Ph.D student of particle physics & Rustacean
GitHub Events
Total
- Push event: 5
- Pull request event: 6
- Create event: 3
Last Year
- Push event: 5
- Pull request event: 6
- Create event: 3
Committers
Last synced: 12 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| axect | a****g@p****e | 4 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 10 months ago
All Time
- Total issues: 0
- Total pull requests: 6
- Average time to close issues: N/A
- Average time to close pull requests: less than a minute
- Total issue authors: 0
- Total pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 6
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 6
- Average time to close issues: N/A
- Average time to close pull requests: less than a minute
- Issue authors: 0
- Pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 6
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- Axect (6)