https://github.com/axect/inspire_tag_extraction

https://github.com/axect/inspire_tag_extraction

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.0%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: Axect
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 10.7 KB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 1 year ago · Last pushed over 1 year ago
Metadata Files
Readme License

README.md

INSPIRE Tag Extraction

This repository contains two Python scripts demonstrating how to extract research tags (e.g., hep-th) from INSPIRE HEP pages using Selenium. Because INSPIRE pages are dynamically rendered (JavaScript-based), traditional HTTP requests and parsing (e.g., requests + BeautifulSoup) will not reliably retrieve the DOM elements. Instead, we use Selenium to render pages and parse the final DOM.


Table of Contents


Google Sheets

The Google Sheet linked below presents the 2025 HEP Theory Postdoc Rumor Mill data after being processed through our bulk CSV parsing script. In particular, each entry in the rumor mill (e.g., a candidate, their institution, and acceptance status) has been enhanced with a new column, "Area," which reflects the research tag (such as hep-th, hep-ph, etc.) retrieved from the candidate’s INSPIRE profile.

Overview of Scripts

  1. Bulk CSV Parsing (bulk_csv_parser.py)

    • Reads a CSV file containing one or more INSPIRE links (in a column called "Inspire link").
    • Uses Selenium to visit each link, extracts the Area (i.e., the tag like hep-th) and adds it as a new column (Area).
    • Outputs a new CSV file appending "_with_area" to the original file's name.
  2. Interactive Tag Parser (interactive_parser.py)

    • Runs in a loop, asking the user to input an INSPIRE link.
    • Attempts to extract and print the research tag.
    • Exits on certain commands (quit, q, or exit).

Requirements

  • Python 3.7+ (or a version that supports Selenium well)
  • Google Chrome installed
  • ChromeDriver (matching your installed Chrome version) in your system PATH or placed in a known location
  • Python packages:

Installation and Setup

  1. Clone or download this repository: bash git clone https://github.com/Axect/INSPIRE-Tag-Extraction cd INSPIRE-Tag-Extraction

  2. Create and activate a virtual environment via uv (recommended): bash uv venv source .venv/bin/activate

  3. Install required packages: bash uv pip sync requirements.txt Or manually: bash uv pip install pandas selenium tqdm

  4. Install ChromeDriver:

    • Download ChromeDriver that matches your installed Google Chrome version from here.
    • Make sure it is placed in your system PATH or in the same directory as the scripts.

Usage

Script 1: Bulk CSV Parsing

File: bulk_csv_parser.py

Description: This script reads an input CSV, visits each "Inspire link" row, extracts the tag, and writes a new CSV with an added Area column.

  1. Run the script: bash python bulk_csv_parser.py --csv path/to/input.csv
  2. What it does:
    • It will open the CSV specified by --csv.
    • For each row, it extracts the tag from the provided Inspire link.
    • A new file (with _with_area.csv appended to the base name) will be created.

Example: ```bash python bulkcsvparser.py --csv "inspire_authors.csv"

Creates "inspireauthorswith_area.csv"

```

Script 2: Interactive Tag Parser

File: interactive_parser.py

Description: This script opens a headless Chrome browser and interactively prompts the user to input INSPIRE links one at a time.

  1. Run the script: bash python interactive_parser.py
  2. What it does:
    • A loop prompts you to enter an INSPIRE link.
    • Once you enter the link, the script attempts to load the page, extract the <span class="ant-tag __UnclickableTag__"> text, and print it on the console.
    • Type quit, q, or exit to stop.

Additional Notes

  • If ChromeDriver cannot be found, you may need to specify the path explicitly. For example: python driver = webdriver.Chrome(executable_path="/path/to/chromedriver", options=chrome_options)
  • If you encounter any issues with dynamic rendering delays, consider increasing the timeout in WebDriverWait(driver, timeout=XX).
  • The tqdm library is used in the CSV script to show progress bars. If you prefer, you can remove it or adjust the code to use standard print statements.

License

This project is licensed under the MIT License. Feel free to use and modify the code according to your needs.

Owner

  • Name: Tae-Geun Kim
  • Login: Axect
  • Kind: user
  • Location: Seoul, South Korea
  • Company: Yonsei Univ.

Ph.D student of particle physics & Rustacean

GitHub Events

Total
  • Push event: 5
  • Pull request event: 6
  • Create event: 3
Last Year
  • Push event: 5
  • Pull request event: 6
  • Create event: 3

Committers

Last synced: 12 months ago

All Time
  • Total Commits: 4
  • Total Committers: 1
  • Avg Commits per committer: 4.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 4
  • Committers: 1
  • Avg Commits per committer: 4.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
axect a****g@p****e 4
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 0
  • Total pull requests: 6
  • Average time to close issues: N/A
  • Average time to close pull requests: less than a minute
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 6
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 6
  • Average time to close issues: N/A
  • Average time to close pull requests: less than a minute
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 6
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • Axect (6)
Top Labels
Issue Labels
Pull Request Labels