redeye_pipeline

Pipeline for scraping metadata from PubMed IDs

https://github.com/inebriateduck/redeye_pipeline

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.1%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Pipeline for scraping metadata from PubMed IDs

Basic Info
  • Host: GitHub
  • Owner: Inebriateduck
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 3.03 MB
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Created over 1 year ago · Last pushed 10 months ago
Metadata Files
Readme Citation

README.md

RedEye data scraping pipeline

Please cite this repository if you use the software within

This is still actively under development - It is developed as a volunteer project when I have time (updates may be sporadic, but it is still under active development)

RedEye

RedEye is an R package based on easyPubMed by Damiano Fantini. It is optimized towards extraction of email addresses for the purpose of cross sectional surveys.

installation

RedEye is not available through CRAN - installation must be performed manually.

After downloading RedEye.tar.gz, open R and do the following Tools > Install Packages > Package Archive File > RedEye.tar.gz.

RedEye Extractor & Hex Breaker

RedEye Extractor is a specialized script for extraction of information for cross sectional surveys from the PubMed database. This script is designed to be easily scalable with the capabilities of the users hardware - the more CPU cores you have, the faster you'll be able to mine your target information from a list of PMIDs. Note that this script only reads CSV files, it does not read XLSX format.

Hex Breaker is the second step in the pipeline, and is automatically called by the R script using the reticulate package once it has completed it's portion of the job. It is a python script that removes duplicate values (ie: email addresses) and cleans up scrambled outputs that are known to relpace special characters. When removing duplicate emails, Hex Breaker will maintain the most recent instance of the address (for example, one found in 2024 will be removed in favour of one from 2025).

Using the pipeline

  1. Ensure that you have R, R studio and Python installed on your machine
  2. Download RedEye Extractor
  3. Open R studio and install the RedEye package by doing the following: Tools > Install Packages > Package Archive File > RedEye.tar.gz.
  4. Once installed, RedEye can be loaded in R and uses identical functions as easyPubMed.
  5. Download Hex Breaker. Take note of the download path, it will be required later.
  6. Create a new R script
  7. Copy the extraction script into your new script section
  8. Replace 'Input pathway' with the pathway to the folder containing your PMID bearing CSV files
  9. Replace 'Output pathway' with your desired output directory. If the specified file does not exist, RedEye will make a new file with that name at the target location
  10. Replace 'Python input' with the path to the Hex Breaker file (including the file itself, it should have .py at the end)
  11. Run the script (Ctrl + Shift + Enter is a useful shortcut)

All code Licensed under GPL-2

C. Daniel Fry, 2025

Owner

  • Name: Daniel Fry
  • Login: Inebriateduck
  • Kind: user

GitHub Events

Total
  • Release event: 1
  • Watch event: 1
  • Push event: 54
  • Create event: 1
Last Year
  • Release event: 1
  • Watch event: 1
  • Push event: 54
  • Create event: 1