redeye_pipeline

Pipeline for scraping metadata from PubMed IDs

https://github.com/inebriateduck/redeye_pipeline

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.1%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Pipeline for scraping metadata from PubMed IDs

Basic Info

Host: GitHub
Owner: Inebriateduck
Language: Python
Default Branch: main
Homepage:
Size: 3.03 MB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 1

Created over 1 year ago · Last pushed 10 months ago

Metadata Files

Readme Citation

RedEye data scraping pipeline

Please cite this repository if you use the software within

This is still actively under development - It is developed as a volunteer project when I have time (updates may be sporadic, but it is still under active development)

RedEye

RedEye is an R package based on easyPubMed by Damiano Fantini. It is optimized towards extraction of email addresses for the purpose of cross sectional surveys.

installation

RedEye is not available through CRAN - installation must be performed manually.

After downloading RedEye.tar.gz, open R and do the following Tools > Install Packages > Package Archive File > RedEye.tar.gz.

RedEye Extractor & Hex Breaker

RedEye Extractor is a specialized script for extraction of information for cross sectional surveys from the PubMed database. This script is designed to be easily scalable with the capabilities of the users hardware - the more CPU cores you have, the faster you'll be able to mine your target information from a list of PMIDs. Note that this script only reads CSV files, it does not read XLSX format.

Hex Breaker is the second step in the pipeline, and is automatically called by the R script using the reticulate package once it has completed it's portion of the job. It is a python script that removes duplicate values (ie: email addresses) and cleans up scrambled outputs that are known to relpace special characters. When removing duplicate emails, Hex Breaker will maintain the most recent instance of the address (for example, one found in 2024 will be removed in favour of one from 2025).

Using the pipeline

Ensure that you have R, R studio and Python installed on your machine
Download RedEye Extractor
Open R studio and install the RedEye package by doing the following: Tools > Install Packages > Package Archive File > RedEye.tar.gz.
Once installed, RedEye can be loaded in R and uses identical functions as easyPubMed.
Download Hex Breaker. Take note of the download path, it will be required later.
Create a new R script
Copy the extraction script into your new script section
Replace 'Input pathway' with the pathway to the folder containing your PMID bearing CSV files
Replace 'Output pathway' with your desired output directory. If the specified file does not exist, RedEye will make a new file with that name at the target location
Replace 'Python input' with the path to the Hex Breaker file (including the file itself, it should have .py at the end)
Run the script (Ctrl + Shift + Enter is a useful shortcut)

All code Licensed under GPL-2

C. Daniel Fry, 2025

Owner

Name: Daniel Fry
Login: Inebriateduck
Kind: user

Repositories: 1
Profile: https://github.com/Inebriateduck

GitHub Events

Total

Release event: 1
Watch event: 1
Push event: 54
Create event: 1

Last Year

Release event: 1
Watch event: 1
Push event: 54
Create event: 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science