extractor
Extractor extracts and analyzed information from scientific articles in PDF format. It runs an script that performs various tasks to facilitate the analysis of multiple articles located in an specific directory.
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 3 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.7%) to scientific vocabulary
Repository
Extractor extracts and analyzed information from scientific articles in PDF format. It runs an script that performs various tasks to facilitate the analysis of multiple articles located in an specific directory.
Basic Info
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 8
Metadata Files
README.md
Repository Overview
https://extractor.readthedocs.io/en/latest/
This repository contains a Python script for extracting and analyzing information from scientific articles in PDF format. The script performs various tasks to facilitate the analysis of multiple articles located in the directory /papers. To extract all information the script use the service GROBID (2008-2022) https://github.com/kermitt2/grobid.
Features
- #### Extraction of PDF Text: Utilizes Grobid to extract text from PDF documents, enabling further analysis of the content.
- #### Generation of Keyword Cloud: Creates a keyword cloud based on the abstracts of the articles, providing a visual representation of the most common words.
- #### Counting Figures per Article: Counts the number of figures in each article, aiding in understanding the visual content of the research presented.
- #### Extraction of Article Links: Attempts to extract links within the PDF documents, particularly references cited in the articles, providing additional resources for research.
Install
First of all, clone the repository
bash
git clone https://github.com/adrijmz/extractor.git
Using Docker
To install the GROBID image, execute the following command
bash
docker pull lfoppiano/grobid:0.7.2
To build the extractor image, execute the followint command from the root directory of the repository
bash
cd /path/to/root/directory/of/extractor
docker build -t extractor .
From Source
To install the GROBID image, execute the following command
bash
docker pull lfoppiano/grobid:0.7.2
Install Python Environment
This project requires Python >= 3.11
Step 1
Create a virtual environment to isolate the project dependencies
bash
conda create -n myenv python=3.11
Init the environment created if it is necessary
bash
conda init myenv
Activate the new environment
bash
conda activate myenv
Step 2
Install dependencies
bash
cd /path/to/root/directory/of/extractor
pip install -r requirements.txt
Usage
Using Docker
Create a Docker network to communicate both containers
bash
docker network create extractor_network
To run the GROBID container, execute the following command
bash
docker run --name server --network extractor_network -p 8070:8070 lfoppiano/grobid:0.7.2
To run extractor container, open a new terminal window and execute the following command
bash
docker run --name extractor --network extractor_network extractor
If you want to see the files generated and you have used Docker to run extractor, execute the following command
To check container ID
bash
docker ps -a
To copy all files to a desire directory
bash
docker cp container_id:/app /path/to/your/directory
From Source
To run the GROBID container, execute the following command
bash
docker run --name server -p 8070:8070 lfoppiano/grobid:0.7.2
Change in src/script.py this url value
bash
url = 'http://server:8070/api/processFulltextDocument'
to this value
bash
url = 'http://localhost:8070/api/processFulltextDocument'
To run python script from the root directory, execute the following command
bash
python3 src/script.py
To access the GROBID service, go to the following URL - http://localhost:8070/
Owner
- Name: Adrián Jiménez
- Login: adrijmz
- Kind: user
- Location: Madrid, Spain
- Company: Stratebi Business Solutions
- Repositories: 2
- Profile: https://github.com/adrijmz
Computer Engineering
Citation (CITATION.cff)
title: "Extractor: Extract data from a PDF file"
license: "MIT"
authors:
- family-names: "Jimenez Cano"
given-names: "Adrian"
cff-version: "1.3.2"
message: "If you use this software, please cite it as below."
preferred-citation:
authors:
- family-names: "Jimenez Cano"
given-names: "Adrian"
title: "Extractor: Extract data from a PDF file"
type: "software"
year: 2024
doi: "10.5281/zenodo.10651048"
CodeMeta (codemeta.json)
{
"@context": "https://doi.org/10.5063/schema/codemeta-2.0",
"@type": "SoftwareSourceCode",
"license": "https://spdx.org/licenses/MIT",
"codeRepository": "https://github.com/adrijmz/extractor",
"dateCreated": "2024-02-03",
"datePublished": "2024-02-12",
"dateModified": "2024-03-06",
"downloadUrl": "https://github.com/adrijmz/extractor/archive/refs/tags/v1.3.2.tar.gz",
"name": "Extractor",
"version": "1.3.2",
"identifier": "10.5281/zenodo.10651048",
"description": "Extractor extracts and analyzed information from scientific articles in PDF format. It runs an script that performs various tasks to facilitate the analysis of multiple articles located in an specific directory.",
"applicationCategory": "Software",
"releaseNotes": "Final release with documentation updated and Dockerfile fixed.",
"developmentStatus": "active",
"referencePublication": "https://zenodo.org/records/10651048",
"keywords": [
"extract",
"analyze",
"links",
"word cloud",
"diagram"
],
"programmingLanguage": [
"Python 3"
],
"contributor": [
{
"@type": "Person",
"givenName": "Adrian",
"familyName": "Jimenez Cano",
"email": "adrian.jimenez.cano@alumnos.upm.es"
}
]
}
GitHub Events
Total
Last Year
Dependencies
- PyPDF2 ==3.0.1
- Requests ==2.31.0
- matplotlib ==3.8.3
- wordcloud ==1.9.3
- actions/checkout v2 composite
- actions/setup-python v2 composite
- python 3.11 build