extractor

Extractor extracts and analyzed information from scientific articles in PDF format. It runs an script that performs various tasks to facilitate the analysis of multiple articles located in an specific directory.

https://github.com/adrijmz/extractor

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 3 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.7%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: adrijmz
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 17.7 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 8

Created over 2 years ago · Last pushed over 2 years ago

Metadata Files

Readme License Citation Codemeta

README.md

Repository Overview

https://extractor.readthedocs.io/en/latest/

This repository contains a Python script for extracting and analyzing information from scientific articles in PDF format. The script performs various tasks to facilitate the analysis of multiple articles located in the directory /papers. To extract all information the script use the service GROBID (2008-2022) https://github.com/kermitt2/grobid.

Features

#### Extraction of PDF Text: Utilizes Grobid to extract text from PDF documents, enabling further analysis of the content.
#### Generation of Keyword Cloud: Creates a keyword cloud based on the abstracts of the articles, providing a visual representation of the most common words.
#### Counting Figures per Article: Counts the number of figures in each article, aiding in understanding the visual content of the research presented.
#### Extraction of Article Links: Attempts to extract links within the PDF documents, particularly references cited in the articles, providing additional resources for research.

Install

First of all, clone the repository bash git clone https://github.com/adrijmz/extractor.git

Using Docker

To install the GROBID image, execute the following command bash docker pull lfoppiano/grobid:0.7.2

To build the extractor image, execute the followint command from the root directory of the repository bash cd /path/to/root/directory/of/extractor docker build -t extractor .

From Source

To install the GROBID image, execute the following command bash docker pull lfoppiano/grobid:0.7.2

Install Python Environment

This project requires Python >= 3.11

Step 1

Create a virtual environment to isolate the project dependencies bash conda create -n myenv python=3.11 Init the environment created if it is necessary bash conda init myenv Activate the new environment bash conda activate myenv

Step 2

Install dependencies bash cd /path/to/root/directory/of/extractor pip install -r requirements.txt

Usage

Using Docker

Create a Docker network to communicate both containers bash docker network create extractor_network

To run the GROBID container, execute the following command bash docker run --name server --network extractor_network -p 8070:8070 lfoppiano/grobid:0.7.2

To run extractor container, open a new terminal window and execute the following command bash docker run --name extractor --network extractor_network extractor

If you want to see the files generated and you have used Docker to run extractor, execute the following command

To check container ID bash docker ps -a

To copy all files to a desire directory bash docker cp container_id:/app /path/to/your/directory

From Source

To run the GROBID container, execute the following command bash docker run --name server -p 8070:8070 lfoppiano/grobid:0.7.2 Change in src/script.py this url value bash url = 'http://server:8070/api/processFulltextDocument' to this value bash url = 'http://localhost:8070/api/processFulltextDocument' To run python script from the root directory, execute the following command bash python3 src/script.py

To access the GROBID service, go to the following URL - http://localhost:8070/

Owner

Name: Adrián Jiménez
Login: adrijmz
Kind: user
Location: Madrid, Spain
Company: Stratebi Business Solutions

Repositories: 2
Profile: https://github.com/adrijmz

Computer Engineering

Citation (CITATION.cff)

title: "Extractor: Extract data from a PDF file"
license: "MIT"
authors:
  - family-names: "Jimenez Cano"
    given-names: "Adrian"

cff-version: "1.3.2"
message: "If you use this software, please cite it as below."
preferred-citation:
  authors:
    - family-names: "Jimenez Cano"
      given-names: "Adrian"
  title: "Extractor: Extract data from a PDF file"
  type: "software"
  year: 2024
  doi: "10.5281/zenodo.10651048"

CodeMeta (codemeta.json)

{
  "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
  "@type": "SoftwareSourceCode",
  "license": "https://spdx.org/licenses/MIT",
  "codeRepository": "https://github.com/adrijmz/extractor",
  "dateCreated": "2024-02-03",
  "datePublished": "2024-02-12",
  "dateModified": "2024-03-06",
  "downloadUrl": "https://github.com/adrijmz/extractor/archive/refs/tags/v1.3.2.tar.gz",
  "name": "Extractor",
  "version": "1.3.2",
  "identifier": "10.5281/zenodo.10651048",
  "description": "Extractor extracts and analyzed information from scientific articles in PDF format. It runs an script that performs various tasks to facilitate the analysis of multiple articles located in an specific directory.",
  "applicationCategory": "Software",
  "releaseNotes": "Final release with documentation updated and Dockerfile fixed.",
  "developmentStatus": "active",
  "referencePublication": "https://zenodo.org/records/10651048",
  "keywords": [
    "extract",
    "analyze",
    "links",
    "word cloud",
    "diagram"
  ],
  "programmingLanguage": [
    "Python 3"
  ],
  "contributor": [
    {
      "@type": "Person",
      "givenName": "Adrian",
      "familyName": "Jimenez Cano",
      "email": "adrian.jimenez.cano@alumnos.upm.es"
    }
  ]
}

GitHub Events

Total

Last Year

Dependencies

requirements.txt pypi

PyPDF2 ==3.0.1
Requests ==2.31.0
matplotlib ==3.8.3
wordcloud ==1.9.3

.github/workflows/ci.yml actions

actions/checkout v2 composite
actions/setup-python v2 composite

Dockerfile docker

python 3.11 build

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

extractor

Science Score: 67.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Repository Overview

Features

Install

Using Docker

From Source

Install Python Environment

Step 1

Step 2

Usage

Using Docker

If you want to see the files generated and you have used Docker to run extractor, execute the following command

From Source

Owner

Citation (CITATION.cff)

CodeMeta (codemeta.json)

GitHub Events

Total

Last Year

Dependencies