Science Score: 49.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.3%) to scientific vocabulary
Repository
Articles Anlyzer using Grobid
Basic Info
- Host: GitHub
- Owner: JorgeMIng
- License: apache-2.0
- Language: Python
- Default Branch: main
- Size: 34.6 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 2
Metadata Files
README.md
PDF_ArticleAnlyzer
Articles Anlyzer using Grobid
Instalation
Follow these steps to get started with the project.
Create a enviroment
Create a enviroment to run the repository
conda create -n
In case you are in python 3.11.5 you could also use python enviroments
python -m venv
Clone the Repository
bash
git clone --recursive <REPOSITORY_URL>
cd <DIRECTORY_NAME>
Install Dependencies
bash
pip install -r requirements.txt
SetUp Packages
This repository has a setup.py using setuptools
setup install for getting the packages at site-packages
bash
python setup.py install
Install the required dependencies for the project.
Configuration
This project uses config files as the form of yaml files.
The main script uses the folders {baseconfig} and {datamain} for a correct working in case you move the main script bring those folder with it.
If you use this project as a library as the examples look at the examples folders for custom config folders with the config files.
Configure Grobid Client
At the folder config/api, check the grovid-server-config.yaml for modiying the protocol (http,https), domain(example.com), and port (8070) before starting
Configure Grobid Funcionalites
At the folder config/api, check the api-base-config.yaml any funcionality has those configs values in common, if you want to create a new config file dont remove those config values, any changes to the base values could end up to unexpected results.
Configure Grobid Server
This project uses a Grobid Server for working make sure there is a online grobid server, you could use docker to run a local grobid server Link on hot to setup a grobid server
Tutorials
At folder examples there are some notebooks with a brief demostration of the funcionalities and how to use the class
You can run the funcionalites as a library as shown at the exmples or using the main script
Features
0. Main Executable
While as shown at the examples this proyect can be use a a library also it can be execute as a script with the main.py its paramerters are {service} for selecting the funcionality, {--protocol} {http} {--domain} {example.com} {--port} {8070}
bash
python main.py {service} --protocol http --domain example.com --port 8070
1. Generate Wordclouds from Abstracts
Using the class WordCloud we extract the abstracts of the articles and create a WordCloud png of the text
bash
python main.py visualize.word_cloud
2. Bar Chart with the Number of Figures per Article
Using the class CountAtritubte we can count specific elments of the articles and create bar chars comparing them, at the config folder config/api there is count-config.yaml
where we can set what atributes to find now it is set to finde
bash
python main.py visualize.stadistic
3. List Article Links
The class SearchLink will find elements and https links at the articles and list them displaying a table using the rich library
bash
python main.py visualize.links_search
Docker
In case you need to run this project as a container you will need to use the Dockerfile at the folder docker
Docker Server
First you need to have a running server with grobid
docker pull grobid/grobid:0.8.0
docker run --rm --gpus all --init --ulimit core=0 -p 8070:8070 grobid/grobid:0.8.0
Grobid Client
docker build --no-cache -t pdf-analyzer docker
docker run -it --network host pdf-analyzer /bin/bash
Now you can use
python main.py {service} and run the services
Documentation
The code is documenteted at ReadTheDocs
License
This project is under the Apache 2.0 License. Refer to the LICENSE file for more details.
Contact
- Name: Jorge Martin Izquierdo
- Email: jorge.martin.izquierdo@alumnos.upm.es
Bibliography
As this project uses the grobid cliente api and server as base for working checkout the original author of this two programs GROBID (2008-2022) https://github.com/kermitt2/grobid GROBID (2008-2022) https://github.com/kermitt2/grobidclientpython
Owner
- Login: JorgeMIng
- Kind: user
- Repositories: 1
- Profile: https://github.com/JorgeMIng
CodeMeta (codemeta.json)
{
"@context": "https://doi.org/10.5063/schema/codemeta-2.0",
"@type": "SoftwareSourceCode",
"license": "https://spdx.org/licenses/Apache-2.0",
"codeRepository": "https://github.com/JorgeMIng/PDF_ArticleAnlyzer",
"dateCreated": "2024-02-04",
"datePublished": "2024-02-04",
"name": "PDF_ArticleAnlyzer",
"description": "This pdf analyzer uses a Grobid sever for analyzing PDF about articles and extracting different information like, wordclouds, stadistical data about the elements of the PDF and search of links on the pdf",
"developmentStatus": "active",
"programmingLanguage": [
"Python"
],
"softwareRequirements": [
"Python 3.11.5"
],
"author": [
{
"@type": "Person",
"givenName": "Jorge",
"familyName": "Martin",
"email": "jorge.martin.izquierdo@alumnos.upm.es"
}
]
}
GitHub Events
Total
Last Year
Dependencies
- actions/checkout v3 composite
- actions/setup-python v3 composite