pdf_articleanlyzer

Articles Anlyzer using Grobid

https://github.com/jorgeming/pdf_articleanlyzer

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.3%) to scientific vocabulary
Last synced: 6 months ago · JSON representation

Repository

Articles Anlyzer using Grobid

Basic Info
  • Host: GitHub
  • Owner: JorgeMIng
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Size: 34.6 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 2
Created about 2 years ago · Last pushed almost 2 years ago
Metadata Files
Readme License Codemeta

README.md

PDF_ArticleAnlyzer

Articles Anlyzer using Grobid

WorkFlow Status DOI License Documentation Status

Instalation

Follow these steps to get started with the project.

Create a enviroment

Create a enviroment to run the repository

conda create -n python=3.11.5

In case you are in python 3.11.5 you could also use python enviroments

python -m venv

Clone the Repository

bash git clone --recursive <REPOSITORY_URL> cd <DIRECTORY_NAME>

Install Dependencies

bash pip install -r requirements.txt

SetUp Packages

This repository has a setup.py using setuptools

setup install for getting the packages at site-packages bash python setup.py install

Install the required dependencies for the project.

Configuration

This project uses config files as the form of yaml files.

The main script uses the folders {baseconfig} and {datamain} for a correct working in case you move the main script bring those folder with it.

If you use this project as a library as the examples look at the examples folders for custom config folders with the config files.

Configure Grobid Client

At the folder config/api, check the grovid-server-config.yaml for modiying the protocol (http,https), domain(example.com), and port (8070) before starting

Configure Grobid Funcionalites

At the folder config/api, check the api-base-config.yaml any funcionality has those configs values in common, if you want to create a new config file dont remove those config values, any changes to the base values could end up to unexpected results.

Configure Grobid Server

This project uses a Grobid Server for working make sure there is a online grobid server, you could use docker to run a local grobid server Link on hot to setup a grobid server

Tutorials

At folder examples there are some notebooks with a brief demostration of the funcionalities and how to use the class

You can run the funcionalites as a library as shown at the exmples or using the main script

Features

0. Main Executable

While as shown at the examples this proyect can be use a a library also it can be execute as a script with the main.py its paramerters are {service} for selecting the funcionality, {--protocol} {http} {--domain} {example.com} {--port} {8070}

bash python main.py {service} --protocol http --domain example.com --port 8070

1. Generate Wordclouds from Abstracts

Using the class WordCloud we extract the abstracts of the articles and create a WordCloud png of the text

bash python main.py visualize.word_cloud

2. Bar Chart with the Number of Figures per Article

Using the class CountAtritubte we can count specific elments of the articles and create bar chars comparing them, at the config folder config/api there is count-config.yaml where we can set what atributes to find now it is set to finde elements form the xml

bash python main.py visualize.stadistic

3. List Article Links

The class SearchLink will find elements and https links at the articles and list them displaying a table using the rich library

bash python main.py visualize.links_search

Docker

In case you need to run this project as a container you will need to use the Dockerfile at the folder docker

Docker Server

First you need to have a running server with grobid docker pull grobid/grobid:0.8.0 docker run --rm --gpus all --init --ulimit core=0 -p 8070:8070 grobid/grobid:0.8.0

Grobid Client

docker build --no-cache -t pdf-analyzer docker

docker run -it --network host pdf-analyzer /bin/bash Now you can use python main.py {service} and run the services

Documentation

The code is documenteted at ReadTheDocs

License

This project is under the Apache 2.0 License. Refer to the LICENSE file for more details.

Contact

  • Name: Jorge Martin Izquierdo
  • Email: jorge.martin.izquierdo@alumnos.upm.es

Bibliography

As this project uses the grobid cliente api and server as base for working checkout the original author of this two programs GROBID (2008-2022) https://github.com/kermitt2/grobid GROBID (2008-2022) https://github.com/kermitt2/grobidclientpython

Owner

  • Login: JorgeMIng
  • Kind: user

CodeMeta (codemeta.json)

{
  "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
  "@type": "SoftwareSourceCode",
  "license": "https://spdx.org/licenses/Apache-2.0",
  "codeRepository": "https://github.com/JorgeMIng/PDF_ArticleAnlyzer",
  "dateCreated": "2024-02-04",
  "datePublished": "2024-02-04",
  "name": "PDF_ArticleAnlyzer",
  "description": "This pdf analyzer uses a Grobid sever for analyzing PDF about articles and extracting different information like, wordclouds, stadistical data about the elements of the PDF and search of links on the pdf",
  "developmentStatus": "active",
  "programmingLanguage": [
    "Python"
  ],
  "softwareRequirements": [
    "Python 3.11.5"
  ],
  "author": [
    {
      "@type": "Person",
      "givenName": "Jorge",
      "familyName": "Martin",
      "email": "jorge.martin.izquierdo@alumnos.upm.es"
    }
  ]
}

GitHub Events

Total
Last Year

Dependencies

.github/workflows/python-app.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v3 composite