text-analysis

Text Analysis using Grobid

https://github.com/yiminzhou7/text-analysis

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 3 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (17.6%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Text Analysis using Grobid

Basic Info

Host: GitHub
Owner: yiminzhou7
License: apache-2.0
Language: Python
Default Branch: main
Homepage:
Size: 15.1 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 4

Created over 2 years ago · Last pushed over 2 years ago

Metadata Files

Readme License Citation

Text Analysis with Grobid

Description

This project uses Grobid, a machine learning software, to extract structured metadata and content from documents in PDF format. The extracted information includes a word cloud with the keywords of the abstracts of the all documents, a bar chart showing the number of figures and the URLs that appear in each PDF document.

Note: There are already 10 PDFs in the repository. If you want to process your own PDFs, please save them in the "papers" folder.

Documentation

You can find the documentation here

Requirements

To run this program you will need: * Docker which is a software that provides a convenient way to package, distribute and run applications as containers, ensuring consistency across different environments. * Grobid which is a machine learning-based toolkit for extracting information from documents in PDF format.

Installation instructions

Step 1: Clone the repository from GitHub to your local machine:

git clone https://github.com/yiminzhou7/Text-Analysis.git

Step 2: Start the docker server.

Step 3: Install Grobid

docker pull lfoppiano/grobid:0.7.2

Execution instructions

Conda

Step 1: Start the Docker server.

Step 2: Run Grobid on localhost:8070:

docker run -t --rm -p 8070:8070 lfoppiano/grobid:0.7.2

You can check if Grobid is running properly by openning a web browser and visit the following URL: http://localhost:8070.

Step 3: Open a new command line and create a blank virtual environment with conda with a name in python 3.10

conda create -n text_analysis python=3.10

Step 4: Activate the environment.

conda activate text_analysis

Step 5: Go to the main directory ("Text-Analysis") and install the dependencies from requirements.txt file:

pip install -r requirements.txt

Step 6: Before running the main program, it is recommended to run the testing.py file found in the "tests" folder. For this purpose, stay in the main directory ("Text-Analysis") and execute

python tests/testing.py

Step 7: After passing the tests, stay in the main directory and execute the main program

python main.py

Once executed, the program will save the results in the "results" folder: - A figure of wordcloud saved as "wordcloud.png". - A histogram saved as "figures.png". - URLs of each paper saved as "links.txt".

Step 8: Once the results have been obtained, stop the container where it is running Grobid.

To find out the CONTAINER_ID, execute:

docker container ps

Then, stop the container

docker container stop CONTAINER_ID

Docker compose

Step 1: Start the Docker server.

Step 2: Stay in the main directory ("Text-Analysis") and execute (Note: make sure there are no programs using port 8070 because that's where Grobid will run):

docker-compose up

In this case, docker-compose will run the tests ("tests/testing.py") before running the main program ("main.py").

If all tests are passed, then the main program will be executed, otherwise it stops.

Step 4: Once the results have been obtained, execute

docker-compose down

Running examples

The main program has been run with 10 PDFs (stored in the papers folder).

The wordcloud results of the abstracts, histogram of number of figures and the URLs found in each paper are shown below.

Preferred citation

Yimin Zhou.

Where to get help

You can write to yimin.zhou@alumnos.upm.es about any help you may need.

Owner

Login: yiminzhou7
Kind: user

Repositories: 1
Profile: https://github.com/yiminzhou7

Citation (CITATION.cff)

license: Apache-2.0
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Zhou"
  given-names: "Yimin"
title: "Text Analysis with Grobid"
version: 1.0.0
date-released: 2024-02-14
url: "https://github.com/yiminzhou7/Text-Analysis.git"

GitHub Events

Total

Last Year

Dependencies

requirements.txt pypi

beautifulsoup4 ==4.12.2
matplotlib ==3.8.1
nltk ==3.8.1
requests ==2.31.0
spacy ==3.7.2
wordcloud ==1.9.3

.github/workflows/test.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite

docs/requirements.txt pypi

mkdocs-material *

Dockerfile docker

python 3.10 build

docker-compose.yml docker

lfoppiano/grobid 0.7.2

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science