Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 3 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (17.6%) to scientific vocabulary
Repository
Text Analysis using Grobid
Basic Info
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 4
Metadata Files
README.md
Text Analysis with Grobid
Description
This project uses Grobid, a machine learning software, to extract structured metadata and content from documents in PDF format. The extracted information includes a word cloud with the keywords of the abstracts of the all documents, a bar chart showing the number of figures and the URLs that appear in each PDF document.
Note: There are already 10 PDFs in the repository. If you want to process your own PDFs, please save them in the "papers" folder.
Documentation
You can find the documentation here
Requirements
To run this program you will need: * Docker which is a software that provides a convenient way to package, distribute and run applications as containers, ensuring consistency across different environments. * Grobid which is a machine learning-based toolkit for extracting information from documents in PDF format.
Installation instructions
Step 1: Clone the repository from GitHub to your local machine:
git clone https://github.com/yiminzhou7/Text-Analysis.git
Step 2: Start the docker server.
Step 3: Install Grobid
docker pull lfoppiano/grobid:0.7.2
Execution instructions
Conda
Step 1: Start the Docker server.
Step 2: Run Grobid on localhost:8070:
docker run -t --rm -p 8070:8070 lfoppiano/grobid:0.7.2
You can check if Grobid is running properly by openning a web browser and visit the following URL: http://localhost:8070.
Step 3: Open a new command line and create a blank virtual environment with conda with a name in python 3.10
conda create -n text_analysis python=3.10
Step 4: Activate the environment.
conda activate text_analysis
Step 5: Go to the main directory ("Text-Analysis") and install the dependencies from requirements.txt file:
pip install -r requirements.txt
Step 6: Before running the main program, it is recommended to run the testing.py file found in the "tests" folder. For this purpose, stay in the main directory ("Text-Analysis") and execute
python tests/testing.py
Step 7: After passing the tests, stay in the main directory and execute the main program
python main.py
Once executed, the program will save the results in the "results" folder: - A figure of wordcloud saved as "wordcloud.png". - A histogram saved as "figures.png". - URLs of each paper saved as "links.txt".
Step 8: Once the results have been obtained, stop the container where it is running Grobid.
To find out the CONTAINER_ID, execute:
docker container ps
Then, stop the container
docker container stop CONTAINER_ID
Docker compose
Step 1: Start the Docker server.
Step 2: Stay in the main directory ("Text-Analysis") and execute (Note: make sure there are no programs using port 8070 because that's where Grobid will run):
docker-compose up
In this case, docker-compose will run the tests ("tests/testing.py") before running the main program ("main.py").
If all tests are passed, then the main program will be executed, otherwise it stops.
Once executed, the program will save the results in the "results" folder: - A figure of wordcloud saved as "wordcloud.png". - A histogram saved as "figures.png". - URLs of each paper saved as "links.txt".
Step 4: Once the results have been obtained, execute
docker-compose down
Running examples
The main program has been run with 10 PDFs (stored in the papers folder).
The wordcloud results of the abstracts, histogram of number of figures and the URLs found in each paper are shown below.
Preferred citation
Yimin Zhou.
Where to get help
You can write to yimin.zhou@alumnos.upm.es about any help you may need.
Owner
- Login: yiminzhou7
- Kind: user
- Repositories: 1
- Profile: https://github.com/yiminzhou7
Citation (CITATION.cff)
license: Apache-2.0 cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Zhou" given-names: "Yimin" title: "Text Analysis with Grobid" version: 1.0.0 date-released: 2024-02-14 url: "https://github.com/yiminzhou7/Text-Analysis.git"
GitHub Events
Total
Last Year
Dependencies
- beautifulsoup4 ==4.12.2
- matplotlib ==3.8.1
- nltk ==3.8.1
- requests ==2.31.0
- spacy ==3.7.2
- wordcloud ==1.9.3
- actions/checkout v3 composite
- actions/setup-python v4 composite
- mkdocs-material *
- python 3.10 build
- lfoppiano/grobid 0.7.2