scholarvista
ScholarVista analyses research papers and extracts/plots information about them. It uses Grobid to extract all the content of the research papers. Then all this data is plotted and displayed using Python.
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (18.2%) to scientific vocabulary
Keywords
Repository
ScholarVista analyses research papers and extracts/plots information about them. It uses Grobid to extract all the content of the research papers. Then all this data is plotted and displayed using Python.
Basic Info
- Host: GitHub
- Owner: mciccale
- License: apache-2.0
- Language: Python
- Default Branch: main
- Homepage: https://scholarvista.readthedocs.io
- Size: 3.24 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 3
Topics
Metadata Files
README.md
ScholarVista
ScholarVista is a tool that extracts and plots information from a set of Academic Research Papers in PDF / TEI XML format. To process PDFs, it utilizes Grobid to generate the TEI XML files, then ScholarVista extracts the relevant information from the TEI XML files and generates the following data:
- Keyword Cloud for each of the paper's abstract and for the total of all abstracts.
- Links List for each one of the links found in the paper.
- Figures Histogram comparing the number of figures per paper.
Table of Contents:
Requirements
Python >=3.12 is required for installing the ScholarVista package, not for the Docker Image.
If you want to generate the results from a set of PDF academic papers, you must ensure that the Grobid Service is installed and running in your machine. See Grobid installation instrucions here.
The most straight-forward way of starting and running Grobid Service is by running a Docker image. Make sure you have Docker installed in your system.
bash
docker run --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:0.8.0
This command will run Grobid and expose a web client in port 8070.
If you already have the TEI XML files generated from Grobid saved in a folder, you can directly generate the information from them.
Note: The TEI XML files MUST be obtained using Grobid, as this tool is intended to work only with Grobid generated TEI XML files.
Install ScholarVista
From Source
To install ScholarVista from source, you can clone the repository and install the package using pip. When using pip it is a good practice to use virtual environments. Check out the official documentation on virtual envornments here.
Conda
bash
git clone https://github.com/mciccale/ScholarVista
cd ScholarVista
conda create -n scholarvista-env-3.12 python=3.12
conda activate scholarvista-env-3.12
pip install .
Note: You can use PyEnv to create a virtual environment. But since ScholarVista needs Python >=3.12, it is more suitable to use Conda, where you can select the Python version to use.
Docker Container
If you prefer running ScholarVista from a Docker Container, you can build the Docker Image with the following commands.
bash
git clone https://github.com/mciccale/ScholarVista
cd ScholarVista
docker build -t scholarvista-app .
This will create an image called scholarvista-app.
Execution Instructions
From Source
CLI Tool
The most convenient way of using ScholarVista is by using its CLI.
The CLI Tool will generate and save to a directory a keyword cloud of the abstract of each paper and a list of URLs for each PDF analyzed; together with a histogram comparing the numer of figures of each PDF and a general keyword cloud of all abstracts.
``` Usage: scholarvista [OPTIONS] COMMAND [ARGS]...
ScholarVista's CLI main entry point.
Options: --input-dir PATH Directory containing PDF files. [required] --output-dir PATH Directory to save results. Defaults to current directory. --help Show this message and exit.
Commands: process-pdfs Process all PDFs in the given directory. process-xmls Process all TEI XMLs in the given directory. ```
Example
- Start Grobid service using the container.
bash
docker run --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:0.8.0
- Run ScholarVista's CLI to process all the PDFs in a given directory and leave the results in another directory.
```bash
Process PDF files and save the results to a specified directory
scholarvista --input-dir ./pdfs --output-dir ./output process-pdfs ```
Python Modules
ScholarVista provides a set of classes and modules to take leverage of all its functionality from your Python code. To see an example, see example.py
Docker Container
If you prefer running ScholarVista with Docker, you can make use of ScholarVista CLI directly from the Docker Image you created following these instructions.
- Start Grobid service using the container.
bash
docker run --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:0.8.0
- Run ScholarVista's container with 2 mounted volumes for input and output directories and connected to the host network.
bash
docker run -it --rm --network=host -v /path/to/input/dir:/input -v /path/to/output/dir:/output scholarvista-app
Note: The default behaviour of ScholarVista's Docker Image is processing pdf files, you can override this by providing the process-xmls argument after the image name.
Example
Here's an example where we process a set of PDFs contained in the foo directory and we leave the results at bar using the Docker Image. Assuming the Grobid service is running at localhost:8070.
bash
docker run -it --rm --network=host -v foo:/input -v bar:/output scholarvista-app process-pdfs
Docker Compose (Experimental)
You can try to run ScholarVista through Docker Compose. However, this feature is still in development and may not work as expected. ScholarVista will be trying to connect to Grobid before it has started, and it will be restarted until the Grobid service is up and running. You can try it by:
SH-Shell like
bash
INPUT_DIR=/path/to/input/dir OUTPUT_DIR=/path/to/output/dir COMMAND='process-pdfs' docker-compose up
PowerShell
powershell
$env:INPUT_DIR="/path/to/input/dir"; $env:OUTPUT_DIR="/path/to/output/dir"; $env:COMMAND="process-pdfs" docker-compose up
Note: The COMMAND variable can be either process-pdfs or process-xmls. And the directories are the host machine directories where the files are extracted and left, respectively.
License
Please refer to the LICENSE file.
Where to Get Help
For further assistance or to contribute to the project, please refer to the CONTRIBUTING.md file.
Owner
- Name: Marco Ciccale Baztán
- Login: mciccale
- Kind: user
- Location: Madrid
- Repositories: 1
- Profile: https://github.com/mciccale
UPM student, Computer Science and Engineering.
Citation (CITATION.cff)
cff-version: 1.2.0
message: 'If you use this software, please cite it as below.'
authors:
- family-names: Ciccalè
given-names: Marco
orcid: https://orcid.org/0009-0000-8821-0587
title: 'ScholarVista'
license: Apache-2.0
version: '0.2.0'
doi: '10.5281/zenodo.10654760'
date-released: 2024-03-02
url: 'https://github.com/mciccale/ScholarVista'
CodeMeta (codemeta.json)
{
"@context": "https://doi.org/10.5063/schema/codemeta-2.0",
"@type": "SoftwareSourceCode",
"license": "https://spdx.org/licenses/Apache-2.0",
"codeRepository": "git+https://github.com/mciccale/ScholarVista",
"dateCreated": "2024-02-10",
"datePublished": "2024-02-14",
"dateModified": "2024-03-02",
"name": "ScholarVista",
"version": "0.2.0",
"identifier": "10.5281/zenodo.10654760",
"description": "ScholarVista is a tool that analyzes research papers and extracts and plots information from them. It utilizes Grobid, a library for extracting content from research papers, to extract all the relevant data. The extracted data is then plotted and displayed using Python.",
"applicationCategory": "Research",
"developmentStatus": "active",
"keywords": [
"text",
"research",
"analysis",
"python",
"grobid"
],
"programmingLanguage": [
"Python 3"
],
"operatingSystem": [
"Linux"
],
"softwareRequirements": [
"Python 3",
"Grobid",
"Matplotlib",
"WordCloud"
],
"author": [
{
"@type": "Person",
"@id": "https://orcid.org/0009-0000-8821-0587",
"givenName": "Marco",
"familyName": "Ciccal",
"email": "marcociccalebaztan@gmail.com"
}
]
}
GitHub Events
Total
Last Year
Issues and Pull Requests
Last synced: 11 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- actions/checkout v4 composite
- actions/setup-python v5 composite
- actions/checkout v4 composite
- actions/setup-python v5 composite
- mkdocs-material *
- astroid 3.0.3
- certifi 2024.2.2
- charset-normalizer 3.3.2
- click 8.1.7
- colorama 0.4.6
- contourpy 1.2.0
- cycler 0.12.1
- dill 0.3.8
- fonttools 4.48.1
- grobid-client-python 0.0.8
- idna 3.6
- isort 5.13.2
- kiwisolver 1.4.5
- matplotlib 3.8.2
- mccabe 0.7.0
- numpy 1.26.4
- packaging 23.2
- pillow 10.2.0
- platformdirs 4.2.0
- pylint 3.0.3
- pyparsing 3.1.1
- python-dateutil 2.8.2
- requests 2.31.0
- six 1.16.0
- tomlkit 0.12.3
- urllib3 2.2.0
- wordcloud 1.9.3
- pylint ^3.0.3 develop
- click ^8.1.7
- grobid-client-python ^0.0.8
- matplotlib ^3.8.2
- python ^3.12
- wordcloud ^1.9.3
- python 3.12-slim build