scholarvista

ScholarVista analyses research papers and extracts/plots information about them. It uses Grobid to extract all the content of the research papers. Then all this data is plotted and displayed using Python.

https://github.com/mciccale/scholarvista

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (18.2%) to scientific vocabulary

Keywords

keyword-cloud keyword-extraction machine-learning python3 text-extraction

Last synced: 6 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: mciccale
License: apache-2.0
Language: Python
Default Branch: main
Homepage: https://scholarvista.readthedocs.io
Size: 3.24 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 3

Topics

keyword-cloud keyword-extraction machine-learning python3 text-extraction

Created about 2 years ago · Last pushed almost 2 years ago

Metadata Files

Readme Contributing License Citation Codemeta

README.md

ScholarVista

ScholarVista is a tool that extracts and plots information from a set of Academic Research Papers in PDF / TEI XML format. To process PDFs, it utilizes Grobid to generate the TEI XML files, then ScholarVista extracts the relevant information from the TEI XML files and generates the following data:

Keyword Cloud for each of the paper's abstract and for the total of all abstracts.
Links List for each one of the links found in the paper.
Figures Histogram comparing the number of figures per paper.

Requirements

Python >=3.12 is required for installing the ScholarVista package, not for the Docker Image.

If you want to generate the results from a set of PDF academic papers, you must ensure that the Grobid Service is installed and running in your machine. See Grobid installation instrucions here.

The most straight-forward way of starting and running Grobid Service is by running a Docker image. Make sure you have Docker installed in your system.

bash docker run --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:0.8.0

This command will run Grobid and expose a web client in port 8070.

If you already have the TEI XML files generated from Grobid saved in a folder, you can directly generate the information from them.

Note: The TEI XML files MUST be obtained using Grobid, as this tool is intended to work only with Grobid generated TEI XML files.

Install ScholarVista

From Source

To install ScholarVista from source, you can clone the repository and install the package using pip. When using pip it is a good practice to use virtual environments. Check out the official documentation on virtual envornments here.

Conda

bash git clone https://github.com/mciccale/ScholarVista cd ScholarVista conda create -n scholarvista-env-3.12 python=3.12 conda activate scholarvista-env-3.12 pip install .

Note: You can use PyEnv to create a virtual environment. But since ScholarVista needs Python >=3.12, it is more suitable to use Conda, where you can select the Python version to use.

Docker Container

If you prefer running ScholarVista from a Docker Container, you can build the Docker Image with the following commands.

bash git clone https://github.com/mciccale/ScholarVista cd ScholarVista docker build -t scholarvista-app .

This will create an image called scholarvista-app.

Execution Instructions

From Source

CLI Tool

The most convenient way of using ScholarVista is by using its CLI.

The CLI Tool will generate and save to a directory a keyword cloud of the abstract of each paper and a list of URLs for each PDF analyzed; together with a histogram comparing the numer of figures of each PDF and a general keyword cloud of all abstracts.

``` Usage: scholarvista [OPTIONS] COMMAND [ARGS]...

ScholarVista's CLI main entry point.

Options: --input-dir PATH Directory containing PDF files. [required] --output-dir PATH Directory to save results. Defaults to current directory. --help Show this message and exit.

Commands: process-pdfs Process all PDFs in the given directory. process-xmls Process all TEI XMLs in the given directory. ```

Example

Start Grobid service using the container.

bash docker run --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:0.8.0

Run ScholarVista's CLI to process all the PDFs in a given directory and leave the results in another directory.

```bash

Process PDF files and save the results to a specified directory

scholarvista --input-dir ./pdfs --output-dir ./output process-pdfs ```

Python Modules

ScholarVista provides a set of classes and modules to take leverage of all its functionality from your Python code. To see an example, see example.py

Docker Container

If you prefer running ScholarVista with Docker, you can make use of ScholarVista CLI directly from the Docker Image you created following these instructions.

Start Grobid service using the container.

bash docker run --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:0.8.0

Run ScholarVista's container with 2 mounted volumes for input and output directories and connected to the host network.

bash docker run -it --rm --network=host -v /path/to/input/dir:/input -v /path/to/output/dir:/output scholarvista-app

Note: The default behaviour of ScholarVista's Docker Image is processing pdf files, you can override this by providing the process-xmls argument after the image name.

Example

Here's an example where we process a set of PDFs contained in the foo directory and we leave the results at bar using the Docker Image. Assuming the Grobid service is running at localhost:8070.

bash docker run -it --rm --network=host -v foo:/input -v bar:/output scholarvista-app process-pdfs

Docker Compose (Experimental)

You can try to run ScholarVista through Docker Compose. However, this feature is still in development and may not work as expected. ScholarVista will be trying to connect to Grobid before it has started, and it will be restarted until the Grobid service is up and running. You can try it by:

SH-Shell like

bash INPUT_DIR=/path/to/input/dir OUTPUT_DIR=/path/to/output/dir COMMAND='process-pdfs' docker-compose up

PowerShell

powershell $env:INPUT_DIR="/path/to/input/dir"; $env:OUTPUT_DIR="/path/to/output/dir"; $env:COMMAND="process-pdfs" docker-compose up

Note: The COMMAND variable can be either process-pdfs or process-xmls. And the directories are the host machine directories where the files are extracted and left, respectively.

License

Please refer to the LICENSE file.

Where to Get Help

For further assistance or to contribute to the project, please refer to the CONTRIBUTING.md file.

Owner

Name: Marco Ciccale Baztán
Login: mciccale
Kind: user
Location: Madrid

Repositories: 1
Profile: https://github.com/mciccale

UPM student, Computer Science and Engineering.

Citation (CITATION.cff)

cff-version: 1.2.0
message: 'If you use this software, please cite it as below.'
authors:
  - family-names: Ciccalè
    given-names: Marco
    orcid: https://orcid.org/0009-0000-8821-0587
title: 'ScholarVista'
license: Apache-2.0
version: '0.2.0'
doi: '10.5281/zenodo.10654760'
date-released: 2024-03-02
url: 'https://github.com/mciccale/ScholarVista'

CodeMeta (codemeta.json)

{
  "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
  "@type": "SoftwareSourceCode",
  "license": "https://spdx.org/licenses/Apache-2.0",
  "codeRepository": "git+https://github.com/mciccale/ScholarVista",
  "dateCreated": "2024-02-10",
  "datePublished": "2024-02-14",
  "dateModified": "2024-03-02",
  "name": "ScholarVista",
  "version": "0.2.0",
  "identifier": "10.5281/zenodo.10654760",
  "description": "ScholarVista is a tool that analyzes research papers and extracts and plots information from them. It utilizes Grobid, a library for extracting content from research papers, to extract all the relevant data. The extracted data is then plotted and displayed using Python.",
  "applicationCategory": "Research",
  "developmentStatus": "active",
  "keywords": [
    "text",
    "research",
    "analysis",
    "python",
    "grobid"
  ],
  "programmingLanguage": [
    "Python 3"
  ],
  "operatingSystem": [
    "Linux"
  ],
  "softwareRequirements": [
    "Python 3",
    "Grobid",
    "Matplotlib",
    "WordCloud"
  ],
  "author": [
    {
      "@type": "Person",
      "@id": "https://orcid.org/0009-0000-8821-0587",
      "givenName": "Marco",
      "familyName": "Ciccal",
      "email": "marcociccalebaztan@gmail.com"
    }
  ]
}

GitHub Events

Total

Last Year

Committers

Last synced: 11 months ago

All Time

Total Commits: 50
Total Committers: 1
Avg Commits per committer: 50.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
mciccale	m**n@g**m	50

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

.github/workflows/lint.yml actions

actions/checkout v4 composite
actions/setup-python v5 composite

.github/workflows/test.yml actions

actions/checkout v4 composite
actions/setup-python v5 composite

docs/requirements.txt pypi

mkdocs-material *

poetry.lock pypi

astroid 3.0.3
certifi 2024.2.2
charset-normalizer 3.3.2
click 8.1.7
colorama 0.4.6
contourpy 1.2.0
cycler 0.12.1
dill 0.3.8
fonttools 4.48.1
grobid-client-python 0.0.8
idna 3.6
isort 5.13.2
kiwisolver 1.4.5
matplotlib 3.8.2
mccabe 0.7.0
numpy 1.26.4
packaging 23.2
pillow 10.2.0
platformdirs 4.2.0
pylint 3.0.3
pyparsing 3.1.1
python-dateutil 2.8.2
requests 2.31.0
six 1.16.0
tomlkit 0.12.3
urllib3 2.2.0
wordcloud 1.9.3

pyproject.toml pypi

pylint ^3.0.3 develop
click ^8.1.7
grobid-client-python ^0.0.8
matplotlib ^3.8.2
python ^3.12
wordcloud ^1.9.3

Dockerfile docker

python 3.12-slim build

docker-compose.yml docker

scholarvista

Science Score: 67.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

ScholarVista

Table of Contents:

Requirements

Install ScholarVista

From Source

Conda

Docker Container

Execution Instructions

From Source

CLI Tool

Example

Process PDF files and save the results to a specified directory

Python Modules

Docker Container

Example

Docker Compose (Experimental)

SH-Shell like

PowerShell

License

Where to Get Help

Owner

Citation (CITATION.cff)

CodeMeta (codemeta.json)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies