scholarvista

ScholarVista analyses research papers and extracts/plots information about them. It uses Grobid to extract all the content of the research papers. Then all this data is plotted and displayed using Python.

https://github.com/mciccale/scholarvista

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (18.2%) to scientific vocabulary

Keywords

keyword-cloud keyword-extraction machine-learning python3 text-extraction
Last synced: 6 months ago · JSON representation ·

Repository

ScholarVista analyses research papers and extracts/plots information about them. It uses Grobid to extract all the content of the research papers. Then all this data is plotted and displayed using Python.

Basic Info
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 3
Topics
keyword-cloud keyword-extraction machine-learning python3 text-extraction
Created about 2 years ago · Last pushed almost 2 years ago
Metadata Files
Readme Contributing License Citation Codemeta

README.md

Documentation Status zenodo test workflow lint workflow

ScholarVista

ScholarVista is a tool that extracts and plots information from a set of Academic Research Papers in PDF / TEI XML format. To process PDFs, it utilizes Grobid to generate the TEI XML files, then ScholarVista extracts the relevant information from the TEI XML files and generates the following data:

  1. Keyword Cloud for each of the paper's abstract and for the total of all abstracts.
  2. Links List for each one of the links found in the paper.
  3. Figures Histogram comparing the number of figures per paper.

Table of Contents:

Requirements

Python >=3.12 is required for installing the ScholarVista package, not for the Docker Image.

If you want to generate the results from a set of PDF academic papers, you must ensure that the Grobid Service is installed and running in your machine. See Grobid installation instrucions here.

The most straight-forward way of starting and running Grobid Service is by running a Docker image. Make sure you have Docker installed in your system.

bash docker run --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:0.8.0

This command will run Grobid and expose a web client in port 8070.

If you already have the TEI XML files generated from Grobid saved in a folder, you can directly generate the information from them.

Note: The TEI XML files MUST be obtained using Grobid, as this tool is intended to work only with Grobid generated TEI XML files.

Install ScholarVista

From Source

To install ScholarVista from source, you can clone the repository and install the package using pip. When using pip it is a good practice to use virtual environments. Check out the official documentation on virtual envornments here.

Conda

bash git clone https://github.com/mciccale/ScholarVista cd ScholarVista conda create -n scholarvista-env-3.12 python=3.12 conda activate scholarvista-env-3.12 pip install .

Note: You can use PyEnv to create a virtual environment. But since ScholarVista needs Python >=3.12, it is more suitable to use Conda, where you can select the Python version to use.

Docker Container

If you prefer running ScholarVista from a Docker Container, you can build the Docker Image with the following commands.

bash git clone https://github.com/mciccale/ScholarVista cd ScholarVista docker build -t scholarvista-app .

This will create an image called scholarvista-app.

Execution Instructions

From Source

CLI Tool

The most convenient way of using ScholarVista is by using its CLI.

The CLI Tool will generate and save to a directory a keyword cloud of the abstract of each paper and a list of URLs for each PDF analyzed; together with a histogram comparing the numer of figures of each PDF and a general keyword cloud of all abstracts.

``` Usage: scholarvista [OPTIONS] COMMAND [ARGS]...

ScholarVista's CLI main entry point.

Options: --input-dir PATH Directory containing PDF files. [required] --output-dir PATH Directory to save results. Defaults to current directory. --help Show this message and exit.

Commands: process-pdfs Process all PDFs in the given directory. process-xmls Process all TEI XMLs in the given directory. ```

Example
  1. Start Grobid service using the container.

bash docker run --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:0.8.0

  1. Run ScholarVista's CLI to process all the PDFs in a given directory and leave the results in another directory.

```bash

Process PDF files and save the results to a specified directory

scholarvista --input-dir ./pdfs --output-dir ./output process-pdfs ```

Python Modules

ScholarVista provides a set of classes and modules to take leverage of all its functionality from your Python code. To see an example, see example.py

Docker Container

If you prefer running ScholarVista with Docker, you can make use of ScholarVista CLI directly from the Docker Image you created following these instructions.

  1. Start Grobid service using the container.

bash docker run --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:0.8.0

  1. Run ScholarVista's container with 2 mounted volumes for input and output directories and connected to the host network.

bash docker run -it --rm --network=host -v /path/to/input/dir:/input -v /path/to/output/dir:/output scholarvista-app

Note: The default behaviour of ScholarVista's Docker Image is processing pdf files, you can override this by providing the process-xmls argument after the image name.

Example

Here's an example where we process a set of PDFs contained in the foo directory and we leave the results at bar using the Docker Image. Assuming the Grobid service is running at localhost:8070.

bash docker run -it --rm --network=host -v foo:/input -v bar:/output scholarvista-app process-pdfs

Docker Compose (Experimental)

You can try to run ScholarVista through Docker Compose. However, this feature is still in development and may not work as expected. ScholarVista will be trying to connect to Grobid before it has started, and it will be restarted until the Grobid service is up and running. You can try it by:

SH-Shell like

bash INPUT_DIR=/path/to/input/dir OUTPUT_DIR=/path/to/output/dir COMMAND='process-pdfs' docker-compose up

PowerShell

powershell $env:INPUT_DIR="/path/to/input/dir"; $env:OUTPUT_DIR="/path/to/output/dir"; $env:COMMAND="process-pdfs" docker-compose up

Note: The COMMAND variable can be either process-pdfs or process-xmls. And the directories are the host machine directories where the files are extracted and left, respectively.

License

Please refer to the LICENSE file.

Where to Get Help

For further assistance or to contribute to the project, please refer to the CONTRIBUTING.md file.

Owner

  • Name: Marco Ciccale Baztán
  • Login: mciccale
  • Kind: user
  • Location: Madrid

UPM student, Computer Science and Engineering.

Citation (CITATION.cff)

cff-version: 1.2.0
message: 'If you use this software, please cite it as below.'
authors:
  - family-names: Ciccalè
    given-names: Marco
    orcid: https://orcid.org/0009-0000-8821-0587
title: 'ScholarVista'
license: Apache-2.0
version: '0.2.0'
doi: '10.5281/zenodo.10654760'
date-released: 2024-03-02
url: 'https://github.com/mciccale/ScholarVista'

CodeMeta (codemeta.json)

{
  "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
  "@type": "SoftwareSourceCode",
  "license": "https://spdx.org/licenses/Apache-2.0",
  "codeRepository": "git+https://github.com/mciccale/ScholarVista",
  "dateCreated": "2024-02-10",
  "datePublished": "2024-02-14",
  "dateModified": "2024-03-02",
  "name": "ScholarVista",
  "version": "0.2.0",
  "identifier": "10.5281/zenodo.10654760",
  "description": "ScholarVista is a tool that analyzes research papers and extracts and plots information from them. It utilizes Grobid, a library for extracting content from research papers, to extract all the relevant data. The extracted data is then plotted and displayed using Python.",
  "applicationCategory": "Research",
  "developmentStatus": "active",
  "keywords": [
    "text",
    "research",
    "analysis",
    "python",
    "grobid"
  ],
  "programmingLanguage": [
    "Python 3"
  ],
  "operatingSystem": [
    "Linux"
  ],
  "softwareRequirements": [
    "Python 3",
    "Grobid",
    "Matplotlib",
    "WordCloud"
  ],
  "author": [
    {
      "@type": "Person",
      "@id": "https://orcid.org/0009-0000-8821-0587",
      "givenName": "Marco",
      "familyName": "Ciccal",
      "email": "marcociccalebaztan@gmail.com"
    }
  ]
}

GitHub Events

Total
Last Year

Committers

Last synced: 11 months ago

All Time
  • Total Commits: 50
  • Total Committers: 1
  • Avg Commits per committer: 50.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
mciccale m****n@g****m 50

Issues and Pull Requests

Last synced: 11 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

.github/workflows/lint.yml actions
  • actions/checkout v4 composite
  • actions/setup-python v5 composite
.github/workflows/test.yml actions
  • actions/checkout v4 composite
  • actions/setup-python v5 composite
docs/requirements.txt pypi
  • mkdocs-material *
poetry.lock pypi
  • astroid 3.0.3
  • certifi 2024.2.2
  • charset-normalizer 3.3.2
  • click 8.1.7
  • colorama 0.4.6
  • contourpy 1.2.0
  • cycler 0.12.1
  • dill 0.3.8
  • fonttools 4.48.1
  • grobid-client-python 0.0.8
  • idna 3.6
  • isort 5.13.2
  • kiwisolver 1.4.5
  • matplotlib 3.8.2
  • mccabe 0.7.0
  • numpy 1.26.4
  • packaging 23.2
  • pillow 10.2.0
  • platformdirs 4.2.0
  • pylint 3.0.3
  • pyparsing 3.1.1
  • python-dateutil 2.8.2
  • requests 2.31.0
  • six 1.16.0
  • tomlkit 0.12.3
  • urllib3 2.2.0
  • wordcloud 1.9.3
pyproject.toml pypi
  • pylint ^3.0.3 develop
  • click ^8.1.7
  • grobid-client-python ^0.0.8
  • matplotlib ^3.8.2
  • python ^3.12
  • wordcloud ^1.9.3
Dockerfile docker
  • python 3.12-slim build
docker-compose.yml docker