Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.3%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: marcupm
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 2.3 MB
Statistics
  • Stars: 1
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created about 1 year ago · Last pushed about 1 year ago
Metadata Files
Readme License Citation

README.md

Data Analysis

DOI

Overview

This project analyzes a corpus of research papers to extract topics, compute similarities, link them in a knowledge graph, and identify funding information. It leverages HuggingFace models and follows best practices for research data management.

Table of Contents

  • Prerequisites
  • Installation
  • Pipeline Steps
  • Component Details
  • Query the Knowledge Graph
  • Research Object & Provenance
  • Directory Structure

Prerequisites

  • Docker and Docker Compose
  • Python 3.9+
  • Research papers in PDF format to be analysized

Installation & Execution

  1. Clone this repository: bash git clone https://github.com/marcupm/data-analysis.git cd data-analysis

  2. Place your research papers in data/ directory.

  3. Start the service: bash docker-compose up -d --build

Pipeline Steps

The pipeline follow the following steps to perform the data analysis automatically when the main is executed

1. Extracting Metadata from PDFs

The pipeline begins by sending the PDF files to GROBID for processing. This extracts structured metadata such as title, authors, affiliations, DOI, and abstract for each paper.

2. Enriching Metadata with OpenAlex Topics

Using the extracted DOIs, the pipeline queries the OpenAlex API to retrieve additional topical information. This helps classify each paper within broader research areas.

3. Creating an RDF Knowledge Graph

The enriched metadata is then transformed into an RDF knowledge graph, enabling semantic analysis and integration with other linked data systems.

4. Enriching RDF with Wikidata Links

Entities within the graph—such as authors, institutions, or topics—are linked to corresponding Wikidata resources. This enhances the semantic richness and interoperability of the graph.

5. Running Topic Modeling on Abstracts

Topic modeling techniques (LDA and BERTopic) are applied to the abstracts to identify common research themes and emerging trends across the papers.

6. Calculating Paper Similarities

Transformer-based embeddings are used to compute similarity scores between papers, helping uncover related or thematically similar research.

7. Extracting Named Entities from Acknowledgements

The acknowledgements sections are processed using Named Entity Recognition (NER) to identify funding organizations and other mentioned entities.

8. Documenting Data Provenance

A PROV-compliant provenance document is generated to describe the full workflow, ensuring transparency and traceability of all data transformations.

9. Packaging as a Research Object

Finally, all outputs are bundled into a standardized Research Object Crate (RO-Crate), making the results portable, interoperable, and reusable within the scientific ecosystem.

Component Details

Metadata Extraction

Uses GROBID to extract structured metadata from PDFs, including titles, authors, abstracts, and references.

Topic Modeling

Two approaches: - LDA (Latent Dirichlet Allocation): Traditional statistical approach - BERTopic: Transformer-based approach using HuggingFace models

Similarity Analysis

Uses sentence-transformers to create embeddings for each paper abstract, then computes cosine similarity to identify related papers.

NER Analysis

Applies Hugging Face's NER models to extract funding organizations and other entities from acknowledgements sections.

Knowledge Graph Creation

Converts extracted information into RDF format, linking papers with: - Authors - Topics - Publication details - External resources (Wikidata)

Query the Knowledge Graph

Start the SPARQL endpoint:

bash cd api python sparql_endpoint.py

Visit http://localhost:5000 in your browser to query the knowledge graph.

Example queries: - Find papers by topic - Identify collaborating authors - Discover funding patterns

You can also use the API:

bash cd api python api.py

Visit http://localhost:5001/api/papers to access the REST API.

Research Object & Provenance

The pipeline creates:

  1. PROV Documentation: Captures the entire analysis workflow with detailed provenance information
  2. Research Object Crate: Packages all research outputs following RO-Crate 1.1 standards

Directory Structure

research-paper-analysis/ ├── api/ │ ├── api.py # REST API for data access │ └── sparql_endpoint.py # SPARQL query interface ├── app/ │ ├── enrich/ │ │ ├── json_to_rdf.py # Convert JSON to RDF │ │ ├── openalex_query.py # Query OpenAlex API │ │ └── wikidata_enrich.py # Wikidata entity linking │ ├── grobid/ │ │ ├── grobid_client.py # Client for GROBID service │ │ └── metadata_extractor.py # Extract metadata from GROBID output │ ├── ner/ │ │ └── extract_acknowledgements.py # Extract entities from acknowledgements │ ├── provenance/ │ │ └── create_prov.py # Create PROV documentation │ ├── ro_create/ │ │ └── create_ro_crate.py # Create RO-Crate metadata │ ├── similarity/ │ │ └── paper_similarity.py # Calculate paper similarities │ ├── topic_modeling/ │ │ └── abstract_topics.py # Topic modeling on abstracts │ ├── main.py # main file that runs all the scripts │ └── rationale.md ├── test/ │ └── test_sparql.md # Example queries to test sparql endpoint ├── data/ # Raw PDF papers └── docs/ ├── index.md # Index of the structure of the project ├── install.md # Instructions to correctly install the project ├── requirements.txt # List of programs that have to be installed └── usage.md # Explanation of how to use the app

Model Decisions

  • GROBID: Specialized for scientific document processing
  • Transformer Models: State-of-the-art for text understanding
  • BERTopic: Combines transformers with topic modeling
  • RDF: Standard format for knowledge graphs with reasoning capabilities
  • PROV & RO-Crate: Follow FAIR data principles for reproducibility

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Rosado"
  given-names: "Alejandro"
  orcid: "https://orcid.org/0009-0003-5984-2579"
- family-names: "Ramirez"
  given-names: "Marc"
  orcid: "https://orcid.org/0009-0001-7576-1424"
- family-names: "Alonso"
  given-names: "Luís"
  orcid: "https://orcid.org/0009-0001-1648-3647"
title: "Data Analysis"
version: 1.0.0
doi: 10.5281/zenodo.1234
date-released: 2025-02-10
url: "https://github.com/marcupm/data-analysis"

GitHub Events

Total
  • Release event: 1
  • Public event: 1
  • Push event: 14
  • Create event: 1
Last Year
  • Release event: 1
  • Public event: 1
  • Push event: 14
  • Create event: 1

Dependencies

Dockerfile docker
  • python 3.9.13 build
docker-compose.yml docker
  • lfoppiano/grobid 0.8.1
docs/requirements.txt pypi
  • Flask ==3.1.1
  • Jinja2 ==3.1.6
  • MarkupSafe ==3.0.2
  • PyYAML ==6.0.2
  • Werkzeug ==3.1.3
  • bertopic ==0.17.0
  • blinker ==1.9.0
  • certifi ==2025.4.26
  • charset-normalizer ==3.4.2
  • click ==8.1.8
  • colorama ==0.4.6
  • contourpy ==1.3.0
  • cycler ==0.12.1
  • filelock ==3.18.0
  • fonttools ==4.58.0
  • fsspec ==2025.3.2
  • hdbscan ==0.8.40
  • huggingface-hub ==0.31.2
  • idna ==3.10
  • importlib_metadata ==8.7.0
  • importlib_resources ==6.5.2
  • isodate ==0.6.1
  • itsdangerous ==2.2.0
  • joblib ==1.5.0
  • kiwisolver ==1.4.7
  • llvmlite ==0.43.0
  • lxml ==5.4.0
  • matplotlib ==3.9.4
  • mpmath ==1.3.0
  • narwhals ==1.39.0
  • networkx ==3.2.1
  • numba ==0.60.0
  • numpy ==2.0.2
  • packaging ==25.0
  • pandas ==2.2.3
  • pillow ==11.2.1
  • plotly ==6.1.0
  • prov ==2.0.1
  • pydot ==4.0.0
  • pynndescent ==0.5.13
  • pyparsing ==3.2.3
  • python-dateutil ==2.9.0.post0
  • pytz ==2025.2
  • rdflib ==6.3.2
  • regex ==2024.11.6
  • requests ==2.32.3
  • safetensors ==0.5.3
  • scikit-learn ==1.6.1
  • scipy ==1.13.1
  • sentence-transformers ==4.1.0
  • six ==1.17.0
  • sympy ==1.14.0
  • threadpoolctl ==3.6.0
  • tokenizers ==0.21.1
  • torch ==2.7.0
  • tqdm ==4.67.1
  • transformers ==4.51.3
  • typing_extensions ==4.13.2
  • tzdata ==2025.2
  • umap-learn ==0.5.7
  • urllib3 ==2.4.0
  • zipp ==3.21.0