data-analysis

https://github.com/marcupm/data-analysis

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 3 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.3%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: marcupm
License: mit
Language: Python
Default Branch: main
Size: 2.3 MB

Statistics

Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 0

Created about 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

Data Analysis

Overview

This project analyzes a corpus of research papers to extract topics, compute similarities, link them in a knowledge graph, and identify funding information. It leverages HuggingFace models and follows best practices for research data management.

Prerequisites
Installation
Pipeline Steps
Component Details
Query the Knowledge Graph
Research Object & Provenance
Directory Structure

Prerequisites

Docker and Docker Compose
Python 3.9+
Research papers in PDF format to be analysized

Installation & Execution

Clone this repository: bash git clone https://github.com/marcupm/data-analysis.git cd data-analysis
Place your research papers in data/ directory.
Start the service: bash docker-compose up -d --build

Pipeline Steps

The pipeline follow the following steps to perform the data analysis automatically when the main is executed

1. Extracting Metadata from PDFs

The pipeline begins by sending the PDF files to GROBID for processing. This extracts structured metadata such as title, authors, affiliations, DOI, and abstract for each paper.

2. Enriching Metadata with OpenAlex Topics

Using the extracted DOIs, the pipeline queries the OpenAlex API to retrieve additional topical information. This helps classify each paper within broader research areas.

3. Creating an RDF Knowledge Graph

The enriched metadata is then transformed into an RDF knowledge graph, enabling semantic analysis and integration with other linked data systems.

4. Enriching RDF with Wikidata Links

Entities within the graph—such as authors, institutions, or topics—are linked to corresponding Wikidata resources. This enhances the semantic richness and interoperability of the graph.

5. Running Topic Modeling on Abstracts

Topic modeling techniques (LDA and BERTopic) are applied to the abstracts to identify common research themes and emerging trends across the papers.

6. Calculating Paper Similarities

Transformer-based embeddings are used to compute similarity scores between papers, helping uncover related or thematically similar research.

7. Extracting Named Entities from Acknowledgements

The acknowledgements sections are processed using Named Entity Recognition (NER) to identify funding organizations and other mentioned entities.

8. Documenting Data Provenance

A PROV-compliant provenance document is generated to describe the full workflow, ensuring transparency and traceability of all data transformations.

9. Packaging as a Research Object

Finally, all outputs are bundled into a standardized Research Object Crate (RO-Crate), making the results portable, interoperable, and reusable within the scientific ecosystem.

Component Details

Metadata Extraction

Uses GROBID to extract structured metadata from PDFs, including titles, authors, abstracts, and references.

Topic Modeling

Two approaches: - LDA (Latent Dirichlet Allocation): Traditional statistical approach - BERTopic: Transformer-based approach using HuggingFace models

Similarity Analysis

Uses sentence-transformers to create embeddings for each paper abstract, then computes cosine similarity to identify related papers.

NER Analysis

Applies Hugging Face's NER models to extract funding organizations and other entities from acknowledgements sections.

Knowledge Graph Creation

Converts extracted information into RDF format, linking papers with: - Authors - Topics - Publication details - External resources (Wikidata)

Query the Knowledge Graph

Start the SPARQL endpoint:

bash cd api python sparql_endpoint.py

Visit http://localhost:5000 in your browser to query the knowledge graph.

Example queries: - Find papers by topic - Identify collaborating authors - Discover funding patterns

You can also use the API:

bash cd api python api.py

Visit http://localhost:5001/api/papers to access the REST API.

Research Object & Provenance

The pipeline creates:

PROV Documentation: Captures the entire analysis workflow with detailed provenance information
Research Object Crate: Packages all research outputs following RO-Crate 1.1 standards

Directory Structure

research-paper-analysis/ ├── api/ │ ├── api.py # REST API for data access │ └── sparql_endpoint.py # SPARQL query interface ├── app/ │ ├── enrich/ │ │ ├── json_to_rdf.py # Convert JSON to RDF │ │ ├── openalex_query.py # Query OpenAlex API │ │ └── wikidata_enrich.py # Wikidata entity linking │ ├── grobid/ │ │ ├── grobid_client.py # Client for GROBID service │ │ └── metadata_extractor.py # Extract metadata from GROBID output │ ├── ner/ │ │ └── extract_acknowledgements.py # Extract entities from acknowledgements │ ├── provenance/ │ │ └── create_prov.py # Create PROV documentation │ ├── ro_create/ │ │ └── create_ro_crate.py # Create RO-Crate metadata │ ├── similarity/ │ │ └── paper_similarity.py # Calculate paper similarities │ ├── topic_modeling/ │ │ └── abstract_topics.py # Topic modeling on abstracts │ ├── main.py # main file that runs all the scripts │ └── rationale.md ├── test/ │ └── test_sparql.md # Example queries to test sparql endpoint ├── data/ # Raw PDF papers └── docs/ ├── index.md # Index of the structure of the project ├── install.md # Instructions to correctly install the project ├── requirements.txt # List of programs that have to be installed └── usage.md # Explanation of how to use the app

Model Decisions

GROBID: Specialized for scientific document processing
Transformer Models: State-of-the-art for text understanding
BERTopic: Combines transformers with topic modeling
RDF: Standard format for knowledge graphs with reasoning capabilities
PROV & RO-Crate: Follow FAIR data principles for reproducibility

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Rosado"
  given-names: "Alejandro"
  orcid: "https://orcid.org/0009-0003-5984-2579"
- family-names: "Ramirez"
  given-names: "Marc"
  orcid: "https://orcid.org/0009-0001-7576-1424"
- family-names: "Alonso"
  given-names: "Luís"
  orcid: "https://orcid.org/0009-0001-1648-3647"
title: "Data Analysis"
version: 1.0.0
doi: 10.5281/zenodo.1234
date-released: 2025-02-10
url: "https://github.com/marcupm/data-analysis"

GitHub Events

Total

Release event: 1
Public event: 1
Push event: 14
Create event: 1

Last Year

Release event: 1
Public event: 1
Push event: 14
Create event: 1

Dependencies

Dockerfile docker

python 3.9.13 build

docker-compose.yml docker

lfoppiano/grobid 0.8.1

docs/requirements.txt pypi

Flask ==3.1.1
Jinja2 ==3.1.6
MarkupSafe ==3.0.2
PyYAML ==6.0.2
Werkzeug ==3.1.3
bertopic ==0.17.0
blinker ==1.9.0
certifi ==2025.4.26
charset-normalizer ==3.4.2
click ==8.1.8
colorama ==0.4.6
contourpy ==1.3.0
cycler ==0.12.1
filelock ==3.18.0
fonttools ==4.58.0
fsspec ==2025.3.2
hdbscan ==0.8.40
huggingface-hub ==0.31.2
idna ==3.10
importlib_metadata ==8.7.0
importlib_resources ==6.5.2
isodate ==0.6.1
itsdangerous ==2.2.0
joblib ==1.5.0
kiwisolver ==1.4.7
llvmlite ==0.43.0
lxml ==5.4.0
matplotlib ==3.9.4
mpmath ==1.3.0
narwhals ==1.39.0
networkx ==3.2.1
numba ==0.60.0
numpy ==2.0.2
packaging ==25.0
pandas ==2.2.3
pillow ==11.2.1
plotly ==6.1.0
prov ==2.0.1
pydot ==4.0.0
pynndescent ==0.5.13
pyparsing ==3.2.3
python-dateutil ==2.9.0.post0
pytz ==2025.2
rdflib ==6.3.2
regex ==2024.11.6
requests ==2.32.3
safetensors ==0.5.3
scikit-learn ==1.6.1
scipy ==1.13.1
sentence-transformers ==4.1.0
six ==1.17.0
sympy ==1.14.0
threadpoolctl ==3.6.0
tokenizers ==0.21.1
torch ==2.7.0
tqdm ==4.67.1
transformers ==4.51.3
typing_extensions ==4.13.2
tzdata ==2025.2
umap-learn ==0.5.7
urllib3 ==2.4.0
zipp ==3.21.0