data-analysis
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 3 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.3%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: marcupm
- License: mit
- Language: Python
- Default Branch: main
- Size: 2.3 MB
Statistics
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Data Analysis
Overview
This project analyzes a corpus of research papers to extract topics, compute similarities, link them in a knowledge graph, and identify funding information. It leverages HuggingFace models and follows best practices for research data management.
Table of Contents
- Prerequisites
- Installation
- Pipeline Steps
- Component Details
- Query the Knowledge Graph
- Research Object & Provenance
- Directory Structure
Prerequisites
- Docker and Docker Compose
- Python 3.9+
- Research papers in PDF format to be analysized
Installation & Execution
Clone this repository:
bash git clone https://github.com/marcupm/data-analysis.git cd data-analysisPlace your research papers in
data/directory.Start the service:
bash docker-compose up -d --build
Pipeline Steps
The pipeline follow the following steps to perform the data analysis automatically when the main is executed
1. Extracting Metadata from PDFs
The pipeline begins by sending the PDF files to GROBID for processing. This extracts structured metadata such as title, authors, affiliations, DOI, and abstract for each paper.
2. Enriching Metadata with OpenAlex Topics
Using the extracted DOIs, the pipeline queries the OpenAlex API to retrieve additional topical information. This helps classify each paper within broader research areas.
3. Creating an RDF Knowledge Graph
The enriched metadata is then transformed into an RDF knowledge graph, enabling semantic analysis and integration with other linked data systems.
4. Enriching RDF with Wikidata Links
Entities within the graph—such as authors, institutions, or topics—are linked to corresponding Wikidata resources. This enhances the semantic richness and interoperability of the graph.
5. Running Topic Modeling on Abstracts
Topic modeling techniques (LDA and BERTopic) are applied to the abstracts to identify common research themes and emerging trends across the papers.
6. Calculating Paper Similarities
Transformer-based embeddings are used to compute similarity scores between papers, helping uncover related or thematically similar research.
7. Extracting Named Entities from Acknowledgements
The acknowledgements sections are processed using Named Entity Recognition (NER) to identify funding organizations and other mentioned entities.
8. Documenting Data Provenance
A PROV-compliant provenance document is generated to describe the full workflow, ensuring transparency and traceability of all data transformations.
9. Packaging as a Research Object
Finally, all outputs are bundled into a standardized Research Object Crate (RO-Crate), making the results portable, interoperable, and reusable within the scientific ecosystem.
Component Details
Metadata Extraction
Uses GROBID to extract structured metadata from PDFs, including titles, authors, abstracts, and references.
Topic Modeling
Two approaches: - LDA (Latent Dirichlet Allocation): Traditional statistical approach - BERTopic: Transformer-based approach using HuggingFace models
Similarity Analysis
Uses sentence-transformers to create embeddings for each paper abstract, then computes cosine similarity to identify related papers.
NER Analysis
Applies Hugging Face's NER models to extract funding organizations and other entities from acknowledgements sections.
Knowledge Graph Creation
Converts extracted information into RDF format, linking papers with: - Authors - Topics - Publication details - External resources (Wikidata)
Query the Knowledge Graph
Start the SPARQL endpoint:
bash
cd api
python sparql_endpoint.py
Visit http://localhost:5000 in your browser to query the knowledge graph.
Example queries: - Find papers by topic - Identify collaborating authors - Discover funding patterns
You can also use the API:
bash
cd api
python api.py
Visit http://localhost:5001/api/papers to access the REST API.
Research Object & Provenance
The pipeline creates:
- PROV Documentation: Captures the entire analysis workflow with detailed provenance information
- Research Object Crate: Packages all research outputs following RO-Crate 1.1 standards
Directory Structure
research-paper-analysis/
├── api/
│ ├── api.py # REST API for data access
│ └── sparql_endpoint.py # SPARQL query interface
├── app/
│ ├── enrich/
│ │ ├── json_to_rdf.py # Convert JSON to RDF
│ │ ├── openalex_query.py # Query OpenAlex API
│ │ └── wikidata_enrich.py # Wikidata entity linking
│ ├── grobid/
│ │ ├── grobid_client.py # Client for GROBID service
│ │ └── metadata_extractor.py # Extract metadata from GROBID output
│ ├── ner/
│ │ └── extract_acknowledgements.py # Extract entities from acknowledgements
│ ├── provenance/
│ │ └── create_prov.py # Create PROV documentation
│ ├── ro_create/
│ │ └── create_ro_crate.py # Create RO-Crate metadata
│ ├── similarity/
│ │ └── paper_similarity.py # Calculate paper similarities
│ ├── topic_modeling/
│ │ └── abstract_topics.py # Topic modeling on abstracts
│ ├── main.py # main file that runs all the scripts
│ └── rationale.md
├── test/
│ └── test_sparql.md # Example queries to test sparql endpoint
├── data/ # Raw PDF papers
└── docs/
├── index.md # Index of the structure of the project
├── install.md # Instructions to correctly install the project
├── requirements.txt # List of programs that have to be installed
└── usage.md # Explanation of how to use the app
Model Decisions
- GROBID: Specialized for scientific document processing
- Transformer Models: State-of-the-art for text understanding
- BERTopic: Combines transformers with topic modeling
- RDF: Standard format for knowledge graphs with reasoning capabilities
- PROV & RO-Crate: Follow FAIR data principles for reproducibility
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Rosado" given-names: "Alejandro" orcid: "https://orcid.org/0009-0003-5984-2579" - family-names: "Ramirez" given-names: "Marc" orcid: "https://orcid.org/0009-0001-7576-1424" - family-names: "Alonso" given-names: "Luís" orcid: "https://orcid.org/0009-0001-1648-3647" title: "Data Analysis" version: 1.0.0 doi: 10.5281/zenodo.1234 date-released: 2025-02-10 url: "https://github.com/marcupm/data-analysis"
GitHub Events
Total
- Release event: 1
- Public event: 1
- Push event: 14
- Create event: 1
Last Year
- Release event: 1
- Public event: 1
- Push event: 14
- Create event: 1
Dependencies
- python 3.9.13 build
- lfoppiano/grobid 0.8.1
- Flask ==3.1.1
- Jinja2 ==3.1.6
- MarkupSafe ==3.0.2
- PyYAML ==6.0.2
- Werkzeug ==3.1.3
- bertopic ==0.17.0
- blinker ==1.9.0
- certifi ==2025.4.26
- charset-normalizer ==3.4.2
- click ==8.1.8
- colorama ==0.4.6
- contourpy ==1.3.0
- cycler ==0.12.1
- filelock ==3.18.0
- fonttools ==4.58.0
- fsspec ==2025.3.2
- hdbscan ==0.8.40
- huggingface-hub ==0.31.2
- idna ==3.10
- importlib_metadata ==8.7.0
- importlib_resources ==6.5.2
- isodate ==0.6.1
- itsdangerous ==2.2.0
- joblib ==1.5.0
- kiwisolver ==1.4.7
- llvmlite ==0.43.0
- lxml ==5.4.0
- matplotlib ==3.9.4
- mpmath ==1.3.0
- narwhals ==1.39.0
- networkx ==3.2.1
- numba ==0.60.0
- numpy ==2.0.2
- packaging ==25.0
- pandas ==2.2.3
- pillow ==11.2.1
- plotly ==6.1.0
- prov ==2.0.1
- pydot ==4.0.0
- pynndescent ==0.5.13
- pyparsing ==3.2.3
- python-dateutil ==2.9.0.post0
- pytz ==2025.2
- rdflib ==6.3.2
- regex ==2024.11.6
- requests ==2.32.3
- safetensors ==0.5.3
- scikit-learn ==1.6.1
- scipy ==1.13.1
- sentence-transformers ==4.1.0
- six ==1.17.0
- sympy ==1.14.0
- threadpoolctl ==3.6.0
- tokenizers ==0.21.1
- torch ==2.7.0
- tqdm ==4.67.1
- transformers ==4.51.3
- typing_extensions ==4.13.2
- tzdata ==2025.2
- umap-learn ==0.5.7
- urllib3 ==2.4.0
- zipp ==3.21.0