mine-dd
MINE-DD: A Python tool for embedding, searching, and querying scientific papers using local LLMs.
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.7%) to scientific vocabulary
Repository
MINE-DD: A Python tool for embedding, searching, and querying scientific papers using local LLMs.
Basic Info
Statistics
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 7
- Releases: 0
Metadata Files
README.md
MINE-DD
(Mining the past to protect against Diarrheal Disease in the future) is a collaborative research project between the eScience Center, the University of Amsterdam (UvA) and Amsterdam UMC. The project focuses on addressing the global health challenge of diarrheal disease in the context of climate change.
Description
MINE-DD is a Python package that leverages artificial intelligence to extract and synthesize insights about climate's impact on diarrheal diseases from scientific literature. It enables researchers to efficiently query and analyze large collections of academic papers that would be impractical to read manually. Built on the PaperQA2 framework, the package implements an advanced question-answering system that provides detailed, citation-backed responses, ensuring that every insight is directly traceable to its original source materials. The package:
- Takes a collection of scientific papers (PDFs)
- Processes them to create embeddings (vector representations)
- Allows users to query these papers with natural language questions
- Returns answers with citations and context from the relevant papers
Notes:
- MINE-DD uses Ollama models locally
- Default LLMs: ollama/llama3.2:1b (for laptop-friendly usage)
- Default embeddings: ollama/mxbai-embed-large:latest
- It uses Python >= 3.11
Installation
Requirements
- Python 3.11 or higher
- Ollama installed locally for running LLMs and embeddings
Installation Steps
- Clone the repository:
console
git clone git@github.com:MINE-DD/MINE-DD.git
cd MINE-DD
- Install the package:
```console
Standard installation
python -m pip install .
Development installation
python -m pip install -e ".[dev]" ```
Testing
The project includes two types of tests:
- Standard Tests: These run in CI environments (GitHub Actions) and don't require Ollama or GPU access.
```console # Run all standard tests pytest
# Run specific test files pytest tests/testutils.py tests/testcli.py ```
- Integration Tests: These test the full functionality including LLM queries with Ollama, requiring a local environment with Ollama running.
console
# Enable integration tests by setting SKIP_OLLAMA_TESTS=False in tests/test_query_integration.py
# Then run:
pytest -m integration
Integration tests are automatically skipped in CI environments and by default are also skipped locally (to avoid unexpected failures). To run them, you need to:
- Ensure Ollama is running (
ollama serve) - Set
SKIP_OLLAMA_TESTS = Falseintests/test_query_integration.py - Run with the integration marker:
pytest -m integration
Usage
Before Using MINE-DD
Make sure Ollama is running in the background:
console
ollama serve
Creating Paper Embeddings
Use the minedd embed command to create embeddings based on your document collection:
console
minedd embed --paper_directory "/path/to/papers_minedd/" --embeddings_filename my-embeddings.pkl
Available Parameters
--embeddings_filename: Name the embeddings pickle file where the index is saved--output_dir: Directory to save the embeddings pkl (default: 'out')--embedding_model: Embedding model (default: 'ollama/mxbai-embed-large:latest')--paper_directory: Directory with paper files (default: 'data/')--augment_existing: If True it will add new documents to the existing pkl file provided. Otherwise it creates the pkl file from scratch.
Querying Papers
Use the minedd query command to ask questions about your document collection:
console
minedd query --embeddings embeddings/papers_embeddings.pkl --questions_file questions.xlsx --output_dir results/
or for a single question:
console
minedd query --embeddings embeddings/papers_embeddings.pkl --question "What is the relationship between climate change and diarrheal disease?"
Available Parameters
--embeddings: Path to the embeddings pickle file (required)--questions_file: Path to Excel file with questions--question: Single question to ask--llm: LLM model to use (default: 'ollama/llama3.2:1b')--embedding_model: Embedding model (default: 'ollama/mxbai-embed-large:latest')--paper_directory: Directory with paper files (default: 'data/')--output_dir: Directory to save outputs (default: 'out')--max_retries: Retries for model loading failures (default: 2)
Contributing
If you want to contribute to the development of MINE-DD, have a look at the contribution guidelines.
License
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Credits
This package was created with Cookiecutter and the NLeSC/python-template.
Owner
- Name: MINE-DD
- Login: MINE-DD
- Kind: organization
- Repositories: 1
- Profile: https://github.com/MINE-DD
Citation (CITATION.cff)
# YAML 1.2
---
cff-version: "1.2.0"
title: "mine_dd"
authors:
-
family-names: Viviani
given-names: Eva
orcid: "https://orcid.org/0000-0002-1330-0585"
-
family-names: Ootes
given-names: Laura
orcid: "https://orcid.org/0000-0002-2800-8309"
#date-released: 2024-00-00
#doi: <insert your DOI here>
version: "0.1.0"
repository-code: "https://github.com/MINE-DD/mine-dd"
keywords:
- nlp
- diarrhea
message: "If you use this software, please cite it using these metadata."
license: Apache-2.0
GitHub Events
Total
- Create event: 9
- Issues event: 33
- Watch event: 1
- Delete event: 15
- Issue comment event: 2
- Member event: 1
- Push event: 74
- Pull request review event: 6
- Pull request review comment event: 13
- Pull request event: 17
Last Year
- Create event: 9
- Issues event: 33
- Watch event: 1
- Delete event: 15
- Issue comment event: 2
- Member event: 1
- Push event: 74
- Pull request review event: 6
- Pull request review comment event: 13
- Pull request event: 17