mine-dd

MINE-DD: A Python tool for embedding, searching, and querying scientific papers using local LLMs.

https://github.com/mine-dd/mine-dd

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.7%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

MINE-DD: A Python tool for embedding, searching, and querying scientific papers using local LLMs.

Basic Info
  • Host: GitHub
  • Owner: MINE-DD
  • License: apache-2.0
  • Language: Jupyter Notebook
  • Default Branch: main
  • Homepage:
  • Size: 103 MB
Statistics
  • Stars: 2
  • Watchers: 1
  • Forks: 0
  • Open Issues: 7
  • Releases: 0
Created about 2 years ago · Last pushed 10 months ago
Metadata Files
Readme Changelog Contributing License Code of conduct Citation

README.md

github repo badge github license badge RSD fair-software badge build cffconvert markdown-link-check

MINE-DD

(Mining the past to protect against Diarrheal Disease in the future) is a collaborative research project between the eScience Center, the University of Amsterdam (UvA) and Amsterdam UMC. The project focuses on addressing the global health challenge of diarrheal disease in the context of climate change.

Description

MINE-DD is a Python package that leverages artificial intelligence to extract and synthesize insights about climate's impact on diarrheal diseases from scientific literature. It enables researchers to efficiently query and analyze large collections of academic papers that would be impractical to read manually. Built on the PaperQA2 framework, the package implements an advanced question-answering system that provides detailed, citation-backed responses, ensuring that every insight is directly traceable to its original source materials. The package:

  • Takes a collection of scientific papers (PDFs)
  • Processes them to create embeddings (vector representations)
  • Allows users to query these papers with natural language questions
  • Returns answers with citations and context from the relevant papers

Notes:

  • MINE-DD uses Ollama models locally
  • Default LLMs: ollama/llama3.2:1b (for laptop-friendly usage)
  • Default embeddings: ollama/mxbai-embed-large:latest
  • It uses Python >= 3.11

Installation

Requirements

  • Python 3.11 or higher
  • Ollama installed locally for running LLMs and embeddings

Installation Steps

  1. Clone the repository:

console git clone git@github.com:MINE-DD/MINE-DD.git cd MINE-DD

  1. Install the package:

```console

Standard installation

python -m pip install .

Development installation

python -m pip install -e ".[dev]" ```

Testing

The project includes two types of tests:

  1. Standard Tests: These run in CI environments (GitHub Actions) and don't require Ollama or GPU access.

```console # Run all standard tests pytest

# Run specific test files pytest tests/testutils.py tests/testcli.py ```

  1. Integration Tests: These test the full functionality including LLM queries with Ollama, requiring a local environment with Ollama running.

console # Enable integration tests by setting SKIP_OLLAMA_TESTS=False in tests/test_query_integration.py # Then run: pytest -m integration

Integration tests are automatically skipped in CI environments and by default are also skipped locally (to avoid unexpected failures). To run them, you need to:

  1. Ensure Ollama is running (ollama serve)
  2. Set SKIP_OLLAMA_TESTS = False in tests/test_query_integration.py
  3. Run with the integration marker: pytest -m integration

Usage

Before Using MINE-DD

Make sure Ollama is running in the background:

console ollama serve

Creating Paper Embeddings

Use the minedd embed command to create embeddings based on your document collection:

console minedd embed --paper_directory "/path/to/papers_minedd/" --embeddings_filename my-embeddings.pkl

Available Parameters

  • --embeddings_filename: Name the embeddings pickle file where the index is saved
  • --output_dir: Directory to save the embeddings pkl (default: 'out')
  • --embedding_model: Embedding model (default: 'ollama/mxbai-embed-large:latest')
  • --paper_directory: Directory with paper files (default: 'data/')
  • --augment_existing: If True it will add new documents to the existing pkl file provided. Otherwise it creates the pkl file from scratch.

Querying Papers

Use the minedd query command to ask questions about your document collection:

console minedd query --embeddings embeddings/papers_embeddings.pkl --questions_file questions.xlsx --output_dir results/

or for a single question:

console minedd query --embeddings embeddings/papers_embeddings.pkl --question "What is the relationship between climate change and diarrheal disease?"

Available Parameters

  • --embeddings: Path to the embeddings pickle file (required)
  • --questions_file: Path to Excel file with questions
  • --question: Single question to ask
  • --llm: LLM model to use (default: 'ollama/llama3.2:1b')
  • --embedding_model: Embedding model (default: 'ollama/mxbai-embed-large:latest')
  • --paper_directory: Directory with paper files (default: 'data/')
  • --output_dir: Directory to save outputs (default: 'out')
  • --max_retries: Retries for model loading failures (default: 2)

Contributing

If you want to contribute to the development of MINE-DD, have a look at the contribution guidelines.

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Credits

This package was created with Cookiecutter and the NLeSC/python-template.

Owner

  • Name: MINE-DD
  • Login: MINE-DD
  • Kind: organization

Citation (CITATION.cff)

# YAML 1.2
---
cff-version: "1.2.0"
title: "mine_dd"
authors:
  -
    family-names: Viviani
    given-names: Eva
    orcid: "https://orcid.org/0000-0002-1330-0585"
  -
    family-names: Ootes
    given-names: Laura
    orcid: "https://orcid.org/0000-0002-2800-8309"
#date-released: 2024-00-00
#doi: <insert your DOI here>
version: "0.1.0"
repository-code: "https://github.com/MINE-DD/mine-dd"
keywords:
  - nlp
  - diarrhea
message: "If you use this software, please cite it using these metadata."
license: Apache-2.0

GitHub Events

Total
  • Create event: 9
  • Issues event: 33
  • Watch event: 1
  • Delete event: 15
  • Issue comment event: 2
  • Member event: 1
  • Push event: 74
  • Pull request review event: 6
  • Pull request review comment event: 13
  • Pull request event: 17
Last Year
  • Create event: 9
  • Issues event: 33
  • Watch event: 1
  • Delete event: 15
  • Issue comment event: 2
  • Member event: 1
  • Push event: 74
  • Pull request review event: 6
  • Pull request review comment event: 13
  • Pull request event: 17