doc2vec-doc-relevance

An approach exploring and assessing literature-based doc-2-doc recommendations using a doc2vec and applying to the RELISH dataset.

https://github.com/zbmed-semtec/doc2vec-doc-relevance

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.7%) to scientific vocabulary

Keywords

doc2doc-relevance document-embeddings document-similarity ontoclue phase-one python
Last synced: 4 months ago · JSON representation ·

Repository

An approach exploring and assessing literature-based doc-2-doc recommendations using a doc2vec and applying to the RELISH dataset.

Basic Info
  • Host: GitHub
  • Owner: zbmed-semtec
  • License: gpl-3.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 9.55 MB
Statistics
  • Stars: 0
  • Watchers: 4
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
doc2doc-relevance document-embeddings document-similarity ontoclue phase-one python
Created over 3 years ago · Last pushed 11 months ago
Metadata Files
Readme License Citation Codemeta

README.md

Project Status: Active – The project has reached a stable, usable state and is being actively developed.

SWH

Doc2Vec-Doc-relevance

This repository focuses on an approach exploring and assessing literature-based doc-2-doc recommendations using the Doc2Vec technique with its application to the RELISH dataset.

Table of Contents

  1. About
  2. Input Data
  3. Pipeline
    1. Generate Embeddings
    2. Format embeddings
    3. Calculate Cosine Similarity
    4. Evaluation
  4. Code Implementation
  5. Getting Started
  6. Tutorial

About

Our approach involves employing the doc2vec model, which extends the popular word2vec technique to capture document-level semantics. By encoding documents and their textual content into fixed-length vectors, doc2vec facilitates similarity calculations and enables meaningful comparisons between documents. This approach is harnessed to derive insightful doc-2-doc recommendations within the realm of biomedical research, specifically employing the RELISH dataset. In order to do so, we employ the doc2vec model from the Gensim library.

Input Data

The input data for this method consists of preprocessed tokens derived from the RELISH documents. These tokens are stored in the RELISH.npy file, which contains preprocessed arrays comprising PMIDs, document titles, and abstracts. These arrays are generated through an extensive preprocessing pipeline, as elaborated in the relish-preprocessing repository. Within this preprocessing pipeline, both the title and abstract texts undergo several stages of refinement: structural words are eliminated, text is converted to lowercase, and finally, tokenization is employed, resulting in arrays of individual words.

Pipeline

This section outlines the progression from generating document embeddings to conducting hyperparameter optimization and ultimately evaluating the effectiveness of the approach.

Generate Embeddings

The following section outlines the process of generating document-level embeddings for each PMID of the RELISH corpus.

Create Tagged Documents

In this initial step, we create TaggedDocuments, which associate each PMID with a corresponding list of words. Here, we combine the abstract and title of each document into a unified paragraph (or document). This unified text serves as the input for our Doc2Vec model, allowing it to capture the semantic meaning of the entire document.

Generate and Train Doc2Vec models

In the second phase, we construct and train Doc2Vec models with customizable hyperparameters. These models are designed to understand the relationships between documents and words in a high-dimensional vector space. We employ the parameters shown below in order to generate our models.

Parameters
  • dm: {1,0} Refers to the training algorithm. If dm=1, distributed memory is used otherwise, distributed bag of words is used.
  • vector_size: It represents the dimensions of the generated embeddings, with options of 200, 300 and 400 in our case.
  • window: Represents the maximum distance between the current and predicted word, with values fof 5,6 and 7 in our case.
  • epochs: Refers to the number of iterations over the training dataseta and is set at 15 in this context.
  • min_count: It is the minimum number of appearances a word must have to not be ignored by the algorithm and is configured at a minimum of 5.

Format embeddings

After model training, we can extract document-level embeddings. These embeddings are numerical vectors that represent the content and context of each document in a continuous vector space. These embeddings are stored by the model, associated with each PMID. For further downstream document similarity calculations, we format and save these embeddings for each document with its PMID as a dataframe in a pickle file. Each specific set of hyperparameter combination results in having a separate pickle file.

Calculate Cosine Similarity

To assess the similarity between two documents within the RELISH corpus, we employ the Cosine Similarity metric. This process enables the generation of a 4-column matrix containing cosine similarity scores for existing pairs of PMIDs within our corpus. For a more detailed explanation of the process, please refer to this documentation.

Evaluation

Precision@N

In order to evaluate the effectiveness of this approach, we make use of Precision@N. Precision@N measures the precision of retrieved documents at various cutoff points (N).We generate a Precision@N matrix for existing pairs of documents within the RELISH corpus, based on the original RELISH JSON file. The code determines the number of true positives within the top N pairs and computes Precision@N scores. The result is a Precision@N matrix with values at different cutoff points, including average scores. For detailed insights into the algorithm, please refer to this documentation.

nDCG@N

Another metric used is the nDCG@N (normalized Discounted Cumulative Gain). This ranking metric assesses document retrieval quality by considering both relevance and document ranking. It operates by using a TSV file containing relevance and cosine similarity scores, involving the computation of DCG@N and iDCG@N scores. The result is an nDCG@N matrix for various cutoff values (N) and each PMID in the corpus, with detailed information available in the documentation.

Code Implementation

The run_embeddings.py serves as a comprehensive wrapper function, supporting the creation of tagged documents, model generation, training, embedding generation, and the subsequent storage of these embeddings as pickle files. Individual functions for each task are provided in the other two code scripts:

  • embeddings.py : Creation of tagged documents from input tokens, creation and training of Doc2Vec models, generation of embeddings.
  • embeddings.dataframe.py : Creates a dataframe of embeddings with its corresponding PMID, sorts and stores it as a pickle file.

Getting Started

To get started with this project, follow these steps:

Step 1: Clone the Repository

First, clone the repository to your local machine using the following command:

Using HTTP:

git clone https://github.com/zbmed-semtec/doc2vec-doc-relevance.git

Using SSH:

Ensure you have set up SSH keys in your GitHub account.

git clone git@github.com:zbmed-semtec/doc2vec-doc-relevance.git

Step 2: Create a virtual environment and install dependencies

To create a virtual environment within your repository, run the following command:

``` python3 -m venv .venv source .venv/bin/activate # On Windows, use '.venv\Scripts\activate'

If you have any difficulties to activate the env on Windows, try below commands:

'.venv\Scripts\activate.ps1' # PowerShell '.venv\Scripts\activate.bat' # Command prompt ```

To confirm if the virtual environment is activated and check the location of yourPython interpreter, run the following command:

which python # On Windows command prompt, use 'where python' # On Windows PowerShell, use 'Get-Command python' The code is stable with python 3.6 and higher. The required python packages are listed in the requirements.txt file. To install the required packages, run the following command:

pip install -r requirements.txt

To deactivate the virtual environment after running the project, run the following command:

deactivate

Step 3: Dataset

Use the Download_Dataset.sh script to download the Split Dataset by running the following commands:

chmod +777 Download_Dataset.sh ./Download_Dataset.sh

This script makes sure that the necessary folders are created and the files are downloaded in the corresponding folders as shown below.

📦 /doc2vec-doc-relevance └─ data └─ Input ├─ Tokens │ ├─ relish.npy └─ Ground_truth └─ relevance_matrix.tsv The file relish.npy is in the NumPy binary format (.npy), which is specifically used to store NumPy arrays efficiently. These arrays contain the PMID, title, and abstract for each document.

In contrast, relevance_matrix.tsv is a Tab-separated Values file, similar to CSV but using tabs as delimiters. It stores tabular data with four columns: PMID1 | PMID2 | Relevance | WMD Similarity.

Reference: Tab-separated values (TSV) file format:
FAIRsharing DOI

Step 4: Generate Embeddings

The embeddings.py script uses the RELISH Tokenized npy file as input. You can easily adapt it for different values and parameters by modifying the hyperparameters.yaml Make sure to have the RELISH Tokenized.npy file within the directory under the data folder.

python3 code/embeddings.py [-i INPUT PATH] [-o OUTPUT PATH] [-p PARAMS]

You must pass the following arguments:

  • -i/ --input : File path to the RELISH tokenized .npy file.
  • -o/ --output : File path to the resulting embeddings in pickle file format.
  • -p/ --params : File path to the hyperparameters YAML file.

To run this script, please execute the following command:

python3 code/embeddings.py --input data/Input/Tokens/relish.npy --output data/embeddings --params code/hyperparameters.yaml

The script will create Doc2Vec models, generate embeddings, and store them in separate directories. You should expect to find a total of 18 files corresponding to the various models, embeddings, and embedding pickle files.

Step 5: Calculate Cosine Similarity

In order to generate the cosine similarity matrix and execute this script, run the following command:

python3 code/generate_cosine_existing_pairs.py [-i INPUT] [-e EMBEDDINGS] [-o OUTPUT]

You must pass the following four arguments:

  • -i/ --input : File path to the RELISH relevance matrix in the TSV format.
  • -e/ --embeddings : File path to the embeddings in the pickle file format.
  • -o/ --output : File path for the output 4 column cosine similarity matrix.

For example, if you are running the code from the code folder and have the RELISH relevance matrix in the data folder, run the cosine matrix creation for the first hyperparameter as:

python3 code/generate_cosine_existing_pairs.py -i data/Input/Ground_truth/relevance_matrix.tsv -e data/embeddings/embeddings_0.pkl -o data/cosine/cosine_similarity_0.tsv

Note: You would have to run the above command for every hyperparameter configuration by changing the file name for the embedding's pickle file or use the following shell script to generate all files at once.

for VALUE in {0..17};do python3 code/generate_cosine_existing_pairs.py -i data/Input/Ground_truth/relevance_matrix.tsv -e data/embeddings/embeddings_pickle_${VALUE}.pkl -o data/cosine/cosine_similarity_${VALUE}.tsv done

Step 6: Precision@N

In order to calculate the Precision@N scores and execute this script, run the following command:

python3 code/precision.py [-i COSINE FILE PATH] [-o OUTPUT PATH] [-c CLASSES]

You must pass the following two arguments:

  • -i/ --cosinefilepath: path to the 4-column cosine similarity existing pairs RELISH file: (tsv file)
  • -o/ --output_path: path to save the generated precision matrix: (tsv file)
  • -c/ --classes: Number of classes for class distribution (2 or 3)

For example, if you are running the code from the code folder and have the cosine similarity TSV file in the data folder, run the precision matrix creation for the first hyperparameter as:

python3 code/precision.py -i data/cosine/cosine_similarity_0.tsv -o data/precision_three_classes/precision_0.tsv -c 3

Note: You would have to run the above command for every hyperparameter configuration by changing the file name for the cosine similarity file or use the following shell script to generate all files at once.

for VALUE in {0..17};do python3 code/precision.py -c data/cosine_similarity_${VALUE}.tsv -o data/precision_three_classes/precision_${VALUE}.tsv -c 3 done

Note: Make sure to re-run the above command by changing the classes for a different class distribution.

Step 7: nDCG@N

In order to calculate nDCG scores and execute this script, run the following command:

python3 code/calculate_gain.py [-i INPUT] [-o OUTPUT]

You must pass the following two arguments:

  • -i / --input: Path to the 4 column cosine similarity existing pairs RELISH TSV file.
  • -o/ --output: Output path along with the name of the file to save the generated nDCG@N TSV file.

For example, if you are running the code from the code folder and have the 4 column RELISH TSV file in the data folder, run the matrix creation for the first hyperparameter as:

python3 code/calculate_gain.py -i data/cosine/cosine_similarity_0.tsv -o data/gain/ndcg_0.tsv

Note: You would have to run the above command for every hyperparameter configuration by changing the file name for the cosine similarity file or use the following shell script to generate all files at once.

for VALUE in {0..17};do python3 code/calculate_gain.py -i data/cosine/cosine_similarity_${VALUE}.tsv -o data/gain/ndcg_${VALUE}.tsv done

Step 8: Compile Results

In order to compile the average result values for Precison@ and nDCG@N and generate a single TSV file each, please use this script.

You must pass the following two arguments:

  • -i / --input: Path to the directory consisting of all the precision matrices/gain matrices.
  • -o/ --output: Output path along with the name of the file to save the generated compiled Precision@N / nDCG@N TSV file.

If you are running the code from the code folder, run the compilation script as:

python3 code/show_avg.py -i data/gain/ -o data/results_gain.tsv

NOTE: Please do not forget to put a '/' at the end of the input file path.

Tutorial

  • A tutorial is accessible in the form of Jupyter notebook for the generation of embeddings.
  • A tutorial is accessible in the form of Jupyter notebook for the computing cosine similarity values.

Owner

  • Name: zbmed-semtec
  • Login: zbmed-semtec
  • Kind: organization

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: "Talha"
    given-names: "Muhammad"
  - family-names: "Geist"
    given-names: "Lukas"
    orcid: "http://orcid.org/0000-0002-2910-7982"
  - family-names: "Ravinder"
    given-names: "Rohitha"
    orcid: "https://orcid.org/0009-0004-4484-6283"
  - family-names: "Solanki"
    given-names: "Dhwani"
    orcid: "https://orcid.org/0009-0004-1529-0095"
  - family-names: "Rebholz-Schuhmann"
    given-names: "Dietrich"
    orcid: "https://orcid.org/0000-0002-1018-0370"
  - family-names: "Castro"
    given-names: "Leyla Jael"
    orcid: "https://orcid.org/0000-0003-3986-0510"
title: "doc2vec-doc-relevance"
version: 1.0.0
date-released: 2022-04-19
license: "https://spdx.org/licenses/GPL-3.0"
copyright: ZB MED
repository-code: "https://github.com/zbmed-semtec/doc2vec-doc-relevance"

CodeMeta (codemeta.json)

{
  "@context": [
    "http://schema.org/",
    {
      "codemeta": "https://w3id.org/codemeta/3.0"
    }
  ],
  "@type": "SoftwareSourceCode",
  "@id": "https://archive.softwareheritage.org/swh:1:dir:5939540bdda33571b0969e520360b80d98b2bf2b",
  "applicationCategory": "Machine Learning",
  "codeRepository": "https://github.com/zbmed-semtec/doc2vec-doc-relevance",
  "author": [
    {
      "@id": "https://zbmed-semtec.github.io/previous_members/#muhammad-talha",
      "@type": "Person",
      "familyName": "Talha",
      "givenName": "Muhammad"
    },
    {
      "@id": "http://orcid.org/0000-0002-2910-7982",
      "@type": "Person",
      "affiliation": {
        "@type": "Organization",
        "name": "ZB MED - Information Centre for Life Sciences",
        "@id": "https://ror.org/0259fwx54"
      },
      "familyName": "Geist",
      "givenName": "Lukas"
    },
    {
      "@id": "https://orcid.org/0009-0004-4484-6283",
      "@type": "Person",
      "affiliation": {
        "@type": "Organization",
        "name": "ZB MED - Information Centre for Life Sciences",
        "@id": "https://ror.org/0259fwx54"
      },
      "familyName": "Ravinder",
      "givenName": "Rohitha"
    },
    {
      "@id": "hhttps://orcid.org/0009-0004-1529-0095",
      "@type": "Person",
      "affiliation": {
        "@type": "Organization",
        "name": "ZB MED - Information Centre for Life Sciences",
        "@id": "https://ror.org/0259fwx54"
      },
      "familyName": "Solanki",
      "givenName": "Dhwani"
    },
    {
      "@id": "https://orcid.org/0000-0002-1018-0370",
      "@type": "Person",
      "affiliation": {
        "@type": "Organization",
        "name": "ZB MED - Information Centre for Life Sciences",
        "@id": "https://ror.org/0259fwx54"
      },
      "familyName": "Rebholz-Schuhmann",
      "givenName": "Dietrich"
    },
    {
      "@id": "https://orcid.org/0000-0003-3986-0510",
      "@type": "Person",
      "affiliation": {
        "@type": "Organization",
        "name": "ZB MED - Information Centre for Life Sciences",
        "@id": "https://ror.org/0259fwx54"
      },
      "familyName": "Castro",
      "givenName": "Leyla Jael"
    }
  ],
  "dateCreated": "2022-04-19",
  "description": "An approach exploring and assessing literature-based doc-2-doc recommendations using a doc2vec and applying to the RELISH dataset.",
  "funder": {
    "@type": "Organization",
    "@id": "https://ror.org/018mejw64",
    "name": "Deutsche Forschungsgemeinschaft",
    "alternateName": "German Research Foundation",
    "url": "http://www.dfg.de/en/"
  },
  "identifier": "swh:1:dir:5939540bdda33571b0969e520360b80d98b2bf2b",
  "keywords": [
    "document-embeddings",
    "document similarity"
  ],
  "license": {
    "@type": "CreativeWork",
    "@id": "http://spdx.org/licenses/GPL-3.0-only",
    "name": "GPLv3",
    "url": "https://www.gnu.org/licenses/gpl-3.0.en.html"
  },
  "name": "doc2vec-doc-relevance",
  "operatingSystem": [
    "Linux",
    "Windows 10",
    "macOS"
  ],
  "programmingLanguage": "Python 3",
  "softwareRequirements": [
    "Python3.9",
    "https://github.com/zbmed-semtec/doc2vec-doc-relevance/blob/main/requirements.txt"
  ],
  "version": "1.0.0",
  "codemeta:developmentStatus": "active",
  "funding": [
    {
      "@type": "Grant",
      "@id": "https://gepris.dfg.de/gepris/projekt/460234259",
      "funder": {
        "@type": "Organization",
        "@id": "https://ror.org/018mejw64"
      },
      "identifier": "460234259",
      "description": "Project no. 460234259 (corresponding to the NFDI4DataScience consortium)"
    },
    {
      "@type": "Grant",
      "@id": "https://gepris.dfg.de/gepris/projekt/407518790",
      "funder": {
        "@type": "Organization",
        "@id": "https://ror.org/018mejw64"
      },
      "identifier": "407518790",
      "description": "Project no. 407518790 (corresponding to the STELLA project)"
    }
  ],
  "codemeta:issueTracker": "https://github.com/zbmed-semtec/doc2vec-doc-relevance/issues",
  "codemeta:referencePublication": {
    "@type": "ScholarlyArticle",
    "@id": "https://ceur-ws.org/Vol-3466/paper5.pdf",
    "identifier": "CEUR:Vol-3466/paper5",
    "creditText": "Ravinder R, Fellerhof T, Dadi V, Geist L, Talha M, Rebholz-Schuhmann D, et al. A Comparison of Vector-based Approaches for Document Similarity Using the RELISH Corpus. Proceedings of the 6th Workshop on Semantic Web Solutions for Large-Scale Biomedical Data Analytics co-located with ESWC 2023. CEUR; 2023. Available: https://ceur-ws.org/Vol-3466/paper5.pdf",
    "name": "A Comparison of Vector-based Approaches for Document Similarity Using the RELISH Corpus",
    "datePublished": "2023-03-01",
    "license": {
      "@type": "CreativeWork",
      "@id": "http://spdx.org/licenses/CC-BY-4.0",
      "name": "Creative Commons Attribution 4.0 International",
      "alternateName": "CC BY 4.0",
      "url": "https://creativecommons.org/licenses/by/4.0/"
    },
    "url": "https://ceur-ws.org/Vol-3466/paper5.pdf#",
    "author": [
      {
        "@id": "https://orcid.org/0009-0004-4484-6283"
      },
      {
        "@id": "https://orcid.org/0000-0002-8725-1317"
      },
      {
        "@id": "https://orcid.org/0000-0002-3082-7522"
      },
      {
        "@id": "https://orcid.org/0000-0002-2910-7982"
      },
      {
        "@id": "https://orcid.org/0000-0002-4795-3648"
      },
      {
        "@id": "https://zbmed-semtec.github.io/previous_members/#muhammad-talha"
      },
      {
        "@id": "https://orcid.org/0000-0002-1018-0370"
      },
      {
        "@id": "https://orcid.org/0000-0003-3986-0510"
      }
    ]
  },
  "codemeta:readme": "https://github.com/zbmed-semtec/doc2vec-doc-relevance/blob/main/README.md",
  "maintainer": {
    "@id": "https://orcid.org/0000-0003-3986-0510"
  }
}

GitHub Events

Total
  • Issues event: 2
  • Delete event: 1
  • Issue comment event: 1
  • Push event: 1
  • Pull request event: 1
  • Pull request review event: 1
Last Year
  • Issues event: 2
  • Delete event: 1
  • Issue comment event: 1
  • Push event: 1
  • Pull request event: 1
  • Pull request review event: 1

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 2
  • Total pull requests: 2
  • Average time to close issues: 6 months
  • Average time to close pull requests: 4 months
  • Total issue authors: 2
  • Total pull request authors: 1
  • Average comments per issue: 1.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 2
  • Average time to close issues: 4 months
  • Average time to close pull requests: 4 months
  • Issue authors: 1
  • Pull request authors: 1
  • Average comments per issue: 1.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • Soudeh-Jahanshahi (1)
  • rohitharavinder (1)
  • Two-Kay (1)
Pull Request Authors
  • rohitharavinder (2)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

requirements.txt pypi
  • gensim ==4.3.2
  • matplotlib ==3.8.0
  • matplotlib-inline ==0.1.6
  • numba ==0.58.1
  • numpy ==1.26.1
  • pandas ==2.1.1
  • scipy ==1.11.3
  • tqdm ==4.66.1