mlfp_filter

Europe PMC Machine Learning Filter for Removing False Positives from Dictionary Annotations

https://github.com/ml4lits/mlfp_filter

Science Score: 52.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
✓
Institutional organization owner
Organization ml4lits has institutional domain (www.ebi.ac.uk)
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.8%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Europe PMC Machine Learning Filter for Removing False Positives from Dictionary Annotations

Basic Info

Host: GitHub
Owner: ML4LitS
License: mit
Language: Jupyter Notebook
Default Branch: main
Size: 455 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created over 2 years ago · Last pushed over 2 years ago

Metadata Files

Readme License Citation

MLFP_filter

Europe PMC Machine Learning Filter for Removing False Positives from Dictionary Annotations

This repository contains the Europe PMC Machine Learning Filter (MLFP_filter) designed to reduce false positives in dictionary annotations. The system is structured into two main pipelines: the Abstract pipeline and the Fulltext pipeline.

Initial Setup

Before initiating the pipelines, ensure that the following environment variables are set to store daily articles and text-mined results:

<your_path_where_you_like_to_store_data>/$TODAY_DATE/fulltext
<your_path_where_you_like_to_store_data>/$TODAY_DATE/abstract

The script bsub_scilite_pipeline.sh is used to create these directories.

Full Text Pipeline

The Full Text Pipeline is initiated using the script scilite_pipeline.sh, which also triggers the Abstract pipeline.

Running the Fulltext Pipeline

To run the Fulltext pipeline, use the following parameters:

PIPELINE_PATH_FULLTEXT="/fulltext/pipelineUnified_repo.sh"
NUM_FILE_X_JOB_FULLTEXT=2

The script pipelineUnified_repo.sh is responsible for executing all processes within the Fulltext pipeline. For instance, the command:

sh scilite_fulltext_pipeline.sh $TIMESTAMP $TODAY_DATE $PIPELINE_PATH_FULLTEXT $NUM_FILE_X_JOB_FULLTEXT

calls the scilite_fulltext_pipeline.sh script, which fetches data based on the $TODAY_DATE parameter.

Running the Abstract Pipeline

The Abstract pipeline is similarly initiated with the following parameters:

PIPELINE_PATH_ABSTRACT="abstract/pipelineUnifiedAbstract_expMethods.sh"
NUM_FILE_X_JOB_ABSTRACT=2

The command:

sh /hps/software/users/literature/textmining/abstract/scilite_abstract_pipeline.sh $TIMESTAMP $TODAY_DATE $PIPELINE_PATH_ABSTRACT $NUM_FILE_X_JOB_ABSTRACT

triggers the main script for the Abstract pipeline, which fetches data based on the $TODAY_DATE.

Common Pipeline Overview

Both pipelines operate similarly, with the primary difference being the source of articles fetched. The general workflow is as follows:

Fetch Process

For the Full Text pipeline, the ebi.ukpmc.pipeline.fetch.FulltextFetcherOA.java class is invoked. The code is available on Git: https://USERNAME@scm.ebi.ac.uk/git/lit-textmining-annotationPipeline.git. It takes multiple arguments, including database configurations, and uses the --timestamp option to fetch data from the PMC_INFO table where timestamp > $DATE_PROVIDED.
For the Abstract pipeline, the ebi.ukpmc.pipeline.fetch.abstracts.AbstractFetcher.java class is used to fetch data from the CITATIONS table where c.date_update > $DATE_PROVIDED.

Annotation Process

After fetching, the pipeline creates separate jobs for processing the source files. The annotation process involves several steps, including sentence segmentation, cleaning, and applying ML filters and dictionaries. These processes are executed using scripts and binaries located at:

/hps/software/users/literature/textmining/bin
/hps/software/users/literature/textmining/lib

Each step's output serves as the input for the subsequent step, culminating in the annotated data being written to the job_x/annotation folder.

JSON Generation Process

Annotated files are compiled, and additional Perl scripts generate JSON from the annotated XML files. These JSON files are then placed in the json_api folder within the daily pipeline directory.

Submitting JSON Files to the Annotation Submission System

The final step involves submitting the JSON files for both Full Text and Abstract annotations to the Annotation Submission System (ASS), which integrates the annotated data into MongoDB.

Log Files

Two types of log files are generated:

Script logs, which record the pipeline's progress from start to finish, are located at: /logs/rdf_[today's date].txt
Process logs, which detail the status and errors of each process, can be found at:
- /$TODAY_DATE/fulltext/logs
- /$TODAY_DATE/abstract/logs

Prerequisites

Before running any scripts, obtain the necessary credentials:

USERNAME_DB_CDB
PASSWORD_DB_CDB
URL_DB_CDB
SCHEMA_DB_CDB
DOMAIN_API

Before running any scripts, replace your email credentials:

LSF_EMAIL in bsubscilitepipeline.sh, bsubsinglefulltextjob, bsubsingleabstractjob
MAIL_RECIPIENTS in common_functions.sh

The machine-learning model used here is available at https://github.com/ML4LitS/annotation_models. Place the model in quantised folder.

Ensure you replace placeholders with actual paths and credentials where necessary.

Cite

Tirunagari, S., Shafique, Z., Venkatesan, A., & Harisson, M. (2023). Europe PMC Machine Learning False Positive Filter for Dictionary Annotations (Version 0.0.1) [Computer software]. Retrieved from https://github.com/ML4LitS/MLFP_filter

Bibtex

@software{tirunagari2023accelerating, author = {Tirunagari, Santosh; Shafique, Zunaira; Venkatesan, Aravind; and Harisson, Melissa}, doi = {}, month = {06}, title = {Europe PMC Machine Learning False Positive Filter for Dictionary Annotations}, url = {https://github.com/ML4LitS/MLFP_filter}, version = {0.0.1}, year = {2023} }

Licence

MIT

Owner

Name: ML4LitS
Login: ML4LitS
Kind: organization
Email: stirunag@ebi.ac.uk
Location: United Kingdom

Website: https://www.ebi.ac.uk/about/teams/literature-services/
Twitter: litertwit
Repositories: 1
Profile: https://github.com/ML4LitS

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software in your work, please cite it using these metadata."
title: "Europe PMC Machine Learning False Positive Filter for Dictionary Annotations"
version: 0.0.1
date-released: 2023-01-01
authors:
  - family-names: "Tirunagari"
    given-names: "S."
  - family-names: "Shafique"
    given-names: "Z."
  - family-names: "Venkatesan"
    given-names: "A."
  - family-names: "Harisson"
    given-names: "M."
repository-code: "https://github.com/ML4LitS/MLFP_filter"

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

mlfp_filter

Science Score: 52.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

MLFP_filter

Europe PMC Machine Learning Filter for Removing False Positives from Dictionary Annotations

Initial Setup

Full Text Pipeline

Running the Fulltext Pipeline

Running the Abstract Pipeline

Common Pipeline Overview

Fetch Process

Annotation Process

JSON Generation Process

Submitting JSON Files to the Annotation Submission System

Log Files

Prerequisites

Cite

Licence

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year