en-tox

https://github.com/ontox-project/en-tox

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.3%) to scientific vocabulary

Last synced: 11 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: ontox-project
License: cc0-1.0
Language: Jupyter Notebook
Default Branch: main
Size: 24.9 MB

Statistics

Stars: 1
Watchers: 2
Forks: 0
Open Issues: 1
Releases: 1

Created over 2 years ago · Last pushed almost 2 years ago

Metadata Files

Readme License Citation

en-tox: a simple NLP model to extract toxicological information from scientific text.

This is the repository containing the work described in the paper "The application of Natural Language Processing for the extraction of toxicological mechanistic information" with the container to run the en-tox model extracting relationships between toxicological entities. The "article" folder contains the code and data files used for the study. The "container" folder contains all code to run the NLP pipeline in isolation. This comprises an updated version of the NER model (to spaCy v3.6) as well as a cleaned-up relationship extraction pipeline, packaged in a Flask application and a Docker container. Instructions on how to run the later are detailed below.

The en-tox model

The Named Entity Recognition model used in the study, referred to here as en-tox, was developed in the scope of the DARTPaths project. It was trained to recognize 7 entity types, which were annotated using the following instructions: * COMPOUND: the smallest unit of a generic/molecular name. For example, we would label “butafenacil” or “C20H18ClF3N2O6” as a compound, but not categories of compounds, such as “herbicide”. * PHENOTYPE: We include the "largest possible" phenotype, i.e., the most specific possible. For example in “increased death of liver cells”, we have phenotype (“death”), attribute/modifier (“increased”) and object (“liver cells”). We then label the complete group of words. However, do not include negations in the phenotype. For example, if the text is “no increase of liver weight is observed”, the phenotype should be “increase of liver weight”. * ORGANISM: We consider both the biological meaning and the “implied” meaning. For example, we will label “plant” as an organism, even if it is not species-specific. Similarly, we will label “sample” or “participant” as organisms, as they refer to a human organism * EXPOSUREROUTE: The way the compound is administered (ex: “inhaled”, “ingested”, etc.). * DOSE: We consider dose as number+unit (ex: “20mg/L”), or as a dose estimate (for example, “IC50”, “NOEC”, etc.) * INVITROVIVO: Words that indicate the conditions under which the study was conducted, for example “cells”/”cell lines” or “organoids” * PARENTVS_OFFSPRING: In which of the two the effects are observed, for example “embryo”, “F1”, etc.

Around 8.000 sentences were annotated, corresponding to about 5.000 sentences from 100 Pubmed articles and 3.000 sentences from 200 ECHA reports. We report a global F1 score on the training corpus of 0.72, and the following F1-scores per entity type:

| Entity | F1 score | | ------------------- | -------- | | COMPOUND | 0.88 | | PHENOTYPE | 0.56 | | ORGANISM | 0.79 | | EXPOSUREROUTE | 0.59 | | DOSE | 0.80 | | INVITROVIVO | 0.55 | | PARENTVS_OFFSPRING | 0.83 |

The Relationship Extraction module is based on semantic rules and based on the DependencyMatcher from spaCy. Details can be found in the paper or in article/code/utils.py, function dependency_matcher.

Build & Run the Relationship Extraction pipeline with Docker

Run in your terminal, from the "container" directory:

docker build -t entox . docker run -p 5000:5000 entox python app.py

In a separate terminal, run:

curl -X POST -H "Content-Type: application/json" -d '{"text": "$YOUR_TEXT", "cause": "$YOUR_CAUSE", "effect": "YOUR_EFFECT"}' http://localhost:5068/relationships

where "text" is the text you are interested in extracting relationship from, "cause" is the cause you are interested in ("COMPOUND" or "PHENOTYPE") and effect the outcome the cause triggers ("PHENOTYPE"). The output obtained will be a list of triplets representing all relationships extracted (if any) as (cause, causal verb, effect).

For ease of use, the pre-built image is available on DockerHub.

You can also directly integrate/adapt the model and code (main.py) in your own pipeline. In this case please reference our work as [xxx ref Corradi et al.]

Owner

Name: ontox-project
Login: ontox-project
Kind: organization

Repositories: 1
Profile: https://github.com/ontox-project

Citation (CITATION.cff)

cff-version: 1.0.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Corradi"
  given-names: "Marie"
- family-names: "Luechtefeld"
  given-names: "Thomas"
- family-names: "Teunis"
  given-names: "Marc"
title: "en-tox: a simple NLP model to extract toxicological information from scientific text."
version: 1.0
doi: 10.5281/zenodo.10610597
date-released: 2024-02-02
url: "https://github.com/ontox-project/en-tox/"

GitHub Events

Total

Watch event: 1

Last Year

Watch event: 1

Dependencies

container/Dockerfile docker

python 3.10-slim-buster build

article/code/requirements.txt pypi

Jinja2 ==3.1.2
MarkupSafe ==2.1.1
Pillow ==9.2.0
PyYAML ==6.0
Unidecode ==1.3.4
attrs ==22.1.0
biopython ==1.79
blis ==0.7.7
brotlipy ==0.7.0
catalogue ==2.0.7
charset-normalizer ==2.0.12
click ==7.1.2
coloredlogs ==15.0.1
conllu ==4.5.2
cssselect ==1.1.0
cymem ==2.0.6
distlib ==0.3.5
docopt ==0.6.2
eutils ==0.6.0
filelock ==3.8.0
habanero ==1.2.2
humanfriendly ==10.0
iniconfig ==1.1.1
joblib ==1.1.0
lxml ==4.9.1
metapub ==0.5.5
murmurhash ==1.0.7
neo4j ==4.4.3
nmslib ==2.1.1
packaging ==21.3
pandas ==1.3.4
pathy ==0.6.1
platformdirs ==2.5.2
pluggy ==1.0.0
preshed ==3.0.6
psutil ==5.9.1
py ==1.11.0
pybind11 ==2.6.1
pydantic ==1.8.2
pyparsing ==3.0.9
pysbd ==0.3.4
pytest ==7.1.2
python-Levenshtein ==0.12.2
pytz ==2022.1
regex ==2022.9.13
scikit-learn ==1.1.2
scipy ==1.8.1
scispacy ==0.4.0
smart-open ==5.2.1
spacy ==3.0.8
spacy-legacy ==3.0.9
srsly ==2.4.3
tabulate ==0.8.10
thinc ==8.0.16
threadpoolctl ==3.1.0
tokenizers ==0.13.1
toml ==0.10.2
tomli ==2.0.1
torch ==1.12.1
torchaudio ==0.12.1
torchvision ==0.13.1
tox ==3.25.1
tqdm ==4.64.0
transformers ==4.23.1
typer ==0.3.2
virtualenv ==20.16.3
wasabi ==0.9.1

container/poetry.lock pypi

annotated-types 0.6.0
blinker 1.7.0
blis 0.7.11
catalogue 2.0.10
certifi 2023.7.22
charset-normalizer 3.3.2
click 8.1.7
colorama 0.4.6
confection 0.1.3
cymem 2.0.8
en-tox 2.0.0
flask 3.0.0
idna 3.4
itsdangerous 2.1.2
jinja2 3.1.2
langcodes 3.3.0
markupsafe 2.1.3
murmurhash 1.0.10
numpy 1.26.1
packaging 23.2
pandas 2.1.1
pathy 0.10.3
preshed 3.0.9
pydantic 2.4.2
pydantic-core 2.10.1
python-dateutil 2.8.2
pytz 2023.3.post1
requests 2.31.0
setuptools 68.2.2
six 1.16.0
smart-open 6.4.0
spacy 3.6.1
spacy-legacy 3.0.12
spacy-loggers 1.0.5
srsly 2.4.8
thinc 8.1.12
tqdm 4.66.1
typer 0.9.0
typing-extensions 4.8.0
tzdata 2023.3
urllib3 2.0.7
wasabi 1.1.2
werkzeug 3.0.1

container/pyproject.toml pypi

en-tox *
flask 3.0.0
pandas 2.1.1
python >=3.10,<3.13
spacy 3.6.1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science