Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.3%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: ontox-project
- License: cc0-1.0
- Language: Jupyter Notebook
- Default Branch: main
- Size: 24.9 MB
Statistics
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 1
- Releases: 1
Metadata Files
README.md
en-tox: a simple NLP model to extract toxicological information from scientific text. 
This is the repository containing the work described in the paper "The application of Natural Language Processing for the extraction of toxicological mechanistic information" with the container to run the en-tox model extracting relationships between toxicological entities. The "article" folder contains the code and data files used for the study. The "container" folder contains all code to run the NLP pipeline in isolation. This comprises an updated version of the NER model (to spaCy v3.6) as well as a cleaned-up relationship extraction pipeline, packaged in a Flask application and a Docker container. Instructions on how to run the later are detailed below.
The en-tox model
The Named Entity Recognition model used in the study, referred to here as en-tox, was developed in the scope of the DARTPaths project. It was trained to recognize 7 entity types, which were annotated using the following instructions: * COMPOUND: the smallest unit of a generic/molecular name. For example, we would label “butafenacil” or “C20H18ClF3N2O6” as a compound, but not categories of compounds, such as “herbicide”. * PHENOTYPE: We include the "largest possible" phenotype, i.e., the most specific possible. For example in “increased death of liver cells”, we have phenotype (“death”), attribute/modifier (“increased”) and object (“liver cells”). We then label the complete group of words. However, do not include negations in the phenotype. For example, if the text is “no increase of liver weight is observed”, the phenotype should be “increase of liver weight”. * ORGANISM: We consider both the biological meaning and the “implied” meaning. For example, we will label “plant” as an organism, even if it is not species-specific. Similarly, we will label “sample” or “participant” as organisms, as they refer to a human organism * EXPOSUREROUTE: The way the compound is administered (ex: “inhaled”, “ingested”, etc.). * DOSE: We consider dose as number+unit (ex: “20mg/L”), or as a dose estimate (for example, “IC50”, “NOEC”, etc.) * INVITROVIVO: Words that indicate the conditions under which the study was conducted, for example “cells”/”cell lines” or “organoids” * PARENTVS_OFFSPRING: In which of the two the effects are observed, for example “embryo”, “F1”, etc.
Around 8.000 sentences were annotated, corresponding to about 5.000 sentences from 100 Pubmed articles and 3.000 sentences from 200 ECHA reports. We report a global F1 score on the training corpus of 0.72, and the following F1-scores per entity type:
| Entity | F1 score | | ------------------- | -------- | | COMPOUND | 0.88 | | PHENOTYPE | 0.56 | | ORGANISM | 0.79 | | EXPOSUREROUTE | 0.59 | | DOSE | 0.80 | | INVITROVIVO | 0.55 | | PARENTVS_OFFSPRING | 0.83 |
The Relationship Extraction module is based on semantic rules and based on the DependencyMatcher from spaCy. Details can be found in the paper or in article/code/utils.py, function dependency_matcher.
Build & Run the Relationship Extraction pipeline with Docker
Run in your terminal, from the "container" directory:
docker build -t entox .
docker run -p 5000:5000 entox
python app.py
In a separate terminal, run:
curl -X POST -H "Content-Type: application/json" -d '{"text": "$YOUR_TEXT", "cause": "$YOUR_CAUSE", "effect": "YOUR_EFFECT"}' http://localhost:5068/relationships
where "text" is the text you are interested in extracting relationship from, "cause" is the cause you are interested in ("COMPOUND" or "PHENOTYPE") and effect the outcome the cause triggers ("PHENOTYPE"). The output obtained will be a list of triplets representing all relationships extracted (if any) as (cause, causal verb, effect).
For ease of use, the pre-built image is available on DockerHub.
You can also directly integrate/adapt the model and code (main.py) in your own pipeline. In this case please reference our work as [xxx ref Corradi et al.]
Owner
- Name: ontox-project
- Login: ontox-project
- Kind: organization
- Repositories: 1
- Profile: https://github.com/ontox-project
Citation (CITATION.cff)
cff-version: 1.0.0 message: "If you use this software, please cite it as below." authors: - family-names: "Corradi" given-names: "Marie" - family-names: "Luechtefeld" given-names: "Thomas" - family-names: "Teunis" given-names: "Marc" title: "en-tox: a simple NLP model to extract toxicological information from scientific text." version: 1.0 doi: 10.5281/zenodo.10610597 date-released: 2024-02-02 url: "https://github.com/ontox-project/en-tox/"
GitHub Events
Total
- Watch event: 1
Last Year
- Watch event: 1
Dependencies
- python 3.10-slim-buster build
- Jinja2 ==3.1.2
- MarkupSafe ==2.1.1
- Pillow ==9.2.0
- PyYAML ==6.0
- Unidecode ==1.3.4
- attrs ==22.1.0
- biopython ==1.79
- blis ==0.7.7
- brotlipy ==0.7.0
- catalogue ==2.0.7
- charset-normalizer ==2.0.12
- click ==7.1.2
- coloredlogs ==15.0.1
- conllu ==4.5.2
- cssselect ==1.1.0
- cymem ==2.0.6
- distlib ==0.3.5
- docopt ==0.6.2
- eutils ==0.6.0
- filelock ==3.8.0
- habanero ==1.2.2
- humanfriendly ==10.0
- iniconfig ==1.1.1
- joblib ==1.1.0
- lxml ==4.9.1
- metapub ==0.5.5
- murmurhash ==1.0.7
- neo4j ==4.4.3
- nmslib ==2.1.1
- packaging ==21.3
- pandas ==1.3.4
- pathy ==0.6.1
- platformdirs ==2.5.2
- pluggy ==1.0.0
- preshed ==3.0.6
- psutil ==5.9.1
- py ==1.11.0
- pybind11 ==2.6.1
- pydantic ==1.8.2
- pyparsing ==3.0.9
- pysbd ==0.3.4
- pytest ==7.1.2
- python-Levenshtein ==0.12.2
- pytz ==2022.1
- regex ==2022.9.13
- scikit-learn ==1.1.2
- scipy ==1.8.1
- scispacy ==0.4.0
- smart-open ==5.2.1
- spacy ==3.0.8
- spacy-legacy ==3.0.9
- srsly ==2.4.3
- tabulate ==0.8.10
- thinc ==8.0.16
- threadpoolctl ==3.1.0
- tokenizers ==0.13.1
- toml ==0.10.2
- tomli ==2.0.1
- torch ==1.12.1
- torchaudio ==0.12.1
- torchvision ==0.13.1
- tox ==3.25.1
- tqdm ==4.64.0
- transformers ==4.23.1
- typer ==0.3.2
- virtualenv ==20.16.3
- wasabi ==0.9.1
- annotated-types 0.6.0
- blinker 1.7.0
- blis 0.7.11
- catalogue 2.0.10
- certifi 2023.7.22
- charset-normalizer 3.3.2
- click 8.1.7
- colorama 0.4.6
- confection 0.1.3
- cymem 2.0.8
- en-tox 2.0.0
- flask 3.0.0
- idna 3.4
- itsdangerous 2.1.2
- jinja2 3.1.2
- langcodes 3.3.0
- markupsafe 2.1.3
- murmurhash 1.0.10
- numpy 1.26.1
- packaging 23.2
- pandas 2.1.1
- pathy 0.10.3
- preshed 3.0.9
- pydantic 2.4.2
- pydantic-core 2.10.1
- python-dateutil 2.8.2
- pytz 2023.3.post1
- requests 2.31.0
- setuptools 68.2.2
- six 1.16.0
- smart-open 6.4.0
- spacy 3.6.1
- spacy-legacy 3.0.12
- spacy-loggers 1.0.5
- srsly 2.4.8
- thinc 8.1.12
- tqdm 4.66.1
- typer 0.9.0
- typing-extensions 4.8.0
- tzdata 2023.3
- urllib3 2.0.7
- wasabi 1.1.2
- werkzeug 3.0.1
- en-tox *
- flask 3.0.0
- pandas 2.1.1
- python >=3.10,<3.13
- spacy 3.6.1