https://github.com/biocypher/collectri
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.3%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: biocypher
- License: mit
- Language: Python
- Default Branch: main
- Size: 76.2 KB
Statistics
- Stars: 1
- Watchers: 2
- Forks: 3
- Open Issues: 1
- Releases: 0
Metadata Files
README.md
BioCypher adapter for the CollecTRI dataset
This repository contains the code for the BioCypher adapter for the CollecTRI dataset. The adapter is a Python module that converts the CollecTRI dataset into the BioCypher format. It also serves as a tutorial for end-to-end knowledge graph construction using BioCypher.
Process
Download and cache the resource
Run BioCypher to create knowledge graph
Deploy knowledge graph and web frontend
Tutorial
Create repository: using the template repository is the easiest way to get started. The template repository contains the basic structure of a BioCypher adapter. Clone the template repository and rename it and the adapter to your project's name. Also adjust the
pyproject.tomlfile to reflect your project's name and version.Find the data: the CollecTRI dataset is available as a flat file at https://rescued.omnipathdb.org/CollecTRI.csv. With this link, we can set up a BioCypher
Resourceobject to download and cache the data. We implement this and all other steps of the build pipeline in the createknowledgegraph.py create script. Check there for the full code.
python
bc = BioCypher()
collectri = Resource(
name="collectri",
url_s="https://rescued.omnipathdb.org/CollecTRI.csv",
lifetime=0, # CollecTRI is a static resource
)
paths = bc.download(collectri)
- Adjust the adapter based on the contents of the dataset. This is the most labour-intensive step, as it involves systematising the dataset and mapping it to a suitable ontology, as well as designing the ETL (extract-transform-load) process in the adapter module. The CollecTRI dataset is comparatively simple, which makes it a good example. You can find a detailed description of the process below (adapter design and ontology mapping).
When building the adapter, it can be helpful to use the Pandas functionality of
BioCypher to preview the KG components. Using the add() and to_df() methods,
we can check whether the adapter is working as expected.
python
bc.add(adapter.get_nodes())
bc.add(adapter.get_edges())
dfs = bc.to_df()
for name, df in dfs.items():
print(name)
print(df.head())
- Run BioCypher to create the knowledge graph. This step is straightforward,
using the information provided by the mapping configuration and the process
provided by the adapter created in the previous step. For compatibility with the
Docker compose workflow, we use the
write_nodes()andwrite_edges()methods to generate CSV files for import into Neo4j, as well as the import call statement and a summary of the build process.
```python bc.writenodes(adapter.getnodes()) bc.writeedges(adapter.getedges())
Write admin import statement
bc.writeimportcall()
Print summary
bc.summary() ```
- Run Docker compose to deploy the knowledge graph. Running the standard
docker-compose.yamlconfiguration will build the graph, import it into Neo4j, and deploy a Neo4j instance to be accessed on https://localhost:7474. The graph can then be browsed and queried.
bash
docker compose up -d
You can also include the ChatGSE frontend in the deployment by running the
docker-compose-chatgse.yaml configuration. This will also deploy a ChatGSE
instance to be accessed on https://localhost:8501. In the Knowledge Graph tab,
you can use natural language queries to generate Cypher queries and run them on
the graph. For connecting, you need to change the Neo4j host IP from localhost
to deploy, which is the name of the Docker service running the Neo4j instance.
You should be able to answer questions like "Which transcription factors
activate TP53?", "Which genes are regulated by transcription factors starting
with 'ZNF'?", or "Which are DNA-binding transcription factors?"
bash
docker compose -f docker-compose-chatgse.yaml up -d
To stop the deployment, run
bash
docker compose down --volumes
or
bash
docker compose -f docker-compose-chatgse.yaml down --volumes
Removing the volumes is necessary to ensure a clean deployment when running
docker compose up again. Otherwise, the graph will contain duplicate nodes and
edges.
Adapter design
We can look at the downloaded dataset (using the path from the previous step) to get an idea of its contents:
```python import pandas as pd df = pd.read_csv(paths[0]) print(df.head())
source target weight ... resources PMID sign.decision
0 MYC TERT 1 ... ExTRI;HTRI;TRRUST;TFactS;NTNU.Curated;Pavlidis... 10022128;10491298;10606235;10637317;10723141;1... PMID
1 SPI1 BGLAP 1 ... ExTRI 10022617 default activation
2 AP1 JUN 1 ... ExTRI;TRRUST;NTNU.Curated 10022869;10037172;10208431;10366004;11281649;1... PMID
3 SMAD3 JUN 1 ... ExTRI;TRRUST;TFactS;NTNU.Curated 10022869;12374795 PMID
4 SMAD4 JUN 1 ... ExTRI;TRRUST;TFactS;NTNU.Curated 10022869;12374795 PMID
print(df.columns)
Index(['source', 'target', 'weight', 'TF.category', 'resources', 'PMID',
'sign.decision'],
dtype='object')
```
We can then use this knowledge to design the adapater, i.e., the ETL process. Briefly, the adapter extracts sources and targets, which are both genes, and establishes relationships that embody the regulons. These relationships are enriched by the curation information contained in the table.
We use Enums to define the types of nodes and edges and their properties. This
helps in organising the process and also allows the use of auto-completion in
downstream tasks. We have two node types, gene and transcription factor, and
one relationship type, transcriptional regulation. We also define the
properties of the nodes and edges, which are none for genes, category for
transcription factors, and weight, resources, references, and
sign_decision for the relationship. (Note that we rename some of the original
attributes to make them more intuitive, e.g., PMID to references, or
machine-compatible, e.g., sign.decision to sign_decision. This conversion is
handled by the adapter and needs to be reflected in our schema configuration.)
The adapter then uses the pandas library to read the dataset and extract the
relevant information. We use the _preprocess_data() method to load the
dataframe and extract unique genes and TFs. Since each row of the dataset is one
relationship, we can simply iterate over the rows to create the relationships
directly.
The last component the adapter needs is two public methods, get_nodes() and
get_edges(), which return generators of nodes and edges, respectively. These
methods are used in the build script (create_knowledge_graph.py) to create the
knowledge graph.
Ontology mapping
In addition, we use the information to create an ontology
mapping
in the schema_config.yaml file, which reflect the ontological grounding of the
data. Since CollecTRI deals with transcriptional regulation in a gene-gene
context, we only need to define gene nodes and some regulatory interaction
between them. For this simple case, we resort to the shallow default ontology,
Biolink,
which already contains Gene entities and regulatory relationships. This also
means we do not need to specify the ontology in the biocypher_config.yaml
file, as Biolink is the default.
We use the existing entity type gene, and we extend the existing pairwise
gene to gene association relationship to transcriptional regulation using
inheritance.
For clarity, we also introduce a transcription factor entity type, which
inherits from gene; this way, we can query for transcription factors
specifically while retaining the ability to query for all genes.
```yaml gene: representedas: node preferredid: hgnc.symbol properties: name: str
transcription factor: isa: gene representedas: node preferred_id: hgnc.symbol properties: name: str category: str
transcriptional regulation: isa: pairwise gene to gene interaction representedas: edge source: transcription factor target: gene properties: weight: float resources: str references: str sign_decision: str ```
Note that, since we pass BioCypherNode and BioCypherEdge objects to the
BioCypher instance, which already include the correct labels of the ontology
classes we map to (gene, transcription factor, and transcriptional
regulation), we do not need to specify the input_label fields of each class.
We do, however, add some optional components to the schema configuration, mainly
to make interaction with the LLM framework BioChatter easier. For instance, we
provide explicit properties for each class, which are used to generate the
schema_info.yaml file, an extended schema configuration for BioChatter
integration. We also include the name property as a shortcut to the gene
symbol without added prefix (which usually is good practice to ensure uniqueness
of identifiers, in this case the hgnc.symbol). This way, we (and the LLM) can
use the name property to refer to genes by their symbol, e.g., MYC instead
of hgnc.symbol:MYC.
We also rename weight to activation_or_inhibition, since it is a binary
attribute that only has two values, 1 and -1, which we also modify to become
a string with the categories activation and inhibition. This makes the
attribute more intuitive to human and LLM users.
Owner
- Name: biocypher
- Login: biocypher
- Kind: organization
- Website: https://biocypher.org
- Repositories: 1
- Profile: https://github.com/biocypher
GitHub Events
Total
- Push event: 5
Last Year
- Push event: 5
Issues and Pull Requests
Last synced: over 1 year ago
All Time
- Total issues: 0
- Total pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- Mjvolk3 (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- biocypher/base 1.2.0
- neo4j 4.4-enterprise
- appdirs 1.4.4
- biocypher 0.5.33
- certifi 2023.7.22
- charset-normalizer 3.3.0
- colorama 0.4.6
- colorlog 6.7.0
- idna 3.4
- isodate 0.6.1
- more-itertools 10.1.0
- neo4j 4.4.11
- neo4j-utils 0.0.7
- networkx 3.1
- numpy 1.26.1
- numpy 1.26.2
- packaging 23.2
- pandas 2.1.1
- platformdirs 3.11.0
- pooch 1.7.0
- pyparsing 3.1.1
- python-dateutil 2.8.2
- pytz 2023.3.post1
- pyyaml 6.0.1
- rdflib 6.3.2
- requests 2.31.0
- six 1.16.0
- stringcase 1.2.0
- toml 0.10.2
- tqdm 4.66.1
- treelib 1.6.4
- tzdata 2023.3
- urllib3 2.0.7
- biocypher ^0.5.33
- python ^3.10
- biocypher/base 1.2.0
- biocypher/chatgse 0.3.2
- neo4j 4.4-enterprise