https://github.com/kundajelab/genomelake
Simple and efficient access to genomic data for deep learning models.
Science Score: 10.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
✓Committers with academic emails
1 of 5 committers (20.0%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.6%) to scientific vocabulary
Repository
Simple and efficient access to genomic data for deep learning models.
Basic Info
Statistics
- Stars: 42
- Watchers: 13
- Forks: 18
- Open Issues: 6
- Releases: 0
Metadata Files
README.md
genomelake
Efficient random access to genomic data for deep learning models.
Supports the following types of input data:
- bigwig
- DNA sequence
genomelake extracts signal from genomic inputs in provided BED intervals.
Requirements
- python 2.7, 3.5, or 3.6
- tiledb
- bcolz
- cython
- numpy
- pybedtools
- pysam
Installation
Clone the repository and run:
python setup.py install
Getting started: training a protein-DNA binding model
Extract genome-wide sequence data into a genomelake data source: ```python from genomelake.backend import extractfastato_file
genomefasta = "/mnt/data/annotations/byrelease/hg19.GRCh37/hg19.genome.fa" genomedatadirectory = "./hg19datadirectory" extractfastatofile(genomefasta, genomedatadirectory) ```
Using a BED intervals file with labels, a genome data source, and genomelake's ArrayExtractor, generate input DNA sequences and labels:
```python
import pybedtools
from genomelake.extractors import ArrayExtractor
import numpy as np
def batchiter(iterable, batchsize): it = iter(iterable) try: while True: values = [] for n in range(batch_size): values += (next(it),) yield values except StopIteration: yield values
def generateinputsandlabels(intervalsfile, datasource, batchsize=128): bt = pybedtools.BedTool(intervalsfile) extractor = ArrayExtractor(datasource) intervalsgenerator = batchiter(bt, batchsize) for intervalsbatch in intervalsgenerator: inputs = extractor(intervalsbatch) labels = [] for interval in intervals_batch: labels.append(float(interval.name)) labels = np.array(labels) yield inputs, labels ```
Train a keras model of JUND binding to DNA using 101 base pair intervals and labels in ./examples/JUND.HepG2.chr22.101bp_intervals.tsv.gz:
```python
from keras.models import Sequential
from keras.layers import Conv1D, Flatten, Dense
intervalsfile = "./examples/JUND.HepG2.chr22.101bpintervals.tsv.gz" inputslabelsgenerator = generateinputsandlabels(intervalsfile, genomedatadirectory)
model = Sequential() model.add(Conv1D(15, 25, input_shape=(101, 4))) model.add(Flatten()) model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binarycrossentropy', optimizer='adam', metrics=['accuracy']) model.fitgenerator(inputslabelsgenerator, stepsperepoch=100) ```
Here is the expected result:
100/100 [==============================] - 7s - loss: 0.0584 - acc: 0.9905
License
genomelake is released under the BSD-3 license. See LICENSE for details.
Owner
- Name: Kundaje Lab
- Login: kundajelab
- Kind: organization
- Location: Stanford University
- Website: http://anshul.kundaje.net
- Repositories: 117
- Profile: https://github.com/kundajelab
Compbio and machine learning code repositories from the Kundaje Lab at Stanford Genetics and Computer Science Depts.
GitHub Events
Total
- Fork event: 2
Last Year
- Fork event: 2
Committers
Last synced: 7 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Johnny Israeli | j****i@s****u | 18 |
| Žiga Avsec | A****z@u****m | 6 |
| Chris Probert | c****t@g****m | 1 |
| Joren Retel | j****l@h****m | 1 |
| daquang | d****g@g****m | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 8 months ago
All Time
- Total issues: 8
- Total pull requests: 15
- Average time to close issues: 4 days
- Average time to close pull requests: 10 days
- Total issue authors: 6
- Total pull request authors: 6
- Average comments per issue: 2.88
- Average comments per pull request: 2.6
- Merged pull requests: 13
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- AvantiShri (2)
- Avsecz (2)
- DebadityaPal (1)
- mmtrebuchet (1)
- berkuva (1)
- mmingay2 (1)
Pull Request Authors
- jisraeli (7)
- Avsecz (4)
- manyu90 (1)
- daquang (1)
- chrisprobert (1)
- jorenretel (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 95 last-month
- Total dependent packages: 0
- Total dependent repositories: 2
- Total versions: 4
- Total maintainers: 1
pypi.org: genomelake
Simple and efficient random access to genomic data for deep learning models.
- Homepage: https://github.com/kundajelab/genomelake
- Documentation: https://genomelake.readthedocs.io/
- License: BSD-3
-
Latest release: 0.1.4
published about 8 years ago
Rankings
Maintainers (1)
Dependencies
- bcolz >=1.1
- numpy *
- pyBigWig *
- pybedtools *
- pysam *
- six >=1.9.0
- tiledb >=0.2.0