https://github.com/kundajelab/genomelake

Simple and efficient access to genomic data for deep learning models.

https://github.com/kundajelab/genomelake

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    1 of 5 committers (20.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.6%) to scientific vocabulary
Last synced: 6 months ago · JSON representation

Repository

Simple and efficient access to genomic data for deep learning models.

Basic Info
  • Host: GitHub
  • Owner: kundajelab
  • License: bsd-3-clause
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 3.95 MB
Statistics
  • Stars: 42
  • Watchers: 13
  • Forks: 18
  • Open Issues: 6
  • Releases: 0
Created about 8 years ago · Last pushed over 6 years ago
Metadata Files
Readme License

README.md

genomelake

CircleCICoverage Status

Efficient random access to genomic data for deep learning models.

Supports the following types of input data:

  • bigwig
  • DNA sequence

genomelake extracts signal from genomic inputs in provided BED intervals.

Requirements

  • python 2.7, 3.5, or 3.6
  • tiledb
  • bcolz
  • cython
  • numpy
  • pybedtools
  • pysam

Installation

Clone the repository and run:

python setup.py install

Getting started: training a protein-DNA binding model

Extract genome-wide sequence data into a genomelake data source: ```python from genomelake.backend import extractfastato_file

genomefasta = "/mnt/data/annotations/byrelease/hg19.GRCh37/hg19.genome.fa" genomedatadirectory = "./hg19datadirectory" extractfastatofile(genomefasta, genomedatadirectory) ```

Using a BED intervals file with labels, a genome data source, and genomelake's ArrayExtractor, generate input DNA sequences and labels: ```python import pybedtools from genomelake.extractors import ArrayExtractor import numpy as np

def batchiter(iterable, batchsize): it = iter(iterable) try: while True: values = [] for n in range(batch_size): values += (next(it),) yield values except StopIteration: yield values

def generateinputsandlabels(intervalsfile, datasource, batchsize=128): bt = pybedtools.BedTool(intervalsfile) extractor = ArrayExtractor(datasource) intervalsgenerator = batchiter(bt, batchsize) for intervalsbatch in intervalsgenerator: inputs = extractor(intervalsbatch) labels = [] for interval in intervals_batch: labels.append(float(interval.name)) labels = np.array(labels) yield inputs, labels ```

Train a keras model of JUND binding to DNA using 101 base pair intervals and labels in ./examples/JUND.HepG2.chr22.101bp_intervals.tsv.gz: ```python from keras.models import Sequential from keras.layers import Conv1D, Flatten, Dense

intervalsfile = "./examples/JUND.HepG2.chr22.101bpintervals.tsv.gz" inputslabelsgenerator = generateinputsandlabels(intervalsfile, genomedatadirectory)

model = Sequential() model.add(Conv1D(15, 25, input_shape=(101, 4))) model.add(Flatten()) model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binarycrossentropy', optimizer='adam', metrics=['accuracy']) model.fitgenerator(inputslabelsgenerator, stepsperepoch=100) ```

Here is the expected result: 100/100 [==============================] - 7s - loss: 0.0584 - acc: 0.9905

License

genomelake is released under the BSD-3 license. See LICENSE for details.

Owner

  • Name: Kundaje Lab
  • Login: kundajelab
  • Kind: organization
  • Location: Stanford University

Compbio and machine learning code repositories from the Kundaje Lab at Stanford Genetics and Computer Science Depts.

GitHub Events

Total
  • Fork event: 2
Last Year
  • Fork event: 2

Committers

Last synced: 7 months ago

All Time
  • Total Commits: 27
  • Total Committers: 5
  • Avg Commits per committer: 5.4
  • Development Distribution Score (DDS): 0.333
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Johnny Israeli j****i@s****u 18
Žiga Avsec A****z@u****m 6
Chris Probert c****t@g****m 1
Joren Retel j****l@h****m 1
daquang d****g@g****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 8 months ago

All Time
  • Total issues: 8
  • Total pull requests: 15
  • Average time to close issues: 4 days
  • Average time to close pull requests: 10 days
  • Total issue authors: 6
  • Total pull request authors: 6
  • Average comments per issue: 2.88
  • Average comments per pull request: 2.6
  • Merged pull requests: 13
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • AvantiShri (2)
  • Avsecz (2)
  • DebadityaPal (1)
  • mmtrebuchet (1)
  • berkuva (1)
  • mmingay2 (1)
Pull Request Authors
  • jisraeli (7)
  • Avsecz (4)
  • manyu90 (1)
  • daquang (1)
  • chrisprobert (1)
  • jorenretel (1)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 95 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 2
  • Total versions: 4
  • Total maintainers: 1
pypi.org: genomelake

Simple and efficient random access to genomic data for deep learning models.

  • Versions: 4
  • Dependent Packages: 0
  • Dependent Repositories: 2
  • Downloads: 95 Last month
Rankings
Forks count: 9.1%
Stargazers count: 10.1%
Dependent packages count: 10.1%
Dependent repos count: 11.5%
Average: 12.8%
Downloads: 23.0%
Maintainers (1)
Last synced: 7 months ago

Dependencies

setup.py pypi
  • bcolz >=1.1
  • numpy *
  • pyBigWig *
  • pybedtools *
  • pysam *
  • six >=1.9.0
  • tiledb >=0.2.0