SeqArray

Data management of large-scale whole-genome sequence variant calls using GDS files (Development version only)

https://github.com/zhengxwen/seqarray

Science Score: 39.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 8 DOI reference(s) in README
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.7%) to scientific vocabulary

Keywords

bioinformatics gds-format r snp snv wes wgs

Keywords from Contributors

pca simd bioconductor-package genomics
Last synced: 9 months ago · JSON representation

Repository

Data management of large-scale whole-genome sequence variant calls using GDS files (Development version only)

Basic Info
Statistics
  • Stars: 46
  • Watchers: 5
  • Forks: 12
  • Open Issues: 20
  • Releases: 11
Topics
bioinformatics gds-format r snp snv wes wgs
Created over 11 years ago · Last pushed 10 months ago
Metadata Files
Readme Changelog

README.md

SeqArray: Data management of large-scale whole-genome sequence variant calls using GDS files

GPLv3 GNU General Public License, GPLv3

Availability Years-in-BioC R

Features

Data management of whole-genome sequence variant calls with hundreds of thousands of individuals: genotypic data (e.g., SNVs, indels and structural variation calls) and annotations in SeqArray GDS files are stored in an array-oriented and compressed manner, with efficient data access using the R programming language.

The SeqArray package is built on top of Genomic Data Structure (GDS) data format, and defines required data structure for a SeqArray file. GDS is a flexible and portable data container with hierarchical structure to store multiple scalable array-oriented data sets. It is suited for large-scale datasets, especially for data which are much larger than the available random-access memory. It also offers the efficient operations specifically designed for integers of less than 8 bits, since a diploid genotype usually occupies fewer bits than a byte. Data compression and decompression are available with relatively efficient random access. A high-level R interface to GDS files is available in the package gdsfmt.

Bioconductor:

Release Version: v1.48.0

http://www.bioconductor.org/packages/SeqArray

Citation

Zheng X, Gogarten S, Lawrence M, Stilp A, Conomos M, Weir BS, Laurie C, Levine D (2017). SeqArray -- A storage-efficient high-performance data format for WGS variant calls. Bioinformatics. DOI: 10.1093/bioinformatics/btx145.

Zheng X, Levine D, Shen J, Gogarten SM, Laurie C, Weir BS (2012). A High-performance Computing Toolset for Relatedness and Principal Component Analysis of SNP Data. Bioinformatics. DOI: 10.1093/bioinformatics/bts606.

Installation (requiring ≥ R_v3.5.0)

  • Bioconductor repository: R if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("SeqArray")

  • Development version from Github (for developers/testers only): R library("devtools") install_github("zhengxwen/gdsfmt") install_github("zhengxwen/SeqArray") The install_github() approach requires that you build from source, i.e. make and compilers must be installed on your system -- see the R FAQ for your operating system; you may also need to install dependencies manually.

  • Install the package from the source code: gdsfmt, SeqArray ```sh wget --no-check-certificate https://github.com/zhengxwen/gdsfmt/tarball/master -O gdsfmtlatest.tar.gz wget --no-check-certificate https://github.com/zhengxwen/SeqArray/tarball/master -O SeqArraylatest.tar.gz R CMD INSTALL gdsfmtlatest.tar.gz R CMD INSTALL SeqArraylatest.tar.gz

Or

curl -L https://github.com/zhengxwen/gdsfmt/tarball/master/ -o gdsfmtlatest.tar.gz curl -L https://github.com/zhengxwen/SeqArray/tarball/master/ -o SeqArraylatest.tar.gz R CMD INSTALL gdsfmtlatest.tar.gz R CMD INSTALL SeqArraylatest.tar.gz ```

Examples

```R library(SeqArray)

gds.fn <- seqExampleFileName("gds")

open a GDS file

f <- seqOpen(gds.fn)

display the contents of the GDS file

f

close the file

seqClose(f) ```

```R

Object of class "SeqVarGDSClass"

File: SeqArray/extdata/CEU_Exon.gds (298.6K)

+ [ ] *

|--+ description [ ] *

|--+ sample.id { Str8 90 LZMA_ra(35.8%), 258B } *

|--+ variant.id { Int32 1348 LZMA_ra(16.8%), 906B } *

|--+ position { Int32 1348 LZMA_ra(64.6%), 3.4K } *

|--+ chromosome { Str8 1348 LZMA_ra(4.63%), 158B } *

|--+ allele { Str8 1348 LZMA_ra(16.7%), 902B } *

|--+ genotype [ ] *

| |--+ data { Bit2 2x90x1348 LZMA_ra(26.3%), 15.6K } *

| |--+ ~data { Bit2 2x1348x90 LZMA_ra(29.3%), 17.3K }

| |--+ extra.index { Int32 3x0 LZMA_ra, 19B } *

| --+ extra { Int16 0 LZMA_ra, 19B }

|--+ phase [ ]

| |--+ data { Bit1 90x1348 LZMA_ra(0.91%), 138B } *

| |--+ ~data { Bit1 1348x90 LZMA_ra(0.91%), 138B }

| |--+ extra.index { Int32 3x0 LZMA_ra, 19B } *

| --+ extra { Bit1 0 LZMA_ra, 19B }

|--+ annotation [ ]

| |--+ id { Str8 1348 LZMA_ra(38.4%), 5.5K } *

| |--+ qual { Float32 1348 LZMA_ra(2.26%), 122B } *

| |--+ filter { Int32,factor 1348 LZMA_ra(2.26%), 122B } *

| |--+ info [ ]

| | |--+ AA { Str8 1348 LZMA_ra(25.6%), 690B } *

| | |--+ AC { Int32 1348 LZMA_ra(24.2%), 1.3K } *

| | |--+ AN { Int32 1348 LZMA_ra(19.8%), 1.0K } *

| | |--+ DP { Int32 1348 LZMA_ra(47.9%), 2.5K } *

| | |--+ HM2 { Bit1 1348 LZMA_ra(150.3%), 254B } *

| | |--+ HM3 { Bit1 1348 LZMA_ra(150.3%), 254B } *

| | |--+ OR { Str8 1348 LZMA_ra(20.1%), 342B } *

| | |--+ GP { Str8 1348 LZMA_ra(24.4%), 3.8K } *

| | --+ BN { Int32 1348 LZMA_ra(20.9%), 1.1K } *

| --+ format [ ]

| --+ DP [ ] *

| |--+ data { Int32 90x1348 LZMA_ra(25.1%), 118.8K } *

| --+ ~data { Int32 1348x90 LZMA_ra(24.1%), 114.2K }

--+ sample.annotation [ ]

--+ family { Str8 90 LZMA_ra(57.1%), 222B }

```

Key Functions in the SeqArray Package

| Function | Description | |:--------------|:-------------------------------------------| | seqVCF2GDS | Reformat VCF files » | | seqSetFilter | Define a data subset of samples or variants » | | seqGetData | Get data from a SeqArray file with a defined filter » | | seqApply | Apply a user-defined function over array margins » | | seqBlockApply | Apply a user-defined function over array margins via blocking » | | seqParallel | Apply functions in parallel » | | ... | |

File Format Conversion

SeqArray GDS File Downloads

See Also

  • JSeqArray.jl: Data manipulation of whole-genome sequencing variant data in Julia
  • PySeqArray: Data manipulation of whole-genome sequencing variant data in Python

Owner

  • Name: Xiuwen Zheng
  • Login: zhengxwen
  • Kind: user
  • Location: Chicago

GitHub Events

Total
  • Issues event: 1
  • Watch event: 3
  • Issue comment event: 5
  • Push event: 80
  • Create event: 1
Last Year
  • Issues event: 1
  • Watch event: 3
  • Issue comment event: 5
  • Push event: 80
  • Create event: 1

Committers

Last synced: about 1 year ago

All Time
  • Total Commits: 742
  • Total Committers: 6
  • Avg Commits per committer: 123.667
  • Development Distribution Score (DDS): 0.034
Past Year
  • Commits: 49
  • Committers: 1
  • Avg Commits per committer: 49.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Xiuwen Zheng z****n@g****m 717
Stephanie M. Gogarten s****n@g****m 14
Bioconductor Git-SVN Bridge b****c@b****g 6
Xiuwen Zheng x****g@a****m 2
smgogarten s****n@v****a 2
Martin Morgan m****n@r****g 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 83
  • Total pull requests: 14
  • Average time to close issues: about 1 month
  • Average time to close pull requests: about 3 hours
  • Total issue authors: 46
  • Total pull request authors: 3
  • Average comments per issue: 2.2
  • Average comments per pull request: 0.14
  • Merged pull requests: 13
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 4
  • Pull requests: 0
  • Average time to close issues: about 8 hours
  • Average time to close pull requests: N/A
  • Issue authors: 4
  • Pull request authors: 0
  • Average comments per issue: 1.25
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • thierrygosselin (13)
  • smgogarten (11)
  • zhengxwen (9)
  • AAvalos82 (4)
  • gustavahlberg (2)
  • iaia87 (2)
  • jemunro (2)
  • boboppie (2)
  • fizwit (1)
  • BELKHIR (1)
  • chisqr (1)
  • pezhmansafdari (1)
  • terrryliu (1)
  • annaquaglieri16 (1)
  • jjfarrell (1)
Pull Request Authors
  • smgogarten (11)
  • zhengxwen (2)
  • mtmorgan (1)
Top Labels
Issue Labels
bug (21) feature required (5) enhancement (2) question (2) help wanted (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • bioconductor 116,950 total
  • Total dependent packages: 11
  • Total dependent repositories: 0
  • Total versions: 16
  • Total maintainers: 1
bioconductor.org: SeqArray

Data management of large-scale whole-genome sequence variant calls using GDS files

  • Versions: 16
  • Dependent Packages: 11
  • Dependent Repositories: 0
  • Downloads: 116,950 Total
Rankings
Dependent repos count: 0.0%
Dependent packages count: 0.0%
Average: 3.6%
Downloads: 10.8%
Maintainers (1)
Last synced: 9 months ago

Dependencies

DESCRIPTION cran
  • R >= 3.5.0 depends
  • gdsfmt >= 1.31.1 depends
  • Biostrings * imports
  • GenomeInfoDb * imports
  • GenomicRanges * imports
  • IRanges * imports
  • S4Vectors * imports
  • methods * imports
  • parallel * imports
  • Biobase * suggests
  • BiocGenerics * suggests
  • BiocParallel * suggests
  • RUnit * suggests
  • Rcpp * suggests
  • Rsamtools * suggests
  • SNPRelate * suggests
  • VariantAnnotation * suggests
  • crayon * suggests
  • digest * suggests
  • knitr * suggests
  • markdown * suggests
  • rmarkdown * suggests
.github/workflows/r.yml actions
  • actions/checkout v3 composite
  • r-lib/actions/setup-r f57f1301a053485946083d7a45022b278929a78a composite