SeqArray
Data management of large-scale whole-genome sequence variant calls using GDS files (Development version only)
Science Score: 39.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 8 DOI reference(s) in README -
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.7%) to scientific vocabulary
Keywords
Keywords from Contributors
Repository
Data management of large-scale whole-genome sequence variant calls using GDS files (Development version only)
Basic Info
- Host: GitHub
- Owner: zhengxwen
- Language: C++
- Default Branch: master
- Homepage: http://www.bioconductor.org/packages/SeqArray
- Size: 9.86 MB
Statistics
- Stars: 46
- Watchers: 5
- Forks: 12
- Open Issues: 20
- Releases: 11
Topics
Metadata Files
README.md
SeqArray: Data management of large-scale whole-genome sequence variant calls using GDS files
GNU General Public License, GPLv3
Features
Data management of whole-genome sequence variant calls with hundreds of thousands of individuals: genotypic data (e.g., SNVs, indels and structural variation calls) and annotations in SeqArray GDS files are stored in an array-oriented and compressed manner, with efficient data access using the R programming language.
The SeqArray package is built on top of Genomic Data Structure (GDS) data format, and defines required data structure for a SeqArray file. GDS is a flexible and portable data container with hierarchical structure to store multiple scalable array-oriented data sets. It is suited for large-scale datasets, especially for data which are much larger than the available random-access memory. It also offers the efficient operations specifically designed for integers of less than 8 bits, since a diploid genotype usually occupies fewer bits than a byte. Data compression and decompression are available with relatively efficient random access. A high-level R interface to GDS files is available in the package gdsfmt.
Bioconductor:
Release Version: v1.48.0
http://www.bioconductor.org/packages/SeqArray
- Help Documents
- Tutorials: Data Management, R Integration, Overview Slides
- News
Citation
Zheng X, Gogarten S, Lawrence M, Stilp A, Conomos M, Weir BS, Laurie C, Levine D (2017). SeqArray -- A storage-efficient high-performance data format for WGS variant calls. Bioinformatics. DOI: 10.1093/bioinformatics/btx145.
Zheng X, Levine D, Shen J, Gogarten SM, Laurie C, Weir BS (2012). A High-performance Computing Toolset for Relatedness and Principal Component Analysis of SNP Data. Bioinformatics. DOI: 10.1093/bioinformatics/bts606.
Installation (requiring ≥ R_v3.5.0)
Bioconductor repository:
R if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("SeqArray")Development version from Github (for developers/testers only):
R library("devtools") install_github("zhengxwen/gdsfmt") install_github("zhengxwen/SeqArray")Theinstall_github()approach requires that you build from source, i.e.makeand compilers must be installed on your system -- see the R FAQ for your operating system; you may also need to install dependencies manually.Install the package from the source code: gdsfmt, SeqArray ```sh wget --no-check-certificate https://github.com/zhengxwen/gdsfmt/tarball/master -O gdsfmtlatest.tar.gz wget --no-check-certificate https://github.com/zhengxwen/SeqArray/tarball/master -O SeqArraylatest.tar.gz R CMD INSTALL gdsfmtlatest.tar.gz R CMD INSTALL SeqArraylatest.tar.gz
Or
curl -L https://github.com/zhengxwen/gdsfmt/tarball/master/ -o gdsfmtlatest.tar.gz curl -L https://github.com/zhengxwen/SeqArray/tarball/master/ -o SeqArraylatest.tar.gz R CMD INSTALL gdsfmtlatest.tar.gz R CMD INSTALL SeqArraylatest.tar.gz ```
Examples
```R library(SeqArray)
gds.fn <- seqExampleFileName("gds")
open a GDS file
f <- seqOpen(gds.fn)
display the contents of the GDS file
f
close the file
seqClose(f) ```
```R
Object of class "SeqVarGDSClass"
File: SeqArray/extdata/CEU_Exon.gds (298.6K)
+ [ ] *
|--+ description [ ] *
|--+ sample.id { Str8 90 LZMA_ra(35.8%), 258B } *
|--+ variant.id { Int32 1348 LZMA_ra(16.8%), 906B } *
|--+ position { Int32 1348 LZMA_ra(64.6%), 3.4K } *
|--+ chromosome { Str8 1348 LZMA_ra(4.63%), 158B } *
|--+ allele { Str8 1348 LZMA_ra(16.7%), 902B } *
|--+ genotype [ ] *
| |--+ data { Bit2 2x90x1348 LZMA_ra(26.3%), 15.6K } *
| |--+ ~data { Bit2 2x1348x90 LZMA_ra(29.3%), 17.3K }
| |--+ extra.index { Int32 3x0 LZMA_ra, 19B } *
| --+ extra { Int16 0 LZMA_ra, 19B }
|--+ phase [ ]
| |--+ data { Bit1 90x1348 LZMA_ra(0.91%), 138B } *
| |--+ ~data { Bit1 1348x90 LZMA_ra(0.91%), 138B }
| |--+ extra.index { Int32 3x0 LZMA_ra, 19B } *
| --+ extra { Bit1 0 LZMA_ra, 19B }
|--+ annotation [ ]
| |--+ id { Str8 1348 LZMA_ra(38.4%), 5.5K } *
| |--+ qual { Float32 1348 LZMA_ra(2.26%), 122B } *
| |--+ filter { Int32,factor 1348 LZMA_ra(2.26%), 122B } *
| |--+ info [ ]
| | |--+ AA { Str8 1348 LZMA_ra(25.6%), 690B } *
| | |--+ AC { Int32 1348 LZMA_ra(24.2%), 1.3K } *
| | |--+ AN { Int32 1348 LZMA_ra(19.8%), 1.0K } *
| | |--+ DP { Int32 1348 LZMA_ra(47.9%), 2.5K } *
| | |--+ HM2 { Bit1 1348 LZMA_ra(150.3%), 254B } *
| | |--+ HM3 { Bit1 1348 LZMA_ra(150.3%), 254B } *
| | |--+ OR { Str8 1348 LZMA_ra(20.1%), 342B } *
| | |--+ GP { Str8 1348 LZMA_ra(24.4%), 3.8K } *
| | --+ BN { Int32 1348 LZMA_ra(20.9%), 1.1K } *
| --+ format [ ]
| --+ DP [ ] *
| |--+ data { Int32 90x1348 LZMA_ra(25.1%), 118.8K } *
| --+ ~data { Int32 1348x90 LZMA_ra(24.1%), 114.2K }
--+ sample.annotation [ ]
--+ family { Str8 90 LZMA_ra(57.1%), 222B }
```
Key Functions in the SeqArray Package
| Function | Description | |:--------------|:-------------------------------------------| | seqVCF2GDS | Reformat VCF files » | | seqSetFilter | Define a data subset of samples or variants » | | seqGetData | Get data from a SeqArray file with a defined filter » | | seqApply | Apply a user-defined function over array margins » | | seqBlockApply | Apply a user-defined function over array margins via blocking » | | seqParallel | Apply functions in parallel » | | ... | |
File Format Conversion
- seqVCF2GDS(): Format conversion from VCF to GDS
- gds2bgen: Format conversion from BGEN to GDS
SeqArray GDS File Downloads
See Also
- JSeqArray.jl: Data manipulation of whole-genome sequencing variant data in Julia
- PySeqArray: Data manipulation of whole-genome sequencing variant data in Python
Owner
- Name: Xiuwen Zheng
- Login: zhengxwen
- Kind: user
- Location: Chicago
- Repositories: 13
- Profile: https://github.com/zhengxwen
GitHub Events
Total
- Issues event: 1
- Watch event: 3
- Issue comment event: 5
- Push event: 80
- Create event: 1
Last Year
- Issues event: 1
- Watch event: 3
- Issue comment event: 5
- Push event: 80
- Create event: 1
Committers
Last synced: about 1 year ago
Top Committers
| Name | Commits | |
|---|---|---|
| Xiuwen Zheng | z****n@g****m | 717 |
| Stephanie M. Gogarten | s****n@g****m | 14 |
| Bioconductor Git-SVN Bridge | b****c@b****g | 6 |
| Xiuwen Zheng | x****g@a****m | 2 |
| smgogarten | s****n@v****a | 2 |
| Martin Morgan | m****n@r****g | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 10 months ago
All Time
- Total issues: 83
- Total pull requests: 14
- Average time to close issues: about 1 month
- Average time to close pull requests: about 3 hours
- Total issue authors: 46
- Total pull request authors: 3
- Average comments per issue: 2.2
- Average comments per pull request: 0.14
- Merged pull requests: 13
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 4
- Pull requests: 0
- Average time to close issues: about 8 hours
- Average time to close pull requests: N/A
- Issue authors: 4
- Pull request authors: 0
- Average comments per issue: 1.25
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- thierrygosselin (13)
- smgogarten (11)
- zhengxwen (9)
- AAvalos82 (4)
- gustavahlberg (2)
- iaia87 (2)
- jemunro (2)
- boboppie (2)
- fizwit (1)
- BELKHIR (1)
- chisqr (1)
- pezhmansafdari (1)
- terrryliu (1)
- annaquaglieri16 (1)
- jjfarrell (1)
Pull Request Authors
- smgogarten (11)
- zhengxwen (2)
- mtmorgan (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- bioconductor 116,950 total
- Total dependent packages: 11
- Total dependent repositories: 0
- Total versions: 16
- Total maintainers: 1
bioconductor.org: SeqArray
Data management of large-scale whole-genome sequence variant calls using GDS files
- Homepage: https://github.com/zhengxwen/SeqArray
- Documentation: https://bioconductor.org/packages/release/bioc/vignettes/SeqArray/inst/doc/SeqArray.pdf
- License: GPL-3
-
Latest release: 1.48.2
published 9 months ago
Rankings
Maintainers (1)
Dependencies
- R >= 3.5.0 depends
- gdsfmt >= 1.31.1 depends
- Biostrings * imports
- GenomeInfoDb * imports
- GenomicRanges * imports
- IRanges * imports
- S4Vectors * imports
- methods * imports
- parallel * imports
- Biobase * suggests
- BiocGenerics * suggests
- BiocParallel * suggests
- RUnit * suggests
- Rcpp * suggests
- Rsamtools * suggests
- SNPRelate * suggests
- VariantAnnotation * suggests
- crayon * suggests
- digest * suggests
- knitr * suggests
- markdown * suggests
- rmarkdown * suggests
- actions/checkout v3 composite
- r-lib/actions/setup-r f57f1301a053485946083d7a45022b278929a78a composite