https://github.com/camilogarciabotero/biovossencoder.jl
A small Julia package to represent BioSequences as a Voss matrix
Science Score: 46.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
✓DOI references
Found 3 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
✓Committers with academic emails
1 of 2 committers (50.0%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.1%) to scientific vocabulary
Keywords
Keywords from Contributors
Repository
A small Julia package to represent BioSequences as a Voss matrix
Basic Info
- Host: GitHub
- Owner: camilogarciabotero
- License: mit
- Language: Julia
- Default Branch: main
- Homepage: https://camilogarciabotero.github.io/BioVossEncoder.jl/dev
- Size: 1.22 MB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 1
- Releases: 6
Topics
Metadata Files
README.md
Encoding biological sequences into Voss representation
[](https://github.com/camilogarciabotero/BioVossEncoder.jl/actions/workflows/CI.yml) [](https://github.com/camilogarciabotero/BioVossEncoder.jl/blob/main/LICENSE) [](https://www.repostatus.org/#wip) [](http://juliapkgstats.com/pkg/BioVossEncoder) [](https://github.com/JuliaTesting/Aqua.jl) [](https://github.com/aviatesk/JET.jl)
BioVossEncoder
A Julia package for encoding biological sequences into Voss representations
- Provides the fastest encoding of biological sequences into Voss representations (aka. OneHot vectors).
- Can encode all
BioSequencetypes andStrings with unambiguous nucleotides and amino acids. - Can handle ambiguous nucleotides and amino acids from a
BioSequence. - Provides a simple and intuitive API for encoding biological sequences.
- Includes a dedicated type
VossEncoderthat match theBioSequences types. - Can be used for single nucletide encoding
vv = vossvector(dna"ACGT", DNA_A).
[!WARNING] This package uses internals from
BioSequencesandBitMatrixtypes, which might not be stable. Use with caution.
Installation
BioVossEncoder is a
Julia Language
package. To install BioVossEncoder,
please open
Julia's interactive session (known as REPL) and press ]
key in the REPL to use the package mode, then type the following command
julia
pkg> add BioVossEncoder
Encoding BioSequences
This package provides a simple and fast way to encode biological sequences into Voss representations. The main struct provided by this package is VossEncoder which is a wrapper of BitMatrix that encodes a biological sequence into a bit matrix and its corresponding alphabet. The following example shows how to encode a DNA sequence into a Voss matrix.
```julia julia> using BioSequences, BioVossEncoder
```
```julia julia> seq = dna"ACGT"
```
julia
julia> VossEncoder(seq)
4×4 Voss Matrix of DNAAlphabet{4}():
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
For simplicity the VossEncoder struct provides a property bitmatrix that returns the BitMatrix representation of the sequence.
julia
julia> VossEncoder(seq).bitmatrix
4×4 BitMatrix:
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
Similarly another function that makes use of the VossEncoder structure is vossmatrix which returns the BitMatrix representation of a sequence directly.
julia
julia> vossmatrix(seq)
4×4 BitMatrix:
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
Creating a Voss vector of a sequence
Sometimes it proves to be useful to encode a sequence into a Voss vector representation (i.e a bit vector of the sequence from the corresponding molecule alphabet).
This package provides a function vossvector that returns a Voss vector of a sequence given a BioSequence and the specific molecule (BioSymbol) that could be DNA or AA.
julia
julia> vossvector(seq, DNA_A)
4-element view(::BitMatrix, 1, :) with eltype Bool:
1
0
0
0
Note that the output is actually using behind the scenes a view of the BitMatrix representation of the sequence. This is done for performance reasons.
Related Ideas
```julia using BioSymbols, BioSequences
function onehot(s::NucSeq) M = falses(4, length(s)) for (i, s) in enumerate(s) bits = compatbits(s) while !iszero(bits) M[trailing_zeros(bits) + 1, i] = true bits &= bits - one(bits) # clear lowest bit end end M end ```
```julia julia> onehot(dna"TGNTKCTW-T")
4×10 BitMatrix: 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 0 1 1 1 0 1 1 0 1 ```
julia
julia> function onehot_reinterpretator(seq::BioSequence)
seqlen = length(seq)
modvect = Vector{Int8}(undef, seqlen)
modifier(value) = (value == DNA_G) ? DNA_M : (value == DNA_T) ? DNA_G : value
reinterpreter = seq -> reinterpret.(Int8, modifier.(seq))[1]
@inbounds for i in 1:seqlen
modvect[i] = reinterpreter(seq[i])
end
return 1:4 .== permutedims(modvect)
end
- SequenceTokenizers.jl: A Julia package for tokenizing biological sequences, providing efficient and flexible tokenization methods for various sequence types. Focused on
Stringtype.
```julia julia> function onehottokenizer(str::String) alphabet = ['A', 'C', 'G', 'T'] tokenizer = SequenceTokenizer(alphabet, 'N') tokens = tokenizer([str]) return onehotbatch(tokenizer, tokens) end # 5×N×1 Array{Float32, 3}
julia
julia> onehot_tokenizer("ACATCAGCATC")
5×11×1 Array{Float32, 3}: [:, :, 1] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 ```
```julia julia> using OneHotArrays
onehotbatch(str, ('A', 'C', 'G','T'))
4×1000000 OneHotMatrix(::Vector{UInt32}) with eltype Bool: ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 1 ⋅ ⋅ 1 ⋅ ⋅ 1 ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ 1 1 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅ 1 ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅ 1 ⋅ 1 ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 1 1 ⋅ ⋅ ⋅ ⋅ ⋅ 1 1 ⋅ 1 ⋅ 1 ⋅ 1 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅ 1 ⋅ ⋅ 1 ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅ 1 1 ⋅ ⋅ ⋅ ⋅ 1 ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅ 1 1 1 1 ⋅ ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅ 1 ⋅ 1 ⋅ ⋅ ⋅ 1 1 ⋅ 1 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅ 1 ⋅
```
Fasta2onehot.jl: A Julia package for converting FASTA sequences into one-hot encoded matrices.
- With
StatsBase.jl
```julia julia> using StatsBase
function onehot_indicator(str::String)::Vector{BitVector} codeunits(str) |> indicatormat end # returns 4-element Vector{BitVector}: ```
- With
collect: The output is aVector{BitVector}which is somehow disorganized, but it is a valid one-hot encoding.
```julia julia> function onehot_collector(str::String)::Vector{BitVector} [collect(str) .== x for x ∈ ['A', 'C', 'G', 'T']] end # retuns 4-element Vector{BitVector}:
```
- With
permutedimsandreinterpret:
julia
julia> function onehot_permutator(seq::BioSequence)
modifier(value) = (value == DNA_G) ? DNA_M : (value == DNA_T) ? DNA_G : value
reinterpreter = seq -> reinterpret.(Int8, modifier.(seq))[1]
return 1:4 .== permutedims(reinterpreter.(seq))
end
A more efficient version of the previous function With codeunits and permutedims:
julia
julia> function onehot_codeunits(str::String)
# A C G T
return UInt8[0x41, 0x43, 0x47, 0x54] .== permutedims(codeunits(str))
end
Benchmarks
```julia julia> using BenchmarkTools
str = rand(codeunits("ACGT"),10^6) |> String seq = randdnaseq(10^6)
VossEncoder.jl
@btime vossmatrix($seq); # 32.056 μs (4 allocations: 488.42 KiB) @btime vossvector($str); # 11.565 ms (10 allocations: 488.62 KiB)
Others
@btime onehot($seq); # 4.408 ms (4 allocations: 488.42 KiB) @btime onehotcodeunits($str); # 8.124 ms (6 allocations: 488.48 KiB) @btime onehotreinterpretator($seq); # 10.140 ms (7 allocations: 1.43 MiB) @btime onehotpermutator($seq); # 9.670 ms (10 allocations: 2.38 MiB) @time onehotindicator($str); # 17.413 ms (14 allocations: 3.82 MiB) @btime onehotcollector($str); # 12.659 ms (32 allocations: 15.74 MiB) @btime onehottokenizer($str) # 22.816 ms (19 allocations: 26.70 MiB)
From the special FluxML ecosystem
@btime onehotbatch($str, ('A', 'C', 'G','T')); # 11.418 ms (3 allocations: 3.81 MiB) ```
```julia versioninfo()
Julia Version 1.11.1 Commit 8f5b7ca12ad (2024-10-16 10:53 UTC) Build Info: Official https://julialang.org/ release Platform Info: OS: macOS (x8664-apple-darwin22.4.0) CPU: 8 × Intel(R) Core(TM) i5-8257U CPU @ 1.40GHz WORDSIZE: 64 LLVM: libLLVM-16.0.6 (ORCJIT, skylake) Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores) ```
Owner
- Name: Camilo García
- Login: camilogarciabotero
- Kind: user
- Location: Bogotá, Colombia
- Company: Universidad de los Andes
- Website: https://camilogarciabotero.github.io
- Twitter: gaspardelanoche
- Repositories: 8
- Profile: https://github.com/camilogarciabotero
Biologist interested in applying bioinformatics and DS tools to understand evolution in different organisms. Currently working on bacteriophages and epigenomics
GitHub Events
Total
- Create event: 5
- Commit comment event: 5
- Release event: 2
- Delete event: 3
- Issue comment event: 4
- Push event: 49
- Pull request event: 5
Last Year
- Create event: 5
- Commit comment event: 5
- Release event: 2
- Delete event: 3
- Issue comment event: 4
- Push event: 49
- Pull request event: 5
Committers
Last synced: 9 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Camilo García | c****2@u****o | 130 |
| dependabot[bot] | 4****] | 4 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 8 months ago
All Time
- Total issues: 1
- Total pull requests: 9
- Average time to close issues: less than a minute
- Average time to close pull requests: about 2 months
- Total issue authors: 1
- Total pull request authors: 2
- Average comments per issue: 6.0
- Average comments per pull request: 0.22
- Merged pull requests: 8
- Bot issues: 0
- Bot pull requests: 5
Past Year
- Issues: 0
- Pull requests: 2
- Average time to close issues: N/A
- Average time to close pull requests: about 14 hours
- Issue authors: 0
- Pull request authors: 2
- Average comments per issue: 0
- Average comments per pull request: 1.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 1
Top Authors
Issue Authors
- JuliaTagBot (1)
Pull Request Authors
- dependabot[bot] (10)
- camilogarciabotero (8)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
- Total downloads: unknown
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 6
juliahub.com: BioVossEncoder
A small Julia package to represent BioSequences as a Voss matrix
- Homepage: https://camilogarciabotero.github.io/BioVossEncoder.jl/dev
- Documentation: https://docs.juliahub.com/General/BioVossEncoder/stable/
- License: MIT
-
Latest release: 0.6.0
published over 1 year ago
Rankings
Dependencies
- actions/checkout v4 composite
- actions/checkout v2 composite
- codecov/codecov-action v3 composite
- julia-actions/cache v1 composite
- julia-actions/julia-buildpkg v1 composite
- julia-actions/julia-docdeploy v1 composite
- julia-actions/julia-processcoverage v1 composite
- julia-actions/julia-runtest v1 composite
- julia-actions/setup-julia v1 composite
- julia-actions/RegisterAction latest composite