https://github.com/camilogarciabotero/biovossencoder.jl

A small Julia package to represent BioSequences as a Voss matrix

https://github.com/camilogarciabotero/biovossencoder.jl

Science Score: 46.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
    1 of 2 committers (50.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.1%) to scientific vocabulary

Keywords

bioinformatics biojulia encoding julia onehot-encoding onehot-vectors wavelets

Keywords from Contributors

projection interactive generic archival sequences genomics observability autograding hacking shellcodes
Last synced: 5 months ago · JSON representation

Repository

A small Julia package to represent BioSequences as a Voss matrix

Basic Info
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 0
  • Open Issues: 1
  • Releases: 6
Topics
bioinformatics biojulia encoding julia onehot-encoding onehot-vectors wavelets
Created about 2 years ago · Last pushed over 1 year ago
Metadata Files
Readme Changelog License

README.md


Encoding biological sequences into Voss representation

[![Documentation](https://img.shields.io/badge/documentation-online-blue.svg?logo=Julia&logoColor=white)](https://camilogarciabotero.github.io/BioVossEncoder.jl/dev/) [![Latest Release](https://img.shields.io/github/release/camilogarciabotero/BioVossEncoder.jl.svg)](https://github.com/camilogarciabotero/BioVossEncoder.jl/releases/latest) [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.10452378.svg)](https://doi.org/10.5281/zenodo.10452378)
[![CI Workflow](https://github.com/camilogarciabotero/BioVossEncoder.jl/actions/workflows/CI.yml/badge.svg)](https://github.com/camilogarciabotero/BioVossEncoder.jl/actions/workflows/CI.yml) [![License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/camilogarciabotero/BioVossEncoder.jl/blob/main/LICENSE) [![Work in Progress](https://www.repostatus.org/badges/latest/wip.svg)](https://www.repostatus.org/#wip) [![Downloads](https://img.shields.io/badge/dynamic/json?url=http%3A%2F%2Fjuliapkgstats.com%2Fapi%2Fv1%2Fmonthly_downloads%2FBioVossEncoder&query=total_requests&suffix=%2Fmonth&label=Downloads)](http://juliapkgstats.com/pkg/BioVossEncoder) [![Aqua QA](https://raw.githubusercontent.com/JuliaTesting/Aqua.jl/master/badge.svg)](https://github.com/JuliaTesting/Aqua.jl) [![JET](https://img.shields.io/badge/%E2%9C%88%EF%B8%8F%20tested%20with%20-%20JET.jl%20-%20red)](https://github.com/aviatesk/JET.jl)

BioVossEncoder

A Julia package for encoding biological sequences into Voss representations

  • Provides the fastest encoding of biological sequences into Voss representations (aka. OneHot vectors).
  • Can encode all BioSequence types and Strings with unambiguous nucleotides and amino acids.
  • Can handle ambiguous nucleotides and amino acids from a BioSequence.
  • Provides a simple and intuitive API for encoding biological sequences.
  • Includes a dedicated type VossEncoder that match the BioSequences types.
  • Can be used for single nucletide encoding vv = vossvector(dna"ACGT", DNA_A).

[!WARNING] This package uses internals from BioSequences and BitMatrix types, which might not be stable. Use with caution.

Installation

BioVossEncoder is a   Julia Language   package. To install BioVossEncoder, please open Julia's interactive session (known as REPL) and press ] key in the REPL to use the package mode, then type the following command

julia pkg> add BioVossEncoder

Encoding BioSequences

This package provides a simple and fast way to encode biological sequences into Voss representations. The main struct provided by this package is VossEncoder which is a wrapper of BitMatrix that encodes a biological sequence into a bit matrix and its corresponding alphabet. The following example shows how to encode a DNA sequence into a Voss matrix.

```julia julia> using BioSequences, BioVossEncoder

```

```julia julia> seq = dna"ACGT"

```

julia julia> VossEncoder(seq) 4×4 Voss Matrix of DNAAlphabet{4}(): 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 For simplicity the VossEncoder struct provides a property bitmatrix that returns the BitMatrix representation of the sequence.

julia julia> VossEncoder(seq).bitmatrix 4×4 BitMatrix: 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1

Similarly another function that makes use of the VossEncoder structure is vossmatrix which returns the BitMatrix representation of a sequence directly.

julia julia> vossmatrix(seq) 4×4 BitMatrix: 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1

Creating a Voss vector of a sequence

Sometimes it proves to be useful to encode a sequence into a Voss vector representation (i.e a bit vector of the sequence from the corresponding molecule alphabet).

This package provides a function vossvector that returns a Voss vector of a sequence given a BioSequence and the specific molecule (BioSymbol) that could be DNA or AA.

julia julia> vossvector(seq, DNA_A)

4-element view(::BitMatrix, 1, :) with eltype Bool: 1 0 0 0 Note that the output is actually using behind the scenes a view of the BitMatrix representation of the sequence. This is done for performance reasons.

Related Ideas

```julia using BioSymbols, BioSequences

function onehot(s::NucSeq) M = falses(4, length(s)) for (i, s) in enumerate(s) bits = compatbits(s) while !iszero(bits) M[trailing_zeros(bits) + 1, i] = true bits &= bits - one(bits) # clear lowest bit end end M end ```

```julia julia> onehot(dna"TGNTKCTW-T")

4×10 BitMatrix: 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 0 1 1 1 0 1 1 0 1 ```

julia julia> function onehot_reinterpretator(seq::BioSequence) seqlen = length(seq) modvect = Vector{Int8}(undef, seqlen) modifier(value) = (value == DNA_G) ? DNA_M : (value == DNA_T) ? DNA_G : value reinterpreter = seq -> reinterpret.(Int8, modifier.(seq))[1] @inbounds for i in 1:seqlen modvect[i] = reinterpreter(seq[i]) end return 1:4 .== permutedims(modvect) end

  • SequenceTokenizers.jl: A Julia package for tokenizing biological sequences, providing efficient and flexible tokenization methods for various sequence types. Focused on String type.

```julia julia> function onehottokenizer(str::String) alphabet = ['A', 'C', 'G', 'T'] tokenizer = SequenceTokenizer(alphabet, 'N') tokens = tokenizer([str]) return onehotbatch(tokenizer, tokens) end # 5×N×1 Array{Float32, 3}

julia julia> onehot_tokenizer("ACATCAGCATC")

5×11×1 Array{Float32, 3}: [:, :, 1] = 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 ```

```julia julia> using OneHotArrays

onehotbatch(str, ('A', 'C', 'G','T'))

4×1000000 OneHotMatrix(::Vector{UInt32}) with eltype Bool: ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 1 ⋅ ⋅ 1 ⋅ ⋅ 1 ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅ … ⋅ ⋅ ⋅ 1 1 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅ 1 ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅ 1 ⋅ 1 ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 1 1 ⋅ ⋅ ⋅ ⋅ ⋅ 1 1 ⋅ 1 ⋅ 1 ⋅ 1 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅ 1 ⋅ ⋅ 1 ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅ 1 1 ⋅ ⋅ ⋅ ⋅ 1 ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅ 1 1 1 1 ⋅ ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅ 1 ⋅ 1 ⋅ ⋅ ⋅ 1 1 ⋅ 1 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 ⋅ ⋅ ⋅ ⋅ 1 ⋅

```

  1. With StatsBase.jl

```julia julia> using StatsBase

function onehot_indicator(str::String)::Vector{BitVector} codeunits(str) |> indicatormat end # returns 4-element Vector{BitVector}: ```

  1. With collect: The output is a Vector{BitVector} which is somehow disorganized, but it is a valid one-hot encoding.

```julia julia> function onehot_collector(str::String)::Vector{BitVector} [collect(str) .== x for x ∈ ['A', 'C', 'G', 'T']] end # retuns 4-element Vector{BitVector}:

```

  1. With permutedims and reinterpret:

julia julia> function onehot_permutator(seq::BioSequence) modifier(value) = (value == DNA_G) ? DNA_M : (value == DNA_T) ? DNA_G : value reinterpreter = seq -> reinterpret.(Int8, modifier.(seq))[1] return 1:4 .== permutedims(reinterpreter.(seq)) end

A more efficient version of the previous function With codeunits and permutedims:

julia julia> function onehot_codeunits(str::String) # A C G T return UInt8[0x41, 0x43, 0x47, 0x54] .== permutedims(codeunits(str)) end

Benchmarks

```julia julia> using BenchmarkTools

str = rand(codeunits("ACGT"),10^6) |> String seq = randdnaseq(10^6)

VossEncoder.jl

@btime vossmatrix($seq); # 32.056 μs (4 allocations: 488.42 KiB) @btime vossvector($str); # 11.565 ms (10 allocations: 488.62 KiB)

Others

@btime onehot($seq); # 4.408 ms (4 allocations: 488.42 KiB) @btime onehotcodeunits($str); # 8.124 ms (6 allocations: 488.48 KiB) @btime onehotreinterpretator($seq); # 10.140 ms (7 allocations: 1.43 MiB) @btime onehotpermutator($seq); # 9.670 ms (10 allocations: 2.38 MiB) @time onehotindicator($str); # 17.413 ms (14 allocations: 3.82 MiB) @btime onehotcollector($str); # 12.659 ms (32 allocations: 15.74 MiB) @btime onehottokenizer($str) # 22.816 ms (19 allocations: 26.70 MiB)

From the special FluxML ecosystem

@btime onehotbatch($str, ('A', 'C', 'G','T')); # 11.418 ms (3 allocations: 3.81 MiB) ```

```julia versioninfo()

Julia Version 1.11.1 Commit 8f5b7ca12ad (2024-10-16 10:53 UTC) Build Info: Official https://julialang.org/ release Platform Info: OS: macOS (x8664-apple-darwin22.4.0) CPU: 8 × Intel(R) Core(TM) i5-8257U CPU @ 1.40GHz WORDSIZE: 64 LLVM: libLLVM-16.0.6 (ORCJIT, skylake) Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores) ```

Owner

  • Name: Camilo García
  • Login: camilogarciabotero
  • Kind: user
  • Location: Bogotá, Colombia
  • Company: Universidad de los Andes

Biologist interested in applying bioinformatics and DS tools to understand evolution in different organisms. Currently working on bacteriophages and epigenomics

GitHub Events

Total
  • Create event: 5
  • Commit comment event: 5
  • Release event: 2
  • Delete event: 3
  • Issue comment event: 4
  • Push event: 49
  • Pull request event: 5
Last Year
  • Create event: 5
  • Commit comment event: 5
  • Release event: 2
  • Delete event: 3
  • Issue comment event: 4
  • Push event: 49
  • Pull request event: 5

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 134
  • Total Committers: 2
  • Avg Commits per committer: 67.0
  • Development Distribution Score (DDS): 0.03
Past Year
  • Commits: 43
  • Committers: 1
  • Avg Commits per committer: 43.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Camilo García c****2@u****o 130
dependabot[bot] 4****] 4
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 8 months ago

All Time
  • Total issues: 1
  • Total pull requests: 9
  • Average time to close issues: less than a minute
  • Average time to close pull requests: about 2 months
  • Total issue authors: 1
  • Total pull request authors: 2
  • Average comments per issue: 6.0
  • Average comments per pull request: 0.22
  • Merged pull requests: 8
  • Bot issues: 0
  • Bot pull requests: 5
Past Year
  • Issues: 0
  • Pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: about 14 hours
  • Issue authors: 0
  • Pull request authors: 2
  • Average comments per issue: 0
  • Average comments per pull request: 1.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 1
Top Authors
Issue Authors
  • JuliaTagBot (1)
Pull Request Authors
  • dependabot[bot] (10)
  • camilogarciabotero (8)
Top Labels
Issue Labels
Pull Request Labels
dependencies (10) enhancement (2)

Packages

  • Total packages: 1
  • Total downloads: unknown
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 6
juliahub.com: BioVossEncoder

A small Julia package to represent BioSequences as a Voss matrix

  • Versions: 6
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent repos count: 10.0%
Average: 25.0%
Dependent packages count: 40.1%
Last synced: 6 months ago

Dependencies

.github/workflows/CI.yml actions
  • actions/checkout v4 composite
  • actions/checkout v2 composite
  • codecov/codecov-action v3 composite
  • julia-actions/cache v1 composite
  • julia-actions/julia-buildpkg v1 composite
  • julia-actions/julia-docdeploy v1 composite
  • julia-actions/julia-processcoverage v1 composite
  • julia-actions/julia-runtest v1 composite
  • julia-actions/setup-julia v1 composite
.github/workflows/CompatHelper.yml actions
.github/workflows/TagBot.yml actions
.github/workflows/register.yml actions
  • julia-actions/RegisterAction latest composite