https://github.com/averagehat/clj-biosequence
A Clojure library designed to make the manipulation of biological sequence data easier.
Science Score: 10.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: ncbi.nlm.nih.gov -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.1%) to scientific vocabulary
Repository
A Clojure library designed to make the manipulation of biological sequence data easier.
Basic Info
- Host: GitHub
- Owner: averagehat
- Language: Clojure
- Default Branch: master
- Size: 4.78 MB
Statistics
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
clj-biosequence
clj-biosequence is a library designed to make working with
biological sequence data easier. Basic functions include:
- Parses and accessors for Genbank, Uniprot XML, fasta and fastq formats.
- A wrapper for BLAST.
- A wrapper for signalP.
- A wrapper for TMHMM.
- Indexing of files for random access.
- Mechanisms for lazy processing of sequences from very large files.
- Interfaces for search and retrieval of sequences from online databases.
- Translation functions for DNA and RNA sequences.
- ID mapping functionality using the Uniprot's ID mapping tool.
Installation
Available from Clojars. For the current version add the following to your project.clj file:
clojure
[clj-biosequence "0.4.3"]
To use in your namespace:
clojure
(ns my-app.core
(:require [clj-biosequence.core :as cbs] ;; for base functionality and fasta
[clj-biosequence.uniprot :as up] ;; for Uniprot xml
[clj-biosequence.genbank :as gb] ;; for Genbank gbseq xml
[clj-biosequence.blast :as bl] ;; for BLAST functionality
[clj-biosequence.fastq :as fq] ;; for fastq functionality
[clj-biosequence.index :as ind] ;; for indexing functionality
[clj-biosequence.signalp :as sp] ;; for a wrapper for signalp
[clj-biosequence.entrezgene :as ez] ;; for entrezgene xml
[clj-biosequence.tmhmm :as tm])) ;; for a wrapper for TMHMM
The project page is here and there is also a guide to getting started with Clojure here.
API docs are available here.
Basic usage
clj-biosequence provides a reader and sequence mechanism for the
lazy access of biosequences in a variety of formats. For example, if
working with fasta sequences a typical session in the REPL could go
something like:
```clojure
;; import core and fasta functions
user> (use 'clj-biosequence.core)
;; To initialise a file call the relevant initialisation function, ;; a string or java file object can be used. ;; For fasta an alphabet is also required to initialise a file.
user> (def fa-file (init-fasta-file "test-files/nuc-sequence.fasta" :iupacNucleicAcids))
'user/fa-file
;; then bs-reader can be used with with-open and biosequence-seq
;; to get access to a lazy sequence of fasta sequences in the file.
user> (with-open r (bs-reader fa-file)) false user> (with-open r (bs-reader fa-file)) 6 ```
And thats just about it. The same pattern is used for all sequence formats supported (at the moment this includes Geneseq xml, Entrezgene xml, Uniprot xml, fasta and fastq formats).
Some examples:
```clojure
;; a lazy sequence of translations in six reading frames
user> (with-open r (bs-reader fa-file)) false user> (with-open r (bs-reader fa-file)) 36
;; fasta-string can be used to convert biosequences to fasta strings
user> (use 'clj-biosequence.uniprot) nil user> (def uniprot-f (init-uniprotxml-file "test-files/uniprot-s-mansoni-20121217.xml"))
'user/uniprot-f
user> (with-open r (bs-reader uniprot-f))
sp|C4PYP8|DRE2_SCHMA Anamorsin homolog | Fe-S cluster assembly protein DRE2 homolog [Schistosoma mansoni] MEQCVADCLNSDDCVMIVWSGEVQEDVMRGLQVAVSTYVKKLQFENLEKFVDSSAVDSQLXHECSVILCGWPNSISVNILK LGLLSNLLSCLRPGGRFFGRDLITGDWDSLKKNLTLSGYIXNPYQLSCENHLIFSASVPSNYTQGSSVKLPWANSDVEAAW ENVDNSSDANGNIINTNTLLXQKSDLKTPLSVCGKEAATDSVGKKKRACKNCTCGLAEIEAAEEDKSDVPISSCGNCYLGD XAFRCSTCPYRGLPPFKPGERILIPDDVLRADL
;; filters can be implemented pretty easily
user> (with-open r (bs-reader fa-file)) "gi|114311762|gb|EE738912.1|EE738912"
;; The function biosequence->file sends biosequences to a file and
;; also accepts a function argument to transform the biosequence
;; before writing (the default is fasta-string).
;; a Uniprot to fasta converter is thus:
user> (with-open r (bs-reader uniprot-f)) "/tmp/fasta.fa" user> (with-open r (bs-reader (init-fasta-file "/tmp/fasta.fa" :iupacAminoAcids))) 2 user> (with-open r (bs-reader (init-fasta-file "/tmp/fasta.fa" :iupacAminoAcids))) clj_biosequence.core.fastaSequence
;; sequences can be filtered to file using this function ;; for eg. filter Cytoplasmic proteins to file in fasta format
user> (with-open r (bs-reader uniprot-f) (map :text (subcellular-location %))))) "/tmp/fasta.fa")) "/tmp/fasta.fa" user> (with-open r (bs-reader (init-fasta-file "/tmp/fasta.fa" :iupacAminoAcids))) 1 user> (with-open r (bs-reader (init-fasta-file "/tmp/fasta.fa" :iupacAminoAcids)))
sp|C4PYP8|DRE2_SCHMA Anamorsin homolog | Fe-S cluster assembly protein DRE2 homolog [Schistosoma mansoni] MEQCVADCLNSDDCVMIVWSGEVQEDVMRGLQVAVSTYVKKLQFENLEKFVDSSAVDSQLXHECSVILCGWPNSISVNILK LGLLSNLLSCLRPGGRFFGRDLITGDWDSLKKNLTLSGYIXNPYQLSCENHLIFSASVPSNYTQGSSVKLPWANSDVEAAW ENVDNSSDANGNIINTNTLLXQKSDLKTPLSVCGKEAATDSVGKKKRACKNCTCGLAEIEAAEEDKSDVPISSCGNCYLGD XAFRCSTCPYRGLPPFKPGERILIPDDVLRADL ``
For strings containing fasta, Uniprot XML or Genbank XML formatted sequences the functionsinit-fasta-string,init-uniprot-stringandinit-genbank-stringallow the use of strings with thewith-open` idiom. For Uniprot and Genbank connection initialisation functions provide the same capability with remotely stored sequences from the relevant servers (see below).
Indexing
For random access to biosequences each supported file format also has an indexed version.
Typical usage as follows:
``clojure
;; callingindex-biosequence-fileon any biosequence file returns a
;; biosequence index. Which is accessed usingwith-open` just like
;; other readers but with faster retrieval of specific biosequences.
user> (use 'clj-biosequence.index) nil user> (def fasta-in (index-biosequence-file fa-file))
'user/fasta-in
user> (with-open r (bs-reader fasta-in) 6 user> (with-open r (bs-reader fasta-in)
clj_biosequence.core.fastaSequence{:acc "gi|116025203|gb|EG339215.1|EG339215", :description "KAAN-aaa29f08.b1 ... etc"
user> (with-open r (bs-reader fasta-in)) "gi|114311762|gb|EE738912.1|EE738912"
;; when a file is indexed two additional files are created with the same
;; base-name as the biosequence file but with the extensions .bin and .idx.
;; The .bin file is compressed sequences and the .idx file is a
;; text file containing the index. The .idx file is readable with
;; edn/read-string. To load an index use load-biosequence-index with the
;; path and basename of the index files.
user> (def fa-ind-2 (load-biosequence-index "/Users/jason/Dropbox/clj-biosequence/resources/test-files/nuc-sequence.fasta"))
'user/fa-ind-2
user> (with-open r (bs-reader fa-ind-2)) "gi|114311762|gb|EE738912.1|EE738912"
;; biosequence collections can be indexed using index-biosequence-list.
user> (def fa-ind-3 (with-open r (bs-reader fa-file)))
'user/fa-ind-3
user> (with-open r (bs-reader fa-ind-3)) "gi|114311762|gb|EE738912.1|EE738912"
;; this can be handy when filtering biosequences. For example secreted proteins ;; can be filtered into their own index
user> (def secreted (with-open r (bs-reader toxins)))
'user/secreted
user> (with-open r (bs-reader fa-ind-3) 6 ```
BLAST
clj-biosequence supports most forms of BLAST, with the exception of
PSI-BLAST. As with other parts of clj-biosequence the BLAST functions
seek to be as lazy and composable as possible. To work the various
BLAST+ programs from the NCBI need to be in your path.
Typical usage as follows:
``clojure
;; initialise a BLAST db by passing the basename of the indexes toinit-blast-db`
user> (use 'clj-biosequence.blast) nil user> (def toxindb (init-blast-db "test-files/toxins.fasta" :iupacAminoAcids))
'user/toxindb
;; The functionblast takes a list of biosequence objects and blasts them against
;; a blast database. It returns a blast search result which is a pointer to the
;; blast result file. This can then be opened using bs-reader and results
;; accessed using biosequence-seq
user> (def tox-bl (with-open r (bs-reader toxins)))
'user/tox-bl
user> (with-open r (bs-reader tox-bl)) 20
;; Addiitonal parameters can be passed to blast using the :param keyword
;; argument. Format is a hash-map with keys strings of the command line switches
;; with the desired value as a string. For example:
user> (def tox-bl (with-open r (bs-reader toxins)))
'user/tox-bl
;; BLAST results can be accessed using the accessors defined in the package
;; and the functions hit-seq and hsp-seq. For example to filter all
;; proteins in tox-bl that had a hit with a bit-score greater than
;; 50 and report their accession (note the use of second to avoid hits
;; to themselves):
user> (with-open r (bs-reader tox-bl) ("B3EWT5" "Q5UFR8" "C1IC47" "Q53B61" "O76199" "C0JAT6" "P0CE79" "C0JAU1" "C0JAT6" "P0CE78" "C0JAT9" "C0JAT5" "C0JAT9" "C0JAT6" "P0CE81" "P0CE80" "P0CE81" "P0CE82" "P0CE81")
;; Or a hash-map of the query id and hit id of hits with a bit score greater than 50
;; (note that calling accession on a BLAST iteraton returns the query accession):
user> (with-open r (bs-reader tox-bl))) {"sp|P84001|29C0ANCSP" "B3EWT5", "sp|P0CE81|A1HB1LOXIN" "P0CE80", "sp|C0JAT9|A1H1LOXSP" "C0JAU1", "sp|P0CE82|A1HB2LOXIN" "P0CE81", "sp|P0CE80|A1HALOXIN" "P0CE81", "sp|C0JAT8|A1H4LOXHI" "C0JAT6", "sp|C0JAT5|A1H2LOXHI" "C0JAT6", "sp|C0JAT6|A1H3LOXHI" "C0JAT5", "sp|C0JAT4|A1H1LOXHI" "C0JAT6", "sp|C0JAU1|A1H2LOXSP" "C0JAT9", "sp|C0JAU2|A1H3LOXSP" "C0JAT9", "sp|Q4VDB5|A1HLOXGA" "P0CE82", "sp|C1IC47|3FN3WALAE" "Q5UFR8", "sp|C1IC48|3FN4WALAE" "C1IC47", "sp|C1IC49|3FN5WALAE" "Q53B61", "sp|P84028|45C1ANCSP" "O76199", "sp|Q56JA9|A1HLOXSM" "P0CE81", "sp|P0CE78|A1H1LOXRE" "P0CE79", "sp|P0CE79|A1H2_LOXRE" "P0CE78"}
;; This can be combined with indexes or biosequence files to obtain the original ;; query biosequences.
user> (def toxin-index (index-biosequence-file toxins))
'user/toxin-index
user> (with-open r (bs-reader tox-bl) i (bs-reader toxin-index))
cljbiosequence.core.fastaSequence{:acc "sp|P84001|29C0ANCSP", :description
"U3-ctenitoxin-Asp1a (Fragment) OS=Ancylometes sp. PE=1 SV=1", :alphabet :iupacAminoAcids, :sequence [\A \N \A \C \T \K \Q \A \D \C \A \E \D \E \C \C \L \D \N \L \F \F \K \R \P \Y \C \E \M \R \Y \G \A \G \K \R \C \A \A \A \S \V \Y \K \E \D \K \D \L \Y]}
;; or sent off to a file.
user> (with-open r (bs-reader tox-bl) i (bs-reader toxin-index)) "/tmp/blast.fa"
;; As the entire chain is lazy these methods will work with as big a file as ;; can be thrown at them (hopefully). So one could annotate a large fasta file ;; starting with a fasta index and a blast DB by:
user> (with-open r (bs-reader (blast (biosequence-seq toxin-index) "blastp" toxindb "/tmp/outfile.xml")) i (bs-reader toxin-index) " - " (first (hit-bit-scores h))))))) "/tmp/annotated-sequeunces.fa")) "/tmp/annotated-sequeunces.fa" user> (with-open r (bs-reader (init-fasta-file "/tmp/annotated-sequeunces.fa" :iupacAminoAcids))) U3-ctenitoxin-Asp1a (Fragment) OS=Ancylometes sp. PE=1 SV=1 - Similar to Toxin CSTX-20 \ OS=Cupiennius salei PE=1 SV=1 - 89.737335
;; Although this is getting a bit complicated for the REPL and should probably ;; be a function(s) of itself (and the blast outfile might need to be deleted).
;; BLAST readers also provide access to the parameters used and these
;; can be accessed by calling parameters on the reader. This will return a
;; blast parameters object with accessors defined in the package.
user> (with-open r (bs-reader tox-bl)) "/Users/jason/Dropbox/clj-biosequence/resources/test-files/toxins.fasta" user> (with-open r (bs-reader tox-bl)) "BLASTP 2.2.24+" user> (with-open r (bs-reader tox-bl)) "10" user> (with-open r (bs-reader tox-bl)) "F"
;; BLAST searches can be indexed like any other biosequence file. In which case ;; the index is keyed to the query accession. Although, the parameter information ;; is lost.
user> (def blast-ind (index-biosequence-file tox-bl))
'user/blast-ind
user> (-> (get-biosequence blast-ind "sp|Q56JA9|A1H_LOXSM") hit-seq first hit-accession) "P0CE82"
```
SignalP
SignalP works in a similar way as BLAST. If you have signalp in your
path it can be applied to collections of bioseqeunces using the
function signalp (which returns a signalp result object) or a
SignalP output file in short form format can be initialised as a
result object.
Basic usage as follows:
```clojure ;; running signalp
user> (use 'clj-biosequence.signalp) nil user> (def sr (signalp (take 20 (biosequence-seq toxin-index)) "/tmp/signalp.txt"))
'user/sr
user> (with-open r (bs-reader sr))
cljbiosequence.signalp.signalpProtein{:name "sp|P58809|CTXCONMR", :cmax 0.105,
:cpos 7, :ymax 0.147, :ypos 1, :smax 0.208, :spos 1, :smean 0.0, :D 0.068, :result "N", :Dmaxcut 0.45, :network "SignalP-noTM"} user> (with-open r (bs-reader sr)) "sp|P58809|CTX_CONMR"
;; signalp? can be used to determine if a result is positive or not
user> (with-open r (bs-reader sr)) false user> (with-open r (bs-reader sr))
cljbiosequence.signalp.signalpProtein{:name "sp|Q9BP63|O3611CONPE", :cmax 0.51,
:cpos 21, :ymax 0.696, :ypos 21, :smax 0.982, :spos 12, :smean 0.952, :D 0.834, :result "Y", :Dmaxcut 0.45, :network "SignalP-noTM"}
;; a convenience function filter-signalp filters a collection of biosequence
;; proteins and returns only proteins containing a signal sequence. If the
;; keyword argument :trim is true the returned biosequences will have the
;; signal sequence trimmed from the sequence
user> (->> (filter-signalp (take 20 (biosequence-seq toxin-index))) first bioseq->string) "MSRLGIMVLTLLLLVFIVTSHQDAGEKQATQRDAINFRWRRSLIRRTATEECEEYCEDEEKTCCGLEDGEPVCATTCLG" user> (->> (filter-signalp (take 20 (biosequence-seq toxin-index)) :trim true) first bioseq->string) "DAGEKQATQRDAINFRWRRSLIRRTATEECEEYCEDEEKTCCGLEDGEPVCATTCLG"
;; SignalP result objects can be indexed in the same manner as BLAST ie. ;; the query sequence accession becomes the index keys.
user> (def si (index-biosequence-file sr))
'user/si
user> (accession (get-biosequence si "sp|P58809|CTXCONMR")) "sp|P58809|CTXCONMR" user> (signalp? (get-biosequence si "sp|P58809|CTX_CONMR")) false
;; Search parameters can be passed to signalp and filter-signalp as hash-maps
;; using the :param keyword argument.
user> (def sr (signalp (take 20 (biosequence-seq toxin-index)) "/tmp/signalp.txt" :params {"-s" "best" "-t" "gram+"}))
'user/sr
user> (with-open r (bs-reader sr))
cljbiosequence.signalp.signalpProtein{:name "sp|P58809|CTXCONMR", :cmax 0.101,
:cpos 2, :ymax 0.119, :ypos 2, :smax 0.139, :spos 1, :smean 0.139, :D 0.127, :result "N", :Dmaxcut 0.45, :network "SignalP-TM"} ```
Accession mapping
clj-biosequence provides a facility for mapping accessions from one
database to another. It is provided in the core package and uses the
Uniprot mapping service so needs an active internet connection.
Basic usage:
``clojure
;;id-convertconverts accessions. Its arguments are a list of accessions
;; to be converted, a 'from' database, a 'to' database and an email (required
;; by Uniprot). The 'from' and 'to' arguments are strings corresponding to
;; to the database codes used by the Uniprot mapping tool (full list at
;; http://www.uniprot.org/faq/28#id_mapping_examples and a partial list in
;; the doc string ofid-convert`.
;; id-convert returns a hash-map of query accessions and search results. If
;; an ID returned no result it is not in the result hash-map. There is a
;; 100,000 limit on individual queries imposed by Uniprot.
;; For example, to convert a list of Uniprot accessions to NCBI Genbank ids, ;; using the previously defined toxin protein index which has accessions in ;; the format "sp|xxx|xxxx":
user> (map accession (take 5 (biosequence-seq toxin-index))) ("sp|P58809|CTXCONMR" "sp|P61792|TXU2HETVE" "sp|P86259|CT2XCONTE" "sp|Q9BP63|O3611CONPE" "sp|A0SE59|CA13_CONMR")
user> (require '[clojure.string :as st]) nil user> (-> (map #(second (st/split (accession %) #"|")) (take 5 (biosequence-seq toxin-index))) (id-convert "ACC" "P_GI" "jason.mulvenna@gmail.com")) {"P58809" "20454877", "P61792" "48428590", "P86259" "229485330", "Q9BP63" "74848505", "A0SE59" "83657225"} ```
Sequence retrieval
Sequences can be retrieved from both Genabnk and Uniprot using
init-uniprot-connection and init-genbank-connection. Both
functions take a list of accession numbers and a return type argument.
Uniprot also needs and email argument and Genbank a database argument.
Both functions can be used in conjunction with the search functions,
genbank-search and uniprot-search.
Basic usage:
``clojure
;; To generate a list of accessions search for Uniprot accessions (note this
;; generates a non-lazy list). Search syntax is exactly the same as Uniprot
;; search syntax (described at http://www.uniprot.org/help/text-search and
;; summarised in the doc string ofuniprot-search`).
;; For example, to get accessions of all proteins in the Schistosoma mansoni ;; reference proteome set:
user> (use 'clj-biosequence.uniprot) nil user> (def sm-prot (uniprot-search "organism:6183 AND keyword:1185" "jason.mulvenna@gmail.com"))
'user/sm-prot
user> (count sm-prot) 11711 user> (first sm-prot) "C4PYP8"
;; A lazy sequence of biosequences can be retrieved from Uniprot using
;; init-uniprot-connection and bs-reader. Sequences can be retrieved as
;; fasta or full Uniprot entries.
user> (def up-conn (init-uniprot-connection (take 10 sm-prot) :fasta "jason.mulvenna@gmail.com"))
'user/up-conn
user> (with-open r (bs-reader up-conn))
cljbiosequence.core.fastaSequence{:acc "sp|C4PYP8|DRE2SCHMA", :description\
"Anamorsin homolog OS=Schistosoma mansoni GN=Smp_207000 PE=3 SV=2", :alphabet\ :iupacAminoAcids, :sequence [\M \E \Q \C \V \A \D \C \L \N \S \D \D \C \V \M ... etc
;; Uniprot
user> (def up-conn (init-uniprot-connection (take 10 sm-prot) :xml "jason.mulvenna@gmail.com"))
'user/up-conn
user> (with-open r (bs-reader up-conn)) clj_biosequence.uniprot.uniprotProtein
;; Although sequences are downloaded as a compressed stream large sequence ;; downloads can take a long time ...
;; Genbank works exactly the same way. Search syntax is the same as Genbank query
;; format (see http://www.ncbi.nlm.nih.gov/books/NBK3837/ and a summary in the doc
;; doc string of genbank-search). A database also neds to specified and may be one
;; of :protein, :nucest, :nuccore, :nucgss or :popset.
;; So to get all Schistosoma mansoni proteins from Genbank
user> (use 'clj-biosequence.genbank) nil user> (def sm-prots (genbank-search "txid6183[Organism:noexp]" :protein))
'user/sm-prots
user> (first sm-prots) "566601372" user> (with-open r (bs-reader (init-genbank-connection (take 10 sm-prots) :protein :fasta)))
clj_biosequence.core.fastaSequence{:acc "gi|566601352|gb|AHC70335.1|", :description
"nicotinic acetylcholine receptor [Schistosoma mansoni]", :alphabet :iupacAminoAcids, :sequence [\M ... etc
user> (with-open r (bs-reader (init-genbank-connection (take 10 sm-prots) :protein :xml))) clj_biosequence.genbank.genbankSequence ```
Supported formats
clj-biosequence uses protocols and records to provide a uniformish interface to diferent formats.
Fasta
``clojure
;; initialise fasta files usinginit-fasta-fileand access
;; sequences usingbs-readerandbiosequence-seq`
user> (def ff (init-fasta-file "test-files/toxins.fasta" :iupacAminoAcids))
user> (with-open r (bs-reader ff)) 5135 user>
;; records and protocols implemented by them as follows:
->fastaSequence Implements: biosequenceID biosequenceDescription Biosequence ```
Fastq
```clojure
;; Use init-fastq-file and access sequences as above.
user> (def ff (init-fastq-file "test-files/fastq-test.fastq"))
'user/ff
user> (with-open r (bs-reader ff)) 9 user>
;; records and protocols as follows:
->fastqSequence Implements: biosequenceID biosequenceDescription Biosequence ```
Uniprot
```clojure
;; initialise uniprot files using init-uniprotxml-file and access
;; sequences as described above
user> (def up (init-uniprotxml-file "test-files/uniprot-s-mansoni-20121217.xml"))
'user/up
user> (with-open r (bs-reader up)) 2
;; records and protocols implemented by them are as follows:
->uniprotProtein ;; top level record for uniprot sequences
Implements: Biosequence
biosequenceID
biosequenceName
biosequenceDescription
biosequenceCitations ;; returns uniprotCitation records
biosequenceFeatures ;; returns uniprotFeature records
biosequenceTaxonomies ;; returns uniprotTaxref records
biosequenceGenes ;; returns uniprotGene records
biosequenceComments ;; returns uniprotComment records
biosequenceSubcelllocs ;; returns uniprotFeature records containg sub celllar location data
biosequenceGoterms ;; returns uniprotFeature records containg GO data
biosequenceEvidence
biosequenceProtein
->uniprotComment
Implements: biosequenceSubcellloc
biosequenceSubcellloc
->uniprotGene
Implements: biosequenceGene
biosequenceID
biosequenceSynonyms
->uniprotTaxref
Implements: biosequenceTaxonomy
biosequenceDbrefs
biosequenceEvidence
->uniprotDbref
Implements: biosequenceDbref
biosequenceGoterm
biosequenceEvidence
->uniprotFeature
Implements: biosequenceNameobject
biosequenceID
biosequenceStatus
biosequenceDescription
biosequenceEvidence
biosequenceCitations
biosequenceIntervals
Biosequence
biosequenceVariant
->uniprotInterval
Implements: biosequenceInterval
biosequenceStatus
biosequenceEvidence
->uniprotCitation
Implements: biosequenceCitation
;; some examples
user> (with-open r (bs-reader up)) "Eukaryota;Metazoa;Platyhelminthes;Trematoda;Digenea;Strigeidida;Schistosomatoidea;Schistosomatidae;Schistosoma"
user> (with-open r (bs-reader up)) "Fe-S cluster assembly protein DRE2 homolog"
user> (with-open r (bs-reader up)) ("Berriman M." "Haas B.J." "LoVerde P.T." "Wilson R.A." "Dillon G.P." "Cerqueira G.C." ...)
user> (with-open r (bs-reader up)) "chain"
```
GenBank: Geneseq xml
```clojure ;; initialise a genbank file in the usual way
user> (def gbf (init-genbank-file "test-files/nucleotide-gb.xml"))
'user/gbf
;; Access sequences as usual
user> (with-open r (bs-reader gbf)) 1268274
;; records and protocols as follows: ->genbankSequence ;; the top level record for Geneseq sequences Implements: biosequenceGene biosequenceID biosequenceDescription Biosequence biosequenceCitations ;; returns citation records biosequenceFeatures ;; returns feature records biosequenceTaxonomies ;; returns tax-ref records ->genbankTaxRef Implements: biosequenceTaxonomy biosequenceFeatures biosequenceDbrefs ;; returns db records ->genbankFeature Implements: biosequenceFeature biosequenceGene biosequenceID biosequenceProtein biosequenceEvidence biosequenceNotes biosequenceNameobject biosequenceIntervals ;; returns interval records biosequenceDbrefs ;; returns db records ->genbankDbRef Implements: biosequenceDbref ->genbankQualifier Implements: biosequenceNameObject ->genbankInterval Implements: biosequenceID biosequenceInterval biosequenceTranslation ->genbankCitation Implements: biosequenceCitation biosequenceNotes ->genbankReader Implements: biosequenceReader ->genbankFile Implements: biosequenceIO biosequenceFile
;; some examples
user> (with-open r (bs-reader gbf)) "KE373594"
user> (with-open r (bs-reader gbf)) "source" user> (with-open r (bs-reader gbf)) 1
;; The function qualifiers is also provided for genbank features
;; and it returns a lazy list of qualifiers.
user> (with-open r (bs-reader gbf)) "organism" user> (with-open r (bs-reader gbf)) "Blumeria graminis f. sp. tritici 96224" ```
GenBank: Entrezgene xml
```clojure
Access sequences in the usual way:
user> (use 'clj-biosequence.entrezgene) nil user> (def ef (init-entrezgene-file "test-files/entrez-gene.xml"))
'user/ef
user> (with-open r (bs-reader ef)) 3875 user>
;; records and protocols as follows:
->entrezGene ;; top level record for an entrez gene Implements: biosequenceGene biosequenceSynonyms biosequenceDescription biosequenceDbrefs ;; returns db records biosequenceComments ;; returns comment records entrezComments biosequenceTranslation biosequenceID biosequenceStatus biosequenceSummary biosequenceTaxonomies ;; returns tax-ref records biosequenceProteins ;; return protein sub-seq records Biosequence ->entrezProtein Implements: biosequenceProtein biosequenceSynonyms biosequenceDescription biosequenceDbrefs ;; reurns db records ->entrezBiosource ->entrezPcrPrimers ->entrezSubSource ->entrezOrgRef Implements: biosequenceTaxonomy biosequenceDbrefs ;; returns db records biosequenceSynonyms ->entrezOrgName ->entrezGeneTrack Implements: biosequenceID biosequenceStatus ->entrezMap ->entrezGeneSource ->entrezGeneComment Implements: biosequenceID entrezComments biosequenceComments ;; returns comment records biosequenceNameObject ->entrezExtraterm Implements: biosequenceNameObject ->entrezOtherSource Implements: biosequenceUrl biosequenceDbrefs ->entrezDbtag Implements: biosequenceDbref ->entrezSeqLocation Implements: biosequenceIntervals ;; returns interval records ->entrezInterval Implements: biosequenceInterval ->entrezGeneReader Implements: biosequenceReader ->entrezgeneFile Implements: biosequenceIO biosequenceFile ->entrezGeneConnection Implements: biosequenceIO ```
Owner
- Name: Mike Panciera
- Login: averagehat
- Kind: user
- Website: averagehat.github.io
- Repositories: 115
- Profile: https://github.com/averagehat
GitHub Events
Total
Last Year
Dependencies
- clj-http 1.0.1
- clj-time 0.9.0
- com.taoensso/nippy 2.7.1
- com.velisco/tagged 0.3.4
- fs 1.3.3
- iota 1.1.2
- org.apache.commons/commons-compress 1.9
- org.clojars.hozumi/clj-commons-exec 1.1.0
- org.clojure/clojure 1.6.0
- org.clojure/data.xml 0.0.8
- org.clojure/data.zip 0.1.1