https://github.com/agormp/sequencelib
Library for analyzing and manipulating DNA and protein sequences
Science Score: 23.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
✓Committers with academic emails
2 of 2 committers (100.0%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.0%) to scientific vocabulary
Keywords
Repository
Library for analyzing and manipulating DNA and protein sequences
Basic Info
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
sequencelib
Using the classes and methods in sequencelib.py, you can read and write text files containing DNA or protein sequences (aligned or unaligned), and analyze or manipulate these sequences in various ways
Note: Much of the functionality in sequencelib is also available through the command-line tool seqconverter
Availability
The sequencelib.py module is available on GitHub: https://github.com/agormp/sequencelib and on PyPI: https://pypi.org/project/sequencelib/
Installation
python3 -m pip install sequencelib
Upgrading to latest version:
python3 -m pip install --upgrade sequencelib
Quick Start Tutorial for sequencelib
Note: under construction. This version mostly generated using chatGPT with some editing
This quick start guide introduces some basic functionalities of sequencelib.
Loading Sequences
sequencelib supports various file formats: fasta, nexus, clustal, phylip, raw, tab, how, and genbank. It automatically detects the file format:
Reading Unaligned Sequences
```python import sequencelib as sq
seqfile = sq.Seqfile("seqfilename.fasta") seqset = seqfile.read_seqs() ```
Iterate over sequences
python
for seq in seqset:
print(seq.name, len(seq))
Reading Aligned Sequences
```python import sequencelib as sq
seqfile = sq.Seqfile("alignment.fasta") alignment = seqfile.read_alignment()
print("Number of sequences:", len(alignment)) print("Alignment length:", alignment.alignlen()) ```
Find Columns with More than 50% Gaps
python
nseqs = len(alignment)
gapcols = []
for i in range(alignment.alignlen()):
col = alignment.getcolumn(i)
gapfrac = col.count("-") / nseqs
if gapfrac >= 0.5:
gapcols.append(i)
Export Alignment to File
```python with open("gapcols.fasta", "w") as f: f.write(subalignment.fasta())
with open("gapcols.nexus", "w") as f: f.write(subalignment.nexus())
with open("gapcols.clustal", "w") as f: f.write(subalignment.clustal()) ```
Analyzing Individual Columns
Directly access columns and analyze their conservation:
python
column = subalignment.getcolumn(0)
if len(set(column)) > 1:
print("This column is not conserved")
Mapping Sequence and Alignment Positions
Map positions between sequence (without gaps) and alignment:
python
alignpos_0index = alignment.seqpos2alignpos("seq1", 41) # Index starts at 0
alignpos_1index = alignment.seqpos2alignpos("seq1", 42, slicesyntax=False) # Index starts at 1
Convert from alignment position to sequence position:
python
seqpos, gapstatus = alignment.alignpos2seqpos("seq1", 153)
if gapstatus:
print(f"Alignment position is a gap; closest preceding residue is at sequence position {seqpos}")
Working with Individual Sequences
Each sequence object has multiple attributes and methods:
```python seq = seqset[0] print(seq.name) print(len(seq)) print(seq.fasta())
shuffledseq = seq.shuffle() proteinseq = seq.translate() ```
Window Iteration
Iterate through sequence windows:
python
for seqwindow in seq.windows(wsize=30):
print(seqwindow.fasta())
More Features
The sequencelib library contains many additional functionalities such as:
- Calculating pairwise sequence distances
- Removing conserved or ambiguous columns
- Reverse complementing DNA sequences
- Handling complex alignments with partitions
SequenceLib: Class and Method Reference
Class: Sequence
Base class representing a biological sequence (DNA, protein, or other types).
Constructor
python
Sequence(name, seq, annotation='', comments='', check_alphabet=False, degap=False)
- name: Identifier for the sequence.
- seq: The actual biological sequence string.
- annotation: Annotation information for each residue.
- comments: Additional metadata or notes.
- check_alphabet: Checks sequence against allowed alphabet symbols.
- degap: Removes gap characters (
-).
Methods
__len__(): Returns the length of the sequence.__getitem__(index): Allows indexing and slicing of the sequence.__setitem__(index, residue): Modifies residue at a given index.__str__(): Returns FASTA-formatted string.copy_seqobject(): Returns a deep copy of the sequence object.rename(newname): Changes the sequence name.subseq(start, stop, slicesyntax=True, rename=False): Extracts subsequence between start and stop positions.subseqpos(poslist, namesuffix=None): Creates subsequence from specified positions.appendseq(other): Appends another sequence at the end.prependseq(other): Prepends another sequence at the start.windows(wsize, stepsize=1, l_overhang=0, r_overhang=0, padding="X", rename=False): Iterates over windows of the sequence.remgaps(): Removes gaps from the sequence.shuffle(): Randomly shuffles the sequence residues.indexfilter(keeplist): Keeps only residues at specified positions.seqdiff(other, zeroindex=True): Lists differences between two sequences.hamming(other): Computes Hamming distance (absolute differences).hamming_ignoregaps(other): Computes Hamming distance, ignoring gaps.pdist(other): Computes proportional differences per site.pdist_ignoregaps(other): Computes proportional differences, ignoring gaps.pdist_ignorechars(other, igchars): Proportional differences ignoring specified characters.residuecounts(): Counts residues and returns a dictionary.composition(ignoregaps=True, ignoreambig=False): Calculates composition as frequencies.findgaps(): Identifies gap positions.fasta(width=60, nocomments=False): Returns FASTA format representation.how(width=80, nocomments=False): Returns HOW format representation.gapencoded(): Encodes gaps as binary (1/0) string.tab(nocomments=False): Returns TAB format representation.raw(): Returns sequence in raw format.
Class: DNA_sequence(Sequence)
Specialized sequence class for DNA sequences. Has access to all the methods in its base class (Sequence) in addition to the ones listed here.
Methods
revcomp(): Returns reverse complement.translate(reading_frame=1): Translates DNA to protein sequence.
Class: Protein_sequence(Sequence)
Specialized sequence class for protein sequences. Has access to all the methods in its base class (Sequence).
Class: Sequences_base
Abstract base class for sequence collections. Should not be instantiated directly. All methods here can be used in both Seqalignment and Seqset objects.
Methods
__len__(): Returns the number of sequences.__getitem__(index): Accesses sequences via indexing or slicing.__setitem__(index, value): Sets sequences by integer index.__eq__(other): Checks equality with another sequence collection.__ne__(other): Checks inequality with another sequence collection.__str__(): Returns FASTA format of the collection.sortnames(reverse=False): Alphabetically sorts sequences by name.addseq(seq, silently_discard_dup_name=False): Adds a sequence object.addseqset(other, silently_discard_dup_name=False): Adds sequences from another collection.remseq(name): Removes sequence by name.remseqs(namelist): Removes multiple sequences.changeseqname(oldname, newname, fix_dupnames=False): Renames a sequence.getseq(name): Retrieves sequence by name.subset(namelist): Extracts a subset by names.subsample(samplesize): Randomly selects a subset.subseq(start, stop, slicesyntax=True, rename=True, aln_name=None, aln_name_number=False): Extracts subset by positions.getnames(): Returns a list of sequence names.range(rangefrom, rangeto): In-place subset of sequences.removedupseqs(): Removes duplicate sequences.group_identical_seqs(): Groups identical sequences.residuecounts(): Counts residues across all sequences.composition(ignoregaps=True, ignoreambig=False): Computes frequency composition.clean_names(illegal=":;,()[]", rep="_"): Cleans illegal characters from names.rename_numbered(basename, namefile=None): Renames sequences numerically.rename_regexp(old_regex, new_string, namefile=None): Renames sequences using regex.transname(namefile): Renames sequences using a mapping file.revcomp(): Reverse complements all sequences.translate(reading_frame=1): Translates all sequences (DNA only).fasta(width=60, nocomments=False): FASTA format.how(width=60, nocomments=False): HOW format.tab(nocomments=False): TAB format.raw(): RAW format.
Class: Seq_alignment(Sequences_base)
Represents aligned sequences. This class also has access to all methods defined in base class (Sequences_base).
Methods
alignlen(): Length of the alignment.getcolumn(i): Retrieves column by index.columns(): Iterates over columns.samplecols(samplesize): Randomly samples columns.conscols(): Lists conserved columns.varcols(): Lists variable columns.gappycols(): Lists columns with gaps.site_summary(): Summarizes alignment sites.indexfilter(keeplist): Keeps columns by indices.remcols(discardlist): Removes columns by indices.remambigcol(): Removes ambiguous columns.remfracambigcol(frac): Removes columns with high ambiguity fraction.remgapcol(): Removes columns with gaps.remfracgapcol(frac): Removes columns with high gap fraction.remendgapcol(frac=0.5): Removes end-gap columns.remconscol(): Removes conserved columns.findgaps(): Identifies gap positions.gap_encode(): Binary encodes gaps.seqpos2alignpos(seqname, seqpos, slicesyntax=True): Maps sequence to alignment position.alignpos2seqpos(seqname, alignpos, slicesyntax=True): Maps alignment to sequence position.shannon(countgaps=True): Computes Shannon entropy.consensus(): Generates consensus sequence.phylip(width=60): PHYLIP format.clustal(width=60): CLUSTAL format.nexus(width=60, print_partitioned=False): NEXUS format.charsetblock(): Generates MrBayes charset block for partitioned analyses.mbpartblock(): Generates detailed MrBayes block (charset, partitions, models, MCMC) for partitioned analyses.bestblock(): Generates MrBayes BEST block for species-tree analyses (taxsets, charsets, BEST parameters).nexuspart(): Generates Nexus-formatted MrBayes block with partition and model specifications.
Class Seqfile_reader
Base class for reading sequence files. Typically, you do not instantiate this class directly.
Methods:
makeseq(name, seq, annotation="", comments="")- Description: Creates and returns a sequence object based on provided type information.
readseq()- Description: Reads a single sequence from a file and returns it as a sequence object.
read_seqs(silently_discard_dup_name=False)- Description: Reads all sequences and returns a
Seq_setobject.
- Description: Reads all sequences and returns a
read_alignment(silently_discard_dup_name=False)- Description: Reads aligned sequences, returning a
Seq_alignmentobject.
- Description: Reads aligned sequences, returning a
Class Fastafilehandle
Class for handling FASTA files.
Methods:
__init__(filename, seqtype="autodetect", check_alphabet=False, degap=False, nameishandle=False)- Description: Initializes a FASTA file reader, performs format checks.
__next__()- Description: Parses and returns the next sequence as a sequence object.
Class Howfilehandle
Class for reading HOW-formatted files.
Methods:
__init__(...)__next__()
Class Genbankfilehandle
Class for reading GenBank files.
Methods:
__init__(...)__next__()find_LOCUS()read_metadata()extract_annotation(metadata)extract_name(metadata)read_genbankseq()
Class Tabfilehandle
Handles tab-delimited sequence files.
Methods:
__init__(...)__next__()
Class Rawfilehandle
Handles raw-format sequence files.
Methods:
__init__(...)__next__()
Class Alignfile_reader
Base class for alignment files.
Methods:
makeseq(name, seq, annotation="", comments="")read_seqs(silently_discard_dup_name=False)
Class Clustalfilehandle
Reads Clustal-formatted alignment files.
Methods:
__init__(...)read_alignment(silently_discard_dup_name=False)
Class Phylipfilehandle
Handles Phylip-formatted alignment files.
Methods:
__init__(...)read_alignment(silently_discard_dup_name=False)
Class Nexusfilehandle
Handles Nexus-formatted alignment files.
Methods:
__init__(...)read_alignment(silently_discard_dup_name=False)
Class Stockholmfilehandle
Reads Stockholm-formatted alignment files.
Methods:
__init__(...)read_alignment(silently_discard_dup_name=False)
Class Seqfile
Factory class to autodetect file formats and instantiate the correct file handler.
Methods:
__new__(klass, filename, filetype="autodetect", ...)
Automatically selects the appropriate sequence or alignment reader based on file contents or explicitly provided file type.
Owner
- Name: Anders Gorm Pedersen
- Login: agormp
- Kind: user
- Location: Denmark
- Company: Technical University of Denmark
- Repositories: 9
- Profile: https://github.com/agormp
Professor of Bioinformatics
GitHub Events
Total
- Push event: 5
Last Year
- Push event: 5
Committers
Last synced: almost 3 years ago
All Time
- Total Commits: 92
- Total Committers: 2
- Avg Commits per committer: 46.0
- Development Distribution Score (DDS): 0.011
Top Committers
| Name | Commits | |
|---|---|---|
| Anders Gorm Pedersen | a****e@d****k | 91 |
| Anders Gorm Pedersen | a****e@D****l | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 7 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 147 last-month
- Total dependent packages: 4
- Total dependent repositories: 1
- Total versions: 43
- Total maintainers: 1
pypi.org: sequencelib
Read, write, and analyze biological sequences
- Homepage: https://github.com/agormp/sequencelib
- Documentation: https://sequencelib.readthedocs.io/
- License: GNU General Public License v3 (GPLv3)
-
Latest release: 2.23.2
published 11 months ago