https://github.com/agormp/seqconverter

Command-line program for reading, writing, analyzing, and manipulating sequence files

https://github.com/agormp/seqconverter

Science Score: 33.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
    1 of 1 committers (100.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (16.0%) to scientific vocabulary

Keywords

bioinformatics python
Last synced: 6 months ago · JSON representation

Repository

Command-line program for reading, writing, analyzing, and manipulating sequence files

Basic Info
  • Host: GitHub
  • Owner: agormp
  • License: gpl-3.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 375 KB
Statistics
  • Stars: 4
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
bioinformatics python
Created over 4 years ago · Last pushed 12 months ago
Metadata Files
Readme License

README.md

seqconverter

PyPI downloads DOI

The command-line program seqconverter can read and write text files containing aligned or unaligned DNA or protein sequences. The program understands most standard and some non-standard formats (fasta, Nexus, Phylip, Clustal, Stockholm, tab, raw, Genbank, How). The tool can be used to convert between sequence file formats, and is also able to perform various manipulations and analyses of sequences.

Availability

The seqconverter source code is available on GitHub: https://github.com/agormp/seqconverter. The executable can be installed from PyPI: https://pypi.org/project/seqconverter/

Version 3

Version 3 has recently been released, and contains a number of changes to the user-interface compared to version 2.x.x. For a full overview see notes in the latest release.

Installation

python3 -m pip install seqconverter

Upgrading to latest version:

python3 -m pip install --upgrade seqconverter

Citation

To cite seqconverter: use the link in the right sidebar under About --> Cite this repository.

Dependencies

seqconverter relies on the sequencelib library and the NumPy package, which are automatically included when using pip to install.

Highlights

  • Can be used to convert between sequence file formats but also able to perform many other manipulations and analyses of sequences.
  • Read and write aligned sequences in the following formats:
    • fasta
    • Nexus
    • Phylip
    • Clustal
    • Stockholm (so far only read)
    • tab
    • raw
  • Read and write unaligned sequences in the following formats:
    • fasta
    • tab
    • raw
    • Genbank
    • How
  • Writes to stdout, so output can be used in pipes or redirected to file
  • Also accepts input on stdin
  • Options to select or discard sequences based on one of several criteria: name matches regular expression, name in NAMEFILE, sequence contains specific residues on specific positions, duplicate (identical) sequences, duplicate names, sequence has many gaps at ends (<=> is shorter than other sequences), random sample of given size, ...
  • Options to select or remove columns from alignment based on one of several criteria: some gaps, more than fraction gaps, more than fration endgaps, conserved, specified indices, random sample of columns, ...
  • Extract all overlapping windows of specified size
  • Options to rename one or more sequences based on various criteria
  • Options to concatenate identically named sequences from multiple sequence files (end-to-end or discarding automatically discovered overlaps)
  • Options to automatically create Nexus charset commands based on merging multiple individual files (e.g., one charset/partition per gene).
  • Can automatically write MrBayes block with template for commands to run partitioned analysis, also based on merging multiple separate sequence alignments.
  • Can translate and find reverse complement for DNA sequences
  • Options to obtain summary information about sequences and alignments: number of seqs, names, lengths, composition (overall or per sequence), nucleotide diversity (pi), site summary (how many columns are variable, contain multiple residues, contain gaps, or contain IUPAC ambiguity symbols, how many unique site patterns)
  • More...
  • Underlying library has been optimized for high speed and low memory consumption
  • Really has too many options, but does useful stuff (and has been created based on what I needed for own projects)

Quick start usage examples

These examples highlight some of the options available. For the full list use option -h to get help.

Get help:

seqconverter -h


Convert aligned sequences in fasta format to nexus, 70 characters per line

seqconverter --informat fasta --outformat nexus \ --width 70 -i myalignment.fasta > myalignment.nexus

Note 1: output is written to the terminal so you need to use redirection to store in a file. Note 2: input format will be automatically detected if not specified with --informat (this works well for standard file types)


Select all sequences whose name match the regular expression "seq_1[0-9]+"

seqconverter --informat fasta --outformat fasta \ --keepreg "seq_1[0-9]+" -i myseqs.fasta > subset.fasta Note: default output format is fasta, so you do not need to specify --outformat fasta


Discard all sequences whose name match the regular expression "seq_1[0-9]+":

seqconverter --informat fasta --outformat fasta \ --remreg "seq_1[0-9]+" -i myseqs.fasta > subset.fasta


Select random subset of 50 sequences from input file

seqconverter --informat fasta --outformat fasta \ --sampleseq 50 -i myseqs.fasta > subset.fasta


Select all sequence variants containing a Lysine at position 484 and a Tyrosine at position 501

seqconverter --informat clustal --outformat fasta \ --keepvar 484K 501Y -i myalignment.aln > voc.fasta


Select columns 50-150 from ClustalW formatted alignment file, write output in fasta

seqconverter --informat clustal --outformat fasta \ --keepcols 50-150 -i myalignment.aln > aligment_50_150.fasta


Remove columns, where one or more residues are gaps, from alignment:

seqconverter --informat fasta --outformat fasta \ --remgapcols -i myalignment.fasta > gapfree.fasta


Remove columns, where >= 75% are gaps, from alignment:

seqconverter --informat fasta --outformat fasta \ --remgapcols 0.75 -i myalignment.fasta > fewergaps.fasta


Remove columns, where more than 75% have endgaps, from alignment:

This command will remove alignment columns if more than 75% of sequences have endgaps in that position. An endgap is defined as a contiguous gappy region at either the beginning or end of a sequence, and are often a result of missing data (the gaps then do not represent insertion or deletion events).

seqconverter --informat fasta --outformat fasta \ --remendgapcols 0.75 -i myalignment.fasta > fewer_endgaps.fasta


Concatenate identically named sequences from separate input files:

Sequences are pasted end to end in the same order as the order of the input files. All input files must contain the same number of sequences, and sequences in different files must have same name (for instance each file could contain an alignment of the sequences for a specific gene from a number of different species, and each sequence could then have the name of the species). The order of sequences in different files does not matter.

When used with the --charset (and possibly --mb) option this can be used to set up a partitioned analysis in MrBayes or BEAST (see below).

seqconverter --informat fasta --outformat fasta \ --paste -i gene1.fasta -i gene2.fasta -i gene3.fasta > concat.fasta


Concatenate sequences from multiple files, create partitioned Nexus file containing charset command

This command concatenates identically named sequences from separate input alignments, creating a partitioned Nexus file with charset specification. Start and stop indices for different charsets are automatically derived from lengths of sub-alignments. Charsets are named based on the names of included files.

This can be used for phylogenetic analyses in BEAST or MrBayes where different genomic regions (e.g., genes) have different substitution models. Note: sequences in each file need to have identical names (e.g. name of species).

seqconverter --outformat nexus --paste \ --charset -i gene1.fasta -i gene2.fasta -i gene3.fasta > partitioned.nexus


Concatenate sequences from multiple files, create partitioned Nexus file with commands to run MrBayes or BEAST analysis

This command does the same as the example above, and additionally adds a MrBayes block containing commands to run a partitioned analysis. The commands have sensible default values (e.g., setting DNA substution models to "nst=mixed" and unlinking most parameters across partitions). Optimally the commands should be tweaked according to the concrete data set. Importing the Nexus file in BEAUTI should result in setting most corresponding options for a BEAST run (but check, and remember to set priors etc.)

seqconverter --outformat nexus --paste \ --charset --mb -i gene1.fasta -i gene2.fasta -i gene3.fasta > partitioned.nexus

Usage

``` usage: seqconverter [-h] [-i SEQFILE] [--informat FORMAT] [--outformat FORMAT] [--width WIDTH] [--sampleseq N] [--keepreg "REGEXP"] [--remreg "REGEXP"] [--keepname NAMEFILE] [--remname NAMEFILE] [--keepvar VARIANT [VARIANT ...]] [--remdupseq] [--remdupname] [--remendgapseqs MIN] [--samplecols N] [--keepcols INDEXORRANGE [INDEXORRANGE ...]] [--remcols INDEXORRANGE [INDEXORRANGE ...]] [--remgapcols [FRAC]] [--remambigcols [FRAC]] [--remendgapcols [FRAC]] [--remconscols] [--windows WSIZE] [--degap] [--rename OLD NEW] [--renamenum BASENAME] [--renamereg "OLDREGEX" "NEWSTRING"] [--saverename NAMEFILE] [--renamefile NAMEFILE] [--gbname FIELD1[,FIELD2,FIELD3,...]] [--paste] [--overlap [MIN]] [--multifile] [--charset] [--mb] [--revcomp] [--translate READING_FRAME] [--nam] [--num] [--len] [--sit] [--com] [--comseq] [--div] [--divseq] [--ignoregaps] [--debug]

options: -h, --help show this help message and exit --debug Print longer error messages

Input/Output: -i SEQFILE One or more sequence files (repeat -i SEQFILE option for each input file). If -i SEQFILE is not given: take input from stdin (typically from a UNIX pipe). --informat FORMAT Input format: auto, fasta, nexus, phylip, clustal, stockholm, genbank, tab, raw, how [default: auto] --outformat FORMAT Output format: fasta, nexus, phylip, clustal, tab, raw, how [default: fasta] --width WIDTH Print sequences with WIDTH characters per line [default: 60] Use the special value -1 (--width -1) to print each sequence in its entirety on a single line, regardless of its length.

Selecting subset of sequences: --sampleseq N Randomly sample N sequences from sequence set --keepreg "REGEXP" Select sequences where substring of name matches regular expression --remreg "REGEXP" Discard sequences where substring of name matches regular expression --keepname NAMEFILE Select sequences listed in NAMEFILE --remname NAMEFILE Discard sequences listed in NAMEFILE --keepvar VARIANT [VARIANT ...] Select sequences containing specific variants, i.e., specific residues on specific positions. Syntax for specifying VARIANT is: (e.g., 484K). Multiple variants can be specifyed simultaneously separated by blanks. Example: --keepvar 484K 501Y --remdupseq Remove duplicate sequences (keeping one of each, randomly selected). --remdupname Remove sequences with duplicate names (keeping one of each, randomly selected). If this option is not set (default): stop execution on duplicate names. --remendgapseqs MIN Discard sequences with endgaps >= MIN positions. Endgaps are defined as contiguous block of gap symbols at either end of sequence.

Selecting subset of positions in sequences: --samplecols N Randomly sample N columns from alignment --keepcols INDEXORRANGE [INDEXORRANGE ...] Keep alignment columns indicated by one or more INDEXORRANGE values. INDEXORRANGE values are either a single position (e.g., 15) or a range (e.g., 20-37). Multiple values shold be separated by blanks. Example: --keepcols 10 15 22-40 57 --remcols INDEXORRANGE [INDEXORRANGE ...] Remove alignment columns indicated by one or more INDEXORRANGE values. INDEXORRANGE values are either a single position (e.g., 15) or a range (e.g., 20-37). Multiple values shold be separated by blanks. Example: --remcols 10 15 22-40 57 --remgapcols [FRAC] Remove columns that contain any gaps. If FRAC (number between 0-1) given: Remove columns where the fraction of gaps >= FRAC. --remambigcols [FRAC] Remove columns where one or more residues are ambiguity symbols (e.g., N for nucleotides). If FRAC (number between 0-1) given: Remove columns where the fraction of ambiguity symbols >= FRAC. --remendgapcols [FRAC] Remove columns where one or more sequences have endgaps. If FRAC (number between 0-1) given: Remove columns where the fraction of sequences having endgaps is >= FRAC. Endgaps are defined as contiguous block of gap symbols at either end of sequence --remconscols Remove conserved columns from alignment --windows WSIZE For each sequence in input: extract all overlapping sequence windows of size WSIZE --degap Remove all gap characters from sequences

Renaming sequences: --rename OLD NEW Rename single sequence from OLD to NEW --renamenum BASENAME Rename all sequences to this form: BASENAME001, ... --renamereg "OLDREGEX" "NEWSTRING" Rename sequences: Replace occurrences of regular expression OLDREGEX with NEW_STRING --saverename NAMEFILE Save renaming information in NAMEFILE for later use --renamefile NAMEFILE Replace names in sequence file using OLDNAME NEWNAME pairs in NAMEFILE. Not all names need to be listed. Note: can be used to restore names saved with --saverename during previous renaming. --gbname FIELD1[,FIELD2,FIELD3,...] For Genbank input: construct sequence names from the list of named fields, in the specified order

Combining multiple sequence files: --paste Concatenate identically named sequences from separate input files. Sequences are pasted end to end in the same order as the order of the input files. All input files must contain same number of sequences, and sequences in different files must have same name. (Order of sequences in individual file is not important).To see partitions choose nexus output, or output to multiple partition files. --overlap [MIN] Similar to --paste, but for input alignments that overlap partly at their ends. End-overlaps are discovered automatically and partition boundaries are then set such that each partition is covered by a unique set of genes. To see partitions choose nexus output, or output to multiple partition files. MIN: (optional, integer) minimum number of overlapping residues required for merging input alignments (default: set automatically based on seq lengths) --multifile Outputs to multiple files (one per partition) instead of stdout. Partitions are generated automatically based on other options. --charset Appends Nexus form charset block listing partitions in data (forces output in Nexus format). Charsets and partitions are generated automatically based on other options. --mb Appends MrBayes block with commands for running partitioned analysis (forces output in Nexus format). Charsets and partitions are generated automatically based on other options.

DNA manipulations: --revcomp Return reverse complement of sequence(s). Requires sequences to be DNA. --translate READINGFRAME Translate input DNA sequences into amino acid sequences. READINGFRAME: either 1, 2, or 3, where 1 means start translation from first nucleotide in sequences. Translation includes as many full-length codons as possible, given READING_FRAME.

Summaries: No sequences are printed when these options are used

--nam Print names of sequences --num Print number of sequences --len Print summary of sequence lengths --sit (For alignments) Print site summary: how many columns are variable, contain multiple residues, contain gaps, or contain IUPAC ambiguity symbols. Also keeps track of overlaps between these categories, and the number of unique site patterns (columns) --com Print overall sequence composition --comseq Print composition for each individual sequence. Output is one line per residue-type per sequence: seqname, residue-type, freq, count, seqlength --div (For alignments) Print sequence diversity (=average pairwise sequence difference): mean, std, min, max --divseq (For alignments) Print sequence diversity for each pair of sequences: name1, name2, fractional difference --ignoregaps When computing composition or diversity: do not count gap symbols ```

Owner

  • Name: Anders Gorm Pedersen
  • Login: agormp
  • Kind: user
  • Location: Denmark
  • Company: Technical University of Denmark

Professor of Bioinformatics

GitHub Events

Total
  • Push event: 1
Last Year
  • Push event: 1

Committers

Last synced: about 2 years ago

All Time
  • Total Commits: 144
  • Total Committers: 1
  • Avg Commits per committer: 144.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 87
  • Committers: 1
  • Avg Commits per committer: 87.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Anders Gorm Pedersen a****e@d****k 144
Committer Domains (Top 20 + Academic)
dtu.dk: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 1
  • Total pull requests: 1
  • Average time to close issues: 1 day
  • Average time to close pull requests: less than a minute
  • Total issue authors: 1
  • Total pull request authors: 1
  • Average comments per issue: 11.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 0
  • Average time to close issues: 1 day
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 11.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • curiosusJR (1)
Pull Request Authors
  • agormp (1)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 166 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 45
  • Total maintainers: 1
pypi.org: seqconverter

Reads and writes sequence files in various formats. Performs manipulations on sequences

  • Versions: 45
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 166 Last month
Rankings
Dependent packages count: 10.1%
Downloads: 17.8%
Average: 21.4%
Dependent repos count: 21.6%
Stargazers count: 27.8%
Forks count: 29.8%
Maintainers (1)
Last synced: 6 months ago

Dependencies

pyproject.toml pypi