Recent Releases of hybpiper

hybpiper - HybPiper version 2.3.2

  • Bugfix: allow hybpiper stats to be run when gene names in the target file contain a dot (e.g. taxon1-gene001.01). See issue#164.

- Python
Published by chrisjackson-pellicle over 1 year ago

hybpiper - HybPiper version 2.3.1

  • Re-write of the modules for commands hybpiper stats, hybpiper retrieve_sequences, and hybpiper paralog_retriever, to vastly speed up processing of compressed sample (*.tar.gz) folders.
    • Much improved speed using a single thread (the hard-coded default for HybPiper version 2.3.0)
    • Samples can now be processed in parallel when using these commands; new option --cpu added (default is to use all available CPUs minus one).
  • Bugfix: ensure all intron sequences are recovered when running hybpiper retrieve_sequences using the intron option.

- Python
Published by chrisjackson-pellicle over 1 year ago

hybpiper - HybPiper version 2.3.0

  • Add option --compress_sample_folder to command hybpiper assemble. Tarball and compress the sample folder after assembly has completed i.e. <sample_name>.tar.gz.
    • This is useful when running HybPiper on HPC clusters with file number limits.
    • If both an uncompressed and compressed folder exist for a sample, a warning is shown and HybPiper exits.
    • All HybPiper subcommands (stats, recovery_heatmap, retrieve_sequences, paralog_retriever, filter_by_length) work with either compressed or uncompressed sample files/folders, or a combination of both.
    • If a <sample_name>.tar.gz file already exists for a sample, it will be extracted and used for the current run of hybpiper assemble, and the <sample_name>.tar.gz file will be deleted.
  • When using BWA for read mapping, the command samtools flagstat is now run during the hybpiper assemble step, rather than during hybpiper stats, and the results are written to a <sample_name>_bam_flagstat.tsv \ <sample_name>_unpaired_bam_flagstat.tsv file(s).
    • If the <sample_name>_bam_flagstat.tsv \ <sample_name>_unpaired_bam_flagstat.tsv file(s) are not present in a sample directory (i.e. the sample was assembled with HybPiper version <2.3.0), samtools flagstat will be run during hybpiper stats. If the sample is a *.tar.gz file, the *.bam file(s) will first be extracted to disk to a temporary directory called temp_bam_files, within your current working directory. This temporary directory will be deleted after samtools flagstat has been run.
  • Add option --not_protein_coding to hybpiper assemble. When this option is provided, sequences matching your target file references will be extracted from SPAdes contigs using BLASTn, rather than Exonerate. This should improve recovery when using a target file with non-protein-coding sequences. Note that this feature is new and might have bugs - please report any issues.
    • Only nucleotide *.FNA sequences will be produced (i.e. no amino-acid sequences).
    • Intronerate will not be run; intron and supercontig sequences will not be produced.
    • If BLASTx or DIAMOND is selected for read mapping (i.e. protein vs translated-nucleotide searches), a warning will be displayed and read mapping will switch to BWA.
  • Add the following options to control BLASTn searches of SPAdes contigs when option --not_protein_coding is used:

    • --extract_contigs_blast_task. Task to use for blastn searches (blastn, blastn-short, megablast, dc-megablast). Default is blastn.
    • --extract_contigs_blast_evalue. Expectation value (E) threshold for saving hits. Default is 10.
    • --extract_contigs_blast_word_size. Word size for wordfinder algorithm (length of best perfect match).
    • --extract_contigs_blast_gapopen. Cost to open a gap.
    • --extract_contigs_blast_gapextend. Cost to extend a gap.
    • --extract_contigs_blast_penalty. Penalty for a nucleotide mismatch.
    • --extract_contigs_blast_reward. Reward for a nucleotide match.
    • --extract_contigs_blast_perc_identity. Percent identity.
    • --extract_contigs_blast_max_target_seqs. Maximum number of aligned sequences to keep (value of 5 or more is recommended). Default is 500.
  • The final step of the hybpiper assemble pipeline has been renamed from exonerate_contigs to extract_contigs (as either Exonerate or BLASTn can now be used).

  • Reorganised grouping of help options when running hybpiper assemble --help to improve clarity.

  • Changed option --timeout_assemble for hybpiper assemble to --timeout_assemble_reads to match the step name.

  • Changed option --timeout_exonerate_contigs for hybpiper assemble to --timeout_extract_contigs to match the step name.

  • Changed option --exonerate_hit_sliding_window_size for hybpiper assemble to --trim_hit_sliding_window_size. This option now applies to either Exonerate hits (and is measured in amino-acids) or BLASTn (measured in nucleotides). Defaults are 5 amino-acids (Exonerate; changed from previous default of 3) or 15 nucleotides (BLASTn).

  • Changed option --exonerate_hit_sliding_window_thresh for hybpiper assemble to --trim_hit_sliding_window_thresh. This option now applies to either Exonerate hits (and is measured via amino-acid similarity) or BLASTn (measured via nucleotide similarity). Defaults are 75 for amino-acids (Exonerate; changed from previous default of 55) or 65 for nucleotides (BLASTn).

  • Fixed a bug in fix_targetfile.py - MAFFT is now called via subprocess rather than Bio.Align.Applications.MafftCommandline when checking for best match translations (see issue#156).

  • Added a more informative error message if running hybpiper retrieve_sequences or hybpiper paralog_retriever from HybPiper version >=2.2.0 on sample folders from HybPiper version >2.2.0. This error occurs because the sample folders do not contain a <prefix>_chimera_check_performed.txt file (see issue#155).

  • When extracting coding sequences from SPAdes contigs using Exonerate, changed the initial Exonerate run to not use the option --refine full (see Exonerate docs), unless the option --exonerate_refine_full is provided to hybpiper assemble. Although the Exonerate option --refine full should improve output alignments, in some cases it can result in spurious alignment regions (e.g. an intron/non-coding region being included as an "exon" alignment) that can get incorporated in to the HybPiper output sequence.

- Python
Published by chrisjackson-pellicle almost 2 years ago

hybpiper - HybPiper version 2.2.0

  • Add option --end_with to command hybpiper assemble. Allows the user to end the assembly pipeline at a chosen step (mapreads, distributereads, assemblereads, exoneratecontigs).
  • Add option --exonerate_skip_hits_with_frameshifts to command hybpiper assemble. If provided, skip Exonerate hits where the SPAdes contig contains frameshifts when considering hits for assembly of an *.FNA sequence. Default behaviour in HybPiper v2.2.0 is to include these hits; previous versions allowed them automatically.
  • Add option --exonerate_skip_hits_with_internal_stop_codons to command hybpiper assemble. If provided, skip Exonerate hits where the SPAdes contig contains internal in-frame stop codon(s) when considering hits for assembly of an *.FNA sequence. A single terminal stop codon is allowed. Default behaviour in HybPiper v2.2.0 is to include these hits; previous versions allowed them automatically.
  • Add option --exonerate_skip_hits_with_terminal_stop_codons to command hybpiper assemble. If provided, skip Exonerate hits where the SPAdes sequence contains a single terminal stop codon. Only applies when option --exonerate_skip_hits_with_internal_stop_codons is also provided. Only use this flag if your target file exclusively contains protein-coding genes with no stop codons included, and you would like to prevent any in-frame stop codons in the output sequences. Default behaviour in HybPiper v2.2.0 is to include these hits; previous versions allowed them automatically.
  • Add option --chimeric_stitched_contig_check to command hybpiper assemble. If provided, HybPiper will attempt to determine whether a stitched contig is a potential chimera of contigs from multiple paralogs. Default behaviour in HybPiper v2.2.0 is to skip this check; previous versions performed the check automatically. Skipping this check speeds up the final 'exonerate_contigs' step of the pipeline, significantly.
  • Add option --no_pad_stitched_contig_gaps_with_n to command hybpiper assemble. If provided, when constructing stitched contigs, do not pad any gaps between hits (with respect to the "best" protein reference) with a number of Ns corresponding to the reference gap multiplied by 3. Default behaviour in HybPiper v2.2.0 is to pad gaps with Ns; previous versions did this automatically.
  • Add option --skip_targetfile_checks to command hybpiper assemble. Skip the target file checks. Can be used if you are confident that your target file has no issues (e.g. if you have previously run hybpiper check_targetfile).
  • Add option --no_spades_eta to command hybpiper assemble. When SPAdes is run concurrently using GNU parallel, the "--eta" flag can result in many "sh: /dev/tty: Device not configured" errors written to stderr. Using this option removes the "--eta" flag to GNU parallel, silencing both ETA output and the error message.
  • Fixed a bug in exonerate_hits.py that could (rarely) result in a duplicated region in the output *.FNA sequence.
  • Fixed a bug in exonerate_hits.py that occurred when more than two Exonerate hits had identical query ranges and similarity scores; this could result in a sequence not being returned for the given gene.
  • Added tests folder containing initial unit tests. Some tests require python package pyfakefs to run.
  • Refactor of the hybpiper package. New module hybpiper_main.py with entry point (moved from assemble.py), and some assemble.py functions moved to utils.py. Target file checking functionality has been consolidated.
  • HybPiper now logs to stdout rather than stderr.
  • Commands hybpiper check_targetfile and hybpiper assemble now write a report file when checking the target file (check_targetfile_report-<target file name>.txt), rather than logging details to the main sample log. Command hybpiper check_targefile writes the report to the current working directory, whereas command hybpiper assemble writes it to the sample directory.
  • If the option --cpu is not specified for hybpiper assemble, HybPiper will now use all available CPUs minus one, rather than all available CPUs.
  • Command hybpiper assemble now checks for output from previous runs for the pipeline steps selected via --start_from and --end_with (default is to select all steps). If previous output is found, HybPiper will exit with an error unless the option --force_overwrite is provided.
  • Corrected the reading frame of sequence Artocarpus-gene660 in the test dataset target file.
  • Command hybpiper assemble now writes the file <prefix>_chimera_check_performed.txt to the sample directory. This is a text file containing 'True' or 'False' depending on whether the option --skip_chimeric_genes was provided to command hybpiper assemble. Used by hybpiper retrieve_sequences and hybpiper paralog_retriever.

- Python
Published by chrisjackson-pellicle almost 2 years ago

hybpiper - HybPiper version 2.1.8

  • Add new subcommand hybpiper filter_by_length, used to filter the sequence output of hybpiper retrieve sequences by absolute length and/or length relative to mean length in target file representatives. This is done on a per-sample/per-gene basis, rather than the sample-level filtering available in hybpiper retrieve_sequences. See wiki for more information.
  • Update the regex used to check target file fasta header formatting, to capture scenarios where a name contains multiple dashes and also ends with a dash.
  • In the fix_targetfile.py module, remove the import of Bio.Align.Applications.MafftCommandline and call MAFFT via subprocess (see issue#147).
  • In the gene_recovery_heatmap.py module, cast the dataframe from the seq_lengths_file to object dtype to avoid a deprecation warning .
  • Add option --no_heatmap to command hybpiper paralog_retriever (see issue#150).
  • Fix an Exonerate-related debug message in exonerate_hits.py.

- Python
Published by chrisjackson-pellicle about 2 years ago

hybpiper - HybPiper version 2.1.7

  • The flag --run_intronerate was removed from the hybpiper assemble command in the run_hybpiper_test_dataset.sh file.
  • Removed the legacy check and attempted download of the test dataset in the run_hybpiper_test_dataset.sh file.
  • Added a check to hybpiper stats and hybpiper retrieve_sequences to ensure sample names in the namelist.txt file do not contain forward slashes issue#143.
  • When checking for putative chimeric gene sequences in hybpiper retrieve_sequences and hybpiper paralog_retriever, generate a warning rather than an error if the file <sample_name>_genes_derived_from_putative_chimeric_stitched_contig.csv can't be found for a given sample. This file will not be written if no gene sequences were produced for this sample (i.e. no reads mapped, no SPAdes contigs, no sequences extracted from SPAdes contigs via Exonerate).
  • Check that target file FASTA headers do not contain quotation marks (" or '); issue#125.
  • Updated the installation instructions in the README and Wiki to use the Bioconda package, and added installation instruction for Macs with Apple Silicon (M1/M2/M3 chips).
  • Fixed a bug in exonerate_hits.py that meant that hits were not always trimmed to start with the first amino-acid with full alignment identity. This bug could potentially have had an effect on output sequences only if the values for --exonerate_hit_sliding_window_size and/or --exonerate_hit_sliding_window_thresh were changed from default values.
  • Use importlib.metadata rather than pkg_resources for module version checks, due to deprecation of the latter.

- Python
Published by chrisjackson-pellicle about 2 years ago

hybpiper - HybPiper version 2.1.6

  • Intronerate is now run by default. The flag --run_intronerate for subcommand hybpiper assemble has been changed to --no_intronerate.
  • If Intronerate fails, failed genes and errors will be printed and logged; the exonerate_contigs step of the pipeline will continue.
  • Updated error handling and logging for the exonerate_contigs step of the pipeline.
  • Change default DPI of heatmaps to 100 (previously 150) for hybpiper recovery_heatmap and hybpiper paralog_retriever
  • Enforce rendering of all loci (x-axis) and sample (y-axis) labels in heatmaps; previously, matplotlib/seaborn would dynamically drop labels if they were too closely spaced.
  • Added flags --no_xlabels and --no_ylabels for hybpiper recovery_heatmap and hybpiper paralog_retriever; turns off rendering of the corresponding labels in the saved figures.
  • If the auto-calculated size of heatmaps for hybpiper recovery_heatmap and hybpiper paralog_retriever is greater than the maximum number of pixels (65536) in either/or length and height, resize the figure to 400 inches and 100 DPI. Note that large datasets can fail to render fully in the saved figure even if the pixel dimensions are less than the maximum (see e.g. https://stackoverflow.com/questions/64393779/how-to-render-a-heatmap-for-a-large-array), but reducing the size/DPI further allows the full figure to be rendered.
  • Added module version.py for a single location of HybPiper version number.
  • Print and log HybPiper version when calling all subcommands.
  • Added column 'TotalBasesRecovered' to the hybpiper stats report, listing the total number of nucleotides recovered for each sample (not counting N characters). Added 'TotalBasesRecovered' as a filtering option in hybpiper retrieve_sequences.

- Python
Published by chrisjackson-pellicle almost 3 years ago

hybpiper - HybPiper version 2.1.5

  • Bugfix: fixed an issue in exonerate_hits.py that could result in initial Exonerate hits being trimmed too aggressively at their 3' ends.
  • Bugfix: fixed an issue in exonerate_hits.py that could introduce minor insertions in to the supercontig (concatenated exon and partial intron) sequence used when running Intronerate.

- Python
Published by chrisjackson-pellicle about 3 years ago

hybpiper - HybPiper version 2.1.4

Bugfix: fixed an issue when using --run_intronerate that could cause an error and result in no *.FNA sequence being produced for some genes.

- Python
Published by chrisjackson-pellicle about 3 years ago

hybpiper - HybPiper version 2.1.3

- Python
Published by chrisjackson-pellicle about 3 years ago

hybpiper - HybPiper version 2.1.1

Added hybpiper fix_targetfile command.

- Python
Published by chrisjackson-pellicle over 3 years ago

hybpiper - HybPiper version 2.0.2

First release for HybPiper version 2.

- Python
Published by chrisjackson-pellicle over 3 years ago

hybpiper - Final HybPiper 1.3 Version

Legacy version for Hybpiper 1.X, last updated March 2020.

- Python
Published by mossmatters about 4 years ago

hybpiper - Bug Fix Reverse Complement Sequences

Fixes https://github.com/mossmatters/HybPiper/issues/38

All users who recovered sequences with version 1.3 should re-run the exonerate step to recover sequences with the correct strand. Instructions can be found in the issue.

- Python
Published by mossmatters almost 8 years ago

hybpiper - The Herbarium Update

1.3 The Herbarium Update January, 2018

Bug fixes and features related to the use of targeted sequencing from herbarium material. These samples tend to have short contigs that can cause issues when trying to assemble full-length genes.

Features

  • Added --exclude flag to be the inverse of --target: all sequences with the specified string will not be used as targets for exon extraction (they will still be used for read-mapping). Useful if you want to add supercontig sequence to the target file, but not use it for exon extraction.

  • Added --addN to intronerate.py. This feature will add 10 N characters in between joined contig when recovering the supercontig. This is useful for identifying where the intron recovery fails, and for annotation processing (i.e. for GenBank).

  • Added a new version of the heatmap script, gene_recovery_heatmap_ggplot.R. This script is much simpler and produces nice color PNG images, but struggles a bit on PDF output. The original heatmap script is stil included. Thanks to Paul Wolf for the ggplot code!

Bug Fixes

  • Fixed misassembly of supercontigs when there are multiple alignments to different parts of the same exon.
  • Fixed poor filtering of GFF results to produce intron/exon annotation.
  • Fixed non-propogation of exonerate parameters

- Python
Published by mossmatters over 8 years ago

hybpiper - HybPiper 1.2: Target Finesse, Unpaired Reads, and Python 3

Features

  • Added --unpaired flag. When using paired-end sequencing reads, a third read file may be specified with this flag. Reads will be mapped to targets separately, but will be used along with paired reads in contig assembly.

  • Added --target flag. Adds the ability to choose which of the reference sequences is used for each gene. If --target is a file (tab-delimited file with one gene and one target name per line), HybPiper will use that. Otherwise --target can be the name of one reference. HybPiper will only use targets with the specified name in the Alignment/Exon Extraction phase. All other targets for that locus will only be used in the Mapping/Read Sorting phase.

  • Added --timeout flag, which uses GNU Parallel to kill processes (i.e. Spades or Exonerate) if they take X percent longer than average. Use if there are a lot of stuck jobs (--timeout 1000)

  • Python 3 compatibility

Bug Fixes

  • Can accommodate Solexa FASTQ paired headers
  • Fixed spades_runner.py not recognizing --cpu on redos
  • Prints more meaningful messages for some common errors
  • Can accommodate prefix not being in current directory
  • Deletes sorted reads on restart to prevent double counting reads.
  • spades_runner.py will now respect --kvals
  • Added initial call to log for reads_first.py

- Python
Published by mossmatters about 9 years ago

hybpiper - HybPiper Published

Release associated with manuscript in Applications in Plant Sciences.

-- Added paralog_investigator.py, which detects and extracts long exons from putative paralogs in all genes in one sample.

-- Added paralogretriever.py, which retrieves sequences generated by paraloginvestigator.py for many samples (or the coding sequence generated by exonerate_hits.py if no paralogs are detected).

-- Added a test_dataset of 13 genes for 9 samples, and a shell script to run the test data through the main script and several post-processing scripts.

-- Fixed bug involving calling HybPiper with a relative path such as: ../reads_first.py

-- reads_first.py --check_depend now checks for SPAdes, BWA, and Samtools

-- Full revision of README, which is now shorter. Full tutorials on installing and running HybPiper are now on the Wiki.

- Python
Published by mossmatters almost 10 years ago

hybpiper - HybPiper Manuscript Submitted

This is the version of HybPiper used to generate the analysis described in our manuscript:

HybPiper: extracting coding sequence and introns for phylogenetics from high-throughput sequencing reads using target enrichment.

A number of improvements have been made since the last release:

Fixed a number of issues regarding selection of multiple full-length contigs. --When two or more contigs are more than 85% of the length of the target reference, HybPiper will choose the contig with the highest coverage depth if it exceeds the next highest coverage depth by 10x. If not, percent-identity between the contig and target (at the amino acid level) is used as the criterion. --Both the 85% length cutoff (--lengthcutoff) and the 10x threshold (--depthmultiplier) can be changed by the user.

Fixed not using SPAdes properly for single-ended reads, should now use --s for single-end and --12 for paired-end

Added work-around for SPAdes assembly failures --HybPiper will now re-do any failed SPAdes assemblies (where no contigs are generated). --The highest-attempted k-mer value is removed, unless there is only one, in which case it is marked as a "dud" --SPAdes can still produce contigs.fasta files of zero size, these are now recognized as failed assemblies.

- Python
Published by mossmatters over 10 years ago

hybpiper - Paralog Warnings and New Name

Now with using the SPAdes assembler, there is a higher likelihood of assembling long contigs for each gene. If paralogs (or divergent alleles) exist for the gene sequence, SPAdes will generate multiple long-length contigs.

In this release, the pipeline will generate warnings if there are multiple long-length contigs for each gene in "paralogwarning.txt" within each gene file. It will also save a general "geneswithparalogwarnings.txt" file in the main directory with a list of genes to consider further.

One option for paralogous genes is to extract coding sequences from each paralog and treat them as separate loci, and re-run the pipeline. The reads may then be accurately distributed to each paralog.

We are also happy to report the name change of the pipeline to HybPiper! Thanks to the Wickett Lab for help with the strenuous naming process. For now our logo is the following, by Elliot Gardner:

hybpiper_logo

- Python
Published by mossmatters over 10 years ago

hybpiper - Spades and Introns

Changed: SPAdes now used to assemble contigs. Much more reliable than Velvet/CAP3. Changed: GNU Parallel now reads gene list from a file, to avoid too-long command lines.

Added support for intron retrieval via interonerate.py, separately and as "supercontigs" containing exons. Added depthcalculator.py, which uses samtools to map reads to exon or supercontig sequences. Added cleanup.py, which removes the bulky assembly files. Added fastamerge.py, which can create a concatenated alignment and RAxML partition file. Added output file exonerate_stats.csv, indicating which contigs were used in final exon recovery.

Fixed bug using bait files with lowercase DNA. Fixed bug in sorting sequences in Biopython (NotImplementedError) Fixed GNU parallel overwrite collision.

- Python
Published by mossmatters over 10 years ago

hybpiper - BWA Support

The pipeline now supports BWA, for aligning reads to nucleotide bait sequences. While faster, it may reduce the specificity, if reads are unable to align to nucleotide baits due to high sequence divergence. The BLASTx pipeline is still the default, and may provide higher specificity.

- Python
Published by mossmatters about 11 years ago

hybpiper - First DOI Release

Many bugfixes have been made, most critically the bug which was adding each read twice for each gene, if both the forward and reverse reads had BLASTX hits. Several usability fixes were added, and more features for adjusting downstream parameters for Velvet and Exonerate.

Finally, a GPL v3 license was added to the repository in preparation for generating a DOI for this release.

- Python
Published by mossmatters over 11 years ago

hybpiper - Reads-First Capability

Adds the ability to start with the raw Illumina reads. Velvet, CAP3, and GNU Parallel are now required. Reads are distributed to separate files, one per target gene, based on BLASTx hit scores. If there are multiple baits per target gene, the best reference is chosen by highest cumulative BLASTx score.

- Python
Published by mossmatters almost 12 years ago

hybpiper - Release 0.0.2

Added hybseq_summary.py, a helper script to describe the lengths of every gene as a result of the first two scripts.

- Python
Published by mossmatters over 12 years ago

hybpiper - Initial Release

Two main scripts:

queryfilebuilder.py Sets up file hierarchy and creates "tailored baits" file for each gene. exonerate_hits.py Generates protein and nucleotide sequences (in frame) from exonerate hits to each gene.

- Python
Published by mossmatters over 12 years ago