https://github.com/bramvandijk88/xenoseq

Pipeline to automate the comparison between short-read libraries to detect foreign ("xenotypic") sequences.

https://github.com/bramvandijk88/xenoseq

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.2%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Pipeline to automate the comparison between short-read libraries to detect foreign ("xenotypic") sequences.

Basic Info
  • Host: GitHub
  • Owner: bramvandijk88
  • License: gpl-3.0
  • Language: Shell
  • Default Branch: master
  • Homepage:
  • Size: 157 MB
Statistics
  • Stars: 4
  • Watchers: 1
  • Forks: 2
  • Open Issues: 0
  • Releases: 0
Created over 5 years ago · Last pushed about 3 years ago
Metadata Files
Readme License

README.md

Xenoseq

__ __ / | / | AT | GC | ______ _______ ______ _______ ______ ______ GG \/AG/ / \ / \ / \ / | / \ / \ TA TT< /GATACT |TAAATGG |/CCGTAA |/AATAAAS/ /ATTTCT |/ATGTTA | TGAC \ AA GA |TG | GG |AG | AG |TT \ TT CG |GA | AT | AA /AT |GATCCCGT/ TA | AA |CG \__AG | GGAACT |TACGGGTA/ GT \__AT | AA | GC |CC |GA | GT |GC GT/ / GG/ TA |AG AT | GG/ TG/ GTAGGCC/ CA/ TT/ TAAATG/ ATGCGCG/ ATGCAAT/ AGGGTTT | AA | AA | AA/

Xenoseq is a simple bioinformatic pipeline to find sequences that appear to be newly introduced into a community. The input are (sets of) query and reference samples, which the pipeline will use to generate the following output:

  • unique_contigs.fasta; sequences in query not present in reference (e.g. ancestral) samples
  • uniquecontigalllinksL300_P97.tbl; table linking contigs to other communities using BLAST, with options L and P in the filename
  • xenotypic_contigs.txt; text file with names of all xenotypic contigs in them
  • xenotypic_contigs.fasta; the subset of unique contigs that can be linked to another reference using BLAST
  • xenotypic_coverage.txt; text files to estimate the abundance of xenotypic sequences in non-ancestral samples
  • source_contigs; directory with fasta files from reference samples that are themselves / are linked to MGEs

Overview

Xenoseq wraps read trimming (fastp), assembly (megahit), read mapping (BWA), read filtering (samtools), and local alignment (blast), to detect putative evidence of horizontal transfer between communities. This tool was used in a recent publication () to detect the movement of MGEs and nanobacteria in compost communities. (see Figure below for the experimental setup and subroutines in Xenoseq)

Xenoseq was applied to detect the movement of MGEs in compost

Install: conda setup (required)

To use xenoseq, a conda environment file is provided to install the above mentioned dependencies (for information on conda, see docs.conda.io/projects/conda/en/latest/)

To install the xenoseq environment, simply use:

git clone https://github.com/bramvandijk88/xenoseq.git cd xenoseq conda env create -f environment.yml conda activate xenoseq

Install: add to path (optional quality-of-life improvement)

To run the pipeline, you either have to provide the full path to the xenoseq binary (e.g. /home/user/XENOSEQ_DIR/xenoseq, or add the xenoseq directory (where you cloned the repository) to your global PATH variable:

export PATH='$PATH:<XENOSEQ_DIR>'

e.g. if you cloned / downloaded xenoseq into your home dir:

export PATH='$PATH:~/xenoseq'

If you don't want to do this each time you login to a new terminal, add the export PATH code to you ~/.bashrc (or ~/.profile) file:

echo export PATH='$PATH:~/xenoseq' >> ~/.bashrc

Metadata file

Usage of the pipeline requires a file listing all the "queries" and "subjects" in a text file. The pipeline will look for unique reads in the query files by comparing them to the corresponding subjects. These reads will be assembled into contigs. A subset of these contigs will be "xenotypic", i.e. having a foreign origin, by aligning the sequences to the remaining subjects, providing extra evidence for horizontal transfer of viral sequences or genes.

The metadata should look like this:

```

SAMPLES (Query, Reference)

Horizontal1 Ancestral1 Horizontal2 Ancestral2 Horizontal3 Ancestral3 Horizontal4 Ancestral4 Vertical1 Ancestral1 Vertical2 Ancestral2 Vertical3 Ancestral3 Vertical4 Ancestral4 ```

To run the pipeline (e.g. on the example data), use one of the following commands: ```

./xenoseq -m examplemetadata.tsv -o Xenoseqexample -l -t # Full pipeline ./xenoseq -m examplemetadata.tsv -o Xenoseqexample # Only identify unique sequences ./xenoseq -m examplemetadata.tsv -o Xenoseqexample -t # Identify and trace unique sequences ./xenoseq -m examplemetadata.tsv -o Xenoseqexample -l -f -P 99.99 # Force a relinking of unique-xenotypic with stricter percent identity ```

This example will use mock reads found in samples/reads and will search for xenotypic contigs in simulated data. Your unique/xenotypic contigs will then be stored in Xenoseq_example/. By default, xenoseq will assume your reads will be stored in /samples/reads, and will assume the files correspond to the names in the metadata file with the suffix \"_R1.fq\" and \"_R2.fq\" (these options can be changed with -p and -r). In other words; if you have your own reads you need to modify the metadata (and optionally, modify the read path/prefix).

The usage of the trace option (-t) will ensure all samples will be mapped back to the unique contigs and generate coverage statistics.

All options

```bash

Usage: xenoseq -m -o -c -l -t Mandatory: -m/--metadata File containing the metadata (tsv file with query-reference sets) Optional options: -p/--pathtoreads Path to reads for samples in metadata (default = samples/reads) -r/--readsuffix Read suffix corresponding to metadata n ames (e.g. when read filenames are Sample1R1.fq and Sample1R2.fq, use _R.fq) (default = _R.fq) -l/--link After detecting unique contigs, attempt to link them to other reference samples. -t/--trace After detecting xenotypic contigs, trace them across all samples. -c/--cores Number of CPUs to use for smaller tasks (passed on to bwa, samtools, etc.) (default = 4) -C/--assemblycores Number of CPUs to use for assembly (megahit) (default = 4) -j/--jobs Maximum number of parallel jobs (default = 4) -J/--maxassemblyjobs Maximum number of parallel jobs for assembly (default = ) -o/--output Output directory to put all the data -L/--alignmentlength Minimal alignment length to link unique sequences to other reference samples. -S/--singleend Assume single-end reads (e.g. use only Sample1R1.fq and skip read merging) -P/--alignmentpid Minimal percent identity to link unique sequences to other reference samples. -f/--force_relink Link unique sequences to reference samples, even when this step is already performed.

```

If you want to modify any more options (e.g. filtering thresholds, quality trimming), you can modify the relevant subroutines of xenoseq given in xenoseq_bin/functions.sh.

```

xenoseq -h

Xenoseq was tested on MAC Monterey 12.5.1 and Ubuntu 20.04.3 (also available on as Windows subsystem)

```

GitHub Events

Total
Last Year