https://github.com/b-ummi/chewbbaca_tutorial

Step by step tutorial using chewBBACA with all necessary files and results

Last synced: 10 months ago · JSON representation

Repository

Step by step tutorial using chewBBACA with all necessary files and results

Basic Info

Host: GitHub
Owner: B-UMMI
Language: HTML
Default Branch: master
Size: 71.9 MB

Statistics

Stars: 5
Watchers: 9
Forks: 1
Open Issues: 4
Releases: 0

Created almost 9 years ago · Last pushed about 5 years ago

https://github.com/B-UMMI/chewBBACA_tutorial/blob/master/

## Objective
The objective of this tutorial is to illustrate the complete workflow of a chewBBACA pipeline for creating a wgMLST and a cgMLST schema for a colection of 714 _Streptococcus agalactiae_ genomes (32 complete genomes and 682 draft genome assemblies deposited on the NCBI databases) by providing step-by-step instructions and displaying the obtained outputs.

All information about the NCBI genomes used in this example is on the [.tsv file](https://github.com/B-UMMI/chewBBACA_tutorial/tree/master/genomes/NCBI_genomes_proks.Sagalactiae_allGenomes.2016_08_03.tsv) inside the `genomes` folder.

Please start by going through the following steps:
1. Install chewBBACA. Check [Installing chewBBACA](https://github.com/B-UMMI/chewBBACA/wiki/0.-Setting-up-chewBBACA) for instructions on how to install chewBBACA. chewBBACA includes Prodigal training files for several species, including for _Streptococcus agalactiae_. You can check the list of available training files [here](https://github.com/B-UMMI/chewBBACA/raw/master/CHEWBBACA/prodigal_training_files/). We have included the training file for _Streptococcus agalactiae_ in this repository.
2. Clone this repository to the local folder of your choice. To clone, run the following command:
`git clone https://github.com/B-UMMI/chewBBACA_tutorial`
3. Go to the top-level directory of the cloned repository, `.../chewBBACA_tutorial/`, and run `unzip genomes/complete_genomes.zip` to extract all the complete genomes.

The execution times reported in this tutorial were obtained for a DELL XPS13 (10th Generation Intel Core i7-10710U Processor - 12MB Cache, up to 4.7 GHz, using 6 cores). Using a computer with less powerful specifications can greatly increase the duration of the analyses.

The commands used in this tutorial assume that the working directory is the top-level directory of the cloned repository, `.../chewBBACA_tutorial/`. The commands should be modified if they are executed from a different working directory. We have included the expected results for each section in the `expected_results` folder for reference (each subfolder has the name of one of the sections).

## Schema creation

We will start by creating a wgMLST schema based on **32** _Streptococcus agalactiae_ complete genomes (32 genomes with a level of assembly classified as complete genome or chromossome) available at NCBI. The sequences are present in the `complete_genomes/` directory. To create the wgMLST schema, run the following command:

```
chewBBACA.py CreateSchema -i complete_genomes/ -o tutorial_schema --ptf Streptococcus_agalactiae.trn --cpu 6
```

The schema seed will be available at `tutorial_schema/schema_seed`. We passed the value `6` to the `--cpu` parameter to use 6 CPU cores, but you should pass a value based on the specifications of your machine. In our system, the process took 56 seconds to complete resulting on a wgMLST schema with 3128 loci. At this point the schema is defined as a set of loci each with a single representative allele.

## Allele calling

The next step is to perform allele calling with the wgMLST schema created in the previous step for the **32** complete genomes. The allele call step determines the allelic profiles of the analyzed strains, identifying known and novel alleles in the analyzed genomes. Novel alleles are assigned an allele identifier and added to the schema. To perform allele call, run the following command:

```
chewBBACA.py AlleleCall -i complete_genomes/ -g tutorial_schema/schema_seed -o results32_wgMLST --cpu 6
```

The allele call used the default BSR threshold of 0.6 (more information on the threshold [here](https://github.com/B-UMMI/chewBBACA/wiki/2.-Allele-Calling)) and took approximately 17 minutes to complete (an average of 32 seconds per genome). The allele call identified 14,720 novel alleles and added those alleles to the schema, increasing the number of alleles in the schema from 3,128 to 17,848.

## Paralog detection

The next step in the analysis is to determine if some of the loci can be considered paralogs, based on the result of the wgMLST allele calling. The _Allele call_ returns a list of Paralogous genes in the `RepeatedLoci.txt` file that can be found on the `results32_wgMLST/results_` folder.
The `RepeatedLoci.txt` file contains a set of 20 loci that were identified as possible paralogs. These loci should be removed from the schema due to the potential uncertainty in allele assignment (for a more detailed description see the [Alelle Calling](https://github.com/B-UMMI/chewBBACA/wiki/2.-Allele-Calling) entry on the wiki). To remove the set of 20 paralogous loci from the allele calling results, run the following command:

```
chewBBACA.py RemoveGenes -i results32_wgMLST/results_/results_alleles.tsv -g results32_wgMLST/results_/RepeatedLoci.txt -o results32_wgMLST/results_/results_alleles_NoParalogs.tsv
```

This will remove the columns matching the 20 paralogous loci from the allele calling results and save the allelic profiles into the `results_alleles_NoParalogs.tsv` file (the new file contains allelic profiles with 3108 loci).

## cgMLST schema determination

We can now determine the set of loci in the core genome based on the allele calling results. The set of loci in the core genome is determined based on a threshold of loci presence in the analysed genomes. We can run the TestGenomeQuality module to determine the impact of several threshold values on the number of loci in the core genome.

```
chewBBACA.py TestGenomeQuality -i results32_wgMLST/results_/results_alleles_NoParalogs.tsv -n 13 -t 200 -s 5 -o results32_wgMLST/results_/genome_quality_32
```

The process will automatically open a HTML file with the following plot:

![Genome quality testing of complete genomes](https://i.imgur.com/uf3Hygd.png)
[larger image fig 1](https://i.imgur.com/uf3Hygd.png) or [see interactive plot online](http://im.fm.ul.pt/chewBBACA/GenomeQual/GenomeQualityPlot_complete_genomes.html)

A set of **1136 loci** were found to be present in all the analyzed complete genomes, while **1267 loci** were present in at least 95%.
For further analysis only the **1267** loci present in at least 95% of the complete genomes will be used. We selected that threshold value to account for loci that may not be identified due to sequencing coverage and assembly problems.

We can run the ExtraCgMLST module to quickly determine the set of loci in the core genome at 95%.

```
chewBBACA.py ExtractCgMLST -i results32_wgMLST/results_/results_alleles_NoParalogs.tsv -o results32_wgMLST/results_/cgMLST_95 --t 0.95
```

The list with the 1267 loci in the core genome at 95% is in the `results32_wgMLST/results_/cgMLST_95/cgMLSTschema.txt` file. This file can be passed to the `--gl` parameter of the AlleleCall process to perform allele calling only for the set of genes that constitute the core genome.

## Allele call for 682 _Streptococcus agalactiae_ assemblies

**682 assemblies** of _Streptococcus agalactiae_ available on NCBI were downloaded (03-08-2016, downloadable zip file [here](https://drive.google.com/file/d/0Bw6VuoagsdhmaWEtR25fODlJTEk/view?usp=sharing), run `unzip GBS_Aug2016.zip` to extract genome files into a folder named `GBS_Aug2016`) and analyzed with [MLST](https://github.com/tseemann/mlst) in order to exclude possibly mislabeled samples as _Streptococcus agalactiae_. Out of the **682 genomes**, 2 (GCA_000323065.2_ASM32306v2 and GCA_001017915.1_ASM101791v1) were detected as being of a different species/contamination and were removed from the analysis.

Allele call was performed on the bona fide _Streptococcus agalactiae_ **680 genomes** using the **1267 loci** that constitute the core genome at 95%. Paralog detection found no paralog loci.

```
chewBBACA.py AlleleCall -i path/to/GBS_Aug2016/ -g tutorial_schema/schema_seed --gl results32_wgMLST/results_/cgMLST_95/cgMLSTschema.txt -o results680_cgMLST --cpu 6
```

The process took approximately 39 minutes to complete (an average of 3.4 secs per genome).

## Evaluate genome quality

We can now concatenate the cgMLST results for the 32 complete genomes with the cgMLST results for the 680 genomes to have all the results in a single file.
To concatenate the allelic profiles of both analyses run the following command:

```
chewBBACA.py JoinProfiles -p1 results32_wgMLST/results_/cgMLST_95/cgMLST.tsv -p2 results680_cgMLST/results_/results_alleles.tsv -o cgMLST_all.tsv
```

The concatenated file was analyzed in order to assess the cgMLST allele quality attribution for all the genomes.

```
chewBBACA.py TestGenomeQuality -i cgMLST_all.tsv -n 13 -t 300 -s 5
```

![Genome quality testing of all genomes](https://i.imgur.com/m1OSycz.png)
[larger image here fig 2](https://i.imgur.com/m1OSycz.png) or [see interactive plot online](http://im.fm.ul.pt/chewBBACA/GenomeQual/GenomeQualityPlot_all_genomes.html)

While the number of loci present in 95% of genomes remains virtually constant at around **1200** loci, considering all or most of the genomes (90%10 kbp and the N50 of the assemblies, sorted by decreasing values.

![Genome Analysis](http://i.imgur.com/I0fNqtd.png)
[larger image fig 3](http://i.imgur.com/I0fNqtd.png)

The second plot represents the total number of contigs and the number of contigs >10kbp.

![Genome Analysis 2](http://i.imgur.com/fabxi0Z.png)
[larger image fig 4](http://i.imgur.com/fabxi0Z.png)

[See interactive plot online](http://im.fm.ul.pt/chewBBACA/GenomeQual/AssemblyStatsStack.html)

At first sight, most of the removed genomes (56/62) were located on the lower range of N50 and bp in contigs >10 kbp (fig.3) and the higher number of contigs (fig.4).

The 5 genomes that were outside this pattern were individually checked:

1. **GCA_000186445.1** [here](https://www.ncbi.nlm.nih.gov/assembly/GCA_000186445.1) - 21 contigs but only 1 is above 10k (Scaffold with lot of Ns, 134 real contigs)
2. **GCA_000221325.2** [here](https://www.ncbi.nlm.nih.gov/assembly/GCA_000221325.2)- NCBI curated it out of RefSeq because it had a genome length too large
3. **GCA_000427055.1** [here](https://www.ncbi.nlm.nih.gov/assembly/GCA_000427055.1)- NCBI curated it out of RefSeq because it had many frameshifted proteins
4. **GCA_000289455.1** [here](https://www.ncbi.nlm.nih.gov/assembly/GCA_000289455.1)- No ST found. We concluded the assembly has a problem but we have not yet identified it.
5. **GCA_000288835.1** [here](https://www.ncbi.nlm.nih.gov/assembly/GCA_000288835.1)- NCBI curated it out of RefSeq because it had many frameshifted proteins

## Schema Evaluation

Schema Evaluator was run on the cgMLST schema:

`chewBBACA.py SchemaEvaluator -i tutorial_schema/schema_seed/ -o schema_evaluation --cpu 6`

[See the schema evaluator page here](http://im.fm.ul.pt/chewBBACA/SchemaEval/rms/RmS.html)

Owner

Name: Bioinformatics @ Molecular Microbiology and Infection Unit
Login: B-UMMI
Kind: organization
Email: microbiologia@fm.ul.pt

Website: http://im.fm.ul.pt
Repositories: 12
Profile: https://github.com/B-UMMI

Script repository

GitHub Events

Total

Watch event: 1

Last Year

Watch event: 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/b-ummi/chewbbaca_tutorial

Science Score: 10.0%

Repository

Basic Info

Statistics

https://github.com/B-UMMI/chewBBACA_tutorial/blob/master/

Owner

GitHub Events

Total

Last Year