https://github.com/juke34/datura-stramonium-genome-project

This repository contains the commands used in the assembly of Datura stramonium genome and its comparative genomics analyses with other Solanaceae species

https://github.com/juke34/datura-stramonium-genome-project

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: nature.com, zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.2%) to scientific vocabulary
Last synced: 4 months ago · JSON representation

Repository

This repository contains the commands used in the assembly of Datura stramonium genome and its comparative genomics analyses with other Solanaceae species

Basic Info
  • Host: GitHub
  • Owner: Juke34
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 7.89 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Fork of icruz1989/Datura-stramonium-genome-project
Created about 5 years ago · Last pushed about 5 years ago

https://github.com/Juke34/Datura-stramonium-genome-project/blob/master/

# *Datura stramonium* genome project (the devil weed)

* This is the workflow that was followed to assembly and annotate the genome of *Datura stramonium* and for comparative genomics with 13 Solanaceae species. Many programs have been used and links of some of them have been stored here. Also, other pipelines have been followed and a link of each one will be found in this workflow. 

Manuscript is already published

https://www.nature.com/articles/s41598-020-79194-1

Zenodo:             [![DOI](https://zenodo.org/badge/236494037.svg)](https://zenodo.org/badge/latestdoi/236494037)


### Firts step is trimming poor quality sequences

    java -jar trimmomatic-0.33.jar PE -phred30 Tic23_S155_L006_R1_001.fastq Tic23_S155_L006_R2_001.fastq output_Tic23_S155_L006_R1_001_paired.fastq output_Tic23_S155_L006_R1_001_unpaired.fastq output_Tic23_S155_L006_R2_001_paired.fastq output_Tic23_S156_L006_R2_001_unpaired.fastq SLIDINGWINDOW:4:15 MINLEN:36

### Fastqc program was used to visualize the quality of the sequences

    fastqc output_Tic23_S155_L006_R1_001_paired.fastq output_Tic23_S155_L006_R2_001_paired.fastq

### Genome size estimation with Kmergenie

   reads_file contains the name of the files of PE sequences (trimmed) from Illumina (per line)
   
    kmergenie-1.7048/kmergenie reads_file.txt 

   One this step is accomplished, now work on PacBio Sequences
   
### Transform PacBio subreads from .bam format to fastq or fasta, fasta files are much bigger in size

    bamtools convert -format fastq -in output_tic_pacbio_merged.bam -out tic23.subreads.fastq

### Firts, we produced an assembly using only the longreads, Canu program is the option. Canu detects the number of cores 

    canu -d ensamble_tic_pacbio -p datura_tic23 genomeSize=1.5g gridOptionscormhap="--mem=40g" merylMemory=62 batMemory=62         corMhapSensitivity=high correctedErrorRate=0.105 corOutCoverage=100 corMinCoverage=0 gridOptions="--time=168:00:00 --         partition=FAST" gnuplotTested=true -pacbio-raw tic23.subreads.fastq 1>run2.log

At the same time, generate contigs only with the Illumina PE sequences using the SparseAssembler program. This will generate an assembly only with Illumina sequences. Kmergenie program used in the step above gives the best kmer to produce an assembly. Use this one to feed SparseAssembler

    SparseAssembler LD 0 k 73 g 15 NodeCovTh 1 EdgeCovTh 0 GS 15000000 i1 output_Tic23_S155_L006_R1_001_paired.fastq i2 output_Tic23_S155_L006_R2_001_paired.fastq

### Once both assemblies are done (Canu and SparseAssembler), now do an Hybrid assembly but using raw reads from PacBio and contigs.txt file from SparseAssembler. This program uses De bruin graph and overlap layout consensus algorithms

    DBG2OLC LD 1 k 17 AdaptiveTh 0.001 KmerCovTh 3 MinOverlap 10 RemoveChimera 1 Contigs Contigs.txt f tic23.subreads.fastq
  
 A file called scaffolds.fasta is generated, this is the hybrid assembly

### In this step we want to align Canu assembly (Canu) to the Hybrid assembly (DBG2OLC)

    nucmer --mumreference -l 100 self.fasta hybrid.fasta
   
   This script from same Nucmer allow to filter out alignments due to repeats and duplicates
    
    delta-filter -r -q -l 10000 out.delta > out.rq.delta

    
 self.fasta corresponds to the Canu assembly file and hybrid.fasta corresponds to the hybrid assembly from DBG2OLC

### Once alignment from above step is done, now use Quickmerge, Ncmer gives you an out.delta file and use again the hybrid assembly and the Canu assembly

    quickmerge -d out.delta -q hybrid.fasta -r self.fasta -hco 3.0 -c 1.1 -l 70000 -ml 7000 -o genome_tic.fasta
   
   A final merged assembly is obtained. This is the BackBone

## Polishing and scaffolding

### Bowtie2 was used to index the genome and we aligned the raw Illumina reads to the merged genome
   
   Firts, index the genome
   
    bowtie2-build --threads 30 genome_tic.fasta bt2_index_genomeTic
   
   Align the Illumina PED sequences to the BackBone
    
    bowtie2 -x bt2_index_genomeTic -1 output_Tic23_S155_L006_R1_001_paired.fastq -2 output_Tic23_S155_L006_R2_001_paired.fastq     -S tic_gen.sam
   
   Convert the aligned file; sam to bam.
    
    samtools view -Sb tic_gen.sam > tic_gen.bam
   
   Index the aligned file and sort
   
    samtools index tic_gen.bam | sort > tic_gen.sorted.bam
    
### Now, its time to use Pilon for polishing the genome. Three rounds of polishing are recommended. Use the aligned file and Backbone to feed Pilon
    
    java -jar pilon-1.22.jar --genome genome_tic.fasta -b tic_gen.sorted.bam -fix bases > genome_tic_pilon1.fasta
   
   In this step, align the raw PacBio sequences to BackBone already polished with Pilon
     
    pbalign --minAnchorSize 15 --maxMatch 20 --nproc 15 output_tic_pacbio_merged.bam genome_tic_pilon1.fasta pbalign_tic.bam
    
   Now use arrow to obtain a consensus sequence and a polished genome
    
    arrow pbalign_tic.bam -r genome_tic_pilon1.fasta -o consensus_tic_pilon1_arrow.fasta

### All is ready for scaffolding, OPERA-LG uses the raw Illumina PE sequences, the genome and long reads

    perl OPERA-long-read.pl --short-read-maptool bowtie2 --opera /LUSTRE/Genetica/ivan/bin_app/OPERA-LG_v2.0.6/bin num-of-       processors 5 --kmer 17 --contig-file consensus_tic_pilon1_arrow.fasta --illumina-read1 reads_1.fasta --illumina-read2         reads_2.fasta --long-read-file raw_reads_pacbio_tic.fasta --output-prefix opera_lr --output-directory RESULTS

### Two polishing steps with PILON were carried out but now to the last version of the genome (output from OPERA-LG). Seesteps above

### Now its time to check the quality and completness of the final assembly
   
   Quast gives you summary statistics
   
    quast.py genome.polished.draft.fasta
   
   BUSCO looks in a specific database set by the user the number of Single copy Orthologs ans gives you the percentage of genome completness
   
    python run_BUSCO.py -r -i final_genome_teotihuacan.fasta -o busco_finaldraft_genome_teotihuacan -l solanaceae_odb10 -m geno

# Repetitive elements analysis

Some script used in this section are already stored in RepeatModeler or RepeatMasker programs. See manual of these to more information

see http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction--Basic

### The firts step for repetitive elements identification is build a databse

    BuildDatabase -engine ncbi -name datura GenomeTic.fa

### Then use RepeatModeler to obtain specific repetitive elements ofe Datura genome
    
    RepeatModeler -database datura -engine ncbi -pa 32
    
### Repeatmodeler produces a consensi.fa.classified file with knows and unknows repetitive elements. Thus, to separete both of them we used

    perl repeatmodeler_parse.pl --fastafile consensi.fa.classified --unknowns repeatmodeler_unknowns.fasta  \
    --identities repeatmodeler_identities.fasta 

Thus we obtained a file with only unknows sequences. These are searched against a transposase database and sequences matching transposase are considered as transposons belonging to the relevant superfamily and are incorporated into repeatmodeler_identities and excluded from repeatmodeler_unknowns. Once filtering is complete using the script transposon_blast_parse.pl the libraries ModelerUnknown.lib and ModelerID.lib are created. The trans database can be obtained downloaded from http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Advanced#3._Collecting_repetitive_sequences_by_RepeatModeler

    makeblastdb -in Tpases020812  -dbtype prot
    
    blastx -query repeatmodeler_unknowns.fasta -db Tpases020812  -evalue 1e-10 -num_descriptions 10 -out                           modelerunknown_blast_results.txt
    
    transposon_blast_parse.pl --blastx  --modelerunknown 
    
   Outputs:

   identified_elements.txt

   unknown_elements.txt

   Rename/combine files from RepeatModeler using following commands:
   
    mv  unknown_elements.txt  ModelerUnknown.lib
    cat identified_elements.txt  repeatmodeler_identities.fasta  > ModelerID.lib
   
All repeats collected so far are used to search against a plant protein database where proteins from transposons are excluded. Elements with significant hits to genes are removed, along with 50 bp upstream and downstream of the blast hit. Remaining sequence that is less than 50 bp is removed completely. Outputs from this script are elements with no significant blast hits to the protein database and the remaining sequence from elements with blast hits that is greater than 50 bp. Use ProExcluder prograam for this purpose
   
   Database alluniRefprexp070416 can be downloaded here 
   
  http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-           Advanced#3._Collecting_repetitive_sequences_by_RepeatModeler
    
    blastx -query ModelerUnknown.lib -db alluniRefprexp070416  -evalue 1e-10 -num_descriptions 10 \
    -out ModelerUnknown.lib_blast_results.txt
    
    ProtExcluder.pl -option ModelerUnknown.lib_blast_results.txt  ModelerUnknown.lib
    
 ### Now obtain a fasta file from RepeatMasker. This program uses the database Dfam and RepeatDabase. Use all to search in the entire database
 
    queryRepeatDatabase.pl -species all > repeatmasker.all.fa
 
   Concatenate now the RepeatMasker database with the one generated manually above
  
    tail -n+2 repeatmasker.taxon.fa | cat - taxon.consensi.fa.classified > taxon.repeatlib.fa
 
 ### Finally run RepeatMasker with this databse
 
    RepeatMasker -pa 20 -q -lib taxon.repeatlib.fa -gff final_genome_tic23.fasta
   
# Gene annotation 

### Firts we obtained and trained gene models using Augustus inside BUSCO program. See BUSCO documentation for more information. Sugustus produces gene models and this models will be used in MAKER

    python run_BUSCO.py -r -i final_genome_tic23.fasta -o busco_finaldraft_genome_teotihuacan -l sonaceae_odb10 -m geno -long      
### MAKER was run following this pipeline 

http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018

### Running MAKER 
   
   MAKER uses several other programs that run inside MAKER and also uses three main configuration files
   
   - maker_exe.ctl - contains the path information for the underlying executables    
   
   - maker_bopt.ctl - contains filtering statistics for BLAST and Exonerate
   
   - maker_opt.ctl - contains all other information for MAKER, including the location of the input genome filem input proteins      and transcripts and gene models

### Once these configuration files are filled, running MAKER is pretty easy. From the folder where you have stored the configuration files just type

    maker -base maker_gff_final -g draffinal_genome_ticuman.fasta -fix_nucleotides

The outputs from MAKER are:

   - The maker_opts.log, maker_exe.log, and maker_bopts.log files are logs of the control files used for this run of MAKER
   
   - The mpi_blastdb directory contains FASTA indexes and BLAST database files created from the input EST, protein, and repeat      databases
   
   - The dpp_contig_master_datastore_index.log contains information on both the run status of individual contigs and information    on where individual contig data is stored
   
   - The dpp_contig_datastore directory contains a set of subfolders, each containing the final MAKER output for individual        contigs from the genomic fasta file

MAKER was run four times, each time was changed the gene models predicted from the prior run 
      
   Now, to obtain the proteins, transcript and a gff file with coordinates of these just use this script and the path of the directory created by MARKER
     
    fasta_merge -d dpp_contig_master_datastore_index.log

    gff3_merge -d dpp_contig_master_datastore_index.log
    
 ### Alternative, these scripts are very useful to obtain proteins and CDS from gff3 format files 

   scripts are part of https://github.com/NBISweden/AGAT (Previously in [GAAS](https://github.com/NBISweden/GAAS/tree/v0.1) until v0.1 November 2019)

    gff3_sp_extract_sequences.pl -gff Final_just_genemodels_tic23_AED_0.5 -f draffinal_genome_ticuman.fasta -p -o       mkr_snap3_final.all.maker.proteins_extractions.fasta 

    gff3_sp_extract_sequences.pl -t cds --cfs -g s.pimp.mgm.FINAL.gff -f tomato-scaffolds.abyss.77.fasta -o cds_solpi.fasta &

### AED quality filtering was done using the script from MAKER: quality_filter.pl (a copy is enclosed in this repository)

available here: https://groups.google.com/forum/#!searchin/maker-devel/quality_filter.pl%7Csort:relevance/maker-devel/LC4STWWlwgo/XV4nhGiHsfIJ

### Running meryl for genome assembly quality value 

Generate dbs for merqry program Quality values of the assemblies

Finding best kmer

       sh best_k.sh  [tolerable_collision_rate=0.001]


       for f in *.fq; do meryl k=21 count output ${f%\.*}.meryl ${f%\.*}.fq; done &

For one sample

        meryl count k=21 output_Tic23_S155_L006_R1_001_paired.fastq output tic23_1.meryl  

Merge

        union_sum.sh 21 meryl.list genome.meryl &

Meryl.list correspont to boh .meryl directories from the previous round

The output is a database with all the read together, index. check that the directory ends with .meryl

It is very important to set all the programs in path 

export PATH="/LUSTRE/Genetica/ivan/bin_app/merqury:$PATH"
export PATH=/LUSTRE/Genetica/ivan/bin_app/meryl-1.0/Linux-amd64/bin:$PATH
cd bin_app/merqury/
export MERQURY=$PWD

Running merqury

           merqury.sh read_db.meryl asm1.fasta test &

###### Testing both genomes fom TEO y TIC with the K-mers dbs constructed from the TEO illumina reads ########

    merqury.sh read_db.meryl genomeTic.fasta genomeTeo.fasta test 

Plotting

    Rscript plot_spectra_asm.R -f test.spectra-asm.hist -o test.spectra-asm.hist.png

### Functional annotation was done using the MAKER proteins and transcripts. Names of the genes were edited using the program AHRD, alternative annotation was done using Mercartor with the database MapMan4

A comprenhensive tutorial for funtional annotations is found in below links

Link for MAKER pipeline: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018 

Link for AHRD program: https://github.com/asishallab/AHRD https://github.com/asishallab/AHRD

### InterproScan command to annotate domains of proteins 

    interproscan.sh -appl TIGRFAM,SUPERFAMILY,Coils,ProSiteProfiles,SMART,PRINTS,ProSitePatterns,Pfam,ProDom -dp -f TSV -goterms -iprlookup -pa -t p -i Datura_genomeTic.fasta Datura_genomeTic.iprscan >& run1.out

# Comparative analyses

### OrthoFinder program to construct Orthologs and paralogs, you can have proteins families from this program

    orthofinder -S diamond -T fasttree -M msa -t 8 -a 8 -f cafe_headers >& run.out &
    
   cafe_headers is the folder were the proteins are stored. Each proteome of each species most be in fasta format
   
### Constructing an ultra metric three 

From OrthoFinder2 you obtain a species tree and a file with the gene families and gene counts in each family to CAFE (expansions and contractions)

Make your three ultrametric with the program of your preference, could be with the R package Ape and its Chronos function

An example of R script to construct the ultrametric tree is provided in this rep. please see above

### Protein families expansions and contractions

CAFE needs two files to run, species tree (netwick format) and a file with this format:

An output from OrthoFinder2 is Orthogroups.GeneCount.csv, in this file you have to remove the column TOTAL (at the end) and add a first column in the file with the name "descriptions" (colums of this file should be separated by TAB). For instance, 

Rows are the gene families and the counts of the genes for each species 

Descriptions    ID  sp1     sp2     sp3.....

IMPORTANT: For some weird reason OrthoFinder edits the names of the original files. Just edit this. CAFE needs short or concise names of the species, this same names should be edited in the species tree that OrthoFinder also gives. Same names most be for both files. Also, remove the nodes from the species tree. 

This how the netwick tree should looks

(Pi:35,(((Date:0.1,Dati:0.1):30,((Cag:1.3,Cam:1.3):17.7,(Stu:7.9,(Spe:3.6,(Sly:1.5,Spi:1.5):2.1):4.3):11.1):11):1,(Nto:10,(Nat:7,(Nsy:4.2,Ntab:4.2):2.7):3):21):4)

Numbers are divergence times and node values were remove. Again this names most be the same as genecounts input

Then you can run CAFE. A important step for filtering should be done. Use the file formated as above

    cafetutorial_clade_and_size_filter.py -i cafe_all_sol.tab -o filtered_cafe_input_prueba.tab -s

Then you have to construct this script for CAFE an example is provided in this repository. Just type sh to run the script

    sh cafe cafe_script_fil.sh

Once CAFE has finished you can obtain using this script (also provided here) to obtains summary statistics

    cafetutorial_report_analysis.py -i resultfile_run2.cafe -o summary_run_2_cafe

** This script only run within the folder with all the scripts from tutorial of CAFE

Script above gives you a file called "summary_run_2_cafe_fams.txt"

The file mentioned above can be used to obtain and parse CAFE results; expanded and contracted significant families using this R script

parseCafeResult.R in https://github.com/asishallab/SlydGeneFamsAnalyses/blob/icruz/exec

    Rscript path/2/GeneFamilies/exec/parseCafeResult.R path/2/summary_run_2_cafe_fams.txt path/2/SlydGeneFamsAnalyses/data
    
This script generates a file called "cafe_result.RData" where you cand find the expanded and contracted family just loading the dataframe

### Families with signal of expansion and contraction can be retrieve from CAFE outputs, then this script will carry out enrichment test in protein families with significant expansion and contraction 

For InterPro function annotation (domains) 

https://github.com/asishallab/SlydGeneFamsAnalyses/blob/icruz/exec/enrichedAnnosInExpContrFams.R

For Mapman4 enrichment funtion annotation

https://github.com/asishallab/SlydGeneFamsAnalyses/blob/icruz/exec/enrichedMapManInMappFams.R 

### Running MAPP for physicochemical divergence 

First we have to retrieve all of the multiple sequence alignments and gene therefore each protein family from Orthofinder results. OrthoFinder create two folders with all of the MSA and genes three. This script will do that and create a folder for each protein family storing both files (MSA and the gene three)

https://github.com/asishallab/SlydGeneFamsAnalyses/blob/icruz/exec/readAndParseOrthogroupsTxt.R

### Running MAPP in a loop mode for all gene families

    for d in *; do
    if [ -d "$d" ]; then         # or:  if test -d "$d"; then
    ( cd "$d" && java -jar MAPP.jar -f ${d}_edited_AA_MSA.fa -t ${d}_phyl_tree.newick -o ${d}_mapp.out )
    fi;
    done

### Once MAPP is done use this script to read the results and also obtain a list with enriched proteins and its function 

    readMappResults.R --workDir /working_dir --outDir resulstMAPP --allInterProTable "here load the Interproannotations of the proteins"/"Also load MapMan4 anotation"

### Positive selection analyses with FUBAR, this pipeline require CDS and the multiple sequences aligments from OrthoFinder. The CDS sequences need to have the same ID name of the alignments, protein alignments are already stored in each folder from previous analyses. Each protein family needs its CDS to construct a CDS alignment in each family. 

### Remove stop codons from proteins alignments in each protein family

    for f in *.fa; do sed 's/*//g' $f > ${f%\.*}_cleanedAs.fa; done

# Concatenate all the CDS

    cat cds.sl.fasta cds.ds.fasta. cds.spe.fasta alltheCDS.fasta

### Retrieve all the ID names from each protein aligment in all the folders

    for d in *; do if [ -d "$d" ]; then ( cd "$d" && grep "^>" *_AA_MSA_orig_gene_ids.fa > ${d}_listIDS.txt ); fi; done &> runningheaderextrac.txt &

### Extract the CDS specific for each protein family in all the folders

    for d in *; do if [ -d "$d" ]; then  ( cd "$d" && perl seqextr.pl $f Solanumtuberosum.fasta > ${f%\.*}_CDS.fa; done) fi; done

### Running PAL2NAL to generate a CDS alignment for all folders

    for d in *; do if [ -d "$d" ]; then ( cd "$d" && perl pal2nal.pl *_AA_MSA_orig_gene_ids.fa *_CDS.fa -codontable 1 -output fasta > ${d}_pal2nal_CDS.fa ); fi; done

### Running FUBAR for all folders 

    for d in *; do if [ -d "$d" ]; then ( cd "$d" && hyphy fubar --alignment ${d}_pal2nal_CDS.fa --tree ${d}_phyl_tree_orig_gene_ids.newick ); fi; done &> running_FUBAR.txt

### Parsing FUBAR outputs

    myext = phyphy.Extractor("OG0022001_pal2nal_CDS.fa.MEME.json")
    myext.extract_csv("fest.tsv", delim = "\t")

### Parsing .json format for all directories

    for d in *; do if [ -d "$d" ]; then ( cd "$d" && python parsejason_v2.py -f *_pal2nal_CDS.fa.FUBAR.json ); fi; done &> running_parseJSON.txt

Now you will have a txt file in each directory or folder with the sites under positive selection, the next step is to parse that file to just keep the sites with a byes factor > 100, significant posterior probabilities  0.98. Use this script for the purpose. 

https://github.com/asishallab/GeneFamilies/blob/master/exec/loadFubarResults.R

### Enrichment tests can be done using InterProscan and Mapman4 annotations with this script 

Read also these scripts enrichedMapManInMappFams.R and enrichedAnnosInExpContrFams.R

https://github.com/asishallab/SlydGeneFamsAnalyses/blob/icruz/exec/identifyDomainsAtSelectedSites.R

   contact: imda@ecologia.unam.mx

Owner

  • Name: Jacques Dainat
  • Login: Juke34
  • Kind: user
  • Location: Montpellier, France
  • Company: IRD

Bioinformatician at IRD

GitHub Events

Total
Last Year