glori-tools

bioinformatic pipeline for GLORI

https://github.com/liucongcas/glori-tools

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: ncbi.nlm.nih.gov, nature.com
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.0%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

bioinformatic pipeline for GLORI

Basic Info
  • Host: GitHub
  • Owner: liucongcas
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 714 KB
Statistics
  • Stars: 20
  • Watchers: 2
  • Forks: 6
  • Open Issues: 12
  • Releases: 1
Created over 3 years ago · Last pushed almost 2 years ago
Metadata Files
Readme License Citation

README.md

GLORI-tools v1.0

GLORI-tools currently works, but is still being optimized for a better user experience.

*🔥NEW🔥 Hi, many friends have encountered various issues while building the STAR index. Please make sure you run this step with enough memory. Additionally, you can use chromosomes 1-22 + X, Y, Z in hg38 as your genome. Please ensure that your genome sequence's name begins with "chr" as we have only tested GLORI on human and mouse genomes.

News

  • As of June 15th, 2023, We have updated the format of the .totalCR.txt file to make it more user-friendly and easier to read. Please see Section 5.3 for details.
  • As of June 23th, 2023, We have added an ciatiaon information for GLORI-tools.
  • As of June 23th, 2023, an additional run parameter has been added to obtain a comprehensive list of all annotated A citations. Please refer to Section 4.5 for details

Table of content

  • Background
  • Pre-processing the raw sequencing data
  • Installation and Requirement
  • Example and Usage
  • Maintainers and Contributing
  • License

Background

GLORI

We developed an absolute m6A quantification method (“GLORI”) that is conceptually similar bisulfite sequencing-based quantification of DNA 5-methylcytosine. GLORI relies on glyoxal and nitrite-mediated deamination of unmethylated adenosines while keeping m6A intact, thereby achieving specific and efficient m6A detection.

Pre-processing the raw sequencing data

Annotations

  • GLORI-tools exclusively accepts single-end sequencing reads. Prior to use, it is critical to ensure that the input reads are A-to-G converted reads. Therefore, it is important to carefully verify the orientation of the reads before processing them.
  • Before inputting data into GLORI-tools, data cleaning is necessary. This involves removing sequencing adapters, low-quality bases, PCR duplicates based on unique molecular identifiers (UMIs), and finally, removing the UMIs themselves. ### Installation
  • trim_galore
  • seqkit
  • FASTX-Toolkit (including fastx_trimmer) ### Example code
  • trim_galore -q 20 --stringency 1 -e 0.3 --length {length(5’UMI) +25nt} --path_to_cutadapt {cutadapter} --dont_gzip -o {output_dir1} {inputfile}
  • seqkit rmdup -j {Thread} -s -D {output_dupname} {filtered_file} > {filter_file2}
  • fastx_trimmer -Q 33 -f {length(5′UMI) +1nt} -i {filter_file2} -o {filter_file3}

GLORI-tools

GLORI-tools is a bioinformatics pipeline tailored for the analysis of high-throughput sequencing data generated by GLORI. GLORI pipeline

Installation and Requirement

Installation

GLORI-tools is written in Python3 and is executed from the command line. To install GLORI-tools simply download the code and extract the files into a GLORI-tools installation folder.

GLORI-tools needs the following tools to be installed and ideally available in the PATH environment:

  • STAR ≥ v2.7.5c
  • bowtie (bowtie1) version ≥ v1.3.0
  • samtools ≥ v1.10
  • python ≥ v3.8.3

GLORI-tools needs the following python package to be installed:

pysam,pandas,argparse,time,collections,os,sys,re,subprocess,multiprocessing,copy,numpy,scipy,math,sqlite3,Bio,statsmodels,itertools,heapq,glob,signal

Example and Usage:

1. Generate annotation files (required)

1.1 download files for annotation (required, using hg38 as example):

  • wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/annotation_releases/109.20190905/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_assembly_report.txt
  • wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/annotation_releases/109.20190905/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gtf.gz

1.2 download reference genome and transcriptome

  • wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz

  • wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/annotation_releases/109.20190905/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_rna.fna.gz

1.3 Unify chromosome naming in GTF file and genome file:

  • python ./get_anno/change_UCSCgtf.py -i GCF_000001405.39_GRCh38.p13_genomic.gtf -j GCF_000001405.39_GRCh38.p13_assembly_report.txt -o GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens

2. get reference for reads alignment (required)

2.1 build genome index using STAR

  • python ./pipelines/build_genome_index.py -f $genome_fastafile -p 20 -pre hg38

you will get: * $ hg38.rvsCom.fa * $ hg38.AG_conversion.fa * the corresponding index from STAR

2.2 build transcriptome index using bowtie

2.2.1 get the longest transcript for genes (required)

Get the required annotation table files:

  • python ./get_anno/gtf2anno.py -i GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens -o GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.tbl

  • awk '$3!~/_/&&$3!="na"' GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.tbl | sed '/unknown_transcript/d' > GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.tbl2

Get the longest transcript:

  • python ./get_anno/selected_longest_transcrpts_fa.py -anno GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.tbl2 -fafile GCF_000001405.39_GRCh38.p13_rna.fa --outname_prx GCF_000001405.39_GRCh38.p13_rna2.fa

2.2.2 build reference with bowtie

  • python ./pipelines/build_transcriptome_index.py -f $ GCF_000001405.39_GRCh38.p13_rna2.fa -pre GCF_000001405.39_GRCh38.p13_rna2.fa

you will get: * $ GCF000001405.39GRCh38.p13rna2.fa.AGconversion.fa * the corresponding index from bowtie

3. get_base annotation (optional)

3.1 get annotation at single-base resolution

  • python ./get_anno/anno_to_base.py -i GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.tbl2 -threads 20 -o GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.tbl2.baseanno

3.2 get required annotation file for further removal of duplicated loci

  • python ./get_anno/gtf2genelist.py -i GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens -f GCF_000001405.39_GRCh38.p13_rna.fa -o GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.genelist > output2

  • awk '$6!~/_/&&$6!="na"' GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.genelist > GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.genelist2

3.3 Removal of duplicated loci in the annotation file

  • python ./get_anno/anno_to_base_remove_redundance_v1.0.py -i GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.tbl2.baseanno -o GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.tbl2.noredundance.base -g GCF_000001405.39_GRCh38.p13_genomic.gtf_change2Ens.genelist2

Finally, you will get annotation files: * GCF000001405.39GRCh38.p13genomic.gtfchange2Ens.tbl2.noredundance.base

4. alignment and call sites (required)

GLORI-tools takes cleaned reads as input and finally reports files for the conversion rate (A-to-G) of GLORI for each gene and m6A sites at single-base resolution with corresponding A rate representative for modification level.

4.1 Example shell scripts

| Used files | | :--- | | Thread=1 | | genomdir=yourdir | | genome=${genomdir}/hg38.AGconversion.fa | | genome2=${genomdir}/hg38.fa | | rvsgenome=${genomdir}/hg38.revCom.fa | | TfGenome=${genomdir}/GCF000001405.39GRCh38.p13rna2.fa.AGconversion.fa | | annodir=yourdir | | baseanno=${annodir}/GCF000001405.39GRCh38.p13genomic.gtfchange2Ens.tbl2.noredundance.base | | anno=${annodir}/GCF000001405.39GRCh38.p13genomic.gtfchange2Ens.tbl2 | | outputdir=yourdir | | tooldir=/tooldir/GLORI -tools | | prx=yourprefix | | file=yourcleanedreads |

4.2 Call m6A sites annotated the with genes

  • python ${tooldir}/run_GLORI.py -i $tooldir -q ${file} -T $Thread -f ${genome} -f2 ${genome2} -rvs ${rvsgenome} -Tf ${TfGenome} -a $anno -b $baseanno -pre ${prx} -o $outputdir --combine --rvs_fac

4.3 Call m6A sites without annotated genes.

  • python ${tooldir}/run_GLORI.py -i $tooldir -q ${file} -T $Thread -f ${genome} -f2 ${genome2} -rvs ${rvsgenome} -Tf ${TfGenome} -a $anno -pre ${prx} -o $outputdir --combine --rvs_fac

  • In this situation, the background for each m6A sites is the overall conversion rate.

  • The site list obtained by the above two methods is basically the similar, and there may be a few differential sites in the list.

4.4 mapping with samples without GLORI treatment

  • python ${tooldir}/run_GLORI.py -i $tooldir -q ${file} -T $Thread -f ${genome} -Tf ${TfGenome} -a $anno -pre ${prx} -o $outputdir --combine --untreated

4.5 call all the annotated A cites

1) with annotation: * python ${tooldir}/run_GLORI.py -i $tooldir -q ${file} -T $Thread -f ${genome} -f2 ${genome2} -rvs ${rvsgenome} -Tf ${TfGenome} -a $anno -b $baseanno -pre ${prx} -o $outputdir --combine --rvs_fac -c 1 -C 0 -r 0 -p 1.1 -adp 1.1 -s 0

2) or without annotation * python ${tooldir}/run_GLORI.py -i $tooldir -q ${file} -T $Thread -f ${genome} -f2 ${genome2} -rvs ${rvsgenome} -Tf ${TfGenome} -a $anno -pre ${prx} -o $outputdir --combine --rvs_fac -c 1 -C 0 -r 0 -p 1.1 -adp 1.1 -s 0

5 Resultes:

5.1 Output files of GLORI

| Output files | Interpretation | | :---: | :---: | | ${yourprefix}merged.sorted.bam | The overall mapping of reads in .bam format | | ${yourprefix}referbase.mpi | The text pileup output from .bam files | | ${yourprefix}.totalCR.txt | The text file containing the median value of the overall A-to-G conversion rate for each transcriptome and gene | | ${yourprefix}.totalm6A.FDR.csv | The final list of m6A sites obtained, with the A rate serving as the m6A level |

5.2 GLORI sites files

| Columns | Interpretation | | :---: | :---: | | Chr | chromosome | | Sites | genomic loci | | Strand | strand | | Gene | annotated gene | | CR | conversion rate for genes | | AGcov | reads coverage with A and G | | Acov | reads coverage with A | | Genecov | mean coverage for the whole gene | | Ratio | A rate for the sites/ or methylation level for the sites | | Pvalue | test for A rate based on the background | | P_adjust | FDR ajusted P value |

5.3 GLORI conversion rate (.totalCR) files

| Columns | Interpretation | | :---: | :---: | | SA | chromosomes or genes | | A-to-G_ratio | A-to-G conversion rate for each chromosome and gene |

Maintainers and Contributing

  • GLORI-tools is developed and maintained by Cong Liu (liucong-1112@pku.edu.cn).
  • The development of GLORI-tools is inseparable from the open source software RNA-m5C (https://github.com/SYSU-zhanglab/RNA-m5C).

Licences

  • Released under MIT license

Citation

please cite: Liu C., Sun H., Yi Y., Shen W., Li K., Xiao Y., et al. (2022). Absolute quantification of single-base m6A methylation in the mammalian transcriptome using GLORI. Nat. Biotechnol. (https://www.nature.com/articles/s41587-022-01487-9)

Citation (CITATION.cff)

cff-version: 1.0.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Cong
    given-names: Liu

title: "GLORI-tools:a bioinformatics pipeline tailored for the analysis of high-throughput sequencing data generated by GLORI"
version: 1.0.0
doi: 10.5281/zenodo.7014168
date-released: 2022-08-20

GitHub Events

Total
  • Issues event: 2
  • Watch event: 4
  • Issue comment event: 15
  • Fork event: 5
Last Year
  • Issues event: 2
  • Watch event: 4
  • Issue comment event: 15
  • Fork event: 5