te_feature_density

A script for computing density or number of TEs in protein coding features.

https://github.com/guilleperis/te_feature_density

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.7%) to scientific vocabulary

Last synced: 9 months ago · JSON representation ·

Repository

A script for computing density or number of TEs in protein coding features.

Basic Info

Host: GitHub
Owner: GuillePeris
License: gpl-3.0
Language: R
Default Branch: main
Size: 36.1 KB

Statistics

Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 0

Created over 1 year ago · Last pushed over 1 year ago

Metadata Files

Readme License Citation

TEfeaturedensity

A script for computing density or number of transposable elements (TEs) in protein coding features.

Dependencies

dplyr
stringr
reshape2
bedtoolsr
AnnotationHub
rtracklayer
biomartr
UCSCRepeatMasker
data.table
tools

bedtoolsr is an R package that uses internally bedtools so this package has to be installed previously.

TEfeaturedensity has only been tested in Unix/Linux.

How does it work?

TEfeaturedensity computes the number of TEs or overlapping TE density in different gene regions (gene, exons, introns, 5'UTR, 3'UTR, downstream). You can use your own gene and RepeatMasker annotation or let TEfeaturedensity download them from Ensembl. Furthermore, a subset from gene annotation (defined by tag; "basic", "Ensemblcanonical", "MANEselect"...) can be chosen.

Automatic annotation download

Gene annotation is downloaded from Ensemble database using function getGTF from biomartrpackage. You need to define species (Canis lupus familiaris) and Ensembl release (113). You can check available species and releases in this website. You can further filter annotation by tag.

TE annotation is downloaded from UCSC through AnnotationHub package using metadata from UCSCRepeatMaskerpackage. For this purpose you must know the corresponding code to a UCSC genome version.

To know what tags are available for a specific species and release in a gene annotation, and code for UCSC RepeatMasker annotation, you can use checkAnnotations.R (see Usage).

Using your own annotations

Gene annotation can be downloaded from Ensembl.org.

Remember to change in TE_feature_density.R these variables:

fileGene <- TRUE
gene_annot_file <- "data/your_genome_annotation.gtf"
gene_annot_format <- "gff"

TE annotation can be downloaded from UCSC Table Browser: choose Clade, Genome and Assembly of interest, then

Group: Variations and Repeats,
Track: RepeatMasker
Output format: All fields from selected table.
Output field separator: tsv

and Get output, saving file in Data folder. Remember to change in TE_feature_density.R these variables:

fileTE <- TRUE
TE_annot_file <- "data/your_rmsk.txt"
TE_annot_format <- "rmsk"

Usage

Check genome and TE annotation

Use script checkAnnotations.R to get possible filter tags for gene annotation (or NULL for no tag filtering) and UCSC RepeatMasker code for online TE annotation downloading. Change variables in Parameterssection:

organism <- "Canis lupus familiaris"
release <- "113"
fileTE <- TRUE: TRUE for using your own TE annotation. FALSE for automatic downloading.
fileGene <- TRUE: TRUE for using your own gene annotation. FALSE for automatic downloading.
gene_annot_file <- "data/Canis_lupus_familiaris.ROS_Cfam_1.0.113.gtf"

Change parameters

Before you run TE_feature_density.R script you have to change R variables in section Parameters:

organism: Species name. E.g "Homo sapiens", "Mus musculus", "Danio rerio"
UCSC_TE_annot: UCSC code for TE annotation. You can get this code running first checkAnnotations.R script.
release: Ensembl gene annotation version.
interest_TEs: a list of TE classes to analyze. E.g. c("LINE", "SINE"). If set to NULL all TE classes are analyzed.
interest_subF: a list of TE families to analyze. E.g. c("Alu", "L1"). If ser to NULL all TE families are considered. Please, notice that if this variable is not NULL overrides `interest_TEs' variable (you can only filter classes or families).
tag: Filter gene annotation according to gene selection ("Ensemblcanonical", "basic", "MANEselect"...). Check tags available running first checkAnnotations.R script.
minOverlap: only consider TEs that overlap at least minOverlap bp. In density analysis this applies to overlapping TE clusters, not individual TEs.
downstream: Number of bp defining downstream region.
OUTPUT_DIR: Results folder.
analysis: You can choose to analyze number of TEs (number) or TE density (density). In density analysis, overlapping TEs are merged so that common nucleotides are not counted several times.

You may also consider to change some variables in Advanced parameters section, particularly if you want to use your own downloaded annotations.

fileGene: TRUE for reading file annotation from gene_annot_file. FALSE for automatic downloading.
gene_annot_file: Path to gene annotation file.
gene_annot_format: Parameter to import file function. Don't change it if you are not really sure!
fileTE: TRUE for reading file annotation from TE_annot_file. FALSE for automatic downloading.
TE_annot_file: Path to TE annotation file.
TE_annot_format: Parameter to import file function. Don't change it if you are not really sure!
feature_types: List of gene features to analyze. Choose from c("gene", "fiveprimeutr", "threeprimeutr", "exon", "intron", "downstream").

Please, notice that using your own annotation files can take longer time than expected!

Session info

``` R version 4.3.0 (2023-04-21) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04 LTS

Matrix products: default BLAS: /usr/lib/x8664-linux-gnu/blas/libblas.so.3.10.0 LAPACK: /usr/lib/x8664-linux-gnu/lapack/liblapack.so.3.10.0

locale: [1] LCCTYPE=esES.UTF-8 LCNUMERIC=C
[3] LCTIME=esES.UTF-8 LCCOLLATE=esES.UTF-8
[5] LCMONETARY=esES.UTF-8 LCMESSAGES=esES.UTF-8
[7] LCPAPER=esES.UTF-8 LCNAME=C
[9] LCADDRESS=C LCTELEPHONE=C
[11] LCMEASUREMENT=esES.UTF-8 LC_IDENTIFICATION=C

time zone: Europe/Madrid tzcode source: system (glibc)

attached base packages: [1] stats4 stats graphics grDevices utils datasets [7] methods base

other attached packages: [1] data.table1.16.4 BRGenomics1.14.1
[3] rtracklayer1.62.0 GenomicRanges1.54.1
[5] biomartr1.0.7 UCSCRepeatMasker3.15.2 [7] GenomeInfoDb1.38.8 IRanges2.36.0
[9] S4Vectors0.40.2 AnnotationHub3.10.1
[11] BiocFileCache2.10.2 dbplyr2.5.0
[13] BiocGenerics0.48.1 dplyr1.1.4
[15] stringr1.5.1 reshape21.4.4
[17] bedr_1.0.7

loaded via a namespace (and not attached): [1] DBI1.2.3 bitops1.0-9
[3] formatR1.14 testthat3.2.1.1
[5] biomaRt2.58.2 rlang1.1.4
[7] magrittr2.0.3 matrixStats1.4.1
[9] compiler4.3.0 RSQLite2.3.8
[11] png0.1-8 vctrs0.6.5
[13] pkgconfig2.0.3 crayon1.5.3
[15] fastmap1.2.0 XVector0.42.0
[17] Rsamtools2.18.0 promises1.3.0
[19] rmarkdown2.29 tzdb0.4.0
[21] purrr1.0.2 bit4.5.0.1
[23] xfun0.49 zlibbioc1.48.2
[25] cachem1.1.0 jsonlite1.8.9
[27] progress1.2.3 blob1.2.4
[29] later1.3.2 DelayedArray0.28.0
[31] BiocParallel1.36.0 interactiveDisplayBase1.40.0 [33] parallel4.3.0 prettyunits1.2.0
[35] R62.5.1 stringi1.8.4
[37] brio1.1.5 knitr1.49
[39] Rcpp1.0.13-1 SummarizedExperiment1.32.0
[41] downloader0.4 R.utils2.12.3
[43] readr2.1.5 VennDiagram1.7.3
[45] httpuv1.6.15 Matrix1.6-5
[47] tidyselect1.2.1 rstudioapi0.17.1
[49] abind1.4-8 yaml2.3.10
[51] codetools0.2-20 curl6.0.1
[53] lattice0.22-6 tibble3.2.1
[55] plyr1.8.9 withr3.0.2
[57] Biobase2.62.0 shiny1.9.1
[59] KEGGREST1.42.0 evaluate1.0.1
[61] lambda.r1.2.4 futile.logger1.4.3
[63] xml21.3.6 Biostrings2.70.3
[65] pillar1.10.0 BiocManager1.30.25
[67] filelock1.0.3 MatrixGenerics1.14.0
[69] renv1.0.11 generics0.1.3
[71] vroom1.6.5 RCurl1.98-1.16
[73] BiocVersion3.18.1 hms1.1.3
[75] ggplot23.5.1 munsell0.5.1
[77] scales1.3.0 xtable1.8-4
[79] glue1.8.0 tools4.3.0
[81] BiocIO1.12.0 locfit1.5-9.10
[83] GenomicAlignments1.38.2 XML3.99-0.17
[85] grid4.3.0 colorspace2.1-1
[87] AnnotationDbi1.64.1 GenomeInfoDbData1.2.11
[89] restfulr0.0.15 cli3.6.3
[91] rappdirs0.3.3 futile.options1.0.1
[93] S4Arrays1.2.1 gtable0.3.6
[95] R.methodsS31.8.2 DESeq21.42.1
[97] digest0.6.37 SparseArray1.2.4
[99] rjson0.2.23 memoise2.0.1
[101] htmltools0.5.8.1 R.oo1.27.0
[103] lifecycle1.0.4 httr1.4.7
[105] mime0.12 bit644.5.2
```

Owner

Name: Guillermo Peris Ripollés
Login: GuillePeris
Kind: user
Location: Castellón/Granada
Company: Universitat Jaume I/Genyo

Twitter: waltzing_piglet
Repositories: 1
Profile: https://github.com/GuillePeris

Full professor at Universitat Jaume I (Spain) and bioinformatic at Genyo (Granada). Interested in mobile genetic elements and miRNA.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Peris
    given-names: Guillermo
    orcid: https://orcid.org/0000-0003-2010-7844
title: "TE_feature_density"
version: 1.0.0
date-released: 2025-13-01

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science