biorxiv-analyzerr

https://github.com/mrunmays/biorxiv-analyzerr

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (7.9%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: MrunmayS
Language: R
Default Branch: main
Size: 83.4 MB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created almost 5 years ago · Last pushed almost 5 years ago

Metadata Files

Readme Citation

bioRxiv-analyzerR

The package aims to extract data from the citations downloaded from the citations manager of bioRxiv, which is a pre-print server for biology research

Prerequisites

The following packages must be installed before hand for the program to work

stringr
dplyr
stringi
tm
SnowballC
wordcloud
RColorBrewer
NLP
topicmodels
tidytext
reshape2
ggplot2
pals
Rcpp
igraph

This could be done by running:

install.packages("stringr", "dplyr", "stringi", "tm" , "SnowballC", "wordcloud", "RColorBrewer", "NLP", "topicmodels", "tidytext", "reshape2", "ggplot2", "Rcpp", "igraph")

To Extract the Data from the text file

Use the extractfunction by passing in the name of the file in form of the string.

 
df <- extract("citations.txt")

It will save everything into the assigned variable

To calculate DTM(Document-Term Matrix)

 
dtm <- calculatedtm(df$Abstract)

Pass in the Abstract coloumn from the dataframe you created to calculate the DTM with common words removed.

To calculate Frequency Table

 
freqtable <- calculatefreq(dtm)

To make a frequency table pass in the dtm found before and into the function

To make a word cloud and bar plot

Frequncy table made in the previous function has been used for this

 
makewordcloud(freqtable)

makebarplot(freqtable)

To make topic model graph

Just pass in the abstract coloumn and the function will do the job with topics = 10

 
maketopicmodel(df$Abstract)

To create topics

To create K number of topics pass in

 
topics <- createtopics(dfn$Abstract, K)

K is set to 10 by default

To make links using topics

Pass in the topics, in the funtion to make network of linked topics

 
text_link(topics)

Owner

Name: Mrunmay Shelar
Login: MrunmayS
Kind: user

Repositories: 32
Profile: https://github.com/MrunmayS

Zakaria Louadi et al.. "Functional enrichment of alternative splicing events with NEASE reveals insights into tissue identity and diseases." bioRxiv , no. (2021): 2021.07.14.452376. Accessed July 19, 2021. doi: 10.1101/2021.07.14.452376.
Alternative splicing (AS) is an important aspect of gene regulation. Nevertheless, its role in molecular processes and pathobiology is far from understood. A roadblock is that tools for the functional analysis of AS-set events are lacking. To mitigate this, we developed NEASE, a tool integrating pathways with protein-protein and domain-domain interactions to functionally characterize AS events. We show in four application cases how NEASE can identify pathways contributing to tissue identity and cell type development, and how it highlights splicing-related biomarkers. With a unique view on AS, NEASE generates unique and meaningful biological insights complementary to classical pathways analysis.KeywordsCompeting Interest StatementThe authors have declared no competing interest.

Daniel Danis et al.. "SvAnna: efficient and accurate pathogenicity prediction for coding and regulatory structural variants in long-read genome sequencing." bioRxiv , no. (2021): 2021.07.14.452267. Accessed July 19, 2021. doi: 10.1101/2021.07.14.452267.
Structural variants (SVs) are implicated in the etiology of Mendelian diseases but have been systematically underascertained owing to limitations of existing technology. Recent technological advances such as long-read sequencing (LRS) enable more comprehensive detection of SVs, but approaches for clinical prioritization of candidate SVs are needed. Existing computational approaches do not specifically target LRS data, thereby missing a substantial proportion of candidate SVs, and do not provide a unified computational model for assessing all types of SVs. Structural Variant Annotation and Analysis (SvAnna) assesses all classes of SV and their intersection with transcripts and regulatory sequences in the context of topologically associating domains, relating predicted effects on gene function with clinical phenotype data. We show with a collection of 182 published case reports with pathogenic SVs that SvAnna places over 90% of pathogenic SVs in the top ten ranks. The interpretable prioritizations provided by SvAnna will facilitate the widespread adoption of LRS in diagnostic genomics.Competing Interest StatementThe authors have declared no competing interest.

Qiang Zhu et al.. "Learning Your Heart Actions From Pulse: ECG Waveform Reconstruction From PPG." bioRxiv , no. (2021): 815258. Accessed July 19, 2021. doi: 10.1101/815258.
This paper studies the relation between electrocardiogram (ECG) and photoplethysmogram (PPG) and infers the waveform of ECG via the PPG signals that can be obtained from affordable wearable Internet-of-Things (IoT) devices for mobile health. In order to address this inverse problem, a transform is proposed to map the discrete cosine transform (DCT) coefficients of each PPG cycle to those of the corresponding ECG cycle based on our proposed cardiovascular signal model. The proposed method is evaluated with different morphologies of the PPG and ECG signals on three benchmark datasets with a variety of combinations of age, weight, and health conditions using different training setups. Experimental results show that the proposed method can achieve a high prediction accuracy greater than 0.92 in averaged correlation for each dataset when the model is trained subject-wise. With a signal processing and learning system that is designed synergistically, we are able to reconstruct ECG signals by exploiting the relation of these two types of cardiovascular measurement. The reconstruction capability of the proposed method can enable low-cost ECG screening from affordable wearable IoT devices for continuous and long-term monitoring. This work may open up a new research direction to transfer the understanding of the clinical ECG knowledge base to build a knowledge base for PPG and data from wearable devices.Index TermsCompeting Interest StatementThe authors have declared no competing interest.

Lu, Jennifer, Darren Korbie and Matt Trau. "An analytical pipeline for DNA Methylation Array Biomarker Studies." bioRxiv , no. (2021): 2021.07.14.452293. Accessed July 19, 2021. doi: 10.1101/2021.07.14.452293.
DNA methylation is one of the most commonly studied epigenetic biomarkers, due to its role in disease and development. The Illumina Infinium methylation arrays still remains the most common method to interrogate methylation across the human genome, due to its capabilities of screening over 480, 000 loci simultaneously. As such, initiatives such as The Cancer Genome Atlas (TCGA) have utilized this technology to examine the methylation profile of over 20,000 cancer samples. There is a growing body of methods for pre-processing, normalisation and analysis of array-based DNA methylation data. However, the shape and sampling distribution of probe-wise methylation that could influence the way data should be examined was rarely discussed. Therefore, this article introduces a pipeline that predicts the shape and distribution of normalised methylation patterns prior to selection of the most optimal inferential statistics screen for differential methylation. Additionally, we put forward an alternative pipeline, which employed feature selection, and demonstrate its ability to select for biomarkers with outstanding differences in methylation, which does not require the predetermination of the shape or distribution of the data of interest.Availability The Distribution test and the feature selection pipelines are available for download at: https://github.com/uqjlu8/DistributionTestKeywordsCompeting Interest StatementThe authors have declared no competing interest.

Charles Bernard et al.. "Large-scale identification of viral quorum sensing systems reveal convergent evolution of density-dependent sporulation-hijacking in bacteriophages." bioRxiv , no. (2021): 2021.07.15.452460. Accessed July 19, 2021. doi: 10.1101/2021.07.15.452460.
Quorum sensing systems (QSSs) are genetic systems supporting cell-cell or bacteriophage-bacteriophage communication. By regulating behavioral switches as a function of the encoding population density, QSSs shape the social dynamics of microbial communities. However, their diversity is tremendously overlooked in bacteriophages, which implies that many density-dependent behaviors likely remains to be discovered in these viruses. Here, we developed a signature-based computational method to identify novel peptide-based RRNPP QSSs in gram-positive bacteria (e.g. Firmicutes) and their mobile genetic elements. The large-scale application of this method against available genomes of Firmicutes and bacteriophages revealed 2708 candidate RRNPP-type QSSs, including 382 found in (pro)phages. These 382 viral candidate QSSs are classified into 25 different groups of homologs, of which 22 were never described before in bacteriophages. Remarkably, genomic context analyses suggest that candidate viral QSSs from 6 different families dynamically manipulate the host biology. Specifically, many viral candidate QSSs are predicted to regulate, in a density-dependent manner, adjacent (pro)phage-encoded regulator genes whose bacterial homologs are key regulators of the sporulation initiation pathway (either Rap, Spo0E, or AbrB). Consistently, we found evidence from public data that certain of our candidate (pro)phage-encoded QSSs dynamically manipulate the timing of sporulation of the bacterial host. These findings challenge the current paradigm assuming that bacteria decide to sporulate in adverse situation. Indeed, our survey highlights that bacteriophages have evolved, multiple times, genetic systems that dynamically influence this decision to their advantage, making sporulation a survival mechanism of last resort for phage-host collectives.KEYWORDSCompeting Interest StatementThe authors have declared no competing interest.HMMsHidden Markov ModelsMAGsMetagenomics-Assembled-GenomesMGEsMobile Genetic ElementsNCBINational Center for Biotechnology InformationPhagesBacteriophagesQSSsQuorum Sensing SystemsRRNPPRap, Rgg, NprR, PlcR and PrgX families of QSS receptorsTPRsTetratricoPeptide Repeats

Chenyang Dong et al.. "INFIMA leverages multi-omics model organism data to identify effector genes of human GWAS variants." bioRxiv , no. (2021): 2021.07.15.452422. Accessed July 19, 2021. doi: 10.1101/2021.07.15.452422.
Genome-wide association studies have revealed many non-coding variants associated with complex traits. However, model organism studies have largely remained as an untapped resource for unveiling the effector genes of non-coding variants. We develop INFIMA, Integrative Fine-Mapping, to pinpoint causal SNPs for Diversity Outbred (DO) mice eQTL by integrating founder mice multi-omics data including ATAC-seq, RNA-seq, footprinting, and in silico mutation analysis. We demonstrate INFIMA’s superior performance compared to alternatives with human and mouse chromatin conformation capture datasets. We apply INFIMA to identify novel effector genes for GWAS variants associated with diabetes. The results of the application are available at http://www.statlab.wisc.edu/shiny/INFIMA/Key wordsCompeting Interest StatementThe authors have declared no competing interest.

Xianglilan Zhang et al.. "Mining bacterial NGS data vastly expands the complete genomes of temperate phages." bioRxiv , no. (2021): 2021.07.15.452192. Accessed July 19, 2021. doi: 10.1101/2021.07.15.452192.
Temperate phages (active prophages induced from bacteria) help control pathogenicity, modulate community structure, and maintain gut homeostasis1. Complete phage genome sequences are indispensable for understanding phage biology. Traditional plaque techniques are inapplicable to temperate phages due to the lysogenicity of these phages, which curb the identification and characterization of temperate phages. Existing in silico tools for prophage prediction usually fail to detect accurate and complete temperate phage genomes2–5. In this study, by a novel computational method mining both the integrated active prophages and their spontaneously induced forms (temperate phages), we obtained 192,326 complete temperate phage genomes from bacterial next-generation sequencing (NGS) data, hence expanded the existing number of complete temperate phage genomes by more than 100-fold. The reliability of our method was validated by wet-lab experiments. The experiments demonstrated that our method can accurately determine the complete genome sequences of the temperate phages, with exact flanking sites (attP and attB sites), outperforming other state-of-the-art prophage prediction methods. Our analysis indicates that temperate phages are likely to function in the evolution of microbes by 1) cross-infecting different bacterial host species; 2) transferring antibiotic resistance and virulence genes; and 3) interacting with hosts through restriction-modification and CRISPR/anti-CRISPR systems. This work provides a comprehensive complete temperate phage genome database and relevant information, which can serve as a valuable resource for phage research.Competing Interest StatementThe authors have declared no competing interest.

Riccardo Fusaroli et al.. "Towards a cumulative science of prosody in autism: a cross-linguistic meta-analysis-based investigation of acoustic markers in American and Danish autistic children." bioRxiv , no. (2021): 2021.07.13.452165. Accessed July 19, 2021. doi: 10.1101/2021.07.13.452165.
Acoustic atypicalities in speech production are widely documented in Autism Spectrum Disorder (ASD) and argued to be both a potential factor in atypical social development and potential markers of clinical features. A recent meta-analysis highlighted shortcomings in the field, in particular small sample sizes and study heterogeneity (Fusaroli, Lambrechts, Bang, Bowler, &amp; Gaigg, 2017). We showcase a cumulative yet self-correcting approach to prosody in ASD to overcome these issues.We analyze a cross-linguistic corpus of multiple speech productions in 77 autistic children and adolescents and 72 TD ones (&gt;1000 recordings in Danish and US English). We replicate findings of a minimal cross-linguistically reliable distinctive acoustic profile for ASD (higher pitch and longer pauses) with moderate effect sizes. We identified novel general reliable differences between the two groups for normalized amplitude quotient, maxima dispersion quotient and creakiness. However, all these relations are small, and there is likely no one general extensive acoustic profile characterizing all autistic individuals. We identified reliable and consistent relations of acoustic features with individual differences (age, gender), and clinical feature: speech rate and ADOS sub-scores (Communication, Social, Stereotyped).Besides cumulatively building our understanding of acoustic atypicalities in ASD, the study concretely shows how to use systematic reviews and meta-analyses to guide follow-up studies, both in their design and their statistical inferences. We indicate future directions: larger and more diverse cross-linguistic datasets, use of previous findings as statistical priors, understanding of covariance between acoustic measures, reliance on machine learning procedures, and open science.Lay Summary Individuals with Autism Spectrum Disorder (ASD) are reported to speak in distinctive ways. Distinctive vocal production can affect social interactions and social development and could represent a noninvasive way to support the assessment of ASD. We systematically check whether acoustic atypicalities found in previous articles can be found in a novel and larger cross-linguistic dataset. Besides a minimal acoustic profile of ASD: higher pitch, longer pauses, increased hoarseness and creakiness, we observe large individual and linguistic variations.KeywordsCompeting Interest StatementRiccardo Fusaroli has provided paid consultancies to F. Hoffmann-La Roche.

Anika Küken et al.. "A structural property for reduction of biochemical networks." bioRxiv , no. (2021): 2021.03.17.435785. Accessed July 19, 2021. doi: 10.1101/2021.03.17.435785.
Large-scale biochemical models are of increasing sizes due to the consideration of interacting organisms and tissues. Model reduction approaches that preserve the flux phenotypes can simplify the analysis and predictions of steady-state metabolic phenotypes. However, existing approaches either restrict functionality of reduced models or do not lead to significant decreases in the number of modelled metabolites. Here, we introduce an approach for model reduction based on the structural property of balancing of complexes that preserves the steady-state fluxes supported by the network and can be efficiently determined at genome scale. Using two large-scale mass-action kinetic models of Escherichia coli, we show that our approach results in a substantial reduction of 99% of metabolites. Applications to genome-scale metabolic models across kingdoms of life result in up to 55% and 85% reduction in the number of metabolites when arbitrary and mass-action kinetics is assumed, respectively. We also show that predictions of the specific growth rate from the reduced models match those based on the original models. Since steady-state flux phenotypes from the original model are preserved in the reduced, the approach paves the way for analysing other metabolic phenotypes in large-scale biochemical networks.KeywordsCompeting Interest StatementThe authors have declared no competing interest.

Yu Huang et al.. "Deep Learning Achieves Neuroradiologist-Level Performance in Detecting Hydrocephalus." bioRxiv , no. (2021): 2021.01.19.427328. Accessed July 19, 2021. doi: 10.1101/2021.01.19.427328.
Background and Purpose To develop automated detection of hydrocephalus requiring treatment in a heterogeneous patient population referred for MRI brain scans, and compare performance to that of neuroradiologists.Materials and Methods We leveraged 496 clinical MRI brain scans (259 hydrocephalus) collected retrospectively at a single clinical site from patients aged 2–90 years (mean 54) referred for any reason. Sixteen MRI scans (ten hydrocephalus) were segmented semi-automatically in 3D to delineate ventricles, extraventricular CSF, and brain tissues. A 3D CNN was trained on these segmentations and subsequently used to automatically segment the remaining 480 scans. To detect hydrocephalus, volumetric features such as volumes of ventricles and temporal horns were computed from the segmentation and were used to train a linear classifier. Machine performance was evaluated in a diagnosis dataset where hydrocephalus was confirmed as requiring surgical intervention, and compared to four neuroradiologists on a random subset of 240 scans. The pipeline was tested on a separate screening dataset of 451 scans collected from a routine clinical population aged 1–95 years (mean 55) to predict the majority reading from four neuroradiologists using images alone.Results When compared to the neuroradiologists at a matched sensitivity, the machine did not show a significant difference in specificity (proportions test, p &gt; 0.05). The machine demonstrated comparable performance in independent diagnosis and screening datasets. Overall ROC performance compared favorably with the state-of-the-art (AUC 0.89–0.92).Conclusion Hydrocephalus can be detected automatically from MRI in a heterogeneous patient population with performance equivalent to that of neuroradiologists.Competing Interest StatementThe authors have declared no competing interest.MRImagnetic resonance imaging2D/3Dtwo-dimensional/three-dimensionalCNNconvolutional neural networkTPMtissue probability mapCSFcerebrospinal fluidNPHnormal pressure hydrocephalusROCreceiver operating characteristicAUCarea under the curveSPMstatistical parametric mappingFSLFMRIB software library

Zhang, Xiaoxiao and Maik Kschischo. "MFmap: A semi-supervised generative model matching cell lines to tumours and cancer subtypes." bioRxiv , no. (2021): 2021.07.15.452446. Accessed July 19, 2021. doi: 10.1101/2021.07.15.452446.
Translating in vitro results from experiments with cancer cell lines to clinical applications requires the selection of appropriate cell line models. Here we present MFmap (model fidelity map), a machine learning model to simultaneously predict the cancer subtype of a cell line and its similarity to an individual tumour sample. The MFmap is a semi-supervised generative model, which compresses high dimensional gene expression, copy number variation and mutation data into cancer subtype informed low dimensional latent representations. The accuracy (test set F1 score &gt; 90%) of the MFmap subtype prediction is validated in ten different cancer datasets. We use breast cancer and glioblastoma cohorts as examples to show how subtype specific drug sensitivity can be translated to individual tumour samples. The low dimensional latent representations extracted by MFmap explain known and novel subtype specific features and enable the analysis of cell-state transformations between different subtypes. From a methodological perspective, we report that MFmap is a semi-supervised method which simultaneously achieves good generative and predictive performance and thus opens opportunities in other areas of computational biology.Author summary Cancer researchers perform experiments with cell lines to better understand the biology of cancer and to develop new anti-cancer treatments. A prerequisite to translate promising results from these in vitro experiments to clinical applications is to use the most appropriate cell line for a given tumour or cancer subtype. We present MFmap (model fidelity map), a deep learning technique to integrate cancer genomic data from patients with cell line data. The MFmap neural network compresses complex genomic features from thousands of genes into a small set of features called latent representations. This makes cell line and tumour data comparable and allows cancer researchers to select the best cell line which closely resembles a specific type of tumours or even an individual tumour. By classifying cancer cell lines into subtypes, MFmap offers a new possibility to predict the effect of therapeutic compounds in a particular tumour subtype. For the example of an aggressive brain tumour we demonstrate that MFmap can be used to study cell-state transformations during the disease course. In addition, MFmap is a promising machine learning method with potential applications in many other areas of biology and medicine.Competing Interest StatementThe authors have declared no competing interest.

Vishal Sarsani et al.. "Model-based identification of conditionally-essential genes from transposon-insertion sequencing data." bioRxiv , no. (2021): 2021.07.15.452443. Accessed July 19, 2021. doi: 10.1101/2021.07.15.452443.
The understanding of bacterial gene function has been greatly enhanced by recent advancements in the deep sequencing of microbial genomes. Transposon insertion sequencing methods combines next-generation sequencing techniques with transposon mutagenesis for the exploration of the essentiality of genes under different environmental conditions. We propose a model-based method that uses regularized negative binomial regression to estimate the change in transposon insertions attributable to gene-environment changes without transformations or uniform normalization. An empirical Bayes model for estimating the local false discovery rate combines unique and total count information to test for genes that show a statistically significant change in transposon counts. When applied to RB-TnSeq (randomized barcode transposon sequencing) and Tn-seq (transposon sequencing) libraries made in strains of Caulobacter crescentus using both total and unique count data the model was able to identify a set of conditionally essential genes for each target condition that shed light on their functions and roles during various stress conditions.Author summary Transposon insertion sequencing allows the study of bacterial gene function by combining next-generation sequencing techniques with transposon mutagenesis under different genetic and environmental perturbations. Our proposed regularized negative binomial regression method improves the quality of analysis of this data.Competing Interest StatementThe authors have declared no competing interest.

Ning Wang et al.. "Variant calling tool evaluation for variable size indel calling from next generation whole genome and targeted sequencing data." bioRxiv , no. (2021): 2021.07.15.452444. Accessed July 19, 2021. doi: 10.1101/2021.07.15.452444.
Insertions and deletions (indels) in human genomes are associated with a wide range of phenotypes, including various clinical disorders. High-throughput, next generation sequencing (NGS) technologies enable detection of short genetic variants, such as single nucleotide variants (SNVs) and indels. However, the variant calling accuracy for indels remains considerably lower than for SNVs. Here we present a comparative study of the performance of variant calling tools on indel calling, evaluated with a wide repertoire of NGS datasets. While there is no single optimal tool to suit all circumstances, our results demonstrate that the choice of variant calling tool greatly impacts the precision and recall of indel calling. Furthermore, to reliably detect indels, it is essential to choose NGS technologies that offer a long read length and high coverage, coupled with specific variant calling tools.Author summary The development of next generation sequencing (NGS) technologies and computational algorithms enabled large scale, simultaneous detection of wide range of genetic variants, such as single nucleotide variants as well as insertions and deletions (indels), which may confer potential clinical significance. Recently, many studies have been conducted to evaluate variant calling tools on indel calling. However, the optimal indel size range for different variant calling tools remain unclear. A good benchmarking dataset for indel calling evaluation should contain biologically representative high-confident indels with a wide size range and preferably come from various sequencing settings. In this article, we created a semi-simulated whole genome sequencing dataset where the sequencing data was computationally generated. The indels in the semi-simulated genome were incorporated from a real human sample to represent biologically realistic indels and to avoid inclusion of variants due to potential technical sequencing errors. Furthermore, we used three real-world NGS datasets generated by whole genome or targeted sequencing to further evaluate our candidate tools. Our results demonstrated that variant calling tools varies greatly in calling different sizes of indels. Deletion calling and insertion calling also showed differences among the tools. The sequencing settings in coverage and read length also had a great impact on indel calling. Our results suggest that the accurate indel calling was dependent on the combination of a variant calling tool, indel size range and sequencing settings.Competing Interest StatementThe authors have declared no competing interest.

Arash Bayat et al.. "Fast and Accurate Exhaustive Higher-Order Epistasis Search with BitEpi." bioRxiv , no. (2021): 858282. Accessed July 19, 2021. doi: 10.1101/858282.
Motivation Complex genetic diseases may be modulated by a large number of epistatic interactions affecting a polygenic phenotype. Identifying these interactions is difficult due to computational complexity, especially in the case of higher-order interactions where more than two genomic variants are involved.Results In this paper, we present BitEpi, a fast and accurate method to test all possible combinations of up to four bi-allelic variants (i.e. Single Nucleotide Variant or SNV for short). BitEpi introduces a novel bitwise algorithm that is 2.1 and 56 times faster for 3-SNV and 4-SNV search, than established software. The novel entropy statistic used in BitEpi is 44% more accurate to identify interactive SNVs, incorporating a p-value-based significance testing. We demonstrate BitEpi on real world data of 4,900 samples and 87,000 SNPs. We also present EpiExplorer to visualize the potentially large number of individual and interacting SNVs in an interactive Cytoscape graph. EpiExplorer uses various visual elements to facilitate the discovery of true biological events in a complex polygenic environment.Competing Interest StatementThe authors have declared no competing interest.

Charles A. Ellis et al.. "A Gradient-based Spectral Explainability Method for EEG Deep Learning Classifiers." bioRxiv , no. (2021): 2021.07.14.452360. Accessed July 19, 2021. doi: 10.1101/2021.07.14.452360.
The automated feature extraction capabilities of deep learning classifiers have promoted their broader application to EEG analysis. In contrast to earlier machine learning studies that used extracted features and traditional explainability approaches, explainability for classifiers trained on raw data is particularly challenging. As such, studies have begun to present methods that provide insight into the spectral features learned by deep learning classifiers trained on raw EEG. These approaches have two key shortcomings. (1) They involve perturbation, which can create out-of-distribution samples that cause inaccurate explanations. (2) They are global, not local. Local explainability approaches can be used to examine how demographic and clinical variables affected the patterns learned by the classifier. In our study, we present a novel local spectral explainability approach. We apply it to a convolutional neural network trained for automated sleep stage classification. We apply layer-wise relevance propagation to identify the relative importance of the features in the raw EEG and subsequently examine the frequency domain of the explanations to determine the importance of each canonical frequency band locally and globally. We then perform a statistical analysis to determine whether age and sex affected the patterns learned by the classifier for each frequency band and sleep stage. Results showed that δ, β, and γ were the overall most important frequency bands. In addition, age and sex significantly affected the patterns learned by the classifier for most sleep stages and frequency bands. Our study presents a novel spectral explainability approach that could substantially increase the level of insight into classifiers trained on raw EEG.KeywordsCompeting Interest StatementThe authors have declared no competing interest.

Hinnerichs, Tilman and Robert Hoehndorf. "DTI-Voodoo: machine learning over interaction networks and ontology-based background knowledge predicts drug–target interactions." bioRxiv , no. (2021): 2021.04.28.441733. Accessed July 19, 2021. doi: 10.1101/2021.04.28.441733.
Motivation In silico drug–target interaction (DTI) prediction is important for drug discovery and drug repurposing. Approaches to predict DTIs can proceed indirectly, top-down, using phenotypic effects of drugs to identify potential drug targets, or they can be direct, bottom-up and use molecular information to directly predict binding potentials. Both approaches can be combined with information about interaction networks.Results We developed DTI-Voodoo as a computational method that combines molecular features and ontology-encoded phenotypic effects of drugs with protein–protein interaction networks, and uses a graph convolutional neural network to predict DTIs. We demonstrate that drug effect features can exploit information in the interaction network whereas molecular features do not. DTI-Voodoo is designed to predict candidate drugs for a given protein; we use this formulation to show that common DTI datasets contain intrinsic biases with major affects on performance evaluation and comparison of DTI prediction methods. Using a modified evaluation scheme, we demonstrate that DTI-Voodoo improves significantly over state of the art DTI prediction methods.Availability DTI-Voodoo source code and data necessary to reproduce results are freely available at https://github.com/THinnerichs/DTI-VOODOO.Contact tilman.hinnerichs{at}kaust.edu.saSupplementary information Supplementary data are available at https://github.com/THinnerichs/DTI-VOODOO.Competing Interest StatementThe authors have declared no competing interest.

Wenke Liu et al.. "Modeling transcriptional profiles of gene perturbation with deep neural network." bioRxiv , no. (2021): 2021.07.15.452534. Accessed July 19, 2021. doi: 10.1101/2021.07.15.452534.
Background Cell line perturbation data could be utilized as a reference for inferring underlying molecular processes in new gene expression profiles. It is important to develop accurate and computationally efficient algorithms to exploit biological knowledge in the growing compendium of existing perturbation data and harness these for new predictions.Results We reframed the problem of inferring possible gene perturbation based on a reference perturbation database into a classification task and evaluated the application of deep neural network models to address this problem. Our results showed that a fully-connected multi-layer neural network was able to achieve up to 74.9% accuracy in a holdout test set, but the model generalizability was limited by consistency between training and testing data.Conclusion Capacity and flexibility enables neural network models to efficiently represent transcriptomic features associated with single gene knockdown perturbations. With consistent signals between training and testing sets, neural networks may be trained to classify new samples to experimentally confirmed molecular phenotypes.KeywordsCompeting Interest StatementThe authors have declared no competing interest.CMapConnectivity MapDNNDeep Neural NetworkshRNAshort hairpin RNACRISPRClustered Regularly Interspaced Short Palindromic RepeatsCGSConsensus Gene SignaturesESEnrichment ScoreWTCSWeighted Connectivity ScoreELUExponential Linear Unit

Zand, Maryam and Jianhua Ruan. "A completely parameter-free method for graph-based single cell RNA-seq clustering." bioRxiv , no. (2021): 2021.07.15.452521. Accessed July 19, 2021. doi: 10.1101/2021.07.15.452521.
Single-cell RNA sequencing (scRNAseq) offers an unprecedented potential for scrutinizing complex biological systems at single cell resolution. One of the most important applications of scRNAseq is to cluster cells into groups of similar expression profiles, which allows unsupervised identification of novel cell subtypes. While many clustering algorithms have been tested towards this goal, graph-based algorithms appear to be the most effective, due to their ability to accommodate the sparsity of the data, as well as the complex topology of the cell population. An integral part of almost all such clustering methods is the construction of a k-nearest-neighbor (KNN) network, and the choice of k, implicitly or explicitly, can have a profound impact on the density distribution of the graph and the structure of the resulting clusters, as well as the resolution of clusters that one can successfully identify from the data. In this work, we propose a fairly simple but robust approach to estimate the best k for constructing the KNN graph while simultaneously identifying the optimal clustering structure from the graph. Our method, named scQcut, employs a topology-based criterion to guide the construction of KNN graph, and then applies an efficient modularity-based community discovery algorithm to predict robust cell clusters. The results obtained from applying scQcut on a large number of real and synthetic datasets demonstrated that scQcut —which does not require any user-tuned parameters—outperformed several popular state-of-the-art clustering methods in terms of clustering accuracy and the ability to correctly identify rare cell types. The promising results indicate that an accurate approximation of the parameter k, which determines the topology of the network, is a crucial element of a successful graph-based clustering method to recover the final community structure of the cell population.Availability ScQcut is written in both Matlab and Python and maybe be accessed through the links below.Matlab version: cs.utsa.edu/ jruan/scQcutPython version: https://github.com/mary77/scQcutContact Jianhua.ruan{at}utsa.eduKeywordsCompeting Interest StatementThe authors have declared no competing interest.

Ni, Pengyu and Zhengchang Su. "Accurate prediction of functional states of cis-regulatory modules reveals the universal epigenetic code in mammals." bioRxiv , no. (2021): 2021.07.15.452574. Accessed July 19, 2021. doi: 10.1101/2021.07.15.452574.
Predicting cis-regulatory modules(CRMs) in a genome and predicting their functional states in various cell/tissue types of the organism are two related challenging computational tasks. Most current methods attempt to achieve both simultaneously using epigenetic data. Though conceptually attractive, they suffer high false discovery rates and limited applications. To fill the gaps, we proposed a two-step strategy to first predict a map of CRMs in the genome, and then predict functional states of the CRMs in various cell/tissue types of the organism. We have recently developed an algorithm for accurately predicting CRMs in a genome by integrating numerous transcription factor ChIP-seq datasets. Here, we showed that only three or four epigenetic marks data in a cell/tissue type were sufficient for a machine-learning model to accurately predict functional states of all CRMs. Our predictions are substantially more accurate than the best achieved so far. Interestingly, a model trained on different cell/tissue types in a mammal can accurately predict functional states of CRMs in different cell/tissue types of the mammal as well as in various cell/tissue types of a different mammal. Therefore, epigenetic code that defines functional states of CRMs in various cell/tissue types is universal at least in mammals. Moreover, we found that from tens to hundreds of thousands of CRMs were active in a human and mouse cell/tissue type, and up to 99.98% of them were reutilized in different cell/tissue types, while as small as 0.02% of them were unique to a cell/tissue type that might define the cell/tissue type.Key wordsCompeting Interest StatementThe authors have declared no competing interest.

C.R. Harris et al.. "Quantifying and correcting slide-to-slide variation in multiplexed immunofluorescence images." bioRxiv , no. (2021): 2021.07.16.452359. Accessed July 19, 2021. doi: 10.1101/2021.07.16.452359.
Motivation The multiplexed imaging domain is a nascent single-cell analysis field with a complex data structure susceptible to technical variability that disrupts inference. These in situ methods are valuable in understanding cell-cell interactions, but few standardized processing steps or normalization techniques of multiplexed imaging data are available.Results We implement and compare data transformations and normalization algorithms in multiplexed imaging data. Our methods adapt the ComBat and functional data registration methods to remove slide effects in this domain, and we present an evaluation framework to compare the proposed approaches. We present clear slide-to-slide variation in the raw, unadjusted data, and show that many of the proposed normalization methods reduce this variation while preserving and improving the biological signal. Further, we find that dividing this data by its slide mean, and the functional data registration methods, perform the best under our proposed evaluation framework. In summary, this approach provides a foundation for better data quality and evaluation criteria in the multiplexed domain.Availability and Implementation Source code is provided at https://github.com/statimagcoll/MultiplexedNormalization.Contact coleman.r.harris{at}vanderbilt.eduSupplementary information Supplementary information is available online.Competing Interest StatementThe authors have declared no competing interest.

Chevez-Guardado, Ruben and Lourdes Peña-Castillo. "Promotech: A general tool for bacterial promoter recognition." bioRxiv , no. (2021): 2021.07.16.452684. Accessed July 19, 2021. doi: 10.1101/2021.07.16.452684.
Promoters are genomic regions where the transcription machinery binds to initiate the transcription of specific genes. Computational tools for identifying bacterial promoters have been around for decades. However, most of these tools were designed to recognize promoters in one or few bacterial species. Here, we present Promotech, a machine-learning-based method for promoter recognition in a wide range of bacterial species. We compared Promotech’s performance with the performance of five other promoter prediction methods. Promotech outperformed these other programs in terms of area under the precision-recall curve (AUPRC) or precision at the same level of recall. Promotech is available at https://github.com/BioinformaticsLabAtMUN/PromoTech.KeywordsCompeting Interest StatementThe authors have declared no competing interest.

Vivian Viallon et al.. "A new pipeline for the normalization and pooling of metabolomics data." bioRxiv , no. (2021): 2021.07.16.452593. Accessed July 19, 2021. doi: 10.1101/2021.07.16.452593.
Pooling metabolomics data across studies is often desirable to increase the statistical power of the analysis. However, this can raise methodological challenges as several preanalytical and analytical factors could introduce differences in measured concentrations and variability between datasets. Specifically, different studies may use variable sample types (e.g., serum versus plasma) collected, treated and stored according to different protocols, and assayed in different laboratories using different instruments. To address these issues, a new pipeline was developed to normalize and pool metabolomics data through a set of sequential steps: (i) exclusions of the least informative observations and metabolites and removal of outliers; imputation of missing data; (ii) identification of the main sources of variability through PC-PR2 analysis; (iii) application of linear mixed models to remove unwanted variability, including samples’ originating study and batch, and preserve biological variations while accounting for potential differences in the residual variances across studies. This pipeline was applied to targeted metabolomics data acquired using Biocrates AbsoluteIDQ kits in eight case-control studies nested within the European Prospective Investigation into Cancer and Nutrition (EPIC) cohort. Comprehensive examination of metabolomics measurements indicated that the pipeline improved the comparability of data across the studies. Our pipeline can be adapted to normalize other molecular data, including biomarkers as well as proteomics data, and could be used for pooling molecular datasets, for example in international consortia, to limit biases introduced by inter-study variability. This versatility of the pipeline makes our work of potential interest to molecular epidemiologists.KeywordsCompeting Interest StatementThe authors have declared no competing interest.

Nemzer, Louis R.. "Visualizing Amino Acid Substitutions in a Physicochemical Vector Space." bioRxiv , no. (2021): 2021.07.15.452549. Accessed July 19, 2021. doi: 10.1101/2021.07.15.452549.
A three-dimensional representation of the twenty proteinogenic amino acids in a physicochemical space is presented. Vectors corresponding to amino acid substitutions are classified based on whether they are accessible via a single-nucleotide mutation. It is shown that the standard genetic code establishes a “choice architecture” that permits nearly independent tuning of the properties related with size and those related with hydrophobicity. This work sheds light on the metarules of evolvability that may have shaped the standard genetic code to increase the probability that adaptive point mutations will be generated. An illustration of the usefulness of visualizing amino acid substitutions in a 3D physicochemical space is shown using data collected from the SARS-CoV-2 receptor binding domain. The substitutions most responsible for antibody escape are almost always inaccessible via single nucleotide mutation, and also change multiple properties concurrently. The results of this research can extend our understanding of certain hereditary disorders caused by point mutations, as well as guide the development of rational protein and vaccine design.KeywordsCompeting Interest StatementThe authors have declared no competing interest.

Pockrandt, Christopher, Martin Steinegger and Steven L. Salzberg. "PhyloCSF++: A fast and user-friendly implementation of PhyloCSF with annotation tools." bioRxiv , no. (2021): 2021.03.10.434297. Accessed July 19, 2021. doi: 10.1101/2021.03.10.434297.
Summary PhyloCSF++ is an efficient and parallelized C++ implementation of the popular PhyloCSF method to distinguish protein-coding and non-coding regions in a genome based on multiple sequence alignments. It can score alignments or produce browser tracks for entire genomes in the wig file format. Additionally, PhyloCSF++ annotates coding sequences in GFF/GTF files using precomputed tracks or computes and scores multiple sequence alignments on the fly with MMseqs2.Availability PhyloCSF++ is released under the AGPLv3 license. Binaries and source code are available at https://github.com/cpockrandt/PhyloCSFpp. The software can be installed through bioconda. A variety of tracks can be accessed through ftp://ftp.ccb.jhu.edu/pub/software/phylocsfpp/.Competing Interest StatementThe authors have declared no competing interest.

Micha Hersch et al.. "Estimating RNA dynamics using one time point for one sample in a single-pulse metabolic labeling experiment." bioRxiv , no. (2021): 2020.05.01.071779. Accessed July 19, 2021. doi: 10.1101/2020.05.01.071779.
Over the past decade, experimental procedures such as metabolic labeling for determining RNA turnover rates at the transcriptome-wide scale have been widely adopted and are now turning to single cell measurements. Several computational methods to estimate RNA processing and degradation rates from such experiments have been suggested, but they all require several RNA sequencing samples. Here we present a method that can estimate RNA synthesis, processing and degradation rates from a single sample. Our method is computationally efficient and outputs rates that correlate well with previously published data sets. Using it on a single sample, we were able to reproduce the observation that dynamic biological processes tend to involve genes with higher metabolic rates, while stable processes involve genes with lower rates. This supports the hypothesis that cells control not only the mRNA steady-state abundance, but also its responsiveness, i.e., how fast steady-state is reached. In addition to saving experimental work and computational time, having a sample-based rate estimation has several advantages. It does not require an error-prone normalization across samples and enables the use of replicates to estimate uncertainty and perform quality control. Finally the method and theoretical results described here are general enough to be useful in other contexts such as nucleotide conversion methods and single cell metabolic labeling experiments.Competing Interest StatementThe authors have declared no competing interest.

Tao Yang et al.. "AdRoit: an accurate and robust method to infer complex transcriptome composition." bioRxiv , no. (2021): 2020.12.14.422697. Accessed July 19, 2021. doi: 10.1101/2020.12.14.422697.
Bulk RNA sequencing technology provides the opportunity to understand biology at the whole transcriptome level without the prohibitive cost of single cell profiling. Advances in spatial transcriptomics enable to dissect tissue organization and function by genome-wide gene expressions. However, the readout of both technologies is the overall gene expression across potentially many cell types without directly providing the information of cell type constitution. Although several in-silico approaches have been proposed to deconvolute RNA-Seq data composed of multiple cell types, many suffer a deterioration of performance in complex tissues. Here we present AdRoit, an accurate and robust method infer the cell composition from transcriptome data comprised of multiple cell types. AdRoit uses gene expression profile obtained from single cell RNA sequencing as a reference. It employs an adaptive learning approach to correct the sequencing technique difference between the single cell data and the bulk or spatial transcriptome data, enabling cross-platform readout comparability. Our systematic benchmarking and applications, which include deconvoluting complex mixtures that encompass 30 cell types, demonstrate its superior sensitivity and specificity compared to other existing methods as well as its utilities. In addition, AdRoit is computationally efficient and runs orders of magnitude faster than many existing methods.Competing Interest StatementT.Y., Y.B., W.F. and G.S.A. have filed a patent application relating to the AdRoit computational framework. M.L.-F. is an employee of Cellular Longevity. All other authors are employees and/or shareholders of Regeneron Pharmaceuticals, although the manuscript's subject matter does not have any relationship to any products or services of this corporation.

Yuzhou Chang et al.. "Define and visualize pathological architectures of human tissues from spatially resolved transcriptomics using deep learning." bioRxiv , no. (2021): 2021.07.08.451210. Accessed July 19, 2021. doi: 10.1101/2021.07.08.451210.
Spatially resolved transcriptomics provides a new way to define spatial contexts and understand biological functions in complex diseases. Although some computational frameworks can characterize spatial context via various clustering methods, the detailed spatial architectures and functional zonation often cannot be revealed and localized due to the limited capacities of associating spatial information. We present RESEPT, a deep-learning framework for characterizing and visualizing tissue architecture from spatially resolved transcriptomics. Given inputs as gene expression or RNA velocity, RESEPT learns a three-dimensional embedding with a spatial retained graph neural network from the spatial transcriptomics. The embedding is then visualized by mapping as color channels in an RGB image and segmented with a supervised convolutional neural network model. Based on a benchmark of sixteen 10x Genomics Visium spatial transcriptomics datasets on the human cortex, RESEPT infers and visualizes the tissue architecture accurately. It is noteworthy that, for the in-house AD samples, RESEPT can localize cortex layers and cell types based on a pre-defined region-or cell-type-specific genes and furthermore provide critical insights into the identification of amyloid-beta plaques in Alzheimer’s disease. Interestingly, in a glioblastoma sample analysis, RESEPT distinguishes tumor-enriched, non-tumor, and regions of neuropil with infiltrating tumor cells in support of clinical and prognostic cancer applications.Competing Interest StatementThe authors have declared no competing interest.

Reijnders, Maarten JMF and Robert M Waterhouse. "CrowdGO: machine learning and semantic similarity guided consensus Gene Ontology annotation." bioRxiv , no. (2021): 731596. Accessed July 19, 2021. doi: 10.1101/731596.
Background: Characterising gene function for the ever-increasing number and diversity of species with annotated genomes relies almost entirely on computational prediction methods. These software are also numerous and diverse, each with different strengths and weaknesses as revealed through community benchmarking efforts. Meta-predictors that assess consensus and conflict from individual algorithms should deliver enhanced functional annotations. Results: To exploit the benefits of meta-approaches, we developed CrowdGO, an open-source consensus-based Gene Ontology (GO) term meta-predictor that employs machine learning models with GO term semantic similarities and information contents. By re-evaluating each gene-term annotation, a consensus dataset is produced with high-scoring confident annotations and low-scoring rejected annotations. Applying CrowdGO to results from a deep learning-based, a sequence similarity-based, and two protein domain-based methods, delivers consensus annotations with improved precision and recall. Furthermore, using standard evaluation measures CrowdGO performance matches that of the community's best performing individual methods. Conclusion: CrowdGO offers a model-informed approach to leverage strengths of individual predictors and produce comprehensive and accurate gene functional annotations.Competing Interest StatementThe authors have declared no competing interest.

FERRARI, IVAN VITO and Paolo PATRIZIO. "Study of Basic Local Alignment Search Tool (BLAST) and Multiple Sequence Alignment (Clustal- X) of Monoclonal mice/human antibodies." bioRxiv , no. (2021): 2021.07.09.451785. Accessed July 19, 2021. doi: 10.1101/2021.07.09.451785.
In this work, we have focused on the study of the Basic Local Alignment Search Tool (BLAST) and Multiple Sequence Alignment (Clustal- X) of different monoclonal mice antibodies to understand better the multiple alignments of sequences. Our strategy was to compare the light chains of multiple monoclonal antibodies to each other, calculating their identity percentage and in which amino acid portion. (See below figure 2) Subsequently, the same survey of heavy chains was carried out with the same methodology. (See below figure 3) Finally, sequence alignment between the light chain of one antibody and the heavy chain of another antibody was studied to understand what happens if chains are exchanged between antibodies. (See below figure 4) From our results of BLAST estimation alignment, we have reported that the Light Chains (Ls) of Monoclonal Antibodies in Comparison have a sequence Homology of about 60-80% and they have a part identical in sequence zone in range 100-210 residues amino acids, except ID PDB 4ISV, which it turns out to have a 40% lower homology than the others antibodies. As far as, the heavy chains (Hs) of Monoclonal Antibodies are concerned, however they tend to have a less homology of sequences, compared to lights chains consideration, equal to 60%-70% and they have an identical part in the sequence zone between 150-210 residues amino acids; with the exception of ID PDB 3I9G-3W9D antibodies that have an equal homology at 50%. ( See supporting part) Summing up: about 70-80% identity among 2 light chains of 2 antibodies, 60-70% identity between 2 heavy chains of 2 antibodies, 30% identity between the two chains of a antibody and 30% if you compare the light chain of one antibody with the heavy chain of another antibody.Competing Interest StatementThe authors have declared no competing interest.

Azza E Ahmed et al.. "Design considerations for workflow management systems use in production genomics research and the clinic." bioRxiv , no. (2021): 2021.04.03.437906. Accessed July 19, 2021. doi: 10.1101/2021.04.03.437906.
Background: The changing landscape of genomics research and clinical practice has created a need for computational pipelines capable of efficiently orchestrating complex analysis stages while handling large volumes of data across heterogeneous computational environments. Workflow Management Systems (WfMSs) are the software components employed to fill this gap. Results: This work provides an approach and systematic evaluation of key features of popular bioinformatics WfMSs in use today: Nextflow, CWL, and WDL and some of their executors, along with Swift/T, a workflow manager commonly used in high-scale physics applications. We employed two use cases: a variant-calling genomic pipeline and a scalability-testing framework, where both were run locally, on an HPC cluster, and in the cloud. This allowed for evaluation of those four WfMSs in terms of language expressiveness, modularity, scalability, robustness, reproducibility, interoperability, ease of development, along with adoption and usage in research labs and healthcare settings. This article is trying to answer, "which WfMS should be chosen for a given bioinformatics application regardless of analysis type?". Conclusions: The choice of a given WfMS is a function of both its intrinsic language and engine features. Within bioinformatics, where analysts are a mix of dry and wet lab scientists, the choice is also governed by collaborations and adoption within large consortia and technical support provided by the WfMS team/community. As the community and its needs continue to evolve along with computational infrastructure, WfMSs will also evolve, especially those with permissive licenses that allow commercial use. In much the same way as the dataflow paradigm and containerization are now well understood to be very useful in bioinformatics applications, we will continue to see innovations of tools and utilities for other purposes, like big data technologies, interoperability, and provenance.Competing Interest StatementThe authors have declared no competing interest.

Andrea Ferrario et al.. "Whole animal modelling reveals neuronal mechanisms of decision-making and reproduces unpredictable swimming in frog tadpoles." bioRxiv , no. (2021): 2021.07.13.452162. Accessed July 19, 2021. doi: 10.1101/2021.07.13.452162.
Animal behaviour is based on interaction between nervous, musculoskeletal and environmental systems. How does an animal process sensory stimuli, use it to decide whether and how to respond, and initiate the locomotor behaviour? We build the whole body computer models of a simple vertebrate with a complete chain of neural circuits and body units for sensory information processing, decision-making, generation of spiking activities, muscle innervation, body flexion, body-water interaction, and movement. Our Central Nervous System (CNS) model generates biologically-realistic spiking and reveals that sensory memory populations on two hindbrain sides compete for swimming initiation and first body flexion. Biomechanical 3-dimensional “Virtual Tadpole” (VT) model is constructed to evaluate if motor outputs of CNS model can produce swimming-like movements in a volume of “water”. We find that whole animal modelling generates reliable and realistic swimming. The combination of CNS and VT models opens a new perspective for experiments with immobilised tadpoles.Competing Interest StatementThe authors have declared no competing interest.

Pablo Malmierca-Merlo et al.. "MetaFun: Unveiling sex differences in multiple omics studies through comprehensive functional meta-analysis." bioRxiv , no. (2021): 2021.07.13.451905. Accessed July 19, 2021. doi: 10.1101/2021.07.13.451905.
Summary Sex and gender differences in different health scenarios has been thoroughly acknowledged in the literature, and yet, very scarcely analyzed. To fill the gap, here we present MetaFun, which allows to meta-analyze multiple omics datasets with a sex-based perspective, and to combine different datasets to gain major statistical power and to assist the researcher in understanding these sex differences in the diseases under study. Metafun is freely available at bioinfo.cipf.es/metafun Availability and implementation: MetaFun is available under http://bioinfo.cipf.es/metafun. The backend has been implemented in R and Java and the frontend has been developed using Angular.Supplementary information R code available at https://gitlab.com/ubb-cipf/metafunrCompeting Interest StatementThe authors have declared no competing interest.

Shuai Lu et al.. "A Structure-based B-cell Epitope Prediction Model Through Combing Local and Global Features." bioRxiv , no. (2021): 2021.07.13.452188. Accessed July 19, 2021. doi: 10.1101/2021.07.13.452188.
B-cell epitopes (BCEs) are a set of specific sites on the surface of an antigen that binds to an antibody produced by B-cell. The recognition of epitopes is a major challenge for drug design and vaccines development. Compared with experimental methods, computational approaches have strong potential for epitope prediction at much lower cost. Moreover, most of the currently methods focus on using local information around target amino acid residue for BCEs prediction without taking the global information of the whole antigen sequence into consideration.We propose a novel deep leaning method thorough combing local features and global features for BCEs prediction. In our model, two parallel modules are built to extract local and global features from the antigen separately. For local features, we use graph convolutional networks to capture information of spatial neighbors of a target amino acid residue. For global features, Attention based Bidirectional Long Short-Term Memory networks (Att-BLTM) are applied to extract information from the whole antigen sequence. Then the local and global features are combined to predict BCEs. The experiments show that the proposed method achieves superior performance over the state-of-the-art BCEs prediction methods on benchmark datasets. Also, we compare the performance differences between data with or without global features. The experimental results show that global features play an important role in BCEs prediction.Competing Interest StatementThe authors have declared no competing interest.

Katherine H. Shutta et al.. "SpiderLearner: An ensemble approach to Gaussian graphical model estimation." bioRxiv , no. (2021): 2021.07.13.452248. Accessed July 19, 2021. doi: 10.1101/2021.07.13.452248.
Multivariate biological data are often modeled using networks in which nodes represent a biological variable (e.g., genes) and edges represent associations (e.g., coexpression). A Gaussian graphical model (GGM), or partial correlation network, is an undirected graphical model in which a weighted edge between two nodes represents the magnitude of their partial correlation, and the absence of an edge indicates zero partial correlation. A GGM provides a roadmap of direct dependencies between variables, providing a valuable systems-level perspective. Many methods exist for estimating GGMs; estimated GGMs are typically highly sensitive to choice of method, posing an outstanding statistical challenge. We address this challenge by developing SpiderLearner, a tool that combines a range of candidate GGM estimation methods to construct an ensemble estimate as a weighted average of results from each candidate. In simulation studies, SpiderLearner performs better than or comparably to the best of the candidate methods. We apply SpiderLearner to estimate a GGM for gene expression in a publicly available dataset of 260 ovarian cancer patients. Using the community structure of the GGM, we develop a network-based risk score which we validate in six independent datasets. The risk score requires only seven genes, each of which has important biological function. Our method is flexible, extensible, and has demonstrated potential to identify de novo biomarkers for complex diseases. An open-source implementation of our method is available at https://github.com/katehoffshutta/SpiderLearner.Competing Interest StatementThe authors have declared no competing interest.

Maximiliano Beckel et al.. "Mining conserved and divergent signals in 5’ splicing site sequences." bioRxiv , no. (2021): 2021.07.14.452117. Accessed July 19, 2021. doi: 10.1101/2021.07.14.452117.
Despite the fact that the main steps of the splicing process are similar across eukaryotes, differences in splicing factors, gene architecture and sequence divergences in splicing signals suggest clade-specific features of splicing and its regulation.In this work we study conserved and divergent signatures embedded in the sequence composition of eukaryotic 5’ splicing sites. We considered a regularized maximum entropy modeling framework to mine for non-trivial two-site correlations in donor sequences of 14 different eukaryote organisms. Our approach allowed us to accommodate and extend in a unified framework many of the regularities observed in previous works, like the relationship between the frequency of occurrence of natural sequences and the corresponding site’s strength, or the negative epistatic effects between exonic and intronic consensus sites. In addition, performing a systematic and comparative analysis of 5’ss we showed that lineage information could be traced not only from single-site frequencies but also from joint di-nucleotide probabilities of donor sequences. Noticeably, we could also identify specific two-site coupling patterns for plants and for animals and argue that these differences, in association with taxon-specific features involving U6 snRNP, could be the basis for differences in splicing regulation previously reported between these groups.Competing Interest StatementThe authors have declared no competing interest.

Karen E. Christianson et al.. "Cloud-based DIA data analysis module for signal refinement improves accuracy and throughput of large datasets." bioRxiv , no. (2021): 2021.07.14.452243. Accessed July 19, 2021. doi: 10.1101/2021.07.14.452243.
Data-independent acquisition (DIA) is a powerful mass spectrometry method that promises higher coverage, reproducibility, and throughput than traditional quantitative proteomics approaches. However, the complexity of DIA data caused by fragmentation of co-isolating peptides presents significant challenges for confident assignment of identity and quantity, information that is essential for deriving meaningful biological insight from the data. To overcome this problem, we previously developed Avant-garde, a tool for automated signal refinement of DIA and other targeted mass spectrometry data. AvG is designed to work alongside existing tools for peptide detection to address the reliability and quantitative suitability of signals extracted for the identified peptides. While its use is straightforward and offers efficient refinement for small datasets, the execution of AvG for large DIA datasets is time-consuming, especially if run with limited computational resources. To overcome these limitations, we present here an improved, cloud-based implementation of the AvG algorithm deployed on Terra, a user-friendly cloud-based platform for large-scale data analysis and sharing, as an accessible and standardized resource to the wider community.Competing Interest StatementJ.D.J. is employed by Inzen Therapeutics and declares that he has no conflict of interest. The remaining authors declare no competing interests.

Charles E. Mordaunt et al.. "Comethyl: A network-based methylome approach to investigate the multivariate nature of health and disease." bioRxiv , no. (2021): 2021.07.14.452385. Accessed July 19, 2021. doi: 10.1101/2021.07.14.452385.
Health outcomes are frequently shaped by difficult to dissect inter-relationships between biological, behavioral, social, and environmental factors. DNA methylation patterns reflect such multi-variate intersections, providing a rich source of novel biomarkers and insight into disease etiologies. Recent advances in whole-genome bisulfite sequencing (WGBS) enable investigation of DNA methylation over all genomic CpGs, but existing bioinformatic approaches lack accessible system-level tools. Here, we develop the R package Comethyl, for weighted gene correlation network analysis (WGCNA) of user-defined genomic regions that generates modules of comethylated regions, which are then tested for correlations with sample traits. First, regions are defined by CpG genomic location or regulatory annotation and filtered based on CpG count, sequencing depth, and variability. Next, correlation networks are used to find modules of interconnected nodes using methylation values within the selected regions. Each module containing multiple comethylated regions is reduced in complexity to a single eigennode value, which is then tested for correlations with experimental metadata. Comethyl has the ability to cover the noncoding regulatory regions of the genome with high relevance to interpretation of genome-wide association studies and integration with other types of epigenomic data. We demonstrate the utility of Comethyl on a dataset of male cord blood samples from newborns later diagnosed with autism spectrum disorder (ASD) versus typical development. Comethyl successfully identified an ASD-associated module containing gene regions with brain glial functions. Comethyl is expected to be useful in uncovering the multi-variate nature of health disparities for a variety of common disorders. Comethyl is available at github.com/cemordaunt/comethyl.Description of the Authors Charles E. Mordaunt, Ph.D. developed Comethyl while a postdoctoral fellow in the department of Medical Microbiology and Immunology at UC Davis. He is currently a Computational Biologist at GSK.Julia S. Mouat is a doctoral student in the Integrative Genetics and Genomics graduate group at UC Davis with interests in health disparities and intergenerational epigenetic risk factors for autism spectrum disorders.Rebecca J. Schmidt, Ph.D. is an Associate Professor of Public Health Sciences at UC Davis, with expertise in the use of epigenetics in epidemiology and neurodevelopmental disorders.Janine M. LaSalle, Ph.D. is a Professor of Medical Microbiology and Immunology, Co-Director of the Perinatal Origins of Disparities Center, and Deputy Director of the Environmental Health Sciences Center at UC Davis, with expertise in epigenomics and neurodevelopmental disorders.KeywordsCompeting Interest StatementThe authors have declared no competing interest.

Juexiao Zhou et al.. "DeeReCT-TSS: A novel meta-learning-based method annotates TSS in multiple cell types based on DNA sequences and RNA-seq data." bioRxiv , no. (2021): 2021.07.14.452328. Accessed July 19, 2021. doi: 10.1101/2021.07.14.452328.
The accurate annotation of transcription start sites (TSSs) and their usage is critical for the mechanistic understanding of gene regulation under different biological contexts. To fulfill this, on one hand, specific high-throughput experimental technologies have been developed to capture TSSs in a genome-wide manner. On the other hand, various computational tools have also been developed for in silico prediction of TSSs solely based on genomic sequences. Most of these computational tools cast the problem as a binary classification task on a balanced dataset and thus result in drastic false positive predictions when applied on the genome-scale. To address these issues, we present DeeReCT-TSS, a deep-learning-based method that is capable of TSSs identification across the whole genome based on both DNA sequences and conventional RNA-seq data. We show that by effectively incorporating these two sources of information, DeeReCT-TSS significantly outperforms other solely sequence-based methods on the precise annotation of TSSs used in different cell types. Furthermore, we develop a meta-learning-based extension for simultaneous transcription start site (TSS) annotation on 10 cell types, which enables the identification of cell-type-specific TSS. Finally, we demonstrate the high precision of DeeReCT-TSS on two independent datasets from the ENCODE project by correlating our predicted TSSs with experimentally defined TSS chromatin states. Our application, pre-trained models and data are available at https://github.com/JoshuaChou2018/DeeReCT-TSS_release.

Gong, Boying, Yun Zhou and Elizabeth Purdom. "Cobolt: Joint analysis of multimodal single-cell sequencing data." bioRxiv , no. (2021): 2021.04.03.438329. Accessed July 19, 2021. doi: 10.1101/2021.04.03.438329.
A growing number of single-cell sequencing platforms enable joint profiling of multiple omics from the same cells. We present Cobolt, a novel method that not only allows for analyzing the data from joint-modality platforms, but provides a coherent framework for the integration of multiple datasets measured on different modalities. We demonstrate its performance on multi-modality data of gene expression and chromatin accessibility and illustrate the integration abilities of Cobolt by jointly analyzing this multi-modality data with single-cell RNA-seq and ATAC-seq datasets.KeywordsCompeting Interest StatementThe authors have declared no competing interest.

Hunyong Cho et al.. "Distribution-based comprehensive evaluation of methods for differential expression analysis in metatranscriptomics." bioRxiv , no. (2021): 2021.07.14.452374. Accessed July 19, 2021. doi: 10.1101/2021.07.14.452374.
Background Measuring and understanding the function of the human microbiome is key for several aspects of health; however, the development of statistical methods specifically for the analysis of microbial gene expression (i.e., metatranscriptomics) is in its infancy. Many currently employed differential expression analysis methods have been designed for different data types and have not been evaluated in metatranscriptomics settings. To address this knowledge gap, we undertook a comprehensive evaluation and benchmarking of eight differential analysis methods for metatranscriptomics data.Results We used a combination of real and simulated metatranscriptomics data to evaluate the performance (i.e., model fit, Type-I error, and statistical power) of eight methods: log-normal (LN), logistic-beta (LB), MAST, Kruskal-Wallis, two-part Kruskal-Wallis, DESeq2, ANCOM-BC, and metagenomeSeq. The simulation was informed by supragingival biofilm microbiome data from 300 preschool-age children enrolled in a study of early childhood caries (ECC), whereas validations were sought in two additional datasets, including an ECC and an inflammatory bowel disease one. The LB test showed the highest power in both small and large sample sizes and reasonably controlled Type-I error. Contrarily, MAST was hampered by inflated Type-I error. Using LN and LB tests, we found that genes C8PHV7 and C8PEV7, harbored by the lactate-producing Campylobacter gracilis, had the strongest association with ECC.Conclusion This comprehensive model evaluation findings offer practical guidance for the selection of appropriate methods for rigorous analyses of differential expression in metatranscriptomics data. Selection of an optimal method is likely to increase the possibility of detecting true signals while minimizing the chance of claiming false ones.KeywordsCompeting Interest StatementThe authors have declared no competing interest.

Nishant Thakur et al.. "An integrated analysis tool reveals intrinsic biases in gene set enrichment." bioRxiv , no. (2021): 2021.07.12.452009. Accessed July 19, 2021. doi: 10.1101/2021.07.12.452009.
Generating meaningful interpretations of gene lists remains a challenge for all large-scale studies. Many approaches exist, often based on evaluating gene enrichment among pre-determined gene classes. Here, we conceived and implemented yet another analysis tool (YAAT), specifically for data from the widely-used model organism C. elegans. YAAT extends standard enrichment analyses, using a combination of co-expression data and profiles of phylogenetic conservation, to identify groups of functionally-related genes. It additionally allows class clustering, providing inference of functional links between groups of genes. We give examples of the utility of YAAT for uncovering unsuspected links between genes and show how the approach can be used to prioritise genes for in-depth study. Our analyses revealed several limitations to the meaningful interpretation of gene lists, specifically related to data sources and the “universe” of gene lists used. We hope that YAAT will represent a model for integrated analysis that could be useful for large-scale exploration of biological function in other species.Competing Interest StatementThe authors have declared no competing interest.

Breanne Sparta et al.. "Binomial models uncover biological variation during feature selection of droplet-based single-cell RNA sequencing." bioRxiv , no. (2021): 2021.07.11.451989. Accessed July 19, 2021. doi: 10.1101/2021.07.11.451989.
Single-cell RNA sequencing (scRNA-seq) aims to characterize how variation in gene expression is distributed across cells in tissues and organisms. Yet, effective comprehension of these extremely high-dimensional datasets remains a critical barrier to progress in biological research. In standard analyses of scRNA-seq data, feature selection steps aim to reduce the dimensionality of the data by focusing on a subset of genes that are the most biologically variable across a set of cells. Ideally, these features provide the genes that are the most informative for partitioning groups of transcriptionally distinct cells, each representing a different cell type or identity. In this work, we propose a simple feature selection model where a binomial sampling process for each mRNA species produces a null model of technical variation. To compare our model to existing methods, we use scRNA-seq data where cell identities have been established a priori for each cell, and characterize whether different feature sets retain biologically varying genes, distort neighborhood structures, and allow popular clustering algorithms to partition groups of cells into their established classes. We find that our model of biological variation, which we term “Differentially Distributed Genes” or DDGs, outperforms existing methods, and enables dimensionality reduction without loss of critical structure within the data set.Competing Interest StatementThe authors have declared no competing interest.

Sebastian, Softya and Swarup Roy. "Parallel Framework for Inferring Genome Scale Gene Regulatory Networks." bioRxiv , no. (2021): 2021.07.11.451988. Accessed July 19, 2021. doi: 10.1101/2021.07.11.451988.
Genome-scale network inference is essential to understand comprehensive interaction patterns. Current methods are limited to the reconstruction of small to moderate-size networks. The most obvious alternative is to propose a novel method or alter existing methods that may leverage parallel computing paradigms. Very few attempts also have been made to re-engineer existing methods by executing selective iterative steps concurrently. We propose a generic framework in this paper that leverages parallel computing without re-engineering the original methods. The proposed framework uses state-of-the-art methods as a black box to infer sub-networks of the segmented data matrix. A simple merger was designed based on preferential attachment to generate the global network by merging the sub-networks.Fifteen (15) inference methods were considered for experimentation. Qualitative and speedup analysis was carried out using DREAM challenge networks. The proposed framework was implemented on all the 15 inference methods using large expression matrices. The results were auspicious as we could infer large networks in reasonable time without compromising the qualitative aspects of the original (serial) algorithm.CLR, the top performer, was then used to infer the network from the expression profiles of an Alzheimer’s disease (AD) affected mouse model consisting of 45,101 genes. We have also highlighted few hub genes from the network that are functionally related to various diseases.KeywordsCompeting Interest StatementThe authors have declared no competing interest.

de Lima Camillo, Lucas Paulo, Louis R Lapierre and Ritambhara Singh. "AltumAge: A Pan-Tissue DNA-Methylation Epigenetic Clock Based on Deep Learning." bioRxiv , no. (2021): 2021.06.01.446559. Accessed July 19, 2021. doi: 10.1101/2021.06.01.446559.
Several age predictors based on DNA methylation, dubbed epigenetic clocks, have been created in recent years. Their accuracy and potential for generalization vary widely based on the training data. Here, we gathered 143 publicly available data sets from several human tissues to develop AltumAge, a highly accurate and precise age predictor based on deep learning. Compared to Horvath’s 2013 model, AltumAge performs better across both normal and malignant tissues and is more generalizable to new data sets. Interestingly, it can predict gestational week from placental tissue with low error. Lastly, we used deep learning interpretation methods to learn which methylation sites contributed to the final model predictions. We observed that while most important CpG sites are linearly related to age, some highly-interacting CpG sites can influence the relevance of such relationships. We studied the associated genes of these CpG sites and found literary evidence of their involvement in age-related gene regulation. Using chromatin annotations, we observed that the CpG sites with the highest contribution to the model predictions were related to heterochromatin and gene regulatory regions in the genome. We also found age-related KEGG pathways for genes containing these CpG sites. In general, neural networks are better predictors due to their ability to capture complex feature interactions compared to the typically used regularized linear regression. Altogether, our neural network approach provides significant improvement and flexibility to current epigenetic clocks without sacrificing model interpretability.Competing Interest StatementThe authors have declared no competing interest.

Xiaojing Cong et al.. "Molecular insights into the μ-opioid receptor biased signaling." bioRxiv , no. (2021): 2021.03.22.436421. Accessed July 19, 2021. doi: 10.1101/2021.03.22.436421.
GPCR functional selectivity has opened new opportunities for the design of safer drugs. Ligands orchestrate GPCR signaling cascades by modulating the receptor conformational landscape. Our study provides insights into the dynamic mechanism enabling opioid ligands to preferentially activate the G protein over the β-arrestin pathways through the μ-opioid receptor (μOR). We combined functional assays in living cells, solution NMR spectroscopy and enhanced-sampling molecular dynamic simulations to identify the specific μOR conformations induced by G protein-biased agonists. In particular, we describe the dynamic and allosteric communications between the ligand-binding pocket and the receptor intracellular domains, through conserved motifs in class A GPCRs. Most strikingly, the biased agonists triggered μOR conformational changes in the intracellular loop 1 and helix 8 domains, which may impair β-arrestin binding or signaling. The findings may apply to other GPCR families and provide key molecular information that could facilitate the design of biased ligands.Competing Interest StatementThe authors have declared no competing interest.

Morisse, Pierre, Thierry Lecroq and Arnaud Lefebvre. "Long-read error correction: a survey and qualitative comparison." bioRxiv , no. (2021): 2020.03.06.977975. Accessed July 19, 2021. doi: 10.1101/2020.03.06.977975.
Third generation sequencing technologies Pacific Biosciences and Oxford Nanopore Technologies were respectively made available in 2011 and 2014. In contrast with second generation sequencing technologies such as Illumina, these new technologies allow the sequencing of long reads of tens to hundreds of kbp. These so called long reads are particularly promising, and are especially expected to solve various problems such as contig and haplotype assembly or scaffolding, for instance. However, these reads are also much more error prone than second generation reads, and display error rates reaching 10 to 30%, according to the sequencing technology and to the version of the chemistry. Moreover, these errors are mainly composed of insertions and deletions, whereas most errors are substitutions in Illumina reads. As a result, long reads require efficient error correction, and a plethora of error correction tools, directly targeted at these reads, were developed in the past ten years. These methods can adopt a hybrid approach, using complementary short reads to perform correction, or a self-correction approach, only making use of the information contained in the long reads sequences. Both these approaches make use of various strategies such as multiple sequence alignment, de Bruijn graphs, Hidden Markov Models, or even combine different strategies. In this paper, we describe a complete survey of long-read error correction, reviewing all the different methodologies and tools existing up to date, for both hybrid and self-correction. Moreover, the long reads characteristics, such as sequencing depth, length, error rate, or even sequencing technology, have huge impacts on how well a given tool or strategy performs, and can thus drastically reduce the correction quality. We thus also present an in-depth benchmark of available long-read error correction tools, on a wide variety of datasets, composed of both simulated and real data, with various error rates, coverages, and read lengths, ranging from small bacterial to large mammal genomes.Competing Interest StatementThe authors have declared no competing interest.

Steffen Albrecht et al.. "Interpretable machine learning models for single-cell ChIP-seq imputation." bioRxiv , no. (2021): 2019.12.20.883983. Accessed July 19, 2021. doi: 10.1101/2019.12.20.883983.
Motivation Single-cell ChIP-seq (scChIP-seq) analysis is challenging due to data sparsity. High degree of data sparsity in biological high-throughput single-cell data is generally handled with imputation methods that complete the data, but specific methods for scChIP-seq are lacking. We present SIMPA, a scChIP-seq data imputation method leveraging predictive information within bulk data from ENCODE to impute missing protein-DNA interacting regions of target histone marks or transcription factors.Results Imputations using machine learning models trained for each single cell, each target, and each genomic region accurately preserve cell type clustering and improve pathway-related gene identification on real data. Results on simulated data show that 100 input genomic regions are already enough to train single-cell specific models for the imputation of thousands of undetected regions. Furthermore, SIMPA enables the interpretation of machine learning models by revealing interaction sites of a given single cell that are most important for the imputation model trained for a specific genomic region. The corresponding feature importance values derived from promoter-interaction profiles of H3K4me3, an activating histone mark, highly correlate with co-expression of genes that are present within the cell-type specific pathways. An imputation method that allows the interpretation of the underlying models facilitates users to gain an even deeper understanding of individual cells and, consequently, of sparse scChIP-seq datasets.Availability and implementation Our interpretable imputation algorithm was implemented in Python and is available at https://github.com/salbrec/SIMPACompeting Interest StatementThe authors have declared no competing interest.

Darryl Ho et al.. "LISA: Learned Indexes for Sequence Analysis." bioRxiv , no. (2021): 2020.12.22.423964. Accessed July 19, 2021. doi: 10.1101/2020.12.22.423964.
Background Next-generation sequencing (NGS) technologies have enabled affordable sequencing of billions of short DNA fragments at high throughput, paving the way for population-scale genomics. Genomics data analytics at this scale requires overcoming performance bottlenecks, such as searching for short DNA sequences over long reference sequences.Results In this paper, we introduce LISA (Learned Indexes for Sequence Analysis), a novel learning-based approach to DNA sequence search. We focus on accelerating two of the most essential flavors of DNA sequence search—exact search and super-maximal exact match (SMEM) search. LISA builds on and extends FM-index, which is the state-of-the-art technique widely deployed in genomics tools. Experiments with human, animal, and plant genome datasets indicate that LISA achieves up to 2.2 and 10.8 speedups over the state-of-the-art FM-index based implementations for exact search and super-maximal exact match (SMEM) search, respectively.Code availability https://github.com/IntelLabs/Trans-Omics-Acceleration-Library/tree/master/LISA.KEYWORDSCompeting Interest StatementThe authors have declared no competing interest.

Xinzhou Ge et al.. "Clipper: p-value-free FDR control on high-throughput data from two conditions." bioRxiv , no. (2021): 2020.11.19.390773. Accessed July 19, 2021. doi: 10.1101/2020.11.19.390773.
High-throughput biological data analysis commonly involves identifying “interesting” features (e.g., genes, genomic regions, and proteins), whose values differ between two conditions, from numerous features measured simultaneously. The most widely-used criterion to ensure the analysis reliability is the false discovery rate (FDR), the expected proportion of uninteresting features among the identified ones. Existing bioinformatics tools primarily control the FDR based on p-values. However, obtaining valid p-values relies on either reasonable assumptions of data distribution or large numbers of replicates under both conditions, two requirements that are often unmet in biological studies. To address this issue, we propose Clipper, a general statistical framework for FDR control without relying on p-values or specific data distributions. Clipper is applicable to identifying both enriched and differential features from high-throughput biological data of diverse types. In comprehensive simulation and real-data benchmarking, Clipper outperforms existing generic FDR control methods and specific bioinformatics tools designed for various tasks, including peak calling from ChIP-seq data, differentially expressed gene identification from bulk or single-cell RNA-seq data, differentially interacting chromatin region identification from Hi-C data, and peptide identification from mass spectrometry data. Notably, our benchmarking results for peptide identification are based on the first mass spectrometry data standard with a realistic dynamic range. Our results demonstrate Clipper’s flexibility and reliability for FDR control, as well as its broad applications in high-throughput data analysis.Significance Statement The reproducibility crisis has been increasingly alarming in biomedical research, which often involves high-throughput data analysis to identify targets for downstream experimental validation. False discovery rate (FDR) is the state-of-the-art criterion to guard reproducibility in such biological data analysis. Existing bioinformatics tools control the FDR using p-values, which are usually ill-posed, leading to failed FDR control or poor power. Clipper is a flexible, powerful FDR-control framework that removes the need for high-resolution, well-calibrated p-values. Applicable to various bioinformatics analyses, Clipper outperforms popular bioinformatics tools, including identifying peaks from ChIP-seq data, differentially expressed genes from bulk or single-cell RNA-seq data, and differentially interacting chromatin regions from Hi-C data. Clipper is a significant computational advance to addressing the reproducibility crisis in biomedical research.Competing Interest StatementThe authors have declared no competing interest.

Jakob McBroome et al.. "A daily-updated database and tools for comprehensive SARS-CoV-2 mutation-annotated trees." bioRxiv , no. (2021): 2021.04.03.438321. Accessed July 19, 2021. doi: 10.1101/2021.04.03.438321.
The vast scale of SARS-CoV-2 sequencing data has made it increasingly challenging to comprehensively analyze all available data using existing tools and file formats. To address this, we present a database of SARS-CoV-2 phylogenetic trees inferred with unrestricted public sequences, which we update daily to incorporate new sequences. Our database uses the recently-proposed mutation-annotated tree (MAT) format to efficiently encode the tree with branches labeled with parsimony-inferred mutations as well as Nextstrain clade and Pango lineage labels at clade roots. As of June 9, 2021, our SARS-CoV-2 MAT consists of 834,521 sequences and provides a comprehensive view of the virus’ evolutionary history using public data. We also present matUtils – a command-line utility for rapidly querying, interpreting and manipulating the MATs. Our daily-updated SARS-CoV-2 MAT database and matUtils software are available at http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/ and https://github.com/yatisht/usher, respectively.Competing Interest StatementThe authors have declared no competing interest.

Singh, Roopali, Feipeng Zhang and Qunhua Li. "Assessing Reproducibility of High-throughput Experiments in the Case of Missing Data." bioRxiv , no. (2021): 2021.07.10.451851. Accessed July 19, 2021. doi: 10.1101/2021.07.10.451851.
High-throughput experiments are an essential part of modern biological and biomedical research. The outcomes of high-throughput biological experiments often have a lot of missing observations due to signals below detection levels. For example, most single-cell RNA-seq (scRNA-seq) protocols experience high levels of dropout due to the small amount of starting material, leading to a majority of reported expression levels being zero. Though missing data contain information about reproducibility, they are often excluded in the reproducibility assessment, potentially generating misleading assessments.In this paper, we develop a regression model to assess how the reproducibility of high-throughput experiments is affected by the choices of operational factors (e.g., platform or sequencing depth) when a large number of measurements are missing. Using a latent variable approach, we extend correspondence curve regression (CCR), a recently proposed method for assessing the effects of operational factors to reproducibility, to incorporate missing values. Using simulations, we show that our method is more accurate in detecting differences in reproducibility than existing measures of reproducibility. We illustrate the usefulness of our method using a single-cell RNA-seq dataset collected on HCT116 cells. We compare the reproducibility of different library preparation platforms and study the effect of sequencing depth on reproducibility, thereby determining the cost-effective sequencing depth that is required to achieve sufficient reproducibility.KeywordsCompeting Interest StatementThe authors have declared no competing interest.

Shuai Lu et al.. "Attention-based convolutional neural networks for protein-protein interaction site prediction." bioRxiv , no. (2021): 2021.07.10.451856. Accessed July 19, 2021. doi: 10.1101/2021.07.10.451856.
Motivation Protein-protein interactions are of great importance in the life cycles of living cells. Accurate prediction of the protein-protein interaction site (PPIs) from protein sequence improves our understanding of protein-protein interaction, contributes to the protein-protein docking and is crucial for drug design. However, practical experimental methods are costly and time-consuming so that many sequence-based computational methods have been developed. Most of those methods employ a sliding window approach, which utilize local neighbor information within a window size. However, they don’t distinguish and use the effect of each individual neighboring residue at different position.Results We propose a novel sequence-based deep learning method consisting of convolutional neural networks (CNNs) and attention mechanism to improve the performance of PPIs prediction. Our attention-based CNNs captures the different effect of each neighboring residue within a sliding window, and therefore making a better understanding of the local environment of target residue. We employ experiments on several public benchmark datasets. The experimental results demonstrate that our proposed method significantly outperforms the state-of-the-art techniques. We also analyze the difference using various sliding window sizes and amino acid residue features combination.Availability The source code can be obtained from https://github.com/biolushuai/attention-based-CNNs-for-PPIs-predictionContact iexfnan{at}zzu.edu.cn or zhangst{at}zzu.edu.cnSupplementary information Supplementary data are available online.Competing Interest StatementThe authors have declared no competing interest.

Katrina Sherbina et al.. "Power Calculator for Detecting Allelic Imbalance Using Hierarchical Bayesian Model." bioRxiv , no. (2021): 2021.07.10.451873. Accessed July 19, 2021. doi: 10.1101/2021.07.10.451873.
Allelic imbalance (AI) is the differential expression of the two alleles in a diploid. AI can vary between tissues, treatments, and environments. Statistical methods for testing in this area exist, with impacts of explosive type I error in the presence of bias well understood. However, for study design, the more important and understudied problem is the type II error and power. As the biological questions for this type of study explode, and the costs of the technology plummet, what is more important: reads or replicates? How small of an interaction can be detected while keeping the type I error at bay? Here we present a simulation study that demonstrates that the proper model can control type I error below 5% for most scenarios. We find that a minimum of 2400, 480, and 240 allele specific reads divided equally among 12, 5, and 3 replicates is needed to detect a 10%, 20%, and 30%, respectively, deviation from allelic balance in a condition with power &gt;80%. A minimum of 960 and 240 allele specific reads is needed to detect a 20% or 30% difference in AI between conditions with comparable power but these reads need to be divided amongst 8 replicates. Higher numbers of replicates increase power more than adding coverage without affecting type I error. We provide a Python package that enables simulation of AI scenarios and enables individuals to estimate type I error and power in detecting AI and differences in AI between conditions tailored to their own specific study needs.Competing Interest StatementThe authors have declared no competing interest.

Ashkan Sedigh et al.. "Enhancing Precision in Bioprinting Utilizing Fuzzy Systems." bioRxiv , no. (2021): 2021.07.10.451921. Accessed July 19, 2021. doi: 10.1101/2021.07.10.451921.
Bioprinting facilitates the generation of complex, three-dimensional (3D), cell-based constructs for a variety of applications. Although multiple bioprinting technologies have been developed, extrusion-based systems have become the dominant technology due to the diversity of substrate materials (bioinks) that can be accommodated, either individually or in combination. However, each bioink has unique material properties and extrusion characteristics that limit bioprinting precision, particularly when generating constructs from different bioinks. Here, we aimed to achieve high precision (i.e. repeatability) across samples by generating bioink-specific printing parameters using a systematic approach. We hypothesized that a fuzzy system could be used as a “black box” method to tackle the inherent vagueness and imprecision in 3D bioprinting data and uncover the optimal printing parameters for a specific bioink that would result in high accuracy and precision. Our fuzzy model was used to approximate and quantify the precision and ease of printability for two common bioinks - type I collagen and Pluronic F127, with or without dilution in αMEM culture media. The model consisted of three inputs (pressure, speed, and dilution percent of bioink) and a single output (layer width). Using this system, we introduce the Bioink Precision Index (BPI), a metric that can be used to quantify and compare the precision of any bioink. Here, we show that printing with parameters optimized using BPI increases the precision for collagen (+15%) and Pluronic F127 (+29%) as compared to the manufacturer’s recommended printing parameters.KEY WORDSCompeting Interest StatementThe authors have declared no competing interest.

Kleinert, Philip and Martin Kircher. "CADD-SV – a framework to score the effects of structural variants in health and disease." bioRxiv , no. (2021): 2021.07.10.451798. Accessed July 19, 2021. doi: 10.1101/2021.07.10.451798.
While technological advances improved the identification of structural variants (SVs) in the human genome, their interpretation remains challenging. Several methods utilize individual mechanistic principles like the deletion of coding sequence or 3D genome architecture disruptions. However, a comprehensive tool using the broad spectrum of available annotations is missing. Here, we describe CADD-SV, a method to retrieve and integrate a wide set of annotations to predict the effects of SVs.Previously, supervised learning approaches were limited due to a small number and biased set of annotated pathogenic or benign SVs. We overcome this problem by using a surrogate training-objective, the Combined Annotation Dependent Depletion (CADD) of functional variants. We use human and chimpanzee derived SVs as proxy-neutral and contrast them with matched simulated variants as proxy-pathogenic, an approach that has proven powerful for SNVs.Our tool computes summary statistics over diverse variant annotations and uses random forest models to prioritize deleterious structural variants. The resulting CADD-SV scores correlate with known pathogenic and rare population variants. We further show that we can prioritize somatic cancer variants as well as non-coding variants known to affect gene expression. We provide a website and offline-scoring tool for easy application of CADD-SV (https://cadd-sv.bihealth.org/).Competing Interest StatementThe authors have declared no competing interest.

Jordan K. Matelsky et al.. "Circuit motifs and graph properties of connectome development in C. elegans." bioRxiv , no. (2021): 2021.07.11.451911. Accessed July 19, 2021. doi: 10.1101/2021.07.11.451911.
Network science is a powerful tool that can be used to better explore the complex structure of brain networks. Leveraging graph and motif analysis tools, we interrogate C. elegans connectomes across multiple developmental time points and compare the resulting graph characteristics and substructures over time. We show the evolution of the networks and highlight stable invariants and patterns as well as those that grow or decay unexpectedly, providing a substrate for additional analysis.Competing Interest StatementThe authors have declared no competing interest.

Fischer, David S., Anna C. Schaar and Fabian J. Theis. "Learning cell communication from spatial graphs of cells." bioRxiv , no. (2021): 2021.07.11.451750. Accessed July 19, 2021. doi: 10.1101/2021.07.11.451750.
Tissue niches are sources of cellular variation and key to understanding both single-cell and tissue phenotypes. The interaction of a cell with its niche can be described through cell communication events. These events cannot be directly observed in molecular profiling assays of single cells and have to be inferred. However, computational models of cell communication and variance attribution defined on data from dissociated tissues suffer from multiple limitations with respect to their ability to define and to identify communication events. We address these limitations using spatial molecular profiling data with node-centric expression modeling (NCEM), a computational method based on graph neural networks which reconciles variance attribution and communication modeling in a single model of tissue niches. We use these models in varying complexity across spatial assays, such as immunohistochemistry and MERFISH, and biological systems to demonstrate that the statistical cell–cell dependencies discovered by NCEM are plausible signatures of known molecular processes underlying cell communication. We identify principles of tissue organisation as cell communication events across multiple datasets using interpretation mechanisms. In the primary motor cortex, we found gene expression variation that is due to niche composition variation across cortical depth. Using the same approach, we also identified niche-dependent cell state variation in CD8 T cells from inflamed colon and colorectal cancer. Finally, we show that NCEMs can be extended to mixed models of explicit cell communication events and latent intrinsic sources of variation in conditional variational autoencoders to yield holistic models of cellular variation in spatial molecular profiling data. Altogether, this graphical model of cellular niches is a step towards understanding emergent tissue phenotypes.Competing Interest StatementF.J.T. reports receiving consulting fees from Cellarity Inc., and ownership interest in Cellarity, Inc. and Dermagnostix.

Yunwei Zhang et al.. "SurvBenchmark: comprehensive benchmarking study of survival analysis methods using both omics data and clinical data." bioRxiv , no. (2021): 2021.07.11.451967. Accessed July 19, 2021. doi: 10.1101/2021.07.11.451967.
Survival analysis is a branch of statistics that deals with both, the tracking of time and of the survival status simultaneously as the dependent response. Current comparisons of the performance of survival models mostly focus on classical clinical data with traditional statistical survival models, with prediction accuracy being often the only measurement of model performance. Moreover, survival analysis approaches for censored omics data have not been fully studied. The typical solution is to truncate survival time, to define a new status variable, and to then perform a binary classification analysis.Here, we develop a benchmarking framework that compares survival models for both clinical datasets and omics datasets, and that not only focuses on classical statistical survival models but also incorporates state-of-art machine learning survival models with multiple performance evaluation measurements including model predictability, stability, flexibility and computational issues. Our comprehensive comparison framework shows that optimality is dataset and analysis method dependent. The key result is that there is no one size fits all solution for any of the criteria and any of the methods. Some methods with a high C-index suffer from computational exhaustion and instability. The implications of our framework give researchers an insight on how different survival model implementations vary over real world datasets. We highlight that care is needed when selecting methods and recommend specifically not to consider the C-index as the only performance evaluation metric as alternative metrics measure other performance aspects.Code availability https://github.com/SydneyBioX/SurvBenchmarkContact jean.yang{at}sydney.edu.auCompeting Interest StatementThe authors have declared no competing interest.

F. Meyer et al.. "Critical Assessment of Metagenome Interpretation - the second round of challenges." bioRxiv , no. (2021): 2021.07.12.451567. Accessed July 19, 2021. doi: 10.1101/2021.07.12.451567.
Evaluating metagenomic software is key for optimizing metagenome interpretation and focus of the community-driven initiative for the Critical Assessment of Metagenome Interpretation (CAMI). In its second challenge, CAMI engaged the community to assess their methods on realistic and complex metagenomic datasets with long and short reads, created from ∼1,700 novel and known microbial genomes, as well as ∼600 novel plasmids and viruses. Altogether 5,002 results by 76 program versions were analyzed, representing a 22x increase in results.Substantial improvements were seen in metagenome assembly, some due to using long-read data. The presence of related strains still was challenging for assembly and genome binning, as was assembly quality for the latter. Taxon profilers demonstrated a marked maturation, with taxon profilers and binners excelling at higher bacterial taxonomic ranks, but underperforming for viruses and archaea. Assessment of clinical pathogen detection techniques revealed a need to improve reproducibility. Analysis of program runtimes and memory usage identified highly efficient programs, including some top performers with other metrics. The CAMI II results identify current challenges, but also guide researchers in selecting methods for specific analyses.Competing Interest StatementA.E.D. co-founded Longas Technologies Pty Ltd, a company aimed at development of synthetic long-read sequencing technologies.

Langnickel, Lisa and Juliane Fluck. "We are not ready yet: limitations of transfer learning for Disease Named Entity Recognition." bioRxiv , no. (2021): 2021.07.11.451939. Accessed July 19, 2021. doi: 10.1101/2021.07.11.451939.
Intense research has been done in the area of biomedical natural language processing. Since the breakthrough of transfer learning-based methods, BERT models are used in a variety of biomedical and clinical applications. For the available data sets, these models show excellent results – partly exceeding the inter-annotator agreements. However, biomedical named entity recognition applied on COVID-19 preprints shows a performance drop compared to the results on available test data. The question arises how well trained models are able to predict on completely new data, i.e. to generalize. Based on the example of disease named entity recognition, we investigate the robustness of different machine learning-based methods – thereof transfer learning – and show that current state-of-the-art methods work well for a given training and the corresponding test set but experience a significant lack of generalization when applying to new data. We therefore argue that there is a need for larger annotated data sets for training and testing.Competing Interest StatementThe authors have declared no competing interest.

Lin Zhou et al.. "A synthetic lethal screen identifies HDAC4 as a potential target in MELK overexpressing cancers." bioRxiv , no. (2021): 2021.07.16.452653. Accessed July 19, 2021. doi: 10.1101/2021.07.16.452653.
Maternal embryonic leucine zipper kinase (MELK) is frequently overexpressed in cancer, but the role of MELK in cancer is still poorly understood. MELK was shown to have roles in many cancer-associated processes including tumor growth, chemotherapy resistance, and tumor recurrence. To determine whether the frequent overexpression of MELK can be exploited in therapy, we performed a high-throughput screen using a library of Saccharomyces cerevisiae mutants to identify genes whose functions become essential when MELK is overexpressed. We identified two such genes: LAG2 and HDA3. LAG2 encodes an inhibitor of the SCF ubiquitin-ligase complex, while HDA3 encodes a subunit of the HDA1 histone deacetylase complex. We find that one of these synthetic lethal interactions is conserved in mammalian cells, as inhibition of a human homolog of HDA3 (HDAC4) is synthetically toxic in MELK overexpression cells. Altogether, our work might provide a new angle of how to exploit MELK overexpression in cancers and might thus lead to novel intervention strategies.

Evelyn Ralston et al.. "Transcriptomic analysis of mdx mouse muscles reveals a signature of early human Duchenne muscular dystrophy." bioRxiv , no. (2021): 2021.07.16.452553. Accessed July 19, 2021. doi: 10.1101/2021.07.16.452553.
The mdx mouse (C57BL/10ScSn-DMDmdx/J) is the oldest model of Duchenne muscular dystrophy (DMD). Mdx remains popular and has not been replaced by newer mouse models, despite criticisms that mdx has a nearly normal lifespan and mild pathology while DMD remains a severe, fatal disease. At some point we noticed that the absence of mdx RNA-seq data limited our ability to assess the results of physiological work on the mouse model and to compare these results to human genetic data [1]. We carried out RNA-seq analysis of wild-type and mdx mice of 2 and 5 months of age, using three hindlimb muscles per mouse: the flexor digitorum brevis (FDB), the extensor digitorum longus (EDL) and the soleus (SOL), with a total of 55 samples. We then mined the data and found that each of the three muscles is a valid experimental model for DMD-related mouse work, even the FDB, despite a delayed pathology development. We also show that the mdx mouse muscles are enriched in metabolic, developmental, regenerational and structural pathways that have been found to be the “disease signature” of DMD in young and presymptomatic subjects [38, 39]. Additionally, we show that healthy human muscle fiber microtubules present the grid-like organization found in control rodents but perturbed in the mdx mouse. We conclude that the mdx mouse appropriately mimics the early stages of DMD, with its microtubule defects signaling fiber regeneration [35]. We hope that these results may contribute to a better understanding of the failure of regeneration as DMD progresses.KeywordsCompeting Interest StatementThe authors have declared no competing interest.

Glen A. Satten et al.. "Efficient Estimation of Indirect Effects in Case-Control Studies Using a Unified Likelihood Framework." bioRxiv , no. (2021): 2021.07.16.452552. Accessed July 19, 2021. doi: 10.1101/2021.07.16.452552.
Mediation models are a set of statistical techniques that investigate the mechanisms that produce an observed relationship between an exposure variable and an outcome variable in order to deduce the extent to which the relationship is influenced by intermediate mediator variables. For a case-control study, the most common mediation analysis strategy employs a counterfactual framework that permits estimation of indirect and direct effects on the odds ratio scale for dichotomous outcomes, assuming either binary or continuous mediators. While this framework has become an important tool for mediation analysis, we demonstrate that we can embed this approach in a unified likelihood framework for mediation analysis in case-control studies that leverages more features of the data (in particular, the relationship between exposure and mediator) to improve efficiency of indirect effect estimates. One important feature of our likelihood approach is that it naturally incorporates cases within the exposure-mediator model to improve efficiency. Our approach does not require knowledge of disease prevalence and can model confounders and exposure-mediator interactions, and is straightforward to implement in standard statistical software. We illustrate our approach using both simulated data and real data from a case-control genetic study of lung cancer.Competing Interest StatementThe authors have declared no competing interest.

Ryn Cuddleston et al.. "Cellular and genetic drivers of RNA editing variation in the human brain." bioRxiv , no. (2021): 2021.07.16.452690. Accessed July 19, 2021. doi: 10.1101/2021.07.16.452690.
Posttranscriptional adenosine-to-inosine modifications amplify the functionality of RNA molecules in the brain, yet the cellular and genetic regulation of RNA editing is poorly described. We quantified base-specific RNA editing across three major cell populations from the human prefrontal cortex: glutamatergic neurons, medial ganglionic eminence GABAergic neurons, and oligodendrocytes. We found more selective editing and RNA hyper-editing in neurons relative to oligodendrocytes. The pattern of RNA editing was highly cell type-specific, with 189,229 cell type-associated sites. The cellular specificity for thousands of sites was confirmed by single nucleus RNA-sequencing. Importantly, cell type-associated sites were enriched in GTEx RNA-sequencing data, edited ∼twentyfold higher than all other sites, and variation in RNA editing was predominantly explained by neuronal proportions in bulk brain tissue. Finally, we discovered 661,791 cis-editing quantitative trait loci across thirteen brain regions, including hundreds with cell type-associated features. These data reveal an expansive repertoire of highly regulated RNA editing sites across human brain cell types and provide a resolved atlas linking cell types to editing variation and genetic regulatory effects.Competing Interest StatementThe authors have declared no competing interest.

Randy L. Parrish et al.. "TIGAR-V2: Efficient TWAS Tool with Nonparametric Bayesian eQTL Weights of 49 Tissue Types from GTEx V8." bioRxiv , no. (2021): 2021.07.16.452700. Accessed July 19, 2021. doi: 10.1101/2021.07.16.452700.
Standard Transcriptome-Wide Association Study (TWAS) methods first train gene expression prediction models using reference transcriptomic data, and then test the association between the predicted genetically regulated gene expression and phenotype of interest. Most existing TWAS tools require cumbersome preparation of genotype input files and extra coding to enable parallel computation. To improve the efficiency of TWAS tools, we develop TIGAR-V2, which directly reads VCF files, enables parallel computation, and reduces up to 90% computation cost compared to the original version. TIGAR-V2 can train gene expression imputation models using either nonparametric Bayesian Dirichlet Process Regression (DPR) or Elastic-Net (as used by PrediXcan), perform TWAS using either individual-level or summary-level GWAS data, and implements both burden and variance-component test statistics for inference. We trained gene expression prediction models by DPR for 49 tissues using GTEx V8 by TIGAR-V2 and illustrated the usefulness of these nonparametric Bayesian DPR eQTL weights through TWAS of breast and ovarian cancer utilizing public GWAS summary statistics. We identified 88 and 37 risk genes respectively for breast and ovarian cancer, most of which are either known or near previously identified GWAS (∼95%) or TWAS (∼40%) risk genes of the corresponding phenotype and three novel independent TWAS risk genes with known functions in carcinogenesis. These findings suggest that TWAS can provide biological insight into the transcriptional regulation of complex diseases. TIGAR-V2 tool, trained Bayesian cis-eQTL weights, and LD information from GTEX V8 are publicly available, providing a useful resource for mapping risk genes of complex diseases.KeywordsCompeting Interest StatementThe authors have declared no competing interest.

Harrison J. Lamb et al.. "In-situ genomic prediction using low-coverage Nanopore sequencing." bioRxiv , no. (2021): 2021.07.16.452615. Accessed July 19, 2021. doi: 10.1101/2021.07.16.452615.
Most traits in livestock, crops and humans are polygenic, that is, a large number of loci contribute to genetic variation. Effects at these loci lie along a continuum ranging from common low-effect to rare high-effect variants that cumulatively contribute to the overall phenotype. Statistical methods to calculate the effect of these loci have been developed and can be used to predict phenotypes in new individuals. In agriculture, these methods are used to select superior individuals using genomic breeding values; in humans these methods are used to quantitatively measure an individual’s disease risk, termed polygenic risk scores. Both fields typically use SNP array genotypes for the analysis. Recently, genotyping-by-sequencing has become popular, due to lower cost and greater genome coverage (including structural variants). Oxford Nanopore Technologies’ (ONT) portable sequencers have the potential to combine the benefits genotyping-by-sequencing with portability and decreased turn-around time. This introduces the potential for in-house clinical genetic disease risk screening in humans or calculating genomic breeding values on-farm in agriculture. Here we demonstrate the potential of the later by calculating genomic breeding values for four traits in cattle using low-coverage ONT sequence data and comparing these breeding values to breeding values calculated from SNP arrays. At sequencing coverages between 2X and 4X the correlation between ONT breeding values and SNP array-based breeding values was &gt; 0.92 when imputation was used and &gt; 0.88 when no imputation was used. With an average sequencing coverage of 0.5x the correlation between the two methods was between 0.85 and 0.92 using imputation, depending on the trait. This demonstrates that ONT sequencing has great potential for in clinic or on-farm genomic prediction.Author Summary Genomic prediction is a method that uses a large number of genetic markers to predict complex phenotypes in livestock, crops and humans. Currently the techniques we use to determine genotypes requires complex equipment which can only be used in laboratories. However, Oxford Nanopore Technologies’ have released a portable DNA sequencer, which can genotype a range of organisms in the field. As a result of the device’s higher error rate, it has largely only been considered for specific applications, such as characterising large mutations. Here we demonstrated that despite the devices error rate, accurate genomic prediction is also possible using this portable device. The ability to accurately predict complex phenotypes such as the predisposition to schizophrenia in humans or lifetime fertility in livestock in-situ would decrease the turnaround time and ultimately increase the utility of this method in the human clinical and on-farm settings.Competing Interest StatementThe authors have declared no competing interest.

Jaclyn E. Bubnell et al.. "Diverse wMel variants of Wolbachia pipientis differentially rescue fertility and cytological defects of the bag of marbles partial loss of function mutation in Drosophila melanogaster." bioRxiv , no. (2021): 2021.01.15.426050. Accessed July 19, 2021. doi: 10.1101/2021.01.15.426050.
In Drosophila melanogaster, the maternally inherited endosymbiont Wolbachia pipientis interacts with germline stem cell genes during oogenesis. One such gene, bag of marbles (bam) is the key switch for differentiation and also shows signals of adaptive evolution for protein diversification. These observations have led us to hypothesize that W. pipientis could be driving the adaptive evolution of bam for control of oogenesis. To test this hypothesis, we must understand the specificity of the genetic interaction between bam and W. pipientis. Previously, we documented that the W. pipientis variant, wMel, rescued the fertility of the bamBW hypomorphic mutant as a transheterozygote over a bam null. However, bamBW was generated more than 20 years ago in an uncontrolled genetic background and maintained over a balancer chromosome. Consequently, the chromosome carrying bamBW accumulated mutations that have prevented controlled experiments to further assess the interaction. Here, we used CRISPR/Cas9 to engineer the same single amino acid bam hypomorphic mutation (bamL255F) and a new bam null disruption mutation into the w1118 isogenic background. We assess the fertility of wildtype bam, bamL255F/bamnull hypomorphic, and bamL255F/ bamL255F mutant females, each infected individually with ten W. pipientis wMel variants representing three phylogenetic clades. Overall, we find that all of the W. pipientis variants tested here rescue bam hypomorphic fertility defects with wMelCS-like variants exhibiting the strongest rescue effects. Additionally, these variants did not increase wildtype bam female fertility. Therefore, both bam and W. pipientis interact in genotype-specific ways to modulate female fertility, a critical fitness phenotype.KeywordsCompeting Interest StatementThe authors have declared no competing interest.

Jason Fletcher et al.. "Interpreting Polygenic Score Effects in Sibling Analysis." bioRxiv , no. (2021): 2021.07.16.452740. Accessed July 19, 2021. doi: 10.1101/2021.07.16.452740.
Researchers often claim that sibling analysis can be used to separate causal genetic effects from the assortment of biases that contaminate most downstream genetic studies. Indeed, typical results from sibling models show large (&gt;50%) attenuations in the associations between polygenic scores and phenotypes compared to non-sibling models, consistent with researchers' expectations about bias reduction. This paper explores these expectations by using family (quad) data and simulations that include indirect genetic effect processes and evaluates the ability of sibling models to uncover direct genetic effects. We find that sibling models, in general, fail to uncover direct genetic effects; indeed, these models have both upward and downward biases that are difficult to sign in typical data. When genetic nurture effects exist, sibling models create 'measurement error' that attenuate associations between polygenic scores and phenotypes. As the correlation between direct and indirect effect changes, this bias can increase or decrease. Our findings suggest that interpreting results from sibling analysis aimed at uncovering direct genetic effects should be treated with caution.Competing Interest StatementThe authors have declared no competing interest.

Lei An et al.. "Defining the Sensitivity Landscape of 74,389 EGFR Variants to Tyrosine Kinase Inhibitors." bioRxiv , no. (2021): 2021.07.18.452818. Accessed July 19, 2021. doi: 10.1101/2021.07.18.452818.
Background: Tyrosine kinase inhibitors (TKIs) therapy is a standard treatment for patients with advanced non-small-cell lung carcinoma (NSCLC) when activating epidermal growth factor receptor (EGFR) mutations are detected. However, except for the well-studied EGFR mutations, most EGFR mutations lack treatment regimens. Methods: We constructed two EGFR variant libraries containing substitutions, deletions, or insertions using the saturation mutagenesis method. All the variants were located in the EGFR mutation hotspot (exons 18-21). The sensitivity of these variants to afatinib, erlotinib, gefitinib, icotinib, and osimertinib was systematically studied by determining their enrichment in massively parallel cytotoxicity assays using an endogenous EGFR-depleted cell line, PC9. Results: A total of 3,914 and 70,475 variants were detected in the constructed EGFR Substitution-Deletion (Sub-Del) and exon 20 Insertion (Ins) libraries, accounting for 99.3% and 55.8% of the designed variants, respectively. Of the 3,914 Sub-Del variants, 813 were highly enriched in the reversible TKI (erlotinib, gefitinib, icotinib) cytotoxicity assays and 51 were enriched in the irreversible TKI (afatinib, osimertinib) cytotoxicity assays. For the 70,475 Ins variants, insertions at amino acid positions 770-774 were highly enriched in all the five TKI cytotoxicity assays. Moreover, the top 5% of the enriched insertion variants included a glycine or serine insertion at high frequency. Conclusions: We present a comprehensive reference for the sensitivity of EGFR variants to five commonly used TKIs. The approach used here should be applicable to other genes and targeted drugs.Competing Interest StatementThe authors have declared no competing interest.

Silvia Schwartz et al.. "Ankyrin2 is required for neuronal morphogenesis and long-term memory and interacts genetically with HDAC4." bioRxiv , no. (2021): 2021.07.18.452850. Accessed July 19, 2021. doi: 10.1101/2021.07.18.452850.
Dysregulation of HDAC4 expression and/or subcellular distribution results in impaired neuronal morphogenesis and long-term memory in Drosophila melanogaster. A recent genetic screen for genes that interact in the same molecular pathway as HDAC4 identified the cytoskeletal adapter Ankyrin2 (Ank2). Here we sought to investigate the role of Ank2 in neuronal morphogenesis, learning and memory, and to examine the nature of interaction with HDAC4. We found that Ank2 is expressed widely throughout the Drosophila brain where it localizes predominantly to axon tracts. Pan-neuronal knockdown of Ank2 in the mushroom body, a region critical for memory formation, resulted in defects in axon morphogenesis, and similarly reduction of Ank2 in lobular plate tangential neurons of the optic lobe disrupted dendritic branching and arborization. Conditional knockdown of Ank2 in the mushroom body of adult Drosophila significantly impaired long-term courtship memory, and this requirement for Ank2 was isolated to gamma (γ) neurons of the mushroom body. As overexpression of HDAC4 in γ neurons also impairs the formation of long-term courtship memory, this suggests that any functional relationship between these proteins during LTM likely occurs in γ neurons. We determined that the genetic interaction requires the presence of nuclear HDAC4 and is not dependent on a conserved putative ankyrin-binding motif present in HDAC4. In summary, we provide the first characterization of the expression pattern of Ank2 in the adult Drosophila brain and demonstrate that Ank2 is critical for morphogenesis of the mushroom body and for the molecular processes required in the adult brain for formation of long-term memories.Competing Interest StatementThe authors have declared no competing interest.

Daniel J. Weiner et al.. "Partitioning gene-mediated disease heritability without eQTLs." bioRxiv , no. (2021): 2021.07.14.452393. Accessed July 19, 2021. doi: 10.1101/2021.07.14.452393.
Unknown SNP-to-gene regulatory architecture complicates efforts to link noncoding GWAS associations with genes implicated by sequencing or functional studies. eQTLs are used to link SNPs to genes, but expression in bulk tissue explains a small fraction of disease heritability. A simple but successful approach has been to link SNPs with nearby genes, but the fraction of heritability mediated by these genes is unclear, and gene-proximal (vs. gene-mediated) heritability enrichments are attenuated accordingly. We propose the Abstract Mediation Model (AMM) to estimate (1) the fraction of heritability mediated by the closest or kth-closest gene to each SNP and (2) the mediated heritability enrichment of a gene set (e.g. genes with rare-variant associations). AMM jointly estimates these quantities by matching the decay in SNP enrichment with distance from genes in the gene set. Across 47 complex traits and diseases, we estimate that the closest gene to each SNP mediates 27% (SE: 6%) of heritability, and that a substantial fraction is mediated by genes outside the ten closest. Mendelian disease genes are strongly enriched for common-variant heritability; for example, just 21 dyslipidemia genes mediate 25% of LDL heritability (211x enrichment, P = 0.01). Among brain-related traits, genes involved in neurodevelopmental disorders are only about 4x enriched, but gene expression patterns are highly informative, with detectable differences in per-gene heritability even among weakly brain-expressed genes.Competing Interest StatementThe authors have declared no competing interest.

Meichen Dong et al.. "Joint Gene Network Construction by Single-Cell RNA Sequencing Data." bioRxiv , no. (2021): 2021.07.14.452387. Accessed July 19, 2021. doi: 10.1101/2021.07.14.452387.
In contrast to differential gene expression analysis at single gene level, gene regulatory networks (GRN) analysis depicts complex transcriptomic interactions among genes for better understandings of underlying genetic architectures of human diseases and traits. Recently, single-cell RNA sequencing (scRNA-seq) data has started to be used for constructing GRNs at a much finer resolution than bulk RNA-seq data and microarray data. However, scRNA-seq data are inherently sparse which hinders direct application of the popular Gaussian graphical models (GGMs). Furthermore, most existing approaches for constructing GRNs with scRNA-seq data only consider gene networks under one condition. To better understand GRNs under different but related conditions with single-cell resolution, we propose to construct Joint Gene Networks with scRNA-seq data (JGNsc) using the GGMs framework. To facilitate the use of GGMs, JGNsc first proposes a hybrid imputation procedure that combines a Bayesian zero-inflated Poisson (ZIP) model with an iterative low-rank matrix completion step to efficiently impute zero-inflated counts resulted from technical artifacts. JGNsc then transforms the imputed data via a nonparanormal transformation, based on which joint GGMs are constructed. We demonstrate JGNsc and assess its performance using synthetic data. The application of JGNsc on two cancer clinical studies of medulloblastoma and glioblastoma identifies novel findings in addition to confirming well-known biological results.Key wordsCompeting Interest StatementThe authors have declared no competing interest.

Zhou, Xianjin. "Over-Representation of Potential SP4 Target Genes within Schizophrenia-Risk Genes." bioRxiv , no. (2021): 2021.07.14.452377. Accessed July 19, 2021. doi: 10.1101/2021.07.14.452377.
Reduction of Sp4 expression causes age-dependent hippocampal vacuolization and many other intermediate phenotypes of schizophrenia in Sp4 hypomorphic mice. Recent human genetic studies from both the Schizophrenia Exome Sequencing Meta-Analysis (SCHEMA) and the Genome-Wide Association Study (GWAS) validated SP4 as a schizophrenia-risk gene over the exome-wide or the genome-wide significance. Truncation of human SP4 gene has an odds ratio of 9.37 (3.38-29.7) for schizophrenia. Despite successful identification of many schizophrenia-risk genes, it is unknown whether and how these risk genes may interact with each other in the development of schizophrenia. By taking advantage of the specific localization of the GC-boxes bound by SP4 transcription factors, I analyzed the relative abundance of these GC-boxes in the proximal promoter regions of schizophrenia-risk genes. I found that the GC-box containing genes are significantly over-represented within schizophrenia-risk genes, suggesting that SP4 is not only a high-risk gene for schizophrenia, but may also act as a hub of network in regulation of many other schizophrenia-risk genes via these GC-boxes in the pathogenesis of schizophrenia.Key wordsCompeting Interest StatementThe authors have declared no competing interest.

Dedukh, D., A. Marta and K. Janko. "Challenges and costs of asexuality: Variation in premeiotic genome duplication in gynogenetic hybrids from Cobitis taenia complex." bioRxiv , no. (2021): 2021.07.15.452483. Accessed July 19, 2021. doi: 10.1101/2021.07.15.452483.
The transition from sexual reproduction to asexuality is often triggered by hybridization. The gametogenesis of many hybrid asexuals involves a stage of premeiotic genomic endoreduplication leading to the production of clonal gametes and bypassing genomic incompatibilities that would normally cause hybrid sterility. However, it is still not clear at what gametogenic stage the endoreplication occurs, how many gonial cells it affects and whether its rate differs among clonal lineages. Here, we investigated meiotic and premeiotic cells of diploid and triploid hybrids of spined loaches (Cypriniformes: Cobitis) that reproduce by gynogenesis. We found that naturally as well as experimentally produced F1 hybrid strains undergo an obligatory genome duplication event to achieve asexuality, occurring in the gonocytes just before entering meiosis or, rarely, one or few divisions before meiosis. Surprisingly however, the genome endoreplication was observed only in a minor fraction of the hybrid’s gonocytes, while the vast majority were unable to duplicate their genomes and consequently could not proceed beyond pachytene due to defects in pairing and bivalent formation. We also noted that the rate of endoreplication was significantly higher among gonocytes of hybrids from successful natural clones than of experimentally produced F1 hybrids, indicating that interclonal selection may favour lineages which maximize the rate of premeiotic endoreduplication. We conclude that asexuality and hybrid sterility are intimately related phenomena and the transition from sexual reproduction to asexuality must overcome significant problems with genome incompatibilities with possible impact on reproductive potential.Competing Interest StatementThe authors have declared no competing interest.

Pérez-Pereira, Noelia, Armando Caballero and Aurora García-Dorado. "Reviewing the consequences of genetic purging on the success of rescue programs." bioRxiv , no. (2021): 2021.07.15.452459. Accessed July 19, 2021. doi: 10.1101/2021.07.15.452459.
Genetic rescue is increasingly considered a promising and underused conservation strategy to reduce inbreeding depression and restore genetic diversity in endangered populations, but the empirical evidence supporting its application is limited to a few generations. Here we discuss on the light of theory the role of inbreeding depression arising from partially recessive deleterious mutations and of genetic purging as main determinants of the medium to long-term success of rescue programs. This role depends on two main predictions: (1) The inbreeding load hidden in populations with a long stable demography increases with the effective population size; and (2) After a population shrinks, purging tends to remove its (partially) recessive deleterious alleles, a process that is slower but more efficient for large populations than for small ones. We also carry out computer simulations to investigate the impact of genetic purging on the medium to long term success of genetic rescue programs. For some scenarios, it is found that hybrid vigor followed by purging will lead to sustained successful rescue. However, there may be specific situations where the recipient population is so small that it cannot purge the inbreeding load introduced by migrants, which would lead to increased fitness inbreeding depression and extinction risk in the medium to long term. In such cases, the risk is expected to be higher if migrants came from a large non-purged population with high inbreeding load, particularly after the accumulation of the stochastic effects ascribed to repeated occasional migration events. Therefore, under the specific deleterious recessive mutation model considered, we conclude that additional caution should be taken in rescue programs. Unless the endangered population harbors some distinctive genetic singularity whose conservation is a main concern, restoration by continuous stable gene flow should be considered, whenever feasible, as it reduces the extinction risk compared to repeated occasional migration and can also allow recolonization events.KeywordsCompeting Interest StatementThe authors have declared no competing interest.

Abigail O. Smith et al.. "c-Jun N-terminal kinase (JNK) signaling contributes to cystic burden in polycystic kidney disease." bioRxiv , no. (2021): 2021.07.15.452451. Accessed July 19, 2021. doi: 10.1101/2021.07.15.452451.
Polycystic kidney disease is an inherited degenerative disease in which the uriniferous tubules are replaced by expanding fluid-filled cysts that ultimately destroy organ function. Autosomal dominant polycystic kidney disease (ADPKD) is the most common form, afflicting approximately 1 in 1,000 people. It primarily is caused by mutations in the transmembrane proteins polycystin-1 (Pkd1) and polycystin-2 (Pkd2). The most proximal effects of Pkd mutations leading to cyst formation are not known, but pro-proliferative signaling must be involved for the tubule epithelial cells to increase in number over time. The c-Jun N-terminal kinase (JNK) pathway promotes proliferation and is activated in acute and chronic kidney diseases. Using a mouse model of cystic kidney disease caused by Pkd2 loss, we observe JNK activation in cystic kidneys and observe increased nuclear phospho c-Jun in cystic epithelium. Genetic removal of Jnk1 and Jnk2 suppresses the nuclear accumulation of phospho c-Jun, reduces proliferation and reduces the severity of cystic disease. While Jnk1 and Jnk2 are thought to have largely overlapping functions, we find that Jnk1 loss is nearly as effective as the double loss of Jnk1 and Jnk2. Jnk pathway inhibitors are in development for neurodegeneration, cancer, and fibrotic diseases. Our work suggests that the JNK pathway should be explored as a therapeutic target for ADPKD.Author Summary Autosomal dominant polycystic kidney disease is a leading cause of end stage renal disease requiring dialysis or kidney transplant. During disease development, the cells lining the kidney tubules proliferate. This proliferation transforms normally small diameter tubules into fluid-filled cysts that enlarge with time, eventually destroying all kidney function. Despite decades of research, polycystic kidney disease remains incurable. Furthermore, the precise signaling events involved in cyst initiation and growth remain unclear. The c-Jun N-terminal kinase (JNK), is a major pathway regulating cellular proliferation and differentiation but its importance to polycystic kidney disease was not known. We show that JNK activity is elevated in cystic kidneys and that reducing JNK activity decreases cyst growth pointing to JNK inhibition as a therapeutic strategy for treating polycystic kidney disease.Competing Interest StatementThe authors have declared no competing interest.PKDpolycystic kidney diseaseADPKDautosomal dominant polycystic kidney diseaseJNKjun N-terminal kinasecAMPcyclic adenosine monophosphateAP-1activator protein-1H&amp;EHematoxylin and eosinLTALotus tetragonolobus agglutininDBADolichos biflorus agglutininMAPmitogen-activated proteinMAP2Kmitogen-activated protein kinase kinaseMAP3Kmitogen-activated protein kinase kinase kinaseSMAalpha smooth muscle actinDAPI4′,6-diamidino-2-phenylindoleGPCRG-protein coupled receptor

Lindsay Fernández-Rhodes et al.. "Ancestral diversity improves discovery and fine-mapping of genetic loci for anthropometric traits - the Hispanic/Latino Anthropometry Consortium." bioRxiv , no. (2021): 2021.05.27.445969. Accessed July 19, 2021. doi: 10.1101/2021.05.27.445969.
Hispanic/Latinos have been underrepresented in genome-wide association studies (GWAS) for anthropometric traits despite notable anthropometric variability with ancestry proportions, and a high burden of growth stunting and overweight/obesity in Hispanic/Latino populations. This address this knowledge gap, we analyzed densely-imputed genetic data in a sample of Hispanic/Latino adults, to identify and fine-map common genetic variants associated with body mass index (BMI), height, and BMI-adjusted waist-to-hip ratio (WHRadjBMI). We conducted a GWAS of 18 studies/consortia as part of the Hispanic/Latino Anthropometry (HISLA) Consortium (Stage 1, n=59,769) and validated our findings in 9 additional studies (HISLA Stage 2, n=9,336). We conducted a trans-ethnic GWAS with summary statistics from HISLA Stage 1 and existing consortia of European and African ancestries. In our HISLA Stage 1+2 analyses, we discovered one novel BMI locus, as well two novel BMI signals and another novel height signal, each within established anthropometric loci. In our trans-ethnic meta- analysis, we identified three additional novel BMI loci, one novel height locus, and one novel WHRadjBMI locus. We also identified three secondary signals for BMI, 28 for height, and two for WHRadjBMI. We replicated &gt;60 established anthropometric loci in Hispanic/Latino populations at genome-wide significance—representing up to 30% of previously-reported index SNP anthropometric associations. Trans-ethnic meta-analysis of the three ancestries showed a small-to-moderate impact of uncorrected population stratification on the resulting effect size estimates. Our novel findings demonstrate that future studies may also benefit from leveraging differences in linkage disequilibrium patterns to discover novel loci and additional signals with less residual population stratification.Competing Interest StatementStephanie M. Gogarten (SMG) and Adrienne M. Stilp (AMS) received funding from Seven Bridges Genomics to develop tools for the NHLBI BioData Catalyst consortium. All others authors declare no competing interests.

Jamie Nourse et al.. "Non-invasive imaging of gene expression and protein secretion dynamics in living mice." bioRxiv , no. (2021): 2021.07.08.451623. Accessed July 19, 2021. doi: 10.1101/2021.07.08.451623.
The liver is the largest organ and main source for secretory proteins with functions critical to health and disease. Tools to non-invasively study the fate of secretory proteins in vivo are scarce. Here we present a multimodal reporter mouse to query the expression and secretion dynamics of prothrombin, a prototypical liver-derived secretory protein. Using optical in vivo imaging, we confirm known modifiers of prothrombin expression and secretion. We discover extrahepatic prothrombin expression in multiple sites (including testes, placenta, brain, kidney, heart and lymphatic system) and in emerging tumors, resulting in significant amounts of tumor-derived prothrombin in the blood with procoagulant properties. Syngeneic cell lines from this mouse model enable unravelling regulatory mechanisms in high resolution, and in a scalable format ex vivo. Beyond discovering new functions in the hemostatic system, this model allows identifying rheostats in the cross-talk between gene expression and availability of a secretory protein. It is also a valuable resource for uncovering novel (tissue-specific) therapeutic vulnerabilities.Key wordsCompeting Interest StatementThe authors have declared no competing interest.

Casaletto, James, Melissa Cline and Brian Shirts. "Quantifying the impact of data sharing on variant classification." bioRxiv , no. (2021): 2021.06.21.449318. Accessed July 19, 2021. doi: 10.1101/2021.06.21.449318.
Healthcare is increasingly leveraging genomic data to inform diagnosis, monitoring, and treatment of certain diseases with genetic predisposition. Associating patient data such as family history and de novo status with a genomic variant helps classify that variant as being pathogenic or benign. Indeed, many variants are already classified by experts, but the majority of variants are very rare, have no associated patient data, and are therefore of uncertain significance. This research models the hypothetical sharing of patient data across institutions in order to accelerate the time it takes to classify a variant. Using conservative assumptions described in the paper, we found that the probability of classifying a pathogenic variant which occurs at the rate of 1 in 100,000 people increases from less than 25% to nearly 80% after just one year when sequencing centers share their clinical data. After 5 years, the probability of classifying such a variant is nearly 100%.Competing Interest StatementThe authors have declared no competing interest.

Moses Nyine et al.. "The haplotype-based analysis of Aegilops tauschii introgression into hard red winter wheat and its impact on productivity traits." bioRxiv , no. (2021): 2021.05.29.446303. Accessed July 19, 2021. doi: 10.1101/2021.05.29.446303.
Introgression from wild relatives have a great potential to broaden beneficial allelic diversity available for crop improvement in breeding programs. Here, we assessed the impact of introgression from 21 diverse accessions of Aegilops tauschii, the diploid ancestor of the wheat D genome, into six hard red winter wheat cultivars on yield and yield component traits. We used 5.2 million imputed D genome SNPs identified by whole-genome sequencing of parental lines and the sequence-based genotyping of introgression population including 351 BC1F3:5 lines. Phenotyping data collected from the irrigated and non-irrigated field trials revealed that up to 23% of the introgression lines produce more grain than the parents and check cultivars. Based on sixteen yield stability statistics, the yield of twelve introgression lines (3.4%) was stable across treatments, years and locations; five of these lines were also high yielding, producing 9.8% more grain than the average yield of check cultivars. The most significant SNP-trait and haplotype-trait associations were identified on chromosome arms 2DS and 6DL for spikelet number per spike (SNS), on chromosome arms 2DS, 3DS, 5DS and 7DS for grain length and on chromosome arms 1DL, 2DS, 6DL and 7DS for grain width. Introgression of haplotypes from Ae. tauschii parents was associated with increase in SNS, which positively correlated with heading date, whereas haplotypes from hexaploid wheat parents were associated with increased grain width. We show that haplotypes on 2DS associated with increased spikelet number and heading date are linked with multiple introgressed alleles of Ppd-D1 identified by the whole-genome sequencing of the Ae. tauschii parents. While some introgressed haplotypes exhibited significant pleiotropic effects with the direction of effects on the yield component traits being largely consistent with the previously reported trade-offs, there were haplotype combinations associated with the positive trends in yield. The characterized repertoire of the introgressed haplotypes derived from Ae. tauschii accessions with the combined positive effects on yield and yield components traits in elite germplasm provides a valuable source of alleles for improving the productivity of winter wheat by optimizing the contribution of component traits to yield.KeywordsCompeting Interest StatementThe authors have declared no competing interest.

GitHub Events

Total

Last Year

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science