desirna

State of the art software for RNA sequence design. The first to solve the Eterna benchmark!

Keywords

bioinformatics bioinformatics-tool eterna rna scientific-computing secondary-structure sequence-design simulation-modeling

Last synced: 6 months ago · JSON representation ·

Repository

State of the art software for RNA sequence design. The first to solve the Eterna benchmark!

Basic Info

Host: GitHub
Owner: fryzjergda
License: apache-2.0
Language: Python
Default Branch: main
Homepage:
Size: 15.2 MB

Statistics

Stars: 12
Watchers: 3
Forks: 1
Open Issues: 0
Releases: 1

Topics

bioinformatics bioinformatics-tool eterna rna scientific-computing secondary-structure sequence-design simulation-modeling

Created over 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme License Citation

DesiRNA

DesiRNA is a state-of-the-art RNA sequence design tool, that stands out for its speed, lightweight nature, ease of installation, and user-friendly interface.

Tests

Last modified

Features

Super Fast: Utilizes Replica Exchange Monte Carlo for efficient computations.
Installation: Simple to install and requires only Python3.
Versatile Design Capabilities:
- Single chain RNAs
- Two-strand RNA complexes
- RNAs with alternative structures
- RNAs with pseudoknots
- Checks if the designed RNA tends to oligomerize
- Designs RNAs with natural ACGU content
Positive and Negative Design: Follows both positive and negative design principles.
Exceptional Performance: Proven extraordinary performance in the EteRNA V! benchmark, solving all 100 design problems in under 24 hours, where 90 were solved under one minute.

Installation

The installation of DesiRNA can be accomplished through several methods. Below are the instructions for different scenarios:

Without Conda

Execute the following commands:

bash python3 -m pip install -r requirements.txt python3 -m pip install viennarna

With Conda - Creating a Custom Environment

Execute the following commands:

bash conda create -n DesiRNA-env python=3 numpy matplotlib pandas pathlib multiprocess conda activate DesiRNA-env pip install viennarna

With Conda - Utilizing Conda Environment YAML File

Execute the following commands:

bash conda env create -f DesiRNA-env.yml conda activate DesiRNA-env

Adding DesiRNA to PATH

To facilitate access to DesiRNA from the command line, the path to DesiRNA.py may be added to the .bashrc file:

bash echo 'export PATH=$PATH:/path/to/DesiRNA.py' >> ~/.bashrc source ~/.bashrc

Replace /path/to/DesiRNA.py with the correct path to the DesiRNA.py file.

Additional Installation Guidance

Conda Installation: Should Conda be required, the official Conda installation guide provides comprehensive instructions.
ViennaRNA Installation: ViennaRNA may be installed through various methods. Detailed instructions are available in the ViennaRNA installation guide.

Usage

DesiRNA is a command-line tool, and its functionality can be accessed through various options and arguments. Below is a detailed explanation of how to use DesiRNA:

Basic Command

The basic command to run DesiRNA requires specifying the filename that contains secondary structures and constraints:

bash DesiRNA.py -f NAME

Standard Options

-R, --replicas: Number of replicas (default: 10).
-e, --exchange: Frequency of replica exchange attempt (default: 100).
-t, --timelimit: Time limit for running the program in seconds (default: 60).
-s, --steps: Number of Replica Exchange steps, overwrites the -t option (default: None).
-r, --results_number: Number of best results to be reported (default: 10).
-acgu, --ACGU {off,on}: Keep 'natural' ACGU content in designed sequences (A:15%, C:30%, G:30%, U:15%) (default: off).

Advanced Options

sws, --stop_when_solved {off,on}: Stop after finding the desired number of solutions (--results_number) (default: off).
-p, --param {2004,1999}: Turner energy parameter for calculating MFE (default: 1999).
-tmin, --tmin: Minimal Replica Temperature (default: 10).
-tmax, --tmax: Maximal Replica Temperature (default: 150).
-ts, --tshelves: Custom temperature shelves for replicas, provide comma-separated values.
-sf, --scoring_function: Scoring functions and weights used to guide the design process. Multiple scoring functions can be selected. Please provide desired scoring functions and their weigths e.g., -sf Ed-Epf:0.5,1-MCC:0.5. (default: Ed-Epf:1.0). Available options:
- Ed-Epf: Energy of the desired structure (Ed) minus free energy of the thermodynamic ensemble (Epf).
- Ed-MFE: Energy of the desired structure (Ed) minus Minimum Free Energy (MFE)
- 1-MCC: One minus Matthews Correlation Coefficient (MCC).
- sln_Epf: Sequence Length Normalized Epf.
- 1-precision: One minus precision (TP/(TP+FP)).
- 1-recall: One minus recall (TP/(TP+FN)).
- Edef: Ensemble defect - deviation of the RNA secondary structure ensemble from the target structure, normalized by sequence length.
-nd, --negative_design {off,on}: Use negative design approach (default: off).
-acgu_content, --ACGU_content: Provide user-defined ACGU content, comma-separated values e.g., -acgu_content 10,40,40,10.
-o, --avoid_oligomerization {off,on}: Check if the designed sequence tends to oligomerize. User may enforce or avoid oligomerization. Slows down the simulation (default: off).
-d, --dimer {off,on}: Design of a homodimer complex, of two strands. Requires input file complying with RNA-RNA complex format (default: off).
-tm, --target_mutations {off,on}: Targeted mutations (default: on).
-tm_perc_max, --target_mutations_percentage_max: Highest percentage of targeted mutations for the lowest temperature replica (default: 0.7).
-tm_perc_min, --target_mutations_percentage_min: Lowest percentage of targeted mutations for the highest temperature replica (default: 0.0).
-motifs, --motif_sequences: Prevent or enforce specific sequence moitif. Provide sequence motifs along with their bonuses(-)/penalties(+), e.g., -motifs GNRA,-1,CCCC,2.
-seed, --seed_number: User-defined seed number for simulation (default: 0).
-re_seq, --replicas_sequences {different,same}: Choose whether replicas will start from the same or different random sequence (default: same).

Example Usage

To run DesiRNA with a specific file and custom parameters:

bash DesiRNA.py -f Standard_design_input.txt -R 20 -e 50 -t 120

This command will run DesiRNA with the file Standard_design_input.txt, 20 replicas, an exchange frequency of 50, and a time limit of 120 seconds.

For further assistance with the command-line options, you can use the help command:

bash DesiRNA.py -h

Input Files

The input file for DesiRNA must contain specific information related to the RNA sequence design. Here's an overview of the expected format and additional options:

Basic Single Chain RNA Design

Name Line: >name followed by a unique identifier.
Sequence Restraints Line: >seq_restr followed by sequence restraints using IUPAC nomenclature.
Secondary Structure Line: >sec_struct followed by the secondary structure.

Example:

```

name Design seqrestr NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN secstruct ((((((.((((((((....))))).)).).)))))) ```

Command:

DesiRNA.py -f Standard_design_input.txt -t 60

Design with Pseudoknots

Include different brackets for pseudoknot representation.

Note: Available brackets for pseudoknots include (), [], <, >, {}, Aa, Bb, Cc, Dd. Up to three levels of pseudoknots are accepted.

Example:

```

name PseudoknotExample seqrestr NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN sec_struct ((((((....[[[[..))))))......]]]].... ```

Command:

DesiRNA.py -f Pseudoknot_design_input.txt -t 60

Two-Chain RNA Complex Design

Include an & symbol in both the sequence restraints and structure lines.

Note: A maximum of two-chain RNAs can be designed.

Example:

```

name RNA-RNA ComplexExample seqrestr NNNNNNNNNNNNNNNNN&NNNNNNNNNNNNNNNNNN sec_struct (((.(((((....))..&(((....)))..)))))) ```

Command:

DesiRNA.py -f RNA_RNA_complex_design_input.txt -t 60

Two-Chain Homodimer Design

Include an & symbol in both the sequence restraints and structure lines. The two sequences, must be of the same length.

Note: Turn on the homodimer option -d on.

Example:

```

name HomodimerExample seqrestr NNNNNNNNNNNNNNNNN&NNNNNNNNNNNNNNNNN sec_struct ((((....((((.....&))))....)))).....

```

Command:

DesiRNA.py -f Homodimer_design_input.txt -t 60 -d on

Design with Alternative Structures

Include additional lines for alternative structures.

Example:

```

name AltStructExample seqrestr NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN secstruct ((((((.((((((((....))))).)).).)))))) altsecstruct (((((((((((((....)))..)).)).).))))). (((((((((((((....)))))...)).).))))). ```

Command:

DesiRNA.py -f Alternative_structures_design_input.txt -t 60

Design with Seed Sequence

Include a seed sequence for the starting point of the design simulation.

Example:

```

name SeedSeqExample seqrestr NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN secstruct ((((((.((((((((....))))).)).).)))))) seed_seq GCCCCGGCCCCCGGCGAAAGCCGGUGGAGGCGGGGC ```

Command:

DesiRNA.py -f Seed_sequence_design_input.txt -t 60

Sequence Restraints Dictionary:

The sequence restraints line uses IUPAC symbols such as N for any nucleotide, W for A or U, S for C or G, etc.

For more details on the IUPAC nomenclature, please refer to the Wikipedia page on nucleic acid notation.

Output Files

The results of a simulation are organized into a specific directory, named according to the parameters and options used in the simulation. The directory includes the following files and subdirectory:

Main Directory

*_mid_results.csv: A CSV file containing intermediate results of the simulation.
*_replicas.png: A PNG image showing the design simulation process.
*_results.csv: A CSV file containing the final results of the simulation.
*_stats: A file containing statistical information about the simulation.
*.txt: The original input file used for the simulation.

Trajectory Files Subdirectory

Within the main directory, there is a trajectory_files folder containing additional files related to the simulation trajectory:

*_best_fasta.fas: A FASTA file containing the best sequence(s) from the simulation.
*_best_str: A file containing the best solution.
*_command: A file containing the command used to run the simulation.
*_multifasta.fas: A FASTA file containing all sequences from the simulation.
*_random.csv: A CSV file containing random sequences fitting compying to the desired structure.
*_replicas.csv: A CSV file containing information about the varius stats for each replica used in the simulation.
*_traj.csv: A CSV file containing the trajectory of the simulation.

Benchmark

The benchmark files are organized into two main directories: Eterna100V1_benchmark_results and Eterna100V1_inputs.

Eterna100V1benchmarkresults

This directory contains the results of the benchmark tests. The files include:

Eterna100V1_1h_results.txt: Results for the 1-hour benchmark test.
Eterna100V1_1min_results.txt: Results for the 1-minute benchmark test.
Eterna100V1_24h_results.txt: Results for the 24-hour benchmark test.
Eterna100V1_all_results.txt: Consolidated results for all benchmark tests.

Eterna100V1_inputs

This directory contains the input files used for the benchmark tests. There are 100 individual text files (eteV1_01.txt, eteV1_02.txt, ..., eteV1_100.txt) that contain the structures of 100 Eterna benchmark puzzles.

These files are organized in the same format as described in the Input Files section.

Example Files

The "example_files" directory contains a set of input and output files related to RNA sequence design. These files are used as examples and reference data for various design scenarios. The contents of this directory are organized into two main subdirectories:

Inputs

This subdirectory contains input files that serve as the starting point for RNA sequence design. Each input file represents a specific design scenario and includes the necessary information for the design process. The input files include:

Alternative_structures_design_input.txt: Input file for designing RNA sequences with alternative structures.
Homodimer_design_input.txt: Input file for designing RNA sequences involved in homodimer interactions.
Pseudoknot_design_input.txt: Input file for designing RNA sequences with pseudoknot structures.
RNA_RNA_complex_design_input.txt: Input file for designing RNA-RNA complex structures.
Seed_sequence_design_input.txt: Input file for designing RNA sequences based on a seed sequence.
Standard_design_input.txt: Input file for standard RNA sequence design.

Outputs

This subdirectory contains output files and data generated as a result of RNA sequence design simulations. Each subdirectory within "outputs" corresponds to a specific design scenario and is named based on the input file used for that scenario. The output files and data include:

Mid_results.csv: Intermediate results data during the design process.
Replicas.png: Replicas of the designed RNA structures.
Results.csv: Final results of the RNA sequence design.
Stats: Additional statistics related to the design process.
trajectory_files: Trajectory files that provide detailed information about the design trajectory, including sequence and structural data.

Tests

The "tests" directory contains a variety of test cases and scripts designed to evaluate the robustness and correctness of the program. These tests are essential for detecting errors and ensuring that the program behaves as expected. The contents of this directory include:

test_DesiRNA.sh: This script runs various test cases to assess the overall functionality and performance of the program.
test_DesiRNA_errors.sh: This script executes tests using the error input files to check if the program correctly identifies and handles errors.
Error_inputs: This subdirectory contains a set of input files that intentionally trigger specific error conditions. These files are used to test the program's error-handling capabilities and include scenarios such as invalid sequences, structural constraints, or formatting errors.

These tests are an integral part of quality assurance, helping to identify and address issues within the program. Users can run these tests to verify the correctness of their program installation and ensure that it can handle different error scenarios gracefully.

Citation

If you use DesiRNA in your research, please cite our paper:

DesiRNA: structure-based design of RNA sequences with a Monte Carlo approach

You can find the full citation details at the link above. Your citation helps support our work and ensures that it reaches a wider audience.

For any inquiries related to the paper or the use of DesiRNA, please feel free to contact the authors.

License

DesiRNA is licensed under the Apache License, Version 2.0. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

For the full license text, please see the LICENSE file in the repository.

Contact

If you have any questions, need support, or would like to collaborate, please feel free to reach out to us:

Dr. Tomasz K. Wirecki: twirecki@iimcb.gov.pl
Prof. Janusz M. Bujnicki: janusz@iimcb.gov.pl

You can also learn more about our research and other projects on our lab's website: GeneSilico Lab

We welcome feedback and contributions to DesiRNA, and we look forward to hearing from you!

Owner

Name: Tom Wir
Login: fryzjergda
Kind: user
Location: Poland
Company: IIMCB in Warsaw

Repositories: 8
Profile: https://github.com/fryzjergda

Bioinformatician, PhD in Chemistry

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: >-
  DesiRNA: structure-based design of RNA sequences with 
  a Monte Carlo approach
message: RNA sequence design software
type: software
authors:
  - given-names: Tomasz K. Wirecki
    email: twirecki@iimcb.gov.pl
    affiliation: >-
      International Institute of Molecular and Cell
      Biology in Warsaw, Warsaw, Poland
    orcid: "https://orcid.org/0000-0001-6539-3640 "
  - given-names: Grzegorz Lach
    email: gel@fuw.edu.pl
    affiliation: >-
      Institute of Theoretical Physics, Faculty of 
      Physics (FUW), University of Warsaw, Poland
    orcid: "https://orcid.org/0000-0003-4018-5217"
  - given-names: Farhang Jaryani
    affiliation: >-
      International Institute of Molecular and Cell
      Biology in Warsaw, Warsaw, Poland
    orcid: "https://orcid.org/0000-0001-8374-6681"
  - given-names: Nagendar Goud Badepally
    email: nbadepally@iimcb.gov.pl
    affiliation: >-
      International Institute of Molecular and Cell
      Biology in Warsaw, Warsaw, Poland
    orcid: "https://orcid.org/0000-0003-4364-8811"
  - given-names: S. Naeim Moafinejad
    email: snmoafinejad@iimcb.gov.pl
    affiliation: >-
      International Institute of Molecular and Cell
      Biology in Warsaw, Warsaw, Poland
    orcid: "https://orcid.org/0000-0003-0397-2596"
  - given-names: Gaja Klaudel
    affiliation: >-
      International Institute of Molecular and Cell
      Biology in Warsaw, Warsaw, Poland
  - given-names: Kalina Nec
    email: knec@iimcb.gov.pl
    affiliation: >-
      International Institute of Molecular and Cell
      Biology in Warsaw, Warsaw, Poland
  - given-names: Eugene F. Baulin
    email: ebaulin@iimcb.gov.pl
    affiliation: >-
      International Institute of Molecular and Cell
      Biology in Warsaw, Warsaw, Poland
    orcid: "https://orcid.org/0000-0003-4694-9783"
  - given-names: Janusz M. Bujnicki
    email: janusz@iimcb.gov.pl
    affiliation: >-
      International Institute of Molecular and Cell
      Biology in Warsaw, Warsaw, Poland
    orcid: "https://orcid.org/0000-0001-5758-9416"
identifiers:
  - type: doi
    value: 10.1101/2023.06.04.543636
  - type: url
    value: >-
      https://github.com/fryzjergda/DesiRNA
    description: GitHub repository
abstract: >-
  RNA sequences underpin the formation of complex and diverse structures, 
  subsequently governing their respective functional properties. Despite 
  the pivotal role RNA sequences play in cellular mechanisms, creating 
  optimized sequences that can predictably fold into desired structures 
  remains a significant challenge. We have developed DesiRNA, a versatile 
  Python-based software tool for RNA sequence design. This program considers 
  a comprehensive array of constraints, ranging from secondary structures 
  (including pseudoknots) and GC content, to the distribution of dinucleotides 
  emulating natural RNAs. Additionally, it factors in the presence or 
  absence of specific sequence motifs and prevents or promotes 
  oligomerization, thereby ensuring a robust and flexible design process. 
  DesiRNA utilizes the Monte Carlo algorithm for the selection and acceptance 
  of mutation sites. In tests on the EteRNA benchmark, DesiRNA displayed high 
  accuracy and computational efficiency, outperforming most existing RNA 
  design programs.
  
  The DesiRNA software is freely available at
  https://github.com/fryzjergda/DesiRNA
keywords:
  - rna
  - sequence design
  - Eterna benchmark
  - python
license: Apache-2.0

desirna

Science Score: 67.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

DesiRNA

Features

Table of Contents

Installation

Without Conda

With Conda - Creating a Custom Environment

With Conda - Utilizing Conda Environment YAML File

Adding DesiRNA to PATH

Additional Installation Guidance

Usage

Basic Command

Standard Options

Advanced Options

Example Usage

Input Files

Basic Single Chain RNA Design

Example:

Command:

Design with Pseudoknots

Example:

Command:

Two-Chain RNA Complex Design

Example:

Command:

Two-Chain Homodimer Design

Example:

Command:

Design with Alternative Structures

Example:

Command:

Design with Seed Sequence

Example:

Command:

Sequence Restraints Dictionary:

Output Files

Main Directory

Trajectory Files Subdirectory

Benchmark

Eterna100V1benchmarkresults

Eterna100V1_inputs

Example Files

Inputs

Outputs

Tests

Citation

License

Contact

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Dependencies