beditor
A Computational Workflow for Designing Libraries of sgRNAs for CRISPR-Mediated Base Editing, and much more
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 7 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.6%) to scientific vocabulary
Keywords
Repository
A Computational Workflow for Designing Libraries of sgRNAs for CRISPR-Mediated Base Editing, and much more
Basic Info
Statistics
- Stars: 19
- Watchers: 5
- Forks: 5
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
beditor(v2)
A Computational Workflow for Designing Libraries of sgRNAs for CRISPR-Mediated Base Editing, and much more
<!-- [![Contributors][contributors-shield]][contributors-url]
[![Forks][forks-shield]][forks-url]
[![Stargazers][stars-shield]][stars-url] -->
<!-- -->
Usage
🖱️ GUI-mode
beditor gui
Note: GUI is recommended for designing small libraries and prioritization of the guides.
▶️ CLI-mode
beditor cli --editor BE1 -m path/to/mutations.tsv -o path/to/output_directory/ --species human --ensembl-release 110
or
beditor cli -c beditor_config.yml
Parameters
usage: beditor cli [--editor EDITOR] [-m MUTATIONS_PATH] [-o OUTPUT_DIR_PATH] [--species SPECIES] [--ensembl-release ENSEMBL_RELEASE] [--genome-path GENOME_PATH] [--gtf-path GTF_PATH] [-r RNA_PATH] [-p PRT_PATH] [-c CONFIG_PATH] [--search-window SEARCH_WINDOW] [-n] [-w WD_PATH] [-t THREADS] [-k KERNEL_NAME] [-v VERBOSE] [-i IGV_PATH_PREFIX] [--ext EXT] [-f] [-d] [--skip SKIP] optional arguments: -h, --help show this help message and exit --editor EDITOR base-editing method, available methods can be listed using command: 'beditor resources' -m MUTATIONS_PATH, --mutations-path MUTATIONS_PATH path to the mutation file, the format of which is available at https://github.com/rraadd88/beditor/README.md#Input-format. -o OUTPUT_DIR_PATH, --output-dir-path OUTPUT_DIR_PATH path to the directory where the outputs should be saved. --species SPECIES species name. --ensembl-release ENSEMBL_RELEASE ensemble release number. --genome-path GENOME_PATH path to the genome file, which is not available on Ensembl. --gtf-path GTF_PATH path to the gene annotations file, which is not available on Ensembl. -r RNA_PATH, --rna-path RNA_PATH path to the transcript sequences file, which is not available on Ensembl. -p PRT_PATH, --prt-path PRT_PATH path to the protein sequences file, which is not available on Ensembl. --search-window SEARCH_WINDOW number of bases to search on either side of a target, if not specified, it is inferred by beditor. -n, --not-be False do not process as a base editor. -c CONFIG_PATH, --config-path CONFIG_PATH path to the configuration file. -w WD_PATH, --wd-path WD_PATH path to the working directory. -t THREADS, --threads THREADS 1 number of threads for parallel processing. -k KERNEL_NAME, --kernel-name KERNEL_NAME 'beditor' name of the jupyter kernel. -v VERBOSE, --verbose VERBOSE 'WARNING' verbose, logging levels: DEBUG > INFO > WARNING > ERROR (default) > CRITICAL. -i IGV_PATH_PREFIX, --igv-path-prefix IGV_PATH_PREFIX prefix to be added to the IGV URL. --ext EXT file extensions of the output tables. -f, --force False -d, --dbug False --skip SKIP skip sections of the workflow Examples: Notes: Required parameters for assigning a species: species ensembl_release or genome_path gtf_path rna_path prt_pathInstallation
Virtual environment and namming kernel (recommended)
conda env create -n beditor python=3.9; # options: conda/mamba, python=3.9/3.8
python -m ipykernel install --user --name beditor
Installation of the package
pip install beditor[all]
Optional dependencies, as required:
pip install beditor # only cli
pip install beditor[gui] # plus gui
For fast processing of large genomes (highly recommended for human genome):
conda install install bioconda::ucsc-fatotwobit bioconda::ucsc-twobittofa bioconda::ucsc-twobitinfo # options: conda/mamba
Else, for moderately fast processing,
conda install install bioconda::bedtools # options: conda/mamba
Input format
Note: The coordinates are 1-based (i.e. X:1-1 instead of X:0:1) and IDs correspond to the chosen genome assemblies (e.g. from Ensembl).
Point mutations
chrom start end strand mutation
5 1123 1123 + C
Position scanning
chrom start end strand
5 1123 1123 +
Region scanning
chrom start end strand
5 1123 2123 +
Protein point mutations
protein id aa pos mutation
ENSP1123 43 S
Protein position scanning
protein id aa pos
ENSP1123 43
Protein region scanning
protein id aa start aa end
ENSP1123 43 143
Note: Ensembl protein IDs are used.
Output format
Note: output contains 0-based coordinates are used.
guide sequence guide locus offtargets score {columns in the input}
AGCGTTTGGCAAATCAAACAAAA 4:1003215-1003238(+) 0 1 ..
Supported base editing methods
| method | nucleotide | nucleotide mutation | window start | window end | guide length | PAM | PAM position | |-------------|------------|---------------------|--------------|------------|--------------|--------|--------------| | A3A-BE3 | C | T | 4 | 8 | 20 | NGG | down | | ABE7.10 | A | G | 4 | 7 | 20 | NGG | down | | ABE7.10* | A | G | 4 | 8 | 20 | NGG | down | | ABE7.9 | A | G | 5 | 8 | 20 | NGG | down | | ABESa | A | G | 6 | 12 | 21 | NNGRRT | down | | BE-PLUS | C | T | 4 | 14 | 20 | NGG | down | | BE1 | C | T | 4 | 8 | 20 | NGG | down | | BE2 | C | T | 4 | 8 | 20 | NGG | down | | BE3 | C | T | 4 | 8 | 20 | NGG | down | | BE4-Gam | C | T | 4 | 8 | 20 | NGG | down | | BE4/BE4max | C | T | 4 | 8 | 20 | NGG | down | | Cas12a-BE | C | T | 10 | 12 | 23 | TTTV | up | | eA3A-BE3 | C | T | 4 | 8 | 20 | NGG | down | | EE-BE3 | C | T | 5 | 6 | 20 | NGG | down | | HF-BE3 | C | T | 4 | 8 | 20 | NGG | down | | Sa(KKH)-ABE | A | G | 6 | 12 | 21 | NNNRRT | down | | SA(KKH)-BE3 | C | T | 3 | 12 | 21 | NNNRRT | down | | SaBE3 | C | T | 3 | 12 | 21 | NNGRRT | down | | SaBE4 | C | T | 3 | 12 | 21 | NNGRRT | down | | SaBE4-Gam | C | T | 3 | 12 | 21 | NNGRRT | down | | Target-AID | C | T | 2 | 4 | 20 | NGG | down | | Target-AID | C | T | 2 | 4 | 20 | NG | down | | VQR-ABE | A | G | 4 | 6 | 20 | NGA | down | | VQR-BE3 | C | T | 4 | 11 | 20 | NGAN | down | | VRER-ABE | A | G | 4 | 6 | 20 | NGCG | down | | VRER-BE3 | C | T | 3 | 10 | 20 | NGCG | down | | xBE3 | C | T | 4 | 8 | 20 | NG | down | | YE1-BE3 | C | T | 5 | 7 | 20 | NGG | down | | YE2-BE3 | C | T | 5 | 6 | 20 | NGG | down | | YEE-BE3 | C | T | 5 | 6 | 20 | NGG | down |
Favorite base editor not listed?
Please send the required info using a PR, or an issue.
Change log
v2
New features:
1. Design libraries for base or amino acid mutational scanning, at defined positions and regions.
2. The gui contains library filtering and prioritization options.
3. Non-base editing applications, e.g. CRISPR-tiling, using not_be option.
Key updates:
1. Quicker installation due to reduced number of dependencies (bwa comes in the package, and samtools not needed).
2. Faster run-time, compared to v1, because of the improvements in the dependencies e.g. pandas etc.
3. Faster run-time on large genomes e.g. human genome, because of the use of 2bit tools.
4. Direct command line options to use non-model species which e.g. not indexed on Ensembl.
5. Configuration made optional.
Technical updates:
1. The gui is powered by mercury, thus overcomming the limitations of v1.
2. Use of one base editor (method) per run, instead of multiple.
3. Due to overall faster run-times, parallelization within a run is disabled. However, multiple runs can be parallelized, externally e.g. using Python's built-in multiprocessing.
5. Only the sgRNAs for which target lies within the optimal activity window are reported. Therefore unneeded penalty for target not being in activity window is now not utilized, but options retained for back-compatibility.
6. Many refactored functions can now be imported and executed independently for "much more" applications.
7. Reports generated for each run in the form of a jupyter notebook.
8. Automated testing on GitHub for continuous integration.
9. The cli is compatible with python 3.8 and 3.9 (even higher untested versions), however the gui not supported on python 3.7 due lack of dependencies.
Future directions, for which contributions are welcome:
- [ ] Adding option to provide 0-based co-ordinates in the input.
Similar projects:
- http://www.rgenome.net/be-designer/
- http://yang-laboratory.com/BEable-GPS
- https://github.com/maxwshen/bepredictbystander
- https://github.com/maxwshen/bepredictefficiency
- https://fgcz-shiny.uzh.ch/PnBDesigner/
How to cite?
v2
Using BibTeX:
@software{Dandage_beditor, title = {beditor: A Computational Workflow for Designing Libraries of sgRNAs for CRISPR-Mediated Base Editing}, author = {Dandage, Rohan}, year = {2024}, url = {https://doi.org/10.5281/zenodo.10648264}, version = {v2.0.1}, note = {The URL is a DOI link to the permanent archive of the software.}, }Using citation information from CITATION.CFF file.
v1
1. Using BibTeX: ``` @software{Dandage_beditorv1, title = {beditor: A Computational Workflow for Designing Libraries of sgRNAs for CRISPR-Mediated Base Editing}, author = {Dandage, Rohan}, year = {2019}, url = {https://doi.org/10.1534/genetics.119.302089}, version = {v1}, } ```Future directions, for which contributions are welcome:
- [ ] Allowing 0-based coordinates in the input.
Similar projects:
- http://www.rgenome.net/be-designer/
- http://yang-laboratory.com/BEable-GPS
- https://github.com/maxwshen/bepredictbystander
- https://github.com/maxwshen/bepredictefficiency
- https://fgcz-shiny.uzh.ch/PnBDesigner/ # API <!-- markdownlint-disable -->
module beditor.lib.get_mutations
Mutation co-ordinates using pyensembl
function get_protein_cds_coords
python
get_protein_cds_coords(annots, protein_id: str) → DataFrame
Get protein CDS coordinates
Args:
annots: pyensembl annotationsprotein_id(str): protein ID
Returns:
pd.DataFrame: output table
function get_protein_mutation_coords
python
get_protein_mutation_coords(data: DataFrame, aapos: int, test=False) → tuple
Get protein mutation coordinates
Args:
data(pd.DataFrame): input tableaapos(int): amino acid positiontest(bool, optional): test-mode. Defaults to False.
Raises:
ValueError: invalid positions
Returns:
tuple: aapos,start,end,seq
function map_coords
python
map_coords(df_: DataFrame, df1_: DataFrame, verbose: bool = False) → DataFrame
Map coordinates
Args:
df_(pd.DataFrame): input table
Returns:
pd.DataFrame: output table
function get_mutation_coords_protein
python
get_mutation_coords_protein(
df0: DataFrame,
annots,
search_window: int,
outd: str = None,
force: bool = False,
verbose: bool = False
) → DataFrame
Get mutation coordinates for protein
Args:
df0(pd.DataFrame): input tableannots(type): pyensembl annotationssearch_window(int): search window length on either side of the targetoutd(str, optional): output directory path. Defaults to None.force(bool, optional): force. Defaults to False.verbose(bool, optional): verbose. Defaults to False.
Returns:
pd.DataFrame: output table
function get_mutation_coords
python
get_mutation_coords(
df0: DataFrame,
annots,
search_window: int,
verbose: bool = False,
**kws_protein
) → DataFrame
Get mutation coordinates
Args:
df0(pd.DataFrame): input tableannots(type): pyensembl annotationsearch_window(int): search window length on either side of the targetverbose(bool, optional): verbose. Defaults to False.
Returns:
pd.DataFrame: output table
module beditor.lib.get_scores
Scores
function get_ppamdist
python
get_ppamdist(
guide_length: int,
pam_len: int,
pam_pos: str,
ppamdist_min: int
) → DataFrame
Get penalties set based on distances of the mismatch/es from PAM
:param guidelength: length of guide sequence :param pamlen: length of PAM sequence :param pampos: PAM location 3' or 5' :param ppamdistmin: minimum penalty :param pmutatpam: penalty for mismatch at PAM
TODOs: Use different scoring function for different methods.
function get_beditorscore_per_alignment
python
get_beditorscore_per_alignment(
NM: int,
alignment: str,
pam_len: int,
pam_pos: str,
pentalty_genic: float = 0.5,
pentalty_intergenic: float = 0.9,
pentalty_dist_from_pam: float = 0.1,
verbose: bool = False
) → float
Calculates beditor score per alignment between guide and genomic DNA.
:param NM: Hamming distance :param mismatchesmax: Maximum mismatches allowed in alignment :param alignment: Symbol '|' means a match, '.' means mismatch and ' ' means gap. e.g. |||||.||||||||||.||||.| :param pentaltygenic: penalty for genic alignment :param pentaltyintergenic: penalty for intergenic alignment :param pentaltydistfrompam: maximum pentalty for a mismatch at PAM () :returns: beditor score per alignment.
function get_beditorscore_per_guide
python
get_beditorscore_per_guide(
guide_seq: str,
strategy: str,
align_seqs_scores: DataFrame,
dBEs: DataFrame,
penalty_activity_window: float = 0.5,
test: bool = False
) → float
Calculates beditor score per guide.
:param guideseq: guide seqeunce 23nts :param strategy: strategy string eg. ABE;+;@-14;ACT:GCT;T:A; :param alignseqsscores: list of beditor scores per alignments for all the alignments between guide and genomic DNA :param penaltyactivitywindow: if editable base is not in activity window, penaltyactivity_window=0.5 :returns: beditor score per guide.
function revcom
python
revcom(s)
function calc_cfd
python
calc_cfd(wt, sg, pam)
function get_cfdscore
python
get_cfdscore(wt, off)
module beditor.lib.get_specificity
Specificities
function run_alignment
python
run_alignment(
src_path: str,
genomep: str,
guidesfap: str,
guidessamp: str,
guidel: int,
mismatches_max: int = 2,
threads: int = 1,
force: bool = False,
verbose: bool = False
) → str
Run alignment
Args:
src_path(str): source pathgenomep(str): genome pathguidesfap(str): guide fasta pathguidessamp(str): guide sam paththreads(int, optional): threads. Defaults to 1.force(bool, optional): force. Defaults to False.verbose(bool, optional): verbose. Defaults to False.
Returns:
str: alignment file.
function read_sam
python
read_sam(align_path: str) → DataFrame
read alignment file
Args:
align_path(str): path to the alignment file
Returns:
pd.DataFrame: output table
Notes:
Tag Meaning NM Edit distance MD Mismatching positions/bases AS Alignment score BC Barcode sequence X0 Number of best hits X1 Number of suboptimal hits found by BWA XN Number of ambiguous bases in the referenece XM Number of mismatches in the alignment XO Number of gap opens XG Number of gap extentions XT Type: Unique/Repeat/N/Mate-sw XA Alternative hits; format: (chr,pos,CIGAR,NM;)* XS Suboptimal alignment score XF Support from forward/reverse alignment XE Number of supporting seeds Reference: https://bio-bwa.sourceforge.net/bwa.shtml
function parse_XA
python
parse_XA(XA: str) → DataFrame
Parse XA tags
Args:
XA(str): XA tag
Notes:
format: (chr,pos,CIGAR,NM;)
Example: XA='4,+908051,23M,0;4,+302823,23M,0;4,-183556,23M,0;4,+1274932,23M,0;4,+207765,23M,0;4,+456906,23M,0;4,-1260135,23M,0;4,+454215,23M,0;4,-1177442,23M,0;4,+955254,23M,1;4,+1167921,23M,1;4,-613257,23M,1;4,+857893,23M,1;4,-932678,23M,2;4,-53825,23M,2;4,+306783,23M,2;'
function get_extra_alignments
python
get_extra_alignments(
df1: DataFrame,
genome: str,
bed_path: str,
alignments_max: int = 10,
threads: int = 1
) → DataFrame
Get extra alignments
Args:
df1(pd.DataFrame): input tablealignments_max(int, optional): alignments max. Defaults to 10.threads(int, optional): threads. Defaults to 1.
Returns:
pd.DataFrame: output table
TODOs: 1. apply parallel processing to get_seq
function to_pam_coord
python
to_pam_coord(
pam_pos: str,
pam_len: int,
align_start: int,
align_end: int,
strand: str
) → tuple
Get PAM coords
Args:
pam_pos(str): PAM positionpam_len(int): PAM lengthalign_start(int): alignment startalign_end(int): alignment endstrand(str): strand
Returns:
tuple: start,end
function get_alignments
python
get_alignments(
align_path: str,
genome: str,
alignments_max: int,
pam_pos: str,
pam_len: int,
guide_len: int,
pam_pattern: str,
pam_bed_path: str,
extra_bed_path: str,
**kws_xa
) → DataFrame
Get alignments
Args:
align_path(str): alignement pathgenome(str): genome pathpam_pos(str): PAM positionpam_len(int): PAM lengthguide_len(int): sgRNA lengthpam_pattern(str): PAM patternpam_bed_path(str): PAM bed path
Returns:
pd.DataFrame: output path
function get_penalties
python
get_penalties(
aligns: DataFrame,
guides: DataFrame,
annots: DataFrame
) → DataFrame
Get penalties
Args:
aligns(pd.DataFrame): alignementsguides(pd.DataFrame): guidesannots(pd.DataFrame): annotations
Returns:
pd.DataFrame: output table
function score_alignments
python
score_alignments(
df4: DataFrame,
pam_len: int,
pam_pos: str,
pentalty_genic: float = 0.5,
pentalty_intergenic: float = 0.9,
pentalty_dist_from_pam: float = 0.1,
verbose: bool = False
) → tuple
scorealignments _summary
Args:
df4(pd.DataFrame): input tablepam_pos(str): PAM positionpentalty_genic(float, optional): penalty for offtarget in genic locus. Defaults to 0.5.pentalty_intergenic(float, optional): penalty for offtarget in intergenic locus. Defaults to 0.9.pentalty_dist_from_pam(float, optional): penalty for offtarget wrt distance from PAM. Defaults to 0.1.verbose(bool, optional): verbose. Defaults to False.
Returns:
tuple: tables
Note:
- Low value corresponds to high penalty and vice versa, because values are multiplied. 2. High penalty means consequential offtarget alignment and vice versa.
function score_guides
python
score_guides(
guides: DataFrame,
scores: DataFrame,
not_be: bool = False
) → DataFrame
Score guides
Args:
guides(pd.DataFrame): guidesscores(pd.DataFrame): scoresnot_be(bool, optional): not a base editor. Defaults to False.
Returns:
pd.DataFrame: output table
Changes: penaltyactivitywindow disabled as only the sgRNAs with target in the window are reported.
module beditor.lib.io
Input/Output
function download_annots
python
download_annots(species_name: str, release: int) → bool
Download annotations using pyensembl
Args:
species_name(str): species namerelease(int): release number
Returns:
bool: whether annotation is downloaded or not
function cache_subdirectory
python
cache_subdirectory(
reference_name: str = None,
annotation_name: str = None,
annotation_version: int = None,
CACHE_BASE_SUBDIR: str = 'beditor'
) → str
Which cache subdirectory to use for a given annotation database over a particular reference. All arguments can be omitted to just get the base subdirectory for all pyensembl cached datasets.
Args:
reference_name(str, optional): reference name. Defaults to None.annotation_name(str, optional): annotation name. Defaults to None.annotation_version(int, optional): annotation version. Defaults to None.CACHE_BASE_SUBDIR(str, optional): cache path. Defaults to 'beditor'.
Returns:
str: output path
function cached_path
python
cached_path(path_or_url: str, cache_directory_path: str)
When downloading remote files, the default behavior is to name local files the same as their remote counterparts.
function to_downloaded_cached_path
python
to_downloaded_cached_path(
url: str,
annots=None,
reference_name: str = None,
annotation_name: str = 'ensembl',
ensembl_release: str = None,
CACHE_BASE_SUBDIR: str = 'pyensembl'
) → str
To downloaded cached path
Args:
url(str): URLannots(optional): pyensembl annotation. Defaults to None.reference_name(str, optional): reference name. Defaults to None.annotation_name(str, optional): annotation name. Defaults to 'ensembl'.ensembl_release(str, optional): ensembl release. Defaults to None.CACHE_BASE_SUBDIR(str, optional): cache path. Defaults to 'pyensembl'.
Returns:
str: output path
function download_genome
python
download_genome(
species: str,
ensembl_release: int,
force: bool = False,
verbose: bool = False
) → str
Download genome
Args:
species(str): species nameensembl_release(int): releaseforce(bool, optional): force. Defaults to False.verbose(bool, optional): verbose. Defaults to False.
Returns:
str: output path
function read_genome
python
read_genome(genome_path: str, fast=True)
Read genome
Args:
genome_path(str): genome pathfast(bool, optional): fast mode. Defaults to True.
function to_fasta
python
to_fasta(
sequences: dict,
output_path: str,
molecule_type: str,
force: bool = True,
**kws_SeqRecord
) → str
Save fasta file.
Args:
sequences(dict): dictionary mapping the sequence name to the sequence.output_path(str): path of the fasta file.force(bool): overwrite if file exists.
Returns:
output_path(str): path of the fasta file
function to_2bit
python
to_2bit(
genome_path: str,
src_path: str = None,
force: bool = False,
verbose: bool = False
) → str
To 2bit
Args:
genome_path(str): genome pathsrc_path(str, optional): source path. Defaults to None.verbose(bool, optional): verbose. Defaults to False.
Returns:
str: output path
function to_fasta_index
python
to_fasta_index(
genome_path: str,
bgzip: bool = False,
bgzip_path: str = None,
threads: int = 1,
verbose: bool = True,
force: bool = False,
indexed: bool = False
) → str
To fasta index
Args:
genome_path(str): genome pathbgzip_path(str, optional): bgzip path. Defaults to None.threads(int, optional): threads. Defaults to 1.verbose(bool, optional): verbose. Defaults to True.force(bool, optional): force. Defaults to False.indexed(bool, optional): indexed or not. Defaults to False.
Returns:
str: output path
function to_bed
python
to_bed(
df: DataFrame,
outp: str,
cols: list = ['chrom', 'start', 'end', 'locus', 'score', 'strand']
) → str
To bed path
Args:
df(pd.DataFrame): input tableoutp(str): output pathcols(list, optional): columns. Defaults to ['chrom','start','end','locus','score','strand'].
Returns:
str: output path
function read_bed
python
read_bed(
p: str,
cols: list = ['chrom', 'start', 'end', 'locus', 'score', 'strand']
) → DataFrame
Read bed file
Args:
p(str): pathcols(list, optional): columns. Defaults to ['chrom','start','end','locus','score','strand'].
Returns:
pd.DataFrame: output table
function to_viz_inputs
python
to_viz_inputs(
gtf_path: str,
genome_path: str,
output_dir_path: str,
output_ext: str = 'tsv',
threads: int = 1,
force: bool = False
) → dict
To viz inputs for the IGV
Args:
gtf_path(str): GTF pathgenome_path(str): genome pathoutput_dir_path(str): output directory pathoutput_ext(str, optional): output extension. Defaults to 'tsv'.threads(int, optional): threads. Defaults to 1.force(bool, optional): force. Defaults to False.
Returns:
dict: configuration
function to_igv_path_prefix
python
to_igv_path_prefix() → str
Get IGV path prefix
Returns:
str: URL
function to_session_path
python
to_session_path(p: str, path_prefix: str = None, outp: str = None) → str
To session path
Args:
p(str): session configuration pathpath_prefix(str, optional): path prefix. Defaults to None.outp(str, optional): output path. Defaults to None.
Returns:
str: output path
function read_cytobands
python
read_cytobands(
cytobands_path: str,
col_chrom: str = 'chromosome',
remove_prefix: str = 'chr'
) → DataFrame
Read cytobands
Args:
cytobands_path(str): pathcol_chrom(str, optional): column with contig. Defaults to 'chromosome'.
Returns:
pd.DataFrame: output table
function to_output
python
to_output(inputs: DataFrame, guides: DataFrame, scores: DataFrame) → DataFrame
To output table
Args:
inputs(pd.DataFrame): inputsguides(pd.DataFrame): guidesscores(pd.DataFrame): scores
Returns:
pd.DataFrame: output table
module beditor.lib.make_guides
Designing the sgRNAs
function get_guide_pam
python
get_guide_pam(
match: str,
pam_stream: str,
guidel: int,
seq: str,
pos_codon: int = None
)
function get_pam_searches
python
get_pam_searches(dpam: DataFrame, seq: str, pos_codon: int) → DataFrame
Search PAM occurance
:param dpam: dataframe with PAM sequences :param seq: target sequence :param poscodon: reading frame :param test: debug mode on :returns dpamsearches: dataframe with positions of pams
function get_guides
python
get_guides(
data: DataFrame,
dpam: DataFrame,
guide_len: int,
base_fraction_max: float = 0.8
) → DataFrame
Get guides
Args:
data(pd.DataFrame): input tabledpam(pd.DataFrame): table with PAM infoguide_len(int): guide lengthbase_fraction_max(float, optional): base fraction max. Defaults to 0.8.
Returns:
pd.DataFrame: output table
function to_locusby_pam
python
to_locusby_pam(
chrom: str,
pam_start: int,
pam_end: int,
pam_position: str,
strand: str,
length: int,
start_off: int = 0
) → str
To locus by PAM from PAM coords.
Args:
chrom(str): chrompam_start(int): PAM startpam_end(int): PAM endpam_position(str): PAM positionstrand(str): strandlength(int): length
Returns:
str: locus
function to_pam_coord
python
to_pam_coord(
startf: int,
endf: int,
startp: int,
endp: int,
strand: str
) → tuple
To PAM coordinates
Args:
startf(int): start flank startendf(int): start flank endstartp(int): start PAM startendp(int): start PAM endstrand(str): strand
Returns:
tuple: start,end
function get_distances
python
get_distances(df2: DataFrame, df3: DataFrame, cfg_method: dict) → DataFrame
Get distances
Args:
df2(pd.DataFrame): input table #1df3(pd.DataFrame): input table #2cfg_method(dict): config for the method
Returns:
pd.DataFrame: output table
function get_windows_seq
python
get_windows_seq(s: str, l: str, wl: str, verbose: bool = False) → str
Sequence by guide strand
Args:
s(str): sequencel(str): locuswl(str): window locusverbose(bool, optional): verbose. Defaults to False.
Returns:
str: window sequence
function filter_guides
python
filter_guides(
df1: DataFrame,
cfg_method: dict,
verbose: bool = False
) → DataFrame
Filter sgRNAs
Args:
df1(pd.DataFrame): input tablecfg_method(dict): config of the methodverbose(bool, optional): verbose. Defaults to False.
Returns:
pd.DataFrame: output table
function get_window_target_overlap
python
get_window_target_overlap(
tstart: int,
tend: int,
wl: str,
ws: str,
nt: str,
verbose: bool = False
) → tuple
Get window target overlap
Args:
tstart(int): target starttend(int): target endwl(str): window locusws(str): window sequencent(str): nucleotideverbose(bool, optional): verbose. Defaults to False.
Returns:
tuple: windowoverlapsthetarget,wts,ntin_overlap,wtl
function get_mutated_codon
python
get_mutated_codon(
ts: str,
tl: str,
tes: str,
tel: str,
strand: str,
verbose: bool = False
) → str
Get mutated codon
Args:
ts(str): target sequencetl(str): target locustes(str): target edited sequencetel(str): target edited locusstrand(str): strandverbose(bool, optional): verbose. Defaults to False.
Returns:
str: mutated codon
function get_coedits_base
python
get_coedits_base(
ws: str,
wl: str,
wts: str,
wtl: str,
nt: str,
verbose: bool = False
) → str
Get co-edited bases
Args:
ws(str): window sequencewl(str): window locuswts(str): window target overlap sequencewtl(str): window target overlap locusnt(str): nucleotideverbose(bool, optional): verbose. Defaults to False.
Returns:
str: coedits
module beditor.lib
module beditor.lib.methods
Global Variables
- multint2reg
- multint2regcomplement
function dpam2dpam_strands
python
dpam2dpam_strands(dpam: DataFrame, pams: list) → DataFrame
Duplicates dpam dataframe to be compatible for searching PAMs on - strand
Args:
dpam(pd.DataFrame): dataframe with pam informationpams(list): pams to be used for actual designing of guides.
Returns:
pd.DataFrame: table
function get_be2dpam
python
get_be2dpam(
din: DataFrame,
methods: list = None,
test: bool = False,
cols_dpam: list = ['PAM', 'PAM position', 'guide length']
) → dict
Make BE to dpam mapping i.e. dict
Args:
din(pd.DataFrame): table with BE and PAM info all cols_dpam neededmethods(list, optional): method names. Defaults to None.test(bool, optional): test-mode. Defaults to False.cols_dpam(list, optional): columns to be used. Defaults to ['PAM', 'PAM position', 'guide length'].
Returns:
dict: output dictionary.
module beditor.lib.utils
Utilities
Global Variables
- cols_muts
- multint2reg
- multint2regcomplement
function get_src_path
python
get_src_path() → str
Get the beditor source directory path.
Returns:
str: path
function runbashcmd
python
runbashcmd(cmd: str, test: bool = False, logf=None)
Run a bash command
Args:
cmd(str): commandtest(bool, optional): test-mode. Defaults to False.logf(optional): log file instance. Defaults to None.
function log_time_elapsed
python
log_time_elapsed(start)
Log time elapsed.
Args:
start(datetime): start tile
Returns:
datetime: difference in time.
function rescale
python
rescale(
a: <built-in function array>,
mn: float = None
) → <built-in function array>
Rescale a vector.
Args:
a(np.array): vector.mn(float, optional): minimum value. Defaults to None.
Returns:
np.array: output vector
function get_nt2complement
python
get_nt2complement()
function s2re
python
s2re(s: str, ss2re: dict) → str
String to regex patterns
Args:
s(str): stringss2re(dict): substrings to regex patterns.
Returns:
str: string with regex patterns.
function parse_locus
python
parse_locus(s: str, zero_based: bool = True) → tuple
parselocus _summary
Args:
s(str): location string.zero_based(bool, optional): zero-based coordinates. Defaults to True.
Returns:
tuple: chrom, start, end, strand
Notes:
beditor outputs (including bed files) use 0-based loci pyensembl and IGV use 1-based locations
function get_pos
python
get_pos(s: str, l: str, reverse: bool = True, zero_based: bool = True) → Series
Expand locus to positions mapped to nucleotides.
Args:
s(str): sequencel(str): locusreverse(bool, optional): reverse the - strand. Defaults to True.zero_based(bool, optional): zero based coordinates. Defaults to True.
Returns:
pd.Series: output.
function get_seq
python
get_seq(
genome: str,
contig: str,
start: int,
end: int,
strand: str,
out_type: str = 'str',
verbose: bool = False
) → str
Extract a sequence from a genome file based on start and end positions using streaming.
Args:
genome(str): The path to the genome file in FASTA format.contig(str): chromstart(int): startend(int): endstrand(str): strandout_type(str, optional): type of the output. Defaults to 'str'.verbose(bool, optional): verbose. Defaults to False.
Raises:
ValueError: invalid strand.
Returns:
str: The extracted sequence.
function read_fasta
python
read_fasta(
fap: str,
key_type: str = 'id',
duplicates: bool = False,
out_type='dict'
) → dict
Read fasta
Args:
fap(str): pathkey_type(str, optional): key type. Defaults to 'id'.duplicates(bool, optional): duplicates present. Defaults to False.
Returns:
dict: data.
Notes:
- If
duplicateskey_type is set todescriptioninstead ofid.
function format_coords
python
format_coords(df: DataFrame) → DataFrame
Format coordinates
Args:
df(pd.DataFrame): table
Returns:
pd.DataFrame: formated table
function fetch_sequences_bp
python
fetch_sequences_bp(p: str, genome: str) → DataFrame
Fetch sequences using biopython.
Args:
p(str): path to the bed file.genome(str): genome path.
Returns:
pd.DataFrame: sequences.
function fetch_sequences
python
fetch_sequences(
p: str,
genome_path: str,
outp: str = None,
src_path: str = None,
revcom: bool = True,
method='2bit',
out_type='df'
) → DataFrame
Fetch sequences
Args:
p(str): path to the bed filegenome_path(str): genome pathoutp(str, optional): output path for fasta file. Defaults to None.src_path(str, optional): source path. Defaults to None.revcom(bool, optional): reverse-complement. Defaults to True.method(str, optional): method name. Defaults to '2bit'.out_type(str, optional): type of the output. Defaults to 'df'.
Returns:
pd.DataFrame: sequences.
function get_sequences
python
get_sequences(
df1: DataFrame,
p: str,
genome_path: str,
outp: str = None,
src_path: str = None,
revcom: bool = True,
out_type: str = 'df',
renames: dict = {},
**kws_fetch_sequences
) → DataFrame
Get sequences for the loci in a table
Args:
df1(pd.DataFrame): input tablep(str): path to the beb fileoutp(str, optional): output path. Defaults to None.src_path(str, optional): source path. Defaults to None.revcom(bool, optional): reverse complement. Defaults to True.out_type(str, optional): output type. Defaults to 'df'.renames(dict, optional): renames. Defaults to {}.
Returns:
pd.DataFrame: output sequences
Notes:
Input is 1-based Output is 0-based Saves bed file and gets the sequences
function to_locus
python
to_locus(
chrom: str = 'chrom',
start: str = 'start',
end: str = 'end',
strand: str = 'strand',
x: Series = None
) → str
To locus
Args:
chrom(str, optional): chrom. Defaults to 'chrom'.start(str, optional): strart. Defaults to 'start'.end(str, optional): end. Defaults to 'end'.strand(str, optional): strand. Defaults to 'strand'.x(pd.Series, optional): row of the dataframe. Defaults to None.
Returns:
str: locus
function get_flanking_seqs
python
get_flanking_seqs(
df1: DataFrame,
targets_path: str,
flanks_path: str,
genome: str = None,
search_window: list = None
) → DataFrame
Get flanking sequences
Args:
df1(pd.DataFrame): input tabletargets_path(str): target sequences pathflanks_path(str): flank sequences pathgenome(str, optional): genome path. Defaults to None.search_window(list, optional): search window around the target. Defaults to None.
Returns:
pd.DataFrame: output table with sequences
function get_strand
python
get_strand(
genome,
df1: DataFrame,
col_start: str,
col_end: str,
col_chrom: str,
col_strand: str,
col_seq: str
) → DataFrame
Get strand by comparing the aligned and fetched sequence
Args:
genome: genome instancedf1(pd.DataFrame): input table.col_start(str): startcol_end(str): endcol_chrom(str): chromcol_strand(str): strandcol_seq(str): sequences
Returns:
pd.DataFrame: output table
Notes:
used for tests.
function reverse_complement_multintseq
python
reverse_complement_multintseq(seq: str, nt2complement: dict) → str
Reverse complement multi-nucleotide sequence
Args:
seq(str): sequencent2complement(dict): nucleotide to complement
Returns:
str: sequence
function reverse_complement_multintseqreg
python
reverse_complement_multintseqreg(
seq: str,
multint2regcomplement: dict,
nt2complement: dict
) → str
Reverse complement multi-nucleotide regex patterns
Args:
seq(str): descriptionmultint2regcomplement(dict): mapping.nt2complement(dict): nucleotide to complement
Returns:
str: regex pattern
function hamming_distance
python
hamming_distance(s1: str, s2: str) → int
Return the Hamming distance between equal-length sequences
Args:
s1(str): sequence #1s2(str): sequence #2
Raises:
ValueError: Undefined for sequences of unequal length
Returns:
int: distance.
function align
python
align(
q: str,
s: str,
test: bool = False,
psm: float = 2,
pmm: float = 0.5,
pgo: float = -3,
pge: float = -1
) → str
Creates pairwise local alignment between seqeunces.
Args:
q(str): querys(str): subjecttest(bool, optional): test-mode. Defaults to False.
Returns:
str: alignment with symbols.
Notes:
REF: http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html The match parameters are: CODE DESCRIPTION x No parameters. Identical characters have score of 1, otherwise 0. m A match score is the score of identical chars, otherwise mismatch score. d A dictionary returns the score of any pair of characters. c A callback function returns scores. The gap penalty parameters are: CODE DESCRIPTION x No gap penalties. s Same open and extend gap penalties for both sequences. d The sequences have different open and extend gap penalties. c A callback function returns the gap penalties.
function get_orep
python
get_orep(seq: str) → int
Get the overrepresentation
function get_polyt_length
python
get_polyt_length(s: str) → int
Counts the length of the longest polyT stretch (RNA pol3 terminator) in sequence
:param s: sequence in string format
function get_annots_installed
python
get_annots_installed() → DataFrame
Get a list of annotations installed.
Returns:
pd.DataFrame: output.
function get_annots
python
get_annots(
species_name: str = None,
release: int = None,
gtf_path: str = None,
transcript_path: str = None,
protein_path: str = None,
reference_name: str = 'assembly',
annotation_name: str = 'source',
verbose: bool = False,
**kws_Genome
)
Get pyensembl annotation instance
Args:
species_name(str, optional): species name. Defaults to None.release(int, optional): release number. Defaults to None.gtf_path(str, optional): GTF path. Defaults to None.transcript_path(str, optional): transcripts path. Defaults to None.protein_path(str, optional): protein path. Defaults to None.reference_name(str, optional): reference name. Defaults to 'assembly'.annotation_name(str, optional): annotation name. Defaults to 'source'.verbose(bool, optional): verbose. Defaults to False.
Returns: pyensembl annotation instance
function to_pid
python
to_pid(annots, gid: str) → str
To protein ID
Args:
annots: pyensembl annotation instancegid(str): gene ID
Returns:
str: protein ID
function to_one_based_coordinates
python
to_one_based_coordinates(df: DataFrame) → DataFrame
To one based coordinates
Args:
df(pd.DataFrame): input table
Returns:
pd.DataFrame: output table.
module beditor.lib.viz
Visualizations.
function to_igv
python
to_igv(
cfg: dict = None,
gtf_path: str = None,
genome_path: str = None,
output_dir_path: str = None,
threads: int = 1,
output_ext: str = None,
force: bool = False
) → str
To IGV session file.
Args:
cfg(dict, optional): configuration of the run. Defaults to None.gtf_path(str, optional): path to the gtf file. Defaults to None.genome_path(str, optional): path to the genome file. Defaults to None.output_dir_path(str, optional): path to the output directory. Defaults to None.threads(int, optional): threads. Defaults to 1.output_ext(str, optional): extension of the output. Defaults to None.force(bool, optional): force. Defaults to False.
Returns:
str: path to the session file.
function get_nt_composition
python
get_nt_composition(seqs: list) → DataFrame
Get nt composition.
Args:
seqs(list): list of sequences
Returns:
pd.DataFrame: table with the frequencies of the nucleotides.
function plot_ntcompos
python
plot_ntcompos(
seqs: list,
pam_pos: str,
pam_len: int,
window: list = None,
ax: Axes = None,
color_pam: str = 'lime',
color_window: str = 'gold'
) → Axes
Plot nucleotide composition
Args:
seqs(list): list of sequences.pam_pos(str): PAM position.pam_len(int): PAM length.window(list, optional): activity window bounds. Defaults to None.ax(plt.Axes, optional): subplot. Defaults to None.color_pam(str, optional): color of the PAM. Defaults to 'lime'.color_window(str, optional): color of the wnindow. Defaults to 'gold'.
Returns:
plt.Axes: subplot
function plot_ontarget
python
plot_ontarget(
guide_loc: str,
pam_pos: str,
pam_len: int,
guidepam_seq: str,
window: list = None,
show_title: bool = False,
figsize: list = [10, 2],
verbose: bool = False,
kws_sg: dict = {}
) → Axes
plotontarget _summary
Args:
guide_loc(str): sgRNA locuspam_pos(str): PAM positionpam_len(int): PAM lengthguidepam_seq(str): sgRNA and PAM sequencewindow(list, optional): activity window bounds. Defaults to None.show_title(bool, optional): show the title. Defaults to False.figsize(list, optional): figure size. Defaults to [10,2].verbose(bool, optional): verbose. Defaults to False.kws_sg(dict, optional): keyword arguments to plot the sgRNA. Defaults to {}.
Returns:
plt.Axes: subplot
TODOs: 1. convert to 1-based coordinates 2. features from the GTF file
function get_plot_inputs
python
get_plot_inputs(df2: DataFrame) → list
Get plot inputs.
Args:
df2(pd.DataFrame): table.
Returns:
list: list of tables.
function plot_library_stats
python
plot_library_stats(
dfs: list,
palette: dict = {True: 'b', False: 'lightgray'},
cutoffs: dict = None,
not_be: bool = True,
dbug: bool = False,
figsize: list = [10, 2.5]
) → list
Plot library stats
Args:
dfs(list): list of tables.palette(type, optional): color palette. Defaults to {True:'b',False:'lightgray'}.cutoffs(dict, optional): cutoffs to be applied. Defaults to None.not_be(bool, optional): not a base editor. Defaults to True.dbug(bool, optional): debug mode. Defaults to False.figsize(list, optional): figure size. Defaults to [10,2.5].
Returns:
list: list of subplots.
module beditor.run
Command-line options
function validate_params
python
validate_params(parameters: dict) → bool
Validate the parameters.
Args:
parameters(dict): parameters
Returns:
bool: whther the parameters are valid or not
function cli
python
cli(
editor: str = None,
mutations_path: str = None,
output_dir_path: str = None,
species: str = None,
ensembl_release: int = None,
genome_path: str = None,
gtf_path: str = None,
rna_path: str = None,
prt_path: str = None,
search_window: int = None,
not_be: bool = False,
config_path: str = None,
wd_path: str = None,
threads: int = 1,
kernel_name: str = 'beditor',
verbose='WARNING',
igv_path_prefix=None,
ext: str = None,
force: bool = False,
dbug: bool = False,
skip=None,
**kws
)
beditor command-line (CLI)
Args:
editor(str, optional): base-editing method, available methods can be listed using command: 'beditor resources'. Defaults to None.mutations_path(str, optional): path to the mutation file, the format of which is available at https://github.com/rraadd88/beditor/README.md#Input-format. Defaults to None.output_dir_path(str, optional): path to the directory where the outputs should be saved. Defaults to None.species(str, optional): species name. Defaults to None.ensembl_release(int, optional): ensemble release number. Defaults to None.genome_path(str, optional): path to the genome file, which is not available on Ensembl. Defaults to None.gtf_path(str, optional): path to the gene annotations file, which is not available on Ensembl. Defaults to None.rna_path(str, optional): path to the transcript sequences file, which is not available on Ensembl. Defaults to None.prt_path(str, optional): path to the protein sequences file, which is not available on Ensembl. Defaults to None.search_window(int, optional): number of bases to search on either side of a target, if not specified, it is inferred by beditor. Defaults to None.not_be(bool, optional): do not process as a base editor. Defaults to False.config_path(str, optional): path to the configuration file. Defaults to None.wd_path(str, optional): path to the working directory. Defaults to None.threads(int, optional): number of threads. Defaults to 1.kernel_name(str, optional): name of the jupyter kernel. Defaults to "beditor".verbose(str, optional): verbose, logging levels: DEBUG > INFO > WARNING > ERROR (default) > CRITICAL. Defaults to "WARNING".igv_path_prefix(type, optional): prefix to be added to the IGV url. Defaults to None.ext(str, optional): file extensions of the output tables. Defaults to None.force(bool, optional): overwrite the outputs of they exist. Defaults to False.dbug(bool, optional): debug mode (developer). Defaults to False.skip(type, optional): skip sections of the workflow (developer). Defaults to None.
Examples: beditor cli -c inputs/mutations/protein/positions.yml
Notes:
Required parameters for a run: editor mutationspath outputdirpath or configpath
function gui
python
gui()
function resources
python
resources()
Owner
- Login: rraadd88
- Kind: user
- Repositories: 3
- Profile: https://github.com/rraadd88
Citation (CITATION.cff)
cff-version: 1.2.0 title: 'beditor: A Computational Workflow for Designing Libraries of sgRNAs for CRISPR-Mediated Base Editing' message: If you use this software, please cite it using the metadata from this file. type: software authors: - given-names: Rohan family-names: Dandage orcid: https://orcid.org/0000-0002-6421-2067 identifiers: - type: doi value: 10.5281/zenodo.10648264 repository-code: https://github.com/rraadd88/beditor version: v2.0.1 date-released: '2024-02-11'
GitHub Events
Total
- Watch event: 3
- Fork event: 1
Last Year
- Watch event: 3
- Fork event: 1
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 17
- Total pull requests: 5
- Average time to close issues: 2 months
- Average time to close pull requests: 5 months
- Total issue authors: 11
- Total pull request authors: 1
- Average comments per issue: 2.88
- Average comments per pull request: 0.4
- Merged pull requests: 3
- Bot issues: 0
- Bot pull requests: 5
Past Year
- Issues: 1
- Pull requests: 0
- Average time to close issues: 9 days
- Average time to close pull requests: N/A
- Issue authors: 1
- Pull request authors: 0
- Average comments per issue: 1.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- bioinfo4321 (4)
- herrroaa (3)
- onebeingmay (2)
- chengwsh (1)
- yx-xiang (1)
- leejimmy93 (1)
- neko-ni (1)
- ebrettmann (1)
- enormandeau (1)
- valkm2 (1)
- murphycj (1)
Pull Request Authors
- dependabot[bot] (7)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 62 last-month
- Total dependent packages: 0
- Total dependent repositories: 1
- Total versions: 20
- Total maintainers: 1
pypi.org: beditor
A computational workflow for designing libraries of guide RNAs for CRISPR base editing
- Homepage: https://github.com/rraadd88/beditor
- Documentation: https://beditor.readthedocs.io/
- License: General Public License v. 3
-
Latest release: 2.0.1
published about 2 years ago
Rankings
Maintainers (1)
Dependencies
- PySimpleGUI *
- datacache ==1.1.4
- dna_features_viewer ==0.1.9
- matplotlib ==2.2.2
- pandas ==0.23.3
- pysam ==0.14.1
- requests ==2.19.1
- scipy ==1.1.0
- seaborn ==0.8.1
- tqdm ==4.23.4
- biopython ==1.71
- datacache ==1.1.4
- dna_features_viewer ==0.1.9
- matplotlib ==2.2.2
- numpy ==1.21.0
- pandas ==0.23.3
- pysam ==0.14.1
- regex ==2018.7.11
- requests ==2.20.0
- scipy ==1.1.0
- seaborn ==0.8.1
- tqdm ==4.23.4
- biopython ==1.71
- datacache ==1.1.4
- dna_features_viewer ==0.1.9
- matplotlib ==2.2.2
- numpy ==1.21.0
- pandas *
- pyensembl ==1.4.0
- pysam ==0.14.1
- pyyaml *
- regex ==2018.7.11
- requests ==2.20.0
- scipy ==1.1.0
- seaborn ==0.8.1
- tqdm ==4.23.4
