https://github.com/cancerit/valiant

Variant Library Annotation Tool (VaLiAnT) is an oligonucleotide library design and annotation tool for Saturation Genome Editing and other Deep Mutational Scanning experiments

https://github.com/cancerit/valiant

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: ncbi.nlm.nih.gov
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.7%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Variant Library Annotation Tool (VaLiAnT) is an oligonucleotide library design and annotation tool for Saturation Genome Editing and other Deep Mutational Scanning experiments

Basic Info
  • Host: GitHub
  • Owner: cancerit
  • License: agpl-3.0
  • Language: Python
  • Default Branch: develop
  • Homepage:
  • Size: 62.8 MB
Statistics
  • Stars: 5
  • Watchers: 11
  • Forks: 2
  • Open Issues: 0
  • Releases: 0
Created over 5 years ago · Last pushed over 2 years ago
Metadata Files
Readme Changelog License

README.md

VaLiAnT

The Variant Library Annotation Tool (VaLiAnT) is an oligonucleotide library design and annotation tool for Saturation Genome Editing and other Deep Mutational Scanning experiments.

A selection of libraries is included in the examples directory, including all necessary inputs and instructions to generate them.

Please also see the VaLiAnT Wiki for more information on use cases.

Citation

Please cite this paper when using VaLiAnT for your publications:

Variant Library Annotation Tool (VaLiAnT): an oligonucleotide library design and annotation tool for saturation genome editing and other deep mutational scanning experiments.

Barbon L, Offord V, Radford EJ, Butler AP, Gerety SS, Adams DJ, Tan HK, Waters AJ.

Bioinformatics. 2022 Jan 27;38(4):892-899.

DOI: 10.1093/bioinformatics/btab776. PMID: 34791067; PMCID: PMC8796380.

Usage

See the command line interface section for the full list and a more detailed description of the parameters.

Input parameters:

  • species
  • assembly
  • 5' adaptor (optional)
  • 3' adaptor (optional)

Main command input files:

SGE input files:

  • SGE targeton file (TSV)
  • reference genome sequence (FASTA)
  • reference genome index (FAI)
  • PAM protection file (VCF, optional)
  • VCF manifest file (CSV, optional)
  • custom variant files (VCF, optional)
  • features file (GTF/GFF2, optional)
  • codon table with frequencies (CSV, optional)
  • background variant file (VCF, optional)
  • background variant mask file (BED, optional)

cDNA input files:

Output files:

The reference directory should contain both the FASTA file and its index; e.g., if the FASTA file is named genome.fa, a genome.fa.fai file should also be present in the same directory. When running the tool in a container, the directory containing both files should therefore be mounted.

The features file (SGE-only gff option) is required to detect exonic regions in the targeton, and should therefore be provided in most circumstances. The files should only contain features for one transcript per gene (the gene_id and transcript_id attributes are required to perform this check). The features file should match the assembly of the target reference genome. Any features of type other than CDS and UTR are ignored.

If the codon-table option is not set, this table will be used.

Ambiguous nucleotides are not allowed in the reference sequence. Soft-masking is ignored.

Oligonucleotides exceeding a given length (max-length option) will not be included in the unique oligonucleotide files and their metadata will be stored in separate files marked as 'excluded'.

Background variants, if provided, are applied before PAM protection variants; when the CDS features of a transcript are provided (via a GTF/GFF2 file), to keep the annotation consistent across targetons, such variants are applied in the minimal range of positions that spans at least the entire CDS, further extended to the boundaries of any targeton overlapping the CDS, and finally to the start and end position of the first and last background variant intersecting the resulting range, respectively.

Background variants may be filtered out by position by providing a set of genomic ranges to be excluded via a BED file (bg-mask option). Excluding frame-shifting variants may affect the annotation.

By default, errors are raised when background variants are not synonymous or shift the reading frame; the force-bg-ns and force-bg-indels flags may be passed to allow them.

Mutations that overlap background variants are discarded; warnings are always raised to identify them, with their positions expressed in absolute genomic coordinates for custom variants, and targeton-relative coordinates for pattern variants.

Python package

After installing the package in an appropriate virtual environment:

sh valiant sge \ "${TARGETONS_FILE}" \ "${REFERENCE_FILE}" \ "${OUTPUT_DIR}" \ "${SPECIES}" \ "${ASSEMBLY}" \ --gff "${GTF_FILE}" \ --adaptor-5 "${ADAPTOR_5}" \ --adaptor-3 "${ADAPTOR_3}"

Alternatively, a configuration file can be provided to the main command:

sh valiant -c config.json

Docker image

After building or pulling the Docker image (quay.io/wtsicgp/valiant:X.X.X, where X.X.X is a version tag):

sh docker run \ -v "${HOST_INPUTS}":"${INPUT_DIR}":ro \ -v "${HOST_REF}":"${REF_DIR}":ro \ -v "${HOST_OUTPUT}":"${OUTPUT_DIR}" \ valiant \ valiant sge \ "${INPUT_DIR}/${TARGETONS_FILE}" \ "${REF_DIR}/${REFERENCE_FILE}" \ "${OUTPUT_DIR}" \ "${SPECIES}" \ "${ASSEMBLY}" \ --gff "${INPUT_DIR}/${GTF_FILE}" \ --adaptor-5 "${ADAPTOR_5}" \ --adaptor-3 "${ADAPTOR_3}"

The HOST_* environment variables represent local paths to be mounted by the Docker container.

Singularity image

After pulling the Docker image with Singularity:

sh singularity exec \ --cleanenv \ -B "${HOST_INPUTS}":"${INPUT_DIR}":ro \ -B "${HOST_REF}":"${REF_DIR}":ro \ -B "${HOST_OUTPUT}":"${OUTPUT_DIR}" \ ${SINGULARITY_IMAGE} \ valiant sge \ "${INPUT_DIR}/${TARGETONS_FILE}" \ "${REF_DIR}/${REFERENCE_FILE}" \ "${OUTPUT_DIR}" \ "${SPECIES}" \ "${ASSEMBLY}" \ --gff "${INPUT_DIR}/${GTF_FILE}" \ --adaptor-5 "${ADAPTOR_5}" \ --adaptor-3 "${ADAPTOR_3}"

Command line interface

Separate subcommands are provided depending on the target sequence origin:

  • SGE (sge): sequences from reference, genomic coordinates, three target regions;
  • cDNA DMS (cdna): user-provided sequences, relative coordinates, single target region.

The arguments and a few options are the same for both subcommands (see here), but the file formats may vary.

valiant

Main command.

|Option|Format|Default|Description| |-|-|-|-| |version|flag|false|Show the version of the tool and quit.| |config|file path|-|Path to the configuration file.|

valiant (sge|cdna)

Arguments and options required or supported by both subcommands. The format of the input files may be different for SGE and cDNA targets (see the argument descriptions).

|Argument|Format|Description| |-|-|-| |OLIGO_INFO|file path|Path to the targeton file (SGE or cDNA format).| |REF_FASTA|file path|Path to the FASTA file of the target reference genome (sge) or cDNA sequences (cdna).| |OUTPUT|file path|Output path (should exist already).| |SPECIES|species name|Target species, to be reported in the oligonucleotide metadata.| |ASSEMBLY|assembly name|Target assembly, to be reported in the oligonucleotide metadata.|

|Option|Format|Default|Description| |-|-|-|-| |codon-table|file path|-|Path to a codon table with frequencies.| |max-length|integer|300|Maximum oligonucleotide length.| |adaptor-5|DNA sequence|-|DNA sequence to be added at the 5' end of the oligonucleotide.| |adaptor-3|DNA sequence|-|DNA sequence to be added at the 3' end of the oligonucleotide.| |log|log level|WARNING|Name of the preferred log level (see the official documentation of the logging module).|

valiant sge

Options specific to SGE.

The REF_FASTA path is expected to point to a reference genome in FASTA format.

|Option|Format|Default|Description| |-|-|-|-| |gff|file path|-|Path to GTF/GFF2 file containing CDS and UTR features; one transcript per gene only.| |bg|file path|-|Path to a background variant VCF file.| |pam|file path|-|Path to a PAM protection file.| |vcf|file path|-|Path to a VCF manifest file.| |revcomp-minus-strand|flag|false|For minus strand targets, include the reverse complement of the mutated reference sequence in the oligonucleotide.| |sequences-only|flag|false|Generate the reference sequence retrieval quality check file and quit. |mask_bg_fp|file path|-|Path to a BED file to exclude background variants from being applied to the specified genomic intervals.| |force-bg-ns|flag|false|Allow non-synonymous background variants.| |force-bg-indels|flag|false|Allow frame-shifting background variants.|

valiant cdna

Options specific to cDNA DMS.

The REF_FASTA path is expected to point to a multi-FASTA containing cDNA sequences.

|Option|Format|Default|Description| |-|-|-|-| |annot|file path|-|Path to a cDNA annotation file.|

Mutation types

Types of mutation that apply to any target (label):

Types of mutation that apply to CDS targets only (label):

Variants imported from VCF files are labelled as custom.

Parametric deletion

Non-overlapping stretches of nucleotides of a given length are deleted starting from a given offset. No partial deletions are performed at the end of the target regions. Format: <SPAN>del[<OFFSET>] (the offset is assumed to be zero if not set).

For backwards compatibility, in the metadata table, 1del0 is reported as 1del.

Given the target ACGTAAA, span two, and start offset zero (2del0), e.g.:

GTAAA ACAAA ACGTA

With start offset one (2del1), e.g.:

ATAAA ACGAA ACGTA

Single-nucleotide variant

Each nucleotide is replaced with all the alternatives.

Given the target AA, e.g.:

CA GA TA AC AG AT

For CDS targets, the resulting amino acid change is reported.

In-frame deletion

Only for CDS targets.

Delete each triplet so that the reading frame is preserved.

Given the target GAAATTTGG with frame 2, e.g.:

GTTTGG GAAAGG

Alanine codon substitution

Only for CDS targets.

Replace each codon with the top-ranking alanine codon.

Given the target GCAAAATTT, with GCC being the top-ranking alanine codon, e.g.:

GCCAAATTT GCAGCCTTT GCAAAAGCC

Stop codon substitution

Only for CDS targets.

Replace each codon with the top-ranking stop codon.

Given the target TAACCCGGG, with TGA being the top-ranking stop codon, e.g.:

TGACCCGGG TAATGAGGG TAACCCTGA

All amino acid codon substitution

Only for CDS targets.

Replace each codon with the top-ranking codon of all amino acids. Given the default codon table, this results in 19 mutated sequences for each codon mapping to an amino acid (the reference amino acid being excluded) and 20 for each stop codon.

Given the target AAATGA on the plus strand, e.g. (each column representing the sequences generated from one codon):

ATCTGA AAAATC ATGTGA AAAATG ACCTGA AAAACC AACTGA AAAAAC AAAAAG AGCTGA AAAAGC CGGTGA AAACGG CTGTGA AAACTG CCCTGA AAACCC CACTGA AAACAC CAGTGA AAACAG GTGTGA AAAGTG GCCTGA AAAGCC GACTGA AAAGAC GAGTGA AAAGAG GGCTGA AAAGGC TTCTGA AAATTC TACTGA AAATAC TGCTGA AAATGC TGGTGA AAATGG

SNVRE

Only for CDS targets.

Given a set of SNV's, replace triplets according to the following rules:

  • if the SNV results in a synonymous mutation, replace the triplet with all the synonymous triplets of the variant
  • if the SNV results in a missense mutation, replace the triplet with the top-ranking synonymous triplet of the variant
  • if the SNV results in a nonsense mutation, replace the triplet with the top-ranking stop codon

For a given missense or nonsense SNV mutation, if the resulting triplet is already the top-ranking one, the second highest ranking triplet is used to generate the SNVRE mutation instead.

Given the following SNV's for sequence AAAAGT, e.g.:

mseq ref alt CAAAGT K Q GAAAGT K E TAAAGT K STOP ACAAGT K T AGAAGT K R ATAAGT K I AACAGT K N AAGAGT K K AATAGT K N ... AAAAGC S S ...

There is only one synonymous mutation for the first triplet (AAGAGT), but since lysine maps to only two codons and one of them is the reference, no SNVRE variants are generated from it. The one for the second triplet (AAAAGC), though, results in the top-ranking codon for serine, that maps to six codons, and therefore the following four SNVRE's are generated:

mseq ref alt AAATCA S S AAATCC S S AAATCG S S AAATCT S S

For missense mutations, the top-ranking codon (the current being excluded) for each alternative amino acid replaces the reference sequence:

mseq ref alt snv svnre CAAAGT K Q CAA CAG GAAAGT K E GAA GAG ACAAGT K T ACA ACC AGAAGT K R AGA CGG ATAAGT K I ATA ATC AACAGT K N AAC AAT AATAGT K N AAT AAC

The resulting SNVRE variants would be:

mseq ref alt CAGAGT K Q GAGAGT K E ACCAGT K T CGGAGT K R ATCAGT K I AATAGT K N AACAGT K N

For the nonsense mutation:

mseq ref alt snv svnre TAAAGT K STOP TAA TGA

The resulting SNVRE variant would be:

mseq ref alt TGAAGT K STOP

Unique codons do not generate SNVRE variants.

Custom variants

Applied to the targeton reference sequence as a whole. Only simple variants such as the following are supported:

  • substitutions
  • insertions (see below)
  • deletions (see below)
  • indels

The classification of the variants is based exclusively on the POS, REF, and ALT fields to be agnostic with respect to the VCF source.

While in the VCF format insertion and deletion positions refer to the base preceding the event and the reference and alternative sequences both include the preceding (or following, if the variants start at position one) base, for consistency with the conventions adopted for generated mutations, in the metadata table such variants are reported as shifted right by one and omitting the preceding (or following) base in the reference and alternative sequences.

Installation

Some of the dependencies are unsupported on Windows, and the tool cannot therefore be installed natively on it. The following options are available:

The instructions that follow apply to Linux and macOS.

Python virtual environment

Please take care to read errors during the dependency installation step carefully. HTSlib (pysam) has system dependencies and will highlight the packages that need to be installed.

Requirements:

  • Python 3.11 or above

To install in a virtual environment:

```sh

Initialise the virtual environment

python3.11 -m venv .env

Activate the virtual environment

source .env/bin/activate

Install the valiant package

pip install . ```

Docker image

To build the Docker container:

sh docker build -t valiant .

File formats

Configuration file

JSON file collecting the execution parameters. It is always generated as an output (config.json) and can optionally be used as input by the main command, e.g.:

sh valiant -c config.json

|Property|Format|Description| |-|-|-| |appName|valiant|Name of the application (constant).| |appVersion|x.y.z|Version of the application.| |mode|sge|cdna|Execution mode.| |params|object|Execution parameters.|

An application version mismatch will result in a warning.

The execution parameters depend on the execution mode, and each corresponds to one of the command line arguments or options.

Common parameters

|CLI argument|JSON property| |-|-| |oligo_info_fp|oligoInfoFilePath| |ref_fasta_fp|refFASTAFilePath| |output_dir|outputDirPath| |species|species| |assembly|assembly|

|CLI option|JSON property| |-|-| |adaptor-5|adaptor5| |adaptor-3|adaptor3| |min-length|minOligoLength| |max-length|maxOligoLength| |codon-table|codonTableFilePath|

SGE parameters

|CLI option|JSON property| |-|-| |revcomp-minus-strand|reverseComplementOnMinusStrand| |gff|GFFFilePath| |bg|backgroundVCFFilePath| |pam|PAMProtectionVCFFilePath| |vcf|customVCFManifestFilePath| |mask_bg_fp|maskBackgroundFilePath| |force-bg-ns|forceBackgroundNonSynonymous| |force-bg-indels|forceBackgroundFrameShifting| |include-no-op-oligo|includeNoOpOligo|

Example:

json { "appName": "valiant", "appVersion": "4.0.0", "mode": "sge", "params": { "species": "homo sapiens", "assembly": "GRCh38", "adaptor5": "AATGATACGGCGACCACCGA", "adaptor3": "TCGTATGCCGTCTTCTGCTTG", "minOligoLength": 1, "maxOligoLength": 300, "codonTableFilePath": null, "backgroundVCFFilePath": null, "oligoInfoFilePath": "parameter_input_files/brca1_nuc_targeton_input.txt", "refFASTAFilePath": "reference_input_files/chr17.fa", "outputDirPath": "brca1_nuc_output", "reverseComplementOnMinusStrand": true, "includeNoOpOligo": false, "GFFFilePath": "reference_input_files/ENST00000357654.9.gtf", "PAMProtectionVCFFilePath": "parameter_input_files/brca1_protection_edits.vcf", "customVCFManifestFilePath": "reference_input_files/brca1_custom_variants_manifest.csv", "maskBackgroundFilePath": null, "forceBackgroundNonSynonymous": false, "forceBackgroundFrameShifting": false } }

cDNA parameters

|CLI option|JSON property| |-|-| |annot|annotationFilePath|

Example:

json { "appName": "valiant", "appVersion": "4.0.0", "mode": "cdna", "params": { "species": "human", "assembly": "pCW57.1", "adaptor5": "AATGATACGGCGACCACCGA", "adaptor3": "TCGTATGCCGTCTTCTGCTTG", "minOligoLength": 1, "maxOligoLength": 300, "codonTableFilePath": null, "oligoInfoFilePath": "examples/cdna/input/cdna_targeton.tsv", "refFASTAFilePath": "examples/cdna/input/BRCA1_NP_009225_1_pCW57_1.fa", "outputDirPath": "examples/cdna/output", "annotationFilePath": "examples/cdna/input/cdna_annot.tsv" } }

SGE targeton file

Tab-separated values (TSV) file describing the reference sequence coordinates and the types of mutation to be applied to the three target regions therein contained (collectively referred to as targeton). Multiple types of mutations can be applied to each target region. The coordinates of the target regions are derived from the genomic range of the second target region and an extension vector describing the lengths of the preceding and following regions.

Duplicate mutation types in any given group within the action vector are ignored.

Spacing is ignored when parsing the extension and action vectors.

The chromosome name needs to match the naming in the GTF/GFF2 file and in the reference genome.

|Field|Format|Description| |-|-|-| |ref_chr|string|Chromosome name.| |ref_strand|+ or -|DNA strand.| |ref_start|integer|Start position of the reference sequence.| |ref_end|integer|End position of the reference sequence.| |r2_start|integer|Start position of the second target region.| |r2_end|integer|End position of the second target region.| |ext_vector|<int>, <int>|Lengths of the first and third target regions.| |action_vector|(<str>, ...), (<str>, ...), (<str>, ...)|Type of mutation labels grouped by target region.| |sgrna_vector|<str>, ...|sgRNA identifiers matching with SGRNA tags in the PAM protection VCF file.|

Example:

ref_chr ref_strand ref_start ref_end r2_start r2_end ext_vector action_vector sgrna_vector chrX + 41334132 41334320 41334253 41334297 25, 15 (1del), (1del, snv), (1del) sgrna_1, sgrna_2

cDNA targeton file

Tab-separated values (TSV) file describing the target cDNA and the types of mutation to be applied to the target region therein contained (expressed in relative coordinates). Multiple types of mutations can be applied the the target region.

The cDNA identifier (seq_id) has to correspond to an entry in the multi-FASTA and (optionally) annotation files.

|Field|Format|Description| |-|-|-| |seq_id|string|cDNA identifier.| |targeton_start|integer|Targeton start position.| |targeton_end|integer|Targeton stop position.| |r2_start|integer|Target region start position.| |r2_end|integer|Target region stop position.| |action_vector|<str>, ...|Type of mutation labels.|

Example:

seq_id targeton_start targeton_end r2_start r2_end action_vector ENST00000357654.9 114 121 114 121 snv,1del,snvre ENST00000357654.9 114 150 120 130 1del,2del0

cDNA annotation file

TSV file describing the CDS region of each cDNA in relative coordinates (one-based and end-inclusive). Gene and transcript identifiers can also be provided.

|Field|Format|Description| |-|-|-| |seq_id|string|cDNA identifier.| |gene_id|string|Gene ID.| |transcript_id|string|Transcript ID.| |cds_start|string|cDNA CDS relative start position.| |cds_end|string|cDNA CDS relative end position.|

Example:

seq_id gene_id transcript_id cds_start cds_end brca1_357654.9 ENSG00000012048.23 ENST00000357654.9 114 5705

VCF manifest file

CSV file listing the VCF files from which to import variants. Each VCF file is given an alias. If a tag is specified (vcf_id_tag), the VCF INFO field will be expected to contain it and its values will be used as variant identifiers; if no tag is specified, the ID field will be used instead.

|Field|Format|Description| |-|-|-| |vcf_alias|string|VCF file alias.| |vcf_id_tag|VCF tag|(Optional) Variant ID tag.| |vcf_path|file path|VCF file path.|

Example:

vcf_alias,vcf_id_tag,vcf_path clinvar_1,ALLELEID,clinvar_abc.vcf clinvar_2,ALLELEID,clinvar_xyz.vcf gnomad,,gnomad_abc.vcf

PAM protection VCF file

VCF file containing single-nucleotide substitution variants linked to sgRNA identifiers via the SGRNA tag.

Example:

```

fileformat=VCFv4.3

INFO=

CHROM POS ID REF ALT QUAL FILTER INFO

chrX 41334252 . G C . . SGRNA=sgRNA1 chrX 41337416 . C T . . SGRNA=sgRNA2 chrX 41339064 . G A . . SGRNA=sgRNA3 chrX 41341504 . T C . . SGRNA=sgRNA4 chrX 41341509 . G A . . SGRNA=sgRNA_4 ```

Oligonucleotide metadata file

Comma-separated values (CSV) file containing name, label, and all metadata of the oligonucleotides generated for any given targeton.

For cDNA targets, the reference chromosome (ref_chr) and strand (ref_strand) will be missing and all positions will be relative to the cDNA sequence. All fields related to PAM protection (pam_seq) and custom VCF variants (vcf_alias, vcf_var_id, and vcf_var_in_const), features unavailable for this target type, will also be empty (except for vcf_var_in_const, which will be set to zero).

The MAVE-HGVS strings are all linear genomic (relative to the start of the targeton) and do not include the reference. Because in HGVS insertion positions are described by the flanking nucleotides, those occurring at either end of the reference sequence should be treated differently (see the 3' rule in the relevant HGVS documentation); for consistency between SGE and cDNA mode, simplicity, and given the limited usefulness of liminal insertions, this is not the case in the current implementation, and therefore the invalid position zero might be found in insertion names.

Array fields use the semicolon as separator.

|Index|Field|Format|Description| |-|-|-|-| |1|oligo_name|string|Name of the oligonucleotide.| |2|species|species name|Species.| |3|assembly|assembly name|Assembly.| |4|gene_id|string|Gene ID.| |5|transcript_id|string|Transcript ID.| |6|src_type|ref|cdna|Sequence source type (reference genome or cDNA).| |7|ref_chr|string|Chromosome name.| |8|ref_strand|+|-|DNA strand.| |9|ref_start|integer|Start position of the reference sequence.| |10|ref_end|integer|End position of the reference sequence.| |11|revc|0|1|Whether the oligonucleotide contains the reverse complement of the reference sequence (minus strand transcripts only).| |12|ref_seq|DNA sequence|Reference sequence.| |13|pam_seq|DNA sequence|PAM-protected reference sequence.| |14|vcf_alias|string|VCF file alias (custom mutations only).| |15|vcf_var_id|string|Variant ID (custom mutations only).| |16|mut_position|integer|Start position of the mutation.| |17|ref|DNA sequence|Reference nucleotide or triplet.| |18|new|DNA sequence|Mutated nucleotide or triplet. Not set for deletions.| |19|ref_aa|amino acid|Reference amino acid.| |20|alt_aa|amino acid|Alternative amino acid.| |21|mut_type|syn|mis|non|Mutation type.| |22|mutator|type of mutator|Label of the type of mutator that generated the oligonucleotide.| |23|oligo_length|integer|Oligonucleotide length.| |24|mseq|DNA sequence|Full oligonucleotide sequence (with adaptors, if any).| |25|mseq_no_adapt|DNA sequence|Oligonucleotide sequence excluding adaptors.| |26|pam_mut_annot|Array of syn|mis|non|ncd|Applied PAM protection variant mutation types (or ncd if affecting a noncoding region).| |27|pam_mut_sgrna_id|Array of sgRNA ID's|sgRNA ID's bound to the PAM protection variants spanned by the mutation or affecting the same codons as the mutation, if any.| |28|mave_nt|MAVE-HGVS string|MAVE-HGVS string corresponding to the mutation.| |29|mave_nt_ref|MAVE-HGVS string|MAVE-HGVS string corresponding to the mutation, where REF does not include PAM protection.| |30|vcf_var_in_const|0|1|Whether the variant is in a region defined as constant (custom mutations only).| |31|background_variants|MAVE-HGVS strings|MAVE-HGVS strings corresponding to the background variants overlapping the targeton range, semicolon-separated.| |32|background_seq|DNA sequence|Reference sequence altered by the background variants.|

Example:

oligo_name,species,assembly,gene_id,transcript_id,src_type,ref_chr,ref_strand,ref_start,ref_end,revc,ref_seq,pam_seq,vcf_alias,vcf_var_id,mut_position,ref,new,ref_aa,alt_aa,mut_type,mutator,oligo_length,mseq,mseq_no_adapt,pam_mut_annot,pam_mut_sgrna_id,mave_nt,mave_nt_ref,vcf_var_in_const,background_variants,background_seq ENST00000357654.9.ENSG00000012048.23_chr17:43104102_1del_rc,homo sapiens,GRCh38,ENSG00000012048.23,ENST00000357654.9,ref,chr17,-,43104080,43104330,1,AGAAAAGAAGAAGAAGAAGAAGAAGAAAACAAATGGTTTTACCAAGGAAGGATTTTCGGGTTCACTCTGTAGAAGTCTTTTGGCACGGTTTCTGTAGCCCATACTTTGGATGATAGAAACTTCATCTTTTAGATGTTCAGGAGAGTTATTTTCCTTTTTTGCAAAATTATAGCTGTTTGCATCTGTAAAATACAAGGGAAAACATTATGTTTGCAGTTAGAGAAAAATGTATGAATTATAATCAAAGAAAC,AGAAAAGAAGAAGAAGAAGAAGAAGAAAACAAATGGTTTTACCAAGGAAGGATTTTCGGGTTCACTCTGTAGAAGTCTTTTGGCGCGATTTCTGTAGCCCATACTTTGGATGATAGAAACTTCATCTTTTAGATGTTCAGGAGAGTTATTTTCCTTTTTTGCAAAATTATAGCTGTTTGCATCTGTAAAATACAAGGGAAAACATTATGTTTGCAGTTAGAGAAAAATGTATGAATTATAATCAAAGAAAC,,,43104102,A,,,,,1del,291,AATGATACGGCGACCACCGAGTTTCTTTGATTATAATTCATACATTTTTCTCTAACTGCAAACATAATGTTTTCCCTTGTATTTTACAGATGCAAACAGCTATAATTTTGCAAAAAAGGAAAATAACTCTCCTGAACATCTAAAAGATGAAGTTTCTATCATCCAAAGTATGGGCTACAGAAATCGCGCCAAAAGACTTCTACAGAGTGAACCCGAAAATCCTTCCTTGGTAAAACCATTTGTTTTCTCTTCTTCTTCTTCTTCTTTTCTTCGTATGCCGTCTTCTGCTTG,GTTTCTTTGATTATAATTCATACATTTTTCTCTAACTGCAAACATAATGTTTTCCCTTGTATTTTACAGATGCAAACAGCTATAATTTTGCAAAAAAGGAAAATAACTCTCCTGAACATCTAAAAGATGAAGTTTCTATCATCCAAAGTATGGGCTACAGAAATCGCGCCAAAAGACTTCTACAGAGTGAACCCGAAAATCCTTCCTTGGTAAAACCATTTGTTTTCTCTTCTTCTTCTTCTTCTTTTCT,syn;syn,,g.23del,g.23del,0,,AGAAAAGAAGAAGAAGAAGAAGAAGAAAACAAATGGTTTTACCAAGGAAGGATTTTCGGGTTCACTCTGTAGAAGTCTTTTGGCACGGTTTCTGTAGCCCATACTTTGGATGATAGAAACTTCATCTTTTAGATGTTCAGGAGAGTTATTTTCCTTTTTTGCAAAATTATAGCTGTTTGCATCTGTAAAATACAAGGGAAAACATTATGTTTGCAGTTAGAGAAAAATGTATGAATTATAATCAAAGAAAC

Variant file

VCF files containing a subset of the metadata in VCF format. The metadata are stored in the INFO field. The REF field reports the reference sequence including (*_pam.vcf) or excluding (*_ref.vcf) PAM protection edits.

The variants can be linked to the corresponding oligonucleotides via the SGE_OLIGO tag, and, for custom variants, to the original VCF files via the SGE_VCF_ALIAS and SGE_VCF_VAR_ID tags.

INFO tags:

|Tag|Metadata field|Description| |-|-|-| |SGE_OLIGO|oligo_name|Corresponding oligonucleotide name.| |SGE_SRC|mutator|Variant source.| |SGE_REF|ref|(Optional) Reference sequence, if different from the PAM-protected reference sequence (PAM VCF only).| |SGE_VCF_ALIAS|vcf_alias|(Optional) VCF variant identifier, only for custom variants.| |SGE_VCF_VAR_ID|vcf_var_id|(Optional) VCF variant source file alias, only for custom variants.|

Unique oligonucleotides file

Comma-separated values (CSV) file containing only the label and the sequence of the oligonucleotides generated for any given targeton, where the sequences are unique. This is a subset of the oligonucleotide metadata file fields (oligo_name and mseq) and rows. When multiple oligonucleotides have the same sequence, the first name in lexicographic order is chosen.

Example:

oligo_name,mseq ENST00000256474.3.ENSG00000134086.8_chr3:10146513_G>A_snv,GGATTACAGGTGTGGGCCACCGTGCCCAGCCACCGGTGTGGCTCTTTAACAACCTTTGCTTGTCCCGATAAGTCACCTTTGGCTCTTCAGAGATGCAGGGACACACGATGGGCTTCTGGTTAACCAAACTGAATTATTTGTGCCATCTCTCAATGTTGACGGACAGCCTATTTTTGCCAATATCACACTGCCAGGTACTGACGTTTTACTTTTTAAAAAGATAAGGTTGTTGTGGTAAGTACAGG ENST00000256474.3.ENSG00000134086.8_chr3:10146513_G>C_snv,GGATTACAGGTGTGGGCCACCGTGCCCAGCCACCGGTGTGGCTCTTTAACAACCTTTGCTTGTCCCGATACGTCACCTTTGGCTCTTCAGAGATGCAGGGACACACGATGGGCTTCTGGTTAACCAAACTGAATTATTTGTGCCATCTCTCAATGTTGACGGACAGCCTATTTTTGCCAATATCACACTGCCAGGTACTGACGTTTTACTTTTTAAAAAGATAAGGTTGTTGTGGTAAGTACAGG ENST00000256474.3.ENSG00000134086.8_chr3:10146513_G>T_snv,GGATTACAGGTGTGGGCCACCGTGCCCAGCCACCGGTGTGGCTCTTTAACAACCTTTGCTTGTCCCGATATGTCACCTTTGGCTCTTCAGAGATGCAGGGACACACGATGGGCTTCTGGTTAACCAAACTGAATTATTTGTGCCATCTCTCAATGTTGACGGACAGCCTATTTTTGCCAATATCACACTGCCAGGTACTGACGTTTTACTTTTTAAAAAGATAAGGTTGTTGTGGTAAGTACAGG ENST00000256474.3.ENSG00000134086.8_chr3:10146474_A_1del,GGATTACAGGTGTGGGCCACCGTGCCCAGCCCCGGTGTGGCTCTTTAACAACCTTTGCTTGTCCCGATAGGTCACCTTTGGCTCTTCAGAGATGCAGGGACACACGATGGGCTTCTGGTTAACCAAACTGAATTATTTGTGCCATCTCTCAATGTTGACGGACAGCCTATTTTTGCCAATATCACACTGCCAGGTACTGACGTTTTACTTTTTAAAAAGATAAGGTTGTTGTGGTAAGTACAGG ENST00000256474.3.ENSG00000134086.8_chr3:10146475_C_1del,GGATTACAGGTGTGGGCCACCGTGCCCAGCCACGGTGTGGCTCTTTAACAACCTTTGCTTGTCCCGATAGGTCACCTTTGGCTCTTCAGAGATGCAGGGACACACGATGGGCTTCTGGTTAACCAAACTGAATTATTTGTGCCATCTCTCAATGTTGACGGACAGCCTATTTTTGCCAATATCACACTGCCAGGTACTGACGTTTTACTTTTTAAAAAGATAAGGTTGTTGTGGTAAGTACAGG ENST00000256474.3.ENSG00000134086.8_chr3:10146477_G_1del,GGATTACAGGTGTGGGCCACCGTGCCCAGCCACCGTGTGGCTCTTTAACAACCTTTGCTTGTCCCGATAGGTCACCTTTGGCTCTTCAGAGATGCAGGGACACACGATGGGCTTCTGGTTAACCAAACTGAATTATTTGTGCCATCTCTCAATGTTGACGGACAGCCTATTTTTGCCAATATCACACTGCCAGGTACTGACGTTTTACTTTTTAAAAAGATAAGGTTGTTGTGGTAAGTACAGG

Reference sequence retrieval quality check file

Comma-separated values (CSV) file with no header reporting the reference sequences as retrieved based on the genomic coordinates and extension vector provided in the SGE targeton file.

The targeton name is derived from the genomic coordinates of the reference sequence.

|Field|Format|Description| |-|-|-| |(Targeton name)|<CHR>_<START>_<END>_<STRAND>|Name of the targeton.| |(Reference genomic range)|<CHR>:<START>-<END>|Reference sequence region.| |(5' constant region start)|integer|Start position of the 5' constant region.| |(5' constant region sequence)|DNA sequence|Sequence of the 5' constant region.| |(Target region 1 start)|integer|Start position of target region 1.| |(Target region 1 sequence)|DNA sequence|Sequence of target region 1.| |(Target region 2 start)|integer|Start position of target region 2.| |(Target region 2 sequence)|DNA sequence|Sequence of target region 2.| |(Target region 3 start)|integer|Start position of target region 3.| |(Target region 3 sequence)|DNA sequence|Sequence of target region 3.| |(3' constant region start)|integer|Start position of the 3' constant region.| |(3' constant region sequence)|DNA sequence|Sequence of the 3' constant region.|

Example:

chr3_10146443_10146687_plus,chr3:10146443-10146687,10146443,GGATTACAGGTGTGGGCCACCGTGCCCAGCC,10146474,ACCGGTGTGGCTCTTTAACAACCTTTGCTTGTCCCGATAG,10146514,GTCACCTTTGGCTCTTCAGAGATGCAGGGACACACGATGGGCTTCTGGTTAACCAAACTGAATTATTTGTGCCATCTCTCAATGTTGACGGACAGCCTATTTTTGCCAATATCACACTGCCAG,10146637,GTACTGACGTTTTACTTTTTAAAAAGATAAGGTTG,10146672,TTGTGGTAAGTACAGG

Development

To run the unit tests, install the extra requirements first:

sh pip install -r test-requirements.txt ./run_tests.sh

LICENSE

```none VaLiAnT Copyright (C) 2020, 2021, 2022, 2023, 2024 Genome Research Ltd

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see https://www.gnu.org/licenses/. ```

Owner

  • Name: CASM IT
  • Login: cancerit
  • Kind: organization
  • Email: cgpit@sanger.ac.uk
  • Location: Hinxton, Cambridge, UK

CASM IT provide bioinformatic support for Cancer, Ageing and Somatic Mutation group at the Wellcome Sanger Institute

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1

Dependencies

Dockerfile docker
  • python 3.9-slim build
src/requirements.txt pypi
  • chardet ==4.0.0
  • click ==8.1.3
  • cython ==0.29.30
  • numpy ==1.21.6
  • pandas ==1.1.5
  • pydantic ==1.9.1
  • pyranges ==0.0.117
  • pysam ==0.19.1
src/setup.py pypi
tests/unit_tests/requirements.txt pypi
  • pytest * test
  • pytest-cov * test