https://github.com/cancerit/valiant

Variant Library Annotation Tool (VaLiAnT) is an oligonucleotide library design and annotation tool for Saturation Genome Editing and other Deep Mutational Scanning experiments

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 3 DOI reference(s) in README
✓
Academic publication links
Links to: ncbi.nlm.nih.gov
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.7%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Variant Library Annotation Tool (VaLiAnT) is an oligonucleotide library design and annotation tool for Saturation Genome Editing and other Deep Mutational Scanning experiments

Basic Info

Host: GitHub
Owner: cancerit
License: agpl-3.0
Language: Python
Default Branch: develop
Homepage:
Size: 62.8 MB

Statistics

Stars: 5
Watchers: 11
Forks: 2
Open Issues: 0
Releases: 0

Created over 5 years ago · Last pushed over 2 years ago

Metadata Files

Readme Changelog License

VaLiAnT

The Variant Library Annotation Tool (VaLiAnT) is an oligonucleotide library design and annotation tool for Saturation Genome Editing and other Deep Mutational Scanning experiments.

A selection of libraries is included in the examples directory, including all necessary inputs and instructions to generate them.

Please also see the VaLiAnT Wiki for more information on use cases.

VaLiAnT

Citation

Please cite this paper when using VaLiAnT for your publications:

Variant Library Annotation Tool (VaLiAnT): an oligonucleotide library design and annotation tool for saturation genome editing and other deep mutational scanning experiments.

Barbon L, Offord V, Radford EJ, Butler AP, Gerety SS, Adams DJ, Tan HK, Waters AJ.

Bioinformatics. 2022 Jan 27;38(4):892-899.

DOI: 10.1093/bioinformatics/btab776. PMID: 34791067; PMCID: PMC8796380.

Usage

See the command line interface section for the full list and a more detailed description of the parameters.

Input parameters:

species
assembly
5' adaptor (optional)
3' adaptor (optional)

Main command input files:

configuration file (JSON)

SGE input files:

SGE targeton file (TSV)
reference genome sequence (FASTA)
reference genome index (FAI)
PAM protection file (VCF, optional)
VCF manifest file (CSV, optional)
custom variant files (VCF, optional)
features file (GTF/GFF2, optional)
codon table with frequencies (CSV, optional)
background variant file (VCF, optional)
background variant mask file (BED, optional)

cDNA input files:

cDNA targeton file (TSV)
cDNA sequences (single multi-FASTA)
cDNA annotation file (TSV, optional)
codon table with frequencies (CSV, optional)

Output files:

reference sequence retrieval quality check file (CSV, SGE-only)
oligonucleotide metadata file (CSV)
variant file (VCF, SGE-only)
unique oligonucleotides file (CSV)
configuration file (JSON)

The reference directory should contain both the FASTA file and its index; e.g., if the FASTA file is named genome.fa, a genome.fa.fai file should also be present in the same directory. When running the tool in a container, the directory containing both files should therefore be mounted.

The features file (SGE-only gff option) is required to detect exonic regions in the targeton, and should therefore be provided in most circumstances. The files should only contain features for one transcript per gene (the gene_id and transcript_id attributes are required to perform this check). The features file should match the assembly of the target reference genome. Any features of type other than CDS and UTR are ignored.

If the codon-table option is not set, this table will be used.

Ambiguous nucleotides are not allowed in the reference sequence. Soft-masking is ignored.

Oligonucleotides exceeding a given length (max-length option) will not be included in the unique oligonucleotide files and their metadata will be stored in separate files marked as 'excluded'.

Background variants, if provided, are applied before PAM protection variants; when the CDS features of a transcript are provided (via a GTF/GFF2 file), to keep the annotation consistent across targetons, such variants are applied in the minimal range of positions that spans at least the entire CDS, further extended to the boundaries of any targeton overlapping the CDS, and finally to the start and end position of the first and last background variant intersecting the resulting range, respectively.

Background variants may be filtered out by position by providing a set of genomic ranges to be excluded via a BED file (bg-mask option). Excluding frame-shifting variants may affect the annotation.

By default, errors are raised when background variants are not synonymous or shift the reading frame; the force-bg-ns and force-bg-indels flags may be passed to allow them.

Mutations that overlap background variants are discarded; warnings are always raised to identify them, with their positions expressed in absolute genomic coordinates for custom variants, and targeton-relative coordinates for pattern variants.

Python package

After installing the package in an appropriate virtual environment:

sh valiant sge \ "${TARGETONS_FILE}" \ "${REFERENCE_FILE}" \ "${OUTPUT_DIR}" \ "${SPECIES}" \ "${ASSEMBLY}" \ --gff "${GTF_FILE}" \ --adaptor-5 "${ADAPTOR_5}" \ --adaptor-3 "${ADAPTOR_3}"

Alternatively, a configuration file can be provided to the main command:

sh valiant -c config.json

Docker image

After building or pulling the Docker image (quay.io/wtsicgp/valiant:X.X.X, where X.X.X is a version tag):

sh docker run \ -v "${HOST_INPUTS}":"${INPUT_DIR}":ro \ -v "${HOST_REF}":"${REF_DIR}":ro \ -v "${HOST_OUTPUT}":"${OUTPUT_DIR}" \ valiant \ valiant sge \ "${INPUT_DIR}/${TARGETONS_FILE}" \ "${REF_DIR}/${REFERENCE_FILE}" \ "${OUTPUT_DIR}" \ "${SPECIES}" \ "${ASSEMBLY}" \ --gff "${INPUT_DIR}/${GTF_FILE}" \ --adaptor-5 "${ADAPTOR_5}" \ --adaptor-3 "${ADAPTOR_3}"

The HOST_* environment variables represent local paths to be mounted by the Docker container.

Singularity image

After pulling the Docker image with Singularity:

sh singularity exec \ --cleanenv \ -B "${HOST_INPUTS}":"${INPUT_DIR}":ro \ -B "${HOST_REF}":"${REF_DIR}":ro \ -B "${HOST_OUTPUT}":"${OUTPUT_DIR}" \ ${SINGULARITY_IMAGE} \ valiant sge \ "${INPUT_DIR}/${TARGETONS_FILE}" \ "${REF_DIR}/${REFERENCE_FILE}" \ "${OUTPUT_DIR}" \ "${SPECIES}" \ "${ASSEMBLY}" \ --gff "${INPUT_DIR}/${GTF_FILE}" \ --adaptor-5 "${ADAPTOR_5}" \ --adaptor-3 "${ADAPTOR_3}"

Command line interface

Separate subcommands are provided depending on the target sequence origin:

SGE (sge): sequences from reference, genomic coordinates, three target regions;
cDNA DMS (cdna): user-provided sequences, relative coordinates, single target region.

The arguments and a few options are the same for both subcommands (see here), but the file formats may vary.

valiant

Main command.

valiant (sge|cdna)

Arguments and options required or supported by both subcommands. The format of the input files may be different for SGE and cDNA targets (see the argument descriptions).

valiant sge

Options specific to SGE.

The REF_FASTA path is expected to point to a reference genome in FASTA format.

valiant cdna

Options specific to cDNA DMS.

The REF_FASTA path is expected to point to a multi-FASTA containing cDNA sequences.

|Option|Format|Default|Description| |-|-|-|-| |annot|file path|-|Path to a cDNA annotation file.|

Mutation types

Types of mutation that apply to any target (label):

parametric deletion (e.g.: 1del, 2del0, 2del1)
single-nucleotide variant (snv)

Types of mutation that apply to CDS targets only (label):

in-frame deletion (inframe)
alanine codon substitution (ala)
stop codon substitution (stop)
all amino acid codon substitution (aa)
SNVRE (snvre)

Variants imported from VCF files are labelled as custom.

Parametric deletion

Non-overlapping stretches of nucleotides of a given length are deleted starting from a given offset. No partial deletions are performed at the end of the target regions. Format: <SPAN>del[<OFFSET>] (the offset is assumed to be zero if not set).

For backwards compatibility, in the metadata table, 1del0 is reported as 1del.

Given the target ACGTAAA, span two, and start offset zero (2del0), e.g.:

GTAAA ACAAA ACGTA

With start offset one (2del1), e.g.:

ATAAA ACGAA ACGTA

Single-nucleotide variant

Each nucleotide is replaced with all the alternatives.

Given the target AA, e.g.:

CA GA TA AC AG AT

For CDS targets, the resulting amino acid change is reported.

In-frame deletion

Only for CDS targets.

Delete each triplet so that the reading frame is preserved.

Given the target GAAATTTGG with frame 2, e.g.:

GTTTGG GAAAGG

Alanine codon substitution

Only for CDS targets.

Replace each codon with the top-ranking alanine codon.

Given the target GCAAAATTT, with GCC being the top-ranking alanine codon, e.g.:

GCCAAATTT GCAGCCTTT GCAAAAGCC

Stop codon substitution

Only for CDS targets.

Replace each codon with the top-ranking stop codon.

Given the target TAACCCGGG, with TGA being the top-ranking stop codon, e.g.:

TGACCCGGG TAATGAGGG TAACCCTGA

All amino acid codon substitution

Only for CDS targets.

Replace each codon with the top-ranking codon of all amino acids. Given the default codon table, this results in 19 mutated sequences for each codon mapping to an amino acid (the reference amino acid being excluded) and 20 for each stop codon.

Given the target AAATGA on the plus strand, e.g. (each column representing the sequences generated from one codon):

ATCTGA AAAATC ATGTGA AAAATG ACCTGA AAAACC AACTGA AAAAAC AAAAAG AGCTGA AAAAGC CGGTGA AAACGG CTGTGA AAACTG CCCTGA AAACCC CACTGA AAACAC CAGTGA AAACAG GTGTGA AAAGTG GCCTGA AAAGCC GACTGA AAAGAC GAGTGA AAAGAG GGCTGA AAAGGC TTCTGA AAATTC TACTGA AAATAC TGCTGA AAATGC TGGTGA AAATGG

SNVRE

Only for CDS targets.

Given a set of SNV's, replace triplets according to the following rules:

if the SNV results in a synonymous mutation, replace the triplet with all the synonymous triplets of the variant
if the SNV results in a missense mutation, replace the triplet with the top-ranking synonymous triplet of the variant
if the SNV results in a nonsense mutation, replace the triplet with the top-ranking stop codon

For a given missense or nonsense SNV mutation, if the resulting triplet is already the top-ranking one, the second highest ranking triplet is used to generate the SNVRE mutation instead.

Given the following SNV's for sequence AAAAGT, e.g.:

mseq ref alt CAAAGT K Q GAAAGT K E TAAAGT K STOP ACAAGT K T AGAAGT K R ATAAGT K I AACAGT K N AAGAGT K K AATAGT K N ... AAAAGC S S ...

There is only one synonymous mutation for the first triplet (AAGAGT), but since lysine maps to only two codons and one of them is the reference, no SNVRE variants are generated from it. The one for the second triplet (AAAAGC), though, results in the top-ranking codon for serine, that maps to six codons, and therefore the following four SNVRE's are generated:

mseq ref alt AAATCA S S AAATCC S S AAATCG S S AAATCT S S

For missense mutations, the top-ranking codon (the current being excluded) for each alternative amino acid replaces the reference sequence:

mseq ref alt snv svnre CAAAGT K Q CAA CAG GAAAGT K E GAA GAG ACAAGT K T ACA ACC AGAAGT K R AGA CGG ATAAGT K I ATA ATC AACAGT K N AAC AAT AATAGT K N AAT AAC

The resulting SNVRE variants would be:

mseq ref alt CAGAGT K Q GAGAGT K E ACCAGT K T CGGAGT K R ATCAGT K I AATAGT K N AACAGT K N

For the nonsense mutation:

mseq ref alt snv svnre TAAAGT K STOP TAA TGA

The resulting SNVRE variant would be:

mseq ref alt TGAAGT K STOP

Unique codons do not generate SNVRE variants.

Custom variants

Applied to the targeton reference sequence as a whole. Only simple variants such as the following are supported:

substitutions
insertions (see below)
deletions (see below)
indels

The classification of the variants is based exclusively on the POS, REF, and ALT fields to be agnostic with respect to the VCF source.

While in the VCF format insertion and deletion positions refer to the base preceding the event and the reference and alternative sequences both include the preceding (or following, if the variants start at position one) base, for consistency with the conventions adopted for generated mutations, in the metadata table such variants are reported as shifted right by one and omitting the preceding (or following) base in the reference and alternative sequences.

Installation

Some of the dependencies are unsupported on Windows, and the tool cannot therefore be installed natively on it. The following options are available:

installing the Windows Subsystem for Linux (WSL) and creating a Python virtual environment
installing Docker (requires the WSL or Windows 10 Pro) and building or pulling the Docker image
installing Singularity (requires a virtualisation solution) and building a Singularity image from the Docker image

The instructions that follow apply to Linux and macOS.

Python virtual environment

Please take care to read errors during the dependency installation step carefully. HTSlib (pysam) has system dependencies and will highlight the packages that need to be installed.

Requirements:

Python 3.11 or above

To install in a virtual environment:

```sh

Initialise the virtual environment

python3.11 -m venv .env

Activate the virtual environment

source .env/bin/activate

Install the valiant package

pip install . ```

Docker image

To build the Docker container:

sh docker build -t valiant .

File formats

Configuration file

JSON file collecting the execution parameters. It is always generated as an output (config.json) and can optionally be used as input by the main command, e.g.:

sh valiant -c config.json

An application version mismatch will result in a warning.

The execution parameters depend on the execution mode, and each corresponds to one of the command line arguments or options.

Common parameters

SGE parameters

Example:

json { "appName": "valiant", "appVersion": "4.0.0", "mode": "sge", "params": { "species": "homo sapiens", "assembly": "GRCh38", "adaptor5": "AATGATACGGCGACCACCGA", "adaptor3": "TCGTATGCCGTCTTCTGCTTG", "minOligoLength": 1, "maxOligoLength": 300, "codonTableFilePath": null, "backgroundVCFFilePath": null, "oligoInfoFilePath": "parameter_input_files/brca1_nuc_targeton_input.txt", "refFASTAFilePath": "reference_input_files/chr17.fa", "outputDirPath": "brca1_nuc_output", "reverseComplementOnMinusStrand": true, "includeNoOpOligo": false, "GFFFilePath": "reference_input_files/ENST00000357654.9.gtf", "PAMProtectionVCFFilePath": "parameter_input_files/brca1_protection_edits.vcf", "customVCFManifestFilePath": "reference_input_files/brca1_custom_variants_manifest.csv", "maskBackgroundFilePath": null, "forceBackgroundNonSynonymous": false, "forceBackgroundFrameShifting": false } }

cDNA parameters

Example:

json { "appName": "valiant", "appVersion": "4.0.0", "mode": "cdna", "params": { "species": "human", "assembly": "pCW57.1", "adaptor5": "AATGATACGGCGACCACCGA", "adaptor3": "TCGTATGCCGTCTTCTGCTTG", "minOligoLength": 1, "maxOligoLength": 300, "codonTableFilePath": null, "oligoInfoFilePath": "examples/cdna/input/cdna_targeton.tsv", "refFASTAFilePath": "examples/cdna/input/BRCA1_NP_009225_1_pCW57_1.fa", "outputDirPath": "examples/cdna/output", "annotationFilePath": "examples/cdna/input/cdna_annot.tsv" } }

SGE targeton file

Tab-separated values (TSV) file describing the reference sequence coordinates and the types of mutation to be applied to the three target regions therein contained (collectively referred to as targeton). Multiple types of mutations can be applied to each target region. The coordinates of the target regions are derived from the genomic range of the second target region and an extension vector describing the lengths of the preceding and following regions.

Duplicate mutation types in any given group within the action vector are ignored.

Spacing is ignored when parsing the extension and action vectors.

The chromosome name needs to match the naming in the GTF/GFF2 file and in the reference genome.

Example:

ref_chr ref_strand ref_start ref_end r2_start r2_end ext_vector action_vector sgrna_vector chrX + 41334132 41334320 41334253 41334297 25, 15 (1del), (1del, snv), (1del) sgrna_1, sgrna_2

cDNA targeton file

Tab-separated values (TSV) file describing the target cDNA and the types of mutation to be applied to the target region therein contained (expressed in relative coordinates). Multiple types of mutations can be applied the the target region.

The cDNA identifier (seq_id) has to correspond to an entry in the multi-FASTA and (optionally) annotation files.

Example:

seq_id targeton_start targeton_end r2_start r2_end action_vector ENST00000357654.9 114 121 114 121 snv,1del,snvre ENST00000357654.9 114 150 120 130 1del,2del0

cDNA annotation file

TSV file describing the CDS region of each cDNA in relative coordinates (one-based and end-inclusive). Gene and transcript identifiers can also be provided.

Example:

seq_id gene_id transcript_id cds_start cds_end brca1_357654.9 ENSG00000012048.23 ENST00000357654.9 114 5705

VCF manifest file

CSV file listing the VCF files from which to import variants. Each VCF file is given an alias. If a tag is specified (vcf_id_tag), the VCF INFO field will be expected to contain it and its values will be used as variant identifiers; if no tag is specified, the ID field will be used instead.

Example:

vcf_alias,vcf_id_tag,vcf_path clinvar_1,ALLELEID,clinvar_abc.vcf clinvar_2,ALLELEID,clinvar_xyz.vcf gnomad,,gnomad_abc.vcf

PAM protection VCF file

VCF file containing single-nucleotide substitution variants linked to sgRNA identifiers via the SGRNA tag.

Example:

```

fileformat=VCFv4.3

INFO=

CHROM POS ID REF ALT QUAL FILTER INFO

chrX 41334252 . G C . . SGRNA=sgRNA1 chrX 41337416 . C T . . SGRNA=sgRNA2 chrX 41339064 . G A . . SGRNA=sgRNA3 chrX 41341504 . T C . . SGRNA=sgRNA4 chrX 41341509 . G A . . SGRNA=sgRNA_4 ```

Oligonucleotide metadata file

Comma-separated values (CSV) file containing name, label, and all metadata of the oligonucleotides generated for any given targeton.

For cDNA targets, the reference chromosome (ref_chr) and strand (ref_strand) will be missing and all positions will be relative to the cDNA sequence. All fields related to PAM protection (pam_seq) and custom VCF variants (vcf_alias, vcf_var_id, and vcf_var_in_const), features unavailable for this target type, will also be empty (except for vcf_var_in_const, which will be set to zero).

The MAVE-HGVS strings are all linear genomic (relative to the start of the targeton) and do not include the reference. Because in HGVS insertion positions are described by the flanking nucleotides, those occurring at either end of the reference sequence should be treated differently (see the 3' rule in the relevant HGVS documentation); for consistency between SGE and cDNA mode, simplicity, and given the limited usefulness of liminal insertions, this is not the case in the current implementation, and therefore the invalid position zero might be found in insertion names.

Array fields use the semicolon as separator.

Example:

oligo_name,species,assembly,gene_id,transcript_id,src_type,ref_chr,ref_strand,ref_start,ref_end,revc,ref_seq,pam_seq,vcf_alias,vcf_var_id,mut_position,ref,new,ref_aa,alt_aa,mut_type,mutator,oligo_length,mseq,mseq_no_adapt,pam_mut_annot,pam_mut_sgrna_id,mave_nt,mave_nt_ref,vcf_var_in_const,background_variants,background_seq ENST00000357654.9.ENSG00000012048.23_chr17:43104102_1del_rc,homo sapiens,GRCh38,ENSG00000012048.23,ENST00000357654.9,ref,chr17,-,43104080,43104330,1,AGAAAAGAAGAAGAAGAAGAAGAAGAAAACAAATGGTTTTACCAAGGAAGGATTTTCGGGTTCACTCTGTAGAAGTCTTTTGGCACGGTTTCTGTAGCCCATACTTTGGATGATAGAAACTTCATCTTTTAGATGTTCAGGAGAGTTATTTTCCTTTTTTGCAAAATTATAGCTGTTTGCATCTGTAAAATACAAGGGAAAACATTATGTTTGCAGTTAGAGAAAAATGTATGAATTATAATCAAAGAAAC,AGAAAAGAAGAAGAAGAAGAAGAAGAAAACAAATGGTTTTACCAAGGAAGGATTTTCGGGTTCACTCTGTAGAAGTCTTTTGGCGCGATTTCTGTAGCCCATACTTTGGATGATAGAAACTTCATCTTTTAGATGTTCAGGAGAGTTATTTTCCTTTTTTGCAAAATTATAGCTGTTTGCATCTGTAAAATACAAGGGAAAACATTATGTTTGCAGTTAGAGAAAAATGTATGAATTATAATCAAAGAAAC,,,43104102,A,,,,,1del,291,AATGATACGGCGACCACCGAGTTTCTTTGATTATAATTCATACATTTTTCTCTAACTGCAAACATAATGTTTTCCCTTGTATTTTACAGATGCAAACAGCTATAATTTTGCAAAAAAGGAAAATAACTCTCCTGAACATCTAAAAGATGAAGTTTCTATCATCCAAAGTATGGGCTACAGAAATCGCGCCAAAAGACTTCTACAGAGTGAACCCGAAAATCCTTCCTTGGTAAAACCATTTGTTTTCTCTTCTTCTTCTTCTTCTTTTCTTCGTATGCCGTCTTCTGCTTG,GTTTCTTTGATTATAATTCATACATTTTTCTCTAACTGCAAACATAATGTTTTCCCTTGTATTTTACAGATGCAAACAGCTATAATTTTGCAAAAAAGGAAAATAACTCTCCTGAACATCTAAAAGATGAAGTTTCTATCATCCAAAGTATGGGCTACAGAAATCGCGCCAAAAGACTTCTACAGAGTGAACCCGAAAATCCTTCCTTGGTAAAACCATTTGTTTTCTCTTCTTCTTCTTCTTCTTTTCT,syn;syn,,g.23del,g.23del,0,,AGAAAAGAAGAAGAAGAAGAAGAAGAAAACAAATGGTTTTACCAAGGAAGGATTTTCGGGTTCACTCTGTAGAAGTCTTTTGGCACGGTTTCTGTAGCCCATACTTTGGATGATAGAAACTTCATCTTTTAGATGTTCAGGAGAGTTATTTTCCTTTTTTGCAAAATTATAGCTGTTTGCATCTGTAAAATACAAGGGAAAACATTATGTTTGCAGTTAGAGAAAAATGTATGAATTATAATCAAAGAAAC

Variant file

VCF files containing a subset of the metadata in VCF format. The metadata are stored in the INFO field. The REF field reports the reference sequence including (*_pam.vcf) or excluding (*_ref.vcf) PAM protection edits.

The variants can be linked to the corresponding oligonucleotides via the SGE_OLIGO tag, and, for custom variants, to the original VCF files via the SGE_VCF_ALIAS and SGE_VCF_VAR_ID tags.

INFO tags:

Unique oligonucleotides file

Comma-separated values (CSV) file containing only the label and the sequence of the oligonucleotides generated for any given targeton, where the sequences are unique. This is a subset of the oligonucleotide metadata file fields (oligo_name and mseq) and rows. When multiple oligonucleotides have the same sequence, the first name in lexicographic order is chosen.

Example:

oligo_name,mseq ENST00000256474.3.ENSG00000134086.8_chr3:10146513_G>A_snv,GGATTACAGGTGTGGGCCACCGTGCCCAGCCACCGGTGTGGCTCTTTAACAACCTTTGCTTGTCCCGATAAGTCACCTTTGGCTCTTCAGAGATGCAGGGACACACGATGGGCTTCTGGTTAACCAAACTGAATTATTTGTGCCATCTCTCAATGTTGACGGACAGCCTATTTTTGCCAATATCACACTGCCAGGTACTGACGTTTTACTTTTTAAAAAGATAAGGTTGTTGTGGTAAGTACAGG ENST00000256474.3.ENSG00000134086.8_chr3:10146513_G>C_snv,GGATTACAGGTGTGGGCCACCGTGCCCAGCCACCGGTGTGGCTCTTTAACAACCTTTGCTTGTCCCGATACGTCACCTTTGGCTCTTCAGAGATGCAGGGACACACGATGGGCTTCTGGTTAACCAAACTGAATTATTTGTGCCATCTCTCAATGTTGACGGACAGCCTATTTTTGCCAATATCACACTGCCAGGTACTGACGTTTTACTTTTTAAAAAGATAAGGTTGTTGTGGTAAGTACAGG ENST00000256474.3.ENSG00000134086.8_chr3:10146513_G>T_snv,GGATTACAGGTGTGGGCCACCGTGCCCAGCCACCGGTGTGGCTCTTTAACAACCTTTGCTTGTCCCGATATGTCACCTTTGGCTCTTCAGAGATGCAGGGACACACGATGGGCTTCTGGTTAACCAAACTGAATTATTTGTGCCATCTCTCAATGTTGACGGACAGCCTATTTTTGCCAATATCACACTGCCAGGTACTGACGTTTTACTTTTTAAAAAGATAAGGTTGTTGTGGTAAGTACAGG ENST00000256474.3.ENSG00000134086.8_chr3:10146474_A_1del,GGATTACAGGTGTGGGCCACCGTGCCCAGCCCCGGTGTGGCTCTTTAACAACCTTTGCTTGTCCCGATAGGTCACCTTTGGCTCTTCAGAGATGCAGGGACACACGATGGGCTTCTGGTTAACCAAACTGAATTATTTGTGCCATCTCTCAATGTTGACGGACAGCCTATTTTTGCCAATATCACACTGCCAGGTACTGACGTTTTACTTTTTAAAAAGATAAGGTTGTTGTGGTAAGTACAGG ENST00000256474.3.ENSG00000134086.8_chr3:10146475_C_1del,GGATTACAGGTGTGGGCCACCGTGCCCAGCCACGGTGTGGCTCTTTAACAACCTTTGCTTGTCCCGATAGGTCACCTTTGGCTCTTCAGAGATGCAGGGACACACGATGGGCTTCTGGTTAACCAAACTGAATTATTTGTGCCATCTCTCAATGTTGACGGACAGCCTATTTTTGCCAATATCACACTGCCAGGTACTGACGTTTTACTTTTTAAAAAGATAAGGTTGTTGTGGTAAGTACAGG ENST00000256474.3.ENSG00000134086.8_chr3:10146477_G_1del,GGATTACAGGTGTGGGCCACCGTGCCCAGCCACCGTGTGGCTCTTTAACAACCTTTGCTTGTCCCGATAGGTCACCTTTGGCTCTTCAGAGATGCAGGGACACACGATGGGCTTCTGGTTAACCAAACTGAATTATTTGTGCCATCTCTCAATGTTGACGGACAGCCTATTTTTGCCAATATCACACTGCCAGGTACTGACGTTTTACTTTTTAAAAAGATAAGGTTGTTGTGGTAAGTACAGG

Reference sequence retrieval quality check file

Comma-separated values (CSV) file with no header reporting the reference sequences as retrieved based on the genomic coordinates and extension vector provided in the SGE targeton file.

The targeton name is derived from the genomic coordinates of the reference sequence.

Example:

chr3_10146443_10146687_plus,chr3:10146443-10146687,10146443,GGATTACAGGTGTGGGCCACCGTGCCCAGCC,10146474,ACCGGTGTGGCTCTTTAACAACCTTTGCTTGTCCCGATAG,10146514,GTCACCTTTGGCTCTTCAGAGATGCAGGGACACACGATGGGCTTCTGGTTAACCAAACTGAATTATTTGTGCCATCTCTCAATGTTGACGGACAGCCTATTTTTGCCAATATCACACTGCCAG,10146637,GTACTGACGTTTTACTTTTTAAAAAGATAAGGTTG,10146672,TTGTGGTAAGTACAGG

Development

To run the unit tests, install the extra requirements first:

sh pip install -r test-requirements.txt ./run_tests.sh

LICENSE

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see https://www.gnu.org/licenses/. ```

Owner

Name: CASM IT
Login: cancerit
Kind: organization
Email: cgpit@sanger.ac.uk
Location: Hinxton, Cambridge, UK

Website: http://www.sanger.ac.uk/science/programmes/cancer-genetics-and-genomics
Repositories: 89
Profile: https://github.com/cancerit

CASM IT provide bioinformatic support for Cancer, Ageing and Somatic Mutation group at the Wellcome Sanger Institute

GitHub Events

Total

Watch event: 1

Last Year

Watch event: 1

Dependencies

Dockerfile docker

python 3.9-slim build

src/requirements.txt pypi

chardet ==4.0.0
click ==8.1.3
cython ==0.29.30
numpy ==1.21.6
pandas ==1.1.5
pydantic ==1.9.1
pyranges ==0.0.117
pysam ==0.19.1

src/setup.py pypi

tests/unit_tests/requirements.txt pypi

pytest * test
pytest-cov * test

https://github.com/cancerit/valiant

Science Score: 23.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

VaLiAnT

Citation

Usage

Python package

Docker image

Singularity image

Command line interface

valiant

valiant (sge|cdna)

valiant sge

valiant cdna

Mutation types

Parametric deletion

Single-nucleotide variant

In-frame deletion

Alanine codon substitution

Stop codon substitution

All amino acid codon substitution

SNVRE

Custom variants

Installation

Python virtual environment

Initialise the virtual environment

Activate the virtual environment

Install the valiant package

Docker image

File formats

Configuration file

Common parameters

SGE parameters

cDNA parameters

SGE targeton file

cDNA targeton file

cDNA annotation file

VCF manifest file

PAM protection VCF file

fileformat=VCFv4.3

INFO=

CHROM POS ID REF ALT QUAL FILTER INFO

Oligonucleotide metadata file

Variant file

Unique oligonucleotides file

Reference sequence retrieval quality check file

Development

LICENSE

Owner

GitHub Events

Total

Last Year

Dependencies