https://github.com/cancerit/pycroquet

python Crispr Read to Oligo QUantification and Evaluation Tool

https://github.com/cancerit/pycroquet

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    2 of 3 committers (66.7%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.8%) to scientific vocabulary

Keywords

bioinformatics crispr quantification
Last synced: 5 months ago · JSON representation

Repository

python Crispr Read to Oligo QUantification and Evaluation Tool

Basic Info
  • Host: GitHub
  • Owner: cancerit
  • License: agpl-3.0
  • Language: Jupyter Notebook
  • Default Branch: develop
  • Homepage:
  • Size: 77.8 MB
Statistics
  • Stars: 2
  • Watchers: 12
  • Forks: 1
  • Open Issues: 3
  • Releases: 0
Topics
bioinformatics crispr quantification
Created over 4 years ago · Last pushed over 1 year ago
Metadata Files
Readme Changelog License

README.md

pyCROQUET

python Crispr Read to Oligo QUantification and Evaluation Tool

cancerit

Publications

Please contact the following for appropriate referencing methods:

  • Keiran Raine (kr2@sanger.ac.uk)
  • Emre Karakoc (ek11@sanger.ac.uk)
  • Victoria Offord (vo1@sanger.ac.uk)

General

Code in place to support read input from any of the following formats:

  • fastq (also gzip compressed)
  • sam
  • bam
  • cram

Subcommands

  • single-guide
    • Short single end read quantification.
  • dual-guide
    • Paired end read quantification.
  • long-read
    • Long single end read quantification
  • guides-to-fa
    • Convert guides to fasta for use with samtools tview

Options

All options are not applicable to all subcommands, however the majority are common.

guidelib

Please see the Guide library format for a description of this file.

queries

Currently the dual-guide mode only supports SAM/BAM/CRAM as input. Convert fastq to unmapped CRAM with:

```

if data has casava read barcode/qc (text after space in read name) please add "-i"

samtools import --output-fmt CRAM,no_ref=1 -@ 4 -1 $READ1 -2 $READ2 -o $OUTFILE.cram ```

chunks

Chunks should be set to a value that allows all CPUs to be utilized. The value is multiplied by the number of CPUs requested and this give the number of unique read sequences held in memory during the mapping phase.

This has a direct impact on memory. The value is automatically reduced when too large to allow full use of requested CPUs.

rules

For single-guide --rules MM (allow 2 mismatches in alignment) is a sensible value. For other subcommands the decision is dependent on the library protocol.

Rules have a direct impact on run time as they increase the time taken to abort an alignment, individual costs are as follows:

  • M = 1
  • I = 2 (single b.p.)
  • D = 2 (single b.p.)

Performance is only impacted by the maximum penalty you allow.

Be aware if you with to allow up to 2 mismatch or 1 mismatch + 1 b.p. insert you must specify:

pycroquet ... --rules MM --rules MI

Output files

CRAM

For single-guide you have the option to suppress it via the --no-alignment (-n) option. In dual-guide it is tightly linked to the pairing code so not possible to disable.

Reads that map uniquely are written with MAPQ>0 (score calculations have not been refined at this time). There are some differences in how to interpret the data depending on if you are processing single-guide, dual-guide or long-read.

Reads mapping to a sgrna_id

To get the reads that map uniquely to a guide element (sgrnaseq) use the `sgrnaid. This is primarily of use forsingle-guideandlong-read`:

bash samtools view -F 4 -q 1 result.cram $SGRNA_ID

To get a single instance of reads that map to a guide element but map equally well to others select for the SA tag (requires samtools>=1.12):

bash samtools view -F 4 -F 256 -d SA result.cram $SGRNA_ID

To get reads that failed to map:

bash samtools view -f 4 result.cram

Reads assigned to a guide

This is only applicable to dual-guide.

You can pull reads by the guide id using samtools view, this example counts R1 mapping to a guide (equivalent to the *.counts.tsv result), replace/set $GUIDE_ID as required:

bash samtools view -F 4 -f 64 -c -d YG:$GUIDE_ID result.cram

To select all the reads mapped to this guide grouped by readname:

bash samtools view -u -F 4 -d YG:$GUIDE_ID result.cram | samtools sort -n - | samtools view -b - > $GUIDE_ID.bam

Dual guide

FASTQ(.gz) input is not currently supported for dual guide, please prepare your data appropriately with samtools import:

bash samtools import -@ 4 -1 R1.fastq.gz -2 R2.fastq.gz -O BAM -o OUTPUT.bam

Please review the import options as casava information can be interpreted where appropriate.

Statistic file extension

The dual guide output extends the standard json statistics file adding pair_classifications:

| Classification | Description | |----------------|----------------------------------------| | match | same vector F/R | | aberrantmatch | same vector pair, aberrant orientation | | fmulti3p | 5p mapped F, 3p multihit | | fmulti5p | 3p mapped F, 5p multihit | | rmulti3p | 5p mapped R, 3p multihit | | rmulti5p | 3p mapped R, 5p multihit | | fopen3p | 5p mapped F, 3p open (unmapped) | | fopen5p | 3p mapped F, 5p open (unmapped) | | ropen3p | 5p mapped R, 3p open (unmapped) | | ropen5p | 3p mapped R, 5p open (unmapped) | | swap | multi vector, uniq mapped | | ambiguous | both ends multi hit | | nomatch | multi/unmapped either end |

Boundary mode details

The -b/--boundary-mode option controls how the guide and read are allowed to overlap. Each section shows the types of alignment allowed, to be valid they still need to pass rules and any minimum score.

In all cases XXX indicates original sequence.

exact

Boundary of sequence must be equal between target (guide) and query (read)

T: XXXXXX Q: XXXXXX

TinQ - target in query

Like the name suggests, valid alignments include those via exact and:

``` T: XXXXX Q: XXXXXX

T: XXXX Q: XXXXXX

T: XXXXX Q: XXXXXX ```

QinT - query in target

Reverse of TinQ, valid alignments include those via exact and:

``` T: XXXXXX Q: XXXXX

T: XXXXXX Q: XXXX

T: XXXXXX Q: XXXXX ```

any

No boundary checks are performed, this allows more complex events, all alignments from exact, TinQ, QinT plus:

``` T: XXXXXXX Q: XXXXXXX

T: XXXXXXX Q: XXXXXXX ```

Viewing alignments

You can use samtools tview to view the cram file, this is mainly useful when checking fuzzy matching or allowing all boundary types.

To make this more informative generate the fasta file for the sgrna elements:

pycroquet guides-to-fa --guidelib guide_library.tsv --fasta sgrna.fa

NOTE: a contig is the individual sgrna sequence, not the pair

Now use this with samtool tview to view your alignments, see command line help to jump directly to a contig of interest or ? when using interactively.

samtools tview result.cram sgrna.fa

Will give a full screen output like this (N is just padded to screen width for short contigs):

1 11 CTAGTTCAGATAAAACAACNNNNNN ................... ................... ................... ................... ................... ................C.. ................... ...................

Installation

Pypi

pip install Cython pip install pycroquet

Docker and Singularity

There are pre-built images containing this codebase on quay.io. When pulling an image you must specify the version there is no latest.

The docker images are known to work correctly after import into a singularity image.

Development

Python 3.9 or better required.

Linux

```bash git clone git@github.com:cancerit/pycroquet.git cd pycroquet python3 -m venv venv source venv/bin/activate pip install -r requirements-dev.txt python3 ./setup.py develop # dynamic build

add/activate pre-commit

pip install pre-commit pre-comomit install ```

Mac

```bash brew update brew install python@3.9 brew install libmagic git clone git@github.com:cancerit/pycroquet.git cd pycroquet python3.9 -m venv venv source venv/bin/activate pip install -r requirements-dev.txt python3 setup.sh develop

add/activate pre-commit

pip install pre-commit pre-comomit install ```

Testing

There are 4 layers to testing and standards:

  1. Local venv testing
  2. Local pre-commit hooks
  3. Tests embedded in docker build
  4. CI tests

Local venv testing

bash ./tests/scripts/run_unit_tests.sh

Also confirm the distribution can be installed by building and installing it into a different venv:

```bash rm -rf dist/ python3 setup.py sdist

new terminal

python3 -m venv tmp-pycroquet-venv source tmp-pycroquet-venv/bin/activate pip install ~/pycroquet/dist/pycroquet-*.tar.gz deactivate rm -rf tmp-pycroquet-venv ```

Local pre-commit hooks

This project additionally uses git pre-commit hooks via the pre-commit tool. These are concerned with file formats and standards, not the actual execution of code. See ./.pre-commit-config.yaml.

Docker testing

The Docker build includes the unit tests, but removes many of the libraries before the final build stage. Mainly for CI tests.

CI tests

CI includes 2 additional tests, each based on the 2 datasets in the ./examples directory.

Updating licence headers

Please use skywalking-eyes.

Expected workflow:

  1. Check state before modifying .licenserc.yaml:
    • docker run -it --rm -v $(pwd):/github/workspace apache/skywalking-eyes header check
    • You should get some 'valid' here, those without a header as 'invalid'
  2. Modify .licenserc.yaml
  3. Apply the changes:
    • docker run -it --rm -v $(pwd):/github/workspace apache/skywalking-eyes header fix
  4. Add/commit changes

This is executed in the CI pipeline.

DO NOT edit the header in the files, please modify the date component of content in .licenserc.yaml. The only exception being:

  • README.md

If you need to make more extensive changes to the license carefully test the pattern is functional.

LICENSE

``` Copyright (c) 2021-2022

Author: CASM/Cancer IT cgphelp@sanger.ac.uk

This file is part of pycroquet.

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see https://www.gnu.org/licenses/.

  1. The usage of a range of years within a copyright statement contained within this distribution should be interpreted as being equivalent to a list of years including the first and last year specified and all consecutive years between them. For example, a copyright statement that reads ‘Copyright (c) 2005, 2007- 2009, 2011-2012’ should be interpreted as being identical to a statement that reads ‘Copyright (c) 2005, 2007, 2008, 2009, 2011, 2012’ and a copyright statement that reads ‘Copyright (c) 2005-2012’ should be interpreted as being identical to a statement that reads ‘Copyright (c) 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012’. ```

Owner

  • Name: CASM IT
  • Login: cancerit
  • Kind: organization
  • Email: cgpit@sanger.ac.uk
  • Location: Hinxton, Cambridge, UK

CASM IT provide bioinformatic support for Cancer, Ageing and Somatic Mutation group at the Wellcome Sanger Institute

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1

Committers

Last synced: about 2 years ago

All Time
  • Total Commits: 63
  • Total Committers: 3
  • Avg Commits per committer: 21.0
  • Development Distribution Score (DDS): 0.175
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Keiran Raine k****2@s****k 52
Victoria Offord v****1@s****k 8
Keiran M Raine k****e 3
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 7
  • Total pull requests: 12
  • Average time to close issues: 21 days
  • Average time to close pull requests: 1 day
  • Total issue authors: 2
  • Total pull request authors: 3
  • Average comments per issue: 3.57
  • Average comments per pull request: 0.5
  • Merged pull requests: 8
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • keiranmraine (4)
  • vaofford (3)
Pull Request Authors
  • keiranmraine (9)
  • superjw (4)
  • vaofford (1)
Top Labels
Issue Labels
enhancement (3) bug (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 3 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 10
  • Total maintainers: 1
pypi.org: pycroquet

python Crispr Read to Oligo QUantification Enhancement Tool

  • Versions: 10
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 3 Last month
Rankings
Dependent packages count: 7.3%
Dependent repos count: 22.1%
Forks count: 30.0%
Stargazers count: 32.0%
Average: 34.2%
Downloads: 79.7%
Maintainers (1)
Last synced: 7 months ago