https://github.com/cancerit/pycroquet
python Crispr Read to Oligo QUantification and Evaluation Tool
Science Score: 10.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
✓Committers with academic emails
2 of 3 committers (66.7%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.8%) to scientific vocabulary
Keywords
Repository
python Crispr Read to Oligo QUantification and Evaluation Tool
Basic Info
Statistics
- Stars: 2
- Watchers: 12
- Forks: 1
- Open Issues: 3
- Releases: 0
Topics
Metadata Files
README.md
pyCROQUET
python Crispr Read to Oligo QUantification and Evaluation Tool
- pyCROQUET
- Publications
- General
- Subcommands
- Options
guidelibquerieschunksrules- Output files
- CRAM
- Dual guide
- Statistic file extension
- Boundary mode details
- Viewing alignments
- Installation
- Pypi
- Docker and Singularity
- Development
- Linux
- Mac
- Testing
- Local
venvtesting - Local
pre-commithooks - Docker testing
- CI tests
- Updating licence headers
- LICENSE
Publications
Please contact the following for appropriate referencing methods:
- Keiran Raine (kr2@sanger.ac.uk)
- Emre Karakoc (ek11@sanger.ac.uk)
- Victoria Offord (vo1@sanger.ac.uk)
General
Code in place to support read input from any of the following formats:
- fastq (also gzip compressed)
- sam
- bam
- cram
Subcommands
- single-guide
- Short single end read quantification.
- dual-guide
- Paired end read quantification.
- long-read
- Long single end read quantification
- guides-to-fa
- Convert guides to fasta for use with
samtools tview
- Convert guides to fasta for use with
Options
All options are not applicable to all subcommands, however the majority are common.
guidelib
Please see the Guide library format for a description of this file.
queries
Currently the dual-guide mode only supports SAM/BAM/CRAM as input. Convert fastq to unmapped CRAM with:
```
if data has casava read barcode/qc (text after space in read name) please add "-i"
samtools import --output-fmt CRAM,no_ref=1 -@ 4 -1 $READ1 -2 $READ2 -o $OUTFILE.cram ```
chunks
Chunks should be set to a value that allows all CPUs to be utilized. The value is multiplied by the number of CPUs requested and this give the number of unique read sequences held in memory during the mapping phase.
This has a direct impact on memory. The value is automatically reduced when too large to allow full use of requested CPUs.
rules
For single-guide --rules MM (allow 2 mismatches in alignment) is a sensible value. For other subcommands the decision
is dependent on the library protocol.
Rules have a direct impact on run time as they increase the time taken to abort an alignment, individual costs are as follows:
M= 1I= 2 (single b.p.)D= 2 (single b.p.)
Performance is only impacted by the maximum penalty you allow.
Be aware if you with to allow up to 2 mismatch or 1 mismatch + 1 b.p. insert you must specify:
pycroquet ... --rules MM --rules MI
Output files
CRAM
For single-guide you have the option to suppress it via the --no-alignment (-n) option. In dual-guide it is tightly
linked to the pairing code so not possible to disable.
Reads that map uniquely are written with MAPQ>0 (score calculations have not been refined at this time). There are some
differences in how to interpret the data depending on if you are processing single-guide, dual-guide or long-read.
Reads mapping to a sgrna_id
To get the reads that map uniquely to a guide element (sgrnaseq) use the `sgrnaid. This is primarily of use forsingle-guideandlong-read`:
bash
samtools view -F 4 -q 1 result.cram $SGRNA_ID
To get a single instance of reads that map to a guide element but map equally well to others select for the SA tag (requires samtools>=1.12):
bash
samtools view -F 4 -F 256 -d SA result.cram $SGRNA_ID
To get reads that failed to map:
bash
samtools view -f 4 result.cram
Reads assigned to a guide
This is only applicable to dual-guide.
You can pull reads by the guide id using samtools view, this example counts R1 mapping to a guide (equivalent to the *.counts.tsv
result), replace/set $GUIDE_ID as required:
bash
samtools view -F 4 -f 64 -c -d YG:$GUIDE_ID result.cram
To select all the reads mapped to this guide grouped by readname:
bash
samtools view -u -F 4 -d YG:$GUIDE_ID result.cram | samtools sort -n - | samtools view -b - > $GUIDE_ID.bam
Dual guide
FASTQ(.gz) input is not currently supported for dual guide, please prepare your data appropriately with samtools import:
bash
samtools import -@ 4 -1 R1.fastq.gz -2 R2.fastq.gz -O BAM -o OUTPUT.bam
Please review the import options as casava information can be interpreted where appropriate.
Statistic file extension
The dual guide output extends the standard json statistics file adding pair_classifications:
| Classification | Description | |----------------|----------------------------------------| | match | same vector F/R | | aberrantmatch | same vector pair, aberrant orientation | | fmulti3p | 5p mapped F, 3p multihit | | fmulti5p | 3p mapped F, 5p multihit | | rmulti3p | 5p mapped R, 3p multihit | | rmulti5p | 3p mapped R, 5p multihit | | fopen3p | 5p mapped F, 3p open (unmapped) | | fopen5p | 3p mapped F, 5p open (unmapped) | | ropen3p | 5p mapped R, 3p open (unmapped) | | ropen5p | 3p mapped R, 5p open (unmapped) | | swap | multi vector, uniq mapped | | ambiguous | both ends multi hit | | nomatch | multi/unmapped either end |
Boundary mode details
The -b/--boundary-mode option controls how the guide and read are allowed to overlap. Each section shows the types of
alignment allowed, to be valid they still need to pass rules and any minimum score.
In all cases XXX indicates original sequence.
exact
Boundary of sequence must be equal between target (guide) and query (read)
T: XXXXXX
Q: XXXXXX
TinQ - target in query
Like the name suggests, valid alignments include those via exact and:
``` T: XXXXX Q: XXXXXX
T: XXXX Q: XXXXXX
T: XXXXX Q: XXXXXX ```
QinT - query in target
Reverse of TinQ, valid alignments include those via exact and:
``` T: XXXXXX Q: XXXXX
T: XXXXXX Q: XXXX
T: XXXXXX Q: XXXXX ```
any
No boundary checks are performed, this allows more complex events, all alignments from exact, TinQ, QinT plus:
``` T: XXXXXXX Q: XXXXXXX
T: XXXXXXX Q: XXXXXXX ```
Viewing alignments
You can use samtools tview to view the cram file, this is mainly useful when checking fuzzy matching or allowing all boundary types.
To make this more informative generate the fasta file for the sgrna elements:
pycroquet guides-to-fa --guidelib guide_library.tsv --fasta sgrna.fa
NOTE: a contig is the individual sgrna sequence, not the pair
Now use this with samtool tview to view your alignments, see command line help to jump directly to a contig of interest
or ? when using interactively.
samtools tview result.cram sgrna.fa
Will give a full screen output like this (N is just padded to screen width for short contigs):
1 11
CTAGTTCAGATAAAACAACNNNNNN
...................
...................
...................
...................
...................
................C..
...................
...................
Installation
Pypi
pip install Cython
pip install pycroquet
Docker and Singularity
There are pre-built images containing this codebase on quay.io. When pulling an image you must specify
the version there is no latest.
The docker images are known to work correctly after import into a singularity image.
Development
Python 3.9 or better required.
Linux
```bash git clone git@github.com:cancerit/pycroquet.git cd pycroquet python3 -m venv venv source venv/bin/activate pip install -r requirements-dev.txt python3 ./setup.py develop # dynamic build
add/activate pre-commit
pip install pre-commit pre-comomit install ```
Mac
```bash brew update brew install python@3.9 brew install libmagic git clone git@github.com:cancerit/pycroquet.git cd pycroquet python3.9 -m venv venv source venv/bin/activate pip install -r requirements-dev.txt python3 setup.sh develop
add/activate pre-commit
pip install pre-commit pre-comomit install ```
Testing
There are 4 layers to testing and standards:
- Local
venvtesting - Local
pre-commithooks - Tests embedded in
docker build CItests
Local venv testing
bash
./tests/scripts/run_unit_tests.sh
Also confirm the distribution can be installed by building and installing it into a different venv:
```bash rm -rf dist/ python3 setup.py sdist
new terminal
python3 -m venv tmp-pycroquet-venv source tmp-pycroquet-venv/bin/activate pip install ~/pycroquet/dist/pycroquet-*.tar.gz deactivate rm -rf tmp-pycroquet-venv ```
Local pre-commit hooks
This project additionally uses git pre-commit hooks via the pre-commit tool. These are concerned
with file formats and standards, not the actual execution of code. See ./.pre-commit-config.yaml.
Docker testing
The Docker build includes the unit tests, but removes many of the libraries before the final build stage. Mainly for CI tests.
CI tests
CI includes 2 additional tests, each based on the 2 datasets in the ./examples directory.
Updating licence headers
Please use skywalking-eyes.
Expected workflow:
- Check state before modifying
.licenserc.yaml:docker run -it --rm -v $(pwd):/github/workspace apache/skywalking-eyes header check- You should get some 'valid' here, those without a header as 'invalid'
- Modify
.licenserc.yaml - Apply the changes:
docker run -it --rm -v $(pwd):/github/workspace apache/skywalking-eyes header fix
- Add/commit changes
This is executed in the CI pipeline.
DO NOT edit the header in the files, please modify the date component of content in .licenserc.yaml. The only exception being:
README.md
If you need to make more extensive changes to the license carefully test the pattern is functional.
LICENSE
``` Copyright (c) 2021-2022
Author: CASM/Cancer IT cgphelp@sanger.ac.uk
This file is part of pycroquet.
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License along with this program. If not, see https://www.gnu.org/licenses/.
- The usage of a range of years within a copyright statement contained within this distribution should be interpreted as being equivalent to a list of years including the first and last year specified and all consecutive years between them. For example, a copyright statement that reads ‘Copyright (c) 2005, 2007- 2009, 2011-2012’ should be interpreted as being identical to a statement that reads ‘Copyright (c) 2005, 2007, 2008, 2009, 2011, 2012’ and a copyright statement that reads ‘Copyright (c) 2005-2012’ should be interpreted as being identical to a statement that reads ‘Copyright (c) 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012’. ```
Owner
- Name: CASM IT
- Login: cancerit
- Kind: organization
- Email: cgpit@sanger.ac.uk
- Location: Hinxton, Cambridge, UK
- Website: http://www.sanger.ac.uk/science/programmes/cancer-genetics-and-genomics
- Repositories: 89
- Profile: https://github.com/cancerit
CASM IT provide bioinformatic support for Cancer, Ageing and Somatic Mutation group at the Wellcome Sanger Institute
GitHub Events
Total
- Watch event: 1
Last Year
- Watch event: 1
Committers
Last synced: about 2 years ago
Top Committers
| Name | Commits | |
|---|---|---|
| Keiran Raine | k****2@s****k | 52 |
| Victoria Offord | v****1@s****k | 8 |
| Keiran M Raine | k****e | 3 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 7 months ago
All Time
- Total issues: 7
- Total pull requests: 12
- Average time to close issues: 21 days
- Average time to close pull requests: 1 day
- Total issue authors: 2
- Total pull request authors: 3
- Average comments per issue: 3.57
- Average comments per pull request: 0.5
- Merged pull requests: 8
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 2
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- keiranmraine (4)
- vaofford (3)
Pull Request Authors
- keiranmraine (9)
- superjw (4)
- vaofford (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 3 last-month
- Total dependent packages: 0
- Total dependent repositories: 1
- Total versions: 10
- Total maintainers: 1
pypi.org: pycroquet
python Crispr Read to Oligo QUantification Enhancement Tool
- Homepage: https://github.com/cancerit/pycroquet
- Documentation: https://pycroquet.readthedocs.io/
- License: AGPL-3.0
-
Latest release: 1.6.0
published almost 4 years ago