Recent Releases of umi-tools
umi-tools -
UMI-tools output is now deterministic with --random-seed
Many users have had issues with making UMI-tools deterministic, which previously relied upon both --random-seed and the enivornmental variable PYTHONHASHSEED being set. From v1.1.6 only --random seed is required.
Please note that in some cases the implemented solution may make the output from v.1.1.6 different to previous versions, even if --random-seed is set to the same value. The differences will be very slight and the different outputs represent equally sensible UMI grouping/deduplication since they relate only to how ties are broken.
Thank you @TyberiusPrime, @christianbioinf and others for their suggestions for how to remove the dependency on PYTHONHASHSEED for deterministic output.
New features
- umi_tools is now deterministic when using --random-seed - @TomSmithCGAT in https://github.com/CGATOxford/UMI-tools/pull/550
- Option to extract barcode from read2 only - @TomSmithCGAT in https://github.com/CGATOxford/UMI-tools/pull/630
- Adds support for python 3.12 - @IanSudbery in https://github.com/CGATOxford/UMI-tools/pull/657
Bugfix
- Avoids switching matplotlib backend - @sshen8 in https://github.com/CGATOxford/UMI-tools/pull/640
- count_tab now correctly reads UMI and cell barcodes - @eachanjohnson in https://github.com/CGATOxford/UMI-tools/pull/654
- count_tab now writes out strings not bytes - @eachanjohnson in https://github.com/CGATOxford/UMI-tools/pull/654
- Installation with < python 3 prevented - @IanSudbery in https://github.com/CGATOxford/UMI-tools/pull/644
Documentation
- FAQ entry regarding identification of possible duplicates reads/pairs - @TomSmithCGAT in https://github.com/CGATOxford/UMI-tools/pull/631
- Improved docs regarding chimeric/unmapped/unpaired read pairs - @TomSmithCGAT in https://github.com/CGATOxford/UMI-tools/pull/629
Other
- Add issue templates - @TomSmithCGAT in https://github.com/CGATOxford/UMI-tools/pull/632
- Update testing suite to pytest - @eachanjohnson in https://github.com/CGATOxford/UMI-tools/pull/655
New Contributors
- @sshen8 made their first contribution in https://github.com/CGATOxford/UMI-tools/pull/640
- @eachanjohnson made their first contribution in https://github.com/CGATOxford/UMI-tools/pull/654
Full Changelog: https://github.com/CGATOxford/UMI-tools/compare/1.1.5...v1.1.6
- Python
Published by TomSmithCGAT over 1 year ago
umi-tools - 1.1.5
New features
- Enables read suffixes to be removed from single end data: @IanSudbery in https://github.com/CGATOxford/UMI-tools/pull/591. See https://github.com/CGATOxford/UMI-tools/issues/580 for motivating issue
- Adds a script to prepare
umi_tools dedupoutput for use withRSEM: @IanSudbery in https://github.com/CGATOxford/UMI-tools/pull/609. See https://github.com/CGATOxford/UMI-tools/issues/465 and https://github.com/CGATOxford/UMI-tools/issues/607 for motivating issues
Bugfix
- Fix lack of help messages in 1.1.4 by @IanSudbery in https://github.com/CGATOxford/UMI-tools/pull/586
- Fixes read suffix line end: @IanSudbery in https://github.com/CGATOxford/UMI-tools/pull/611
Documentation
- Fixed docs for dedup stats filenames: @msto in https://github.com/CGATOxford/UMI-tools/pull/604
New Contributors
- @msto made their first contribution in https://github.com/CGATOxford/UMI-tools/pull/604
Full Changelog: https://github.com/CGATOxford/UMI-tools/compare/1.1.4...1.1.5
- Python
Published by TomSmithCGAT about 2 years ago
umi-tools -
Debug to support python 3.11. Thank you @sjaenick for bringing this to our attention and testing (#563)
- Python
Published by TomSmithCGAT about 3 years ago
umi-tools -
New features
- Adds '--umi-separator' option to
umi_tools extractto specify UMI separator. Thanks @opplatek (#548)
Optimisation
- Speeds up read pair mate writing. Significant benefit for transcriptome alignments (#543)
Bugfix
- Handles
umi_tools groupoutput to tsv with--per-contigwhen no gene tags are present. Thanks @mfansler & @akmorrow13 (#577) - Fixes syntax warning in extract.py. Thanks @rajivnarayan (#558)
- Improves error message for incorrect command line input. Thanks @epruesse (#506 & #537)
- Python
Published by TomSmithCGAT about 3 years ago
umi-tools -
Bugfix
- whitelist --filtered-out with SE reads threw an unassigned error. Thanks @yech1990 for rectifying this (#453)
Also includes a very minor update of syntax (#455)
- Python
Published by TomSmithCGAT almost 5 years ago
umi-tools -
A long overdue release covering some minor functionality updates and bugfixes:
Additional functionality:
- Write out reads failing regex matching with
extract/whitelist(see options--filtered-out,--filtered-out2). See #328 for motivation - Ignore template length with paired-end
dedup/group(see option--ignore-tlen). See #357 for motivation. Thanks @skitcattCRUKMI - Ignore read pair suffixes with
extract/whiteliste.g/1or/2. (see option--ignore-read-pair-suffixes). See #325, #391, #418, PierreBSC/Viral-Track#9 for motivation
Performance
- Sped up error correction mapping for cell barcodes in
whitelistby using BKTree. Thanks @redst4r. Note that this adds a new python dependency (pybktree) which is available viapipandconda-forge. - Very slight reduction in memory usage for
dedup/groupvia bugfix to reduce the amount of reads being retained in the buffer. Thanks to @mitrinh1 for spotting this (#428). The bug was equivalent to hardcoding the option-buffer-whole-contigon, which ensures all reads with the same start position are grouped together for deduplication, but at the cost of not yielding reads until the end of each contig, thus increasing memory usage. As such, the bug was not detrimental to results output.
Bugfixes:
- Unmapped mates were not properly discarded with
dedupandgroup. Thanks @Daniel-Liu-c0deb0t for rectifying this.
- Python
Published by TomSmithCGAT over 5 years ago
umi-tools -
Debug for KeyError when some reads are missing a cell barode tag and stats output required from umi_tools dedup. See comments from @ZHUwj0 in #281
- Python
Published by TomSmithCGAT about 6 years ago
umi-tools - 1.0.0
This release is intended to be a stable release with no plans for significant updates to UMI-tools functionality in the near future. As part of this release, much of the code base has been refactored. It is possible this may have introduced bugs which have not been picked up by the regression testing. If so, please raise an issue and we'll try and rectify with a minor release update ASAP.
Documentation
UMI-tools documentation is now available online: https://umi-tools.readthedocs.io/en/latest/index.html
Along with the previous documentation, the readthedocs pages also include new pages:
- FAQ
- Making use of our Alogrithmns: The API
New knee method for whitelist
- The method to detect the "knee" in
whitelisthas been updated (#317). This method should always identify a threshold and is now set as the default method. Note that this knee method appears to be slightly more conservative (fewer cells above threshold) but having identified the knee, one can always re-runwhitelistand use--set-cell-numberto expand the whitelist if desired - The old method is still available via
--knee-method=density - In addition, to run the old knee method but allow whitelist to exit without error even if a suitable knee point isn't identified, use the new
--allow-threshold-erroroption (#249) - Putative errors in CBs above the knee can be detected using
--ed-above-threshold(#309)
Explicit options for handling chimeric & inproper read pairs (#312)
The behaviour for chimeric read pairs, inproper read pairs and unmapped reads can now be explictly set with the --chimeric-pairs, --unpaired-reads and --unmapped-reads.
New options
--temp-dir: Set the directory for temporary files (#254)--either-read&--either-read-resolve: Extract the UMI from either read (#175)
Misc
- Updates python testing version to 3.6.7 and drops python 2 testing
- Replace deprecated imp import (#318)
- Debug error with
pysam <0.14(#319) - Refactor module files
- Moves documentation into dedicated module
- Python
Published by TomSmithCGAT about 7 years ago
umi-tools -
Mainly minor debugs and improved detection of incorrect command line options. Minor updates to documentation.
- Resolves issues correctly skipping reads which have not been assigned (#191 & #273). This involves the addition of the
--assigned-status-tagoption
Testing for OSX has been dropped due to unresolved issues with travis. We hope to resurrect this in the future!
In line with major python packages (e.g https://www.numpy.org/neps/nep-0014-dropping-python2.7-proposal.html), support for python 2 will be dropped from January 1st 2019.
- Python
Published by TomSmithCGAT over 7 years ago
umi-tools - 0.5.2
- Adds options to specify a delimiter for a cell barcode or UMI which should be concatenated + options to specify a string splitting the cell barcode or UMI into multiple parts, of which only the first will be used. Note, this options will only work if the barcodes are contained in the BAM tag - if they were appended to the read name using
umi_tools extractthere is no need for these options. See #217 for motivation:--umi-tag-delimiter=[STRING]= remove the delimeter STRING from the UMI. Defaults toNone--umi-tag-split=[STRING]= split UMI by STRING and take only the first portion. Defaults toNone--cell-tag-delimiter=[STRING]= remove the delimeter STRING from the cell barcode. Defaults toNone--cell-tag-split=[STRING]= split cell barcode by STRING and take only the first portion. Defaults to-to deal with 10X GEMs
- Reduced memory requirements for
count --wide-format-cell-counts: #222 - Debugs issues with
--bc-pattern2: #201, #221 - Updates documentation: #204, #210, #211 - Thanks @kohlkopf, @hy09 & @cbrueffer
- Python
Published by TomSmithCGAT about 8 years ago
umi-tools - 0.5.1
Minor update. Improves detection of duplicate reads with paired end reads, reduces run time with dedup --output-stats and a few simple debugs.
- Improved identification of duplicate reads from paired end reads - will now use the position of the FIRST splice junction in the read (in reference coords) (#187)
- Speeds up
dedupwhen running with--output-stats- (#184) - Fixes bugs:
whitelist --set-cell-number --plot-prefix-> unwanted errordedupgave non-informative error when input contains zero valid reads/read pairs. Now raises a warning but exits with status 0 (#190, #195)counterrored if gene identifier contained a ":" (#198)
- Renames
--whole-contigoption to--buffer-whole-contigto avoid confusion withper-contigoption.--whole-contigoption will still work but will not be visible in documentation (#196)
- Python
Published by TomSmithCGAT over 8 years ago
umi-tools -
Version 0.5.0 introduces new commands to support single-cell RNA-Seq and reduces run-time. The underlying methods have not changed hence the minor release number uptick.
UMI-tools goes single cell
New commands for single cell RNA-Seq (scRNA-Seq):
whitelist - Extract cell barcodes (CB) from droplet-based scRNA-Seq fastqs and estimate the number of "true" CBs. Outputs a flatfile listing the true cell barcodes and 'error' barcodes within a set distance. See #97 for a motivating example. Thanks to @Hoohm for input and patience in testing. Thanks to @k3yavi for input in discussions about implementing a 'knee' method.
count - Count the number of reads per cell per gene after de-duplication. This tool uses the same underlying methods as
groupanddedupand acts to simplify scRNA-Seq read-counting withumi_tools. See #114, #131count_tab - As per
countbut works from a flatfile input from e.gfeatureCounts- See #44, #121, #125
In the process of creating these commands, the options for dealing with UMIs on a "per-gene" basis have been re-jigged to make their purpose clearer. See e.g #127 for a motvating example.
To perform group, dedup or count on a per-gene, basis, the --per-gene option should be provided. This must be combined with either --gene-tag if the BAM contains gene assignments in a tag, or --per-contig if the reads have been aligned to a transcriptome. In the later case, if the reads have been aligned to a transcriptome where each contig is a transcript, the option --gene-transcript-map can be used to operate at the gene level. These options are standardised across all tools such that one can easily change e.g a count command into a dedup command.
Updated options:
- extract - Can now accept regex patterns to describe UMI +/- CB encoding in read(s). See
--extract-method=regexoption.
We have written a guide for how to use UMI-tools for scRNA-Seq analysis including estimation of the number of true CBs, flexible extraction of cell barcodes and UMIs and per-cell read-counting as well as common workflow variations.
Reduced run-time (#156)
Introduced a hashing step to limit the scope of the edit-distance comparisons required to build the networks. Big thanks to @mparker2 for this!
Simplified installation ( #145 )
Previously extensions were cythonized and compiled on the fly using 'pyximport, requiring users to have access to the install directory the first time the extension was required. Now the cythonized extension is provided, and is compiled at install-time.
- Python
Published by TomSmithCGAT over 8 years ago
umi-tools - 0.4.4
- Tweaks the way group handles paired end BAMs. To simplify the process and ensure all reads are written out, the paired end read (read 2) is now outputted without a group or UMI tag. (#115).
- Introduces the
--skip-tags-regexoption to enable users to skip descriptive gene tags, such as "Unassigned" when using the--gene-tagoption. See #108. - Bugfixes:
- If the --transcript-gene-map included transcripts not observed in the BAM, this caused an error when trying to retrieve reads aligned to the transcript. This has been resolved. See #109
- Allow output to zipped file with extract using python 3 #104
- Improved test coverage (
--chromand--gene-tagoptions). Thanks @MarinusVL for kindly sharing a BAM with gene tags.
- Python
Published by TomSmithCGAT almost 9 years ago
umi-tools -
Due to a bug in pysam.fetch() paired end files with a large number of contigs could take a long time to process (see #93). This has now been resolved.
Thanks to @gpratt for spotting and resolving this.
- Python
Published by TomSmithCGAT almost 9 years ago
umi-tools -
Added functionality:
Deduplicating on gene ids ( #44 for motivation): The user can now group/dedup according to the gene which the read aligns to. This is useful for single cell RNA-Seq methods such as e.g CEL-Seq where the position of the read on a transcript may be different for reads generated from the same initial molecule. The following options may be used define the gene_id for each read:
--per-gene--gene-transcript-map--gene-tagWorking with BAM tags (#73, #76, #89): UMIs can now be extracted from the BAM tags andgroup will add a tag to each read describing the read group and UMI. See following options for controlling this behaviour:
--extract-umi-method--umi-tag--umi-group-tagOuput unmapped reads (#78) The group command will now output unmapped reads if the
--output-unmappedis supplied. These reads will not be assigned to any group.
+ bug fixes for group command (#67, #81) and updated documentation (#77, #79 )
- Python
Published by TomSmithCGAT almost 9 years ago
umi-tools -
The code has been tweaked to improve run-time. See #69 for a discussion about the changes implemented.
- Python
Published by TomSmithCGAT about 9 years ago
umi-tools -
- Corrects the edit distance comparison used to generate the network for the directional method.
- This will only affect results generated using the directional method and
--edit-distance-threshold>1
Previously, using the directional method with the option --edit-distance-threshold set to > 1 did not return the expected set of de-duplicated reads. If you have used the directional method with a threshold >1, we recommend updating UMI-tools and re-running dedup.
- Python
Published by TomSmithCGAT about 9 years ago
umi-tools -
- Debugs python 3 compatibility issues
- Adds python 3 tests
- Python
Published by TomSmithCGAT about 9 years ago
umi-tools -
Minor bump: Resolves setuptools-based installation issue
- Python
Published by TomSmithCGAT about 9 years ago
umi-tools -
Version bump to allow pypi update. No code changes
- Python
Published by TomSmithCGAT over 9 years ago
umi-tools -
- Adds the new
groupcommand to group PCR duplicates and return the groups in a tagged BAM file and/or flat file format. This was motivated by multiple requests to group PCR duplicated reads for downstream processes, e,g #45, #54. Special thanks to Nils Koelling (@koelling) for testing the group command. - Adds the
--umi-separatoroption fordedupandgroupfor workflow whereumi_tools extractis not used to extract the UMI. This was motivated by #58
- Python
Published by TomSmithCGAT over 9 years ago
umi-tools -
From v0.2.6 onwards the *directional-adjacency* method is renamed *directional*
- Python
Published by TomSmithCGAT over 9 years ago
umi-tools -
- Debugs writing out paired end
- Debugs installation
- Python
Published by TomSmithCGAT over 9 years ago
umi-tools -
extract
- New feature: Filter out read by UMI base-call quality score --quality-threshold and --quality-encoding options (#29, #33)
dedup - Improved performance for paired end files (#31, #35)
- Python
Published by TomSmithCGAT almost 10 years ago
umi-tools - Debugs read extraction from 3' end
Debugs read extraction from 3' end
- Python
Published by TomSmithCGAT almost 10 years ago
umi-tools - Improved memory performace for UMI extraction
Improved memory performace for UMI extraction from paired end reads
- Python
Published by TomSmithCGAT almost 10 years ago