Releases | Open Source Science

umi-tools -

UMI-tools output is now deterministic with `--random-seed`

Many users have had issues with making UMI-tools deterministic, which previously relied upon both --random-seed and the enivornmental variable PYTHONHASHSEED being set. From v1.1.6 only --random seed is required.

Please note that in some cases the implemented solution may make the output from v.1.1.6 different to previous versions, even if --random-seed is set to the same value. The differences will be very slight and the different outputs represent equally sensible UMI grouping/deduplication since they relate only to how ties are broken.

Thank you @TyberiusPrime, @christianbioinf and others for their suggestions for how to remove the dependency on PYTHONHASHSEED for deterministic output.

New features

umi_tools is now deterministic when using --random-seed - @TomSmithCGAT in https://github.com/CGATOxford/UMI-tools/pull/550
Option to extract barcode from read2 only - @TomSmithCGAT in https://github.com/CGATOxford/UMI-tools/pull/630
Adds support for python 3.12 - @IanSudbery in https://github.com/CGATOxford/UMI-tools/pull/657

Bugfix

Avoids switching matplotlib backend - @sshen8 in https://github.com/CGATOxford/UMI-tools/pull/640
count_tab now correctly reads UMI and cell barcodes - @eachanjohnson in https://github.com/CGATOxford/UMI-tools/pull/654
count_tab now writes out strings not bytes - @eachanjohnson in https://github.com/CGATOxford/UMI-tools/pull/654
Installation with < python 3 prevented - @IanSudbery in https://github.com/CGATOxford/UMI-tools/pull/644

Documentation

FAQ entry regarding identification of possible duplicates reads/pairs - @TomSmithCGAT in https://github.com/CGATOxford/UMI-tools/pull/631
Improved docs regarding chimeric/unmapped/unpaired read pairs - @TomSmithCGAT in https://github.com/CGATOxford/UMI-tools/pull/629

Other

Add issue templates - @TomSmithCGAT in https://github.com/CGATOxford/UMI-tools/pull/632
Update testing suite to pytest - @eachanjohnson in https://github.com/CGATOxford/UMI-tools/pull/655

New Contributors

@sshen8 made their first contribution in https://github.com/CGATOxford/UMI-tools/pull/640
@eachanjohnson made their first contribution in https://github.com/CGATOxford/UMI-tools/pull/654

Full Changelog: https://github.com/CGATOxford/UMI-tools/compare/1.1.5...v1.1.6

- Python
Published by TomSmithCGAT over 1 year ago

umi-tools - 1.1.5

New features

Enables read suffixes to be removed from single end data: @IanSudbery in https://github.com/CGATOxford/UMI-tools/pull/591. See https://github.com/CGATOxford/UMI-tools/issues/580 for motivating issue
Adds a script to prepare umi_tools dedup output for use with RSEM: @IanSudbery in https://github.com/CGATOxford/UMI-tools/pull/609. See https://github.com/CGATOxford/UMI-tools/issues/465 and https://github.com/CGATOxford/UMI-tools/issues/607 for motivating issues

Bugfix

Fix lack of help messages in 1.1.4 by @IanSudbery in https://github.com/CGATOxford/UMI-tools/pull/586
Fixes read suffix line end: @IanSudbery in https://github.com/CGATOxford/UMI-tools/pull/611

Documentation

Fixed docs for dedup stats filenames: @msto in https://github.com/CGATOxford/UMI-tools/pull/604

New Contributors

@msto made their first contribution in https://github.com/CGATOxford/UMI-tools/pull/604

Full Changelog: https://github.com/CGATOxford/UMI-tools/compare/1.1.4...1.1.5

- Python
Published by TomSmithCGAT about 2 years ago

umi-tools -

Debug to support python 3.11. Thank you @sjaenick for bringing this to our attention and testing (#563)

- Python
Published by TomSmithCGAT about 3 years ago

umi-tools -

New features

Adds '--umi-separator' option to umi_tools extract to specify UMI separator. Thanks @opplatek (#548)

Optimisation

Speeds up read pair mate writing. Significant benefit for transcriptome alignments (#543)

Bugfix

Handles umi_tools group output to tsv with --per-contig when no gene tags are present. Thanks @mfansler & @akmorrow13 (#577)
Fixes syntax warning in extract.py. Thanks @rajivnarayan (#558)
Improves error message for incorrect command line input. Thanks @epruesse (#506 & #537)

- Python
Published by TomSmithCGAT about 3 years ago

umi-tools -

Bugfix - whitelist --filtered-out with SE reads threw an unassigned error. Thanks @yech1990 for rectifying this (#453)

Also includes a very minor update of syntax (#455)

- Python
Published by TomSmithCGAT almost 5 years ago

umi-tools - 1.1.1

Updates requirements for pysam version to >0.16.0.1. Thanks @sunnymouse25 (#444)

- Python
Published by TomSmithCGAT over 5 years ago

umi-tools -

A long overdue release covering some minor functionality updates and bugfixes:

Additional functionality:

Write out reads failing regex matching with extract/whitelist (see options --filtered-out, --filtered-out2). See #328 for motivation
Ignore template length with paired-end dedup/group (see option --ignore-tlen). See #357 for motivation. Thanks @skitcattCRUKMI
Ignore read pair suffixes with extract/whitelist e.g /1 or /2. (see option --ignore-read-pair-suffixes). See #325, #391, #418, PierreBSC/Viral-Track#9 for motivation

Performance

Sped up error correction mapping for cell barcodes in whitelist by using BKTree. Thanks @redst4r. Note that this adds a new python dependency (pybktree) which is available via pip and conda-forge.
Very slight reduction in memory usage for dedup/group via bugfix to reduce the amount of reads being retained in the buffer. Thanks to @mitrinh1 for spotting this (#428). The bug was equivalent to hardcoding the option -buffer-whole-contig on, which ensures all reads with the same start position are grouped together for deduplication, but at the cost of not yielding reads until the end of each contig, thus increasing memory usage. As such, the bug was not detrimental to results output.

Bugfixes:

Unmapped mates were not properly discarded with dedup and group. Thanks @Daniel-Liu-c0deb0t for rectifying this.

- Python
Published by TomSmithCGAT over 5 years ago

umi-tools -

Debug for KeyError when some reads are missing a cell barode tag and stats output required from umi_tools dedup. See comments from @ZHUwj0 in #281

- Python
Published by TomSmithCGAT about 6 years ago

umi-tools - 1.0.0

This release is intended to be a stable release with no plans for significant updates to UMI-tools functionality in the near future. As part of this release, much of the code base has been refactored. It is possible this may have introduced bugs which have not been picked up by the regression testing. If so, please raise an issue and we'll try and rectify with a minor release update ASAP.

Documentation

UMI-tools documentation is now available online: https://umi-tools.readthedocs.io/en/latest/index.html

Along with the previous documentation, the readthedocs pages also include new pages:

FAQ
Making use of our Alogrithmns: The API

New knee method for whitelist

The method to detect the "knee" in whitelist has been updated (#317). This method should always identify a threshold and is now set as the default method. Note that this knee method appears to be slightly more conservative (fewer cells above threshold) but having identified the knee, one can always re-run whitelist and use --set-cell-number to expand the whitelist if desired
The old method is still available via --knee-method=density
In addition, to run the old knee method but allow whitelist to exit without error even if a suitable knee point isn't identified, use the new --allow-threshold-error option (#249)
Putative errors in CBs above the knee can be detected using --ed-above-threshold (#309)

Explicit options for handling chimeric & inproper read pairs (#312)

The behaviour for chimeric read pairs, inproper read pairs and unmapped reads can now be explictly set with the --chimeric-pairs, --unpaired-reads and --unmapped-reads.

New options

--temp-dir: Set the directory for temporary files (#254)
--either-read & --either-read-resolve: Extract the UMI from either read (#175)

Misc

Updates python testing version to 3.6.7 and drops python 2 testing
Replace deprecated imp import (#318)
Debug error with pysam <0.14 (#319)
Refactor module files
Moves documentation into dedicated module

- Python
Published by TomSmithCGAT about 7 years ago

umi-tools -

Mainly minor debugs and improved detection of incorrect command line options. Minor updates to documentation.

Resolves issues correctly skipping reads which have not been assigned (#191 & #273). This involves the addition of the --assigned-status-tag option

Testing for OSX has been dropped due to unresolved issues with travis. We hope to resurrect this in the future!

In line with major python packages (e.g https://www.numpy.org/neps/nep-0014-dropping-python2.7-proposal.html), support for python 2 will be dropped from January 1st 2019.

- Python
Published by TomSmithCGAT over 7 years ago

umi-tools - 0.5.4

The defualt value for --skip_regex was incorrectly formatted. Thanks to @ekernf01 for spotting (#231/#256)

- Python
Published by TomSmithCGAT over 7 years ago

umi-tools - 0.5.3

Debugs wide-format output for count (#227). Thanks @kevin199011

- Python
Published by TomSmithCGAT about 8 years ago

umi-tools - 0.5.2

Adds options to specify a delimiter for a cell barcode or UMI which should be concatenated + options to specify a string splitting the cell barcode or UMI into multiple parts, of which only the first will be used. Note, this options will only work if the barcodes are contained in the BAM tag - if they were appended to the read name using umi_tools extract there is no need for these options. See #217 for motivation:
- --umi-tag-delimiter=[STRING] = remove the delimeter STRING from the UMI. Defaults to None
- --umi-tag-split=[STRING] = split UMI by STRING and take only the first portion. Defaults to None
- --cell-tag-delimiter=[STRING] = remove the delimeter STRING from the cell barcode. Defaults to None
- --cell-tag-split=[STRING] = split cell barcode by STRING and take only the first portion. Defaults to - to deal with 10X GEMs
Reduced memory requirements for count --wide-format-cell-counts: #222
Debugs issues with --bc-pattern2: #201, #221
Updates documentation: #204, #210, #211 - Thanks @kohlkopf, @hy09 & @cbrueffer

- Python
Published by TomSmithCGAT about 8 years ago

umi-tools - 0.5.1

Minor update. Improves detection of duplicate reads with paired end reads, reduces run time with dedup --output-stats and a few simple debugs.

Improved identification of duplicate reads from paired end reads - will now use the position of the FIRST splice junction in the read (in reference coords) (#187)
Speeds up dedup when running with --output-stats - (#184)
Fixes bugs:
- whitelist --set-cell-number --plot-prefix -> unwanted error
- dedup gave non-informative error when input contains zero valid reads/read pairs. Now raises a warning but exits with status 0 (#190, #195)
- count errored if gene identifier contained a ":" (#198)
Renames --whole-contig option to --buffer-whole-contig to avoid confusion with per-contig option. --whole-contig option will still work but will not be visible in documentation (#196)

- Python
Published by TomSmithCGAT over 8 years ago

umi-tools -

Version 0.5.0 introduces new commands to support single-cell RNA-Seq and reduces run-time. The underlying methods have not changed hence the minor release number uptick.

UMI-tools goes single cell

New commands for single cell RNA-Seq (scRNA-Seq):

whitelist - Extract cell barcodes (CB) from droplet-based scRNA-Seq fastqs and estimate the number of "true" CBs. Outputs a flatfile listing the true cell barcodes and 'error' barcodes within a set distance. See #97 for a motivating example. Thanks to @Hoohm for input and patience in testing. Thanks to @k3yavi for input in discussions about implementing a 'knee' method.
count - Count the number of reads per cell per gene after de-duplication. This tool uses the same underlying methods as group and dedup and acts to simplify scRNA-Seq read-counting with umi_tools. See #114, #131
count_tab - As per count but works from a flatfile input from e.g featureCounts - See #44, #121, #125

In the process of creating these commands, the options for dealing with UMIs on a "per-gene" basis have been re-jigged to make their purpose clearer. See e.g #127 for a motvating example.

To perform group, dedup or count on a per-gene, basis, the --per-gene option should be provided. This must be combined with either --gene-tag if the BAM contains gene assignments in a tag, or --per-contig if the reads have been aligned to a transcriptome. In the later case, if the reads have been aligned to a transcriptome where each contig is a transcript, the option --gene-transcript-map can be used to operate at the gene level. These options are standardised across all tools such that one can easily change e.g a count command into a dedup command.

Updated options:

extract - Can now accept regex patterns to describe UMI +/- CB encoding in read(s). See --extract-method=regex option.

We have written a guide for how to use UMI-tools for scRNA-Seq analysis including estimation of the number of true CBs, flexible extraction of cell barcodes and UMIs and per-cell read-counting as well as common workflow variations.

Reduced run-time (#156)

Introduced a hashing step to limit the scope of the edit-distance comparisons required to build the networks. Big thanks to @mparker2 for this!

Simplified installation ( #145 )

Previously extensions were cythonized and compiled on the fly using 'pyximport, requiring users to have access to the install directory the first time the extension was required. Now the cythonized extension is provided, and is compiled at install-time.

- Python
Published by TomSmithCGAT over 8 years ago

umi-tools - 0.4.4

Tweaks the way group handles paired end BAMs. To simplify the process and ensure all reads are written out, the paired end read (read 2) is now outputted without a group or UMI tag. (#115).
Introduces the --skip-tags-regex option to enable users to skip descriptive gene tags, such as "Unassigned" when using the --gene-tag option. See #108.
Bugfixes:
- If the --transcript-gene-map included transcripts not observed in the BAM, this caused an error when trying to retrieve reads aligned to the transcript. This has been resolved. See #109
- Allow output to zipped file with extract using python 3 #104
Improved test coverage (--chrom and --gene-tag options). Thanks @MarinusVL for kindly sharing a BAM with gene tags.

- Python
Published by TomSmithCGAT almost 9 years ago

umi-tools - 0.4.3

Improves run time for large networks (see #94, #31).

Thanks to @gpratt for identifying the issue and implementing the solution

- Python
Published by TomSmithCGAT almost 9 years ago

umi-tools - 0.4.2

When using the directional method with the group command, the 'top' UMI within each group was not always the most abundant (see comments in #96). This has now been resolved

- Python
Published by TomSmithCGAT almost 9 years ago

umi-tools -

Due to a bug in pysam.fetch() paired end files with a large number of contigs could take a long time to process (see #93). This has now been resolved.

Thanks to @gpratt for spotting and resolving this.

- Python
Published by TomSmithCGAT almost 9 years ago

umi-tools -

Added functionality:

Deduplicating on gene ids ( #44 for motivation): The user can now group/dedup according to the gene which the read aligns to. This is useful for single cell RNA-Seq methods such as e.g CEL-Seq where the position of the read on a transcript may be different for reads generated from the same initial molecule. The following options may be used define the gene_id for each read: --per-gene --gene-transcript-map --gene-tag
Working with BAM tags (#73, #76, #89): UMIs can now be extracted from the BAM tags andgroup will add a tag to each read describing the read group and UMI. See following options for controlling this behaviour: --extract-umi-method --umi-tag --umi-group-tag
Ouput unmapped reads (#78) The group command will now output unmapped reads if the --output-unmapped is supplied. These reads will not be assigned to any group.

+ bug fixes for group command (#67, #81) and updated documentation (#77, #79 )

- Python
Published by TomSmithCGAT almost 9 years ago

umi-tools - 0.3.6

Improves the group command: - Adds the --subset option as per the dedup command (#74) - Corrects the flatfile output from the dedup command (#72)

- Python
Published by TomSmithCGAT about 9 years ago

umi-tools -

The code has been tweaked to improve run-time. See #69 for a discussion about the changes implemented.

- Python
Published by TomSmithCGAT about 9 years ago

umi-tools -

Corrects the edit distance comparison used to generate the network for the directional method.
This will only affect results generated using the directional method and --edit-distance-threshold >1

Previously, using the directional method with the option --edit-distance-threshold set to > 1 did not return the expected set of de-duplicated reads. If you have used the directional method with a threshold >1, we recommend updating UMI-tools and re-running dedup.

- Python
Published by TomSmithCGAT about 9 years ago

umi-tools -

Debugs python 3 compatibility issues
Adds python 3 tests

- Python
Published by TomSmithCGAT about 9 years ago

umi-tools -

Minor bump: Resolves setuptools-based installation issue

- Python
Published by TomSmithCGAT about 9 years ago