Recent Releases of TorchMetrics - Measuring Reproducibility in PyTorch

TorchMetrics - Measuring Reproducibility in PyTorch - Minor patch release

[1.8.2] - 2025-09-03

Fixed

  • Fixed BinaryPrecisionRecallCurve now returns NaN for precision when no predictions meet a threshold (#3227)
  • Fixed precision_at_fixed_recall and recall_at_fixed_precision to correctly return NaN thresholds when recall/precision conditions are not met (#3226)

Key Contributors

@iamkulbhushansingh

If we forgot someone due to not matching commit email with GitHub account, let us know :]


Full Changelog: https://github.com/Lightning-AI/torchmetrics/compare/v1.8.1...v1.8.2

Scientific Software - Peer-reviewed - Python
Published by Borda 9 months ago

TorchMetrics - Measuring Reproducibility in PyTorch - Minor patch release

[1.8.1] - 2025-08-07

Changed

  • Added reduction='none' to vif metric (#3196)
  • Float input support for segmentation metrics (#3198)

Fixed

  • Fixed unintended sigmoid normalization in BinaryPrecisionRecallCurve (#3182)

Key Contributors

@iamkulbhushansingh, @PussyCat0700, @simonreise

If we forgot someone due to not matching commit email with GitHub account, let us know :]


Full Changelog: https://github.com/Lightning-AI/torchmetrics/compare/v1.8.0...v1.8.1

Scientific Software - Peer-reviewed - Python
Published by Borda 10 months ago

TorchMetrics - Measuring Reproducibility in PyTorch - First video and vertex metrics

The upcoming TorchMetrics v1.8.0 release introduces three flagship metrics, each designed to address critical evaluation needs in real-world applications.

Video Multi-Method Assessment Fusion (VMAF) brings a perceptual video-quality score that closely mirrors human judgment, powering streaming services such as Netflix and YouTube to optimize encoding ladders for consistent viewer experiences and enabling video-restoration labs to quantify improvements achieved by denoising and super-resolution algorithms.

Continuous Ranked Probability Score (CRPS) enables comprehensive evaluation of full predictive distributions rather than point estimates; meteorological centers leverage CRPS to benchmark probabilistic precipitation and temperature forecasts, improving public weather alerts, while energy companies apply it to assess uncertainty in load-demand predictions and refine grid management and trading strategies.

Lip Vertex Error (LVE) measures the discrepancy between predicted and ground-truth lip landmarks to quantify audio-visual synchronization. Localization studios use LVE to validate lip-sync accuracy during film dubbing, while AR/VR developers integrate it into avatar pipelines to ensure natural mouth movements in real-time virtual meetings and social experiences.


[1.8.0] - 2025-07-23

Added

  • Added VMAF metric to new video domain (#2991)
  • Added CRPS in regression domain (#3024)
  • Added aggregation_level argument to DiceScore (#3018)
  • Added support for reduction="none" to LearnedPerceptualImagePatchSimilarity (#3053)
  • Added support single str input for functional interface of bert_score (#3056)
  • Enhance: BERTScore to evaluate hypotheses against multiple references (#3069)
  • Added Lip Vertex Error (LVE) in multimodal domain (#3090)
  • Added antialias argument to FID metric (#3177)
  • Added mixed input format to segmentation metrics (#3176)

Changed

  • Changed data_range argument in PSNR metric to be a required argument (#3178)

Removed

  • Removed zero_division argument from DiceScore (#3018)

Key Contributors

@nkaenzig, @rittik9, @simonreise, @SkafteNicki

New Contributors

  • @lantiga made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/3054
  • @AlexVerine made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/3057
  • @ZhiyuanChen made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/3059
  • @ahmedhshahin made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/3101
  • @gratus907 made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/3103
  • @cyyever made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/3118
  • @Armannas made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/3124
  • @alifa98 made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/3128
  • @simonreise made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/3176

If we forgot someone due to not matching commit email with GitHub account, let us know :]


Full Changelog: https://github.com/Lightning-AI/torchmetrics/compare/v1.7.0...v1.8.0

Scientific Software - Peer-reviewed - Python
Published by Borda 10 months ago

TorchMetrics - Measuring Reproducibility in PyTorch - Minor patch release

[1.7.4] - 2025-07-04

Changed

  • Improved numerical stability of pearson's correlation coefficient (#3152)

Fixed

  • Fixed: Ignore zero and negative predictions in retrieval metrics (#3160)
  • Fixed SSIM dist_reduce_fx when reduction=None for distributed training (#3162, #3166)
  • Fixed attribute error (#3154)
  • Fixed incorrect shape in _pearson_corrcoef_update (#3168)

Key Contributors

@AymenKallala, @gratus907, @Isalia20, @rittik9

If we forgot someone due to not matching commit email with GitHub account, let us know :]


Full Changelog: https://github.com/Lightning-AI/torchmetrics/compare/v1.7.3...v1.7.4

Scientific Software - Peer-reviewed - Python
Published by Borda 11 months ago

TorchMetrics - Measuring Reproducibility in PyTorch - Minor patch release

[1.7.3] - 2025-06-13

Fixed

  • Fixed: ensure WrapperMetric resets wrapped_metric state (#3123)
  • Fixed top_k in multiclass_accuracy (#3117)

- Fixed compatibility to COCO format for pycocotools 2.0.10 (#3131)

Key Contributors

@rittik9

If we forgot someone due to not matching commit email with GitHub account, let us know :]


Full Changelog: https://github.com/Lightning-AI/torchmetrics/compare/v1.7.2...v1.7.3

Scientific Software - Peer-reviewed - Python
Published by Borda 12 months ago

TorchMetrics - Measuring Reproducibility in PyTorch - Minor patch release

[1.7.2] - 2025-05-27

Changed

  • Enhance: improve performance of _rank_data (#3103)

Fixed

  • Fixed UnboundLocalError in MatthewsCorrCoef (#3059)
  • Fixed MIFID incorrectly converts inputs to byte dtype with custom encoders (#3064)
  • Fixed ignore_index in MultilabelExactMatch (#3085)
  • Fixed: disable non-blocking on MPS (#3101)

Key Contributors

@ahmedhshahin, @gratus907, @rittik9, @ZhiyuanChen

If we forgot someone due to not matching commit email with GitHub account, let us know :]


Full Changelog: https://github.com/Lightning-AI/torchmetrics/compare/v1.7.1...v1.7.2

Scientific Software - Peer-reviewed - Python
Published by Borda about 1 year ago

TorchMetrics - Measuring Reproducibility in PyTorch - Minor patch release

[1.7.1] - 2025-04-06

Changed

  • Enhance Support Adding a MetricCollection to Another MetricCollection in add_metrics Function (#3032)

Fixed

  • Fixed absent class MeanIOU (#2892)
  • Fixed detection IoU ignores predictions without ground truth (#3025)
  • Fixed error raised in MulticlassAccuracy when top_k>1 (#3039)

Key Contributors

@Isalia20, @rittik9, @SkafteNicki

If we forgot someone due to not matching commit email with GitHub account, let us know :]


Full Changelog: https://github.com/Lightning-AI/torchmetrics/compare/v1.7.0...v1.7.1

Scientific Software - Peer-reviewed - Python
Published by Borda about 1 year ago

TorchMetrics - Measuring Reproducibility in PyTorch - More image metrics

The upcoming release of TorchMetrics is set to deliver a range of innovative features and enhancements across multiple domains, further solidifying its position as a leading tool for machine learning metrics. In the image domain, significant additions include the ARNIQA and DeepImageStructureAndTextureSimilarity metrics, which provide new insights into image quality and similarity. Additionally, the CLIPScore metric now supports more models and processors, expanding its versatility in image-text alignment tasks.

Beyond image analysis, the regression package welcomes the JensenShannonDivergence metric, offering a powerful tool for comparing probability distributions. The clustering package also sees a notable update with the introduction of the ClusterAccuracy metric, which helps evaluate the performance of clustering algorithms more effectively.

In the realm of classification, the Equal Error Rate (EER) metric has been added, providing a crucial measure for assessing the performance of classification models, particularly in scenarios where false positives and false negatives have different costs. Furthermore, the MeanAveragePrecision metric now includes a functional interface, enhancing its usability and flexibility for users.

These updates collectively enhance the capabilities of TorchMetrics, making it an even more comprehensive and indispensable resource for machine learning practitioners and researchers.

[1.7.0] - 2025-03-20

Added

  • Additions to image domain:
    • Added ARNIQA metric (#2953)
    • Added DeepImageStructureAndTextureSimilarity (#2993)
    • Added support for more models and processors in CLIPScore (#2978)
  • Added JensenShannonDivergence metric to regression package (#2992)
  • Added ClusterAccuracy metric to cluster package (#2777)
  • Added Equal Error Rate (EER) to classification package (#3013)
  • Added functional interface to MeanAveragePrecision metric (#3011)

Changed

  • Making num_classes optional for one-hot inputs in MeanIoU (#3012)

Removed

  • Removed Dice from classification (#3017)

Fixed

  • Fixed edge case in integration between class-wise wrapper and metric tracker (#3008)
  • Fixed IndexError in MultiClassAccuracy when using top_k with single sample (#3021)

Key Contributors

@Isalia20, @LorenzoAgnolucci, @nathanpainchaud, @rittik9, @SkafteNicki

If we forgot someone due to not matching commit email with GitHub account, let us know :]


Full Changelog: https://github.com/Lightning-AI/torchmetrics/compare/v1.6.0...v1.7.0

Scientific Software - Peer-reviewed - Python
Published by Borda about 1 year ago

TorchMetrics - Measuring Reproducibility in PyTorch - Minor patch release

[1.6.3] - 2024-03-13

Fixed

  • Fixed logic in how metric states referencing is handled in MetricCollection (#2990)
  • Fixed integration between class-wise wrapper and metric tracker (#3004)

Key Contributors

@SkafteNicki

If we forgot someone due to not matching commit email with GitHub account, let us know :]


Full Changelog: https://github.com/Lightning-AI/torchmetrics/compare/v1.6.2...v1.6.3

Scientific Software - Peer-reviewed - Python
Published by Borda about 1 year ago

TorchMetrics - Measuring Reproducibility in PyTorch - Minor patch release

[1.6.2] - 2024-02-28

Added

  • Added zero_division argument to DiceScore in segmentation package (#2860)
  • Added cache_session to DNSMOS metric to control caching behavior (#2974)
  • Added disable option to nan_strategy in basic aggregation metrics (#2943)

Changed

  • Make num_classes optional for classification in case of micro averaging (#2841)
  • Enhance Clip_Score to calculate similarities between same modalities (#2875)

Fixed

  • Fixed DiceScore when there is zero overlap between predictions and targets (#2860)
  • Fixed MeanAveragePrecision for average="micro" when 0 label is not present (#2968)
  • Fixed corner-case in PearsonCorrCoef when input is constant (#2975)
  • Fixed MetricCollection.update gives identical results (#2944)
  • Fixed missing kwargs in PIT metric for permutation wise mode (#2977)
  • Fixed multiple errors in the _final_aggregation function for PearsonCorrCoef (#2980)
  • Fixed incorrect CLIP-IQA type hints (#2952)

Key Contributors

@baskrahmer, @czmrand, @rbedyakin, @rittik9, @SkafteNicki, @wooseopkim

If we forgot someone due to not matching commit email with GitHub account, let us know :]


Full Changelog: https://github.com/Lightning-AI/torchmetrics/compare/v1.6.1...v1.6.2

Scientific Software - Peer-reviewed - Python
Published by Borda about 1 year ago

TorchMetrics - Measuring Reproducibility in PyTorch - Minor patch release

[1.6.1] - 2024-12-25

Changed

  • Enabled specifying weights path for FID (#2867)
  • Delete Device2Host caused by comm with device and host (#2840)

Fixed

  • Fixed plotting of multilabel confusion matrix (#2858)
  • Fixed issue with shared state in metric collection when using dice score (#2848)
  • Fixed top_k for multiclassf1score with one-hot encoding (#2839)
  • Fixed slow calculations of classification metrics with MPS (#2876)

Key Contributors

@Isalia20, @nkaenzig, @podgorki, @rittik9, @yuvalkirstain, @zhaozheng09

If we forgot someone due to not matching commit email with GitHub account, let us know :]


Full Changelog: https://github.com/Lightning-AI/torchmetrics/compare/v1.6.0...v1.6.1

Scientific Software - Peer-reviewed - Python
Published by Borda over 1 year ago

TorchMetrics - Measuring Reproducibility in PyTorch - More metrics

The latest release of TorchMetrics introduces several significant enhancements and new features that will greatly benefit users across various domains. This update includes the addition of new metrics and methods that enhance the library's functionality and usability.

One of the key additions is the NISQA audio metric, which provides advanced capabilities for evaluating audio quality. In the classification domain, the new LogAUC and NegativePredictiveValue metrics offer improved tools for assessing model performance, particularly in imbalanced datasets. For regression tasks, the NormalizedRootMeanSquaredError metric has been introduced, providing a normalized measure of prediction accuracy that is less sensitive to outliers.

In the field of image segmentation, the new Dice metric enhances the evaluation of segmentation models by providing a robust measure of overlap between predicted and ground truth masks. Additionally, the merge_state method has been added to the Metric class, allowing for more efficient state management and aggregation across multiple devices or processes.

Furthermore, this release includes support for the propagation of the autograd graph in Distributed Data-Parallel (DDP) settings, enabling more efficient and scalable training of models across multiple GPUs. These enhancements collectively make TorchMetrics a more powerful and versatile tool for machine learning practitioners, enabling more accurate and efficient model evaluation across a wide range of applications.

[1.6.0] - 2024-11-12

Added

  • Added audio metric NISQA (#2792)
  • Added classification metric LogAUC (#2377)
  • Added classification metric NegativePredictiveValue (#2433)
  • Added regression metric NormalizedRootMeanSquaredError (#2442)
  • Added segmentation metric Dice (#2725)
  • Added method merge_state to Metric (#2786)
  • Added support for propagation of the autograd graph in DDP setting (#2754)

Changed

  • Changed naming and input order arguments in KLDivergence (#2800)

Deprecated

  • Deprecated Dice from classification metrics (#2725)

Removed

  • Changed minimum supported Pytorch version to 2.0 (#2671)
  • Dropped support for Python 3.8 (#2827)
  • Removed num_outputs in R2Score (#2800)

Fixed

  • Fixed segmentation Dice + GeneralizedDice for 2d index tensors (#2832)
  • Fixed mixed results of rouge_score with accumulate='best' (#2830)

Key Contributors

@Borda, @cw-tan, @philgzl, @rittik9, @SkafteNicki

New Contributors since 1.5.0

  • @bfolie made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2793
  • @StalkerShurik made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2811
  • @philgzl made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2792
  • @cw-tan made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2754

Full Changelog: https://github.com/Lightning-AI/torchmetrics/compare/v1.5.0...v1.6.0

Scientific Software - Peer-reviewed - Python
Published by Borda over 1 year ago

TorchMetrics - Measuring Reproducibility in PyTorch - Minor patch release

[1.5.2] - 2024-11-07

Changed

  • Re-adding numpy 2+ support (#2804)

Fixed

  • Fixed iou scores in detection for either empty predictions/targets leading to wrong scores (#2805)
  • Fixed MetricCollection compatibility with torch.jit.script (#2813)
  • Fixed assert in PIT (#2811)
  • Patched np.Inf for numpy 2.0+ (#2826)

Key Contributors

@adamjstewart, @Borda, @SkafteNicki, @StalkerShurik, @yurithefury

If we forgot someone due to not matching commit email with GitHub account, let us know :]


Full Changelog: https://github.com/Lightning-AI/torchmetrics/compare/v1.5.1...v1.5.2

Scientific Software - Peer-reviewed - Python
Published by Borda over 1 year ago

TorchMetrics - Measuring Reproducibility in PyTorch - Minor compatibility patch

[1.5.1] - 2024-10-22

Fixed

  • Changing _modules dict type in Pytorch 2.5 preventing to fail collections metrics (#2793)

Key Contributors

@bfolie

If we forgot someone due to not matching commit email with GitHub account, let us know :]


Full Changelog: https://github.com/Lightning-AI/torchmetrics/compare/v1.5.0...v1.5.1

Scientific Software - Peer-reviewed - Python
Published by Borda over 1 year ago

TorchMetrics - Measuring Reproducibility in PyTorch - Shape metric

Shape metrics are quantitative methods used to assess and compare the geometric properties of objects, often in datasets that represent shapes. One such metric is the Procrustes Disparity, which measures the sum of the squared differences between two datasets after applying a Procrustes transformation. This transformation involves scaling, rotating, and translating the datasets to achieve optimal alignment. The Procrustes Disparity is particularly useful when comparing datasets that are similar in structure but not perfectly aligned, allowing for more meaningful comparison by minimizing differences due to orientation or size.

[1.5.0] - 2024-10-18

Added

  • Added segmentation metric HausdorffDistance (#2122)
  • Added audio metric DNSMOS (#2525)
  • Added shape metric ProcrustesDistance (#2723)
  • Added MetricInputTransformer wrapper (#2392)
  • Added input_format argument to segmentation metrics (#2572)
  • Added multi-output support for MAE metric (#2605)
  • Added truncation argument to BERTScore (#2776)

Changed

  • Tracker higher is better integration (#2649)
  • Updated InfoLM class to dynamically set higher_is_better (#2674)

Deprecated

  • Deprecated num_outputs in R2Score (#2705)

Fixed

  • Fixed corner case in IoU metric for single empty prediction tensors (#2780)
  • Fixed PSNR calculation for integer type input images (#2788)

Key Contributors

@Astraightrain, @grahamannett, @lgienapp, @matsumotosan, @quancs, @SkafteNicki

New Contributors since 1.4.0

  • @kalekundert made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2543
  • @lgienapp made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2392
  • @sweber1 made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2634
  • @gxy-gxy made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2347
  • @Astraightrain made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2605
  • @ndrwrbgs made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2640
  • @grahamannett made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2674
  • @petertheprocess made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2721
  • @rittik9 made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2726
  • @vkinakh made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2698
  • @likawind made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2732
  • @veera-puthiran-14082 made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2753
  • @GPPassos made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2727

Full Changelog: https://github.com/Lightning-AI/torchmetrics/compare/v1.4.0...v1.5.0

Scientific Software - Peer-reviewed - Python
Published by Borda over 1 year ago

TorchMetrics - Measuring Reproducibility in PyTorch - Minor patch release

[1.4.3] - 2024-10-10

Fixed

  • Fixed for Pearson changes inputs (#2765)
  • Fixed bug in PESQ metric where NoUtterancesError prevented calculating on a batch of data (#2753)
  • Fixed corner case in MatthewsCorrCoef (#2743)

Key Contributors

@Borda, @SkafteNicki, @veera-puthiran-14082

If we forgot someone due to not matching commit email with GitHub account, let us know :]


Full Changelog: https://github.com/Lightning-AI/torchmetrics/compare/v1.4.2...v1.4.3

Scientific Software - Peer-reviewed - Python
Published by Borda over 1 year ago

TorchMetrics - Measuring Reproducibility in PyTorch - Minor patch release

[1.4.2] - 2022-09-12

Added

  • Re-adding Chrf implementation (#2701)

Fixed

  • Fixed wrong aggregation in segmentation.MeanIoU (#2698)
  • Fixed handling zero division error in binary IoU (Jaccard index) calculation (#2726)
  • Corrected the padding related calculation errors in SSIM (#2721)
  • Fixed compatibility of audio domain with new scipy (#2733)
  • Fixed how prefix/postfix works in MultitaskWrapper (#2722)
  • Fixed flakiness in tests related to torch.unique with dim=None (#2650)

Key Contributors

@Borda, @petertheprocess, @rittik9, @SkafteNicki, @vkinakh

If we forgot someone due to not matching commit email with GitHub account, let us know :]


Full Changelog: https://github.com/Lightning-AI/torchmetrics/compare/v1.4.1...v1.4.2

Scientific Software - Peer-reviewed - Python
Published by Borda over 1 year ago

TorchMetrics - Measuring Reproducibility in PyTorch - Minor patch release

[1.4.1] - 2024-08-02

Changed

  • Calculate the text color of ConfusionMatrix plot based on luminance (#2590)
  • Updated _safe_divide to allow Accuracy to run on the GPU (#2640)
  • Improved better error messages for intersection detection metrics for wrong user input (#2577)

Removed

  • Dropped Chrf implementation due to licensing issues with the upstream package (#2668)

Fixed

  • Fixed bug in MetricCollection when using compute groups and compute is called more than once (#2571)
  • Fixed class order of panoptic_quality(..., return_per_class=True) output (#2548)
  • Fixed BootstrapWrapper not being reset correctly (#2574)
  • Fixed integration between ClasswiseWrapper and MetricCollection with custom _filter_kwargs method (#2575)
  • Fixed BertScore calculation: pred target misalignment (#2347)
  • Fixed _cumsum helper function in multi-gpu (#2636)
  • Fixed bug in MeanAveragePrecision.coco_to_tm (#2588)
  • Fixed missed f-strings in exceptions/warnings (#2667)

Key Contributors

@Borda, @gxy-gxy, @i-aki-y, @ndrwrbgs, @relativityhd, @SkafteNicki

If we forgot someone due to not matching commit email with GitHub account, let us know :]


Full Changelog: https://github.com/Lightning-AI/torchmetrics/compare/v1.4.0...v1.4.1

Scientific Software - Peer-reviewed - Python
Published by Borda almost 2 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Minor dependency correction

Full Changelog: https://github.com/Lightning-AI/torchmetrics/compare/v1.4.0...v1.4.0.post0

Scientific Software - Peer-reviewed - Python
Published by Borda about 2 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Metrics for segmentation

In Torchmetrics v1.4, we are happy to introduce a new domain of metrics to the library: segmentation metrics. Segmentation metrics are used to evaluate how well segmentation algorithms are performing, e.g., algorithms that take in an image and pixel-by-pixel decide what kind of object it is. These kind of algorithms are necessary in applications such as self driven cars. Segmentations are closely related to classification metrics, but for now, in Torchmetrics, expect the input to be formatted differently; see the documentation for more info. For now, MeanIoU and GeneralizedDiceScore have been added to the subpackage, with many more to follow in upcoming releases of Torchmetrics. We are happy to receive any feedback on metrics to add in the future or the user interface for the new segmentation metrics.

Torchmetrics v1.3 adds new metrics to the classification and image subpackage and has multiple bug fixes and other quality-of-life improvements. We refer to the changelog for the complete list of changes.

[1.4.0] - 2024-05-03

Added

  • Added SensitivityAtSpecificity metric to classification subpackage (#2217)
  • Added QualityWithNoReference metric to image subpackage (#2288)
  • Added a new segmentation metric:
    • MeanIoU (#1236)
    • GeneralizedDiceScore (#1090)
  • Added support for calculating segmentation quality and recognition quality in PanopticQuality metric (#2381)
  • Added pretty-errors for improving error prints (#2431)
  • Added support for torch.float weighted networks for FID and KID calculations (#2483)
  • Added zero_division argument to selected classification metrics (#2198)

Changed

  • Made __getattr__ and __setattr__ of ClasswiseWrapper more general (#2424)

Fixed

  • Fix getitem for metric collection when prefix/postfix is set (#2430)
  • Fixed axis names with Precision-Recall curve (#2462)
  • Fixed list synchronization with partly empty lists (#2468)
  • Fixed memory leak in metrics using list states (#2492)
  • Fixed bug in computation of ERGAS metric (#2498)
  • Fixed BootStrapper wrapper not working with kwargs provided argument (#2503)
  • Fixed warnings being suppressed in MeanAveragePrecision when requested (#2501)
  • Fixed corner-case in binary_average_precision when only negative samples are provided (#2507)

Key Contributors

@baskrahmer, @Borda, @ChristophReich1996, @daniel-code, @furkan-celik, @i-aki-y, @jlcsilva, @NielsRogge, @oguz-hanoglu, @SkafteNicki, @ywchan2005

New Contributors

  • @eamonn-zh made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2345
  • @nsmlzl made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2346
  • @fschlatt made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2364
  • @JonasVerbickas made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2358
  • @AtomicVar made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2391
  • @JDongian made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2400
  • @daniel-code made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2390
  • @baskrahmer made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2457
  • @ChristophReich1996 made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2381
  • @lukazso made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2491
  • @S-aiueo32 made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2499
  • @dominicgkerr made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2493
  • @Shoumik-Gandre made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2482
  • @randombenj made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2511
  • @NielsRogge made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1236
  • @i-aki-y made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2198

If we forgot someone due to not matching commit email with GitHub account, let us know :]


Full Changelog: https://github.com/Lightning-AI/torchmetrics/compare/v1.3.0...v1.4.0

Scientific Software - Peer-reviewed - Python
Published by Borda about 2 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Minor patch release

[1.3.2] - 2024-03-18

Fixed

  • Fixed negative variance estimates in certain image metrics (#2378)
  • Fixed dtype being changed by deepspeed for certain regression metrics (#2379)
  • Fixed plotting of metric collection when prefix/postfix is set (#2429)
  • Fixed bug when top_k>1 and average="macro" for classification metrics (#2423)
  • Fixed case where label prediction tensors in classification metrics were not validated correctly (#2427)
  • Fixed how auc scores are calculated in PrecisionRecallCurve.plot methods (#2437)

Full Changelog: https://github.com/Lightning-AI/torchmetrics/compare/v1.3.1...v1.3.2

Key Contributors

@Borda, @SkafteNicki

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Scientific Software - Peer-reviewed - Python
Published by Borda about 2 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Minor patch release

[1.3.1] - 2024-02-12

Fixed

  • Fixed how backprop is handled in LPIPS metric (#2326)
  • Fixed MultitaskWrapper not being able to be logged in lightning when using metric collections (#2349)
  • Fixed high memory consumption in Perplexity metric (#2346)
  • Fixed cached network in FeatureShare not being moved to the correct device (#2348)
  • Fix naming of statistics in MeanAveragePrecision with custom max det thresholds (#2367)
  • Fixed custom aggregation in retrieval metrics (#2364)
  • Fixed initialize aggregation metrics with default floating type (#2366)
  • Fixed plotting of confusion matrices (#2358)

Full Changelog: https://github.com/Lightning-AI/torchmetrics/compare/v1.3.0...v1.3.1

Key Contributors

@Borda, @fschlatt, @JonasVerbickas, @nsmlzl, @SkafteNicki

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Scientific Software - Peer-reviewed - Python
Published by Borda over 2 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Minor package patch

Full Changelog: https://github.com/Lightning-AI/torchmetrics/compare/v1.3.0...v1.3.0.post0

Scientific Software - Peer-reviewed - Python
Published by Borda over 2 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Minor package patch

Full Changelog: https://github.com/Lightning-AI/torchmetrics/compare/v1.3.0...v1.3.0.post

Scientific Software - Peer-reviewed - Python
Published by Borda over 2 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - New Image metrics & wrappers

[1.3.0] - 2024-01-10

Added

  • Added more tokenizers for SacreBLEU metric (#2068)
  • Added support for logging MultiTaskWrapper directly with lightnings log_dict method (#2213)
  • Added FeatureShare wrapper to share submodules containing feature extractors between metrics (#2120)
  • Added new metrics to image domain:
    • SpatialDistortionIndex (#2260)
    • Added CriticalSuccessIndex (#2257)
    • Spatial Correlation Coefficient (#2248)
  • Added average argument to multiclass versions of PrecisionRecallCurve and ROC (#2084)
  • Added confidence scores when extended_summary=True in MeanAveragePrecision (#2212)
  • Added RetrievalAUROC metric (#2251)
  • Added aggregate argument to retrieval metrics (#2220)
  • Added utility functions in segmentation.utils for future segmentation metrics (#2105)

Changed

  • Changed minimum supported Pytorch version from 1.8 to 1.10 (#2145)
  • Changed x-/y-axis order for PrecisionRecallCurve to be consistent with scikit-learn (#2183)

Deprecated

  • Deprecated metric._update_called (#2141)
  • Deprecated specicity_at_sensitivity in favour of specificity_at_sensitivity (#2199)

Fixed

  • Fixed support for half precision + CPU in metrics requiring topk operator (#2252)
  • Fixed warning incorrectly being raised in Running metrics (#2256)
  • Fixed integration with custom feature extractor in FID metric (#2277)

Full Changelog: https://github.com/Lightning-AI/torchmetrics/compare/v1.2.0...v1.3.0

Key Contributors

@Borda, @HoseinAkbarzadeh, @matsumotosan, @miskfi, @oguz-hanoglu, @SkafteNicki, @stancld, @ywchan2005

New Contributors

  • @pme0 made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2114
  • @damiankucharski made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2173
  • @clumsy made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2185
  • @jankng made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2226
  • @tanguymagne made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2230
  • @kyle-dorman made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2184
  • @oguz-hanoglu made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2199
  • @miskfi made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2257
  • @ywchan2005 made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2260
  • @HoseinAkbarzadeh made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2248

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Scientific Software - Peer-reviewed - Python
Published by Borda over 2 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Lazy imports

[1.2.1] - 2023-11-30

Added

  • Added error if NoTrainInceptionV3 is being initialized without torch-fidelity not being installed (#2143)
  • Added support for Pytorch v2.1 (#2142)

Changed

  • Change default state of SpectralAngleMapper and UniversalImageQualityIndex to be tensors (#2089)
  • Use arange and repeat for deterministic bincount (#2184)

Removed

  • Removed unused lpips third-party package as dependency of LearnedPerceptualImagePatchSimilarity metric (#2230)

Fixed

  • Fixed numerical stability bug in LearnedPerceptualImagePatchSimilarity metric (#2144)
  • Fixed numerical stability issue in UniversalImageQualityIndex metric (#2222)
  • Fixed incompatibility for MeanAveragePrecision with pycocotools backend when too little max_detection_thresholds are provided (#2219)
  • Fixed support for half precision in Perplexity metric (#2235)
  • Fixed device and dtype for LearnedPerceptualImagePatchSimilarity functional metric (#2234)
  • Fixed bug in Metric._reduce_states(...) when using dist_sync_fn="cat" (#2226)
  • Fixed bug in CosineSimilarity where 2d is expected but 1d input was given (#2241)
  • Fixed bug in MetricCollection when using compute groups and compute is called more than once (#2211)

Full Changelog: https://github.com/Lightning-AI/torchmetrics/compare/v1.2.0...v1.2.1

Key Contributors

@Borda, @jankng, @kyle-dorman, @SkafteNicki, @tanguymagne

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Scientific Software - Peer-reviewed - Python
Published by Borda over 2 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Clustering metrics

Torchmetrics v1.2 is out now! The latest release includes 11 new metrics within a new subdomain: Clustering. In this blog post, we briefly explain what clustering is, why it’s a useful measure and newly added metrics that can be used with code samples.

Clustering - what is it?

Clustering is an unsupervised learning technique. The term unsupervised here refers to the fact that we do not have ground truth targets as we do in classification. The primary goal of clustering is to discover hidden patterns or structures within data without prior knowledge about the meaning or importance of particular features. Thus, clustering is a form of data exploration compared to supervised learning, where the goal is “just” to predict if a data point belongs to one class.

The key goal of clustering algorithms is to split data into clusters/sets where data points from the same cluster are more similar to each other than any other points from the remaining clusters. Some of the most common and widely used clustering algorithms are K-Means, Hierarchical clustering, and Gaussian Mixture Models (GMM).

An objective quality evaluation/measure is required regardless of the clustering algorithm or internal optimization criterion used. In general, we can divide all clustering metrics into two categories: extrinsic metrics and intrinsic metrics.

Extrinsic metrics

Extrinsic metrics are characterized by requirements of some ground truth labeling, even if used for an unsupervised method. This may seem counter-intuitive at first as we, by clustering definition, do not use such ground truth labeling. However, most clustering algorithms are still developed on datasets with labels available, so these metrics use this fact as an advantage.

Intrinsic metrics

In contrast, intrinsic metrics do not need any ground truth information. These metrics estimate inter-cluster consistency (cohesion of all points assigned to a single set) compared to other clusters (separation). This is often done by comparing the distance in the embedding space.

Update to Mean Average Precision

MeanAveragePrecision, the most widely used metric for object detection in computer vision, now supports two new arguments: average and backend.

  • The average argument controls averaging over multiple classes. By the core definition, the default way is macro averaging, where the metric is calculated for each class separately and then averaged together. This will continue to be the default in Torchmetrics, but now we also support the setting average="micro". Every object under this setting is essentially considered to be the same class, and the returned value is, therefore, calculated simultaneously over all objects.

  • The second argument - backend, is important, as it indicates what computational backend will be used for the internal computations. Since MeanAveragePrecision is not a simple metric to compute, and we value the correctness of our metric, we rely on some third-party library to do the internal computations. By default, we rely on users to have the official pycocotools installed, but with the new argument, we will also be supporting other backends.

[1.2.0] - 2023-09-22

Added

  • Added metric to cluster package:
    • MutualInformationScore (#2008)
    • RandScore (#2025)
    • NormalizedMutualInfoScore (#2029)
    • AdjustedRandScore (#2032)
    • CalinskiHarabaszScore (#2036)
    • DunnIndex (#2049)
    • HomogeneityScore (#2053)
    • CompletenessScore (#2053)
    • VMeasureScore (#2053)
    • FowlkesMallowsIndex (#2066)
    • AdjustedMutualInfoScore (#2058)
    • DaviesBouldinScore (#2071)
  • Added backend argument to MeanAveragePrecision (#2034)

Full Changelog: https://github.com/Lightning-AI/torchmetrics/compare/v1.1.0...v1.2.0

New Contributors since v1.1.0

  • @matsumotosan made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2008
  • @GlavitsBalazs made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2042
  • @OmerShubi made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2081
  • @munahaf made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/2082

Key Contributors

@matsumotosan, @SkafteNicki

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Scientific Software - Peer-reviewed - Python
Published by Borda over 2 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Weekly patch release

[1.1.2] - 2023-09-11

Fixed

  • Fixed tie breaking in ndcg metric (#2031)
  • Fixed bug in BootStrapper when very few samples were evaluated that could lead to crash (#2052)
  • Fixed bug when creating multiple plots that lead to not all plots being shown (#2060)
  • Fixed performance issues in RecallAtFixedPrecision for large batch sizes (#2042)
  • Fixed bug related to MetricCollection used with custom metrics have prefix/postfix attributes (#2070)

Contributors

@GlavitsBalazs, @SkafteNicki

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Scientific Software - Peer-reviewed - Python
Published by Borda over 2 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Weekly patch release

[1.1.1] - 2023-08-29

Added

  • Added average argument to MeanAveragePrecision (#2018)

Fixed

  • Fixed bug in PearsonCorrCoef is updated on single samples at a time (#2019)
  • Fixed support for pixel-wise MSE (#2017)
  • Fixed bug in MetricCollection when used with multiple metrics that return dicts with same keys (#2027)
  • Fixed bug in detection intersection metrics when class_metrics=True resulting in wrong values (#1924)
  • Fixed missing attributes higher_is_better, is_differentiable for some metrics (#2028)

Contributors

@adamjstewart, @SkafteNicki

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Scientific Software - Peer-reviewed - Python
Published by Borda almost 3 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Into Generative AI

In version v1.1 of Torchmetrics, in total five new metrics have been added, bringing the total number of metrics up to 128! In particular, we have two new exciting metrics for evaluating your favorite generative models for images.

Perceptual Path length

Introduced in the famous StyleGAN paper back in 2018 the Perceptual path length metric is used to quantify how smoothly a generator manages to interpolate between points in its latent space. Why does the smoothness of the latent space of your generative model matter? Assume you find an image at some point in your latent space that generates an image you like, but you would like to see if you could find a better one if you slightly change the latent point it was generated from. If your latent space could be smoother, this because very hard because even small changes to the latent point can lead to large changes in the generated image.

CLIP image quality assessment

CLIP image quality assessment (CLIPIQA) is a very recently proposed metric in this paper. The metrics build on the OpenAI CLIP model, which is a multi-modal model for connecting text and images. The core idea behind the metric is that different properties of an image can be assessed by measuring how similar the CLIP embedding of the image is to the respective CLIP embedding of a positive and negative prompt for that given property.

VIF, Edit, and SA-SDR

  • VisualInformationFidelity has been added to the image package. The first proposed in this paper can be used to automatically assess the quality of images in a perceptual manner.

  • EditDistance have been added to the text package. A very classical metric for text that simply measures the amount of characters that need to be substituted, inserted, or deleted, to transform the predicted text into the reference text.

  • SourceAggregatedSignalDistortionRatio has been added to the audio package. Metric was originally proposed in this paper and is an improvement over the classical Signal-to-Distortion Ratio (SDR) metric (also found in torchmetrics) that provides more stable gradients during training when trying to train models for style source separation.

[1.1.0] - 2022-08-22

Added

  • Added source aggregated signal-to-distortion ratio (SA-SDR) metric (#1882
  • Added VisualInformationFidelity to image package (#1830)
  • Added EditDistance to text package (#1906)
  • Added top_k argument to RetrievalMRR in retrieval package (#1961)
  • Added support for evaluating "segm" and "bbox" detection in MeanAveragePrecision at the same time (#1928)
  • Added PerceptualPathLength to image package (#1939)
  • Added support for multioutput evaluation in MeanSquaredError (#1937)
  • Added argument extended_summary to MeanAveragePrecision such that precision, recall, iou can be easily returned (#1983)
  • Added warning to ClipScore if long captions are detected and truncate (#2001)
  • Added CLIPImageQualityAssessment to multimodal package (#1931)
  • Added new property metric_state to all metrics for users to investigate currently stored tensors in memory (#2006)

Full Changelog: https://github.com/Lightning-AI/torchmetrics/compare/v1.0.0...v1.1.0


New Contributors since v1.0.0

  • @fansuregrin made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1892
  • @salcc made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1934
  • @IanMaquignaz made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1943
  • @kn made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1955
  • @Vivswan made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1982
  • @njuaplusplus made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1986

Contributors

@bojobo, @lucadiliello, @quancs, @SkafteNicki

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Scientific Software - Peer-reviewed - Python
Published by Borda almost 3 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Weekly patch release

[1.0.3] - 2022-08-08

Added

  • Added warning to MeanAveragePrecision if too many detections are observed (#1978)

Fixed

  • Fix support for int input for when multidim_average="samplewise" in classification metrics (#1977)
  • Fixed x/y labels when plotting confusion matrices (#1976)
  • Fixed IOU compute in cuda (#1982)

Contributors

@borda, @SkafteNicki^n, @Vivswan

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Scientific Software - Peer-reviewed - Python
Published by Borda almost 3 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Weekly patch release

[1.0.2] - 2022-08-03

Added

  • Added warning to PearsonCorrCoeff if input has a very small variance for its given dtype (#1926)

Changed

  • Changed all non-task specific classification metrics to be true subtypes of Metric (#1963)

Fixed

  • Fixed bug in CalibrationError where calculations for double precision input was performed in float precision (#1919)
  • Fixed bug related to the prefix/postfix arguments in MetricCollection and ClasswiseWrapper being duplicated (#1918)
  • Fixed missing AUC score when plotting classification metrics that support the score argument (#1948)

Contributors

@borda, @SkafteNicki^n

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Scientific Software - Peer-reviewed - Python
Published by Borda almost 3 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Weekly patch release

[1.0.1] - 2022-07-13

Fixed

  • Fixes corner case when using MetricCollection together with aggregation metrics (#1896)
  • Fixed the use of max_fpr in AUROC metric when only one class is present (#1895)
  • Fixed bug related to empty predictions for IntersectionOverUnion metric (#1892)
  • Fixed bug related to MeanMetric and broadcasting of weights when Nans are present (#1898)
  • Fixed bug related to expected input format of pycoco in MeanAveragePrecision (#1913)

Contributors

@fansuregrin, @SkafteNicki

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Scientific Software - Peer-reviewed - Python
Published by Borda almost 3 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Visualize metrics

We are happy to announce that the first major release of Torchmetrics, version v1.0, is publicly available. We have worked hard on a couple of new features for this milestone release, but for v1.0.0, we have also managed to implement over 100 metrics in torchmetrics.

Plotting

The big new feature of v1.0 is a built-in plotting feature. As the old saying goes: "A picture is worth a thousand words". Within machine learning, this is definitely also true for many things. Metrics are one area that, in some cases, is definitely better showcased in a figure than as a list of floats. The only requirement for getting started with the plotting feature is installing matplotlib. Either install with pip install matplotlib or pip install torchmetrics[visual] (the latter option also installs Scienceplots and uses that as the default plotting style).

The basic interface is the same for any metric. Just call the new .plot method:

python metric = AnyMetricYouLike() for _ in range(num_updates): metric.update(preds[i], target[i]) fig, ax = metric.plot()

The plot method by default does not require any arguments and will automatically call metric.compute internally on whatever metric states have been accumulated.

[1.0.0] - 2022-07-04

Added

  • Added prefix and postfix arguments to ClasswiseWrapper (#1866)
  • Added speech-to-reverberation modulation energy ratio (SRMR) metric (#1792, #1872)
  • Added new global arg compute_with_cache to control caching behaviour after compute method (#1754)
  • Added ComplexScaleInvariantSignalNoiseRatio for audio package (#1785)
  • Added Running wrapper for calculate running statistics (#1752)
  • AddedRelativeAverageSpectralError and RootMeanSquaredErrorUsingSlidingWindow to image package (#816)
  • Added support for SpecificityAtSensitivity Metric (#1432)
  • Added support for plotting of metrics through .plot() method (#1328, #1481, #1480, #1490, #1581, #1585, #1593, #1600, #1605, #1610, #1609, #1621, #1624, #1623, #1638, #1631, #1650, #1639, #1660, #1682, #1786)
  • Added support for plotting of audio metrics through .plot() method (#1434)
  • Added classes to output from MAP metric (#1419)
  • Added Binary group fairness metrics to classification package (#1404)
  • Added MinkowskiDistance to regression package (#1362)
  • Added pairwise_minkowski_distance to pairwise package (#1362)
  • Added new detection metric PanopticQuality (#929, #1527)
  • Added PSNRB metric (#1421)
  • Added ClassificationTask Enum and use in metrics (#1479)
  • Added ignore_index option to exact_match metric (#1540)
  • Add parameter top_k to RetrievalMAP (#1501)
  • Added support for deterministic evaluation on GPU for metrics that uses torch.cumsum operator (#1499)
  • Added support for plotting of aggregation metrics through .plot() method (#1485)
  • Added support for python 3.11 (#1612)
  • Added support for auto clamping of input for metrics that uses the data_range (#1606)
  • Added ModifiedPanopticQuality metric to detection package (#1627)
  • Added PrecisionAtFixedRecall metric to classification package (#1683)
  • Added multiple metrics to detection package (#1284)
    • IntersectionOverUnion
    • GeneralizedIntersectionOverUnion
    • CompleteIntersectionOverUnion
    • DistanceIntersectionOverUnion
  • Added MultitaskWrapper to wrapper package (#1762)
  • Added RelativeSquaredError metric to regression package (#1765)
  • Added MemorizationInformedFrechetInceptionDistance metric to image package (#1580)

Changed

  • Changed permutation_invariant_training to allow using a 'permutation-wise' metric function (#1794)
  • Changed update_count and update_called from private to public methods (#1370)
  • Raise exception for invalid kwargs in Metric base class (#1427)
  • Extend EnumStr raising ValueError for invalid value (#1479)
  • Improve speed and memory consumption of binned PrecisionRecallCurve with large number of samples (#1493)
  • Changed __iter__ method from raising NotImplementedError to TypeError by setting to None (#1538)
  • FID metric will now raise an error if too few samples are provided (#1655)
  • Allowed FID with torch.float64 (#1628)
  • Changed LPIPS implementation to no more rely on third-party package (#1575)
  • Changed FID matrix square root calculation from scipy to torch (#1708)
  • Changed calculation in PearsonCorrCoeff to be more robust in certain cases (#1729)
  • Changed MeanAveragePrecision to pycocotools backend (#1832)

Deprecated

  • Deprecated domain metrics import from package root (#1685, #1694, #1696, #1699, #1703)

Removed

  • Support for python 3.7 (#1640)

Fixed

  • Fixed support in MetricTracker for MultioutputWrapper and nested structures (#1608)
  • Fixed restrictive check in PearsonCorrCoef (#1649)
  • Fixed integration with jsonargparse and LightningCLI (#1651)
  • Fixed corner case in calibration error for zero confidence input (#1648)
  • Fix precision-recall curve based computations for float target (#1642)
  • Fixed missing kwarg squeeze in MultiOutputWrapper (#1675)
  • Fixed padding removal for 3d input in MSSSIM (#1674)
  • Fixed max_det_threshold in MAP detection (#1712)
  • Fixed states being saved in metrics that use register_buffer (#1728)
  • Fixed states not being correctly synced and device transfered in MeanAveragePrecision for iou_type="segm" (#1763)
  • Fixed use of prefix and postfix in nested MetricCollection (#1773)
  • Fixed ax plotting logging in `MetricCollection (#1783)
  • Fixed lookup for punkt sources being downloaded in RougeScore (#1789)
  • Fixed integration with lightning for CompositionalMetric (#1761)
  • Fixed several bugs in SpectralDistortionIndex metric (#1808)
  • Fixed bug for corner cases in MatthewsCorrCoef (#1812, #1863)
  • Fixed support for half precision in PearsonCorrCoef (#1819)
  • Fixed number of bugs related to average="macro" in classification metrics (#1821)
  • Fixed off-by-one issue when ignore_index = num_classes + 1 in Multiclass-jaccard (#1860)

New Contributors

  • @theja-vanka made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1372
  • @wilderrodrigues made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1391
  • @Freed-Wu made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1402
  • @reaganjlee made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1405
  • @davidgilbertson made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1412
  • @ValerianRey made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1430
  • @EPronovost made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1427
  • @felixdivo made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1438
  • @ivnvalex made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1447
  • @PangLuo made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1452
  • @JustinGoheen made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1463
  • @DavidZhang73 made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1476
  • @7shoe made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1474
  • @srishti-git1110 made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1481
  • @niberger made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/929
  • @shhs29 made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1434
  • @ihowell made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1525
  • @venomouscyanide made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1480
  • @ItamarChinn made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1540
  • @vincentvaroquauxads made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1521
  • @Bomme made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1501
  • @alexkrz made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1490
  • @clay-curry made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1547
  • @clueless-skywatcher made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1362
  • @marcocaccin made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1527
  • @Piyush-97 made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/816
  • @FarzanT made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1583
  • @basveeling made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1651
  • @YeaMerci made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1684
  • @fkroeber made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1712
  • @soma2000-lang made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1421
  • @maxi-w made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1726
  • @wbeardall made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1765
  • @RistoAle97 made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1778
  • @cdboer made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1820
  • @bot66 made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1828
  • @cs-mshah made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1849
  • @bojobo made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1851
  • @goldenfire6 made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1862
  • @martinmeinke made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1860
  • @Dibz15 made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1580
  • @relativityhd made their first contribution in https://github.com/Lightning-AI/torchmetrics/pull/1866

Contributors

@alexkrz, @AndresAlgaba, @basveeling, @Bomme, @Borda, @Callidior, @clueless-skywatcher, @Dibz15, @EPronovost, @fkroeber, @ItamarChinn, @marcocaccin, @martinmeinke, @niberger, @Piyush-97, @quancs, @relativityhd, @shenoynikhil, @shhs29, @SkafteNicki, @soma2000-lang, @srishti-git1110, @stancld, @twsl, @ValerianRey, @venomouscyanide, @wbeardall

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Scientific Software - Peer-reviewed - Python
Published by Borda almost 3 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Minor patch release

[0.11.4] - 2023-03-10

Fixed

  • Fixed evaluation of R2Score with the near constant target (#1576)
  • Fixed dtype conversion when the metric is submodule (#1583)
  • Fixed bug related to top_k>1 and ignore_index!=None in StatScores based metrics (#1589)
  • Fixed corner case for PearsonCorrCoef when running in DDP mode but only on a single device (#1587)
  • Fixed overflow error for specific cases in MAP when big areas are calculated (#1607)

Contributors

@borda, @FarzanT, @SkafteNicki

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Full Changelog: https://github.com/Lightning-AI/metrics/compare/v0.11.3...v0.11.4

Scientific Software - Peer-reviewed - Python
Published by Borda about 3 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Minor patch release

[0.11.3] - 2023-02-28

Fixed

  • Fixed classification metrics for byte input (#1521)
  • Fixed the use of ignore_index in MulticlassJaccardIndex (#1386)

Contributors

@SkafteNicki, @vincentvaroquauxads

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Full Changelog: https://github.com/Lightning-AI/metrics/compare/v0.11.2...v0.11.3

Scientific Software - Peer-reviewed - Python
Published by Borda about 3 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Minor patch release

[0.11.2] - 2023-02-21

Fixed

  • Fixed compatibility between XLA in _bincount function (#1471)
  • Fixed type hints in methods belonging to MetricTracker wrapper (#1472)
  • Fixed multilabel in ExactMatch (#1474)

Contributors

@7shoe, @borda, @SkafteNicki, @ValerianRey

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Full Changelog: https://github.com/Lightning-AI/metrics/compare/v0.11.1...v0.11.2

Scientific Software - Peer-reviewed - Python
Published by Borda over 3 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Minor patch release

[0.11.1] - 2023-01-30

Fixed

  • Fixed type checking on the maximize parameter at the initialization of MetricTracker (#1428)
  • Fixed mixed precision auto-cast for SSIM metric (#1454)
  • Fixed checking for nltk.punkt in RougeScore if a machine is not online (#1456)
  • Fixed wrongly reset method in MultioutputWrapper (#1460)
  • Fixed dtype checking in PrecisionRecallCurve for target tensor (#1457)

Contributors

@borda, @SkafteNicki, @stancld

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Full Changelog: https://github.com/Lightning-AI/metrics/compare/v0.11.0...v0.11.1

Scientific Software - Peer-reviewed - Python
Published by Borda over 3 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Adding Multimodal and nominal domain

We are happy to announce that Torchmetrics v0.11 is now publicly available. In Torchmetrics v0.11 we have primarily focused on the cleanup of the large classification refactor from v0.10 and adding new metrics. With v0.11 are crossing 90+ metrics in Torchmetrics nearing the milestone of having 100+ metrics.

New domains

In Torchmetrics we are not only looking to expand with new metrics in already established metric domains such as classification or regression, but also new domains. We are therefore happy to report that v0.11 includes two new domains: Multimodal and nominal.

Multimodal

If there is one topic within machine learning that is hot right now then it is generative models and in particular image-to-text generative models. Just recently stable diffusion v2 was released, able to create even more photorealistic images from a single text prompt than ever

In Torchmetrics v0.11 we are adding a new domain called multimodal to support the evaluation of such models. For now, we are starting out with a single metric, the CLIPScore from this paper that can be used to evaluate such image-to-text models. CLIPScore currently achieves the highest correlation with human judgment, and thus a high CLIPScore for an image-text pair means that it is highly plausible that an image caption and an image are related to each other.

Nominal

If you have ever taken any course in statistics or introduction to machine learning you should hopefully have heard about data can be of different types of attributes: nominal, ordinal, interval, and ratio. This essentially refers to how data can be compared. For example, nominal data cannot be ordered and cannot be measured. An example, would it be data that describes the color of your car: blue, red, or green? It does not make sense to compare the different values. Ordinal data can be compared but does have not a relative meaning. An example, would it be the safety rating of a car: 1,2,3? We can say that 3 is better than 1 but the actual numerical value does not mean anything.

In v0.11 of TorchMetrics, we are adding support for classic metrics on nominal data. In fact, 4 new metrics have already been added to this domain: - CramersV - PearsonsContingencyCoefficient - TschuprowsT - TheilsU

All metrics are measures of association between two nominal variables, giving a value between 0 and 1, with 1 meaning that there is a perfect association between the variables.

Small improvements

In addition to metrics within the two new domains v0.11 of Torchmetrics contains other smaller changes and fixes:

  • TotalVariation metric has been added to the image package, which measures the complexity of an image with respect to its spatial variation.

  • MulticlassExactMatch metric has been added to the classification package, which for example can be used to measure sentence level accuracy where all tokens need to match for a sentence to be counted as correct

  • KendallRankCorrCoef have been added to the regression package for measuring the overall correlation between two variables

  • LogCoshError have been added to the regression package for measuring the residual error between two variables. It is similar to the mean squared error close to 0 but similar to the mean absolute error away from 0.


Finally, Torchmetrics now only supports v1.8 and higher of Pytorch. It was necessary to increase from v1.3 to secure because we were running into compatibility issues with an older version of Pytorch. We strive to support as many versions of Pytorch, but for the best experience, we always recommend keeping Pytorch and Torchmetrics up to date.


[0.11.0] - 2022-11-30

Added

  • Added MulticlassExactMatch to classification metrics (#1343)
  • Added TotalVariation to image package (#978)
  • Added CLIPScore to new multimodal package (#1314)
  • Added regression metrics:
    • KendallRankCorrCoef (#1271)
    • LogCoshError (#1316)
  • Added new nominal metrics:
    • CramersV (#1298)
    • PearsonsContingencyCoefficient (#1334)
    • TschuprowsT (#1334)
    • TheilsU (#1337)
  • Added option to pass distributed_available_fn to metrics to allow checks for custom communication backend for making dist_sync_fn actually useful (#1301)
  • Added normalize argument to Inception, FID, KID metrics (#1246)

Changed

  • Changed minimum Pytorch version to be 1.8 (#1263)
  • Changed interface for all functional and modular classification metrics after refactor (#1252)

Removed

  • Removed deprecated BinnedAveragePrecision, BinnedPrecisionRecallCurve, RecallAtFixedPrecision (#1251)
  • Removed deprecated LabelRankingAveragePrecision, LabelRankingLoss and CoverageError (#1251)
  • Removed deprecated KLDivergence and AUC (#1251)

Fixed

  • Fixed precision bug in pairwise_euclidean_distance (#1352)

Contributors

@borda, @justusschock, @ragavvenkatesan, @shenoynikhil, @SkafteNicki, @stancld

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Scientific Software - Peer-reviewed - Python
Published by Borda over 3 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Minor patch release

[0.10.3] - 2022-11-16

Fixed

  • Fixed bug in Metrictracker.best_metric when return_step=False (#1306)
  • Fixed bug to prevent users from going into an infinite loop if trying to iterate of a single metric (#1320)
  • Fixed bug in Metrictracker.best_metric when return_step=False (#1306)

Contributors

@SkafteNicki

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Scientific Software - Peer-reviewed - Python
Published by Borda over 3 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Fixed Performance

[0.10.2] - 2022-10-31

Changed

  • Changed in-place operation to out-of-place operation in pairwise_cosine_similarity (#1288)

Fixed

  • Fixed high memory usage for certain classification metrics when average='micro' (#1286)
  • Fixed precision problems when structural_similarity_index_measure was used with autocast (#1291)
  • Fixed slow performance for confusion matrix-based metrics (#1302)
  • Fixed restrictive dtype checking in spearman_corrcoef when used with autocast (#1303)

Contributors

@SkafteNicki

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Scientific Software - Peer-reviewed - Python
Published by Borda over 3 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Minor patch release

[0.10.1] - 2022-10-21

Fixed

  • Fixed broken clone method for classification metrics (#1250)
  • Fixed unintentional downloading of nltk.punkt when lsum not in rouge_keys (#1258)
  • Fixed type casting in MAP metric between bool and float32 (#1150)

Contributors

@dreaquil, @SkafteNicki, @stancld

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Scientific Software - Peer-reviewed - Python
Published by Borda over 3 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Large changes to classifications

TorchMetrics v0.10 is now out, significantly changing the whole classification package. This blog post will go over the reasons why the classification package needs to be refactored, what it means for our end users, and finally, what benefits it gives. A guide on how to upgrade your code to the recent changes can be found near the bottom.

Why the classification metrics need to change

We have for a long time known that there were some underlying problems with how we initially structured the classification package. Essentially, classification tasks can e divided into either binary, multiclass, or multilabel, and determining what task a user is trying to run a given metric on is hard just based on the input. The reason a package such as sklearn can do this is to only support input in very specific formats (no multi-dimensional arrays and no support for both integer and probability/logit formats).

This meant that some metrics, especially for binary tasks, could have been calculating something different than expected if the user were to provide another shape but the expected. This is against the core value of TorchMetrics, that our users, of course should trust that the metric they are evaluating is given the excepted result.

Additionally, classification metrics were missing consistency. For some, metrics num_classes=2 meant binary, and for others num_classes=1 meant binary. You can read more about the underlying reasons for this refactor in this and this issue.

The solution

The solution we went with was to split every classification metric into three separate metrics with the prefix binary_*, multiclass_* and multilabel_*. This solves a number of the above problems out of the box because it becomes easier for us to match our users' expectations for any given input shape. It additionally has some other benefits both for us as developers and ends users

  • Maintainability: by splitting the code into three distinctive functions, we are (hopefully) lowering the code complexity, making the codebase easier to maintain in the long term.
  • Speed: by completely removing the auto-detection of task at runtime, we can significantly increase computational speed (more on this later).
  • Task-specific arguments: by splitting into three functions, we also make it more clear what input arguments affect the computed result. Take - Accuracy as an example: both numclasses , topk , average are arguments that have an influence if you are doing multiclass classification but doing nothing for binary classification and vice versa with the thresholds argument. The task-specific versions only contain the arguments that influence the given task.
  • There are many smaller quality-of-life improvements hidden throughout the refactor, however here are our top 3:

Standardized arguments

The input arguments for the classification package are now much more standardized. Here are a few examples:

  • Each metric now only supports arguments that influence the final result. This means that numclasses is removed from all `binarymetrics are now required for allmulticlass_metrics and renamed tonumlabelsfor allmultilabel*` metrics.
  • The ignore_index argument is now supported by ALL classification metrics and supports any value and not only values in the [0,num_classes] range (similar to torch loss functions). Below is shown an example:
  • We added a new validate_args to all classification metrics to allow users to skip validation of inputs making the computations completely faster. By default, we will still do input validation because it is the safest option for the user. Still, if you are confident that the input to the metric is correct, then you can now disable this, checking for a potential speed-up (more on this later).

Constant memory implementations

Some of the most useful metrics for evaluating classification problems are metrics such as ROC, AUROC, AveragePrecision, etc., because they not only evaluate your model for a single threshold but a whole range of thresholds, essentially giving you the ability to see the trade-off between Type I and Type II errors. However, a big problem with the standard formulation of these metrics (which we have been using) is that they require access to all data for their calculation. Our implementation has been extremely memory-intensive for these kinds of metrics.

In v0.10 of TorchMetrics, all these metrics now have an argument called thresholds. By default, it is None and the metric will still save all targets and predictions in memory as you are used to. However, if this argument is instead set to a tensor - torch.linspace(0,1,100) it will instead use a constant-memory approximation by evaluating the metric under those provided thresholds.

Setting thresholds=None has an approximate memory footprint of O(numsamples) whereas using thresholds=torch.linspace(0,1,100) has an approximate memory footprint of `O(numthresholds)`. In this particular case, users will save memory when the metric is computed on more than 100 samples. This feature can save memory by comparing this to modern machine learning, where evaluation is often done on thousands to millions of data points.

This also means that the Binned* metrics that currently exist in TorchMetrics are being deprecated as their functionality is now captured by this argument.

All metrics are faster (ish)

By splitting each metric into 3 separate metrics, we reduce the number of calculations needed. We, therefore, expected out-of-the-box that our new implementations would be faster. The table below shows the timings of different metrics with the old and new implementations (with and without input validation). Numbers in parentheses denote speed-up over old implementations.

The following observations can be made:

  • Some metrics are a bit faster (1.3x), and others are much faster (4.6x) after the refactor!
  • Disabling input validation can speed up things. For example, multiclass_confusion_matrix goes from a speedup of 3.36x to 4.81 when input validation is disabled. A clear advantage for users that are familiar with the metrics and do not need validation of their input at every update.
  • If we compare binary with multiclass, the biggest speedup can be seen for multiclass problems.
  • Every metric is faster except for the precision-recall curve, even the new approximative binning method. This is a bit strange, as the non-approximation should be equally fast (it's the same code). We are actively looking into this.

[0.10.0] - 2022-10-04

Added

  • Added a new NLP metric InfoLM (#915)
  • Added Perplexity metric (#922)
  • Added ConcordanceCorrCoef metric to regression package (#1201)
  • Added argument normalize to LPIPS metric (#1216)
  • Added support for multiprocessing of batches in PESQ metric (#1227)
  • Added support for multioutput in PearsonCorrCoef and SpearmanCorrCoef (#1200)

Changed

  • Classification refactor (#1054, #1143, #1145, #1151, #1159, #1163, #1167, #1175, #1189, #1197, #1215, #1195)
  • Changed update in FID metric to be done in an online fashion to save memory (#1199)
  • Improved performance of retrieval metrics (#1242)
  • Changed SSIM and MSSSIM update to be online to reduce memory usage (#1231)

Fixed

  • Fixed a bug in ssim when return_full_image=True where the score was still reduced (#1204)
  • Fixed MPS support for:
    • MAE metric (#1210)
    • Jaccard index (#1205)
  • Fixed bug in ClasswiseWrapper such that compute gave wrong result (#1225)
  • Fixed synchronization of empty list states (#1219)

Contributors

@Borda, @bryant1410, @geoffrey-g-delhomme, @justusschock, @lucadiliello, @nicolas-dufour, @Queuecumber, @SkafteNicki, @stancld

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Scientific Software - Peer-reviewed - Python
Published by Borda over 3 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Minor patch release

[0.9.3] - 2022-08-22

Added

  • Added global option sync_on_compute to disable automatic synchronization when compute is called (#1107)

Fixed

  • Fixed missing reset in ClasswiseWrapper (#1129)
  • Fixed JaccardIndex multi-label compute (#1125)
  • Fix SSIM propagate device if gaussian_kernel is False, add test (#1149)

Contributors

@KeVoyer1, @krshrimali, @SkafteNicki

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Scientific Software - Peer-reviewed - Python
Published by Borda almost 4 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Minor patch release

[0.9.2] - 2022-06-29

Fixed

  • Fixed mAP calculation for areas with 0 predictions (#1080)
  • Fixed bug where avg precision state and auroc state was not merge when using MetricCollections (#1086)
  • Skip box conversion if no boxes are present in MeanAveragePrecision (#1097)
  • Fixed inconsistency in docs and code when setting average="none" in AvaragePrecision metric (#1116)

Contributors

@23pointsNorth, @kouyk, @SkafteNicki

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Scientific Software - Peer-reviewed - Python
Published by Borda almost 4 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Minor PL compatibility patch

[0.9.1] - 2022-06-08

Added

  • Added specific RuntimeError when metric object is on the wrong device (#1056)
  • Added an option to specify own n-gram weights for BLEUScore and SacreBLEUScore instead of using uniform weights only. (#1075)

Fixed

  • Fixed aggregation metrics when input only contains zero (#1070)
  • Fixed TypeError when providing superclass arguments as kwargs (#1069)
  • Fixed bug related to state reference in metric collection when using compute groups (#1076)

Contributors

@jlcsilva, @SkafteNicki, @stancld

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Scientific Software - Peer-reviewed - Python
Published by Borda almost 4 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Faster forward

Highligths

TorchMetrics v0.9 is now out, and it brings significant changes to how the forward method works. This blog post goes over these improvements and how they affect both users of TorchMetrics and users that implement custom metrics. TorchMetrics v0.9 also includes several new metrics and bug fixes.

Blog: TorchMetrics v0.9 — Faster forward

The Story of the Forward Method

Since the beginning of TorchMetrics, Forward has served the dual purpose of calculating the metric on the current batch and accumulating in a global state. Internally, this was achieved by calling update twice: one for each purpose, which meant repeating the same computation. However, for many metrics, calling update twice is unnecessary to achieve both the local batch statistics and accumulating globally because the global statistics are simple reductions of the local batch states.

In v0.9, we have finally implemented a logic that can take advantage of this and will only call update once before making a simple reduction. As you can see in the figure below, this can lead to a single call of forward being 2x faster in v0.9 compared to v0.8 of the same metric.

With the improvements to forward, many metrics have become significantly faster (up to 2x) It should be noted that this change mainly benefits metrics (for example, confusionmatrix) where calling update is expensive.

We went through all existing metrics in TorchMetrics and enabled this feature for all appropriate metrics, which was almost 95% of all metrics. We want to stress that if you are using metrics from TorchMetrics, nothing has changed to the API, and no code changes are necessary.

[0.9.0] - 2022-05-31

Added

  • Added RetrievalPrecisionRecallCurve and RetrievalRecallAtFixedPrecision to retrieval package (#951)
  • Added class property full_state_update that determines forward should call update once or twice (#984,#1033)
  • Added support for nested metric collections (#1003)
  • Added Dice to classification package (#1021)
  • Added support to segmentation type segm as IOU for mean average precision (#822)

Changed

  • Renamed reduction argument to average in Jaccard score and added additional options (#874)

Removed

  • Removed deprecated compute_on_step argument (#962, #967, #979 ,#990, #991, #993, #1005, #1004, #1007)

Fixed

  • Fixed non-empty state dict for a few metrics (#1012)
  • Fixed bug when comparing states while finding compute groups (#1022)
  • Fixed torch.double support in stat score metrics (#1023)
  • Fixed FID calculation for non-equal size real and fake input (#1028)
  • Fixed case where KLDivergence could output Nan (#1030)
  • Fixed deterministic for PyTorch<1.8 (#1035)
  • Fixed default value for mdmc_average in Accuracy (#1036)
  • Fixed missing copy of property when using compute groups in MetricCollection (#1052)

Contributors

@Borda, @burglarhobbit, @charlielito, @gianscarpe, @MrShevan, @phaseolud, @razmikmelikbekyan, @SkafteNicki, @tanmoyio, @vumichien

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Scientific Software - Peer-reviewed - Python
Published by Borda almost 4 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Minor patch release

[0.8.2] - 2022-05-06

Fixed

  • Fixed multi-device aggregation in PearsonCorrCoef (#998)
  • Fixed MAP metric when using a custom list of thresholds (#995)
  • Fixed compatibility between compute groups in MetricCollection and prefix/postfix arg (#1007)
  • Fixed compatibility with future Pytorch 1.12 in safe_matmul (#1011, #1014)

Contributors

@ben-davidson-6, @Borda, @SkafteNicki, @tanmoyio

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Scientific Software - Peer-reviewed - Python
Published by Borda about 4 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Minor patch release

[0.8.1] - 2022-04-27

Changed

  • Reimplemented the signal_distortion_ratio metric, which removed the absolute requirement of fast-bss-eval (#964)

Fixed

  • Fixed "Sort currently does not support bool dtype on CUDA" error in MAP for empty preds (#983)
  • Fixed BinnedPrecisionRecallCurve when thresholds argument is not provided (#968)
  • Fixed CalibrationError to work on logit input (#985)

Contributors

@DuYicong515, @krshrimali, @quancs, @SkafteNicki

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Scientific Software - Peer-reviewed - Python
Published by Borda about 4 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Faster collection and more metrics!

We are excited to announce that TorchMetrics v0.8 is now available. The release includes several new metrics in the classification and image domains and some performance improvements for those working with metrics collections.

Metric collections just got faster

Common wisdom dictates that you should never evaluate the performance of your models using only a single metric but instead a collection of metrics. For example, it is common to simultaneously evaluate the accuracy, precision, recall, and f1 score in classification. In TorchMetrics, we have for a long time provided the MetricCollection object for chaining such metrics together for an easy interface to calculate them all at once. However, in many cases, such a collection of metrics shares some of the underlying computations that have been repeated for every metric in the collection. In Torchmetrics v0.8 we have introduced the concept of compute_groups to MetricCollection that will, as default, be auto-detected and group metrics that share some of the same computations.

Thus, if you are using MetricCollections in your code, upgrading to TorchMetrics v0.8 should automatically make your code run faster without any code changes.

Many exciting new metrics

TorchMetrics v0.8 includes several new metrics within the classification and image domain, both for the functional and modular API. We refer to the documentation for the full description of all metrics if you want to learn more about them.

  • SpectralAngleMapper or SAM was added to the image package. This metric can calculate the spectral similarity between given reference spectra and estimated spectra.
  • CoverageError was added to the classification package. This metric can be used when you are working with multi-label data. The metric works similar to thesklearn counterpart and computes how far you need to go through ranked scores such that all true labels are covered.
  • LabelRankingAveragePrecision and LabelRankingLoss were added to the classification package. Both metrics are used in multi-label ranking problems, where the goal is to give a better rank to the labels associated with each sample. Each metric gives a measure of how well your model is doing this.
  • ErrorRelativeGlobalDimensionlessSynthesis or ERGAS was added to the image package. This metric can be used to calculate the accuracy of Pan sharpened images considering the normalized average error of each band of the resulting image.
  • UniversalImageQualityIndex was added to the image package. This metric can assess the difference between two images, which considers three different factors when computed: loss of correlation, luminance distortion, and contrast distortion.
  • ClasswiseWrapper was added to the wrapper package. This wrapper can be used in combinations with metrics that return multiple values (such as classification metrics with the average=None argument). The wrapper will unwrap the result into a dict with a label for each value.

[0.8.0] - 2022-04-14

Added

  • Added WeightedMeanAbsolutePercentageError to regression package (#948)
  • Added new classification metrics:
    • CoverageError (#787)
    • LabelRankingAveragePrecision and LabelRankingLoss (#787)
  • Added new image metric:
    • SpectralAngleMapper (#885)
    • ErrorRelativeGlobalDimensionlessSynthesis (#894)
    • UniversalImageQualityIndex (#824)
    • SpectralDistortionIndex (#873)
  • Added support for MetricCollection in MetricTracker (#718)
  • Added support for 3D image and uniform kernel in StructuralSimilarityIndexMeasure (#818)
  • Added smart update of MetricCollection (#709)
  • Added ClasswiseWrapper for better logging of classification metrics with multiple output values (#832)
  • Added **kwargs argument for passing additional arguments to base class (#833)
  • Added negative ignore_index for the Accuracy metric (#362)
  • Added adaptive_k for the RetrievalPrecision metric (#910)
  • Added reset_real_features argument image quality assessment metrics (#722)
  • Added new keyword argument compute_on_cpu to all metrics (#867)

Changed

  • Made num_classes in jaccard_index a required argument (#853, #914)
  • Added normalizer, tokenizer to ROUGE metric (#838)
  • Improved shape checking of permutation_invariant_training (#864)
  • Allowed reduction None (#891)
  • MetricTracker.best_metric will now give a warning when computing on metric that do not have a best (#913)

Deprecated

  • Deprecated argument compute_on_step (#792)
  • Deprecated passing in dist_sync_on_step, process_group, dist_sync_fn direct argument (#833)

Removed

  • Removed support for versions of Lightning lower than v1.5 (#788)
  • Removed deprecated functions, and warnings in Text (#773)
    • WER and functional.wer
  • Removed deprecated functions and warnings in Image (#796)
    • SSIM and functional.ssim
    • PSNR and functional.psnr
  • Removed deprecated functions, and warnings in classification and regression (#806)
    • FBeta and functional.fbeta
    • F1 and functional.f1
    • Hinge and functional.hinge
    • IoU and functional.iou
    • MatthewsCorrcoef
    • PearsonCorrcoef
    • SpearmanCorrcoef
  • Removed deprecated functions, and warnings in detection and pairwise (#804)
    • MAP and functional.pairwise.manhatten
  • Removed deprecated functions, and warnings in Audio (#805)
    • PESQ and functional.audio.pesq
    • PIT and functional.audio.pit
    • SDR and functional.audio.sdr and functional.audio.si_sdr
    • SNR and functional.audio.snr and functional.audio.si_snr
    • STOI and functional.audio.stoi

Fixed

  • Fixed device mismatch for MAP metric in specific cases (#950)
  • Improved testing speed (#820)
  • Fixed compatibility of ClasswiseWrapper with the prefix argument of MetricCollection (#843)
  • Fixed BestScore on GPU (#912)
  • Fixed Lsum computation for ROUGEScore (#944)

Contributors

@ankitaS11, @ashutoshml, @Borda, @hookSSi, @justusschock, @lucadiliello, @quancs, @rusty1s, @SkafteNicki, @stancld, @vumichien, @weningerleon, @yassersouri

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Scientific Software - Peer-reviewed - Python
Published by Borda about 4 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Minor patch release

[0.7.3] - 2022-03-22

Fixed

  • Fixed unsafe log operation in TweedieDeviace for power=1 (#847)
  • Fixed bug in MAP metric related to either no ground truth or no predictions (#884)
  • Fixed ConfusionMatrix, AUROC and AveragePrecision on GPU when running in deterministic mode (#900)
  • Fixed NaN or Inf results returned by signal_distortion_ratio (#899)
  • Fixed memory leak when using update method with tensor where requires_grad=True (#902)

Contributors

@mtailanian, @quancs, @SkafteNicki

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Scientific Software - Peer-reviewed - Python
Published by Borda about 4 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - JOSS paper

[0.7.2] - 2022-02-10

Fixed

  • Minor patches in JOSS paper.

Scientific Software - Peer-reviewed - Python
Published by Borda over 4 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Improve mAP performance

[0.7.1] - 2022-02-03

Changed

  • Used torch.bucketize in calibration error when torch>1.8 for faster computations (#769)
  • Improve mAP performance (#742)

Fixed

  • Fixed check for available modules (#772)
  • Fixed Matthews correlation coefficient when the denominator is 0 (#781)

Contributors

@Borda, @ramonemiliani93, @SkafteNicki, @twsl

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Scientific Software - Peer-reviewed - Python
Published by Borda over 4 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - New NLP metrics and improved API

We are excited to announce that TorchMetrics v0.7 is now publicly available. This release is pretty significant. It includes several new metrics (mainly for NLP), naming and import changes, general improvements to the API, and some other great features. TorchMetrics thus now has over 60+ metrics, and the package is more user-friendly than ever.

NLP metrics - Text package

Text package is a part of TorchMetrics as of v0.5. With the growing capability of language generation models, there is also a real need to have reliable evaluation metrics. With several added metrics and unified API, TorchMetrics makes the usage of various metrics even easier! TorchMetrics v0.7 newly includes a couple of machine translation metrics such as chrF, chrF++, Translation Edit Rate, or Extended Edit Distance. Furthermore, it also supports other metrics - Match Error Rate, Word Information Lost, Word Information Preserved, and SQuAD evaluation metrics. Last but not least, we also made possible the evaluation of the ROUGE score using multiple references.

Argument unification

Importantly, all text metrics assume preds, target input order with these explicit keyword arguments. If different naming was used before v0.7, it is deprecated and completely removed in v0.8.

Import and naming changes

TorchMetrics v0.7 brings more extensive and minor changes to how metrics should be imported. The import changes directly impact v0.7, meaning that you will most likely need to change the import statement for some specific metrics. All naming changes follow our standard deprecation process, meaning that in v0.7, any metric that is renamed will still work but raise an error asking to use the new metric name. From v0.8, the old metric names will no longer be available.

[0.7.0] - 2022-01-17

Added

  • Added NLP metrics:
    • MatchErrorRate (#619)
    • WordInfoLost and WordInfoPreserved (#630)
    • SQuAD (#623)
    • CHRFScore (#641)
    • TranslationEditRate (#646)
    • ExtendedEditDistance (#668)
  • Added MultiScaleSSIM into image metrics (#679)
  • Added Signal to Distortion Ratio (SDR) to audio package (#565)
  • Added MinMaxMetric to wrappers (#556)
  • Added ignore_index to retrieval metrics (#676)
  • Added support for multi references in ROUGEScore (#680)
  • Added a default VSCode devcontainer configuration (#621)

Changed

  • Scalar metrics will now consistently have additional dimensions squeezed (#622)
  • Metrics having third party dependencies removed from global import (#463)
  • Untokenized for BLEUScore input stay consistent with all the other text metrics (#640)
  • Arguments reordered for TER, BLEUScore, SacreBLEUScore, CHRFScore now the expected input order is predictions first and target second (#696)
  • Changed dtype of metric state from torch.float to torch.long in ConfusionMatrix to accommodate larger values (#715)
  • Unify preds, target input argument's naming across all text metrics (#723, #727)
    • bert, bleu, chrf, sacre_bleu, wip, wil, cer, ter, wer, mer, rouge, squad

Deprecated

  • Renamed IoU -> Jaccard Index (#662)
  • Renamed text WER metric: (#714)
    • functional.wer -> functional.word_error_rate
    • WER -> WordErrorRate
  • Renamed correlation coefficient classes: (#710)
    • MatthewsCorrcoef -> MatthewsCorrCoef
    • PearsonCorrcoef -> PearsonCorrCoef
    • SpearmanCorrcoef -> SpearmanCorrCoef
  • Renamed audio STOI metric: (#753, #758)
    • audio.STOI to audio.ShortTimeObjectiveIntelligibility
    • functional.audio.stoi to functional.audio.short_time_objective_intelligibility
  • Renamed audio PESQ metrics: (#751)
    • functional.audio.pesq -> functional.audio.perceptual_evaluation_speech_quality
    • audio.PESQ -> audio.PerceptualEvaluationSpeechQuality
  • Renamed audio SDR metrics: (#711)
    • functional.sdr -> functional.signal_distortion_ratio
    • functional.si_sdr -> functional.scale_invariant_signal_distortion_ratio
    • SDR -> SignalDistortionRatio
    • SI_SDR -> ScaleInvariantSignalDistortionRatio
  • Renamed audio SNR metrics: (#712)
    • functional.snr -> functional.signal_distortion_ratio
    • functional.si_snr -> functional.scale_invariant_signal_noise_ratio
    • SNR -> SignalNoiseRatio
    • SI_SNR -> ScaleInvariantSignalNoiseRatio
  • Renamed F-score metrics: (#731, #740)
    • functional.f1 -> functional.f1_score
    • F1 -> F1Score
    • functional.fbeta -> functional.fbeta_score
    • FBeta -> FBetaScore
  • Renamed Hinge metric: (#734)
    • functional.hinge -> functional.hinge_loss
    • Hinge -> HingeLoss
  • Renamed image PSNR metrics (#732)
    • functional.psnr -> functional.peak_signal_noise_ratio
    • PSNR -> PeakSignalNoiseRatio
  • Renamed image PIT metric: (#737)
    • functional.pit -> functional.permutation_invariant_training
    • PIT -> PermutationInvariantTraining
  • Renamed image SSIM metric: (#747)
    • functional.ssim -> functional.scale_invariant_signal_noise_ratio
    • SSIM -> StructuralSimilarityIndexMeasure
  • Renamed detection MAP to MeanAveragePrecision metric (#754)
  • Renamed Fidelity & LPIPS image metric: (#752)
    • image.FID -> image.FrechetInceptionDistance
    • image.KID -> image.KernelInceptionDistance
    • image.LPIPS -> image.LearnedPerceptualImagePatchSimilarity

Removed

  • Removed embedding_similarity metric (#638)
  • Removed argument concatenate_texts from wer metric (#638)
  • Removed arguments newline_sep and decimal_places from rouge metric (#638)

Fixed

  • Fixed MetricCollection kwargs filtering when no kwargs are present in update signature (#707)

Contributors

@ashutoshml, @Borda, @cuent, @Fariborzzz, @getgaurav2, @janhenriklambrechts, @justusschock, @karthikrangasai, @lucadiliello, @mahinlma, @mathemusician, @mona0809, @mrleu, @puhuk, @quancs, @SkafteNicki, @stancld, @twsl

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Scientific Software - Peer-reviewed - Python
Published by Borda over 4 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Fixing mAP on GPU

[0.6.2] - 2021-12-15

Fixed

  • Fixed torch.sort currently does not support bool dtype on CUDA (#665)
  • Fixed mAP properly checks if ground truths are empty (#684)
  • Fixed initialization of tensors to be on the correct device for MAP metric (#673)

Contributors

@OlofHarrysson, @tkupek, @twsl

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Scientific Software - Peer-reviewed - Python
Published by Borda over 4 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Own mAP implementation

[0.6.1] - 2021-12-06

Changed

  • Migrate MAP metrics from pycocotools to PyTorch (#632)
  • Use torch.topk instead of torch.argsort in retrieval precision for speedup (#627)

Fixed

  • Fix empty predictions in MAP metric (#594, #610, #624)
  • Fix edge case of AUROC with average=weighted on GPU (#606)
  • Fixed forward in compositional metrics (#645)

Contributors

@Callidior, @SkafteNicki, @tkupek, @twsl, @zuoxingdong

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Scientific Software - Peer-reviewed - Python
Published by Borda over 4 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - More metrics than ever

[0.6.0] - 2021-10-28

We are excited to announce that Torchmetrics v0.6 is now publicly available. TorchMetrics v0.6 does not focus on specific domains but adds a ton of new metrics to several domains, thus increasing the number of metrics in the repository to over 60! Not only have v0.6 added metrics within already covered domains, but we also add support for two new: Pairwise metrics and detection.

https://devblog.pytorchlightning.ai/torchmetrics-v0-6-more-metrics-than-ever-e98c3983621e

Pairwise Metrics

TorchMetrics v0.6 offers a new set of metrics in its functional backend for calculating pairwise distances. Given a tensor X with shape [N,d] (N observations, each in d dimensions), a pairwise metric calculates [N,N] matrix of all possible combinations between the rows of X.

Detection

TorchMetrics v0.6 now includes a detection package that provides for the MAP metric. The implementation essentially wraps pycocotools around securing that we get the correct value, but with the benefit of now being able to scale to multiple devices (as any other metric in TorchMetrics).

New additions

  • In the audio package, we have two new metrics: Perceptual Evaluation of Speech Quality (PESQ) and Short Term Objective Intelligibility (STOI). Both metrics can be used to assert speech quality.

  • In the retrieval package, we also have two new metrics: R-precision and Hit-rate. R-precision corresponds to recall at the R-th position of the query. The hit rate is the ratio of the total number of hits returned as a result of a query (hits) to the total number of hits returned.

  • The text package also receives an update in the form of two new metrics: Sacre BLEU score and character error rate. Sacre BLUE score provides and more systematic way of comparing BLUE scores across tasks. The character error rate is similar to the word error rate but instead calculates if a given algorithm has correctly predicted a sentence based on a character-by-character comparison.

  • The regression package got a single new metric in the form of the Tweedie deviance score metric. Deviance scores are generally a better measure of fit than measures such as squared error when trying to model data coming from highly screwed distributions.

  • Finally, we have added five new metrics for simple aggregation: SumMetric, MeanMetric, MinMetric, MaxMetric, CatMetric. All five metrics take in a single input (either native python floats or torch.Tensor) and keep track of the sum, average, min, etc. These new aggregation metrics are especially useful in combination with self.log from lightning if you want to log something other than the average of the metric you are tracking.

Detail changes

Added

  • Added audio metrics:
    • Perceptual Evaluation of Speech Quality (PESQ) (#353)
    • Short Term Objective Intelligibility (STOI) (#353)
  • Added Information retrieval metrics:
    • RetrievalRPrecision (#577)
    • RetrievalHitRate (#576)
  • Added NLP metrics:
    • SacreBLEUScore (#546)
    • CharErrorRate (#575)
  • Added other metrics:
    • Tweedie Deviance Score (#499)
    • Learned Perceptual Image Patch Similarity (LPIPS) (#431)
  • Added MAP (mean average precision) metric to new detection package (#467)
  • Added support for float targets in nDCG metric (#437)
  • Added average argument to AveragePrecision metric for reducing multi-label and multi-class problems (#477)
  • Added MultioutputWrapper (#510)
  • Added metric sweeping:
    • higher_is_better as constant attribute (#544)
    • higher_is_better to rest of codebase (#584)
  • Added simple aggregation metrics: SumMetric, MeanMetric, CatMetric, MinMetric, MaxMetric (#506)
  • Added pairwise submodule with metrics (#553)
    • pairwise_cosine_similarity
    • pairwise_euclidean_distance
    • pairwise_linear_similarity
    • pairwise_manhatten_distance

Changed

  • AveragePrecision will now as default output the macro average for multilabel and multiclass problems (#477)
  • half, double, float will no longer change the dtype of the metric states. Use metric.set_dtype instead (#493)
  • Renamed AverageMeter to MeanMetric (#506)
  • Changed is_differentiable from property to a constant attribute (#551)
  • ROC and AUROC will no longer throw an error when either the positive or negative class is missing. Instead, return 0 scores and give a warning

Deprecated

  • Deprecated torchmetrics.functional.self_supervised.embedding_similarity in favour of new pairwise submodule

Removed

  • Removed dtype property (#493)

Fixed

  • Fixed bug in F1 with average='macro' and ignore_index!=None (#495)
  • Fixed bug in pit by using the returned first result to initialize device and type (#533)
  • Fixed SSIM metric using too much memory (#539)
  • Fixed bug where device property was not properly updated when the metric was a child of a module (#542)

Contributors

@an1lam, @Borda, @karthikrangasai, @lucadiliello, @mahinlma, @Obus, @quancs, @SkafteNicki, @stancld, @tkupek

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Scientific Software - Peer-reviewed - Python
Published by Borda over 4 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Own NLP implementations

[0.5.1] - 2021-08-30

Added

  • Added device and dtype properties (#462)
  • Added TextTester class for robustly testing text metrics (#450)

Changed

  • Added support for float targets in nDCG metric (#437)

Removed

  • Removed rouge-score as dependency for text package (#443)
  • Removed jiwer as dependency for text package (#446)
  • Removed bert-score as dependency for text package (#473)

Fixed

  • Fixed ranking of samples in SpearmanCorrCoef metric (#448)
  • Fixed bug where compositional metrics where unable to sync because of type mismatch (#454)
  • Fixed metric hashing (#478)
  • Fixed BootStrapper metrics not working on GPU (#462)
  • Fixed the semantic ordering of kernel height and width in SSIM metric (#474)

Contributors

@justusschock, @karthikrangasai, @kingyiusuen, @Obus, @SkafteNicki, @stancld

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Scientific Software - Peer-reviewed - Python
Published by Borda over 4 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Text-related (NLP) metrics

[0.5.0] - 2021-08-09

This release includes general improvements to the library and new metrics within the NLP domain.

https://devblog.pytorchlightning.ai/torchmetrics-v0-5-nlp-metrics-f4232467b0c5

Natural language processing is arguably one of the most exciting areas of machine learning, with models such as BERT, ROBERTA, GPT-3 etc., really pushing what automated text translation, recognition, and generation systems are capable of. 

With the introduction of these models, many metrics have been proposed that measure how well these models perform. TorchMetrics v0.5 includes 4 such metrics: BERT score, BLEU, ROUGE and WER.

Detail changes

Added

  • Added Text-related (NLP) metrics:
    • Word Error Rate (WER) (#383)
    • ROUGE (#399)
    • BERT score (#424)
    • BLUE score (#360)
  • Added MetricTracker wrapper metric for keeping track of the same metric over multiple epochs (#238)
  • Added other metrics:
    • Symmetric Mean Absolute Percentage error (SMAPE) (#375)
    • Calibration error (#394)
    • Permutation Invariant Training (PIT) (#384)
  • Added support in nDCG metric for target with values larger than 1 (#349)
  • Added support for negative targets in nDCG metric (#378)
  • Added None as reduction option in CosineSimilarity metric (#400)
  • Allowed passing labels in (nsamples, nclasses) to AveragePrecision (#386)

Changed

  • Moved psnr and ssim from functional.regression.* to functional.image.* (#382)
  • Moved image_gradient from functional.image_gradients to functional.image.gradients (#381)
  • Moved R2Score from regression.r2score to regression.r2 (#371)
  • Pearson metric now only store 6 statistics instead of all predictions and targets (#380)
  • Use torch.argmax instead of torch.topk when k=1 for better performance (#419)
  • Moved check for number of samples in R2 score to support single sample updating (#426)

Deprecated

  • Rename r2score >> r2_score and kldivergence >> kl_divergence in functional (#371)
  • Moved bleu_score from functional.nlp to functional.text.bleu (#360)

Removed

  • Removed restriction that threshold has to be in (0,1) range to support logit input (#351, #401)
  • Removed restriction that preds could not be bigger than num_classes to support logit input (#357)
  • Removed module regression.psnr and regression.ssim (#382):
  • Removed (#379):
    • function functional.mean_relative_error
    • num_thresholds argument in BinnedPrecisionRecallCurve

Fixed

  • Fixed bug where classification metrics with average='macro' would lead to wrong result if a class was missing (#303)
  • Fixed weighted, multi-class AUROC computation to allow for 0 observations of some class, as contribution to final AUROC is 0 (#376)
  • Fixed that _forward_cache and _computed attributes are also moved to the correct device if metric is moved (#413)
  • Fixed calculation in IoU metric when using ignore_index argument (#328)

Contributors

@BeyondTheProof, @Borda, @CSautier, @discort, @edwardclem, @gagan3012, @hugoperrin, @karthikrangasai, @paul-grundmann, @quancs, @rajs96, @SkafteNicki, @vatch123

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Scientific Software - Peer-reviewed - Python
Published by Borda almost 5 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Fixing DDP sync

[0.4.1] - 2021-07-05

Changed

  • Extend typing (#330, #332, #333, #335, #314)

Fixed

  • Fixed DDP by is_sync logic to Metric (#339)

Scientific Software - Peer-reviewed - Python
Published by Borda almost 5 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Multimedia - audio & image quality

Overview

https://devblog.pytorchlightning.ai/torchmetrics-v0-4-introducing-multimedia-metrics-e6380a3ad354

Audio

The first highlight of v0.4.0 is a set of 3 new metrics for calculating for evaluating audio data: Scale-invariant signal-to-distortion ratio, Scale-invariant signal-to-noise ratio, and signal-to-noise ratio. All these metrics take a predicted audio tensor and a target tensor, both with the shape [...,time] and calculate the metric over the time axis.

Image

Version v0.4.0 also includes a completely new image package. Since its initial 0.2.0 release, Torchmetrics has had both PSNR and SSIM in its regression module, metrics that can be used to evaluate image quality.  With the image module, we are adding three new metrics for evaluating the quality of generative models (such as GANS): Inception score (IS), Fréchet inception distance (FID) and kernel inception distance (KID).

More Functionality

In addition to the new audio and image package, we also want to highlight a couple of features: * Addition of MeanAbsolutePercentageError (MAPE) metric to the regression package. Useful in regression settings where you want to focus on the relative instead of absolute error. * Addition of KLDivergence metric to the classification package. Useful for measuring the distance between probability distributions like the ones outputted in variational auto-encoders. * Addition of CosineSimilarity metric to the regression package. Useful for calculating the angle between two embedding vectors in domains such as metric learning. * As requested by multiple users, Accuracy, Precision, Recall, FBeta, F1, StatScore, Hamming, ConfusionMatrix now directly support that predictions can be unnormalized, e.g. logits from your model. No need to call .softmax(dim=-1) anymore! * All modular metrics now have both a sync and sync_context methods that allow the user full control over when metric states are synced. Note that we still automatically do this whenever calling the compute method. * The is_differentiable property has been adopted by many more of our metrics!

Thanks

Big thanks to all community members for their contributions and feedback. A special thanks to @quancs for leading the development of the new audio package.

[0.4.0] - 2021-06-24

Added

  • Added Cosine Similarity metric (#305)
  • Added Specificity metric (#210)
  • Added add_metrics method to MetricCollection for adding additional metrics after initialization (#221)
  • Added pre-gather reduction in the case of dist_reduce_fx="cat" to reduce communication cost (#217)
  • Added better error message for AUROC when num_classes is not provided for multiclass input (#244)
  • Added support for unnormalized scores (e.g. logits) in Accuracy, Precision, Recall, FBeta, F1, StatScore, Hamming, ConfusionMatrix metrics (#200)
  • Added MeanAbsolutePercentageError(MAPE) metric. (#248)
  • Added squared argument to MeanSquaredError for computing RMSE (#249)
  • Added FID metric (#213)
  • Added is_differentiable property to ConfusionMatrix, F1, FBeta, Hamming, Hinge, IOU, MatthewsCorrcoef, Precision, Recall, PrecisionRecallCurve, ROC, StatScores (#253)
  • Added audio metrics: SNR, SISDR, SISNR (#292)
  • Added Inception Score metric to image module (#299)
  • Added KID metric to image module (#301)
  • Added sync and sync_context methods for manually controlling when metric states are synced (#302)
  • Added KLDivergence metric (#247)

Changed

  • Forward cache is reset when reset method is called (#260)
  • Improved per-class metric handling for imbalanced datasets for precision, recall, precision_recall, fbeta, f1, accuracy, and specificity (#204)
  • Decorated torch.jit.unused to MetricCollection forward (#307)
  • Renamed thresholds argument to binned metrics for manually controlling the thresholds (#322)

Deprecated

  • Deprecated torchmetrics.functional.mean_relative_error (#248)
  • Deprecated num_thresholds argument in BinnedPrecisionRecallCurve (#322)

Removed

  • Removed argument is_multiclass (#319)

Fixed

  • AUC can also support more dimensional inputs when all but one dimension are of size 1 (#242)
  • Fixed dtype of modular metrics after reset has been called (#243)
  • Fixed calculation in matthews_corrcoef to correctly match formula (#321)

Contributors

@AnselmC, @arvindmuralie77, @bhadreshpsavani, @Borda, @GiannisVagionakis, @hassiahk, @IgorHoholko, @johannespitz, @justusschock, @maximsch2, @pranjaldatta, @quancs, @simran2905, @SkafteNicki, @tchaton

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Scientific Software - Peer-reviewed - Python
Published by Borda almost 5 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Minor patch release

[0.3.2] - 2021-05-10

Added

  • Added is_differentiable property:
    • To AUC, AUROC, CohenKappa and AveragePrecision (#178)
    • To PearsonCorrCoef, SpearmanCorrcoef, R2Score and ExplainedVariance (#225)

Changed

  • MetricCollection should return metrics with prefix on items(), keys() (#209)
  • Calling compute before update will now give an warning (#164)

Removed

  • Removed numpy as dependency (#212)

Fixed

  • Fixed auc calculation and add tests (#197)
  • Fixed loading persisted metric states using load_state_dict() (#202)
  • Fixed PSNR not working with DDP (#214)
  • Fixed metric calculation with unequal batch sizes (#220)
  • Fixed metric concatenation for list states for zero-dim input (#229)
  • Fixed numerical instability in AUROC metric for large input (#230)

Contributors

@bhadreshpsavani, @hlin09, @maximsch2, @SkafteNicki, @tchaton

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Scientific Software - Peer-reviewed - Python
Published by Borda about 5 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Minor PL development patch

Cleaning remaining inconsistency and fix PL develop integration (#191, #192, #193, #194)

Scientific Software - Peer-reviewed - Python
Published by Borda about 5 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Information retrieval

Information Retrieval

Information retrieval (IR) metrics are used to evaluate how well a system is retrieving information from a database or from a collection of documents. This is the case with search engines, where a query provided by the user is compared with many possible results, some of which are relevant and some are not.

When you query a search engine, you hope that results that could be useful are ranked higher on the results page. However, each query is usually compared with a different set of documents. For this reason, we had to implement a mechanism to allow users to easily compute the IR metrics in cases where each query is compared with a different number of possible candidates.

For this reason, IR metrics feature an additional argument called indexes that say to which query a prediction refers to. In the end, all query-document pairs are grouped by query index and then the final result is computed as the average of the metric over each group.

In total 6 new metrics have been added for doing information retrieval: - RetrievalMAP (Mean Average Precision) - RetrievalMRR (Mean Reciprocal Rank) - RetrievalPrecision (Precision for IR) - RetrievalRecall (Recall for IR) - RetrievalNormalizedDCG (Normalized Discounted Cumulative Gain) - RetrievalFallOut (Fall Out rate for IR)

Special thanks go to @lucadiliello, for implementing all IR.

Expanding and improving the collection

In addition to expanding our collection to the field of information retrieval, this release also includes new metrics for the classification domain: - BootStrapper metric that can wrap around any other metric in our collection for easy computation of confidence intervals - CohenKappa is a statistic that is used to measure inter-rater reliability for qualitative (categorical) items - MatthewsCorrcoef or phi coefficient is used in machine learning as a measure of the quality of binary (two-class) classifications - Hinge loss is used for "maximum-margin" classification, most notably for support vector machines. - PearsonCorrcoef is a metric for measuring the linear correlation between two sets of data - SpearmanCorrcoef is a metric for measuring the rank correlation between two sets of data. It assesses how well the relationship between two variables can be described using a monotonic function.

Binned metrics

The current implementation of the AveragePrecision and PrecisionRecallCurve has the drawback that it saves all predictions and targets in memory to correctly calculate the metric value. These metrics now receive a binned version that calculates the value at fixed thresholds. This is less precise than original implementations but also much more memory efficient.

Special thanks go to @SkafteNicki, for letting all this happen.

https://devblog.pytorchlightning.ai/torchmetrics-v0-3-0-information-retrieval-metrics-and-more-c55265e9b94f

[0.3.0] - 2021-04-20

Added

  • Added BootStrapper to easily calculate confidence intervals for metrics (#101)
  • Added Binned metrics (#128)
  • Added metrics for Information Retrieval:
    • Added RetrievalMAP (PL^5032)
    • Added RetrievalMRR (#119)
    • Added RetrievalPrecision (#139)
    • Added RetrievalRecall (#146)
    • Added RetrievalNormalizedDCG (#160)
    • Added RetrievalFallOut (#161)
  • Added other metrics:
    • Added CohenKappa (#69)
    • Added MatthewsCorrcoef (#98)
    • Added PearsonCorrcoef (#157)
    • Added SpearmanCorrcoef (#158)
    • Added Hinge (#120)
  • Added average='micro' as an option in AUROC for multilabel problems (#110)
  • Added multilabel support to ROC metric (#114)
  • Added testing for half precision (#77, #135)
  • Added AverageMeter for ad-hoc averages of values (#138)
  • Added prefix argument to MetricCollection (#70)
  • Added __getitem__ as metric arithmetic operation (#142)
  • Added property is_differentiable to metrics and test for differentiability (#154)
  • Added support for average, ignore_index and mdmc_average in Accuracy metric (#166)
  • Added postfix arg to MetricCollection (#188)

Changed

  • Changed ExplainedVariance from storing all preds/targets to tracking 5 statistics (#68)
  • Changed behavior of confusionmatrix for multilabel data to better match multilabel_confusion_matrix from sklearn (#134)
  • Updated FBeta arguments (#111)
  • Changed reset method to use detach.clone() instead of deepcopy when resetting to default (#163)
  • Metrics passed as dict to MetricCollection will now always be in deterministic order (#173)
  • Allowed MetricCollection pass metrics as arguments (#176)

Deprecated

  • Rename argument is_multiclass -> multiclass (#162)

Removed

  • Prune remaining deprecated (#92)

Fixed

  • Fixed when _stable_1d_sort to work when n>=N (PL^6177)
  • Fixed _computed attribute not being correctly reset (#147)
  • Fixed to Blau score (#165)
  • Fixed backwards compatibility for logging with older version of pytorch-lightning (#182)

Contributors

@alanhdu, @arvindmuralie77, @bhadreshpsavani, @Borda, @ethanwharris, @lucadiliello, @maximsch2, @SkafteNicki, @thomasgaudelet, @victorjoos

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Scientific Software - Peer-reviewed - Python
Published by Borda about 5 years ago

TorchMetrics - Measuring Reproducibility in PyTorch - Initial release

What is Torchmetrics

TorchMetrics is a collection of 25+ PyTorch metrics implementations and an easy-to-use API to create custom metrics. It offers:

  • A standardized interface to increase reproducability
  • Reduces Boilerplate
  • Distributed-training compatible
  • Automatic accumulation over batches
  • Automatic synchronization between multiple devices

You can use TorchMetrics in any PyTorch model, or with in PyTorch Lightning to enjoy additional features:

  • Module metrics are automatically placed on the correct device.
  • Native support for logging metrics in Lightning to reduce even more boilerplate.

Using functional metrics

Similar to torch.nn, most metrics have both a module-based and a functional version. The functional version implements the basic operations required for computing each metric. They are simple python functions that as input take torch.tensors and return the corresponding metric as a torch.tensor.

``` python import torch

import our library

import torchmetrics

simulate a classification problem

preds = torch.randn(10, 5).softmax(dim=-1) target = torch.randint(5, (10,))

acc = torchmetrics.functional.accuracy(preds, target) ```

Using Module metrics

Nearly all functional metrics have a corresponding module-based metric that calls it a functional counterpart underneath. The module-based metrics are characterized by having one or more internal metrics states (similar to the parameters of the PyTorch module) that allow them to offer additional functionalities:

  • Accumulation of multiple batches
  • Automatic synchronization between multiple devices
  • Metric arithmetic

``` python import torch

import our library

import torchmetrics

initialize metric

metric = torchmetrics.Accuracy()

nbatches = 10 for i in range(nbatches): # simulate a classification problem preds = torch.randn(10, 5).softmax(dim=-1) target = torch.randint(5, (10,)) # metric on current batch acc = metric(preds, target) print(f"Accuracy on batch {i}: {acc}")

metric on all batches using custom accumulation

acc = metric.compute() print(f"Accuracy on all data: {acc}") ```

Built-in metrics

  • Accuracy
  • AveragePrecision
  • AUC
  • AUROC
  • F1
  • Hamming Distance
  • ROC
  • ExplainedVariance
  • MeanSquaredError
  • R2Score
  • bleu_score
  • embedding_similarity

And many more!

Contributors

@Borda, @SkafteNicki, @williamFalcon, @teddykoker, @justusschock, @tadejsv, @edenlightning, @ydcjeff, @ddrevicky, @ananyahjha93, @awaelchli, @rohitgr7, @akihironitta, @manipopopo, @Diuven, @arnaudgelas, @s-rog, @c00k1ez, @tgaddair, @elias-ramzi, @cuent, @jpcarzolio, @bryant1410, @shivdhar, @Sordie, @krzysztofwos, @abhik-99, @bernardomig, @peblair, @InCogNiTo124, @j-dsouza, @pranjaldatta, @ananthsub, @deng-cy, @abhinavg97, @tridao, @prampey, @abrahambotros, @ozen, @ShomyLiu, @yuntai, @pwwang

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Scientific Software - Peer-reviewed - Python
Published by Borda about 5 years ago