otrecod.jl

Optimal transport for data recoding

https://github.com/otrecoding/otrecod.jl

Science Score: 31.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.6%) to scientific vocabulary

Keywords

julia julia-language optimal-transport
Last synced: 9 months ago · JSON representation ·

Repository

Optimal transport for data recoding

Basic Info
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 0
  • Open Issues: 1
  • Releases: 0
Topics
julia julia-language optimal-transport
Created almost 7 years ago · Last pushed about 1 year ago
Metadata Files
Readme License Citation

README.md

OTRecod.jl

CI codecov

Valérie Garès & Jérémy Omer, 2022. "Regularized Optimal Transport of Covariates and Outcomes in Data Recoding," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 117(537), pages 320-333, January.

Abstract: When databases are constructed from heterogeneous sources, it is not unusual that different encodings are used for the same outcome. In such case, it is necessary to recode the outcome variable before merging two databases. The method proposed for the recoding is an application of optimal transportation where we search for a bijective mapping between the distributions of such variable in two databases. In this article, we build upon the work by Garés et al., where they transport the distributions of categorical outcomes assuming that they are distributed equally in the two databases. Here, we extend the scope of the model to treat all the situations where the covariates explain the outcomes similarly in the two databases. In particular, we do not require that the outcomes be distributed equally. For this, we propose a model where joint distributions of outcomes and covariates are transported. We also propose to enrich the model by relaxing the constraints on marginal distributions and adding an L1 regularization term. The performances of the models are evaluated in a simulation study, and they are applied to a real dataset.

Keywords: https://ideas.repec.org/a/taf/jnlasa/v117y2022i537p320-333.html

pdf

Installation

The package runs on julia 1.1 and above. In a Julia session switch to pkg> mode to add the package:

julia julia>] # switch to pkg> mode pkg> add https://github.com/otrecoding/OTRecod.jl

Alternatively, you can achieve the above using the Pkg API:

julia julia> import Pkg julia> Pkg.add(url = "https://github.com/otrecoding/OTRecod.jl")

When finished, make sure that you're back to the Julian prompt (julia>) and bring OTRecod into scope:

julia julia> using OTRecod

You can test the package with

julia julia>] # switch to pkg> mode pkg> test OTRecod

To run an example from a dataset

```julia julia> using OTRecod

help?> rundirectory search: rundirectory

rundirectory(path, method; outname="result.out", maxrelax=0.0, lambdareg=0.0, nbfiles=0, norme=0, percent_closest=0.2)

Run one given method on a given number of data files of a given directory. The data files must be the only files with extension ".txt" in the directory.

  • path : name of the directory
  • method : :group or :joint
  • maxrelax: maximum percentage of deviation from expected probability masses
  • lambda_reg: coefficient measuring the importance of the regularization term
  • nbfiles: number of files considered, 0 if all the data files are tested
  • norme : 0, 1 or 2, norm used for distances in the space of covariates
  • percent_closest: percent of closest neighbors taken in the computation of the costs (both distance and regularization related)
  • observed: if nonempty, list of indices of the observed covariates; this allows to exclude some latent variables. ```

Copyright © 2020 Jeremy Omer jeremy.omer@insa-rennes.fr.

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License version 3 as published by the Free Software Foundation. See LICENSE file.

Owner

  • Name: Optimal transport to recode data variables
  • Login: otrecoding
  • Kind: organization
  • Location: Rennes

Citation (CITATION.bib)

@article{doi:10.1080/01621459.2020.1775615,
author = {Valérie Garès and Jérémy Omer},
title = {Regularized Optimal Transport of Covariates and Outcomes in Data Recoding},
journal = {Journal of the American Statistical Association},
volume = {117},
number = {537},
pages = {320-333},
year  = {2022},
publisher = {Taylor & Francis},
doi = {10.1080/01621459.2020.1775615},
URL = { https://doi.org/10.1080/01621459.2020.1775615 },
eprint = { https://doi.org/10.1080/01621459.2020.1775615 },
abstract = { When databases are constructed from heterogeneous sources, it is not unusual that different encodings are used for the same outcome. In such case, it is necessary to recode the outcome variable before merging two databases. The method proposed for the recoding is an application of optimal transportation where we search for a bijective mapping between the distributions of such variable in two databases. In this article, we build upon the work by Garés et al., where they transport the distributions of categorical outcomes assuming that they are distributed equally in the two databases. Here, we extend the scope of the model to treat all the situations where the covariates explain the outcomes similarly in the two databases. In particular, we do not require that the outcomes be distributed equally. For this, we propose a model where joint distributions of outcomes and covariates are transported. We also propose to enrich the model by relaxing the constraints on marginal distributions and adding an L1 regularization term. The performances of the models are evaluated in a simulation study, and they are applied to a real dataset. The code used in the computational assessment and in the simulation of test cases is publicly available on Github repository: https://github.com/otrecoding/OTRecod.jl. }
}

GitHub Events

Total
  • Push event: 1
  • Fork event: 1
Last Year
  • Push event: 1
  • Fork event: 1

Issues and Pull Requests

Last synced: about 1 year ago

All Time
  • Total issues: 1
  • Total pull requests: 88
  • Average time to close issues: N/A
  • Average time to close pull requests: about 2 months
  • Total issue authors: 1
  • Total pull request authors: 2
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 45
  • Bot issues: 0
  • Bot pull requests: 86
Past Year
  • Issues: 0
  • Pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: less than a minute
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • pnavaro (1)
Pull Request Authors
  • github-actions[bot] (45)
  • pnavaro (1)
Top Labels
Issue Labels
enhancement (1)
Pull Request Labels

Dependencies

.github/workflows/CompatHelper.yml actions
  • julia-actions/setup-julia v1 composite
.github/workflows/ci.yml actions
  • actions/cache v1 composite
  • actions/checkout v3 composite
  • codecov/codecov-action v1 composite
  • julia-actions/julia-buildpkg latest composite
  • julia-actions/julia-docdeploy latest composite
  • julia-actions/julia-processcoverage v1 composite
  • julia-actions/julia-runtest latest composite
  • julia-actions/setup-julia v1 composite