https://github.com/amazon-science/minimax-fair

Last synced: 10 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: amazon-science
License: apache-2.0
Language: Python
Default Branch: main
Size: 75.2 KB

Statistics

Stars: 9
Watchers: 1
Forks: 7
Open Issues: 2
Releases: 0

Created about 5 years ago · Last pushed about 5 years ago

https://github.com/amazon-science/minimax-fair/blob/main/

# MinimaxFair - Convergent Algorithms for (Relaxed) Minimax Group Fairness

MinimaxFair is a Python package for training ML models for (relaxed) minimax group fairness as discussed in https://arxiv.org/abs/2011.03108

This repository contains python code for
* learning models that achieve minimax group fairness for both regression and classification tasks
* learning models that minimize error subject to relaxed group fairness constraints
* visualizing tradeoffs between fairness and overall error

We also include some examples of fairness sensitive datasets for experimentation, though our package supports any
dataset formatted as a .csv whose columns are labeled

Our algorithms support the following training objectives (loss functions):
* Mean Squared Error
* 0/1 Loss
* Log Loss
* False Positive Rate
* False Negative Rate

Our algorithms support the following model classes:
* Linear Regression
* Logistic Regression
* Paired Regression Classifier from https://github.com/algowatchpenn/GerryFair
* Perceptron
* Multi-Layer Perceptron for Classification - Uses custom wrapper class to work with our algorithm

## Installation

To install the package and prepare for use, run:

```
git clone https://github.com/amazon-research/minimax-fair.git
```

The current version of the package uses the following packages: pandas, numpy, sklearn, matplotlib, pytorch, and
scipy. If you do not have these packages, please run the following command to install them:

```
pip install -r requirements.txt
```

## Using our package

To use our package, we recommend modifying the relevant parameters in `main_driver.py` (which are described in detail
within the file itself) and then running the following in the terminal from the root directory of MinimaxFair:
```
python3 main_driver.py
```

## Datasets

### Downloading pre-configured datasets.

To use one of the pre-configured datasets, you must first download the dataset from the corresponding online repository
, and place it in the `datasets` folder with the specified file name.

We have provided a brief description of each dataset, along with a link to a page from where it can be downloaded
. For some datasets, additional formatting or the running of provided scripts may be required.

#### Descriptions of pre-configured datasets

**Adult** - Predict whether income exceeds $50K/yr based on census data. http://archive.ics.uci.edu/ml/datasets/Adult

Downloading instructions: On the linked webpage click the "Data Folder" link and download `adult.data`. Rename the file to `adult.csv`
and place it in the `datasets` folder in this package. Then, in the `src` folder of this package, run the script
`clean_adult_data.py` which will create the the file `adult_cleaned.csv` which is formatted to be usedwith our pre-configured code.

**Seoul Bike Sharing** - The dataset contains count of public bikes rented at each hour in Seoul Bike sharing System
with
the
corresponding Weather data and Holidays information. https://archive.ics.uci.edu/ml/datasets/Seoul+Bike+Sharing+Demand

Downloading instructions: On the linked webpage click the "Data Folder" link, download `SeoulBikeData.csv`, and place
in the `datasets` folder. If there are issues, make sure that the csv is saved in utf-8 encoding, and delete the "" symbol
in the column headers "Temperature(C)" and "Dew point temperature(C)".

**Communities and Crime** - Communities within the United States. The data combines socio-economic data from the
1990 US
Census, law enforcement data from the 1990 US LEMAS survey, and crime data from the 1995 FBI UCR. We added a column
for plurality race, which we used to divide communities into groups. https://github.com/algowatchpenn/GerryFair/tree/master/dataset

Downloading instructions: On the provided webpage, download `communities.csv` and place in the `datasets` folder. Then, in the `src` folder of this package, run the script
`clean_communities_data.py` which will create the the file `communities_cleaned.csv` which is formatted to be used
with our pre-configured code.

**COMPAS** - Data from Broward County Clerks Office, Broward County Sherrif's Office, Florida Department of Corrections
, compiled by ProPublica. https://www.propublica.org/datastore/dataset/compas-recidivism-risk-score-data-and-analysis

Downloading instructions: Go to the linked github, download the file `compas-scores-two-years.csv` and place in the `datasets`
folder.

**Credit** - German Credit Data for predicting good and bad credit risks, compiled by Prof Dr. Hans Hoffman. https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)

Downloading instructions: Open "Data Folder" and download `german.data`. Rename to `german.csv` and place in `datasets`
folder. Then, in the `src` folder of this package, run the script
`setup_credit_dataset.py` which will create the the file `german_cleaned.csv` which is formatted to be used
with our pre-configured code.

**Default** - This dataset contains information on default payments, demographic factors, credit data, history of
payment
, and bill statements of credit card clients in Taiwan from April 2005 to September 2005. https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients

Downloading instructions: Follow the "Data Folder" link and download `default of credit card clients.xls`. Open the
file in
excel
and delete the top row with X1, X2, etc. Save the file as `default.csv` and place in `datasets` folder.

**Fires** - This is a regression task, where the aim is to predict the burned area of forest fires, in the
northeast region of Portugal, by using meteorological and other data. https://archive.ics.uci.edu/ml/datasets/forest+fires

Downloading instructions: In the Data Folder, download `forestfires.csv` directly into `datasets` folder.

**Heart** - This dataset contains the medical records of 299 patients who had heart failure, collected during their
follow-up period, where each patient profile has 13 clinical features. https://archive.ics.uci.edu/ml/datasets/Heart+failure+clinical+records

Downloading instructions: Download `heart_failure_clinical_records_dataset.csv` into `datasets` folder.

**Bank Marketing** - The data is related with direct marketing campaigns (phone calls) of a Portuguese banking
institution. The classification goal is to predict if the client will subscribe a term deposit (variable y). https://archive.ics.uci.edu/ml/datasets/bank+marketing

Downloading instructions: In the Data Folder, download `bank.zip`, extract the folder, and move the files `bank-full
.csv` and `bank.csv` into the `datasets` folder.

**Student** - Predict student performance in secondary education (high school). https://archive.ics.uci.edu/ml/datasets/student+performance

Downloading instructions: In the Data folder, download `student.zip`, extract, and move the file `student-mat.csv` to the `datasets` folder.
If there are errors, the ";" characters may need to be replaced with ", " to meet the csv format. This can also be done by using the data-to-columns
tool in Excel and using the ";" character as a delimiter to split the first column.

**Wine** - Using chemical analysis determine the origin of wines. https://archive.ics.uci.edu/ml/datasets/wine+quality

Downloading instructions: In the Data Folder, download `winequality-red.data`, and `winequality-white.data` and
replace the `.data` extensions with `.csv` for both files, and place them in `datasets` folder. Then, run
`combine_wine_data.py` which will create a new csv named `winequality-full.csv` in the `datasets` folder which is
ready for use.

### Running the code with pre-configured datasets

To use a pre-configured dataset simply change the value of
`use_preconfigured_dataset` to True and select the relevant dataset index in `main_drive .py`. This selects the
relevant parameter values which can be viewed for each dataset in the file `dataset_mapping.py`.

This setting can also be used to generate customized synthetic data using the dataset index 0. Settings for
synthetic data generation are described in more detail in `main_driver.py`.

### Using custom datasets

Our algorithm can be used with any dataset that is in the form of csv files with labeled columns. To use a custom
dataset, do the following:
1) Place the csv file for the dataset in the `datasets` folder which is located in the root directory of the package
2) Specify the relevant parameters for interpreting the dataset in the section starting on line 63 of `main_driver
.py`
* Relevant parameters include target label, column labels for group categories, usable features, etc.
3) Ensure the other parameters in `main_driver.py` have the desired value and run the file from the command line as
described above

## License

Licensed under the Apache License, Version 2.0 (the "License").
You may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

## Citation

If you publish material that uses this code, you can use the following citation:

```js
@inproceedings{diana2021minimax,
title={Minimax Group Fairness: Algorithms and Experiments},
author={Diana, Emily and Gill, Wesley and Kearns, Michael and Kenthapadi, Krishnaram and Roth, Aaron},
year = {2021},
booktitle={Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society}
}
```

Owner

Name: Amazon Science
Login: amazon-science
Kind: organization

Website: https://amazon.science
Twitter: AmazonScience
Repositories: 80
Profile: https://github.com/amazon-science

GitHub Events

Total

Last Year

Issues and Pull Requests

Last synced: over 1 year ago

All Time

Total issues: 2
Total pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: 14 days
Total issue authors: 1
Total pull request authors: 1
Average comments per issue: 0.0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/amazon-science/minimax-fair

Science Score: 10.0%

Repository

Basic Info

Statistics

https://github.com/amazon-science/minimax-fair/blob/main/

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies