ethnicolr

Predict Race and Ethnicity Based on the Sequence of Characters in a Name

https://github.com/appeler/ethnicolr

Science Score: 39.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 6 DOI reference(s) in README
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.2%) to scientific vocabulary

Keywords

ethnicity lstm machine-learning names race

Keywords from Contributors

text-corpus electoral-rolls gender-classification india distributed-data-collection traveling-salesman
Last synced: 6 months ago · JSON representation

Repository

Predict Race and Ethnicity Based on the Sequence of Characters in a Name

Basic Info
Statistics
  • Stars: 247
  • Watchers: 14
  • Forks: 64
  • Open Issues: 0
  • Releases: 12
Topics
ethnicity lstm machine-learning names race
Created almost 9 years ago · Last pushed 6 months ago
Metadata Files
Readme Citation

README.md

ethnicolr: Predict Race and Ethnicity From Name

PyPI Authenicated Test Badge PyPI version Anaconda version PePy Downloads

We exploit the US census data, the Florida voting registration data, and the Wikipedia data collected by Skiena and colleagues to predict race and ethnicity based on first and last name or just the last name. The granularity at which we predict the race depends on the dataset. For instance, Skiena et al.\' Wikipedia data is at the ethnic group level, while the census data we use in the model (the raw data has additional categories of Native Americans and Bi-racial) merely categorizes between Non-Hispanic Whites, Non-Hispanic Blacks, Asians, and Hispanics.

New Package With New Models in Pytorch

https://github.com/appeler/ethnicolr2

Streamlit App

https://ethnicolr.streamlit.app/

Caveats and Notes

If you picked a person at random with the last name \'Smith\' in the US in 2010 and asked us to guess this person\'s race (as measured by the census), the best guess would be based on what is available from the aggregated Census file. It is the Bayes Optimal Solution. So what good are last-name-only predictive models for? A few things---if you want to impute race and ethnicity for last names that are not in the census file, infer the race and ethnicity in different years than when the census was conducted (if some assumptions hold), infer the race of people in different countries (if some assumptions hold), etc. The biggest benefit comes in cases where both the first name and last name are known.

Install

We strongly recommend installing ethnicolr inside a Python virtual environment (see venv documentation)

bash pip install ethnicolr

Notes:

  • The models are run and verified on TensorFlow 2.x using Python 3.10 through 3.12
  • If you install on Windows, Theano installation typically needs admin. privileges on the shell.

Jupyter Quickstart

bash pip install ethnicolr jupyter ethnicolr_download_models jupyter notebook ethnicolr/examples

Open one of the example notebooks and run the cells to see the package in action.

General API

To see the available command line options for any function, please type in [<function-name>][--help]

```python

census_ln --help

usage: census_ln [-h] [-y {2000,2010}] [-o OUTPUT] -l LAST input

Appends Census columns by last name

positional arguments: input Input file

optional arguments: -h, --help show this help message and exit -y {2000,2010}, --year {2000,2010} Year of Census data (default=2000) -o OUTPUT, --output OUTPUT Output file with Census data columns -l LAST, --last LAST Name of the column containing the last name ```

Cleaning Names

The prediction models work best when first and last names contain only alphabetic characters. Before calling the CLI or Python APIs, strip out titles (e.g., Dr, Hon.), middle names, suffixes, punctuation and non-ASCII characters. The pred_wiki_name command automatically normalizes names by removing diacritics and extraneous characters. If the tool still skips entries, check that the first and last name columns are not empty after cleaning.

Examples

To append census data from 2010 to a file with column header in the first row, specify the column name carrying last names using the [-l] option, keeping the rest the same:

bash census_ln -y 2010 -o output-census2010.csv -l last_name input-with-header.csv

To predict race/ethnicity using Wikipedia full name model, specify the column name of last name and first name by using [-l] and [-f] flags respectively.

bash pred_wiki_name -o output-wiki-pred-race.csv -l last_name -f first_name input-with-header.csv

Functions

We expose several functions, each of which either takes a pandas DataFrame or a CSV.

  • censusln(df, lnamecol, year=2000)
    • What it does:
    • Removes extra space
    • For names in the census file, it appends relevant data of what probability the name provided is of a certain race/ethnicity

Parameters


          **df** : *{DataFrame, csv}* Pandas dataframe of CSV file
           contains the names of the individual to be inferred

          **lname_col** : *{string}* name of the column containing the
           last name

          **Year** : *{2000, 2010}, default=2000* year of census to use

  • Output: Appends the following columns to the pandas DataFrame or CSV: pctwhite, pctblack, pctapi, pctaian, pct2prace, pcthispanic. See here for what the column names mean.

``` literal-block

import pandas as pd

from ethnicolr import censusln, predcensus_ln

names = [{'name': 'smith'}, ... {'name': 'zhang'}, ... {'name': 'jackson'}]

df = pd.DataFrame(names)

df name 0 smith 1 zhang 2 jackson

census_ln(df, 'name') name pctwhite pctblack pctapi pctaian pct2prace pcthispanic 0 smith 73.35 22.22 0.40 0.85 1.63 1.56 1 zhang 0.61 0.09 98.16 0.02 0.96 0.16 2 jackson 41.93 53.02 0.31 1.04 2.18 1.53 ```

  • predcensusln(df, lnamecol, year=2000, numiter=100, conf_int=1.0)


    Parameters


            **df** : *{DataFrame, csv}* Pandas dataframe of CSV file
             contains the names of the individual to be inferred
    
            **namecol** : *{string}* name of the column containing the last
             name
    
            **year** : *{2000, 2010}, default=2000* year of census to use
    
            **num_iter** : *int, default=100* number of iterations to
             calculate uncertainty in model
    
            **conf_int** : *float, default=1.0* confidence interval in
             predicted class
    

    • Output: Appends the following columns to the pandas DataFrame or CSV: race (white, black, asian, or hispanic), api (percentage chance asian), black, hispanic, white. For each race it will provide the mean, standard error, lower & upper bound of confidence interval

(Using the same dataframe from example above)

```python

census_ln(df, 'name') name pctwhite pctblack pctapi pctaian pct2prace pcthispanic 0 smith 73.35 22.22 0.40 0.85 1.63 1.56 1 zhang 0.61 0.09 98.16 0.02 0.96 0.16 2 jackson 41.93 53.02 0.31 1.04 2.18 1.53

census_ln(df, 'name', 2010) name race pctwhite pctblack pctapi pctaian pct2prace pcthispanic 0 smith white 70.9 23.11 0.5 0.89 2.19 2.4 1 zhang api 0.99 0.16 98.06 0.02 0.62 0.15 2 jackson black 39.89 53.04 0.39 1.06 3.12 2.5

predcensusln(df, 'name') name race api black hispanic white 0 smith white 0.002019 0.247235 0.014485 0.736260 1 zhang api 0.997807 0.000149 0.000470 0.001574 2 jackson black 0.002797 0.528193 0.014605 0.454405 ```

  • predwikiln( df, lnamecol, numiter=100, conf_int=1.0)

    • What it does:
    • Removes extra space.
    • Uses the last name wiki model to predict the race and ethnicity.

    Parameters


            **df** : *{DataFrame, csv}* Pandas dataframe of CSV file
             contains the names of the individual to be inferred
    
            **lname_col** : *{string}* name of the column containing the
             last name
    
            **num_iter** : *int, default=100* number of iterations to
             calculate uncertainty in model
    
            **conf_int** : *float, default=1.0* confidence interval in
             predicted class
    

    • Output: Appends the following columns to the pandas DataFrame or CSV: race (categorical variable --- category with the highest probability). For each race it will provide the mean, standard error, lower & upper bound of confidence interval

literal-block "Asian,GreaterEastAsian,EastAsian", "Asian,GreaterEastAsian,Japanese", "Asian,IndianSubContinent", "GreaterAfrican,Africans", "GreaterAfrican,Muslim", "GreaterEuropean,British","GreaterEuropean,EastEuropean", "GreaterEuropean,Jewish","GreaterEuropean,WestEuropean,French", "GreaterEuropean,WestEuropean,Germanic","GreaterEuropean,WestEuropean,Hispanic", "GreaterEuropean,WestEuropean,Italian","GreaterEuropean,WestEuropean,Nordic".

```python

import pandas as pd

names = [ ... {"last": "smith", "first": "john", "truerace": "GreaterEuropean,British"}, ... { ... "last": "zhang", ... "first": "simon", ... "truerace": "Asian,GreaterEastAsian,EastAsian", ... }, ... ] df = pd.DataFrame(names)

from ethnicolr import predwikiln, predwikiname

odf = predwikiln(df,'last', conf_int=0.9) ['Asian,GreaterEastAsian,EastAsian', 'Asian,GreaterEastAsian,Japanese', 'Asian,IndianSubContinent', 'GreaterAfrican,Africans', 'GreaterAfrican,Muslim', 'GreaterEuropean,British', 'GreaterEuropean,EastEuropean', 'GreaterEuropean,Jewish', 'GreaterEuropean,WestEuropean,French', 'GreaterEuropean,WestEuropean,Germanic', 'GreaterEuropean,WestEuropean,Hispanic', 'GreaterEuropean,WestEuropean,Italian', 'GreaterEuropean,WestEuropean,Nordic']

odf last first truerace ... GreaterEuropean,WestEuropean,Nordiclb GreaterEuropean,WestEuropean,Nordic_ub race 0 Smith john GreaterEuropean,British 0.016103 ... 0.014135 0.007382 0.048828 GreaterEuropean,British 1 Zhang simon Asian,GreaterEastAsian,EastAsian 0.863391 ... 0.017452 0.001844 0.027252 Asian,GreaterEastAsian,EastAsian

[2 rows x 56 columns]

odf.iloc[0, :8] last Smith first john truerace GreaterEuropean,British Asian,GreaterEastAsian,EastAsianmean 0.016103 Asian,GreaterEastAsian,EastAsianstd 0.009735 Asian,GreaterEastAsian,EastAsianlb 0.005873 Asian,GreaterEastAsian,EastAsianub 0.034637 Asian,GreaterEastAsian,Japanesemean 0.003814 Name: 0, dtype: object ```

  • predwikiname(df, namecol, numiter=100, confint=1.0)

    • What it does:
    • Removes extra space.
    • Uses the full name wiki model to predict the race and ethnicity.

    Parameters


            **df** : *{DataFrame, csv}* Pandas dataframe of CSV file
             contains the names of the individual to be inferred
    
            **namecol** : *{string}* name of the column containing the
             name.
    
            **num_iter** : *int, default=100* number of iterations to
             calculate uncertainty of predictions
    
            **conf_int** : *float, default=1.0* confidence interval
    

    • Output: Appends the following columns to the pandas DataFrame or CSV: race (categorical variable---category with the highest probability), \"Asian,GreaterEastAsian,EastAsian\", \"Asian,GreaterEastAsian,Japanese\", \"Asian,IndianSubContinent\", \"GreaterAfrican,Africans\", \"GreaterAfrican,Muslim\", \"GreaterEuropean,British\",\"GreaterEuropean,EastEuropean\", \"GreaterEuropean,Jewish\",\"GreaterEuropean,WestEuropean,French\", \"GreaterEuropean,WestEuropean,Germanic\",\"GreaterEuropean,WestEuropean,Hispanic\", \"GreaterEuropean,WestEuropean,Italian\",\"GreaterEuropean,WestEuropean,Nordic\". For each race it will provide the mean, standard error, lower & upper bound of confidence interval

(Using the same dataframe from example above)

``` literal-block

odf = predwikiname(df,'last', 'first', conf_int=0.9) ['Asian,GreaterEastAsian,EastAsian', 'Asian,GreaterEastAsian,Japanese', 'Asian,IndianSubContinent', 'GreaterAfrican,Africans', 'GreaterAfrican,Muslim', 'GreaterEuropean,British', 'GreaterEuropean,EastEuropean', 'GreaterEuropean,Jewish', 'GreaterEuropean,WestEuropean,French', 'GreaterEuropean,WestEuropean,Germanic', 'GreaterEuropean,WestEuropean,Hispanic', 'GreaterEuropean,WestEuropean,Italian', 'GreaterEuropean,WestEuropean,Nordic']

odf last first truerace _name Asian,GreaterEastAsian,EastAsianmean ... GreaterEuropean,WestEuropean,Nordicmean GreaterEuropean,WestEuropean,Nordicstd GreaterEuropean,WestEuropean,Nordiclb GreaterEuropean,WestEuropean,Nordic_ub race 0 Smith john GreaterEuropean,British Smith John 0.004111 ... 0.006246 0.004760 0.001048 0.016288 GreaterEuropean,British 1 Zhang simon Asian,GreaterEastAsian,EastAsian Zhang Simon 0.944203 ... 0.000793 0.002557 0.000019 0.002470 Asian,GreaterEastAsian,EastAsian

[2 rows x 57 columns]

odf.iloc[0,:8] last Smith first john truerace GreaterEuropean,British _name Smith John Asian,GreaterEastAsian,EastAsianmean 0.004111 Asian,GreaterEastAsian,EastAsianstd 0.002929 Asian,GreaterEastAsian,EastAsianlb 0.001356 Asian,GreaterEastAsian,EastAsianub 0.010571 Name: 0, dtype: object ```

  • predflregln(df, lnamecol, numiter=100, confint=1.0)


    Parameters


            **df** : *{DataFrame, csv}* Pandas dataframe of CSV file
             contains the names of the individual to be inferred
    
            **lname_col** : *{string}* name of the column containing the
             last name
    
            **num_iter** : *int, default=100* number of iterations to
             calculate the uncertainty
    
            **conf_int** : *float, default=1.0* confidence interval
    

    • Output: Appends the following columns to the pandas DataFrame or CSV: race (white, black, asian, or Hispanic), asian (percentage chance Asian), Hispanic, nhblack, nhwhite. For each race, it will provide the mean, standard error, lower & upper bound of confidence interval

```python

import pandas as pd

names = [ ... {"last": "sawyer", "first": "john", "truerace": "nhwhite"}, ... {"last": "torres", "first": "raul", "true_race": "hispanic"}, ... ]

df = pd.DataFrame(names)

from ethnicolr import predflregln, predflregname, predflreglnfivecat, predflregnamefivecat

odf = predflregln(df, 'last', confint=0.9) ['asian', 'hispanic', 'nhblack', 'nhwhite']

odf last first truerace asianmean asianstd asianlb asianub hispanicmean hispanicstd hispaniclb hispanicub nhblackmean nhblackstd nhblacklb nhblackub nhwhitemean nhwhitestd nhwhitelb nhwhiteub race 0 Sawyer john nhwhite 0.009859 0.006819 0.005338 0.019673 0.021488 0.004602 0.014802 0.030148 0.180929 0.052784 0.105756 0.270238 0.787724 0.051082 0.705290 0.860286 nh_white 1 Torres raul hispanic 0.006463 0.001985 0.003915 0.010146 0.878119 0.021998 0.839274 0.909151 0.013118 0.005002 0.007364 0.021633 0.102300 0.017828 0.075911 0.130929 hispanic

[2 rows x 20 columns]

odf.iloc[0] last Sawyer first john truerace nhwhite asianmean 0.009859 asianstd 0.006819 asianlb 0.005338 asianub 0.019673 hispanicmean 0.021488 hispanicstd 0.004602 hispaniclb 0.014802 hispanicub 0.030148 nhblackmean 0.180929 nhblackstd 0.052784 nhblacklb 0.105756 nhblackub 0.270238 nhwhitemean 0.787724 nhwhitestd 0.051082 nhwhitelb 0.70529 nhwhiteub 0.860286 race nh_white Name: 0, dtype: object ```

  • predflregname(df, lnamecol, numiter=100, confint=1.0)

    • What it does:
    • Removes extra space.
    • Uses the full name FL model to predict the race and ethnicity.

    Parameters


            **df** : *{DataFrame, csv}* Pandas dataframe of CSV file
             contains the names of the individual to be inferred
    
            **namecol** : *{list}* name of the column containing the name.
    
            **num_iter** : *int, default=100* number of iterations to
             calculate the uncertainty
    
            **conf_int** : *float, default=1.0* confidence interval in
             predicted class
    

    • Output: Appends the following columns to the pandas DataFrame or CSV: race (white, black, asian, or Hispanic), asian (percentage chance Asian), Hispanic, nhblack, nhwhite. For each race, it will provide the mean, standard error, lower & upper bound of confidence interval

(Using the same dataframe from example above)

``` literal-block

odf = predflregname(df, 'last', 'first', confint=0.9) ['asian', 'hispanic', 'nhblack', 'nhwhite']

odf last first truerace asianmean asianstd asianlb asianub hispanicmean hispanicstd hispaniclb hispanicub nhblackmean nhblackstd nhblacklb nhblackub nhwhitemean nhwhitestd nhwhitelb nhwhiteub race 0 Sawyer john nhwhite 0.001534 0.000850 0.000636 0.002691 0.006818 0.002557 0.003684 0.011660 0.028068 0.015095 0.011488 0.055149 0.963581 0.015738 0.935445 0.983224 nh_white 1 Torres raul hispanic 0.005791 0.002906 0.002446 0.011748 0.890561 0.029581 0.841328 0.937706 0.011397 0.004682 0.005829 0.020796 0.092251 0.026675 0.049868 0.139210 hispanic

odf.iloc[1] last Torres first raul truerace hispanic asianmean 0.005791 asianstd 0.002906 asianlb 0.002446 asianub 0.011748 hispanicmean 0.890561 hispanicstd 0.029581 hispaniclb 0.841328 hispanicub 0.937706 nhblackmean 0.011397 nhblackstd 0.004682 nhblacklb 0.005829 nhblackub 0.020796 nhwhitemean 0.092251 nhwhitestd 0.026675 nhwhitelb 0.049868 nhwhite_ub 0.13921 race hispanic Name: 1, dtype: object ```

  • predflreglnfivecat(df, namecol, numiter=100, conf_int=1.0)


    Parameters


            **df** : *{DataFrame, csv}* Pandas dataframe of CSV file
             contains the names of the individual to be inferred
    
            **lname_col** : *{string, list, int}* name of location of the
             column containing the last name
    
            **num_iter** : *int, default=100* number of iterations to
             calculate uncertainty
    
            **conf_int** : *float, default=1.0* confidence interval
    

    • Output: Appends the following columns to the pandas DataFrame or CSV: race (white, black, asian, Hispanic or other), asian (percentage chance Asian), hispanic, nhblack, nhwhite, other. For each race, it will provide the mean, standard error, lower & upper bound of confidence interval

(Using the same dataframe from example above)

```python

odf = predflreglnfivecat(df,'last') ['asian', 'hispanic', 'nhblack', 'nh_white', 'other']

odf last first truerace asianmean asianstd asianlb asianub hispanicmean hispanicstd ... nhwhitemean nhwhitestd nhwhitelb nhwhiteub othermean otherstd otherlb otherub race 0 Sawyer john nhwhite 0.100038 0.020539 0.073266 0.143334 0.044263 0.013077 ... 0.376639 0.048289 0.296989 0.452834 0.248466 0.021040 0.219721 0.283785 nh_white 1 Torres raul hispanic 0.062390 0.021863 0.033837 0.103737 0.774414 0.043238 ... 0.030393 0.009591 0.019713 0.046483 0.117761 0.019524 0.089418 0.150615 hispanic

[2 rows x 24 columns]

odf.iloc[0] last Sawyer first john truerace nhwhite asianmean 0.100038 asianstd 0.020539 asianlb 0.073266 asianub 0.143334 hispanicmean 0.044263 hispanicstd 0.013077 hispaniclb 0.02476 hispanicub 0.067965 nhblackmean 0.230593 nhblackstd 0.063948 nhblacklb 0.130577 nhblackub 0.343513 nhwhitemean 0.376639 nhwhitestd 0.048289 nhwhitelb 0.296989 nhwhiteub 0.452834 othermean 0.248466 otherstd 0.02104 otherlb 0.219721 otherub 0.283785 race nh_white Name: 0, dtype: object ```

  • predflregnamefivecat(df, namecol, numiter=100, conf_int=1.0)

    • What it does:
    • Removes extra space.
    • Uses the full name FL model to predict the race and ethnicity.

    Parameters


            **df** : *{DataFrame, csv}* Pandas dataframe of CSV file
             contains the names of the individual to be inferred
    
            **namecol** : *{string, list}* string or list of the name or
             location of the column containing the first name, last name.
    
            **num_iter** : *int, default=100* number of iterations to
             calculate uncertainty
    
            **conf_int** : *float, default=1.0* confidence interval
    

    • Output: Appends the following columns to the pandas DataFrame or CSV: race (white, black, asian, Hispanic, or other), asian (percentage chance Asian), hispanic, nhblack, nhwhite, other. For each race, it will provide the mean, standard error, lower & upper bound of confidence interval

(Using the same dataframe from example above)

```python

odf = predflregnamefivecat(df, 'last','first') ['asian', 'hispanic', 'nhblack', 'nh_white', 'other']

odf last first truerace asianmean asianstd asianlb asianub hispanicmean hispanicstd ... nhwhitemean nhwhitestd nhwhitelb nhwhiteub othermean otherstd otherlb otherub race 0 Sawyer john nhwhite 0.039310 0.011657 0.025982 0.059719 0.019737 0.005813 ... 0.650306 0.059327 0.553913 0.733201 0.192242 0.021004 0.160185 0.226063 nh_white 1 Torres raul hispanic 0.020086 0.011765 0.008240 0.041741 0.899110 0.042237 ... 0.019073 0.009901 0.010166 0.040081 0.055774 0.017897 0.036245 0.088741 hispanic

[2 rows x 24 columns]

odf.iloc[1] last Torres first raul truerace hispanic asianmean 0.020086 asianstd 0.011765 asianlb 0.00824 asianub 0.041741 hispanicmean 0.89911 hispanicstd 0.042237 hispaniclb 0.823799 hispanicub 0.937612 nhblackmean 0.005956 nhblackstd 0.006528 nhblacklb 0.002686 nhblackub 0.010134 nhwhitemean 0.019073 nhwhitestd 0.009901 nhwhitelb 0.010166 nhwhiteub 0.040081 othermean 0.055774 otherstd 0.017897 otherlb 0.036245 other_ub 0.088741 race hispanic Name: 1, dtype: object ```

  • predncregname(df, namecol, numiter=100, conf_int=1.0)

    • What it does:
    • Removes extra space.
    • Uses the full name NC model to predict the race and ethnicity.

    Parameters


            **df** : *{DataFrame, csv}* Pandas dataframe of CSV file
             contains the names of the individual to be inferred
    
            **namecol** : *{string, list}* string or list of the name or
             location of the column containing the first name and last name.
    
            **num_iter** : *int, default=100* number of iterations to
             calculate uncertainty
    
            **conf_int** : *float, default=1.0* confidence interval
    

    • Output: Appends the following columns to the pandas DataFrame or CSV: race + ethnicity. The codebook is here. For each race, it will provide the mean, standard error, lower & upper bound of confidence interval

```python

import pandas as pd

names = [ ... {"last": "hernandez", "first": "hector", "truerace": "HL+O"}, ... {"last": "zhang", "first": "simon", "truerace": "NL+A"}, ... ]

df = pd.DataFrame(names)

from ethnicolr import predncreg_name

odf = predncregname(df, 'last','first', confint=0.9) ['HL+A', 'HL+B', 'HL+I', 'HL+M', 'HL+O', 'HL+W', 'NL+A', 'NL+B', 'NL+I', 'NL+M', 'NL+O', 'NL+W']

odf last first truerace _name HL+Amean HL+Astd HL+Alb HL+Aub HL+Bmean HL+Bstd HL+Blb HL+Bub HL+Imean ... NL+Mmean NL+Mstd NL+Mlb NL+Mub NL+Omean NL+Ostd NL+Olb NL+Oub NL+Wmean NL+Wstd NL+Wlb NL+W_ub race 0 hernandez hector HL+O Hernandez Hector 2.727371e-13 0.0 2.727372e-13 2.727372e-13 6.542178e-04 0.0 6.542183e-04 6.542183e-04 0.000032 ... 7.863581e-06 0.0 7.863589e-06 7.863589e-06 0.184513 0.0 0.184514 0.184514 0.001256 0.0 0.001256 0.001256 HL+O 1 zhang simon NL+A Zhang Simon 1.985421e-06 0.0 1.985423e-06 1.985423e-06 8.708256e-09 0.0 8.708265e-09 8.708265e-09 0.000049 ... 1.446786e-07 0.0 1.446784e-07 1.446784e-07 0.003238 0.0 0.003238 0.003238 0.000154 0.0 0.000154 0.000154 NL+A

[2 rows x 53 columns]

odf.iloc[0] last hernandez first hector truerace HL+O _name Hernandez Hector HL+Amean 0.0 HL+Astd 0.0 HL+Alb 0.0 HL+Aub 0.0 HL+Bmean 0.000654 HL+Bstd 0.0 HL+Blb 0.000654 HL+Bub 0.000654 HL+Imean 0.000032 HL+Istd 0.0 HL+Ilb 0.000032 HL+Iub 0.000032 HL+Mmean 0.000541 HL+Mstd 0.0 HL+Mlb 0.000541 HL+Mub 0.000541 HL+Omean 0.58944 HL+Ostd 0.0 HL+Olb 0.58944 HL+Oub 0.58944 HL+Wmean 0.221309 HL+Wstd 0.0 HL+Wlb 0.221309 HL+Wub 0.221309 NL+Amean 0.000044 NL+Astd 0.0 NL+Alb 0.000044 NL+Aub 0.000044 NL+Bmean 0.002199 NL+Bstd 0.0 NL+Blb 0.002199 NL+Bub 0.002199 NL+Imean 0.000004 NL+Istd 0.0 NL+Ilb 0.000004 NL+Iub 0.000004 NL+Mmean 0.000008 NL+Mstd 0.0 NL+Mlb 0.000008 NL+Mub 0.000008 NL+Omean 0.184513 NL+Ostd 0.0 NL+Olb 0.184514 NL+Oub 0.184514 NL+Wmean 0.001256 NL+Wstd 0.0 NL+Wlb 0.001256 NL+Wub 0.001256 race HL+O Name: 0, dtype: object ```

Application

To illustrate how the package can be used, we impute the race of the campaign contributors recorded by FEC for the years 2000 and 2010 and tally campaign contributions by race.

Data on race of all the people in the DIME data is posted here. The underlying Python scripts are posted here

Data

In particular, we utilize the last-name--race data from the 2000 census and 2010 census, the Wikipedia data collected by Skiena and colleagues, and the Florida voter registration data from early 2017.

Evaluation

  1. SCAN Health Plan, a Medicare Advantage plan that serves over 200,000 members throughout California used the software to better assess racial disparities of health among the people they serve. They only had racial data on about 47% of their members, so they used it to learn the race of the remaining 53%. On the data they had labels for, they found .9 AUC and 83% accuracy for the last name model.

  2. Evaluation on NC Data: https://github.com/appeler/ncraceethnicity

Authors

Suriyan Laohaprapanon and Gaurav Sood

Contributor Code of Conduct

The project welcomes contributions from everyone! In fact, it depends on it. To maintain this welcoming atmosphere and to collaborate in a fun and productive way, we expect contributors to the project to abide by the Contributor Code of Conduct

License

The package is released under the MIT License.

Adjacent Repositories

  • appeler/ethnicolr2 Ethnicolr implementation with new models in pytorch
  • appeler/ethnicolor Race and Ethnicity based on name using data from census, voter reg. files, etc.
  • appeler/instate instate: predict the state of residence from last name using the indian electoral rolls
  • appeler/search_names Search a long list of names (patterns) in a large text corpus systematically and quickly
  • appeler/ncraceethnicity Evaluation of some of the ethnicolr models on the NC Voter Registration Data + New Models Based on NC Voter Registration Data.

Owner

  • Name: appeler
  • Login: appeler
  • Kind: organization

Making sense of names.

GitHub Events

Total
  • Issues event: 3
  • Watch event: 19
  • Delete event: 5
  • Issue comment event: 6
  • Push event: 70
  • Pull request review comment event: 2
  • Pull request review event: 2
  • Pull request event: 7
  • Fork event: 2
  • Create event: 6
Last Year
  • Issues event: 3
  • Watch event: 19
  • Delete event: 5
  • Issue comment event: 6
  • Push event: 70
  • Pull request review comment event: 2
  • Pull request review event: 2
  • Pull request event: 7
  • Fork event: 2
  • Create event: 6

Committers

Last synced: almost 3 years ago

All Time
  • Total Commits: 222
  • Total Committers: 15
  • Avg Commits per committer: 14.8
  • Development Distribution Score (DDS): 0.554
Top Committers
Name Email Commits
Suriyan Laohaprapanon s****t@g****m 99
***** g****7@g****m 71
Gaurav Sood s****u@u****m 17
dependabot[bot] 4****]@u****m 11
Bashar Naji 7****i@u****m 8
Mady s****p@g****m 5
root r****t@j****l 2
Roman Imankulov r****v@g****m 2
Josh Malina j****a@g****m 1
John Chen j****2@l****m 1
Snyk bot g****t@s****o 1
Steven Buss s****s@g****m 1
Rajashekar Chintalapati r****h@g****m 1
Vignesh Chandrasekharan v****4@g****m 1
Ananth V 4****d@u****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 49
  • Total pull requests: 54
  • Average time to close issues: 2 months
  • Average time to close pull requests: 7 days
  • Total issue authors: 43
  • Total pull request authors: 12
  • Average comments per issue: 2.1
  • Average comments per pull request: 0.44
  • Merged pull requests: 46
  • Bot issues: 0
  • Bot pull requests: 15
Past Year
  • Issues: 2
  • Pull requests: 6
  • Average time to close issues: about 2 months
  • Average time to close pull requests: 1 minute
  • Issue authors: 2
  • Pull request authors: 2
  • Average comments per issue: 1.5
  • Average comments per pull request: 0.33
  • Merged pull requests: 4
  • Bot issues: 0
  • Bot pull requests: 2
Top Authors
Issue Authors
  • soodoku (5)
  • mocherson (2)
  • messamat (2)
  • allienu (1)
  • stewmorg (1)
  • barslan16 (1)
  • logrkn (1)
  • MaryFllh (1)
  • andret6 (1)
  • zhiyzuo (1)
  • JasonBock (1)
  • jeremyholtzman (1)
  • gen-li (1)
  • sayandev (1)
  • floswald (1)
Pull Request Authors
  • dependabot[bot] (15)
  • soodoku (12)
  • basharnaji (9)
  • suriyan (8)
  • imankulov (3)
  • AnanthVivekanand (1)
  • johntiger1 (1)
  • VC444 (1)
  • sbuss (1)
  • joshmalina (1)
  • rajashekar (1)
  • snyk-bot (1)
Top Labels
Issue Labels
bug (5) enhancement (1)
Pull Request Labels
dependencies (15) codex (4) python (2)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 2,477 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 12
  • Total versions: 29
  • Total maintainers: 3
pypi.org: ethnicolr

Predict Race/Ethnicity Based on Sequence of Characters in Names

  • Versions: 29
  • Dependent Packages: 0
  • Dependent Repositories: 12
  • Downloads: 2,477 Last month
Rankings
Dependent repos count: 4.2%
Stargazers count: 4.6%
Forks count: 5.3%
Downloads: 5.8%
Average: 6.0%
Dependent packages count: 10.1%
Maintainers (3)
Last synced: 6 months ago

Dependencies

.github/workflows/python-publish.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
  • pypa/gh-action-pypi-publish 27b31702a0e7fc50959f5ad993c78deac1bdfc29 composite
.github/workflows/test.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
package.json npm
requirements-rtd.txt pypi
  • numpy *
  • pandas >=1.3.0
  • sphinx ==6.1.3
  • sphinx-rtd-theme ==1.2.0
  • tensorflow >=2.7.2,<3
requirements.txt pypi
  • numpy *
  • pandas >=1.3.0
  • tensorflow >=2.7.2,<3
setup.py pypi
  • pandas >=1.3.0
  • tensorflow >=2.7.2,<3
  • tensorflow-aarch64 >=2.7.2,<3
streamlit/requirements.txt pypi
  • Cython >=0.28.5
  • ethnicolr ==0.9.6
  • joblib *
  • matplotlib *
  • nltk *
  • numpy >=1.22.0
  • pandas >=1.3.0
  • scikit-learn ==0.22.2.post1
  • setuptools >=65.5.1
  • streamlit *
  • tensorflow >=2.7.2,<3
  • tqdm *