earnest

Statistical baby name preference learning

https://github.com/captainpete/earnest

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.3%) to scientific vocabulary

Last synced: 11 months ago · JSON representation ·

Repository

Statistical baby name preference learning

Basic Info

Host: GitHub
Owner: captainpete
License: mit
Language: Jupyter Notebook
Default Branch: main
Size: 294 KB

Statistics

Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created over 1 year ago · Last pushed over 1 year ago

Metadata Files

Readme License Citation

Earnest

A quick baby-name preference learning app for unabashed data science nerds.

Uses the names dataset from https://pypi.org/project/names-dataset/
Converts all the names to 875 wide feature frame comprised of:
- 1: male prevalence
- 1: female prevalence
- 105: country prevalence
- 768: embedding vector
By soliciting preferences in repeated 1-vs-20 rounds, learns a ranking
Displays the top and bottom 50 names

More on preferences

This is an Active Learning approach. Active Learning is useful when labelling is expensive, but can be prone to feedback loops depending on how the iterations are constructed. This app presents three columns of names, here referred to as A, B and C, from which the user is to select a single name as the most preferred for the round.

Column A is sampled from the current top 200 names
Column B is sampled from the next 800 (200:1000) names
Column C is sampled uniformly at random

By clicking on a name, the preference is recorded, the model retrained, and the user presented with a new set of names. There is a search box under the columns to allow selection of a name not listed (useful in bootstrapping the model), and a button filled with flowers that simply resamples using the current model.

Iterative results

After a few rounds, I've found this model learns a preference relatively quickly. One imagines the features are quite informative, specifically the embedding vector - which encodes all manner of historical, literate, and cultural associations. Results are displayed after each round.

Bias

If your name is in "the worst" list - please don't take it personally! This model is designed to uncover your preference; mine was clearly very Victorian. It's worth noting also that the nomic embedding model used here will also have a bias that reflects the data.

This project is called Earnest reflecting the preferences of Gwendolen and Cecily, who preferred that name over Jack or John - but what did this say about their preference for other names?

Running

pip install names-dataset
Install streamlit_searchbox, rapidfuzz and xgboost
Install Ollama and the Nomic embed model
Run the load.pynb notebook to produce baby_names_full.pkl
Run streamlit run earnest.py to launch the interface

Support

The code is pretty straight-forward. Best of luck!

Owner

Name: Peter Hollows
Login: captainpete
Kind: user
Location: Melbourne, Australia

Website: https://dojo7.com/
Repositories: 11
Profile: https://github.com/captainpete

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: earnest
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Peter
    family-names: Hollows
    email: github@dojo7.com
repository-code: 'https://github.com/captainpete/earnest'
license: MIT

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science