double-jeopardy-in-llms

Code for "Double Jeopardy and Climate Impact in the Use of Large Language Models." Includes scripts for analyzing socio-economic disparities, tokenization inefficiencies, and LLM utility using FLORES-200, Ethnologue, WDI, and GPT-4 APIs.

https://github.com/worldbank/double-jeopardy-in-llms

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.7%) to scientific vocabulary

Keywords

climate-change ethnologue gdp language llm openai-api tokenization wdi

Last synced: 6 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: worldbank
License: mpl-2.0
Language: Jupyter Notebook
Default Branch: main
Homepage:
Size: 6.33 MB

Statistics

Stars: 1
Watchers: 6
Forks: 0
Open Issues: 0
Releases: 0

Topics

climate-change ethnologue gdp language llm openai-api tokenization wdi

Created 12 months ago · Last pushed 11 months ago

Metadata Files

Readme Contributing License Code of conduct Citation

README.md

Double Jeopardy and Climate Impact in the Use of Large Language Models: Socio-economic Disparities and Reduced Utility for Non-English Speakers

This work investigates the socio-economic disparities and reduced utility for non-English speakers in the use of large language models (LLMs). We use the FLORES-200 dataset and Ethnologue to analyze the socio-economic disparities in the use of LLMs. We also use the OpenAI's GPT-4 API to assess the reduced utility of LLMsfor non-English speakers.

Double Jeopardy and Climate Impact in the Use of Large Language Models: Socio-economic Disparities and Reduced Utility for Non-English Speakers
Paper on ArXiv

Data sources and APIs

This section provides an overview of the datasets and APIs used in the paper.

FLORES-200 and FLORES+: A multilingual dataset covering 100 languages, with 1,000 sentences per language. Used for evaluating translation quality and computing the tokenization premium relative to English.
Ethnologue: Provides linguistic data, including the number of speakers, geographic distribution, and writing systems. We use Ethnologue to estimate the number of speakers for each language.
World Development Indicators (WDI): Contains socio-economic data at the country level. Specifically, we use the GDP per capita in current US\$ (NY.GDP.PCAP.CD) and the annual population growth rates (SP.POP.GROW) indicators to compute the population-weighted GDP for each language and for aligning population estimates to 2022 based on historical figures from Ethnologue.
OpenAI GPT-4o and GTP-4 Turbo APIs: Used to assess the reduced utility of LLMs for non-English speakers. We applied translation with different prompting methods to generate reference translations for FLORES sentences. The LLM translated non-English sentences into English, with the original English sentences serving as a benchmark for evaluating translation quality.

Notebooks

This section provides a listing and brief description of the notebooks used to generate inputs for the paper.

notebooks/: This folder contains the notebooks used to generate the data and analyze the results.
- notebooks/compute-premium-costs.ipynb: Computes the tokenization premium for the FLORES dataset. The calculation of the population-weighted GDP for each language is also done in this notebook.
- notebooks/back-translation-task.ipynb: Generates the back-translation task for the FLORES dataset. The notebook implements the batched translation strategy for the translation task and uses the OpenAI GPT-4o API.
- notebooks/analysis.ipynb: Notebook for additional analysis of the results. Key visualizations are generated in this notebook, including the comparison of the tokenization premiums between two different tokenizers (GPT-4o vs. GPT-4 Turbo).

Also, in the reports/ folder, you can find the figures generated for the paper.

Running the code

This repository uses poetry to manage dependencies. To install the dependencies, run the following command:

bash poetry install

To review the list of dependencies, please refer to the pyproject.toml file.

VS Code / Cursor users can use the Python extension to run the notebooks.

Otherwise, use the following command to spin up a local Jupyter server:

bash poetry run jupyter notebook

It is recommended to use a virtual environment to run the code.

Additionaly, the notebooks/compute-premium-costs.ipynb notebook uses the OpenAI API. To use the API, you need to set the OPENAI_API_KEY environment variable. You can create a .env file in the root of the repository and add the following:

OPENAI_API_KEY=<your-openai-api-key>

Computational resources

This work has been developed using a MacBook Pro with an M1 Pro processor and 64GB of RAM. No GPU is needed for the computations. Access to the OpenAI API is required.

Notes

Some of the notebooks are not publically available because they are used to handle proprietary data from Ethnologue which is not publicly available. One of the notebooks is used to compute the adjusted population based on the historical figures from Ethnologue and the annual population growth rates.

Citation

Please cite our paper as follows when referencing this work.

bibtex @misc{solatorio2024doublejeopardyclimateimpact, title={Double Jeopardy and Climate Impact in the Use of Large Language Models: Socio-economic Disparities and Reduced Utility for Non-English Speakers}, author={Aivin V. Solatorio and Gabriel Stefanini Vicente and Holly Krambeck and Olivier Dupriez}, year={2024}, eprint={2410.10665}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2410.10665}, }

Code of Conduct

The template maintains a Code of Conduct to ensure an inclusive and respectful environment for everyone. Please adhere to it in all interactions within our community.

License

The template is licensed under the Mozilla Public License. Remember to replace the license if necessary. If open source, choose an open source license.

Owner

Name: World Bank Group
Login: worldbank
Kind: organization
Email: github@worldbank.org

Repositories: 182
Profile: https://github.com/worldbank

World Bank Repository for Data Products and tools. Content does not necessarily represent official World Bank Group positions, policies, recommendations, etc.

Citation (CITATION.cff)

cff-version: 1.2.0
title: "Hidden Costs of LLMs"
authors:
  - affiliation: World Bank
    family-names: Solatorio
    given-names: Aivin
  - affiliation: World Bank
    family-names: Stefanini Vicente
    given-names: Gabriel
    orcid: https://orcid.org/0000-0001-6530-3780
keywords:
  - Open Science
  - Large language model

GitHub Events

Total

Watch event: 1
Public event: 1
Push event: 13
Pull request event: 4
Create event: 3

Last Year

Watch event: 1
Public event: 1
Push event: 13
Pull request event: 4
Create event: 3

Dependencies

.github/workflows/gh-pages.yml actions

actions/checkout v4 composite
actions/setup-python v5 composite
peaceiris/actions-gh-pages v3 composite

.github/workflows/release.yml actions

actions/checkout v4 composite
actions/download-artifact v3 composite
actions/setup-python v4 composite
actions/upload-artifact v3 composite
pypa/gh-action-pypi-publish release/v1 composite
sigstore/gh-action-sigstore-python v1.2.3 composite

poetry.lock pypi

171 dependencies

pyproject.toml pypi

datasets ^2.18.0
docutils 0.20.1
fire ^0.5.0
groq ^0.6.0
httpx ^0.27.0
ipykernel ^6.29.4
ipywidgets ^8.1.2
joblib ^1.4.0
jupyter-book >=1,<2
kaleido 0.2.1
matplotlib ^3.9.0
nbformat >=4.2.0
openai ^1.30.1
openpyxl ^3.1.2
pandas ^2
plotly ^5.21.0
python >=3.10
requests ^2.28.1
scipy ^1.14.0
statsmodels ^0.14.4
tiktoken ^0.7.0
tokenizers ^0.15.2
torch ^2.2.2
transformers ^4.39.2

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science