Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.8%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Basic Info
Statistics
  • Stars: 2
  • Watchers: 6
  • Forks: 0
  • Open Issues: 1
  • Releases: 0
Created about 1 year ago · Last pushed 8 months ago
Metadata Files
Readme Contributing License Code of conduct Citation

README.md

ai4data_use

Set up environment

Set up your environment via conda or venv

```bash python -m venv {myenv}

activate the environment

{myenv}/source/activate

install required packages

pip install -r requirements.txt

move to the scripts folder

cd scripts ```

Quickstart

If you wanted to test out the entire process without setting things up, we recommend you to check out the notebooks inside the examples folder.

Batch Processing

To do batch processing, the following assumes that you have your research papers in PDF format (in our case we have climate related PRWP documents, as well as Adaptation-One-Earth-Policy documents) on the input directory.

You also need to set up your config.yaml and .env file and put your OPENAI_API_KEY there. Also, make sure to change the necessary configurations such as MAX_REQUESTS_PER_BATCH if you have large scale pdf files you can set it to Max. of 50,000 which is the api limit for batch processing.

Workflow

We have a 3-step process in this data labeling process:

  1. Zero-shot Extraction

Using 4o-mini, we will extract potential dataset mentions and its corresponding metadata (if available). 2. LLM-as-a-Judge Validation

Using the zero-shot extraction outputs, we will then use a validation layer where we will tag each of the dataset mentions valid:true if the model thinks it is a dataset mention and set it to false if not, together with its corresponding invalid_reason. 3. Autonomous Reasoning

Using the output of the LLM-as-a-Judge validation, we will then make the final layer where we incorporate a Devil's Advocate mechanism to challenge its own classification by considering alternative interpretations. It also re-evaluates ambiguous cases and overrides the previous judgements of the earlier layers.

The process are named "extraction", "judge" and "reasoning".

Zero-Shot Extraction

```bash

once the prerequisites and dependencies are sufficed run the following in a terminal

python run_batch.py --process extraction ```

The script above will process the input directory and handles the processing of the desired openai batches format and submits it. It will set up the directories needed for the process to run, saves the list of the batch_ids to a text file to track its status.

The helper code below lets you check the status of your batch run.

```python def listbatches(client): """ Lists all submitted batches along with their statuses. """ try: batches = client.batches.list() print("All Batch Jobs:") for batch in batches: print(f"Batch ID: {batch.id}, Status: {batch.status}, Created At: {batch.createdat}") except Exception as e: print(f"Error listing batches: {e}")

or use the text file under extraction_outputs to filter the outputs

apikey = "YOURAPIKEY" # or get from config using loadconfig client = OpenAI(apikey=apikey) filepath = "extractionoutputs/extraction_batches.txt"

with open(filepath, "r") as f: batchesres = f.readlines()

batchids = [batch.strip() for batch in batchesres] batches = client.batches.list() for batch in batches: if batch.id in batch_ids: print(f"{batch.id} : {batch.status}")

```

Note: It will take a while for the batches to be completed.

Once the status of all the batches are completed. We need to retrieve its results, just run the following code.

```bash

python retrieve_results.py --process extraction

```

It will automatically place the result of each batch run under the extraction_outputs/extraction to its corresponding output file.

LLM-as-a-Judge

Once the outputs are saved under extraction_outputs/extraction we can now process the LLM-as-a-Judge pipeline where the model will validate the zero-shot extracted dataset mentions.

```bash

python run_batch.py --process judge

```

batch ids for this process will be saved under extraction_putputs/judge_batches.txt, you can track again the batch run until it is completed.

Again, once completed we can run the file to retrieve its results.

```bash

python retrieve_results.py --process judge

```

It will automatically place the result of each batch run under the extraction_outputs/judge to its corresponding output file.

Autonomous Reasoning Agent

Once the information is validated by the LLM, we will use the autonomous reasoning agent to further refine and validate the extracted data. The reasoning agent will follow a structured prompt to ensure the accuracy and relevance of the dataset mentions.

```bash

python run_batch.py --process reasoning

```

batch ids for this process will be saved under extraction_putputs/reasoning.txt, you can track again the batch run until it is completed.

Once completed we can run the file to retrieve its results.

```bash

python retrieve_results.py --process reasoning

```

It will automatically place the result of each batch run under the extraction_outputs/reasoning to its corresponding output file.

Next Steps

Now that you have your validated results from the pipeline, you can now make a fine-tuning dataset. After you have your reasoning outputs from the task earlier, you just need to run the code below.

```bash

python generatefinetunedata.py

```

[OPTIONAL] MANUALLY LABELLED DATA

You can also label manually annotated data to finetune your model.

Finetuning Your Model

After generating your finetuning data, you can now finetune it. We have provided a notebook where you can finetune your model using Unsloth. You can find this notebook in the examples folder. Follow the instructions in the notebook to load your finetuning data and start the finetuning process.

Owner

  • Name: World Bank Group
  • Login: worldbank
  • Kind: organization
  • Email: github@worldbank.org

World Bank Repository for Data Products and tools. Content does not necessarily represent official World Bank Group positions, policies, recommendations, etc.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "Country borders or names do not necessarily reflect the World Bank Group’s official position. All maps are for illustrative purposes and do not imply the expression of any opinion on the part of the World Bank, concerning the legal status of any country or territory or concerning the delimitation of frontiers or boundaries."
title: "World Bank Data Lab Project Template"
authors:
  - affiliation: World Bank
    family-names: Stefanini Vicente
    given-names: Gabriel
    orcid: https://orcid.org/0000-0001-6530-3780
keywords:
  - Open Science
repository-code: https://github.com/worldbank/template/tree/main

GitHub Events

Total
  • Watch event: 1
  • Delete event: 1
  • Member event: 1
  • Public event: 1
  • Push event: 7
  • Pull request event: 2
  • Create event: 2
Last Year
  • Watch event: 1
  • Delete event: 1
  • Member event: 1
  • Public event: 1
  • Push event: 7
  • Pull request event: 2
  • Create event: 2

Dependencies

.github/workflows/gh-pages.yml actions
  • actions/checkout v4 composite
  • actions/deploy-pages v4 composite
  • actions/setup-python v5 composite
  • actions/upload-pages-artifact v3 composite
.github/workflows/release.yml actions
  • actions/checkout v4 composite
  • actions/download-artifact v4 composite
  • actions/setup-python v5 composite
  • actions/upload-artifact v4 composite
  • pypa/gh-action-pypi-publish release/v1 composite
pyproject.toml pypi
  • bokeh >=3,<4
  • pandas >=2
  • pycountry >=22.3.5
  • requests >=2.28.1