https://github.com/ai4bharat/qe-pe-mteval

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.8%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: AI4Bharat
Default Branch: master
Size: 14.7 MB

Statistics

Stars: 0
Watchers: 5
Forks: 0
Open Issues: 0
Releases: 0

Created about 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme

Quality Estimation and Post-Editing Using LLMs For Indic Languages: How Good Is It?

This repository explores the use of Large Language Models (LLMs) like GPT-4 and Gemma-2 for machine translation evaluation, focusing on quality estimation (QE) and post-editing (PE) tasks in low-resource Indic languages. It includes fine-tuning setups, synthetic data generation, and performance benchmarks for both reference-based and reference-free scenarios.

Synthetic Data Generation

We generate synthetic error explanations and post-edits using GPT-4, prompted with expert-annotated in-context examples. Our 3-shot prompting strategy significantly improves generation quality over zero-shot methods, enabling the fine-tuning of open-source LLMs for both reference-based and reference-free machine translation evaluation.
The overall generation pipeline is illustrated in the figure below.

Pipeline for Synthetic Explanation and Post-Editing Generation

Models

We fine-tune different variants of Gemma-9B on a range of tasks by modifying the inputs and outputs. These include generating error spans, error explanations, and post-edits, both with and without references. You can find the training pairs here.

Fine-tuning Tasks

| Model Name | Inputs Provided | Outputs Expected | |---------------------|--------------------------------------------------|----------------------------------------------| | Reference-Based | | | | ErrSp | Source, Translation, Reference | Error Spans | | ErrSp–Exp | Source, Translation, Reference | Error Spans + Explanations | | ErrSp–ip–Exp | Source, Translation, Reference, Error Spans | Explanations | | Reference-Free | | | | ErrSp | Source, Translation | Error Spans | | ErrSp–Exp | Source, Translation | Error Spans + Explanations | | ErrSp–Exp–PE | Source, Translation | Error Spans + Explanations + Post-Edits | | ErrSp–ip–Exp | Source, Translation, Error Spans | Explanations | | ErrSp–ip–Exp–PE | Source, Translation, Error Spans | Explanations + Post-Edits | | ErrSp–ip–PE | Source, Translation, Error Spans | Post-Edits | | ErrSp–PE | Source, Translation | Error Spans + Post-Edits | | PE | Source, Translation | Post-Edits |

Task Overview

Task Flow of Reference-Based and Reference-Free Settings

This diagram highlights the input-output configurations for different fine-tuning tasks under both reference-based and reference-free settings using Gemma-9B.

Owner

Name: AI4Bhārat
Login: AI4Bharat
Kind: organization
Email: opensource@ai4bharat.org
Location: India

Website: https://ai4bharat.org
Twitter: AI4Bharat
Repositories: 37
Profile: https://github.com/AI4Bharat

Artificial-Intelligence-For-Bhārat : Building open-source AI solutions for India!

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science