https://github.com/ai4bharat/nirantar

https://github.com/ai4bharat/nirantar

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.7%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: AI4Bharat
  • Language: Python
  • Default Branch: master
  • Size: 266 KB
Statistics
  • Stars: 6
  • Watchers: 6
  • Forks: 0
  • Open Issues: 1
  • Releases: 1
Created almost 2 years ago · Last pushed 10 months ago
Metadata Files
Readme

README.md

Nirantar

DOI

We present Nirantar based on a large-scale effort to collect extempore and conversational speech data from participants spanning 22 languages across diverse locations in India. Given the extensive number of languages and locations involved, data is collected in incremental batches. Each batch introduces new languages, new domains (locations), or both, creating a practical playground for continual learning (CL). Nirantar contains a total of 3240 hours of human-transcribed speech data covering 208 Indian districts across 22 languages, with 1780 hours newly released as a part of this work. The data inflow and resulting multilingual multi-domain episodes are based on real-world data collection rather than simulated episodes commonly found in existing CL datasets. In particular, the amount of data collected, and the number of languages and domains involved is not uniform across episodes, reflecting a practical and real world continual learning scenario. This dataset serves as a playground for training and evaluating CL approaches in three different scenarios: Language-Incremental (LIL), Domain-Incremental (DIL), and the novel Language-Incremental Domain-Incremental Learning (LIDIL). To establish the dataset's usefulness, we evaluated several existing CL approaches within these scenarios. Our findings indicate that the behavior of these algorithms varies across the three scenarios, emphasizing the need for detailed independent studies of each.

Resources

Audio Data

|Part 1 | Part 2 | Part 3 | Part 4 | Part 5 | |-|-|-|-|-|

For extraction, please use the following command

cat AUDIOS.tgz.* > AUDIOS.tgz
tar -xzvf AUDIOS.tgz

Episodic Manifests:

These episodic manifests hold the labels (i.e transcript) for three scenatios viz, Language Incremental Learning (LIL), Domain Incremental Learning (DIL) and Language and Domain Incremental Learning (LIDIL). In addition to the transcript, the manifests also contains additional information viz speaker_id, gender, age-group, state, district etc.

| Scenario | LIL | DIL | LIDIL | |-|-|-|-|


Folder structure:

ROOT
├── DIL
│   ├── episode0
│   │   ├── episode0_train.json
│   │   ├── episode0_valid.json
│   │   ├── episode0_replay.json
│   ├── episode1
│   ├── ...
│   └── episode11
├── LIDIL
│   ├── episode0
│   │   ├── episode0_train.json
│   │   ├── episode0_valid.json
│   │   ├── episode0_replay.json
│   ├── episode1
│   ├── ...
│   └── episode9
└── LIL
    ├── episode0
    │   ├── episode0_train.json
    │   ├── episode0_valid.json
    │   ├── episode0_replay.json
    ├── episode1
    ├── ...
    └── episode9

Evaluation Splits

The evaluation is not splitted into episodes but is continually evolving. The episodic evaluations can be run using the language, state and district information as shared in the Metadata section

Note: We are still looking for ways to effectively release benchmark data to avoid data contamination.


Folder structure

ROOT
├── audios
│   ├── audio1.wav
│   ├── audio2.wav
│   ├── ...
│   └── audioX.wav
└── test.json

Metadata

The following CSVs hold the mapping of episode number with the language, state and district of an audio file. For fine-grained evaluation on the test split, the metadata can be used.

| Metadata | LIL.csv | DIL.csv | LIDIL.csv | |-|-|-|-|


Manifest format (train/test splits)

{
    "audio_filepath": "<AUDIOS/audios>/2533274790514854_chunk_4.wav",                    # Points to the wav file
    "text": "<TRANSCRIPT>",                   # Transcript for audio, we use Normalized version of the transcript
    "duration": <DURATION>,                                                          #  Audio duration in seconds
    "lang": "<LANG_CODE(ISO)>",                                      # ISO code for language (given in meta data)
    "samples": <NUMBER_OF_SAMPLES>,                                                           # Number of samples
    "verbatim": "<VERBATIM VERSION OF TRANSCRIPT>",                          # Verbatim version of the transcript
    "normalized": "<NORMALIZE>",                                           # Normalized version of the transcript
    "speaker_id": "S4258780200341914",                                                        # Unique speaker ID
    "scenario": "Extempore",                                                                       # Type of data
    "task_name": "KYP - Traveling",                                                                   # Task name
    "gender": "Male",                                                                     # Gender of the speaker
    "age_group": "18-30",                                                              # Age group of the speaker
    "job_type": "Student",                                                              # Job type of the speaker
    "qualification": "Undergrad and Grad.",                                        # Qualification of the speaker
    "area": "Rural",                                                        # Area from which the speaker belongs
    "district": "Barpeta",                                              # District from which the speaker belongs
    "state": "Assam",                                                      # State from which the speaker belongs
    "occupation": "Private tutor",                                                         # Speaker's occupation
    "verification_report": "{}"                                    # Verification markers as given by the QA team
    "chunk_name": "2533274790514854_chunk_4.wav"                                              # Audio chunk name
}

Training and Inference

  • Install NeMo
  • Clone this repository
  • For Training

    ```

    -------------------- DIL Full Finetune Example ------------------------------------

    type=DIL config=dilep3fullfinetune # config name name=${config} # name of the experiment checkpointpath= resume=false # helpful for resuming training bash training/scripts/trainhms.sh ${type} ${config} ${name} ${checkpointpath} $resume

    bash avgcheckpoint.sh TIDIL dilep3fullfinetune # averaging the top 5 checkpoints ```

  • For Inference

    OMP_NUM_THREADS=64 python ${RUNNER_PATH}/transcribe_speech.py \ model_path=$model_path \ dataset_manifest=<PLACEHOLDER>/$language.json \ output_filename=$SAVE_PATH \ langid=$LANGUAGE_ID \ batch_size=$BATCH_SIZE \ compute_timestamps=False \ compute_langs=False \ cuda=$CUDA_DEVICE_ID \ amp=True \ append_pred=False

Contact

  • Tahir Javed (tahir@cse.iitm.ac.in)
  • Kaushal Bhogale (CS22D006@cse.iitm.ac.in)

Owner

  • Name: AI4Bhārat
  • Login: AI4Bharat
  • Kind: organization
  • Email: opensource@ai4bharat.org
  • Location: India

Artificial-Intelligence-For-Bhārat : Building open-source AI solutions for India!

GitHub Events

Total
  • Watch event: 4
  • Push event: 2
Last Year
  • Watch event: 4
  • Push event: 2

Issues and Pull Requests

Last synced: 9 months ago

All Time
  • Total issues: 1
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 1
  • Total pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • harish2704 (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels