https://github.com/audiollms/singlish

https://github.com/audiollms/singlish

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.1%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: AudioLLMs
  • Default Branch: main
  • Size: 2.93 KB
Statistics
  • Stars: 4
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 1 year ago · Last pushed over 1 year ago
Metadata Files
Readme

README.md

Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models

Paper 2025, Datasets

Overview

Singlish, a Creole language rooted in English, is prominent in Singapore's multilingual and multicultural landscape. Despite its widespread use, the spoken form of Singlish remains underexplored, limiting insights into its linguistic structure and applications. This project addresses this gap by standardizing and annotating the largest spoken Singlish corpus from National Speech Corpus.

The original National Speech Corpus is consisted of 10k hours of speech and transcriptions. In our first release, we reorganized and released the high quality ones and extended to multitask datasets and standard train/test split.

Multitask National Speech Corpus (MNSC)

The MNSC is a comprehensive dataset supporting various tasks:

  • Automatic Speech Recognition (ASR): Transcribing spoken Singlish into text.
  • Spoken Question Answering (SQA): Answering questions based on spoken content.
  • Spoken Dialogue Summarization (SDS): Summarizing spoken dialogues.
  • Paralinguistic Question Answering (PQA): Analyzing paralinguistic features like accent and gender.

The dataset includes standardized splits and a human-verified test set to facilitate research and benchmarking. It is available at MNSC-V1-Huggingface.

Other Singlish Corpus

Citation

If you found our resource useful, please consider cite our work: @article{wang2025advancing, title={Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models}, author={Wang, Bin and Zou, Xunlong and Sun, Shuo and Zhang, Wenyu and He, Yingxu and Liu, Zhuohan and Wei, Chengwei and Chen, Nancy F and Aw, AiTi}, journal={arXiv preprint arXiv:2501.01034}, year={2025} }

Owner

  • Name: AudioLLMs
  • Login: AudioLLMs
  • Kind: organization

GitHub Events

Total
  • Watch event: 2
  • Push event: 5
Last Year
  • Watch event: 2
  • Push event: 5