https://github.com/coqui-ai/open-speech-corpora

💎 A list of accessible speech corpora for ASR, TTS, and other Speech Technologies

Keywords

speech-emotion-recognition speech-processing speech-recognition speech-separation speech-synthesis speech-to-text stt text-to-speech tts voice-activity-detection voice-cloning voice-recognition

Last synced: 10 months ago · JSON representation

Repository

💎 A list of accessible speech corpora for ASR, TTS, and other Speech Technologies

Basic Info

Host: GitHub
Owner: coqui-ai
License: mit
Default Branch: master
Homepage:
Size: 139 KB

Statistics

Stars: 1,318
Watchers: 56
Forks: 142
Open Issues: 168
Releases: 0

Topics

speech-emotion-recognition speech-processing speech-recognition speech-separation speech-synthesis speech-to-text stt text-to-speech tts voice-activity-detection voice-cloning voice-recognition

Created over 7 years ago · Last pushed about 2 years ago

Metadata Files

Readme License Code of conduct

💎 Open Speech Corpora

A list of open speech corpora for Speech Technology research and development.

This list has a preference for free (i.e. no $ cost) and truly open corpora (e.g. released under a Creative Commons license or a Community Data License Agreement). Not all these corpora may meet those criteria, but all the following corpora are accessible and usable for research and/or commercial use.

Feel free to propse additions to the list!

There's a long backlog of corpora to be added in the Issues, and Pull Requests are very welcome :)

📜 CC-0

| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE | | --- | --- | --- | --- | --- | --- | | Common Voice | Multilingual | >15,000 hours (validated); >20,000 hours (total) | Multi-speaker | https://voice.mozilla.org/en/datasets | CC-0 | | Yesno | Hebrew | 6 mins | one male | http://www.openslr.org/1/ | CC-0 | | LJ Speech Corpus | English | ~24 hours | one female | https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2 | CC-0 | | NST Danish ASR Database | Danish | 229,992 utterances | 616 speakers | original: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-19/, reorganized: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-55/ | CC-0 | | NST Danish Dictation | Danish | 34,955 utterances | 151 speakers | https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-20/ | CC-0 | | NST Danish Speech Synthesis | Danish | 4,108 utterances | 1 male speaker | https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-21/ | CC-0 | | NST Swedish ASR Database | Swedish | 366,000 utterances | 1,000 speakers | original: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-16/, reorganized: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-56/ | CC-0 | | NST Swedish Dictation | Swedish | 45,620 utterances | 195 speakers | https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-17/ | CC-0 | | NST Swedish Speech Synthesis | Swedish | 5,279 utterances | 1 male speaker | https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-18/ | CC-0 | | NST Norwegian ASR Database | Norwegian | 359,760 utterances | 980 speakers | original: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-13/, reorganized: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-54/ | CC-0 | | NST Norwegian Dictation | Norwegian | 33,360 utterances | 144 speakers | https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-14/ | CC-0 | | NST Norwegian Speech Synthesis | Norwegian | 5,363 utterances | 1 male speaker | https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-15/ | CC-0 | | NB Tale – Speech Database for Norwegian | Norwegian | 7,600 utterances + ~12 hours | 380 speakers | https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-31/ | CC-0 | | Norwegian Parliamentary Speech Corpus (v0.1) | Norwegian | ~59 hours | 203 speakers | https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-58/ | CC-0 | | Wikimedia Commons Odia | Odia | ~8 hours | ~20 speakers | https://commons.wikimedia.org/wiki/Category:Odia_pronunciation | mostly(?) CC-0 | | Thorsten-21.02-neutral | German | ~24 hours | 1 male speaker | https://www.Thorsten-Voice.de | CC-0 | | Thorsten-21.06-emotional | German | 2.400 utterances (8 emotions) | 1 male speaker | https://www.Thorsten-Voice.de | CC-0 |

📜 CC-BY

| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE | | --- | --- | --- | --- | --- | --- | | ARU Speech Corpus | English (UK) | 720 utterances / speaker | 12 (6 femals; 6 male) | http://datacat.liverpool.ac.uk/681/1/ARU_Speech_Corpus_v1_0.zip | CC-BY 3.0 | | Althingi Parliamentary Speech Corpus | Icelandic | 542 hours and 25 minutes | 196 speakers | http://www.malfong.is/index.php?dlid=73&lang=en | CC-BY 4.0 | | Alþingisumræður Parliamentary Speech Corpus | Icelandic | ~21 hours | | http://www.malfong.is/index.php?dlid=8&lang=en | CC-BY 3.0 | | Hjal Corpus | Icelandic | ~41,000 recordings | 883 speakers | http://www.malfong.is/index.php?dlid=5&lang=en | CC-BY 3.0 | | The Malromur Corpus | Icelandic | 152 hours | 563 speakers | http://www.malfong.is/index.php?dlid=65&lang=en | CC-BY 4.0 | | Telecooperation German Corpus for Kinect | German | ~35 hours | ~180 speakers | http://www.repository.voxforge1.org/downloads/de/german-speechdata-TUDa-2015.tar.gz | CC-BY 2.0 | | African Speech Technology English-English Speech Corpus | English | ~21 hours | | https://repo.sadilar.org/handle/20.500.12185/283 | CC-BY 2.5 South Africa | | African Speech Technology isiXhosa Speech Corpus | isiXhosa | ~26 hours | | https://repo.sadilar.org/handle/20.500.12185/305 | CC-BY 2.5 South Africa | | NCHLT Afrikaans | Afrikaans | 56 hours | 210 speakers (98 female / 112 male) | https://repo.sadilar.org/handle/20.500.12185/280 | CC-BY 3.0 | | NCHLT English | English | 56 hours | 210 speakers (100 female / 110 male) | https://repo.sadilar.org/handle/20.500.12185/274 | CC-BY 3.0 | | NCHLT isiNdebele | isiNdebele | 56 hours | 148 speakers (78 female / 70 male) | https://repo.sadilar.org/handle/20.500.12185/272 | CC-BY 3.0 | | NCHLT isiXhosa | isiXhosa | 56 hours | 209 speakers (106 female / 103 male) | https://repo.sadilar.org/handle/20.500.12185/279 | CC-BY 3.0 | | NCHLT isiZulu | isiZulu | 56 hours | 210 speakers (98 female / 112 male) | https://repo.sadilar.org/handle/20.500.12185/275 | CC-BY 3.0 | | NCHLT Sepedi | Sepedi | 56 hours | 210 speakers (100 female / 110 male) | https://repo.sadilar.org/handle/20.500.12185/270 | CC-BY 3.0 | | NCHLT Sesotho | Sesotho | 56 hours | 210 speakers (113 female / 97 male) | https://repo.sadilar.org/handle/20.500.12185/278 | CC-BY 3.0 | | NCHLT Setswana | Setswana | 56 hours | 210 speakers (109 female / 101 male) | https://repo.sadilar.org/handle/20.500.12185/281 | CC-BY 3.0 | | NCHLT Siswati | Siswati | 56 hours | 197 speakers (96 female / 101 male) | https://repo.sadilar.org/handle/20.500.12185/271 | CC-BY 3.0 | | NCHLT Tshivenda | Tshivenda | 56 hours | 208 speakers (83 female / 125 male) | https://repo.sadilar.org/handle/20.500.12185/276 | CC-BY 3.0 | | NCHLT Xitsonga | Xitsonga | 56 hours | 198 speakers (95 female/103 male) | https://repo.sadilar.org/handle/20.500.12185/277 | CC-BY 3.0 | | Lwazi II Cross-lingual Proper Name Corpus | Afrikaans; English; isiZulu; Sesotho | 2 hours 5 mins| 20 speakers | https://repo.sadilar.org/handle/20.500.12185/445 | CC-BY 3.0 | | Lwazi II Proper Name Call Routing Telephone Corpus | English | 2 hours 7 mins | | https://repo.sadilar.org/handle/20.500.12185/448 | CC-BY 3.0 | | Lwazi II Afrikaans Trajectory Tracking Corpus | Afrikaans | 4 hours | one male | https://repo.sadilar.org/handle/20.500.12185/442 | CC-BY 3.0 | | LibriSpeech | English | ~1000 hours | 2484 speakers (1201 female / 1283 male) | http://www.openslr.org/12/ | CC-BY 4.0 | | Zeroth-Korean | Korean | 52.8 hours | 115 speakers | http://www.openslr.org/40/ | CC-BY 4.0 | | Speech Commands | English | 17.8 hours | >1,000 speakers | https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html | CC-BY 4.0 | | ParlamentParla | Catalan | 320 hours | | https://www.openslr.org/59/ | CC-BY 4.0 | | SIWIS | French | ~10 hours | one female | http://datashare.is.ed.ac.uk/download/DS_10283_2353.zip | CC-BY 4.0 | | VCTK | English | 44 hours | 109 speakers | http://datashare.is.ed.ac.uk/download/DS_10283_3443.zip | CC-BY 4.0 | | LibriTTS | English | 586 hours | 2,456 speakers (1,185 female / 1,271 male) | http://www.openslr.org/60/ | CC-BY 4.0 | | Augmented LibriSpeech | Audio (English); Text (English, French) | 236 hours | | https://persyval-platform.univ-grenoble-alpes.fr/datasets/DS91 | CC-BY 4.0 | | Helsinki Prosody Corpus | English | 262.5 hours | 1,230 speakers | https://github.com/Helsinki-NLP/prosody | CC-BY 4.0 | |Tuva Speech Database | Norwegian | 24 hours | 40 speakers | https://www.nb.no/sprakbanken/show?serial=oai:nb.no:sbr-44&lang= | CC-BY 4.0 | | COERLL Kʼicheʼ corpus | Kʼicheʼ | 34 minutes | ? speakers | https://cl.indiana.edu/~ftyers/resources/utexas-kiche-audio.tar.gz | CC-BY 4.0 | | Timers and Such v0.1 | English (synthetic: US, real: various nationalities) | synthetic: 172 hours, real: 0.29 hours | 21 synthetic, 11 real | https://zenodo.org/record/4110812#.X9j0RmBOkYM | CC-BY 4.0 | | Large Corpus of Czech Parliament Plenary Hearings | Czech | 444 hours | | https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3126 | CC-BY 4.0 |

📜 CC-BY-SA

| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE | | --- | --- | --- | --- | --- | --- | | Iban | Iban | 8 hours | | http://www.openslr.org/24/ https://github.com/sarahjuan/iban | CC-BY-SA 2.0 | | Vystadial 2013 | English; Czech | 41 hours; 15 hours | | http://www.openslr.org/6/ | CC-BY-SA 3.0 US | | Vystadial 2016 Czech | Czech | 77 hours; includes Vystadial 2013 Czech | | https://lindat.cz/repository/xmlui/handle/11234/1-1740 | CC-BY-SA 4.0 | | Free Spoken Digit Dataset | English | 2,000 isolated digits | 4 speakers | https://github.com/Jakobovski/free-spoken-digit-dataset | CC-BY-SA 4.0 | | Google Javanese | Javanese | 296 hours| 1019 speakers| http://www.openslr.org/35/ | CC-BY-SA 4.0 | | Google Nepali | Nepali | 165 hours| 527 speakers| http://www.openslr.org/54/ | CC-BY-SA 4.0 | | Google Bengali | Bengali | 229 hours| 508 speakers| http://www.openslr.org/53/ | CC-BY-SA 4.0 | | Google Sinhala | Sinhala | 224 hours| 478 speakers| http://www.openslr.org/52/ | CC-BY-SA 4.0 | | Google Sundanese | Sundanese | 333 hours| 542 speakers| http://www.openslr.org/36/ | CC-BY-SA 4.0 | | Spoken Wikipedia Corpus (SWC-2017) | English; German; Dutch | 182 hours; 249 hours; 79 hours | 395 speakers; 339 speakers; 145 speakers | https://nats.gitlab.io/swc/ | CC-BY-SA 4.0 | | Chuvash TTS | Chuvash | 4 hours | 1 speaker | https://github.com/ftyers/Turkic_TTS | CC-BY-SA 4.0 | | Forschergeist | German | 2 hours | 2 speakers (1 female; 1 male) | female speaker: https://goofy.zamia.org/zamia-speech/corpora/forschergeist/annettevogt-20180320-rec.tgz; male speaker: https://goofy.zamia.org/zamia-speech/corpora/forschergeist/timpritlove-20180320-rec.tgz | CC-BY-SA 4.0 | | Malayalam Speech Corpus by SMC | Malayalam | 1:36 hours | 75 speakers (3 female, 12 male, 60 unidentified) | https://releases.smc.org.in/msc-reviewed-speech/ | CC-BY-SA 4.0 | | Google Malayalam | Malayalam | 3.02 hours| 24 speakers| http://www.openslr.org/63/ | CC-BY-SA 4.0 |

📜 CC-BY-ND

| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE | | --- | --- | --- | --- | --- | --- | | IBM Recorded Debates v1 | English | 5 hours | 10 speakers | https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Debate%20Speech%20Analysis | CC-BY-ND | | IBM Recorded Debates v2 | English | ~14 hours | 14 speakers | https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Debate%20Speech%20Analysis | CC-BY-ND |

📜 CC-BY-NC

| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE | | --- | --- | --- | --- | --- | --- | | TV3Parla | Catalan | 240 hours | | http://laklak.eu/share/tv3_0.3.tar.gz | CC-BY-NC 4.0 | | Russian Open STT Corpus | Russian | ~10,000 hours public, ~10,000 more upon request | | https://github.com/snakers4/open_stt/#links | CC-BY-NC 4.0 with some exceptions| | Russian Open TTS Corpus | Russian | 145 hours | 3 males | https://github.com/snakers4/open_tts/#links | CC-BY-NC 4.0 with some expections| | OVM – Otázky Václava Moravce | Czech | 35 hours | | https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-000D-EC98-3 | CC-BY-NC 3.0 |

📜 CC-BY-NC-SA

| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE | | --- | --- | --- | --- | --- | --- | | CHiME-Home | English | 6.8 hours | | https://archive.org/details/chime-home | CC-BY-NC-SA 3.0 | | Cameroon Pidgin English Corpus | Cameroon Pidgin English | ~17 hours | | http://ota.ox.ac.uk/text/2563.zip | CC-BY-NC-SA 3.0 |

📜 CC-BY-NC-ND

| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE | | --- | --- | --- | --- | --- | --- | | Tatoeba-Eng | English | ~250 hours (rough estimate) | 6 speakers | https://voice.mozilla.org/en/datasets | CC-BY-NC 4.0 (some audio) / CC-BY-NC-ND 3.0 (most audio) / CC-BY 2.0 (all text) | | TED-LIUM | English | 118 hours | 685 speakers (36h female / 81h male) | http://www.openslr.org/7/ | CC-BY-NC-ND 3.0 | | TED-LIUM-2 | English | 207 hours | 1242 speakers (66h female / 141h male) | http://www.openslr.org/19/ | CC-BY-NC-ND 3.0 | | TED-LIUM-3 | English | 452 hours | 2028 speakers (134h female / 316h male) | http://www.openslr.org/51/ | CC-BY-NC-ND 3.0 | | Pansori TEDxKR | Korean | 3 hours | 41 speakers | http://www.openslr.org/58/ | CC-BY-NC-ND 4.0 | | Primewords Mandarin | Mandarin | 100 hours | 296 speakers | http://www.openslr.org/47/ | CC-BY-NC-ND 4.0| | MuST-C v1.0 | Audio (English); Text (Dutch, French, German, Italian, Portuguese, Romanian, Russian, Spanish) | 408, 504, 492, 465, 442, 385, 432, 489 hours per language pair | | https://ict.fbk.eu/must-c-release-v1-0/ | CC-BY-NC-ND 4.0 | | Czech Parliament Meetings | Czech | 88 hours | | https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0005-CF9C-4 | CC-BY-NC-ND 3.0 | | BembaSpeech | Bemba | 24 hours | 17 speakers (9 male / 8 female) | https://github.com/csikasote/BembaSpeech | CC-BY-NC-ND 4.0 |

📜 CDLA-Permissive

| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE | | --- | --- | --- | --- | --- | --- | | DiPCo | English | ~5 hours | 32 speakers (13 female; 19 male) | https://s3.amazonaws.com/dipco/DiPCo.tgz | CDLA-Permissive-1.0 |

📜 GNU General Public License

| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE | | --- | --- | --- | --- | --- | --- | | VoxForge | English | ~120 hours | ~2966 speakers | http://www.repository.voxforge1.org/downloads/en/Trunk/Audio/Main/16kHz_16bit/ https://voice.mozilla.org/en/datasets | GNU-GPL 3.0 | | VoxForge | Russian | | | http://www.repository.voxforge1.org/downloads/ru/Trunk/Audio/Main/16kHz_16bit/ http://www.repository.voxforge1.org/downloads/Russian/Trunk/Audio/Main/16kHz_16bit/| GNU-GPL 3.0 | | VoxForge | German | | | http://www.repository.voxforge1.org/downloads/de/Trunk/Audio/Main/16kHz_16bit/ | GNU-GPL 3.0 |

📜 Apache License

| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE | | --- | --- | --- | --- | --- | --- | | AISHELL-1 | Mandarin | 170 hours | 400 speakers | http://www.openslr.org/33/ | Apache 2.0 | | Tunisian_MSA | Modern Standard Arabic (Tunisia) | 11.2 hours | 118 speakers | http://www.openslr.org/46/ | Apache 2.0 | | African Accented French | French | 22 hours | 232 speakers | http://www.openslr.org/57/ | Apache 2.0 | | THCHS-30 | Mandarin Chinese | 33.57 hours (13,389 utterances) | 40 speakers (31 female; 9 male) | http://www.openslr.org/18/ | Apache 2.0 | | Living Audio Dataset - Dutch | Dutch | 57:49 min | 1 speaker | https://github.com/Idlak/Living-Audio-Dataset | Apache 2.0 | | Living Audio Dataset - English | English | 50:50 min | 1 speaker | https://github.com/Idlak/Living-Audio-Dataset | Apache 2.0 | | Living Audio Dataset - Irish | Irish | 61:56 min | 1 speaker | https://github.com/Idlak/Living-Audio-Dataset | Apache 2.0 | | Living Audio Dataset - Russian | Russian | 34:58 min | 1 speaker | https://github.com/Idlak/Living-Audio-Dataset | Apache 2.0 |

📜 MIT License

| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE | | --- | --- | --- | --- | --- | --- | | ALFFA | Amharic;Hausa (paid); Swahili; Wolof | | | http://www.openslr.org/25/ https://github.com/besacier/ALFFA_PUBLIC | MIT |

📜 BSD 3-Clause License

| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE | | --- | --- | --- | --- | --- | --- | | M-AILABS German Corpus | German | 237 hours and 22 minutes | | http://www.caito.de/data/Training/stt_tts/de_DE.tgz | M-AILABS LICENSE (a data-specific BSD 3-Clause License)| | M-AILABS Queen's English Corpus | Queen's English | 45 hours and 35 minutes | | http://www.caito.de/data/Training/stt_tts/en_UK.tgz | M-AILABS LICENSE (a data-specific BSD 3-Clause License)| | M-AILABS US English Corpus | American English | 102 hours and 7 minutes | | http://www.caito.de/data/Training/stt_tts/en_US.tgz | M-AILABS LICENSE (a data-specific BSD 3-Clause License)| | M-AILABS Spanish Corpus | Spanish Spanish | 108 hours and 34 minutes | | http://www.caito.de/data/Training/stt_tts/es_ES.tgz | M-AILABS LICENSE (a data-specific BSD 3-Clause License)| | M-AILABS Italian Corpus | Italian | 127 hours and 40 minutes | | http://www.caito.de/data/Training/stt_tts/it_IT.tgz | M-AILABS LICENSE (a data-specific BSD 3-Clause License)| | M-AILABS Ukrainian Corpus | Ukrainian | 87 hours and 8 minutes | | http://www.caito.de/data/Training/stt_tts/uk_UK.tgz | M-AILABS LICENSE (a data-specific BSD 3-Clause License)| | M-AILABS Russian Corpus | Russian | 46 hours and 47 minutes | | http://www.caito.de/data/Training/stt_tts/ru_RU.tgz | M-AILABS LICENSE (a data-specific BSD 3-Clause License)| | M-AILABS French-v0.9 Corpus | French | 190 hours and 30 minutes | | http://www.caito.de/data/Training/stt_tts/fr_FR.tgz | M-AILABS LICENSE (a data-specific BSD 3-Clause License)| | M-AILABS Polish Corpus | Polish | 53 hours and 50 minutes | | http://www.caito.de/data/Training/stt_tts/pl_PL.tgz | M-AILABS LICENSE (a data-specific BSD 3-Clause License)|

📜 Custom License

| CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE | | --- | --- | --- | --- | --- | --- | | Fluent Speech Commands Corpus | English | 19 hours (30,043 utterances) | 97 speakers | http://fluent.ai:2052/jf8398hf30f0381738rucj3828chfdnchs.tar.gz | Fluent Speech Commands Public License | | CMU Wilderness | 700 Langs | Alignments distributed without audio or text total:~14,000 hours; per lang: ~20 hours | | https://github.com/festvox/datasets-CMU_Wilderness | https://live.bible.is/terms | | CHiME-5 | English | 50 hours | 48 speakers | http://spandh.dcs.shef.ac.uk/chime_challenge/data.html | CHiME-5 License | | Fearless Steps Corpus | English | 19,000 hours (20 hours transcribed) | ~450 speakers | https://fearless-steps.github.io/ChallengePhase3/#19k_Corpus_Access | NASA Media Usage Guidelines | | Microsoft Speech Corpus (Indian languages) | Telugu; Tamil; Gujarati | | | https://msropendata.com/datasets/7230b4b1-912d-400e-be58-f84e0512985e | Microsoft Speech Corpus (Indian Languages) License | | Microsoft Speech Language Translation Corpus | English; Chinese; Japanese| | | https://msropendata.com/datasets/54813518-4ea6-4c39-9bb2-b0d1e5f0c187 | Microsoft Research Data License Agreement | | Hey Snips Corpus | English | 11K positive "Hey Snips" (~4.4 hours) and 87K negative (~89 hours) utterances | 2215 speakers (positive & negative) and 4028 speakers (negative only) | https://research.snips.ai/datasets/keyword-spotting | Snips Data License | | Snips SLU Corpus | English; French | 1660 "Smart Lights EN" (~1.3 hours), 1286 "Smart Speaker EN" (~55 minutes), 1138 "Smart Speaker FR" (~50 minutes) utterances | English: 69 speakers; French: 30 speakers | https://research.snips.ai/datasets/spoken-language-understanding | Snips Data License | | CMU Sphinx Group - AN4 | English | "an4clstk"(~50 minutes) "an4testclstk" (~6 minutes) | "an4clstk": 21 female, 53 male "an4testclstk": 3 female, 7 male | http://www.speech.cs.cmu.edu/databases/an4/an4raw.bigendian.tar.gz | AN4 | | FT Speech | Danish | ~1,857 hours (1,017,244 utterances) | 434 speakers (176 female, 258 male) | https://ftspeech.dk | FT Speech License | | FalaBrasil-LAPS-Constituicao | Brazilian-Portuguese | 9 hours | 1 speaker | https://drive.google.com/uc?export=download&confirm=SrvW&id=1Nf849u-27CYRzJqedLaI-FaZfMRO7FT | "Bases de áudio transcrito e bases de texto normalizadas (sem pontuação, com números escritos por extenso, etc.) disponibilizadas de forma gratuita* pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita*] / Portanto, apenas as bases livres estão sendo disponibilizadas." | | FalaBrasil-LaPSMail | Brazilian-Portuguese | 1 hour | 25 speakers | <https://drive.google.com/uc?export=download&confirm=PecV&id=1BVq8MDSE4fBQefVxqCGSl-EcKAcjJLb> | "Bases de áudio transcrito e bases de texto normalizadas (sem pontuação, com números escritos por extenso, etc.) disponibilizadas de forma gratuita* pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita*] / Portanto, apenas as bases livres estão sendo disponibilizadas." | | FalaBrasil-LaPS Benchmark | Brazilian-Portuguese | 1 hour | 1 speaker | https://drive.google.com/uc?export=download&confirm=XFfF&id=1nZ8L9nJTt4blFC0RGT9Y7XRu02aAvDIo | "Bases de áudio transcrito e bases de texto normalizadas (sem pontuação, com números escritos por extenso, etc.) disponibilizadas de forma gratuita* pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita*] / Portanto, apenas as bases livres estão sendo disponibilizadas." |

Owner

Name: coqui
Login: coqui-ai
Kind: organization
Email: info@coqui.ai

Website: https://coqui.ai
Twitter: coqui_ai
Repositories: 17
Profile: https://github.com/coqui-ai

Coqui, a startup providing open speech tech for everyone 🐸

GitHub Events

Total

Watch event: 78
Fork event: 6

Last Year

Watch event: 78
Fork event: 6

Issues and Pull Requests

Last synced: over 1 year ago

All Time

Total issues: 109
Total pull requests: 10
Average time to close issues: 12 days
Average time to close pull requests: 2 days
Total issue authors: 9
Total pull request authors: 6
Average comments per issue: 0.47
Average comments per pull request: 0.4
Merged pull requests: 10
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

JRMeyer (82)
ftyers (3)
lunixbochs (1)
JEMeyer (1)
evanmiltenburg (1)
alpoktem (1)
dabraude (1)
pehonnet (1)
gheyret (1)

Pull Request Authors

ftyers (3)
andrenatal (1)
dabraude (1)
shacharmirkin (1)
JRMeyer (1)
choufractal (1)

https://github.com/coqui-ai/open-speech-corpora

Science Score: 23.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

💎 Open Speech Corpora

📜 CC-0

📜 CC-BY

📜 CC-BY-SA

📜 CC-BY-ND

📜 CC-BY-NC

📜 CC-BY-NC-SA

📜 CC-BY-NC-ND

📜 CDLA-Permissive

📜 GNU General Public License

📜 Apache License

📜 MIT License

📜 BSD 3-Clause License

📜 Custom License

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels