corpws-cc0

Corpws o frawddegau o destun Cymraeg wedi'u trwyddedu o dan drwydded CC0 | A corpus of Welsh texts licensed under the CC0 licence

https://github.com/techiaith/corpws-cc0

Science Score: 62.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
✓
Institutional organization owner
Organization techiaith has institutional domain (techiaith.bangor.ac.uk)
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (2.0%) to scientific vocabulary

Keywords

cc0 commonvoice corpus nlp welsh

Last synced: 11 months ago · JSON representation ·

Repository

Corpws o frawddegau o destun Cymraeg wedi'u trwyddedu o dan drwydded CC0 | A corpus of Welsh texts licensed under the CC0 licence

Basic Info

Host: GitHub
Owner: techiaith
License: cc0-1.0
Default Branch: main
Homepage:
Size: 2.99 MB

Statistics

Stars: 1
Watchers: 5
Forks: 0
Open Issues: 0
Releases: 5

Topics

cc0 commonvoice corpus nlp welsh

Created over 5 years ago · Last pushed over 2 years ago

Metadata Files

Readme License Citation

Corpws CC0

Dyma gorpws o frawddegau o destun Cymraeg wedi'u trwyddedu o dan drwydded CC0. Ar hyn o bryd, mae'r corpws yn cynnwys bron i 20,000 o frawddegau dros 180,000 o docynnau, a'r bwriad yw parhau i'w gynyddu wrth i ni gael gafael ar destunau o dan y drwydded briodol. Bwriad y corpws hwn y galluogi hyfforddi modelau iaith Cymraeg ar gyfer sawl diben gwahanol.

Casglwyd y testunau o wahanol ffynonellau gan gynnwys testunau allan o hawlfraint a thestunau a rannwyd â ni o dan drwydded CC0 gan awduron gwreiddiol, er enghraifft erthyglau Wicipedia a negesuon Twitter a ysgrifenwyd gan yr unigolion hynny. Mae'r testunau hefyd yn cynnwys brawddegau a awdurwyd gan staff y project er mwyn darparu enghreifftiau o nodweddion ieithyddol penodol i'r corpws.

Casglwyd llawer o'r testunau hyn er mwyn eu cyfrannu i Common Voice, project gan gwmni Mozilla sy'n casglu data agored er mwyn creu lleisiau synthetig ar gyfer ieithoedd y byd. Mae'r ffeil hon felly yn cynnwys nifer o'r un brawddegau a geir yn https://github.com/techiaith/brawddegau-adnabod-lleferydd, ond yn ychwanegol at hynny ceir brawddegau eraill oedd yn rhy hir ar gyfer anghenion Common Voice, neu'n cynnwys nodau neu gynnwys arall a oedd yn anaddas ar gyfer y promtiau recordio.

Ychwanegiad Hydref 2021

Rydym hefyd wedi ychwanegu at gynnwys y corpws hwn drwy ddethol is-set o dros 100k o frawddegau Cymraeg o gorpws CoVost Facebook o gyfieithiadau peirianyddol o frawddegau Saesneg Common Voice. Lluniwyd yr is-set hon (a fwriadwyd yn wreiddiol ar gyfer gweithredu fel promptiau recordio) drwy hidlo allan y brawddegau hynny oedd yn hwy na 15 gair, neu'n cynnwys digidau, acronymau neu dalfyriadau, neu a oedd yn cynnwys geiriau nad oeddynt yn Lecsicon Cymraeg Bangor (ag eithrio rhai geirffurfiau penodol). Gweler https://github.com/techiaith/brawddegau-adnabod-lleferydd/blob/master/data/covost/README.md am ragor o fanylion. Gan nad brawddegau a awdurwyd yn y Gymraeg yn wreiddiol yw'r rhain, rydym wedi eu cadw ar wahân mewn ail ffeil, sef cycovostsubset.txt, fel y gallwch benderfynu eu defnyddio ai peidio yn ddibynnol ar eich angen penodol chi. Er mai brawddegau a gyfieithwyd yn beirianyddol yw'r rhain, adolygwyd sampl ohonynt gan olygyddion dynol a chael bod llai na 5% ohonynt yn broblemus (ffigwr sy'n cymharu'n dda â realiti y testunau Cymraeg gwreiddiol a gawn ar y we). Yn ogystal, teimlwn fod y brawddegau hyn yn ddefnyddiol gan eu bod yn cynnwys detholiad o bynciau ac amserau a phersonau gramadegol sy'n anodd i'w cael fel arall o fewn casgliad o destunau sydd â thrwydded rydd fel CC0 arni. Er na chredwn y byddai testunau cycovostsubset.txt, yn addas ar gyfer dadansoddiadau diwylliannol a ieithyddol gymdeithasol o'r Gymraeg, credwn eu bod yn werthfawr ar gyfer hyfforddi modelau iaith uniaith Cymraeg lle nad oes digon o destunau gwreiddiol Cymraeg ar gael fel arall.

Ychwanegiad Mawrth 2023

Rydym hefyd wedi ychwanegu at gynnwys y corpws hwn drwy normaleiddio detholiad o’r lleferydd a drawsgrifiwyd yn ‘verbatim’ gennym er mwyn ei gyhoeddi o fewn ein banc trawsgrifiadau. At ei gilydd, rydym wedi normaleiddio dros 4000 o’r trawsgrifiadau hynny a'u hychwanegu at y corpws hwn fel ffeil ar wahân. Gweler: https://git.techiaith.bangor.ac.uk/data-porth-technolegau-iaith/banc-trawsgrifiadau-bangor am fwy o fanylion ynghylch ffurf wreiddiol y trawsgrifiadau a’r egwyddorion trawsgrifio y defnyddiwyd, neu i lwytho’r banc cyfan i lawr.

Cyfrannu

Gallwch ein helpu i gynyddu maint y corpws hwn drwy gyfrannu unrhyw destunau o'ch eiddo chi i ni o dan drwydded CC0 fel eu bod ar gael yn rhydd i bawb. Os am wneud hynny, cysylltwch â techiaith@bangor.ac.uk.

CC0 Corpus

This is a corpus of Welsh texts licensed under the CC0 licence. The corpus currently contains nearly 20,000 sentences and over 180,000 tokens, and our aim is to continue to increase it's size as and when we're able to secure texts under the appropriate license. This corpus is intended to enable the training of language models for a variety of different purposes.

The texts were collected from various sources including out-of-copyright texts and texts that were shared with us under the CC0 license by original authors, for example Wikipedia articles and Twitter messages written by individuals responsible for their creation. The texts also include sentences authored by project staff with the intention of providing the corpus with examples of specific linguistic features.

Many of these texts were collected for input into Common Voice, a project by Mozilla that collects open data to create synthetic voices for world languages. This file therefore contains many of the same sentences found at https://github.com/techiaith/brawddegau-adnabod-lleferydd, but in addition to those, this corpus also contains many sentences that were too long for the needs of Common Voice needs, or which contained characters or other content that were unsuitable for the recording prompts.

October 2021 Addition

We have added to the content of this corpus by selecting a subset of over 100k Welsh sentences from the CoVost Facebook corpus of machine translated English Common Voice sentences. This subset (originally intended to serve as recording prompts) was created by filtering out sentences that exceeded 15 words, contained digits, acronyms or abbreviations, or contained words not found in the Bangor Welsh Lexicon (with some exceptions). See https://github.com/techiaith/brawddegau-adnabod-lleferydd/blob/master/data/covost/README.md for more details. As these sentences were not originally written in Welsh, we have kept them separate in a second file, cycovostsubset.txt, so you may decide whether or not to use them depending on your specific aims. Although these are machine translated sentences, a sample of the texts reviewed by human editors who found that less than 5% of the sentences were problematic (a figure that compares well to the situation with the original Welsh texts that are found on the web). We have found these sentences to be useful as they contain a selection topics and grammatical tenses and persons that are otherwise difficult to find within freely licensed texts. As a result, whilst we do not recommend using cycovostsubset.txt texts for cultural and social linguistic analysis of the Welsh language, we believe that they are valuable for training monolingual Welsh language models where there would otherwise be insufficient original Welsh texts available.

March 2023 Addition

We have also added to the content of this corpus by normalizing a selection of the speech we transcribed in a 'verbatim' style for publication in our transcript bank. In total, we have normalized over 4000 of those transcriptions which have been added to this corpus as a separate file. See: https://git.techiaith.bangor.ac.uk/data-porth-technologiau-iaith/banc-transcripts-bangor for more information in respect of the original format of the transcriptions and the transcription conventions used, or to download the transcription bank in its entirety.

Contributing

You can help us increase the size of this corpus by donating any texts thatt you may own to us under the CC0 license so that they may be freely available. To do so, please contact techiaith@bangor.ac.uk.

Owner

Name: Uned Technolegau Iaith / Language Technologies Unit
Login: techiaith
Kind: organization
Location: Prifysgol Bangor University

Website: http://techiaith.bangor.ac.uk
Twitter: techiaith
Repositories: 82
Profile: https://github.com/techiaith

Uned ymchwil hunan-gynhaliol sy’n datblygu technolegau ar gyfer y Gymraeg / A self-funded research unit that develops technologies for the Welsh language

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: Corpws CC0 Corpus
message: >-
  Er ein bod yn rhyddhau'r data hwn o dan drwydded
  CC0, gofynwn yn garedig i chi ystyried rhoi
  cydnabyddiaeth i'r gwaith hwn. | While we are
  releasing this data under the CC0 licence, should
  you use this resource, we kindly ask you to
  consider acknowledging our work.
type: dataset
authors:
  - given-names: Delyth
    family-names: Prys
    email: d.prys@bangor.ac.uk
    affiliation: Bangor University Language Technologies Unit
    orcid: 'https://orcid.org/0000-0002-4909-6926'
  - given-names: Gruffudd
    family-names: Prys
    email: g.prys@bangor.ac.uk
    affiliation: Bangor University Language Technologies Unit
    orcid: 'https://orcid.org/0000-0002-2910-2460'
  - given-names: Dewi Bryn
    orcid: 'https://orcid.org/0000-0003-1263-6332'
    family-names: Jones
    email: d.b.jones@bangor.ac.uk
    affiliation: Bangor University Language Technologies Unit
  - given-names: Gareth Llewellyn
    orcid: 'https://orcid.org/0000-0001-8929-0718'
    family-names: Watkins
    email: g.watkins@bangor.ac.uk
    affiliation: Bangor University Language Technologies Unit
notes: "Delyth Prys - Cyfraniad ieithyddol/Linguistic contribution, Gareth Watkins - Cyfraniad ieithyddol/Linguistic contribution, Gruffudd Prys - Cyfraniad ieithyddol a thechnegol/Linguistic and technical contribution, Dewi Bryn Jones  - Cyfraniad technegol/Technical contribution"
url: 'https://github.com/techiaith/corpws-CC0'
abstract: >-
  Corpws o frawddegau o destun Cymraeg wedi'u
  trwyddedu o dan drwydded CC0 | A corpus of Welsh
  texts licensed under the CC0 licence
keywords:
  - Welsh
  - Corpus
  - CC0
license: CC0-1.0
version: '21.10'
date-released: '2021-10-28'

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science