corpws-meincnodi-rhannau-ymadrodd
Corpws ar gyfer meincnodi tagwyr rhannau ymadrodd Cymraeg | A corpus for benchmarking Welsh part-of-speech taggers
https://github.com/techiaith/corpws-meincnodi-rhannau-ymadrodd
Science Score: 65.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 4 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
✓Institutional organization owner
Organization techiaith has institutional domain (techiaith.bangor.ac.uk) -
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (1.4%) to scientific vocabulary
Keywords
Repository
Corpws ar gyfer meincnodi tagwyr rhannau ymadrodd Cymraeg | A corpus for benchmarking Welsh part-of-speech taggers
Basic Info
Statistics
- Stars: 0
- Watchers: 5
- Forks: 1
- Open Issues: 0
- Releases: 1
Topics
Metadata Files
README.md
Corpws Meincnodi Tagwyr Rhan Ymadrodd
Mae hwn yn gorpws o bron i 25,000 mil o eiriau o destun ar ffurf 1,500 o frawddegau a godwyd o amryw o ffynonhellau gwahanol gyda'r bwriad o greu prawf da a chytbwys o allu unrhyw dagiwr rhan ymadrodd Cymraeg.
Cynnwys
Fe gynlluniwyd y Corpws Meincnodi i gynnwys cynrychioliad eang o fathau gwahanol o Gymraeg cyfoes mewn orgraff fodern. Cynhwysir ynddo amrywiad o destunau er mwyn gwobrwyo'r gallu i gyffredinoli i ffurfiau ac orgraff lai safonol, ond Cymraeg fel y caiff ei gynhyrchu ar ffurf testun heddiw oedd yr hyn a ganolbwyntiwyd arno yn bennaf wrth lunio'r corpws. O ran cynnwys y brawddegau, ymdrechwyd i sicrhau bod ynddynt amrywiaeth o ran cywair, arddull, tafodiaith a phwnc, ac fe gyfeiriwyd at y fframweithiau a ddefnyddwyd gan CEG a CorCenCC wrth wneud hynny. Yn ogystal â sicrhau amrywiaeth o ran mathau'r testunau ffynhonnell, ceisiwyd hefyd sicrhau bod amrywiaeth o ran amser a pherson o fewn cystrawennau'r brawddegau.
Tagio a Gwerthuso
Ein bwriad dros y misoedd nesaf, ar gais Llywodraeth Cymru, yw tagio'r brawddegau hyn mewn modd a fydd yn caniatáu gwerthuso a chymharu gwahanol dagwyr rhan ymadrodd Cymraeg. Y nifer o eiriau o gorpws y mae'n bosib eu tagio yn gywir yn ystod y cyfnod nesaf fydd y gwir derfyn ar faint y Corpws Meincnodi llawn. Gan fod y gwahanol dagwyr yn defnyddio gwahanol setiau o dagiau rhan ymadrodd, byddwn yn datblygu set gyffredinol o dagiau cyfryngol i hwyluso'r gymhariaeth honno, yn ogystal â fframwaith i alluogi'r cymharu a'r gwerthuso.
Trwydded
Yn wahanol i'n data hyfforddi, a drwyddedir o dan drwydded CC0, trwyddedir y corpws hwn o dan drwydded fwy caethiwus, sef CC-BY-SA. Y rheswm am hynny yw bod defnyddio CC-BY-SA yn ein galluogi i godi enghreifftiau o ffynhonellau pwysig megis Wicipedia, deunyddiau gan Goleg Cymraeg Cenedlaethol Cymru a chorpora CorCenCC a Chorpws Siarad. Gellir dosbarthu'r corpws meincnodi hwn yn rhydd cyhyd â bod y gofynion o ran cydnabyddiaeth a 'rhannu cyffelyb' (hynny yw, 'sharealike') y drwydded CC-BY-SA yn cael eu parchu.
Cydnabod ein gwaith
Os defnyddiwch chi'r adnodd hwn, gofynwn yn garedig i chi gydnabod a chyfeirio at ein gwaith. Mae cydnabyddiaeth o'r fath yn gymorth i ni sicrhau cyllid yn y dyfodol i greu rhagor o adnoddau defnyddiol i'w rhannu.
Cydnabyddiaeth
Defnyddwyd testunau o'r adnoddau canlynol yn y Corpws Meincnodi:
Corpws Siarad 2014, Corpws:Siarad, Deuchar, M., Davies, P. & Donnelly, K., Cyrchwyd ar 03/12/2020 < http://bangortalk.org.uk/speakers.php?c=siarad>
Ellis, N. C., O'Dochartaigh, C., Hicks, W., Morgan, M., & Laporte, N. (2001). Cronfa Electroneg o Gymraeg (CEG): A 1 million word lexical database and frequency count for Welsh. [On-line]
Gwales.com 2020, Llyfrau, gwales.com, Cyrchwyd ar 03/12/2020 http://www.gwales.com/books/?tsid=15
Hwb Cymru 2020, Dysgu ac addysgu i Gymru, Hwb Cymru, Cyrchwyd ar 03/12/2020 https://hwb.gov.wales/
James, E. W. 2018, Williams, William (Pantycelyn), James, E. W, Cyrchwyd ar 03/12/2020 http://orca.cf.ac.uk/128971/1/Williams%2C%20William%20%28Pantycelyn%29.pdf
Knight D, Morris S, Fitzpatrick T, et al. (2020). CorCenCC: Corpws Cenedlaethol Cymraeg Cyfoes – the National Corpus of Contemporary Welsh (Version 1.0.0). Cardiff University. ht tp://doi.org/10.17035/d.2020.0119878310
Meddwl.org 2020, hafan, meddwl.org, Cyrchwyd ar 03/12/2020 < https://meddwl.org/> Porth Coleg Cymraeg Cenedlaethol 2020, Hafan, Coleg Cymraeg Cenedlaethol, Cyrchwyd ar 03/12/2020 < https://wici.porth.ac.uk/index.php/Hafan>
Raspberrypi.org 2020, Hello , Raspberrypi.org, Cyrchwyd ar 03/12/2020 <https://projects.raspberrypi.org/
Wici Pobol y Cwm 2020, Home, Wici Pobol y Cwm, Cyrchwyd ar 03/12/2020 https://pobol-y-cwm.fandom.com/cy/wiki/Main_Page
Wiki Y Cyfryngau Cymraeg 2020, Home, Wiki Y Cyfryngau Cymraeg Cyrchwyd ar 03/12/2020 https://y-cyfryngau-cymraeg.fandom.com/cy/wiki/Main_Page
Wicipedia 2020, Croeso i Wicipedia, Sefydliad Wikimedia, Cyrchwyd ar 03/12/2020 https://cy.wikipedia.org/wiki/Hafan
Ymddiriolaeth Adeiladu Cymru 2019, Rhestr gyfeirio cynllunio digwyddiadau, Ymddiriolaeth Adeiladu Cymru, Cyrchwyd ar 03/12/2020 http://www.yac.cymru/uploads/resources/2019-03-13-24-2-bct-event-planning-checklist-c.pdf
Welsh Part-of-Speech Tagger Benchmarking Corpus
Benchmark Corpus
This is a corpus of approximately 25,000 words, in the form of 1,500 sentences drawn from a variety of different sources with a view to creating a good, balanced test of the ability of Welsh Part Of Speech (POS) taggers to tag Welsh language text correctly.
Contents
The Benchmark Corpus is designed to include a broad representation of different types of contemporary Welsh in modern orthography. To reward the ability to generalize to a less standard forms and orthographical conventions, the corpus contains a range of different texts. However, the main focus during corpus construction was Welsh language text as it is produced today.
Efforts were made to ensure sentences included in the corpus were varied in terms of register, style, dialect and subject matter, and reference was made to the frameworks used by CEG and CorCenCC in doing so. As well as ensuring variety in the types of source texts, we also sought to ensure that there was a variety in respect of tense and person within sentence structures.
Tagging and Evaluation
Our intention over the coming months, at the request of the Welsh Government, is to tag these sentences in a way that will allow the evaluation and comparison of different Welsh POS taggers. The true limit on the size of the full Benchmark Corpus will be the number of words from the current corpus that can be correctly tagged during this time. As the different POS taggers use different tagsets, we will develop a general, intermediate tagset to facilitate that comparison, as well as a framework to enable comparison and evaluation.
Licence
Unlike our training data, which is licensed under the CC0 license, this corpus is licensed under the more restrictive CC-BY-SA license. This is because using CC-BY-SA allows us to collect examples from important sources such as Wikipedia, materials from Coleg Cymraeg Cenedlaethol Cymru and from the Siarad and CorCenCC corpora. This benchmark corpus can be freely distributed as long as the attribution and sharealike requirements of the CC-BY-SA license are respected.
Acknowledging our work
If you use this resource, we kindly ask you to acknowledge and reference our work. Doing so helps us secure future funding to create more useful resources to share.
Acknowledgements
Texts from the following sources were used in this resource:
Corpws Siarad 2014, Corpws:Siarad, Deuchar, M., Davies, P. & Donnelly, K., Cyrchwyd ar 03/12/2020 < http://bangortalk.org.uk/speakers.php?c=siarad>
Ellis, N. C., O'Dochartaigh, C., Hicks, W., Morgan, M., & Laporte, N. (2001). Cronfa Electroneg o Gymraeg (CEG): A 1 million word lexical database and frequency count for Welsh. [On-line]
Gwales.com 2020, Llyfrau, gwales.com, Cyrchwyd ar 03/12/2020 http://www.gwales.com/books/?tsid=15
Hwb Cymru 2020, Dysgu ac addysgu i Gymru, Hwb Cymru, Cyrchwyd ar 03/12/2020 https://hwb.gov.wales/
James, E. W. 2018, Williams, William (Pantycelyn), James, E. W, Cyrchwyd ar 03/12/2020 http://orca.cf.ac.uk/128971/1/Williams%2C%20William%20%28Pantycelyn%29.pdf
Knight D, Morris S, Fitzpatrick T, et al. (2020). CorCenCC: Corpws Cenedlaethol Cymraeg Cyfoes – the National Corpus of Contemporary Welsh (Version 1.0.0). Cardiff University. ht tp://doi.org/10.17035/d.2020.0119878310
Meddwl.org 2020, hafan, meddwl.org, Cyrchwyd ar 03/12/2020 < https://meddwl.org/> Porth Coleg Cymraeg Cenedlaethol 2020, Hafan, Coleg Cymraeg Cenedlaethol, Cyrchwyd ar 03/12/2020 < https://wici.porth.ac.uk/index.php/Hafan>
Raspberrypi.org 2020, Hello , Raspberrypi.org, Cyrchwyd ar 03/12/2020 <https://projects.raspberrypi.org/
Wici Pobol y Cwm 2020, Home, Wici Pobol y Cwm, Cyrchwyd ar 03/12/2020 https://pobol-y-cwm.fandom.com/cy/wiki/Main_Page
Wiki Y Cyfryngau Cymraeg 2020, Home, Wiki Y Cyfryngau Cymraeg Cyrchwyd ar 03/12/2020 https://y-cyfryngau-cymraeg.fandom.com/cy/wiki/Main_Page
Wicipedia 2020, Croeso i Wicipedia, Sefydliad Wikimedia, Cyrchwyd ar 03/12/2020 https://cy.wikipedia.org/wiki/Hafan
Ymddiriolaeth Adeiladu Cymru 2019, Rhestr gyfeirio cynllunio digwyddiadau, Ymddiriolaeth Adeiladu Cymru, Cyrchwyd ar 03/12/2020 http://www.yac.cymru/uploads/resources/2019-03-13-24-2-bct-event-planning-checklist-c.pdf
Owner
- Name: Uned Technolegau Iaith / Language Technologies Unit
- Login: techiaith
- Kind: organization
- Location: Prifysgol Bangor University
- Website: http://techiaith.bangor.ac.uk
- Twitter: techiaith
- Repositories: 82
- Profile: https://github.com/techiaith
Uned ymchwil hunan-gynhaliol sy’n datblygu technolegau ar gyfer y Gymraeg / A self-funded research unit that develops technologies for the Welsh language
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: Corpws Meincnodi Tagwyr Rhan Ymadrodd/Part of Speech Tagger Benchmarking Corpus
message: >-
If you use this dataset, please cite it using the
metadata from this file.
type: dataset
authors:
- given-names: Gruffudd
family-names: Prys
email: g.prys@bangor.ac.uk
affiliation: Bangor University Language Technologies Unit
orcid: 'https://orcid.org/0000-0002-2910-2460'
- given-names: Gareth Llewellyn
orcid: 'https://orcid.org/0000-0001-8929-0718'
family-names: Watkins
email: g.watkins@bangor.ac.uk
affiliation: Bangor University Language Technologies Unit
notes: "Gareth Watkins - Cyfraniad ieithyddol/Linguistic contribution, Gruffudd Prys - Cyfraniad ieithyddol a thechnegol/Linguistic and technical contribution"
url: 'https://github.com/techiaith/corpws-meincnodi-rhannau-ymadrodd'
abstract: >-
Corpws ar gyfer meincnodi tagwyr rhannau ymadrodd Cymraeg | A corpus for benchmarking Welsh part-of-speech taggers
keywords:
- Welsh
- Corpus
- Benchmark
- Part of Speech
- Tagger
- Evaluation
license: CC-BY-SA-4.0
license-url: 'https://creativecommons.org/licenses/by-sa/4.0/legalcode'
version: '21.01'
date-released: '2021-01-29'