https://github.com/centrefordigitalhumanities/auchann

Generates CHAT annotations from transcript-correction utterance pairs

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.6%) to scientific vocabulary

Last synced: 7 months ago · JSON representation

Repository

Generates CHAT annotations from transcript-correction utterance pairs

Basic Info

Host: GitHub
Owner: CentreForDigitalHumanities
License: bsd-3-clause
Language: Python
Default Branch: develop
Homepage:
Size: 87.9 KB

Statistics

Stars: 1
Watchers: 2
Forks: 0
Open Issues: 9
Releases: 0

Created about 4 years ago · Last pushed 9 months ago

Metadata Files

Readme License

AuChAnn

pypi auchann

AuChAnn is a python package that provides Automatic CHAT Annotation based on a transcript string and an interpretation (or 'corrected') string. For example, when given: Transcript: 'Ik wilt nu eh na huis' Correction: 'Ik wil nu naar huis'

AuChAnn produces: CHAT-Annotation: 'ik wilt [: wil] nu &-eh na(ar) [* s:r:prep] huis'

CHAT is an annotation convention that was developed for the CHILDES corpus (MacWinney, 2000) and is used by many linguists to annotate speech. For more information on CHAT, you can read their manual: https://talkbank.org/manuals/CHAT.html.

AuChAnn was specifically developed to enhance linguistic data in the form of a transcript and interpretation by a linguist for use with SASTA (https://github.com/CentreForDigitalHumanities/sasta)

Getting Started

You can install AuChAnn using pip:

bash pip install auchann

You can also optionally install Sastadev which is used for detecting inflection errors.

bash pip install auchann[NL]

When installed, the program can be run interactively from the console using the command auchann .

Import as Library

To use AuChAnn in your own python applications, you can import the alignwords function from alignwords, see below. This is the main functionality of the package.

```python from auchann.alignwords import alignwords

transcript = input("Transcript: ") correction = input("Correction: ") alignment = align_words(transcript, correction) print(alignment) ```

Settings

Various settings can be adjusted. Default values are used for every unchanged property.

```python from auchann.alignwords import alignwords, AlignmentSettings import editdistance

settings = AlignmentSettings()

Return the edit distance between the original and correction

settings.calc_distance = lambda original, correction: editdistance.distance(original, correction)

Return an override of the distance and the error type;

if error type is None the distance returned will be ignored

Default method detects inflection errors

settings.detect_error = lambda original, correction: (1, "m") if original == "geloopt" and correction == "liep" else (0, None)

Sastadev contains a helper function for Dutch which detects inflection errors

from sastadev.deregularise import detecterror settings.detecterror = detect_error

How many words could be split from one?

e.g. das -> da(t) (i)s requires a lookahead of 2

hoest -> hoe (i)s (he)t requires a lookahead of 3

settings.lookahead = 5

Allow detection of replacements within a group

e.g. swapping articles this will then be marked with

the specified key

EXAMPLE:

Transcript: de huis

Correction: het huis

de [: het] [* s:r:gc:art] huis

settings.replacements = { 's:r:gc:art': ['de', 'het', 'een'], 's:r:gc:pro': ['dit', 'dat', 'deze'], 's:r:prep': ['aan', 'uit'] }

Other lists to adjust

settings.fillers = ['eh', 'hm', 'uh'] settings.fragments = ['ba', 'to', 'mu']

Example usage

transcript = input("Transcript: ") correction = input("Correction: ") alignment = align_words(transcript, correction, settings) print(alignment) ```

How it Works

The align_words function scans the transcript and correction and determines for each token whether a correction token is copied exactly from the transcript, replaces a token from the transcript, is inserted, or whether a transcript token has been omitted. Based on which of these operations has occurred, the function adds the appropriate CHAT annotation to the output string.

The algorithm uses edit distance to establish which words are replacements of each other, i.e. it links a transcript token to a correction token. Words with the lowest available edit distance are matched together, and based on this match the operations COPY and REPLACE are determined. If two candidates have the same edit distance to a token, word position is used to determine the match. The operations REMOVE and INSERT are established if no suitable match can be found for a transcript and correction token respectively.

In addition to establishing these four operations, the function detects several other properties of the transcript and correction which can be expressed in CHAT. For example, it determines whether a word is a filler or fragment, whether a conjugation error has occurred, or if a pronoun, preposition, or article has been used incorrectly.

Development

To install the requirements:

bash pip install -r requirements.txt

To run the AuChAnn command-line function from the console:

bash python -m auchann

Run Tests

bash pip install pytest pytest

Upload to PyPi

bash pip install pip-tools twine python setup.py sdist twine upload dist/*.tar.gz

Acknowledgments

The research for this software was made possible by the CLARIAH-PLUS project financed by NWO (Grant 184.034.023).

References

MacWhinney, B. (2000). The CHILDES Project: Tools for Analyzing Talk. 3rd Edition. Mahwah, NJ: Lawrence Erlbaum Associates

Owner

Name: Centre for Digital Humanities
Login: CentreForDigitalHumanities
Kind: organization
Email: cdh@uu.nl
Location: Netherlands

Website: https://cdh.uu.nl/
Repositories: 39
Profile: https://github.com/CentreForDigitalHumanities

Interdisciplinary centre for research and education in computational and data-driven methods in the humanities.

GitHub Events

Total

Push event: 1

Last Year

Push event: 1

Dependencies

requirements.txt pypi

chamd ==0.5.8
editdistance ==0.6.0
pyyaml ==5.4.1
pyyaml-include ==1.2.post2
sastadev ==0.0.2

setup.py pypi

chamd >=0.5.8
editdistance *
pyyaml-include *
sastadev *

.github/workflows/unit-tests.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite

https://github.com/centrefordigitalhumanities/auchann

Science Score: 26.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

AuChAnn

Getting Started

Import as Library

Settings

Return the edit distance between the original and correction

Return an override of the distance and the error type;

if error type is None the distance returned will be ignored

Default method detects inflection errors

Sastadev contains a helper function for Dutch which detects inflection errors

How many words could be split from one?

e.g. das -> da(t) (i)s requires a lookahead of 2

hoest -> hoe (i)s (he)t requires a lookahead of 3

Allow detection of replacements within a group

e.g. swapping articles this will then be marked with

the specified key

EXAMPLE:

Transcript: de huis

Correction: het huis

de [: het] [* s:r:gc:art] huis

Other lists to adjust

Example usage

How it Works

Development

Run Tests

Upload to PyPi

Acknowledgments

References

Owner

GitHub Events

Total

Last Year

Dependencies