peppermodules-toolboxtextmodules

https://github.com/sdruskat/peppermodules-toolboxtextmodules

Keywords from Contributors

mesh interactive

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: sdruskat
License: apache-2.0
Language: Java
Default Branch: master
Size: 758 KB

Statistics

Stars: 0
Watchers: 2
Forks: 0
Open Issues: 2
Releases: 1

Created over 8 years ago · Last pushed over 2 years ago

Metadata Files

Readme Changelog License Citation

Pepper modules for the SIL Toolbox interlinear text format

How to cite

If you use the Toolbox Text Modules for Pepper in your work, please cite it with the metadata given below!

Author: Stephan Druskat (ORCiD id: https://orcid.org/0000-0003-4925-7248)
Year: 2018
Title: pepperModules-ToolboxTextModules
Version: 1.0.0
DOI: 10.5281/zenodo.1162208
Release date: 2018-01-24

Note that this metadata is also provided in machine-readable form in the Citation File Format, in the file CITATION.cff.

General information

Pepper is a conversion framework for linguistic data. pepperModules-ToolboxTextModules is a plugin for Pepper and provides an importer and exporter for the Toolbox Interlinear Text Format, i.e., the text-based export format from SIL Toolbox. The format is used frequently for persisting language documentation data. For examples of Toolbox interlinear text files, see for example this directory in the GitHub repository teropa/nlp.

With the pepperModules-ToolboxTextModules, the data stored in Toolbox interlinear text files can be transferred to another format. This way, the data can be re-used for other purposes (such as adding different annotation types), or visualized and analyzed, e.g., in ANNIS, a search and visualization platform for linguistic data. For a list of available format converters for Pepper, see the list of known Pepper modules.

Note that there is also a Pepper module for the Toolbox XML format which is not related to this project.

Context

The development of pepperModules-ToolboxTextModules has been initiated in the MelaTAMP research project.

Requirements

pepperModules-ToolboxTextModules requires at Pepper >= 3.1.1-SNAPSHOT, as it relies on default property values which have been introduced in this version. A kickstarter (i.e., standalone) version of Pepper including the correct version can be obtained from the snapshot releases repository for Pepper: Pepper_2018.01.26-SNAPSHOT.zip. This is the earliest version of Pepper including the required functionalities for pepperModules-ToolboxTextModules, newer versions will do just as well.

Usage

Download Pepper_2018.01.26-SNAPSHOT.zip (or newer) and extract it to a directory of your choice
Download the latest pepperModules-ToolboxTextModules .jar from releases and extract it to a directory of your choice
Add the path to the directory containing pepperModules-ToolboxTextModules-<version>.jar to the Pepper configuration file:
- Open {Pepper directory}/pepper/conf/pepper.properties
- Remove the comment hash (#) from the line #pepper.dropin.paths= and add the path
Start Pepper with the respective command (pepperStart.sh on Linux/Mac, pepperStart.bat on Win)
Check that the Toolbox text modules have been resolved by displaying the list of available modules in Pepper with the l command
Start a conversion. You can use the interactive wizard (c), or run a pre-defined workflow (c {path to workflow file}).

Pepper workflow file

Pepper conversions are defined in Pepper workflow files, see the Pepper User Guide.

The available properties for the Toolbox Text Modules are detailed in the following sections.

Importer

Requirements, assumptions, behaviour

Pre-existing meta annotations for ids

During conversion, when the importer encounters a pre-existing meta annotation on an id, it will overwrite the value of this annotation. This is highly unlikely to happen, unless the source file has been manually edited and the duplicate meta annotation introduced in the process.

Properties

Note that required values with a default value do not have to be specified in the workflow file when the default value should be used.

fileExtensions (String) (required): The file extensions that corpus files can have as a comma-separated list.

Default value: txt

idMarker (String): The Toolbox marker that precedes lines with IDs, without the preceding backslash.

Default value: id

refMarker (String) (required): The marker used for references, i.e., usually "ref" or "id".

Default value: ref

lexicalMarker (String) (required): The Toolbox marker that precedes lines with source text (usually "words") without the preceding backslash.

Default value: tx

morphologyMarker (String): The Toolbox marker that precedes lines with morphological information, without the preceding backslash.

Default value: mb

lexAnnotationMarkers (String): All Toolbox markers which precede lines with annotations of source text segments (usually "words"), without the preceding backslashes, and as a comma-separated list.
morphologyAnnotationMarkers (String): All Toolbox markers which precede lines with annotations of morphemes, without the preceding backslashes, and as a comma-separated list.

Default value: ge,ps

attachDelimiter (String): Whether detached morphology delimiters (as in "item - item" or similar) should be attached to the previous or subsequent item, as a two-item comma-separated list, where the first item signifies whether the delimiter should be attached at all (if true it will be attached), and the second item signifies whether the delimiter should be attached to the subsequent item (if true it will be attached to the subsequent item, making the latter a suffix).

Default value: true,true

morphemeDelimiters (String): The morpheme delimiters used in the Toolbox files as a comma-separated two-point list where the first element is the affix delimiter, and the second element is the clitics delimiter.

Default value: -,=

liaisonDelimiter (String): The morpheme delimiter used in the Toolbox files to mark "words" represented on the morphological layer that are contracted into words on the lexical layer, e.g., Saliba tane = ta wane. This delimiter can be used for cases where the importer may otherwise not have enough information to figure out that the lexical word should contain the "morphological word".

It will be dropped after parsing and will not show up in either the Salt model or any further model transformations.

The marker is only picked up when used to suffix the second to nth word, i.e. for the Saliba example above, ta _wane (property default is the underscore _) will be mapped as two items on the morphological layer which are ruled by one item on the lexical layer:

lex: | tane | |-----------| morph: | ta | wane |

Default value: _

subrefDefinitionMarker (String): The marker used to define subrefs.

Default value: subref

subrefAnnotationMarkers (String): The marker which precedes lines with annotations that can potentially span subranges of the complete morphological data source. For details about subrefs see the respective MelaTAMP wiki page.
mergeDuplMarkers (Boolean) (required): Whether lines with the same marker in the same block should be merged into one line.

true: Subsequent lines marked with {marker} are concatenated to the first line marked with {marker}.

false: All lines but the first line marked with {marker} are dropped.

Default value: true

recordErrors (Boolean) (required): Whether the importer should record errors.

true (default): Errors in the data model will be recorded, i.e., annotations on an error layer (called err) will be added for each line which seems to contain an error. Additionally, another annotation will be added to discrete layers, recording the original faulty line.

`false`: Errors will not be recorded. 

Default value: `true`

normalizeMarkers (Boolean) (required): Whether annotation namespace-name combinations for the default layers should be normalized to Toolbox standards (after the default values for refs, subrefs, lexical and morphological markers).

Default value: false
normalizeDocNames (Boolean) (required): Whether special characters and whitespaces in document names should be replaced with default characters.

Default value: true

fixInterl11n (Boolean) (required): Whether the importer should fix interlinearization.

true (default): Interlinearization error in the data model will be fixed as follows.

- For **discrepancies between the number of lexical and morphological
tokens**, morphological tokens will either be added to until their 
number is equal to that of lexical tokens (using the property `missingAnnoString`),
or all tokens at indices >
index of the last lexical token will be dropped.

- For **discrepancies between the number of tokens and their annotations**
as defined by `lexAnnotationMarkers` and 
`morphologyAnnotationMarkers`, annotations will either be
added to until their number is equal to that of the token layer they
refer to, or all tokens at indices > index of last token they refer
to will be dropped.

`false`: Interlinearization errors will not be fixed. For missing morphological tokens
or annotations, nothing will be inserted. Morphological tokens and
annotations at indices > last index of lexical token, or last index
of token layer they refer to will, respectively, be concatenated to the last element
on their line, and separated by whitespaces.

**NOTE:** If the property is set to `false`, unfixed interl11n errors may
cause an exception to be thrown during runtime!

Default value: `true`

missingAnnoString (String) (required): A String used to fill interlinearization gaps.

Default value: ***

Exporter

Requirements, assumptions, behaviour

Data source sequences

The exporter works on a single data source sequence only, which is acquired for an id during conversion via sDocumentGraph.getOverlappedDataSourceSequence(idSpan, SALT_TYPE.STEXT_OVERLAPPING_RELATION).get(0).

This is due to the fact that the definitive data source in Toolbox is the single source text, i.e., the string of lexical tokens.

Layers

The Salt model must contain the following SLayers that are picked up by the properties idSpanLayer, refSpanLayer, txTokenLayer, and mbTokenLayer respectively:

A layer containing spans annotated with id-scope annotations
A layer containing spans annotated with ref-scope annotations
A layer containing lexical tokens
A layer containing morphological tokens

The names for these layers must be unique, i.e., the model must contain exactly one SLayer with that name.

Timeline

The Salt model must contain a timeline (STimeline), and the lexical tokens and morphological tokens must be interlinearized via this timeline. Lexical tokens must be aligned along the timeline without any gaps in indices.

Annotations

The Toolbox text format represents annotation names in a single field, the marker, which takes the form \{annotation-name}. It is the user's responsibility to keep these unique if required.

In order to enable this, the markers can automatically be generated by a combination of the respective relevant fields in Salt, the annotation's layer, its namespace and its name. This is done by defining the property markerScheme, where the values for layer name, namespace and name are represented in a pattern as l, ns and n.

For an annotation with layer name "layer1", the namespace "ns1" and the name "n1", a property defined as pattern l__ns_n will result in a marker \layer1__ns1_n1.

Properties

Note that required values with a default value do not have to be specified in the workflow file when the default value should be used.

idSpanLayer (String) (required): The Salt layer that contains the spans to be mapped to Toolbox ids.

Default value: id

refSpanLayer (String) (required): The Salt layer that contains the spans to be mapped to Toolbox refs.

Default value: ref

txTokenLayer (String) (required): The Salt layer that contains the tokens to be mapped to Toolbox' tx lines.

Default value: tx

mbTokenLayer (String): The Salt layer that contains the tokens to be mapped to Toolbox' mb lines.

Default value: mb

idIdentifierAnnotation (String) (required): The annotation (namespace::name) that contains the identifiers of ids.
refIdentifierAnnotation (String) (required): The annotation (namespace::name) that contains the identifiers of refs.
txMaterialAnnotations (String): Comma-separated list of annotations which contain primary data, i.e., lexical material which will already be mapped to tokens but still exists as annotation and should thus be left out during export to annotations (as they will already be mapped to \tx).
mbMaterialAnnotations (String): Comma-separated list of annotations which contain primary data, i.e., morphological material which will already be mapped to tokens but still exists as annotation and should thus be left out during export to annotations (as they will already be mapped to \mb).
spaceReplacement (String) (required): String to replace whitespaces in annotation values with, as these whitespaces may break the item count in Toolbox interlinearization.

Default value: -

markerMap (String): A map mapping combinations of an annotation container's first layer name, the annotation's namespace and the annotation's name to a new annotation name.

Format: markerMap=layer-name::namespace:name=newName, layer::name=newName, namespace:name=newName, name=newName.
mapLayer (Boolean): Whether the name of the first layer in the list of an annotation container's layers should be mapped onto the marker. If true, this will result in a marker \{layer-name}__{name} for annotations whose container have >= 1 layers.

Default value: false
mapNamespace (Boolean): Whether the namespace of an annotation should be mapped onto the marker. If true, this will result in a marker \{namespace}_{name} for annotations with a non-null namespace.

Default value: false

Contribute

Contributions are welcome! When contributing to this repository, please first discuss the change you wish to make via a new issue before making a change.

Contributors

An overview of contributors to this project can be found here.

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. ```

Owner

Name: Stephan Druskat
Login: sdruskat
Kind: user
Location: Berlin
Company: German Aerospace Center (DLR)

Website: http://sdruskat.net
Twitter: stdruskat
Repositories: 12
Profile: https://github.com/sdruskat

Software Engineering PhD candidate @DLR-SC, Research Software Engineer (https://hexatomic.github.io)

Citation (CITATION.cff)

# This is a file in the Citation File Format (https://citation-file-format.github.io)
# It provides citation metadata for the software you are using.
cff-version: 1.0.3
message: If you use the Toolbox Text Modules for Pepper in your work, please cite it with the metadata given below!
authors:
  - family-names: Druskat
    given-names: Stephan
    orcid: https://orcid.org/0000-0003-4925-7248
  - family-names: Krause
    given-names: Thomas
    orcid: https://orcid.org/0000-0003-3731-2422
title: ToolboxTextModules
version: 1.1.1
doi: 10.5281/zenodo.1297366
date-released: 2021-01-27
repository-code: https://github.com/sdruskat/pepperModules-ToolboxTextModules

GitHub Events

Total

Last Year

Committers

Last synced: about 1 year ago

All Time

Total Commits: 401
Total Committers: 3
Avg Commits per committer: 133.667
Development Distribution Score (DDS): 0.287

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Stephan Druskat	m**l@s**t	286
Stephan Druskat	s**t@u**e	113
dependabot[bot]	4****]	2

Committer Domains (Top 20 + Academic)

uni-jena.de: 1 sdruskat.net: 1

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 4
Total pull requests: 4
Average time to close issues: about 2 months
Average time to close pull requests: 3 days
Total issue authors: 1
Total pull request authors: 2
Average comments per issue: 0.0
Average comments per pull request: 0.25
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 3

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

peppermodules-toolboxtextmodules

Science Score: 67.0%

Keywords from Contributors

Repository

Basic Info

Statistics

Metadata Files

README.md

Pepper modules for the SIL Toolbox interlinear text format

How to cite

General information

Context

Requirements

Usage

Pepper workflow file

Importer

Requirements, assumptions, behaviour

Pre-existing meta annotations for ids

Properties

Exporter

Requirements, assumptions, behaviour

Data source sequences

Layers

Timeline

Annotations

Properties

Contribute

Contributors

License

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies