peppermodules-toolboxtextmodules
https://github.com/sdruskat/peppermodules-toolboxtextmodules
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 3 DOI reference(s) in README -
○Academic publication links
-
✓Committers with academic emails
1 of 3 committers (33.3%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.0%) to scientific vocabulary
Keywords from Contributors
Repository
Basic Info
- Host: GitHub
- Owner: sdruskat
- License: apache-2.0
- Language: Java
- Default Branch: master
- Size: 758 KB
Statistics
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 2
- Releases: 1
Metadata Files
README.md
Pepper modules for the SIL Toolbox interlinear text format
How to cite
If you use the Toolbox Text Modules for Pepper in your work, please cite it with the metadata given below!
- Author: Stephan Druskat (ORCiD id: https://orcid.org/0000-0003-4925-7248)
- Year: 2018
- Title: pepperModules-ToolboxTextModules
- Version: 1.0.0
- DOI: 10.5281/zenodo.1162208
- Release date: 2018-01-24
Note that this metadata is also provided in machine-readable form in the Citation File Format, in the file CITATION.cff.
General information
Pepper is a conversion framework for linguistic data. pepperModules-ToolboxTextModules is a plugin for Pepper and provides an importer and exporter for the Toolbox Interlinear Text Format, i.e., the text-based export format from SIL Toolbox. The format is used frequently for persisting language documentation data. For examples of Toolbox interlinear text files, see for example this directory in the GitHub repository teropa/nlp.
With the pepperModules-ToolboxTextModules, the data stored in Toolbox interlinear text files can be transferred to another format. This way, the data can be re-used for other purposes (such as adding different annotation types), or visualized and analyzed, e.g., in ANNIS, a search and visualization platform for linguistic data. For a list of available format converters for Pepper, see the list of known Pepper modules.
Note that there is also a Pepper module for the Toolbox XML format which is not related to this project.
Context
The development of pepperModules-ToolboxTextModules has been initiated in the MelaTAMP research project.
Requirements
pepperModules-ToolboxTextModules requires at Pepper >= 3.1.1-SNAPSHOT, as it relies on default property values which have been introduced in this version. A kickstarter (i.e., standalone) version of Pepper including the correct version can be obtained from the snapshot releases repository for Pepper: Pepper_2018.01.26-SNAPSHOT.zip. This is the earliest version of Pepper including the required functionalities for pepperModules-ToolboxTextModules, newer versions will do just as well.
Usage
- Download Pepper_2018.01.26-SNAPSHOT.zip (or newer) and extract it to a directory of your choice
- Download the latest pepperModules-ToolboxTextModules
.jarfrom releases and extract it to a directory of your choice - Add the path to the directory containing
pepperModules-ToolboxTextModules-<version>.jarto the Pepper configuration file:- Open
{Pepper directory}/pepper/conf/pepper.properties - Remove the comment hash (
#) from the line#pepper.dropin.paths=and add the path
- Open
- Start Pepper with the respective command (
pepperStart.shon Linux/Mac,pepperStart.baton Win) - Check that the Toolbox text modules have been resolved by displaying the list
of available modules in Pepper with the
lcommand - Start a conversion. You can use the interactive wizard (
c), or run a pre-defined workflow (c {path to workflow file}).
Pepper workflow file
Pepper conversions are defined in Pepper workflow files, see the Pepper User Guide.
The available properties for the Toolbox Text Modules are detailed in the following sections.
Importer
Requirements, assumptions, behaviour
Pre-existing meta annotations for ids
During conversion, when the importer encounters a pre-existing meta annotation on an id, it will overwrite the value of this annotation. This is highly unlikely to happen, unless the source file has been manually edited and the duplicate meta annotation introduced in the process.
Properties
Note that required values with a default value do not have to be specified in the workflow file when the default value should be used.
fileExtensions(String) (required): The file extensions that corpus files can have as a comma-separated list.
Default value: txt
idMarker(String): The Toolbox marker that precedes lines with IDs, without the preceding backslash.
Default value: id
refMarker(String) (required): The marker used for references, i.e., usually "ref" or "id".
Default value: ref
lexicalMarker(String) (required): The Toolbox marker that precedes lines with source text (usually "words") without the preceding backslash.
Default value: tx
morphologyMarker(String): The Toolbox marker that precedes lines with morphological information, without the preceding backslash.
Default value: mb
lexAnnotationMarkers(String): All Toolbox markers which precede lines with annotations of source text segments (usually "words"), without the preceding backslashes, and as a comma-separated list.morphologyAnnotationMarkers(String): All Toolbox markers which precede lines with annotations of morphemes, without the preceding backslashes, and as a comma-separated list.
Default value: ge,ps
attachDelimiter(String): Whether detached morphology delimiters (as in "item - item" or similar) should be attached to the previous or subsequent item, as a two-item comma-separated list, where the first item signifies whether the delimiter should be attached at all (iftrueit will be attached), and the second item signifies whether the delimiter should be attached to the subsequent item (iftrueit will be attached to the subsequent item, making the latter a suffix).
Default value: true,true
morphemeDelimiters(String): The morpheme delimiters used in the Toolbox files as a comma-separated two-point list where the first element is the affix delimiter, and the second element is the clitics delimiter.
Default value: -,=
liaisonDelimiter(String): The morpheme delimiter used in the Toolbox files to mark "words" represented on the morphological layer that are contracted into words on the lexical layer, e.g., Salibatane = ta wane. This delimiter can be used for cases where the importer may otherwise not have enough information to figure out that the lexical word should contain the "morphological word".
It will be dropped after parsing and will not show up in either the Salt model or any further model transformations.
The marker is only picked up when used to suffix the second to nth word,
i.e. for the Saliba example above, ta _wane (property default is the
underscore _) will be mapped as two items on the morphological layer which are
ruled by one item on the lexical layer:
lex: | tane |
|-----------|
morph: | ta | wane |
Default value: _
subrefDefinitionMarker(String): The marker used to define subrefs.
Default value: subref
subrefAnnotationMarkers(String): The marker which precedes lines with annotations that can potentially span subranges of the complete morphological data source. For details about subrefs see the respective MelaTAMP wiki page.mergeDuplMarkers(Boolean) (required): Whether lines with the same marker in the same block should be merged into one line.
true: Subsequent lines marked with {marker} are concatenated to the first
line marked with {marker}.
false: All lines but the first line marked with {marker} are dropped.
Default value: true
recordErrors(Boolean) (required): Whether the importer should record errors.
true (default): Errors in the data model will be recorded, i.e., annotations
on an error layer (called err) will be added for each line which
seems to contain an error. Additionally, another annotation will be added
to discrete layers, recording the original faulty line.
`false`: Errors will not be recorded.
Default value: `true`
normalizeMarkers(Boolean) (required): Whether annotation namespace-name combinations for the default layers should be normalized to Toolbox standards (after the default values for refs, subrefs, lexical and morphological markers).Default value:
falsenormalizeDocNames(Boolean) (required): Whether special characters and whitespaces in document names should be replaced with default characters.
Default value: true
fixInterl11n(Boolean) (required): Whether the importer should fix interlinearization.
true (default): Interlinearization error in the data model will be fixed as
follows.
- For **discrepancies between the number of lexical and morphological
tokens**, morphological tokens will either be added to until their
number is equal to that of lexical tokens (using the property `missingAnnoString`),
or all tokens at indices >
index of the last lexical token will be dropped.
- For **discrepancies between the number of tokens and their annotations**
as defined by `lexAnnotationMarkers` and
`morphologyAnnotationMarkers`, annotations will either be
added to until their number is equal to that of the token layer they
refer to, or all tokens at indices > index of last token they refer
to will be dropped.
`false`: Interlinearization errors will not be fixed. For missing morphological tokens
or annotations, nothing will be inserted. Morphological tokens and
annotations at indices > last index of lexical token, or last index
of token layer they refer to will, respectively, be concatenated to the last element
on their line, and separated by whitespaces.
**NOTE:** If the property is set to `false`, unfixed interl11n errors may
cause an exception to be thrown during runtime!
Default value: `true`
missingAnnoString(String) (required): A String used to fill interlinearization gaps.
Default value: ***
Exporter
Requirements, assumptions, behaviour
Data source sequences
The exporter works on a single data source sequence only, which is acquired
for an id during conversion via
sDocumentGraph.getOverlappedDataSourceSequence(idSpan, SALT_TYPE.STEXT_OVERLAPPING_RELATION).get(0).
This is due to the fact that the definitive data source in Toolbox is the single source text, i.e., the string of lexical tokens.
Layers
The Salt model must contain the following SLayers that are picked up by the
properties idSpanLayer, refSpanLayer, txTokenLayer, and mbTokenLayer
respectively:
- A layer containing spans annotated with id-scope annotations
- A layer containing spans annotated with ref-scope annotations
- A layer containing lexical tokens
- A layer containing morphological tokens
The names for these layers must be unique, i.e., the model must contain exactly one SLayer with that name.
Timeline
The Salt model must contain a timeline (STimeline), and the lexical tokens
and morphological tokens must be interlinearized via this timeline. Lexical
tokens must be aligned along the timeline without any gaps in indices.
Annotations
The Toolbox text format represents annotation names in a single field, the
marker, which takes the form \{annotation-name}. It is the user's responsibility
to keep these unique if required.
In order to enable this, the markers can automatically be generated by a
combination of the respective relevant fields in Salt, the annotation's layer,
its namespace and its name. This is done by defining the property
markerScheme, where the values for layer name, namespace and name are
represented in a pattern as l, ns and n.
For an annotation with layer name "layer1", the namespace "ns1" and the name
"n1", a property defined as pattern l__ns_n will result in a marker
\layer1__ns1_n1.
Properties
Note that required values with a default value do not have to be specified in the workflow file when the default value should be used.
idSpanLayer(String) (required): The Salt layer that contains the spans to be mapped to Toolbox ids.
Default value: id
refSpanLayer(String) (required): The Salt layer that contains the spans to be mapped to Toolbox refs.
Default value: ref
txTokenLayer(String) (required): The Salt layer that contains the tokens to be mapped to Toolbox' tx lines.
Default value: tx
mbTokenLayer(String): The Salt layer that contains the tokens to be mapped to Toolbox' mb lines.
Default value: mb
idIdentifierAnnotation(String) (required): The annotation (namespace::name) that contains the identifiers of ids.refIdentifierAnnotation(String) (required): The annotation (namespace::name) that contains the identifiers of refs.txMaterialAnnotations(String): Comma-separated list of annotations which contain primary data, i.e., lexical material which will already be mapped to tokens but still exists as annotation and should thus be left out during export to annotations (as they will already be mapped to \tx).mbMaterialAnnotations(String): Comma-separated list of annotations which contain primary data, i.e., morphological material which will already be mapped to tokens but still exists as annotation and should thus be left out during export to annotations (as they will already be mapped to \mb).spaceReplacement(String) (required): String to replace whitespaces in annotation values with, as these whitespaces may break the item count in Toolbox interlinearization.
Default value: -
markerMap(String): A map mapping combinations of an annotation container's first layer name, the annotation's namespace and the annotation's name to a new annotation name.Format:
markerMap=layer-name::namespace:name=newName, layer::name=newName, namespace:name=newName, name=newName.mapLayer(Boolean): Whether the name of the first layer in the list of an annotation container's layers should be mapped onto the marker. If true, this will result in a marker\{layer-name}__{name}for annotations whose container have >= 1 layers.Default value:
falsemapNamespace(Boolean): Whether the namespace of an annotation should be mapped onto the marker. If true, this will result in a marker\{namespace}_{name}for annotations with a non-null namespace.Default value:
false
Contribute
Contributions are welcome! When contributing to this repository, please first discuss the change you wish to make via a new issue before making a change.
Contributors
An overview of contributors to this project can be found here.
License
``` Copyright (c) 2016ff. Stephan Druskat. Exploitation rights belong exclusively to Humboldt-Universität zu Berlin.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. ```
Owner
- Name: Stephan Druskat
- Login: sdruskat
- Kind: user
- Location: Berlin
- Company: German Aerospace Center (DLR)
- Website: http://sdruskat.net
- Twitter: stdruskat
- Repositories: 12
- Profile: https://github.com/sdruskat
Software Engineering PhD candidate @DLR-SC, Research Software Engineer (https://hexatomic.github.io)
Citation (CITATION.cff)
# This is a file in the Citation File Format (https://citation-file-format.github.io)
# It provides citation metadata for the software you are using.
cff-version: 1.0.3
message: If you use the Toolbox Text Modules for Pepper in your work, please cite it with the metadata given below!
authors:
- family-names: Druskat
given-names: Stephan
orcid: https://orcid.org/0000-0003-4925-7248
- family-names: Krause
given-names: Thomas
orcid: https://orcid.org/0000-0003-3731-2422
title: ToolboxTextModules
version: 1.1.1
doi: 10.5281/zenodo.1297366
date-released: 2021-01-27
repository-code: https://github.com/sdruskat/pepperModules-ToolboxTextModules
GitHub Events
Total
Last Year
Committers
Last synced: about 1 year ago
Top Committers
| Name | Commits | |
|---|---|---|
| Stephan Druskat | m****l@s****t | 286 |
| Stephan Druskat | s****t@u****e | 113 |
| dependabot[bot] | 4****] | 2 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: about 1 year ago
All Time
- Total issues: 4
- Total pull requests: 4
- Average time to close issues: about 2 months
- Average time to close pull requests: 3 days
- Total issue authors: 1
- Total pull request authors: 2
- Average comments per issue: 0.0
- Average comments per pull request: 0.25
- Merged pull requests: 3
- Bot issues: 0
- Bot pull requests: 3
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- sdruskat (4)
Pull Request Authors
- dependabot[bot] (3)
- sdruskat (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- ch.qos.logback:logback-classic 1.2.0
- org.hamcrest:hamcrest-all 1.3