annatto

Converts linguistic data formats based on the graphANNIS data model as intermediate representation and can apply consistency tests.

https://github.com/korpling/annatto

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (18.1%) to scientific vocabulary

Last synced: 11 months ago · JSON representation ·

Repository

Converts linguistic data formats based on the graphANNIS data model as intermediate representation and can apply consistency tests.

Basic Info

Host: GitHub
Owner: korpling
License: apache-2.0
Language: Rust
Default Branch: main
Homepage:
Size: 5.69 MB

Statistics

Stars: 3
Watchers: 4
Forks: 0
Open Issues: 18
Releases: 0

Created almost 6 years ago · Last pushed 11 months ago

Metadata Files

Readme Changelog License Citation

Annatto

This software aims to test and convert data within the RUEG research group at Humboldt-Universität zu Berlin. Tests aim at continuously evaluating the state of the RUEG corpus data to early identify issues regarding compatibility, consistency, and integrity to facilitate data handling with regard to annotation, releases and integration.

For efficiency annatto relies on the graphANNIS representation and already provides a basic set of data handling modules. We recommend to get acquianted with the ANNIS Query language to better understand the more advanced features of Annatto.

Installing and running annatto

Annatto is a command line program, which is available pre-compiled for Linux, Windows and macOS. Download and extract the latest release file for your platform.

After extracting the binary to a directory of your choice, you can run the binary by opening a terminal and execute bash <path-to-directory>/annatto on Linux and macOS and bash <path-to-directory>\annatto.exe on Windows. If the annatto binary is located in the current working directory, you can also just execute ./annatto on Linux and macOS and annatto.exe on Windows. In the following examples, the prefix to the path is omitted.

The main usage of annatto is through the command line interface. Run bash annatto --help to get more help on the sub-commands. The most important command is annatto run <workflow-file>, which runs all the modules as defined in the given [workflow] file.

Modules

Annatto comes with a number of modules, which have different types:

Importer modules allow importing files from different formats. More than one importer can be used in a workflow, but then the corpus data needs to be merged using one of the merger manipulators. When running a workflow, the importers are executed first and in parallel.

Graph operation modules change the imported corpus data. They are executed one after another (non-parallel) and in the order they have been defined in the workflow.

Exporter modules export the data into different formats. More than one exporter can be used in a workflow. When running a workflow, the exporters are executed last and in parallel.

To list all available formats (importer, exporter) and graph operations run bash annatto list

To show information about modules for the given format or graph operation use bash annatto info <name>

The documentation for the modules are also included here.

Creating a workflow file

Annatto workflow files list which importers, graph operations and exporters to execute. We use an TOML file with the ending .toml to configure the workflow. TOML files can be as simple as key-value pairs, like config-key = "config-value". But they allow representing more complex structures, such as lists. The TOML website has a great "Quick Tour" section which explains the basics concepts of TOML with examples.

Import

An import step starts with the header [[import]], and a configuration value for the key path where to read the corpus from and the key format which declares in which format the corpus is encoded. The file path is relative to the workflow file. Importers also have an additional configuration header, that follows the [[import]] section and is marked with the [import.config] header.

```toml [[import]] path = "textgrid/exampleCorpus/" format = "textgrid"

[import.config] tiergroups = { tok = [ "pos", "lemma", "Inf-Struct" ] } skiptimelinegeneration = true skipaudio = true skiptimeannotations = true audio_extension = "wav" ```

You can have more than one importer, and you can simply list all the different importers at the beginning of the workflow file. An importer always needs to have a configuration header, even if it does not set any specific configuration option.

```toml [[import]] path = "a/mycorpus/" format = "format-a"

[import.config]

[[import]] path = "b/mycorpus/" format = "format-b"

[import.config]

[[import]] path = "c/mycorpus/" format = "format-c"

[import.config]

...

```

Graph operations

Graph operations use the header [[graph_op]] and the key action to describe which action to execute. Since there are no files to import/export, they don't have a path configuration.

```toml [[graph_op]] action = "check"

[graph_op.config]

Empty list of tests

tests = [] ```

Export

Exporters work similar to importers, but use the keyword [[export]] instead.

```toml [[export]] path = "output/exampleCorpus" format = "graphml"

[export.config] addvis = "# no vis" guessvis = true ```

Full example

You cannot mix import, graph operations and export headers. You have to first list all the import steps, then the graph operations and then the export steps.

```toml [[import]] path = "conll/ExampleCorpus" format = "conllu"

[import.config]

[[graph_op]] action = "check"

[graph_op.config] report = "list"

[[graph_op.config.tests]] query = "tok" expected = [ 1, inf ] description = "There is at least one token."

[[graph_op.config.tests]] query = "node ->dep node" expected = [ 1, inf ] description = "There is at least one dependency relation."

[[export]] path = "grapml/" format = "graphml"

[export.config] addvis = "# no vis" guessvis = true

```

Developing annatto

You need to install Rust to compile the project. We recommend installing the following Cargo subcommands for developing annis-web:

cargo-release for creating releases
cargo-about for re-generating the third party license file
cargo-llvm-cov for determining the code coverage
cargo-insta allows reviewing the test snapshot files
cargo-dist for configuring the GitHub actions that create the release binaries.

Execute tests

You can run the tests with the default cargo test command. To calculate the code coverage, you can use cargo-llvm-cov:

bash cargo llvm-cov --open --all-features --ignore-filename-regex 'tests?\.rs'

Performing a release

You need to have cargo-release installed to perform a release. Execute the follwing cargo command once to install it.

bash cargo install cargo-release cargo-about

To perform a release, switch to the main branch and execute:

bash cargo release [LEVEL] --execute

The level should be patch, minor or major depending on the changes made in the release. Running the release command will also trigger a CI workflow to create release binaries on GitHub.

Funding

Die Forschungsergebnisse dieser Veröffentlichung wurden gefördert durch die Deutsche Forschungsgemeinschaft (DFG) – SFB 1412, 416591334 sowie FOR 2537, 313607803, GZ LU 856/16-1.

This research was funded by the German Research Foundation (DFG, Deutsche Forschungsgemeinschaft) – SFB 1412, 416591334 and FOR 2537, 313607803, GZ LU 856/16-1.

Owner

Name: korpling
Login: korpling
Kind: organization

Repositories: 69
Profile: https://github.com/korpling

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: Annatto
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Thomas
    family-names: Krause
    email: thomas.krause@hu-berlin.de
    affiliation: Humboldt-Universität zu Berlin
    orcid: 'https://orcid.org/0000-0003-3731-2422'
  - given-names: Martin
    family-names: Klotz
    email: martin.klotz@hu-berlin.de
    affiliation: Humboldt-Universität zu Berlin
    orcid: 'https://orcid.org/0000-0002-8078-4516'
repository-code: 'https://github.com/korpling/annatto/'
abstract: >-
  This software aims to test and convert linguistic corpus
  data. Tests aim at continuously evaluating the state of a
  corpus to early identify issues regarding compatibility,
  consistency, and integrity to facilitate data handling
  with regard to annotation, releases and integration. For
  efficiency, Annatto relies on the graphANNIS
  representation and already provides a basic set of data
  handling modules.
license: Apache-2.0
version: 0.39.1
date-released: '2025-09-01'

GitHub Events

Total

Create event: 84
Release event: 20
Issues event: 60
Watch event: 2
Delete event: 66
Issue comment event: 89
Push event: 176
Pull request review event: 2
Pull request review comment event: 2
Pull request event: 140

Last Year

Create event: 84
Release event: 20
Issues event: 60
Watch event: 2
Delete event: 66
Issue comment event: 89
Push event: 176
Pull request review event: 2
Pull request review comment event: 2
Pull request event: 140

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 127
Total pull requests: 359
Average time to close issues: about 2 months
Average time to close pull requests: about 17 hours
Total issue authors: 3
Total pull request authors: 2
Average comments per issue: 0.55
Average comments per pull request: 1.18
Merged pull requests: 333
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 42
Pull requests: 146
Average time to close issues: 18 days
Average time to close pull requests: about 2 hours
Issue authors: 2
Pull request authors: 2
Average comments per issue: 0.52
Average comments per pull request: 0.87
Merged pull requests: 123
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

MartinKl (122)
thomaskrause (3)
chiarcos (1)

Pull Request Authors

MartinKl (338)
thomaskrause (116)

Top Labels

Issue Labels

enhancement (62) bug (23) documentation (2) question (1)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- cargo 45,044 total

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 55
Total maintainers: 2

crates.io: annatto

Converts linguistic data formats based on the graphANNIS data model as intermediate representation and can apply consistency tests.

Homepage: https://github.com/korpling/annatto/
Documentation: https://docs.rs/annatto/
License: Apache-2.0
Latest release: 0.39.0
published 11 months ago

Versions: 55
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 45,044 Total

Rankings

Dependent repos count: 29.0%

Dependent packages count: 34.3%

Forks count: 40.2%

Average: 51.2%

Stargazers count: 55.0%

Downloads: 97.9%

Maintainers (2)

thomaskrause MartinKl

Last synced: 11 months ago

Dependencies

.github/workflows/code_coverage.yml actions

SierraSoftworks/setup-grcov v1 composite
Swatinem/rust-cache v2.2.0 composite
actions-rs/toolchain v1.0.6 composite
actions/checkout v2 composite
actions/download-artifact v2 composite
actions/upload-artifact v2 composite
baptiste0928/cargo-install v1 composite
barecheck/code-coverage-action v1 composite

.github/workflows/release.yml actions

actions/checkout v3 composite
taiki-e/create-gh-release-action v1 composite

.github/workflows/rust.yml actions

actions-rs/clippy-check v1.0.7 composite
actions-rs/toolchain v1 composite
actions/checkout v2 composite
actions/checkout v1 composite
baptiste0928/cargo-install v1 composite
mbrobbel/rustfmt-check 0.3.0 composite

Cargo.toml cargo

pretty_assertions 1.3 development
anyhow 1.0
csv 1.1
encoding_rs_io 0.1.7
glob 0.3
graphannis 2.4.2
graphannis-core 2.4.2
indicatif 0.16
itertools 0.10
log 0.4
normpath 1.1
ordered-float 3.4.0
pest 2.0
pest_derive 2.0
pyembed 0.22
pyo3 0.16
quick-xml 0.23
rayon 1.1
regex 1.4
rust-embed 6.3.0
serde 1.0
serde_derive 1.0
smartstring 0.2
structopt 0.3
tempfile 3
thiserror 1.0
toml 0.5
umya-spreadsheet 0.8
xml-rs 0.8

annatto

Science Score: 54.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Annatto

Installing and running annatto

Modules

Creating a workflow file

Import

...

Graph operations

Empty list of tests

Export

Full example

Developing annatto

Execute tests

Performing a release

Funding

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

crates.io: annatto

Rankings

Maintainers (2)

Dependencies