annatto

Converts linguistic data formats based on the graphANNIS data model as intermediate representation and can apply consistency tests.

https://github.com/korpling/annatto

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (18.1%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Converts linguistic data formats based on the graphANNIS data model as intermediate representation and can apply consistency tests.

Basic Info
  • Host: GitHub
  • Owner: korpling
  • License: apache-2.0
  • Language: Rust
  • Default Branch: main
  • Homepage:
  • Size: 5.69 MB
Statistics
  • Stars: 3
  • Watchers: 4
  • Forks: 0
  • Open Issues: 18
  • Releases: 0
Created over 5 years ago · Last pushed 6 months ago
Metadata Files
Readme Changelog License Citation

README.md

docs.rs codecov

Annatto

This software aims to test and convert data within the RUEG research group at Humboldt-Universität zu Berlin. Tests aim at continuously evaluating the state of the RUEG corpus data to early identify issues regarding compatibility, consistency, and integrity to facilitate data handling with regard to annotation, releases and integration.

For efficiency annatto relies on the graphANNIS representation and already provides a basic set of data handling modules. We recommend to get acquianted with the ANNIS Query language to better understand the more advanced features of Annatto.

Installing and running annatto

Annatto is a command line program, which is available pre-compiled for Linux, Windows and macOS. Download and extract the latest release file for your platform.

After extracting the binary to a directory of your choice, you can run the binary by opening a terminal and execute bash <path-to-directory>/annatto on Linux and macOS and bash <path-to-directory>\annatto.exe on Windows. If the annatto binary is located in the current working directory, you can also just execute ./annatto on Linux and macOS and annatto.exe on Windows. In the following examples, the prefix to the path is omitted.

The main usage of annatto is through the command line interface. Run bash annatto --help to get more help on the sub-commands. The most important command is annatto run <workflow-file>, which runs all the modules as defined in the given [workflow] file.

Modules

Annatto comes with a number of modules, which have different types:

Importer modules allow importing files from different formats. More than one importer can be used in a workflow, but then the corpus data needs to be merged using one of the merger manipulators. When running a workflow, the importers are executed first and in parallel.

Graph operation modules change the imported corpus data. They are executed one after another (non-parallel) and in the order they have been defined in the workflow.

Exporter modules export the data into different formats. More than one exporter can be used in a workflow. When running a workflow, the exporters are executed last and in parallel.

To list all available formats (importer, exporter) and graph operations run bash annatto list

To show information about modules for the given format or graph operation use bash annatto info <name>

The documentation for the modules are also included here.

Creating a workflow file

Annatto workflow files list which importers, graph operations and exporters to execute. We use an TOML file with the ending .toml to configure the workflow. TOML files can be as simple as key-value pairs, like config-key = "config-value". But they allow representing more complex structures, such as lists. The TOML website has a great "Quick Tour" section which explains the basics concepts of TOML with examples.

Import

An import step starts with the header [[import]], and a configuration value for the key path where to read the corpus from and the key format which declares in which format the corpus is encoded. The file path is relative to the workflow file. Importers also have an additional configuration header, that follows the [[import]] section and is marked with the [import.config] header.

```toml [[import]] path = "textgrid/exampleCorpus/" format = "textgrid"

[import.config] tiergroups = { tok = [ "pos", "lemma", "Inf-Struct" ] } skiptimelinegeneration = true skipaudio = true skiptimeannotations = true audio_extension = "wav" ```

You can have more than one importer, and you can simply list all the different importers at the beginning of the workflow file. An importer always needs to have a configuration header, even if it does not set any specific configuration option.

```toml [[import]] path = "a/mycorpus/" format = "format-a"

[import.config]

[[import]] path = "b/mycorpus/" format = "format-b"

[import.config]

[[import]] path = "c/mycorpus/" format = "format-c"

[import.config]

...

```

Graph operations

Graph operations use the header [[graph_op]] and the key action to describe which action to execute. Since there are no files to import/export, they don't have a path configuration.

```toml [[graph_op]] action = "check"

[graph_op.config]

Empty list of tests

tests = [] ```

Export

Exporters work similar to importers, but use the keyword [[export]] instead.

```toml [[export]] path = "output/exampleCorpus" format = "graphml"

[export.config] addvis = "# no vis" guessvis = true ```

Full example

You cannot mix import, graph operations and export headers. You have to first list all the import steps, then the graph operations and then the export steps.

```toml [[import]] path = "conll/ExampleCorpus" format = "conllu"

[import.config]

[[graph_op]] action = "check"

[graph_op.config] report = "list"

[[graph_op.config.tests]] query = "tok" expected = [ 1, inf ] description = "There is at least one token."

[[graph_op.config.tests]] query = "node ->dep node" expected = [ 1, inf ] description = "There is at least one dependency relation."

[[export]] path = "grapml/" format = "graphml"

[export.config] addvis = "# no vis" guessvis = true

```

Developing annatto

You need to install Rust to compile the project. We recommend installing the following Cargo subcommands for developing annis-web:

Execute tests

You can run the tests with the default cargo test command. To calculate the code coverage, you can use cargo-llvm-cov:

bash cargo llvm-cov --open --all-features --ignore-filename-regex 'tests?\.rs'

Performing a release

You need to have cargo-release installed to perform a release. Execute the follwing cargo command once to install it.

bash cargo install cargo-release cargo-about

To perform a release, switch to the main branch and execute:

bash cargo release [LEVEL] --execute

The level should be patch, minor or major depending on the changes made in the release. Running the release command will also trigger a CI workflow to create release binaries on GitHub.

Funding

Die Forschungsergebnisse dieser Veröffentlichung wurden gefördert durch die Deutsche Forschungsgemeinschaft (DFG) – SFB 1412, 416591334 sowie FOR 2537, 313607803, GZ LU 856/16-1.

This research was funded by the German Research Foundation (DFG, Deutsche Forschungsgemeinschaft) – SFB 1412, 416591334 and FOR 2537, 313607803, GZ LU 856/16-1.

Owner

  • Name: korpling
  • Login: korpling
  • Kind: organization

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: Annatto
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Thomas
    family-names: Krause
    email: thomas.krause@hu-berlin.de
    affiliation: Humboldt-Universität zu Berlin
    orcid: 'https://orcid.org/0000-0003-3731-2422'
  - given-names: Martin
    family-names: Klotz
    email: martin.klotz@hu-berlin.de
    affiliation: Humboldt-Universität zu Berlin
    orcid: 'https://orcid.org/0000-0002-8078-4516'
repository-code: 'https://github.com/korpling/annatto/'
abstract: >-
  This software aims to test and convert linguistic corpus
  data. Tests aim at continuously evaluating the state of a
  corpus to early identify issues regarding compatibility,
  consistency, and integrity to facilitate data handling
  with regard to annotation, releases and integration. For
  efficiency, Annatto relies on the graphANNIS
  representation and already provides a basic set of data
  handling modules.
license: Apache-2.0
version: 0.39.1
date-released: '2025-09-01'

GitHub Events

Total
  • Create event: 84
  • Release event: 20
  • Issues event: 60
  • Watch event: 2
  • Delete event: 66
  • Issue comment event: 89
  • Push event: 176
  • Pull request review event: 2
  • Pull request review comment event: 2
  • Pull request event: 140
Last Year
  • Create event: 84
  • Release event: 20
  • Issues event: 60
  • Watch event: 2
  • Delete event: 66
  • Issue comment event: 89
  • Push event: 176
  • Pull request review event: 2
  • Pull request review comment event: 2
  • Pull request event: 140

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 127
  • Total pull requests: 359
  • Average time to close issues: about 2 months
  • Average time to close pull requests: about 17 hours
  • Total issue authors: 3
  • Total pull request authors: 2
  • Average comments per issue: 0.55
  • Average comments per pull request: 1.18
  • Merged pull requests: 333
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 42
  • Pull requests: 146
  • Average time to close issues: 18 days
  • Average time to close pull requests: about 2 hours
  • Issue authors: 2
  • Pull request authors: 2
  • Average comments per issue: 0.52
  • Average comments per pull request: 0.87
  • Merged pull requests: 123
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • MartinKl (122)
  • thomaskrause (3)
  • chiarcos (1)
Pull Request Authors
  • MartinKl (338)
  • thomaskrause (116)
Top Labels
Issue Labels
enhancement (62) bug (23) documentation (2) question (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • cargo 45,044 total
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 55
  • Total maintainers: 2
crates.io: annatto

Converts linguistic data formats based on the graphANNIS data model as intermediate representation and can apply consistency tests.

  • Versions: 55
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 45,044 Total
Rankings
Dependent repos count: 29.0%
Dependent packages count: 34.3%
Forks count: 40.2%
Average: 51.2%
Stargazers count: 55.0%
Downloads: 97.9%
Maintainers (2)
Last synced: 6 months ago

Dependencies

.github/workflows/code_coverage.yml actions
  • SierraSoftworks/setup-grcov v1 composite
  • Swatinem/rust-cache v2.2.0 composite
  • actions-rs/toolchain v1.0.6 composite
  • actions/checkout v2 composite
  • actions/download-artifact v2 composite
  • actions/upload-artifact v2 composite
  • baptiste0928/cargo-install v1 composite
  • barecheck/code-coverage-action v1 composite
.github/workflows/release.yml actions
  • actions/checkout v3 composite
  • taiki-e/create-gh-release-action v1 composite
.github/workflows/rust.yml actions
  • actions-rs/clippy-check v1.0.7 composite
  • actions-rs/toolchain v1 composite
  • actions/checkout v2 composite
  • actions/checkout v1 composite
  • baptiste0928/cargo-install v1 composite
  • mbrobbel/rustfmt-check 0.3.0 composite
Cargo.toml cargo
  • pretty_assertions 1.3 development
  • anyhow 1.0
  • csv 1.1
  • encoding_rs_io 0.1.7
  • glob 0.3
  • graphannis 2.4.2
  • graphannis-core 2.4.2
  • indicatif 0.16
  • itertools 0.10
  • log 0.4
  • normpath 1.1
  • ordered-float 3.4.0
  • pest 2.0
  • pest_derive 2.0
  • pyembed 0.22
  • pyo3 0.16
  • quick-xml 0.23
  • rayon 1.1
  • regex 1.4
  • rust-embed 6.3.0
  • serde 1.0
  • serde_derive 1.0
  • smartstring 0.2
  • structopt 0.3
  • tempfile 3
  • thiserror 1.0
  • toml 0.5
  • umya-spreadsheet 0.8
  • xml-rs 0.8