qualle

Implementation of Qualle Framework as proposed in the paper "Content-Based Quality Estimation for Automatic Subject Indexing of Short Texts under Precision and Recall Constraints" and accompanying source code by Martin Toepfer and Christin Seifert.

https://github.com/zbw/qualle

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (18.2%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Implementation of Qualle Framework as proposed in the paper "Content-Based Quality Estimation for Automatic Subject Indexing of Short Texts under Precision and Recall Constraints" and accompanying source code by Martin Toepfer and Christin Seifert.

Basic Info
  • Host: GitHub
  • Owner: zbw
  • License: apache-2.0
  • Language: Python
  • Default Branch: master
  • Size: 895 KB
Statistics
  • Stars: 4
  • Watchers: 5
  • Forks: 4
  • Open Issues: 3
  • Releases: 11
Created over 4 years ago · Last pushed 7 months ago
Metadata Files
Readme License Citation

README.md

Qualle

Extended Tests codecov Code style: black Ruff security: bandit

This is an implementation of the Qualle framework as proposed in the paper [1] and accompanying source code.

The framework allows to train a model which can be used to predict the quality of the result of applying a multi-label classification (MLC) method on a document. In this implementation, only the recall is predicted for a document, but in principle any document-level quality estimation (such as the prediction of precision) can be implemented analogously.

Qualle provides a command-line interface to train and evaluate models. In addition, a REST webservice for predicting the recall of a MLC result is provided.

Requirements

Python >= 3.10 is required.

Installation

Choose one of these installation methods:

With pip

Qualle is available on PyPI . You can install Qualle using pip:

pip install qualle

This will install a command line tool called qualle . You can call qualle -h to see the help message which will display the available modes and options.

Note that it is generally recommended to use a virtual environment to avoid conflicting behaviour with the system package manager.

From source

You also have the option to checkout the repository and install the packages from source. You need poetry to perform the task:

```shell

call inside the project directory

poetry install --without ci ```

Docker

You can also use a Docker Image from the Container Registry of Github:

docker pull ghcr.io/zbw/qualle

Alternatively, you can use the Dockerfile included in this project to build a Docker image yourself. E.g.:

docker build -t qualle .

By default, a container built from this image launches a REST interface listening on 0.0.0.0:8000

You need to pass the model file (see below the section REST interface) per bind mount or volume to the docker container. Beyond that, you need to specify the location of the model file with an environment variable named MDL_FILE:

docker run --rm -it --env MDL_FILE=/model -v /path/to/model:/model -p 8000:8000 ghcr.io/zbw/qualle

Gunicorn is used as HTTP Server. You can use the environment variable GUNICORN_CMD_ARGS to customize Gunicorn settings, such as the number of worker processes to use:

docker run --rm -it --env MDL_FILE=/model --env GUNICORN_CMD_ARGS="--workers 4" -v /path/to/model:/model -p 8000:8000 ghcr.io/zbw/qualle

You can also use the Docker image to train or evaluate by using the Qualle command line tool:

shell docker run --rm -it -v \ /path/to/train_data_file:/train_data_file -v /path/to/model_dir:/mdl_dir ghcr.io/zbw/qualle \ qualle train /train_data_file /mdl_dir/model

The Qualle command line tool is not available for the release 0.1.0 and 0.1.1. For these releases, you need to call the python module qualle.main instead:

shell docker run --rm -it -v \ /path/to/train_data_file:/train_data_file -v /path/to/model_dir:/model_dir ghcr.io/zbw/qualle:0.1.1 \ python -m qualle.main train /train_data_file /model_dir/model

Usage

Input data

In order to train a model, evaluate a model or predict the quality of an MLC result you have to provide data.

This can be a tabular-separated file (tsv) in the format (tabular is marked with \t)

document-content\tpredicted_labels_with_scores\ttrue_labels

where - document-content is a string describing the content of the document (more precisely: the string on which the MLC method is trained), e.g. the title - predicted_labels_with_scores is a comma-separated list of pairs predicted_label:confidence-score (this is basically the output of the MLC method) - true_labels is a comma-separated list of true labels (ground truth)

Note that you can omit the true_labels section if you only want to predict the quality of the MLC result.

For example, a row in the data file could look like this:

Optimal investment policy of the regulated firm\tConcept0:0.5,Concept1:1\tConcept0,Concept3

For those who use an MLC method via the toolkit Annif for automated subject indexing: You can alternatively specify a full-text document corpus combined with the result of the Annif index method (tested with Annif version 0.59) applied on the corpus. This is a folder with three files per document:

  • doc.annif : result of Annif index method
  • doc.tsv : ground truth
  • doc.txt : document content

As above, you may omit the doc.tsv if you just want to predict the quality of the MLC result.

Train

To train a model, use the train mode, e.g.:

qualle train /path/to/train_data_file /path/to/output/model

It is also possible to use label calibration (comparison of predicted vs actual labels) using the subthesauri of a thesaurus (such as the STW) as categories (please read the paper for more explanations). Consult the help (see above) for the required options.

Evaluate

You must provide test data and the path to a trained model in order to evaluate that model. Metrics such as the explained variation are printed out, describing the quality of the recall prediction (please consult the paper for more information).

REST interface

To perform the prediction on a MLC result, a REST interface can be started. uvicorn is used as HTTP server. You can also use any ASGI server implementation and create the ASGI app directly with the method qualle.interface.rest.create_app. You need to provide the environment variable MDL_FILE with the path to the model (see qualle.interface.config.RESTSettings).

The REST endpoint expects a HTTP POST with the result of a MLC for a list of documents as body. The format is JSON as specified in qualle/openapi.json. You can also use the Swagger UI accessible at http://address_of_server/docs to play around a bit.

Contribute

Contributions via pull requests are welcome. Please create an issue beforehand to explain and discuss the reasons for the respective contribution.

qualle code should follow the Black style. The Black tool is included as a development dependency; you can run black . in the project root to autoformat code. There is also the possibility of doing this with a Git Pre-Commit hook script. It is already configured in the .pre-commit-config.yaml file. The pre-commit tool has been included as a development dependency. You would have to run the command pre-commit install inside your local virtual environment. Subsequently, the Black tool will automatically check the formatting of modified or new scripts after each time a git commit command is executed.

References

[1] Toepfer, Martin, and Christin Seifert. "Content-based quality estimation for automatic subject indexing of short texts under precision and recall constraints." International Conference on Theory and Practice of Digital Libraries. Springer, Cham, 2018., DOI 10.1007/978-3-030-00066-0_1

Context information

This code was created as part of the subject indexing automatization effort at ZBW - Leibniz Information Centre for Economics. See our homepage for more information, publications, and contact details.

Owner

  • Name: ZBW - Leibniz Information Centre for Economics
  • Login: zbw
  • Kind: organization
  • Location: Kiel, Hamburg (Germany)

ZBW is a public information provider to support open science and research in economics. It holds more than 5 Mio media items and operates web applications.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Bartz
    given-names: Christopher
    affiliation: "ZBW - Leibniz Information Centre for Economics"
  - family-names: Fürneisen
    given-names: Moritz
    affiliation: "ZBW - Leibniz Information Centre for Economics"
  - family-names: Rajendram Bashyam
    given-names: Lakshmi
    affiliation: "ZBW - Leibniz Information Centre for Economics"
  - family-names: Majal
    given-names: Ghulam Mustafa
    affiliation: "ZBW - Leibniz Information Centre for Economics"
title: "qualle (a framework to predict the quality of a multi-label classification result)"
abstract: "This framework allows to train a model which can be used to predict the quality of the result of applying a multi-label classification (MLC) method on a document. In this implementation, only the recall is predicted for a document, but in principle any document-level quality estimation (such as the prediction of precision) can be implemented analogously."
version: 0.5.1
license: Apache-2.0
date-released: 2025-07-28
repository-code: "https://github.com/zbw/qualle"
contact:
  - name: "Automatization of subject indexing using methods from artificial intelligence (AutoSE)"
  - website: "https://www.zbw.eu/en/about-us/key-activities/automated-subject-indexing"
  - email: autose@zbw.eu
  - affiliation: "ZBW - Leibniz Information Centre for Economics"
keywords:
  - "automated subject indexing"
  - "controlled vocabularies"
  - "machine learning"
references:
  - authors:
      - family-names: Toepfer
        given-names: Martin
      - family-names: Seifert
        given-names: Christin
    title: "Content-Based Quality Estimation for Automatic Subject Indexing of Short Texts under Precision and Recall Constraints"
    type: conference-paper
    doi: 10.1007/978-3-030-00066-0_1

GitHub Events

Total
  • Create event: 14
  • Release event: 4
  • Issues event: 24
  • Watch event: 1
  • Delete event: 16
  • Issue comment event: 24
  • Push event: 50
  • Pull request review event: 14
  • Pull request event: 25
Last Year
  • Create event: 14
  • Release event: 4
  • Issues event: 24
  • Watch event: 1
  • Delete event: 16
  • Issue comment event: 24
  • Push event: 50
  • Pull request review event: 14
  • Pull request event: 25

Committers

Last synced: almost 3 years ago

All Time
  • Total Commits: 151
  • Total Committers: 3
  • Avg Commits per committer: 50.333
  • Development Distribution Score (DDS): 0.073
Top Committers
Name Email Commits
Christopher Bartz c****z@z****u 140
Moritz Fuerneisen m****n@z****u 10
annakasprzik a****k@g****e 1
Committer Domains (Top 20 + Academic)
zbw.eu: 2 gmx.de: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 17
  • Total pull requests: 51
  • Average time to close issues: 6 days
  • Average time to close pull requests: 7 days
  • Total issue authors: 1
  • Total pull request authors: 6
  • Average comments per issue: 0.12
  • Average comments per pull request: 1.39
  • Merged pull requests: 39
  • Bot issues: 0
  • Bot pull requests: 7
Past Year
  • Issues: 17
  • Pull requests: 35
  • Average time to close issues: 6 days
  • Average time to close pull requests: 4 days
  • Issue authors: 1
  • Pull request authors: 3
  • Average comments per issue: 0.12
  • Average comments per pull request: 1.54
  • Merged pull requests: 24
  • Bot issues: 0
  • Bot pull requests: 7
Top Authors
Issue Authors
  • gmmajal (19)
Pull Request Authors
  • gmmajal (35)
  • cbartz (12)
  • dependabot[bot] (6)
  • Lakshmi-bashyam (4)
  • annakasprzik (1)
  • san-uh (1)
Top Labels
Issue Labels
Pull Request Labels
dependencies (6) python (4)

Dependencies

.github/workflows/basic.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
.github/workflows/extended.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
  • codecov/codecov-action v3 composite
.github/workflows/publish.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
  • docker/build-push-action v3 composite
  • docker/login-action v2 composite
  • docker/metadata-action v4 composite
Dockerfile docker
  • python 3.8-slim-buster build
poetry.lock pypi
  • attrs 22.2.0 develop
  • bandit 1.7.4 develop
  • black 22.12.0 develop
  • certifi 2022.12.7 develop
  • charset-normalizer 3.0.1 develop
  • coverage 7.0.5 develop
  • dparse 0.6.2 develop
  • exceptiongroup 1.1.0 develop
  • flake8 5.0.4 develop
  • gitdb 4.0.10 develop
  • gitpython 3.1.30 develop
  • httpcore 0.16.3 develop
  • httpx 0.23.3 develop
  • iniconfig 2.0.0 develop
  • mccabe 0.7.0 develop
  • mypy-extensions 0.4.3 develop
  • packaging 21.3 develop
  • pathspec 0.10.3 develop
  • pbr 5.11.1 develop
  • platformdirs 2.6.2 develop
  • pluggy 1.0.0 develop
  • pycodestyle 2.9.1 develop
  • pyflakes 2.5.0 develop
  • pytest 7.2.1 develop
  • pytest-cov 4.0.0 develop
  • pytest-mock 3.10.0 develop
  • pyyaml 6.0 develop
  • requests 2.28.2 develop
  • rfc3986 1.5.0 develop
  • ruamel-yaml 0.17.21 develop
  • ruamel-yaml-clib 0.2.7 develop
  • safety 2.3.5 develop
  • smmap 5.0.0 develop
  • stevedore 4.1.1 develop
  • toml 0.10.2 develop
  • tomli 2.0.1 develop
  • urllib3 1.26.14 develop
  • anyio 3.6.2
  • click 8.1.3
  • colorama 0.4.6
  • fastapi 0.88.0
  • h11 0.14.0
  • idna 3.4
  • isodate 0.6.1
  • joblib 1.2.0
  • numpy 1.24.1
  • pydantic 1.10.4
  • pyparsing 3.0.9
  • rdflib 6.2.0
  • scikit-learn 1.2.0
  • scipy 1.9.3
  • setuptools 66.1.1
  • six 1.16.0
  • sniffio 1.3.0
  • starlette 0.22.0
  • threadpoolctl 3.1.0
  • typing-extensions 4.4.0
  • uvicorn 0.20.0
pyproject.toml pypi
  • fastapi ~0.88
  • pydantic ~1.10
  • python ^3.8
  • rdflib ~6.2
  • scikit-learn ~1.2
  • scipy ~1.9
  • uvicorn ~0.20