stwfsapy

Python reimplementation of stwfsapy.

https://github.com/zbw/stwfsapy

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.8%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Python reimplementation of stwfsapy.

Basic Info

Host: GitHub
Owner: zbw
License: apache-2.0
Language: Python
Default Branch: master
Size: 634 KB

Statistics

Stars: 4
Watchers: 4
Forks: 2
Open Issues: 6
Releases: 0

Created about 6 years ago · Last pushed 11 months ago

Metadata Files

Readme License Citation

stwfsapy

About

This library provides the functionality to find SKOS thesaurus concepts in a text. It is a reimplementation in Python of stwfsa combined with the concept scoring from [1]. A deterministic finite automaton is constructed from the labels of the thesaurus concepts to perform the matching. In addition, a classifier is trained to score the matched concept occurrences.

Data Requirements

The construction of the automaton requires a SKOS thesaurus represented as a rdflib Graph. Concepts should be related to labels by one of the relations skos:prefLabel, skos:altLabel, or skos:hiddenLabel. (This implementation also includes zbwext:altLabelNarrower and zbwext:altLabelRelated as possible concept-label relations which are specific to ZBW.) Concepts have to be identifiable by rdf:type. The training of the predictor requires annotated text. Each training sample should be annotated with one or more concepts from the thesaurus.

Installation

Requirements

Python >= 3.10,<3.14 is required.

With pip

stwfsapy is available on PyPI . You can install stwfsapy using pip:

pip install stwfsapy

This will install a python package called stwfsapy.

Note that it is generally recommended to use a virtual environment to avoid conflicting behaviour with the system package manager.

From source

You also have the option to checkout the repository and install the packages from source. You need poetry to perform the task:

```shell

call inside the project directory

poetry install --without ci ```

Usage

Create predictor

First load your thesaurus. ```python from rdflib import Graph

g = Graph() g.parse('/path/to/your/thesaurus') First, define the type URI for descriptors. If your thesaurus is structured into sub-thesauri by providing categories for the concepts of the thesaurus using, e.g., `skos:Collection`, you can optionally specify the type of these categories via a URI. In this case you should also specify the relation that relates concepts to categories. Furthermore you can indicate whether this relation is a specialisation relation (as opposed to a generalisation relation, which is the default). For the [STW](https://http://zbw.eu/stw/) this would bepython descriptortypeuri = 'http://zbw.eu/namespaces/zbw-extensions/Descriptor' thsystypeuri = 'http://zbw.eu/namespaces/zbw-extensions/Thsys' thesaurusrelationtypeuri = 'http://www.w3.org/2004/02/skos/core#broader' isspecialisation = False ```

Create the predictor python from stwfsapy.predictor import StwfsapyPredictor p = StwfsapyPredictor( g, descriptor_type_uri, thsys_type_uri, thesaurus_relation_type_uri, is_specialisation, langs={'en'}, simple_english_plural_rules=True) The next step assumes you have loaded your texts into a list X and your labels into a list of lists y, such that for all indices 0 <= i < len(X). The list at y[i] contains the URIs to the correct concepts for X[i]. The concepts should be given by their URI. Then you can train the classifier: python p.fit(X, y) Afterwards you can get the predicted concepts and scores: python p.suggest_proba(['one input text', 'A completely different input text.']) Alternatively you can get a sparse matrix of scores by calling python p.predict_proba(['one input text', 'Another input text.']) The indices of the concepts are stored in p.concept_map_.

Options

All options for the predictor are documented at https://stwfsapy-zbw.readthedocs.io .

Save Model

A trained predictor p can be stored by calling p.store('/path/to/storage/location'). Afterwards it can be loaded as follows: ```python from stwfsapy.predictor import StwfsapyPredictor

StwfsapyPredictor.load('/path/to/storage/location') ```

Contribute

Contributions via pull requests are welcome. Please create an issue beforehand to explain and discuss the reasons for the respective contribution.

References

[1] Toepfer, Martin, and Christin Seifert. "Fusion architectures for automatic subject indexing under concept drift" International Journal on Digital Libraries (IJDL), 2018.

Context information

This code was created as part of the subject indexing automation effort at ZBW – Leibniz Information Centre for Economics. See our homepage for more information, publications, and contact details.

Owner

Name: ZBW - Leibniz Information Centre for Economics
Login: zbw
Kind: organization
Location: Kiel, Hamburg (Germany)

Website: https://www.zbw.eu/en/
Twitter: zbw_news
Repositories: 17
Profile: https://github.com/zbw

ZBW is a public information provider to support open science and research in economics. It holds more than 5 Mio media items and operates web applications.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Fürneisen
    given-names: Moritz
    affiliation: "ZBW - Leibniz Information Centre for Economics"
  - family-names: Bartz
    given-names: Christopher
    affiliation: "ZBW - Leibniz Information Centre for Economics"
  - family-names: Rajendram Bashyam
    given-names: Lakshmi
    affiliation: "ZBW - Leibniz Information Centre for Economics"
  - family-names: Majal 
    given-names: Ghulam Mustafa
    affiliation: "ZBW - Leibniz Information Centre for Economics"
title: "stwfsapy (a library for matching labels of thesaurus concepts via finite-state automata)"
abstract: "This library provides functionality to find the labels of SKOS thesaurus concepts in text. A deterministic finite automaton is constructed from the labels of the thesaurus concepts to perform the matching. In addition, a classifier is trained to score the matched concept occurrences."
version: 0.6.1
license: Apache-2.0
date-released: 2025-06-24
repository-code: "https://github.com/zbw/stwfsapy"
contact:
  - name: "Automatization of subject indexing using methods from artificial intelligence (AutoSE)"
  - website: "https://www.zbw.eu/en/about-us/key-activities/automated-subject-indexing"
  - email: autose@zbw.eu
  - affiliation: "ZBW - Leibniz Information Centre for Economics"
keywords:
  - "automated subject indexing"
  - "controlled vocabularies"
  - "machine learning"
references:
  - authors:
      - family-names: Toepfer
        given-names: Martin
    title: "stwfsa"
    type: software
    repository-code: "https://github.com/zbw/stwfsa"

GitHub Events

Total

Create event: 12
Release event: 4
Issues event: 18
Watch event: 1
Delete event: 8
Issue comment event: 17
Push event: 44
Pull request review comment event: 2
Pull request event: 20
Pull request review event: 14

Last Year

Create event: 12
Release event: 4
Issues event: 18
Watch event: 1
Delete event: 8
Issue comment event: 17
Push event: 44
Pull request review comment event: 2
Pull request event: 20
Pull request review event: 14

Committers

Last synced: over 3 years ago

All Time

Total Commits: 153
Total Committers: 5
Avg Commits per committer: 30.6
Development Distribution Score (DDS): 0.216

Top Committers

Name	Email	Commits
Moritz Fuerneisen	m**n@z**u	120
Christopher Bartz	c**z@z**u	20
annakasprzik	a**k@g**e	8
Moritz Fürneisen	m**u@r**m	4
Christopher Bartz	c**z@g**e	1

Committer Domains (Top 20 + Academic)

gmx.de: 2 zbw.eu: 2 rocketmail.com: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 20
Total pull requests: 81
Average time to close issues: about 1 month
Average time to close pull requests: 25 days
Total issue authors: 6
Total pull request authors: 5
Average comments per issue: 0.4
Average comments per pull request: 0.99
Merged pull requests: 72
Bot issues: 0
Bot pull requests: 16

Past Year

Issues: 11
Pull requests: 21
Average time to close issues: 7 days
Average time to close pull requests: 3 days
Issue authors: 2
Pull request authors: 3
Average comments per issue: 0.27
Average comments per pull request: 1.57
Merged pull requests: 17
Bot issues: 0
Bot pull requests: 1

View more stats

Top Authors

Issue Authors

gmmajal (10)
mo-fu (3)
juhoinkinen (3)
Lakshmi-bashyam (1)
annakasprzik (1)
rafle (1)

Pull Request Authors

mo-fu (38)
dependabot[bot] (21)
gmmajal (19)
Lakshmi-bashyam (13)
cbartz (2)

Top Labels

Issue Labels

enhancement (4) bug (1)

Pull Request Labels

dependencies (21) python (1)

Packages

Total packages: 1
Total downloads:
- pypi 263 last-month

Total dependent packages: 2
Total dependent repositories: 2
Total versions: 17
Total maintainers: 3

pypi.org: stwfsapy

A library for match labels of thesaurus concepts to text and assigning scores to found occurrences.

Homepage: https://github.com/zbw/stwfsapy
Documentation: https://stwfsapy.readthedocs.io/
License: Apache-2.0
Latest release: 0.6.1
published 12 months ago

Versions: 17
Dependent Packages: 2
Dependent Repositories: 2
Downloads: 263 Last month

Rankings

Dependent packages count: 3.2%

Dependent repos count: 11.6%

Average: 15.4%

Downloads: 15.5%

Forks count: 19.1%

Stargazers count: 27.8%

Maintainers (3)

AnkasZBW rblakshmi gmajal

Last synced: 11 months ago

stwfsapy

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

stwfsapy

About

Data Requirements

Installation

Requirements

With pip

From source

call inside the project directory

Usage

Create predictor

Options

Save Model

Contribute

References

Context information

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: stwfsapy

Rankings

Maintainers (3)