compound-splitter
Wrapper and evaluation service for multiple Dutch compound splitters
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: zenodo.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.6%) to scientific vocabulary
Keywords
Repository
Wrapper and evaluation service for multiple Dutch compound splitters
Basic Info
Statistics
- Stars: 0
- Watchers: 5
- Forks: 0
- Open Issues: 2
- Releases: 1
Topics
Metadata Files
README.md
Compound Splitter
This is a basic wrapper for multiple Dutch compound splitters. The purpose of this wrapper is to:
- provide a unified API for multiple compound splitters. The package offers a simple socket server and a Flask application for this purpose.
- evaluate the accuracy of different compound splitters
Intended audience
The package was initially developed for T-scan, a natural language analysis application intended for research. For T-scan, we required that users could choose between different algorithms (hence the need for a unified API), and some evaluation of the quality of those algorithms.
The resulting package is useful if you want to run a compound splitting service (e.g. as part of an API or web application), or if you want to evaluate compound splitter methods. Adding new methods, even ones that are not python packages, should be feasible if you have programming experience.
If you are looking for a simple, lightweight python package for compound splitting, this is not it. compound-word-splitter may be a good alternative for you.
Compound splitting methods
The following compound splitters are included:
compound-splitter-nl, developed by Katja Hoffman, Valentin Jijkoun, Jaap Kamps, and Christof Monz (LGPL-3.0 license). See https://web.archive.org/web/20200813005715/https://ilps.science.uva.nl/resources/compound-splitter-nl/ for the archived website and https://github.com/bminixhofer/ilps-nl-splitter for an archive of the source code.- SECOS, developed by Martin Riedel and Chris Biemann (Apache-2.0 license). See https://github.com/riedlma/SECOS
- MCS, developed by Patrick Ziering. See https://www.ims.uni-stuttgart.de/en/research/resources/tools/mcs/
As a baseline, we also include a "never" algorithm, which never splits.
Requirements
- Python 3.6+
- Java (only required for MCS)
Installation
Installing with pip
compound-splitters-nl is available as a python package, which includes all the data for all included compound splitter methods. This complete package is too large to be registered on PyPI, but you can download the package from our releases.
The archived package can be installed via pip by installing the local file:
```bash pip install compound-splitters-nl-*.tar.gz
or substitute with your file path
```
If you want to use the web API, you will need to install additional dependencies:
bash
pip install compound-splitters-nl-*.tar.gz[web_api]
Installing from source code
You can also clone the source code repository. In this case, you will still need to download and unpack the data needed for the compound splitter methods. Run installation with:
bash
pip install -r requirements.txt
python retrieve.py
python prepare.py
Tests
bash
python -m unittest discover tests/
Evaluate Different Compound Algorithms
This will evaluate the different algorithms using the reference files in test_sets .
bash
python -m compound_splitter.evaluate
Run Web API
bash
python -m compound_splitter.api_web
JSON Interface
GET /list
Lists the splitting methods.
GET /split/<method_name>/<compound>
Splits the compound using the specified method.
Run Simple Socket Server
bash
python -m compound_splitter.socket_server
bash
$ telnet localhost 7005
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
bedrijfsaansprakelijkheidsverzekering,secos
bedrijfs,aansprakelijkheids,verzekeringConnection closed by foreign host.
Owner
- Name: UU Digital Humanities Lab
- Login: UUDigitalHumanitieslab
- Kind: organization
- Email: digitalhumanities@uu.nl
- Location: Utrecht
- Website: https://cdh.uu.nl/rsl/
- Repositories: 102
- Profile: https://github.com/UUDigitalHumanitieslab
Research Software Lab · Centre for Digital Humanities · Utrecht University
Citation (CITATION.cff)
cff-version: 1.2.0
title: compound-splitters-nl
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- name: 'Research Software Lab, Centre for Digital Humanities, Utrecht University'
website: 'https://cdh.uu.nl/centre-for-digital-humanities/research-software-lab/'
email: cdh@uu.nl
city: Utrecht
country: NL
repository-code: >-
https://github.com/UUDigitalHumanitieslab/compound-splitter
abstract: >-
Wrapper and evaluation service for multiple Dutch compound
splitters
license: BSD-3-Clause
version: 0.0.1
references:
- title: compound-splitter-nl
authors:
- given-names: Katja
family-names: Hoffman
- given-names: Valentin
family-names: Jijkoun
- given-names: Jaap
family-names: Kamps
- given-names: Christof
family-names: Monz
type: software
license: LGPL-3.0
url: https://web.archive.org/web/20200813005715/https://ilps.science.uva.nl/resources/compound-splitter-nl/
- title: SECOS
authors:
- given-names: Martin
family-names: Riedl
- given-names: Chris
family-names: Biemann
type: software
license: Apache-2.0
url: https://github.com/riedlma/SECOS
- title: Unsupervised Compound Splitting With Distributional Semantics Rivals Supervised Methods
type: article
authors:
- given-names: Martin
family-names: Riedl
- given-names: Chris
family-names: Biemann
url: https://aclanthology.org/N16-1075.pdf
- title: MOP Compound Splitter (MCS)
authors:
- given-names: Patrick
family-names: Ziering
type: software
url: https://www.ims.uni-stuttgart.de/en/research/resources/tools/mcs/
GitHub Events
Total
Last Year
Committers
Last synced: 8 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Sheean Spoel | s****l@u****l | 36 |
| Luka van der Plas | l****s@g****m | 13 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 8 months ago
All Time
- Total issues: 2
- Total pull requests: 4
- Average time to close issues: N/A
- Average time to close pull requests: 21 days
- Total issue authors: 2
- Total pull request authors: 2
- Average comments per issue: 1.5
- Average comments per pull request: 0.0
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- oktaal (1)
Pull Request Authors
- lukavdplas (3)
- oktaal (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- attrs ==19.3.0
- certifi ==2020.6.20
- chardet ==3.0.4
- click ==7.1.2
- flask ==1.1.2
- idna ==2.10
- iniconfig ==1.0.1
- itsdangerous ==1.1.0
- jinja2 ==2.11.2
- markupsafe ==1.1.1
- more-itertools ==8.4.0
- packaging ==20.4
- pluggy ==0.13.1
- py ==1.9.0
- pyparsing ==2.4.7
- pytest ==6.0.1
- requests ==2.24.0
- six ==1.15.0
- toml ==0.10.1
- urllib3 ==1.25.10
- werkzeug ==1.0.1
- Flask *
- pytest *
- requests *
- actions/cache v2 composite
- actions/checkout v2 composite
- actions/setup-python v1 composite
- requests *