Recent Releases of ucto
ucto - v0.34
[Maarten van Gompel] * fall back when local config dir can not be checked for whatever reason https://github.com/LanguageMachines/ucto/issues/97 * extract custom configuration directory if provided, and fall back to that for includes https://github.com/LanguageMachines/ucto/issues/96 * needs ticcutils >= 0.35 [Ko van der Sloot] * force use of c++17 * minor code updates * streamlined Github CI file * adapted some foliatests to recent libfolia versions * refactored tests: - all shell scripts have the .sh extension now - use folialint or foliadiff to check folia results
- C++
Published by kosloot over 1 year ago
ucto - v0.30
[Ko van der Sloot] * using ticcutils >- 0.34. All Unicode id NFC normalized now * normalization performed for passthru too. All output should be in the same encoding (NFC) * fixed a problem when using the API form Frog * improving code quality * added (dangerous, and compiletime only) option to change the magic 'tokconfig-' value.
[Maarten van Gompel] * README.md: README: added demo screencast
- C++
Published by kosloot over 2 years ago
ucto - v0.27
[Ko van der Sloot] * removed dependency on libtar * fixed build when HAVE_TEXTCAT was not set. Improved guards agains missing textcat support
[Maarten van Gompel] * guard against uninitialized/missing textcat (https://github.com/proycon/python-frog#22) * require latest libfolia, ticcutils and a more recent libxml2
- C++
Published by proycon over 3 years ago
ucto - v0.26
[Ko van der Sloot] * some code quality improvements * fix for https://github.com/LanguageMachines/ucto/issues/89 * updated configure.ac * updated GitHub action * [Maarten van Gompel] * Added MAINTAINERS * updated codemeta.json * fix for https://github.com/fbkarsdorp/homebrew-lamachine/issues/17
- C++
Published by kosloot over 3 years ago
ucto - v0.25
[Ko van der Sloot] * Added a test for https://github.com/LanguageMachines/ucto/issues/87 * Adapted to latest update in tokconfig-fra (uctodata 0.9) * Deal with unknown languages (as detected by ucto), using iso-639-3 'und' (https://github.com/LanguageMachines/ucto/issues/86) * don't tokenize unknown languages * configurable sentence splitter for "und" text * added tests * added code to set the separator (--seperators), so ucto can split on more than just spaces * migrated test wrapper to Python 3 (was still on 2.7)
[Maarten van Gompel] * Set up a Dockerfile * Added build-deps.sh to automatically download, build and install dependencies * Updated software metadata (codemeta.json) to latest requirements as proposed in CLARIAH * deprecated options -f and -x, still works but no longer advertised and gives a deprecation notice (https://github.com/LanguageMachines/ucto/issues/88) * textcat.cfg is now searched for in user config dir as well as global config; also allow running without textcat if the config is missing entirely (same as if not compiled in) * added support for user-based configuration dirs ($XDGCONFIGHOME/ucto), takes precedence over global data dirs
- C++
Published by proycon almost 4 years ago
ucto - v0.24
- fix for https://github.com/LanguageMachines/ucto/issues/84
- added a solution for https://github.com/LanguageMachines/ucto/issues/53 (only partly)
- added some UnicodeString members to the API
- bumped library version to 6.0, because of API changes
- code cleanup and refactoring
- C++
Published by kosloot over 4 years ago
ucto - v0.23
- added support for the new 'tag' feature in FoLiA, only for tag="token"
- fixed a problem with '-T full' option not always adding text
- use the new TextPolicy class from libfolia
- fix for https://github.com/LanguageMachines/ucto/issues/81
- fix for https://github.com/LanguageMachines/ucto/issues/82
- added code to handle several Unicode joiners
- replaced TravisCI by GutHub action
- %include files may have an extension now
- added tests for new features
- C++
Published by kosloot almost 5 years ago
ucto - v0.17
Bug-fix release: - solved problems when tokenizing (partly-)tokenized FoLiA (but this is a very complicated situation. Might need more work) - solved problems with --passthru on FoLiA - avoid empty lines in FoLiA output - use the new generate_id attribute for provenance/processors - added more tests
KNOW PROBLEM: On TravisCI/MacOSX some tests fail for unclear reasons.
- C++
Published by kosloot almost 7 years ago
ucto - v0.14
[Ko van der Sloot] * updated usage() and removed -S option (never used) * make sure the right textclass is assigned to <w> nodes in FoLiA * minor code fixes/refactorings * added more tests * updated man.1 page
[Maarten van Gompel] * updated README.md
[Iris Hendrickx] * Updated and extended the manual
- C++
Published by kosloot over 7 years ago
ucto - v0.13
[Ko van der Sloot] * improved configure/build/test * added a --split option * fixed -P option * removed -S option (never used, and only half implemented) * added a --add-tokens option, to add special tokens for the default language * generally use the icu:: namespace * added more tests * fixed uninitialized variable. * added code to use an alternative search-path for uctodata
[Maarten van Gompel] * added codemeta.json
- C++
Published by kosloot about 8 years ago
ucto - v0.9.7
- added textredundancy option, default is 'minimal'
- small adaptations to work with FoLiA 1.5 specs
- set textclass on words when outputclass != inputclass
- DON'T filter special characters when inputclass == outputclass
- -F (folia input) is automatically set for .xml files
- more robust against texts with embedded tabs, etc.
- more and better tests added
- better logging and error messaging
- improved language handling. TODO: Language detection in FoLiA
- bug fixes:
- correctly handle xml-comment inside a
- better id generation when parent has no id
- better reaction on overly long 'words'
- C++
Published by kosloot over 8 years ago
ucto - v0.9.6
- Moving data files from
etc/toshare/, as they are more data files than configuration files that should be edited. Requires uctodata >= 0.4. Should solve debian packaging issues (#18) - Minor updates to the manual (#2)
- Some refactoring/code cleanup, temper expectations regarding ucto's date-tagging abilities (#16, thanks also to @sanmai-NL)
- C++
Published by proycon over 9 years ago
ucto - v0.9.4
Major update - Language support - added support for multiple languages - auto detection of languages using textcat - some refactoring - no more call to exit() - Better logging and Warning messages - some folia output improvements - bug fixes - in passthru, - issue #11
- C++
Published by kosloot over 9 years ago
ucto - v0.9.0
Major update - now use uctodata for language specific information ucto itself only supports a generic tokenizer - interactive use now uses readline library - accept long options --help and --version - UTF16BE now works - better support for crooked Windows files in general - added a --normalize option to map tokens in a certain TokenClass to it's generic name
- C++
Published by kosloot almost 10 years ago