Recent Releases of ticcltools

ticcltools - v0.11

  • require C++17
  • require latest ticcutils
  • Now we use NFC endoded Unicode strings everywhere
  • testrank script results were outdated since 0.10
  • removed dependency on libtar
  • added --follow option to TiCCL-indexer(NT)
  • several code refactoring and cleanup
  • adapted tests
  • updated GitHub CI

- C++
Published by kosloot about 1 year ago

ticcltools - v0.10

[Ko van der Sloot] * LDcalc: - No longer filter out n-grams with common parts. Was too aggressive - Removed some more outcommented old code * chainclean: added a --caseless option. (Default is true) * Removed Roaring versions of the code. Lacked maintenance for years. * internally shifting towards UnicodeString in general * a lot of C++ cleanup, with some refactoring, splitting up long blobs of code

- C++
Published by kosloot about 3 years ago

ticcltools - v0.9

Ko van der Sloot: * LDcalc: removed code to filter out ngrams with common parts (experimental)

Maarten van Gompel: * Added Dockerfile: containerization support * Changed repository status to unsupported!

- C++
Published by proycon over 3 years ago

ticcltools - v0.8

  • using more recent functions from ticcutils
  • use more code from ticcl_common
  • attempt to solve https://github.com/LanguageMachines/ticcltools/issues/42
  • some small code refactoring

- C++
Published by kosloot about 4 years ago

ticcltools - v0.7.1

[Ko vd Sloot] * changed ICU requirement to at least 5.6 * some refactoring * started implementing a solution for #42 * added error message when the index file is empty.

- C++
Published by proycon over 5 years ago

ticcltools - v0.7

[Martin Reynaert] * updated man pages * updated README.md

[Ko vander Sloot] Numerous bug fixes and additions. Added a .so for common functions

The bitType is changed to uint64_t (for the biggest int possible) which triggered some code adaptations. (values < 0 are not possible)

  • TICCL-unk:

    • some changes in UNK detection
    • added a --hemp option
    • create a .fore.clean file when a background corpus is merged in
  • TICCL-stats:

    • added a -n option to use a newline as delimiter
  • TICCL-indexer(NT):

    • better and faster implementation
    • added --confstats option
  • TICCL-LDcalc:

    • added a --follow option for debugging purposes
    • fix for https://github.com/LanguageMachines/ticcltools/issues/30
    • added --low and --high parameters
  • TICCL-rank:

    • added a --follow option for debugging purposes
    • added --subtractartifrqfeature1 and --subtractartifrqfeature2 options
    • replaced pairs_combined ranking by median ranking
    • added an n-garm filter
  • TICCL-chain:

    • added --nounk option
    • fix for https://github.com/LanguageMachines/ticcltools/issues/38
    • fix for https://github.com/LanguageMachines/ticcltools/issues/37
    • use the alphabet file too with --alph
  • TICCL-chainclean: new module to clean chain ranked files

  • TICCL-anahash:

    • accept lexicons without frequencies too. (also simple word lists)
    • added a -o option

- C++
Published by kosloot almost 6 years ago

ticcltools - v0.6

Intermediate release, with a lot of new code to handle N-grams Also a lot of refactoring is done, for more clear and maintainable code. This is work in progress still.

  • TICCL-unk:

    • more extensive acronym detection
    • fixed artifreq problems in 'clean' punctuated words
    • added filters for 'unwanted' characters
    • added a ligature filter to convert evil ligatures
    • normalize all hyphens to a 'normal' one (-)
    • use a better definition of punctuation (unicode character class is not good enough to decide)
  • TICCL-lexstat:

    • the 'separator' symbol should get freq=0, so it isnt counted
    • the clip value is added to the output filename
  • TICCL-indexer:

    • indexer and indexerNT now produce the same output, using different strategies when a --foci files is used.
  • TICCL-LDcalc: major overhaul for n-grams

    • added a ngram point column to the output (so NOT backward compatible!)
    • produce a '.short' list for short word corrections
    • produce a '.ambi' file with a list of n-grams related to short words
    • prune a lot of ngrams from the output
  • TICCL-rank:

    • output is sorted now
    • honor the ngram-points from the new LDcalc. (so NOT backward compatible!)
  • TICCL-chain: new module to chain ranked files

  • TICCL-lexclean: -added a -x option for 'inverse' alphabet

  • TICCL-anahash:

    • added a --list option to produce a list of words and anagram values
  • added metadata file: codemeta.json

- C++
Published by kosloot over 7 years ago

ticcltools - v0.5

  • updated configuration. also for Mac OSX
  • use of more ticcutils stuff: diacriticsfilter
  • added a TICCL-mergelex program
  • the OMPTHREADLIMIT environment variable was ignored sometimes
  • TICCL-unk:
    • fixed a problem in artifreq handling
    • changed acronym detection (work in progress)
    • added -o option TICCL-lexstat:
    • added TTR output
    • added -o option TICCL-indexer
    • now also handles --foci file. with some speed-up
    • added a -t option TICCL-LDcalc:
    • be less picky on a few wrong lines in the data
  • added some tests
  • when libroaring is installed we built roaring versions of some modules (experimental)
  • updated man pages

- C++
Published by kosloot about 8 years ago

ticcltools - v0.4

  • first official release.
    • added functions to test on Word2Vec datafiles
    • refactoring and modernizing stuff all around

- C++
Published by kosloot almost 9 years ago