Recent Releases of fugashi

fugashi - v1.3.0: M1 Wheels! Finally!

This release addresses one of the longest standing issues, #55. Many thanks to @nikitalita figuring out how to cross-compile MeCab for wheels.

There are no other changes.

- C++
Published by polm over 2 years ago

fugashi - v1.2.1: Python 3.11 Support

This release adds wheels for Python 3.11, with no other changes.

- C++
Published by polm about 3 years ago

fugashi - v1.2.0: Add nbestToNodeList, drop Python 3.6 and earlier

This release of fugashi adds one new feature: Tagger.nbestToNodeList returns the top N possible tokenizations of a string as node lists. Many thanks to @teowenshen for the implementation (#61).

This release also drops support for Python 3.6 and earlier versions. While the current source should still work with 3.5 and 3.6, wheels are not provided, and it is recommended you upgrade your Python version to one that has not reached end-of-life status. If you must use an older version, you can continue using v1.1.2.

- C++
Published by polm over 3 years ago

fugashi - v1.1.2: Python 3.10 Support, Cleaner Builds

This release adds long overdue wheels for Python 3.10. There are no changes in functionality or API.

On the backend, in addition to fixing issues with the 3.10 version number and quoting, the build process was cleaned up considerably. Many thanks to @lambdadog for the bugfixes and cleanup!

This release does not include wheels for M1 Macs - those may be working, but I've been unable to confirm it. See #55 for details or to help out.

- C++
Published by polm about 4 years ago

fugashi - v1.1.1: Bug Fixes and API Cleanup

This release has a number of stability and API improvements.

  • fugashi-build-dict didn't work in its initial release, that has been fixed.
  • Calls to parseToNode no longer invalidate old node surfaces (#38)
  • Initialization errors now throw an Exception rather than printing output directly (https://github.com/explosion/spaCy/releases/tag/v3.0.7)

Note that the fix to #38 has a number of side effects that may need more extensive evaluation. In particular:

  • memory use will grow very slowly over the life of a Tagger object
  • execution speed will be a bit slower, up to around 10%

It's expected that these will both be addressed before long; despite the issues, the current fix has been deemed suitable for a release because in the vast majority of use cases it will behave more correctly than the previous release.

- C++
Published by polm over 4 years ago

fugashi - Experimental Support for Dictionary Building Added

One feature fugashi hasn't had until now is the ability to build user dictionaries. This feature can be important for improving tokenization quality in many applications. This release adds fugashi-build-dict, a wrapper for MeCab's mecab-dict-index command. You can use it like this:

fugashi-build-dict -d [system-dic-dir] -u mydic.dic input.csv

If you're familiar with MeCab's user dictionary creation process nothing has changed, so any feedback on use or any errors you encounter would be appreciated. If you're not familiar with the dictionary process, just wait a bit - a guide should be released soon.

- C++
Published by polm about 5 years ago

fugashi - fugashi v1.0.0

fugashi v1.0 has arrived. :confetti_ball:

This release does not include any major changes to the code. The main purpose of this release is to make it clear that the API has reached a point where it can remain stable moving forward. While there will surely be more patches to clean things up or add minor features, I don't have any major changes planned.

This release does include one small change: previously, __repr__ marked UNKs. This behavior is useful in some situations, but it's easier to add it to generic behavior than take it out, so I removed it. Now you can (mostly) reconstruct the input with ''.join([str(nn) for nn in nodes]).

Thanks for using fugashi, and if there's anything you'd like to see in it please feel free to open an issue.

- C++
Published by polm over 5 years ago

fugashi - Command line scripts and callable Taggers

This isn't a drastic release, but since I've been dragging out the patch numbers it seemed like a good time to bump the minor version. This is v0.2.0! :tada:

The first feature in this release is the addition of command line scripts. Since it's possible to install fugashi without MeCab, you might not have a command-line binary. This fixes that so you can use fugashi as a replacement for mecab. There's also the fugashi-info script, which is similar to mecab -D in that it prints dictionary information. I hope it will be useful when dealing with bugs and installation issues.

The other feature is that Tagger instances are now callable. One of the best features of fugashi is it makes it much easier to work with MeCab nodes, but the function associated with that - parseToNodeList - had an unfortunately long name. I didn't want to call it parse since that already has meaning in MeCab, but giving it a different name felt odd... so I realized the easiest thing is to make the Tagger instance itself callable. Here's an example of the change this makes possible:

```python from fugashi import Tagger tagger = Tagger()

before

for word in tagger.parseToNodeList(text): print(word.surface)

after

for word in tagger(text): print(word.surface) ```

Feels better, doesn't it? I imagine this will be particularly helpful for compact expressions like list comprehensions. And parseToNodeList is still there, so existing code can be used unmodified.

Lately I've been working more on optimizing SudachiPy than fugashi, but there are still ease-of-use improvements to be made here, and if it works here it can be useful in other tokenizers too. If there's anything you'd like to see let me know.

- C++
Published by polm almost 6 years ago

fugashi - Bundled UniDic Support

This release adds support for installing UniDic from PyPI, whether the easy-to-install unidic-lite or the full-fledged unidic package. Special thanks to @chezou for helping with testing on Windows, which had quoting issues due to backslashes in paths.

This release greatly simplifies installing and using fugashi. Assuming no major issues are found, the next release should be 1.0.0.

- C++
Published by polm almost 6 years ago

fugashi - OSX Build Bugfix Release

This release includes a fix for builds on OSX. See #16 for details; thanks to @HiromuHota for the report and help with the fix.

- C++
Published by polm almost 6 years ago

fugashi - Version 0.1.10: Python 3.5+ and other features

This release includes a number of small fixes from 0.1.9 and two more significant changes.

Unidic 26 Field Format Support

Unidic has a surprising variety of formats, and the 26-field variety wasn't previously supported. This format includes kana accent information and is notably used in binary distribution of Unidic 2.1.2.

Support for Python 3.5, 3.6

Support for these versions was initially removed due to their short remaining lifespan and lack of a default option in the namedtuple constructor. @tamuhey made the necessary changes to get them working so they're supported for now; thanks!

Other Changes

  • dummy mecabrc specification for bundled Unidic support (still a work in progress)
  • test fixes and documentation
  • deal with comma separate values inside fields

Upcoming Changes

I'm working on creating a bundled version of Unidic. Modern versions of Unidic are too large to distribute via PyPI, so I'm figuring out the best way to distribute the data.

- C++
Published by polm almost 6 years ago

fugashi - Generic Dictionary Support in v0.1.8

v0.1.8 of fugashi adds support for generic dictionaries. You can now use IPADic or other dictionaries by using a GenericTagger the same way you would use the normal Tagger:

import fugashi
tagger = fugashi.GenericTagger('-d/usr/local/lib/mecab/dic/ipadic')

It's also possible to specify dictionary fields so you can get convenient access to features no matter what dictionary you use.

``` import fugashi

the wrapper is just a namedtuple with a default value of None for all fields

MyDictFeatures = fugashi.createdictwrapper('MyDictFeatures', 'lemma alpha beta'.split()) tagger = fugashi.GenericTagger('-d/usr/local/lib/mecab/dic/customdic', MyDictFeatures) nodes = tagger.parseToNodes('blah blah') node = nodes[0] print(node.lemma, node.alpha, node.beta) ```

Some other changes:

  • the raw feature string is now available as .feature_raw on nodes
  • packaging-related fixes
  • initial mecab-ko-dic (Korean) support; needs more testing

- C++
Published by polm about 6 years ago

fugashi - Fugashi v0.1.5

This update fixes two issues.

  • When Tagger() gets invalid arguments, throw an error
  • Specify Cython depency correctly (#1)

Thanks to @zdyh for the dependency fix!

- C++
Published by polm about 6 years ago