Recent Releases of fugashi
fugashi - v1.3.0: M1 Wheels! Finally!
This release addresses one of the longest standing issues, #55. Many thanks to @nikitalita figuring out how to cross-compile MeCab for wheels.
There are no other changes.
- C++
Published by polm over 2 years ago
fugashi - v1.2.1: Python 3.11 Support
This release adds wheels for Python 3.11, with no other changes.
- C++
Published by polm about 3 years ago
fugashi - v1.2.0: Add nbestToNodeList, drop Python 3.6 and earlier
This release of fugashi adds one new feature: Tagger.nbestToNodeList returns the top N possible tokenizations of a string as node lists. Many thanks to @teowenshen for the implementation (#61).
This release also drops support for Python 3.6 and earlier versions. While the current source should still work with 3.5 and 3.6, wheels are not provided, and it is recommended you upgrade your Python version to one that has not reached end-of-life status. If you must use an older version, you can continue using v1.1.2.
- C++
Published by polm over 3 years ago
fugashi - v1.1.2: Python 3.10 Support, Cleaner Builds
This release adds long overdue wheels for Python 3.10. There are no changes in functionality or API.
On the backend, in addition to fixing issues with the 3.10 version number and quoting, the build process was cleaned up considerably. Many thanks to @lambdadog for the bugfixes and cleanup!
This release does not include wheels for M1 Macs - those may be working, but I've been unable to confirm it. See #55 for details or to help out.
- C++
Published by polm about 4 years ago
fugashi - v1.1.1: Bug Fixes and API Cleanup
This release has a number of stability and API improvements.
fugashi-build-dictdidn't work in its initial release, that has been fixed.- Calls to
parseToNodeno longer invalidate old node surfaces (#38) - Initialization errors now throw an Exception rather than printing output directly (https://github.com/explosion/spaCy/releases/tag/v3.0.7)
Note that the fix to #38 has a number of side effects that may need more extensive evaluation. In particular:
- memory use will grow very slowly over the life of a
Taggerobject - execution speed will be a bit slower, up to around 10%
It's expected that these will both be addressed before long; despite the issues, the current fix has been deemed suitable for a release because in the vast majority of use cases it will behave more correctly than the previous release.
- C++
Published by polm over 4 years ago
fugashi - Experimental Support for Dictionary Building Added
One feature fugashi hasn't had until now is the ability to build user dictionaries. This feature can be important for improving tokenization quality in many applications. This release adds fugashi-build-dict, a wrapper for MeCab's mecab-dict-index command. You can use it like this:
fugashi-build-dict -d [system-dic-dir] -u mydic.dic input.csv
If you're familiar with MeCab's user dictionary creation process nothing has changed, so any feedback on use or any errors you encounter would be appreciated. If you're not familiar with the dictionary process, just wait a bit - a guide should be released soon.
- C++
Published by polm about 5 years ago
fugashi - fugashi v1.0.0
fugashi v1.0 has arrived. :confetti_ball:
This release does not include any major changes to the code. The main purpose of this release is to make it clear that the API has reached a point where it can remain stable moving forward. While there will surely be more patches to clean things up or add minor features, I don't have any major changes planned.
This release does include one small change: previously, __repr__ marked UNKs. This behavior is useful in some situations, but it's easier to add it to generic behavior than take it out, so I removed it. Now you can (mostly) reconstruct the input with ''.join([str(nn) for nn in nodes]).
Thanks for using fugashi, and if there's anything you'd like to see in it please feel free to open an issue.
- C++
Published by polm over 5 years ago
fugashi - Command line scripts and callable Taggers
This isn't a drastic release, but since I've been dragging out the patch numbers it seemed like a good time to bump the minor version. This is v0.2.0! :tada:
The first feature in this release is the addition of command line scripts. Since it's possible to install fugashi without MeCab, you might not have a command-line binary. This fixes that so you can use fugashi as a replacement for mecab. There's also the fugashi-info script, which is similar to mecab -D in that it prints dictionary information. I hope it will be useful when dealing with bugs and installation issues.
The other feature is that Tagger instances are now callable. One of the best features of fugashi is it makes it much easier to work with MeCab nodes, but the function associated with that - parseToNodeList - had an unfortunately long name. I didn't want to call it parse since that already has meaning in MeCab, but giving it a different name felt odd... so I realized the easiest thing is to make the Tagger instance itself callable. Here's an example of the change this makes possible:
```python from fugashi import Tagger tagger = Tagger()
before
for word in tagger.parseToNodeList(text): print(word.surface)
after
for word in tagger(text): print(word.surface) ```
Feels better, doesn't it? I imagine this will be particularly helpful for compact expressions like list comprehensions. And parseToNodeList is still there, so existing code can be used unmodified.
Lately I've been working more on optimizing SudachiPy than fugashi, but there are still ease-of-use improvements to be made here, and if it works here it can be useful in other tokenizers too. If there's anything you'd like to see let me know.
- C++
Published by polm almost 6 years ago
fugashi - Bundled UniDic Support
This release adds support for installing UniDic from PyPI, whether the easy-to-install unidic-lite or the full-fledged unidic package. Special thanks to @chezou for helping with testing on Windows, which had quoting issues due to backslashes in paths.
This release greatly simplifies installing and using fugashi. Assuming no major issues are found, the next release should be 1.0.0.
- C++
Published by polm almost 6 years ago
fugashi - OSX Build Bugfix Release
This release includes a fix for builds on OSX. See #16 for details; thanks to @HiromuHota for the report and help with the fix.
- C++
Published by polm almost 6 years ago
fugashi - Version 0.1.10: Python 3.5+ and other features
This release includes a number of small fixes from 0.1.9 and two more significant changes.
Unidic 26 Field Format Support
Unidic has a surprising variety of formats, and the 26-field variety wasn't previously supported. This format includes kana accent information and is notably used in binary distribution of Unidic 2.1.2.
Support for Python 3.5, 3.6
Support for these versions was initially removed due to their short remaining lifespan and lack of a default option in the namedtuple constructor. @tamuhey made the necessary changes to get them working so they're supported for now; thanks!
Other Changes
- dummy mecabrc specification for bundled Unidic support (still a work in progress)
- test fixes and documentation
- deal with comma separate values inside fields
Upcoming Changes
I'm working on creating a bundled version of Unidic. Modern versions of Unidic are too large to distribute via PyPI, so I'm figuring out the best way to distribute the data.
- C++
Published by polm almost 6 years ago
fugashi - Generic Dictionary Support in v0.1.8
v0.1.8 of fugashi adds support for generic dictionaries. You can now use IPADic or other dictionaries by using a GenericTagger the same way you would use the normal Tagger:
import fugashi
tagger = fugashi.GenericTagger('-d/usr/local/lib/mecab/dic/ipadic')
It's also possible to specify dictionary fields so you can get convenient access to features no matter what dictionary you use.
``` import fugashi
the wrapper is just a namedtuple with a default value of None for all fields
MyDictFeatures = fugashi.createdictwrapper('MyDictFeatures', 'lemma alpha beta'.split()) tagger = fugashi.GenericTagger('-d/usr/local/lib/mecab/dic/customdic', MyDictFeatures) nodes = tagger.parseToNodes('blah blah') node = nodes[0] print(node.lemma, node.alpha, node.beta) ```
Some other changes:
- the raw feature string is now available as
.feature_rawon nodes - packaging-related fixes
- initial mecab-ko-dic (Korean) support; needs more testing
- C++
Published by polm about 6 years ago
fugashi - Fugashi v0.1.5
This update fixes two issues.
- When
Tagger()gets invalid arguments, throw an error - Specify Cython depency correctly (#1)
Thanks to @zdyh for the dependency fix!
- C++
Published by polm about 6 years ago