Recent Releases of wordfreq
wordfreq - v3.0.2: packaging fixes
Updated the range of allowable versions of
regex. Versions before 2021.7.6 don't have theregex.Matchclass.Added the
extrasdependencies as optional dependencies in pyproject.toml.
- Python
Published by rspeer almost 4 years ago
wordfreq - v3.0: The "handle numbers better" release
Previously, wordfreq would group all digit sequences of the same 'shape', with length 2 or more, into a single token and return the frequency of that token, which would be a vast overestimate.
Now it distributes the frequency over all numbers of that shape, with an estimated distribution that allows for Benford's law (lower numbers are more frequent) and a special frequency distribution for 4-digit numbers that look like years (2010 is more frequent than 1020).
More changes related to digits:
Functions such as
iter_wordlistandtop_n_listno longer return multi-digit numbers (they used to return them in their "smashed" form, such as "0000").lossy_tokenizeno longer replaces digit sequences with 0s. That happens instead in a place that's internal to theword_frequencyfunction, so we can look at the values of the digits before they're replaced.
Other changes:
wordfreq is now developed using
poetryas its package manager, and withpyproject.tomlas the source of configuration instead ofsetup.py.The minimum version of Python supported is 3.7.
Type information is exported using
py.typed.
- Python
Published by rspeer almost 4 years ago
wordfreq - v2.5.1
Version 2.5.1 (2021-09-02)
Import ftfy and use its
uncurl_quotesmethod to turn curly quotes into straight ones, providing consistency with multiple forms of apostrophes.Set minimum version requierements on
regex,jieba, andlangcodesso that tokenization will give consistent results.Work around an inconsistency in the
msgpackAPI aroundstrict_map_key=False.
Version 2.5 (2021-04-15)
- Incorporate data from the OSCAR corpus.
- Python
Published by rspeer almost 5 years ago