Releases | Open Source Science

hanlp - v2.1.1 Ancient Chinese Support

After supporting 130 languages, HanLP has officially released an open-source Ancient Chinese model. This model supports automatic word segmentation, lemmatization, part-of-speech tagging, and dependency parsing for Ancient Chinese. Thanks to multi-task learning, this single model can handle all of these tasks, as well as coarse-grained/fine-grained segmentation and UPOS/XPOS/PKU part-of-speech tagging sets.

Blog post: https://www.hankcs.com/nlp/hanlp-ancient-chinese-processing-model-released.html
Demo: https://github.com/hankcs/HanLP/blob/master/plugins/hanlpdemo/hanlpdemo/lzh/demo_mtl.py
Performance: https://hanlp.hankcs.com/docs/api/hanlp/pretrained/mtl.html#hanlp.pretrained.mtl.KYOTOEVAHANTOKLEMPOSUDEPLZH
Visualization:

hankcs com 2025-01-12 at 4 47 53 PM

Full Changelog: https://github.com/hankcs/HanLP/compare/v2.1.0...v2.1.1

- Python
Published by hankcs over 1 year ago

hanlp - v2.1.0 English Support

What's Changed

Release an English MTL model with ModernBERT encoder: EN_TOK_LEM_POS_NER_SRL_UDEP_SDP_CON_MODERNBERT_BASE
Enhance Security Practices for HanLP Based on OpenSSF Scorecard by @Fix3dP0int in https://github.com/hankcs/HanLP/pull/1931

New Contributors

@Fix3dP0int made their first contribution in https://github.com/hankcs/HanLP/pull/1931

Full Changelog: https://github.com/hankcs/HanLP/compare/v2.1.0-beta.62...v2.1.0

- Python
Published by hankcs over 1 year ago

hanlp - v1.8.6 常规维护

What's Changed

更新Portable版中的自定义词典 fix: https://github.com/hankcs/HanLP/issues/1936
清理 Predefine
数据包兼容data-for-1.7.5.zipmd5=1d9e1be4378b2dbc635858d9c3517aaa
Portable版同步升级到v1.8.6

xml <dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.8.6</version> </dependency>

Full Changelog: https://github.com/hankcs/HanLP/compare/v1.8.5...v1.8.6

:tada:感谢所有在issue中提出宝贵建议的用户！

- Python
Published by hankcs over 1 year ago

hanlp - v1.8.5 常规维护

What's Changed

修复mini二元文法在JRE初始化后第一次分词可能出现的不一致 fix: https://github.com/hankcs/HanLP/issues/1851#issuecomment-1767808746
修复ViterbiSegment分词器中加载自定义词典时未替换DoubleArrayTrie导致分词不符合预期的问题 by @wxy929629 in https://github.com/hankcs/HanLP/pull/1835
fix:修复CWSEvaluator比较切分语句时的计算错误 by @webSue in https://github.com/hankcs/HanLP/pull/1853
数据包兼容data-for-1.7.5.zipmd5=1d9e1be4378b2dbc635858d9c3517aaa
Portable版同步升级到v1.8.5

xml <dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.8.5</version> </dependency>

New Contributors

@wxy929629 made their first contribution in https://github.com/hankcs/HanLP/pull/1835

Full Changelog: https://github.com/hankcs/HanLP/compare/v1.8.4...v1.8.5

- Python
Published by hankcs over 1 year ago

hanlp - v2.1.0-beta.62 Routine Release

What's Changed

Release mMiniLMv2L12 version of MTL on UD210
Release a small MTL model trained on our new corpora
Multi-process compatible loader
Support new versions of tensorflow and numpy
Add support for Python 3.10
Implementation of "Graph Pre-training for AMR Parsing and Generation"
Let PipeLine support copy() by @Vela-zz in https://github.com/hankcs/HanLP/pull/1861

New Contributors

@Vela-zz made their first contribution in https://github.com/hankcs/HanLP/pull/1861

Full Changelog: https://github.com/hankcs/HanLP/compare/v2.1.0-beta.0...v2.1.0-beta.62

- Python
Published by hankcs over 1 year ago

hanlp - v1.8.4 常规维护

将<>视作分隔符 fix https://bbs.hankcs.com/t/topic/4527
Segment 添加是否进行 Normalize 的配置方法 close https://github.com/hankcs/HanLP/issues/1714
修复文本推荐的评分器分数计算时 scorer.boost 的 bug fix: https://github.com/hankcs/HanLP/issues/1718
bugfix: 修复 bintrie 树全分词时提前跳出循环 bug by @carl10086 in https://github.com/hankcs/HanLP/pull/1775
自定义词典支持.tsv格式 fix: https://github.com/hankcs/HanLP/issues/1785
修复自定义词典路径传参 fix: https://github.com/hankcs/HanLP/issues/1799
为DoubleArrayTrie增加enableFastBuild by @qiangwang in https://github.com/hankcs/HanLP/pull/1805
数据包兼容data-for-1.7.5.zipmd5=1d9e1be4378b2dbc635858d9c3517aaa
Portable版同步升级到v1.8.4

xml <dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.8.4</version> </dependency>

:tada:感谢所有在issue中提出宝贵建议的用户！

New Contributors

@carl10086 made their first contribution in https://github.com/hankcs/HanLP/pull/1775
@qiangwang made their first contribution in https://github.com/hankcs/HanLP/pull/1805

Full Changelog: https://github.com/hankcs/HanLP/compare/v1.8.3...v1.8.4

- Python
Published by hankcs over 3 years ago

hanlp - v1.8.3 常规维护

修复动态自定义词典与CustomDictionaryForcing的搭配问题 fix https://github.com/hankcs/HanLP/issues/1712
调整莎=sha1,suo1 fix https://github.com/hankcs/HanLP/issues/1670
根据总词频动态决定未登录词的默认词频
DoubleArrayTrie里的LongestSearcher的next支持null作为值 by @tiandiweizun in https://github.com/hankcs/HanLP/pull/1674
Update DoubleArrayTrie.java的注释 by @TITC in https://github.com/hankcs/HanLP/pull/1699
数据包兼容data-for-1.7.5.zipmd5=1d9e1be4378b2dbc635858d9c3517aaa
Portable版同步升级到v1.8.3

xml <dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.8.3</version> </dependency> Full Changelog: https://github.com/hankcs/HanLP/compare/v1.8.2...v1.8.3

New Contributors

@TITC made their first contribution in https://github.com/hankcs/HanLP/pull/1699

:tada:感谢所有在issue中提出宝贵建议的用户！

- Python
Published by hankcs over 4 years ago

hanlp - v2.1.0-beta 104 languages, 10 tasks, dual backends

We are proud to announce the beta release of HanLP 2.1, which now offers 10 joint tasks on 104 languages: tokenization, lemmatization, part-of-speech tagging, token feature extraction, dependency parsing, constituency parsing, semantic role labeling, semantic dependency parsing, abstract meaning representation (AMR) parsing.

- Python
Published by hankcs over 4 years ago

hanlp - v1.8.2 常规维护与准确率提升

调整公式，维特比分词准确率从94.49提升至94.69 https://bbs.hankcs.com/t/topic/136/61?u=hankcs
改进 HMM 采样函数 https://bbs.hankcs.com/t/topic/136/64?u=hankcs
支持禁用自动刷新词典缓存（CustomDictionaryAutoRefreshCache=false）fix https://github.com/hankcs/HanLP/issues/1655
修复CoreDictionary的reload方法
修订bigram模型
修订简繁映射表
lve4的韵母修正为ve fix https://github.com/hankcs/HanLP/issues/1644
修复 CustomDictionary.reload() fix https://github.com/hankcs/HanLP/issues/1635
数据包兼容data-for-1.7.5.zipmd5=1d9e1be4378b2dbc635858d9c3517aaa
Portable版同步升级到v1.8.2

xml <dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.8.2</version> </dependency>

:tada:感谢所有在issue中提出宝贵建议的用户！

- Python
Published by hankcs almost 5 years ago

hanlp - v1.8.1 常规维护与修复

修复 convertToPinyinList fix https://github.com/hankcs/HanLP/issues/1634
修复CharTable 归一化部分字符错误 fix https://github.com/hankcs/HanLP/issues/1615
数据包兼容data-for-1.7.5.zipmd5=1d9e1be4378b2dbc635858d9c3517aaa
Portable版同步升级到v1.8.1

xml <dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.8.1</version> </dependency>

:tada:感谢所有在issue中提出宝贵建议的用户！

- Python
Published by hankcs about 5 years ago

hanlp - v1.8.0 支持多实例、补充字符集

重构CustomDictionary，支持多实例 https://github.com/hankcs/HanLP/issues/1339
支持𩽾𩾌(ān kāng)之类的补充字符集 fix https://github.com/hankcs/HanLP/issues/1564
修复 CoreStopWordDictionary.dictionary.clear() fix https://github.com/hankcs/HanLP/issues/1603
双数组trie树防止传入空白key导致无法转移状态 fix https://bbs.hankcs.com/t/dat/3196/8
新增热更新方法 CoreDictionary.reload() fix https://github.com/hankcs/HanLP/issues/1594
新增 KBeamArcEagerDependencyParser(String modelPath, String cwsModelPath, String posModelPath) fix https://github.com/hankcs/HanLP/issues/1585
Fix Sentence.create on compound word consisting of single word
HiddenMarkovModel构造时备份参数 fix https://github.com/hankcs/HanLP/issues/1530
数据包兼容data-for-1.7.5.zipmd5=1d9e1be4378b2dbc635858d9c3517aaa
Portable版同步升级到v1.8.0

xml <dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.8.0</version> </dependency>

:tada:感谢所有在issue中提出宝贵建议的用户！

- Python
Published by hankcs over 5 years ago

hanlp - v2.1.0-alpha 104 languages, 10 tasks, dual backends

We are proud to announce the release of HanLP 2.1, which now offers 10 joint tasks on 104 languages: tokenization, lemmatization, part-of-speech tagging, token feature extraction, dependency parsing, constituency parsing, semantic role labeling, semantic dependency parsing, abstract meaning representation (AMR) parsing.

- Python
Published by hankcs over 5 years ago

hanlp - v1.7.8 常规维护

CharType使用IOAdapter fix https://github.com/hankcs/HanLP/issues/1480
portable文件补全
加入自定义词条“雄安”
数据包兼容data-for-1.7.5.zipmd5=1d9e1be4378b2dbc635858d9c3517aaa
Portable版同步升级到v1.7.8

xml <dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.7.8</version> </dependency>

:tada:感谢所有在issue中提出宝贵建议的用户！

- Python
Published by hankcs almost 6 years ago

hanlp - v1.7.7 常规维护、多项改进

改进原子切分 fix https://github.com/hankcs/HanLP/issues/1421
修复聚类数目大于文档数目时引发的异常 fix https://github.com/hankcs/HanLP/issues/1397
使用构造函数代替静态NERInstance.create，方便子类继承
去掉幺=么 fix https://github.com/hankcs/HanLP/issues/1427
CRFModel support getting all tags
修复 AbstractClassifier.enableProbability fix https://github.com/hankcs/HanLP/issues/1423
开放 CWSEvaluator.Result 内部成员 fix https://bbs.hankcs.com/t/topic/887
公开HMM的成员
数据包兼容data-for-1.7.5.zipmd5=1d9e1be4378b2dbc635858d9c3517aaa
Portable版同步升级到v1.7.7

xml <dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.7.7</version> </dependency>

:tada:感谢所有在issue中提出宝贵建议的用户！

- Python
Published by hankcs about 6 years ago

hanlp - v2.0.0-alpha.0 NLP for the next decade

HanLP 2.0 embraces the state-of-the-art Natural Language Processing with Deep Learning and massive unlabeled corpora. Featuring updates are:

Easy model building and serving with TensorFlow 2.0 and Keras.
Multilingual Support.
Tokenization, Part-of-Speech Tagging, Named Entity Recognition, Syntactic & Semantic Dependency Parsing, Document Classification via one unified interface.

Currently, HanLP 2.0 is in alpha stage with more killer features on the roadmap. For news and updates, join our forum.

- Python
Published by hankcs over 6 years ago

hanlp - v1.7.6 最后的武士 The Last Samurai

接下来是一个全新的时代，我们的征途是星辰大海。此后1.x分支将继续提供稳定性维护，两个版本面向的场景不同，2.0基于深度学习，面向对精度要求极其高的场景，例如端到端的问答系统解决方案；而1.x基于传统机器学习和特征工程，面向搜索引擎等对速度要求较高的场景。2.0需要时间打磨，1.x将会持续维护，保证稳定性。

新增 DocVectorModel.nearest(java.lang.String, int) 方法 fix https://github.com/hankcs/HanLP/issues/1332
词法分析器新增空格处理 fix https://github.com/hankcs/HanLP/issues/797
修订现代汉语补充词库 fix https://github.com/hankcs/HanLP/issues/1330
NGramDictionaryMaker等默认UTF-8编码 fix https://github.com/hankcs/HanLP/issues/1320
WordVectorModel支持自定义Map类型：https://github.com/hankcs/HanLP/issues/1304
修复信息熵计算中的除零错误 fix https://github.com/hankcs/HanLP/issues/1366
修复Nature的线程安全性
tfidf，idf的数据可以通过加载idf文件得到
开放 CoreStopWordDictionary.dictionary https://github.com/hankcs/HanLP/issues/1356
修复加载自定义停用词文件无效
兼容数据包data-for-1.7.5.zip 或分流或网盘md5=1d9e1be4378b2dbc635858d9c3517aaa
Portable版同步升级到v1.7.6

xml <dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.7.6</version> </dependency>

:tada:感谢所有在issue中提出宝贵建议的用户！

- Python
Published by hankcs over 6 years ago

hanlp - v1.7.5《自然语言处理入门》随书代码

《自然语言处理入门》新书发布，欢迎查阅随书代码

一本零起点NLP入门书，基础理论与生产代码并重，Python与Java双实现。从基本概念出发，逐步介绍中文分词、词性标注、命名实体识别、信息抽取、文本聚类、文本分类、句法分析这几个热门问题的算法原理与工程实现。书中通过对多种算法的讲解，比较了它们的优缺点和适用场景，同时详细演示生产级成熟代码，助你真正将自然语言处理应用在生产环境中。《自然语言处理入门》由南方科技大学数学系创系主任夏志宏、微软亚洲研究院副院长周明、字节跳动人工智能实验室总监李航、华为诺亚方舟实验室语音语义首席科学家刘群、小米人工智能实验室主任兼NLP首席科学家王斌、中国科学院自动化研究所研究员宗成庆、清华大学副教授刘知远、北京理工大学副教授张华平和52nlp作序推荐。感谢各位前辈老师，希望这个项目和这本书能成为大家工程和学习上的“蝴蝶效应”，帮助大家在NLP之路上蜕变成蝶。

论坛蝴蝶效应上线！限时开放注册，用于交流讨论HanLP使用方法和读者反馈，格式比GitHub自由
DocVectorModel支持自定义分词器、开/关停用词过滤器 fix https://github.com/hankcs/HanLP/issues/1253#issuecomment-515501521
将换行空格等视作CT_OTHER fix https://github.com/hankcs/HanLP/issues/1283
修复repeated bisection聚类算法 fix https://github.com/hankcs/HanLP/issues/1260#issuecomment-519441039
让CoreStopWordDictionary.apply返回结果
修复Analyzer的enableCustomDictionaryForcing方法 fix https://github.com/hankcs/HanLP/issues/1221
新数据包data-for-1.7.5.zip 或分流 md5=1d9e1be4378b2dbc635858d9c3517aaa
Portable版同步升级到v1.7.5

xml <dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.7.5</version> </dependency>

:tada:感谢所有在issue中提出宝贵建议的用户！

- Python
Published by hankcs over 6 years ago

hanlp - v1.7.4 与OpenCC完全一致的简繁转换

无损转换OpenCC词典，结果一致 https://github.com/hankcs/OpenCC-to-HanLP fix https://github.com/hankcs/HanLP/issues/1184
停用词典支持热更新：fix https://github.com/hankcs/HanLP/issues/1158
修正URLTokenizer中的正则表达式 fix https://github.com/hankcs/HanLP/issues/1188
修复自定义词性 fix https://github.com/hankcs/HanLP/issues/1172
修正 CollectionUtility.sortMapByValue(java.util.Map, boolean) fix https://github.com/hankcs/HanLP/issues/1159
修订人名词典
修正角色标注时“始##始”的A标签 fix https://github.com/hankcs/HanLP/issues/434
Add unit tests for com.hankcs.hanlp.utility.MathUtilityTest and com.hankcs.hanlp.algorithm.EditDistance
微调bigram fix https://github.com/hankcs/HanLP/issues/1015
新数据包data-for-1.7.4.zip 或海外或网盘md5=0e2e1bfc4da6d9305909ce815cbe5a44
Portable版同步升级到v1.7.4

xml <dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.7.4</version> </dependency>

:tada:感谢所有在issue中提出宝贵建议的用户！

- Python
Published by hankcs almost 7 years ago

hanlp - v1.7.3常规维护

感知机词法分析器默认使用98年人民日报6个月的大模型
优化DoubleArrayTrie fix https://github.com/hankcs/HanLP/issues/1136
CRFNERecognizer支持在构造时传入自定义命名实体标签，新增addNERLabels方法 @zhangruinan
防止ViterbiSegment.dat不必要的初始化
修复词法分析器对动态插入的词条的处理 fix https://github.com/hankcs/HanLP/issues/271#issuecomment-479719965
词法分析器seg接口支持自定义词性覆盖统计词性 fix https://github.com/hankcs/HanLP/issues/1156
修订拼音
新数据包data-for-1.7.3.zip 或网盘md5=4e4f3695565a75b56427ba4a40731949
Portable版同步升级到v1.7.3

<dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.7.3</version> </dependency>

:tada:感谢所有在issue中提出宝贵建议的用户！

- Python
Published by hankcs about 7 years ago

hanlp - v1.7.2新的句法分析模块、多项改进

新增基于ArcEager转移系统的柱搜索依存句法分析器，废弃MaxEntDependencyParser
调整繁體分詞策略 fix https://github.com/hankcs/HanLP/issues/1059
修正卡方检验整型溢出的问题，准确率提升（95.47->96.08） fix https://github.com/hankcs/HanLP/issues/1075
使LexicalAnalyzer支持TranslatedPersonRecognition和JapanesePersonRecognition fix https://github.com/hankcs/HanLP/issues/1080
提示在线学习不可能学习新的标签
tokenizer的seg2sentence修改为static
词法分析器默认关闭规则系统
修正CustomDictionary.reload(); fix https://github.com/hankcs/HanLP/issues/1100
unigram、bigram微调
新数据包data-for-1.7.2.zip 或网盘md5=2228732bae47b8dc8e410678af72847f
Portable版同步升级到v1.7.2

<dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.7.2</version> </dependency>

:tada:感谢所有在issue中提出宝贵建议的用户！

- Python
Published by hankcs over 7 years ago

hanlp - v1.7.1高速缓存、动态词典

新增可自定义用户词典的维特比分词器 @AnyListen
利用BufferedOutputStream加速缓存生成，快37倍
自定义词典兼容含有空格的路径 fix https://github.com/hankcs/HanLP/issues/1025
增加isCustomNature方法
使热更新产生的缓存文件包含用户词性 fix https://github.com/hankcs/HanLP/issues/1028
修复可变DAT的entrySet方法 fix https://github.com/hankcs/HanLP/issues/1038
微调ngram，简繁等
新数据包data-for-1.7.1.zip MD5 = 9b8faa7fc7fddb24e27da27bd404126d
Portable版同步升级到v1.7.1

<dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.7.1</version> </dependency>

感谢所有在issue中提出宝贵建议的用户！

- Python
Published by hankcs over 7 years ago

hanlp - v1.7.0新增文本聚类、流水线分词

:triangularflagon_post:新增文本聚类模块（k-means和repeated bisection）
:triangularflagon_post:词法分析器新增流水线模式
词法分析器加入规则 enableRuleBasedSegment https://github.com/hankcs/HanLP/issues/991
支持通过JVM的启动参数指定data路径：java -DHANLP_ROOT=/opt/hanlp 则加载/opt/hanlp/data https://github.com/hankcs/HanLP/issues/983
分词断句支持指定断句颗粒 https://github.com/hankcs/HanLP/issues/1018
CustomDictionary.insert("新词语", "词性标签")支持省略频次
NeuralNetworkDependencyParser构造函数接受Segment
TextRankKeyword支持构造自任意分词器
优化双数组trie树，构建后自动shrink到最低内存 https://github.com/hankcs/HanLP/issues/984
修订简繁词典
微调ngram和nr模型
新数据包data-for-1.7.0.zip MD5 = 4c396f3039230ddfcef20865264512b1
Portable版同步升级到v1.7.0

<dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.7.0</version> </dependency>

:tada:节日快乐！感谢所有在issue中提出宝贵建议的用户！

- Python
Published by hankcs over 7 years ago

hanlp - v1.6.8全世界最大的中文语料库

新模型训练自一亿字的大型综合语料库，是目前全世界最大的中文分词语料库。语料规模决定实际效果，希望如此大规模的语料库能够引起大家对语料库建设工作的重视。欢迎使用NLPTokenizer.analyze接口或PerceptronLexicalAnalyzer体验这一改进。
修复“改进人名UV拆分”造成的问题 fix https://github.com/hankcs/HanLP/issues/932
文本分类的卡方检测失败时不过滤特征 fix https://github.com/hankcs/HanLP/issues/920
废弃HMMSegment
修订简繁词典
新数据包data-for-1.6.8.zip md5=0eae09571f080bd99b81f79bee6c6b62
Portable版同步升级到v1.6.8

<dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.6.8</version> </dependency>

:tada:感谢所有在issue中提出宝贵建议的用户！

- Python
Published by hankcs almost 8 years ago

hanlp - v1.6.7模型默认训练自微软研究院语料库修订版

默认感知机分词模型训练自 MSRA Named Entity Corpus
词法分析器在低优先级用户词典模式下合并统计分词结果，高优先级模式则最长匹配
词法分析器用户词典覆盖词性标注器的结果:https://github.com/hankcs/HanLP/issues/525
改进人名UV拆分 fix https://github.com/hankcs/HanLP/issues/880
修复 MaxEntDependencyParser fix https://github.com/hankcs/HanLP/issues/914
新增TF和TF-IDF统计与关键词提取工具
word2vec适配IOAdapter与集群 fix https://github.com/hankcs/HanLP/issues/903
HanLP.extractWords增加更多参数
新增NERTrainer.tagSet成员，方便Python用户
Sentence新增更多语料操作接口
LinearModel显示压缩进度
微调人名、bigram等模型
修订简繁词典，根据国家统计局2016行政区划数据校订地名词典
新数据包data-for-1.6.7.zip md5=4da338b7bcf3939a70b8cc16ed338c45
Portable版同步升级到v1.6.7

<dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.6.7</version> </dependency>

:tada:感谢所有在issue中提出宝贵建议的用户！

- Python
Published by hankcs almost 8 years ago

hanlp - v1.6.6解码快10倍的CRF词法分析器

CRF模型重构为对数线性模型，复用感知机框架的维特比解码算法，速度提高10倍
正式废弃CRFSegment，删除CRFSegmentModel.txt.bin
句法分析器默认使用NLPTokenizer
修复新Nature框架下角色标注机构名识别问题：https://github.com/hankcs/HanLP/issues/870
新旧模型不兼容，请下载新数据包data-for-1.6.6.zip md5=aea7194670d89f920d59a592568c88ad
Portable版同步升级到v1.6.6

<dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.6.6</version> </dependency>

:tada:感谢所有在issue中提出宝贵建议的用户！

- Python
Published by hankcs almost 8 years ago

hanlp - v1.6.5跨平台稳定的自定义词性

Pre-release测试版

重构Nature枚举为类，避免反射，兼容最新JDK：https://github.com/hankcs/HanLP/issues/866
新增感知机分类器，基于此实现人名性别识别
新增一阶、二阶HMM
新增中文分词评测工具
支持使用环境变量HANLP_ROOT来代替hanlp.properties中的root
IOUtil读取空白文件时的稳定性，兼容 UTF8 file with BOM
IOUtil.loadDictionary支持标记整个词典的默认词性
DoubleArrayTrieSegment和AhoCorasickDoubleArrayTrieSegment支持构造自词典路径
修正感知机词法分析器在不进行命名实体识别时对字符的正规化 @wangzhe258369
微调人名识别模型、删除错误词条
修订CharTable，删除橙子和橘子的不合理的转换 @linuxsong
数据包 data-for-1.6.4.zip md5=8b5b944f89c4052d0552bf8ad7479010 获取最新版的数据包，请fork并git clone一份仓库中的最新data。
Portable版同步升级到v1.6.5

<dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.6.5</version> </dependency>

:tada:感谢所有在issue中提出宝贵建议的用户！

- Python
Published by hankcs almost 8 years ago

hanlp - v1.6.4常规维护

优化CorpusLoader、优化MutableFeatureMap的设计
优化新词发现，使结果不含分隔符:https://github.com/hankcs/HanLP/issues/826
TextRank提取关键词提升算法速度 @hlstudio
用户词典热更新时支持.csv @patrick_lin
增强词向量读取时的健壮性：https://github.com/hankcs/HanLP/issues/821
根据百度汉语和在线辞海修正拼音词典 @AnyListen
修订停用词词典 @duohappy
修复词法分析器禁用用户词典时发生的问题、修复词法分析器seg接口与命名实体识别的配合问题：https://github.com/hankcs/pyhanlp/issues/15#issuecomment-382583304 、修正结构化感知机多线程平均的问题
微调人名识别模型、新增月份词汇
数据包 data-for-1.6.4.zip md5=8b5b944f89c4052d0552bf8ad7479010 获取最新版的数据包，请fork并git clone一份仓库中的最新data。
Portable版同步升级到v1.6.4

<dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.6.4</version> </dependency>

:tada:感谢所有在issue中提出宝贵建议的用户！

- Python
Published by hankcs about 8 years ago

hanlp - v1.6.3支持动态用户词典、自定义词性与优先级

词法分析器支持CustomDictionary.insert动态插入的用户词条
词法分析器支持用户词典中的自定义词性
词法分析器支持enableCustomDictionaryForcing提高用户词典优先级
NLPTokenizer默认使用感知机词法分析器
完善圆圈数字对应关系 @AnyListen
开放命名实体识别的特征提取方法
TextRankKeyword使用CoreStopWordDictionary的过滤器
删除人名识别中的BXD模式，优化日本人名识别
修复ViterbiSegment激活多个配置项带来的问题
微调bigram、微调人名识别模型
数据包兼容 data-for-1.6.2.zip md5=3ebb9e47ecff740f09c9ec7c21324661 获取最新版的数据包，请fork并git clone一份仓库中的最新data。
Portable版同步升级到v1.6.3

<dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.6.3</version> </dependency>

:tada:感谢所有在issue中提出宝贵建议的用户！

- Python
Published by hankcs about 8 years ago

hanlp - v1.6.2词法分析器支持词典、简繁和索引模式

所有词法分析器都支持用户词典、简繁、offset与全切分索引模式（需更新模型与CharTable）
CRF分词升级到CRF词法分析器，支持训练，与CRF++兼容
重构词法分析器，提供统一的接口。
HanLP.newSegment支持传入算法名称构造相应的分词器
Sentence支持翻译词性，方便记不住词性短码的初级用户
Sentence支持输出brat standoff format：http://brat.nlplab.org/standoff.html
修复DoubleArrayTrie的LongestSearcher
修订词库、修订CharTable、微调人名识别模型，解决：https://github.com/hankcs/HanLP/issues/772
新数据包 data-for-1.6.2.zip md5=3ebb9e47ecff740f09c9ec7c21324661 获取最新版的数据包，请fork并git clone一份仓库中的最新data。
Portable版同步升级到v1.6.2

<dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.6.2</version> </dependency>

:tada:感谢所有在issue中提出宝贵建议的用户！

- Python
Published by hankcs about 8 years ago

hanlp - v1.6.1常规维护

感知机分词性能评估、修正感知机词法分析器在空白字符串时的问题
感知机命名实体识别支持任意NER类型、开放词法分析器CWS、POS和NER的getter
修复MutableDoubleArrayTrieInteger遍历时可能产生的问题
优化角色标注人名识别的启发式规则
文本分句支持颗粒度
微调bigram、人名识别模型
依然兼容数据包 data-for-1.6.0.zip md5=38d19afa881ddb00b213f4680259ce68 获取最新版的数据包，请fork一份并git clone https://github.com/YourName/HanLP.git。
Portable版同步升级到v1.6.1

<dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.6.1</version> </dependency>

:tada:感谢所有在issue中提出宝贵建议的用户！

- Python
Published by hankcs about 8 years ago

hanlp - v1.6.0感知机词法分析器，动态双数组trie树

:triangularflagon_post:《基于感知机的中文分词、词性标注与命名实体识别框架》
:triangularflagon_post:《动态双数组trie树》
新数据包 data-for-1.6.0.zip md5=38d19afa881ddb00b213f4680259ce68 获取最新版的数据包，请fork一份并git clone https://github.com/YourName/HanLP.git。
Portable版同步升级到v1.6.0

<dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.6.0</version> </dependency>

:tada:感谢所有在issue中提出宝贵建议的用户！

- Python
Published by hankcs about 8 years ago

hanlp - v1.5.4常规维护

优化DoubleArrayTrieSegment的效率
废弃CRFDependencyParser：https://github.com/hankcs/HanLP/issues/730
改正CRF的Tag方法：https://github.com/hankcs/HanLP/issues/703#issuecomment-355587377
加载核心词典词性转移矩阵失败时以IllegalArgumentException方式通知：https://github.com/hankcs/HanLP/issues/747
微调bigram、人名、机构名识别模型，修订繁体->台湾词典：https://github.com/hankcs/HanLP/issues/756#issuecomment-362503432
数据包依然兼容data-for-1.5.3.zip：国内网盘或海外連結 md5=cadc96db94c3df070855706bb0f8429e 获取最新版的数据包，请fork一份并git clone https://github.com/YourName/HanLP.git。

Portable版同步升级到v1.5.4

<dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.5.4</version> </dependency>

:tada:感谢所有在issue中提出宝贵建议的用户！

- Python
Published by hankcs over 8 years ago

hanlp - v1.5.3新年快乐

分词器多线程数默认系统CPU核心数
索引模式可选分词结果最小颗粒度：https://github.com/hankcs/HanLP/issues/670
识别带千位分隔符的数字，修复BaseNode中的toString()
微调人名识别模型、ngram；修订现代汉语补充词库、简繁词库
使word2vec命令行参数解析与原版兼容：https://github.com/hankcs/HanLP/issues/699
改正CRF的Tag方法：https://github.com/hankcs/HanLP/issues/703
修复word2vec缓存问题：https://github.com/hankcs/HanLP/issues/718
新词发现过滤使用LinkedList：https://github.com/hankcs/HanLP/issues/724
模型加载失败时统一throw new IllegalArgumentException，参考：https://github.com/hankcs/HanLP/issues/477 https://github.com/hankcs/HanLP/issues/116
数据包依然兼容data-for-1.5.3.zip：国内网盘或海外連結 md5=cadc96db94c3df070855706bb0f8429e 获取最新版的数据包，请fork一份并git clone https://github.com/YourName/HanLP.git。

Portable版同步升级到v1.5.3

<dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.5.3</version> </dependency>

:tada:感谢所有contributors、所有在issue中提出宝贵建议的用户！

- Python
Published by hankcs over 8 years ago

hanlp - v1.5.2常规维护

优化CommonDictionary的加载速度
提高自定义词条以空格开头或结尾时的健壮性
数据包依然兼容data-for-1.3.3.zip：国内网盘或海外連結 md5=71f6fbbcde4ad70b5b97d4a01ca03c3c 获取最新版的数据包，请fork一份并git clone https://github.com/YourName/HanLP.git。

Portable版同步升级到v1.5.2

<dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.5.2</version> </dependency>

:tada:感谢所有在issue中提出宝贵建议的用户！

- Python
Published by hankcs over 8 years ago

hanlp - v1.5.1常规维护

优化新词发现模块的内存占用：https://github.com/hankcs/HanLP/issues/667
word2vec优化，修复Vector类相关问题：https://github.com/hankcs/HanLP/issues/669
重构EnumItemDictionary，废弃了历史遗留的.trie .dat二次加载，用统一的.bin一次加载
数据包依然兼容data-for-1.3.3.zip：网盘分流或电信下载或海外連結 md5=71f6fbbcde4ad70b5b97d4a01ca03c3c

Portable版同步升级到v1.5.1

<dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.5.1</version> </dependency>

:tada:感谢所有在issue中提出宝贵建议的用户！

- Python
Published by hankcs over 8 years ago

hanlp - v1.5.0新词识别、词向量/文档向量模块

:triangularflagon_post:《词向量》
:triangularflagon_post:《新词识别》
数据包依然兼容data-for-1.3.3.zip：国内网盘或海外連結 md5=71f6fbbcde4ad70b5b97d4a01ca03c3c 获取最新版的数据包，请fork一份并git clone https://github.com/YourName/HanLP.git。
Portable版同步升级到v1.5.0

<dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.5.0</version> </dependency>

:tada:感谢大快公司开源的新词识别与word2vec模块！

- Python
Published by hankcs over 8 years ago

hanlp - v1.4.0新增文本分类、情感分析模块

:triangularflagon_post:请参考文档《文本分类与情感分析》
数据包依然兼容data-for-1.3.3.zip：国内网盘或海外連結 md5=71f6fbbcde4ad70b5b97d4a01ca03c3c 获取最新版的数据包，请fork一份并git clone https://github.com/YourName/HanLP.git。
Portable版同步升级到v1.4.0

<dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.4.0</version> </dependency>

:tada:感谢大快公司开源的文本分类模块！

- Python
Published by hankcs over 8 years ago

hanlp - v1.3.5新特性、优化与维护

大幅优化CRF分词和二阶HMM分词，重构CharacterBasedGenerativeModelSegment @TylunasLi
自定义词典支持热更新：https://github.com/hankcs/HanLP/issues/563 ，ngram模型支持热加载：https://github.com/hankcs/HanLP/issues/580
新增一个提高用户词典优先级的开关：https://github.com/hankcs/HanLP/issues/633
支持98年人民日报的复合词语料格式，如"[中央/n 人民/n 广播/vn 电台/n]nt"
开放TextRank关键词提取中的最大迭代次数参数：https://github.com/hankcs/HanLP/issues/577
为Term添加equal方法 @AnyListen
TextRankKeyword 提取窗口相近词的强化 @tiandiweizun
文本摘要方法支持自定义句子分隔符 @wangdong
提高AC自动机健壮性，添加hasKeyword接口 @fnaith
修复BinTrie.remove不存在的key时导致的问题：https://github.com/hankcs/HanLP/issues/540
解决mini模型下同时打开所有命名实体识别和数词识别时触发的问题：https://github.com/hankcs/HanLP/issues/542
CharTable.txt 添加上下标字符的对应关系 @AnyListen
将“\t”等不可打印的字符视作分隔符：https://github.com/hankcs/HanLP/issues/584
中文数词与阿拉伯数词切分开 @jian.li
修正全角年份识别中字符串长度错误，修正数字识别工具的错误，增加测试代码。支持读取包含BOM的文本文件。 @TylunasLi
校对CoreNatureDictionary.txt，删除以分号开头的错误词语:https://github.com/hankcs/HanLP/issues/221#issuecomment-313594433
修复CoNLLWord中toString方法的bug @xu2333
微调人名识别模型：https://github.com/hankcs/HanLP/issues/562 删除人名识别模型中的高频动词D标签，降低误命中率，音译人名识别取消外国地名触发
修复Nature.fromString和IOUtil.loadDictionary：https://github.com/hankcs/HanLP/issues/626
修正简繁一多对应校验表，拼音等
数据包依然兼容data-for-1.3.3.zip：国内网盘或海外連結 md5=71f6fbbcde4ad70b5b97d4a01ca03c3c 获取最新版的数据包，请fork一份并git clone https://github.com/YourName/HanLP.git。

Portable版同步升级到v1.3.5

<dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.3.5</version> </dependency>

:tada:感谢所有contributors、所有在issue中提出宝贵建议的用户！

- Python
Published by hankcs over 8 years ago

hanlp - v1.3.4修复Resin和部分集群IO

集群环境中CoreStopWordDictionary适配IOAdapter： https://github.com/hankcs/HanLP/issues/530
修复HDFS上的readBytesFromOtherInputStream：https://github.com/hankcs/HanLP/issues/536#issuecomment-302918045
解决resin下自定义IOAdapter的IO异常：https://github.com/hankcs/HanLP/issues/528
修正TextUtility.isAllSingleByte：https://github.com/hankcs/HanLP/issues/526
修正了核心字典的”每xx"词性:https://github.com/hankcs/HanLP/pull/524
数据包依然兼容data-for-1.3.3.zip：国内网盘或海外連結 md5=71f6fbbcde4ad70b5b97d4a01ca03c3c 获取最新版的数据包，请fork一份并git clone https://github.com/YourName/HanLP.git。

Portable版同步升级到v1.3.4

<dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.3.4</version> </dependency>

特别鸣谢

@hx78 @realgzq @junphine @cicido @AnyListen

:tada:感谢所有contributors、所有在issue中提出宝贵建议的用户！

- Python
Published by hankcs about 9 years ago

hanlp - v1.3.3常规维护

CharType的二进制由程序自动生成，版本库内全部词典/模型实现明文文本储存维护
支持逗号分割的.csv格式词典（感谢@driventokill）
移除用于加载语料和训练模型的main方法，方便Spring用户：https://github.com/hankcs/HanLP/issues/391
在机构名识别的时候，词语保持自己的词性，而不是未##团的词性：https://github.com/hankcs/HanLP/issues/403#issuecomment-281859486
增加一些方便语料处理的方法
机构名识别限定nrf为特征词的译名性前缀，删除一些类似于"的""之"等不能构成机构名的助词成分
修正一个拼音（感谢@mudsu）
移除TextRankKeyword中逻辑重复的语句（感谢@jsksxs360）
优化索引分词，以字典序保证子成分的顺序稳定：https://github.com/hankcs/HanLP/issues/496#issuecomment-298007743 ，改进索引分词的完整性，修复了索引分词中的各种问题（感谢@gxy0451和@panhaidong的issue）
微调BiGram模型、人名识别模型、机构名识别模型
去掉了portable版的文件存在校验逻辑，使其完整地支持root配置项和IOAdapter。旧版用户如果遇到兼容性问题，请参考升级指南
新版数据包data-for-1.3.3.zip：网盘分流或电信下载或海外連結 md5=71f6fbbcde4ad70b5b97d4a01ca03c3c

Portable版同步升级到v1.3.3

<dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.3.3</version> </dependency>

:tada:感谢所有contributors、所有在issue中提出宝贵建议的用户！

- Python
Published by hankcs about 9 years ago

hanlp - v1.3.2新年快乐

:gift: 1. 机构名识别模式串匹配由AhoCorasick升级到AhoCorasickDoubleArrayTrie 2. 人性化提示神经网络依存句法模型路径配置问题 3. 索引模式支持用户词典全切分 4. 默认停用词过滤器不再过滤单字 5. 微调机构名识别模型，微调人名识别模型 6. 修订简繁词典 7. 新版数据包data-for-1.3.2.zip：网盘分流或电信下载 8. Portable版同步升级到v1.3.2

<dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.3.2</version> </dependency>

- Python
Published by hankcs over 9 years ago

hanlp - v1.3.1常规维护

全部静态依存句法分析模型迁移到内存池
修复自定义词典的合并逻辑
数据包依然兼容data-for-1.3.0.zip
Portable同步升级到v1.3.1

<dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.3.1</version> </dependency>

- Python
Published by hankcs over 9 years ago

hanlp - v1.3.0新IO接口、内存池、臺灣正體、香港繁體

统一IO接口，实现com.hankcs.hanlp.corpus.io.IIOAdapter接口即可在不同的平台（Hadoop、Redis等）上运行HanLP
新的内存池：当内存足够时尽量缓存大模型，否则自动释放
支持简体、繁体、臺灣正體、香港繁體之间"一简对多繁""一繁对多简"极致转换
拼音转换可选保留无拼音的原字符:https://github.com/hankcs/HanLP/issues/307#issuecomment-241611797
换行符的字符类型修改为分割符
新版数据包：data-for-1.3.0.zip
Portable同步升级到v1.3.0，Maven：

<dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.3.0</version> </dependency>

- Python
Published by hankcs over 9 years ago

hanlp - v1.2.11常规维护

portable版使用pathSeparator分割路径，自动补全/后缀
调整繁簡字典
微调人名、机构名识别模型
调整字符正规化表，采用文本形式维护
动态开启用户词性后依然支持隐马词性标注
修复部分JVM上的自定义词性功能
小优化：对于核心词典已存在的词语,用户词典直接覆盖其属性
新版数据包：data-for-1.2.11.zip
Portable同步升级到v1.2.11，Maven：

<dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.2.11</version> </dependency>

- Python
Published by hankcs almost 10 years ago

hanlp - v1.2.10支持自定义词性

实现了用户自定义词性,同时支持代码动态增加和词典文件增加用户词性；请参考demo
实现了URL识别,支持包括".中国"在内的大部分IANA顶级域名
BinTrie实现了Externalizable接口,可直接序列化
修正BinTrie的remove方法
DoubleArrayTrie小优化
为NShortSegment添加用户词典功能
拼音词库修正
回滚旧版简繁词典,调整简繁分词逻辑
人工校对了几个词语及词性
新版数据包：data-for-1.2.10.zip
Portable同步升级到v1.2.10，Maven：

<dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.2.10</version> </dependency>

- Python
Published by hankcs almost 10 years ago

hanlp - v1.2.9常规维护

修正隐马模型转移矩阵隐状态总数的统计和转移概率的计算
地名识别算法微调
改进数词识别效果，修复数词识别导致的潜在问题
修复人名识别模块的问题
补充文档，整理代码
数据包依然兼容标准版data-for-1.2.8-standard.zip或完整版data-for-1.2.8-full.zip；海外用户请自由使用海外用户专用OneDrive链接
Portable同步升级到v1.2.9

<dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.2.9</version> </dependency>

- Python
Published by hankcs about 10 years ago

hanlp - v1.2.8新年快乐

TextRankKeyword新增了一些接口，优化堆排序以实现TopN
新增一个有趣的“同义改写”功能：DemoRewriteText
CoreStopWordDictionary支持自定义过滤逻辑
增强神经网络句法分析器对词表外词性的健壮性
允许用户在某些极端情况下（不标准的Java虚拟机，用户缺乏相关知识等）使用绝对路径下的配置文件
当用户词典与核心词典冲突时，进一步保证用户词典的优先级
微调了人名识别、机构名识别模型
微调了简繁转换词典
新版数据集：标准版data-for-1.2.8-standard.zip或完整版data-for-1.2.8-full.zip；海外用户请自由使用海外用户专用OneDrive链接
Portable同步升级到v1.2.8

<dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.2.8</version> </dependency>

- Python
Published by hankcs over 10 years ago

hanlp - v1.2.7基于神经网络模型的依存句法分析器

新增基于神经网络分类模型与转移系统的判决式依存句法分析器NeuralNetworkDependencyParser和对应的模型文件
新增流式ByteArrayStream，反序列化时内存占用减半
CoNLLSentence支持for遍历
重构所有依存句法分析器
日本人名、机构名模型微调
新训练的CRF分词模型，与旧版本不兼容
新版数据包：data-for-1.2.7.zip
Portable同步升级到v1.2.7，Maven：

<dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.2.7</version> </dependency>

- Python
Published by hankcs over 10 years ago

hanlp - v1.2.6用户词典优先级、CRF分词支持词典

改进：自定义词典的优先级高于核心词典
大幅补充了简繁分歧词典，对简繁转换和繁体中文分词支持更好
CoreStopWordDictionary不过滤null词性
为CRFSegment添加自定义词典支持
修复了BinTrie和SegmentWrapper的潜在问题
一些模型、词典的人工微调
数据包依然兼容data-for-1.2.4.zip

- Python
Published by hankcs over 10 years ago

hanlp - v1.2.5繁体分词优化，CRF分词优化

新增加了一些工具，开放了对内部词库的动态读写
CRFModel支持BiGram Feature Template，成为通用的模型类
Suggester 增加removeAllSentences方法
优化繁体中文分词
优化CRF分词对标点的支持
数据包依然兼容data-for-1.2.4.zip

- Python
Published by hankcs over 10 years ago

hanlp - v1.2.4

调整用户词典作用为：分词后使用用户词典合并相邻词语
KeywordExtractor排除空格换行等
优化地名识别模块对短地名的处理
词典加载期间提供更人性化的报错信息
默认关闭字符正规化
求解两个数组中最相近的数更新到一种O(n)时间的算法
自动校验CoreNatureDictionary.ngram.txt的缓存与CoreNatureDictionary.txt的缓存的一致性
词典微调，最新数据集：data-for-1.2.4.zip
Portable同步升级到v1.2.4，Maven：

<dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.2.4</version> </dependency>

- Python
Published by hankcs almost 11 years ago