text-sim
文本相似度(匹配)计算,提供Baseline、训练、推理、指标分析...代码包含TensorFlow/Pytorch双版本
Science Score: 10.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: arxiv.org, scholar.google -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (7.1%) to scientific vocabulary
Keywords
Repository
文本相似度(匹配)计算,提供Baseline、训练、推理、指标分析...代码包含TensorFlow/Pytorch双版本
Basic Info
Statistics
- Stars: 179
- Watchers: 2
- Forks: 32
- Open Issues: 6
- Releases: 0
Topics
Metadata Files
README.md
Text-Similarity
Overview
- Dataset: 中文/English 语料, ☞ 点这里
- Paper: 相关论文详解, ☞ 点这里
- The implemented method is as follows::
- TF-IDF
- BM25
- LSH
- SIF/uSIF
- FastText
- RNN Base (Siamese RNN, Stack RNN)
- CNN Base (Fast Text, Text CNN, Char CNN, VDCNN)
- Bert Base
- Albert
- NEZHA
- RoBERTa
- SimCSE
- Poly-Encoder
- ColBERT
- RE2(Simple-Effective-Text-Matching)
Usages
可以选择通过pip进行安装并使用(如下),或者直接下载源码到本地,集成到项目中:
pip3 install text-sim
1:examples目录下有不同模型对应的 preprocess/train/evalute代码,可自行修改
2:如下示例从examples中引入actuator方法,准备好对应的模型配置文件即可执行
3:examples目录下的inference.py为训练好的模型推理代码
4:主体代码放在sim下,TensorFlow和Pytorch两个版本分开存放,引用方式基本保持一致
5:相关工具包括word2vec、tokenizer、data_format统一放在sim的tools下
TF-IDF
```python
Example
Sklearn version
from examples.runtfidfsklearn import actuator actuator("./corpus/chinese/breeno/train.tsv", query1="12 23 4160 276", query2="29 23 169 1495")
Custom version
from examples.run_tfidf import actuator actuator("./corpus/chinese/breeno/train.tsv", query1="12 23 4160 276", query2="29 23 169 1495")
工具调用
from sim.tf_idf import TFIdf
tokens_list = ["这是 一个 什么 样 的 工具", "..."] query = ["非常 好用 的 工具"]
tfidf = TFIdf(tokenslist, split=" ") print(tfidf.getscore(query, 0)) # score print(tfidf.getscorelist(query, 10)) # [(index, score), ...] print(tfidf.weight()) # list or numpy array ```
BM25
```python
Example
from examples.run_bm25 import actuator actuator("./corpus/chinese/breeno/train.tsv", query1="12 23 4160 276", query2="29 23 169 1495")
工具调用
from sim.bm25 import BM25
tokens_list = ["这是 一个 什么 样 的 工具", "..."] query = ["非常 好用 的 工具"]
bm25 = BM25(tokenslist, split=" ") print(bm25.getscore(query, 0)) # score print(bm25.getscorelist(query, 10)) # [(index, score), ...] print(bm25.weight()) # list or numpy array ```
LSH
```python from sim.lsh import E2LSH from sim.lsh import MinHash
e2lsh = E2LSH() min_hash = MinHash()
candidates = [[3.6216, 8.6661, -2.8073, -0.44699, 0], ...] query = [-2.7769, -5.6967, 5.9179, 0.37671, 1] print(e2lsh.search(candidates, query)) # index in candidates print(min_hash.search(candidates, query)) # index in candidates ```
SIF
- A Simple But Tough-To-Beat Baseline For Sentence Embeddings
- Unsupervised Random Walk Sentence Embeddings: A Strong but Simple Baseline ```python sentences = [["token1", "token2", "..."], ...] vector = [[[1, 1, 1], [2, 2, 2], [...]], ...] from sim.sifusif import SIF from sim.sifusif import uSIF
sif = SIF(ncomponents=5, componenttype="svd") sif.fit(tokenslist=sentences, vectorlist=vector)
usif = uSIF(ncomponents=5, n=1, componenttype="svd") usif.fit(tokenslist=sentences, vectorlist=vector) ```
FastText
- Bag of Tricks for Efficient Text Classification ```python # TensorFlow version from examples.tensorflow.runfasttext import actuator actuator(executetype="train", modeltype="bert", modeldir="./data/chinesewwmL-12H-768_A-12")
Pytorch version
from examples.pytorch.runfasttext import actuator actuator(executetype="train", modeltype="bert", modeldir="./data/chinesewwm_pytorch") ```
RNN Base
- Siamese Recurrent Architectures for Learning Sentence Similarity
- Learning Text Similarity with Siamese Recurrent Networks ```python # TensorFlow version from examples.tensorflow.runsiamesernn import actuator actuator("./data/config/siamsernn.json", executetype="train")
Pytorch version
from examples.pytorch.runsiamesernn import actuator actuator("./data/config/siamsernn.json", executetype="train") ```
CNN Base
- Convolutional Neural Networks for Sentence Classification
- Character-Aware Neural Language Models
- Highway Networks
- Very Deep Convolutional Networks for Text Classification ```python # TensorFlow version from examples.tensorflow.runcnnbase import actuator actuator(executetype="train", modeltype="bert", modeldir="./data/chinesewwmL-12H-768_A-12")
Pytorch version
from examples.pytorch.runcnnbase import actuator actuator(executetype="train", modeltype="bert", modeldir="./data/chinesewwm_pytorch") ```
Bert Base
- Attention Is All You Need ```python # TensorFlow version from examples.tensorflow.runbasicbert import actuator actuator(modeldir="./data/chinesewwmL-12H-768A-12", executetype="train")
Pytorch version
from examples.pytorch.runbasicbert import actuator actuator(modeldir="./data/chinesewwmpytorch", executetype="train") ```
Albert
- ALBERT: A Lite BERT For Self-superpised Learning Of Language Representations ```python # TensorFlow version from examples.tensorflow.runalbert import actuator actuator(modeldir="./data/albertsmallzhgoogle", executetype="train")
Pytorch version
from examples.pytorch.runalbert import actuator actuator(modeldir="./data/albertchinesesmall", execute_type="train") ```
NEZHA
- NEZHA: Neural Contextualized Representation For Chinese Language Understanding ```python # TensorFlow version from examples.tensorflow.runnezha import actuator actuator(modeldir="./data/NEZHA-Base-WWM", execute_type="train")
Pytorch version
from examples.pytorch.runnezha import actuator actuator(modeldir="./data/nezha-base-wwm", execute_type="train") ```
RoBERTa
- RoBERTa: A Robustly Optimized BERT Pretraining Approach ```python # TensorFlow version from examples.tensorflow.runbasicbert import actuator actuator(modeldir="./data/chineserobertaL-6H-384A-12", executetype="train")
Pytorch version
from examples.pytorch.runbasicbert import actuator actuator(modeldir="./data/chinese-roberta-wwm-ext", executetype="train") ```
SimCSE
- SimCSE: Simple Contrastive Learning of Sentence Embeddings ```python # TensorFlow version from examples.tensorflow.runsimcse import actuator actuator(modeldir="./data/chinesewwmL-12H-768A-12", executetype="train", modeltype="bert")
Pytorch version
from examples.pytorch.runsimcse import actuator actuator(modeldir="./data/chinesewwmpytorch", executetype="train", modeltype="bert") ```
Poly-Encoder
- Poly-encoders: Transformer Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring ```python # TensorFlow version from examples.tensorflow.runpolyencoder import actuator actuator(modeldir="./data/chinesewwmL-12H-768A-12", executetype="train", model_type="bert")
Pytorch version
from examples.pytorch.runpolyencoder import actuator actuator(modeldir="./data/chinesewwmpytorch", executetype="train", model_type="bert") ```
ColBERT
- ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT ```python # TensorFlow version from examples.tensorflow.runcolbert import actuator actuator(modeldir="./data/chinesewwmL-12H-768A-12", executetype="train", modeltype="bert")
Pytorch version
from examples.pytorch.runcolbert import actuator actuator(modeldir="./data/chinesewwmpytorch", executetype="train", modeltype="bert") ```
RE2
- Simple and Effective Text Matching with Richer Alignment Features ```python # TensorFlow version from examples.tensorflow.runre2 import actuator actuator("./data/config/re2.json", executetype="train")
Pytorch version
from examples.pytorch.runre2 import actuator actuator("./data/config/re2.json", executetype="train") ```
Cite
@misc{text-similarity,
title={text-similarity},
author={Bocong Deng},
year={2021},
howpublished={\url{https://github.com/DengBoCong/text-similarity}},
}
Reference
Owner
- Name: DengBoCong
- Login: DengBoCong
- Kind: user
- Location: Beijing, China
- Company: HUST - Master of Software Engineering
- Website: http://dengbocong.cn/
- Repositories: 7
- Profile: https://github.com/DengBoCong
Deep Learning | NLP | Java
GitHub Events
Total
- Watch event: 6
- Fork event: 3
Last Year
- Watch event: 6
- Fork event: 3
Committers
Last synced: 9 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| DengBoCong | 1****0@q****m | 101 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 7
- Total pull requests: 0
- Average time to close issues: 4 days
- Average time to close pull requests: N/A
- Total issue authors: 7
- Total pull request authors: 0
- Average comments per issue: 0.29
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 1
- Pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- skynet-jzm (1)
- tjulh (1)
- zz19980926 (1)
- Exusia-0 (1)
- VirgilG72 (1)
- wybingo (1)
- Huanghong2016 (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 3
-
Total downloads:
- pypi 12 last-month
-
Total dependent packages: 0
(may contain duplicates) -
Total dependent repositories: 1
(may contain duplicates) - Total versions: 5
- Total maintainers: 1
proxy.golang.org: github.com/dengbocong/text-similarity
- Documentation: https://pkg.go.dev/github.com/dengbocong/text-similarity#section-documentation
- License: mit
-
Latest release: v1.0.7
published almost 4 years ago
Rankings
proxy.golang.org: github.com/DengBoCong/text-similarity
- Documentation: https://pkg.go.dev/github.com/DengBoCong/text-similarity#section-documentation
- License: mit
-
Latest release: v1.0.7
published almost 4 years ago
Rankings
pypi.org: text-sim
Chinese text similarity calculation package of Tensorflow/Pytorch
- Homepage: https://github.com/DengBoCong/text-similarity
- Documentation: https://text-sim.readthedocs.io/
- License: MIT License
-
Latest release: 1.0.7
published almost 4 years ago
Rankings
Maintainers (1)
Funding
- https://pypi.org/project/text-sim/
Dependencies
- gensim ==4.1.2
- jieba ==0.42.1
- networkx ==2.6.3
- nltk ==3.6.5
- numpy ==1.19.5
- packaging ==21.3
- pandas ==1.3.4
- scikit-learn ==0.24.2
- sentencepiece ==0.1.96
- tensorflow ==2.6.0
- tokenizers ==0.10.3
- torch ==1.10.0
- tqdm ==4.62.3
- transformers ==4.12.5