text-sim

文本相似度（匹配）计算，提供Baseline、训练、推理、指标分析...代码包含TensorFlow/Pytorch双版本

https://github.com/DengBoCong/text-similarity

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org, scholar.google
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (7.1%) to scientific vocabulary

Keywords

bert deep-learning mechine-learing model nlp pytorch similarity text-classification transformer

Last synced: 6 months ago · JSON representation

Repository

文本相似度（匹配）计算，提供Baseline、训练、推理、指标分析...代码包含TensorFlow/Pytorch双版本

Basic Info

Host: GitHub
Owner: DengBoCong
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 57.6 MB

Statistics

Stars: 179
Watchers: 2
Forks: 32
Open Issues: 6
Releases: 0

Topics

bert deep-learning mechine-learing model nlp pytorch similarity text-classification transformer

Created almost 5 years ago · Last pushed almost 4 years ago

Metadata Files

Readme License

Text-Similarity

[![Blog](https://img.shields.io/badge/blog-@DengBoCong-blue.svg?style=social)](https://www.zhihu.com/people/dengbocong) [![Paper Support](https://img.shields.io/badge/paper-repo-blue.svg?style=social)](https://github.com/DengBoCong/nlp-paper) ![Stars Thanks](https://img.shields.io/badge/Stars-thanks-brightgreen.svg?style=social&logo=trustpilot) ![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=social&logo=appveyor) [comment]: <> ([![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square)]())

Overview

Dataset: 中文/English 语料, ☞ 点这里
Paper: 相关论文详解, ☞ 点这里
The implemented method is as follows:：
- TF-IDF
- BM25
- LSH
- SIF/uSIF
- FastText
- RNN Base (Siamese RNN, Stack RNN)
- CNN Base (Fast Text, Text CNN, Char CNN, VDCNN)
- Bert Base
- Albert
- NEZHA
- RoBERTa
- SimCSE
- Poly-Encoder
- ColBERT
- RE2（Simple-Effective-Text-Matching）

Usages

可以选择通过pip进行安装并使用（如下），或者直接下载源码到本地，集成到项目中： pip3 install text-sim

1：examples目录下有不同模型对应的 preprocess/train/evalute代码，可自行修改 2：如下示例从examples中引入actuator方法，准备好对应的模型配置文件即可执行 3：examples目录下的inference.py为训练好的模型推理代码 4：主体代码放在sim下，TensorFlow和Pytorch两个版本分开存放，引用方式基本保持一致 5：相关工具包括word2vec、tokenizer、data_format统一放在sim的tools下

TF-IDF

```python

Example

Sklearn version

from examples.runtfidfsklearn import actuator actuator("./corpus/chinese/breeno/train.tsv", query1="12 23 4160 276", query2="29 23 169 1495")

Custom version

from examples.run_tfidf import actuator actuator("./corpus/chinese/breeno/train.tsv", query1="12 23 4160 276", query2="29 23 169 1495")

工具调用

from sim.tf_idf import TFIdf

tokens_list = ["这是一个什么样的工具", "..."] query = ["非常好用的工具"]

tfidf = TFIdf(tokenslist, split=" ") print(tfidf.getscore(query, 0)) # score print(tfidf.getscorelist(query, 10)) # [(index, score), ...] print(tfidf.weight()) # list or numpy array ```

BM25

```python

Example

from examples.run_bm25 import actuator actuator("./corpus/chinese/breeno/train.tsv", query1="12 23 4160 276", query2="29 23 169 1495")

工具调用

from sim.bm25 import BM25

tokens_list = ["这是一个什么样的工具", "..."] query = ["非常好用的工具"]

bm25 = BM25(tokenslist, split=" ") print(bm25.getscore(query, 0)) # score print(bm25.getscorelist(query, 10)) # [(index, score), ...] print(bm25.weight()) # list or numpy array ```

LSH

```python from sim.lsh import E2LSH from sim.lsh import MinHash

e2lsh = E2LSH() min_hash = MinHash()

candidates = [[3.6216, 8.6661, -2.8073, -0.44699, 0], ...] query = [-2.7769, -5.6967, 5.9179, 0.37671, 1] print(e2lsh.search(candidates, query)) # index in candidates print(min_hash.search(candidates, query)) # index in candidates ```

SIF

A Simple But Tough-To-Beat Baseline For Sentence Embeddings
Unsupervised Random Walk Sentence Embeddings: A Strong but Simple Baseline ```python sentences = [["token1", "token2", "..."], ...] vector = [[[1, 1, 1], [2, 2, 2], [...]], ...] from sim.sifusif import SIF from sim.sifusif import uSIF

sif = SIF(ncomponents=5, componenttype="svd") sif.fit(tokenslist=sentences, vectorlist=vector)

usif = uSIF(ncomponents=5, n=1, componenttype="svd") usif.fit(tokenslist=sentences, vectorlist=vector) ```

FastText

Bag of Tricks for Efficient Text Classification ```python # TensorFlow version from examples.tensorflow.runfasttext import actuator actuator(executetype="train", modeltype="bert", modeldir="./data/chinesewwmL-12H-768_A-12")

Pytorch version

from examples.pytorch.runfasttext import actuator actuator(executetype="train", modeltype="bert", modeldir="./data/chinesewwm_pytorch") ```

RNN Base

Siamese Recurrent Architectures for Learning Sentence Similarity
Learning Text Similarity with Siamese Recurrent Networks ```python # TensorFlow version from examples.tensorflow.runsiamesernn import actuator actuator("./data/config/siamsernn.json", executetype="train")

Pytorch version

from examples.pytorch.runsiamesernn import actuator actuator("./data/config/siamsernn.json", executetype="train") ```

CNN Base

Convolutional Neural Networks for Sentence Classification
Character-Aware Neural Language Models
Highway Networks
Very Deep Convolutional Networks for Text Classification ```python # TensorFlow version from examples.tensorflow.runcnnbase import actuator actuator(executetype="train", modeltype="bert", modeldir="./data/chinesewwmL-12H-768_A-12")

Pytorch version

from examples.pytorch.runcnnbase import actuator actuator(executetype="train", modeltype="bert", modeldir="./data/chinesewwm_pytorch") ```

Bert Base

Attention Is All You Need ```python # TensorFlow version from examples.tensorflow.runbasicbert import actuator actuator(modeldir="./data/chinesewwmL-12H-768A-12", executetype="train")

Pytorch version

from examples.pytorch.runbasicbert import actuator actuator(modeldir="./data/chinesewwmpytorch", executetype="train") ```

Albert

ALBERT: A Lite BERT For Self-superpised Learning Of Language Representations ```python # TensorFlow version from examples.tensorflow.runalbert import actuator actuator(modeldir="./data/albertsmallzhgoogle", executetype="train")

Pytorch version

from examples.pytorch.runalbert import actuator actuator(modeldir="./data/albertchinesesmall", execute_type="train") ```

NEZHA

NEZHA: Neural Contextualized Representation For Chinese Language Understanding ```python # TensorFlow version from examples.tensorflow.runnezha import actuator actuator(modeldir="./data/NEZHA-Base-WWM", execute_type="train")

Pytorch version

from examples.pytorch.runnezha import actuator actuator(modeldir="./data/nezha-base-wwm", execute_type="train") ```

RoBERTa

RoBERTa: A Robustly Optimized BERT Pretraining Approach ```python # TensorFlow version from examples.tensorflow.runbasicbert import actuator actuator(modeldir="./data/chineserobertaL-6H-384A-12", executetype="train")

Pytorch version

from examples.pytorch.runbasicbert import actuator actuator(modeldir="./data/chinese-roberta-wwm-ext", executetype="train") ```

SimCSE

SimCSE: Simple Contrastive Learning of Sentence Embeddings ```python # TensorFlow version from examples.tensorflow.runsimcse import actuator actuator(modeldir="./data/chinesewwmL-12H-768A-12", executetype="train", modeltype="bert")

Pytorch version

from examples.pytorch.runsimcse import actuator actuator(modeldir="./data/chinesewwmpytorch", executetype="train", modeltype="bert") ```

Poly-Encoder

Poly-encoders: Transformer Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring ```python # TensorFlow version from examples.tensorflow.runpolyencoder import actuator actuator(modeldir="./data/chinesewwmL-12H-768A-12", executetype="train", model_type="bert")

Pytorch version

from examples.pytorch.runpolyencoder import actuator actuator(modeldir="./data/chinesewwmpytorch", executetype="train", model_type="bert") ```

ColBERT

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT ```python # TensorFlow version from examples.tensorflow.runcolbert import actuator actuator(modeldir="./data/chinesewwmL-12H-768A-12", executetype="train", modeltype="bert")

Pytorch version

from examples.pytorch.runcolbert import actuator actuator(modeldir="./data/chinesewwmpytorch", executetype="train", modeltype="bert") ```

RE2

Simple and Effective Text Matching with Richer Alignment Features ```python # TensorFlow version from examples.tensorflow.runre2 import actuator actuator("./data/config/re2.json", executetype="train")

Pytorch version

from examples.pytorch.runre2 import actuator actuator("./data/config/re2.json", executetype="train") ```

Cite

@misc{text-similarity, title={text-similarity}, author={Bocong Deng}, year={2021}, howpublished={\url{https://github.com/DengBoCong/text-similarity}}, }

Reference

Owner

Name: DengBoCong
Login: DengBoCong
Kind: user
Location: Beijing, China
Company: HUST - Master of Software Engineering

Website: http://dengbocong.cn/
Repositories: 7
Profile: https://github.com/DengBoCong

Deep Learning | NLP | Java

GitHub Events

Total

Watch event: 6
Fork event: 3

Last Year

Watch event: 6
Fork event: 3

Committers

Last synced: 9 months ago

All Time

Total Commits: 101
Total Committers: 1
Avg Commits per committer: 101.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
DengBoCong	1**0@q**m	101

Committer Domains (Top 20 + Academic)

qq.com: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 7
Total pull requests: 0
Average time to close issues: 4 days
Average time to close pull requests: N/A
Total issue authors: 7
Total pull request authors: 0
Average comments per issue: 0.29
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

skynet-jzm (1)
tjulh (1)
zz19980926 (1)
Exusia-0 (1)
VirgilG72 (1)
wybingo (1)
Huanghong2016 (1)

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 3
Total downloads:
- pypi 12 last-month

Total dependent packages: 0
(may contain duplicates)
Total dependent repositories: 1
(may contain duplicates)
Total versions: 5
Total maintainers: 1

proxy.golang.org: github.com/dengbocong/text-similarity

Documentation: https://pkg.go.dev/github.com/dengbocong/text-similarity#section-documentation
License: mit
Latest release: v1.0.7
published almost 4 years ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent packages count: 6.4%

Average: 6.6%

Dependent repos count: 6.8%

Last synced: 6 months ago

proxy.golang.org: github.com/DengBoCong/text-similarity

Documentation: https://pkg.go.dev/github.com/DengBoCong/text-similarity#section-documentation
License: mit
Latest release: v1.0.7
published almost 4 years ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent packages count: 6.4%

Average: 6.6%

Dependent repos count: 6.8%

Last synced: 6 months ago

pypi.org: text-sim

Chinese text similarity calculation package of Tensorflow/Pytorch

Homepage: https://github.com/DengBoCong/text-similarity
Documentation: https://text-sim.readthedocs.io/
License: MIT License
Latest release: 1.0.7
published almost 4 years ago

Versions: 3
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 12 Last month

Rankings

Stargazers count: 5.8%

Forks count: 7.4%

Dependent packages count: 10.0%

Average: 18.6%

Dependent repos count: 21.7%

Downloads: 48.3%

Maintainers (1)

DengBoCong

Funding

https://pypi.org/project/text-sim/

Last synced: 6 months ago

Dependencies

requirements.txt pypi

gensim ==4.1.2
jieba ==0.42.1
networkx ==2.6.3
nltk ==3.6.5
numpy ==1.19.5
packaging ==21.3
pandas ==1.3.4
scikit-learn ==0.24.2
sentencepiece ==0.1.96
tensorflow ==2.6.0
tokenizers ==0.10.3
torch ==1.10.0
tqdm ==4.62.3
transformers ==4.12.5

text-sim

Science Score: 10.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Text-Similarity

Overview

Usages

TF-IDF

Example

Sklearn version

Custom version

工具调用

BM25

Example

工具调用

LSH

SIF

FastText

Pytorch version

RNN Base

Pytorch version

CNN Base

Pytorch version

Bert Base

Pytorch version

Albert

Pytorch version

NEZHA

Pytorch version

RoBERTa

Pytorch version

SimCSE

Pytorch version

Poly-Encoder

Pytorch version

ColBERT

Pytorch version

RE2

Pytorch version

Cite

Reference

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

proxy.golang.org: github.com/dengbocong/text-similarity

Rankings

proxy.golang.org: github.com/DengBoCong/text-similarity

Rankings

pypi.org: text-sim

Rankings

Maintainers (1)

Funding

Dependencies