text-sim

文本相似度(匹配)计算,提供Baseline、训练、推理、指标分析...代码包含TensorFlow/Pytorch双版本

https://github.com/DengBoCong/text-similarity

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org, scholar.google
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (7.1%) to scientific vocabulary

Keywords

bert deep-learning mechine-learing model nlp pytorch similarity text-classification transformer
Last synced: 6 months ago · JSON representation

Repository

文本相似度(匹配)计算,提供Baseline、训练、推理、指标分析...代码包含TensorFlow/Pytorch双版本

Basic Info
  • Host: GitHub
  • Owner: DengBoCong
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 57.6 MB
Statistics
  • Stars: 179
  • Watchers: 2
  • Forks: 32
  • Open Issues: 6
  • Releases: 0
Topics
bert deep-learning mechine-learing model nlp pytorch similarity text-classification transformer
Created almost 5 years ago · Last pushed almost 4 years ago
Metadata Files
Readme License

README.md

Text-Similarity

[![Blog](https://img.shields.io/badge/blog-@DengBoCong-blue.svg?style=social)](https://www.zhihu.com/people/dengbocong) [![Paper Support](https://img.shields.io/badge/paper-repo-blue.svg?style=social)](https://github.com/DengBoCong/nlp-paper) ![Stars Thanks](https://img.shields.io/badge/Stars-thanks-brightgreen.svg?style=social&logo=trustpilot) ![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=social&logo=appveyor) [comment]: <> ([![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square)]())

Overview

  • Dataset: 中文/English 语料, ☞ 点这里
  • Paper: 相关论文详解, ☞ 点这里
  • The implemented method is as follows:
    • TF-IDF
    • BM25
    • LSH
    • SIF/uSIF
    • FastText
    • RNN Base (Siamese RNN, Stack RNN)
    • CNN Base (Fast Text, Text CNN, Char CNN, VDCNN)
    • Bert Base
    • Albert
    • NEZHA
    • RoBERTa
    • SimCSE
    • Poly-Encoder
    • ColBERT
    • RE2(Simple-Effective-Text-Matching)

Usages

可以选择通过pip进行安装并使用(如下),或者直接下载源码到本地,集成到项目中: pip3 install text-sim

1:examples目录下有不同模型对应的 preprocess/train/evalute代码,可自行修改 2:如下示例从examples中引入actuator方法,准备好对应的模型配置文件即可执行 3:examples目录下的inference.py为训练好的模型推理代码 4:主体代码放在sim下,TensorFlow和Pytorch两个版本分开存放,引用方式基本保持一致 5:相关工具包括word2vec、tokenizer、data_format统一放在sim的tools下

TF-IDF

```python

Example

Sklearn version

from examples.runtfidfsklearn import actuator actuator("./corpus/chinese/breeno/train.tsv", query1="12 23 4160 276", query2="29 23 169 1495")

Custom version

from examples.run_tfidf import actuator actuator("./corpus/chinese/breeno/train.tsv", query1="12 23 4160 276", query2="29 23 169 1495")

工具调用

from sim.tf_idf import TFIdf

tokens_list = ["这是 一个 什么 样 的 工具", "..."] query = ["非常 好用 的 工具"]

tfidf = TFIdf(tokenslist, split=" ") print(tfidf.getscore(query, 0)) # score print(tfidf.getscorelist(query, 10)) # [(index, score), ...] print(tfidf.weight()) # list or numpy array ```

BM25

```python

Example

from examples.run_bm25 import actuator actuator("./corpus/chinese/breeno/train.tsv", query1="12 23 4160 276", query2="29 23 169 1495")

工具调用

from sim.bm25 import BM25

tokens_list = ["这是 一个 什么 样 的 工具", "..."] query = ["非常 好用 的 工具"]

bm25 = BM25(tokenslist, split=" ") print(bm25.getscore(query, 0)) # score print(bm25.getscorelist(query, 10)) # [(index, score), ...] print(bm25.weight()) # list or numpy array ```

LSH

```python from sim.lsh import E2LSH from sim.lsh import MinHash

e2lsh = E2LSH() min_hash = MinHash()

candidates = [[3.6216, 8.6661, -2.8073, -0.44699, 0], ...] query = [-2.7769, -5.6967, 5.9179, 0.37671, 1] print(e2lsh.search(candidates, query)) # index in candidates print(min_hash.search(candidates, query)) # index in candidates ```

SIF

sif = SIF(ncomponents=5, componenttype="svd") sif.fit(tokenslist=sentences, vectorlist=vector)

usif = uSIF(ncomponents=5, n=1, componenttype="svd") usif.fit(tokenslist=sentences, vectorlist=vector) ```

FastText

Pytorch version

from examples.pytorch.runfasttext import actuator actuator(executetype="train", modeltype="bert", modeldir="./data/chinesewwm_pytorch") ```

RNN Base

Pytorch version

from examples.pytorch.runsiamesernn import actuator actuator("./data/config/siamsernn.json", executetype="train") ```

CNN Base

Pytorch version

from examples.pytorch.runcnnbase import actuator actuator(executetype="train", modeltype="bert", modeldir="./data/chinesewwm_pytorch") ```

Bert Base

  • Attention Is All You Need ```python # TensorFlow version from examples.tensorflow.runbasicbert import actuator actuator(modeldir="./data/chinesewwmL-12H-768A-12", executetype="train")

Pytorch version

from examples.pytorch.runbasicbert import actuator actuator(modeldir="./data/chinesewwmpytorch", executetype="train") ```

Albert

Pytorch version

from examples.pytorch.runalbert import actuator actuator(modeldir="./data/albertchinesesmall", execute_type="train") ```

NEZHA

Pytorch version

from examples.pytorch.runnezha import actuator actuator(modeldir="./data/nezha-base-wwm", execute_type="train") ```

RoBERTa

Pytorch version

from examples.pytorch.runbasicbert import actuator actuator(modeldir="./data/chinese-roberta-wwm-ext", executetype="train") ```

SimCSE

Pytorch version

from examples.pytorch.runsimcse import actuator actuator(modeldir="./data/chinesewwmpytorch", executetype="train", modeltype="bert") ```

Poly-Encoder

Pytorch version

from examples.pytorch.runpolyencoder import actuator actuator(modeldir="./data/chinesewwmpytorch", executetype="train", model_type="bert") ```

ColBERT

Pytorch version

from examples.pytorch.runcolbert import actuator actuator(modeldir="./data/chinesewwmpytorch", executetype="train", modeltype="bert") ```

RE2

Pytorch version

from examples.pytorch.runre2 import actuator actuator("./data/config/re2.json", executetype="train") ```

Cite

@misc{text-similarity, title={text-similarity}, author={Bocong Deng}, year={2021}, howpublished={\url{https://github.com/DengBoCong/text-similarity}}, }

Reference

Owner

  • Name: DengBoCong
  • Login: DengBoCong
  • Kind: user
  • Location: Beijing, China
  • Company: HUST - Master of Software Engineering

Deep Learning | NLP | Java

GitHub Events

Total
  • Watch event: 6
  • Fork event: 3
Last Year
  • Watch event: 6
  • Fork event: 3

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 101
  • Total Committers: 1
  • Avg Commits per committer: 101.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
DengBoCong 1****0@q****m 101
Committer Domains (Top 20 + Academic)
qq.com: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 7
  • Total pull requests: 0
  • Average time to close issues: 4 days
  • Average time to close pull requests: N/A
  • Total issue authors: 7
  • Total pull request authors: 0
  • Average comments per issue: 0.29
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • skynet-jzm (1)
  • tjulh (1)
  • zz19980926 (1)
  • Exusia-0 (1)
  • VirgilG72 (1)
  • wybingo (1)
  • Huanghong2016 (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 3
  • Total downloads:
    • pypi 12 last-month
  • Total dependent packages: 0
    (may contain duplicates)
  • Total dependent repositories: 1
    (may contain duplicates)
  • Total versions: 5
  • Total maintainers: 1
proxy.golang.org: github.com/dengbocong/text-similarity
  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent packages count: 6.4%
Average: 6.6%
Dependent repos count: 6.8%
Last synced: 6 months ago
proxy.golang.org: github.com/DengBoCong/text-similarity
  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent packages count: 6.4%
Average: 6.6%
Dependent repos count: 6.8%
Last synced: 6 months ago
pypi.org: text-sim

Chinese text similarity calculation package of Tensorflow/Pytorch

  • Versions: 3
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 12 Last month
Rankings
Stargazers count: 5.8%
Forks count: 7.4%
Dependent packages count: 10.0%
Average: 18.6%
Dependent repos count: 21.7%
Downloads: 48.3%
Maintainers (1)
Funding
  • https://pypi.org/project/text-sim/
Last synced: 6 months ago

Dependencies

requirements.txt pypi
  • gensim ==4.1.2
  • jieba ==0.42.1
  • networkx ==2.6.3
  • nltk ==3.6.5
  • numpy ==1.19.5
  • packaging ==21.3
  • pandas ==1.3.4
  • scikit-learn ==0.24.2
  • sentencepiece ==0.1.96
  • tensorflow ==2.6.0
  • tokenizers ==0.10.3
  • torch ==1.10.0
  • tqdm ==4.62.3
  • transformers ==4.12.5