similarities

Similarities: a toolkit for similarity calculation and semantic search. 相似度计算、匹配搜索工具包,支持亿级数据文搜文、文搜图、图搜图,python3开发,开箱即用。

https://github.com/shibing624/similarities

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.3%) to scientific vocabulary

Keywords

bm25 deep-learning faiss image-search image-similarity matching nlp pytorch search-engine similarity similarity-search text-matching
Last synced: 6 months ago · JSON representation ·

Repository

Similarities: a toolkit for similarity calculation and semantic search. 相似度计算、匹配搜索工具包,支持亿级数据文搜文、文搜图、图搜图,python3开发,开箱即用。

Basic Info
Statistics
  • Stars: 870
  • Watchers: 9
  • Forks: 87
  • Open Issues: 8
  • Releases: 2
Topics
bm25 deep-learning faiss image-search image-similarity matching nlp pytorch search-engine similarity similarity-search text-matching
Created almost 4 years ago · Last pushed over 1 year ago
Metadata Files
Readme Contributing License Citation

README.md

🇨🇳中文 | 🌐English | 📖文档/Docs | 🤖模型/Models

Logo

Similarities: Similarity Calculation and Semantic Search

PyPI version Downloads Contributions welcome License Apache 2.0 python_version GitHub issues Wechat Group

similarities: a toolkit for similarity calculation and semantic search, supports text and image. 相似度计算、语义匹配搜索工具包。

similarities 实现了多种文本和图片的相似度计算、语义匹配检索算法,支持亿级数据文搜文、文搜图、图搜图,python3开发,pip安装,开箱即用。

Guide

Features

文本相似度计算 + 文本搜索

  • 语义匹配模型【推荐】:本项目基于text2vec实现了CoSENT模型的文本相似度计算和文本搜索
    • 支持中英文、多语言多种SentenceBERT类预训练模型
    • 支持 Cos Similarity/Dot Product/Hamming Distance/Euclidean Distance 等多种相似度计算方法
    • 支持 SemanticSearch/Faiss/Annoy/Hnsw 等多种文本搜索算法
    • 支持亿级数据高效检索
    • 支持命令行文本转向量(多卡)、建索引、批量检索、启动服务
  • 字面匹配模型:本项目实现了Word2Vec、BM25、RankBM25、TFIDF、SimHash、同义词词林、知网Hownet义原匹配等多种字面匹配模型

图像相似度计算/图文相似度计算 + 图搜图/文搜图

  • CLIP(Contrastive Language-Image Pre-Training)模型:图文匹配模型,可用于图文特征(embeddings)、相似度计算、图文检索、零样本图片分类,本项目基于PyTorch实现了CLIP模型的向量表征、构建索引(基于AutoFaiss)、批量检索、后台服务(基于FastAPI)、前端展现(基于Gradio)功能
    • 支持openai/clip-vit-base-patch32等CLIP系列模型
    • 支持OFA-Sys/chinese-clip-vit-huge-patch14等Chinese-CLIP系列模型
    • 支持前后端分离部署,FastAPI后端服务,Gradio前端展现
    • 支持亿级数据高效检索,基于Faiss检索,支持GPU加速
    • 支持图搜图、文搜图、向量搜图
    • 支持图像embedding提取、文本embedding提取
    • 支持图像相似度计算、图文相似度计算
    • 支持命令行图像转向量(多卡)、建索引、批量检索、启动服务
  • 图像特征提取:本项目基于cv2实现了pHash、dHash、wHash、aHash、SIFT等多种图像特征提取算法

Demo

Image Search Demo: https://huggingface.co/spaces/shibing624/CLIP-Image-Search

Text Search Demo: https://huggingface.co/spaces/shibing624/similarities

Install

pip install torch # conda install pytorch pip install -U similarities

or

git clone https://github.com/shibing624/similarities.git cd similarities pip install -e .

Usage

1. 文本向量相似度计算

example: examples/textsimilaritydemo.py

python from similarities import BertSimilarity m = BertSimilarity(model_name_or_path="shibing624/text2vec-base-chinese") r = m.similarity('如何更换花呗绑定银行卡', '花呗更改绑定银行卡') print(f"similarity score: {float(r)}") # similarity score: 0.855146050453186

2. 文本向量搜索

在文档候选集中找与query最相似的文本,常用于QA场景的问句相似匹配、文本搜索等任务。

SemanticSearch精准搜索算法,Cos Similarity + topK 聚类检索,适合百万内数据集

example: examples/textsemanticsearch_demo.py

Annoy、Hnswlib等近似搜索算法,适合百万级数据集

example: examples/fasttextsemanticsearchdemo.py

Faiss高效向量检索,适合亿级数据集

3. 基于字面的文本相似度计算和文本搜索

支持同义词词林(Cilin)、知网Hownet、词向量(WordEmbedding)、Tfidf、SimHash、BM25等算法的相似度计算和字面匹配搜索,常用于文本匹配冷启动。

example: examples/literaltextsemanticsearchdemo.py

4. 图像相似度计算和图片搜索

支持CLIP、pHash、SIFT等算法的图像相似度计算和匹配搜索,中文CLIP模型支持图搜图,文搜图、还支持中英文图文互搜。

example: examples/imagesemanticsearch_demo.py

image_sim

Faiss高效向量检索,适合亿级数据集

5. 聚类

通过社群发现(community_detection)算法可以在大规模数据集上执行聚类,寻找聚类簇(即相似的句子组)。

example: examples/textclusteringdemo.py

6. 图文语义去重

通过同义句挖掘(paraphraseminingembeddings)算法可以从大量句子或文档集中挖掘出具有相似意义的句子对,可用于冗余图文检测,语义去重。

命令行模式(CLI)

  • 支持批量获取文本向量、图像向量(embedding)
  • 支持构建索引(index)
  • 支持批量检索(filter)
  • 支持启动服务(server)

code: cli.py

```

similarities -h

NAME similarities

SYNOPSIS similarities COMMAND

COMMANDS COMMAND is one of the following:

 bert_embedding
   Compute embeddings for a list of sentences

 bert_index
   Build indexes from text embeddings using autofaiss

 bert_filter
   Entry point of bert filter, batch search index

 bert_server
   Main entry point of bert search backend, start the server

 clip_embedding
   Embedding text and image with clip model

 clip_index
   Build indexes from embeddings using autofaiss

 clip_filter
   Entry point of clip filter, batch search index

 clip_server
   Main entry point of clip search backend, start the server

```

run:

```shell pip install similarities -U similarities clip_embedding -h

example

cd examples similarities clipembedding data/toyclip/ ```

  • bert_embedding等是二级命令,bert开头的是文本相关,clip开头的是图像相关
  • 各二级命令使用方法见similarities clip_embedding -h
  • 上面示例中data/toy_clip/clip_embedding方法的input_dir参数,输入文件目录(required)

Contact

  • Issue(建议) :GitHub issues
  • 邮件我:xuming: xuming624@qq.com
  • 微信我: 加我微信号:xuming624, 备注:姓名-公司-NLP 进NLP交流群。

Citation

如果你在研究中使用了similarities,请按如下格式引用:

APA:

Xu, M. Similarities: Compute similarity score for humans (Version 1.0.1) [Computer software]. https://github.com/shibing624/similarities

BibTeX:

@misc{Xu_Similarities_Compute_similarity, title={Similarities: similarity calculation and semantic search toolkit}, author={Xu Ming}, year={2022}, howpublished={\url{https://github.com/shibing624/similarities}}, }

License

授权协议为 The Apache License 2.0,可免费用做商业用途。请在产品说明中附加similarities的链接和授权协议。

Contribute

项目代码还很粗糙,如果大家对代码有所改进,欢迎提交回本项目,在提交之前,注意以下两点:

  • tests添加相应的单元测试
  • 使用python -m pytest来运行所有单元测试,确保所有单测都是通过的

之后即可提交PR。

Acknowledgements

Thanks for their great work!

Owner

  • Name: xuming
  • Login: shibing624
  • Kind: user
  • Location: Beijing, China
  • Company: @tencent

Senior Researcher, Machine Learning Developer, Advertising Risk Control.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Xu"
  given-names: "Ming"
  orcid: "https://orcid.org/0000-0003-3402-7159"
title: "Similarities: Compute similarity score for humans"
url: "https://github.com/shibing624/similarities"
data-released: 2022-03-05
version: 0.0.4

GitHub Events

Total
  • Issues event: 5
  • Watch event: 112
  • Issue comment event: 7
  • Push event: 1
  • Fork event: 13
Last Year
  • Issues event: 5
  • Watch event: 112
  • Issue comment event: 7
  • Push event: 1
  • Fork event: 13

Committers

Last synced: 11 months ago

All Time
  • Total Commits: 166
  • Total Committers: 5
  • Avg Commits per committer: 33.2
  • Development Distribution Score (DDS): 0.024
Past Year
  • Commits: 15
  • Committers: 2
  • Avg Commits per committer: 7.5
  • Development Distribution Score (DDS): 0.067
Top Committers
Name Email Commits
shibing624 s****4@1****m 162
Dom 9****l 1
Allenpandas 6****7@q****m 1
wiker.yang Y****5 1
flemingxu f****u@t****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 41
  • Total pull requests: 4
  • Average time to close issues: about 1 month
  • Average time to close pull requests: 1 day
  • Total issue authors: 36
  • Total pull request authors: 4
  • Average comments per issue: 1.93
  • Average comments per pull request: 0.0
  • Merged pull requests: 4
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 9
  • Pull requests: 1
  • Average time to close issues: 3 days
  • Average time to close pull requests: about 6 hours
  • Issue authors: 7
  • Pull request authors: 1
  • Average comments per issue: 1.89
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • JiejieDeng (4)
  • bingOral (2)
  • idea10101 (2)
  • leonyu879 (1)
  • EASTERNTIGER (1)
  • xiuxiuxius (1)
  • vivisol (1)
  • jliartem (1)
  • lifeitech (1)
  • 1264561652 (1)
  • zhangmianhongni (1)
  • Ponyo1 (1)
  • annian101 (1)
  • yxw2014 (1)
  • XHP007 (1)
Pull Request Authors
  • wikeryong (2)
  • Allenpandas (1)
  • shibing624 (1)
Top Labels
Issue Labels
question (18) enhancement (10) bug (7) wontfix (2)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 182 last-month
  • Total dependent packages: 1
  • Total dependent repositories: 3
  • Total versions: 23
  • Total maintainers: 1
pypi.org: similarities

Similarities is a toolkit for compute similarity scores between texts, performing text searches.

  • Versions: 23
  • Dependent Packages: 1
  • Dependent Repositories: 3
  • Downloads: 182 Last month
Rankings
Stargazers count: 3.2%
Dependent packages count: 3.2%
Forks count: 6.0%
Average: 6.4%
Dependent repos count: 9.1%
Downloads: 10.2%
Maintainers (1)
Last synced: 6 months ago

Dependencies

requirements.txt pypi
  • Pillow *
  • annoy *
  • hnswlib *
  • jieba >=0.39
  • loguru *
  • opencv-python *
  • text2vec >=1.1.5
  • transformers *
setup.py pypi
  • Pillow *
  • annoy *
  • hnswlib *
  • jieba >=0.39
  • loguru *
  • opencv-python *
  • text2vec >=1.1.5
  • transformers *
.github/workflows/ubuntu.yml actions
  • actions/cache v2 composite
  • actions/checkout v2 composite
  • actions/setup-python v2 composite