similarities
Similarities: a toolkit for similarity calculation and semantic search. 相似度计算、匹配搜索工具包,支持亿级数据文搜文、文搜图、图搜图,python3开发,开箱即用。
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.3%) to scientific vocabulary
Keywords
Repository
Similarities: a toolkit for similarity calculation and semantic search. 相似度计算、匹配搜索工具包,支持亿级数据文搜文、文搜图、图搜图,python3开发,开箱即用。
Basic Info
- Host: GitHub
- Owner: shibing624
- License: apache-2.0
- Language: Python
- Default Branch: main
- Homepage: https://pypi.org/project/similarities/
- Size: 9.61 MB
Statistics
- Stars: 870
- Watchers: 9
- Forks: 87
- Open Issues: 8
- Releases: 2
Topics
Metadata Files
README.md
🇨🇳中文 | 🌐English | 📖文档/Docs | 🤖模型/Models
Similarities: Similarity Calculation and Semantic Search
similarities: a toolkit for similarity calculation and semantic search, supports text and image. 相似度计算、语义匹配搜索工具包。
similarities 实现了多种文本和图片的相似度计算、语义匹配检索算法,支持亿级数据文搜文、文搜图、图搜图,python3开发,pip安装,开箱即用。
Guide
Features
文本相似度计算 + 文本搜索
- 语义匹配模型【推荐】:本项目基于text2vec实现了CoSENT模型的文本相似度计算和文本搜索
- 支持中英文、多语言多种SentenceBERT类预训练模型
- 支持 Cos Similarity/Dot Product/Hamming Distance/Euclidean Distance 等多种相似度计算方法
- 支持 SemanticSearch/Faiss/Annoy/Hnsw 等多种文本搜索算法
- 支持亿级数据高效检索
- 支持命令行文本转向量(多卡)、建索引、批量检索、启动服务
- 字面匹配模型:本项目实现了Word2Vec、BM25、RankBM25、TFIDF、SimHash、同义词词林、知网Hownet义原匹配等多种字面匹配模型
图像相似度计算/图文相似度计算 + 图搜图/文搜图
- CLIP(Contrastive Language-Image Pre-Training)模型:图文匹配模型,可用于图文特征(embeddings)、相似度计算、图文检索、零样本图片分类,本项目基于PyTorch实现了CLIP模型的向量表征、构建索引(基于AutoFaiss)、批量检索、后台服务(基于FastAPI)、前端展现(基于Gradio)功能
- 支持openai/clip-vit-base-patch32等CLIP系列模型
- 支持OFA-Sys/chinese-clip-vit-huge-patch14等Chinese-CLIP系列模型
- 支持前后端分离部署,FastAPI后端服务,Gradio前端展现
- 支持亿级数据高效检索,基于Faiss检索,支持GPU加速
- 支持图搜图、文搜图、向量搜图
- 支持图像embedding提取、文本embedding提取
- 支持图像相似度计算、图文相似度计算
- 支持命令行图像转向量(多卡)、建索引、批量检索、启动服务
- 图像特征提取:本项目基于cv2实现了pHash、dHash、wHash、aHash、SIFT等多种图像特征提取算法
Demo
Image Search Demo: https://huggingface.co/spaces/shibing624/CLIP-Image-Search

Text Search Demo: https://huggingface.co/spaces/shibing624/similarities

Install
pip install torch # conda install pytorch
pip install -U similarities
or
git clone https://github.com/shibing624/similarities.git
cd similarities
pip install -e .
Usage
1. 文本向量相似度计算
example: examples/textsimilaritydemo.py
python
from similarities import BertSimilarity
m = BertSimilarity(model_name_or_path="shibing624/text2vec-base-chinese")
r = m.similarity('如何更换花呗绑定银行卡', '花呗更改绑定银行卡')
print(f"similarity score: {float(r)}") # similarity score: 0.855146050453186
model_name_or_path:模型名称或者路径,默认会从HF model hub下载并使用中文语义匹配模型shibing624/text2vec-base-chinese,如果需要多语言,可以替换为shibing624/text2vec-base-multilingual模型,支持中、英、韩、日、德、意等多国语言
2. 文本向量搜索
在文档候选集中找与query最相似的文本,常用于QA场景的问句相似匹配、文本搜索等任务。
SemanticSearch精准搜索算法,Cos Similarity + topK 聚类检索,适合百万内数据集
example: examples/textsemanticsearch_demo.py
Annoy、Hnswlib等近似搜索算法,适合百万级数据集
example: examples/fasttextsemanticsearchdemo.py
Faiss高效向量检索,适合亿级数据集
文本转向量,建索引,批量检索,启动服务:examples/faissbertsearchserverdemo.py
前端python调用:examples/faissbertsearchclientdemo.py
3. 基于字面的文本相似度计算和文本搜索
支持同义词词林(Cilin)、知网Hownet、词向量(WordEmbedding)、Tfidf、SimHash、BM25等算法的相似度计算和字面匹配搜索,常用于文本匹配冷启动。
example: examples/literaltextsemanticsearchdemo.py
4. 图像相似度计算和图片搜索
支持CLIP、pHash、SIFT等算法的图像相似度计算和匹配搜索,中文CLIP模型支持图搜图,文搜图、还支持中英文图文互搜。
example: examples/imagesemanticsearch_demo.py

Faiss高效向量检索,适合亿级数据集
图像转向量,建索引,批量检索,启动服务:examples/faissclipsearchserverdemo.py
前端python调用:examples/faissclipsearchclientdemo.py
前端gradio调用:examples/faissclipsearchgradiodemo.py

5. 聚类
通过社群发现(community_detection)算法可以在大规模数据集上执行聚类,寻找聚类簇(即相似的句子组)。
example: examples/textclusteringdemo.py
6. 图文语义去重
通过同义句挖掘(paraphraseminingembeddings)算法可以从大量句子或文档集中挖掘出具有相似意义的句子对,可用于冗余图文检测,语义去重。
命令行模式(CLI)
- 支持批量获取文本向量、图像向量(embedding)
- 支持构建索引(index)
- 支持批量检索(filter)
- 支持启动服务(server)
code: cli.py
```
similarities -h
NAME similarities
SYNOPSIS similarities COMMAND
COMMANDS COMMAND is one of the following:
bert_embedding
Compute embeddings for a list of sentences
bert_index
Build indexes from text embeddings using autofaiss
bert_filter
Entry point of bert filter, batch search index
bert_server
Main entry point of bert search backend, start the server
clip_embedding
Embedding text and image with clip model
clip_index
Build indexes from embeddings using autofaiss
clip_filter
Entry point of clip filter, batch search index
clip_server
Main entry point of clip search backend, start the server
```
run:
```shell pip install similarities -U similarities clip_embedding -h
example
cd examples similarities clipembedding data/toyclip/ ```
bert_embedding等是二级命令,bert开头的是文本相关,clip开头的是图像相关- 各二级命令使用方法见
similarities clip_embedding -h - 上面示例中
data/toy_clip/是clip_embedding方法的input_dir参数,输入文件目录(required)
Contact

Citation
如果你在研究中使用了similarities,请按如下格式引用:
APA:
Xu, M. Similarities: Compute similarity score for humans (Version 1.0.1) [Computer software]. https://github.com/shibing624/similarities
BibTeX:
@misc{Xu_Similarities_Compute_similarity,
title={Similarities: similarity calculation and semantic search toolkit},
author={Xu Ming},
year={2022},
howpublished={\url{https://github.com/shibing624/similarities}},
}
License
授权协议为 The Apache License 2.0,可免费用做商业用途。请在产品说明中附加similarities的链接和授权协议。
Contribute
项目代码还很粗糙,如果大家对代码有所改进,欢迎提交回本项目,在提交之前,注意以下两点:
- 在
tests添加相应的单元测试 - 使用
python -m pytest来运行所有单元测试,确保所有单测都是通过的
之后即可提交PR。
Acknowledgements
- A Simple but Tough-to-Beat Baseline for Sentence Embeddings[Sanjeev Arora and Yingyu Liang and Tengyu Ma, 2017]
- https://github.com/liuhuanyong/SentenceSimilarity
- https://github.com/qwertyforce/image_search
- ImageHash - Official Github repository
- https://github.com/openai/CLIP
- https://github.com/OFA-Sys/Chinese-CLIP
- https://github.com/UKPLab/sentence-transformers
- https://github.com/rom1504/clip-retrieval
Thanks for their great work!
Owner
- Name: xuming
- Login: shibing624
- Kind: user
- Location: Beijing, China
- Company: @tencent
- Website: https://blog.csdn.net/mingzai624
- Repositories: 32
- Profile: https://github.com/shibing624
Senior Researcher, Machine Learning Developer, Advertising Risk Control.
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Xu" given-names: "Ming" orcid: "https://orcid.org/0000-0003-3402-7159" title: "Similarities: Compute similarity score for humans" url: "https://github.com/shibing624/similarities" data-released: 2022-03-05 version: 0.0.4
GitHub Events
Total
- Issues event: 5
- Watch event: 112
- Issue comment event: 7
- Push event: 1
- Fork event: 13
Last Year
- Issues event: 5
- Watch event: 112
- Issue comment event: 7
- Push event: 1
- Fork event: 13
Committers
Last synced: 11 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| shibing624 | s****4@1****m | 162 |
| Dom | 9****l | 1 |
| Allenpandas | 6****7@q****m | 1 |
| wiker.yang | Y****5 | 1 |
| flemingxu | f****u@t****m | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 7 months ago
All Time
- Total issues: 41
- Total pull requests: 4
- Average time to close issues: about 1 month
- Average time to close pull requests: 1 day
- Total issue authors: 36
- Total pull request authors: 4
- Average comments per issue: 1.93
- Average comments per pull request: 0.0
- Merged pull requests: 4
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 9
- Pull requests: 1
- Average time to close issues: 3 days
- Average time to close pull requests: about 6 hours
- Issue authors: 7
- Pull request authors: 1
- Average comments per issue: 1.89
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- JiejieDeng (4)
- bingOral (2)
- idea10101 (2)
- leonyu879 (1)
- EASTERNTIGER (1)
- xiuxiuxius (1)
- vivisol (1)
- jliartem (1)
- lifeitech (1)
- 1264561652 (1)
- zhangmianhongni (1)
- Ponyo1 (1)
- annian101 (1)
- yxw2014 (1)
- XHP007 (1)
Pull Request Authors
- wikeryong (2)
- Allenpandas (1)
- shibing624 (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 182 last-month
- Total dependent packages: 1
- Total dependent repositories: 3
- Total versions: 23
- Total maintainers: 1
pypi.org: similarities
Similarities is a toolkit for compute similarity scores between texts, performing text searches.
- Homepage: https://github.com/shibing624/similarities
- Documentation: https://similarities.readthedocs.io/
- License: Apache License 2.0
-
Latest release: 1.2.3
published over 1 year ago
Rankings
Maintainers (1)
Dependencies
- Pillow *
- annoy *
- hnswlib *
- jieba >=0.39
- loguru *
- opencv-python *
- text2vec >=1.1.5
- transformers *
- Pillow *
- annoy *
- hnswlib *
- jieba >=0.39
- loguru *
- opencv-python *
- text2vec >=1.1.5
- transformers *
- actions/cache v2 composite
- actions/checkout v2 composite
- actions/setup-python v2 composite