Shekar: A Python Toolkit for Persian Natural Language Processing
Shekar: A Python Toolkit for Persian Natural Language Processing - Published in JOSS (2025)
txtai
💡 All-in-one open-source AI framework for semantic search, LLM orchestration and language model workflows
harmony
The Harmony Python library: a research tool for psychologists to harmonise data and questionnaire items. Open source.
text2vec
text2vec, text to vector. 文本向量表征工具,把文本转化为向量矩阵,实现了Word2Vec、RankBM25、Sentence-BERT、CoSENT等文本表征、文本相似度计算模型,开箱即用。
https://github.com/lancedb/lance
Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
@llm-tools/embedjs
A NodeJS RAG framework to easily work with LLMs and embeddings
knrscore
KNRScore is a Python package for computing K-Nearest-Rank Similarity, a metric that quantifies local structural similarity between two maps or embeddings.
word2vecelastic
Collect sentences from ElasticSearch, preprocess and train diachronic Word2Vec models
ragoon
High level library for batched embeddings generation, blazingly-fast web-based RAG and quantized indexes processing ⚡
https://github.com/featureform/featureform
The Virtual Feature Store. Turn your existing data infrastructure into a feature store.
https://github.com/csinva/interpretable-embeddings
Interpretable text embeddings by asking LLMs yes/no questions (NeurIPS 2024)
marqo-fashionclip
State-of-the-art CLIP/SigLIP embedding models finetuned for the fashion domain. +57% increase in evaluation metrics vs FashionCLIP 2.0.
tax-retrieval-benchmark
An implementation of the TaxRetrievalBenchmark task for the 🤗 Massive Text Embedding Benchmark (MTEB) framework.
https://github.com/epsilla-cloud/vectordb
Epsilla is a high performance Vector Database Management System
questea
QuestionnaireEmbeddingsAnalysis - innovative approach to extracting richer information from clinical survey
https://github.com/ferencberes/online-node2vec
Node Embeddings in Dynamic Graphs
https://github.com/amazon-science/supervised-intent-clustering
This is a package to fine-tune language models in order to create clustering-friendly embeddings.
https://github.com/dru-mara/evalne-gui
EvalNE-GUI: The Graphical User Interface for EvalNE
jupyter-scatter-tutorial
Jupyter Scatter Tutorial (that was first presented at SciPy '23)
open-text-embeddings
Open Source Text Embedding Models with OpenAI Compatible API
lorann
Approximate Nearest Neighbor search using reduced-rank regression, with extremely fast queries, tiny memory usage, and rapid indexing on modern vector embeddings.
tsde
TSDE is a novel SSL framework for TSRL, the first of its kind, effectively harnessing a diffusion process, conditioned on an innovative dual-orthogonal Transformer encoder architecture with a crossover mechanism, and employing a unique IIF mask strategy (KDD 2024, main research track).
langchain-chatbot
AI Chatbot for analyzing/extracting information from data in conversational format.
https://github.com/centre-for-humanities-computing/embedding-explorer
Tools for interactive visual exploration of semantic embeddings.
https://github.com/aida-ugent/debayes
DeBayes: a Bayesian Method for Debiasing Network Embeddings (ICML 2020).
geospatial-rag
AI Framework for Remote Sensing Image Analysis using RAG - 88%+ accuracy, multi-modal queries, ChatGPT-like interface
clep
🤖 A Python Package for generating new patient representations driven by data and prior knowledge
awesome-generative-ai
A curated list of Generative AI tools, works, models, and references
DiRe - JAX
DiRe - JAX: A JAX based Dimensionality Reduction Algorithm for Large-scale Data - Published in JOSS (2025)
most-different-text-selection
Use embedding data from LLMs to determine the most different text in a given corpus.
model
The Clay Foundation Model - An open source AI model and interface for Earth
https://github.com/d1egoprog/synthetictriples
Paper showcase for the initial version of the Synthetic Triple generation approach
https://github.com/0xibra/linux-tower-gpt-embeddings-experiment
This project is a work-in-progress and serves as an experiment for context injection with GPT and code embeddings. The goal is to use GPT to develop the remaining features of the project.
https://github.com/sergeyklay/clusterium
Text Clustering Toolkit for Bayesian Nonparametric Analysis
https://github.com/amberlee2427/nancy-brain
Nancy's RAG backend and HTTP API/MCP server connectors.
midi2vec
MIDI2vec computes embeddings for representing MIDI data in vector space
comparative-embedding-visualization
A Jupyter widget for comparing two embeddings with shared labels by their confusion, neighborhoods, and size.