ragoon

High level library for batched embeddings generation, blazingly-fast web-based RAG and quantized indexes processing ⚡

https://github.com/louisbrulenaudet/ragoon

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.7%) to scientific vocabulary

Keywords

ai embeddings embeddings-similarity faiss generative-ai groq groqapi llama llama-index llm nlp rag retrieval-augmented-generation vector-database vector-search vectorization

Last synced: 6 months ago · JSON representation ·

Repository

High level library for batched embeddings generation, blazingly-fast web-based RAG and quantized indexes processing ⚡

Basic Info

Host: GitHub
Owner: louisbrulenaudet
License: apache-2.0
Language: Jupyter Notebook
Default Branch: main
Homepage: https://ragoon.readthedocs.io
Size: 18.9 MB

Statistics

Stars: 67
Watchers: 5
Forks: 8
Open Issues: 0
Releases: 9

Topics

ai embeddings embeddings-similarity faiss generative-ai groq groqapi llama llama-index llm nlp rag retrieval-augmented-generation vector-database vector-search vectorization

Created almost 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme License Code of conduct Citation

RAGoon : High level library for batched embeddings generation, blazingly-fast web-based RAG and quantized indexes processing

RAGoon is a set of NLP utilities for multi-model embedding production, high-dimensional vector visualization, and aims to improve language model performance by providing contextually relevant information through search-based querying, web scraping and data augmentation techniques.

Quick install

The reference page for RAGoon is available on the official page of PyPI: RAGoon.

python pip install ragoon

Usage

This section provides an overview of different code blocks that can be executed with RAGoon to enhance your NLP and language model projects.

Embeddings production

This class handles loading a dataset from Hugging Face, processing it to add embeddings using specified models, and provides methods to save and upload the processed dataset.

```python from ragoon import EmbeddingsDataLoader from datasets import load_dataset

Initialize the dataset loader with multiple models

loader = EmbeddingsDataLoader( token="hftoken", dataset=loaddataset("louisbrulenaudet/dac6-instruct", split="train"), # If dataset is already loaded. # datasetname="louisbrulenaudet/dac6-instruct", # If you want to load the dataset from the class. modelconfigs=[ {"model": "bert-base-uncased", "queryprefix": "Query:"}, {"model": "distilbert-base-uncased", "queryprefix": "Query:"} # Add more model configurations as needed ] )

Uncomment this line if passing dataset_name instead of dataset.

loader.load_dataset()

Process the splits with all models loaded

loader.process( column="output", preload_models=True )

To access the processed dataset

processeddataset = loader.getdataset() print(processed_dataset[0]) ```

You can also embed a single text using multiple models:

```python from ragoon import EmbeddingsDataLoader

Initialize the dataset loader with multiple models

loader = EmbeddingsDataLoader( token="hftoken", modelconfigs=[ {"model": "bert-base-uncased"}, {"model": "distilbert-base-uncased"} ] )

Load models

loader.load_models()

Embed a single text with all loaded models

text = "This is a single text for embedding." embeddingresult = loader.batchencode(text)

Output the embeddings

print(embedding_result) ```

Similarity search and index creation

The SimilaritySearch class is instantiated with specific parameters to configure the embedding model and search infrastructure. The chosen model, louisbrulenaudet/tsdae-lemone-mbert-base, is likely a multilingual BERT model fine-tuned with TSDAE (Transfomer-based Denoising Auto-Encoder) on a custom dataset. This model choice suggests a focus on multilingual capabilities and improved semantic representations.

The cuda device specification leverages GPU acceleration, crucial for efficient processing of large datasets. The embedding dimension of 768 is typical for BERT-based models, representing a balance between expressiveness and computational efficiency. The ip (inner product) metric is selected for similarity comparisons, which is computationally faster than cosine similarity when vectors are normalized. The i8 dtype indicates 8-bit integer quantization, a technique that significantly reduces memory usage and speeds up similarity search at the cost of a small accuracy rade-off.

```python import polars as pl from ragoon import ( dataset_loader, SimilaritySearch, EmbeddingsVisualizer )

dataset = dataset_loader( name="louisbrulenaudet/dac6-instruct", streaming=False, split="train" )

dataset.savetodisk("dataset.hf")

instance = SimilaritySearch( model_name="louisbrulenaudet/tsdae-lemone-mbert-base", device="cuda", ndim=768, metric="ip", dtype="i8" )

embeddings = instance.encode(corpus=dataset["output"])

ubinaryembeddings = instance.quantizeembeddings( embeddings=embeddings, quantization_type="ubinary" )

int8embeddings = instance.quantizeembeddings( embeddings=embeddings, quantization_type="int8" )

instance.createusearchindex( int8embeddings=int8embeddings, indexpath="./usearchint8.index", save=True )

instance.createfaissindex( ubinaryembeddings=ubinaryembeddings, indexpath="./faissubinary.index", save=True )

topkscores, topkindices = instance.search( query="Dfinir le rle d'un intermdiaire concepteur conformment l'article 1649 AE du Code gnral des Impts.", topk=10, rescoremultiplier=4 )

try: dataframe = pl.fromarrow(dataset.data.table).withrow_index()

except: dataframe = pl.fromarrow(dataset.data.table).withrow_count( name="index" )

scoresdf = pl.DataFrame( { "index": topkindices, "score": topkscores } ).withcolumns( pl.col("index").cast(pl.UInt32) )

searchresults = dataframe.filter( pl.col("index").isin(topkindices) ).join( scores_df, how="inner", on="index" )

print("search_results") ```

Embeddings visualization

This class provides functionality to load embeddings from a FAISS index, reduce their dimensionality using PCA and/or t-SNE, and visualize them in an interactive 3D plot.

```python from ragoon import EmbeddingsVisualizer

visualizer = EmbeddingsVisualizer( indexpath="path/to/index", datasetpath="path/to/dataset" )

visualizer.visualize( method="pca", savehtml=True, htmlfilename="embeddingvisualization.html" ) ```

Plot

Dynamic web search

RAGoon is a Python library that aims to improve the performance of language models by providing contextually relevant information through retrieval-based querying, web scraping, and data augmentation techniques. It integrates various APIs, enabling users to retrieve information from the web, enrich it with domain-specific knowledge, and feed it to language models for more informed responses.

RAGoon's core functionality revolves around the concept of few-shot learning, where language models are provided with a small set of high-quality examples to enhance their understanding and generate more accurate outputs. By curating and retrieving relevant data from the web, RAGoon equips language models with the necessary context and knowledge to tackle complex queries and generate insightful responses.

```python from groq import Groq

from openai import OpenAI

from ragoon import WebRAG

Initialize RAGoon instance

ragoon = WebRAG( googleapikey="yourgoogleapikey", googlecx="yourgooglecx", completionclient=Groq(apikey="yourgroqapi_key") )

Search and get results

query = "I want to do a left join in Python Polars" results = ragoon.search( query=query, completionmodel="Llama3-70b-8192", maxtokens=512, temperature=1, )

Print results

print(results) ```

Badge

Building something cool with RAGoon? Consider adding a badge to your project card.

markdown [<img src="https://raw.githubusercontent.com/louisbrulenaudet/ragoon/main/assets/badge.svg" alt="Built with RAGoon" width="200" height="32"/>](https://github.com/louisbrulenaudet/ragoon)

Citing this project

If you use this code in your research, please use the following BibTeX entry.

BibTeX @misc{louisbrulenaudet2024, author = {Louis Brul Naudet}, title = {RAGoon : High level library for batched embeddings generation, blazingly-fast web-based RAG and quantized indexes processing}, howpublished = {\url{https://github.com/louisbrulenaudet/ragoon}}, year = {2024} }

Feedback

If you have any feedback, please reach out at louisbrulenaudet@icloud.com.

Owner

Name: Louis Brulé Naudet
Login: louisbrulenaudet
Kind: user
Location: Paris
Company: Université Paris-Dauphine (Paris Sciences et Lettres - PSL)

Website: https://louisbrulenaudet.com
Twitter: BruleNaudet
Repositories: 81
Profile: https://github.com/louisbrulenaudet

Research in business taxation and development (NLP, LLM, Computer vision...), University Dauphine-PSL 📖 | Backed by the Microsoft for Startups Hub program

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Brulé Naudet"
  given-names: "Louis"
  orcid: "https://orcid.org/0000-0001-9111-4879"
title: "RAGoon : Improve Large Language Models retrieval using dynamic web-search"
version: 1.0.0
date-released: 2024-05-26

GitHub Events

Total

Release event: 1
Watch event: 10
Push event: 4
Pull request event: 1
Fork event: 2
Create event: 1

Last Year

Release event: 1
Watch event: 10
Push event: 4
Pull request event: 1
Fork event: 2
Create event: 1

Committers

Last synced: 6 months ago

All Time

Total Commits: 60
Total Committers: 4
Avg Commits per committer: 15.0
Development Distribution Score (DDS): 0.05

Past Year

Commits: 21
Committers: 3
Avg Commits per committer: 7.0
Development Distribution Score (DDS): 0.095

Top Committers

Name	Email	Commits
Louis Brulé Naudet	l**t@i**m	57
youssefkhalil320	y**0@g**m	1
Yuan-Man	6****X	1
root	r**t@D**O	1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 1
Total pull requests: 3
Average time to close issues: 5 minutes
Average time to close pull requests: 10 days
Total issue authors: 1
Total pull request authors: 3
Average comments per issue: 0.0
Average comments per pull request: 0.33
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: about 1 month
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 1.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

youssefkhalil320 (1)

Pull Request Authors

Yuan-ManX (2)
ethanke (2)
youssefkhalil320 (2)

Top Labels

Issue Labels

Pull Request Labels

documentation (2)

Packages

Total packages: 1
Total downloads:
- pypi 41 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 15
Total maintainers: 1

pypi.org: ragoon

RAGoon : High level library for batched embeddings generation, blazingly-fast web-based RAG and quantized indexes processing ⚡

Documentation: https://ragoon.readthedocs.io/
License: apache-2.0
Latest release: 0.0.15
published over 1 year ago

Versions: 15
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 41 Last month

Rankings

Dependent packages count: 10.9%

Average: 36.1%

Dependent repos count: 61.3%

Maintainers (1)

louisbrulenaudet

Last synced: 6 months ago

Dependencies

requirements.txt pypi

beautifulsoup4 *
google-api-python-client *
groq *
httpx *
openai *
requests *

pyproject.toml pypi

ragoon

Science Score: 44.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

RAGoon : High level library for batched embeddings generation, blazingly-fast web-based RAG and quantized indexes processing

Quick install

Usage

Embeddings production

Initialize the dataset loader with multiple models

Uncomment this line if passing dataset_name instead of dataset.

loader.load_dataset()

Process the splits with all models loaded

To access the processed dataset

Initialize the dataset loader with multiple models

Load models

Embed a single text with all loaded models

Output the embeddings

Similarity search and index creation

Embeddings visualization

Dynamic web search

from openai import OpenAI

Initialize RAGoon instance

Search and get results

Print results

Badge

Citing this project

Feedback

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: ragoon

Rankings

Maintainers (1)

Dependencies