blackkite

https://github.com/kjgarza/blackkite

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.3%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: kjgarza
License: mit
Language: Python
Default Branch: master
Size: 624 KB

Statistics

Stars: 1
Watchers: 2
Forks: 0
Open Issues: 1
Releases: 0

Created over 2 years ago · Last pushed almost 2 years ago

Metadata Files

Readme License Citation

Blackkite: Ingestor and processor of data to MongoDB

Blackkite helps you to create a small MongoDB index for makins semantic search or RAG. By default indexs only markdown files that are in a single directory.

Update environment variables with your MongoDB connection string and Open AI API key.
Create a new Python environment zsh python3 -m venv env
Activate the new Python environment zsh source env/bin/activate
Install the requirements zsh pip3 install -r requirements.txt
Load, Transform, Embed and Store zsh python3 vectorize.py {path}

Make sure all the files are in the same directory. you can use the utils to move them. zsh sh move_.md.sh

Semantic Search Made Easy With LangChain and MongoDB

Enabling semantic search on user-specific data is a multi-step process that includes loading, transforming, embedding and storing data before it can be queried.

That graphic is from the team over at LangChain, whose goal is to provide a set of utilities to greatly simplify this process.

In this tutorial, we'll walk through each of these steps, using MongoDB Atlas as our Store. Specifically, we'll use the AT&T Wikipedia page as our data source. We'll then use libraries from LangChain to Load, Transform, Embed and Store:

Once the source is store is stored in MongoDB, we can retrieve the data that interests us:

Prerequisites

MongoDB Atlas Subscription (Free Tier is fine)
Open AI API key

Quick Start Steps

Get the code: zsh git clone https://github.com/wbleonard/atlas-langchain.git
Update params.py with your MongoDB connection string and Open AI API key.
Create a new Python environment zsh python3 -m venv env
Activate the new Python environment zsh source env/bin/activate
Install the requirements zsh pip3 install -r requirements.txt
Load, Transform, Embed and Store zsh python3 vectorize.py
Retrieve zsh python3 query.py -q "Who started AT&T?"

The Details

Load -> Transform -> Embed -> Store

Step 1: Load

There's no lacking for sources of data: Slack, YouTube, Git, Excel, Reddit, Twitter, etc., and LangChain provides a growing list of integrations that includes this list and many more.

For this exercise, we're going to use the WebBaseLoader to load the Wikipedia page for AT&T.

python from langchain.document_loaders import WebBaseLoader loader = WebBaseLoader("https://en.wikipedia.org/wiki/AT%26T") data = loader.load()

#### Step 2: Transform (Split) Now that we have a bunch of text loaded, it needs to be split into smaller chunks so we can tease out the relevant portion based on our search query. For this example we'll use the recommended RecursiveCharacterTextSplitter. As I have it configured, it attempts to split on paragraphs ("\n\n"), then sentences("(?<=\. )"), then words (" ") using a chunk size of 1000 characters. So if a paragraph doesn't fit into 1000 characters, it will truncate at the next word it can fit to keep the chunk size under 1000 chacters. You can tune the chunk_size to your liking. Smaller numbers will lead to more documents, and vice-versa.

python from langchain.text_splitter import RecursiveCharacterTextSplitter text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0, separators=[ "\n\n", "\n", "(?<=\. )", " "], length_function=len) docs = text_splitter.split_documents(data)

Step 3: Embed

Embedding is where you associate your text with an LLM to create a vector representation of that text. There are many options to choose from, such as OpenAI and Hugging Face, and LangChang provides a standard interface for interacting with all of them.

For this exercise we're going to use the popular OpenAI embedding. Before proceeding, you'll need an API key for the OpenAI platform, which you will set in params.py.

We're simply going to load the embedder in this step. The real power comes when we store the embeddings in Step 4.

python from langchain.embeddings.openai import OpenAIEmbeddings embeddings = OpenAIEmbeddings(openai_api_key=params.openai_api_key)

Step 4: Store

You'll need a vector database to store the embeddings, and lucky for you MongoDB fits that bill. Even luckier for you, the folks at LangChain have a MongoDB Atlas module that will do all the heavy lifting for you! Don't forget to add your MongoDB Atlas connection string to params.py.

```python from pymongo import MongoClient from langchain.vectorstores import MongoDBAtlasVectorSearch

client = MongoClient(params.mongodbconnstring) collection = client[params.dbname][params.collectionname]

Insert the documents in MongoDB Atlas with their embedding

docsearch = MongoDBAtlasVectorSearch.fromdocuments( docs, embeddings, collection=collection, indexname=index_name ) ```

You'll find the complete script in vectorize.py, which needs to be run once per data source (and you could easily modify the code to iterate over multiple data sources).

zsh python3 vectorize.py

Step 5: Index the Vector Embeddings

The final step before we can query the data is to create a search index on the stored embeddings.

In the Atlas console, create a Search Index using the JSON Editor named vsearch_index with the following definition: JSON { "mappings": { "dynamic": true, "fields": { "embedding": { "dimensions": 1536, "similarity": "cosine", "type": "knnVector" } } } }

Retrieve

We could now run a search, using methods like similirity_search or maxmarginalrelevance_search and that would return the relevant slice of data, which in our case would be an entire paragraph. However, we can continue to harness the power of the LLM to contextually compress the response so that it more directly tries to answer our question.

```python from pymongo import MongoClient from langchain.vectorstores import MongoDBAtlasVectorSearch from langchain.embeddings.openai import OpenAIEmbeddings from langchain.llms import OpenAI from langchain.retrievers import ContextualCompressionRetriever from langchain.retrievers.document_compressors import LLMChainExtractor

client = MongoClient(params.mongodbconnstring) collection = client[params.dbname][params.collectionname]

vectorStore = MongoDBAtlasVectorSearch( collection, OpenAIEmbeddings(openaiapikey=params.openaiapikey), indexname=params.indexname )

llm = OpenAI(openaiapikey=params.openaiapikey, temperature=0) compressor = LLMChainExtractor.from_llm(llm)

compressionretriever = ContextualCompressionRetriever( basecompressor=compressor, baseretriever=vectorStore.asretriever() ) ```

```zsh python3 query.py -q "Who started AT&T?"

Your question:

Who started AT&T?

AI Response:

AT&T - Wikipedia "AT&T was founded as Bell Telephone Company by Alexander Graham Bell, Thomas Watson and Gardiner Greene Hubbard after Bell's patenting of the telephone in 1875."[25] "On December 30, 1899, AT&T acquired the assets of its parent American Bell Telephone, becoming the new parent company."[28] ```

Resources

Owner

Name: Kristian Garza
Login: kjgarza
Kind: user
Location: Berlin

Website: http://uk.linkedin.com/in/kjgarza/
Twitter: kriztean
Repositories: 29
Profile: https://github.com/kjgarza

Citation (CITATION.cff)

cff-version: 1.2.0
authors:
  - family-names: Garza
    given-names: Kristian
    email: kj.garza@gmail.com
    orcid: https://orcid.org/0000-0003-3484-6875
    affiliation:
title: Blackkite Ingestor and processor of data to MongoDB
abstract: Blackkite helps you to create a small MongoDB index for making semantic search or RAG.
  By default, it indexes only markdown files that are in a single directory. To use
  Blackkite, update the environment variables with your MongoDB connection string and
  Open AI API key.
version: 1.0.0
repository: https://github.com/kjgarza/blackkite.git
license: MIT
keywords:
  - MongoDB
  - Semantic Search
  - LangChain
  - Data Processing
  - Python

GitHub Events

Total

Last Year

Dependencies

requirements.txt pypi

argparse *
bs4 *
langchain >=0.0.231
lxml *
openai *
pymongo *
requests *
tiktoken *

poetry.lock pypi

aiohttp 3.9.5
aiosignal 1.3.1
anyio 4.3.0
argparse 1.4.0
async-timeout 4.0.3
attrs 23.2.0
beautifulsoup4 4.12.3
bs4 0.0.2
certifi 2024.2.2
charset-normalizer 3.3.2
click 8.1.7
colorama 0.4.6
dataclasses-json 0.5.9
distro 1.9.0
dnspython 2.6.1
exceptiongroup 1.2.1
frozenlist 1.4.1
greenlet 3.0.3
h11 0.14.0
httpcore 1.0.5
httpx 0.27.0
idna 3.7
langchain 0.0.231
langchainplus-sdk 0.0.20
lxml 5.2.1
marshmallow 3.21.1
marshmallow-enum 1.5.1
multidict 6.0.5
mypy-extensions 1.0.0
numexpr 2.10.0
numpy 1.26.4
openai 1.23.2
openapi-schema-pydantic 1.2.4
packaging 24.0
pydantic 1.10.15
pymongo 4.6.3
pyyaml 6.0.1
regex 2024.4.16
requests 2.31.0
sniffio 1.3.1
soupsieve 2.5
sqlalchemy 2.0.29
tenacity 8.2.3
tiktoken 0.6.0
tqdm 4.66.2
typing-extensions 4.11.0
typing-inspect 0.9.0
urllib3 2.2.1
yarl 1.9.4

pyproject.toml pypi

argparse *
bs4 *
click *
langchain ^0.0.231
lxml *
openai *
pymongo *
python ^3.10
requests *
tiktoken *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science