quick-start-creating-a-vector-database-for-rag

A playground for vector database exploration using Chroma

https://github.com/joshuapowell/quick-start-creating-a-vector-database-for-rag

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.9%) to scientific vocabulary

Keywords

llm rag vector-database
Last synced: 6 months ago · JSON representation ·

Repository

A playground for vector database exploration using Chroma

Basic Info
  • Host: GitHub
  • Owner: joshuapowell
  • License: apache-2.0
  • Language: Jupyter Notebook
  • Default Branch: main
  • Homepage:
  • Size: 43.9 KB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
llm rag vector-database
Created 8 months ago · Last pushed 8 months ago
Metadata Files
Readme License Citation

README.md

Quick Start for Creating a Vector Database for LLM with RAG

I'm working to learn more about how I can use large language models (LLM) with retrieval-augmented generation (RAG) to improve existing process and work-pipelines by offloading repetitive tasks. A part of expanding my knowledge around LLMs and RAG includes being able to generate a body of knowledge that is specific to the process or work-pipeline that I'm targeting at the time.

To help me focus on getting to a RAG and not be swallowed by the vast amount of information on the topic or the other options like knowledge graphs, I've decided to begin my exploration using Chroma (Chroma). Chroma is an open-source (Apache 2.0) vector database that, from what I can tell, provides a simple Python SDK and CLI that I can use to build out my knowledge base.

Prerequisities

  1. Python 3.13.2 or later
  2. ChromaDB 1.0.15 or later
  3. IPYKernel 6.29.5 or later
  4. Poetry

Using

Launch one of the notebooks in the notebooks directory in your preferred Jupyter Notebook environment.

  1. Explore how to use a vector database: exploring-chromadb.ipynb
  2. Explore how to ETL PDF data into a vector database: exploring-data-extraction-from-pdf.ipynb

Development

Virtual Environment via pyenv

The project

  1. Install the required version of Python for this project, currently at >=3.13.

  2. Create a new virtual environment for this project using pyenv

pyenv virtualenv <PYTHON_VERSION> audit_webpage_metadata

  1. Activate the virtual environment

pyenv activate audit_webpage_metadata

Install Dependencies via poetry

The project is managed using Poetry, a Python packaging and depdency manager. More information can be found on the official Poetry project website.

  1. Install the package with dependencies

poetry install --no-root

Disclaimer

The content, including but not limited to code, text, images, audio, and/or video, hereafter referred to as "content", in this document are provided for informational and educational purposes only. TO THE EXTENT PERMITTED BY APPLICABLE LAW, THE AUTHOR PROVIDES THIS DOCUMENT "AS IS" WITHOUT WARRANTY OF ANY KIND, INCLUDING WITHOUT LIMITATION, ANY IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NONINFRINGEMENT. In no event shall the author or their employer be liable for any claim, damages or other liability, direct or indirect, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the code and content or the use or other dealings in the code and content. Use this code and all other content at your own risk.

Third-party API Disclaimer: Additionally, the code examples in this post may interact with third-party APIs and services. The availability and functionality of these APIs are subject to change without notice. The author is not responsible for any issues arising from changes to these APIs or any downtime or limitations imposed by the service providers. You are responsible for complying with the terms of service and usage policies of any third-party APIs you use in conjunction with this code. Use this code at your own risk, and be aware of potential security implications when connecting to external services.

Product Link Disclaimer: This blog post may contain links to products or services available for purchase. These links are provided to offer readers additional information and resources. The author's opinions expressed in this post are independent and not influenced by any potential commercial relationships. No compensation is received for including these links, and their presence does not constitute an endorsement. Readers are encouraged to conduct their own research before making any purchasing decisions.

Copyright

Copyright © 2025 J.I. Powell. All rights reserved.

Owner

  • Name: Joshua Powell
  • Login: joshuapowell
  • Kind: user
  • Location: Pittsburgh, PA
  • Company: @broadcom

Researcher and engineer with deep expertise developing data products

Citation (CITATION.cff)

cff-version: 1.2.0
message: "Powell, Joshua I. (2025, July 3). Exploring vector database usage as it applies to LLM and RAG. United States."
authors:
- family-names: "Powell"
  given-names: "Joshua"
  orcid: "https://orcid.org/0000-0002-0894-2399"
title: "Exploring vector database usage as it applies to LLM and RAG"
version: 1.0.0
doi: "00.0000/00000000.0000.0000000"
date-released: 2025-07-03
url: "https://github.com/joshuapowell/quick-start-creating-a-vector-database-for-RAG"

GitHub Events

Total
  • Push event: 9
Last Year
  • Push event: 9

Dependencies

pyproject.toml pypi
  • chromadb (>=1.0.15,<2.0.0)
  • ipykernel (>=6.29.5,<7.0.0)