rag_citation_llm

https://github.com/jpdas/rag_citation_llm

Science Score: 31.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (7.9%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: JPDas
Language: Python
Default Branch: main
Size: 0 Bytes

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created over 1 year ago · Last pushed over 1 year ago

Metadata Files

Readme Citation

RAGCITATIONLLM

Adding citations to LLM responses enhances trustworthiness, accuracy, and ethical compliance. By providing verifiable sources, LLMs can reduce the risk of misinformation and promote responsible AI usage.

This is a POC project to see the citations in the response. Langchain Retrieval Augmented Generations(RAG) chain with Citations Using OpenAI + FAISS VectorDB

Rquired finetune of the code : To generate citations, you just need to keep metadata (e.g doc name, URL, page number etc) along with your document-chunks in your vector-db. When you add chunks to the LLM context for the final answer generation, number them sequentially and ask the LLM to show granular citations for its final answer, referencing the chunk numbers. Then your code can look up the metadata of the cited chunks and display them after the LLM answer. You could even show the chunks themselves as excerpts. Display the citations in your answer in this format - "link text"

Example: https://www.perplexity.ai/

Question:"What is principal component analysis?" Answer: Principal Component Analysis (PCA) is a method used to rotate a dataset in such a way that the new features, known as principal components, are statistically uncorrelated. This process involves finding the direction of maximum variance in the data, which is labeled as "Component 1," and then selecting a subset of these new features based on their importance in explaining the data. PCA is commonly used for dimensionality reduction and feature extraction, allowing for a more efficient representation of the data that is better suited for analysis [1].

[1] means , LLM gets the inputs from first chunk.

Owner

Name: Jyotiprakash Das
Login: JPDas
Kind: user
Location: Bangalore
Company: Motorola Solutions

Repositories: 1
Profile: https://github.com/JPDas

Citation (citation.py)

import os
import pdfplumber
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.prompts.chat import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate,
)

from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

from IPython.display import display, Markdown, Latex
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.chains import RetrievalQA

from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

from dotenv import load_dotenv

load_dotenv()


OPENAI_API_KEY = os.getenv("OPENAI_KEY")

def read_pdf_page_by_page(pdf_path):
    """Reads a PDF page by page and prints the text.

    Args:
        pdf_path (str): Path to the PDF file.
    """

    texts = ""
    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages):
            texts += page.extract_text()
            
            # texts.append({"page_number": page_num+1, "text": text})

    return texts

def create_vector_store(texts):
    """Creates a vector store from a PDF file."""

    # Split documents into chunks
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

    docs = text_splitter.split_text(texts)
    
    # Create embeddings
    embeddings = OpenAIEmbeddings(api_key=OPENAI_API_KEY)

    metadata=[{"source": text} for text in docs]
    # Create the vector store
    vectorstore = FAISS.from_texts(docs, embeddings, metadatas=metadata)
    return vectorstore

def print_result(result):
    output_text = f"""### Question:
    {result['input']}
    ### Answer:
    {result['answer']}
    ### Sources:
    {result['context']}
    ### All relevant sources:
    {' '.join(list(set([doc.metadata['source'] for doc in result['context']])))}
    """
    return(output_text)


pdf_path = r"D:\LLM_Experiments\RAG_CITATION\Introduction to Machine Learning with Python ( PDFDrive.com )-min.pdf"

texts = read_pdf_page_by_page(pdf_path)

vector_store = create_vector_store(texts)

llm = ChatOpenAI(temperature = 0.0, model="gpt-4o-mini",  api_key=OPENAI_API_KEY)

CITATION_QA_TEMPLATE = (
    "Please provide an answer based solely on the provided sources. "
    "When referencing information from a source, "
    "cite the appropriate source(s) using their corresponding numbers. "
    "Every answer should include at least one source citation. "
    "Only cite a source when you are explicitly referencing it. "
    "If none of the sources are helpful, you should indicate that. "
    "For example:\n"
    "Source 1:\n"
    "The sky is red in the evening and blue in the morning.\n"
    "Source 2:\n"
    "Water is wet when the sky is red.\n"
    "Query: When is water wet?\n"
    "Answer: Water will be wet when the sky is red [2], "
    "which occurs in the evening [1].\n"
    "Now it's your turn. Below are several numbered sources of information:"
    "\n------\n"
    "{context}"
    "\n------\n"
    "Query: {input}\n"
    "Answer: "
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", CITATION_QA_TEMPLATE),
        ("human", "{input}"),
    ]
)

retriever=vector_store.as_retriever(search_kwargs={"k":3})
query = "What is principal component analysis?"

question_answer_chain = create_stuff_documents_chain(llm, prompt)
chain = create_retrieval_chain(retriever, question_answer_chain)

result = chain.invoke({"input": query})

print(result['answer'])
display(Markdown(result['answer']))

GitHub Events

Total

Push event: 3
Create event: 2

Last Year

Push event: 3
Create event: 2

Dependencies

poetry.lock pypi

aiohappyeyeballs 2.4.3
aiohttp 3.10.10
aiosignal 1.3.1
annotated-types 0.7.0
anyio 4.6.2.post1
asttokens 2.4.1
attrs 24.2.0
certifi 2024.8.30
cffi 1.17.1
charset-normalizer 3.4.0
colorama 0.4.6
cryptography 43.0.3
dataclasses-json 0.6.7
decorator 5.1.1
distro 1.9.0
executing 2.1.0
faiss-cpu 1.9.0
frozenlist 1.5.0
greenlet 3.1.1
h11 0.14.0
httpcore 1.0.6
httpx 0.27.2
httpx-sse 0.4.0
idna 3.10
ipython 8.29.0
jedi 0.19.1
jiter 0.7.0
jsonpatch 1.33
jsonpointer 3.0.0
langchain 0.3.7
langchain-community 0.3.5
langchain-core 0.3.15
langchain-openai 0.2.6
langchain-text-splitters 0.3.2
langsmith 0.1.142
marshmallow 3.23.1
matplotlib-inline 0.1.7
multidict 6.1.0
mypy-extensions 1.0.0
numpy 1.26.4
openai 1.54.3
orjson 3.10.11
packaging 24.2
parso 0.8.4
pdfminer-six 20231228
pdfplumber 0.11.4
pexpect 4.9.0
pillow 11.0.0
prompt-toolkit 3.0.48
propcache 0.2.0
ptyprocess 0.7.0
pure-eval 0.2.3
pycparser 2.22
pydantic 2.9.2
pydantic-core 2.23.4
pydantic-settings 2.6.1
pygments 2.18.0
pypdfium2 4.30.0
python-dotenv 1.0.1
pyyaml 6.0.2
regex 2024.11.6
requests 2.32.3
requests-toolbelt 1.0.0
six 1.16.0
sniffio 1.3.1
sqlalchemy 2.0.35
stack-data 0.6.3
tenacity 9.0.0
tiktoken 0.8.0
tqdm 4.67.0
traitlets 5.14.3
typing-extensions 4.12.2
typing-inspect 0.9.0
urllib3 2.2.3
wcwidth 0.2.13
yarl 1.17.1

pyproject.toml pypi

faiss-cpu ^1.9.0
ipython ^8.29.0
langchain ^0.3.7
langchain-community ^0.3.5
langchain-openai ^0.2.6
pdfplumber ^0.11.4
python ^3.12

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science