rag_citation_llm
Science Score: 31.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (7.9%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: JPDas
- Language: Python
- Default Branch: main
- Size: 0 Bytes
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
RAGCITATIONLLM
Adding citations to LLM responses enhances trustworthiness, accuracy, and ethical compliance. By providing verifiable sources, LLMs can reduce the risk of misinformation and promote responsible AI usage.
This is a POC project to see the citations in the response. Langchain Retrieval Augmented Generations(RAG) chain with Citations Using OpenAI + FAISS VectorDB
Rquired finetune of the code : To generate citations, you just need to keep metadata (e.g doc name, URL, page number etc) along with your document-chunks in your vector-db. When you add chunks to the LLM context for the final answer generation, number them sequentially and ask the LLM to show granular citations for its final answer, referencing the chunk numbers. Then your code can look up the metadata of the cited chunks and display them after the LLM answer. You could even show the chunks themselves as excerpts. Display the citations in your answer in this format - "link text"
Example: https://www.perplexity.ai/
Question:"What is principal component analysis?" Answer: Principal Component Analysis (PCA) is a method used to rotate a dataset in such a way that the new features, known as principal components, are statistically uncorrelated. This process involves finding the direction of maximum variance in the data, which is labeled as "Component 1," and then selecting a subset of these new features based on their importance in explaining the data. PCA is commonly used for dimensionality reduction and feature extraction, allowing for a more efficient representation of the data that is better suited for analysis [1].
[1] means , LLM gets the inputs from first chunk.
Owner
- Name: Jyotiprakash Das
- Login: JPDas
- Kind: user
- Location: Bangalore
- Company: Motorola Solutions
- Repositories: 1
- Profile: https://github.com/JPDas
Citation (citation.py)
import os
import pdfplumber
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.prompts.chat import (
ChatPromptTemplate,
SystemMessagePromptTemplate,
HumanMessagePromptTemplate,
)
from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from IPython.display import display, Markdown, Latex
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.chains import RetrievalQA
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from dotenv import load_dotenv
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_KEY")
def read_pdf_page_by_page(pdf_path):
"""Reads a PDF page by page and prints the text.
Args:
pdf_path (str): Path to the PDF file.
"""
texts = ""
with pdfplumber.open(pdf_path) as pdf:
for page_num, page in enumerate(pdf.pages):
texts += page.extract_text()
# texts.append({"page_number": page_num+1, "text": text})
return texts
def create_vector_store(texts):
"""Creates a vector store from a PDF file."""
# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
docs = text_splitter.split_text(texts)
# Create embeddings
embeddings = OpenAIEmbeddings(api_key=OPENAI_API_KEY)
metadata=[{"source": text} for text in docs]
# Create the vector store
vectorstore = FAISS.from_texts(docs, embeddings, metadatas=metadata)
return vectorstore
def print_result(result):
output_text = f"""### Question:
{result['input']}
### Answer:
{result['answer']}
### Sources:
{result['context']}
### All relevant sources:
{' '.join(list(set([doc.metadata['source'] for doc in result['context']])))}
"""
return(output_text)
pdf_path = r"D:\LLM_Experiments\RAG_CITATION\Introduction to Machine Learning with Python ( PDFDrive.com )-min.pdf"
texts = read_pdf_page_by_page(pdf_path)
vector_store = create_vector_store(texts)
llm = ChatOpenAI(temperature = 0.0, model="gpt-4o-mini", api_key=OPENAI_API_KEY)
CITATION_QA_TEMPLATE = (
"Please provide an answer based solely on the provided sources. "
"When referencing information from a source, "
"cite the appropriate source(s) using their corresponding numbers. "
"Every answer should include at least one source citation. "
"Only cite a source when you are explicitly referencing it. "
"If none of the sources are helpful, you should indicate that. "
"For example:\n"
"Source 1:\n"
"The sky is red in the evening and blue in the morning.\n"
"Source 2:\n"
"Water is wet when the sky is red.\n"
"Query: When is water wet?\n"
"Answer: Water will be wet when the sky is red [2], "
"which occurs in the evening [1].\n"
"Now it's your turn. Below are several numbered sources of information:"
"\n------\n"
"{context}"
"\n------\n"
"Query: {input}\n"
"Answer: "
)
prompt = ChatPromptTemplate.from_messages(
[
("system", CITATION_QA_TEMPLATE),
("human", "{input}"),
]
)
retriever=vector_store.as_retriever(search_kwargs={"k":3})
query = "What is principal component analysis?"
question_answer_chain = create_stuff_documents_chain(llm, prompt)
chain = create_retrieval_chain(retriever, question_answer_chain)
result = chain.invoke({"input": query})
print(result['answer'])
display(Markdown(result['answer']))
GitHub Events
Total
- Push event: 3
- Create event: 2
Last Year
- Push event: 3
- Create event: 2
Dependencies
- aiohappyeyeballs 2.4.3
- aiohttp 3.10.10
- aiosignal 1.3.1
- annotated-types 0.7.0
- anyio 4.6.2.post1
- asttokens 2.4.1
- attrs 24.2.0
- certifi 2024.8.30
- cffi 1.17.1
- charset-normalizer 3.4.0
- colorama 0.4.6
- cryptography 43.0.3
- dataclasses-json 0.6.7
- decorator 5.1.1
- distro 1.9.0
- executing 2.1.0
- faiss-cpu 1.9.0
- frozenlist 1.5.0
- greenlet 3.1.1
- h11 0.14.0
- httpcore 1.0.6
- httpx 0.27.2
- httpx-sse 0.4.0
- idna 3.10
- ipython 8.29.0
- jedi 0.19.1
- jiter 0.7.0
- jsonpatch 1.33
- jsonpointer 3.0.0
- langchain 0.3.7
- langchain-community 0.3.5
- langchain-core 0.3.15
- langchain-openai 0.2.6
- langchain-text-splitters 0.3.2
- langsmith 0.1.142
- marshmallow 3.23.1
- matplotlib-inline 0.1.7
- multidict 6.1.0
- mypy-extensions 1.0.0
- numpy 1.26.4
- openai 1.54.3
- orjson 3.10.11
- packaging 24.2
- parso 0.8.4
- pdfminer-six 20231228
- pdfplumber 0.11.4
- pexpect 4.9.0
- pillow 11.0.0
- prompt-toolkit 3.0.48
- propcache 0.2.0
- ptyprocess 0.7.0
- pure-eval 0.2.3
- pycparser 2.22
- pydantic 2.9.2
- pydantic-core 2.23.4
- pydantic-settings 2.6.1
- pygments 2.18.0
- pypdfium2 4.30.0
- python-dotenv 1.0.1
- pyyaml 6.0.2
- regex 2024.11.6
- requests 2.32.3
- requests-toolbelt 1.0.0
- six 1.16.0
- sniffio 1.3.1
- sqlalchemy 2.0.35
- stack-data 0.6.3
- tenacity 9.0.0
- tiktoken 0.8.0
- tqdm 4.67.0
- traitlets 5.14.3
- typing-extensions 4.12.2
- typing-inspect 0.9.0
- urllib3 2.2.3
- wcwidth 0.2.13
- yarl 1.17.1
- faiss-cpu ^1.9.0
- ipython ^8.29.0
- langchain ^0.3.7
- langchain-community ^0.3.5
- langchain-openai ^0.2.6
- pdfplumber ^0.11.4
- python ^3.12