inter-iit-13-pathway-legalqa-chatbot
https://github.com/himanshu-skid19/inter-iit-13-pathway-legalqa-chatbot
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.0%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: himanshu-skid19
- Language: Python
- Default Branch: main
- Size: 74.8 MB
Statistics
- Stars: 2
- Watchers: 1
- Forks: 11
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
PathLex - Agentic RAG Application for Legal QA
Team 67
Table of Contents
Introduction
We introduce PathLex, an advanced agentic Retrieval-Augmented Generation (RAG) system specifically tailored for the legal domain. Built on Pathway’s real-time data processing capabilities and leveraging LLMCompiler's dynamic task planning, PathLex addresses critical limitations in existing legal RAG systems, such as hallucinations, retrieval inaccuracies, and long-context handling. With innovations in chunking, multi-tier replanning, and robust fallback mechanisms, including a human-in-the-loop framework, PathLex ensures precise, context-aware answers with verifiable citations. This work lays the foundation for intelligent automation in high-stakes domains, demonstrating the potential for transformative improvements in legal information systems.
Key Features
1. Parallel Task Execution
- Challenge: Traditional RAG systems process retrieval queries sequentially, introducing latency when handling multiple queries.
- Solution: LLMCompiler employs a planner-executor architecture to identify independent retrieval tasks and execute them in parallel, significantly reducing latency.
2. Dynamic Replanning for Retrieval
- Challenge: In multi-hop queries, intermediate retrieval results often necessitate changes in subsequent queries or reasoning.
- Solution: LLMCompiler adapts dynamically through a dynamic execution graph, recomputing task dependencies as results come in, ensuring actions remain contextually relevant.
3. Enhanced Retrieval Precision with Task-Specific Tools
- Challenge: Generic retrieval tools often lack precision for task-specific needs.
- Solution: LLMCompiler integrates specialized retrieval tools, dynamically assigning the most relevant tool for each task to improve precision.
4. Scalability to Complex Queries
- Challenge: Traditional RAG systems struggle with multi-step queries involving intricate reasoning and dependencies.
- Solution: LLMCompiler creates directed acyclic graphs (DAGs) for task execution, efficiently managing complex reasoning and retrieval dependencies.
5. Plan-and-Solve Alignment
- Challenge: Treating retrieval and generation as a monolithic process can lead to inefficiencies.
- Solution: LLMCompiler breaks tasks into manageable sub-steps (e.g., retrieval → analysis → generation), optimizing each independently for accuracy and efficiency.
6. Reduced Token Usage and Cost
- Challenge: Excessive token consumption increases costs in traditional RAG workflows.
- Solution: Inspired by ReWOO (Xu et al., 2023), LLMCompiler decouples reasoning from execution, minimizing unnecessary LLM invocations and reducing token usage.
Installation and Setup
Initial Steps
Note: You can only run Pathway's Server in linux or MacOS, to run it in Windows, it is recommended you use WSL or use Docker to run the server.
Create a new virtual environment and activate it by running the following commands:
bash python -m venv venv source venv/bin/activateInstall the requirements using:
bash pip install -r requirements.txt
1. Setting up the Pathway VectorStore
This is the initial step required to run Pathway's Vector Store such that it can be connected to our main pipeline for retrieval. Pathway offers real-time data processing facilities that allow one to add or remove documents from a vector store in real-time.
Prerequisites
- OpenAI API key or VoyageAI's API key (depending on what embeddings you decided to use, we recommend using voyage-3 by default). Replace them inplace of the placeholders in
run-server.py. - We use a custom parser in our solution, so you must have
custom_parser.pyin the root directory - We modified the VectorStoreServer to suit our purpose and thus you must have
server.pyin your root directory. - An active internet connection is required, disruption might lead to the server crashing.
1. Steps to run
- Go to pathway-server using
cd pathway-sever - Create a directory named
/data(or something else) and upload your documents to that folder. Note: You may use other data sources as well such as google drive, just replace the file path with the drive url for it to work. - You have to install tesseract-ocr and replace the TESSDATA path in
run-server.pyin place of TESSDATA_PREFIX - Replace your OpenAI/VoyageAI key in
run-server.py Finally, simply run
bash python run-server.pyThe server will be hosted on
127.0.0.1on port8745by default, though you may change if you wish so.You can test if the server is running and working by running:
bash python test_server.py
Do note that embeddings may take a lot of time to be created due to which it might give a read timeout error if you try to retrieve from the vector store. This goes away once the embeddings have been created, so it is recommended that you put in a few documents at a time in the server at most.
2. Setting up the environment variables
We use a few models and services in our pipeline. You will have to put a .env file in the root directory with the following parameters filled in:
OPENAI_API_KEY= your_openai_api_key
ANTHROPIC_API_KEY= your_anthropic_api_key
COHERE_API_KEY=your_cohere_api_key
LANGCHAIN_HUB_API_KEY= your_langchain_hub_api_key
LANGFUSE_SECRET_KEY= your_langfuse_secret_key
LANGFUSE_PUBLIC_KEY= your_langfuse_public_key
LANGFUSE_HOST= https://cloud.langfuse.com
PATHWAY_HOST = 127.0.0.1
PATHWAY_PORT = 8745
3. Scripts for Quick Testing
We have provided Python scripts of our pipeline without a UI to ease testing on our solution. You may use these scripts to run our pipeline for quick evaluation though you also have the option to run the complete UI by following the steps mentioned in the section after this.
Steps to run:
- Ensure you have followed the steps mentioned in sections 1 and 2 to run the Pathway VectorStore and set up all the environment variables.
- You can test our pipeline by running:
bash python main.py "<insert query here>"
4. How to run the front end
Here we are assuming that you are running the frontend on a Windows Device. For Linux/macOS, you can find similar steps to install the respective packages online.
Prerequisites
- NodeJS
- npm
- Download the NodeJS MSI
- Download the npm zip
Steps to run:
- Extract the MSI you downloaded in step 3 (with 7 Zip) in a directory "node".
- Set the PATH environment variable to add the "node" directory
Extract the zip file from npm in a different directory (not under node directory)
Now you should have node + npm working, use these commands to check:
bash node --version npm --versionTo use the UI, we need to create an Atlas MongoDB API. You can follow these steps to create one.
Add your
.envfile to thelegal-chatbot-frontend\src\final_scriptswith API keys, server local host URL, server port number or the ngrok link.Replace
MONGO_URLinlegal-chatbot-frontend\src\final_scripts\global_.pywith your own hosted mongodb apikey with the name of the Database as ```legalchatbot```. This is very important.This mongodb url must also be replaced in
legal-chatbot-backend\server.jsalso, in place ofyour_mongo_urlRun this now:
cd legal-chatbot-frontend\ npm iRun this now:
cd legal-chatbot-backend\ npm iTo start the client, run:
cd legal-chatbot-frontend npm run devTo start the server, run:
cd legal-chatbot-backend node server.js
5. Solution Pipeline & Usage Guide
- The owner logs in via the login page.
- The user can reaccess his/her previous chats, and create a new chat.
- In a new chat, the user can put a legal query.
- After entering the legal query, the user can see the dynamic thought process on the left. This happens in the order of
- Plan and Schedule: This contains the plan made by the planner to answer the given query. It makes a plan with a list of tools to call and schedules them to be executed parallelly by the task-fetching unit.
- Joiner: The joiner agent decides whether to proceed with Generation or to Replan based on the outputs of the tool calls executed by the scheduler. It decides this by using an LLM to analyze the results of the tool calls and create a Thought and an Action along with feedback in case it decides to re-plan.
- Rewrite: Rewrite Agent receives the Thought and Feedback from the joiner, based on which it decides to rewrite the query. It has grade documents function that generates a score for each chunk retrieved by the tool. This allows it to identify ”good” documents among the retrieved documents that are relevant to the user query.
- Generate: This agent reviews the outputs from previous tool calls and the query to generate the final answer that is to be shown to the user.
- HIL: Humans can give feedback to rewrite or generate according to the retrieved docs.
6. Architecture
├── Experiments and miscellaneous
├── beam_retriever_train_and_exp.py
├── lumber chunking.py
└── meta chunking.py
├── HIL.py
├── README.md
├── Reports
├── Pathway_MidEval_Report.pdf
└── endterm_report.pdf
├── agents.py
├── anthropic_functions.py
├── beam_retriever.py
├── beam_tool.py
├── citations.py
├── demo_videos
├── demo.mp4
└── summary_video.mkv
├── globals_.py
├── imports.py
├── joiner.py
├── legal-chatbot-backend
├── .gitignore
├── cache1.txt
├── models
│ ├── Chat.js
│ ├── Script_response.js
│ └── User.js
├── package-lock.json
├── package.json
├── server.js
├── test.txt
└── text_files
│ ├── generate.txt
│ ├── human_input_node.txt
│ ├── join.txt
│ ├── plan_and_schedule.txt
│ └── rewrite.txt
├── legal-chatbot-frontend
├── .gitignore
├── README.md
├── eslint.config.js
├── index.html
├── package-lock.json
├── package.json
├── public
│ ├── send_btn.svg
│ └── vite.svg
├── src
│ ├── App.css
│ ├── App.jsx
│ ├── Home.css
│ ├── Home.jsx
│ ├── SignIn.css
│ ├── SignIn.jsx
│ ├── SignUp.css
│ ├── SignUp.jsx
│ ├── assets
│ │ └── react.svg
│ ├── files.jsx
│ ├── final_scripts
│ │ ├── .gitignore
│ │ ├── HIL.py
│ │ ├── agents.py
│ │ ├── anthropic_functions.py
│ │ ├── beam_retriever.py
│ │ ├── beam_tool.py
│ │ ├── citations.py
│ │ ├── get_all_files.py
│ │ ├── globals_.py
│ │ ├── imports.py
│ │ ├── joiner.py
│ │ ├── main.py
│ │ ├── output.txt
│ │ ├── output_parser.py
│ │ ├── pathway_server
│ │ │ ├── custom_parser.py
│ │ │ ├── run-server.py
│ │ │ ├── server.py
│ │ │ └── test_server.py
│ │ ├── planner.py
│ │ ├── prompts.py
│ │ ├── requirements.txt
│ │ ├── task_fetching_unit.py
│ │ ├── test.py
│ │ ├── tools.py
│ │ └── utils.py
│ ├── index.css
│ └── main.jsx
└── vite.config.js
├── main.py
├── output.txt
├── output_parser.py
├── pathway_server
├── custom_parser.py
├── run-server.py
├── server.py
└── test_server.py
├── planner.py
├── prompts.py
├── requirements.txt
├── task_fetching_unit.py
├── tools.py
└── utils.py
Team Members
Owner
- Login: himanshu-skid19
- Kind: user
- Repositories: 1
- Profile: https://github.com/himanshu-skid19
Citation (citations.py)
from imports import *
from utils import *
def remove_uuid(text):
return re.sub(r'\b[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}\b', '', text)
def extract_document_name(file_path):
match = re.search(r'[^/\\]+(?:\.pdf|\.docx|\.txt)', file_path, re.IGNORECASE)
if match:
return match.group(0)
return file_path
def get_docs(answer):
docs=[]
for data in reversed(answer):
key=next(iter(data.keys()))
value=data[key]['messages']
for i in value:
if isinstance(i, FunctionMessage):
if(i.content!='join'):
docs.append(i.content)
if len(docs)==0:
return [],[]
content=docs[0]
# Regex pattern to split based on "CHUNK ENDS HERE"
chunks = re.split(r'\n----------------CHUNK ENDS HERE ------------------\n', content)
if(chunks[-1]==''):
chunks.pop()
processed_chunks = []
process_chunks_citations = set()
process_chunks_results = set()
for chunk in chunks:
# Clean and strip unnecessary newlines or hyphens
chunk = chunk.strip()
chunk = re.sub(r'-{2,}', '', chunk) # Remove lines with multiple hyphens
chunk = re.sub(r'\n+', '\n', chunk) # Normalize newlines
# Extract source, page, and paragraph details
source_match = re.search(r'Source: (.+?)\n', chunk)
page_match = re.search(r'Page Number: (\d+)', chunk)
if source_match and page_match:
source = source_match.group(1).strip()
page = page_match.group(1).strip()
doc_name = extract_document_name(source)
# Combine into a single "SOURCE" line
metadata = f"SOURCE: {doc_name}, Page: {page}"
# Remove the original metadata from the chunk
chunk = re.sub(r'Document Summary:\n.+?Page Content:\n', '', chunk, flags=re.DOTALL)
# Add the cleaned metadata at the start of the chunk
cleaned_chunk = f"{metadata}{chunk.split(' ')}"
processed_chunks.append(cleaned_chunk)
cleaned_chunks = [remove_uuid(chunk) for chunk in processed_chunks]
# Print cleaned data
for cleaned_chunk in cleaned_chunks:
chunk_test = cleaned_chunk
data = chunk_test[chunk_test.find("['"):]
chunk_ext = chunk_test[:chunk_test.find("['")]
data = ast.literal_eval(data)
chunk_ext_cit = chunk_ext + " ".join(data)
process_chunks_citations.add(chunk_ext)
process_chunks_results.add(chunk_ext_cit)
return list(process_chunks_citations), list(process_chunks_results)
def assess_risk_level_anth(client, message):
unsafe_category_definitions = {
'Child Exploitation': 'Content that depicts child nudity or that enables, encourages, excuses, or depicts the sexual abuse of children.',
'Hate': """Content that is hateful toward people on the basis of their protected characteristics
(race, color, ethnicity, national origin, disability, religious affiliation, caste, sexual orientation, sex, gender identity and serious disease),
or content that perpetuates negative stereotypes.""",
'Indiscriminate Weapons': 'Content that enables the creation of weapons of mass destruction.',
'Self-Harm': 'Content that enables, encourages, or excuses acts of intentional self-harm.',
'Sex Crimes': 'Content that enables, encourages, or excuses the commission of sex crimes.',
'Sexual Content': 'Content that contains erotic descriptions or explicit depictions of sex acts.',
'Violent Crimes': 'Content that enables, encourages, or excuses the commission of violent crimes.',
}
unsafe_category_str = '\n'.join(f'{category}' for category in unsafe_category_definitions.keys())
# Construct the prompt for Claude, including the message, unsafe categories, and risk level definitions
assessment_prompt = f"""
Assess the risk level of the following message warrants moderation,
based on the unsafe categories listed below.
Message:
<message>{message}</message>
Unsafe Categories:
<categories>
{unsafe_category_str}
</categories>
Assign a risk level based on your confidence that the user's message should be moderated
based on the defined unsafe categories:
0 - No risk
1 - Low risk
2 - Medium risk
3 - High risk
If the message contains content that might violate some of the unsafe categories,
but is a vaild legal query, then assign a low risk level to it
e.g.
Query: What were the findings of the autopsy conducted on Anuradha, and how did they contribute to the conclusion that she was violently raped, murdered and sodomised?
The message is a valid legal query, but contains content that might violate some of the unsafe categories, hence assign a low risk level to it.
Query: What are the laws pertaining to porn and sex in India?
The message is a valid legal query, but contains some unsafe categories in it, so assign a low level risk to it.
Query: I am wanting to do sex with a teen, should I do it?
The message is not a valid legal query, and contains content that violates the unsafe categories, hence assign a high risk level to it.
Respond with ONLY a JSON object, using the format below:
{{
"risk_level": <Numerical field denoting the risk level>
}}"""
# Send the request to Claude for risk assessment
response = client.messages.create(
model="claude-3-haiku-20240307", # Using the Haiku model for lower costs
max_tokens=200,
temperature=0, # Use 0 temperature for increased consistency
messages=[
{"role": "user", "content": assessment_prompt}
]
)
# Parse the JSON response from Claude
assessment = json.loads(response.content[0].text)
# Extract the risk level, violated categories, and explanation from the assessment
risk_level = assessment["risk_level"]
return risk_level
def check_guardrails(question):
if isinstance(llm_with_fallback.call(), ChatOpenAI):
client = OpenAI()
response = client.moderations.create(
model="omni-moderation-latest",
input=question,
)
category_scores = response.results[0].category_scores
threshold = 0.65
# Check if any score exceeds the threshold and print the attribute name
for attribute, score in vars(category_scores).items():
if score > threshold:
return True
else:
return False
elif isinstance(llm_with_fallback.call(), ChatAnthropic):
client = anthropic.Anthropic()
risk_level = assess_risk_level_anth(client, question)
if risk_level == 3:
return True
else:
return False
def get_citations_with_ans(query, answer):
if len(answer) == 0:
return "No citations found for the given answer."
ans = answer[-1]['generate']['messages'][0]
cit, retrieved_contexts = get_docs(answer=answer)
prompt = f"""
You are a legal expert tasked with evaluating and formatting citations for a given query and answer. Follow these steps:
1. Understand the Query and Answer:
Read the query and the provided answer carefully to grasp the specific information being sought and its context.
2. Assess Main Citations Using Retrieved Contexts:
Main citations (referred to as "Document Paths") correspond to the retrieved contexts.
Evaluate the relevance of each retrieved context to the answer based on its content and connection to the query and answer.
If the retrieved context contributes to or supports the answer, include its corresponding document path in Main Citations.
If the answer indicates "I don’t know" or similar, mark Main Citations as None.
In all other cases, include the relevant document paths, even if the answer is partially derived from reasoning on the retrieved contexts.
3. Extract In-Context Citations:
Identify specific legal citations (e.g., "Smithson, 2018") directly referenced in the retrieved contexts that support the answer.
If no relevant in-context citations are found, write None under In-Context Citations.
4. Generate the Response:
Main Citations:
If the answer indicates "I don’t know" or similar, write None.
Otherwise, include all relevant document paths based on the retrieved contexts.
In-Context Citations:
If relevant citations are found, list them.
If not, write None.
Follow the formatting structure below:
Response Structure:
Main Citations:
List relevant document paths in each newline, or write "None" if the answer is "I don't know" or similar.
In-Context Citations:
List specific citations in each newline, or write "None" if no in-context citations are found.
Query: {query}
Answer: {ans}
Main Citations (Document Paths):
{cit}
Retrieved Contexts:
{retrieved_contexts}
Your response must be concise and include only citations (main and in-context) that are directly relevant to the query and answer. Always provide None explicitly under In-Context Citations if no relevant citations are found. Similarly, only provide None under Main Citations if the answer clearly indicates "I don’t know" or equivalent.
"""
prompt1 = f"""
You are a legal expert tasked with evaluating and formatting citations for a given query and answer. Follow these steps:
1. Understand the Query and Answer:
Read the query and the provided answer carefully to grasp the specific information being sought and its context.
2. Assess Main Citations Using Retrieved Contexts:
Main citations (referred to as "Document Paths") correspond to the retrieved contexts.
Evaluate the relevance of each retrieved context to the answer based on its content and connection to the query and answer.
If the retrieved context contributes to or supports the answer, include its corresponding document path in Main Citations.
If the answer indicates "I don't know" or similar, mark Main Citations as None.
In all other cases, include the relevant document paths, even if the answer is partially derived from reasoning on the retrieved contexts.
3. Extract In-Context Citations:
Identify specific legal citations (e.g., "Smithson, 2018") directly referenced in the retrieved contexts that support the answer.
If no relevant in-context citations are found, write None under In-Context Citations.
4. Generate the Response:
Provide ONLY the following output format:
Main Citations:
List relevant document paths in each newline, or write "None" if the answer is "I don't know" or similar.
In-Context Citations:
List specific citations in each newline, or write "None" if no in-context citations are found.
Strictly follow these rules:
- Produce ONLY the two-line output above
- Do not add any introductory or explanatory text
- Do not include any notes or commentary
- Ensure the output is exactly as shown, even if you have no citations to report
"""
if isinstance(llm_with_fallback.call(), ChatOpenAI):
client = OpenAI()
completion = client.chat.completions.create(
model="gpt-4o",
temperature=0,
messages=[
{"role": "system", "content": prompt},
{"role": "user", "content": f"""
**Query**: {query}
**Answer**: {ans}
**Main Citations (Document Paths)**: {cit}
**Retrieved Contexts**:
{retrieved_contexts}
"""},
],
)
response = completion.choices[0].message.content
elif isinstance(llm_with_fallback.call(), ChatAnthropic):
client = anthropic.Anthropic()
completion = client.messages.create(
model="claude-3-5-sonnet-20241022",
temperature=0,
max_tokens=1024,
system = prompt1,
messages=[
{"role": "user", "content": f"""
**Query**: {query}
**Answer**: {ans}
**Main Citations (Document Paths)**: {cit}
**Retrieved Contexts**:
{retrieved_contexts}
"""}
]
)
response = completion.content[0].text
# extracted_response = "" if "no" in response.lower() else response
# top_citations = cit[:3]
# citations = "".join(top_citations)
# final_response = f"Answer: \n{ans}\nCitations:\n{citations}{extracted_response}"
final_response = f"Answer: \n{ans}\n{response}"
return final_response
import json
GitHub Events
Total
- Watch event: 2
- Member event: 1
- Push event: 7
- Pull request event: 2
- Fork event: 12
- Create event: 2
Last Year
- Watch event: 2
- Member event: 1
- Push event: 7
- Pull request event: 2
- Fork event: 12
- Create event: 2