inter-iit-13-pathway-legalqa-chatbot

https://github.com/himanshu-skid19/inter-iit-13-pathway-legalqa-chatbot

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: himanshu-skid19
Language: Python
Default Branch: main
Size: 74.8 MB

Statistics

Stars: 2
Watchers: 1
Forks: 11
Open Issues: 0
Releases: 0

Created over 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme Citation

PathLex - Agentic RAG Application for Legal QA

Team 67

Introduction

We introduce PathLex, an advanced agentic Retrieval-Augmented Generation (RAG) system specifically tailored for the legal domain. Built on Pathway’s real-time data processing capabilities and leveraging LLMCompiler's dynamic task planning, PathLex addresses critical limitations in existing legal RAG systems, such as hallucinations, retrieval inaccuracies, and long-context handling. With innovations in chunking, multi-tier replanning, and robust fallback mechanisms, including a human-in-the-loop framework, PathLex ensures precise, context-aware answers with verifiable citations. This work lays the foundation for intelligent automation in high-stakes domains, demonstrating the potential for transformative improvements in legal information systems.

Key Features

1. Parallel Task Execution

Challenge: Traditional RAG systems process retrieval queries sequentially, introducing latency when handling multiple queries.
Solution: LLMCompiler employs a planner-executor architecture to identify independent retrieval tasks and execute them in parallel, significantly reducing latency.

2. Dynamic Replanning for Retrieval

Challenge: In multi-hop queries, intermediate retrieval results often necessitate changes in subsequent queries or reasoning.
Solution: LLMCompiler adapts dynamically through a dynamic execution graph, recomputing task dependencies as results come in, ensuring actions remain contextually relevant.

3. Enhanced Retrieval Precision with Task-Specific Tools

Challenge: Generic retrieval tools often lack precision for task-specific needs.
Solution: LLMCompiler integrates specialized retrieval tools, dynamically assigning the most relevant tool for each task to improve precision.

4. Scalability to Complex Queries

Challenge: Traditional RAG systems struggle with multi-step queries involving intricate reasoning and dependencies.
Solution: LLMCompiler creates directed acyclic graphs (DAGs) for task execution, efficiently managing complex reasoning and retrieval dependencies.

5. Plan-and-Solve Alignment

Challenge: Treating retrieval and generation as a monolithic process can lead to inefficiencies.
Solution: LLMCompiler breaks tasks into manageable sub-steps (e.g., retrieval → analysis → generation), optimizing each independently for accuracy and efficiency.

6. Reduced Token Usage and Cost

Challenge: Excessive token consumption increases costs in traditional RAG workflows.
Solution: Inspired by ReWOO (Xu et al., 2023), LLMCompiler decouples reasoning from execution, minimizing unnecessary LLM invocations and reducing token usage.

Installation and Setup

Initial Steps

Note: You can only run Pathway's Server in linux or MacOS, to run it in Windows, it is recommended you use WSL or use Docker to run the server.

Create a new virtual environment and activate it by running the following commands: bash python -m venv venv source venv/bin/activate
Install the requirements using: bash pip install -r requirements.txt

1. Setting up the Pathway VectorStore

This is the initial step required to run Pathway's Vector Store such that it can be connected to our main pipeline for retrieval. Pathway offers real-time data processing facilities that allow one to add or remove documents from a vector store in real-time.

Prerequisites

OpenAI API key or VoyageAI's API key (depending on what embeddings you decided to use, we recommend using voyage-3 by default). Replace them inplace of the placeholders in run-server.py.
We use a custom parser in our solution, so you must have custom_parser.py in the root directory
We modified the VectorStoreServer to suit our purpose and thus you must have server.py in your root directory.
An active internet connection is required, disruption might lead to the server crashing.

1. Steps to run

Go to pathway-server using cd pathway-sever
Create a directory named /data (or something else) and upload your documents to that folder. Note: You may use other data sources as well such as google drive, just replace the file path with the drive url for it to work.
You have to install tesseract-ocr and replace the TESSDATA path in run-server.py in place of TESSDATA_PREFIX
Replace your OpenAI/VoyageAI key in run-server.py
Finally, simply run bash python run-server.py
The server will be hosted on 127.0.0.1 on port 8745 by default, though you may change if you wish so.
You can test if the server is running and working by running: bash python test_server.py

Do note that embeddings may take a lot of time to be created due to which it might give a read timeout error if you try to retrieve from the vector store. This goes away once the embeddings have been created, so it is recommended that you put in a few documents at a time in the server at most.

2. Setting up the environment variables

We use a few models and services in our pipeline. You will have to put a .env file in the root directory with the following parameters filled in:

OPENAI_API_KEY= your_openai_api_key ANTHROPIC_API_KEY= your_anthropic_api_key COHERE_API_KEY=your_cohere_api_key LANGCHAIN_HUB_API_KEY= your_langchain_hub_api_key LANGFUSE_SECRET_KEY= your_langfuse_secret_key LANGFUSE_PUBLIC_KEY= your_langfuse_public_key LANGFUSE_HOST= https://cloud.langfuse.com PATHWAY_HOST = 127.0.0.1 PATHWAY_PORT = 8745

3. Scripts for Quick Testing

We have provided Python scripts of our pipeline without a UI to ease testing on our solution. You may use these scripts to run our pipeline for quick evaluation though you also have the option to run the complete UI by following the steps mentioned in the section after this.

Steps to run:

Ensure you have followed the steps mentioned in sections 1 and 2 to run the Pathway VectorStore and set up all the environment variables.
You can test our pipeline by running: bash python main.py "<insert query here>"

4. How to run the front end

Here we are assuming that you are running the frontend on a Windows Device. For Linux/macOS, you can find similar steps to install the respective packages online.

Prerequisites

NodeJS
npm
Download the NodeJS MSI
Download the npm zip

Steps to run:

Extract the MSI you downloaded in step 3 (with 7 Zip) in a directory "node".
Set the PATH environment variable to add the "node" directory
Extract the zip file from npm in a different directory (not under node directory)
Now you should have node + npm working, use these commands to check: bash node --version npm --version
To use the UI, we need to create an Atlas MongoDB API. You can follow these steps to create one.
Add your .env file to the legal-chatbot-frontend\src\final_scripts with API keys, server local host URL, server port number or the ngrok link.
Replace MONGO_URL in legal-chatbot-frontend\src\final_scripts\global_.py with your own hosted mongodb apikey with the name of the Database as ```legalchatbot```. This is very important.
This mongodb url must also be replaced in legal-chatbot-backend\server.js also, in place of your_mongo_url
Run this now: cd legal-chatbot-frontend\ npm i
Run this now: cd legal-chatbot-backend\ npm i
To start the client, run: cd legal-chatbot-frontend npm run dev
To start the server, run: cd legal-chatbot-backend node server.js

5. Solution Pipeline & Usage Guide

The owner logs in via the login page.
The user can reaccess his/her previous chats, and create a new chat.
In a new chat, the user can put a legal query.
After entering the legal query, the user can see the dynamic thought process on the left. This happens in the order of
- Plan and Schedule: This contains the plan made by the planner to answer the given query. It makes a plan with a list of tools to call and schedules them to be executed parallelly by the task-fetching unit.
- Joiner: The joiner agent decides whether to proceed with Generation or to Replan based on the outputs of the tool calls executed by the scheduler. It decides this by using an LLM to analyze the results of the tool calls and create a Thought and an Action along with feedback in case it decides to re-plan.
- Rewrite: Rewrite Agent receives the Thought and Feedback from the joiner, based on which it decides to rewrite the query. It has grade documents function that generates a score for each chunk retrieved by the tool. This allows it to identify ”good” documents among the retrieved documents that are relevant to the user query.
- Generate: This agent reviews the outputs from previous tool calls and the query to generate the final answer that is to be shown to the user.
- HIL: Humans can give feedback to rewrite or generate according to the retrieved docs.

6. Architecture

├── Experiments and miscellaneous ├── beam_retriever_train_and_exp.py ├── lumber chunking.py └── meta chunking.py ├── HIL.py ├── README.md ├── Reports ├── Pathway_MidEval_Report.pdf └── endterm_report.pdf ├── agents.py ├── anthropic_functions.py ├── beam_retriever.py ├── beam_tool.py ├── citations.py ├── demo_videos ├── demo.mp4 └── summary_video.mkv ├── globals_.py ├── imports.py ├── joiner.py ├── legal-chatbot-backend ├── .gitignore ├── cache1.txt ├── models │ ├── Chat.js │ ├── Script_response.js │ └── User.js ├── package-lock.json ├── package.json ├── server.js ├── test.txt └── text_files │ ├── generate.txt │ ├── human_input_node.txt │ ├── join.txt │ ├── plan_and_schedule.txt │ └── rewrite.txt ├── legal-chatbot-frontend ├── .gitignore ├── README.md ├── eslint.config.js ├── index.html ├── package-lock.json ├── package.json ├── public │ ├── send_btn.svg │ └── vite.svg ├── src │ ├── App.css │ ├── App.jsx │ ├── Home.css │ ├── Home.jsx │ ├── SignIn.css │ ├── SignIn.jsx │ ├── SignUp.css │ ├── SignUp.jsx │ ├── assets │ │ └── react.svg │ ├── files.jsx │ ├── final_scripts │ │ ├── .gitignore │ │ ├── HIL.py │ │ ├── agents.py │ │ ├── anthropic_functions.py │ │ ├── beam_retriever.py │ │ ├── beam_tool.py │ │ ├── citations.py │ │ ├── get_all_files.py │ │ ├── globals_.py │ │ ├── imports.py │ │ ├── joiner.py │ │ ├── main.py │ │ ├── output.txt │ │ ├── output_parser.py │ │ ├── pathway_server │ │ │ ├── custom_parser.py │ │ │ ├── run-server.py │ │ │ ├── server.py │ │ │ └── test_server.py │ │ ├── planner.py │ │ ├── prompts.py │ │ ├── requirements.txt │ │ ├── task_fetching_unit.py │ │ ├── test.py │ │ ├── tools.py │ │ └── utils.py │ ├── index.css │ └── main.jsx └── vite.config.js ├── main.py ├── output.txt ├── output_parser.py ├── pathway_server ├── custom_parser.py ├── run-server.py ├── server.py └── test_server.py ├── planner.py ├── prompts.py ├── requirements.txt ├── task_fetching_unit.py ├── tools.py └── utils.py

Team Members

Owner

Login: himanshu-skid19
Kind: user

Repositories: 1
Profile: https://github.com/himanshu-skid19

Citation (citations.py)

from imports import *
from utils import *

def remove_uuid(text):
    return re.sub(r'\b[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}\b', '', text)

def extract_document_name(file_path):
    match = re.search(r'[^/\\]+(?:\.pdf|\.docx|\.txt)', file_path, re.IGNORECASE)
    if match:
        return match.group(0)
    return file_path 

def get_docs(answer):
    docs=[]
    for data in reversed(answer):
        key=next(iter(data.keys()))
        value=data[key]['messages']
        for i in value:
            if isinstance(i, FunctionMessage):
                if(i.content!='join'):
                    docs.append(i.content)

    if len(docs)==0:
        return [],[]
        
    content=docs[0]
    # Regex pattern to split based on "CHUNK ENDS HERE"
    chunks = re.split(r'\n----------------CHUNK ENDS HERE ------------------\n', content)
    if(chunks[-1]==''):
        chunks.pop()
    processed_chunks = []
    process_chunks_citations = set()
    process_chunks_results = set()

    for chunk in chunks:
        # Clean and strip unnecessary newlines or hyphens
        chunk = chunk.strip()
        chunk = re.sub(r'-{2,}', '', chunk)  # Remove lines with multiple hyphens
        chunk = re.sub(r'\n+', '\n', chunk)  # Normalize newlines

        # Extract source, page, and paragraph details
        source_match = re.search(r'Source: (.+?)\n', chunk)
        page_match = re.search(r'Page Number: (\d+)', chunk)
        
        if source_match and page_match:
            source = source_match.group(1).strip()
            page = page_match.group(1).strip()

            doc_name = extract_document_name(source)

            # Combine into a single "SOURCE" line
            metadata = f"SOURCE: {doc_name}, Page: {page}"
            # Remove the original metadata from the chunk
            chunk = re.sub(r'Document Summary:\n.+?Page Content:\n', '', chunk, flags=re.DOTALL)
            # Add the cleaned metadata at the start of the chunk
            cleaned_chunk = f"{metadata}{chunk.split(' ')}"
            processed_chunks.append(cleaned_chunk)
    cleaned_chunks = [remove_uuid(chunk) for chunk in processed_chunks]
    # Print cleaned data
    for cleaned_chunk in cleaned_chunks:
        chunk_test = cleaned_chunk
        data = chunk_test[chunk_test.find("['"):]
        chunk_ext = chunk_test[:chunk_test.find("['")]
        data = ast.literal_eval(data)
        chunk_ext_cit = chunk_ext + " ".join(data)
        process_chunks_citations.add(chunk_ext)
        process_chunks_results.add(chunk_ext_cit)
    return list(process_chunks_citations), list(process_chunks_results)

def assess_risk_level_anth(client, message):
    unsafe_category_definitions = {
        'Child Exploitation': 'Content that depicts child nudity or that enables, encourages, excuses, or depicts the sexual abuse of children.',
        'Hate': """Content that is hateful toward people on the basis of their protected characteristics 
            (race, color, ethnicity, national origin, disability, religious affiliation, caste, sexual orientation, sex, gender identity and serious disease), 
            or content that perpetuates negative stereotypes.""",
        'Indiscriminate Weapons': 'Content that enables the creation of weapons of mass destruction.', 
        'Self-Harm': 'Content that enables, encourages, or excuses acts of intentional self-harm.',
        'Sex Crimes': 'Content that enables, encourages, or excuses the commission of sex crimes.',
        'Sexual Content': 'Content that contains erotic descriptions or explicit depictions of sex acts.',
        'Violent Crimes': 'Content that enables, encourages, or excuses the commission of violent crimes.', 
    }
    unsafe_category_str = '\n'.join(f'{category}' for category in unsafe_category_definitions.keys())
    
    # Construct the prompt for Claude, including the message, unsafe categories, and risk level definitions
    assessment_prompt = f"""
    Assess the risk level of the following message warrants moderation, 
    based on the unsafe categories listed below.
    
    Message:
    <message>{message}</message>

    Unsafe Categories:
    <categories>
    {unsafe_category_str}
    </categories>

    Assign a risk level based on your confidence that the user's message should be moderated 
    based on the defined unsafe categories:

    0 - No risk
    1 - Low risk
    2 - Medium risk
    3 - High risk

    If the message contains content that might violate some of the unsafe categories,
    but is a vaild legal query, then assign a low risk level to it

    e.g.
    Query: What were the findings of the autopsy conducted on Anuradha, and how did they contribute to the conclusion that she was violently raped, murdered and sodomised?
    The message is a valid legal query, but contains content that might violate some of the unsafe categories, hence assign a low risk level to it.

    Query: What are the laws pertaining to porn and sex in India?
    The message is a valid legal query, but contains some unsafe categories in it, so assign a low level risk to it.

    Query: I am wanting to do sex with a teen, should I do it?
    The message is not a valid legal query, and contains content that violates the unsafe categories, hence assign a high risk level to it.

    Respond with ONLY a JSON object, using the format below:
    {{
    "risk_level": <Numerical field denoting the risk level>
    }}"""

    # Send the request to Claude for risk assessment
    response = client.messages.create(
        model="claude-3-haiku-20240307",  # Using the Haiku model for lower costs
        max_tokens=200,
        temperature=0,   # Use 0 temperature for increased consistency
        messages=[
            {"role": "user", "content": assessment_prompt}
        ]
    )
    
    # Parse the JSON response from Claude
    assessment = json.loads(response.content[0].text)
    
    # Extract the risk level, violated categories, and explanation from the assessment
    risk_level = assessment["risk_level"]
    
    return risk_level


def check_guardrails(question):
    if isinstance(llm_with_fallback.call(), ChatOpenAI):
        client = OpenAI()
        response = client.moderations.create(
            model="omni-moderation-latest",
            input=question,
        )

        category_scores = response.results[0].category_scores
        threshold = 0.65
        # Check if any score exceeds the threshold and print the attribute name
        for attribute, score in vars(category_scores).items():
            if score > threshold:
                return True
        else:
            return False
    elif isinstance(llm_with_fallback.call(), ChatAnthropic): 
        client = anthropic.Anthropic()
        risk_level = assess_risk_level_anth(client, question)
        if risk_level == 3:
            return True
        else:
            return False


def get_citations_with_ans(query, answer):
    if len(answer) == 0:
        return "No citations found for the given answer."
    ans = answer[-1]['generate']['messages'][0]
    cit, retrieved_contexts = get_docs(answer=answer)
    prompt = f"""
    You are a legal expert tasked with evaluating and formatting citations for a given query and answer. Follow these steps:
    1. Understand the Query and Answer:

        Read the query and the provided answer carefully to grasp the specific information being sought and its context.

    2. Assess Main Citations Using Retrieved Contexts:

        Main citations (referred to as "Document Paths") correspond to the retrieved contexts.
        Evaluate the relevance of each retrieved context to the answer based on its content and connection to the query and answer.
            If the retrieved context contributes to or supports the answer, include its corresponding document path in Main Citations.
            If the answer indicates "I don’t know" or similar, mark Main Citations as None.
            In all other cases, include the relevant document paths, even if the answer is partially derived from reasoning on the retrieved contexts.

    3. Extract In-Context Citations:

        Identify specific legal citations (e.g., "Smithson, 2018") directly referenced in the retrieved contexts that support the answer.
        If no relevant in-context citations are found, write None under In-Context Citations.

    4. Generate the Response:

        Main Citations:
            If the answer indicates "I don’t know" or similar, write None.
            Otherwise, include all relevant document paths based on the retrieved contexts.
        In-Context Citations:
            If relevant citations are found, list them.
            If not, write None.
        Follow the formatting structure below:

    Response Structure:

    Main Citations:
    List relevant document paths in each newline, or write "None" if the answer is "I don't know" or similar.

    In-Context Citations:
    List specific citations in each newline, or write "None" if no in-context citations are found.

    Query: {query}

    Answer: {ans}

    Main Citations (Document Paths):
    {cit}

    Retrieved Contexts:
    {retrieved_contexts}

    Your response must be concise and include only citations (main and in-context) that are directly relevant to the query and answer. Always provide None explicitly under In-Context Citations if no relevant citations are found. Similarly, only provide None under Main Citations if the answer clearly indicates "I don’t know" or equivalent.
    """
    prompt1 = f"""
    You are a legal expert tasked with evaluating and formatting citations for a given query and answer. Follow these steps:
    1. Understand the Query and Answer:
        Read the query and the provided answer carefully to grasp the specific information being sought and its context.

    2. Assess Main Citations Using Retrieved Contexts:
        Main citations (referred to as "Document Paths") correspond to the retrieved contexts.
        Evaluate the relevance of each retrieved context to the answer based on its content and connection to the query and answer.
            If the retrieved context contributes to or supports the answer, include its corresponding document path in Main Citations.
            If the answer indicates "I don't know" or similar, mark Main Citations as None.
            In all other cases, include the relevant document paths, even if the answer is partially derived from reasoning on the retrieved contexts.

    3. Extract In-Context Citations:
        Identify specific legal citations (e.g., "Smithson, 2018") directly referenced in the retrieved contexts that support the answer.
        If no relevant in-context citations are found, write None under In-Context Citations.

    4. Generate the Response:
        Provide ONLY the following output format:

    Main Citations:
    List relevant document paths in each newline, or write "None" if the answer is "I don't know" or similar.

    In-Context Citations:
    List specific citations in each newline, or write "None" if no in-context citations are found.
    
    Strictly follow these rules:
    - Produce ONLY the two-line output above
    - Do not add any introductory or explanatory text
    - Do not include any notes or commentary
    - Ensure the output is exactly as shown, even if you have no citations to report
    """
    if isinstance(llm_with_fallback.call(), ChatOpenAI):
        client = OpenAI()
        completion = client.chat.completions.create(
            model="gpt-4o",
            temperature=0,
            messages=[
                {"role": "system", "content": prompt},
                {"role": "user", "content": f"""
                                    **Query**: {query}

                                    **Answer**: {ans}
                                    
                                    **Main Citations (Document Paths)**: {cit}

                                    **Retrieved Contexts**:
                                    {retrieved_contexts}
                """},
            ],
        )
        response = completion.choices[0].message.content
    elif isinstance(llm_with_fallback.call(), ChatAnthropic): 
        client = anthropic.Anthropic()
        completion = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            temperature=0,
            max_tokens=1024,
            system = prompt1,
            messages=[
                {"role": "user", "content": f"""
                                **Query**: {query}

                                **Answer**: {ans}
                                
                                **Main Citations (Document Paths)**: {cit}

                                **Retrieved Contexts**:
                                {retrieved_contexts}
                            """}
            ]
        )
        response = completion.content[0].text
    # extracted_response = "" if "no" in response.lower() else response
    # top_citations = cit[:3]
    # citations = "".join(top_citations)
    # final_response = f"Answer: \n{ans}\nCitations:\n{citations}{extracted_response}"
    final_response = f"Answer: \n{ans}\n{response}"
    return final_response

import json

GitHub Events

Total

Watch event: 2
Member event: 1
Push event: 7
Pull request event: 2
Fork event: 12
Create event: 2

Last Year

Watch event: 2
Member event: 1
Push event: 7
Pull request event: 2
Fork event: 12
Create event: 2

inter-iit-13-pathway-legalqa-chatbot

Science Score: 54.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

PathLex - Agentic RAG Application for Legal QA

Team 67

Table of Contents

Introduction

Key Features

1. Parallel Task Execution

2. Dynamic Replanning for Retrieval

3. Enhanced Retrieval Precision with Task-Specific Tools

4. Scalability to Complex Queries

5. Plan-and-Solve Alignment

6. Reduced Token Usage and Cost

Installation and Setup

Initial Steps

1. Setting up the Pathway VectorStore

Prerequisites

1. Steps to run

2. Setting up the environment variables

3. Scripts for Quick Testing

Steps to run:

4. How to run the front end

Prerequisites

Steps to run:

5. Solution Pipeline & Usage Guide

6. Architecture

Team Members

Owner

Citation (citations.py)

GitHub Events

Total

Last Year