https://github.com/deepset-ai/haystack-sagemaker

πŸš€ This repo is a showcase of how you can use models deployed on AWS SageMaker in your Haystack Retrieval Augmented Generative AI pipelines

https://github.com/deepset-ai/haystack-sagemaker

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • β—‹
    CITATION.cff file
  • βœ“
    codemeta.json file
    Found codemeta.json file
  • β—‹
    .zenodo.json file
  • β—‹
    DOI references
  • β—‹
    Academic publication links
  • β—‹
    Committers with academic emails
  • β—‹
    Institutional organization owner
  • β—‹
    JOSS paper metadata
  • β—‹
    Scientific vocabulary similarity
    Low similarity (10.8%) to scientific vocabulary

Keywords

aws haystack llm nlp opensearch sagemaker

Keywords from Contributors

agent transformers
Last synced: 4 months ago · JSON representation

Repository

πŸš€ This repo is a showcase of how you can use models deployed on AWS SageMaker in your Haystack Retrieval Augmented Generative AI pipelines

Basic Info
  • Host: GitHub
  • Owner: deepset-ai
  • Language: Jupyter Notebook
  • Default Branch: main
  • Homepage:
  • Size: 1.06 MB
Statistics
  • Stars: 13
  • Watchers: 1
  • Forks: 3
  • Open Issues: 2
  • Releases: 0
Topics
aws haystack llm nlp opensearch sagemaker
Created over 2 years ago · Last pushed over 2 years ago
Metadata Files
Readme

README.md

Haystack Retrieval-Augmented Generative QA Pipelines with SageMaker JumpStart

This repo is a showcase of how you can use models deployed on SageMaker JumpStart in your Haystack Retrieval Augmented Generative AI pipelines.

Instructions:
- Starting an OpenSearch service
- Indexing Documents to OpenSearch
- The RAG Pipeline

The Repo Structure
This repository contains 2 runnable Python scripts for indexing and the retrieval augmented pipeline respectively, with instructions on how to run them below:

opensearch_indexing_pipeline.py

rag_pipeline.py

We've also included notebooks for them both in notebooks/ which you can optionally use to create and run each pipeline step by step.

Prerequisites
To run the Haystack pipelines and use the models from SageMaker in this repository, you need an AWS account, and we suggest setting up AWS CLI on your machine.

The Data

This showcase includes some documents we've crawled from the OpenSearch website and documentation pages. You can index these into your own OpenSearchDocumentStore using opensearch_indexing_pipeline.py or notebooks/opensearch_indexing_pipeline.ipynb.

The Model

For this demo, we deployed the falcon-40b-instruct model on SageMaker Jumpstart. Once deployed, you can use your own credentials in the PromptNode. To deploy a model on JumpStart, simply log in to your AWS account and go to the Studio on SageMaker. Navigate to JumpStart and deploy falcon-40b-instruct. This may take a few minutes: image

Starting an OpenSearch service

Option 1: OpenSearch service on AWS

Requirements: An AWS account and AWS CLI

You can use the provided CloudFormation template opsearch-index.yaml to deploy an OpenSearch service on AWS.

Set the --stack-name and OSPassword to your preferred values and run the following. You may also change the default OSDomainName and OSUsername values, set to opensearch-haystack-domain and admin respectively, in opensearch-index.yaml

bash aws cloudformation create-stack --stack-name HaystackOpensearch --template-body file://cloudformation/opensearch-index.yaml --parameters ParameterKey=InstanceType,ParameterValue=r5.large.search ParameterKey=InstanceCount,ParameterValue=3 ParameterKey=OSPassword,ParameterValue=Password123! You can then retrieve your OpenSearch host required to Write documents by running: bash aws cloudformation describe-stacks --stack-name HaystackOpensearch --query "Stacks[0].Outputs[?OutputKey=='OpenSearchEndpoint'].OutputValue" --output text

Option 2: Local OpenSearch service

Requirements: Docker

Another option is to have a local OpenSearch service. For this, you may simply run: ```python from haystack.utils import launch_opensearch

launch_opensearch() `` This will start an OpenSearch service onlocalhost:9200`

The Indexing Pipeline: Write Documents to OpenSearch

To run the scripts and notebooks provided here, first clone the repo and install the requirements. bash git clone git@github.com:deepset-ai/haystack-sagemaker.git cd haystack-sagemaker pip install -r requirements.txt

Writing documents

You can use a Haystack indexing pipeline to prepare and write documents to an OpenSearchDocumentStore. 1. Set your environment variables: bash export OPENSEARCH_HOST='your_opensearch_host' export OPENSEARCH_PORT='your_opensearch_port' export OPENSEARCH_USERNAME='your_opensearch_username' export OPENSEARCH_PASSWORD='your_opensearch_password' 2. Use the indexing pipeline to write the preprocessed documents to your OpenSearch index:

Option 1:

For this demo, we've prepared documents which have been crawled from the OpenSearch documentation and website. As an example of how you may use an S3 bucket, we've also made them available here and here

Run python opensearch_indexing_pipeline.py --fetch-files to fetch these 2 files from S3 or modify the source code in opensearch_indexing_pipeline.py to fetch your own files from an S3 bucket. This will fetch the specified files from the S3 bucket, and put them in data/. The script will then preprocess and prepare Documents from these files, and write them to your OpenSearchDocumentStore.

Option 2:

Run python opensearch_indexing_pipeline.py

This will write the same files, already available in data/, to your OpenSearchDocumentStore

The RAG Pipeline

An indexing pipeline prepares and writes documents to a DocumentStore so that they are in a format which is useable by your choice of NLP pipeline and language models.

On the other hand, a query pipeline is any combination of Haystack nodes that may consume a user query and result in a response. Here, you will find a retrieval augmented question answering pipeine in rag_pipeline.py.

bash export SAGEMAKER_MODEL_ENDPOINT=your_falcon_40b_instruc_endpoint export AWS_PROFILE_NAME=your_aws_profile export AWS_REGION_NAME=your_aws_region

Running the following will start a retrieval augmented QA pipeline with the prompt defined in the PromptTemplate. Feel free to modify this template or even use one of our prompts from the PromptHub to experiment with different instructions.

bash python rag_pipeline.py

Then, ask some questions about OpenSearch πŸ₯³ πŸ‘‡

https://github.com/deepset-ai/haystack-sagemaker/assets/15802862/40563962-2d75-415b-bac4-b25eaa5798e5

Owner

  • Name: deepset
  • Login: deepset-ai
  • Kind: organization
  • Email: hello@deepset.ai
  • Location: Berlin, Germany

Building enterprise search systems powered by latest NLP & open-source.

GitHub Events

Total
  • Fork event: 1
Last Year
  • Fork event: 1

Committers

Last synced: 7 months ago

All Time
  • Total Commits: 19
  • Total Committers: 2
  • Avg Commits per committer: 9.5
  • Development Distribution Score (DDS): 0.053
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Tuana Γ‡elik t****k@d****i 18
Malte Pietsch m****h@d****i 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 1
  • Total pull requests: 3
  • Average time to close issues: N/A
  • Average time to close pull requests: 3 days
  • Total issue authors: 1
  • Total pull request authors: 2
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • ibuchh (1)
Pull Request Authors
  • TuanaCelik (2)
  • dtaivpp (1)
Top Labels
Issue Labels
Pull Request Labels