https://github.com/berkeleyljj/massive-serve-jinjian

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.3%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: berkeleyljj
Language: Python
Default Branch: main
Size: 37 MB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 0

Created 12 months ago · Last pushed 10 months ago

Metadata Files

Readme

Introduction

This repository contains the Compact‑DS Dive Public API and server code. It exposes a production‑ready Flask service for retrieval‑augmented generation (RAG) backed by a billion‑scale FAISS IVFPQ index. The server provides adjustable settings and search modes at low-latency. A small CLI helps download/prepare indices and start the server, and an example domain (index_dev/) shows the expected data layout. Please refer to the Quickstart below to set up.

Expected data layout (under DATASTORE_PATH)

<DATASTORE_PATH>/ <domain_name>/ config.json # loader config (encoder, nprobe, index filename, etc.) index/ # a single merged FAISS file (*.faiss) passages/ # *.jsonl shards for text lookup

Quickstart

Datasets

CompactDS-102GB
- Core index and passages. Please refer to the dataset card for details.
Full embeddings
- PubMed embeddings are sharded, so combine locally if needed: bash cat massiveds-pubmed--passages7_00.pkl_{aa,ab,ac,ad,ae,af,ag,ah,ai} \ > massiveds-pubmed--passages7_00.pkl

1) Download the dataset/index from Hugging Face

Choose a local data root (DATASTOREPATH). Example: ```bash export DATASTOREPATH=/home/ubuntu/massive-serve-dev ```
Download the dataset into $DATASTORE_PATH/<domain_name> (example uses index_dev): bash huggingface-cli download <ORG_OR_USER>/<DATASET_REPO> \ --repo-type dataset \ --local-dir $DATASTORE_PATH/index_dev

Notes: - The directory should include an index/ directory with a FAISS index and a passages/ directory with .jsonl files. - If your index is uploaded in split/chunked form, see Step 3 to combine shards.

2) Build the position mapping arrays

The server looks up passage text by FAISS index id using position mapping arrays. Generate them once from your passages/ directory:

Open build_arr.py and set INPUT_DIR to your passages folder, e.g.: python INPUT_DIR = "/home/ubuntu/massive-serve-dev/index_dev/passages"
Then run from the repo root: bash python build_arr.py

This writes three files next to the script (and the server expects them under index_dev/ as configured by the code): - index_dev/position_array.npy - index_dev/filename_index_array.npy - index_dev/filename_list.npy

3) Combine FAISS index shards

Your index/ folder contains split parts, combine them into a single .faiss file before serving.

Simple shard set (concatenate all ...faiss_** parts in order; do NOT include the .meta file): bash cd $DATASTORE_PATH/index_dev/index # Example names: index_IVFPQ.100000000.768.65536.64.faiss_aa, ..._ab, ..._ac, ... cat $(ls index_IVFPQ.100000000.768.65536.64.faiss_* | sort) > index_full.faiss

After combining, ensure there is exactly one .faiss file in index/ (e.g., index/index_full.faiss).

4) Launch the API server

Use index_dev as the domain name (provided by index_dev/config.json): bash DATASTORE_PATH=/home/ubuntu/massive-serve-dev \ python -m massive_serve.cli serve --domain_name index_dev

By default the server starts at port 30888 and exposes /search and /vote endpoints.

5) Test the API

For the full reference and examples, see API_DOCUMENTATION.md. You can use curl commands to run quick tests.

Single query: bash curl -X POST http://compactds.duckdns.org:30888/search \ -H "Content-Type: application/json" \ -d '{"query": "Tell me more about Albert Einstein", "n_docs": 5, "nprobe": 32}'

Batched queries: bash curl -X POST http://compactds.duckdns.org:30888/search \ -H "Content-Type: application/json" \ -d '{"queries": ["quantum computing", "Who is Nikola Tesla", "AI ethics"], "n_docs": 2}'

Citation

@article{lyu2025compactds, title={Frustratingly Simple Retrieval Improves Challenging, Reasoning-Intensive Benchmarks}, author={Xinxi Lyu and Michael Duan and Rulin Shao and Pang Wei Koh and Sewon Min} journal={arXiv preprint arXiv:2507.01297}, year={2025} }

Owner

Name: Jinjian Liu
Login: berkeleyljj
Kind: user

Repositories: 1
Profile: https://github.com/berkeleyljj

GitHub Events

Total

Delete event: 1
Push event: 26
Pull request event: 3
Create event: 2

Last Year

Delete event: 1
Push event: 26
Pull request event: 3
Create event: 2

Dependencies

pyproject.toml pypi

setup.py pypi

click *
faiss-cpu *
flask *
flask-cors *
numpy *
sentence-transformers *
torch *
tqdm *
transformers *

rerank/contriever/requirements.txt pypi

beir ==1.0.0
torch ==1.11.0
transformers ==4.18.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science