https://github.com/lancedb/lance
Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
✓Committers with academic emails
2 of 91 committers (2.2%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (16.7%) to scientific vocabulary
Keywords
Keywords from Contributors
Repository
Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
Basic Info
- Host: GitHub
- Owner: lancedb
- License: apache-2.0
- Language: Rust
- Default Branch: main
- Homepage: https://lancedb.github.io/lance/
- Size: 27.5 MB
Statistics
- Stars: 5,305
- Watchers: 51
- Forks: 450
- Open Issues: 808
- Releases: 330
Topics
Metadata Files
README.md
**Modern columnar data format for ML. Convert from Parquet in 2-lines of code for 100x faster random access, zero-cost schema evolution, rich secondary indices, versioning, and more.
**
**Compatible with Pandas, DuckDB, Polars, Pyarrow, and Ray with more integrations on the way.**
Documentation •
Blog •
Discord •
X
[![CI Badge]][CI]
[![Docs Badge]][Docs]
[![crates.io badge]][crates.io]
[![Python versions badge]][Python versions]
Lance is a modern columnar data format that is optimized for ML workflows and datasets. Lance is perfect for:
- Building search engines and feature stores.
- Large-scale ML training requiring high performance IO and shuffles.
- Storing, querying, and inspecting deeply nested data for robotics or large blobs like images, point clouds, and more.
The key features of Lance include:
High-performance random access: 100x faster than Parquet without sacrificing scan performance.
Vector search: find nearest neighbors in milliseconds and combine OLAP-queries with vector search.
Zero-copy, automatic versioning: manage versions of your data without needing extra infrastructure.
Ecosystem integrations: Apache Arrow, Pandas, Polars, DuckDB, Ray, Spark and more on the way.
[!TIP] Lance is in active development and we welcome contributions. Please see our contributing guide for more information.
Quick Start
Installation
shell
pip install pylance
To install a preview release:
shell
pip install --pre --extra-index-url https://pypi.fury.io/lancedb/ pylance
[!TIP] Preview releases are released more often than full releases and contain the latest features and bug fixes. They receive the same level of testing as full releases. We guarantee they will remain published and available for download for at least 6 months. When you want to pin to a specific version, prefer a stable release.
Converting to Lance
```python import lance
import pandas as pd import pyarrow as pa import pyarrow.dataset
df = pd.DataFrame({"a": [5], "b": [10]}) uri = "/tmp/test.parquet" tbl = pa.Table.frompandas(df) pa.dataset.writedataset(tbl, uri, format='parquet')
parquet = pa.dataset.dataset(uri, format='parquet') lance.write_dataset(parquet, "/tmp/test.lance") ```
Reading Lance data
python
dataset = lance.dataset("/tmp/test.lance")
assert isinstance(dataset, pa.dataset.Dataset)
Pandas
python
df = dataset.to_table().to_pandas()
df
DuckDB ```python import duckdb
If this segfaults, make sure you have duckdb v0.7+ installed
duckdb.query("SELECT * FROM dataset LIMIT 10").to_df() ```
Vector search
Download the sift1m subset
shell
wget ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
tar -xzf sift.tar.gz
Convert it to Lance
```python import lance from lance.vector import vectotable import numpy as np import struct
nvecs = 1000000 ndims = 128 with open("sift/sift_base.fvecs", mode="rb") as fobj: buf = fobj.read() data = np.array(struct.unpack("<128000000f", buf[4 : 4 + 4 * nvecs * ndims])).reshape((nvecs, ndims)) dd = dict(zip(range(nvecs), data))
table = vectotable(dd) uri = "vecdata.lance" sift1m = lance.writedataset(table, uri, maxrowspergroup=8192, maxrowsperfile=1024*1024) ```
Build the index
python
sift1m.create_index("vector",
index_type="IVF_PQ",
num_partitions=256, # IVF
num_sub_vectors=16) # PQ
Search the dataset
```python
Get top 10 similar vectors
import duckdb
dataset = lance.dataset(uri)
Sample 100 query vectors. If this segfaults, make sure you have duckdb v0.7+ installed
sample = duckdb.query("SELECT vector FROM dataset USING SAMPLE 100").todf() queryvectors = np.array([np.array(x) for x in sample.vector])
Get nearest neighbors for all of them
rs = [dataset.totable(nearest={"column": "vector", "k": 10, "q": q}) for q in queryvectors] ```
Directory structure
| Directory | Description | |--------------------|--------------------------| | rust | Core Rust implementation | | python | Python bindings (PyO3) | | java | Java bindings (JNI) | | docs | Documentation source |
What makes Lance different
Here we will highlight a few aspects of Lance’s design. For more details, see the full Lance design document.
Vector index: Vector index for similarity search over embedding space.
Support both CPUs (x86_64 and arm) and GPU (Nvidia (cuda) and Apple Silicon (mps)).
Encodings: To achieve both fast columnar scan and sub-linear point queries, Lance uses custom encodings and layouts.
Nested fields: Lance stores each subfield as a separate column to support efficient filters like “find images where detected objects include cats”.
Versioning: A Manifest can be used to record snapshots. Currently we support creating new versions automatically via appends, overwrites, and index creation.
Fast updates (ROADMAP): Updates will be supported via write-ahead logs.
Rich secondary indices: Support BTree, Bitmap, Full text search, Label list,
NGrams, and more.
Benchmarks
Vector search
We used the SIFT dataset to benchmark our results with 1M vectors of 128D
- For 100 randomly sampled query vectors, we get <1ms average response time (on a 2023 m2 MacBook Air)

- ANNs are always a trade-off between recall and performance

Vs. parquet
We create a Lance dataset using the Oxford Pet dataset to do some preliminary performance testing of Lance as compared to Parquet and raw image/XMLs. For analytics queries, Lance is 50-100x better than reading the raw metadata. For batched random access, Lance is 100x better than both parquet and raw files.

Why are you building yet another data format?!
The machine learning development cycle involves the steps:
mermaid
graph LR
A[Collection] --> B[Exploration];
B --> C[Analytics];
C --> D[Feature Engineer];
D --> E[Training];
E --> F[Evaluation];
F --> C;
E --> G[Deployment];
G --> H[Monitoring];
H --> A;
People use different data representations to varying stages for the performance or limited by the tooling available. Academia mainly uses XML / JSON for annotations and zipped images/sensors data for deep learning, which is difficult to integrate into data infrastructure and slow to train over cloud storage. While industry uses data lakes (Parquet-based techniques, i.e., Delta Lake, Iceberg) or data warehouses (AWS Redshift or Google BigQuery) to collect and analyze data, they have to convert the data into training-friendly formats, such as Rikai/Petastorm or TFRecord. Multiple single-purpose data transforms, as well as syncing copies between cloud storage to local training instances have become a common practice.
While each of the existing data formats excels at the workload it was originally designed for, we need a new data format tailored for multistage ML development cycles to reduce and data silos.
A comparison of different data formats in each stage of ML development cycle.
| | Lance | Parquet & ORC | JSON & XML | TFRecord | Database | Warehouse | |---------------------|-------|---------------|------------|----------|----------|-----------| | Analytics | Fast | Fast | Slow | Slow | Decent | Fast | | Feature Engineering | Fast | Fast | Decent | Slow | Decent | Good | | Training | Fast | Decent | Slow | Fast | N/A | N/A | | Exploration | Fast | Slow | Fast | Slow | Fast | Decent | | Infra Support | Rich | Rich | Decent | Limited | Rich | Rich |
Community Highlights
Lance is currently used in production by: * LanceDB, a serverless, low-latency vector database for ML applications * LanceDB Enterprise, hyperscale LanceDB with enterprise SLA. * Leading multimodal Gen AI companies for training over petabyte-scale multimodal data. * Self-driving car company for large-scale storage, retrieval and processing of multi-modal data. * E-commerce company for billion-scale+ vector personalized search. * and more.
Presentations, Blogs and Talks
- Designing a Table Format for ML Workloads, Feb 2025.
- Transforming Multimodal Data Management with LanceDB, Ray Summit, Oct 2024.
- Lance v2: A columnar container format for modern data, Apr 2024.
- Lance Deep Dive. July 2023.
- Lance: A New Columnar Data Format, Scipy 2022, Austin, TX. July, 2022.
Owner
- Name: Lance DB
- Login: lancedb
- Kind: organization
- Location: United States of America
- Repositories: 1
- Profile: https://github.com/lancedb
Committers
Last synced: 9 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Lei Xu | l****i@l****m | 683 |
| Weston Pace | w****e@g****m | 363 |
| Will Jones | w****7@g****m | 273 |
| Chang She | 7****n | 212 |
| BubbleCal | b****l@o****m | 191 |
| Rob Meng | r****g@g****m | 127 |
| Lance Release | l****v@l****m | 103 |
| Lance Release | l****v@e****i | 37 |
| LuQQiu | l****b@g****m | 32 |
| broccoliSpicy | 9****y | 30 |
| gsilvestrin | g****n | 29 |
| Bert | a****t@l****m | 23 |
| vinoyang | y****7@g****m | 21 |
| huangzhaowei | c****x@g****m | 21 |
| Chongchen Chen | c****y@q****m | 16 |
| Rok Mihevc | r****k@m****g | 15 |
| Jai Chopra | j****a@g****m | 14 |
| Raunak Shah | r****0@g****m | 14 |
| jay | j****n@b****m | 12 |
| Jiacheng Yang | 9****b | 9 |
| jacketsj | j****j | 9 |
| dsgibbons | g****t@g****o | 7 |
| Tanay Mehta | h****y@g****m | 6 |
| Yue | n****m@g****m | 6 |
| Ishan Anand | a****n@o****m | 5 |
| Jack Ye | y****n@g****m | 5 |
| Wyatt Alt | w****t@g****m | 5 |
| Xin Hao | h****t@g****m | 5 |
| Utkarsh Gautam | u****7@g****m | 4 |
| universalmind303 | c****d@g****m | 4 |
| and 61 more... | ||
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 1,186
- Total pull requests: 3,473
- Average time to close issues: about 2 months
- Average time to close pull requests: 8 days
- Total issue authors: 151
- Total pull request authors: 131
- Average comments per issue: 0.79
- Average comments per pull request: 1.17
- Merged pull requests: 2,631
- Bot issues: 0
- Bot pull requests: 2
Past Year
- Issues: 607
- Pull requests: 1,674
- Average time to close issues: 18 days
- Average time to close pull requests: 5 days
- Issue authors: 92
- Pull request authors: 88
- Average comments per issue: 0.63
- Average comments per pull request: 1.34
- Merged pull requests: 1,194
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- wjones127 (294)
- westonpace (220)
- eddyxu (94)
- jackye1995 (66)
- chebbyChefNEQ (47)
- yanghua (46)
- broccoliSpicy (36)
- changhiskhan (29)
- BubbleCal (19)
- Jay-ju (17)
- majin1102 (16)
- Xuanwo (16)
- jacketsj (13)
- SaintBacchus (12)
- tonyf (12)
Pull Request Authors
- westonpace (732)
- eddyxu (497)
- wjones127 (475)
- BubbleCal (464)
- chebbyChefNEQ (198)
- jackye1995 (96)
- yanghua (89)
- LuQQiu (84)
- Jay-ju (73)
- broccoliSpicy (69)
- SaintBacchus (54)
- Xuanwo (48)
- albertlockett (43)
- majin1102 (37)
- chenkovsky (36)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 23
-
Total downloads:
- cargo 2,795,723 total
-
Total dependent packages: 52
(may contain duplicates) -
Total dependent repositories: 1
(may contain duplicates) - Total versions: 1,699
- Total maintainers: 6
proxy.golang.org: github.com/lancedb/lance
- Documentation: https://pkg.go.dev/github.com/lancedb/lance#section-documentation
- License: apache-2.0
-
Latest release: v0.34.0
published 6 months ago
Rankings
crates.io: lance
A columnar data format that is 100x faster than Parquet for random access.
- Documentation: https://docs.rs/lance/
- License: Apache-2.0
-
Latest release: 0.35.0
published 6 months ago
Rankings
Maintainers (6)
repo1.maven.org: com.lancedb:lance-core
Lance Format Java API
- Homepage: http://lancedb.com/
- Documentation: https://appdoc.app/artifact/com.lancedb/lance-core/
- License: The Apache Software License, Version 2.0
-
Latest release: 0.35.0
published 6 months ago
Rankings
repo1.maven.org: com.lancedb:lance-parent
Lance Format Java API
- Homepage: http://lancedb.com/
- Documentation: https://appdoc.app/artifact/com.lancedb/lance-parent/
- License: The Apache Software License, Version 2.0
-
Latest release: 0.35.0
published 6 months ago
Rankings
crates.io: lance-tools
Tools for interacting with Lance files and tables
- Documentation: https://docs.rs/lance-tools/
- License: Apache-2.0
-
Latest release: 0.35.0
published 6 months ago
Rankings
Maintainers (1)
crates.io: lance-arrow
Arrow Extension for Lance
- Documentation: https://docs.rs/lance-arrow/
- License: Apache-2.0
-
Latest release: 0.34.0
published 6 months ago
Rankings
Maintainers (1)
crates.io: lance-linalg
A columnar data format that is 100x faster than Parquet for random access.
- Documentation: https://docs.rs/lance-linalg/
- License: Apache-2.0
-
Latest release: 0.35.0
published 6 months ago
Rankings
Maintainers (5)
crates.io: lance-testing
A columnar data format that is 100x faster than Parquet for random access.
- Documentation: https://docs.rs/lance-testing/
- License: Apache-2.0
-
Latest release: 0.34.0
published 6 months ago
Rankings
Maintainers (5)
crates.io: lance-index
Lance indices implementation
- Documentation: https://docs.rs/lance-index/
- License: Apache-2.0
-
Latest release: 0.35.0
published 6 months ago
Rankings
Maintainers (1)
crates.io: lance-datagen
A columnar data format that is 100x faster than Parquet for random access.
- Documentation: https://docs.rs/lance-datagen/
- License: Apache-2.0
-
Latest release: 0.34.0
published 6 months ago
Rankings
Maintainers (1)
crates.io: vercel_blob
A rust client for the Vercel Blob Storage API
- Documentation: https://docs.rs/vercel_blob/
- License: Apache-2.0
-
Latest release: 0.1.0
published over 2 years ago
Rankings
Maintainers (1)
crates.io: fsst
FSST string compression for Lance
- Documentation: https://docs.rs/fsst/
- License: Apache-2.0
-
Latest release: 0.34.0
published 6 months ago
Rankings
Maintainers (4)
crates.io: lance-bitpacking
Vendored copy of https://github.com/spiraldb/fastlanes for use in Lance
- Documentation: https://docs.rs/lance-bitpacking/
- License: Apache-2.0
-
Latest release: 0.34.0
published 6 months ago
Rankings
Maintainers (1)
crates.io: lance-examples
Lance examples in Rust
- Documentation: https://docs.rs/lance-examples/
- License: Apache-2.0
-
Latest release: 0.35.0
published 6 months ago
Rankings
Maintainers (2)
crates.io: lance-encoding
Encoders and decoders for the Lance file format
- Documentation: https://docs.rs/lance-encoding/
- License: Apache-2.0
-
Latest release: 0.35.0
published 6 months ago
Rankings
Maintainers (4)
crates.io: lance-jni
JNI bindings for Lance Columnar format
- Documentation: https://docs.rs/lance-jni/
- License: Apache-2.0
-
Latest release: 0.31.0
published 8 months ago
Rankings
Maintainers (4)
crates.io: lance-core
Lance Columnar Format -- Core Library
- Documentation: https://docs.rs/lance-core/
- License: Apache-2.0
-
Latest release: 0.34.0
published 6 months ago
Rankings
Maintainers (5)
crates.io: lance-test-macros
A columnar data format that is 100x faster than Parquet for random access.
- Documentation: https://docs.rs/lance-test-macros/
- License: Apache-2.0
-
Latest release: 0.34.0
published 6 months ago
Rankings
Maintainers (1)
crates.io: lance-encoding-datafusion
Encoders and decoders for the Lance file format that rely on datafusion
- Documentation: https://docs.rs/lance-encoding-datafusion/
- License: Apache-2.0
-
Latest release: 0.30.0
published 8 months ago
Rankings
Maintainers (3)
crates.io: lance-io
I/O utilities for Lance
- Documentation: https://docs.rs/lance-io/
- License: Apache-2.0
-
Latest release: 0.35.0
published 6 months ago
Rankings
Maintainers (5)
crates.io: lance-table
Utilities for the Lance table format
- Documentation: https://docs.rs/lance-table/
- License: Apache-2.0
-
Latest release: 0.35.0
published 6 months ago
Rankings
Maintainers (5)
crates.io: lance-file
Utilities for the Lance file format
- Documentation: https://docs.rs/lance-file/
- License: Apache-2.0
-
Latest release: 0.34.0
published 6 months ago
Rankings
Maintainers (4)
crates.io: lance-datafusion
Internal utilities used by other lance modules to simplify working with datafusion
- Documentation: https://docs.rs/lance-datafusion/
- License: Apache-2.0
-
Latest release: 0.35.0
published 6 months ago
Rankings
Maintainers (5)
Dependencies
- ./.github/workflows/build_linux_wheel * composite
- actions/checkout v3 composite
- actions/setup-python v4 composite
- actions/upload-artifact v3 composite
- PyO3/maturin-action v1 composite
- PyO3/maturin-action v1 composite
- PyO3/maturin-action v1 composite
- actions/upload-artifact v3 composite
- Swatinem/rust-cache v2 composite
- actions/checkout v3 composite
- katyo/publish-crates v2 composite
- ./.github/workflows/build_linux_wheel * composite
- actions/checkout v3 composite
- actions/configure-pages v2 composite
- actions/deploy-pages v1 composite
- actions/setup-python v4 composite
- actions/upload-pages-artifact v1 composite
- actions/checkout v3 composite
- ./.github/workflows/bump-version * composite
- actions/checkout v3 composite
- actions/setup-python v4 composite
- ad-m/github-push-action master composite
- ./.github/workflows/build_linux_wheel * composite
- actions/checkout v3 composite
- actions/setup-python v4 composite
- actions/github-script v6 composite
- actions/setup-node v3 composite
- ./.github/workflows/build_linux_wheel * composite
- ./.github/workflows/build_mac_wheel * composite
- ./.github/workflows/build_windows_wheel * composite
- ./.github/workflows/upload_wheel * composite
- actions/checkout v3 composite
- actions/setup-python v4 composite
- ./.github/workflows/build_linux_wheel * composite
- ./.github/workflows/build_mac_wheel * composite
- ./.github/workflows/build_windows_wheel * composite
- ./.github/workflows/run_integtests * composite
- ./.github/workflows/run_tests * composite
- Swatinem/rust-cache v2 composite
- actions/checkout v3 composite
- actions/setup-python v4 composite
- amazon/dynamodb-local * docker
- lazybit/minio * docker
- Swatinem/rust-cache v2 composite
- actions/checkout v3 composite
- libduckdb-sys 0.8.1 development
- arrow-array 43.0.0
- arrow-schema 43.0.0
- futures 0.3
- lazy_static 1.4.0
- num-traits 0.2
- tokio 1.23
- all_asserts 2.3.1 development
- approx 0.5.1 development
- clap 4.1.1 development
- dirs 5.0.0 development
- mock_instant 0.3.1 development
- arrow 43.0.0
- arrow-ipc 43.0
- async-recursion 1.0
- async-trait 0.1.60
- aws-config 0.56
- aws-credential-types 0.56
- aws-sdk-dynamodb 0.30.0
- byteorder 1.4.3
- bytes 1.3
- cblas 0.4.0
- chrono 0.4.23
- clap 4.1.1
- dashmap 5
- datafusion 28.0.0
- half 2.2.1
- http 0.2.9
- lapack 0.19.0
- lru_time_cache 0.11
- moka 0.11.3
- num-traits 0.2
- num_cpus 1.0
- ordered-float 3.6.0
- path-absolutize 3.0.14
- pin-project 1.0
- prost 0.10
- prost-types 0.10
- rand 0.8.3
- roaring 0.10.1
- shellexpand 3.0.0
- tfrecord 0.14.0
- url 2.3
- uuid 1.2
- ubuntu 22.04 build
- breathe *
- cython *
- duckdb >=0.8
- fastai *
- jupyterlab *
- pandas *
- piccolo-theme *
- pyarrow *
- sphinx ==7.1.2
- tensorflow *
- xmltodict *
- numpy >=1.22
- pyarrow >=10