extralit

Fast and accurate systemic data extraction with LLM assistance

https://github.com/extralit/extralit

Keywords

data-extraction literature-review llm

Last synced: 6 months ago · JSON representation

Repository

Fast and accurate systemic data extraction with LLM assistance

Basic Info

Host: GitHub
Owner: Extralit
License: apache-2.0
Language: Python
Default Branch: develop
Homepage: https://extralit.ai/
Size: 667 MB

Statistics

Stars: 24
Watchers: 0
Forks: 22
Open Issues: 13
Releases: 9

Topics

data-extraction literature-review llm

Created over 2 years ago · Last pushed 6 months ago

Metadata Files

Readme Contributing Funding License Code of conduct Citation Codeowners

Documentation | Quickstart | Architecture

What is Extralit?

Extralit (EXTRAct LITerature) is a data extraction workflow with user-friendly UI, designed for LLM-assisted scientific data extraction and other unstructured document intelligence tasks. It focuses on data accuracy above all else, and further integrates human feedback loops for continuous LLM refinement and collaborative data extraction.

Why Use Extralit?

Precision First Built for high data accuracy, ensuring reliable results.
Human-in-the-Loop Seamlessly integrate human annotations to refine LLM outputs and collaborate on data validation.
Flexible & Scalable Available as a Python SDK, CLI, and Web UI with multiple deployment options to fit your workflow.

Key Features:

Schema-Driven Extraction Define structured schemas for context-aware, high-accuracy data extraction across scientific domains.
Advanced PDF Processing AI-powered OCR detects complex table structures in both digital and scanned PDFs.
Built-in Validation Automatically verify extracted data for accuracy in both the annotation UI and the data pipeline outputs.
User-Friendly Interface Easily review, edit, and validate data with team-based consensus workflows.
Data Flywheel Collect human annotations to monitor performance and build fine-tuning datasets for continuous improvement.

Start extracting smarter with Extralit!

Recent News

May 2025: Extralit selected for Google Summer of Code 2025! We're working on Scientific PDF Data Extraction and Interactive Schema Editor UI projects.
Looking to contribute? Check out our GSoC projects or open issues to get started!

Getting started

Installation

Install the client package

bash pip install extralit

If you already have a server deployed and login credentials, obtain your API key in the User Settings. You can manage your extraction workspace through the CLI with:

```base extralit login --api-url http://

You will be prompted an API key to login to your account

```

Server setup

See https://docs.extralit.ai/latest/getting_started/quickstart/

Project Architecture

Extralit is built on top of Argilla, extending its capabilities with enhanced data extraction, validation, and human-in-the-loop workflows, with these 5 core components:

Python SDK: A Python SDK which is installable with pip install extralit to interact with the web server and provides an API to manage the data extraction workflows.
FastAPI Server: The backbone of Extralit, handling users, storage, and API interactions. It manages application data using a relational database (PostgreSQL by default).
Web UI: A web application to visualize and annotate your data, users and teams. It is built with Vue.js and Nuxt.js and is directly deployed alongside the FastAPI Server within our Docker image.
Vector Database: A vector database to store the records data and perform scalable vector similarity searches and basic document searches. We currently support ElasticSearch and AWS OpenSearch and they can be deployed as separate Docker images.

Repo Activity

Alt

Owner

Name: Extralit
Login: extralit
Kind: organization

Repositories: 1
Profile: https://github.com/extralit

Schema-driven data extraction from scientific literature with LLM- and experts-in-the-loop

GitHub Events

Total

Create event: 52
Release event: 2
Issues event: 44
Watch event: 6
Delete event: 40
Issue comment event: 62
Push event: 580
Pull request review comment event: 64
Pull request review event: 54
Pull request event: 79
Fork event: 5

Last Year

Create event: 52
Release event: 2
Issues event: 44
Watch event: 6
Delete event: 40
Issue comment event: 62
Push event: 580
Pull request review comment event: 64
Pull request review event: 54
Pull request event: 79
Fork event: 5

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 48
Total pull requests: 76
Average time to close issues: 10 days
Average time to close pull requests: 12 days
Total issue authors: 9
Total pull request authors: 11
Average comments per issue: 0.19
Average comments per pull request: 0.74
Merged pull requests: 38
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 48
Pull requests: 72
Average time to close issues: 10 days
Average time to close pull requests: 10 days
Issue authors: 9
Pull request authors: 11
Average comments per issue: 0.19
Average comments per pull request: 0.75
Merged pull requests: 37
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

JonnyTran (38)
priyankeshh (3)
ArthrowAbstract (1)
LiChiaTing (1)
Nakshatra05 (1)
JTran-IDM (1)
dawn-tran (1)
Akshita-Goel (1)
bitnami-bot (1)

Pull Request Authors

JonnyTran (42)
Copilot (11)
priyankeshh (9)
Akshita-Goel (3)
Ashutoshx7 (2)
nafisatahasin (2)
ArthrowAbstract (2)
Nakshatra05 (2)
nafisa404 (1)
SanjayUG (1)
Ashutosh-KARNX7 (1)

Top Labels

Issue Labels

enhancement (12) refactor (10) bug (6) feature (3) deployment (3) refactoring (3) dependencies (1) ui/ux (1) documentation (1) python (1) packaging (1) monorepo (1) argilla-server (1) extralit (1) help wanted (1)

Pull Request Labels

enhancement (2) bug (1) refactor (1)

Packages

Total packages: 2
Total downloads:
- pypi 330 last-month

Total dependent packages: 0
(may contain duplicates)
Total dependent repositories: 0
(may contain duplicates)
Total versions: 24
Total maintainers: 1

pypi.org: extralit

Open-source tool for accurate & fast scientific literature data extraction with LLM and human-in-the-loop.

Documentation: https://extralit.readthedocs.io/
License: Apache 2.0
Latest release: 0.6.1
published 6 months ago

Versions: 14
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 192 Last month

Rankings

Dependent packages count: 10.8%

Average: 35.7%

Dependent repos count: 60.7%

Maintainers (1)

JonnyTran

Last synced: 6 months ago

pypi.org: extralit-server

Open-source tool for accurate & fast scientific literature data extraction with LLM and human-in-the-loop.

Documentation: https://extralit-server.readthedocs.io/
License: Apache-2.0
Latest release: 0.6.1
published 6 months ago

Versions: 10
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 138 Last month

Rankings

Dependent packages count: 10.8%

Average: 35.7%

Dependent repos count: 60.7%

Maintainers (1)

JonnyTran

Last synced: 6 months ago

extralit

Science Score: 26.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Documentation | Quickstart | Architecture

What is Extralit?

Recent News

Getting started

Installation

You will be prompted an API key to login to your account

Server setup

Project Architecture

Repo Activity

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: extralit

Rankings

Maintainers (1)

pypi.org: extralit-server

Rankings

Maintainers (1)