extralit

Fast and accurate systemic data extraction with LLM assistance

https://github.com/extralit/extralit

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (16.8%) to scientific vocabulary

Keywords

data-extraction literature-review llm
Last synced: 6 months ago · JSON representation

Repository

Fast and accurate systemic data extraction with LLM assistance

Basic Info
  • Host: GitHub
  • Owner: Extralit
  • License: apache-2.0
  • Language: Python
  • Default Branch: develop
  • Homepage: https://extralit.ai/
  • Size: 667 MB
Statistics
  • Stars: 24
  • Watchers: 0
  • Forks: 22
  • Open Issues: 13
  • Releases: 9
Topics
data-extraction literature-review llm
Created over 2 years ago · Last pushed 6 months ago
Metadata Files
Readme Contributing Funding License Code of conduct Citation Codeowners

README.md

Extralit

Documentation | Quickstart | Architecture

What is Extralit?

Extralit (EXTRAct LITerature) is a data extraction workflow with user-friendly UI, designed for LLM-assisted scientific data extraction and other unstructured document intelligence tasks. It focuses on data accuracy above all else, and further integrates human feedback loops for continuous LLM refinement and collaborative data extraction.

Why Use Extralit?

  • Precision First Built for high data accuracy, ensuring reliable results.
  • Human-in-the-Loop Seamlessly integrate human annotations to refine LLM outputs and collaborate on data validation.
  • Flexible & Scalable Available as a Python SDK, CLI, and Web UI with multiple deployment options to fit your workflow.

Key Features:

  • Schema-Driven Extraction Define structured schemas for context-aware, high-accuracy data extraction across scientific domains.
  • Advanced PDF Processing AI-powered OCR detects complex table structures in both digital and scanned PDFs.
  • Built-in Validation Automatically verify extracted data for accuracy in both the annotation UI and the data pipeline outputs.
  • User-Friendly Interface Easily review, edit, and validate data with team-based consensus workflows.
  • Data Flywheel Collect human annotations to monitor performance and build fine-tuning datasets for continuous improvement.

Start extracting smarter with Extralit!

Recent News

  • May 2025: Extralit selected for Google Summer of Code 2025! We're working on Scientific PDF Data Extraction and Interactive Schema Editor UI projects.
  • Looking to contribute? Check out our GSoC projects or open issues to get started!

Getting started

Installation

Install the client package

bash pip install extralit

If you already have a server deployed and login credentials, obtain your API key in the User Settings. You can manage your extraction workspace through the CLI with:

```base extralit login --api-url http://

You will be prompted an API key to login to your account

```

Server setup

See https://docs.extralit.ai/latest/getting_started/quickstart/

Project Architecture

Extralit is built on top of Argilla, extending its capabilities with enhanced data extraction, validation, and human-in-the-loop workflows, with these 5 core components:

  • Python SDK: A Python SDK which is installable with pip install extralit to interact with the web server and provides an API to manage the data extraction workflows.
  • FastAPI Server: The backbone of Extralit, handling users, storage, and API interactions. It manages application data using a relational database (PostgreSQL by default).
  • Web UI: A web application to visualize and annotate your data, users and teams. It is built with Vue.js and Nuxt.js and is directly deployed alongside the FastAPI Server within our Docker image.
  • Vector Database: A vector database to store the records data and perform scalable vector similarity searches and basic document searches. We currently support ElasticSearch and AWS OpenSearch and they can be deployed as separate Docker images.

Repo Activity

Alt

Owner

  • Name: Extralit
  • Login: extralit
  • Kind: organization

Schema-driven data extraction from scientific literature with LLM- and experts-in-the-loop

GitHub Events

Total
  • Create event: 52
  • Release event: 2
  • Issues event: 44
  • Watch event: 6
  • Delete event: 40
  • Issue comment event: 62
  • Push event: 580
  • Pull request review comment event: 64
  • Pull request review event: 54
  • Pull request event: 79
  • Fork event: 5
Last Year
  • Create event: 52
  • Release event: 2
  • Issues event: 44
  • Watch event: 6
  • Delete event: 40
  • Issue comment event: 62
  • Push event: 580
  • Pull request review comment event: 64
  • Pull request review event: 54
  • Pull request event: 79
  • Fork event: 5

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 48
  • Total pull requests: 76
  • Average time to close issues: 10 days
  • Average time to close pull requests: 12 days
  • Total issue authors: 9
  • Total pull request authors: 11
  • Average comments per issue: 0.19
  • Average comments per pull request: 0.74
  • Merged pull requests: 38
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 48
  • Pull requests: 72
  • Average time to close issues: 10 days
  • Average time to close pull requests: 10 days
  • Issue authors: 9
  • Pull request authors: 11
  • Average comments per issue: 0.19
  • Average comments per pull request: 0.75
  • Merged pull requests: 37
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • JonnyTran (38)
  • priyankeshh (3)
  • ArthrowAbstract (1)
  • LiChiaTing (1)
  • Nakshatra05 (1)
  • JTran-IDM (1)
  • dawn-tran (1)
  • Akshita-Goel (1)
  • bitnami-bot (1)
Pull Request Authors
  • JonnyTran (42)
  • Copilot (11)
  • priyankeshh (9)
  • Akshita-Goel (3)
  • Ashutoshx7 (2)
  • nafisatahasin (2)
  • ArthrowAbstract (2)
  • Nakshatra05 (2)
  • nafisa404 (1)
  • SanjayUG (1)
  • Ashutosh-KARNX7 (1)
Top Labels
Issue Labels
enhancement (12) refactor (10) bug (6) feature (3) deployment (3) refactoring (3) dependencies (1) ui/ux (1) documentation (1) python (1) packaging (1) monorepo (1) argilla-server (1) extralit (1) help wanted (1)
Pull Request Labels
enhancement (2) bug (1) refactor (1)

Packages

  • Total packages: 2
  • Total downloads:
    • pypi 330 last-month
  • Total dependent packages: 0
    (may contain duplicates)
  • Total dependent repositories: 0
    (may contain duplicates)
  • Total versions: 24
  • Total maintainers: 1
pypi.org: extralit

Open-source tool for accurate & fast scientific literature data extraction with LLM and human-in-the-loop.

  • Versions: 14
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 192 Last month
Rankings
Dependent packages count: 10.8%
Average: 35.7%
Dependent repos count: 60.7%
Maintainers (1)
Last synced: 6 months ago
pypi.org: extralit-server

Open-source tool for accurate & fast scientific literature data extraction with LLM and human-in-the-loop.

  • Versions: 10
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 138 Last month
Rankings
Dependent packages count: 10.8%
Average: 35.7%
Dependent repos count: 60.7%
Maintainers (1)
Last synced: 6 months ago