https://github.com/adithya-s-k/synthdoc

https://github.com/adithya-s-k/synthdoc

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.3%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: adithya-s-k
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Size: 72.8 MB
Statistics
  • Stars: 6
  • Watchers: 0
  • Forks: 0
  • Open Issues: 1
  • Releases: 0
Created about 1 year ago · Last pushed 10 months ago
Metadata Files
Readme License

README.md

SynthDoc

License: Apache 2.0 Python 3.9+

A comprehensive library for generating synthetic documents designed for training and evaluating models in document understanding tasks. SynthDoc supports multiple languages, fonts, and document layouts to create diverse training datasets for OCR models, document layout analysis, visual question answering, and retrieval systems.

Key Features

  • 🌐 Multi-language Document Generation: Create documents in various languages using LLMs.
  • VQA Dataset Generation: Produce rich Visual Question Answering datasets with hard negatives for robust model training.
  • 🔄 Document Translation: Translate the text in documents to other languages while preserving the original layout.
  • 🧩 HuggingFace Integration: Outputs datasets directly in HuggingFace Dataset format for seamless integration with your ML pipelines.
  • ⚙️ Flexible Configuration: Easily configure LLM providers (OpenAI, Groq, Anthropic, Ollama) and API keys using a .env file.
  • 🚀 Extensible Workflows: A modular architecture that makes it easy to add new document generation and augmentation workflows.

Available Workflows

1. Raw Document Generation

Generate synthetic documents from scratch using Large Language Models (LLMs) to create diverse content across multiple languages.

Purpose: Create original document content for training data augmentation and model robustness testing.

Process: 1. An LLM generates contextually appropriate content in the specified language. 2. The content is rendered into document images with proper formatting. 3. The output is a standardized HuggingFace dataset.

Output Schema: - image: The generated document page as a PIL Image. - image_path: Path to the saved image file. - page_number: The page number. - language: The language of the generated content. - prompt: The prompt used for content generation. - And other metadata...


2. VQA (Visual Question Answering) Generation

Generate question-answer pairs for visual question answering tasks, including hard negatives for training robust retrieval models.

Purpose: Create comprehensive VQA datasets for training and evaluating visual document understanding models.

Process: 1. General VQA: Generate diverse question-answer pairs about document content, layout, and visual elements. 2. Hard Negative VQA: Create challenging negative examples that are semantically similar but factually incorrect. 3. Similarity Scoring: Generate similarity scores for retrieval training.

Output Schema (extends raw document schema): - questions: List of generated questions. - answers: Corresponding ground truth answers. - hard_negatives: Challenging incorrect answers. - And other VQA-related metadata...


3. Document Translation

Translate the text within a document image to one or more target languages while preserving the visual layout.

Purpose: Adapt existing document datasets for multi-lingual model training.

Process: 1. Layout Detection: A YOLO-based model detects text blocks in the source image. 2. OCR: Text is extracted from each detected block. 3. Translation: The extracted text is translated to the target language(s). 4. Rendering: The translated text is rendered back onto a copy of the original image in the same location, using appropriate fonts.

Output: A HuggingFace dataset containing the translated document images.

Installation

Basic Installation

bash pip install git+https://github.com/adithya-s-k/SynthDoc.git

With LLM Support (Recommended)

To enable content generation with LLMs, install with the llm extra: bash pip install "git+https://github.com/adithya-s-k/SynthDoc.git[llm]"

For Development

bash git clone https://github.com/adithya-s-k/SynthDoc.git cd SynthDoc pip install -e .[llm]

Quick Start

1. Configure Environment (Recommended)

SynthDoc uses LiteLLM to connect to 100+ LLM providers.

  1. Copy the environment template: bash cp env.template .env

  2. Edit your .env file and add your API keys. The DEFAULT_LLM_MODEL will be used for generation. ```env

    .env file

    OPENAIAPIKEY=youropenaikey GROQAPIKEY=yourgroqkey DEFAULTLLMMODEL=gpt-4o-mini ```

2. Use SynthDoc

Create a main.py file: ```python from synthdoc import SynthDoc

SynthDoc automatically loads from .env file

It will prompt you to specify an output directory on first run.

synth = SynthDoc()

--- Workflow 1: Generate Raw Documents ---

print("Generating raw documents...") rawdataset = synth.generaterawdocs( language="en", numpages=2, prompt="Generate a short technical report about climate change." ) print(f"Generated {len(rawdataset)} raw documents.") print(rawdataset[0])

--- Workflow 2: Generate VQA ---

print("\nGenerating VQA dataset...") vqadataset = synth.generatevqa( sourcedocuments=rawdataset, numquestionsperdoc=3 ) print(f"Generated {len(vqadataset)} VQA samples.") print(vqa_dataset[0]['questions'])

--- Workflow 3: Translate Documents ---

print("\nTranslating documents...")

This workflow requires a local YOLO model, which will be auto-downloaded.

translateddataset = synth.translatedocuments( inputdataset=rawdataset, targetlanguages=["es", "fr"] # Translate to Spanish and French ) print(f"Translated documents into {len(translateddataset)} language versions.") print(translated_dataset) ```

Manual Configuration (Not Recommended)

You can override the .env configuration by passing parameters directly.

```python

Override model and API key

synthmanual = SynthDoc(llmmodel="groq/llama-3-8b-8192", api_key="your-groq-key")

Use local Ollama models (no API key needed)

synthollama = SynthDoc(llmmodel="ollama/llama2") ```

Roadmap

  • [x] Core document generation pipeline
  • [x] Multi-language content generation via LLMs
  • [x] VQA generation module
  • [x] Document translation workflow
  • [x] Layout Augmentation: Programmatically alter document layouts.
  • [ ] PDF Augmentation: Recombine elements from a corpus of PDFs to create new documents.
  • [ ] Handwriting Synthesis: Generate documents with realistic handwritten fonts.

Contributing

Contributions are welcome! Please feel free to submit a pull request or open an issue.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Owner

  • Name: Adithya S K
  • Login: adithya-s-k
  • Kind: user
  • Location: Indian
  • Company: Cognitivelab

Exploring Generative AI • Google DSC Lead'23 • Cloud & Full Stack Engineer • Drones & IoT • FOSS Contributor

GitHub Events

Total
  • Watch event: 3
  • Delete event: 3
  • Issue comment event: 1
  • Push event: 4
  • Pull request event: 6
  • Pull request review event: 3
  • Create event: 1
Last Year
  • Watch event: 3
  • Delete event: 3
  • Issue comment event: 1
  • Push event: 4
  • Pull request event: 6
  • Pull request review event: 3
  • Create event: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 0
  • Total pull requests: 15
  • Average time to close issues: N/A
  • Average time to close pull requests: 14 days
  • Total issue authors: 0
  • Total pull request authors: 3
  • Average comments per issue: 0
  • Average comments per pull request: 0.53
  • Merged pull requests: 8
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 15
  • Average time to close issues: N/A
  • Average time to close pull requests: 14 days
  • Issue authors: 0
  • Pull request authors: 3
  • Average comments per issue: 0
  • Average comments per pull request: 0.53
  • Merged pull requests: 8
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • laxmanclo (8)
  • adithya-s-k (4)
  • samarth777 (3)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

poetry.lock pypi
  • 117 dependencies
pyproject.toml pypi
  • beautifulsoup4 >=4.12.2
  • datasets >=2.14.0
  • deep-translator >=1.11.4
  • doclayout-yolo >=0.0.4
  • fonttools >=4.43.0
  • google-genai >=1.25.0
  • huggingface_hub >=0.19.0
  • litellm >=1.74.2
  • markdown >=3.5.1
  • numpy >=1.24.0
  • opencv-python >=4.8.0
  • pdfplumber >=0.10.0
  • pillow >=10.0.0
  • pydantic >=2.5.0
  • pymupdf >=1.26.3
  • pypdf >=3.17.0
  • pytesseract >=0.3.13
  • python-dotenv >=1.0.0
  • reportlab >=4.0.4
  • requests >=2.31.0
  • rich >=13.7.0
  • scikit-image >=0.21.0
  • tqdm >=4.66.0
  • transformers >=4.35.0
  • typer >=0.9.0
uv.lock pypi
  • 162 dependencies