https://github.com/adithya-s-k/synthdoc
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.3%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: adithya-s-k
- License: apache-2.0
- Language: Python
- Default Branch: main
- Size: 72.8 MB
Statistics
- Stars: 6
- Watchers: 0
- Forks: 0
- Open Issues: 1
- Releases: 0
Metadata Files
README.md
SynthDoc
A comprehensive library for generating synthetic documents designed for training and evaluating models in document understanding tasks. SynthDoc supports multiple languages, fonts, and document layouts to create diverse training datasets for OCR models, document layout analysis, visual question answering, and retrieval systems.
Key Features
- 🌐 Multi-language Document Generation: Create documents in various languages using LLMs.
- ❓ VQA Dataset Generation: Produce rich Visual Question Answering datasets with hard negatives for robust model training.
- 🔄 Document Translation: Translate the text in documents to other languages while preserving the original layout.
- 🧩 HuggingFace Integration: Outputs datasets directly in HuggingFace
Datasetformat for seamless integration with your ML pipelines. - ⚙️ Flexible Configuration: Easily configure LLM providers (OpenAI, Groq, Anthropic, Ollama) and API keys using a
.envfile. - 🚀 Extensible Workflows: A modular architecture that makes it easy to add new document generation and augmentation workflows.
Available Workflows
1. Raw Document Generation
Generate synthetic documents from scratch using Large Language Models (LLMs) to create diverse content across multiple languages.
Purpose: Create original document content for training data augmentation and model robustness testing.
Process: 1. An LLM generates contextually appropriate content in the specified language. 2. The content is rendered into document images with proper formatting. 3. The output is a standardized HuggingFace dataset.
Output Schema:
- image: The generated document page as a PIL Image.
- image_path: Path to the saved image file.
- page_number: The page number.
- language: The language of the generated content.
- prompt: The prompt used for content generation.
- And other metadata...
2. VQA (Visual Question Answering) Generation
Generate question-answer pairs for visual question answering tasks, including hard negatives for training robust retrieval models.
Purpose: Create comprehensive VQA datasets for training and evaluating visual document understanding models.
Process: 1. General VQA: Generate diverse question-answer pairs about document content, layout, and visual elements. 2. Hard Negative VQA: Create challenging negative examples that are semantically similar but factually incorrect. 3. Similarity Scoring: Generate similarity scores for retrieval training.
Output Schema (extends raw document schema):
- questions: List of generated questions.
- answers: Corresponding ground truth answers.
- hard_negatives: Challenging incorrect answers.
- And other VQA-related metadata...
3. Document Translation
Translate the text within a document image to one or more target languages while preserving the visual layout.
Purpose: Adapt existing document datasets for multi-lingual model training.
Process: 1. Layout Detection: A YOLO-based model detects text blocks in the source image. 2. OCR: Text is extracted from each detected block. 3. Translation: The extracted text is translated to the target language(s). 4. Rendering: The translated text is rendered back onto a copy of the original image in the same location, using appropriate fonts.
Output: A HuggingFace dataset containing the translated document images.
Installation
Basic Installation
bash
pip install git+https://github.com/adithya-s-k/SynthDoc.git
With LLM Support (Recommended)
To enable content generation with LLMs, install with the llm extra:
bash
pip install "git+https://github.com/adithya-s-k/SynthDoc.git[llm]"
For Development
bash
git clone https://github.com/adithya-s-k/SynthDoc.git
cd SynthDoc
pip install -e .[llm]
Quick Start
1. Configure Environment (Recommended)
SynthDoc uses LiteLLM to connect to 100+ LLM providers.
Copy the environment template:
bash cp env.template .envEdit your
.envfile and add your API keys. TheDEFAULT_LLM_MODELwill be used for generation. ```env.env file
OPENAIAPIKEY=youropenaikey GROQAPIKEY=yourgroqkey DEFAULTLLMMODEL=gpt-4o-mini ```
2. Use SynthDoc
Create a main.py file:
```python
from synthdoc import SynthDoc
SynthDoc automatically loads from .env file
It will prompt you to specify an output directory on first run.
synth = SynthDoc()
--- Workflow 1: Generate Raw Documents ---
print("Generating raw documents...") rawdataset = synth.generaterawdocs( language="en", numpages=2, prompt="Generate a short technical report about climate change." ) print(f"Generated {len(rawdataset)} raw documents.") print(rawdataset[0])
--- Workflow 2: Generate VQA ---
print("\nGenerating VQA dataset...") vqadataset = synth.generatevqa( sourcedocuments=rawdataset, numquestionsperdoc=3 ) print(f"Generated {len(vqadataset)} VQA samples.") print(vqa_dataset[0]['questions'])
--- Workflow 3: Translate Documents ---
print("\nTranslating documents...")
This workflow requires a local YOLO model, which will be auto-downloaded.
translateddataset = synth.translatedocuments( inputdataset=rawdataset, targetlanguages=["es", "fr"] # Translate to Spanish and French ) print(f"Translated documents into {len(translateddataset)} language versions.") print(translated_dataset) ```
Manual Configuration (Not Recommended)
You can override the .env configuration by passing parameters directly.
```python
Override model and API key
synthmanual = SynthDoc(llmmodel="groq/llama-3-8b-8192", api_key="your-groq-key")
Use local Ollama models (no API key needed)
synthollama = SynthDoc(llmmodel="ollama/llama2") ```
Roadmap
- [x] Core document generation pipeline
- [x] Multi-language content generation via LLMs
- [x] VQA generation module
- [x] Document translation workflow
- [x] Layout Augmentation: Programmatically alter document layouts.
- [ ] PDF Augmentation: Recombine elements from a corpus of PDFs to create new documents.
- [ ] Handwriting Synthesis: Generate documents with realistic handwritten fonts.
Contributing
Contributions are welcome! Please feel free to submit a pull request or open an issue.
License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Owner
- Name: Adithya S K
- Login: adithya-s-k
- Kind: user
- Location: Indian
- Company: Cognitivelab
- Website: https://adithyask.com/
- Twitter: adithya_s_k
- Repositories: 60
- Profile: https://github.com/adithya-s-k
Exploring Generative AI • Google DSC Lead'23 • Cloud & Full Stack Engineer • Drones & IoT • FOSS Contributor
GitHub Events
Total
- Watch event: 3
- Delete event: 3
- Issue comment event: 1
- Push event: 4
- Pull request event: 6
- Pull request review event: 3
- Create event: 1
Last Year
- Watch event: 3
- Delete event: 3
- Issue comment event: 1
- Push event: 4
- Pull request event: 6
- Pull request review event: 3
- Create event: 1
Issues and Pull Requests
Last synced: 10 months ago
All Time
- Total issues: 0
- Total pull requests: 15
- Average time to close issues: N/A
- Average time to close pull requests: 14 days
- Total issue authors: 0
- Total pull request authors: 3
- Average comments per issue: 0
- Average comments per pull request: 0.53
- Merged pull requests: 8
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 15
- Average time to close issues: N/A
- Average time to close pull requests: 14 days
- Issue authors: 0
- Pull request authors: 3
- Average comments per issue: 0
- Average comments per pull request: 0.53
- Merged pull requests: 8
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- laxmanclo (8)
- adithya-s-k (4)
- samarth777 (3)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- 117 dependencies
- beautifulsoup4 >=4.12.2
- datasets >=2.14.0
- deep-translator >=1.11.4
- doclayout-yolo >=0.0.4
- fonttools >=4.43.0
- google-genai >=1.25.0
- huggingface_hub >=0.19.0
- litellm >=1.74.2
- markdown >=3.5.1
- numpy >=1.24.0
- opencv-python >=4.8.0
- pdfplumber >=0.10.0
- pillow >=10.0.0
- pydantic >=2.5.0
- pymupdf >=1.26.3
- pypdf >=3.17.0
- pytesseract >=0.3.13
- python-dotenv >=1.0.0
- reportlab >=4.0.4
- requests >=2.31.0
- rich >=13.7.0
- scikit-image >=0.21.0
- tqdm >=4.66.0
- transformers >=4.35.0
- typer >=0.9.0
- 162 dependencies