synthetic-paper-generator
generating synthetic scientific articles in LaTeX using PyLaTeX, rendering them to PDF, and testing OCR/parsing tools in a controlled way.
Science Score: 52.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
✓Institutional organization owner
Organization ucl-arc has institutional domain (www.ucl.ac.uk) -
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.7%) to scientific vocabulary
Repository
generating synthetic scientific articles in LaTeX using PyLaTeX, rendering them to PDF, and testing OCR/parsing tools in a controlled way.
Basic Info
- Host: GitHub
- Owner: UCL-ARC
- License: other
- Language: Python
- Default Branch: main
- Size: 924 KB
Statistics
- Stars: 2
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Synthetic Scientific Paper Generator
This tool generates LaTeX documents using PyLaTeX with configurable structures, renders them as PDFs, and saves the original structure in JSON format for OCR comparison.
Quick Start
System Requirements
- Python 3.11 or higher
- Tesseract OCR (for OCR capabilities)
- Poppler (for PDF processing)
Installing System Dependencies
On macOS:
bash
brew install tesseract poppler
On Ubuntu/Debian:
bash
sudo apt-get install tesseract-ocr poppler-utils
Install the package in development mode:
bash pip install -e .Run the generator (from project root directory):
By default, the generator uses the Faker library to create synthetic content (no LLM calls):
bash
python -m src.synthetic_paper_generator.generate
To use an LLM for content generation (requires API keys and configuration):
bash
python -m src.synthetic_paper_generator.generate --use-llm true
- The
--use-llmflag controls whether to use an LLM for content generation. - Default:
false(uses Faker for abstracts and introductions) - Set to
trueto use your configured LLM provider (e.g., Azure OpenAI).
- Run OCR pipeline (from project root directory):
bash python -m src.synthetic_paper_generator.ocr_pipeline
Output Structure
The tool generates the following directory structure under the output folder:
output/
├── pdf/ # Generated PDF files
│ └── paper_*.pdf # Generated scientific papers
├── json/ # Original structure in JSON format
│ └── paper_*.json # JSON files containing paper structure
└── ocr_txt/ # OCR results from different engines
├── tesseract/ # Tesseract OCR results
│ └── paper_*.txt
├── marker/ # Marker PDF parser results
│ └── paper_*.txt
└── docling/ # Docling OCR results
└── paper_*.txt
Understanding the Output
PDF Files (
output/pdf/):- Generated scientific papers in PDF format
- Each file is named
paper_<timestamp>.pdf
JSON Structure (
output/json/):- Contains the original structure of each generated paper
- Useful for comparing OCR results with ground truth
- Each file corresponds to a PDF file with the same name
OCR Results (
output/ocr_txt/):- Results from different OCR engines
- Each engine has its own subdirectory
- Text files contain extracted content from PDFs
- Useful for comparing OCR accuracy across different engines
Configuration
You can customize the document generation by modifying config.yaml. The configuration file allows you to:
- Adjust document structure
- Modify content generation parameters
- Change output formats and locations
About
Project Team
Sagar Uprety (s.uprety@ucl.ac.uk) Tim Repke (tim.repke@pik-potsdam.de) <!-- TODO: how do we have an array of collaborators ? -->
Building Documentation
The MkDocs HTML documentation can be built locally by running
sh
tox -e docs
from the root of the repository. The built documentation will be written to
site.
Alternatively to build and preview the documentation locally, in a Python
environment with the optional docs dependencies installed, run
sh
mkdocs serve
Roadmap
- [x] Initial Research
- [ ] Minimum viable product <-- You are Here
- [ ] Alpha Release
- [ ] Feature-Complete Release
Owner
- Name: UCL Advanced Research Computing Centre
- Login: UCL-ARC
- Kind: organization
- Location: United Kingdom
- Website: https://www.ucl.ac.uk/arc
- Twitter: ucl_arc
- Repositories: 9
- Profile: https://github.com/UCL-ARC
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
type: software
authors:
- family-names: "Uprety"
given-names: "Sagar"
email: "s.uprety@ucl.ac.uk"
repository-code: "https://github.com/UCL-ARC/synthetic-paper-generator"
title: "synthetic_paper_generator: generating synthetic scientific articles in LaTeX using PyLaTeX, rendering them to PDF, and testing OCR/parsing tools in a controlled way."
license: "MIT"
GitHub Events
Total
- Watch event: 1
- Member event: 1
- Push event: 1
- Create event: 3
Last Year
- Watch event: 1
- Member event: 1
- Push event: 1
- Create event: 3
Issues and Pull Requests
Last synced: 6 months ago
Dependencies
- actions/cache 5a3ec84eff668545956fd18022155c47e93e2684 composite
- actions/checkout 11bd71901bbe5b1630ceea73d27597364c9af683 composite
- actions/setup-python a26af69be951a213d495a4c3e4e4022e16d87065 composite
- peaceiris/actions-gh-pages 4f9cc6602d3f66b9c108549d475ec49e8ef4d45e composite
- actions/cache 5a3ec84eff668545956fd18022155c47e93e2684 composite
- actions/checkout 11bd71901bbe5b1630ceea73d27597364c9af683 composite
- actions/setup-python a26af69be951a213d495a4c3e4e4022e16d87065 composite
- actions/cache 5a3ec84eff668545956fd18022155c47e93e2684 composite
- actions/checkout 11bd71901bbe5b1630ceea73d27597364c9af683 composite
- actions/setup-python a26af69be951a213d495a4c3e4e4022e16d87065 composite
- Pillow *
- docling *
- marker-pdf *
- pdf2image *
- pytesseract *
- torch *
- torchvision *
- Pillow >=10.2.0
- docling *
- faker *
- litellm >=1.30.7
- marker-pdf *
- pdf2image >=1.16.3
- pylatex *
- pytesseract >=0.3.10
- python-dotenv >=1.0.0
- pyyaml *