synthetic-paper-generator

generating synthetic scientific articles in LaTeX using PyLaTeX, rendering them to PDF, and testing OCR/parsing tools in a controlled way.

https://github.com/ucl-arc/synthetic-paper-generator

Science Score: 52.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
    Organization ucl-arc has institutional domain (www.ucl.ac.uk)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.7%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

generating synthetic scientific articles in LaTeX using PyLaTeX, rendering them to PDF, and testing OCR/parsing tools in a controlled way.

Basic Info
  • Host: GitHub
  • Owner: UCL-ARC
  • License: other
  • Language: Python
  • Default Branch: main
  • Size: 924 KB
Statistics
  • Stars: 2
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created 9 months ago · Last pushed 9 months ago
Metadata Files
Readme License Citation

README.md

Synthetic Scientific Paper Generator

This tool generates LaTeX documents using PyLaTeX with configurable structures, renders them as PDFs, and saves the original structure in JSON format for OCR comparison.

Quick Start

System Requirements

  • Python 3.11 or higher
  • Tesseract OCR (for OCR capabilities)
  • Poppler (for PDF processing)

Installing System Dependencies

On macOS: bash brew install tesseract poppler

On Ubuntu/Debian: bash sudo apt-get install tesseract-ocr poppler-utils

  1. Install the package in development mode: bash pip install -e .

  2. Run the generator (from project root directory):

By default, the generator uses the Faker library to create synthetic content (no LLM calls): bash python -m src.synthetic_paper_generator.generate

To use an LLM for content generation (requires API keys and configuration): bash python -m src.synthetic_paper_generator.generate --use-llm true

  • The --use-llm flag controls whether to use an LLM for content generation.
  • Default: false (uses Faker for abstracts and introductions)
  • Set to true to use your configured LLM provider (e.g., Azure OpenAI).
  1. Run OCR pipeline (from project root directory): bash python -m src.synthetic_paper_generator.ocr_pipeline

Output Structure

The tool generates the following directory structure under the output folder:

output/ ├── pdf/ # Generated PDF files │ └── paper_*.pdf # Generated scientific papers ├── json/ # Original structure in JSON format │ └── paper_*.json # JSON files containing paper structure └── ocr_txt/ # OCR results from different engines ├── tesseract/ # Tesseract OCR results │ └── paper_*.txt ├── marker/ # Marker PDF parser results │ └── paper_*.txt └── docling/ # Docling OCR results └── paper_*.txt

Understanding the Output

  1. PDF Files (output/pdf/):

    • Generated scientific papers in PDF format
    • Each file is named paper_<timestamp>.pdf
  2. JSON Structure (output/json/):

    • Contains the original structure of each generated paper
    • Useful for comparing OCR results with ground truth
    • Each file corresponds to a PDF file with the same name
  3. OCR Results (output/ocr_txt/):

    • Results from different OCR engines
    • Each engine has its own subdirectory
    • Text files contain extracted content from PDFs
    • Useful for comparing OCR accuracy across different engines

Configuration

You can customize the document generation by modifying config.yaml. The configuration file allows you to: - Adjust document structure - Modify content generation parameters - Change output formats and locations

About

Project Team

Sagar Uprety (s.uprety@ucl.ac.uk) Tim Repke (tim.repke@pik-potsdam.de) <!-- TODO: how do we have an array of collaborators ? -->

Building Documentation

The MkDocs HTML documentation can be built locally by running

sh tox -e docs

from the root of the repository. The built documentation will be written to site.

Alternatively to build and preview the documentation locally, in a Python environment with the optional docs dependencies installed, run

sh mkdocs serve

Roadmap

  • [x] Initial Research
  • [ ] Minimum viable product <-- You are Here
  • [ ] Alpha Release
  • [ ] Feature-Complete Release

Owner

  • Name: UCL Advanced Research Computing Centre
  • Login: UCL-ARC
  • Kind: organization
  • Location: United Kingdom

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
type: software
authors:
  - family-names: "Uprety"
    given-names: "Sagar"
    email: "s.uprety@ucl.ac.uk"
repository-code: "https://github.com/UCL-ARC/synthetic-paper-generator"
title: "synthetic_paper_generator: generating synthetic scientific articles in LaTeX using PyLaTeX, rendering them to PDF, and testing OCR/parsing tools in a controlled way."
license: "MIT"

GitHub Events

Total
  • Watch event: 1
  • Member event: 1
  • Push event: 1
  • Create event: 3
Last Year
  • Watch event: 1
  • Member event: 1
  • Push event: 1
  • Create event: 3

Issues and Pull Requests

Last synced: 6 months ago


Dependencies

.github/workflows/docs.yml actions
  • actions/cache 5a3ec84eff668545956fd18022155c47e93e2684 composite
  • actions/checkout 11bd71901bbe5b1630ceea73d27597364c9af683 composite
  • actions/setup-python a26af69be951a213d495a4c3e4e4022e16d87065 composite
  • peaceiris/actions-gh-pages 4f9cc6602d3f66b9c108549d475ec49e8ef4d45e composite
.github/workflows/linting.yml actions
  • actions/cache 5a3ec84eff668545956fd18022155c47e93e2684 composite
  • actions/checkout 11bd71901bbe5b1630ceea73d27597364c9af683 composite
  • actions/setup-python a26af69be951a213d495a4c3e4e4022e16d87065 composite
.github/workflows/tests.yml actions
  • actions/cache 5a3ec84eff668545956fd18022155c47e93e2684 composite
  • actions/checkout 11bd71901bbe5b1630ceea73d27597364c9af683 composite
  • actions/setup-python a26af69be951a213d495a4c3e4e4022e16d87065 composite
pyproject.toml pypi
  • Pillow *
  • docling *
  • marker-pdf *
  • pdf2image *
  • pytesseract *
  • torch *
  • torchvision *
requirements.txt pypi
  • Pillow >=10.2.0
  • docling *
  • faker *
  • litellm >=1.30.7
  • marker-pdf *
  • pdf2image >=1.16.3
  • pylatex *
  • pytesseract >=0.3.10
  • python-dotenv >=1.0.0
  • pyyaml *