BenchmarkDataNLP.jl

BenchmarkDataNLP.jl: Synthetic Data Generation for NLP Benchmarking - Published in JOSS (2025)

https://github.com/mantzaris/benchmarkdatanlp.jl

Science Score: 93.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 7 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: joss.theoj.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Keywords

corpus-data data-generation data-generator llm-training nlp
Last synced: 6 months ago · JSON representation

Repository

Generate synthetic text from a variety of methods, eg. Context Free Grammars (CFGs), with parameterized complexity to test your NLP methods (like LLMs)

Basic Info
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 1
  • Open Issues: 0
  • Releases: 2
Topics
corpus-data data-generation data-generator llm-training nlp
Created over 1 year ago · Last pushed 9 months ago
Metadata Files
Readme License

README.md

BenchmarkDataNLP.jl

License: MIT DOI Documentation Build Status

Overview

BenchmarkDataNLP.jl is a Julia project (can be easily used from other languages by calling Julia) that generates synthetic text datasets for natural language processing (NLP) experimentation (characters selected from the Korean Language Unicode block, Hangul). The primary goal is to allow researchers and developers to produce language-like corpora of varying sizes and complexities, without immediately investing in large-scale real-world data collection or computationally expensive training runs. This toolbox provides multiple generation algorithms—Context-Free Grammars (CFG), RDF/Triple-store-based corpora, Finite State Machine (FSM) expansions, and Template-based text generation—each supporting a complexity parameter. You can quickly obtain controlled, structured text for model prototyping, or debugging.

Some Features

  • Tunable Complexity: A complexity parameter (often 1–100 or up to 1000) influences: Vocabulary size, Grammar roles/expansions, Probability of terminal tokens, Number of subjects/predicates/objects (in RDF) and more.
  • Deterministic or Random Generation: Some methods (e.g., deterministic CFG expansions, round-robin adjacency walks) produce fully reproducible text where 100% accuracy is achievable, other modes (e.g., random adjacency in FSM, polysemy in CFG) inject randomness for more varied outputs.
  • Multiple approaches:
    • CFG: Creates random context-free grammars, expansions, and sentences
    • RDF: Builds a triple-store from subject/predicate/object sets and turns them into text lines or paragraphs
    • FSM: Generates text by stepping a finite state machine adjacency state set
    • Template-based (TPS): Fills placeholders in skeleton templates with a partitioned vocabulary

JSON Lines Output by default, each module writes .jsonl files, split into train, test, and validation sets (80% / 10% / 10%).

Quick Start Example

Installation

  1. open the Julia REPL, get into package mode pressing ] and put: add https://github.com/mantzaris/BenchmarkDataNLP.jl, and after installation get out of package mode (backspace) and type using BenchmarkDataNLP
  2. for development, clone the repo git clone https://github.com/mantzaris/BenchmarkDataNLP.jl, move into the repo directory cd BenchmarkDataNLP.jl, open the Julia REPL press ], dev ., exit the package mode and using BenchmarkDataNLP.jl

```julia using BenchmarkDataNLP

generate a dataset using Context Free Grammar, at complexity 20, 1000 sentences (800 lines in training, 100 testing, 100 validation) at the path you choose the files to be generated, eg. "/home/user/Documents"

generatecorpusCFG(complexity = 20, numsentences = 1000, enablepolysemy = false, outputdir = "/home/user/Documents", base_filename = "MyDataset")

generate using a Finite State Machine based approach

generatefsmcorpus(20, 1000; outputdir="/home/user/Documents", basename="MyFSM", usecontext=true, randomadjacency=true, maxlength=12)

generate using an RDF based approach

generaterdfcorpus( 20, 1000; outputdir = "/home/user/Documents", basename = "MyRDF", fillerratio = 0.2, maxfiller = 2, usecontext = true)

generate using a Template Strings approach

generatetpscorpus( 20, 1000; outputdir = "/home/user/Documents", basename = "TemplatedTest", ntemplates = 5, maxplaceholdersin_template = 4, deterministic = false) ```

Entries in the .jsonl files produced will look like:

{"text":"갃갇갊 갆 갇 갆 갃가갇."}

Where the characters are in the Hangul region of unicode.

Referencing

Citing this work:

@article{Mantzaris2025, doi = {10.21105/joss.07844}, url = {https://doi.org/10.21105/joss.07844}, year = {2025}, publisher = {The Open Journal}, volume = {10}, number = {109}, pages = {7844}, author = {Alexander V. Mantzaris}, title = {BenchmarkDataNLP.jl: Synthetic Data Generation for NLP Benchmarking}, journal = {Journal of Open Source Software} }

Owner

  • Name: a.v.mantzaris
  • Login: mantzaris
  • Kind: user
  • Location: USA

Excited about the future of technology. Happy to participate in shaping that future through theory and practice.

JOSS Publication

BenchmarkDataNLP.jl: Synthetic Data Generation for NLP Benchmarking
Published
May 27, 2025
Volume 10, Issue 109, Page 7844
Authors
Alexander V. Mantzaris ORCID
Department of Statistics and Data Science, University of Central Florida (UCF), USA
Editor
Julia Romanowska ORCID
Tags
NLP benchmarking data generation language models

GitHub Events

Total
  • Release event: 2
  • Delete event: 1
  • Issue comment event: 1
  • Push event: 102
  • Pull request review event: 5
  • Pull request review comment event: 5
  • Pull request event: 2
  • Fork event: 1
  • Create event: 5
Last Year
  • Release event: 2
  • Delete event: 1
  • Issue comment event: 1
  • Push event: 102
  • Pull request review event: 5
  • Pull request review comment event: 5
  • Pull request event: 2
  • Fork event: 1
  • Create event: 5

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 0
  • Total pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: about 9 hours
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 1.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: about 9 hours
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 1.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • jromanowska (2)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

.github/workflows/ci.yml actions
  • actions/checkout v4 composite
  • julia-actions/setup-julia v2 composite
.github/workflows/documentation.yml actions
  • actions/checkout v4 composite
  • julia-actions/cache v2 composite
  • julia-actions/setup-julia v2 composite