BenchmarkDataNLP.jl

BenchmarkDataNLP.jl: Synthetic Data Generation for NLP Benchmarking - Published in JOSS (2025)

https://github.com/mantzaris/benchmarkdatanlp.jl

Science Score: 93.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 7 DOI reference(s) in README and JOSS metadata
✓
Academic publication links
Links to: joss.theoj.org
○
Academic email domains
○
Institutional organization owner
✓
JOSS paper metadata
Published in Journal of Open Source Software

Keywords

corpus-data data-generation data-generator llm-training nlp

Last synced: 10 months ago · JSON representation

Repository

Generate synthetic text from a variety of methods, eg. Context Free Grammars (CFGs), with parameterized complexity to test your NLP methods (like LLMs)

Basic Info

Host: GitHub
Owner: mantzaris
License: mit
Language: Julia
Default Branch: main
Homepage: https://mantzaris.github.io/BenchmarkDataNLP.jl/
Size: 1.4 MB

Statistics

Stars: 0
Watchers: 1
Forks: 1
Open Issues: 0
Releases: 2

Topics

corpus-data data-generation data-generator llm-training nlp

Created over 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme License

BenchmarkDataNLP.jl

Overview

BenchmarkDataNLP.jl is a Julia project (can be easily used from other languages by calling Julia) that generates synthetic text datasets for natural language processing (NLP) experimentation (characters selected from the Korean Language Unicode block, Hangul). The primary goal is to allow researchers and developers to produce language-like corpora of varying sizes and complexities, without immediately investing in large-scale real-world data collection or computationally expensive training runs. This toolbox provides multiple generation algorithms—Context-Free Grammars (CFG), RDF/Triple-store-based corpora, Finite State Machine (FSM) expansions, and Template-based text generation—each supporting a complexity parameter. You can quickly obtain controlled, structured text for model prototyping, or debugging.

Some Features

Tunable Complexity: A complexity parameter (often 1–100 or up to 1000) influences: Vocabulary size, Grammar roles/expansions, Probability of terminal tokens, Number of subjects/predicates/objects (in RDF) and more.
Deterministic or Random Generation: Some methods (e.g., deterministic CFG expansions, round-robin adjacency walks) produce fully reproducible text where 100% accuracy is achievable, other modes (e.g., random adjacency in FSM, polysemy in CFG) inject randomness for more varied outputs.
Multiple approaches:
- CFG: Creates random context-free grammars, expansions, and sentences
- RDF: Builds a triple-store from subject/predicate/object sets and turns them into text lines or paragraphs
- FSM: Generates text by stepping a finite state machine adjacency state set
- Template-based (TPS): Fills placeholders in skeleton templates with a partitioned vocabulary

JSON Lines Output by default, each module writes .jsonl files, split into train, test, and validation sets (80% / 10% / 10%).

Quick Start Example

Installation

open the Julia REPL, get into package mode pressing ] and put: add https://github.com/mantzaris/BenchmarkDataNLP.jl, and after installation get out of package mode (backspace) and type using BenchmarkDataNLP
for development, clone the repo git clone https://github.com/mantzaris/BenchmarkDataNLP.jl, move into the repo directory cd BenchmarkDataNLP.jl, open the Julia REPL press ], dev ., exit the package mode and using BenchmarkDataNLP.jl

```julia using BenchmarkDataNLP

generate a dataset using Context Free Grammar, at complexity 20, 1000 sentences (800 lines in training, 100 testing, 100 validation) at the path you choose the files to be generated, eg. "/home/user/Documents"

generatecorpusCFG(complexity = 20, numsentences = 1000, enablepolysemy = false, outputdir = "/home/user/Documents", base_filename = "MyDataset")

generate using a Finite State Machine based approach

generatefsmcorpus(20, 1000; outputdir="/home/user/Documents", basename="MyFSM", usecontext=true, randomadjacency=true, maxlength=12)

generate using an RDF based approach

generaterdfcorpus( 20, 1000; outputdir = "/home/user/Documents", basename = "MyRDF", fillerratio = 0.2, maxfiller = 2, usecontext = true)

generate using a Template Strings approach

generatetpscorpus( 20, 1000; outputdir = "/home/user/Documents", basename = "TemplatedTest", ntemplates = 5, maxplaceholdersin_template = 4, deterministic = false) ```

Entries in the .jsonl files produced will look like:

{"text":"갃갇갊 갆 갇 갆 갃가갇."}

Where the characters are in the Hangul region of unicode.

Referencing

Citing this work:

@article{Mantzaris2025, doi = {10.21105/joss.07844}, url = {https://doi.org/10.21105/joss.07844}, year = {2025}, publisher = {The Open Journal}, volume = {10}, number = {109}, pages = {7844}, author = {Alexander V. Mantzaris}, title = {BenchmarkDataNLP.jl: Synthetic Data Generation for NLP Benchmarking}, journal = {Journal of Open Source Software} }

Owner

Name: a.v.mantzaris
Login: mantzaris
Kind: user
Location: USA

Twitter: avmantzaris
Repositories: 34
Profile: https://github.com/mantzaris

Excited about the future of technology. Happy to participate in shaping that future through theory and practice.

JOSS Publication

BenchmarkDataNLP.jl: Synthetic Data Generation for NLP Benchmarking

Published

May 27, 2025

DOI

10.21105/joss.07844

Volume 10, Issue 109, Page 7844

Authors

Alexander V. Mantzaris

Department of Statistics and Data Science, University of Central Florida (UCF), USA

Editor

Julia Romanowska

GitHub Events

Total

Release event: 2
Delete event: 1
Issue comment event: 1
Push event: 102
Pull request review event: 5
Pull request review comment event: 5
Pull request event: 2
Fork event: 1
Create event: 5

Last Year

Release event: 2
Delete event: 1
Issue comment event: 1
Push event: 102
Pull request review event: 5
Pull request review comment event: 5
Pull request event: 2
Fork event: 1
Create event: 5

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 0
Total pull requests: 2
Average time to close issues: N/A
Average time to close pull requests: about 9 hours
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 1.0
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 2
Average time to close issues: N/A
Average time to close pull requests: about 9 hours
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 1.0
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

jromanowska (2)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

.github/workflows/ci.yml actions

actions/checkout v4 composite
julia-actions/setup-julia v2 composite

.github/workflows/documentation.yml actions

actions/checkout v4 composite
julia-actions/cache v2 composite
julia-actions/setup-julia v2 composite

BenchmarkDataNLP.jl

Science Score: 93.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

BenchmarkDataNLP.jl

Overview

Some Features

Quick Start Example

Installation

generate a dataset using Context Free Grammar, at complexity 20, 1000 sentences (800 lines in training, 100 testing, 100 validation) at the path you choose the files to be generated, eg. "/home/user/Documents"

generate using a Finite State Machine based approach

generate using an RDF based approach

generate using a Template Strings approach

Referencing

Owner

JOSS Publication

BenchmarkDataNLP.jl: Synthetic Data Generation for NLP Benchmarking

Authors

Editor

Tags

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies