BenchmarkDataNLP.jl
BenchmarkDataNLP.jl: Synthetic Data Generation for NLP Benchmarking - Published in JOSS (2025)
Science Score: 93.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 7 DOI reference(s) in README and JOSS metadata -
✓Academic publication links
Links to: joss.theoj.org -
○Academic email domains
-
○Institutional organization owner
-
✓JOSS paper metadata
Published in Journal of Open Source Software
Keywords
Repository
Generate synthetic text from a variety of methods, eg. Context Free Grammars (CFGs), with parameterized complexity to test your NLP methods (like LLMs)
Basic Info
- Host: GitHub
- Owner: mantzaris
- License: mit
- Language: Julia
- Default Branch: main
- Homepage: https://mantzaris.github.io/BenchmarkDataNLP.jl/
- Size: 1.4 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 1
- Open Issues: 0
- Releases: 2
Topics
Metadata Files
README.md
BenchmarkDataNLP.jl
Overview
BenchmarkDataNLP.jl is a Julia project (can be easily used from other languages by calling Julia) that generates synthetic text datasets for natural language processing (NLP) experimentation (characters selected from the Korean Language Unicode block, Hangul). The primary goal is to allow researchers and developers to produce language-like corpora of varying sizes and complexities, without immediately investing in large-scale real-world data collection or computationally expensive training runs. This toolbox provides multiple generation algorithms—Context-Free Grammars (CFG), RDF/Triple-store-based corpora, Finite State Machine (FSM) expansions, and Template-based text generation—each supporting a complexity parameter. You can quickly obtain controlled, structured text for model prototyping, or debugging.
Some Features
- Tunable Complexity: A complexity parameter (often 1–100 or up to 1000) influences: Vocabulary size, Grammar roles/expansions, Probability of terminal tokens, Number of subjects/predicates/objects (in RDF) and more.
- Deterministic or Random Generation: Some methods (e.g., deterministic CFG expansions, round-robin adjacency walks) produce fully reproducible text where 100% accuracy is achievable, other modes (e.g., random adjacency in FSM, polysemy in CFG) inject randomness for more varied outputs.
- Multiple approaches:
- CFG: Creates random context-free grammars, expansions, and sentences
- RDF: Builds a triple-store from subject/predicate/object sets and turns them into text lines or paragraphs
- FSM: Generates text by stepping a finite state machine adjacency state set
- Template-based (TPS): Fills placeholders in skeleton templates with a partitioned vocabulary
JSON Lines Output by default, each module writes .jsonl files, split into train, test, and validation sets (80% / 10% / 10%).
Quick Start Example
Installation
- open the Julia REPL, get into package mode pressing
]and put:add https://github.com/mantzaris/BenchmarkDataNLP.jl, and after installation get out of package mode (backspace) and typeusing BenchmarkDataNLP - for development, clone the repo
git clone https://github.com/mantzaris/BenchmarkDataNLP.jl, move into the repo directorycd BenchmarkDataNLP.jl, open the Julia REPL press],dev ., exit the package mode andusing BenchmarkDataNLP.jl
```julia using BenchmarkDataNLP
generate a dataset using Context Free Grammar, at complexity 20, 1000 sentences (800 lines in training, 100 testing, 100 validation) at the path you choose the files to be generated, eg. "/home/user/Documents"
generatecorpusCFG(complexity = 20, numsentences = 1000, enablepolysemy = false, outputdir = "/home/user/Documents", base_filename = "MyDataset")
generate using a Finite State Machine based approach
generatefsmcorpus(20, 1000; outputdir="/home/user/Documents", basename="MyFSM", usecontext=true, randomadjacency=true, maxlength=12)
generate using an RDF based approach
generaterdfcorpus( 20, 1000; outputdir = "/home/user/Documents", basename = "MyRDF", fillerratio = 0.2, maxfiller = 2, usecontext = true)
generate using a Template Strings approach
generatetpscorpus( 20, 1000; outputdir = "/home/user/Documents", basename = "TemplatedTest", ntemplates = 5, maxplaceholdersin_template = 4, deterministic = false) ```
Entries in the .jsonl files produced will look like:
{"text":"갃갇갊 갆 갇 갆 갃가갇."}
Where the characters are in the Hangul region of unicode.
Referencing
Citing this work:
@article{Mantzaris2025, doi = {10.21105/joss.07844}, url = {https://doi.org/10.21105/joss.07844}, year = {2025}, publisher = {The Open Journal}, volume = {10}, number = {109}, pages = {7844}, author = {Alexander V. Mantzaris}, title = {BenchmarkDataNLP.jl: Synthetic Data Generation for NLP Benchmarking}, journal = {Journal of Open Source Software} }
Owner
- Name: a.v.mantzaris
- Login: mantzaris
- Kind: user
- Location: USA
- Twitter: avmantzaris
- Repositories: 34
- Profile: https://github.com/mantzaris
Excited about the future of technology. Happy to participate in shaping that future through theory and practice.
JOSS Publication
BenchmarkDataNLP.jl: Synthetic Data Generation for NLP Benchmarking
Authors
Tags
NLP benchmarking data generation language modelsGitHub Events
Total
- Release event: 2
- Delete event: 1
- Issue comment event: 1
- Push event: 102
- Pull request review event: 5
- Pull request review comment event: 5
- Pull request event: 2
- Fork event: 1
- Create event: 5
Last Year
- Release event: 2
- Delete event: 1
- Issue comment event: 1
- Push event: 102
- Pull request review event: 5
- Pull request review comment event: 5
- Pull request event: 2
- Fork event: 1
- Create event: 5
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 0
- Total pull requests: 2
- Average time to close issues: N/A
- Average time to close pull requests: about 9 hours
- Total issue authors: 0
- Total pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 1.0
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 2
- Average time to close issues: N/A
- Average time to close pull requests: about 9 hours
- Issue authors: 0
- Pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 1.0
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- jromanowska (2)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- actions/checkout v4 composite
- julia-actions/setup-julia v2 composite
- actions/checkout v4 composite
- julia-actions/cache v2 composite
- julia-actions/setup-julia v2 composite
