KeemenaPreprocessing.jl: Unicode-Robust Cleaning, Multi-Level Tokenisation & Streaming Offset Bundling for Julia NLP

KeemenaPreprocessing.jl: Unicode-Robust Cleaning, Multi-Level Tokenisation & Streaming Offset Bundling for Julia NLP - Published in JOSS (2026)

https://github.com/mantzaris/keemenapreprocessing.jl

Science Score: 87.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in JOSS metadata
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
✓
JOSS paper metadata
Published in Journal of Open Source Software

Keywords

julia natural-language-processing nlp text-encoding textprocessing tokenization

Last synced: 4 months ago · JSON representation

Repository

Preprocessing for text data: cleaning, normalization, vectorization, tokenization and more

Basic Info

Host: GitHub
Owner: mantzaris
License: mit
Language: Julia
Default Branch: main
Homepage: https://mantzaris.github.io/KeemenaPreprocessing.jl/dev/
Size: 992 KB

Statistics

Stars: 3
Watchers: 1
Forks: 0
Open Issues: 1
Releases: 0

Topics

julia natural-language-processing nlp text-encoding textprocessing tokenization

Created about 1 year ago · Last pushed 5 months ago

Metadata Files

Readme License

KeemenaPreprocessing

One-stop text pre-processor for Julia - clean -> tokenise -> segment -> build vocabulary -> align levels -> save bundle.

KeemenaPreprocessing.jl is a corpus-level preprocessing substrate for ML/NLP pipelines in Julia. It builds a deterministic PreprocessBundle from raw text using a streaming, two-pass workflow with predictable memory behavior. The key output is a reproducible artifact: token id streams plus offset tables and cross-level alignments (byte/char/word/sentence/etc.) suitable for downstream modeling, annotation alignment, and evaluation.

Intended for: - Researchers and engineers preprocessing large corpora for training or evaluating ML/NLP models. - Workflows that need stable offsets/cross-references (for aligning spans, annotations, evaluation, error analysis).

Not ideally for: - Users looking for a full NLP toolkit (tagging, parsing, NER, lemmatization, etc.). - Users wanting a library that bundles many tokenizer implementations or enforces a specific tokenizer ecosystem.

What you get

Vocabulary
- deterministic id <-> token tables
- minimum-frequency filtering
- user-defined special tokens
Tokenisation
- byte, character, whitespace or Unicode-word
- pluggable custom function
Offset vectors
- word, sentence, paragraph and document boundaries
- always begin with 1 and end with n_tokens + 1
Alignment cross-maps
- byte <-> char <-> word indices (forward & backward)
Streaming mode
- constant-memory two-pass pipeline
- choose vector of bundles or single merged bundle
Bundles
- everything packed into a PreprocessBundle
- save / load with JLD2 in one line

Scope and ecosystem

KeemenaPreprocessing focuses on building a deterministic, aligned preprocessing artifact for downstream modeling
Tokenizer packages (like WordTokenizers.jl) focus on fast sentence/word splitting and configurable tokenizers, including global configurability via settokenizer/setsentence_splitter
BPE/tokenizer-model packages (like BytePairEncoding.jl) focus on subword tokenization methods (including GPT-2 byte-level BPE and tiktoken)
KemenaPreprocessing integrates with these via callables rather than hard dependencies, to avoid locking users into upstream conventions and to preserve reproducible pipelines
Bundles (portable preprocessing artifacts)
- everything is packed into a PreprocessBundle (plain Julia structs + arrays)
- convenience persistence via JLD2 (save_preprocess_bundle / load_preprocess_bundle)
- JLD2 is a default convenience backend, not a constraint: advanced users can serialize the bundle differently (e.g. HDF5/Arrow/custom layouts) if they need cross-language interchange, memory mapping, or indexed random access

Quick example (full corpus in RAM)

```julia using KeemenaPreprocessing

docs = ["First document.", "Second document..."]

cfg = PreprocessConfiguration( tokenizername = :unicode, recordsentenceoffsets = true, minimumtoken_frequency = 2)

bundle = preprocess_corpus(docs; config = cfg)

wordids = gettokenids(bundle, :word) println("tokens:", length(wordids)) ```

The single call does all of: load, clean, tokenise, build vocabulary, record offsets, assemble bundle.

Processing huge corpora with constant memory

```julia using KeemenaPreprocessing, Downloads

Two Project Gutenberg books

alice = Downloads.download( "https://www.gutenberg.org/files/11/11-0.txt", "alice.txt") time = Downloads.download( "https://www.gutenberg.org/files/35/35-0.txt", "time_machine.txt")

cfg = PreprocessConfiguration(tokenizer_name = :whitespace)

merged = preprocesscorpusstreamingfull( [alice, time]; # any iterable of sources cfg = cfg, chunktokens = 5_000) # ~5 k tokens per internal chunk

println("total tokens:", length(gettokenids(merged, :word))) ```

preprocess_corpus_streaming_full runs the two-pass streaming pipeline, merges all internal chunks on the fly, and returns one cohesive bundle covering the entire corpus—ideal when downstream code expects a single artefact yet you still need strict memory bounds during preprocessing.

Installing

It can be downloaded from the general registry: import Pkg; Pkg.add("KeemenaPreprocessing"), or pressing ']' and then typing add KeemenaPreprocessing and then back in the REPL prompt using KeemenaPreprocessing.

For the Dev version: open the Julia REPL, get into package mode pressing ] and put: add https://github.com/mantzaris/KeemenaPreprocessing.jl

Contributing to KeemenaPreprocessing.jl

Feel free to contribute and collaboration is encouraged.

How to contribute

Reporting bugs

Please open a GitHub issue and include: - Julia version - KeemenaPreprocessing.jl version (from Project.toml or Pkg.status()) - A minimal reproducible example - Expected behavior vs actual behavior with the error messages

Proposing changes

Open an issue first if the change is large or affects the public API, so we can agree on direction before doing all the work and finding out that a modified plan would have been better

Pull requests

Fork the repository and create a feature branch
Keep pull requests focused (one logical change per PR) as it makes review easier
Add tests for bug fixes and new features and putting clear test names helps
Update documentation if behavior or API changes
Ensure CI is green

Community guidelines

Please be respectful and constructive. This project follows the Julia Community Standards

Owner

Name: a.v.mantzaris
Login: mantzaris
Kind: user
Location: USA

Twitter: avmantzaris
Repositories: 35
Profile: https://github.com/mantzaris

Excited about the future of technology. Happy to participate in shaping that future through theory and practice.

JOSS Publication

KeemenaPreprocessing.jl: Unicode-Robust Cleaning, Multi-Level Tokenisation and Streaming Offset Bundling for Julia NLP

Published

February 23, 2026

DOI

10.21105/joss.09348

Volume 11, Issue 118, Page 9348

Authors

Alexander V. Mantzaris

Department of Statistics and Data Science, University of Central Florida (UCF), USA

Editor

Owen Lockwood

GitHub Events

Total

Create event: 111
Commit comment event: 3
Release event: 2
Delete event: 1
Pull request event: 2
Fork event: 1
Issues event: 3
Watch event: 2
Issue comment event: 2
Push event: 83

Last Year

Create event: 111
Commit comment event: 3
Release event: 2
Delete event: 1
Pull request event: 2
Fork event: 1
Issues event: 3
Watch event: 2
Issue comment event: 2
Push event: 83

Committers

Last synced: 9 months ago

All Time

Total Commits: 70
Total Committers: 1
Avg Commits per committer: 70.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 70
Committers: 1
Avg Commits per committer: 70.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
mantzaris	a**s@g**m	70

Issues and Pull Requests

Last synced: 5 months ago

All Time

Total issues: 1
Total pull requests: 0
Average time to close issues: less than a minute
Average time to close pull requests: N/A
Total issue authors: 1
Total pull request authors: 0
Average comments per issue: 2.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 0
Average time to close issues: less than a minute
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 2.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

JuliaTagBot (1)

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- julia 1 total

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 2

juliahub.com: KeemenaPreprocessing

Preprocessing for text data: cleaning, normalization, vectorization, tokenization and more

Homepage: https://mantzaris.github.io/KeemenaPreprocessing.jl/dev/
Documentation: https://docs.juliahub.com/General/KeemenaPreprocessing/stable/
License: MIT
Latest release: 0.1.1
published 6 months ago

Versions: 2
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 1 Total

Rankings

Dependent repos count: 8.2%

Average: 21.6%

Dependent packages count: 35.1%

Last synced: 4 months ago

Dependencies

.github/workflows/CI.yml actions

actions/checkout v4 composite
julia-actions/cache v2 composite
julia-actions/julia-buildpkg v1 composite
julia-actions/julia-docdeploy v1 composite
julia-actions/julia-runtest v1 composite
julia-actions/setup-julia v2 composite

.github/workflows/CompatHelper.yml actions

.github/workflows/TagBot.yml actions

JuliaRegistries/TagBot v1 composite

KeemenaPreprocessing.jl: Unicode-Robust Cleaning, Multi-Level Tokenisation &amp; Streaming Offset Bundling for Julia NLP

Science Score: 87.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

KeemenaPreprocessing

What you get

Scope and ecosystem

Quick example (full corpus in RAM)

Processing huge corpora with constant memory

Two Project Gutenberg books

Installing

Contributing to KeemenaPreprocessing.jl

How to contribute

Reporting bugs

Proposing changes

Pull requests

Community guidelines

Owner

JOSS Publication

KeemenaPreprocessing.jl: Unicode-Robust Cleaning, Multi-Level Tokenisation and Streaming Offset Bundling for Julia NLP

Authors

Editor

Tags

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

juliahub.com: KeemenaPreprocessing

Rankings

Dependencies

KeemenaPreprocessing.jl: Unicode-Robust Cleaning, Multi-Level Tokenisation & Streaming Offset Bundling for Julia NLP