KeemenaPreprocessing.jl: Unicode-Robust Cleaning, Multi-Level Tokenisation & Streaming Offset Bundling for Julia NLP
KeemenaPreprocessing.jl: Unicode-Robust Cleaning, Multi-Level Tokenisation & Streaming Offset Bundling for Julia NLP - Published in JOSS (2026)
Science Score: 87.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 1 DOI reference(s) in JOSS metadata -
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
✓JOSS paper metadata
Published in Journal of Open Source Software
Keywords
Repository
Preprocessing for text data: cleaning, normalization, vectorization, tokenization and more
Basic Info
- Host: GitHub
- Owner: mantzaris
- License: mit
- Language: Julia
- Default Branch: main
- Homepage: https://mantzaris.github.io/KeemenaPreprocessing.jl/dev/
- Size: 992 KB
Statistics
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 1
- Releases: 0
Topics
Metadata Files
README.md
KeemenaPreprocessing
One-stop text pre-processor for Julia - clean -> tokenise -> segment -> build vocabulary -> align levels -> save bundle.
KeemenaPreprocessing.jl is a corpus-level preprocessing substrate for ML/NLP pipelines in Julia. It builds a deterministic PreprocessBundle from raw text using a streaming, two-pass workflow with predictable memory behavior. The key output is a reproducible artifact: token id streams plus offset tables and cross-level alignments (byte/char/word/sentence/etc.) suitable for downstream modeling, annotation alignment, and evaluation.
Intended for: - Researchers and engineers preprocessing large corpora for training or evaluating ML/NLP models. - Workflows that need stable offsets/cross-references (for aligning spans, annotations, evaluation, error analysis).
Not ideally for: - Users looking for a full NLP toolkit (tagging, parsing, NER, lemmatization, etc.). - Users wanting a library that bundles many tokenizer implementations or enforces a specific tokenizer ecosystem.
What you get
Vocabulary
- deterministic id <-> token tables
- minimum-frequency filtering
- user-defined special tokens
- deterministic id <-> token tables
Tokenisation
- byte, character, whitespace or Unicode-word
- pluggable custom function
- byte, character, whitespace or Unicode-word
Offset vectors
- word, sentence, paragraph and document boundaries
- always begin with 1 and end with
n_tokens + 1
- word, sentence, paragraph and document boundaries
Alignment cross-maps
- byte <-> char <-> word indices (forward & backward)
- byte <-> char <-> word indices (forward & backward)
Streaming mode
- constant-memory two-pass pipeline
- choose vector of bundles or single merged bundle
- constant-memory two-pass pipeline
Bundles
- everything packed into a
PreprocessBundle - save / load with JLD2 in one line
- everything packed into a
Scope and ecosystem
- KeemenaPreprocessing focuses on building a deterministic, aligned preprocessing artifact for downstream modeling
- Tokenizer packages (like WordTokenizers.jl) focus on fast sentence/word splitting and configurable tokenizers, including global configurability via settokenizer/setsentence_splitter
- BPE/tokenizer-model packages (like BytePairEncoding.jl) focus on subword tokenization methods (including GPT-2 byte-level BPE and tiktoken)
KemenaPreprocessing integrates with these via callables rather than hard dependencies, to avoid locking users into upstream conventions and to preserve reproducible pipelines
Bundles (portable preprocessing artifacts)
- everything is packed into a
PreprocessBundle(plain Julia structs + arrays) - convenience persistence via JLD2 (
save_preprocess_bundle/load_preprocess_bundle) - JLD2 is a default convenience backend, not a constraint: advanced users can serialize the bundle differently (e.g. HDF5/Arrow/custom layouts) if they need cross-language interchange, memory mapping, or indexed random access
- everything is packed into a
Quick example (full corpus in RAM)
```julia using KeemenaPreprocessing
docs = ["First document.", "Second document..."]
cfg = PreprocessConfiguration( tokenizername = :unicode, recordsentenceoffsets = true, minimumtoken_frequency = 2)
bundle = preprocess_corpus(docs; config = cfg)
wordids = gettokenids(bundle, :word) println("tokens:", length(wordids)) ```
The single call does all of: load, clean, tokenise, build vocabulary, record offsets, assemble bundle.
Processing huge corpora with constant memory
```julia using KeemenaPreprocessing, Downloads
Two Project Gutenberg books
alice = Downloads.download( "https://www.gutenberg.org/files/11/11-0.txt", "alice.txt") time = Downloads.download( "https://www.gutenberg.org/files/35/35-0.txt", "time_machine.txt")
cfg = PreprocessConfiguration(tokenizer_name = :whitespace)
merged = preprocesscorpusstreamingfull( [alice, time]; # any iterable of sources cfg = cfg, chunktokens = 5_000) # ~5 k tokens per internal chunk
println("total tokens:", length(gettokenids(merged, :word))) ```
preprocess_corpus_streaming_full runs the two-pass streaming pipeline,
merges all internal chunks on the fly, and returns one cohesive bundle
covering the entire corpus—ideal when downstream code expects a single artefact
yet you still need strict memory bounds during preprocessing.
Installing
It can be downloaded from the general registry: import Pkg; Pkg.add("KeemenaPreprocessing"), or pressing ']' and then typing add KeemenaPreprocessing and then back in the REPL prompt using KeemenaPreprocessing.
For the Dev version: open the Julia REPL, get into package mode pressing ] and put: add https://github.com/mantzaris/KeemenaPreprocessing.jl
Contributing to KeemenaPreprocessing.jl
Feel free to contribute and collaboration is encouraged.
How to contribute
Reporting bugs
Please open a GitHub issue and include:
- Julia version
- KeemenaPreprocessing.jl version (from Project.toml or Pkg.status())
- A minimal reproducible example
- Expected behavior vs actual behavior with the error messages
Proposing changes
Open an issue first if the change is large or affects the public API, so we can agree on direction before doing all the work and finding out that a modified plan would have been better
Pull requests
- Fork the repository and create a feature branch
- Keep pull requests focused (one logical change per PR) as it makes review easier
- Add tests for bug fixes and new features and putting clear test names helps
- Update documentation if behavior or API changes
- Ensure CI is green
Community guidelines
Please be respectful and constructive. This project follows the Julia Community Standards
Owner
- Name: a.v.mantzaris
- Login: mantzaris
- Kind: user
- Location: USA
- Twitter: avmantzaris
- Repositories: 35
- Profile: https://github.com/mantzaris
Excited about the future of technology. Happy to participate in shaping that future through theory and practice.
JOSS Publication
KeemenaPreprocessing.jl: Unicode-Robust Cleaning, Multi-Level Tokenisation and Streaming Offset Bundling for Julia NLP
Authors
Tags
NLP Text Processing Tokenization Corpus CleaningGitHub Events
Total
- Create event: 111
- Commit comment event: 3
- Release event: 2
- Delete event: 1
- Pull request event: 2
- Fork event: 1
- Issues event: 3
- Watch event: 2
- Issue comment event: 2
- Push event: 83
Last Year
- Create event: 111
- Commit comment event: 3
- Release event: 2
- Delete event: 1
- Pull request event: 2
- Fork event: 1
- Issues event: 3
- Watch event: 2
- Issue comment event: 2
- Push event: 83
Issues and Pull Requests
Last synced: 24 days ago
All Time
- Total issues: 1
- Total pull requests: 0
- Average time to close issues: less than a minute
- Average time to close pull requests: N/A
- Total issue authors: 1
- Total pull request authors: 0
- Average comments per issue: 2.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 0
- Average time to close issues: less than a minute
- Average time to close pull requests: N/A
- Issue authors: 1
- Pull request authors: 0
- Average comments per issue: 2.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- JuliaTagBot (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- julia 1 total
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 2
juliahub.com: KeemenaPreprocessing
Preprocessing for text data: cleaning, normalization, vectorization, tokenization and more
- Homepage: https://mantzaris.github.io/KeemenaPreprocessing.jl/dev/
- Documentation: https://docs.juliahub.com/General/KeemenaPreprocessing/stable/
- License: MIT
-
Latest release: 0.1.1
published 2 months ago
Rankings
Dependencies
- actions/checkout v4 composite
- julia-actions/cache v2 composite
- julia-actions/julia-buildpkg v1 composite
- julia-actions/julia-docdeploy v1 composite
- julia-actions/julia-runtest v1 composite
- julia-actions/setup-julia v2 composite
- JuliaRegistries/TagBot v1 composite
