Recent Releases of hojichar
hojichar - 0.15.1
Fix segmentation fault on fugashi related filters
- Added
max_parse_charstoDiscardTooManyNounsandWordRepetitionRatioFilterto preventfugashisegfaults on very large texts. - Defaults keep existing behavior for typical document sizes.
What's Changed
- Fix segfault on too long docs parsing by fugashi by @hoppiece in https://github.com/HojiChar/HojiChar/pull/83
Full Changelog: https://github.com/HojiChar/HojiChar/compare/v0.15.0...v0.15.1
- Python
Published by hoppiece 7 months ago
hojichar - 0.15.0
What's New
Deduplication Module Overhaul — hojichar.filters.deduplication
GenerateDedupLSH
Now powered by the Rust‑based engine rensa:
computes MinHash + LSH at ≈ 2 000 docs / s (single‑thread) — ideal for large‑scale near‑duplicate detection.New Deduplicators | Class | Where it shines | Notes | |-------|-----------------|-------| |
InlineDeduplicator| In‑memory, single‑process workloads | Handles ≈ 100 M docs on a beefy box. | |RedisDeduplicator| Distributed pipelines (hojichar.Parallel, Spark, etc.) | Stores LSH keys in a pre‑provisioned Redis server. | |RedisBloomDeduplicator| Web‑scale corpora (≳ 10B docs) | Uses RedisBloom scalable Bloom filters — trades a tiny FP rate for massive RAM savings. |
What's Changed
- Faster implementation of Near-Deduplication with MinHash LSH by @hoppiece in https://github.com/HojiChar/HojiChar/pull/81
Full Changelog: https://github.com/HojiChar/HojiChar/compare/v0.14.1...v0.15.0
- Python
Published by hoppiece 8 months ago
hojichar - 0.14.1
Fix https://github.com/HojiChar/HojiChar/pull/80
What's Changed
- Update README.md by @hoppiece in https://github.com/HojiChar/HojiChar/pull/78
- JSONDumper: Exclude '_initstats' from exported extras in JSON output by @hoppiece in https://github.com/HojiChar/HojiChar/pull/80
Full Changelog: https://github.com/HojiChar/HojiChar/compare/v0.14.0...v0.14.1
- Python
Published by hoppiece 8 months ago
hojichar - 0.14.0
🚀 New Features in v0.14.0
Asynchronous Pipeline Support
- Added support for building asynchronous text-processing pipelines.
- You can now define non-blocking filters by extending the
AsyncFilterclass. - Multiple asynchronous filters can be composed together using
AsyncCompose. AsyncComposealso accepts traditional synchronousFilterclasses — they are automatically wrapped and executed in an asynchronous manner.
AsyncChatAPI — High-Throughput LLM Integration Example
- In recent years, core NLP processing has increasingly shifted outside the CPU, with LLMs (Large Language Models) becoming a key component.
- As a practical example of an asynchronous filter, we introduce the
AsyncChatAPIclass — a high-throughput LLM inference interface compatible with OpenAI API. - Using
AsyncChatAPI, you can send up to 1K req/sec to an LLM endpoint with simple, declarative code:
```python import os
from hojichar import AsyncCompose from hojichar.filters.documentfilters import JSONLoader, JSONDumper from hojichar.utils.asynchandlers import writestreamto_file
asyncpipeline = AsyncCompose( [ JSONLoader(inputkey="text"), AsyncChatAPI( modelid="gpt-4o", openaiendpointurl="https://api.openai.com/v1", openaiapikey=os.getenv("OPENAIAPIKEY"), maxconcurrentrequests=128, outputkey="llmoutput", messagegenerator=lambda doc: [{"role": "user", "content": doc.text[:1000]}], ), JSONDumper(export_extras=True), ] )
with open("input.jsonl") as f: with asyncpipeline: asyncoutputstream = (str(doc) async for doc in asyncpipeline.applystream(f)) await writestreamtofile(asyncoutputstream, "output.jsonl", chunk_size=128) ```
Since AsyncChatAPI follows the OpenAI API specification, it can seamlessly interact with self-hosted LLM endpoints such as vLLM.
This enables you to build robust, high-throughput, and easy-to-maintain pipelines for "Chain of LLMs" style workflows, where multiple LLMs are orchestrated together in a single declarative pipeline.
What's Changed
- Async pipeline by @hoppiece in https://github.com/HojiChar/HojiChar/pull/72
- AsyncFilter:
AsyncChatAPIby @hoppiece in https://github.com/HojiChar/HojiChar/pull/77
Full Changelog: https://github.com/HojiChar/HojiChar/compare/v0.13.1...v0.14.0
- Python
Published by hoppiece 8 months ago
hojichar - 0.13.0
HojiChar v0.13.0 Release Notes
We are pleased to announce the release of HojiChar v0.13.0, introducing significant improvements to batch and stream processing, filter management, and statistics tracking.
Main PR - https://github.com/HojiChar/HojiChar/pull/71 - https://github.com/HojiChar/HojiChar/pull/69
✨ New Features and Improvements
Unified Batch and Stream Processing for Filters and Pipelines
- The
FilterandComposeclasses now support both batch processing and stream processing, providing greater flexibility and efficiency for text processing pipelines. - New methods added:
apply_batch(documents: Sequence[Document]): Enables optimized batch processing. Users can override this method to implement custom batch logic. By default, it applies the existingapplymethod iteratively.apply_stream(documents: Iterable[Document]): Processes an iterable stream ofDocumentobjects.- The
use_batchflag in theFilterconstructor controls whetherapply_streaminternally usesapply_batchfor stream processing (default:False). - The
batch_sizeparameter defines the batch size whenuse_batchis enabled.
Context Manager and Resource Management
- Filters now support resource cleanup via:
shutdown()method for explicitly releasing resources (e.g., closing database connections).- Context manager support (
withstatement) automatically callsshutdown()upon exit.
- The
Composeclass will propagate theshutdown()call to all contained filters.
Refined Filter Control and Statistics Handling
Key control logic has been shifted from
Composeto theFilterlevel for better modularity:- Probabilistic filter application via the
pparameter. - Skipping rejected documents controlled by
skip_rejected. - Random state management:
- Users can pass an integer seed or a
numpy.random.Generatorinstance via therandom_stateparameter. - If unspecified, filters inherit the shared random generator from
Compose.
- Probabilistic filter application via the
Statistics Tracking Overhaul:
- New
hojichar.core.models.Statisticsclass replaces the legacyinspection.pystatistics system (now deprecated). - Each
Filterinstance independently tracks its own statistics. - Access statistics via:
Filter.get_statistics()returns aStatisticsobject.Filter.get_statistics_map()returns a dictionary representation.- The
Composeclass provides: get_total_statistics()for aggregatedStatistics.get_total_statistics_map()for aggregated statistics as dictionaries.
- New
Statistics Object Structure:
python
@dataclass
class Statistics:
name: Optional[str] = None
input_num: int = 0
input_bytes: int = 0
input_chars: int = 0
output_num: int = 0
output_bytes: int = 0
output_chars: int = 0
discard_num: int = 0
diff_bytes: int = 0
diff_chars: int = 0
cumulative_time_ns: int = 0
🚨 Deprecations and Planned Removals
The following classes and modules are scheduled for deprecation since v1.0.0
TokenandTokenFilter- And token-related implementations in
Document. UseDocument.extrasto treat such information.
- And token-related implementations in
- The
inspection.pymodule for statistics.- Use
Statisticsclass to get stats. Compose.statisticandCompose.statistic_objproperty will removed in the future.
- Use
Please migrate to the updated Statistics class and new filter methods to ensure future compatibility.
🛠️ For developers
- The project has migrated from Poetry to uv for package management and building.
- Python
Published by hoppiece 8 months ago
hojichar - 0.11.1
Changes
This update includes adding the__repr__ method to the Document class, enhancing object representation for easier debugging.
For instance, now when you check a Document object, you'll see a detailed representation by repr method.
```python
from hojichar import Document
doc = Document("Hello, world", extras={"date": "2024-10-03"})
repr(doc) "Document(text='Hello, world', is_rejected=False, extras={'date': '2024-10-03'})"
eval(repr(doc)) Document(text='Hello, world', is_rejected=False, extras={'date': '2024-10-03'}) ```
- Python
Published by hoppiece over 1 year ago
hojichar - 0.11.0
What's New in This Release
We're excited to introduce a series of new filters in this version, designed to enhance the filters and be particularly useful for handling noisy datasets such as Common Crawl.
New Filters Added:
hojichar.filters.document_filters:
DiscardTooManyNouns: Removes "word salad" with excessive nouns in Japanese texts.CharRepetitionRatioFilter: Filters out entries based on character repetition ratios to reduce noise.WordRepetitionRatioFilter: Discards entries with repetitive word patterns in Japanese texts.DiscardTooManySpecialTokens: Cleans up entries overloaded with special tokens or symbols, judged as noise.SingleCharacterRepetitionFilter: Removes entries where single characters are overly repeated.DiscardTooManyEndingEllipsis: Eliminates entries ending with multiple ellipses such as....DiscardTooShortLines: Filters out repetitions of unusually short lines.
hojichar.filters.language_identification:
LanguageIdentificationByFastText: Employs FastText for high-performance language identification.AcceptJapaneseByFastText: Japanese LID filter.
Installation Notes:
- To utilize these new filters, some may require installation of dependency libraries. Install hojichar by running:
bash pip install 'hojichar[all]' - The
mmh3package, which thehojichar.filters.deduplicationmodule depends on, has also been added to the extras package. To use it, you will need to specifyhojichar[all]in the same way as above.
Additional Updates:
- Full Support for Python 3.12: We officially now support Python 3.12, ensuring compatibility and enhanced performance across more environments.
- Python
Published by hoppiece over 1 year ago
hojichar -
Fixes
- Fix access when the
extrasargument is passed to theDocument: https://github.com/HojiChar/HojiChar/pull/55
- Python
Published by hoppiece over 1 year ago
hojichar - 0.10.0
New Features in Document and JSONLoader
This release is mainly about https://github.com/HojiChar/HojiChar/pull/48
- We've enhanced the Document class by introducing an extras attribute for storing additional metadata.
- This metadata can be utilized in filters.
- We've added an extra_keys argument to the JSONLoader to facilitate the handling of these extra metadata fields.
Acknowledgments
Thanks to @shinofumijp for their valuable contributions to these features!
- Python
Published by hoppiece over 1 year ago
hojichar - 0.8.0
Changes
- Statistics are now displayed during processing.
- The number of MB processed is now displayed while redirecting standard output or writing with the
--outputoption.
- The number of MB processed is now displayed while redirecting standard output or writing with the
- Added
--inputoption to HojiChar CLI. It takes a file path as an argument and specifies an input file.- A progress bar and the remaining expected time are displayed when this option is used.
- The
tqdmhas been added to the dependencies for the above changes.
- Python
Published by hoppiece over 2 years ago
hojichar - 0.6.0
Changes
Changes were made to make the analysis of discarded text more convenient.
- ignore_filtered flag of the hojichar.Compose is abolished.
- skip_rejected flag is added to the hojichar.Filter.
- --all option is added to the CLI.
- dump_reason flag is added to hojichar.document_filters.JSONDumper.
- Logging the Filter and its primitive member variable when the filter discards the document.
- Discarded text is no longer converted to blank characters.
With these changes, we can analyze the following outputs
Profile myprofile.py:
```python
from hojichar.filters import document_filters as dflt
FILTER = Compose(
[
dflt.JSONLoader(key="text", ignore=True),
dflt.DiscardAdultContentJa(p=0.9, ignoreconfused=True),
dflt.JSONDumper(dumpreason=True, skiprejected=False),
],
)
And
cat dirtytexts.jsonl | hojichar -p myprofile.py --all | jq
Get such lines:
json
{
"text": "劇訳表示。 : 経済産業省「国民の皆さん、トイレットペーパーは余分に備えを」【防災の日】\n< 【防衛省】15年度予算、概算要求が過去最高額へ←「GO!日本」\n「初キッスの年齢を教えろ」【海外反応】 >\n経済産業省「国民の皆さん、トイレットペーパーは余分に備えを」【防災の日】\n<防災の日>経産省トイレットペーパー備蓄PR「1か月分」",
"isrejected": true,
"reason": {
"name": "DiscardAdultContentJa",
"p": 0.9,
"matchedtext": "えろ",
"matchedtextneighbor": "へ←「GO!日本」\n「初キッスの年齢を教えろ」【海外反応】 >\n経済産業省「国民の皆"
}
}
```
- Python
Published by hoppiece over 2 years ago
hojichar - 0.5.0
What's New
This release mainly incorporates the new features in #12
Main new feature
- The HojiChar CLI 🎉
- The definition of the profile pass to CLI
Other changes
Some new filters are added.
- hojichar.document_filters.Sleep
- hojichar.document_filters.JSONDumper
- Python
Published by hoppiece over 2 years ago
hojichar - 0.3.0
Modified the module structure of some filters.
- For practicality, we abolish the filter modules cleaners and normalization. These general text preprocessors are involved in document_filters.
- On the other hand, for clarity, deduplication and tokenization, which are used for specific purposes, have been separated from document_filters.
Several bugs were fixed.
- Each filter module is now imported correctly from hojichar.
- Python
Published by hoppiece about 3 years ago