Recent Releases of hojichar
hojichar - 0.15.1
Fix segmentation fault on fugashi related filters
- Added
max_parse_charstoDiscardTooManyNounsandWordRepetitionRatioFilterto preventfugashisegfaults on very large texts. - Defaults keep existing behavior for typical document sizes.
What's Changed
- Fix segfault on too long docs parsing by fugashi by @hoppiece in https://github.com/HojiChar/HojiChar/pull/83
Full Changelog: https://github.com/HojiChar/HojiChar/compare/v0.15.0...v0.15.1
- Python
Published by hoppiece 11 months ago
hojichar - 0.15.0
What's New
Deduplication Module Overhaul — hojichar.filters.deduplication
GenerateDedupLSH
Now powered by the Rust‑based engine rensa:
computes MinHash + LSH at ≈ 2 000 docs / s (single‑thread) — ideal for large‑scale near‑duplicate detection.New Deduplicators | Class | Where it shines | Notes | |-------|-----------------|-------| |
InlineDeduplicator| In‑memory, single‑process workloads | Handles ≈ 100 M docs on a beefy box. | |RedisDeduplicator| Distributed pipelines (hojichar.Parallel, Spark, etc.) | Stores LSH keys in a pre‑provisioned Redis server. | |RedisBloomDeduplicator| Web‑scale corpora (≳ 10B docs) | Uses RedisBloom scalable Bloom filters — trades a tiny FP rate for massive RAM savings. |
What's Changed
- Faster implementation of Near-Deduplication with MinHash LSH by @hoppiece in https://github.com/HojiChar/HojiChar/pull/81
Full Changelog: https://github.com/HojiChar/HojiChar/compare/v0.14.1...v0.15.0
- Python
Published by hoppiece 11 months ago
hojichar - 0.14.1
Fix https://github.com/HojiChar/HojiChar/pull/80
What's Changed
- Update README.md by @hoppiece in https://github.com/HojiChar/HojiChar/pull/78
- JSONDumper: Exclude '_initstats' from exported extras in JSON output by @hoppiece in https://github.com/HojiChar/HojiChar/pull/80
Full Changelog: https://github.com/HojiChar/HojiChar/compare/v0.14.0...v0.14.1
- Python
Published by hoppiece 11 months ago
hojichar - 0.14.0
🚀 New Features in v0.14.0
Asynchronous Pipeline Support
- Added support for building asynchronous text-processing pipelines.
- You can now define non-blocking filters by extending the
AsyncFilterclass. - Multiple asynchronous filters can be composed together using
AsyncCompose. AsyncComposealso accepts traditional synchronousFilterclasses — they are automatically wrapped and executed in an asynchronous manner.
AsyncChatAPI — High-Throughput LLM Integration Example
- In recent years, core NLP processing has increasingly shifted outside the CPU, with LLMs (Large Language Models) becoming a key component.
- As a practical example of an asynchronous filter, we introduce the
AsyncChatAPIclass — a high-throughput LLM inference interface compatible with OpenAI API. - Using
AsyncChatAPI, you can send up to 1K req/sec to an LLM endpoint with simple, declarative code:
```python import os
from hojichar import AsyncCompose from hojichar.filters.documentfilters import JSONLoader, JSONDumper from hojichar.utils.asynchandlers import writestreamto_file
asyncpipeline = AsyncCompose( [ JSONLoader(inputkey="text"), AsyncChatAPI( modelid="gpt-4o", openaiendpointurl="https://api.openai.com/v1", openaiapikey=os.getenv("OPENAIAPIKEY"), maxconcurrentrequests=128, outputkey="llmoutput", messagegenerator=lambda doc: [{"role": "user", "content": doc.text[:1000]}], ), JSONDumper(export_extras=True), ] )
with open("input.jsonl") as f: with asyncpipeline: asyncoutputstream = (str(doc) async for doc in asyncpipeline.applystream(f)) await writestreamtofile(asyncoutputstream, "output.jsonl", chunk_size=128) ```
Since AsyncChatAPI follows the OpenAI API specification, it can seamlessly interact with self-hosted LLM endpoints such as vLLM.
This enables you to build robust, high-throughput, and easy-to-maintain pipelines for "Chain of LLMs" style workflows, where multiple LLMs are orchestrated together in a single declarative pipeline.
What's Changed
- Async pipeline by @hoppiece in https://github.com/HojiChar/HojiChar/pull/72
- AsyncFilter:
AsyncChatAPIby @hoppiece in https://github.com/HojiChar/HojiChar/pull/77
Full Changelog: https://github.com/HojiChar/HojiChar/compare/v0.13.1...v0.14.0
- Python
Published by hoppiece 12 months ago
hojichar - 0.13.0
HojiChar v0.13.0 Release Notes
We are pleased to announce the release of HojiChar v0.13.0, introducing significant improvements to batch and stream processing, filter management, and statistics tracking.
Main PR - https://github.com/HojiChar/HojiChar/pull/71 - https://github.com/HojiChar/HojiChar/pull/69
✨ New Features and Improvements
Unified Batch and Stream Processing for Filters and Pipelines
- The
FilterandComposeclasses now support both batch processing and stream processing, providing greater flexibility and efficiency for text processing pipelines. - New methods added:
apply_batch(documents: Sequence[Document]): Enables optimized batch processing. Users can override this method to implement custom batch logic. By default, it applies the existingapplymethod iteratively.apply_stream(documents: Iterable[Document]): Processes an iterable stream ofDocumentobjects.- The
use_batchflag in theFilterconstructor controls whetherapply_streaminternally usesapply_batchfor stream processing (default:False). - The
batch_sizeparameter defines the batch size whenuse_batchis enabled.
Context Manager and Resource Management
- Filters now support resource cleanup via:
shutdown()method for explicitly releasing resources (e.g., closing database connections).- Context manager support (
withstatement) automatically callsshutdown()upon exit.
- The
Composeclass will propagate theshutdown()call to all contained filters.
Refined Filter Control and Statistics Handling
Key control logic has been shifted from
Composeto theFilterlevel for better modularity:- Probabilistic filter application via the
pparameter. - Skipping rejected documents controlled by
skip_rejected. - Random state management:
- Users can pass an integer seed or a
numpy.random.Generatorinstance via therandom_stateparameter. - If unspecified, filters inherit the shared random generator from
Compose.
- Probabilistic filter application via the
Statistics Tracking Overhaul:
- New
hojichar.core.models.Statisticsclass replaces the legacyinspection.pystatistics system (now deprecated). - Each
Filterinstance independently tracks its own statistics. - Access statistics via:
Filter.get_statistics()returns aStatisticsobject.Filter.get_statistics_map()returns a dictionary representation.- The
Composeclass provides: get_total_statistics()for aggregatedStatistics.get_total_statistics_map()for aggregated statistics as dictionaries.
- New
Statistics Object Structure:
python
@dataclass
class Statistics:
name: Optional[str] = None
input_num: int = 0
input_bytes: int = 0
input_chars: int = 0
output_num: int = 0
output_bytes: int = 0
output_chars: int = 0
discard_num: int = 0
diff_bytes: int = 0
diff_chars: int = 0
cumulative_time_ns: int = 0
🚨 Deprecations and Planned Removals
The following classes and modules are scheduled for deprecation since v1.0.0
TokenandTokenFilter- And token-related implementations in
Document. UseDocument.extrasto treat such information.
- And token-related implementations in
- The
inspection.pymodule for statistics.- Use
Statisticsclass to get stats. Compose.statisticandCompose.statistic_objproperty will removed in the future.
- Use
Please migrate to the updated Statistics class and new filter methods to ensure future compatibility.
🛠️ For developers
- The project has migrated from Poetry to uv for package management and building.
- Python
Published by hoppiece 12 months ago
hojichar - 0.11.1
Changes
This update includes adding the__repr__ method to the Document class, enhancing object representation for easier debugging.
For instance, now when you check a Document object, you'll see a detailed representation by repr method.
```python
from hojichar import Document
doc = Document("Hello, world", extras={"date": "2024-10-03"})
repr(doc) "Document(text='Hello, world', is_rejected=False, extras={'date': '2024-10-03'})"
eval(repr(doc)) Document(text='Hello, world', is_rejected=False, extras={'date': '2024-10-03'}) ```
- Python
Published by hoppiece over 1 year ago
hojichar - 0.11.0
What's New in This Release
We're excited to introduce a series of new filters in this version, designed to enhance the filters and be particularly useful for handling noisy datasets such as Common Crawl.
New Filters Added:
hojichar.filters.document_filters:
DiscardTooManyNouns: Removes "word salad" with excessive nouns in Japanese texts.CharRepetitionRatioFilter: Filters out entries based on character repetition ratios to reduce noise.WordRepetitionRatioFilter: Discards entries with repetitive word patterns in Japanese texts.DiscardTooManySpecialTokens: Cleans up entries overloaded with special tokens or symbols, judged as noise.SingleCharacterRepetitionFilter: Removes entries where single characters are overly repeated.DiscardTooManyEndingEllipsis: Eliminates entries ending with multiple ellipses such as....DiscardTooShortLines: Filters out repetitions of unusually short lines.
hojichar.filters.language_identification:
LanguageIdentificationByFastText: Employs FastText for high-performance language identification.AcceptJapaneseByFastText: Japanese LID filter.
Installation Notes:
- To utilize these new filters, some may require installation of dependency libraries. Install hojichar by running:
bash pip install 'hojichar[all]' - The
mmh3package, which thehojichar.filters.deduplicationmodule depends on, has also been added to the extras package. To use it, you will need to specifyhojichar[all]in the same way as above.
Additional Updates:
- Full Support for Python 3.12: We officially now support Python 3.12, ensuring compatibility and enhanced performance across more environments.
- Python
Published by hoppiece over 1 year ago
hojichar -
Fixes
- Fix access when the
extrasargument is passed to theDocument: https://github.com/HojiChar/HojiChar/pull/55
- Python
Published by hoppiece almost 2 years ago
hojichar - 0.10.0
New Features in Document and JSONLoader
This release is mainly about https://github.com/HojiChar/HojiChar/pull/48
- We've enhanced the Document class by introducing an extras attribute for storing additional metadata.
- This metadata can be utilized in filters.
- We've added an extra_keys argument to the JSONLoader to facilitate the handling of these extra metadata fields.
Acknowledgments
Thanks to @shinofumijp for their valuable contributions to these features!
- Python
Published by hoppiece almost 2 years ago
hojichar - 0.8.0
Changes
- Statistics are now displayed during processing.
- The number of MB processed is now displayed while redirecting standard output or writing with the
--outputoption.
- The number of MB processed is now displayed while redirecting standard output or writing with the
- Added
--inputoption to HojiChar CLI. It takes a file path as an argument and specifies an input file.- A progress bar and the remaining expected time are displayed when this option is used.
- The
tqdmhas been added to the dependencies for the above changes.
- Python
Published by hoppiece almost 3 years ago
hojichar - 0.6.0
Changes
Changes were made to make the analysis of discarded text more convenient.
- ignore_filtered flag of the hojichar.Compose is abolished.
- skip_rejected flag is added to the hojichar.Filter.
- --all option is added to the CLI.
- dump_reason flag is added to hojichar.document_filters.JSONDumper.
- Logging the Filter and its primitive member variable when the filter discards the document.
- Discarded text is no longer converted to blank characters.
With these changes, we can analyze the following outputs
Profile myprofile.py:
```python
from hojichar.filters import document_filters as dflt
FILTER = Compose(
[
dflt.JSONLoader(key="text", ignore=True),
dflt.DiscardAdultContentJa(p=0.9, ignoreconfused=True),
dflt.JSONDumper(dumpreason=True, skiprejected=False),
],
)
And
cat dirtytexts.jsonl | hojichar -p myprofile.py --all | jq
Get such lines:
json
{
"text": "劇訳表示。 : 経済産業省「国民の皆さん、トイレットペーパーは余分に備えを」【防災の日】\n< 【防衛省】15年度予算、概算要求が過去最高額へ←「GO!日本」\n「初キッスの年齢を教えろ」【海外反応】 >\n経済産業省「国民の皆さん、トイレットペーパーは余分に備えを」【防災の日】\n<防災の日>経産省トイレットペーパー備蓄PR「1か月分」",
"isrejected": true,
"reason": {
"name": "DiscardAdultContentJa",
"p": 0.9,
"matchedtext": "えろ",
"matchedtextneighbor": "へ←「GO!日本」\n「初キッスの年齢を教えろ」【海外反応】 >\n経済産業省「国民の皆"
}
}
```
- Python
Published by hoppiece almost 3 years ago
hojichar - 0.5.1
What's new
- Major revision of the README. We have tried to improve the content in English and to make it more understandable to newcomers. We will continue to improve the README as it is still insufficient.
- Clean up for doc/ directory.
- Python
Published by hoppiece about 3 years ago
hojichar - 0.5.0
What's New
This release mainly incorporates the new features in #12
Main new feature
- The HojiChar CLI 🎉
- The definition of the profile pass to CLI
Other changes
Some new filters are added.
- hojichar.document_filters.Sleep
- hojichar.document_filters.JSONDumper
- Python
Published by hoppiece about 3 years ago
hojichar - 0.3.0
Modified the module structure of some filters.
- For practicality, we abolish the filter modules cleaners and normalization. These general text preprocessors are involved in document_filters.
- On the other hand, for clarity, deduplication and tokenization, which are used for specific purposes, have been separated from document_filters.
Several bugs were fixed.
- Each filter module is now imported correctly from hojichar.
- Python
Published by hoppiece over 3 years ago