What's Changed

Adapt profile loading to AsyncCompose by @hoppiece in https://github.com/HojiChar/HojiChar/pull/86

Full Changelog: https://github.com/HojiChar/HojiChar/compare/v0.15.3...v0.15.4

- Python
Published by hoppiece 7 months ago

What's Changed

Add deprecated statistics properties for compatibility with previous version. by @hoppiece in https://github.com/HojiChar/HojiChar/pull/85

Full Changelog: https://github.com/HojiChar/HojiChar/compare/v0.15.2...v0.15.3

- Python
Published by hoppiece 7 months ago

What's Changed

Support Big-endian CPU and fix some tiny bug about deduplication by @hoppiece in https://github.com/HojiChar/HojiChar/pull/84

Full Changelog: https://github.com/HojiChar/HojiChar/compare/v0.15.1...v0.15.2

- Python
Published by hoppiece 7 months ago

Fix segmentation fault on `fugashi` related filters

Added max_parse_chars to DiscardTooManyNouns and WordRepetitionRatioFilter to prevent fugashi segfaults on very large texts.
Defaults keep existing behavior for typical document sizes.

What's Changed

Fix segfault on too long docs parsing by fugashi by @hoppiece in https://github.com/HojiChar/HojiChar/pull/83

Full Changelog: https://github.com/HojiChar/HojiChar/compare/v0.15.0...v0.15.1

- Python
Published by hoppiece 7 months ago

What's New

Deduplication Module Overhaul — hojichar.filters.deduplication

GenerateDedupLSH
Now powered by the Rust‑based engine rensa:
computes MinHash + LSH at ≈ 2 000 docs / s (single‑thread) — ideal for large‑scale near‑duplicate detection.
New Deduplicators | Class | Where it shines | Notes | |-------|-----------------|-------| | InlineDeduplicator | In‑memory, single‑process workloads | Handles ≈ 100 M docs on a beefy box. | | RedisDeduplicator | Distributed pipelines (hojichar.Parallel, Spark, etc.) | Stores LSH keys in a pre‑provisioned Redis server. | | RedisBloomDeduplicator | Web‑scale corpora (≳ 10B docs) | Uses RedisBloom scalable Bloom filters — trades a tiny FP rate for massive RAM savings. |

What's Changed

Faster implementation of Near-Deduplication with MinHash LSH by @hoppiece in https://github.com/HojiChar/HojiChar/pull/81

Full Changelog: https://github.com/HojiChar/HojiChar/compare/v0.14.1...v0.15.0

- Python
Published by hoppiece 8 months ago

hojichar - 0.14.1

Fix https://github.com/HojiChar/HojiChar/pull/80

What's Changed

Update README.md by @hoppiece in https://github.com/HojiChar/HojiChar/pull/78
JSONDumper: Exclude '_initstats' from exported extras in JSON output by @hoppiece in https://github.com/HojiChar/HojiChar/pull/80

Full Changelog: https://github.com/HojiChar/HojiChar/compare/v0.14.0...v0.14.1

- Python
Published by hoppiece 8 months ago

🚀 New Features in v0.14.0

Asynchronous Pipeline Support

Added support for building asynchronous text-processing pipelines.
You can now define non-blocking filters by extending the AsyncFilter class.
Multiple asynchronous filters can be composed together using AsyncCompose.
AsyncCompose also accepts traditional synchronous Filter classes — they are automatically wrapped and executed in an asynchronous manner.

AsyncChatAPI — High-Throughput LLM Integration Example

In recent years, core NLP processing has increasingly shifted outside the CPU, with LLMs (Large Language Models) becoming a key component.
As a practical example of an asynchronous filter, we introduce the AsyncChatAPI class — a high-throughput LLM inference interface compatible with OpenAI API.
Using AsyncChatAPI, you can send up to 1K req/sec to an LLM endpoint with simple, declarative code:

```python import os

from hojichar import AsyncCompose from hojichar.filters.documentfilters import JSONLoader, JSONDumper from hojichar.utils.asynchandlers import writestreamto_file

asyncpipeline = AsyncCompose( [ JSONLoader(inputkey="text"), AsyncChatAPI( modelid="gpt-4o", openaiendpointurl="https://api.openai.com/v1", openaiapikey=os.getenv("OPENAIAPIKEY"), maxconcurrentrequests=128, outputkey="llmoutput", messagegenerator=lambda doc: [{"role": "user", "content": doc.text[:1000]}], ), JSONDumper(export_extras=True), ] )

with open("input.jsonl") as f: with asyncpipeline: asyncoutputstream = (str(doc) async for doc in asyncpipeline.applystream(f)) await writestreamtofile(asyncoutputstream, "output.jsonl", chunk_size=128) ```

Since AsyncChatAPI follows the OpenAI API specification, it can seamlessly interact with self-hosted LLM endpoints such as vLLM.
This enables you to build robust, high-throughput, and easy-to-maintain pipelines for "Chain of LLMs" style workflows, where multiple LLMs are orchestrated together in a single declarative pipeline.

What's Changed

Async pipeline by @hoppiece in https://github.com/HojiChar/HojiChar/pull/72
AsyncFilter: AsyncChatAPI by @hoppiece in https://github.com/HojiChar/HojiChar/pull/77

Full Changelog: https://github.com/HojiChar/HojiChar/compare/v0.13.1...v0.14.0

- Python
Published by hoppiece 8 months ago

hojichar - 0.13.1

Fixed pyproject.toml so that the Python version supported by pypi.org is retrieved correctly.

- Python
Published by hoppiece 8 months ago

hojichar - 0.13.0

HojiChar v0.13.0 Release Notes

We are pleased to announce the release of HojiChar v0.13.0, introducing significant improvements to batch and stream processing, filter management, and statistics tracking.

Main PR - https://github.com/HojiChar/HojiChar/pull/71 - https://github.com/HojiChar/HojiChar/pull/69

✨ New Features and Improvements

Unified Batch and Stream Processing for Filters and Pipelines

The Filter and Compose classes now support both batch processing and stream processing, providing greater flexibility and efficiency for text processing pipelines.
New methods added:
- apply_batch(documents: Sequence[Document]): Enables optimized batch processing. Users can override this method to implement custom batch logic. By default, it applies the existing apply method iteratively.
- apply_stream(documents: Iterable[Document]): Processes an iterable stream of Document objects.
- The use_batch flag in the Filter constructor controls whether apply_stream internally uses apply_batch for stream processing (default: False).
- The batch_size parameter defines the batch size when use_batch is enabled.

Context Manager and Resource Management

Filters now support resource cleanup via:
- shutdown() method for explicitly releasing resources (e.g., closing database connections).
- Context manager support (with statement) automatically calls shutdown() upon exit.
The Compose class will propagate the shutdown() call to all contained filters.

Refined Filter Control and Statistics Handling

Key control logic has been shifted from Compose to the Filter level for better modularity:
- Probabilistic filter application via the p parameter.
- Skipping rejected documents controlled by skip_rejected.
- Random state management:
- Users can pass an integer seed or a numpy.random.Generator instance via the random_state parameter.
- If unspecified, filters inherit the shared random generator from Compose.
Statistics Tracking Overhaul:
- New hojichar.core.models.Statistics class replaces the legacy inspection.py statistics system (now deprecated).
- Each Filter instance independently tracks its own statistics.
- Access statistics via:
- Filter.get_statistics() returns a Statistics object.
- Filter.get_statistics_map() returns a dictionary representation.
- The Compose class provides:
- get_total_statistics() for aggregated Statistics.
- get_total_statistics_map() for aggregated statistics as dictionaries.
Statistics Object Structure:

python @dataclass class Statistics: name: Optional[str] = None input_num: int = 0 input_bytes: int = 0 input_chars: int = 0 output_num: int = 0 output_bytes: int = 0 output_chars: int = 0 discard_num: int = 0 diff_bytes: int = 0 diff_chars: int = 0 cumulative_time_ns: int = 0

🚨 Deprecations and Planned Removals

The following classes and modules are scheduled for deprecation since v1.0.0

Token andTokenFilter
- And token-related implementations in Document. Use Document.extras to treat such information.
The inspection.py module for statistics.
- Use Statistics class to get stats.
- Compose.statistic and Compose.statistic_obj property will removed in the future.

Please migrate to the updated Statistics class and new filter methods to ensure future compatibility.

🛠️ For developers

The project has migrated from Poetry to uv for package management and building.

- Python
Published by hoppiece 8 months ago

hojichar - 0.12.0

Changes

Merges https://github.com/HojiChar/HojiChar/pull/67
- Add export_extras options for JSONDumper.

- Python
Published by hoppiece 9 months ago

hojichar - 0.11.4

Fixes the bug:

https://github.com/HojiChar/HojiChar/issues/65
Fix division by zero in some filters.

- Python
Published by hoppiece over 1 year ago

hojichar - 0.11.3

Bug fix

The following bug has fixed:

ImportError occurs with missing the package requests when HojiChar is installed without a [all] option.

- Python
Published by hoppiece over 1 year ago

hojichar - 0.11.2

Bug Fix

The above bug are fixed:

ImportError caused when importing the hojichar package after installing pip install hojichar without the [all] option.

- Python
Published by hoppiece over 1 year ago

hojichar - 0.11.1

Changes

This update includes adding the__repr__ method to the Document class, enhancing object representation for easier debugging.

For instance, now when you check a Document object, you'll see a detailed representation by repr method. ```python

from hojichar import Document

doc = Document("Hello, world", extras={"date": "2024-10-03"})

repr(doc) "Document(text='Hello, world', is_rejected=False, extras={'date': '2024-10-03'})"

eval(repr(doc)) Document(text='Hello, world', is_rejected=False, extras={'date': '2024-10-03'}) ```

- Python
Published by hoppiece over 1 year ago

hojichar - 0.11.0

What's New in This Release

We're excited to introduce a series of new filters in this version, designed to enhance the filters and be particularly useful for handling noisy datasets such as Common Crawl.

New Filters Added:

`hojichar.filters.document_filters`:

DiscardTooManyNouns: Removes "word salad" with excessive nouns in Japanese texts.
CharRepetitionRatioFilter: Filters out entries based on character repetition ratios to reduce noise.
WordRepetitionRatioFilter: Discards entries with repetitive word patterns in Japanese texts.
DiscardTooManySpecialTokens: Cleans up entries overloaded with special tokens or symbols, judged as noise.
SingleCharacterRepetitionFilter: Removes entries where single characters are overly repeated.
DiscardTooManyEndingEllipsis: Eliminates entries ending with multiple ellipses such as ....
DiscardTooShortLines: Filters out repetitions of unusually short lines.

`hojichar.filters.language_identification`:

LanguageIdentificationByFastText: Employs FastText for high-performance language identification.
AcceptJapaneseByFastText: Japanese LID filter.

Installation Notes:

To utilize these new filters, some may require installation of dependency libraries. Install hojichar by running: bash pip install 'hojichar[all]'
The mmh3 package, which the hojichar.filters.deduplication module depends on, has also been added to the extras package. To use it, you will need to specify hojichar[all] in the same way as above.

Additional Updates:

Full Support for Python 3.12: We officially now support Python 3.12, ensuring compatibility and enhanced performance across more environments.

- Python
Published by hoppiece over 1 year ago

hojichar -

Fixes

Fix access when the extras argument is passed to the Document: https://github.com/HojiChar/HojiChar/pull/55

- Python
Published by hoppiece over 1 year ago

hojichar - 0.10.0

New Features in Document and JSONLoader

This release is mainly about https://github.com/HojiChar/HojiChar/pull/48 - We've enhanced the Document class by introducing an extras attribute for storing additional metadata. - This metadata can be utilized in filters. - We've added an extra_keys argument to the JSONLoader to facilitate the handling of these extra metadata fields.

Acknowledgments

Thanks to @shinofumijp for their valuable contributions to these features!

- Python
Published by hoppiece over 1 year ago

hojichar - 0.9.0

Changes

Added the hojichar.Parallel class. With this, users can parallelize the processing of the Compose class without delving into the details of parallel implementation.
Removed the hojichar.utils.process module.

- Python
Published by hoppiece over 2 years ago

hojichar - 0.8.1

Changes

Adding --versionor-v option to the hojichar command to see the version of HojiChar.

- Python
Published by hoppiece over 2 years ago

hojichar - 0.8.0

Changes

Statistics are now displayed during processing.
- The number of MB processed is now displayed while redirecting standard output or writing with the --output option.
Added --input option to HojiChar CLI. It takes a file path as an argument and specifies an input file.
- A progress bar and the remaining expected time are displayed when this option is used.
The tqdm has been added to the dependencies for the above changes.

- Python
Published by hoppiece over 2 years ago

hojichar - 0.7.2

Fixes

Fixed an issue where HojiChar CLI would not stop with a single Ctrl+C during processing.

- Python
Published by hoppiece over 2 years ago

hojichar - 0.7.1

Changes

Fixed a problem in which the filters argument of a Compose object would display only Compose instead of expanding filters bundled by Compose in the statistics if they existed.

- Python
Published by hoppiece over 2 years ago

hojichar - 0.7.0

Changes

HojiChar CLI is now parallelized by the number of CPU cores by default 🎉. By the option --jobs/-j, the user specifies the number of parallelisms.

- Python
Published by hoppiece over 2 years ago

hojichar - 0.6.1

To make the JSON of the dumped statistics easier to read - Displaying in UTF-8, not ASCII - Making indent

- Python
Published by hoppiece over 2 years ago

hojichar - 0.6.0

Changes

Changes were made to make the analysis of discarded text more convenient. - ignore_filtered flag of the hojichar.Compose is abolished. - skip_rejected flag is added to the hojichar.Filter. - --all option is added to the CLI. - dump_reason flag is added to hojichar.document_filters.JSONDumper. - Logging the Filter and its primitive member variable when the filter discards the document. - Discarded text is no longer converted to blank characters.

With these changes, we can analyze the following outputs Profile myprofile.py: ```python from hojichar.filters import document_filters as dflt

FILTER = Compose( [ dflt.JSONLoader(key="text", ignore=True), dflt.DiscardAdultContentJa(p=0.9, ignoreconfused=True), dflt.JSONDumper(dumpreason=True, skiprejected=False), ], ) And cat dirtytexts.jsonl | hojichar -p myprofile.py --all | jq Get such lines:json { "text": "劇訳表示。 : 経済産業省「国民の皆さん、トイレットペーパーは余分に備えを」【防災の日】\n< 【防衛省】15年度予算、概算要求が過去最高額へ←「GO!日本」\n「初キッスの年齢を教えろ」【海外反応】 >\n経済産業省「国民の皆さん、トイレットペーパーは余分に備えを」【防災の日】\n<防災の日>経産省トイレットペーパー備蓄PR「1か月分」", "isrejected": true, "reason": { "name": "DiscardAdultContentJa", "p": 0.9, "matchedtext": "えろ", "matchedtextneighbor": "へ←「GO!日本」\n「初キッスの年齢を教えろ」【海外反応】 >\n経済産業省「国民の皆" } } ```

- Python
Published by hoppiece over 2 years ago

hojichar - 0.5.3

Added PEP561 compliant py.typed to the package.
Now mypy can use our type annotations in HojiChar.

- Python
Published by hoppiece over 2 years ago

hojichar - 0.5.2

Fix the behavior of --dump-stats option in HojiChar CLI.

- Python
Published by hoppiece over 2 years ago

hojichar - 0.5.1

What's new

Major revision of the README. We have tried to improve the content in English and to make it more understandable to newcomers. We will continue to improve the README as it is still insufficient.
Clean up for doc/ directory.

- Python
Published by hoppiece over 2 years ago

hojichar - 0.5.0

What's New

This release mainly incorporates the new features in #12

Main new feature

The HojiChar CLI 🎉
The definition of the profile pass to CLI

Other changes

Some new filters are added. - hojichar.document_filters.Sleep - hojichar.document_filters.JSONDumper

- Python
Published by hoppiece over 2 years ago

hojichar - 0.4.0

Allowed control of hash value parameters for deduplication.
Removed unnecessary processing inside the hash value calculation. This makes them incompatible with previous hash values.

- Python
Published by hoppiece almost 3 years ago

hojichar - 0.3.1

Now package version is accessible via hojichar.__version__.

- Python
Published by hoppiece about 3 years ago

hojichar - 0.3.0

Modified the module structure of some filters. - For practicality, we abolish the filter modules cleaners and normalization. These general text preprocessors are involved in document_filters. - On the other hand, for clarity, deduplication and tokenization, which are used for specific purposes, have been separated from document_filters.

Several bugs were fixed. - Each filter module is now imported correctly from hojichar.

- Python
Published by hoppiece about 3 years ago

hojichar - 0.2.2

This is the first release for PyPI 🎉

- Python
Published by hoppiece about 3 years ago

Recent Releases of hojichar

hojichar - 0.15.4

What's Changed

hojichar - v0.15.3

What's Changed

hojichar - 0.15.2

What's Changed

hojichar - 0.15.1

Fix segmentation fault on fugashi related filters

What's Changed

hojichar - 0.15.0

What's New

What's Changed

hojichar - 0.14.1

What's Changed

hojichar - 0.14.0

🚀 New Features in v0.14.0

Asynchronous Pipeline Support

AsyncChatAPI — High-Throughput LLM Integration Example

What's Changed

hojichar - 0.13.1

hojichar - 0.13.0

HojiChar v0.13.0 Release Notes

✨ New Features and Improvements

Unified Batch and Stream Processing for Filters and Pipelines

Context Manager and Resource Management

Refined Filter Control and Statistics Handling

🚨 Deprecations and Planned Removals

🛠️ For developers

hojichar - 0.12.0

Changes

hojichar - 0.11.4

Fixes the bug:

hojichar - 0.11.3

Bug fix

hojichar - 0.11.2

Bug Fix

hojichar - 0.11.1

Changes

hojichar - 0.11.0

What's New in This Release

New Filters Added:

hojichar.filters.document_filters:

hojichar.filters.language_identification:

Installation Notes:

Additional Updates:

hojichar -

Fixes

hojichar - 0.10.0

New Features in Document and JSONLoader

hojichar - 0.9.0

Changes

hojichar - 0.8.1

Changes

hojichar - 0.8.0

Changes

hojichar - 0.7.2

Fixes

hojichar - 0.7.1

Changes

hojichar - 0.7.0

Changes

hojichar - 0.6.1

hojichar - 0.6.0

Changes

hojichar - 0.5.3

hojichar - 0.5.2

hojichar - 0.5.1

What's new

hojichar - 0.5.0

What's New

Main new feature

Other changes

hojichar - 0.4.0

hojichar - 0.3.1

hojichar - 0.3.0

hojichar - 0.2.2

Fix segmentation fault on `fugashi` related filters

`hojichar.filters.document_filters`:

`hojichar.filters.language_identification`: