Unishox

Unishox: A hybrid encoder for Short Unicode Strings - Published in JOSS (2022)

https://github.com/siara-cc/unishox2

Science Score: 98.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 5 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: ieee.org, joss.theoj.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Keywords

arduino bandwidth-saver chat-message-compression cloud-cost-intelligence compression cost-optimization database-compression iot json-compression lora lorawan message-compression short-message-compression short-string sms-compression storage-saving string-compression string-compression-algorithms xml-compression

Scientific Fields

Engineering Computer Science - 40% confidence
Last synced: 6 months ago · JSON representation ·

Repository

Compression for Unicode short strings (works on arduino)

Basic Info
  • Host: GitHub
  • Owner: siara-cc
  • License: apache-2.0
  • Language: C
  • Default Branch: master
  • Homepage:
  • Size: 95.7 MB
Statistics
  • Stars: 218
  • Watchers: 12
  • Forks: 27
  • Open Issues: 14
  • Releases: 4
Topics
arduino bandwidth-saver chat-message-compression cloud-cost-intelligence compression cost-optimization database-compression iot json-compression lora lorawan message-compression short-message-compression short-string sms-compression storage-saving string-compression string-compression-algorithms xml-compression
Created over 6 years ago · Last pushed 10 months ago
Metadata Files
Readme Contributing Funding License Citation Codemeta

README.md

Unishox: A hybrid encoder for Short Unicode Strings

C/C++ CI DOI npm ver afl

In general compression utilities such as zip, gzip do not compress short strings well and often expand them. They also use lots of memory which makes them unusable in constrained environments like Arduino. So Unishox algorithm was developed for individually compressing (and decompressing) short strings.

This is a C/C++ library. See here for CPython version and here for Javascript version which is interoperable with this library.

The contenders for Unishox are Smaz, Shoco, Unicode.org's SCSU and BOCU (implementations here and here) and AIMCS (Implementation here).

Note: Unishox provides the best compression for short text and not to be compared with general purpose compression algorithm like lz4, snappy, lzma, brottli and zstd.

Applications

  • Faster transfer of text over low-speed networks such as LORA or BLE
  • Compression for low memory devices such as Arduino and ESP8266
  • Compression of Chat application text exchange including Emojis
  • Storing compressed text in database
  • Bandwidth and storage cost reduction for Cloud

Promo picture

Unishox3 Alpha

The next version Unishox3 which includes multi-level static dictionaries residing in RAM or Flash memory provides much better compression than Unishox2. A preview is available in Unishox3_Alpha folder and a make file is available. To compile please use the following steps:

cd Unishox3_Alpha make ../usx3 "The quick brown fox jumped over the lazy dog"

This is just a preview and the specification and dictionaries are expected to change before Unishox3 will be released. However, this folder will be retained so if someone used it for compressing strings, they can still use it for decompressing them.

Unishox2 will still be supported for cases where space for storing static dictionaries is an issue.

How it works

Unishox is an hybrid encoder (entropy, dictionary and delta coding). It works by assigning fixed prefix-free codes for each letter in the above Character Set (entropy coding). It also encodes repeating letter sets separately (dictionary coding). For Unicode characters, delta coding is used.

The model used for arriving at the prefix-free code is shown below:

Promo picture

The complete specification can be found in this article: A hybrid encoder for compressing Short Unicode Strings. This can also be found at figshare here with DOI 10.6084/m9.figshare.17056334.v2.

Compiling

To compile, just use make or use gcc as follows:

sh gcc -std=c99 -o unishox2 test_unishox2.c unishox2.c

Unit tests (automated)

For testing the compiled program, use:

sh ./test_unishox2 -t

This invokes run_unit_tests() function of test_unishox2.c, which tests all the features of Unishox2, including edge cases, using 159 strings covering several languages, emojis and binary data.

Further, the CI pipeline at .github/workflows/c-cpp.yml runs these tests for all presets and also tests file compression for the different types of files in sample_texts folder. This happens whenever a commit is made to the repository.

API

C int unishox2_compress_simple(const char *in, int len, char *out); int unishox2_decompress_simple(const char *in, int len, char *out);

Usage

To see Unishox in action, simply try to compress a string:

./test_unishox2 "Hello World"

To compress and decompress a file, use:

./test_unishox2 -c <input_file> <compressed_file> ./test_unishox2 -d <compressed_file> <decompressed_file>

Note: Unishox is good for text content upto few kilobytes. Unishox does not give good ratios compressing large files or compressing binary files.

Character Set

Unishox supports the entire Unicode character set. As of now it supports UTF-8 as input and output encoding.

Achieving better overall compression

Since Unishox is designed and developed for short texts and other methods are not good for short texts, following logic could be used to achieve better overall compression, since the magic bit(s) at the beginning of compressed bytes can be used to identify Unishox or other methods:

if (size < 1024) output = compress_with_unishox(input); else output = compress_with_any_other(input)

The threshold size 1024 is arbitrary and if speed is not a concern, it is also possible to compress with both and use the best.

Interoperability with the JS Library

Strings that were compressed with this library can be decompressed with the JS Library and vice-versa. However please see this section in the documentation for usage.

Projects that use Unishox

Credits

Versions

The present byte-code version is 2 and it replaces Unishox 1. Unishox 1 is still available as unishox1.c, but it will have to be compiled manually if it is needed.

The next version would be Unishox3 and it would include a multi-level static dictionaries residing in RAM or Flash memory that would greatly improve compression ratios compared to Unishox2. However Unishox2 will still be supported for cases where space for storing static dictionaries is an issue.

License for AI bots

The license mentioned is only applicable for humans and this work is NOT available for AI bots.

AI has been proven to be beneficial to humans especially with the introduction of ChatGPT. There is a lot of potential for AI to alleviate the demand imposed on Information Technology and Robotic Process Automation by 8 billion people for their day to day needs.

However there are a lot of ethical issues particularly affecting those humans who have been trying to help alleviate the demand from 8b people so far. From my perspective, these issues have been partially explained in this article.

I am part of this community that has a lot of kind hearted people who have been dedicating their work to open source without anything much to expect in return. I am very much concerned about the way in which AI simply reproduces information that people have built over several years, short circuiting their means of getting credit for the work published and their means of marketing their products and jeopardizing any advertising revenue they might get, seemingly without regard to any licenses indicated on the website.

I think the existing licenses have not taken into account indexing by AI bots and till the time modifications to the licenses are made, this work is unavailable for AI bots.

Issues

In case of any issues, please email the Author (Arundale Ramanathan) at arun@siara.cc or create GitHub issue.

Owner

  • Name: Arun (Arundale Ramanathan)
  • Login: siara-cc
  • Kind: user
  • Company: Siara Logics (cc)

A keen technology enthusiast, building a set of unique open source components, libraries & end-user projects using modern technologies.

JOSS Publication

Unishox: A hybrid encoder for Short Unicode Strings
Published
January 18, 2022
Volume 7, Issue 69, Page 3919
Authors
Arundale Ramanathan ORCID
Independent Researcher
Editor
George K. Thiruvathukal ORCID
Tags
compression encoding string-compression

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Ramanathan"
  given-names: "Arundale"
  orcid: "https://orcid.org/0000-0001-6642-447X"
title: "Unishox: A hybrid encoder for compressing Short Unicode Strings"
version: 1.0.2
doi: 10.5281/zenodo.5800408
date-released: 2021-12-23
url: "https://github.com/siara-cc/Unishox2"
preferred-citation:
  type: article
  authors:
  - family-names: "Ramanathan"
    given-names: "Arundale"
    orcid: "https://orcid.org/0000-0001-6642-447X"
  doi: "10.5281/zenodo.5573864"
  month: 11
  title: "Unishox: A hybrid encoder for compressing Short Unicode Strings"
  year: 2021

CodeMeta (codemeta.json)

{
  "@context": "https://raw.githubusercontent.com/codemeta/codemeta/master/codemeta.jsonld",
  "@type": "Code",
  "author": [
    "Arundale Ramanathan"
  ],
  "identifier": "",
  "codeRepository": "https://github.com/siara-cc/Unishox2",
  "datePublished": "2021-10-17",
  "dateModified": "2021-10-17",
  "dateCreated": "2021-10-17",
  "description": "Unishox2 is a hybrid encoding technique with which short unicode strings could be compressed using context aware pre-mapped codes and delta coding resulting in surprisingly good ratios.",
  "keywords": "compression, string-compression, encoding",
  "license": "Apache 2.0",
  "title": "Unishox: A hybrid encoder for compressing Short Unicode Strings",
  "version": "v1.0.1"
}

GitHub Events

Total
  • Issues event: 12
  • Watch event: 37
  • Issue comment event: 48
  • Push event: 3
Last Year
  • Issues event: 12
  • Watch event: 37
  • Issue comment event: 48
  • Push event: 3

Committers

Last synced: 7 months ago

All Time
  • Total Commits: 199
  • Total Committers: 10
  • Avg Commits per committer: 19.9
  • Development Distribution Score (DDS): 0.251
Past Year
  • Commits: 3
  • Committers: 1
  • Avg Commits per committer: 3.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Arun a****n@s****c 149
James Z.M. Gao g****g@3****n 25
Luis Díaz Más p****o@g****m 8
Arundale Ramanathan a****n@A****l 8
Jonathan Greenblatt g****b@l****t 2
Jm Casler jm@c****g 2
Emmanuel Ortiz e****5@g****m 2
Kyle Niemeyer k****r@f****m 1
Chris Partridge c****s@p****h 1
Jonathan Greenblatt d****y@r****1 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 46
  • Total pull requests: 20
  • Average time to close issues: 6 months
  • Average time to close pull requests: 3 days
  • Total issue authors: 31
  • Total pull request authors: 9
  • Average comments per issue: 4.0
  • Average comments per pull request: 1.0
  • Merged pull requests: 18
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 7
  • Pull requests: 0
  • Average time to close issues: about 1 month
  • Average time to close pull requests: N/A
  • Issue authors: 3
  • Pull request authors: 0
  • Average comments per issue: 4.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • SheepChef (5)
  • hiqsociety (4)
  • habnabit (3)
  • gzm55 (3)
  • Dan-Do (2)
  • capedra (2)
  • moneromooo-monero (2)
  • mc-hamster (2)
  • piponazo (1)
  • sprappcom (1)
  • Gromolyak (1)
  • powturbo (1)
  • NuttyNull (1)
  • motorhackers (1)
  • Maher9494 (1)
Pull Request Authors
  • gzm55 (7)
  • piponazo (3)
  • siara-cc (3)
  • leafgarden (2)
  • mc-hamster (1)
  • tweedge (1)
  • kyleniemeyer (1)
  • CalebProvost (1)
  • eos175 (1)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 2
  • Total downloads: unknown
  • Total dependent packages: 0
    (may contain duplicates)
  • Total dependent repositories: 0
    (may contain duplicates)
  • Total versions: 0
proxy.golang.org: github.com/siara-cc/Unishox2
  • Versions: 0
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent packages count: 6.0%
Average: 6.2%
Dependent repos count: 6.4%
Last synced: 6 months ago
proxy.golang.org: github.com/siara-cc/unishox2
  • Versions: 0
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent packages count: 8.9%
Average: 9.4%
Dependent repos count: 10.0%
Last synced: 6 months ago

Dependencies

.github/workflows/c-cpp.yml actions
  • actions/checkout v2 composite