unihan-etl

Export UNIHAN's database to csv, json or yaml

https://github.com/cihai/unihan-etl

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.8%) to scientific vocabulary

Keywords

chinese chinese-dictionary chinese-words cjk dictionary japanese korean unicode unihan unihan-database

Keywords from Contributors

chinese-characters japanese-dictionary astronomy parsing glotaran pyglotaran target-analysis jwst
Last synced: 6 months ago · JSON representation ·

Repository

Export UNIHAN's database to csv, json or yaml

Basic Info
Statistics
  • Stars: 59
  • Watchers: 4
  • Forks: 13
  • Open Issues: 13
  • Releases: 7
Topics
chinese chinese-dictionary chinese-words cjk dictionary japanese korean unicode unihan unihan-database
Created about 12 years ago · Last pushed 6 months ago
Metadata Files
Readme Changelog License Citation

README.md

unihan-etl · Python Package License Code Coverage

An ETL tool for the Unicode Han Unification (UNIHAN) database releases. unihan-etl is designed to fetch (download), unpack (unzip), and convert the database from the Unicode website into either a flattened, tabular format or a structured, hierarchical format.

unihan-etl serves dual purposes: as a Python library offering an API for accessing data as Python objects, and as a command-line interface (CLI) for exporting data into CSV, JSON, or YAML formats.

This tool is a component of the cihai suite of CJK related projects. For a similar tool, see libUnihan.

As of v0.31.0, unihan-etl is compatible with UNIHAN Version 15.1.0 (released on 2023-09-01, revision 35).

The UNIHAN database

The UNIHAN database organizes data across multiple files, exemplified below:

tsv U+3400 kCantonese jau1 U+3400 kDefinition (same as U+4E18 丘) hillock or mound U+3400 kMandarin qiū U+3401 kCantonese tim2 U+3401 kDefinition to lick; to taste, a mat, bamboo bark U+3401 kHanyuPinyin 10019.020:tiàn U+3401 kMandarin tiàn

Values vary in shape and structure depending on their field type. kHanyuPinyin maps Unicode codepoints to Hànyǔ Dà Zìdiǎn, where 10019.020:tiàn represents an entry. Complicating it further, more variations:

tsv U+5EFE kHanyuPinyin 10513.110,10514.010,10514.020:gǒng U+5364 kHanyuPinyin 10093.130:xī,lǔ 74609.020:lǔ,xī

kHanyuPinyin supports multiple entries delimited by spaces. ":" (colon) separate locations in the work from pinyin readings. "," (comma) separate multiple entries/readings. This is just one of 90 fields contained in the database.

Tabular, "Flat" output

CSV (default)

console $ unihan-etl

csv char,ucn,kCantonese,kDefinition,kHanyuPinyin,kMandarin 㐀,U+3400,jau1,(same as U+4E18 丘) hillock or mound,,qiū 㐁,U+3401,tim2,"to lick; to taste, a mat, bamboo bark",10019.020:tiàn,tiàn

With $ unihan-etl -F yaml --no-expand:

yaml - char: 㐀 kCantonese: jau1 kDefinition: (same as U+4E18 丘) hillock or mound kHanyuPinyin: null kMandarin: qiū ucn: U+3400 - char: 㐁 kCantonese: tim2 kDefinition: to lick; to taste, a mat, bamboo bark kHanyuPinyin: 10019.020:tiàn kMandarin: tiàn ucn: U+3401

To preview in the CLI, try tabview or csvlens.

JSON

console $ unihan-etl -F json --no-expand

json [ { "char": "㐀", "ucn": "U+3400", "kDefinition": "(same as U+4E18 丘) hillock or mound", "kCantonese": "jau1", "kHanyuPinyin": null, "kMandarin": "qiū" }, { "char": "㐁", "ucn": "U+3401", "kDefinition": "to lick; to taste, a mat, bamboo bark", "kCantonese": "tim2", "kHanyuPinyin": "10019.020:tiàn", "kMandarin": "tiàn" } ]

Tools:

YAML

console $ unihan-etl -F yaml --no-expand

yaml - char: 㐀 kCantonese: jau1 kDefinition: (same as U+4E18 丘) hillock or mound kHanyuPinyin: null kMandarin: qiū ucn: U+3400 - char: 㐁 kCantonese: tim2 kDefinition: to lick; to taste, a mat, bamboo bark kHanyuPinyin: 10019.020:tiàn kMandarin: tiàn ucn: U+3401

Filter via the CLI with yq.

"Structured" output

Codepoints can pack a lot more detail, unihan-etl carefully extracts these values in a uniform manner. Empty values are pruned.

To make this possible, unihan-etl exports to JSON, YAML, and python list/dicts.

Why not CSV? Unfortunately, CSV is only suitable for storing table-like information. File formats such as JSON and YAML accept key-values and hierarchical entries.

JSON

console $ unihan-etl -F json

json [ { "char": "㐀", "ucn": "U+3400", "kDefinition": ["(same as U+4E18 丘) hillock or mound"], "kCantonese": ["jau1"], "kMandarin": { "zh-Hans": "qiū", "zh-Hant": "qiū" } }, { "char": "㐁", "ucn": "U+3401", "kDefinition": ["to lick", "to taste, a mat, bamboo bark"], "kCantonese": ["tim2"], "kHanyuPinyin": [ { "locations": [ { "volume": 1, "page": 19, "character": 2, "virtual": 0 } ], "readings": ["tiàn"] } ], "kMandarin": { "zh-Hans": "tiàn", "zh-Hant": "tiàn" } } ]

YAML

console $ unihan-etl -F yaml

yaml - char: 㐀 kCantonese: - jau1 kDefinition: - (same as U+4E18 丘) hillock or mound kMandarin: zh-Hans: qiū zh-Hant: qiū ucn: U+3400 - char: 㐁 kCantonese: - tim2 kDefinition: - to lick - to taste, a mat, bamboo bark kHanyuPinyin: - locations: - character: 2 page: 19 virtual: 0 volume: 1 readings: - tiàn kMandarin: zh-Hans: tiàn zh-Hant: tiàn ucn: U+3401

Features

  • automatically downloads UNIHAN from the internet
  • strives for accuracy with the specifications described in UNIHAN's database design
  • export to JSON, CSV and YAML (requires pyyaml) via -F
  • configurable to export specific fields via -f
  • accounts for encoding conflicts due to the Unicode-heavy content
  • designed as a technical proof for future CJK (Chinese, Japanese, Korean) datasets
  • core component and dependency of cihai, a CJK library
  • data package support
  • expansion of multi-value delimited fields in YAML, JSON and python dictionaries
  • supports >= 3.7 and pypy

If you encounter a problem or have a question, please create an issue.

Installation

To download and build your own UNIHAN export:

console $ pip install --user unihan-etl

or by pipx:

console $ pipx install unihan-etl

Developmental releases

pip:

console $ pip install --user --upgrade --pre unihan-etl

pipx:

console $ pipx install --suffix=@next 'unihan-etl' --pip-args '\--pre' --force // Usage: unihan-etl@next load yoursession

Usage

unihan-etl offers customizable builds via its command line arguments.

See unihan-etl CLI arguments for information on how you can specify columns, files, download URL's, and output destination.

To output CSV, the default format:

console $ unihan-etl

To output JSON:

console $ unihan-etl -F json

To output YAML:

console $ pip install --user pyyaml $ unihan-etl -F yaml

To only output the kDefinition field in a csv:

console $ unihan-etl -f kDefinition

To output multiple fields, separate with spaces:

console $ unihan-etl -f kCantonese kDefinition

To output to a custom file:

console $ unihan-etl --destination ./exported.csv

To output to a custom file (templated file extension):

console $ unihan-etl --destination ./exported.{ext}

See unihan-etl CLI arguments for advanced usage examples.

Code layout

```console

cache dir (Unihan.zip is downloaded, contents extracted)

{XDG cache dir}/unihan_etl/

output dir

{XDG data dir}/unihan_etl/ unihan.json unihan.csv unihan.yaml # (requires pyyaml)

package dir

unihan_etl/ core.py # argparse, download, extract, transform UNIHAN's data options.py # configuration object constants.py # immutable data vars (field to filename mappings, etc) expansion.py # extracting details baked inside of fields types.py # type annotations util.py # utility / helper functions

test suite

tests/* ```

API

The package is python underneath the hood, you can utilize its full API. Example:

```python

from unihan_etl.core import Packager pkgr = Packager() hasattr(pkgr.options, 'destination') True ```

Developing

console $ git clone https://github.com/cihai/unihan-etl.git

console $ cd unihan-etl

Bootstrap your environment and learn more about contributing. We use the same conventions / tools across all cihai projects: pytest, sphinx, mypy, ruff, tmuxp, and file watcher helpers (e.g. entr(1)).

More information

Docs Build Status

Owner

  • Name: cihai
  • Login: cihai
  • Kind: organization
  • Email: cihai@git-pull.com
  • Location: 中国。Unknown Dynasty.

United front for open, permissive, high quality CJK datasets

Citation (CITATION.cff)

cff-version: 1.2.0
message: >-
  If you use this software, please cite it as below.
  NOTE: Change "x.y" by the version you use. If you are unsure about which version
  you are using run: `pip show unihan-etl`."
authors:
- family-names: "Narlock"
  given-names: "Tony"
  orcid: "https://orcid.org/0000-0002-2568-415X"
title: "unihan-etl"
type: software
version: x.y
url: "https://unihan-etl.git-pull.com"

GitHub Events

Total
  • Issues event: 1
  • Watch event: 8
  • Delete event: 3
  • Issue comment event: 9
  • Pull request review comment event: 6
  • Pull request review event: 1
  • Pull request event: 5
  • Fork event: 1
  • Create event: 3
Last Year
  • Issues event: 1
  • Watch event: 8
  • Delete event: 3
  • Issue comment event: 9
  • Pull request review comment event: 6
  • Pull request review event: 1
  • Pull request event: 5
  • Fork event: 1
  • Create event: 3

Committers

Last synced: almost 3 years ago

All Time
  • Total Commits: 1,106
  • Total Committers: 4
  • Avg Commits per committer: 276.5
  • Development Distribution Score (DDS): 0.105
Top Committers
Name Email Commits
Tony Narlock t****y@g****m 990
pyup-bot g****t@p****o 106
dependabot-preview[bot] 2****]@u****m 8
pre-commit-ci[bot] 6****]@u****m 2
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 17
  • Total pull requests: 152
  • Average time to close issues: 4 months
  • Average time to close pull requests: 23 days
  • Total issue authors: 5
  • Total pull request authors: 7
  • Average comments per issue: 1.59
  • Average comments per pull request: 1.36
  • Merged pull requests: 71
  • Bot issues: 0
  • Bot pull requests: 75
Past Year
  • Issues: 2
  • Pull requests: 14
  • Average time to close issues: N/A
  • Average time to close pull requests: 8 days
  • Issue authors: 2
  • Pull request authors: 2
  • Average comments per issue: 0.0
  • Average comments per pull request: 1.21
  • Merged pull requests: 5
  • Bot issues: 0
  • Bot pull requests: 8
Top Authors
Issue Authors
  • tony (12)
  • garfieldnate (2)
  • frankier (1)
  • void285 (1)
  • jamesbcd (1)
Pull Request Authors
  • tony (64)
  • dependabot-preview[bot] (41)
  • dependabot[bot] (32)
  • pyup-bot (11)
  • pre-commit-ci[bot] (2)
  • gitter-badger (1)
  • kianmeng (1)
Top Labels
Issue Labels
enhancement (1) bug (1)
Pull Request Labels
dependencies (73) python (2) github_actions (2)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 511 last-month
  • Total dependent packages: 3
  • Total dependent repositories: 7
  • Total versions: 58
  • Total maintainers: 1
pypi.org: unihan-etl

Export UNIHAN data of Chinese, Japanese, Korean to CSV, JSON or YAML

  • Versions: 58
  • Dependent Packages: 3
  • Dependent Repositories: 7
  • Downloads: 511 Last month
Rankings
Dependent packages count: 3.2%
Dependent repos count: 5.5%
Downloads: 7.1%
Average: 7.1%
Stargazers count: 9.6%
Forks count: 10.2%
Maintainers (1)
Last synced: 6 months ago

Dependencies

poetry.lock pypi
  • alabaster 0.7.12 develop
  • atomicwrites 1.4.1 develop
  • attrs 21.4.0 develop
  • babel 2.10.3 develop
  • beautifulsoup4 4.11.1 develop
  • black 22.6.0 develop
  • certifi 2022.6.15 develop
  • charset-normalizer 2.1.0 develop
  • click 8.1.3 develop
  • codecov 2.1.12 develop
  • colorama 0.4.5 develop
  • coverage 6.4.2 develop
  • docutils 0.18.1 develop
  • flake8 3.9.2 develop
  • furo 2022.6.21 develop
  • idna 3.3 develop
  • imagesize 1.4.1 develop
  • importlib-metadata 4.12.0 develop
  • iniconfig 1.1.1 develop
  • isort 5.10.1 develop
  • jinja2 3.1.2 develop
  • livereload 2.6.3 develop
  • markdown-it-py 2.1.0 develop
  • markupsafe 2.1.1 develop
  • mccabe 0.6.1 develop
  • mdit-py-plugins 0.3.0 develop
  • mdurl 0.1.1 develop
  • mypy 0.961 develop
  • mypy-extensions 0.4.3 develop
  • myst-parser 0.18.0 develop
  • packaging 21.3 develop
  • pathspec 0.9.0 develop
  • platformdirs 2.5.2 develop
  • pluggy 1.0.0 develop
  • py 1.11.0 develop
  • pycodestyle 2.7.0 develop
  • pyflakes 2.3.1 develop
  • pygments 2.12.0 develop
  • pyparsing 3.0.9 develop
  • pytest 7.1.2 develop
  • pytest-cov 3.0.0 develop
  • pytest-rerunfailures 10.2 develop
  • pytest-watcher 0.2.3 develop
  • pytz 2022.1 develop
  • pyyaml 6.0 develop
  • requests 2.28.1 develop
  • six 1.16.0 develop
  • snowballstemmer 2.2.0 develop
  • soupsieve 2.3.2.post1 develop
  • sphinx 5.0.2 develop
  • sphinx-argparse 0.3.1 develop
  • sphinx-autobuild 2021.3.14 develop
  • sphinx-autodoc-typehints 1.18.3 develop
  • sphinx-basic-ng 0.0.1a12 develop
  • sphinx-copybutton 0.5.0 develop
  • sphinx-inline-tabs 2021.4.11b8 develop
  • sphinx-issues 3.0.1 develop
  • sphinxcontrib-applehelp 1.0.2 develop
  • sphinxcontrib-devhelp 1.0.2 develop
  • sphinxcontrib-htmlhelp 2.0.0 develop
  • sphinxcontrib-jsmath 1.0.1 develop
  • sphinxcontrib-qthelp 1.0.3 develop
  • sphinxcontrib-serializinghtml 1.1.5 develop
  • sphinxext-opengraph 0.6.3 develop
  • sphinxext-rediraffe 0.2.7 develop
  • tomli 2.0.1 develop
  • tornado 6.2 develop
  • typed-ast 1.5.4 develop
  • typing-extensions 4.3.0 develop
  • urllib3 1.26.10 develop
  • watchdog 2.1.9 develop
  • zipp 3.8.1 develop
  • appdirs 1.4.4
  • unicodecsv 0.14.1
  • zhon 1.1.5
pyproject.toml pypi
  • black * develop
  • codecov * develop
  • coverage * develop
  • docutils ~0.18.0 develop
  • flake8 * develop
  • furo * develop
  • isort * develop
  • mypy * develop
  • myst_parser * develop
  • pytest * develop
  • pytest-cov * develop
  • pytest-rerunfailures * develop
  • pytest-watcher ^0.2.3 develop
  • sphinx * develop
  • sphinx-argparse * develop
  • sphinx-autobuild * develop
  • sphinx-autodoc-typehints * develop
  • sphinx-copybutton * develop
  • sphinx-inline-tabs * develop
  • sphinx-issues * develop
  • sphinxext-opengraph * develop
  • sphinxext-rediraffe * develop
  • appdirs *
  • python ^3.7
  • unicodecsv *
  • zhon *
.github/workflows/docs.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
  • dorny/paths-filter v2.7.0 composite
  • jakejarvis/cloudflare-purge-action v0.3.0 composite
  • jakejarvis/s3-sync-action v0.5.1 composite
.github/workflows/tests.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
  • codecov/codecov-action v3 composite
  • pypa/gh-action-pypi-publish release/v1 composite