Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.8%) to scientific vocabulary
Keywords
Keywords from Contributors
Repository
Export UNIHAN's database to csv, json or yaml
Basic Info
- Host: GitHub
- Owner: cihai
- License: mit
- Language: Python
- Default Branch: master
- Homepage: https://unihan-etl.git-pull.com
- Size: 2.75 MB
Statistics
- Stars: 59
- Watchers: 4
- Forks: 13
- Open Issues: 13
- Releases: 7
Topics
Metadata Files
README.md
unihan-etl ·

An ETL tool for the Unicode Han Unification (UNIHAN) database releases. unihan-etl is designed to fetch (download), unpack (unzip), and convert the database from the Unicode website into either a flattened, tabular format or a structured, hierarchical format.
unihan-etl serves dual purposes: as a Python library offering an API for accessing data as Python objects, and as a command-line interface (CLI) for exporting data into CSV, JSON, or YAML formats.
This tool is a component of the cihai suite of CJK related projects. For a similar tool, see libUnihan.
As of v0.31.0, unihan-etl is compatible with UNIHAN Version 15.1.0 (released on 2023-09-01, revision 35).
The UNIHAN database
The UNIHAN database organizes data across multiple files, exemplified below:
tsv
U+3400 kCantonese jau1
U+3400 kDefinition (same as U+4E18 丘) hillock or mound
U+3400 kMandarin qiū
U+3401 kCantonese tim2
U+3401 kDefinition to lick; to taste, a mat, bamboo bark
U+3401 kHanyuPinyin 10019.020:tiàn
U+3401 kMandarin tiàn
Values vary in shape and structure depending on their field type.
kHanyuPinyin maps Unicode codepoints to
Hànyǔ Dà Zìdiǎn, where 10019.020:tiàn represents
an entry. Complicating it further, more variations:
tsv
U+5EFE kHanyuPinyin 10513.110,10514.010,10514.020:gǒng
U+5364 kHanyuPinyin 10093.130:xī,lǔ 74609.020:lǔ,xī
kHanyuPinyin supports multiple entries delimited by spaces. ":" (colon) separate locations in the work from pinyin readings. "," (comma) separate multiple entries/readings. This is just one of 90 fields contained in the database.
Tabular, "Flat" output
CSV (default)
console
$ unihan-etl
csv
char,ucn,kCantonese,kDefinition,kHanyuPinyin,kMandarin
㐀,U+3400,jau1,(same as U+4E18 丘) hillock or mound,,qiū
㐁,U+3401,tim2,"to lick; to taste, a mat, bamboo bark",10019.020:tiàn,tiàn
With $ unihan-etl -F yaml --no-expand:
yaml
- char: 㐀
kCantonese: jau1
kDefinition: (same as U+4E18 丘) hillock or mound
kHanyuPinyin: null
kMandarin: qiū
ucn: U+3400
- char: 㐁
kCantonese: tim2
kDefinition: to lick; to taste, a mat, bamboo bark
kHanyuPinyin: 10019.020:tiàn
kMandarin: tiàn
ucn: U+3401
To preview in the CLI, try tabview or csvlens.
JSON
console
$ unihan-etl -F json --no-expand
json
[
{
"char": "㐀",
"ucn": "U+3400",
"kDefinition": "(same as U+4E18 丘) hillock or mound",
"kCantonese": "jau1",
"kHanyuPinyin": null,
"kMandarin": "qiū"
},
{
"char": "㐁",
"ucn": "U+3401",
"kDefinition": "to lick; to taste, a mat, bamboo bark",
"kCantonese": "tim2",
"kHanyuPinyin": "10019.020:tiàn",
"kMandarin": "tiàn"
}
]
Tools:
YAML
console
$ unihan-etl -F yaml --no-expand
yaml
- char: 㐀
kCantonese: jau1
kDefinition: (same as U+4E18 丘) hillock or mound
kHanyuPinyin: null
kMandarin: qiū
ucn: U+3400
- char: 㐁
kCantonese: tim2
kDefinition: to lick; to taste, a mat, bamboo bark
kHanyuPinyin: 10019.020:tiàn
kMandarin: tiàn
ucn: U+3401
Filter via the CLI with yq.
"Structured" output
Codepoints can pack a lot more detail, unihan-etl carefully extracts these values in a uniform manner. Empty values are pruned.
To make this possible, unihan-etl exports to JSON, YAML, and python list/dicts.
JSON
console
$ unihan-etl -F json
json
[
{
"char": "㐀",
"ucn": "U+3400",
"kDefinition": ["(same as U+4E18 丘) hillock or mound"],
"kCantonese": ["jau1"],
"kMandarin": {
"zh-Hans": "qiū",
"zh-Hant": "qiū"
}
},
{
"char": "㐁",
"ucn": "U+3401",
"kDefinition": ["to lick", "to taste, a mat, bamboo bark"],
"kCantonese": ["tim2"],
"kHanyuPinyin": [
{
"locations": [
{
"volume": 1,
"page": 19,
"character": 2,
"virtual": 0
}
],
"readings": ["tiàn"]
}
],
"kMandarin": {
"zh-Hans": "tiàn",
"zh-Hant": "tiàn"
}
}
]
YAML
console
$ unihan-etl -F yaml
yaml
- char: 㐀
kCantonese:
- jau1
kDefinition:
- (same as U+4E18 丘) hillock or mound
kMandarin:
zh-Hans: qiū
zh-Hant: qiū
ucn: U+3400
- char: 㐁
kCantonese:
- tim2
kDefinition:
- to lick
- to taste, a mat, bamboo bark
kHanyuPinyin:
- locations:
- character: 2
page: 19
virtual: 0
volume: 1
readings:
- tiàn
kMandarin:
zh-Hans: tiàn
zh-Hant: tiàn
ucn: U+3401
Features
- automatically downloads UNIHAN from the internet
- strives for accuracy with the specifications described in UNIHAN's database design
- export to JSON, CSV and YAML (requires pyyaml) via
-F - configurable to export specific fields via
-f - accounts for encoding conflicts due to the Unicode-heavy content
- designed as a technical proof for future CJK (Chinese, Japanese, Korean) datasets
- core component and dependency of cihai, a CJK library
- data package support
- expansion of multi-value delimited fields in YAML, JSON and python dictionaries
- supports >= 3.7 and pypy
If you encounter a problem or have a question, please create an issue.
Installation
To download and build your own UNIHAN export:
console
$ pip install --user unihan-etl
or by pipx:
console
$ pipx install unihan-etl
Developmental releases
pip:
console
$ pip install --user --upgrade --pre unihan-etl
pipx:
console
$ pipx install --suffix=@next 'unihan-etl' --pip-args '\--pre' --force
// Usage: unihan-etl@next load yoursession
Usage
unihan-etl offers customizable builds via its command line arguments.
See unihan-etl CLI arguments for information on how you can specify columns, files, download URL's, and output destination.
To output CSV, the default format:
console
$ unihan-etl
To output JSON:
console
$ unihan-etl -F json
To output YAML:
console
$ pip install --user pyyaml
$ unihan-etl -F yaml
To only output the kDefinition field in a csv:
console
$ unihan-etl -f kDefinition
To output multiple fields, separate with spaces:
console
$ unihan-etl -f kCantonese kDefinition
To output to a custom file:
console
$ unihan-etl --destination ./exported.csv
To output to a custom file (templated file extension):
console
$ unihan-etl --destination ./exported.{ext}
See unihan-etl CLI arguments for advanced usage examples.
Code layout
```console
cache dir (Unihan.zip is downloaded, contents extracted)
{XDG cache dir}/unihan_etl/
output dir
{XDG data dir}/unihan_etl/ unihan.json unihan.csv unihan.yaml # (requires pyyaml)
package dir
unihan_etl/ core.py # argparse, download, extract, transform UNIHAN's data options.py # configuration object constants.py # immutable data vars (field to filename mappings, etc) expansion.py # extracting details baked inside of fields types.py # type annotations util.py # utility / helper functions
test suite
tests/* ```
API
The package is python underneath the hood, you can utilize its full API. Example:
```python
from unihan_etl.core import Packager pkgr = Packager() hasattr(pkgr.options, 'destination') True ```
Developing
console
$ git clone https://github.com/cihai/unihan-etl.git
console
$ cd unihan-etl
Bootstrap your environment and learn more about contributing. We use the same conventions / tools across all cihai projects: pytest, sphinx, mypy, ruff, tmuxp, and file watcher helpers (e.g. entr(1)).
More information
Owner
- Name: cihai
- Login: cihai
- Kind: organization
- Email: cihai@git-pull.com
- Location: 中国。Unknown Dynasty.
- Website: https://cihai.git-pull.com
- Repositories: 6
- Profile: https://github.com/cihai
United front for open, permissive, high quality CJK datasets
Citation (CITATION.cff)
cff-version: 1.2.0 message: >- If you use this software, please cite it as below. NOTE: Change "x.y" by the version you use. If you are unsure about which version you are using run: `pip show unihan-etl`." authors: - family-names: "Narlock" given-names: "Tony" orcid: "https://orcid.org/0000-0002-2568-415X" title: "unihan-etl" type: software version: x.y url: "https://unihan-etl.git-pull.com"
GitHub Events
Total
- Issues event: 1
- Watch event: 8
- Delete event: 3
- Issue comment event: 9
- Pull request review comment event: 6
- Pull request review event: 1
- Pull request event: 5
- Fork event: 1
- Create event: 3
Last Year
- Issues event: 1
- Watch event: 8
- Delete event: 3
- Issue comment event: 9
- Pull request review comment event: 6
- Pull request review event: 1
- Pull request event: 5
- Fork event: 1
- Create event: 3
Committers
Last synced: almost 3 years ago
All Time
- Total Commits: 1,106
- Total Committers: 4
- Avg Commits per committer: 276.5
- Development Distribution Score (DDS): 0.105
Top Committers
| Name | Commits | |
|---|---|---|
| Tony Narlock | t****y@g****m | 990 |
| pyup-bot | g****t@p****o | 106 |
| dependabot-preview[bot] | 2****]@u****m | 8 |
| pre-commit-ci[bot] | 6****]@u****m | 2 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 17
- Total pull requests: 152
- Average time to close issues: 4 months
- Average time to close pull requests: 23 days
- Total issue authors: 5
- Total pull request authors: 7
- Average comments per issue: 1.59
- Average comments per pull request: 1.36
- Merged pull requests: 71
- Bot issues: 0
- Bot pull requests: 75
Past Year
- Issues: 2
- Pull requests: 14
- Average time to close issues: N/A
- Average time to close pull requests: 8 days
- Issue authors: 2
- Pull request authors: 2
- Average comments per issue: 0.0
- Average comments per pull request: 1.21
- Merged pull requests: 5
- Bot issues: 0
- Bot pull requests: 8
Top Authors
Issue Authors
- tony (12)
- garfieldnate (2)
- frankier (1)
- void285 (1)
- jamesbcd (1)
Pull Request Authors
- tony (64)
- dependabot-preview[bot] (41)
- dependabot[bot] (32)
- pyup-bot (11)
- pre-commit-ci[bot] (2)
- gitter-badger (1)
- kianmeng (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 511 last-month
- Total dependent packages: 3
- Total dependent repositories: 7
- Total versions: 58
- Total maintainers: 1
pypi.org: unihan-etl
Export UNIHAN data of Chinese, Japanese, Korean to CSV, JSON or YAML
- Documentation: https://unihan-etl.git-pull.com
- License: MIT
-
Latest release: 0.37.0
published about 1 year ago
Rankings
Maintainers (1)
Dependencies
- alabaster 0.7.12 develop
- atomicwrites 1.4.1 develop
- attrs 21.4.0 develop
- babel 2.10.3 develop
- beautifulsoup4 4.11.1 develop
- black 22.6.0 develop
- certifi 2022.6.15 develop
- charset-normalizer 2.1.0 develop
- click 8.1.3 develop
- codecov 2.1.12 develop
- colorama 0.4.5 develop
- coverage 6.4.2 develop
- docutils 0.18.1 develop
- flake8 3.9.2 develop
- furo 2022.6.21 develop
- idna 3.3 develop
- imagesize 1.4.1 develop
- importlib-metadata 4.12.0 develop
- iniconfig 1.1.1 develop
- isort 5.10.1 develop
- jinja2 3.1.2 develop
- livereload 2.6.3 develop
- markdown-it-py 2.1.0 develop
- markupsafe 2.1.1 develop
- mccabe 0.6.1 develop
- mdit-py-plugins 0.3.0 develop
- mdurl 0.1.1 develop
- mypy 0.961 develop
- mypy-extensions 0.4.3 develop
- myst-parser 0.18.0 develop
- packaging 21.3 develop
- pathspec 0.9.0 develop
- platformdirs 2.5.2 develop
- pluggy 1.0.0 develop
- py 1.11.0 develop
- pycodestyle 2.7.0 develop
- pyflakes 2.3.1 develop
- pygments 2.12.0 develop
- pyparsing 3.0.9 develop
- pytest 7.1.2 develop
- pytest-cov 3.0.0 develop
- pytest-rerunfailures 10.2 develop
- pytest-watcher 0.2.3 develop
- pytz 2022.1 develop
- pyyaml 6.0 develop
- requests 2.28.1 develop
- six 1.16.0 develop
- snowballstemmer 2.2.0 develop
- soupsieve 2.3.2.post1 develop
- sphinx 5.0.2 develop
- sphinx-argparse 0.3.1 develop
- sphinx-autobuild 2021.3.14 develop
- sphinx-autodoc-typehints 1.18.3 develop
- sphinx-basic-ng 0.0.1a12 develop
- sphinx-copybutton 0.5.0 develop
- sphinx-inline-tabs 2021.4.11b8 develop
- sphinx-issues 3.0.1 develop
- sphinxcontrib-applehelp 1.0.2 develop
- sphinxcontrib-devhelp 1.0.2 develop
- sphinxcontrib-htmlhelp 2.0.0 develop
- sphinxcontrib-jsmath 1.0.1 develop
- sphinxcontrib-qthelp 1.0.3 develop
- sphinxcontrib-serializinghtml 1.1.5 develop
- sphinxext-opengraph 0.6.3 develop
- sphinxext-rediraffe 0.2.7 develop
- tomli 2.0.1 develop
- tornado 6.2 develop
- typed-ast 1.5.4 develop
- typing-extensions 4.3.0 develop
- urllib3 1.26.10 develop
- watchdog 2.1.9 develop
- zipp 3.8.1 develop
- appdirs 1.4.4
- unicodecsv 0.14.1
- zhon 1.1.5
- black * develop
- codecov * develop
- coverage * develop
- docutils ~0.18.0 develop
- flake8 * develop
- furo * develop
- isort * develop
- mypy * develop
- myst_parser * develop
- pytest * develop
- pytest-cov * develop
- pytest-rerunfailures * develop
- pytest-watcher ^0.2.3 develop
- sphinx * develop
- sphinx-argparse * develop
- sphinx-autobuild * develop
- sphinx-autodoc-typehints * develop
- sphinx-copybutton * develop
- sphinx-inline-tabs * develop
- sphinx-issues * develop
- sphinxext-opengraph * develop
- sphinxext-rediraffe * develop
- appdirs *
- python ^3.7
- unicodecsv *
- zhon *
- actions/checkout v3 composite
- actions/setup-python v4 composite
- dorny/paths-filter v2.7.0 composite
- jakejarvis/cloudflare-purge-action v0.3.0 composite
- jakejarvis/s3-sync-action v0.5.1 composite
- actions/checkout v3 composite
- actions/setup-python v4 composite
- codecov/codecov-action v3 composite
- pypa/gh-action-pypi-publish release/v1 composite