jptranstokenizer

Japanese Tokenizer for transformers library

https://github.com/retarfi/jptranstokenizer

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.8%) to scientific vocabulary

Keywords

japanese natural-language-processing nlp transformer

Last synced: 10 months ago · JSON representation ·

Repository

Japanese Tokenizer for transformers library

Basic Info

Host: GitHub
Owner: retarfi
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 813 KB

Statistics

Stars: 5
Watchers: 1
Forks: 1
Open Issues: 2
Releases: 11

Topics

japanese natural-language-processing nlp transformer

Created almost 4 years ago · Last pushed over 2 years ago

Metadata Files

Readme License Citation

jptranstokenizer: Japanese Tokenzier for transformers

This is a repository for japanese tokenizer with HuggingFace library.
You can use JapaneseTransformerTokenizer like transformers.BertJapaneseTokenizer.
issue は日本語でも大丈夫です。

Documentations

Documentations are available on readthedoc.

Install

pip install jptranstokenizer

Quickstart

This is the example to use jptranstokenizer.JapaneseTransformerTokenizer with sentencepiece model of nlp-waseda/roberta-base-japanese and Juman++.
Before the following steps, you need to install pyknp and Juman++.

```python

from jptranstokenizer import JapaneseTransformerTokenizer tokenizer = JapaneseTransformerTokenizer.from_pretrained("nlp-waseda/roberta-base-japanese") tokens = tokenizer.tokenize("外国人参政権")

tokens: ['▁外国', '▁人', '▁参政', '▁権']

```

Note that different dependencies are required depending on the type of tokenizer you use.
See also Quickstart on Read the Docs

Citation

There will be another paper. Be sure to check here again when you cite.

This Implementation

@inproceedings{Suzuki-2023-nlp, jtitle = {{異なる単語分割システムによる日本語事前学習言語モデルの性能評価}}, title = {{Performance Evaluation of Japanese Pre-trained Language Models with Different Word Segmentation Systems}}, jauthor = {鈴木, 雅弘 and 坂地, 泰紀 and 和泉, 潔}, author = {Suzuki, Masahiro and Sakaji, Hiroki and Izumi, Kiyoshi}, jbooktitle = {言語処理学会第29回年次大会 (NLP2023)}, booktitle = {29th Annual Meeting of the Association for Natural Language Processing (NLP)}, year = {2023}, pages = {894-898} }

Related Work

Pretrained Japanese BERT models (containing Japanese tokenizer)
- Autor NLP Lab. in Tohoku University
- https://github.com/cl-tohoku/bert-japanese
SudachiTra
- Author Works Applications
- https://github.com/WorksApplications/SudachiTra
UD_Japanese-GSD
- Author megagonlabs
- https://github.com/megagonlabs/UD_Japanese-GSD
Juman++
- Author Kurohashi Lab. in University of Kyoto
- https://github.com/ku-nlp/jumanpp

Owner

Name: Masahiro Suzuki
Login: retarfi
Kind: user
Location: Tokyo
Company: Nikko Asset Management Co., Ltd.

Website: https://msuzuki.me/
Twitter: retarfi_
Repositories: 4
Profile: https://github.com/retarfi

Ph. D. student in the University of Tokyo / NLP Engineer in Finance

Citation (CITATION.cff)

cff-version: 1.0.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "鈴木"
  given-names: "雅弘"
  orcid: "https://orcid.org/0000-0001-8519-5617"
- family-names: "坂地"
  given-names: "泰紀"
  orcid: "https://orcid.org/0000-0001-5030-625X"
- family-names: "和泉"
  given-names: "潔"
title: "jptranstokenizer: Japanese Tokenzier for transformers"
version: 0.3.2
date-released: 2023-05-09
url: "https://github.com/retarfi/jptranstokenizer"
preferred-citation:
  type: conference-paper
  authors:
  - family-names: "鈴木"
    given-names: "雅弘"
    orcid: "https://orcid.org/0000-0001-8519-5617"
  - family-names: "坂地"
    given-names: "泰紀"
    orcid: "https://orcid.org/0000-0001-5030-625X"
  - family-names: "和泉"
    given-names: "潔"
  booktitle: "言語処理学会 第29回年次大会 (NLP2023)"
  month: 3
  start: 894
  end: 898
  title: "異なる単語分割システムによる日本語事前学習言語モデルの性能評価"
  year: 2023

GitHub Events

Total

Last Year

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 11
Total pull requests: 109
Average time to close issues: 18 days
Average time to close pull requests: about 9 hours
Total issue authors: 1
Total pull request authors: 2
Average comments per issue: 1.0
Average comments per pull request: 0.17
Merged pull requests: 101
Bot issues: 0
Bot pull requests: 27

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

retarfi (9)

Pull Request Authors

retarfi (42)
github-actions[bot] (16)

Top Labels

Issue Labels

documentation (4) enhancement (2) wontfix (1) bug (1)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 40 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 12
Total maintainers: 1

pypi.org: jptranstokenizer

Japanese tokenizer with transformers library

Documentation: https://jptranstokenizer.readthedocs.io/
License: MIT
Latest release: 0.4.0
published over 2 years ago

Versions: 12
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 40 Last month

Rankings

Dependent packages count: 10.1%

Average: 16.7%

Downloads: 18.4%

Dependent repos count: 21.6%

Maintainers (1)

retarfi