https://github.com/centre-for-humanities-computing/chinese-tokenizer

A Rusty way of tokenizing Chinese texts

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (17.8%) to scientific vocabulary

Keywords

jieba rust tokenizer

Last synced: 5 months ago · JSON representation

Repository

A Rusty way of tokenizing Chinese texts

Basic Info

Host: GitHub
Owner: centre-for-humanities-computing
License: mit
Language: Rust
Default Branch: master
Homepage:
Size: 9.77 KB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 0

Topics

jieba rust tokenizer

Created about 6 years ago · Last pushed about 6 years ago

Metadata Files

Readme License

A Rust-y tokenizer for Chinese texts

This is a short program for tokenizing Chinese text, using a Rust port of jieba.

The default tokenizer is a maximum likelihood matching algorithm working from a Chinese lexicon (i.e. dictionary-based). However, jieba-rs also implements a Hidden Markov Model tokenizer. The preferred tokenizer can be easily selected by making the necessary changes in src/main.rs.

Getting started

In order to run on your machine, you'll need to first install Rust and the Cargo package manager. This is done a number of different ways, depending on whether you use macOS, Linux, or Windows. You can find more information on how to do this here and here.

Once that's completed, you'll need to copy your data into the empty 'data' folder. Note that the current structure of this program only allows for folder structures one level deep. In other words:

data/subfolder/file.txt

Be sure to check the comments at the beginning of src/main.rs. Some paths and variables may need to be modified to suit your needs.

Building the program

With Rust, you have two options when running the program. Firstly, you can simply do the following in the root directory:

cargo run --release

This builds the local package and executes the binary. However, you can also run these steps seperately.

First build:

cargo build --release

Then run: ./target/release/chinese

Note that in both cases, we're using the --release flag. This prompts the compiler to perform optimisations which substantially improve performance of the tokenizer.

NB!

This was written quite quickly to solve a specific problem and is still essentially work-in-progress. It will work for any collection of Chinese texts, as long as the corpus structured in the format outlined above. However, I hope at some point to return to this and make it more flexible, as well as offering the user the chance to set certain flags.

Author

Author: rdkm89
Date: 2020-01-13

Built with

This tokenizer pipeline is dependent on jieba-rs by Github user messense. The original repo for that project can be found here

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Owner

Name: Center for Humanities Computing Aarhus
Login: centre-for-humanities-computing
Kind: organization
Email: chcaa@cas.au.dk
Location: Aarhus, Denmark

Website: https://chc.au.dk/
Repositories: 130
Profile: https://github.com/centre-for-humanities-computing

GitHub Events

Total

Last Year

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/centre-for-humanities-computing/chinese-tokenizer

Science Score: 13.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

A Rust-y tokenizer for Chinese texts

Getting started

Building the program

NB!

Author

Built with

License

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels