https://github.com/centre-for-humanities-computing/chinese-tokenizer

A Rusty way of tokenizing Chinese texts

https://github.com/centre-for-humanities-computing/chinese-tokenizer

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (17.8%) to scientific vocabulary

Keywords

jieba rust tokenizer
Last synced: 5 months ago · JSON representation

Repository

A Rusty way of tokenizing Chinese texts

Basic Info
  • Host: GitHub
  • Owner: centre-for-humanities-computing
  • License: mit
  • Language: Rust
  • Default Branch: master
  • Homepage:
  • Size: 9.77 KB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
jieba rust tokenizer
Created about 6 years ago · Last pushed about 6 years ago
Metadata Files
Readme License

README.md

A Rust-y tokenizer for Chinese texts

This is a short program for tokenizing Chinese text, using a Rust port of jieba.

The default tokenizer is a maximum likelihood matching algorithm working from a Chinese lexicon (i.e. dictionary-based). However, jieba-rs also implements a Hidden Markov Model tokenizer. The preferred tokenizer can be easily selected by making the necessary changes in src/main.rs.

Getting started

In order to run on your machine, you'll need to first install Rust and the Cargo package manager. This is done a number of different ways, depending on whether you use macOS, Linux, or Windows. You can find more information on how to do this here and here.

Once that's completed, you'll need to copy your data into the empty 'data' folder. Note that the current structure of this program only allows for folder structures one level deep. In other words:

data/subfolder/file.txt

Be sure to check the comments at the beginning of src/main.rs. Some paths and variables may need to be modified to suit your needs.

Building the program

With Rust, you have two options when running the program. Firstly, you can simply do the following in the root directory:

cargo run --release

This builds the local package and executes the binary. However, you can also run these steps seperately.

First build:

cargo build --release

Then run: ./target/release/chinese

Note that in both cases, we're using the --release flag. This prompts the compiler to perform optimisations which substantially improve performance of the tokenizer.

NB!

This was written quite quickly to solve a specific problem and is still essentially work-in-progress. It will work for any collection of Chinese texts, as long as the corpus structured in the format outlined above. However, I hope at some point to return to this and make it more flexible, as well as offering the user the chance to set certain flags.

Author

Author: rdkm89
Date: 2020-01-13

Built with

This tokenizer pipeline is dependent on jieba-rs by Github user messense. The original repo for that project can be found here

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Owner

  • Name: Center for Humanities Computing Aarhus
  • Login: centre-for-humanities-computing
  • Kind: organization
  • Email: chcaa@cas.au.dk
  • Location: Aarhus, Denmark

GitHub Events

Total
Last Year

Issues and Pull Requests

Last synced: 9 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels