https://github.com/centre-for-humanities-computing/chinese-tokenizer
A Rusty way of tokenizing Chinese texts
https://github.com/centre-for-humanities-computing/chinese-tokenizer
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (17.8%) to scientific vocabulary
Keywords
Repository
A Rusty way of tokenizing Chinese texts
Basic Info
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
A Rust-y tokenizer for Chinese texts
This is a short program for tokenizing Chinese text, using a Rust port of jieba.
The default tokenizer is a maximum likelihood matching algorithm working from a Chinese lexicon (i.e. dictionary-based). However, jieba-rs also implements a Hidden Markov Model tokenizer. The preferred tokenizer can be easily selected by making the necessary changes in src/main.rs.
Getting started
In order to run on your machine, you'll need to first install Rust and the Cargo package manager. This is done a number of different ways, depending on whether you use macOS, Linux, or Windows. You can find more information on how to do this here and here.
Once that's completed, you'll need to copy your data into the empty 'data' folder. Note that the current structure of this program only allows for folder structures one level deep. In other words:
data/subfolder/file.txt
Be sure to check the comments at the beginning of src/main.rs. Some paths and variables may need to be modified to suit your needs.
Building the program
With Rust, you have two options when running the program. Firstly, you can simply do the following in the root directory:
cargo run --release
This builds the local package and executes the binary. However, you can also run these steps seperately.
First build:
cargo build --release
Then run:
./target/release/chinese
Note that in both cases, we're using the --release flag. This prompts the compiler to perform optimisations which substantially improve performance of the tokenizer.
NB!
This was written quite quickly to solve a specific problem and is still essentially work-in-progress. It will work for any collection of Chinese texts, as long as the corpus structured in the format outlined above. However, I hope at some point to return to this and make it more flexible, as well as offering the user the chance to set certain flags.
Author
Author: rdkm89
Date: 2020-01-13
Built with
This tokenizer pipeline is dependent on jieba-rs by Github user messense. The original repo for that project can be found here
License
This project is licensed under the MIT License - see the LICENSE.md file for details
Owner
- Name: Center for Humanities Computing Aarhus
- Login: centre-for-humanities-computing
- Kind: organization
- Email: chcaa@cas.au.dk
- Location: Aarhus, Denmark
- Website: https://chc.au.dk/
- Repositories: 130
- Profile: https://github.com/centre-for-humanities-computing
GitHub Events
Total
Last Year
Issues and Pull Requests
Last synced: 9 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0