RcppMeCab

RcppMeCab: Rcpp Interface of CJK Morpheme Analyzer MeCab

https://github.com/junhewk/rcppmecab

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.4%) to scientific vocabulary

Keywords

cjk nlp pos r rcpp tagger
Last synced: 6 months ago · JSON representation

Repository

RcppMeCab: Rcpp Interface of CJK Morpheme Analyzer MeCab

Basic Info
  • Host: GitHub
  • Owner: junhewk
  • Language: C++
  • Default Branch: master
  • Homepage:
  • Size: 3.04 MB
Statistics
  • Stars: 25
  • Watchers: 2
  • Forks: 9
  • Open Issues: 3
  • Releases: 1
Topics
cjk nlp pos r rcpp tagger
Created almost 8 years ago · Last pushed over 1 year ago
Metadata Files
Readme Changelog

README.md

RcppMeCab

License R CRAN Downloads

This package, RcppMeCab, is a Rcpp wrapper for the part-of-speech morphological analyzer MeCab. It supports native utf-8 encoding in C++ code and CJK (Chinese, Japanese, and Korean) MeCab library. This package fully utilizes the power Rcpp brings R computation to analyze texts faster.

Please see this for easy installation and usage examples in Korean.

Changes in 0.0.1.3-3

  • Single character vector input in pos() will return a character vector, not a list.
  • pos() and posParallel() return lists, not named lists. We decided to remove original texts in results, since it does not fit to R way.
  • Some typos in code and explanations are revised.

Installation

Linux and Mac OSX

First, install MeCab of your language-of-choice.

Second, you can install RcppMeCab from CRAN with:

``` install.packages("RcppMeCab") # build from source

install.packages("devtools")

install_github("junhewk/RcppMeCab") # install developmental version ```

Windows

You should set the language you want to use for the analysis with the environment variable MECAB_LANG. The default value is ko and if you want to analyze Japanese or Chinese, please set it as ja before install the package.

``` install.packages("RcppMeCab") # for installing Korean version

or, install for Japanese

Sys.setenv(MECAB_LANG = 'ja') # for installing Japanese developmental version install.packages("RcppMeCab", type="source") # build from source

install.packages("devtools")

install_github("junhewk/RcppMeCab") # install developmental version ```

For analyzing, you also need MeCab binary and dictionary.

For Korean:

Install mecab-ko-msvc and mecab-ko-dic-msvc up to your 32-bit or 64-bit Windows version in C:\mecab. Provide directory location to RcppMeCab function.

Version Information for Korean

Current mecab-ko-msvc is not working in R. Please use mecab-ko-msvc 0.9.2 or lower.

For Japanese:

Install mecab binary. Provide directory location to RcppMeCab function. For example: pos(sentence, sys_dic = "C:/PROGRA~2/mecab/dic/ipadic")

Usage

This package has pos and posParallel function.

pos(sentence) # returns list, sentence will present on the names of the list pos(sentence, join = FALSE) # for yielding morphemes only (tags will be given on the vector names) pos(sentence, format = "data.frame") # the result will returned as a data frame format pos(sentence, user_dic) # gets a compiled user dictionary posParallel(sentence, user_dic) # parallelized version uses more memory, but much faster than the loop in single threading

  • sentence: a text for being analyzed
  • join: If it gets TRUE, output form is (morpheme/tag). If it gets FALSE, output form is (morpheme) + tag in attribute.
  • format: The default is a list. If you set this as "data.frame", the function will return the result in a data frame format.
  • sys_dic: a directory in which dicrc, model.bin, and other files are located, default value is "" or you can set your default value using options(mecabSysDic = "")
  • userdic: a user dictionary file compiled by `mecabdict_index`, default value is also ""

Notification for the dictionary

You should not use simplified dictionary entry, e.g. tilde expression (~/). Please provide full path name in sys_dic and user_dic.

Compiling User Dictionary

MeCab API has DictionaryCompiler, but it contains die(). Hence, calling it in Rcpp crashes down entire R session. This will not be included in RcppMeCab functions.

Please refer to Mecab for Japanese.

Unix and Mac OSX

You should have model_file if you want the library to estimate cost automatically.

You need entire mecab-ko-dic source if you want to compile Korean user dictionary. User dictionary should also be prepared in CSV file. CSV structure is found in Japanese and Korean.

Compile:

`` $ /usr/local/libexec/mecab/mecab-dict-index -mmodelfile-dmecabdiclocation-uuserdictionaryfilename-fCSV file charset-toriginal dictionary charsettarget_csv

example

$ /usr/local/libexec/mecab/mecab-dict-index -m /usr/local/lib/mecab/dic/mecab-ko-dic/model.bin -d ~/mecab-ko-dic-2.0.3-20170922 -u userdic.dic -f utf8 -t utf8 ~/person.csv ```

Windows

  • Korean: mecab-ko-msvc has mecab-dict-index.exe.
  • Japanese: MeCab binary version has mecab-dict-index.exe.

You can use it in the same way the Linux binary compiles the dictionary.

TODOs

  • Provide multilanguage manuals for international support

Author

Junhewk Kim (junhewk.kim@gmail.com)

Contributor

Kato Akiru

Owner

  • Name: Junhewk Kim
  • Login: junhewk
  • Kind: user
  • Location: Seoul, South Korea
  • Company: Yonsei University

Medical Humanities Researcher, Narrative Bioethicist, Text Analysis Enthusiast

GitHub Events

Total
  • Watch event: 1
  • Fork event: 1
Last Year
  • Watch event: 1
  • Fork event: 1

Committers

Last synced: over 2 years ago

All Time
  • Total Commits: 75
  • Total Committers: 3
  • Avg Commits per committer: 25.0
  • Development Distribution Score (DDS): 0.453
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Junhewk Kim j****m@g****m 41
Kato Akiru a****4@g****m 31
Junhewk Kim jk@J****e 3
Committer Domains (Top 20 + Academic)

Packages

  • Total packages: 1
  • Total downloads:
    • cran 189 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 2
  • Total maintainers: 1
cran.r-project.org: RcppMeCab

'rcpp' Wrapper for 'mecab' Library

  • Versions: 2
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 189 Last month
Rankings
Forks count: 7.9%
Stargazers count: 10.6%
Average: 19.8%
Dependent repos count: 24.0%
Downloads: 27.6%
Dependent packages count: 28.8%
Maintainers (1)
Last synced: 6 months ago

Dependencies

DESCRIPTION cran
  • R >= 3.4.0 depends
  • Rcpp * imports
  • RcppParallel * imports
  • spelling * suggests
  • testthat * suggests
.github/workflows/R-CMD-check.yml actions
  • actions/checkout v2 composite
  • r-lib/actions/setup-r v1 composite