sudachir

R Interface to 'Sudachi'

https://github.com/uribo/sudachir

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    1 of 3 committers (33.3%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (16.8%) to scientific vocabulary

Keywords

japanese-language nlp rpackage
Last synced: 6 months ago · JSON representation

Repository

R Interface to 'Sudachi'

Basic Info
Statistics
  • Stars: 6
  • Watchers: 2
  • Forks: 1
  • Open Issues: 2
  • Releases: 0
Topics
japanese-language nlp rpackage
Created over 5 years ago · Last pushed about 3 years ago
Metadata Files
Readme Funding License

README.Rmd

---
output: github_document
---



```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# sudachir 

SudachiR is an R version of [Sudachi](https://github.com/WorksApplications/sudachi.rs), a Japanese morphological analyzer.


[![CRAN status](https://www.r-pkg.org/badges/version/sudachir)](https://CRAN.R-project.org/package=sudachir)
[![R build status](https://github.com/uribo/sudachir/workflows/R-CMD-check/badge.svg)](https://github.com/uribo/sudachir/actions)
[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://www.tidyverse.org/lifecycle/#experimental)


## Installation

You can install the released version of `{sudachir}` from CRAN with:

``` r
install.packages("sudachir")
```

and also, the developmment version from GitHub.

``` r
if (!requireNamespace("remotes"))
  install.packages("remotes")

remotes::install_github("uribo/sudachir")
```

## Usage

### Set up 'r-sudachipy' environment

`{sudachir}` works with [sudachipy](https://github.com/WorksApplications/sudachi.rs/tree/develop/python)  (>= 0.6.\*) via the [reticulate](https://github.com/rstudio/reticulate/) package.

To get started, it requires a Python environment that has sudachipy and its dictionaries already installed and available.

This package provides a function `install_sudachipy` which helps users prepare a Python virtual environment. The desired modules (`sudachipy`, `sudachidict_core`, `pandas`) can be installed with this function, but can also be installed manually.

```{r}
library(reticulate)
library(sudachir)

if (!virtualenv_exists("r-sudachipy")) {
  install_sudachipy()
}

use_virtualenv("r-sudachipy", required = TRUE)
```

### Tokenize sentences

Use `tokenize_to_df` for tokenization.

```{r}
txt <- c(
  "国家公務員は鳴門海峡に行きたい",
  "吾輩は猫である。\n名前はまだない。"
)
tokenize_to_df(data.frame(doc_id = c(1, 2), text = txt))
```

You can control which dictionary features are parsed using the `col_select` argument.

```{r}
tokenize_to_df(txt, col_select = 1:3) |>
  dplyr::glimpse()

tokenize_to_df(
  txt, 
  into = dict_features("en"),
  col_select = c("pos1", "pos2")
) |>
  dplyr::glimpse()
```

The `as_tokens` function can tidy up tokens and the first part-of-speech informations into a list of named tokens. Also, you can use the `form` function as a shorthand of `tokenize_to_df(txt) |> as_tokens()`.

```{r}
tokenize_to_df(txt) |> as_tokens(type = "surface")

form(txt, type = "surface")
form(txt, type = "normalized")
form(txt, type = "dictionary")
form(txt, type = "reading")
```

### Change split mode

```{r}
tokenize_to_df(txt, instance = rebuild_tokenizer("B")) |>
  as_tokens("surface", pos = FALSE)

tokenize_to_df(txt, instance = rebuild_tokenizer("A")) |>
  as_tokens("surface", pos = FALSE)
```

### Change dictionary edition

You can touch dictionary options using the `rebuild_tokenizer` function.

```{r}
if (py_module_available("sudachidict_full")) {
  tokenizer_full <- rebuild_tokenizer(mode = "C", dict_type = "full")
  tokenize_to_df(txt, instance = tokenizer_full) |>
    as_tokens("surface", pos = FALSE)
}
```

Owner

  • Name: Shinya Uryu
  • Login: uribo
  • Kind: user
  • Location: Tokushima, Japan
  • Company: Tokushima University (徳島大学)

R / Data Engineer / Geo / Ecology / Visualization / Tokushima University (徳島大学)

GitHub Events

Total
Last Year

Committers

Last synced: about 2 years ago

All Time
  • Total Commits: 56
  • Total Committers: 3
  • Avg Commits per committer: 18.667
  • Development Distribution Score (DDS): 0.536
Past Year
  • Commits: 21
  • Committers: 1
  • Avg Commits per committer: 21.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Shinya Uryu s****7@g****m 26
paithiov909 a****4@g****m 25
Shinya Uryu u****a@t****p 5
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 4
  • Total pull requests: 4
  • Average time to close issues: 3 months
  • Average time to close pull requests: 5 days
  • Total issue authors: 2
  • Total pull request authors: 2
  • Average comments per issue: 0.0
  • Average comments per pull request: 2.0
  • Merged pull requests: 4
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • uribo (3)
  • barracuda156 (1)
Pull Request Authors
  • paithiov909 (3)
  • uribo (1)
Top Labels
Issue Labels
release 🚀 (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • cran 198 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 1
  • Total maintainers: 1
cran.r-project.org: sudachir

R Interface to 'Sudachi'

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 198 Last month
Rankings
Forks count: 17.0%
Stargazers count: 20.6%
Dependent repos count: 23.8%
Dependent packages count: 28.7%
Average: 34.3%
Downloads: 81.6%
Maintainers (1)
Last synced: 6 months ago

Dependencies

DESCRIPTION cran
  • cli >= 2.1.0 imports
  • dplyr >= 1.0.2 imports
  • glue >= 1.4.2 imports
  • magrittr >= 1.5 imports
  • purrr >= 0.3.4 imports
  • reticulate >= 1.17 imports
  • rlang >= 0.4.8 imports
  • tibble >= 3.0.4 imports
  • tidyselect >= 1.1.0 imports
  • rstudioapi * suggests
  • testthat * suggests
.github/workflows/R-CMD-check.yaml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
  • r-lib/actions/check-r-package v2 composite
  • r-lib/actions/setup-pandoc v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
.github/workflows/pkgdown.yaml actions
  • actions/cache v2 composite
  • actions/checkout v2 composite
  • r-lib/actions/setup-pandoc master composite
  • r-lib/actions/setup-r master composite