sunflower

An R Package to Assess and Categorize Production Errors in Spanish

https://github.com/ismaelgutier/sunflower

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 6 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.8%) to scientific vocabulary

Keywords

aphasia aphasia-help errors-classifier logopedia neuropsychology r repetitive-attempts rlanguage rpackage slp spanish spanish-language speech-language-pathology

Last synced: 6 months ago · JSON representation ·

Repository

An R Package to Assess and Categorize Production Errors in Spanish

Basic Info

Host: GitHub
Owner: ismaelgutier
License: gpl-3.0
Language: R
Default Branch: main
Homepage:
Size: 19.1 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 1

Topics

aphasia aphasia-help errors-classifier logopedia neuropsychology r repetitive-attempts rlanguage rpackage slp spanish spanish-language speech-language-pathology

Created over 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

README.Rmd

---
output: github_document
---



```{r, include = FALSE}

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

```

# sunflower: A Package to Assess and Categorize Language Production Errors



![](https://img.shields.io/badge/sunflower-v._1.02-orange?style=flat&link=https%3A%2F%2Fgithub.com%2Fismaelgutier%2Fsunflower) 
[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-%233493ad.svg)](https://www.gnu.org/licenses/gpl-3.0)
[![Language](https://img.shields.io/badge/Language-grey?style=flat&logo=R)](https://www.r-project.org/)
[![Paper](https://img.shields.io/badge/Paper-Frontiers%20in%20Psychology-brightgreen)](https://doi.org/10.3389/fpsyg.2025.1538196)





*sunflower* is a package designed to assist clinicians and researchers in the fields of Speech Therapy and (Neuro)psychology of Language. Its primary goal is to facilitate the management of multiple response data and compute formal similarity indices to assess the quality of oral and written productions in patients with aphasia and related disorders, such as apraxia of speech, in Spanish. Additionally, the package allows for the classification of these productions according to classical typologies in the field, prior to computing formal and semantic similarity measures. For the computation of the latter, *sunflower* partially relies on natural language processing models such as word2vec. The outputs provided by this package facilitate statistical analyses in R, a widely-used tool in the field for data wrangling, visualization, and analysis.

## Installation

*sunflower* can be installed as an R package with:

```r
install.packages("devtools")
devtools::install_github("ismaelgutier/sunflower")
```
```{r include=FALSE}
require("sunflower")
require("tidyverse")
require("htmltools")
require("readxl")
```

The *sunflower* package works using the pipe operator (`%>%`) from the [*tidyverse* package](https://www.tidyverse.org/), allowing it to work seamlessly with functions from other packages in the *tidyverse*, such as *dplyr* for data wrangling, *readr* for data reading, and *ggplot2* for data visualization. This can significantly enhance our workflow.

## How to use

### Loading the packages

Once installed, we only need to load the *sunflower* package. However, as previously mentioned, the *tidyverse* package can also be valuable for other complementary tasks.
```r
require("sunflower")
require("tidyverse")
```

### Compute Formal Quality Indexes

We can load a pre-loaded dataframe from the package, which is available for anyone interested in testing the functions. These dataframes include: `IGC_sample`, `IGC_long_sample`, `IGC_long_phon_sample` and `simulated_sample`.

```{r}
df_to_formal_metrics = sunflower::IGC_long_phon_sample
```

However, in this example we are going to conduct the formal quality analysis using phonological broad transcriptions from a larger dataset.

```{r echo=FALSE}
df_to_formal_metrics = readxl::read_xlsx("manuscript/data/long_with_phon.xlsx") %>%
  dplyr::select(ID_general, test, task_type, task_modality, ID, 
                item_ID = task_item_ID, item, response, 
                RA = cda_behavior, attempt = Attempt,
                item_phon = target_word_transcrito_clean, response_phon = response_word_transcrito_clean) %>%
  dplyr::filter(test %in% c("SnodgrassVanderwart", "BETA", "EPLA", "Gutiérrez-Cordero")) %>%
  dplyr::filter(!stringr::str_detect(task_type, "nonword")) %>%
  dplyr::arrange(ID)
```

```{r}
formal_metrics_computed = df_to_formal_metrics %>% 
    get_formal_similarity(item_col = "item", 
                          response_col = "response",
                          attempt_col = "attempt",
                          group_cols = c("ID", "item_ID"))
```
```{r include=FALSE}
 formal_metrics_computed = formal_metrics_computed %>% select(-c(test, task_modality, comment_warning))
```

Display some of the results from the formal quality analysis.

```{r echo=FALSE}
formal_metrics_computed %>% head(9) %>% knitr::kable()
```

***Note.*** Move the dataframe to the right to see all the columns and metrics.


### Obtain Positional Accuracy Data

Apply the pertinent function to obtain positional accuracies...

```{r message=FALSE}
positions_accuracy = formal_metrics_computed %>% 
  positional_accuracy(item_col = "item_phon", 
                      response_col = "response_phon", 
                      match_col = "adj_strict_match_pos")
```

Display the results of the positional accuracy analysis.

```{r echo=FALSE}
positions_accuracy %>%
  filter(response == "lintro") %>%  # Item for which response is "lintro"
  select(ID:response_phon, RA, attempt, position:element_in_response) %>% 
  knitr::kable()
```

If we were to plot this dataframe, we would obtain...

```{r include=FALSE}
# Convertir targetL a character para evitar problemas al combinar dataframes
positions_accuracy <- positions_accuracy %>%
  dplyr::mutate(targetL = as.character(targetL))

# Duplicar y modificar el dataframe para crear 'positions_general'
positions_general <- positions_accuracy %>%
  dplyr::mutate(targetL = "General")

# Combinar ambos dataframes
positions <- dplyr::bind_rows(positions_accuracy, positions_general)

# Especificar manualmente los niveles en el orden deseado
desired_levels <- c("3", "4", "5", "6", "7", "8", "9", "10", "11", "12",
                    "13", "14", "15", "17", "21", "22", "24", "48", "General")

# Convertir correct_pos a numérico y ordenar targetL como factor según desired_levels
positions <- positions %>%
  dplyr::mutate(correct_pos = as.numeric(correct_pos),
         targetL = factor(targetL, levels = desired_levels)) %>%
  dplyr::arrange(correct_pos, targetL)

# Definir un conjunto de linetypes que se pueda repetir
custom_linetypes <- rep(c("solid", "dashed", "dotted", "longdash", "dotdash"),
                        length.out = nlevels(positions$targetL))

# Calcular la precisión y contar el número de observaciones por grupo
plot_positions <- positions %>%
  group_by(position, targetL) %>%
  summarize(acc = mean(correct_pos, na.rm = TRUE),
            n = n()) %>%
  ggplot(aes(x = as.numeric(position), y = acc, group = targetL,
             fill = targetL, color = targetL, lty = targetL)) +
  geom_line(size = 0.70, alpha = 0.6) +
  geom_point(aes(size = n), shape = 21, color = "black", alpha = 0.6) +
  scale_linetype_manual(values = custom_linetypes) +
  theme(panel.border = element_rect(colour = "black", fill = NA),
        panel.background = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.line = element_blank()) +
  scale_x_continuous(breaks = seq(min(positions$position, na.rm = TRUE), 
                                  max(positions$position, na.rm = TRUE), 
                                  by = 1)) +
  ylab("Proportion (%) of correct phonemes") +
  xlab("Phoneme position") +
  guides(fill = guide_legend(title = "Word Length", ncol = 2),  # Set the legend to 2 columns
         lty = guide_legend(title = "Word Length", ncol = 2),
         color = guide_legend(title = "Word Length", ncol = 2),
         size = guide_legend(title = "Datapoints", ncol = 2)) +
  papaja::theme_apa() +
  theme(legend.position = "right")
```

```{r plot_positions, out.width = "75%", fig.align="center", echo=FALSE}
plot_positions
```

***Note.*** This plot depicts the positional accuracy of `r nrow(positions)` datapoints.


### Classify Errors

*sunflower* allows for the classification of production errors once some indexes related to responses to a stimulus have been obtained and contextualized based on whether they come from repeated attempts or single productions. This process involves three steps.

First, a lexicality check of the response is performed using the `check_lexicality()` function, which involves determining whether the response is a real word. To do this, the package searches for the response in a database such as *BuscaPalabras* ([BPal](https://www.uv.es/~mperea/Davis_Perea_in_press.pdf)) and compares its frequency with the target word to determine if it is a real word based on whether it has a higher frequency or not when the parameter `criterion = "database"` is set. Alternatively, the response can be checked against a dictionary (*sunflower* searches for responses among entries from the *Real Academia Española*, [RAE](https://www.rae.es/)) when the parameter `criterion = "dictionary"` is used.

Next, similarity measures between the targets and the responses are obtained using various algorithms within the `get_formal_similarity()` function. Finally, the cosine similarity between the two productions is computed if possible using the `get_semantic_similarity()` function, based on an NLP model. In our case, the parameter `model = m_w2v` refers to a binary file containing a Spanish Billion Words embeddings corpus created using the *word2vec* algorithm. This file is included in the zip file (for more information, see the markdown in the vignettes) located within the dependency-bundle zip, which can be found in our supplementary [OSF mirror repository](https://osf.io/akuxv/).

```{r include=FALSE}
df_to_classify = sunflower::IGC_long_phon_sample

m_w2v = word2vec::read.word2vec(file = "dependency-bundle/sbw_vectors.bin", normalize = TRUE)
```

```{r}
errors_classified = df_to_classify %>% 
  check_lexicality(item_col = "item", response_col = "response", criterion = "database") %>%
  get_formal_similarity(item_col = "item", response_col = "response", 
                        attempt_col = "attempt", group_cols = c("ID", "item_ID")) %>%
  get_semantic_similarity(item_col = "item", response_col = "response", model = m_w2v) %>%
  classify_errors(response_col = "response", item_col = "item",
                  access_col = "accessed", RA_col = "RA", also_classify_RAs = T)
```

Display the classification that was conducted.

```{r echo=FALSE}

errors_classified %>%
  dplyr::filter(
    (ID == 32 & item_ID == 32 & response == "rasca") |
    (ID == 276 & item_ID == 7 & response == "rasca") |
    (ID == 272 & item_ID == 3 & response == "caballo") |
    (ID == 134 & item_ID == 10 & response == "sisi") |
    (ID == 148 & item_ID == 24 & response == "are") |
    (ID == 17 & item_ID == 17 & response == "malacar") |
    (ID == 140 & item_ID == 16 & response == "pito") |
    (ID == 8 & item_ID == 8 & response == "talablo") |
    (ID == 9 & item_ID == 9)
  ) %>%
  dplyr::select(ID, item_ID, item, response, RA, attempt, correct, nonword:check_comment) %>%
  knitr::kable()  # Muestra en formato de tabla

```

***Notes.*** Move the dataframe to the right to see all the columns and errors.

## Making it faster - A guided usage tutorial

A file that allows executing all the functions relatively quickly as a sample can be downloaded from  its link in our OSF. This can be helpful for both novice users and those who want to explore the package's functionalities in a more straightforward and/or faster way. Users would only need to run the code presented in the script in the link and would require the word2vec model made available in the dependency-bundle zip.

## The published work

This work has been published and can be accessed [here](https://doi.org/10.3389/fpsyg.2025.1538196). It can be cited as follows: 

Gutiérrez-Cordero. I, & García-Orza, J. (2025). sunflower: an R package for handling multiple response attempts and conducting error analysis in aphasia and related disorders. *Frontiers in Psychology*, *16*, 1538196. https://doi.org/10.3389/fpsyg.2025.1538196

## Acknowledgments

Thanks to Cristian Cardellino for making his work on the [Spanish Billion Word Corpus and Embeddings](https://crscardellino.github.io/SBWCE/) publicly available.

## Hello!

Any suggestions, comments, or questions about the package's functionality are warmly welcomed. If you’d like to contribute to the project, please feel free to get in touch. 🌻

Owner

Name: Ismael Gutiérrez Cordero
Login: ismaelgutier
Kind: user

Repositories: 1
Profile: https://github.com/ismaelgutier

Citation (CITATION)

citation <- function(package = "sunflower") {
  citEntry(
    entry = "Article",
    title = "sunflower: an R package for handling multiple response attempts and conducting error analysis in aphasia and related disorders",
    author = c(
      person("Ismael", "Gutiérrez-Cordero", role = c("aut", "cre")),
      person("Javier", "García-Orza", role = "aut")
    ),
    year = "2025",
    journal = "Frontiers in Psychology",
    volume = "16",
    pages = "1538196",
    doi = "10.3389/fpsyg.2025.1538196",
    url = "https://doi.org/10.3389/fpsyg.2025.1538196"
  )
}

GitHub Events

Total

Release event: 1
Push event: 45
Pull request event: 44
Create event: 1

Last Year

Release event: 1
Push event: 45
Pull request event: 44
Create event: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 0
Total pull requests: 8
Average time to close issues: N/A
Average time to close pull requests: less than a minute
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 8
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 8
Average time to close issues: N/A
Average time to close pull requests: less than a minute
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 8
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

ismaelgutier (33)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

DESCRIPTION cran

R >= 3.5.0 depends
PTXQC * imports
dplyr >= 1.0.7 imports
magrittr * imports
purrr * imports
reshape2 * imports
stats * imports
stringdist * imports
stringr * imports
tibble * imports
tictoc * imports
tidyr >= 1.1.4 imports
tidyverse * imports
word2vec * imports
knitr * suggests
rmarkdown * suggests
testthat >= 3.0.0 suggests

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science