paracorp

Concordancer for parallel, bilingual corpora

https://github.com/gederajeg/paracorp

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 5 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (19.6%) to scientific vocabulary

Last synced: 9 months ago · JSON representation ·

Repository

Concordancer for parallel, bilingual corpora

Basic Info

Host: GitHub
Owner: gederajeg
License: other
Language: R
Default Branch: main
Homepage: https://gederajeg.github.io/paracorp/
Size: 484 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 1

Created over 4 years ago · Last pushed over 4 years ago

Metadata Files

Readme License Citation

README.Rmd

---
output: github_document
bibliography: mybibs.bib
link-citations: true
---



```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# paracorp


[![R-CMD-check](https://github.com/gederajeg/paracorp/workflows/R-CMD-check/badge.svg)](https://github.com/gederajeg/paracorp/actions)
[![Codecov test coverage](https://codecov.io/gh/gederajeg/paracorp/branch/main/graph/badge.svg)](https://app.codecov.io/gh/gederajeg/paracorp?branch=main)
[![](https://img.shields.io/badge/doi-10.17605/OSF.IO/HV9CU-lightblue.svg)](https://doi.org/10.17605/OSF.IO/HV9CU)


The goal of **paracorp** is to provide an R functionality for generating parallel concordance (Keyword-in-Context [KWIC] display) from a parallel/bilingual corpora. The first attempt is implemented in the `para_conc()` function that is built on top of the [tidyverse](https://www.tidyverse.org/) suit of packages. Please use the following citation if **paracorp** is used in publications:

```{r how-to-cite}
citation("paracorp")
```


The **paracorp** package is part of the following [research project](https://udayananetworking.unud.ac.id/lecturer/research/880-gede-primahadi-wijaya-rajeg/a-model-for-translation-study-based-on-english-indonesian-translation-database-and-its-pedagogical-implication-1179) [@rajeg_material_2021]:

> Rajeg, Gede Primahadi Wijaya, I Made Rajeg, Putu Dea Indah Kartini & I Gede Semara Dharma Putra. 2021. Material pendukung untuk *MODEL KAJIAN TERJEMAHAN BERBASIS BANK DATA TERJEMAHAN DIGITAL INGGRIS-INDONESIA DAN IMPLIKASI PEDAGOGISNYA*. Open Science Framework. https://doi.org/10.17605/OSF.IO/Y6ESA. https://osf.io/y6esa/.

The output of the research has been disseminated in several seminars [@rajeg_pemanfaatan_2021; @rajeg_derajat_2021].

## Installation

You can install the development version of paracorp from [GitHub](https://github.com/) with:

``` r
# install.packages("devtools")
devtools::install_github("gederajeg/paracorp")
```

## Examples

The **paracorp** package comes with internal sample data of English-Indonesian parallel corpora from the science genre developed by the PAN BPPT project [@adriani_development_2009; @bppt_statistical_2009]. The data are available in the form of character vectors called `sci_en` (for the English text) whose line is aligned with the Indonesian version (`sci_id`).

The code-snippet below shows how to generate a parallel concordance for the English modal verb "should" as the target, search-term and present the Indonesian translation (shown in the `TRANSLATION` column in the output table).

```{r example-eng-idn}
library(paracorp) # load the package

# in this example, the English text is used as the source text
my_para_conc <- para_conc(source_text = sci_en, 
                          target_text = sci_id, 
                          pattern = "\\bshould\\b", # regular expression pattern
                          conc_sample = 20) # retrieve 20 random concordance lines

# peek into the results as tibble/data frame
head(my_para_conc)

```

The printed messages show that, by default, `para_conc()` also saves the concordance into a tab-separated plain text (by default called `'parallel_conc.txt'`), in addition to returning a tibble/data frame format of the concordance. The tab-separated `'parallel_conc.txt'` file can be opened in MS Excel for further corpus-based analyses.

### Suppressing the automatic plain-text output

You can suppress the automatic plain-text-output behaviour by specifying `filename = FALSE` as shown below. In this situation, the output of `para_conc()` is only the tibble/data frame.

```{r suppress-automatic-output}
# suppress automatic output file behaviour with `filename = FALSE`
my_para_conc <- para_conc(source_text = sci_en, 
                          target_text = sci_id, 
                          pattern = "\\bshould\\b", # regular expression pattern
                          conc_sample = 20, # retrieve 20 random concordance lines
                          filename = FALSE) # suppress automatic output file 

# peek into the results as tibble/data frame
head(my_para_conc)
```


### Switching the source- and target-text inputs

Moreover, the position of the input corpora can be reversed depending on the nature of the corpora or the research question(s). In the example below, the Indonesian text is entered into the `source_text` argument while the English text is entered into the `target_text` argument. In this case, the input string in the `pattern` argument of `para_conc()` should represent the Indonesian target-keyword.

```{r example-idn-eng}
# in this example, the Indonesian text is used as the source text
my_para_conc <- para_conc(source_text = sci_id, 
                          target_text = sci_en, 
                          pattern = "\\bmungkin\\b", # regular expression pattern
                          conc_sample = 20) # retrieve 20 random concordance lines

# peek into the results as tibble/data frame
head(my_para_conc)
```

### Sampling numbers

If the requested number of sample (out of all matches) is **greater than** or **equal to** the number of matches of the search pattern, `para_conc()` will print messages indicating these situations, and will retrieve all matches found, rather than generating sample that is supposed to be fewer than the total matches.

The snippet below shows the scenario and printed message when the requested number of sample is **equal to** the number of matches.

```{r sample-number-behaviour-1}
# sample number requested is equal to the matches
para_conc(sci_en, sci_id, pattern = "should", conc_sample = 64, filename = FALSE)
```

Meanwhile, the snippet below shows the scenario and printed message when the requested number of sample is **greater than** the number of matches.

```{r sample-number-behaviour-2}
# sample number requested is greater than the matches
para_conc(sci_en, sci_id, pattern = "should", conc_sample = 67, filename = FALSE)
```

### No matches

When no matches were found for the string given in the `pattern` argument, `para_conc()` will also print out the message informing so and no output will be produced. See the example below.

```{r no-match}
# For instance, searching for an Indonesian word when the source text is in English
# will most likely produce such no-match message.
para_conc(sci_en, sci_id, pattern = "\\bmungkin\\b", conc_sample = 20, filename = FALSE)

```



```{r delete-saved-file, echo = FALSE}
unlink("parallel_conc.txt")
```


## R Session Info

```{r sessinfo}
devtools::session_info()
```

## References

Owner

Name: Gede Primahadi Wijaya Rajeg
Login: gederajeg
Kind: user
Location: Bali, Indonesia
Company: Universitas Udayana

Website: https://udayananetworking.unud.ac.id/lecturer/880-gede-primahadi-wijaya-rajeg
Twitter: PrimahadiWijaya
Repositories: 3
Profile: https://github.com/gederajeg

I am interested in cognitive linguistics, corpus linguistics, and construction grammar. I use R for all things data science in linguistics.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Rajeg
    given-names: Gede Primahadi Wijaya
    orcid: https://orcid.org/0000-0002-2047-8621
title: "paracorp: Concordancer for parallel, bilingual corpora"
version: 0.0.1
doi: 10.17605/OSF.IO/HV9CU
date-released: 2021-12-10
license: MIT
repository-code: "https://github.com/gederajeg/paracorp"

GitHub Events

Total

Last Year

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science