Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
✓DOI references
Found 5 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (19.6%) to scientific vocabulary
Last synced: 6 months ago
·
JSON representation
·
Repository
Concordancer for parallel, bilingual corpora
Basic Info
- Host: GitHub
- Owner: gederajeg
- License: other
- Language: R
- Default Branch: main
- Homepage: https://gederajeg.github.io/paracorp/
- Size: 484 KB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 1
Created about 4 years ago
· Last pushed about 4 years ago
Metadata Files
Readme
License
Citation
README.Rmd
---
output: github_document
bibliography: mybibs.bib
link-citations: true
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# paracorp
[](https://github.com/gederajeg/paracorp/actions)
[](https://app.codecov.io/gh/gederajeg/paracorp?branch=main)
[](https://doi.org/10.17605/OSF.IO/HV9CU)
The goal of **paracorp** is to provide an R functionality for generating parallel concordance (Keyword-in-Context [KWIC] display) from a parallel/bilingual corpora. The first attempt is implemented in the `para_conc()` function that is built on top of the [tidyverse](https://www.tidyverse.org/) suit of packages. Please use the following citation if **paracorp** is used in publications:
```{r how-to-cite}
citation("paracorp")
```
The **paracorp** package is part of the following [research project](https://udayananetworking.unud.ac.id/lecturer/research/880-gede-primahadi-wijaya-rajeg/a-model-for-translation-study-based-on-english-indonesian-translation-database-and-its-pedagogical-implication-1179) [@rajeg_material_2021]:
> Rajeg, Gede Primahadi Wijaya, I Made Rajeg, Putu Dea Indah Kartini & I Gede Semara Dharma Putra. 2021. Material pendukung untuk *MODEL KAJIAN TERJEMAHAN BERBASIS BANK DATA TERJEMAHAN DIGITAL INGGRIS-INDONESIA DAN IMPLIKASI PEDAGOGISNYA*. Open Science Framework. https://doi.org/10.17605/OSF.IO/Y6ESA. https://osf.io/y6esa/.
The output of the research has been disseminated in several seminars [@rajeg_pemanfaatan_2021; @rajeg_derajat_2021].
## Installation
You can install the development version of paracorp from [GitHub](https://github.com/) with:
``` r
# install.packages("devtools")
devtools::install_github("gederajeg/paracorp")
```
## Examples
The **paracorp** package comes with internal sample data of English-Indonesian parallel corpora from the science genre developed by the PAN BPPT project [@adriani_development_2009; @bppt_statistical_2009]. The data are available in the form of character vectors called `sci_en` (for the English text) whose line is aligned with the Indonesian version (`sci_id`).
The code-snippet below shows how to generate a parallel concordance for the English modal verb "should" as the target, search-term and present the Indonesian translation (shown in the `TRANSLATION` column in the output table).
```{r example-eng-idn}
library(paracorp) # load the package
# in this example, the English text is used as the source text
my_para_conc <- para_conc(source_text = sci_en,
target_text = sci_id,
pattern = "\\bshould\\b", # regular expression pattern
conc_sample = 20) # retrieve 20 random concordance lines
# peek into the results as tibble/data frame
head(my_para_conc)
```
The printed messages show that, by default, `para_conc()` also saves the concordance into a tab-separated plain text (by default called `'parallel_conc.txt'`), in addition to returning a tibble/data frame format of the concordance. The tab-separated `'parallel_conc.txt'` file can be opened in MS Excel for further corpus-based analyses.
### Suppressing the automatic plain-text output
You can suppress the automatic plain-text-output behaviour by specifying `filename = FALSE` as shown below. In this situation, the output of `para_conc()` is only the tibble/data frame.
```{r suppress-automatic-output}
# suppress automatic output file behaviour with `filename = FALSE`
my_para_conc <- para_conc(source_text = sci_en,
target_text = sci_id,
pattern = "\\bshould\\b", # regular expression pattern
conc_sample = 20, # retrieve 20 random concordance lines
filename = FALSE) # suppress automatic output file
# peek into the results as tibble/data frame
head(my_para_conc)
```
### Switching the source- and target-text inputs
Moreover, the position of the input corpora can be reversed depending on the nature of the corpora or the research question(s). In the example below, the Indonesian text is entered into the `source_text` argument while the English text is entered into the `target_text` argument. In this case, the input string in the `pattern` argument of `para_conc()` should represent the Indonesian target-keyword.
```{r example-idn-eng}
# in this example, the Indonesian text is used as the source text
my_para_conc <- para_conc(source_text = sci_id,
target_text = sci_en,
pattern = "\\bmungkin\\b", # regular expression pattern
conc_sample = 20) # retrieve 20 random concordance lines
# peek into the results as tibble/data frame
head(my_para_conc)
```
### Sampling numbers
If the requested number of sample (out of all matches) is **greater than** or **equal to** the number of matches of the search pattern, `para_conc()` will print messages indicating these situations, and will retrieve all matches found, rather than generating sample that is supposed to be fewer than the total matches.
The snippet below shows the scenario and printed message when the requested number of sample is **equal to** the number of matches.
```{r sample-number-behaviour-1}
# sample number requested is equal to the matches
para_conc(sci_en, sci_id, pattern = "should", conc_sample = 64, filename = FALSE)
```
Meanwhile, the snippet below shows the scenario and printed message when the requested number of sample is **greater than** the number of matches.
```{r sample-number-behaviour-2}
# sample number requested is greater than the matches
para_conc(sci_en, sci_id, pattern = "should", conc_sample = 67, filename = FALSE)
```
### No matches
When no matches were found for the string given in the `pattern` argument, `para_conc()` will also print out the message informing so and no output will be produced. See the example below.
```{r no-match}
# For instance, searching for an Indonesian word when the source text is in English
# will most likely produce such no-match message.
para_conc(sci_en, sci_id, pattern = "\\bmungkin\\b", conc_sample = 20, filename = FALSE)
```
```{r delete-saved-file, echo = FALSE}
unlink("parallel_conc.txt")
```
## R Session Info
```{r sessinfo}
devtools::session_info()
```
## References
Owner
- Name: Gede Primahadi Wijaya Rajeg
- Login: gederajeg
- Kind: user
- Location: Bali, Indonesia
- Company: Universitas Udayana
- Website: https://udayananetworking.unud.ac.id/lecturer/880-gede-primahadi-wijaya-rajeg
- Twitter: PrimahadiWijaya
- Repositories: 3
- Profile: https://github.com/gederajeg
I am interested in cognitive linguistics, corpus linguistics, and construction grammar. I use R for all things data science in linguistics.
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: Rajeg
given-names: Gede Primahadi Wijaya
orcid: https://orcid.org/0000-0002-2047-8621
title: "paracorp: Concordancer for parallel, bilingual corpora"
version: 0.0.1
doi: 10.17605/OSF.IO/HV9CU
date-released: 2021-12-10
license: MIT
repository-code: "https://github.com/gederajeg/paracorp"
GitHub Events
Total
Last Year
Issues and Pull Requests
Last synced: 11 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0