https://github.com/chainsawriot/textsdc

Statistical Data Cleaning For Text Data

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.9%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Statistical Data Cleaning For Text Data

Basic Info

Host: GitHub
Owner: chainsawriot
License: gpl-3.0
Language: R
Default Branch: master
Size: 186 KB

Statistics

Stars: 7
Watchers: 3
Forks: 0
Open Issues: 0
Releases: 0

Created almost 7 years ago · Last pushed over 4 years ago

Metadata Files

Readme License

README.Rmd

---
output: github_document
---



```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
devtools::load_all()
```
# textsdc

The goal of textsdc (text statistical data cleaning) is to clean text data statistically. The current version can do:

1. text deduplication using a very simple similarity-based algorithm.

Future version should be able to do:

1. removal of "boilerplates".

Related packages:

1. [quanteda](https://github.com/quanteda/quanteda) - for text analysis
2. [textclean](https://github.com/trinker/textclean) - for normalization of text data

## Installation

You can install the experimental version of textsdc from github:

```{r eval = FALSE}
devtools::install_github("chainsawriot/textsdc")
```

## Example

### Deduplication

Calculate the possible duplicates in your input text.

```{r, eval = FALSE}
require(textsdc)
```

```{r example}
lyrics <- c("He drinks a Whiskey drink",
            "he drinks a Vodka drink",
            "He drinks a Lager drink",
            "he drinks a Cider drink",
            "He sings the songs that remind him of the good times",
            "He sings the songs that remind him of the best times",
            "Oh Danny Boy",
            "Danny Boy",
            "Danny Boy",
            "I get knocked down, but I get up again",
            "You are never gonna keep me down",
            "I get knocked down, but I get up again",
            "You are never gonna keep me down",
            "I get knocked down, but I get up again",
            "You are never gonna keep me down",
            "I get knocked down, but I get up again",
            "You are never gonna keep me down")
dups <- calculate_textsdc(lyrics)
dups
```

```{r example2}
dups$dist_matrix
```

Extract the deduplicated version

```{r example21}
clean_textsdc(dups)
```


Adjust the threshold for duplication.

```{r example3}
dups2 <- calculate_textsdc(lyrics, threshold = 0.9)
dups2
```

```{r example4}
clean_textsdc(dups2)
```

You can also use percentile-based threshold, e.g. assuming 70% of the articles are not duplicates.

```{r example5}
dups3 <- calculate_textsdc(lyrics, threshold = 0.7, percentile = TRUE)
dups3
```

```{r example6}
clean_textsdc(dups3)
```

CJK language

```{r cjk1}
demands2 <- c("徹底撤回修例",
              "收回暴動定義",
              "撤銷對至今為止所有反送中抗爭者控罪",
              "徹底追究警隊濫權情況",
              "以行政命令解散立法會，立即實行雙真普選",
              "撤銷對至今為止所有反送中抗爭者控罪",
              "解散立法會，立即實行雙真普選")
dups4 <- calculate_textsdc(demands2, threshold = 0.7, percentile = TRUE)
dups4
```

```{r cjk2}
clean_textsdc(dups4)
```

There are four precedence options on how to get the deduplicated version of the input text.

Default: earlier

```{r earlier}
metallica <- c("The Unforgiven",
               "The Unforgiven II",
               "The Unforgiven III",
               "Fight Fire With Fire",
               "Master of Puppets",
               "For Whom The Bell Tolls",
               "For Whom The Bell Toll",
               "Master of Puppets")
metallica_dups <- calculate_textsdc(metallica, threshold = 0.7)
clean_textsdc(metallica_dups)
```

Longer

```{r longer}
clean_textsdc(metallica_dups, precedence = "longer")
```

Shorter

```{r shorter}
clean_textsdc(metallica_dups, precedence = "shorter")
```

Random

```{r random}
clean_textsdc(metallica_dups, precedence = "random")
```

Owner

Login: chainsawriot
Kind: user
Location: Germany
Company: @gesistsa

Website: http://www.chainsawriot.com
Repositories: 241
Profile: https://github.com/chainsawriot

GitHub Events

Total

Last Year

Issues and Pull Requests

Last synced: 12 months ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/chainsawriot/textsdc

Science Score: 13.0%

Repository

Basic Info

Statistics

Metadata Files

README.Rmd

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels