ngram

Fast n-Gram Tokenization

https://github.com/wrathematics/ngram

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
1 of 5 committers (20.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.3%) to scientific vocabulary

Keywords

ngram r text text-mining

Last synced: 6 months ago · JSON representation

Repository

Fast n-Gram Tokenization

Basic Info

Host: GitHub
Owner: wrathematics
License: other
Language: C
Default Branch: master
Homepage:
Size: 1.87 MB

Statistics

Stars: 70
Watchers: 11
Forks: 24
Open Issues: 3
Releases: 3

Topics

ngram r text text-mining

Created almost 12 years ago · Last pushed about 2 years ago

Metadata Files

Readme Changelog License

ngram

Version: 3.2.2
License: BSD 2-Clause
Project home: https://github.com/wrathematics/ngram
Bug reports: https://github.com/wrathematics/ngram/issues
Author: Drew Schmidt and Christian Heckendorf

ngram is an R package for constructing n-grams ("tokenizing"), as well as generating new text based on the n-gram structure of a given text input ("babbling"). The package can be used for serious analysis or for creating "bots" that say amusing things. See details section below for more information.

The package is designed to be extremely fast at tokenizing, summarizing, and babbling tokenized corpora. Because of the architectural design, we are also able to handle very large volumes of text, with performance scaling very nicely. Benchmarks and example usage can be found in the package vignette.

Package Details

The original purpose for the package was to combine the book "Modern Applied Statistics in S" with the collected works of H. P. Lovecraft and generate amusing nonsense. This resulted in the post Modern Applied Statistics in R'lyeh. I had originally tried several other available R packages to do this, but they were taking hours on a subset of the full combined corpus to preprocess the data into a somewhat inconvenient format. However, the the ngram package can do the preprocessing into the desired format in well under a second (with about half of the preprocessing time spent on copying data for R coherency).

The package is mostly C, with the returned object (to R) being an external pointer. In fact, the underlying C code can be compiled as a standalone library. There is some minimal compatibility with exporting the data to proper R data structures, but it is incomplete at this time.

For more information, see the package vignette.

Installation

You can install the stable version from CRAN using the usual install.packages():

r install.packages("ngram")

The development version is maintained on GitHub:

r remotes::install_github("wrathematics/ngram")

Example Usage

Here we present a few simple examples on how to use the ngram package. See the package vignette for more detailed information on package usage.

Tokenization, Summarizing, and Babbling

Let's take the sequence

r x <- "a b a c a b b"

Eagle-eyed readers will recognize this as the blood code from Mortal Kombat, but you can pretend it's something boring like an amino acid sequence or something. We can form the n-gram structure of this sequence with the ngram function:

```r library(ngram)

ng <- ngram(x, n=3) ```

There are various ways of printing the object.

```r ng

[1] "An ngram object with 5 3-grams"

print(ng, output="truncated")

a b a

c {1} |

a c a

b {1} |

b a c

a {1} |

a b b

NULL {1} |

c a b

b {1} |

```

With output="truncated", only the first 5 n-grams will be shown (here there are only 5 total). To see all (in the case of having more than 5), you can set output="full".

There are several "getter" functions, but they are incomplete (see Notes section below). Perhaps the most useful of them generates a "phrase table", or a list of n-grams by their frequency and proportion in the input text:

```r get.phrasetable(ng)

ngrams freq prop

1 a b 2 0.3333333

2 b a 1 0.1666667

3 c a 1 0.1666667

4 a c 1 0.1666667

5 b b 1 0.1666667

```

Finally, we can use the glory of Markov Chains to babble new sequences:

```r babble(ng=ng, genlen=12)

[1] "a b b c a b b a b a c a "

```

For reproducibility, use the seed argument:

```r babble(ng=ng, genlen=12, seed=1234)

[1] "a b a c a b b a b b a b "

```

At this time, we note that the seed may not guarantee the same results across machines. Currently only Solaris produces different values from mainstream platforms (Windows, Mac, Linux, FreeBSD), but potentially others could as well.

Weka-Like Tokenization

There is also a tokenizer that behaves identically to the one in the RWeka package (only the ngram one is significantly faster!). Using the same sequence x as above:

```r ngram::ngram_asweka(x, min=2, max=3)

[1] "a b a" "b a c" "a c a" "c a b" "a b b" "a b" "b a" "a c" "c a"

[10] "a b" "b b"

```

Owner

Name: Drew Schmidt
Login: wrathematics
Kind: user
Location: Knoxville, Tennessee

Website: https://hpcran.org
Twitter: wrathematics
Repositories: 120
Profile: https://github.com/wrathematics

I like R, C, and HPC.

GitHub Events

Total

Pull request event: 1
Fork event: 1

Last Year

Pull request event: 1
Fork event: 1

Committers

Last synced: 9 months ago

All Time

Total Commits: 304
Total Committers: 5
Avg Commits per committer: 60.8
Development Distribution Score (DDS): 0.204

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
wrathematics	w**s@g**m	242
Drew Schmidt	s**t@m**u	38
Christian Heckendorf	h**c@g**m	22
Ville Vaara	v**a@g**m	1
Joe Russell	j**l@g**m	1

Committer Domains (Top 20 + Academic)

math.utk.edu: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 8
Total pull requests: 3
Average time to close issues: 21 days
Average time to close pull requests: 1 day
Total issue authors: 8
Total pull request authors: 3
Average comments per issue: 1.5
Average comments per pull request: 0.67
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

jakob-r (1)
JosephPotashnik (1)
SteveBronder (1)
Yongbaek-Kim (1)
arthur0421 (1)
russey (1)
szhang1121 (1)
heckendorfc (1)

Pull Request Authors

MichaelChirico (2)
villevaara (1)
russey (1)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- cran 2,232 last-month
Total docker downloads: 42,028

Total dependent packages: 4
Total dependent repositories: 11
Total versions: 11
Total maintainers: 1

cran.r-project.org: ngram

Fast n-Gram 'Tokenization'

Homepage: https://github.com/wrathematics/ngram
Documentation: http://cran.r-project.org/web/packages/ngram/ngram.pdf
License: BSD 2-clause License + file LICENSE
Latest release: 3.2.3
published about 2 years ago

Versions: 11
Dependent Packages: 4
Dependent Repositories: 11
Downloads: 2,232 Last month
Docker Downloads: 42,028

Rankings

Forks count: 3.3%

Stargazers count: 5.1%

Dependent repos count: 8.8%

Dependent packages count: 9.3%

Average: 10.3%

Downloads: 12.7%

Docker downloads count: 22.8%

Maintainers (1)

wrathematics@gmail.com

Last synced: 6 months ago

Dependencies

DESCRIPTION cran

R >= 3.0.0 depends
methods * imports

ngram

Science Score: 10.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

ngram

Package Details

Installation

Example Usage

Tokenization, Summarizing, and Babbling

[1] "An ngram object with 5 3-grams"

a b a

c {1} |

a c a

b {1} |

b a c

a {1} |

a b b

NULL {1} |

c a b

b {1} |

ngrams freq prop

1 a b 2 0.3333333

2 b a 1 0.1666667

3 c a 1 0.1666667

4 a c 1 0.1666667

5 b b 1 0.1666667

[1] "a b b c a b b a b a c a "

[1] "a b a c a b b a b b a b "

Weka-Like Tokenization

[1] "a b a" "b a c" "a c a" "c a b" "a b b" "a b" "b a" "a c" "c a"

[10] "a b" "b b"

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

cran.r-project.org: ngram

Rankings

Maintainers (1)

Dependencies