https://github.com/chainsawriot/resdtmf

Responsible Document-Term Matrix Format

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.5%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

Responsible Document-Term Matrix Format

Basic Info

Host: GitHub
Owner: chainsawriot
License: other
Language: R
Default Branch: master
Size: 94.7 KB

Statistics

Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 0

Created over 6 years ago · Last pushed about 6 years ago

Metadata Files

Readme License

README.Rmd

---
output: github_document
---



```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```
# resdtmf 




The goal of Responsible Document-term Matrix Format (`resdtmf`, pronounced as "res-dumf" /ˈɹɪzdəmf/) is to create a machine-readable, plain-text and exchangable file format of document-term matrices (dtm, or in quanteda's parlance, document-feature matrices).

Currently, there is no standard format for document-term matrics. A resdtmf file is a JSON file with the following five components:

1. `triplets`: a collection of triplets which are tuples of 3 values: docid (document id), tid (term id), f (frequncy)
2. `features`: a collection of features which are tuples of 2 values: tid (term id), term (the term itself)
3. `dumped_docvars`: meta-data of each document
4. `dumped_meta`: meta-data of the entire dtm
5. `order_of_content`: a collection of tuples of 2 values: order (numeric sequence of order), docid.

This is an example of a resdtmf file.

```json
{
  "triplets": [
    {
      "docid": "text1",
      "tid": 1,
      "f": 1
    },
    {
      "docid": "text3",
      "tid": 1,
      "f": 1
    },
    {
      "docid": "text1",
      "tid": 2,
      "f": 1
    },
    {
      "docid": "text2",
      "tid": 2,
      "f": 1
    },
    {
      "docid": "text1",
      "tid": 3,
      "f": 1
    },
    {
      "docid": "text2",
      "tid": 3,
      "f": 1
    },
    {
      "docid": "text3",
      "tid": 3,
      "f": 1
    },
    {
      "docid": "text2",
      "tid": 4,
      "f": 1
    },
    {
      "docid": "text3",
      "tid": 5,
      "f": 1
    }
  ],
  "features": [
    {
      "tid": 1,
      "term": "i"
    },
    {
      "tid": 2,
      "term": "love"
    },
    {
      "tid": 3,
      "term": "you"
    },
    {
      "tid": 4,
      "term": "me"
    },
    {
      "tid": 5,
      "term": "hate"
    }
  ],
  "dumped_docvars": [
    {
      "docid": "text1",
      "sentiment": 1
    },
    {
      "docid": "text2",
      "sentiment": 1
    },
    {
      "docid": "text3",
      "sentiment": 0
    }
  ],
  "dumped_meta": [],
  "order_of_content": [
    {
      "order": 1,
      "docid": "text1"
    },
    {
      "order": 2,
      "docid": "text2"
    },
    {
      "order": 3,
      "docid": "text3"
    }
  ]
}
```

We also believe that a responsible DTM should have enough meta data to describe the meaning of the data. This package supports [Dublin Core](https://dublincore.org/) (DCMES 1.1).

## Installation

Install the development version from [GitHub](https://github.com/) with:

``` r
# install.packages("devtools")
devtools::install_github("chainsawriot/resdtmf")
```
## Example - Basic serialization

Suppose you have a simple document-feature matrix like this:

```{r example}
require(quanteda)
require(magrittr)
require(resdtmf)

quanteda::corpus(c('i love you', 'you love me', 'i hate you'),
                 docvars = data.frame(sentiment = c(1,1,0))) %>%
    quanteda::dfm() -> input_dfm
input_dfm
```

This document-feature matrix can be exported into a json file with:

```{r}
export_resdtmf(input_dfm, "example.json")
```

The file is machine-readable.

```{r}
readLines("example.json")
```

It can be imported easily back into R.

```{r}
example_dfm <- import_resdtmf("example.json")
example_dfm
```

And the metadata is preserved.

```{r}
docvars(example_dfm)
```

And everything is equal.

```{r}
all.equal(example_dfm, input_dfm)
```

It is also possible to convert the imported dfm into another format, e.g. "data.frame".

```{r}
example_dfm_df <- import_resdtmf("example.json", convert_to = "data.frame")
class(example_dfm_df)
```

```{r}
example_dfm_df
```

Example: serializing a DTM created using the `data_corpus_inaugural` data.

```{r}
inaugural_dfm <- dfm(data_corpus_inaugural)
export_resdtmf(inaugural_dfm, "inaug_dfm.json")
```

```{r}
inaugural_dfm_from_json <- import_resdtmf("inaug_dfm.json")
inaugural_dfm_from_json
```

```{r}
all.equal(inaugural_dfm, inaugural_dfm_from_json)
```

Using compression

```{r}
export_resdtmf(inaugural_dfm, "inaug_dfm2.json", compress = TRUE)
file.size("inaug_dfm.json")
file.size("inaug_dfm2.json.zip")
```

```{r}
inaugural_dfm_from_json_zip <- import_resdtmf("inaug_dfm2.json.zip")
all.equal(inaugural_dfm, inaugural_dfm_from_json_zip)
```

## Example - Dublin Core

```{r}
quanteda::corpus(c('i love you',
                   'you love me',
                   'baka shinji',
                   'ich liebe dict obwohl du ssss bist'),
                 docvars = data.frame(sentiment = c(1,1,0,1))) %>%
    quanteda::dfm() -> input_dfm
input_dfm
```

```{r}
dc_meta <- create_dc(
    title = c("Romeo + Juliet", "Moulin Rouge!", "Neon Genesis: Evangelion", "Mord ist mein Geschäft, Liebling"),
    format = "Document-term Matrix",
    language = c("en", "en", "ja", "de"))

input_dfm2 <- put_dc(input_dfm, dc_meta)
input_dfm2
```

Inspecting DCMES 1.5 data (similar to the `tm::DublinCore` method).

```{r}
inspect_dc(input_dfm2[4,])
```

Serialization

```{r}
export_resdtmf(input_dfm2, "input_dfm2.json")
input_dfm3 <- import_resdtmf("input_dfm2.json")
inspect_dc(input_dfm3[4,])
```

```{r, include = FALSE}
### Clean up
unlink("inaug_dfm2.json.zip")
unlink("inaug_dfm.json")
unlink("input_dfm2.json")
unlink("example.json")
```

Owner

Login: chainsawriot
Kind: user
Location: Germany
Company: @gesistsa

Website: http://www.chainsawriot.com
Repositories: 241
Profile: https://github.com/chainsawriot

GitHub Events

Total

Last Year

Issues and Pull Requests

Last synced: over 1 year ago

All Time

Total issues: 4
Total pull requests: 0
Average time to close issues: about 18 hours
Average time to close pull requests: N/A
Total issue authors: 1
Total pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/chainsawriot/resdtmf

Science Score: 13.0%

Repository

Basic Info

Statistics

Metadata Files

README.Rmd

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels