https://github.com/chainsawriot/resdtmf

Responsible Document-Term Matrix Format

https://github.com/chainsawriot/resdtmf

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.5%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

Responsible Document-Term Matrix Format

Basic Info
  • Host: GitHub
  • Owner: chainsawriot
  • License: other
  • Language: R
  • Default Branch: master
  • Size: 94.7 KB
Statistics
  • Stars: 1
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 6 years ago · Last pushed about 6 years ago
Metadata Files
Readme License

README.Rmd

---
output: github_document
---



```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```
# resdtmf 




The goal of Responsible Document-term Matrix Format (`resdtmf`, pronounced as "res-dumf" /ˈɹɪzdəmf/) is to create a machine-readable, plain-text and exchangable file format of document-term matrices (dtm, or in quanteda's parlance, document-feature matrices).

Currently, there is no standard format for document-term matrics. A resdtmf file is a JSON file with the following five components:

1. `triplets`: a collection of triplets which are tuples of 3 values: docid (document id), tid (term id), f (frequncy)
2. `features`: a collection of features which are tuples of 2 values: tid (term id), term (the term itself)
3. `dumped_docvars`: meta-data of each document
4. `dumped_meta`: meta-data of the entire dtm
5. `order_of_content`: a collection of tuples of 2 values: order (numeric sequence of order), docid.

This is an example of a resdtmf file.

```json
{
  "triplets": [
    {
      "docid": "text1",
      "tid": 1,
      "f": 1
    },
    {
      "docid": "text3",
      "tid": 1,
      "f": 1
    },
    {
      "docid": "text1",
      "tid": 2,
      "f": 1
    },
    {
      "docid": "text2",
      "tid": 2,
      "f": 1
    },
    {
      "docid": "text1",
      "tid": 3,
      "f": 1
    },
    {
      "docid": "text2",
      "tid": 3,
      "f": 1
    },
    {
      "docid": "text3",
      "tid": 3,
      "f": 1
    },
    {
      "docid": "text2",
      "tid": 4,
      "f": 1
    },
    {
      "docid": "text3",
      "tid": 5,
      "f": 1
    }
  ],
  "features": [
    {
      "tid": 1,
      "term": "i"
    },
    {
      "tid": 2,
      "term": "love"
    },
    {
      "tid": 3,
      "term": "you"
    },
    {
      "tid": 4,
      "term": "me"
    },
    {
      "tid": 5,
      "term": "hate"
    }
  ],
  "dumped_docvars": [
    {
      "docid": "text1",
      "sentiment": 1
    },
    {
      "docid": "text2",
      "sentiment": 1
    },
    {
      "docid": "text3",
      "sentiment": 0
    }
  ],
  "dumped_meta": [],
  "order_of_content": [
    {
      "order": 1,
      "docid": "text1"
    },
    {
      "order": 2,
      "docid": "text2"
    },
    {
      "order": 3,
      "docid": "text3"
    }
  ]
}
```

We also believe that a responsible DTM should have enough meta data to describe the meaning of the data. This package supports [Dublin Core](https://dublincore.org/) (DCMES 1.1).

## Installation

Install the development version from [GitHub](https://github.com/) with:

``` r
# install.packages("devtools")
devtools::install_github("chainsawriot/resdtmf")
```
## Example - Basic serialization

Suppose you have a simple document-feature matrix like this:

```{r example}
require(quanteda)
require(magrittr)
require(resdtmf)

quanteda::corpus(c('i love you', 'you love me', 'i hate you'),
                 docvars = data.frame(sentiment = c(1,1,0))) %>%
    quanteda::dfm() -> input_dfm
input_dfm
```

This document-feature matrix can be exported into a json file with:

```{r}
export_resdtmf(input_dfm, "example.json")
```

The file is machine-readable.

```{r}
readLines("example.json")
```

It can be imported easily back into R.

```{r}
example_dfm <- import_resdtmf("example.json")
example_dfm
```

And the metadata is preserved.

```{r}
docvars(example_dfm)
```

And everything is equal.

```{r}
all.equal(example_dfm, input_dfm)
```

It is also possible to convert the imported dfm into another format, e.g. "data.frame".

```{r}
example_dfm_df <- import_resdtmf("example.json", convert_to = "data.frame")
class(example_dfm_df)
```

```{r}
example_dfm_df
```

Example: serializing a DTM created using the `data_corpus_inaugural` data.

```{r}
inaugural_dfm <- dfm(data_corpus_inaugural)
export_resdtmf(inaugural_dfm, "inaug_dfm.json")
```

```{r}
inaugural_dfm_from_json <- import_resdtmf("inaug_dfm.json")
inaugural_dfm_from_json
```

```{r}
all.equal(inaugural_dfm, inaugural_dfm_from_json)
```

Using compression

```{r}
export_resdtmf(inaugural_dfm, "inaug_dfm2.json", compress = TRUE)
file.size("inaug_dfm.json")
file.size("inaug_dfm2.json.zip")
```

```{r}
inaugural_dfm_from_json_zip <- import_resdtmf("inaug_dfm2.json.zip")
all.equal(inaugural_dfm, inaugural_dfm_from_json_zip)
```

## Example - Dublin Core

```{r}
quanteda::corpus(c('i love you',
                   'you love me',
                   'baka shinji',
                   'ich liebe dict obwohl du ssss bist'),
                 docvars = data.frame(sentiment = c(1,1,0,1))) %>%
    quanteda::dfm() -> input_dfm
input_dfm
```

```{r}
dc_meta <- create_dc(
    title = c("Romeo + Juliet", "Moulin Rouge!", "Neon Genesis: Evangelion", "Mord ist mein Geschäft, Liebling"),
    format = "Document-term Matrix",
    language = c("en", "en", "ja", "de"))

input_dfm2 <- put_dc(input_dfm, dc_meta)
input_dfm2
```

Inspecting DCMES 1.5 data (similar to the `tm::DublinCore` method).

```{r}
inspect_dc(input_dfm2[4,])
```

Serialization

```{r}
export_resdtmf(input_dfm2, "input_dfm2.json")
input_dfm3 <- import_resdtmf("input_dfm2.json")
inspect_dc(input_dfm3[4,])
```

```{r, include = FALSE}
### Clean up
unlink("inaug_dfm2.json.zip")
unlink("inaug_dfm.json")
unlink("input_dfm2.json")
unlink("example.json")
```

Owner

  • Login: chainsawriot
  • Kind: user
  • Location: Germany
  • Company: @gesistsa

GitHub Events

Total
Last Year

Issues and Pull Requests

Last synced: over 1 year ago

All Time
  • Total issues: 4
  • Total pull requests: 0
  • Average time to close issues: about 18 hours
  • Average time to close pull requests: N/A
  • Total issue authors: 1
  • Total pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • chainsawriot (4)
Pull Request Authors
Top Labels
Issue Labels
bug (2)
Pull Request Labels