siftr

Fuzzily search a dataframe's names, labels, and levels to find the variable you need.

https://github.com/desiquintans/siftr

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.0%) to scientific vocabulary

Keywords

dataframe interactive r

Last synced: 9 months ago · JSON representation

Repository

Fuzzily search a dataframe's names, labels, and levels to find the variable you need.

Basic Info

Host: GitHub
Owner: DesiQuintans
License: other
Language: R
Default Branch: main
Homepage:
Size: 23.7 MB

Statistics

Stars: 2
Watchers: 1
Forks: 0
Open Issues: 11
Releases: 0

Topics

dataframe interactive r

Created about 3 years ago · Last pushed almost 2 years ago

Metadata Files

Readme Changelog License

`siftr`

If you work as an analyst, you probably shift projects often and need to get oriented in a new dataset quickly. siftr is an interactive tool that helps you find the column you need in a large dataframe using powerful 'fuzzy' searches.

It was designed with medical, census, and survey data in mind, where dataframes can reach hundreds of columns and millions of rows.

Installation

``` r

CRAN soon

Or install the live development version from Github.

remotes::install_github("DesiQuintans/siftr") ```

Starting `siftr` with every R session

For convenience, you can add siftr to your .Rprofile so that it is immediately available when you start R.

r file.edit(file.path("~", ".Rprofile")) # Opens your global .Rprofile for editing.

Add this line and save it:

r options(defaultPackages = c('datasets', 'utils', 'grDevices', 'graphics', 'stats', 'methods', 'siftr'))

Functions in `siftr`

| Function | Description | |:--------------------|:-------------------------------------------------------| | sift() | Search through a dataframe's columns. | | sift.name() | Only search variable names (i.e. column names). | | sift.desc() | Only search descriptive labels. | | sift.factors() | Only search factor labels (and value labels). | | save_dictionary() | Save the data dictionary for use with tsv2label | | options_sift() | Get and set options related to how siftr functions. | | mtcars_lab | A dataset bundled with the package for testing. |

Ways of searching in `siftr`

Exact matching with or without regular expressions
Fuzzy matching with or without regular expressions
Orderless exact matching with or without regular expressions

Examples

r library(siftr) data(starwars, package = "dplyr")

By default, sift() searches for exact matches in a column's names, labels, levels, and unique values. As a convenience, you can type bare names in (i.e. color instead of "color") for simple queries.

``` r sift(starwars, color)

> ℹ Building dictionary for 'starwars'. This only happens when it changes.

> ✔ Dictionary was built in 0.01 secs.

>

> 4 hair_color

> Type: character Missing: 5 % All same? No

> Peek: auburn, grey, grey, brown, blond, white, auburn, white, …

> 5 skin_color

> Type: character Missing: 0 % All same? No

> Peek: white, blue, grey, red, green-tan, brown, fair, blue, ye…

> 6 eye_color

> Type: character Missing: 0 % All same? No

> Peek: blue-gray, yellow, unknown, red, blue, gold, black, haze…

>

> ✔ There were 3 results for query `color`.

```

As you can see, sift() returns lots of useful information about the variables it has found: The column number and name, its type, how much of it is NA/NaN, whether all of its values are the same, and a random peek at some of the column's unique values.

The .dist argument opts-in to approximate searching. It can take an integer (the number of characters that can be flexibly matched) or a double between 0 and 1 (e.g. 0.25 = 25% of the query pattern's length can be flexibly matched).

``` r sift(starwars, homewolrd, .dist = 0.25)

> 10 homeworld

> Type: character Missing: 11 % All same? No

> Peek: Serenno, Trandosha, Aleen Minor, Cerea, Cato Neimoidia, …

>

> ✔ There was 1 result for query `homewolrd`.

```

You can search with regular expressions, but these must be given as Character strings.

``` r sift(starwars, "gr(a|e)y")

> 4 hair_color

> Type: character Missing: 5 % All same? No

> Peek: auburn, grey, grey, brown, blond, white, auburn, white, …

> 5 skin_color

> Type: character Missing: 0 % All same? No

> Peek: white, blue, grey, red, green-tan, brown, fair, blue, ye…

> 6 eye_color

> Type: character Missing: 0 % All same? No

> Peek: blue-gray, yellow, unknown, red, blue, gold, black, haze…

>

> ✔ There were 3 results for query `gr(a|e)y`.

```

If you give multiple queries, then you will get an orderless look-around search.

``` r sift(mtcars_lab, gallon, mileage)

> ℹ Building dictionary for 'mtcars_lab'. This only happens when it changes.

> ✔ Dictionary was built in 0.01 secs.

>

> 2 mpg

> Mileage (miles per gallon)

> Type: double Missing: 0 % All same? No

> Peek: 15.2, 21.5, 15, 30.4, 16.4, 14.3, 24.4, 15.5, 19.2, 22.8…

>

> ✔ There was 1 result for query `(?=.gallon)(?=.mileage)`.

```

Finally (and most powerfully), you can combine regular expressions and orderless look-around searches.

``` r sift(starwars, color, "[a-z]{4}_")

> 4 hair_color

> Type: character Missing: 5 % All same? No

> Peek: blond, unknown, none, auburn, grey, blonde, brown, auburn,…

>

> 5 skin_color

> Type: character Missing: 0 % All same? No

> Peek: white, brown mottle, white, blue, fair, green, yellow, blu…

>

> ✔ There were 2 results for query `(?=.color)(?=.[a-z]4_)`.

```

`siftr` works best on labelled data

sift() searches through these fields:

A column's name (colnames(df))
Its label (attr(col, "label"); placed by many packages including haven and labelled)
Its value labels (attr(col, "labels"); often hold-overs from SPSS or SAS datasets)
Its factor levels (levels(col))
Its unique values (unique(col)), sampled at random for large datasets

The more of these fields you can fill out, the more informative and powerful sift() will be.

siftr pairs well with one of my other packages, tsv2label, which can label, rename, and factorise a dataset using a plain text dictionary.

Owner

Name: Desi Quintans
Login: DesiQuintans
Kind: user

Website: http://www.desiquintans.com
Repositories: 18
Profile: https://github.com/DesiQuintans

GitHub Events

Total

Issues event: 3
Watch event: 1

Last Year

Issues event: 3
Watch event: 1

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 38
Total pull requests: 1
Average time to close issues: about 2 months
Average time to close pull requests: 1 minute
Total issue authors: 1
Total pull request authors: 1
Average comments per issue: 0.26
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 18
Pull requests: 1
Average time to close issues: 3 days
Average time to close pull requests: 1 minute
Issue authors: 1
Pull request authors: 1
Average comments per issue: 0.11
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

DesiQuintans (33)

Pull Request Authors

DesiQuintans (2)

Top Labels

Issue Labels

enhancement (6) bug (4) invalid (1) question (1) style (1)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- cran 144 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 2
Total maintainers: 1

cran.r-project.org: siftr

Fuzzily Search a Dataframe to Find Relevant Columns

Homepage: https://github.com/DesiQuintans/siftr
Documentation: http://cran.r-project.org/web/packages/siftr/siftr.pdf
License: MIT + file LICENSE
Status: removed
Latest release: 1.1.0
published almost 3 years ago

Versions: 2
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 144 Last month

Rankings

Forks count: 28.7%

Dependent packages count: 29.2%

Dependent repos count: 34.9%

Stargazers count: 35.2%

Average: 43.5%

Downloads: 89.6%

Maintainers (1)

science@desiquintans.com

Last synced: about 1 year ago

siftr

Science Score: 26.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

siftr

Installation

CRAN soon

Or install the live development version from Github.

Starting siftr with every R session

Functions in siftr

Ways of searching in siftr

Examples

> ℹ Building dictionary for 'starwars'. This only happens when it changes.

> ✔ Dictionary was built in 0.01 secs.

>

> 4 hair_color

> Type: character Missing: 5 % All same? No

> Peek: auburn, grey, grey, brown, blond, white, auburn, white, …

> 5 skin_color

> Type: character Missing: 0 % All same? No

> Peek: white, blue, grey, red, green-tan, brown, fair, blue, ye…

> 6 eye_color

> Type: character Missing: 0 % All same? No

> Peek: blue-gray, yellow, unknown, red, blue, gold, black, haze…

>

> ✔ There were 3 results for query color.

> 10 homeworld

> Type: character Missing: 11 % All same? No

> Peek: Serenno, Trandosha, Aleen Minor, Cerea, Cato Neimoidia, …

>

> ✔ There was 1 result for query homewolrd.

> 4 hair_color

> Type: character Missing: 5 % All same? No

> Peek: auburn, grey, grey, brown, blond, white, auburn, white, …

> 5 skin_color

> Type: character Missing: 0 % All same? No

> Peek: white, blue, grey, red, green-tan, brown, fair, blue, ye…

> 6 eye_color

> Type: character Missing: 0 % All same? No

> Peek: blue-gray, yellow, unknown, red, blue, gold, black, haze…

>

> ✔ There were 3 results for query gr(a|e)y.

> ℹ Building dictionary for 'mtcars_lab'. This only happens when it changes.

> ✔ Dictionary was built in 0.01 secs.

>

> 2 mpg

> Mileage (miles per gallon)

> Type: double Missing: 0 % All same? No

> Peek: 15.2, 21.5, 15, 30.4, 16.4, 14.3, 24.4, 15.5, 19.2, 22.8…

>

> ✔ There was 1 result for query (?=.*gallon)(?=.*mileage).

> 4 hair_color

> Type: character Missing: 5 % All same? No

> Peek: blond, unknown, none, auburn, grey, blonde, brown, auburn,…

>

> 5 skin_color

> Type: character Missing: 0 % All same? No

> Peek: white, brown mottle, white, blue, fair, green, yellow, blu…

>

> ✔ There were 2 results for query (?=.*color)(?=.*[a-z]4_).

siftr works best on labelled data

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

`siftr`

Starting `siftr` with every R session

Functions in `siftr`

Ways of searching in `siftr`

> ✔ There were 3 results for query `color`.

> ✔ There was 1 result for query `homewolrd`.

> ✔ There were 3 results for query `gr(a|e)y`.

> ✔ There was 1 result for query `(?=.gallon)(?=.mileage)`.

> ✔ There were 2 results for query `(?=.color)(?=.[a-z]4_)`.

`siftr` works best on labelled data