pii

Repo for the pii package, an easy way to identify personally identifiable information in your data

https://github.com/jacobpstein/pii

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (16.2%) to scientific vocabulary

Keywords

pii r

Last synced: 9 months ago · JSON representation

Repository

Repo for the pii package, an easy way to identify personally identifiable information in your data

Basic Info

Host: GitHub
Owner: jacobpstein
License: other
Language: R
Default Branch: main
Homepage:
Size: 55.7 KB

Statistics

Stars: 7
Watchers: 1
Forks: 0
Open Issues: 3
Releases: 0

Topics

pii r

Created over 1 year ago · Last pushed over 1 year ago

Metadata Files

Readme Changelog License

README.Rmd

---
output: github_document
---



```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# pii: A package for dealing with personally identifiable information


[![CRAN status](https://www.r-pkg.org/badges/version/pii)](https://CRAN.R-project.org/package=pii)


The goal of pii is to flag columns that potentially contain personally identifiable information. This package was inspired by concerns that survey data might be shared without users realizing the files contain PII. It is based on a set of [standard guidlines](https://www.usaid.gov/sites/default/files/2022-05/508saa.pdf) from the United States Agency for International Development, though people can debate what is and isn't PII. 

The main function of the `pii` package, `check_PII` looks for the following:

- Names
- Email addresses
- Phone numbers
- Locations (e.g., city or village name)
- Geo-coordinates
- Disability status
- Combinations of the above that might identify someone

The function dynamically determines potential PII issues across by comparing a column's uniqueness to the median uniqueness of the dataset, adjusting it accordingly (e.g., by 20%). Numeric and date columns are skipped. Mixed classes within a column (e.g., text and numbers) are flagged as this type of information is more likely to contain PII. 

This function provides a first step in flagging *potential* PII within your data. Nothing beats gaining familiarity with the data, strong documentation, and careful data management.  

## Installation

You can install the development version of pii from [GitHub](https://github.com/) with:

``` r
# install.packages("devtools")
devtools::install_github("jacobpstein/pii")
```

You can also download from CRAN with:
```{r}
install.packages("pii")
```

## Example

Let's say you're working with the `mtcars` data set. But this isn't just any version of `mtcars`, this is more like a data about characters from the hit feature film [*Cars*](https://cars.disney.com)! You happen to have phone numbers and the location of each car's house. A colleague at a partner org asks for the data, but you aren't sure if it has PII, so you run the `check_PII` function. 
```{r example}

library(dplyr)
library(tibble)
library(pii)

# Set a seed for reproducibility
set.seed(101624)

# Number of rows in the cars dataset
n <- nrow(mtcars)

# Generate car phone numbers as strings
phone_numbers <- sprintf("555-%03d-%04d", sample(100:999, n, replace = TRUE), sample(1000:9999, n, replace = TRUE))

# Generate latitudes for where the cars live (range roughly between -90 and 90)
latitudes <- runif(n, min = -90, max = 90)

# Generate longitudes for where the cars live (range roughly between -180 and 180)
longitudes <- runif(n, min = -180, max = 180)

# Merge new columns into the mtcars dataset using mutate
mtcars_with_pii <- mtcars %>%
  mutate(phone_number = phone_numbers,
         latitude = latitudes,
         longitude = longitudes) |> 
  # we also have row names with actual car names!
  rownames_to_column(var = "car_name")

# run our function over the data
mtcars_pii <- check_PII(mtcars_with_pii)

# take a look at the output
print(mtcars_pii)

```

The `check_PII` function flags combinations of columns that together could identify individuals. It also flags columns that contain names that suggest a column contains PII, like, `phone_number.` 

## Seperate your PII

Once you have run the `check_PII` function, you might want to remove those columns from your data frame so that the data can easily be shared. The `split_PII_data` function removes the columns flagged by `check_PII,` puts them into a separate data frame, and creates a unique join key should you need to merge them back in at some point.

```{r example2}

# use our data from earlier
car_df_split <- split_PII_data(mtcars_with_pii, exclude_columns = c("car_name", "mpg", "cyl", "drat", "wt", "qsec", "vs", "am", "gear", "carb"))

# this creates a list containing two data frames: one with PII, one without

car_df_to_share <- car_df_split$non_pii_data

car_PII <- car_df_split$pii_data

```

Note that the `exclude_columns =` argument allows the user to keep certain columns that were flagged as PII in the data. 
```{r example3}

# take a look at our non-PII data
head(car_df_to_share)

```

Seems ok. Meanwhile, you can put the PII in a secure, encrypted location. But let's take a peak...
```{r example4}

# take a look at our PII data
head(car_PII)

```

Owner

Login: jacobpstein
Kind: user

Repositories: 1
Profile: https://github.com/jacobpstein

GitHub Events

Total

Issues event: 2
Watch event: 6
Push event: 35
Create event: 2

Last Year

Issues event: 2
Watch event: 6
Push event: 35
Create event: 2

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 2
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 1
Total pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 2
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

jacobpstein (3)

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- cran 195 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 3
Total maintainers: 1

cran.r-project.org: pii

Search Data Frames for Personally Identifiable Information

Homepage: https://github.com/jacobpstein/pii
Documentation: http://cran.r-project.org/web/packages/pii/pii.pdf
License: MIT + file LICENSE
Latest release: 1.3.0
published over 1 year ago

Versions: 3
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 195 Last month

Rankings

Dependent packages count: 28.0%

Dependent repos count: 34.5%

Average: 49.8%

Downloads: 86.8%

Maintainers (1)

jacobpstein@gmail.com