PII-Codex

PII-Codex: a Python library for PII detection, categorization, and severity assessment - Published in JOSS (2023)

https://github.com/edyvision/pii-codex

Science Score: 98.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: researchgate.net, wiley.com, acm.org, joss.theoj.org, zenodo.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Keywords

analysis data-analysis personal-identifiable-information pii pii-detection poetry presidio python python3 research research-software research-tool

Scientific Fields

Earth and Environmental Sciences Physical Sciences - 40% confidence
Last synced: 4 months ago · JSON representation ·

Repository

A research python package for detecting, categorizing, and assessing the severity of personal identifiable information (PII)

Basic Info
  • Host: GitHub
  • Owner: EdyVision
  • License: bsd-3-clause
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 835 KB
Statistics
  • Stars: 89
  • Watchers: 3
  • Forks: 10
  • Open Issues: 5
  • Releases: 16
Topics
analysis data-analysis personal-identifiable-information pii pii-detection poetry presidio python python3 research research-software research-tool
Created over 3 years ago · Last pushed 4 months ago
Metadata Files
Readme Contributing License Citation Zenodo

README.md

![alt text](https://github.com/EdyVision/pii-codex/blob/main/docs/PII_Codex_Logo.svg?raw=true) PII Detection, Categorization, and Severity Assessment [![made-with-python](https://img.shields.io/badge/Made%20with-Python-1f425f.svg)](https://www.python.org/) ![](https://img.shields.io/badge/code%20style-black-000000.svg) [![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://GitHub.com/EdyVision/pii-codex/graphs/commit-activity) [![codecov](https://codecov.io/gh/EdyVision/pii-codex/branch/main/graph/badge.svg?token=QO7DNMP87X)](https://codecov.io/gh/EdyVision/pii-codex) [![License](https://img.shields.io/badge/License-BSD_3--Clause-blue.svg)](https://opensource.org/licenses/BSD-3-Clause) [![Python 3.9](https://img.shields.io/badge/python-3.9-blue.svg)](https://www.python.org/downloads/release/python-390/) [![DOI](https://zenodo.org/badge/533554671.svg)](https://zenodo.org/badge/latestdoi/533554671) [![status](https://joss.theoj.org/papers/5296a84bba0925e682dcddf14bec5880/status.svg)](https://joss.theoj.org/papers/5296a84bba0925e682dcddf14bec5880)

Author: Eidan Rosado - @EdyVision
Affiliation: Nova Southeastern University, College of Computing and Engineering

Project Background

The PII Codex project was built as a core part of an ongoing research effort in Personal Identifiable Information (PII) detection and risk assessment (to be publicly released later in 2023). There was a need to not only detect PII in text, but also identify its severity, associated categorizations in cybersecurity research and policy documentation, and provide a way for others in similar research efforts to reproduce or extend the research. PII Codex is a combination of systematic research, conceptual frameworks, third-party open source software, and cloud service provider integrations. The categorizations are directly influenced by the research of Milne et al. (2016) while the ranking is a result of category severities on the scale provided by Schwartz and Solove (2012) from Non-Identifiable, Semi-Identifiable, and Identifiable.

The outputs of the primary PII Codex analysis and adapter functions are AnalysisResult or AnalysisResultSet objects that will provide a listing of detections, severities, mean risk scores for each string processed, and summary statistics on the analysis made. The final outputs do not contain the original texts but instead will provide where to find the detections should the end-user care for this information in their analysis.

Statement of Need

The general knowledge base of identifiable data, the usage restrictions of this data, and the associated policies surrounding it have shifted drastically over the years. The tech industry has had to adjust to many policy changes regarding the tracking of individuals, the usage of data from online profiles and platforms, and the right to be forgotten entirely from a service or platform (GDPR). While the shift has provided data protections around the globe, the majority of technology users continue to have little to no control over their personal information with third-party data consumers (Trepte, 2020).

Understanding if identifiable data types exist in a data set can prevent accidental sharing of such data by allowing its detection in the first place and, in the case of this software package, present sanitized strings, the reasons to why the token was considered to be PII, and permit for the results to be publishable.

Potential Usages

Potential usages include sanitizing of dataset strings (e.g. a collection of social media posts), presenting results to users for software examining their interactions (e.g. UX research on user-awareness in cybersecurity applications), etc.


Running Locally with Poetry

This project uses Poetry. To run this project, install poetry and proceed to follow the instructions under /docs/LOCAL_SETUP.md.

Note: This project has only been tested with Ubuntu and MacOS and with Python versions 3.9 and 3.10. You may need to upgrade pip ahead of installation.

Installing with PIP

Video capture of install provided in LOCAL_SETUP.md file. Make sure you set up a virtual environment with either python 3.9 or 3.10 and upgrade pip with:

bash pip install --upgrade pip pip install -U pip setuptools wheel # only needed if you haven't already done so

Before adding pii-codex on your project, download the spaCy en_core_web_lg model:

bash pip install -U spacy python3 -m spacy download en_core_web_lg

For more details on spaCy installation and usage, refer to their docs.

The repository releases are hosted on PyPi and can be installed with:

bash pip install pii-codex pip install "pii-codex[detections]"

Note: The extras installed with pii-codex[detections] are the spaCy, Micrisoft Presidio Analyzer, and Microsoft Anonymzer packages.

Using Poetry:

bash poetry update poetry add pii-codex poetry install pii-codex --extras="detections"

For those using Google Collab, check out the example notebook:

Open In Colab

Usage

Video capture of usage provided in LOCAL_SETUP.md.

Sample Input / Output

The built-in analyzer uses Microsoft Presidio. Feed in a collection of strings with analyzecollection() or just a single string with analyzeitem(). Those analyzing a collection of strings will also be provided with statistics calculated on the risk scores for detected items. python from pii_codex.services.analysis_service import PIIAnalysisService PIIAnalysisService().analyze_collection( texts=["your collection of strings"], language_code="en", collection_name="Data Set Label", # Optional Labeling collection_type="SAMPLE" # Defaults to POPULATION, used stats calculations )

You can also pass in a data param (dataframe) instead of simple text array with a text column and a metadata column to be analyzed for those analyzing social media posts. Current metadata supported are URL, LOCATION, and SCREEN_NAME.

Sample output (results object converted to dict from notebook): { "collection_name": "PII Collection 1", "collection_type": "POPULATION", "analyses": [ { "analysis": [ { "pii_type_detected": "PERSON", "risk_level": 3, "risk_level_definition": "Identifiable", "cluster_membership_type": "Financial Information", "hipaa_category": "Protected Health Information", "dhs_category": "Linkable", "nist_category": "Directly PII", "entity_type": "PERSON", "score": 0.85, "start": 21, "end": 24, } ], "index": 0, "risk_score_mean": 3, "sanitized_text: "Hi! My name is <REDACTED>", }, ... ], "detection_count": 5, "risk_scores": [3, 2.6666666666666665, 1, 2, 1], "risk_score_mean": 1.9333333333333333, "risk_score_mode": 1, "risk_score_median": 2, "risk_score_standard_deviation": 0.8273115763993905, "risk_score_variance": 0.6844444444444444, "detected_pii_types": { "LOCATION", "EMAIL_ADDRESS", "URL", "PHONE_NUMBER", "PERSON", }, "detected_pii_type_frequencies": { "PERSON": 1, "EMAIL_ADDRESS": 1, "PHONE_NUMBER": 1, "URL": 1, "LOCATION": 1, }, }

Docs

For more information on usage, check out the respective documentation for guidance on using PII-Codex.

| Topic | Document | Description | |-----------------------------|-----------------------------------------------------------------------|------------------------------------------------------------------------------------------| | PII Type Mappings | PII Mappings | Overview of how to perform mappings between PII types and how to review store PII types. | | PII Detections and Analysis | PII Detection and Analysis | Overview of how to detect and analyze strings | | Local Repo Setup | Local Repo Setup | Instructions for local repository setup | | Example Analysis | Example Analysis Notebook | Notebook with example analysis using MSFT Presidio | | PII-Codex Docs | docs/pii_codex/index.html | Autogenerated docs on classes, services, and models |


Attributions

This project benefited greatly from a number of PII research works like that from Milne et al (2016) with the definition of the types and categories, Schwartz and Solove (2012) with the severity levels of Non-Identifiable, Semi-Identifiable, and Identifiable, and the documentation by NIST, DHS (2012), and HIPAA (full list of foundational publications provided below). A special thanks to all the open source projects, and frameworks that made the setup and structuring of this project much easier like Poetry, Microsoft Presidio, spaCy (2017), Jupyter, and several others.

Foundational Publications

The following publications that inspired and provided a foundation for this repository:

| Concept | Document | Description | |-------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------| | PII Type Mappings | Milne et al., (2016) | PII token categories and NIST and DHS categorizations. | | Risk Continuum | Schwartz & Solove, (2011) | Risk continuum concept and definition (what lead to the ranking in PII-Codex). | | Privacy and Affordances | Trepte, (2020) | Third-Party data consumption and user control (lack thereof) background. | | Social Media and Privacy | Beigi & Liu, (2010) | Privacy issues with social media and third-party data consumption. | | Privacy Settings and Data Access | Moura & Serrão, (2016) | Privacy settings, data access, and unauthorized usage. | | Information Privacy Review | Bélanger & Crossler, (2011) | Concept of aggregation of data to identify individuals. | | Big Data and Third Party Data Consumption | Tene & Polonetsky, (2013) | Third-party data usage, user control, and privacy. | | PII and Confidentiality | McCallister et al., (2010) | NIST guidance on PII confidentiality protections for federal agencies. | | Data Capitalism and Privacy | West, (2017) | Data capitalism, surveillance, and privacy. |

The remaining resources such as python library citations, cloud service provider docs, and cybersecurity guidelines are included in the paper.bib file.

Community Guidelines

For community guidelines and contribution instructions, please view the CONTRIBUTING.md file.

Owner

  • Name: Eidan Rosado
  • Login: EdyVision
  • Kind: user
  • Location: Colorado

Engineer, Entrepreneur, and amateur artist. Computer Scientist interested in all things Voice, HCI, and social computing.

JOSS Publication

PII-Codex: a Python library for PII detection, categorization, and severity assessment
Published
June 20, 2023
Volume 8, Issue 86, Page 5402
Authors
Eidan J. Rosado ORCID
College of Computing and Engineering, Nova Southeastern University, Fort Lauderdale, FL 33314, USA
Editor
Arfon Smith ORCID
Tags
PII PII topology risk categories personal identifiable information

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Rosado
    given-names: Eidan J.
    orcid: https://orcid.org/0000-0003-0665-098X
title: "pii-codex: a Python library for PII detection, categorization, and severity assessment"
version: 0.4.6
doi: 10.5281/zenodo.7212576
date-released: 2023-06-18

GitHub Events

Total
  • Issues event: 2
  • Watch event: 17
  • Issue comment event: 1
  • Push event: 5
  • Create event: 1
Last Year
  • Issues event: 2
  • Watch event: 17
  • Issue comment event: 1
  • Push event: 5
  • Create event: 1

Committers

Last synced: 5 months ago

All Time
  • Total Commits: 49
  • Total Committers: 1
  • Avg Commits per committer: 49.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Eidan Rosado 1****n 49

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 6
  • Total pull requests: 34
  • Average time to close issues: almost 2 years
  • Average time to close pull requests: 5 days
  • Total issue authors: 6
  • Total pull request authors: 1
  • Average comments per issue: 1.33
  • Average comments per pull request: 0.35
  • Merged pull requests: 33
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 1.0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • xqrt (1)
  • sagar-roy-mulberri (1)
  • subhradip-bose (1)
  • HardKothari (1)
  • dilawar (1)
  • shoaibmalek21 (1)
Pull Request Authors
  • EdyVision (33)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 56 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 15
  • Total maintainers: 1
pypi.org: pii-codex

PII Detection, Categorization, and Severity Assessment

  • Versions: 15
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 56 Last month
Rankings
Dependent packages count: 6.6%
Downloads: 17.1%
Stargazers count: 17.9%
Average: 18.4%
Forks count: 19.6%
Dependent repos count: 30.6%
Maintainers (1)
Last synced: 4 months ago

Dependencies

poetry.lock pypi
  • 140 dependencies
pyproject.toml pypi
  • Faker ^14.2.1 develop
  • assertpy ^1.1 develop
  • black ^22.8.0 develop
  • coverage ^6.4.4 develop
  • en_core_web_lg * develop
  • importlib-resources ^5.9.0 develop
  • ipykernel ^6.16.0 develop
  • jupyter ^1.0.0 develop
  • matplotlib ^3.6.0 develop
  • pylint ^2.15.0 develop
  • pytest ^7.1.3 develop
  • seaborn ^0.12.0 develop
  • dataclasses-json ^0.5.7
  • numpy ^1.23.2
  • pandas ^1.4.4
  • presidio-analyzer ^2.2.29
  • pydantic ^1.8.2
  • python >=3.9,<3.11
  • spacy ^3.4.1
.github/workflows/checks.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v3 composite
.github/workflows/release.yml actions
  • actions/checkout v3 composite
  • actions/create-release latest composite
  • actions/setup-python v3 composite
.github/workflows/test.yml actions
  • actions/checkout v3 composite
  • actions/checkout master composite
  • actions/setup-python v3 composite
  • codecov/codecov-action v3 composite
.github/workflows/draft-pdf.yml actions
  • actions/checkout v3 composite
  • actions/upload-artifact v1 composite
  • openjournals/openjournals-draft-action master composite