anjana

ANJANA is a Python library for anonymizing sensitive data

https://github.com/ifca-advanced-computing/anjana

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.4%) to scientific vocabulary

Keywords

anonymity data data-analytics data-anonymization data-privacy data-science databases k-anonymity l-diversity open-data privacy security t-closeness
Last synced: 6 months ago · JSON representation ·

Repository

ANJANA is a Python library for anonymizing sensitive data

Basic Info
Statistics
  • Stars: 34
  • Watchers: 1
  • Forks: 3
  • Open Issues: 7
  • Releases: 7
Topics
anonymity data data-analytics data-anonymization data-privacy data-science databases k-anonymity l-diversity open-data privacy security t-closeness
Created almost 2 years ago · Last pushed 6 months ago
Metadata Files
Readme Changelog Contributing License Citation

README.md

ANJANA

License: Apache 2.0 codecov DOI PyPI Downloads Documentation Status release-please Publish Package in PyPI CI/CD Pipeline Code Coverage Python version

Anonymity as major assurance of personal data privacy:

ANJANA is a Python library for anonymizing sensitive data.

The following anonymity techniques are implemented, based on the Python library pyCANON: * k-anonymity. * (α,k)-anonymity. * ℓ-diversity. * Entropy ℓ-diversity. * Recursive (c,ℓ)-diversity. * t-closeness. * Basic β-likeness. * Enhanced β-likeness. * δ-disclosure privacy.

Installation

First, we strongly recommend the use of a virtual environment. In linux: bash virtualenv .venv -p python3 source .venv/bin/activate

Using pip:

Install anjana (linux and windows): bash pip install anjana

Using git:

Install the most updated version of anjana (linux and windows):

bash pip install git+https://github.com/IFCA-Advanced-Computing/anjana.git

Getting started

For anonymizing your data you need to introduce: * The pandas dataframe with the data to be anonymized. Each column can contain: identifiers, quasi-indentifiers or sensitive attributes. * The list with the names of the identifiers in the dataframe, in order to suppress them. * The list with the names of the quasi-identifiers in the dataframe. * The sentive attribute (only one) in case of applying other techniques than k-anonymity. * The level of anonymity to be applied, e.g. k (for k-anonymity), (for ℓ-diversity), t (for t-closeness), β (for basic or enhanced β-likeness), etc. * Maximum level of record suppression allowed (from 0 to 100, acting as the percentage of suppressed records). * Dictionary containing one dictionary for each quasi-identifier with the hierarchies and the levels.

Example: apply k-anonymity, ℓ-diversity and t-closeness to the adult dataset with some predefined hierarchies:

```python import pandas as pd import anjana from anjana.anonymity import kanonymity, ldiversity, t_closeness

Read and process the data

data = pd.read_csv("adult.csv") data.columns = data.columns.str.strip() cols = [ "workclass", "education", "marital-status", "occupation", "sex", "native-country", ] for col in cols: data[col] = data[col].str.strip()

Define the identifiers, quasi-identifiers and the sensitive attribute

quasiident = [ "age", "education", "marital-status", "occupation", "sex", "native-country", ] ident = ["race"] sensatt = "salary-class"

Select the desired level of k, l and t

k = 10 l_div = 2 t = 0.5

Select the suppression limit allowed

supp_level = 50

Import the hierarquies for each quasi-identifier. Define a dictionary containing them

hierarchies = { "age": dict(pd.readcsv("hierarchies/age.csv", header=None)), "education": dict(pd.readcsv("hierarchies/education.csv", header=None)), "marital-status": dict(pd.readcsv("hierarchies/marital.csv", header=None)), "occupation": dict(pd.readcsv("hierarchies/occupation.csv", header=None)), "sex": dict(pd.readcsv("hierarchies/sex.csv", header=None)), "native-country": dict(pd.readcsv("hierarchies/country.csv", header=None)), }

Apply the three functions: k-anonymity, l-diversity and t-closeness

dataanon = kanonymity(data, ident, quasiident, k, supplevel, hierarchies) dataanon = ldiversity( dataanon, ident, quasiident, sensatt, k, ldiv, supplevel, hierarchies ) dataanon = tcloseness( dataanon, ident, quasiident, sensatt, k, t, supp_level, hierarchies ) ```

The previous code can be executed in less than 4 seconds for the more than 30,000 records of the original dataset.

Define your own hierarchies

All the anonymity functions available in ANJANA receive a dictionary with the hierarchies to be applied to the quasi-identifiers. In particular, this dictionary has as key the names of the columns that are quasi-identifiers to which a hierarchy is to be applied (it may happen that you do not want to generalize some QIs and therefore no hierarchy is to be applied to them, just do not include them in this dictionary). The value for each key (QI) is formed by a dictionary in such a way that the value 0 has as value the raw column (as it is in the original dataset), the value 1 corresponds to the first level of transformation to be applied, in relation to the values of the original column, and so on with as many keys as levels of hierarchies have been established.

For a better understanding, let's look at the following example. Supose that we have the following simulated dataset (extracted from the hospitalextended.csv_ dataset used for testing purposes) with age, gender and city as quasi-identifiers, name as identifier and disease as sensitive attribute. Regarding the QI, we want to apply the following hierarquies: interval of 5 years (first level) and 10 years (second level) for the age. Suppression as first level for both gender and city.

| name | age | gender | city | disease | |-----------|-----|--------|------------|----------------| | Ramsha | 29 | Female | Tamil Nadu | Cancer | | Yadu | 24 | Female | Kerala | Viralinfection | | Salima | 28 | Female | Tamil Nadu | TB | | Sunny | 27 | Male | Karnataka | No illness | | Joan | 24 | Female | Kerala | Heart-related | | Bahuksana | 23 | Male | Karnataka | TB | | Rambha | 19 | Male | Kerala | Cancer | | Kishor | 29 | Male | Karnataka | Heart-related | | Johnson | 17 | Male | Kerala | Heart-related | | John | 19 | Male | Kerala | Viralinfection |

Then, in order to create the hierarquies we can define the following dictionary:

```python import numpy as np

age = data['age'].values

Values: 29 24 28 27 24 23 19 29 17 19

age_5years = ['[25, 30)', '[20, 25)', '[25, 30)', '[25, 30)', '[20, 25)', '[20, 25)', '[15, 20)', '[25, 30)', '[15, 20)', '[15, 20)']

age_10years = ['[20, 30)', '[20, 30)', '[20, 30)', '[20, 30)', '[20, 30)', '[20, 30)', '[10, 20)', '[20, 30)', '[10, 20)', '[10, 20)']

hierarchies = { "age": {0: age, 1: age5years, 2: age10years}, "gender": { 0: data["gender"].values, 1: np.array([""] * len(data["gender"].values)) # Suppression }, "city": {0: data["city"].values, 1: np.array([""] * len(data["city"].values))} # Suppression } ```

You can also use the function generateintervals()_ from utils for creating the interval-based hierarchy as follows:

```python import numpy as np from anjana.anonymity import utils

age = data['age'].values

hierarchies = { "age": { 0: data["age"].values, 1: utils.generateintervals(data["age"].values, 0, 100, 5), 2: utils.generateintervals(data["age"].values, 0, 100, 10), }, "gender": { 0: data["gender"].values, 1: np.array([""] * len(data["gender"].values)) # Suppression }, "city": {0: data["city"].values, 1: np.array([""] * len(data["city"].values))} # Suppression } ```

License

This project is licensed under the Apache 2.0 license.

Citation

If you are using anjana you can cite it as follows:

bibtex @article{sainzpardo2024anjana, title={An Open Source Python Library for Anonymizing Sensitive Data}, author={S{\'a}inz-Pardo D{\'\i}az, Judith and L{\'o}pez Garc{\'\i}a, {\'A}lvaro}, journal={Scientific data}, volume={11}, number={1}, pages={1289}, year={2024}, publisher={Nature Publishing Group UK London} }

Funding and acknowledgments

This work is funded by European Union through the SIESTA project (Horizon Europe) under Grant number 101131957.


Note: Anjana and the mythology of Cantabria

"La Anjana" is a character from the mythology of Cantabria. Known as the good fairy of Cantabria, generous and protective of all people, she helps the poor, the suffering and those who stray in the forest.

- Partially extracted from: Cotera, Gustavo. Mitología de Cantabria. Ed. Tantin, Santander, 1998.

Owner

  • Name: IFCA Advanced Computing and e-Science group
  • Login: IFCA-Advanced-Computing
  • Kind: organization
  • Location: Santander, Spain

Citation (CITATION.cff)

cff-version: 0.0.6
message: "If you use this software, please cite it as below."
authors:
- family-names: "Sáinz-Pardo Díaz"
  given-names: "Judith"
  orcid: "https://orcid.org/0000-0002-8387-578X"
- family-names: "López García"
  given-names: "Álvaro"
  orcid: "https://orcid.org/0000-0002-0013-4602"
title: "ANJANA"
version: 1.1.0
date-released: 2024-05-13
url: "https://github.com/IFCA-Advanced-Computing/anjana"
identifiers:
  - type: doi
    value: 10.5281/zenodo.11186382
  - type: doi
    value: 10.1038/s41597-024-04019-z
references:
  - type: article
    authors:
      - family-names: "Sáinz-Pardo Díaz"
        given-names: "Judith"
        orcid: "https://orcid.org/0000-0002-8387-578X"
      - family-names: "López García"
        given-names: "Álvaro"
        orcid: "https://orcid.org/0000-0002-0013-4602"
    title: "An Open Source Python Library for Anonymizing Sensitive Data"
    journal: "Scientific Data"
    year: "2024"
    doi: "10.1038/s41597-024-04019-z"


GitHub Events

Total
  • Release event: 1
  • Watch event: 16
  • Delete event: 34
  • Issue comment event: 31
  • Push event: 47
  • Pull request event: 77
  • Fork event: 1
  • Create event: 40
Last Year
  • Release event: 1
  • Watch event: 16
  • Delete event: 34
  • Issue comment event: 31
  • Push event: 47
  • Pull request event: 77
  • Fork event: 1
  • Create event: 40

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 0
  • Total pull requests: 32
  • Average time to close issues: N/A
  • Average time to close pull requests: 13 days
  • Total issue authors: 0
  • Total pull request authors: 2
  • Average comments per issue: 0
  • Average comments per pull request: 0.47
  • Merged pull requests: 12
  • Bot issues: 0
  • Bot pull requests: 27
Past Year
  • Issues: 0
  • Pull requests: 31
  • Average time to close issues: N/A
  • Average time to close pull requests: 6 days
  • Issue authors: 0
  • Pull request authors: 2
  • Average comments per issue: 0
  • Average comments per pull request: 0.45
  • Merged pull requests: 12
  • Bot issues: 0
  • Bot pull requests: 27
Top Authors
Issue Authors
  • dependabot[bot] (2)
  • judithspd (1)
Pull Request Authors
  • dependabot[bot] (97)
  • judithspd (17)
  • alvarolopez (1)
Top Labels
Issue Labels
dependencies (2) autorelease: pending (1) github_actions (1)
Pull Request Labels
dependencies (95) github_actions (14) autorelease: pending (9)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 223 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 7
  • Total maintainers: 2
pypi.org: anjana

ANJANA is an open source framework for applying different anonymity techniques.

  • Versions: 7
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 223 Last month
Rankings
Dependent packages count: 9.6%
Average: 36.4%
Dependent repos count: 63.1%
Maintainers (2)
Last synced: 6 months ago

Dependencies

pyproject.toml pypi
test-requirements.txt pypi