https://github.com/ArtLabss/open-data-anonymizer

Python Data Anonymization & Masking Library For Data Science Tasks

https://github.com/ArtLabss/open-data-anonymizer

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.7%) to scientific vocabulary

Keywords

anonymization data-anonymization data-encoding data-science machine-learning pandas pdf pdf-anonymization python python-data-anonymization
Last synced: 5 months ago · JSON representation

Repository

Python Data Anonymization & Masking Library For Data Science Tasks

Basic Info
  • Host: GitHub
  • Owner: ArtLabss
  • License: bsd-3-clause
  • Language: Python
  • Default Branch: main
  • Homepage: https://www.artlabs.tech
  • Size: 40.2 MB
Statistics
  • Stars: 272
  • Watchers: 8
  • Forks: 35
  • Open Issues: 6
  • Releases: 8
Topics
anonymization data-anonymization data-encoding data-science machine-learning pandas pdf pdf-anonymization python python-data-anonymization
Created over 4 years ago · Last pushed over 2 years ago
Metadata Files
Readme Contributing License Code of conduct Authors

README.md

anonympy 🕶️




With ❤️ by ArtLabs

Overview

General Data Anonymization library for images, PDFs and tabular data. See ArtLabs/projects for more or similar projects.


Main Features

Ease of use - this package was written to be as intuitive as possible.

Tabular

  • Efficient - based on pd.DataFrame
  • Numerous anonymization methods
    • Numeric data
      • Generalization - Binning
      • Perturbation
      • PCA Masking
      • Generalization - Rounding
    • Categorical data
      • Synthetic Data
      • Resampling
      • Tokenization
      • Partial Email Masking
    • Datetime data
      • Synthetic Date
      • Perturbation

Images

  • Anonymization techniques
    • Personal Images (faces)
      • Blurring
      • Pixaled Face Blurring
      • Salt and Pepper Noise
    • General Images
      • Blurring

PDF

  • Find sensitive information and cover it with black boxes

Text, Sound

  • In Development


Installation

Dependencies

  1. Python (>= 3.7)
  2. cape-dataframes
  3. faker
  4. pandas
  5. OpenCV
  6. pytesseract
  7. transformers
  8. . . . . .

Install with pip

Easiest way to install anonympy is using pip

pip install anonympy

Install from source

Installing the library from source code is also possible

git clone https://github.com/ArtLabss/open-data-anonimizer.git cd open-data-anonimizer pip install -r requirements.txt make bootstrap

Downloading Repository

Or you could download this repository from pypi and run the following: ``` cd open-data-anonimizer python setup.py install ```

Usage Example

[![Google Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1wg4g4xWTSLvThYHYLKDIKSJEC4ChQHaM?usp=sharing)

More examples here

Tabular

```python

from anonympy.pandas import dfAnonymizer from anonympy.pandas.utilspandas import loaddataset

df = load_dataset() print(df) ```

| | name | age | birthdate | salary | web | email | ssn | |--:|------:|----:|-----------:|---------:|-------------------------------------:|---------------------:|----------:| | 0 | Bruce | 33 | 1915-04-17 | 59234.32 | http://www.alandrosenburgcpapc.co.uk | josefrazier@owen.com | 343554334 | | 1 | Tony | 48 | 1970-05-29 | 49324.53 | http://www.capgeminiamerica.co.uk | eryan@lewis.com | 656564664 |

```python

Calling the generic function

anonym = dfAnonymizer(df) anonym.anonymize(inplace = False) # changes will be returned, not applied ```

| | name | age | birthdate | age | web | email | ssn | |------|-----------------|--------|------------|---------|------------|---------------------|-------------| | 0 | Stephanie Patel | 30 | 1915-05-10 | 60000.0 | 5968b7880f | pjordan@example.com | 391-77-9210 | | 1 | Daniel Matthews | 50 | 1971-01-21 | 50000.0 | 2ae31d40d4 | tparks@example.org | 872-80-9114 |

```python

Or applying a specific anonymization technique to a column

from anonympy.pandas.utilspandas import availablemethods

anonym.categoricalcolumns ... ['name', 'web', 'email', 'ssn'] availablemethods('categorical') ... categoricalfake categoricalfakeauto categoricalresampling categoricaltokenization categoricalemail_masking

anonym.anonymize({'name': 'categoricalfake', # {'columnname': 'methodname'} 'age': 'numericnoise', 'birthdate': 'datetimenoise', 'salary': 'numericrounding', 'web': 'categoricaltokenization', 'email':'categoricalemailmasking', 'ssn': 'columnsuppression'}) print(anonym.to_df()) ``` | | name | age | birthdate | salary | web | email | |--:|------:|----:|-----------:|---------:|-------------------------------------:|---------------------:| | 0 | Paul Lang | 31 | 1915-04-17 | 60000.0 | 8ee92fb1bd | j**r@owen.com | | 1 | Michael Gillespie | 42 | 1970-05-29 | 50000.0 | 51b615c92e | e**n@lewis.com |


Images

```python

Passing an Image

import cv2 from anonympy.images import imAnonymizer

img = cv2.imread('salty.jpg') anonym = imAnonymizer(img)

blurred = anonym.faceblur((31, 31), shape='r', box = 'r') # blurring shape and bounding box ('r' / 'c') pixel = anonym.facepixel(blocks=20, box=None) sap = anonym.face_SaP(shape = 'c', box=None) ``` blurred | pixel | sap :-------------------------:|:-------------------------:|:-------------------------: input_img1 | output_img1 | sap_image

```python

Passing a Folder

path = 'C:/Users/shakhansho.sabzaliev/Downloads/Data' # images are inside Data folder dst = 'D:/' # destination folder anonym = imAnonymizer(path, dst)

anonym.blur(method = 'median', kernel = 11) ```

This will create a folder Output in dst directory.

```python

The Data folder had the following structure

| 1.jpg | 2.jpg | 3.jpeg |
---test | 4.png | 5.jpeg |
---test2 6.png

The Output folder will have the same structure and file names but blurred images

```


PDF

In order to initialize pdfAnonymizer object we have to install pytesseract and poppler, and provide path to the binaries of both as arguments or add paths to system variables

```python

from anonympy.pdf import pdfAnonymizer

need to specify paths, since I don't have them in system variables

anonym = pdfAnonymizer(pathtopdf = "Downloads\test.pdf", pytesseractpath = r"C:\Program Files\Tesseract-OCR\tesseract.exe", popplerpath = r"C:\Users\shakhansho\Downloads\Release-22.01.0-0\poppler-22.01.0\Library\bin")

Calling the generic function

anonym.anonymize(outputpath = 'output.pdf', removemetadata = True, fill = 'black', outline = 'black') ```

test.pdf | output.pdf | :-------------------------:|:-------------------------:| test_img | output_img |

In case you only want to hide specific information, instead of anonymize use other methods

```python

anonym = pdfAnonymizer(pathtopdf = r"Downloads\test.pdf") anonym.pdf2images() # images are stored in anonym.images variable anonym.images2text(anonym.images) # texts are stored in anonym.texts

Entities of interest

locs: dict = anonym.findLOC(anonym.texts[0]) # index refers to page number emails: dict = anonym.findemails(anonym.texts[0]) # {pagenumber: [coords]} coords: list = locs['page1'] + emails['page_1']

anonym.cover_box(anonym.images[0], coords) display(anonym.images[0]) ```

Development

Contributions

The Contributing Guide has detailed information about contributing code and documentation.

Important Links

License

BSD-3

Code of Conduct

Please see Code of Conduct. All community members are expected to follow it.

Owner

  • Name: ArtLabs
  • Login: ArtLabss
  • Kind: organization

Machine Learning for eCommerce and Retail

GitHub Events

Total
  • Issues event: 1
  • Watch event: 26
  • Issue comment event: 1
  • Fork event: 5
Last Year
  • Issues event: 1
  • Watch event: 26
  • Issue comment event: 1
  • Fork event: 5

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 11
  • Total pull requests: 21
  • Average time to close issues: 3 days
  • Average time to close pull requests: 2 days
  • Total issue authors: 11
  • Total pull request authors: 5
  • Average comments per issue: 2.36
  • Average comments per pull request: 0.14
  • Merged pull requests: 20
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 2
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 2
  • Pull request authors: 0
  • Average comments per issue: 0.5
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • iqramchoudhury (1)
  • jvmncs (1)
  • gpcr (1)
  • fbvdka (1)
  • ShNadi (1)
  • rscmendes (1)
  • prathnasingh (1)
  • schackartk (1)
  • captrespect (1)
Pull Request Authors
  • shukkkur (14)
  • mirmozavr (4)
  • mitmirzutun (1)
  • JostBrand (1)
  • lnagel (1)
Top Labels
Issue Labels
bug (4) enhancement (4) documentation (1)
Pull Request Labels
bug (2) documentation (1)

Dependencies

requirements.txt pypi
  • Faker ==10.0.0
  • Pillow ==9.0.1
  • PyPDF2 ==1.26.0
  • black *
  • coverage *
  • flake8 *
  • flake8-black *
  • imutils ==0.5.4
  • isort *
  • matplotlib ==3.5.1
  • numpy *
  • numpy ==1.22.0
  • opencv_python ==4.5.5.62
  • pandas *
  • pdf2image ==1.16.0
  • poppler-utils *
  • pyarrow ==6.0.1
  • pycryptodome *
  • pyspark *
  • pytesseract ==0.3.9
  • pytest *
  • pytest-cov *
  • pytest-httpserver *
  • pytest-mock *
  • pyyaml *
  • requests *
  • responses *
  • scikit_learn ==1.0.2
  • setuptools ==52.0.0
  • texttable ==1.6.4
  • transformers ==4.17.0
  • validators *
setup.py pypi
  • PyPDF2 *
  • faker *
  • numpy *
  • opencv_python *
  • pandas *
  • pdf2image *
  • poppler-utils *
  • pycryptodome *
  • pytesseract *
  • pyyaml *
  • requests *
  • rfc3339 *
  • scikit-learn *
  • setuptools *
  • texttable *
  • transformers *
  • validators *
.github/workflows/codeql-analysis.yml actions
  • actions/checkout v2 composite
  • github/codeql-action/analyze v1 composite
  • github/codeql-action/autobuild v1 composite
  • github/codeql-action/init v1 composite
.github/workflows/pylinter.yml actions
  • actions/checkout v2 composite
  • programmingwithalex/pylinter v1.4.3 composite
.github/workflows/python-app.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v3 composite