https://github.com/ArtLabss/open-data-anonymizer
Python Data Anonymization & Masking Library For Data Science Tasks
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.7%) to scientific vocabulary
Keywords
Repository
Python Data Anonymization & Masking Library For Data Science Tasks
Basic Info
- Host: GitHub
- Owner: ArtLabss
- License: bsd-3-clause
- Language: Python
- Default Branch: main
- Homepage: https://www.artlabs.tech
- Size: 40.2 MB
Statistics
- Stars: 272
- Watchers: 8
- Forks: 35
- Open Issues: 6
- Releases: 8
Topics
Metadata Files
README.md
anonympy 🕶️
Overview
General Data Anonymization library for images, PDFs and tabular data. See ArtLabs/projects for more or similar projects.
Main Features
Ease of use - this package was written to be as intuitive as possible.
Tabular
- Efficient - based on pd.DataFrame
- Numerous anonymization methods
- Numeric data
- Generalization - Binning
- Perturbation
- PCA Masking
- Generalization - Rounding
- Categorical data
- Synthetic Data
- Resampling
- Tokenization
- Partial Email Masking
- Datetime data
- Synthetic Date
- Perturbation
Images
- Anonymization techniques
- Personal Images (faces)
- Blurring
- Pixaled Face Blurring
- Salt and Pepper Noise
- General Images
- Blurring
- Find sensitive information and cover it with black boxes
Text, Sound
- In Development
Installation
Dependencies
- Python (>= 3.7)
- cape-dataframes
- faker
- pandas
- OpenCV
- pytesseract
- transformers
- . . . . .
Install with pip
Easiest way to install anonympy is using pip
pip install anonympy
Install from source
Installing the library from source code is also possible
git clone https://github.com/ArtLabss/open-data-anonimizer.git
cd open-data-anonimizer
pip install -r requirements.txt
make bootstrap
Downloading Repository
Or you could download this repository from pypi and run the following:
```
cd open-data-anonimizer
python setup.py install
```
Usage Example
[](https://colab.research.google.com/drive/1wg4g4xWTSLvThYHYLKDIKSJEC4ChQHaM?usp=sharing)More examples here
Tabular
```python
from anonympy.pandas import dfAnonymizer from anonympy.pandas.utilspandas import loaddataset
df = load_dataset() print(df) ```
| | name | age | birthdate | salary | web | email | ssn | |--:|------:|----:|-----------:|---------:|-------------------------------------:|---------------------:|----------:| | 0 | Bruce | 33 | 1915-04-17 | 59234.32 | http://www.alandrosenburgcpapc.co.uk | josefrazier@owen.com | 343554334 | | 1 | Tony | 48 | 1970-05-29 | 49324.53 | http://www.capgeminiamerica.co.uk | eryan@lewis.com | 656564664 |
```python
Calling the generic function
anonym = dfAnonymizer(df) anonym.anonymize(inplace = False) # changes will be returned, not applied ```
| | name | age | birthdate | age | web | email | ssn | |------|-----------------|--------|------------|---------|------------|---------------------|-------------| | 0 | Stephanie Patel | 30 | 1915-05-10 | 60000.0 | 5968b7880f | pjordan@example.com | 391-77-9210 | | 1 | Daniel Matthews | 50 | 1971-01-21 | 50000.0 | 2ae31d40d4 | tparks@example.org | 872-80-9114 |
```python
Or applying a specific anonymization technique to a column
from anonympy.pandas.utilspandas import availablemethods
anonym.categoricalcolumns ... ['name', 'web', 'email', 'ssn'] availablemethods('categorical') ... categoricalfake categoricalfakeauto categoricalresampling categoricaltokenization categoricalemail_masking
anonym.anonymize({'name': 'categoricalfake', # {'columnname': 'methodname'} 'age': 'numericnoise', 'birthdate': 'datetimenoise', 'salary': 'numericrounding', 'web': 'categoricaltokenization', 'email':'categoricalemailmasking', 'ssn': 'columnsuppression'}) print(anonym.to_df()) ``` | | name | age | birthdate | salary | web | email | |--:|------:|----:|-----------:|---------:|-------------------------------------:|---------------------:| | 0 | Paul Lang | 31 | 1915-04-17 | 60000.0 | 8ee92fb1bd | j**r@owen.com | | 1 | Michael Gillespie | 42 | 1970-05-29 | 50000.0 | 51b615c92e | e**n@lewis.com |
Images
```python
Passing an Image
import cv2 from anonympy.images import imAnonymizer
img = cv2.imread('salty.jpg') anonym = imAnonymizer(img)
blurred = anonym.faceblur((31, 31), shape='r', box = 'r') # blurring shape and bounding box ('r' / 'c') pixel = anonym.facepixel(blocks=20, box=None) sap = anonym.face_SaP(shape = 'c', box=None) ``` blurred | pixel | sap :-------------------------:|:-------------------------:|:-------------------------:
|
|
![]()
```python
Passing a Folder
path = 'C:/Users/shakhansho.sabzaliev/Downloads/Data' # images are inside
Datafolder dst = 'D:/' # destination folder anonym = imAnonymizer(path, dst)anonym.blur(method = 'median', kernel = 11) ```
This will create a folder Output in dst directory.
```python
The Data folder had the following structure
| 1.jpg
| 2.jpg
| 3.jpeg
|
---test
| 4.png
| 5.jpeg
|
---test2
6.png
The Output folder will have the same structure and file names but blurred images
```
In order to initialize pdfAnonymizer object we have to install pytesseract and poppler, and provide path to the binaries of both as arguments or add paths to system variables
```python
from anonympy.pdf import pdfAnonymizer
need to specify paths, since I don't have them in system variables
anonym = pdfAnonymizer(pathtopdf = "Downloads\test.pdf", pytesseractpath = r"C:\Program Files\Tesseract-OCR\tesseract.exe", popplerpath = r"C:\Users\shakhansho\Downloads\Release-22.01.0-0\poppler-22.01.0\Library\bin")
Calling the generic function
anonym.anonymize(outputpath = 'output.pdf', removemetadata = True, fill = 'black', outline = 'black') ```
test.pdf | output.pdf |
:-------------------------:|:-------------------------:|
|
|
In case you only want to hide specific information, instead of anonymize use other methods
```python
anonym = pdfAnonymizer(pathtopdf = r"Downloads\test.pdf") anonym.pdf2images() # images are stored in anonym.images variable anonym.images2text(anonym.images) # texts are stored in anonym.texts
Entities of interest
locs: dict = anonym.findLOC(anonym.texts[0]) # index refers to page number emails: dict = anonym.findemails(anonym.texts[0]) # {pagenumber: [coords]} coords: list = locs['page1'] + emails['page_1']
anonym.cover_box(anonym.images[0], coords) display(anonym.images[0]) ```
Development
Contributions
The Contributing Guide has detailed information about contributing code and documentation.
Important Links
- Official source code repo: https://github.com/ArtLabss/open-data-anonimizer
- Download releases: https://pypi.org/project/anonympy/
- Issue tracker: https://github.com/ArtLabss/open-data-anonimizer/issues
License
Code of Conduct
Please see Code of Conduct. All community members are expected to follow it.
Owner
- Name: ArtLabs
- Login: ArtLabss
- Kind: organization
- Website: https://artlabs.tech/
- Repositories: 2
- Profile: https://github.com/ArtLabss
Machine Learning for eCommerce and Retail
GitHub Events
Total
- Issues event: 1
- Watch event: 26
- Issue comment event: 1
- Fork event: 5
Last Year
- Issues event: 1
- Watch event: 26
- Issue comment event: 1
- Fork event: 5
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 11
- Total pull requests: 21
- Average time to close issues: 3 days
- Average time to close pull requests: 2 days
- Total issue authors: 11
- Total pull request authors: 5
- Average comments per issue: 2.36
- Average comments per pull request: 0.14
- Merged pull requests: 20
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 2
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 2
- Pull request authors: 0
- Average comments per issue: 0.5
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- iqramchoudhury (1)
- jvmncs (1)
- gpcr (1)
- fbvdka (1)
- ShNadi (1)
- rscmendes (1)
- prathnasingh (1)
- schackartk (1)
- captrespect (1)
Pull Request Authors
- shukkkur (14)
- mirmozavr (4)
- mitmirzutun (1)
- JostBrand (1)
- lnagel (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- Faker ==10.0.0
- Pillow ==9.0.1
- PyPDF2 ==1.26.0
- black *
- coverage *
- flake8 *
- flake8-black *
- imutils ==0.5.4
- isort *
- matplotlib ==3.5.1
- numpy *
- numpy ==1.22.0
- opencv_python ==4.5.5.62
- pandas *
- pdf2image ==1.16.0
- poppler-utils *
- pyarrow ==6.0.1
- pycryptodome *
- pyspark *
- pytesseract ==0.3.9
- pytest *
- pytest-cov *
- pytest-httpserver *
- pytest-mock *
- pyyaml *
- requests *
- responses *
- scikit_learn ==1.0.2
- setuptools ==52.0.0
- texttable ==1.6.4
- transformers ==4.17.0
- validators *
- PyPDF2 *
- faker *
- numpy *
- opencv_python *
- pandas *
- pdf2image *
- poppler-utils *
- pycryptodome *
- pytesseract *
- pyyaml *
- requests *
- rfc3339 *
- scikit-learn *
- setuptools *
- texttable *
- transformers *
- validators *
- actions/checkout v2 composite
- github/codeql-action/analyze v1 composite
- github/codeql-action/autobuild v1 composite
- github/codeql-action/init v1 composite
- actions/checkout v2 composite
- programmingwithalex/pylinter v1.4.3 composite
- actions/checkout v3 composite
- actions/setup-python v3 composite
|