anonymize-ddp
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 8 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.7%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: UtrechtUniversity
- License: mit
- Language: Python
- Default Branch: main
- Size: 85.2 MB
Statistics
- Stars: 0
- Watchers: 10
- Forks: 0
- Open Issues: 7
- Releases: 2
Metadata Files
ReadMe.md
Anonymize-DDP
Pseudonimizing software for data download packages (DDP), specifically focussed on Instagram.
Table of Contents
About Anonymize-DDP
Date: December 2020
Researchers: * Laura Boeschoten (l.boeschoten@uu.nl)
Research Software Engineers: * Martine de Vos (m.g.devos@uu.nl) * Roos Voorvaart (r.voorvaart@uu.nl)
Built With
The blurring of text in images and videos is based on a pre-trained version of the EAST model. Replacing the extracted sensitive info with the pseudonimized substitutes in the data download package is done using the AnonymoUUs package.
License
The code in this project is licensed with MIT.
Attribution and academic use
The scientific paper detailing the first release of anonymize-ddp is available here.
A data set consisting of 11 personal Instagram archives, or Data-Download Packages, was created to validate the anonymization procedure.
This data set is publicly available at
Getting Started
Prerequisites
This project makes use of Python 3.8 and Poetry for managing dependencies. To clone this repository, you'll need Git installed on your computer.
Preparatory steps
Before running the software, the following steps need to be taken:
Clone repository
Run the following code in the command line:
```
Clone this repository
$ git clone https://github.com/UtrechtUniversity/anonymize-ddp
Go into the repository
$ cd anonymize-ddp/anonymize-ddp
Install dependencies
poetry install
```
Download DDP
To download your Instagram data package:
- Go to www.instagram.com and log in
- Click on your profile picture, go to Settings and Privacy and Security
- Scroll to Data download and click Request download
- Enter your email adress and click Next
- Enter your password and click Request download
Instagram will deliver your data in a compressed zip folder with format username_YYYYMMDD.zip (i.e., Instagram handle and date of download). For Mac users this might be different, so make sure to check that all provided files are zipped into one folder with the name username_YYYYMMDD.zip. Save the DDP(s) in the data folder.
Create additional files
Before you can run the software, you need to make sure that the src folder contains the following items:
* Facial blurring software: The frozeneasttext_detection.pb software, necessary for the facial blurring of images and videos, can be downloaded from GitHub
* Participant file*: An overview of all participants' usernames and participant IDs (e.g., participants.csv). We recommend placing this file in the anonymize folder. However, you can save this file anywhere you like, as long as you refer to the path correctly while running the software.
* N.B. Only relevant for participant based studies with predefined participant IDs. This file can have whatever name you prefer, as long as it is saved as .csv and contains 2 columns; the first being the original instagram handles (e.g., janjansen) and the second the participant IDs (e.g., PP001).
Run software
When all preceding steps are taken, the data download packages can be pseudonimized.
Note that the poetry run command executes the given command inside the project’s virtual environment.
Run the program with (atleast) the arguments -i for data input folder (i.e., data) and -o data output folder (i.e., results/output):
``` $ poetry run python anonymize/anonymizinginstagramuu.py [OPTIONS]
Options: -i path to folder containing zipfiles (i.e., -i data/raw) -o path to folder where files will be unpacked and pseudonimized (i.e., -o data/processed) -l path to log file -p path to participants list to use corresponding participant IDs (e.g., -p src/participants.csv) -c replace capitalized names only (when not entering this option, the default = False; not case sensitive) (e.g., -c)
```
An overview of the program's workflow is shown below:

The output of the program will be a copy of the zipped data download package with all names, usernames, email addresses, and phone numbers pseudonimized, and all pictures and videos blurred. This pseudonimized data download package is saved in the output folder.
Validation
The validation procedure determines the performance of anonymization code concerning deidentification of text. It compares results of the automated anonymization with the ideal expected result, i.e., a manually created ground-truth.
For this validation an example data set is used which includes: * A set of 11 DDPs with nonsense content * A groundtruth file with results of manually labeling the PII in these DDPs
The example data set is available at .
Prerequisites
scikit-learn
Preparatory steps
Before running the software, the following steps need to be taken:
- Collect data: both Instagram DDPs and corresponding ground truth data
- Anonymize Instagram DDPs
Anonymize DDPs
Run the automated anonymization as described in the main Readme After the anonymization, make sure you have separate folders with the following data: * original DDPs * anonymized DDPs * key files
Perform evaluation
When all preceding steps are taken, the evaluation can be performed.
``` $ cd anonymize/validation $ poetry run python validation_script.py [OPTIONS]
Options: -r path to file with results of manual labeling -p path to folder with anonymized datapackages; output of anonymization -k path to folder with key files; output of anonymization
```
Output
Evaluation metrics: * table with recall, precision en f1 * four folders with specific occurences of FP, FN, TP and special hashes
Testing
Run tests with available test data to check the consistency of the evaluation procedure From the root folder:
```
Go to the main folder of the poetry project
$ cd anonymize-ddp/anonymize-ddp
Run the test
$ poetry run pytest
```
Owner
- Name: Utrecht University
- Login: UtrechtUniversity
- Kind: organization
- Email: info.rdm@uu.nl
- Location: Utrecht, The Netherlands
- Website: https://www.uu.nl
- Repositories: 85
- Profile: https://github.com/UtrechtUniversity
The central place for managing code and software for Utrecht University researchers and employees
Citation (CITATION.cff)
# YAML 1.2
---
authors:
-
affiliation: "Utrecht University"
family-names: Voorvaart
given-names: Roos
orcid: "https://orcid.org/0000-0002-4411-8495"
-
affiliation: "Utrecht University"
family-names: "de Vos"
given-names: "Martine G"
orcid: "https://orcid.org/0000-0001-5301-1713"
-
affiliation: "Utrecht University"
family-names: Boeschoten
given-names: Laura
orcid: "https://orcid.org/0000-0002-3536-0474"
-
affiliation: "Utrecht University"
family-names: "van den Goorbergh"
given-names: Ruben
orcid: "https://orcid.org/0000-0003-3229-3015"
-
affiliation: "Utrech University"
family-names: Kaandorp
given-names: Casper
orcid: "https://orcid.org/0000-0001-6326-6680"
cff-version: "1.0.3"
doi: "10.5281/zenodo.5211335"
keywords:
- "pseudonymization"
- "python"
license: "MIT"
message: "If you use this software, please cite it using these metadata."
repository-code: "https://github.com/UtrechtUniversity/anonymize-ddp/releases/tag/v1.0.1"
title: "anonymize-ddp: pseudonimizing software for data download packages "
version: "1.0.1"
date-released: "2021-08-17"
GitHub Events
Total
Last Year
Committers
Last synced: 9 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Martine de Vos | m****s@g****m | 65 |
| Voorvaart | r****t@u****l | 40 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 9 months ago
All Time
- Total issues: 2
- Total pull requests: 10
- Average time to close issues: N/A
- Average time to close pull requests: 5 months
- Total issue authors: 1
- Total pull request authors: 2
- Average comments per issue: 1.0
- Average comments per pull request: 0.5
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 8
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- MartineDeVos (2)
Pull Request Authors
- dependabot[bot] (8)
- MartineDeVos (2)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- atomicwrites 1.4.0 develop
- attrs 21.2.0 develop
- colorama 0.4.4 develop
- iniconfig 1.1.1 develop
- packaging 21.2 develop
- pluggy 1.0.0 develop
- py 1.10.0 develop
- pyparsing 2.4.7 develop
- pytest 6.2.5 develop
- toml 0.10.2 develop
- anonymouus 0.0.8
- certifi 2021.10.8
- charset-normalizer 2.0.7
- facenet-pytorch 2.5.2
- idna 3.3
- imutils 0.5.4
- joblib 1.1.0
- numpy 1.21.1
- opencv-python 4.5.3.56
- pandas 1.3.4
- pathlib 1.0.1
- pillow 8.4.0
- progressbar2 3.55.0
- python-dateutil 2.8.2
- python-utils 2.5.6
- pytz 2021.3
- requests 2.26.0
- scikit-learn 0.24.2
- scipy 1.6.1
- six 1.16.0
- threadpoolctl 3.0.0
- torch 1.9.1
- torchvision 0.10.1
- typing-extensions 3.10.0.2
- urllib3 1.26.7
- zipp 3.6.0
- pytest ^6.0.0 develop
- anonymoUUs 0.0.8
- facenet-pytorch ^2.5.2
- imutils ^0.5.4
- opencv-python ^4.5.3
- pandas ^1.3.2
- pathlib ^1.0.1
- progressbar2 ^3.53.1
- python ^3.8
- python-dateutil ^2.8.2
- scikit-learn ^0.24.2
- torchvision ^0.10.0
- zipp ^3.5.0
- anonymoUUs ==0.0.8
- facenet-pytorch >=2.2.9
- imutils >=0.5.3
- numpy >=1.17.2
- opencv-python >=4.2.0.34
- pandas >=0.25.1
- pathlib >=1.0.1
- progressbar2 >=3.51.4
- python-dateutil >=2.8.1
- scikit-learn >=0.24.1
- torchvision >=0.6.1
- zipp >=0.6.0