anonymize-ddp

https://github.com/utrechtuniversity/anonymize-ddp

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 8 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.7%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: UtrechtUniversity
License: mit
Language: Python
Default Branch: main
Size: 85.2 MB

Statistics

Stars: 0
Watchers: 10
Forks: 0
Open Issues: 7
Releases: 2

Created over 5 years ago · Last pushed almost 3 years ago

Metadata Files

Readme License Citation

Anonymize-DDP

Pseudonimizing software for data download packages (DDP), specifically focussed on Instagram.

About Anonymize-DDP
Getting Started

About Anonymize-DDP

Date: December 2020

Researchers: * Laura Boeschoten (l.boeschoten@uu.nl)

Research Software Engineers: * Martine de Vos (m.g.devos@uu.nl) * Roos Voorvaart (r.voorvaart@uu.nl)

Built With

The blurring of text in images and videos is based on a pre-trained version of the EAST model. Replacing the extracted sensitive info with the pseudonimized substitutes in the data download package is done using the AnonymoUUs package.

License

The code in this project is licensed with MIT.

Attribution and academic use

The scientific paper detailing the first release of anonymize-ddp is available here.

A data set consisting of 11 personal Instagram archives, or Data-Download Packages, was created to validate the anonymization procedure. This data set is publicly available at

Getting Started

Prerequisites

This project makes use of Python 3.8 and Poetry for managing dependencies. To clone this repository, you'll need Git installed on your computer.

Preparatory steps

Before running the software, the following steps need to be taken:

Clone repository
Download DDP
Create additional files

Clone repository

Run the following code in the command line:

```

Clone this repository

$ git clone https://github.com/UtrechtUniversity/anonymize-ddp

Go into the repository

$ cd anonymize-ddp/anonymize-ddp

Install dependencies

poetry install

```

Download DDP

To download your Instagram data package:

Go to www.instagram.com and log in
Click on your profile picture, go to Settings and Privacy and Security
Scroll to Data download and click Request download
Enter your email adress and click Next
Enter your password and click Request download

Instagram will deliver your data in a compressed zip folder with format username_YYYYMMDD.zip (i.e., Instagram handle and date of download). For Mac users this might be different, so make sure to check that all provided files are zipped into one folder with the name username_YYYYMMDD.zip. Save the DDP(s) in the data folder.

Create additional files

Before you can run the software, you need to make sure that the src folder contains the following items: * Facial blurring software: The frozeneasttext_detection.pb software, necessary for the facial blurring of images and videos, can be downloaded from GitHub * Participant file*: An overview of all participants' usernames and participant IDs (e.g., participants.csv). We recommend placing this file in the anonymize folder. However, you can save this file anywhere you like, as long as you refer to the path correctly while running the software.

* N.B. Only relevant for participant based studies with predefined participant IDs. This file can have whatever name you prefer, as long as it is saved as .csv and contains 2 columns; the first being the original instagram handles (e.g., janjansen) and the second the participant IDs (e.g., PP001).

Run software

When all preceding steps are taken, the data download packages can be pseudonimized. Note that the poetry run command executes the given command inside the project’s virtual environment. Run the program with (atleast) the arguments -i for data input folder (i.e., data) and -o data output folder (i.e., results/output):

``` $ poetry run python anonymize/anonymizinginstagramuu.py [OPTIONS]

Options: -i path to folder containing zipfiles (i.e., -i data/raw) -o path to folder where files will be unpacked and pseudonimized (i.e., -o data/processed) -l path to log file -p path to participants list to use corresponding participant IDs (e.g., -p src/participants.csv) -c replace capitalized names only (when not entering this option, the default = False; not case sensitive) (e.g., -c)

```

An overview of the program's workflow is shown below:

The output of the program will be a copy of the zipped data download package with all names, usernames, email addresses, and phone numbers pseudonimized, and all pictures and videos blurred. This pseudonimized data download package is saved in the output folder.

Validation

The validation procedure determines the performance of anonymization code concerning deidentification of text. It compares results of the automated anonymization with the ideal expected result, i.e., a manually created ground-truth.

For this validation an example data set is used which includes: * A set of 11 DDPs with nonsense content * A groundtruth file with results of manually labeling the PII in these DDPs

The example data set is available at .

Prerequisites

scikit-learn

Preparatory steps

Before running the software, the following steps need to be taken:

Collect data: both Instagram DDPs and corresponding ground truth data
Anonymize Instagram DDPs

Anonymize DDPs

Run the automated anonymization as described in the main Readme After the anonymization, make sure you have separate folders with the following data: * original DDPs * anonymized DDPs * key files

Perform evaluation

When all preceding steps are taken, the evaluation can be performed.

``` $ cd anonymize/validation $ poetry run python validation_script.py [OPTIONS]

Options: -r path to file with results of manual labeling -p path to folder with anonymized datapackages; output of anonymization -k path to folder with key files; output of anonymization

```

Output

Evaluation metrics: * table with recall, precision en f1 * four folders with specific occurences of FP, FN, TP and special hashes

Testing

Run tests with available test data to check the consistency of the evaluation procedure From the root folder:

```

Go to the main folder of the poetry project

$ cd anonymize-ddp/anonymize-ddp

Run the test

$ poetry run pytest

```

Owner

Name: Utrecht University
Login: UtrechtUniversity
Kind: organization
Email: info.rdm@uu.nl
Location: Utrecht, The Netherlands

Website: https://www.uu.nl
Repositories: 85
Profile: https://github.com/UtrechtUniversity

The central place for managing code and software for Utrecht University researchers and employees

Citation (CITATION.cff)

# YAML 1.2
---
authors: 
  -
    affiliation: "Utrecht University"
    family-names: Voorvaart
    given-names: Roos
    orcid: "https://orcid.org/0000-0002-4411-8495"
  -
    affiliation: "Utrecht University"
    family-names: "de Vos"
    given-names: "Martine G"
    orcid: "https://orcid.org/0000-0001-5301-1713"
  -
    affiliation: "Utrecht University"
    family-names: Boeschoten
    given-names: Laura
    orcid: "https://orcid.org/0000-0002-3536-0474"
  -
    affiliation: "Utrecht University"
    family-names: "van den Goorbergh"
    given-names: Ruben
    orcid: "https://orcid.org/0000-0003-3229-3015"
  -
    affiliation: "Utrech University"
    family-names: Kaandorp
    given-names: Casper
    orcid: "https://orcid.org/0000-0001-6326-6680"
cff-version: "1.0.3"
doi: "10.5281/zenodo.5211335"
keywords: 
  - "pseudonymization"
  - "python"
license: "MIT"
message: "If you use this software, please cite it using these metadata."
repository-code: "https://github.com/UtrechtUniversity/anonymize-ddp/releases/tag/v1.0.1"
title: "anonymize-ddp: pseudonimizing software for data download packages "
version: "1.0.1"
date-released: "2021-08-17"

GitHub Events

Total

Last Year

Committers

Last synced: about 1 year ago

All Time

Total Commits: 105
Total Committers: 2
Avg Commits per committer: 52.5
Development Distribution Score (DDS): 0.381

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Martine de Vos	m**s@g**m	65
Voorvaart	r**t@u**l	40

Committer Domains (Top 20 + Academic)

uu.nl: 1

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 2
Total pull requests: 10
Average time to close issues: N/A
Average time to close pull requests: 5 months
Total issue authors: 1
Total pull request authors: 2
Average comments per issue: 1.0
Average comments per pull request: 0.5
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 8

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

MartineDeVos (2)

Pull Request Authors

dependabot[bot] (8)
MartineDeVos (2)

Top Labels

Issue Labels

Pull Request Labels

dependencies (8)

Dependencies

anonymize-ddp/poetry.lock pypi

atomicwrites 1.4.0 develop
attrs 21.2.0 develop
colorama 0.4.4 develop
iniconfig 1.1.1 develop
packaging 21.2 develop
pluggy 1.0.0 develop
py 1.10.0 develop
pyparsing 2.4.7 develop
pytest 6.2.5 develop
toml 0.10.2 develop
anonymouus 0.0.8
certifi 2021.10.8
charset-normalizer 2.0.7
facenet-pytorch 2.5.2
idna 3.3
imutils 0.5.4
joblib 1.1.0
numpy 1.21.1
opencv-python 4.5.3.56
pandas 1.3.4
pathlib 1.0.1
pillow 8.4.0
progressbar2 3.55.0
python-dateutil 2.8.2
python-utils 2.5.6
pytz 2021.3
requests 2.26.0
scikit-learn 0.24.2
scipy 1.6.1
six 1.16.0
threadpoolctl 3.0.0
torch 1.9.1
torchvision 0.10.1
typing-extensions 3.10.0.2
urllib3 1.26.7
zipp 3.6.0

anonymize-ddp/pyproject.toml pypi

pytest ^6.0.0 develop
anonymoUUs 0.0.8
facenet-pytorch ^2.5.2
imutils ^0.5.4
opencv-python ^4.5.3
pandas ^1.3.2
pathlib ^1.0.1
progressbar2 ^3.53.1
python ^3.8
python-dateutil ^2.8.2
scikit-learn ^0.24.2
torchvision ^0.10.0
zipp ^3.5.0

requirements.txt pypi

anonymoUUs ==0.0.8
facenet-pytorch >=2.2.9
imutils >=0.5.3
numpy >=1.17.2
opencv-python >=4.2.0.34
pandas >=0.25.1
pathlib >=1.0.1
progressbar2 >=3.51.4
python-dateutil >=2.8.1
scikit-learn >=0.24.1
torchvision >=0.6.1
zipp >=0.6.0

anonymize-ddp

Science Score: 67.0%

Repository

Basic Info

Statistics

Metadata Files

ReadMe.md

Anonymize-DDP

Table of Contents

About Anonymize-DDP

Built With

License

Attribution and academic use

Getting Started

Prerequisites

Preparatory steps

Clone repository

Clone this repository

Go into the repository

Install dependencies

Download DDP

Create additional files

Run software

Validation

Prerequisites

Preparatory steps

Anonymize DDPs

Perform evaluation

Output

Testing

Go to the main folder of the poetry project

Run the test

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies