Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 8 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.7%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: UtrechtUniversity
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 85.2 MB
Statistics
  • Stars: 0
  • Watchers: 10
  • Forks: 0
  • Open Issues: 7
  • Releases: 2
Created over 5 years ago · Last pushed over 2 years ago
Metadata Files
Readme License Citation

ReadMe.md

Anonymize-DDP

Pseudonimizing software for data download packages (DDP), specifically focussed on Instagram.

Table of Contents

About Anonymize-DDP

Date: December 2020

Researchers: * Laura Boeschoten (l.boeschoten@uu.nl)

Research Software Engineers: * Martine de Vos (m.g.devos@uu.nl) * Roos Voorvaart (r.voorvaart@uu.nl)

Built With

The blurring of text in images and videos is based on a pre-trained version of the EAST model. Replacing the extracted sensitive info with the pseudonimized substitutes in the data download package is done using the AnonymoUUs package.

License

The code in this project is licensed with MIT.

Attribution and academic use

The scientific paper detailing the first release of anonymize-ddp is available here.

A data set consisting of 11 personal Instagram archives, or Data-Download Packages, was created to validate the anonymization procedure. This data set is publicly available at DOI

Getting Started

Prerequisites

This project makes use of Python 3.8 and Poetry for managing dependencies. To clone this repository, you'll need Git installed on your computer.

Preparatory steps

Before running the software, the following steps need to be taken:

  1. Clone repository
  2. Download DDP
  3. Create additional files

Clone repository

Run the following code in the command line:

```

Clone this repository

$ git clone https://github.com/UtrechtUniversity/anonymize-ddp

Go into the repository

$ cd anonymize-ddp/anonymize-ddp

Install dependencies

poetry install

```

Download DDP

To download your Instagram data package:

  1. Go to www.instagram.com and log in
  2. Click on your profile picture, go to Settings and Privacy and Security
  3. Scroll to Data download and click Request download
  4. Enter your email adress and click Next
  5. Enter your password and click Request download

Instagram will deliver your data in a compressed zip folder with format username_YYYYMMDD.zip (i.e., Instagram handle and date of download). For Mac users this might be different, so make sure to check that all provided files are zipped into one folder with the name username_YYYYMMDD.zip. Save the DDP(s) in the data folder.

Create additional files

Before you can run the software, you need to make sure that the src folder contains the following items: * Facial blurring software: The frozeneasttext_detection.pb software, necessary for the facial blurring of images and videos, can be downloaded from GitHub * Participant file*: An overview of all participants' usernames and participant IDs (e.g., participants.csv). We recommend placing this file in the anonymize folder. However, you can save this file anywhere you like, as long as you refer to the path correctly while running the software.

* N.B. Only relevant for participant based studies with predefined participant IDs. This file can have whatever name you prefer, as long as it is saved as .csv and contains 2 columns; the first being the original instagram handles (e.g., janjansen) and the second the participant IDs (e.g., PP001).

Run software

When all preceding steps are taken, the data download packages can be pseudonimized. Note that the poetry run command executes the given command inside the project’s virtual environment. Run the program with (atleast) the arguments -i for data input folder (i.e., data) and -o data output folder (i.e., results/output):

``` $ poetry run python anonymize/anonymizinginstagramuu.py [OPTIONS]

Options: -i path to folder containing zipfiles (i.e., -i data/raw) -o path to folder where files will be unpacked and pseudonimized (i.e., -o data/processed) -l path to log file -p path to participants list to use corresponding participant IDs (e.g., -p src/participants.csv) -c replace capitalized names only (when not entering this option, the default = False; not case sensitive) (e.g., -c)

```

An overview of the program's workflow is shown below: flowanonymize.png

The output of the program will be a copy of the zipped data download package with all names, usernames, email addresses, and phone numbers pseudonimized, and all pictures and videos blurred. This pseudonimized data download package is saved in the output folder.

Validation

The validation procedure determines the performance of anonymization code concerning deidentification of text. It compares results of the automated anonymization with the ideal expected result, i.e., a manually created ground-truth.

For this validation an example data set is used which includes: * A set of 11 DDPs with nonsense content * A groundtruth file with results of manually labeling the PII in these DDPs

The example data set is available at DOI.

Prerequisites

scikit-learn

Preparatory steps

Before running the software, the following steps need to be taken:

  1. Collect data: both Instagram DDPs and corresponding ground truth data
  2. Anonymize Instagram DDPs

Anonymize DDPs

Run the automated anonymization as described in the main Readme After the anonymization, make sure you have separate folders with the following data: * original DDPs * anonymized DDPs * key files

Perform evaluation

When all preceding steps are taken, the evaluation can be performed.

``` $ cd anonymize/validation $ poetry run python validation_script.py [OPTIONS]

Options: -r path to file with results of manual labeling -p path to folder with anonymized datapackages; output of anonymization -k path to folder with key files; output of anonymization

```

Output

Evaluation metrics: * table with recall, precision en f1 * four folders with specific occurences of FP, FN, TP and special hashes

Testing

Run tests with available test data to check the consistency of the evaluation procedure From the root folder:

```

Go to the main folder of the poetry project

$ cd anonymize-ddp/anonymize-ddp

Run the test

$ poetry run pytest

```

Owner

  • Name: Utrecht University
  • Login: UtrechtUniversity
  • Kind: organization
  • Email: info.rdm@uu.nl
  • Location: Utrecht, The Netherlands

The central place for managing code and software for Utrecht University researchers and employees

Citation (CITATION.cff)

# YAML 1.2
---
authors: 
  -
    affiliation: "Utrecht University"
    family-names: Voorvaart
    given-names: Roos
    orcid: "https://orcid.org/0000-0002-4411-8495"
  -
    affiliation: "Utrecht University"
    family-names: "de Vos"
    given-names: "Martine G"
    orcid: "https://orcid.org/0000-0001-5301-1713"
  -
    affiliation: "Utrecht University"
    family-names: Boeschoten
    given-names: Laura
    orcid: "https://orcid.org/0000-0002-3536-0474"
  -
    affiliation: "Utrecht University"
    family-names: "van den Goorbergh"
    given-names: Ruben
    orcid: "https://orcid.org/0000-0003-3229-3015"
  -
    affiliation: "Utrech University"
    family-names: Kaandorp
    given-names: Casper
    orcid: "https://orcid.org/0000-0001-6326-6680"
cff-version: "1.0.3"
doi: "10.5281/zenodo.5211335"
keywords: 
  - "pseudonymization"
  - "python"
license: "MIT"
message: "If you use this software, please cite it using these metadata."
repository-code: "https://github.com/UtrechtUniversity/anonymize-ddp/releases/tag/v1.0.1"
title: "anonymize-ddp: pseudonimizing software for data download packages "
version: "1.0.1"
date-released: "2021-08-17"

GitHub Events

Total
Last Year

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 105
  • Total Committers: 2
  • Avg Commits per committer: 52.5
  • Development Distribution Score (DDS): 0.381
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Martine de Vos m****s@g****m 65
Voorvaart r****t@u****l 40
Committer Domains (Top 20 + Academic)
uu.nl: 1

Issues and Pull Requests

Last synced: 9 months ago

All Time
  • Total issues: 2
  • Total pull requests: 10
  • Average time to close issues: N/A
  • Average time to close pull requests: 5 months
  • Total issue authors: 1
  • Total pull request authors: 2
  • Average comments per issue: 1.0
  • Average comments per pull request: 0.5
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 8
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • MartineDeVos (2)
Pull Request Authors
  • dependabot[bot] (8)
  • MartineDeVos (2)
Top Labels
Issue Labels
Pull Request Labels
dependencies (8)

Dependencies

anonymize-ddp/poetry.lock pypi
  • atomicwrites 1.4.0 develop
  • attrs 21.2.0 develop
  • colorama 0.4.4 develop
  • iniconfig 1.1.1 develop
  • packaging 21.2 develop
  • pluggy 1.0.0 develop
  • py 1.10.0 develop
  • pyparsing 2.4.7 develop
  • pytest 6.2.5 develop
  • toml 0.10.2 develop
  • anonymouus 0.0.8
  • certifi 2021.10.8
  • charset-normalizer 2.0.7
  • facenet-pytorch 2.5.2
  • idna 3.3
  • imutils 0.5.4
  • joblib 1.1.0
  • numpy 1.21.1
  • opencv-python 4.5.3.56
  • pandas 1.3.4
  • pathlib 1.0.1
  • pillow 8.4.0
  • progressbar2 3.55.0
  • python-dateutil 2.8.2
  • python-utils 2.5.6
  • pytz 2021.3
  • requests 2.26.0
  • scikit-learn 0.24.2
  • scipy 1.6.1
  • six 1.16.0
  • threadpoolctl 3.0.0
  • torch 1.9.1
  • torchvision 0.10.1
  • typing-extensions 3.10.0.2
  • urllib3 1.26.7
  • zipp 3.6.0
anonymize-ddp/pyproject.toml pypi
  • pytest ^6.0.0 develop
  • anonymoUUs 0.0.8
  • facenet-pytorch ^2.5.2
  • imutils ^0.5.4
  • opencv-python ^4.5.3
  • pandas ^1.3.2
  • pathlib ^1.0.1
  • progressbar2 ^3.53.1
  • python ^3.8
  • python-dateutil ^2.8.2
  • scikit-learn ^0.24.2
  • torchvision ^0.10.0
  • zipp ^3.5.0
requirements.txt pypi
  • anonymoUUs ==0.0.8
  • facenet-pytorch >=2.2.9
  • imutils >=0.5.3
  • numpy >=1.17.2
  • opencv-python >=4.2.0.34
  • pandas >=0.25.1
  • pathlib >=1.0.1
  • progressbar2 >=3.51.4
  • python-dateutil >=2.8.1
  • scikit-learn >=0.24.1
  • torchvision >=0.6.1
  • zipp >=0.6.0