anonymouus

Python package to replace identifiable strings in multiple files and folders at once.

https://github.com/utrechtuniversity/anonymouus

Keywords

de-identification python text

Keywords from Contributors

internet-archive web-scraping

Last synced: 6 months ago · JSON representation ·

Repository

Python package to replace identifiable strings in multiple files and folders at once.

Basic Info

Host: GitHub
Owner: UtrechtUniversity
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 145 KB

Statistics

Stars: 5
Watchers: 7
Forks: 0
Open Issues: 4
Releases: 1

Topics

de-identification python text

Created over 5 years ago · Last pushed over 2 years ago

Metadata Files

Readme License Citation

anonymoUUs

anonymoUUs is a Python package for replacing identifiable strings in multiple files and folders at once. It can be used to pseudonymise data files and therefore contributes to protecting personal data.

The goal of anonymoUUs is to substitute multiple identifying strings with pseudo-IDs to avoid tracable relationships between data batches. A single data batch typically consists of multiple nested folders that contain multiple files in multiple formats. AnonymoUUs runs through the entire file tree, looking for keywords to replace them with the provided substitute, including in:
- the file contents - file names - folder names - zipped folders

Note: Whereas replacing personal details with non-personal details can make data less identifiable, it does not guarantee anonymous data!

Supported file formats

AnonymoUUs can work with multiple text-based file types, like .txt, .html, .json and .csv. UTF-8 encoding is assumed. Users have several options to provide keyword-replacement mappings and to customize the behaviour of the software, visit the usage section for more information.

Getting Started

Prerequisites

To install and run anonymoUUs, you need: - an active Python installation; - a folder containing the data to be pseudonymised; - a keyword-replacement mapping file.

Installation

To install anonymoUUs, in your terminal run:

sh $ pip install anonymoUUs

Example workflow

To get started with a simple example, you can go through this Jupyter notebook, which runs through a minimal example of anonymoUUs.

Prerequisites: - download the testdata from the test_data folder - make sure you have jupyter notebook installed

Usage

To run the software, you need to take the following steps: 1. Provide the path to the data to be substituted 2. Provide the keyword-replacement mapping 3. Create and customize the Anonymize object 4. Perform the substitutions

1. Input Data

Provide the path to the folder where the data resides, for example: python from pathlib import Path test_data = Path('../test_data/')

Details: - Files are opened depending on their extension. Extensions that are not recognised will be skipped. - Errors will be ignored. - The standard version of this package assumes 'UTF-8' encoding. Since reading file-contents is done with a single function, it will be easy to adjust it, for example to also read other encodings. You can do so by overloading it in an extension:

```python

standard reading function

def readfile(self, source: Path): f = open(source, 'r', encoding='utf-8', errors='ignore') contents = list(f) f.close() return contents ```

2. Mapping

In order to replace words or patterns, you need a replacement-mapping. AnonymoUUs allows mappings in the form of a dictionary, a csv file or a function. - In all cases, the keys will be replaced by the provided values. - It is also possible to provide string patterns to replace, using regular expressions (regex) in the keys. AnonymoUUs will replace every matching pattern with the provided replacement string.

Dictionary mapping

To use a dictionary-type mapping, simply provide the (path to the) dictionary (file) and apply the Anonymize function. Note that you can provide a regular expression using re.compile('regex') to look for string patterns.

```python from anonymoUUs import Anonymize

Using a dictionary and regular expression for subject 02:

my_dict = { 'Bob': 'subject-01', re.compile('ca.*?er'): 'subject-02', }

anonymizedict = Anonymize(mydict) ```

CSV file mapping

To use a CSV for mapping, simply provide the path to the file. AnonymoUUs converts the provided csv file into a dictionary.

Requirements: - The csv file needs to contain column headers (any format) - The csv file needs to have the keys (which need to be replaced, e.g., names) in the first column and the values (the replacements, e.g., numbers) in the second column. - The path can be a String, Path or PosixPath.

It is possible to add a regular expression as keyword in the csv-file. Make sure they start with the prefix r#:

| key | value | | ---| --- | | r#ca.*?er | replacement string |

```python

Using a csv file

keycsv = testdata/'keys.csv'

anonymizecsv = Anonymize(keycsv) ```

Function mapping

If you are replacing strings with a pattern, you can also use a function to 'calculate' the replacement string. The function takes a found match and should return its replacement. The function must have at least one input argument.

```python

Define function

def replace(match, **kwargs): result = 'default-replacement' match = int(match) threshold = kwargs.get("threshold", 4000) if match < threshold: result = 'special-replacement' return result

Subsitute using the defined replace function

anon = Anonymize(replace, pattern=r'\d{4}', threshold=1000) anon.substitute( '/Users/casperkaandorp/Desktop/test.json', '/Users/casperkaandorp/Desktop/result-folder' ) ``Note the possibility to provide additional arguments when you initialize an Anonymize object that will be passed to the replacement function (in this example, thethresholdis passed to thereplace` function).

3. Create an Anonymize object

By default, the Anonymize function is case sensitive. Basic use: ```python from anonymoUUs import Anonymize

anonymize_object = Anonymize(keys) ```

Performance is probably best when your keywords can be generalized into a single regular expression. AnonymoUUs will search these patterns and replace them instead of matching the entire dictionary/csv-file against file contents or file/folder-paths. Example:

python anonymize_regex = Anonymize(my_dict, pattern=r'[A-B]\d{4}')

Arguments

The regular expressions that take care of the replacements can be modified by using the flag parameter. It takes one or more variables which can be found here. Multiple variables are combined by a bitwise OR (the | operator). Example for a case-insensitive substitution:

anonymize_regex = Anonymize(my_dict, flags=re.IGNORECASE)

By using the use_word_boundaries argument (defaults to False), the algorithm ignores substring matches. If 'ted' is a key in your dictionary, without use_word_boundaries the algorithm will replace the 'ted' part in f.i. 'createdat'. You can overcome this problem by setting `useword_boundariesto True. It will put the\b`-anchor around your regex pattern or dictionary keys. The beauty of the boundary anchors is that '@' is considered a boundary as well, and thus names in email addresses can be replaced. Example:

anonymize_regex = Anonymize(my_dict, use_word_boundaries=True)

It is also possible to specify how to re-zip unzipped folders:

```python

specifying a zip-format to zip unpacked archives after processing (.zip is default)

anonymizezip = Anonymize('/Users/casper/Desktop/keys.csv', zipformat='gztar') ```

4. Substitute data

The substitute method is the step where the specified keys will be replaced by the replacements. It will replace all occurrences of the specified words with the substutions, in all files in the provided source folder.

Basic use: python anonymize_object.substitute(source_path, target_path)

Arguments: - source_path (required) path to the original file, folder or zip-archive to perform the substitutions on, either a string or a Path object - target_path (optional): a string or Path object indicating whre the results need to be written. The path will be created if it does not yet exist.

If target_path is provided, anonymoUUs will create a processed copy of the source into the target folder. If the source is a single file, and the file path does not contain elements that will be replaced, and the target folder is identical to the source folder, then the processed result will get a 'copy' extension to prevent overwriting.

When target_path is omitted, the source will be overwritten by a processed version of it.

```python

process the datadownload.zip file, replace all patterns and write a copy to the 'bucket' folder.

anonymize_regex.substitute( '/Users/casper/Desktop/datadownload.zip', '/Users/casper/Desktop/bucket' )

process the 'download' folder and replace the original by its processed version

anonymize_regex.substitute('/Users/casper/Desktop/download')

process a single file, and replace it

anonymizeregex.substitute('/Users/casper/Desktop/myfile.json') ```

Validation

The validation procedure determines the performance of anonymization software. It compares results of the automated anonymization with a manually labeled ground-truth. The validation procedure checks whether all occurrences of personal identifiable information, as detected in the manually labeled ground-truth, are correctly substituted.

Prepare

Clone this repository to run the validation.

Make sure you have these data present: * anonymized files * key file * manually labeled ground truth Example data can be found in the test_data in this folder

Create your manually labeled file with dedicated software like Label Studio. In the graphical user interface you can easily add custom labels to the sensitive information in your text files

Validate

Run from the commandline ``` $ cd tests $ python validation.py [OPTIONS]

Options: --anymdir path to folder with anonymized data --gtfile path to labeled groundtruth file (json) --keyfile path to key file (csv)

```

Attribution and academic use

The code in this project is licensed with MIT. This software is archived at Zenodo Please cite this software using the metadata in the citation file

Contributing

Contributions are what make the open source community an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

You can contribute by: 1. Opening an Issue 2. Suggesting edits to the code 3. Suggesting edits to the documentation 4. If you are unfamiliar with GitHub, feel free to contact us.

To contribute to content directly:

Fork the project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

Contact

You can contact the Utrecht University Research Engineering team by email.

Project Link: https://github.com/UtrechtUniversity/anonymouus.

Owner

Name: Utrecht University
Login: UtrechtUniversity
Kind: organization
Email: info.rdm@uu.nl
Location: Utrecht, The Netherlands

Website: https://www.uu.nl
Repositories: 85
Profile: https://github.com/UtrechtUniversity

The central place for managing code and software for Utrecht University researchers and employees

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: anonymoUUs
message: Please cite this software as follows
type: software
authors:
  - given-names: Casper
    family-names: Kaandorp
    email: c.s.kaandorp@uu.nl
    affiliation: Utrecht University
    orcid: 'https://orcid.org/0000-0001-6326-6680'
version: 0.0.8
doi: 10.5281/zenodo.5751861 
date-released: 2021-12-03

GitHub Events

Total

Last Year

Committers

Last synced: almost 3 years ago

All Time

Total Commits: 65
Total Committers: 4
Avg Commits per committer: 16.25
Development Distribution Score (DDS): 0.554

Top Committers

Name	Email	Commits
Martine de Vos	m**s@g**m	29
Casper Kaandorp	c**p@u**l	19
maartenschermer	m**r@u**l	15
DorienHuijser	d**r@o**m	2

Committer Domains (Top 20 + Academic)

uu.nl: 2

Issues and Pull Requests

Last synced: 8 months ago

All Time

Total issues: 6
Total pull requests: 6
Average time to close issues: about 1 month
Average time to close pull requests: about 5 hours
Total issue authors: 2
Total pull request authors: 2
Average comments per issue: 0.83
Average comments per pull request: 0.0
Merged pull requests: 6
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

MartineDeVos (5)
nehamoopen (1)

Pull Request Authors

MartineDeVos (4)
maartenschermer (2)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 45 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 8
Total maintainers: 1

pypi.org: anonymouus

A tool to substitue patterns/names in a file tree

Homepage: https://github.com/UtrechtUniversity/anonymouus
Documentation: https://anonymouus.readthedocs.io/
License: MIT
Latest release: 0.0.8
published almost 5 years ago

Versions: 8
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 45 Last month

Rankings

Dependent packages count: 10.0%

Average: 21.6%

Dependent repos count: 21.7%

Stargazers count: 23.0%

Downloads: 23.3%

Forks count: 29.8%

Maintainers (1)

cskaandorp

Last synced: 6 months ago

anonymouus

Science Score: 54.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

anonymoUUs

Supported file formats

Table of Contents

Getting Started

Prerequisites

Installation

Example workflow

Usage

1. Input Data

standard reading function

2. Mapping

Dictionary mapping

Using a dictionary and regular expression for subject 02:

CSV file mapping

Using a csv file

Function mapping

Define function

Subsitute using the defined replace function

3. Create an Anonymize object

Arguments

specifying a zip-format to zip unpacked archives after processing (.zip is default)

4. Substitute data

process the datadownload.zip file, replace all patterns and write a copy to the 'bucket' folder.

process the 'download' folder and replace the original by its processed version

process a single file, and replace it

Validation

Prepare

Validate

Attribution and academic use

Contributing

Contact

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: anonymouus

Rankings

Maintainers (1)

Dependencies