anonymouus

Python package to replace identifiable strings in multiple files and folders at once.

https://github.com/utrechtuniversity/anonymouus

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.6%) to scientific vocabulary

Keywords

de-identification python text

Keywords from Contributors

internet-archive web-scraping
Last synced: 6 months ago · JSON representation ·

Repository

Python package to replace identifiable strings in multiple files and folders at once.

Basic Info
  • Host: GitHub
  • Owner: UtrechtUniversity
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 145 KB
Statistics
  • Stars: 5
  • Watchers: 7
  • Forks: 0
  • Open Issues: 4
  • Releases: 1
Topics
de-identification python text
Created over 5 years ago · Last pushed over 2 years ago
Metadata Files
Readme License Citation

README.md

anonymoUUs

anonymoUUs is a Python package for replacing identifiable strings in multiple files and folders at once. It can be used to pseudonymise data files and therefore contributes to protecting personal data.

The goal of anonymoUUs is to substitute multiple identifying strings with pseudo-IDs to avoid tracable relationships between data batches. A single data batch typically consists of multiple nested folders that contain multiple files in multiple formats. AnonymoUUs runs through the entire file tree, looking for keywords to replace them with the provided substitute, including in:
- the file contents - file names - folder names - zipped folders

Note: Whereas replacing personal details with non-personal details can make data less identifiable, it does not guarantee anonymous data!

Supported file formats

AnonymoUUs can work with multiple text-based file types, like .txt, .html, .json and .csv. UTF-8 encoding is assumed. Users have several options to provide keyword-replacement mappings and to customize the behaviour of the software, visit the usage section for more information.

Table of Contents

Getting Started

Prerequisites

To install and run anonymoUUs, you need: - an active Python installation; - a folder containing the data to be pseudonymised; - a keyword-replacement mapping file.

Installation

To install anonymoUUs, in your terminal run:

sh $ pip install anonymoUUs

Example workflow

To get started with a simple example, you can go through this Jupyter notebook, which runs through a minimal example of anonymoUUs.

Prerequisites: - download the testdata from the test_data folder - make sure you have jupyter notebook installed

Usage

To run the software, you need to take the following steps: 1. Provide the path to the data to be substituted 2. Provide the keyword-replacement mapping 3. Create and customize the Anonymize object 4. Perform the substitutions

1. Input Data

Provide the path to the folder where the data resides, for example: python from pathlib import Path test_data = Path('../test_data/')

Details: - Files are opened depending on their extension. Extensions that are not recognised will be skipped. - Errors will be ignored. - The standard version of this package assumes 'UTF-8' encoding. Since reading file-contents is done with a single function, it will be easy to adjust it, for example to also read other encodings. You can do so by overloading it in an extension:

```python

standard reading function

def readfile(self, source: Path): f = open(source, 'r', encoding='utf-8', errors='ignore') contents = list(f) f.close() return contents ```

2. Mapping

In order to replace words or patterns, you need a replacement-mapping. AnonymoUUs allows mappings in the form of a dictionary, a csv file or a function. - In all cases, the keys will be replaced by the provided values. - It is also possible to provide string patterns to replace, using regular expressions (regex) in the keys. AnonymoUUs will replace every matching pattern with the provided replacement string.

Dictionary mapping

To use a dictionary-type mapping, simply provide the (path to the) dictionary (file) and apply the Anonymize function. Note that you can provide a regular expression using re.compile('regex') to look for string patterns.

```python from anonymoUUs import Anonymize

Using a dictionary and regular expression for subject 02:

my_dict = { 'Bob': 'subject-01', re.compile('ca.*?er'): 'subject-02', }

anonymizedict = Anonymize(mydict) ```

CSV file mapping

To use a CSV for mapping, simply provide the path to the file. AnonymoUUs converts the provided csv file into a dictionary.

Requirements: - The csv file needs to contain column headers (any format) - The csv file needs to have the keys (which need to be replaced, e.g., names) in the first column and the values (the replacements, e.g., numbers) in the second column. - The path can be a String, Path or PosixPath.

It is possible to add a regular expression as keyword in the csv-file. Make sure they start with the prefix r#:

| key | value | | ---| --- | | r#ca.*?er | replacement string |

```python

Using a csv file

keycsv = testdata/'keys.csv'

anonymizecsv = Anonymize(keycsv) ```

Function mapping

If you are replacing strings with a pattern, you can also use a function to 'calculate' the replacement string. The function takes a found match and should return its replacement. The function must have at least one input argument.

```python

Define function

def replace(match, **kwargs): result = 'default-replacement' match = int(match) threshold = kwargs.get("threshold", 4000) if match < threshold: result = 'special-replacement' return result

Subsitute using the defined replace function

anon = Anonymize(replace, pattern=r'\d{4}', threshold=1000) anon.substitute( '/Users/casperkaandorp/Desktop/test.json', '/Users/casperkaandorp/Desktop/result-folder' ) `` Note the possibility to provide additional arguments when you initialize an Anonymize object that will be passed to the replacement function (in this example, thethresholdis passed to thereplace` function).

3. Create an Anonymize object

By default, the Anonymize function is case sensitive. Basic use: ```python from anonymoUUs import Anonymize

anonymize_object = Anonymize(keys) ```

Performance is probably best when your keywords can be generalized into a single regular expression. AnonymoUUs will search these patterns and replace them instead of matching the entire dictionary/csv-file against file contents or file/folder-paths. Example:

python anonymize_regex = Anonymize(my_dict, pattern=r'[A-B]\d{4}')

Arguments

The regular expressions that take care of the replacements can be modified by using the flag parameter. It takes one or more variables which can be found here. Multiple variables are combined by a bitwise OR (the | operator). Example for a case-insensitive substitution:

anonymize_regex = Anonymize(my_dict, flags=re.IGNORECASE)

By using the use_word_boundaries argument (defaults to False), the algorithm ignores substring matches. If 'ted' is a key in your dictionary, without use_word_boundaries the algorithm will replace the 'ted' part in f.i. 'createdat'. You can overcome this problem by setting `useword_boundariesto True. It will put the\b`-anchor around your regex pattern or dictionary keys. The beauty of the boundary anchors is that '@' is considered a boundary as well, and thus names in email addresses can be replaced. Example:

anonymize_regex = Anonymize(my_dict, use_word_boundaries=True)

It is also possible to specify how to re-zip unzipped folders:

```python

specifying a zip-format to zip unpacked archives after processing (.zip is default)

anonymizezip = Anonymize('/Users/casper/Desktop/keys.csv', zipformat='gztar') ```

4. Substitute data

The substitute method is the step where the specified keys will be replaced by the replacements. It will replace all occurrences of the specified words with the substutions, in all files in the provided source folder.

Basic use: python anonymize_object.substitute(source_path, target_path)

Arguments: - source_path (required) path to the original file, folder or zip-archive to perform the substitutions on, either a string or a Path object - target_path (optional): a string or Path object indicating whre the results need to be written. The path will be created if it does not yet exist.

If target_path is provided, anonymoUUs will create a processed copy of the source into the target folder. If the source is a single file, and the file path does not contain elements that will be replaced, and the target folder is identical to the source folder, then the processed result will get a 'copy' extension to prevent overwriting.

When target_path is omitted, the source will be overwritten by a processed version of it.

```python

process the datadownload.zip file, replace all patterns and write a copy to the 'bucket' folder.

anonymize_regex.substitute( '/Users/casper/Desktop/datadownload.zip', '/Users/casper/Desktop/bucket' )

process the 'download' folder and replace the original by its processed version

anonymize_regex.substitute('/Users/casper/Desktop/download')

process a single file, and replace it

anonymizeregex.substitute('/Users/casper/Desktop/myfile.json') ```

Validation

The validation procedure determines the performance of anonymization software. It compares results of the automated anonymization with a manually labeled ground-truth. The validation procedure checks whether all occurrences of personal identifiable information, as detected in the manually labeled ground-truth, are correctly substituted.

Prepare

Clone this repository to run the validation.

Make sure you have these data present: * anonymized files * key file * manually labeled ground truth Example data can be found in the test_data in this folder

Create your manually labeled file with dedicated software like Label Studio. In the graphical user interface you can easily add custom labels to the sensitive information in your text files

Validate

Run from the commandline ``` $ cd tests $ python validation.py [OPTIONS]

Options: --anymdir path to folder with anonymized data --gtfile path to labeled groundtruth file (json) --keyfile path to key file (csv)

```

Attribution and academic use

The code in this project is licensed with MIT. This software is archived at Zenodo DOI Please cite this software using the metadata in the citation file

Contributing

Contributions are what make the open source community an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

You can contribute by: 1. Opening an Issue 2. Suggesting edits to the code 3. Suggesting edits to the documentation 4. If you are unfamiliar with GitHub, feel free to contact us.

To contribute to content directly:

  1. Fork the project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Contact

You can contact the Utrecht University Research Engineering team by email.

Project Link: https://github.com/UtrechtUniversity/anonymouus.

Owner

  • Name: Utrecht University
  • Login: UtrechtUniversity
  • Kind: organization
  • Email: info.rdm@uu.nl
  • Location: Utrecht, The Netherlands

The central place for managing code and software for Utrecht University researchers and employees

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: anonymoUUs
message: Please cite this software as follows
type: software
authors:
  - given-names: Casper
    family-names: Kaandorp
    email: c.s.kaandorp@uu.nl
    affiliation: Utrecht University
    orcid: 'https://orcid.org/0000-0001-6326-6680'
version: 0.0.8
doi: 10.5281/zenodo.5751861 
date-released: 2021-12-03

GitHub Events

Total
Last Year

Committers

Last synced: almost 3 years ago

All Time
  • Total Commits: 65
  • Total Committers: 4
  • Avg Commits per committer: 16.25
  • Development Distribution Score (DDS): 0.554
Top Committers
Name Email Commits
Martine de Vos m****s@g****m 29
Casper Kaandorp c****p@u****l 19
maartenschermer m****r@u****l 15
DorienHuijser d****r@o****m 2
Committer Domains (Top 20 + Academic)
uu.nl: 2

Issues and Pull Requests

Last synced: 8 months ago

All Time
  • Total issues: 6
  • Total pull requests: 6
  • Average time to close issues: about 1 month
  • Average time to close pull requests: about 5 hours
  • Total issue authors: 2
  • Total pull request authors: 2
  • Average comments per issue: 0.83
  • Average comments per pull request: 0.0
  • Merged pull requests: 6
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • MartineDeVos (5)
  • nehamoopen (1)
Pull Request Authors
  • MartineDeVos (4)
  • maartenschermer (2)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 45 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 8
  • Total maintainers: 1
pypi.org: anonymouus

A tool to substitue patterns/names in a file tree

  • Versions: 8
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 45 Last month
Rankings
Dependent packages count: 10.0%
Average: 21.6%
Dependent repos count: 21.7%
Stargazers count: 23.0%
Downloads: 23.3%
Forks count: 29.8%
Maintainers (1)
Last synced: 6 months ago

Dependencies

setup.py pypi
  • charset_normalizer *
  • odfpy *
  • openpyxl *
  • pandas *
  • xlrd *
  • xlsxwriter *