https://github.com/ben-aaron188/textwash
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.0%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: ben-aaron188
- License: gpl-3.0
- Language: Python
- Default Branch: main
- Size: 3.12 MB
Statistics
- Stars: 25
- Watchers: 2
- Forks: 5
- Open Issues: 7
- Releases: 0
Metadata Files
README.md
Textwash
UPDATE: Textwash is now available for Dutch! See below for details of how you can run the Dutch anonymization model.
Textwash is an automated text anonymisation tool written in Python. The tool can be used to anonymise unstructured text data. To achieve this, Textwash identifies and extracts personally-identifiable information (e.g., names, dates) from text and replaces the identified entities with a generic identifier (e.g., Jane Doe is replaced with PERSONFIRSTNAME1 PERSONLASTNAME1).
Why is this software special?
Textwash was designed to be a tool that meets the highest standards that we have for text anonymisation. The following principles guided our development decisions:
- Complete and transparent evaluation: you can find a full empirical evaluation of this tool in the paper linked below. We put the tool to various tests and show what it can(not) do - this includes a motivated intruder test where humans try to re-identify persons from Textwash-anonymised documents.
- Data never leave your system: at no point does the Textwash tool require you to upload (text) data or use an API. The tool is entirely functional offline (you can try it by switching off your Internet connection). This feature is essential to avoid any data leakage or possible risks for your data.
- Open source: the code base is open source and can be inspected, used adn modified in line with the GNU General Public License 3 (GPL-3.0). We do this because we think it is essential that you know what this tool does.
- Learning-based anonymisation: since the information that can reveal personal data is complex, we are not using a dictionary-based approach (e.g., looking up keywords in a static database). Instead, the core of Textwash is a machine learning model that assigns category probabilities to phrases and anonymises them accordingly.
Note for researchers/organisations/other users
We would be glad if Textwash is helpful to you. But even if you prefer to use another tool, we strongly encourage you to ask the developers to provide you as the bare minimum with (i) an evaluation of their tool that shows empirically what it can and cannot do (you can even point them to our evaluation approach and ask them to show how their tool performs on our evaluation dataset), and (ii) reasons why they require you to send your data to online services or an API (you should never do this, nor does a good software necessitate this).
If they refuse to provide this, you should be skeptical.
Note for commercial anonymisation tools
We have looked hard to find a tool that is as transparent, open and data-averse (as in: not unnecessarily collecting data) as ours. We did not find any.
If you have a tool that meets these requirements, we would be glad to promote it and list it here.If you think your tool is better, we would love to see your evaluation results - you can use all the data we used and we'd be happy to assist with setting up the human intruder evaluation.
Quick start guide
Textwash is built in Python3. To run the software, it is recommended to first create an Anaconda environment and install the required dependencies. For details on how to get and install Anaconda, click here.
$ conda create -n textwash python=3.7
$ conda activate textwash
$ pip install -r requirements.txt
Additionally, you need to download the trained model folders from here. Once you have downloaded the tgz file, unpack it and place it in the data directory. Important: the models (in en and nl) should be directly in ./data and not in the models parent dirctory. The relative path to the models should be ./data/en and ./data/nl. Otherwise, your will encounter the Repo id must be in the form 'repo_name' ... error.
Using Textwash
Textwash can be used to anonymise txt files. To do this, run anon.py by providing the --language ('en' for English and 'nl' for Dutch), the path to the input files --input_dir and the corresponding path to the output folder --output_dir. For example, running
$ python3 anon.py --language en --input_dir examples --output_dir anonymised_examples --cpu
anonymises the three example texts in the examples directory. In doing so, Textwash loads the downloaded model into memory, then automatically anonymises the inputs and writes the anonymised files to the provided output folder anonymised_examples.
Textwash works best when running on a GPU. If no GPU is available, you should use the --cpu flag as in the snippet above. If you have a GPU, remove the --cpu flag and Textwash will resort to pytorch with CUDA support.
Entity selection
Textwash can furthermore be restricted to only consider a subset of all available entity types for anonymisation.
The complete list of available entity types is as follows: * ADDRESS * DATE * EMAILADDRESS * LOCATION * NUMERIC * OCCUPATION * ORGANIZATION * OTHERIDENTIFYINGATTRIBUTE * PERSONFIRSTNAME * PERSONLASTNAME * PHONENUMBER * PRONOUN * TIME
Using the --entities flag, individual entity types can be selected for anonymisation. These entity types need to be separated by comma.
For example, if you would only like to anonymise the LOCATION and PERSON_FIRSTNAME entity types, run
$ python3 anon.py --input_dir examples --output_dir anonymised_examples --cpu --entities LOCATION,PERSON_FIRSTNAME
Examples
You can find examples of person descriptions rich in details in the examples directory with the corresponding anonymised versions after running it through Textwash in the examples_anonymised directory.
Who can use Textwash?
Textwash is developed with non-profit open science principles. If you are a researcher, a research organization, working in the public sector or a non-profit organization, you are free to use this software. Please make sure you cite our work as follows:
(will be added soon)
If you intend to use this software commercially without our consent, please be advised that this software is released under the GNU General Public License 3 (GPL-3.0).
You may copy, distribute and modify the software as long as you track changes/dates of in source files and keep modifications under GPL. You can distribute your application using a GPL library commercially, but you must also provide the source code.
Who developed Textwash?
Textwash is a multi-year project that is led by Maximilian Mozes (University College London) and Bennett Kleinberg (Tilburg University and University College London).
The work is supported by a SAGE Proof of Concept Grant and an Open Science grant from the Dutch Research Council (NWO).
Questions and Comments
Please open a GitHub Issue if you have any questions or remarks.
Owner
- Name: BKleinberg
- Login: ben-aaron188
- Kind: user
- Website: https://bkleinberg.net/
- Repositories: 18
- Profile: https://github.com/ben-aaron188
GitHub Events
Total
- Issues event: 4
- Watch event: 9
- Issue comment event: 2
- Push event: 2
- Pull request review event: 1
- Pull request event: 4
- Fork event: 4
Last Year
- Issues event: 4
- Watch event: 9
- Issue comment event: 2
- Push event: 2
- Pull request review event: 1
- Pull request event: 4
- Fork event: 4
Issues and Pull Requests
Last synced: 10 months ago
All Time
- Total issues: 4
- Total pull requests: 3
- Average time to close issues: N/A
- Average time to close pull requests: 17 days
- Total issue authors: 2
- Total pull request authors: 3
- Average comments per issue: 0.0
- Average comments per pull request: 0.33
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 4
- Pull requests: 3
- Average time to close issues: N/A
- Average time to close pull requests: 17 days
- Issue authors: 2
- Pull request authors: 3
- Average comments per issue: 0.0
- Average comments per pull request: 0.33
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- StefKirsch (3)
- ShekharNarayanan (1)
Pull Request Authors
- StefKirsch (1)
- juhanurmi (1)
- ShekharNarayanan (1)
- maximilianmozes (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- numpy ==1.20.3
- torch ==1.9.0
- transformers ==2.6.0