philter-ucsf

Open source clinical text de-identification

https://github.com/bchsi/philter-ucsf

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
  • Committers with academic emails
    2 of 9 committers (22.2%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.6%) to scientific vocabulary

Keywords from Contributors

interactive serializer packaging network-simulation hacking autograding observability embedded optim standardization
Last synced: 7 months ago · JSON representation

Repository

Open source clinical text de-identification

Basic Info
  • Host: GitHub
  • Owner: BCHSI
  • License: bsd-3-clause
  • Language: Python
  • Default Branch: master
  • Size: 8.96 MB
Statistics
  • Stars: 136
  • Watchers: 4
  • Forks: 58
  • Open Issues: 12
  • Releases: 0
Created over 7 years ago · Last pushed over 1 year ago
Metadata Files
Readme License

README.md

If you use this software for any publication, please cite: Norgeot, B., Muenzen, K., Peterson, T.A. et al. Protected Health Information filter (Philter): accurately and securely de-identifying free-text clinical notes. npj Digit. Med. 3, 57 (2020). https://doi.org/10.1038/s41746-020-0258-y

Installing Philter

To install Philter from PyPi, run the following command:

bash pip3 install philter-ucsf

The main philter code will be executed by running:

bash python3 -m philter_ucsf [flags, see below]

However, we strongly suggest that you download the project source code and run all sample commands below from the home directory before running the install version of Philter.

Installing Requirements

To install the Python requirements, run the following command:

bash pip3 install -r requirements.txt

Running Philter: A Step-by-Step Guide

Philter is a command-line based clinical text de-identification software that removes protected health information (PHI) from any plain text file. Although the software has built-in evaluation capabilities and can compare Philter PHI-reduced notes with a corresponding set of ground truth annotations, annotations are not required to run Philter. The following steps may be used to 1) run Philter in the command line without ground truth annotations, or 2) generate Philter-compatible annotations and run Philter in evaluation mode using ground truth annotations. Although any set of notes and corresponding annotations may be used with Philter, the examples provided here will correspond to the I2B2 dataset, which Philter uses in its default configuration.

Before running Philter either with or without evaluation, make sure to familiarize yourself with the various options that may be used for any given Philter run:

Flags:

-i (input):             Path to the directory or the file that contains the clinical note(s), the default is ./data/i2b2notes/
-a (anno):            Path to the directory or the file that contains the PHI annotation(s), the default is ./data/i2b2
anno/
-o (output):         Path to the directory to save the PHI-reduced notes in, the default is ./data/i2b2results/
-f (filters):            Path to the config file, the default is ./configs/philter
delta.json
-x (xml):               Path to the json file that contains all xml data, the default is ./data/phinotes.json
-c (coords):         Output path to the json file that will contain the coordinate map data, the default is ./data/coordinates.json
-v (verbose):       When verbose is true, will emit messages about script progress. The default is True
**-e (run
eval):       When runeval is true, will run our eval script and emit summarized results to terminal
**-t (freq
table):
    When freqtable is true, will output a unigram/bigram frequency table of all note words and their PHI/non-PHI counts. Default is False
-n (initials):          When initials is true, will include annotated initials PHI in recall/precision calculations. The default is True
--eval_output:     Path to the directory that the detailed eval files will be outputted to, the default is ./data/phi/
--outputformat:  Define format of annotation, allowed values are \"asterisk\", \"i2b2\". Default is \"asterisk\"
--ucsfformat:      When ucsfformat is true, will adjust eval script for slightly different xml format. The default is False
--prod:                  When prod is true, this will run the script with output in i2b2 xml format without running the eval script. The default is False
--cachepos:         Path to a directoy to store/load the pos data for all notes. If no path is specified then memory caching will be used

0. Curating I2B2 XML Files

To remove non-HIPAA PHI annotations from the I2B2 XML files, run the following command:

-i Path to the directory that contains the original I2B2 xml files
-o Path to the directory where the curated files will be written

bash python improve_i2b2_notes.py -i data/i2b2_xml/ -o data/i2b2_xml_updated/

1. Running Philter WITHOUT evaluation (no ground-truth annotations required)

a. Make sure the input file(s) are in plain text format. If you are using the I2B2 dataset (or any other dataset in XML or other formats), the note text must be extracted from each original file and be saved in individual text files. Examples of properly formatted input files can be found in ./data/i2b2_notes/.

b. Store all input file(s) in the same directory, and create an output directory (if you want the PHI-reduced notes to be stored somewhere other than the default location).

c. Create a configuration file with specified filters (if you do not want to use the default configuration file).

d. Run Philter in the command line using either default or custom parameters.

Use the following command to run a single job and output files in XML format: bash python3 main.py -i ./data/i2b2_notes/ -o ./data/i2b2_results/ -f ./configs/philter_delta.json --prod=True IMPORTANT NOTE: XML-formatted files do NOT have PHI-reduced text. Instead, they contain the original note text with the PHI tags identified by Philter.

If you'd like to output ONLY the PHI-reduced text with asterisks obscuring Philter-identified PHI, simply add the -outputformat "asterisk" option: bash python3 main.py -i ./data/i2b2_notes/ -o ./data/i2b2_results/ -f ./configs/philter_delta.json --prod=True --outputformat "asterisk"

To run multiple jobs simultaneously, all input notes handled by a single job must be located in separate directories to avoid cross-contamination between output files. For example, if you wanted to run Philter on 1000 notes simultaneously on two processes, the two input directories might look like:

  1. ./data/batch1/500inputnotes_batch1/
  2. ./data/batch2/500inputnotes_batch2/

In this example, the following two commands would be used to start running each job in the background: ```bash nohup python3 main.py -i ./data/batch1/500inputnotesbatch2/ -o ./data/i2b2resultstest/ -f ./configs/philterdelta.json --prod=True > ./data/batch1/batch1terminalout.txt 2>&1 &

bash nohup python3 main.py -i ./data/batch2/500inputnotesbatch2/ -o ./data/i2b2resultstest/ -f ./configs/philterdelta.json --prod=True > ./data/batch2/batch2terminalout.txt 2>&1 &

```

2. Running Philter WITH evaluation (ground truth annotations required)

a. Create Philter-compatible annotation files using the transformation script located in ./generatedataset/. This script expects notes in xml format, and transforms each input file into two plain text files: 1) the original note text, and 2) the note text with asterisks obscuring PHI. A properly formatted xml input can be found in ./data/i2b2xml, and examples of the two outputs can be found in ./data/i2b2notes and ./data/i2b2anno, respectively. Additionally, this script creates a .json file that contains the original text from each note, followed by the PHI annotations in json format. An example of this output file can be found at ./data/phinotesi2b2.json. This is the file that will be used as the -x default option.

Flags:

-x Path to the directory file that contains the note xml files
-o Path to the json file that will contain a summary of the phi in the xml files
-n Path to the directory where you would like to store the plain text notes
-a Path to the directory where you would like to store the plain text annotations

Use the following command to create these input files from notes in XML format:

bash python3 ./generate_dataset/main_ucsf_updated.py -x ./data/i2b2_xml/ -o ./data/phi_notes_i2b2.json -n ./data/i2b2_notes/ -a ./data/i2b2_anno/ Note: If this command produces an ElementTree.ParseError, you may need to remove .DSStore from ./data/i2b2xml.

b-c. See Step 1b-c above

d. Run Philter in evaluation mode using the following command:

bash python3 main.py -i ./data/i2b2_notes/ -a ./data/i2b2_anno/ -o ./data/i2b2_results/ -x ./data/phi_notes_i2b2.json -f=./configs/philter_delta.json --outputformat "asterisk"

By defult, this will output PHI-reduced notes (.txt format) in the specified output directory. If this command is used with the --outputformat i2b2 flag (or with no --outputformat specified, since i2b2 format is the default option), the evaluation script will not be run and the script will output notes with the original text and the Philter PHI tags (.xml format) in the specified output directory.

Owner

  • Name: BCHSI
  • Login: BCHSI
  • Kind: organization

GitHub Events

Total
  • Issues event: 1
  • Watch event: 24
  • Member event: 1
  • Issue comment event: 4
  • Fork event: 9
Last Year
  • Issues event: 1
  • Watch event: 24
  • Member event: 1
  • Issue comment event: 4
  • Fork event: 9

Committers

Last synced: over 2 years ago

All Time
  • Total Commits: 94
  • Total Committers: 9
  • Avg Commits per committer: 10.444
  • Development Distribution Score (DDS): 0.394
Past Year
  • Commits: 1
  • Committers: 1
  • Avg Commits per committer: 1.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Kathleen Muenzen 3****n 57
Kathleen k****n@y****m 15
tschaffter t****r@g****m 6
kmuenzen k****n@K****l 6
Gundolf Schenk 3****s 3
Paul M. Heider h****p@m****u 2
dependabot[bot] 4****] 2
RedChrists g****k@u****u 2
beaunorgeot b****t@g****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 8 months ago

All Time
  • Total issues: 13
  • Total pull requests: 16
  • Average time to close issues: 6 months
  • Average time to close pull requests: 5 months
  • Total issue authors: 9
  • Total pull request authors: 10
  • Average comments per issue: 1.23
  • Average comments per pull request: 0.81
  • Merged pull requests: 4
  • Bot issues: 0
  • Bot pull requests: 3
Past Year
  • Issues: 1
  • Pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 1
  • Average comments per issue: 3.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 1
Top Authors
Issue Authors
  • kmuenzen (3)
  • Adeva1 (1)
  • DSLituiev (1)
  • ewartj (1)
  • lesterlitch (1)
  • gknor (1)
  • fbraun4358 (1)
  • soulaven (1)
  • nlaimr (1)
Pull Request Authors
  • dependabot[bot] (3)
  • tschaffter (2)
  • TimOrme (2)
  • landiisotta (1)
  • callandramoore (1)
  • tmills (1)
  • katie-ta (1)
  • markskrass (1)
  • paulheider (1)
  • Trott (1)
Top Labels
Issue Labels
enhancement (1)
Pull Request Labels
dependencies (3)

Packages

  • Total packages: 2
  • Total downloads:
    • pypi 662 last-month
  • Total dependent packages: 0
    (may contain duplicates)
  • Total dependent repositories: 2
    (may contain duplicates)
  • Total versions: 3
  • Total maintainers: 1
pypi.org: philter-ucsf

An open-source PHI-filtering software

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 659 Last month
Rankings
Forks count: 6.4%
Stargazers count: 7.9%
Dependent packages count: 10.0%
Downloads: 11.2%
Average: 11.4%
Dependent repos count: 21.7%
Maintainers (1)
Last synced: 8 months ago
pypi.org: philter-ucsf-beta

An open-source PHI-filtering software

  • Versions: 2
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 3 Last month
Rankings
Forks count: 6.3%
Stargazers count: 7.7%
Dependent packages count: 10.0%
Dependent repos count: 21.7%
Average: 21.8%
Downloads: 63.4%
Maintainers (1)
Last synced: 8 months ago

Dependencies

requirements.txt pypi
  • chardet ==3.0.4
  • nltk ==3.5
  • numpy ==1.19.0
  • pandas ==1.0.5
  • xmltodict ==0.12.0