Science Score: 49.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
○Academic publication links
-
✓Committers with academic emails
2 of 9 committers (22.2%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.6%) to scientific vocabulary
Keywords from Contributors
Repository
Open source clinical text de-identification
Basic Info
- Host: GitHub
- Owner: BCHSI
- License: bsd-3-clause
- Language: Python
- Default Branch: master
- Size: 8.96 MB
Statistics
- Stars: 136
- Watchers: 4
- Forks: 58
- Open Issues: 12
- Releases: 0
Metadata Files
README.md
If you use this software for any publication, please cite: Norgeot, B., Muenzen, K., Peterson, T.A. et al. Protected Health Information filter (Philter): accurately and securely de-identifying free-text clinical notes. npj Digit. Med. 3, 57 (2020). https://doi.org/10.1038/s41746-020-0258-y
Installing Philter
To install Philter from PyPi, run the following command:
bash
pip3 install philter-ucsf
The main philter code will be executed by running:
bash
python3 -m philter_ucsf [flags, see below]
However, we strongly suggest that you download the project source code and run all sample commands below from the home directory before running the install version of Philter.
Installing Requirements
To install the Python requirements, run the following command:
bash
pip3 install -r requirements.txt
Running Philter: A Step-by-Step Guide
Philter is a command-line based clinical text de-identification software that removes protected health information (PHI) from any plain text file. Although the software has built-in evaluation capabilities and can compare Philter PHI-reduced notes with a corresponding set of ground truth annotations, annotations are not required to run Philter. The following steps may be used to 1) run Philter in the command line without ground truth annotations, or 2) generate Philter-compatible annotations and run Philter in evaluation mode using ground truth annotations. Although any set of notes and corresponding annotations may be used with Philter, the examples provided here will correspond to the I2B2 dataset, which Philter uses in its default configuration.
Before running Philter either with or without evaluation, make sure to familiarize yourself with the various options that may be used for any given Philter run:
Flags:
-i (input): Path to the directory or the file that contains the clinical note(s), the default is ./data/i2b2notes/
-a (anno): Path to the directory or the file that contains the PHI annotation(s), the default is ./data/i2b2anno/
-o (output): Path to the directory to save the PHI-reduced notes in, the default is ./data/i2b2results/
-f (filters): Path to the config file, the default is ./configs/philterdelta.json
-x (xml): Path to the json file that contains all xml data, the default is ./data/phinotes.json
-c (coords): Output path to the json file that will contain the coordinate map data, the default is ./data/coordinates.json
-v (verbose): When verbose is true, will emit messages about script progress. The default is True
**-e (runeval): When runeval is true, will run our eval script and emit summarized results to terminal
**-t (freqtable): When freqtable is true, will output a unigram/bigram frequency table of all note words and their PHI/non-PHI counts. Default is False
-n (initials): When initials is true, will include annotated initials PHI in recall/precision calculations. The default is True
--eval_output: Path to the directory that the detailed eval files will be outputted to, the default is ./data/phi/
--outputformat: Define format of annotation, allowed values are \"asterisk\", \"i2b2\". Default is \"asterisk\"
--ucsfformat: When ucsfformat is true, will adjust eval script for slightly different xml format. The default is False
--prod: When prod is true, this will run the script with output in i2b2 xml format without running the eval script. The default is False
--cachepos: Path to a directoy to store/load the pos data for all notes. If no path is specified then memory caching will be used
0. Curating I2B2 XML Files
To remove non-HIPAA PHI annotations from the I2B2 XML files, run the following command:
-i Path to the directory that contains the original I2B2 xml files
-o Path to the directory where the curated files will be written
bash
python improve_i2b2_notes.py -i data/i2b2_xml/ -o data/i2b2_xml_updated/
1. Running Philter WITHOUT evaluation (no ground-truth annotations required)
a. Make sure the input file(s) are in plain text format. If you are using the I2B2 dataset (or any other dataset in XML or other formats), the note text must be extracted from each original file and be saved in individual text files. Examples of properly formatted input files can be found in ./data/i2b2_notes/.
b. Store all input file(s) in the same directory, and create an output directory (if you want the PHI-reduced notes to be stored somewhere other than the default location).
c. Create a configuration file with specified filters (if you do not want to use the default configuration file).
d. Run Philter in the command line using either default or custom parameters.
Use the following command to run a single job and output files in XML format:
bash
python3 main.py -i ./data/i2b2_notes/ -o ./data/i2b2_results/ -f ./configs/philter_delta.json --prod=True
IMPORTANT NOTE: XML-formatted files do NOT have PHI-reduced text. Instead, they contain the original note text with the PHI tags identified by Philter.
If you'd like to output ONLY the PHI-reduced text with asterisks obscuring Philter-identified PHI, simply add the -outputformat "asterisk" option:
bash
python3 main.py -i ./data/i2b2_notes/ -o ./data/i2b2_results/ -f ./configs/philter_delta.json --prod=True --outputformat "asterisk"
To run multiple jobs simultaneously, all input notes handled by a single job must be located in separate directories to avoid cross-contamination between output files. For example, if you wanted to run Philter on 1000 notes simultaneously on two processes, the two input directories might look like:
- ./data/batch1/500inputnotes_batch1/
- ./data/batch2/500inputnotes_batch2/
In this example, the following two commands would be used to start running each job in the background: ```bash nohup python3 main.py -i ./data/batch1/500inputnotesbatch2/ -o ./data/i2b2resultstest/ -f ./configs/philterdelta.json --prod=True > ./data/batch1/batch1terminalout.txt 2>&1 &
bash
nohup python3 main.py -i ./data/batch2/500inputnotesbatch2/ -o ./data/i2b2resultstest/ -f ./configs/philterdelta.json --prod=True > ./data/batch2/batch2terminalout.txt 2>&1 &
```
2. Running Philter WITH evaluation (ground truth annotations required)
a. Create Philter-compatible annotation files using the transformation script located in ./generatedataset/. This script expects notes in xml format, and transforms each input file into two plain text files: 1) the original note text, and 2) the note text with asterisks obscuring PHI. A properly formatted xml input can be found in ./data/i2b2xml, and examples of the two outputs can be found in ./data/i2b2notes and ./data/i2b2anno, respectively. Additionally, this script creates a .json file that contains the original text from each note, followed by the PHI annotations in json format. An example of this output file can be found at ./data/phinotesi2b2.json. This is the file that will be used as the -x default option.
Flags:
-x Path to the directory file that contains the note xml files
-o Path to the json file that will contain a summary of the phi in the xml files
-n Path to the directory where you would like to store the plain text notes
-a Path to the directory where you would like to store the plain text annotations
Use the following command to create these input files from notes in XML format:
bash
python3 ./generate_dataset/main_ucsf_updated.py -x ./data/i2b2_xml/ -o ./data/phi_notes_i2b2.json -n ./data/i2b2_notes/ -a ./data/i2b2_anno/
Note: If this command produces an ElementTree.ParseError, you may need to remove .DSStore from ./data/i2b2xml.
b-c. See Step 1b-c above
d. Run Philter in evaluation mode using the following command:
bash
python3 main.py -i ./data/i2b2_notes/ -a ./data/i2b2_anno/ -o ./data/i2b2_results/ -x ./data/phi_notes_i2b2.json -f=./configs/philter_delta.json --outputformat "asterisk"
By defult, this will output PHI-reduced notes (.txt format) in the specified output directory. If this command is used with the --outputformat i2b2 flag (or with no --outputformat specified, since i2b2 format is the default option), the evaluation script will not be run and the script will output notes with the original text and the Philter PHI tags (.xml format) in the specified output directory.
Owner
- Name: BCHSI
- Login: BCHSI
- Kind: organization
- Repositories: 5
- Profile: https://github.com/BCHSI
GitHub Events
Total
- Issues event: 1
- Watch event: 24
- Member event: 1
- Issue comment event: 4
- Fork event: 9
Last Year
- Issues event: 1
- Watch event: 24
- Member event: 1
- Issue comment event: 4
- Fork event: 9
Committers
Last synced: over 2 years ago
Top Committers
| Name | Commits | |
|---|---|---|
| Kathleen Muenzen | 3****n | 57 |
| Kathleen | k****n@y****m | 15 |
| tschaffter | t****r@g****m | 6 |
| kmuenzen | k****n@K****l | 6 |
| Gundolf Schenk | 3****s | 3 |
| Paul M. Heider | h****p@m****u | 2 |
| dependabot[bot] | 4****] | 2 |
| RedChrists | g****k@u****u | 2 |
| beaunorgeot | b****t@g****m | 1 |
Issues and Pull Requests
Last synced: 8 months ago
All Time
- Total issues: 13
- Total pull requests: 16
- Average time to close issues: 6 months
- Average time to close pull requests: 5 months
- Total issue authors: 9
- Total pull request authors: 10
- Average comments per issue: 1.23
- Average comments per pull request: 0.81
- Merged pull requests: 4
- Bot issues: 0
- Bot pull requests: 3
Past Year
- Issues: 1
- Pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 1
- Pull request authors: 1
- Average comments per issue: 3.0
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 1
Top Authors
Issue Authors
- kmuenzen (3)
- Adeva1 (1)
- DSLituiev (1)
- ewartj (1)
- lesterlitch (1)
- gknor (1)
- fbraun4358 (1)
- soulaven (1)
- nlaimr (1)
Pull Request Authors
- dependabot[bot] (3)
- tschaffter (2)
- TimOrme (2)
- landiisotta (1)
- callandramoore (1)
- tmills (1)
- katie-ta (1)
- markskrass (1)
- paulheider (1)
- Trott (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 2
-
Total downloads:
- pypi 662 last-month
-
Total dependent packages: 0
(may contain duplicates) -
Total dependent repositories: 2
(may contain duplicates) - Total versions: 3
- Total maintainers: 1
pypi.org: philter-ucsf
An open-source PHI-filtering software
- Homepage: https://github.com/BCHSI/philter-ucsf
- Documentation: https://philter-ucsf.readthedocs.io/
- License: BSD License
-
Latest release: 1.0.3
published almost 6 years ago
Rankings
Maintainers (1)
pypi.org: philter-ucsf-beta
An open-source PHI-filtering software
- Homepage: https://github.com/BCHSI/philter-ucsf
- Documentation: https://philter-ucsf-beta.readthedocs.io/
- License: BSD License
-
Latest release: 0.0.2
published over 2 years ago
Rankings
Maintainers (1)
Dependencies
- chardet ==3.0.4
- nltk ==3.5
- numpy ==1.19.0
- pandas ==1.0.5
- xmltodict ==0.12.0