data-processing
Data processing scripts for the project What Works When for Whom?
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.5%) to scientific vocabulary
Repository
Data processing scripts for the project What Works When for Whom?
Basic Info
- Host: GitHub
- Owner: e-mental-health
- License: other
- Language: Jupyter Notebook
- Default Branch: master
- Homepage: https://www.research-software.nl/projects/what-works-when-for-whom
- Size: 890 KB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 1
Metadata Files
README.md
Data Processing (What Works When for Whom?)
Data processing scripts of the e-Mental Health project. The scripts deal with two data sets: OVK and Tactus. Data related to our paper "De-identification of Dutch Medical Text" can be found in the data repository
OVK mails
The OVK data consist of Word files with e-mails. First, each Word file was converted to text:
abiword -t text 1234.docx # creates file 1234.text
Next, extra lines have been added to the text files to indicate where a new email started. A typical separator line was "Date: 1 Jan 2000":
ovkPrepare.py 1234.text > 1234.prepared # expects a numeric file name
The program tries to guess the author of an email but this sometimes fails. The missed names are indicated by the string "???" and need to be added manually.
The files contain personal information, which needs to be removed. Therefore all names and numbers were removed from the file. This process contains two steps: 1. named entity recognition (ner), and 2. anonymization:
ner.py < 1234.prepared > 1234.ner # requires frog, see comment below
anonymize.py 1234.ner # creates file 1234.ner.out, see comment below
For named entity recognition, we rely the program frog, which is part of the LaMachine package. After installing the package, we run it as follows:
docker run -p 8080:8080 -t -i proycon/lamachine
lamachine@abcd1234:/$ frog -S 8080 --skip mcpla
After starting LaMachine like this, the ner.py program is able to process the texts.
The anonymization process (anonymize.py) is interactive. Each new entity will be displayed on the screen and the user is required to enter the entity type (like PER, ORG or LOC) or an empty string (which stands for: no entity).
In the file 1234.ner.out, all names and numbers have been replaced by dummy strings (like PER for person names). Mail headers signalling the start of a new mail, start on a new line containing a capitalized word followed by a space and a colon (like: "Date :"). All other lines contain a single sentence.
Finally, the output files of the anonymization process can be converted to a csv table which can be used for analysis:
ovk2table.py 1234.ner.out ... > all.csv
The program ovk2table.py assumes that the files contain emails in chronological order. The local file reversed.txt contains the names of files with the emails in reversed chronological order.
The Jupyter notebooks pca.ipynb and pca-results.ipynb can be used for analyzing the data in the csv tables. pca.ipynb uses principle-component analysis for visualizing the e-mails on a two-dimensional space. It also displays words which are specific for subsets of the emails. pca-results.ipynb performs exactly the same task but next to building models from the words in the emails, it also displays information on the treatment progress.
Tactus mails
The Tactus data consist of XML files with emails. These can be converted to csv table:
tactus2table.py 1234.xml ... # creates file ../output/emails.csv
The texts of the mails need to be anonymized. There is no automatic solution for this yet.
The data in the csv table can be analysed with the Jupyter notebooks pca-tactus.ipynb and pca-tactus-results.ipynb, in the same way as the OVK data.
Metadata
The OVK metadata is stored in the SPSS file opve.sav. The file was converted to csv with R:
library(foreign)
data <- read.spss("file.sav")
write.csv(data,file="file.csv")
The Tactus metadata is stored in the XML file of each client. The program tactus2table.py extracts these and stores them in five files in the directory ../output (Intake.csv and Lijst.*)
Installation
clone the repository
git clone git@github.com:eriktks/data-processing.git
change into the top-level directory
cd data-processing
Dependencies
- Python 3.6
License
Copyright (c) 2018, Erik Tjong Kim Sang
Apache Software License 2.0
Contributing
Contributing authors so far: * Erik Tjong Kim Sang
Owner
- Name: What Works When for Whom?
- Login: e-mental-health
- Kind: organization
- Email: erikt@xs4all.nl
- Location: Amsterdam, The Netherlands
- Website: https://www.esciencecenter.nl/project/what-works-when-for-whom
- Repositories: 2
- Profile: https://github.com/e-mental-health
project What Works When for Whom?, Netherlands eScience Center
Citation (CITATION.cff)
# YAML 1.2
---
abstract: "Data processing scripts for the project What Works When for Whom? "
authors:
-
affiliation: "Netherlands eScience Center"
family-names: "Tjong Kim Sang"
given-names: Erik
orcid: "https://orcid.org/0000-0002-8431-081X"
cff-version: "1.1.0"
date-released: 2021-03-09
license: "Apache-2.0"
message: "If you use this software, please cite it using these metadata."
title: Data Processing (What Works When for Whom?)
version: "1.0"
...
GitHub Events
Total
- Watch event: 1
Last Year
- Watch event: 1