lives
Python code for the analysis and creation of machine-learned models for the NIH-funded project "Using natural language processing to determine predictors of healthy diet and physical activity behavior change in ovarian cancer survivors."
Science Score: 57.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 4 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.0%) to scientific vocabulary
Repository
Python code for the analysis and creation of machine-learned models for the NIH-funded project "Using natural language processing to determine predictors of healthy diet and physical activity behavior change in ovarian cancer survivors."
Basic Info
- Host: GitHub
- Owner: clulab
- License: mit
- Language: Jupyter Notebook
- Default Branch: master
- Homepage: https://reporter.nih.gov/search/qfhaBJoM20qq64VSqwCScg/project-details/10109452
- Size: 705 KB
Statistics
- Stars: 4
- Watchers: 6
- Forks: 0
- Open Issues: 5
- Releases: 0
Metadata Files
README.md
Computational rescue of untapped behavioral data from the LIvES study
Python code for the analysis and creation of machine-learned models for the NIH-funded project "Using natural language processing to determine predictors of a healthy diet and physical activity behavior change in ovarian cancer survivors." https://reporter.nih.gov/search/qfhaBJoM20qq64VSqwCScg/project-details/10109452
About
In most behavioral interventions, the focus is primarily on whether the intervention improves the medical outcome of interest. As a result, study procedures generate enormous amounts of data during the operationalization of the intervention used only for study management, not for analysis. This data, which we call "secondary process-related data," is often recorded and archived but used only to achieve the main goals of the intervention, not for analysis. We rescue this data through machine learning, natural language processing, and artificial intelligence. We leverage these data to help automate fidelity monitoring and identify low-level, non-traditional predictors of lifestyle behavioral outcomes.
More information
You can find more information about this project in the following resources:
Damian Yukio Romero Diaz. 2022. Using natural language processing to determine predictors of healthy diet and physical activity behavior change in ovarian cancer survivors. (2022). DOI:https://doi.org/10.48321/D1BK5T
Cynthia A. Thomson, Tracy E. Crane, Austin Miller, David O. Garcia, Karen Basen-Engquist, and David S. Alberts. 2016. A randomized trial of diet and physical activity in women treated for stage II–IV ovarian cancer: Rationale and design of the Lifestyle Intervention for Ovarian Cancer Enhanced Survival (LIVES): An NRG Oncology/Gynecologic Oncology Group (GOG-225) Study. Contemporary Clinical Trials 49, (July 2016), 181–189. DOI:https://doi.org/10.1016/j.cct.2016.07.005
Freylersythe, S., Sharp, R., Culnan, J., Romero Diaz, D. Y., Zhao, Y., Franks, G. H., Nitschke, R., Bethard, S. J., & Crane, T. E. (2022). Lessons Learned from a Secondary Analysis Using Natural Language Processing and Machine Learning from a Lifestyle Intervention: SBM Conference Handout [Data set]. https://github.com/clulab/SBM2022LIvES
Attribution
Please cite the contents in this repository as follows:
Bethard, S. J., Romero Diaz, D. Y., Culnan, J., & Zhao, Y. (2022). LIvES Project: Code and Code Documentation (Version 0.1.0) [Computer software]. https://github.com/clulab/lives
License
The code in this repository is distributed under an MIT license. You can find the details here.
Repo for implementing Whisper model for Lives project
Python code for implementing the Whisper model to transcribe English and Spanish interviews and to translate Spanish interviews to English translation.
Setup
You will need to set up an appropriate coding environment:
- Python (version 3.7 or higher)
- PyTorch
- The following command will pull and install the latest commit from openai/whipser repository.
pip install git+https://github.com/openai/whisper.git - It also required the command-line tool
ffmpegto be installed on your system, which is available from most package managers: ``` # on Ubuntu or Debian sudo apt update && sudo apt install ffmpeg
on Arch Linux
sudo pacman -S ffmpeg
on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg
on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg
on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg
* You may needrustinstalled as well, in case [tokenizers](https://pypi.org/project/tokenizers/) does not provide a pre-built wheel for your platform.
pip install setuptools-rust
```
Models and languages
Whisper provides five model sizes, we use the base model here which is a multilingual model and has 71,825,920 parameters. If you want to implement different sizes of models and consider the approximate memory requirements and relative speed, please go to the Available models and languages section here.
In this project, we use this Whisper model to transcribe and translate English and Spanish audio. It also can transcribe and translate other kinds of languages. All available languages are listed in the tokenizer.py
Python usage
To transcribe and translate the audio, we will use the following function to get the result.
From audio.py:
The
load_audio()method reads the audio file and returns a NumPy array containing the audio waveform, in float32 dtype.[-0.00018311 -0.00024414 -0.00030518 ... -0.00146484 -0.00195312 -0.00210571]The
pad_or_trim()method pads or trims the audio array to NSAMPLES () to fit 30 seconds. ``` SAMPLERATE = 16000 CHUNKLENGTH = 30 NSAMPLES = CHUNKLENGTH * SAMPLERATE # 480000: number of samples in a chunk ```The
log_mel_spectrogram()method computes the log-Mel spectrogram of a NumPy array containing the audio waveform and retunes a Tensor that contains the Mel spectrogram and the shape of the Tensor will be (80, 3000). ``` tensor([[-0.5296, -0.5296, -0.5296, ..., 0.0462, 0.2417, 0.1118], [-0.5296, -0.5296, -0.5296, ..., 0.0443, 0.1246, -0.1071], [-0.5296, -0.5296, -0.5296, ..., 0.2268, 0.0590, -0.2129], ..., [-0.5296, -0.5296, -0.5296, ..., -0.5296, -0.5296, -0.5296], [-0.5296, -0.5296, -0.5296, ..., -0.5296, -0.5296, -0.5296], [-0.5296, -0.5296, -0.5296, ..., -0.5296, -0.5296, -0.5296]])
```
From decoding.py:
- The
detect_language()method detects the language of the log-Mel spectrogram and returns a Tensor (_) and the probability distribution (probs) which contains the languages and the probability of each language will be.
``` , probs = model.detectlanguage() print(f"Detected language: {max(probs, key = probs.get)}")
probs:
{'en': 0.9958220720291138, 'zh': 7.025230297585949e-05, 'de': 0.00015919747238513082, 'es': 0.0003416460531298071, 'ru': 0.00030879987752996385, 'ko': 0.00028310518246144056, 'fr': 0.00021966002532280982,...}
```
From transcribe.py:
- The
transcribe()method transcribes the audio file and returns a dictionary containing the resultingtext ("text")andsegment-level details ("segment"), and the spoken language or the language you want to translate ("language"'). The parameters we put in this method will be the audio file, language, and fp16=False/True. In"segment", it shows each segment's start and end. With this information, we then can get the time each audio file transcribe (or translate) takes.
** When the model is running on the CPU and you set fp16=True, you will get the warning message "FP16 is not supported on CPU; using FP32 instead". Then you should set the fp16=False to solve the warning.
``` model.transcribe(audio file, language="english", fp16=False)
result:
{'text': " It's now a good time for a call. I'm sorry it's really late. I think my calls have been going past what they should be.", 'segments': [{'id': 0, 'seek': 0, 'start': 0.0, 'end': 2.88, 'text': " It's now a good time for a call. I'm sorry it's really late.", 'tokens': [50364, 467, 311, 586, 257, 665, 565, 337, 257, 818, 13, 286, 478, 2597, 309, 311, 534, 3469, 13, 50508, 50508, 876, 452, 5498, 362, 668, 516, 1791, 437, 436, 820, 312, 13, 50680], 'temperature': 0.0, 'avglogprob': -0.28695746830531527, 'compressionratio': 1.163265306122449, 'nospeechprob': 0.020430153235793114}, {'id': 1, 'seek': 288, 'start': 2.88, 'end': 30.88, 'text': ' I think my calls have been going past what they should be.', 'tokens': [50364, 286, 519, 452, 5498, 362, 668, 516, 1791, 437, 436, 820, 312, 13, 51764], 'temperature': 0.0, 'avglogprob': -0.7447051405906677, 'compressionratio': 0.90625, 'nospeechprob': 0.02010437287390232}], 'language': 'english'}
```
** To transcribe and translate the audio which is split based on the length of the turn. Make sure the length of each turn is longer than 0 Before running the script for the short audio.
Owner
- Name: Computational Language Understanding Lab (CLU Lab) at University of Arizona
- Login: clulab
- Kind: organization
- Location: Tucson, AZ
- Website: http://clulab.org
- Repositories: 72
- Profile: https://github.com/clulab
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: 'LIvES Project: Code and Code Documentation'
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Steven John
family-names: Bethard
affiliation: University of Arizona
orcid: 'https://orcid.org/0000-0001-9560-6491'
- given-names: Damian Yukio
family-names: Romero Diaz
affiliation: University of Arizona
orcid: 'https://orcid.org/0000-0003-4661-0296'
- given-names: John
family-names: Culnan
affiliation: University of Arizona
orcid: 'https://orcid.org/0000-0001-7327-1053'
- given-names: Yiyun
family-names: Zhao
affiliation: University of Arizona
repository-code: 'https://github.com/clulab/lives'
abstract: >-
Python code for the analysis and creation of
machine-learned models for the NIH-funded project
"Using natural language processing to determine
predictors of healthy diet and physical activity
behavior change in ovarian cancer survivors." For
more information, please consult our Data
Management Plan at:
https://dmphub.cdlib.org/dmps/doi:10.48321/D1BK5T
NIH NCI R21CA256680 (MPI: Crane/Bethard) for
01/01/2021-12/31/2022.
keywords:
- Behavioral intervention
- Cancer research
- Machine learning
- Natural language processing
license: MIT
version: 0.1.0
date-released: '2022-10-04'
GitHub Events
Total
- Watch event: 1
- Push event: 2
Last Year
- Watch event: 1
- Push event: 2
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 3
- Total pull requests: 16
- Average time to close issues: N/A
- Average time to close pull requests: about 2 hours
- Total issue authors: 1
- Total pull request authors: 3
- Average comments per issue: 0.0
- Average comments per pull request: 0.06
- Merged pull requests: 14
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- damian-romero (3)
Pull Request Authors
- damian-romero (13)
- jmculnan (2)
- bethard (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- pyannote.audio >=2.0
- pyannote.core *
- pyannote.database *
- pyannote.metrics *
- tqdm *
- pip3
- python 3.10