lives

Python code for the analysis and creation of machine-learned models for the NIH-funded project "Using natural language processing to determine predictors of healthy diet and physical activity behavior change in ovarian cancer survivors."

https://github.com/clulab/lives

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.0%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Python code for the analysis and creation of machine-learned models for the NIH-funded project "Using natural language processing to determine predictors of healthy diet and physical activity behavior change in ovarian cancer survivors."

Basic Info
Statistics
  • Stars: 4
  • Watchers: 6
  • Forks: 0
  • Open Issues: 5
  • Releases: 0
Created about 5 years ago · Last pushed 10 months ago
Metadata Files
Readme License Citation

README.md

Computational rescue of untapped behavioral data from the LIvES study

Python code for the analysis and creation of machine-learned models for the NIH-funded project "Using natural language processing to determine predictors of a healthy diet and physical activity behavior change in ovarian cancer survivors." https://reporter.nih.gov/search/qfhaBJoM20qq64VSqwCScg/project-details/10109452

About

In most behavioral interventions, the focus is primarily on whether the intervention improves the medical outcome of interest. As a result, study procedures generate enormous amounts of data during the operationalization of the intervention used only for study management, not for analysis. This data, which we call "secondary process-related data," is often recorded and archived but used only to achieve the main goals of the intervention, not for analysis. We rescue this data through machine learning, natural language processing, and artificial intelligence. We leverage these data to help automate fidelity monitoring and identify low-level, non-traditional predictors of lifestyle behavioral outcomes.

More information

You can find more information about this project in the following resources:

Damian Yukio Romero Diaz. 2022. Using natural language processing to determine predictors of healthy diet and physical activity behavior change in ovarian cancer survivors. (2022). DOI:https://doi.org/10.48321/D1BK5T

Cynthia A. Thomson, Tracy E. Crane, Austin Miller, David O. Garcia, Karen Basen-Engquist, and David S. Alberts. 2016. A randomized trial of diet and physical activity in women treated for stage II–IV ovarian cancer: Rationale and design of the Lifestyle Intervention for Ovarian Cancer Enhanced Survival (LIVES): An NRG Oncology/Gynecologic Oncology Group (GOG-225) Study. Contemporary Clinical Trials 49, (July 2016), 181–189. DOI:https://doi.org/10.1016/j.cct.2016.07.005

Freylersythe, S., Sharp, R., Culnan, J., Romero Diaz, D. Y., Zhao, Y., Franks, G. H., Nitschke, R., Bethard, S. J., & Crane, T. E. (2022). Lessons Learned from a Secondary Analysis Using Natural Language Processing and Machine Learning from a Lifestyle Intervention: SBM Conference Handout [Data set]. https://github.com/clulab/SBM2022LIvES

Attribution

Please cite the contents in this repository as follows:

Bethard, S. J., Romero Diaz, D. Y., Culnan, J., & Zhao, Y. (2022). LIvES Project: Code and Code Documentation (Version 0.1.0) [Computer software]. https://github.com/clulab/lives

License

The code in this repository is distributed under an MIT license. You can find the details here.

Repo for implementing Whisper model for Lives project

Python code for implementing the Whisper model to transcribe English and Spanish interviews and to translate Spanish interviews to English translation.

Setup

You will need to set up an appropriate coding environment:

  • Python (version 3.7 or higher)
  • PyTorch
  • The following command will pull and install the latest commit from openai/whipser repository. pip install git+https://github.com/openai/whisper.git
  • It also required the command-line tool ffmpeg to be installed on your system, which is available from most package managers: ``` # on Ubuntu or Debian sudo apt update && sudo apt install ffmpeg

on Arch Linux

sudo pacman -S ffmpeg

on MacOS using Homebrew (https://brew.sh/)

brew install ffmpeg

on Windows using Chocolatey (https://chocolatey.org/)

choco install ffmpeg

on Windows using Scoop (https://scoop.sh/)

scoop install ffmpeg * You may needrustinstalled as well, in case [tokenizers](https://pypi.org/project/tokenizers/) does not provide a pre-built wheel for your platform. pip install setuptools-rust ```

Models and languages

Whisper provides five model sizes, we use the base model here which is a multilingual model and has 71,825,920 parameters. If you want to implement different sizes of models and consider the approximate memory requirements and relative speed, please go to the Available models and languages section here.

In this project, we use this Whisper model to transcribe and translate English and Spanish audio. It also can transcribe and translate other kinds of languages. All available languages are listed in the tokenizer.py

Python usage

To transcribe and translate the audio, we will use the following function to get the result.

From audio.py:

  • The load_audio() method reads the audio file and returns a NumPy array containing the audio waveform, in float32 dtype. [-0.00018311 -0.00024414 -0.00030518 ... -0.00146484 -0.00195312 -0.00210571]

  • The pad_or_trim() method pads or trims the audio array to NSAMPLES () to fit 30 seconds. ``` SAMPLERATE = 16000 CHUNKLENGTH = 30 NSAMPLES = CHUNKLENGTH * SAMPLERATE # 480000: number of samples in a chunk ```

  • The log_mel_spectrogram() method computes the log-Mel spectrogram of a NumPy array containing the audio waveform and retunes a Tensor that contains the Mel spectrogram and the shape of the Tensor will be (80, 3000). ``` tensor([[-0.5296, -0.5296, -0.5296, ..., 0.0462, 0.2417, 0.1118], [-0.5296, -0.5296, -0.5296, ..., 0.0443, 0.1246, -0.1071], [-0.5296, -0.5296, -0.5296, ..., 0.2268, 0.0590, -0.2129], ..., [-0.5296, -0.5296, -0.5296, ..., -0.5296, -0.5296, -0.5296], [-0.5296, -0.5296, -0.5296, ..., -0.5296, -0.5296, -0.5296], [-0.5296, -0.5296, -0.5296, ..., -0.5296, -0.5296, -0.5296]])

```

From decoding.py:

  • The detect_language() method detects the language of the log-Mel spectrogram and returns a Tensor (_) and the probability distribution (probs) which contains the languages and the probability of each language will be.

``` , probs = model.detectlanguage() print(f"Detected language: {max(probs, key = probs.get)}")

probs:

{'en': 0.9958220720291138, 'zh': 7.025230297585949e-05, 'de': 0.00015919747238513082, 'es': 0.0003416460531298071, 'ru': 0.00030879987752996385, 'ko': 0.00028310518246144056, 'fr': 0.00021966002532280982,...}

```

From transcribe.py:

  • The transcribe() method transcribes the audio file and returns a dictionary containing the resulting text ("text") and segment-level details ("segment"), and the spoken language or the language you want to translate ("language"'). The parameters we put in this method will be the audio file, language, and fp16=False/True. In "segment", it shows each segment's start and end. With this information, we then can get the time each audio file transcribe (or translate) takes.

** When the model is running on the CPU and you set fp16=True, you will get the warning message "FP16 is not supported on CPU; using FP32 instead". Then you should set the fp16=False to solve the warning.

``` model.transcribe(audio file, language="english", fp16=False)

result:

{'text': " It's now a good time for a call. I'm sorry it's really late. I think my calls have been going past what they should be.", 'segments': [{'id': 0, 'seek': 0, 'start': 0.0, 'end': 2.88, 'text': " It's now a good time for a call. I'm sorry it's really late.", 'tokens': [50364, 467, 311, 586, 257, 665, 565, 337, 257, 818, 13, 286, 478, 2597, 309, 311, 534, 3469, 13, 50508, 50508, 876, 452, 5498, 362, 668, 516, 1791, 437, 436, 820, 312, 13, 50680], 'temperature': 0.0, 'avglogprob': -0.28695746830531527, 'compressionratio': 1.163265306122449, 'nospeechprob': 0.020430153235793114}, {'id': 1, 'seek': 288, 'start': 2.88, 'end': 30.88, 'text': ' I think my calls have been going past what they should be.', 'tokens': [50364, 286, 519, 452, 5498, 362, 668, 516, 1791, 437, 436, 820, 312, 13, 51764], 'temperature': 0.0, 'avglogprob': -0.7447051405906677, 'compressionratio': 0.90625, 'nospeechprob': 0.02010437287390232}], 'language': 'english'}

```

** To transcribe and translate the audio which is split based on the length of the turn. Make sure the length of each turn is longer than 0 Before running the script for the short audio.

Owner

  • Name: Computational Language Understanding Lab (CLU Lab) at University of Arizona
  • Login: clulab
  • Kind: organization
  • Location: Tucson, AZ

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: 'LIvES Project: Code and Code Documentation'
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Steven John
    family-names: Bethard
    affiliation: University of Arizona
    orcid: 'https://orcid.org/0000-0001-9560-6491'
  - given-names: Damian Yukio
    family-names: Romero Diaz
    affiliation: University of Arizona
    orcid: 'https://orcid.org/0000-0003-4661-0296'
  - given-names: John
    family-names: Culnan
    affiliation: University of Arizona
    orcid: 'https://orcid.org/0000-0001-7327-1053'
  - given-names: Yiyun
    family-names: Zhao
    affiliation: University of Arizona
repository-code: 'https://github.com/clulab/lives'
abstract: >-
  Python code for the analysis and creation of
  machine-learned models for the NIH-funded project
  "Using natural language processing to determine
  predictors of healthy diet and physical activity
  behavior change in ovarian cancer survivors." For
  more information, please consult our Data
  Management Plan at:
  https://dmphub.cdlib.org/dmps/doi:10.48321/D1BK5T


  NIH NCI R21CA256680 (MPI: Crane/Bethard) for
  01/01/2021-12/31/2022.
keywords:
  - Behavioral intervention
  - Cancer research
  - Machine learning
  - Natural language processing
license: MIT
version: 0.1.0
date-released: '2022-10-04'

GitHub Events

Total
  • Watch event: 1
  • Push event: 2
Last Year
  • Watch event: 1
  • Push event: 2

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 3
  • Total pull requests: 16
  • Average time to close issues: N/A
  • Average time to close pull requests: about 2 hours
  • Total issue authors: 1
  • Total pull request authors: 3
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.06
  • Merged pull requests: 14
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • damian-romero (3)
Pull Request Authors
  • damian-romero (13)
  • jmculnan (2)
  • bethard (1)
Top Labels
Issue Labels
enhancement (2)
Pull Request Labels
enhancement (3)

Dependencies

requirements.txt pypi
  • pyannote.audio >=2.0
  • pyannote.core *
  • pyannote.database *
  • pyannote.metrics *
  • tqdm *
environment.yaml conda
  • pip3
  • python 3.10