https://github.com/12sqawdwq/cough_detection

咳嗽音检测包含开源代码的私有化修改和自己写的文件转换

https://github.com/12sqawdwq/cough_detection

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.9%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

咳嗽音检测包含开源代码的私有化修改和自己写的文件转换

Basic Info
  • Host: GitHub
  • Owner: 12sqawdwq
  • Language: Python
  • Default Branch: main
  • Size: 1.3 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 1
  • Releases: 0
Created over 1 year ago · Last pushed over 1 year ago

https://github.com/12sqawdwq/cough_detection/blob/main/

# PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition

This repo contains code for our paper: **PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition** [1]. A variety of CNNs are trained on the large-scale AudioSet dataset [2] containing 5000 hours audio with 527 sound classes. A mean average precision (mAP) of 0.439 is achieved using our proposed Wavegram-Logmel-CNN system, outperforming the Google baseline of 0.317 [3]. The PANNs have been used for audio tagging and sound event detection. The PANNs have been used to fine-tune several audio pattern recoginition tasks, and have outperformed several state-of-the-art systems. 

## Environments
The codebase is developed with Python 3.7. Install requirements as follows:
```
pip install -r requirements.txt
```

## Audio tagging using pretrained models
Users can inference the tags of an audio recording using pretrained models without training. Details can be viewed at [scripts/0_inference.sh](scripts/0_inference.sh) First, downloaded one pretrained model from https://zenodo.org/record/3987831, for example, the model named "Cnn14_mAP=0.431.pth". Then, execute the following commands to inference this [audio](resources/R9_ZSCveAHg_7s.wav):
```
CHECKPOINT_PATH="Cnn14_mAP=0.431.pth"
wget -O $CHECKPOINT_PATH https://zenodo.org/record/3987831/files/Cnn14_mAP%3D0.431.pth?download=1
MODEL_TYPE="Cnn14"
CUDA_VISIBLE_DEVICES=0 python3 pytorch/inference.py audio_tagging \
    --model_type=$MODEL_TYPE \
    --checkpoint_path=$CHECKPOINT_PATH \
    --audio_path="resources/R9_ZSCveAHg_7s.wav" \
    --cuda
```

Then the result will be printed on the screen looks like:
```
Speech: 0.893
Telephone bell ringing: 0.754
Inside, small room: 0.235
Telephone: 0.183
Music: 0.092
Ringtone: 0.047
Inside, large room or hall: 0.028
Alarm: 0.014
Animal: 0.009
Vehicle: 0.008
embedding: (2048,)
```

If users would like to use 16 kHz model for inference, just do:
```
CHECKPOINT_PATH="Cnn14_16k_mAP=0.438.pth"   # Trained by a later code version, achieves higher mAP than the paper.
wget -O $CHECKPOINT_PATH https://zenodo.org/record/3987831/files/Cnn14_16k_mAP%3D0.438.pth?download=1
MODEL_TYPE="Cnn14_16k"
CUDA_VISIBLE_DEVICES=0 python3 pytorch/inference.py audio_tagging \
    --sample_rate=16000 \
    --window_size=512 \
    --hop_size=160 \
    --mel_bins=64 \
    --fmin=50 \
    --fmax=8000 \
    --model_type=$MODEL_TYPE \
    --checkpoint_path=$CHECKPOINT_PATH \
    --audio_path='resources/R9_ZSCveAHg_7s.wav' \
    --cuda
```

## Sound event detection using pretrained models
Some of PANNs such as DecisionLevelMax (the best), DecisionLevelAvg, DecisionLevelAtt) can be used for frame-wise sound event detection. For example, execute the following commands to inference sound event detection results on this [audio](resources/R9_ZSCveAHg_7s.wav):

```
CHECKPOINT_PATH="Cnn14_DecisionLevelMax_mAP=0.385.pth"
wget -O $CHECKPOINT_PATH https://zenodo.org/record/3987831/files/Cnn14_DecisionLevelMax_mAP%3D0.385.pth?download=1
MODEL_TYPE="Cnn14_DecisionLevelMax"
CUDA_VISIBLE_DEVICES=0 python3 pytorch/inference.py sound_event_detection \
    --model_type=$MODEL_TYPE \
    --checkpoint_path=$CHECKPOINT_PATH \
    --audio_path="resources/R9_ZSCveAHg_7s.wav" \
    --cuda
```

The visualization of sound event detection result looks like:


Please see https://www.youtube.com/watch?v=QyFNIhRxFrY for a sound event detection demo.

For those users who only want to use the pretrained models for inference, we have prepared a **panns_inference** tool which can be easily installed by:
```
pip install panns_inference
```

Please visit https://github.com/qiuqiangkong/panns_inference for details of panns_inference.

## Train PANNs from scratch
Users can train PANNs from scratch as follows.

## 1. Download dataset
The [scripts/1_download_dataset.sh](scripts/1_download_dataset.sh) script is used for downloading all audio and metadata from the internet. The total size of AudioSet is around 1.1 TB. Notice there can be missing files on YouTube, so the numebr of files downloaded by users can be different from time to time. Our downloaded version contains 20550 / 22160 of the balaned training subset, 1913637 / 2041789 of the unbalanced training subset, and 18887 / 20371 of the evaluation subset. 

For reproducibility, our downloaded dataset can be accessed at: link: [https://pan.baidu.com/s/13WnzI1XDSvqXZQTS-Kqujg](https://pan.baidu.com/s/13WnzI1XDSvqXZQTS-Kqujg), password: 0vc2

The downloaded data looks like:

dataset_root
 audios
     balanced_train_segments
    |     ... (~20550 wavs, the number can be different from time to time)
     eval_segments
    |     ... (~18887 wavs)
     unbalanced_train_segments
          unbalanced_train_segments_part00
         |     ... (~46940 wavs)
         ...
          unbalanced_train_segments_part40
               ... (~39137 wavs)
 metadata
      balanced_train_segments.csv
      class_labels_indices.csv
      eval_segments.csv
      qa_true_counts.csv
      unbalanced_train_segments.csv
## 2. Pack waveforms into hdf5 files The [scripts/2_pack_waveforms_to_hdf5s.sh](scripts/2_pack_waveforms_to_hdf5s.sh) script is used for packing all raw waveforms into 43 large hdf5 files for speed up training: one for balanced training subset, one for evaluation subset and 41 for unbalanced traning subset. The packed files looks like:
workspace
 hdf5s
      targets (2.3 GB)
     |     balanced_train.h5
     |     eval.h5
     |     unbalanced_train
     |         unbalanced_train_part00.h5
     |        ...
     |         unbalanced_train_part40.h5
      waveforms (1.1 TB)
           balanced_train.h5
           eval.h5
           unbalanced_train
               unbalanced_train_part00.h5
              ...
               unbalanced_train_part40.h5
## 3. Create training indexes The [scripts/3_create_training_indexes.sh](scripts/3_create_training_indexes.sh) is used for creating training indexes. Those indexes are used for sampling mini-batches. ## 4. Train The [scripts/4_train.sh](scripts/4_train.sh) script contains training, saving checkpoints, and evaluation. ``` WORKSPACE="your_workspace" CUDA_VISIBLE_DEVICES=0 python3 pytorch/main.py train \ --workspace=$WORKSPACE \ --data_type='full_train' \ --window_size=1024 \ --hop_size=320 \ --mel_bins=64 \ --fmin=50 \ --fmax=14000 \ --model_type='Cnn14' \ --loss_type='clip_bce' \ --balanced='balanced' \ --augmentation='mixup' \ --batch_size=32 \ --learning_rate=1e-3 \ --resume_iteration=0 \ --early_stop=1000000 \ --cuda ``` ## Results The CNN models are trained on a single card Tesla-V100-PCIE-32GB. (The training also works on a GPU card with 12 GB). The training takes around 3 - 7 days. ``` Validate bal mAP: 0.005 Validate test mAP: 0.005 Dump statistics to /workspaces/pub_audioset_tagging_cnn_transfer/statistics/main/sample_rate=32000,window_size=1024,hop_size=320,mel_bins=64,fmin=50,fmax=14000/data_type=full_train/Cnn13/loss_type=clip_bce/balanced=balanced/augmentation=mixup/batch_size=32/statistics.pkl Dump statistics to /workspaces/pub_audioset_tagging_cnn_transfer/statistics/main/sample_rate=32000,window_size=1024,hop_size=320,mel_bins=64,fmin=50,fmax=14000/data_type=full_train/Cnn13/loss_type=clip_bce/balanced=balanced/augmentation=mixup/batch_size=32/statistics_2019-09-21_04-05-05.pickle iteration: 0, train time: 8.261 s, validate time: 219.705 s ------------------------------------ ... ------------------------------------ Validate bal mAP: 0.637 Validate test mAP: 0.431 Dump statistics to /workspaces/pub_audioset_tagging_cnn_transfer/statistics/main/sample_rate=32000,window_size=1024,hop_size=320,mel_bins=64,fmin=50,fmax=14000/data_type=full_train/Cnn13/loss_type=clip_bce/balanced=balanced/augmentation=mixup/batch_size=32/statistics.pkl Dump statistics to /workspaces/pub_audioset_tagging_cnn_transfer/statistics/main/sample_rate=32000,window_size=1024,hop_size=320,mel_bins=64,fmin=50,fmax=14000/data_type=full_train/Cnn13/loss_type=clip_bce/balanced=balanced/augmentation=mixup/batch_size=32/statistics_2019-09-21_04-05-05.pickle iteration: 600000, train time: 3253.091 s, validate time: 1110.805 s ------------------------------------ Model saved to /workspaces/pub_audioset_tagging_cnn_transfer/checkpoints/main/sample_rate=32000,window_size=1024,hop_size=320,mel_bins=64,fmin=50,fmax=14000/data_type=full_train/Cnn13/loss_type=clip_bce/balanced=balanced/augmentation=mixup/batch_size=32/600000_iterations.pth ... ``` An **mean average precision (mAP)** of **0.431** is obtained. The training curve looks like: Results of PANNs on AudioSet tagging. Dash and solid lines are training mAP and evaluation mAP, respectively. The six plots show the results with different: (a) architectures; (b) data balancing and data augmentation; (c) embedding size; (d) amount of training data; (e) sampling rate; (f) number of mel bins. ## Performance of differernt systems Top rows show the previously proposed methods using embedding features provided by Google. Previous best system achieved an mAP of 0.369 using large feature-attention neural networks. We propose to train neural networks directly from audio recordings. Our CNN14 achieves an mAP of 0.431, and Wavegram-Logmel-CNN achieves an mAP of 0.439. ## Plot figures of [1] To reproduce all figures of [1], just do: ``` wget -O paper_statistics.zip https://zenodo.org/record/3987831/files/paper_statistics.zip?download=1 unzip paper_statistics.zip python3 utils/plot_for_paper.py plot_classwise_iteration_map python3 utils/plot_for_paper.py plot_six_figures python3 utils/plot_for_paper.py plot_complexity_map python3 utils/plot_for_paper.py plot_long_fig ``` ## Fine-tune on new tasks After downloading the pretrained models. Build fine-tuned systems for new tasks is simple! ``` MODEL_TYPE="Transfer_Cnn14" CHECKPOINT_PATH="Cnn14_mAP=0.431.pth" CUDA_VISIBLE_DEVICES=0 python3 pytorch/finetune_template.py train \ --sample_rate=32000 \ --window_size=1024 \ --hop_size=320 \ --mel_bins=64 \ --fmin=50 \ --fmax=14000 \ --model_type=$MODEL_TYPE \ --pretrained_checkpoint_path=$CHECKPOINT_PATH \ --cuda ``` Here is an example of fine-tuning PANNs to GTZAN music classification: https://github.com/qiuqiangkong/panns_transfer_to_gtzan ## Demos We apply the audio tagging system to build a sound event detection (SED) system. The SED prediction is obtained by applying the audio tagging system on consecutive 2-second segments. The video of demo can be viewed at:
https://www.youtube.com/watch?v=7TEtDMzdLeY ## FAQs If users came across out of memory error, then try to reduce the batch size. ## Cite [1] Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D. Plumbley. "Panns: Large-scale pretrained audio neural networks for audio pattern recognition." IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020): 2880-2894. ## Reference [2] Gemmeke, J.F., Ellis, D.P., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M. and Ritter, M., 2017, March. Audio set: An ontology and human-labeled dataset for audio events. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776-780, 2017 [3] Hershey, S., Chaudhuri, S., Ellis, D.P., Gemmeke, J.F., Jansen, A., Moore, R.C., Plakal, M., Platt, D., Saurous, R.A., Seybold, B. and Slaney, M., 2017, March. CNN architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 131-135, 2017 ## External links Other work on music transfer learning includes:
https://github.com/jordipons/sklearn-audio-transfer-learning
https://github.com/keunwoochoi/transfer_learning_music

Owner

  • Name: victor lucifer Wilson
  • Login: 12sqawdwq
  • Kind: user

GitHub Events

Total
  • Watch event: 1
  • Push event: 1
  • Create event: 2
Last Year
  • Watch event: 1
  • Push event: 1
  • Create event: 2