https://github.com/dcavar/elan2split

Split ELAN Annotation Files and corresponding speech files into a corpus format for common ASR and Forced Aligners

https://github.com/dcavar/elan2split

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.1%) to scientific vocabulary

Keywords

cpp11 elan forced-alignment sox speech-corpus speech-recognition xerxes xml
Last synced: 4 months ago · JSON representation

Repository

Split ELAN Annotation Files and corresponding speech files into a corpus format for common ASR and Forced Aligners

Basic Info
  • Host: GitHub
  • Owner: dcavar
  • License: apache-2.0
  • Language: C++
  • Default Branch: master
  • Homepage: http://damir.cavar.me/
  • Size: 16.6 KB
Statistics
  • Stars: 10
  • Watchers: 2
  • Forks: 3
  • Open Issues: 0
  • Releases: 0
Topics
cpp11 elan forced-alignment sox speech-corpus speech-recognition xerxes xml
Created over 7 years ago · Last pushed over 7 years ago
Metadata Files
Readme License

README.md

ELAN2split

(C) 2015-2018 by Damir Cavar <dcavar@iu.edu>

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at:

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Purpose

ELAN2split is a tool that creates pairs of audio/transcription files that correspond to time-aligned segments in an ELAN file. Each time-aligned segment is saved in two files, the trimmed WAVE-file from the original recording and the transcription or annotation text in the corresponding tier that can be selected via command line. This corpus is ideal to build and train a Forced Aligner and initial speech corpora and subsequently to train speech recognizers. I built this tool to work with the Prosodylab-Aligner.

This is a command line tool. It does not come with a graphical interface.

Download Binaries

The following binaries require only a sox binary in the system's PATH:

On Mac OSX or Linux make sure that the unzipped binary has executable flags. You can set the executable flag using:

chmod +x elan2split

On all systems (Windows since Windows 7) you can run the following command in the command line tool to test wether SoX is installed on your system:

where sox

On Windows 10 with the optional Windows Subsystem for Linux and bash installed you can follow the same compile and build instructions of Ubuntu as described below in the Compilation section.

Introduction

This code emerged out of the need to create data or corpora for the development of speech recognizers or other speech processing tools, as for example Forced Alignment for audio and textual transcription.

The program will perform various tasks on ELAN Annotation Files and corresponding audio media files. It has been tested only on ELAN version 4.x and WAV (Microsoft) signed 16 bit PCM audio files.

The program can perform the following tasks:

  • list the labels of all annotation tiers
  • list all associated media files and check whether the media files are accessible
  • segment the audio based on the annotation intervals of any existing tier and generate pairs of audio and annotation text files in a specific location

Command Line

To list all tier labels in an ELAN annotation file (here for example test.eaf):

elan2split -l test.eaf

To list all associated media files in the eaf-file:

elan2split -m test.eaf

To extract all annotation intervals as individual WAV- and text-files:

elan2split test.eaf

This will use the first tier to perform the split. It will output all files in the current working directory.

If you want to specify the output directory (it will be automatically created, if not existing):

elan2split -o ./hr-corpus test.eaf

More options are explained in the help text:

elan2split -h

The putput files will have specific names. The names are using the ELAN tier ID, the start and end time of the annotation interval in milliseconds, and the ending .wav and .txt accordingly. The file endings for the text-files can be specified to ".lab" for example using the following option:

elan2split -s .lab

The tier name can be added to the text-file name (preceding the suffix) using the following option:

elan2split -n

In the current version all time metric is assumed to be using milliseconds in the ELAN annotation file. This will be more flexible in future. Make sure that your time scale in your ELAN annotation is using milliseconds.

Requirements

Runtime

The segmentation is based on SoX. Please make sure that the sox binary is accessible via command line and that the folder with the sox binary is in the system's PATH variable.

ELAN Annotation Files (eaf) are XML files. When processing eaf-files, the XML validator requires internet acces to fetch the eaf-schema specification (if specified in the eaf-file header). In some future release I will provide the option to specify some local copy of the schema for validation to remove the requirement to be online during runtime.

Compilation

The code is common C++11. Besides the common compiler environment it is using the following extensions and libraries:

  • CMake version 3.2.2 or newer (although 3.0 might work as well)
  • Boost libraries version 1.47.0 or newer, and in particular the Boost components program_options, filesystem, system
  • Xerces C++

To compile the binary run the following commands in the source folder on some *nix-type system with the necessary development environment and libraries installed:

cmake .
make

This should generate the binary elan2split.

Owner

  • Name: Damir Cavar
  • Login: dcavar
  • Kind: user
  • Location: Bloomington, IN
  • Company: Indiana University

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1

Committers

Last synced: 10 months ago

All Time
  • Total Commits: 2
  • Total Committers: 2
  • Avg Commits per committer: 1.0
  • Development Distribution Score (DDS): 0.5
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Damir Cavar d****r@m****m 1
Damir Cavar d****r@g****m 1
Committer Domains (Top 20 + Academic)
me.com: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels