• Hynek Boril, Ph.D.


      Associate Professor

      Electrical and Computer Engineering Department

      University of Wisconsin-Platteville

Tools


QCN-RASTALP

QCN-RASTALP is designed to compensate for the cepstral variance introduced by channel variations, additive noise, and Lombard effect.

Quantile-based cepstral dynamics normalization (QCN) is based on the concept of cepstral mean and variance normalization (CMN, CVN). CMN and CVN assume that the distributions of cepstral coefficients are Gaussian (CVN), or at least symmetric to their means (CMN). However, analyses have shown that especially low cepstral coefficients in clean speech tend to be skewed and multimodal (having more than one significant extreme) and that presence of noise and changes in talking style considerably affect distribution shapes and skewness. In the case when the actual cepstral distributions drift from Gaussian, mean and variance become less effective in representing the actual range of cepstral samples' occurrence, or their 'dynamic range'. QCN determines the dynamic range from the cepstral histogram quantiles, bounding certain portion of the sample occurrences. Instead of distribution means, quantile means are subtracted from all samples, followed by dynamic range normalization to unity.

QCN-RASTALP combines QCN with low-pass temporal filtering adopted from RASTA. While original RASTA performs band-pass filtering, where the high pass component corresponds to CMN, in QCN-RASTALP, the high-pass filtering is bypassed and only the low-pass is preserved. The new low-pass filter significantly reduces the transient distortions seen in the original RASTA and allows for replacing sub-optimal CMN by other cepstral compensations.

Source codes for QCN-RASTALP and RASTALP are provided below.

References

    Boril, H., Hansen, J. H. L. (2010). “Unsupervised Equalization of Lombard Effect for Speech Recognition in Noisy Adverse Environments,” IEEE Transactions on Audio, Speech, and Language Processing, 18(6), 1379-1393. [pdf] [bib]

    Boril, H., Hansen, J. H. L. (2011). “UT-Scope: Towards LVCSR under Lombard Effect Induced by Varying Types and Levels of Noisy Background,” IEEE ICASSP'11, 4472-4475, Prague, Czech Republic, May 2011. [pdf] [bib]

Rectangular Filterbank Cepstral Coefficients (RFCC)

The RFCC front-end is inspired by perceptual linear prediction (PLP) cepstral features. The original Bark frequency trapezoid filters are replaced by a bank of 24 uniform non-overlapping rectangular filters distributed over a linear frequency scale. RFCC was initially proposed for robust ASR in noisy/Lombard speech conditions (20Bands-LPC) [Boril and Hansen, 2010]. RFCCs are extracted using an open source feature extraction and enhancement tool CTUCopy.

The RFCC features used in CRSS NIST-SRE'12 submission were extracted using the config file below.

  • CTUConfig_RFCC.txt - RFCC config file for CTUCopy. It is noted that the dithering is off in the RFCC front-end, however, an external dithering was performed on the original wave files prior to the feature extraction to handle digital silences.

References

    Boril, H., Hansen, J. H. L. (2010). “Unsupervised Equalization of Lombard Effect for Speech Recognition in Noisy Adverse Environments,” IEEE Transactions on Audio, Speech, and Language Processing, 18(6), 1379-1393. [pdf] [bib]

    CTUCopy Toolkit Download Page

    CTUCopy Manpage

Arkread, arkwrite, ark2scp, htk2ark - Matlab and C Functions for Reading and Generating Kaldi's ark Feature Files

Arkread() reads the content of an ark feature file into Matlab matrices. Arkwrite() produces an ark file from Matlab index and feature matrices. Ark2scp() produces a complementary Kaldi style scp list file for an ark file. Htk2ark() converts HTK features (feature files) to Kaldi ark feature files. The function takes a list of input HTK feature files and stores them in a single ark file and also generates complemetary Kaldi scp file with the list of utterance files and fast access addresses.

Matlab codes:

C code:

Kaldi Project Page:

Pitch Tracker DTFE (Direct Time Domain Frequency Estimator)

DTFE (also denoted DFE) is a novel algorithm for fundamental frequency estimation and voiced/unvoiced (V/UV) classification performed directly in the time domain. The algorithm is designed to provide real-time pitch detection with time and frequency resolution comparable or superior to autocorrelation-based schemes while significantly reducing computational costs. The DTFE algorithm comprises spectral shaping, adaptive thresholding, and F0 candidate selection based on consistency criteria. The primary application is on clean speech signals (close-talk channels).

  • dtfe.exe - MS Windows binary, requires Microsoft .NET framework (available through MS Windows Update)

References

    Boril, H. (2008). “Robust speech recogniton: Analysis and equalization of Lombard effect in Czech corpora,” Ph.D. dissertation, Czech Technical University in Prague, Czech Republic (Section 4.1, pp. 30-41). [pdf] [bib]

    Boril, H. and Pollák, P. (2004). “Direct time domain fundamental frequency estimation of speech in noisy conditions”, in Proc. EUSIPCO 2004, volume 1, 1003 - 1006 (Vienna, Austria). [pdf] [bib]

    Boril, H. and Pollák, P. (2006). “Pitch-marking Based on the DFE Algorithm.” Lecture, 6th ECESS and TC-STAR WP3 Meeting (Berlin, Germany). [pdf] [bib]

Executing dtfe.exe

  • DTFE is executed with two command line parameters: 'dtfe.exe <input_wave_file> <output_F0_text_file>'.
  • The input is required to be a single channel (mono) sound file in the Windows PCM '.wav' format. The following sample frequencies are supported: 8k, 16k, 22.05k 32k, 44.1k, 48k, 96k, 192k (Hz).
  • The output text file contains two colums - F0 estimates in the first column followed by corresponding time labels in seconds in the second column. Note that each frequency estimate occupies two consecutive lines representing the F0 sample's onset and offset times, respectively. The exception are the first and last row of the file which denote the boundaries of the onset of the first (or offset of the last) voiced island in the waveform - see the example below.
  • DTFE can be conveniently executed from Matlab - see the example code below.
  • The wav file used in the example can be downloaded here.

    inputFile = 'example.wav';
    outputFile = 'example.txt';
    dos(['dtfe.exe ' inputFile ' ' outputFile]);
    [frequency, time] = textread('example.txt', '%f %f');
    plot(time, frequency);

    DTFE Example

 

Last Updated 9-3-2021