There is some debate in the community regarding the use of the DCT, instead of directly using the log Mel fiterbank features, particularly for deep neural network based acoustic models. Some research groups, like Google, use filterbanks (fbanks) while Kaldi mostly uses MFCCs, especially in its TDNN chain models. Since filterbank energies are correlated and cannot be used directly with a Gaussian mixture with diagonal covariance, we apply a discrete cosine transform (DCT) to decorrelate them.
Here is Dan Povey’s take on this:
The reason we use MFCC is because they are more easily compressible, being decorrelated; we dump them to disk with compression to 1 byte per coefficient. But we dump all the coefficients, so it’s equivalent to filterbanks times a full-rank matrix, no information is lost.
手机扫一扫
移动阅读更方便
你可能感兴趣的文章