Audio Feature Extraction — FFT & MFCCs

FFT converts audio to frequency domain; MFCCs extract compact perceptually-motivated features that power classical ASR.

Intermediate · 20 min read

From Waveform to Features

Raw waveforms are hard to learn from. The pipeline: waveform → STFT → Mel Spectrogram → MFCCs.

import librosa
import numpy as np

audio, sr = librosa.load("speech.wav", sr=16000, mono=True)

# 1. Mel Spectrogram
mel_spec = librosa.feature.melspectrogram(
    y=audio, sr=sr, n_mels=80, n_fft=512, hop_length=160, win_length=400, fmin=80, fmax=7600,
)
mel_db = librosa.power_to_db(mel_spec, ref=np.max)
print(f"Mel spectrogram shape: {mel_db.shape}")  # (80, time_frames)

# 2. MFCCs — compact decorrelated features
mfccs   = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=13, hop_length=160)
delta   = librosa.feature.delta(mfccs)
delta2  = librosa.feature.delta(mfccs, order=2)
features = np.vstack([mfccs, delta, delta2])  # 39-dim feature vector
print(f"MFCC feature shape: {features.shape}")  # (39, time_frames)

NOTE: Modern ASR skips MFCCs: End-to-end models like Whisper learn features directly from mel spectrograms using CNN layers. MFCCs are important for understanding traditional systems and constrained environments.


Part of the Speech Recognition & LLM Engineering series on Tekivex. Browse all tutorials or explore our open-source products.