Spectral Descriptors
Spectral descriptors are used after a FFT. For FFT we use a Hann window.
Spectral Flatness
ID: flatness |
2
Spectral flatness indicates how noisy versus tonal a sound is. A high flatness means the spectrum is uniform like white noise, while a low flatness shows clear peaks, like a sustained musical note.
-
Equation
\[Flatness = \frac{\exp\left( \frac{1}{K} \sum_{k=0}^{K-1} \ln(|X[k]|^2) \right)}{\frac{1}{K} \sum_{k=0}^{K-1} |X[k]|^2}\] -
Notes
Spectral Flux
ID: flux |
6
Spectral flux measures how quickly the spectrum of a sound changes over time. High flux indicates sudden changes or transients, like drum hits, while low flux corresponds to steady, continuous sounds.
-
Equation
\(Flux = \sum_{k=1}^{K-1} \max(0, |X[k]| - |X_{prev}[k]|)\)
-
Notes
Spectral Irregularity
ID: irregularity
Spectral irregularity quantifies how uneven or jagged a spectrum is between adjacent frequency bins. High irregularity indicates complex, inharmonic, or noisy timbres, while low values suggest smooth, harmonic sounds.
-
Equation
\(Irregularity = \frac{\sum_{k=1}^{K-1} (|X[k]| - |X[k-1]|)^2}{\sum_{k=0}^{K-1} |X[k]|^2}\)
-
Notes
Spectral Crest
ID: crest |
6
Spectral crest measures the ratio of the highest spectral peak to the average spectral amplitude. A high crest indicates a tone dominated by strong harmonics or transients, while a low crest corresponds to more even, noise-like spectra.
-
Equation
\(Crest = \frac{\max_{k} |X[k]|}{\frac{1}{K} \sum_{k=0}^{K-1} |X[k]|}\)
-
Notes
Spectral Skewness
ID: skewness |
6
Spectral kurtosis measures the peakedness or tailedness of the spectral distribution around its centroid. It quantifies how concentrated the spectral energy is in a few frequency bins versus being evenly spread.
-
Equation
\(Irregularity = \frac{\sum_{k=1}^{K-1} (|X[k]| - |X[k-1]|)^2}{\sum_{k=0}^{K-1} |X[k]|^2}\)
-
Notes
Spectral Kurtois
ID: kurtosis |
6
Spectral skewness measures the asymmetry of the spectral distribution around its centroid. It indicates whether the energy is biased toward low or high frequencies.
-
Equation
\(Irregularity = \frac{\sum_{k=1}^{K-1} (|X[k]| - |X[k-1]|)^2}{\sum_{k=0}^{K-1} |X[k]|^2}\)
-
Notes
Spectral RollOff
ID: rolloff |
6
Spectral rolloff indicates the frequency below which a fixed percentage of a sound’s spectral energy is contained. Higher values make the sound perceptually brighter or sharper, while lower values make it darker or warmer.
-
Equation
TODO
-
Notes
TODO
Spectral Entropy
ID: entropy |
6
Spectral entropy indicates how uniformly a sound’s spectral energy is distributed across frequencies. Higher values make the sound perceptually more noisy or disordered, while lower values make it more tonal or structured.
-
Equation
TODO
-
Notes
TODO
Spectral Centroid
ID: centroid |
3
Spectral centroid indicates the “center of mass” of a sound’s spectrum. Higher values make the sound perceptually brighter, while lower values make it darker or warmer.
-
Equation
\[Centroid = \frac{\sum_{k=0}^{K-1} f_k |X[k]|}{\sum_{k=0}^{K-1} |X[k]|}\] -
Notes
Centroid Velocity
ID: velocity |
6
Centroid velocity measures how quickly the spectral centroid changes over time, reflecting dynamic shifts in brightness or timbre.
-
Equation
\(Velocity = |Centroid_t - Centroid_{t-1}|\)
-
Notes
Spectral Spread
ID: spread |
3
Spectral spread quantifies how dispersed the energy is around the spectral centroid, indicating whether the sound is focused (narrow) or diffuse (wide) in frequency.
-
Equation
\(Spread = \sqrt{\frac{\sum_{k=0}^{K-1} (f_k - Centroid)^2 |X[k]|}{\sum_{k=0}^{K-1} |X[k]|}}\)
-
Notes
High Frequency Ratio
ID: hfr
High Frequency Ratio measures the proportion of energy in the upper part of the spectrum, reflecting the brightness or presence of high-pitched content in a sound.
-
Equation
\(HFR = \frac{\sum_{k=K/4}^{K-1} |X[k]|}{\sum_{k=0}^{K-1} |X[k]|}\)
-
Notes
Zero Crossing Rate
ID: zcr |
5
Zero Crossing Rate counts how often the waveform crosses the zero amplitude line, indicating the noisiness or percussiveness of a sound.
-
Equation
\(ZCR = \frac{1}{N} \sum_{n=1}^{N-1} \mathbb{I}\{\text{sgn}(x[n]) \neq \text{sgn}(x[n-1])\}\)
-
Notes
Standard Deviation
ID: std
Standard deviation of the normalized spectral power compared to the mean (\(\mu = \frac{1}{K}\)):
-
Equation
\(StdDev = \sqrt{\frac{1}{K} \sum_{k=0}^{K-1} \left( |X_{norm}[k]| - \mu \right)^2}\)
-
Notes
Pitch & PitchConfidence
ID: pitch
Estimated fundamental frequency and confidence, calculated using the YIN algorithm's cumulative mean normalized difference function (CMNDF):
-
Equation
\(d'_t(\tau) = \begin{cases} 1 & \text{if } \tau = 0 \\ \frac{d_t(\tau)}{\frac{1}{\tau} \sum_{j=1}^{\tau} d_t(j)} & \text{otherwise} \end{cases}\)
-
Notes
Normalized Magnitude
Vector of magnitude values normalized by the FFT size \(N\):
-
Equation
\(|X[k]| = \frac{Power[k]}{N}\)
-
Notes
Harmonicity
ID: harmonicity
Very simple version implemented for now.
Harmonicity measures how well a sound’s spectrum aligns with a harmonic series (integer multiples of a fundamental frequency). High harmonicity indicates a clear pitched sound with harmonically related partials, while low harmonicity indicates inharmonic or noise-like spectra.
-
Equation
\[Harmonicity = \frac{\max_{k>0} |X[k]|}{\sum_{k>0} |X[k]|}\] -
Notes
Probably will be Based on Yu (2006);
Log-Mel Spectrogram
ID: logmel |
5
The log-mel spectrum represents how the energy of a sound is distributed across perceptual frequency bands, using the mel scale and a logarithmic (dB) compression to approximate human loudness perception.
-
Equation
\[ E_m = \sum_{k=0}^{K-1} H_m(k)\,P[k] \]\[ \text{LogMel}[m] = \max\left(L_{min},\; 10 \log_{10}(E_m)\right) \] -
Notes
- \(P[k]\) is the power spectrum (\(|X[k]|^2\))
- \(H_m(k)\) is the mel filterbank (triangular filters)
- \(E_m\) is the energy in mel band \(m\)
- Log scaling (dB) approximates human loudness perception
- \(L_{min}\) prevents \(-\infty\) (numerical stability)
- Often followed by a top-dB clipping (e.g., 80 dB range)
- This is computed per frame; stacking over time forms a mel spectrogram
MFCC
ID: mfcc |
3
MFCCs summarize the shape of a sound’s spectrum on a perceptual, mel-based scale, giving a compact representation of timbre and tone color as humans hear it.
-
Equation
\(MFCC[i] = \sum_{m=0}^{M-1} \cos\left( \frac{\pi i (m + 0.5)}{M} \right) \max(L_{min}, 10 \log_{10}(E_m))\)
-
Notes
Chroma
ID: chroma |
4
Chroma features capture the intensity of the twelve pitch classes (C, C♯, …, B) in a sound, representing its harmonic and melodic content independently of octave.
-
Equation
\(Chroma[c] = \sum_{k=0}^{K-1} W_{c}[k] \cdot Power[k]\)
-
Notes
-
OpenScofouses a Hann window and FFTW3 for FFT. ↩ -
Descriptorcompatible withlibrosain order of \(10^{-9}\). ↩ -
Descriptorcompatible withlibrosain order of \(10^{-5}\). ↩↩↩ -
Descriptorcompatible withlibrosain order of \(10^{-3}\). ↩ -
Descriptorcompatible withessentiain order of \(10^{-4}\). ↩↩↩↩↩↩↩