Skip to content

Spectral Descriptors

Spectral descriptors are used after a FFT. For FFT we use a Hann window.


Spectral Flatness

ID: flatness | image/svg+xml 2

Spectral flatness indicates how noisy versus tonal a sound is. A high flatness means the spectrum is uniform like white noise, while a low flatness shows clear peaks, like a sustained musical note.

  • Equation
    \[Flatness = \frac{\exp\left( \frac{1}{K} \sum_{k=0}^{K-1} \ln(|X[k]|^2) \right)}{\frac{1}{K} \sum_{k=0}^{K-1} |X[k]|^2}\]
  • Notes

Spectral Flux

ID: flux | 6

Spectral flux measures how quickly the spectrum of a sound changes over time. High flux indicates sudden changes or transients, like drum hits, while low flux corresponds to steady, continuous sounds.

  • Equation

    \(Flux = \sum_{k=1}^{K-1} \max(0, |X[k]| - |X_{prev}[k]|)\)

  • Notes

Spectral Irregularity

ID: irregularity

Spectral irregularity quantifies how uneven or jagged a spectrum is between adjacent frequency bins. High irregularity indicates complex, inharmonic, or noisy timbres, while low values suggest smooth, harmonic sounds.

  • Equation

    \(Irregularity = \frac{\sum_{k=1}^{K-1} (|X[k]| - |X[k-1]|)^2}{\sum_{k=0}^{K-1} |X[k]|^2}\)

  • Notes

Spectral Crest

ID: crest | 6

Spectral crest measures the ratio of the highest spectral peak to the average spectral amplitude. A high crest indicates a tone dominated by strong harmonics or transients, while a low crest corresponds to more even, noise-like spectra.

  • Equation

    \(Crest = \frac{\max_{k} |X[k]|}{\frac{1}{K} \sum_{k=0}^{K-1} |X[k]|}\)

  • Notes

Spectral Skewness

ID: skewness | 6

Spectral kurtosis measures the peakedness or tailedness of the spectral distribution around its centroid. It quantifies how concentrated the spectral energy is in a few frequency bins versus being evenly spread.

  • Equation

    \(Irregularity = \frac{\sum_{k=1}^{K-1} (|X[k]| - |X[k-1]|)^2}{\sum_{k=0}^{K-1} |X[k]|^2}\)

  • Notes

Spectral Kurtois

ID: kurtosis | 6

Spectral skewness measures the asymmetry of the spectral distribution around its centroid. It indicates whether the energy is biased toward low or high frequencies.

  • Equation

    \(Irregularity = \frac{\sum_{k=1}^{K-1} (|X[k]| - |X[k-1]|)^2}{\sum_{k=0}^{K-1} |X[k]|^2}\)

  • Notes

Spectral RollOff

ID: rolloff | 6

Spectral rolloff indicates the frequency below which a fixed percentage of a sound’s spectral energy is contained. Higher values make the sound perceptually brighter or sharper, while lower values make it darker or warmer.

  • Equation

    TODO

  • Notes

    TODO


Spectral Entropy

ID: entropy | 6

Spectral entropy indicates how uniformly a sound’s spectral energy is distributed across frequencies. Higher values make the sound perceptually more noisy or disordered, while lower values make it more tonal or structured.

  • Equation

    TODO

  • Notes

    TODO

Spectral Centroid

ID: centroid | image/svg+xml 3

Spectral centroid indicates the “center of mass” of a sound’s spectrum. Higher values make the sound perceptually brighter, while lower values make it darker or warmer.

  • Equation
    \[Centroid = \frac{\sum_{k=0}^{K-1} f_k |X[k]|}{\sum_{k=0}^{K-1} |X[k]|}\]
  • Notes

Centroid Velocity

ID: velocity | 6

Centroid velocity measures how quickly the spectral centroid changes over time, reflecting dynamic shifts in brightness or timbre.

  • Equation

    \(Velocity = |Centroid_t - Centroid_{t-1}|\)

  • Notes

Spectral Spread

ID: spread | image/svg+xml 3

Spectral spread quantifies how dispersed the energy is around the spectral centroid, indicating whether the sound is focused (narrow) or diffuse (wide) in frequency.

  • Equation

    \(Spread = \sqrt{\frac{\sum_{k=0}^{K-1} (f_k - Centroid)^2 |X[k]|}{\sum_{k=0}^{K-1} |X[k]|}}\)

  • Notes

High Frequency Ratio

ID: hfr

High Frequency Ratio measures the proportion of energy in the upper part of the spectrum, reflecting the brightness or presence of high-pitched content in a sound.

  • Equation

    \(HFR = \frac{\sum_{k=K/4}^{K-1} |X[k]|}{\sum_{k=0}^{K-1} |X[k]|}\)

  • Notes

Zero Crossing Rate

ID: zcr | image/svg+xml 5

Zero Crossing Rate counts how often the waveform crosses the zero amplitude line, indicating the noisiness or percussiveness of a sound.

  • Equation

    \(ZCR = \frac{1}{N} \sum_{n=1}^{N-1} \mathbb{I}\{\text{sgn}(x[n]) \neq \text{sgn}(x[n-1])\}\)

  • Notes

Standard Deviation

ID: std

Standard deviation of the normalized spectral power compared to the mean (\(\mu = \frac{1}{K}\)):

  • Equation

    \(StdDev = \sqrt{\frac{1}{K} \sum_{k=0}^{K-1} \left( |X_{norm}[k]| - \mu \right)^2}\)

  • Notes

Pitch & PitchConfidence

ID: pitch

Estimated fundamental frequency and confidence, calculated using the YIN algorithm's cumulative mean normalized difference function (CMNDF):

  • Equation

    \(d'_t(\tau) = \begin{cases} 1 & \text{if } \tau = 0 \\ \frac{d_t(\tau)}{\frac{1}{\tau} \sum_{j=1}^{\tau} d_t(j)} & \text{otherwise} \end{cases}\)

  • Notes

Normalized Magnitude

Vector of magnitude values normalized by the FFT size \(N\):

  • Equation

    \(|X[k]| = \frac{Power[k]}{N}\)

  • Notes

Harmonicity

ID: harmonicity

Very simple version implemented for now.

Harmonicity measures how well a sound’s spectrum aligns with a harmonic series (integer multiples of a fundamental frequency). High harmonicity indicates a clear pitched sound with harmonically related partials, while low harmonicity indicates inharmonic or noise-like spectra.

  • Equation
    \[Harmonicity = \frac{\max_{k>0} |X[k]|}{\sum_{k>0} |X[k]|}\]
  • Notes

    Probably will be Based on Yu (2006);


Log-Mel Spectrogram

ID: logmel | image/svg+xml 5

The log-mel spectrum represents how the energy of a sound is distributed across perceptual frequency bands, using the mel scale and a logarithmic (dB) compression to approximate human loudness perception.

  • Equation
    \[ E_m = \sum_{k=0}^{K-1} H_m(k)\,P[k] \]
    \[ \text{LogMel}[m] = \max\left(L_{min},\; 10 \log_{10}(E_m)\right) \]
  • Notes
    • \(P[k]\) is the power spectrum (\(|X[k]|^2\))
    • \(H_m(k)\) is the mel filterbank (triangular filters)
    • \(E_m\) is the energy in mel band \(m\)
    • Log scaling (dB) approximates human loudness perception
    • \(L_{min}\) prevents \(-\infty\) (numerical stability)
    • Often followed by a top-dB clipping (e.g., 80 dB range)
    • This is computed per frame; stacking over time forms a mel spectrogram

MFCC

ID: mfcc | image/svg+xml 3

MFCCs summarize the shape of a sound’s spectrum on a perceptual, mel-based scale, giving a compact representation of timbre and tone color as humans hear it.

  • Equation

    \(MFCC[i] = \sum_{m=0}^{M-1} \cos\left( \frac{\pi i (m + 0.5)}{M} \right) \max(L_{min}, 10 \log_{10}(E_m))\)

  • Notes

Chroma

ID: chroma | image/svg+xml 4

Chroma features capture the intensity of the twelve pitch classes (C, C♯, …, B) in a sound, representing its harmonic and melodic content independently of octave.

  • Equation

    \(Chroma[c] = \sum_{k=0}^{K-1} W_{c}[k] \cdot Power[k]\)

  • Notes

  1. OpenScofo uses a Hann window and FFTW3 for FFT. 

  2. Descriptor compatible with librosa in order of \(10^{-9}\)

  3. Descriptor compatible with librosa in order of \(10^{-5}\)

  4. Descriptor compatible with librosa in order of \(10^{-3}\)

  5. Descriptor full compatible with librosa

  6. Descriptor compatible with essentia in order of \(10^{-4}\)