Spectral Descriptors

Spectral descriptors are used after a FFT. For FFT we use a Hann window.

`Spectral Flatness`

ID: flatness | ²

Spectral flatness indicates how noisy versus tonal a sound is. A high flatness means the spectrum is uniform like white noise, while a low flatness shows clear peaks, like a sustained musical note.

Equation

\[Flatness = \frac{\exp\left( \frac{1}{K} \sum_{k=0}^{K-1} \ln(|X[k]|^2) \right)}{\frac{1}{K} \sum_{k=0}^{K-1} |X[k]|^2}\]
Notes

`Spectral Flux`

ID: flux | ⁶

Spectral flux measures how quickly the spectrum of a sound changes over time. High flux indicates sudden changes or transients, like drum hits, while low flux corresponds to steady, continuous sounds.

Equation

\(Flux = \sum_{k=1}^{K-1} \max(0, |X[k]| - |X_{prev}[k]|)\)
Notes

`Spectral Irregularity`

ID: irregularity

Spectral irregularity quantifies how uneven or jagged a spectrum is between adjacent frequency bins. High irregularity indicates complex, inharmonic, or noisy timbres, while low values suggest smooth, harmonic sounds.

Equation

\(Irregularity = \frac{\sum_{k=1}^{K-1} (|X[k]| - |X[k-1]|)^2}{\sum_{k=0}^{K-1} |X[k]|^2}\)
Notes

`Spectral Crest`

ID: crest | ⁶

Spectral crest measures the ratio of the highest spectral peak to the average spectral amplitude. A high crest indicates a tone dominated by strong harmonics or transients, while a low crest corresponds to more even, noise-like spectra.

Equation

\(Crest = \frac{\max_{k} |X[k]|}{\frac{1}{K} \sum_{k=0}^{K-1} |X[k]|}\)
Notes

`Spectral Skewness`

ID: skewness | ⁶

Spectral kurtosis measures the peakedness or tailedness of the spectral distribution around its centroid. It quantifies how concentrated the spectral energy is in a few frequency bins versus being evenly spread.

Equation

\(Irregularity = \frac{\sum_{k=1}^{K-1} (|X[k]| - |X[k-1]|)^2}{\sum_{k=0}^{K-1} |X[k]|^2}\)
Notes

`Spectral Kurtois`

ID: kurtosis | ⁶

Spectral skewness measures the asymmetry of the spectral distribution around its centroid. It indicates whether the energy is biased toward low or high frequencies.

Equation

\(Irregularity = \frac{\sum_{k=1}^{K-1} (|X[k]| - |X[k-1]|)^2}{\sum_{k=0}^{K-1} |X[k]|^2}\)
Notes

`Spectral RollOff`

ID: rolloff | ⁶

Spectral rolloff indicates the frequency below which a fixed percentage of a sound’s spectral energy is contained. Higher values make the sound perceptually brighter or sharper, while lower values make it darker or warmer.

Equation

TODO
Notes

TODO

`Spectral Entropy`

ID: entropy | ⁶

Spectral entropy indicates how uniformly a sound’s spectral energy is distributed across frequencies. Higher values make the sound perceptually more noisy or disordered, while lower values make it more tonal or structured.

Equation

TODO
Notes

TODO

`Spectral Centroid`

ID: centroid | ³

Spectral centroid indicates the “center of mass” of a sound’s spectrum. Higher values make the sound perceptually brighter, while lower values make it darker or warmer.

Equation

\[Centroid = \frac{\sum_{k=0}^{K-1} f_k |X[k]|}{\sum_{k=0}^{K-1} |X[k]|}\]
Notes

`Centroid Velocity`

ID: velocity | ⁶

Centroid velocity measures how quickly the spectral centroid changes over time, reflecting dynamic shifts in brightness or timbre.

Equation

\(Velocity = |Centroid_t - Centroid_{t-1}|\)
Notes

`Spectral Spread`

ID: spread | ³

Spectral spread quantifies how dispersed the energy is around the spectral centroid, indicating whether the sound is focused (narrow) or diffuse (wide) in frequency.

Equation

\(Spread = \sqrt{\frac{\sum_{k=0}^{K-1} (f_k - Centroid)^2 |X[k]|}{\sum_{k=0}^{K-1} |X[k]|}}\)
Notes

`High Frequency Ratio`

ID: hfr

High Frequency Ratio measures the proportion of energy in the upper part of the spectrum, reflecting the brightness or presence of high-pitched content in a sound.

Equation

\(HFR = \frac{\sum_{k=K/4}^{K-1} |X[k]|}{\sum_{k=0}^{K-1} |X[k]|}\)
Notes

`Zero Crossing Rate`

ID: zcr | ⁵

Zero Crossing Rate counts how often the waveform crosses the zero amplitude line, indicating the noisiness or percussiveness of a sound.

Equation

\(ZCR = \frac{1}{N} \sum_{n=1}^{N-1} \mathbb{I}\{\text{sgn}(x[n]) \neq \text{sgn}(x[n-1])\}\)
Notes

`Standard Deviation`

ID: std

Standard deviation of the normalized spectral power compared to the mean (\(\mu = \frac{1}{K}\)):

Equation

\(StdDev = \sqrt{\frac{1}{K} \sum_{k=0}^{K-1} \left( |X_{norm}[k]| - \mu \right)^2}\)
Notes

`Pitch` & `PitchConfidence`

ID: pitch

Estimated fundamental frequency and confidence, calculated using the YIN algorithm's cumulative mean normalized difference function (CMNDF):

Equation

\(d'_t(\tau) = \begin{cases} 1 & \text{if } \tau = 0 \\ \frac{d_t(\tau)}{\frac{1}{\tau} \sum_{j=1}^{\tau} d_t(j)} & \text{otherwise} \end{cases}\)
Notes

`Normalized Magnitude`

Vector of magnitude values normalized by the FFT size \(N\):

Equation

\(|X[k]| = \frac{Power[k]}{N}\)
Notes

`Harmonicity`

ID: harmonicity

Very simple version implemented for now.

Harmonicity measures how well a sound’s spectrum aligns with a harmonic series (integer multiples of a fundamental frequency). High harmonicity indicates a clear pitched sound with harmonically related partials, while low harmonicity indicates inharmonic or noise-like spectra.

Equation

\[Harmonicity = \frac{\max_{k>0} |X[k]|}{\sum_{k>0} |X[k]|}\]
Notes

Probably will be Based on Yu (2006);

`Log-Mel Spectrogram`

ID: logmel | ⁵

The log-mel spectrum represents how the energy of a sound is distributed across perceptual frequency bands, using the mel scale and a logarithmic (dB) compression to approximate human loudness perception.

Equation

\[ E_m = \sum_{k=0}^{K-1} H_m(k)\,P[k] \]

\[ \text{LogMel}[m] = \max\left(L_{min},\; 10 \log_{10}(E_m)\right) \]
Notes
- \(P[k]\) is the power spectrum (\(|X[k]|^2\))
- \(H_m(k)\) is the mel filterbank (triangular filters)
- \(E_m\) is the energy in mel band \(m\)
- Log scaling (dB) approximates human loudness perception
- \(L_{min}\) prevents \(-\infty\) (numerical stability)
- Often followed by a top-dB clipping (e.g., 80 dB range)
- This is computed per frame; stacking over time forms a mel spectrogram

`MFCC`

ID: mfcc | ³

MFCCs summarize the shape of a sound’s spectrum on a perceptual, mel-based scale, giving a compact representation of timbre and tone color as humans hear it.

Equation

\(MFCC[i] = \sum_{m=0}^{M-1} \cos\left( \frac{\pi i (m + 0.5)}{M} \right) \max(L_{min}, 10 \log_{10}(E_m))\)
Notes

`Chroma`

ID: chroma | ⁴

Chroma features capture the intensity of the twelve pitch classes (C, C♯, …, B) in a sound, representing its harmonic and melodic content independently of octave.

Equation

\(Chroma[c] = \sum_{k=0}^{K-1} W_{c}[k] \cdot Power[k]\)
Notes

OpenScofo uses a Hann window and FFTW3 for FFT. ↩
Descriptor compatible with librosa in order of \(10^{-9}\). ↩
Descriptor compatible with librosa in order of \(10^{-5}\). ↩↩↩
Descriptor compatible with librosa in order of \(10^{-3}\). ↩
Descriptor full compatible with librosa. ↩↩
Descriptor compatible with essentia in order of \(10^{-4}\). ↩↩↩↩↩↩↩

Spectral Descriptors

Spectral Flatness

Spectral Flux

Spectral Irregularity

Spectral Crest

Spectral Skewness

Spectral Kurtois

Spectral RollOff

Spectral Entropy

Spectral Centroid

Centroid Velocity

Spectral Spread

High Frequency Ratio

Zero Crossing Rate

Standard Deviation

Pitch & PitchConfidence

Normalized Magnitude

Harmonicity

Log-Mel Spectrogram

MFCC

Chroma

`Spectral Flatness`

`Spectral Flux`

`Spectral Irregularity`

`Spectral Crest`

`Spectral Skewness`

`Spectral Kurtois`

`Spectral RollOff`

`Spectral Entropy`

`Spectral Centroid`

`Centroid Velocity`

`Spectral Spread`

`High Frequency Ratio`

`Zero Crossing Rate`

`Standard Deviation`

`Pitch` & `PitchConfidence`

`Normalized Magnitude`

`Harmonicity`

`Log-Mel Spectrogram`

`MFCC`

`Chroma`