librosa是一个应用广泛的音频处理python库。

在librosa中有一个方法叫做stft，功能是求音频的短时傅里叶变换, librosa.stft 返回是一个矩阵

短时傅立叶变换（STFT），返回一个复数矩阵使得D(f,t)

当调用的形式是 np.abs( stft() ), 代表的是取，取出复数矩阵的实部，即频率的振幅。
当调用的形式是 np.angle( stft() ), 代表的是取，取出复数矩阵的虚部，即频率的相位。

This function returns a complex-valued matrix D such that
- np.abs(D[f, t]) is the magnitude of frequency bin f
at frame t, and
- np.angle(D[f, t]) is the phase of frequency bin f
at frame t.

The integers t and f can be converted to physical units by means
of the utility functions frames_to_sample and fft_frequencies.

1. `librosa.stft`函数

librosa.stft(y, n_fft=2048, hop_length=None, win_length=None, window='hann', center=True, pad_mode='reflect')

参数：

y：音频时间序列
n_fft：FFT窗口大小，n_fft=hop_length+overlapping
hop_length：帧移，如果未指定，则默认win_length / 4。
win_length：每一帧音频都由window（）加窗。窗长win_length，然后用零填充以匹配N_FFT。默认win_length=n_fft。
window：字符串，元组，数字，函数 shape =（n_fft, )
窗口（字符串，元组或数字）；
窗函数，例如scipy.signal.hanning
长度为n_fft的向量或数组
center：bool
如果为True，则填充信号y，以使帧 D [:, t]以y [t * hop_length]为中心。
如果为False，则D [:, t]从y [t * hop_length]开始
dtype：D的复数值类型。默认值为64-bit complex复数
pad_mode：如果center = True，则在信号的边缘使用填充模式。默认情况下，STFT使用reflection padding。

返回：
STFT矩阵，shape = (1+ nfft2\frac{n_{fft}}{2}2nfft, nframesn_{frames}nframes )

2. stft的输出帧数

音频短时傅里叶变换后，在对音频取幅值，可以得到音频的线性谱。

对线性谱进行mel刻度的加权求和，可以得到语音识别和语音合成中常用的mel谱。

短时傅里叶变换的过程是先对音频分帧，再分别对每一帧傅里叶变换。

在应用stft方法求解短时傅里叶变换时，发现求出的特征帧的数目有点反常。

比如我有一个长度是400个点的音频，如果帧长是100，那么我自然而然的想到，最后应当得到4帧。

但实际不是这个样子的，实际的帧数为5帧。这就有点神奇了。

分析了一下，原因如下。

为了方便讨论，假设帧长为200，帧移为100，fft size是200。

上图中是一条长度为430的音频，在经过stft后，输出5帧。

过程是这样的。

在音频的左右两侧padding，padding size是fft size的一半
现在音频长度变成了430 + 200 = 630
总帧数为(630 - 100) // 100
如上图示意，最后得到5帧

所以，librosa求stft的输出帧数，当音频长度在400 <= x < 500之间时，输出帧数为5。

推广开来，音频求stft的输出帧数都是 audio_length // hop_length + 1。

这个东西有什么用呢？在用到帧和原始采样点相对应的场景下，千万别忘了padding 0 到帧数那么多，再去进行其他操作。

如果有兴趣可以写代码自己验证一下。

3. code

@cache(level=20)
def stft(y,n_fft=2048,hop_length=None,win_length=None,window="hann",center=True,dtype=None,pad_mode="reflect",
):"""Short-time Fourier transform (STFT).The STFT represents a signal in the time-frequency domain bycomputing discrete Fourier transforms (DFT) over short overlappingwindows.This function returns a complex-valued matrix D such that- ``np.abs(D[f, t])`` is the magnitude of frequency bin ``f``at frame ``t``, and- ``np.angle(D[f, t])`` is the phase of frequency bin ``f``at frame ``t``.The integers ``t`` and ``f`` can be converted to physical units by meansof the utility functions `frames_to_sample` and `fft_frequencies`.Parameters----------y : np.ndarray [shape=(n,)], real-valuedinput signaln_fft : int > 0 [scalar]length of the windowed signal after padding with zeros.The number of rows in the STFT matrix ``D`` is ``(1 + n_fft/2)``.The default value, ``n_fft=2048`` samples, corresponds to a physicalduration of 93 milliseconds at a sample rate of 22050 Hz, i.e. thedefault sample rate in librosa. This value is well adapted for musicsignals. However, in speech processing, the recommended value is 512,corresponding to 23 milliseconds at a sample rate of 22050 Hz.In any case, we recommend setting ``n_fft`` to a power of two foroptimizing the speed of the fast Fourier transform (FFT) algorithm.hop_length : int > 0 [scalar]number of audio samples between adjacent STFT columns.Smaller values increase the number of columns in ``D`` withoutaffecting the frequency resolution of the STFT.If unspecified, defaults to ``win_length // 4`` (see below).win_length : int <= n_fft [scalar]Each frame of audio is windowed by ``window`` of length ``win_length``and then padded with zeros to match ``n_fft``.Smaller values improve the temporal resolution of the STFT (i.e. theability to discriminate impulses that are closely spaced in time)at the expense of frequency resolution (i.e. the ability to discriminatepure tones that are closely spaced in frequency). This effect is knownas the time-frequency localization trade-off and needs to be adjustedaccording to the properties of the input signal ``y``.If unspecified, defaults to ``win_length = n_fft``.window : string, tuple, number, function, or np.ndarray [shape=(n_fft,)]Either:- a window specification (string, tuple, or number);see `scipy.signal.get_window`- a window function, such as `scipy.signal.windows.hann`- a vector or array of length ``n_fft``Defaults to a raised cosine window (`'hann'`), which is adequate formost applications in audio signal processing... see also:: `filters.get_window`center : booleanIf ``True``, the signal ``y`` is padded so that frame``D[:, t]`` is centered at ``y[t * hop_length]``.If ``False``, then ``D[:, t]`` begins at ``y[t * hop_length]``.Defaults to ``True``,  which simplifies the alignment of ``D`` onto atime grid by means of `librosa.frames_to_samples`.Note, however, that ``center`` must be set to `False` when analyzingsignals with `librosa.stream`... see also:: `librosa.stream`dtype : np.dtype, optionalComplex numeric type for ``D``.  Default is inferred to match theprecision of the input signal.pad_mode : string or functionIf ``center=True``, this argument is passed to `np.pad` for paddingthe edges of the signal ``y``. By default (``pad_mode="reflect"``),``y`` is padded on both sides with its own reflection, mirrored aroundits first and last sample respectively.If ``center=False``,  this argument is ignored... see also:: `numpy.pad`Returns-------D : np.ndarray [shape=(1 + n_fft/2, n_frames), dtype=dtype]Complex-valued matrix of short-term Fourier transformcoefficients.See Also--------istft : Inverse STFTreassigned_spectrogram : Time-frequency reassigned spectrogramNotes-----This function caches at level 20.Examples-------->>> y, sr = librosa.load(librosa.ex('trumpet'))>>> S = np.abs(librosa.stft(y))>>> Sarray([[5.395e-03, 3.332e-03, ..., 9.862e-07, 1.201e-05],[3.244e-03, 2.690e-03, ..., 9.536e-07, 1.201e-05],...,[7.523e-05, 3.722e-05, ..., 1.188e-04, 1.031e-03],[7.640e-05, 3.944e-05, ..., 5.180e-04, 1.346e-03]],dtype=float32)Use left-aligned frames, instead of centered frames>>> S_left = librosa.stft(y, center=False)Use a shorter hop length>>> D_short = librosa.stft(y, hop_length=64)Display a spectrogram>>> import matplotlib.pyplot as plt>>> fig, ax = plt.subplots()>>> img = librosa.display.specshow(librosa.amplitude_to_db(S,...                                                        ref=np.max),...                                y_axis='log', x_axis='time', ax=ax)>>> ax.set_title('Power spectrogram')>>> fig.colorbar(img, ax=ax, format="%+2.0f dB")"""# By default, use the entire frameif win_length is None:win_length = n_fft# Set the default hop, if it's not already specifiedif hop_length is None:hop_length = int(win_length // 4)fft_window = get_window(window, win_length, fftbins=True)# Pad the window out to n_fft sizefft_window = util.pad_center(fft_window, n_fft)# Reshape so that the window can be broadcastfft_window = fft_window.reshape((-1, 1))# Check audio is validutil.valid_audio(y)# Pad the time series so that frames are centeredif center:if n_fft > y.shape[-1]:warnings.warn("n_fft={} is too small for input signal of length={}".format(n_fft, y.shape[-1]))y = np.pad(y, int(n_fft // 2), mode=pad_mode)elif n_fft > y.shape[-1]:raise ParameterError("n_fft={} is too large for input signal of length={}".format(n_fft, y.shape[-1]))# Window the time series.y_frames = util.frame(y, frame_length=n_fft, hop_length=hop_length)if dtype is None:dtype = util.dtype_r2c(y.dtype)# Pre-allocate the STFT matrixstft_matrix = np.empty((int(1 + n_fft // 2), y_frames.shape[1]), dtype=dtype, order="F")fft = get_fftlib()# how many columns can we fit within MAX_MEM_BLOCK?n_columns = util.MAX_MEM_BLOCK // (stft_matrix.shape[0] * stft_matrix.itemsize)n_columns = max(n_columns, 1)for bl_s in range(0, stft_matrix.shape[1], n_columns):bl_t = min(bl_s + n_columns, stft_matrix.shape[1])stft_matrix[:, bl_s:bl_t] = fft.rfft(fft_window * y_frames[:, bl_s:bl_t], axis=0)return stft_matrix

librosa 语音库（二）STFT 的实现相关推荐

librosa 语音库（四）librosa.feature.mfcc
基础概念给定一个音频文件, 通过 Load 加载进来得打signal, shape 为(M ), 比如 sr = 22050, t = 10.62s; 通过分帧后, 将一维的 M 转化为二维的分帧 ...
librosa 语音库（一）简介
librosa是一个非常强大的python语音信号处理的第三方库; 即 librosa 使用python去实现多种的算法: 本文参考的是librosa的官方文档主要总结了一些重要. 先总结一下本文中常 ...
librosa 语音库（三） librosa.feature. 中的 spectrogram 与 melspectrogram
窗口的长度与 n_fft 需要匹配大小长度: 1. Mel 语谱图的函数定义 librosa.feature.melspectrogram()函数在spectral.py 中,实现过程为: def m ...
win10语音语言服务器,win10系统：朗读女语音库（发音人）安装方法说明
win10系统:朗读女语音库(发音人)安装方法说明朗读女使用帮助本文将介绍在win10系统下,朗读女软件如何添加安装:发音人(语音库). 一.安装开启win10系统自带的三个发音人. 1.首先要 ...
语音库构建_在10分钟内构建一个多功能语音助手
语音库构建 Nowadays people don't have time to manually search the internet for information or the answers ...
Delphi7怎么样调用系统语音库 .
(一)要使用系统的语音库,你需要先安装 Microsoft Speech SDK 5.1 及其语言包,下载地址: Speech SDK 5.1: http://www.newhua.com/soft/ ...
Py之seaborn：数据可视化seaborn库(二)的组合图可视化之密度图/核密度图分布可视化、箱型图/散点图、小提琴图/散点图组合可视化的简介、使用方法之最强攻略(建议收藏)
Py之seaborn:数据可视化seaborn库(二)的组合图可视化之密度图/核密度图分布可视化.箱型图/散点图.小提琴图/散点图组合可视化的简介.使用方法之最强攻略(建议收藏) 目录二.组合图可视 ...
QT 用QAudio语音库实现音频信号的采集以及发送到另一台电脑播放
一年多以前曾经写过一篇用QT audio语音库实现音频的采集和播放的博文:https://blog.csdn.net/hanzhen7541/article/details/80152381 上面那 ...
林格斯添加真人语音库
今天装了灵格斯,按照一些方法添加语音库,在"设置-->语音-->真人发音"那还是仅有Sample,经过多方寻找,看了这个网友的博客,照着试了一下,果然成功了,下面再对此 ...
Lingoes安装词典和语音库
安装词典: 选项->词典,出现"词典管理"窗体,点"安装",从磁盘上选择要安装的词典文件(扩展名为ld2的文件),勾选"添加到索引组" ...

librosa 语音库（二）STFT 的实现

1. `librosa.stft`函数

2. stft的输出帧数

3. code

librosa 语音库（二）STFT 的实现相关推荐

最新文章

热门文章

librosa 语音库（二）STFT 的实现

1. librosa.stft函数

2. stft的输出帧数

3. code

librosa 语音库（二）STFT 的实现相关推荐

最新文章

热门文章

1. `librosa.stft`函数