深度学习中的语音信号处理基础

文章目录

音频处理流程
- 常用谱：幅度谱、梅尔谱
时域 --> 频域
分帧
- 窗长
- 帧移
- 语音信号特征获取流程
梅尔谱
- 使用 librosa 提取梅尔谱
- 使用 tacotron 获取梅尔谱（推荐）
小结
相关资料

学习自：维涅斯的大火花《[天子学习]深度学习中的语音信号处理基础&代码实现》（强烈推荐！）
https://www.bilibili.com/video/BV1f3411C7kb

语音的表示方式：

时域（Time Domain）
频域（Frequency Domain）

音频处理流程

输入信号后：

分帧、预加重、加窗。
FFT（快速傅里叶变换） --> 得到 STFT（短时傅里叶谱） spectrum or Linear spectrum . SFTF 谱包含幅度和相位。
对复数取绝对值或平方值 --> 得到幅度谱 (spectrum or amplitude spectrum or magnitude spectrum )，相比 STFT，相位信息会被抹掉）
Mel 滤波 --> 得到梅尔谱 Melspectrum
取对数 --> 得到对数梅尔谱 Log melspectrum
DCT（离散余弦变换） --> 得到 Fbank 谱
计算动态特征（Delta MFCC） --> 输出特征向量，得到 MFCC谱，也称为包络谱。（可用 pyword 提取）

常用谱：幅度谱、梅尔谱

语音的深度学习最常用的谱：幅度谱、梅尔谱

通常可以使用 librosa 库或 torchaudio 库提取。

Fbank, MFCC 由于蕴含信息比较少，已不适合这个大数据时代。

MFCC 维度比较低，但有些特殊任务还是会用到，如情感语音转换任务。

理解 STFT 谱的提取过程最重要。

语音是一个时域信号，也就是随着时间不断变化。

语音的时域信号可以表示为一维的向量，Y代表幅度，X代表采样点。

传感器根据一定的间隔进行采样，采样时间间隔的导数称为采样率，采样率单位为 hz 。

信号长度（signal） = 采样时间（time）* 采样率（Sample rate）

比如：一条语音信号，长度为2s（时间表示法），采样率为 16000HZ，那么在计算机中表示为 2*16000 的一维向量。也可以说这条语音的长度是 32000（采样点表示法）。

说信号长度，必须确定采样率，才能确定其实际时长。

因此在语音的标识数据量不变的情况下，采样率增大，则时间减小。

深度学习不直接使用采样数据作为输入的原因：

1、数据量太大；

2、人类感受声音，不是通过时域信号。会通过耳蜗将语音分解为不同频率的声音，主要是在感知声音的不同成分的频率高低和强度。

人的听觉范围为 200hz – 2whz；语音的声音范围为 200hz – 8000hz。（所以一般语音的采样率为 16000）

例如有一个声音信号为三个信号组合而成： 10sin(x)+0.1sin(100x)+0.01sin(1000x)10sin(x) + 0.1sin(100x) + 0.01sin(1000x)10sin(x)+0.1sin(100x)+0.01sin(1000x) 。根据采样定理，人类主要感受到 10sin(x)10sin(x)10sin(x) （频率为1，振幅为10）这个信号

时域 --> 频域

信号处理基本任务：计算机分析实际语音到底是哪些信号组成的。

使用什么算法获取信号各个成分的频率？-- 傅里叶变换 (Fourier )

傅里叶变换基本思想：以适当的某种频率间隔，将语音分解为一组基础信号，再用傅里叶算法，计算出每组信号的幅度和相位。

也就是说，一段信号，能得到一组不同的基础信号的幅度和相位。

频域图中，X轴代表不同的频率，Y 轴记录了频率信号的幅度大小。

问题：一条语音只能得到一组幅度和相位吗?

回答：不能

由于海森堡测不准原理，不能对太长或者太短的语音信号做傅里叶变换，否则得到的数据都不准确。

太长：比如 50ms 以上；太短：比如 10ms、5ms 左右。

可参考：《形象易懂的傅里叶变换、短时傅里叶变换和小波变换》: https://mp.weixin.qq.com/s/CRqhHIlYYRjYJ64PZZnUkQ

分帧

一条语音信号，长度可能在 2s – 1min 之间。

如何对语音信号进行幅度和相位的提取？ – 分帧操作（类似于一维卷积）。

窗长

窗长：window size / window length，每两个频率之间间隔，一般是 25ms 或 1024个采样点。

经过反复试验，25–50ms 是一个比较适当的长度。

对该“帧”做傅里叶变换，得到幅度&相位

也就是说，很长的一段信号被分帧后，可以得到很多组幅度和相位。

帧移

帧移（hop size / hop length）：帧移动间隔

帧数计算公式：

帧数 = 语音信号长度 // hop_length + 1 （// 表示整除）

帧数代表能得到多少组幅度和相位；帧数和窗长没有关系，仅和帧移有关。可了解分帧原理。

语音可以由一维向量变成二维矩阵。

问：对于窗中的每个信号做傅里叶变化，如何规定这些基础信号的频率呢？？

这“很多组”幅度和相位，究竟是哪些信号的?

答：可以根据算法自己设定。在代码中，通常是由 n_fft 这个参数设定。

在语音信号处理中，能够得到多少组幅度和相位，也代表能得到多少特征的维度。

假设 n_fft = 256 , 则实际上设定了 129 组不同频率的参数。

那么这些频率的大小如何确定? 为什么是129？

根据傅里叶算法可以得到：频率组数 = n_fft // 2 +1

频率组的频率是多少？

则根据该窗的长度确定频率组间隔，从0开始线性增加信号分量的频率。

频率依为：1∗(1/T)，2∗（1/T）…..129∗(1/T)1 * (1/T)，2*（1/T）….. 129 * (1/T)1∗(1/T)，2∗（1/T）…..129∗(1/T)

例如：窗长 =320，SR = 16000 （采样率）

该窗时间长度：T = 320 / 16000 = 0.02秒

则频率组间隔 = 1/T = 1/0.02 = 50hz

则，若设n_fft = 256,

频率组的频率即为：50hz ,100hz,150hz, … (129*50)hz

接下来，用 Fourier算法，根据这些频率求出其幅度和相位。

经过设置不同的参数（窗长、n_fft ），可以规定不同频率的频率组。

窗长可以确定频率组中，频率的大小。
n_fft 可以确定你需要多少组频率。

语音信号特征获取流程

1、一条语音信号（时域），已知采样率SR，时间长度Time

2、设定参数： win_length（窗长，确定频率组中每个频率的大小）， Hop_length（帧移）， n_fft(每个小窗的语音可以提取多少组频率)；这三个参数可以确定二维矩阵的性质。

3、频域信号(2维)
其帧数 = (SR * Time) // hop + 1
其维度 = n_fft // 2 +1
张量形状为：（维度，帧数）

import torch
import torchaudio  waveform = torch.rand(1, 16000)
# tensor([[0.5449, 0.5231, 0.0253,  ..., 0.1546, 0.1296, 0.1924]])# 由于一般都是使用 信号表示法，因此这里可以不设置采样率
# 注意： expected_0 < win_length <= n_ftt
stft_func = torchaudio.transforms.Spectrogram(n_fft=350, win_length=320, hop_length=160) # Spectrogram()STFT_spec = stft_func(waveform) STFT_spec.shape
# torch.Size([1, 176, 101])

176 = 350/2 + 1

101 = 16000 // 160 + 1 # 语音的帧长，和语音的时间长度有关；语音信号越长，能分出的帧越多。

梅尔谱

梅尔谱：在幅度谱的基础上，乘以一个“梅尔变换”，得到80维度的梅尔谱。

这种谱的80个频率组更接近人耳的听觉感知范围。

但是相应地，蕴含的语音信息比幅度谱要少一些。因此，在一些面向人类的语音任务中较为常用。

注意，深度学习中的梅尔谱大多数情况下指对数梅尔谱！！！

傅里叶谱（幅度谱）是一个等间隔的谱，呈线性，所以也称为 Linear spectrum。提取方式如：10 hz 20 hz 30hz ……

梅尔谱呈对数增长，非线性。提取方式参考如：10 15 17 18 …

语音深度学习任务	主流论文所使用的谱
语音识别（语音 -> 文本符号）	Melspec、或者神经网络提取的谱
声纹识别，说话人验证（语音 -> 话者分类标签）	Melspec
语音合成（文本符号 -> 语音）	Melspec
语音降噪	STFTspec , amp spec
音色转换	Melspec

大多数情况下

幅度谱 n_fft = 1024 ( 特征维度 n_dim = 513)

梅尔谱：n_mel = 80

幅度谱携带的信息比梅尔谱多，但也包含杂乱噪音。梅尔谱主要聚焦在人类听觉范围，所以 80 就可以基本描述人类的声音。

使用 librosa 提取梅尔谱

在训练时，效果并不好，更推荐使用 tacotron 的方法

import librosa
import soundfile
import numpy as np
from matplotlib import pyplot as plt
import scipy.signal as signal
import copysr = 16000 # Sample rate 采样率
n_fft = 1024 # fft points (samples)  希望提取的频率的组数--513
frame_shift = 0.0125 # 单位为秒
frame_length = 0.05 # 单位为秒，时间表示法
hop_length = int(sr*frame_shift) # 帧移，采样点表示法
win_length = int(sr*frame_length) # 窗长
n_mels = 80 # Number of Mel banks to generate  梅尔谱频率组数
power = 1.2 # Exponent for amplifying the predicted magnitude
n_iter = 100 # Number of inversion iterations
preemphasis = .97 # or None # 预加重参数
max_db = 100
ref_db = 20
top_db = 15def get_spectrograms(fpath):'''Returns normalized log(melspectrogram) and log(magnitude) from `sound_file`.Args:sound_file: A string. The full path of a sound file.Returns:mel: A 2d array of shape (T, n_mels) <- Transposedmag: A 2d array of shape (T, 1+n_fft/2) <- Transposed'''# Loading sound file 读取语音信号波形# librosa 在读取时，可以改变采样率。这里被改为 22050# y-一维时域信号y, sr = librosa.load(fpath, sr=22050)plt.figure()plt.title("oringin: wavform")plt.plot(y)plt.show()# Trimming - 头尾静音消除 # 可以将 top_db 分贝（强度）以下的切掉y, _ = librosa.effects.trim(y, top_db=top_db)# Preemphasis 预加重（经典的信号处理算法）y = np.append(y[0], y[1:] - preemphasis * y[:-1])# stft 短时傅里叶变换变化# 得到 linear 是短时傅里叶谱，包含幅度和相位信息。linear 中每一个元素是复数 a+bj 。 可以通过取绝对值，将复数变为实数。  linear = librosa.stft(y=y,n_fft=n_fft,hop_length=hop_length,win_length=win_length)# magnitude spectrogram 幅度谱# 通常将相位信息去掉，只保留幅度。 mag = np.abs(linear)  # (1+n_fft//2, T)# mel spectrogram 梅尔谱 # mel_basis 是梅尔谱矩阵，乘上幅度谱矩阵（mag），就可以得到梅尔谱mel_basis = librosa.filters.mel(sr, n_fft, n_mels)  # (n_mels, 1+n_fft//2)mel = np.dot(mel_basis, mag)  # (n_mels, t)# to decibel 取对数，将梅尔谱变为分贝单位 mel = 20 * np.log10(np.maximum(1e-5, mel))mag = 20 * np.log10(np.maximum(1e-5, mag))# normalize mel = np.clip((mel - ref_db + max_db) / max_db, 1e-8, 1)mag = np.clip((mag - ref_db + max_db) / max_db, 1e-8, 1)# Transposemel = mel.T.astype(np.float32)  # (T, n_mels)mag = mag.T.astype(np.float32)  # (T, 1+n_fft//2)return mel, magdef melspectrogram2wav(mel):'''# Generate wave file from spectrogram'''# transposemel = mel.T# de-noramlize mel = (np.clip(mel, 0, 1) * max_db) - max_db + ref_db# to amplitude  /-----/  the reverse of # to decibelmel = np.power(10.0, mel * 0.05)m = _mel_to_linear_matrix(sr, n_fft, n_mels)mag = np.dot(m, mel)# wav reconstructionwav = griffin_lim(mag)# de-preemphasiswav = signal.lfilter([1], [1, -preemphasis], wav)# trimwav, _ = librosa.effects.trim(wav)return wav.astype(np.float32)def spectrogram2wav(mag):'''# Generate wave file from spectrogram'''# transposemag = mag.T# de-noramlizemag = (np.clip(mag, 0, 1) * max_db) - max_db + ref_db# to amplitudemag = np.power(10.0, mag * 0.05)# wav reconstructionwav = griffin_lim(mag)# de-preemphasiswav = signal.lfilter([1], [1, -preemphasis], wav)# cwav, _ = librosa.effects.trim(wav)return wav.astype(np.float32)def _mel_to_linear_matrix(sr, n_fft, n_mels):m = librosa.filters.mel(sr, n_fft, n_mels)m_t = np.transpose(m)p = np.matmul(m, m_t)d = [1.0 / x if np.abs(x) > 1.0e-8 else x for x in np.sum(p, axis=0)]return np.matmul(m_t, np.diag(d))def griffin_lim(spectrogram):'''Applies Griffin-Lim's raw.'''X_best = copy.deepcopy(spectrogram)for i in range(n_iter):X_t = invert_spectrogram(X_best)est = librosa.stft(X_t, n_fft, hop_length, win_length=win_length)phase = est / np.maximum(1e-8, np.abs(est))X_best = spectrogram * phaseX_t = invert_spectrogram(X_best)y = np.real(X_t)return ydef invert_spectrogram(spectrogram):'''spectrogram: [f, t]'''return librosa.istft(spectrogram, hop_length, win_length=win_length, window="hann")def plot_spectrogram_to_numpy(spectrogram):fig, ax = plt.subplots(figsize=(12, 3))im = ax.imshow(spectrogram, aspect="auto", origin="lower",interpolation='none')plt.colorbar(im, ax=ax)plt.xlabel("Frames")plt.ylabel("Channels")plt.tight_layout()fig.canvas.draw()data = save_figure_to_numpy(fig)plt.close()return datadef save_figure_to_numpy(fig):# save it to a numpy array.data = np.fromstring(fig.canvas.tostring_rgb(), dtype=np.uint8, sep='')data = data.reshape(fig.canvas.get_width_height()[::-1] + (3,))return datadef test1(): ## 给定一条语音p = './0001_000351.wav'aa = get_spectrograms(p)print("melspec size: {} , stft spec size:{}".format(aa[0].shape,aa[1].shape))# size : (frames, ndim)plt.figure()plt.title("oringin melspec & stft spec")plt.subplot(2, 1, 2)plt.imshow(plot_spectrogram_to_numpy(aa[1].T))plt.subplot(2, 1, 1)plt.imshow(plot_spectrogram_to_numpy(aa[0].T))plt.show()# wav = melspectrogram2wav(mel.T)wav1 = melspectrogram2wav(aa[0]) # input size : (frames ,ndim)plt.figure()plt.title("mel2wav: wavform")plt.plot(wav1)plt.show()# librosa.output.write_wav("gg_stft.wav", wav, sr)soundfile.write(p.replace('.w','_gff.w'), wav1, sr)print("finished change ")###  画出 转化语音的谱aa = get_spectrograms(p.replace('.w','_gff.w'))plt.figure()plt.title("new wav  :  melspec & stft spec")plt.subplot(2, 1, 2)plt.imshow(plot_spectrogram_to_numpy(aa[1].T))plt.subplot(2, 1, 1)plt.imshow(plot_spectrogram_to_numpy(aa[0].T))plt.show()exit()if __name__ == '__main__':test1()

使用 tacotron 获取梅尔谱（推荐）

基本思想和上方一致（有细微差别），但训练效果一直。

import numpy as np
import torch
import librosa.util as librosa_util
import torch.nn.functional as F
from torch.autograd import Variable
from scipy.signal import get_window
from librosa.util import pad_center, tiny
from scipy.io.wavfile import read
from librosa.filters import mel as librosa_mel_fnclass STFT(torch.nn.Module):"""adapted from Prem Seetharaman's https://github.com/pseeth/pytorch-stft"""def __init__(self, filter_length=800, hop_length=200, win_length=800,window='hann'):super(STFT, self).__init__()self.filter_length = filter_lengthself.hop_length = hop_lengthself.win_length = win_lengthself.window = windowself.forward_transform = Nonescale = self.filter_length / self.hop_lengthfourier_basis = np.fft.fft(np.eye(self.filter_length))cutoff = int((self.filter_length / 2 + 1))fourier_basis = np.vstack([np.real(fourier_basis[:cutoff, :]),np.imag(fourier_basis[:cutoff, :])])forward_basis = torch.FloatTensor(fourier_basis[:, None, :])inverse_basis = torch.FloatTensor(np.linalg.pinv(scale * fourier_basis).T[:, None, :])if window is not None:assert(filter_length >= win_length)# get window and zero center pad it to filter_lengthfft_window = get_window(window, win_length, fftbins=True)fft_window = pad_center(fft_window, filter_length)fft_window = torch.from_numpy(fft_window).float()# window the basesforward_basis *= fft_windowinverse_basis *= fft_windowself.register_buffer('forward_basis', forward_basis.float())self.register_buffer('inverse_basis', inverse_basis.float())def transform(self, input_data):num_batches = input_data.size(0)num_samples = input_data.size(1)self.num_samples = num_samples# similar to librosa, reflect-pad the inputinput_data = input_data.view(num_batches, 1, num_samples)input_data = F.pad(input_data.unsqueeze(1),(int(self.filter_length / 2), int(self.filter_length / 2), 0, 0),mode='reflect')input_data = input_data.squeeze(1)forward_transform = F.conv1d(input_data,Variable(self.forward_basis, requires_grad=False),stride=self.hop_length,padding=0)cutoff = int((self.filter_length / 2) + 1)real_part = forward_transform[:, :cutoff, :]imag_part = forward_transform[:, cutoff:, :]magnitude = torch.sqrt(real_part**2 + imag_part**2)phase = torch.autograd.Variable(torch.atan2(imag_part.data, real_part.data))return magnitude, phasedef inverse(self, magnitude, phase):recombine_magnitude_phase = torch.cat([magnitude*torch.cos(phase), magnitude*torch.sin(phase)], dim=1)inverse_transform = F.conv_transpose1d(recombine_magnitude_phase,Variable(self.inverse_basis, requires_grad=False),stride=self.hop_length,padding=0)if self.window is not None:window_sum = window_sumsquare(self.window, magnitude.size(-1), hop_length=self.hop_length,win_length=self.win_length, n_fft=self.filter_length,dtype=np.float32)# remove modulation effectsapprox_nonzero_indices = torch.from_numpy(np.where(window_sum > tiny(window_sum))[0])window_sum = torch.autograd.Variable(torch.from_numpy(window_sum), requires_grad=False)window_sum = window_sum.cuda() if magnitude.is_cuda else window_suminverse_transform[:, :, approx_nonzero_indices] /= window_sum[approx_nonzero_indices]# scale by hop ratioinverse_transform *= float(self.filter_length) / self.hop_lengthinverse_transform = inverse_transform[:, :, int(self.filter_length/2):]inverse_transform = inverse_transform[:, :, :-int(self.filter_length/2):]return inverse_transformdef forward(self, input_data):self.magnitude, self.phase = self.transform(input_data)reconstruction = self.inverse(self.magnitude, self.phase)return reconstruction
def window_sumsquare(window, n_frames, hop_length=200, win_length=800,n_fft=800, dtype=np.float32, norm=None):"""# from librosa 0.6Compute the sum-square envelope of a window function at a given hop length.This is used to estimate modulation effects induced by windowingobservations in short-time fourier transforms.Parameters----------window : string, tuple, number, callable, or list-likeWindow specification, as in `get_window`n_frames : int > 0The number of analysis frameshop_length : int > 0The number of samples to advance between frameswin_length : [optional]The length of the window function.  By default, this matches `n_fft`.n_fft : int > 0The length of each analysis frame.dtype : np.dtypeThe data type of the outputReturns-------wss : np.ndarray, shape=`(n_fft + hop_length * (n_frames - 1))`The sum-squared envelope of the window function"""if win_length is None:win_length = n_fftn = n_fft + hop_length * (n_frames - 1)x = np.zeros(n, dtype=dtype)# Compute the squared window at the desired lengthwin_sq = get_window(window, win_length, fftbins=True)win_sq = librosa_util.normalize(win_sq, norm=norm)**2win_sq = librosa_util.pad_center(win_sq, n_fft)# Fill the envelopefor i in range(n_frames):sample = i * hop_lengthx[sample:min(n, sample + n_fft)] += win_sq[:max(0, min(n_fft, n - sample))]return xdef griffin_lim(magnitudes, stft_fn, n_iters=30):"""PARAMS------magnitudes: spectrogram magnitudesstft_fn: STFT class with transform (STFT) and inverse (ISTFT) methods"""angles = np.angle(np.exp(2j * np.pi * np.random.rand(*magnitudes.size())))angles = angles.astype(np.float32)angles = torch.autograd.Variable(torch.from_numpy(angles))signal = stft_fn.inverse(magnitudes, angles).squeeze(1)for i in range(n_iters):_, angles = stft_fn.transform(signal)signal = stft_fn.inverse(magnitudes, angles).squeeze(1)return signaldef dynamic_range_compression(x, C=1, clip_val=1e-5): # C: compression factor return torch.log(torch.clamp(x, min=clip_val) * C)# 没有对梅尔谱取最小值，而是对傅里叶谱取了最小值。这里提取出的梅尔谱有负数，上方librosa 提取的没有负数。def dynamic_range_decompression(x, C=1):# C: compression factor used to compress return torch.exp(x) / Cdef load_wav_to_torch(full_path):sampling_rate, data = read(full_path)return torch.FloatTensor(data.astype(np.float32)), sampling_rateclass TacotronSTFT(torch.nn.Module):def __init__(self, filter_length=1024, hop_length=256, win_length=1024,n_mel_channels=80, sampling_rate=22050, mel_fmin=0.0,mel_fmax=11025):super(TacotronSTFT, self).__init__()self.n_mel_channels = n_mel_channelsself.sampling_rate = sampling_rateself.stft_fn = STFT(filter_length, hop_length, win_length)mel_basis = librosa_mel_fn(sampling_rate, filter_length, n_mel_channels, mel_fmin, mel_fmax)mel_basis = torch.from_numpy(mel_basis).float()self.register_buffer('mel_basis', mel_basis)def spectral_normalize(self, magnitudes):output = dynamic_range_compression(magnitudes)return outputdef spectral_de_normalize(self, magnitudes):output = dynamic_range_decompression(magnitudes)return outputdef mel_spectrogram(self, y):"""Computes mel-spectrograms from a batch of wavesPARAMS y: Variable(torch.FloatTensor) with shape (B, T) in range [-1, 1]RETURNS mel_output: torch.FloatTensor of shape (B, n_mel_channels, T)"""y = y.clamp(min=-1, max=1)# assert(torch.min(y.data) >= -1)# assert(torch.max(y.data) <= 1)magnitudes, phases = self.stft_fn.transform(y)magnitudes = magnitudes.datamel_output = torch.matmul(self.mel_basis, magnitudes)mel_output = self.spectral_normalize(mel_output) # 以自然对数为底的对数 return mel_output
if __name__=="__main__":pass

小结

从2015年深度学习广泛取得较好的效果以来，大多数语音任务普遍采用 STFT（amp）谱或者melspec作为训练输入。
melspec在大多数论文中默认是取对数的，即使论文本身使用的单词是 melspec,但是实际上训练的时候，代码中会加上 log() 函数。
STFT谱的特征维度可以随意，一般习惯使用1024、512、256维，但是较多的训练过程的melspec还是采用80维。