Speech Separation
Speaker Separation
- Training Data
- Evaluation
- - Signal-to-noise ratio (SNR)
  - Scale invariant signal-to-distortion ratio (SI-SDR / SI-SNR)
  - SI-SDR improvement
  - More...
- Permutation Issue
- Deep Clustering
- - Masking
  - Ideal Binary Mask (IBM)
  - Deep Clustering
- Permutation Invariant Training (PIT)
- TasNet – Time-domain Audio Separation Network
- SepFormer - Attention is All You Need in Speech Separation
- More …
- - Unknown number of speakers
  - Multiple Microphones
  - Visual Information
  - Task-oriented Optimization
- To learn more ……
References

Speech Separation

鸡尾酒会效应: Humans can focus on the voice produced by a single speaker in a crowded and noisy environments.

Speech Separation

Speech Enhancement: speech-nonspeech separation (de-noising) (将人说话的声音从噪声中分离)
Speaker Separation: multi-speaker talking (将多个人说话的音频分离开来)

Speaker Separation

下面的讨论主要关注 Two speakers、Single microphone，并且数据集保证 Speaker independent (training and testing speakers are completely different)

Input and output have the same length. Seq2seq is not needed

Training Data

Training Data: It is easy to generate training data.

Evaluation

Signal-to-noise ratio (SNR)

$SNR=10log⁡10∥X^∥2∥E∥2S N R=10 \log _{10} \frac{\|\hat{X}\|^{2}}{\|E\|^{2}}$

$X$ is speech signal (vector) here

可见， $∣ ∣ E ∣ ∣$ 越小，SNR 越大

缺陷

Case 1: $X^*$ 和 $X^\hat X$ 其实声音完全相同，只是声音大小不同，但 $E$ 却很大
Case 2: Simply larger the output can increase SNR

Scale invariant signal-to-distortion ratio (SI-SDR / SI-SNR)

$SISDR⁡=10log⁡10∥XT∥2∥XE∥2\operatorname{SISDR}=10 \log _{10} \frac{\left\|X_{T}\right\|^{2}}{\left\|X_{E}\right\|^{2}}$

可见， $X^*$ 与 $X^\hat X$ 越平行，SISDR 越大

SI-SDR 解决了之前提到的 SNR 的两个缺陷

(1) $SISDR=10log⁡10∥XT∥2∥XE∥2=+∞\text{SISDR}=10 \log _{10} \frac{\left\|X_{T}\right\|^{2}}{\left\|X_{E}\right\|^{2}}=+\infty$
(2)
$SISDR=10log⁡10∥XT∥2∥XE∥2=10log⁡10∥kXT∥2∥kXE∥2\text{SISDR}=10 \log _{10} \frac{\left\|X_{T}\right\|^{2}}{\left\|X_{E}\right\|^{2}}=10 \log _{10} \frac{\left\|kX_{T}\right\|^{2}}{\left\|kX_{E}\right\|^{2}}$ (the same)

SI-SDR improvement

分别对 Mixed audio 和模型分离出的声音信号计算 SI-SDR，再计算它们的差值
$improvement:SI-SDRi=SI-SDR2−SI-SDR1\text{SI-SDR improvement}: \text{SI-SDR}_i = \text{SI-SDR}_2 - \text{SI-SDR}_1$

More…

Perceptual Evaluation of Speech Quality (PESQ) was designed to evaluate the quality, and the score ranges from -0.5 to 4.5.
Short-Time Objective Intelligibility (STOI) was designed to compute intelligibility (声音的可理解程度), and the score ranges from 0 to 1.

Permutation Issue

有了训练数据和评估方法，看起来我们可以直接训练一个模型，只需要最小化生成向量与 Ground-truth 的 L1 / L2 误差或者最大化 SI-SDR 即可
但事情并没有那么简单，混合音频包含两段音频，我们模型的第一个分支应该输出红色音频还是蓝色音频呢？也就是按怎样的顺序摆放正确答案呢 (Permutation Issue)？
如果数据集里只有红色和蓝色这两种音频，那么我们只需要固定一种就好，但问题是数据集里有很多不同的 speaker，如果我们仍然对每两个 speaker 固定一个输出顺序，模型训练就可能产生问题，例如下图中，我们规定模型第一个分支输出红色和黄色音频，第二个分支输出蓝色和绿色音频，但蓝色、绿色音频对应的性别是不同的，红色、黄色音频对应的性别也是不同的，在训练时模型就会被要求在输入一段混合男女声时有时在第一分支输出男声 (黄绿)，有时在第一分支输出女声 (红蓝)，这可能会给模型训练带来问题，我们更想要第一个分支固定输出男声，第二个分支固定生成女声，因此想到先对音频按性别进行分类
而有的男声可能会比较高亢，有的女声可能会比较低沉，此时我们又想按音调进行分类。但其实分类标准有很多，例如性别、声调、声音大小，怎么才能正确选择分类标准呢？下面介绍 Deep Clustering 和 PIT，它们使得 Deep learning 可以被用于 Speaker independent 的情况

permutation problem: predicted channels are well separated but inconsistent along time (This occurs when the sources to separate are similar to each other)

Deep Clustering

paper: Deep clustering: Discriminative embeddings for segmentation and separation

Deep Clustering: Learning deep embeddings for clustering

Masking

Speeker Separation 其实就是由一个矩阵 $X$ 生成另外两个矩阵 $X_1,X_2$
但 $X_1,X_2$ 其实与 $X$ 是很相似的，因此可以简化一下模型的结构，只训练一个 Mask Generator，生成两个掩码矩阵分别与 $X$ 作逐元素乘来得到 $X_1,X_2$ (Mask can be binary or continuous)

Ideal Binary Mask (IBM)

Each audio is represented by its spectrogram.

Ideal Binary Mask (IBM): 将两段声音的频谱图对应矩阵中的每个元素进行比较，如果一个矩阵的某一位置的元素比另一矩阵的相同位置元素更大，mask 就为 1，反之为 0。由此可以生成两段声音信号对应的 mask matrix，mask matrix 与混合声音频谱图对应矩阵进行逐元素相乘即可还原出原有声音 (效果还不错)

现在我们要做的就是让模型在给定混合音频的频谱图时，生成 IBM

Deep Clustering

Deep Clustering: 先由混合音频 $D×TD\times T$ $D \times T$ 的频谱图生成 $D×T×AD\times T\times A$ $D \times T \times A$ 的 spectrogram embeddings (为频谱图上的每一个点都生成一个 Embedding)，同时每个元素对应 embedding 的 L2 norm 都为 1。然后再对这些 Embedding 进行 K-means 聚类，将它们分为两类，两类 vector 对应的位置就可以形成两个 Binary Mask
- The segmentations are therefore implicitly encoded in the embeddings, and can be ”decoded” by clustering.

Deep Clustering – Training

在训练时，只需要根据真正的 IBM，使得 Embedding Generation 生成的向量满足 IBM 中不同元素对应的 Embedding 尽量远离，相同元素对应的 Embedding 相互靠近即可 (i.e. We need an objective function that minimizes the distances between embeddings of elements within a partition, while maximizing the distances between embeddings for elements in different partitions)。这样之后的 K-means 聚类就能生成正确的 IBM
值得注意的是：It is possible to train with two speakers, but test on three speakers ( $K = 3$ during $k$ -means)! 这说明 Embedding Generation 成功捕捉到了声音信号频谱图的特征

Math Form

假设 $V=fθ(x)∈RN×DV=f_\theta(x)\in\R^{N\times D}$ 为 $N$ 个输入信号的 embedding，其中每一个输入信号都是一个 time-frequency bin $(t, f)$ ， $fθf_\theta$ 为神经网络。 $VVT∈RN×NVV^T\in\R^{N\times N}$ 即为 embedding 之间的相似度矩阵 (affinity matrix)
假设 $Y∈RN×CY\in \R^{N\times C}$ 用于指示每个输入信号对应的类 (cluster)，由 $N$ 个 one-hot vector 组成，其中 $C$ 为总的类别数。因此， $YY^T$ 即为 binary affinity matrix，如果两个输入信号属于同一类，则对应元素为 1，反之为 0
损失函数即可表示为下式：
$CY(V)=∥VVT−YYT∥F2=∑i,j(⟨vi,vj⟩−⟨yi,yj⟩)2\begin{aligned} \mathcal{C}_{Y}(V) &=\left\|V V^{T}-Y Y^{T}\right\|_{\mathrm{F}}^{2}=\sum_{i, j}\left(\left\langle v_{i}, v_{j}\right\rangle-\left\langle y_{i}, y_{j}\right\rangle\right)^{2} \\ \end{aligned}$ 假设每个输入信号的 embedding $v_i$ 为 unit-norm embedding，即 $v_i|^2=1$ ，则
$CY(V)=∑i,j:yi=yj(1−2⟨vi,vj⟩)+∑i,j⟨vi,vj⟩2=∑i,j:yi=yj(∣vi−vj∣2−1)+∑i,j⟨vi,vj⟩2\begin{aligned} \mathcal{C}_{Y}(V) &=\sum_{i, j: y_{i}=y_{j}}\left(1-2\left\langle v_{i}, v_{j}\right\rangle\right)+\sum_{i, j}\left\langle v_{i}, v_{j}\right\rangle^{2} \\ &=\sum_{i, j: y_{i}=y_{j}}\left(\left|v_{i}-v_{j}\right|^{2}-1\right)+\sum_{i, j}\left\langle v_{i}, v_{j}\right\rangle^{2} \end{aligned}$ 由上式可知，损失函数使得属于同一类的 embedding 相互靠近的同时还保证了所有 embedding 尽量各不相同，防止陷入平凡解 (prevent collapse to a trivial solution)
上述损失函数由于需要计算 $N×NN\times N$ 的相似度矩阵，计算量太大，可以利用 $∣∣A∣∣F2=tr(ATA)||A||^2_F=tr(A^TA)$ 进一步进行化简降低计算量：
$CY(V)=∥VVT−YYT∥F2=tr((VVT−YYT)(VVT−YYT))=tr(VVTVVT)−2tr(VVTYYT)+tr(YYTYYT)=tr(VTVVTV)−2tr(YTVVTY)+tr(YTYYTY)=∣∣VTV∣∣F2−2∣∣VTY∣∣F2+∣∣YTY∣∣F2\begin{aligned} \mathcal{C}_{Y}(V) &=\left\|V V^{T}-Y Y^{T}\right\|_{\mathrm{F}}^{2} \\&=tr\left((V V^{T}-Y Y^{T})(V V^{T}-Y Y^{T})\right) \\&=tr(V V^{T}V V^{T})-2tr(V V^{T}Y Y^{T})+tr(Y Y^{T}Y Y^{T}) \\&=tr( V^{T}VV^{T}V )-2tr(Y^{T}V V^{T}Y )+tr(Y^{T}Y Y^{T}Y) \\&=||V^TV||_F^2-2||V^TY||_F^2+||Y^TY||_F^2 \end{aligned}$

虽然 Deep clustering 效果不错，但它还有一个缺憾：训练过程并不是 end-to-end 的

Permutation Invariant Training (PIT)

paper: Multi-talker Speech Separation with Utterance-level Permutation Invariant Training of Deep Recurrent Neural Networks

Given a speaker separation model $θi\theta^i$ , we can determine the permutation (choose the permutation with smaller loss)
But we need permutation to train speaker separation model … 这似乎就成了一个鸡生蛋、蛋生鸡的问题

PIT

PIT 用一个 loop 来解决上述问题，先随机初始化模型参数，然后根据这个随机初始化的参数对 label 的顺序进行排序，然后用这个顺序来训练模型。然后再用新的模型重新进行 label 排序，依次循环往复…

Results

paper: Interrupted and cascaded permutation invariant training for speech separation
由下图可以看到，刚开始训练时，同一对样本的 label 顺序经常被对调，但随着训练的进行，label 排放的顺序逐渐被固定
下图展示了 PIT 学出的 label 排列顺序是很有效的。对比 (a) (b) ©，人为固定 label 顺序并没有 PIT 的效果好。同时，还可以用一个已经训练好的 PIT 来生成固定的 label 顺序，以供模型进行学习，这样就形成了 Cascaded PIT，能取得较好的效果

TasNet – Time-domain Audio Separation Network

paper: Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation

TasNet

TasNet 特别的地方在于它是直接在时域中进行的，也就是输入、输出直接就是声音信号 Mixture waveform (a 16- $d$ vector (2ms 的声音信号)) 而不是 spectrogram。输入的时域信号首先通过 Encoder 变成一个 Feature map，这个 Encoder 的作用和傅里叶变换类似，但它是直接由模型学出来的。Separator 根据 Feature map 产生两个 Mask，Mask 与 Feature map 作逐元素相乘后再送入 Decoder 就得到了分离的时域声音信号，这里的 Decoder 就类似逆傅里叶变换

如果数据集是 Speaker independent 的，那么训练 TasNet 时就需要用到 PIT

Encoder and Decoder

Encoder 和 Decoder 其实就是两个线性层，分别将 16 维的声音信号 vector 转成 512 维的 feature vector，再将 feature vector 重新转回声音信号

Separator

Separator 的输入为 Encoder 的输出，也就是很多个 512- $d$ 的向量，每个向量都对应着 2ms 的声音信号。Separator 由若干层的 1-D conv 组成。每一个 1-D conv 层都由上一层输出的两个相邻 vector 得到一个新的 vector。如下图所示，经过若干 conv 层之后再加上 transform + sigmoid (sigmoid 保证 mask 的值在 0 ~ 1 之间，比较符合直觉)，就能得到两个 mask，这两个 mask 分别与 Encoder 的输出相乘，结果输入 Decoder 后就能得到分离出的时域声音信号
实际的 Separator 结构更加复杂一些，它的一个基础模块一直叠加到了 $d = 128$ ，也就是输出的 mask vector 看到了 128 个输入声音向量
Separator 叠加了多个这样的基础模块，当叠加 3 个时，模型输出的 mask 考虑了 1.53s 的输入声音信号

这里看视频没有看的特别明白，有时间再看看 paper 吧

Depthwise Separable Convolution

TasNet 还使用了 Depthwise Separable Convolution 来轻量化网络 (Ref: youtube)

SepFormer - Attention is All You Need in Speech Separation

paper: Attention is All You Need in Speech Separation

More …

Unknown number of speakers

paper: Recursive speech separation for unknown number of speakers

idea: recursively separating a speaker

Multiple Microphones

paper: FaSNet: Low-latency Adaptive Beamforming for Multi-microphone Audio Processing

Visual Information

paper: Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
Google AI Blog: Looking to Listen: Audio-Visual Speech Separation

Task-oriented Optimization

Who would listen to the results of speech enhancement or speaker separation?

paper

[Fu, et al., ICML’19] MetricGAN: Generative Adversarial Networks based Black-box Metric Scores Optimization for Speech Enhancement, ICML, 2019
[Shon, et al., INTERSPEECH’19] VoiceID Loss: Speech Enhancement for Speaker Verification, INTERSPEECH, 2019

To learn more ……

A Wavenet for Speech Denoising
Alternative Objective Functions for Deep Clustering
Deep Learning Based Phase Reconstruction for Speaker Separation: A Trigonometric Perspective
Phase-aware Speech Enhancement with Deep Complex U-Net
Divide and Conquer: A Deep CASA Approach to Talker-independent Monaural Speaker Separation
Wavesplit: End-to-End Speech Separation by Speaker Clustering

References

Deep learning for human language processing 2020 spring (李宏毅课程)
A must-read paper and tutorial list for speech separation based on neural networks
Speech Separation benchmarks

Speech Separation相关推荐

论文翻译：Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation
我醉了呀,当我花一天翻译完后,发现已经网上已经有现成的了,而且翻译的比我好,哎,造孽呀,但是他写的是论文笔记,而我是纯翻译,能给读者更多的思想和理解空间,并且还有参考文献,也不错哈,反正翻译是写给自己 ...
语音分离speech separation数据准备，开源
Speech separation 现在大热的一个方向,西雅图的腾讯一直在作者方面的研究,chime6上俞老师讲了一个多小时的他们的工作就能证明该方向有多火, 当前有很多深度学习的方法如deep-cl ...
李宏毅人类语言处理2020：Speech Separation
目录 0. 背景知识补充 1. 鸡尾酒会问题 2. Speaker Separation 2.1 两人的单通道语音分离 2.2 评估指标 2.2.1 信噪比(signal-to-noise ratio ...
【论文笔记之 Speech Separation Overview】Supervised Speech Separation Based on Deep Learning-An Overview
本文对汪徳亮于 2017 年在 IEEE/ACM Transactions on Audio, Speech, and Language Processing 上发表的论文进行简单地翻译,如有表述不当 ...
【论文笔记之 Conv-TasNet】Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation
本文对 Yi Luo 于 2019 年在 IEEE/ACM Transactions on Audio, Speech, and Language Processing 上发表的论文进行简单地翻译.如 ...
深度学习-语音处理-语音分离入门学习（Speech separation）
Chapter3-2_Speech Separation(TasNet)
文章目录 1 TasNet总体架构 2 Encoder和Decoder 3 Separator 4 TasNet回顾 5 More 5.1 Unknown number of speakers 5.2 ...
Chapter3-1_Speech Separation(Deep Clustering, PIT)
文章目录 1 内容简述 2 评价指标 2.1 Signal-to-noise ratio (SNR) 2.2 Scale invariant signal-to-distortion ratio (S ...
论文研究12：DUAL-PATH RNN for audio separation
论文研究12:DUAL-PATH RNN: EFFICIENT LONG SEQUENCE MODELING FOR TIME-DOMAIN SINGLE-CHANNEL SPEECH SEPARAT ...

Speech Separation

目录

Speech Separation

Speaker Separation

Training Data

Evaluation

Signal-to-noise ratio (SNR)

Scale invariant signal-to-distortion ratio (SI-SDR / SI-SNR)

SI-SDR improvement

More…

Permutation Issue

Deep Clustering

Masking

Ideal Binary Mask (IBM)

Deep Clustering

Permutation Invariant Training (PIT)

TasNet – Time-domain Audio Separation Network

SepFormer - Attention is All You Need in Speech Separation

More …

Unknown number of speakers

Multiple Microphones

Visual Information

Task-oriented Optimization

To learn more ……

References

Speech Separation相关推荐

最新文章

热门文章