0. 说明

https://github.com/auspicious3000/autovc

但是听Demo中, 涉及到unseen的情况, 合成音色确实像, 但是质量不满足商用

复现Git的代码, 需要确认

开源代码中speaker encoder部分的pretrained model能够直接提取vctk所有说话人信息
目前同事使用speaker encoder pretrained model提取vctk embedding信息, 训练loss达不到论文写的那么低
导致音色最终的迁移不像
因此需要确认下pretrained speaker encoder是用什么语料训的

0.1 问题

https://github.com/auspicious3000/autovc/issues/55

按照这上面的讨论, 大家复现的结果都不太好

https://github.com/auspicious3000/autovc/issues/49

这个复现的也不太理想

https://github.com/auspicious3000/autovc/issues/48

这个训练方案可参考

1. 下载代码

git clone https://github.com/auspicious3000/autovc.git
改名字叫AutoVC_hujk17
设置自己的git, 便于查看修改: https://github.com/ruclion/AutoVC_hujk17

2. 搭环境

conda create -n pytorch_p36 python=3.6.5
conda activate pytorch_p36
conda install pytorch==1.2.0 torchvision==0.4.0 cudatoolkit=10.0 -c pytorch
conda install -c conda-forge librosa
pip install tqdm
pip install wavenet_vocoder

装的太慢了, 可以改: https://blog.csdn.net/Arthur_Holmes/article/details/105095088

2.1. 使用小波的环境

cd jay
source ./venv/bin/activate

3. 读代码

3.1. make_spect.py

制作mel谱

import os
import pickle
import numpy as np
import soundfile as sf
from scipy import signal
from scipy.signal import get_window
from librosa.filters import mel
from numpy.random import RandomState

3.1.1. butter_highpass(cutoff, fs, order=5)

不懂

def butter_highpass(cutoff, fs, order=5):nyq = 0.5 * fsnormal_cutoff = cutoff / nyqb, a = signal.butter(order, normal_cutoff, btype='high', analog=False)return b, a

3.1.2. def pySTFT(x, fft_length=1024, hop_length=256)

线性谱, 为甚么不用librosa??

def pySTFT(x, fft_length=1024, hop_length=256):x = np.pad(x, int(fft_length//2), mode='reflect')noverlap = fft_length - hop_lengthshape = x.shape[:-1]+((x.shape[-1]-noverlap)//hop_length, fft_length)strides = x.strides[:-1]+(hop_length*x.strides[-1], x.strides[-1])result = np.lib.stride_tricks.as_strided(x, shape=shape,strides=strides)fft_window = get_window('hann', fft_length, fftbins=True)result = np.fft.rfft(fft_window * result, n=fft_length).Treturn np.abs(result)

3.1.x. main()

mel_basis = mel(16000, 1024, fmin=90, fmax=7600, n_mels=80).T
min_level = np.exp(-100 / 20 * np.log(10))
b, a = butter_highpass(30, 16000, order=5)# audio file directory
rootDir = './wavs'
# spectrogram directory
targetDir = './spmel'dirName, subdirList, _ = next(os.walk(rootDir))
print('Found directory: %s' % dirName)for subdir in sorted(subdirList):print(subdir)if not os.path.exists(os.path.join(targetDir, subdir)):os.makedirs(os.path.join(targetDir, subdir))_,_, fileList = next(os.walk(os.path.join(dirName,subdir)))prng = RandomState(int(subdir[1:])) for fileName in sorted(fileList):# Read audio filex, fs = sf.read(os.path.join(dirName,subdir,fileName))# Remove drifting noisey = signal.filtfilt(b, a, x)# Ddd a little random noise for model roubstnesswav = y * 0.96 + (prng.rand(y.shape[0])-0.5)*1e-06# Compute spectD = pySTFT(wav).T# Convert to mel and normalizeD_mel = np.dot(D, mel_basis)D_db = 20 * np.log10(np.maximum(min_level, D_mel)) - 16S = np.clip((D_db + 100) / 100, 0, 1)    # save spect    np.save(os.path.join(targetDir, subdir, fileName[:-4]),S.astype(np.float32), allow_pickle=False)

3.2. make_metadata.py

制作训练的meta和speaker embedding

每个说话人固定用10句2s的随机段, 经过speaker encoder后的向量平均

整个training过程中均使用一样的, 并且不会变

和one-hot embedding已经无限接近了

"""
Generate speaker embeddings and metadata for training
"""
import os
import pickle
from model_bl import D_VECTOR
from collections import OrderedDict
import numpy as np
import torchC = D_VECTOR(dim_input=80, dim_cell=768, dim_emb=256).eval().cuda()
c_checkpoint = torch.load('3000000-BL.ckpt')
new_state_dict = OrderedDict()
for key, val in c_checkpoint['model_b'].items():new_key = key[7:]new_state_dict[new_key] = val
C.load_state_dict(new_state_dict)
num_uttrs = 10
len_crop = 128# Directory containing mel-spectrograms
rootDir = './spmel'
dirName, subdirList, _ = next(os.walk(rootDir))
print('Found directory: %s' % dirName)speakers = []
for speaker in sorted(subdirList):print('Processing speaker: %s' % speaker)utterances = []utterances.append(speaker)_, _, fileList = next(os.walk(os.path.join(dirName,speaker)))# make speaker embeddingassert len(fileList) >= num_uttrsidx_uttrs = np.random.choice(len(fileList), size=num_uttrs, replace=False)embs = []for i in range(num_uttrs):tmp = np.load(os.path.join(dirName, speaker, fileList[idx_uttrs[i]]))candidates = np.delete(np.arange(len(fileList)), idx_uttrs)# choose another utterance if the current one is too shortwhile tmp.shape[0] < len_crop:idx_alt = np.random.choice(candidates)tmp = np.load(os.path.join(dirName, speaker, fileList[idx_alt]))candidates = np.delete(candidates, np.argwhere(candidates==idx_alt))left = np.random.randint(0, tmp.shape[0]-len_crop)melsp = torch.from_numpy(tmp[np.newaxis, left:left+len_crop, :]).cuda()emb = C(melsp)embs.append(emb.detach().squeeze().cpu().numpy())     utterances.append(np.mean(embs, axis=0))# create file listfor fileName in sorted(fileList):utterances.append(os.path.join(speaker,fileName))speakers.append(utterances)with open(os.path.join(rootDir, 'train.pkl'), 'wb') as handle:pickle.dump(speakers, handle)

3.3. data_loader.py

每次取大约2s的等长音频, 对应上说话人的emedding, 取出

from torch.utils import data
import torch
import numpy as np
import pickle
import os    from multiprocessing import Process, Manager   class Utterances(data.Dataset):"""Dataset class for the Utterances dataset."""def __init__(self, root_dir, len_crop):"""Initialize and preprocess the Utterances dataset."""self.root_dir = root_dirself.len_crop = len_cropself.step = 10metaname = os.path.join(self.root_dir, "train.pkl")meta = pickle.load(open(metaname, "rb"))"""Load data using multiprocessing"""manager = Manager()meta = manager.list(meta)dataset = manager.list(len(meta)*[None])  processes = []for i in range(0, len(meta), self.step):p = Process(target=self.load_data, args=(meta[i:i+self.step],dataset,i))  p.start()processes.append(p)for p in processes:p.join()self.train_dataset = list(dataset)self.num_tokens = len(self.train_dataset)print('Finished loading the dataset...')def load_data(self, submeta, dataset, idx_offset):  for k, sbmt in enumerate(submeta):    uttrs = len(sbmt)*[None]for j, tmp in enumerate(sbmt):if j < 2:  # fill in speaker id and embeddinguttrs[j] = tmpelse: # load the mel-spectrogramsuttrs[j] = np.load(os.path.join(self.root_dir, tmp))dataset[idx_offset+k] = uttrsdef __getitem__(self, index):# pick a random speakerdataset = self.train_dataset list_uttrs = dataset[index]emb_org = list_uttrs[1]# pick random uttr with random cropa = np.random.randint(2, len(list_uttrs))tmp = list_uttrs[a]if tmp.shape[0] < self.len_crop:len_pad = self.len_crop - tmp.shape[0]uttr = np.pad(tmp, ((0,len_pad),(0,0)), 'constant')elif tmp.shape[0] > self.len_crop:left = np.random.randint(tmp.shape[0]-self.len_crop)uttr = tmp[left:left+self.len_crop, :]else:uttr = tmpreturn uttr, emb_orgdef __len__(self):"""Return the number of spkrs."""return self.num_tokensdef get_loader(root_dir, batch_size=16, len_crop=128, num_workers=0):"""Build and return a data loader."""dataset = Utterances(root_dir, len_crop)worker_init_fn = lambda x: np.random.seed((torch.initial_seed()) % (2**32))data_loader = data.DataLoader(dataset=dataset,batch_size=batch_size,shuffle=True,num_workers=num_workers,drop_last=True,worker_init_fn=worker_init_fn)return data_loader

3.4. model_vc.py

设置了模型的结构, 标准的pytorch用法

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as npclass LinearNorm(torch.nn.Module):def __init__(self, in_dim, out_dim, bias=True, w_init_gain='linear'):super(LinearNorm, self).__init__()self.linear_layer = torch.nn.Linear(in_dim, out_dim, bias=bias)torch.nn.init.xavier_uniform_(self.linear_layer.weight,gain=torch.nn.init.calculate_gain(w_init_gain))def forward(self, x):return self.linear_layer(x)class ConvNorm(torch.nn.Module):def __init__(self, in_channels, out_channels, kernel_size=1, stride=1,padding=None, dilation=1, bias=True, w_init_gain='linear'):super(ConvNorm, self).__init__()if padding is None:assert(kernel_size % 2 == 1)padding = int(dilation * (kernel_size - 1) / 2)self.conv = torch.nn.Conv1d(in_channels, out_channels,kernel_size=kernel_size, stride=stride,padding=padding, dilation=dilation,bias=bias)torch.nn.init.xavier_uniform_(self.conv.weight, gain=torch.nn.init.calculate_gain(w_init_gain))def forward(self, signal):conv_signal = self.conv(signal)return conv_signalclass Encoder(nn.Module):"""Encoder module:"""def __init__(self, dim_neck, dim_emb, freq):super(Encoder, self).__init__()self.dim_neck = dim_neckself.freq = freqconvolutions = []for i in range(3):conv_layer = nn.Sequential(ConvNorm(80+dim_emb if i==0 else 512,512,kernel_size=5, stride=1,padding=2,dilation=1, w_init_gain='relu'),nn.BatchNorm1d(512))convolutions.append(conv_layer)self.convolutions = nn.ModuleList(convolutions)self.lstm = nn.LSTM(512, dim_neck, 2, batch_first=True, bidirectional=True)def forward(self, x, c_org):x = x.squeeze(1).transpose(2,1)c_org = c_org.unsqueeze(-1).expand(-1, -1, x.size(-1))x = torch.cat((x, c_org), dim=1)for conv in self.convolutions:x = F.relu(conv(x))x = x.transpose(1, 2)self.lstm.flatten_parameters()outputs, _ = self.lstm(x)out_forward = outputs[:, :, :self.dim_neck]out_backward = outputs[:, :, self.dim_neck:]codes = []for i in range(0, outputs.size(1), self.freq):codes.append(torch.cat((out_forward[:,i+self.freq-1,:],out_backward[:,i,:]), dim=-1))return codesclass Decoder(nn.Module):"""Decoder module:"""def __init__(self, dim_neck, dim_emb, dim_pre):super(Decoder, self).__init__()self.lstm1 = nn.LSTM(dim_neck*2+dim_emb, dim_pre, 1, batch_first=True)convolutions = []for i in range(3):conv_layer = nn.Sequential(ConvNorm(dim_pre,dim_pre,kernel_size=5, stride=1,padding=2,dilation=1, w_init_gain='relu'),nn.BatchNorm1d(dim_pre))convolutions.append(conv_layer)self.convolutions = nn.ModuleList(convolutions)self.lstm2 = nn.LSTM(dim_pre, 1024, 2, batch_first=True)self.linear_projection = LinearNorm(1024, 80)def forward(self, x):#self.lstm1.flatten_parameters()x, _ = self.lstm1(x)x = x.transpose(1, 2)for conv in self.convolutions:x = F.relu(conv(x))x = x.transpose(1, 2)outputs, _ = self.lstm2(x)decoder_output = self.linear_projection(outputs)return decoder_output   class Postnet(nn.Module):"""Postnet- Five 1-d convolution with 512 channels and kernel size 5"""def __init__(self):super(Postnet, self).__init__()self.convolutions = nn.ModuleList()self.convolutions.append(nn.Sequential(ConvNorm(80, 512,kernel_size=5, stride=1,padding=2,dilation=1, w_init_gain='tanh'),nn.BatchNorm1d(512)))for i in range(1, 5 - 1):self.convolutions.append(nn.Sequential(ConvNorm(512,512,kernel_size=5, stride=1,padding=2,dilation=1, w_init_gain='tanh'),nn.BatchNorm1d(512)))self.convolutions.append(nn.Sequential(ConvNorm(512, 80,kernel_size=5, stride=1,padding=2,dilation=1, w_init_gain='linear'),nn.BatchNorm1d(80)))def forward(self, x):for i in range(len(self.convolutions) - 1):x = torch.tanh(self.convolutions[i](x))x = self.convolutions[-1](x)return x    class Generator(nn.Module):"""Generator network."""def __init__(self, dim_neck, dim_emb, dim_pre, freq):super(Generator, self).__init__()self.encoder = Encoder(dim_neck, dim_emb, freq)self.decoder = Decoder(dim_neck, dim_emb, dim_pre)self.postnet = Postnet()def forward(self, x, c_org, c_trg):codes = self.encoder(x, c_org)if c_trg is None:return torch.cat(codes, dim=-1)tmp = []for code in codes:tmp.append(code.unsqueeze(1).expand(-1,int(x.size(1)/len(codes)),-1))code_exp = torch.cat(tmp, dim=1)encoder_outputs = torch.cat((code_exp, c_trg.unsqueeze(1).expand(-1,x.size(1),-1)), dim=-1)mel_outputs = self.decoder(encoder_outputs)mel_outputs_postnet = self.postnet(mel_outputs.transpose(2,1))mel_outputs_postnet = mel_outputs + mel_outputs_postnet.transpose(2,1)mel_outputs = mel_outputs.unsqueeze(1)mel_outputs_postnet = mel_outputs_postnet.unsqueeze(1)return mel_outputs, mel_outputs_postnet, torch.cat(codes, dim=-1)

3.5. solver_encoder.py

和论文一样, 实现了train的过程

from model_vc import Generator
import torch
import torch.nn.functional as F
import time
import datetimeclass Solver(object):def __init__(self, vcc_loader, config):"""Initialize configurations."""# Data loader.self.vcc_loader = vcc_loader# Model configurations.self.lambda_cd = config.lambda_cdself.dim_neck = config.dim_neckself.dim_emb = config.dim_embself.dim_pre = config.dim_preself.freq = config.freq# Training configurations.self.batch_size = config.batch_sizeself.num_iters = config.num_iters# Miscellaneous.self.use_cuda = torch.cuda.is_available()self.device = torch.device('cuda:0' if self.use_cuda else 'cpu')self.log_step = config.log_step# Build the model and tensorboard.self.build_model()def build_model(self):self.G = Generator(self.dim_neck, self.dim_emb, self.dim_pre, self.freq)        self.g_optimizer = torch.optim.Adam(self.G.parameters(), 0.0001)self.G.to(self.device)def reset_grad(self):"""Reset the gradient buffers."""self.g_optimizer.zero_grad()#=====================================================================================================================================#def train(self):# Set data loader.data_loader = self.vcc_loader# Print logs in specified orderkeys = ['G/loss_id','G/loss_id_psnt','G/loss_cd']# Start training.print('Start training...')start_time = time.time()for i in range(self.num_iters):# =================================================================================== ##                             1. Preprocess input data                                ## =================================================================================== ## Fetch data.try:x_real, emb_org = next(data_iter)except:data_iter = iter(data_loader)x_real, emb_org = next(data_iter)x_real = x_real.to(self.device) emb_org = emb_org.to(self.device) # =================================================================================== ##                               2. Train the generator                                ## =================================================================================== #self.G = self.G.train()# Identity mapping lossx_identic, x_identic_psnt, code_real = self.G(x_real, emb_org, emb_org)g_loss_id = F.mse_loss(x_real, x_identic)   g_loss_id_psnt = F.mse_loss(x_real, x_identic_psnt)   # Code semantic loss.code_reconst = self.G(x_identic_psnt, emb_org, None)g_loss_cd = F.l1_loss(code_real, code_reconst)# Backward and optimize.g_loss = g_loss_id + g_loss_id_psnt + self.lambda_cd * g_loss_cdself.reset_grad()g_loss.backward()self.g_optimizer.step()# Logging.loss = {}loss['G/loss_id'] = g_loss_id.item()loss['G/loss_id_psnt'] = g_loss_id_psnt.item()loss['G/loss_cd'] = g_loss_cd.item()# =================================================================================== ##                                 4. Miscellaneous                                    ## =================================================================================== ## Print out training information.if (i+1) % self.log_step == 0:et = time.time() - start_timeet = str(datetime.timedelta(seconds=et))[:-7]log = "Elapsed [{}], Iteration [{}/{}]".format(et, i+1, self.num_iters)for tag in keys:log += ", {}: {:.4f}".format(tag, loss[tag])print(log)

3.6. conversion.py

调用autovc的模型做一下推测, 比较简单, 主要代码花费到了准备输入数据

meta中的句子互相转, n^2的, 但是没有限制seq的秒数, 可能和训练时候的秒数差很多

import os
import pickle
import torch
import numpy as np
from math import ceil
from model_vc import Generatordef pad_seq(x, base=32):len_out = int(base * ceil(float(x.shape[0])/base))len_pad = len_out - x.shape[0]assert len_pad >= 0return np.pad(x, ((0,len_pad),(0,0)), 'constant'), len_paddevice = 'cuda:0'
G = Generator(32,256,512,32).eval().to(device)g_checkpoint = torch.load('autovc.ckpt')
G.load_state_dict(g_checkpoint['model'])metadata = pickle.load(open('metadata.pkl', "rb"))spect_vc = []for sbmt_i in metadata:x_org = sbmt_i[2]x_org, len_pad = pad_seq(x_org)uttr_org = torch.from_numpy(x_org[np.newaxis, :, :]).to(device)emb_org = torch.from_numpy(sbmt_i[1][np.newaxis, :]).to(device)for sbmt_j in metadata:emb_trg = torch.from_numpy(sbmt_j[1][np.newaxis, :]).to(device)with torch.no_grad():_, x_identic_psnt, _ = G(uttr_org, emb_org, emb_trg)if len_pad == 0:uttr_trg = x_identic_psnt[0, 0, :, :].cpu().numpy()else:uttr_trg = x_identic_psnt[0, 0, :-len_pad, :].cpu().numpy()spect_vc.append( ('{}x{}'.format(sbmt_i[0], sbmt_j[0]), uttr_trg) )with open('results.pkl', 'wb') as handle:pickle.dump(spect_vc, handle)

3.7. vocoder.py

标准vocoder使用, 调用不难, 数据预处理的时候格式统一比较重要

import torch
import librosa
import pickle
from synthesis import build_model
from synthesis import wavegenspect_vc = pickle.load(open('results.pkl', 'rb'))
device = torch.device("cuda")
model = build_model().to(device)
checkpoint = torch.load("checkpoint_step001000000_ema.pth")
model.load_state_dict(checkpoint["state_dict"])for spect in spect_vc:name = spect[0]c = spect[1]print(name)waveform = wavegen(model, c=c)   librosa.output.write_wav(name+'.wav', waveform, sr=16000)

3.8. model_bl.py

speaker encoder, 结构有些太过简单了, rnn然后最后一帧+FC, 再norm, 可以枚举下别的encoder

配合上2s切分音频以及10个音频取平均, 或许掩盖了一部分缺点

import torch
import torch.nn as nnclass D_VECTOR(nn.Module):"""d vector speaker embedding."""def __init__(self, num_layers=3, dim_input=40, dim_cell=256, dim_emb=64):super(D_VECTOR, self).__init__()self.lstm = nn.LSTM(input_size=dim_input, hidden_size=dim_cell, num_layers=num_layers, batch_first=True)  self.embedding = nn.Linear(dim_cell, dim_emb)def forward(self, x):self.lstm.flatten_parameters()            lstm_out, _ = self.lstm(x)embeds = self.embedding(lstm_out[:,-1,:])norm = embeds.norm(p=2, dim=-1, keepdim=True) embeds_normalized = embeds.div(norm)return embeds_normalized

3.9. hparams.py

# NOTE: If you want full control for model architecture. please take a look
# at the code and change whatever you want. Some hyper parameters are hardcoded.class Map(dict):"""Example:m = Map({'first_name': 'Eduardo'}, last_name='Pool', age=24, sports=['Soccer'])Credits to epool:https://stackoverflow.com/questions/2352181/how-to-use-a-dot-to-access-members-of-dictionary"""def __init__(self, *args, **kwargs):super(Map, self).__init__(*args, **kwargs)for arg in args:if isinstance(arg, dict):for k, v in arg.items():self[k] = vif kwargs:for k, v in kwargs.iteritems():self[k] = vdef __getattr__(self, attr):return self.get(attr)def __setattr__(self, key, value):self.__setitem__(key, value)def __setitem__(self, key, value):super(Map, self).__setitem__(key, value)self.__dict__.update({key: value})def __delattr__(self, item):self.__delitem__(item)def __delitem__(self, key):super(Map, self).__delitem__(key)del self.__dict__[key]# Default hyperparameters:
hparams = Map({'name': "wavenet_vocoder",# Convenient model builder'builder': "wavenet",# Input type:# 1. raw [-1, 1]# 2. mulaw [-1, 1]# 3. mulaw-quantize [0, mu]# If input_type is raw or mulaw, network assumes scalar input and# discretized mixture of logistic distributions output, otherwise one-hot# input and softmax output are assumed.# **NOTE**: if you change the one of the two parameters below, you need to# re-run preprocessing before training.'input_type': "raw",'quantize_channels': 65536,  # 65536 or 256# Audio:'sample_rate': 16000,# this is only valid for mulaw is True'silence_threshold': 2,'num_mels': 80,'fmin': 125,'fmax': 7600,'fft_size': 1024,# shift can be specified by either hop_size or frame_shift_ms'hop_size': 256,'frame_shift_ms': None,'min_level_db': -100,'ref_level_db': 20,# whether to rescale waveform or not.# Let x is an input waveform, rescaled waveform y is given by:# y = x / np.abs(x).max() * rescaling_max'rescaling': True,'rescaling_max': 0.999,# mel-spectrogram is normalized to [0, 1] for each utterance and clipping may# happen depends on min_level_db and ref_level_db, causing clipping noise.# If False, assertion is added to ensure no clipping happens.o0'allow_clipping_in_normalization': True,# Mixture of logistic distributions:'log_scale_min': float(-32.23619130191664),# Model:# This should equal to `quantize_channels` if mu-law quantize enabled# otherwise num_mixture * 3 (pi, mean, log_scale)'out_channels': 10 * 3,'layers': 24,'stacks': 4,'residual_channels': 512,'gate_channels': 512,  # split into 2 gropus internally for gated activation'skip_out_channels': 256,'dropout': 1 - 0.95,'kernel_size': 3,# If True, apply weight normalization as same as DeepVoice3'weight_normalization': True,# Use legacy code or not. Default is True since we already provided a model# based on the legacy code that can generate high-quality audio.# Ref: https://github.com/r9y9/wavenet_vocoder/pull/73'legacy': True,# Local conditioning (set negative value to disable))'cin_channels': 80,# If True, use transposed convolutions to upsample conditional features,# otherwise repeat features to adjust time resolution'upsample_conditional_features': True,# should np.prod(upsample_scales) == hop_size'upsample_scales': [4, 4, 4, 4],# Freq axis kernel size for upsampling network'freq_axis_kernel_size': 3,# Global conditioning (set negative value to disable)# currently limited for speaker embedding# this should only be enabled for multi-speaker dataset'gin_channels': -1,  # i.e., speaker embedding dim'n_speakers': -1,# Data loader'pin_memory': True,'num_workers': 2,# train/test# test size can be specified as portion or num samples'test_size': 0.0441,  # 50 for CMU ARCTIC single speaker'test_num_samples': None,'random_state': 1234,# Loss# Training:'batch_size': 2,'adam_beta1': 0.9,'adam_beta2': 0.999,'adam_eps': 1e-8,'amsgrad': False,'initial_learning_rate': 1e-3,# see lrschedule.py for available lr_schedule'lr_schedule': "noam_learning_rate_decay",'lr_schedule_kwargs': {},  # {"anneal_rate": 0.5, "anneal_interval": 50000},'nepochs': 2000,'weight_decay': 0.0,'clip_thresh': -1,# max time steps can either be specified as sec or steps# if both are None, then full audio samples are used in a batch'max_time_sec': None,'max_time_steps': 8000,# Hold moving averaged parameters and use them for evaluation'exponential_moving_average': True,# averaged = decay * averaged + (1 - decay) * x'ema_decay': 0.9999,# Save# per-step intervals'checkpoint_interval': 10000,'train_eval_interval': 10000,# per-epoch interval'test_eval_epoch_interval': 5,'save_optimizer_state': True,# Eval:
})def hparams_debug_string():values = hparams.values()hp = ['  %s: %s' % (name, values[name]) for name in sorted(values)]return 'Hyperparameters:\n' + '\n'.join(hp)

代码中圈红色的没明白:

代码也没用上max_time_steps呀.. 是wavenet的么
checkpoint_interval那些量都没有用到呀... 问问同事或者看看issue

4. 运行代码

4.1. conversion.py

python conversion.py

4.2. vocoder.py

不懂画红线的

python vocoder.py

就是有些慢, 之后试试处理成GL

5. 完善Train的代码

5.1. val部分

5.1.1. main.py

5.1.2. dataloader.py

先保留截取2s的方案, 不引入padding和sort

5.1.3. solve_encoder.py

修正dataloader的使用方式, 改为:

5.2. tensorboard部分

pip install tensorboardX

5.2.1 solve_encoder.py

5.3. VCTK处理

5.3.0 自调整噪声大小去静默段

感谢同事~

效果对比起来非常好

静音段能去掉
前面的口水声能保留

5.3.1. mel谱

全部转换为mel

5.3.2. metadata

重新写了一版, 多用txt表示路径+npy, 不习惯用pkl

也更容易可视化

类似于Rayhane的, 有一个主要的meta

5.3.3. 修改dataloader

修改过meta的制作, 需要微调dataloader->data_loader_hujk17.py

做了一个比较简单的

6. Train

6.1 超参数

原因

speaker dim只能256
同事的speaker dim是128, neck和freq均是16, 我的均翻倍, 32和32
论文中也是32和32, 并且预训练好的模型也是32和32

6.2. 数据使用

VCTK一共106个人, 分为seen和unseen, seen中分为train和val

7. Inference

7.1 conversion.py

交叉vc版本

7.2. inference.py

仿照同事写一个A+B模式的

7.3. make_spect.py

我的make_spect可以直接用于vc, 从AutoVC_hujk17/full_106_spmel_nosli中:

随便选两行

AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss 论文代码复现相关推荐

AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss 论文理解
0. Abstract 非并行的多对多语音转换以及零语音转换仍然是未开发的领域.诸如对抗性网络(GAN)和条件变量自动编码器(CVAE)之类的深度样式转换算法已被用作该领域的新解决方案.但是,GAN训 ...
AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss代码调试过程
论文: AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss 代码实现参考:https://github.com/peis ...
AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss笔记
文章目录网络结构说话人编码器内容编码器解码器声码器实验论文: AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder L ...
论文阅读 - AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss
文章目录 1 概述 2 模型架构 3 模块解析 3.1 获取梅尔频谱 3.2 speaker encoder 3.3 AutoVC 3.4 Vocoder 4 关键部分参考资料 1 概述 voice ...
Musical Composition Style Transfer via Disentangled Timbre Representations论文阅读
前言这篇文章与<Learning Disentangled Representations for Timber and Pitch in Music Audio>很相似,图和作者顺序都 ...
深度学习（三十五）——Style Transfer（2）, YOLOv3, Tiny-YOLO, One-stage vs. Two-stage
Style Transfer Texture Networks: Feed-forward Synthesis of Textures and Stylized Images 这篇论文属于fast s ...
风格迁移综述Neural Style Transfer: A Review
浙江大学和亚利桑那州立大学的几位研究者在 arXiv 上发布了一篇「神经风格迁移(Neural Style Transfer)」的概述论文,对当前神经网络风格迁移技术的研究.应用和难题进行了全面的总结 ...
Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset
会议:icassp 2021 作者:Kun Zhou,lihaizhou 文章目录 abstract 1. introduction 2. Analysis of Deep Emotional Fea ...
吴恩达老师深度学习视频课笔记：神经风格迁移(neural style transfer)
什么是神经风格迁移(neural style transfer):如下图,Content为原始拍摄的图像,Style为一种风格图像.如果用Style来重新创造Content照片,神经风 ...

AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss 论文代码复现