AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss 论文代码复现
0. 说明
https://github.com/auspicious3000/autovc
但是听Demo中, 涉及到unseen的情况, 合成音色确实像, 但是质量不满足商用
复现Git的代码, 需要确认
- 开源代码中speaker encoder部分的pretrained model能够直接提取vctk所有说话人信息
- 目前同事使用speaker encoder pretrained model提取vctk embedding信息, 训练loss达不到论文写的那么低
- 导致音色最终的迁移不像
- 因此需要确认下pretrained speaker encoder是用什么语料训的
0.1 问题
https://github.com/auspicious3000/autovc/issues/55
按照这上面的讨论, 大家复现的结果都不太好
https://github.com/auspicious3000/autovc/issues/49
这个复现的也不太理想
https://github.com/auspicious3000/autovc/issues/48
这个训练方案可参考
1. 下载代码
- git clone https://github.com/auspicious3000/autovc.git
- 改名字叫AutoVC_hujk17
- 设置自己的git, 便于查看修改: https://github.com/ruclion/AutoVC_hujk17
2. 搭环境
- conda create -n pytorch_p36 python=3.6.5
- conda activate pytorch_p36
- conda install pytorch==1.2.0 torchvision==0.4.0 cudatoolkit=10.0 -c pytorch
- conda install -c conda-forge librosa
- pip install tqdm
- pip install wavenet_vocoder
装的太慢了, 可以改: https://blog.csdn.net/Arthur_Holmes/article/details/105095088
2.1. 使用小波的环境
- cd jay
- source ./venv/bin/activate
3. 读代码
3.1. make_spect.py
制作mel谱
import os
import pickle
import numpy as np
import soundfile as sf
from scipy import signal
from scipy.signal import get_window
from librosa.filters import mel
from numpy.random import RandomState
3.1.1. butter_highpass(cutoff, fs, order=5)
不懂
def butter_highpass(cutoff, fs, order=5):nyq = 0.5 * fsnormal_cutoff = cutoff / nyqb, a = signal.butter(order, normal_cutoff, btype='high', analog=False)return b, a
3.1.2. def pySTFT(x, fft_length=1024, hop_length=256)
线性谱, 为甚么不用librosa??
def pySTFT(x, fft_length=1024, hop_length=256):x = np.pad(x, int(fft_length//2), mode='reflect')noverlap = fft_length - hop_lengthshape = x.shape[:-1]+((x.shape[-1]-noverlap)//hop_length, fft_length)strides = x.strides[:-1]+(hop_length*x.strides[-1], x.strides[-1])result = np.lib.stride_tricks.as_strided(x, shape=shape,strides=strides)fft_window = get_window('hann', fft_length, fftbins=True)result = np.fft.rfft(fft_window * result, n=fft_length).Treturn np.abs(result)
3.1.x. main()
mel_basis = mel(16000, 1024, fmin=90, fmax=7600, n_mels=80).T
min_level = np.exp(-100 / 20 * np.log(10))
b, a = butter_highpass(30, 16000, order=5)# audio file directory
rootDir = './wavs'
# spectrogram directory
targetDir = './spmel'dirName, subdirList, _ = next(os.walk(rootDir))
print('Found directory: %s' % dirName)for subdir in sorted(subdirList):print(subdir)if not os.path.exists(os.path.join(targetDir, subdir)):os.makedirs(os.path.join(targetDir, subdir))_,_, fileList = next(os.walk(os.path.join(dirName,subdir)))prng = RandomState(int(subdir[1:])) for fileName in sorted(fileList):# Read audio filex, fs = sf.read(os.path.join(dirName,subdir,fileName))# Remove drifting noisey = signal.filtfilt(b, a, x)# Ddd a little random noise for model roubstnesswav = y * 0.96 + (prng.rand(y.shape[0])-0.5)*1e-06# Compute spectD = pySTFT(wav).T# Convert to mel and normalizeD_mel = np.dot(D, mel_basis)D_db = 20 * np.log10(np.maximum(min_level, D_mel)) - 16S = np.clip((D_db + 100) / 100, 0, 1) # save spect np.save(os.path.join(targetDir, subdir, fileName[:-4]),S.astype(np.float32), allow_pickle=False)
3.2. make_metadata.py
制作训练的meta和speaker embedding
每个说话人固定用10句2s的随机段, 经过speaker encoder后的向量平均
整个training过程中均使用一样的, 并且不会变
和one-hot embedding已经无限接近了
"""
Generate speaker embeddings and metadata for training
"""
import os
import pickle
from model_bl import D_VECTOR
from collections import OrderedDict
import numpy as np
import torchC = D_VECTOR(dim_input=80, dim_cell=768, dim_emb=256).eval().cuda()
c_checkpoint = torch.load('3000000-BL.ckpt')
new_state_dict = OrderedDict()
for key, val in c_checkpoint['model_b'].items():new_key = key[7:]new_state_dict[new_key] = val
C.load_state_dict(new_state_dict)
num_uttrs = 10
len_crop = 128# Directory containing mel-spectrograms
rootDir = './spmel'
dirName, subdirList, _ = next(os.walk(rootDir))
print('Found directory: %s' % dirName)speakers = []
for speaker in sorted(subdirList):print('Processing speaker: %s' % speaker)utterances = []utterances.append(speaker)_, _, fileList = next(os.walk(os.path.join(dirName,speaker)))# make speaker embeddingassert len(fileList) >= num_uttrsidx_uttrs = np.random.choice(len(fileList), size=num_uttrs, replace=False)embs = []for i in range(num_uttrs):tmp = np.load(os.path.join(dirName, speaker, fileList[idx_uttrs[i]]))candidates = np.delete(np.arange(len(fileList)), idx_uttrs)# choose another utterance if the current one is too shortwhile tmp.shape[0] < len_crop:idx_alt = np.random.choice(candidates)tmp = np.load(os.path.join(dirName, speaker, fileList[idx_alt]))candidates = np.delete(candidates, np.argwhere(candidates==idx_alt))left = np.random.randint(0, tmp.shape[0]-len_crop)melsp = torch.from_numpy(tmp[np.newaxis, left:left+len_crop, :]).cuda()emb = C(melsp)embs.append(emb.detach().squeeze().cpu().numpy()) utterances.append(np.mean(embs, axis=0))# create file listfor fileName in sorted(fileList):utterances.append(os.path.join(speaker,fileName))speakers.append(utterances)with open(os.path.join(rootDir, 'train.pkl'), 'wb') as handle:pickle.dump(speakers, handle)
3.3. data_loader.py
每次取大约2s的等长音频, 对应上说话人的emedding, 取出
from torch.utils import data
import torch
import numpy as np
import pickle
import os from multiprocessing import Process, Manager class Utterances(data.Dataset):"""Dataset class for the Utterances dataset."""def __init__(self, root_dir, len_crop):"""Initialize and preprocess the Utterances dataset."""self.root_dir = root_dirself.len_crop = len_cropself.step = 10metaname = os.path.join(self.root_dir, "train.pkl")meta = pickle.load(open(metaname, "rb"))"""Load data using multiprocessing"""manager = Manager()meta = manager.list(meta)dataset = manager.list(len(meta)*[None]) processes = []for i in range(0, len(meta), self.step):p = Process(target=self.load_data, args=(meta[i:i+self.step],dataset,i)) p.start()processes.append(p)for p in processes:p.join()self.train_dataset = list(dataset)self.num_tokens = len(self.train_dataset)print('Finished loading the dataset...')def load_data(self, submeta, dataset, idx_offset): for k, sbmt in enumerate(submeta): uttrs = len(sbmt)*[None]for j, tmp in enumerate(sbmt):if j < 2: # fill in speaker id and embeddinguttrs[j] = tmpelse: # load the mel-spectrogramsuttrs[j] = np.load(os.path.join(self.root_dir, tmp))dataset[idx_offset+k] = uttrsdef __getitem__(self, index):# pick a random speakerdataset = self.train_dataset list_uttrs = dataset[index]emb_org = list_uttrs[1]# pick random uttr with random cropa = np.random.randint(2, len(list_uttrs))tmp = list_uttrs[a]if tmp.shape[0] < self.len_crop:len_pad = self.len_crop - tmp.shape[0]uttr = np.pad(tmp, ((0,len_pad),(0,0)), 'constant')elif tmp.shape[0] > self.len_crop:left = np.random.randint(tmp.shape[0]-self.len_crop)uttr = tmp[left:left+self.len_crop, :]else:uttr = tmpreturn uttr, emb_orgdef __len__(self):"""Return the number of spkrs."""return self.num_tokensdef get_loader(root_dir, batch_size=16, len_crop=128, num_workers=0):"""Build and return a data loader."""dataset = Utterances(root_dir, len_crop)worker_init_fn = lambda x: np.random.seed((torch.initial_seed()) % (2**32))data_loader = data.DataLoader(dataset=dataset,batch_size=batch_size,shuffle=True,num_workers=num_workers,drop_last=True,worker_init_fn=worker_init_fn)return data_loader
3.4. model_vc.py
设置了模型的结构, 标准的pytorch用法
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as npclass LinearNorm(torch.nn.Module):def __init__(self, in_dim, out_dim, bias=True, w_init_gain='linear'):super(LinearNorm, self).__init__()self.linear_layer = torch.nn.Linear(in_dim, out_dim, bias=bias)torch.nn.init.xavier_uniform_(self.linear_layer.weight,gain=torch.nn.init.calculate_gain(w_init_gain))def forward(self, x):return self.linear_layer(x)class ConvNorm(torch.nn.Module):def __init__(self, in_channels, out_channels, kernel_size=1, stride=1,padding=None, dilation=1, bias=True, w_init_gain='linear'):super(ConvNorm, self).__init__()if padding is None:assert(kernel_size % 2 == 1)padding = int(dilation * (kernel_size - 1) / 2)self.conv = torch.nn.Conv1d(in_channels, out_channels,kernel_size=kernel_size, stride=stride,padding=padding, dilation=dilation,bias=bias)torch.nn.init.xavier_uniform_(self.conv.weight, gain=torch.nn.init.calculate_gain(w_init_gain))def forward(self, signal):conv_signal = self.conv(signal)return conv_signalclass Encoder(nn.Module):"""Encoder module:"""def __init__(self, dim_neck, dim_emb, freq):super(Encoder, self).__init__()self.dim_neck = dim_neckself.freq = freqconvolutions = []for i in range(3):conv_layer = nn.Sequential(ConvNorm(80+dim_emb if i==0 else 512,512,kernel_size=5, stride=1,padding=2,dilation=1, w_init_gain='relu'),nn.BatchNorm1d(512))convolutions.append(conv_layer)self.convolutions = nn.ModuleList(convolutions)self.lstm = nn.LSTM(512, dim_neck, 2, batch_first=True, bidirectional=True)def forward(self, x, c_org):x = x.squeeze(1).transpose(2,1)c_org = c_org.unsqueeze(-1).expand(-1, -1, x.size(-1))x = torch.cat((x, c_org), dim=1)for conv in self.convolutions:x = F.relu(conv(x))x = x.transpose(1, 2)self.lstm.flatten_parameters()outputs, _ = self.lstm(x)out_forward = outputs[:, :, :self.dim_neck]out_backward = outputs[:, :, self.dim_neck:]codes = []for i in range(0, outputs.size(1), self.freq):codes.append(torch.cat((out_forward[:,i+self.freq-1,:],out_backward[:,i,:]), dim=-1))return codesclass Decoder(nn.Module):"""Decoder module:"""def __init__(self, dim_neck, dim_emb, dim_pre):super(Decoder, self).__init__()self.lstm1 = nn.LSTM(dim_neck*2+dim_emb, dim_pre, 1, batch_first=True)convolutions = []for i in range(3):conv_layer = nn.Sequential(ConvNorm(dim_pre,dim_pre,kernel_size=5, stride=1,padding=2,dilation=1, w_init_gain='relu'),nn.BatchNorm1d(dim_pre))convolutions.append(conv_layer)self.convolutions = nn.ModuleList(convolutions)self.lstm2 = nn.LSTM(dim_pre, 1024, 2, batch_first=True)self.linear_projection = LinearNorm(1024, 80)def forward(self, x):#self.lstm1.flatten_parameters()x, _ = self.lstm1(x)x = x.transpose(1, 2)for conv in self.convolutions:x = F.relu(conv(x))x = x.transpose(1, 2)outputs, _ = self.lstm2(x)decoder_output = self.linear_projection(outputs)return decoder_output class Postnet(nn.Module):"""Postnet- Five 1-d convolution with 512 channels and kernel size 5"""def __init__(self):super(Postnet, self).__init__()self.convolutions = nn.ModuleList()self.convolutions.append(nn.Sequential(ConvNorm(80, 512,kernel_size=5, stride=1,padding=2,dilation=1, w_init_gain='tanh'),nn.BatchNorm1d(512)))for i in range(1, 5 - 1):self.convolutions.append(nn.Sequential(ConvNorm(512,512,kernel_size=5, stride=1,padding=2,dilation=1, w_init_gain='tanh'),nn.BatchNorm1d(512)))self.convolutions.append(nn.Sequential(ConvNorm(512, 80,kernel_size=5, stride=1,padding=2,dilation=1, w_init_gain='linear'),nn.BatchNorm1d(80)))def forward(self, x):for i in range(len(self.convolutions) - 1):x = torch.tanh(self.convolutions[i](x))x = self.convolutions[-1](x)return x class Generator(nn.Module):"""Generator network."""def __init__(self, dim_neck, dim_emb, dim_pre, freq):super(Generator, self).__init__()self.encoder = Encoder(dim_neck, dim_emb, freq)self.decoder = Decoder(dim_neck, dim_emb, dim_pre)self.postnet = Postnet()def forward(self, x, c_org, c_trg):codes = self.encoder(x, c_org)if c_trg is None:return torch.cat(codes, dim=-1)tmp = []for code in codes:tmp.append(code.unsqueeze(1).expand(-1,int(x.size(1)/len(codes)),-1))code_exp = torch.cat(tmp, dim=1)encoder_outputs = torch.cat((code_exp, c_trg.unsqueeze(1).expand(-1,x.size(1),-1)), dim=-1)mel_outputs = self.decoder(encoder_outputs)mel_outputs_postnet = self.postnet(mel_outputs.transpose(2,1))mel_outputs_postnet = mel_outputs + mel_outputs_postnet.transpose(2,1)mel_outputs = mel_outputs.unsqueeze(1)mel_outputs_postnet = mel_outputs_postnet.unsqueeze(1)return mel_outputs, mel_outputs_postnet, torch.cat(codes, dim=-1)
3.5. solver_encoder.py
和论文一样, 实现了train的过程
from model_vc import Generator
import torch
import torch.nn.functional as F
import time
import datetimeclass Solver(object):def __init__(self, vcc_loader, config):"""Initialize configurations."""# Data loader.self.vcc_loader = vcc_loader# Model configurations.self.lambda_cd = config.lambda_cdself.dim_neck = config.dim_neckself.dim_emb = config.dim_embself.dim_pre = config.dim_preself.freq = config.freq# Training configurations.self.batch_size = config.batch_sizeself.num_iters = config.num_iters# Miscellaneous.self.use_cuda = torch.cuda.is_available()self.device = torch.device('cuda:0' if self.use_cuda else 'cpu')self.log_step = config.log_step# Build the model and tensorboard.self.build_model()def build_model(self):self.G = Generator(self.dim_neck, self.dim_emb, self.dim_pre, self.freq) self.g_optimizer = torch.optim.Adam(self.G.parameters(), 0.0001)self.G.to(self.device)def reset_grad(self):"""Reset the gradient buffers."""self.g_optimizer.zero_grad()#=====================================================================================================================================#def train(self):# Set data loader.data_loader = self.vcc_loader# Print logs in specified orderkeys = ['G/loss_id','G/loss_id_psnt','G/loss_cd']# Start training.print('Start training...')start_time = time.time()for i in range(self.num_iters):# =================================================================================== ## 1. Preprocess input data ## =================================================================================== ## Fetch data.try:x_real, emb_org = next(data_iter)except:data_iter = iter(data_loader)x_real, emb_org = next(data_iter)x_real = x_real.to(self.device) emb_org = emb_org.to(self.device) # =================================================================================== ## 2. Train the generator ## =================================================================================== #self.G = self.G.train()# Identity mapping lossx_identic, x_identic_psnt, code_real = self.G(x_real, emb_org, emb_org)g_loss_id = F.mse_loss(x_real, x_identic) g_loss_id_psnt = F.mse_loss(x_real, x_identic_psnt) # Code semantic loss.code_reconst = self.G(x_identic_psnt, emb_org, None)g_loss_cd = F.l1_loss(code_real, code_reconst)# Backward and optimize.g_loss = g_loss_id + g_loss_id_psnt + self.lambda_cd * g_loss_cdself.reset_grad()g_loss.backward()self.g_optimizer.step()# Logging.loss = {}loss['G/loss_id'] = g_loss_id.item()loss['G/loss_id_psnt'] = g_loss_id_psnt.item()loss['G/loss_cd'] = g_loss_cd.item()# =================================================================================== ## 4. Miscellaneous ## =================================================================================== ## Print out training information.if (i+1) % self.log_step == 0:et = time.time() - start_timeet = str(datetime.timedelta(seconds=et))[:-7]log = "Elapsed [{}], Iteration [{}/{}]".format(et, i+1, self.num_iters)for tag in keys:log += ", {}: {:.4f}".format(tag, loss[tag])print(log)
3.6. conversion.py
调用autovc的模型做一下推测, 比较简单, 主要代码花费到了准备输入数据
meta中的句子互相转, n^2的, 但是没有限制seq的秒数, 可能和训练时候的秒数差很多
import os
import pickle
import torch
import numpy as np
from math import ceil
from model_vc import Generatordef pad_seq(x, base=32):len_out = int(base * ceil(float(x.shape[0])/base))len_pad = len_out - x.shape[0]assert len_pad >= 0return np.pad(x, ((0,len_pad),(0,0)), 'constant'), len_paddevice = 'cuda:0'
G = Generator(32,256,512,32).eval().to(device)g_checkpoint = torch.load('autovc.ckpt')
G.load_state_dict(g_checkpoint['model'])metadata = pickle.load(open('metadata.pkl', "rb"))spect_vc = []for sbmt_i in metadata:x_org = sbmt_i[2]x_org, len_pad = pad_seq(x_org)uttr_org = torch.from_numpy(x_org[np.newaxis, :, :]).to(device)emb_org = torch.from_numpy(sbmt_i[1][np.newaxis, :]).to(device)for sbmt_j in metadata:emb_trg = torch.from_numpy(sbmt_j[1][np.newaxis, :]).to(device)with torch.no_grad():_, x_identic_psnt, _ = G(uttr_org, emb_org, emb_trg)if len_pad == 0:uttr_trg = x_identic_psnt[0, 0, :, :].cpu().numpy()else:uttr_trg = x_identic_psnt[0, 0, :-len_pad, :].cpu().numpy()spect_vc.append( ('{}x{}'.format(sbmt_i[0], sbmt_j[0]), uttr_trg) )with open('results.pkl', 'wb') as handle:pickle.dump(spect_vc, handle)
3.7. vocoder.py
标准vocoder使用, 调用不难, 数据预处理的时候格式统一比较重要
import torch
import librosa
import pickle
from synthesis import build_model
from synthesis import wavegenspect_vc = pickle.load(open('results.pkl', 'rb'))
device = torch.device("cuda")
model = build_model().to(device)
checkpoint = torch.load("checkpoint_step001000000_ema.pth")
model.load_state_dict(checkpoint["state_dict"])for spect in spect_vc:name = spect[0]c = spect[1]print(name)waveform = wavegen(model, c=c) librosa.output.write_wav(name+'.wav', waveform, sr=16000)
3.8. model_bl.py
speaker encoder, 结构有些太过简单了, rnn然后最后一帧+FC, 再norm, 可以枚举下别的encoder
配合上2s切分音频以及10个音频取平均, 或许掩盖了一部分缺点
import torch
import torch.nn as nnclass D_VECTOR(nn.Module):"""d vector speaker embedding."""def __init__(self, num_layers=3, dim_input=40, dim_cell=256, dim_emb=64):super(D_VECTOR, self).__init__()self.lstm = nn.LSTM(input_size=dim_input, hidden_size=dim_cell, num_layers=num_layers, batch_first=True) self.embedding = nn.Linear(dim_cell, dim_emb)def forward(self, x):self.lstm.flatten_parameters() lstm_out, _ = self.lstm(x)embeds = self.embedding(lstm_out[:,-1,:])norm = embeds.norm(p=2, dim=-1, keepdim=True) embeds_normalized = embeds.div(norm)return embeds_normalized
3.9. hparams.py
# NOTE: If you want full control for model architecture. please take a look
# at the code and change whatever you want. Some hyper parameters are hardcoded.class Map(dict):"""Example:m = Map({'first_name': 'Eduardo'}, last_name='Pool', age=24, sports=['Soccer'])Credits to epool:https://stackoverflow.com/questions/2352181/how-to-use-a-dot-to-access-members-of-dictionary"""def __init__(self, *args, **kwargs):super(Map, self).__init__(*args, **kwargs)for arg in args:if isinstance(arg, dict):for k, v in arg.items():self[k] = vif kwargs:for k, v in kwargs.iteritems():self[k] = vdef __getattr__(self, attr):return self.get(attr)def __setattr__(self, key, value):self.__setitem__(key, value)def __setitem__(self, key, value):super(Map, self).__setitem__(key, value)self.__dict__.update({key: value})def __delattr__(self, item):self.__delitem__(item)def __delitem__(self, key):super(Map, self).__delitem__(key)del self.__dict__[key]# Default hyperparameters:
hparams = Map({'name': "wavenet_vocoder",# Convenient model builder'builder': "wavenet",# Input type:# 1. raw [-1, 1]# 2. mulaw [-1, 1]# 3. mulaw-quantize [0, mu]# If input_type is raw or mulaw, network assumes scalar input and# discretized mixture of logistic distributions output, otherwise one-hot# input and softmax output are assumed.# **NOTE**: if you change the one of the two parameters below, you need to# re-run preprocessing before training.'input_type': "raw",'quantize_channels': 65536, # 65536 or 256# Audio:'sample_rate': 16000,# this is only valid for mulaw is True'silence_threshold': 2,'num_mels': 80,'fmin': 125,'fmax': 7600,'fft_size': 1024,# shift can be specified by either hop_size or frame_shift_ms'hop_size': 256,'frame_shift_ms': None,'min_level_db': -100,'ref_level_db': 20,# whether to rescale waveform or not.# Let x is an input waveform, rescaled waveform y is given by:# y = x / np.abs(x).max() * rescaling_max'rescaling': True,'rescaling_max': 0.999,# mel-spectrogram is normalized to [0, 1] for each utterance and clipping may# happen depends on min_level_db and ref_level_db, causing clipping noise.# If False, assertion is added to ensure no clipping happens.o0'allow_clipping_in_normalization': True,# Mixture of logistic distributions:'log_scale_min': float(-32.23619130191664),# Model:# This should equal to `quantize_channels` if mu-law quantize enabled# otherwise num_mixture * 3 (pi, mean, log_scale)'out_channels': 10 * 3,'layers': 24,'stacks': 4,'residual_channels': 512,'gate_channels': 512, # split into 2 gropus internally for gated activation'skip_out_channels': 256,'dropout': 1 - 0.95,'kernel_size': 3,# If True, apply weight normalization as same as DeepVoice3'weight_normalization': True,# Use legacy code or not. Default is True since we already provided a model# based on the legacy code that can generate high-quality audio.# Ref: https://github.com/r9y9/wavenet_vocoder/pull/73'legacy': True,# Local conditioning (set negative value to disable))'cin_channels': 80,# If True, use transposed convolutions to upsample conditional features,# otherwise repeat features to adjust time resolution'upsample_conditional_features': True,# should np.prod(upsample_scales) == hop_size'upsample_scales': [4, 4, 4, 4],# Freq axis kernel size for upsampling network'freq_axis_kernel_size': 3,# Global conditioning (set negative value to disable)# currently limited for speaker embedding# this should only be enabled for multi-speaker dataset'gin_channels': -1, # i.e., speaker embedding dim'n_speakers': -1,# Data loader'pin_memory': True,'num_workers': 2,# train/test# test size can be specified as portion or num samples'test_size': 0.0441, # 50 for CMU ARCTIC single speaker'test_num_samples': None,'random_state': 1234,# Loss# Training:'batch_size': 2,'adam_beta1': 0.9,'adam_beta2': 0.999,'adam_eps': 1e-8,'amsgrad': False,'initial_learning_rate': 1e-3,# see lrschedule.py for available lr_schedule'lr_schedule': "noam_learning_rate_decay",'lr_schedule_kwargs': {}, # {"anneal_rate": 0.5, "anneal_interval": 50000},'nepochs': 2000,'weight_decay': 0.0,'clip_thresh': -1,# max time steps can either be specified as sec or steps# if both are None, then full audio samples are used in a batch'max_time_sec': None,'max_time_steps': 8000,# Hold moving averaged parameters and use them for evaluation'exponential_moving_average': True,# averaged = decay * averaged + (1 - decay) * x'ema_decay': 0.9999,# Save# per-step intervals'checkpoint_interval': 10000,'train_eval_interval': 10000,# per-epoch interval'test_eval_epoch_interval': 5,'save_optimizer_state': True,# Eval:
})def hparams_debug_string():values = hparams.values()hp = [' %s: %s' % (name, values[name]) for name in sorted(values)]return 'Hyperparameters:\n' + '\n'.join(hp)
代码中圈红色的没明白:
- 代码也没用上max_time_steps呀.. 是wavenet的么
- checkpoint_interval那些量都没有用到呀... 问问同事或者看看issue
4. 运行代码
4.1. conversion.py
- python conversion.py
4.2. vocoder.py
不懂画红线的
- python vocoder.py
就是有些慢, 之后试试处理成GL
5. 完善Train的代码
5.1. val部分
5.1.1. main.py
5.1.2. dataloader.py
先保留截取2s的方案, 不引入padding和sort
5.1.3. solve_encoder.py
修正dataloader的使用方式, 改为:
5.2. tensorboard部分
- pip install tensorboardX
5.2.1 solve_encoder.py
5.3. VCTK处理
5.3.0 自调整噪声大小去静默段
感谢同事~
效果对比起来非常好
- 静音段能去掉
- 前面的口水声能保留
5.3.1. mel谱
全部转换为mel
5.3.2. metadata
重新写了一版, 多用txt表示路径+npy, 不习惯用pkl
也更容易可视化
类似于Rayhane的, 有一个主要的meta
5.3.3. 修改dataloader
修改过meta的制作, 需要微调dataloader->data_loader_hujk17.py
做了一个比较简单的
6. Train
6.1 超参数
原因
- speaker dim只能256
- 同事的speaker dim是128, neck和freq均是16, 我的均翻倍, 32和32
- 论文中也是32和32, 并且预训练好的模型也是32和32
6.2. 数据使用
VCTK一共106个人, 分为seen和unseen, seen中分为train和val
7. Inference
7.1 conversion.py
交叉vc版本
7.2. inference.py
仿照同事写一个A+B模式的
7.3. make_spect.py
我的make_spect可以直接用于vc, 从AutoVC_hujk17/full_106_spmel_nosli中:
随便选两行
AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss 论文代码复现相关推荐
- AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss 论文理解
0. Abstract 非并行的多对多语音转换以及零语音转换仍然是未开发的领域.诸如对抗性网络(GAN)和条件变量自动编码器(CVAE)之类的深度样式转换算法已被用作该领域的新解决方案.但是,GAN训 ...
- AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss代码调试过程
论文: AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss 代码实现参考:https://github.com/peis ...
- AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss笔记
文章目录 网络结构 说话人编码器 内容编码器 解码器 声码器 实验 论文: AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder L ...
- 论文阅读 - AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss
文章目录 1 概述 2 模型架构 3 模块解析 3.1 获取梅尔频谱 3.2 speaker encoder 3.3 AutoVC 3.4 Vocoder 4 关键部分 参考资料 1 概述 voice ...
- Musical Composition Style Transfer via Disentangled Timbre Representations论文阅读
前言 这篇文章与<Learning Disentangled Representations for Timber and Pitch in Music Audio>很相似,图和作者顺序都 ...
- 深度学习(三十五)——Style Transfer(2), YOLOv3, Tiny-YOLO, One-stage vs. Two-stage
Style Transfer Texture Networks: Feed-forward Synthesis of Textures and Stylized Images 这篇论文属于fast s ...
- 风格迁移综述Neural Style Transfer: A Review
浙江大学和亚利桑那州立大学的几位研究者在 arXiv 上发布了一篇「神经风格迁移(Neural Style Transfer)」的概述论文,对当前神经网络风格迁移技术的研究.应用和难题进行了全面的总结 ...
- Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset
会议:icassp 2021 作者:Kun Zhou,lihaizhou 文章目录 abstract 1. introduction 2. Analysis of Deep Emotional Fea ...
- 吴恩达老师深度学习视频课笔记:神经风格迁移(neural style transfer)
什么是神经风格迁移(neural style transfer):如下图,Content为原始拍摄的图像,Style为一种风格图像.如果用Style来重新创造Content照片,神经风 ...
最新文章
- oracle数据库查看等待,Oracle常见等待事件说明(三)-enqueue/free buffer waits
- layui列表显示缩略图_layUI实现列表查询功能
- 光有面罩不是能防护的
- POJ 1269 Intersecting Lines(求直线交点)
- Docker初识与安装
- AHB-lite时序详细解读
- Android 中文API (68) —— BluetoothClass.Service
- (转)计算机视觉CV 圈子
- windows黑客编程系列(一):运行单一实例
- 基于RAM的雷达线性调频信号产生
- 一分钟教会你固态硬盘数据恢复方法
- react框架设计原理及生命周期
- HTML期末大学生网页设计作业-我的家乡
- 查看git暂存区有哪些文件
- 大一python基础编程试卷_python--大一期末测试题(含答案)
- Java实验三 Java继承、抽象类与接口(13题)
- android 如何让应用程序在全部应用程序列表里显示跟隐藏!
- 萧伯纳学演讲不怕出丑
- MySQL中建表时 pk、nn、qu、b、un、zf、ai、g代表的意思
- Linux内核分析(七)系统调用execve处理过程