0. 说明

  1. Google-ZYX指有VAE
  2. Phoneme-HCSI指中文使用实验室G2P和韵律, 英文用KbGit和替换空格
  3. DBMIX指双语/混语语料为标贝买的
  4. 为了和'春春'语料统一, 目前只使用中文10000和英文2000句

1. 代码调整

1.1. 拷贝之前的项目

使用Git: https://github.com/ruclion/Fantasy_Mix-Lingual_Tacotron_Version_2_Google-ZYX

项目为: /ceph/home/hujk17/Fantasy_Mix-Lingual_Tacotron_Version_4_Google-ZYX-Phoneme-HCSI-DBMIX

这个项目的Git: https://github.com/ruclion/Fantasy_Mix-Lingual_Tacotron_Version_4_Google-ZYX-Phoneme-HCSI-DBMIX

1.2. G2P_CN_HCSI得到中文Phoneme文本

1.2.1. 标贝中文的处理

主要是databaker_G2P.py, 在本项目下有一份: /ceph/home/hujk17/Fantasy_Mix-Lingual_Tacotron_Version_4_Google-ZYX-Phoneme-HCSI-DBMIX/G2P_CN_HCSI


还有一份在: /ceph/home/hujk17/G2P_CN_HCSI




1.2.2. 中文的pinyin和symbol和韵律

_来分割symbol, pinyin_G2P_2.py可以分解pinyin到symbols(声韵母)


1.2.3.  声调embedding到symbol上

参看文献一, 2020阿里:

  • Instead of using a unified phone set across languages, we combine English and Mandarin phone sets together as a whole. For English utterances, we use 44 British English phoneme symbols plus 3 possible stress symbols. For Mandarin utterances, we use 62 Pinyin initials and finals plus 5 possible tones. The tone or stress symbols are attached to the corresponding phoneme symbols. We also use symbols to indicate in-utterance pauses and utterance ends.
  • [1] 理解: 中文音调1, 2, 3, 4, 5, 6. 英文重读7, 8, 9(0, 1, 2 + 7); 声母和辅音均使用韵母和元音的标记
  • [2] 理解: 中文音调1, 2, 3, 4, 5, 6. 英文重读7, 8, 9(0, 1, 2 + 7); 而没有音调的声母用10占位, 没有重音的辅音用11占位
  • [3] 还有一种理解: 0代表没有, 比如声母无音调, 英文无音调, 0代表没有, 辅音无重读, 中文无重读
  • 但是我感觉[1]简单, 并且分开类别, 不同特征类别来描述不同特点的信息, 即使交叉也没问题.  (川哥有一次分享了一篇BC的论文, 补上去TODO...)

参看文献二, 2019谷歌:

  • [未懂TODO...] Characters /Graphemes: Embeddings corresponding to each character or grapheme are the default inputs for end-to-end TTS models [2, 20, 23], requiring the model to implicitly learn how to pronounce input words (i.e. grapheme-to-phoneme conversion [26]) as part of the synthesis task. Extending a grapheme-based input vocabulary to a multilingual setting is straightforward, by simply concatenating grapheme sets in the training corpus for each language. This can grow quickly for languages with large alphabets, e.g. our Mandarin vocabulary contains over 4.5k tokens. We simply concatenate all graphemes appearing in the training corpus, leading to a total of 4,619 tokens. Equivalent graphemes are shared across languages. During inference all previously unseen characters are mapped to a special out-of-vocabulary (OOV) symbol.
  • Phonemes: Using phoneme inputs simplifies the TTS task, as the model no longer needs to learn complicated pronunciation rules for languages such as English. Similar to our grapheme-based model, equivalent phonemes are shared across languages. We concatenate all possible phoneme symbols, for a total of 88 tokens. To support Mandarin, we include tone information by learning phoneme-independent embeddings for each of the 4 possible tones, and broadcast each tone embedding to all phoneme embeddings inside the corresponding syllable. For English and Spanish, tone embeddings are replaced by stress embeddings which include primary and secondary stresses. A special symbol is used when there is no tone or stress.
  • [4] 理解: 中文音调1, 2, 3, 4. 英文重读5, 6(1, 2 + 4), 0代表轻声或者无重读; 音节共用一个标记, 保证音节内的集中统一; 使用IPA, 一共88个字符

参看文献三, 2020港中文:

  • English words and Chinese characters are transcribed as phonemes as input with stress and tonal information respectively.

目前采用理解[2], 原因是简单, 并且会实现, 也基本上有道理, 下面是不会的:

  1. "I have $250 in my pocket.", # number -> spell-out
    ['AY1', ' ', 'HH', 'AE1', 'V', ' ', 'T', 'UW1', ' ', 'HH', 'AH1', 'N', 'D', 'R', 'AH0', 'D', ' ', 'F', 'IH1', 'F', 'T', 'IY0', ' ', 'D', 'AA1', 'L', 'ER0', 'Z', ' ', 'IH0', 'N', ' ', 'M', 'AY1', ' ', 'P', 'AA1', 'K', 'AH0', 'T', ' ', '.']
  2. 上述的辅音不知道如何划分音节, 需要懂一些专家知识

目前采用方案, 音调跟着前面的symbol, 重音跟着前面的symbol, 如果symbols后面没有数字, 则用10或者11

1.3. G2P_EN_Kb得到英文Phoneme文本

自带phoneme, 但不懂HCSI的规则, 所以之中英文句子, 接上G2P_EN_Kb



1.4. preprocess.py

1.4.1. 不放心的

#M-AILABS (and other datasets) trim params (there parameters are usually correct for any data, but definitely must be tuned for specific speakers)

trim_fft_size = 512,

trim_hop_size = 128,

trim_top_db = 63,

这个用的春春的, 并不是DB的, 可能会有些问题, 但是目前没有管

1.4.2. 代码路径微调

略, 重新写好就行

1.4.3. 异常

有些wav有异常, 导致preprocess时候会莫名其妙结束, 改为全部的异常catch即可

1.4.4. 结果

1.4. preprocess.py

1.4.1. EN的phoneme文件形成symbols

路径: /ceph/home/hujk17/Fantasy_Mix-Lingual_Tacotron_Version_4_Google-ZYX-Phoneme-HCSI-DBMIX/G2P_EN_Kb/databaker_MIX_Phoneme/DBMIX_EN_symbolsList_symbol_split.csv.txt

1.4.2. CN的phoneme文件形成symbols

路径: /ceph/home/hujk17/Fantasy_Mix-Lingual_Tacotron_Version_4_Google-ZYX-Phoneme-HCSI-DBMIX/G2P_CN_HCSI/databaker_MIX_Phoneme/DBMIX_CN_symbolsList_symbol_split.csv.txt

1.4.3. 合并


  • 英文的0, 1, 2忘了变成789, 后面没有重音的(辅音和韵律符号), 补上11
  • 中文的后面没有音调的(声母和韵律符号), 补上10
  • 这样一来在tacotron的text_to_sequence_MIX_Phoneme_Version就不需要区分中文和英文符号了
  • 中文中有一些英文, 先不管. TODO...

TODO, 需要同步到Git上面





1.5. Feeder和Train和Tacotron改动


挺麻烦的, 不过常规. 过程略

2. 训练


目前训练的还不错, 先不统计哪一步alignment就收敛了, 先往下走

3. 合成

  1. When I found out about her death I was shocked, but not surprised, she said.
  2. The latter serve as a worm aphrodisiac, getting the hermaphroditic worms to breed more often.
  3. Artistic gymnastics, rhythmic gymnastics, trampoline, weightlifting, handball.
  4. 那些庄稼田园在果果眼里感觉太亲切了
  5. 她把鞋子拎在手上光着脚丫故意踩在水洼里
  6. 我为男主角感到有点遗憾
  7. When I found out about her death 她把鞋子拎在手上 I was shocked, but not surprised, she said.
  8. 她把鞋子拎在手上 When I found out about her death 光着脚丫故意踩在水洼里

