爱丁堡大学神经机器翻译系统 nematus 使用笔记

*博客地址：http://blog.csdn.net/wangxinginnlp/article/details/64921476

*由于没有step by step instruction，被代码搞晕了些，写个说明文档以备后用。[要注意，红色字都是坑]

代码准备：

nematus https://github.com/rsennrich/nematus

subword-nmt https://github.com/rsennrich/subword-nmt

数据准备：

nematus中自带了1000句对英德双语语料（En-De）

实验环境：

nematus里面写了需要配置：

Nematus requires the following packages:

Python >= 2.7
numpy
Theano >= 0.7 (and its dependencies).

we recommend executing the following command in a Python virtual environment: pip install numpy numexpr cython tables theano

the following packages are optional, but highly recommended

CUDA >= 7 (only GPU training is sufficiently fast)
cuDNN >= 4 (speeds up training substantially)

you can run Nematus locally. To install it, execute python setup.py install

实验疑点：

1. 为什么有多个source dictionaries？支持Linguistic Input Features，每个Feauture一个dictionary？

nematus支持传入多个source dictionaries

nmt.py中接收source dictionary代码：

train = TextIterator(datasets[0], datasets[1],
dictionaries[:-1], dictionaries[-1],

....)

data_iterator.py中TextIterator有

self.source_dicts = []

for source_dict in source_dicts:
self.source_dicts.append(load_dict(source_dict))

的确是的，没有仔细读README。

--dictionaries PATH [PATH ...]

network vocabularies (one per source factor, plus target vocabulary)

*中英训练语料中中文部分有“|”符号，直接挂掉了。

w = [self.source_dicts[i][f] if f in self.source_dicts[i] else 1 for (i,f) in enumerate(w.split('|'))]

其中w是input unit, 默认是word，Linguistic Input Features之间是以“|”切分的。

第一步：解压文件

**文件夹nematus-master和subword-nmt-master在同一个目录下。

第二步：数据处理，生成词汇表vocabulary。

**利用他自带的preprocess.sh脚本[似乎脚本有问题]对双语语料进行处理。

自带的preprocess.sh为：

#!/bin/bashP=$1# source language (example: fr)
S=$2
# target language (example: en)
T=$3# path to nematus/data
P1=$4# path to subword NMT scripts (can be downloaded from https://github.com/rsennrich/subword-nmt)
P2=$5# tokenize
perl $P1/tokenizer.perl -threads 5 -l $S < {P}.${S} > {P}.${S}.tok
perl $P1/tokenizer.perl -threads 5 -l $T < {P}.${T} > {P}.${T}.tok# learn BPE on joint vocabulary:
cat {P}.${S}.tok {P}.${T}.tok | python $P2/learn_bpe.py -s 20000 > ${S}${T}.bpepython $P2/apply_bpe.py -c ${S}${T}.bpe < {P}.${S}.tok > {P}.${S}.tok.bpe
python $P2/apply_bpe.py -c ${S}${T}.bpe < {P}.${T}.tok > {P}.${T}.tok.bpe# build dictionary
python $P1/build_dictionary.py {P}.${S}.tok.bpe
python $P1/build_dictionary.py {P}.${T}.tok.bpe

但似乎上面P和P1两个在处理的时候弄反了。

可以手动改写下：

#!/bin/bash# 放数据处理脚本的目录（当前目录）
P=$1# 源语言标识符号
# source language (example: fr)
S=$2# 目标语言标识符号
# target language (example: en)
T=$3# 数据存放目录
# path to nematus/data
P1=$4# subword处理脚本目录
# path to subword NMT scripts (can be downloaded from https://github.com/rsennrich/subword-nmt)
P2=$5# 对双语语料进行tokenization
# tokenize
perl $P/tokenizer.perl -threads 5 -l $S < ${P1}.${S} > ${P1}.${S}.tok
perl $P/tokenizer.perl -threads 5 -l $T < ${P1}.${T} > ${P1}.${T}.tok# 用BPE算法训练联合词汇表
# learn BPE on joint vocabulary:
cat ${P1}.${S}.tok ${P1}.${T}.tok | python $P2/learn_bpe.py -s 20000 > ${S}${T}.bpe# 用所训练的联合词汇表对双语语料进行subword处理
python $P2/apply_bpe.py -c ${S}${T}.bpe < ${P1}.${S}.tok > ${P1}.${S}.tok.bpe
python $P2/apply_bpe.py -c ${S}${T}.bpe < ${P1}.${T}.tok > ${P1}.${T}.tok.bpe# 对双语subword语料提取各自的词汇表
# build dictionary
python $P/build_dictionary.py ${P1}.${S}.tok.bpe
python $P/build_dictionary.py ${P1}.${T}.tok.bpe

输入命令：./preprocess.sh ./ en de ../test/data/corpus ../../subword-nmt-master/

输入命令前数据目录.nematus-master/test/data下（系统自带）文件：

输入命令处理中：

输入命令处理后数据目录.nematus-master/test/data下文件：

其中corpus.en是系统自带的语料

corpus.en.tok是经过tokenization后（perl $P/tokenizer.perl -threads 5 -l $S < ${P1}.${S} > ${P1}.${S}.tok）的语料[自带的语料其实已经token过]

corpus.en.tok.bpe是subword处理后的语料

corpus.en.tok.bpe.json是json格式的英语（en）vocabulary

nematus 支持两种存储格式的vocabulary，nematus/util.py中读取字典的函数为：

def load_dict(filename):try:with open(filename, 'rb') as f:return unicode_to_utf8(json.load(f))except:with open(filename, 'rb') as f:return pkl.load(f)

字典格式：1. jason格式[nematus自带脚本默认生成]

2. pkl格式

第三步：模型训练

nematus-master/test中自带的test_train.sh

#!/bin/bash# warning: this test is useful to check if training fails, and what speed you can achieve
# the toy datasets are too small to obtain useful translation results,
# and hyperparameters are chosen for speed, not for quality.
# For a setup that preprocesses and trains a larger data set,
# check https://github.com/rsennrich/wmt16-scripts/tree/master/samplemkdir -p models../nematus/nmt.py \--model models/model.npz \--datasets data/corpus.en data/corpus.de \--dictionaries data/vocab.en.json data/vocab.de.json \--dim_word 256 \--dim 512 \--n_words_src 30000 \--n_words 30000 \--maxlen 50 \--optimizer adam \--lrate 0.0001 \--batch_size 40 \--no_shuffle \--dispFreq 500 \--finish_after 500

参数在nematus-master的README.md中有说明。最重要的是--datasets 和 --dictionaries参数，分别对应训练的双语语料和双语词汇表。

由于自带的语料只有1000句，训练迭代最多500次（--finish_after参数），训练很快结束。

训练结束后，模型保存在参数（--model）路径下，观察下该路径下所生成的文件：

第四步：模型测试

test_translate.py中需要从网上下载sennrich他们训练好的WMT16 En->De En->Ro模型，不想下载该模型就要手动改下代码

1. 直接运行下载模型

THEANO_FLAGS=mode=FAST_RUN,floatX=float32,device=gpu python test_translate.py

**En->De模型有615MB，需要点时间下载。

En->De模型文件见：http://data.statmt.org/rsennrich/wmt16_systems/de-en/

2. 改动配置，运行自己刚才跑的模型。注释掉模型下载部分代码：

#!/usr/bin/env python
# -*- coding: utf-8 -*-import sys
import os
import unittest
import requestssys.path.append(os.path.abspath('../nematus'))
from translate import main as translatedef load_wmt16_model(src, target):path = os.path.join('models', '{0}-{1}'.format(src,target))try:os.makedirs(path)except OSError:passfor filename in ['model.npz', 'model.npz.json', 'vocab.{0}.json'.format(src), 'vocab.{0}.json'.format(target)]:if not os.path.exists(os.path.join(path, filename)):r = requests.get('http://data.statmt.org/rsennrich/wmt16_systems/{0}-{1}/'.format(src,target) + filename, stream=True)with open(os.path.join(path, filename), 'wb') as f:for chunk in r.iter_content(1024**2):f.write(chunk)class TestTranslate(unittest.TestCase):"""Regression tests for translation with WMT16 models"""'''def setUp(self):"""Download pre-trained models"""#print '-------------'load_wmt16_model('en','de')load_wmt16_model('en','ro')'''def outputEqual(self, output1, output2):"""given two translation outputs, check that output string is identical,and probabilities are equal within rounding error."""for i, (line, line2) in enumerate(zip(open(output1).readlines(), open(output2).readlines())):if not i % 2:self.assertEqual(line, line2)else:probs = map(float, line.split())probs2 = map(float, line.split())for p, p2 in zip(probs, probs2):self.assertAlmostEqual(p, p2, 5)# English-German WMT16 system, no dropoutdef test_ende(self):os.chdir('models/en-de/')translate(['model.npz'], open('../../en-de/in'), open('../../en-de/out','w'), k=12, normalize=True, n_process=1, suppress_unk=True, print_word_probabilities=True)os.chdir('../..')self.outputEqual('en-de/ref','en-de/out')'''# English-Romanian WMT16 system, dropoutdef test_enro(self):os.chdir('models/en-ro/')translate(['model.npz'], open('../../en-ro/in'), open('../../en-ro/out','w'), k=12, normalize=True, n_process=1, suppress_unk=True, print_word_probabilities=True)os.chdir('../..')self.outputEqual('en-ro/ref','en-ro/out')'''if __name__ == '__main__':unittest.main()

发现代码从models/en-de/目录读取model.npz文件，于是建立models/en-de/目录，把刚才训练好的model.npz拷贝进入。

此时报错，缺少model.npz.jason文件，model.npz.jason是模型的配置文件。model.npz是模型的参数。

继续报错，model.npz.jason中双语词典路径不对，上一步中保存的model.npz.jason记录的是相对路径的词汇表文件。现在切换了目录，读取不到该文件，需要改动 "dictionaries":属性。

但是由于刚才训练语料较少，模型训练的不好，解码出来句子全部由“,”组成。

第五步：subword恢复

TBD

第六步：翻译性能评测

TBD

爱丁堡大学神经机器翻译系统 nematus 使用笔记相关推荐

神经机器翻译系统资料
作者:zhbzz2007 出处:http://www.cnblogs.com/zhbzz2007 欢迎转载,也请保留这段声明.谢谢! 1 简介自2013年提出了神经机器翻译系统之后,神经机器翻译系统 ...
哈佛大学 NLP 组开源神经机器翻译系统 OpenNMT
今天,Harvard NLP (哈佛大学自然语言处理研究组) 宣布开源其研发的神经机器翻译系统 OpenNMT,该系统使用了 Torch 数学工具包,已达 industrial-strength 可生 ...
【深度学习机器翻译】GNMT：Google 的的神经机器翻译系统
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation 1 ...
神经机器翻译（Neural machine translation, NMT）学习笔记
神经机器翻译(Neural machine translation, NMT)是最近提出的机器翻译方法.与传统的统计机器翻译不同,NMT的目标是建立一个单一的神经网络,可以共同调整以最大化翻译性能.最 ...
统计机器翻译与神经机器翻译区别_如果每个人都献出一点爱，就会拥有一套超级牛的机器翻译系统...
所谓机器翻译技术就是利用计算机软件技术实现不同语言之间的自动翻译,目的为了帮助解决或缓解人工翻译代价过高和效率过低的问题.特别是针对大规模数据的实时和低成本翻译的应用场景,非人工翻译所为,有效利用机器 ...
[转]神经机器翻译（NMT）相关资料整理
1 简介自2013年提出了神经机器翻译系统之后,神经机器翻译系统取得了很大的进展.最近几年相关的论文,开源系统也是层出不穷.本文主要梳理了神经机器翻译入门.进阶所需要阅读的资料和论文,并提供了相关链 ...
神经机器翻译的前世今生--转自散文网
本文转自散文网,原文链接如下:http://sanwen.net/a/mjyslpo.html 神经机器翻译 2016-11-13 03:17雅译公司推荐100次 1. 引言神经机器翻译( ...
深度学习与自然语言处理教程(6) - 神经机器翻译、seq2seq与注意力机制（NLP通关指南·完结）
作者:韩信子@ShowMeAI 教程地址:https://www.showmeai.tech/tutorials/36 本文地址:https://www.showmeai.tech/article-d ...
神经机器翻译（NMT）详细资料整理
1 简介自2013年提出了神经机器翻译系统之后,神经机器翻译系统取得了很大的进展.最近几年相关的论文,开源系统也是层出不穷.本文主要梳理了神经机器翻译入门.进阶所需要阅读的资料和论文,并提供了相关链 ...

爱丁堡大学神经机器翻译系统 nematus 使用笔记

爱丁堡大学神经机器翻译系统 nematus 使用笔记相关推荐

最新文章

热门文章