NLP - sentencepiece

文章目录

一、关于 sentencepiece
二、安装
- 1、Python 模块
- 2、从 C++ 源构建和安装 SentencePiece 命令行工具
- 3、使用 vcpkg 构建和安装
- 4、从签名发布的 wheels 下载和安装 SentencePiece
三、命令行使用
- 1、训练模型
- 2、将原始文本编码为 sentence pieces/ids
- 3、编码 sentence pieces/ids 到原始文本
- 4、端到端示例 End-to-End Example
- 5、导出词汇表 Export vocabulary list
- 6、重新定义特殊元token
- 7、词表限制 Vocabulary restriction
四、Python 调用

一、关于 sentencepiece

github : https://github.com/google/sentencepiece
论文《SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing》：https://aclanthology.org/D18-2012.pdf

Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training.

SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) [Sennrich et al.]) and unigram language model [Kudo.]) with the extension of direct training from raw sentences.

SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.

重复出现次数多的词组，就认为是一个词。
粒度比分词大。
模型在训练中主要使用统计指标，比如出现的频率，左右连接度等，还有困惑度来训练最终的结果。

二、安装

SentencePiece分为两部分：训练模型和使用模型。
训练模型部分是用C语言实现的，可编成二进程程序执行，训练结果是生成一个model和一个词典文件。
模型使用部分同时支持二进制程序和Python调用两种方式，训练完生成的词典数据是明文，可编辑，因此也可以用任何语言读取和使用。

1、Python 模块

SentencePiece 提供了 Python 封装支持训练和 segmentation。
你可以通过以下命令安装 Python 二进制包：

% pip install sentencepiece

For more detail, see Python module

2、从 C++ 源构建和安装 SentencePiece 命令行工具

需要安装一下工具和依赖库：

cmake
C++11 compiler
gperftools library (optional, 10-40% performance improvement can be obtained.)

在 Ubuntu 上，可以使用 apt-get 安装编译工具：

sudo apt-get install cmake build-essential pkg-config libgoogle-perftools-dev

然后，你可以以如下方式，构建和安装命令行工具：

git clone https://github.com/google/sentencepiece.git
cd sentencepiece
mkdir build
cd build
cmake ..
make -j $(nproc)
sudo make install
sudo ldconfig -v

在 macOS 上，最后一行用 sudo update_dyld_shared_cache 命令替代

3、使用 vcpkg 构建和安装

vcpkg : https://github.com/Microsoft/vcpkg

你可以使用 vcpkg 下载和安装 sentencepiece
You can download and install sentencepiece using the vcpkg dependency manager:

git clone https://github.com/Microsoft/vcpkg.git
cd vcpkg
./bootstrap-vcpkg.sh
./vcpkg integrate install
./vcpkg install sentencepiece

vcpkg 中的 sentencepiece 被微软团队和社区贡献者保持更新；
如果版本过时了，请联系vcpkg 仓库这里创建 issue https://github.com/Microsoft/vcpkg

4、从签名发布的 wheels 下载和安装 SentencePiece

你可以从 GitHub releases page 下载 wheel：https://github.com/google/sentencepiece/releases/latest

在发布过程中，我们使用 OpenSSF 生成了 SLSA3 签名，
OpenSSF’s : slsa-framework/slsa-github-generator
https://github.com/slsa-framework/slsa-github-generator

去验证一个发布的二进制包：
To verify a release binary:
1、安装验证工具：https://github.com/slsa-framework/slsa-verifier#installation
2、从 https://github.com/google/sentencepiece/releases/latest 下载 attestation.intoto.jsonl 源文件；
3、运行验证器：

slsa-verifier -artifact-path <the-wheel> -provenance attestation.intoto.jsonl -source github.com/google/sentencepiece -tag <the-tag>pip install wheel_file.whl

三、命令行使用

1、训练模型

训练模型语法：

spm_train --input= --model_prefix=<model_name> --vocab_size=8000 --character_coverage=1.0 --model_type=

例如：

spm_train --input='../corpus.txt' --model_prefix='../mypiece' --vocab_size=8000 --character_coverage=1 --model_type='bpe'

参数说明

--input 指定需要训练的文本文件；不需要分词、标准化或其他预处理；
SentencePiece 默认采用 Unicode NFKC 进行标准化；
如果有多个文件，可以使用逗号分隔；
--model_prefix 指定训练好的模型名前缀。
将会生成两个文件： <model_name>.model 和 <model_name>.vocab （词典信息）。
--vocab_size 训练后词表的大小，比如 8000, 16000, 或 32000。
数量越大训练越慢，太小(<4000)可能训练不了。
--character_coverage 模型中覆盖的字符数。中文、日语等字符多的语料可以设置为 0.9995；其他字符少的语料可设置为 1。
--model_type，训练时模型。可选择的类别有：unigram (默认), bpe, char, 或 word。

max_sentence_length 最大句子长度，默认是4192，长度貌似按字节来算，意味一个中文字代表长度为2
max_sentencepiece_length 最大的句子块长度，默认是16
seed_sentencepiece_size 控制句子数量，默认是100w
num_threads 线程数，默认是开16个
use_all_vocab 使用所有的tokens作为词库，不过只对word/char 模型管用
input_sentence_size 训练器最大加载数量，默认为0

2、将原始文本编码为 sentence pieces/ids

Encode raw text into sentence pieces/ids

spm_encode --model=<model_file> --output_format=piece < input > output
spm_encode --model=<model_file> --output_format=id < input > output

使用 --extra_options 标识来插入 BOS/EOS 标记，或反转输入顺序。
Use --extra_options flag to insert the BOS/EOS markers or reverse the input sequence.

spm_encode --extra_options=eos (add </s> only)
spm_encode --extra_options=bos:eos (add <s> and </s>)
spm_encode --extra_options=reverse:bos:eos (reverse input and add <s> and </s>)

SentencePiece 支持 nbest segmentation 和使用 --output_format=(nbest|sample)_(piece|id) 标识进行 segmentation 抽样。

spm_encode --model=<model_file> --output_format=sample_piece --nbest_size=-1 --alpha=0.5 < input > output
spm_encode --model=<model_file> --output_format=nbest_id --nbest_size=10 < input > output

3、编码 sentence pieces/ids 到原始文本

spm_decode --model=<model_file> --input_format=piece < input > output
spm_decode --model=<model_file> --input_format=id < input > output

使用 --extra_options 选项来解码倒序的文本。

spm_decode --extra_options=reverse < input > output

4、端到端示例 End-to-End Example

% spm_train --input=data/botchan.txt --model_prefix=m --vocab_size=1000
unigram_model_trainer.cc(494) LOG(INFO) Starts training with :
input: "../data/botchan.txt"
... <snip>
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091
trainer_interface.cc(272) LOG(INFO) Saving model: m.model
trainer_interface.cc(281) LOG(INFO) Saving vocabs: m.vocab% echo "I saw a girl with a telescope." | spm_encode --model=m.model
▁I ▁saw ▁a ▁girl ▁with ▁a ▁ te le s c o pe .% echo "I saw a girl with a telescope." | spm_encode --model=m.model --output_format=id
9 459 11 939 44 11 4 142 82 8 28 21 132 6% echo "9 459 11 939 44 11 4 142 82 8 28 21 132 6" | spm_decode --model=m.model --input_format=id
I saw a girl with a telescope.

You can find that the original input sentence is restored from the vocabulary id sequence.

5、导出词汇表 Export vocabulary list

spm_export_vocab --model=<model_file> --output=<output file>

<output file> 存储词汇表和排放日志概率的列表。词汇表id对应于此文件中的行号。
stores a list of vocabulary and emission log probabilities. The vocabulary id corresponds to the line number in this file.

6、重新定义特殊元token

一般情况下，SentencePiece 使用 Unknown ( <unk>), BOS ( <s>) and EOS (</s>) 对应的 id 为 0, 1 和 2。
我们也可以重新定义训练中对应的id：

spm_train --bos_id=0 --eos_id=1 --unk_id=5 --input=... --model_prefix=... --character_coverage=...

When setting -1 id e.g., bos_id=-1, this special token is disabled. Note that the unknown id cannot be disabled. We can define an id for padding () as --pad_id=3.

当设置id 为-1，比如 bos_id=-1, 代表这个 token 无效；unknown id 无法取消。

我们可以定义为 padding () 定义id：--pad_id=3。

如果你想为其他特殊token定义id，可以参考：Use custom symbols

https://github.com/google/sentencepiece/blob/master/doc/special_symbols.md

7、词表限制 Vocabulary restriction

spm_encode接收 --vocabularyand a --vocabulary_threshold选项，这样 spm_encode 只会产生同样出现在词汇表中的符号（至少有一定频率）。这个技术在 subword-nmt page 中有描述，用法与 subword-nmt 基本相同。

假设 L1和 L2是两种语言（源语言/目标语言），训练共享的spm模型，并获得每种语言的最终词汇表：

% cat {train_file}.L1 {train_file}.L2 | shuffle > train
% spm_train --input=train --model_prefix=spm --vocab_size=8000 --character_coverage=0.9995
% spm_encode --model=spm.model --generate_vocabulary < {train_file}.L1 > {vocab_file}.L1
% spm_encode --model=spm.model --generate_vocabulary < {train_file}.L2 > {vocab_file}.L2

shufflecommand is used just in case because spm_trainloads the first 10M lines of corpus by default.

segment train/test 语料使用 --vocabulary 选项：

% spm_encode --model=spm.model --vocabulary={vocab_file}.L1 --vocabulary_threshold=50 < {test_file}.L1 > {test_file}.seg.L1
% spm_encode --model=spm.model --vocabulary={vocab_file}.L2 --vocabulary_threshold=50 < {test_file}.L2 > {test_file}.seg.L2

四、Python 调用

import sentencepiece as spmsp = spm.SentencePieceProcessor()
text = "食材上不会有这样的纠结" sp.Load("/tmp/test.model")
print(sp.EncodeAsPieces(text))

伊织 2022-11-01（天气第二次变冷）