bert分词工具-使用Bert自带的WordPiece分词工具将文本分割成单字

笔者不久前发布过一个中文分字工具，（本文称之为version1.0）该工具是将所有的字符单独分离出来，并以空格隔开。笔者使用该工具分字之后在实体分类任务上的效果很差。原因可能有下.
时间数据经version1.0处理之后如下：
原数据：2020年4月2日
version1.0处理之后：2 0 2 0 年 4 月 2 日
这样处理可能让模型很难理解这是一个日期数据。

今天笔者将version1.0改进到version2.0，version2.0是使用Bert自带的WordPiece分词工具将文本分割成单字（英文是分词）。经version2.0处理之后日期数据会变成下面这样：
2020 年 4 月 2 日
这样模型的效果会比在version1.0分字的效果上面好很多。

以下是源码。
1.bert_token.py

import os
import syssys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "D:\work\Entity-Relation-Extraction-my\pretrained_model\chinese_L-12_H-768_A-12")))
import tokenization'''
使用bert自带WordPiece分词工具将文本分割成单字
@Author：西兰
@Date：2020-4-1
'''class Model_data_preparation(object):def __init__(self, RAW_DATA_INPUT_DIR="data", DATA_OUTPUT_DIR="my_data",vocab_file_path="vocab.txt", do_lower_case=True):''':param RAW_DATA_INPUT_DIR: 输入文件目录，一般是原始数据目录:param DATA_OUTPUT_DIR: 输出文件目录，一般是分类任务数据文件夹:param vocab_file_path: 词表路径，一般是预先训练的模型的词表路径:param do_lower_case: 默认TRUE'''# BERT 自带WordPiece分词工具，对于中文都是分成单字self.bert_tokenizer = tokenization.FullTokenizer(vocab_file=self.get_vocab_file_path(vocab_file_path),do_lower_case=do_lower_case)  # 初始化 bert_token 工具self.DATA_INPUT_DIR = self.get_data_input_dir(RAW_DATA_INPUT_DIR)self.DATA_OUTPUT_DIR = os.path.join(os.path.dirname(__file__), DATA_OUTPUT_DIR)print("数据输入路径：", self.DATA_INPUT_DIR)print("数据输出路径：", self.DATA_OUTPUT_DIR)# 获取输入文件路径def get_data_input_dir(self, DATA_INPUT_DIR):DATA_INPUT_DIR = os.path.join(os.path.abspath(os.path.join(os.path.dirname(__file__), "../../")), DATA_INPUT_DIR)return DATA_INPUT_DIR# 获取词汇表路径def get_vocab_file_path(self, vocab_file_path):vocab_file_path = os.path.join(os.path.abspath(os.path.join(os.path.dirname(__file__), "D:\work\Entity-Relation-Extraction-my\pretrained_model\chinese_L-12_H-768_A-12")),vocab_file_path)return vocab_file_path# 处理原始数据def separate_raw_data(self):# if not os.path.exists(self.DATA_OUTPUT_DIR):#     os.makedirs(os.path.join(self.DATA_OUTPUT_DIR, "train"))#     os.makedirs(os.path.join(self.DATA_OUTPUT_DIR, "valid"))#     os.makedirs(os.path.join(self.DATA_OUTPUT_DIR, "test"))token_in_f = open('./bert_raw_data_token_in.txt', "w", encoding='utf-8')token_in_not_UNK_f = open('./bert_raw_data_token_in_not_UNK.txt', "w", encoding='utf-8')with open('./raw_data.txt', 'r', encoding='utf-8') as f:count_numbers = 0while True:line = f.readline()if line:count_numbers += 1text_tokened = self.bert_tokenizer.tokenize(line)text_tokened_not_UNK = self.bert_tokenizer.tokenize_not_UNK(line)# print(text_tokened)token_in_f.write(" ".join(text_tokened) + "\n")token_in_not_UNK_f.write(" ".join(text_tokened_not_UNK) + "\n")else:breakprint("all numbers", count_numbers)print("\n")token_in_f.close()token_in_not_UNK_f.close()if __name__ == "__main__":RAW_DATA_DIR = "raw_data/company/all"DATA_OUTPUT_DIR = "bert_raw_data"bert_dir = ''model_data = Model_data_preparation(RAW_DATA_INPUT_DIR=RAW_DATA_DIR, DATA_OUTPUT_DIR=DATA_OUTPUT_DIR)model_data.separate_raw_data()

2.BERT官方的tokenization.py

# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Tokenization classes."""from __future__ import absolute_import
from __future__ import division
from __future__ import print_functionimport collections
import re
import unicodedata
import six
import tensorflow as tfdef validate_case_matches_checkpoint(do_lower_case, init_checkpoint):"""Checks whether the casing config is consistent with the checkpoint name."""# The casing has to be passed in by the user and there is no explicit check# as to whether it matches the checkpoint. The casing information probably# should have been stored in the bert_config.json file, but it's not, so# we have to heuristically detect it to validate.if not init_checkpoint:returnm = re.match("^.*?([A-Za-z0-9_-]+)/bert_model.ckpt", init_checkpoint)if m is None:returnmodel_name = m.group(1)lower_models = ["uncased_L-24_H-1024_A-16", "uncased_L-12_H-768_A-12","multilingual_L-12_H-768_A-12", "chinese_L-12_H-768_A-12"]cased_models = ["cased_L-12_H-768_A-12", "cased_L-24_H-1024_A-16","multi_cased_L-12_H-768_A-12"]is_bad_config = Falseif model_name in lower_models and not do_lower_case:is_bad_config = Trueactual_flag = "False"case_name = "lowercased"opposite_flag = "True"if model_name in cased_models and do_lower_case:is_bad_config = Trueactual_flag = "True"case_name = "cased"opposite_flag = "False"if is_bad_config:raise ValueError("You passed in `--do_lower_case=%s` with `--init_checkpoint=%s`. ""However, `%s` seems to be a %s model, so you ""should pass in `--do_lower_case=%s` so that the fine-tuning matches ""how the model was pre-training. If this error is wrong, please ""just comment out this check." % (actual_flag, init_checkpoint,model_name, case_name, opposite_flag))def convert_to_unicode(text):"""Converts `text` to Unicode (if it's not already), assuming utf-8 input."""if six.PY3:if isinstance(text, str):return textelif isinstance(text, bytes):return text.decode("utf-8", "ignore")else:raise ValueError("Unsupported string type: %s" % (type(text)))elif six.PY2:if isinstance(text, str):return text.decode("utf-8", "ignore")elif isinstance(text, unicode):return textelse:raise ValueError("Unsupported string type: %s" % (type(text)))else:raise ValueError("Not running on Python2 or Python 3?")def printable_text(text):"""Returns text encoded in a way suitable for print or `tf.logging`."""# These functions want `str` for both Python2 and Python3, but in one case# it's a Unicode string and in the other it's a byte string.if six.PY3:if isinstance(text, str):return textelif isinstance(text, bytes):return text.decode("utf-8", "ignore")else:raise ValueError("Unsupported string type: %s" % (type(text)))elif six.PY2:if isinstance(text, str):return textelif isinstance(text, unicode):return text.encode("utf-8")else:raise ValueError("Unsupported string type: %s" % (type(text)))else:raise ValueError("Not running on Python2 or Python 3?")def load_vocab(vocab_file):"""Loads a vocabulary file into a dictionary."""vocab = collections.OrderedDict()index = 0with tf.gfile.GFile(vocab_file, "r") as reader:while True:token = convert_to_unicode(reader.readline())if not token:breaktoken = token.strip()vocab[token] = indexindex += 1return vocabdef convert_by_vocab(vocab, items):"""Converts a sequence of [tokens|ids] using the vocab."""output = []for item in items:output.append(vocab[item])return outputdef convert_tokens_to_ids(vocab, tokens):return convert_by_vocab(vocab, tokens)def convert_ids_to_tokens(inv_vocab, ids):return convert_by_vocab(inv_vocab, ids)def whitespace_tokenize(text):"""Runs basic whitespace cleaning and splitting on a piece of text."""text = text.strip()if not text:return []tokens = text.split()return tokensclass FullTokenizer(object):"""Runs end-to-end tokenziation."""def __init__(self, vocab_file, do_lower_case=True):self.vocab = load_vocab(vocab_file)self.inv_vocab = {v: k for k, v in self.vocab.items()}self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)self.wordpiece_tokenizer_not_UNK = WordpieceTokenizer_not_UNK(vocab=self.vocab)def tokenize(self, text):split_tokens = []for token in self.basic_tokenizer.tokenize(text):for sub_token in self.wordpiece_tokenizer.tokenize(token):split_tokens.append(sub_token)return split_tokensdef tokenize_not_UNK(self, text):split_tokens = []for token in self.basic_tokenizer.tokenize(text):for sub_token in self.wordpiece_tokenizer_not_UNK.tokenize(token):split_tokens.append(sub_token)return split_tokensdef convert_tokens_to_ids(self, tokens):return convert_by_vocab(self.vocab, tokens)def convert_ids_to_tokens(self, ids):return convert_by_vocab(self.inv_vocab, ids)class BasicTokenizer(object):"""Runs basic tokenization (punctuation splitting, lower casing, etc.)."""def __init__(self, do_lower_case=True):"""Constructs a BasicTokenizer.Args:do_lower_case: Whether to lower case the input."""self.do_lower_case = do_lower_casedef tokenize(self, text):"""Tokenizes a piece of text."""text = convert_to_unicode(text)text = self._clean_text(text)# This was added on November 1st, 2018 for the multilingual and Chinese# models. This is also applied to the English models now, but it doesn't# matter since the English models were not trained on any Chinese data# and generally don't have any Chinese data in them (there are Chinese# characters in the vocabulary because Wikipedia does have some Chinese# words in the English Wikipedia.).text = self._tokenize_chinese_chars(text)orig_tokens = whitespace_tokenize(text)split_tokens = []for token in orig_tokens:if self.do_lower_case:token = token.lower()token = self._run_strip_accents(token)split_tokens.extend(self._run_split_on_punc(token))output_tokens = whitespace_tokenize(" ".join(split_tokens))return output_tokensdef _run_strip_accents(self, text):"""Strips accents from a piece of text."""text = unicodedata.normalize("NFD", text)output = []for char in text:cat = unicodedata.category(char)if cat == "Mn":continueoutput.append(char)return "".join(output)def _run_split_on_punc(self, text):"""Splits punctuation on a piece of text."""chars = list(text)i = 0start_new_word = Trueoutput = []while i < len(chars):char = chars[i]if _is_punctuation(char):output.append([char])start_new_word = Trueelse:if start_new_word:output.append([])start_new_word = Falseoutput[-1].append(char)i += 1return ["".join(x) for x in output]def _tokenize_chinese_chars(self, text):"""Adds whitespace around any CJK character."""output = []for char in text:cp = ord(char)if self._is_chinese_char(cp):output.append(" ")output.append(char)output.append(" ")else:output.append(char)return "".join(output)def _is_chinese_char(self, cp):"""Checks whether CP is the codepoint of a CJK character."""# This defines a "chinese character" as anything in the CJK Unicode block:#   https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)## Note that the CJK Unicode block is NOT all Japanese and Korean characters,# despite its name. The modern Korean Hangul alphabet is a different block,# as is Japanese Hiragana and Katakana. Those alphabets are used to write# space-separated words, so they are not treated specially and handled# like the all of the other languages.if ((cp >= 0x4E00 and cp <= 0x9FFF) or  #(cp >= 0x3400 and cp <= 0x4DBF) or  #(cp >= 0x20000 and cp <= 0x2A6DF) or  #(cp >= 0x2A700 and cp <= 0x2B73F) or  #(cp >= 0x2B740 and cp <= 0x2B81F) or  #(cp >= 0x2B820 and cp <= 0x2CEAF) or(cp >= 0xF900 and cp <= 0xFAFF) or  #(cp >= 0x2F800 and cp <= 0x2FA1F)):  #return Truereturn Falsedef _clean_text(self, text):"""Performs invalid character removal and whitespace cleanup on text."""output = []for char in text:cp = ord(char)if cp == 0 or cp == 0xfffd or _is_control(char):continueif _is_whitespace(char):output.append(" ")else:output.append(char)return "".join(output)class WordpieceTokenizer_not_UNK(object):"""Runs WordPiece tokenziation."""def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200):self.vocab = vocabself.unk_token = unk_tokenself.max_input_chars_per_word = max_input_chars_per_worddef tokenize(self, text):"""Tokenizes a piece of text into its word pieces.This uses a greedy longest-match-first algorithm to perform tokenizationusing the given vocabulary.For example:input = "unaffable"output = ["un", "##aff", "##able"]Args:text: A single token or whitespace separated tokens. This should havealready been passed through `BasicTokenizer.Returns:A list of wordpiece tokens."""text = convert_to_unicode(text)output_tokens = []for token in whitespace_tokenize(text):chars = list(token)if len(chars) > self.max_input_chars_per_word:output_tokens.append(self.unk_token)continueis_bad = Falsestart = 0sub_tokens = []while start < len(chars):end = len(chars)cur_substr = Nonewhile start < end:substr = "".join(chars[start:end])if start > 0:substr = "##" + substrif substr in self.vocab:cur_substr = substrbreakend -= 1if cur_substr is None:is_bad = Truebreaksub_tokens.append(cur_substr)start = endif is_bad:#output_tokens.append(self.unk_token)output_tokens.append(token)else:output_tokens.extend(sub_tokens)return output_tokensclass WordpieceTokenizer(object):"""Runs WordPiece tokenziation."""def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200):self.vocab = vocabself.unk_token = unk_tokenself.max_input_chars_per_word = max_input_chars_per_worddef tokenize(self, text):"""Tokenizes a piece of text into its word pieces.This uses a greedy longest-match-first algorithm to perform tokenizationusing the given vocabulary.For example:input = "unaffable"output = ["un", "##aff", "##able"]Args:text: A single token or whitespace separated tokens. This should havealready been passed through `BasicTokenizer.Returns:A list of wordpiece tokens."""text = convert_to_unicode(text)output_tokens = []for token in whitespace_tokenize(text):chars = list(token)if len(chars) > self.max_input_chars_per_word:output_tokens.append(self.unk_token)continueis_bad = Falsestart = 0sub_tokens = []while start < len(chars):end = len(chars)cur_substr = Nonewhile start < end:substr = "".join(chars[start:end])if start > 0:substr = "##" + substrif substr in self.vocab:cur_substr = substrbreakend -= 1if cur_substr is None:is_bad = Truebreaksub_tokens.append(cur_substr)start = endif is_bad:output_tokens.append(self.unk_token)else:output_tokens.extend(sub_tokens)return output_tokensdef _is_whitespace(char):"""Checks whether `chars` is a whitespace character."""# \t, \n, and \r are technically contorl characters but we treat them# as whitespace since they are generally considered as such.if char == " " or char == "\t" or char == "\n" or char == "\r":return Truecat = unicodedata.category(char)if cat == "Zs":return Truereturn Falsedef _is_control(char):"""Checks whether `chars` is a control character."""# These are technically control characters but we count them as whitespace# characters.if char == "\t" or char == "\n" or char == "\r":return Falsecat = unicodedata.category(char)if cat.startswith("C"):return Truereturn Falsedef _is_punctuation(char):"""Checks whether `chars` is a punctuation character."""cp = ord(char)# We treat all non-letter/number ASCII as punctuation.# Characters such as "^", "$", and "`" are not in the Unicode# Punctuation class but we treat them as punctuation anyways, for# consistency.if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or(cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):return Truecat = unicodedata.category(char)if cat.startswith("P"):return Truereturn False

以下是笔者的公众号，欢迎关注、投稿~

bert分词工具-使用Bert自带的WordPiece分词工具将文本分割成单字相关推荐

java etl工具_一文带你入门ETL工具-datax的简单使用
什么是ETL? ETL负责将分布的.异构数据源中的数据如关系数据.平面数据文件等抽取到临时中间层后进行清洗.转换.集成,最后加载到数据仓库或数据集市中,成为联机分析处理.数据挖掘的基础. ETL是数据 ...
ftp工具绿色版，带你了解ftp工具绿色版是什么
说起好用的ftp下载工具绿色版,就不得不提iis7服务器管理工具了,它不仅仅能批量管理ftp站点,同时还能问完成定时上传和下载,还是一款免费的软件. 除了兼顾ftp下载工具绿色版的客户端,iis7服务 ...
BERT |（3）BERT模型的使用--pytorch的代码解释
参考代码:https://github.com/649453932/Bert-Chinese-Text-Classification-Pytorch 从名字可以看出来这个是做一个中文文本分类的的任务, ...
中文处理工具fastHan 2.0：支持中文分词、词性标注、命名实体识别、依存语法分析、中文AMR的强有力工具
fastHan 简介 fastHan是基于fastNLP与pytorch实现的中文自然语言处理工具,像spacy一样调用方便. 其内核为基于BERT的联合模型,其在15个语料库中进行训练,可处理中文分 ...
BERT |（2）BERT的原理详解
在写这一篇的时候,偶然发现有一篇博客,相比于我之前的一篇写得更详尽,这一篇也参考这篇博客来继续写写自己的笔记总结. 原博客地址:一文读懂BERT(原理篇) 一.什么是Bert? 二,bert的原理从 ...
Windows自带的端口转发工具netsh使用方法_DOS/BAT
Windows自带的端口转发工具netsh使用方法_DOS/BAT 作者:用户来源:互联网时间:2017-02-22 17:24:30 netsh 端口转发摘要: 下面的代码在windows下运 ...
解决Python自带的json序列化工具不能序列化datetime类型数据问题
解决Python自带的json序列化工具不能序列化datetime类型数据问题参考文章: (1)解决Python自带的json序列化工具不能序列化datetime类型数据问题 (2)https:// ...
预训练模型：BERT深度解析《BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding》
目录 1. 背景 2. 什么是 Bert 及原理? 3. 论文内容<BERT: Pre-training of Deep Bidirectional Transformers for Langu ...
虚拟机vmnet8每次都要先禁用再启用_【零成本 amp; 超详细】使用Win10自带的Hyper-V管理工具搭建虚拟机...
这个页面解决什么问题? 这篇文章主要以图片的方式引导大家如何安装虚拟机! 虚拟机,顾名思义就是虚拟出来的电脑,这个虚拟出来的电脑和真实的电脑几乎完全一样,所不同的是他的硬盘是在一个文件中虚拟出来的,所 ...

bert分词工具-使用Bert自带的WordPiece分词工具将文本分割成单字

bert分词工具-使用Bert自带的WordPiece分词工具将文本分割成单字相关推荐

最新文章

热门文章