POS标记——HMM模型

1.数据准备

from utils_pos import get_word_tag, preprocess
import pandas as pd
from collections import defaultdict
import math
import numpy as npwith open("WSJ_02-21.pos", 'r') as f:training_corpus = f.readlines()with open("hmm_vocab.txt", 'r') as f:voc_l = f.read().split('\n')
#生成index字典
vocab = {} # Get the index of the corresponding words.
for i, word in enumerate(sorted(voc_l)): vocab[word] = i       print("Vocabulary dictionary, key is the word, value is a unique integer")
cnt = 0
for k,v in vocab.items():print(f"{k}:{v}")cnt += 1if cnt > 20:breakwith open("WSJ_24.pos", 'r') as f:y = f.readlines()
#测试集内容格式：'economy\tNN\n', "'s\tPOS\n", 'temperature\tNN\n'
#corpus without tags, preprocessed
_, prep = preprocess(vocab, "test.words")

2.HMM模型的训练

计算转移矩阵和观测矩阵

def create_dictionaries(training_corpus, vocab):"""Input: training_corpus: a corpus where each line has a word followed by its tag.vocab: a dictionary where keys are words in vocabulary and value is an indexOutput: emission_counts: a dictionary where the keys are (tag, word) and the values are the countstransition_counts: a dictionary where the keys are (prev_tag, tag) and the values are the countstag_counts: a dictionary where the keys are the tags and the values are the counts"""# initialize the dictionaries using defaultdictemission_counts = defaultdict(int)transition_counts = defaultdict(int)tag_counts = defaultdict(int)# Initialize "prev_tag" (previous tag) with the start state, denoted by '--s--'prev_tag = '--s--' # use 'i' to track the line number in the corpusi = 0 # Each item in the training corpus contains a word and its POS tag# Go through each word and its tag in the training corpusfor word_tag in training_corpus:# Increment the word_tag counti += 1# Every 50,000 words, print the word countif i % 50000 == 0:print(f"word count = {i}")### START CODE HERE (Replace instances of 'None' with your code) #### get the word and tag using the get_word_tag helper function (imported from utils_pos.py)word, tag = get_word_tag(word_tag,vocab) # Increment the transition count for the previous word and tagtransition_counts[(prev_tag, tag)] += 1# Increment the emission count for the tag and wordemission_counts[(tag, word)] += 1# Increment the tag counttag_counts[tag] += 1# Set the previous tag to this tag (for the next iteration of the loop)prev_tag = tag### END CODE HERE ###return emission_counts, transition_counts, tag_countsdef create_transition_matrix(alpha, tag_counts, transition_counts):''' Input: alpha: number used for smoothingtag_counts: a dictionary mapping each tag to its respective counttransition_counts: transition count for the previous word and tagOutput:A: matrix of dimension (num_tags,num_tags)'''# Get a sorted list of unique POS tagsall_tags = sorted(tag_counts.keys())# Count the number of unique POS tagsnum_tags = len(all_tags)# Initialize the transition matrix 'A'A = np.zeros((num_tags,num_tags))# Get the unique transition tuples (previous POS, current POS)trans_keys = set(transition_counts.keys())### START CODE HERE (Replace instances of 'None' with your code) ### # Go through each row of the transition matrix Afor i in range(num_tags):# Go through each column of the transition matrix Afor j in range(num_tags):# Initialize the count of the (prev POS, current POS) to zerocount = 0# Define the tuple (prev POS, current POS)# Get the tag at position i and tag at position j (from the all_tags list)key = (all_tags[i],all_tags[j])# Check if the (prev POS, current POS) tuple # exists in the transition counts dictionaryif key in trans_keys: #complete this line# Get count from the transition_counts dictionary # for the (prev POS, current POS) tuplecount = transition_counts[key]# Get the count of the previous tag (index position i) from tag_countscount_prev_tag = tag_counts[all_tags[i]]# Apply smoothing using count of the tuple, alpha, # count of previous tag, alpha, and total number of tagsA[i,j] = (count+alpha)/(count_prev_tag+alpha*num_tags)### END CODE HERE ###return Adef create_emission_matrix(alpha, tag_counts, emission_counts, vocab):'''Input: alpha: tuning parameter used in smoothing tag_counts: a dictionary mapping each tag to its respective countemission_counts: a dictionary where the keys are (tag, word) and the values are the countsvocab: a dictionary where keys are words in vocabulary and value is an index.within the function it'll be treated as a listOutput:B: a matrix of dimension (num_tags, len(vocab))'''# get the number of POS tagnum_tags = len(tag_counts)# Get a list of all POS tagsall_tags = sorted(tag_counts.keys())# Get the total number of unique words in the vocabularynum_words = len(vocab)# Initialize the emission matrix B with places for# tags in the rows and words in the columnsB = np.zeros((num_tags, num_words))# Get a set of all (POS, word) tuples # from the keys of the emission_counts dictionaryemis_keys = set(list(emission_counts.keys()))### START CODE HERE (Replace instances of 'None' with your code) #### Go through each row (POS tags)for i in range(num_tags): # complete this line# Go through each column (words)for j in range(num_words): # complete this line# Initialize the emission count for the (POS tag, word) to zerocount = 0# Define the (POS tag, word) tuple for this row and columnkey =  (all_tags[i],vocab[j])# check if the (POS tag, word) tuple exists as a key in emission countsif key in emis_keys: # complete this line# Get the count of (POS tag, word) from the emission_counts dcount = emission_counts[key]# Get the count of the POS tagcount_tag = tag_counts[all_tags[i]]# Apply smoothing and store the smoothed value # into the emission matrix B for this row and columnB[i,j] = (count+alpha)/(count_tag+alpha*num_words)### END CODE HERE ###return B

3.维特比算法进行预测

def initialize(states, tag_counts, A, B, corpus, vocab):'''Input: states: a list of all possible parts-of-speechtag_counts: a dictionary mapping each tag to its respective countA: Transition Matrix of dimension (num_tags, num_tags)B: Emission Matrix of dimension (num_tags, len(vocab))corpus: a sequence of words whose POS is to be identified in a list vocab: a dictionary where keys are words in vocabulary and value is an indexOutput:best_probs: matrix of dimension (num_tags, len(corpus)) of floatsbest_paths: matrix of dimension (num_tags, len(corpus)) of integers'''# Get the total number of unique POS tagsnum_tags = len(tag_counts)# Initialize best_probs matrix # POS tags in the rows, number of words in the corpus as the columnsbest_probs = np.zeros((num_tags, len(corpus)))# Initialize best_paths matrix# POS tags in the rows, number of words in the corpus as columnsbest_paths = np.zeros((num_tags, len(corpus)), dtype=int)# Define the start tokens_idx = states.index("--s--")### START CODE HERE (Replace instances of 'None' with your code) #### Go through each of the POS tagsfor i in range(num_tags): # complete this line# Handle the special case when the transition from start token to POS tag i is zeroif A[s_idx,i]==0: # complete this line# Initialize best_probs at POS tag 'i', column 0, to negative infinitybest_probs[i,0] = float('-inf')# For all other cases when transition from start token to POS tag i is non-zero:else:# Initialize best_probs at POS tag 'i', column 0# Check the formula in the instructions abovebest_probs[i,0] = math.log(A[s_idx,i])+math.log(B[i,vocab[corpus[0]]])### END CODE HERE ### return best_probs, best_pathsdef viterbi_forward(A, B, test_corpus, best_probs, best_paths, vocab):'''Input: A, B: The transition and emission matrices respectivelytest_corpus: a list containing a preprocessed corpusbest_probs: an initilized matrix of dimension (num_tags, len(corpus))best_paths: an initilized matrix of dimension (num_tags, len(corpus))vocab: a dictionary where keys are words in vocabulary and value is an index Output: best_probs: a completed matrix of dimension (num_tags, len(corpus))best_paths: a completed matrix of dimension (num_tags, len(corpus))'''# Get the number of unique POS tags (which is the num of rows in best_probs)num_tags = best_probs.shape[0]# Go through every word in the corpus starting from word 1# Recall that word 0 was initialized in `initialize()`for i in range(1, len(test_corpus)): # Print number of words processed, every 5000 wordsif i % 5000 == 0:print("Words processed: {:>8}".format(i))### START CODE HERE (Replace instances of 'None' with your code EXCEPT the first 'best_path_i = None') #### For each unique POS tag that the current word can befor j in range(num_tags): # complete this line# Initialize best_prob for word i to negative infinitybest_prob_i = float('-inf')# Initialize best_path for current word i to Nonebest_path_i = None# For each POS tag that the previous word can be:for k in range(num_tags): # complete this line# Calculate the probability = # best probs of POS tag k, previous word i-1 + # log(prob of transition from POS k to POS j) + # log(prob that emission of POS j is word i)prob = best_probs[k,i-1] + math.log(A[k,j]) + math.log(B[j,vocab[test_corpus[i]]])# check if this path's probability is greater than# the best probability up to and before this pointif prob > best_prob_i: # complete this line# Keep track of the best probabilitybest_prob_i = prob# keep track of the POS tag of the previous word# that is part of the best path.  # Save the index (integer) associated with # that previous word's POS tagbest_path_i = k# Save the best probability for the # given current word's POS tag# and the position of the current word inside the corpusbest_probs[j,i] = best_prob_i# Save the unique integer ID of the previous POS tag# into best_paths matrix, for the POS tag of the current word# and the position of the current word inside the corpus.best_paths[j,i] = best_path_i### END CODE HERE ###return best_probs, best_pathsdef viterbi_backward(best_probs, best_paths, corpus, states):'''This function returns the best path.'''# Get the number of words in the corpus# which is also the number of columns in best_probs, best_pathsm = best_paths.shape[1] # Initialize array z, same length as the corpusz = [None] * m# Get the number of unique POS tagsnum_tags = best_probs.shape[0]# Initialize the best probability for the last wordbest_prob_for_last_word = float('-inf')# Initialize pred array, same length as corpuspred = [None] * m### START CODE HERE (Replace instances of 'None' with your code) ##### Step 1 ### Go through each POS tag for the last word (last column of best_probs)# in order to find the row (POS tag integer ID) # with highest probability for the last wordfor k in range(num_tags): # complete this line# If the probability of POS tag at row k # is better than the previously best probability for the last word:if best_probs[k,-1]>best_prob_for_last_word: # complete this line# Store the new best probability for the lsat wordbest_prob_for_last_word = best_probs[k,-1]# Store the unique integer ID of the POS tag# which is also the row number in best_probsz[m - 1] = k# Convert the last word's predicted POS tag# from its unique integer ID into the string representation# using the 'states' dictionary# store this in the 'pred' array for the last wordpred[m - 1] = states[k]## Step 2 ### Find the best POS tags by walking backward through the best_paths# From the last word in the corpus to the 0th word in the corpusfor i in range(m-1, 0, -1): # complete this line# Retrieve the unique integer ID of# the POS tag for the word at position 'i' in the corpus#pos_tag_for_word_i = np.argmax(best_probs[:,i])pos_tag_for_word_i = z[i]# In best_paths, go to the row representing the POS tag of word i# and the column representing the word's position in the corpus# to retrieve the predicted POS for the word at position i-1 in the corpusz[i - 1] = best_paths[pos_tag_for_word_i,i]# Get the previous word's POS tag in string form# Use the 'states' dictionary, # where the key is the unique integer ID of the POS tag,# and the value is the string representation of that POS tagpred[i - 1] = states[z[i-1]]### END CODE HERE ###return pred

4.模型测试

def compute_accuracy(pred, y):'''Input: pred: a list of the predicted parts-of-speech y: a list of lines where each word is separated by a '\t' (i.e. word \t tag)Output: '''num_correct = 0total = 0# Zip together the prediction and the labelsfor prediction, y in zip(pred, y):### START CODE HERE (Replace instances of 'None' with your code) #### Split the label into the word and the POS tagword_tag_tuple = (y.split())# Check that there is actually a word and a tag# no more and no less than 2 itemsif len(word_tag_tuple) != 2: # complete this linecontinue # store the word and tag separatelyword, tag = word_tag_tuple# Check if the POS tag label matches the predictionif prediction==tag: # complete this line# count the number of times that the prediction# and label matchnum_correct += 1# keep track of the total number of examples (that have valid labels)total += 1### END CODE HERE ###return num_correct/total

如果只根据词所对应的最高频标记预测，准确率达到0.85，通过HMM模型可以达到0.95

POS标记——HMM模型相关推荐

python自然语言处理实战核心技术与算法——HMM模型代码详解
本人初学NLP,当我看着<python自然语言处理实战核心技术与算法>书上这接近200行的代码看着有点头皮发麻,于是我读了接近一天基本把每行代码的含义给读的个七七八八,考虑到可能会有人和我 ...
语言模型Katz backoff以及HMM模型
之前关于信息抽取那篇文章提到使用HMM对文章段落进行分段并标注,其中会使用到trigram-HMM并对传统的HMM进行改造以符合特定情况下使用.这里分别对Katz backoff以及HMM模型在具体状 ...
HMM模型 forward backward viterbi算法
在这里插入图片描述评估问题隐马尔可夫模型中包含一个评估问题:已知模型参数,计算某一特定输出序列的概率.通常使用forward算法解决. 比如计算活动序列{读书,做清洁,散步,做清洁,散步}出现的概 ...
机器学习深版11：HMM模型
机器学习深版11:HMM模型(隐马尔科夫模型) 文章目录机器学习深版11:HMM模型(隐马尔科夫模型) 1. 熵(Entropy) 2. 最大熵模型 3. HMM(隐马尔可夫模型) 4. 应用场景 ...
【文本数据挖掘】中文命名实体识别：HMM模型+BiLSTM_CRF模型（Pytorch）【调研与实验分析】
1️⃣本篇博文是[文本数据挖掘]大作业-中文命名实体识别-调研与实验分析 2️⃣在之前的自然语言课程中也完成过一次命名实体识别的实验 [一起入门NLP]中科院自然语言处理作业三:用BiLSTM+CRF ...
HMM模型及其在中文分词中的应用
HMM模型及其在中文分词中的应用马尔可夫模型有限状态集 s s s 初始时刻的状态概率分布 π \pi π 状态转移概率矩阵 A A A 马尔可夫模型实例隐马尔可夫模型观测集 V V V 发射 ...
实战三十七：基于HMM模型实现中文分词
任务描述:在理解中文文本的语义时需要进行分词处理,分词算法包括字符串匹配算法,基于统计的机器学习算法两大类.本案例在前文将说明常用分词库及其简单应用,之后会通过中文分词的例子介绍和实现一个基于统计的中 ...
HMM模型（Ⅱ）—量化投资
HMM模型(Ⅱ)-量化投资引言何为量化投资实现思想试水实现获取数据对数据使用HMM模型数据读入与处理调用HMM模型对HMM的结果进行可视化处理做出量化策略小结引言关于HMM模 ...
词性标注HMM模型之TnT — A Statistical Part-of-Speech Tagger (2000) 论文解读
这里写目录标题概述内容背景介绍模型体系架构目标函数推导 Smoothing 处理未知词的处理大小写问题定向搜索总结概述该文作者是德国萨尔大学的Thorsten Brants,作者认 ...

POS标记——HMM模型

POS标记——HMM模型相关推荐

最新文章

热门文章