POS标记——HMM模型
1.数据准备
from utils_pos import get_word_tag, preprocess
import pandas as pd
from collections import defaultdict
import math
import numpy as npwith open("WSJ_02-21.pos", 'r') as f:training_corpus = f.readlines()with open("hmm_vocab.txt", 'r') as f:voc_l = f.read().split('\n')
#生成index字典
vocab = {} # Get the index of the corresponding words.
for i, word in enumerate(sorted(voc_l)): vocab[word] = i print("Vocabulary dictionary, key is the word, value is a unique integer")
cnt = 0
for k,v in vocab.items():print(f"{k}:{v}")cnt += 1if cnt > 20:breakwith open("WSJ_24.pos", 'r') as f:y = f.readlines()
#测试集内容格式:'economy\tNN\n', "'s\tPOS\n", 'temperature\tNN\n'
#corpus without tags, preprocessed
_, prep = preprocess(vocab, "test.words")
2.HMM模型的训练
计算转移矩阵和观测矩阵
def create_dictionaries(training_corpus, vocab):"""Input: training_corpus: a corpus where each line has a word followed by its tag.vocab: a dictionary where keys are words in vocabulary and value is an indexOutput: emission_counts: a dictionary where the keys are (tag, word) and the values are the countstransition_counts: a dictionary where the keys are (prev_tag, tag) and the values are the countstag_counts: a dictionary where the keys are the tags and the values are the counts"""# initialize the dictionaries using defaultdictemission_counts = defaultdict(int)transition_counts = defaultdict(int)tag_counts = defaultdict(int)# Initialize "prev_tag" (previous tag) with the start state, denoted by '--s--'prev_tag = '--s--' # use 'i' to track the line number in the corpusi = 0 # Each item in the training corpus contains a word and its POS tag# Go through each word and its tag in the training corpusfor word_tag in training_corpus:# Increment the word_tag counti += 1# Every 50,000 words, print the word countif i % 50000 == 0:print(f"word count = {i}")### START CODE HERE (Replace instances of 'None' with your code) #### get the word and tag using the get_word_tag helper function (imported from utils_pos.py)word, tag = get_word_tag(word_tag,vocab) # Increment the transition count for the previous word and tagtransition_counts[(prev_tag, tag)] += 1# Increment the emission count for the tag and wordemission_counts[(tag, word)] += 1# Increment the tag counttag_counts[tag] += 1# Set the previous tag to this tag (for the next iteration of the loop)prev_tag = tag### END CODE HERE ###return emission_counts, transition_counts, tag_countsdef create_transition_matrix(alpha, tag_counts, transition_counts):''' Input: alpha: number used for smoothingtag_counts: a dictionary mapping each tag to its respective counttransition_counts: transition count for the previous word and tagOutput:A: matrix of dimension (num_tags,num_tags)'''# Get a sorted list of unique POS tagsall_tags = sorted(tag_counts.keys())# Count the number of unique POS tagsnum_tags = len(all_tags)# Initialize the transition matrix 'A'A = np.zeros((num_tags,num_tags))# Get the unique transition tuples (previous POS, current POS)trans_keys = set(transition_counts.keys())### START CODE HERE (Replace instances of 'None' with your code) ### # Go through each row of the transition matrix Afor i in range(num_tags):# Go through each column of the transition matrix Afor j in range(num_tags):# Initialize the count of the (prev POS, current POS) to zerocount = 0# Define the tuple (prev POS, current POS)# Get the tag at position i and tag at position j (from the all_tags list)key = (all_tags[i],all_tags[j])# Check if the (prev POS, current POS) tuple # exists in the transition counts dictionaryif key in trans_keys: #complete this line# Get count from the transition_counts dictionary # for the (prev POS, current POS) tuplecount = transition_counts[key]# Get the count of the previous tag (index position i) from tag_countscount_prev_tag = tag_counts[all_tags[i]]# Apply smoothing using count of the tuple, alpha, # count of previous tag, alpha, and total number of tagsA[i,j] = (count+alpha)/(count_prev_tag+alpha*num_tags)### END CODE HERE ###return Adef create_emission_matrix(alpha, tag_counts, emission_counts, vocab):'''Input: alpha: tuning parameter used in smoothing tag_counts: a dictionary mapping each tag to its respective countemission_counts: a dictionary where the keys are (tag, word) and the values are the countsvocab: a dictionary where keys are words in vocabulary and value is an index.within the function it'll be treated as a listOutput:B: a matrix of dimension (num_tags, len(vocab))'''# get the number of POS tagnum_tags = len(tag_counts)# Get a list of all POS tagsall_tags = sorted(tag_counts.keys())# Get the total number of unique words in the vocabularynum_words = len(vocab)# Initialize the emission matrix B with places for# tags in the rows and words in the columnsB = np.zeros((num_tags, num_words))# Get a set of all (POS, word) tuples # from the keys of the emission_counts dictionaryemis_keys = set(list(emission_counts.keys()))### START CODE HERE (Replace instances of 'None' with your code) #### Go through each row (POS tags)for i in range(num_tags): # complete this line# Go through each column (words)for j in range(num_words): # complete this line# Initialize the emission count for the (POS tag, word) to zerocount = 0# Define the (POS tag, word) tuple for this row and columnkey = (all_tags[i],vocab[j])# check if the (POS tag, word) tuple exists as a key in emission countsif key in emis_keys: # complete this line# Get the count of (POS tag, word) from the emission_counts dcount = emission_counts[key]# Get the count of the POS tagcount_tag = tag_counts[all_tags[i]]# Apply smoothing and store the smoothed value # into the emission matrix B for this row and columnB[i,j] = (count+alpha)/(count_tag+alpha*num_words)### END CODE HERE ###return B
3.维特比算法进行预测
def initialize(states, tag_counts, A, B, corpus, vocab):'''Input: states: a list of all possible parts-of-speechtag_counts: a dictionary mapping each tag to its respective countA: Transition Matrix of dimension (num_tags, num_tags)B: Emission Matrix of dimension (num_tags, len(vocab))corpus: a sequence of words whose POS is to be identified in a list vocab: a dictionary where keys are words in vocabulary and value is an indexOutput:best_probs: matrix of dimension (num_tags, len(corpus)) of floatsbest_paths: matrix of dimension (num_tags, len(corpus)) of integers'''# Get the total number of unique POS tagsnum_tags = len(tag_counts)# Initialize best_probs matrix # POS tags in the rows, number of words in the corpus as the columnsbest_probs = np.zeros((num_tags, len(corpus)))# Initialize best_paths matrix# POS tags in the rows, number of words in the corpus as columnsbest_paths = np.zeros((num_tags, len(corpus)), dtype=int)# Define the start tokens_idx = states.index("--s--")### START CODE HERE (Replace instances of 'None' with your code) #### Go through each of the POS tagsfor i in range(num_tags): # complete this line# Handle the special case when the transition from start token to POS tag i is zeroif A[s_idx,i]==0: # complete this line# Initialize best_probs at POS tag 'i', column 0, to negative infinitybest_probs[i,0] = float('-inf')# For all other cases when transition from start token to POS tag i is non-zero:else:# Initialize best_probs at POS tag 'i', column 0# Check the formula in the instructions abovebest_probs[i,0] = math.log(A[s_idx,i])+math.log(B[i,vocab[corpus[0]]])### END CODE HERE ### return best_probs, best_pathsdef viterbi_forward(A, B, test_corpus, best_probs, best_paths, vocab):'''Input: A, B: The transition and emission matrices respectivelytest_corpus: a list containing a preprocessed corpusbest_probs: an initilized matrix of dimension (num_tags, len(corpus))best_paths: an initilized matrix of dimension (num_tags, len(corpus))vocab: a dictionary where keys are words in vocabulary and value is an index Output: best_probs: a completed matrix of dimension (num_tags, len(corpus))best_paths: a completed matrix of dimension (num_tags, len(corpus))'''# Get the number of unique POS tags (which is the num of rows in best_probs)num_tags = best_probs.shape[0]# Go through every word in the corpus starting from word 1# Recall that word 0 was initialized in `initialize()`for i in range(1, len(test_corpus)): # Print number of words processed, every 5000 wordsif i % 5000 == 0:print("Words processed: {:>8}".format(i))### START CODE HERE (Replace instances of 'None' with your code EXCEPT the first 'best_path_i = None') #### For each unique POS tag that the current word can befor j in range(num_tags): # complete this line# Initialize best_prob for word i to negative infinitybest_prob_i = float('-inf')# Initialize best_path for current word i to Nonebest_path_i = None# For each POS tag that the previous word can be:for k in range(num_tags): # complete this line# Calculate the probability = # best probs of POS tag k, previous word i-1 + # log(prob of transition from POS k to POS j) + # log(prob that emission of POS j is word i)prob = best_probs[k,i-1] + math.log(A[k,j]) + math.log(B[j,vocab[test_corpus[i]]])# check if this path's probability is greater than# the best probability up to and before this pointif prob > best_prob_i: # complete this line# Keep track of the best probabilitybest_prob_i = prob# keep track of the POS tag of the previous word# that is part of the best path. # Save the index (integer) associated with # that previous word's POS tagbest_path_i = k# Save the best probability for the # given current word's POS tag# and the position of the current word inside the corpusbest_probs[j,i] = best_prob_i# Save the unique integer ID of the previous POS tag# into best_paths matrix, for the POS tag of the current word# and the position of the current word inside the corpus.best_paths[j,i] = best_path_i### END CODE HERE ###return best_probs, best_pathsdef viterbi_backward(best_probs, best_paths, corpus, states):'''This function returns the best path.'''# Get the number of words in the corpus# which is also the number of columns in best_probs, best_pathsm = best_paths.shape[1] # Initialize array z, same length as the corpusz = [None] * m# Get the number of unique POS tagsnum_tags = best_probs.shape[0]# Initialize the best probability for the last wordbest_prob_for_last_word = float('-inf')# Initialize pred array, same length as corpuspred = [None] * m### START CODE HERE (Replace instances of 'None' with your code) ##### Step 1 ### Go through each POS tag for the last word (last column of best_probs)# in order to find the row (POS tag integer ID) # with highest probability for the last wordfor k in range(num_tags): # complete this line# If the probability of POS tag at row k # is better than the previously best probability for the last word:if best_probs[k,-1]>best_prob_for_last_word: # complete this line# Store the new best probability for the lsat wordbest_prob_for_last_word = best_probs[k,-1]# Store the unique integer ID of the POS tag# which is also the row number in best_probsz[m - 1] = k# Convert the last word's predicted POS tag# from its unique integer ID into the string representation# using the 'states' dictionary# store this in the 'pred' array for the last wordpred[m - 1] = states[k]## Step 2 ### Find the best POS tags by walking backward through the best_paths# From the last word in the corpus to the 0th word in the corpusfor i in range(m-1, 0, -1): # complete this line# Retrieve the unique integer ID of# the POS tag for the word at position 'i' in the corpus#pos_tag_for_word_i = np.argmax(best_probs[:,i])pos_tag_for_word_i = z[i]# In best_paths, go to the row representing the POS tag of word i# and the column representing the word's position in the corpus# to retrieve the predicted POS for the word at position i-1 in the corpusz[i - 1] = best_paths[pos_tag_for_word_i,i]# Get the previous word's POS tag in string form# Use the 'states' dictionary, # where the key is the unique integer ID of the POS tag,# and the value is the string representation of that POS tagpred[i - 1] = states[z[i-1]]### END CODE HERE ###return pred
4.模型测试
def compute_accuracy(pred, y):'''Input: pred: a list of the predicted parts-of-speech y: a list of lines where each word is separated by a '\t' (i.e. word \t tag)Output: '''num_correct = 0total = 0# Zip together the prediction and the labelsfor prediction, y in zip(pred, y):### START CODE HERE (Replace instances of 'None' with your code) #### Split the label into the word and the POS tagword_tag_tuple = (y.split())# Check that there is actually a word and a tag# no more and no less than 2 itemsif len(word_tag_tuple) != 2: # complete this linecontinue # store the word and tag separatelyword, tag = word_tag_tuple# Check if the POS tag label matches the predictionif prediction==tag: # complete this line# count the number of times that the prediction# and label matchnum_correct += 1# keep track of the total number of examples (that have valid labels)total += 1### END CODE HERE ###return num_correct/total
如果只根据词所对应的最高频标记预测,准确率达到0.85,通过HMM模型可以达到0.95
POS标记——HMM模型相关推荐
- python自然语言处理实战核心技术与算法——HMM模型代码详解
本人初学NLP,当我看着<python自然语言处理实战核心技术与算法>书上这接近200行的代码看着有点头皮发麻,于是我读了接近一天基本把每行代码的含义给读的个七七八八,考虑到可能会有人和我 ...
- 语言模型Katz backoff以及HMM模型
之前关于信息抽取那篇文章提到使用HMM对文章段落进行分段并标注,其中会使用到trigram-HMM并对传统的HMM进行改造以符合特定情况下使用.这里分别对Katz backoff以及HMM模型在具体状 ...
- HMM模型 forward backward viterbi算法
在这里插入图片描述 评估问题 隐马尔可夫模型中包含一个评估问题:已知模型参数,计算某一特定输出序列的概率.通常使用forward算法解决. 比如计算活动序列{读书,做清洁,散步,做清洁,散步}出现的概 ...
- 机器学习深版11:HMM模型
机器学习深版11:HMM模型(隐马尔科夫模型) 文章目录 机器学习深版11:HMM模型(隐马尔科夫模型) 1. 熵(Entropy) 2. 最大熵模型 3. HMM(隐马尔可夫模型) 4. 应用场景 ...
- 【文本数据挖掘】中文命名实体识别:HMM模型+BiLSTM_CRF模型(Pytorch)【调研与实验分析】
1️⃣本篇博文是[文本数据挖掘]大作业-中文命名实体识别-调研与实验分析 2️⃣在之前的自然语言课程中也完成过一次命名实体识别的实验 [一起入门NLP]中科院自然语言处理作业三:用BiLSTM+CRF ...
- HMM模型及其在中文分词中的应用
HMM模型及其在中文分词中的应用 马尔可夫模型 有限状态集 s s s 初始时刻的状态概率分布 π \pi π 状态转移概率矩阵 A A A 马尔可夫模型实例 隐马尔可夫模型 观测集 V V V 发射 ...
- 实战三十七:基于HMM模型实现中文分词
任务描述:在理解中文文本的语义时需要进行分词处理,分词算法包括字符串匹配算法,基于统计的机器学习算法两大类.本案例在前文将说明常用分词库及其简单应用,之后会通过中文分词的例子介绍和实现一个基于统计的中 ...
- HMM模型(Ⅱ)—量化投资
HMM模型(Ⅱ)-量化投资 引言 何为量化投资 实现思想 试水实现 获取数据 对数据使用HMM模型 数据读入与处理 调用HMM模型 对HMM的结果进行可视化处理 做出量化策略 小结 引言 关于HMM模 ...
- 词性标注HMM模型之TnT — A Statistical Part-of-Speech Tagger (2000) 论文解读
这里写目录标题 概述 内容背景介绍 模型体系架构 目标函数推导 Smoothing 处理 未知词的处理 大小写问题 定向搜索 总结 概述 该文作者是德国萨尔大学的Thorsten Brants,作者认 ...
最新文章
- 模板网站建设究竟有哪些优势?
- jquery 悬浮验证框架 jQuery Validation Engine
- C++类中封装线程函数
- gprof 使用例程(转)
- Cutting Bamboos(牛客多校第九场H主席树+二分+思维)
- 前端/JS笔记-利用JS/正则判断input是否存数字以及字母加数字
- Docker系列文章索引
- 涂抹果酱_如何玩果酱
- AI学习笔记(十五)自然语言处理基本概念
- 对因果报应和轮回的态度
- 用Python全自动播放尔雅网课
- tbschedule使用
- 如何获取div中的value值
- 外汇mt4 软件在哪里下载比较正规?
- 向彪fisco bcos入门教程,为了直观安装区块链浏览器(九)
- 活动倒计时HTML,活动倒计时代码(精确到毫秒)jquery插件
- IE,谷歌访问跨域问题
- Java操作数据库方式(六)DataSource详解
- 韶华易老,昔日往事诚可抛
- Oracle之函数concat、lpad