Python 处理POS标签

在这里，我们将查看一些词性（POS）标记的数据，并提取从中获取一些分布信息，然后使用单词/POS标签共现计数作为构建简单POS标签的基础

打开其中一个数据文件文件以查看数据格式（称为Brill格式,在txt文档中）方便的处理格式，其中文本每行一句已标记，每个标记之间有一个空格。每个令牌都是一起提供的其POS形式为TOKEN/POS。你会看到一些标点符号是在该令牌化方案中被视为单独的令牌。

There/EX is/VBZ no/DT slow-motion/JJ close-up/NN ,/, blood-and-guts/JJ portrayal/NN of/IN the/DT animal/NN 's/POS demise/NN ./.
Keep/VB in/IN mind/NN that/IN this/DT is/VBZ the/DT same/JJ movie/NN in/IN which/WDT a/DT character/NN is/VBZ flattened/VBN by/IN a/DT steamroller/NN only/RB to/TO pop/VB right/JJ back/NN up/IN and/CC peer/VB in/IN the/DT window/NN of/IN a/DT Boeing/NNP 747/CD --/: from/IN the/DT outside/NN --/: as/IN it/PRP takes/VBZ off/IN ./.
I/PRP will/MD be/VB the/DT first/JJ to/TO agree/VB that/IN there/EX is/VBZ much/JJ to/TO be/VB found/VBN wrong/RB with/IN modern/JJ movie/NN making/NN ./.
Many/JJ modern/JJ scriptwriters/NNS seem/VBP to/TO be/VB incapable/JJ of/IN writing/VBG drama/NN ,/, or/CC anything/NN else/RB ,/, without/IN foul-mouthed/JJ cursing/NN ./.
Sex/NN and/CC violence/NN are/VBP routinely/RB included/VBN even/RB when/WRB they/PRP are/VBP irrelevant/JJ to/TO the/DT script/NN ,/, and/CC high-tech/JJ special/JJ effects/NNS are/VBP continually/RB substituted/VBN for/IN good/JJ plot/NN and/CC character/NN development/NN ./.
In/IN short/JJ ,/, we/PRP have/VBP a/DT movie/NN and/CC television/NN industry/NN that/WDT is/VBZ either/DT incapable/JJ or/CC petrified/JJ of/IN making/VBG a/DT movie/NN unless/IN it/PRP carries/VBZ a/DT PG-13/NN or/CC R/NN rating/NN ./.
Hence/RB copious/JJ amounts/NNS of/IN gratuitous/JJ sex/NN ,/, violence/NN and/CC gutter/NN language/NN are/VBP included/VBN as/IN a/DT crutch/NN ./.
However/RB ,/, these/DT faults/NNS are/VBP not/RB the/DT exclusive/JJ property/NN of/IN modern/JJ comedies/NNS ,/, and/CC I/PRP believe/VBP Mr./NNP Knight/NNP errs/VBZ when/WRB he/PRP attempts/VBZ to/TO link/VB this/DT modern/JJ phenomenon/NN too/RB closely/RB to/TO a/DT single/JJ category/NN of/IN movie/NN making/NN ./.
Michael/NNP Smith/NNP St./NNP Louis/NNP Rochester/NNP Telephone/NNP Corp./NNP said/VBD it/PRP agreed/VBD to/TO buy/VB Viroqua/NNP Telephone/NNP Co./NNP of/IN Viroqua/NNP ,/, Wis/NNP ./.
Terms/NNS were/VBD n't/RB disclosed/VBN ./.

基本部分:

import sys, re, getopt################################################################
# Command line options handling, and helpopts, args = getopt.getopt(sys.argv[1:], 'hd:t:')#args = '-a -b -cfoo -d bar a1 a2'.split()
#opts, args = getopt.getopt(args, 'abc:d:')opts = dict(opts)def printHelp():progname = sys.argv[0]progname = progname.split('/')[-1] # strip out extended pathhelp = __doc__.replace('<PROGNAME>', progname, 1)print('-' * 60, help, '-' * 60, file=sys.stderr)sys.exit()if '-h' in opts:printHelp()if '-d' not in opts:print("\n** ERROR: must specify training data file (opt: -d FILE) **", file=sys.stderr)printHelp()
print(args)
if len(args) > 0:print("\n** ERROR: unexpected input on commmand line **", file=sys.stderr)printHelp()

扩展所提供的起始代码以给出将处理训练数据文件的脚本，将每个句子拆分为其标记，并计算标记和POS的同时出现次数标签。取数据文件中给出的令牌（即，用于资本化）。您的数据结构应该是两级字典从代币到POS标签再到计数（即｛术语→ ｛postag → count}}），其中count为令牌与该POS标签一起出现的次数。我们称之为dictionary术语postag计数字典。


count_word = dict()
with open('training_data.txt', "r", encoding="utf-8") as in_file:#re.findall("[A-Z]{2,}(?![a-z])|[A-Z][a-z]+(?=[A-Z])|[\'\w\-]+",s)for line in in_file:pattern = line.split()# matches = pattern.findall(line.lower())for allword in pattern:word = allword.split('/')[0]pa= allword.split('/')[-1]if word in count_word:if pa in count_word[word]:count_word[word][pa] += 1else:count_word[word][pa] = 1else:count_word[word] = { pa: 1}               number=max(count_word[word].values())#print(word,"->",pa)print(word,"->",pa,"->",number)#print(number)break

在运行代码之前放置配置:

-d training.txt

其他解决方法:

def parseLine(line):wdtags = line.split()wdtagpairs = []for wdtag in wdtags:parts = wdtag.split('/')wdtagpairs.append((parts[0], parts[1]))return wdtagpairswordTagCounts = {}# This is main data structure of lexicon - a two-level# dictionary, mapping {words->{tags->counts}}print('<reading data for new lexicon ....>', file=sys.stderr)
with open(opts['-d']) as data_in:for line in data_in:for (wd, tag) in parseLine(line):if wd not in wordTagCounts:wordTagCounts[wd] = {}if tag in wordTagCounts[wd]:wordTagCounts[wd][tag] += 1else:wordTagCounts[wd][tag] = 1
print('<done>', file=sys.stderr)

通过处理术语postag计数字典，确定以下问题的答案问题，您的代码应该打印到屏幕上。
（a）确定数据中使用的全套POS标签及其总体相对频率。按相对频率的降序打印出来。什么是单个最常见的标签？
（b）你的词典中有多少项目（“类型”）是模糊的，即有更多
多个POS标签？培训中出现的总体令牌的比例是多少数据对应于明确的词汇项？

# ANALYSE word-tag-count dictionary, to compute:
# -- proportion of types that have more than one tag
# -- accuracy naive tagger would have on the training data
# -- most common tags globallytagCounts = {}
ambiguousTypes = 0
ambiguousTokens = 0
allTypes = len(wordTagCounts)
allTokens = 0
correctTokens = 0for wd in wordTagCounts:values = wordTagCounts[wd].values()if len(values) > 1:ambiguousTypes += 1ambiguousTokens += sum(values)correctTokens += max(values)allTokens += sum(values)for t, c in wordTagCounts[wd].items():if t in tagCounts:tagCounts[t] += celse:tagCounts[t] = cprint('Proportion of word types that are ambiguous: %5.1f%% (%d / %d)' % \((100.0 * ambiguousTypes) / allTypes, ambiguousTypes, allTypes), file=sys.stderr)print('Proportion of tokens that are ambiguous in data: %5.1f%% (%d / %d)' % \((100.0 * ambiguousTokens) / allTokens, ambiguousTokens, allTokens), file=sys.stderr)print('Accuracy of naive tagger on training data: %5.1f%% (%d / %d)' % \((100.0 * correctTokens) / allTokens, correctTokens, allTokens), file=sys.stderr)tags = sorted(tagCounts, key=lambda x:tagCounts[x], reverse=True)print('Top Ten Tags by count:', file=sys.stderr)
for tag in tags[:10]:count = tagCounts[tag]print('   %9s %6.2f%% (%5d / %d)' % \(tag, (100.0 * count) / allTokens, count, allTokens), file=sys.stderr)

I don’t understand the if len(values)>1 loop and why correctTokens += max(values)

您将面临测试数据包含所谓未知单词，即在训练数据中看不到的单词。首先假设通过给所有未知单词分配非PTB标记符号，例如UNK作为其POS标签）。确定此方法实现的标记精度超过测试数据。注：测试数据以与培训相同的格式提供数据，即带有附加到每个令牌的金标准POS标签。所以，你需要分开每个token/POS对中的令牌和标记，将原始标记应用于令牌然后将其与金标准POS标签进行比较，计算准确度随你而变。

真实世界标签系统表现如何的一个关键因素是它们对未知事物的处理。探索处理测试数据中未知单词的其他方法，为了达到你的天真标签的最佳准确度。一个明显的第一移动是给每个未知单词分配一个最常见的标签（你应该已在上面确定）。您还可以考虑是否存在术语的子类可以容易识别的，用于分配不同的默认标签，如果你有时间和动力，你可以探索使用词尾后缀作为预测未知单词词性的提示。例如，以-ing结尾的单词英语通常是动词（带有POS VBG），而以-ly结尾的单词通常是状语（RB）。

# Function to 'guess' tag for unknown wordsdigitRE = re.compile('\d')
jj_ends_RE = re.compile('(ed|us|ic|ble|ive|ary|ful|ical|less)$')# NOTE: if you uncomment the 'return' at the start of the following
# definition, the score achieved will be that where all unknown words
# are tagged *incorrectly* (as UNK). Uncommenting instead the third
# 'return', will yield the score where the default tag for unknown
# words is NNP. Otherwise, the definition attempts to guess the
# correct tags for unknown words based on their suffix or other
# characteristics. def tagUnknown(wd):
#    return 'UNK'
#    return 'NN'
#    return 'NNP'if wd[0:1].isupper():return 'NNP'if '-' in wd:return 'JJ'if digitRE.search(wd):return 'CD'if jj_ends_RE.search(wd):return 'JJ'if wd.endswith('s'):return 'NNS'if wd.endswith('ly'):return 'RB'if wd.endswith('ing'):return 'VBG'

what is if wd[0:1].
what is digitRE.search(wd)
what is jj_ends_RE.search(wd)

# Apply naive tagging method to test data, and score performanceif '-t' in opts:# Compute 'most common' tag for each known word - store in maxtag dictionarymaxtag = {}for wd in wordTagCounts:tags = sorted(wordTagCounts[wd], key=lambda x:wordTagCounts[wd][x], reverse=True)maxtag[wd] = tags[0]print('<tagging test data ....>', file=sys.stderr)# Tag each word of test data, and scoretest = open(opts['-t'], 'r')alltest = 0correct = 0for line in test:for wd, truetag in parseLine(line):if wd in maxtag:newtag = maxtag[wd]else:newtag = tagUnknown(wd)alltest += 1if newtag == truetag:correct += 1print('<done>', file=sys.stderr)print("Score on test data: %5.1f%% (%5d / %5d)" % \((100.0*correct)/alltest, correct, alltest), file=sys.stderr)

I don’t understand the tags = …
what is truetag?

在运行代码之前放置配置:

-d training.txt -t test.txt

Python 处理POS标签相关推荐

python获取div标签的id_Python 获取div标签中的文字实例
预备知识点 compile 函数 compile 函数用于编译正则表达式,生成一个正则表达式( Pattern )对象,供 match() 和 search() 这两个函数使用. 语法格式为: re. ...
python学习 - 图标签用宋体Times New Roman字体 + 规范的混淆矩阵绘制
python学习 - 图标签用宋体&Times New Roman字体 + 规范的混淆矩阵绘制只需复制下面一行代码即可获得效果中文:宋体字号英文和数字:Times New Roman字体 ...
selenium + python处理select标签下拉框的选项
selenium + python处理select标签下拉框的选项 1. 背景在爬取网页是,有时候我们会遇到下图中的下拉框,也就是< select > < /select > ...
python获得a标签内容
python获得a标签内容: brandname=html.xpath(".//td[@class='hospital_r']//a/text()")
python 根据a标签查找href的值
python 根据a标签查找href的值 20/100 zhaoyangjian724# !/usr/bin/env python # -*- coding: utf-8 -*- import url ...
杨桃的Python机器学习2——标签和特征
本人CSDN博客专栏:https://blog.csdn.net/yty_7 Github地址:https://github.com/yot777/ 好了,机器学习进入正题了,我们从大家身边的实例开始 ...
python语言pos_关于python：NLTK所有可能的pos标签是什么？
如何找到包含自然语言工具包(nltk)使用的所有可能pos标记的列表? 这本书有一个注释,说明如何在标签集上寻求帮助,例如: nltk.help.upenn_tagset() 其他人可能相似. (注意 ...
python中pos的用法_Python正则式的基本用法
Python正则式的基本用法 1.1基本规则 1.2重复 1.2.1最小匹配与精确匹配 1.3前向界定与后向界定 1.4组的基本知识 2．re模块的基本函数 2.1使用compile加速 2.2 ma ...
python条形图数据标签_python – Plotly中用于条形图的单独标记条形图
我试图为分组条形图创建注释 – 其中每个条形图都有一个特定的数据标签,显示该条形图的值并位于条形图的中心上方. 我尝试对教程中的示例进行简单修改以实现此目的,如下所示: import plotly.p ...

Python 处理POS标签

Python 处理POS标签

Python 处理POS标签相关推荐

最新文章

热门文章