There/EX is/VBZ no/DT slow-motion/JJ close-up/NN ,/, blood-and-guts/JJ portrayal/NN of/IN the/DT animal/NN 's/POS demise/NN ./.
Keep/VB in/IN mind/NN that/IN this/DT is/VBZ the/DT same/JJ movie/NN in/IN which/WDT a/DT character/NN is/VBZ flattened/VBN by/IN a/DT steamroller/NN only/RB to/TO pop/VB right/JJ back/NN up/IN and/CC peer/VB in/IN the/DT window/NN of/IN a/DT Boeing/NNP 747/CD --/: from/IN the/DT outside/NN --/: as/IN it/PRP takes/VBZ off/IN ./.
I/PRP will/MD be/VB the/DT first/JJ to/TO agree/VB that/IN there/EX is/VBZ much/JJ to/TO be/VB found/VBN wrong/RB with/IN modern/JJ movie/NN making/NN ./.
Many/JJ modern/JJ scriptwriters/NNS seem/VBP to/TO be/VB incapable/JJ of/IN writing/VBG drama/NN ,/, or/CC anything/NN else/RB ,/, without/IN foul-mouthed/JJ cursing/NN ./.
Sex/NN and/CC violence/NN are/VBP routinely/RB included/VBN even/RB when/WRB they/PRP are/VBP irrelevant/JJ to/TO the/DT script/NN ,/, and/CC high-tech/JJ special/JJ effects/NNS are/VBP continually/RB substituted/VBN for/IN good/JJ plot/NN and/CC character/NN development/NN ./.
In/IN short/JJ ,/, we/PRP have/VBP a/DT movie/NN and/CC television/NN industry/NN that/WDT is/VBZ either/DT incapable/JJ or/CC petrified/JJ of/IN making/VBG a/DT movie/NN unless/IN it/PRP carries/VBZ a/DT PG-13/NN or/CC R/NN rating/NN ./.
Hence/RB copious/JJ amounts/NNS of/IN gratuitous/JJ sex/NN ,/, violence/NN and/CC gutter/NN language/NN are/VBP included/VBN as/IN a/DT crutch/NN ./.
However/RB ,/, these/DT faults/NNS are/VBP not/RB the/DT exclusive/JJ property/NN of/IN modern/JJ comedies/NNS ,/, and/CC I/PRP believe/VBP Mr./NNP Knight/NNP errs/VBZ when/WRB he/PRP attempts/VBZ to/TO link/VB this/DT modern/JJ phenomenon/NN too/RB closely/RB to/TO a/DT single/JJ category/NN of/IN movie/NN making/NN ./.
Michael/NNP Smith/NNP St./NNP Louis/NNP Rochester/NNP Telephone/NNP Corp./NNP said/VBD it/PRP agreed/VBD to/TO buy/VB Viroqua/NNP Telephone/NNP Co./NNP of/IN Viroqua/NNP ,/, Wis/NNP ./.
import sys, re, getopt################################################################
# Command line options handling, and helpopts, args = getopt.getopt(sys.argv[1:], 'hd:t:')#args = '-a -b -cfoo -d bar a1 a2'.split()
#opts, args = getopt.getopt(args, 'abc:d:')opts = dict(opts)def printHelp():progname = sys.argv[0]progname = progname.split('/')[-1] # strip out extended pathhelp = __doc__.replace('<PROGNAME>', progname, 1)print('-' * 60, help, '-' * 60, file=sys.stderr)sys.exit()if '-h' in opts:printHelp()if '-d' not in opts:print("\n** ERROR: must specify training data file (opt: -d FILE) **", file=sys.stderr)printHelp()
if len(args) > 0:print("\n** ERROR: unexpected input on commmand line **", file=sys.stderr)printHelp()

扩展所提供的起始代码以给出将处理训练数据文件的脚本,将每个句子拆分为其标记,并计算标记和POS的同时出现次数标签。取数据文件中给出的令牌(即,用于资本化)。您的数据结构应该是两级字典从代币到POS标签再到计数(即{术语→ {postag → count}}),其中count为令牌与该POS标签一起出现的次数。我们称之为dictionary术语postag计数字典。

count_word = dict()
with open('training_data.txt', "r", encoding="utf-8") as in_file:#re.findall("[A-Z]{2,}(?![a-z])|[A-Z][a-z]+(?=[A-Z])|[\'\w\-]+",s)for line in in_file:pattern = line.split()# matches = pattern.findall(line.lower())for allword in pattern:word = allword.split('/')[0]pa= allword.split('/')[-1]if word in count_word:if pa in count_word[word]:count_word[word][pa] += 1else:count_word[word][pa] = 1else:count_word[word] = { pa: 1}               number=max(count_word[word].values())#print(word,"->",pa)print(word,"->",pa,"->",number)#print(number)break


-d training.txt


def parseLine(line):wdtags = line.split()wdtagpairs = []for wdtag in wdtags:parts = wdtag.split('/')wdtagpairs.append((parts[0], parts[1]))return wdtagpairswordTagCounts = {}# This is main data structure of lexicon - a two-level# dictionary, mapping {words->{tags->counts}}print('<reading data for new lexicon ....>', file=sys.stderr)
with open(opts['-d']) as data_in:for line in data_in:for (wd, tag) in parseLine(line):if wd not in wordTagCounts:wordTagCounts[wd] = {}if tag in wordTagCounts[wd]:wordTagCounts[wd][tag] += 1else:wordTagCounts[wd][tag] = 1
print('<done>', file=sys.stderr)

(a) 确定数据中使用的全套POS标签及其总体相对频率。按相对频率的降序打印出来。什么是单个最常见的标签?
(b) 你的词典中有多少项目(“类型”)是模糊的,即有更多

# ANALYSE word-tag-count dictionary, to compute:
# -- proportion of types that have more than one tag
# -- accuracy naive tagger would have on the training data
# -- most common tags globallytagCounts = {}
ambiguousTypes = 0
ambiguousTokens = 0
allTypes = len(wordTagCounts)
allTokens = 0
correctTokens = 0for wd in wordTagCounts:values = wordTagCounts[wd].values()if len(values) > 1:ambiguousTypes += 1ambiguousTokens += sum(values)correctTokens += max(values)allTokens += sum(values)for t, c in wordTagCounts[wd].items():if t in tagCounts:tagCounts[t] += celse:tagCounts[t] = cprint('Proportion of word types that are ambiguous: %5.1f%% (%d / %d)' % \((100.0 * ambiguousTypes) / allTypes, ambiguousTypes, allTypes), file=sys.stderr)print('Proportion of tokens that are ambiguous in data: %5.1f%% (%d / %d)' % \((100.0 * ambiguousTokens) / allTokens, ambiguousTokens, allTokens), file=sys.stderr)print('Accuracy of naive tagger on training data: %5.1f%% (%d / %d)' % \((100.0 * correctTokens) / allTokens, correctTokens, allTokens), file=sys.stderr)tags = sorted(tagCounts, key=lambda x:tagCounts[x], reverse=True)print('Top Ten Tags by count:', file=sys.stderr)
for tag in tags[:10]:count = tagCounts[tag]print('   %9s %6.2f%% (%5d / %d)' % \(tag, (100.0 * count) / allTokens, count, allTokens), file=sys.stderr)
  • I don’t understand the if len(values)>1 loop and why correctTokens += max(values)


真实世界标签系统表现如何的一个关键因素是它们对未知事物的处理。探索处理测试数据中未知单词的其他方法,为了达到你的天真标签的最佳准确度。一个明显的第一移动是给每个未知单词分配一个最常见的标签(你应该已在上面确定)。您还可以考虑是否存在术语的子类可以容易识别的,用于分配不同的默认标签,如果你有时间和动力,你可以探索使用词尾后缀作为预测未知单词词性的提示。例如,以-ing结尾的单词英语通常是动词(带有POS VBG),而以-ly结尾的单词通常是状语(RB)。

# Function to 'guess' tag for unknown wordsdigitRE = re.compile('\d')
jj_ends_RE = re.compile('(ed|us|ic|ble|ive|ary|ful|ical|less)$')# NOTE: if you uncomment the 'return' at the start of the following
# definition, the score achieved will be that where all unknown words
# are tagged *incorrectly* (as UNK). Uncommenting instead the third
# 'return', will yield the score where the default tag for unknown
# words is NNP. Otherwise, the definition attempts to guess the
# correct tags for unknown words based on their suffix or other
# characteristics. def tagUnknown(wd):
#    return 'UNK'
#    return 'NN'
#    return 'NNP'if wd[0:1].isupper():return 'NNP'if '-' in wd:return 'JJ'if 'CD'if 'JJ'if wd.endswith('s'):return 'NNS'if wd.endswith('ly'):return 'RB'if wd.endswith('ing'):return 'VBG'
  • what is if wd[0:1].
  • what is
  • what is
# Apply naive tagging method to test data, and score performanceif '-t' in opts:# Compute 'most common' tag for each known word - store in maxtag dictionarymaxtag = {}for wd in wordTagCounts:tags = sorted(wordTagCounts[wd], key=lambda x:wordTagCounts[wd][x], reverse=True)maxtag[wd] = tags[0]print('<tagging test data ....>', file=sys.stderr)# Tag each word of test data, and scoretest = open(opts['-t'], 'r')alltest = 0correct = 0for line in test:for wd, truetag in parseLine(line):if wd in maxtag:newtag = maxtag[wd]else:newtag = tagUnknown(wd)alltest += 1if newtag == truetag:correct += 1print('<done>', file=sys.stderr)print("Score on test data: %5.1f%% (%5d / %5d)" % \((100.0*correct)/alltest, correct, alltest), file=sys.stderr)
  • I don’t understand the tags = …
  • what is truetag?


-d training.txt -t test.txt

