语料库数据处理个案实例（计算机搭配强度、删除表中的停用词、词料检索的KWIC实现）

7.5 计算机搭配强度

搭配是语言地道与否的标志，是区分本族语言和非本族语言的重要指标，因此，语料库语言学和语言教学都非常重视搭配的研究。比如，汉语的"吃饭"是动词与名词搭配，动词"吃"和名词"饭"搭配，而英语更多地说"have meal"，很少说"eat meal"；汉语既可以说"喝茶"，也可以说"吃茶"，而英语只能说"drink tea"，不会说"eat tea"；同样，汉语说"吃药"，英语说“take medicine”。计算语言学领域也非常重视搭配的研究。计算语言学领域有时候将上一节提到的Ngrams也称做搭配。
我们可以通过简单计算Ngrams频次的方法来计算搭配强度，也可以用卡方（Chi-square)、互信息(Pointwise Mutual information,PMI)、对数似然比（Log-likelihood ratio）等检验方法来计算搭配的强度。NLTK库提供了计算上述几种检验方法来计算Bigrams和Ngrams强度的模块。
本小节将讨论如何计算Ngrams频次，如何计算Ngrams的卡方值、PMI值等。另外，我们在7.12小节讨论Stanford CoreNLP软件包的使用时，也将讨论如何利用句法分析的方法来提取动词-名词、形容词-名词等搭配，并计算她们的搭配强度。

7.5.1 计算搭配的频次

在7.4小节中，我们提取了文本的词块（Ngrams)，对提取的词块做了清洗处理，并将处理结果保存到n_grams_AlphaNum列表中。我们可以通过这些词块的频次来表示它们的强度，比如只提取频次大于等于2的词块。请看下面的代码。

# to compute the frequency of ngrams in n_grams_AlphaNum
# put this snippet of code after the second snippet of code in Section 7.4
freq_dict = {}for i in n_grams_AlphaNum:if i in freq_dict.keys():freq_dict[i] += 1else:freq_dict[i] = 1
for j in freq_dict.keys():if freq_dict[j] >= 2:print(j[0],j[1],j[2],j[3],'\t',freq_dict[j])

因为例中的文本非常小，提取的四词块频次均为1，所以本例中没有打印结果。读者可以用一个较大文本进行试验。

7.5.2 计算二词词块的搭配强度

NLTK库的collocations模块提供了BigramAssocMeasures等函数来计算二词词块的频次。请看下面的代码。

import nltk.collocations
string = '''I give Pirrip as my father's family name, on the authority of his tombstone and my sister,--Mrs. Joe Gargery, who married the blacksmith. As I never saw my father or my mother, and never saw any likeness of either of them (for their days were long before the days of photographs), my first fancies regarding what they were like were unreasonably derived from their tombstones. The shape of the letters on my father's, gave me an odd idea that he was a square, stout, dark man, with curly black hair.'''
string_tokenized = nltk.word_tokenize(string.lower())
finder = nltk.collocations.BigramCollocationFinder.from_words(string_tokenized)
bgm = nltk.collocations.BigramAssocMeasures()
scored = finder.score_ngrams(bgm.likelihood_ratio)
scored

上面的代码中，我们首先引入nltk.collocations和定义需要处理的文本，然后同nltk.word_tokenize对文本进行分词处理。接下来，我们通过nltk.collocations中的BigramCollocationFinder.from_words()函数提取分词后的二词词块，并将之赋值给finder。如果我们执行print(finder)，返回的结果为<nltk.collocations.Bigram CollocationFinder object at 0x919eacc>，也就是说，finder实际上是一个Bigram CollocationFinder对象。
接下来，我们定义nltk.collocations.BigramAssocMeasures()，并通过finder.score_ngrams()函数来计算Bigrams的likelihood_ratio值。finder.score_ngrams()只有一个参数，即需要计算的统计检验名称，我们可以将之定义为bgm.likelihood_ratio、bgm.student_t、bgm.chi_sq、bgm.pmi、bgm.dice等。在本例中，我们将搭配强度检验方法定义为bgm.likelihood_ratio。最后，打印结果如下(局部）：

[(('never', 'saw'), 19.91866838483344),(('my', 'father'), 19.09923162702352),(('father', "'s"), 16.09958337506456),(('(', 'for'), 11.354974483983558),(('--', 'mrs.'), 11.354974483983558),(('a', 'square'), 11.354974483983558),(('an', 'odd'), 11.354974483983558),(('any', 'likeness'), 11.354974483983558),(('black', 'hair'), 11.354974483983558),(('curly', 'black'), 11.354974483983558),(('dark', 'man'), 11.354974483983558),.......(('of', 'the'), 1.6576837641065838),((',', 'my'), 0.4658676573818533)]

7.5.3 计算三词词块的搭配强度

与计算二词词块搭配强度相似，NLTK库的collocations模块提供TrigramAssocMeasures等函数来计算三词词块的频次。请看下面的代码。

import nltk
import nltk.collocationsstring = '''I give Pirrip as my father's family name, on the authority of his tombstone and my sister,--Mrs. Joe Gargery, who married the blacksmith. As I never saw my father or my mother, and never saw any likeness of either of them (for their days were long before the days of photographs), my first fancies regarding what they were like were unreasonably derived from their tombstones. The shape of the letters on my father's, gave me an odd idea that he was a square, stout, dark man, with curly black hair.'''string_tokenized = nltk.word_tokenize(string.lower())
bgm = nltk.collocations.TrigramAssocMeasures()
finder = nltk.collocations.TrigramCollocationFinder.from_words(string_tokenized)
scored = finder.score_ngrams(bgm.likelihood_ratio)
print(scored)

上面代码与二词词块搭配强度计算的唯一不同在于，我们将BigramsAssocMeasures换成了TrigramsAssocMeasures。在计算三词词块搭配强度等检验方法。在本例中，我们将搭配强度检验方法定义为bgm.likelihood_ratio。
最后，打印结果。结果如下（局部）：

[(('my', 'father', "'s"), 35.19881500208808), (('never', 'saw', 'any'), 28.50105414657721), (('my', 'father', 'or'), 26.635121101238198), (('and', 'never', 'saw'), 25.747333628748777), (('i', 'never', 'saw'), 25.747333628748777), ((',', 'on', 'the'), 9.527528647915378)]

7.6 删除词表中的停用词

停用词(stopwords)是指文本中出现的非常高频的代词、介词、副词等词类。我们在分析词表时，往往需要分析实词，而可能并不太关心停用词。因此，我们可以通过Python来删除词表中的停用词，以聚焦于分析词表中的其他词。NLTK库中内置多种语言的停用词表，我们可以通过stopwords.words(‘english’)语句引用英语停用词表。请看下面的代码。

import nltk
from nltk.corpus import stopwordsstopwords_list = stopwords.words('english')
print(stopwords_list) #打印出停用词列表

string = '''I give Pirrip as my father's family name, on the authority of his tombstone and my sister,--Mrs. Joe Gargery, who married the blacksmith. As I never saw my father or my mother, and never saw any likeness of either of them (for their days were long before the days of photographs), my first fancies regarding what they were like were unreasonably derived from their tombstones. The shape of the letters on my father's, gave me an odd idea that he was a square, stout, dark man, with curly black hair.'''wordlist = nltk.word_tokenize(string.lower())for word in wordlist:if word not in stopwords_list:print(word)

我们定义需处理的文本，通过nltk.word_tokenize()函数对该文本进行分词处理，制作文本词表，并赋值给wordlist变量。最后，for … in 对wordlist中的单词循环遍历，如果单词不在停用词表中，则将之打印出来。

7.7 词料检索的KWIC实现

在使用Wordsmith或AntConc软件检索关键词时，经常看到返回结果时使用了Key Word in Context(KWIC)的显示方式，即将检索的关键词放在中间对齐的中间位置，关键词左右各留出一定数量的单词或字符串作为语境，以方便研究者在一定语境中阅读关键词。NLTK库有concordance()函数可以实现关键词的KWIC检索。
请看下面的例子。我们希望在ge.txt文本中检索关键词’but’，并用KWIC形式返回检索结果。代码如下：

import nltkfile_in = open(r'D:\works\文本分析\leopythonbookdata-master\texts\ge.txt','r')
raw_text = file_in.read()
tokens = nltk.word_tokenize(raw_text)nltk_text = nltk.Text(tokens)
nltk_text.concordance('but')

上面代码中，首先引入nltk,然后定义ge.txt文件句柄。下面两行读取ge.txt，并将之进行分词处理。倒数第二行代码，通过nltk.Text()函数将分词后的文本列表转换成nltk的Text数据，因为concordance()只能检索nltk的Text数据。最后一行，利用concordance()对’but’进行检索。
concordance()函数的基本格式为：concordance(keyword,width = 75,lines = 25),其中keyword为检索的关键词，返回结果默认有75个字符，默认返回25行检索行。如果选择默认设置，则不用在concordance()中设置参数。也可以将width和lines设置成其他值。

语料库数据处理个案实例（计算机搭配强度、删除表中的停用词、词料检索的KWIC实现）相关推荐

语料库数据处理个案实例（分词和分句、词频统计、排序）
本文来自<基于Python的语料库处理>_雷蕾著. 7.1 分句和分词 7.1.1 分句分句(sentence splitting)就是将字符串按自然句子的形式进行切分.假设我们有如下代 ...
mysql提供什么语句用于删除表中的数据_MySQL提供【】语句用于删除表中的数据...
MySQL提供[ ]语句用于删除表中的数据答:暂无解析出头教育: 当双活塞杆液压缸的缸体固定, 活塞杆与运动件固连时,其运动件的运动范围等于液压缸有效行程的答:三倍可同时搜集若干调查单位资料 ...
oracle 删除表中内容,oracle删除表中数据（delete与truncate）
当表中的数据不需要时,则应该删除该数据并释放所占用的空间,删除表中的数据可以使用Delete语句或者Truncate语句,下面分别介绍. 一.delete语句 (1)有条件删除语法格式:delete ...
Mysql之删除表中数据
Mysql之删除表中数据语法以下是 SQL DELETE 语句从 MySQL 数据表中删除数据的通用语法: DELETE FROM table_name [WHERE Clause] 如果没有指定 ...
mysql用于删除表中数据的关键字是_MySQL-删除数据（DELECT）
数据库备份介绍: 数据库一旦删除数据,它就会永远消失. 因此,在执行DELETE语句之前,应该先备份数据库,以防万一要找回删除过的数据. MySQL提供了非常有用的工具,用于在服务器上本地备份或转储M ...
mysql根据id删除数据库,MYSQL删除表中的指定ID数据
MYSQL删除表中的指定ID数据删除A表中的ID 中的开头以B* 的数据库. 复制代码代码如下: delete FROM A WHERE id like 'B%' 单独删除 A 表中的ID B 复 ...
SQL Delete 语句（删除表中的记录）
SQL DELETE 语句 DELETE语句用于删除表中现有记录. SQL DELETE 语法 DELETE FROM table_name WHERE condition; 请注意删除表格中的 ...
MySQL 学习笔记（4）— 组合查询、子查询、插入数据、更新/删除表数据、增加/删除表中的列以及重命名表
1. 组合查询 1.表的加减法表的加法,即求 product 和 product2 的并集,UNION 运算会除去重复的记录 SELECT product_id, product_name FROM ...
oracle中的rowid--伪列-删除表中的重复内容-实用
1.rowid是一个伪列,是用来确保表中行的唯一性,它并不能指示出行的物理位置,但可以用来定位行. 2.rowid是存储在索引中的一组既定的值(当行确定后).我们可以像表中普通的列一样将它选出来. 3 ...

语料库数据处理个案实例（计算机搭配强度、删除表中的停用词、词料检索的KWIC实现）

7.5 计算机搭配强度

7.5.1 计算搭配的频次

7.5.2 计算二词词块的搭配强度

7.5.3 计算三词词块的搭配强度

7.6 删除词表中的停用词

7.7 词料检索的KWIC实现

语料库数据处理个案实例（计算机搭配强度、删除表中的停用词、词料检索的KWIC实现）相关推荐

最新文章

热门文章