python中处理WordNet

>>>from nltk.corpus import wordnet as wn

>>> wn.synsets('motorcar')

>>> wn.synset('car.n.01').lemma_names

2.5 WordNet

WordNet is a semantically oriented dictionary of English, similar to a traditional thesaurus(辞典)but with a richer structure. NLTK includes the English WordNet, with 155,287 words and 117,659 synonym(同义词)sets. We’ll begin by looking at synonyms and how they are accessed in WordNet.

Senses and Synonyms 意义和同义词

Consider the sentence in (1a). If we replace the word motorcar in (1a) with automobile, to get (1b), the meaning of the sentence stays pretty much the same:

(1) a. Benz is credited with the invention of the motorcar.

b. Benz is credited with the invention of the automobile.

Since everything else in the sentence has remained unchanged, we can conclude that the words motorcar and automobile have the same meaning, i.e., they are synonyms.

We can explore these words with the help of WordNet:

>>> from nltk.corpus import wordnet as wn

>>> wn.synsets('motorcar')

[Synset('car.n.01')]

Thus, motorcar has just one possible meaning and it is identified as car.n.01, the first noun sense of car. The entity car.n.01 is called a synset, or “synonym set,”(同义词集)a collection of synonymous words (or “lemmas”):

>>> wn.synset('car.n.01').lemma_names

['car', 'auto', 'automobile', 'machine', 'motorcar']

Each word of a synset can have several meanings, e.g., car can also signify a train carriage, a gondola(货车), or an elevator car. However, we are only interested in the single meaning that is common to all words of this synset. Synsets also come with a prose(平凡的) definition and some example sentences:

>>> wn.synset('car.n.01').definition

'a motor vehicle with four wheels; usually propelled by an internal combustion engine(内燃机)'

>>> wn.synset('car.n.01').examples

['he needs a car to get to work']

Although definitions help humans to understand the intended meaning of a synset, the words of the synset are often more useful for our programs. To eliminate ambiguity, we will identify these words as car.n.01.automobile, car.n.01.motorcar, and so on. This pairing of a synset with a word is called a lemma(一个同义词集的单词配对称为词条). We can get all the lemmas for a given synset①, look up a particular lemma②, get the synset corresponding to a lemma③, and get the “name” of a lemma④:

>>> wn.synset('car.n.01').lemmas ①

[Lemma('car.n.01.car'), Lemma('car.n.01.auto'), Lemma('car.n.01.automobile'),

Lemma('car.n.01.machine'), Lemma('car.n.01.motorcar')]

>>> wn.lemma('car.n.01.automobile') ②

Lemma('car.n.01.automobile')

>>> wn.lemma('car.n.01.automobile').synset ③

Synset('car.n.01')

>>> wn.lemma('car.n.01.automobile').name ④

'automobile'

Unlike the words automobile and motorcar, which are unambiguous and have one synset, the word car is ambiguous, having five synsets:

>>> wn.synsets('car')

[Synset('car.n.01'), Synset('car.n.02'), Synset('car.n.03'), Synset('car.n.04'),

Synset('cable_car.n.01')]

>>> for synset in wn.synsets('car'):

... print synset.lemma_names

...

['car', 'auto', 'automobile', 'machine', 'motorcar']

['car', 'railcar', 'railway_car', 'railroad_car']

['car', 'gondola']

['car', 'elevator_car']

['cable_car', 'car']

For convenience, we can access all the lemmas involving the word car as follows:

>>> wn.lemmas('car')

[Lemma('car.n.01.car'), Lemma('car.n.02.car'), Lemma('car.n.03.car'),

Lemma('car.n.04.car'), Lemma('cable_car.n.01.car')]

Your Turn: Write down all the senses of the word dish that you can think of. Now, explore this word with the help of WordNet, using the same operations shown earlier.

The WordNet Hierarchy WordNet层次结构

WordNet synsets correspond to abstract concepts(抽象的概念), and they don’t always have corresponding words in English. These concepts are linked together in a hierarchy. Some concepts are very general, such as Entity, State, Event; these are calledunique beginnersor root synsets(根同义词集). Others, such as gas guzzler(油老虎) and hatchback(带掀式后背的小轿车), are much more specific. A small portion of a concept hierarchy is illustrated in Figure 2-8.

Figure 2-8. Fragment of WordNet concept hierarchy: Nodes correspond to synsets; edges indicate the hypernym(上位词)/hyponym(下位词) relation, i.e., the relation between superordinate(同hypernym) and subordinate(从属的) concepts

WordNet makes it easy to navigate between concepts. For example, given a concept like motorcar, we can look at the concepts that are more specific—the (immediate) hyponyms.

>>> motorcar = wn.synset('car.n.01')

>>> types_of_motorcar = motorcar.hyponyms()

>>> types_of_motorcar[26]

Synset('ambulance.n.01')

>>> sorted([lemma.name for synset in types_of_motorcar for lemma in synset.lemmas])

['Model_T', 'S.U.V.', 'SUV', 'Stanley_Steamer', 'ambulance', 'beach_waggon',

'beach_wagon', 'bus', 'cab', 'compact', 'compact_car', 'convertible',

'coupe', 'cruiser', 'electric', 'electric_automobile', 'electric_car',

'estate_car', 'gas_guzzler', 'hack', 'hardtop', 'hatchback', 'heap',

'horseless_carriage', 'hot-rod', 'hot_rod', 'jalopy', 'jeep', 'landrover',

'limo', 'limousine', 'loaner', 'minicar', 'minivan', 'pace_car', 'patrol_car',

'phaeton', 'police_car', 'police_cruiser', 'prowl_car', 'race_car', 'racer',

'racing_car', 'roadster', 'runabout', 'saloon', 'secondhand_car', 'sedan',

'sport_car', 'sport_utility', 'sport_utility_vehicle', 'sports_car', 'squad_car',

'station_waggon', 'station_wagon', 'stock_car', 'subcompact', 'subcompact_car',

'taxi', 'taxicab', 'tourer', 'touring_car', 'two-seater', 'used-car', 'waggon',

'wagon']

We can also navigate up(浏览) the hierarchy by visiting hypernyms. Some words have multiple paths, because they can be classified in more than one way. There are two paths between car.n.01 and entity.n.01 because wheeled_vehicle.n.01 can be classified as both a vehicle and a container.

>>> motorcar.hypernyms()

[Synset('motor_vehicle.n.01')]

>>> paths = motorcar.hypernym_paths()

>>> len(paths)

2

>>> [synset.name for synset in paths[0]]

['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'artifact.n.01',

'instrumentality.n.03', 'container.n.01', 'wheeled_vehicle.n.01',

'self-propelled_vehicle.n.01', 'motor_vehicle.n.01', 'car.n.01']

>>> [synset.name for synset in paths[1]]

['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'artifact.n.01',

'instrumentality.n.03', 'conveyance.n.03', 'vehicle.n.01', 'wheeled_vehicle.n.01',

'self-propelled_vehicle.n.01', 'motor_vehicle.n.01', 'car.n.01']

We can get the most general hypernyms (or root hypernyms) of a synset as follows:

>>> motorcar.root_hypernyms()

[Synset('entity.n.01')]

Your Turn:Try out NLTK’s convenient graphical WordNet browser: nltk.app.wordnet(). Explore the WordNet hierarchy by following the hypernym and hyponym links.

More Lexical Relations 更多词汇关系

Hypernyms and hyponyms are called lexical relations(词汇关系) because they relate one synset to another. These two relations navigate up and down the “is-a” hierarchy. Another important way to navigate the WordNet network is from items to their components (meronyms 部分) or to the things they are contained in (holonyms整体). For example, the parts of a tree are its trunk(树干), crown(树冠), and so on; these are the part_meronyms(). The substance(实质) a tree is made of includes heartwood(心材) and sapwood(边材), i.e., the substance_meronyms(). A collection of trees forms a forest, i.e., the member_holonyms():

>>> wn.synset('tree.n.01').part_meronyms()

[Synset('burl.n.02'), Synset('crown.n.07'), Synset('stump.n.01'),

Synset('trunk.n.01'), Synset('limb.n.02')]

>>> wn.synset('tree.n.01').substance_meronyms()

[Synset('heartwood.n.01'), Synset('sapwood.n.01')]

>>> wn.synset('tree.n.01').member_holonyms()

[Synset('forest.n.01')]

To see just how intricate(复杂的) things can get, consider the word mint, which has several closely related senses. We can see that mint.n.04 is part of mint.n.02 and the substance from which mint.n.05 is made.

>>> for synset in wn.synsets('mint', wn.NOUN):

... print synset.name + ':', synset.definition

...

batch.n.02: (often followed by `of') a large number or amount or extent

mint.n.02: any north temperate(北温带)plant of the genus Mentha with aromatic leaves and

small mauve(淡紫色) flowers

mint.n.03: any member of the mint family of plants

mint.n.04: the leaves of a mint plant used fresh or candied

mint.n.05: a candy(糖果) that is flavored(加味)with a mint oil

mint.n.06: a plant where money is coined by authority of the government

>>> wn.synset('mint.n.04').part_holonyms()

[Synset('mint.n.02')]

>>> wn.synset('mint.n.04').substance_holonyms()

[Synset('mint.n.05')]

There are also relationships between verbs. For example, the act of walking involves the act of stepping, so walking entails(蕴涵) stepping. Some verbs have multiple entailments:

>>> wn.synset('walk.v.01').entailments()

[Synset('step.v.01')]

>>> wn.synset('eat.v.01').entailments()

[Synset('swallow.v.01'), Synset('chew.v.01')]

>>> wn.synset('tease.v.03').entailments()

[Synset('arouse.v.07'), Synset('disappoint.v.01')]

Some lexical relationships hold between lemmas, e.g., antonymy(反义词组):

>>> wn.lemma('supply.n.02.supply').antonyms()

[Lemma('demand.n.02.demand')]

>>> wn.lemma('rush.v.01.rush').antonyms()

[Lemma('linger.v.04.linger')]

>>> wn.lemma('horizontal.a.01.horizontal').antonyms()

[Lemma('vertical.a.01.vertical'), Lemma('inclined.a.02.inclined')]

>>> wn.lemma('staccato.r.01.staccato').antonyms() 不连续的

[Lemma('legato.r.01.legato')] 连奏的

You can see the lexical relations, and the other methods defined on a synset, using

dir(). For example, try dir(wn.synset('harmony.n.02')).

Semantic Similarity 语义相似度

We have seen that synsets are linked by a complex network of lexical relations. Given a particular synset, we can traverse the WordNet network to find synsets with related meanings. Knowing which words are semantically related is useful for indexing a collection of texts, so that a search for a general term such as vehicle will match documents containing specific terms such aslimousine(豪华轿车).

Recall that each synset has one or more hypernym paths that link it to a root hypernym such as entity.n.01. Two synsets linked to the same root may have several hypernyms in common(共同之处) (see Figure 2-8). If two synsets share a very specific hypernym—one that is low down(实情?) in the hypernym hierarchy—they must be closely related.

>>> right = wn.synset('right_whale.n.01')

>>> orca = wn.synset('orca.n.01')

>>> minke = wn.synset('minke_whale.n.01')

>>> tortoise = wn.synset('tortoise.n.01')

>>> novel = wn.synset('novel.n.01')

>>> right.lowest_common_hypernyms(minke)

[Synset('baleen_whale.n.01')]

>>> right.lowest_common_hypernyms(orca)

[Synset('whale.n.02')]

>>> right.lowest_common_hypernyms(tortoise)

[Synset('vertebrate.n.01')]

>>> right.lowest_common_hypernyms(novel)

[Synset('entity.n.01')]

Of course we know that whale is very specific (and baleen whale even more so), whereas vertebrate is more general and entity is completely general. We can quantify this concept of generality by looking up the depth of each synset:

>>> wn.synset('baleen_whale.n.01').min_depth()

14

>>> wn.synset('whale.n.02').min_depth()

13

>>> wn.synset('vertebrate.n.01').min_depth()

8

>>> wn.synset('entity.n.01').min_depth()

0

Similarity measures have been defined over the collection of WordNet synsets that incorporate this insight. For example,path_similarityassigns a score in the range 0–1 based on the shortest path that connects the concepts in the hypernym hierarchy (-1 is returned in those cases where a path cannot be found 没有路径就返回-1). Comparing a synset with itself will return 1(与自己比较返回1). Consider the following similarity scores, relating right whale(露脊鲸) to minke whale(小须鲸

), orca(逆戟鲸), tortoise, and novel. Although the numbers won’t mean much, they decrease as we move away from the semantic space(语义空间) of sea creatures to inanimate objects(静物).

>>> right.path_similarity(minke)

0.25

>>> right.path_similarity(orca)

0.16666666666666666

>>> right.path_similarity(tortoise)

0.076923076923076927

>>> right.path_similarity(novel)

0.043478260869565216

Several other similarity measures are available; you can typehelp(wn) for more information. NLTK also includes VerbNet, a hierarchical verb lexicon linked to WordNet. It can be accessed with nltk.corpus.verb net.

python中处理WordNet相关推荐

  1. Python中常用的数据分析工具(模块)有哪些?

    本期Python培训分享:Python中常用的数据分析工具(模块)有哪些?Python本身的数据分析功能并不强,需要安装一些第三方的扩展库来增强它的能力.我们课程用到的库包括NumPy.Pandas. ...

  2. 独家 | 快速掌握spacy在python中进行自然语言处理(附代码链接)

    作者:Paco Nathan 翻译:笪洁琼 校对:和中华 本文约6600字,建议阅读15分钟. 本文简要介绍了如何使用spaCy和Python中的相关库进行自然语言处理(有时称为"文本分析& ...

  3. 【Python】Python中的文本处理

    作者 | KahEm Chu 编译 | VK 来源 | Towards Data Science 互联网连接了世界,而Facebook.Twitter和Reddit等社交媒体为人们提供了表达自己对某个 ...

  4. python中nlp的库_用于nlp的python中的网站数据清理

    python中nlp的库 The most important step of any data-driven project is obtaining quality data. Without t ...

  5. 推荐 :快速掌握spacy在python中进行自然语言处理(附代码链接)

    作者:Paco Nathan 翻译:笪洁琼 校对:和中华 本文约6600字,建议阅读15分钟. 本文简要介绍了如何使用spaCy和Python中的相关库进行自然语言处理(有时称为"文本分析& ...

  6. python中文文档-Python语言、主要工具与类库中文文档

    Python是Guido van Rossum在1989年圣诞节期间,为了打发无聊的圣诞节而编写的一个编程语言. Python 提供了非常完善的基础代码库,覆盖了网络.文件.GUI.数据库.文本等大量 ...

  7. 如何优雅的在python中暂停死循环?

    死循环 有时候在工作中可能会遇到要一直执行某个功能的程序,这时候死循环就派上用途了,python中死循环的具体形式大致如下 while True:run_your_code() 结束死循环 通常我们结 ...

  8. 关于python中的dict和defaultdict

    dict 在Python中如果访问字典中不存在的键,会引发KeyError异常,所以一般当我们比如统计一句话的词频时候,我们总是使用这样的处理方式: strings = ('puppy', 'kitt ...

  9. python中的新式类与旧式类的一些基于descriptor的概念(上)

    python中基于descriptor的一些概念(上) 1. 前言 2. 新式类与经典类 2.1 内置的object对象 2.2 类的方法 2.2.1 静态方法 2.2.2 类方法 2.3 新式类(n ...

  10. Python中yield和yield from的用法

    yield 后面接的是 future 对象 调用方 委托生成器 yield from 直接给出循环后的结果 yield from 委托者和子生成器直接通信 yield from 直接处理stopIte ...

最新文章

  1. Linux有问必答-如何创建和挂载XFS文件系统
  2. java 表现层:jsp、freemarker、velocity
  3. 论信息系统的项目范围管理
  4. 一起学nRF51xx 15 - spis
  5. UGUI_判断鼠标或者手指是否点击在UI上
  6. c++ 用类统计不及格人数_统计小课堂13
  7. 原生JS实现淡入淡出效果(fadeIn/fadeOut/fadeTo)
  8. 如何通过属性给实体赋值
  9. 在CentOS 6上使用yum安装lnmp服务
  10. 触发更新机制_王者荣耀1.14更新:11名英雄调整,韩信加强,鲁班大师重做
  11. android sqlite SQLiteDatabase 操作大全 不看后悔!必收藏!看后精通SQLITE (第三部分,完整代码)
  12. qq手机令牌 for android3.3 官方安装版,原QQ安全助手|QQ手机管家 for Android 安卓版v3.3.0 - PC6安卓网...
  13. c语言选择题题及答案,C语言选择题练习及答案.doc
  14. rf扫描枪_RF枪操作的简要步骤
  15. 2022A特种设备相关管理(电梯)特种作业证考试题库及在线模拟考试
  16. win7计算机属性资源管理器停止工作,Win7系统Windows资源管理器已停止工作怎么解决?...
  17. Pycharm中的红色小闪电含义
  18. exp/expdp 与 imp/impdp命令导入导出数据库详解
  19. 京东12年被裁,昨天赔偿到账了,加调休和年假总共47万多,感谢公司!
  20. 私有化部署——企业数据保护伞

热门文章

  1. Flask-Uploads文件上传的简单使用
  2. JavaScript开发工具大全
  3. [转载]从MyEclipse到IntelliJ IDEA-让你摆脱鼠标,全键盘操作
  4. jQuery动画的实现
  5. Smuxi 0.8.10 发布 - IRC 客户端软件
  6. echarts改变颜色属性的demo
  7. 【bzoj4897】[Thu Summer Camp2016]成绩单 区间dp
  8. Hadoop、Hbase基本命令及调优方式
  9. 无线安全审计工具FruityWifi初体验
  10. 【网络流24题】魔术球