matlab如何提取文本词干,英文词干提取(stemming)算法

英文词干提取有多种方式，在实践中，可能涉及到机器学习数据挖掘等多方面的内容。

这里主要介绍的是易于实现的几种原始算法：

Lovins (1968)

Porter (1980)

Porter2 (2000)

1. Lovins

Lovins是最早的实现

1.1. 简介

算法涉及如下部件：

ending, 词后缀，共有294个，详细列表见最后

condition, 词后缀去除条件，每个ending对应一个condition，共有29个，详细列表见最后

transformation, 转换ending的方式，共有35个，详细列表见最后

算法分为两部：

对英文词，根据ending列表，按照ending从长到短扫描，找到第一个符合condition的ending

根据剩下的stem应用transformation，将ending转为恰当的形式

1.2. 例子

第一步

英文词为nationally，按照endling列表，从长到短扫描，首先找到 .09. ationally B，

对应的规则是B Minimum stem length = 3，要求去除ending后，剩余的部分长度大于等于3

nationally 去除 ationally 后只剩下 n, 不符合condition

继续扫描ending，找到 .07. ionally A，对应的规则是 A No restrictions on stem,没有任何限制。

于是最终选定 ionally作为ending

第二步

英文词nationally的stem是nat, 查找transformation，发现没有符合的transformation，不进行变换直接输出。

比如又一个词sitting，第一步得到stem是sitt, 第二步这里会应用第一条transformation，最终输出sit

1.Appendix.A endings 列表

.11.

alistically B arizability A izationally B

.10.

antialness A arisations A arizations A entialness A

.09.

allically C antaneous A antiality A arisation A

arization A ationally B ativeness A eableness E

entations A entiality A entialize A entiation A

ionalness A istically A itousness A izability A

izational A

.08.

ableness A arizable A entation A entially A

eousness A ibleness A icalness A ionalism A

ionality A ionalize A iousness A izations A

lessness A

.07.

ability A aically A alistic B alities A

ariness E aristic A arizing A ateness A

atingly A ational B atively A ativism A

elihood E encible A entally A entials A

entiate A entness A fulness A ibility A

icalism A icalist A icality A icalize A

ication G icianry A ination A ingness A

ionally A isation A ishness A istical A

iteness A iveness A ivistic A ivities A

ization F izement A oidally A ousness A

.06.

aceous A acious B action G alness A

ancial A ancies A ancing B ariser A

arized A arizer A atable A ations B

atives A eature Z efully A encies A

encing A ential A enting C entist A

eously A ialist A iality A ialize A

ically A icance A icians A icists A

ifully A ionals A ionate D ioning A

ionist A iously A istics A izable E

lessly A nesses A oidism A

.05.

acies A acity A aging B aical A

alist A alism B ality A alize A

allic BB anced B ances B antic C

arial A aries A arily A arity B

arize A aroid A ately A ating I

ation B ative A ators A atory A

ature E early Y ehood A eless A

elity A ement A enced A ences A

eness E ening E ental A ented C

ently A fully A ially A icant A

ician A icide A icism A icist A

icity A idine I iedly A ihood A

inate A iness A ingly B inism J

inity CC ional A ioned A ished A

istic A ities A itous A ively A

ivity A izers F izing F oidal A

oides A otide A ously A

.04.

able A ably A ages B ally B

ance B ancy B ants B aric A

arly K ated I ates A atic B

ator A ealy Y edly E eful A

eity A ence A ency A ened E

enly E eous A hood A ials A

ians A ible A ibly A ical A

ides L iers A iful A ines M

ings N ions B ious A isms B

ists A itic H ized F izer F

less A lily A ness A ogen A

ward A wise A ying B yish A

.03.

acy A age B aic A als BB

ant B ars O ary F ata A

ate A eal Y ear Y ely E

ene E ent C ery E ese A

ful A ial A ian A ics A

ide L ied A ier A ies P

ily A ine M ing N ion Q

ish C ism B ist A ite AA

ity A ium A ive A ize F

oid A one R ous A

.02.

ae A al BB ar X as B

ed E en F es E ia A

ic A is A ly B on S

or T um U us V yl R

s' A 's A

.01.

a A e A i A o A

s W y B

1.Appendix.B conditions 列表

A No restrictions on stem

B Minimum stem length = 3

C Minimum stem length = 4

D Minimum stem length = 5

E Do not remove ending after e

F Minimum stem length = 3 and do not remove ending after e

G Minimum stem length = 3 and remove ending only after f

H Remove ending only after t or ll

I Do not remove ending after o or e

J Do not remove ending after a or e

K Minimum stem length = 3 and remove ending only after l, i or u*e

L Do not remove ending after u, x or s, unless s follows o

M Do not remove ending after a, c, e or m

N Minimum stem length = 4 after s**, elsewhere = 3

O Remove ending only after l or i

P Do not remove ending after c

Q Minimum stem length = 3 and do not remove ending after l or n

R Remove ending only after n or r

S Remove ending only after dr or t, unless t follows t

T Remove ending only after s or t, unless t follows o

U Remove ending only after l, m, n or r

V Remove ending only after c

W Do not remove ending after s or u

X Remove ending only after l, i or u*e

Y Remove ending only after in

Z Do not remove ending after f

AA Remove ending only after d, f, ph, th, l, er, or, es or t

BB Minimum stem length = 3 and do not remove ending after met or ryst

CC Remove ending only after l

1.Appendix.C transformations 列表

1 remove one of double b, d, g, l, m, n, p, r, s, t

2 iev -> ief

3 uct -> uc

4 umpt -> um

5 rpt -> rb

6 urs -> ur

7 istr -> ister

7a metr -> meter

8 olv -> olut

9 ul -> l except following a, o, i

10 bex -> bic

11 dex -> dic

12 pex -> pic

13 tex -> tic

14 ax -> ac

15 ex -> ec

16 ix -> ic

17 lux -> luc

18 uad -> uas

19 vad -> vas

20 cid -> cis

21 lid -> lis

22 erid -> eris

23 pand -> pans

24 end -> ens except following s

25 ond -> ons

26 lud -> lus

27 rud -> rus

28 her -> hes except following p, t

29 mit -> mis

30 ent -> ens except following m

31 ert -> ers

32 et -> es except following n

33 yt -> ys

34 yz -> ys

2. Porter

2.1. 简介

元音与辅音

元音辅音与常见的定义略有不同：

元音(Vowel) - A E I O U, 以及辅音后边的Y

辅音(Consonant) - 除了 A E I O U，以及元音后边的Y

单词的分组

连续的元音看作元音组V，连续的辅音看作辅音组C，于是任意一个单词都可以表示成VC交错的形式，例如：

segmentfault -> s/e/gm/e/ntf/au/lt -> CVCVCVC

porter -> p/o/rt/e/r -> CVCVC

application -> a/ppl/i/c/a/t/io/n -> VCVCVCVC

apple -> a/ppl/e -> V/C/V

综合起来，可以表示为 VC 组的形式：$$ C^m[V] $$

其中参数m类似于Lovin中condition的stem长度，用于后续的判断

规则

Porter算法以rule为主，rule的形式为：

(condition) S1 -> S2

condition作用于去除了S1的stem，除了m还有其他特征：

m - 表示VC组的数目

* - 表示任意字符, 和子串，v,d,o配合使用

大写字母 - 表示子串

v - 表示一个元音字符

d - 表示两个一样的辅音

o - 表示cvc, 其中第二个c不能是W,X,Y

S1是词的后缀，S2的变化后的后缀

和Lovin不同，一个词语经过多个规则的串联处理，输出目标词(Lovin是一次性输出)

例如 hopping, 首先应用规则(*v*) ING ->, 变为hopp

然后应用规则(*d and not (*L or *S or *Z)) -> single letter，从hopp变为hop

流程

整个算法是从上往下应用规则，有些规则比较特殊，如果触发了要处理额外的规则

规则很多，于是对规则进行分组(step)，这里的分组是为了逻辑上做区分(实际上算法也可以根据分组优化)，整个算法就是从头到位执行的，流程如下：

do Step_1a

do Step_1b (如果命中step 2b.2 or step 2b.3, 则做一些额外工作)

do Step_1c

do Step_2

do Step_3

do Step_4

do Step_5a

do Step_5b

每个Step的详细内容见附录

2.2. 例子

2.Appendix Step 1a

SSES -> SS

IES -> I

SS -> SS

S ->

2.Appendix Step 1b

(m>0) EED -> EE

(*v*) ED ->

(*v*) ING ->

If the second or third of the rules in Step 1b is successful, the following is done:

AT -> ATE

BL -> BLE

IZ -> IZE

(*d and not (*L or *S or *Z)) -> single letter

(m=1 and *o) -> E

2.Appendix Step 1c

(*v*) Y -> I

2.Appendix Step 2

(m>0) ATIONAL -> ATE

(m>0) TIONAL -> TION

(m>0) ENCI -> ENCE

(m>0) ANCI -> ANCE

(m>0) IZER -> IZE

(m>0) ABLI -> ABLE

(m>0) ALLI -> AL

(m>0) ENTLI -> ENT

(m>0) ELI -> E

(m>0) OUSLI -> OUS

(m>0) IZATION -> IZE

(m>0) ATION -> ATE

(m>0) ATOR -> ATE

(m>0) ALISM -> AL

(m>0) IVENESS -> IVE

(m>0) FULNESS -> FUL

(m>0) OUSNESS -> OUS

(m>0) ALITI -> AL

(m>0) IVITI -> IVE

(m>0) BILITI -> BLE

2.Appendix Step 3

(m>0) ICATE -> IC

(m>0) ATIVE ->

(m>0) ALIZE -> AL

(m>0) ICITI -> IC

(m>0) ICAL -> IC

(m>0) FUL ->

(m>0) NESS ->

2.Appendix Step 4

(m>1) AL ->

(m>1) ANCE ->

(m>1) ENCE ->

(m>1) ER ->

(m>1) IC ->

(m>1) ABLE ->

(m>1) IBLE ->

(m>1) ANT ->

(m>1) EMENT ->

(m>1) MENT ->

(m>1) ENT ->

(m>1 and (*S or *T)) ION ->

(m>1) OU ->

(m>1) ISM ->

(m>1) ATE ->

(m>1) ITI ->

(m>1) OUS ->

(m>1) IVE ->

(m>1) IZE ->

2.Appendix Step 5a

(m>1) E ->

(m=1 and not *o) E ->

2.Appendix Step 5b

(m > 1 and *d and *L) -> single letter

matlab如何提取文本词干,英文词干提取(stemming)算法 - Lovins, Porter相关推荐

Word处理控件Aspose.Words功能演示：用Java从Word文档中提取文本
Aspose.Words For .NET是一种高级Word文档处理API,用于执行各种文档管理和操作任务.API支持生成,修改,转换,呈现和打印文档,而无需在跨平台应用程序中直接使用Microsof ...
词形变换和词干提取工具（英文）
转载自: http://www.cnblogs.com/kaituorensheng/p/3437807.html 词形变换和词干提取工具(英文) 在信息检索和文本挖掘中,需要对一个词的不同形态进行归 ...
java lucene词干提取_词形变换和词干提取工具（英文）
在信息检索和文本挖掘中,需要对一个词的不同形态进行归并,即词形规范化,从而提高文本处理的效率.例如:词根run有不同的形式running.ran另外runner也和run有关.这里涉及到两个概念: 词 ...
python 英文文本中的关键词提取
python 英文关键词提取详细教程: https://opensourcelibs.com/lib/pytextrank # To install from PyPi: 慢就加镜像 -i pytho ...
python中文文本分词_SnowNLP：?中文分词?词性标准?提取文本摘要,?提取文本关键词,?转换成拼音?繁体转简体的处理中文文本的Python3 类库...
SnowNLP是一个python写的类库,可以方便的处理中文文本内容,是受到了TextBlob的启发而写的,由于现在大部分的自然语言处理库基本都是针对英文的,于是写了一个方便处理中文的类库,并且和Te ...
深度学习的端到端文本OCR:使用EAST模型从自然场景图片中提取文本
我们生活在这样一个时代:任何一个组织或公司要想扩大规模并保持相关性,就必须改变他们对技术的看法,并迅速适应不断变化的环境.我们已经知道谷歌是如何实现图书数字化的.或者Google earth是如何使用 ...
python 英文关键词提取_如何提取文章的关键词（Python版）
项目需求: 我们采集来的文章没有关键词,在发布的时候无法设定标签,我们通过代码自动提取出文章的关键词,达到对数据加工的目的. 测试环境: Anaconda Python3.5 Win7 ultmate ...
提取文本中的汉字字符串
java 编程点滴提取文本中的汉字字符串提取文本中的汉字字符串代码中含有中文字符,希望将代码中的中文字符提取出来,输出到数据库表格,然后补充对应的英文翻译. 继续处理代码,将文中的中文字符,通过 ...
TextRank中文,英文关键词提取
1.基于pytextrank英文关键词提取 # pip install pytextrank # python -m spacy download en_core_web_sm import spac ...
【转】SQL函数：字符串中提取数字，英文，中文，过滤重复字符
SQL函数:字符串中提取数字,英文,中文,过滤重复字符 --提取数字 IF OBJECT_ID('DBO.GET_NUMBER') IS NOT NULL DROP FUNCTION DBO.GET_ ...

matlab如何提取文本词干,英文词干提取(stemming)算法 - Lovins, Porter

matlab如何提取文本词干,英文词干提取(stemming)算法 - Lovins, Porter相关推荐

最新文章

热门文章