Porter词干算法(或“ Porter stemmer”)是用于从英语单词中删除较常见的词法和不固定词尾的过程。它的主要用途是术语标准化过程的一部分,该过程通常在设置信息检索系统时完成。

#!/usr/bin/env python
import sysclass PorterStemmer:def __init__(self):"""The main part of the stemming algorithm starts here.b is a buffer holding a word to be stemmed. The letters are in b[k0],b[k0+1] ... ending at b[k]. In fact k0 = 0 in this demo program. k isreadjusted downwards as the stemming progresses. Zero termination isnot in fact used in the algorithm.Note that only lower case sequences are stemmed. Forcing to lower caseshould be done before stem(...) is called."""self.b = ""  # buffer for word to be stemmedself.k = 0self.k0 = 0self.j = 0   # j is a general offset into the stringdef cons(self, i):"""cons(i) is TRUE <=> b[i] is a consonant."""if self.b[i] == 'a' or self.b[i] == 'e' or self.b[i] == 'i' or self.b[i] == 'o' or self.b[i] == 'u':return 0if self.b[i] == 'y':if i == self.k0:return 1else:return (not self.cons(i - 1))return 1def m(self):"""m() measures the number of consonant sequences between k0 and j.if c is a consonant sequence and v a vowel sequence, and <..>indicates arbitrary presence,<c><v>       gives 0<c>vc<v>     gives 1<c>vcvc<v>   gives 2<c>vcvcvc<v> gives 3...."""n = 0i = self.k0while 1:if i > self.j:return nif not self.cons(i):breaki = i + 1i = i + 1while 1:while 1:if i > self.j:return nif self.cons(i):breaki = i + 1i = i + 1n = n + 1while 1:if i > self.j:return nif not self.cons(i):breaki = i + 1i = i + 1def vowelinstem(self):"""vowelinstem() is TRUE <=> k0,...j contains a vowel"""for i in range(self.k0, self.j + 1):if not self.cons(i):return 1return 0def doublec(self, j):"""doublec(j) is TRUE <=> j,(j-1) contain a double consonant."""if j < (self.k0 + 1):return 0if (self.b[j] != self.b[j-1]):return 0return self.cons(j)def cvc(self, i):if i < (self.k0 + 2) or not self.cons(i) or self.cons(i-1) or not self.cons(i-2):return 0ch = self.b[i]if ch == 'w' or ch == 'x' or ch == 'y':return 0return 1def ends(self, s):"""ends(s) is TRUE <=> k0,...k ends with the string s."""length = len(s)if s[length - 1] != self.b[self.k]: # tiny speed-upreturn 0if length > (self.k - self.k0 + 1):return 0if self.b[self.k-length+1:self.k+1] != s:return 0self.j = self.k - lengthreturn 1def setto(self, s):length = len(s)self.b = self.b[:self.j+1] + s + self.b[self.j+length+1:]self.k = self.j + lengthdef r(self, s):"""r(s) is used further down."""if self.m() > 0:self.setto(s)def step1ab(self):"""step1ab() gets rid of plurals and -ed or -ing. e.g.caresses  ->  caressponies    ->  ponities      ->  ticaress    ->  caresscats      ->  catfeed      ->  feedagreed    ->  agreedisabled  ->  disablematting   ->  matmating    ->  matemeeting   ->  meetmilling   ->  millmessing   ->  messmeetings  ->  meet"""if self.b[self.k] == 's':if self.ends("sses"):self.k = self.k - 2elif self.ends("ies"):self.setto("i")elif self.b[self.k - 1] != 's':self.k = self.k - 1if self.ends("eed"):if self.m() > 0:self.k = self.k - 1elif (self.ends("ed") or self.ends("ing")) and self.vowelinstem():self.k = self.jif self.ends("at"):   self.setto("ate")elif self.ends("bl"): self.setto("ble")elif self.ends("iz"): self.setto("ize")elif self.doublec(self.k):self.k = self.k - 1ch = self.b[self.k]if ch == 'l' or ch == 's' or ch == 'z':self.k = self.k + 1elif (self.m() == 1 and self.cvc(self.k)):self.setto("e")def step1c(self):"""step1c() turns terminal y to i when there is another vowel in the stem."""if (self.ends("y") and self.vowelinstem()):self.b = self.b[:self.k] + 'i' + self.b[self.k+1:]def step2(self):"""step2() maps double suffices to single ones.so -ization ( = -ize plus -ation) maps to -ize etc. note that thestring before the suffix must give m() > 0."""if self.b[self.k - 1] == 'a':if self.ends("ational"):   self.r("ate")elif self.ends("tional"):  self.r("tion")elif self.b[self.k - 1] == 'c':if self.ends("enci"):      self.r("ence")elif self.ends("anci"):    self.r("ance")elif self.b[self.k - 1] == 'e':if self.ends("izer"):      self.r("ize")elif self.b[self.k - 1] == 'l':if self.ends("bli"):       self.r("ble") # --DEPARTURE--# To match the published algorithm, replace this phrase with#   if self.ends("abli"):      self.r("able")elif self.ends("alli"):    self.r("al")elif self.ends("entli"):   self.r("ent")elif self.ends("eli"):     self.r("e")elif self.ends("ousli"):   self.r("ous")elif self.b[self.k - 1] == 'o':if self.ends("ization"):   self.r("ize")elif self.ends("ation"):   self.r("ate")elif self.ends("ator"):    self.r("ate")elif self.b[self.k - 1] == 's':if self.ends("alism"):     self.r("al")elif self.ends("iveness"): self.r("ive")elif self.ends("fulness"): self.r("ful")elif self.ends("ousness"): self.r("ous")elif self.b[self.k - 1] == 't':if self.ends("aliti"):     self.r("al")elif self.ends("iviti"):   self.r("ive")elif self.ends("biliti"):  self.r("ble")elif self.b[self.k - 1] == 'g': # --DEPARTURE--if self.ends("logi"):      self.r("log")# To match the published algorithm, delete this phrasedef step3(self):"""step3() dels with -ic-, -full, -ness etc. similar strategy to step2."""if self.b[self.k] == 'e':if self.ends("icate"):     self.r("ic")elif self.ends("ative"):   self.r("")elif self.ends("alize"):   self.r("al")elif self.b[self.k] == 'i':if self.ends("iciti"):     self.r("ic")elif self.b[self.k] == 'l':if self.ends("ical"):      self.r("ic")elif self.ends("ful"):     self.r("")elif self.b[self.k] == 's':if self.ends("ness"):      self.r("")def step4(self):"""step4() takes off -ant, -ence etc., in context <c>vcvc<v>."""if self.b[self.k - 1] == 'a':if self.ends("al"): passelse: returnelif self.b[self.k - 1] == 'c':if self.ends("ance"): passelif self.ends("ence"): passelse: returnelif self.b[self.k - 1] == 'e':if self.ends("er"): passelse: returnelif self.b[self.k - 1] == 'i':if self.ends("ic"): passelse: returnelif self.b[self.k - 1] == 'l':if self.ends("able"): passelif self.ends("ible"): passelse: returnelif self.b[self.k - 1] == 'n':if self.ends("ant"): passelif self.ends("ement"): passelif self.ends("ment"): passelif self.ends("ent"): passelse: returnelif self.b[self.k - 1] == 'o':if self.ends("ion") and (self.b[self.j] == 's' or self.b[self.j] == 't'): passelif self.ends("ou"): pass# takes care of -ouselse: returnelif self.b[self.k - 1] == 's':if self.ends("ism"): passelse: returnelif self.b[self.k - 1] == 't':if self.ends("ate"): passelif self.ends("iti"): passelse: returnelif self.b[self.k - 1] == 'u':if self.ends("ous"): passelse: returnelif self.b[self.k - 1] == 'v':if self.ends("ive"): passelse: returnelif self.b[self.k - 1] == 'z':if self.ends("ize"): passelse: returnelse:returnif self.m() > 1:self.k = self.jdef step5(self):"""step5() removes a final -e if m() > 1, and changes -ll to -l ifm() > 1."""self.j = self.kif self.b[self.k] == 'e':a = self.m()if a > 1 or (a == 1 and not self.cvc(self.k-1)):self.k = self.k - 1if self.b[self.k] == 'l' and self.doublec(self.k) and self.m() > 1:self.k = self.k -1def stem(self, p, i, j):# copy the parameters into staticsself.b = pself.k = jself.k0 = iif self.k <= self.k0 + 1:return self.b # --DEPARTURE--# With this line, strings of length 1 or 2 don't go through the# stemming process, although no mention is made of this in the# published algorithm. Remove the line to match the published# algorithm.self.step1ab()self.step1c()self.step2()self.step3()self.step4()self.step5()return self.b[self.k0:self.k+1]if __name__ == '__main__':p = PorterStemmer()if len(sys.argv) > 1:for f in sys.argv[1:]:infile = open(f, 'r')while 1:output = ''word = ''line = infile.readline()if line == '':breakfor c in line:if c.isalpha():word += c.lower()else:if word:output += p.stem(word, 0,len(word)-1)word = ''output += c.lower()print output,infile.close()

Porter Stemmer详解版相关推荐

  1. 电脑连接电视方法详解_电脑如何连网?——校园宽带的连接方法(详解版)

    玉屏洲电脑联网 详解版  联网前必备!--注册好的运维云账号 如果不知道啥是运维云,可以在公众号里发消息 运维云 获取运维云账号注册方法! 注册好的运维云样板 1 第一步·宽带连接 用网线一端连接墙上 ...

  2. Python Tkinter——数字拼图游戏详解版

    Python Tkinter 实践系列--数字拼图游戏详解版 import random #Python中的random是一个标准库用于生成随机数.随机整数.还有随机从数据集取数据. import t ...

  3. AT指令(中文详解版)(二)

    AT指令(中文详解版)(二) 常 用 AT 命 令 手 册   1.常用操作 1.1 AT 命令解释:检测 Module 与串口是否连通,能否接收 AT 命令: 命令格式:AT<CR> 命 ...

  4. 案例1:金融数据分析----code知识点详解版

    案例1:金融数据分析----code详解版 1.引言 1.1案例分析目标 1.2涉及知识点 1.3案例分析流程 2.数据获取 `涉及知识点:` 2.1安装*tushare*库 2.2获取Token 2 ...

  5. 不定积分常用公式(详解版)

    不定积分常用公式(详解版)(持续更新中~) 太长不看可移步这篇文章 不定积分常用公式(简洁版) 正文 不定积分常用公式(详解版)(持续更新中~) 第一部分 第二部分 第三部分 第四部分 其他 第一部分 ...

  6. Java Swing布局管理器(详解版)

    在使用 Swing 向容器添加组件时,需要考虑组件的位置和大小.如果不使用布局管理器,则需要先在纸上画好各个组件的位置并计算组件间的距离,再向容器中添加.这样虽然能够灵活控制组件的位置,实现却非常麻烦 ...

  7. 互为质数的勾股数c语言,C语言求勾股数(详解版)

    搜索热词 问题描述 求100以内的所有勾股数. 所谓勾股数,是指能够构成直角三角形三条边的三个正整数(a,b,c). 问题分析 根据"勾股数"定义,所求三角形三边应满足条件 a2 ...

  8. Python Pandas绘图教程(详解版)

    Python Pandas绘图教程(详解版) Pandas 在数据分析.数据可视化方面有着较为广泛的应用,Pandas 对 Matplotlib 绘图软件包的基础上单独封装了一个plot()接口,通过 ...

  9. c# 实现hello word 详解版

    c# 实现hello word 详解版 前言:超级适合真正零基础的人 工具:Visual Studio 2019 了解: .net/dotnet:一般指.Net Framework框架->一种平 ...

最新文章

  1. 关于mysql字符集及导入导出
  2. iphone4 电话截获
  3. HBase底层存储原理——我靠,和cassandra本质上没有区别啊!都是kv 列存储,只是一个是p2p另一个是集中式而已!...
  4. IT公司100题-4-在二元树中找出和为某一值的所有路径
  5. Excel导入MS SQL SERVER 操作
  6. P1477-[NOI2008]假面舞会【构图,dfs,gcd】
  7. 使用Flyway在Java EE中进行数据库迁移
  8. CodeSmith注册机,支持5.2.2和5.2.1版
  9. 我们都有冲动了的飞鸽传书2011
  10. java 异常 检查型和非检查型
  11. 算法分析-堆排序 HeapSort 优先级队列
  12. Windows10安装sql2016配置iis问题
  13. java实现斐波那契数列的三种方法
  14. 如何解决您的虚拟主机中有文件触发了安全防护报警规则,可能存在webshell网页木马...
  15. MCSA Server 2012 R2 Passthrough Disk
  16. 差分GPS(differential GPS-DGPS,DGPS)
  17. 交换安全----局域网安全简介
  18. C/C++小程序学习:n*n魔方矩阵实现每行、每列、每一对角线上的元素之和相等
  19. 干式离合器与湿式离合器有什么区别(转载)
  20. 【笔试题目整理】 网易2018校园招聘数据分析师笔试卷

热门文章

  1. css sprit雪碧图制作,使用教程
  2. 哈工大演化计算PPT1(精译)
  3. 小白萌新笔记——鼠标失灵
  4. 喜报!泛睿云国家版权局颁发的软件著作权证书
  5. 视听说教程(第三版)4 quiz 1
  6. 产品经理之流失率+留存率≠100% ,MAU DAU
  7. 连接服务器显示615,D-Link DIR 615无线路由器设置
  8. 使用zrender绘制基本图形
  9. 笔下文学小说下载【3.01】
  10. template是什么意思啊_建议永久保存!告诉你的孩子什么才重要