实例 搜索引擎

  一个搜索引擎由搜索器、索引器、检索器和用户接口四个部分组成

  搜索器就是爬虫(scrawler),爬出的内容送给索引器生成索引(Index)存储在内部数据库。用户通过用户接口发出询问(query),询问解析后送达检索器,检索器高效检索后,将结果返回给用户。

  以下5个文件为爬取的搜索样本。

# # 1.txt
# I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. I have a dream today.# # 2.txt
# I have a dream that one day down in Alabama, with its vicious racists, . . . one day right there in Alabama little black boys and black girls will be able to join hands with little white boys and white girls as sisters and brothers. I have a dream today.# # 3.txt
# I have a dream that one day every valley shall be exalted, every hill and mountain shall be made low, the rough places will be made plain, and the crooked places will be made straight, and the glory of the Lord shall be revealed, and all flesh shall see it together.# # 4.txt
# This is our hope. . . With this faith we will be able to hew out of the mountain of despair a stone of hope. With this faith we will be able to transform the jangling discords of our nation into a beautiful symphony of brotherhood. With this faith we will be able to work together, to pray together, to struggle together, to go to jail together, to stand up for freedom together, knowing that we will be free one day. . . .# # 5.txt
# And when this happens, and when we allow freedom ring, when we let it ring from every village and every hamlet, from every state and every city, we will be able to speed up that day when all of God's children, black men and white men, Jews and Gentiles, Protestants and Catholics, will be able to join hands and sing in the words of the old Negro spiritual: "Free at last! Free at last! Thank God Almighty, we are free at last!"

简单搜索引擎

SearchEngineBase 基类,corpus(语料)
class SearchEngineBase(object):def __init__(self):pass#将文件作为id,与内容一起送到process_corpusdef add_corpus(self, file_path):with open(file_path, 'r') as fin:text = fin.read()self.process_corpus(file_path, text)#索引器 将文件路径作为id,将处理的内容作为索引存储def process_corpus(self, id, text):raise Exception('process_corpus not implemented.')#检索器def search(self, query):raise Exception('search not implemented.')#多态
def main(search_engine):for file_path in ['1.txt', '2.txt', '3.txt', '4.txt', '5.txt']:search_engine.add_corpus(file_path)while True:query = input()results = search_engine.search(query)print('found {} result(s):'.format(len(results)))for result in results:print(result)class SimpleEngine(SearchEngineBase):def __init__(self):super(SimpleEngine, self).__init__()self.__id_to_texts = dict()def process_corpus(self, id, text):self.__id_to_texts[id] = textdef search(self, query):results = []for id, text in self.__id_to_texts.items():if query in text:results.append(id)return resultssearch_engine = SimpleEngine()
main(search_engine)########## 输出 ##########
# simple
# found 0 result(s):
# whe
# found 2 result(s):
# 1.txt
# 5.txt

缺点:每次索引与检索都需要占用大量空间,还有查询只能是一个词或连续的几个词,对分散的在不同位置的多个词无能为力

词袋模型 (bag of words model)

运用词袋模型 (bag of words model),NLP领域最简单的模型之一。
process_corpus函数中调用parse_text_to_words把文档中的各个单词装进set集合中。
search函数中把包含查询关键字也打碎成set,与索引的文档核对,将匹配的id加入结果集。
import re
class BOWEngine(SearchEngineBase):def __init__(self):super(BOWEngine, self).__init__()self.__id_to_words = dict()def process_corpus(self, id, text):self.__id_to_words[id] = self.parse_text_to_words(text)def search(self, query):query_words = self.parse_text_to_words(query)results = []for id, words in self.__id_to_words.items():if self.query_match(query_words, words):results.append(id)return results@staticmethoddef query_match(query_words, words):for query_word in query_words:if query_word not in words:return Falsereturn True#for query_word in query_words:#    return False if query_word not in words else True#result = filter(lambda x:x not in words,query_words)#return False if (len(list(result)) > 0) else True
@staticmethoddef parse_text_to_words(text):# 使用正则表达式去除标点符号和换行符text = re.sub(r'[^\w ]', ' ', text)# 转为小写text = text.lower()# 生成所有单词的列表word_list = text.split(' ')# 去除空白单词word_list = filter(None, word_list)# 返回单词的 setreturn set(word_list)search_engine = BOWEngine()
main(search_engine)########## 输出 ##########
# i have a dream
# found 3 result(s):
# 1.txt
# 2.txt
# 3.txt
# freedom children
# found 1 result(s):
# 5.txt

  缺点:每次search还是需要遍历所有文档

Inverted index 倒序索引

Inverted index 倒序索引,现在保留的是 word -> id 的字典
import re
class BOWInvertedIndexEngine(SearchEngineBase):def __init__(self):super(BOWInvertedIndexEngine, self).__init__()self.inverted_index = dict()#生成索引 word -> iddef process_corpus(self, id, text):words = self.parse_text_to_words(text)for word in words:if word not in self.inverted_index:self.inverted_index[word] = []self.inverted_index[word].append(id) #{'little':['1.txt','2.txt'],...}def search(self, query):query_words = list(self.parse_text_to_words(query))query_words_index = list()for query_word in query_words:query_words_index.append(0)# 如果某一个查询单词的倒序索引为空,我们就立刻返回for query_word in query_words:if query_word not in self.inverted_index:return []result = []while True:# 首先,获得当前状态下所有倒序索引的 indexcurrent_ids = []for idx, query_word in enumerate(query_words):current_index = query_words_index[idx]current_inverted_list = self.inverted_index[query_word] #['1.txt','2.txt']# 已经遍历到了某一个倒序索引的末尾,结束 searchif current_index >= len(current_inverted_list):return resultcurrent_ids.append(current_inverted_list[current_index])# 然后,如果 current_ids 的所有元素都一样,那么表明这个单词在这个元素对应的文档中都出现了if all(x == current_ids[0] for x in current_ids):result.append(current_ids[0])query_words_index = [x + 1 for x in query_words_index]continue# 如果不是,我们就把最小的元素加一min_val = min(current_ids)min_val_pos = current_ids.index(min_val)query_words_index[min_val_pos] += 1@staticmethoddef parse_text_to_words(text):# 使用正则表达式去除标点符号和换行符text = re.sub(r'[^\w ]', ' ', text)# 转为小写text = text.lower()# 生成所有单词的列表word_list = text.split(' ')# 去除空白单词word_list = filter(None, word_list)# 返回单词的 setreturn set(word_list)search_engine = BOWInvertedIndexEngine()
main(search_engine)########## 输出 ########### little
# found 2 result(s):
# 1.txt
# 2.txt
# little vicious
# found 1 result(s):
# 2.txt

LRUCache

如果90%以上都是重复搜索,为了提高性能,考虑增加缓存,使用Least Recently Used 近期最少使用算法实现

import pylru
class LRUCache(object):def __init__(self, size=32):self.cache = pylru.lrucache(size)def has(self, key):return key in self.cachedef get(self, key):return self.cache[key]def set(self, key, value):self.cache[key] = valueclass BOWInvertedIndexEngineWithCache(BOWInvertedIndexEngine, LRUCache):def __init__(self):super(BOWInvertedIndexEngineWithCache, self).__init__()LRUCache.__init__(self)def search(self, query):if self.has(query):print('cache hit!')return self.get(query)result = super(BOWInvertedIndexEngineWithCache, self).search(query)self.set(query, result)return resultsearch_engine = BOWInvertedIndexEngineWithCache()
main(search_engine)########## 输出 ##########
# little
# found 2 result(s):
# 1.txt
# 2.txt
# little
# cache hit!
# found 2 result(s):
# 1.txt
# 2.txt

  注意BOWInvertedIndexEngineWithCache继承了两个类。

在构造函数里直接使用super(BOWInvertedIndexEngineWithCache, self).__init__()来初始化BOWInvertedIndexEngine父类
对于多重继承的父类就要使用LRUCache.__init__(self)来初始化
BOWInvertedIndexEngineWithCache里重载了search函数,在函数里面要调用父类BOWInvertedIndexEngine的search函数,使用:
result = super(BOWInvertedIndexEngineWithCache, self).search(query)

参考

  极客时间《Python核心技术与实战》专栏

  https://time.geekbang.org/column/intro/176

转载于:https://www.cnblogs.com/xiaoguanqiu/p/10984178.html

Python基础:一起来面向对象 (二) 之搜索引擎相关推荐

  1. Python基础day09【面向对象(封装、继承、多态)、重写、私有权限】

    视频.源码.课件.软件.笔记:超全面Python基础入门教程[十天课程]博客笔记汇总表[黑马程序员] Python基础day09[面向对象(封装.继承.多态).重写.私有权限] Python基础day ...

  2. python基础四_01_面向对象

    python基础四_01_面向对象编程 导读: 本文主要从what?why?how?三个方向理解面向对象编程的思想:仅供自己梳理. 一.什么是面向对象编程? 面向对象编程与面向过程编程是两种常见的编程 ...

  3. python 基础系列(十二) — python正则

    python 基础系列(十二) - python正则 1. 正则表达式基础 1.1. 简单介绍 正则表达式并不是Python的一部分.正则表达式是用于处理字符串的强大工具,拥有自己独特的语法以及一个独 ...

  4. Python基础入门笔记(二)

    前言 本文主要为 Python基础入门笔记(一)内容的补充. 一.迭代器和生成器 1.1 Python迭代器 迭代器是一个可以记住遍历的位置的对象. 迭代器对象从集合的第一个元素开始访问,直到所有的元 ...

  5. Python基础入门_5面向对象基础

    Python 基础入门前四篇: Python 基础入门–简介和环境配置 Python基础入门_2基础语法和变量类型 Python基础入门_3条件语句和迭代循环 Python基础入门_4函数 第五篇主要 ...

  6. python基础程序设计与面向对象程序设计_python基础——面向对象的程序设计

    python基础--面向对象的程序设计 1 什么是面向对象的程序设计 面向过程的程序设计的核心是过程,过程即解决问题的步骤,面向过程的设计就好比精心设计好一条流水线,考虑周全什么时候处理什么东西. 优 ...

  7. Python基础知识入门(二)

    Python基础知识入门(一) Python基础知识入门(三) Python基础知识入门(四) Python基础知识入门(五) 一.数字类型 Python 数字数据类型用于存储数值.数据类型是不允许改 ...

  8. Python基础day08【面向对象(类、对象、属性)、魔方方法(init、str、del、repr)】

    视频.源码.课件.软件.笔记:超全面Python基础入门教程[十天课程]博客笔记汇总表[黑马程序员]   目录 0.复习 1.类外部添加和获取对象属性 2.类内部操作属性 3.魔法方法 3.1.__i ...

  9. 大数据Python基础学习——练习(二)

    目录 使用字典完成一个点餐系统 要求 判断两个函数的区别 lambda,reduce的应用 reduce,map,filter,sorted的应用 按照字典的值进行排序 Python中的类,实例,封装 ...

最新文章

  1. uart与usart区别
  2. WebApi数据验证——编写自定义数据注解(Data Annotations)
  3. appium java版本错误_升级Appium最新java-client 6.0.0包后-问题来了...
  4. 【k8s学习笔记】第二篇:在Ubuntu系统中安装kubelet,kubeadm和kubectl
  5. jquery 清空表单
  6. php字符串变量,PHP 字符串变量
  7. 美国教授描述未来学校,将颠覆现有教育模式
  8. Knative 实战:基于阿里云 Kafka 实现消息推送
  9. 定期存款可以提前取出来吗_定期存款、约定转存、自动转存和自己取出来转存有什么不同?...
  10. 【零基础学Java】—Scanner类的使用( 十)
  11. iOS 手记 - 计算文字高度/宽度:- (CGSize)sizeWithAttributes:(NSDictionaryNSString *,id *)attrs...
  12. ssh免密码登陆远程服务器(3种方式)
  13. List集合去重方式及效率对比
  14. css的语义---Cascading Style Sheet
  15. 清北学堂 2017-10-07
  16. C语言关于变量定义未使用编译警告warring
  17. 【Redis】事务和锁机制
  18. 树形结构——JAVA实现
  19. cocos2dx版本热更新梳理
  20. pdfbox 转图片中文乱码处理

热门文章

  1. 【基础算法-模拟-例题-*校长的问题】-C++
  2. [java][ide][sts] 使用配置
  3. opencv7-ml之svm
  4. PHP 函数dirname()使用实例
  5. 编写查询功能TextBox
  6. 【React踩坑记一】React项目中禁用浏览器双击选中文字的功能
  7. 生产路由跳转报错找不到js路径问题
  8. QT5 QT4--LNK2019 无法解析的外部符号
  9. Codeforces 724 G Xor-matic Number of the Graph 线性基+DFS
  10. select count(*)和select count(1)的区别(转载)