布隆过滤器的实现方法1:自己实现

参考 http://www.cnblogs.com/naive/p/5815433.html

bllomFilter两个参数分别代表,布隆过滤器的大小和hash函数的个数

#coding:utf-8
#!/usr/bin/env pythonfrom bitarray import bitarray
# 3rd party
import mmh3
import scrapy
from BeautifulSoup import BeautifulSoup as BS
import os
ls = os.linesepclass BloomFilter(set):def __init__(self, size, hash_count):super(BloomFilter, self).__init__()self.bit_array = bitarray(size)self.bit_array.setall(0)self.size = sizeself.hash_count = hash_countdef __len__(self):return self.sizedef __iter__(self):return iter(self.bit_array)def add(self, item):for ii in range(self.hash_count):index = mmh3.hash(item, ii) % self.sizeself.bit_array[index] = 1return selfdef __contains__(self, item):out = Truefor ii in range(self.hash_count):index = mmh3.hash(item, ii) % self.sizeif self.bit_array[index] == 0:out = Falsereturn outclass DmozSpider(scrapy.Spider):name = "baidu"allowed_domains = ["baidu.com"]start_urls = ["http://baike.baidu.com/item/%E7%BA%B3%E5%85%B0%E6%98%8E%E7%8F%A0"]def parse(self, response):# fname = "/media/common/娱乐/Electronic_Design/Coding/Python/Scrapy/tutorial/tutorial/spiders/temp"## html = response.xpath('//html').extract()[0]# fobj = open(fname, 'w')# fobj.writelines(html.encode('utf-8'))# fobj.close()bloom = BloomFilter(1000, 10)animals = ['dog', 'cat', 'giraffe', 'fly', 'mosquito', 'horse', 'eagle','bird', 'bison', 'boar', 'butterfly', 'ant', 'anaconda', 'bear','chicken', 'dolphin', 'donkey', 'crow', 'crocodile']# First insertion of animals into the bloom filterfor animal in animals:bloom.add(animal)# Membership existence for already inserted animals# There should not be any false negativesfor animal in animals:if animal in bloom:print('{} is in bloom filter as expected'.format(animal))else:print('Something is terribly went wrong for {}'.format(animal))print('FALSE NEGATIVE!')# Membership existence for not inserted animals# There could be false positivesother_animals = ['badger', 'cow', 'pig', 'sheep', 'bee', 'wolf', 'fox','whale', 'shark', 'fish', 'turkey', 'duck', 'dove','deer', 'elephant', 'frog', 'falcon', 'goat', 'gorilla','hawk']for other_animal in other_animals:if other_animal in bloom:print('{} is not in the bloom, but a false positive'.format(other_animal))else:print('{} is not in the bloom filter as expected'.format(other_animal))

布隆过滤器的实现方法2:使用pybloom

参考 http://www.jianshu.com/p/f57187e2b5b9

#coding:utf-8
#!/usr/bin/env pythonfrom pybloom import BloomFilterimport scrapy
from BeautifulSoup import BeautifulSoup as BS
import os
ls = os.linesepclass DmozSpider(scrapy.Spider):name = "baidu"allowed_domains = ["baidu.com"]start_urls = ["http://baike.baidu.com/item/%E7%BA%B3%E5%85%B0%E6%98%8E%E7%8F%A0"]def parse(self, response):# fname = "/media/common/娱乐/Electronic_Design/Coding/Python/Scrapy/tutorial/tutorial/spiders/temp"## html = response.xpath('//html').extract()[0]# fobj = open(fname, 'w')# fobj.writelines(html.encode('utf-8'))# fobj.close()# bloom = BloomFilter(100, 10)bloom = BloomFilter(1000, 0.001)animals = ['dog', 'cat', 'giraffe', 'fly', 'mosquito', 'horse', 'eagle','bird', 'bison', 'boar', 'butterfly', 'ant', 'anaconda', 'bear','chicken', 'dolphin', 'donkey', 'crow', 'crocodile']# First insertion of animals into the bloom filterfor animal in animals:bloom.add(animal)# Membership existence for already inserted animals# There should not be any false negativesfor animal in animals:if animal in bloom:print('{} is in bloom filter as expected'.format(animal))else:print('Something is terribly went wrong for {}'.format(animal))print('FALSE NEGATIVE!')# Membership existence for not inserted animals# There could be false positivesother_animals = ['badger', 'cow', 'pig', 'sheep', 'bee', 'wolf', 'fox','whale', 'shark', 'fish', 'turkey', 'duck', 'dove','deer', 'elephant', 'frog', 'falcon', 'goat', 'gorilla','hawk']for other_animal in other_animals:if other_animal in bloom:print('{} is not in the bloom, but a false positive'.format(other_animal))else:print('{} is not in the bloom filter as expected'.format(other_animal))

输出

dog is in bloom filter as expected
cat is in bloom filter as expected
giraffe is in bloom filter as expected
fly is in bloom filter as expected
mosquito is in bloom filter as expected
horse is in bloom filter as expected
eagle is in bloom filter as expected
bird is in bloom filter as expected
bison is in bloom filter as expected
boar is in bloom filter as expected
butterfly is in bloom filter as expected
ant is in bloom filter as expected
anaconda is in bloom filter as expected
bear is in bloom filter as expected
chicken is in bloom filter as expected
dolphin is in bloom filter as expected
donkey is in bloom filter as expected
crow is in bloom filter as expected
crocodile is in bloom filter as expected
badger is not in the bloom filter as expected
cow is not in the bloom filter as expected
pig is not in the bloom filter as expected
sheep is not in the bloom filter as expected
bee is not in the bloom filter as expected
wolf is not in the bloom filter as expected
fox is not in the bloom filter as expected
whale is not in the bloom filter as expected
shark is not in the bloom filter as expected
fish is not in the bloom filter as expected
turkey is not in the bloom filter as expected
duck is not in the bloom filter as expected
dove is not in the bloom filter as expected
deer is not in the bloom filter as expected
elephant is not in the bloom filter as expected
frog is not in the bloom filter as expected
falcon is not in the bloom filter as expected
goat is not in the bloom filter as expected
gorilla is not in the bloom filter as expected
hawk is not in the bloom filter as expected

转载于:https://www.cnblogs.com/tonglin0325/p/7043886.html

Python爬虫学习——布隆过滤器相关推荐

  1. python爬虫正则表达式实例-python爬虫学习三:python正则表达式

    python爬虫学习三:python正则表达式 1.正则表达式基础 a.正则表达式的大致匹配过程: 1.依次拿出表达式和文本中的字符比较 2.如果每一个字符都能匹配,则匹配成功:一旦有匹配不成功的字符 ...

  2. Python爬虫学习系列教程

    大家好哈,我呢最近在学习Python爬虫,感觉非常有意思,真的让生活可以方便很多.学习过程中我把一些学习的笔记总结下来,还记录了一些自己实际写的一些小爬虫,在这里跟大家一同分享,希望对Python爬虫 ...

  3. 新手python爬虫代码_新手小白必看 Python爬虫学习路线全面指导

    爬虫是大家公认的入门Python最好方式,没有之一.虽然Python有很多应用的方向,但爬虫对于新手小白而言更友好,原理也更简单,几行代码就能实现基本的爬虫,零基础也能快速入门,让新手小白体会更大的成 ...

  4. Python爬虫学习系列教程-----------爬虫系列 你值的收藏

    静觅 » Python爬虫学习系列教程:http://cuiqingcai.com/1052.html 大家好哈,我呢最近在学习Python爬虫,感觉非常有意思,真的让生活可以方便很多.学习过程中我把 ...

  5. Python 爬虫学习 系列教程

    Python爬虫 --- 中高级爬虫学习路线 :https://www.cnblogs.com/Eeyhan/p/14148832.html 看不清图时,可以把图片保存到本地在打开查看... Pyth ...

  6. 从入门到入土:Python爬虫学习|实例练手|爬取LOL全英雄信息及技能||异步加载|初级难度反扒处理|寻找消失的API

    此博客仅用于记录个人学习进度,学识浅薄,若有错误观点欢迎评论区指出.欢迎各位前来交流.(部分材料来源网络,若有侵权,立即删除) 本人博客所有文章纯属学习之用,不涉及商业利益.不合适引用,自当删除! 若 ...

  7. 从入门到入土:Python爬虫学习|实例练手|详细讲解|爬取腾讯招聘网|一步一步分析|异步加载|初级难度反扒处理|寻找消失的API来找工作吧

    此博客仅用于记录个人学习进度,学识浅薄,若有错误观点欢迎评论区指出.欢迎各位前来交流.(部分材料来源网络,若有侵权,立即删除) 本人博客所有文章纯属学习之用,不涉及商业利益.不合适引用,自当删除! 若 ...

  8. 从入门到入土:Python爬虫学习|实例练手|爬取猫眼榜单|Xpath定位标签爬取|代码

    此博客仅用于记录个人学习进度,学识浅薄,若有错误观点欢迎评论区指出.欢迎各位前来交流.(部分材料来源网络,若有侵权,立即删除) 本人博客所有文章纯属学习之用,不涉及商业利益.不合适引用,自当删除! 若 ...

  9. 从入门到入土:Python爬虫学习|实例练手|爬取百度翻译|Selenium出击|绕过反爬机制|

    此博客仅用于记录个人学习进度,学识浅薄,若有错误观点欢迎评论区指出.欢迎各位前来交流.(部分材料来源网络,若有侵权,立即删除) 本人博客所有文章纯属学习之用,不涉及商业利益.不合适引用,自当删除! 若 ...

最新文章

  1. 《构建可扩展的Web站点》书评
  2. vc 版本 宏 (zz.IS2120@BG57IV3)
  3. 为什么S/4HANA的销售订单创建会触发生产订单的创建
  4. 机器学习的练功方式(六)——朴素贝叶斯
  5. gzip算法源代码 - tankzhouqiang - 博客园
  6. Centos5安装Dell OpenManage Server Administrator
  7. CS231n-Numpy入门
  8. python雷达图的相似度_Python教程:matplotlib实现雷达图和柱状图
  9. Ubuntu系统安装教程
  10. 微信企业号开发—发送消息
  11. 如何显示和删除联想的OEM分区
  12. PAT-A1010解题报告
  13. i386和X86是什么意思
  14. Java图形界面编程模拟ATM自助取款系统
  15. 【opencv450】 图像相减、二值化、阈值分割
  16. 图书推荐:《战略地图:化无形资产为有形成果》Strategy maps: converting intangible assets into tangible outcomes By Robert S
  17. infinite-scroll插件使用
  18. 同济大学软件学院万院长谈泽业
  19. tripwire检查文件完整性
  20. 沙特阿拉伯通信和信息技术委员会 CITC 更新了其对ICT设备和移动设备的要求

热门文章

  1. 算法岗百里挑一热爆了,全球AI大厂薪酬大起底
  2. 从零开始编写深度学习库(四)Eigen::Tensor学习使用及代码重构
  3. 基础知识(十四)服务器搭建
  4. mysql wb bbu_BBU
  5. idea可以使用flash框架吗_可以使用 C# 的 Web 前端框架 Blazor
  6. python之穿越火线游戏代码_Python 大作业之五子棋游戏(附代码)
  7. Jrebel 激活方式
  8. 信安教程第二版-第6章认证技术原理与应用
  9. python使用kafka原理详解真实完整版_史上最详细Kafka原理总结
  10. k-means算法学习1