Python爬虫学习——布隆过滤器

布隆过滤器的实现方法1:自己实现

参考 http://www.cnblogs.com/naive/p/5815433.html

bllomFilter两个参数分别代表,布隆过滤器的大小和hash函数的个数

#coding:utf-8
#!/usr/bin/env pythonfrom bitarray import bitarray
# 3rd party
import mmh3
import scrapy
from BeautifulSoup import BeautifulSoup as BS
import os
ls = os.linesepclass BloomFilter(set):def __init__(self, size, hash_count):super(BloomFilter, self).__init__()self.bit_array = bitarray(size)self.bit_array.setall(0)self.size = sizeself.hash_count = hash_countdef __len__(self):return self.sizedef __iter__(self):return iter(self.bit_array)def add(self, item):for ii in range(self.hash_count):index = mmh3.hash(item, ii) % self.sizeself.bit_array[index] = 1return selfdef __contains__(self, item):out = Truefor ii in range(self.hash_count):index = mmh3.hash(item, ii) % self.sizeif self.bit_array[index] == 0:out = Falsereturn outclass DmozSpider(scrapy.Spider):name = "baidu"allowed_domains = ["baidu.com"]start_urls = ["http://baike.baidu.com/item/%E7%BA%B3%E5%85%B0%E6%98%8E%E7%8F%A0"]def parse(self, response):# fname = "/media/common/娱乐/Electronic_Design/Coding/Python/Scrapy/tutorial/tutorial/spiders/temp"## html = response.xpath('//html').extract()[0]# fobj = open(fname, 'w')# fobj.writelines(html.encode('utf-8'))# fobj.close()bloom = BloomFilter(1000, 10)animals = ['dog', 'cat', 'giraffe', 'fly', 'mosquito', 'horse', 'eagle','bird', 'bison', 'boar', 'butterfly', 'ant', 'anaconda', 'bear','chicken', 'dolphin', 'donkey', 'crow', 'crocodile']# First insertion of animals into the bloom filterfor animal in animals:bloom.add(animal)# Membership existence for already inserted animals# There should not be any false negativesfor animal in animals:if animal in bloom:print('{} is in bloom filter as expected'.format(animal))else:print('Something is terribly went wrong for {}'.format(animal))print('FALSE NEGATIVE!')# Membership existence for not inserted animals# There could be false positivesother_animals = ['badger', 'cow', 'pig', 'sheep', 'bee', 'wolf', 'fox','whale', 'shark', 'fish', 'turkey', 'duck', 'dove','deer', 'elephant', 'frog', 'falcon', 'goat', 'gorilla','hawk']for other_animal in other_animals:if other_animal in bloom:print('{} is not in the bloom, but a false positive'.format(other_animal))else:print('{} is not in the bloom filter as expected'.format(other_animal))

布隆过滤器的实现方法2:使用pybloom

参考 http://www.jianshu.com/p/f57187e2b5b9

#coding:utf-8
#!/usr/bin/env pythonfrom pybloom import BloomFilterimport scrapy
from BeautifulSoup import BeautifulSoup as BS
import os
ls = os.linesepclass DmozSpider(scrapy.Spider):name = "baidu"allowed_domains = ["baidu.com"]start_urls = ["http://baike.baidu.com/item/%E7%BA%B3%E5%85%B0%E6%98%8E%E7%8F%A0"]def parse(self, response):# fname = "/media/common/娱乐/Electronic_Design/Coding/Python/Scrapy/tutorial/tutorial/spiders/temp"## html = response.xpath('//html').extract()[0]# fobj = open(fname, 'w')# fobj.writelines(html.encode('utf-8'))# fobj.close()# bloom = BloomFilter(100, 10)bloom = BloomFilter(1000, 0.001)animals = ['dog', 'cat', 'giraffe', 'fly', 'mosquito', 'horse', 'eagle','bird', 'bison', 'boar', 'butterfly', 'ant', 'anaconda', 'bear','chicken', 'dolphin', 'donkey', 'crow', 'crocodile']# First insertion of animals into the bloom filterfor animal in animals:bloom.add(animal)# Membership existence for already inserted animals# There should not be any false negativesfor animal in animals:if animal in bloom:print('{} is in bloom filter as expected'.format(animal))else:print('Something is terribly went wrong for {}'.format(animal))print('FALSE NEGATIVE!')# Membership existence for not inserted animals# There could be false positivesother_animals = ['badger', 'cow', 'pig', 'sheep', 'bee', 'wolf', 'fox','whale', 'shark', 'fish', 'turkey', 'duck', 'dove','deer', 'elephant', 'frog', 'falcon', 'goat', 'gorilla','hawk']for other_animal in other_animals:if other_animal in bloom:print('{} is not in the bloom, but a false positive'.format(other_animal))else:print('{} is not in the bloom filter as expected'.format(other_animal))

输出

dog is in bloom filter as expected
cat is in bloom filter as expected
giraffe is in bloom filter as expected
fly is in bloom filter as expected
mosquito is in bloom filter as expected
horse is in bloom filter as expected
eagle is in bloom filter as expected
bird is in bloom filter as expected
bison is in bloom filter as expected
boar is in bloom filter as expected
butterfly is in bloom filter as expected
ant is in bloom filter as expected
anaconda is in bloom filter as expected
bear is in bloom filter as expected
chicken is in bloom filter as expected
dolphin is in bloom filter as expected
donkey is in bloom filter as expected
crow is in bloom filter as expected
crocodile is in bloom filter as expected
badger is not in the bloom filter as expected
cow is not in the bloom filter as expected
pig is not in the bloom filter as expected
sheep is not in the bloom filter as expected
bee is not in the bloom filter as expected
wolf is not in the bloom filter as expected
fox is not in the bloom filter as expected
whale is not in the bloom filter as expected
shark is not in the bloom filter as expected
fish is not in the bloom filter as expected
turkey is not in the bloom filter as expected
duck is not in the bloom filter as expected
dove is not in the bloom filter as expected
deer is not in the bloom filter as expected
elephant is not in the bloom filter as expected
frog is not in the bloom filter as expected
falcon is not in the bloom filter as expected
goat is not in the bloom filter as expected
gorilla is not in the bloom filter as expected
hawk is not in the bloom filter as expected

转载于:https://www.cnblogs.com/tonglin0325/p/7043886.html

Python爬虫学习——布隆过滤器相关推荐

python爬虫正则表达式实例-python爬虫学习三：python正则表达式
python爬虫学习三:python正则表达式 1.正则表达式基础 a.正则表达式的大致匹配过程: 1.依次拿出表达式和文本中的字符比较 2.如果每一个字符都能匹配,则匹配成功:一旦有匹配不成功的字符 ...
Python爬虫学习系列教程
大家好哈,我呢最近在学习Python爬虫,感觉非常有意思,真的让生活可以方便很多.学习过程中我把一些学习的笔记总结下来,还记录了一些自己实际写的一些小爬虫,在这里跟大家一同分享,希望对Python爬虫 ...
新手python爬虫代码_新手小白必看 Python爬虫学习路线全面指导
爬虫是大家公认的入门Python最好方式,没有之一.虽然Python有很多应用的方向,但爬虫对于新手小白而言更友好,原理也更简单,几行代码就能实现基本的爬虫,零基础也能快速入门,让新手小白体会更大的成 ...
Python爬虫学习系列教程-----------爬虫系列你值的收藏
静觅 » Python爬虫学习系列教程:http://cuiqingcai.com/1052.html 大家好哈,我呢最近在学习Python爬虫,感觉非常有意思,真的让生活可以方便很多.学习过程中我把 ...
Python 爬虫学习系列教程
Python爬虫 --- 中高级爬虫学习路线 :https://www.cnblogs.com/Eeyhan/p/14148832.html 看不清图时,可以把图片保存到本地在打开查看... Pyth ...
从入门到入土：Python爬虫学习|实例练手|爬取LOL全英雄信息及技能||异步加载|初级难度反扒处理|寻找消失的API
此博客仅用于记录个人学习进度,学识浅薄,若有错误观点欢迎评论区指出.欢迎各位前来交流.(部分材料来源网络,若有侵权,立即删除) 本人博客所有文章纯属学习之用,不涉及商业利益.不合适引用,自当删除! 若 ...
从入门到入土：Python爬虫学习|实例练手|详细讲解|爬取腾讯招聘网|一步一步分析|异步加载|初级难度反扒处理|寻找消失的API来找工作吧
此博客仅用于记录个人学习进度,学识浅薄,若有错误观点欢迎评论区指出.欢迎各位前来交流.(部分材料来源网络,若有侵权,立即删除) 本人博客所有文章纯属学习之用,不涉及商业利益.不合适引用,自当删除! 若 ...
从入门到入土：Python爬虫学习|实例练手|爬取猫眼榜单|Xpath定位标签爬取|代码
此博客仅用于记录个人学习进度,学识浅薄,若有错误观点欢迎评论区指出.欢迎各位前来交流.(部分材料来源网络,若有侵权,立即删除) 本人博客所有文章纯属学习之用,不涉及商业利益.不合适引用,自当删除! 若 ...
从入门到入土：Python爬虫学习|实例练手|爬取百度翻译|Selenium出击|绕过反爬机制|
此博客仅用于记录个人学习进度,学识浅薄,若有错误观点欢迎评论区指出.欢迎各位前来交流.(部分材料来源网络,若有侵权,立即删除) 本人博客所有文章纯属学习之用,不涉及商业利益.不合适引用,自当删除! 若 ...

Python爬虫学习——布隆过滤器

Python爬虫学习——布隆过滤器相关推荐

最新文章

热门文章