python 爬虫 (错误很多)

不怎么会用PYTHON写。。今天试了一下。

#!/usr/bin/python
# vim: set fileencoding=utf-8:
import sys
import urllib2
import re
import sqlite3
import hashlib
import random
from BeautifulSoup import BeautifulSoupclass SpriderUrl:# 初始化def __init__(self,url,domain_name):self.url=urlself.domain_name=domain_name# 获得URL列表def getUrl(self):urls=[]# try:body_text=urllib2.urlopen(self.url).read()soup=BeautifulSoup(body_text)links=soup.findAll('a')# connect sqllite3md5_str=hashlib.md5(str(random.randint(1,100000))+"aa")print "data_name:"+md5_str.hexdigest()# create sqlite3 data namecon=sqlite3.connect(md5_str.hexdigest()+".db")# create  sqlite3 table namecon.execute("""create table url_data(id interger auto_increment primary key,url TEXT not null)""")for link in links:if re.match('(.*)\:\/\/'+self.domain_name,link.get('href')):urls.append(link.get('href'))con.execute("insert into url_data(url)values('"+link.get('href')+"')")con.commit()while len(urls)>0:for url in urls:body_text2=urllib2.urlopen(url).read()soup2=BeautifulSoup(body_text2)links2=soup2.findAll('a')for link2 in links2:if re.match('(.*)\:\/\/'+self.domain_name,link2.get('href')):test=link2.get('href')cur=con.execute("select * from url_data where url='"+test+"'")bool_itm=cur.fetchone()if bool_itm is None:urls.append(link2.get('href'))con.execute("insert into url_data(url)values('"+test+"')")else:continueelse:continueprint "Done"t=SpriderUrl('http://www.baidu.com/',"www.baidu.com")
t.getUrl()

转载于:https://www.cnblogs.com/xiaoCon/p/3478725.html

python 爬虫 (错误很多)相关推荐

【Python爬虫错误】ConnectionResetError: [WinError 10054] 远程主机强迫关闭了一个现有的连接
今天写爬虫爬取天天基金网站(http://fund.eastmoney.com/)时出现如下图所示的错误. 分析原因,是因为使用urlopen方法太过频繁,引起远程主机的怀疑,被网站认定为是攻击行为. ...
【python爬虫错误】Max retries exceeded with url
Max retries exceeded with url 寻找可用的ip 寻找可用的ip import time from lxml import html # 把lxml是解析xml语言的库 et ...
Python 爬虫框架 - PySpider
Python爬虫进阶四之PySpider的用法:http://cuiqingcai.com/2652.html 网络爬虫剖析,以Pyspider为例:http://python.jobbole.com ...
零基础学习python爬虫_教你零基础如何入门Python爬虫！
Python爬虫好学吗?看你怎么学了.如果是自学,会难一些,毕竟有难题很难找到人帮你解答,很容易半途而废.要是你找到了一家靠谱的学校,就会容易很多.不过,这里我想教你入门Python爬虫. 一:爬虫准 ...
python爬虫好学不_Python爬虫好学吗？
该楼层疑似违规已被系统折叠隐藏此楼查看此楼二:开始爬虫 1.爬虫主要分为两个部分,第一个是网页界面的获取,第二个是网页界面的解析:爬虫的原理是利用代码模拟浏览器访问网站,与浏览器不同的是,爬虫获取 ...
如何自学python爬虫-小白如何快速学习Python爬虫？
原标题:小白如何快速学习Python爬虫? 很多同学想学习爬虫 ,对于小白来说,爬虫可能是一件非常复杂.技术门槛很高的事情.而且爬虫是入门 Python 最好的方式,没有之一. 我们可以通过爬虫获取 ...
python培训好学吗-Python爬虫培训好学吗?
原标题:Python爬虫培训好学吗? 好不好学要看你怎么学了.如果是自学,会难一些,毕竟有难题很难找到人帮你解答,很容易半途而废.要是你找到了一家靠谱的学校,就会容易很多.不过,这里我想教你入门Pyt ...
python爬虫什么意思-这样学Python爬虫，想爬什么爬什么
原标题:这样学Python爬虫,想爬什么爬什么你知道怎么学爬虫吗?正确的入门姿势在这里!只要学好了Python爬虫,真的是可以想爬什么爬什么哟,像什么美女图片啦,美食图片啦,美景图片啦.(小编可不知 ...
饱暖思淫欲之美女图片的Python爬虫实例（二）
美女图片的Python爬虫实例:面向服务器版 ==该爬虫面向成年人且有一定的自控能力(涉及部分性感图片,仅用于爬虫实例研究)== 前言初始教程存在问题解决思路目标实现步骤硬件配置服务器信 ...
【python爬虫】爬取淘宝网商品信息
相信学了python爬虫,很多人都想爬取一些数据量比较大的网站,淘宝网就是一个很好的目标,其数据量大,而且种类繁多,而且难度不是很大,很适合初级学者进行爬取.下面是整个爬取过程: 第一步:构建访问的u ...

python 爬虫 (错误很多)

python 爬虫 (错误很多)相关推荐

最新文章

热门文章