Python 爬虫咸鱼版

主要用到urllib2、BeautifulSoup模块

#encoding=utf-8
import re
import requests
import urllib2
import datetime
import MySQLdb
from bs4 import BeautifulSoup
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
class Splider(object):def __init__(self):print u'开始爬取内容...'##用来获取网页源代码def getsource(self,url):headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2652.0 Safari/537.36'}req = urllib2.Request(url=url,headers=headers)socket = urllib2.urlopen(req)content = socket.read()socket.close()return content##changepage用来生产不同页数的链接def changepage(self,url,total_page):now_page = int(re.search('page/(\d+)',url,re.S).group(1))page_group = []for i in range(now_page,total_page+1):link = re.sub('page/(\d+)','page/%d' % i,url,re.S)page_group.append(link)return page_group#获取字内容def getchildrencon(self,child_url):conobj = {}content = self.getsource(child_url)soup = BeautifulSoup(content, 'html.parser', from_encoding='utf-8')content = soup.find('div',{'class':'c-article_content'})img = re.findall('src="(.*?)"',str(content),re.S)conobj['con'] = content.get_text()conobj['img'] = (';').join(img)return conobj##获取内容def getcontent(self,html_doc):soup = BeautifulSoup(html_doc, 'html.parser', from_encoding='utf-8')tag = soup.find_all('div',{'class':'promo-feed-headline'})info = {}i = 0for link in tag:info[i] = {}title_desc = link.find('h3')info[i]['title'] = title_desc.get_text()post_date = link.find('div',{'class':'post-date'})pos_d = post_date['data-date'][0:10]info[i]['content_time'] = pos_dinfo[i]['source'] = 'whowhatwear'source_link = link.find('a',href=re.compile(r"section=fashion-trends"))source_url = 'http://www.whowhatwear.com'+source_link['href']info[i]['source_url'] = source_urlin_content = self.getsource(source_url)in_soup = BeautifulSoup(in_content, 'html.parser', from_encoding='utf-8')soup_content = in_soup.find('section',{'class':'widgets-list-content'})info[i]['content'] = soup_content.get_text().strip('\n')text_con = in_soup.find('section',{'class':'text'})summary = text_con.get_text().strip('\n') if text_con.text != None else NULLinfo[i]['summary'] = summary[0:200]+'...';img_list = re.findall('src="(.*?)"',str(soup_content),re.S)info[i]['imgs'] = (';').join(img_list)info[i]['create_time'] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")i+=1#print info#exit()return infodef saveinfo(self,content_info):conn = MySQLdb.Connect(host='127.0.0.1',user='root',passwd='123456',port=3306,db='test',charset='utf8')cursor = conn.cursor()for each in content_info:for k,v in each.items():sql = "insert into t_fashion_spider2(`title`,`summary`,`content`,`content_time`,`imgs`,`source`,`source_url`,`create_time`) values ('%s','%s','%s','%s','%s','%s','%s','%s')" % (MySQLdb.escape_string(v['title']),MySQLdb.escape_string(v['summary']),MySQLdb.escape_string(v['content']),v['content_time'],v['imgs'],v['source'],v['source_url'],v['create_time'])cursor.execute(sql)conn.commit()cursor.close()conn.close()
if __name__ == '__main__':classinfo = []p_num = 5url = 'http://www.whowhatwear.com/section/fashion-trends/page/1'jikesplider = Splider()all_links = jikesplider.changepage(url,p_num)for link in all_links:print u'正在处理页面：' + linkhtml = jikesplider.getsource(link)info = jikesplider.getcontent(html)classinfo.append(info)jikesplider.saveinfo(classinfo)

转载于:https://www.cnblogs.com/WYlover/p/10728793.html

Python 爬虫咸鱼版相关推荐

python闲鱼二手爬虫_Python 爬虫咸鱼版
#encoding=utf-8 import re import requests import urllib2 import datetime import MySQLdb from bs4 imp ...
python爬虫+网页版微信实时获取消息程序
项目需求: 目的是24小时爬取各种软件的讯息并且以一种统一的方式集中发送给自己. 实现方法: 利用python的requests库以及wxpy库,前者用来爬取网页,后者用来将爬到的内容发送给自己. 程 ...
python爬虫招聘-Python爬虫抓取智联招聘（基础版）
原标题:Python爬虫抓取智联招聘(基础版) 作者:C与Python实战「若你有原创文章想与大家分享,欢迎投稿.」对于每个上班族来说,总要经历几次换工作,如何在网上挑到心仪的工作?如何提前为心仪 ...
Python爬虫爬取智联招聘（进阶版）
运行平台: Windows Python版本: Python3.6 IDE: Sublime Text 其他工具: Chrome浏览器 0.写在前面的话本文是基于基础版上做的修改,如 ...
简单几步实现网络音乐播放器（Python爬虫版百度FM）
Python入门之爬取百度音乐先说一下为什么会有这篇文章,首先肯定是有这个需求了,本人出差在外地,这里的网速卡到爆,根本支撑不了在线听歌的要求,所以就想下载到本地来慢慢听.这可是python的绝活, ...
Python爬虫图片爬取简陋版
Python爬虫图片爬取简陋版因为在自学Python 学了几天打算写一个爬虫,后来发现学的python的基础还要学库于是花了好长时间查资料终于写出来一个简陋版本的东拼西凑还真让我搞成了下面 ...
python爬虫 requests+bs4爬取猫眼电影傻瓜版教程
python爬虫 requests+bs4爬取猫眼电影傻瓜版教程前言一丶整体思路二丶遇到的问题三丶分析URL 四丶解析页面五丶写入文件六丶完整代码七丶最后前言大家好我是墨绿头顶总 ...
Python爬虫：AGE动漫下载之 requests 版
日常跳转: 导入: 分析与代码解释: 打印搜索到的信息 BeautifelSoup4 库使用提醒小优化链接解析: 获取视频链接小优化: 提示: 视频下载: 求大佬赐教实例源码及结果结果及下载 ...
python制作电脑软件_利用PYTHON制作桌面版爬虫软件（二）
今天继续新的专题.主要讲解[利用PYTHON制作桌面版爬虫软件]下的如何实现界面功能(一).该讲主要包括以下三个内容:掌握如何编写主函数,运行界面. 了解pywin32模块. 如何用python识别Q ...

Python 爬虫咸鱼版

Python 爬虫咸鱼版相关推荐

最新文章

热门文章