一、介绍

    本例子用Selenium +phantomjs爬取栏目(http://tv.cctv.com/lm/)的信息

   

  二、网站信息

    

    

    

    

  

  三、数据抓取

    首先抓取所有要抓取网页链接,共39页,保存到数据库里面

    

    def getUrls(self):urls = []urls.append('http://tv.cctv.com/lm/')for index in range(2,40):urls.append("javascript:window.scroll(0,145);DataInteraction({0});showPageTitle_fenyei2('ELMT1413526954890942',{0});".format(index))self.db.SaveCCTVColumnUrls(urls,'0')

View Code

    针对上面的网站信息,来进行抓取

    1、首先抓取信息列表

      

      抓取代码:Elements = doc("div[id='text_box_0']").find('dl').find('dd')

    2、栏目名称,链接

      

      column1Element = element.find('div[class="text"]').find('h3').find('a')

      columnName = column1Element.text().encode('utf8').replace(',', ',').replace('\n', '')

      columnUrl = column1Element.attr('href')

  四,实现代码

    

# coding=utf-8
import os
import re
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from datetime import datetime,timedelta
import selenium.webdriver.support.ui as ui
import time
from pyquery import PyQuery as pq
import columnData
import mongoDB
class cctvColumnInfo:def __init__(self):#通过配置文件获取IEDriverServer.exe路径# self.urls = self.getUrls()# IEDriverServer ='C:\Program Files\Internet Explorer\IEDriverServer.exe'# self.driver = webdriver.Ie(IEDriverServer)# self.driver.maximize_window()self.driver = webdriver.PhantomJS(service_args=['--load-images=false'])#service_args=['--load-images=false']self.driver.set_page_load_timeout(10)self.driver.maximize_window()self.db = mongoDB.mongoDbBase()def WriteUrl(self,url):fileName = os.path.join(os.getcwd(), 'cctvColumn/cctvColumn_url.txt')with open(fileName, 'a') as f:f.write('\n'+url)def getUrls(self):urls = []urls.append('http://tv.cctv.com/lm/')for index in range(2,40):urls.append("javascript:window.scroll(0,145);DataInteraction({0});showPageTitle_fenyei2('ELMT1413526954890942',{0});".format(index))self.db.SaveCCTVColumnUrls(urls,'0')# return urlsdef WriteLog(self, message,date):fileName = os.path.join(os.getcwd(), 'cctvColumn/cctvColumn-'+date + '.txt')with open(fileName, 'a') as f:f.write(message)def getColumnInfo(self, colInfo):ts = colInfo.split('主持人')firstBroadcastTime = ts[0]ts1 = ts[1].split('播出频道')columnHost = '主持人' + ts1[0]broadcastChannel = '播出频道' + ts1[1]return firstBroadcastTime, columnHost, broadcastChanneldef CatchData(self):urlIndex = 0urls = self.db.GetCCTVColumnUrls()itemIndex = 0for u in urls:url = u['url']try:if url == 'http://tv.cctv.com/lm/':self.driver.get(url)else:self.driver.execute_script(url)urlIndex += 1time.sleep(2)selenium_html = self.driver.execute_script("return document.documentElement.outerHTML")doc = pq(selenium_html)# Elements = doc("div[@id='text_box_0']/dl/dd")Elements = doc("div[id='text_box_0']").find('dl').find('dd')message = ''# for element in Elements:column_name = url.encode('utf8')print urlfor element in Elements.items():colobj = columnData.columnData()itemIndex+=1firstBroadcastTime = ''ReplayBroadcastTime = ''firstBroadcastChannel = ''# column1Element = element.find('div[@class="text"]/h3/a')# column1Element = element.find_element_by_xpath("//div[@class='ui-page-next']")column1Element = element.find('div[class="text"]').find('h3').find('a')columnName = column1Element.text().encode('utf8').replace(',', ',').replace('\n', '')columnUrl = column1Element.attr('href')colobj.setColumnName(columnName)colobj.setColumnUrl(columnUrl)column_name += '\n' + columnName# time.sleep(3)print columnName# column2Element = element.find('div[@class="text"]/p/a')column2Element = element.find('div[class="text"]').find('p').find('a')columnTimeName = column2Element.text().encode('utf8').replace(',', ',').replace('\n', '')columnTimeUrl = column2Element.attr('href')colobj.setColumnTimeName(columnTimeName)colobj.setColumnTimeUrl(columnTimeUrl)# print columnTimeName + '; ' + columnTimeUrl# column34Elements = element.find('div[@class="text"]/span/a')column34Elements = element.find('div[class="text"]').find('span').find('a')# for column34Element in column34Elements:column34Index = 0pastVideoUrl = ''officialWebsiteUrl = ''for column34Element in column34Elements.items():if column34Index == 0:pastVideoUrl = column34Element.attr('href')colobj.setPastVideoUrl(pastVideoUrl)else:officialWebsiteUrl = column34Element.attr('href')colobj.setOfficialWebsiteUrl(officialWebsiteUrl)column34Index += 1# columnImageElement = element.find('div[@class="img"]/a/img')columnImageElement = element.find('div[class="img"]').find('a').find('img')colImgUrl = columnImageElement.attr('src')if colImgUrl == None:columnImageElement = element.find('div[class="image"]').find('a').find('img')colImgUrl = columnImageElement.attr('src')# print colImgUrl
                    colobj.setColImgUrl(colImgUrl)# 首播时间firstBroadcastTime1 = ''# 主持人columnHost = ''# 播出频道firstBroadcastChannel1 =''# columnInfos = element.find('div[@class="lr"]/div')columnInfos = element.find('div[class="lr"]').find('div')if columnInfos:for colInfo in columnInfos.items():firstBroadcastTime1, columnHost, firstBroadcastChannel1 = self.getColumnInfo(colInfo.text().encode('utf8').replace(',', ',').replace('\n', ''))columnHost = columnHost.replace(',', ',')if not firstBroadcastTime:firstBroadcastTime = firstBroadcastTime1if not firstBroadcastChannel:firstBroadcastChannel = firstBroadcastChannel1colobj.setColumnHost(columnHost)colobj.setFirstBroadcastChannel(firstBroadcastChannel1)colobj.setFirstBroadcastTime(firstBroadcastTime1)# 栏目名称,首播时间,重播时间,播出频道,主持人,栏目url,栏目名称1(带时间的),栏目名称1url,往期视频url,栏目官网url,),栏目对应图片urlmess = '\n{0},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10}'.format(columnName, firstBroadcastTime,ReplayBroadcastTime,firstBroadcastChannel, columnHost,columnUrl, columnTimeName,columnTimeUrl, pastVideoUrl,officialWebsiteUrl, colImgUrl)# print messmessage += messself.db.SaveCCTVColumnData(colobj,itemIndex)self.db.SaveCCTVColumnUrl(columnUrl, '1', columnName)date = time.strftime('%Y-%m-%d')self.WriteLog(message, date)self.WriteUrl(column_name)self.db.SetCCTVColumnUrlCrawlState(url)except TimeoutException,e:print 'timeout url:  '+urlself.driver.close()self.driver.quit()def getBroadCast(self):urls = self.db.GetSubCCTVColumnUrls()for u in urls:firstBroadcastTime = ''ReplayBroadcastTime = ''firstBroadcastChannel = ''messsage = ''url = u['url']# url='http://tv.cctv.com/lm/xqds'# url='http://tv.cctv.com/lm/24xiaoshi/'columnName = u['columnName']#     u'http://tv.cctv.com/lm/kanjian'try:self.driver.get(url)time.sleep(2)selenium_html = self.driver.execute_script("return document.documentElement.outerHTML")doc = pq(selenium_html)Elements = doc("p[class='p_1']")index = 0for element in Elements.items():if index == 0:firstBroadcastTime = element.text().encode('utf8').replace(',', ',').replace('\n', '')elif index == 1:ReplayBroadcastTime = element.text().encode('utf8').replace(',', ',').replace('\n', '')elif index == 2:firstBroadcastChannel = element.text().encode('utf8').replace(',', ',').replace('\n', '')breakindex += 1if index == 0:Elements = doc("div[class='head_msg']").find('table').find('tbody').find('tr')for element in Elements.items():messsage+=element.text().encode('utf8').replace(',', ',').replace('\n', '')if messsage:firstBroadcastTime, ReplayBroadcastTime, firstBroadcastChannel= self.getBroadInfo(columnName.encode('utf8'),messsage)self.db.SetCCTVColumnUrlCrawlState(url)if firstBroadcastChannel:colobj = columnData.columnData()colobj.setColumnName(columnName)colobj.setFirstBroadcastTime(firstBroadcastTime)colobj.setFirstBroadcastChannel(firstBroadcastChannel)colobj.setReplayBroadcastTime(ReplayBroadcastTime)self.db.UpdateCCTVColumnData(colobj)print '\n'print urlprint columnNameprint firstBroadcastTimeprint firstBroadcastChannelprint ReplayBroadcastTimeexcept TimeoutException, e:print 'TimeoutException:'+urldef getBroadInfo(self,columnName,column):# column ='首播频道: CCTV-14首播时间: 周三17:15'firstBroadcastTime = ''ReplayBroadcastTime = ''firstBroadcastChannel = ''column=column.replace('栏目大全','')if '>>' in column:index = column.index('>>')column = column[0:index]if 'CCTV13' in column:column = column.replace('CCTV13', 'CCTV-13')if 'CCTV6' in column:column = column.replace('CCTV6', 'CCTV-6')if 'CCTV1' in column:column = column.replace('CCTV1','CCTV-1')if '官方微信' in column:index = column.index('官方微信')column = column[0:index]# if '停播' in column or '关闭' in column:#     return firstBroadcastTime, ReplayBroadcastTime, firstBroadcastChannel# elif '>>' in column:#     index = column.index('>>')#     column = column[0:index]if '首播时间' in column:if '重播时间' in column:cols = column.split('重播时间')firstBroadcastTime = cols[0]if '独播频道' in cols[1]:ReplayBroadcastTime = '重播时间' + cols[1].split('独播频道')[0]firstBroadcastChannel = '独播频道' + cols[1].split('独播频道')[1]elif '首播频道' in cols[1]:ReplayBroadcastTime = '重播时间' + cols[1].split('首播频道')[0]firstBroadcastChannel = '首播频道' + cols[1].split('首播频道')[1]elif '播出频道' in cols[1]:ReplayBroadcastTime = '重播时间' + cols[1].split('播出频道')[0]firstBroadcastChannel = '播出频道' + cols[1].split('播出频道')[1]elif '独播频道' in column:cols = column.split('独播频道')firstBroadcastTime = cols[0]firstBroadcastChannel = '独播频道' + cols[1]elif '播出频道' in column:cols = column.split('播出频道')firstBroadcastTime = cols[0]firstBroadcastChannel = '播出频道' + cols[1]elif '首播频道' in column:cols = column.split('首播频道')index = column.index('首播频道')if index==0:cols = column.split('首播时间')firstBroadcastChannel = cols[0]firstBroadcastTime = '首播时间' + cols[1]else:firstBroadcastTime = cols[0]firstBroadcastChannel = '首播频道' + cols[1]else:if '首播(' in column and '重播(' in column:if '独播频道' in column:cols = column.split('独播频道')firstBroadcastChannel = '独播频道' + cols[1]firstBroadcastTime = cols[0]# '首播(生活): 一-六18:52 日18:42重播(生活): 一-五 日16:08首播(文史): 一-五22:43六日22:33/30重播(文史): 二-五06:46六日06:24'if '(生活版)' in columnName:if '首播(文史)' in firstBroadcastTime:temp = firstBroadcastTime.split('首播(文史)')[0]if '重播(生活)' in temp:firstBroadcastTime = '首播时间: '+temp.split('重播(生活)')[0].replace('首播(生活): ','')ReplayBroadcastTime = '重播时间: '+temp.split('重播(生活)')[1].replace(': ','')# 首播(文史): 一-五22:43六日22:33/30重播(文史): 二-五06:46六日06:24首播(生活): 一-六18:52 日18:42重播(生活): 一-五 日16:08elif '(文史版)' in columnName:if '首播(生活)' in firstBroadcastTime:temp = firstBroadcastTime.split('首播(生活)')[0]if '重播(文史)' in temp:firstBroadcastTime = '首播时间: '+temp.split('重播(文史)')[0].replace('首播(文史): ','')ReplayBroadcastTime = '重播时间: '+ temp.split('重播(文史)')[1].replace(': ','')elif '播出频道' in column:cols = column.split('播出频道')firstBroadcastTime = cols[0]firstBroadcastChannel = '播出频道' + cols[1]elif '首播频道' in column:cols = column.split('首播频道')firstBroadcastTime = cols[0]firstBroadcastChannel = '首播频道' + cols[1]return firstBroadcastTime,ReplayBroadcastTime,firstBroadcastChanneldef exportColumnInfo(self):columns = self.db.GetCCTVColumnData()for col in columns:columnName = col['columnName'].encode('utf8')firstBroadcastTime = col['firstBroadcastTime'].encode('utf8')firstBroadcastTime=firstBroadcastTime.replace('首播时间: ','')firstBroadcastChannel = col['firstBroadcastChannel'].encode('utf8').replace("播出频道:", "").replace("独播频道:", "").replace("首播频道:", "")firstBroadcastChannel =firstBroadcastChannel.replace(")","").replace("(","").replace("CCTV-8电视剧","CCTV-8 电视剧")firstBroadcastChannel = firstBroadcastChannel.replace("CCTV-1综合频道", "CCTV-1 综合频道")firstBroadcastChannel = firstBroadcastChannel.replace("CCTV-1高清频道", "CCTV-1 高清频道")firstBroadcastChannel = firstBroadcastChannel.replace("CCTV13", "CCTV-13")firstBroadcastChannel = firstBroadcastChannel.replace("CCTV1", "CCTV-1")firstBroadcastChannel = firstBroadcastChannel.replace("CCTV-少儿", "CCTV-14 少儿")firstBroadcastChannel = firstBroadcastChannel.replace("CCTV6", "CCTV-6")firstBroadcastChannel = firstBroadcastChannel.replace("CCTV-12社会与法", "CCTV-12 社会与法")replayBroadcastTime = col['replayBroadcastTime'].encode('utf8')replayBroadcastTime = replayBroadcastTime.replace('重播时间:', '')columnHost = col['columnHost'].encode('utf8')columnUrl = col['columnUrl'].encode('utf8')columnTimeName = col['columnTimeName'].encode('utf8')columnTimeUrl = col['columnTimeUrl']if columnTimeUrl:columnTimeUrl = columnTimeUrl.encode('utf8')officialWebsiteUrl = col['officialWebsiteUrl'].encode('utf8')pastVideoUrl = col['pastVideoUrl'].encode('utf8')colImgUrl = col['colImgUrl'].encode('utf8')# 栏目名称,首播时间,重播时间,播出频道,主持人,栏目url,栏目名称1(带时间的),栏目名称1url,往期视频url,栏目官网url,),栏目对应图片urlmessage = '\n{0},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10}'.format(columnName, firstBroadcastTime,replayBroadcastTime,firstBroadcastChannel, columnHost,columnUrl, columnTimeName,columnTimeUrl, pastVideoUrl,officialWebsiteUrl, colImgUrl)date = time.strftime('%Y-%m-%d')self.WriteLog(message, date)obj = cctvColumnInfo()
# obj.getUrls()
# obj.CatchData()
# obj.getBroadCast()
obj.exportColumnInfo()

View Code

# coding=utf-8
import os
from pymongo import MongoClient
from pymongo import ASCENDING, DESCENDING
import codecs
import time
import columnData
import datetime
import reclass mongoDbBase:# def __init__(self, databaseIp = '127.0.0.1',databasePort = 27017,user = "ott",password= "ott", mongodbName='OTT_DB'):def __init__(self, connstr='mongodb://ott:ott@127.0.0.1:27017/', mongodbName='OTT'):# client = MongoClient(connstr)# self.db = client[mongodbName]client = MongoClient('127.0.0.1', 27017)self.db = client.OTTself.db.authenticate('ott', 'ott')def SaveCCTVColumnData(self,columnData,index):count = self.db.column_data.find({'columnName': columnData.getColumnName()}).count()if count == 0:dictM ={'columnName':columnData.getColumnName(),'firstBroadcastTime':columnData.getFirstBroadcastTime(),'replayBroadcastTime':'','firstBroadcastChannel':columnData.getFirstBroadcastChannel(),'columnHost':columnData.getColumnHost(),'columnUrl':columnData.getColumnUrl(),'columnTimeName':columnData.getColumnTimeName(),'columnTimeUrl':columnData.getColumnTimeUrl(),'officialWebsiteUrl':columnData.getOfficialWebsiteUrl(),'pastVideoUrl': columnData.getPastVideoUrl(),'colImgUrl':columnData.getColImgUrl(),'index':index}self.db.column_data.insert(dictM)def GetCCTVColumnData(self):columns = self.db.column_data.find({},{'_id':0})return columnsdef UpdateCCTVColumnData(self, columnData):dictM ={'$set':{'replayBroadcastTime':columnData.getReplayBroadcastTime(),'firstBroadcastTime':columnData.getFirstBroadcastTime(),'firstBroadcastChannel': columnData.getFirstBroadcastChannel()}}self.db.column_data.update({"columnName":columnData.getColumnName()},dictM)def SaveCCTVColumnUrl(self, url,suburl,columnName):dictM = {'url': url, 'iscrawl': '0','suburl':suburl,'columnName':columnName}# db.urls.find({iscrawl:'1'}).count()count = self.db.columnurls.find({'url': url}).count()if count == 0:self.db.columnurls.insert(dictM)def SaveCCTVColumnUrls(self, urlList,suburl):index = 0for url in urlList:# db.urls.find({iscrawl:'1'}).count()count = self.db.columnurls.find({'url': url}).count()if count == 0:dictM = {'url': url, 'iscrawl': '0', 'suburl': suburl,'index':index}self.db.columnurls.insert(dictM)index += 1# self.db.Meeting.update({'title': meet["title"],'date': meet["date"]}, {'$set': dictM}, {'upsert': True})def GetCCTVColumnUrls(self):urls = self.db.columnurls.find({'iscrawl': '0','suburl':'0'}, {'_id': 0, 'url': 1})# for url in urls:#     #http://top.chinaz.com/hangye/index_yule.html#     print urls['url']#     breakreturn urlsdef GetSubCCTVColumnUrls(self):urls = self.db.columnurls.find({'iscrawl': '0', 'suburl': '1'}, {'_id': 0, 'url': 1,'columnName':1})# urls = self.db.columnurls.find({'firstBroadcastChannel': re.compile('栏目'), 'suburl': '1'}, {'_id': 0, 'url': 1, 'columnName': 1})return urls# def SetUrlCrawlState(self,urlList):#     for url in urlList:#         self.db.urls.update({'url':url},{'$set':{'iscrawl':'1'}})def SetCCTVColumnUrlCrawlState(self, url):# db.urls.update({iscrawl:'1'},{'$set':{iscrawl:'0'}},false,true)self.db.columnurls.update({'url': url}, {'$set': {'iscrawl': '1'}})# d = mongoDbBase()# urls = []
# urls.append('abc')
# # d.SaveUrls(urls)
# d.SetUrlCrawlState(urls)

View Code

    def download(self, url, name):try:# url='http://pp.myapp.com/ma_icon/0/icon_10910_1523714409/96'# name='D:\work\python_crawl\down\2019.jpg'pic = requests.get(url, timeout=5)with open(name, 'wb') as f:f.write(pic.content)except requests.exceptions.ConnectionError:print('当前图片无法下载')

转载于:https://www.cnblogs.com/shaosks/p/8759388.html

[Python爬虫] 之三十:Selenium +phantomjs 利用 pyquery抓取栏目相关推荐

  1. [Python爬虫] 之二十二:Selenium +phantomjs 利用 pyquery抓取界面网站数据

    一.介绍 本例子用Selenium +phantomjs爬取界面(https://a.jiemian.com/index.php?m=search&a=index&type=news& ...

  2. [Python爬虫] 之十八:Selenium +phantomjs 利用 pyquery抓取电视之家网数据

    一.介绍 本例子用Selenium +phantomjs爬取电视之家(http://www.tvhome.com/news/)的资讯信息,输入给定关键字抓取资讯信息. 给定关键字:数字:融合:电视 抓 ...

  3. [Python爬虫] 之二十七:Selenium +phantomjs 利用 pyquery抓取今日头条视频

    一.介绍 本例子用Selenium +phantomjs爬取今天头条视频(http://www.tvhome.com/news/)的信息,输入给定关键字抓取图片信息. 给定关键字:视频:融合:电视 二 ...

  4. Python爬虫入门实战之猫眼电影数据抓取(理论篇)

    前言 本文可能篇幅较长,但是绝对干货满满,提供了大量的学习资源和途径.达到让读者独立自主的编写基础网络爬虫的目标,这也是本文的主旨,输出有价值能够真正帮助到读者的知识,即授人以鱼不如授人以渔,让我们直 ...

  5. Python爬虫实践:从中文歌词库抓取歌词

    利用BeautifulSoup库构建一个简单的网络爬虫,从中文歌词库网站抓取凤凰传奇所有曲目的歌词(http://www.cnlyric.com/geshou/1927.html). from url ...

  6. Python爬虫入门实战之猫眼电影数据抓取(实战篇)

    项目实战 静态网页实战 本节我们将为大家展现一个完整爬虫的大致过程,此次项目内容为提取猫眼电影TOP100榜中的所有电影信息并存储至CSV文件中,其首页地址为http://maoyan.com/boa ...

  7. python3 简单爬虫实战|使用selenium来模拟浏览器抓取选股宝网站信息里面的股票

    对爬虫的简单介绍 1.    什么是爬虫? 请求页面并提取数据的自动化过程. 2.    爬虫的基本流程 (1) 发起请求:通过url向服务器发起request请求,请求可以包含额外的header信息 ...

  8. Python爬虫4.2 — ajax(动态网页数据抓取)用法教程

    Python爬虫4.2 - ajax[动态网页数据]用法教程 综述 AJAX 介绍 什么是AJAX 实例说明 请求分析 获取方式 实例说明 其他博文链接 综述 本系列文档用于对Python爬虫技术的学 ...

  9. Python爬虫实战之二 - 基于Scrapy框架抓取Boss直聘的招聘信息

    Python爬虫实战之三 - 基于Scrapy框架抓取Boss直聘的招聘信息 ---------------readme--------------- 简介:本人产品汪一枚,Python自学数月,对于 ...

最新文章

  1. AI赌神超进化:德扑六人局击溃世界冠军,诈唬如神,每小时能赢1千刀 | Science...
  2. Linux Kernel TCP/IP Stack — L1 Layer — NIC Controller
  3. 为什么携程要做好持续交付?
  4. mysql优化器分析器_MySQL查询优化器的概念和原理整个执行过程
  5. php做一个计算日期之间天数,PHP计算任意两个日期之间的天数
  6. 概率论 一维随机变量
  7. Redis主从复制配置(原理剖析)
  8. Windows下使用Git配置SSH免密登录
  9. (06)vtk修改默认鼠标操作,实现鼠标按键控制模型旋转
  10. catia锥齿轮画法_CATIA自动生成锥齿轮模型的宏程序应用方法
  11. java 中文url转码_对 url 中含有的中文进行转码操作
  12. 【论文笔记】Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks
  13. 每日一题:16. “气球” 的最大数量 (C++)
  14. 原创|分享2个赚零花钱的小技巧
  15. 50欧姆系统的由来的小故事
  16. 人工蜂群算法求解TSP问题
  17. java里readfile,readfile java例子
  18. 与技术无关,但却值得码农们好好读一读的怪书:禅与摩托车维修艺术
  19. css 根据不同屏幕设置间距_CSS根据屏幕分辨率宽度自动适应的办法
  20. Numerical Optimization和Convex optimization 两本书的选择?

热门文章

  1. 第十四周项目一-排序函数模版
  2. 第十二周项目一-实现复数类中的运算符重载(3)
  3. JNI 实战全面解析
  4. python中outside loop_Python: 'break' outside loop
  5. DOM对象和内置对象(上)
  6. 基于double-check模式尝试将消息放到batch中
  7. macOS安装 cocoapods1.9.1失败Failed to build gem native extension
  8. swift_007(Swift的Array 数组)
  9. The CLR’s Execution Model(Chapter 1 of CLR via C#)
  10. Linux 下 的 cc 和 gcc