需求：爬取电影名，评分，主演
捉妖记2 梁朝伟白百何 9.3分
喵星人古天乐马丽 9.0分
祖宗十九代岳云鹏吴京 8.9分
奇门遁甲大鹏倪妮 9.0分
勇敢者游戏:决战丛林道恩・强森凯文・哈特 9.3分

首先对网页链接分析，第一页：https://dianying.2345.com/list/——-.html，第二页：https://dianying.2345.com/list/——-2.html。组成为https://dianying.2345.com/list/ +——-+页号.html。
再分析爬取内容：

分析网页格式可以看出要爬取的内容都在’v_picConBox mt15’这个div中

电影名

这里要注意一个爬取小坑的问题。2345电影网在每页电影大全中，穿插了广告(坑B网)。所以要特别注意下，不能去爬广告，因为电影和广告的html是不一样的。

接下来则是代码部分

注： if name == ‘main‘和def init (self), 这两个代码部分无法显示双下划线.需要在name,main,main前后加上双下划线__

“””
Created on Fri Apr 13 19:08:03 2018

@author: oyzm
“”“

import urllib.request
import urllib.error
from urllib.parse import quote
from bs4 import BeautifulSoup
import pybloom_live
import string
import codecs
import time
import random
from mylog import MyLog

boom=pybloom_live.BloomFilter( capacity=10000,error_rate=0.0001 )

class DyItem(object):
title=None
action=None
score=None

class GetDianYing(object):
def init(self):
#https://dianying.2345.com/list/——-1.html
self.urlBase=’https://dianying.2345.com/list/’
self.mylog=MyLog()
self.pages=self.getPages(self.urlBase)
self.context=self.spider(self.urlBase,self.pages)
self.piplines(self.context)

#发请求
def getHttp(self,url):try:#time.sleep(random.randint(0,5))url=quote(url,string.printable)response=urllib.request.urlopen(url)except error.URLError as e:self.mylog.debug('爬取%s失败,原因%s'%url,e)else:self.mylog.debug('爬取%s成功'%url)return response.read()  #取页数
def getPages(self,urlBase):httpRe=self.getHttp(urlBase)soup=BeautifulSoup(httpRe,'lxml')liTag=soup.find("div",attrs={"class":"v_page"})tags=liTag.find_all("a")lastTag=tags[-2]total=int( lastTag.get_text().strip() )#由于没做代理，只爬5页测试使用return 5#爬取
def spider(self,urlBase,pages):url=''context=[]for i in range(pages):url=urlBase+'-------'+str(i)+'.html'self.mylog.debug('开始爬取%s'%url)httpRe=self.getHttp(url)soup=BeautifulSoup(httpRe,'lxml')divTag=soup.find("div",attrs={"class":"v_picConBox mt15"})ulTag=divTag.find("ul",attrs={"class":"v_picTxt pic180_240 clearfix"})liTag=ulTag.find_all("li")for tag in liTag:dyItem=DyItem()#避免广告if tag.get("media")!=None:dyItem.score=tag.find("span",attrs={"class":"pRightBottom"}).find("em").get_text()dyItem.title=tag.find("em",attrs={"class":"emTit"}).find("a").get_text()Faction=tag.find("span",attrs={"class":"sDes"})Saction=Faction.find_all("em")#action: a   bactions=''for action in Saction:action=action.find("a").get_text().strip()actions=actions+action+' 'dyItem.action=actionscontext.append(dyItem)self.mylog.debug("爬取电影为<<%s>>信息成功"%dyItem.title)return context#存储
def piplines(self,context):dyName='2345电影.txt'nowTime=time.strftime('%Y-%m-%d %H:%M:%S\r\n',time.localtime())with codecs.open(dyName,'w','utf8') as fp:fp.write('run time:%s'%nowTime)for item in context:fp.write('%s \t %s \t %s \t \r\n' %(item.title,item.action,item.score) )self.mylog.info(u'将电影名为<<%s>>的数据存入%s'%(item.title,dyName))

if name==’main‘:
GetDianYing()

日志代码（mylog ）

-- coding: utf-8 --

“””
Created on Fri Apr 6 20:49:35 2018

@author: oyzm
“”“

import logging
import getpass
import sys

class MyLog(object):
def init(self):
self.user=getpass.getuser()
#定义日志构造器
self.logger=logging.getLogger(self.user)
#设置级别
self.logger.setLevel( logging.DEBUG )

    #取出日志名self.logName=sys.argv[0][0:-3]+'.log'#定义日志格式self.formatter=logging.Formatter( ' %(asctime)-12s %(filename)s %(funcName)s %(name)s %(message)s\n' )   #定义处理器self.fileHandle=logging.FileHandler( self.logName,encoding='utf-8' )self.fileHandle.setFormatter( self.formatter )self.fileHandle.setLevel(logging.ERROR)self.streamHandle=logging.StreamHandler()self.streamHandle.setFormatter( self.formatter )self.streamHandle.setLevel(logging.DEBUG)#添加处理器self.logger.addHandler(self.fileHandle)self.logger.addHandler(self.streamHandle)#按级别输出
def debug( self,msg ):self.logger.debug(msg)
def error( self,msg ):self.logger.error(msg)
def warn( self,msg ):self.logger.warn(msg)
def info( self,msg ):self.logger.info(msg)
def critical( self,msg ):self.logger.critical(msg)

if name==’main‘:
ml=MyLog()
ml.debug(‘这是一个error测试’)
ml.warn(‘这是一个error测试’)
ml.error(‘这是一个error测试’)

python使用BeautifulSoup爬取2345电影网相关推荐

python BeautifulSoup爬取豆瓣电影top250信息并写入Excel表格
豆瓣是一个社区网站,创立于2005年3月6日.该网站以书影音起家,提供关于书籍,电影,音乐等作品信息,其描述和评论都是由用户提供的,是Web2.0网站中具有特色的一个网站. 豆瓣电影top250网址: ...
python爬取豆瓣电影top250_用Python爬虫实现爬取豆瓣电影Top250
用Python爬虫实现爬取豆瓣电影Top250 #爬取豆瓣电影Top250 #250个电影 ,分为10个页显示,1页有25个电影 import urllib.request from bs4 imp ...
[python爬虫] BeautifulSoup爬取+CSV存储贵州农产品数据
在学习使用正则表达式.BeautifulSoup技术或Selenium技术爬取网络数据过程中,通常会将爬取的数据存储至TXT文件中,前面也讲述过海量数据存储至本地MySQL数据库中,这里主要补充Bea ...
利用Scrapy爬取1905电影网
本文将从以下几个方面讲解Scrapy爬虫的基本操作 Scrapy爬虫介绍 Scrapy安装 Scrapy实例--爬取1905电影网相关资料 Scrapy 爬虫介绍 Scrapy是Python开发的一 ...
爬取猫眼电影网前100的电影排名
爬取猫眼电影网前100的电影排名猫眼电影网:http://maoyan.com/board/4 确定要爬取的数据: 1:排名 2:电影名称 3:主演 4:上映舌尖 5:评分构造下一页url 首页: ...
Scrapy爬取1908电影网电影数据
Scrapy爬取1908电影网电影数据最初是打算直接从豆瓣上爬电影数据的,但编写完一直出现403错误,查了查是豆瓣反爬虫导致了,加了headers也还是一直出现错误,无奈只能转战1908电影网了. ...
python用bs4爬取豆瓣电影排行榜 Top 250的电影信息和电影图片，分别保存到csv文件和文件夹中
python用bs4爬取豆瓣电影排行榜 Top 250的电影信息和图片,分别保存到csv文件和文件夹中. 爬取的数据包括每个电影的电影名 , 导演 ,演员 ,评分,推荐语,年份,国家,类型. py如果 ...
python使用selenium爬取联想官网驱动（一）：获取遍历各驱动的下载网址
python使用selenium爬取联想官网驱动(一):获取遍历各驱动的下载网址然后wget命令试验下载由于初期学习,所以先拿一个型号的产品驱动试验. (1)以下为在联想某型号产品获取相关驱动下载的 ...
python爬虫入门练习：BeautifulSoup爬取猫眼电影TOP100排行榜，pandas保存本地excel文件
传送门:[python爬虫入门练习]正则表达式爬取猫眼电影TOP100排行榜,openpyxl保存本地excel文件对于上文使用的正则表达式匹配网页内容,的确是有些许麻烦,替换出现任何的差错都会导致 ...

python使用BeautifulSoup爬取2345电影网

注： if name == ‘main‘和def init (self), 这两个代码部分无法显示双下划线.需要在name,main,main前后加上双下划线__

-- coding: utf-8 --

python使用BeautifulSoup爬取2345电影网相关推荐

最新文章

热门文章