python log壁纸_一个爬取Bing每日壁纸的python脚本

1. 背景

Bing搜索每天的背景图片有些比较适合做桌面，但是有的提供下载有的不提供下载。每天去点击下载又不太方便，所以第一次学习了一下python爬虫怎么写，写的很简单。

2. 相关技术

2.1 Python爬虫参考

2.2 Python正则表达式

2.3 解决登录问题

一些网站需要登录操作，应该是大部分网站都是登录操作的。

2.4 logging：内置日志库

3. 爬虫实现

爬虫分三个部分：请求，解析，保存。

下面只展示主要逻辑代码。完整代码参考Github。

3.1 请求脚本

import urllib.request

import re

import logging

def getHtml(url):

page = urllib.request.urlopen(url)

html = page.read()

if html:

logging.debug("Get Response:"+str(len(html)))

else:

logging.warning("Request failed!")

return html.decode('utf-8')

3.2 解析脚本

重点是解析脚本，这里定义了两种方法：一种通过正则表达式匹配，另一种使用BeautifulSoup解析文档树。通过文档书解析是原来通过下载页面来解析的，但是发现下载的页面与直接请求http://cn.bing.com/获得的响应是不同的，因为有js脚本做了后续处理。所以无法做爬虫解析。只能使用了正则表达式匹配，效果还好。

from bs4 import BeautifulSoup

import json

import re

import logging

def getJpg(html):

reg = r'(url:.{10,90}jpg)' //这里匹配包含"url:**jpg"的字符串，没写出更精确的正则表达式，只能写匹配10到90个字符了

logging.debug("Using re "+reg+" to get Jpg")

jpgre= re.compile(reg)

jpglist=re.findall(jpgre,html)

if jpglist:

logging.debug("Get jpg list("+str(len(jpglist))+"):"+str(jpglist))

jpgUrl = jpglist[0].split('"')[1]

imageUrl = host+jpgUrl

logging.info("Get jpg url:"+imageUrl)

return imageUrl

def bingParser(html):

#soup=BeautifulSoup(html,"html.parser")//直接解析响应就会有问题获取不到

soup=BeautifulSoup(open('Bing.html'),"html.parser") //最初通过下载的页面解析成功

print(soup.title)

print(type(soup.a))

print(soup.select('#bgDiv'))

style = (soup.select('#bgDiv')[0].attrs['style']).strip()

print(style)

json_style=json.dumps(style)

print(json_style)

imageurl=style.strip().split(';')[-3:-2]

#print(imageurl[0].split('"')[1])

imageUrl = (imageurl[0].split('"')[1])

#imageUrl = (imageurl[0].split(':')[1].strip().split('"')[1])

print(imageUrl)

return imageUrl

3.3 保存脚本

保存脚本是需要运行的脚本，所以其他脚本都在这里调用了。

import urllib.request

import urllib.parse

import parseHtml

import request

import logging

import sys

//定义日志

logging.basicConfig(level=logging.DEBUG,

format='%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s',

datefmt='%Y-%m-%d %H:%M:%S',

filename='bingcn.log',

filemode='a'

)

host="http://cn.bing.com"

logging.info("From:"+host)

html = request.getHtml(host)

imageurl = parseHtml.getJpg(html)

logging.info("Image url:"+imageurl)

fileName = imageurl.split('/')[-1:][0]

logging.info("Image file name:"+fileName)

def saveImg(imageURL,fileName):

url = (imageURL)

logging.info('Image file url:'+url)

#url=urllib.parse.urlencode(url)

u = urllib.request.urlopen(url)

data = u.read()

f = open(fileName, 'wb')

f.write(data)

logging.info("Save file :"+imageURL)

f.close()

saveImg(imageurl,fileName)

4. 运行

脚本针对python3环境写的，直接运行saveImage.py即可。

如果使用日志文件的方式，可以在当前目录下看到日志文件bingcn.log，保存的图片也在当前目录下。

james@james:~/code/hello-world/code/python/networkong/pycrowler/crowler_bingcn > python3 saveImage.py

2017-06-26 14:36:05 saveImage.py[line:19] INFO From:http://cn.bing.com

2017-06-26 14:36:06 request.py[line:12] DEBUG Get Response:126510

2017-06-26 14:36:06 parseHtml.py[line:91] DEBUG Using re (url:.{10,90}jpg) to get Jpg

2017-06-26 14:36:06 parseHtml.py[line:95] DEBUG Get jpg list(2):['url: "/az/hprichbg/rb/MadagascarLemurs_ZH-CN7754035615_1920x1080.jpg', "url:'\\/az\\/hprichbg\\/rb\\/CallanishSS_ZH-CN12559903397_1920x1080.jpg"]

2017-06-26 14:36:06 parseHtml.py[line:98] INFO Get jpg url:http://cn.bing.com/az/hprichbg/rb/MadagascarLemurs_ZH-CN7754035615_1920x1080.jpg

2017-06-26 14:36:06 saveImage.py[line:24] INFO Image url:http://cn.bing.com/az/hprichbg/rb/MadagascarLemurs_ZH-CN7754035615_1920x1080.jpg

2017-06-26 14:36:06 saveImage.py[line:26] INFO Image file name:MadagascarLemurs_ZH-CN7754035615_1920x1080.jpg

2017-06-26 14:36:06 saveImage.py[line:30] INFO Image file url:http://cn.bing.com/az/hprichbg/rb/MadagascarLemurs_ZH-CN7754035615_1920x1080.jpg

2017-06-26 14:36:06 saveImage.py[line:36] INFO Save file :http://cn.bing.com/az/hprichbg/rb/MadagascarLemurs_ZH-CN7754035615_1920x1080.jpg

python log壁纸_一个爬取Bing每日壁纸的python脚本相关推荐

【Python爬虫学习实践】多线程爬取Bing每日壁纸
在本节实践中,我们将借助Python多线程编程并采用生产者消费者模式来编写爬取Bing每日壁纸的爬虫.在正式编程前,我们还是一样地先来分析一下我们的需求及大体实现的过程. 总体设计预览首先,我们先来 ...
Python3爬取Bing每日图片，并设置为电脑桌面
文章目录 1 - 简述 2 - 核心代码 2.1 - 爬取BingImage 2.2 - 设置为桌面 2.3 - 设置为每日自动执行 3 - 完整代码 4 - 运行结果作为鄙视链底层的"脚 ...
必应壁纸php,PHP 自动保存Bing 每日壁纸
https://cn.bing.com 已经修复! 前段时间,写了个Bing随机壁纸API,Bing随机壁纸 API 有个朋友问我壁纸怎么保存下来的,其实这东西网上一搜一大把,用PHP/JAVA/Py ...
python简直万能_一键爬取资源，Python简直太强大了！
"用Python写个爬虫小程序,每分钟可以发起几千次"点击"的动作,任手速再快也不可能赶得上,那抢中概率自然就高了." ▲爬虫自动抢鞋子这么神! 于是,我认 ...
python 生意参谋_如何爬取生意参谋数据？是不是违规操作？
生意参谋是阿里巴巴为商家打造的一站式.个性化.可定制的商务决策体验平台.近期发现,有部分人员使用爬虫爬取生意参谋平台数据,影响商家的用户体验.到底是如何爬取生意参谋数据呢? 官方推荐淘宝.京东.拼多多 ...
python数据分析工资_拉勾网爬取全国python职位并数据分析薪资，工作经验，学历等信息...
确定网页的加载方式是JavaScript加载通过谷歌浏览器开发者工具分析和寻找网页的真实请求,确定真实数据在position.Ajax开头的链接里,请求方式是POST 使用requests的post ...
用Python给我写一个爬取网易云的音乐代码
好的,为了爬取网易云音乐,我们需要使用一些第三方库来帮助我们处理网络请求和解析 HTML/XML 数据.在这里,我建议您使用 requests 库来发送网络请求和 Beautiful Soup 库来解 ...
python log函数_求你别再花大价钱学 Python 之爬虫实战
引子 Python 基本概念 Python 优势和劣势优势 Python 的劣势 Python 安装设置 Python 基本语法程序例子 Python 基本语法 Python 爬虫实现爬虫相关 ...
python协程gevent案例爬取斗鱼图片过程解析 - python
文章来源: 敏而好学论坛嗨学网www.piaodoo.com 欢迎大家相互学习分析分析网站寻找需要的网址用谷歌浏览器摁F12打开开发者工具,然后打开斗鱼颜值分类的页面,如图: 在里面的请求中, ...

python log壁纸_一个爬取Bing每日壁纸的python脚本

python log壁纸_一个爬取Bing每日壁纸的python脚本相关推荐

最新文章

热门文章