python爬虫难点_python网页爬虫浅析

Python网页爬虫简介：

有时候我们需要把一个网页的图片copy 下来。通常手工的方式是鼠标右键 save picture as ...

python 网页爬虫可以一次性把所有图片copy 下来。

步骤如下：

1. 读取要爬虫的html

2. 对爬下来的html 进行存储并处理：存储原始html

过滤生成list

正则匹配出picture的连接

3. 根据连接保存图片到本地

主要的难点：熟悉urllib ,

正则匹配查找图片链接

代码如下：import urllib.request

import os

import redef getHtml(url): #get html

page = urllib.request.urlopen(url)

html = page.read()

return html

def write(html, htmlfile): #write html into a file name html.txt

try:

f = open(htmlfile, mode='w')

f.writelines(str(html))

f.close()

except TypeError:

print ("write html file failed")def getImg2(html, initialFile, finalFile):

reg = '"*' #split string html with " and write in file name re.txt

imgre1 = re.compile(reg)

imglist = re.split(imgre1, str(html))

f1 = open(initialFile, mode='w')

for index in imglist:

f1.write("\n")

f1.write(index)

f1.close

reg2 = "^https.*jpg" # match items start with "https" and ends with "jpg"

imgre2 = re.compile(reg2)

f2 = open(initialFile, mode='r')

f3 = open(finalFile, mode='w')

tempre = f2.readlines()

for index in tempre:

temp = re.match(imgre2,index)

if temp != None:

f3.write(index)

#f3.write("\n")

f2.close()

f3.close()def saveImg2(p_w_picpathfile): #save p_w_picpath

f_imglist2 = open(p_w_picpathfile, mode='r')

templist = f_imglist2.readlines()

x = 0

for index in templist:

urllib.request.urlretrieve(index,'%s.jpg' %x)

x = x + 1html = "https://p_w_picpath.baidu.com/search/index?tn=baidup_w_picpath&ct=201326592&lm=-1&cl=2&ie=gbk&word=%BA%FB%B5%FB&fr=ala&ala=1&alatpl=adress&pos=0&hs=2&xthttps=111111"

htmlfile = "D:\\New\\html.txt"

SplitFile = "D:\\New\\re.txt"

imgefile = "D:\\New\\imglist.txt"html = getHtml(html)

print("get html complete!")

getImg2(html, SplitFile, imgefile)

print("get Image link list complete! ")

saveImg2(imgefile)

print("Save Image complete!")

python爬虫难点_python网页爬虫浅析相关推荐

python爬虫数据挖掘_Python网页爬虫文本处理科学计算机器学习数据挖掘兵器谱...
转载自"我爱自然语言处理":http://www.52nlp.cn,已获得授权.更多内容可见公众号:"牛衣古柳"(ID:Deserts-X). 周末时看到这篇不 ...
Python之简单的网页爬虫开发
Python之简单的网页爬虫开发文章目录 Python之简单的网页爬虫开发下面简单介绍一下request: 简单介绍一下什么是第三方库: 结合requests与正则表达式多线程爬虫多进程库(m ...
python爬虫模块_python实现爬虫的模块总结
网络爬虫(又被称为网页蜘蛛,网络机器人),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本.其本质就是利用脚本发送请求,解析响应,然后提取有用信息,最后保存下来. python由于语法简单,可 ...
python 爬虫框架_Python网络爬虫-scrapy框架的使用
1. Scrapy 1.1 Scrapy框架的安装 Scrapy是一个十分强大的爬虫框架,依赖的库比较多,至少需要依赖的库有Twisted .lxml和pyOpenSSL.在不同的平台环境下,它所依赖 ...
实战|Python轻松实现动态网页爬虫(附详细源码)
用浅显易懂的语言分享爬虫.数据分析及可视化等干货,希望人人都能学到新知识. 项目背景事情是这样的,前几天我公众号写了篇爬虫入门的实战文章,叫做<实战|手把手教你用Python爬虫(附详细源码) ...
beautifulsoup解析动态页面div未展开_实战|Python轻松实现动态网页爬虫(附详细源码)...
用浅显易懂的语言分享爬虫.数据分析及可视化等干货,希望人人都能学到新知识.项目背景事情是这样的,前几天我公众号写了篇爬虫入门的实战文章,叫做<实战|手把手教你用Python爬虫(附详细源码)&g ...
python 爬虫论_Python网络爬虫（理论篇）
欢迎关注公众号:Python爬虫数据分析挖掘,回复[开源源码]免费获取更多开源项目源码网络爬虫的组成网络爬虫由控制节点,爬虫节点,资源库构成. 网络爬虫的控制节点和爬虫节点的结构关系控制节点(爬 ...
Python轻松实现动态网页爬虫(附详细源码)！
AJAX动态加载网页一什么是动态网页 J哥一向注重理论与实践相结合,知其然也要知其所以然,才能以不变应万变. 所谓的动态网页,是指跟静态网页相对的一种网页编程技术.静态网页,随着html代码的生成 ...
python爬虫难点_Python爬虫技巧
在本文中,我们将分析几个真实网站,来看看我们在<用Python写网络爬虫(第2版)>中学过的这些技巧是如何应用的.首先我们使用Google演示一个真实的搜索表单,然后是依赖JavaScr ...

python爬虫难点_python网页爬虫浅析

python爬虫难点_python网页爬虫浅析相关推荐

最新文章

热门文章