为复现师姐论文成果,爬取Flickr网站数据,只需爬取图片元数据,无需爬取图片:

(一已成功,二失败了,这里记录给自己看。)

一、用Python的icrawler包

icrawler是一个轻型框架,自带爬取Flickr的方法,但该方法:

①无法爬取flickr照片的元数据,本人水平有限也不会修改或正确调用源代码;

②不会调用由feeder、parser、downloader组成的crawler;想把feeder、parser、downloader分开用,又不明白它们之间的url_queue和task_queue是如何实现连接的。

学习了python类的继承,方法的重写之后,成功调用icrawler爬取自己想要的东西。

but同二、中一样,无法多次请求。(放一段正在进行的代码和后面的报错。)

2022-08-16 22:37:38,497 - INFO - downloader - image #197 https://www.flickr.com/photos/79191095@N00/51821259303/
save....
2022-08-16 22:37:40,248 - INFO - downloader - image #198    https://www.flickr.com/photos/79191095@N00/51821258453/
2022-08-16 22:37:42,918 - INFO - downloader - image #199    https://www.flickr.com/photos/greathan/51820563139/
save....
2022-08-16 22:37:43,764 - INFO - downloader - image #200    https://www.flickr.com/photos/tomros_pics/51588916180/
save....
2022-08-16 22:37:45,271 - INFO - downloader - image #201    https://www.flickr.com/photos/rebelsabu/51516613806/
save....
2022-08-16 22:37:47,779 - INFO - downloader - image #202    https://www.flickr.com/photos/rebelsabu/51489771857/
save....
2022-08-16 22:37:48,984 - INFO - downloader - image #203    https://www.flickr.com/photos/hysnikapo/51225667481/
save....
2022-08-16 22:37:50,819 - INFO - downloader - image #204    https://www.flickr.com/photos/rebelsabu/51210131734/
save....
2022-08-16 22:37:52,343 - INFO - downloader - image #205    https://www.flickr.com/photos/shyish/51170912410/
save....
save....
2022-08-16 22:37:53,178 - INFO - downloader - image #206    https://www.flickr.com/photos/shyish/51163693309/
2022-08-16 22:37:57,853 - INFO - downloader - image #207    https://www.flickr.com/photos/shyish/51157978260/
save....
save....
2022-08-16 22:37:59,355 - INFO - downloader - image #208    https://www.flickr.com/photos/shyish/51140112694/
2022-08-16 22:37:59,734 - ERROR - parser - Exception caught when fetching page https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=0bc578353b653bab01df2c4a674a8b4e&format=json&nojsoncallback=1&has_geo=1&tags=Beijing&min_taken_date=2019-01-01&max_taken_date=2019-12-31&page=2, error: HTTPSConnectionPool(host='api.flickr.com', port=443): Read timed out. (read timeout=5), remaining retry times: 2
2022-08-16 22:38:00,098 - INFO - downloader - image #209    https://www.flickr.com/photos/shyish/51139336116/
save....
save....
2022-08-16 22:38:00,762 - INFO - downloader - image #210    https://www.flickr.com/photos/shyish/51130136567/
2022-08-16 22:38:01,612 - INFO - downloader - image #211    https://www.flickr.com/photos/shyish/51128767872/
save....
save....
2022-08-16 22:38:03,031 - INFO - downloader - image #212    https://www.flickr.com/photos/shyish/51128728456/
2022-08-16 22:38:05,131 - ERROR - parser - Exception caught when fetching page https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=0bc578353b653bab01df2c4a674a8b4e&format=json&nojsoncallback=1&has_geo=1&tags=Beijing&min_taken_date=2019-01-01&max_taken_date=2019-12-31&page=2, error: HTTPSConnectionPool(host='api.flickr.com', port=443): Read timed out. (read timeout=5), remaining retry times: 1
2022-08-16 22:38:08,047 - INFO - downloader - downloader-001 is waiting for new download tasks
2022-08-16 22:38:10,847 - ERROR - parser - Exception caught when fetching page https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=0bc578353b653bab01df2c4a674a8b4e&format=json&nojsoncallback=1&has_geo=1&tags=Beijing&min_taken_date=2019-01-01&max_taken_date=2019-12-31&page=2, error: HTTPSConnectionPool(host='api.flickr.com', port=443): Read timed out. (read timeout=5), remaining retry times: 0
2022-08-16 22:38:13,051 - INFO - downloader - downloader-001 is waiting for new download tasks
2022-08-16 22:38:16,248 - ERROR - parser - Exception caught when fetching page https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=0bc578353b653bab01df2c4a674a8b4e&format=json&nojsoncallback=1&has_geo=1&tags=Beijing&min_taken_date=2019-01-01&max_taken_date=2019-12-31&page=3, error: HTTPSConnectionPool(host='api.flickr.com', port=443): Read timed out. (read timeout=5), remaining retry times: 2
2022-08-16 22:38:18,055 - INFO - downloader - downloader-001 is waiting for new download tasks
2022-08-16 22:38:21,715 - ERROR - parser - Exception caught when fetching page https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=0bc578353b653bab01df2c4a674a8b4e&format=json&nojsoncallback=1&has_geo=1&tags=Beijing&min_taken_date=2019-01-01&max_taken_date=2019-12-31&page=3, error: HTTPSConnectionPool(host='api.flickr.com', port=443): Read timed out. (read timeout=5), remaining retry times: 1
2022-08-16 22:38:23,058 - INFO - downloader - downloader-001 is waiting for new download tasks
2022-08-16 22:38:27,169 - ERROR - parser - Exception caught when fetching page https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=0bc578353b653bab01df2c4a674a8b4e&format=json&nojsoncallback=1&has_geo=1&tags=Beijing&min_taken_date=2019-01-01&max_taken_date=2019-12-31&page=3, error: HTTPSConnectionPool(host='api.flickr.com', port=443): Read timed out. (read timeout=5), remaining retry times: 0
2022-08-16 22:38:28,066 - INFO - downloader - downloader-001 is waiting for new download tasks
2022-08-16 22:38:33,080 - INFO - downloader - downloader-001 is waiting for new download tasks
2022-08-16 22:38:33,567 - ERROR - parser - Exception caught when fetching page https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=0bc578353b653bab01df2c4a674a8b4e&format=json&nojsoncallback=1&has_geo=1&tags=Beijing&min_taken_date=2019-01-01&max_taken_date=2019-12-31&page=4, error: HTTPSConnectionPool(host='api.flickr.com', port=443): Read timed out. (read timeout=5), remaining retry times: 2
2022-08-16 22:38:38,085 - INFO - downloader - downloader-001 is waiting for new download tasks
2022-08-16 22:38:39,123 - ERROR - parser - Exception caught when fetching page https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=0bc578353b653bab01df2c4a674a8b4e&format=json&nojsoncallback=1&has_geo=1&tags=Beijing&min_taken_date=2019-01-01&max_taken_date=2019-12-31&page=4, error: HTTPSConnectionPool(host='api.flickr.com', port=443): Read timed out. (read timeout=5), remaining retry times: 1
2022-08-16 22:38:43,096 - INFO - downloader - downloader-001 is waiting for new download tasks
2022-08-16 22:38:44,513 - ERROR - parser - Exception caught when fetching page https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=0bc578353b653bab01df2c4a674a8b4e&format=json&nojsoncallback=1&has_geo=1&tags=Beijing&min_taken_date=2019-01-01&max_taken_date=2019-12-31&page=4, error: HTTPSConnectionPool(host='api.flickr.com', port=443): Read timed out. (read timeout=5), remaining retry times: 0

如何解决:

用icrawler爬取flickr是需要翻墙的,每爬几千个image换一个站点即可。

二、根据B站爬虫课程中学到的一些知识,自己上手爬flickr。

(一)调用icrawler.builtin中flickr的FlickrFeeder,获取存有我所需图片的页面的url链接

1. 改写FlickrFeeder的源代码

因为feeder和parser之间的url_queue是一个python的“generator object”,我不会打印出来。故在FlickrFeeder类的feed方法中,加入"urllist=[]"、"urllist.append(complete_url)"、"print(urllist)"。可以得到urllist。

urllist = []   # 自己加入
for i in range(page, page + page_max):if self.signal.get('reach_max_num'):breakcomplete_url = '{}&page={}'.format(url, i)while True:try:self.output(complete_url, block=False)except:if self.signal.get('reach_max_num'):breakelse:breakself.logger.debug('put url to url_queue: {}'.format(complete_url))   # complete_url是str类型urllist.append(complete_url)   # 自己加入
print(urllist)   # type(urllist)-->list   # 自己加入

2. FlickrFeeder用的是

flickr.photos.search方法

输入参数即可:

signal = {'signal1': 'reach_max_num'}
session = requests.Session()
apikey = apikey
feeder = FlickrFeeder(thread_num=1, signal=signal, session=session)   # 实例化一个对象
feeder.feed(apikey=apikey, max_num=4000, tags=['Hong Kong'], min_taken_date=datetime.date(2013, 1, 1), max_taken_date=datetime.date(2013, 1, 31), has_geo=1)

① apikey在这里申请:((Flickr 上的應用程式園地)

② flickr api 方法说明文档在这里:    Flickr 服務

(二)解析(一)中得到的urllist,取得photo_id,再用flickr api的flickr.photos.getInfo方法获得所需元数据。

1. (一)中print出的urllist是“base_url+page{1至40}”,有时就算搜索结果有140pages,它也只返回第1至40page。故任选一条url,把最后的page={}删掉,page前面的url作为“base_url”

请求“base_url”的内容得到pages的数值。

response = json.loads(requests.get(get_pages_url).content.decode(encoding='utf-8'))
pages = int(response['photos']['pages'])

for i in range(pages):即可循环每页。

2. 此时每页url链接https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=958bac39f63d627982cebbffc6733a4e&format=json&nojsoncallback=1&tags=%5B%27hong+kong%27%5D&min_taken_date=2014-04-01&max_taken_date=2014-04-30&has_geo=1可得到:所找的每一张photo的id,,,这是我们下一步请求及解析图片元数据的必要参数。

两层循环:先循环每一页page,,在循环每页page的每一个photo_id,即可得到全部photos。

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}  # UA伪装proxies = {"https": None}  # 可自己找代理ip
photolist = []  # search图片时,满足参数的所有图片的listfor i in range(rages):  # 循环每个页面 session = requests.Session()response = session.get(get_url, params={'pages': str(i + 1)}, proxies=proxies, headers=headers)  # 请求每个页面的urlresponse.close()  content_json = response.content.decode(encoding='utf-8')content = json.loads(content_json)photos = content['photos']['photo']  # 得到一个照片列表for photo in photos:  # 循环每一张照片photo_id = photo['id']  # 得到每一张照片的idbase_url = 'https://api.flickr.com/services/rest/?'params = {'method': 'flickr.photos.getInfo',   # flickr.photos.getInfo方法可获取taken_time,longitude,latitude,tags,image_url等元数据'api_key': apikey,'photo_id': photo_id,'format': 'json','nojsoncallback': 1}session = requests.Session()ret = session.get(base_url + urlencode(params), proxies=proxies)info = json.loads(ret.content.decode())infolist = []  # 每个张图片所需的info的listinfolist.append(photo_id)nsid = info['photo']['owner']['nsid']infolist.append(nsid)username = info['photo']['owner']['username']infolist.append(username)taken_time = info['photo']['dates']['taken']infolist.append(taken_time)lon = info['photo']['location']['longitude']infolist.append(lon)lat = info['photo']['location']['latitude']infolist.append(lat)# locality = info['photo']['location']['locality']['_content']  # infolist.append(locality)# url = info['photo']['urls']['url'][0]['_content']# infolist.append(url)tags = info['photo']['tags']['tag']  tag_str = ""for tag in tags:tag_str = tag_str + tag['raw'] + ", "infolist.append(tag_str)photolist.append(infolist)return photolist

(三)将元数据存入excel表

不细说了。

三、尚未解决的问题

根据二、中代码确已将元数据下载到并存入excel表中,,但Flickr对于爬虫访问次数的限制我实在无法破解,因为我要爬取的数据量蛮大,一年就有3万多张符合要求的图片,而我需要爬6年的数据。现在好像已经被flickr服务器限制,实在无法继续爬取了。

换了wifi

换了代理ip

试着加了try except

加了time.sleep()

加了response.close()

用了fiddler软件抓包工具

设置socket默认的等待时间

试了下面链接的方法,不行。

python 爬虫:https; HTTPSConnectionPool(host='z.jd.com', port=443) - 简书 (jianshu.com)

都没成功。呜呜┭┮﹏┭┮

【学习记录】基于python爬取Flickr图片及元数据相关推荐

  1. 【爬虫】毕设学习记录:python爬取静态网页(只爬取单页)

    毕设题目是对指定网页内容进行正负向判断,并输出判断结果. 所以只需要爬取单页面的内容即可. 目标网页:在途网-哈尔滨酒店评价 [第一步:客户端向目标网址(服务器)发起get请求] import req ...

  2. 从入门到入土:基于Python爬取四川大学所有官方网站|狗头保命|

    此博客仅用于记录个人学习进度,学识浅薄,若有错误观点欢迎评论区指出.欢迎各位前来交流.(部分材料来源网络,若有侵权,立即删除) 本人博客所有文章纯属学习之用,不涉及商业利益.不合适引用,自当删除! 若 ...

  3. 使用Python爬取网页图片

    使用Python爬取网页图片 李晓文 21 天前 近一段时间在学习如何使用Python进行网络爬虫,越来越觉得Python在处理爬虫问题是非常便捷的,那么接下来我就陆陆续续的将自己学习的爬虫知识分享给 ...

  4. 基于python爬取有道翻译,并在线翻译

    基于python爬取有道翻译,并在线翻译 由于我也是爬虫新学者,有什么做的不对的请多加包涵 我们需要使用的库如下 from urllib import request import urllib im ...

  5. python关于二手房的课程论文_基于python爬取链家二手房信息代码示例

    基本环境配置 python 3.6 pycharm requests parsel time 相关模块pip安装即可 确定目标网页数据 哦豁,这个价格..................看到都觉得脑阔 ...

  6. 利用python爬取网页图片

    学习python爬取网页图片的时候,可以通过这个工具去批量下载你想要的图片 开始正题: 我从尤物网去爬取我喜欢的女神的写真照,我们这里主要用到的就两个模块 re和urllib模块,有的时候可能会用到t ...

  7. python爬取美女图片的练习

    python 爬取美女图片的练习 主要使用 xpath 定位获取 图片的链接 本次练习使用到os库 ,lmxl库 , requests库 import requests from lxml impor ...

  8. Python爬取bing图片

    我之前写过一篇Python爬取百度图片,有兴趣的朋友可以去读一下,这次写的是爬取Bing的图片. 打开Bing图片,搜索关键词,开始分析页面,可以发现bing和百度一样都是下滑自动加载,百度默认一次加 ...

  9. python爬取搜狗图片

    python爬取搜狗图片 cond(`""" 对于动态加载的网站图片的获取,我们需要去分析js内容一定要让网页发生加载后去分析,分析network 里的XHR,可以看到需 ...

最新文章

  1. 修改用友服务器ip地址,用友服务器ip地址更换
  2. vscode css智能补全_强大的 VS Code入门
  3. [业界资讯]Window7下的IE8新漏洞KB973874成功修复
  4. 四.jmeter代码学习, SampleResult【持续更新】
  5. 路径总和Python解法
  6. 全国大学生数学建模2014年A题嫦娥三号软着陆轨道设计与控制策略论文与代码
  7. 第7步 mybatis-generator dao层生成器
  8. LeetCode 1033. 移动石子直到连续
  9. sql语句有没有怎么优化的空间,这条语句在我这里执行是死机
  10. 计算机与信息技术研究生,计算机与信息技术学院研究生必读经典文献.doc
  11. TL9000 电信业质量体系管理标准
  12. windows磁盘空间释放(二)
  13. [模板] 球 体积交 体积并
  14. QMC5883L 校准方法
  15. matplotlib保存图片去除白边
  16. mysql常用函数整理
  17. 神经网络入门(个人理解)
  18. fedora28/29/32/centos ipmi 登录解决
  19. 【计算机三级信息安全】信息安全保障概述
  20. 【手把手教你】搭建神经网络(CT扫描3D图像的分类)

热门文章

  1. 各大电子商务网站的站内搜索比较,因为要做站内搜索,所以前去观摩下
  2. 2021-2022学年广州市执信中学九年级第一学期期中考试英语试题
  3. Ubuntu下VS code空格间距很小解决办法
  4. 遥感ENVI5.1辐射定标以及大气矫正
  5. 网易云计算机专业课程,网易云课堂“计算机专业课程”开课
  6. potplayer 多个进程_Linux系统编程1.2:进程概念简介
  7. 找实习的一些感悟(图像算法转大数据)——女孩也能干开发
  8. yolov5跌倒检测。可以检测跌倒,坐立,下蹲,正常行走。可 以绘制各种训练指标曲线。
  9. (一)基于物联网的智能安防监控机器人2207231212569
  10. 全瓷牙冠-市场现状及未来发展趋势