爬取(明星网)明星面部数据
爬取(明星网)明星面部数据
from bs4 import BeautifulSoup
import os
import requests
import time
1 下载数据
1.1 请求分析
- Request
GET /upload/thumb/2015/11-16/0-uwo1Wk.jpg HTTP/1.1
Host: img.mingxing.com
Referer:http://img.mingxing.com//mingxing//20181015/88aa35c304dc06e822bb2efdd33497a5.jpg
User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.6821.400 QQBrowser/10.3.3040.400
def get_img(url, path):headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.6821.400 QQBrowser/10.3.3040.400","Referer": url,}response = requests.get(url, headers=headers)# print(response.content)with open(path, "wb") as fw:fw.write(response.content)
if __name__ == "__main__":url = "http://img.mingxing.com//mingxing//20181015/88aa35c304dc06e822bb2efdd33497a5.jpg"path = "./dataset/tmp.jpg"get_img(url, path)
2 明星列表页面
- Request
GET /ziliao/index?&p=1 HTTP/1.1
Host: www.mingxing.com
Connection: keep-alive
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.6821.400 QQBrowser/10.3.3040.400
Upgrade-Insecure-Requests: 1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Accept-Encoding: gzip, deflate
Accept-Language: zh-CN,zh;q=0.9
Cookie: __51cke__=; UM_distinctid=169c7147623b32-041b6f54ccddbe-3257487f-232800-169c7147624ddd; CNZZDATA30054349=cnzz_eid%3D515205138-1553821270-https%253A%252F%252Fwww.baidu.com%252F%26ntime%3D1553821270; PHPSESSID=9btdnv30htpj54ies9em19pan1; right_adv=%7B%22time%22%3A%222019329%22%2C%22number%22%3A20%7D; __tins__18838395=%7B%22sid%22%3A%201553825001869%2C%20%22vd%22%3A%203%2C%20%22expires%22%3A%201553827085186%7D; __51laig__=21
2.1 单页明星列表
URL_MINGXING_CELEBRITY_LIST = "http://www.mingxing.com/ziliao/index"
<div>:class="page_starlist",明星列表
-><ul>
--><li>
---><a>:明星页面url
----><span>
-----><img>:src - 明星图片url,alt = 明星姓名
def get_celebrities_one_page(url, idx_page):headers={"Connection": "keep-alive","User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.6821.400 QQBrowser/10.3.3040.400","Upgrade-Insecure-Requests": "1","Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8","Accept-Encoding": "gzip, deflate","Accept-Language": "zh-CN,zh;q=0.9","Cookie": "__51cke__=; UM_distinctid=169c7147623b32-041b6f54ccddbe-3257487f-232800-169c7147624ddd; CNZZDATA30054349=cnzz_eid%3D515205138-1553821270-https%253A%252F%252Fwww.baidu.com%252F%26ntime%3D1553821270; PHPSESSID=9btdnv30htpj54ies9em19pan1; right_adv=%7B%22time%22%3A%222019329%22%2C%22number%22%3A20%7D; __tins__18838395=%7B%22sid%22%3A%201553825001869%2C%20%22vd%22%3A%203%2C%20%22expires%22%3A%201553827085186%7D; __51laig__=21",}params = {"p": idx_page}response = requests.get(url, params=params, headers=headers)html = response.text# print(html)soup = BeautifulSoup(html, 'lxml')# print(soup.find("div", class_="page_starlist").find_all("img"))lst_celebrities = []for item in soup.find("div", class_="page_starlist").find_all("img"):lst_celebrities.append({"name": item.get("alt").strip(),"url": "http://www.mingxing.com" + item.find_parent("a").get("href"),"img_urls": [item.get("src")]})# print(item.find_parent("a")["href"])# print(item["src"], item["alt"])return lst_celebrities
if __name__ == "__main__":idx_page = 1print(get_celebrities_one_page(URL_MINGXING_CELEBRITY_LIST, idx_page))
2.2 多页明星列表
NUM_PAGES = 10
def get_celebrities(url, num_pages):lst_celebrities = []for idx_page in range(1, num_pages):lst_celebrities.extend(get_celebrities_one_page(url, idx_page))time.sleep(3)return lst_celebrities
if __name__ == "__main__":lst_celebrities = get_celebrities(URL_MINGXING_CELEBRITY_LIST, NUM_PAGES)print(lst_celebrities)
[{‘name’: ‘鹿晗’, ‘url’: ‘http://www.mingxing.com/mingxing/index/name/luhan.html’, ‘img_urls’: [‘http://img.mingxing.com/upload/thumb/6/17097.jpg’]}, {‘name’: ‘迪丽热巴’, ‘url’: ‘http://www.mingxing.com/mingxing/index/name/dilireba.html’, ‘img_urls’:
…
[‘http://img.mingxing.com/upload/thumb/5/14274.jpg’]}, {‘name’: ‘王艺洁’, ‘url’: ‘http://www.mingxing.com/mingxing/index/name/wangyijie.html’, ‘img_urls’: [‘http://img.mingxing.com/upload/thumb/5/14276.jpg’]}, {‘name’: ‘段林希’, ‘url’: ‘http://www.mingxing.com/mingxing/index/name/duanlinxi.html’, ‘img_urls’: [‘http://img.mingxing.com/upload/thumb/5/14277.jpg’]}]
3 明星页面
GET /mingxing/index/name/luhan.html HTTP/1.1
Host: www.mingxing.com
Connection: keep-alive
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.6821.400 QQBrowser/10.3.3040.400
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Referer: http://www.mingxing.com/ziliao/index?&p=1
Accept-Encoding: gzip, deflate
Accept-Language: zh-CN,zh;q=0.9
Cookie: __51cke__=; UM_distinctid=169c7147623b32-041b6f54ccddbe-3257487f-232800-169c7147624ddd; PHPSESSID=9btdnv30htpj54ies9em19pan1; right_adv=%7B%22time%22%3A%222019329%22%2C%22number%22%3A29%7D; __tins__18838395=%7B%22sid%22%3A%201553844269026%2C%20%22vd%22%3A%201%2C%20%22expires%22%3A%201553846069026%7D; __51laig__=30; CNZZDATA30054349=cnzz_eid%3D515205138-1553821270-https%253A%252F%252Fwww.baidu.com%252F%26ntime%3D1553843231
<ul>:class="page_starphoto",明星列表
-><li>
--><a>
---><span>
----><img>:src - 明星图片url
def get_celebrity_img_urls(url):headers={"Connection": "keep-alive","User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.6821.400 QQBrowser/10.3.3040.400","Upgrade-Insecure-Requests": "1","Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8","Accept-Encoding": "gzip, deflate","Accept-Language": "zh-CN,zh;q=0.9","Cookie": "__51cke__=; UM_distinctid=169c7147623b32-041b6f54ccddbe-3257487f-232800-169c7147624ddd; CNZZDATA30054349=cnzz_eid%3D515205138-1553821270-https%253A%252F%252Fwww.baidu.com%252F%26ntime%3D1553821270; PHPSESSID=9btdnv30htpj54ies9em19pan1; right_adv=%7B%22time%22%3A%222019329%22%2C%22number%22%3A20%7D; __tins__18838395=%7B%22sid%22%3A%201553825001869%2C%20%22vd%22%3A%203%2C%20%22expires%22%3A%201553827085186%7D; __51laig__=21",# "Referer": url,}response = requests.get(url, headers=headers)html = response.textsoup = BeautifulSoup(html, 'lxml')lst_imgs = []for item in soup.find("ul", class_="page_starphoto").find_all("img"):lst_imgs.append(item["src"])# print(item["src"])return lst_imgs
if __name__ == "__main__":get_celebrity_img_urls("http://www.mingxing.com/mingxing/index/name/luhan.html")
4 创建明星面部数据集
if __name__ == "__main__":NUM_PAGES = 10DATASET_PATH = "./dataset"# 明星列表lst_celebrities = get_celebrities(URL_MINGXING_CELEBRITY_LIST, NUM_PAGES)for celebrity in lst_celebrities:# 明星文件夹celebrity_dir = os.path.join(DATASET_PATH, celebrity["name"])print("*" * 10)print("celebrity: {}".format(celebrity["name"]))if not os.path.exists(celebrity_dir):os.makedirs(celebrity_dir)# 明星页面celebrity["img_urls"].extend(get_celebrity_img_urls(celebrity["url"]))idx_img = 0for img_url in celebrity["img_urls"]:idx_img += 1img_path = os.path.join(celebrity_dir, "{:04d}.jpg".format(idx_img))get_img(img_url, img_path)print("download {} ---> {}".format(img_url, img_path))time.sleep(3)
**********
celebrity: 鹿晗
download http://img.mingxing.com/upload/thumb/6/17097.jpg ---> ./dataset\鹿晗\0001.jpg
download http://img.mingxing.com/mingxing//20180928/2e8dc41ba5f72d2e0ed005541a515a54.jpg ---> ./dataset\鹿晗\0002.jpg
download http://img.mingxing.com/mingxing//20180319/c84bf559d0dd0e2fae005f84a4016f6c.jpg ---> ./dataset\鹿晗\0003.jpg
download http://img.mingxing.com/upload/thumb/2017/06-28/0-5QoNse.jpg ---> ./dataset\鹿晗\0004.jpg
download http://img.mingxing.com/upload/thumb/2017/05-02/0-gOgsmr.jpg ---> ./dataset\鹿晗\0005.jpg
download http://img.mingxing.com/upload/thumb/2016/12-20/0-2GSIij.jpg ---> ./dataset\鹿晗\0006.jpg
download http://img.mingxing.com/upload/thumb/2016/07-07/0-yaTLqz.jpg ---> ./dataset\鹿晗\0007.jpg
download http://img.mingxing.com/upload/thumb/2016/04-21/0-iDv7Fj.jpg ---> ./dataset\鹿晗\0008.jpg
download http://img.mingxing.com/upload/thumb/2016/04-11/0-FTCO8H.jpg ---> ./dataset\鹿晗\0009.jpg
download http://img.mingxing.com/upload/thumb/2016/03-21/0-op5Sbt.jpg ---> ./dataset\鹿晗\0010.jpg
download http://img.mingxing.com/upload/thumb/2015/12-30/0-08NWI0.jpg ---> ./dataset\鹿晗\0011.jpg
download http://img.mingxing.com/upload/thumb/2015/12-08/0-wlgnGF.jpg ---> ./dataset\鹿晗\0012.jpg
download http://img.mingxing.com/upload/thumb/2015/11-16/0-uwo1Wk.jpg ---> ./dataset\鹿晗\0013.jpg
**********
celebrity: 迪丽热巴
download http://img.mingxing.com/content/20180103/535f03beaa9b7f0cb3c6f2f302886bf8.jpg ---> ./dataset\迪丽热巴\0001.jpg
download http://img.mingxing.com/mingxing//20181015/14b77dfea0cad1360955d818fcbb0de6.jpg ---> ./dataset\迪丽热巴\0002.jpg
download http://img.mingxing.com/mingxing//20180921/28e35a28498d760e908abce74fd40f5f.jpg ---> ./dataset\迪丽热巴\0003.jpg
download http://img.mingxing.com/mingxing//20180726/17702f5a9b8b998cbb0c70c260b40ad3.gif ---> ./dataset\迪丽热巴\0004.jpg
download http://img.mingxing.com/mingxing//20180620/ea20b15f13f6b34d1b4764553bfba7a9.png ---> ./dataset\迪丽热巴\0005.jpg
download http://img.mingxing.com/mingxing//20180417/985a84ccae9646f31f4dd717ccd40508.jpg ---> ./dataset\迪丽热巴\0006.jpg
download http://img.mingxing.com/mingxing//20180411/5376e604692d6fb42ae7a48e73143eb8.jpg ---> ./dataset\迪丽热巴\0007.jpg
download http://img.mingxing.com/mingxing/20180301/bdd3cbbf262d7793f21ed10975744c22.jpg ---> ./dataset\迪丽热巴\0008.jpg
download http://img.mingxing.com/mingxing/20180301/3418a7189704f4e68f81a29b4320af87.jpg ---> ./dataset\迪丽热巴\0009.jpg
download http://img.mingxing.com/mingxing/20180227/d6aa477ed34271c06fe9edb4dccc9e94.jpg ---> ./dataset\迪丽热巴\0010.jpg
download http://img.mingxing.com/mingxing/20180227/92dccee3c3ab96b8aae57f2f0469b1c2.jpg ---> ./dataset\迪丽热巴\0011.jpg
download http://img.mingxing.com/mingxing/20180226/0fc7ff656cabc975cbb349daeb6ee793.jpg ---> ./dataset\迪丽热巴\0012.jpg
download http://img.mingxing.com/mingxing/20180225/45a68453086b2307eaf10b7921b7e199.jpg ---> ./dataset\迪丽热巴\0013.jpg...celebrity: 约翰尼·德普
download http://img.mingxing.com/upload/thumb/5/14261.jpg ---> ./dataset\约翰尼·德普\0001.jpg
download http://img.mingxing.com/upload/thumb/2016/05-24/0-re7Tem.jpg ---> ./dataset\约翰尼·德普\0002.jpg
download http://img.mingxing.com/upload/thumb/2016/04-13/0-X6RYXs.jpg ---> ./dataset\约翰尼·德普\0003.jpg
download http://img.mingxing.com/upload/thumb/2016/03-25/0-bxK5os.jpg ---> ./dataset\约翰尼·德普\0004.jpg
download http://img.mingxing.com/upload/thumb/2016/03-25/0-h77lr9.jpg ---> ./dataset\约翰尼·德普\0005.jpg
download http://img.mingxing.com/upload/thumb/2016/03-17/0-U3Y3EK.jpg ---> ./dataset\约翰尼·德普\0006.jpg
download http://img.mingxing.com/upload/thumb/2016/03-17/0-WUdojP.jpg ---> ./dataset\约翰尼·德普\0007.jpg
download http://img.mingxing.com/upload/thumb/2016/03-17/0-ghntJ4.jpg ---> ./dataset\约翰尼·德普\0008.jpg
download http://img.mingxing.com/upload/thumb/2016/02-26/0-G2Th8a.jpg ---> ./dataset\约翰尼·德普\0009.jpg
download http://img.mingxing.com/upload/thumb/2016/02-23/0-cARUg7.jpg ---> ./dataset\约翰尼·德普\0010.jpg
download http://img.mingxing.com/upload/thumb/2016/02-18/0-DLYZNo.jpg ---> ./dataset\约翰尼·德普\0011.jpg
download http://img.mingxing.com/upload/thumb/2016/01-29/0-Pe5YMh.jpg ---> ./dataset\约翰尼·德普\0012.jpg
**********
celebrity: 雨果·维文
download http://img.mingxing.com/upload/thumb/5/14262.jpg ---> ./dataset\雨果·维文\0001.jpg
download http://img.mingxing.com/upload/thumb/2016/04-13/0-Pm9m6p.jpg ---> ./dataset\雨果·维文\0002.jpg
download http://img.mingxing.com/upload/thumb/2016/04-13/0-kqjoN7.jpg ---> ./dataset\雨果·维文\0003.jpg
download http://img.mingxing.com/upload/thumb/2016/04-08/0-03NXtB.jpg ---> ./dataset\雨果·维文\0004.jpg
download http://img.mingxing.com/upload/thumb/2016/03-30/0-TJRqeD.jpg ---> ./dataset\雨果·维文\0005.jpg
download http://img.mingxing.com/upload/thumb/2016/02-26/0-Wuurq1.jpg ---> ./dataset\雨果·维文\0006.jpg
download http://img.mingxing.com/upload/thumb/2016/02-18/0-4fqOgM.jpg ---> ./dataset\雨果·维文\0007.jpg
**********
celebrity: 希亚·拉博夫
爬取(明星网)明星面部数据相关推荐
- 爬取东方财富网股票行情数据和资讯
爬取东方财富网股票行情数据和资讯 这个需求源于我的一个练手项目 本篇博客参考:https://zhuanlan.zhihu.com/p/50099084 该博客介绍的东西本博客不做论述 使用技术: 语 ...
- python爬取火车票网的时刻表数据
python爬取火车票网的时刻表数据 导包 import re,requests,datetime,time,json from prettytable import PrettyTable from ...
- python二手房价格预测_Python爬取赶集网北京二手房数据R对爬取的二手房房价做线性回归分析...
前言:本文主要分为两部分:Python爬取赶集网北京二手房数据&R对爬取的二手房房价做线性回归分析.文章思路清晰,代码详细,特别适合刚刚接触Python&R的同学学习参考. Part1 ...
- 使用python爬取东方财富网机构调研数据
最近有一个需求,需要爬取东方财富网的机构调研数据.数据所在的网页地址为: 机构调研 网页如下所示: 可见数据共有8464页,此处不能直接使用scrapy爬虫进行爬取,因为点击下一页时,浏览器只是发起了 ...
- python爬取历史天气数据并保存_Python爬取天气网历史天气数据
我的第一篇博客,哈哈哈,记录一下我的Python进阶之路! 今天写了一个简单的爬虫. 使用python的requests 和BeautifulSoup模块,Python 2.7.12可在命令行中直接使 ...
- Python爬取天气网历史天气数据
我的第一篇博客,哈哈哈,记录一下我的Python进阶之路! 今天写了一个简单的爬虫. 使用Python的requests 和BeautifulSoup模块,Python 2.7.12可在命令行中直接使 ...
- python培训机构调研最多的股票_使用python爬取东方财富网机构调研数据
标签: 最近有一个需求,需要爬取东方财富网的机构调研数据.数据所在的网页地址为: 机构调研 网页如下所示: 可见数据共有8464页,此处不能直接使用scrapy爬虫进行爬取,因为点击下一页时,浏览器只 ...
- 爬取楼盘网并将数据保存在excel表中
初学,代码有点烂,有些错误先不处理. #!/usr/bin/python # -*- coding: <encoding name> -*-import requests from bs4 ...
- 手把手教你爬取途牛网旅行路线数据,告诉你五一去哪儿玩!
作者 | timber 本文经授权转自数据森麟(ID: shujusenlin) 五一假期将至,送给大家一个用于途牛网爬取旅行线路线获取的爬虫,预祝大家都度过一个愉快的五一假期. 本爬虫最先是用 Sc ...
- Windows下利用python+selenium+firefox爬取动态网页数据(爬取东方财富网指数行情数据)
由于之前用urlib和request发现只能获取静态网页数据,目前爬取动态网页有两种方法, (1)分析页面请求 (2)Selenium模拟浏览器行为(霸王硬上弓),本文讲的就是此方法 一.安装sele ...
最新文章
- Linux那些事儿之我是Sysfs(4)举例一lddbus
- CSS中表格的一些属性和使用
- 虚拟服务器关机怎么开,云服务器关机了怎么开启
- 模拟数据集上训练神经网络,网络解决二分类问题练习
- 千万别让海底捞知道你的生日
- 最大连续子序列和(4种算法)
- 百度谷歌离线地图解决方案(离线地图下载)
- Github emoji 表情包大全
- Excel2016 折线图
- 自建CA并签名server证书实现https
- 硬盘损坏如何恢oracle,硬盘损坏后恢复数据的几种方法
- GC.SuppressFinalize()的正确用法
- 盘点:当今十大备份应用软件
- L05 Laravel 教程 - 电商实战
- Win10 无线投屏/无线屏幕拓展
- 颜色拾取器color picker (javascript version)
- 调用OpenLayers,实现简单的地图搜索
- mysql查询一个字段最大值_查询表中某一个字段的数字最大值的记录
- 怒了!维基解密开放爆料数据库,内容涉及全是美国“脏事”!
- 虚拟化特性(二)华为虚拟化产品特性
热门文章
- tigase服务器推送消息,Tigase HTTP API 上一个 第8章。配置Tigase服务器以加载组件
- Java基础知识(二)—— API文档
- python 二进制Plist文件转Xml格式Plist
- 20201113--工具总结
- 被世界开源界评为“awesome” ESP系列模组开源资料整理,各种开发固件,工具,酷炫的开源项目,库文件都有
- 商城系统-数据库设计
- Windows 源码运行 ThingsBoard
- Js逆向教程22-AST 抽象语法树babel安装
- 不装了,我开源了5款人脸生成器!超模脸、网红脸、萌娃脸...
- 携程并了去哪儿,互联网业7:2:1法则几成定律