爬取(明星网)明星面部数据

from bs4 import BeautifulSoup
import os
import requests
import time

1 下载数据

1.1 请求分析

  • Request
GET /upload/thumb/2015/11-16/0-uwo1Wk.jpg HTTP/1.1
Host: img.mingxing.com
Referer:http://img.mingxing.com//mingxing//20181015/88aa35c304dc06e822bb2efdd33497a5.jpg
User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.6821.400 QQBrowser/10.3.3040.400
def get_img(url, path):headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.6821.400 QQBrowser/10.3.3040.400","Referer": url,}response = requests.get(url, headers=headers)# print(response.content)with open(path, "wb") as fw:fw.write(response.content)
if __name__ == "__main__":url = "http://img.mingxing.com//mingxing//20181015/88aa35c304dc06e822bb2efdd33497a5.jpg"path = "./dataset/tmp.jpg"get_img(url, path)

2 明星列表页面

  • Request
GET /ziliao/index?&p=1 HTTP/1.1
Host: www.mingxing.com
Connection: keep-alive
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.6821.400 QQBrowser/10.3.3040.400
Upgrade-Insecure-Requests: 1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Accept-Encoding: gzip, deflate
Accept-Language: zh-CN,zh;q=0.9
Cookie: __51cke__=; UM_distinctid=169c7147623b32-041b6f54ccddbe-3257487f-232800-169c7147624ddd; CNZZDATA30054349=cnzz_eid%3D515205138-1553821270-https%253A%252F%252Fwww.baidu.com%252F%26ntime%3D1553821270; PHPSESSID=9btdnv30htpj54ies9em19pan1; right_adv=%7B%22time%22%3A%222019329%22%2C%22number%22%3A20%7D; __tins__18838395=%7B%22sid%22%3A%201553825001869%2C%20%22vd%22%3A%203%2C%20%22expires%22%3A%201553827085186%7D; __51laig__=21

2.1 单页明星列表

URL_MINGXING_CELEBRITY_LIST = "http://www.mingxing.com/ziliao/index"
<div>:class="page_starlist",明星列表
-><ul>
--><li>
---><a>:明星页面url
----><span>
-----><img>:src - 明星图片url,alt = 明星姓名
def get_celebrities_one_page(url, idx_page):headers={"Connection": "keep-alive","User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.6821.400 QQBrowser/10.3.3040.400","Upgrade-Insecure-Requests": "1","Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8","Accept-Encoding": "gzip, deflate","Accept-Language": "zh-CN,zh;q=0.9","Cookie": "__51cke__=; UM_distinctid=169c7147623b32-041b6f54ccddbe-3257487f-232800-169c7147624ddd; CNZZDATA30054349=cnzz_eid%3D515205138-1553821270-https%253A%252F%252Fwww.baidu.com%252F%26ntime%3D1553821270; PHPSESSID=9btdnv30htpj54ies9em19pan1; right_adv=%7B%22time%22%3A%222019329%22%2C%22number%22%3A20%7D; __tins__18838395=%7B%22sid%22%3A%201553825001869%2C%20%22vd%22%3A%203%2C%20%22expires%22%3A%201553827085186%7D; __51laig__=21",}params = {"p": idx_page}response = requests.get(url, params=params, headers=headers)html = response.text# print(html)soup = BeautifulSoup(html, 'lxml')# print(soup.find("div", class_="page_starlist").find_all("img"))lst_celebrities = []for item in soup.find("div", class_="page_starlist").find_all("img"):lst_celebrities.append({"name": item.get("alt").strip(),"url": "http://www.mingxing.com" + item.find_parent("a").get("href"),"img_urls": [item.get("src")]})# print(item.find_parent("a")["href"])# print(item["src"], item["alt"])return lst_celebrities
if __name__ == "__main__":idx_page = 1print(get_celebrities_one_page(URL_MINGXING_CELEBRITY_LIST, idx_page))

2.2 多页明星列表

NUM_PAGES = 10
def get_celebrities(url, num_pages):lst_celebrities = []for idx_page in range(1, num_pages):lst_celebrities.extend(get_celebrities_one_page(url, idx_page))time.sleep(3)return lst_celebrities
if __name__ == "__main__":lst_celebrities = get_celebrities(URL_MINGXING_CELEBRITY_LIST, NUM_PAGES)print(lst_celebrities)

[{‘name’: ‘鹿晗’, ‘url’: ‘http://www.mingxing.com/mingxing/index/name/luhan.html’, ‘img_urls’: [‘http://img.mingxing.com/upload/thumb/6/17097.jpg’]}, {‘name’: ‘迪丽热巴’, ‘url’: ‘http://www.mingxing.com/mingxing/index/name/dilireba.html’, ‘img_urls’:

[‘http://img.mingxing.com/upload/thumb/5/14274.jpg’]}, {‘name’: ‘王艺洁’, ‘url’: ‘http://www.mingxing.com/mingxing/index/name/wangyijie.html’, ‘img_urls’: [‘http://img.mingxing.com/upload/thumb/5/14276.jpg’]}, {‘name’: ‘段林希’, ‘url’: ‘http://www.mingxing.com/mingxing/index/name/duanlinxi.html’, ‘img_urls’: [‘http://img.mingxing.com/upload/thumb/5/14277.jpg’]}]

3 明星页面

GET /mingxing/index/name/luhan.html HTTP/1.1
Host: www.mingxing.com
Connection: keep-alive
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.6821.400 QQBrowser/10.3.3040.400
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Referer: http://www.mingxing.com/ziliao/index?&p=1
Accept-Encoding: gzip, deflate
Accept-Language: zh-CN,zh;q=0.9
Cookie: __51cke__=; UM_distinctid=169c7147623b32-041b6f54ccddbe-3257487f-232800-169c7147624ddd; PHPSESSID=9btdnv30htpj54ies9em19pan1; right_adv=%7B%22time%22%3A%222019329%22%2C%22number%22%3A29%7D; __tins__18838395=%7B%22sid%22%3A%201553844269026%2C%20%22vd%22%3A%201%2C%20%22expires%22%3A%201553846069026%7D; __51laig__=30; CNZZDATA30054349=cnzz_eid%3D515205138-1553821270-https%253A%252F%252Fwww.baidu.com%252F%26ntime%3D1553843231
<ul>:class="page_starphoto",明星列表
-><li>
--><a>
---><span>
----><img>:src - 明星图片url
def get_celebrity_img_urls(url):headers={"Connection": "keep-alive","User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.6821.400 QQBrowser/10.3.3040.400","Upgrade-Insecure-Requests": "1","Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8","Accept-Encoding": "gzip, deflate","Accept-Language": "zh-CN,zh;q=0.9","Cookie": "__51cke__=; UM_distinctid=169c7147623b32-041b6f54ccddbe-3257487f-232800-169c7147624ddd; CNZZDATA30054349=cnzz_eid%3D515205138-1553821270-https%253A%252F%252Fwww.baidu.com%252F%26ntime%3D1553821270; PHPSESSID=9btdnv30htpj54ies9em19pan1; right_adv=%7B%22time%22%3A%222019329%22%2C%22number%22%3A20%7D; __tins__18838395=%7B%22sid%22%3A%201553825001869%2C%20%22vd%22%3A%203%2C%20%22expires%22%3A%201553827085186%7D; __51laig__=21",# "Referer": url,}response = requests.get(url, headers=headers)html = response.textsoup = BeautifulSoup(html, 'lxml')lst_imgs = []for item in soup.find("ul", class_="page_starphoto").find_all("img"):lst_imgs.append(item["src"])# print(item["src"])return lst_imgs
if __name__ == "__main__":get_celebrity_img_urls("http://www.mingxing.com/mingxing/index/name/luhan.html")

4 创建明星面部数据集

if __name__ == "__main__":NUM_PAGES = 10DATASET_PATH = "./dataset"# 明星列表lst_celebrities = get_celebrities(URL_MINGXING_CELEBRITY_LIST, NUM_PAGES)for celebrity in lst_celebrities:# 明星文件夹celebrity_dir = os.path.join(DATASET_PATH, celebrity["name"])print("*" * 10)print("celebrity: {}".format(celebrity["name"]))if not os.path.exists(celebrity_dir):os.makedirs(celebrity_dir)# 明星页面celebrity["img_urls"].extend(get_celebrity_img_urls(celebrity["url"]))idx_img = 0for img_url in celebrity["img_urls"]:idx_img += 1img_path = os.path.join(celebrity_dir, "{:04d}.jpg".format(idx_img))get_img(img_url, img_path)print("download {} ---> {}".format(img_url, img_path))time.sleep(3)
**********
celebrity: 鹿晗
download http://img.mingxing.com/upload/thumb/6/17097.jpg ---> ./dataset\鹿晗\0001.jpg
download http://img.mingxing.com/mingxing//20180928/2e8dc41ba5f72d2e0ed005541a515a54.jpg ---> ./dataset\鹿晗\0002.jpg
download http://img.mingxing.com/mingxing//20180319/c84bf559d0dd0e2fae005f84a4016f6c.jpg ---> ./dataset\鹿晗\0003.jpg
download http://img.mingxing.com/upload/thumb/2017/06-28/0-5QoNse.jpg ---> ./dataset\鹿晗\0004.jpg
download http://img.mingxing.com/upload/thumb/2017/05-02/0-gOgsmr.jpg ---> ./dataset\鹿晗\0005.jpg
download http://img.mingxing.com/upload/thumb/2016/12-20/0-2GSIij.jpg ---> ./dataset\鹿晗\0006.jpg
download http://img.mingxing.com/upload/thumb/2016/07-07/0-yaTLqz.jpg ---> ./dataset\鹿晗\0007.jpg
download http://img.mingxing.com/upload/thumb/2016/04-21/0-iDv7Fj.jpg ---> ./dataset\鹿晗\0008.jpg
download http://img.mingxing.com/upload/thumb/2016/04-11/0-FTCO8H.jpg ---> ./dataset\鹿晗\0009.jpg
download http://img.mingxing.com/upload/thumb/2016/03-21/0-op5Sbt.jpg ---> ./dataset\鹿晗\0010.jpg
download http://img.mingxing.com/upload/thumb/2015/12-30/0-08NWI0.jpg ---> ./dataset\鹿晗\0011.jpg
download http://img.mingxing.com/upload/thumb/2015/12-08/0-wlgnGF.jpg ---> ./dataset\鹿晗\0012.jpg
download http://img.mingxing.com/upload/thumb/2015/11-16/0-uwo1Wk.jpg ---> ./dataset\鹿晗\0013.jpg
**********
celebrity: 迪丽热巴
download http://img.mingxing.com/content/20180103/535f03beaa9b7f0cb3c6f2f302886bf8.jpg ---> ./dataset\迪丽热巴\0001.jpg
download http://img.mingxing.com/mingxing//20181015/14b77dfea0cad1360955d818fcbb0de6.jpg ---> ./dataset\迪丽热巴\0002.jpg
download http://img.mingxing.com/mingxing//20180921/28e35a28498d760e908abce74fd40f5f.jpg ---> ./dataset\迪丽热巴\0003.jpg
download http://img.mingxing.com/mingxing//20180726/17702f5a9b8b998cbb0c70c260b40ad3.gif ---> ./dataset\迪丽热巴\0004.jpg
download http://img.mingxing.com/mingxing//20180620/ea20b15f13f6b34d1b4764553bfba7a9.png ---> ./dataset\迪丽热巴\0005.jpg
download http://img.mingxing.com/mingxing//20180417/985a84ccae9646f31f4dd717ccd40508.jpg ---> ./dataset\迪丽热巴\0006.jpg
download http://img.mingxing.com/mingxing//20180411/5376e604692d6fb42ae7a48e73143eb8.jpg ---> ./dataset\迪丽热巴\0007.jpg
download http://img.mingxing.com/mingxing/20180301/bdd3cbbf262d7793f21ed10975744c22.jpg ---> ./dataset\迪丽热巴\0008.jpg
download http://img.mingxing.com/mingxing/20180301/3418a7189704f4e68f81a29b4320af87.jpg ---> ./dataset\迪丽热巴\0009.jpg
download http://img.mingxing.com/mingxing/20180227/d6aa477ed34271c06fe9edb4dccc9e94.jpg ---> ./dataset\迪丽热巴\0010.jpg
download http://img.mingxing.com/mingxing/20180227/92dccee3c3ab96b8aae57f2f0469b1c2.jpg ---> ./dataset\迪丽热巴\0011.jpg
download http://img.mingxing.com/mingxing/20180226/0fc7ff656cabc975cbb349daeb6ee793.jpg ---> ./dataset\迪丽热巴\0012.jpg
download http://img.mingxing.com/mingxing/20180225/45a68453086b2307eaf10b7921b7e199.jpg ---> ./dataset\迪丽热巴\0013.jpg...celebrity: 约翰尼·德普
download http://img.mingxing.com/upload/thumb/5/14261.jpg ---> ./dataset\约翰尼·德普\0001.jpg
download http://img.mingxing.com/upload/thumb/2016/05-24/0-re7Tem.jpg ---> ./dataset\约翰尼·德普\0002.jpg
download http://img.mingxing.com/upload/thumb/2016/04-13/0-X6RYXs.jpg ---> ./dataset\约翰尼·德普\0003.jpg
download http://img.mingxing.com/upload/thumb/2016/03-25/0-bxK5os.jpg ---> ./dataset\约翰尼·德普\0004.jpg
download http://img.mingxing.com/upload/thumb/2016/03-25/0-h77lr9.jpg ---> ./dataset\约翰尼·德普\0005.jpg
download http://img.mingxing.com/upload/thumb/2016/03-17/0-U3Y3EK.jpg ---> ./dataset\约翰尼·德普\0006.jpg
download http://img.mingxing.com/upload/thumb/2016/03-17/0-WUdojP.jpg ---> ./dataset\约翰尼·德普\0007.jpg
download http://img.mingxing.com/upload/thumb/2016/03-17/0-ghntJ4.jpg ---> ./dataset\约翰尼·德普\0008.jpg
download http://img.mingxing.com/upload/thumb/2016/02-26/0-G2Th8a.jpg ---> ./dataset\约翰尼·德普\0009.jpg
download http://img.mingxing.com/upload/thumb/2016/02-23/0-cARUg7.jpg ---> ./dataset\约翰尼·德普\0010.jpg
download http://img.mingxing.com/upload/thumb/2016/02-18/0-DLYZNo.jpg ---> ./dataset\约翰尼·德普\0011.jpg
download http://img.mingxing.com/upload/thumb/2016/01-29/0-Pe5YMh.jpg ---> ./dataset\约翰尼·德普\0012.jpg
**********
celebrity: 雨果·维文
download http://img.mingxing.com/upload/thumb/5/14262.jpg ---> ./dataset\雨果·维文\0001.jpg
download http://img.mingxing.com/upload/thumb/2016/04-13/0-Pm9m6p.jpg ---> ./dataset\雨果·维文\0002.jpg
download http://img.mingxing.com/upload/thumb/2016/04-13/0-kqjoN7.jpg ---> ./dataset\雨果·维文\0003.jpg
download http://img.mingxing.com/upload/thumb/2016/04-08/0-03NXtB.jpg ---> ./dataset\雨果·维文\0004.jpg
download http://img.mingxing.com/upload/thumb/2016/03-30/0-TJRqeD.jpg ---> ./dataset\雨果·维文\0005.jpg
download http://img.mingxing.com/upload/thumb/2016/02-26/0-Wuurq1.jpg ---> ./dataset\雨果·维文\0006.jpg
download http://img.mingxing.com/upload/thumb/2016/02-18/0-4fqOgM.jpg ---> ./dataset\雨果·维文\0007.jpg
**********
celebrity: 希亚·拉博夫

爬取(明星网)明星面部数据相关推荐

  1. 爬取东方财富网股票行情数据和资讯

    爬取东方财富网股票行情数据和资讯 这个需求源于我的一个练手项目 本篇博客参考:https://zhuanlan.zhihu.com/p/50099084 该博客介绍的东西本博客不做论述 使用技术: 语 ...

  2. python爬取火车票网的时刻表数据

    python爬取火车票网的时刻表数据 导包 import re,requests,datetime,time,json from prettytable import PrettyTable from ...

  3. python二手房价格预测_Python爬取赶集网北京二手房数据R对爬取的二手房房价做线性回归分析...

    前言:本文主要分为两部分:Python爬取赶集网北京二手房数据&R对爬取的二手房房价做线性回归分析.文章思路清晰,代码详细,特别适合刚刚接触Python&R的同学学习参考. Part1 ...

  4. 使用python爬取东方财富网机构调研数据

    最近有一个需求,需要爬取东方财富网的机构调研数据.数据所在的网页地址为: 机构调研 网页如下所示: 可见数据共有8464页,此处不能直接使用scrapy爬虫进行爬取,因为点击下一页时,浏览器只是发起了 ...

  5. python爬取历史天气数据并保存_Python爬取天气网历史天气数据

    我的第一篇博客,哈哈哈,记录一下我的Python进阶之路! 今天写了一个简单的爬虫. 使用python的requests 和BeautifulSoup模块,Python 2.7.12可在命令行中直接使 ...

  6. Python爬取天气网历史天气数据

    我的第一篇博客,哈哈哈,记录一下我的Python进阶之路! 今天写了一个简单的爬虫. 使用Python的requests 和BeautifulSoup模块,Python 2.7.12可在命令行中直接使 ...

  7. python培训机构调研最多的股票_使用python爬取东方财富网机构调研数据

    标签: 最近有一个需求,需要爬取东方财富网的机构调研数据.数据所在的网页地址为: 机构调研 网页如下所示: 可见数据共有8464页,此处不能直接使用scrapy爬虫进行爬取,因为点击下一页时,浏览器只 ...

  8. 爬取楼盘网并将数据保存在excel表中

    初学,代码有点烂,有些错误先不处理. #!/usr/bin/python # -*- coding: <encoding name> -*-import requests from bs4 ...

  9. 手把手教你爬取途牛网旅行路线数据,告诉你五一去哪儿玩!

    作者 | timber 本文经授权转自数据森麟(ID: shujusenlin) 五一假期将至,送给大家一个用于途牛网爬取旅行线路线获取的爬虫,预祝大家都度过一个愉快的五一假期. 本爬虫最先是用 Sc ...

  10. Windows下利用python+selenium+firefox爬取动态网页数据(爬取东方财富网指数行情数据)

    由于之前用urlib和request发现只能获取静态网页数据,目前爬取动态网页有两种方法, (1)分析页面请求 (2)Selenium模拟浏览器行为(霸王硬上弓),本文讲的就是此方法 一.安装sele ...

最新文章

  1. Linux那些事儿之我是Sysfs(4)举例一lddbus
  2. CSS中表格的一些属性和使用
  3. 虚拟服务器关机怎么开,云服务器关机了怎么开启
  4. 模拟数据集上训练神经网络,网络解决二分类问题练习
  5. 千万别让海底捞知道你的生日
  6. 最大连续子序列和(4种算法)
  7. 百度谷歌离线地图解决方案(离线地图下载)
  8. Github emoji 表情包大全
  9. Excel2016 折线图
  10. 自建CA并签名server证书实现https
  11. 硬盘损坏如何恢oracle,硬盘损坏后恢复数据的几种方法
  12. GC.SuppressFinalize()的正确用法
  13. 盘点:当今十大备份应用软件
  14. L05 Laravel 教程 - 电商实战
  15. Win10 无线投屏/无线屏幕拓展
  16. 颜色拾取器color picker (javascript version)
  17. 调用OpenLayers,实现简单的地图搜索
  18. mysql查询一个字段最大值_查询表中某一个字段的数字最大值的记录
  19. 怒了!维基解密开放爆料数据库,内容涉及全是美国“脏事”!
  20. 虚拟化特性(二)华为虚拟化产品特性

热门文章

  1. tigase服务器推送消息,Tigase HTTP API 上一个 第8章。配置Tigase服务器以加载组件
  2. Java基础知识(二)—— API文档
  3. python 二进制Plist文件转Xml格式Plist
  4. 20201113--工具总结
  5. 被世界开源界评为“awesome” ESP系列模组开源资料整理,各种开发固件,工具,酷炫的开源项目,库文件都有
  6. 商城系统-数据库设计
  7. Windows 源码运行 ThingsBoard
  8. Js逆向教程22-AST 抽象语法树babel安装
  9. 不装了,我开源了5款人脸生成器!超模脸、网红脸、萌娃脸...
  10. 携程并了去哪儿,互联网业7:2:1法则几成定律