文章目录

  • 前言
  • 一、前提准备
  • 二、代码部分
    • 1.引入库
    • 2.发送请求,解析数据,并保存到本地
    • 3.全部代码
  • 总结

前言

接触深度学习有一段时间了,我们利用CNN卷积神经网络做一个十二生肖动物图片识别的小项目。在训练模型的时候我们往往需要大量的数据,今天我们主要针对数据获取这部分做一个简要的介绍:


今天我们通过python将Selenium自动化框架和Beautifulsoup结合起来,并通过多线程的方式进行数据下载,来提高下载速度,最后将数据保存到我们的数据集中。


一、前提准备

开始前我们需要做好以下准备:
1.python3.10(python3版本就可以,我用的是python3.10最新版本)
2.requests
3.Beautifulsoup用于解析网页
4.Selenium自动化框架(安装好相应的浏览器驱动并配置好环境变量)
5.threading多线程

二、代码部分

声明:代码仅供技术交流使用,如有侵权请联系本人删除

1.引入库

首先,导入我们需要运用的库:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
# from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import threading  # 用于开启多线程
import requests
import time
import random
import os

2.发送请求,解析数据,并保存到本地

根据网页信息先确定输入框的位置:

这里我们用到的浏览器是谷歌浏览器,先在代码中配置好谷歌浏览器的相关信息,然后定位输入框,代码部分如下:

def parser_content(self):# 配置谷歌浏览器环境driver = webdriver.Chrome()# 最大化页面driver.maximize_window()# 加载网页driver.get(self.url)# 设置隐式等待时长10sdriver.implicitly_wait(10)# 设置强制等待时长2stime.sleep(2)# 定位输入栏driver.find_element(By.XPATH, "//input[@class='s_ipt']").send_keys(name)time.sleep(1)# 键盘事件模拟点击回车键进入图片界面driver.find_element(By.XPATH, "//input[@class='s_ipt']").send_keys(Keys.ENTER)driver.implicitly_wait(10)time.sleep(2)

接下来我们进入图片列表界面:


在这里我们需要进行一个下拉框的操作来获取更多的图片信息,并将获取到的图片链接保存到列表中,代码部分如下:

start = time.time()
# 竖向滚动条操作----------------------------------------------------------------------------
temp_height = 0
for ii in range(1, 1000000, 8):js1 = "document.documentElement.scrollTop={}".format(ii)driver.execute_script(js1)time.sleep(0.01)# 检查滑动条是否到达页面最底部check_height = driver.execute_script("return document.documentElement.scrollTop || window.pageYOffset || document.body.scrollTop;")if check_height == temp_height:breaktemp_height = check_height# 加入时间限制,超过45s会自动停止if time.time() - start > 45:break
# ----------------------------------------------------------------------------------------# 获取全部图片url链接
url_lst = driver.find_elements(By.XPATH, "//div[@class='imgbox-border']/a")
for item in url_lst[1:201]:  # 可以更改数字来获取更多图片链接new_url = item.get_attribute("href")# print(new_url)self.lst1.append(new_url)
print("此次共获取到" + str(len(self.lst1)) + "张图片!")
# 关闭浏览器
driver.quit()

运行后可以看到我们已经获取到了想要的内容,接下来将我们进入图片链接的详情页。

通过Beautifulsoup来解析获取图片url链接,并开启四线程来进行图片下载,代码部分如下:

def download1(self):for i in self.lst1[0:int(len(self.lst1) / 4)]:try:resp = requests.get(i, headers=self.headers)resp.encoding = "utf-8"html = resp.text# print(html)Be = BeautifulSoup(html, 'html.parser')wrapper = Be.find('div', class_='img-wrapper')img = wrapper.find('img')['src']res = requests.get(img)with open(str(self.path) + str(self.save_path) + "/" + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + ".jpg", "wb") as file:file.write(res.content)print("下载完成保存至:" + str(self.path) + str(self.save_path) + "/" + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + ".jpg")except:passdef download2(self):for i in self.lst1[int(len(self.lst1) / 4):int(len(self.lst1) / 2)]:try:resp = requests.get(i, headers=self.headers)resp.encoding = "utf-8"html = resp.text# print(html)Be = BeautifulSoup(html, 'html.parser')wrapper = Be.find('div', class_='img-wrapper')img = wrapper.find('img')['src']res = requests.get(img)with open(str(self.path) + str(self.save_path) + "/" + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + ".jpg", "wb") as file:file.write(res.content)print("下载完成保存至:" + str(self.path) + str(self.save_path) + "/" + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + ".jpg")except:passdef download3(self):for i in self.lst1[int(len(self.lst1) / 2):int(len(self.lst1) / 2) + int(len(self.lst1) / 4)]:try:resp = requests.get(i, headers=self.headers)resp.encoding = "utf-8"html = resp.text# print(html)Be = BeautifulSoup(html, 'html.parser')wrapper = Be.find('div', class_='img-wrapper')img = wrapper.find('img')['src']res = requests.get(img)with open(str(self.path) + str(self.save_path) + "/" + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + ".jpg", "wb") as file:file.write(res.content)print("下载完成保存至:" + str(self.path) + str(self.save_path) + "/" + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + ".jpg")except:passdef download4(self):for i in self.lst1[int(len(self.lst1) / 2) + int(len(self.lst1) / 4):len(self.lst1)]:try:resp = requests.get(i, headers=self.headers)resp.encoding = "utf-8"html = resp.text# print(html)Be = BeautifulSoup(html, 'html.parser')wrapper = Be.find('div', class_='img-wrapper')img = wrapper.find('img')['src']res = requests.get(img)with open(str(self.path) + str(self.save_path) + "/" + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + ".jpg", "wb") as file:file.write(res.content)print("下载完成保存至:" + str(self.path) + str(self.save_path) + "/" + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + ".jpg")except:passdef multi_thread(self):t1 = threading.Thread(target=self.download1())t2 = threading.Thread(target=self.download2())t3 = threading.Thread(target=self.download3())t4 = threading.Thread(target=self.download4())t1.start()t2.start()t3.start()t4.start()

3.全部代码

# -- coding: utf-8  --
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
# from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import threading
import requests
import time
import random
import osclass Spider():def __init__(self):self.url = "https://image.baidu.com/"self.headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3883.400 QQBrowser/10.8.4559.400","Referer": "https://www.baidu.com/"}self.lst1 = []self.path = './data/train/'  # 保存路径self.save_path = os.path.join(xname)if not os.path.exists(self.save_path):os.mkdir(self.path + "./{}".format(xname))def parser_content(self):# 配置谷歌浏览器环境driver = webdriver.Chrome()# 最大化页面driver.maximize_window()# 加载网页driver.get(self.url)# 设置隐式等待时长10sdriver.implicitly_wait(10)# 设置强制等待时长2stime.sleep(2)# 定位输入栏driver.find_element(By.XPATH, "//input[@class='s_ipt']").send_keys(name)time.sleep(1)# 键盘事件模拟点击回车键进入图片界面driver.find_element(By.XPATH, "//input[@class='s_ipt']").send_keys(Keys.ENTER)driver.implicitly_wait(10)time.sleep(2)start = time.time()# 竖向滚动条操作----------------------------------------------------------------------------temp_height = 0for ii in range(1, 1000000, 8):js1 = "document.documentElement.scrollTop={}".format(ii)driver.execute_script(js1)time.sleep(0.01)# 检查滑动条是否到达页面最底部check_height = driver.execute_script("return document.documentElement.scrollTop || window.pageYOffset || document.body.scrollTop;")if check_height == temp_height:breaktemp_height = check_heightif time.time() - start > 45:break# ----------------------------------------------------------------------------------------# 获取全部图片url链接url_lst = driver.find_elements(By.XPATH, "//div[@class='imgbox-border']/a")for item in url_lst[1:201]:new_url = item.get_attribute("href")print(new_url)self.lst1.append(new_url)print("此次共获取到" + str(len(self.lst1)) + "张图片!")# 关闭浏览器driver.quit()def download1(self):for i in self.lst1[0:int(len(self.lst1) / 4)]:try:resp = requests.get(i, headers=self.headers)resp.encoding = "utf-8"html = resp.text# print(html)Be = BeautifulSoup(html, 'html.parser')wrapper = Be.find('div', class_='img-wrapper')img = wrapper.find('img')['src']res = requests.get(img)with open(str(self.path) + str(self.save_path) + "/" + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + ".jpg", "wb") as file:file.write(res.content)print("下载完成保存至:" + str(self.path) + str(self.save_path) + "/" + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + ".jpg")except:passdef download2(self):for i in self.lst1[int(len(self.lst1) / 4):int(len(self.lst1) / 2)]:try:resp = requests.get(i, headers=self.headers)resp.encoding = "utf-8"html = resp.text# print(html)Be = BeautifulSoup(html, 'html.parser')wrapper = Be.find('div', class_='img-wrapper')img = wrapper.find('img')['src']res = requests.get(img)with open(str(self.path) + str(self.save_path) + "/" + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + ".jpg", "wb") as file:file.write(res.content)print("下载完成保存至:" + str(self.path) + str(self.save_path) + "/" + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + ".jpg")except:passdef download3(self):for i in self.lst1[int(len(self.lst1) / 2):int(len(self.lst1) / 2) + int(len(self.lst1) / 4)]:try:resp = requests.get(i, headers=self.headers)resp.encoding = "utf-8"html = resp.text# print(html)Be = BeautifulSoup(html, 'html.parser')wrapper = Be.find('div', class_='img-wrapper')img = wrapper.find('img')['src']res = requests.get(img)with open(str(self.path) + str(self.save_path) + "/" + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + ".jpg", "wb") as file:file.write(res.content)print("下载完成保存至:" + str(self.path) + str(self.save_path) + "/" + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + ".jpg")except:passdef download4(self):for i in self.lst1[int(len(self.lst1) / 2) + int(len(self.lst1) / 4):len(self.lst1)]:try:resp = requests.get(i, headers=self.headers)resp.encoding = "utf-8"html = resp.text# print(html)Be = BeautifulSoup(html, 'html.parser')wrapper = Be.find('div', class_='img-wrapper')img = wrapper.find('img')['src']res = requests.get(img)with open(str(self.path) + str(self.save_path) + "/" + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + ".jpg", "wb") as file:file.write(res.content)print("下载完成保存至:" + str(self.path) + str(self.save_path) + "/" + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + str(random.randint(1, 9)) + ".jpg")except:passdef multi_thread(self):t1 = threading.Thread(target=self.download1())t2 = threading.Thread(target=self.download2())t3 = threading.Thread(target=self.download3())t4 = threading.Thread(target=self.download4())t1.start()t2.start()t3.start()t4.start()if __name__ == '__main__':name = input("请输入下载类型:")xname = input("请输入创建文件夹名称(英文):")spider = Spider()spider.parser_content()spider.multi_thread()

至此,我们整个代码的编写已经全部完成了,最后让我们一起来看一下效果。


我们打开最初命名的文件夹,可以发现,图片已经被成功下载保存到里面了,而且下载的速度很快。

总结

兴趣是最好的老师
好了,今天的代码就到这里了。

CNN卷积神经网络十二生肖识别项目(一)数据下载篇相关推荐

  1. 1700X + GTX950 跑 CNN卷积神经网络面部表情识别实例代码

    网站评论功能维护中,对文章的评论记录于此: 文章: http://blog.csdn.net/sqh4587/article/details/74507010 tensorflow机器学习之利用CNN ...

  2. cnn卷积神经网络手写体识别keras和tensorflow

    在学习手写体识别的时候,看到一些B站的教学视频发现,很多用TensorFlow完成的手写体识别,在下载数据集的时候会报错,无法使用,这是因为TensorFlow在维护的时候,处理的不是很好,无法使用i ...

  3. python机器学习库keras——CNN卷积神经网络人脸识别

    全栈工程师开发手册 (作者:栾鹏) python教程全解 github地址:https://github.com/626626cdllp/kears/tree/master/Face_Recognit ...

  4. 深度篇—— CNN 卷积神经网络(四) 使用 tf cnn 进行 mnist 手写数字 代码演示项目

    返回主目录 返回 CNN 卷积神经网络目录 上一章:深度篇-- CNN 卷积神经网络(三) 关于 ROI pooling 和 ROI Align 与 插值 本小节,细说 使用 tf cnn 进行 mn ...

  5. 基于pytorch使用实现CNN 如何使用pytorch构建CNN卷积神经网络

    基于pytorch使用实现CNN 如何使用pytorch构建CNN卷积神经网络 所用工具 文件结构: 数据: 代码: 结果: 改进思路 拓展 本文是一个基于pytorch使用CNN在生物信息学上进行位 ...

  6. 【项目实战】Python基于OpenCV和卷积神经网络CNN进行车牌号码识别项目实战

    说明:这是一个机器学习实战项目(附带数据+代码+文档+视频讲解),如需数据+代码+文档+视频讲解可以直接到文章最后获取. 1.项目背景 车牌识别系统(Vehicle License Plate Rec ...

  7. 深度学习--TensorFlow(项目)识别自己的手写数字(基于CNN卷积神经网络)

    目录 基础理论 一.训练CNN卷积神经网络 1.载入数据 2.改变数据维度 3.归一化 4.独热编码 5.搭建CNN卷积神经网络 5-1.第一层:第一个卷积层 5-2.第二层:第二个卷积层 5-3.扁 ...

  8. CNN(卷积神经网络)识别图形验证码(全网最通俗易懂,最全面的讲解)

    这里面大多资料均为网上参阅,参考资料过多未能记住您的文章地址望见谅,如涉及您的文章,本文未声明的即可留言,我会将您的原文地址引入. 一.前言 项目代码:https://github.com/bao17 ...

  9. plt保存图片_人工智能Keras CNN卷积神经网络的图片识别模型训练

    CNN卷积神经网络是人工智能的开端,CNN卷积神经网络让计算机能够认识图片,文字,甚至音频与视频.CNN卷积神经网络的基础知识,可以参考:CNN卷积神经网络 LetNet体系结构是卷积神经网络的&qu ...

最新文章

  1. 全球通信云市场爆发增长,RTC 技术普惠还有多远
  2. 问题-[ACCESS2007]怎么显示MsysObjects
  3. BPF Tools 参考链接
  4. Windows Server 2003 备份和恢复的最佳做法
  5. Web前端知识技能大汇总
  6. Linux笔记2 文件权限管理
  7. 【论文知识点笔记】Binarized P-Network(强化学习+保守值迭代+二值化神经网络)
  8. ITK:基本区域增长
  9. UDP Socket编程 C/C++实现 (Windows Platform SDK)
  10. ZOJ.3551.Bloodsucker(期望DP)
  11. Flask爱家租房--celery(发送验证短信)
  12. 42 可写成成三个整数的立方和
  13. Python使用递归法和函数式编程计算整数各位之和
  14. JAVA day08 接口(interface),多态,instanceof
  15. CodeVS 1031 质数环(DP)
  16. Storm入门之第6章一个实际的例子
  17. 中国北斗卫星导航系统
  18. 服务器搬迁清单需要启动任务以及恢复办法
  19. teredo 未能解析服务器名,关于Teredo 参数无法进行限定,服务器连接已阻止的各种解决办法...
  20. weborder什么意思_hp web是什么意思

热门文章

  1. 全球首发护舒宝敏感肌系列 小豆子李子璇也Pick的姨妈CP
  2. odoo15全面解决财务应收应付全面管理方案(含银企直联)(1)
  3. CreateDialog和DialogBox的区别,模态对话框与非模态对话框
  4. 输出n个格子需要的麦粒数
  5. linux下安装包打包依赖库所走的弯路
  6. Mac配置maven环境与settings设置
  7. 杭电1856——并差集
  8. Gartner:首席信息官能从IT支出中得到哪些收获?
  9. API函数的简单应用(一)
  10. [Practical.Vim(2012.9)].Drew.Neil.Tip94 学习摘要