项目介绍

  • askWeb/index.py 网站爬取数据类
  • database/index.py 数据库类(数据库封装)
  • utils/index.py 工具文件
  • main.py 项目入口文件

1.main.py入口文件介绍

from askWeb.index import AskUrl
import datetime# 爬取的网站网址
from database import database
from utils import utilsimport asyncio,aiohttp# 抓取网站
#urlName网站名
#url网站路径
#requestType 请求类型
#网站类型
async def getUrlContent(urlName, url, requestType="get", webType=1):# 开始时间startTime = datetime.datetime.now()print("抓取" + urlName + "开始计时...........")await AskUrl(url).handleGetYourContent(requestType, webType)# 结束时间endTime = datetime.datetime.now()lastTime = str((endTime - startTime).seconds)print("抓取"+urlName+"总用时:" + lastTime)# 今日热榜的初始路径
# "https://tophub.today"
if __name__ == "__main__":startTime = datetime.datetime.now()urlArr = [{"urlName": "b站热榜","url": "https://www.bilibili.com/v/popular/rank/all","requestType": "get","type": 1,},{"urlName": "微博热榜","url": "https://tophub.today/n/KqndgxeLl9","requestType": "get","type": 6,},{"urlName": "微信热榜","url": "https://tophub.today/n/WnBe01o371","requestType": "get","type": 5,},{"urlName": "抖音视频榜","url": "https://tophub.today/n/WnBe01o371","requestType": "get","type": 4,},{"urlName": "CSDN综合热榜","url": "https://blog.csdn.net/phoenix/web/blog/hot-rank","requestType": "get","type": 3,},{"urlName": "IT资讯热榜","url": "https://it.ithome.com/","requestType": "get","type": 2,},]# 任务列表task_list = []# enumerate函数将一个可遍历的数据对象组合为一个索引序列for key, value in enumerate(urlArr):future = asyncio.ensure_future(getUrlContent(value["urlName"], value["url"], value["requestType"], value["type"]))task_list.append(future)loop = asyncio.get_event_loop()loop.run_until_complete(asyncio.wait(task_list))# 需要注销loop.close 因为任务还在执行时就被关闭了,所以我直接注销loop.close。如果你有更好的方法,可以评论区留言# loop.close()#结束的时间endTime = datetime.datetime.now()#使用的时间lastTime = str((endTime - startTime).seconds)print("总用时:" + lastTime)

2.askWeb/index.py 网站爬取数据类

上传数据库需要把注释取消,并改成自己数据库的字段

import ssl
import time
from urllib.error import HTTPErrorimport aiohttp
from bs4 import BeautifulSoup  # 网页解析 获取数据
import requests
import urllib3
from urllib import request
from http import cookiejar
from utils import utils
import database.database
import random
import jsonurllib3.disable_warnings()
ssl._create_default_https_context = ssl._create_unverified_context# 访问网站类
class AskUrl():##类初始化def __init__(self, url):self.dealSql = database.database.ConnectSql()self.url = url# 获取随机的userAgentdef handleRandomUserAgent(self):allUserAgent = ["Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36 OPR/87.0.4390.45 (Edition Campaign 34)","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.42 Safari/537.36 Edg/103.0.1264.21",]return allUserAgent[random.randint(0, 3)]# 获取网站cookiedef getWebCookie(self):# 声明一个cookiejar对象实例保存cookiecookie = cookiejar.CookieJar()# 利用urllib中的request库里的HTTPCookieProcessor方法创建cookie处理器handler = request.HTTPCookieProcessor(cookie)# 通过cookieHandler创建openeropener = request.build_opener(handler)print(self.url)# 打开网页try:opener.open(self.url)except HTTPError as e:print("捕获网站cookie报错信息:%s" % e)return ""cookieStr = ""for item in cookie:cookieStr = cookieStr + item.name + "=" + item.value + ";"return cookieStr[:-1]# 异步协程# 访问网站async def visitWeb(self, method, param="", header="", session=""):# 关闭请求警告requests.packages.urllib3.disable_warnings()proxies = {"http": None,"https": None,}if header == "":cookie = self.getWebCookie()# print(cookie,"cookiecookiecookie")header = {"Cache-Control": "no-cache","Cookie": cookie,"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,""application/signed-exchange;v=b3;q=0.9",'User-Agent': self.handleRandomUserAgent(),}if method == 'get':async with await session.get(self.url, data=param and param or {}, headers=header) as resp:page_text = await resp.content.read(999999999999999)else:async with await session.post(self.url, data=param and param or "", headers=header) as resp:page_text = await resp.content.read(999999999999999)# 编码格式转换,防止中文乱码page_text.decode("utf-8", "ignore")# 实例化beautifulSoup对象,需要将页面源码数据加载到该对象中soup = BeautifulSoup(page_text, 'html.parser')# print(soup)return soup#   抓取头部信息和标题def handleGetWebTitleAndLogo(self):# 1.实例化beautifulSoup对象,需要将页面源码数据加载到该对象中soup = self.visitWeb("get")try:webTitle = soup.select("title")  # 网站标题webTitle = webTitle and webTitle[0].text or ""except HTTPError as e:webTitle = ""print("网站标题报错信息:%s" % e)try:webLogo = soup.select("link[type='image/x-icon']")  # 网站logowebLogo = webLogo and webLogo[0].get("href") or ""except HTTPError as e:webLogo = ""print("网站logo报错信息:%s" % e)try:webDescription = soup.select("meta[name='description']")  # 网站描述webDescription = webDescription and webDescription[0].get("content") or ""except HTTPError as e:webDescription = ""print("网站描述报错信息:%s" % e)try:webKeywords = soup.select("meta[name='keywords']")  # 网站关键词webKeywords = webKeywords and (webKeywords[0].get("content") is None and "" or webKeywords[0].get("content")) or ""except HTTPError as e:webKeywords = ""print("网站关键词报错信息:%s" % e)return {"webTitle": webTitle, "webLogo": webLogo, "webDescription": webDescription, "webKeywords": webKeywords}# 获取你想要的数据 过滤网站内容# type 抓取网站类型async def handleGetYourContent(self, requestType="get", type=1, params=""):"""aiohttp:发送http请求1.创建爱你一个ClientSession对象2.通过ClientSession对方发送get,post,put登录请求3.await 异步等待返回结果(程序挂起)"""async with aiohttp.ClientSession() as session:# 1.实例化beautifulSoup对象,需要将页面源码数据加载到该对象中soup = await self.visitWeb(requestType, params, session=session)if type == 1:await self.handleGrabBliWeb(soup)elif type == 2:await self.handleItHomeWeb(soup)elif type == 3:await self.handleGetCsdnWeb(soup)elif type == 4:await self.handleGetDyVideoWeb(soup)elif type == 5:await self.handleGetWeChatWeb(soup)elif type == 6:await self.handleGetWeiBoWeb(soup)print('操作完成')# 微博热搜榜async def handleGetWeiBoWeb(self, soup, num=1, page=0):# 2.通过标签获取列表里的数据li_list = soup.select(".table")[0].select("tbody>tr")# 循环遍历列表数据for item in li_list:href = item.select(".al a")[0].get("href")  # 访问路径title = item.select(".al a")[0].text  # 标题hotData = item.select("td")[2].text# res = self.dealSql.handleInsert(table="g_hot_list", title=title, url=href, hot_num=hotData, type=5,#                                 add_time=utils.FormatDate(), update_time=utils.FormatDate())res = Trueif res:data = "第 %s 条插入成功:标题: %s 访问量: %s 访问路径:%s" % (num, title, hotData, href)print(data)else:data = "第 %s 条插入失败"print(data)# time.sleep(1)num += 1# 微信热文热榜async def handleGetWeChatWeb(self, soup, num=1):# 2.通过标签获取列表里的数据li_list = soup.select(".table")[0].select("tbody>tr")# 循环遍历列表数据for item in li_list:href = item.select(".al a")[0].get("href")  # 访问路径title = item.select(".al a")[0].text  # 标题hotData = item.select("td")[2].texthotData = hotData.split(" ")[0]  # 热度# res = self.dealSql.handleInsert(table="g_hot_list", title=title, url=href, hot_num=hotData, type=5,#                                 add_time=utils.FormatDate(), update_time=utils.FormatDate())res = Trueif res:data = "第 %s 条插入成功:标题: %s 访问量: %s 访问路径:%s" % (num, title, hotData, href)print(data)else:data = "第 %s 条插入失败"print(data)# time.sleep(1)num += 1# 抖音短视频热榜async def handleGetDyVideoWeb(self, soup, num=1):# 2.通过标签获取列表里的数据li_list = soup.select(".table")[0].select("tbody>tr")# 循环遍历列表数据for item in li_list:href = item.select(".al a")[0].get("href")  # 访问路径title = item.select(".al a")[0].text  # 标题hotData = item.select("td")[2].text  # 热度# res = self.dealSql.handleInsert(table="g_hot_list", title=title, url=href, hot_num=hotData, type=4,#                                 add_time=utils.FormatDate(), update_time=utils.FormatDate())res = Trueif res:data = "第 %s 条插入成功:标题: %s 访问量: %s 访问路径:%s" % (num, title, hotData, href)print(data)else:data = "第 %s 条插入失败"print(data)# time.sleep(1)num += 1# csdn文章热榜async def handleGetCsdnWeb(self, soup, num=1, page=0):# 2.通过API获取列表里的数据# 字符串转为数组li_list = json.loads(str(soup),strict=False)["data"]# 循环遍历列表数据for item in li_list:href = item["articleDetailUrl"]  # 访问路径title = item["articleTitle"]  # 访问标题hotData = item["hotRankScore"]  # 热度# res = self.dealSql.handleInsert(table="g_hot_list", title=title, url=href, hot_num=hotData, type=3,#                                 add_time=utils.FormatDate(), update_time=utils.FormatDate())res = Trueif res:data = "第 %s 条插入成功:标题: %s 热度量:%s 访问路径:%s" % (num, title, hotData, href)print(data)else:data = "第 %s 条插入失败"print(data)# time.sleep(1)num += 1if page < 4:curPage = page + 1async with aiohttp.ClientSession() as session:soup = await self.visitWeb("get", {"page": curPage, "pageSize": 25, "type": ""}, session=session)return await self.handleGetCsdnWeb(soup, num, curPage)# b站热榜async def handleGrabBliWeb(self, soup, num=1):# 2.通过标签获取列表里的数据li_list = soup.select(".rank-list-wrap>ul>li")# 循环遍历列表数据for item in li_list:href = item.select(".info a")[0].get("href")  # 访问路径title = item.select(".info a")[0].text  # 标题# "".join()去除空格hotData = "".join(item.select(".info .detail-state .data-box")[0].text.split())  # 播放量if href.find("//", 0) >= 0:href = href.split("//")[1]# res = self.dealSql.handleInsert(table="g_hot_list", title=title, url=href, hot_num=hotData, type=1,#                                 add_time=utils.FormatDate(), update_time=utils.FormatDate())res = Trueif res:data = "第 %s 条插入成功:标题: %s 访问量: %s 访问路径:%s" % (num, title, hotData, href)print(data)else:data = "第 %s 条插入失败"print(data)# time.sleep(1)num += 1# it之家热榜async def handleItHomeWeb(self, soup, num=1, nexPage=1):# 第一页if nexPage == 1:# 2.通过标签获取列表里的数据li_list = soup.select(".fl>ul>li")# 第二页以后else:li_list = soup.select("li")# 循环遍历列表数据for item in li_list:href = item.select("a[class='img']")[0].get("href")  # 访问路径title = item.select("a[class='img']")[0].select("img")[0].get("alt")  # 标题# res = self.dealSql.handleInsert(table="g_hot_list", title=title, url=href, type=2,#                                 add_time=utils.FormatDate(), update_time=utils.FormatDate())res = Trueif res:data = "第 %s 条插入成功:标题: %s 访问路径:%s" % (num, title, href)print(data)else:data = "第 %s 条插入失败"print(data)#     print(data)# time.sleep(1)num += 1if nexPage == 1:nexPageUrl = "https://it.ithome.com/category/domainpage"header = {"Cache-Control": "no-cache","Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9"}param = {"domain": "it", "subdomain": "", "ot": int(time.time()) * 1000}async with aiohttp.ClientSession() as session:resData = await AskUrl(nexPageUrl).visitWeb("post", param=param, header=header, session=session)# 获取数据时转换json数据需要设置strict=False,否则报错soup = BeautifulSoup(json.loads(str(resData),strict=False)["content"]["html"], 'html.parser')return await self.handleItHomeWeb(soup, num, nexPage + 1)

3.utils/index.py 工具文件

import time##时间转换
def FormatDate(timeNow="",fmt="%Y-%m-%d %H:%M:%S"):if timeNow=="":# 获取当前时间timeNow = int(time.time())# 转换成localtimetime_local = time.localtime(timeNow)# 转换成新的时间格式(2016-05-09 18:59:20)dt = time.strftime(fmt, time_local)return dt# 2022-1-1转为时间戳
def time_to_str(val):return int(time.mktime(time.strptime(val, "%Y-%m-%d")))# 当前时间戳
def cur_time_to_str():return int(time.mktime(time.localtime(time.time())))

4.database/index.py 数据库类(数据库封装)

需要填写自己的ip,mysql账号,密码


# 链接数据库类
class ConnectSql():# 成员属性# 链接myql的ip地址__host = "xxxx"# 链接mysql的账号__user = "xxxx"# 链接mysql的密码__passwd = "xxxxx"# mysql端口号__port = 3306# 数据库名称__db = "xxxx"# 字符编码__charset = "utf8"cur = ""# 构造函数//类初始化信息def __init__(self):try:# 连接数据库self.conn = pymysql.connect(host=self.__host,user=self.__user, password=self.__passwd,port=self.__port,database=self.__db,charset= self.__charset)self.cur = self.conn.cursor()  # 生成游标对象except pymysql.Error as e:print("链接错误:%s" % e)# 摧毁对象,释放空间内存def __del__(self):print("摧毁")# 关闭数据库def closedatabase(self):# 如果数据打开,则关闭;否则没有操作if self.conn and self.cur:self.cur.close()self.conn.close()return True# 执行execute方法,返回影响的行数def handleExcute(self, sql):try:self.cur.execute(sql)  # 执行插入的sql语句self.conn.commit()  # 提交到数据库执行count = self.cur.rowcountif count > 0:return Trueelse:return Falseexcept TypeError:print("错误内容:", TypeError)# 执行mysql失败,事务回滚self.conn.rollback()self.closedatabase()return False# 执行sql语句# 执行源生sql语句def dealMysql(self, dataSql):self.handleExcute(dataSql)# 插入数据def handleInsert(self, **params):table = "table" in params and params["table"] or ""sql = "INSERT INTO %s(" % tabledel params["table"]fields = ""values = ""for k, v in params.items():fields += "%s," % k# 判断数据类型,插入对应的数据if type(v) == type("test"):values += "'%s'," % velif type(v) == type(1):values += "%s," % vfields = fields.rstrip(',')values = values.rstrip(',')sql = sql + fields + ")values(" + values + ")"print(sql, "handleUpdate")return self.handleExcute(sql)# 删除数据def handleDel(self, **params):table = "table" in params and params["table"] or ""where = "where" in params and params["where"] or ""sql = "DELETE FROM %s WHERE %s " % (table, where)print(sql, "handleUpdate")return self.handleExcute(sql)# 编辑数据def handleUpdate(self, **params):table = "table" in params and params["table"] or ""where = "where" in params and params["where"] or ""params.pop("table")params.pop("where")sql = "UPDATE %s SET " % tablefor k, v in params.items():# 判断数据类型,插入对应的数据if type(v) == type("test"):sql += "%s='%s'" % (k, v)elif type(v) == type(1):sql += "%s=%s" % (k, v)sql += "WHERE %s" % whereprint(sql, "handleUpdate")return self.handleExcute(sql)# 查询多条数据def handleFindAllData(self, **params):# table fields where order limittable = "table" in params and params["table"] or ""where = "where" in params and "WHERE " + params["where"] or ""field = "field" in params and params["field"] or "*"order = "order" in params and "ORDER BY " + params["order"] or ""sql = "SELECT %s FROM %s %s %s %s" % (field, table, where, order)print(sql, "handleFindAllData")return self.handleExcute(sql)# 查询单条数据def handleFindOneData(self, **params):# table fields where order limittable = "table" in params and params["table"] or ""where = "where" in params and "WHERE " + params["where"] or ""field = "field" in params and params["field"] or "*"order = "order" in params and "ORDER BY " + params["order"] or ""sql = "SELECT %s FROM %s %s %s %s LIMIT 1" % (field, table, where, order)print(sql,"handleFindOneData")return self.handleExcute(sql)

项目源码下载地址:https://download.csdn.net/download/qq_36977923/85762109?spm=1001.2014.3001.5501

✨踩坑不易,还希望各位大佬支持一下\textcolor{gray}{踩坑不易,还希望各位大佬支持一下}踩坑不易,还希望各位大佬支持一下

【python】python异步抓取网站数据【详细过程】相关推荐

  1. python爬网页数据用什么_初学者如何用“python爬虫”技术抓取网页数据?

    原标题:初学者如何用"python爬虫"技术抓取网页数据? 在当今社会,互联网上充斥着许多有用的数据.我们只需要耐心观察并添加一些技术手段即可获得大量有价值的数据.而这里的&quo ...

  2. 查询数据 抓取 网站数据_有了数据,我就学会了如何在几个小时内抓取网站,您也可以...

    查询数据 抓取 网站数据 I had a shameful secret. It is one that affects a surprising number of people in the da ...

  3. java抓取网站数据

    java 抓取网站数据 假设你需要获取51job 人才网上java 人才的需求数量,首先你需要分析51job 网站的搜索这 一块是怎么运作的,通过解析网页的源代码,我们发现了以下一些信息: 1. 搜索 ...

  4. 抓取网站数据入库详解,附图文

    抓取网站数据入库详解,附图文 一. 分析需求 1.1 需求分析 刚好有这样一个需求,去抓取下方网站的页面全部数据,并存入MySQL数据库. 这个页面为: 爬取页面 年月日选择 出生于几点,性别: 男或 ...

  5. Scrapy爬虫轻松抓取网站数据

    Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架. 可以应用在包括数据挖掘,信息处理或存储历史数据等一系列的程序中. 其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的, 也 ...

  6. Pycharm + python 爬虫简单爬取网站数据

    本文主要介绍简单的写一个爬取网站图片并将图片下载的python爬虫示例. 首先,python爬虫爬取数据,需要先了解工具包requests以及BeautifulSoup requests中文文档:ht ...

  7. python线程池抓取网页数据

    因为最近朋友实验研究需要手动复制,粘贴www.chemsrc.com网页上的数据很繁琐,大致看了一下一共有4000多页,因此想到了用爬虫来爬取数据. 有了这个想法便来考虑试试 # 如何提取单个页面的数 ...

  8. 【爬虫】Scrapy 抓取网站数据

    [原文链接]http://chenqx.github.io/2014/11/09/Scrapy-Tutorial-for-BBSSpider/ Scrapy Tutorial 接下来以爬取饮水思源BB ...

  9. 使用javascript抓取网站数据

    最近接到一个小项目,从一个网站抓取数据以另一种格式在另一个网站显示,其中遇到了不少的问题,主要用了javascript和jquery.现在总结一下遇到的一些问题和解决办法. 思路:使用ajax请求网站 ...

最新文章

  1. R语言ggplot2可视化分面图(facet_grid)、自定义缩小分面图标签栏的高度、但是不改变标签栏标签文本的大小、通过element_text函数的margin参数设置实现
  2. 《数学之美》第6章 信息的度量和作用
  3. 手机linux登陆密码忘了怎么办,忘记linux登陆密码重置的方法
  4. WPF加载程序集中字符串资源
  5. Makefile 使用总结
  6. C++中的const数据成员
  7. JavaSist之ClassPool
  8. 富贵包这种常见颈椎病怎么改善?
  9. Android双列表联动和固定头部ScrollView效果实现
  10. 0100-Same Tree(相同的树)
  11. 前端学习(1658):前端系列实战课程之图片延迟加载思路
  12. 大型网站HTTPS 实践(一)| HTTPS 协议和原理
  13. android adb音频采集,android adb
  14. QOS---fr流量×××--用了frame-relay fragmet 40
  15. 简单scrapy爬虫实例
  16. JavaScript 评论添加练习
  17. Cocos2d-x 引擎概要
  18. html在线快递单号打印,HTML 快递打印模板(示例代码)
  19. LDN的蓝牙双模键盘帮助文档
  20. 微信小程序:map组件标注callout与label简单用法

热门文章

  1. 经典工作自我鉴定范文/实习自我鉴定表
  2. 爬取百大弹幕,大家还是喜欢上罗老师的课!
  3. 全国青少年软件编程等级考试Python标准解读(1_6级)
  4. ubuntu+cuda+theano
  5. 【计算机网络学习笔记】分组交换的原理
  6. 漏洞解决方案-远程DNS服务允许递归查询
  7. 【快应用】account.authorize授权码模式登录报错1102
  8. 【兔年烟花】旖旎风景——浪漫烟花(Python实现)
  9. 限流与代理网关集成调研及应用
  10. 获取图片的EXIF信息如此困难?