多线程爬取百度关键字结果，并获取真实url

项目目的：练习

项目要求：根据给定的关键字，检索百度的结果，将结果保存到文件中

遇到问题：

1、python list取值问题，有些看不清晰的，用for index, item in enumerate(array):查看

2、选取想要的元素，两种方式：

一是tag.h3.a['href']，

二是tagh3 = result.find_all('h3')；for h3 in tagh3:href = h3.find('a').get('href')

3、构建网址

out_url = [(key, page, "https://www.baidu.com/s?wd={}&pn={}".format(key, page * 10),) for key in keys for page in range(pages)]

4、结果去除百度自己的内容

title = tag.h3.a.text；if '百度' in title:break

5、去除类似视频大全的百度整合内容

if not href.startswith('http'):break

6、获取百度搜索的真实网址

baidu_url = requests.get(url=href, headers=myhead, allow_redirects=False)

real_url = baidu_url.headers['Location'] # 得到网页原始地址

if real_url.startswith('http'):

allow_redirects=False是重点，禁止跳转

7、任务和结果传递

self.work_queue = Queue() # 任务队列

self.result_queue = Queue() # 结果队列

8、多线程卡死

一定用 while not self.work_queue.empty():，不用写成 while True：

9、没了，上代码，为了便于调试，做了微调，代码里面有说明

# ！/usr/bin/env python
# -*- coding:utf -8-*-import time
from retrying import retry
import requests
from bs4 import BeautifulSoup
import threading
from queue import Queuelock = threading.RLock()class WorkManager(object):def __init__(self, do_job, works, thread_num=25):self.job = do_jobself.work_queue = Queue()  # 任务队列self.result_queue = Queue()  # 结果队列self.threads = []self.__init_work_queue(works)self.__init_thread_pool(thread_num)# #初始化工作队列,添加工作入队def __init_work_queue(self, works):for item in works:# print('__init_work_queue item:', item)  # 参数tupeself.work_queue.put((self.job, item))  # 将任务函数和参数传入任务队列# #初始化线程,同时运行线程数量有效果，原理没明白def __init_thread_pool(self, thread_num):for i in range(thread_num):self.threads.append(Work(self.work_queue, self.result_queue))# #等待所有线程运行完毕def wait_allcomplete(self):'''@description:等待线程结束，并取得运行结果@return:result_list'''for item in self.threads:if item.isAlive():item.join()result_list = []for i in range(self.result_queue.qsize()):res = self.result_queue.get()#print('wait_allcomplete:', res)result_list.append(res)return result_listclass Work(threading.Thread):def __init__(self, work_queue, result_queue):threading.Thread.__init__(self)self.work_queue = work_queueself.result_queue = result_queueself.start()  # 启动线程def run(self):# 一定不用死循环while not self.work_queue.empty():try:do, args = self.work_queue.get(block=False)  # 任务异步出队# print('Work args：', args)  # 参数list or tupe,注意检查此处result = do(*args)  # 传递  list or tupe 各元素#print('work run result:', result, flush=True)self.result_queue.put(result)  # 取得函数返回值self.work_queue.task_done()  # 通知系统任务完成with lock:print('{}\tdone\twith\t{}\tat\t{}'.format(threading.currentThread().name, args[0], get_stime()), flush=True)except Exception as error:print(error, flush=True)breakdef get_stime():ct = time.time()local_time = time.localtime(ct)data_head = time.strftime("%Y-%m-%d %H:%M:%S", local_time)data_secs = (ct - int(ct)) * 1000stamp = "%s.%03d" % (data_head, data_secs)return stampmyhead = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36','Accept': 'image/webp,image/apng,image/*,*/*;q=0.8','Accept-Encoding': 'gzip,deflate,sdch, br','Accept-Language': 'zh-TW,zh;q=0.8,en-US;q=0.6,en;q=0.4','Cache-Control': 'max-age=0','Connection': 'close','Proxy-Connection': 'no-cache'
}def parse_url(url, params=None, headers=myhead, proxies=None, timeout=6, ecode='utf-8',wait_random_min=200, wait_random_max=3000, stop_max_attempt_number=100):@retry(wait_random_min=wait_random_min, wait_random_max=wait_random_max, stop_max_attempt_number=stop_max_attempt_number)def _parse_url(url):response = requests.get(url, params=params, headers=headers, proxies=proxies, timeout=timeout)assert response.status_code == 200# 由于status_code == 200，所以不能用于百度真是网址获取，因其code==302return response.content.decode(ecode)try:response = _parse_url(url)soup = BeautifulSoup(response, 'lxml')[s.extract() for s in soup(["script", "style"])]except requests.exceptions.ConnectionError as e:print('ConnectionError:', e, url, flush=True)soup = Noneexcept requests.exceptions.ChunkedEncodingError as e:print('ChunkedEncodingError:', e, url, flush=True)soup = Noneexcept Exception as e:print('Unfortunitely Unknow Error:', e, url, flush=True)soup = Nonereturn soupdef fd():import win32ui_dlg = win32ui.CreateFileDialog(1)  # 1表示打开文件对话框_dlg.SetOFNInitialDir('c:/')  # 设置打开文件对话框中的初始显示目录_dlg.DoModal()filename = _dlg.GetPathName()  # 获取选择的文件名称return filenamedef make_urls(pages):'''_k = []_file = fd()if not _file:return Falseres = _file.split('.')[0:-1]  # 文件名，含完整路径，去掉后缀with open(_file) as f:for row in f.readlines():row = row.strip()  # 默认删除空白符  '#^\s*$'if len(row) == 0:break  # 去除行len为0的行_k.append(row)keys = sorted(set(_k), key=_k.index)#为方便演示，用list直接替代读文件'''keys = ["减肥计划","减肥运动","如何减肥","怎么减肥","有效减肥","郑多燕减肥","减肥视频","减肥","减肥方法","减肥食谱","   ","减肚子","腰腹减肥","\t","减腰","减肥法","减肥法"]out_url = [(key, page, "https://www.baidu.com/s?wd={}&pn={}".format(key, page * 10),) for key in keys for page in range(pages)]return 'baidu', out_url# return res[0], out_urldef getkeys(key, page, url):_texts = []result = parse_url(url=url)'''#方法1tagh3 = result.find_all('h3')index = 0for h3 in tagh3:href = h3.find('a').get('href')title = h3.find('a').textif '百度' in title:breakif not href.startswith('http'):breakbaidu_url = requests.get(url=href, headers=myhead, allow_redirects=False)  # 禁止跳转real_url = baidu_url.headers['Location']  # 得到网页原始地址if real_url.startswith('http'):index += 1_texts.append([index, title, real_url])#方法1结束'''# 方法2，效果与方法1相同allTags = result.findAll('div', ['result-op c-container xpath-log', 'result c-container'])# 'result-op c-container xpath-log'   #百度自己内容index = 0for tag in allTags:href = tag.h3.a['href']title = tag.h3.a.textif '百度' in title:breakif not href.startswith('http'):breakbaidu_url = requests.get(url=href, headers=myhead, allow_redirects=False)real_url = baidu_url.headers['Location']  # 得到网页原始地址if real_url.startswith('http'):index += 1_texts.append([key, page, index, title, real_url])# 方法2结束return _textsdef savefile(_filename, lists):# 函数说明:将爬取的文章lists写入文件print('[' + _filename + ']开始保存......', end='', flush=True)lists.sort()with open(_filename, 'a', encoding='utf-8') as f:for lists_line in lists:for index, item in enumerate(lists_line):f.write('key:' + item[0] + '\tpage:' + str(item[1]) + '\tindex:' + str(item[2]) + '\ttitle:' + item[3] + '\turl:' + item[4] + '\n')print('[' + _filename + ']保存完成。', flush=True)def main():start = time.time()try:_name, urls = make_urls(10)except Exception as e:print(e)return Falsework_manager = WorkManager(getkeys, urls)  # 调用函数,参数:list内tupe,线程数量texts = work_manager.wait_allcomplete()savefile(_name + '_百度词频.txt', texts)print("threadPool cost all time: %s" % (time.time() - start), flush=True)if __name__ == "__main__":main()# threadPool cost all time: 27.787729501724243

多线程爬取百度关键字结果，并获取真实url相关推荐

Python爬取百度搜索的标题和真实URL的代码和详细解析
网页爬取主要的是对网页内容进行分析,这是进行数据爬取的先决条件,因此博客主要对爬取思路进行下解析,自学的小伙伴们可以一起来学习,有什么不足也可以指出,都是在自学Ing,回归正题今天我们要来爬取百度搜索 ...
测试多线程爬取百度图片
#爬取ajax加载网页数据 coding=utf-8 import threading import time from urllib.request import urlretrieve impor ...
Python爬取蓝奏云直链(获取真实文件地址)
最近在用蓝奏云,这款云盘无限速并且操作分享简单,自认为挺好的一个云盘,所以研究了如何通过蓝奏云分享链接获取文件最终地址.你可能问爬取直链有什么用,我说一下我的需求,我的服务器学生机带宽是1m,很小.我 ...
python爬取贴吧所有帖子-python 爬虫爬取百度贴吧，获取海量信息
需要用到的库:requests,re,xpath 首先打开随便一个贴吧:贴吧首页通过观察发现每一个帖子的链接是这样的:帖子链接我们只需要获取后面灰色部分就可以了,点击f12 按ctrl+f 找到链 ...
php爬取百度相关关键词,PHP获取百度关键词排行接口源码
/**百度关键词排行接口 @author 原作者肯定不是Youngxj @time 2018年6月14日 @code 200->正常 */ // 关键词 @$k=$_GET['k'] ? $_G ...
爬取百度指数行业排行榜
写在前面: 我是「虐猫人薛定谔i」,一个不满足于现状,有梦想,有追求的00后 \quad 本博客主要记录和分享自己毕生所学的知识,欢迎关注,第一时间获取更新. \quad 不忘初心,方得始终.自己的梦 ...
C#爬取百度图片最新（20220627）
常规来说爬取百度图片无非是获取图片的链接然后下载下来. 通常直接获取则会出现百度验证.所以第一步我们要设置cookes以及各种信息. 如何获取cookes很简单,浏览器打开百度图片按F12查看找到co ...
【Python 爬虫】多线程爬取
文章目录前言一.多进程库(multiprocessing) 二.多线程爬虫三.案例实操四.案例解析 1.获取网页内容 2.获取每一章链接 3.获取每一章的正文并返回章节名和正文 4.将每一章保 ...
利用多线程爬取表情包
今天用python爬了一下表情包首先,我们先导入一些所需的包 #系统包 import os#时间包 from time import time# 爬虫包 import lxml as lxml im ...

多线程爬取百度关键字结果，并获取真实url

项目目的：练习

项目要求：根据给定的关键字，检索百度的结果，将结果保存到文件中

遇到问题：

1、python list取值问题，有些看不清晰的，用for index, item in enumerate(array):查看

2、选取想要的元素，两种方式：

一是tag.h3.a['href']，

二是tagh3 = result.find_all('h3')；for h3 in tagh3:href = h3.find('a').get('href')

3、构建网址

out_url = [(key, page, "https://www.baidu.com/s?wd={}&pn={}".format(key, page * 10),) for key in keys for page in range(pages)]

4、结果去除百度自己的内容

title = tag.h3.a.text；if '百度' in title:break

5、去除类似视频大全的百度整合内容

if not href.startswith('http'):break

6、获取百度搜索的真实网址

baidu_url = requests.get(url=href, headers=myhead, allow_redirects=False)

real_url = baidu_url.headers['Location'] # 得到网页原始地址

if real_url.startswith('http'):

allow_redirects=False是重点，禁止跳转

7、任务和结果传递

self.work_queue = Queue() # 任务队列

self.result_queue = Queue() # 结果队列

8、多线程卡死

一定用 while not self.work_queue.empty():，不用写成 while True：

9、没了，上代码，为了便于调试，做了微调，代码里面有说明

多线程爬取百度关键字结果，并获取真实url相关推荐

最新文章

热门文章

多线程爬取百度关键字结果，并获取真实url

项目目的：练习

项目要求：根据给定的关键字，检索百度的结果，将结果保存到文件中

遇到问题：

1、python list取值问题，有些看不清晰的，用for index, item in enumerate(array):查看

2、选取想要的元素 ，两种方式：

一是tag.h3.a['href']，

二是tagh3 = result.find_all('h3')；for h3 in tagh3:href = h3.find('a').get('href')

3、构建网址

out_url = [(key, page, "https://www.baidu.com/s?wd={}&pn={}".format(key, page * 10),) for key in keys for page in range(pages)]

4、结果去除百度自己的内容

title = tag.h3.a.text；if '百度' in title:break

5、去除类似视频大全的百度整合内容

if not href.startswith('http'):break

6、获取百度搜索的真实网址

baidu_url = requests.get(url=href, headers=myhead, allow_redirects=False)

real_url = baidu_url.headers['Location'] # 得到网页原始地址

if real_url.startswith('http'):

allow_redirects=False是重点，禁止跳转

7、任务和结果传递

self.work_queue = Queue() # 任务队列

self.result_queue = Queue() # 结果队列

8、多线程卡死

一定用 while not self.work_queue.empty():，不用写成 while True：

9、没了，上代码，为了便于调试，做了微调，代码里面有说明

多线程爬取百度关键字结果，并获取真实url相关推荐

最新文章

热门文章

2、选取想要的元素，两种方式：