Python爬取百度搜索风云榜实时热点.

Python爬虫实战源码合集（持续更新）

百度搜索风云榜：http://top.baidu.com/

源码：

import os
import json
from datetime import datetime
from datetime import timezone
from datetime import timedelta
from collections import OrderedDictimport requests
from bs4 import BeautifulSoupdef get_utc8now():utcnow = datetime.now(timezone.utc)utc8now = utcnow.astimezone(timezone(timedelta(hours=8)))return utc8nowdef save_as_json(filename, records):dict_obj = {}if os.path.exists(filename):with open(filename, 'r', encoding='utf-8') as f:dict_obj = json.load(f, object_pairs_hook=OrderedDict)time_str = str(get_utc8now())for keyword, search_index in records:time_count_dict = {'time': time_str, 'count': search_index}dict_obj.setdefault(keyword, []).append(time_count_dict)with open(filename, 'w', encoding='utf-8') as f:json.dump(dict_obj, f, indent=4, separators=(',',': '),ensure_ascii=False, sort_keys=False)def crawl_baidu_top(buzz_no=1):response = requests.get('http://top.baidu.com/buzz?b={}'.format(buzz_no))response.encoding = 'gb18030'soup = BeautifulSoup(response.text, 'html.parser')table_tag = soup.find('table', {'class': 'list-table'})item_tags = table_tag.find_all('tr')keywords, search_indices = [], []for item in item_tags:keyword_tag = item.find('td', {'class': 'keyword'})last_tag = item.find('td', {'class': 'last'})if (keyword_tag is not None) and (last_tag is not None):keyword_title_tag = keyword_tag.find('a', {'class': 'list-title'})keywords.append(keyword_title_tag.text.strip())search_indices.append(last_tag.text.strip())return list(zip(keywords, search_indices))if __name__ == '__main__':now = get_utc8now()year_str = now.strftime('%Y')date_str = now.strftime('%Y%m%d')os.makedirs(year_str, exist_ok=True)filename = os.path.join(year_str, '{} 实时热点.json'.format(date_str))records = crawl_baidu_top()save_as_json(filename, records)

运行：

再次运行：

Python爬取百度搜索风云榜实时热点.相关推荐

Python 爬取百度搜索风云榜新闻并自动推送到邮箱
本文将使用Python爬取百度新闻搜索指数排名前50的新闻,并通过服务器运行,每天定时发送到指定邮箱. 先上代码: # -*- coding:utf-8 -*- import requests,os, ...
python爬取百度搜索_使用Python + requests爬取百度搜索页面
想学一下怎样用python爬取百度搜索页面,因为是第一次接触爬虫,遇到一些问题,把解决过程与大家分享一下 1.使用requests爬取网页首先爬取百度主页www.baidu.com import r ...
Python爬取百度搜索的标题和真实URL的代码和详细解析
网页爬取主要的是对网页内容进行分析,这是进行数据爬取的先决条件,因此博客主要对爬取思路进行下解析,自学的小伙伴们可以一起来学习,有什么不足也可以指出,都是在自学Ing,回归正题今天我们要来爬取百度搜索 ...
python爬取百度搜索_Python-Scrapy抓取百度数据并分析
抓取智联招聘和百度搜索的数据并进行分析,使用visual studio编写代码mongodb和SQLServer存储数据.使用scrapy框架结合 selenium爬取百度搜索数据,并进行简要的数据的 ...
python 爬取百度搜索结果url
简单的爬取百度搜索结果url 先用了requests库来访问百度,再通过xpath来提取搜索后的结果 import requests from lxml import etreefor i in ra ...
python爬取百度搜索答案题目和摘要
url就自行构造吧 # coding:utf-8 import urllib2 import re from bs4 import BeautifulSoup url = 'http://www.ba ...
python爬虫代码实例-Python爬虫爬取百度搜索内容代码实例
这篇文章主要介绍了Python爬虫爬取百度搜索内容代码实例,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友可以参考下搜索引擎用的很频繁,现在利用Python爬 ...
【Spark分布式内存计算框架——Spark Streaming】10. 应用案例：百度搜索风云榜（中）实时数据ETL存储
5.3 实时数据ETL存储实时从Kafka Topic消费数据,提取ip地址字段,调用[ip2Region]库解析为省份和城市,存储到HDFS文件中,设置批处理时间间隔BatchInterval为1 ...
python 将百度搜索风云榜的关键内容提取并写入txt文件和读取验证
# -*- coding:UTF-8 -*- from bs4 import BeautifulSoup import requests,sys import codecs##今日娱乐名人排行榜--百 ...

Python爬取百度搜索风云榜实时热点.

源码：

运行：

Python爬取百度搜索风云榜实时热点.相关推荐

最新文章

热门文章