爬取IT之家业界新闻

爬取站点 https://it.ithome.com/ityejie/ ，进入详情页提取内容。

  1 import requests
  2 import json
  3 from lxml import etree
  4 from pymongo import MongoClient
  5
  6 url = 'https://it.ithome.com/ithome/getajaxdata.aspx'
  7 headers = {
  8     'authority': 'it.ithome.com',
  9     'method': 'POST',
 10     'path': '/ithome/getajaxdata.aspx',
 11     'scheme': 'https',
 12     'accept': 'text/html, */*; q=0.01',
 13     'accept-encoding': 'gzip, deflate, br',
 14     'accept-language': 'zh-CN,zh;q=0.9',
 15     'content-length': '40',
 16     'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
 17     'cookie': 'BAIDU_SSP_lcr=https://www.hao123.com/link/https/?key=http%3A%2F%2Fwww.ithome.com%2F&&monkey=m-kuzhan-group1&c=B329C2F33C91DEACCFAEB1680305F198; Hm_lvt_f2d5cbe611513efcf95b7f62b934c619=1530106766; ASP.NET_SessionId=tyxenfioljanx4xwsvz3s4t4; Hm_lvt_cfebe79b2c367c4b89b285f412bf9867=1530106547,1530115669; BEC=228f7aa5e3abfee5d059195ad34b4137|1530117889|1530109082; Hm_lpvt_f2d5cbe611513efcf95b7f62b934c619=1530273209; Hm_lpvt_cfebe79b2c367c4b89b285f412bf9867=1530273261',
 18     'origin': 'https://it.ithome.com',
 19     'referer': 'https://it.ithome.com/ityejie/',
 20     'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3472.3 Safari/537.36',
 21     'x-requested-with': 'XMLHttpRequest'
 22 }
 23
 24 client = MongoClient()
 25 db = client['ithome']
 26 collection = db['ithome']
 27 max_page = 1000
 28
 29 def get_page(page):
 30
 31     formData = {
 32         'categoryid': '31',
 33         'type': 'pccategorypage',
 34         'page': page,
 35         }
 36     try:
 37         r = requests.post(url, data=formData, headers=headers)
 38         if r.status_code == 200:
 39
 40             #print(type(r))
 41             html = r.text
 42             # 响应返回的是字符串，解析为HTML DOM模式 text = etree.HTML(html)
 43             text = etree.HTML(html)
 44             link_list = text.xpath('//h2/a/@href')
 45
 46             print("提取第"+str(page)+"页文章")
 47             id=0
 48             for link in link_list:
 49                 id+=1
 50                 print("解析第"+str(page)+"页第"+str(id)+"篇文章")
 51                 print("链接为："+link)
 52                 loadpage(link)
 53
 54     except requests.ConnectionError as e:
 55         print('Error', e.args)
 56
 57
 58 # 取出每个文章的链接
 59 def loadpage(link):
 60
 61     headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3472.3 Safari/537.36'}
 62
 63     try:
 64
 65         reseponse = requests.get(link, headers = headers)
 66         if reseponse.status_code == 200:
 67             html = reseponse.text
 68             # 解析
 69             node = etree.HTML(html)
 70
 71             ithome ={}
 72             # 取出每个标题，正文等
 73
 74             # xpath返回的列表，这个列表就这一个参数，用索引方式取出来，标题
 75             ithome['title'] = node.xpath('//*[@id="wrapper"]/div[1]/div[2]/h1')[0].text
 76             # 时间
 77             ithome['data'] = node.xpath('//*[@id="pubtime_baidu"]')[0].text
 78             # 取出标签下的内容
 79             #content = node.xpath('//*[@id="paragraph"]/p/text()')
 80             ithome['content'] = "".join(node.xpath('//*[@id="paragraph"]/p/text()')).strip()
 81             #content = node.xpath('//*[@id="paragraph"]/p')[1].text
 82             # 取出标签里包含的内容，作者
 83             ithome['author'] = node.xpath('//*[@id="author_baidu"]/strong')[0].text
 84             # 评论数
 85             ithome['commentcount'] = node.xpath('//span[@id="commentcount"]')[0].text
 86             #评论数没有取到
 87             write_to_file(ithome)
 88             save_to_mongo(ithome)
 89
 90     except requests.ConnectionError as e:
 91         print('Error', e.args)
 92
 93 def write_to_file(content):
 94     with open('ithome.json','a',encoding='utf-8') as f:
 95         f.write(json.dumps(content,ensure_ascii=False)+'\n')
 96         f.close()
 97
 98 def save_to_mongo(result):
 99     if collection.insert(result):
100         print('Saved to Mongo')
101
102 if __name__ == '__main__':
103     for page in range(1, max_page + 1):
104         get_page(page)
105
106

转载于:https://www.cnblogs.com/wanglinjie/p/9246369.html

爬取IT之家业界新闻相关推荐

python爬取IT之家业界新闻
爬取站点 https://it.ithome.com/ityejie/ ,进入详情页提取内容. import requests import json from lxml import etree f ...
java爬虫入门--用jsoup爬取汽车之家的新闻
概述使用jsoup来进行网页数据爬取.jsoup 是一款Java 的HTML解析器,可直接解析某个URL地址.HTML文本内容.它提供了一套非常省力的API,可通过DOM,CSS以及类似于jQuer ...
Python爬虫框架Scrapy入门（三）爬虫实战：爬取长沙链家二手房
Item Pipeline介绍 Item对象是一个简单的容器,用于收集抓取到的数据,其提供了类似于字典(dictionary-like)的API,并具有用于声明可用字段的简单语法. Scrapy的It ...
python爬取汽车之家_python爬虫实战之爬取汽车之家网站上的图片
随着生活水平的提高和快节奏生活的发展.汽车开始慢慢成为人们的必需品,浏览各种汽车网站便成为购买合适.喜欢车辆的前提.例如汽车之家网站中就有最新的报价和图片以及汽车的相关内容,是提供信息最快最全的中国汽 ...
python3爬虫系列16之多线程爬取汽车之家批量下载图片
python3爬虫系列16之多线程爬取汽车之家批量下载图片 1.前言上一篇呢,python3爬虫系列14之爬虫增速多线程,线程池,队列的用法(通俗易懂),主要介绍了线程,多线程,和两个线程池的使用. ...
用Python爬取2020链家杭州二手房数据
起源于数据挖掘课程设计的需求,参考着17年这位老兄写的代码:https://blog.csdn.net/sinat_36772813/article/details/73497956?utm_medi ...
Python 爬虫实战入门——爬取汽车之家网站促销优惠与经销商信息
在4S店实习,市场部经理让我写一个小程序自动爬取汽车之家网站上自家品牌的促销文章,因为区域经理需要各店上报在网站上每一家经销商文章的露出频率,于是就自己尝试写一个爬虫,正好当入门了. 一.自动爬取并输 ...
python爬虫（二十四）爬取汽车之家某品牌图片
爬取汽车之家某品牌图片需求爬取汽车之家某品牌的汽车图片目标url https://car.autohome.com.cn/photolist/series/52880/6957393.html# ...
python爬虫利用Scrapy框架爬取汽车之家奔驰图片--实战
先看一下利用scrapy框架爬取汽车之家奔驰A级的效果图 1)进入cmd命令模式下,进入想要存取爬虫代码的文件,我这里是进入e盘下的python_spider文件夹内 C:\Users\15538&g ...

爬取IT之家业界新闻

爬取IT之家业界新闻相关推荐

最新文章

热门文章