python爬取新浪新闻

最近公司项目比较少，楼主闲了好长时间了，作为一个刚毕业几个月的新人，心里很烦躁，只能自己找点新东西去学了。看到周围好多人都接触了爬虫，再加上楼主最近沉迷吴宣仪不可自拔，每天投票投票，投票的同时需要监控票数涨幅，爬虫再适合不过了，于是决定开始学习python这门语言。（python配置环境网上很多，开发工具楼主用的pycharm，可以网上百度破解方法）。

python入门的话首先找一个网站练手，楼主找的新浪网新闻，爬取这些新闻网站，对于开发者来说，一定要先去找其有没有对开发者开放的开发接口，一般新闻网站都有自己的rss开发接口，这种网站对于开发者来说很方便，可以很轻松的获取到数据，若直接从新闻网爬取，有很多广告会干扰你。

废话有点多昂，赶紧上代码！！！

config.ini为配置文件，为方便后期修改，将爬虫的一些信息从代码中抽离出，源码如下

[info]
postUrl = http://bigdata.ossou.cn/api/pythonSave
scrapyUrl = http://rss.sina.com.cn/news/china/focus15.xml
patternUrl = http://news.sina.com.cn/.*?.shtml
patternImage = http://n.sinaimg.cn/.*?$
imageUrl = images/
loadImageUrl = http://bigdata.ossou.cn/sinaScrapy/
time = 7200

sinaScrapy.py即爬取新浪新闻的主要源码

import socket
import re
import os
import time
import json
import configparser
from bs4 import BeautifulSoup
from urllib import request#新闻详情对象
class News(object):title = ''title_pic_url = ''content = ''public_time = ''author=''acc_count = 0class SpiderMain(object):#初始化相关变量def __init__(self):self.conf = configparser.ConfigParser()self.conf.read('config.ini')self.scrapyUrl = self.conf.get('info','scrapyUrl')#爬取的主地址self.imageUrl = self.conf.get('info', 'imageUrl')#图片保存地址self.postUrl = self.conf.get('info','postUrl')#保存数据接口self.patternUrl = self.conf.get('info','patternUrl')#单条新闻url的正则匹配表达式self.patternImage = self.conf.get('info','patternImage')#文中内容图片的正则匹配表达式self.loadImageUrl = self.conf.get('info','loadImageUrl')#服务器下图片的src#网络请求，用于向服务器保存爬取的内容def request(self,data):url = self.postUrlprint(data)jdata = json.dumps(data,ensure_ascii=False).encode('utf-8')  # 对数据进行JSON格式化编码print(jdata)req = request.Request(url,jdata)  # 生成页面请求的完整数据response = request.urlopen(req)  # 发送页面请求print(response.read())  # 获取服务器返回的页面信息'''#通过url获取网页内容def getHtmlContent(self, url):socket.setdefaulttimeout(20) #设置超时时间requests = request.Request(url)requests.add_header("User-Agent","Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.100 Safari/537.36")response = request.urlopen(requests)content = response.read()return content#从爬取的主url中获取单个新闻的访问urldef getItemLink(self, content):soup = BeautifulSoup(content, "html.parser")contents = soup.findAll('item','')  # 获取新闻详情链接listNews = []for content in contents:try:news = News()soup = BeautifulSoup(str(content), "html.parser")pattern = re.compile(str(self.patternUrl))  # 获取新闻详情链接title = soup.find('title', '')links = re.findall(pattern,str(content))pushTime,content,url,author = self.getMainContent(links[0])news.title = title.get_text().replace('\n', '')news.public_time = pushTimenews.content = contentnews.author = authornews.title_pic_url = urlnewsDict = news.__dict__ #对象转化为字典listNews.append(newsDict)except Exception as e:print(e)print(listNews)print(len(listNews))if(len(listNews)>0):self.request(listNews)#通过单个新闻的url获取新闻详情def getMainContent(self,url):global titlePathtry:html = self.getHtmlContent(url)soup = BeautifulSoup(html, "html.parser", from_encoding="utf-8")pushTime = soup.find('span',class_='date').get_text()print('发布时间为：',pushTime)# 获取内容并替换图片content = soup.find("div", class_="article")author = soup.find("a",class_="source").get_text()images = soup.findAll("img", {"src": re.compile(str(self.patternImage))})#获取新闻图片titlePath = ''if (len(images) > 0):for i, image in enumerate(images):# 遍历所有的图片，并替换成其路径oldName = image['src']print(oldName)url = self.imageUrlif os.path.isdir(url):passelse:os.mkdir(url)millis = int(round(time.time() * 1000))path = url + str(millis) + '.jpg'newUrl = self.loadImageUrl+pathcontent = str(content).replace(oldName, newUrl)request.urlretrieve(image['src'], path)  # 下载到本地if (i == 0):titlePath = pathprint('标题图片为：',titlePath)except Exception as e:print(e)print(pushTime,titlePath,author)return pushTime,str(content),titlePath,authorif __name__ == "__main__":obj = SpiderMain()content = obj.getHtmlContent(obj.scrapyUrl)obj.getItemLink(content)

到此，新浪新闻的简单爬取便实现了，新闻是实时更新的，感兴趣的朋友可以加个定时器。

python爬取新浪新闻相关推荐

网络爬虫-----python爬取新浪新闻
思路:先爬取首页,然后通过正则筛选出所有文章url,然后通过循环分别爬取这些url到本地 #python新闻爬虫实战 import urllib.request import re url = 'ht ...
python爬取新浪新闻意义_爬取新浪新闻
[Python] 纯文本查看复制代码import requests import os from bs4 import BeautifulSoup import re # 爬取具体每个新闻内容 de ...
python爬取新浪新闻首页_Python爬虫学习：微信、知乎、新浪等主流网站的模拟登陆爬取方法...
微信.知乎.新浪等主流网站的模拟登陆爬取方法摘要:微信.知乎.新浪等主流网站的模拟登陆爬取方法. 网络上有形形色色的网站,不同类型的网站爬虫策略不同,难易程度也不一样.从是否需要登陆这方面来说,一些 ...
python爬取新浪新闻首页_学习了《python网络爬虫实战》第一个爬虫，爬取新浪新闻...
请安装anaconda,其中附带的spyder方便运行完查看变量 1.进入cmd控制台, 输入 pip install BeautifulSoup4 pip install requests 2.编写 ...
Python爬取新浪新闻评论的url查找方法
快船续约考辛斯至赛季结束他加盟后11战10胜以这条新闻为例. 首先F12打开调试台. 找到评论模块,点击框出的url. 来到这个页面,同样打开调试台,先点①再点②刷新. 找到info?versio ...
python爬取新浪新闻存储到excel
一.运行环境 (1) BeautifulSoup的导入:pip install BeautifulSoup4 (2) requests的导入:pip install requests (3) re的导 ...
从入门到入土：Python爬虫学习|实例练手|爬取新浪新闻搜索指定内容|Xpath定位标签爬取|代码注释详解
此博客仅用于记录个人学习进度,学识浅薄,若有错误观点欢迎评论区指出.欢迎各位前来交流.(部分材料来源网络,若有侵权,立即删除) 本人博客所有文章纯属学习之用,不涉及商业利益.不合适引用,自当删除! 若 ...
python爬虫-使用BeautifulSoup爬取新浪新闻标题
** python爬虫-使用BeautifulSoup爬取新浪新闻标题 ** 最近在学习爬虫的技巧,首先学习的是较为简单的BeautifulSoup,应用于新浪新闻上. import requests ...
使用python网络爬虫爬取新浪新闻（一）
使用python网络爬虫爬取新浪新闻第一次写博客,感觉有点不太习惯!不知道怎么突然就想学学爬虫了,然后就用了一天的时间,跟着教程写了这个爬虫,!不说废话了,我将我从教程上学习的东西整个写下来吧,从头 ...

python爬取新浪新闻

python爬取新浪新闻相关推荐

最新文章

热门文章