爬取校园新闻首页的新闻的详情，使用正则表达式，函数抽离

import requests
import reurl = "http://news.gzcc.cn/html/xiaoyuanxinwen/"
res = requests.get(url)
res.encoding = 'utf-8'# 利用BeautifulSoup的HTML解析器，生成结构树
from bs4 import BeautifulSoupsoup = BeautifulSoup(res.text, 'html.parser')def getClickCount(url):HitUrl = 'http://oa.gzcc.cn/api.php?op=count&id=9183&modelid=80'hitNumber = requests.get(HitUrl).text.split('.html')[-1].lstrip("('").rstrip("');")print("点击次数:", hitNumber)re.match('http://news.gzcc.cn/html/2018/xiaoyuanxinwen(.*).html', url).group(1).split('/')[1]print('新闻编号:', re.search('\_(.*).html', url).group(1))def getNewDetail(url):res = requests.get(url)res.encoding = 'utf-8'soup = BeautifulSoup(res.text, 'html.parser')for news in soup.select('li'):if len(news.select('.news-list-title')) > 0:# 首页文章标题title = news.select('.news-list-title')[0].text# 首页文章描述description = news.select('.news-list-description')[0].text# 首页文章信息info = news.select('.news-list-info')[0].text# 首页文章链接href = news.select('a')[0]['href']url = hrefres = requests.get(url)res.encoding = 'utf-8'soup = BeautifulSoup(res.text, 'html.parser')# 获取每篇文章的信息newinfo = soup.select('.show-info')[0].text# 获取文章内容content = soup.select('#content')[0].text# 日期date = newinfo.split()[0]# 当日时间time = newinfo.split()[1]# 作者author = newinfo.split()[2]# 审核checker = newinfo.split()[3]# 来源source = newinfo.split()[4]# 摄影Photography = newinfo.split()[5]print('------------------------------------------------------------------------------')print("文章标题：" + title)print("\n文章描述：" + description)print("\n文章信息:\n" + date + ' ' + time + '\n' + author + '\n' + checker + '\n' + source+ '\n' + Photography)getClickCount(href)#点击次数、新闻编号print("\n文章链接：" + href)print(content)print('------------------------------------------------------------------------------')getNewDetail(url)

转载于:https://www.cnblogs.com/FZW1874402927/p/8747466.html

爬取校园新闻首页的新闻的详情，使用正则表达式，函数抽离相关推荐

python抽取指定url页面的title_Python使用scrapy爬虫，爬取今日头条首页推荐新闻
爬取今日头条https://www.toutiao.com/首页推荐的新闻,打开网址得到如下界面查看源代码你会发现全是js代码,说明今日头条的内容是通过js动态生成的. 用火狐浏览器F12查看得知 ...
python爬取学校新闻_python-爬取校园新闻首页的新闻
1.作业代码 importrequestsfrom bs4 importBeautifulSoupfrom datetime importdatetime#====================== ...
爬取校园新闻首页的新闻
1. 用requests库和BeautifulSoup库,爬取校园新闻首页新闻的标题.链接.正文. url = "http://news.gzcc.cn/html/xiaoyuanxinwe ...
（python爬虫）新浪新闻数据爬取与清洗+新浪新闻数据管理系统+MySQL
新浪新闻数据爬取与清洗+新浪新闻数据管理系统设计要求新浪新闻数据爬取与清洗基本要求:完成新浪新闻排行中文章的数据爬取,包括标题.媒体.时间.内容. 进阶要求:对最近一周出现次数最多的关键字排名并 ...
利用自定义函数实现批量爬取多家公司的新闻
1 需求利用自定义函数实现批量爬取多家公司的新闻. 2 代码实现 from selenium import webdriver import redef dongfang(company):chro ...
perl脚本爬虫程序，支持爬取北大未名bbs、163新闻、ifeng新闻、猫扑论坛、sina新闻等
[实例简介] 采用perl脚本写的爬虫程序,可以爬取北大未名bbs.163新闻.ifeng新闻.猫扑论坛.sina新闻等 [实例截图] 文件:590m.com/f/25127180-494436243 ...
（55）-- 简单爬取人人网个人首页信息
# 简单爬取人人网个人首页信息 from urllib import requestbase_url = 'http://www.renren.com/964943656' headers = {&q ...
Node爬取简书首页文章
Node爬取简书首页文章博主刚学node,打算写个爬虫练练手,这次的爬虫目标是简书的首页文章流程分析使用superagent发送http请求到服务端,获取HTML文本用cheerio解析获得的 ...
基于python爬虫的论文标题_Python3实现爬取简书首页文章标题和文章链接的方法【测试可用】...
本文实例讲述了Python3实现爬取简书首页文章标题和文章链接的方法.分享给大家供大家参考,具体如下: from urllib import request from bs4 import Beaut ...
python爬取电影天堂首页
用python写了个小爬虫,用来爬取电影天堂首页放置的几十部电影的名称,上映日期和下载链接,用到了beautifulsoup库和lxml库用来解析代码如下: import requests impo ...

爬取校园新闻首页的新闻的详情，使用正则表达式，函数抽离

爬取校园新闻首页的新闻的详情，使用正则表达式，函数抽离相关推荐

最新文章

热门文章