爬取思路

第一步，用requests获取新闻目录的网页源码。

def get_page(url):   #页面源代码response = requests.get(url, headers=headers)if response.status_code == 200:return response.textelse:print("Fail to get page")url = "http://news.fzu.edu.cn/html/fdyw/" + str(offset) + ".html"
html = get_page(url)

第二步，获取每一篇文章的url，并先提取日期、标题

def get_articles(html, new_list):doc = pq(html)articles = doc('.list_main_content li')get_articles(html, new_list)

第三步，通过日期限制爬取范围，并对每一则新闻的url发起get请求

if new["date"][:4] == "2020":   #只爬2020年new["title"] = article('a').text()  #标题url = 'http://news.fzu.edu.cn' + article('a').attr('href')html_new = get_page(url)get_other_data(html_new, new)new_list.append(new)
elif new["date"][:4] == "2021":continue
else:global flagflag = 1return

第四步，在每则新闻网页的源代码中获取剩下的信息，即作者、正文、浏览数

def get_other_data(html, new):doc = pq(html)data = doc('.detail_main_content')author = data('#author').text()  #作者new["author"] = authorpage_views_str = data('script').text()  #阅读数a1 = page_views_str.find("url")a2 = page_views_str.find("timeout")page_views_url = page_views_str[a1 + 5:a2 - 2]page_views_url = "http://news.fzu.edu.cn" + page_views_urlpage_views = requests.post(page_views_url).textnew["page_views"] = page_viewscontent = ""    #正文paragraphs = doc('#news_content_display')for p in paragraphs('p').items():content += p.text() + "\n"new["content"] = content

第五步，存入数据库

db = pymysql.connect(host='localhost', user='root', password='beli3579', port=3306, db='fzu_new')
cursor = db.cursor()
cursor.execute("DROP TABLE IF EXISTS news")
sql = '''create table news(date varchar(20),title varchar(70),author varchar(50),page_views varchar(20),content varchar(3000))'''
cursor.execute(sql)
for new in new_list:sql = 'insert into news(date,title,author,page_views,content) values(%s,%s,%s,%s,%s)'try:if cursor.execute(sql, tuple(new.values())):print('Success to the database')db.commit()except:print('Fail to the database')db.rollback()
db.close()

遇到的问题

在chrome的检查功能中，新闻的浏览数有显示，但是爬不下来
最终发现是Ajax 请求

爬取学校新闻网站文章相关推荐

python爬去学校_python爬取学校教务系统
写这个爬虫的缘由以前用java写过一个爬取学校的教务系统的爬虫 https://blog.csdn.net/ygdxt/article/details/81158321,最近痴迷Python爬虫,了 ...
python 爬取上海体育彩票文章标题、时间、内容
python期末大作业爬取上海体育彩票文章标题.时间.内容并计算词频.生成特殊形状的词云图利用selenium爬取内容代码: # https://www.shsportslottery.com/ ...
python爬取公众号文章如何获取发布时间
python爬取公众号文章如何获取发布时间在上一篇爬取公众号的文章中爬虫如何爬取微信公众号文章介绍了如何获取公众号的所有历史文章链接,但当我根据链接去爬取文章的时候,却遇到了一个小问题,就是文章的发 ...
博客搬家系列（六）-爬取今日头条文章
博客搬家系列(六)-爬取今日头条文章一.前情回顾博客搬家系列(一)-简介:https://blog.csdn.net/rico_zhou/article/details/83619152 博客搬家 ...
升级完善第一个爬虫GCZW3，使能够批量爬取多篇文章热评
前天写了观察者网的爬虫,只能根据某个网页链接爬取,不能一次性大量爬取多篇文章的热门评论. 于是,今天想把它升级一下,让它可以从首页获取首页展示的所有文章的链接,并分别进行爬取. 于是写了mainPag ...
Scrapy框架+Gerapy分布式爬取海外网文章
Scrapy框架+Gerapy分布式爬取海外网文章前言一.Scrapy和Gerapy是什么? 1.Scrapy概述 2.Scrapy五大基本构成: 3.建立爬虫项目整体架构图 4.Gerapy概述 ...
Python爬取书包网文章实战总结
python爬取书包网文章总结今天闲来无事去看小说,但是发现没办法直接下载,所以呢就用python爬虫来下载一波了,哈哈- 爬取的是这篇小说:剑破九天(是不是很霸气,话不多说,开始-) 总体思路步骤 ...
Python3爬取今日头条文章视频数据，完美解决as、cp、_signature的加密方法(2020-6-29版)
前言在这里我就不再一一介绍每个步骤的具体操作了,因为在爬取老版今日头条数据的时候都已经讲的非常清楚了,所以在这里我只会在重点上讲述这个是这么实现的,如果想要看具体步骤请先去看我今日头条的文章内容,里 ...
python编写爬虫爬取先知社区文章
python编写爬虫爬取先知社区文章的标题.标题链接.作者.作者链接.文章分类.发布时间.评论数(O(∩_∩)O哈哈~初级小白,暂时没用正则表达式) import requests '''爬取全部'' ...

爬取学校新闻网站文章

爬取学校新闻网站文章

爬取思路

遇到的问题

爬取学校新闻网站文章相关推荐

最新文章

热门文章