用python爬取交大图书馆图书信息

由于到图书馆中查找数据的时候，每个网页都需要一张一张的翻转，而同时因为每张网页中的内容十分有限，故写此爬虫，方便查找之用

# -*- coding=utf-8 -*-
#@author: 、Edgar
#@version: 1.1
import requests
import urllib.error
from bs4 import BeautifulSoup
import time
import threadingdef get_html(url):"""获取网页的源代码"""header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ""Chrome/76.0.3809.100 Safari/537.36"}try:response = requests.get(url, headers=header)response.encoding = response.apparent_encodingexcept requests.HTTPError as e:print(e)except urllib.error.URLError as e:print(e)else:return response.textdef is_last_page(soup):"""判断该网页是不是最后一页了， 如果是的话，就返回False代表是最后一页否则的话返回下一页的网页地址"""target = soup.find('a', {"title": "Next"})if target is None:return Falseelse:return target["href"]def spider(soup):"""爬取是搜索后的网页，获得书名，余量等"""tr_list = soup.find("table", {"cellspacing": "1"}).findAll("tr", {"valign": "baseline"})total_data = ''for tr in tr_list:td_list = tr.findAll("td")num = td_list[0].get_text().replace(" ", '').strip()num = "序号： " + numcall_num = td_list[2].get_text().replace(" ", '').strip()call_num = "索书号： " + call_numname = td_list[3].get_text().replace("\n", '').strip()name = "书名： " + nameauthor = td_list[4].get_text().replace("\n", '').strip()author = "作者： " + authoryear = td_list[5].get_text().replace(" ", '').strip()year = "年代： " + yearinfo = td_list[6].get_text().replace(" ", '').strip()info = "馆名(总/借出): " + info# 获得链接，可从而获取更多的信息try:info_link = td_list[6].a["href"]except Exception :info_link = Nonesort = td_list[7].get_text().replace(" ", '').strip()sort = "类型： " + sortdata = num + '\n' + call_num + '\n' + name + '\n' + author + '\n' + year + '\n' + info + '\n' + sort + '\n'if info_link is None:spider_more_data = "无详细信息 \n"else:spider_more_data = spider_more(info_link)total_data = data + spider_more_datafile.write(total_data)file.write("-"*58+'\n')total_data = ''def spider_more(url):"""获得更多关于书籍"""html = get_html(url)soup = BeautifulSoup(html, "lxml")tr_list = soup.findAll("table", {"cellspacing": "2"})[1].find_all("tr")[1:]num = 0total_data =''for tr in tr_list:num += 1td = tr.findAll("td")status = td[2].get_text()status = "单册状态： " + statusreturn_time = td[3].get_text()return_time = "应还时间： " + return_timelocation = td[5].get_text()location = "馆藏位置： " + locationbar_code = td[8].get_text()bar_code = "条码： " + bar_codedata ="第{}本书具体信息：\n".format(num) + status + "\n" + return_time +  "\n" + location +  "\n" + bar_code + "\n\n"total_data += datareturn "\n" + total_datadef main(url):"""首先爬取第一页的信息，并且判断第一页是否是最后一页如果不是最后一页，在进行同样的操作"""html = get_html(url)soup = BeautifulSoup(html, 'lxml')spider(soup)flag = is_last_page(soup)while flag:url = flaghtml = get_html(url)soup = BeautifulSoup(html, 'lxml')spider(soup)flag = is_last_page(soup)time.sleep(6)class promote(threading.Thread):def run(self):print("正在下载数据中： ", end="")while 1:print(".", flush=True,end="")time.sleep(2)if __name__ == "__main__":file = open("lib_data.txt", "a", encoding="utf-8")url = input("请输入您在交大图书馆搜索后的网页链接(复制粘贴即可)： ")pro = promote()pro.setDaemon(True)pro.start()start_time = time.time()main(url)end_time = time.time()print("\n共用时 {} s".format(end_time-start_time))

程序执行之后直接生成 txt 文件，可直接查看

附：
交大图书馆官网：http://www.lib.sjtu.edu.cn/f/main/index.shtml

今日发现交大图书馆在首页搜索之后样式不是之前的样式了，推荐搜索的时候在
http://opac.lib.sjtu.edu.cn 搜不会出现其他问题^¹。

2019年9月22日 ↩︎

用python爬取交大图书馆图书信息相关推荐

爬取郑州大学图书馆图书信息
图书馆链接要在校园网情况下运行.否则没有权限 import requests import re headers={"User-Agent":"Mozilla/5.0 ...
Python按照你的检索爬取天津大学图书馆书籍信息
Python按照你的检索爬取天津大学图书馆书籍信息爬取步骤网页解析代码完全自己手写的代码,入门级水平把.对于静态HTML网页爬取来说相对简单,现在对于动态编写JavaScript还不知道如何处 ...
Python爬虫入门 | 4 爬取豆瓣TOP250图书信息
先来看看页面长啥样的:https://book.douban.com/top250 我们将要爬取哪些信息:书名.链接.评分.一句话评价-- 1. 爬取单个信息我们先来尝试爬取书名,利用之 ...
python 爬虫爬取当当网图书信息
初次系统的学习python,在学习完基本语法后,对爬虫进行学习,现在对当当网进行爬取,爬取了基本图书信息,包括图书名.作者等 import requests from time import slee ...
scrapy框架的简单使用——爬取当当网图书信息
** Scrapy爬取当当网图书信息实例 --以警察局办案为类比 ** 使用Scrapy进行信息爬取的过程看起来十分的复杂,但是他的操作方式与警局办案十分的相似,那么接下来我们就以故事的形式开始Scr ...
爬取起点网站图书信息（书名、作者、简介、图片url）
# 爬取qidian网站图书信息(书名.作者.简介.图片url) import requests from lxml import etree import jsonclass BookSpider( ...
python关于二手房的课程论文_基于python爬取链家二手房信息代码示例
基本环境配置 python 3.6 pycharm requests parsel time 相关模块pip安装即可确定目标网页数据哦豁,这个价格..................看到都觉得脑阔 ...
Python爬取安居客经纪人信息
Python爬取安居客经纪人信息 Python2.7.15 今天我们来爬取安居客经纪人的信息.这次我们不再使用正则,我们使用beautifulsoup.不了解的可以先看一下这个文档,便于理解.http ...
Python爬取药监局化妆品管理信息发现的问题
Python爬取药监局化妆品管理信息 **1.json格式本质上是字符串!!! 今天在爬取国家药监局化妆品管理信息的时候,发现"json数据本质上是字符串",以前我还以为json本 ...

用python爬取交大图书馆图书信息

用python爬取交大图书馆图书信息相关推荐

最新文章

热门文章