华北电力大学图书馆读者荐购系统数据爬取

  • 前言
  • 本章工具
  • 网页分析
    • 1.荐购数据
    • 2.书目具体信息
  • 代码部分
    • 1. 荐购数据爬取
    • 2. 完整书目信息爬取

前言

本学期数据仓库与数据挖掘课程大作业是编程实现一种数据挖掘方法,之前也注意过学校图书馆的荐购系统,数据十分有趣,于是想借这次机会尝试一下。

本章工具

  • python3.7
  • pycharm编辑器
  • chrome谷歌浏览器

网页分析

在学校内网时一般用IP地址访问,在校外学校为我们提供了VPN服务,而且内网VPN支持自动登录并且不限时间,爬取之前我想,将含有我校园VPN登录信息的cookie放进请求头应该就能实现模拟登陆,后面事实证明也是如此。
数据来源主要是以下两个页面:

  • https://202-204-70-2-8080.webvpn.ncepu.edu.cn/asord/asord_hist.php
  • https://202-204-70-2-8080.webvpn.ncepu.edu.cn/opac/search_adv.php#/index

    可以看到荐购页面公开的数据非常详尽,但是对我们来说并不太够。比如图书馆处理完有没有订购到相应书籍、图书上架的时间、荐购书目的分类、价格、摘要等等,这些数据对后面的分析是非常重要的。于是我们利用第二个页面,可以对之前爬取的图书书名进行馆藏检索,获得更多我们需要的信息。

1.荐购数据

通过网页检查,可以得到url、cookie、用于定位的xpath语句。

2.书目具体信息

此部分注意以下几点

  1. 书目检索页面请求方式为post方式,我们需要用表单提交数据,表单数据还要转换成json类型。
  2. 若未查询到该书,网页返回的response字典中“total”键值为0;若查询到该书,我们取按相关度排序的第一本书。获取可以唯一确定这本书的MARC码并访问链接,在二级页面,就可以得到这本书的详细所有信息。
  3. 其实这里并不能查询到本书上架时间,但是注意到MARC码的005字段是一个时间戳,查询MARC编码规则发现这是一个自定义字段,再结合荐购时间和出版时间,我们有理由推断这就是本书在华电图书馆上架的时间。
  4. 二级页面里很多书的信息格式和排版都不相同,一直没有找到好的爬取策略和存储方式,最后决定把所有信息存入列表,再导出为本地.npy文件。

代码部分

1. 荐购数据爬取

用到的第三方库:

  • urllib,py3内置的http请求库
  • lxml,python的一个解析库,用于xpath语句定位
  • xlwt,用于将爬取的数据导入Excel文件

获取网页返回数据的代码:

# page用于网页翻页
page = 1
def get_response(page):request = ur.Request(url='https://202-204-70-2-8080.webvpn.ncepu.edu.cn/asord/asord_hist.php?page=' + str(page),# 请求头里包含了user-agent和cookie,user-agent会随机改变来进行header伪# 装,cookie包含了我的登录信息headers={'User-Agent': user_agent.get_user_agent_pc(),'Cookie': 'Ecp_ClientId=2200316134701955091; UM_distinctid=170e1e4017c2f1-0368182f8cae1e-366b400c-100200-170e1e4017d5f8; s_ecid=MCMID%7C65840636796574064942530628684224841055; sp=039e7b5d-932a-4634-b129-b32c9e1f4715; _hjid=f9a5b480-7aa6-4cfa-86b0-9e680bf8c2f1; AMCV_8E929CC25A1FB2B30A495C97%40AdobeOrg=281789898%7CMCIDTS%7C18341%7CMCMID%7C65840636796574064942530628684224841055%7CMCAAMLH-1585218224%7C11%7CMCAAMB-1585218224%7CRKhpRz8krg2tLO6pguXWp5olkAcUniQYPHaMWWgdJ3xzPWQmdj0y%7CMCOPTOUT-1584620624s%7CNONE%7CMCSYNCSOP%7C411-18345%7CMCAID%7CNONE%7CvVersion%7C4.1.0; AMCV_8E929CC25A1FB2B30A495C97%40AdobeOrg=281789898%7CMCIDTS%7C18341%7CMCMID%7C65840636796574064942530628684224841055%7CMCAAMLH-1585218224%7C11%7CMCAAMB-1585218224%7CRKhpRz8krg2tLO6pguXWp5olkAcUniQYPHaMWWgdJ3xzPWQmdj0y%7CMCOPTOUT-1584620624s%7CNONE%7CMCSYNCSOP%7C411-18345%7CMCAID%7CNONE%7CvVersion%7C4.1.0; __cfduid=d9533956e470db8ed7280690d1525b0b81585644698; EUID=8a3376c8-37ef-4577-b0f9-ca8b8067f787; ANONRA_COOKIE=18DED70A2D59F142CC3142606041D2E1DF9A64D2F2FAC2F3624B8159BB7F9E869A06BAB46E67AAC399FAC5A25ABB2F39ECB1B67F767FFD3E; SD_REMOTEACCESS=eyJhY2NvdW50SWQiOiI2MDM0MCIsImRlcHRJZCI6Ijg1NzA1IiwidGltZXN0YW1wIjoxNTg1NjQ0Njk4NzAwfQ==; AMCV_4D6368F454EC41940A4C98A6%40AdobeOrg=-432600572%7CMCMID%7C65840636796574064942530628684224841055%7CMCIDTS%7C18353%7CMCAID%7CNONE%7CMCOPTOUT-1585651904s%7CNONE%7CMCAAMLH-1586249504%7C11%7CMCAAMB-1586249504%7Cj8Odv6LonN4r3an7LhD3WZrU1bUpAkFkkiY1ncBR96t2PTI%7CMCCIDH%7C-1439492996%7CvVersion%7C4.5.2; s_pers=%20v8%3D1585649115885%7C1680257115885%3B%20v8_s%3DLess%2520than%25201%2520day%7C1585650915885%3B; AMCV_4D6368F454EC41940A4C98A6%40AdobeOrg=-432600572%7CMCMID%7C65840636796574064942530628684224841055%7CMCIDTS%7C18353%7CMCAID%7CNONE%7CMCOPTOUT-1585651904s%7CNONE%7CMCAAMLH-1586249504%7C11%7CMCAAMB-1586249504%7Cj8Odv6LonN4r3an7LhD3WZrU1bUpAkFkkiY1ncBR96t2PTI%7CMCCIDH%7C-1439492996%7CvVersion%7C4.5.2; utag_main=v_id:0170e1ef156700210b7b2560e5140306e00430660086e$_sn:12$_ss:0$_st:1585723672701$vapi_domain:ncepu.edu.cn$_se:2$ses_id:1585721103283%3Bexp-session$_pn:6%3Bexp-session; id=2289d9138cc10048||t=1585721874|et=730|cs=002213fd48ee63d482ebb634c4; SEARCHHISTORY_0=UEsDBBQACAgIANSrilAAAAAAAAAAAAAAAAABAAAAMM2R20vbUBzH%2F5cDp0%2Bltbm1KZSRri0IRdks%0AilgfzpqzNJAmIReiG0IFL6UgWLzBWrah2OKDRdAxFS9%2FjEmM%2F4UnraKIPrinPZ3P7%2FyufL8z34GF%0Avih4DNUwSKu2okSBLII0%2BDxuF6YVfZLmKBAFtomNUfGxwMTIqFRL8zppSZCkoZCGqmXp6XhcijlI%0A%2FYpUSUQWilW0WqyixocND09RNq2YqH14mpJBihIxq5ozCCI6kvCE%2FI3QsGRKM8QMzKegQMMsE0KW%0AhXwW5pOQZ2A2GZHNkiFLEjZKSMqQa%2FGcbmDTlDWV3BXsnQR%2F2rebf0nCGpwM%2FM4vr9G8qS96h133%0Aco2Ae%2FEj6PdDON3wGgcEbnstv1n3m%2Fteq%2FkUNtb9nx0S%2Br9XvdWVEI63gt5SOKpz5G%2Bfedc7hIPe%0Ayt1uK%2FxsX7lnXbJYtWsFzVbFgV4VAyMLl%2BRQ8QSb4liK4fkRmqMXom%2B5UeSTn4SCqtP0%2B9xwHOef%0A%2FXiuPZE5B%2FmPofapEZhiIS1AioJ5Dgo5KHCDAgHyiQcQ8iT70gj3shPUl9Nl4BPVz1vueb8M%2FhNP%0AKO41V1iGSRDBF2bvAVBLBwhCn1%2FOtgEAACUDAAA%3D%0A; _abck=1EFC414F10BC1D6A0927D3B24DBD4FDA~0~YAAQV5bfF62bxGRxAQAAOCgKaAPE18idV2UZExKBviPP6NVRMA9LSK7b5ISYpILS/X8gcGoE3NpZ3a1lnXnJdkqBNQtNsUv8RiDXzC3mreJMFADLWmvxY6TQrXAtoQKssC/Refr8T49sbEW24nNBf55iwyF/jjU1WPK07aFiZzu8MTJvPMR1RIloTSUNDAv4/YTJk6vyaidJjayBWjaGX5YRnqi0dH1NndKmocg1PsI26QrSTksxt5fcVRNbuIeMJkHF4rLLTO+zoMRO6kky/DvcpDtS2UdC8pbWbn34yeVUOR6iz/oOfSADvevj7TXTcLYso37IMJoYNkmV1pSi~-1~-1~-1; _ga=GA1.3.1681601817.1586589033; _sp_id.e0ee=ed2403cb-a7e2-4458-a8b9-7c47726e3826.1584340234.3.1586591633.1586524290.3c720d37-18bb-47ca-b586-62e73dd703b3; PHPSESSID=944815ubc3eb1kosegdk1cjie2; webvpn_username=120171080101%7C1586959936%7C1447df2e037c70f1691256345e6162d4f3b7660d; _astraeus_session=aThBa3A3RWFha2pCK1VWRU95dGh5UWJmd2pQbzdPdnlPRzNSNE9CcHp0bWcwNkZ2Qm12UXF6SU5SWEtHRTBERmpqQjFIK2o4eitPQkRMcTdjOUhaR1psSi9IMHF0ekRXR2ZCV00zSDlNWWMraXE3MlRGWDQ4QnhQcVlrWWZuZUxOR1B1OHpNVEIvYzU1RE8zeFFpYjZld2VneUFtVmdhZFg4bVBVYU5ydEdxMi9UTkdabGZuQkNFMWtCVDNRbERPR09MaWFaUWdLemxYODV5MUVidW9OQUFVZERVU1pGKzdSTnIyaHVLeDdpT2dxMSsrTzJVZklRVTU0dHZSVTA0LytQdEhOUlhWQVZlKyt4ZXFhRkI0NkZqWnJTbk8vVWFha1QzY2U2NHBxdkViOXordXNobVdSNk9YdXJmUWpkb25HUm1kMG11emd0UUZlTWFQbWUrQ1BFKzlFV0drQTh5cDBSN1JnVWJpYVBsZmVlWjBTcnJpVmRSbFlvNG9SUzlJLS10aWlTOTF5RG1kWWJ5YjlhZmFaM0Z3PT0%3D--ecb7db64ac12526389f83b88e890f1f910c073a6'})response = ur.urlopen(request).read().decode('utf-8')# print(type(response))lxml_x = le.HTML(response)return lxml_x

上面用到的user_agent.get_user_agent_pc()方法,获取随机的UserAgent进行header伪装

import random
# pc端的user-agent
user_agent_pc = [# 谷歌'Mozilla/5.0.html (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.html.2171.71 Safari/537.36','Mozilla/5.0.html (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.html.1271.64 Safari/537.11','Mozilla/5.0.html (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.html.648.133 Safari/534.16',# 火狐'Mozilla/5.0.html (Windows NT 6.1; WOW64; rv:34.0.html) Gecko/20100101 Firefox/34.0.html','Mozilla/5.0.html (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',# opera'Mozilla/5.0.html (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.html.2171.95 Safari/537.36 OPR/26.0.html.1656.60',# qq浏览器'Mozilla/5.0.html (compatible; MSIE 9.0.html; Windows NT 6.1; WOW64; Trident/5.0.html; SLCC2; .NET CLR 2.0.html.50727; .NET CLR 3.5.30729; .NET CLR 3.0.html.30729; Media Center PC 6.0.html; .NET4.0C; .NET4.0E; QQBrowser/7.0.html.3698.400)',# 搜狗浏览器'Mozilla/5.0.html (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.html.963.84 Safari/535.11 SE 2.X MetaSr 1.0.html',# 360浏览器'Mozilla/5.0.html (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.html.1599.101 Safari/537.36','Mozilla/5.0.html (Windows NT 6.1; WOW64; Trident/7.0.html; rv:11.0.html) like Gecko',# uc浏览器'Mozilla/5.0.html (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.html.2125.122 UBrowser/4.0.html.3214.0.html Safari/537.36',
]
# 移动端的user-agent
user_agent_phone = [# IPhone'Mozilla/5.0.html (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.html.2 Mobile/8J2 Safari/6533.18.5',# IPAD'Mozilla/5.0.html (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.html.2 Mobile/8C148 Safari/6533.18.5','Mozilla/5.0.html (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.html.2 Mobile/8J2 Safari/6533.18.5',# Android'Mozilla/5.0.html (Linux; U; Android 2.2.1; zh-cn; HTC_Wildfire_A3333 Build/FRG83D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0.html Mobile Safari/533.1','Mozilla/5.0.html (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0.html Mobile Safari/533.1',# QQ浏览器 Android版本'MQQBrowser/26 Mozilla/5.0.html (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0.html Mobile Safari/533.1',# Android Opera Mobile'Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10',# Android Pad Moto Xoom'Mozilla/5.0.html (Linux; U; Android 3.0.html; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0.html Safari/534.13',
]def get_user_agent_pc():return random.choice(user_agent_pc)def get_user_agent_phone():return random.choice(user_agent_phone)

我们通过get_response()函数已经可以得到网页返回的结果,现在需要筛选我们感兴趣的数据:书名、作者、荐购日期、图书馆购买情况、图书馆的反馈:

n = 1
try:for page in range(1, 33):#32页response_ = get_response(page)for i in range(2, 22):#每页20组数据#xpath语句定位title = response_.xpath('//*[@id="container"]/table/tr[' + str(i) + ']/td[@class="whitetext"][2]/text()')[0]author = response_.xpath('//*[@id="container"]/table/tr[' + str(i) + ']/td[@class="whitetext"][3]/text()')[0]date = response_.xpath('//*[@id="container"]/table/tr[' + str(i) + ']/td[@class="whitetext"][5]/text()')[0]status = response_.xpath('//*[@id="container"]/table/tr[' + str(i) + ']/td[@class="whitetext"][6]/text()')[0]note = response_.xpath('//*[@id="container"]/table/tr[' + str(i) + ']/td[@class="whitetext"][7]/text()')#注意到note和status列表里含有空数据,这里进行替换if note:passelse:note = '无'if len(status) == 2:status = response_.xpath('//*[@id="container"]/table/tr[' + str(i) + ']/td[@class="whitetext"][6]/font/text()')[0]#这里用到了Excel处理,它的预定义下面会讲mysheet.write(n, 0, title)mysheet.write(n, 1, author)mysheet.write(n, 2, date)mysheet.write(n, 3, status)mysheet.write(n, 4, note)n += 1# print(title, author, date, status)print(n)
except:pass

在上面的语句块之前,我们需要定义mysheet变量,它来自于xlwt库中的xlwt.Workbook.add_sheet()方法,用于Excel表格的存取。

workbook = xlwt.Workbook(encoding='utf-8')
mysheet = workbook.add_sheet('荐购数据', cell_overwrite_ok=True)
header = ['书名', '作者', '荐购日期', '状态', '备注']
for i in range(0, 5):mysheet.write(0, i, header[i])

最后,本页面爬取完成,数据存入Excel表格中。

workbook.save('荐购表.xls')
print('已导出到Excel表格!')

我们来运行一下:
共用时19秒就爬取完此网站上的632组数据,我们查看根目录会发现多出来的“荐购表.xls"文件,打开后如图
这样,632条荐购数据已经存入了本地,我们接着进行下一步数据爬取。

2. 完整书目信息爬取

用到的第三方库:

  • requests,一个基于urllib的http库
  • lxml,见上文
  • xlrd,用于从Excel文件中读取数据
  • numpy,Numeric Python这里用于对list的本地存取

关键代码:

  • 构造字符串并搜索书籍
# 按照“书名+作者”的形式搜索
for i in range(1, 629):title = table.cell(i, 0).valueauthor = table.cell(i, 1).valuesearch_word.append(title + ' ' + author)
for j in search_word:form_data = {"searchWords": [{"fieldList": [{"fieldCode": "", "fieldValue": j}]}], "filters": [],"limiter": [], "sortField": "relevance", "sortType": "desc", "pageSize": 20, "pageCount": 1,"locale": "zh_CN", "first": True}url = 'https://202-204-70-2-8080.webvpn.ncepu.edu.cn/opac/ajax_search_adv.php'headers = {'User-Agent': user_agent.get_user_agent_pc(),'Cookie': 'Ecp_ClientId=2200316134701955091; UM_distinctid=170e1e4017c2f1-0368182f8cae1e-366b400c-100200-170e1e4017d5f8; s_ecid=MCMID%7C65840636796574064942530628684224841055; sp=039e7b5d-932a-4634-b129-b32c9e1f4715; _hjid=f9a5b480-7aa6-4cfa-86b0-9e680bf8c2f1; AMCV_8E929CC25A1FB2B30A495C97%40AdobeOrg=281789898%7CMCIDTS%7C18341%7CMCMID%7C65840636796574064942530628684224841055%7CMCAAMLH-1585218224%7C11%7CMCAAMB-1585218224%7CRKhpRz8krg2tLO6pguXWp5olkAcUniQYPHaMWWgdJ3xzPWQmdj0y%7CMCOPTOUT-1584620624s%7CNONE%7CMCSYNCSOP%7C411-18345%7CMCAID%7CNONE%7CvVersion%7C4.1.0; AMCV_8E929CC25A1FB2B30A495C97%40AdobeOrg=281789898%7CMCIDTS%7C18341%7CMCMID%7C65840636796574064942530628684224841055%7CMCAAMLH-1585218224%7C11%7CMCAAMB-1585218224%7CRKhpRz8krg2tLO6pguXWp5olkAcUniQYPHaMWWgdJ3xzPWQmdj0y%7CMCOPTOUT-1584620624s%7CNONE%7CMCSYNCSOP%7C411-18345%7CMCAID%7CNONE%7CvVersion%7C4.1.0; __cfduid=d9533956e470db8ed7280690d1525b0b81585644698; EUID=8a3376c8-37ef-4577-b0f9-ca8b8067f787; ANONRA_COOKIE=18DED70A2D59F142CC3142606041D2E1DF9A64D2F2FAC2F3624B8159BB7F9E869A06BAB46E67AAC399FAC5A25ABB2F39ECB1B67F767FFD3E; SD_REMOTEACCESS=eyJhY2NvdW50SWQiOiI2MDM0MCIsImRlcHRJZCI6Ijg1NzA1IiwidGltZXN0YW1wIjoxNTg1NjQ0Njk4NzAwfQ==; AMCV_4D6368F454EC41940A4C98A6%40AdobeOrg=-432600572%7CMCMID%7C65840636796574064942530628684224841055%7CMCIDTS%7C18353%7CMCAID%7CNONE%7CMCOPTOUT-1585651904s%7CNONE%7CMCAAMLH-1586249504%7C11%7CMCAAMB-1586249504%7Cj8Odv6LonN4r3an7LhD3WZrU1bUpAkFkkiY1ncBR96t2PTI%7CMCCIDH%7C-1439492996%7CvVersion%7C4.5.2; s_pers=%20v8%3D1585649115885%7C1680257115885%3B%20v8_s%3DLess%2520than%25201%2520day%7C1585650915885%3B; AMCV_4D6368F454EC41940A4C98A6%40AdobeOrg=-432600572%7CMCMID%7C65840636796574064942530628684224841055%7CMCIDTS%7C18353%7CMCAID%7CNONE%7CMCOPTOUT-1585651904s%7CNONE%7CMCAAMLH-1586249504%7C11%7CMCAAMB-1586249504%7Cj8Odv6LonN4r3an7LhD3WZrU1bUpAkFkkiY1ncBR96t2PTI%7CMCCIDH%7C-1439492996%7CvVersion%7C4.5.2; utag_main=v_id:0170e1ef156700210b7b2560e5140306e00430660086e$_sn:12$_ss:0$_st:1585723672701$vapi_domain:ncepu.edu.cn$_se:2$ses_id:1585721103283%3Bexp-session$_pn:6%3Bexp-session; id=2289d9138cc10048||t=1585721874|et=730|cs=002213fd48ee63d482ebb634c4; SEARCHHISTORY_0=UEsDBBQACAgIANSrilAAAAAAAAAAAAAAAAABAAAAMM2R20vbUBzH%2F5cDp0%2Bltbm1KZSRri0IRdks%0AilgfzpqzNJAmIReiG0IFL6UgWLzBWrah2OKDRdAxFS9%2FjEmM%2F4UnraKIPrinPZ3P7%2FyufL8z34GF%0Avih4DNUwSKu2okSBLII0%2BDxuF6YVfZLmKBAFtomNUfGxwMTIqFRL8zppSZCkoZCGqmXp6XhcijlI%0A%2FYpUSUQWilW0WqyixocND09RNq2YqH14mpJBihIxq5ozCCI6kvCE%2FI3QsGRKM8QMzKegQMMsE0KW%0AhXwW5pOQZ2A2GZHNkiFLEjZKSMqQa%2FGcbmDTlDWV3BXsnQR%2F2rebf0nCGpwM%2FM4vr9G8qS96h133%0Aco2Ae%2FEj6PdDON3wGgcEbnstv1n3m%2Fteq%2FkUNtb9nx0S%2Br9XvdWVEI63gt5SOKpz5G%2Bfedc7hIPe%0Ayt1uK%2FxsX7lnXbJYtWsFzVbFgV4VAyMLl%2BRQ8QSb4liK4fkRmqMXom%2B5UeSTn4SCqtP0%2B9xwHOef%0A%2FXiuPZE5B%2FmPofapEZhiIS1AioJ5Dgo5KHCDAgHyiQcQ8iT70gj3shPUl9Nl4BPVz1vueb8M%2FhNP%0AKO41V1iGSRDBF2bvAVBLBwhCn1%2FOtgEAACUDAAA%3D%0A; _abck=1EFC414F10BC1D6A0927D3B24DBD4FDA~0~YAAQV5bfF62bxGRxAQAAOCgKaAPE18idV2UZExKBviPP6NVRMA9LSK7b5ISYpILS/X8gcGoE3NpZ3a1lnXnJdkqBNQtNsUv8RiDXzC3mreJMFADLWmvxY6TQrXAtoQKssC/Refr8T49sbEW24nNBf55iwyF/jjU1WPK07aFiZzu8MTJvPMR1RIloTSUNDAv4/YTJk6vyaidJjayBWjaGX5YRnqi0dH1NndKmocg1PsI26QrSTksxt5fcVRNbuIeMJkHF4rLLTO+zoMRO6kky/DvcpDtS2UdC8pbWbn34yeVUOR6iz/oOfSADvevj7TXTcLYso37IMJoYNkmV1pSi~-1~-1~-1; _ga=GA1.3.1681601817.1586589033; _sp_id.e0ee=ed2403cb-a7e2-4458-a8b9-7c47726e3826.1584340234.3.1586591633.1586524290.3c720d37-18bb-47ca-b586-62e73dd703b3; webvpn_username=120171080101%7C1587118484%7C8015f9ca9ec47da4dde6c91989324b8096061339; PHPSESSID=o2613oi8kgcn75eb83qlj3au22; _astraeus_session=ZW5ycnhkN2EwallDUWZ1VDUybzZONi9KUGx1K2N6ZSt2c0Y4QWN2ZXQvU0RPUXhOb2lheWQ2Tjk3ZjlMcUhOSGtiNGwvSUg4WjdNSDRNS1Z6VFlOT2hYTGlNNHQ3dkVBRUhndHhKRXJHLytrUlpIKzd2VHhZRkdNS0pQalhCeDNYdHJLY2tMbG1zTzJKYmwrQUg1dk1yc3lLUHpjeEtXOE0rUlR2ci9WU1Y4MFFoRUNkVHU0TkE0TmVTd2tUL3AyRlRnU1RTTEtPaFdoc0tKb3hoMFovcGpXYTNaRGZMQ20wWlB5Ri9NOG9xcExRdGVjSHpHczFtamlmclRPaXJjNzNidWZCMlpoZ1pTQTlUUnQ5cGN6U0dmdk8wa1ZERXFsVFNFeG55MjJrSWNIdE1wU3lzSXpJY04wTjAwNU9Fd2pDSzlmRjZNVDk1VStIWlFpczJJc0JLWGxLYktvYVZ3aDlEMHRud2xkbVBZN3BvaTcxZWZsdkJFdnZDZXpSRS9qLS01V3ljeVdncEo4QkJTYU5XdklhK2NRPT0%3D--a68f2f55af720ff9da2aba5a4f972f113985f799','Content-Type': 'application/json'}# 这里提交的是json格式的表单,需要用到json.dumps()函数进行转换response = requests.post(url=url, headers=headers, data=json.dumps(form_data))r = response.texttry:marcRecNo = r[r.index("\"marcRecNo\":\"") + 13:r.index("\",\"num\":"):]except Exception as e:marcRecNo = ''# 得到每本书唯一的marc码marc_s.append(marcRecNo)
  • 根据marc码查询更多信息
for marc in marc_s:if marc=='':continueelse:request = ur.Request(url='https://202-204-70-2-8080.webvpn.ncepu.edu.cn/opac/item.php?marc_no='+marc,headers={'User-Agent': user_agent.get_user_agent_pc(),'cookie': 'Ecp_ClientId=2200316134701955091; UM_distinctid=170e1e4017c2f1-0368182f8cae1e-366b400c-100200-170e1e4017d5f8; s_ecid=MCMID%7C65840636796574064942530628684224841055; sp=039e7b5d-932a-4634-b129-b32c9e1f4715; _hjid=f9a5b480-7aa6-4cfa-86b0-9e680bf8c2f1; AMCV_8E929CC25A1FB2B30A495C97%40AdobeOrg=281789898%7CMCIDTS%7C18341%7CMCMID%7C65840636796574064942530628684224841055%7CMCAAMLH-1585218224%7C11%7CMCAAMB-1585218224%7CRKhpRz8krg2tLO6pguXWp5olkAcUniQYPHaMWWgdJ3xzPWQmdj0y%7CMCOPTOUT-1584620624s%7CNONE%7CMCSYNCSOP%7C411-18345%7CMCAID%7CNONE%7CvVersion%7C4.1.0; AMCV_8E929CC25A1FB2B30A495C97%40AdobeOrg=281789898%7CMCIDTS%7C18341%7CMCMID%7C65840636796574064942530628684224841055%7CMCAAMLH-1585218224%7C11%7CMCAAMB-1585218224%7CRKhpRz8krg2tLO6pguXWp5olkAcUniQYPHaMWWgdJ3xzPWQmdj0y%7CMCOPTOUT-1584620624s%7CNONE%7CMCSYNCSOP%7C411-18345%7CMCAID%7CNONE%7CvVersion%7C4.1.0; __cfduid=d9533956e470db8ed7280690d1525b0b81585644698; EUID=8a3376c8-37ef-4577-b0f9-ca8b8067f787; ANONRA_COOKIE=18DED70A2D59F142CC3142606041D2E1DF9A64D2F2FAC2F3624B8159BB7F9E869A06BAB46E67AAC399FAC5A25ABB2F39ECB1B67F767FFD3E; SD_REMOTEACCESS=eyJhY2NvdW50SWQiOiI2MDM0MCIsImRlcHRJZCI6Ijg1NzA1IiwidGltZXN0YW1wIjoxNTg1NjQ0Njk4NzAwfQ==; AMCV_4D6368F454EC41940A4C98A6%40AdobeOrg=-432600572%7CMCMID%7C65840636796574064942530628684224841055%7CMCIDTS%7C18353%7CMCAID%7CNONE%7CMCOPTOUT-1585651904s%7CNONE%7CMCAAMLH-1586249504%7C11%7CMCAAMB-1586249504%7Cj8Odv6LonN4r3an7LhD3WZrU1bUpAkFkkiY1ncBR96t2PTI%7CMCCIDH%7C-1439492996%7CvVersion%7C4.5.2; s_pers=%20v8%3D1585649115885%7C1680257115885%3B%20v8_s%3DLess%2520than%25201%2520day%7C1585650915885%3B; AMCV_4D6368F454EC41940A4C98A6%40AdobeOrg=-432600572%7CMCMID%7C65840636796574064942530628684224841055%7CMCIDTS%7C18353%7CMCAID%7CNONE%7CMCOPTOUT-1585651904s%7CNONE%7CMCAAMLH-1586249504%7C11%7CMCAAMB-1586249504%7Cj8Odv6LonN4r3an7LhD3WZrU1bUpAkFkkiY1ncBR96t2PTI%7CMCCIDH%7C-1439492996%7CvVersion%7C4.5.2; utag_main=v_id:0170e1ef156700210b7b2560e5140306e00430660086e$_sn:12$_ss:0$_st:1585723672701$vapi_domain:ncepu.edu.cn$_se:2$ses_id:1585721103283%3Bexp-session$_pn:6%3Bexp-session; id=2289d9138cc10048||t=1585721874|et=730|cs=002213fd48ee63d482ebb634c4; SEARCHHISTORY_0=UEsDBBQACAgIANSrilAAAAAAAAAAAAAAAAABAAAAMM2R20vbUBzH%2F5cDp0%2Bltbm1KZSRri0IRdks%0AilgfzpqzNJAmIReiG0IFL6UgWLzBWrah2OKDRdAxFS9%2FjEmM%2F4UnraKIPrinPZ3P7%2FyufL8z34GF%0Avih4DNUwSKu2okSBLII0%2BDxuF6YVfZLmKBAFtomNUfGxwMTIqFRL8zppSZCkoZCGqmXp6XhcijlI%0A%2FYpUSUQWilW0WqyixocND09RNq2YqH14mpJBihIxq5ozCCI6kvCE%2FI3QsGRKM8QMzKegQMMsE0KW%0AhXwW5pOQZ2A2GZHNkiFLEjZKSMqQa%2FGcbmDTlDWV3BXsnQR%2F2rebf0nCGpwM%2FM4vr9G8qS96h133%0Aco2Ae%2FEj6PdDON3wGgcEbnstv1n3m%2Fteq%2FkUNtb9nx0S%2Br9XvdWVEI63gt5SOKpz5G%2Bfedc7hIPe%0Ayt1uK%2FxsX7lnXbJYtWsFzVbFgV4VAyMLl%2BRQ8QSb4liK4fkRmqMXom%2B5UeSTn4SCqtP0%2B9xwHOef%0A%2FXiuPZE5B%2FmPofapEZhiIS1AioJ5Dgo5KHCDAgHyiQcQ8iT70gj3shPUl9Nl4BPVz1vueb8M%2FhNP%0AKO41V1iGSRDBF2bvAVBLBwhCn1%2FOtgEAACUDAAA%3D%0A; _abck=1EFC414F10BC1D6A0927D3B24DBD4FDA~0~YAAQV5bfF62bxGRxAQAAOCgKaAPE18idV2UZExKBviPP6NVRMA9LSK7b5ISYpILS/X8gcGoE3NpZ3a1lnXnJdkqBNQtNsUv8RiDXzC3mreJMFADLWmvxY6TQrXAtoQKssC/Refr8T49sbEW24nNBf55iwyF/jjU1WPK07aFiZzu8MTJvPMR1RIloTSUNDAv4/YTJk6vyaidJjayBWjaGX5YRnqi0dH1NndKmocg1PsI26QrSTksxt5fcVRNbuIeMJkHF4rLLTO+zoMRO6kky/DvcpDtS2UdC8pbWbn34yeVUOR6iz/oOfSADvevj7TXTcLYso37IMJoYNkmV1pSi~-1~-1~-1; _ga=GA1.3.1681601817.1586589033; _sp_id.e0ee=ed2403cb-a7e2-4458-a8b9-7c47726e3826.1584340234.3.1586591633.1586524290.3c720d37-18bb-47ca-b586-62e73dd703b3; webvpn_username=120171080101%7C1587118484%7C8015f9ca9ec47da4dde6c91989324b8096061339; PHPSESSID=o2613oi8kgcn75eb83qlj3au22; _astraeus_session=ZW5ycnhkN2EwallDUWZ1VDUybzZONi9KUGx1K2N6ZSt2c0Y4QWN2ZXQvU0RPUXhOb2lheWQ2Tjk3ZjlMcUhOSGtiNGwvSUg4WjdNSDRNS1Z6VFlOT2hYTGlNNHQ3dkVBRUhndHhKRXJHLytrUlpIKzd2VHhZRkdNS0pQalhCeDNYdHJLY2tMbG1zTzJKYmwrQUg1dk1yc3lLUHpjeEtXOE0rUlR2ci9WU1Y4MFFoRUNkVHU0TkE0TmVTd2tUL3AyRlRnU1RTTEtPaFdoc0tKb3hoMFovcGpXYTNaRGZMQ20wWlB5Ri9NOG9xcExRdGVjSHpHczFtamlmclRPaXJjNzNidWZCMlpoZ1pTQTlUUnQ5cGN6U0dmdk8wa1ZERXFsVFNFeG55MjJrSWNIdE1wU3lzSXpJY04wTjAwNU9Fd2pDSzlmRjZNVDk1VStIWlFpczJJc0JLWGxLYktvYVZ3aDlEMHRud2xkbVBZN3BvaTcxZWZsdkJFdnZDZXpSRS9qLS01V3ljeVdncEo4QkJTYU5XdklhK2NRPT0%3D--a68f2f55af720ff9da2aba5a4f972f113985f799'})response = ur.urlopen(request).read()lxml_ = le.HTML(response)def get_value(xpath):return lxml_.xpath(xpath)inf_s = get_value('//*[@id="item_detail"]/dl/descendant::*/text()')del inf_s[-6:]print(inf_s)m = np.array(inf_s)# 将得到的书籍信息存入本地try:np.save('书籍信息/%s.npy'%inf_s[1], m)except:pass
  • 运行程序


    在控制台和代码所在文件夹可以看到书籍信息已经保存到本地了,大功告成!
    虽然看上去简单几行代码就实现了预期目标,但是一路上各种问题不断出现。代码水平实在有限,因此也一直在深夜调试运行,希望不要给学校网络带来压力。
    爬取完成后,我发现数据非常单薄,缺少维度。加之我的数据分析能力实在有限,原计划用此数据做关联规则和基于贝叶斯网络的预测,尝试无果后只好放弃,就当是一次尝试与练习吧,记录一下学习历程,大家笑一笑就好。

python爬虫的一次尝试——华北电力大学图书馆读者荐购系统:基于python爬虫的web数据爬取相关推荐

  1. python财务报表预测股票价格_机器学习股票价格预测从爬虫到预测-数据爬取部分...

    声明:本文已授权公众号「AI极客研修站」独家发布 前言 各位朋友大家好,小之今天又来给大家带来一些干货了.上篇文章机器学习股票价格预测初级实战是我在刚接触量化交易那会,因为苦于找不到数据源,所以找的一 ...

  2. python爬虫网络数据包_Python爬虫之多线程图虫网数据爬取(十六)

    Python爬虫之多线程图虫网数据爬取(十六) 发布时间:2019-05-14 10:11, 浏览次数:289 , 标签: Python 原创不易,转载前请注明博主的链接地址:Blessy_Zhu h ...

  3. python实现数据爬取——糗事百科爬虫项目

    python实现数据爬取--糗事百科爬虫项目 # urllib.request 请求模块 import urllib.request # re 模块使 Python 语言拥有全部的正则表达式功能. i ...

  4. 爬虫项目八:Python对天猫商品数据、评论数据爬取

    文章目录 前言 一.商品数据 1.分析url 2.登录账号 3.解析数据 4.模拟滑动滑块 二.评论数据 1.分析url 2.解析数据 前言 天猫商城商品数据.评论数据爬取 提示:以下是本篇文章正文内 ...

  5. 爬虫项目十:Python苏宁易购商品数据、评论数据爬取

    文章目录 前言 一.商品数据 1.分析url 2.解析数据 3.实现翻页 二.评论数据 前言 利用Python对苏宁易购商品数据评价数据实现爬取 提示:以下是本篇文章正文内容,下面案例可供参考 一.商 ...

  6. python爬取网页表格数据匹配,python爬虫——数据爬取和具体解析

    标签:pattern   div   mat   txt   保存   关于   json   result   with open 关于正则表达式的更多用法,可参考链接:https://blog.c ...

  7. 【Python爬虫】2022年数学建模美赛B题数据爬取

    2022年数学建模美赛B题数据爬取 背景 2022年的美赛刚刚落下帷幕,该题的一个主要难点在于数据的获取.很多人无法找到有效的数据,或者是无法获取数据. 比如找到了如下米德湖的水文数据,但是发现并没有 ...

  8. Python爬虫《自动化学报》数据爬取与数据分析

    Python爬虫<自动化学报>数据爬取与数据分析 文章目录 Python爬虫<自动化学报>数据爬取与数据分析 前言 一.代码 二.结果展示 三.爬虫实现 1.准备 2.获取网页 ...

  9. python爬虫,g-mark网站图片数据爬取及补坑

    应用python对g-mark网站图片数据爬取,同时但对于数据抓取失败的图片进行补坑(重新爬取操作),由于是日本网站,没有梯子访问的话,特别容易访问超时,比较合适的补坑操作是直接将数据采集到数据库,而 ...

最新文章

  1. 集成学习VotingClassifier、HistGradientBoostingClassifier、Stacking、Blending
  2. 会计的思考(38):会计--让业务做到心中有数,有真数
  3. 【MATLAB】数据类型 ( 元胞数组 | 单位阵 | 幻方 | 结构体 | 元胞数组值获取 )
  4. python使用MySQL数据库
  5. 禁止程序接收鼠标事件的工具_报表工具html事件--鼠标悬停出现提示信息
  6. 纯CSS实现漂亮圆角阴影边框
  7. 坐标轴 日期格式_Excel图表技巧之不连续的日期坐标轴怎么显示
  8. Asp.net2.0工具包AjaxControlToolkit下载和安装
  9. Django框架之Filters(过滤器)、母版的使用
  10. Unity常用工具类
  11. past软件_Past软件与AndersonDarling正态性检验
  12. 独家 | 一文读懂复杂网络(应用、模型和研究历史)
  13. java mysql 分页_mysql分页查询总结
  14. deepin 安装git
  15. U盘启动制作及系统安装
  16. 福建省三明市谷歌卫星地图下载
  17. Pytest操作中间件
  18. log4j升级为log4j2(无需改动代码)
  19. abbplc型号_ABB PLC选型手册.pdf
  20. 老师利用计算机分析学生的考试成绩属于(),解析计算机考试成绩的数据分析理论...

热门文章

  1. java封面_java代码生成封面
  2. MySQL 快速造数 mysql_random_data_load
  3. 个人计算机和手机软件的异同,在线考试手机端和电脑端有什么区别
  4. 《海外社交媒体营销》一一2.5 选择正确的工具和软件
  5. java linux路径写法_window linux 路径写法(转载)
  6. Akka Study
  7. 《行为经济学》学习_北大光华_中国大学MOOC
  8. Simulink中lookup-Table的使用
  9. CSS控制网站去色全灰效果如何展现?
  10. important 用法