文章目录

Python 爬虫操作基本操作
- python 标准库 urllib
- - 获取信息
  - 上传信息
- python 标准库 urllib3
- - 获取信息
  - 上传信息
- 第三方库 requests
- - 获取特征信息
  - 模拟浏览器访问
  - - 直接访问被 403 拒绝
    - 添加 headers
  - 报错信息
  - 设置代理
- BeautifulSoup 应用
- - 安装
  - html 文件解析
  - 文档解析
  - 网页解析
12306火车票爬取
- 特别说明
- Pycharm 配置 Qt
- - Pycharm 下载
  - Qt 安装与配置
  - 界面绘制
- 代码文件
- - MianWindow.py
  - query_request.py
  - get_stations.py
- Pyinstaller 程序打包
- - 直接运行
  - 打包运行
- （附）简单爬取操作
Git 地址

Python 爬虫操作基本操作

python 标准库 urllib

获取信息

import urllib.requestresponse = urllib.request.urlopen('http://www.baidu.com/')print(response.read().decode('utf-8'))

<!DOCTYPE html><!--STATUS OK-->
此处省略<script>if(navigator.cookieEnabled){document.cookie="NOJS=;expires=Sat, 01 Jan 2000 00:00:00 GMT";}</script><script src="http://ss.bdimg.com/static/superman/js/components/hotsearch-8598bcf712.js"></script></body></html>

上传信息

import urllib.request
import urllib.parsedata = bytes(urllib.parse.urlencode({'word':'hello'}), encoding='utf-8')response = urllib.request.urlopen('http://httpbin.org/post', data=data)
response
html = response.read().decode('utf-8')
print(html)

{"args": {}, "data": "", "files": {}, "form": {"word": "hello"}, "headers": {"Accept-Encoding": "identity", "Content-Length": "10", "Content-Type": "application/x-www-form-urlencoded", "Host": "httpbin.org", "User-Agent": "Python-urllib/3.6", "X-Amzn-Trace-Id": "Root=1-5ec0a4a4-4da16208bf73bce8b0088a14"}, "json": null, "origin": "182.137.240.188", "url": "http://httpbin.org/post"
}

python 标准库 urllib3

获取信息

import urllib3# 处理与线程的连接以及线程安全
http = urllib3.PoolManager()
response = http.request('GET', 'http://www.baidu.com')print(response.data.decode())

<!DOCTYPE html><!--STATUS OK-->
<html>
<head><meta http-equiv="content-type" content="text/html;charset=utf-8"><meta http-equiv="X-UA-Compatible" content="IE=Edge"><link rel="dns-prefetch" href="//s1.bdstatic.com"/><link rel="dns-prefetch" href="//t1.baidu.com"/><link rel="dns-prefetch" href="//t2.baidu.com"/><link rel="dns-prefetch" href="//t3.baidu.com"/><link rel="dns-prefetch" href="//t10.baidu.com"/><link rel="dns-prefetch" href="//t11.baidu.com"/><link rel="dns-prefetch" href="//t12.baidu.com"/><link rel="dns-prefetch" href="//b1.bdstatic.com"/><title>百度一下，你就知道</title><link href="http://s1.bdstatic.com/r/www/cache/static/home/css/index.css" rel="stylesheet" type="text/css" />此处省略
</body></html>

上传信息

import urllib3# 处理与线程的连接以及线程安全
http = urllib3.PoolManager()
response = http.request('POST', 'http://httpbin.org/post', fields={'word':'hello'})print(response.data.decode())

{"args": {}, "data": "", "files": {}, "form": {"word": "hello"}, "headers": {"Accept-Encoding": "identity", "Content-Length": "128", "Content-Type": "multipart/form-data; boundary=14c9abc7c138e04cfb65714db9055b19", "Host": "httpbin.org", "X-Amzn-Trace-Id": "Root=1-5ec0a5af-6146cbc0386719987392e77c"}, "json": null, "origin": "182.137.240.188", "url": "http://httpbin.org/post"
}

第三方库 requests

获取特征信息

import requestsresponse = requests.get('http://www.baidu.com/')# 防止中文乱码
response.encoding='utf-8'print('状态码\n', response.status_code)
print('请求地址\n', response.url)
print('头部信息\n', response.headers)
print('Cookie\n', response.cookies)
print('文本源码\n', response.text)
print('字节源码\n', response.content)

状态码200
请求地址http://www.baidu.com/
头部信息{'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Sun, 17 May 2020 02:55:45 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:56 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Connection': 'close', 'Transfer-Encoding': 'chunked'}
Cookie<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
文本源码<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html
......
<img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>字节源码b'<!DOCTYPE html>\r\n<!--STATUS OK--><html> ......</div> </div> </body> </html>\r\n'

模拟浏览器访问

直接访问被 403 拒绝

import requestsurl = 'https://www.whatismyip.com/'response = requests.get(url)
print(response.status_code)

添加 headers

headers 是浏览器访问服务器使用的，有的网页具有反爬机制，我们需要设置 headers 来模拟浏览器访问，以 Google Chrome 浏览器为例，按下 F12 进入控制台，在顶部选择 network，按 F5 刷新，在获取的信息中选择到需要的网页信息，例如本例的 /www.whatismyip.com，点击 Headers 下拉获取 user-agent 信息

import requestsurl = 'https://www.whatismyip.com/'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}response = requests.get(url, headers=headers)
print(response.content.decode('utf-8'))

<!DOCTYPE html>
<html lang="en-US">
<head>
<meta charset="UTF-8"></script>
</body>
</html>
<script>
jQuery(function () {jQuery('[data-toggle="tooltip"]').tooltip();
});
</script>

报错信息

设置读取的 timeout 迫使出现网络超时错误

import requests
try:response = requests.get('http://www.baidu.com', timeout=0.1)print(response.status_code)
except Exception as error:print('Timeout:', str(error))

Timeout: HTTPConnectionPool(host='127.0.0.1', port=1080): Read timed out. (read timeout=0.1)

设置多个请求错误处理

import requests
from requests.exceptions import ReadTimeout, HTTPError, RequestException
try:response = requests.get('https://www.whatismyip.com/', timeout=0.5)print(response.status_code)
except ReadTimeout:print('Timeout')
except HTTPError:print('HttpError')
except RequestException:print('RequestError')try:headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}response = requests.get('https://www.whatismyip.com/', timeout=1, headers=headers)print(response.status_code)
except ReadTimeout:print('Timeout')
except HTTPError:print('HttpError')
except RequestException:print('RequestError')

RequestError
Timeout

设置代理

代理可以在西刺免费代理IP上获取

import requestsproxy = {'https':'119.84.112.139:80', 'http':'101.132.190.101:80'}response_https = requests.get('https://www.baidu.com', proxies=proxy)
print('response_https:', response_https.status_code)response_http = requests.get('http://www.baidu.com', proxies=proxy)
print('response_http:', response_http.status_code)

response_https: 200
response_http: 200

BeautifulSoup 应用

安装

可在此处查询官网文档

使用 Win10 系统直接使用

pip install bs4

如果你用的是新版的Debain或ubuntu,那么可以通过系统的软件包管理来安装:

$ apt-get install Python-bs4

Beautiful Soup 4 通过PyPi发布,所以如果你无法使用系统包管理安装,那么也可以通过 easy_install 或 pip 来安装.包的名字是 beautifulsoup4 ,这个包兼容Python2和Python3.

$ easy_install beautifulsoup4

$ pip install beautifulsoup4

(在PyPi中还有一个名字是 BeautifulSoup 的包,但那可能不是你想要的,那是 Beautiful Soup3 的发布版本,因为很多项目还在使用BS3, 所以 BeautifulSoup 包依然有效.但是如果你在编写新项目,那么你应该安装的 beautifulsoup4 )

如果你没有安装 easy_install 或 pip ,那你也可以下载BS4的源码 ,然后通过setup.py来安装.

$ Python setup.py install

如果上述安装方法都行不通,Beautiful Soup的发布协议允许你将BS4的代码打包在你的项目中,这样无须安装即可使用.

作者在Python2.7和Python3.2的版本下开发Beautiful Soup, 理论上Beautiful Soup应该在所有当前的Python版本中正常工作

html 文件解析

from bs4 import BeautifulSouphtml_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p><p class="story">...</p>
"""soup = BeautifulSoup(html_doc, features='lxml')

soup.title

<title>The Dormouse's story</title>

soup.title.name

'title'

soup.title.string

"The Dormouse's story"

soup.title.parent.name

'head'

soup.p

<p class="title"><b>The Dormouse's story</b></p>

soup.p['class']

['title']

soup.a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

文档解析

soup = BeautifulSoup(open('index.html'), features='lxml')
print(soup.prettify)

<bound method Tag.prettify of <html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p></body></html>>

网页解析

response = requests.get('http://news.baidu.com')soup = BeautifulSoup(response.text, features='lxml')
print(soup.find('title').text)

百度新闻——海量中文资讯平台

12306火车票爬取

特别说明

本次案例根据《零基础学 Python》修改而来，在原案例中，由于书本原案例是 2018 年编写的代码，随着 12306 的迭代更新，旧的爬取方式不在适用，本次主要修改的是 query函数，修改了爬取方式，界面布局等

Pycharm 配置 Qt

Pycharm 下载

Pycharm 点击此处进入下载官网

Pycharm 提供专业版（Professional）与社区版（Community），社区版免费且开源，专业版具有 30 天试用期，需要付费购买，可以使用支付宝支付，我使用的是专业版 pycharm-professional-2020.1.1

Qt 安装与配置

通过如下命令安装

!pip install PyQt5

注：我是用的是 Anaconda 环境配置，下面的路径选择根据自己电脑路径修改

配置三个扩展工具 External Tools，依次点击 File→Settings→Tools→External Tools，点击＋号，三个工具的配置输入如下：
Qt Designer:用于绘制界面

属性	参数
Name	Qt Designer(自己取名字)
Description	Create Qt UI(描述信息，可以不写)
Program	E:\Anaconda3\Library\bin\designer.exe(根据自己Python环境填写路径)
Arguments	无
Working directory	E:\Anaconda3\Library\bin

PyUIC:将 UI 界面转换为 Python 可识别的代码
属性	参数
–	–
Name	PyUIC(自己取名字)
Description	UI to py file(描述信息，可以不写)
Program	E:\Anaconda3\envs\tensorflow1.x\python.exe(根据自己Python环境填写路径)
Arguments	-m PyQt5.uic.pyuic $FileName$ -o $FileNameWithoutExtension$.py
Working directory	$FileDir$

qrc2py:将需要用到的资源文件转换成 Python 可识别文件（在设置 UI 时可以在 UI 转换成 py 文件后手动添加资源文件，但过程相较于繁琐，此处采取在编辑UI时直接添加资源文件）
属性	参数
–	–
Name	qrc2py(自己取名字)
Description	无(描述信息，可以不写)
Program	E:\Anaconda3\envs\tensorflow1.x\Scripts\pyrcc5.exe(根据自己Python环境填写路径)
Arguments	$FileName$ -o $FileNameWithoutExtension$_rc.py
Working directory	$FileDir$

配置完成后如下

界面绘制

打开配置的扩展工具 Qt Designer

绘制 UI 界面部分需要具备基本的 Qt 操作，左边是界面布局的一些按钮，控件等，右边是调节控件的参数例如命名，大小等，界面使用了添加资源文件，所以后面需要将生成的 qrc 文件转换，界面如何绘制不再赘述，界面如图所示：

绘制完界面后，点击如下，将界面转换成 py 文件，UI 文件名与 py 文件名相同，都为 MianWindow

转换 qrc 文件类似上面转换 UI 操作，转换后需要在 MianWindow.py 中加入：

import source_rc

代码文件

主要的程序文件如下

MianWindow.py

# -*- coding: utf-8 -*-# Form implementation generated from reading ui file 'MianWindow.ui'
#
# Created by: PyQt5 UI code generator 5.14.2
#
# WARNING! All changes made in this file will be lost!from PyQt5 import QtCore, QtGui, QtWidgets
import sys
import time
from get_stations import *
import source_rcfrom PyQt5.QtCore import Qt
from PyQt5.QtWidgets import *
from query_request import *
from PyQt5.QtGui import *class Ui_mainWindow(object):# setupUi由Qt Designer设计def setupUi(self, mainWindow):mainWindow.setObjectName("mainWindow")mainWindow.resize(960, 850)mainWindow.setMinimumSize(QtCore.QSize(480, 360))mainWindow.setMaximumSize(QtCore.QSize(960, 1080))self.centralwidget = QtWidgets.QWidget(mainWindow)self.centralwidget.setObjectName("centralwidget")self.label_title_img = QtWidgets.QLabel(self.centralwidget)self.label_title_img.setGeometry(QtCore.QRect(0, 0, 960, 225))self.label_title_img.setStyleSheet("background-image: url(:/png/src/bg1.png);")self.label_title_img.setText("")self.label_title_img.setObjectName("label_title_img")self.widget_input = QtWidgets.QWidget(self.centralwidget)self.widget_input.setGeometry(QtCore.QRect(0, 225, 640, 80))self.widget_input.setStyleSheet("background-image: url(:/png/src/bg2.png);")self.widget_input.setObjectName("widget_input")self.label_departure = QtWidgets.QLabel(self.widget_input)self.label_departure.setGeometry(QtCore.QRect(0, 32, 60, 15))self.label_departure.setObjectName("label_departure")self.lineEdit_departure = QtWidgets.QLineEdit(self.widget_input)self.lineEdit_departure.setGeometry(QtCore.QRect(55, 30, 113, 21))self.lineEdit_departure.setText("")self.lineEdit_departure.setObjectName("lineEdit_departure")self.lineEdit_destination = QtWidgets.QLineEdit(self.widget_input)self.lineEdit_destination.setGeometry(QtCore.QRect(235, 30, 113, 21))self.lineEdit_destination.setText("")self.lineEdit_destination.setObjectName("lineEdit_destination")self.label_destination = QtWidgets.QLabel(self.widget_input)self.label_destination.setGeometry(QtCore.QRect(175, 32, 60, 15))self.label_destination.setObjectName("label_destination")self.lineEdit_date = QtWidgets.QLineEdit(self.widget_input)self.lineEdit_date.setGeometry(QtCore.QRect(430, 30, 113, 21))self.lineEdit_date.setObjectName("lineEdit_date")self.label_date = QtWidgets.QLabel(self.widget_input)self.label_date.setGeometry(QtCore.QRect(355, 32, 72, 15))self.label_date.setObjectName("label_date")self.pushButton_inquire = QtWidgets.QPushButton(self.widget_input)self.pushButton_inquire.setGeometry(QtCore.QRect(548, 26, 93, 28))self.pushButton_inquire.setObjectName("pushButton_inquire")self.label_departure.raise_()self.lineEdit_departure.raise_()self.lineEdit_destination.raise_()self.label_destination.raise_()self.label_date.raise_()self.pushButton_inquire.raise_()self.lineEdit_date.raise_()self.label_logo = QtWidgets.QLabel(self.centralwidget)self.label_logo.setGeometry(QtCore.QRect(640, 225, 320, 80))self.label_logo.setStyleSheet("background-image: url(:/png/src/logo.png);")self.label_logo.setText("")self.label_logo.setObjectName("label_logo")self.widget_train_class = QtWidgets.QWidget(self.centralwidget)self.widget_train_class.setGeometry(QtCore.QRect(0, 305, 960, 35))self.widget_train_class.setStyleSheet("background-image: url(:/png/src/bg3.png);")self.widget_train_class.setObjectName("widget_train_class")self.label_train_class = QtWidgets.QLabel(self.widget_train_class)self.label_train_class.setGeometry(QtCore.QRect(20, 10, 72, 15))self.label_train_class.setObjectName("label_train_class")self.checkBox_G = QtWidgets.QCheckBox(self.widget_train_class)self.checkBox_G.setGeometry(QtCore.QRect(120, 9, 91, 19))self.checkBox_G.setObjectName("checkBox_G")self.checkBox_D = QtWidgets.QCheckBox(self.widget_train_class)self.checkBox_D.setGeometry(QtCore.QRect(280, 9, 91, 19))self.checkBox_D.setObjectName("checkBox_D")self.checkBox_Z = QtWidgets.QCheckBox(self.widget_train_class)self.checkBox_Z.setGeometry(QtCore.QRect(440, 9, 91, 19))self.checkBox_Z.setObjectName("checkBox_Z")self.checkBox_T = QtWidgets.QCheckBox(self.widget_train_class)self.checkBox_T.setGeometry(QtCore.QRect(600, 9, 91, 19))self.checkBox_T.setObjectName("checkBox_T")self.checkBox_K = QtWidgets.QCheckBox(self.widget_train_class)self.checkBox_K.setGeometry(QtCore.QRect(760, 9, 91, 19))self.checkBox_K.setObjectName("checkBox_K")self.label_information = QtWidgets.QLabel(self.centralwidget)self.label_information.setGeometry(QtCore.QRect(0, 340, 960, 62))self.label_information.setStyleSheet("background-image: url(:/png/src/bg4.png);")self.label_information.setText("")self.label_information.setObjectName("label_information")self.tableView_information = QtWidgets.QTableView(self.centralwidget)self.tableView_information.setGeometry(QtCore.QRect(0, 402, 960, 448))self.tableView_information.setObjectName("tableView_information")self.model = QStandardItemModel()  # 创建存储数据的模式# 根据空间自动改变列宽度并且不可修改列宽度self.tableView_information.horizontalHeader().setSectionResizeMode(QHeaderView.Stretch)# 设置表头不可见self.tableView_information.horizontalHeader().setVisible(False)# 纵向表头不可见self.tableView_information.verticalHeader().setVisible(False)# 设置表格内容文字大小font = QtGui.QFont()font.setPointSize(10)self.tableView_information.setFont(font)# 设置表格内容不可编辑self.tableView_information.setEditTriggers(QAbstractItemView.NoEditTriggers)# 垂直滚动条始终开启self.tableView_information.setVerticalScrollBarPolicy(Qt.ScrollBarAlwaysOn)self.widget_input.raise_()self.label_title_img.raise_()self.label_logo.raise_()self.widget_train_class.raise_()self.label_information.raise_()self.tableView_information.raise_()mainWindow.setCentralWidget(self.centralwidget)self.retranslateUi(mainWindow)QtCore.QMetaObject.connectSlotsByName(mainWindow)def retranslateUi(self, mainWindow):_translate = QtCore.QCoreApplication.translatemainWindow.setWindowTitle(_translate("mainWindow", "12306官网查询"))self.label_departure.setText(_translate("mainWindow", "出发地："))self.label_destination.setText(_translate("mainWindow", "目的地："))self.label_date.setText(_translate("mainWindow", "出发日期："))self.pushButton_inquire.setText(_translate("mainWindow", "查询"))self.label_train_class.setText(_translate("mainWindow", "车次类型："))self.checkBox_G.setText(_translate("mainWindow", "G-高铁"))self.checkBox_D.setText(_translate("mainWindow", "D-动车"))self.checkBox_Z.setText(_translate("mainWindow", "Z-直达"))self.checkBox_T.setText(_translate("mainWindow", "T-特快"))self.checkBox_K.setText(_translate("mainWindow", "K-快车"))self.lineEdit_date.setText(get_time())  # 出发日显示当天日期self.pushButton_inquire.clicked.connect(self.on_click)  # 查询按钮指定单击事件的方法self.checkBox_G.stateChanged.connect(self.change_G)  # 高铁选中与取消事件self.checkBox_D.stateChanged.connect(self.change_D)  # 动车选中与取消事件self.checkBox_Z.stateChanged.connect(self.change_Z)  # 直达车选中与取消事件self.checkBox_T.stateChanged.connect(self.change_T)  # 特快车选中与取消事件self.checkBox_K.stateChanged.connect(self.change_K)  # 快车选中与取消事件# 将所有车次分类复选框取消勾选def checkBox_default(self):self.checkBox_G.setChecked(False)self.checkBox_D.setChecked(False)self.checkBox_Z.setChecked(False)self.checkBox_T.setChecked(False)self.checkBox_K.setChecked(False)# 查询按钮的单击事件def on_click(self):get_from = self.lineEdit_departure.text() # 获取出发地get_to = self.lineEdit_destination.text()  # 获取到达地get_date = self.lineEdit_date.text()  # 获取出发时间# 判断车站文件是否存在if isStations() == True:stations = eval(read())  # 读取所有车站并转换为dic类型# 判断所有参数是否为空，出发地、目的地、出发日期if get_from != "" and get_to != "" and get_date != "":# 判断输入的车站名称是否存在，以及时间格式是否正确if get_from in stations and get_to in stations and is_valid_date(get_date):# 获取输入的日期是当前年初到现在一共过了多少天inputYearDay = time.strptime(get_date, "%Y-%m-%d").tm_yday# 获取系统当前日期是当前年初到现在一共过了多少天yearToday = time.localtime(time.time()).tm_yday# 计算时间差，也就是输入的日期减掉系统当前的日期timeDifference = inputYearDay - yearToday# 判断时间差为0时证明是查询当前的查票，# 以及29天以后的车票。12306官方要求只能查询30天以内的车票if timeDifference >= 0 and timeDifference <= 28:from_station = stations[get_from]  # 在所有车站文件中找到对应的参数，出发地to_station = stations[get_to]  # 目的地data = query(get_date, from_station, to_station)  # 发送查询请求,并获取返回的信息self.checkBox_default()if len(data) != 0:  # 判断返回的数据是否为空# 如果不是空的数据就将车票信息显示在表格中self.displayTable(len(data), 16, data)else:self.messageDialog('警告', '没有返回的网络数据！')else:self.messageDialog('警告', '超出查询日期的范围内,''不可查询昨天的车票信息,以及29天以后的车票信息！')else:self.messageDialog('警告', '输入的站名不存在,或日期格式不正确！')else:self.messageDialog('警告', '请填写车站名称！')else:self.messageDialog('警告', '未下载车站查询文件！')# 高铁复选框事件处理def change_G(self, state):# 选中将高铁信息添加到最后要显示的数据当中if state == QtCore.Qt.Checked:# 获取高铁信息g_vehicle()# 通过表格显示该车型数据self.displayTable(len(type_data), 16, type_data)else:# 取消选中状态将移除该数据r_g_vehicle()self.displayTable(len(type_data), 16, type_data)# 动车复选框事件处理def change_D(self, state):# 选中将动车信息添加到最后要显示的数据当中if state == QtCore.Qt.Checked:# 获取动车信息d_vehicle()# 通过表格显示该车型数据self.displayTable(len(type_data), 16, type_data)else:# 取消选中状态将移除该数据r_d_vehicle()self.displayTable(len(type_data), 16, type_data)# 直达复选框事件处理def change_Z(self, state):# 选中将直达车信息添加到最后要显示的数据当中if state == QtCore.Qt.Checked:# 获取直达车信息z_vehicle()self.displayTable(len(type_data), 16, type_data)else:# 取消选中状态将移除该数据r_z_vehicle()self.displayTable(len(type_data), 16, type_data)# 特快复选框事件处理def change_T(self, state):# 选中将特快车信息添加到最后要显示的数据当中if state == QtCore.Qt.Checked:# 获取特快车信息t_vehicle()self.displayTable(len(type_data), 16, type_data)else:# 取消选中状态将移除该数据r_t_vehicle()self.displayTable(len(type_data), 16, type_data)# 快速复选框事件处理def change_K(self, state):# 选中将快车信息添加到最后要显示的数据当中if state == QtCore.Qt.Checked:# 获取快速车信息k_vehicle()self.displayTable(len(type_data), 16, type_data)else:# 取消选中状态将移除该数据r_k_vehicle()self.displayTable(len(type_data), 16, type_data)# 显示消息提示框，参数title为提示框标题文字，message为提示信息def messageDialog(self, title, message):msg_box = QMessageBox(QMessageBox.Warning, title, message)msg_box.exec_()# 显示车次信息的表格# train参数为共有多少趟列车，该参数作为表格的行。# info参数为每趟列车的具体信息，例如有座、无座卧铺等。该参数作为表格的列def displayTable(self, train, info, data):self.model.clear()for row in range(train):for column in range(info):# 添加表格内容item = QStandardItem(data[row][column])# 向表格存储模式中添加表格具体信息self.model.setItem(row, column, item)# 设置表格存储数据的模式self.tableView_information.setModel(self.model)# 获取系统当前时间并转换请求数据所需要的格式
def get_time():# 获得当前时间时间戳now = int(time.time())# 转换为其它日期格式,如:"%Y-%m-%d %H:%M:%S"timeStruct = time.localtime(now)strTime = time.strftime("%Y-%m-%d", timeStruct)return strTimedef is_valid_date(str):'''判断是否是一个有效的日期字符串'''try:time.strptime(str, "%Y-%m-%d")return Trueexcept:return False# 定义显示函数
def show_MainWindow():app = QtWidgets.QApplication(sys.argv)  # 实例化QApplication类，作为GUI主程序入口MainWindow = QtWidgets.QMainWindow()  # 实例化QtWidgets.QMainWindow类，创建自带menu的窗体类型QMainWindowui = Ui_mainWindow()  # 实例化UI类ui.setupUi(MainWindow)  # 设置窗体UIMainWindow.show()  # 显示窗体sys.exit(app.exec_())# 当来自操作系统的分发事件指派调用窗口时，# 应用程序开启主循环（mainloop）过程，# 当窗口创建完成，需要结束主循环过程，# 这时候呼叫sys.exit（）方法来，结束主循环过程退出，# 并且释放内存。为什么用app.exec_()而不是app.exec()？# 因为exec是python系统默认关键字，为了以示区别，所以写成exec_# 主程序入口
if __name__ == '__main__':if isStations() == False:getStation()show_MainWindow()

query_request.py

from get_stations import *'''5-7目的地，3车次，6出发地，8出发时间，9到达时间，10历时，26无坐，29硬座，24软座，28硬卧，33动卧，23软卧，21高级软卧，30二等座，31一等座，32商务座特等座
'''data = []  # 用于保存整理好的所有车次信息
type_data = []  # 保存车次分类后最后的数据def query(date, from_station, to_station):print(date, from_station, to_station)data.clear()  # 清空数据type_data.clear()  # 清空车次分类保存的数据# 设置cookiecookie = 'JSESSIONID=245782306A8F72B197AE2ADA05F463A8; BIGipServerotn=1708720394.50210.0000; RAIL_EXPIRATION=1589844843366; RAIL_DEVICEID=R0B-jSpnTZ4NSWa2MVZNuBUvmwAoHG22Rqb8eQm0qu7ZUWdpbKElaHY3oqEUR8AG2ooarmYVW3kNP98Lkhn5YqoPa5KUUB8IMjRdPEZ-iZbqgyh-gOFgMNRRpieZq3GBI36yzGkOErVsDyR9NWWrDJY_EThOSJ5f; BIGipServerpassport=921174282.50215.0000; route=9036359bb8a8a461c164a04f8f50b252; _jc_save_fromStation=%u5317%u4EAC%2C{}; _jc_save_toStation=%u4E0A%u6D77%2C{}; _jc_save_fromDate={}; _jc_save_toDate={}; _jc_save_wfdc_flag=dc'.format(from_station, to_station, date, date)# 设置标头headers = {'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Mobile Safari/537.36','Cookie': cookie}# 查询请求地址url = 'https://kyfw.12306.cn/otn/leftTicket/query?leftTicketDTO.train_date={}&leftTicketDTO.from_station={}&leftTicketDTO.to_station={}&purpose_codes=ADULT'.format(date, from_station, to_station)# 发送查询请求response = requests.get(url, headers=headers)# 修改编码格式response.encoding = 'utf-8'# 将json数据转换为字典类型，通过键值对取数据result = response.json()result = result['data']['result']# 判断车站文件是否存在if isStations() == True:# 读取所有车站并转换为dic类型，eval()获取字符串值stations = eval(read())if len(result) != 0:  # 判断返回数据是否为空for i in result:# 分割数据并添加到列表中tmp_list = i.split('|')# 因为查询结果中出发站和到达站为站名的缩写字母，所以需要在车站库中找到对应的车站名称from_station = list(stations.keys())[list(stations.values()).index(tmp_list[6])]to_station = list(stations.keys())[list(stations.values()).index(tmp_list[7])]# 创建座位数组，由于返回的座位数据中含有空既“”，所以将空改成--这样好识别seat = [tmp_list[3], from_station, to_station, tmp_list[8], tmp_list[9], tmp_list[10], tmp_list[32], tmp_list[31], tmp_list[30], tmp_list[21], tmp_list[23], tmp_list[33], tmp_list[28], tmp_list[24], tmp_list[29], tmp_list[26]]newSeat = []# 循环将座位信息中的空改成--for s in seat:if s == "":s = "--"else:s = snewSeat.append(s)  # 保存新的座位信息data.append(newSeat)return data   # 返回整理好的车次信息# 获取高铁信息的方法
def g_vehicle():if len(data) != 0:for g in data:  # 循环所有火车数据i = g[0].startswith('G')  # 判断车次首字母是不是高铁if i:  # 如果是将该条信息添加到高铁数据中type_data.append(g)# 移除高铁信息的方法
def r_g_vehicle():if len(data) != 0 and len(type_data) != 0:for g in data:i = g[0].startswith('G')if i:  # 移除高铁信息type_data.remove(g)# 获取动车信息的方法
def d_vehicle():if len(data) != 0:for d in data:  # 循环所有火车数据i = d[0].startswith('D')  # 判断车次首字母是不是动车if i == True:  # 如果是将该条信息添加到动车数据中type_data.append(d)# 移除动车信息的方法
def r_d_vehicle():if len(data) != 0 and len(type_data) != 0:for d in data:i = d[0].startswith('D')if i == True:  # 移除动车信息type_data.remove(d)# 获取直达车信息的方法
def z_vehicle():if len(data) != 0:for z in data:  # 循环所有火车数据i = z[0].startswith('Z')  # 判断车次首字母是不是直达if i == True:  # 如果是将该条信息添加到直达数据中type_data.append(z)# 移除直达车信息的方法
def r_z_vehicle():if len(data) != 0 and len(type_data) != 0:for z in data:i = z[0].startswith('Z')if i == True:  # 移除直达车信息type_data.remove(z)# 获取特快车信息的方法
def t_vehicle():if len(data) != 0:for t in data:  # 循环所有火车数据i = t[0].startswith('T')  # 判断车次首字母是不是特快if i == True:  # 如果是将该条信息添加到特快车数据中type_data.append(t)# 移除特快车信息的方法
def r_t_vehicle():if len(data) != 0 and len(type_data) != 0:for t in data:i = t[0].startswith('T')if i == True:  # 移除特快车信息type_data.remove(t)# 获取快速车数据的方法
def k_vehicle():if len(data) != 0:for k in data:  # 循环所有火车数据i = k[0].startswith('K')  # 判断车次首字母是不是快车if i == True:  # 如果是将该条信息添加到快车数据中type_data.append(k)# 移除快速车数据的方法
def r_k_vehicle():if len(data) != 0 and len(type_data) != 0:for k in data:i = k[0].startswith('K')if i == True:  # 移除快车信息type_data.remove(k)

get_stations.py

import requests
import re
import os# 获取地名信息
def getStation():url = 'https://kyfw.12306.cn/otn/resources/js/framework/station_name.js?station_version=1.9142'response = requests.get(url, verify=True)# 返回中文与大写字母stations = re.findall(u'([\u4e00-\u9fa5]+)\|([A-Z]+)', response.text)stations = dict(stations)stations = str(stations)write(stations)# 站点文件写入
def write(item):with open('stations.txt', 'w', encoding='utf-8') as f:f.write(item)# 站点文件读取
def read():with open('stations.txt', 'r', encoding='utf-8') as f:data = f.readline()return data# 判断站点文件是否存在
def isStations():isStations = os.path.exists('stations.txt')return isStations

Pyinstaller 程序打包

直接运行

运行主程序如下

打包运行

程序打包后运行出错参见参见pygame 实现 flappybird 并打包成 exe 运行文件与使用 Pygame 创建五子棋游戏解决方案
在命令行直接输入：

pyinstaller -F -w -i logo.ico main.py

打包运行效果如下：

（附）简单爬取操作

进入 12306 车票查询官网，输入北京到上海如下图所示

按 F12，再按 F5 刷新(刷新后可能需要重新点击查询)，最终界面应如下，其中包含了大量信息，除了车次信息还有网页的图片文件信息等：

找到代表车次信息的信息条，名称大致为 query?leftTicketDTO.train_date=2020-05-17&leftTicketDTO.from_station=BJP&leftTicketDTO.to_station=SHH&purpose_codes=ADULT（不同的时间可能不同）

Headers 与 Response 选项卡就是我们需要的信息，Headers 里面包含了 Cookie 文件，消息头，User-Agent 等，Response 则是包含了车次信息，里面包含站点信息，时间，座位信息以及一些反爬的混淆信息，我们需要适用正则化，字符串处理方法等提取信息，具体操作参见 query_request.py

Git 地址

所有文件均上传至 GitHub
欢迎 star

复工复产，利用Python爬虫爬取火车票信息相关推荐

【爬虫】利用Python爬虫爬取小麦苗itpub博客的所有文章的连接地址并写入Excel中（2）...
[爬虫]利用Python爬虫爬取小麦苗itpub博客的所有文章的连接地址并写入Excel中(2) 第一篇( http://blog.itpub.net/26736162/viewspace-22865 ...
用python爬虫爬取微博信息
用python爬虫爬取微博信息话不多说,直接上代码! import requests from bs4 import BeautifulSoup from urllib import parse i ...
利用Python爬虫爬取网页福利图片
最近几天,学习了爬虫算法,通过参考书籍,写下自己简单爬虫项目: 爬取某福利网站的影片海报图片环境:anaconda3.5+spyder3.2.6 目录 1.本节目标 2.准备工作 3.抓取分析 4. ...
利用Python爬虫爬取斗鱼直播间信息，以及直播的实际人数！
首先我准备利用mysql来存储我爬取的信息,建一个host表如下: 然后下载pymysql ,利用它与数据库链接,因为在这里我只涉及到写入的操作: Unit_Mtsql 然后就是使用Beautiful ...
java爬虫抓取nba_利用Python爬虫爬取NBA数据功能实例分享
Python实现爬虫爬取NBA数据功能示例本文实例讲述了Python实现爬虫爬取NBA数据功能.分享给大家供大家参考,具体如下: 爬取的网站为:stat-nba.com,这里爬取的是NBA2016- ...
python爬网站数据实例-利用Python爬虫爬取NBA数据功能实例分享
Python实现爬虫爬取NBA数据功能示例本文实例讲述了Python实现爬虫爬取NBA数据功能.分享给大家供大家参考,具体如下: 爬取的网站为:stat-nba.com,这里爬取的是NBA2016- ...
python爬虫爬取房源信息
目录一.数据获取与预处理二.csv文件的保存三.数据库存储四.爬虫完整代码五.数据库存储完整代码写这篇博客的原因是在我爬取房产这类数据信息的时候,发现csdn中好多博主写的关于此类的文 ...
简易爬虫-利用Python爬虫爬取圣墟小说到本地
大家好,今天给大家带来Python爬虫的简易制作,很适合新手练手. 爬虫即是利用程序模仿真实用户浏览网页并记录目标内容,从而可避过网站的广告,以获取较好的阅读体验. 本次以辰东大神的新书<圣墟& ...
利用python爬虫爬取京东商城商品图片
笔者曾经用python第三方库requests来爬取京东商城的商品页内容,经过解析之后发现只爬到了商品页一半的图片.(这篇文章我们以爬取智能手机图片为例) 当鼠标没有向下滑时,此时查看源代码的话,就会 ...

复工复产，利用Python爬虫爬取火车票信息

文章目录

Python 爬虫操作基本操作

python 标准库 urllib

获取信息

上传信息

python 标准库 urllib3

获取信息

上传信息

第三方库 requests

获取特征信息

模拟浏览器访问

直接访问被 403 拒绝

添加 headers

报错信息

设置代理

BeautifulSoup 应用

安装

html 文件解析

文档解析

网页解析

12306火车票爬取

特别说明

Pycharm 配置 Qt

Pycharm 下载

Qt 安装与配置

界面绘制

代码文件

MianWindow.py

query_request.py

get_stations.py

Pyinstaller 程序打包

直接运行

打包运行

（附）简单爬取操作

Git 地址

复工复产，利用Python爬虫爬取火车票信息相关推荐

最新文章

热门文章