12306所有车次及时刻表的爬取中

在上一篇博文里，我最后大概写了下第一个小目标实现的思路，下面先展示一下第一个目标实现的成果。

一、成果展示

爬取完之后的结果大概是这个样子的，每一天的数据保存成一个文件夹，我是爬取了近一个月的车次信息，每一个文件夹里面存着若干个txt文件

这是文件夹的内部，txt的命名就是我上一篇博文里提到的爬取的时候的关键字keyword

打开其中一个txt文件，里面存的是这样的数据，这样看着有点乱，把它复制一下，放到 (http://www.json.cn/)

解析完之后是下面这样的，如果搜索的关键字没有返回车次数据，那么data这个list就是空，但是结构和有车次数据的相同，我们要做的就是遍历data这个list，然后取出里面的信息。

处理完每个txt文件之后得到这样的一个表格，这样我们的小目标就基本实现，已经能取到全国所有车次和对应的车次编号啦。

二、小目标部分的代码

我把所需要爬的关键字分成三类，第一种是只需要一次循环的C0-C9这样的，第二种是需要两次循环的，C10-C19这样的，第三类是纯数字的，不带字母的，这种的比较特殊，因为在搜130时，所有字母中含130的也会显示出来，如果看不懂什么意思，建议自己打开12306去试一试，看一看结果。如果你有更好地方法，欢迎沟通交流，我目前只能通过这种笨笨的方法来获取。

第一类的代码

import pandas as pd
import requests
import csv
import json
import os
import timeurl = "https://search.12306.cn/search/v1/train/search?keyword={}&date={}"  # 先准备基础的url
date_list = ["20201211", "20201212", "20201213", "20201214", "20201215", "20201216","20201217", "20201218", "20201219", "20201220", "20201221", "20201222","20201223", "20201224", "20201225", "20201226", "20201227", "20201228","20201229", "20201230", "20201231", "20210101", "20210102", "20210103","20210104", "20210105", "20210106", "20210107", "20210108"]
# 准备日期列表，因为日期不一样，发车的车次会有变化，建议多选几天，最后再去重，尽量得到完整的车次数据
keyword_list = ["C1", "C9","D0", "D4", "D9","G4", "G9","K2", "K3", "K4", "K5", "K6"]
# 这是第一次就<100条的数据，我估计它再怎么变化也不会增加到200以上，所以这是只需要一个循环的
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36"
}
# 定义请求头，模拟浏览器访问
proxies = {# 格式：  协议：协议：//ip地址：端口号"HTTP": "http://182.34.27.89:9999"
}  # 使用代理，去无忧代理找一个能用的，反爬措施，或许可以多选几个，放一个列表里，使用random随机取一个，效果更好
if not os.path.exists("爬取的12306车次数据"):os.mkdir("爬取的12306车次数据")
file_name = "爬取的12306车次数据\{}\{}.txt"  # 定义文件的命名
pd_file = "爬取的12306车次数据\{}"
for i in range(0, 29):  # 注意是左闭右开的if not os.path.exists(pd_file.format(date_list[i])):os.mkdir(pd_file.format(date_list[i]))  # 用于判断这个日期的文件夹存在不存在，不存在就新建一个for j in range(0, 12):ful_url = url.format(keyword_list[j], date_list[i])  # 拼接完整的urlresponse = requests.get(url=ful_url, headers=headers, proxies=proxies)  # 发起请求，接受响应file_name_ful = file_name.format(date_list[i], keyword_list[j])  # 拼接完整的文件名with open(file_name_ful, "w", encoding="utf-8")as f:  # 新建并打开文件，将响应的内容写入txtf.write(response.content.decode("utf-8"))

第二类代码

import pandas as pd
import requests
import csv
import json
import os
import timeurl = "https://search.12306.cn/search/v1/train/search?keyword={}&date={}"
date_list = ["20201211", "20201212", "20201213", "20201214", "20201215", "20201216","20201217", "20201218", "20201219", "20201220", "20201221", "20201222","20201223", "20201224", "20201225", "20201226", "20201227", "20201228","20201229", "20201230", "20201231", "20210101", "20210102", "20210103","20210104", "20210105", "20210106", "20210107", "20210108"]
keyword_list = ["C2", "C3", "C5", "C6", "C7", "C8","D1", "D2", "D3", "D5", "D6", "D7", "D8","G1", "G2", "G3", "G5", "G6", "G7", "G8","K1", "K7", "K8", "K9","T", "Z"]
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36"
}proxies = {# 格式：  协议：协议：//ip地址：端口号"HTTP": "http://222.189.190.151:9999"
}  # 使用代理
if not os.path.exists("爬取的12306车次数据"):os.mkdir("爬取的12306车次数据")
file_name = "爬取的12306车次数据\{}\{}.txt"
pd_file="爬取的12306车次数据\{}"
for i in range(0, 29):if not os.path.exists(pd_file.format(date_list[i])):os.mkdir(pd_file.format(date_list[i]))for j in range(0, 26):for k in range(0, 10):key_word = keyword_list[j] + str(k)ful_url = url.format(key_word, date_list[i])response = requests.get(url=ful_url, headers=headers, proxies=proxies)file_name_ful = file_name.format(date_list[i], key_word)with open(file_name_ful, "w", encoding="utf-8")as f:f.write(response.content.decode("utf-8"))

第三类代码

import pandas as pd
import requests
import csv
import json
import os
import timeurl = "https://search.12306.cn/search/v1/train/search?keyword={}&date={}"
date_list = ["20201211", "20201212", "20201213", "20201214", "20201215", "20201216","20201217", "20201218", "20201219", "20201220", "20201221", "20201222","20201223", "20201224", "20201225", "20201226", "20201227", "20201228","20201229", "20201230", "20201231", "20210101", "20210102", "20210103","20210104", "20210105", "20210106", "20210107", "20210108"]headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36"
}proxies = {# 格式：  协议：协议：//ip地址：端口号"HTTP": "http://171.12.115.212:9999"
}  # 使用代理
if not os.path.exists("爬取的12306车次数据"):os.mkdir("爬取的12306车次数据")
file_name = "爬取的12306车次数据\{}\{}.txt"
pd_file = "爬取的12306车次数据\{}"
for i in range(0, 29):if not os.path.exists(pd_file.format(date_list[i])):os.mkdir(pd_file.format(date_list[i]))for j in range(100, 1000):key_word = str(j)ful_url = url.format(key_word, date_list[i])response = requests.get(url=ful_url, headers=headers, proxies=proxies)file_name_ful = file_name.format(date_list[i], key_word)with open(file_name_ful, "w", encoding="utf-8")as f:f.write(response.content.decode("utf-8"))

将这三个代码块运行一下，你就能得到若干文件夹，文件夹里有若干个txt文本，接下来我们就批量读取这些txt，处理这些txt，得到最后的那个表格。

三、对爬下来的数据进行初步处理

这里我只写第三类的，就是纯数字部分的数据的初步处理，剩下的两类大家照着改吧改吧就行了。
注意，代码块里之所以加了一个flag，是因为刚才那三个代码里，写的txt文件可能有出错的，就是你请求那么多次，总会有那么几个或者几十个请求出错的情况，你一个一个手工实验的时候都会出错，更别说刚才批量请求那么多次了。
这里的处理办法就是遇到写的不对的txt文件，就重新发起请求，将响应写进这个txt文件，再处理，直到读到的文件是正确的，不再发起请求，开始进入下一个txt文件进行处理

import os
import csv
import json
import re
import requests
url = "https://search.12306.cn/search/v1/train/search?keyword={}&date={}"headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36"
}
proxies = {# 格式：  协议：协议：//ip地址：端口号"HTTP": "http://222.189.190.36:9999"
}  # 使用代理date_list = ["20201211", "20201212", "20201213", "20201214", "20201215", "20201216","20201217", "20201218", "20201219", "20201220", "20201221", "20201222","20201223", "20201224", "20201225", "20201226", "20201227", "20201228","20201229", "20201230", "20201231", "20210101", "20210102", "20210103","20210104", "20210105", "20210106", "20210107", "20210108"]if not os.path.exists("12306爬虫原始数据处理"):os.mkdir("12306爬虫原始数据处理")file_name = "爬取的12306车次数据\{}\{}.txt"with open(f"12306爬虫原始数据处理/所有车次信息3.csv", 'w', newline="", encoding="utf-8")as f:
# with open(f"12306爬虫原始数据处理/所有车次信息5.csv", 'a', newline="", encoding="utf-8")as f:writer = csv.writer(f)# writer.writerow(["车次", "编号", "出发站", "终点站", "经停站站总数"])for i in range(0, 29):for j in range(100, 1000):flag = 1key_word = str(j)file_name_ful = file_name.format(date_list[i], key_word)while (flag == 1):try:with open(file_name_ful, "r", encoding="utf-8")as f1:content = f1.read()data_dic = json.loads(content)data_list = data_dic["data"]if date_list is not None:# print(len(data_list))for data in data_list:# station_train_code 车次station_train_code = data["station_train_code"]# train_no 编号train_no = "#" + data["train_no"] + "#"# from_station 出发站from_station = data["from_station"]# to_station 终点站to_station = data["to_station"]# total_num 经停站总数total_num = data["total_num"]# print(station_train_code, date_list[i], key_word)writer.writerow([station_train_code, train_no, from_station, to_station, total_num])flag = 0else:flag = 0continueexcept Exception as e:表格     ful_url = url.format(key_word, date_list[i])response = requests.get(url=ful_url, headers=headers, proxies=proxies)with open(file_name_ful, "w", encoding="utf-8")as f3:f3.write(response.content.decode("utf-8"))print("{}文件夹下{}txt有误，进行重新请求".format(date_list[i], key_word))

这下处理完成后，你能得到三个csv文件，分别对这三个csv文件去重，然后整合成一个最终版，包含全国所有车次及其编号的车次信息就出来啦。
最后一步就是根据得到的这个最终版，处理这个表格，发起请求，得到每一个车次经停站的详情信息。这就留到下写吧！
下的链接：
https://blog.csdn.net/weixin_43316129/article/details/111473027