中科大EPC课程爬取

中科大的日常交流英语和学术交流英语在完成20学时课堂学习的同时，还需要在英语语言实践中心（EPC）修满20学时的实践课，才能获得相应学分，而EPC的课程还是比较难选的，这里用Python爬取EPC课程来捡漏。具体实现流程是用Python每隔一分钟刷新一次EPC网站，如果有人退课，则用邮件通知自己。

EPC网站分析

进入EPC主页，发现网站的登录需要验证码，一开始的想法是用Cookies来模拟登录。具体做法如下：
登录EPC后，点击F12打开控制台，查看Network，勾选上Preserve log，然后点击“ Situational Dialogue”，在控制台中定位到正确的URL，同时复制请求头作为Python中的Headers，结果确实是可以在Python中爬取的，但这样有一个坏处，每次运行都需要先用浏览器登录EPC，然后复制新的Cookies。后来实验发现，在Python中登录EPC是不需要验证码的，所以这里采用另一种实现方法。

利用相同的方法，在Network中找到请求方法为“POST”的链接，即为登录链接，我这里使用requests.Session()来保留Cookies。

具体代码实现如下：

import requests,smtplib,email,time
from bs4 import BeautifulSoup as bs  #使用 BeautifulSoup库对页面进行解析
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from email.mime.image import MIMEImage
from email.header import Header  MAX = 13 #作为周数的约束条件，最大值为21
INI = 125 #作为访问失败的无效值，随意定的
session = requests.Session()
#登录EPC
url_login ='http://epc.ustc.edu.cn/n_left.asp'
data={
'submit_type': 'user_login',
'name': 'xxxxxx',
'pass': 'xxxxxx',
'user_type': '2',
'Submit': 'LOG IN'
}
resp = session.post(url=url_login,data=data)#解析页面，返回列表：[week，星期，教师，学时，上课时间，教室]
def getInfo(url):resp = session.get(url)resp.encoding = resp.apparent_encoding#print(resp.text)soup = bs(resp.text,'html.parser' )tds = soup.select('td[align="center"]') return [int(tds[14].string[1:3]),tds[15].string,] + [string for string in tds[18].strings]#tds[0] #只显示可预约#tds[1] #预约单元#tds[2] #周数#tds[3] #星期几#tds[4] #教师#tds[5] #学时#tds[6] #上课时间#tds[7] #教室……#tds[14] #第多少周#tds[15].string #星期#tds[16].string #教师#[x for x in tds[18].strings] #时间
def getEPC():#返回数据：字典#key：name or INI#value:[week，星期，教师，学时，上课时间，教室] or [INI]try:url1 = 'http://epc.ustc.edu.cn/m_practice.asp?second_id=2001'  # Situational dialogueurl2 = 'http://epc.ustc.edu.cn/m_practice.asp?second_id=2002'  # Topical discussionurl3 = 'http://epc.ustc.edu.cn/m_practice.asp?second_id=2003'  # Debateurl4 = 'http://epc.ustc.edu.cn/m_practice.asp?second_id=2004'  # Dramaurl7 = 'http://epc.ustc.edu.cn/m_practice.asp?second_id=2007'  # Pronunciation Practiceinfo={}    info['Situational Dialogue']= getInfo(url1)info['Topical Discussion']= getInfo(url2)info['Drama']= getInfo(url4)info['Pronunciation Practice']= getInfo(url7)return infoexcept:return {INI:[INI]}
#邮箱发送
def SendMail(text):subject = 'EPC Crawling'sender = 'xxxxxx'receiver= ['xxxxxx',xxxxxx']msg = MIMEMultipart('mixed')msg['Subject'] = subjectmsg['From'] = sendermsg['To'] = ';'.join(receiver)text = MIMEText(text,'plain','utf-8')msg.attach(text)   smtp = smtplib.SMTP()smtp.connect('xxxxxx')username = 'xxxxxx'password = 'xxxxxx'smtp.login(username,password)smtp.sendmail(sender,receiver,msg.as_string())smtp.quit()
#主程序
if __name__=="__main__":while True:status = Trueinfo = getEPC()print(time.ctime(),'：')for key,value in info.items():print('{}:{}'.format(key,value))print('\n')for key,value in info.items():if value[0] == INI: text = "EPC crawling stopped."print(text)#SendMail(text)status = Falsebreakif ((value[0] < MAX)): #这里或许可以另外建一个筛选规则text = 'There is a course of {} in week{},{},{},{}.'.format(key,value[0],value[1],value[2],value[3],end='\n\n')print(text)SendMail(text)if status==False:breaktime.sleep(60) #这里修改刷新频率

邮件内容如下：

因为我的邮箱是和微信绑定的，基本上是一爬取到课程，我就能收到，然后看情况能否选课。
因为中科大校园网并不稳定，这里另外建立一个脚本来调用上述代码：

import os
cmd = 'py epc.py'
while True:os.system(cmd) #可以增加一个每隔5s尝试一次

在使用Python之前，两个月才选了8节EPC，用了Python后，两周就修了12个EPC，科技改变生活。

结语

随着EPC课程的修满，这块代码也可以公开了。有两个可以遗憾的地方：一是筛选规则简单，导致一堆垃圾邮件。这里可以增加一个筛选规则，减少垃圾邮件，还可以定点查询某个时间段某个类型的课是否有空缺；二是不能自动选课，还需要自己在收到有课后登录网页选课，十分麻烦。留待他人优化吧。