前言

原始数据集

爬虫部分

爬取每个队员在buct做题数量

爬取每个队员codeforces的最高分，注册时间，解题数量

爬取每个队员有效做题时间

数据处理部分

模型部分

Linear Regression

XGBregression+gridsearchCV调参

Random Forset

前言

刚开课时

我：老师我们能不能写深度学习呀

某老师：当然可以呀

本来我们组都弄好了大作业给他写个CNN分类植物种类

数据都跑出来了

某老师给来一句：你这个分类花花草草阿太简单了，不符合大作业的代码量（火速入典）

当天晚上：诶你做我们ACM队员的获奖预测吧！

我：（......wsnd）

于是就有了接下来的故事

原始数据集

从某老师那里得到的学生，学号，班级（没啥用），codeforces的id，atcoder的id（很多没有）

我合并了一下大概500条数据左右

清洗了一波，把信息缺失太严重的删掉，去个重，剩400左右

后面各种删最后有效的只有不到400条

然后我拟了一些class

从某老师那得到的历届ACM获奖名单，填写下

具体是这样的

class1: 没得过奖+蓝桥杯省三+一些水奖

class2: 蓝桥省二+天梯国三

class3: icpc/ccpc铜奖+蓝桥国二/三/优+天梯国二

class4: icpc/ccpc银奖+蓝桥国一+天梯国一

class5: icpc/ccpc金奖

于是这个荒唐的项目就轰轰烈烈展开了

爬虫部分

爬取每个队员在buct做题数量

xpath+F12抓包，找标签

xpath解析拿到数据

代码如下（cookie已打码，这个真不能给别人）

#//div[@class="extra content"]
import lxml
import numpy as np
import requests
from lxml import etree
import pandas as pd
def create_request(name):print(name)url = 'https://buctcoder.com/userinfo.php?user='+str(name)headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36 Edg/101.0.1210.53','Cookie':'XSYWSND'}request = requests.get(url=url,headers=headers)return request
def get_content_html(request):request.encoding='utf-8'content = request.textcontent = etree.HTML(content)return content
def get_solve_number(content):solve_result = content.xpath('//div[@class="extra content"]/a/text()')try:solve_result = solve_result[0].split(" ")[1]return solve_resultexcept:return 0
if __name__ == '__main__':df = pd.DataFrame(pd.read_csv('work.csv'))totleset = df.valuesid=[]solve_set=[]for node in totleset:id.append(node[3])schoolnumber=int(node[1])request=create_request(schoolnumber)content=get_content_html(request)result=get_solve_number(content)solve_set.append(result)df = pd.DataFrame(pd.read_csv('oldwork.csv'))totleset = df.valuesfor node in totleset:id.append(node[3])schoolnumber=int(node[2])request = create_request(schoolnumber)content = get_content_html(request)result = get_solve_number(content)solve_set.append(result)result=[]for i in range(0,len(id)):tmp=[]tmp.append(id[i])tmp.append(solve_set[i])result.append(tmp)data = pd.DataFrame(result, columns=['name', 'buct_solve'])data.to_csv('data_buct_solve.csv', index=False)

爬取每个队员codeforces的最高分，注册时间，解题数量

maxrating抓包

codeforces这个网站他很鸡贼

蓝名用户的class名字是user-blue

绿名用户class名字就变成user-green

所以写了个颜色列表依次遍历过去

解题数抓包

注册时间抓包

这个注册时间要根据单位换算成天，不然会混乱

代码如下

import lxml
import numpy as np
import requests
from lxml import etree
import pandas as pd
def create_request(name):url = 'https://codeforces.com/profile/'+str(name)headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36 Edg/101.0.1210.53'}request = requests.get(url=url,headers=headers)return request
def get_content_html(request):request.encoding='utf-8'content = request.textcontent = etree.HTML(content)return content
def get_max_score(content):color_set = ['gray', 'green', 'cyan', 'blue', 'violet', 'orange', 'red', 'legendary']for color in color_set[::-1]:targetstr = "//span[@class=\"user-" + str(color) + "\"]/text()"maxscore_result = content.xpath(targetstr)if len(maxscore_result) == 4:maxscore_result = maxscore_result[3]return int(maxscore_result)elif len(maxscore_result) == 2:maxscore_result = maxscore_result[1]return int(maxscore_result)return 0
def get_solve_problem(content):solve_result = content.xpath('//div[@class="_UserActivityFrame_counterValue"]/text()')solve_result = solve_result[0]solve_result = solve_result.split(' ')[0]return int(solve_result)
def get_age_time(content):time_result = content.xpath('//span[@class="format-humantime"]/text()')time_result = time_result[-1].split(' ')if time_result[1] == 'months':time_result = int(time_result[0]) * 4 * 7elif time_result[1] == 'years':time_result = int(time_result[0]) * 12 * 4 * 7elif time_result[1] == 'week':time_result = int(time_result[0]) * 7else:time_result = int(time_result[0])return time_resultif __name__ == '__main__':df = pd.DataFrame(pd.read_csv('work.csv'))totleset = df.valuesmaxscore_set = []solveproblem_set = []time_age_set = []id=[]for node in totleset:id.append(node[3])codeforcesid_set=node[4:7]maxscore=-1timeage=0solvepro=0for name in codeforcesid_set:if name is not np.nan:print(name)request = create_request(name)content = get_content_html(request)maxscore = max(maxscore,get_max_score(content))solvepro=solvepro+get_solve_problem(content)timeage=timeage+get_age_time(content)maxscore_set.append(maxscore)solveproblem_set.append(solvepro)time_age_set.append(timeage)df = pd.DataFrame(pd.read_csv('oldwork.csv'))totleset = df.valuesfor node in totleset:id.append(node[3])name=node[6]maxscore=0timeage=0solvepro=0if name is not np.nan:print(name)request = create_request(name)content = get_content_html(request)maxscore = max(maxscore,get_max_score(content))solvepro=solvepro+get_solve_problem(content)timeage=timeage+get_age_time(content)maxscore_set.append(maxscore)solveproblem_set.append(solvepro)time_age_set.append(timeage)# print(id)# print(maxscore_set)# print(solveproblem_set)# print(time_age_set)result = []for i in range(0,len(id)):tmp=[]tmp.append(id[i])tmp.append(maxscore_set[i])tmp.append(solveproblem_set[i])tmp.append(time_age_set[i])result.append(tmp)data=pd.DataFrame(result,columns=['name','cf_max_rating','cf_solve','cf_time'])data.to_csv('data.csv',index=False)

爬取每个队员有效做题时间

在https://codeforces.com/api/user.rating?handle=用户id'

这个地址里，有json格式的每个账号打比赛的所有记录

其中时间戳只差是以秒为单位的

也就是说，我们用用户打过的最后一场比赛和第一场比赛时间戳相减，在单位换算，就可以得到用户有效做题时间

代码如下：

import lxml
import numpy as np
import requests
from lxml import etree
import pandas as pd
import jsonimport jsonpath
import requests
from lxml import etree
def create_request(name):url = 'https://codeforces.com/api/user.rating?handle='+str(name)headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36 Edg/101.0.1210.53'}request = requests.get(url=url,headers=headers)return request
def get_content(request):try:request.encoding = 'utf-8'content = request.textcontent = json.loads(content)return contentexcept:return False
def get_time(content):if(content==False):return 0contest_list = jsonpath.jsonpath(content, '$..result..ratingUpdateTimeSeconds')if(not contest_list):return 0return (contest_list[-1]-contest_list[0])//3600//24
if __name__ == '__main__':df = pd.DataFrame(pd.read_csv('work.csv'))totleset=df.valuesid=[]time_set=[]for node in totleset:id.append(node[3])codeforcesid_set = node[4:7]time_tmp=0for name in codeforcesid_set:if name is not np.nan:print(name)request = create_request(name)content = get_content(request)time_tmp=max(time_tmp,get_time(content))time_set.append(time_tmp)df = pd.DataFrame(pd.read_csv('oldwork.csv'))totleset = df.valuesfor node in totleset:id.append(node[3])name = node[6]time_tmp=0if name is not np.nan:print(name)request = create_request(name)content = get_content(request)time_tmp = max(time_tmp, get_time(content))time_set.append(time_tmp)print(id)print(time_set)result=[]for i in range(0,len(id)):tmp=[]tmp.append(id[i])tmp.append(time_set[i])result.append(tmp)data = pd.DataFrame(result, columns=['name', 'real_time'])data.to_csv('data_time.csv', index=False)

数据处理部分

得到爬取的数据后，我们得到了一张并不干净的数据

其中有因为网络问题爬出来挂0的，有一场比赛没打过的等等

这部分我们可以按自己的意愿酌情删除或修改数据（tmd数据本来就少）

由于我们之后模型训练要用分类模型，所以对连续数据进行分箱使其离散化

依据事实我们选择等距分箱

最终用于回归模型的data如下

用于分类模型的data如下

模型部分

回归模型，先归一化

得到如下数据

Linear Regression

来一发Linear Regression试试水

from sklearn.linear_model import LinearRegression
lin = LinearRegression()
lin.fit(x_train,y_train)prediction = lin.predict(x_train)

得分确实假了才60多好像

ZZX那么强才给铜牌水平是吧（模型背大锅）

XGBregression+gridsearchCV调参

当时忘记看得分了，反正高了巨多

ZZX给了3.8，小低一点可以接受

之后是分类模型

Random Forset

随机森林

起码在训练集上有百分之85的准确率，也算挺好了

没有测试集给咱试呀

结语

没想到阴差阳错之后最后答辩的外教就是我三天之前机器学习课答辩时的外教（希望我的异乡人英语别给他留太深印象）

还好没有认出我，最后他还小夸了我们组一下，可能写的专业对他口了

老师给分别太离谱/doge

python国际化课程capstone（ML预测ACM队员获奖概率）相关推荐

Python数据挖掘课程五.线性回归知识及预测糖尿病实例
今天主要讲述的内容是关于一元线性回归的知识,Python实现,包括以下内容: 1.机器学习常用数据集介绍 2.什么是线性回顾 3.LinearRegre ...
Python数据处理课程设计-房屋价格预测
注:可能有些图片未能成功上传,可在文档处进行下载链接:Python数据处理课程设计-房屋价格预测-机器学习文档类资源-CSDN下载课程设计报告课程名称 Python数据处理课程设计项目名称房 ...
【python数据挖掘课程】二十三.时间序列金融数据预测及Pandas库详解
这是<Python数据挖掘课程>系列文章,也是我上课内容及书籍中的一个案例.本文主要讲述时间序列算法原理,Pandas扩展包基本用法以及Python调用statsmodels库的时间序列算 ...
【Python数据挖掘课程】五.线性回归知识及预测糖尿病实例
今天主要讲述的内容是关于一元线性回归的知识,Python实现,包括以下内容: 1.机器学习常用数据集介绍 2.什么是线性回顾 3.LinearRegre ...
【python数据挖掘课程】二十.KNN最近邻分类算法分析详解及平衡秤TXT数据集读取
这是<Python数据挖掘课程>系列文章,也是我这学期上课的部分内容及书籍的一个案例.本文主要讲述KNN最近邻分类算法.简单实现分析平衡秤数据集,希望这篇文章对大家有所帮助,同时提供些思路 ...
python数据挖掘课程】二十一.朴素贝叶斯分类器详解及中文文本舆情分析
#2018-04-06 13:52:30 April Friday the 14 week, the 096 day SZ SSMR python数据挖掘课程]二十一.朴素贝叶斯分类器详解及中文文本舆 ...
【Python数据挖掘课程笔记】八.关联规则挖掘及Apriori实现购物推荐
#2018-03-23 10:48:40 March Friday the 12 week, the 082 day SZ SSMR[Python数据挖掘课程笔记]八.关联规则挖掘及Apriori实现 ...
python在线课程价格-python课程价格
python课程价格根据所报读的班级不同,价格从一万到两万四不等,详情请咨询客服.随着近年Python的持续走热,越来越多的公司开始使用Python编程语言.具体情况大家可以看一下各个招聘平台的具体数 ...
python大学课程-Coursera上Python课程（公开课）汇总
原标题:Coursera上Python课程(公开课)汇总 Python是深度学习时代的语言,Coursera上有很多Python课程,从Python入门到精通,从Python基础语法到应用Python ...

python国际化课程capstone（ML预测ACM队员获奖概率）

前言

原始数据集

爬虫部分

爬取每个队员在buct做题数量

爬取每个队员codeforces的最高分，注册时间，解题数量

爬取每个队员有效做题时间

数据处理部分

模型部分

Linear Regression

XGBregression+gridsearchCV调参

Random Forset

python国际化课程capstone（ML预测ACM队员获奖概率）相关推荐

最新文章

热门文章