英超联赛10年数据爬虫

引言：今天对国外某足球网站进行爬虫，爬取英超联赛10年数据，主要包括比赛双方以及比分。

1.网站分析

网址：https://www.premierleague.com/results（需要科学上网）。我们要的信息主要是对战双方和第一粒进球的时间。右击查看源码没有我们要的信息，考虑是动态加载。F12打开开发者选项。

找到了信息存放的网址https://footballapi.pulselive.com/football/fixtures?comps=1&compSeasons=79&teams=1,127,131,43,4,6,7,159,26,10,11,12,23,20,42,45,21,33,36,25&page=0&pageSize=40&sort=desc&statuses=C&altIds=true。且里面的数据是json格式，一页就有40条的数据。对于网址太长，经过我的试验可以删去部分得到网址如下https://footballapi.pulselive.com/football/fixtures?comps=1&compSeasons=79&page=0&pageSize=40&sort=desc&statuses=C&altIds=true。这是17-18赛季对应的网址，往下翻会加载出跟多的比赛内容，而网址对应的变化就是page从0到10的变化。如果要爬取16-17赛季的数据则去查看对应的网址。我查看之后发现变化的只有Seasons这个数据。所以如果你是要爬取多个年份的数据则相当于两个循环。外循环是Seasons这个数据的循环，然后对应每个Seasons都有一个page的循环。

Json数据分析

确定好了数据的来源，那么我们就要分析如何获取Json数据中我们要的部分。我们要的数据主要是存在了如下位置。

3.Python代码

确定好了数据位置我们就可以进行爬虫获取，代码如下：

import requests
from requests.exceptions import RequestException
import json
import csv
import time
headers = {
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Origin': 'https://www.premierleague.com',
'Referer': 'https://www.premierleague.com/results?co=1&se=17&cl=-1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.17 Safari/537.36',
}
def get_page(url):try:response = requests.get(url, headers=headers)response.encoding = 'utf-8'if response.status_code == 200:# print(response.text)# print(response.text)return response.textreturn Noneexcept RequestException as err:print('获取页面错误')print(err)# time.sleep(3)# return get_index_page(html)
def parse_page(url):html = get_page(url)html = json.loads(html)mainname , time , clientname,total_time = [], [] ,[],[]for i in html['content']:time = []mainname.append(i['teams'][0]['team']['name'])num = len(i['goals'])if num == 0:total_time.append('None')else:for j in range(num):time.append(i['goals'][j]['clock']['label'][:2])first_time = min(time)total_time.append(first_time)clientname.append(i['teams'][1]['team']['name'])info = zip(mainname,total_time,clientname)info = list(info)return info
def write2csv(info):with open('all_data.csv','a',newline='',encoding='utf-8-sig') as f:for i in info:writer = csv.writer(f)writer.writerow(i)
if __name__ == '__main__':# 18-19'https://footballapi.pulselive.com/football/fixtures?comps=1&compSeasons=210&teams=1,127,131,43,46,4,6,7,34,159,26,10,11,12,23,20,21,33,25,38&page=0&pageSize=40&sort=desc&statuses=C&altIds=true'# 17-18'https://footballapi.pulselive.com/football/fixtures?comps=1&compSeasons=79&teams=1,127,131,43,4,6,7,159,26,10,11,12,23,20,42,45,21,33,36,25&page=0&pageSize=40&sort=desc&statuses=C&altIds=true'# 16-17'https://footballapi.pulselive.com/football/fixtures?comps=1&compSeasons=54&teams=1,127,43,4,6,7,41,26,10,11,12,13,20,42,29,45,21,33,36,25&page=0&pageSize=40&sort=desc&statuses=C&altIds=true'# 15-16'https://footballapi.pulselive.com/football/fixtures?comps=1&compSeasons=42&teams=1,2,127,4,6,7,26,10,11,12,23,14,20,42,29,45,21,33,36,25&page=0&pageSize=40&sort=desc&statuses=C&altIds=true'# 14-15'https://footballapi.pulselive.com/football/fixtures?comps=1&compSeasons=27&teams=1,2,43,4,6,7,41,26,10,11,12,23,17,20,42,29,45,21,36,25&page=0&pageSize=40&sort=desc&statuses=C&altIds=true'# 13-14'https://footballapi.pulselive.com/football/fixtures?comps=1&compSeasons=22&teams=1,2,46,4,6,7,34,41,10,11,12,23,14,20,42,29,45,21,36,25&page=0&pageSize=40&sort=desc&statuses=C&altIds=true'# 12-13'https://footballapi.pulselive.com/football/fixtures?comps=1&compSeasons=21&teams=1,2,4,7,34,10,11,12,23,14,17,40,20,42,29,45,21,36,25,39&page=0&pageSize=40&sort=desc&statuses=C&altIds=true'# 11-12'https://footballapi.pulselive.com/football/fixtures?comps=1&compSeasons=20&teams=1,2,3,27,4,7,34,10,11,12,23,14,17,42,29,45,21,36,39,38&page=0&pageSize=40&sort=desc&statuses=C&altIds=true'# 10-11'https://footballapi.pulselive.com/football/fixtures?comps=1&compSeasons=19&teams=1,2,35,3,44,27,4,7,34,10,11,12,23,42,29,21,36,25,39,38&page=0&pageSize=40&sort=desc&statuses=C&altIds=true'# 09-10'https://footballapi.pulselive.com/football/fixtures?comps=1&compSeasons=18&page=0&pageSize=40&sort=desc&statuses=C&altIds=true'# 08-09'https://footballapi.pulselive.com/football/fixtures?comps=1&compSeasons=17&page=0&pageSize=40&sort=desc&statuses=C&altIds=true'# https: // www.premierleague.com / match / 6706seasons = ['17','18','19','20','21','22','27','42','54','79','210',]for season in seasons:for page in range(3,11):time.sleep(3)print('当前运行到',season,page)url = 'https://footballapi.pulselive.com/football/fixtures?comps=1&compSeasons='+str(season)+'&page='+str(page)+'&pageSize=40&sort=desc&statuses=C&altIds=true'info = parse_page(url)write2csv(info)

获取到的数据我们写入CSV。进球时间我只保留了前两位，便于后面的使用。对于平局即没有进球数我用None代替。

4.数据分析

我采集好了数据之后，把每个主队的第一粒进球时间做了一张表，统计每支队伍主场作战时第一粒进球的时间分布情况。首先用代码对每个队伍的每场比赛第一粒进球时间进行汇总

import csv
import os
from matplotlib import pyplot as plt
import numpy as np#这一部分是对之前的总的csv数据进行分类，以战队名称新建主场数据，使用一次之后则不再使用。
# zhudui_list = []
# with open('all_data.csv','r',encoding='utf-8-sig') as f:
#     reader = csv.reader(f)
#     for line in reader:
#         zhudui = line[0]
#         zhudui_list.append(zhudui)
#         with open('./mainname/'+zhudui+'.csv','r',newline='',encoding='utf-8-sig') as e:
#             writer = csv.writer(e)
#             writer.writerow([line[1]])#这里面的数据就是来源于上一部分代码得到的数据
for csvfile in os .listdir('mainname/'):print(csvfile)with open('mainname/'+csvfile,'r',encoding='utf-8-sig') as f:time_list = []reader = csv.reader(f)for line in reader:if line[0] == 'None':time_list.append(100)else:time_list.append(int(line[0]))plt.figure(figsize=(12, 6))plt.title(csvfile[:-4])plt.hist(time_list, rwidth=0.95,bins=20)plt.xticks(np.arange(0,100,5))plt.xlabel('min')plt.savefig('Pictures/'+csvfile[:-4]+'.tiff', dpi=60)plt.show()

5.数据展示

每一支队伍都有，我就不一一展示了。

Python英超联赛10年数据爬虫相关推荐

我如何预测10场英超联赛的确切结果
Is there a way to predict the outcome of any soccer game with 100% accuracy? The honest and simplest ...
Python开发之 10分钟教你学会爬虫Scrapy
文章目录一.简介二.Scrapy的简单示例 1.先找一个需要爬的网页 2.先给用的浏览器下载一个"XPath"查询插件 2.1 谷歌浏览器方法 2.2 360浏览器的方法(博主 ...
python爬虫工程师需要会什么软件_Python学习教程：爬虫工程师必备的10个爬虫工具！...
Python学习教程:爬虫工程师必备的10个爬虫工具! 最近很多学爬虫的伙伴让推荐顺手的爬虫工具,南瓜花了点时间总结了一下,把这些好用的爬虫工具都跟你们找齐活了! 都知道工欲善其事必先利其器,那么作为 ...
Python实现信息自动配对爬虫排版程序（附下载）
授权自AI科技大本营(ID:rgznai100) 本文约2800字,建议阅读7分钟. 本文为你介绍Python实现信息自动配对爬虫排版程序. 公众号(DatapiTHU)后台回复"20200 ...
tornado 获取html,python使用tornado实现简单爬虫
本文实例为大家分享了python使用tornado实现简单爬虫的具体代码,供大家参考,具体内容如下代码在官方文档的示例代码中有,但是作为一个tornado新手来说阅读起来还是有点困难的,于是我在代码 ...
Oracle同英超联赛数据统计和展示的结合
技术是为业务服务的,一直在各个领域被论证,毕竟有场景使用,才能体现出价值,否则只能自娱自乐了. 了解现代足球篮球联赛的朋友,可能知道,现在球场上产生的数据是相当多的,无论是从维度上,还是量级上,例如跑 ...
python将字符写入excel_Python 爬虫并且将数据写入Excel
听到网站爬虫,很多人都觉得很高大上,爬虫是不是黑客才能干的事啊?今天这里展示了一个简单的爬虫程序,并且对数据进行读取分析,最后写入Excel文件. 难点在于分析HTML代码上,最起码您得看得懂HTML ...
Python实训day04am【爬虫介绍、爬取网页测试、Python第三方库】
Python实训-15天-博客汇总表目录 1.文本文件编程题 2.爬虫(Scrapy) 2.1.安装第三方库 2.2.爬取网页测试 2.2.1.样例1 2.2.2.样例2 3.PyCharm导入第三 ...
互联网晚报 | 12月8日星期三 | 淘宝购物车上线“好友买单”功能；英超联赛正式入驻视频号；中国联通成立5G消息生态联盟...
今日看点 ✦ 中国联通成立5G消息生态联盟,已搭建300余项示范应用 ✦ 淘宝购物车双12上线"好友买单"功能,将支持填写双地址 ✦ 百度百家号宣布创作者突破460万,2022年将 ...

Python英超联赛10年数据爬虫