Scrapy爬虫,请先准备好Scrapy的环境

  • 获取要爬取的URL
  • 爬虫前期工作
  • 用Pycharm打开项目开始写爬虫文件
  • 启动爬虫

获取要爬取的URL



爬虫前期工作

用Pycharm打开项目开始写爬虫文件

  1. 字段文件items
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass NbaprojectItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()# pass# 创建字段的固定格式-->scrapy.Field()# 英文名engName = scrapy.Field()# 中文名chName = scrapy.Field()# 身高height = scrapy.Field()# 体重weight = scrapy.Field()# 国家英文名contryEn = scrapy.Field()# 国家中文名contryCh = scrapy.Field()# NBA球龄experience = scrapy.Field()# 球衣号码jerseyNo = scrapy.Field()# 入选年draftYear = scrapy.Field()# 队伍英文名engTeam = scrapy.Field()# 队伍中文名chTeam = scrapy.Field()# 位置position = scrapy.Field()# 东南部displayConference = scrapy.Field()# 分区division = scrapy.Field()
  1. 爬虫文件
import scrapy
import json
from nbaProject.items import NbaprojectItemclass NbaspiderSpider(scrapy.Spider):name = 'nbaSpider'allowed_domains = ['nba.com']# 第一次爬取的网址,可以写多个网址# start_urls = ['http://nba.com/']start_urls = ['https://china.nba.com/static/data/league/playerlist.json']# 处理网址的responsedef parse(self, response):# 因为访问的网站返回的是json格式,首先用第三方包处理json数据data = json.loads(response.text)['payload']['players']# 以下列表用来存放不同的字段# 英文名engName = []# 中文名chName = []# 身高height = []# 体重weight = []# 国家英文名contryEn = []# 国家中文名contryCh = []# NBA球龄experience = []# 球衣号码jerseyNo = []# 入选年draftYear = []# 队伍英文名engTeam = []# 队伍中文名chTeam = []# 位置position = []# 东南部displayConference = []# 分区division = []# 计数count = 1for i in data:# 英文名engName.append(str(i['playerProfile']['firstNameEn'] + i['playerProfile']['lastNameEn']))# 中文名chName.append(str(i['playerProfile']['firstName'] + i['playerProfile']['lastName']))# 国家英文名contryEn.append(str(i['playerProfile']['countryEn']))# 国家中文contryCh.append(str(i['playerProfile']['country']))# 身高height.append(str(i['playerProfile']['height']))# 体重weight.append(str(i['playerProfile']['weight']))# NBA球龄experience.append(str(i['playerProfile']['experience']))# 球衣号码jerseyNo.append(str(i['playerProfile']['jerseyNo']))# 入选年draftYear.append(str(i['playerProfile']['draftYear']))# 队伍英文名engTeam.append(str(i['teamProfile']['code']))# 队伍中文名chTeam.append(str(i['teamProfile']['displayAbbr']))# 位置position.append(str(i['playerProfile']['position']))# 东南部displayConference.append(str(i['teamProfile']['displayConference']))# 分区division.append(str(i['teamProfile']['division']))# 创建item字段对象,用来存储信息 这里的item就是对应上面导的NbaprojectItemitem = NbaprojectItem()item['engName'] = str(i['playerProfile']['firstNameEn'] + i['playerProfile']['lastNameEn'])item['chName'] = str(i['playerProfile']['firstName'] + i['playerProfile']['lastName'])item['contryEn'] = str(i['playerProfile']['countryEn'])item['contryCh'] = str(i['playerProfile']['country'])item['height'] = str(i['playerProfile']['height'])item['weight'] = str(i['playerProfile']['weight'])item['experience'] = str(i['playerProfile']['experience'])item['jerseyNo'] = str(i['playerProfile']['jerseyNo'])item['draftYear'] = str(i['playerProfile']['draftYear'])item['engTeam'] = str(i['teamProfile']['code'])item['chTeam'] = str(i['teamProfile']['displayAbbr'])item['position'] = str(i['playerProfile']['position'])item['displayConference'] = str(i['teamProfile']['displayConference'])item['division'] = str(i['teamProfile']['division'])# 打印爬取信息print("传输了",count,"条字段")count += 1# 将字段交回给引擎 -> 管道文件yield item
  1. 配置文件->开启管道文件
# Scrapy settings for nbaProject project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# ----------不做修改部分---------
BOT_NAME = 'nbaProject'SPIDER_MODULES = ['nbaProject.spiders']
NEWSPIDER_MODULE = 'nbaProject.spiders'
# ----------不做修改部分---------# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'nbaProject (+http://www.yourdomain.com)'# Obey robots.txt rules
# ----------修改部分(可以自行查这是啥东西)---------
# ROBOTSTXT_OBEY = True
# ----------修改部分---------# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {#    'nbaProject.middlewares.NbaprojectSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {#    'nbaProject.middlewares.NbaprojectDownloaderMiddleware': 543,
#}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# 开启管道文件
# ----------修改部分---------
ITEM_PIPELINES = {'nbaProject.pipelines.NbaprojectPipeline': 300,
}
# ----------修改部分---------
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
  1. 管道文件 -> 将字段写进mysql
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface
from itemadapter import ItemAdapterimport pymysql
class NbaprojectPipeline:# 初始化函数def __init__(self):# 连接数据库 注意修改数据库信息self.connect = pymysql.connect(host='域名', user='用户名', passwd='密码',db='数据库', port=端口号) # 获取游标self.cursor = self.connect.cursor()# 创建一个表用于存放item字段的数据createTableSql = """create table if not exists `nbaPlayer`(playerId INT UNSIGNED AUTO_INCREMENT,engName varchar(80),chName varchar(20),height varchar(20),weight varchar(20),contryEn varchar(50),contryCh varchar(20),experience int,jerseyNo int,draftYear int,engTeam varchar(50),chTeam varchar(50),position varchar(50),displayConference varchar(50),division varchar(50),primary key(playerId))charset=utf8;"""# 执行sql语句self.cursor.execute(createTableSql)self.connect.commit()print("完成了创建表的工作")#每次yield回来的字段会在这里做处理def process_item(self, item, spider):# 打印item增加观赏性print(item)# sql语句insert_sql = """insert into nbaPlayer(playerId, engName, chName,height,weight,contryEn,contryCh,experience,jerseyNo,draftYear,engTeam,chTeam,position,displayConference,division) VALUES (null,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"""# 执行插入数据到数据库操作# 参数(sql语句,用item字段里的内容替换sql语句的占位符)self.cursor.execute(insert_sql, (item['engName'], item['chName'], item['height'], item['weight'], item['contryEn'], item['contryCh'], item['experience'], item['jerseyNo'],item['draftYear'], item['engTeam'], item['chTeam'], item['position'],item['displayConference'], item['division']))# 提交,不进行提交无法保存到数据库self.connect.commit()print("数据提交成功!")

启动爬虫

屏幕上滚动的数据

去数据库查看数据

简简单单就把球员数据爬回来啦~

原创不易,请给博主一个小小的赞吧~

Python之Scrapy爬虫教程NBA球员数据存放到Mysql数据库相关推荐

  1. scrapy爬虫储存到mysql_详解Python之Scrapy爬虫教程NBA球员数据存放到Mysql数据库

    获取要爬取的URL 爬虫前期工作 用Pycharm打开项目开始写爬虫文件 字段文件items # Define here the models for your scraped items # # S ...

  2. 【Python】调取tushare和joinquant的数据写入本地MySQL数据库(推荐)(技术实现过程)

    本次需求,源于<数据建设小项目_构建量化投资分析模型>,其文字报告版发于个人知乎专栏 https://zhuanlan.zhihu.com/p/349686026. 按照<构建量化投 ...

  3. 使用scrapy爬取伯乐在线多线程存为MySQL数据库

    在spider文件中的程序为 import scrapy from ..items import BolespiderItemclass BoleSpider(scrapy.Spider):name ...

  4. Python之Scrapy爬虫(热门网站数据爬取)

    第一关:猫眼电影排行TOP100信息爬取 代码: item.py文件 import scrapy class MaoyanItem(scrapy.Item):#********** Begin *** ...

  5. 【科学文献计量】CSSCI数据采集,转化为python中的DataFrame格式,并存放到MySQL数据库

    CSSCI数据采集,转化为python中的DataFrame格式,并存放到MySQL数据库 1 CSSCI数据采集 2 数据加载到python中并进行DataFrame类型转化 3 将DataFram ...

  6. NBA球员数据爬虫练习

    (每周爬虫)NBA球员数据爬虫 准备开个新坑,一周练习一次小爬虫,对于质量较高的数据集,可以顺便做一下分析.同时回归Python代码与统计分析方法. url:https://nba.hupu.com/ ...

  7. 用python做一个数据查询软件_使用Python实现NBA球员数据查询小程序功能

    本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,如有问题请及时联系我们以作处理. 以下文章来源于早起Python ,作者投稿君 一.前言 有时将代码转成带有界面的程序,会极大地方便 ...

  8. python进行数据查询_使用Python实现NBA球员数据查询小程序功能

    本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,如有问题请及时联系我们以作处理. 以下文章来源于早起Python ,作者投稿君 一.前言 有时将代码转成带有界面的程序,会极大地方便 ...

  9. Python之 - 使用Scrapy建立一个网站抓取器,网站爬取Scrapy爬虫教程

    Scrapy是一个用于爬行网站以及在数据挖掘.信息处理和历史档案等大量应用范围内抽取结构化数据的应用程序框架,广泛用于工业. 在本文中我们将建立一个从Hacker News爬取数据的爬虫,并将数据按我 ...

最新文章

  1. ssh协议是osi_TCP/IP协议和三次握手四次挥手
  2. 移动端监听安卓手机返回键
  3. python爬虫下载-用Python爬虫下载整本小说
  4. 内存泄漏和内存溢出的区别
  5. CentOs搭建svn
  6. 双向循环链表:鸿蒙轻内核中数据的“驿站”
  7. javaweb四种域对象的应用
  8. Atitit 可移植性之道attilax著
  9. windows系统下搭建私有nuget仓储服务器, 打包程序集并推送到私有nuget仓储服务器...
  10. 谷歌插件开发ajax请求,2020-01-03(chrome插件:拦截ajax请求并修改返回结果)
  11. matlab课堂笔记,厦门大学matlab第四次课程笔记 PTB的简单讲解
  12. 霍夫变换(Hough Transformation)基本思想及MATLAB相关函数
  13. 《硬件接入》海康威视接入及CPU性能优化思路
  14. matlab dmc控制代码,动态控制矩阵(DMC)算法的浅析
  15. java从键盘读入数据_关于Java中从键盘读入各种数据的方式
  16. PhantomReference虚引用
  17. 三极管工作原理_通俗易懂的讲解三极管工作原理,新手小白记得收藏
  18. Lync2013扩展开发
  19. sql 根据出生日期计算年龄
  20. redis通过key模糊搜索_redis key模糊查找

热门文章

  1. 力扣杯-竞赛合集LCP 01. 猜数字
  2. 可解释性研究 -LRP-for-LSTM
  3. swarm mysql集群_docker搭建基于percona-xtradb-cluster方案的mysql集群
  4. nginx负载均衡(权重)
  5. jQuery属性操作之.val()
  6. intval的绕过和chr的利用
  7. 软件测试质量度量,软件测试过程质量的度量
  8. 地址解析协议(Address Resolution Protocol)
  9. Python爬虫初学一(爬虫基础)
  10. 在Windows系统中查看下载文件的MD5,SHA1,SHA256校验码