基金是一种很好的理财方式，利用pyhton根据以往的跌幅情况进行基金选择，是一种很可靠的选择方式。本文以债券基金（稳定且风险较低）的爬虫和策略选择为例子，实现基金的选择。

1、数据库准备

1.1、ubuntu下的mysql安装

以Ubuntu为例，首先安装mysql数据库。
首先执行下面三条命令：

sudo apt-get install mysql-server
sudo apt install mysql-client
sudo apt install libmysqlclient-dev

输入以下命令测试安装是否成功：

sudo netstat -tap | grep mysql

若出现以下信息，则表示安装成功：

通过以下命令进入mysql：

mysql -u root -p 你的密码

1.2、mysql密码设置（若安装过程中已经设置，跳过此步）

安装过程中，设置密码的步骤可能跳过，因此会存在不知道默认密码的情况，因此要对root用户重新设置密码。

sudo cat /etc/mysql/debian.cnf

如下图显示的账号和密码：

用查询到的账户和密码登录mysql。
接下来就是修改密码了。

1)、use mysql;                   #连接到mysql数据库2)、update mysql.user set authentication_string=password('你所要修改的密码') where user='root' and Host ='localhost';    #修改密码3)、update user set  plugin="mysql_native_password";     4)、flush privileges;5)、quit;

重启mysql服务器后，直接可以用root账户进行登录了。

1.3、设置mysql允许远程访问

首先编辑文件/etc/mysql/mysql.conf.d/mysqld.cnf：

sudo vi /etc/mysql/mysql.conf.d/mysqld.cnf

#掉bind-address = 127.0.0.1这一行
保存并退出，然后进入mysql，执行授权命令：

1） grant all on *.* to root@'%' identified by '你的密码' with grant option;2） flush privileges;

然后执行quit命令退出mysql服务，执行如下命令重启mysql：

sudo service mysql restart

现在可以使用远程连接ubuntu下的mysql服务。

1.4、创建数据库表

进入mysql，依次创建库，创建表：

create database funds;
use funds;
CREATE TABLE funds (code varchar(6) NOT NULL,day varchar(25) NULL,dayOfGrowth double(10,2) NULL,fromBuild double(10,2) NULL,fromThisYear double(10,2) NULL,name varchar(25) NULL,recent1Month double(10,2) NULL,recent1Week double(10,2) NULL,recent1Year double(10,2) NULL,recent2Year double(10,2) NULL,recent3Month double(10,2) NULL,recent3Year double(10,2) NULL,recent6Month double(10,2) NULL,serviceCharge double(10,2) NULL,unitNetWorth double(14,4) NULL,upEnoughAmount varchar(10) NULL
);
alter table funds convert to character set utf8;  #入库的中文数据可能存在乱码，因此对编码进行设置。

2、Scrapy框架进行天天基金的数据爬虫

2.1、创建scrapy项目

进入打算存放项目的路径，创建项目：

scrapy startproject funds

创建好项目后，查看会发现生成一些文件，这里对相关文件做下说明

scrapy.cfg 项目的配置信息，主要为Scrapy命令行工具提供一个基础的配置信息。（真正爬虫相关的配置信息在settings.py文件中）
items.py 设置数据存储模板，用于结构化数据，如：Django的Model
pipelines 数据处理行为，如：一般结构化的数据持久化
settings.py 配置文件，如：递归的层数、并发数，延迟下载等
spiders 爬虫目录，如：创建文件，编写爬虫规则

接下来就可以使用pycharm打开项目进行开发了

2.2、相关代码

在funds项目里创建一个py文件（项目的任何地方都行）：
例如创建scrapy_start.py：

from scrapy import cmdlinecmdline.execute("scrapy crawl fundsList".split())

step 1：设置item

class FundsItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()code = scrapy.Field()   # 基金代码name = scrapy.Field()   # 基金名称unitNetWorth = scrapy.Field()   # 单位净值day = scrapy.Field()    # 日期dayOfGrowth = scrapy.Field()  # 日增长率recent1Week = scrapy.Field()    # 最近一周recent1Month = scrapy.Field()   # 最近一月recent3Month = scrapy.Field()   # 最近三月recent6Month = scrapy.Field()   # 最近六月recent1Year = scrapy.Field()    # 最近一年recent2Year = scrapy.Field()    # 最近二年recent3Year = scrapy.Field()    # 最近三年fromThisYear = scrapy.Field()   # 今年以来fromBuild = scrapy.Field()  # 成立以来serviceCharge = scrapy.Field()  # 手续费upEnoughAmount = scrapy.Field()     # 起够金额pass

step 2：编写spider

import scrapy
import jsonfrom scrapy.http import Request
from funds.items import FundsItemclass FundsSpider(scrapy.Spider):name = 'fundsList'   # 唯一，用于区别Spider。运行爬虫时，就要使用该名字allowed_domains = ['fund.eastmoney.com']  # 允许访问的域# 初始url。在爬取从start_urls自动开始后，服务器返回的响应会自动传递给parse(self, response)方法。# 说明：该url可直接获取到所有基金的相关数据# start_url = ['http://fundact.eastmoney.com/banner/pg.html#ln']# custome_setting可用于自定义每个spider的设置，而setting.py中的都是全局属性的，当你的scrapy工程里有多个spider的时候这个custom_setting就显得很有用了# custome_setting = {## }# spider中初始的request是通过调用 start_requests() 来获取的。 start_requests() 读取 start_urls 中的URL， 并以 parse 为回调函数生成 Request 。# 重写start_requests也就不会从start_urls generate Requests了def start_requests(self):url = 'https://fundapi.eastmoney.com/fundtradenew.aspx?ft=zq&sc=1n&st=desc&pi=1&pn=3000&cp=&ct=&cd=&ms=&fr=&plevel=&fst=&ftype=&fr1=&fl=0&isab='#上面地址中'ft=zq'代表爬取债券型基金，可根据情况更改为pg、gp、hh、zs、QDII、LOF。（偏股型、股票型、混合型、指数型、QDII型、LOF型）requests = []request = scrapy.Request(url,callback=self.parse_funds_list)requests.append(request)return requestsdef parse_funds_list(self,response):datas = response.body.decode('UTF-8')# 取出json部门datas = datas[datas.find('{'):datas.find('}')+1] # 从出现第一个{开始，取到}# 给json各字段名添加双引号datas = datas.replace('datas', '\"datas\"')datas = datas.replace('allRecords', '\"allRecords\"')datas = datas.replace('pageIndex', '\"pageIndex\"')datas = datas.replace('pageNum', '\"pageNum\"')datas = datas.replace('allPages', '\"allPages\"')jsonBody = json.loads(datas)jsonDatas = jsonBody['datas']fundsItems = []for data in jsonDatas:fundsItem = FundsItem()fundsArray = data.split('|')fundsItem['code'] = fundsArray[0]fundsItem['name'] = fundsArray[1]fundsItem['day'] = fundsArray[3]fundsItem['unitNetWorth'] = fundsArray[4]fundsItem['dayOfGrowth'] = fundsArray[5]fundsItem['recent1Week'] = fundsArray[6]fundsItem['recent1Month'] = fundsArray[7]fundsItem['recent3Month'] = fundsArray[8]fundsItem['recent6Month'] = fundsArray[9]fundsItem['recent1Year'] = fundsArray[10]fundsItem['recent2Year'] = fundsArray[11]fundsItem['recent3Year'] = fundsArray[12]fundsItem['fromThisYear'] = fundsArray[13]fundsItem['fromBuild'] = fundsArray[14]fundsItem['serviceCharge'] = fundsArray[18]fundsItem['upEnoughAmount'] = fundsArray[24]fundsItems.append(fundsItem)return fundsItems

step 3：配置settings.py

custome_setting可用于自定义每个spider的设置，而setting.py中的都是全局属性的，当你的scrapy工程里有多个spider的时候这个custom_setting就显得很有用了。
但是我目前项目暂时只有一个爬虫，所以暂时使用setting.py设置spider。
设置了DEFAULT_REQUEST_HEADERS（本次爬虫由于是请求接口，该项不配置也可）

DEFAULT_REQUEST_HEADERS = {# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',# 'Accept-Language': 'en','Accept':'*/*','Accept-Encoding':'gzip, deflate, br','Accept-Language':'zh-CN,zh;q=0.9','Connection':'keep-alive','Cookie':'st_pvi=72856792768813; UM_distinctid=1604442b00777b-07f0a512f81594-5e183017-100200-1604442b008b52; qgqp_b_id=f10107e9d27d5fe2099a361a52fcb296; st_si=08923516920112; ASP.NET_SessionId=s3mypeza3w34uq2zsnxl5azj','Host':'fundapi.eastmoney.com','Referer':'http://fundact.eastmoney.com/banner/pg.html','User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
}

设置ITEM_PIPELINES

ITEM_PIPELINES = {'funds.pipelines.FundsPipeline': 300,
}

pipelines.py，将数据写入我本地数据库里

import pymysql.cursorsclass FundsPipeline(object):def process_item(self, item, spider):# 连接数据库connection = pymysql.connect(host='localhost',user='root',password='root',db='funds',charset='utf8mb4',cursorclass=pymysql.cursors.DictCursor)sql = "INSERT INTO funds(code,name,unitNetWorth,day,dayOfGrowth,recent1Week,recent1Month,recent3Month,recent6Month,recent1Year,recent2Year,recent3Year,fromThisYear,fromBuild,serviceCharge,upEnoughAmount)\VALUES('%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s')" % (item['code'], item['name'], item['unitNetWorth'], item['day'], item['dayOfGrowth'], item['recent1Week'], \item['recent1Month'], item['recent3Month'], item['recent6Month'], item['recent1Year'], item['recent2Year'],item['recent3Year'], item['fromThisYear'], item['fromBuild'], item['serviceCharge'], item['upEnoughAmount'])with connection.cursor() as cursor:cursor.execute(sql) # 执行sqlconnection.commit()  # 提交到数据库执行connection.close()return item

执行python scrapy_start.py即可开始爬虫基金数据并存入mysql数据库中。

3、策略编写

import pymysql#连接数据库
import pandas as pddef connect_mysql(host, username, password, port, database):conn=pymysql.connect(host = host ,user = username ,passwd= password,port= port,db=database,charset='utf8')cur = conn.cursor() # 生成游标对象sql="select * from funds_zq; " # SQL语句cur.execute(sql) # 执行SQL语句data = cur.fetchall() # 通过fetchall方法获得数据cols = cur.descriptioncol = []for i in cols:col.append(i[0])zq_frame = list(map(list, data))zq_frame = pd.DataFrame(zq_frame,columns=col)return zq_framefunds_zq = connect_mysql('自行更改', 'root', 'root', 3306, 'funds')#这里策略是按照近一周、一月、半年等等进行设置，可自行调整。本人在此只是简单的进行了一个保守的筛选。
def strategy(frame):frame = frame[(frame['dayOfGrowth'] > 0) & (frame['dayOfGrowth'] < 0.25)]frame = frame[(frame['recent1Month'] > 0.7) & (frame['recent1Month'] < 1.5)]frame = frame[(frame['recent1Week'] > 0.1) & (frame['recent1Week'] < 0.5)]frame = frame[(frame['recent1Year'] > 6) & (frame['recent1Year'] < 15)]frame = frame[(frame['recent2Year'] > 12) & (frame['recent2Year'] < 30)]frame = frame[(frame['recent3Month'] > 2) & (frame['recent3Month'] < 4)]frame = frame[(frame['recent6Month'] > 3.5) & (frame['recent6Month'] < 7)]frame = frame[frame['serviceCharge'] < 0.1]return frameresult = strategy(funds_zq)
print(result)

最后选出来的基金有两个：

我们来看看这两个基金的具体情况：

两只基金都有稳定的增长，年化基本在7~8%左右，比放在余额宝可要高3倍呢！
本文旨在分析证券基金，若有风险承担的能力，可自行将爬取基金改为偏股型、股票型即可，希望本文对大家有所帮助。

天天基金爬虫+策略选基相关推荐

印钞机 V1.0（量化选基总结）
今年的元旦,在家把之前手工的选基方法完全程序化了.这是我的"印钞机" V1.0. 为什么叫印钞机,详细情况可见下文及最后的总结. 量化选基成果我的主要基金投资方法其实就是量化选基 ...
【量化选基】中证500指数增强比300增强好吗？
回测区间:2016年1月1日,到2022年4月29日. 对比指数,300增强,500增强 500增强组合,年化8.63%,年度最大回撤-30% 300增强组合,年化收益9.01%,年度最大回撤-26% ...
反击“猫眼电影”网站的反爬虫策略
0×01 前言前两天在百家号上看到一篇名为<反击爬虫,前端工程师的脑洞可以有多大?>的文章,文章从多方面结合实际情况列举了包括猫眼电影.美团.去哪儿等大型电商网站的反爬虫机制.的确,如文 ...
python爬取网页防止重复内容_python解决网站的反爬虫策略总结
本文详细介绍了网站的反爬虫策略,在这里把我写爬虫以来遇到的各种反爬虫策略和应对的方法总结一下. 从功能上来讲,爬虫一般分为数据采集,处理,储存三个部分.这里我们只讨论数据采集部分. 一般网站从三个方面 ...
代理ip网站开发_网站反爬虫策略，用代理IP都能解决吗？
很多人会使用到网页采集器,其实这也是通过程序来进行采集的,如果没有使用代理IP,采集速度快了,照样是会被封住的.另外,这些网站还有其他的一些反爬策略,同样也会影响到我们采集网页的数据,这是如何限制的呢 ...
Scrapy绕过反爬虫策略汇总
文章目录一.Scrapy无法返回爬取内容的几种可能原因 1,ip封锁爬取 2,xpath路径不对 3,xpath路径出现font,tbody标签 4,xpath路径不够明确 5,robot协议 6, ...
网站反爬虫策略VS反反爬虫策略
网站反爬虫策略 1.通过User-Agent校验反爬 2.通过访问频度反爬 3.通过验证码校验反爬 4.通过变换网页结构反爬 5.通过账号权限反爬反反爬虫策略制定 1.发送模拟User-Agent: ...
常见的一些反爬虫策略(下篇)-Java网络爬虫系统性学习与实战系列（10）
常见的一些反爬虫策略(下篇)-Java网络爬虫系统性学习与实战系列(10) 文章目录联系方式反爬虫策略文本混淆 SVG映射 CSS文字偏移图片混淆伪装字体反爬 Referer字段反爬数据分 ...
使用scrapy做爬虫遇到的一些坑：网站常用的反爬虫策略，如何机智的躲过反爬虫Crawled (403)
在这幅图中我们可以很清晰地看到爬虫与反爬虫是如何进行斗智斗勇的. 在学习使用爬虫时,我们制作出来的爬虫往往是在"裸奔",非常的简单. 简单低级的爬虫有一个很大的优点:速度快,伪装度 ...

天天基金爬虫+策略选基