Python实现量化选股

什么是选股？

选股(stock selection)是一种主动性投资策略，先按照某种规则或算法分析单只股票的前景，然后构建一个投资组合，长期持有。一般情况下要求组合的股票具有低相关性，这样才能对冲系统性风险，否则在大盘走弱的时候投资组合也会面临巨大的下跌风险。

运用什么模型？

关于如何选股，学术界提出过很多不同的模型，最经典的莫过于马科维茨投资组合理论。这里我们使用MM趋势模型(Mark Minervini’s Trend Template)，这是国外一位传奇投资大师提出的技术面选股方法，核心思想是通过技术指标来度量股票动能，从中筛选最有潜力的股票，买入并持有。

MM趋势模型

股票价格高于150天均线和200天均线
150日均线高于200日均线
200日均线上升至少1个月
50日均线高于150日均线和200日均线
股票价格高于50日均线
股票价格比52周低点高30%
股票价格在52周高点的25%以内
相对强弱指数(RS)大于等于70，这里的相对强弱指的是股票与大盘对比，RS = 股票1年收益率 / 基准指数1年收益率

关于Mark Minervini

全美最富盛名的交易员之一，曾经获得30000%的收益率，在34岁前称为亿万富翁，详情见<金融怪杰>一书。

选股面临的技术性难题？

从哪里获取大量股票的历史数据？
当股票数量很多时，如何提高计算性能？

本文将用Python实现MM模型的量化选股，并解决上述提出的两个技术难题。

import os
import datetime as dt
import time
from typing import Any, Dict, Optional, Listimport requests
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import talib
import multiprocessing as mp
from requests.exceptions import ConnectionError, Timeout%matplotlib inline
plt.style.use("fivethirtyeight")

1. 从蜂鸟数据获取历史数据

蜂鸟数据是新兴的金融数据提供商，提供包括股票，外汇，商品期货和数字货币的实时报价和历史数据，并提供API接口，是所有金融从业者获取免费数据的便捷渠道。

## 撰写自定义函数，通过API获取数据def fetch_trochil(url: str,params: Dict[str, str],attempt: int = 3,timeout: int = 3) -> Dict[str, Any]:"""装饰requests.get函数"""for i in range(attempt):try:resp = requests.get(url, params, timeout=timeout)resp.raise_for_status()data = resp.json()["data"]if not data:raise Exception("empty dataset")return dataexcept (ConnectionError, Timeout) as e:print(e)i += 1time.sleep(i * 0.5)def fetch_cnstocks(apikey: str) -> pd.DataFrame:"""从蜂鸟数据获取A股产品列表"""url = "https://api.trochil.cn/v1/cnstock/markets"params = {"apikey": apikey}res = fetch_trochil(url, params)return pd.DataFrame.from_records(res)def fetch_daily_ohlc(symbol: str,date_from: dt.datetime,date_to: dt.datetime,apikey: str) -> pd.DataFrame:"""从蜂鸟数据获取A股日图历史K线"""url = "https://api.trochil.cn/v1/cnstock/history"params = {"symbol": symbol,"start_date": date_from.strftime("%Y-%m-%d"),"end_date": date_to.strftime("%Y-%m-%d"),"freq": "daily","apikey": apikey}res = fetch_trochil(url, params)return pd.DataFrame.from_records(res)def fetch_index_ohlc(symbol: str,date_from: dt.datetime,date_to: dt.datetime,apikey: str) -> pd.DataFrame:"""获取股指的日图历史数据"""url = "https://api.trochil.cn/v1/index/daily"params = {"symbol": symbol,"start_date": date_from.strftime("%Y-%m-%d"),"end_date": date_to.strftime("%Y-%m-%d"),"apikey": apikey}res = fetch_trochil(url, params)return pd.DataFrame.from_records(res)

1.1 产品列表

先获取沪深A股上市企业的所有股票ID。

apikey = os.getenv("TROCHIL_API")  # use your apikey
cnstocks = fetch_cnstocks(apikey)
cnstocks

	symbol	name	exchange
0	SH600000	浦发银行	上海证券交易所
1	SH600004	白云机场	上海证券交易所
2	SH600006	东风汽车	上海证券交易所
3	SH600007	中国国贸	上海证券交易所
4	SH600008	首创股份	上海证券交易所
...	...	...	...
3784	SZ300815	玉禾田	深证证券交易所
3785	SZ300816	艾可蓝	深证证券交易所
3786	SZ300817	双飞股份	深证证券交易所
3787	SZ300818	耐普矿机	深证证券交易所
3788	SZ300820	英杰电气	深证证券交易所

3789 rows × 3 columns

成功获取沪深A股3789只股票的产品信息，前缀’SH’代表上海证券交易所股票，'SZ’代表在深圳证券交易所的股票。建模时仅使用上证交易所的股票。

# 筛选前缀为'SH'的股票
cnstocks_shsz = cnstocks.query("symbol.str.startswith('SH')")
cnstocks_shsz

	symbol	name	exchange
0	SH600000	浦发银行	上海证券交易所
1	SH600004	白云机场	上海证券交易所
2	SH600006	东风汽车	上海证券交易所
3	SH600007	中国国贸	上海证券交易所
4	SH600008	首创股份	上海证券交易所
...	...	...	...
1580	SH688369	致远互联	上海证券交易所
1581	SH688388	嘉元科技	上海证券交易所
1582	SH688389	普门科技	上海证券交易所
1583	SH688398	赛特新材	上海证券交易所
1584	SH688399	硕世生物	上海证券交易所

1585 rows × 3 columns

1.2 个股历史数据

从蜂鸟数据获取上海证券交易所股票的日图历史价格。根据MM趋势模型，我们最少需要过去260天的历史数据，部分新上市或已退市的股票可能不符合要求，所以剔除K线数量少于260的股票。

%%time# 下载2019年至今的历史数据
# 下载时剔除K线少于260个交易日的股票
date_from = dt.datetime(2019, 1, 1)
date_to = dt.datetime.today()
symbols = cnstocks_shsz.symbol.to_list()
min_klines = 260# 逐个下载，蜂鸟数据的API没有分钟请求限制
# 先把数据存储在列表中，下载完成后再合并和清洗
ohlc_list = []
for symbol in symbols:try:ohlc = fetch_daily_ohlc(symbol, date_from, date_to, apikey)if ohlc is not None and len(ohlc) >= min_klines:ohlc.set_index("datetime", inplace=True)ohlc_list.append(ohlc)except Exception as e:pass

CPU times: user 21.7 s, sys: 349 ms, total: 22 s
Wall time: 49.3 s

下载1500多只股票的历史数据（约400多个交易日）只需要不到1分钟的时间。接下来我们整合和清洗数据，然后存储在本地，方便后续分析。

ohlc_joined = pd.concat(ohlc_list)
ohlc_joined.info()

<class 'pandas.core.frame.DataFrame'>
Index: 532756 entries, 2019-01-02 to 2020-07-29
Data columns (total 6 columns):#   Column  Non-Null Count   Dtype
---  ------  --------------   -----  0   open    532756 non-null  float641   high    532756 non-null  float642   low     532756 non-null  float643   close   532756 non-null  float644   volume  532756 non-null  float645   symbol  532756 non-null  object
dtypes: float64(5), object(1)
memory usage: 28.5+ MB

查看是否存在缺失值。

ohlc_joined.isnull().sum()

open      0
high      0
low       0
close     0
volume    0
symbol    0
dtype: int64

保存到本地，以csv格式存储。后面可以直接从本地读取数据，避免API请求带来的时间浪费。

ohlc_joined.to_csv("cnstock_daily_ohlc.csv", index=True)

1.3 上证指数

获取上证指数的历史价格，计算过去1年的累计收益率，用于计算个股的相对强弱。

benchmark = fetch_index_ohlc("shci", date_from, date_to, apikey)
benchmark.tail()

	datetime	open	high	low	close	volume
377	2020-07-23	3306.15	3336.30	3257.83	3325.11	4.070000e+10
378	2020-07-24	3310.65	3319.13	3184.96	3196.77	4.271000e+10
379	2020-07-27	3210.39	3221.98	3174.66	3205.23	2.993000e+10
380	2020-07-28	3226.13	3245.30	3208.49	3227.96	2.894000e+10
381	2020-07-29	3221.99	3294.55	3209.99	3294.55	3.249000e+10

# 计算1年累计收益率，1年以252个交易日计算
benchmark_ann_ret = benchmark.close.pct_change(252).iloc[-1]
benchmark_ann_ret

0.12150312157460808

2. 选股

def screen(close: pd.Series, benchmark_ann_ret: float) -> pd.Series:"""实现MM选股模型的逻辑，评估单只股票是否满足筛选条件Args:close(pd.Series): 股票收盘价，默认时间序列索引benchmark_ann_ret(float): 基准指数1年收益率，用于计算相对强弱"""# 计算50，150，200日均线ema_50 = talib.EMA(close, 50).iloc[-1]ema_150 = talib.EMA(close, 150).iloc[-1]ema_200 = talib.EMA(close, 200).iloc[-1]# 200日均线的20日移动平滑，用于判断200日均线是否上升ema_200_smooth = talib.EMA(talib.EMA(close, 200), 20).iloc[-1]# 收盘价的52周高点和52周低点high_52week = close.rolling(52 * 5).max().iloc[-1]low_52week = close.rolling(52 * 5).min().iloc[-1]# 最新收盘价cl = close.iloc[-1]# 筛选条件1：收盘价高于150日均线和200日均线if cl > ema_150 and cl > ema_200:condition_1 = Trueelse:condition_1 = False# 筛选条件2：150日均线高于200日均线if ema_150 > ema_200:condition_2 = Trueelse:condition_2 = False# 筛选条件3：200日均线上升1个月if ema_200 > ema_200_smooth:condition_3 = Trueelse:condition_3 = False# 筛选条件4：50日均线高于150日均线和200日均线if ema_50 > ema_150 and ema_50 > ema_200:condition_4 = Trueelse:condition_4 = False# 筛选条件5：收盘价高于50日均线if cl > ema_50:condition_5 = Trueelse:condition_5 = False# 筛选条件6：收盘价比52周低点高30%if cl >= low_52week * 1.3:condition_6 = Trueelse:condition_6 = False# 筛选条件7：收盘价在52周高点的25%以内if cl >= high_52week * 0.75 and cl <= high_52week * 1.25:condition_7 = Trueelse:condition_7 = False# 筛选条件8：相对强弱指数大于等于70rs = close.pct_change(252).iloc[-1] / benchmark_ann_ret * 100if rs >= 70:condition_8 = Trueelse:condition_8 = False# 判断股票是否符合标准if (condition_1 and condition_2 and condition_3 andcondition_4 and condition_5 and condition_6 andcondition_7 and condition_8):meet_criterion = Trueelse:meet_criterion = Falseout = {"rs": round(rs, 2),"close": cl,"ema_50": ema_50,"ema_150": ema_150,"ema_200": ema_200,"high_52week": high_52week,"low_52week": low_52week,"meet_criterion": meet_criterion}return pd.Series(out)

2.1 同步

首先我们用同步的方法进行筛选，将相同的筛选函数应用于1400只股票。

# 仅仅筛选有足够历史数据的股票
symbols_to_screen = list(ohlc_joined.symbol.unique())# 将数据框的格式从long-format转化为wide-format
ohlc_joined_wide = ohlc_joined.pivot(columns="symbol", values="close").fillna(method="ffill")ohlc_joined_wide.head()

symbol	sh600000	sh600004	sh600006	sh600007	sh600008	sh600009	sh600010	sh600011	sh600012	sh600015	...	sh603987	sh603988	sh603989	sh603990	sh603991	sh603993	sh603996	sh603997	sh603998	sh603999
datetime
2019-01-02	9.70	9.90	3.59	12.96	3.43	50.38	1.47	7.19	5.73	7.31	...	6.45	9.86	19.52	32.69	18.30	3.71	7.58	7.76	4.53	4.75
2019-01-03	9.81	9.77	3.63	12.95	3.37	49.70	1.49	7.00	5.42	7.35	...	6.44	9.82	18.86	32.00	17.83	3.74	7.50	7.90	4.58	4.75
2019-01-04	9.96	9.90	3.66	13.18	3.43	49.82	1.52	7.10	5.50	7.49	...	6.63	9.96	19.42	32.10	18.35	3.83	7.70	7.86	4.72	4.99
2019-01-07	9.98	9.94	3.74	13.28	3.49	49.78	1.52	7.08	5.60	7.45	...	6.63	10.12	19.60	32.07	18.81	3.89	7.86	7.86	4.88	5.06
2019-01-08	9.96	9.79	3.79	13.40	3.45	49.84	1.51	7.16	5.76	7.41	...	6.99	10.40	19.90	32.25	18.90	3.85	8.13	7.85	4.92	5.05

5 rows × 1399 columns

%%timeresults = ohlc_joined_wide.apply(screen, benchmark_ann_ret=benchmark_ann_ret)
results = results.T

CPU times: user 2.97 s, sys: 6.47 ms, total: 2.98 s
Wall time: 2.97 s

同步计算大约需要3秒的时间，在研究阶段是可以接受的，但生产阶段不行。试想您把选股系统做成一个产品，用户选定条件后点击筛选，要等待至少3秒的时间才能得到结果，将导致非常糟糕的用户体验，接下来我们尝试用多进程来解决这个问题。

我们先看看满足条件的股票有哪些？

results.query("meet_criterion == True").sort_values("rs", ascending=False)

	rs	close	ema_50	ema_150	ema_200	high_52week	low_52week	meet_criterion
symbol
sh603069	4018.29	40	23.7712	15.9565	14.4162	43.09	5.99	True
sh603378	3670.14	72.5	49.6385	36.9516	33.5383	72.5	11.77	True
sh603976	3155.27	76.76	44.6285	30.1172	27.5381	76.76	13.93	True
sh603129	2714.32	94.04	72.0693	54.2714	49.5389	94.5	18.07	True
sh600859	2501.24	62	47.3097	29.8663	26.5997	74.12	11.84	True
...	...	...	...	...	...	...	...	...
sh603029	84.33	16.14	15.4671	14.7685	14.6689	17.89	12.4	True
sh601717	83.81	6.6	6.10555	5.93716	5.93643	7.3	4.8	True
sh600824	80.77	4.14	4.03203	3.67566	3.64035	5.17	2.91	True
sh600887	80.06	35.87	32.2961	30.7098	30.4364	36	27.12	True
sh600807	73.2	5.02	4.77558	4.28381	4.23731	5.77	2.97	True

389 rows × 8 columns

有389个股票符合条件，从量化交易的角度来看，似乎并没有成功挑选出有潜力的股票，当然这与参数的选择有关系。

模型是否有效并不是本文要探讨的主题（我们会在其它文章中进行探索），所以先不要过度关注这点。

2.2 多进程

接下来尝试用多进程来加速选股的过程，看是否能把筛选时间降到1秒以内。多进程计算的核心思想是分而治之，将相似的计算任务分发到不同的CPU，最后汇总结果。这里用multiprocessing实现多进程。

%%time# 定义worker函数
def screen_stocks(df: pd.DataFrame, benchmark_ann_ret: float) -> pd.DataFrame:results = df.apply(screen, benchmark_ann_ret=benchmark_ann_ret)return results.T# 拆分数据框，先尝试用四条进程，将数据框拆分为四个部分（按列划分）
df_chunks = np.array_split(ohlc_joined_wide, 4, axis=1)# 用multiprocessing.Pool对象管理进程池
with mp.Pool(processes=4) as p:future_results = [p.apply_async(screen_stocks, kwds={"df": df, "benchmark_ann_ret": benchmark_ann_ret}) for df in df_chunks]results = pd.concat([r.get() for r in future_results])

CPU times: user 934 ms, sys: 204 ms, total: 1.14 s
Wall time: 1.06 s

利用四条进程，我们成功把计算时间缩短到1秒左右，并且获得完全相同的结果。

results.query("meet_criterion == True").sort_values("rs", ascending=False)

	rs	close	ema_50	ema_150	ema_200	high_52week	low_52week	meet_criterion
symbol
sh603069	4018.29	40	23.7712	15.9565	14.4162	43.09	5.99	True
sh603378	3670.14	72.5	49.6385	36.9516	33.5383	72.5	11.77	True
sh603976	3155.27	76.76	44.6285	30.1172	27.5381	76.76	13.93	True
sh603129	2714.32	94.04	72.0693	54.2714	49.5389	94.5	18.07	True
sh600859	2501.24	62	47.3097	29.8663	26.5997	74.12	11.84	True
...	...	...	...	...	...	...	...	...
sh603029	84.33	16.14	15.4671	14.7685	14.6689	17.89	12.4	True
sh601717	83.81	6.6	6.10555	5.93716	5.93643	7.3	4.8	True
sh600824	80.77	4.14	4.03203	3.67566	3.64035	5.17	2.91	True
sh600887	80.06	35.87	32.2961	30.7098	30.4364	36	27.12	True
sh600807	73.2	5.02	4.77558	4.28381	4.23731	5.77	2.97	True

389 rows × 8 columns

接下来测试一下进程数量和计算时间的关系，决定最优的进程数量。

max_processors = mp.cpu_count()time_used = {}
for processors in range(1, max_processors + 1):df_chunks = np.array_split(ohlc_joined_wide, processors, axis=1)t0 = time.time()with mp.Pool(processors) as p:future_results = [p.apply_async(screen_stocks, kwds={"df": df, "benchmark_ann_ret": benchmark_ann_ret}) for df in df_chunks]results = pd.concat([r.get() for r in future_results])elapsed = time.time() - t0time_used[processors] = elapsedfig, ax = plt.subplots(figsize=(12, 7))
ax = sns.pointplot(x=list(time_used.keys()), y=list(time_used.values()))
ax.set_xlabel("CPU cores")
ax.set_ylabel("Time used(seconds)")
ax.set_title("Computation time vs CPU Cores", loc="left")

从上图可以看出，使用两个进程时计算时间削减了一半（跟预期相符）。随着进程数逼近最大进程数，计算时间的递减不断下降，这并不难理解，因为计算机同时在处理其它任务，所以即便设置processors=12，也不可能把全部进程全部利用起来。从目前的情况来看，用4条进程处理是合适的，能够把时间从3.5秒降低至约1秒左右。

3. 总结

本文介绍了如何使用Python进行量化选股，包括：

从蜂鸟数据获取沪深A股的历史数据。
自定义函数实现MM模型的选股逻辑。
多进程计算，大幅减少筛选的时间。

接下来的研究方向是回溯检验，根据MM模型构建投资组合，优化筛选参数，看是否能带来超额收益。

如果喜欢我们的文章，记得点赞和收藏哦，我们会持续为大家带来数据科学和量化交易领域的精品文章。

【关于我们】

蜂鸟数据：开源金融数据库，聚合主流金融市场10000+时间序列，为广大金融从业者提供高质量的免费数据。我们的优势：1. 同时提供股票，外汇，商品期货的实时报价和历史数据；2. 提供高度统一的API接口，您可以把数据整合到自己的程序中，查看我们的API文档。

这是属于大数据的时代，蜂鸟数据的使命：用数据创造财富。