可视化股票市场结构||沪深300股票聚类可视化

前半部分是Visulizing the stock maket structure文章翻译，对代码进行逐行解释，并在文后附录所有参考链接。后半部分是基于案例做的沪深300可视化

此案例采用了几种非监督学习技术从历史数据变体中提取股票市场结构。

此处使用的数据量是历史价格的变体：相关联的报价在一天内倾向于产生联动。

学习出一个图结构

我们采用稀疏逆协方差评估来寻找哪些报价之间存在有条件的关联。

确切的说，稀疏逆协方差给我们一张图，表示各种关联的列表。

对于每一个股票，那些与他相连的股票也是可以解释其波动性的因素。

聚类

我们采用聚类技术将表现类似的报价划分在一起。

在scikit-learin众多可用的聚类技术中，呃喔们采用Affinity Propagation（近邻传播）；因为它不强求相同大小的类，并且能从数据中自动确定类的数目。

这给了我们不同于图形的一种指标，因为图形反映的是变量之间的条件关系；而聚类反映的边际属性。聚在一起的变量可以看作在全市场层面具有相似的影响。

嵌入到2D空间

为了可视化，我们需要将不同股票展示在一个2D画布中，为此我们采用 Manifold learning（流形学习）技术来实现2D嵌入。

可视化

3个模型的输出结合在一个2D图形上，节点表示股票，边表示：

簇标签用于定义节点颜色
稀疏协方差模型用于展示边的强度
2D嵌入用于定位平面中的节点

这个例子有相当多和可视化相关的代码，因为对于显示图像来说可视化至关重要。挑战之一就是定位标签尽量减少重叠。为此，我们采用基于每个轴向上最近邻方向的启发式方法（heuristic ）。

out:

Cluster 1: Apple, Amazon, Yahoo
Cluster 2: Comcast, Cablevision, Time Warner
Cluster 3: ConocoPhillips, Chevron, Total, Valero Energy, Exxon
Cluster 4: Cisco, Dell, HP, IBM, Microsoft, SAP, Texas Instruments
Cluster 5: Boeing, General Dynamics, Northrop Grumman, Raytheon
Cluster 6: AIG, American express, Bank of America, Caterpillar, CVS, DuPont de Nemours, Ford, General Electrics, Goldman Sachs, Home Depot, JPMorgan Chase, Marriott, 3M, Ryder, Wells Fargo, Wal-Mart
Cluster 7: McDonald's
Cluster 8: GlaxoSmithKline, Novartis, Pfizer, Sanofi-Aventis, Unilever
Cluster 9: Kellogg, Coca Cola, Pepsi
Cluster 10: Colgate-Palmolive, Kimberly-Clark, Procter Gamble
Cluster 11: Canon, Honda, Navistar, Sony, Toyota, Xerox

具体代码：

from __future__ import print_function   # Python2.6运行才需要，Python3.x的不需要这一行
import sys
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.collections import LineCollection   # 见参考Matplotlib.collections.LineCollection结构及用法
import pandas as pd
from sklearn import cluster, covariance, manifold  # 见参考3、参考4、参考5
print(__doc__)   # 打印本文的意义# #############################################################################
# Retrieve the data from Internet# The data is from 2003 - 2008. This is reasonably calm: (not too long ago so
# that we get high-tech firms, and before the 2008 crash). This kind of
# historical data can be obtained for from APIs like the quandl.com and
# alphavantage.co ones.symbol_dict = {'TOT': 'Total','XOM': 'Exxon','CVX': 'Chevron','COP': 'ConocoPhillips','VLO': 'Valero Energy','MSFT': 'Microsoft','IBM': 'IBM','TWX': 'Time Warner','CMCSA': 'Comcast','CVC': 'Cablevision','YHOO': 'Yahoo','DELL': 'Dell','HPQ': 'HP','AMZN': 'Amazon','TM': 'Toyota','CAJ': 'Canon','SNE': 'Sony','F': 'Ford','HMC': 'Honda','NAV': 'Navistar','NOC': 'Northrop Grumman','BA': 'Boeing','KO': 'Coca Cola','MMM': '3M','MCD': 'McDonald\'s','PEP': 'Pepsi','K': 'Kellogg','UN': 'Unilever','MAR': 'Marriott','PG': 'Procter Gamble','CL': 'Colgate-Palmolive','GE': 'General Electrics','WFC': 'Wells Fargo','JPM': 'JPMorgan Chase','AIG': 'AIG','AXP': 'American express','BAC': 'Bank of America','GS': 'Goldman Sachs','AAPL': 'Apple','SAP': 'SAP','CSCO': 'Cisco','TXN': 'Texas Instruments','XRX': 'Xerox','WMT': 'Wal-Mart','HD': 'Home Depot','GSK': 'GlaxoSmithKline','PFE': 'Pfizer','SNY': 'Sanofi-Aventis','NVS': 'Novartis','KMB': 'Kimberly-Clark','R': 'Ryder','GD': 'General Dynamics','RTN': 'Raytheon','CVS': 'CVS','CAT': 'Caterpillar','DD': 'DuPont de Nemours'}   # 42项公司名称缩写，dict结构symbols, names = np.array(sorted(symbol_dict.items())).T   # 将symbol_dict转换维（key,value）形式的列，并排序，然后转为2×56数组。最后进行拆包，返回两个numpy.arrayquotes = []   # 实例化list，用于承载“报价”for symbol in symbols:print('Fetching quote history for %r' % symbol, file=sys.stderr) # 参考6url = ('https://raw.githubusercontent.com/scikit-learn/examples-data/''master/financial-data/{}.csv')   # 见参考7，raw.githubusercontent.com换成github.com就能看到正常资料quotes.append(pd.read_csv(url.format(symbol)))   # 见参考8，str.format()字符串格式化。得到quotes是一个list，每个元素是一个dataframe，承载着symbol的所有历史数据（时间、open、close）close_prices = np.vstack([q['close'] for q in quotes])   # 见参考9，通过numpy聚合功能，获取一个n行1列的数组，每个记录是一个股票的收盘价的list
open_prices = np.vstack([q['open'] for q in quotes])# The daily variations of the quotes are what carry most information
variation = close_prices - open_prices   # 收盘价-开盘价，作为信息载体# #############################################################################
# Learn a graphical structure from the correlations
edge_model = covariance.GraphicalLassoCV(cv=5)   # 实例化一个GraphicalLassoCV对象，关于Lasso见参考10，关于GraphicalLassoCV见参考14# standardize the time series: using correlations rather than covariance
# is more efficient for structure recovery
X = variation.copy().T
X /= X.std(axis=0)
edge_model.fit(X)# #############################################################################
# Cluster using affinity propagation_, labels = cluster.affinity_propagation(edge_model.covariance_)   # 返回划分好的聚类中心的索引和聚类中心的标签，见参考11，
n_labels = labels.max() # 返回标签中的最大值，标签默认是数字递增形式的for i in range(n_labels + 1): # 此处是[0,1,2)print('Cluster %i: %s' % ((i + 1), ', '.join(names[labels == i]))) # 列出聚类后分类信息# #############################################################################
# Find a low-dimension embedding for visualization: find the best position of
# the nodes (the stocks) on a 2D plane# We use a dense eigen_solver to achieve reproducibility (arpack is
# initiated with random vectors that we don't control). In addition, we
# use a large number of neighbors to capture the large-scale structure.
node_position_model = manifold.LocallyLinearEmbedding(n_components=2, eigen_solver='dense', n_neighbors=6)   # 见参考5、参考13。近邻选6个，降维后得到2个embedding = node_position_model.fit_transform(X.T).T   # 训练模型并执行降维，返回降维后的样本集# #############################################################################
# Visualization
plt.figure(1, facecolor='w', figsize=(10, 8)) # 用函数方式创建图形，背景设为白色，大小设置为10*8 inchs
plt.clf()   # 清除当前图形
ax = plt.axes([0., 0., 1., 1.])   # 新建一个axes对象
plt.axis('off')   # 见参考12# Display a graph of the partial correlations
partial_correlations = edge_model.precision_.copy() #偏相关分析
d = 1 / np.sqrt(np.diag(partial_correlations))  # 见参考15
partial_correlations *= d
partial_correlations *= d[:, np.newaxis]  # 参考16，转为n*1结构的二维数组
non_zero = (np.abs(np.triu(partial_correlations, k=1)) > 0.02) # 参考17，取上三角矩阵，判断与0.02大小获取True/False布尔值# Plot the nodes using the coordinates of our embedding
plt.scatter(embedding[0], embedding[1], s=100 * d ** 2, c=labels,cmap=plt.cm.nipy_spectral)   # 参考18,参考19# Plot the edges
start_idx, end_idx = np.where(non_zero) # 参考20，获取non_zero中True的横纵座标
# a sequence of (*line0*, *line1*, *line2*), where::
#            linen = (x0, y0), (x1, y1), ... (xm, ym)
segments = [[embedding[:, start], embedding[:, stop]]for start, stop in zip(start_idx, end_idx)]
values = np.abs(partial_correlations[non_zero])
lc = LineCollection(segments,zorder=0, cmap=plt.cm.hot_r,norm=plt.Normalize(0, .7 * values.max()))   # 参考2
lc.set_array(values)
lc.set_linewidths(15 * values)
ax.add_collection(lc)# 为每个节点添加标签，要避免标签重叠
for index, (name, label, (x, y)) in enumerate(zip(names, labels, embedding.T)):dx = x - embedding[0]dx[index] = 1dy = y - embedding[1]dy[index] = 1this_dx = dx[np.argmin(np.abs(dy))]   # 参考21this_dy = dy[np.argmin(np.abs(dx))]if this_dx > 0:horizontalalignment = 'left'x = x + .002else:horizontalalignment = 'right'x = x - .002if this_dy > 0:verticalalignment = 'bottom'y = y + .002else:verticalalignment = 'top'y = y - .002plt.text(x, y, name, size=10,horizontalalignment=horizontalalignment,verticalalignment=verticalalignment,bbox=dict(facecolor='w',edgecolor=plt.cm.nipy_spectral(label / float(n_labels)),alpha=.6))   # 添加图形文本，参考22plt.xlim(embedding[0].min() - .15 * embedding[0].ptp(),embedding[0].max() + .10 * embedding[0].ptp(),)   # 设定x轴的范围，参考23
plt.ylim(embedding[1].min() - .03 * embedding[1].ptp(),embedding[1].max() + .03 * embedding[1].ptp())plt.show()

下载见Download Python source code: plot_stock_market.py。

-----------------------分割线，以上是翻译，以下是沪深300股票实例化--------------------------

沪深300股票结构可视化

sklearn给的案例是美股48只股票之间的关系，在我们大A股市场，同样的算法下会得到怎样的结果呢。

详见链接：《沪深300股票聚类可视化案例||完整可运行代码逐行解释》

Reference

Visualizing the stock market structure
Matplotlib.collections.LineCollection结构及用法
sklearn.cluster聚类模块结构（classes||Functions）及用法
sklearn.covariance模块结构及用法
sklearn.manifold（流式学习）模块结构及用法||LLE参数、属性、方法详解
sys模块stdin||stdout||stderr结构及用法
raw.githubusercontent.com与github什么关系
Python字符串格式化函数str.format()结构及用法
numpy.vstack()||stack()||hstack()结构及用法
Lasso详解：历史、数学表征、物理意义、 Python实现
sklearn.cluster.affinity_propagation()结构、用法及AP算法详解
matplotlib.pyplot.axis()与matplotlib.pyplot.axes()结构及用法
manifold.LocallyLinearEmbedding（LLE）流形学习之局部线性嵌入算法详解
sklearn.covariance.GraphicalLassoCV结构及用法(参数、属性、方法)
numpy.diag()结构及用法||参数详解
numpy.newaxis结构及用法||参数详解
numpy.triu()||numpy.tril()结构及用法||参数详解
matplotlib.pyplot.cm结构及用法||参数详解
matplotlib.pyplot.scatter结构及用法||参数详解
numpy.where()、numpy.nonzero()结构及用法||参数详解
numpy.argmin()||argmax()结构及用法||详解axis
matplotlib.pyplot.text()结构及用法||参数详解
matplotlib.pyplot.xlim()、ylim()、axis()结构及用法||参数详解
《沪深300股票聚类可视化案例||完整可运行代码逐行解释》

可视化股票市场结构||沪深300股票聚类可视化相关推荐

沪深300股票聚类可视化案例||tushare完整可运行代码逐行解释
上篇文章:<可视化股票市场结构||沪深300股票聚类可视化>逐行代码解释了sklearn中的一个案例:可视化股票市场结构.案例中采用的数据是美股.这篇文章将其移植到A股市场,看看我们的沪深 ...
最新沪深300股票一览
933 神火股份 300能源 68 赛格三星 300可选 937 金牛能源 300能源 527 美的电器 300可选 983 西山煤电 300能源 541 佛山照明 300可选 600028 中国石化 ...
基于支持向量机SVM的沪深300股票预测股票涨跌方向
结果参考:https://www.bilibili.com/video/BV1nY411z7Kk/?spm_id_from=333.999.0.0 附完整代码+数据
择时策略 —— 基于扩散指标的沪深300指数择时
1.策略概述扩散指标的原始值是最近一段时间以来,股票池中符合要求的股票占股票池总数的比例,如果这个指标值比较高,一般认为说明市场方向性比较明确.理论上来说,任何单个股票的指标都能"扩散化& ...
Barra 结构化风险模型实现(1)——沪深300指数的风格因子暴露度分析
米筐科技(RiceQuant)策略研究报告:Barra 结构化风险模型实现(1)--沪深300指数的风格因子暴露度分析江嘉键 1 年前1 概述 Barra 结构化风险模型是全球知名的投资组合表现和风 ...
Java程序告诉你A股沪深300哪些股票值得投资
CSDN.牛客.雪球.公众号同步首发: 闲得慌,其实可以手写的,主要是想复习一下Java(不是.. 首先,在中证指数官网下载沪深300的样本权重EXCEL 下载后,我把股票名和对应权重合并到了紧邻的两 ...
结构化风险模型----转：沪深300指数的风格因子暴露度分析（一）
from: https://xueqiu.com/7381621247/73649418 1 概述 Barra 结构化风险模型是全球知名的投资组合表现和风险分析工具.最近一段时间,我们米筐科技量化策略 ...
tushare获得沪深300和中证500的股票
tushare获得沪深300和中证500的股票 import tushare as tsresult = ts.get_zz500s() result.to_csv("zz500.csv&q ...
pytdx 调用沪深300 所有股票实时行情
pytdx 是一个开源的 Python 库,可以用来调用通达信的行情数据.要调用沪深300 所有股票的实时行情,你需要先安装 pytdx,然后使用以下代码: from pytdx.hq import ...

可视化股票市场结构||沪深300股票聚类可视化

学习出一个图结构

聚类

嵌入到2D空间

可视化

沪深300股票结构可视化

Reference

可视化股票市场结构||沪深300股票聚类可视化相关推荐

最新文章

热门文章