机器学习的相关代码汇总

文章目录

机器学习的相关代码汇总
机器学习相关代码汇总
- XGBoost
- - 示例1
  - 示例2
- SVM
- - 示例1
  - 示例二
- EM
- - 示例1
  - 示例二：GMM
- 贝叶斯网络
- - 示例一
- LDA

机器学习相关代码汇总

XGBoost

这个在sklearn当中有一个，但是有另一个功能更加强大的，只要

pip3 install xgboost

即可安装，不过这个安装过程真是一波三折的。

然后，我们需要知道xgboost使用的大体流程，以下几个示例都没离开这个流程框架：

示例1

在这个示例里面，涉及到一个agaricus数据，这个单词意思是：巴西蘑菇。但是巴西蘑菇有很多种类，有的有毒，有的没有毒，能否预测出一个给定的蘑菇有毒还是没毒呢？

import xgboost as xgb
import numpy as np# 1、xgBoost的基本使用
# 2、自定义损失函数的梯度和二阶导train_data = 'xgboost_data/agaricus_train.txt'
test_data = 'xgboost_data/agaricus_test.txt'
# 定义一个损失函数
def log_reg(y_hat, y):p = 1.0 / (1.0 + np.exp(-y_hat))g = p - y.get_label()h = p * (1.0 - p)return g,  h# 错误率，本例子当中，估计值<0.5代表没有毒
def error_rate(y_hat, y):return 'error', float(sum(y.get_label() != (y_hat > 0.5))) / len(y_hat)if __name__ == "__main__":# 读取数据data_train = xgb.DMatrix(train_data)data_test = xgb.DMatrix(test_data)# 设置参数param = {'max_depth': 3, 'eta': 1, 'silent': 1, 'objective': 'binary:logistic'}  # logitrawwatchlist = [(data_test, 'eval'), (data_train, 'train')]n_round = 7bst = xgb.train(param, data_train, num_boost_round=n_round, evals=watchlist, obj=log_reg, feval=error_rate)# 计算错误率y_hat = bst.predict(data_test)y = data_test.get_label()print('y_hat',y_hat)print('y', y)error = sum(y != (y_hat > 0.5))error_rate = float(error) / len(y_hat)print('样本总数：\t', len(y_hat))print('错误数目：\t%4d' % error)print('错误率：\t%.5f%%' % (100 * error_rate))

说明：

开头定义的log_reg还有error_rate最后会被用在下面的train方法中，对应当中的obj参数和feval参数，意思是：用用户定义的损失函数log_reg，来进行提升。采用用户定义的错误率error_rate来进行错误率的预测。

关于当中的train函数：

def train(params, dtrain, num_boost_round=10, evals=(), obj=None, feval=None,maximize=None, early_stopping_rounds=None, evals_result=None,verbose_eval=True, xgb_model=None, callbacks=None)
"""
dtrain:训练数据
num_boost_round:数据提升时候的迭代次数
evals:验证，传进去一个元组，里面指定什么是训练集，哪些是测试集
"""

在train当中有一个params，这个时候就涉及到：Booster参数了

max_depth: 指定决策树的深度
eta: 学习率，默认0.1
silent：静默模式。该值如果是1，模型运行不输出
objective:给定损失函数，默认为：binary:logistic，或者reg:linear

在xgboost当中，会把相关数据存放在DMatrix数据结构当中，这个数据结构是一个二维矩阵，但是xgboost当中对其进行了优化。

在以上代码当中不停的出现get_label方法，那么具体什么是label呢？

有一句英文解释很明了：

The label is the name of some category. If you’re building a machine learning system to distinguish fruits coming down a conveyor belt, labels for training samples might be “apple”, " orange", “banana”. The features are any kind of information you can extract about each sample. In our example, you might have one feature for colour, another for weight, another for length, and another for width. Maybe you would have some measure of concavity or linearity or ball-ness.

即：落实到使用中，就是label代表了最后你这个到底是什么东西，二特征feature则代表那一个个属性

示例2

在这个示例当中，用了鸢尾花数据集。鸢尾花其实有很多种类，对于不同的鸢尾花（本数据集当中有三类，分别是：Setosa, Versicolor, Virginica）。不同种类的鸢尾花花儿宽度，叶子长度等属性，都不尽相同。我们用XGBoost训练一下，看能不能有效的对相关数据做预测。

import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split   # cross_validationdef iris_type(s):it = {'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2}return it[s]if __name__ == "__main__":path = 'xgboost_data/iris.data'  # 数据文件路径data = pd.read_csv(path, header=None)x, y = data[range(4)], data[4]y = pd.Categorical(y).codesx_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1, test_size=50)data_train = xgb.DMatrix(x_train, label=y_train)data_test = xgb.DMatrix(x_test, label=y_test)watch_list = [(data_test, 'eval'), (data_train, 'train')]#决策树深度为2，学习率是0.3，param = {'max_depth': 2, 'eta': 0.3, 'silent': 1, 'objective': 'multi:softmax', 'num_class': 3}bst = xgb.train(param, data_train, num_boost_round=6, evals=watch_list)y_hat = bst.predict(data_test)result = y_test.reshape(1, -1) == y_hatprint('正确率:\t', float(np.sum(result)) / len(y_hat))print('END.....\n')

说明：

这个代码当中有一个pd.Categorica方法。这个方法具有分类和排序的功能。

pandas.Categorical(val，category = None，ordered = None，dtype = None)
"""
val       :[list-like] The values of categorical.
categories:[index like] Unique categorisation of the categories.
ordered   :[boolean] If false, then the categorical is treated as unordered.
dtype     :[CategoricalDtype] an instance. Error-
ValueError: If the categories do not validate.
TypeError : If an explicit ordered = True but categorical can't be sorted. Return- Categorical varibale
"""

[reshape(1,-1)转化成1行

[reshape(2,-1)转换成两行

[reshape(-1,1)转换成1列

[reshape(-1,2)转化成两列

SVM

示例1

我们还是拿经典的鸢尾花数据集，来用SVM方法来做预测

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score# 'sepal length', 'sepal width', 'petal length', 'petal width'
iris_feature = '花萼长度', '花萼宽度', '花瓣长度', '花瓣宽度'if __name__ == "__main__":path = "./iris.data" # 数据文件路径data = pd.read_csv(path, header=None)x, y = data[range(4)], data[4]y = pd.Categorical(y).codes # 按照花的类型进行分组x = x[[0, 1]] # 只取第0,和第1列x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1, train_size=0.6)# 分类器clf = svm.SVC(C=0.1, kernel='linear', decision_function_shape='ovr')clf.fit(x_train, y_train.ravel())# 准确率print(clf.score(x_train, y_train))  # 精度print('训练集准确率：', accuracy_score(y_train, clf.predict(x_train)))print(clf.score(x_test, y_test))print('测试集准确率：', accuracy_score(y_test, clf.predict(x_test)))# decision_functionprint('decision_function:\n', clf.decision_function(x_train))print('\npredict:\n', clf.predict(x_train))# 画图x1_min, x2_min = x.min()x1_max, x2_max = x.max()x1, x2 = np.mgrid[x1_min:x1_max:500j, x2_min:x2_max:500j]  # 生成网格采样点grid_test = np.stack((x1.flat, x2.flat), axis=1)  # 测试点# Z = clf.decision_function(grid_test)    # 样本到决策面的距离grid_hat = clf.predict(grid_test)       # 预测分类值grid_hat = grid_hat.reshape(x1.shape)  # 使之与输入的形状相同cm_light = mpl.colors.ListedColormap(['#A0FFA0', '#FFA0A0', '#A0A0FF'])cm_dark = mpl.colors.ListedColormap(['g', 'r', 'b'])plt.figure(facecolor='w')plt.pcolormesh(x1, x2, grid_hat, cmap=cm_light)plt.scatter(x[0], x[1], c=y, edgecolors='k', s=50, cmap=cm_dark)      # 样本plt.scatter(x_test[0], x_test[1], s=120, facecolors='none', zorder=10)     # 圈中测试集样本plt.xlabel(iris_feature[0], fontsize=13)plt.ylabel(iris_feature[1], fontsize=13)plt.xlim(x1_min, x1_max)plt.ylim(x2_min, x2_max)plt.title('Iris SVM', fontsize=16)plt.grid(b=True, ls=':')plt.tight_layout(pad=1.5)plt.show()

代码说明：

smv.SVC当中的相关参数
- C=1.0：
  
  SVC的惩罚参数。C越大，对训练集测试时准确率很高，但泛化能力弱。C值小，对误分类的惩罚减小，允许容错，泛化能力较强。
- kernel=‘rbf’ ：核函数，默认是rbf，可以是‘linear’（线性核函数）, ‘poly’（多项式核函数）, ‘rbf’（高斯核函数）, ‘sigmoid’（sigmoid核函数）
- degree ：多项式poly函数的维度，默认是3，选择其他核函数时会被忽略。
- gamma ： ‘rbf’,‘poly’ 和‘sigmoid’的核函数参数。默认是’auto’（数值上是样本个数的倒数）
- coef0 ：核函数的常数项。对于‘poly’和 ‘sigmoid’有用。
- probability ：是否采用概率估计，默认为False
- tol ：停止训练的误差值阈值，默认为1e-3
- max_iter ：最大迭代次数。-1为无限制。
- decision_function_shape ：‘ovo’, ‘ovr’ or None, default=None3
  
  ovo: 模型是一对一
  
  ovr: 一对其它

示例二

# -*- coding:utf-8 -*-
import numpy as np
from sklearn import svm
from scipy import stats
from sklearn.metrics import accuracy_score
import matplotlib as mpl
import matplotlib.pyplot as pltdef extend(a, b, r):x = a - bm = (a + b) / 2return m-r*x/2, m+r*x/2if __name__ == "__main__":#自己自创了一组样本np.random.seed(0) #随机生成一组数据，怎么生成，看下面的代码N = 20x = np.empty((4*N, 2)) #生成了一个4*N行，2列的空矩阵，里面的数值都非常小，无限趋于0means = [(-1, 1), (1, 1), (1, -1), (-1, -1)]sigmas = [np.eye(2), 2*np.eye(2), np.diag((1,2)), np.array(((2,1),(1,2)))] #四个矩阵for i in range(4):mn = stats.multivariate_normal(means[i], sigmas[i]*0.3)x[i*N:(i+1)*N, :] = mn.rvs(N)a = np.array((0,1,2,3)).reshape((-1, 1))y = np.tile(a, N).flatten()clf = svm.SVC(C=1, kernel='rbf', gamma=1, decision_function_shape='ovo')clf.fit(x, y)y_hat = clf.predict(x)acc = accuracy_score(y, y_hat)np.set_printoptions(suppress=True)print('预测正确的样本个数：%d，正确率：%.2f%%' % (round(acc*4*N), 100*acc))# decision_functionprint(clf.decision_function(x))print(y_hat)x1_min, x2_min = np.min(x, axis=0)x1_max, x2_max = np.max(x, axis=0)x1_min, x1_max = extend(x1_min, x1_max, 1.05)x2_min, x2_max = extend(x2_min, x2_max, 1.05)x1, x2 = np.mgrid[x1_min:x1_max:500j, x2_min:x2_max:500j]x_test = np.stack((x1.flat, x2.flat), axis=1)y_test = clf.predict(x_test)y_test = y_test.reshape(x1.shape)cm_light = mpl.colors.ListedColormap(['#FF8080', '#A0FFA0', '#6060FF', '#F080F0'])cm_dark = mpl.colors.ListedColormap(['r', 'g', 'b', 'm'])plt.figure(facecolor='w')plt.pcolormesh(x1, x2, y_test, cmap=cm_light)plt.scatter(x[:, 0], x[:, 1], s=40, c=y, cmap=cm_dark, alpha=0.7)plt.xlim((x1_min, x1_max))plt.ylim((x2_min, x2_max))plt.grid(b=True)plt.tight_layout(pad=2.5)plt.title(u'SVM多分类方法：One/One or One/Other', fontsize=18)plt.show()

代码说明：

scipy.stats.multivariate_normal

随机生成一个多元正态分布，手动指定均值和方差
scipy.stats.poisson.rvs(loc=期望, scale=标准差, size=生成随机数的个数)

从泊松分布中生成指定个数的随机数，那么在这里的代码当中，是从多元正态分布当中生成指定的随机样本

EM

示例1


import numpy as np
from scipy.stats import multivariate_normal
from sklearn.mixture import GaussianMixture
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import pairwise_distances_argminif __name__ == '__main__':style = 'myself'np.random.seed(0)mu1_fact = (0, 0, 0) # 设定三个均值cov1_fact = np.diag((1, 2, 3)) # 设定三个不同的方差data1 = np.random.multivariate_normal(mu1_fact, cov1_fact, 400)mu2_fact = (2, 2, 1)cov2_fact = np.array(((1, 1, 3), (1, 2, 1), (0, 0, 1)))data2 = np.random.multivariate_normal(mu2_fact, cov2_fact, 100)data = np.vstack((data1, data2))y = np.array([True] * 400 + [False] * 100)if style == 'sklearn':g = GaussianMixture(n_components=2, covariance_type='full', tol=1e-6, max_iter=1000)g.fit(data)print('类别概率:\t', g.weights_[0])print('均值:\n', g.means_, '\n')print('方差:\n', g.covariances_, '\n')mu1, mu2 = g.means_sigma1, sigma2 = g.covariances_else:num_iter = 100n, d = data.shapemu1 = data.min(axis=0)mu2 = data.max(axis=0)sigma1 = np.identity(d)sigma2 = np.identity(d)pi = 0.5# EMfor i in range(num_iter):# E Stepnorm1 = multivariate_normal(mu1, sigma1)norm2 = multivariate_normal(mu2, sigma2)tau1 = pi * norm1.pdf(data)tau2 = (1 - pi) * norm2.pdf(data)gamma = tau1 / (tau1 + tau2)# M Stepmu1 = np.dot(gamma, data) / np.sum(gamma)mu2 = np.dot((1 - gamma), data) / np.sum((1 - gamma))sigma1 = np.dot(gamma * (data - mu1).T, data - mu1) / np.sum(gamma)sigma2 = np.dot((1 - gamma) * (data - mu2).T, data - mu2) / np.sum(1 - gamma)pi = np.sum(gamma) / nprint(i, ":\t", mu1, mu2)print('类别概率:\t', pi)print('均值:\t', mu1, mu2)print('方差:\n', sigma1, '\n\n', sigma2, '\n')# 预测分类norm1 = multivariate_normal(mu1, sigma1)norm2 = multivariate_normal(mu2, sigma2)tau1 = norm1.pdf(data)tau2 = norm2.pdf(data)fig = plt.figure(figsize=(13, 7), facecolor='w')ax = fig.add_subplot(121, projection='3d')ax.scatter(data[:, 0], data[:, 1], data[:, 2], c='b', s=30, marker='o', depthshade=True)ax.set_xlabel('X')ax.set_ylabel('Y')ax.set_zlabel('Z')ax.set_title(u'原始数据', fontsize=18)ax = fig.add_subplot(122, projection='3d')order = pairwise_distances_argmin([mu1_fact, mu2_fact], [mu1, mu2], metric='euclidean')print(order)if order[0] == 0:c1 = tau1 > tau2else:c1 = tau1 < tau2c2 = ~c1acc = np.mean(y == c1)print('准确率：%.2f%%' % (100*acc))ax.scatter(data[c1, 0], data[c1, 1], data[c1, 2], c='r', s=30, marker='o', depthshade=True)ax.scatter(data[c2, 0], data[c2, 1], data[c2, 2], c='g', s=30, marker='^', depthshade=True)ax.set_xlabel('X')ax.set_ylabel('Y')ax.set_zlabel('Z')ax.set_title(u'EM算法分类', fontsize=18)plt.suptitle(u'EM算法的实现', fontsize=21)plt.subplots_adjust(top=0.90)plt.tight_layout()plt.show()

np.vstack：沿着竖直方向将矩阵堆叠起来。

np.hstack: 沿着水平方向将矩阵堆叠起来。
np.identity（m）：创建一个m阶的方阵

示例二：GMM

用到案例就是男女身高分布的例子


import numpy as np
from sklearn.mixture import GaussianMixture
from sklearn.model_selection import train_test_split
import matplotlib as mpl
import matplotlib.colors
import matplotlib.pyplot as pltmpl.rcParams['font.sans-serif'] = [u'SimHei']
mpl.rcParams['axes.unicode_minus'] = False
# from matplotlib.font_manager import FontProperties
# font_set = FontProperties(fname=r"c:\windows\fonts\simsun.ttc", size=15)
# fontproperties=font_setdef expand(a, b):d = (b - a) * 0.05return a-d, b+dif __name__ == '__main__':data = np.loadtxt('HeightWeight.csv', dtype=np.float, delimiter=',', skiprows=1)y, x = np.split(data, [1, ], axis=1)x, x_test, y, y_test = train_test_split(x, y, train_size=0.6, random_state=0)gmm = GaussianMixture(n_components=2, covariance_type='full', random_state=0)x_min = np.min(x, axis=0)x_max = np.max(x, axis=0)gmm.fit(x)print('均值 = \n', gmm.means_)print('方差 = \n', gmm.covariances_)y_hat = gmm.predict(x)y_test_hat = gmm.predict(x_test)acc = np.mean(y_hat.ravel() == y.ravel())acc_test = np.mean(y_test_hat.ravel() == y_test.ravel())acc_str = u'训练集准确率：%.2f%%' % (acc * 100)acc_test_str = u'测试集准确率：%.2f%%' % (acc_test * 100)print(acc_str)print(acc_test_str)cm_light = mpl.colors.ListedColormap(['#FF8080', '#77E0A0'])cm_dark = mpl.colors.ListedColormap(['r', 'g'])x1_min, x1_max = x[:, 0].min(), x[:, 0].max()x2_min, x2_max = x[:, 1].min(), x[:, 1].max()x1_min, x1_max = expand(x1_min, x1_max)x2_min, x2_max = expand(x2_min, x2_max)x1, x2 = np.mgrid[x1_min:x1_max:500j, x2_min:x2_max:500j]grid_test = np.stack((x1.flat, x2.flat), axis=1)grid_hat = gmm.predict(grid_test)grid_hat = grid_hat.reshape(x1.shape)plt.figure(figsize=(9, 7), facecolor='w')plt.pcolormesh(x1, x2, grid_hat, cmap=cm_light)plt.scatter(x[:, 0], x[:, 1], s=50, marker='o', cmap=cm_dark, edgecolors='k')plt.scatter(x_test[:, 0], x_test[:, 1], s=60, marker='^', cmap=cm_dark, edgecolors='k')p = gmm.predict_proba(grid_test)print(p)p = p[:, 0].reshape(x1.shape)CS = plt.contour(x1, x2, p, levels=(0.1, 0.5, 0.8), colors=list('rgb'), linewidths=2)plt.clabel(CS, fontsize=15, fmt='%.1f', inline=True)ax1_min, ax1_max, ax2_min, ax2_max = plt.axis()xx = 0.9*ax1_min + 0.1*ax1_maxyy = 0.1*ax2_min + 0.9*ax2_maxplt.text(xx, yy, acc_str, fontsize=18)yy = 0.15*ax2_min + 0.85*ax2_maxplt.text(xx, yy, acc_test_str, fontsize=18)plt.xlim((x1_min, x1_max))plt.ylim((x2_min, x2_max))plt.xlabel(u'身高(cm)', fontsize='large')plt.ylabel(u'体重(kg)', fontsize='large')plt.title(u'EM算法估算GMM的参数', fontsize=20)plt.grid()plt.show()

代码说明：

np.ravel(): 把array降为一维，如果没有必要，不会产生源数据的副本
predict_proba:

predict：训练后返回预测结果，显示标签值

predict_proba：返回一个 n 行 k 列的数组，第 i 行第 j 列上的数值是模型预测第 i 个预测样本为某个标签的概率，并且每一行的概率和为1。

贝叶斯网络

示例一

用高斯朴素贝叶斯来对鸢尾花数据进行分类，代码本身并不难。

注意：这个用到一个Pipline操作，先给标准化，然后多项式回归，然后再进行高斯朴素贝叶斯


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn.preprocessing import StandardScaler, MinMaxScaler, PolynomialFeatures
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifierdef iris_type(s):it = {'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2}return it[s]filePath = '/home/johnny/PycharmProjects/pythonProject1/Machine_Learning/Data/iris.data'
if __name__ == "__main__":data = pd.read_csv(filePath, header=None)x, y = data[np.arange(4)], data[4]y = pd.Categorical(values=y).codesfeature_names = u'花萼长度', u'花萼宽度', u'花瓣长度', u'花瓣宽度'features = [0,1]x = x[features]x, x_test, y, y_test = train_test_split(x, y, train_size=0.7, random_state=0)priors = np.array((1,2,4), dtype=float)priors /= priors.sum()gnb = Pipeline([('sc', StandardScaler()),('poly', PolynomialFeatures(degree=1)),('clf', GaussianNB(priors=priors))])    # 由于鸢尾花数据是样本均衡的，其实不需要设置先验值gnb.fit(x, y.ravel())y_hat = gnb.predict(x)print('训练集准确度: %.2f%%' % (100 * accuracy_score(y, y_hat)))y_test_hat = gnb.predict(x_test)print('测试集准确度：%.2f%%' % (100 * accuracy_score(y_test, y_test_hat)))  # 画图N, M = 500, 500     # 横纵各采样多少个值x1_min, x2_min = x.min()x1_max, x2_max = x.max()t1 = np.linspace(x1_min, x1_max, N)t2 = np.linspace(x2_min, x2_max, M)x1, x2 = np.meshgrid(t1, t2)                    # 生成网格采样点x_grid = np.stack((x1.flat, x2.flat), axis=1)   # 测试点cm_light = mpl.colors.ListedColormap(['#77E0A0', '#FF8080', '#A0A0FF'])cm_dark = mpl.colors.ListedColormap(['g', 'r', 'b'])y_grid_hat = gnb.predict(x_grid)                  # 预测值y_grid_hat = y_grid_hat.reshape(x1.shape)plt.figure(facecolor='w')plt.pcolormesh(x1, x2, y_grid_hat, cmap=cm_light)     # 预测值的显示plt.scatter(x[features[0]], x[features[1]], edgecolors='k', s=50, cmap=cm_dark)plt.scatter(x_test[features[0]], x_test[features[1]], marker='^', edgecolors='k', s=120, cmap=cm_dark)plt.xlabel(feature_names[features[0]], fontsize=13)plt.ylabel(feature_names[features[1]], fontsize=13)plt.xlim(x1_min, x1_max)plt.ylim(x2_min, x2_max)plt.title(u'GaussianNB for Iris', fontsize=18)plt.grid(True)plt.show()

LDA

这里用到了gensim。需要安装，直接用pip install指令就可以安装

from gensim import corpora, models, similarities
from pprint import pprintpath = './LDA_test.txt'
if __name__ == '__main__':f = open(path)stop_list = set('for a of the and to in'.split())print('After')texts = [[word for word in line.strip().lower().split() if word not in stop_list] for line in f]print('Text = ')print(texts)dictionary = corpora.Dictionary(texts) # 去掉texts当中的重复词，组成一个字典(发现，所有的词都按照字典序排下去了)V = len(dictionary)corpus = [dictionary.doc2bow(text) for text in texts] # 生成每一篇文档的词袋向量print("corpus",corpus)corpus_tfidf = models.TfidfModel(corpus)[corpus]corpus_tfidf = corpusprint('TF-IDF:')for c in corpus_tfidf:print(c)print('\nLDA Model:')num_topics = 2lda = models.LdaModel(corpus_tfidf, num_topics=num_topics, id2word=dictionary,alpha='auto', eta='auto', minimum_probability=0.001, passes=10)doc_topic = [doc_t for doc_t in lda[corpus_tfidf]]print('Document-Topic:\n')pprint(doc_topic)for doc_topic in lda.get_document_topics(corpus_tfidf):print(doc_topic)for topic_id in range(num_topics):print('Topic', topic_id)pprint(lda.show_topic(topic_id))similarity = similarities.MatrixSimilarity(lda[corpus_tfidf])print('Similarity:')pprint(list(similarity))hda = models.HdpModel(corpus_tfidf, id2word=dictionary)topic_result = [a for a in hda[corpus_tfidf]]print('\n\nUSE WITH CARE--\nHDA Model:')pprint(topic_result)print('HDA Topics:')print(hda.print_topics(num_topics=2, num_words=5))

代码说明：

doc2bow

计算每个不同单词的出现次数，将单词转换为其整数单词 id 并将结果作为稀疏向量(按照<单词id，出现次数>的格式)返回。具体这个id是什么，这个在dictionary当中可以查到.
similarities.MatrixSimilarity

这个功能，还能按ctrl+b才知道是怎么回事：用来计算文档语料的余弦相似度。余弦相似度：通过计算两个向量的夹角余弦值来评估他们的相似度
```
Compute cosine similarity against a corpus of documents by storing the index matrix in memory.
Unless the entire matrix fits into main memory, use :class:`~gensim.similarities.docsim.Similarity` instead.
```