机器学习的相关代码汇总

文章目录

  • 机器学习的相关代码汇总
  • 机器学习相关代码汇总
    • XGBoost
      • 示例1
      • 示例2
    • SVM
      • 示例1
      • 示例二
    • EM
      • 示例1
      • 示例二:GMM
    • 贝叶斯网络
      • 示例一
    • LDA

机器学习相关代码汇总

XGBoost

这个在sklearn当中有一个,但是有另一个功能更加强大的,只要

pip3 install xgboost

即可安装,不过这个安装过程真是一波三折的。

然后,我们需要知道xgboost使用的大体流程,以下几个示例都没离开这个流程框架:

示例1

在这个示例里面,涉及到一个agaricus数据,这个单词意思是:巴西蘑菇。但是巴西蘑菇有很多种类,有的有毒,有的没有毒,能否预测出一个给定的蘑菇有毒还是没毒呢?

import xgboost as xgb
import numpy as np# 1、xgBoost的基本使用
# 2、自定义损失函数的梯度和二阶导train_data = 'xgboost_data/agaricus_train.txt'
test_data = 'xgboost_data/agaricus_test.txt'
# 定义一个损失函数
def log_reg(y_hat, y):p = 1.0 / (1.0 + np.exp(-y_hat))g = p - y.get_label()h = p * (1.0 - p)return g,  h# 错误率,本例子当中,估计值<0.5代表没有毒
def error_rate(y_hat, y):return 'error', float(sum(y.get_label() != (y_hat > 0.5))) / len(y_hat)if __name__ == "__main__":# 读取数据data_train = xgb.DMatrix(train_data)data_test = xgb.DMatrix(test_data)# 设置参数param = {'max_depth': 3, 'eta': 1, 'silent': 1, 'objective': 'binary:logistic'}  # logitrawwatchlist = [(data_test, 'eval'), (data_train, 'train')]n_round = 7bst = xgb.train(param, data_train, num_boost_round=n_round, evals=watchlist, obj=log_reg, feval=error_rate)# 计算错误率y_hat = bst.predict(data_test)y = data_test.get_label()print('y_hat',y_hat)print('y', y)error = sum(y != (y_hat > 0.5))error_rate = float(error) / len(y_hat)print('样本总数:\t', len(y_hat))print('错误数目:\t%4d' % error)print('错误率:\t%.5f%%' % (100 * error_rate))

说明:

开头定义的log_reg还有error_rate最后会被用在下面的train方法中,对应当中的obj参数和feval参数,意思是:用用户定义的损失函数log_reg,来进行提升。采用用户定义的错误率error_rate来进行错误率的预测。

关于当中的train函数:

def train(params, dtrain, num_boost_round=10, evals=(), obj=None, feval=None,maximize=None, early_stopping_rounds=None, evals_result=None,verbose_eval=True, xgb_model=None, callbacks=None)
"""
dtrain:训练数据
num_boost_round:数据提升时候的迭代次数
evals:验证,传进去一个元组,里面指定什么是训练集,哪些是测试集
"""

在train当中有一个params,这个时候就涉及到:Booster参数了

  • max_depth: 指定决策树的深度

  • eta: 学习率,默认0.1

  • silent:静默模式。该值如果是1,模型运行不输出

  • objective:给定损失函数,默认为:binary:logistic,或者reg:linear

在xgboost当中,会把相关数据存放在DMatrix数据结构当中,这个数据结构是一个二维矩阵,但是xgboost当中对其进行了优化。

在以上代码当中不停的出现get_label方法,那么具体什么是label呢?

有一句英文解释很明了:

The label is the name of some category. If you’re building a machine learning system to distinguish fruits coming down a conveyor belt, labels for training samples might be “apple”, " orange", “banana”. The features are any kind of information you can extract about each sample. In our example, you might have one feature for colour, another for weight, another for length, and another for width. Maybe you would have some measure of concavity or linearity or ball-ness.

即:落实到使用中,就是label代表了最后你这个到底是什么东西,二特征feature则代表那一个个属性

示例2

在这个示例当中,用了鸢尾花数据集。鸢尾花其实有很多种类,对于不同的鸢尾花(本数据集当中有三类,分别是:Setosa, Versicolor, Virginica)。不同种类的鸢尾花花儿宽度,叶子长度等属性,都不尽相同。我们用XGBoost训练一下,看能不能有效的对相关数据做预测。

import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split   # cross_validationdef iris_type(s):it = {'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2}return it[s]if __name__ == "__main__":path = 'xgboost_data/iris.data'  # 数据文件路径data = pd.read_csv(path, header=None)x, y = data[range(4)], data[4]y = pd.Categorical(y).codesx_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1, test_size=50)data_train = xgb.DMatrix(x_train, label=y_train)data_test = xgb.DMatrix(x_test, label=y_test)watch_list = [(data_test, 'eval'), (data_train, 'train')]#决策树深度为2,学习率是0.3,param = {'max_depth': 2, 'eta': 0.3, 'silent': 1, 'objective': 'multi:softmax', 'num_class': 3}bst = xgb.train(param, data_train, num_boost_round=6, evals=watch_list)y_hat = bst.predict(data_test)result = y_test.reshape(1, -1) == y_hatprint('正确率:\t', float(np.sum(result)) / len(y_hat))print('END.....\n')

说明:

  • 这个代码当中有一个pd.Categorica方法。这个方法具有分类和排序的功能。
pandas.Categorical(val,category = None,ordered = None,dtype = None)
"""
val       :[list-like] The values of categorical.
categories:[index like] Unique categorisation of the categories.
ordered   :[boolean] If false, then the categorical is treated as unordered.
dtype     :[CategoricalDtype] an instance. Error-
ValueError: If the categories do not validate.
TypeError : If an explicit ordered = True but categorical can't be sorted. Return- Categorical varibale
"""
  • [reshape(1,-1)转化成1行

    [reshape(2,-1)转换成两行

    [reshape(-1,1)转换成1列

    [reshape(-1,2)转化成两列

SVM

示例1

我们还是拿经典的鸢尾花数据集,来用SVM方法来做预测

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score# 'sepal length', 'sepal width', 'petal length', 'petal width'
iris_feature = '花萼长度', '花萼宽度', '花瓣长度', '花瓣宽度'if __name__ == "__main__":path = "./iris.data" # 数据文件路径data = pd.read_csv(path, header=None)x, y = data[range(4)], data[4]y = pd.Categorical(y).codes # 按照花的类型进行分组x = x[[0, 1]] # 只取第0,和第1列x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1, train_size=0.6)# 分类器clf = svm.SVC(C=0.1, kernel='linear', decision_function_shape='ovr')clf.fit(x_train, y_train.ravel())# 准确率print(clf.score(x_train, y_train))  # 精度print('训练集准确率:', accuracy_score(y_train, clf.predict(x_train)))print(clf.score(x_test, y_test))print('测试集准确率:', accuracy_score(y_test, clf.predict(x_test)))# decision_functionprint('decision_function:\n', clf.decision_function(x_train))print('\npredict:\n', clf.predict(x_train))# 画图x1_min, x2_min = x.min()x1_max, x2_max = x.max()x1, x2 = np.mgrid[x1_min:x1_max:500j, x2_min:x2_max:500j]  # 生成网格采样点grid_test = np.stack((x1.flat, x2.flat), axis=1)  # 测试点# Z = clf.decision_function(grid_test)    # 样本到决策面的距离grid_hat = clf.predict(grid_test)       # 预测分类值grid_hat = grid_hat.reshape(x1.shape)  # 使之与输入的形状相同cm_light = mpl.colors.ListedColormap(['#A0FFA0', '#FFA0A0', '#A0A0FF'])cm_dark = mpl.colors.ListedColormap(['g', 'r', 'b'])plt.figure(facecolor='w')plt.pcolormesh(x1, x2, grid_hat, cmap=cm_light)plt.scatter(x[0], x[1], c=y, edgecolors='k', s=50, cmap=cm_dark)      # 样本plt.scatter(x_test[0], x_test[1], s=120, facecolors='none', zorder=10)     # 圈中测试集样本plt.xlabel(iris_feature[0], fontsize=13)plt.ylabel(iris_feature[1], fontsize=13)plt.xlim(x1_min, x1_max)plt.ylim(x2_min, x2_max)plt.title('Iris SVM', fontsize=16)plt.grid(b=True, ls=':')plt.tight_layout(pad=1.5)plt.show()

代码说明:

  • smv.SVC当中的相关参数

    • C=1.0:

      SVC的惩罚参数。C越大,对训练集测试时准确率很高,但泛化能力弱。C值小,对误分类的惩罚减小,允许容错,泛化能力较强。

    • kernel=‘rbf’ :核函数,默认是rbf,可以是‘linear’(线性核函数), ‘poly’(多项式核函数), ‘rbf’(高斯核函数), ‘sigmoid’(sigmoid核函数)

    • degree :多项式poly函数的维度,默认是3,选择其他核函数时会被忽略。

    • gamma : ‘rbf’,‘poly’ 和‘sigmoid’的核函数参数。默认是’auto’(数值上是样本个数的倒数)

    • coef0 :核函数的常数项。对于‘poly’和 ‘sigmoid’有用。

    • probability :是否采用概率估计,默认为False

    • tol :停止训练的误差值阈值,默认为1e-3

    • max_iter :最大迭代次数。-1为无限制。

    • decision_function_shape :‘ovo’, ‘ovr’ or None, default=None3

      ovo: 模型是一对一

      ovr: 一对其它

示例二

# -*- coding:utf-8 -*-
import numpy as np
from sklearn import svm
from scipy import stats
from sklearn.metrics import accuracy_score
import matplotlib as mpl
import matplotlib.pyplot as pltdef extend(a, b, r):x = a - bm = (a + b) / 2return m-r*x/2, m+r*x/2if __name__ == "__main__":#自己自创了一组样本np.random.seed(0) #随机生成一组数据,怎么生成,看下面的代码N = 20x = np.empty((4*N, 2)) #生成了一个4*N行,2列的空矩阵,里面的数值都非常小,无限趋于0means = [(-1, 1), (1, 1), (1, -1), (-1, -1)]sigmas = [np.eye(2), 2*np.eye(2), np.diag((1,2)), np.array(((2,1),(1,2)))] #四个矩阵for i in range(4):mn = stats.multivariate_normal(means[i], sigmas[i]*0.3)x[i*N:(i+1)*N, :] = mn.rvs(N)a = np.array((0,1,2,3)).reshape((-1, 1))y = np.tile(a, N).flatten()clf = svm.SVC(C=1, kernel='rbf', gamma=1, decision_function_shape='ovo')clf.fit(x, y)y_hat = clf.predict(x)acc = accuracy_score(y, y_hat)np.set_printoptions(suppress=True)print('预测正确的样本个数:%d,正确率:%.2f%%' % (round(acc*4*N), 100*acc))# decision_functionprint(clf.decision_function(x))print(y_hat)x1_min, x2_min = np.min(x, axis=0)x1_max, x2_max = np.max(x, axis=0)x1_min, x1_max = extend(x1_min, x1_max, 1.05)x2_min, x2_max = extend(x2_min, x2_max, 1.05)x1, x2 = np.mgrid[x1_min:x1_max:500j, x2_min:x2_max:500j]x_test = np.stack((x1.flat, x2.flat), axis=1)y_test = clf.predict(x_test)y_test = y_test.reshape(x1.shape)cm_light = mpl.colors.ListedColormap(['#FF8080', '#A0FFA0', '#6060FF', '#F080F0'])cm_dark = mpl.colors.ListedColormap(['r', 'g', 'b', 'm'])plt.figure(facecolor='w')plt.pcolormesh(x1, x2, y_test, cmap=cm_light)plt.scatter(x[:, 0], x[:, 1], s=40, c=y, cmap=cm_dark, alpha=0.7)plt.xlim((x1_min, x1_max))plt.ylim((x2_min, x2_max))plt.grid(b=True)plt.tight_layout(pad=2.5)plt.title(u'SVM多分类方法:One/One or One/Other', fontsize=18)plt.show()

代码说明:

  • scipy.stats.multivariate_normal

    随机生成一个多元正态分布,手动指定均值和方差

  • scipy.stats.poisson.rvs(loc=期望, scale=标准差, size=生成随机数的个数)

    从泊松分布中生成指定个数的随机数,那么在这里的代码当中,是从多元正态分布当中生成指定的随机样本

EM

示例1


import numpy as np
from scipy.stats import multivariate_normal
from sklearn.mixture import GaussianMixture
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import pairwise_distances_argminif __name__ == '__main__':style = 'myself'np.random.seed(0)mu1_fact = (0, 0, 0) # 设定三个均值cov1_fact = np.diag((1, 2, 3)) # 设定三个不同的方差data1 = np.random.multivariate_normal(mu1_fact, cov1_fact, 400)mu2_fact = (2, 2, 1)cov2_fact = np.array(((1, 1, 3), (1, 2, 1), (0, 0, 1)))data2 = np.random.multivariate_normal(mu2_fact, cov2_fact, 100)data = np.vstack((data1, data2))y = np.array([True] * 400 + [False] * 100)if style == 'sklearn':g = GaussianMixture(n_components=2, covariance_type='full', tol=1e-6, max_iter=1000)g.fit(data)print('类别概率:\t', g.weights_[0])print('均值:\n', g.means_, '\n')print('方差:\n', g.covariances_, '\n')mu1, mu2 = g.means_sigma1, sigma2 = g.covariances_else:num_iter = 100n, d = data.shapemu1 = data.min(axis=0)mu2 = data.max(axis=0)sigma1 = np.identity(d)sigma2 = np.identity(d)pi = 0.5# EMfor i in range(num_iter):# E Stepnorm1 = multivariate_normal(mu1, sigma1)norm2 = multivariate_normal(mu2, sigma2)tau1 = pi * norm1.pdf(data)tau2 = (1 - pi) * norm2.pdf(data)gamma = tau1 / (tau1 + tau2)# M Stepmu1 = np.dot(gamma, data) / np.sum(gamma)mu2 = np.dot((1 - gamma), data) / np.sum((1 - gamma))sigma1 = np.dot(gamma * (data - mu1).T, data - mu1) / np.sum(gamma)sigma2 = np.dot((1 - gamma) * (data - mu2).T, data - mu2) / np.sum(1 - gamma)pi = np.sum(gamma) / nprint(i, ":\t", mu1, mu2)print('类别概率:\t', pi)print('均值:\t', mu1, mu2)print('方差:\n', sigma1, '\n\n', sigma2, '\n')# 预测分类norm1 = multivariate_normal(mu1, sigma1)norm2 = multivariate_normal(mu2, sigma2)tau1 = norm1.pdf(data)tau2 = norm2.pdf(data)fig = plt.figure(figsize=(13, 7), facecolor='w')ax = fig.add_subplot(121, projection='3d')ax.scatter(data[:, 0], data[:, 1], data[:, 2], c='b', s=30, marker='o', depthshade=True)ax.set_xlabel('X')ax.set_ylabel('Y')ax.set_zlabel('Z')ax.set_title(u'原始数据', fontsize=18)ax = fig.add_subplot(122, projection='3d')order = pairwise_distances_argmin([mu1_fact, mu2_fact], [mu1, mu2], metric='euclidean')print(order)if order[0] == 0:c1 = tau1 > tau2else:c1 = tau1 < tau2c2 = ~c1acc = np.mean(y == c1)print('准确率:%.2f%%' % (100*acc))ax.scatter(data[c1, 0], data[c1, 1], data[c1, 2], c='r', s=30, marker='o', depthshade=True)ax.scatter(data[c2, 0], data[c2, 1], data[c2, 2], c='g', s=30, marker='^', depthshade=True)ax.set_xlabel('X')ax.set_ylabel('Y')ax.set_zlabel('Z')ax.set_title(u'EM算法分类', fontsize=18)plt.suptitle(u'EM算法的实现', fontsize=21)plt.subplots_adjust(top=0.90)plt.tight_layout()plt.show()
  • np.vstack:沿着竖直方向将矩阵堆叠起来。

    np.hstack: 沿着水平方向将矩阵堆叠起来。

  • np.identity(m):创建一个m阶的方阵

示例二:GMM

用到案例就是男女身高分布的例子


import numpy as np
from sklearn.mixture import GaussianMixture
from sklearn.model_selection import train_test_split
import matplotlib as mpl
import matplotlib.colors
import matplotlib.pyplot as pltmpl.rcParams['font.sans-serif'] = [u'SimHei']
mpl.rcParams['axes.unicode_minus'] = False
# from matplotlib.font_manager import FontProperties
# font_set = FontProperties(fname=r"c:\windows\fonts\simsun.ttc", size=15)
# fontproperties=font_setdef expand(a, b):d = (b - a) * 0.05return a-d, b+dif __name__ == '__main__':data = np.loadtxt('HeightWeight.csv', dtype=np.float, delimiter=',', skiprows=1)y, x = np.split(data, [1, ], axis=1)x, x_test, y, y_test = train_test_split(x, y, train_size=0.6, random_state=0)gmm = GaussianMixture(n_components=2, covariance_type='full', random_state=0)x_min = np.min(x, axis=0)x_max = np.max(x, axis=0)gmm.fit(x)print('均值 = \n', gmm.means_)print('方差 = \n', gmm.covariances_)y_hat = gmm.predict(x)y_test_hat = gmm.predict(x_test)acc = np.mean(y_hat.ravel() == y.ravel())acc_test = np.mean(y_test_hat.ravel() == y_test.ravel())acc_str = u'训练集准确率:%.2f%%' % (acc * 100)acc_test_str = u'测试集准确率:%.2f%%' % (acc_test * 100)print(acc_str)print(acc_test_str)cm_light = mpl.colors.ListedColormap(['#FF8080', '#77E0A0'])cm_dark = mpl.colors.ListedColormap(['r', 'g'])x1_min, x1_max = x[:, 0].min(), x[:, 0].max()x2_min, x2_max = x[:, 1].min(), x[:, 1].max()x1_min, x1_max = expand(x1_min, x1_max)x2_min, x2_max = expand(x2_min, x2_max)x1, x2 = np.mgrid[x1_min:x1_max:500j, x2_min:x2_max:500j]grid_test = np.stack((x1.flat, x2.flat), axis=1)grid_hat = gmm.predict(grid_test)grid_hat = grid_hat.reshape(x1.shape)plt.figure(figsize=(9, 7), facecolor='w')plt.pcolormesh(x1, x2, grid_hat, cmap=cm_light)plt.scatter(x[:, 0], x[:, 1], s=50, marker='o', cmap=cm_dark, edgecolors='k')plt.scatter(x_test[:, 0], x_test[:, 1], s=60, marker='^', cmap=cm_dark, edgecolors='k')p = gmm.predict_proba(grid_test)print(p)p = p[:, 0].reshape(x1.shape)CS = plt.contour(x1, x2, p, levels=(0.1, 0.5, 0.8), colors=list('rgb'), linewidths=2)plt.clabel(CS, fontsize=15, fmt='%.1f', inline=True)ax1_min, ax1_max, ax2_min, ax2_max = plt.axis()xx = 0.9*ax1_min + 0.1*ax1_maxyy = 0.1*ax2_min + 0.9*ax2_maxplt.text(xx, yy, acc_str, fontsize=18)yy = 0.15*ax2_min + 0.85*ax2_maxplt.text(xx, yy, acc_test_str, fontsize=18)plt.xlim((x1_min, x1_max))plt.ylim((x2_min, x2_max))plt.xlabel(u'身高(cm)', fontsize='large')plt.ylabel(u'体重(kg)', fontsize='large')plt.title(u'EM算法估算GMM的参数', fontsize=20)plt.grid()plt.show()

代码说明:

  • np.ravel(): 把array降为一维,如果没有必要,不会产生源数据的副本

  • predict_proba:

    predict:训练后返回预测结果,显示标签值

    predict_proba:返回一个 n 行 k 列的数组, 第 i 行 第 j 列上的数值是模型预测 第 i 个预测样本为某个标签的概率,并且每一行的概率和为1。

贝叶斯网络

示例一

用高斯朴素贝叶斯来对鸢尾花数据进行分类,代码本身并不难。

注意:这个用到一个Pipline操作,先给标准化,然后多项式回归,然后再进行高斯朴素贝叶斯


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn.preprocessing import StandardScaler, MinMaxScaler, PolynomialFeatures
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifierdef iris_type(s):it = {'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2}return it[s]filePath = '/home/johnny/PycharmProjects/pythonProject1/Machine_Learning/Data/iris.data'
if __name__ == "__main__":data = pd.read_csv(filePath, header=None)x, y = data[np.arange(4)], data[4]y = pd.Categorical(values=y).codesfeature_names = u'花萼长度', u'花萼宽度', u'花瓣长度', u'花瓣宽度'features = [0,1]x = x[features]x, x_test, y, y_test = train_test_split(x, y, train_size=0.7, random_state=0)priors = np.array((1,2,4), dtype=float)priors /= priors.sum()gnb = Pipeline([('sc', StandardScaler()),('poly', PolynomialFeatures(degree=1)),('clf', GaussianNB(priors=priors))])    # 由于鸢尾花数据是样本均衡的,其实不需要设置先验值gnb.fit(x, y.ravel())y_hat = gnb.predict(x)print('训练集准确度: %.2f%%' % (100 * accuracy_score(y, y_hat)))y_test_hat = gnb.predict(x_test)print('测试集准确度:%.2f%%' % (100 * accuracy_score(y_test, y_test_hat)))  # 画图N, M = 500, 500     # 横纵各采样多少个值x1_min, x2_min = x.min()x1_max, x2_max = x.max()t1 = np.linspace(x1_min, x1_max, N)t2 = np.linspace(x2_min, x2_max, M)x1, x2 = np.meshgrid(t1, t2)                    # 生成网格采样点x_grid = np.stack((x1.flat, x2.flat), axis=1)   # 测试点cm_light = mpl.colors.ListedColormap(['#77E0A0', '#FF8080', '#A0A0FF'])cm_dark = mpl.colors.ListedColormap(['g', 'r', 'b'])y_grid_hat = gnb.predict(x_grid)                  # 预测值y_grid_hat = y_grid_hat.reshape(x1.shape)plt.figure(facecolor='w')plt.pcolormesh(x1, x2, y_grid_hat, cmap=cm_light)     # 预测值的显示plt.scatter(x[features[0]], x[features[1]], edgecolors='k', s=50, cmap=cm_dark)plt.scatter(x_test[features[0]], x_test[features[1]], marker='^', edgecolors='k', s=120, cmap=cm_dark)plt.xlabel(feature_names[features[0]], fontsize=13)plt.ylabel(feature_names[features[1]], fontsize=13)plt.xlim(x1_min, x1_max)plt.ylim(x2_min, x2_max)plt.title(u'GaussianNB for Iris', fontsize=18)plt.grid(True)plt.show()

LDA

这里用到了gensim。需要安装,直接用pip install指令就可以安装

from gensim import corpora, models, similarities
from pprint import pprintpath = './LDA_test.txt'
if __name__ == '__main__':f = open(path)stop_list = set('for a of the and to in'.split())print('After')texts = [[word for word in line.strip().lower().split() if word not in stop_list] for line in f]print('Text = ')print(texts)dictionary = corpora.Dictionary(texts) # 去掉texts当中的重复词,组成一个字典(发现,所有的词都按照字典序排下去了)V = len(dictionary)corpus = [dictionary.doc2bow(text) for text in texts] # 生成每一篇文档的词袋向量print("corpus",corpus)corpus_tfidf = models.TfidfModel(corpus)[corpus]corpus_tfidf = corpusprint('TF-IDF:')for c in corpus_tfidf:print(c)print('\nLDA Model:')num_topics = 2lda = models.LdaModel(corpus_tfidf, num_topics=num_topics, id2word=dictionary,alpha='auto', eta='auto', minimum_probability=0.001, passes=10)doc_topic = [doc_t for doc_t in lda[corpus_tfidf]]print('Document-Topic:\n')pprint(doc_topic)for doc_topic in lda.get_document_topics(corpus_tfidf):print(doc_topic)for topic_id in range(num_topics):print('Topic', topic_id)pprint(lda.show_topic(topic_id))similarity = similarities.MatrixSimilarity(lda[corpus_tfidf])print('Similarity:')pprint(list(similarity))hda = models.HdpModel(corpus_tfidf, id2word=dictionary)topic_result = [a for a in hda[corpus_tfidf]]print('\n\nUSE WITH CARE--\nHDA Model:')pprint(topic_result)print('HDA Topics:')print(hda.print_topics(num_topics=2, num_words=5))

代码说明:

  • doc2bow

    计算每个不同单词的出现次数,将单词转换为其整数单词 id 并将结果作为稀疏向量(按照<单词id,出现次数>的格式)返回。具体这个id是什么,这个在dictionary当中可以查到.

  • similarities.MatrixSimilarity

    这个功能,还能按ctrl+b才知道是怎么回事:用来计算文档语料的余弦相似度。余弦相似度:通过计算两个向量的夹角余弦值来评估他们的相似度

    Compute cosine similarity against a corpus of documents by storing the index matrix in memory.
    Unless the entire matrix fits into main memory, use :class:`~gensim.similarities.docsim.Similarity` instead.
    

机器学习的相关代码汇总相关推荐

  1. 【机器学习】特征提取代码汇总

    特征提取代码汇总 import jieba from sklearn.datasets import load_iris from sklearn.feature_extraction import ...

  2. 机器学习数据预处理代码汇总(最新更新20年3月1日)

    这篇博客用一个pandas的DataFrame类型的数据为例,字段名为了不与任何第三方库混淆,我们叫他 dataframe 这篇博客没有长篇大论,就是希望能够让大家直接复制代码,然后把datafram ...

  3. retinex相关代码汇总

    混合方法 SSR.m matlab代码,本来是RGB,改成了处理灰度图像的. %%%%%%%%%%%%%%%RGB normalisation%%%%%%%%%%%%%%%%%%%%%% %its c ...

  4. 机器学习回归算法代码汇总

    1. K近邻(KNN)回归和分类算法详解及调参 参考:http://www.statr.cn/?p=351 2. scikit-learn的基本回归方法(线性.决策树.SVM.KNN)和集成方法(随机 ...

  5. 【radar】毫米波雷达相关开源项目代码汇总(工具箱、仿真、2D毫米波检测、融合、4D毫米波检测、分割、SLAM、跟踪)(6)

    [radar]毫米波雷达相关开源项目代码汇总(工具箱.仿真.2D毫米波检测.融合.4D毫米波检测.分割.SLAM.跟踪)(6) Toolbox pymmw https://github.com/m6c ...

  6. ICCV2021|底层视觉(图像生成,图像编辑,超分辨率等)相关论文汇总(附论文链接/代码)[持续更新]

    ICCV2021|底层视觉和图像生成相关论文汇总(如果觉得有帮助,欢迎点赞和收藏) 1.图像生成(Image Generation) Multiple Heads are Better than On ...

  7. 机器学习实用代码汇总(你想要的这里都有)

    机器学习实用代码汇总(你想要的这里都有) 文章目录 机器学习实用代码汇总(你想要的这里都有) 前言 一.数据导入 1.数据文件读取 2.提取特征和标签 3.数据分布及关系图(ProfileReport ...

  8. ECCV2020|图像重建(超分辨率,图像恢复,去雨,去雾等)相关论文汇总(附论文链接/代码/解析)

    转载自https://zhuanlan.zhihu.com/p/180551773 原帖地址: ECCV2020|图像重建/底层视觉(超分辨率,图像恢复,去雨,去雾,去模糊,去噪等)相关论文汇总(附论 ...

  9. SINAMICS V20变频器故障代码汇总及相关处理办法和报警复位

    SINAMICS V20变频器故障代码汇总及相关处理办法和报警复位 以下内容仅作为故障报警排查的指导,不具有绝对性,导致变频器故障报警的原因很多,情况也较复杂,本文只是对常见的故障报警原因和处理方法进 ...

最新文章

  1. 盘点丨2017年亚洲新晋18家独角兽公司
  2. centos7下的glusterfs的安装与使用
  3. 操作多个表_2_组合相关的行
  4. mysql建立 分区_MySQL-mysql分区合理建立
  5. windows程序窗体创建流程模型A--利用基本数据类型
  6. qcustomplot绘制热力图瀑布图_使用REmap绘制中国地图
  7. 使用python开发windows应用程序
  8. java mysql 学生成绩管理系统_java简单学生成绩管理系统
  9. 最全面的Linux命令大全出炉了
  10. 一次性密码本-绝对不会被破译的密码
  11. html5 图片缩放 鼠标滚轮,鼠标滚轮实现图片的放大缩小
  12. java 短信从申请到实现(阿里云)
  13. 脸上长痘部位详解 从痘痘看身体状况
  14. c语言中ifelse意义,c语言if和else if的区别
  15. 2021-05-03Wireshark流量包分析
  16. java版我的世界光追,光追有多神奇?我的世界VS别人的世界
  17. KVM虚拟机如何新增一块磁盘?
  18. [Asp.Net Core]鉴权授权
  19. java的像素与dpi_对屏幕的理解---分辨率,dpi,ppi,屏幕尺寸,像素 等
  20. python3校验身份证号码

热门文章

  1. 条形码的打印,pdf打印条形码
  2. 巧用flashback database实现灵活的数据切换
  3. 基于javaweb的医药信息管理系统(java+ssm+html+easyui+mysql)
  4. 沙盘软件测试题,Sandboxie沙盘工具,免费的沙盒虚拟环境软件、隔离测试、多开程序...
  5. 铁道部新客票系统设计(三)
  6. 在PowerBI中导入JSON文件
  7. java计算机毕业设计合同管理源码+mysql数据库+系统+lw文档+部署
  8. 辉仔日记之学代码第二十一期——单例模式
  9. 扫描pc端页面二维码,在手机上签名
  10. DMG文件介绍及建立