文章目录

  • 1. sklearn.decomposition.TruncatedSVD
  • 2. sklearn.feature_extraction.text.TfidfVectorizer
  • 3. 代码实践
  • 4. 参考文献

《统计学习方法》潜在语义分析(Latent Semantic Analysis,LSA) 笔记

1. sklearn.decomposition.TruncatedSVD

sklearn.decomposition.TruncatedSVD 官网介绍

class sklearn.decomposition.TruncatedSVD(n_components=2,
algorithm='randomized', n_iter=5, random_state=None, tol=0.0)

主要参数:

  • n_components: default = 2,话题数量

  • algorithm: default = “randomized”,算法选择

  • n_iter: optional (default 5),迭代次数
    Number of iterations for randomized SVD solver. Not used by ARPACK.

属性:

  • components_, shape (n_components, n_features)

  • explained_variance_, shape (n_components,)
    The variance of the training samples transformed by a projection to each component.

  • explained_variance_ratio_, shape (n_components,)
    Percentage of variance explained by each of the selected components.

  • singular_values_, shape (n_components,)
    The singular values corresponding to each of the selected components.

2. sklearn.feature_extraction.text.TfidfVectorizer

sklearn.feature_extraction.text.TfidfVectorizer 官网介绍
将原始文档集合转换为TF-IDF矩阵

class sklearn.feature_extraction.text.TfidfVectorizer(input='content',
encoding='utf-8', decode_error='strict', strip_accents=None,
lowercase=True, preprocessor=None, tokenizer=None, analyzer='word',
stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1),
max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False,
dtype=<class 'numpy.float64'>, norm='l2', use_idf=True, smooth_idf=True,
sublinear_tf=False)

参数介绍 这个博客 写的很清楚。

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ['This is the first document.','This document is the second document.','And this is the third one.','Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.shape)
print(X)
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
(4, 9)(0, 8)    0.38408524091481483(0, 3)   0.38408524091481483(0, 6)   0.38408524091481483(0, 2)   0.5802858236844359(0, 1)    0.46979138557992045(1, 8)   0.281088674033753(1, 3) 0.281088674033753(1, 6) 0.281088674033753(1, 1) 0.6876235979836938(1, 5)    0.5386476208856763(2, 8)    0.267103787642168(2, 3) 0.267103787642168(2, 6) 0.267103787642168(2, 0) 0.511848512707169(2, 7) 0.511848512707169(2, 4) 0.511848512707169(3, 8) 0.38408524091481483(3, 3)   0.38408524091481483(3, 6)   0.38408524091481483(3, 2)   0.5802858236844359(3, 1)    0.46979138557992045

3. 代码实践

# -*- coding:utf-8 -*-
# @Python Version: 3.7
# @Time: 2020/5/1 10:27
# @Author: Michael Ming
# @Website: https://michael.blog.csdn.net/
# @File: 17.LSA.py
# @Reference: https://cloud.tencent.com/developer/article/1530432
import numpy as np
from sklearn.decomposition import TruncatedSVD  # LSA 潜在语义分析
from sklearn.feature_extraction.text import TfidfVectorizer  # 将文本集合转成权值矩阵# 5个文档
docs = ["Love is patient, love is kind. It does not envy, it does not boast, it is not proud.","It does not dishonor others, it is not self-seeking, it is not easily angered, it keeps no record of wrongs.","Love does not delight in evil but rejoices with the truth.","It always protects, always trusts, always hopes, always perseveres.","Love never fails. But where there are prophecies, they will cease; where there are tongues, \they will be stilled; where there is knowledge, it will pass away. (1 Corinthians 13:4-8 NIV)"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs)  # 转成权重矩阵
print("--------转成权重---------")
print(X)
print("--------获取特征(单词)---------")
words = vectorizer.get_feature_names()
print(words)
print(len(words), "个特征(单词)")  # 52个单词topics = 4
lsa = TruncatedSVD(n_components=topics)  # 潜在语义分析,设置4个话题
X1 = lsa.fit_transform(X)  # 训练并进行转化
print("--------lsa奇异值---------")
print(lsa.singular_values_)
print("--------5个文本,在4个话题向量空间下的表示---------")
print(X1)  # 5个文本,在4个话题向量空间下的表示pick_docs = 2  # 每个话题挑出2个最具代表性的文档
topic_docid = [X1[:, t].argsort()[:-(pick_docs + 1):-1] for t in range(topics)]
# argsort,返回排序后的序号
print("--------每个话题挑出2个最具代表性的文档---------")
print(topic_docid)# print("--------lsa.components_---------")
# print(lsa.components_)  # 4话题*52单词,话题向量空间
pick_keywords = 3  # 每个话题挑出3个关键词
topic_keywdid = [lsa.components_[t].argsort()[:-(pick_keywords + 1):-1] for t in range(topics)]
print("--------每个话题挑出3个关键词---------")
print(topic_keywdid)print("--------打印LSA分析结果---------")
for t in range(topics):print("话题 {}".format(t))print("\t 关键词:{}".format(", ".join(words[topic_keywdid[t][j]] for j in range(pick_keywords))))for i in range(pick_docs):print("\t\t 文档{}".format(i))print("\t\t", docs[topic_docid[t][i]])

运行结果

--------转成权重---------(0, 24) 0.3031801002944161(0, 19)   0.4547701504416241(0, 32)   0.2263512201359201(0, 22)   0.2263512201359201(0, 20)   0.3825669873635752(0, 12)   0.3031801002944161(0, 28)   0.4547701504416241(0, 14)   0.2263512201359201(0, 6)    0.2263512201359201(0, 36)   0.2263512201359201(1, 19)   0.28327311337182914(1, 20)  0.4765965465346523(1, 12)   0.14163655668591457(1, 28)  0.42490967005774366(1, 11)  0.21148886348790247(1, 30)  0.21148886348790247(1, 40)  0.21148886348790247(1, 39)  0.21148886348790247(1, 13)  0.21148886348790247(1, 2)   0.21148886348790247(1, 21)  0.21148886348790247(1, 27)  0.21148886348790247(1, 37)  0.21148886348790247(1, 29)  0.21148886348790247(1, 51)  0.21148886348790247:    :(3, 46)    0.22185332169737518(3, 17)  0.22185332169737518(3, 33)  0.22185332169737518(4, 24)  0.09483932399667956(4, 19)  0.09483932399667956(4, 20)  0.0797818291938777(4, 7)    0.1142518110942895(4, 25)   0.14161217495916(4, 16) 0.14161217495916(4, 48) 0.42483652487747997(4, 43)  0.42483652487747997(4, 3)   0.28322434991832(4, 34) 0.14161217495916(4, 44) 0.28322434991832(4, 49) 0.42483652487747997(4, 8)   0.14161217495916(4, 45) 0.14161217495916(4, 5)  0.14161217495916(4, 41) 0.14161217495916(4, 23) 0.14161217495916(4, 31) 0.14161217495916(4, 4)  0.14161217495916(4, 9)  0.14161217495916(4, 0)  0.14161217495916(4, 26) 0.14161217495916
--------获取特征(单词)---------
['13', 'always', 'angered', 'are', 'away', 'be', 'boast', 'but', 'cease', 'corinthians', 'delight', 'dishonor', 'does', 'easily', 'envy', 'evil', 'fails', 'hopes', 'in', 'is', 'it', 'keeps', 'kind', 'knowledge', 'love', 'never', 'niv', 'no', 'not', 'of', 'others', 'pass', 'patient', 'perseveres', 'prophecies', 'protects', 'proud', 'record', 'rejoices', 'seeking', 'self', 'stilled', 'the', 'there', 'they', 'tongues', 'trusts', 'truth', 'where', 'will', 'with', 'wrongs']
52 个特征(单词)
--------lsa奇异值---------
[1.29695724 1.00165234 0.98752651 0.94862686]
--------5个文本,在4个话题向量空间下的表示---------
[[ 0.85667347 -0.00334881 -0.11274158 -0.14912237][ 0.80868148  0.09220662 -0.16057627 -0.33804609][ 0.46603522 -0.3005665  -0.06851382  0.82322097][ 0.13423034  0.92315127  0.22573307  0.2806665 ][ 0.24297388 -0.22857306  0.9386499  -0.08314939]]
--------每个话题挑出2个最具代表性的文档---------
[array([0, 1], dtype=int64), array([3, 1], dtype=int64), array([4, 3], dtype=int64), array([2, 3], dtype=int64)]
--------每个话题挑出3个关键词---------
[array([28, 20, 19], dtype=int64), array([ 1, 46, 33], dtype=int64), array([49, 48, 43], dtype=int64), array([10, 42, 18], dtype=int64)]
--------打印LSA分析结果---------
话题 0关键词:not, it, is文档0Love is patient, love is kind. It does not envy, it does not boast, it is not proud.文档1It does not dishonor others, it is not self-seeking, it is not easily angered, it keeps no record of wrongs.
话题 1关键词:always, trusts, perseveres文档0It always protects, always trusts, always hopes, always perseveres.文档1It does not dishonor others, it is not self-seeking, it is not easily angered, it keeps no record of wrongs.
话题 2关键词:will, where, there文档0Love never fails. But where there are prophecies, they will cease; where there are tongues,         they will be stilled; where there is knowledge, it will pass away. (1 Corinthians 13:4-8 NIV)文档1It always protects, always trusts, always hopes, always perseveres.
话题 3关键词:delight, the, in文档0Love does not delight in evil but rejoices with the truth.文档1It always protects, always trusts, always hopes, always perseveres.

4. 参考文献

主要参考了下面作者的文章,表示感谢!
sklearn: 利用TruncatedSVD做文本主题分析

基于sklearn.decomposition.TruncatedSVD的潜在语义分析实践相关推荐

  1. 赵仲秋《基于稀疏编码多尺度空间潜在语义分析的图像分类》论文阅读笔记

    论文:赵仲秋,季海峰,高隽,胡东辉,吴信东.基于稀疏编码多尺度空间潜在语义分析的图像分类[J].计算机学报,2014,37(06):1251-1260. 文章摘要: 传统潜在语义分析方法无法利用图像中 ...

  2. 潜在语义分析(Latent Semantic Analysis,LSA)

    文章目录 1. 单词向量空间.话题向量空间 1.1 单词向量空间 1.2 话题向量空间 2. 潜在语义分析算法 2.1 例子 3. 非负矩阵分解算法 4. TruncatedSVD 潜在语义分析实践 ...

  3. 【ML】异常检测、PCA、混淆矩阵、调参综合实践(基于sklearn)

    [ML]异常检测.PCA.混淆矩阵.调参综合实践(基于sklearn) 加载数据 可视化数据 异常点检测 PCA降维 使用KNN进行分类并可视化 计算混淆矩阵 调节n_neighbors参数找到最优值 ...

  4. 【ML】主成分分析 PCA(Principal Component Analysis)原理 + 实践 (基于sklearn)

    [ML]主成分分析 PCA(Principal Component Analysis)原理 + 实践 (基于sklearn) 原理简介 实践 数据集 数据处理 使用KNN模型进行分类预测(为了和PCA ...

  5. sklearn自学指南(part48)--截断奇异值分解和潜在语义分析

    学习笔记,仅供参考,有错必纠 分解信号的分量(矩阵分解问题) 截断奇异值分解和潜在语义分析 TruncatedSVD实现了奇异值分解(SVD)的一种变体,它只计算 k k k个最大的奇异值,其中 k ...

  6. 基于sklearn的LogisticRegression鸢尾花多类分类实践

    文章目录 1. 问题描述 2. 数据介绍 2.1 数据描述 2.2 数据 2.3 数据可视化 3. 模型选择 3.1 固有的多类分类器 3.2 1对多的多类分类器 3.3 OneVsRestClass ...

  7. 【ML】异常检测(anomaly detection)原理 + 实践 (基于sklearn)

    [ML]异常检测(anomaly detection)原理 + 实践 (基于sklearn) 原理简介 实践 加载数据 可视化数据(观察规律) 训练模型 预测和展示 调整异常值为20%的情况 原理简介 ...

  8. 【ML】决策树(Decision tree)原理 + 实践 (基于sklearn)

    [ML]决策树(Decision tree)原理 + 实践 (基于sklearn) 原理介绍 简要介绍 原理 得分函数(信息熵) 实战 数据集 数据处理 训练 预测+评估 绘制决策树 原理介绍 简要介 ...

  9. 【ML】KNN 原理 + 实践(基于sklearn)

    [ML]KNN 原理 + 实践(基于sklearn) 原理介绍 基本原理 K的选取 特征归一化 什么是归一化?为什么要归一化? 如何归一化? 实践 数据集 加载数据 可视化数据,观察规律 训练数据 预 ...

最新文章

  1. mfc removemenu 静态菜单 删除_循序渐进的升级,静态体验新款奥迪 A4L
  2. 马库斯再批深度学习:20年毫无进展,无法处理语言复杂性
  3. 四则运算个人项目反思总结
  4. 利用循环神经网络生成唐诗_【机器学习】【期末复习】闲聊神经网络 分类
  5. PHP mysql 事务处理实例
  6. 将 Mac OS X 上的目录挂载到 Linux 的方法
  7. python导入urllib request_Python 3.3 - urllib.request - 导入错误
  8. JNI----Native本地方法接口
  9. windows之如何知道C盘目录下的大文件路径
  10. html 两个idv上下居中,Django搭建个人博客:回到顶部浮动按钮、矢量图标、页脚沉底和粘性侧边栏...
  11. 小D课堂 - 新版本微服务springcloud+Docker教程_1_02技术选型
  12. 一起谈.NET技术,HTML5 - 搭建移动Web应用
  13. db2 springboot 整合_Spring boot Mybatis 整合(完整版)
  14. 今天分享的案例是关于某电商店铺的年终销售业绩
  15. pb 系统托盘实例(定时任务管理)
  16. 在Swing中使用JxBrowser-Using JxBrowser in Swing
  17. KVASER 与 Matlab联合使用
  18. CodeSmith模板
  19. html 四个字与两个字,2014好听的两个字和四个字的qq网名大全精选
  20. 结合MACD看现货白银价格走势图

热门文章

  1. java中文乱码decode_Java中文乱码处理
  2. MySQL源码编译与初始化
  3. zxing二维码的生成与解码(C#)
  4. ApiController得到服务器端绝对路径
  5. PHP list的赋值
  6. iOS中常见的6种传值方式,UIPageViewController
  7. Cassandra1.2文档学习(12)—— hint机制
  8. 26. 左旋转字符串
  9. 【记录】有关parseInt的讨论
  10. Request.Params[CategoryID]