理解sklearn.feature.text中的CountVectorizer和TfidfVectorizer

"""
理解sklearn中的CountVectorizer和TfidfVectorizer
"""
from collections import Counterimport numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizersentences = ["there is a dog dog", "here is a cat"]
count_vec = CountVectorizer()
a = count_vec.fit_transform(sentences)
print(a.toarray())
print(count_vec.vocabulary_)
"""
输出
{'dog': 1, 'there': 4, 'here': 2, 'cat': 0, 'is': 3}
表示每个词汇对应的坐标
"""print("=" * 10)
tf_vec = TfidfVectorizer()
b = tf_vec.fit_transform(sentences)
print(b.toarray())
print(tf_vec.vocabulary_)
print(tf_vec.idf_)  # 逆文档频率
print(tf_vec.get_feature_names())def mytf_idf(s):# 自己实现tfidfwords = tf_vec.get_feature_names()tf_matrix = np.zeros((len(s), len(words)), dtype=np.float32)smooth = 1# 初始值加上平滑因子df_matrix = np.ones(len(words), dtype=np.float32) * smoothfor i in range(len(s)):s_words = s[i].split()for j in range(len(words)):cnt = Counter(s_words).get(words[j], 0)tf_matrix[i][j] = cntif cnt > 0:df_matrix[j] += 1# idf一定是大于1的数值idf_matrix = np.log((len(s) + smooth) / df_matrix) + 1matrix = tf_matrix * idf_matrixmatrix = matrix / np.linalg.norm(matrix, 2, axis=1).reshape(matrix.shape[0], 1)print(matrix)print("=" * 10)
mytf_idf(sentences)
"""
TODO:
* IDF可以学到，通过神经网络反向传播来学习IDF而不是直接计算得出
* CountVectorizer有时不需要考虑个数，只需要知道是否出现过即可
"""

转载于:https://www.cnblogs.com/weiyinfu/p/9558755.html

理解sklearn.feature.text中的CountVectorizer和TfidfVectorizer相关推荐

理解sklearn.processing.scale中使用有偏总体标准差
sklearn.processing.scale数据标准化 sklearn.preprocessing.scale(X, axis=0, with_mean=True, with_std=True, ...
sklearn.feature_extraction.text.CountVectorizer 学习
CountVectorizer: CountVectorizer可以将文本文档集合转换为token计数矩阵.(token可以理解成词) 此实现通过使用scipy.sparse.csr_matrix产生 ...
为什么训练集用fit_transform()而测试集用transform()及sklearn.feature_extraction.text.CountVectorizer API详解
真正讲明白的 https://blog.csdn.net/yyhhlancelot/article/details/85097656 API https://scikit-learn.org/stab ...
【Python3机器学习】sklearn中的CountVectorizer和TfidfTransformer
原文链接:https://blog.csdn.net/qq_36134437/article/details/103057909 CountVectorizer会将文本中的词语转换为词频矩阵,它通过f ...
sklearn.feature_extraction.text.CountVectorizer 参数说明
本人小白一枚,现在正在做分词和文本挖掘的事情,翻译了下sklearn.feature_extraction.text.CountVectorizer,有错误之处还请大佬指出将文本文档集合转换为计数矩 ...
sklearn基础（一）文本特征提取函数CountVectorizer()和TfidfVectorizer()
CountVectorizer()函数 CountVectorizer()函数只考虑每个单词出现的频率:然后构成一个特征矩阵,每一行表示一个训练文本的词频统计结果.其思想是,先根据所有训练文本,不考虑 ...
【Machine Learning 学习笔记】feature engineering中noisy feature的影响
[Machine Learning 学习笔记]feature engineering中noisy feature的影响通过本篇博客记录一下添加噪声对Lasso和SVM的影响,采用的数据集为sklea ...
【Spring Boot官方文档原文理解翻译-持续更新中】
[Spring Boot官方文档原文理解翻译-持续更新中] 文章目录 [Spring Boot官方文档原文理解翻译-持续更新中] Chapter 4. Getting Started 4.1. Int ...
StructBERT：将语言结构纳入深度语言理解的预训练中——中文翻译
STRUCTBERT:将语言结构纳入深度语言理解的预训练中 Wei Wang, Bin Bi, Ming Yan, Chen Wu, Zuyi Bao, Jiangnan Xia, Liwei Pen ...

理解sklearn.feature.text中的CountVectorizer和TfidfVectorizer

理解sklearn.feature.text中的CountVectorizer和TfidfVectorizer相关推荐

最新文章

热门文章