ML之NB:利用朴素贝叶斯NB算法(TfidfVectorizer+不去除停用词)对20类新闻文本数据集进行分类预测、评估

目录

输出结果

设计思路

核心代码


输出结果

设计思路

核心代码

class TfidfVectorizer Found at: sklearn.feature_extraction.textclass TfidfVectorizer(CountVectorizer):"""Convert a collection of raw documents to a matrix of TF-IDF features.Equivalent to CountVectorizer followed by TfidfTransformer.Read more in the :ref:`User Guide <text_feature_extraction>`.Parameters----------input : string {'filename', 'file', 'content'}If 'filename', the sequence passed as an argument to fit isexpected to be a list of filenames that need reading to fetchthe raw content to analyze.If 'file', the sequence items must have a 'read' method (file-likeobject) that is called to fetch the bytes in memory.Otherwise the input is expected to be the sequence strings orbytes items are expected to be analyzed directly.encoding : string, 'utf-8' by default.If bytes or files are given to analyze, this encoding is used todecode.decode_error : {'strict', 'ignore', 'replace'}Instruction on what to do if a byte sequence is given to analyze thatcontains characters not of the given `encoding`. By default, it is'strict', meaning that a UnicodeDecodeError will be raised. Othervalues are 'ignore' and 'replace'.strip_accents : {'ascii', 'unicode', None}Remove accents during the preprocessing step.'ascii' is a fast method that only works on characters that havean direct ASCII mapping.'unicode' is a slightly slower method that works on any characters.None (default) does nothing.analyzer : string, {'word', 'char'} or callableWhether the feature should be made of word or character n-grams.If a callable is passed it is used to extract the sequence of featuresout of the raw, unprocessed input.preprocessor : callable or None (default)Override the preprocessing (string transformation) stage whilepreserving the tokenizing and n-grams generation steps.tokenizer : callable or None (default)Override the string tokenization step while preserving thepreprocessing and n-grams generation steps.Only applies if ``analyzer == 'word'``.ngram_range : tuple (min_n, max_n)The lower and upper boundary of the range of n-values for differentn-grams to be extracted. All values of n such that min_n <= n <= max_nwill be used.stop_words : string {'english'}, list, or None (default)If a string, it is passed to _check_stop_list and the appropriate stoplist is returned. 'english' is currently the only supported stringvalue.If a list, that list is assumed to contain stop words, all of whichwill be removed from the resulting tokens.Only applies if ``analyzer == 'word'``.If None, no stop words will be used. max_df can be set to a valuein the range [0.7, 1.0) to automatically detect and filter stopwords based on intra corpus document frequency of terms.lowercase : boolean, default TrueConvert all characters to lowercase before tokenizing.token_pattern : stringRegular expression denoting what constitutes a "token", only usedif ``analyzer == 'word'``. The default regexp selects tokens of 2or more alphanumeric characters (punctuation is completely ignoredand always treated as a token separator).max_df : float in range [0.0, 1.0] or int, default=1.0When building the vocabulary ignore terms that have a documentfrequency strictly higher than the given threshold (corpus-specificstop words).If float, the parameter represents a proportion of documents, integerabsolute counts.This parameter is ignored if vocabulary is not None.min_df : float in range [0.0, 1.0] or int, default=1When building the vocabulary ignore terms that have a documentfrequency strictly lower than the given threshold. This value is alsocalled cut-off in the literature.If float, the parameter represents a proportion of documents, integerabsolute counts.This parameter is ignored if vocabulary is not None.max_features : int or None, default=NoneIf not None, build a vocabulary that only consider the topmax_features ordered by term frequency across the corpus.This parameter is ignored if vocabulary is not None.vocabulary : Mapping or iterable, optionalEither a Mapping (e.g., a dict) where keys are terms and values areindices in the feature matrix, or an iterable over terms. If notgiven, a vocabulary is determined from the input documents.binary : boolean, default=FalseIf True, all non-zero term counts are set to 1. This does not meanoutputs will have only 0/1 values, only that the tf term in tf-idfis binary. (Set idf and normalization to False to get 0/1 outputs.)dtype : type, optionalType of the matrix returned by fit_transform() or transform().norm : 'l1', 'l2' or None, optionalNorm used to normalize term vectors. None for no normalization.use_idf : boolean, default=TrueEnable inverse-document-frequency reweighting.smooth_idf : boolean, default=TrueSmooth idf weights by adding one to document frequencies, as if anextra document was seen containing every term in the collectionexactly once. Prevents zero divisions.sublinear_tf : boolean, default=FalseApply sublinear tf scaling, i.e. replace tf with 1 + log(tf).Attributes----------vocabulary_ : dictA mapping of terms to feature indices.idf_ : array, shape = [n_features], or NoneThe learned idf vector (global term weights)when ``use_idf`` is set to True, None otherwise.stop_words_ : setTerms that were ignored because they either:- occurred in too many documents (`max_df`)- occurred in too few documents (`min_df`)- were cut off by feature selection (`max_features`).This is only available if no vocabulary was given.See also--------CountVectorizerTokenize the documents and count the occurrences of token and returnthem as a sparse matrixTfidfTransformerApply Term Frequency Inverse Document Frequency normalization to asparse matrix of occurrence counts.Notes-----The ``stop_words_`` attribute can get large and increase the model sizewhen pickling. This attribute is provided only for introspection and canbe safely removed using delattr or set to None before pickling."""def __init__(self, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer='word', stop_words=None, token_pattern=r"(?u)\b\w\w+\b", ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=np.int64, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False):super(TfidfVectorizer, self).__init__(input=input, encoding=encoding, decode_error=decode_error, strip_accents=strip_accents, lowercase=lowercase, preprocessor=preprocessor, tokenizer=tokenizer, analyzer=analyzer, stop_words=stop_words, token_pattern=token_pattern, ngram_range=ngram_range, max_df=max_df, min_df=min_df, max_features=max_features, vocabulary=vocabulary, binary=binary, dtype=dtype)self._tfidf = TfidfTransformer(norm=norm, use_idf=use_idf, smooth_idf=smooth_idf, sublinear_tf=sublinear_tf)# Broadcast the TF-IDF parameters to the underlying transformer instance# for easy grid search and repr@propertydef norm(self):return self._tfidf.norm@norm.setterdef norm(self, value):self._tfidf.norm = value@propertydef use_idf(self):return self._tfidf.use_idf@use_idf.setterdef use_idf(self, value):self._tfidf.use_idf = value@propertydef smooth_idf(self):return self._tfidf.smooth_idf@smooth_idf.setterdef smooth_idf(self, value):self._tfidf.smooth_idf = value@propertydef sublinear_tf(self):return self._tfidf.sublinear_tf@sublinear_tf.setterdef sublinear_tf(self, value):self._tfidf.sublinear_tf = value@propertydef idf_(self):return self._tfidf.idf_def fit(self, raw_documents, y=None):"""Learn vocabulary and idf from training set.Parameters----------raw_documents : iterablean iterable which yields either str, unicode or file objectsReturns-------self : TfidfVectorizer"""X = super(TfidfVectorizer, self).fit_transform(raw_documents)self._tfidf.fit(X)return selfdef fit_transform(self, raw_documents, y=None):"""Learn vocabulary and idf, return term-document matrix.This is equivalent to fit followed by transform, but more efficientlyimplemented.Parameters----------raw_documents : iterablean iterable which yields either str, unicode or file objectsReturns-------X : sparse matrix, [n_samples, n_features]Tf-idf-weighted document-term matrix."""X = super(TfidfVectorizer, self).fit_transform(raw_documents)self._tfidf.fit(X)# X is already a transformed view of raw_documents so# we set copy to Falsereturn self._tfidf.transform(X, copy=False)def transform(self, raw_documents, copy=True):"""Transform documents to document-term matrix.Uses the vocabulary and document frequencies (df) learned by fit (orfit_transform).Parameters----------raw_documents : iterablean iterable which yields either str, unicode or file objectscopy : boolean, default TrueWhether to copy X and operate on the copy or perform in-placeoperations.Returns-------X : sparse matrix, [n_samples, n_features]Tf-idf-weighted document-term matrix."""check_is_fitted(self, '_tfidf', 'The tfidf vector is not fitted')X = super(TfidfVectorizer, self).transform(raw_documents)return self._tfidf.transform(X, copy=False)

ML之NB:利用朴素贝叶斯NB算法(TfidfVectorizer+不去除停用词)对20类新闻文本数据集进行分类预测、评估相关推荐

  1. ML之SVM:利用SVM算法(超参数组合进行多线程网格搜索+3fCrVa)对20类新闻文本数据集进行分类预测、评估

    ML之SVM:利用SVM算法(超参数组合进行多线程网格搜索+3fCrVa)对20类新闻文本数据集进行分类预测.评估 目录 输出结果 设计思路 核心代码 输出结果 Fitting 3 folds for ...

  2. ML之SVM:利用SVM算法(超参数组合进行单线程网格搜索+3fCrVa)对20类新闻文本数据集进行分类预测、评估

    ML之SVM:利用SVM算法(超参数组合进行单线程网格搜索+3fCrVa)对20类新闻文本数据集进行分类预测.评估 目录 输出结果 设计思路 核心代码 输出结果 Fitting 3 folds for ...

  3. ML之NB:利用朴素贝叶斯NB算法(CountVectorizer+不去除停用词)对fetch_20newsgroups数据集(20类新闻文本)进行分类预测、评估

    ML之NB:利用朴素贝叶斯NB算法(CountVectorizer+不去除停用词)对fetch_20newsgroups数据集(20类新闻文本)进行分类预测.评估 目录 输出结果 设计思路 核心代码 ...

  4. ML之NB:基于NB朴素贝叶斯算法训练20类新闻文本数据集进行多分类预测

    ML之NB:基于NB朴素贝叶斯算法训练20类新闻文本数据集进行多分类预测 目录 输出结果 设计思路 核心代码 输出结果 设计思路 核心代码 vec = CountVectorizer() X_trai ...

  5. NLP之词向量:利用word2vec对20类新闻文本数据集进行词向量训练、测试(某个单词的相关词汇)

    NLP之词向量:利用word2vec对20类新闻文本数据集进行词向量训练.测试(某个单词的相关词汇) 目录 输出结果 设计思路 核心代码 输出结果 寻找训练文本中与morning最相关的10个词汇: ...

  6. ML之NB:朴素贝叶斯Naive Bayesian算法的简介、应用、经典案例之详细攻略

    ML之NB:朴素贝叶斯Naive Bayesian算法的简介.应用.经典案例之详细攻略 目录 朴素贝叶斯Naive Bayesian算法的简介 1.朴素贝叶斯计算流程表述 2.朴素贝叶斯的优缺点 2. ...

  7. #第26篇分享:一个文本分类的数据挖掘(python语言:sklearn 朴素贝叶斯NB)(2)

    #sklearn 朴素贝叶斯NB算法常用于文本分类,尤其是对于英文等语言来说,分类效果很好:它常用于垃圾文本过滤.情感预测.推荐系统等:是基于概率进行预测的模型,可以做二分类及多分类( 朴素贝叶斯是个 ...

  8. NLP之TopicModel:朴素贝叶斯NB的先验概率之Dirichlet分布的应用

    NLP之TopicModel:朴素贝叶斯NB的先验概率之Dirichlet分布的应用 目录 1.Dirichlet骰子先验和后验分布的采样 2.稀疏Dirichlet先验的采样 1.Dirichlet ...

  9. 高斯判别分析(GDA)和朴素贝叶斯(NB)

    生成模型和判别模型 监督学习一般学习的是一个决策函数y=f(x)y=f(x)y=f(x)或者是条件概率分布p(y∣x)p(y|x)p(y∣x). 判别模型直接用数据学习这个函数或分布,例如Linear ...

最新文章

  1. 一个阿里P7的自白:一念之差,我差点转了产品
  2. 同一MODBUS读写多(两)个BH32角度传感器
  3. js滚动条下拉一定值_JS逆向 | *APD模拟登录(AES)
  4. 线性Transformer应该不是你要等的那个模型
  5. spring源码刨析总结
  6. Anaconda 下libsvm的安装
  7. Filter过滤器输出HelloFilter
  8. 信息技术测试计算机疑难问题处理,江苏省中小学信息技术等级考试常见问题处理.doc...
  9. 比亚迪半导体IPO再生波折:又被中止审核 红杉小米是股东
  10. python实现GPS经纬度转换
  11. .NET 将PDF转成图片之Magick.NET(亲测可用)
  12. 一起谈.NET技术,走向ASP.NET架构设计——第二章:设计/ 测试/代码
  13. 好123主页篡改修复方法
  14. 在AI眼前“隐身”,用特制贴欺骗AI计算机视觉
  15. iptable命令参数详解
  16. 关于湖北美术学院花坛长出娃娃
  17. hdu 2086 A1 = ?(递推)
  18. rpcbind服务死活启动不了
  19. JS逆向——破解百度翻译参数(sign)爬虫 超级详细
  20. 百战RHCE(第十五战:Linux进阶命令十二-主机名和域名解析极简管理)

热门文章

  1. 了解一下ES6: 函数简述深浅拷贝
  2. EF Core中关于System.Linq.Dynamic.Core的使用(转载)
  3. 第一章:nginx环境搭建
  4. 地图相关应用系统部署到现场报错原因汇总
  5. 开源如此火热,但研究表明该领域已不再增长
  6. 很遗憾,没有一篇文章能讲清楚ZooKeeper
  7. 海量数据的分库分表技术演进,最佳实践
  8. Nomad技术手册:共识协议(Consensus Protocol)
  9. go build命令详解
  10. 操作系统:生产者与消费者问题