ML之NB：利用朴素贝叶斯NB算法(TfidfVectorizer+不去除停用词)对20类新闻文本数据集进行分类预测、评估

输出结果

设计思路

核心代码

输出结果

设计思路

核心代码

class TfidfVectorizer Found at: sklearn.feature_extraction.textclass TfidfVectorizer(CountVectorizer):"""Convert a collection of raw documents to a matrix of TF-IDF features.Equivalent to CountVectorizer followed by TfidfTransformer.Read more in the :ref:`User Guide <text_feature_extraction>`.Parameters----------input : string {'filename', 'file', 'content'}If 'filename', the sequence passed as an argument to fit isexpected to be a list of filenames that need reading to fetchthe raw content to analyze.If 'file', the sequence items must have a 'read' method (file-likeobject) that is called to fetch the bytes in memory.Otherwise the input is expected to be the sequence strings orbytes items are expected to be analyzed directly.encoding : string, 'utf-8' by default.If bytes or files are given to analyze, this encoding is used todecode.decode_error : {'strict', 'ignore', 'replace'}Instruction on what to do if a byte sequence is given to analyze thatcontains characters not of the given `encoding`. By default, it is'strict', meaning that a UnicodeDecodeError will be raised. Othervalues are 'ignore' and 'replace'.strip_accents : {'ascii', 'unicode', None}Remove accents during the preprocessing step.'ascii' is a fast method that only works on characters that havean direct ASCII mapping.'unicode' is a slightly slower method that works on any characters.None (default) does nothing.analyzer : string, {'word', 'char'} or callableWhether the feature should be made of word or character n-grams.If a callable is passed it is used to extract the sequence of featuresout of the raw, unprocessed input.preprocessor : callable or None (default)Override the preprocessing (string transformation) stage whilepreserving the tokenizing and n-grams generation steps.tokenizer : callable or None (default)Override the string tokenization step while preserving thepreprocessing and n-grams generation steps.Only applies if ``analyzer == 'word'``.ngram_range : tuple (min_n, max_n)The lower and upper boundary of the range of n-values for differentn-grams to be extracted. All values of n such that min_n <= n <= max_nwill be used.stop_words : string {'english'}, list, or None (default)If a string, it is passed to _check_stop_list and the appropriate stoplist is returned. 'english' is currently the only supported stringvalue.If a list, that list is assumed to contain stop words, all of whichwill be removed from the resulting tokens.Only applies if ``analyzer == 'word'``.If None, no stop words will be used. max_df can be set to a valuein the range [0.7, 1.0) to automatically detect and filter stopwords based on intra corpus document frequency of terms.lowercase : boolean, default TrueConvert all characters to lowercase before tokenizing.token_pattern : stringRegular expression denoting what constitutes a "token", only usedif ``analyzer == 'word'``. The default regexp selects tokens of 2or more alphanumeric characters (punctuation is completely ignoredand always treated as a token separator).max_df : float in range [0.0, 1.0] or int, default=1.0When building the vocabulary ignore terms that have a documentfrequency strictly higher than the given threshold (corpus-specificstop words).If float, the parameter represents a proportion of documents, integerabsolute counts.This parameter is ignored if vocabulary is not None.min_df : float in range [0.0, 1.0] or int, default=1When building the vocabulary ignore terms that have a documentfrequency strictly lower than the given threshold. This value is alsocalled cut-off in the literature.If float, the parameter represents a proportion of documents, integerabsolute counts.This parameter is ignored if vocabulary is not None.max_features : int or None, default=NoneIf not None, build a vocabulary that only consider the topmax_features ordered by term frequency across the corpus.This parameter is ignored if vocabulary is not None.vocabulary : Mapping or iterable, optionalEither a Mapping (e.g., a dict) where keys are terms and values areindices in the feature matrix, or an iterable over terms. If notgiven, a vocabulary is determined from the input documents.binary : boolean, default=FalseIf True, all non-zero term counts are set to 1. This does not meanoutputs will have only 0/1 values, only that the tf term in tf-idfis binary. (Set idf and normalization to False to get 0/1 outputs.)dtype : type, optionalType of the matrix returned by fit_transform() or transform().norm : 'l1', 'l2' or None, optionalNorm used to normalize term vectors. None for no normalization.use_idf : boolean, default=TrueEnable inverse-document-frequency reweighting.smooth_idf : boolean, default=TrueSmooth idf weights by adding one to document frequencies, as if anextra document was seen containing every term in the collectionexactly once. Prevents zero divisions.sublinear_tf : boolean, default=FalseApply sublinear tf scaling, i.e. replace tf with 1 + log(tf).Attributes----------vocabulary_ : dictA mapping of terms to feature indices.idf_ : array, shape = [n_features], or NoneThe learned idf vector (global term weights)when ``use_idf`` is set to True, None otherwise.stop_words_ : setTerms that were ignored because they either:- occurred in too many documents (`max_df`)- occurred in too few documents (`min_df`)- were cut off by feature selection (`max_features`).This is only available if no vocabulary was given.See also--------CountVectorizerTokenize the documents and count the occurrences of token and returnthem as a sparse matrixTfidfTransformerApply Term Frequency Inverse Document Frequency normalization to asparse matrix of occurrence counts.Notes-----The ``stop_words_`` attribute can get large and increase the model sizewhen pickling. This attribute is provided only for introspection and canbe safely removed using delattr or set to None before pickling."""def __init__(self, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer='word', stop_words=None, token_pattern=r"(?u)\b\w\w+\b", ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=np.int64, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False):super(TfidfVectorizer, self).__init__(input=input, encoding=encoding, decode_error=decode_error, strip_accents=strip_accents, lowercase=lowercase, preprocessor=preprocessor, tokenizer=tokenizer, analyzer=analyzer, stop_words=stop_words, token_pattern=token_pattern, ngram_range=ngram_range, max_df=max_df, min_df=min_df, max_features=max_features, vocabulary=vocabulary, binary=binary, dtype=dtype)self._tfidf = TfidfTransformer(norm=norm, use_idf=use_idf, smooth_idf=smooth_idf, sublinear_tf=sublinear_tf)# Broadcast the TF-IDF parameters to the underlying transformer instance# for easy grid search and repr@propertydef norm(self):return self._tfidf.norm@norm.setterdef norm(self, value):self._tfidf.norm = value@propertydef use_idf(self):return self._tfidf.use_idf@use_idf.setterdef use_idf(self, value):self._tfidf.use_idf = value@propertydef smooth_idf(self):return self._tfidf.smooth_idf@smooth_idf.setterdef smooth_idf(self, value):self._tfidf.smooth_idf = value@propertydef sublinear_tf(self):return self._tfidf.sublinear_tf@sublinear_tf.setterdef sublinear_tf(self, value):self._tfidf.sublinear_tf = value@propertydef idf_(self):return self._tfidf.idf_def fit(self, raw_documents, y=None):"""Learn vocabulary and idf from training set.Parameters----------raw_documents : iterablean iterable which yields either str, unicode or file objectsReturns-------self : TfidfVectorizer"""X = super(TfidfVectorizer, self).fit_transform(raw_documents)self._tfidf.fit(X)return selfdef fit_transform(self, raw_documents, y=None):"""Learn vocabulary and idf, return term-document matrix.This is equivalent to fit followed by transform, but more efficientlyimplemented.Parameters----------raw_documents : iterablean iterable which yields either str, unicode or file objectsReturns-------X : sparse matrix, [n_samples, n_features]Tf-idf-weighted document-term matrix."""X = super(TfidfVectorizer, self).fit_transform(raw_documents)self._tfidf.fit(X)# X is already a transformed view of raw_documents so# we set copy to Falsereturn self._tfidf.transform(X, copy=False)def transform(self, raw_documents, copy=True):"""Transform documents to document-term matrix.Uses the vocabulary and document frequencies (df) learned by fit (orfit_transform).Parameters----------raw_documents : iterablean iterable which yields either str, unicode or file objectscopy : boolean, default TrueWhether to copy X and operate on the copy or perform in-placeoperations.Returns-------X : sparse matrix, [n_samples, n_features]Tf-idf-weighted document-term matrix."""check_is_fitted(self, '_tfidf', 'The tfidf vector is not fitted')X = super(TfidfVectorizer, self).transform(raw_documents)return self._tfidf.transform(X, copy=False)

ML之NB：利用朴素贝叶斯NB算法(TfidfVectorizer+不去除停用词)对20类新闻文本数据集进行分类预测、评估相关推荐

ML之SVM：利用SVM算法(超参数组合进行多线程网格搜索+3fCrVa)对20类新闻文本数据集进行分类预测、评估
ML之SVM:利用SVM算法(超参数组合进行多线程网格搜索+3fCrVa)对20类新闻文本数据集进行分类预测.评估目录输出结果设计思路核心代码输出结果 Fitting 3 folds for ...
ML之SVM：利用SVM算法(超参数组合进行单线程网格搜索+3fCrVa)对20类新闻文本数据集进行分类预测、评估
ML之SVM:利用SVM算法(超参数组合进行单线程网格搜索+3fCrVa)对20类新闻文本数据集进行分类预测.评估目录输出结果设计思路核心代码输出结果 Fitting 3 folds for ...
ML之NB：利用朴素贝叶斯NB算法(CountVectorizer+不去除停用词)对fetch_20newsgroups数据集(20类新闻文本)进行分类预测、评估
ML之NB:利用朴素贝叶斯NB算法(CountVectorizer+不去除停用词)对fetch_20newsgroups数据集(20类新闻文本)进行分类预测.评估目录输出结果设计思路核心代码 ...
ML之NB：基于NB朴素贝叶斯算法训练20类新闻文本数据集进行多分类预测
ML之NB:基于NB朴素贝叶斯算法训练20类新闻文本数据集进行多分类预测目录输出结果设计思路核心代码输出结果设计思路核心代码 vec = CountVectorizer() X_trai ...
NLP之词向量：利用word2vec对20类新闻文本数据集进行词向量训练、测试(某个单词的相关词汇)
NLP之词向量:利用word2vec对20类新闻文本数据集进行词向量训练.测试(某个单词的相关词汇) 目录输出结果设计思路核心代码输出结果寻找训练文本中与morning最相关的10个词汇: ...
ML之NB：朴素贝叶斯Naive Bayesian算法的简介、应用、经典案例之详细攻略
ML之NB:朴素贝叶斯Naive Bayesian算法的简介.应用.经典案例之详细攻略目录朴素贝叶斯Naive Bayesian算法的简介 1.朴素贝叶斯计算流程表述 2.朴素贝叶斯的优缺点 2. ...
#第26篇分享：一个文本分类的数据挖掘（python语言：sklearn 朴素贝叶斯NB）（2）
#sklearn 朴素贝叶斯NB算法常用于文本分类,尤其是对于英文等语言来说,分类效果很好:它常用于垃圾文本过滤.情感预测.推荐系统等:是基于概率进行预测的模型,可以做二分类及多分类( 朴素贝叶斯是个 ...
NLP之TopicModel：朴素贝叶斯NB的先验概率之Dirichlet分布的应用
NLP之TopicModel:朴素贝叶斯NB的先验概率之Dirichlet分布的应用目录 1.Dirichlet骰子先验和后验分布的采样 2.稀疏Dirichlet先验的采样 1.Dirichlet ...
高斯判别分析(GDA)和朴素贝叶斯(NB)
生成模型和判别模型监督学习一般学习的是一个决策函数y=f(x)y=f(x)y=f(x)或者是条件概率分布p(y∣x)p(y|x)p(y∣x). 判别模型直接用数据学习这个函数或分布,例如Linear ...

ML之NB：利用朴素贝叶斯NB算法(TfidfVectorizer+不去除停用词)对20类新闻文本数据集进行分类预测、评估

输出结果

设计思路

核心代码

ML之NB：利用朴素贝叶斯NB算法(TfidfVectorizer+不去除停用词)对20类新闻文本数据集进行分类预测、评估相关推荐

最新文章

热门文章