路径为:

D:\software\python27\Lib\site-packages\sklearn\datasets

替换twenty_newsgroups.py中的内容如下:

"""Caching loader for the 20 newsgroups text classification datasetThe description of the dataset is available on the official website at:http://people.csail.mit.edu/jrennie/20Newsgroups/Quoting the introduction:The 20 Newsgroups data set is a collection of approximately 20,000newsgroup documents, partitioned (nearly) evenly across 20 differentnewsgroups. To the best of my knowledge, it was originally collectedby Ken Lang, probably for his Newsweeder: Learning to filter netnewspaper, though he does not explicitly mention this collection. The 20newsgroups collection has become a popular data set for experimentsin text applications of machine learning techniques, such as textclassification and text clustering.This dataset loader will download the recommended "by date" variant of the
dataset and which features a point in time split between the train and
test sets. The compressed dataset size is around 14 Mb compressed. Once
uncompressed the train set is 52 MB and the test set is 34 MB.The data is downloaded, extracted and cached in the '~/scikit_learn_data'
folder.The `fetch_20newsgroups` function will not vectorize the data into numpy
arrays but the dataset lists the filenames of the posts and their categories
as target labels.The `fetch_20newsgroups_tfidf` function will in addition do a simple tf-idf
vectorization step."""
# Copyright (c) 2011 Olivier Grisel <olivier.grisel@ensta.org>
# License: BSD 3 clauseimport os
import logging
import tarfile
import pickle
import shutil
import re
import codecsimport numpy as np
import scipy.sparse as spfrom .base import get_data_home
from .base import Bunch
from .base import load_files
from ..utils import check_random_state
from ..feature_extraction.text import CountVectorizer
from ..preprocessing import normalize
from ..externals import joblib, sixif six.PY3:from urllib.request import urlopen
else:from urllib2 import urlopenlogger = logging.getLogger(__name__)URL = ("http://people.csail.mit.edu/jrennie/""20Newsgroups/20news-bydate.tar.gz")
ARCHIVE_NAME = "20news-bydate.tar.gz"
CACHE_NAME = "20news-bydate.pkz"
TRAIN_FOLDER = "20news-bydate-train"
TEST_FOLDER = "20news-bydate-test"def download_20newsgroups(target_dir, cache_path):"""Download the 20 newsgroups data and stored it as a zipped pickle."""archive_path = os.path.join(target_dir, ARCHIVE_NAME)train_path = os.path.join(target_dir, TRAIN_FOLDER)test_path = os.path.join(target_dir, TEST_FOLDER)# if not os.path.exists(target_dir):#     os.makedirs(target_dir)## if os.path.exists(archive_path):#     # Download is not complete as the .tar.gz file is removed after#     # download.#     logger.warn("Download was incomplete, downloading again.")#     os.remove(archive_path)# logger.warn("Downloading dataset from %s (14 MB)", URL)# opener = urlopen(URL)# open(archive_path, 'wb').write(opener.read())logger.info("Decompressing %s", archive_path)tarfile.open(archive_path, "r:gz").extractall(path=target_dir)os.remove(archive_path)# Store a zipped picklecache = dict(train=load_files(train_path, encoding='latin1'),test=load_files(test_path, encoding='latin1'))compressed_content = codecs.encode(pickle.dumps(cache), 'zlib_codec')open(cache_path, 'wb').write(compressed_content)shutil.rmtree(target_dir)return cachedef strip_newsgroup_header(text):"""Given text in "news" format, strip the headers, by removing everythingbefore the first blank line."""_before, _blankline, after = text.partition('\n\n')return after_QUOTE_RE = re.compile(r'(writes in|writes:|wrote:|says:|said:'r'|^In article|^Quoted from|^\||^>)')def strip_newsgroup_quoting(text):"""Given text in "news" format, strip lines beginning with the quotecharacters > or |, plus lines that often introduce a quoted section(for example, because they contain the string 'writes:'.)"""good_lines = [line for line in text.split('\n')if not _QUOTE_RE.search(line)]return '\n'.join(good_lines)def strip_newsgroup_footer(text):"""Given text in "news" format, attempt to remove a signature block.As a rough heuristic, we assume that signatures are set apart by eithera blank line or a line made of hyphens, and that it is the last such linein the file (disregarding blank lines at the end)."""lines = text.strip().split('\n')for line_num in range(len(lines) - 1, -1, -1):line = lines[line_num]if line.strip().strip('-') == '':breakif line_num > 0:return '\n'.join(lines[:line_num])else:return textdef fetch_20newsgroups(data_home=None, subset='train', categories=None,shuffle=True, random_state=42,remove=(),download_if_missing=True):"""Load the filenames and data from the 20 newsgroups dataset.Parameters----------subset: 'train' or 'test', 'all', optionalSelect the dataset to load: 'train' for the training set, 'test'for the test set, 'all' for both, with shuffled ordering.data_home: optional, default: NoneSpecify an download and cache folder for the datasets. If None,all scikit-learn data is stored in '~/scikit_learn_data' subfolders.categories: None or collection of string or unicodeIf None (default), load all the categories.If not None, list of category names to load (other categoriesignored).shuffle: bool, optionalWhether or not to shuffle the data: might be important for models thatmake the assumption that the samples are independent and identicallydistributed (i.i.d.), such as stochastic gradient descent.random_state: numpy random number generator or seed integerUsed to shuffle the dataset.download_if_missing: optional, True by defaultIf False, raise an IOError if the data is not locally availableinstead of trying to download the data from the source site.remove: tupleMay contain any subset of ('headers', 'footers', 'quotes'). Each ofthese are kinds of text that will be detected and removed from thenewsgroup posts, preventing classifiers from overfitting onmetadata.'headers' removes newsgroup headers, 'footers' removes blocks at theends of posts that look like signatures, and 'quotes' removes linesthat appear to be quoting another post.'headers' follows an exact standard; the other filters are not alwayscorrect."""data_home = get_data_home(data_home=data_home)cache_path = os.path.join(data_home, CACHE_NAME)twenty_home = os.path.join(data_home, "20news_home")cache = Noneif os.path.exists(cache_path):try:with open(cache_path, 'rb') as f:compressed_content = f.read()uncompressed_content = codecs.decode(compressed_content, 'zlib_codec')cache = pickle.loads(uncompressed_content)except Exception as e:print(80 * '_')print('Cache loading failed')print(80 * '_')print(e)if cache is None:if download_if_missing:cache = download_20newsgroups(target_dir=twenty_home,cache_path=cache_path)else:raise IOError('20Newsgroups dataset not found')if subset in ('train', 'test'):data = cache[subset]elif subset == 'all':data_lst = list()target = list()filenames = list()for subset in ('train', 'test'):data = cache[subset]data_lst.extend(data.data)target.extend(data.target)filenames.extend(data.filenames)data.data = data_lstdata.target = np.array(target)data.filenames = np.array(filenames)data.description = 'the 20 newsgroups by date dataset'else:raise ValueError("subset can only be 'train', 'test' or 'all', got '%s'" % subset)if 'headers' in remove:data.data = [strip_newsgroup_header(text) for text in data.data]if 'footers' in remove:data.data = [strip_newsgroup_footer(text) for text in data.data]if 'quotes' in remove:data.data = [strip_newsgroup_quoting(text) for text in data.data]if categories is not None:labels = [(data.target_names.index(cat), cat) for cat in categories]# Sort the categories to have the ordering of the labelslabels.sort()labels, categories = zip(*labels)mask = np.in1d(data.target, labels)data.filenames = data.filenames[mask]data.target = data.target[mask]# searchsorted to have continuous labelsdata.target = np.searchsorted(labels, data.target)data.target_names = list(categories)# Use an object array to shuffle: avoids memory copydata_lst = np.array(data.data, dtype=object)data_lst = data_lst[mask]data.data = data_lst.tolist()if shuffle:random_state = check_random_state(random_state)indices = np.arange(data.target.shape[0])random_state.shuffle(indices)data.filenames = data.filenames[indices]data.target = data.target[indices]# Use an object array to shuffle: avoids memory copydata_lst = np.array(data.data, dtype=object)data_lst = data_lst[indices]data.data = data_lst.tolist()return datadef fetch_20newsgroups_vectorized(subset="train", remove=(), data_home=None):"""Load the 20 newsgroups dataset and transform it into tf-idf vectors.This is a convenience function; the tf-idf transformation is done using thedefault settings for `sklearn.feature_extraction.text.Vectorizer`. For moreadvanced usage (stopword filtering, n-gram extraction, etc.), combinefetch_20newsgroups with a custom `Vectorizer` or `CountVectorizer`.Parameters----------subset: 'train' or 'test', 'all', optionalSelect the dataset to load: 'train' for the training set, 'test'for the test set, 'all' for both, with shuffled ordering.data_home: optional, default: NoneSpecify an download and cache folder for the datasets. If None,all scikit-learn data is stored in '~/scikit_learn_data' subfolders.remove: tupleMay contain any subset of ('headers', 'footers', 'quotes'). Each ofthese are kinds of text that will be detected and removed from thenewsgroup posts, preventing classifiers from overfitting onmetadata.'headers' removes newsgroup headers, 'footers' removes blocks at theends of posts that look like signatures, and 'quotes' removes linesthat appear to be quoting another post.Returns-------bunch : Bunch objectbunch.data: sparse matrix, shape [n_samples, n_features]bunch.target: array, shape [n_samples]bunch.target_names: list, length [n_classes]"""data_home = get_data_home(data_home=data_home)filebase = '20newsgroup_vectorized'if remove:filebase += 'remove-' + ('-'.join(remove))target_file = os.path.join(data_home, filebase + ".pk")# we shuffle but use a fixed seed for the memoizationdata_train = fetch_20newsgroups(data_home=data_home,subset='train',categories=None,shuffle=True,random_state=12,remove=remove)data_test = fetch_20newsgroups(data_home=data_home,subset='test',categories=None,shuffle=True,random_state=12,remove=remove)if os.path.exists(target_file):X_train, X_test = joblib.load(target_file)else:vectorizer = CountVectorizer(dtype=np.int16)X_train = vectorizer.fit_transform(data_train.data).tocsr()X_test = vectorizer.transform(data_test.data).tocsr()joblib.dump((X_train, X_test), target_file, compress=9)# the data is stored as int16 for compactness# but normalize needs floatsX_train = X_train.astype(np.float64)X_test = X_test.astype(np.float64)normalize(X_train, copy=False)normalize(X_test, copy=False)target_names = data_train.target_namesif subset == "train":data = X_traintarget = data_train.targetelif subset == "test":data = X_testtarget = data_test.targetelif subset == "all":data = sp.vstack((X_train, X_test)).tocsr()target = np.concatenate((data_train.target, data_test.target))else:raise ValueError("%r is not a valid subset: should be one of ""['train', 'test', 'all']" % subset)return Bunch(data=data, target=target, target_names=target_names)

适用于python机器学习与实践的twenty_newsgroups.py文件内容相关推荐

  1. 【Python机器学习及实践】进阶篇:模型实用技巧(特征提升)

    Python机器学习及实践--进阶篇:模型实用技巧(特征提升) 所谓特征抽取,就是逐条将原始数据转化为特征向量的形式,这个过程同时涉及对数据特征的量化表示:而特征筛选则进一步,在高维度.已量化的特征向 ...

  2. python机器学习及实践_Python机器学习及实践

    Python机器学习及实践/Chapter_1/.ipynb_checkpoints/Chapter_1.1-checkpoint.ipynb Python机器学习及实践/Chapter_1/.ipy ...

  3. 《Python机器学习及实践》----良/恶性乳腺癌肿瘤预测

    本片博客是根据<Python机器学习及实践>一书中的实例,所有代码均在本地编译通过.数据为从该书指定的百度网盘上下载的. 代码片段: import pandas as pd import ...

  4. python机器学习及实践_机器学习入门之《Python机器学习及实践:从零开始通往Kaggle竞赛之路》...

    本文主要向大家介绍了机器学习入门之<Python机器学习及实践:从零开始通往Kaggle竞赛之路>,通过具体的内容向大家展现,希望对大家学习机器学习入门有所帮助. <Python 机 ...

  5. Python机器学习与实践——简介篇

    周四晚上胡哥给大家简单培训了一下nlp的一些算法,感觉受益匪浅.回去之后反省了一下,有段时间没看机器学习的东西了,nlp要抓,机器学习也要学.开个坑,记录和分享一下学习内容(书籍为<python ...

  6. 记事本写python怎么运行-从头学Python之编写可执行的.py文件

    Python可是真强大.但他具体是怎么强大的,让我们一点一点来了解吧(小编每天晚上下班回家会抽时间看看教程,多充实下自己也是好的). 废话不多说,就讲一下这个背景吧: 事情是这个样子的~本着好学的精神 ...

  7. python包的中 _init _.py文件介绍

    python包的中 _init _.py文件介绍 我们新建python包时常常会看到一个__init _.py文件. 作用介绍: ​ 一:这个文件是属于python包的,这个文件用作于标识python ...

  8. 《Python机器学习及实践——从零开始通往Kaggle竞赛之路》学习笔记(1)——简介篇

    机器学习的结构 #mermaid-svg-HxJdCSW6sVlBmYVP {font-family:"trebuchet ms",verdana,arial,sans-serif ...

  9. 【Python学习】使用Pyinstaller将py文件导出为exe文件

    PyInstaller其实就是把python解析器和你自己的脚本打包成一个可执行的文件,但是它不是跨编译的,也就是说在Windows下用PyInstaller生成的exe只能运行在Windows下,在 ...

最新文章

  1. 初学者如何搭建一个自己专属的电子实验室?
  2. access查询女教师所有的信息_【9月3日报名必看】教师资格证报名如何查询报名成功及修改报名信息?...
  3. 杭州网络推广浅析网站优化如何更快的提升收录?
  4. python爬虫代码1000行-简单用14行代码写一个Python代理IP的爬虫
  5. Android2.2缩略图类ThumbnailUtils
  6. eclipse Maven配置
  7. 无限弹窗(python)
  8. linux中的文本处理方法集锦
  9. html文件做屏保win10,win10怎么自己添加动态屏保
  10. 函数求和公式计算机出库入库,Excel 库存统计相关函数及制作库存统计表
  11. 外贸SOHO具备的素质
  12. 怎么把电脑上的python软件卸载干净_怎么把一个软件卸载干净_把一个软件卸载干净的两种方法-系统城...
  13. durango服务器维护,Durango狂野大陆连接服务器失败_九游手机游戏
  14. 数据链路层的基本功能简单总结
  15. OpenCV - SIFT-SURF(Python实现)
  16. CAD调整十字光标的长度
  17. 【Flink】 is not serializable. The object probably contains or references non serializable fields
  18. 网络上的计算机无权限访问权限,电脑连不上网,提示无网络访问权限怎么办?...
  19. 博学谷php,博学谷web前端
  20. FLV科普4 FLV头信息解析

热门文章

  1. Fedora 18 下安装 mplayer
  2. MOCTF-Web-访问限制
  3. leetcode 解压缩_leetcode之字符串压缩
  4. vb php mysql_VB连接MYSQL数据的方法
  5. Promise处理前端异步事件
  6. Git版本回退之 reset 和 revert
  7. vue图片懒加载实例
  8. H5禁止手机自带键盘弹出
  9. 初中生问题:求任意凸多边形的交叉面积
  10. 图像局部显著性—点特征(SURF)