Python 第三方模块机器学习 Scikit-Learn模块简介,基类,数据集,错误

英文官方文档:https://scikit-learn.org/stable/ \qquad 中文官方文档:https://scikit-learn.org.cn/ \quad https://www.cntofu.com/book/170/index.html

一.基本情况
1.简介:

Scikit-Learn简称sklearn,是基于Numpy/SciPy/Matplotlib的Python开源机器学习库,
包含了从数据预处理到训练模型的各个方面,并涵盖了几乎所有主流的机器学习算法.该库的可重
用性较高,从而得以帮助程序员实现高效的开发

2.功能:

子模块列表:https://blog.csdn.net/newmarui/article/details/52094383

①分类(classification):识别某个对象属于哪个类别常用算法:SVM(支持向量机),nearest neighbors(最近邻),random forest(随机森林)常见应用:垃圾邮件识别,图像识别
②回归(regression):预测与对象相关联的连续值属性常见算法:SVR(支持向量回归),ridge regression(岭回归),Lasso常见应用:药物反应,预测股价
③聚类(clustering):将相似对象自动分组常用算法:k-Means,spectral clustering,mean-shift常见应用:客户细分,分组实验结果
④降维(dimensionality reduction):减少要考虑的随机变量的数量常见算法:PCA(主成分分析),feature selection(特征选择),non-negative matrix factorization(非负矩阵分解)常见应用:可视化,提高效率
⑤模型选择(model selection):比较,验证,选择参数和模型,目标是通过调整参数来提高精度常用模块:grid search(网格搜索),cross validation(交叉验证),metrics(度量)
⑥预处理(pre-processing):特征提取和归一化常用模块:preprocessing,feature extraction常见应用:把输入的数据(如文本)转换为机器学习算法可用的数据

3.安装:

pip install sklearn

二.算法选择

参见:https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

三.基类与实用函数
1.简介:

该模块提供了所有估计器的基类

2.基类(Base classes):

Base class for all estimators in scikit-learn:sklearn.base.BaseEstimator()
Mixin class for all bicluster estimators in scikit-learn:sklearn.base.BiclusterMixin
Mixin class for all classifiers in scikit-learn:sklearn.base.ClassifierMixin
Mixin class for all cluster estimators in scikit-learn:sklearn.base.ClusterMixin
Mixin class for all density estimators in scikit-learn:sklearn.base.DensityMixin
Mixin class for all regression estimators in scikit-learn:sklearn.base.RegressorMixin
Mixin class for all transformers in scikit-learn:sklearn.base.TransformerMixin
Transformer mixin that performs feature selection given a support mask:sklearn.feature_selection.SelectorMixin

3.实用函数(Utility functions):

Constructs a new unfitted estimator with the same parameters:sklearn.base.clone(<estimator>[,safe=True])
Return True if the given estimator is (probably) a classifier:[<out>=]sklearn.base.is_classifier(<estimator>)
Return True if the given estimator is (probably) a regressor:[<out>=]sklearn.base.is_regressor(<estimator>)######################################################################################################################Context manager for global scikit-learn configuration:sklearn.sklearn.config_context([**new_config])
Retrieve current values for configuration set by set_config:[<config>=]sklearn.get_config()
Set global scikit-learn configuration:sklearn.set_config([assume_finite=None,working_memory=None,print_changed_only=None,display=None])
Print useful debugging information:sklearn.show_versions()

四.数据集

官方文档:http://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets

1.概述:

该模块用于获得内置或自定义数据集

2.装载机(Loaders)
(1)简介:

"装载机"(Loaders)主要用于加载内置数据集

(2)工具函数:

清除缓存:sklearn.datasets.clear_data_home([data_home=None])#参数说明:data_home:指定缓存路径;为str/None(表示"~/sklearn_learn_data")#会将该目录及目录中的内容全部删除######################################################################################################################以svmlight/libsvm格式转储数据集:sklearn.datasets.dump_svmlight_file(<X>,<y>,<f>[,zero_based=True,comment=None,query_id=None,multilabel=False])#参数说明:X:指定样本集;为n_samples × n_features array-like/n_samples × n_features sparse matrixy:指定样本的真实类别;为1 × n_samples array-like/1 × n_samples sparse matrix/n_samples × n_labels array-like/n_samples × n_labels sparse matrixf:指定转储数据集的位置;为str/binary mode file-likezero_based:为True,则列索引从0开始(column indices should be written zero-based)为False,则列索引从1开始(column indices should be written one-based)为"auto",则自动确定comment:指定在文件开始插入的注释;为Unicode str/ASCII bytequery_id:Array containing pairwise preference constraints (qid in svmlight format);为1 × n_samples array-likemultilabel:指定是否为多标签分类;为bool######################################################################################################################返回sklearn数据目录的路径:sklearn.datasets.get_data_home([data_home=None])######################################################################################################################加载以类别作为子文件夹名称的文本文件:[<data>=]sklearn.datasets.load_files(<container_path>[,description=None,categories=None,load_content=True,shuffle=True,encoding=None,decode_error='strict',random_state=0])######################################################################################################################以CSR稀疏矩阵格式加载svmlight/libsvm格式的数据集文件:[<X>,<y>,<query_id>=]sklearn.datasets.load_svmlight_file(<f>[,n_features=None,dtype=<class 'numpy.float64'>,multilabel=False,zero_based='auto',query_id=False,offset=0,length=-1])#参数说明:其他参数同sklearn.datasets.dump_svmlight_file()f:指定要加载的文件;为str/file-like/intn_features:指定要使用的特征数;为intdtype:指定数据类型;为numpy data typequery_id:指定是否返回QueryID;为booloffset:Ignore the offset first bytes by seeking forward, then discarding the following bytes up until the next new line character;为intlength:If strictly positive, stop reading any new line of data once the position in the file has reached the (offset + length) bytes threshold;为intX:返回样本集;为n_samples × n_features sparse matrixy:返回样本的标签;为1 × n_samples ndarray/tuple list(长度为n_samples)query_id:返回QueryID;为1 × n_samples array######################################################################################################################从SVMlight格式的多个文件中加载数据集:[<Xy>=]sklearn.datasets.load_svmlight_files(<files>[,n_features=None,dtype=<class 'numpy.float64'>,multilabel=False,zero_based='auto',query_id=False,offset=0,length=-1])#参数说明:其他参数同sklearn.datasets.load_svmlight_file()files:指定要加载的文件;为str array-like/file-like array-like/int array-likeXy:返回数据集;为[<X1>,<y1>[,<q1>],...<Xn>,<yn>[,<qn>]]#仅当query_id=True时返回<qi>

(2)分类问题数据集:

Load the filenames and data from the 20 newsgroups dataset:[<bunch>=]sklearn.datasets.fetch_20newsgroups([data_home=None,subset='train',categories=None,shuffle=True,random_state=42,remove=(),download_if_missing=True,return_X_y=False])
Load and vectorize the 20 newsgroups dataset:[<bunch>=]sklearn.datasets.fetch_20newsgroups_vectorized([subset='train',remove=(),data_home=None,download_if_missing=True,return_X_y=False,normalize=True,as_frame=False])
Load the covertype dataset:[<dataset>=]sklearn.datasets.fetch_covtype([data_home=None,download_if_missing=True,random_state=None,shuffle=False,return_X_y=False,as_frame=False])
Load the kddcup99 datasetd:[<data>=]sklearn.datasets.fetch_kddcup99([subset=None,data_home=None,shuffle=False,random_state=None,percent10=True,download_if_missing=True,return_X_y=False,as_frame=False])
Load the Labeled Faces in the Wild (LFW) pairs dataset:[<data>=]sklearn.datasets.fetch_lfw_pairs([subset='train',data_home=None,funneled=True,resize=0.5,color=False,slice_=(slice(70,195),slice(78,172)),download_if_missing=True])
Load the Labeled Faces in the Wild (LFW) people dataset:[<dataset>=]sklearn.datasets.fetch_lfw_people([data_home=None,funneled=True,resize=0.5,min_faces_per_person=0,color=False,slice_=(slice(70,195),slice(78,172)),download_if_missing=True,return_X_y=False])
Load the Olivetti faces data-set from AT&T:[<data>=]sklearn.datasets.fetch_olivetti_faces([data_home=None,shuffle=False,random_state=0,download_if_missing=True,return_X_y=False])
Load the RCV1 multilabel datasets:[<dataset>=]klearn..datasets.fetch_rcv1([data_home=None,subset='all',download_if_missing=True,random_state=None,shuffle=False,return_X_y=False])
Load and return the breast cancer wisconsin dataset:[<data>=]sklearn.datasets.load_breast_cancer([return_X_y=False,as_frame=False])
Load and return the digits datasets:[<data>=]sklearn.datasets.load_digits([n_class=10,return_X_y=False,as_frame=False])
Load and return the iris dataset:[<data>=]sklearn.datasets.load_iris([return_X_y=False,as_frame=False])
Load and return the wine dataset:[<data>=]sklearn.datasets.load_wine([return_X_y=False,as_frame=False])

(3)回归问题数据集:

Load the California housing dataset:[<dataset>=]datasets.fetch_california_housing([data_home=None,download_if_missing=True,return_X_y=False,as_frame=False])
Load and return the boston house-prices dataset:[<data>=]sklearn.datasets.load_boston([return_X_y=False])
Load and return the diabetes dataset:[<data>=]sklearn.datasets.load_diabetes([return_X_y=False,as_frame=False])

(4)其他数据集:

Fetch dataset from openml by name or dataset id:[<data>=]sklearn.datasets.fetch_openml([name=None,version='active',data_id=None,data_home=None,target_column='default-target',cache=True,return_X_y=False,as_frame='auto'])
Loader for species distribution dataset from Phillips et:[<data>=]sklearn.datasets.fetch_species_distributions([data_home=None,download_if_missing=True])
Load and return the physical excercise linnerud dataset:[<data>=]sklearn.datasets.load_linnerud([return_X_y=False,as_frame=False])
Load the numpy array of a single sample image:[<img>=]sklearn.datasets.load_sample_image(<image_name>)#加载china.jpg或flower.jpg
Load sample images for image manipulation:[<data>=]sklearn.datasets.load_sample_images()#同时加载china.jpg与flower.jpg

3.样本生成器(Samples generator):

Generate an array with constant block diagonal structure for biclustering:[<X>,<rows>,<cols>=]sklearn.datasets.make_biclusters(<shape>,<n_clusters>[,noise=0.0,minval=10,maxval=100,shuffle=True,random_state=None])
Generate isotropic Gaussian blobs for clustering:[<X>,<y>,<centers>=]sklearn.datasets.make_blobs([n_samples=100,n_features=2,centers=None,cluster_std=1.0,center_box=(-10.0,10.0),shuffle=True,random_state=None,return_centers=False])
Generate an array with block checkerboard structure for biclustering:[<X>,<rows>,<cols>=]sklearn.datasets.make_checkerboard(<shape>,<n_clusters>[,noise=0.0,minval=10,maxval=100,shuffle=True,random_state=None])
Make a large circle containing a smaller circle in 2d:[<X>,<y>=]sklearn.datasets.make_circles([n_samples=100,shuffle=True,noise=None,random_state=None,factor=0.8])
Generate a random n-class classification problem:[<X>,<y>=]sklearn.datasets.make_classification([n_samples=100,n_features=20,n_informative=2,n_redundant=2,n_repeated=0,n_classes=2,n_clusters_per_class=2,weights=None,flip_y=0.01,class_sep=1.0,hypercube=True,shift=0.0,scale=1.0,shuffle=True,random_state=None])
Generate the “Friedman #1” regression problem:[<X>,<y>=]sklearn.datasets.make_friedman1([n_samples=100,n_features=10,noise=0.0,random_state=None])
Generate the “Friedman #2” regression problem:[<X>,<y>=]sklearn.datasets.make_friedman2([n_samples=100,noise=0.0,random_state=None])
Generate the “Friedman #3” regression problem:[<X>,<y>=]sklearn.datasets.make_friedman3([n_samples=100,noise=0.0,random_state=None])
Generate isotropic Gaussian and label samples by quantile:[<X>,<y>=]sklearn.datasets.make_gaussian_quantiles([mean=None,cov=1.0,n_samples=100,n_features=2,n_classes=3,shuffle=True,random_state=None])
Generates data for binary classification used in Hastie et al:[<X>,<y>=]sklearn.datasets.make_hastie_10_2([n_samples=12000,random_state=None])
Generate a mostly low rank matrix with bell-shaped singular values:[<X>=]sklearn.datasets.make_low_rank_matrix([n_samples=100,n_features=100,effective_rank=10,tail_strength=0.5,random_state=None])
Make two interleaving half circles:[<X>,<y>=]sklearn.datasets.make_moons([n_samples=100,shuffle=True,noise=None,random_state=None])
Generate a random multilabel classification problem:[<X>,<y>,<p_c><p_w_c>=]sklearn.datasets.make_multilabel_classification([n_samples=100,n_features=20,n_classes=5,n_labels=2,length=50,allow_unlabeled=True,sparse=False,return_indicator='dense',return_distributions=False,random_state=None])
Generate a random regression problem:[<X>,<y>,<coef>=]sklearn.datasets.make_regression([n_samples=100,n_features=100,n_informative=10,n_targets=1,bias=0.0,effective_rank=None,tail_strength=0.5,noise=0.0,shuffle=True,coef=False,random_state=None])
Generate an S curve dataset:[<X>,<t>=]sklearn.datasets.make_s_curve([n_samples=100,noise=0.0,random_state=None])
Generate a signal as a sparse combination of dictionary elements:[<data>,<dictionary>,<code>=]sklearn.datasets.make_sparse_coded_signal(<n_samples>,<n_components>,<n_features>,<n_nonzero_coefs>[,random_state=None])
Generate a sparse symmetric definite positive matrix:[<prec>=]sklearn.datasets.make_sparse_spd_matrix([dim=1,alpha=0.95,norm_diag=False,smallest_coef=0.1,largest_coef=0.9,random_state=None])
Generate a random regression problem with sparse uncorrelated design:[<X>,<y>=]sklearn.datasets.make_sparse_uncorrelated([n_samples=100,n_features=10,random_state=None])
Generate a random symmetric, positive-definite matrix:[<X>=]sklearn.datasets.make_spd_matrix(<n_dim>[,random_state=None])
Generate a swiss roll dataset:[<X>,<t>=]sklearn.datasets.make_swiss_roll([n_samples=100,noise=0.0,random_state=None])

五.exceptions
1.简介:

该模块中定义了所有sklearn自定义的错误和警告

2.使用:

Custom warning to capture convergence problems:class sklearn.exceptions.ConvergenceWarning
Warning used to notify implicit data conversions happening in the code:class sklearn.exceptions.DataConversionWarning
Custom warning to notify potential issues with data dimensionality:class sklearn.exceptions.DataDimensionalityWarning
Warning used to notify the user of inefficient computation:class sklearn.exceptions.EfficiencyWarning
Warning class used if there is an error while fitting the estimator:class sklearn.exceptions.FitFailedWarning
Exception class to raise if estimator is used before fitting:class sklearn.exceptions.NotFittedError
Warning used when the metric is invalid:class sklearn.exceptions.UndefinedMetricWarning

Python 第三方模块机器学习 Scikit-Learn模块简介,基类,数据集,错误相关推荐

c调用python第三方库_Python使用ctypes模块调用DLL函数之C语言数组与numpy数组传递...
在Python语言中,可以使用ctypes模块调用其它如C++语言编写的动态链接库DLL文件中的函数,在提高软件运行效率的同时,也可以充分利用目前市面上各种第三方的DLL库函数,以扩充Python软件 ...
的使用两个数据集拼接_使用Scikit Learn的分类器探索Iris数据集
暂时,想象一下你不是一个花卉专家(如果你是专家,那对你很好!).你能区分三种不同的鸢尾属植物吗?刚毛鸢尾属,花色鸢尾属和维吉尼亚鸢尾属(setosa, versicolor, virginica)? ...
《Python面向对象编程指南》——1.2　基类中的__init__()方法
本节书摘来自异步社区<Python面向对象编程指南>一书中的第1章,第1.2节,作者［美］Steven F. Lott, 张心韬兰亮译,更多章节内容可以访问云栖社区"异步社区 ...
Python之数据挖掘实践--scikit learn库介绍和下载、实践、采坑
文章目录前言 A sklearn库是什么? A1 依赖库介绍 1.Numpy库 2.Scipy库 3. matplotlib A2 下载安装 B 实践过程 B1 主成分分析(PCA) B2 实现Km ...
python常规异常的基类_Python警告的基类警告类是____。
[判断题]pickle使用pickle.dump(data,file)读取数据. [单选题]下列是Python断言语句失败的错误类型是? [单选题]立体被平面截切所产生的表面交线称为( ). [判断题 ...
机器学习与Scikit Learn学习库
摘要: 本文介绍机器学习相关的学习库Scikit Learn,包含其安装及具体识别手写体数字案例,适合机器学习初学者入门Scikit Learn. 在我科研的时候,机器学习(ML)是计算机科学领域中最 ...
Python：抽象基类（abc模块）
抽象基类(abc模块) 介绍白话理解实用场景介绍该模块提供了一个元类 ABCMeta,可以用来定义抽象类,另外还提供一个工具类 ABC,可以用它以继承的方式定义抽象基类. 装饰器以" ...
python数据挖掘与机器学习实践技术
分析机器学习在应用时需要掌握的经验及编程技巧.通过实际案例的形式,介绍如何提炼创新点,以及如何发表高水平论文等相关经验.旨在掌握Python编程的基础知识与技巧.特征工程(数据清洗.变量降维.特征选择 ...
第 11 章 Python 第三方库纵览
整理的文章内容主要来源为高教版<计算机等级考试二级 Python>教程视频讲义,并且更正了原讲义中的错误的地方. 专栏文章索引如下: 考试大纲第 1 章程序设计基本方法第 2 章 P ...

Python 第三方模块机器学习 Scikit-Learn模块简介,基类,数据集,错误

Python 第三方模块机器学习 Scikit-Learn模块简介,基类,数据集,错误相关推荐

最新文章

热门文章

Python 第三方模块 机器学习 Scikit-Learn模块 简介,基类,数据集,错误

Python 第三方模块 机器学习 Scikit-Learn模块 简介,基类,数据集,错误相关推荐

最新文章

热门文章

Python 第三方模块机器学习 Scikit-Learn模块简介,基类,数据集,错误

Python 第三方模块机器学习 Scikit-Learn模块简介,基类,数据集,错误相关推荐