python矩阵归一化方法_python之sklearn常见数据预处理归一化方式解析
标签:
标准归一化
归一化到均值为0,方差为1
sklearn.preprocessing.scale函数:Standardize a dataset along any axis
先贴出主要的源码,乍一看,很乱,其实细看之下,就是多了一些判断稀疏矩阵之类的条件性代码。
#coding=utf-8
import numpy as np
from scipy import sparse
def _handle_zeros_in_scale(scale, copy=True):
''' Makes sure that whenever scale is zero, we handle it correctly.
This happens in most scalers when we have constant features.'''
# if we are fitting on 1D arrays, scale might be a scalar
if np.isscalar(scale):
if scale == .0:
scale = 1.
return scale
elif isinstance(scale, np.ndarray):
if copy:
# New array to avoid side-effects
scale = scale.copy()
scale[scale == 0.0] = 1.0
return scale
def scale(X, axis=0, with_mean=True, with_std=True, copy=True):
"""Standardize a dataset along any axis
Center to the mean and component wise scale to unit variance.
Read more in the :ref:`User Guide `.
Parameters
----------
X : {array-like, sparse matrix}
The data to center and scale.
axis : int (0 by default)
axis used to compute the means and standard deviations along. If 0,
independently standardize each feature, otherwise (if 1) standardize
each sample.
with_mean : boolean, True by default
If True, center the data before scaling.
with_std : boolean, True by default
If True, scale the data to unit variance (or equivalently,
unit standard deviation).
copy : boolean, optional, default True
set to False to perform inplace row normalization and avoid a
copy (if the input is already a numpy array or a scipy.sparse
CSC matrix and if axis is 1).
Notes
-----
This implementation will refuse to center scipy.sparse matrices
since it would make them non-sparse and would potentially crash the
program with memory exhaustion problems.
Instead the caller is expected to either set explicitly
`with_mean=False` (in that case, only variance scaling will be
performed on the features of the CSC matrix) or to call `X.toarray()`
if he/she expects the materialized dense array to fit in memory.
To avoid memory copy the caller should pass a CSC matrix.
See also
--------
StandardScaler: Performs scaling to unit variance using the``Transformer`` API
(e.g. as part of a preprocessing :class:`sklearn.pipeline.Pipeline`).
""" # noqa
X = check_array(X, accept_sparse='csc', copy=copy, ensure_2d=False,
warn_on_dtype=True, estimator='the scale function',
dtype=FLOAT_DTYPES)
if sparse.issparse(X):
if with_mean:
raise ValueError(
"Cannot center sparse matrices: pass `with_mean=False` instead"
" See docstring for motivation and alternatives.")
if axis != 0:
raise ValueError("Can only scale sparse matrix on axis=0, "
" got axis=%d" % axis)
if with_std:
_, var = mean_variance_axis(X, axis=0)
var = _handle_zeros_in_scale(var, copy=False)
inplace_column_scale(X, 1 / np.sqrt(var))
else:
X = np.asarray(X)
if with_mean:
mean_ = np.mean(X, axis)
if with_std:
scale_ = np.std(X, axis)
# Xr is a view on the original array broadcasting on the axis in which we are interested in
#下面这一行一开始着实让人不太懂,感觉是一直对Xr操作,怎么突然返回X,后来才知道Xr是X的一个视图,
#np.rollaxis返回的是输入数组的视图,两者只是形式上不同,本质是相等的,通过assert(X==Xr)可以证实。
Xr = np.rollaxis(X, axis)
if with_mean:
Xr -= mean_
mean_1 = Xr.mean(axis=0)
# Verify that mean_1 is 'close to zero'. If X contains very
# large values, mean_1 can also be very large, due to a lack of
# precision of mean_. In this case, a pre-scaling of the
# concerned feature is efficient, for instance by its mean or
# maximum.
if not np.allclose(mean_1, 0):
warnings.warn("Numerical issues were encountered "
"when centering the data "
"and might not be solved. Dataset may "
"contain too large values. You may need "
"to prescale your features.")
Xr -= mean_1
if with_std:
scale_ = _handle_zeros_in_scale(scale_, copy=False)
Xr /= scale_
if with_mean:
mean_2 = Xr.mean(axis=0)
# If mean_2 is not 'close to zero', it comes from the fact that
# scale_ is very small so that mean_2 = mean_1/scale_ > 0, even
# if mean_1 was close to zero. The problem is thus essentially
# due to the lack of precision of mean_. A solution is then to
# subtract the mean again:
if not np.allclose(mean_2, 0):
warnings.warn("Numerical issues were encountered "
"when scaling the data "
"and might not be solved. The standard "
"deviation of the data is probably "
"very close to 0. ")
Xr -= mean_2
return X
简化版scale代码
def scale_mean_var(input_arr,axis=0):
#from sklearn import preprocessing
#input_arr= preprocessing.scale(input_arr.astype('float'))
mean_ = np.mean(input_arr,axis=0)
scale_ = np.std(input_arr,axis=0)
#减均值
output_arr= input_arr- mean_
#判断均值是否接近0
mean_1 = output_arr.mean(axis=0)
if not np.allclose(mean_1, 0):
output_arr -= mean_1
#将标准差为0元素的置1
#scale_ = _handle_zeros_in_scale(scale_, copy=False)
scale_[scale_ == 0.0] = 1.0
#除以标准差
output_arr /=scale_
#再次判断均值是否为0
mean_2 = output_arr .mean(axis=0)
if not np.allclose(mean_2, 0):
output_arr -= mean_2
return output_arr
最大最小归一化
sklearn.preprocessing.minmax_scale函数:Transforms features by scaling each feature to a given range.
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min
简化版代码很简单
def max_min(input_arr,o_min,o_max):
"""
Transforms features by scaling each feature to a given range.
"""
i_min = np.min(input_arr)
i_max = np.max(input_arr)
out_arr = np.clip(input_arr,i_min,i_max)
out_arr = (out_arr- i_min)/(i_max - i_min)
if o_max==1 and o_min==0:
return out_arr
else:
out_arr = out_arr*(o_max-o_min)+o_min
return out_arr
最大绝对值归一化
maxabs_scale函数:Scale each feature by its maximum absolute value.
def maxabs_scale(input,axis=0):
"""
Scale each feature to the [-1, 1] range without breaking the sparsity
"""
if not isinstance(input,numpy.ndarray):
input = np.asarray(input).astype(np.float32)
maxabs = np.max(abs(input),axis=0)
out_array = input/maxabs
return out_array
标签:
python矩阵归一化方法_python之sklearn常见数据预处理归一化方式解析相关推荐
- python中repr方法_Python中的常见特殊方法—— repr方法
在Python中有些方法名.属性名的前后都添加了双下划线,这种方法.属性通常都属于Python的特殊方法和特殊属性,开发者可以通过重写这些方法或者直接调用这些方法来实现特殊的功能.其实前面见过的构造方 ...
- 使用sklearn进行数据预处理 —— 归一化/标准化/正则化
本文主要是对照scikit-learn的preprocessing章节结合代码简单的回顾下预处理技术的几种方法,主要包括标准化.数据最大最小缩放处理.正则化.特征二值化和数据缺失值处理.内容比较简单, ...
- python升维方法_python机器学习12:数据“升维”
1.向数据集添加交互式特征 在实际应用中,常常会遇到数据集的特征不足的情况,要解决这个问题,就需要对数据集的特征进行扩充.这里我们介绍两种在统计建模中常用的方法---交互式特征(Interaction ...
- python的pandas方法_python使用Pandas处理数据的方法
python使用Pandas处理数据的方法 发布时间:2020-06-17 13:50:10 来源:亿速云 阅读:119 作者:鸽子 Pandas是Python中非常常用的数据处理工具,使用起来非常方 ...
- 关于使用sklearn进行数据预处理 —— 归一化/标准化/正则化
20220121 z-score标准化 模型存储和load再调用其实没有关系 再load计算的时候,也是以实际的数据重新计算 并不是以save模型的边界来计算的 20211227 onehot训练集保 ...
- Python: sklearn库——数据预处理
Python: sklearn库 -- 数据预处理 数据集转换之预处理数据: 将输入的数据转化成机器学习算法可以使用的数据.包含特征提取和标准化. 原因:数据集的标准化(服从均 ...
- 机器学习之数据预处理——归一化,标准化
机器学习之数据预处理--归一化,标准化 基础知识 1.什么是特征预处理 2.预处理方法 : 3.预处理API: 数据的标准化(normalization)和归一化 数据的标准化 数据归一化 1 把数变 ...
- python __reduce__魔法方法_Python魔法方法指南
(译)Python魔法方法指南 简介 本指南归纳于我的几个月的博客,主题是 魔法方法 . 什么是魔法方法呢?它们在面向对象的Python的处处皆是.它们是一些可以让你对类添加"魔法" ...
- python 财务分析可视化方法_Python数据可视化的四种简易方法
Python数据可视化的四种简易方法 作者:PHPYuan 时间:2018-11-28 03:40:43 摘要: 本文讲述了热图.二维密度图.蜘蛛图.树形图这四种Python数据可视化方法. 数据可视 ...
最新文章
- 【B/S实践】IIS发布
- 【秋招必备】LeetCode神器,算法刷题宝典.pdf
- 分享:SringBuffer与String的区别
- 成功解决TypeError: only integer scalar arrays can be converted to a scalar index
- API测试工具SoapUI Postman对比分析
- GitHub 与 git 笔记 。
- MongoDB的基本shell操作(三)
- r语言查找是否存在空值_关于R包安装你知道多少?
- svn的使用(Mac)
- RTX5 | 消息队列06 - (实战技巧)FDCAN接收中断ISR同步线程
- Spring面试问题与解答
- ionic 页面传值问题
- leetcode894.AllPossibleFullBinaryTrees
- 按学号查找学生信息 用c语言表达,学生信息管理系统C语言编程.docx
- [转]如何学好windows c++编程 学习精髓(收集,整理)
- laravel项目出现Non-static method Redis::hGet() cannot be called statically的解决方法
- 跨网段共享服务器文件夹,跨网段文件共享
- 怦然心动(Flipped)-3
- Unity3D_3dsMax-Vray材质导入
- 色调映射:Edge-Preserving Decompositions for Multi-Scale Tone and Detail Manipulation