缺失数据填补基础方法（1）——k-Nearest Neighbors (kNN) 填补

一、kNN介绍

kNNImputer类提供了使用k-Nearest Neighbors（KNN）算法完成缺失值的填补。每个样本的缺失值都是使用在训练集中找到的n_neighbors个近邻的值来估算的，请注意，如果一个样本缺少多个特征，则该样本可以会有多组n_neighbors邻域供体，具体取决于填补的特定特征。

然后，将每个缺失特征填补为这些邻居的加权或未加权平均值。如果donor neighbors的数量少于n_neighbors，则使用该特征的训练集的平均值进行填补。当然，训练集中的样本总数始终大于或等于可用于填补的最近邻数。这取决于总体样本量以及由于缺失特征太多而从最近邻计算中排除的样本数（由row_max_missing控制）

二、代码示例

下面的代码段演示如何将缺失值替换为np.nan，使用包含缺失值的行的两个最近邻的平均特征值：

>>> import numpy as np
>>> from missingpy import KNNImputer
>>> nan = np.nan
>>> X = [[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]]
>>> imputer = KNNImputer(n_neighbors=2, weights="uniform")
>>> imputer.fit_transform(X)
array([[1. , 2. , 4. ],[3. , 4. , 3. ],[5.5, 6. , 5. ],[8. , 8. , 7. ]])

如果上述代码报错，则使用如下代码：

>>> import numpy as np
>>> from sklearn.impute import KNNImputer
>>> nan = np.nan
>>> X = [[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]]
>>> imputer = KNNImputer(n_neighbors=2, weights="uniform")
>>> imputer.fit_transform(X)
array([[1. , 2. , 4. ],[3. , 4. , 3. ],[5.5, 6. , 5. ],[8. , 8. , 7. ]])

三、API实例

KNNImputer的API如下：

KNNImputer(missing_values="NaN", n_neighbors=5, weights="uniform", metric="masked_euclidean", row_max_missing=0.5, col_max_missing=0.8, copy=True)Parameters
----------
missing_values : integer or "NaN", optional (default = "NaN")The placeholder for the missing values. All occurrences of`missing_values` will be imputed. For missing values encoded as``np.nan``, use the string value "NaN".n_neighbors : int, optional (default = 5)Number of neighboring samples to use for imputation.weights : str or callable, optional (default = "uniform")Weight function used in prediction.  Possible values:- 'uniform' : uniform weights.  All points in each neighborhoodare weighted equally.- 'distance' : weight points by the inverse of their distance.in this case, closer neighbors of a query point will have agreater influence than neighbors which are further away.- [callable] : a user-defined function which accepts anarray of distances, and returns an array of the same shapecontaining the weights.metric : str or callable, optional (default = "masked_euclidean")Distance metric for searching neighbors. Possible values:- 'masked_euclidean'- [callable] : a user-defined function which conforms to thedefinition of _pairwise_callable(X, Y, metric, **kwds). In otherwords, the function accepts two arrays, X and Y, and a``missing_values`` keyword in **kwds and returns a scalar distancevalue.row_max_missing : float, optional (default = 0.5)The maximum fraction of columns (i.e. features) that can be missingbefore the sample is excluded from nearest neighbor imputation. Itmeans that such rows will not be considered a potential donor in``fit()``, and in ``transform()`` their missing feature values will beimputed to be the column mean for the entire dataset.col_max_missing : float, optional (default = 0.8)The maximum fraction of rows (or samples) that can be missingfor any feature beyond which an error is raised.copy : boolean, optional (default = True)If True, a copy of X will be created. If False, imputation willbe done in-place whenever possible. Note that, if metric is"masked_euclidean" and copy=False then missing_values in theinput matrix X will be overwritten with zeros.Attributes
----------
statistics_ : 1-D array of length {n_features}The 1-D array contains the mean of each feature calculated usingobserved (i.e. non-missing) values. This is used for imputingmissing values in samples that are either excluded from nearestneighbors search because they have too many ( > row_max_missing)missing features or because all of the sample's k-nearest neighbors(i.e., the potential donors) also have the relevant feature valuemissing.Methods
-------
fit(X, y=None):Fit the imputer on X.Parameters----------X : {array-like}, shape (n_samples, n_features)Input data, where ``n_samples`` is the number of samples and``n_features`` is the number of features.Returns-------self : objectReturns self.transform(X):Impute all missing values in X.Parameters----------X : {array-like}, shape = [n_samples, n_features]The input data to complete.Returns-------X : {array-like}, shape = [n_samples, n_features]The imputed dataset.fit_transform(X, y=None, **fit_params):Fit KNNImputer and impute all missing values in X.Parameters----------X : {array-like}, shape (n_samples, n_features)Input data, where ``n_samples`` is the number of samples and``n_features`` is the number of features.Returns-------X : {array-like}, shape (n_samples, n_features)Returns imputed dataset.

参考：

Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein and Russ B. Altman, Missing value estimation methods for DNA microarrays, BIOINFORMATICS Vol. 17 no. 6, 2001 Pages 520-525.

缺失数据填补基础方法（1）——k-Nearest Neighbors (kNN) 填补相关推荐

插值法补齐缺失数据_一种挽救你缺失数据的好方法——多重补插
原标题:一种挽救你缺失数据的好方法--多重补插缺失数据--研究中的绊脚石在医学研究中,我们经常会碰到缺失数据.无论是因为实验设计的问题,或是检测手段精度,又或是医学实验中的不可抗力因素.一旦数据没 ...
R语言实战-第十八章 R处理缺失数据的高级方法
第18章处理缺失数据的高级方法 # 目的:识别缺失数据:检查导致数据缺失的原因:删除包含的缺失值的实例或用合理的数值代替(插补)缺失值 #主要包: mice包 #主要数据集:VIM包中的sleep数 ...
KNN（K Nearest Neighbors）分类是什么学习方法？如何或者最佳的K值？RadiusneighborsClassifer分类器又是什么？KNN进行分类详解及实践
KNN(K Nearest Neighbors)分类是什么学习方法?如何或者最佳的K值?RadiusneighborsClassifer分类器又是什么?KNN进行分类详解及实践如何使用GridSea ...
缺失数据填补基础方法（3）——Multiple Imputation by Chained Equations (MICE)
目录一.MICE方法介绍二.数据集介绍 2.1 数据集来源 2.2 类别属性 2.3 下载链接三.代码实现 3.1 读取数据 3.2 检查数据类型 3.3 检查相关性 3.4 检查缺失值 3.5 ...
缺失数据填补基础方法（2）——Random Forest (MissForest)填补
目录一.MissForest介绍二.代码示例三.API实例安装: pip install missingpy 一.MissForest介绍 MissForest以迭代的方式使用随机森林来填补缺 ...
python pandas dropna_Pandas之Dropna滤除缺失数据的实现方法
约定: import pandas as pd import numpy as np from numpy import nan as NaN 滤除缺失数据 pandas的设计目标之一就是使得处理缺失 ...
pandas删除缺失数据(pd.dropna()方法)
1.创建带有缺失值的数据库: import pandas as pd import numpy as npdf = pd.DataFrame(np.random.randn(5, 3), index ...
【工大SCIR笔记】自然语言处理领域的数据增广方法
点击上方,选择星标或置顶,每天给你送干货! 作者:李博涵来自:哈工大SCIR 1.摘要本文介绍自然语言处理领域的数据增广方法.数据增广(Data Augmentation,也有人将Data Aug ...
自然语言处理领域的数据增广方法
1.摘要本文介绍自然语言处理领域的数据增广方法.数据增广(Data Augmentation,也有人将Data Augmentation翻译为"数据增强",然而"数据增 ...

缺失数据填补基础方法（1）——k-Nearest Neighbors (kNN) 填补

目录

一、kNN介绍

二、代码示例

三、API实例

缺失数据填补基础方法（1）——k-Nearest Neighbors (kNN) 填补相关推荐

最新文章

热门文章