pip install missingpy


kNNImputer类提供了使用k-Nearest Neighbors(KNN)算法完成缺失值的填补。每个样本的缺失值都是使用在训练集中找到的n_neighbors个近邻的值来估算的,请注意,如果一个样本缺少多个特征,则该样本可以会有多组n_neighbors邻域供体,具体取决于填补的特定特征。

然后,将每个缺失特征填补为这些邻居的加权或未加权平均值。如果donor neighbors的数量少于n_neighbors,则使用该特征的训练集的平均值进行填补。当然,训练集中的样本总数始终大于或等于可用于填补的最近邻数。这取决于总体样本量以及由于缺失特征太多而从最近邻计算中排除的样本数(由row_max_missing控制)



>>> import numpy as np
>>> from missingpy import KNNImputer
>>> nan = np.nan
>>> X = [[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]]
>>> imputer = KNNImputer(n_neighbors=2, weights="uniform")
>>> imputer.fit_transform(X)
array([[1. , 2. , 4. ],[3. , 4. , 3. ],[5.5, 6. , 5. ],[8. , 8. , 7. ]])


>>> import numpy as np
>>> from sklearn.impute import KNNImputer
>>> nan = np.nan
>>> X = [[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]]
>>> imputer = KNNImputer(n_neighbors=2, weights="uniform")
>>> imputer.fit_transform(X)
array([[1. , 2. , 4. ],[3. , 4. , 3. ],[5.5, 6. , 5. ],[8. , 8. , 7. ]])



KNNImputer(missing_values="NaN", n_neighbors=5, weights="uniform", metric="masked_euclidean", row_max_missing=0.5, col_max_missing=0.8, copy=True)Parameters
missing_values : integer or "NaN", optional (default = "NaN")The placeholder for the missing values. All occurrences of`missing_values` will be imputed. For missing values encoded as``np.nan``, use the string value "NaN".n_neighbors : int, optional (default = 5)Number of neighboring samples to use for imputation.weights : str or callable, optional (default = "uniform")Weight function used in prediction.  Possible values:- 'uniform' : uniform weights.  All points in each neighborhoodare weighted equally.- 'distance' : weight points by the inverse of their distance.in this case, closer neighbors of a query point will have agreater influence than neighbors which are further away.- [callable] : a user-defined function which accepts anarray of distances, and returns an array of the same shapecontaining the weights.metric : str or callable, optional (default = "masked_euclidean")Distance metric for searching neighbors. Possible values:- 'masked_euclidean'- [callable] : a user-defined function which conforms to thedefinition of _pairwise_callable(X, Y, metric, **kwds). In otherwords, the function accepts two arrays, X and Y, and a``missing_values`` keyword in **kwds and returns a scalar distancevalue.row_max_missing : float, optional (default = 0.5)The maximum fraction of columns (i.e. features) that can be missingbefore the sample is excluded from nearest neighbor imputation. Itmeans that such rows will not be considered a potential donor in``fit()``, and in ``transform()`` their missing feature values will beimputed to be the column mean for the entire dataset.col_max_missing : float, optional (default = 0.8)The maximum fraction of rows (or samples) that can be missingfor any feature beyond which an error is raised.copy : boolean, optional (default = True)If True, a copy of X will be created. If False, imputation willbe done in-place whenever possible. Note that, if metric is"masked_euclidean" and copy=False then missing_values in theinput matrix X will be overwritten with zeros.Attributes
statistics_ : 1-D array of length {n_features}The 1-D array contains the mean of each feature calculated usingobserved (i.e. non-missing) values. This is used for imputingmissing values in samples that are either excluded from nearestneighbors search because they have too many ( > row_max_missing)missing features or because all of the sample's k-nearest neighbors(i.e., the potential donors) also have the relevant feature valuemissing.Methods
fit(X, y=None):Fit the imputer on X.Parameters----------X : {array-like}, shape (n_samples, n_features)Input data, where ``n_samples`` is the number of samples and``n_features`` is the number of features.Returns-------self : objectReturns self.transform(X):Impute all missing values in X.Parameters----------X : {array-like}, shape = [n_samples, n_features]The input data to complete.Returns-------X : {array-like}, shape = [n_samples, n_features]The imputed dataset.fit_transform(X, y=None, **fit_params):Fit KNNImputer and impute all missing values in X.Parameters----------X : {array-like}, shape (n_samples, n_features)Input data, where ``n_samples`` is the number of samples and``n_features`` is the number of features.Returns-------X : {array-like}, shape (n_samples, n_features)Returns imputed dataset.


  1. Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein and Russ B. Altman, Missing value estimation methods for DNA microarrays, BIOINFORMATICS Vol. 17 no. 6, 2001 Pages 520-525.

