The datasets we have used so far have been described in terms of features. In the m04_Recommending Movies w Affinity Analysis_Apriori_sys.stdout.flush_df.iterrows_Sort nested dict嵌套_Linli522362242的专栏-CSDN博客, we used a transaction-centric dataset. However, ultimately this was just a different format for representing feature-based data.

There are many other types of datasets, including text, images, sounds, movies, or even real objects. Most data mining algorithms, however, rely on having numerical or categorical features. This means we need a way to represent these types before we input them into the data mining algorithm.

In this chapter, we will discuss how to extract numerical and categorical features, and choose the best features when we do have them. We will discuss some common patterns and techniques for extracting features.

The key concepts introduced in this chapter include:

  • • Extracting features from datasets
  • • Creating new features
  • • Selecting good features
  • • Creating your own transformer for custom datasets

Feature extraction

Extracting features is one of the most critical tasks in data mining, and it generally affects your end result more than the choice of data mining algorithm. Unfortunately, there are no hard and fast rules for choosing features that will result in high performance data mining. In many ways, this is where the science of data mining becomes more of an art. Creating good features relies on intuition, domain expertise, data mining experience, trial and error, and sometimes a little luck.

The difference between feature selection and feature extraction提取 is that while we maintain the original features when we used feature selection algorithms, such as sequential backward selection, we use feature extraction to transform or project the data onto a new feature space. In the context of dimensionality reduction, feature extraction can be understood as an approach to data compression with the goal of maintaining most of the relevant information. In practice, feature extraction is not only used to improve storage space or the computational efficiency of the learning algorithm, but can also improve the predictive performance by reducing the curse of dimensionality—especially if we are working with non-regularized models.

Representing reality in models

Not all datasets are presented in terms of features. Sometimes, a dataset consists of nothing more than all of the books that have been written by a given author. Sometimes, it is the film of each of the movies released in 1979. At other times, it is a library collection of interesting historical artifacts.

From these datasets, we may want to perform a data mining task. For the books, we may want to know the different categories that the author writes. In the films, we may wish to see how women are portrayed[pɔːrˈtreɪ]描绘. In the historical artifacts, we may want to know whether they are from one country or another. It isn't possible to just pass these raw datasets into a decision tree and see what the result is.

For a data mining algorithm to assist us here, we need to represent these as features. Features are a way to create a model and the model provides an approximation of reality in a way that data mining algorithms can understand. Therefore, a model is just a simplified version of some aspect of the real world. As an example, the game of chess is a simplified model for historical warfare[ˈwɔːfeə(r)] 战争.

Selecting features has another advantage: they reduce the complexity of the real world into a more manageable model. Imagine how much information it would take to properly[ˈprɑːpərli] 恰当地, accurately, and fully describe a real-world object to someone that has no background knowledge of the item. You would need to describe the size, weight, texture[ˈtekstʃər]质地, composition, age, flaws瑕疵, purpose, origin产地, and so on.

The complexity of real objects is too much for current algorithms, so we use these simpler models instead.

This simplification also focuses our intent[ɪnˈtent]目的,意图 in the data mining application. In later chapters, we will look at clustering and where it is critically important至关重要. If you put random features in, you will get random results out.

However, there is a downside as this simplification reduces the detail, or may remove good indicators of the things we wish to perform data mining on.

Thought should always be given to how to represent reality in the form of a model. Rather than just using what has been used in the past, you need to consider the goal of the data mining exercise. What are you trying to achieve? In m03 Predicting Sports Winners with Decision Trees_NBA_TP_metric_OneHotEncoder_bias_colab_Linli522362242的专栏-CSDN博客, Predicting Sports Winners with Decision Trees, we created features by thinking about the goal (predicting winners) and used a little domain knowledge to come up with ideas for new features.

Not all features need to be numeric or categorical. Algorithms have been developed that work directly on text, graphs, and other data structures. In this book, we mainly use numeric or categorical features.

The Adult dataset is a great example of taking a complex reality and attempting to model it using features. In this dataset, the aim is to estimate if someone earns more than $50,000 per year. To download the dataset, navigate to Index of /ml/machine-learning-databases/adult and click on the Data link. Download the adult.data and adult.names into a directory named Adult in your data folder.

###############################

download file:

cp8_Sentiment_urlretrieve_pyprind_tarfile_bag词袋_walk目录_regex_verbose_N-gram_Hash_colab_verbose_文本向量化_Linli522362242的专栏-CSDN博客

import urllib.request
import time
import sys
import osdataset_filename = 'adult.data'
dataset_URL = 'http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'def reporthook(count, block_size, total_size):global start_timeif count==0:start_time = time.time()returntime.sleep(1) # 1 secondduration = time.time() - start_timeprogress_size = int(count*block_size)currentLoad = progress_size/(1024.**2)speed = currentLoad / duration # 1024.**2 <== 1MB=1024KB, 1KB=1024Btyespercent = count * block_size * 100./total_sizesys.stdout.write("\r%d%% | %d MB | speed=%.2f MB/s | %d sec elapsed" %(percent, currentLoad, speed, duration))sys.stdout.flush()# if not exists file ('adult.data') then download...
if not os.path.isfile( dataset_filename ):# urllib.request.urlretrieve(url, filename=None, reporthook=None, data=None)# The third argument, if present, is a callable that will be called once on establishment of # the network connection and once after each block read thereafter. # The callable will be passed three arguments; a count of blocks transferred so far, #                                              a block size in bytes, #                                              and the total size of the file. (bytes)urllib.request.urlretrieve(dataset_URL, dataset_filename, reporthook)

OR

import requestsdataset_filename = 'adult.data'
dataset_URL = 'http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'r = requests.get(dataset_URL )with open(dataset_filename, 'wb') as f:f.write( r.content )

Read online data:

This dataset takes a complex task and describes it in features. These features describe the person, their environment, their background, and their life status.

import pandas as pdurl_adult = 'http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'adult = pd.read_csv( url_adult, header=None,names=["Age", "Work-Class", "fnlwgt", "Education","Education-Num", "Marital-Status", "Occupation","Relationship", "Race", "Sex", "Capital-gain","Capital-loss", "Hours-per-week", "Native-Country","Earnings-Raw"])
adult.iloc[10:20]


Dealing with missing data

cp4 Training Sets Preprocessing_StringIO_dropna_categorical_feature_Encode_Scale_L1_L2_bbox_to_ancho_Linli522362242的专栏-CSDN博客

adult.dtypes


There is a situation here, the null value in the data set is the'?' character, sometimes it can be detected to use df.isin(['?']).any(),
#########################
for example,n3_knn breastCancer NaiveBayesLikelihood_voter_manhat_Euclid_Minkow_空值?_SBS特征选取_Laplace_zip_NLP_spam_Linli522362242的专栏-CSDN博客

breast_cancer.isin(['?']).any()


#########################
sometimes it cannot be like the following situation

adult.isin(['?']).any()

adult.isnull().any() # isnull : Alias of isna.


solutions:

import numpy as np
adult = adult.replace( to_replace=r'\?', value=np.nan, regex=True)
adult.iloc[10:16]

adult.isna().any()

adult.shape

adult.isnull().sum(axis=0)

adult.dropna(inplace=True)
# OR adult = adult.dropna(inplace=False) # inplace = False will return a modified object, so we need to save it
adult.isnull().any()

adult[10:16]

 Note: Because the original 14th row existed a null value (NAN), it was deleted.

###############################

import pandas as pdurl_adult = 'http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'adult = pd.read_csv( url_adult, # or 'adult.data',sep=',',keep_default_na=False,header=None,names=["Age", "Work-Class", "fnlwgt", "Education","Education-Num", "Marital-Status", "Occupation","Relationship", "Race", "Sex", "Capital-gain","Capital-loss", "Hours-per-week", "Native-Country","Earnings-Raw"])

The adult file itself contains two blank lines at the end of the file. By default, pandas will interpret the penultimate new line to be an empty (but valid) row. To remove this, we remove any line with invalid numbers (the use of inplace just makes sure the same Dataframe is affected, rather than creating a new one):

adult[-5:]

We do not have this situation here, because pandas is a new version, which solves this problem.

If it is a low version of pandas, add the following code

adult.dropna( how='all', inplace=True )

Having a look at the dataset, we can see a variety of features from adult.columns:

adult.columns

The results show each of the feature names that are stored inside an Index object from pandas:

Common feature patterns

While there are millions of ways to create features, there are some common patterns that are employed across different disciplines[ˈdɪsəplɪnz]学科,科目. However, choosing appropriate features is tricky and it is worth considering how a feature might correlate to the end result. As the adage[ˈædɪdʒ]谚语 says, don't judge a book by its cover—it is probably not worth considering the size of a book if you are interested in the message contained within.

Some commonly used features focus on the physical properties of the real world objects being studied, for example:

  • • Spatial properties such as the length, width, and height of an object
  • • Weight and/or density of the object
  • • Age of an object or its components
  • • The type of the object
  • • The quality of the object

Other features might rely on the usage or history of the object:

  • • The producer, publisher, or creator of the object
  • • The year of manufacturing
  • • The use of the object使用方法

Other features describe a dataset in terms of its components根据其组件/组成成分角度描述数据集:

  • • Frequency of a given subcomponent, such as a word in a book
  • • Number of subcomponents and/or the number of different subcomponents
  • • Average size of the subcomponents, such as the average sentence length

Ordinal features( ordered categorical (ordinal, e.g. t-shirt size would be an ordinal feature, because we can define an order XL > L > M);  unordered categorical (nominal, e.g. t-shirt color as a nominal feature ) ) allow us to perform ranking, sorting, and grouping of similar values. As we have seen in previous chapters, features can be numerical or categorical. Numerical features( continuous(e.g. house price) ) are often described as being ordinal. For example, three people, Alice, Bob and Charlie, may have heights of 1.5 m, 1.6 m and 1.7 m. We would say that Alice and Bob are more similar in height than are Alice and Charlie.

The Adult dataset that we loaded in the last section contains examples of continuous, ordinal features. For example, the Hours-per-week feature tracks how many hours per week people work. Certain operations make sense on a feature like this. They include computing the mean, standard deviation, minimum and maximum. There is a function in pandas for giving some basic summary stats of this type:

adult['Hours-per-week'].describe()


     Some of these operations do not make sense for other features. For example, it doesn't make sense to compute the sum of the education statuses教育程度.

There are also features that are not numerical, but still ordinal. The Education feature in the Adult dataset is an example of this. For example, a Bachelor's degree is a higher education status than finishing high school, which is a higher status than not completing high school. It doesn't quite make sense to compute the mean of these values, but we can create an approximation by taking the median value. The dataset gives a helpful feature Education-Num, which assigns a number that is basically equivalent to the number of years of education completed. This allows us to quickly compute the median:

adult['Education-Num'].median()

The result is 10, or finishing one year past high school(刚好读完高一). If we didn't have this, we could compute the median by creating an ordering over the education values如果没有受教育年限数据,我们为不同教育阶段指定数字编号,也可以计算均值.

%matplotlib inlineimport seaborn as sns
import matplotlib.pyplot as pltplt.figure( figsize=(12, 9) )sns.swarmplot( x='Education-Num',y='Hours-per-week',hue='Earnings-Raw',data = adult[::50],size=12)

we can create a LongHours feature, which tells us if a person works more than 40 hours per week. This turns our continuous feature (Hours-per-week) into a categorical one

Features can also be categorical. For instance, a ball can be a tennis ball, cricket ball, football, or any other type of ball. Categorical features are also referred to as nominal features. For nominal features, the values are either the same or they are different. While we could rank balls by size or weight, just the category alone isn't enough to compare things. A tennis ball is not a cricket ball[ˈkrɪkɪt bɔːl]棒球, and it is also not a football. We could argue that a tennis ball is more similar to a cricket ball (say, in size), but the category alone doesn't differentiate this—they are the same, or they are not.

We can convert categorical features to numerical features using the one-hot encoding, as we saw inm03 Predicting Sports Winners with Decision Trees_NBA_TP_metric_OneHotEncoder_bias_colab_Linli522362242的专栏-CSDN博客, Predicting Sports Winners with Decision Trees. For the aforementioned categories of balls, we can create three new binary features: is a tennis ball, is a cricket ball, and is a football. For a tennis ball, the vector would be [1, 0, 0]. A cricket ball has the values [0, 1, 0], while a football has the values [0, 0, 1]. These features are binary, but can be used as continuous features by many algorithms. One key reason for doing this is that it easily allows for direct numerical comparison (such as computing the distance between samples).

Mapping convert string representation(label/class/category) to integer########

https://blog.csdn.net/Linli522362242/article/details/108230328

Mapping ordinal features(size, XL>L>M)


import pandas as pddf = pd.DataFrame([['green', 'M', 10.1, 'clas2'],['red', 'L', 13.5, 'class1'],['blue', 'XL', 15.3, 'class2']])
df.columns = ['color', 'size', 'price', 'classlabel']
df

Mapping ordinal features(size, XL>L>M)

size_mapping = {'XL': 3,'L': 2,'M': 1}
df['size'] = df['size'].map(size_mapping)
df

ordinal features(integer values) will not go further(use one-hot encoding)

#########Optional: Encoding Ordinal Features

If we are unsure about the numerical differences between the categories of ordinal features, or the difference between two ordinal values is not defined, we can also encode them using a threshold encoding with 0/1 values. For example, we can split the feature "size" with values M, L, and XL into two new features "x > M" and "x > L". Let's consider the original DataFrame:

df = pd.DataFrame([['green', 'M', 10.1, 'class2'],['red', 'L', 13.5, 'class1'],['blue', 'XL', 15.3, 'class2']])df.columns = ['color', 'size', 'price', 'classlabel']
df


     We can use the apply method of pandas' DataFrames to write custom lambda expressions in order to encode these variables using the value-threshold approach:

df['x > M'] = df['size'].apply( lambda x: 1 if x in {'L', 'XL'} else 0 ) #return 1 if x in {'L', 'XL'} else 0
df['x > L'] = df['size'].map( lambda x: 1 if x == 'XL' else 0 ) #return 1 if x == 'XL' else 0del df['size']###
df

M: 0 0 ,  L: 1, 0 ,  XL:1, 1
similar to a regular(dense) NumPy array
#########
     If we want to transform the integer values back to the original string representation at a later stage, we can simply define a reverse-mapping dictionary inv_size_mapping = {v: k for k, v in size_mapping.items()} that can then be used via the pandas map method on the transformed feature column, similar to the size_mapping dictionary that we used previously. We can use it as follows:

inv_size_mapping = {v: k for k,v in size_mapping.items()} # return a value(k)
df['size'].map(inv_size_mapping)

import numpy as np
# create a mapping dict
# to convert class labels from strings to integers
class_mapping = { label: idx for idx, label in enumerate( np.unique(df['classlabel']) )}
class_mapping

 : dict

Mapping unordered categorical (nominal)

# to convert class labels from strings to integers
df['classlabel'] = df['classlabel'].map(class_mapping)
df

These features are binary, but can be used as continuous features by many algorithms.

We can reverse the key-value pairs in the mapping dictionary as follows to map the converted class labels(integer) back to the original string representation

inv_class_mapping = {v: k for k,v in class_mapping.items()}
df['classlabel'] = df['classlabel'].map(inv_class_mapping)
df

DataFrame get_dummies###########

columns        list-like, default None

Column names in the DataFrame to be encoded. If columns is None then all the columns with object or category dtype will be converted.(here just color, since only color value is string)

sparse        bool, default False

Whether the dummy-encoded columns should be backed by a SparseArray (True, e.g. ) or a regular(dense) NumPy array (False, e.g.). But whether the sparse parameter is set to sparse=True or sparse=False, as long as drop_first=False(default), the returned is sparse matrix, so we don’t need to set this parameter, just set drop_first

drop_first        bool, default False(return a sparse  matrix; True: a regular(dense) NumPy array)

Whether to get k-1 dummies out of k categorical levels by removing the first level.

# one-hot encoding via pandas
pd.get_dummies( df[ ['price', 'color', 'size'] ], columns = ['color'])

unordered categorical(nominal feature)

multicollinearity guard in get_dummies

pd.get_dummies( df[ ['price', 'color', 'size'] ], drop_first=True )

we remove the column color_blue, the feature information is still preserved since if we observe color_green=0 and color_red=0, it implies that the observation must be blue.

1.LabelEncoder() convert string representation(label/class/category) to integer

from sklearn.preprocessing import LabelEncoderX = df[['color', 'size', 'price']].values
# X
# array([['green', 1, 10.1],
#        ['red', 2, 13.5],
#        ['blue', 3, 15.3]], dtype=object)
color_le = LabelEncoder()
X[:,0] = color_le.fit_transform(X[:, 0])
X

==>unordered categorical(nominal feature )

corlor_le.inverse_transform(X[:, 0])==>
      a learning algorithm will now assume that green is larger than blue, and red is larger than green
A common workaround for this problem is to use a technique called one-hot encoding

2.OneHotEncoder convert integer representation(label/class/category) to a sparse matrix

from sklearn.preprocessing import OneHotEncoder
X=df[['color', 'size', 'price']].values
# X
# array([['green', 1, 10.1],
#        ['red', 2, 13.5],
#        ['blue', 3, 15.3]], dtype=object)
color_ohe=OneHotEncoder()
# X[:,0] ==> array(['green', 'red', 'blue'], dtype=object)
# X[:,0].reshape(-1,1) ==>
# array([['green'],
#        ['red'],
#        ['blue']], dtype=object)
color_ohe.fit_transform( X[:,0].reshape(-1,1) ).toarray()

When we are using one-hot encoding datasets, we have to keep in mind that it introduces multicollinearity, which can be an issue for certain methods (for instance, methods that require matrix inversion). If features are highly correlated, matrices are computationally difficult to invert, which can lead to numerically unstable estimates. To reduce the correlation among variables, we can simply remove one feature column from the one-hot encoded array. Note that we do not lose any important information by removing a feature column, though; for example, if we remove the column color_blue, the feature information is still preserved since if we observe color_green=0 and color_red=0, it implies that the observation must be blue.

toarray(): Numpy array                    color_ohe=OneHotEncoder(drop='first')                                                                                           color_ohe.fit_transform( X[:,0].reshape(-1,1) ).toarray()

                 These features are binary, but can be used as continuous features by many algorithms.
     When we initialized the OneHotEncoder. By default, the OneHotEncoder returns a sparse matrix when we use the transform method, and we converted the sparse matrix representation into a regular (dense) NumPy array for the purpose of visualization via the toarray method. Sparse matrices are a more efficient way of storing large datasets and one that is supported by many scikit-learn functions, which is especially useful if an array contains a lot of zeros. To omit the toarray step, we could alternatively initialize the encoder as OneHotEncoder(..., sparse=False) to return a regular NumPy array.
###
m03 Predicting Sports Winners with Decision Trees_NBA_TP_metric_OneHotEncoder_bias_colab_Linli522362242的专栏-CSDN博客

from sklearn.preprocessing import OneHotEncoderonehot = OneHotEncoder()
X_teams = onehot.fit_transform( X_teams ).todense()X_teams[-3:]

todense(): matrix 

ColumnTransformer process multiple columns      ########################


from sklearn.compose import ColumnTransformerX = df[['color', 'size', 'price']].values
# X
# array([['green', 1, 10.1],
#        ['red', 2, 13.5],
#        ['blue', 3, 15.3]], dtype=object)#( name   , transformer    , columns )
c_transf = ColumnTransformer([('onehot', OneHotEncoder(), [0]),('nothing', 'passthrough', [1,2])# 'passthrough': to pass [columns] through untransformed])                                # 'drop': to drop the [columns]
c_transf.fit_transform(X).astype(float)

color_ohe = OneHotEncoder(categories='auto',drop='first')

VS 

##########

The Adult dataset contains several categorical features, with Work-Class being one example. While we could argue that some values are of higher rank than others (for instance, a person with a job is likely to have a better income than a person without), it doesn't make sense for all values. For example, a person working for the state government is not more or less likely to have a higher income than someone working in the 私企private sector[ˈsektər]小群体;区域,部分;(经济、贸易)部门.

We can view the unique values for this feature in the dataset using the unique() function:

adult['Work-Class'].unique()

There are some missing values in the preceding dataset, but they won't affect our computations in this example.

Similarly, we can convert numerical features to categorical features through a process called discretization[dɪs'krɪtɪ'zeʃən]离散化, as we saw in Cp4m04_Recommending Movies w Affinity Analysis_Apriori_sys.stdout.flush_df.iterrows_Sort nested dict嵌套_Linli522362242的专栏-CSDN博客, Recommending Movies Using Affiity Analysis. We can call any person who is taller than 1.7 m tall, and any person shorter than 1.7 m short. This gives us a categorical feature (although still an ordinal one). We do lose some data here. For instance, two people, one 1.69 m tall and one 1.71 m, will be in two different categories and considered drastically[ˈdræstɪkli]彻底地 different from each other. In contrast, a person 1.2 m tall will be considered "of roughly the same height" as the person 1.69 m tall! This loss of detail is a side effect of discretization, and it is an issue that we deal with when creating models.

In the Adult dataset, we can create a LongHours feature, which tells us if a person works more than 40 hours per week. This turns our continuous feature (Hours-per-week) into a categorical one:
##############

13_Loading & Preproces Data from multiple CSV with TF 2_Feature Columns_TF eXtended_num_oov_buckets_Linli522362242的专栏-CSDN博客

Bucketized column(分桶列 )

Often, you don’t want to feed a number directly into the model, but instead split its value into different categories based on numerical ranges. Consider raw data that represents a person’s age. Instead of representing age as a numeric column, we could split the age(continuous feature) into several buckets(categories) using a bucketized column. Notice the one-hot values below describe which age range each row matches. Buckets include the left boundary, and exclude the right boundary. For example, consider raw data that represents the year a house was built. Instead of representing that year as a scalar numeric column, we could split the year into the following four buckets:

The model will represent the buckets as follows:

Why would you want to split a number — a perfectly valid input to your model — into a categorical value? Well, notice that the categorization splits a single input number into a four-element vector. Therefore, the model now can learn four individual weights rather than just one; four weights creates a richer model than one weight. More importantly, bucketizing enables the model to clearly distinguish between different year categories since only one of the elements is set (1) and the other three elements are cleared (0). For example, when we just use a single number (a year) as input, a linear model can only learn a linear relationship. So, bucketing provides the model with additional flexibility that the model can use to learn模型可以学习更复杂的关系.

##############

adult['LongHours'] = adult['Hours-per-week'] > 40
adult.head(n=10)

Creating good features

Modeling, and the loss of information that the simplification causes, are the reasons why we do not have data mining methods that can just be applied to any dataset. A good data mining practitioner will have, or obtain, domain knowledge in the area they are applying data mining to. They will look at the problem, the available data, and come up with a model that represents what they are trying to achieve.

For instance, a height feature may describe one component of a person, but may not describe their academic performance well. If we were attempting to predict a person's grade, we may not bother measuring each person's height.

This is where data mining becomes more art than science. Extracting good features is difficult and is the topic of significant and ongoing research. Choosing better classification algorithms can improve the performance of a data mining application, but choosing better features is often a better option.

In all data mining applications, you should first outline确定大致的方向,概述 what you are looking for before you start designing the methodology that will find it. This will dictate[ˈdɪkteɪt] 规定;影响,指示 the types of features you are aiming for, the types of algorithms that you can use, and the expectations on the final result.

Feature selection

We will often have a large number of features to choose from, but we wish to select only a small subset. There are many possible reasons for this:

  • Reducing complexity: Many data mining algorithms need more time and resources with increase in the number of features. Reducing the number of features is a great way to make an algorithm run faster or with fewer resources.
  • Reducing noise: Adding extra features doesn't always lead to better performance. Extra features may confuse the algorithm, finding correlations and patterns that don’t have meaning (this is common in smaller datasets). Choosing only the appropriate features is a good way to reduce the chance of random correlations that have no real meaning.
  • Creating readable models: While many data mining algorithms will happily compute an answer for models with thousands of features, the results may be difficult to interpret for a human. In these cases, it may be worth using fewer features and creating a model that a human can understand.

Some classification algorithms can handle data with issues such as these. Getting the data right and getting the features to effectively describe the dataset you are modeling can still assist algorithms.

There are some basic tests we can perform, such as ensuring that the features are at least different. If a feature's values are all same, it can't give us extra information to perform our data mining.

The VarianceThreshold transformer in scikit-learn, for instance, will remove any feature that doesn't have at least a minimum level of variance in the values. To show how this works, we first create a simple matrix using NumPy:

import numpy as npX = np.arange(30).reshape( (10,3) )
X

The result is the numbers zero to 29, in three columns and 10 rows. This represents a synthetic dataset with 10 samples and three features:

Then, we set the entire second column/feature to the value 1:

X[:,1] = 1
X


We can now create a VarianceThreshold transformer and apply it to our dataset:

from sklearn.feature_selection import VarianceThreshold

 ... ...

 ImportError: cannot import name '_libsvm_sparse' from 'sklearn.svm' (C:\Anaconda3\envs\tensorflow\lib\site-packages\sklearn\svm\__init__.py)

??????????????????

since my Scikit-Learn doesn't include feature_selection
Then go to cmd.exe, type command conda list

So I have to use google colab:

from sklearn.feature_selection import VarianceThresholdvt = VarianceThreshold()
Xt = vt.fit_transform(X)Xt

threshold        float, default=0

Features with a training-set variance lower than this threshold will be removed. The default is to keep all features with non-zero variance, i.e. remove the features that have the same value in all samples.


Variances:

np.var( X, axis=0)


OR

np.nanvar(X, axis=0)

OR

( X.std(axis=0) )**2

 population variance
Numpy's std 
uses ddof=0 (population standard deviation)

cp4 Training Sets Preprocessing_StringIO_dropna_categorical_feature_Encode_Scale_L1_L2_bbox_to_ancho_Linli522362242的专栏-CSDN博客

sum( ( X-X.mean(axis=0) )**2 )/(X.shape[0])

Proved==>Numpy's std uses ddof=0 (population variance ==>population standard deviation )

We can observe the variances for each column by printing the vt.variances_ attribute:

vt.variances_

 ?????

​​​​​​scikit-learn/_variance_threshold.py at 2beed55847ee70d363bdbfe14ee4401438fba057 · scikit-learn/scikit-learn · GitHub

def fit(self, X, y=None): 

        if hasattr(X, "toarray"):   # sparse matrix_, self.variances_ = mean_variance_axis(X, axis=0)if self.threshold == 0:mins, maxes = min_max_axis(X, axis=0)peak_to_peaks = maxes - minselse:self.variances_ = np.nanvar(X, axis=0)if self.threshold == 0:peak_to_peaks = np.ptp(X, axis=0)if self.threshold == 0:# Use peak-to-peak to avoid numeric precision issues# for constant featurescompare_arr = np.array([self.variances_, peak_to_peaks])self.variances_ = np.nanmin(compare_arr, axis=0)
hasattr( X, 'toarray')

 since X is not a sparse matrix == go to ==> else statement

np.nanvar(X, axis=0) == return ==> 

np.ptp(X, axis=0) == return ==> 
numpy.ptp(a, axis=None, out=None, keepdims=<no value>)
                 Range of values (maximum - minimum) along an axis.

np.max(X, axis=0) - np.min(X,axis=0)

compare_arr = np.array([self.variances_, peak_to_peaks])
self.variances_ = np.nanmin(compare_arr, axis=0) ==>

The result shows that while the first and third column contains at least some information, the second column had no variance:

A simple and obvious test like this is always good to run when seeing data for the first time. Features with no variance do not add any value to a data mining application; however, they can slow down the performance of the algorithm.

Selecting the best individual features

If we have a number of features, the problem of finding the best subset is a difficult task. It relates to solving the data mining problem itself, multiple times. As we saw inm04_Recommending Movies w Affinity Analysis_Apriori_sys.stdout.flush_df.iterrows_Sort nested dict嵌套_Linli522362242的专栏-CSDN博客, Recommending Movies Using Affinity Analysis, subset-based tasks increase exponentially as the number of features increase. This exponential growth in time needed is also true for finding the best subset of features.

A workaround to this problem is not to look for a subset that works well together, rather than just finding the best individual features. This univariate[junɪ'vɛrɪet] 单变量的 feature selection gives us a score based on how well a feature performs by itself. This is usually done for classification tasks, and we generally measure some type of correlation between a variable and the target class.

The scikit-learn package has a number of transformers for performing univariate feature selection. They include SelectKBest, which returns the k best performing features, and SelectPercentile, which returns the top r% of features. In both cases, there are a number of methods of computing the quality of a feature.

There are many different methods to compute how effectively a single feature correlates with a class value. A commonly used method is the chi-squared (χ2) test. Other methods include mutual information and entropy.

We can observe single-feature tests in action using our Adult dataset. First, we extract a dataset and class values from our pandas DataFrame. We get a selection of the features:

X = adult[ ['Age', 'Education-Num', 'Capital-gain', 'Capital-loss', 'Hours-per-week']].values
X


     We will also create a target class array by testing whether the Earnings-Raw value is above $50,000 or not. If it is, the class will be True. Otherwise, it will be False. Let's look at the code:

adult['Earnings-Raw'][:10], adult['Earnings-Raw'][-10:],

y = ( adult['Earnings-Raw'] == ' >50K' ).values
y[:10], y[-10:]


###############

CHAID: Humidity has 2 categories and our expected values should be evenly distributed in order to calculate how distinguishing the variable is:

卡方检验就是统计样本的实际观测值(Play tennis)与理论推断值(Expected)之间的偏离程度(Difference),实际观测值与理论推断值之间的偏离程度就决定卡方值的大小,如果卡方值越大,二者偏差程度越大;反之,二者偏差越小;若两个值完全相等时,卡方值就为0,表明理论值完全符合。 注意:卡方检验针对分类变量
Calculating χ2 (Chi-square卡方) value: 


Calculating degrees of freedom = (r-1) * (c-1)
Where
        r = number of row components or number of variable categories,
        c = number of response variables.

Here, there are two row categories (High and Normal) and two column categories (No and
Yes).
Hence, degrees of freedom = (r-1) * (c-1)  = (2-1) * (2-1) = 1, and 2 since ( No or Yes )
p-value for Chi-square 2.8 with 1 d.f = 0.0942
p-value can be obtained with the following Excel formulae: = CHIDIST (2.8, 1) = 0.0942

from scipy import stats# https://www.graduatetutor.com/statistics-tutor/probability-density-function-pdf-and-cumulative-distribution-function-cdf/
pval = 1 - stats.chi2.cdf( 2.8, 1 )# Cumulative Distribution Function
pval


     In a similar way, we will calculate the p-value for all variables and select the best variable
with a low p-value(High Chi-square value)
.

###############

Next, we create our transformer using the chi2 function and a SelectKBest transformer:

Xt_chi2 = transformer.fit_transform( X, y )
Xt_chi2

<= k=3 feature variables=

The resulting matrix now only contains 3 features. We can also get the scores for each column, allowing us to find out which features were used. Let's look at the code:

print( transformer.scores_ )

'Age', 'Education-Num', 'Capital-gain', 'Capital-loss',  'Hours-per-week'

The highest values are for the first, third, and fourth columns Correlates to the Age, Capital-Gain, and Capital-Loss features. Based on a univariate feature selection, these are the best features to choose.

################
     我们常常把一个式子中独立变量的个数称为这个式子的“自由度”,确定一个式子自由度的方法是:若式子包含有 n 个变量,其中k 个被限制的样本统计量(常见的统计量有:样本均值,样本方差,样本极差等),则这个表达式的自由度为 n-k。比如中包含ξ1,ξ2,…,ξn这 n 个变量,其中ξ1-ξn-1相互独立,ξn为其余变量的平均值,因此自由度为 n-1

from scipy import stats# 2 : >50K or not
# degrees of freedom =(r-1)(2-1) =r-1 = Xt.shape[1]-1
pval = 1-stats.chi2.cdf( transformer.scores_, X.shape[1]-1 )
pval

transformer.pvalues_

There is no doubt that you cannot use pvalue because the pvalue here is all 0.
################

If you'd like to find out more about the features in the Adult dataset, take a look at the adult.names file that comes with the dataset and the academic paper it references.
Index of /ml/machine-learning-databases/adult

We could also implement other correlations, such as the Pearson's correlation coefficient. This is implemented in SciPy, a library used for scientific computing (scikit-learn uses it as a base).
#######################################
cp10_回归预测连续目标变量_boston_Residual_plot_mlxtend_sns_pd_covariance_correlation_RANSAC_R2_Ridge_C_F_A_K_树_Linli522362242的专栏-CSDN博客

In the previous section, we visualized the data distributions of the Housing dataset variables in the form of histograms and scatterplots. Next, we will create a correlation matrix to quantify量化 and summarize linear relationships between variables https://blog.csdn.net/Linli522362242/article/details/103387527. A correlation matrix is closely related to the covariance matrix that we covered in the section Unsupervised dimensionality reduction via principal component analysis in cp5_Compressing Data via Dimensionality Reduction_feature extraction_PCA_LDA_convergence_kernel PCA https://blog.csdn.net/Linli522362242/article/details/105196037.

constructing the covariance matrix

The symmetric d × d -dimensional covariance matrix, where d is the number of dimensions in the dataset, stores the pairwise成对地 covariances between the different features. For example, the covariance between two features  and on the population level can be calculated via the following equation:
 VS  sample covariances 
     The reason the sample covariance matrix has N-1 in the denominator rather than N is essentially that the population mean(OR u) is not known and is replaced by the sample mean
     Here,  and  are the sample means of feature j and k , respectively. Note that the sample means are zero if we standardize the datasetcp4 Training Sets Preprocessing_StringIO_dropna_categorical_feature_Encode_Scale_L1_L2_bbox_to_ancho_Linli522362242的专栏-CSDN博客. A positive covariance between two features indicates that the features increase or decrease together, whereas a negative covariance indicates that the features vary in opposite directions. For example, a covariance matrix of three features can then be written as (note that  stands for the Greek uppercase letter sigma, which is not to be confused with the sum symbol):

We can interpret the correlation matrix as being a rescaled version of the covariance matrix. In fact, the correlation matrix is identical to a covariance matrix computed from standardized features.

The correlation matrix相关系数矩阵 is a square matrix that contains the Pearson product-moment correlation coefficient 皮尔逊积矩相关系数 (often abbreviated as Pearson's r), which measures the linear dependence between pairs of features. The correlation coefficients are in the range –1 to 1. Two features have a perfect positive correlation if r = 1, no correlation if r = 0, and a perfect negative correlation if r = –1. As mentioned previously, Pearson's correlation coefficient can simply be calculated as the covariance between two features, x and y (numerator), divided by the product of their standard deviations (denominator):
( Pearson product-moment correlation coefficient )Pearson's r :  

Covariance versus correlation for standardized features

We can show that the covariance between a pair of standardized features is, in fact, equal to their linear correlation coefficient. To show this, let's first standardize the features x and y to obtain their z-scores, which we will denote as

m05_Extract Feature_Transformers(慎variances_)_download Adult互联网ads数据集_null value(?_csv_SVD_PCA_eigen相关推荐

  1. 这是一份非常全面的开源数据集!

    Datawhale推荐 来源:机器之心编译 近期,skymind.ai 发布了一份非常全面的开源数据集.内容包括生物识别.自然图像以及深度学习图像等数据集,现机器之心将其整理如下:(内附链接哦~) 最 ...

  2. 线性回归之梯度下降法介绍

    线性回归之梯度下降法介绍 上一篇博文中介绍了最基本的梯度下降法实现流程,常见的梯度下降算法有: 全梯度下降算法(Full gradient descent), 随机梯度下降算法(Stochastic ...

  3. python实现决策树算法sklearn_python sklearn-05:决策树及随机森林

    1.决策树 2.随机森林 1.决策树(decision tree) 决策树一种简单的非线性模型,用来解决回归与分类问题. 通常是重复的将训练集解释变量分割成子集的过程.决策树的节点用方块表示,用来测试 ...

  4. dcase_util教程(二)——各单元介绍

    接着上一篇教程,继续的有各个UTILITIES的介绍.网址 1. Container 数据容器的类.这些数据的目的是为了包装数据,使用有用的方法来访问和操作数据,以及加载和存储数据. 这些容器是从标准 ...

  5. 2017年,这两个大数据岗位一定会火!

    讨论哪个大数据岗位会火之前,我们先来简单的分析一下大数据领域的行情,这里重点说一下当前的情况. 2016年,互联网行业遇到了资本寒冬,抛开大公司不说,一些中小型的公司不断的缩减预算,因为很难融到钱. ...

  6. [科研笔记] 关于人工智能与算法项目的思考

    原链接:https://blog.csdn.net/walilk/article/details/77131929 前言 [科研笔记] 系列是我在科研道路上的随笔和思考,内容不加以局限,是一个开放的文 ...

  7. 最全面超大规模数据集下载链接汇总(转)

    大数据   大数据 1. https://delicious.com/pskomoroch/dataset 2.http://stackoverflow.com/questions/10843892/ ...

  8. DeepCTR-Torch 如何使用【案例(Criteo、Movielens)演示、特征(SparseFeat、DenseFeat、VarLenSparseFeat)参数含义】

    文章目录 安装 DeepCTR-Torch 的 4 个步骤 step1:模型导入 step2:简单的预处理 step3:生成特征列 step4:生成训练样本并训练模型 分类:Criteo 回归:Mov ...

  9. Amazon SageMaker测评分享,效果超出预期

    一.前言 随着科技的进步和社会的发展,人工智能得到了愈加广泛的重视,特别是最近大火的Chatgpt,充分展现了研发通用人工智能助手广阔的研究和应用前景.让越来越多的组织和企业跟风加入到人工智能领域的研 ...

  10. 梯度下降法算法比较和进一步优化

    梯度下降法算法比较和进一步优化 常见的梯度下降算法有: 全梯度下降算法(Full gradient descent), 随机梯度下降算法(Stochastic gradient descent), 小 ...

最新文章

  1. 宝塔服务器环境好不好_服务器环境怎么搭建?(宝塔环境搭建教程)
  2. 《C和指针》一1.5 警告的总结
  3. iis应用程序池监控方法实例
  4. 【Arduino】斑马线指示灯Zabra_crossing_WS2812
  5. 第四次作业类测试代码+036+吴心怡
  6. c6011取消对null指针的引用_C++| 函数的指针参数如何传递内存?
  7. 复习JavaFile类_递归_综合案例
  8. it运维中faq_如何编写系统FAQ
  9. 【Flink】Flink allowedLateness 与 watermark 的区别
  10. lvm讲解与磁盘挂载问题排除
  11. 【自动驾驶】模型预测控制(MPC)实现轨迹跟踪
  12. 工业机器人实训系统(鼠标装配)
  13. 量子子计算机可以传送吗,量子隐形传送是可能的吗?是
  14. 利用Python脚本来使用Google自动翻译Excel表格文件
  15. 「Photoshop2021入门教程」新功能——快速操作
  16. 海绵宝宝的视频配音怎么制作?简单的小方法来帮忙
  17. 好友已经删了,还能在电脑中找到聊天记录吗
  18. 如何经营好一家公司?这些管理要点请收下
  19. 在windows配置Apache httpd代理服务器
  20. 计算机网络 传输媒体 光缆,软考网络管理员笔记之计算机网络传输媒体

热门文章

  1. JSK-布设光钎-Kruscal最小生成树-并查集-图的连通性
  2. H G W S哪一个不是状态函数_复变函数学习笔记(13)——单位圆盘上的自同构群(用了近世代数)...
  3. 浅析重复线性渐变repeating-linear-gradient如何使用
  4. 中职学校计算机技能大赛总结,中职学校技能大赛总结
  5. 智课雅思词汇---二十五、-ate
  6. Openstack 经典面试问题和解答
  7. IE Internet选项快捷键
  8. 将全部视频画面水平或者垂直翻转的实例教程分享
  9. 游戏策划笔记:记忆点的构造
  10. Linux7系统克隆到另一个硬盘,Ubuntu14.04 dd命令克隆系统镜像安装到另一台机器上...