目录

  • Covariance and correlation
  • How can you select k for k means?
  • Naive Bayes
    • Why is Naive Bayes “naive”?
    • Naive Bayes vs. RF
  • Q: What is the difference between online and batch learning(offline learning)?
  • How to identify outliers?
  • Reduce overfitting
  • Q: What are the feature selection methods used to select the right variables?
  • Q: Why is mean square error a bad measure of model performance? What would you suggest instead?
  • PCA
  • Linear regression
    • Assumptions
    • What Happens When We Break Some of These Assumptions? (And What Can We Do About It?)
    • regularization
    • drawback of linear model
  • Tree
    • Prevent Overfitting
  • More features than data points
    • More features than data points in linear regression?
  • SVM
    • What are the support vectors in SVM?
    • Primal and Dual (features vs. samples)
    • SVM and linearSVM
  • Bagging, Boosting, Stacking
    • Linear regression & Boosting
    • When would you use random forests vs. SVM and why?
  • Random Forest
    • Explain what the bootstrap sampling method is and give an example of when it’s used.
    • Pros and cons
  • AdaBoost vs Gradient Boosting
    • Explain the "gradient" part of the Gradient Boosting
    • Gradient Boosting implementations
    • Bootstrapping
  • Gradient Descent(Logistic Classification)
  • TF-IDF
  • Cross validation, test set, and final model
  • Skewed data and Log
  • Multicollinearity
    • Baseline model
    • Interpretable vs Explainable Machine Learning
  • Imbalanced data
    • how to deal with imbalanced data?
    • SOMTE.
    • Evaluation Metrics
    • Type1 vs Type2 Error
  • Deep Learning
  • Initializing all the weights with all (constant) zero
  • Softmax
  • Activation function?
  • Vanishing/Exploding Gradient Problem
  • Optimizer
  • Feature Engineering
  • Features with High cardinality
  • BinaryEncoding

Covariance and correlation

  • Both covariance and correlation measure the linear relationship and the dependency between two variables.

  • Correlation values are standardized. (Pearson correlation coefficient)

  • Covariance values are not standardized.

  • Correlation only measure the linear relationship between two variables.

  • Correlation is zero, then the two variables have no linear relationship, but they may have non-linear relationship.

基本的统计和机器学习概念
不要引入一些其他概念,从而引导面试官提问,但是如果对引入的概念非常熟悉,那么可以将面试官往这上面引导。


https://towardsdatascience.com/120-data-scientist-interview-questions-and-answers-you-should-know-in-2021-b2faf7de8f3e

How can you select k for k means?

You can use the elbow method, which is a popular method used to determine the optimal value of k. Essentially, what you do is plot the squared error for each value of k on a graph (value of k on the x-axis and squared error on the y-axis). Once the graph is made, the point where the distortion declines the most is the elbow point.

https://medium.com/analytics-vidhya/elbow-method-of-k-means-clustering-algorithm-a0c916adc540

Naive Bayes

Why is Naive Bayes “naive”?

Naive Bayes is naive because it holds a strong assumption in that the features are assumed to be uncorrelated with one another, which typically is never the case.

Naive Bayes vs. RF

Naive Bayes is better in the sense that it is easy to train and understand the process and results. A random forest can seem like a black box. Therefore, a Naive Bayes algorithm may be better in terms of implementation and understanding.

However, in terms of performance, a random forest is typically stronger because it is an ensemble technique.

Q: What is the difference between online and batch learning(offline learning)?

  • In offline learning, the whole training data must be available at the time of model training. Only when training is completed can the model be used for predicting.
  • In contrast, online algorithms process data sequentially. They produce a model and put it in operation without having the complete training dataset available at the beginning. The model is continuously updated during operation as more training data arrives.

How to identify outliers?

There are a couple of ways to identify outliers:

  • We can use visualization tools such as BoxPlot/ ScatterPlot/ Histogram to see whether there are outliers directly from the distribution plot of the feature.
  • Z-score/standard deviations: if we know that 99.7% of data in a data set lie within three standard deviations, then we can calculate the size of one standard deviation, multiply it by 3, and identify the data points that are outside of this range. Likewise, we can calculate the z-score of a given point, and if it’s equal to +/- 3, then it’s an outlier.
    Note: that there are a few contingencies that need to be considered when using this method; the data must be normally distributed, this is not applicable for small data sets, and the presence of too many outliers can throw off z-score.
  • Interquartile Range (IQR): IQR, the concept used to build boxplots, can also be used to identify outliers. The IQR is equal to the difference between the 3rd quartile and the 1st quartile. You can then identify if a point is an outlier if it is less than Q1–1.5*IRQ or greater than Q3 + 1.5*IQR. This comes to approximately 2.698 standard deviations.

Reduce overfitting

  • Cross-validation
  • Regularization
  • Reduce the number of features
  • Ensemble Learning Techniques

Q: What are the feature selection methods used to select the right variables?

There are two types of methods for feature selection: filter methods and wrapper methods.

Filter methods include the following:

  • Linear discrimination analysis
  • ANOVA
  • Chi-Square (We aim to select the features which are highly dependent on the target.)

Wrapper methods: evaluate models using procedures that add and/or remove predictors to find the optimal combination that maximizes model performance.

include the following:

  • Forward Selection: starts with one predictor and adds more iteratively. At each subsequent iteration, the best of the remaining original predictors are added based on performance criteria.
  • Backward Selection: We test all the features and start removing them to see what works better

Business sense

  • based on what features makes sense in business. Especially when we need interpretability of the model.

https://towardsdatascience.com/chi-square-test-for-feature-selection-in-machine-learning-206b1f0b8223
https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/

Q: Why is mean square error a bad measure of model performance? What would you suggest instead?

Mean Squared Error (MSE) gives a relatively high weight to large errors — therefore, MSE tends to put too much emphasis on large deviations. A more robust alternative is MAE (mean absolute deviation).


PCA

Principal component analysis (PCA) projects data to a lower-dimensional linear subspace, in a way that preserves the axes of highest variance in the data.

An important pre-processing step before applying PCA is to standardize features to have zero mean and unit variance.

Linear regression

Assumptions

  • Linearity: Linear relationship between x & y
  • Independence: samples are i.i.d (Independent and identically distributed)
  • No perfect collinearity: The independent variables do not share a perfect, linear relationship.
  • Zero conditional mean: The error term is unrelated to our independent variables.
  • Homoscedasticity 同方差性: the variance of the error term is constant.
  • Normality: the residuals follow a normal distribution

What Happens When We Break Some of These Assumptions? (And What Can We Do About It?)

  • https://towardsdatascience.com/what-happens-when-you-break-the-assumptions-of-linear-regression-f78f2fe90f3a
  • https://www.quality-control-plan.com/StatGuide/linreg_ass_viol.htm#Lack%20of%20independence

1. Non-linear

  • Our parameter estimates would be biased, and our model would make poor predictions.

How to determine linear/non-linear relationship

  • You could create a scatter plot between the two variables and see if the relationship between them is linear or non-linear.
  • You can then compare the performance between a linear regression and a non-linear regression, and choose the function that performs best.
  • Fit the data with a linear regression, and then plot the residuals. If there is no obvious pattern in the residual plot, then the linear regression was likely the correct model. However, if the residuals look non-random, then perhaps a non-linear regression would be the better choice.

2. Our sample is non-random

Thus, it is good practice to do some EDA prior to building a regression model to confirm that the two groups are not drastically different

For serially correlated Y values, the estimates of the slope and intercept will be unbiased, but the estimates of their variances will not be reliable.

If you are unsure whether your Y values are independent, you may wish to consult a statistician or someone who is knowledgeable about the data collection scheme you are using.

do some EDA prior to building a regression model to confirm

3. We Have Perfect Collinearity

  • Violating multicollinearity does not impact prediction, but can impact inference.

How to solve it

  • drop one of the variables

4. Our Error Term is Correlated with One of Our Independent Variables

  • This occurs if our regression model differs from the true model. There are typically some omitted variable bias.

How to solve it

  • First, if you know the variables that should be included in the true model, then you can add these variables to the model you are building. This is the best solution; however, it is also unrealistic because we can never truly know what variables are in the true model.
  • The second solution is to conduct a Randomized Control Trial (RCT). In an RCT, the researcher randomly allocates participants to the treatment group or the control group. Because the treatment is given randomly, the relationship between the error term and the independent variables is equal to zero.

5. violating the homoscedasticity

https://stats.stackexchange.com/questions/22800/what-are-the-dangers-of-violating-the-homoscedasticity-assumption-for-linear-reg

  • does not introduce a bias in the estimates of your coefficients.
  • the standard errors of the coefficients are wrong. This means that one cannot compute any t-statistics and p-values and consequently hypothesis testing is not possible.

How to solve it

  • use robust standard errors

6. Our Errors are Non-Normal

  • the results of our hypothesis tests and confidence intervals will be inaccurate.

How to solve it

  • transform your target variable so that it becomes normal. This can have the effect of making the errors normal, as well. The log transform and square root transform are most common.

regularization

  • Lasso regression: L1-norm, Feature selection
  • Ridge regression: L2-norm
  • Elastic-net regression: L1-norm, L2-norm, Feature selection

drawback of linear model

  • A linear model holds some strong assumptions that may not be true in application. It assumes a linear relationship, multivariate normality, no or little multicollinearity, no auto-correlation, and homoscedasticity.
  • A linear model can’t be used for discrete or binary outcomes.
  • You can’t vary the model flexibility of a linear model.

What is flexibility in machine learning?

  • You can think of “Flexibility” of a model as the model’s “curvy-ness” when graphing the model equation
  • Greater flexibility corresponds to lower bias but higher variance. It allows fitting a wider variety of functions, but increases the risk of overfitting.

https://stackoverflow.com/questions/26437372/what-is-the-definition-of-flexibility-of-a-method-in-machine-learning
https://stats.stackexchange.com/questions/338009/how-is-model-flexibility-measured-or-quantified-what-units-is-it-measured-in/338896

Tree

Prevent Overfitting

  • Pruning

    • Reduced error

      • Starting at the leaves, each node is replaced with its most popular class.
      • If the validation metric is not negatively affected, then the change is kept, else it is reverted.
      • Reduced error pruning has the advantage of speed and simplicity.
    • Cost complexity

https://stats.stackexchange.com/questions/193538/how-to-choose-alpha-in-cost-complexity-pruning


  • Early stopping

    • Maximum depth
    • Maximum leaf nodes
    • Minimum samples split
    • Minimum impurity decrease

More features than data points

  • Feature selection involves selecting a subset of predictors to use as input to predictive models.

Common techniques include filter methods that select features based on their statistical relationship to the target variable (e.g. correlation), and wrapper methods that select features based on their contribution to a model when predicting the target variable (e.g. RFE(Recursive Feature Elimination)).


  • Projection Methods (Dimension reduction) create a lower-dimensionality representation of samples that preserves the relationships observed in the data.

They are often used for visualization, although the dimensionality reduction nature of the techniques may also make them useful as a data transform to reduce the number of predictors. This might include techniques from linear algebra, such as SVD and PCA.


  • Regularized Algorithms. Standard machine learning algorithms may be adapted to use regularization during the training process.

This will penalize models based on the number of features used or weighting of features, encouraging the model to perform well and minimize the number of predictors used in the model.

This can act as a type of automatic feature selection during training and may involve augmenting existing models (e.g regularized linear regression and regularized logistic regression) or the use of specialized methods such as LARS and LASSO.


There is no best method and it is recommended to use controlled experiments to test a suite of different methods.

这些基本对所有的模型都是一样的。


More features than data points in linear regression?


解决方法基本一样。

  • Regularization
  • Dimension Reduction
  • Subset Selection

  • https://machinelearningmastery.com/how-to-handle-big-p-little-n-p-n-in-machine-learning/
  • https://medium.com/@jennifer.zzz/more-features-than-data-points-in-linear-regression-5bcabba6883e

SVM

What are the support vectors in SVM?

The support vectors are the data points that touch the boundaries of the maximum margin (see below).

Primal and Dual (features vs. samples)

Would training the soft-margin SVMs either using primal or the dual formulation yield the same results on a test dataset? Why?

Yes. because the pattern is convex and the primal & dual will yield the same result.

  • Primal mode is preferred when we don’t need to apply kernel trick to the data and the dataset is large but the dimension of each data point is small.
  • Dual form is preferred when data has a huge dimension and we need to apply the kernel trick.

注意这里并不是说特征数量比样本数量多。如果特征数量比样本多,仍然主要是通过前文所写的几种方法。


  • https://medium.com/geekculture/the-optimization-behind-svm-primal-and-dual-form-5cca1b052f45

SVM and linearSVM

  • SVM: The implementation is based on libsvm. The fit time scales at least quadratically with the number of samples and may be impractical beyond tens of thousands of samples.
  • linearSVM: Similar to SVC with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.

https://stackoverflow.com/questions/45384185/what-is-the-difference-between-linearsvc-and-svckernel-linear

Bagging, Boosting, Stacking

  • https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205

  • https://zhuanlan.zhihu.com/p/36822575

  • The main hypothesis is that when weak models are correctly combined we can obtain more accurate and/or robust models.

  • Indeed, to be able to “solve” a problem, we want our model to have enough degrees of freedom to resolve the underlying complexity of the data we are working with, but we also want it to have not too much degrees of freedom to avoid high variance and be more robust. This is the well known bias-variance tradeoff.

  • Weak learners: High bias or high variance.

  • Bagging (reduce variance), that often considers homogeneous weak learners, learns them independently from each other in parallel and combines them following some kind of deterministic averaging process.
    Boosting (reduce bias), that often considers homogeneous weak learners, learns them sequentially in a very adaptative way (a base model depends on the previous ones) and combines them following a deterministic strategy.
    Stacking (reduce bias), that often considers heterogeneous weak learners, learns them in parallel and combines them by training a meta-model to output a prediction based on the different weak models predictions.

Linear regression & Boosting

  • https://stats.stackexchange.com/questions/186966/gradient-boosting-for-linear-regression-why-does-it-not-work
  • https://stats.stackexchange.com/questions/230388/how-does-linear-base-learner-works-in-boosting-and-how-does-it-works-in-the-xgb

We can use linear regression, but it not common in practice.

  • The final model prediction has the same functional form as the full linear regression.
  • A possible defense of boosting in this situation could be the implicit regularization it provides. Possibly (I haven’t played with this) you could use the early stopping feature of a gradient booster, along with a cross validation, to stop short of the full linear regression. This would provide a regularization to your regression, and possibly help with overfitting. This is not particularly practical, as one has very efficient and well-understood options like ridge regression and the elastic net in this setting.
  • Boosting shines when there is no terse functional form around. Boosting decision trees lets the functional form of the regressor/classifier evolve slowly to fit the data, often resulting in complex shapes one could not have dreamed up by hand and eye. When a simple functional form is desired, boosting is not going to help you find it (or at least is probably a rather inefficient way to find it).
  • For example, the combination of n trees is still a tree, but this tree will be significantly bigger and complexity of such tree is impractical. On the other hand in case of linear functions complexity is exactly the same.

When would you use random forests vs. SVM and why?

There are a couple of reasons why a random forest is a better choice of an algorithm than a support vector machine:

  • Random forests allow you to determine the feature importance. SVM’s can’t do this.
  • Random forests are much quicker and simpler to build than an SVM.
  • For multi-class classification problems, SVMs require a one-vs-rest method, which is less scalable and more memory intensive.

Random Forest

Explain what the bootstrap sampling method is and give an example of when it’s used.

Technically speaking, the bootstrap sampling method is a resampling method that uses random sampling with replacement.

It’s an essential part of the random forest algorithm, as well as other ensemble learning algorithms.

Bootstrap Aggregation is a general procedure that can be used to reduce the variance for those algorithm that have high variance.

https://machinelearningmastery.com/bagging-and-random-forest-ensemble-algorithms-for-machine-learning/

Pros and cons

Pros:

  • Random Forests can be used for both classification and regression tasks.
  • Random Forests work well with both categorical and numerical data. No scaling or transformation of variables is usually necessary.
  • Random Forests implicitly perform feature selection and generate uncorrelated decision trees. It does this by choosing a random set of features to build each decision tree. This also makes it a great model when you have to work with a high number of features in the data.
  • Random Forests are not influenced by outliers to a fair degree. It does this by binning the variables.
  • Random Forests can handle linear and non-linear relationships well.
  • Random Forests generally provide high accuracy and balance the bias-variance trade-off well. Since the model’s principle is to average the results across the multiple decision trees it builds, it averages the variance as well.
  • OOB for model selection

Cons:

  • Random Forests are not easily interpretable. They provide feature importance but it does not provide complete visibility into the coefficients as linear regression.
  • Random Forests can be computationally intensive for large datasets.
  • Random forest is like a black box algorithm, you have very little control over what the model does.
  • In problems with multiple categorical variables, random forest may not be able to increase the accuracy of the base learner.
  • 对于有不同级别的属性的数据,级别划分较多的属性会对随机森林产生更大的影响,所以随机森林在这种数据上产生的属性权值是不可信的。(类似于上面一点)
  • 在某些噪声较大的分类和回归问题上会过拟合。

https://zhuanlan.zhihu.com/p/27160995

  • https://en.wikipedia.org/wiki/Random_forest#cite_note-:3-37
  • https://medium.datadriveninvestor.com/random-forest-pros-and-cons-c1c42fb64f04

AdaBoost vs Gradient Boosting

  • https://datascience.stackexchange.com/questions/39193/adaboost-vs-gradient-boosting?newreg=7b9da28f7a7546349f2218c04dda8ac8
  • https://medium.com/analytics-vidhya/what-is-gradient-boosting-how-is-it-different-from-ada-boost-2d5ff5767cb2

The main differences are that Gradient Boosting is a generic algorithm to find approximate solutions to the additive modeling problem, while AdaBoost can be seen as a special case with a particular loss function. Hence, Gradient Boosting is much more flexible.

AdaBoost:

  • Exponential loss
  • decision stumps (just split the data into two regions)
  • Each subsequent stump give more weights on the samples that were incorrectly classified in the previous stump.

Gradient Boosting:

  • Squared loss
  • Trees are larger than a stump, but GBDT still restricts the size of a tree.
  • A decision tree is built based on the residuals of the previous samples.

Explain the “gradient” part of the Gradient Boosting

Every tree fit to the residuals from the previous tree.
Previous tree residual is nothing but the gradient of the loss function.

Gradient Boosting implementations

GradientBoostingClassifier

  • Early implementation of Gradient Boosting in sklearn
  • Typical slow on large datasets
  • Most important parameters are # of estimators and learning rate
  • Supports both binary & multi-class classification
  • Supports sparse data

HistGradientBoostingClassifier
● Orders of magnitude faster than GradientBoostingClassifier on large datasets
● Inspired by LightGBM implementation
● Histogram-based split finding in tree learning
● Does not support sparse data
● Supports both binary & multi-class classification
● Natively supports categorical features
● Does not support monotonicity constraints

XGBoost
● One of most popular implementations of gradient boosting
● Fast approximate split finding based on histograms
● Supports GPU training, sparse data & missing values
● Adds l1 and l2 penalties on leaf weights
● Monotonicity & feature interaction constraints
● Works well with pipelines in sklearn due to a compatible interface
● Does not support categorical variables natively

LightGBM
● Supports GPU training, sparse data & missing values
● Histogram-based node splitting
● Use Gradient-based One-Sided Sampling (GOSS) for tree learning
● Exclusive feature bundling to handle sparse features
● Generally faster than XGBoost on CPUs
● Supports distributed training on different frameworks like Ray, Spark, Dask etc.
● CLI version

  • What makes LightGBM lightning fast?
    https://towardsdatascience.com/what-makes-lightgbm-lightning-fast-a27cf0d9785e

CatBoost
● Optimized for categorical features
● Uses target encoding to handle categorical features
● Uses ordered boosting to build “symmetric” trees
● Overfitting detector
● Tooling support (Jupyter notebook & Tensorboard visualization)
● Supports GPU training, sparse data & missing values
● Monotonicity constraints

Bootstrapping

  • https://en.wikipedia.org/wiki/Bootstrapping_(statistics)#Disadvantages

● Several training samples (of same size) are created by sampling the dataset with replacement
● Each training sample is then used to train a model
● The outputs from each of the models are averaged to make the final prediction.

Disadvantage:

  • Time-consuming.
  • Although bootstrapping is (under some conditions) asymptotically consistent, it does not provide general finite-sample guarantees. The result may depend on the representative sample. ( (The training set has duplicate data.) The data distribution of the resulting training set is different from that of the original data set, which introduces estimation bias.)

Gradient Descent(Logistic Classification)

  • All steps of calculation.

  • Python Code

import math
def sigmoid(x, a, b):return 1.0/(1.0 + math.exp(-a*x - b))def GradientDescent(x, y, a, b):a, b = a - (y - (sigmoid(x,a,b)) * x), b - (y - (sigmoid(x,a,b)))return a, b

TF-IDF

  • https://towardsdatascience.com/text-vectorization-term-frequency-inverse-document-frequency-tfidf-5a3f9604da6d

  • TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection or corpus.

  • Term Frequency (TF): It is a measure of the frequency of a word (w) in a document (d). TF is defined as the ratio of a word’s occurrence in a document to the total number of words in a document. The denominator term in the formula is to normalize since all the corpus documents are of different lengths.

  • Inverse Document Frequency (IDF): It is the measure of the importance of a word. Term frequency (TF) does not consider the importance of words. Some words such as’ of’, ‘and’, etc. can be most frequently present but are of little significance. IDF provides weightage to each word based on its frequency in the corpus D.

  • Term Frequency — Inverse Document Frequency (TFIDF)

  • TFIDF gives more weightage to the word that is rare in the corpus (all the documents).

  • TFIDF provides more importance to the word that is more frequent in the document.

  • Advantages: Bag of words (BoW) converts the text into a feature vector by counting the occurrence of words in a document. It is not considering the importance of words. TFIDF is based on BoW model, which contains insights about the less relevant and more relevant words in a document. The importance of a word in the text is of great significance in information retrieval.

  • Disadvantage of TFIDF. It is unable to capture the semantics. For example, funny and humorous are synonyms, but TFIDF does not capture that. Moreover, TFIDF can be computationally expensive if the vocabulary is vast.

Cross validation, test set, and final model

4995 Lecture2

  1. Development data: Training data + Validation data
  2. Testing data

Fit on Training data. Transform on Validation and Testing.

Train final model on Development data.

Skewed data and Log

Degrades the model’s ability (especially regression based models) to describe typical cases as it has to deal with rare cases on extreme values. ie right skewed data will predict better on data points with lower value as compared to those with higher values. Skewed data also does not work well with many statistical methods. However, tree based models are not affected.

Log transformation

  • https://reinec.medium.com/my-notes-handling-skewed-data-5984de303725

Multicollinearity

  • https://martechcareer.com/advice/2020/09/14-mmm
  • https://towardsdatascience.com/multicollinearity-why-is-it-bad-5335030651bf
  • https://zhuanlan.zhihu.com/p/72722146#:~:text=%E5%A4%9A%E9%87%8D%E5%85%B1%E7%BA%BF%E6%80%A7%E9%97%AE%E9%A2%98%E5%B0%B1%E6%98%AF,%E9%97%B4%E7%9C%9F%E5%AE%9E%E7%9A%84%E5%85%B3%E7%B3%BB%E4%BA%86%E3%80%82

Linear model:
Multicollinearity will harm the interpretability greatly but hurt the performance slightly.

We could understand it from the meaning of the coefficients (Weights).
Coefficient (Weight) of one feature means that if we fix the values of other features, then this feature increases by 1, the result will increase by 1*coefficient. However, if there is multicollinearity in our data, we cannot assume that one feature changes and other features are fixed at the same time. Therefore, these coefficients will be very confusing and completely unexplainable.

  1. In the case of linear regression, multi-collinearity leads to finding weights/coefficients that have very high variance (since (XTX)-1 becomes huge). So, it is difficult to say anything precise about the coefficients. So, in some cases it does hurt the model performance as well as the model interpretability.

  2. Yes, one-hot encoding leads to having features that have dependencies (i.e. multi-collinearity). One way to reduce it would be drop the variables that you believe would not necessarily impact your model interpretability. Another way to drop variables is to use Variance Inflation Factor to drop features that have high multi-collinearity:

https://www.statisticshowto.com/variance-inflation-factor/

In this case, you could make this part of your model selection process, where you iteratively drop features using VIF, then estimating the performance of the model on a validation set and finally choosing the model with the highest performance.

Tree model:

  1. Yes, multi-collinearity could impact model performance as well interpretability in decision trees. In your example, in another scenario (with a different split), X2 could have a higher information gain than X1 and hence could end up being the more important feature. This could lead to difficulty in model interpretability. Additionally, during prediction this model can underperform especially on samples where the dependency between X1 and X2 does not necessarily hold.

How to fix multicollinearity?

  1. Do correlation analysis. Delete the one of two highly related features manually.
  2. PCA. Dimensionality reduction.
  3. Ridge regression.
  4. Add more samples.
  5. Some statistical methods, like Variation Inflation Factor (VIF).

Baseline model

  • https://towardsdatascience.com/how-to-build-a-baseline-model-be6ce42389fc

  • https://datascience.stackexchange.com/questions/30912/what-does-baseline-mean-in-the-context-of-machine-learning

  • A baseline is a method that uses heuristics, simple summary statistics, randomness, or machine learning to create predictions for a dataset.

statistics and randomness

  • Classification baselines:

    • “stratified”: generates predictions by respecting the training set’s class distribution.
    • “most_frequent”: always predicts the most frequent label in the training set
    • “prior”: always predicts the class that maximizes the class prior.
    • “uniform”: generates predictions uniformly at random.
    • “constant”: always predicts a constant label that is provided by the user.
  • Regression baselines:

    • “median”: always predicts the median of the training set
    • “quantile”: always predicts a specified quantile of the training set, provided with the quantile parameter.
    • “constant”: always predicts a constant value that is provided by the user.

Machine Learning

  • Baseline model should be simple. Simple models are less likely to overfit. If you see that your baseline is already overfitting, it makes no sense to go for more complex modeling, as the complexity will kill the performance.

  • Baseline model should be interpretable. Interpretability will help you to get a better understanding of your data and will show you a direction for the feature engineering.

Interpretable vs Explainable Machine Learning

https://towardsdatascience.com/interperable-vs-explainable-machine-learning-1fa525e12f48

  • An interpretable model can be understood by a human without any other aids/techniques. We can understand how these models make predictions by looking only at the model summary/ parameters. We could also say that an interpretable model provides its own explanation. Decision trees and linear regression.
  • An explainable model does not provide its own explanation. On their own, these models are too complicated to be understood by humans and they require additional techniques, like LIME and SHAP, to understand how they make predictions.

Imbalanced data

  • https://towardsdatascience.com/5-smote-techniques-for-oversampling-your-imbalance-data-b8155bdbe2b5

how to deal with imbalanced data?

  • Oversampling (SOMTE)
  • Undersampling
  • Metrics: ROC curve, f1 score
  • Change the threshold (Typically we will use 0.5.)
  • Increase weights on minority samples
  • Adjust objective function

SOMTE.

  • SMOTE works by utilizing a k-nearest neighbor algorithm to create synthetic data. SMOTE first start by choosing random data from the minority class, then k-nearest neighbors from the data are set. Synthetic data would then be made between the random data and the randomly selected k-nearest neighbor.

Evaluation Metrics

● Threshold-based metrics
○ Classification Accuracy
○ Precision, Recall & F1-score

● Ranking-based metrics
○ Average Precision (AP)
○ Area Under Curve (AUC)


Accuracy:

  • Classification accuracy could be misleading in case of imbalance datasets
  • Accuracy Paradox

Precision, Recall & F1-score

Normalization

  • Normalize by true number of labels.
  • Normalize by predict number of labels.

Minority class is considered positive


Macro is a more balanced metrics.

Choosing the Right Metric

● Problem-specific
● Balanced accuracy better than accuracy (most of the times)
● Cost associated with misclassification
● Predicting that an individual has no cancer when he/she has cancer (false negative) is far more costlier than the other way round
● Predicting an email as spam when it is not (false positive) has higher cost than predicting email as not spam
● Choose recall when cost of false negatives is high (Type I error)
● Choose precision when cost of false positives is high (Type II error)

Precision-Recall (PR) Curve

  • A precision-recall curve shows the relationship between precision and recall at every cut-off point.
  • Visualize effect of selected threshold on performance.

Receiver Operating Curve (ROC)

  • Another useful tool to visualize the performance of a classification model

  • ROC depicts the relationship between False Positive Rate (FPR) and True Positive Rate/Recall (TPR)

  • FPR = FP / (TN + FP)

    Area Under ROC (AUROC)

  • Area Under ROC (AUROC) provides an aggregate measure of model performance across all possible classification thresholds.

  • AUROC varies between 0 and 1 and a model with random/const predictions has a value of 0.5.

Type1 vs Type2 Error

  • https://www.scribbr.com/statistics/type-i-and-type-ii-errors/#:~:text=In%20statistics%2C%20a%20Type%20I%20error%20means%20rejecting%20the%20null,hypothesis%20when%20it’s%20actually%20false.&text=The%20risk%20of%20making%20a%20Type%20II%20error%20is%20inversely,statistical%20power%20of%20a%20test

Tradeoff between Type I and Type II errors:

  • Setting a lower significance level decreases a Type I error risk, but increases a Type II error risk.

  • Increasing the power of a test decreases a Type II error risk, but increases a Type I error risk.

  • The null hypothesis distribution shows all possible results you’d obtain if the null hypothesis is true. The correct conclusion for any point on this distribution means not rejecting the null hypothesis.

  • The alternative hypothesis distribution shows all possible results you’d obtain if the alternative hypothesis is true. The correct conclusion for any point on this distribution means rejecting the null hypothesis.

Type I and Type II errors occur where these two distributions overlap. The blue shaded area represents alpha, the Type I error rate, and the green shaded area represents beta, the Type II error rate.

By setting the Type I error rate, you indirectly influence the size of the Type II error rate as well.

  • Is a Type I or Type II error worse?

For statisticians, a Type I error is usually worse. In practical terms, however, either type of error could be worse depending on your research context.

A Type I error means mistakenly going against the main statistical assumption of a null hypothesis. This may lead to new policies, practices or treatments that are inadequate or a waste of resources.

In contrast, a Type II error means failing to reject a null hypothesis. It may only result in missed opportunities to innovate, but these can also have important practical consequences.

Example:

Consequences of a Type I error
Based on the incorrect conclusion that the new drug intervention is effective, over a million patients are prescribed the medication, despite risks of severe side effects and inadequate research on the outcomes. The consequences of this Type I error also mean that other treatment options are rejected in favor of this intervention.

Consequences of a Type II error
If a Type II error is made, the drug intervention is considered ineffective when it can actually improve symptoms of the disease. This means that a medication with important clinical significance doesn’t reach a large number of patients who could tangibly benefit from it.

Example:

You decide to get tested for COVID-19 based on mild symptoms. There are two errors that could potentially occur:

Type I error (false positive): the test result says you have coronavirus, but you actually don’t.

Type II error (false negative): the test result says you don’t have coronavirus, but you actually do.

Deep Learning

What the difference between adam optimizer and SGD?

Initializing all the weights with all (constant) zero

  • https://www.deeplearning.ai/ai-notes/initialization/#:~:text=Initializing%20all%20the%20weights%20with,the%20same%20features%20during%20training.&text=Thus%2C%20both%20neurons%20will%20evolve,neurons%20from%20learning%20different%20things.

Initializing all the weights with zeros leads the neurons to learn the same features during training.

Softmax

  • Softmax is used in multi-class classification.
  • The softmax function turns numbers into probabilities. The sum of these probabilities is one, and each probability represents a potential outcome.

Activation function?

Add non-linearity into a neural network.

  • Sigmoid: It is computationally expensive, causes vanishing gradient problem and not zero-centred. This method is generally used for binary classification problems.
  • Softmax: The softmax is a more generalised form of the sigmoid. It is used in multi-class classification problems. Similar to sigmoid, it produces values in the range of 0–1 therefore it is used as the final layer in classification models.
  • Tanh (Hyperbolic Tangent): If you compare it to sigmoid, it solves just one problem of being zero-centred.
  • ReLU (Rectified Linear Unit): it is known to be a better activation function than the sigmoid function and the tanh function because it performs gradient descent faster. It does not cause the Vanishing Gradient Problem.

https://towardsdatascience.com/everything-you-need-to-know-about-activation-functions-in-deep-learning-models-84ba9f82c253

Vanishing/Exploding Gradient Problem

  • In a network of n hidden layers, n derivatives will be multiplied together. If the derivatives are large then the gradient will increase exponentially as we propagate down the model until they eventually explode, and this is what we call the problem of exploding gradient.
  • Alternatively, if the derivatives are small then the gradient will decrease exponentially as we propagate through the model until it eventually vanishes, and this is the vanishing gradient problem.

How to avoid it?

https://towardsdatascience.com/the-vanishing-exploding-gradient-problem-in-deep-neural-networks-191358470c11#:~:text=In%20a%20network%20of%20n,the%20problem%20of%20exploding%20gradient%20.

Optimizer

https://towardsdatascience.com/optimizers-for-training-neural-network-59450d71caf6

how to select optimizer?

Feature Engineering

  • https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114

Features with High cardinality

  • https://towardsdatascience.com/dealing-with-features-that-have-high-cardinality-1c9212d7ff1b#:~:text=A%20categorical%20feature%20is%20said,absence)%20in%20the%20categorical%20variable.

  • https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02

BinaryEncoding

https://stats.stackexchange.com/questions/325263/binary-encoding-vs-one-hot-encoding


Reduce overfitting in NN model?

https://towardsdatascience.com/preventing-deep-neural-network-from-overfitting-953458db800a

https://towardsdatascience.com/exploit-your-hyperparameters-batch-size-and-learning-rate-as-regularization-9094c1c99b55

https://www.quora.com/Why-does-decreasing-the-learning-rate-also-increases-over-fitting-rate-in-a-neural-network

https://www.1point3acres.com/bbs/thread-660530-1-1.html

high-cardinality


Cluster:
https://www.analyticsvidhya.com/blog/2017/02/test-data-scientist-clustering/

Machine Learning Review Note相关推荐

  1. Day 5. Suicidal Ideation Detection: A Review of Machine Learning Methods and Applications综述

    Title: Suicidal Ideation Detection: A Review of Machine Learning Methods and Applications 自杀意念检测:机器学 ...

  2. Machine Learning and Application in Terahertz Technology: A Review on Achievements

    有关<Machine Learning and Application in Terahertz Technology: A Review on Achievements and Future ...

  3. [Quant][Note] Empirical Asset Pricing via Machine Learning

    题目 Empirical Asset Pricing via Machine Learning 论文链接 论文pdf链接 发表时间 2018.09.04 论文作者 Shihao Gu, Bryan K ...

  4. Paper:《Hidden Technical Debt in Machine Learning Systems—机器学习系统中隐藏的技术债》翻译与解读

    Paper:<Hidden Technical Debt in Machine Learning Systems-机器学习系统中隐藏的技术债>翻译与解读 导读:机器学习系统中,隐藏多少技术 ...

  5. 机器学习(Machine Learning)深度学习(Deep Learning)资料(Chapter 2)

    机器学习(Machine Learning)&深度学习(Deep Learning)资料(Chapter 2) - tony的专栏 - 博客频道 - CSDN.NET 注:机器学习资料篇目一共 ...

  6. An example machine learning notebook

    原文地址 An example machine learning notebook Notebook by Randal S. Olson Supported by Jason H. Moore Un ...

  7. 【Paper】ConvLSTM:Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting

    论文原文 论文下载 论文被引:1651(2020/03/01) 4827(2022/03/26) 论文年份:2015 文章目录 Abstract 1 Introduction 2 Preliminar ...

  8. bff v2ex_语音备忘录的BFF-如何通过Machine Learning简化Speech2Text

    bff v2ex by Rafael Belchior 通过拉斐尔·贝尔基奥尔(Rafael Belchior) 语音备忘录的BFF-如何通过Machine Learning简化Speech2Text ...

  9. Multi-task Learning(Review)多任务学习概述

    https://www.toutiao.com/a6707402838705701383/ 背景:只专注于单个模型可能会忽略一些相关任务中可能提升目标任务的潜在信息,通过进行一定程度的共享不同任务之间 ...

最新文章

  1. 022_Vue购物车
  2. 《汇编语言》王爽—实验五详解
  3. 用Linux同时编辑两个文档,如何使用Vim编辑多个文件
  4. 做自媒体花式撸收益?
  5. Linux下内存使用率、CPU使用率、以及运行原理-转
  6. c 语言编程文档下载,C语言编程规范
  7. windows下安装foremost和binwalk(以及两个软件的安装包)
  8. Android实战开发--三种地图类型的设计
  9. 12306抢票使用教程
  10. 如何做出优雅的过渡效果? Dotween插件的简单介绍及示例代码
  11. Oracle查询表空间
  12. Oracle 字符函数
  13. 简述验证Anaconda是否安装成功的两种方式和Anaconda环境变量配置过程
  14. 两种必须具备的工具才能使Web更具可读性
  15. 20个计算机英语关键词,英语微课堂:20个专业体育英语术语,秒懂赛场关键词!...
  16. [Burp Suite完整教程] Intruder Attack typePayloads – 拥有上千种姿态的攻击模式
  17. Google Java编程风格指南
  18. 『C语言』系统日期时间
  19. 2024北京邮电大学计算机考研信息汇总
  20. ABAQUS 工程仿真分析基础入门到精通视频教程

热门文章

  1. Ubuntu20.04安装xfce4桌面
  2. linux常用压缩-解压-打包命令
  3. docker:镜像命令
  4. 拿到腾讯字节快手 offer 后,他的LeetCode刷题经验在GitHub上收获1.3k星
  5. HTML 表格td中无内容时不显示边框的解决办法
  6. 一文让你读懂区块链能给游戏带来什么价值?
  7. Power bi 超市经典案例之销售分析(四)
  8. 《前端》--jQuery  - replaceWith() 方法
  9. CTF-Show密码学:ZIP文件密码破解【暴力破解】
  10. VSCode C/C++提示“LPCSTR 类型的实参与LPCWSTR类型的形参不兼容“