Covariance and correlation
How can you select k for k means?
Naive Bayes
- Why is Naive Bayes “naive”?
- Naive Bayes vs. RF
Q: What is the difference between online and batch learning(offline learning)?
How to identify outliers?
Reduce overfitting
Q: What are the feature selection methods used to select the right variables?
Q: Why is mean square error a bad measure of model performance? What would you suggest instead?
PCA
Linear regression
- Assumptions
- What Happens When We Break Some of These Assumptions? (And What Can We Do About It?)
- regularization
- drawback of linear model
Tree
- Prevent Overfitting
More features than data points
- More features than data points in linear regression?
SVM
- What are the support vectors in SVM?
- Primal and Dual (features vs. samples)
- SVM and linearSVM
Bagging, Boosting, Stacking
- Linear regression & Boosting
- When would you use random forests vs. SVM and why?
Random Forest
- Explain what the bootstrap sampling method is and give an example of when it’s used.
- Pros and cons
AdaBoost vs Gradient Boosting
- Explain the "gradient" part of the Gradient Boosting
- Gradient Boosting implementations
- Bootstrapping
Gradient Descent（Logistic Classification）
TF-IDF
Cross validation, test set, and final model
Skewed data and Log
Multicollinearity
- Baseline model
- Interpretable vs Explainable Machine Learning
Imbalanced data
- how to deal with imbalanced data?
- SOMTE.
- Evaluation Metrics
- Type1 vs Type2 Error
Deep Learning
Initializing all the weights with all (constant) zero
Softmax
Activation function?
Vanishing/Exploding Gradient Problem
Optimizer
Feature Engineering
Features with High cardinality
BinaryEncoding

Covariance and correlation

Both covariance and correlation measure the linear relationship and the dependency between two variables.
Correlation values are standardized. (Pearson correlation coefficient)
Covariance values are not standardized.
Correlation only measure the linear relationship between two variables.
Correlation is zero, then the two variables have no linear relationship, but they may have non-linear relationship.

基本的统计和机器学习概念
不要引入一些其他概念，从而引导面试官提问，但是如果对引入的概念非常熟悉，那么可以将面试官往这上面引导。

https://towardsdatascience.com/120-data-scientist-interview-questions-and-answers-you-should-know-in-2021-b2faf7de8f3e

How can you select k for k means?

You can use the elbow method, which is a popular method used to determine the optimal value of k. Essentially, what you do is plot the squared error for each value of k on a graph (value of k on the x-axis and squared error on the y-axis). Once the graph is made, the point where the distortion declines the most is the elbow point.

https://medium.com/analytics-vidhya/elbow-method-of-k-means-clustering-algorithm-a0c916adc540

Naive Bayes

Why is Naive Bayes “naive”?

Naive Bayes is naive because it holds a strong assumption in that the features are assumed to be uncorrelated with one another, which typically is never the case.

Naive Bayes vs. RF

Naive Bayes is better in the sense that it is easy to train and understand the process and results. A random forest can seem like a black box. Therefore, a Naive Bayes algorithm may be better in terms of implementation and understanding.

However, in terms of performance, a random forest is typically stronger because it is an ensemble technique.

Q: What is the difference between online and batch learning(offline learning)?

In offline learning, the whole training data must be available at the time of model training. Only when training is completed can the model be used for predicting.
In contrast, online algorithms process data sequentially. They produce a model and put it in operation without having the complete training dataset available at the beginning. The model is continuously updated during operation as more training data arrives.

How to identify outliers?

There are a couple of ways to identify outliers:

We can use visualization tools such as BoxPlot/ ScatterPlot/ Histogram to see whether there are outliers directly from the distribution plot of the feature.
Z-score/standard deviations: if we know that 99.7% of data in a data set lie within three standard deviations, then we can calculate the size of one standard deviation, multiply it by 3, and identify the data points that are outside of this range. Likewise, we can calculate the z-score of a given point, and if it’s equal to +/- 3, then it’s an outlier.
Note: that there are a few contingencies that need to be considered when using this method; the data must be normally distributed, this is not applicable for small data sets, and the presence of too many outliers can throw off z-score.
Interquartile Range (IQR): IQR, the concept used to build boxplots, can also be used to identify outliers. The IQR is equal to the difference between the 3rd quartile and the 1st quartile. You can then identify if a point is an outlier if it is less than Q1–1.5*IRQ or greater than Q3 + 1.5*IQR. This comes to approximately 2.698 standard deviations.

Reduce overfitting

Cross-validation
Regularization
Reduce the number of features
Ensemble Learning Techniques

Q: What are the feature selection methods used to select the right variables?

There are two types of methods for feature selection: filter methods and wrapper methods.

Filter methods include the following:

Linear discrimination analysis
ANOVA
Chi-Square (We aim to select the features which are highly dependent on the target.)

Wrapper methods: evaluate models using procedures that add and/or remove predictors to find the optimal combination that maximizes model performance.

include the following:

Forward Selection: starts with one predictor and adds more iteratively. At each subsequent iteration, the best of the remaining original predictors are added based on performance criteria.
Backward Selection: We test all the features and start removing them to see what works better

Business sense

based on what features makes sense in business. Especially when we need interpretability of the model.

https://towardsdatascience.com/chi-square-test-for-feature-selection-in-machine-learning-206b1f0b8223
https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/

Q: Why is mean square error a bad measure of model performance? What would you suggest instead?

Mean Squared Error (MSE) gives a relatively high weight to large errors — therefore, MSE tends to put too much emphasis on large deviations. A more robust alternative is MAE (mean absolute deviation).

PCA

Principal component analysis (PCA) projects data to a lower-dimensional linear subspace, in a way that preserves the axes of highest variance in the data.

An important pre-processing step before applying PCA is to standardize features to have zero mean and unit variance.

Linear regression

Assumptions

Linearity: Linear relationship between x & y
Independence: samples are i.i.d (Independent and identically distributed)
No perfect collinearity: The independent variables do not share a perfect, linear relationship.
Zero conditional mean: The error term is unrelated to our independent variables.
Homoscedasticity 同方差性: the variance of the error term is constant.
Normality: the residuals follow a normal distribution

What Happens When We Break Some of These Assumptions? (And What Can We Do About It?)

https://towardsdatascience.com/what-happens-when-you-break-the-assumptions-of-linear-regression-f78f2fe90f3a
https://www.quality-control-plan.com/StatGuide/linreg_ass_viol.htm#Lack%20of%20independence

1. Non-linear

Our parameter estimates would be biased, and our model would make poor predictions.

How to determine linear/non-linear relationship

You could create a scatter plot between the two variables and see if the relationship between them is linear or non-linear.
You can then compare the performance between a linear regression and a non-linear regression, and choose the function that performs best.
Fit the data with a linear regression, and then plot the residuals. If there is no obvious pattern in the residual plot, then the linear regression was likely the correct model. However, if the residuals look non-random, then perhaps a non-linear regression would be the better choice.

2. Our sample is non-random

Thus, it is good practice to do some EDA prior to building a regression model to confirm that the two groups are not drastically different

For serially correlated Y values, the estimates of the slope and intercept will be unbiased, but the estimates of their variances will not be reliable.

If you are unsure whether your Y values are independent, you may wish to consult a statistician or someone who is knowledgeable about the data collection scheme you are using.

do some EDA prior to building a regression model to confirm

3. We Have Perfect Collinearity

Violating multicollinearity does not impact prediction, but can impact inference.

How to solve it

drop one of the variables

4. Our Error Term is Correlated with One of Our Independent Variables

This occurs if our regression model differs from the true model. There are typically some omitted variable bias.

How to solve it

First, if you know the variables that should be included in the true model, then you can add these variables to the model you are building. This is the best solution; however, it is also unrealistic because we can never truly know what variables are in the true model.
The second solution is to conduct a Randomized Control Trial (RCT). In an RCT, the researcher randomly allocates participants to the treatment group or the control group. Because the treatment is given randomly, the relationship between the error term and the independent variables is equal to zero.

5. violating the homoscedasticity

https://stats.stackexchange.com/questions/22800/what-are-the-dangers-of-violating-the-homoscedasticity-assumption-for-linear-reg

does not introduce a bias in the estimates of your coefficients.
the standard errors of the coefficients are wrong. This means that one cannot compute any t-statistics and p-values and consequently hypothesis testing is not possible.

How to solve it

use robust standard errors

6. Our Errors are Non-Normal

the results of our hypothesis tests and confidence intervals will be inaccurate.

How to solve it

transform your target variable so that it becomes normal. This can have the effect of making the errors normal, as well. The log transform and square root transform are most common.

regularization

Lasso regression: L1-norm, Feature selection
Ridge regression: L2-norm
Elastic-net regression: L1-norm, L2-norm, Feature selection

drawback of linear model

A linear model holds some strong assumptions that may not be true in application. It assumes a linear relationship, multivariate normality, no or little multicollinearity, no auto-correlation, and homoscedasticity.
A linear model can’t be used for discrete or binary outcomes.
You can’t vary the model flexibility of a linear model.

What is flexibility in machine learning?

You can think of “Flexibility” of a model as the model’s “curvy-ness” when graphing the model equation
Greater flexibility corresponds to lower bias but higher variance. It allows fitting a wider variety of functions, but increases the risk of overfitting.

https://stackoverflow.com/questions/26437372/what-is-the-definition-of-flexibility-of-a-method-in-machine-learning
https://stats.stackexchange.com/questions/338009/how-is-model-flexibility-measured-or-quantified-what-units-is-it-measured-in/338896

Tree

Prevent Overfitting

Pruning
- Reduced error
  - Starting at the leaves, each node is replaced with its most popular class.
  - If the validation metric is not negatively affected, then the change is kept, else it is reverted.
  - Reduced error pruning has the advantage of speed and simplicity.
- Cost complexity

https://stats.stackexchange.com/questions/193538/how-to-choose-alpha-in-cost-complexity-pruning

Early stopping
- Maximum depth
- Maximum leaf nodes
- Minimum samples split
- Minimum impurity decrease

More features than data points

Feature selection involves selecting a subset of predictors to use as input to predictive models.

Common techniques include filter methods that select features based on their statistical relationship to the target variable (e.g. correlation), and wrapper methods that select features based on their contribution to a model when predicting the target variable (e.g. RFE(Recursive Feature Elimination)).

Projection Methods (Dimension reduction) create a lower-dimensionality representation of samples that preserves the relationships observed in the data.

They are often used for visualization, although the dimensionality reduction nature of the techniques may also make them useful as a data transform to reduce the number of predictors. This might include techniques from linear algebra, such as SVD and PCA.

Regularized Algorithms. Standard machine learning algorithms may be adapted to use regularization during the training process.

This will penalize models based on the number of features used or weighting of features, encouraging the model to perform well and minimize the number of predictors used in the model.

This can act as a type of automatic feature selection during training and may involve augmenting existing models (e.g regularized linear regression and regularized logistic regression) or the use of specialized methods such as LARS and LASSO.

There is no best method and it is recommended to use controlled experiments to test a suite of different methods.

这些基本对所有的模型都是一样的。

More features than data points in linear regression?

解决方法基本一样。

Regularization
Dimension Reduction
Subset Selection

https://machinelearningmastery.com/how-to-handle-big-p-little-n-p-n-in-machine-learning/
https://medium.com/@jennifer.zzz/more-features-than-data-points-in-linear-regression-5bcabba6883e

SVM

What are the support vectors in SVM?

The support vectors are the data points that touch the boundaries of the maximum margin (see below).

Primal and Dual (features vs. samples)

Would training the soft-margin SVMs either using primal or the dual formulation yield the same results on a test dataset? Why?

Yes. because the pattern is convex and the primal & dual will yield the same result.

Primal mode is preferred when we don’t need to apply kernel trick to the data and the dataset is large but the dimension of each data point is small.
Dual form is preferred when data has a huge dimension and we need to apply the kernel trick.

注意这里并不是说特征数量比样本数量多。如果特征数量比样本多，仍然主要是通过前文所写的几种方法。

https://medium.com/geekculture/the-optimization-behind-svm-primal-and-dual-form-5cca1b052f45

SVM and linearSVM

SVM: The implementation is based on libsvm. The fit time scales at least quadratically with the number of samples and may be impractical beyond tens of thousands of samples.
linearSVM: Similar to SVC with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.

https://stackoverflow.com/questions/45384185/what-is-the-difference-between-linearsvc-and-svckernel-linear

Bagging, Boosting, Stacking

https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205
https://zhuanlan.zhihu.com/p/36822575
The main hypothesis is that when weak models are correctly combined we can obtain more accurate and/or robust models.
Indeed, to be able to “solve” a problem, we want our model to have enough degrees of freedom to resolve the underlying complexity of the data we are working with, but we also want it to have not too much degrees of freedom to avoid high variance and be more robust. This is the well known bias-variance tradeoff.
Weak learners: High bias or high variance.
Bagging (reduce variance), that often considers homogeneous weak learners, learns them independently from each other in parallel and combines them following some kind of deterministic averaging process.
Boosting (reduce bias), that often considers homogeneous weak learners, learns them sequentially in a very adaptative way (a base model depends on the previous ones) and combines them following a deterministic strategy.
Stacking (reduce bias), that often considers heterogeneous weak learners, learns them in parallel and combines them by training a meta-model to output a prediction based on the different weak models predictions.

Linear regression & Boosting

https://stats.stackexchange.com/questions/186966/gradient-boosting-for-linear-regression-why-does-it-not-work
https://stats.stackexchange.com/questions/230388/how-does-linear-base-learner-works-in-boosting-and-how-does-it-works-in-the-xgb

We can use linear regression, but it not common in practice.

The final model prediction has the same functional form as the full linear regression.
A possible defense of boosting in this situation could be the implicit regularization it provides. Possibly (I haven’t played with this) you could use the early stopping feature of a gradient booster, along with a cross validation, to stop short of the full linear regression. This would provide a regularization to your regression, and possibly help with overfitting. This is not particularly practical, as one has very efficient and well-understood options like ridge regression and the elastic net in this setting.
Boosting shines when there is no terse functional form around. Boosting decision trees lets the functional form of the regressor/classifier evolve slowly to fit the data, often resulting in complex shapes one could not have dreamed up by hand and eye. When a simple functional form is desired, boosting is not going to help you find it (or at least is probably a rather inefficient way to find it).
For example, the combination of n trees is still a tree, but this tree will be significantly bigger and complexity of such tree is impractical. On the other hand in case of linear functions complexity is exactly the same.

When would you use random forests vs. SVM and why?

There are a couple of reasons why a random forest is a better choice of an algorithm than a support vector machine:

Random forests allow you to determine the feature importance. SVM’s can’t do this.
Random forests are much quicker and simpler to build than an SVM.
For multi-class classification problems, SVMs require a one-vs-rest method, which is less scalable and more memory intensive.

Random Forest

Explain what the bootstrap sampling method is and give an example of when it’s used.

Technically speaking, the bootstrap sampling method is a resampling method that uses random sampling with replacement.

It’s an essential part of the random forest algorithm, as well as other ensemble learning algorithms.

Bootstrap Aggregation is a general procedure that can be used to reduce the variance for those algorithm that have high variance.

https://machinelearningmastery.com/bagging-and-random-forest-ensemble-algorithms-for-machine-learning/

Pros and cons

Pros:

Random Forests can be used for both classification and regression tasks.
Random Forests work well with both categorical and numerical data. No scaling or transformation of variables is usually necessary.
Random Forests implicitly perform feature selection and generate uncorrelated decision trees. It does this by choosing a random set of features to build each decision tree. This also makes it a great model when you have to work with a high number of features in the data.
Random Forests are not influenced by outliers to a fair degree. It does this by binning the variables.
Random Forests can handle linear and non-linear relationships well.
Random Forests generally provide high accuracy and balance the bias-variance trade-off well. Since the model’s principle is to average the results across the multiple decision trees it builds, it averages the variance as well.
OOB for model selection

Cons:

Random Forests are not easily interpretable. They provide feature importance but it does not provide complete visibility into the coefficients as linear regression.
Random Forests can be computationally intensive for large datasets.
Random forest is like a black box algorithm, you have very little control over what the model does.
In problems with multiple categorical variables, random forest may not be able to increase the accuracy of the base learner.
对于有不同级别的属性的数据，级别划分较多的属性会对随机森林产生更大的影响，所以随机森林在这种数据上产生的属性权值是不可信的。(类似于上面一点)
在某些噪声较大的分类和回归问题上会过拟合。

https://zhuanlan.zhihu.com/p/27160995

https://en.wikipedia.org/wiki/Random_forest#cite_note-:3-37
https://medium.datadriveninvestor.com/random-forest-pros-and-cons-c1c42fb64f04

AdaBoost vs Gradient Boosting

https://datascience.stackexchange.com/questions/39193/adaboost-vs-gradient-boosting?newreg=7b9da28f7a7546349f2218c04dda8ac8
https://medium.com/analytics-vidhya/what-is-gradient-boosting-how-is-it-different-from-ada-boost-2d5ff5767cb2

The main differences are that Gradient Boosting is a generic algorithm to find approximate solutions to the additive modeling problem, while AdaBoost can be seen as a special case with a particular loss function. Hence, Gradient Boosting is much more flexible.

AdaBoost:

Exponential loss
decision stumps (just split the data into two regions)
Each subsequent stump give more weights on the samples that were incorrectly classified in the previous stump.

Gradient Boosting:

Squared loss
Trees are larger than a stump, but GBDT still restricts the size of a tree.
A decision tree is built based on the residuals of the previous samples.

Explain the “gradient” part of the Gradient Boosting

Every tree fit to the residuals from the previous tree.
Previous tree residual is nothing but the gradient of the loss function.

Gradient Boosting implementations

GradientBoostingClassifier

Early implementation of Gradient Boosting in sklearn
Typical slow on large datasets
Most important parameters are # of estimators and learning rate
Supports both binary & multi-class classification
Supports sparse data

HistGradientBoostingClassifier
● Orders of magnitude faster than GradientBoostingClassifier on large datasets
● Inspired by LightGBM implementation
● Histogram-based split finding in tree learning
● Does not support sparse data
● Supports both binary & multi-class classification
● Natively supports categorical features
● Does not support monotonicity constraints

XGBoost
● One of most popular implementations of gradient boosting
● Fast approximate split finding based on histograms
● Supports GPU training, sparse data & missing values
● Adds l1 and l2 penalties on leaf weights
● Monotonicity & feature interaction constraints
● Works well with pipelines in sklearn due to a compatible interface
● Does not support categorical variables natively

LightGBM
● Supports GPU training, sparse data & missing values
● Histogram-based node splitting
● Use Gradient-based One-Sided Sampling (GOSS) for tree learning
● Exclusive feature bundling to handle sparse features
● Generally faster than XGBoost on CPUs
● Supports distributed training on different frameworks like Ray, Spark, Dask etc.
● CLI version

What makes LightGBM lightning fast?
https://towardsdatascience.com/what-makes-lightgbm-lightning-fast-a27cf0d9785e

CatBoost
● Optimized for categorical features
● Uses target encoding to handle categorical features
● Uses ordered boosting to build “symmetric” trees
● Overfitting detector
● Tooling support (Jupyter notebook & Tensorboard visualization)
● Supports GPU training, sparse data & missing values
● Monotonicity constraints

Bootstrapping

https://en.wikipedia.org/wiki/Bootstrapping_(statistics)#Disadvantages

● Several training samples (of same size) are created by sampling the dataset with replacement
● Each training sample is then used to train a model
● The outputs from each of the models are averaged to make the final prediction.

Disadvantage:

Time-consuming.
Although bootstrapping is (under some conditions) asymptotically consistent, it does not provide general finite-sample guarantees. The result may depend on the representative sample. ( (The training set has duplicate data.) The data distribution of the resulting training set is different from that of the original data set, which introduces estimation bias.)

Gradient Descent（Logistic Classification）

All steps of calculation.
Python Code

import math
def sigmoid(x, a, b):return 1.0/(1.0 + math.exp(-a*x - b))def GradientDescent(x, y, a, b):a, b = a - (y - (sigmoid(x,a,b)) * x), b - (y - (sigmoid(x,a,b)))return a, b

TF-IDF

https://towardsdatascience.com/text-vectorization-term-frequency-inverse-document-frequency-tfidf-5a3f9604da6d
TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection or corpus.
Term Frequency (TF): It is a measure of the frequency of a word (w) in a document (d). TF is defined as the ratio of a word’s occurrence in a document to the total number of words in a document. The denominator term in the formula is to normalize since all the corpus documents are of different lengths.

Inverse Document Frequency (IDF): It is the measure of the importance of a word. Term frequency (TF) does not consider the importance of words. Some words such as’ of’, ‘and’, etc. can be most frequently present but are of little significance. IDF provides weightage to each word based on its frequency in the corpus D.

Term Frequency — Inverse Document Frequency (TFIDF)
TFIDF gives more weightage to the word that is rare in the corpus (all the documents).
TFIDF provides more importance to the word that is more frequent in the document.
Advantages: Bag of words (BoW) converts the text into a feature vector by counting the occurrence of words in a document. It is not considering the importance of words. TFIDF is based on BoW model, which contains insights about the less relevant and more relevant words in a document. The importance of a word in the text is of great significance in information retrieval.
Disadvantage of TFIDF. It is unable to capture the semantics. For example, funny and humorous are synonyms, but TFIDF does not capture that. Moreover, TFIDF can be computationally expensive if the vocabulary is vast.

Cross validation, test set, and final model

4995 Lecture2

Development data: Training data + Validation data
Testing data

Fit on Training data. Transform on Validation and Testing.

Train final model on Development data.

Skewed data and Log

Degrades the model’s ability (especially regression based models) to describe typical cases as it has to deal with rare cases on extreme values. ie right skewed data will predict better on data points with lower value as compared to those with higher values. Skewed data also does not work well with many statistical methods. However, tree based models are not affected.

Log transformation

https://reinec.medium.com/my-notes-handling-skewed-data-5984de303725

Multicollinearity

https://martechcareer.com/advice/2020/09/14-mmm
https://towardsdatascience.com/multicollinearity-why-is-it-bad-5335030651bf
https://zhuanlan.zhihu.com/p/72722146#:~:text=%E5%A4%9A%E9%87%8D%E5%85%B1%E7%BA%BF%E6%80%A7%E9%97%AE%E9%A2%98%E5%B0%B1%E6%98%AF,%E9%97%B4%E7%9C%9F%E5%AE%9E%E7%9A%84%E5%85%B3%E7%B3%BB%E4%BA%86%E3%80%82

Linear model:
Multicollinearity will harm the interpretability greatly but hurt the performance slightly.

We could understand it from the meaning of the coefficients (Weights).
Coefficient (Weight) of one feature means that if we fix the values of other features, then this feature increases by 1, the result will increase by 1*coefficient. However, if there is multicollinearity in our data, we cannot assume that one feature changes and other features are fixed at the same time. Therefore, these coefficients will be very confusing and completely unexplainable.

In the case of linear regression, multi-collinearity leads to finding weights/coefficients that have very high variance (since (X^TX)-1 becomes huge). So, it is difficult to say anything precise about the coefficients. So, in some cases it does hurt the model performance as well as the model interpretability.
Yes, one-hot encoding leads to having features that have dependencies (i.e. multi-collinearity). One way to reduce it would be drop the variables that you believe would not necessarily impact your model interpretability. Another way to drop variables is to use Variance Inflation Factor to drop features that have high multi-collinearity:

https://www.statisticshowto.com/variance-inflation-factor/

In this case, you could make this part of your model selection process, where you iteratively drop features using VIF, then estimating the performance of the model on a validation set and finally choosing the model with the highest performance.

Tree model:

Yes, multi-collinearity could impact model performance as well interpretability in decision trees. In your example, in another scenario (with a different split), X2 could have a higher information gain than X1 and hence could end up being the more important feature. This could lead to difficulty in model interpretability. Additionally, during prediction this model can underperform especially on samples where the dependency between X1 and X2 does not necessarily hold.

How to fix multicollinearity?

Do correlation analysis. Delete the one of two highly related features manually.
PCA. Dimensionality reduction.
Ridge regression.
Add more samples.
Some statistical methods, like Variation Inflation Factor (VIF).

Baseline model

https://towardsdatascience.com/how-to-build-a-baseline-model-be6ce42389fc
https://datascience.stackexchange.com/questions/30912/what-does-baseline-mean-in-the-context-of-machine-learning
A baseline is a method that uses heuristics, simple summary statistics, randomness, or machine learning to create predictions for a dataset.

statistics and randomness

Classification baselines:
- “stratified”: generates predictions by respecting the training set’s class distribution.
- “most_frequent”: always predicts the most frequent label in the training set
- “prior”: always predicts the class that maximizes the class prior.
- “uniform”: generates predictions uniformly at random.
- “constant”: always predicts a constant label that is provided by the user.
Regression baselines:
- “median”: always predicts the median of the training set
- “quantile”: always predicts a specified quantile of the training set, provided with the quantile parameter.
- “constant”: always predicts a constant value that is provided by the user.

Machine Learning

Baseline model should be simple. Simple models are less likely to overfit. If you see that your baseline is already overfitting, it makes no sense to go for more complex modeling, as the complexity will kill the performance.
Baseline model should be interpretable. Interpretability will help you to get a better understanding of your data and will show you a direction for the feature engineering.

Interpretable vs Explainable Machine Learning

https://towardsdatascience.com/interperable-vs-explainable-machine-learning-1fa525e12f48

An interpretable model can be understood by a human without any other aids/techniques. We can understand how these models make predictions by looking only at the model summary/ parameters. We could also say that an interpretable model provides its own explanation. Decision trees and linear regression.
An explainable model does not provide its own explanation. On their own, these models are too complicated to be understood by humans and they require additional techniques, like LIME and SHAP, to understand how they make predictions.

Imbalanced data

https://towardsdatascience.com/5-smote-techniques-for-oversampling-your-imbalance-data-b8155bdbe2b5

how to deal with imbalanced data?

Oversampling (SOMTE)
Undersampling
Metrics: ROC curve, f1 score
Change the threshold (Typically we will use 0.5.)
Increase weights on minority samples
Adjust objective function

SOMTE.

SMOTE works by utilizing a k-nearest neighbor algorithm to create synthetic data. SMOTE first start by choosing random data from the minority class, then k-nearest neighbors from the data are set. Synthetic data would then be made between the random data and the randomly selected k-nearest neighbor.

Evaluation Metrics

● Threshold-based metrics
○ Classification Accuracy
○ Precision, Recall & F1-score

● Ranking-based metrics
○ Average Precision (AP)
○ Area Under Curve (AUC)

Accuracy:

Classification accuracy could be misleading in case of imbalance datasets
Accuracy Paradox

Precision, Recall & F1-score

Normalization

Normalize by true number of labels.
Normalize by predict number of labels.

Minority class is considered positive

Macro is a more balanced metrics.

Choosing the Right Metric

● Problem-specific
● Balanced accuracy better than accuracy (most of the times)
● Cost associated with misclassification
● Predicting that an individual has no cancer when he/she has cancer (false negative) is far more costlier than the other way round
● Predicting an email as spam when it is not (false positive) has higher cost than predicting email as not spam
● Choose recall when cost of false negatives is high (Type I error)
● Choose precision when cost of false positives is high (Type II error)

Precision-Recall (PR) Curve

A precision-recall curve shows the relationship between precision and recall at every cut-off point.
Visualize effect of selected threshold on performance.

Receiver Operating Curve (ROC)

Another useful tool to visualize the performance of a classification model
ROC depicts the relationship between False Positive Rate (FPR) and True Positive Rate/Recall (TPR)
FPR = FP / (TN + FP)

Area Under ROC (AUROC)
Area Under ROC (AUROC) provides an aggregate measure of model performance across all possible classification thresholds.
AUROC varies between 0 and 1 and a model with random/const predictions has a value of 0.5.

Type1 vs Type2 Error

https://www.scribbr.com/statistics/type-i-and-type-ii-errors/#:~:text=In%20statistics%2C%20a%20Type%20I%20error%20means%20rejecting%20the%20null,hypothesis%20when%20it’s%20actually%20false.&text=The%20risk%20of%20making%20a%20Type%20II%20error%20is%20inversely,statistical%20power%20of%20a%20test

Tradeoff between Type I and Type II errors:

Setting a lower significance level decreases a Type I error risk, but increases a Type II error risk.
Increasing the power of a test decreases a Type II error risk, but increases a Type I error risk.
The null hypothesis distribution shows all possible results you’d obtain if the null hypothesis is true. The correct conclusion for any point on this distribution means not rejecting the null hypothesis.
The alternative hypothesis distribution shows all possible results you’d obtain if the alternative hypothesis is true. The correct conclusion for any point on this distribution means rejecting the null hypothesis.

Type I and Type II errors occur where these two distributions overlap. The blue shaded area represents alpha, the Type I error rate, and the green shaded area represents beta, the Type II error rate.

By setting the Type I error rate, you indirectly influence the size of the Type II error rate as well.

Is a Type I or Type II error worse?

For statisticians, a Type I error is usually worse. In practical terms, however, either type of error could be worse depending on your research context.

A Type I error means mistakenly going against the main statistical assumption of a null hypothesis. This may lead to new policies, practices or treatments that are inadequate or a waste of resources.

In contrast, a Type II error means failing to reject a null hypothesis. It may only result in missed opportunities to innovate, but these can also have important practical consequences.

Example:

Consequences of a Type I error
Based on the incorrect conclusion that the new drug intervention is effective, over a million patients are prescribed the medication, despite risks of severe side effects and inadequate research on the outcomes. The consequences of this Type I error also mean that other treatment options are rejected in favor of this intervention.

Consequences of a Type II error
If a Type II error is made, the drug intervention is considered ineffective when it can actually improve symptoms of the disease. This means that a medication with important clinical significance doesn’t reach a large number of patients who could tangibly benefit from it.

Example:

You decide to get tested for COVID-19 based on mild symptoms. There are two errors that could potentially occur:

Type I error (false positive): the test result says you have coronavirus, but you actually don’t.

Type II error (false negative): the test result says you don’t have coronavirus, but you actually do.

Deep Learning

What the difference between adam optimizer and SGD?

Initializing all the weights with all (constant) zero

https://www.deeplearning.ai/ai-notes/initialization/#:~:text=Initializing%20all%20the%20weights%20with,the%20same%20features%20during%20training.&text=Thus%2C%20both%20neurons%20will%20evolve,neurons%20from%20learning%20different%20things.

Initializing all the weights with zeros leads the neurons to learn the same features during training.

Softmax

Softmax is used in multi-class classification.
The softmax function turns numbers into probabilities. The sum of these probabilities is one, and each probability represents a potential outcome.

Activation function?

Add non-linearity into a neural network.

Sigmoid: It is computationally expensive, causes vanishing gradient problem and not zero-centred. This method is generally used for binary classification problems.
Softmax: The softmax is a more generalised form of the sigmoid. It is used in multi-class classification problems. Similar to sigmoid, it produces values in the range of 0–1 therefore it is used as the final layer in classification models.
Tanh (Hyperbolic Tangent): If you compare it to sigmoid, it solves just one problem of being zero-centred.
ReLU (Rectified Linear Unit): it is known to be a better activation function than the sigmoid function and the tanh function because it performs gradient descent faster. It does not cause the Vanishing Gradient Problem.

https://towardsdatascience.com/everything-you-need-to-know-about-activation-functions-in-deep-learning-models-84ba9f82c253

Vanishing/Exploding Gradient Problem

In a network of n hidden layers, n derivatives will be multiplied together. If the derivatives are large then the gradient will increase exponentially as we propagate down the model until they eventually explode, and this is what we call the problem of exploding gradient.
Alternatively, if the derivatives are small then the gradient will decrease exponentially as we propagate through the model until it eventually vanishes, and this is the vanishing gradient problem.

How to avoid it?

https://towardsdatascience.com/the-vanishing-exploding-gradient-problem-in-deep-neural-networks-191358470c11#:~:text=In%20a%20network%20of%20n,the%20problem%20of%20exploding%20gradient%20.

Optimizer

https://towardsdatascience.com/optimizers-for-training-neural-network-59450d71caf6

how to select optimizer?

Feature Engineering

https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114

Features with High cardinality

https://towardsdatascience.com/dealing-with-features-that-have-high-cardinality-1c9212d7ff1b#:~:text=A%20categorical%20feature%20is%20said,absence)%20in%20the%20categorical%20variable.
https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02

BinaryEncoding

https://stats.stackexchange.com/questions/325263/binary-encoding-vs-one-hot-encoding

Reduce overfitting in NN model?

https://towardsdatascience.com/preventing-deep-neural-network-from-overfitting-953458db800a

https://towardsdatascience.com/exploit-your-hyperparameters-batch-size-and-learning-rate-as-regularization-9094c1c99b55

https://www.quora.com/Why-does-decreasing-the-learning-rate-also-increases-over-fitting-rate-in-a-neural-network

https://www.1point3acres.com/bbs/thread-660530-1-1.html

high-cardinality

Cluster:
https://www.analyticsvidhya.com/blog/2017/02/test-data-scientist-clustering/

Machine Learning Review Note相关推荐

Day 5. Suicidal Ideation Detection: A Review of Machine Learning Methods and Applications综述
Title: Suicidal Ideation Detection: A Review of Machine Learning Methods and Applications 自杀意念检测:机器学 ...
Machine Learning and Application in Terahertz Technology: A Review on Achievements
有关<Machine Learning and Application in Terahertz Technology: A Review on Achievements and Future ...
[Quant][Note] Empirical Asset Pricing via Machine Learning
题目 Empirical Asset Pricing via Machine Learning 论文链接论文pdf链接发表时间 2018.09.04 论文作者 Shihao Gu, Bryan K ...
Paper：《Hidden Technical Debt in Machine Learning Systems—机器学习系统中隐藏的技术债》翻译与解读
Paper:<Hidden Technical Debt in Machine Learning Systems-机器学习系统中隐藏的技术债>翻译与解读导读:机器学习系统中,隐藏多少技术 ...
机器学习(Machine Learning)深度学习(Deep Learning)资料(Chapter 2)
机器学习(Machine Learning)&深度学习(Deep Learning)资料(Chapter 2) - tony的专栏 - 博客频道 - CSDN.NET 注:机器学习资料篇目一共 ...
An example machine learning notebook
原文地址 An example machine learning notebook Notebook by Randal S. Olson Supported by Jason H. Moore Un ...
【Paper】ConvLSTM：Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting
论文原文论文下载论文被引:1651(2020/03/01) 4827(2022/03/26) 论文年份:2015 文章目录 Abstract 1 Introduction 2 Preliminar ...
bff v2ex_语音备忘录的BFF-如何通过Machine Learning简化Speech2Text
bff v2ex by Rafael Belchior 通过拉斐尔·贝尔基奥尔(Rafael Belchior) 语音备忘录的BFF-如何通过Machine Learning简化Speech2Text ...
Multi-task Learning(Review)多任务学习概述
https://www.toutiao.com/a6707402838705701383/ 背景:只专注于单个模型可能会忽略一些相关任务中可能提升目标任务的潜在信息,通过进行一定程度的共享不同任务之间 ...

Machine Learning Review Note

目录

Covariance and correlation

How can you select k for k means?

Naive Bayes

Why is Naive Bayes “naive”?

Naive Bayes vs. RF

Q: What is the difference between online and batch learning(offline learning)?

How to identify outliers?

Reduce overfitting

Q: What are the feature selection methods used to select the right variables?

Q: Why is mean square error a bad measure of model performance? What would you suggest instead?

PCA

Linear regression

Assumptions

What Happens When We Break Some of These Assumptions? (And What Can We Do About It?)

regularization

drawback of linear model

Tree

Prevent Overfitting

More features than data points

More features than data points in linear regression?

SVM

What are the support vectors in SVM?

Primal and Dual (features vs. samples)

SVM and linearSVM

Bagging, Boosting, Stacking

Linear regression & Boosting

When would you use random forests vs. SVM and why?

Random Forest

Explain what the bootstrap sampling method is and give an example of when it’s used.

Pros and cons

AdaBoost vs Gradient Boosting

Explain the “gradient” part of the Gradient Boosting

Gradient Boosting implementations

Bootstrapping

Gradient Descent（Logistic Classification）

TF-IDF

Cross validation, test set, and final model

Skewed data and Log

Multicollinearity

Baseline model

Interpretable vs Explainable Machine Learning

Imbalanced data

how to deal with imbalanced data?

SOMTE.

Evaluation Metrics

Type1 vs Type2 Error

Deep Learning

Initializing all the weights with all (constant) zero

Softmax

Activation function?

Vanishing/Exploding Gradient Problem

Optimizer

Feature Engineering

Features with High cardinality

BinaryEncoding

Machine Learning Review Note相关推荐

最新文章

热门文章