模型索引

  • 常见学术名词及翻译
  • Naive bayes
  • Decision tree
  • Generalisation and Evaluation
  • Linear regression
  • Logistics regression
  • Regularisation
  • SVM
  • PCA
  • K-means
  • GMM
  • KNN
  • Hierarchical Agglomerative Clustering

常见学术名词及翻译

independent and identically distributed独立同分布

dataset数据集

denote表示

Principal Components Analysis主要成分分析

Logistic Regression逻辑回归

k-Nearest Neighbour Method k邻近算法

Clustering聚类

Linear regression线性回归

eigenvector特征向量

eigenvalue特征值

bias偏置常数

maximum likelihood极大似然

gradient descent梯度下降

prior probabilities先验概率

hyperplane超平面

Information entropy信息熵


Naive bayes

prior probability: p(c1)p(c_1)p(c1​), p(c2)p(c_2)p(c2​).

equation: p(c1∣x)=p(c1)p(x∣c1)p(x)=p(c1)p(x∣c1)p(c1)p(x∣c1)+p(c2)p(x∣c2)p(c_1|x) = \frac{p(c_1)p(x|c_1)}{p(x)}=\frac{p(c_1)p(x|c_1)}{p(c_1)p(x|c_1)+p(c_2)p(x|c_2)}p(c1​∣x)=p(x)p(c1​)p(x∣c1​)​=p(c1​)p(x∣c1​)+p(c2​)p(x∣c2​)p(c1​)p(x∣c1​)​

Decision tree

Information entropy calculation:

Entropy equation: E=−∑iCpilog⁡2piE = -\sum\limits_i^Cp_i\log_2p_iE=−i∑C​pi​log2​pi​

Example: in one split we have 1 red, 2 blues, 3 greens points
E=−(16log⁡(16)+26log⁡(26)+36log⁡(36))=1.47E = -(\frac{1}{6}\log(\frac{1}{6})+\frac{2}{6}\log(\frac{2}{6})+\frac{3}{6}\log(\frac{3}{6}))=1.47E=−(61​log(61​)+62​log(62​)+63​log(63​))=1.47

if we get 3 blues in one split(only blue)
E=−(1log⁡(1))=0E = -(1\log(1))=0E=−(1log(1))=0

Information gain:

consider the before split:
E=−(510log⁡(510)+510log⁡(510))=1E = -(\frac{5}{10}\log(\frac{5}{10})+\frac{5}{10}\log(\frac{5}{10}))=1E=−(105​log(105​)+105​log(105​))=1

and consider the latter split:
the left split: E=0E = 0E=0, the right split: E=−(16log⁡(16)+56log⁡(56))=0.65E = -(\frac{1}{6}\log(\frac{1}{6})+\frac{5}{6}\log(\frac{5}{6}))=0.65E=−(61​log(61​)+65​log(65​))=0.65
in final calculation, we need to add weigh to these two splits:
Esplit=410∗0+610∗0.65=0.39E_{split} = \frac{4}{10}*0+\frac{6}{10}*0.65 = 0.39Esplit​=104​∗0+106​∗0.65=0.39

So the Gain=1−0.39=0.61Gain = 1-0.39 = 0.61Gain=1−0.39=0.61 (equal to how much entropy we removed).

Same expression: higher information gain = more entropy removed

Generalisation and Evaluation

Under-fitting:

predictor too simplistic (too rigid)
not powerful enough to capture salient patterns in data
can find another predictor F′F'F′ with smaller EtrainE_{train}Etrain​ and EgenE_{gen}Egen​

Over-fitting:

predictor too complex (flexible)
• fits “noise” in the training data

ROC curve:

real Positive real Negative
Predict Positive TP FP
Predict Negative FN TN

TPR=TPP=TPTP+FNTPR = \frac{TP}{P}=\frac{TP}{TP+FN}TPR=PTP​=TP+FNTP​

FPR=FPN=FPFP+TNFPR = \frac{FP}{N}=\frac{FP}{FP+TN}FPR=NFP​=FP+TNFP​

better model: FPR becomes smaller and TPR becomes larger.

AUC(Area Under Curve)

range: 0~1

The larger the AUC is, the better the model predicts.

Linear regression

process:

  1. draw a line through the data
  2. measure the distance from the line to the data, square each distance‾\underline{distance}distance​(residuals), and add them up.(record this sum of squared residuals)
  3. rotate the line a little bit, measure the residuals, square them, and then sum up the squares.(record it)
  4. repeat step 2 and 3 n times and we check the bag of sum of squared residuals, we choose the least sum of squared residuals parameters as the output.(“Least Squares”)

Initial:

SS(mean)=(data−mean)2SS(mean) = (data-mean)^2SS(mean)=(data−mean)2

Variance around the mean = (data−mean)2n\frac{(data-mean)^2}{n}n(data−mean)2​

Var(mean)=SS(mean)nVar(mean) = \frac{SS(mean)}{n}Var(mean)=nSS(mean)​

Fit:

SS(mean)=(data−line)SS(mean) = (data-line)SS(mean)=(data−line)

Var(fit)=SS(fit)nVar(fit)=\frac{SS(fit)}{n}Var(fit)=nSS(fit)​

R2=Var(mean)−Var(fit)Var(mean)=SS(mean)−SS(fit)SS(mean)R^2=\frac{Var(mean)-Var(fit)}{Var(mean)}=\frac{SS(mean)-SS(fit)}{SS(mean)}R2=Var(mean)Var(mean)−Var(fit)​=SS(mean)SS(mean)−SS(fit)​

R2R^2R2 tells us how much of the variance in y axis can be explained by taking x axis into account.
For example: we get Var(mean) = 11.1 and Var(fit) = 4.4
we will get $R^2 = \frac{11.1-4.4}{11.1}=0.6=60% $

There is a 60%60\%60% reduction in variance when we take the y axis into account.
Alternatively, we can say that x axis “explains” 60%60\%60% of the variance in y axis.

In linear regression, equations with more parameters will never make SS(fit) worse than equations with fewer parameters.
Why? This is because least squares will cause any term that makes SS(fit) worse to be multiplied by 0, and, in a sense, no longer exist.

  1. Quantifies the relationship in the data(this is R2R^2R2)

    1. This needs to be large.
  2. Determines how reliable that relationship is (this is the p-value that we calculate with F(F-distribution))

    1. This needs to be small.

Logistics regression

formula: p=sigmoid(y)=11+e−y=11+e−(wTx+w0)\large p = sigmoid(y) = \frac{1}{1+e^{-y}}=\frac{1}{1+e^{-(w^Tx+w_0)}}p=sigmoid(y)=1+e−y1​=1+e−(wTx+w0​)1​

Evaluation for model:
Cross-Entropy Loss: L=−∑[ytruelog⁡(p)+(1−ytrue)log⁡(1−p)]L = -\sum [y_{true}\log(p)+(1-y_{true})\log(1-p)]L=−∑[ytrue​log(p)+(1−ytrue​)log(1−p)]
target: to minimize the loss function

Pros:

  1. generate the probability for the unlabelled data

  2. reduce the influence of outliers, make the prediction more accurete.

Cons:

  1. only used in linear distribution

Regularisation

How to deal with overfitting:

  1. Reduce number of features.
    – Manually select which features to keep.
    – Model selection algorithm(later in course)
  2. Regularization.
    – Keep all the features, but reduce magnitude/values of parameters θj\theta_jθj​.
    – Works well when we have a lot of features, each of which contributes a bit to predicting yyy.

SVM

How Linear SVM works:

SVM is a supervised machine learning algorithm which can be used for classification or regression problems. It uses a technique called the kernel trick to transform your data and then based on these transformations it finds an optimal boundary between the possible outputs.

target:

find maximum margin hyperplanes

{wTxi+w0≥1,foryi=+1wTxi+w0≤−1,foryi=−1.\left\{\begin{aligned}w^Tx_i+w_0 \ge1 ,&& for \ y_i=+1 \\ w^Tx_i+w_0 \le -1 ,&& for \ y_i=-1. \end{aligned}\right.{wTxi​+w0​≥1,wTxi​+w0​≤−1,​​for yi​=+1for yi​=−1.​

and we have two hyperplanes:

{wTxi+w0=1wTxi+w0=−1\left\{\begin{aligned}w^Tx_i+w_0 =1 \\ w^Tx_i+w_0 = -1 \end{aligned}\right.{wTxi​+w0​=1wTxi​+w0​=−1​

for points satisfy support vectors:

The distance to the hyperplane is 1∣∣w∣∣\frac{1}{||w||}∣∣w∣∣1​
So the width of the margin is d++d−=2∣∣w∣∣d_+ + d_- = \frac{2}{||w||}d+​+d−​=∣∣w∣∣2​

As we can see, we want to maximize 2∣∣w∣∣\frac{2}{||w||}∣∣w∣∣2​
Instead, we do it by minimizing ∣∣w∣∣22(=wTw2)\frac{||w||^2}{2}(=\frac{w^Tw}{2})2∣∣w∣∣2​(=2wTw​)

constraints:

yi(wTx+w0)−1≥0y_i(w^Tx+w_0)-1 \ge 0yi​(wTx+w0​)−1≥0

For Nonlinear case:

we use the kernel function K(x,x′)=φ(x)Tφ(x′)K(x,x')=\varphi(x)^T \varphi(x')K(x,x′)=φ(x)Tφ(x′)

Example: Assume x=[x1,x2]Tx = [x_1,x_2]^Tx=[x1​,x2​]T, to transform the data set into quadratic feature set
x→φ(x)=[x12,x22,2x1x2,2x1,2x2,1]Tx \rightarrow \varphi (x) = [x_1^2,x_2^2,\sqrt{2}x_1x_2,\sqrt{2}x_1,\sqrt{2}x_2,1]^Tx→φ(x)=[x12​,x22​,2​x1​x2​,2​x1​,2​x2​,1]T

K(x′,x)=φ(x′)Tφ(x)=x12x1′2+x22x2′2+2x1x2x1′x2′+2x1x1′+1=(x1x1′+x2x2′+1)2=(1+(xTx′))2K(x',x) = \varphi(x')^T\varphi(x)=x_1^2x_1'^2+x_2^2x_2'^2+2x_1x_2x_1'x_2'+2x_1x_1'+1 \\=(x_1x_1'+x_2x_2'+1)^2=(1+(x^Tx'))^2K(x′,x)=φ(x′)Tφ(x)=x12​x1′2​+x22​x2′2​+2x1​x2​x1′​x2′​+2x1​x1′​+1=(x1​x1′​+x2​x2′​+1)2=(1+(xTx′))2

For kernel functions we have:

Linear kernel: K(x,x′)=xTx′K(x,x')=x^Tx'K(x,x′)=xTx′

Polynomial kernel: K(x,x′)=[1+xTx′]kK(x,x')=[1+x^Tx']^kK(x,x′)=[1+xTx′]k

Radial basis kernel: K(x,x′)=exp[−12∣∣x−x′∣∣2]K(x,x')=exp[-\frac{1}{2}||x-x'||^2]K(x,x′)=exp[−21​∣∣x−x′∣∣2]

PCA

Principal component analysis (PCA) is a commonly used unsupervised learning method, which uses orthogonal transformation to convert the observed data represented by linearly independent variables into a few data represented by linearly independent variables, which are called principal components

How PCA works:

The target of PCA is to find a transformation matrix www to transform high-dimensional vectors into low-dimensional vectors. PCA is defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some scalar projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.

K-means

process:
1.select the number of clusters you want to identify in your data. This is the “K” in “K-means clustering”.
2.randomly select K distinct data points.(These are the initial clusters).
3.measure the distance between 1st1^{st}1st point and the K initial clusters.
4.assign the 1st1^{st}1st point to the nearest cluster.
5.repeat step 3 and 4 just modify the number of point.
6.calculate the mean of each cluster.
7.repeat step 3,4,5,6 until the clustering did not change at all during the last iteration, we stop.

K-means algorithm is sensitive to the initial data(clusters)

how to choose the best K:

we can try k-means algorithm from K=1 to K=n(n is the number of the whole points), meanwhile we record the reduction of variation, we plot them and we can find where the huge reduction in variation happens, this is called an “elbow plot”, and we can pick “K” by find the “elbow” in the plot.

how is K-means clustering different from hierarchical clustering?

K-means clustering specially tries to put the data into the number of clusters we tell it to.
Hierarchical clustering just tell us, pairwise, what two things are most similar.

GMM

process:
1.guess the number of clusters(number of gaussian distribution)
2.for every gaussian distribution, give the parameters(expectation, variance and weight) random value
3.for every instance, calculate the probability of it in each gaussian distribution
4.for each gaussian distribution, the contribution of each instance to this distribution can be represented using its probability. Using probabilities as weights to calculate the new expectation and variance in place of old expectation and variance.
5.repeat 3 and 4 steps until each gaussian distribution’s expectation and variance converge.

KNN

How KNN works:

KNN works by finding the distances between a query and all the examples in the data, selecting the specified number examples (K) closest to the query, then votes for the most frequent label (in the case of classification) or averages the labels (in the case of regression).

How to choose the best K:

The optimal value of K can be selected using a held-out validation dataset: we pick the value of K that gives the lowest classification error over the validation dataset.

How K impact the result:

Low values for K(like K=1 or K=2) can be noisy and subject to the effects of outliers.
Large values for K smooth over things, but we don’t want K to be so large that a category with only a few samples in it will always be out voted by other categories.

Hierarchical Agglomerative Clustering

process description:

Agglomerative clustering starts by assigning each instance to its own cluster, and iteratively merging pairs of clusters into a replacement cluster until we are left with a single cluster containing all the instances. This single cluster is the root of the hierarchy (called a dendrogram), the two clusters that are merged into it are its children, and so on recursively. The algorithm runs for n − 1 iterations, where n is the number of datapoints. At each iteration we merge the two clusters that have the smallest distance between them, according to some distance metric.

single linkage:

This is the distance between the closest members of the two clusters.

example: say we have one-dimension data: -1, 2, 3, 8, 10
the merge process is :
1.merge 2 and 3 for their distance is 1
2.merge 8 and 10 for their distance is 2
3.merge -1 and {2,3} for their distance is 3(the min distance between -1 and the group{2,3} which is -1 and 2)
4.merge {-1,2,3} and {8.10} for their distance is 5(the min distance between {-1,2,3} and {8,10})
5.completed.

formula: min⁡x1∈c1,x2∈c2D(x1,x2)\Large \min\limits_{x_1 \in c_1,x_2 \in c_2}D(x_1,x_2)x1​∈c1​,x2​∈c2​min​D(x1​,x2​)

complete linkage:

This is the distance between the members that are the farthest apart.

example: say we have one-dimension data: -1, 2, 3, 8, 10
the merge process is :
1.merge 2 and 3 for their distance is 1
2.merge 8 and 10 for their distance is 2
3.merge -1 and {2,3} for their distance is 4(the max distance between -1 and the group{2,3} which is -1 and 3)
4.merge {-1,2,3} and {8,10} for their distance is 11(the max distance between {-1,2,3} and {8,10} which is -1 and 10)
5.completed

formula: max⁡x1∈c1,x2∈c2D(x1,x2)\Large \max\limits_{x_1 \in c_1,x_2 \in c_2}D(x_1,x_2)x1​∈c1​,x2​∈c2​max​D(x1​,x2​)

average linkage:

This method involves looking at the distances between all pairs and averages all of these distances. This is also called UPGMA-Unweighted Pair Group Mean Averaging.

example: say we have one-dimension data: -1, 2, 3, 8, 10
the merge process is :
1.merge 2 and 3 for their distance is 1
2.merge 8 and 10 for their distance is 2
3.merge -1 and {2,3} for their distance is 2.5(averge distance between -1 and the group{2,3} which is 1/2((-1 and 3)+(-1 and 2)))
4.merge {-1,2,3} and {8,10} for their distance is 7.67(averge distance between {-1,2,3} and {8,10} which is 1/2(ave(8 and {-1,2,3})+ave(10 and {-1,2,3})))
5.completed.

formula: 1n1n2∑x1∈c1,x2∈c2D(x1,x2)\Large \frac{1}{n_1n_2} \sum\limits_{x_1 \in c_1,x_2 \in c_2}D(x_1,x_2)n1​n2​1​x1​∈c1​,x2​∈c2​∑​D(x1​,x2​) (assume c1c_1c1​ has n1n_1n1​ parameters and c2c_2c2​ has n2n_2n2​ parameters)

通俗解释:single linkage是指最小的距离里面挑最小的,complete linkage是指最大的距离里面挑最小的,average linkage是指平均距离(中等距离)里面挑最小的

机器学习常见模型英文介绍相关推荐

  1. 人工智能基础:机器学习常见的算法介绍

    目录 监督学习 1.1 分类 1.2 回归 无监督学习 2.1 聚类 2.2 降维 3.半监督学习 4.迁移学习 5.强化学习(ReinforcementLearning, RL) 今天给大家聊聊机器 ...

  2. 1.机器学习常见模型评价指标

    这里主要是讲述分类问题.多分类问题是能够转化为二分类问题的.因此评价指标主要都是基于二分类来提出的. 1.混淆矩阵 优点:能够很好地包含了整体的分类结果信息 缺点:不直观,外行看不懂 2.准确度(Ac ...

  3. ML之ME/LF:机器学习中常见模型评估指标/损失函数(LiR损失、L1损失、L2损失、Logistic损失)求梯度/求导、案例应用之详细攻略

    ML之ME/LF:机器学习中常见模型评估指标/损失函数(LiR损失.L1损失.L2损失.Logistic损失)求梯度/求导.案例应用之详细攻略 目录 常见损失函数求梯度案例 1.线性回归求梯度 2.L ...

  4. 机器学习笔记之卡尔曼滤波(一)动态模型基本介绍

    机器学习笔记之卡尔曼滤波--动态模型基本介绍 引言 回顾:动态模型 动态模型的相关任务 卡尔曼滤波介绍 引言 本节从动态模型开始,介绍卡尔曼滤波(Kalman Filter). 回顾:动态模型 我们在 ...

  5. NLP自然语言处理-机器学习和自然语言处理介绍(一)

    "NLP自然语言处理-机器学习和自然语言处理介绍" 一.机器学习 1.什么是机器学习 从广义上来说,机器学习是一种能够赋予机器学习的能力以此让它完成直接编程无法完成的功能的方法.但 ...

  6. 机器学习中模型泛化能力和过拟合现象(overfitting)的矛盾、以及其主要缓解方法正则化技术原理初探...

    1. 偏差与方差 - 机器学习算法泛化性能分析 在一个项目中,我们通过设计和训练得到了一个model,该model的泛化可能很好,也可能不尽如人意,其背后的决定因素是什么呢?或者说我们可以从哪些方面去 ...

  7. 干货 | 22道机器学习常见面试题目

    来源:机器学习算法与自然语言处理 本文共6600字,建议阅读13分钟. 本文为你带来22道机器学习常见的面试问题和回答. 1.无监督和有监督算法的区别? 有监督学习:对具有概念标记(分类)的训练样本进 ...

  8. 机器学习--判别式模型与生成式模型

    原文地址为:机器学习--判别式模型与生成式模型 一.引言 本材料参考Andrew Ng大神的机器学习课程 http://cs229.stanford.edu 在上一篇有监督学习回归模型中,我们利用训练 ...

  9. js 计算任意凸多边形内最大矩形_题库 | 计算机视觉常见面试题型介绍及解答 第 7 期...

    - 计算机视觉 -为什么说 Dropout 可以解决过拟合?(1)取平均的作用: 先回到标准的模型即没有 dropout,我们用相同的训练数据去训练 5 个不同的神经网络,一般会得到 5 个不同的结果 ...

  10. JavaScript玩转机器学习:模型转换

    JavaScript玩转机器学习:模型转换 模型转换 TensorFlow.js 配备了各种预训练模型,这些模型可以在浏览器中使用,模型仓库 中有相关介绍.但是,您可能已经在其他地方找到或创建了一个 ...

最新文章

  1. 高通创投在中国区的投资重点,目前主要聚焦在AI+5G、XR+5G、机器人/自动驾驶+5G、物联网+5G四个方面。
  2. 每日一皮:千万别和杠精一般见识...
  3. vue组件系列2、拖放上传
  4. python以下是变量合法命名的是_Python超级详细的变量命名规则
  5. SAP C4C里销售订单行项目为什么无法添加产品
  6. Caffe编译代码的时候报各种未定义未声明
  7. liunx JMeter 进行压力测试
  8. 自定义导航--wx.getMenuButtonBoundingClientRect() 万机兼容
  9. 【转】C#、面向对象、设计模式学习
  10. string字符串数字自增_常见的字符串操作
  11. 阶段2 JavaWeb+黑马旅游网_15-Maven基础_第5节 使用骨架创建maven的java工程_15maven工程servlet实例之导入项目依赖...
  12. audio.js的研究与使用
  13. 【FPGA目标跟踪】基于FPGA的帧差法和SAD匹配算法的目标跟踪实现
  14. 黑马程序员---初学java建议(亲身经历)
  15. 三星BESPOKE家电系列海外发布会看点一览,定制化设计成未来家居首选
  16. 乱谈计算机、转专业、考研
  17. 医疗险十大常见误区,你中了几个?
  18. zabbix应用之详细的拓扑图标签--链路流量
  19. 企业如何做新闻软文发布? 软文推广和新闻源发布有何不同之处?
  20. 一种具有17路可调PWM直流电机的串口遥控机器人

热门文章

  1. Aqara绿米董事长游延筠专访:以用户体验为出发点,打造更懂你的家
  2. 整数乘法的计算机方法,太实用了!小学数学四则运算技巧及简便方法
  3. 使用LIS2DH12三轴加速度传感器检测震动与倾斜角度
  4. 3.22续上篇详细版本的参数保存方法(使用用户自定义Application来保存全局参数)
  5. 烂笔头也需要常翻出来用用啊
  6. 银行卡所属银行的查询接口--阿里提供
  7. python操作符是什么意思_如何使用python操作符**与*?有什么区别?
  8. java如何表格一样对齐_如何水平对齐表格? (How can I horizontally align a form?)
  9. 不支持:http://javax.xml.XMLConstants/property/accessExternalStylesheet
  10. python股票交易接口源代码分享