azure机器学习

Detailed Notes for Machine Learning Foundation Course by Microsoft Azure & Udacity, 2020 on Lesson 4 — Supervised & Unsupervised Learning

Microsoft Azure和Udacity于2020年第4课-有 监督和无监督学习的 机器学习基础课程的详细说明

This lesson covers two of Machine Learning’s fundamental approaches: supervised and unsupervised learning. You will learn about classification, regression, clustering, representation learning, and more.

本课程涵盖了机器学习的两种基本方法:有监督的学习和无监督的学习。 您将学习分类,回归,聚类,表示学习等内容。

监督学习:分类 (Supervised Learning: Classification)

In a classification problem, the outputs are categorical or discrete.

分类 问题中,输出是分类的或离散的。

Some of the most common types of classification problems include:

最常见的分类问题类型包括:

  • Classification on tabular data: The data is available in the form of rows and columns, potentially originating from a wide variety of data sources.

    表格数据的分类 :数据以行和列的形式提供,可能源自多种数据源。

  • Classification on image or sound data: The training data consists of images or sounds whose categories are already known.

    图像或声音的数据分类 :训练数据由其类别已知的图像或声音组成。

  • Classification on text data: The training data consists of texts whose categories are already known.

    文本数据分类 :训练数据由其类别已知的文本组成。

Examples of Classification problems are:

分类问题的示例是:

  • Computer Vision计算机视觉
  • Speech Recognition语音识别
  • Biometric Identification生物识别
  • Document Classification文件分类
  • Sentiment Analysis情绪分析
  • Credit Scoring信用评分
  • Anomaly Detection异常检测
Jobs in AI
人工智能工作

算法类别: (Categories of Algorithms:)

At a high level, there are mainly 3 categories of the algorithms:

概括而言, 算法主要分为3类:

  • Two-Class (Binary) Classification: used when the prediction has to be made only between two categories, e.g. True/False, Yes/No.

    两级(二进制)分类:仅在必须在两个类别(例如,是/否,是/否)之间进行预测时使用

  • Multi-Class SIngle-Label Classification: used when there are multiple categories to predict from, however, the output belongs to a single category, e.g. red, yellow, green, or blue.

    多类SIngle-Label分类 :当有多个类别可以预测时使用,但是输出属于单个类别,例如红色,黄色,绿色或蓝色。

  • Multi-Class Multi-Label Classification: used when there are multiple categories to predict from and the output can belong to multiple categories, e.g. red, yellow, green, or blue.

    多类别多标签分类 :在有多个类别可以进行预测并且输出可以属于多个类别(例如红色,黄色,绿色或蓝色)时使用。

两类分类算法: (Two-Class Classification Algorithms:)

多类分类算法: (Multi-Class Classification Algorithms:)

多类算法 (Multi-Class Algorithms)

多类算法超参数: (Multi-Class Algorithms Hyperparameters:)

  • Multi-Class Logistic Regression: It is a well-known method in statistics that is used to predict the probability of an outcome and is popular in classification tasks. The two key parameters to configure this algorithm are 1) Optimization Tolerance: controls when to stop the iterations, if the improvement between iterations is less than the specified threshold, the algorithm stops and returns the current model, and 2) Regularization Weight: regularization is a method of preventing overfitting by penalizing models with extreme coefficient values. The Regularization Weight control how much to penalize the models at each iteration.

    多类Logistic回归 :这是统计中众所周知的方法,用于预测结果的概率,在分类任务中很流行。 配置此算法的两个关键参数是1) 优化容控制何时停止迭代,如果迭代之间的改进小于指定的阈值,则算法停止并返回当前模型,以及2) 正则化权重:正则化是一种通过惩罚具有极高系数值的模型来防止过度拟合的方法。 正则化权重控制每次迭代要对模型进行多少惩罚。

  • Multi-Class Neural Network: A typical example includes the input layer, hidden layer, and output layer. The relationship between the input and the output is learned from training the Neural Network on input data. The 3 key parameters to configure the Multi-Class Neural Network include: 1) Number of Hidden Nodes: this option allows customizing the number of nodes in the neural network, 2) Learning Rate: controls the size of the step taken at each iteration before the correction, 3) Num of Learning Iterations: maximum number the algorithm should process the training cases.

    多类神经网络 :一个典型示例包括输入层,隐藏层和输出层。 输入和输出之间的关系是通过对输入数据进行神经网络训练来学习的。 配置多类神经网络的3个关键参数包括: 1)隐藏节点数此选项允许自定义神经网络中的节点数; 2)学习率:控制每次迭代之前执行的步长更正, 3) 学习迭代次数:算法应处理的训练案例的最大数量。

  • Multi-Class Decision Forest: an ensemble of Decision Trees. The algorithm works by building multiple Decision Trees and then voting on the most popular output class. The 5 key parameters to configure the Multi-Class Decision Forest include: 1) Resampling Methods: controls the method used to create the individual Decision Trees, 2) Number of Decision Trees: specifies the maximum number of Decision Trees that can be created in the ensemble, 3) Maximum Depth of Decision Trees: number to limit the depth of any Decision Tree, 4) Number of Random Splits per Node: the number of splits to use when building each node of the tree, 5) Minimum Number of samples per Leaf Node: controls the minimum number of cases required to create any terminal node in a tree.

    多类决策森林:决策树的集合。 该算法通过构建多个决策树,然后对最受欢迎的输出类进行投票来工作。 配置多类别决策林的5个关键参数包括: 1)重采样方法:控制用于创建单个决策树的方法, 2)决策树数:指定可在决策树中创建的最大决策树数。合奏, 3)决策树的最大深度:限制任何决策树深度的数量4)每个节点的随机分割数:构建树的每个节点时要使用的分割数, 5)每个叶节点的最小样本数:控制在树中创建任何终端节点所需的最小案例数。

热门AI文章: (Trending AI Articles:)

1. Machine Learning Concepts Every Data Scientist Should Know

1.每个数据科学家都应该知道的机器学习概念

2. AI for CFD: byteLAKE’s approach (part3)

2. CFD的人工智能:byteLAKE的方法(第3部分)

3. AI Fail: To Popularize and Scale Chatbots, We Need Better Data

3. AI失败:要普及和扩展聊天机器人,我们需要更好的数据

4. Top 5 Jupyter Widgets to boost your productivity!

4.前5个Jupyter小部件可提高您的生产力!

监督学习:回归 (Supervised Learning: Regression)

In a regression problem, the output is numerical or continuous.

回归 问题中,输出是数字或连续的。

回归概论 (Introduction to Regression)

Common types of regression problems include:

回归问题的常见类型包括:

  • Regression on tabular data: The data is available in the form of rows and columns, potentially originating from a wide variety of data sources.

    表格数据的回归:数据以行和列的形式提供,可能源自多种数据源。

  • Regression on image or sound data: Training data consists of images/sounds whose numerical scores are already known. Several steps need to be performed during the preparation phase to transform images/sounds into numerical vectors accepted by the algorithms.

    图像或声音数据的回归:训练数据由其数字分数已知的图像/声音组成。 在准备阶段需要执行几个步骤,以将图像/声音转换为算法接受的数值向量。

  • Regression on text data: Training data consists of texts whose numerical scores are already known. Several steps need to be performed during the preparation phase to transform text into numerical vectors accepted by the algorithms.

    对文本数据进行回归:训练数据由数字分数已知的文本组成。 在准备阶段需要执行几个步骤,以将文本转换为算法接受的数值向量。

Examples of Regression Problems:

回归问题的示例:

  • Housing prices房屋价格
  • Customer churn客户流失
  • Customer Lifetime Value客户终身价值
  • Forecasting (time series)预测(时间序列)
  • Anomaly detection异常检测

算法类别 (Categories of Algorithms)

Common machine learning algorithms for regression problems include:

用于回归问题的常见机器学习算法包括:

Linear Regression

线性回归

  • A linear relationship between one or more independent variables and a numeric outcome (dependent variable)一个或多个自变量与数值结果(因变量)之间的线性关系
  • Fast training快速训练
  • Two popular approaches to measuring error and fit the regression line:两种流行的方法来测量误差并拟合回归线:
  1. Ordinary Least Square Method: computes error as the sum of the squares of distance from the actual value to the predicted line and fits the model by minimizing the squared error. This method assumes a strong linear relationship between the independent and the dependent variables.

    普通最小二乘法 :将误差计算为从实际值到预测线的距离的平方和,并通过最小化平方误差来拟合模型。 该方法假定自变量和因变量之间具有很强的线性关系。

  2. Gradient Descent: minimize the amount of error at each step of the model training process.

    梯度下降 :在模型训练过程的每个步骤中使误差最小化。

Decision Forest Regression

决策森林回归

  • An ensemble learning method using multiple decision tress使用多决策树的整体学习方法
  • Each tree outputs a distribution as a prediction每棵树输出一个分布作为预测
  • Aggregation is performed to find a distribution closest to the combined distribution执行聚合以查找最接近组合分布的分布
  • Accurate, fast training times准确,快速的培训时间
  • It supports some of the same hyperparameters as the Multi-Class Decision Forest Algorithm i.e. Number of Trees, Max Depth, etc.它支持与多类决策森林算法相同的一些超参数,即树数,最大深度等。

Neural Net Regression

神经网络回归

  • Label column must be a numerical data type标签列必须是数字数据类型
  • A fully connected Neural Network: Input layer + one Hidden layer + Output layer完全连接的神经网络:输入层+一个隐藏层+输出层
  • Accurate, long training times准确,长时间的培训
  • It supports the hyperparameters as the Multi-Class Neural Network Algorithm, i.e. Number of Hidden Nodes, Learning Rate, Number of Iterations, etc.它支持超参数作为多类神经网络算法,即隐藏节点数,学习率,迭代数等。

自动化回归器的培训 (Automate the Training of Regressors)

Automated Machine Learning enables the automated exploration of the combinations needed to successfully produce a trained model. AutoML intelligently tests multiple combinations of algorithms and hyperparameters in parallel and returns the best one. It enables building Machine Learning models with high-scale efficiency and productivity, all while sustaining model quality. The resulting models can be:

自动化机器学习可以自动探索成功生成训练模型所需的组合。 AutoML可以并行智能地测试算法和超参数的多种组合 ,并返回最佳组合。 它可以在保持模型质量的同时,以大规模的效率和生产率构建机器学习模型。 结果模型可以是:

  1. Deployed into production, or部署到生产中,或
  2. Further refined and customized进一步完善和定制

Beyond the primary metric, you can also review a comprehensive set of performance metrics and charts to further assess the model performance.

除了主要指标之外,您还可以查看一组全面的性能指标和图表,以进一步评估模型的性能。

无监督学习 (Unsupervised Learning)

In unsupervised learning, algorithms learn from unlabeled data by looking for hidden structures in the data.

无监督学习中 ,算法通过查找数据中的隐藏结构来从未标记的数据中学习。

Obtaining unlabeled data is comparatively inexpensive and unsupervised learning can be used to uncover very useful information in such data.

获得未标记的数据相对便宜,并且可以使用无监督学习来发现此类数据中非常有用的信息。

无监督机器学习的类型 (Types of Unsupervised Machine Learning)

Clustering: organizes entities from the input data into a finite number of subsets or clusters

聚类 :将输入数据中的实体组织成有限数量的子集或聚类

Feature Learning: transforms sets of inputs into other inputs that are potentially more useful in solving a given problem

特征学习 :将输入集转换为其他输入,这些输入对于解决给定问题可能更有用

Anomaly Detection: identifies two major groups of entities: 1) Normal, 2) Abnormal (anomalies)

异常检测 :识别两个主要的实体组:1)正常,2)异常(异常)

Some other types include Dimensionality Reduction, Feature Extraction, Neural Networks, Principle Component Analysis, Matrix Factorization.

其他一些类型包括降维,特征提取,神经网络,主成分分析,矩阵分解。

半监督学习 (Semi-Supervised Learning)

Semi-supervised learning combines the supervised and unsupervised approaches; typically it involves having small amounts of labeled data and large amounts of unlabeled data.

半监督学习 结合了有监督和无监督的方法; 通常,它涉及拥有少量标记数据和大量未标记数据。

问题: (The problem:)

  • Difficult and expensive to acquire labeled data获取标记数据困难且昂贵
  • Acquiring unlabeled data which is usually inexpensive获取通常不昂贵的无标签数据

解决方案: (The solution:)

Uses a small amount of labeled data and a much larger amount of unlabeled data

使用少量标记数据和大量未标记数据

  • Self-Training: train the model using labeled data and use it to make predictions on the unlabeled data. The output is a dataset that is fully labeled and can be used in a Supervised Learning approach.

    自我训练:使用标记的数据训练模型,并使用其对未标记的数据进行预测。 输出是一个完全标记的数据集,可以在“监督学习”方法中使用。

  • Multi-view Training: train multiple models on different views of data that includes various feature selection, parts of training data, or various model architectures.

    多视图训练:在数据的不同视图上训练多个模型,包括各种功能选择,部分训练数据或各种模型架构。

  • Self-ensemble Training: similar to Multi-view Training except a single model is trained on different views of data

    自我训练 :与多视图训练类似,不同之处在于单个模型针对不同的数据视图进行训练

聚类 (Clustering)

Clustering is the problem of organizing entities from the input data into a finite number of subsets or clusters; the goal is to maximize both intra-cluster similarity and inter-cluster differences.

是将输入数据中的实体组织成有限数量的子集或簇的问题。 目标是最大程度地提高集群内相似度和集群间差异。

Applications of Clustering Algorithms:

聚类算法的应用:

  • Personalization and target marketing个性化和目标营销
  • Document classification文件分类
  • Fraud Detection欺诈识别
  • Medical imaging医学影像
  • City Planning城市规划

聚类算法: (Clustering Algorithms:)

  • Centroid-Based Clustering: organizes data into clusters based on the distance of members from the centroid of the cluster, e.g. K-Means.

    基于质心的聚类 :基于成员到聚类的质心的距离将数据组织到聚类中,例如K-Means。

  • Density-based Clustering: clusters members that are closely packed together and it can learn clusters of arbitrary shapes.

    基于密度的聚类 :将紧密堆积的成员聚在一起,并且可以学习任意形状的聚类。

  • Distribution-based Clustering: The underlying assumption is that the data has an inherent distribution type such as normal distribution. The algorithm clusters based on the probability of a member belonging to a particular distribution.

    基于分布的聚类 :基本假设是数据具有固有分布类型,例如正态分布。 该算法基于成员属于特定分布的概率进行聚类。

  • Hierarchical Clustering: builds a tree of clusters. This is best-suited for hierarchical data such as taxonomies.

    层次集群 :构建集群树。 这最适合分类数据等分层数据。

K-均值聚类: (K-Means Clustering:)

K-means is a centroid-based unsupervised clustering algorithm.

K均值 是基于质心的无监督聚类算法。

It creates up to a target (K) number of clusters and group similar members together in a cluster. The objective is to minimize intra-cluster distances (squared error of the distance between the members of the cluster and its center).

它最多可创建目标(K)个集群,并将集群中的相似成员分组在一起。 目的是最小化群集内距离 ( 群集 成员与其中心之间的 距离的 平方误差)。

K均值聚类算法: (K-Means Clustering Algorithm:)

Steps:

脚步:

  1. Initializes Centroid locations.初始化质心位置。
  2. Assign each member to a cluster represented by the closest centroid.将每个成员分配给以最接近的质心表示的聚类。
  3. Compute the new cluster centroids based on current cluster membership.根据当前群集成员身份计算新的群集质心。
  4. Check for Convergence.检查收敛性。
  • Different types of Convergence criteria. 1) check how much the centroid location change as a result of new cluster membership. If the total change in centroid location is less than a given tolerance, it will assume convergence and stop. 2) based on a fixed number of iterations, If the convergence criterion is not met, it will iterate starting with step number two.

    不同类型的收敛准则。 1)检查由于新的群集成员关系,质心位置发生了多少变化。 如果质心位置的总变化小​​于给定的公差,它将假定会聚并停止。 2)基于固定的迭代次数,如果不满足收敛标准,它将从第二步开始进行迭代。

K-Means模块配置: (K-Means Module Configurations:)

  • Number of Centroids: number of clusters you want the algorithm to begin with. The algorithm starts with this number of data points and iterates to find the optimal configuration.

    质心数:您要算法开始的聚类数。 该算法从此数量的数据点开始,并进行迭代以找到最佳配置。

  • Initialization approach: the selection of the initial centroids. The options for initialization are first n random or k-means++ algorithm.

    初始化方法 :选择初始质心。 初始化的选项是n个随机算法或k-means ++算法。

  • Distance metric: default for this is the Euclidean distance

    距离度量 :默认为欧几里得距离

  • Normalize features: uses the Min-Max Normalizer to scale the numeric data point from zero to one

    归一化功能 :使用最小-最大归一化器将数字数据点从零缩放到一个

  • Assign label mode: used only if your dataset already has a label column. uses the min-max normalizer to scale the numeric data point from zero to one. Optionally, the label values can be used to guide the selection of the clusters. Another use of the label column is to fill in missing values.

    分配标签模式 :仅在数据集已经具有标签列时使用。 使用最小-最大规范化器将数字数据点从零缩放到一。 可选地,标签值可用于指导群集的选择。 标签列的另一种用法是填写缺失值。

  • Number of iterations: dictates the number of times the algorithm should iterate over the training data before it finalizes the selection of centroids

    迭代次数 :指示算法在最终确定质心之前应迭代训练数据的次数。

课程总结 (Lesson Summary)

This lesson covered two of Machine Learning’s fundamental approaches: supervised and unsupervised learning.

本课程涵盖了两方面的机器学习的基本方法: 监督无监督的学习。

First, we learned about supervised learning. Specifically, we learned:

首先,我们了解了监督学习 。 具体来说,我们了解到:

  • More about classification and regression, two of the most representative supervised learning tasks

    有关分类回归的更多信息,这是最具代表性的两个监督学习任务

  • Some of the major algorithms involved in supervised learning, as well as how to evaluate and compare their performance

    监督学习中涉及的一些主要算法 ,以及如何评估和比较其性能

  • How to use automated machine learning to automate the training and selection of classifiers and regressors

    如何使用自动化机器学习来自动化分类器和回归器的训练和选择

Next, the lesson focused on unsupervised learning, including:

接下来,本课的重点是无监督学习 ,包括:

  • Its most representative learning task, clustering

    它最有代表性的学习任务是聚类

  • How unsupervised learning can address challenges like lack of labeled data, the curse of dimensionality, overfitting, feature engineering, and outliers无监督学习如何解决诸如缺少标签数据,维度诅咒,过度拟合,特征工程和离群值之类的挑战
  • An introduction to representation learning

    表征学习入门

别忘了给我们您的

azure机器学习_Microsoft Azure机器学习x Udacity —第4课笔记相关推荐

  1. 如何使用Azure ML Studio开启机器学习

    文章讲的是如何使用Azure ML Studio开启机器学习,"机器学习是让计算机在不被明确编程的情况下运作的科学." --安德鲁·吴(Coursera) 机器学习正在迅速成为数据 ...

  2. 使用ML.NET + Azure DevOps + Azure Container Instances打造机器学习生产化

    介绍 Azure DevOps,以前称为Visual Studio Team Services(VSTS),可帮助个人和组织更快地规划,协作和发布产品.其中一项值得注意的服务是Azure Pipeli ...

  3. MLOps极致细节:18. Azure ML Pipeline(机器学习管道),Azure Container Instances (ACI)部署模型

    MLOps极致细节:18. Azure ML Pipeline(机器学习管道),Azure Container Instances (ACI)部署模型 在之前的章节中,我们已经完成了数据预处理,机器学 ...

  4. MLOps极致细节:16. Azure ML Pipeline(机器学习管道),Azure Compute Instance搭建与使用

    MLOps极致细节:16. Azure ML Pipeline(机器学习管道),Azure Compute Instance搭建与使用 这篇博客与下篇博客,我们将介绍Azure ML Pipeline ...

  5. MLOps极致细节:17. Azure ML Pipeline(机器学习管道),模型训练,打包和注册

    MLOps极致细节:17. Azure ML Pipeline(机器学习管道),模型训练,打包和注册 这两个章节中,我们将介绍Azure ML Pipeline的使用,并且结合MLFlow一起跟踪ML ...

  6. Azure机器学习——创建Azure机器学习服务

    创建Azure机器学习服务 一.Azure订阅 二.创建Azure机器学习服务(工作区) 在Azure portal界面创建Azure机器学习工作区 使用Python SDK创建Azure机器学习工作 ...

  7. azure云数据库_Microsoft Azure SQL数据库-分步创建教程

    azure云数据库 Microsoft Azure SQL Database is a managed cloud database for programmers/developers to dev ...

  8. 机器学习指南_机器学习-快速指南

    机器学习指南 机器学习-快速指南 (Machine Learning - Quick Guide) 机器学习-简介 (Machine Learning - Introduction) Today's ...

  9. python机器学习库_Python机器学习库 Top 10,你值得拥有!

    随着人工智能技术的发展与普及,Python超越了许多其他编程语言,成为了机器学习领域中最热门最常用的编程语言之一.有许多原因致使Python在众多开发者中如此受追捧,其中之一便是其拥有大量的与机器学习 ...

最新文章

  1. linux运行python文件socket未定义_Python服务器名称错误:未定义全局名称“SocketError”...
  2. NodeJs .net core connect Azure service bus
  3. flask执行python脚本_如何在flask后端运行python脚本?
  4. jboss war包放哪_如何将JBoss HR Employee Rewards项目放入云端
  5. 【PRML 学习笔记】第一章 - 介绍 (Introduction)
  6. OpenCV46:立体图像的深度图|Depth Map
  7. CE教程第八关——搜索4级指针
  8. 哈夫曼编码c语言例题,关于哈夫曼编码试题的计算
  9. Winxp不幸中毒以及手杀过程
  10. 0基础如何自学软件编程开发
  11. ST官方的IIC实例解析(第一部分)
  12. 安卓逆向—霸哥磁力搜索apk过签名校验
  13. 实用软件收集(持续更新)
  14. 青海西藏新疆地区有名的调查研究咨询公司
  15. 人鱼线和马甲线的区别你知道吗
  16. 干货分享 | UE游戏鼠标双击判定
  17. 自动将*.md文档中的图片上传到Gitee(Typora+PicGo+Gitee)
  18. XML解析时获取到的节点为null
  19. 安全考试服务平台app下载地址
  20. 冷原子量子计算机,中国在超冷原子量子模拟领域获突破

热门文章

  1. jquery页面滚动显示浮动菜单栏锚点定位效果
  2. JQuery中serialize()、serializeArray()和param()的使用方法
  3. 【原创】erlang 模块之 epmd
  4. warning: the frame size of 1040 bytes is larger than 1024 bytes
  5. 话里话外:从“种房子”谈流程与制度的差别
  6. 全球计算机科学硕士申请,2019爱尔兰留学都柏林大学计算机科学硕士申请
  7. python 股票指标库talib_TaLib在股票技术分析中的应用
  8. [ js处理表单 ]:保存、提交
  9. 卸料装置弹性零件的计算方法_冲裁力、卸料力及推件力的计算-常见问题.doc
  10. php 如何获取表格数据类型,使用phpword获取doc中的表格数据