图像梯度增强

背景 (Background)

Purpose of analysis:

分析目的：

Understand the factors driving student success so that Open University can allocate resources to improve student success

了解推动学生成功的因素，以便开放大学可以分配资源来提高学生的成功

Description of dataset:

数据集说明：

The Open University Learning Analytics Dataset is a publicly available dataset containing data about courses, students and their interactions with VLE for seven selected courses (modules).

开放大学学习分析数据集是一个公共可用的数据集，其中包含有关课程，学生及其与七个所选课程(模块)的VLE交互的数据。

As the unique identifiers across all the data tables were student ID, module and course description, data was aggregated at this level. The following variables were used in the analysis:

由于所有数据表中的唯一标识符是学生ID，模块和课程说明，因此在此级别汇总了数据。分析中使用了以下变量：

Variable NameVariable TypeVariable NameVariable Typeid_student unique identifier/primary keyimd band categorical code_modulecategoricalage band categorical code_presentation categorical num of previous attemptsnumericalgender categorical studied credits categorical region categorical disability categorical highest education categorical final result numerical sum weighted score numerical average module length numerical average submission duration numerical average proportion content accessed numerical average date registration numerical trimmed assessment type categorical

变量名称变量类型变量名称变量类型 id_student唯一标识符/主keyimd频带类别code_module类别ageage类别类别code_presentation类别先前尝试的次数numericalgender类别classical所学习的学分类别区域categorical残疾categorical最高学历类别最终平均结果数值平均值数值总和访问的内容数值平均日期注册数值修剪评估类型类别

方法 (Methodology)

数据转换 (Data transformation)

Combine datasets based on unique identifier (student ID, course and code_description)根据唯一标识符(学生ID，课程和code_description)组合数据集
Aggregate variables at unique identifier level在唯一标识符级别汇总变量
Update nominal variable types from character to factor将名义变量类型从字符更新为因子

Predictors:

预测变量：

code module代码模块
code presentation (changed to 1 = B and 2 = J),代码表示(更改为1 = B和2 = J)，
gender, region, highest education, IMD band, Age band, number of previous attempts, studied credits,性别，地区，最高学历，IMD频段，年龄段，以前的尝试次数，学分，
average submission duration (averaged as a single code presentation can have multiple assessments with varying submission durations),平均提交时间(平均一次代码演示可以进行多次评估，且提交时间有所不同)，
average module length,平均模块长度
average proportion content accessed (based on sum of clicks on “ou” content and resources/quiz/glossary divided by total sum of clicks per code description),平均访问的内容比例(基于对“ ou”内容和资源/测验/词汇的点击总和除以每个代码描述的总点击总和)，
average date registration and平均日期注册和
trimmed assessment type (as a single code description can have multiple assessments and multiple of the same type of assessment. It is important to determine whether the types of assessments per course and description drive student success)修整的评估类型(因为单个代码描述可以具有多个评估，也可以具有多个相同类型的评估。确定每门课程和描述的评估类型是否能促进学生成功很重要)

Excluded variables: Student ID and sum weighted score

排除的变量：学生ID和加权总和

Reason for exclusion: Student ID– identifying information, sum weighted score — correlated to final result

排除原因：学生证-识别信息，总和加权分-与最终结果相关

Define “success” target variable

定义“成功”目标变量

Response variable: Success (“Pass” or “Distinction” = “Yes”, “Fail” or “Withdrawn” = “No”响应变量：成功(“通过”或“不同” =“是”，“失败”或“撤回” =“否”

Reason for not using “final result” as success factor: There is a disproportionate of “Pass” records within the dataset thus reducing the model accuracy at predicting “Fail”, “Withdrawn” and “Distinction”. As such, the response variable was binarised.

不使用“最终结果”作为成功因素的原因：数据集中的“通过”记录不成比例，因此降低了预测“失败”，“撤回”和“区别”时的模型准确性。 这样，将响应变量二值化。

Check for nulls, missing values, and outliers for each variable that will be input into the model

检查将输入到模型中的每个变量的空值，缺失值和离群值

Why do we check for missing data? — If a large proportion of data is missing/null, sample is not representative enough to provide accurate results. This is not the case when the missing/null data is missing for a valid reason.

为什么我们要检查丢失的数据？ —如果大部分数据丢失/为空，则样本的代表性不足以提供准确的结果。当丢失/空数据由于正当原因而丢失时，情况并非如此。

资料设定 (Data set-up)

• Split data into training & test sets for nominal and binary predictors to test for model accuracy

•将数据分为用于名义和二进制预测变量的训练和测试集，以测试模型的准确性

Why do we split the data into training and test sets — The model is trained on the training set. Model accuracy is tested on the test set to determine how well the model is good at prediction success against non-success on data it has not “seen”.

我们为什么将数据分为训练集和测试集-在训练集上训练模型。在测试集上测试模型准确性，以确定模型针对未“看到”的数据不成功的预测成功率。

资料建模 (Data modelling)

Analytical method: Gradient boosting machine (GBM)

分析方法：梯度提升机(GBM)
Alternate methods could have been used but model accuracy was good enough — Distributed Random Forest (DRF) and Generalized Linear Model (GLMNET)可以使用其他方法，但是模型的精度足够好-分布式随机森林(DRF)和广义线性模型(GLMNET)
Check for model accuracy using confusion matrix (i.e. proportion that were predicted correctly) and area under curve (how well did we do compared to random chance).

使用混淆矩阵(即正确预测的比例)和曲线下面积(与随机机会相比，我们做得如何)检查模型准确性 。

输出量 (Output)

Top predictors (by variable importance score)最佳预测变量(按重要性变量计)

数据ETL (Data ETL)

All data exploration, transformation, and loading of final dataset was done in Alteryx to avoid writing code and test its functionality for basic data transformation steps that would typically be carried out in R such as joins, mutate and group by.

所有数据探索，转换和最终数据集的加载均在Alteryx中完成，以避免编写代码并测试其功能，以进行通常在R中执行的基本数据转换步骤，例如连接，变异和分组。

First, I joined three datasets — assessments.csv, studentAssessments.csv and courses.csv by ID assessments (primary key).

首先，我通过ID评估(主键)加入了三个数据集-Assessments.csv，studentAssessments.csv和courses.csv。

Next, I engineered two features: weighted_score and submission_duration. Dates are usually meaningless unless they are transformed into useful features. Course description only had two categories, which I converted to 1 and 2 for ease of reference.

接下来，我设计了两个功能：weighted_score和Submitting_duration。日期通常是没有意义的，除非将其转换为有用的功能。课程描述只有两类，为了便于参考，我将其转换为1和2。

The next step was to add in data on student interactions with the virtual learning platform (VLE). Some feature engineering was done for ease of analysis such as creating a new variable called activity_type_sum that reduces the number of categories in activity type into two broad categories — content access and browsing. The reason for doing this is that granular categories only result in more features and reduce the number of observations per category. The number of clicks were summed by the activity type feature. Proportion of activity out of total activity that is browsing related and content access related was also calculated. This is a good way to create a feature that is relative to another feature and scaled by total activity thus ensuring that all students are represented on a similar scale by their activity type.

下一步是添加有关学生与虚拟学习平台(VLE)互动的数据。为了简化分析，进行了一些功能设计，例如创建一个名为activity_type_sum的新变量，该变量将活动类型中的类别数量减少为两大类-内容访问和浏览。这样做的原因是，细化类别只会带来更多功能，并减少每个类别的观测次数。点击次数是按活动类型功能求和的。还计算了与浏览相关和与内容访问相关的总活动中活动的比例。这是创建相对于另一个功能并按总活动进行缩放的功能的好方法，从而确保所有学生按其活动类型以相似的比例表示。

Block 1 was joined to Block 2 using student_id, code_module and code_presentation as the primary key. The resulting output is shown below.

使用student_id，code_module和code_presentation作为主键，将块1加入到块2。结果输出如下所示。

The above output — Block 3 — was joined with student registration data using student_id, code_module, and code_presentation to bring across the data_registered field.

上面的输出(第3块)通过使用student_id，code_module和code_presentation与学生注册数据结合在一起，以显示data_registered字段。

The date_unregistered field was ignored as it had a lot of missing values. Moreover, students with empty unregistered field cells have withdrawal as the value for their final_result. This variable is our target/response variable. So, the date_unregistered field appears to be a proxy measure for final_result and as such it makes sense to exclude this variable from our analysis.

date_unregistered字段被忽略，因为它缺少很多值。此外，未注册字段单元格为空的学生可以将他们的final_result值取回。此变量是我们的目标/响应变量。因此， date_unregistered字段似乎是final_result的代理度量，因此从我们的分析中排除此变量是有意义的。

As shown above, for a given id_student, code_module, and code_presentation, the module_presentation length, proportion_content and date_registration is repeated. As we want to have unique records, we can aggregate the data as follows:

如上所示，对于给定的id_student，code_module和code_presentation，重复module_presentation的长度，比例内容和date_registration。由于我们希望拥有唯一的记录，因此可以按以下方式汇总数据：

Summarise weighted score using total sum使用总和总结加权分数
Average submission duration平均提交时间
Average module presentation (you can also use other aggregates such as minimum, maximum, and median)平均模块展示(您也可以使用其他汇总，例如最小值，最大值和中位数)
Average of proportion_content_accessratio_content_access的平均值
Average of date_registrationdate_registration的平均值

Data is now at student_id, code_module, code_presentation and assessment_type level; however, the target variable — final_result — is at student_id, code_module and code_presentation level. Hence, this data will need to be further aggregated.

数据现在位于student_id，code_module，code_presentation和评估类型级别；但是，目标变量final_result处于student_id，code_module和code_presentation级别。因此，该数据将需要进一步汇总。

Let’s look at student info first. A unique record here is id_student, code_module, code_presentation. So, we will need to go back a step and summarise a student_id, code_module and code_presentation to represent all assessments taken by an individual. We will still use the previous summary formulas.

我们先来看一下学生信息。这里的唯一记录是id_student，code_module，code_presentation。因此，我们将需要退后一步，总结一个student_id，code_module和code_presentation来代表一个人进行的所有评估。我们仍将使用以前的汇总公式。

By doing this we have 8 unique assessment types that a student can take for a given code module and code presentation. Assessment types are not repeated (trimmed only) so if at student took 3 TMAs this is not reflected as shown below.

通过这样做，我们可以为学生提供针对给定代码模块和代码表示形式的8种独特的评估类型。评估类型不会重复(仅修整)，因此，如果学生参加了3次TMA，则不会如下所示。

A variable could be created to count the number of assessments per assessment type but it would contain lots of missing values as not all assessments have all three types of assessments. Now, we are ready to join to the student info data with output shown below.

可以创建一个变量来计算每种评估类型的评估数量，但是它会包含很多缺失值，因为并非所有评估都具有这三种评估类型。现在，我们准备加入学生信息数据，输出如下所示。

Now, we have 18 columns. We have been told that a presentation may differ if presented in February vs. October. We will assume that it does not differ year on year (i.e. 2013B is same as 2014B). As such, we will recode code_presentation into 1 for B and 2 for J as a binary variable.

现在，我们有18列。我们被告知，如果在2月和10月之间进行演示，则演示可能会有所不同。我们将假定它没有同比差异(即2013B与2014B相同)。这样，我们将将code_presentation重新编码为B的1和J的2作为二进制变量。

The final output is shown below.

最终输出如下所示。

It is finally time for some data exploration.

现在是时候进行一些数据探索了。

探索性数据分析 (Exploratory Data Analysis)

Categorical variables can be represented with bar charts where the y-axis is the frequency of the occurence of a given category. For example, in the chart below we can see that the most frequently taken code module is FFF followed by BBB. There are seven unique code modules with no missing values.

类别变量可以用条形图表示，其中y轴是给定类别的出现频率。例如，在下面的图表中，我们可以看到最常用的代码模块是FFF，其次是BBB。 有七个唯一的代码模块，没有缺失值。

Data can also be summarised numerically using a five-point summary for continuous variables and using mode for categorical variables as shown below.

数据也可以使用五点汇总(用于连续变量)和使用模式(用于分类变量)以数字方式汇总，如下所示。

Insights we can make from the summary below is that the more common student is a Scottish male student presenting without a disability with an Imd_band between 20 to 40% with a typical Pass as their final result.

我们可以从下面的摘要中得出的见解是，最普通的学生是苏格兰的一名男生，表现出无障碍，Imd_band在20％到40％之间，最终成绩为正常。

Now we can move towards modelling the dataset.

现在我们可以对数据集建模。

机器学习模型 (Machine Learning Model)

We have been asked to assist Open University in better understanding student success. • We will assume that student success is measured via the final result where pass and distinction are indicators of “success” and withdrawn and fail are indicators of “non-success”. • For the independent variables, we will use all variables from the previoustable except for weighted_score. The reason for this is because weighted score determines the final result for a given student. As such, it is highly correlated (multicollinear) to final result and as such will be excluded. • Student ID is identifying information and as such will not be used as a predictor.

我们被要求协助开放大学更好地了解学生的成功。 •我们将假设学生的成功是通过最终结果来衡量的，合格和区别是“成功”的指标，退缩和失败是“不成功”的指标。 •对于自变量，我们将使用上一张表中的所有变量(weighted_score除外)。这样做的原因是因为加权分数决定了给定学生的最终结果。因此，它与最终结果高度相关(多重共线性)，因此将被排除。 •学生证是识别信息，因此不会用作预测指标。

GBM (Gradient Boosted Model) was used as a model of choice. This type of model creates a series of weak learners (shallow trees) where each new tree tries to improve on the error rate of the previous tree. The final tree is one with the lowest error rate. It is an ensemble machine learning method as several trees are created to provide the final results. However, unlike in randomForest, these trees are created in a series rather than in parallel. Furthermore, these trees are not independent and are depenent on the previous tree’s error rate where the following three will try harder to improve prediction for the more difficult cases. This is controlled by a parameter called hte learning rate.

使用GBM(梯度增强模型)作为选择模型。这种类型的模型会创建一系列弱学习者(浅树)，其中的每棵新树都试图提高前一棵树的错误率。最终的树是错误率最低的树。这是一种集成的机器学习方法，因为创建了几棵树以提供最终结果。但是，与randomForest不同，这些树是按系列而不是并行创建的。此外，这些树不是独立的，而是取决于前一棵树的错误率，在后一棵树上，后三棵将更努力地提高对更困难情况的预测。这是由称为学习率的参数控制的。

The model was run with 500 rounds (500 trees) with minimum and maximum depths of 4 for the tree. Typically, it is not good to have very deep trees as this can lead to overfitting where the algorithm tries to explain every observation in the dataset as it increases the depth of the tree leading to leaves containing a very small number of observations that fit the given rule.

该模型以500轮(500棵树)运行，最小和最大深度为4。通常，拥有非常深的树不是很好，因为这会导致过度拟合，因为算法会尝试解释数据集中的每个观察值，因为它会增加树的深度，从而导致叶子中包含非常少的符合给定观察值的观察值规则。

We can see from the above output that the model has a RMSE (root-mean-squared error) value of 0.55 which is quite high. It is particularly bad at predicting Distinction and Fail, which may be due to the imbalance in the dataset where we know from our exploratory data analysis that Pass is the most common final result.

从上面的输出中我们可以看到，该模型的RMSE(均方根误差)值为0.55，这非常高。在预测Distinction和Fail时特别糟糕，这可能是由于数据集的不平衡所致，从探索性数据分析中我们知道Pass是最常见的最终结果。

To counteract this imbalance issue, the target variable was redefined as “success” (distinction and pass) and “failure” (fail and withdrawn). It is common to combine categories to deal with imbalanced datasets. Other ways are to undersample (i.e. reduce the number of instances for the most frequent class) or oversample (i.e. create artificial observations for the non-frequent classes).

为了解决此不平衡问题，将目标变量重新定义为“成功”(区分并通过)和“失败”(失败并撤回)。合并类别以处理不平衡的数据集是很常见的。其他方法是欠采样(即减少最频繁分类的实例数量)或过采样(即为非频繁分类创建人工观察)。

The model was re-run with the following output. Here we can see that the mean per-class error has dropped significantly. The Area Under the Curve (AUC) is another accuracy metric that tells you how well the model is at classifying cases correctly (i.e. maximising the true positive rate (TPR)). The higher the AUC, the more accurate the model. As the AUC is measured between 0 and 1, an AUC of 0.87 is pretty good.

使用以下输出重新运行模型。在这里，我们可以看到平均每类错误已大大降低。曲线下面积(AUC)是另一种精度度量标准，它告诉您模型在正确分类案例(即最大化真实正利率(TPR))方面的表现。 AUC越高，模型越准确。由于测得的AUC在0到1之间，因此0.87的AUC相当不错。

Another metric that is commonly used in classification problems is the F1 score which is the harmonic mean of precision and recall. Both metrics aim to maximise the TPR while minimising either the false negative rate (recall) or false positive rate (precision). A true positive is when a success or failure is classified correctly. A false negative is when a success is labelled as a failure. A false positive is when a failure is labelled as a success. For the F1 score to be high, both precision and recall need to be high.

分类问题中通常使用的另一个度量标准是F1分数，它是精确度和查全率的谐波平均值。两种指标均旨在最大程度地提高TPR，同时最大程度地降低误报率(召回率)或误报率(精度)。 正确的判断是对成功或失败进行了正确分类。 假否定是将成功标记为失败。误报是指将失败标记为成功。为了使F1分数高，准确性和召回率都必须高。

Confusion matrix indicates an overall error rate of 17.11% which is mainly driven by how good the model is at classifying successes. The model is not so good at classifying failures with an error rate of 39.07%. Again this may be due to the data being overrepresented by “Passes”. Thus, the results should be treated with caution and model re-run with a more balanced dataset.

混淆矩阵表明总体错误率为17.11％，这主要是由模型对成功分类的良好程度驱动的。该模型不能很好地对故障进行分类，错误率为39.07％。同样，这可能是由于数据被“通行证”过多所代表。因此，应谨慎对待结果，并使用更平衡的数据集重新运行模型。

Now, let’s look at the top predictors of success or failure by looking at the variable importance list.

现在，让我们通过查看变量重要性列表来查看成功或失败的主要预测因素。

The top 3 variables are code module, trimmed_assessment_type and average submission duration.前三个变量是代码模块，trimmed_assessment_type和平均提交时间。
The bottom 3 variables at predicting whether or not a successful outcome will be reached for a given student are disability status, average date for registration and gender.预测给定学生是否达到成功结果的最后3个变量是残疾状况，平均注册日期和性别。
Note: As code module and code presentation are part of the unique identifier they should have been excluded from analysis. However, as presentations in February and October may be different for some courses, both variables were kept in the model. It is possible that excluding these variables may increase accuracy or make other variables more “important”.

注意：由于代码模块和代码表示是唯一标识符的一部分，因此应将它们从分析中排除。 但是，由于某些课程在2月和10月的演示文稿可能有所不同，因此两个变量都保留在模型中。 排除这些变量可能会提高准确性或使其他变量更“重要”。

Now, let’s visualise information on the top predictors to better understand the model. The stacked bar plot below shows the proportion of records by course module and final_result. We can deduce that students are more likely to be successful in completing AAA, EEE and GGG courses over other courses.

现在，让我们可视化顶级预测变量上的信息，以更好地理解模型。下面的堆叠条形图按课程模块和final_result显示了记录的比例。我们可以推断出，与其他课程相比，学生更有可能成功完成AAA，EEE和GGG课程。

From the above table, we can see that there is a 100% success rate if an exam is the only assessment for a given course and presentation.从上表中可以看出，如果考试是给定课程和演示的唯一评估，则成功率为100％。
If only computer marked assessments make up the course component, there is a very high failure/withdrawn rate. It would be interesting to investigate why having CMA as part of a presentation assessment leads to a decrease in success rate.如果仅由计算机标记的评估构成课程组成部分，则故障/退出率非常高。调查为什么将CMA作为演示评估的一部分会导致成功率降低，这将是很有趣的。

The histograms above show the average submission duration by success and failure.

上面的直方图显示了成功和失败的平均提交时间。

It appears that when students are successful, they are more likely to submit their assignment within 10 days (+/-) of the assessment submission date.

看来，当学生成功时，他们更有可能在评估提交日期的10天内(+/-)提交作业。

结语 (Wrapping up)

Machine learning was used to quickly identify top contributors to student success.

机器学习用于快速确定学生成功的主要贡献者。

Recommendations for model improvement include:

有关模型改进的建议包括：

Working with a balanced dataset使用平衡的数据集
Including proxy measures for resource allocation within the dataset包括数据集中资源分配的代理指标
Count of the number of assessments by type per course and presentation as a

按课程类型和演示形式按类型进行的评估数量计数

feature

特征
Remove categorical variables that are associated with each other (i.e. use of chi-squared test of independence)删除彼此关联的分类变量(即，使用卡方独立性检验)

Hopefully, you now have a better understanding of utilising GBM for a classification problem, the pitfalls of a classification problem (i.e. imbalanced dataset) and the use of various accuracy metrics.

希望您现在对使用GBM解决分类问题，分类问题的陷阱(即数据集不平衡)以及使用各种准确性指标有了更好的了解。

The reference to all R code is provided in my git repository: https://github.com/shedoesdatascience/openlearning

在我的git存储库中提供了对所有R代码的引用： https : //github.com/shedoesdatascience/openlearning

翻译自: https://towardsdatascience.com/using-gradient-boosting-machines-for-classification-in-r-b22b2f8ec1f1

图像梯度增强

查看全文

http://www.taodudu.cc/news/show-863371.html

机器学习文本分类代码_无需担心机器学习-如何在少于10行代码中对文本进行分类
lr模型和dnn模型_建立ML或DNN模型的技巧
数量和质量评价模型_数量对于语言模型可以具有自己的质量
mlflow_使用MLflow跟踪进行超参数调整
聊天产生器
深度学习领域专业词汇_深度学习时代的人文领域专业知识
图像分类
CSDN-Markdown基本语法
python3（一）数字Number
python3（二）Numpy
python3（三）Matplotlib
python3（四）Pandas库
python3（六）监督学习
pycharm中如何调用Anoconda的库
TensorFlow（四）优化器函数Optimizer
TensorFlow（三）常用函数
TensorFlow（五）常用函数与基本操作
TensorFlow（六）with语句
Pycharm如何选择自动打开最近项目
CSDN-Markdown编辑器如何修改图像大小
TensorFlow（七）tf.nn库
TensorFlow（八）激活函数
TensorFlow（九）eval函数
TensorFlow（十）定义图变量的方法
TensorFlow读取MNIST数据集错误的问题
Tensorflow（一）基础命令
TensorFlow（二）函数基础
TensorFlow：实战Google深度学习框架（二）实现简单神经网络
TensorFlow：实战Google深度学习框架（三）深层神经网络
TensorFlow：实战Google深度学习框架（四）MNIST数据集识别问题

图像梯度增强_使用梯度增强机在R中进行分类相关推荐

梯度消失和梯度爆炸_出现梯度消失与梯度爆炸的原因以及解决方案
在学习李宏毅老师机器学习的相关视频时,课下做了一个有关神经网络的小Demo,但是运行效果总是不尽人意,上网查询资料,才发现是梯度爆炸和梯度消失惹的祸.今天就让我们一起来学习一下梯度消失与梯度爆炸的概念 ...
随机梯度下降法_动量梯度下降法(gradient descent with momentum)
简介动量梯度下降法是对梯度下降法的改良版本,通常来说优化效果好于梯度下降法.对梯度下降法不熟悉的可以参考梯度下降法,理解梯度下降法是理解动量梯度下降法的前提,除此之外要搞懂动量梯度下降法需要知道原始 ...
梯度下降算法_批梯度下降法，Minibatch梯度下降法和随机梯度下降法之间的区别...
什么是梯度下降法? 梯度下降法是一种机器学习中常用的优化算法,用来找到一个函数(f)的参数(系数)的值,使成本函数(cost)最小. 当参数不能解析计算时(如使用线性代数),并且必须通过优化算法搜索时 ...
数据增强_浅析数据增强
与计算机视觉中使用图像进行数据增强不同,NLP中文本数据增强是非常罕见的.这是因为图像的一些简单操作,如将图像旋转或将其转换为灰度,并不会改变其语义.语义不变变换的存在使增强成为计算机视觉研究中举个 ...
梯度下降算法_神经网络梯度下降算法
神经网络梯度下降算法 2018, SEPT 13 梯度下降(Gradient Descent) 是神经网络比较重要的部分,因为我们通常利用梯度来利用Cost function(成本函数) 进行back ...
opencv 图像雾检测_专栏 | OpenCV图像处理专栏十 | 利用中值滤波进行去雾
原标题:专栏 | OpenCV图像处理专栏十 | 利用中值滤波进行去雾前言这是OpenCV图像处理专栏的第十篇文章,介绍一种利用中值滤波来实现去雾的算法.这个方法发表于国内的一篇论文,链接我放附录 ...
python随机森林筛选变量_变量重要性随机森林在R中是否有类似Python的rfpimp来分组共线变量...
早上好我在R(randomForest,caret)中的随机林实现中使用置换重要性对变量进行排序.所有变量都是连续的,结果是明确的.在为了处理共线特性Terence Parr,Jeremy How ...
r语言将百分数化为小数_如何将数字格式化为R中的百分比？
慕桂英546537 我做了一些基准测试对这些问题的答案的速度和惊讶地看到percent在scales如此吹捧包装,鉴于其疲弱.我想它的优势是它的自动检测器可以正确格式化,但是如果您知道数据看起来像什么 ...
聚集索引扫描97%_聚集前1％：R中的资产分析
聚集索引扫描97% by Ben Weber 通过本·韦伯聚集前1%:R中的资产分析 (Clustering the Top 1%: Asset Analysis in R) The recent ...

图像梯度增强_使用梯度增强机在R中进行分类