随机森林算法demo python spark

关键参数

最重要的，常常需要调试以提高算法效果的有两个参数：numTrees，maxDepth。

numTrees（决策树的个数）：增加决策树的个数会降低预测结果的方差，这样在测试时会有更高的accuracy。训练时间大致与numTrees呈线性增长关系。
maxDepth：是指森林中每一棵决策树最大可能depth，在决策树中提到了这个参数。更深的一棵树意味模型预测更有力，但同时训练时间更长，也更倾向于过拟合。但是值得注意的是，随机森林算法和单一决策树算法对这个参数的要求是不一样的。随机森林由于是多个的决策树预测结果的投票或平均而降低而预测结果的方差，因此相对于单一决策树而言，不容易出现过拟合的情况。所以随机森林可以选择比决策树模型中更大的maxDepth。
甚至有的文献说，随机森林的每棵决策树都最大可能地进行生长而不进行剪枝。但是不管怎样，还是建议对maxDepth参数进行一定的实验，看看是否可以提高预测的效果。
另外还有两个参数，subsamplingRate，featureSubsetStrategy一般不需要调试，但是这两个参数也可以重新设置以加快训练，但是值得注意的是可能会影响模型的预测效果（如果需要调试的仔细读下面英文吧）。

We include a few guidelines for using random forests by discussing the various parameters. We omit some decision tree parameters since those are covered in the decision tree guide.
The first two parameters we mention are the most important, and tuning them can often improve performance:
（1）numTrees: Number of trees in the forest.
Increasing the number of trees will decrease the variance in predictions, improving the model’s test-time accuracy.
Training time increases roughly linearly in the number of trees.
（2）maxDepth: Maximum depth of each tree in the forest.
Increasing the depth makes the model more expressive and powerful. However, deep trees take longer to train and are also more prone to overfitting.
In general, it is acceptable to train deeper trees when using random forests than when using a single decision tree. One tree is more likely to overfit than a random forest (because of the variance reduction from averaging multiple trees in the forest).
The next two parameters generally do not require tuning. However, they can be tuned to speed up training.
（3）subsamplingRate: This parameter specifies the size of the dataset used for training each tree in the forest, as a fraction of the size of the original dataset. The default (1.0) is recommended, but decreasing this fraction can speed up training.
（4）featureSubsetStrategy: Number of features to use as candidates for splitting at each tree node. The number is specified as a fraction or function of the total number of features. Decreasing this number will speed up training, but can sometimes impact performance if too low.
We include a few guidelines for using random forests by discussing the various parameters. We omit some decision tree parameters since those are covered in the decision tree guide.

"""
Random Forest Classification Example.
"""
from __future__ import print_functionfrom pyspark import SparkContext
# $example on$
from pyspark.mllib.tree import RandomForest, RandomForestModel
from pyspark.mllib.util import MLUtils
# $example off$if __name__ == "__main__":sc = SparkContext(appName="PythonRandomForestClassificationExample")# $example on$# Load and parse the data file into an RDD of LabeledPoint.data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')# Split the data into training and test sets (30% held out for testing)(trainingData, testData) = data.randomSplit([0.7, 0.3])# Train a RandomForest model.#  Empty categoricalFeaturesInfo indicates all features are continuous.#  Note: Use larger numTrees in practice.#  Setting featureSubsetStrategy="auto" lets the algorithm choose.model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},numTrees=3, featureSubsetStrategy="auto",impurity='gini', maxDepth=4, maxBins=32)# Evaluate model on test instances and compute test errorpredictions = model.predict(testData.map(lambda x: x.features))labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())print('Test Error = ' + str(testErr))print('Learned classification forest model:')print(model.toDebugString())# Save and load modelmodel.save(sc, "target/tmp/myRandomForestClassificationModel")sameModel = RandomForestModel.load(sc, "target/tmp/myRandomForestClassificationModel")# $example off$

模型样子：

TreeEnsembleModel classifier with 3 treesTree 0:If (feature 511 <= 0.0)If (feature 434 <= 0.0)Predict: 0.0Else (feature 434 > 0.0)Predict: 1.0Else (feature 511 > 0.0)Predict: 0.0Tree 1:If (feature 490 <= 31.0)Predict: 0.0Else (feature 490 > 31.0)Predict: 1.0Tree 2:If (feature 302 <= 0.0)If (feature 461 <= 0.0)If (feature 208 <= 107.0)Predict: 1.0Else (feature 208 > 107.0)Predict: 0.0Else (feature 461 > 0.0)Predict: 1.0Else (feature 302 > 0.0)Predict: 0.0

转载于:https://www.cnblogs.com/bonelee/p/7204096.html

随机森林算法demo python spark相关推荐

python 营销应用_随机森林算法入门(python)，,它可以用于市场营销对客户
随机森林算法入门(python),,它可以用于市场营销对客户目录 1 什么是随机森林 1.1 集成学习 1.2 随机决策树 1.3 随机森林 1.4 投票 2 为什么要用它 3 使用方法 3.1 变 ...
python随机森林库_随机森林算法入门(python)
目录 1 什么是随机森林 1.1 集成学习 1.2 随机决策树 1.3 随机森林 1.4 投票 2 为什么要用它 3 使用方法 3.1 变量选择 3.2 分类 3.3 回归 4 一个简单的Python ...
r与python做随机森林_随机森林算法入门(python)
昨天收到yhat推送了一篇介绍随机森林算法的邮件,感觉作为介绍和入门不错,就顺手把它翻译一下. 目录 1 什么是随机森林 1.1 集成学习 1.2 随机决策树 1.3 随机森林 1.4 投票 2 为什 ...
基于java的随机森林算法_基于Spark实现随机森林代码
本文实例为大家分享了基于Spark实现随机森林的具体代码,供大家参考,具体内容如下 public class RandomForestClassficationTest extends TestCas ...
随机森林算法的Python实现
随机森林主要应用于回归和分类. 它几乎可以将任何数据填进去,下文使用鸢尾花数据进行分类和预测环境 python3.8 数据集鸢尾花数据集 def dataset(self):iris = load ...
随机森林算法入门(python)
向AI转型的程序员都关注了这个号
Spark 随机森林算法原理、源码分析及案例实战
图 1. Spark 与其它大数据处理工具的活跃程度比较回页首环境要求操作系统:Linux,本文采用的 Ubuntu 10.04,大家可以根据自己的喜好使用自己擅长的 Linux 发行版 Jav ...
matlab 随机森林算法_（六）如何利用Python从头开始实现随机森林算法
博客地址:https://blog.csdn.net/CoderPai/article/details/96499505 点击阅读原文,更好的阅读体验 CoderPai 是一个专注于人工智能在量化交易 ...
保姆级随机森林算法Python教学
摘要机器学习算法是数据挖掘.数据能力分析和数学建模必不可少的一部分,而随机森林算法和决策树算法是其中较为常用的两种算法,本文将会对随机森林算法的Python实现进行保姆级教学. 0 绪论 ...

随机森林算法demo python spark

关键参数

随机森林算法demo python spark相关推荐

最新文章

热门文章