by Parminder Singh

通过Parminder Singh

我从Kaggle机器学习竞赛中获得的经验 (What I’ve learned from competing in machine learning contests on Kaggle)

Recently I decided to get more serious about my data science skills. So I decided to practice my skills, which led me to Kaggle.

最近,我决定更加认真地对待我的数据科学技能。 因此,我决定练习自己的技能,这使我成为了Kaggle 。

The experience has been very positive.

经验是非常积极的。

When I arrived at Kaggle, I was confused about what to do and how everything works. This article will help you to overcome the confusion that I experienced.

当我到达Kaggle时,我对要做的事以及一切运作方式感到困惑。 本文将帮助您克服我所经历的困惑。

I joined the Redefining Cancer Treatment contest because it was for a noble cause. As well, the data was more manageable because it was text based.

我参加了“ 重新定义癌症治疗”比赛,因为那是一个崇高的事业。 同样,数据是基于文本的,因此更易于管理。

在哪里编码 (Where to code)

What makes Kaggle great is that you don’t need a cloud server that creates results for you. Kaggle has a feature where you can run scripts and notebooks inside Kaggle for free, as long as they finish executing within an hour. I used Kaggle’s notebooks for many of my submissions, and experimented with many variables.

Kaggle之所以出色,是因为您不需要 可以为您创建结果的云服务器。 Kaggle具有一项功能,您可以在其中运行脚本和笔记本 只要在一个小时内完成执行,就可以免费在Kaggle内部进行。 我使用Kaggle的笔记本进行许多提交,并尝试了许多变量。

Overall it was a great experience.

总体而言,这是一次很棒的经历。

For the contests, you need to use images or have a large corpus of text. And you will need a fast personal computer (PC) or a cloud container. My PC is crappy, so I used Amazon Web Services’ (AWS) c4.2xlarge instance. It was powerful enough for the text and costed only $0.40 per hour. I also had a free $150 credit from the GitHub student developer pack, so I didn’t need to worry about the cost.

对于比赛,您需要使用图像或大量文本。 您将需要一台快速的个人计算机(PC)或一个云容器。 我的PC笨拙,因此我使用了Amazon Web Services(AWS)c4.2xlarge实例。 它的文字功能强大,每小时仅需$ 0.40。 我还从GitHub学生开发者包中获得了$ 150的免费赠金,因此我无需担心成本。

Later when I took part in the Dog Breed Identification playground contest, I worked a lot with images, so I had to upgrade my instance to g2.2xlarge. It costed $0.65 per hour, but it had graphics processing unit (GPU) power, so that it could compute thousands of images in just a few minutes.

后来,当我参加“ 狗品种识别”游乐场比赛时,我做了很多图像工作,因此必须将实例升级到g2.2xlarge。 花费了 每小时$ 0.65,但它具有图形处理单元(GPU)的功能,因此它可以在短短几分钟内计算成千上万张图像。

The instance g2.2xlarge was still not large enough to hold all of the data I worked with, so I cached the intermediate data as files and deleted the data from RAM. I did this by using del <variable name> to avoid ResourceExhaustionError or MemoryError . Both were equally disheartening.

g2.2xlarge实例仍然不够大,无法容纳我使用的所有数据,因此我将中间数据作为文件缓存,并从RAM中删除了该数据。 我通过使用del <variable na me>来做到这一点,以avoid ResourceExhaustio nErr or or Memor yError。 两者同样令人沮丧。

如何开始Kaggle比赛 (How to get started with Kaggle competitions)

It’s not as scary as it sounds. The Discussion and Kernel tabs for every contest are a marvellous way to get started. A few days after the start of a contest, you will see several starter kernels appear in the Kernel tab. You can use these to get started.

它并不像听起来那样可怕。 每个竞赛的“讨论”和“内核”选项卡都是一种很好的入门方法。 竞赛开始几天后,您会在“内核”选项卡中看到几个入门内核。 您可以使用这些入门。

Instead of handling the loading and creation of submissions, just deal with the manipulation of data. I prefer the XGBoost starter kernels. Their codes are always short and are ranked high on leaderboards.

与其处理提交的加载和创建,不如处理数据。 我更喜欢XGBoost入门内核。 他们的代码总是很短,并且在排行榜上排名很高。

Extreme Gradient Boosting (XGBoost) is based on the decision tree model. It is very fast and amazingly accurate, even on default variables. For large data I prefer to use Light Gradient Boosting Machine (LightGBM). It is similar in concept to the XGBoost, but approaches the problem a bit differently. There is a catch, it is not as accurate. So you can experiment using LightGBM, and when you know it is working great, switch to XGBoost (they have a similar API).

极端梯度提升 (XGBoost)基于决策树模型 。 即使在默认变量下,它也非常快速且精确到惊人。 对于大数据,我更喜欢使用光梯度增强机 (LightGBM)。 它在概念上与XGBoost相似,但是对问题的处理方式略有不同。 有一个陷阱,它不那么准确。 因此,您可以尝试使用LightGBM,当您知道它运作良好时,请切换到XGBoost(它们具有类似的API)。

Check the discussions every few days to see if someone has found a new approach. If someone does, use it in your script and test to see if you benefit from it.

每隔几天检查一下讨论,看看是否有人找到了新方法。 如果有人这样做,请在您的脚本中使用它并进行测试以查看是否从中受益。

如何在排行榜上排名 (How to go up in the leaderboard)

So you have your starter code cooked and want to rise higher? There are many possible approaches:

因此,您已经编写了入门代码,并希望提高? 有许多可能的方法:

  • Cross validation (CV): Always split the training data into 80% and 20%. That way when you train on 80% of the data, you can manually cross-check with 20% of the data to see if you have a good model. To quote the discussion board on Kaggle, “Always trust your CV more than the leaderboard.” The leaderboard has 50% to 70% of actual test set, so you cannot be sure about the quality of your solution based on the percentages. Sometimes your model might be great overall, but bad on the data, specifically in the public test set.

    交叉验证(CV):始终将训练数据分为80%和20%。 这样,当您对80%的数据进行训练时,可以手动对20%的数据进行交叉检查,以查看您是否拥有良好的模型。 在Kaggle上的讨论区中引用:“永远比排行榜更信任您的简历。” 排行榜拥有实际测试集的50%到70%,因此您无法根据百分比确定解决方案的质量。 有时,您的模型总体上可能不错,但是对数据却很不利,尤其是在公共测试集中。

  • Cache your intermediate data: You will do less work next time by doing this. Focus on a specific step rather than running everything from the start. Almost all python objects can be pickled , but for efficiency, always use .save() and .load() functions of the library you are using for your code.

    缓存您的中间数据:下次您将减少此工作。 专注于特定步骤,而不是一开始就运行所有内容。 几乎所有的python对象都可以被pickled ,但是为了提高效率,请始终使用要用于代码的库的.save().load()函数。

  • Use GridSearchCV: It is a great module that allows you to provide a set of variable values. It will try all possible combinations until it finds the optimal set of values. This is a great automation for optimization. A finely tuned XGBoost can beat a generic neural network in many problems.

    使用GridSearchCV :这是一个很棒的模块,允许您提供一组变量值。 它将尝试所有可能的组合,直到找到最佳值集。 这是优化的绝佳自动化。 精心调整的XGBoost在许多问题上都可以击败通用神经网络。

  • Use the model appropriate to the problem: Using a knife in a gunfight is not a good idea. I have a simple approach: For text data, use XGBoost or Keras LSTM. For image data, use Pre-trained Keras model (I use Inception most of the time) with some custom bottleneck layers.

    使用适合该问题的模型:在枪战中使用刀子不是一个好主意。 我有一个简单的方法:对于文本数据,请使用XGBoost或Keras LSTM。 对于图像数据,请使用带有一些自定义瓶颈层的预训练Keras模型(我大部分时间都使用Inception )。

  • Combine models: Using a kitchen knife for everything is not enough. You need a Swiss army knife. Try combining various models to get even more accurate information. For example, Inception plus the Xception model work great for image data. Combined models take a lot of RAM, which g2.2xlarge might not provide. So avoid them unless you really want to get that accuracy boost.

    组合模型:仅用厨刀处理所有事情还不够。 您需要一把瑞士军刀。 尝试组合各种模型以获得更准确的信息。 例如,Inception和Xception模型非常适合图像数据。 组合模型占用大量RAM,而g2.2xlarge可能无法提供。 因此,除非您真的想提高精度,否则请避免使用它们。

  • Feature extraction: Make the work easier for the model by extracting multiple simpler features from one feature, or combining several features into one feature. For example, you can extract the country and area code from a phone number. Models are not very intelligent, they are just algorithms that fit data. So make sure that the data is appropriate for optimal fit.

    特征提取:通过从一个特征中提取多个更简单的特征,或将多个特征组合为一个特征,使模型的工作变得更容易。 例如,您可以从电话号码中提取国家和地区代码。 模型不是很智能,只是适合数据的算法。 因此,请确保数据适合最佳拟合。

在Kaggle上还能做什么 (What else to do on Kaggle)

Other than being a competition platform for data science, Kaggle is also a platform for exploring datasets and creating kernels that explore insights into the data.

除了作为数据科学的竞争平台外,Kaggle还是一个用于探索数据集和创建用于探索数据洞察力的内核的平台。

So you can choose any dataset out of the top five that appear on the datasets page, and just go with it. The data might be weird, and you might experience difficulty as a beginner. What matters is that you analyze data and make visualizations relate to it, which contributes to your learning.

因此,您可以从出现在“ 数据集”页面上的前五名中选择任何一个数据集 ,然后进行选择。 数据可能很奇怪,并且初学者可能会遇到困难。 重要的是您分析数据并使其与可视化相关,这有助于您的学习。

用于分析的库 (Which libraries to use for analysis)

For visualizations, explore seaborn and matplotlib librariesFor data manipulation, explore NumPy and pandasFor data preprocessing, explore sklearn.preprocessing module

要进行可视化 ,请浏览seaborn和matplotlib库要进行数据处理,请浏览NumPy和pandas要进行数据预处理,请浏览sklearn.preprocessing模块

Pandas’ library has some basic plot functions too, and they are extremely convenient.intel_sorted[“Instruction_Set”].value_counts().plot(kind=’pie’)

熊猫的图书馆也有一些基本的绘图功能,它们非常方便。 intel_sorted[“Instruction_Set”].value_counts().plot(kind='pie')

The above one line of code made a pie chart with “Instruction_Set.” And the best thing is that it still looks pretty.

上面的代码行使用“ Instruction_Set”制作了一个饼图。 最好的是它看起来仍然很漂亮。

为什么要做这一切 (Why do all this)

Machine learning is a beautiful field with lots of development going on. Participating in these contests will help you to learn a lot about algorithms and the various approaches to data. I myself learned a lot of these things from Kaggle.

机器学习是一个美丽的领域,正在不断发展。 参加这些竞赛将帮助您学习很多有关算法和各种数据处理方法的知识。 我自己从Kaggle中学到了很多这些东西。

Also, to be able to say, “My AI is in the top 15% for <insert contest name here>” is pretty dope.

另外,可以说,“我的AI在<此处插入竞赛名称>的前15%中”排名很高。

我旅途中的一些额外功能 (Some extras from my journey)

The graph below represents my kernel’s exploration of the Intel CPU dataset on Kaggle:

下图显示了我的内核在Kaggle上对Intel CPU数据集的探索:

My solution for the Redefining Cancer Treatment contest:

我对“ 重新定义癌症治疗”竞赛的解决方案:

That’s all folks.

那是所有人。

Thanks for reading. I hope I made you feel more confident about participating in Kaggle’s contests.

谢谢阅读。 希望我让您对参加Kaggle的比赛更有信心。

See you on the leaderboards.

排行榜上见。

翻译自: https://www.freecodecamp.org/news/what-i-learned-from-kaggle-contests-d3123e17a36b/

我从Kaggle机器学习竞赛中获得的经验相关推荐

  1. 教你如何在机器学习竞赛中更胜一筹

    更多技术干活请关注:阿里云云栖社区 - 汇聚阿里技术精粹 作者:Team Machine Learning,这是一个机器学习爱好者团队,他们热衷于建立一个有希望在数据科学/机器学习方面建立事业的有抱负 ...

  2. 教你如何在机器学习竞赛中更胜一筹(上)

    介绍 机器学习很复杂.你可能会遇到一个令你无从下手的数据集,特别是当你处于机器学习的初期. 在这个博客中,你将学到一些基本的关于建立机器学习模型的技巧,大多数人都从中获得经验.这些技巧由Marios ...

  3. Kaggle 机器学习竞赛冠军及优胜者的源代码汇总

    阅读目录 Algorithmic Trading Challenge25 Allstate Purchase Prediction Challenge3 Amazon.com – Employee A ...

  4. PK656个对手!深兰科技在全球顶级AI赛事kaggle竞赛中再次夺冠

    近日,在kaggle平台上举办的Paddy Disease Classification竞赛公布最终比赛结果,深兰科技组建的机器人视觉算法团队一举夺得桂冠.这是继今年8月25日在CCKS 2022第十 ...

  5. 机器学习竞赛实际上是一场数据竞赛

    https://www.toutiao.com/a6638521684594786819/ 希望使AI有别于其他公司的方法就是使用差异化的数据集,自己构建数据集是一个差异化方法之一. 随着机器学习 - ...

  6. 机器学习竞赛(代码)

    1. kaggle Kaggle Competition Past Solutions Kaggle 机器学习竞赛冠军及优胜者的源代码汇总

  7. kaggle账号_机器学习竞赛入门--kaggle篇

    机器学习竞赛越来越热门,了解各大平台的就显得至关重要,这篇文章就带大家了解全球最为著名的机器学习竞赛平台--kaggle https://www.kaggle.com/ 本篇文章分为以下几个部分 1. ...

  8. 从零开始,教初学者如何征战全球最大机器学习竞赛社区Kaggle竞赛

    来源https://baijiahao.baidu.com/s?id=1589819926995842562&wfr=spider&for=pc 在学习过深度学习的基础知识之后,参与实 ...

  9. 自动预测保险理赔:机器学习之特征预处理(Kaggle保险索赔竞赛案例)

    原文地址:https://yq.aliyun.com/articles/65158?spm=5176.8091938.0.0.3Wl7HH 摘要: 针对Kaggle保险索赔竞赛给定的数据集,本文详细介 ...

最新文章

  1. python定义的关键数据类型_Python基本数据类型
  2. RequestBodyAdvice和ResponseBodyAdvice
  3. 3D姿态估计——ThreeDPose项目简单易用的模型解析
  4. nodejs的内存管理,垃圾回收机制
  5. java 多进程多线程_Java并发编程原理与实战三:多线程与多进程的联系以及上下文切换所导致资源浪费问题...
  6. cartographer探秘第四章之代码解析(四) --- 后端优化 --- 约束计算
  7. paip.提升用户体验-----c++ 宏的使用...替换从在的地张儿复制过来的代码.
  8. Git项目下载部分文件或文件夹
  9. 也许黎曼猜想是错误的
  10. 为知笔记导入html,为知笔记导入印象笔记
  11. win8系统保护服务器,Win8整合SmartScreen升级功能保护系统安全
  12. Prim算法解决最小生成树 (解决修路问题)
  13. 【蓝牙sbc协议】sbc源码阅读笔记(三)——数据读写过程
  14. 小学的四则运算(输入结果)
  15. CentOS 7 分区方案
  16. (一)同步与异步的相关概念
  17. 操作系统 | 银行家算法及代码实现
  18. 批量html转换word,用Word宏来实现批量将HTML转换成DOC
  19. mac 中idea激活Jrebel
  20. WINDOWS Outlook 同步 iCloud日历并且可以编辑

热门文章

  1. 常量的定义与使用 1006 c#
  2. bootstrap-导航菜单
  3. 面试经历-19-03-14
  4. RocketMQ 重试机制
  5. 转载:【微信小程序】 wx:if 与 hidden(隐藏元素)区别
  6. 开源的自然语言处理工具
  7. SEO(search engine optimization)搜索引擎优化
  8. Agile PLM EC Understand the BOM Publishing Process
  9. 十分钟教会你原生JS压缩图片,极其精简版
  10. 容器编排技术 -- Kubernetes Pod 优先级和抢占