kaggle比赛数据

This article was originally written by Shahul ES and posted on the Neptune blog.

本文最初由 Shahul ES 撰写， 并发布在 Neptune博客上。

In this article, I will discuss some great tips and tricks to improve the performance of your structured data binary classification model. These tricks are obtained from solutions of some of Kaggle’s top tabular data competitions. Without much lag, let’s begin.

在本文中，我将讨论一些很棒的技巧和窍门，以提高结构化数据二进制分类模型的性能。这些技巧是从Kaggle的一些顶级表格数据竞赛的解决方案中获得的。没有太多的滞后，让我们开始吧。

These are the five competitions that I have gone through to create this article:

以下是我撰写本文时经历的五项比赛：

Home credit default risk

房屋信贷违约风险
Santander Customer Transaction Prediction

桑坦德银行客户交易预测
VSB Power Line Fault Detection

VSB电源线故障检测
Microsoft Malware Prediction

Microsoft恶意软件预测
IEEE-CIS Fraud Detection

IEEE-CIS欺诈检测

处理更大的数据集 (Dealing with larger datasets)

One issue you might face in any machine learning competition is the size of your data set. If the size of your data is large, that is 3GB + for kaggle kernels and more basic laptops you could find it difficult to load and process with limited resources. Here is the link to some of the articles and kernels that I have found useful in such situations.

在任何机器学习竞赛中，您可能面临的一个问题是数据集的大小。如果数据量很大，那么kaggle内核和更基本的笔记本电脑需要3GB以上的内存，您可能会发现很难用有限的资源来加载和处理数据。这里是我发现在这种情况下有用的一些文章和内核的链接。

Faster data loading with pandas.

使用熊猫更快地加载数据。
Data compression techniques to reduce the size of data by 70%.

数据压缩技术可将数据大小减少70％。
Optimize the memory by reducing the size of some attributes.

通过减小某些属性的大小来优化内存。
Use open-source libraries such as Dask to read and manipulate the data, it performs parallel computing and saves up memory space.

使用诸如Dask之类的开源库来读取和处理数据，它可以执行并行计算并节省内存空间。
Use cudf.

使用cudf 。
Convert data to parquet format.

将数据转换为镶木地板格式。
Converting data to feather format.

将数据转换为羽毛格式。
Reducing memory usage for optimizing RAM.

减少内存使用以优化RAM 。

数据探索 (Data exploration)

Data exploration always helps to better understand the data and gain insights from it. Before starting to develop machine learning models, top competitors always read/do a lot of exploratory data analysis for the data. This helps in feature engineering and cleaning of the data.

数据探索始终有助于更好地理解数据并从中获得见解。在开始开发机器学习模型之前，顶级竞争者总是会读取/进行大量探索性数据分析。这有助于功能设计和数据清理。

EDA for microsoft malware detection.

用于Microsoft 恶意软件检测的 EDA 。
Time Series EDA for malware detection.

用于检测恶意软件的时间序列EDA。
Complete EDA for home credit loan prediction.

完整的EDA用于房屋信用贷款预测。
Complete EDA for Santader prediction.

完成用于Santader预测的EDA。
EDA for VSB Power Line Fault Detection.

用于VSB电源线故障检测的 EDA 。

资料准备 (Data preparation)

After data exploration, the first thing to do is to use those insights to prepare the data. To tackle issues like class imbalance, encoding categorical data, etc. Let’s see the methods used to do it.

在进行数据探索之后，要做的第一件事就是利用这些见解来准备数据。为了解决类不平衡，对分类数据进行编码等问题，让我们看看用于实现此目的的方法。

Methods to tackle class imbalance.

解决班级失衡的方法。
Data augmentation by Synthetic Minority Oversampling Technique.

综合少数民族过采样技术的数据扩充。
Fast inplace shuffle for augmentation.

快速就地洗牌以增强效果。
Finding synthetic samples in the dataset.

在数据集中查找合成样本。
Signal denoising used in signal processing competitions.

信号处理比赛中使用的信号降噪。
Finding patterns of missing data.

查找丢失数据的模式。
Methods to handle missing data.

处理丢失数据的方法。
An overview of various encoding techniques for categorical data.

用于分类数据的各种编码技术的概述。
Building model to predict missing values.

建立模型以预测缺失值。
Random shuffling of data to create new synthetic training set.

随机对数据进行改组以创建新的综合训练集。

特征工程 (Feature engineering)

Next, you can check the most popular feature and feature engineering techniques used in these top kaggle competitions. The feature engineering part varies from problem to problem depending on the domain.

接下来，您可以查看在这些热门kaggle比赛中使用的最受欢迎的功能和功能工程技术。功能工程部分的问题因域而异。

Target encoding cross validation for better encoding.

目标编码交叉验证可实现更好的编码。
Entity embedding to handle categories.

实体嵌入处理类别。
Encoding cyclic features for deep learning.

编码循环功能以进行深度学习。
Manual feature engineering methods.

手动特征工程方法。
Automated feature engineering techniques using featuretools.

使用featuretools的自动化特征工程技术。
Top hard crafted features used in microsoft malware detection.

Microsoft恶意软件检测中使用的顶级精选功能。
Denoising NN for feature extraction.

神经网络去噪特征提取。
Feature engineering using RAPIDS framework.

使用RAPIDS框架进行功能工程。
Things to remember while processing features using LGBM.

使用LGBM处理功能时要记住的事情。
Lag features and moving averages.

滞后特征和移动平均线。
Principal component analysis for dimensionality reduction.

用于降维的主成分分析。
LDA for dimensionality reduction.

LDA用于降维。
Best hand crafted LGBM features for microsoft malware detection.

用于Microsoft恶意软件检测的最佳手工LGBM功能。
Generating frequency features.

生成频率特征。
Dropping variables with different train and test distribution.

丢弃具有不同训练和测试分布的变量。
Aggregate time series features for home credit competition.

汇总家庭信用竞争的时间序列特征。
Time Series features used in home credit default risk.

家庭信用违约风险中使用的时间序列功能。
Scale,Standardize and normalize with sklearn.

使用sklearn进行缩放，标准化和标准化。
Handcrafted features for Home default risk competition.

本地默认风险竞争的手工功能。
Handcrafted features used in Santander Transaction Prediction.

桑坦德交易预测中使用的手工功能。

功能选择 (Feature selection)

After generating many features from your data, you need to decide which all features to use in your model to get the maximum performance out of your model. This step also includes identifying the impact each feature is having on your model. Let’s see some of the most popular feature selection methods.

从数据中生成许多功能之后，您需要决定在模型中使用哪些所有功能，以使模型获得最大性能。此步骤还包括确定每个功能对模型的影响。让我们看一些最受欢迎的功能选择方法。

Six ways to do features selection using sklearn.

使用sklearn选择功能的六种方法。
Permutation feature importance.

排列特征的重要性。
Adversarial feature validation.

对抗特征验证。
Feature selection using null importances.

使用空重要性的特征选择。
Tree explainer using SHAP.

使用SHAP的树解释器。
DeepNN explainer using SHAP.

使用SHAP的 DeepNN解释器。

造型 (Modeling)

After handcrafting and selecting your features, you should choose the right Machine learning algorithm to make your prediction. These are the collection of some of the most used ML models in structured data classification challenges.

手工制作并选择了特征之后，您应该选择正确的机器学习算法来进行预测。这些是在结构化数据分类挑战中一些最常用的机器学习模型的集合。

Random forest classifier.

随机森林分类器。
XGBoost : Gradient boosted decision trees.

XGBoost：梯度增强决策树。
LightGBM for distributed and faster training.

LightGBM可进行分布式和更快的培训。
CatBoost to handle categorical data.

CatBoost处理分类数据。
Naive bayes classifier.

天真的贝叶斯分类器。
Gaussian naive bayes model.

高斯朴素贝叶斯模型。
LGBM + CNN model used in 3rd place solution of Santander Customer Transaction Prediction

LGBM + CNN模型用于桑坦德银行客户交易预测的第三名解决方案
Knowledge distillation in Neural Network.

神经网络中的知识提炼。
Follow the regularized leader method.

遵循正则化领导方法。
Comparison between LGB boosting methods (goss, gbdt and dart).

LGB增强方法 (goss，gbdt和dart)之间的比较。
NN + focal loss experiment.

NN +焦点损失实验。
Keras NN with timeseries splitter.

Keras NN与时间序列分割器。
5th place NN architecture with code for Santander Transaction prediction.

第五名NN体系结构，带有用于桑坦德交易预测的代码。

超参数调整 (Hyperparameter tuning)

LGBM hyperparameter tuning methods.

LGBM 超参数调整方法。
Automated model tuning methods.

自动化的模型调整方法。
Parameter tuning with hyperopt.

使用hyperopt进行参数调整。
Bayesian optimization for hyperparameter tuning.

贝叶斯优化超参数调整。
Gpyopt Hyperparameter Optimisation.

Gpyopt超参数优化。

评价 (Evaluation)

Choosing a suitable validation strategy is very important to avoid huge shake-ups or poor performance of the model in the private test set.

选择合适的验证策略对于避免在私人测试集中出现巨大的波动或模型的不良性能非常重要。

The traditional 80:20 split wouldn’t work for many cases. Cross-validation works in most cases over the traditional single train-validation split to estimate the model performance.

传统的80:20分割在很多情况下都不起作用。在大多数情况下，交叉验证都可以通过传统的单列火车验证拆分来估计模型性能。

There are different variations of KFold cross-validation such as group k-fold that should be chosen accordingly.

KFold交叉验证有不同的变体，例如应相应选择组k倍。

K-fold cross-validation.

K折交叉验证。
Stratified KFold cross-validation.

分层KFold交叉验证。
Group KFold

KFold组
Adversarial validation to check if train and test distributions are similar or not.

对抗性验证，以检查训练和测试分布是否相似。
Time Series split validation.

时间序列分割验证。
Extensive time series splitter.

广泛的时间序列分割器。

Note:

注意：

There are various metrics that you can use to evaluate the performance of your tabular models. A bunch of useful classification metrics are listed and explained here.

您可以使用多种指标来评估表格模型的性能。 这里列出并解释了 许多有用的 分类指标 。

其他训练技巧 (Other training tricks)

GPU acceleration for LGBM.

LGBM的GPU加速。
Use the GPU efficiently.

有效地使用GPU 。
Free keras memory.

免费的keras记忆。
Save and load models to save runtime and memory.

保存和加载模型以节省运行时间和内存。

合奏 (Ensemble)

If you’re in the competing environment one won’t get to the top of the leaderboard without ensembling. Selecting the appropriate ensembling/stacking method is very important to get the maximum performance out of your models.

如果您在竞争激烈的环境中，那么如果不进行整合，就不会登上排行榜的首位。选择合适的组装/堆叠方法对于使模型发挥最大性能非常重要。

Let’s see some of the popular ensembling techniques used in kaggle competitions:

让我们看看kaggle比赛中使用的一些流行合奏技术：

Weighted average ensemble.

加权平均合奏。
Stacked generalization ensemble.

堆叠泛化合奏。
Out of folds predictions.

出人意料的预测。
Blending with linear regression.

与线性回归融合。
Use optuna to determine blending weights.

使用optuna确定混合权重。
Power average ensemble.

平均功率合奏。
Power 3.5 blending strategy.

Power 3.5混合策略。
Blending diverse models.

融合多种模式。
Different stacking approaches.

不同的堆叠方法。
AUC weight optimization.

AUC权重优化。
Geometric mean for low correlation predictions.

低相关性预测的几何平均值。
Weighted rank average.

加权排名平均。

最后的想法 (Final thoughts)

In this article, you saw many popular and effective ways to improve the performance of your tabular data binary classification model. Hopefully, you will find them useful in your projects.

在本文中，您看到了许多流行和有效的方法来改善表格数据二进制分类模型的性能。希望您会发现它们在您的项目中很有用。

This article was originally written by Shahul ES and posted on the Neptune blog. You can find more in-depth articles for machine learning practitioners there.

本文最初由 Shahul ES 撰写， 并发布在 Neptune博客上 。 您可以在此处找到针对机器学习从业人员的更多深入文章。

翻译自: https://medium.com/neptune-ai/tabular-data-binary-classification-all-tips-and-tricks-from-5-kaggle-competitions-51667b21876e

kaggle比赛数据

http://www.taodudu.cc/news/show-997590.html

netflix_Netflix的Polynote
气流与路易吉，阿戈，MLFlow，KubeFlow
顶级数据恢复_顶级R数据科学图书馆
大数据 notebook_Dockerless Notebook：数据科学期待已久的未来
微软大数据_我对Microsoft的数据科学采访
如何击败腾讯_击败股市
如何将Jupyter Notebook连接到远程Spark集群并每天运行Spark作业？
twitter 数据集处理_Twitter数据清理和数据科学预处理
使用管道符组合使用命令_如何使用管道的魔力
2020年十大币预测_2020年十大商业智能工具
为什么我们需要使用Pandas新字符串Dtype代替文本数据对象
nlp构建_使用NLP构建自杀性推文分类器
时间序列分析 lstm_LSTM —时间序列分析
泰晤士报下载_《泰晤士报》和《星期日泰晤士报》新闻编辑室中具有指标的冒险活动-第1部分：问题
异常检测机器学习_使用机器学习检测异常
特征工程tf-idf_特征工程-保留和删除的内容
自我价值感缺失的表现_不同类型的缺失价值观和应对方法
学习sql注入:猜测数据库_面向数据科学家SQL：学习简单方法
python自动化数据报告_如何：使用Python将实时数据自动化到您的网站
学习深度学习需要哪些知识_您想了解的有关深度学习的所有知识
置信区间估计预测区间估计_估计，预测和预测
地图 c-suite_C-Suite的模型
sap中泰国有预扣税设置吗_泰国餐厅密度细分：带有K-means聚类的python
傅里叶变换直观_A / B测试的直观模拟
鸽子迷信_人工智能如何帮助我战胜鸽子
scikit keras_Scikit学习，TensorFlow，PyTorch，Keras…但是天秤座呢？
数据结构两个月学完_这是我作为数据科学家两年来所学到的
迈向数据科学的第一步：在Python中支持向量回归
使用Python和MetaTrader在5分钟内开始构建您的交易策略
ipywidgets_未来价值和Ipywidgets

kaggle比赛数据_表格数据二进制分类：来自5个Kaggle比赛的所有技巧和窍门相关推荐

海量数据寻找最频繁的数据_寻找数据科学家的“原因”
海量数据寻找最频繁的数据 Start with "Why" - Why do we do the work we do? 从"为什么"开始-我们为什么要做我们所 ...
exce中让两列数据一一对应_表格数据对比眼花缭乱、痛苦不堪，找对方法，1秒搞定...
[温馨提示]亲爱的朋友,阅读之前请您点击[关注],您的支持将是我最大的动力!#学问分享官# 在我们日常工作中,经常碰到两列数据或者两个表格对比,找出差异数据,如果表格的数据太多,靠肉眼一行行对比,即使 ...
ant 改变表格数据_表格技巧—Excel表格怎么替换数字
在编制报表时,如果把序列号中的某些数字改变,一个个更正,肯定是比重新录一遍还要慢的,如果只是想替换其中的数字,其实可以利用Excel表格自带的查找替换功能,一步到位解决.接下来小编教大家怎么样将一大批 ...
对datatable类型列名排序_表格数据的排序功能（支持多列）
官方排序组件说明介绍 [表格数据]- 排序 1.功能说明对 DataFrame 数据进行排序操作. 2.基本使用说明在"df"参数内传入 DataFrame 数据集,例如传入[ ...
管道过滤模式大数据_大数据管道配方
管道过滤模式大数据介绍 (Introduction) If you are starting with Big Data it is common to feel overwhelmed by t ...
海量数据寻找最频繁的数据_在数据中寻找什么
海量数据寻找最频繁的数据 Some activities are instinctive. A baby doesn't need to be taught how to suckle. Most p ...
汉字笔画数据_统计学原理数据的预处理
数据审核数据审核-原始数据(raw data) 完整性审核应调查的单位或个体是否有遗漏所有的调查项目或变量是否填写齐全准确性审核数据是否真实反映实际情况,内容是否符合实际数据是否有错误,计 ...
同时删除两张表的数据_把数据表中对应工作表的数据首先删除，然后导入数据...
大家好,我们今日继续讲解VBA数据库解决方案的第28讲内容:利用VBA,把数据表中对应工作表的数据首先删除,然后向数据表中导入工作表数据.数据库的讲解已经持续一段时间了,从对简单数据库的认识到利用VB ...
hive 导入hdfs数据_将数据加载或导入运行在基于HDFS的数据湖之上的Hive表中的另一种方法。
hive 导入hdfs数据 Preceding pen down the article, might want to stretch out appreciation to all the well ...

kaggle比赛数据_表格数据二进制分类：来自5个Kaggle比赛的所有技巧和窍门