python 机器学习管道

Below are the usual steps involved in building the ML pipeline:

以下是构建ML管道所涉及的通常步骤：

Import Data汇入资料
Exploratory Data Analysis (EDA)探索性数据分析(EDA)
Missing Value Imputation缺失价值估算
Outlier Treatment离群值处理
Feature Engineering特征工程
Model Building建筑模型
Feature Selection功能选择
Model Interpretation模型解释
Save the model保存模型
Model Deployment *模型部署*

问题陈述和数据获取 (Problem Statement and Getting the Data)

I’m using a relatively bigger and more complicated data set to demonstrate the process. Refer to the Kaggle competition — IEEE-CIS Fraud Detection.

我正在使用相对较大和较复杂的数据集来演示该过程。请参阅Kaggle竞赛-IEEE-CIS欺诈检测。

Navigate to Data Explorer and you will see something like this:

导航到“数据资源管理器”，您将看到以下内容：

Select train_transaction.csv and it will show you a glimpse of what data looks like. Click on the download icon highlighted by a red arrow to get the data.

选择train_transaction.csv ，它将向您展示数据的外观。单击以红色箭头突出显示的下载图标以获取数据。

Other than the usual library import statements, you will need to check for 2 additional libraries —

除了通常的库导入语句外，您还需要检查另外两个库-

pip安装pyarrow (pip install pyarrow)

点安装fast_ml (pip install fast_ml)

主要亮点 (Key Highlights)

This is the first article in the series of building machine learning pipeline. In this article, we will focus on optimizations around importing the data in Jupyter notebook and executing things faster.

这是构建机器学习管道系列文章中的第一篇。在本文中，我们将专注于围绕Jupyter Notebook中导入数据并更快地执行操作的优化。

There are 3 key things to note in this article —

本文中有3个关键要注意的地方-

Python zipfile

Python压缩档
Reducing the memory usage of the dataset

减少数据集的内存使用量
A faster way of saving/loading working datasets

保存/加载工作数据集的更快方法

1：导入数据 (1: Import Data)

After you have downloaded the zipped file. It is so much better to use python to unzip the file.

下载压缩文件后。使用python解压缩文件要好得多。

Tip 1: Use function from python zipfile library to unzip the file.

提示1： 使用python zipfile库中的函数来解压缩文件。

import zipfilewith zipfile.ZipFile('train_transaction.csv.zip', mode='r') as zip_ref:    zip_ref.extractall('data/')

This will create a folder data and unzip the CSV file train_transaction.csv in that folder.

这将创建一个文件夹data ，并将CSV文件train_transaction.csv解压缩到该文件夹中。

We will use pandas read_csv method to load the data set into Jupyter notebook.

我们将使用pandas read_csv方法将数据集加载到Jupyter笔记本中。

%time trans = pd.read_csv('train_transaction.csv')df_size = trans.memory_usage().sum() / 1024**2print(f'Memory usage of dataframe is {df_size} MB')print (f'Shape of dataframe is {trans.shape}')---- Output ----CPU times: user 23.2 s, sys: 7.87 s, total: 31 sWall time: 32.5 sMemory usage of dataframe is 1775.1524047851562 MBShape of dataframe is (590540, 394)

This data is ~1.5 GB with more than half a million rows.

该数据约为1.5 GB，具有超过一百万行。

Tip 2: We will use a function from fast_ml to reduce this memory usage.

提示2： 我们将使用fast_ml中的函数来减少此内存使用量。

from fast_ml.utilities import reduce_memory_usage%time trans = reduce_memory_usage(trans, convert_to_category=False)---- Output ----Memory usage of dataframe is 1775.15 MBMemory usage after optimization is: 542.35 MBDecreased by 69.4%CPU times: user 2min 25s, sys: 2min 57s, total: 5min 23sWall time: 5min 56s

This step took almost 5 mins but it has reduced the memory size by almost 70% that’s quite a significant reduction

此步骤花费了将近5分钟，但已将内存大小减少了将近70％，这是一个相当大的减少

For further analysis, we will create a sample dataset of 200k records just so that our data processing steps won’t take a long time to run.

为了进行进一步的分析，我们将创建一个包含20万条记录的样本数据集，以便我们的数据处理步骤不会花费很长时间。

# Take a sample of 200k records%time trans = trans.sample(n=200000)#reset the index because now index would have shuffledtrans.reset_index(inplace = True, drop = True)df_size = trans.memory_usage().sum() / 1024**2print(f'Memory usage of sample dataframe is {df_size} MB')---- Output ----CPU times: user 1.39 s, sys: 776 ms, total: 2.16 sWall time: 2.43 sMemory usage of sample dataframe is 185.20355224609375 MB

Now, we will save this in our local drive — CSV Format

现在，我们将其保存在本地驱动器中-CSV格式

import osos.makedirs('data', exist_ok=True) trans.to_feather('data/train_transaction_sample')

Tip 3: use the feather format instead of csv

提示3： 使用羽毛格式而不是csv

import osos.makedirs('data', exist_ok=True)trans.to_feather('data/train_transaction_sample')

Once you load the data from these 2 sources you will observe the significant performance improvements.

从这两个来源加载数据后，您将观察到显着的性能改进。

Load the saved sample data — CSV Format

加载保存的样本数据-CSV格式

%time trans = pd.read_csv('data/train_transaction_sample.csv')df_size = tras.memory_usage().sum() / 1024**2print(f'Memory usage of dataframe is {df_size} MB')print (f'Shape of dataframe is {trans.shape}')---- Output ----CPU times: user 7.37 s, sys: 1.06 s, total: 8.42 sWall time: 8.5 sMemory usage of dataframe is 601.1964111328125 MBShape of dataframe is (200000, 394)

Load the saved sample data — Feather Format

加载保存的样本数据—羽毛格式

%time trans = pd.read_feather('tmp/train_transaction_sample')df_size = trans.memory_usage().sum() / 1024**2print(f'Memory usage of dataframe is {df_size} MB')print (f'Shape of dataframe is {trans.shape}')---- Output ----CPU times: user 1.32 s, sys: 930 ms, total: 2.25 sWall time: 892 msMemory usage of dataframe is 183.67779541015625 MBShape of dataframe is (200000, 394)

注意这里两件事： (Notice 2 things here :)

i. The amount of time it took to load the CSV file is almost 10 times the time it took for loading feather format data.

一世。加载CSV文件所花费的时间几乎是加载羽毛格式数据所花费的时间的10倍。

ii. Size of the data set loaded was retained in feather format whereas in CSV format the data set is again consuming high memory and we will have to run the reduce_memory_usage function again.

ii。加载的数据集的大小以羽毛格式保留，而在CSV格式中，数据集再次占用大量内存，我们将不得不再次运行reduce_memory_usage函数。

结束语： (Closing Notes:)

Please feel free to write your thoughts/suggestions/feedback.请随时写下您的想法/建议/反馈。
We will use the new sample data set what we created for our further analysis.我们将使用我们创建的新样本数据集进行进一步分析。
We will talk about Exploratory Data Analysis in our next article.在下一篇文章中，我们将讨论探索性数据分析。
Github link

Github链接

翻译自: https://towardsdatascience.com/building-a-machine-learning-pipeline-part-1-b19f8c8317ae

python 机器学习管道

查看全文

http://www.taodudu.cc/news/show-863704.html

pandas数据可视化_5利用Pandas进行强大的可视化以进行数据预处理
迁移学习迁移参数_迁移学习简介
div文字自动扩充_文字资料扩充
ml是什么_ML，ML，谁是所有人的冠军？
随机森林分类器_建立您的第一个随机森林分类器
Python中的线性回归：Sklearn与Excel
机器学习中倒三角符号_机器学习的三角误差
使用Java解决您的数据科学问题
树莓派神经网络植入_使用自动编码器和TensorFlow进行神经植入
opencv 运动追踪_足球运动员追踪-使用OpenCV根据运动员的球衣颜色识别运动员的球队
犀牛建模软件的英文语言包_使用tidytext和textmineR软件包在R中进行主题建模（
使用Keras和TensorFlow构建深度自动编码器
出人意料的生日会400字_出人意料的有效遗传方法进行特征选择
fast.ai_使用fast.ai自组织地图—步骤4：使用Fast.ai DataBunch处理非监督数据
无监督学习与监督学习_有监督与无监督学习
分类决策树回归决策树_决策树分类器背后的数学
检测对抗样本_对抗T恤以逃避ML人检测器
机器学习中一阶段网络是啥_机器学习项目的各个阶段
目标检测 dcn v2_使用Detectron2分6步进行目标检测
生成高分辨率pdf_用于高分辨率图像合成的生成变分自编码器
神经网络激活函数对数函数_神经网络中的激活函数
算法伦理
python 降噪_使用降噪自动编码器重建损坏的数据（Python代码）
bert简介_BERT简介
卷积神经网络结构_卷积神经网络
html两个框架同时_两个框架的故事
深度学习中交叉熵_深度计算机视觉，用于检测高熵合金中的钽和铌碎片
梯度提升树python_梯度增强树回归— Spark和Python
5行代码可实现5倍Scikit-Learn参数调整的更快速度
tensorflow 多人_使用TensorFlow2.x进行实时多人2D姿势估计

python 机器学习管道_构建机器学习管道-第1部分相关推荐

机器学习量子_量子机器学习：神经网络学习
机器学习量子 My last articles tackled Bayes nets on quantum computers (read it here!), and k-means cluste ...
2020-1-29 深度学习笔记5 - 机器学习基础（构建机器学习算法）
第5章机器学习基础官网学习算法机器学习算法是一种能够从数据中学习的算法. 所谓学习是指,对于某类任务T和性能度量P,一个计算机程序被认为可以从经验E中学习是指,通过经验E改进后,它在任务T上由 ...
python 预测算法_通过机器学习的线性回归算法预测股票走势（用Python实现）
本文转自博客园,作者为hsm_computer 原文链接:https://www.cnblogs.com/JavaArchitect/p/11717998.html在笔者的新书里,将通过股票案例讲述P ...
matlab和python的语言_四大机器学习编程语言对比：R、Python、MATLAB、Octave
本文作者是一位机器学习工程师,他比较了四种机器学习编程语言(工具):R.Python.MATLAB 和 OCTAVE.作者列出了这些语言(工具)的优缺点,希望对想开始学习它们的人有用. 图源:Pixa ...
python预测股票价格_使用机器学习预测股票价格的愚蠢简便方法
在这篇文章中,我展示了使用H2o.ai框架的机器学习,使用R语言进行股票价格预测的分步方法. 该框架也可以在Python中使用,但是,由于我对R更加熟悉,因此我将以该语言展示该教程. 您可能已经问过自 ...
python预测糖尿病_使用机器学习的算法预测皮马印第安人糖尿病
皮马印第安人糖尿病预测 pima_diabetes_analysis_and_prediction 文件夹: data --> 存储原始样本和数据清洗后的样本 data_analysis_a ...
python绘制单线图_如何绘制管道单线图
. 精选资料,欢迎下载 1. 压力管道单线图概述单线图是按照正等轴侧投影方法进行绘制的管道图 ( 或画成以单线表示的管道空视图 ) , 也称管段图.单线图具有简单明了. 易于识别.具有较好的三维真 ...
梯度下降python编程实现_【机器学习】线性回归——单变量梯度下降的实现（Python版）...
[线性回归] 如果要用一句话来解释线性回归是什么的话,那么我的理解是这样子的:**线性回归,是从大量的数据中找出最优的线性(y=ax+b)拟合函数,通过数据确定函数中的未知参数,进而进行后续操作(预测 ...
python股票预测_通过机器学习的线性回归算法预测股票走势（用Python实现）
1 波士顿房价数据分析安装好Python的Sklearn库后,在安装包下的路径中就能看到描述波士顿房价的csv文件,具体路径是"python安装路径\Lib\site-packages\s ...

python 机器学习管道_构建机器学习管道-第1部分