参考链接: Python 克里斯蒂安Cristian算法

python处理数据列

Let’s suppose your manager gives you a random dataset and told you to do a “basic” cleaning: “Keep only the records that have values in column A, B, and C or the records don’t have any value in those three columns at all“. How would you approach that?

假设您的经理为您提供了一个随机数据集,并告诉您进行“基本”清理:“仅保留在A,B和C列中具有值的记录,或者在以下三列中不具有任何值的记录:所有”。 您将如何处理?

介绍 (Introduction)

Feature engineering can be really basic such as scaling and encoding, but sometimes mysterious. Recently, I came across this type of cross-column-based requirement and this got me thinking, how could a cleaning logics sound so simple in words but be inherently complicated in the data cleaning and feature engineering?

特征工程可以是真正的基础,例如缩放和编码,但有时是神秘的。 最近,我遇到了这种基于跨列的需求,这让我开始思考, 清理逻辑听起来多么简单,但在数据清理和功能工程中却固有地复杂?

文字直截了当,代码不那么直观 (Straightforward in words, not so intuitive in codes)

Most of the data manipulation skills we learned from existing courses focus on row-wise applications. This type of function is well written in data analysis packages such as Pandas, meaning analysts or data scientists don’t have to create a new application from scratch. However, things are not that easy when it comes to column-wise applications, especially involving multiple columns. If we like to perform the simple request as I mentioned in the introduction, we have to mix something more advanced such as logical relationship into the process.

我们从现有课程中学到的大多数数据处理技能都集中在按行应用程序上。 这类功能可以很好地用数据分析包(例如Pandas)编写,这意味着分析师或数据科学家不必从头开始创建新的应用程序。 但是,对于列式应用程序, 尤其是涉及多列的应用程序,事情并非那么容易。 如果我们喜欢执行引言中提到的简单请求,则必须在流程中混合一些更高级的内容,例如逻辑关系。

In this post, I will demonstrate a basic use case and a more complicated situation about cross-column manipulation. I will also include the complete manipulation function, showcasing how to apply those functions in a regular ETL process. The techniques shared in the following article includes:

在本文中,我将演示一个基本的用例以及有关跨列操作的更复杂的情况。 我还将包括完整的操纵功能,展示如何在常规ETL过程中应用这些功能。 下一篇文章中共享的技术包括:

Computing cross-column aggregated metrics for numerical values 计算数值的跨列汇总指标 Validating records with cross-column logical operations 使用跨列逻辑操作验证记录

数据集简介 (Introducing the Dataset)

During the last spring, my friend and I have competed in Datathon 2020 hosted by Brown University. The topic is predicting where are the prospective homebuyers based on credit-bureau data and demographic data at the zip code level. The features include metrics such as bank card limit, number of open bank cards, and balance for the most recent mortgage.

在去年春天,我和我的朋友参加了由布朗大学主办的2020年Datathon竞赛。 该主题基于邮政编码级别的信用局数据和人口统计数据预测潜在的购房者在哪里。 这些功能包括度量标准,例如银行卡限额,已开立银行卡的数量以及最新抵押的余额。

The provided datasets cover April to September 2019 and the CSV files are split accordingly. There are over 36,000,000 rows and over 300 columns in sum, and the total size of those makes up about 10GB. I would use this to demonstrate the ideas.

提供的数据集涵盖2019年4月至2019年9月,并且CSV文件已相应拆分。 总行数超过36,000,000,列数超过300,这些文件的总大小约为10GB。 我将用它来演示想法。

A snippet of the data set

数据集的摘要

问题1:使用聚合方法减少跨数据集的聚合指标 (Problem 1: Reducing the cross-dataset aggregated metrics with aggregation method)

Simply put, the data is a mess. After combing data sets across different times, the total size of the data could freeze a regular laptop whenever a simple computation is applied. Thus, I wanted to come up with a way that generalized the same metric for different months (e.g., there is a “number of open bank cards” metric for each month. I like to reduce the total number of features but keep that information).

简而言之,数据是一团糟。 在跨不同时间组合数据集之后,每当应用简单计算时,数据的总大小可能会冻结常规笔记本电脑。 因此,我想提出一种在不同月份推广相同指标的方法(例如,每个月都有一个“未结银行卡数量”指标。我希望减少功能总数,但保留该信息) 。

Various machine learning techniques (e.g., PCA, clustering) could achieve similar effects. However, I would transform features with the most direct fashion, which is averaging.

各种机器学习技术(例如,PCA,聚类)可以达到类似的效果。 但是,我将以最平均的方式(平均)来转换要素。

解决方案1:谨慎使用Pandas内置计算 (Solution 1: Utilizing Pandas built-in calculation cautiously)

That’s just saying that we need to average the metrics over time. We can easily achieve that with pd.DataFrame.mean(). Pandas package also provides various aggregated function such as sum(), std().

这只是说我们需要对指标进行平均。 我们可以使用pd.DataFrame.mean()轻松实现。 Pandas软件包还提供了各种聚合函数,例如sum() , std() 。

The expected result

预期结果

This is a relatively easy task. However, the trick is setting the correct pipeline that organizes new columns and builds up a new data frame. Here are a few tips worth mentioning:

这是一个相对容易的任务。 但是,技巧是设置正确的管道来组织新的列并建立新的数据框。 这里有一些值得一提的提示:

pd.DataFrame()doesn’t allow you to individually specify data type to columns accordingly as pd.read_csv()does (It is weird, I know). To accommodate the limitations, it is recommended to assignastype()functions after building a new data frame. pd.DataFrame()不允许您像pd.read_csv()那样单独为列指定数据类型(这很奇怪,我知道)。 为了适应这些限制,建议在构建新的数据帧之后分配astype()函数。 Also, if you had different types of numerical data to be assigned at the same time, watch out for the possible loss of information while using lists of value to build apd.DataFrame() . Auto-conversion between integer and floating-point will mangle the value without any warning! (Not to mention that numeric data converted to string value will also lose a few digits in the process) 另外,如果要同时分配不同类型的数值数据,请在使用值列表构建pd.DataFrame()时注意信息的可能丢失。 整数和浮点之间的自动转换将破坏该值,而不会发出任何警告! (更不用说转换为字符串值的数字数据在此过程中也会丢失几位数)

The whole function is shown below. Note that “df” is the data to be processed, and “df_col” is just for building a list of column names.

整个功能如下所示。 请注意,“ df”是要处理的数据,“ df_col”仅用于构建列名列表。

#### aggregate column function ####

def agg_col_calculation(df):

### df is the data frame containing all data from different months

exclude_col_list = ['zip5', 'age', 'household_count', 'person_count',

'homebuyers', 'first_homebuyers']

month_list = [4,5,6,7,8,9]

# import the original column for building new columns template

df_col = pd.read_csv('/content/gdrive/My Drive/Brown_Datathon_Jeff/zip9_coded_201904_pv.csv',

nrows=2)

col_list = df_col.columns[2:]

avg_series_list = []

## compute the aggregated metrics

for c in col_list:

working_col_list = [c + f'_{m}' for m in month_list]

avg_series = df[working_col_list].mean(axis='columns')

avg_series_list.append(avg_series)

print('complete appending!')

# set the new names of columns

new_col_list = [c + '_avg' for c in col_list]

# preprocess the data before building data frame

exclude_col_array = np.array(df[exclude_col_list]).T

## make new df

df_avg = pd.DataFrame(np.array([*exclude_col_array, *np.array(avg_series_list)]).T,

index=df.index, columns=[*exclude_col_list, *new_col_list],

dtype='float')

## handle data type issue

df_avg = df_avg.astype({'zip5': int})

df_avg = df_avg.astype({'zip5': str})

print(df_avg.shape)

return df_avg

df = agg_col_calculation(df)

问题2:清理具有跨列逻辑关系的数据 (Problem 2: Cleaning data with cross-column logical relation)

Until this point, it was no too much trouble. But here comes the tough one. Although we narrowed down the data, something was off.

到目前为止,还不算太麻烦。 但是这是艰难的一步。 尽管我们缩小了数据范围,但是还是有些不对劲。

In common sense, If one record (an area) has one mortgage, then there should be a value in each of the columns (average loan amount, average balance, the proportion) describing the first mortgage. If one record has no mortgage at all, then there should be no value in these three columns at all (since no mortgage in that area). However, I found out the discrepancy just by eyeballing a few rows. Some rows have the average loan amount but no average balance, and some rows only have the proportion of the second loan without the other two counterparts.

通常,如果一个记录(一个区域)有一个抵押,那么在描述第一抵押的每个列中应该有一个值(平均贷款额,平均余额,比例)。 如果一条记录根本没有抵押,那么这三列中就根本没有价值(因为该区域没有抵押)。 但是,我只是盯着几行就发现了差异。 有些行具有平均贷款额,但没有平均余额,有些行仅具有第二笔贷款的比例,而没有其他两个对应物。

Description of three coexisting columns

三个共存列的描述

This type of error can seriously damage the model if it couldn’t identify the correct patterns with high-dimensional data. To make the model or the analysis more robust, the questionable records should be excluded from the data set.

如果无法使用高维数据识别正确的模式,则此类错误会严重损坏模型。 为了使模型或分析更加可靠,应将​​可疑记录从数据集中排除。

用维恩图思考 (Thinking with the Venn diagram)

To understand the problem, the Venn Diagram could be helpful for understanding. For what I tried to do here, and most of the general similar ad-hoc situation, I am looking at subsetting completely-non-intersecting part and the totally-intersecting part, which is 000 (All - (A∪B∪C)) and 111 (A⋂B⋂C) respectively.

要理解该问题,维恩图可能有助于理解。 对于我在这里尝试做的事情以及大多数一般类似的特殊情况,我正在考虑将完全不相交的部分和完全相交的部分细分为000(全部-(A-B∪C) )和111(A⋂B⋂C)。

Source:

https://en.wikipedia.org/wiki/Venn_diagram

资料来源:

https :

//en.wikipedia.org/wiki/Venn_diagram

With the idea, because Pandas supports logical operators (&, |, ~) I could configure the logic like this:

有了这个主意,因为Pandas支持逻辑运算符(&,|,〜),所以我可以这样配置逻辑:

## 111((df[col_1].astype(bool) & df[col_2].astype(bool) & df[col_3].astype(bool))## 000(~df[col_1].astype(bool) & ~df[col_2].astype(bool) & ~df[col_3].astype(bool)))

p.s. In case you are wondering, no, this is not the case that you need to convert AND into OR when you do the negative logical condition (since the elements here are all binary, True or False). See this thread for another example.

ps如果您想知道,不,这不是在执行负逻辑条件时不需要将AND转换为OR的情况(因为此处的元素都是二进制,True或False)。 有关另一个示例,请参见此线程 。

Then I integrated those logical operators into one line and make a series:

然后,我将这些逻辑运算符整合为一行,并进行一系列处理:

## 111 or 000s_bool = ((df[col_1].astype(bool) & df[col_2].astype(bool) & df[col_3].astype(bool))|(~df[col_1].astype(bool) & ~df[col_2].astype(bool) & ~df[col_3].astype(bool)))

After concatenating all the series of boolean into one data frame, I used all() with apply() and lambda to create a short function while applying it along the columns axis. To be more specific, all(x) will do another logical operation to make sure all the different sections (groups of columns; e.g., a group of columns for 3rd loan in June, a group of columns for 5th mortgage in August) in one row obeys the rule I set in the first place (which is 111 or 000 relationship in the Venn Diagram). This came down to a final series of booleans:

将所有布尔值系列连接到一个数据帧中后,我在将all()与apply()和lambda结合使用时,在沿列轴应用它的同时创建了一个短函数。 更具体地说, all(x)将执行另一逻辑运算,以确保所有不同的部分(列组;例如,6月份第三次贷款的一组列,8月第五次抵押的一组列) row遵循我首先设置的规则(维恩图中的关系为111或000)。 这归结为最后的布尔值系列:

s_agg_bool = df_bool.apply(lambda x: all(x), axis = 'columns')# We can then use this series of booleans to subset the right records out of the raw data.

Here is a final ETL function that I put the dataset through the cleaning process. Note that it was designed to apply to multiple groups of columns such as metrics from different months or from different categories:

这是最终的ETL函数,我将数据集放入清理过程中。 请注意,它旨在应用于多个列组,例如来自不同月份或不同类别的指标:

def cross_col_correct(df):

mtg_list = ['mortgage1_limit', 'mortgage1_balance', 'mortgage1_open']

heq_list = ['homeequity1_limit', 'homeequity1_balance', 'homeequity1_open']

s_bool_list = []

# convert all null values into 0

df.iloc[:,2:] = df.iloc[:,2:].applymap(lambda x: 0 if np.isnan(x) else x)

# montly individual columns check

for col_list in [mtg_list, heq_list]:

# make different variables names

for i in [0,1,2,3,4]:

if i == 0:

col_list = col_list

else:

col_list = [col.replace(str(i), str(i+1)) for col in col_list]

# logical operators

s_bool = ((df[col_list[0]].astype(bool) & df[col_list[1]].astype(bool) & df[col_list[2]].astype(bool))

| (~df[col_list[0]].astype(bool) & ~df[col_list[1]].astype(bool) & ~df[col_list[2]].astype(bool)))

s_bool_list.append(s_bool)

# aggregated columns check

total_mtg_check_list = ['total_mortgage_balance', 'total_mortgage_limit', 'mortgage1_balance']

total_heq_check_list = ['total_homeequity_balance', 'total_homeequity_limit', 'homeequity1_balance']

for col_list in [total_mtg_check_list, total_heq_check_list]:

s_bool2 = ((df[col_list[0]].astype(bool) & df[col_list[1]].astype(bool) & df[col_list[2]].astype(bool)) |

(~df[col_list[0]].astype(bool) & ~df[col_list[1]].astype(bool) & ~df[col_list[2]].astype(bool)))

s_bool_list.append(s_bool2)

# make boolean df containing t/f for each category for each number (e.g. mortgage1, homeequity2...)

df_bool = pd.concat(s_bool_list, axis=1, sort=False)

# make final boolean list (if a row fail to have consistent T in df_bool, then it will be dropped)

s_agg_bool = df_bool.apply(lambda x: all(x), axis = 'columns')

df = df.loc[s_agg_bool]

print('Complete cross columns correct!')

return df

df = cross_col_correct(df)

p.s. Before doing the logical works, we have to convert all NaN value into 0 and all numerical values into boolean data type, so that we can utilize the mechanic of the Pandas logic (all zero value will be False and any non-zero will be True).

ps在进行逻辑运算之前,我们必须将所有NaN值转换为0,并将所有数值转换为布尔数据类型,以便我们可以利用熊猫逻辑的机制(所有零值将为False,任何非零值将为False)。真正)。

结论 (Conclusion)

I quickly ran through a few tips and examples of crossed-column-based data manipulation. However, I know this is hardly a thorough guide for similar use cases.

我快速浏览了一些技巧,并举例说明了基于交叉列的数据操作。 但是,我知道对于类似的用例,这几乎不是一个完整的指南。

Data manipulation is surprisingly difficult because we often face dynamic problems that a single function to impute missing values or identify biased data is far from enough in real-world scenarios. After trying to find related resources, I realized there is no systematic guide or discussion thoroughly covering the practical stuff since the essence of it is rather hard to be classified and non-traditional. Therefore, this serves as an attempt to share possible solutions, and I hope this post could spark new ideas and discover more non-ordinary methods.

数据操作异常困难,因为我们经常面临动态问题,即在实际情况下,单个函数来估算缺失值或识别有偏差的数据还远远不够。 在尝试查找相关资源之后,我意识到没有系统的指导或讨论可以完全覆盖实际内容,因为其实质很难分类且非传统。 因此,这是尝试共享可能的解决方案的一种尝试,我希望本文能激发出新的想法并发现更多非常规方法。

Photo by

Cristian Escobar on

Unsplash

克里斯蒂安·埃斯科巴 (

Cristian Escobar)在

Unsplash上

拍摄的照片

Congrats and thanks for your reading! Feel free to reply with any thoughts and opinions, or just drop a message on my Linkedin page!

恭喜,感谢您的阅读! 如有任何想法和意见,请随时回复,或在我的Linkedin页面上留言!

p.s. If you are interested in some COVID-19 state-wise info, be sure to visit this neat personal dashboard to track the curves.

ps如果您对某些COVID-19状态信息感兴趣,请确保访问此整洁的个人仪表板以跟踪曲线。

Jeff

杰夫

翻译自: https://towardsdatascience.com/cross-column-based-data-manipulation-in-python-dfa5d8ffdd64

python处理数据列

[转载] python处理数据列_Python中基于跨列的数据处理相关推荐

  1. python列表数据排序_Python中,如何将列表中数据排序给列表排序?

    在程序中使用字典进行数据信息統计时由于字典是无序的所以打印字典时内容也是无序的.因此为了使统计得到的结果更方便查看需要进行排序.Python中字典的排序分为按"键"排序和按&quo ...

  2. python 三维数据绘图_Python中三维坐标空间绘制的实现

    在三维空间绘制点,线,面 1.绘制点 用scatter()散点绘制三维坐标点from matplotlib import pyplot as plt from mpl_toolkits.mplot3d ...

  3. python面板数据回归_Python中的Fama Macbeth回归(Pandas或Statsmodels)

    编辑:新建库 已存在可通过以下命令安装的更新库:pip install finance-byu 新的库包括Fama Macbeth回归实现,速度得到了提高,并且更新了Regtable类.新的图书馆还包 ...

  4. [转载] python函数isdisjoint方法_Python中的isdisjoint()函数

    参考链接: Python Set isdisjoint() 在本文中,我们将学习如何在set()数据类型上实现isdisjoint()函数.此函数检查作为参数传递的集合是否具有任何共同的元素.如果找到 ...

  5. [转载] python字典查询功能_Python中的字典功能

    参考链接: Python中的字典dictionary方法 (cmp(), len(), items()-) python字典查询功能 Let's check out some important fu ...

  6. [转载] python enumerate函数 实例_python中使用enumerate函数遍历元素实例

    参考链接: Python enumerate() 这个是python的一个内建函数,看书的时候发现了他,mark一下 当我们既需要遍历索引同时需要遍历元素的时候,可以考虑使用enumerate函数,e ...

  7. python保存dataframe数据到excel中,处理列宽数据格式等

    python保存dataframe数据到excel中,处理列宽数据格式等 前言 1,比较简单的datafame 生成excel代码 2,采用excelwriter 包进行数据列的改造 3,根据具体数据 ...

  8. linux awk 某一列合并,利用shell中awk和xargs以及sed将多行多列文本中某一列合并成一行...

    一.问题描述 最近需要利用Shell将多行多列文本中某一列,通过指定的分隔符合并成一行.假设需要处理的文本如下: 我们主要处理的是,将用户名提取处理,合并成一行,并通过逗号进行分隔.最终的格式如下: ...

  9. html跨行使用的属性,HTML表格标记详解4:TD参数中设定跨列跨行属性

    this.p={ m:2, b:2, loftPermalink:'', id:'fks_083074083081084075085087087095083084084067083083082065' ...

最新文章

  1. 软件视频会议Vidyo体验
  2. 词频统计 求最大k个数
  3. (2)shiro角色资源权限
  4. python 用户输入_Python中如何让用户输入内容
  5. Contest2162 - 2019-3-28 高一noip基础知识点 测试5 题解版
  6. 从底层重学 Java 之 Stream 并行及标志 GitChat连接
  7. 2020年中国消费市场发展报告
  8. Matlab自定义函数的五种方法
  9. easyexcel导入获取表头并且表头为不固定列
  10. android 9.0打开wifi,Android9.0 SystemUI 屏蔽打开wifi时不显示4G图标的逻辑
  11. 防碰撞算法c语言,RFID防碰撞 二进制树形搜索算法 C语言实现
  12. wubi安裝ubuntukylin 14.04过程以及基本配置
  13. 塔科夫帧数测试软件,逃离塔科夫如何优化游戏FPS_画面优化设置详解_52pk
  14. 浏览器HTML5 写入文件
  15. Python实现文件编码转换GB2312、GBK、UTF-8
  16. FPGA编程入门:Quartus II 设计1位全加器
  17. Amaze UI貌似挂了。。。附上amaze UI框架的图标
  18. 忍无可忍?英特尔执行副总裁撰文《高通的诡辩被戳穿了》指责高通
  19. python入门教程陈孟林_适用于小白的 python 快速入门教程
  20. ELK日志分析系统搭建以及springboot日志发送到ELK中

热门文章

  1. 2020 计蒜之道 预赛 第一场 爆零记
  2. 【NOIP2015】【Luogu2615】神奇的幻方(模拟填数)
  3. php为什么要创建类,php – 是否有理由为单一功能创建类?
  4. php检测是目录还是文件,php检测文件目录大小类
  5. [蓝桥杯]试题 基础练习 完美的代价
  6. [leetcode]14. 最长公共前缀
  7. 试题13 进制转换(十进制-R进制)
  8. OpenGL基础25:多光源(附简单GLSL配置)
  9. bzoj 1878: [SDOI2009]HH的项链(主席树)
  10. bzoj 1604: [Usaco2008 Open]Cow Neighborhoods 奶牛的邻居(切比雪夫距离+multiset贪心+并查集)