##############################第1种###################################################

[1]代码如下(下面的会损失数据精度):

def memory_usage_mb(df, *args, **kwargs):"""Dataframe memory usage in MB. """return df.memory_usage(*args, **kwargs).sum() / 1024**2def reduce_mem_usage(df, verbose=True):numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']start_mem = df.memory_usage(deep=True).sum() / 1024**2for col in df.columns:col_type = df[col].dtypesif col_type in numerics:c_min = df[col].min()c_max = df[col].max()if str(col_type)[:3] == 'int':if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:df[col] = df[col].astype(np.int8)elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:df[col] = df[col].astype(np.int16)elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:df[col] = df[col].astype(np.int32)elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:df[col] = df[col].astype(np.int64)else:c_prec = df[col].apply(lambda x: np.finfo(x).precision).max()if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max and c_prec == np.finfo(np.float32).precision:df[col] = df[col].astype(np.float32)else:df[col] = df[col].astype(np.float64)end_mem = df.memory_usage().sum() / 1024**2if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))return df

##############################第2种###################################################

[2]代码如下(下面的不会损失数据精度):

​
def memory_usage_mb(df, *args, **kwargs):"""Dataframe memory usage in MB. """return df.memory_usage(*args, **kwargs).sum() / 1024**2​
def reduce_memory_usage(df, deep=True, verbose=True, categories=True):# All types that we want to change for "lighter" ones.# int8 and float16 are not include because we cannot reduce# those data types.# float32 is not include because float16 has too low precision.numeric2reduce = ["int16", "int32", "int64", "float64"]start_mem = 0if verbose:start_mem = memory_usage_mb(df, deep=deep)for col, col_type in df.dtypes.iteritems():best_type = Noneif col_type == "object":df[col] = df[col].astype("category")best_type = "category"elif col_type in numeric2reduce:downcast = "integer" if "int" in str(col_type) else "float"df[col] = pd.to_numeric(df[col], downcast=downcast)best_type = df[col].dtype.name# Log the conversion performed.if verbose and best_type is not None and best_type != str(col_type):print(f"Column '{col}' converted from {col_type} to {best_type}")if verbose:end_mem = memory_usage_mb(df, deep=deep)diff_mem = start_mem - end_mempercent_mem = 100 * diff_mem / start_memprint(f"Memory usage decreased from"f" {start_mem:.2f}MB to {end_mem:.2f}MB"f" ({diff_mem:.2f}MB, {percent_mem:.2f}% reduction)")return df

上面的代码的bug是:

df[col] = df[col].astype("category")

会导致无法使用fillna函数,直接报错。因为fillna主要是针对object对象的,不能直接针对category对象

所以填充工作必须在节省内存的前面.

使用方法是:

import datatable as dttrain=dt.fread(folder+"train.csv")
train=train.to_pandas()train = reduce_mem_usage(train)

##############################第3种###################################################

代码来自[3]

def get_stats(df):stats = pd.DataFrame(index=df.columns, columns=['na_count', 'n_unique', 'type', 'memory_usage'])for col in df.columns:stats.loc[col] = [df[col].isna().sum(), df[col].nunique(dropna=False), df[col].dtypes, df[col].memory_usage(deep=True, index=False) / 1024**2]stats.loc['Overall'] = [stats['na_count'].sum(), stats['n_unique'].sum(), None, df.memory_usage(deep=True).sum() / 1024**2]return statsdef print_header():print('col         conversion        dtype    na    uniq  size')print()def print_values(name, conversion, col):template = '{:10}  {:16}  {:>7}  {:2}  {:6}  {:1.2f}MB'print(template.format(name, conversion, str(col.dtypes), col.isna().sum(), col.nunique(dropna=False), col.memory_usage(deep=True, index=False) / 1024 ** 2))# safe downcast
def sd(col, max_loss_limit=0.001, avg_loss_limit=0.001, na_loss_limit=0, n_uniq_loss_limit=0, fillna=0):"""max_loss_limit - don't allow any float to lose precision more than this value. Any values are ok for GBT algorithms as long as you don't unique values.See https://en.wikipedia.org/wiki/Half-precision_floating-point_format#Precision_limitations_on_decimal_values_in_[0,_1]avg_loss_limit - same but calculates avg throughout the series.na_loss_limit - not really useful.n_uniq_loss_limit - very important parameter. If you have a float field with very high cardinality you can set this value to something like n_records * 0.01 in order to allow some field relaxing."""is_float = str(col.dtypes)[:5] == 'float'na_count = col.isna().sum()n_uniq = col.nunique(dropna=False)try_types = ['float16', 'float32']if na_count <= na_loss_limit:try_types = ['int8', 'int16', 'float16', 'int32', 'float32']for type in try_types:col_tmp = col# float to int conversion => try to round to minimize casting errorif is_float and (str(type)[:3] == 'int'):col_tmp = col_tmp.copy().fillna(fillna).round()col_tmp = col_tmp.astype(type)max_loss = (col_tmp - col).abs().max()avg_loss = (col_tmp - col).abs().mean()na_loss = np.abs(na_count - col_tmp.isna().sum())n_uniq_loss = np.abs(n_uniq - col_tmp.nunique(dropna=False))if max_loss <= max_loss_limit and avg_loss <= avg_loss_limit and na_loss <= na_loss_limit and n_uniq_loss <= n_uniq_loss_limit:return col_tmp# field can't be convertedreturn coldef reduce_mem_usage_sd(df, deep=True, verbose=False, obj_to_cat=False):numerics = ['int16', 'uint16', 'int32', 'uint32', 'int64', 'uint64', 'float16', 'float32', 'float64']start_mem = df.memory_usage(deep=deep).sum() / 1024 ** 2for col in df.columns:col_type = df[col].dtypes# collect statsna_count = df[col].isna().sum()n_uniq = df[col].nunique(dropna=False)# numericsif col_type in numerics:df[col] = sd(df[col])# stringsif (col_type == 'object') and obj_to_cat:df[col] = df[col].astype('category')if verbose:print(f'Column {col}: {col_type} -> {df[col].dtypes}, na_count={na_count}, n_uniq={n_uniq}')new_na_count = df[col].isna().sum()if (na_count != new_na_count):print(f'Warning: column {col}, {col_type} -> {df[col].dtypes} lost na values. Before: {na_count}, after: {new_na_count}')new_n_uniq = df[col].nunique(dropna=False)if (n_uniq != new_n_uniq):print(f'Warning: column {col}, {col_type} -> {df[col].dtypes} lost unique values. Before: {n_uniq}, after: {new_n_uniq}')end_mem = df.memory_usage(deep=deep).sum() / 1024 ** 2percent = 100 * (start_mem - end_mem) / start_memif verbose:print('Mem. usage decreased from {:5.2f} Mb to {:5.2f} Mb ({:.1f}% reduction)'.format(start_mem, end_mem, percent))return df

使用方法:

print("缩小前,train情况统计")
stats = get_stats(train)
print(stats)
train= reduce_mem_usage_sd(train, verbose=True)
print("缩小后,test情况统计")
stats = get_stats(train)
print(stats)

代码来自:
[1]https://www.kaggle.com/gemartin/load-data-reduce-memory-usage

[2]https://www.kaggle.com/c/ieee-fraud-detection/discussion/107653#latest-619384

pandas的dataframe节省内存相关推荐

  1. dataframe 空值替换为0_缓解Pandas中DataFrame占用内存过高

    0 背景 在我们使用pandas进行数据处理的时候,有时候发现文件在本地明明不大,但是用pandas以DataFrame形式加载内存中的时候会占用非常高的内存,本文即解决这样的问题. 1 原因 如果是 ...

  2. 解决 pandas 读取数据时内存过大的问题

    解决 pandas 读取数据时内存过大的问题 背景: 在我们使用pandas进行数据处理的时候,有时候发现文件在本地明明不大,但是用pandas以DataFrame形式加载内存中的时候会占用非常高的内 ...

  3. 【Python学习】 - 解决DataFrame占用内存过大问题

    这篇文章原文出自kaggle,文中给出了reduce_mem_usage方法可以用来自动缩减dataframe占用空间 这篇notebook展示了通过使用更合理的数据类型来减少dataframe的内存 ...

  4. python dataframe 中位数_python下的Pandas中DataFrame基本操作(一),基本函数整理

    pandas作者Wes McKinney 在[PYTHON FOR DATA ANALYSIS]中对pandas的方方面面都有了一个权威简明的入门级的介绍,但在实际使用过程中,我发现书中的内容还只是冰 ...

  5. pandas中DataFrame的ix,loc,iloc索引方式的异同

    pandas中DataFrame的ix,loc,iloc索引方式的异同 1.loc: 按照标签索引,范围包括start和end 2.iloc: 在位置上进行索引,不包括end 3.ix: 先在inde ...

  6. Pandas的DataFrame数据类型

    Pandas的DataFrame数据类型 纵轴表示不同索引axis=0,横轴表示不同列axis=1 DataFrame类型创建 1.从二维ndarray对象创建 import pandas as pd ...

  7. Python—pandas中DataFrame类型数据操作函数

    python数据分析工具pandas中DataFrame和Series作为主要的数据结构.  本文主要是介绍如何对DataFrame数据进行操作并结合一个实例测试操作函数.  1)查看DataFram ...

  8. pandas基于dataframe字符串数据列不包含特定字符串来筛选dataframe中的数据行(rows where values do not contain substring)

    pandas基于dataframe字符串数据列不包含(not contains)特定字符串来筛选dataframe中的数据行(rows where values do not contain subs ...

  9. pandas将dataframe原有的数据列名称转化为整数数值列名称(convert dataframe column labelsl into integers)

    pandas将dataframe原有的数据列名称转化为整数数值列名称(convert dataframe column labelsl into integers) 目录 pandas将datafra ...

最新文章

  1. 黄聪:TortoiseGit(乌龟git)保存用户名密码的方法
  2. python在审计中的应用-【干货】Python自动化审计及实现
  3. 阿里巴巴利润暴涨108%
  4. 所见即所得的开源跨平台 Markdown 编辑器
  5. linux怎么卸载evolution,linux - 如何在Ubuntu的系统托盘中最小化Evolution? - Ubuntu问答...
  6. HTML学习总结(3)——Audio/Video
  7. 获取整数的最大值最小值
  8. 高手详解SQL性能优化十条经验
  9. 【Axure手机原型】手机产品的规划和设计
  10. 移动魔百和系统升级服务器地址,【当贝市场】移动魔百盒升级后无法观看视频怎么办...
  11. snipaste设置开机启动后失效失败问题解決方法
  12. 新手在Kail Linux中使用pdfcrack 来破解pdf密码
  13. linux如何安装php,linux下怎么安装php
  14. rms 公式 有效值_真有效值RMS定义及推导
  15. mysql 合并同类项_使用Excel处理交易数据同一ID下的多条交易记录
  16. 《流浪地球》海报丨见证小破球24亿票房逆袭之路
  17. python取前三位_python的字符串截取||取字符串前三位
  18. Citrix虚拟桌面部署
  19. python自动刷手机视频_万能自动刷视频
  20. android最新版本和发布时间,Android12正式版发布时间-什么时候更新

热门文章

  1. 实现才是目的——《大道至简》第六章读后感
  2. OJ 169 Majority Element
  3. EJS脚本中AES应用
  4. linux 的读写操作(转)
  5. 简单实用的js调试logger组件
  6. POJ 2745 显示器 解题报告
  7. java short 后缀_自学java的新手问个问题,为什么写个代码中的int能自动转
  8. 异步编程Promise、Generator和Async
  9. mysql blob 内容查看_这些被你忽视的MySQL细节,可能会让你丢饭碗!
  10. VuePress 添加百度统计代码