python csv性能_性能：Python pandas DataFrame.to

最初的问题：

我在其中遍历了数千个带有Python Pandas DataFrames的泡菜文件,这些文件的行数(在aprox 600和1300之间)有所不同,但列数却没有变化(准确地说是636).然后,我对其进行转换(每个转换都完全相同),然后使用DataFrame.to_csv()方法将其附加到一个csv文件中.

to_csv代码摘录：

if picklefile == '0000.p':

dftemp.to_csv(finalnormCSVFile)

else:

dftemp.to_csv(finalnormCSVFile, mode='a', header=False)

让我感到困扰的是,它起步非常快,但是性能却成倍下降,我保留了处理时间日志：

start: 2015-03-24 03:26:36.958058

2015-03-24 03:26:36.958058

count = 0

time: 0:00:00

2015-03-24 03:30:53.254755

count = 100

time: 0:04:16.296697

2015-03-24 03:39:16.149883

count = 200

time: 0:08:22.895128

2015-03-24 03:51:12.247342

count = 300

time: 0:11:56.097459

2015-03-24 04:06:45.099034

count = 400

time: 0:15:32.851692

2015-03-24 04:26:09.411652

count = 500

time: 0:19:24.312618

2015-03-24 04:49:14.519529

count = 600

time: 0:23:05.107877

2015-03-24 05:16:30.175175

count = 700

time: 0:27:15.655646

2015-03-24 05:47:04.792289

count = 800

time: 0:30:34.617114

2015-03-24 06:21:35.137891

count = 900

time: 0:34:30.345602

2015-03-24 06:59:53.313468

count = 1000

time: 0:38:18.175577

2015-03-24 07:39:29.805270

count = 1100

time: 0:39:36.491802

2015-03-24 08:20:30.852613

count = 1200

time: 0:41:01.047343

2015-03-24 09:04:14.613948

count = 1300

time: 0:43:43.761335

2015-03-24 09:51:45.502538

count = 1400

time: 0:47:30.888590

2015-03-24 11:09:48.366950

count = 1500

time: 1:18:02.864412

2015-03-24 13:02:33.152289

count = 1600

time: 1:52:44.785339

2015-03-24 15:30:58.534493

count = 1700

time: 2:28:25.382204

2015-03-24 18:09:40.391639

count = 1800

time: 2:38:41.857146

2015-03-24 21:03:19.204587

count = 1900

time: 2:53:38.812948

2015-03-25 00:00:05.855970

count = 2000

time: 2:56:46.651383

2015-03-25 03:53:05.020944

count = 2100

time: 3:52:59.164974

2015-03-25 05:02:16.534149

count = 2200

time: 1:09:11.513205

2015-03-25 06:07:32.446801

count = 2300

time: 1:05:15.912652

2015-03-25 07:13:45.075216

count = 2400

time: 1:06:12.628415

2015-03-25 08:20:17.927286

count = 2500

time: 1:06:32.852070

2015-03-25 09:27:20.676520

count = 2600

time: 1:07:02.749234

2015-03-25 10:35:01.657199

count = 2700

time: 1:07:40.980679

2015-03-25 11:43:20.788178

count = 2800

time: 1:08:19.130979

2015-03-25 12:53:57.734390

count = 2900

time: 1:10:36.946212

2015-03-25 14:07:20.936314

count = 3000

time: 1:13:23.201924

2015-03-25 15:22:47.076786

count = 3100

time: 1:15:26.140472

2015-03-25 19:51:10.776342

count = 3200

time: 4:28:23.699556

2015-03-26 03:06:47.372698

count = 3300

time: 7:15:36.596356

count = 3324

end of cycle: 2015-03-26 03:59:54.161842

end: 2015-03-26 03:59:54.161842

total duration: 2 days, 0:33:17.203784

更新#1：

我按照您的建议执行了@Alexander,但肯定与to_csv()方法有关：

start: 2015-03-26 05:18:25.948410

2015-03-26 05:18:25.948410

count = 0

time: 0:00:00

2015-03-26 05:20:30.425041

count = 100

time: 0:02:04.476631

2015-03-26 05:22:27.680582

count = 200

time: 0:01:57.255541

2015-03-26 05:24:26.012598

count = 300

time: 0:01:58.332016

2015-03-26 05:26:16.542835

count = 400

time: 0:01:50.530237

2015-03-26 05:27:58.063196

count = 500

time: 0:01:41.520361

2015-03-26 05:29:45.769580

count = 600

time: 0:01:47.706384

2015-03-26 05:31:44.537213

count = 700

time: 0:01:58.767633

2015-03-26 05:33:41.591837

count = 800

time: 0:01:57.054624

2015-03-26 05:35:43.963843

count = 900

time: 0:02:02.372006

2015-03-26 05:37:46.171643

count = 1000

time: 0:02:02.207800

2015-03-26 05:38:36.493399

count = 1100

time: 0:00:50.321756

2015-03-26 05:39:42.123395

count = 1200

time: 0:01:05.629996

2015-03-26 05:41:13.122048

count = 1300

time: 0:01:30.998653

2015-03-26 05:42:41.885513

count = 1400

time: 0:01:28.763465

2015-03-26 05:44:20.937519

count = 1500

time: 0:01:39.052006

2015-03-26 05:46:16.012842

count = 1600

time: 0:01:55.075323

2015-03-26 05:48:14.727444

count = 1700

time: 0:01:58.714602

2015-03-26 05:50:15.792909

count = 1800

time: 0:02:01.065465

2015-03-26 05:51:48.228601

count = 1900

time: 0:01:32.435692

2015-03-26 05:52:22.755937

count = 2000

time: 0:00:34.527336

2015-03-26 05:52:58.289474

count = 2100

time: 0:00:35.533537

2015-03-26 05:53:39.406794

count = 2200

time: 0:00:41.117320

2015-03-26 05:54:11.348939

count = 2300

time: 0:00:31.942145

2015-03-26 05:54:43.057281

count = 2400

time: 0:00:31.708342

2015-03-26 05:55:19.483600

count = 2500

time: 0:00:36.426319

2015-03-26 05:55:52.216424

count = 2600

time: 0:00:32.732824

2015-03-26 05:56:27.409991

count = 2700

time: 0:00:35.193567

2015-03-26 05:57:00.810139

count = 2800

time: 0:00:33.400148

2015-03-26 05:58:17.109425

count = 2900

time: 0:01:16.299286

2015-03-26 05:59:31.021719

count = 3000

time: 0:01:13.912294

2015-03-26 06:00:49.200303

count = 3100

time: 0:01:18.178584

2015-03-26 06:02:07.732028

count = 3200

time: 0:01:18.531725

2015-03-26 06:03:28.518541

count = 3300

time: 0:01:20.786513

count = 3324

end of cycle: 2015-03-26 06:03:47.321182

end: 2015-03-26 06:03:47.321182

total duration: 0:45:21.372772

并按要求提供源代码：

import pickle

import pandas as pd

import numpy as np

from os import listdir

from os.path import isfile, join

from datetime import datetime

# Defining function to deep copy pandas data frame:

def very_deep_copy(self):

return pd.DataFrame(self.values.copy(), self.index.copy(), self.columns.copy())

# Adding function to Dataframe module:

pd.DataFrame.very_deep_copy = very_deep_copy

#Define Data Frame Header:

head = [

'ConcatIndex', 'Concatenated String Index', 'FileID', ..., 'Attribute', 'Attribute'

]

exclude = [

'ConcatIndex', 'Concatenated String Index', 'FileID', ... , 'Real URL Array'

]

path = "./dataset_final/"

pickleFiles = [ f for f in listdir(path) if isfile(join(path,f)) ]

finalnormCSVFile = 'finalNormalizedDataFrame2.csv'

count = 0

start_time = datetime.now()

t1 = start_time

print("start: " + str(start_time) + "

for picklefile in pickleFiles:

if count%100 == 0:

t2 = datetime.now()

print(str(t2))

print('count = ' + str(count))

print('time: ' + str(t2 - t1) + '

t1 = t2

#DataFrame Manipulation:

df = pd.read_pickle(path + picklefile)

df['ConcatIndex'] = 100000*df.FileID + df.ID

for i in range(0, len(df)):

df.loc[i, 'Concatenated String Index'] = str(df['ConcatIndex'][i]).zfill(10)

df.index = df.ConcatIndex

#DataFrame Normalization:

dftemp = df.very_deep_copy()

for string in head:

if string in exclude:

if string != 'ConcatIndex':

dftemp.drop(string, axis=1, inplace=True)

else:

if 'Real ' in string:

max = pd.DataFrame.max(df[string.strip('Real ')])

elif 'child' in string:

max = pd.DataFrame.max(df[string.strip('child')+'desc'])

else:

max = pd.DataFrame.max(df[string])

if max != 0:

dftemp[string] = dftemp[string]/max

dftemp.drop('ConcatIndex', axis=1, inplace=True)

#Saving DataFrame in CSV:

if picklefile == '0000.p':

dftemp.to_csv(finalnormCSVFile)

else:

dftemp.to_csv(finalnormCSVFile, mode='a', header=False)

count += 1

print('count = ' + str(count))

cycle_end_time = datetime.now()

print("end of cycle: " + str(cycle_end_time) + "

end_time = datetime.now()

print("end: " + str(end_time))

print('total duration: ' + str(end_time - start_time) + '

更新#2：

如建议的那样,我对前两百个picklefile执行了命令％prun％run“ ./DataSetNormalization.py”,结果如下：

136373640 function calls (136342619 primitive calls) in 1018.769 seconds

Ordered by: internal time

ncalls tottime percall cumtime percall filename:lineno(function)

220 667.069 3.032 667.069 3.032 {method 'close' of '_io.TextIOWrapper' objects}

1540 42.046 0.027 46.341 0.030 {pandas.lib.write_csv_rows}

219 34.886 0.159 34.886 0.159 {built-in method collect}

3520 16.782 0.005 16.782 0.005 {pandas.algos.take_2d_axis1_object_object}

78323 9.948 0.000 9.948 0.000 {built-in method empty}

25336892 9.645 0.000 12.635 0.000 {built-in method isinstance}

1433941 9.344 0.000 9.363 0.000 generic.py:1845(__setattr__)

221051/220831 7.387 0.000 119.767 0.001 indexing.py:194(_setitem_with_indexer)

723540 7.312 0.000 7.312 0.000 {method 'reduce' of 'numpy.ufunc' objects}

273414 7.137 0.000 20.642 0.000 internals.py:2656(set)

604245 6.846 0.000 6.850 0.000 {method 'copy' of 'numpy.ndarray' objects}

1760 6.566 0.004 6.566 0.004 {pandas.lib.isnullobj}

276274 5.315 0.000 5.315 0.000 {method 'ravel' of 'numpy.ndarray' objects}

1719244 5.264 0.000 5.266 0.000 {built-in method array}

1102450 5.070 0.000 29.543 0.000 internals.py:1804(make_block)

1045687 5.056 0.000 10.209 0.000 index.py:709(__getitem__)

1 4.718 4.718 1018.727 1018.727 DataSetNormalization.py:6()

602485 4.575 0.000 15.087 0.000 internals.py:2586(iget)

441662 4.562 0.000 33.386 0.000 internals.py:2129(apply)

272754 4.550 0.000 4.550 0.000 internals.py:1291(set)

220883 4.073 0.000 4.073 0.000 {built-in method charmap_encode}

4781222 3.805 0.000 4.349 0.000 {built-in method getattr}

52143 3.673 0.000 3.673 0.000 {built-in method truediv}

1920486 3.671 0.000 3.672 0.000 {method 'get_loc' of 'pandas.index.IndexEngine' objects}

1096730 3.513 0.000 8.370 0.000 internals.py:3035(__init__)

875899 3.508 0.000 14.458 0.000 series.py:134(__init__)

334357 3.420 0.000 3.439 0.000 {pandas.lib.infer_dtype}

2581268 3.419 0.000 4.774 0.000 {pandas.lib.values_from_object}

1102450 3.036 0.000 6.110 0.000 internals.py:59(__init__)

824856 2.888 0.000 45.749 0.000 generic.py:1047(_get_item_cache)

2424185 2.657 0.000 3.870 0.000 numeric.py:1910(isscalar)

273414 2.505 0.000 9.332 0.000 frame.py:2113(_sanitize_column)

1646198 2.491 0.000 2.880 0.000 index.py:698(__contains__)

879639 2.461 0.000 2.461 0.000 generic.py:87(__init__)

552988 2.385 0.000 4.451 0.000 internals.py:3565(_get_blkno_placements)

824856 2.349 0.000 51.282 0.000 frame.py:1655(__getitem__)

220831 2.224 0.000 21.670 0.000 internals.py:460(setitem)

326437 2.183 0.000 11.352 0.000 common.py:1862(_possibly_infer_to_datetimelike)

602485 2.167 0.000 16.974 0.000 frame.py:1982(_box_item_values)

602485 2.087 0.000 23.202 0.000 internals.py:2558(get)

770739 2.036 0.000 6.471 0.000 internals.py:1238(__init__)

276494 1.966 0.000 1.966 0.000 {pandas.lib.get_blkno_indexers}

10903876/10873076 1.935 0.000 1.972 0.000 {built-in method len}

220831 1.924 0.000 76.647 0.000 indexing.py:372(setter)

220 1.893 0.009 1.995 0.009 {built-in method load}

1920486 1.855 0.000 8.198 0.000 index.py:1173(get_loc)

112860 1.828 0.000 9.607 0.000 common.py:202(_isnull_ndarraylike)

602485 1.707 0.000 8.903 0.000 series.py:238(from_array)

875899 1.688 0.000 2.493 0.000 series.py:263(_set_axis)

3300 1.661 0.001 1.661 0.001 {method 'tolist' of 'numpy.ndarray' objects}

1102670 1.609 0.000 2.024 0.000 internals.py:108(mgr_locs)

4211850 1.593 0.000 1.593 0.000 {built-in method issubclass}

1335546 1.501 0.000 2.253 0.000 generic.py:297(_get_axis_name)

273414 1.411 0.000 37.866 0.000 frame.py:1994(__setitem__)

441662 1.356 0.000 7.884 0.000 indexing.py:982(_convert_to_indexer)

220831 1.349 0.000 131.331 0.001 indexing.py:95(__setitem__)

273414 1.329 0.000 23.170 0.000 generic.py:1138(_set_item)

326437 1.276 0.000 6.203 0.000 fromnumeric.py:2259(prod)

274734 1.271 0.000 2.113 0.000 shape_base.py:60(atleast_2d)

273414 1.242 0.000 34.396 0.000 frame.py:2072(_set_item)

602485 1.183 0.000 1.979 0.000 generic.py:1061(_set_as_cached)

934422 1.175 0.000 1.894 0.000 {method 'view' of 'numpy.ndarray'objects}

1540 1.144 0.001 58.217 0.038 format.py:1409(_save_chunk)

220831 1.144 0.000 9.198 0.000 indexing.py:139(_convert_tuple)

441662 1.137 0.000 3.036 0.000 indexing.py:154(_convert_scalar_indexer)

220831 1.087 0.000 1.281 0.000 arrayprint.py:343(array2string)

1332026 1.056 0.000 3.997 0.000 generic.py:310(_get_axis)

602485 1.046 0.000 9.949 0.000 frame.py:1989(_box_col_values)

220 1.029 0.005 1.644 0.007 internals.py:2429(_interleave)

824856 1.025 0.000 46.777 0.000 frame.py:1680(_getitem_column)

1491578 1.022 0.000 2.990 0.000 common.py:58(_check)

782616 1.010 0.000 3.513 0.000 numeric.py:394(asarray)

290354 0.988 0.000 1.386 0.000 internals.py:1950(shape)

220831 0.958 0.000 15.392 0.000 generic.py:2101(copy)

273414 0.940 0.000 1.796 0.000 indexing.py:1520(_convert_to_index_sliceable)

220831 0.920 0.000 1.558 0.000 common.py:1110(_possibly_downcast_to_dtype)

220611 0.914 0.000 0.914 0.000 {pandas.lib.is_bool_array}

498646 0.906 0.000 0.906 0.000 {method 'clear' of 'dict' objects}

715345 0.848 0.000 13.083 0.000 common.py:132(_isnull_new)

452882 0.824 0.000 1.653 0.000 index.py:256(__array_finalize__)

602485 0.801 0.000 0.801 0.000 internals.py:208(iget)

52583 0.748 0.000 2.038 0.000 common.py:1223(_fill_zeros)

606005 0.736 0.000 6.755 0.000 internals.py:95(make_block_same_class)

708971 0.732 0.000 2.156 0.000 internals.py:3165(values)

1760378 0.724 0.000 0.724 0.000 internals.py:2025(_get_items)

109560 0.720 0.000 6.140 0.000 nanops.py:152(_get_values)

220831 0.718 0.000 11.017 0.000 internals.py:2395(copy)

924669 0.712 0.000 1.298 0.000 common.py:2248(_get_dtype_type)

1515796 0.698 0.000 0.868 0.000 {built-in method hasattr}

220831 0.670 0.000 4.299 0.000 internals.py:435(copy)

875899 0.661 0.000 0.661 0.000 series.py:285(_set_subtyp)

220831 0.648 0.000 0.649 0.000 {method 'get_value' of 'pandas.index.IndexEngine' objects}

452882 0.640 0.000 0.640 0.000 index.py:218(_reset_identity)

715345 0.634 0.000 1.886 0.000 {pandas.lib.isscalar}

1980 0.626 0.000 1.172 0.001 internals.py:3497(_merge_blocks)

220831 0.620 0.000 2.635 0.000 common.py:1933(_is_bool_indexer)

272754 0.608 0.000 0.899 0.000 internals.py:1338(should_store)

220831 0.599 0.000 3.463 0.000 series.py:482(__getitem__)

498645 0.591 0.000 1.497 0.000 generic.py:1122(_clear_item_cache)

1119390 0.584 0.000 1.171 0.000 index.py:3936(_ensure_index)

220831 0.573 0.000 1.883 0.000 index.py:222(view)

814797 0.555 0.000 0.905 0.000 internals.py:3086(_values)

52583 0.543 0.000 15.545 0.000 ops.py:469(wrapper)

220831 0.536 0.000 3.760 0.000 internals.py:371(_try_cast_result)

228971 0.533 0.000 0.622 0.000 generic.py:1829(__getattr__)

769651 0.528 0.000 0.528 0.000 {built-in method min}

224351 0.509 0.000 2.030 0.000 generic.py:1099(_maybe_update_cacher)

...

我将重新运行它进行确认,但是看起来它确实与pandas的to_csv()方法有关,因为大多数运行时间都在io和csv编写器上使用.为什么会产生这种影响？有什么建议么？

更新#3：

好吧,我进行了完整的％prun测试,实际上,几乎有90％的时间都花在了{_io.TextIOWrapper’对象的方法’close’上}.所以我想这就是问题所在…你们怎么看？

我的问题是：

>是什么导致性能下降？

> pandas.DataFrames.to_csv()追加模式是否在每次写入文件时加载整个文件？

>有没有办法改善这一过程？

python csv性能_性能：Python pandas DataFrame.to_csv附加逐渐变慢相关推荐

csv 字符串_爬虫|python读写csv文件的方法完成生涯规划表
前面用PrettyTable制作了生涯规划表,今天突然想到其实可以用Python读写Excel文件,还可以制作Excel表格,相对而言,感觉csv的读写更简单,所以决定用csv改写生涯规划案例,以后教 ...
Python之Pandas：pandas.DataFrame.to_csv函数的简介、具体案例、使用方法详细攻略
Python之Pandas:pandas.DataFrame.to_csv函数的简介.具体案例.使用方法详细攻略目录 pandas.DataFrame.to_csv函数的简介 pandas.to_c ...
python将ElasticSearch索引数据读入pandas dataframe实战
python将ElasticSearch索引数据读入pandas dataframe实战 # 导入基础包和库 import pandas as pdpd.set_option('display.max ...
python 概率分布模型_使用python的概率模型进行公司估值
python 概率分布模型 Note from Towards Data Science's editors: While we allow independent authors to publis ...
python 时间序列预测_使用Python进行动手时间序列预测
python 时间序列预测 Time series analysis is the endeavor of extracting meaningful summary and statistical ...
python怎么画参数函数图像_详解pandas.DataFrame.plot() 画图函数
首先看官网的DataFrame.plot( )函数 DataFrame.plot(x=None, y=None, kind='line', ax=None, subplots=False, share ...
python dataframe索引转成列_如何将 Pandas DataFrame 的索引转换为列
我们将介绍将 Pandas DataFrame 的索引转换为列的各种方法,例如 df.index,带有 rename_axis 的 reset_index 来重命名索引,以及 set_index. 我 ...
python交换两列的位置_如何更改 pandas dataframe 中两列的位置
如何更改 pandas dataframe 中两列的位置: 把其中的某列移到第一列的位置. 原来的 df 是: df = pd.read_csv('I:/Papers/consumer/codeand ...
前锋python百度云_关于python：使用索引在pandas DataFrame中设置特定单元格的值
我创建了一个熊猫数据框架 df = DataFrame(index=['A','B','C'], columns=['x','y']) 得到了这个 x y A NaN NaN B NaN ...

python csv性能_性能：Python pandas DataFrame.to_csv附加逐渐变慢

python csv性能_性能：Python pandas DataFrame.to_csv附加逐渐变慢相关推荐

最新文章

热门文章