I want to perform my own complex operations on financial data in dataframes in a sequential manner. 我希望以顺序方式对数据框中的财务数据执行我自己的复杂操作。

For example I am using the following MSFT CSV file taken from Yahoo Finance : 例如,我使用从Yahoo Finance获取的以下MSFT CSV文件:

Date,Open,High,Low,Close,Volume,Adj Close

I then do the following: 然后我做以下事情:

#!/usr/bin/env python
from pandas import *df = read_csv('table.csv')for i, row in enumerate(df.values):date = df.index[i]open, high, low, close, adjclose = row#now perform analysis on open/close based on date, etc..

Is that the most efficient way? 这是最有效的方式吗? Given the focus on speed in pandas, I would assume there must be some special function to iterate through the values in a manner that one also retrieves the index (possibly through a generator to be memory efficient)? 鉴于对熊猫速度的关注,我认为必须有一些特殊的函数来迭代遍历值,同时也检索索引(可能通过生成器来节省内存)? df.iteritems unfortunately only iterates column by column. 遗憾的是, df.iteritems只能逐列迭代。




The newest versions of pandas now include a built-in function for iterating over rows. 最新版本的pandas现在包含一个用于迭代行的内置函数。

for index, row in df.iterrows():# do some logic here

Or, if you want it faster use itertuples() 或者,如果你想更快地使用itertuples()

But, unutbu's suggestion to use numpy functions to avoid iterating over rows will produce the fastest code. 但是,unutbu建议使用numpy函数来避免遍历行将产生最快的代码。


I checked out iterrows after noticing Nick Crawford's answer, but found that it yields (index, Series) tuples. 我注意到iterrows 福德的回答后检查了它,但发现它产生(索引,系列)元组。 Not sure which would work best for you, but I ended up using the itertuples method for my problem, which yields (index, row_value1...) tuples. 不确定哪个最适合你,但我最终使用itertuples方法来解决我的问题,它产生(index,row_value1 ...)元组。

There's also iterkv , which iterates through (column, series) tuples. 还有iterkv ,它遍历(列,系列)元组。


Just as a small addition, you can also do an apply if you have a complex function that you apply to a single column: 只是作为一个小小的补充,如果您具有应用于单个列的复杂函数,也可以执行应用:

http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.apply.html http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.apply.html

df[b] = df[a].apply(lambda col: do stuff with col here)




Like what has been mentioned before, pandas object is most efficient when process the whole array at once. 与前面提到的一样,pandas对象在一次处理整个数组时效率最高。 However for those who really need to loop through a pandas DataFrame to perform something, like me, I found at least three ways to do it. 然而对于那些真正需要循环通过pandas DataFrame来执行某些事情的人,比如我,我发现至少有三种方法可以做到这一点。 I have done a short test to see which one of the three is the least time consuming. 我做了一个简短的测试,看看三者中哪一个最耗时。

t = pd.DataFrame({'a': range(0, 10000), 'b': range(10000, 20000)})
B = []
C = []
A = time.time()
for i,r in t.iterrows():C.append((r['a'], r['b']))
B.append(time.time()-A)C = []
A = time.time()
for ir in t.itertuples():C.append((ir[1], ir[2]))
B.append(time.time()-A)C = []
A = time.time()
for r in zip(t['a'], t['b']):C.append((r[0], r[1]))
B.append(time.time()-A)print B

Result: 结果:

[0.5639059543609619, 0.017839908599853516, 0.005645036697387695]

This is probably not the best way to measure the time consumption but it's quick for me. 这可能不是衡量时间消耗的最佳方法,但它对我来说很快。

Here are some pros and cons IMHO: 以下是一些利弊恕我直言:

  • .iterrows(): return index and row items in separate variables, but significantly slower .iterrows():在单独的变量中返回索引和行项,但速度明显较慢
  • .itertuples(): faster than .iterrows(), but return index together with row items, ir[0] is the index .itertuples():比.iterrows()快,但返回索引和行项,ir [0]是索引
  • zip: quickest, but no access to index of the row zip:最快,但无法访问该行的索引

