Python数据处理相关语法整理

简介
Python自身特性总结
编程Tips
拿到新电脑配环境时做的事：
一些加速python代码的技巧
Python项目代码结构
量化策略指标计算
- 绝对收益率
- 最大回撤
- 给定累计收益率序列计算收益率序列
- 年化收益率
- 年化波动率
- 夏普比率
-----------------------------------------------------
------------------- 分割线 -------------------------
------------------------------------------------------
字符串
字典
- 字典取数
Datetime库
- 字符串和日期相互转换
- 天数之间的加减
Scipy.stats
Numpy 库
- 数学函数
- 正态分布取样 / 多元正态分布取样
- 排名
- 数组/矩阵的堆叠
- 矩阵(matrix)相关操作
- 广播机制
- 生成固定的随机数
- 排序
- np.nan的处理
- 正负无穷
Pandas库
- 计算相关性
- 统计频数
- 描述性统计describe
- 多层索引Multiindex
- dataframe的整合与形变
- 数学函数
- 计算协方差和相关系数
- 对每一行每一列分别操作
- 时间窗口
- - 计算相关系数
  - 点乘
- 对dataframe的列进行遍历
- 对dataframe的行进行遍历
- 对Series进行遍历
- 生成日期序列
- 读取表格
- 删除满足条件的行
- 增删改查
- - 两个dataframe合并
  - 增加一行
  - 增加一列
  - 删除列
- 设置索引
- - set_index
  - reset_index
- 取索引
- 更改Series和DataFrame列名
- 缺失值/空字符串/inf/字符串/非数值处理
- 表格排序
- - df.sort_values()
  - df.sort_index()
  - df.set_index()
- 表格画图 - df.plot()
- 采样
- - df.resample()
- 对元素进行映射apply
- Shift(num)
- 两个表的连接
- 两个表的遍历
- groupby
- 内连接，左连接，右连接，全外连接(pd.merge)
Matplotlib.pyplot库
- 显示中文
- plt.plot()
- 颜色设置
- Fig和axes和plt画图的区别
- axes 相关属性设置
- plt.subplots
- 设置横纵坐标间隔
- 设置X\Y轴显示的最大值和最小值
- 双坐标轴
- 绘制水平直线
- 图例、网格, 设置
- 一个一个添加子图
- 绘制频率直方图
- 画条形图
- 画3D图
- 画散点图
Pytorch
Serverless
随机数
- 函数
- 随机取样
- 随机分成N组
正则表达式
- re.match（从起始位置开始匹配）
- re.search（扫描整个字符串并返回第一个成功的匹配）
dataclasses库
functools库 --- 缓存函数结果
- 缓存函数结果
有用的代码片段
- 线性回归
- 从列表中随机选取N个数
- 输出重定向
- 并行
- - 使用joblib来并行运行程序
  - 并行之使用Multiprocessing 获取函数值
- 监控内存使用情况
- Subprocess
- 计算运行时间
- 计算每段代码的运行时间的函数装饰器
- 使用tqdm显示for循环的进度条和耗时
- 序列化python对象
- 计算股票行业哑变量因子
- 因子中性化、去极值、标准化
- 非线性规划（Scipy.optimize.minimize）
- - 风险平价
- 拷贝文件
- 拷贝整个文件夹

简介

本篇博文主要是自己在处理金融类数据、编写金融类代码的时候的总结

Email: 6763624736@qq.com

Python自身特性总结

python的所有类，都是type这个类的实例
metaclass可以允许父类对子类进行修改
variable_name : DataType
python可以以这样的方式指定变量的类型
目录下如果有__init__.py，那么这个目录就会被当成一个包，当import这个目录的时候，会先执行__init__里面的语句
a[::-1]，表示把a的数组里的元素倒序输出，但是只限制在一维里的元素
第三个-1其实是表示步长
一些pip不能安装的，需要下载轮子的包的下载网址
python包下载网址

编程Tips

一般修改dataframe或者Series要重新赋值，比如 df = df.concat([df1,df2],axis=1)
np.isnan 不能识别None，而pd.isnull 适用范围比较广
读取数据设定index后，可以使用 drop_duplicate 一下
如果Series的index是整数的话，索引 s[-1]会报没有-1这个key的错误
如果对dataframe或者Series的行进行筛选，并且要更改的话
记得使用df.loc[ ] 而不是直接 df[ ]
Pandas和Numpy常用库

拿到新电脑配环境时做的事：

关闭自动重启
我的电脑->管理->设置，windows update禁用，并且选项上面选择不操作

点击windows图标，进入设置，设置电源选项

点击windows图标，进入设置，关闭windows更新

一些加速python代码的技巧

使用numba
尽量在循环内少用Series和Dataframe的loc进行索引，这样会造成速度下降
少用dataframe的loc和append来往里面加元素，最好是先在python原生的列表里操作
numpy，dataframe，series只有在统计分析的时候才比较快（就是数据不变了已经），如果要往里面加元素，最好是原生的列表操作比较快
你对列表直接遍历进行操作比你把列表变成Series再进行apply操作要快

Python项目代码结构

python项目代码架构

量化策略指标计算

策略指标计算公式

指标公式：

Calmar（卡玛比率）= 超额收益 / 最大回撤

绝对收益率

给定收益率序列(pd.Series)，计算绝对收益率

def get_absolute_return(return_s):return ((return_s+1).cumprod().iloc[-1]-1)

最大回撤

给定累积收益率序列计算最大回撤

def compute_portfolio_max_dawdown(strategy_cumu_return_list):# 计算最大回撤i = int(np.argmax((np.maximum.accumulate(strategy_cumu_return_list) - strategy_cumu_return_list) /np.maximum.accumulate(strategy_cumu_return_list)))if i == 0:return 0j = int(np.argmax(strategy_cumu_return_list[:i]))  # 开始位置max_drawdown = (strategy_cumu_return_list[j] - strategy_cumu_return_list[i]) / strategy_cumu_return_list[j]return max_drawdown

其中

np.maximum.accumulate(arr)-arr # 计算每个点的回撤绝对值
(np.maximum.accumulate(arr)-arr) /np.maximum.accumulate(arr) # 计算每个点的回撤比例

给定收益率序列计算最大回撤

def compute_portfolio_max_dawdown(portfolio_return_array):strategy_cumu_return_list = (np.array(portfolio_return_array) + 1).cumprod()# 计算最大回撤i = int(np.argmax((np.maximum.accumulate(strategy_cumu_return_list) - strategy_cumu_return_list) /np.maximum.accumulate(strategy_cumu_return_list)))if i == 0:return 0j = int(np.argmax(strategy_cumu_return_list[:i]))  # 开始位置max_drawdown = (strategy_cumu_return_list[j] - strategy_cumu_return_list[i]) / strategy_cumu_return_list[j]return max_drawdown

给定累计收益率序列计算收益率序列

def compute_return(cumu_return_lst):shift_return_lst = [1] + list(cumu_return_lst)[:-1]return_lst = np.array(cumu_return_lst)/np.array(shift_return_lst) - 1return_lst[0] = 0return return_lst

年化收益率

给定累计收益率序列，计算年化收益率

# day_num是回测天数
def compute_annual_yield(cumu_return_lst):day_num = len(cumu_return_lst)annual_return = cumu_return_lst[-1]**(252/day_num) - 1return annual_return

年化波动率

给定收益率序列，计算年化波动率

def compute_annual_std(return_lst, daily_data_cnt):annual_std = np.std(return_lst)*math.sqrt(252*daily_data_cnt)return annual_std

夏普比率

给定收益率序列，计算夏普

def compute_portfolio_sharpe_ratio(portfolio_return_array, day_num, daily_data_cnt):strategy_cumu_return_list = (np.array(portfolio_return_array) + 1).cumprod()annual_yield = strategy_cumu_return_list[-1] ** (252 / day_num) - 1annual_std = np.std(portfolio_return_array) * math.sqrt(252 * daily_data_cnt)return (annual_yield - 0.04) / annual_std

-----------------------------------------------------

------------------- 分割线 -------------------------

------------------------------------------------------

字符串

字符串大小写转换
字符串大小写转换

字典

字典取数

d.get(key, default)
key是键值，若不存在，则返回default

Datetime库

字符串和日期相互转换

startDate = "2018-10-01"
endDate = "2018-10-31"###字符转化为日期
startTime = datetime.datetime.strptime(startDate, '%Y-%m-%d').time()
endTime = datetime.datetime.strptime(endDate, '%Y-%m-%d').time()now = datetime.datetime.now()
print(now)###日期转化为字符串
print("--1---:" + datetime.datetime.strftime(startTime, "%Y-%m-%d"))
print("--2---:" + datetime.datetime.strftime(endTime, "%Y-%m-%d"))

天数之间的加减

两个日期直接加减可以获得天数

timedelta().days可获得天数

datetime.timedelta(days = XX)是间隔的天数

Scipy.stats

简介：scipy.stats是和数据统计相关的包，有各种统计函数

scipy.stats.rankdata()
可对数组中的数据进行排序
例子

Numpy 库

数学函数

np.cov(x)&np.var(x)
np.cov(x)&np.var(x)两者区别

>>> from numpy import cov
>>> cov([1, 2, 3], [2, 12, 14])
array([[  1.        ,   6.        ],[  6.        ,  41.33333333]])

正态分布取样 / 多元正态分布取样

正态分布取样

import numpy as npnp.random.normal(loc=mean, scale=std, size=(,))

多元正态分布取样

import numpy as np
sample = np.random.multivariate_normal(mean=[0,0], cov=[[1,0.5],[0.5,1.5]],size=200)

多元正态分布可以设置协方差矩阵

排名

降序排名：np.argsort(-arr).argsort()
升序排名：np.argsort(arr).argsort()

数组/矩阵的堆叠

np.vstack

T = np.array([9, 15, 25, 14, 10, 18, 0, 16, 5, 19, 16, 20])
S = np.array([39, 56, 93, 61, 50, 75, 32, 85, 42, 70, 66, 80])
M = np.asarray([38, 56, 90, 63, 56, 77, 30, 80, 41, 79, 64, 88])
X = np.vstack((T, S, M))
print(X)
# result
[[ 9 15 25 14 10 18  0 16  5 19 16 20][39 56 93 61 50 75 32 85 42 70 66 80][38 56 90 63 56 77 30 80 41 79 64 88]]

np.hstack

T = np.array([9, 15, 25, 14, 10, 18, 0, 16, 5, 19, 16, 20])
S = np.array([39, 56, 93, 61, 50, 75, 32, 85, 42, 70, 66, 80])
M = np.asarray([38, 56, 90, 63, 56, 77, 30, 80, 41, 79, 64, 88])
X = np.hstack((T, S, M))
print(X)
# result
[ 9 15 25 14 10 18  0 16  5 19 16 20 39 56 93 61 50 75 32 85 42 70 66 8038 56 90 63 56 77 30 80 41 79 64 88]

矩阵(matrix)相关操作

将列表转换成矩阵

print(np.matrix([1,3,2]))
# result
[[1 3 2]]

矩阵对应位置相乘：np.multiply()
np.multiply(), np.dot(), * 的区别博客

np.dot()：对于秩为1的数组，执行对应位置相乘，然后再相加；
对于秩不为1的二维数组，执行矩阵乘法运算；超过二维的可以参考numpy库介绍。

* 星号乘法 : 对数组执行对应位置相乘，对矩阵执行矩阵乘法运算

广播机制

numpy中的广播机制

生成固定的随机数

为了生成固定的随机数，我们需要使用种子(seed)
相同的种子会生成相同的随机数
用法：

randomState = np.random.RandomState(0)
a = randomState.randint(10, size=(5,6))
# 这里的randomState其实就是在前面加了 np.random.seed(0)的 np.random, 后面可以调用randint等随机数生成方法

如果是使用np.random.seed(0)再使用np.random.randint调用，每次调用之前都要加np.random.seed(0)

排序

ndarray.sort()

np.nan的处理

np.nan其实是个float，

但是如果数组里有np.nan，在进行诸如np.argmax()，np.max()之类的统计计算的时候会反正nan

正负无穷

正无穷：np.inf
负无穷: -np.inf

Pandas库

计算相关性

# 较慢
df.corr()# 较快
pd.DataFrame(np.corrcoef(df.values, rowvar=False), index = df.index, columns=df.columns)

统计频数

描述性统计describe

注意如果不是数值类型的进行describe, 会出现 count unique freq这些。
可以注意一些是否需要进行astype一下
如果想把Series横着加到Dataframe里出现错误，
要先把Series转换成Dataframe

多层索引Multiindex

m_index1=pd.Index([("A","x1"),("A","x2"),("B","y1"),("B","y2"),("B","y3")],names=['class1', 'class2'])
df1=pd.DataFrame(np.random.randint(1,10,(5,3)),index=m_index1)

df.index.names = […]
index.get_level_values(level)
获取多重索引第N曾的索引值
index.tolist()

dataframe的整合与形变

dataframe的整合与形变
注意* 要设置stack(dropna=False), dropna默认为False了

数学函数

数学函数集合：
Series及Dataframe数值计算和统计基础函数应用总结

count(),min(),quantile(),sum(),mean(),median(),std(),skew(),kurt()cumsum(),cumprod()value_counts(),unique()print(df.count(),'→ count统计非Na值的数量\n')
print(df.min(),'→ min统计最小值\n',df['key2'].max(),'→ max统计最大值\n')
print(df.quantile(q=0.75),'→ quantile统计分位数，参数q确定位置\n')
print(df.sum(),'→ sum求和\n')
print(df.mean(),'→ mean求平均值\n')
print(df.median(),'→ median求算数中位数，50%分位数\n')
print(df.std(),'\n',df.var(),'→ std,var分别求标准差，方差\n')
print(df.skew(),'→ skew样本的偏度\n')
print(df.kurt(),'→ kurt样本的峰度\n')

使用rolling().apply(lambda XXXX) 会很慢，
因为apply其实是一个loop。

如果rolling()有实现数学函数，那么可以用，例如 rolling().max()

如果没有的数学函数，例如argmax()，那么可以考虑以下解决方案

方案一：超快

df.rolling(window=n).max()

如果rolling后面实现了一些函数可以用

方案二：超快
使用from numpy.lib.stride_tricks import sliding_window_view 的sliding_window_view

from numpy.lib.stride_tricks import sliding_window_view
sliding_window_view(df1, (d, len(df1.columns))).argmax(axis=2).squeeze()

axis=1：表示沿着行做argmax
axis=2：表示沿着列做argmax

参考stackoverflow

这个实现了一些例如 argmax, argmin, argsort之类的函数

例1：

def _ts_decay_linear(x1: pd.DataFrame, d):weight = np.arange(d) + 1weight = weight / weight.sum()result = sliding_window_view(x1, (d, len(x1.columns))).swapaxes(2,3).dot(weight).squeeze()result = np.concatenate([[[np.nan] * x1.shape[1] for i in range(d - 1)], result], axis=0)result = pd.DataFrame(result, index=x1.index, columns=x1.columns)return result

例2：

def _ts_rank(x1:pd.DataFrame, d):result = sliding_window_view(x1, (d, len(x1.columns))).argsort(axis=2).argsort(axis=2).squeeze()[:,-1,:]result = np.concatenate([[[np.nan] * x1.shape[1] for i in range(d - 1)], result], axis=0)result = pd.DataFrame(result, index=x1.index, columns=x1.columns)return result

方案三（使用原生列表处理，较慢）：

def _ts_argmin(x1:pd.DataFrame, d):result = []for row in x1.T.values.tolist():tmp_lst = []tmp_lst = tmp_lst + [np.nan] * (d - 1)for i in range(d, len(row) + 1):slice_lst = row[i - d: i]tmp_lst.append(slice_lst.index(min(slice_lst)))result.append(tmp_lst)result = list(map(list, zip(*result)))result = pd.DataFrame(data=result, index=x1.index, columns=x1.columns)return result

计算协方差和相关系数

import pandas as pd
# 对于s1是Series的情形
cov = s1.cov(s2)
cor = s1.corr(s2)
# 对于df1是dataframe的情形
df1.corrwith(df2,axis=0/1)

遇到Nan的话会忽略Nan进行计算

对每一行每一列分别操作

dataframe.apply(args=(arg1,arg2),axis=0/1)

axis=1: 对每一行进行操作

import numpy as np
import pandas as pda = pd.DataFrame(data=np.random.randint(10,size=(2,5)), index=pd.date_range(start='2020-01-01', end='2020-01-02', freq='D'))
print(a)
a= a.apply(lambda x: x.sum(),axis=1)
print(a)# 结果0  1  2  3  4
2020-01-01  8  2  5  1  2
2020-01-02  3  9  2  4  02020-01-01    18
2020-01-02    18

axis=0: 对每一列进行操作

import numpy as np
import pandas as pda = pd.DataFrame(data=np.random.randint(10,size=(2,5)), index=pd.date_range(start='2020-01-01', end='2020-01-02', freq='D'))
print(a)
a= a.apply(lambda x: x.sum(),axis=0)
print(a)# 结果0  1  2  3  4
2020-01-01  2  7  6  0  7
2020-01-02  5  7  1  1  80     7
1    14
2     7
3     1
4    15

时间窗口

data.rolling(window = num).sum()

注意：如果data是dataframe他可以指定axis=0, axis=0 就是一列一列进行操作，axis=1就是一行一行进行操作

计算相关系数

df1.corrwith(df2, method=‘spearman’, axis=1)

参数：
axis=0 表示计算对应列之间的相关系数
axis=1 表示计算对应行之间的相关系数
遇到Nan的话会忽略Nan进行计算

点乘

a.dot(b)
最好用这个，不要用 * 直接乘，如果b是list或者array会出问题

对dataframe的列进行遍历

iteriterms()

import pandas as pd
import numpy as npstock_return_df = pd.read_excel('./A_share_monthly_return.xlsx',index_col=0,parse_dates=True)for col_name, col in stock_return_df.iteritems():print(col)

对dataframe的行进行遍历

3种遍历方法

在iterrows()里对row进行更改会直接修改原始dataframe的行，例：

df1 = pd.DataFrame(data=random_state.randint(10000, size=(3774, 3000)), index=pd.date_range('2010-01-01', '2020-05-01', freq='d'))print(df1)for idx, row in df1.iterrows():row[0] = 0print(df1)
# 结果0     1     2     3     4     ...  2995  2996  2997  2998  2999
2010-01-01  2732  9845  3264  4859  9225  ...  4154  9289  2609  4369  8497
2010-01-02  9305  3001  4810  2906  7437  ...  2644  7230  3830  2067  6998
2010-01-03  1555  1659  4481  9719  5869  ...  2936  3356  7052  9589   818
2010-01-04  2824  6661  4645  4593  2863  ...  5722  4968  4699  3115  6194
2010-01-05  5117  3212  4009  5082   206  ...  3989  1137  8580  9623  9954
...          ...   ...   ...   ...   ...  ...   ...   ...   ...   ...   ...
2020-04-27  7157  7164  8072  1829  5243  ...  6620  8079  9726  9272  3106
2020-04-28  8603  3301  6819  5708  6772  ...  2344  4667  1416  5496  7303
2020-04-29  4922  9285  2712  3649   567  ...  6840  8727  1475  6463  4575
2020-04-30  1604  9847  9379  1088  5234  ...  4701  9478  7822  6443   652
2020-05-01  4877   895  2257  9885  6252  ...  5241  7137   679   804  6447[3774 rows x 3000 columns]0     1     2     3     4     ...  2995  2996  2997  2998  2999
2010-01-01     0  9845  3264  4859  9225  ...  4154  9289  2609  4369  8497
2010-01-02     0  3001  4810  2906  7437  ...  2644  7230  3830  2067  6998
2010-01-03     0  1659  4481  9719  5869  ...  2936  3356  7052  9589   818
2010-01-04     0  6661  4645  4593  2863  ...  5722  4968  4699  3115  6194
2010-01-05     0  3212  4009  5082   206  ...  3989  1137  8580  9623  9954
...          ...   ...   ...   ...   ...  ...   ...   ...   ...   ...   ...
2020-04-27     0  7164  8072  1829  5243  ...  6620  8079  9726  9272  3106
2020-04-28     0  3301  6819  5708  6772  ...  2344  4667  1416  5496  7303
2020-04-29     0  9285  2712  3649   567  ...  6840  8727  1475  6463  4575
2020-04-30     0  9847  9379  1088  5234  ...  4701  9478  7822  6443   652
2020-05-01     0   895  2257  9885  6252  ...  5241  7137   679   804  6447[3774 rows x 3000 columns]Process finished with exit code 0

对Series进行遍历

for row_num, (factor_name, val) in enumerate(s.items()):pass

生成日期序列

pd.daterange(start= , end= , periods= ,freq= )
freq参数详情见链接

生成年频日期

for date in pd.date_range(start='2011',periods=10,freq='Y'):print(date)

读取表格

pd.read_csv(file_name, index_col=0, parse_dates=True) 默认以 ‘,’ 分割
pd.read_table(file_name) 默认以 ‘\t’ 分割
pd.read_table 的sep参数可以指定分割符
index_col = False的话表示读取的文件没有index列，index_col=0表示第0列是index
parse_dates=[‘col_name’]会将对应列的日期字符串转换成日期的格式
*当文件为空的时候会报错

删除满足条件的行

dataframe删除满足条件的行

增删改查

两个dataframe合并

*注意循环理最好不要有concat，循环里可以把series变成list，加道一个大list里。最后把那个list变成dataframe，这样快些

pd.concate([df1, df2], axis=0/1)
axis=0：把行拼在一起

import numpy as np
import pandas as pda = pd.DataFrame(data=np.random.randint(4,size=(2,2)))
b = pd.DataFrame(data=np.random.randint(4,size=(2,2)))print(pd.concat([a,b], axis=0))# 结果0  1
0  2  2
1  0  1
0  1  1
1  1  0

axis=1：把列拼在一起

import numpy as np
import pandas as pda = pd.DataFrame(data=np.random.randint(4,size=(2,2)))
b = pd.DataFrame(data=np.random.randint(4,size=(2,2)))print(pd.concat([a,b], axis=1))# 结果0  1  0  1
0  2  3  3  0
1  0  1  1  2

在concate的时候各个元素会进行对齐：

s1 = pd.Series([1,2], index= ['a','b'])
s2 = pd.Series([3,4], index= ['b','a'])
print(pd.concat([s1, s2],axis=1))# 结果:0  1
a  1  4
b  2  3

增加一行

使用 df.append(s1)
df.append(df1)
*如果要append Series，那么要指定Series的name，或者要指定 ignore_index=True
使用df.concat([df1,df2], axis =0)

注意，上面使用完了要赋值回去，如df = df.append()

增加一列

DataFrame.insert(pos, column_name, value, allow_duplicates=False)

参数：

pos : 参数column插入的位置，如果想插入到第一列则为0，取值范围： 0 <= pos <= len(columns),其中len(columns)为Dataframe的列数
column_name :给插入数据value取列名，可为数字，字符串等
value : 可以是整数，Series或者数组等
allow_duplicates : 默认 False，如果插入的列已存在则报错

注意，这个不会返回一个新的表格，直接在原来的表格上改了

删除列

import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(1,10).reshape((3,3)),columns=['A','B','C'])
print(df)
print()
df = df.drop(columns=['A'])
print(df)

运行结果：

'''A  B  C
0  1  2  3
1  4  5  6
2  7  8  9B  C
0  2  3
1  5  6
2  8  9
'''

设置索引

set_index

原型：DataFrame.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)

keys：列标签或列标签/数组列表，需要设置为索引的列
drop：默认为True，删除用作新索引的列
append：默认为False，是否将列附加到现有索引
inplace：默认为False，适当修改DataFrame(不要创建新对象)
verify_integrity：默认为false，检查新索引的副本。否则，请将检查推迟到必要时进行。将其设置为false将提高该方法的性能。

reset_index

reset_index():

函数原型：DataFrame.reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill=‘’)

作用：把索引设置为0,1,2,3,4…

参数解释:

level：int、str、tuple或list，默认无，仅从索引中删除给定级别。默认情况下移除所有级别。控制了具体要还原的那个等级的索引
drop：drop为False则索引列会被还原为普通列，否则会丢失
inplace：默认为false，适当修改DataFrame(不要创建新对象)
col_level：int或str，默认值为0，如果列有多个级别，则确定将标签插入到哪个级别。默认情况下，它将插入到第一级。
col_fill：对象，默认‘’，如果列有多个级别，则确定其他级别的命名方式。如果没有，则重复索引名

取索引

.loc[] 或 .iloc[]

多重索引要加()，如： .loc[(1,2),3]

取某日期最近日期/前一最近日期/后一最近日期的值:

import numpy as np
import pandas as pd
import rea = pd.DataFrame(data=np.random.randint(10,size=(15,5)), index=pd.date_range(start='2020-01-01', end='2021-04-01', freq='M'))
result = a.index.get_loc('2020-03-29', method='ffill')
print(result)
print(a.iloc[a.index.get_loc('2020-03-15', method='ffill')])
# 结果:
10    8
1    6
2    3
3    6
4    5

索引的条件过滤
df.loc[() & ()], df.loc[() | ()], df.loc[~()]
isin,
索引条件过滤参考文章

更改Series和DataFrame列名

对于Series，因为只有一列，所以相当于改了整个Series的Name

s.rename("my_name")

对于DataFrame
1、直接大批量修改。
df.columns = [‘A’,‘B’]

2、只修改几个列
df.rename(columns={‘a’:‘A’}, inplace=Ture)

缺失值/空字符串/inf/字符串/非数值处理

非数值类型

def isnumber(x):try:return float(x)except:return None

对dataframe的每一列都apply这个就行了，或者使用applymap对每一个单元格进行操作

对于inf类型：
df[np.isinf(df)] = np.nan
接下来再顺便处理nan即可

对于NaN类型的：

df.dropna(axis = 0/1, how='all', thresh = 0/1/2..， subset= ['XX'])

参数说明：

axis：axis=0去掉的是行；axis=1去掉的是列
how：'all’代表只有当行/列全部是NaN的时候才被去掉
thresh：表示当有NaN数量>thresh个的时候，该行/列才被去掉
subset：去除指定列中含空值的行

例子：
axis = 1，去掉列

import numpy as np
import pandas as pda = pd.DataFrame(data=np.random.randint(10,size=(2,5)), index=pd.date_range(start='2020-01-01', end='2020-01-02', freq='D'))
a.iloc[0,0]= np.nan
print(a)
print(a.dropna(axis=1))# 结果0  1  2  3  4
2020-01-01  NaN  8  0  5  3
2020-01-02  4.0  6  3  7  61  2  3  4
2020-01-01  8  0  5  3
2020-01-02  6  3  7  6

axis=0, 去掉行

a = pd.DataFrame(data=np.random.randint(10,size=(2,5)), index=pd.date_range(start='2020-01-01', end='2020-01-02', freq='D'))
a.iloc[0,0]= np.nan
print(a)
print(a.dropna(axis=0))# 结果0  1  2  3  4
2020-01-01  NaN  6  8  1  1
2020-01-02  2.0  0  7  5  30  1  2  3  4
2020-01-02  2.0  0  7  5  3

填补缺失值：
df.fillna(method=“bfill/ffill”)

method：'bfill’向前填充，‘pad’、'ffill’向后填充

对于空字符串
例如：df.iloc[:,2] = df.iloc[:,2].str.split(‘,’, expand=True).replace(‘’, np.nan)
不要用df.replace
把所有字符串换成None

import pandas as pd
for col in df.columns:df[col] = pd.to_numeric(df[col], errors='coerce')

表格排序

df.sort_values()

作用：既可以根据列数据，也可根据行数据排序。
注意：必须指定by参数，即必须指定哪几行或哪几列；无法根据index名和columns名排序（由.sort_index()执行）

df.sort_index()

df. sort_index()可以完成和df. sort_values()完全相同的功能，但python更推荐用只用df. sort_index()对“根据行标签”和“根据列标签”排序，其他排序方式用df.sort_values()。

df.set_index()

DataFrame可以通过set_index方法，可以设置单索引和复合索引。
DataFrame.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)
append添加新索引，drop为False，inplace为True时，索引将会还原为列

表格画图 - df.plot()

使用方法链接

使用DataFrame的plot方法绘制图像会按照数据的每一列绘制一条曲线，默认按照列columns的名称在适当的位置展示图例，比matplotlib绘制节省时间，且DataFrame格式的数据更规范，方便向量化及计算。

设置画出的类型，折线图还是柱状图
折线图不用设置
kind = ‘bar’
df.plot()返回ax，可对ax进行设置
plot()里面的label参数无效，会被column的名字覆盖掉

采样

df.resample()

resample一般需要index是日期的形式，根据某一频率采样后，得到的是相应的dataframe（包含相应日期的行），一般后面会跟数学函数，比如 mean(), sum()等，也可以使用apply()对这个dataframe进行操作

常用的函数：
first(), last()

对元素进行映射apply

df[col_name].apply(function_name, args=(xx,xx,…))

Shift(num)

shift是把列总体往下移num格

两个表的连接

方法一：

merge(left, right, how='inner', on=None, left_on=None, right_on=None,left_index=False, right_index=False, sort=True,suffixes=('_x', '_y'), copy=True, indicator=False)

left：左边的dataframe
right：右边的dataframe
how：'inner’内连接 or 'left’左外连接 or 'left’右外连接 or 'outer’全连接
on: 用于连接的列索引名称
left_on：左边dataframe用于连接的列名称
right_on：右边dataframe用于连接的列名称

两个表的遍历

groupby

import pandas as pd
import matplotlib.pyplot as pltdata_df = pd.read_csv('./group.csv', encoding='gbk').set_index(['date','memberName', 'productName'])total_position = (data_df['longNumber'] + data_df['shortNumber']).rename('totalPosition')
total_position_sum = total_position.groupby(by=['date','productName']).sum().rename('totalPositionSum')

groupyby表示，按by指定的列进行分组，然后进行指定操作

groupby后面跟的一些有用的函数：
nunique()，取unique的数量

内连接，左连接，右连接，全外连接(pd.merge)

pd.merge(left=, right=, on= ,how=)
DataFrame的合并merge

Matplotlib.pyplot库

显示中文

plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

plt.plot()

marker: marker

颜色设置

设置颜色教程

Fig和axes和plt画图的区别

Fig和Axes和plt画图的区别

axes 相关属性设置

axes相关属性设置

注意，设置子图标题：
ax.title.set_text(‘…’)

plt.subplots

plt.subplots()参考博文
理解 fig, axes= plt.subplots(2,2)，其中axes是返回的子图, fig 是画布
为了在某个子图上画画，一般在plot函数下会有一个参数ax,传入ax=ax即可在相应的子图上画

为了使得子图间不重叠，可以使用方法 fig.tight_layout()

设置横纵坐标间隔

xticks和xticklabels

ax.set_xticks( lst )
lst里面放的是需要显示的数据点的位置

ax_set_xticklabels( lst )
lst里面是x轴任意刻度的标签，即把其他值用作标签

或者ticks和labels一起设置
日期坐标设置相关
例如：
假如A是需要画的Series，可以这样设置日期
plt.xticks(range(1, len(A.index),90),A.index[range(1, len(A.index),90)],rotation=90)
第一个参数是，设置需要显示坐标的数据点的pos，第二个参数是每个数据点的label，第三个参数是旋转多少度
MultipleLocator

from matplotlib.pyplot import MultipleLocator
y_major_locator=MultipleLocator(10)
ax.yaxis.set_major_locator(y_major_locator)

设置X\Y轴显示的最大值和最小值

链接
plt.axes(x_min, x_max, y_min, y_max)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)

双坐标轴

ax1 = zero_rate_s.plot()
forward_rate_s.plot(ax=ax1)
ax1.set_xlabel('month')
ax1.set_ylabel('interest rate')
ax1.legend(loc='upper left')ax2 = ax1.twinx()
discount_factor_s.plot(color='green', ax=ax2)
ax2.set_ylabel('discount factor')
ax2.legend(loc='upper right')

绘制水平直线

ax.axhline(y=0, color=‘black’, linestyle=‘–’)

图例、网格, 设置

plt.legend() / ax.legend()
legend(labels=[])可以设置每根曲线在图例里的名称
plt.grid(True)

参数:
matplotlin.pyplot.grid(b, which, axis, color, linestyle, linewidth， **kwargs)
- b : 布尔值。就是是否显示网格线的意思。官网说如果b设置为None，且kwargs长度为0，则切换网格状态。
- which : 取值为’major’, ‘minor’， ‘both’。默认为’major’。
- axis : 取值为‘both’， ‘x’，‘y’。就是想绘制哪个方向的网格线。
- color : 这就不用多说了，就是设置网格线的颜色。或者直接用c来代替color也可以。
- linestyle :也可以用ls来代替linestyle，设置网格线的风格，是连续实线，虚线或者其它不同的线条。 | ‘-’ | ‘–’ | ‘-.’ | ‘:’ | ‘None’ | ‘’ | ‘’]
- linewidth : 设置网格线的宽度

一个一个添加子图

import matplotlib.pyplot as plt#创建新的figure
fig = plt.figure()#必须通过add_subplot()创建一个或多个绘图
ax = fig.add_subplot(221)#绘制2x2两行两列共四个图，编号从1开始
ax1 = fig.add_subplot(221)
ax2 = fig.add_subplot(222)
ax3 = fig.add_subplot(223)
ax4 = fig.add_subplot(224)#图片的显示
plt.show()

绘制频率直方图

python绘制频率直方图

f=plt.figure()
ax1=f.add_subplot(111)bins=np.arange(-60, 60, step=5)ax1.hist(ts, bins=bins, alpha=1,color='steelblue', edgecolor='none' ,rwidth=0.8,density =False)
# density表示纵坐标是否表示为频率ax1.set_xticks(bins)

用Python为直方图绘制拟合曲线的两种方法

封装成函数形式

def plot_distribution(ts, lower_bound, upper_bound, step):f = plt.figure()ax1 = f.add_subplot(111)bins = np.arange(lower_bound, upper_bound, step= step)ax1.hist(ts, bins=bins, alpha=0.5,color='steelblue', edgecolor='none', rwidth=0.8,density=False)ax1.set_xticks(bins)plt.show()

画条形图

plt.bar()画图

import numpy as np
import matplotlib.pyplot as plt
# 数据
x = np.arange(4)
Bj = [52, 55, 63, 53]
Sh = [44, 66, 55, 41]
bar_width = 0.3
# 绘图 x 表示 从那里开始
plt.bar(x, Bj, bar_width)
plt.bar(x+bar_width, Sh, bar_width, align="center")
# 展示图片
plt.show()

画3D图

ax.plot_surface()
官方链接

画散点图

python散点图的绘制

Pytorch

Serverless

随机数

函数

np.random.randint

import numpy as npx=np.random.randint(100, size=(5))
# 生成一个[0,100)的随机数，大小为5print(x)

np.random.rand
生成的数组元素在[0,1)之间，参数是你第N维的大小

import numpy as npl1 = np.random.rand(2,3)print(l1)'''
[[0.44488226 0.7159999  0.11339061][0.42876384 0.49648113 0.27221476]]
'''

随机取样

random.sample()
random.choice()

随机分成N组

randomState = np.random.RandomState(0)
def divideIntoNstrand(listTemp, n):twoList = [[] for i in range(n)]for i,e in enumerate(listTemp):twoList[i%n].append(e)return twoList

randomState.shuffle(group_factor)
tmp_split_factor_group = divideIntoNstrand(group_factor, split_num)

正则表达式

正则表达式菜鸟教程

re.match（从起始位置开始匹配）

re.match参考链接

re.match(pattern, string, flags=0)

re.search（扫描整个字符串并返回第一个成功的匹配）

re.search(pattern, string, flag)

如果没有匹配会返回None

成功匹配后，使用group()返回的是匹配的字符串
使用groups()返回的是元组

例子1：

factor_path = r'E:\code_repo\gp_cta\result_seed0_population2000_parsi0.04'
for file_name in os.listdir(factor_path):re_result = re.search(r'best_program_(\d+).csv', file_name)if re_result is not None:print(re_result.group())
# result
best_program_0.csv
best_program_1.csv
best_program_2.csv
best_program_3.csv
best_program_4.csv

例子2：

factor_path = r'E:\code_repo\gp_cta\result_seed0_population2000_parsi0.04'
for file_name in os.listdir(factor_path):re_result = re.search(r'best_program_(\d+).csv', file_name)if re_result is not None:print(re_result.groups())
# result
('0',)
('1',)
('2',)
('3',)
('4',)

re.search(pattern, string).groups() ，如果有匹配到，返回匹配到的各个组的元素

如果要加括号，记得把通配符放在括号里面，如(\d+)

dataclasses库

一个dataclass是指“一个带有默认值的可变的namedtuple”，广义的定义就是有一个类，它的属性均可公开访问，可以带有默认值并能被修改，而且类中含有与这些属性相关的类方法，那么这个类就可以称为dataclass，再通俗点讲，dataclass就是一个含有数据及操作数据方法的容器。
dataclasses参考链接

functools库 — 缓存函数结果

缓存函数结果

from functools import lru_cache@lru_cache(maxsize=999)
def load_bar_data(symbol: str,exchange: Exchange,interval: Interval,start: datetime,end: datetime
):""""""return database_manager.load_bar_data(symbol, exchange, interval, start, end)

在函数头前面加lru_cache可以起到缓存函数结果的作用, max_size=num,表示最多缓存num个结果

有用的代码片段

线性回归

1. 方法一
线性回归及可决系数R^2的计算

from sklearn import linear_model
cft = linear_model.LinearRegression()
cft.fit(x,y)
beta, alpha  = cft.coef_, cft.intercept_

from sklearn.metrics import r2_score
sklearn.metrics.r2_score(y_true, y_pred, sample_weight=None, multioutput=’uniform_average’)

注意，如果x是single feature，要reshape(-1,1)

2. 方法二
可计算p值的方法
Find p-value (significance) in scikit-learn LinearRegression

import pandas as pd
import numpy as np
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from scipy import statsdiabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.targetX2 = sm.add_constant(X)
est = sm.OLS(y, X2)
est2 = est.fit()
predicted_y = est2.predict(X2)
print(est2.summary())df1 = pd.concat((est2.params, est2.bse,est2.tvalues,est2.pvalues), axis=1)
df1 = df1.rename(columns={0: 'coef', 1: 'std_err',2:'t-statistics',3:'p-value'})
df1 = df1.round(decimals=3)
df1.to_excel('summary.xlsx')

statsmodel的结果提取
statsmodel结果提取

从列表中随机选取N个数

从列表中选取N个数参考链接

如果对象是list，那么可以使用如下代码

import random
data = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
sample_num = 5
random.sample(data, sample_num)

如果对象是Numpy的ndarray，或者是Pandas的Series或者DataFrame，需要先生成下标，例如对于Series来说：

contract_list = pd.Series(os.listdir('./complete_code')).apply(lambda x: x[:-4])
contract_list = contract_list.rename('address')sample_num = 2000
sample_list = [i for i in range(contract_list.shape[0])]
sample_list = random.sample(sample_list, sample_num) # 生成下标sample_Series = contract_list.iloc[sample_list]

输出重定向

# 输出定向到文件
class Logger(object):def __init__(self, filename="Default.log"):self.terminal = sys.stdoutself.log = open(filename, "a")def write(self, message):self.terminal.write(message)self.log.write(message)def flush(self):passsys.stdout = Logger(filename='./out.txt')
sys.stderr = Logger(filename='./err.txt')

并行

由于python解释器CPython只有一个GIL（Global Interpreter Lock），线程运行要获得这个锁才能运行，因此python无法实现多线程。
关于python的GIL多线程看这里

为了实现并行计算，充分利用多核CPU，Python只能运用多进程。

但是众所周知，进程切换比线程开销大，所以性能较多线程比较差。

多进程并行例子

I/O密集型任务：Python中可以使用多线程试试
（涉及到网络、磁盘IO的任务都是IO密集型任务）
from multiprocessing.dummy import Pool

CPU密集型任务：Python中只能使用多进程并行
from multiprocessing import Pool

使用joblib来并行运行程序

参考链接
例子


from joblib import Parallel, delayed
import time
def single(a):""" 定义一个简单的函数  """time.sleep(1)  # 休眠1sprint(a)  start = time.time()  # 记录开始的时间
Parallel(n_jobs=3)(delayed(single)(i) for i in range(10))   # 并行化处理
Time = time.time() - start  # 计算执行的时间
print(str(Time)+'s')

delayed第二个括号里的i表示传递给single函数的参数

运行结果：

#运行结果如下
0
1
2
3
4
5
6
7
8
9
4.833665370941162s

并行之使用Multiprocessing 获取函数值

import pandas as pd
import os
from multiprocessing import Pooldef detectDistribution(src_path, filename):print('Checking {}'.format(filename))tmp_fullpath = os.path.join(src_path, filename)tmp_s = pd.read_pickle(tmp_fullpath).iloc[:,0]tmp_s_mean = tmp_s.mean()tmp_s_std = tmp_s.std()bins=[tmp_s_mean - 2.5* tmp_s_std,tmp_s_mean - 1.5* tmp_s_std,tmp_s_mean - 0.5* tmp_s_std,tmp_s_mean + 0.5 * tmp_s_std,tmp_s_mean + 1.5 * tmp_s_std,tmp_s_mean + 2.5* tmp_s_std]s = pd.cut(tmp_s, bins=bins)s_val_cnt = s.value_counts().sort_index(ascending=True)print('Done {}'.format(filename))if (s_val_cnt.iloc[0]< s_val_cnt.iloc[1] < s_val_cnt.iloc[2] ) and (s_val_cnt.iloc[2]>s_val_cnt.iloc[3]>s_val_cnt.iloc[4]):return Trueelse:return  Falseif __name__ == '__main__':src_path = r'D:\jiangjinyu\stock\2_pkl_version4.0\allpkl_f1_xgb_20220819\_1430'nthread = 30p = Pool(nthread)result_s = pd.Series(index = os.listdir(src_path))for tmp_filename in os.listdir(src_path):flag = p.apply_async(detectDistribution,(src_path, tmp_filename))# flag = detectDistribution(src_path, tmp_filename)result_s.loc[tmp_filename] = flagp.close()p.join()for idx, row in result_s.items():result_s.loc[idx] = row.get()result_s.to_csv('check_f1_report.csv')

监控内存使用情况

方法

Subprocess

subprocess教程
subprocess.run会阻塞当前进程
subprocess.Popen会打开一个新的进程，当前进程继续往下执行。

注意，stdout默认是None，这样会把输出打印到屏幕上。

import subprocessif __name__ ==  '__main__':# 创建一个子进程p = subprocess.Popen('python test.py', shell=True)#p.poll可以检查进程是否终止while p.poll() is None:passprint(p.stdout)

import subprocessif __name__ ==  '__main__':# 创建一个子进程p = subprocess.run('python test.py', shell=True)

计算运行时间

import time
import math# 计算消耗时间，并格式化的函数
def timeSince(since):now = time.time()s = now - sincem = math.floor(s / 60)s -= m * 60return '%dm %ds' % (m, s)start_time = time.time()
# ...
# ...some code
# ...
print(timeSince(start_time)) # 消耗的时间

计算每段代码的运行时间的函数装饰器

from functools import wraps
from line_profiler import LineProfilerdef profile_wrapper(func):def decorate():@wraps(func)def wrapper(*args, **kwargs):prof = LineProfiler()prof_wrapper = prof(func)ret = prof_wrapper(*args, **kwargs)prof.print_stats()return retreturn wrapperreturn decorate()

在需要统计运行时间的代码上加上@profile_wrapper就可以了

使用tqdm显示for循环的进度条和耗时

一般使用方法为：

for item in tqdm(iterable, total=):

关于total参数的说明

序列化python对象

使用 dill 库

dill.dump(obj, fileobj)dill.load(obj, fileobj)

计算股票行业哑变量因子

import pandas as pddef change_code(code):pre = code[-2:]num = code[:-3]return pre.lower() + '.' + numif __name__ == '__main__':A_code_lst = pd.read_csv('A_code_lst.txt').iloc[:, 0].apply(change_code).to_list()industry_dummy_variable= pd.DataFrame(index=A_code_lst, dtype='float')stock_industry = pd.read_csv('./stock_industry.csv', encoding='gbk')# stock_industry['code'] = stock_industry['code'].apply(lambda x:x[3:])stock_industry = stock_industry.set_index(['code'])stock_industry = stock_industry.loc[A_code_lst]industry_list = stock_industry['industry'].dropna().unique()for industry in industry_list:tmp_s = pd.Series(index=A_code_lst, dtype='float').rename('{}'.format(industry))tmp_s = tmp_s.fillna(0)tmp_s.loc[ stock_industry['industry'] == industry ] = 1industry_dummy_variable = pd.concat([industry_dummy_variable, tmp_s], axis=1)industry_dummy_variable.T.to_csv('./industry_dummy_variable.csv')

因子中性化、去极值、标准化

注意：一般市值中性化的时候是月度的市值进行中性化

去极值使用的函数： np.clip(array, min, max)

中位数去极值

# 中位数去极值
factor_row_median = factor_row.median()
factor_delta_median = (factor_row - factor_row_median).abs().median()
factor_row = np.clip(factor_row, factor_row_median - 5 * factor_delta_median, factor_row_median + 5 * factor_delta_median)

np.where(a<1,None,a)
把数组a中小于1的数设置为None，否则设置为原来的数

tmp=np.where(df['annual_return'] > factor_row_median + 5 * factor_delta_median, None, df['annual_return'])df['annual_return']=np.where(df['annual_return'] < factor_row_median - 5 * factor_delta_median, None, tmp)

函数化：

# Notice: factor_row是series
def exclude_extreme(factor_row):factor_row_median = factor_row.median()factor_delta_median = (factor_row - factor_row_median).abs().median()tmp = np.where(factor_row> factor_row_median + 5 * factor_delta_median, None, factor_row)return np.where(factor_row < factor_row_median - 5 * factor_delta_median, None, tmp)

非线性规划（Scipy.optimize.minimize）

关于scipy.optimize.minimize的介绍

风险平价

# 组合内资产的协方差矩阵（在当前上下文中指策略之间的协方差矩阵）
V = np.matrix([[123,37.5,70,30],[37.5, 122, 72, 13.5],[70, 72, 321, -32],[30, 13.5, -32, 52]])R = np.matrix([[14],[12],[15],[7]])def calculate_portfolio_var(w, V):'''计算投资组合的风险:param w: 向量，表示各个资产在投资组合中的权重，其实对于这里的输入是一个 1*n 的矩阵:param V: 资产之间的协方差矩阵:return: 投资组合收益率的方差 sigma^2 （表示投资组合的风险）'''w = np.matrix(w)# w*V*w.T最后是一个1*1的矩阵来着，所以需要取[0,0]# w*V*w 是二次型return (w*V*w.T)[0, 0]def calculate_risk_contribution(w, V):'''计算各个资产对投资组合的风险贡献:param w: 向量，表示各个资产在投资组合中的权重，其实对于这里的输入是一个 1*n 的矩阵:param V: 资产之间的协方差矩阵:return:'''w = np.matrix(w)sigma = np.sqrt(calculate_portfolio_var(w, V))# 边际风险贡献, marginal risk contribution# MRC是一个 n*1 的矩阵，代表各个资产的边际风险贡献MRC =V*w.T# 各个资产对投资组合的风险贡献程度RC = np.multiply(MRC, w.T) / sigmareturn RCdef risk_budget_objective(w, params):'''使用优化求解器求解目标:param w: 原始的投资组合中各个资产的权重，是优化器的初始迭代点:param params: params[0]代表各资产的协方差矩阵params[1]代表希望各资产对组合风险的贡献程度:return:'''# 计算投资组合风险V = params[0]expected_rc = params[1]sig_p = np.sqrt(calculate_portfolio_var(w, V))risk_target = np.asmatrix(np.multiply(sig_p, expected_rc))asset_RC = calculate_risk_contribution(w, V)J = sum(np.square(asset_RC - risk_target.T))[0, 0]return Jdef total_weight_constraint(w):'''在约束求解器中，这个函数的类型是eq, 表示最后返回的这个值要等于0:param w::return:'''return np.sum(w) - 1.0def long_only_contraint(w):# 表示w中的元素都要大于等于0return wdef solve_risk_parity_weight(original_w, expected_rc, V):'''解决风险平价的权重:param expected_rc: 期望的:param V: 资产间的协方差矩阵:return:'''# original_w = [0.25, 0.25, 0.25, 0.25]constraint = ({'type': 'eq','fun': total_weight_constraint},{'type': 'ineq','fun': long_only_contraint})res = minimize(risk_budget_objective,np.array(original_w),args=[V, expected_rc],method='SLSQP',constraints=constraint,options={'disp':False})return np.asmatrix(res.x)

拷贝文件

拷贝整个文件夹

import shutil
def CopyFile(filepath, newPath):fileNames = os.listdir(filepath)for file in fileNamesnewDir = os.path.join(filepath, file)if os.path.isfile(newDir):newFile = os.path.join(newPath, file)shutil.copyfile(newDir,newFile)else:CopyFile(newDir, newPath)

Python数据处理相关语法整理相关推荐

Python数据处理相关小例编程
有5名某界大佬xiaoyun.xiaohong.xiaoteng.xiaoyi和xiaoyang,其QQ号分别是88888.5555555.11111.1234321和1212121,用字典将这些数据 ...
数据挖掘，数据处理--相关英文单词整理，一片笔记足够！
不需要刻意去记,遇到了有个印象能认出来就行.来源于网络,以及自己学习过程中的整理. 这里做一个整理,方便初学者学习. box and whisker plot: 框须图 brushing: 画刷法 c ...
python数据处理相关
1.enumerate(sequence, start=0)函数枚举函数,传入一个序列或迭代对象,枚举输出一个序号对象. 可用于对序列数据进行编号. 2.字符串的split()方法和strip()方 ...
python数组堆叠,堆叠数组-python数据处理
堆叠数组-python数据处理堆叠数组-python数据处理从深度看,数组既可以横向叠放,也可以竖向叠放.为此,可以使用vstack().dstack().hstack().column_stac ...
Python GUI编程-了解相关技术[整理]
Python GUI编程-了解相关技术[整理] 我们可以看到,其实python进行GUI开发并没有自身的相关库,而是借用第三方库进行开发.tkinter是tcl/tk相关,pyGTK是Gtk相关,wx ...
python用于计算和数据处理的包有哪些_chapter3-2 常用数据处理包Numpy整理2
Numpy是Python数据处理分析和建模中最常用的包之一,笔者希望通过本文和下一篇的常用技巧补充,将自己在科研和工作中积累和摘抄的数据处理包使用经验和技巧笔记展示给大家~ 本部分技巧知识点主要包括: ...
Python后端相关技术/工具栈
Python后端相关技术/工具栈转载http://python.jobbole.com/83486/ 整理下目前涉及到的python的技术栈和工具栈(用过或了解的, 其他的后续用到再补充) 编辑器 ...
python自动化办公书籍-python自动化办公知识点整理汇总
知乎上有人提问:用python进行办公自动化都需要学习什么知识呢? 这可能是很多非IT职场人士面临的困惑,想把python用到工作中,却不知如何下手?python在自动化办公领域越来越受欢迎,批量处理 ...
太赞了！用200道题彻底搞定Python数据处理！
前言 Pandas与NumPy都是Python数据分析中的利器,但是对着官方文档学习是十分枯燥且低效的方式,因此我精心挑选了200个Python数据处理中的常用操作,并整理成习题的形式创作了Panda ...

Python数据处理相关语法整理

Python数据处理相关语法整理

简介

Python自身特性总结

编程Tips

拿到新电脑配环境时做的事：

一些加速python代码的技巧

Python项目代码结构

量化策略指标计算

绝对收益率

最大回撤

给定累计收益率序列计算收益率序列

年化收益率

年化波动率

夏普比率

-----------------------------------------------------

------------------- 分割线 -------------------------

------------------------------------------------------

字符串

字典

字典取数

Datetime库

字符串和日期相互转换

天数之间的加减

Scipy.stats

Numpy 库

数学函数

正态分布取样 / 多元正态分布取样

排名

数组/矩阵的堆叠

矩阵(matrix)相关操作

广播机制

生成固定的随机数

排序

np.nan的处理

正负无穷

Pandas库

计算相关性

统计频数

描述性统计describe

多层索引Multiindex

dataframe的整合与形变

数学函数

计算协方差和相关系数

对每一行每一列分别操作

时间窗口

计算相关系数

点乘

对dataframe的列进行遍历

对dataframe的行进行遍历

对Series进行遍历

生成日期序列

读取表格

删除满足条件的行

增删改查

两个dataframe合并

增加一行

增加一列

删除列

设置索引

set_index

reset_index

取索引

更改Series和DataFrame列名

缺失值/空字符串/inf/字符串/非数值 处理

表格排序

df.sort_values()

df.sort_index()

df.set_index()

表格画图 - df.plot()

采样

df.resample()

对元素进行映射apply

Shift(num)

两个表的连接

两个表的遍历

groupby

内连接，左连接，右连接，全外连接(pd.merge)

Matplotlib.pyplot库

显示中文

缺失值/空字符串/inf/字符串/非数值处理