https://r4ds.had.co.nz/transform.html
这些R也可以通过python实现,如下

pyrhon:

pip install nycflights13
pip install dfply
from dfply import *
import numpy as np
import pandas as pd
from nycflights13 import flights
from nycflights13 import airports
from nycflights13 import airlines
df=flights
df.head()airports.head()

filter()

R:

nov_dec <- filter(flights, month %in% c(11, 12))filter(flights, month == 1, day == 1)

PYTHON:

df[(df['month']==1)&(df['day']==1)]flights>>filter_by((X.month==1)&(X.day==1))

arrange()

r:

arrange(flights, year, month, day)arrange(flights, desc(dep_delay))

python:

df.sort_values(by=["year","month","day"],ascending=False)

select()

r

select(flights, year, month, day)select(flights, time_hour, air_time, everything())There are a number of helper functions you can use within select():starts_with("abc"): matches names that begin with “abc”.ends_with("xyz"): matches names that end with “xyz”.contains("ijk"): matches names that contain “ijk”.matches("(.)\\1"): selects variables that match a regular expression. This one matches any variables that contain repeated characters. You’ll learn more about regular expressions in strings.num_range("x", 1:3): matches x1, x2 and x3.

python

df[["year","month","day"]]df.filter(regex='Q',axis=1)#正则表达式

mutate()创建新的变量

r

flights_sml <- select(flights, year:day, ends_with("delay"), distance, air_time
)
mutate(flights_sml,gain = dep_delay - arr_delay,speed = distance / air_time * 60
)

python

df['gain'] = df['dep_delay']- df['arr_delay']
df['speed'] = df['distance'] / df['air_time'] * 60

summarise()

r

flights_sml <- select(flights, year:day, ends_with("delay"), distance, air_time
)
mutate(flights_sml,gain = dep_delay - arr_delay,speed = distance / air_time * 60
)

python

df.describe().loc['mean']df=df.dropna(axis=0,how='any')
df.describe().loc['mean']

Grouped summaries with summarise()分组汇总

r

by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))

python

flights.groupby(['year','month','day'])['dep_delay'].agg(np.mean)#agg只能对某一列即series做agg操作g=df.groupby(['year','month','day'])
for key,df2 in g:print(key)print(df2.describe().loc['mean'])

Grouped mutates (and filters)

python

g=df.groupby(['year','month','day'])
for key,df2 in g:print(key)print(df2.sort_values(by=["arr_delay"],ascending=False)[:10])

python

g=df.groupby(['dest'])
for key,df2 in g:if(len(df2)>365):print(key)print(df2)

python

df1=df[df['arr_delay']>0]
s=df1['arr_delay'].sum()
df1['prop_delay']=df1['arr_delay']/sdf1[['year','month','day','dest','arr_delay','prop_delay']]

12 Tidy data数据清洗

python

import numpy as np
import pandas as pd
from plydata.tidy import *
from plydata import select
import pandas as pd
country=['Afghanistan','Brazil','China']
year1999=[745, 37737,2122589]
year2000=[2666,80488,213766]
df=pd.DataFrame(columns=['country','1999','2000'])
df['country']=country
df['1999']=year1999
df['2000']=year2000

pivot_longer长短表转换

r

table4a %>% pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")

python:
pandas melt
https://blog.csdn.net/maymay_/article/details/80039677
pandas pivot_table
https://www.cnblogs.com/Yanjy-OnlyOne/p/11195621.html

ta4.melt(id_vars='country',vat_name='year',value_name='cases')from plydata.tidy import *
from plydata import select
df >> pivot_longer( cols=select('1999','2000'),names_to='years',values_to='score')


python

a=[]
b=[]
for i in range(len(df)):
#     print(df.iloc[i]['country'])
#     a.append(df.iloc[i]['country'])
#     country.append(df.iloc[i]['country'])for j in ['1999','2000']:a=[]a.append(df.iloc[i]['country'])a.append(j)a.append(df.iloc[i]['1999'])
#         print(a)b.append(a)# print(b)
df1=pd.DataFrame(b,columns=['country','year','cases'])
df1

df2=pd.DataFrame([['Afghanistan', '1999', 19987071],['Afghanistan', '2000', 20595360],['Brazil', '1999',  172006362],['Brazil', '2000',174504898],['China', '1999', 1272915272],['China', '2000', 1280428583]])
df2.columns=['country','year','population']
df2

df_merge =df1.merge(df2,on=['country','year'])
df_merge
df4=pd.DataFrame([['Afghanistan', '1999','cases', 745],['Afghanistan', '1999', "population",19987071],['Brazil', '2000', "cases", 2666],['Brazil', '2000',"population",  20595360],['China', '1999', "cases"  ,        37737],['China', '1999', "population" ,172006362]])
df4.columns=["country","year","type" ,"count"]
df4


用工具包实现

df4 >> pivot_wider(names_from='type',values_from=('cases', 'population'))


用python代码实现

b=[]
for i in range(len(df4)):a=[]a.extend(df4.iloc[i][['country','year']].tolist())a.append(df4[(df4['country']==a[0])&(df4['year']==a[1])&(df4['type']=='cases')]['count'].values[0])a.append(df4[(df4['country']==a[0])&(df4['year']==a[1])&(df4['type']=='population')]['count'].values[0])
#     print(a)b.append(a)
b=pd.DataFrame(b)
b=b.drop_duplicates()
b.columns=['country','year','cases','population']
b

12.4 Separating and uniting

12.4.1 Separate

df5=pd.DataFrame([['Afghanistan', '1999','745/19987071'],['Afghanistan', '2000', "2666/20595360 "],['Brazil', '1999', "37737/172006362 "],['Brazil', '2000',"80488/174504898 "],['China', '1999', "212258/1272915272"],['China', '2000', "213766/1280428583"]])
df5.columns=["country","year","rate"]
df5


r

table3 %>% separate(rate, into = c("cases", "population"))

python

t3['rate'].str.split('/',1,expand=True)
df5['cases']=df5['rate'].apply(lambda x:x.split('/')[0])
df5['population']=df5['rate'].apply(lambda x:x.split('/')[1])
df5

df5['century']=df5['year'].apply(lambda x:x[:2])
df5['years']=df5['year'].apply(lambda x:x[2:])
df5

12.4.2 Unite

df5>> unite('new', 'century', 'years')

df5['new']=df5['century'].str.cat(df5['years'])
df5


fill()

import pandas as pd
import numpy as npa = np.arange(25, dtype=float).reshape((5, 5))
print(len(a))
for i in range(len(a)):a[i, :i] = np.nan
a[3, 0] = 25.0
df = pd.DataFrame(data=a, columns = list('ABCDE'))
print(df)
print(df.fillna(value=0))

13 Relational data

from nycflights13 import flights
from nycflights13 import airports
from nycflights13 import airlines
from dfply import *
import numpy as np
import pandas as pdflights2=flights>>select(X.year,X.month,X.day, X.hour, X.origin, X.dest, X.tailnum, X.carrier)
airlines>>left_join(flights2, on='carrier')

df1 = pd.DataFrame({'col1': ['one', 'two', 'three'],'col2': [1, 2, 3]
})df2 = pd.DataFrame({'col1': ['one', 'four', 'three'],'col2': [1, 4, 3]
})df1>>inner_join( df2, on='col1')

df1 = pd.DataFrame({'col1': ['one', 'two', 'three'],'col2': [1, 2, 3]
})df2 = pd.DataFrame({'col1': ['one', 'four', 'three'],'col2': [1, 4, 3]
})df1>>outer_join( df2, on='col1')

df1 = pd.DataFrame({'col1': ['one', 'two', 'three'],'col2': [1, 2, 3]
})df2 = pd.DataFrame({'col1': ['one', 'four', 'three'],'col2': [1, 4, 3]
})df1>>right_join( df2, on='col1')


https://plydata.readthedocs.io/en/stable/api.html

Data transformation R语言与python相关推荐

  1. python和r语言哪个入门容易-R语言 vs Python对比:数据分析哪家强?

    什么是R语言? R语言,一种自由软件编程语言与操作环境,主要用于统计分析.绘图.数据挖掘.R本来是由来自新西兰奥克兰大学的罗斯·伊哈卡和罗伯特·杰特曼开发(也因此称为R),现在由"R开发核心 ...

  2. 分享:用Datacamp进行R语言,python学习

    也并不是什么分享资源.这里主要是分享一下目前接触的学习网站吧,也算是学习路上的一个记录. datacamp 网址:https://www.datacamp.com 重点:微软的用户可以免费激活data ...

  3. 机器学习数据集pima-indians-diabetes.data 及R语言读入命令

    这个数据集是机器学习常用练习集,包括768个observation,原链接如下 https://archive.ics.uci.edu/ml/machine-learning-databases/pi ...

  4. R语言添加Python模块错误的解决方法

    使用R Studio添加Python包pytdx时,使用reticulate(R包)中import来导入已下载好的pytdx(Python包) pip install pytdx # 先在终端中下载p ...

  5. python 量化交易_Quantsrat让R语言像Python一样进行策略回测和量化交易

    使用Quantsrat包 Quantsrat用来建立策略.添加指标.生成信号.生成买卖规则等进行回测.效果类似优矿.万矿.米筐那样的Python量化平台一样.因为不能CRAN在线安装,安装过程中还有一 ...

  6. data transformation python_Data augmentation: 利用python进行图像扩建

    关于数据扩充 本文代码将从现有文件夹中选择一些随机图像并进行转换,例如添加噪点.旋转.翻转等等. Step 1: 图像转换(transformations) 现有的python库例如 OpenCV 或 ...

  7. spss、R语言、Python数据分析系列(6):R语言adf单位根检验

    data<-read.table('C:/Users/HXWD/Desktop/数据/rb.csv',header=TRUE,sep=',') data=data[,5] head(data) ...

  8. MATLAB算法实战应用案例精讲-【数模应用】朴素贝叶斯(NB)(附Java、R语言、Python和MATLAB代码)

    目录 前言 几个相关概念 与KNN算法对比 算法原理 对连续值的处理 公式推导

  9. MATLAB算法实战应用案例精讲-【数模应用】多元线性回归(MLR)(附Java、R语言、python和matlab代码实现)

    目录 前言 知识储备 数据的分类 对线性的理解.系数的解释和内生性 几个高频面试题目<

最新文章

  1. python隐藏部分代码_python隐藏类中属性的3种实现方法
  2. python基础知识面试题-python基础知识的重点面试题
  3. 服务端大量CLOSE_WAIT问题的解决
  4. python文本替换 数据库_在Python中使用ASCII文件中的注释查找/替换子...
  5. android studio如何编译测试,Android Studio 进行单元测试完整教程
  6. 生活窍门 这样用钱就会富足
  7. windosw7 Hosts文件的位置
  8. Github1.3万星,迅猛发展的JAX对比TensorFlow、PyTorch
  9. Samba 3.4.0 发布
  10. 浅谈FTP服务的几个知识点
  11. 网约车源码 打车APP 同城打车代驾小程序源码
  12. 高等数学(第七版)同济大学 习题3-8 个人解答
  13. Java | 基础算法 - 排序:冒泡排序 代码实现(含详细注释)
  14. android框架xUtils使用介绍
  15. 转:浅谈Radius协议 -来自CSDN:http://blog.csdn.net/wangpengqi/article/details/17097221
  16. webgl_图形变换(旋转,平移,缩放)
  17. class的操作:className和classList
  18. 计算机网络实验3.1.3·CHAP 鉴别配置
  19. mysql开启外网访问权限
  20. 每日词根——viv(生命)

热门文章

  1. Springboot跨域配置报错:When allowCredentials is true, allowedOrigins cannot contain the specia
  2. gt800打印测试软件,zebra GT800 高级桌面条码标签打印机
  3. Siri 捷径邂逅esp8266实现的智能家居控制
  4. Cppcheck 1.54 C/C++静态代码分析工具
  5. 【NIO详解】基于NIO的client与server
  6. 大功率mos管(功率mos管)的五种损坏原因分析,新手必读
  7. java 网上医院预约系统_基于java web的医院网上预约挂号系统 代码+数据库文件 齐全...
  8. WinForm,可能是Windows上手最快的图形框架了
  9. windows 命令行ssh + Xming打开虚拟机的图形界面应用
  10. Garrett Motion将在Auto Shanghai 2021上展示用于混合动力汽车和燃料电池汽车的下一代电动助力技术