pandas.dataframe用法总结:

1 df[df.Datatype=='train']  返回的是一个dataframe  ,中括号里==返回的为series 它的特点是有索引有值

2  df['Class']  返回的为Series   type(tr['Class']=) <class 'pandas.core.series.Series'>

3

####surce code###############################################################################

# Takes in dataframes and a list of selected features (column names) 
# and returns (train_x, train_y), (test_x, test_y)
def train_test_data(complete_df, features_df, selected_features):
    '''Gets selected training and test features from given dataframes, and 
       returns tuples for training and test features and their corresponding class labels.
       :param complete_df: A dataframe with all of our processed text data, datatypes, and labels
       :param features_df: A dataframe of all computed, similarity features
       :param selected_features: An array of selected features that correspond to certain columns in `features_df`
       :return: training and test features and labels: (train_x, train_y), (test_x, test_y)'''
    
    # get the training features
    df = pd.concat([complete_df,features_df],axis=1)
    tr = df[df.Datatype=='train']
    
    print("type(sf=",type(tr))
    print("tr=",tr)
    print("type(df.Datatype=='train')=",type(df.Datatype == 'train'))
    train_x = tr[selected_features].values
    print("train_x=",train_x)
    # And training class labels (0 or 1)
    t = df.Datatype == "Class"
    print("type(t)=",type(t),"t=",t)
    train_y = tr['Class'].values
    
    # get the test features and labels
    test= df[df.Datatype == 'test']
    test_x = test[selected_features].values
    test_y = test['Class'].values
    
    return (train_x, train_y), (test_x, test_y)

"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
test_selection = list(features_df)[:2] # first couple columns as a test
print("test_selection=",test_selection)
# test that the correct train/test data is created
(train_x, train_y), (test_x, test_y) = train_test_data(complete_df, features_df, test_selection)

# params: generated train/test data
tests.test_data_split(train_x, train_y, test_x, test_y)

#result##############################################################################################

test_selection= ['c_1', 'c_2']
type(sf= <class 'pandas.core.frame.DataFrame'>
tr=               File Task  Category  Class  \
0   g0pA_taska.txt    a         0      0
2   g0pA_taskc.txt    c         2      1
3   g0pA_taskd.txt    d         1      1
4   g0pA_taske.txt    e         0      0
5   g0pB_taska.txt    a         0      0
..             ...  ...       ...    ...
89  g4pD_taske.txt    e         1      1
90  g4pE_taska.txt    a         1      1
91  g4pE_taskb.txt    b         2      1
92  g4pE_taskc.txt    c         3      1
93  g4pE_taskd.txt    d         0      0   Text Datatype       c_1  \
0   inheritance is a basic concept of object orien...    train  0.398148
2   the vector space model also called term vector...    train  0.869369
3   bayes theorem was names after rev thomas bayes...    train  0.593583
4   dynamic programming is an algorithm design tec...    train  0.544503
5   inheritance is a basic concept in object orien...    train  0.329502
..                                                ...      ...       ...
89  dynamic programming is a method of providing s...    train  0.845188
90  object oriented programming is a style of prog...    train  0.485000
91  pagerankalgorithm is also known as link analys...    train  0.950673
92  the definition of term depends on the applicat...    train  0.551220
93   bayes theorem or bayes rule  or something cal...    train  0.361257   c_2       c_3       c_4       c_5       c_6  lcs_word
0   0.079070  0.009346  0.000000  0.000000  0.000000  0.191781
2   0.719457  0.613636  0.515982  0.449541  0.382488  0.846491
3   0.268817  0.156757  0.108696  0.081967  0.060440  0.316062
4   0.115789  0.031746  0.005319  0.000000  0.000000  0.242574
5   0.053846  0.007722  0.003876  0.000000  0.000000  0.161172
..       ...       ...       ...       ...       ...       ...
89  0.546218  0.400844  0.347458  0.302128  0.273504  0.643725
90  0.105528  0.025253  0.005076  0.000000  0.000000  0.242718
91  0.878378  0.823529  0.800000  0.780822  0.761468  0.839506
92  0.328431  0.285714  0.252475  0.233831  0.220000  0.283019
93  0.031579  0.000000  0.000000  0.000000  0.000000  0.161765  [70 rows x 13 columns]
type(df.Datatype=='train')= <class 'pandas.core.series.Series'>
train_x= [[0.39814815 0.07906977][0.86936937 0.71945701][0.59358289 0.2688172 ][0.54450262 0.11578947][0.32950192 0.05384615][0.59030837 0.15044248][0.75977654 0.50561798][0.51612903 0.07027027][0.44086022 0.11891892][0.97945205 0.91724138][0.95138889 0.7972028 ][0.97647059 0.85798817][0.81176471 0.55621302][0.44117647 0.03030303][0.48888889 0.06741573][0.81395349 0.67058824][0.61111111 0.15492958][1.         1.        ][0.63402062 0.20207254][0.58293839 0.29047619][0.63793103 0.42857143][0.42038217 0.07692308][0.68776371 0.40677966][0.67664671 0.31927711][0.76923077 0.53355705][0.71226415 0.37914692][0.62992126 0.33992095][0.71573604 0.26020408][0.33206107 0.03065134][0.71721311 0.36213992][0.87826087 0.71179039][0.52980132 0.35548173][0.57211538 0.14009662][0.31967213 0.04115226][0.53       0.13567839][0.78       0.65829146][0.65269461 0.18674699][0.44394619 0.15315315][0.66502463 0.39108911][0.72815534 0.30731707][0.76204819 0.54984894][0.94701987 0.67333333][0.36842105 0.0619469 ][0.53289474 0.09933775][0.61849711 0.16860465][0.51030928 0.09326425][0.57983193 0.11814346][0.40703518 0.06565657][0.51546392 0.09310345][0.58454106 0.27669903][0.6171875  0.33858268][1.         0.96153846][0.99166667 0.96638655][0.5505618  0.15819209][0.41935484 0.07608696][0.83516484 0.45555556][0.92708333 0.69473684][0.492891   0.05714286][0.70873786 0.52682927][0.86338798 0.66483516][0.96060606 0.92097264][0.43801653 0.08333333][0.73366834 0.35353535][0.51388889 0.09302326][0.48611111 0.07906977][0.84518828 0.54621849][0.485      0.10552764][0.95067265 0.87837838][0.55121951 0.32843137][0.36125654 0.03157895]]
type(t)= <class 'pandas.core.series.Series'> t= 0     False
1     False
2     False
3     False
4     False...
95    False
96    False
97    False
98    False
99    False
Name: Datatype, Length: 100, dtype: bool
Tests Passed!

#https://notebookinstance.notebook.us-east-2.sagemaker.aws/notebooks/CN-ML_SageMaker_Studies/Project_Plagiarism_Detection/2_Plagiarism_Feature_Engineering.ipynb

pandas.dataframe用法总结 何时返回dataframe 何时返回series相关推荐

  1. pandas使用np.where函数计算返回dataframe中指定数据列包含缺失值的行索引列表list

    pandas使用np.where函数计算返回dataframe中指定数据列包含缺失值的行索引列表list(index of rows with missing values in dataframe ...

  2. pandas使用dropna函数计算返回dataframe中不包含缺失值的行索引列表list(index of rows without missing values in dataframe)

    pandas使用dropna函数计算返回dataframe中不包含缺失值的行索引列表list(index of rows without missing values in dataframe) 目录

  3. pandas使用isna函数和any函数计算返回dataframe中包含缺失值的数据行(rows with missing values in dataframe)

    pandas使用isna函数和any函数计算返回dataframe中包含缺失值的数据行(rows with missing values in dataframe) 目录

  4. pandas使用groupby函数和count函数返回的是分组下每一列的统计值(不统计NaN缺失值)、如果多于一列返回dataframe、size函数返回分组下的行数结果为Series(缺失值不敏感)

    pandas使用groupby函数和count函数返回的是分组下每一列的统计值(不统计NaN缺失值).如果多于一列返回dataframe.size函数返回分组下的行数结果为Series(不区分缺失值和 ...

  5. pandas使用read_excel函数读取excel表格数据为dataframe、设置sheet_name参数为表单索引位置列表则读取多个表单的数据并返回dataframe字典

    pandas使用read_excel函数读取excel表格数据为dataframe.使用sheet_name参数指定读取excel表格中指定的sheet表单.设置sheet_name参数为表单索引位置 ...

  6. panda 函数笔记(merge\DataFrame用法\DataFrame.plot)

    1.merge( )              2.DataFrame用法                       2.1.创建一个DataFrame:                    2. ...

  7. Pandas的学习(4.DataFrame之间的运算以及DataFrame和Series之间的运算)

    DataFrame的运算    1.DataFrame之间的运算 同Series一样: ---   在运算中自动对齐不同索引的数据 ---  如果索引不对应,则补NaN 无论是行不对应还是列不对应,都 ...

  8. pandas的自带数据集_pandas.DataFrame.sample随机抽样

    从0到1Python数据科学之旅:http://dwz.date/cqpw 微信公众号:pythonEducation模型和统计项目QQ:231469242 1    数据切片选取 1.1    pa ...

  9. python读hadoop_python读取hdfs并返回dataframe教程

    不多说,直接上代码 from hdfs import Client import pandas as pd HDFSHOST = "http://xxx:50070" FILENA ...

最新文章

  1. 深入浅出统计学(十二)置信区间
  2. 【百家稷学】从传统方法到深度学习,人脸算法和应用的演变(河南平顶山学院技术分享)...
  3. 化工原理(过滤和沉淀)
  4. 【网络流24题】[CTSC1999]家园
  5. Linux下的文件I/O编程
  6. exchange2013 OWA界面使用公有计算机或私有计算机选项
  7. 关于区块链通证模型,你想知道的都在这
  8. 关于local storage及session storage 应用问题
  9. android 按钮复用,Android Button 自带阴影效果另一种解决办法
  10. 软件需求最佳实践之需求的沟通与分析
  11. 百度换肤怎么实现的html,JavaScript 实现百度换肤功能
  12. Shiro 实战教程(上)
  13. ExpandableListView使用方法详解
  14. 交换机vlan划分实验
  15. 用C#制作RPG游戏
  16. 2014 年移动设备界面设计有哪些趋势?
  17. PS MAC 2022安装步骤
  18. 伪随机生成器具体实现——密码法
  19. Python爬虫技术-根据【理财】关键字爬取“巨潮资讯网”的title
  20. 深度学习—利用TensorFlow2实现狗狗品种品种(DenseNet121实现)

热门文章

  1. oracle存档模式,Oracle开启归档模式并设置RMAN自动备份策略
  2. 网络服务-DNS 域名系统服务
  3. creo 3.0计算机配置,Creo 3.0 Parametric 配置选项文件使用说明
  4. php数据库查询中文方块,解决Python数据可视化中文部分显示方块问题
  5. php7 java8_php7 vs java8 vs nodejs5 vs lua5.2 计算性能比较
  6. uboot 如何设置网关地址_两种网络地址段,如何设置内网和外网一起上?
  7. postgresql高可用_Postgresql高可用实现方案
  8. JavaScript异步调用的发展历程
  9. golang reflect
  10. css3动画参数解释