文章目录

  • 1、加载csv格式的数据集并生成Dataset
    • 1.1 pandas读取csv数据生成Dataframe
    • 1.2 将Dataframe生成Dataset
  • 2、将数据封装成Feature columnn
  • 3、构建并训练模型
  • 4、构建组合特征
  • 5、预测数据

本部分使用Estimator的方式实现逻辑回归。

(1)使用csv格式的泰坦尼克号数据作为数据集。

import tensorflow as tf
import pandas as pd
from IPython.display import clear_output
print(tf.__version__)
print(pd.__version__)
2.3.0
1.0.1

1、加载csv格式的数据集并生成Dataset

1.1 pandas读取csv数据生成Dataframe

dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv')
dfeval = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/eval.csv')
y_train = dftrain.pop('survived')
y_eval = dfeval.pop('survived')

我们看一下数据:

dftrain.head()
sex age n_siblings_spouses parch fare class deck embark_town alone
0 male 22.0 1 0 7.2500 Third unknown Southampton n
1 female 38.0 1 0 71.2833 First C Cherbourg n
2 female 26.0 0 0 7.9250 Third unknown Southampton y
3 female 35.0 1 0 53.1000 First C Southampton n
4 male 28.0 0 0 8.4583 Third unknown Queenstown y
dftrain.describe()
age n_siblings_spouses parch fare
count 627.000000 627.000000 627.000000 627.000000
mean 29.631308 0.545455 0.379585 34.385399
std 12.511818 1.151090 0.792999 54.597730
min 0.750000 0.000000 0.000000 0.000000
25% 23.000000 0.000000 0.000000 7.895800
50% 28.000000 0.000000 0.000000 15.045800
75% 35.000000 1.000000 0.000000 31.387500
max 80.000000 8.000000 5.000000 512.329200

训练数据和验证数据的数量分别为:

dftrain.shape,dfeval.shape
((627, 9), (264, 9))

看看年龄的分布:

#将数字划分成bins份,统计每份的数量,并作直方图
dftrain.age.hist(bins=20)
#dftain.age等同于dftrain['age'],pandas会为每列数据生成一个同名的变量
<matplotlib.axes._subplots.AxesSubplot at 0x7f83dd224c50>

再看看性别:

dftrain.sex.value_counts()
male      410
female    217
Name: sex, dtype: int64
#以图标方式显示
dftrain.sex.value_counts().plot(kind='barh')
<matplotlib.axes._subplots.AxesSubplot at 0x7f83e18777d0>

#再看看仓位
dftrain['class'].value_counts().plot(kind='barh')
<matplotlib.axes._subplots.AxesSubplot at 0x7f83e072a410>

看看男女的survivid比例:

pd.concat([dftrain,y_train],axis=1).groupby('sex').survived.mean()
sex
female    0.778802
male      0.180488
Name: survived, dtype: float64
pd.concat([dftrain,y_train],axis=1).groupby('sex').survived.mean().plot(kind='barh').set_xlabel('% survived')
Text(0.5, 0, '% survived')

1.2 将Dataframe生成Dataset

使用上面的训练数据生成tf.data.Dataset:

#本行代码只是示范了如何将Dataframe转换成Dataset,下面并未用到这个变量。
dataset = tf.data.Dataset.from_tensor_slices((dict(dftrain),y_train))
for ds in dataset:print(ds)
def make_input_fn(data_df, label_df, num_epochs=10, shuffle=True, batch_size=32):def input_function():#从Dataframe中构建Datasetds = tf.data.Dataset.from_tensor_slices((dict(data_df), label_df))if shuffle:ds = ds.shuffle(1000)ds = ds.batch(batch_size).repeat(num_epochs)return dsreturn input_functiontrain_input_fn = make_input_fn(dftrain, y_train)
eval_input_fn = make_input_fn(dfeval, y_eval, num_epochs=1, shuffle=False)
ds = make_input_fn(dftrain, y_train, batch_size=10)()
for feature_batch, label_batch in ds.take(1):print('Some feature keys:', list(feature_batch.keys()))print()print('A batch of class:', feature_batch['class'].numpy())print()print('A batch of Labels:', label_batch.numpy())
Some feature keys: ['sex', 'age', 'n_siblings_spouses', 'parch', 'fare', 'class', 'deck', 'embark_town', 'alone']A batch of class: [b'Second' b'Second' b'Third' b'Second' b'Third' b'Third' b'Third'b'Third' b'Third' b'Third']A batch of Labels: [0 0 0 1 0 0 0 1 0 0]
age_column = feature_columns[7]
tf.keras.layers.DenseFeatures([age_column])(feature_batch).numpy()
array([[23.],[28.],[32.],[31.],[28.],[ 4.],[28.],[25.],[35.],[28.]], dtype=float32)
gender_column = feature_columns[0]
tf.keras.layers.DenseFeatures([tf.feature_column.indicator_column(gender_column)])(feature_batch).numpy()
array([[1., 0.],[1., 0.],[1., 0.],[1., 0.],[1., 0.],[1., 0.],[1., 0.],[1., 0.],[1., 0.],[1., 0.]], dtype=float32)

2、将数据封装成Feature columnn

CATEGORICAL_COLUMNS = ['sex', 'n_siblings_spouses', 'parch', 'class', 'deck','embark_town', 'alone']
NUMERIC_COLUMNS = ['age', 'fare']feature_columns = []
for feature_name in CATEGORICAL_COLUMNS:vocabulary = dftrain[feature_name].unique()feature_columns.append(tf.feature_column.categorical_column_with_vocabulary_list(feature_name,vocabulary))for feature_name in NUMERIC_COLUMNS:feature_columns.append(tf.feature_column.numeric_column(feature_name,dtype=tf.float32))
feature_columns[5]
VocabularyListCategoricalColumn(key='embark_town', vocabulary_list=('Southampton', 'Cherbourg', 'Queenstown', 'unknown'), dtype=tf.string, default_value=-1, num_oov_buckets=0)

3、构建并训练模型

使用tensorflow预构建的LinearClassfier可以很方便的构建和训练模型:

linear_est = tf.estimator.LinearClassifier(feature_columns=feature_columns)
linear_est.train(train_input_fn)
result = linear_est.evaluate(eval_input_fn)clear_output()
print(result)
{'accuracy': 0.75, 'accuracy_baseline': 0.625, 'auc': 0.825375, 'auc_precision_recall': 0.7897542, 'average_loss': 0.5112618, 'label/mean': 0.375, 'loss': 0.50658554, 'precision': 0.6386555, 'prediction/mean': 0.47108686, 'recall': 0.7676768, 'global_step': 200}

4、构建组合特征

上述特征使用的都是单一特征,但有时候特征的组合和结果的相关性更高。比如单从性别年龄都很难看出结果,但某个年龄+性别的组合,可能就和结果的相关性很高了。

age_x_gender = tf.feature_column.crossed_column(['age', 'sex'], hash_bucket_size=100)
derived_feature_columns = [age_x_gender]
linear_est = tf.estimator.LinearClassifier(feature_columns=feature_columns+derived_feature_columns)
linear_est.train(train_input_fn)
result = linear_est.evaluate(eval_input_fn)clear_output()
print(result)
{'accuracy': 0.7765151, 'accuracy_baseline': 0.625, 'auc': 0.85026014, 'auc_precision_recall': 0.7782748, 'average_loss': 0.48670053, 'label/mean': 0.375, 'loss': 0.4771559, 'precision': 0.7631579, 'prediction/mean': 0.29976445, 'recall': 0.5858586, 'global_step': 200}

5、预测数据

pred_dicts = list(linear_est.predict(eval_input_fn))
probs = pd.Series([pred['probabilities'][1] for pred in pred_dicts])probs.plot(kind='hist', bins=20, title='predicted probabilities')
INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:Layer linear/linear_model is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2.  The layer has dtype float32 because its dtype defaults to floatx.If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2.To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /var/folders/vx/w_50bfjj6xn9j5_lqhfrbcv00000gn/T/tmpo5ivmpz_/model.ckpt-200
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.<matplotlib.axes._subplots.AxesSubplot at 0x7f83dccac390>

计算ROC:

from sklearn.metrics import roc_curve
from matplotlib import pyplot as pltfpr, tpr, _ = roc_curve(y_eval, probs)
plt.plot(fpr, tpr)
plt.title('ROC curve')
plt.xlabel('false positive rate')
plt.ylabel('true positive rate')
plt.xlim(0,)
plt.ylim(0,)
(0, 1.05)

tensorflow综合示例4:逻辑回归:使用Estimator相关推荐

  1. TensorFlow HOWTO 1.3 逻辑回归

    1.3 逻辑回归 将线性回归的模型改一改,就可以用于二分类.逻辑回归拟合样本属于某个分类,也就是样本为正样本的概率. 操作步骤 导入所需的包. import tensorflow as tf impo ...

  2. 使用TensorFlow编程实现一元逻辑回归

    内容回顾 逻辑回归是在线性模型的基础上,再增加一个Sigmoid函数来实现的. 输入样本特征,经过线性组合之后,得到的是一个连续值,经过Sigmoid函数,把它转化为一个0-1之间的概率,再通过设置一 ...

  3. tensorflow综合示例1:tensorflow-keras的基本使用方式

    import numpy as np import matplotlib.pyplot as plt import pandas as pd import tensorflow as tf from ...

  4. tensorflow综合示例3:对结构化数据进行分类:csv keras feature_column

    文章目录 1.数据集 1.1 使用 Pandas 从csv创建一个 dataframe 1.2 将 dataframe 拆分为训练.验证和测试集 1.3 用 tf.data 创建输入流水线Datase ...

  5. tensorflow综合示例7:LeNet-5实现mnist识别

    在本文中,我们使用tensorflow2.x实现了lenet-5,用于mnist的识别. import numpy as np import matplotlib.pyplot as plt impo ...

  6. tensorflow综合示例5:图象分割

    本文主要内容来自: https://www.tensorflow.org/tutorials/images/segmentation?hl=zh-cn 图像分割 这篇教程将重点讨论图像分割任务,使用的 ...

  7. Python集成机器学习:用AdaBoost、决策树、逻辑回归集成模型分类和回归和网格搜索超参数优化

    最近我们被客户要求撰写关于集成机器的研究报告,包括一些图形和统计输出. Boosting 是一类集成机器学习算法,涉及结合许多弱学习器的预测. 视频:从决策树到随机森林:R语言信用卡违约分析信贷数据实 ...

  8. 机器学习-对数几率回归(逻辑回归)算法

    文章目录 简介 激活函数 损失函数 优化算法 代码 前些天发现了一个巨牛的人工智能学习网站,通俗易懂,风趣幽默,忍不住分享一下给大家.点击跳转到网站. 简介 对数几率回归(Logistic Regre ...

  9. sklearn综合示例3:逻辑回归

    文章目录 API 模型参数 penalty dual tol C fit_intercept class_weight random_state solver max_iter verbose war ...

最新文章

  1. 1.Spring Security 详细简绍与入门
  2. 智能制造业乘风破浪,工业机器人怎样勇立潮头?
  3. 人工智能和神经科学之间有什么关系?诺奖得主这样说……
  4. Apache + PHP 服务
  5. jQuery下实现检测指定元素加载完毕
  6. 工作66:storage区别
  7. 阿里巴巴的AI算法程序媛是怎样的一种存在?
  8. Django账号绑定邮箱时发送链接
  9. SQL SERVER 2014 安装图解(含 SQL SERVER 2014 安装程序共享)
  10. nginx 405 not allowed问题解决方法
  11. 201409-2-画图
  12. ASP.NET MVC过滤器
  13. 分布式事务的四种解决方案,值得参考!
  14. Tomcat服务器内存修改
  15. 数据挖掘实战(6):实战篇
  16. solidworks2016详细安装教程
  17. 最新的win10精简版V2004
  18. c语言 虚拟摄像头设备_c++ 虚拟摄像头驱动程序
  19. 智能人物画像综合分析系统——Day16
  20. DiskPart使用方法(ZT)

热门文章

  1. Codeup墓地-问题 A: 算法7-15:迪杰斯特拉最短路径算法
  2. 【简便解法】1077 互评成绩计算 (20分)_32行代码AC
  3. C语言满分解法:L1-041 寻找250 (10分)(解题报告)
  4. 22行代码AC_Prime Number Aizu - 0009(素数筛)(解题报告)
  5. 38行代码AC——UVA-167The Sultan‘s Successors(八皇后问题,附视频讲解)
  6. json python无效语法_在python中打开无效的json文件
  7. python返回变量类型_Python指定函数参数、返回值类型报错是咋了?
  8. xpath以某个字符开始_XPATH技术补充-实例
  9. ajax拿table里的th值,Jquery Ajax 异步设置Table中某列的值
  10. js大屏导出图片_整理了30个实用可视化大屏模板,附源文件+工具