tensorflow综合示例4：逻辑回归：使用Estimator

文章目录

1、加载csv格式的数据集并生成Dataset
- 1.1 pandas读取csv数据生成Dataframe
- 1.2 将Dataframe生成Dataset
2、将数据封装成Feature columnn
3、构建并训练模型
4、构建组合特征
5、预测数据

本部分使用Estimator的方式实现逻辑回归。

（1）使用csv格式的泰坦尼克号数据作为数据集。

import tensorflow as tf
import pandas as pd
from IPython.display import clear_output
print(tf.__version__)
print(pd.__version__)

2.3.0
1.0.1

1、加载csv格式的数据集并生成Dataset

1.1 pandas读取csv数据生成Dataframe

dftrain = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/train.csv')
dfeval = pd.read_csv('https://storage.googleapis.com/tf-datasets/titanic/eval.csv')
y_train = dftrain.pop('survived')
y_eval = dfeval.pop('survived')

我们看一下数据：

dftrain.head()

	sex	age	n_siblings_spouses	fare	class	deck	embark_town	alone
0	male	22.0	1	7.2500	Third	unknown	Southampton	n
1	female	38.0	1	71.2833	First	C	Cherbourg	n
2	female	26.0	0	7.9250	Third	unknown	Southampton	y
3	female	35.0	1	53.1000	First	C	Southampton	n
4	male	28.0	0	8.4583	Third	unknown	Queenstown	y

dftrain.describe()

	age	n_siblings_spouses	parch	fare
count	627.000000	627.000000	627.000000	627.000000
mean	29.631308	0.545455	0.379585	34.385399
std	12.511818	1.151090	0.792999	54.597730
min	0.750000	0.000000	0.000000	0.000000
25%	23.000000	0.000000	0.000000	7.895800
50%	28.000000	0.000000	0.000000	15.045800
75%	35.000000	1.000000	0.000000	31.387500
max	80.000000	8.000000	5.000000	512.329200

训练数据和验证数据的数量分别为：

dftrain.shape,dfeval.shape

((627, 9), (264, 9))

看看年龄的分布：

#将数字划分成bins份，统计每份的数量，并作直方图
dftrain.age.hist(bins=20)
#dftain.age等同于dftrain['age']，pandas会为每列数据生成一个同名的变量

<matplotlib.axes._subplots.AxesSubplot at 0x7f83dd224c50>

再看看性别：

dftrain.sex.value_counts()

male      410
female    217
Name: sex, dtype: int64

#以图标方式显示
dftrain.sex.value_counts().plot(kind='barh')

<matplotlib.axes._subplots.AxesSubplot at 0x7f83e18777d0>

#再看看仓位
dftrain['class'].value_counts().plot(kind='barh')

<matplotlib.axes._subplots.AxesSubplot at 0x7f83e072a410>

看看男女的survivid比例：

pd.concat([dftrain,y_train],axis=1).groupby('sex').survived.mean()

sex
female    0.778802
male      0.180488
Name: survived, dtype: float64

pd.concat([dftrain,y_train],axis=1).groupby('sex').survived.mean().plot(kind='barh').set_xlabel('% survived')

Text(0.5, 0, '% survived')

1.2 将Dataframe生成Dataset

使用上面的训练数据生成tf.data.Dataset:

#本行代码只是示范了如何将Dataframe转换成Dataset，下面并未用到这个变量。
dataset = tf.data.Dataset.from_tensor_slices((dict(dftrain),y_train))
for ds in dataset:print(ds)

def make_input_fn(data_df, label_df, num_epochs=10, shuffle=True, batch_size=32):def input_function():#从Dataframe中构建Datasetds = tf.data.Dataset.from_tensor_slices((dict(data_df), label_df))if shuffle:ds = ds.shuffle(1000)ds = ds.batch(batch_size).repeat(num_epochs)return dsreturn input_functiontrain_input_fn = make_input_fn(dftrain, y_train)
eval_input_fn = make_input_fn(dfeval, y_eval, num_epochs=1, shuffle=False)

ds = make_input_fn(dftrain, y_train, batch_size=10)()
for feature_batch, label_batch in ds.take(1):print('Some feature keys:', list(feature_batch.keys()))print()print('A batch of class:', feature_batch['class'].numpy())print()print('A batch of Labels:', label_batch.numpy())

Some feature keys: ['sex', 'age', 'n_siblings_spouses', 'parch', 'fare', 'class', 'deck', 'embark_town', 'alone']A batch of class: [b'Second' b'Second' b'Third' b'Second' b'Third' b'Third' b'Third'b'Third' b'Third' b'Third']A batch of Labels: [0 0 0 1 0 0 0 1 0 0]

age_column = feature_columns[7]
tf.keras.layers.DenseFeatures([age_column])(feature_batch).numpy()

array([[23.],[28.],[32.],[31.],[28.],[ 4.],[28.],[25.],[35.],[28.]], dtype=float32)

gender_column = feature_columns[0]
tf.keras.layers.DenseFeatures([tf.feature_column.indicator_column(gender_column)])(feature_batch).numpy()

array([[1., 0.],[1., 0.],[1., 0.],[1., 0.],[1., 0.],[1., 0.],[1., 0.],[1., 0.],[1., 0.],[1., 0.]], dtype=float32)

2、将数据封装成Feature columnn

CATEGORICAL_COLUMNS = ['sex', 'n_siblings_spouses', 'parch', 'class', 'deck','embark_town', 'alone']
NUMERIC_COLUMNS = ['age', 'fare']feature_columns = []
for feature_name in CATEGORICAL_COLUMNS:vocabulary = dftrain[feature_name].unique()feature_columns.append(tf.feature_column.categorical_column_with_vocabulary_list(feature_name,vocabulary))for feature_name in NUMERIC_COLUMNS:feature_columns.append(tf.feature_column.numeric_column(feature_name,dtype=tf.float32))

feature_columns[5]

VocabularyListCategoricalColumn(key='embark_town', vocabulary_list=('Southampton', 'Cherbourg', 'Queenstown', 'unknown'), dtype=tf.string, default_value=-1, num_oov_buckets=0)

3、构建并训练模型

使用tensorflow预构建的LinearClassfier可以很方便的构建和训练模型：

linear_est = tf.estimator.LinearClassifier(feature_columns=feature_columns)
linear_est.train(train_input_fn)
result = linear_est.evaluate(eval_input_fn)clear_output()
print(result)

{'accuracy': 0.75, 'accuracy_baseline': 0.625, 'auc': 0.825375, 'auc_precision_recall': 0.7897542, 'average_loss': 0.5112618, 'label/mean': 0.375, 'loss': 0.50658554, 'precision': 0.6386555, 'prediction/mean': 0.47108686, 'recall': 0.7676768, 'global_step': 200}

4、构建组合特征

上述特征使用的都是单一特征，但有时候特征的组合和结果的相关性更高。比如单从性别年龄都很难看出结果，但某个年龄+性别的组合，可能就和结果的相关性很高了。

age_x_gender = tf.feature_column.crossed_column(['age', 'sex'], hash_bucket_size=100)
derived_feature_columns = [age_x_gender]
linear_est = tf.estimator.LinearClassifier(feature_columns=feature_columns+derived_feature_columns)
linear_est.train(train_input_fn)
result = linear_est.evaluate(eval_input_fn)clear_output()
print(result)

{'accuracy': 0.7765151, 'accuracy_baseline': 0.625, 'auc': 0.85026014, 'auc_precision_recall': 0.7782748, 'average_loss': 0.48670053, 'label/mean': 0.375, 'loss': 0.4771559, 'precision': 0.7631579, 'prediction/mean': 0.29976445, 'recall': 0.5858586, 'global_step': 200}

5、预测数据

pred_dicts = list(linear_est.predict(eval_input_fn))
probs = pd.Series([pred['probabilities'][1] for pred in pred_dicts])probs.plot(kind='hist', bins=20, title='predicted probabilities')

INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:Layer linear/linear_model is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2.  The layer has dtype float32 because its dtype defaults to floatx.If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2.To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /var/folders/vx/w_50bfjj6xn9j5_lqhfrbcv00000gn/T/tmpo5ivmpz_/model.ckpt-200
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.<matplotlib.axes._subplots.AxesSubplot at 0x7f83dccac390>

计算ROC：

from sklearn.metrics import roc_curve
from matplotlib import pyplot as pltfpr, tpr, _ = roc_curve(y_eval, probs)
plt.plot(fpr, tpr)
plt.title('ROC curve')
plt.xlabel('false positive rate')
plt.ylabel('true positive rate')
plt.xlim(0,)
plt.ylim(0,)

(0, 1.05)

tensorflow综合示例4：逻辑回归：使用Estimator相关推荐

TensorFlow HOWTO 1.3 逻辑回归
1.3 逻辑回归将线性回归的模型改一改,就可以用于二分类.逻辑回归拟合样本属于某个分类,也就是样本为正样本的概率. 操作步骤导入所需的包. import tensorflow as tf impo ...
使用TensorFlow编程实现一元逻辑回归
内容回顾逻辑回归是在线性模型的基础上,再增加一个Sigmoid函数来实现的. 输入样本特征,经过线性组合之后,得到的是一个连续值,经过Sigmoid函数,把它转化为一个0-1之间的概率,再通过设置一 ...
tensorflow综合示例1：tensorflow-keras的基本使用方式
import numpy as np import matplotlib.pyplot as plt import pandas as pd import tensorflow as tf from ...
tensorflow综合示例3：对结构化数据进行分类：csv keras feature_column
文章目录 1.数据集 1.1 使用 Pandas 从csv创建一个 dataframe 1.2 将 dataframe 拆分为训练.验证和测试集 1.3 用 tf.data 创建输入流水线Datase ...
tensorflow综合示例7：LeNet-5实现mnist识别
在本文中,我们使用tensorflow2.x实现了lenet-5,用于mnist的识别. import numpy as np import matplotlib.pyplot as plt impo ...
tensorflow综合示例5：图象分割
本文主要内容来自: https://www.tensorflow.org/tutorials/images/segmentation?hl=zh-cn 图像分割这篇教程将重点讨论图像分割任务,使用的 ...
Python集成机器学习：用AdaBoost、决策树、逻辑回归集成模型分类和回归和网格搜索超参数优化
最近我们被客户要求撰写关于集成机器的研究报告,包括一些图形和统计输出. Boosting 是一类集成机器学习算法,涉及结合许多弱学习器的预测. 视频:从决策树到随机森林:R语言信用卡违约分析信贷数据实 ...
机器学习-对数几率回归（逻辑回归）算法
文章目录简介激活函数损失函数优化算法代码前些天发现了一个巨牛的人工智能学习网站,通俗易懂,风趣幽默,忍不住分享一下给大家.点击跳转到网站. 简介对数几率回归(Logistic Regre ...
sklearn综合示例3：逻辑回归
文章目录 API 模型参数 penalty dual tol C fit_intercept class_weight random_state solver max_iter verbose war ...