这个单子主要是深度模型的构建以及pca降维度

背景

汽车销售行业在税收上存在少开发票金额、少记收入，上牌、按揭、保险不入账，不及时确认保修索赔款等情况，导致政府损失大量税收。汽车销售企业的部分经营指标数据能在一定程度上评估企业的偷漏税倾向。样本数据提供了汽车销售行业纳税人的各种属性和是否偷漏税标识，提取纳税人经营特征可以建立偷漏税行为识别模型，识别偷漏税纳税人。

分析方法的主要流程

1.1 数据的提取

import pandas as pd
%matplotlib inline
from pylab import mpl
mpl.rcParams['font.sans-serif'] = ['FangSong'] # 指定默认字体
mpl.rcParams['axes.unicode_minus'] = False # 解决保存图像是负号'-'显示为方块的问题

test=pd.read_csv('1.csv')
test.head()

	纳税人编号	销售类型	销售模式	汽车销售平均毛利	维修毛利	企业维修收入占销售收入比重	增值税税负	存货周转率	成本费用利润率	整体理论税负	整体税负控制数	办牌率	单台办牌手续费收入	代办保险率	保费返还率	输出
0	1	国产轿车	4S店	0.0635	0.3241	0.0879	0.0084	8.5241	0.0018	0.0166	0.0147	0.4000	0.02	0.7155	0.1500	正常
1	2	国产轿车	4S店	0.0520	0.2577	0.1394	0.0298	5.2782	-0.0013	0.0032	0.0137	0.3307	0.02	0.2697	0.1367	正常
2	3	国产轿车	4S店	0.0173	0.1965	0.1025	0.0067	19.8356	0.0014	0.0080	0.0061	0.2256	0.02	0.2445	0.1301	正常
3	4	国产轿车	一级代理商	0.0501	0.0000	0.0000	0.0000	1.0673	-0.3596	-0.1673	0.0000	0.0000	0.00	0.0000	0.0000	异常
4	5	进口轿车	4S店	0.0564	0.0034	0.0066	0.0017	12.8470	-0.0014	0.0123	0.0095	0.0039	0.08	0.0117	0.1872	正常

test['输出'].unique()

array(['正常', '异常'], dtype=object)

def function(a):if '正常'in a :return 1else:return 0
test['输出'] = test.apply(lambda x: function(x['输出']), axis = 1)

1.2 缺失值查看

test.isnull().sum()

纳税人编号            0
销售类型             0
销售模式             0
汽车销售平均毛利         0
维修毛利             0
企业维修收入占销售收入比重    0
增值税税负            0
存货周转率            0
成本费用利润率          0
整体理论税负           0
整体税负控制数          0
办牌率              0
单台办牌手续费收入        0
代办保险率            0
保费返还率            0
输出               0
dtype: int64

test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 124 entries, 0 to 123
Data columns (total 16 columns):
纳税人编号            124 non-null int64
销售类型             124 non-null object
销售模式             124 non-null object
汽车销售平均毛利         124 non-null float64
维修毛利             124 non-null float64
企业维修收入占销售收入比重    124 non-null float64
增值税税负            124 non-null float64
存货周转率            124 non-null float64
成本费用利润率          124 non-null float64
整体理论税负           124 non-null float64
整体税负控制数          124 non-null float64
办牌率              124 non-null float64
单台办牌手续费收入        124 non-null float64
代办保险率            124 non-null float64
保费返还率            124 non-null float64
输出               124 non-null int64
dtypes: float64(12), int64(2), object(2)
memory usage: 15.6+ KB

1.2 数据的可视化

test.hist(figsize=(20,20))

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000001E94D9EB160>,<matplotlib.axes._subplots.AxesSubplot object at 0x000001E94FA5F128>,<matplotlib.axes._subplots.AxesSubplot object at 0x000001E94FA8B780>,<matplotlib.axes._subplots.AxesSubplot object at 0x000001E94FAB3E10>],[<matplotlib.axes._subplots.AxesSubplot object at 0x000001E94FAE44E0>,<matplotlib.axes._subplots.AxesSubplot object at 0x000001E94FAE4518>,<matplotlib.axes._subplots.AxesSubplot object at 0x000001E94FB3B240>,<matplotlib.axes._subplots.AxesSubplot object at 0x000001E94FB648D0>],[<matplotlib.axes._subplots.AxesSubplot object at 0x000001E94FB8CF60>,<matplotlib.axes._subplots.AxesSubplot object at 0x000001E94FBBC630>,<matplotlib.axes._subplots.AxesSubplot object at 0x000001E94FBE2CC0>,<matplotlib.axes._subplots.AxesSubplot object at 0x000001E94FC15390>],[<matplotlib.axes._subplots.AxesSubplot object at 0x000001E94FC3CA20>,<matplotlib.axes._subplots.AxesSubplot object at 0x000001E94FC700F0>,<matplotlib.axes._subplots.AxesSubplot object at 0x000001E94FC96780>,<matplotlib.axes._subplots.AxesSubplot object at 0x000001E94FCBEDD8>]],dtype=object)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-nua3HhCA-1575811000009)(output_12_1.png)]

#使用pandas库将类别变量编码
test_1 = pd.get_dummies(test)

test_1.head()

	纳税人编号	汽车销售平均毛利	维修毛利	企业维修收入占销售收入比重	增值税税负	存货周转率	成本费用利润率	整体理论税负	整体税负控制数	办牌率	...	销售类型_国产轿车	销售类型_进口轿车	销售模式_4S店	销售模式_一级代理商
0	1	0.0635	0.3241	0.0879	0.0084	8.5241	0.0018	0.0166	0.0147	0.4000	...	1	0	1	0
1	2	0.0520	0.2577	0.1394	0.0298	5.2782	-0.0013	0.0032	0.0137	0.3307	...	1	0	1	0
2	3	0.0173	0.1965	0.1025	0.0067	19.8356	0.0014	0.0080	0.0061	0.2256	...	1	0	1	0
3	4	0.0501	0.0000	0.0000	0.0000	1.0673	-0.3596	-0.1673	0.0000	0.0000	...	1	0	0	1
4	5	0.0564	0.0034	0.0066	0.0017	12.8470	-0.0014	0.0123	0.0095	0.0039	...	0	1	1	0

5 rows × 27 columns

2.1 PCA降低维度

# 作用：将数据集划分为 训练集和测试集
y=test_1['输出']
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
X_scaler = StandardScaler()
#特征标准化
x = test_1.drop(['输出'],axis=1)
x = X_scaler.fit_transform(x)
# PCA
pca = PCA(n_components=0.9)# 保证降维后的数据保持90%的信息
pca.fit(x)
X=pca.transform(x)

print('data shape: {0}; no. positive: {1}; no. negative: {2}'.format(X.shape, y[y==1].shape[0], y[y==0].shape[0]))
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

data shape: (124, 17); no. positive: 71; no. negative: 53

3.1 svm模型

from sklearn import svm
from sklearn.metrics import classification_report
svm = svm.SVC() # 定义svm模型
# 拟合模型
svm.fit(X_train, y_train)
print(classification_report(y_test, svm.predict(X_test)))

             precision    recall  f1-score   support0       0.57      0.80      0.67         51       0.83      0.62      0.71         8avg / total       0.73      0.69      0.70        13

3.1 决策树模型

from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier() # 定义决策树模型
# 拟合模型
tree.fit(X_train, y_train)
print(classification_report(y_test, tree.predict(X_test)))

             precision    recall  f1-score   support0       0.44      0.80      0.57         51       0.75      0.38      0.50         8avg / total       0.63      0.54      0.53        13

#预测结果
svm.predict(X_test)

array([1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1], dtype=int64)

构建LM神经网络模型

#构建LM神经网络模型
from keras.models import Sequential#导入神经网络的初始函数
from keras.layers.core import Dense,Activation
net_file='net.model'
net=Sequential()#建立神经网络模型
net.add(Dense(input_dim=17,output_dim=10))
net.add(Activation('relu'))
net.add(Dense(input_dim=10,output_dim=1))
net.add(Activation('sigmoid'))
net.compile(loss='binary_crossentropy',optimizer='adam')
net.fit(X_train,y_train,nb_epoch=2,batch_size=10)#每次训练10个样本
net.save_weights(net_file)#保存模型

Using Theano backend.
WARNING (theano.configdefaults): g++ not available, if using conda: `conda install m2w64-toolchain`
D:\sofewore\anaconda\lib\site-packages\theano\configdefaults.py:560: UserWarning: DeprecationWarning: there is no c++ compiler.This is deprecated and with Theano 0.11 a c++ compiler will be mandatorywarnings.warn("DeprecationWarning: there is no c++ compiler."
WARNING (theano.configdefaults): g++ not detected ! Theano will be unable to execute optimized C-implementations (for both CPU and GPU) and will default to Python implementations. Performance will be severely degraded. To remove this warning, set Theano flags cxx to an empty string.
WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
D:\sofewore\anaconda\lib\site-packages\ipykernel_launcher.py:6: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(input_dim=17, units=10)`D:\sofewore\anaconda\lib\site-packages\ipykernel_launcher.py:8: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(input_dim=10, units=1)`D:\sofewore\anaconda\lib\site-packages\ipykernel_launcher.py:11: UserWarning: The `nb_epoch` argument in `fit` has been renamed `epochs`.# This is added back by InteractiveShellApp.init_path()Epoch 1/2
111/111 [==============================] - 0s 4ms/step - loss: 0.7726
Epoch 2/2
111/111 [==============================] - 0s 4ms/step - loss: 0.7472

predict_result=net.predict_classes(X_test).reshape(len(X_test))#预测结果

predict_result

array([0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0])

print(classification_report(y_test,predict_result))

             precision    recall  f1-score   support0       0.45      1.00      0.62         51       1.00      0.25      0.40         8avg / total       0.79      0.54      0.49        13

20191202_2_识别偷税漏税人相关推荐

《系统集成项目管理工程师》必背100个知识点-61识别干系人分析的步骤
识别干系人分析的步骤? (1)识别能影响项目或受项目影响的全部人员.群体或组织: (2)分析干系人可能产生的影响,对他们进行分类,制订管理策略: (3)评估关键干系人对不同情况可能做出的反应或应对,以 ...
笔记-项目干系人管理-识别干系人
识别能影响项目决策.活动或结果的个人.群体或组织,以及被项目决策.活动或者结果影响的个人.群体或者组织,并分析和记录他们的相关信息的过程. 这些信息包括他们的利益.参与度.互相依赖.影响力及对项目成功 ...
识别干系人-启动过程组
识别干系人是识别能影响项目决策.活动或结果的个人. 群体或组织,以及被项目决策.活动或结果所影响的个人. 群体或组织,并分析和记录他们的相关信息的过程.这些信息包括他们的利益.参与度.相互依赖.影响力 ...
识别干系人与如何理解识别干系人过程贯穿项目始终？
识别干系人是识别所有受项目影响的人员或组织,并记录其利益,参与情况和对项目成功的影响过程. 输入: 项目章程:其提供参与项目和受影响的内外部各方的信息采购文件事业环境因素:公司文化和结构.行业标准 ...
项目的开始 —— 第二步识别干系人
昨天咱们给大家讲到了,项目的开始分为两步.第一步是制定项目章程,第二步就是识别干系人. 那么今天就给大家讲讲第二步:识别干系人. 那么第一个问题:什么是干系人?所谓干系人就是和你项目有关的人. 比如: ...
深度学习tensorflow object detect 之识别社会人小猪佩奇
今天我们来做一个有趣的事情,识别"社会人",我们现在就准备用tensorflow object detect api来检测它代码仓库:https://github.com/san ...
30倍加速，3毫秒极速识别，人、车、OCR等9大识别任务一网打尽
本文已在飞桨公众号发布,查看请戳链接: 30倍加速,3毫秒极速识别,人.车.OCR等9大识别任务一网打尽人脸.车辆.人体属性.卡证.交通标识等经典图像识别能力,在我们当前数字化工作及生活中发挥着极其 ...
Facebook发现：计算机识别系统更青睐识别“有钱人”，准确率高出20%
https://www.lieyunwang.com/archives/455535 [猎云网(微信号:ilieyun)]6月9日报道(编译:叶展盛) 近日,Facebook人工智能研究人员对6种物体 ...
CV：基于face库利用cv2调用摄像头(或视频)根据人脸图片实现找人(先指定要识别已知人脸的文件夹转为numpy_array+输入新图片遍历已有numpy_array)
CV:基于face库利用cv2调用摄像头(或视频)根据人脸图片实现找人(先指定要识别已知人脸的文件夹转为numpy_array+输入新图片遍历已有numpy_array) 目录输出结果设计思路核 ...
AI读懂说话人情绪，语音情感识别数据等你Pick！
近日,小米推出了全面支持情感化语音交互的小米小爱音箱Art,小米也成为业内首家情感化TTS大规模落地的企业. 基于开心.关心.害羞等有限但类型不同的情感音频数据,通过不同技术训练并迭代声学模型,这款音 ...

20191202_2_识别偷税漏税人

背景

分析方法的主要流程

构建LM神经网络模型

20191202_2_识别偷税漏税人相关推荐

最新文章

热门文章