差分隐私 深度学习

I would like to thank Mr. Akshay Kulkarni for guiding me on my journey in publishing my first-ever article.

我要感谢Akshay Kulkarni先生在我发表我的第一篇文章的过程中为我提供了指导。

介绍 (INTRODUCTION)

As a large number of our day to day activities are being moved online the amount of personal and sensitive data being recorded is also on the rise. This surge in data has also led to the increased data analysis tools in the form of machine learning and deep learning that are permeating in every possible industry. These techniques are also used on sensitive user data to derive actionable insights. The objective of these models is to discover overall patterns rather than an individual’s habits.

随着我们日常活动的大量在线进行,记录的个人和敏感数据的数量也在增加。 数据的激增也导致以机器学习和深度学习形式出现的数据分析工具的增加,渗透到每个可能的行业中。 这些技术还用于敏感的用户数据,以得出可行的见解。 这些模型的目的是发现整体模式,而不是个人的习惯。

Photo by Kevin Ku on Unsplash
凯文·库 ( Kevin Ku)在Unsplash上摄

Deep learning is evolving to become the industry standard in many of the automation procedures. But it is also infamous for learning the minute and fine details of the training dataset. This aggravates the privacy risk as the model weights now encode the finer user details which on hostile inspection could potentially reveal the user information. For example, Fredrikson et al. demonstrated a model-inversion attack that recovers images from a facial recognition system [1]. Given the abundance of data freely available it is safe to assume that a determined adversary can get hold of the necessary auxiliary information required to extract user information from the model weights.

深度学习正在发展成为许多自动化程序中的行业标准。 但是,学习训练数据集的细微细节也是臭名昭著的。 由于模型权重现在对更精细的用户详细信息进行编码,因此加剧了隐私风险,在敌对检查中,这些详细信息可能会泄露用户信息。 例如,Fredrikson等。 展示了一种模型反转攻击,该攻击可以从面部识别系统中恢复图像[1]。 考虑到免费提供的大量数据,可以安全地假设确定的对手可以掌握从模型权重中提取用户信息所需的必要辅助信息。

什么是差异隐私? (WHAT IS DIFFERENTIAL PRIVACY?)

Differential Privacy is a theory which provides us with certain mathematical guarantees of privacy of user information. It aims to reduce the impact of any one individual’s data on the overall result. This means that one would make the same inference about an individual’s data whether or not it was present in the input of the analysis. As the number of analyses increases on the data so does the risk of exposure of user information. The results of differentially private computations are immune to a wide range of privacy attacks.

差异隐私是一种理论,为我们提供了用户信息隐私的某些数学保证。 它旨在减少任何个人数据对整体结果的影响。 这意味着无论分析的输入中是否存在有关个人数据的推断,都相同。 随着数据分析数量的增加,暴露用户信息的风险也会增加。 差分私有计算的结果不受多种隐私攻击的影响。

Here X is an individual. Image by Author.
X是个人。 图片由作者提供。

This is achieved by adding carefully tuned noise (characterized by epsilon) during the computation making it difficult for hackers to identify any user. This addition of noise also leads to erosion in the accuracy of the computation. Hence there is a trade-off between the accuracy and privacy protection offered. The level of privacy is measured by epsilon and is inversely proportional to the extent of privacy protection offered. This means higher the epsilon the lesser is the degree of protection of the data and higher is the chance of revealing user information. Achieving epsilon differential privacy is an ideal case and is very difficult to achieve in a practical scenario and hence the (ε, δ)-differential privacy is used. By using (ε, δ)-differential privacy the algorithm is ε-differentially private with probability (1−δ). Hence, the closer δ is to 0, the better. Delta is usually set to the reciprocal of the number of training samples.

这是通过在计算过程中添加经过精心调整的噪声(由epsilon表征)来实现的,从而使黑客难以识别任何用户。 噪声的这种增加还导致计算精度的下降。 因此,在提供的准确性和隐私保护之间需要权衡。 隐私级别由epsilon衡量,与提供的隐私保护程度成反比。 这意味着ε越高,数据的保护程度越小,泄露用户信息的机会也就越高。 实现ε差分隐私是一种理想情况,在实际情况下很难实现,因此使用(ε,δ)差分隐私。 通过使用(ε,δ)差分隐私,该算法可以 可能性 (1-δ)。 因此,δ越接近0越好。 增量通常设置为训练样本数的倒数。

Gradient manipulation. Image by Author
渐变操作。 图片作者

Mind you that we aim to safeguard the model and not the data itself from hostile inspection. This is done by adding noise during the calculation of model weights which has the additional advantage of regularization of the model. Specifically, in Deep Learning we integrate differential privacy by opting for differentially private optimizers because it is where most of the computation happens. The gradients are first calculated by taking the gradient of loss w.r.t the weights. These gradients are then clipped according to the l2_norm_clip parameter and noise is added to it which is controlled by the noise_multiplier parameter in the TensorFlow Privacy library.

请注意,我们旨在保护模型,而不是保护数据本身免遭敌对检查。 这是通过在计算模型权重期间添加噪声来完成的,这具有使模型规范化的其他优点。 具体来说,在深度学习中,我们通过选择差异私有优化器来集成差异隐私,因为它是大多数计算发生的地方。 首先通过权重的损失梯度来计算梯度。 然后根据l2_norm_clip参数剪切这些梯度, 并向其中添加噪声,该噪声由TensorFlow Privacy库中的noise_multiplier参数控制。

We aim to preserve the privacy of the model weights by applying differential privacy in the form of a Differentially Private — Stochastic Gradient Descent (DP-SGD) optimizer. Here an attempt is made to apply differential privacy to a deep neural network VGG19 with the task of image classification on the CIFAR10 dataset to understand the impact on the model performance and privacy. If such a model were trained on sensitive and private images it would become paramount that these images are not leaked as they could be put to malicious use.

我们的目标是通过以差分专用-随机梯度下降(DP-SGD)优化器的形式应用差分隐私来保护模型权重的隐私。 这里尝试将差分隐私应用于CIFAR10数据集上的图像分类任务的深层神经网络VGG19 ,以了解对模型性能和隐私的影响。 如果对这样的模型进行了敏感和私有图像的培训,那么这些图像就不会被泄漏,因为它们可能会被恶意使用,将变得至关重要。

DP-SGD的实施 (IMPLEMENTATION OF DP-SGD)

The TensorFlow Privacy library was used to implement the DP-SGD optimizer and compute epsilon. The hyperparameters are not very finely tuned to get the maximum accuracy for the benchmark as we need it for a comparative study only and excessive tuning might skew the comparison that is intended. The function that computes epsilon takes ‘number of steps’ as the input which is calculated as (epochs * number of training examples) / batch size. The number of steps is a measure of the number of times the model sees the training data. As the number of steps increases the more the model sees the training data and it incorporates the finer details into itself by overfitting which means the model has a higher chance of revealing user information.

TensorFlow Privacy库用于实现DP-SGD优化器和计算epsilon。 对超参数的调整不是很好,以获得基准的最大精度,因为我们仅需要进行比较研究,过度调整可能会使预期的比较不正确。 计算epsilon的函数将“ 步数”作为输入,计算方式为(历元*训练示例数)/批处理大小步骤数是模型查看训练数据的次数的度量。 随着步数的增加,模型看到的训练数据就更多,并且通过过度拟合将更精细的细节结合到自身中,这意味着模型有更大的机会展示用户信息。

num_classes = 10# data loadingx_train, y_train), (x_test, y_test) = cifar10.load_data()x_train = x_train.astype('float32')x_test = x_test.astype('float32')y_train = tf.keras.utils.to_categorical(y_train, num_classes)y_test = tf.keras.utils.to_categorical(y_test, num_classes)# network paramsnum_classes  = 10batch_size   = 128dropout      = 0.5weight_decay = 0.0001iterations   = len(x_train) // batch_size# dpsgd paramsdpsgd = Truelearning_rate = 0.15noise_multiplier = 1.0l2_norm_clip = 1.0epochs = 100microbatches = batch_sizeif dpsgd:   optimizer = DPGradientDescentGaussianOptimizer(   l2_norm_clip=l2_norm_clip,   noise_multiplier=noise_multiplier,   num_microbatches=microbatches,   learning_rate=learning_rate)   loss = tf.keras.losses.CategoricalCrossentropy(   reduction = tf.compat.v1.losses.Reduction.NONE)   # reduction is set to NONE to get loss in a vector form   print('DPSGD')else:   optimizer = optimizers.SGD(lr=learning_rate)   loss = tf.keras.losses.CategoricalCrossentropy()   print('Vanilla SGD')model.compile(loss=loss, optimizer=optimizer, metrics=['accuracy'])# where model is the neural network architecture you want to use# augmenting imagesdatagen = ImageDataGenerator(horizontal_flip=True,          width_shift_range=0.125,           height_shift_range=0.125,                  fill_mode='constant',cval=0.)datagen.fit(x_train)# start trainingmodel.fit(datagen.flow(x_train,y_train,batch_size=batch_size), steps_per_epoch=iterations,epochs=epochs,validation_data=(x_test, y_test))if dpsgd:   eps = compute_epsilon(epochs * len(x_train) // batch_size)   print('Delta = %f, Epsilon = %.2f'(1/(len(x_train)*1.5),eps))else:   print('Trained with vanilla non-private SGD optimizer')

The benchmark uses SGD provided by the Keras library with the same learning rate as DP-SGD. The network parameters are kept the same to make a fair comparison.

该基准使用Keras库提供的SGD,其学习率与DP-SGD相同。 网络参数保持相同以进行公平比较。

Word of caution: Since DP-SGD takes in weight decay as a parameter applying weight decay to each of the layers of the neural network shows an error.

忠告:由于DP-SGD将重量衰减作为参数,将重量衰减应用于神经网络的每一层都会显示错误。

Image by Author
图片作者

A few observations:

一些观察:

  • The model training with DP-SGD is slower than the benchmark.DP-SGD的模型训练比基准测试慢。
  • The model training with DP-SGD is noisier than the benchmark which may due to gradient clipping and addition of noise.使用DP-SGD进行的模型训练比基准噪声大,这可能是由于梯度削波和噪声增加所致。
  • Eventually, the model with DP-SGD achieves a decent performance in comparison with the benchmark.最终,与基准测试相比,带有DP-SGD的模型获得了不错的性能。

处理不平衡的数据 (DEALING WITH IMBALANCED DATA)

We try to simulate a real-world scenario by making an imbalanced dataset and then training the model. This is done by randomly assigning user-defined weights to each of the classes. We artificially create a severe imbalance that causes the model to underperform. To combat this imbalance we employ data level approaches instead of making changes to the model itself. The reduced number of instances causes the epsilon to increase. After oversampling with Synthetic Minority Oversampling TEchnique (SMOTE) we see an increase in accuracy. This enables us to deal with imbalance without making changes to the model itself.

我们尝试通过制作不平衡的数据集然后训练模型来模拟现实世界中的场景。 这是通过将用户定义的权重随机分配给每个类别来完成的。 我们人为地造成了严重的失衡,导致模型表现不佳。 为了解决这种不平衡,我们采用了数据级别的方法,而不是对模型本身进行更改。 减少的实例数量会导致epsilon增加。 在使用合成少数族裔过采样技术(SMOTE)进行过采样后,我们发现准确性有所提高。 这使我们能够处理不平衡而无需更改模型本身。

Number of instances for each class (10 classes in total):[3000 3500 250 2000 1000 1500 500 50 5000 2500]

每个类的实例数(总共10个类):[3000 3500 250 2000 1000 1500 500 50 5000 2500]

def imbalance(y, weights, x_train, y_train):    random.shuffle(weights)    indices = []    for i in range(10):       inds = np.where(y==i)[0]       random.shuffle(inds)       print(i,int(weights[i]*len(x_train)))       indices += list(inds[0:int(weights[i]*len(x_train))])       x_train = x_train[indices]       y_train = y_train[indices]    return x_train, y_trainweights = [0.01,0.05,0.1,0.2,0.3,0.4,0.5,0.6,0.7,1]x_train, y_train = imbalance(y_train,weights,x_train, y_train)# Implementing SMOTEfrom imblearn.over_sampling import SMOTEsmote = SMOTE(random_state=42)x_train = x_train.reshape(len(x_train),32*32*3) # where shape of image is (32,32,3)x_train, y_train = smote.fit_resample(x_train, y_train)x_train = x_train.reshape(len(x_train),32,32,3)

We observe that:

我们观察到:

  • After oversampling the data with SMOTE, the model performance improves over the model trained over an imbalanced data set.在使用SMOTE对数据进行过采样后,模型性能将优于在不平衡数据集上训练的模型。
  • The model trained on the original balanced data supersedes the performance to the model trained on the oversampled data.在原始平衡数据上训练的模型取代了在过采样数据上训练的模型的性能。

From experiments, we observe that data level approaches suffice to combat the problem of imbalance even with the differentially private optimizer.

从实验中,我们观察到即使使用差分私有优化器,数据级方法也足以解决不平衡问题。

调整NOISE_MULTIPLIER参数 (TUNING THE NOISE_MULTIPLIER PARAMETER)

The hyperparameter in the repertoire of the TensorFlow Privacy that has a direct impact on the epsilon is the noise multiplier.

TensorFlow Privacy库中对epsilon有直接影响的超参数是噪声乘数。

Image by Author
图片作者

Noise = 100.0, Epsilon = 0.18Noise = 10.0, Epsilon = 0.36Noise = 1.0, Epsilon = 3.44

噪声= 100.0,Epsilon = 0.18噪声= 10.0,Epsilon = 0.36噪声= 1.0,Epsilon = 3.44

We observe that:

我们观察到:

  • Increasing the noise reduces the value of epsilon which should imply better privacy protection.增加噪声会降低epsilon的值,这应意味着更好的隐私保护。
  • It is also known that privacy protection level is inversely proportional to model performance众所周知,隐私保护级别与模型性能成反比
  • But we observe no such relation in practice and that increasing the noise multiplier has almost no impact on the performance.但是我们在实践中没有观察到这种关系,并且增加噪声乘数几乎不会对性能产生影响。

结论 (CONCLUSION)

The departure from theory seen in tuning the noise_multiplier may be attributed to the large depth of the model where the gradient manipulation is rendered ineffective. From the code given in the TensorFlow Privacy source, we see that the value of epsilon is independent of the model training and only depends on noise_multiplier, batch_size, epochs, and delta values that we select. Epsilon has good theoretical backing for measuring the amount of risk to privacy. It takes into account all the factors that are important in determining the number of times the model sees the training data and it makes intuitive sense that the more the model sees the data the more the risk of user info exposure from model weights. But it still lacks in actually gauging how safe the model weights are to hostile inspection. This is a cause of concern because now we don’t have an idea if our model weights are immune to hacking. There is now a requirement for a metric that measures how opaque the model weights are about how little info they reveal about the users.

在调整noise_multiplier时看到的与理论的偏离可能归因于模型的较大深度,在该模型中,梯度操作无效。 从TensorFlow隐私源中给出的代码中,我们看到,小量的价值是独立于模型训练的,只取决于noise_multiplier,batch_size时代delta值,我们选择。 Epsilon在衡量隐私风险方面具有良好的理论支持。 它考虑到了所有重要因素,这些因素对于确定模型查看训练数据的次数非常重要,并且直观地认为,模型查看数据越多,从模型权重中获取用户信息的风险就越大。 但是它仍然缺乏实际衡量模型权重对敌对检查的安全性的方法。 这是一个令人担忧的原因,因为现在我们还不知道我们的模型权重是否可以免受黑客攻击。 现在,需要一种度量标准,该度量标准应衡量模型权重对于用户显示的信息量有多不透明。

未来的工作 (FUTURE WORK)

As discussed in the conclusion coming up with a metric to judge the privacy of the model is first and foremost. It can even be an attack model trying to derive user information from model weights. The number of training data points the model is able to identify correctly can be quantified as the risk to privacy. Once a reliable measure has been formalized, we can go about extending differentially private optimizers to more complex tasks in Computer Vision and Natural Language processing and to other architectures as well.

正如结论中所讨论的那样,最重要的是提出一种判断模型隐私的度量标准。 它甚至可以是试图从模型权重中得出用户信息的攻击模型。 该模型能够正确识别的训练数据点的数量可以量化为隐私风险。 一旦确定了可靠的措施,我们就可以将差分专用优化器扩展到计算机视觉和自然语言处理中的更复杂任务,以及其他体系结构。

参考资料 (REFERENCES)

  1. M. Fredrikson, S. Jha, and T. Ristenpart. “Model inversion attacks that exploit confidence information and basic countermeasures”. In CCS, pages 1322–1333. ACM, 2015.

    M. Fredrikson,S。Jha和T. Ristenpart。 “ 利用置信度信息和基本对策的模型反转攻击 ”。 在CCS中,第1322–1333页。 ACM ,2015年。

  2. Abadi, Martin, et al. “Deep learning with differential privacy.” Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. 2016.

    阿巴迪,马丁等。 “ 具有不同隐私的深度学习 。” 2016 ACM SIGSAC计算机和通信安全会议论文集 。 2016。

  3. ‘TensorFlow Privacy, https://github.com/tensorflow/privacy

    'TensorFlow隐私, https://github.com/tensorflow/privacy

翻译自: https://towardsdatascience.com/differential-privacy-in-deep-learning-cf9cc3591d28

差分隐私 深度学习


http://www.taodudu.cc/news/show-4686904.html

相关文章:

  • 不平衡数据采样_过度采样不平衡数据的5种打击技术
  • tensorflow 数据输入与特征工程
  • 2020_WWW_The Structure of Social Influence in Recommender Networks
  • 长尾分布之DECOUPLING REPRESENTATION AND CLASSIFIER FOR LONG-TAILED RECOGNITION
  • PLM是做题家吗?一文速览预训练语言模型数学推理能力新进展
  • javascript 矩阵_JavaScript问题解决器:旋转图像矩阵
  • [ECCV 2020] Distribution-balanced loss for multi-label classification in long-tailed datasets
  • iphone忘记密码了怎么开锁
  • V831——人脸识别开锁
  • 多线程开发Kafka消费者的方案和优劣
  • 实训3——按键开锁
  • 找人开锁被坑笔记
  • 基于51单片机密码锁-舵机开锁-CXM
  • 设计模式-生产者与消费者模式
  • CoVH之柯南开锁
  • 《程序员的自我修养》后感【1】下
  • Arduino 开锁,刷卡开锁模块
  • 无处不在的算法---《算法神探》读后感
  • 电脑解锁后黑屏有鼠标_电脑黑屏后屏幕只有鼠标怎么办呢?
  • 投资笔记3-建立资产认知
  • 投资组合结构
  • 这些年来什么才是最好的投资?
  • 最好的投资是自己,有关怎样投资自己
  • 最好的投资
  • 长文 | LSTM和循环神经网络基础教程(PDF下载)
  • 轻量级开发编辑器 sublime text 3 使用心得
  • PDF格式分析(二十三)Action动作
  • 基于python的pdf文件处理系统
  • 基于python fitz的pdf文件处理器--已开源
  • java 内存 pdf_jvm内存模型高清版.pdf

差分隐私 深度学习_深度学习中的差异隐私相关推荐

  1. 贝叶斯深度神经网络_深度学习为何胜过贝叶斯神经网络

    贝叶斯深度神经网络 Recently I came across an interesting Paper named, "Deep Ensembles: A Loss Landscape ...

  2. 深度强化学习_深度学习理论与应用第8课 | 深度强化学习

    本文是博雅大数据学院"深度学习理论与应用课程"第八章的内容整理.我们将部分课程视频.课件和讲授稿进行发布.在线学习完整内容请登录www.cookdata.cn 深度强化学习是一种将 ...

  3. 深度强化学习和强化学习_深度强化学习:从哪里开始

    深度强化学习和强化学习 by Jannes Klaas 简尼斯·克拉斯(Jannes Klaas) 深度强化学习:从哪里开始 (Deep reinforcement learning: where t ...

  4. 保证为正数 深度学习_深度学习:让数学课堂学习真正发生

    在21世纪核心素养中,深度学习能力是公民必须具备的生活和工作能力,发展深度学习是当代学习科学的重要举措,是深度加工知识信息.提高学习效率的有效途径.深度学习也称深层学习,是美国学者Ference Ma ...

  5. 元学习 迁移学习_元学习就是您所需要的

    元学习 迁移学习 Update: This post is part of a blog series on Meta-Learning that I'm working on. Check out ...

  6. 度量学习 流形学习_流形学习2

    度量学习 流形学习 潜图深度学习 (Deep learning with latent graphs) TL;DR: Graph neural networks exploit relational ...

  7. 分类 迁移学习_迁移学习时间序列分类

    迁移学习时间序列分类 题目: Transfer learning for time series classification 作者: Hassan Ismail Fawaz, Germain For ...

  8. 本地差分隐私 随机响应_大数据时代下的隐私保护

    本文作者程越强.孙茗珅.韦韬 1 引言 在大数据的时代,越来越多的服务和产品是围绕用户数据(隐私)建立的.这样虽然带来了个性化的服务,提高了服务质量和精度,但是在数据收集.使用以及公布的过程中,用户隐 ...

  9. 深度学习_卷积神经网络中感受野的理解和计算

    卷积神经网络感受野的计算方法 https://blog.csdn.net/qq_36653505/article/details/83473943?utm_medium=distribute.pc_r ...

最新文章

  1. java-Transient关键字、Volatile关键字介绍和序列化、反序列化机制、单例类序列化
  2. 统计自然语言处理笔记
  3. JQuery果然是神器,这里顺便测试一下我发现的那个漏洞!
  4. 团队冲刺the second day
  5. js 高阶函数之柯里化
  6. 719. Find K-th Smallest Pair Distance
  7. 无法自动进入并单步执行服务器_膳食纤维无法进入血液执行营养功能,吃它有啥用?...
  8. SQL中多表查询:左连接、右连接、内连接、全连接、交叉连接
  9. 9种让肌肤美白的简单方法 - 生活至上,美容至尚!
  10. 【验证码识别】基于matlab CNN卷积神经网络验证码识别【含Matlab源码 098期】
  11. 利用vue.js实现一个砍价小程序
  12. 登陆服务器显示guest,登录界面如何隐藏guest账号
  13. java中io的重要性_java中的IO整理
  14. vcf构建idx索引
  15. VehicleNet: Learning Robust Visual Representation for Vehicle Re-identification(车辆网络:学习用于车辆再识别的鲁棒视觉)
  16. GitHub的Java面试项目
  17. svg 动画 ------- svg的内容如何围绕自身旋转
  18. amd r7 2700u linux,满血双通道R7 2700U?AMD锐龙APU测试
  19. 【安卓Handler】Handler消息机制
  20. LaTex的表格、图片、参考文献的基本操作

热门文章

  1. 【mud】文字mud游戏的魅力(龟跑比赛)
  2. 淘客漏洞群用的什么机器人_目前淘客返利机器人是不是很火,大家都是用什么返利机器人的?...
  3. mysql创建用户并赋权(亲测)
  4. 计算机绘图期末试题,21年5月份154北理工《机械制图2》期末试卷
  5. 帝国霸业服务器无限加载,帝国霸业银河生存控制台的服务器命令是什么_控制台服务器命令一览_3DM单机...
  6. Unity问答——请教一下NGUI的图片转换问题
  7. 实现在H5中唤起抖音APP
  8. 软件测试30K*16(总包50W+)入职offer,一位字节跳动女测试开发的自述
  9. 25000 字详解 23 种设计模式(多图 + 代码)
  10. Java——闰年的判断方法,闰年概念