深度学习数据集中数据差异大

The modern world runs on “big data,” the massive data sets used by governments, firms, and academic researchers to conduct analyses, unearth patterns, and drive decision-making. When it comes to data analysis, bigger can be better: The more high-quality data is incorporated, the more robust the analysis will be. Large-scale data analysis is becoming increasingly powerful thanks to machine learning and has a wide range of benefits, such as informing public-health research, reducing traffic, and identifying systemic discrimination in loan applications.

现代世界基于“大数据”,这是政府,企业和学术研究人员用来进行分析,挖掘模式和推动决策的海量数据集。 当涉及到数据分析时,越大越好:合并的数据越多,分析就越鲁棒。 得益于机器学习,大规模数据分析正变得越来越强大,并具有广泛的好处,例如为公共卫生研究提供信息,减少流量以及识别贷款申请中的系统性歧视。

But there’s a downside to big data, as it requires aggregating vast amounts of potentially sensitive personal information. Whether amassing medical records, scraping social media profiles, or tracking banking and credit card transactions, data scientists risk jeopardizing the privacy of the individuals whose records they collect. And once data is stored on a server, it may be stolen, shared, or compromised. “Improper disclosure of such data can have adverse consequences for a data subject’s private information, or even lead to civil liability or bodily harm,” explains data scientist An Nguyen, in his article, “Understanding Differential Privacy.”

但是大数据有一个缺点,因为它需要汇总大量潜在的敏感个人信息。 无论是收集病历,抓取社交媒体资料还是跟踪银行和信用卡交易,数据科学家都有可能危及收集其记录的个人的隐私。 数据一旦存储在服务器上,就可能被盗,共享或泄露。 数据科学家An Nguyen在他的文章“ 了解差异隐私 ”中解释说:“不当披露此类数据可能会对数据主体的私人信息产生不利影响,甚至导致民事责任或人身伤害。”

Computer scientists have worked for years to try to find ways to make data more private, but even if they attempt to de-identify data — for example, by removing individuals’ names or other parts of a data set — it is often possible for others to “connect the dots” and piece together information from multiple sources to determine a supposedly anonymous individual’s identity (via a so-called re-identification or linkage attack).

计算机科学家已经工作了多年,试图找到使数据更具私密性的方法,但是即使他们试图去识别数据(例如,通过删除个人的姓名或数据集的其他部分),对于其他人也通常是可能的。 “点点滴滴”,并将来自多个来源的信息拼凑起来,以确定所谓的匿名个人的身份(通过所谓的重新识别或链接攻击)。

Fortunately, in recent years, computer scientists have developed a promising new approach to privacy-preserving data analysis known as “differential privacy” that allows researchers to unearth the patterns within a data set — and derive observations about the population as a whole — while obscuring the information about each individual’s records.

幸运的是,近年来,计算机科学家为保护隐私的数据分析开发了一种很有前途的新方法,称为“差异性隐私”,它使研究人员能够挖掘数据集内的模式,并获得有关总体人口的观察结果,同时又能掩盖整体情况。有关每个人的记录的信息。

解决方案:差异隐私 (The solution: differential privacy)

Differential privacy (also known as “epsilon indistinguishability”) was first developed in 2006 by Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam Smith. In a 2016 lecture, Dwork defined differential privacy as being achieved when “the outcome of any analysis is essentially equally likely, independent of whether any individual joins, or refrains from joining, the dataset.”

差异性隐私(也称为“ε不可区分性”)由Cynthia Dwork,Frank McSherry,Kobbi Nissim和Adam Smith 于2006年首次开发 。 在2016年的一次演讲中,Dwork 将差异性隐私定义为“当任何分析的结果基本上具有同等可能性,而与任何个人加入还是拒绝加入数据集无关时”。

How is this possible? Differential privacy works by adding a pre-determined amount of randomness, or “noise,” into a computation performed on a data set. As an example, imagine if five people submit “yes” or “no” about a question on a survey, but before their responses are accepted, they have to flip a coin. If they flip heads, they answer the question honestly. But if they flip tails, they have to re-flip the coin, and if the second toss is tails, they respond “yes,” and if heads, they respond “no” — regardless of their actual answer to the question.

这怎么可能? 差分隐私通过将预定量的随机性或“噪声”添加到对数据集执行的计算中而起作用。 例如,假设有五个人对调查中的一个问题回答“是”或“否”,但是在他们的回答被接受之前,他们必须掷硬币。 如果他们低着头,他们会诚实地回答这个问题。 但是,如果他们甩尾巴,就必须重新掷硬币;如果第二次抛硬币是尾巴,则他们回答“是”,如果是抛头,则他们回答“否”,而不管他们对问题的实际回答如何。

As a result of this process, we would expect a quarter of respondents (0.5 x 0.5 — those who flip tails and tails) to answer “yes,” even if their actual answer would have been “no”. With sufficient data, the researcher would be able to factor in this probability and still determine the overall population’s response to the original question, but every individual in the data set would be able to plausibly deny that their actual response was included.

作为此过程的结果,我们希望四分之一的受访者(0.5 x 0.5-那些甩尾巴的人)回答“是”,即使他们的实际回答是“否”。 有了足够的数据,研究人员将能够考虑这一可能性,并且仍然可以确定总体人群对原始问题的回答,但是数据集中的每个人都可以合理地否认包括了他们的实际回答。

Of course, researchers don’t actually use coin tosses and instead rely on algorithms that, based on a pre-determined probability, similarly alter some of the responses in the data set. The more responses are changed by the algorithm, the more the privacy is preserved for the individuals in the data set. The trade-off, of course, is that as more “noise” is added to the computation — that is, as a greater percentage of responses are changed — then the accuracy of the data analysis goes down.

当然,研究人员实际上并没有使用抛硬币,而是依靠基于预定概率的算法来类似地更改数据集中的某些响应。 该算法更改的响应越多,为数据集中的个人保留的隐私越多。 当然,要权衡的是,随着计算中添加更多“噪声”(即,随着更大百分比的响应发生变化),数据分析的准确性就会下降。

When Dwork and her colleagues first defined differential privacy, they used the Greek symbol ε, or epsilon, to mathematically define the privacy loss associated with the release of data from a data set. This value defines just how much differential privacy is provided by a particular algorithm: The lower the value of epsilon, the more each individual’s privacy is protected. The higher the epsilon, the more accurate the data analysis — but the less privacy is preserved.

当Dwork和她的同事首次定义差异隐私时,他们使用希腊符号ε或epsilon在数学上定义与从数据集中释放数据相关的隐私损失。 此值仅定义特定算法提供多少差异隐私:epsilon值越低,每个人的隐私受到保护的程度就越高。 ε越高,数据分析越准确-但是保留的隐私越少。

A lower epsilon value results in greater privacy — but lower accuracy
较低的ε值会带来更大的隐私权,但准确性会降低

When the data is perturbed (i.e. the “noise” is added) while still on a user’s device, it’s known as local differential privacy. When the noise is added to a computation after the data has been collected, it’s called central differential privacy. With this latter method, the more you query a data set, the more information risks being leaked about the individual records. Therefore, the central model requires constantly searching for new sources of data to maintain high levels of privacy.

当数据仍在用户设备上而受到干扰(即添加了“噪声”)时,称为本地差异隐私。 在收集数据后将噪声添加到计算中时,这称为中央差分隐私。 使用后一种方法,您查询数据集的次数越多,就越有可能泄露有关各个记录的信息。 因此,中心模型要求不断搜索新的数据源以保持高度的隐私。

Either way, a key goal of differential privacy is to ensure that the results of a given query will not be affected by the presence (or absence) of a single record. Differential privacy also makes data less attractive to would-be attackers and can help prevent them from connecting personal data from multiple platforms.

无论哪种方式,差异隐私的主要目标都是确保给定查询的结果不会受到单个记录的存在(或不存在)的影响。 差异性隐私还使数据对潜在的攻击者的吸引力降低,并且可以帮助防止他们连接来自多个平台的个人数据。

实践中的差异隐私 (Differential privacy in practice)

Differential privacy has already gained widespread adoption by governments, firms, and researchers. It is already being used for “disclosure avoidance” by the U.S. census, for example, and Apple uses differential privacy to analyze user data ranging from emoji suggestions to Safari crashes. Google has even released an open-source version of a differential privacy library used in many of the company’s core products.

差异性隐私已被政府,公司和研究人员广泛采用。 例如, 美国人口普查已经将其用于“避免泄露”,而Apple使用差异隐私来分析用户数据,从表情符号建议到Safari崩溃。 谷歌甚至发布了该公司许多核心产品中使用的差异隐私库的开源版本。

Using a concept known as “elastic sensitivity” developed in recent years by researchers at UC Berkeley, differential privacy is being extended into real-world SQL queries. The ride-sharing service Uber adopted this approach to study everything from traffic patterns to drivers’ earnings, all while protecting users’ privacy. By incorporating elastic sensitivity into a system that requires massive amounts of user data to connect riders with drivers, the company can help protect its users from a snoop.

加州大学伯克利分校的研究人员近年来使用一种称为“弹性敏感性”的概念,将差分隐私扩展到了实际SQL查询中。 乘车共享服务Uber采用这种方法研究了从交通方式到驾驶员收入的所有内容,同时保护了用户的隐私。 通过将弹性敏感度纳入需要大量用户数据才能将骑手与驾驶员连接起来的系统,该公司可以帮助保护其用户免遭窥探。

Consider, for example, how implementing elastic sensitivity could protect a high-profile Uber user, such as Ivanka Trump. As Andy Greenberg wrote in Wired: “If an Uber business analyst asks how many people are currently hailing cars in midtown Manhattan — perhaps to check whether the supply matches the demand — and Ivanka Trump happens to requesting an Uber at that moment, the answer wouldn’t reveal much about her in particular. But if a prying analyst starts asking the same question about the block surrounding Trump Tower, for instance, Uber’s elastic sensitivity would add a certain amount of randomness to the result to mask whether Ivanka, specifically, might be leaving the building at that time.”

例如,考虑实现弹性敏感性如何保护著名的Uber用户,例如Ivanka Trump。 正如安迪·格林伯格(Andy Greenberg)在《 连线 》中写道:“如果一个优步业务分析师询问目前曼哈顿中城有多少人在叫车(也许是为了检查供应量是否符合需求),而伊万卡·特朗普当时恰好要求一个优步,答案就不会特别是没有透露太多关于她的信息。 但是,例如,如果一个撬动的分析师开始对特朗普大厦周围的街区提出相同的问题,那么优步的弹性敏感性将给结果增加一定程度的随机性,以掩盖伊万卡是否特别是在那时可能离开建筑物。”

Still, for all its benefits, most organizations are not yet using differential privacy. It requires large data sets, it is computationally intensive, and organizations may lack the resources or personnel to deploy it. They also may not want to reveal how much private information they’re using — and potentially leaking.

尽管有其所有优点,但大多数组织仍未使用差异隐私。 它需要大量的数据集,计算量很大,并且组织可能缺乏部署它的资源或人员。 他们可能也不想透露正在使用多少私人信息,并且有可能泄露信息。

Another concern is that organizations that use differential privacy may be overstating how much privacy they’re providing. A firm may claim to use differential privacy, but in practice could use such a high epsilon value that the actual privacy provided would be limited.

另一个担忧是,使用差异隐私的组织可能夸大了他们提供的隐私数量。 公司可能声称使用差别隐私,但实际上可能会使用很高的ε值,以致实际提供的隐私将受到限制。

Given the importance of these implementation details there is a need for shared learning amongst the differential privacy community.

考虑到这些实施细节的重要性,需要在不同的隐私社区之间共享学习。

To address whether differential privacy is being properly deployed, Dwork, together with UC Berkeley researchers Nitin Kohli and Deirdre Mulligan, have proposed the creation of an “Epsilon Registry” to encourage companies to be more transparent. “Given the importance of these implementation details there is a need for shared learning amongst the differential privacy community,” they wrote in the Journal of Privacy and Confidentiality. “To serve these purposes, we propose the creation of the Epsilon Registry — a publicly available communal body of knowledge about differential privacy implementations that can be used by various stakeholders to drive the identification and adoption of judicious differentially private implementations.”

为了解决差异性隐私是否得到适当部署,Dwork与加州大学伯克利分校的研究人员Nitin Kohli和Deirdre Mulligan共同建议创建“ Epsilon注册中心”,以鼓励公司提高透明度。 他们在《隐私与机密性杂志 》上写道: “鉴于这些实施细节的重要性,因此需要在不同的隐私社区之间进行共同学习。” “为实现这些目的,我们建议创建Epsilon注册管理机构-一个公开的公共社区,以了解差异隐私实施的知识,各种利益相关者可以使用该知识体系来推动识别和采用明智的差异私有实施。”

As a final note, organizations should not rely on differential privacy alone, but rather should use it as just one defense in a broader arsenal, alongside other measures, like encryption and access control. Organizations should disclose the sources of data they’re using for their analysis, along with what steps they’re taking to protect that data. Combining such practices with differential privacy with low epsilon values will go a long way in helping to realize the benefits of “big data” while reducing the leakage of sensitive personal data.

最后要说明的是,组织不应仅依赖于差异性隐私,而应将其用作更广泛的武器库中的一种防御措施,以及诸如加密和访问控制之类的其他措施。 企业应披露其用于分析的数据源,以及他们将采取哪些步骤来保护这些数据。 将此类做法与具有较低epsilon值的差异性隐私相结合,将有助于帮助实现“大数据”的好处,同时减少敏感个人数据的泄漏。

This article was cross-posted by Brookings TechStream. The video was animated by Annalise Kamegawa. The Center for Long-Term Cybersecurity would like to thank Nitin Kohli, PhD student in the UC Berkeley School of Information, and Paul Laskowski, Assistant Adjunct Professor in the UC Berkeley School of Information, for providing their expertise to review this video and article.

本文由 Brookings TechStream 交叉发布 该视频由 Annalize Kamegawa 制作动画 长期网络安全中心要感谢加州大学伯克利分校信息学院的博士生Nitin Kohli和加州大学伯克利分校信息学院的助理兼职教授Paul Laskowski,他们提供了专业知识来审阅此视频和文章。

翻译自: https://medium.com/cltc-bulletin/using-differential-privacy-to-harness-big-data-and-preserve-privacy-349d84799862

深度学习数据集中数据差异大


http://www.taodudu.cc/news/show-994966.html

相关文章:

  • 小型数据库_如果您从事“小型科学”工作,那么您是否正在利用数据存储库?
  • 参考文献_参考
  • 数据统计 测试方法_统计测试:了解如何为数据选择最佳测试!
  • 每个Power BI开发人员的Power Query提示
  • a/b测试_如何进行A / B测试?
  • 面向数据科学家的实用统计学_数据科学家必知的统计数据
  • 在Python中有效使用JSON的4个技巧
  • 虚拟主机创建虚拟lan_创建虚拟背景应用
  • python 传不定量参数_Python中的定量金融
  • 贝叶斯 朴素贝叶斯_手动执行贝叶斯分析
  • GitHub动作简介
  • 照顾好自己才能照顾好别人_您必须照顾的5个基本数据
  • 认识数据分析_认识您的最佳探索数据分析新朋友
  • arima模型怎么拟合_7个统计测试,用于验证和帮助拟合ARIMA模型
  • 天池幸福感的数据处理_了解幸福感与数据(第1部分)
  • 詹森不等式_注意詹森差距
  • 数据分析师 需求分析师_是什么让分析师出色?
  • 猫眼电影评论_电影的人群意见和评论家的意见一样好吗?
  • ai前沿公司_美术是AI的下一个前沿吗?
  • mardown 标题带数字_标题中带有数字的故事更成功吗?
  • 使用Pandas 1.1.0进行稳健的2个DataFrames验证
  • rstudio 关联r_使用关联规则提出建议(R编程)
  • jquery数据折叠_通过位折叠缩小大数据
  • 决策树信息熵计算_决策树熵|熵计算
  • 流式数据分析_流式大数据分析
  • 数据科学还是计算机科学_数据科学101
  • js有默认参数的函数加参数_函数参数:默认,关键字和任意
  • 相似邻里算法_纽约市-邻里之战
  • 数据透视表和数据交叉表_数据透视表的数据提取
  • 图像处理傅里叶变换图像变化_傅里叶变换和图像床单视图。

深度学习数据集中数据差异大_使用差异隐私来利用大数据并保留隐私相关推荐

  1. 汤晓鸥谈深度学习三大核心要素:算法设计、高性能的计算能力以及大数据

    汤晓鸥谈深度学习三大核心要素:算法设计.高性能的计算能力以及大数据 2017-05-21 15:02:28    深度学习    0 0 0 昨日(5月20日),香港中文大学汤晓鸥教授莅临 2017C ...

  2. 1.基于深度学习的知识追踪研究进展_刘铁园

    基于深度学习的知识追踪研究进展_刘铁园 1.知识追踪改进方向 针对可解释问题的改进 针对长期依赖问题的改进 针对缺少学习特征问题的改进 2.基于深度学习的知识追踪 DLKT 2.1 符号定义 2.2 ...

  3. 【深度学习】重大里程碑!VOLO屠榜CV任务,无需额外数据,首个超越87%的模型...

    近来,Transformer在CV领域遍地开花,取得了非常好的性能,指标屡创新高.但Transformer的性能距离最佳的CNN仍存在差距,不由产生出一种Transformer不过如此的感觉. 可是, ...

  4. 大数据产品开发流程规范_华为内部资料流出!揭秘华为数据湖:3大特点、6个标准、入湖流程...

    点蓝色字关注"云技术" 导读:数据湖:实现企业数据的"逻辑汇聚". 作者:华为公司数据管理部来源:大数据DT(ID:hzdashuju)01 华为数据湖的3个特 ...

  5. 基于深度学习的二维码检测和识别(含完整代码和数据)

    最近尝试着将深度学习技术引入到二维码检测和识别中,期望能够提升传统二维码的识读性能,能够适用更多复杂背景,并且最终应用到工业生产中,方便生产线上对产品的ID管理. 项目最终实现效果如下所示: 相对来说 ...

  6. 深度学习的权重衰减是什么_权重衰减和L2正则化是一个意思吗?它们只是在某些条件下等价...

    权重衰减== L2正则化? 神经网络是很好的函数逼近器和特征提取器,但有时它们的权值过于专门化而导致过度拟合.这就是正则化概念出现的地方,我们将讨论这一概念,以及被错误地认为相同的两种主要权重正则化技 ...

  7. 深度学习和机器博弈如何结合_对抗机器学习的博弈论方法

    深度学习和机器博弈如何结合 Artificial Intelligence has known a great success in recent years as it provided us wi ...

  8. 深度学习的权重衰减是什么_【深度学习理论】一文搞透Dropout、L1L2正则化/权重衰减...

    前言 本文主要内容--一文搞透深度学习中的正则化概念,常用正则化方法介绍,重点介绍Dropout的概念和代码实现.L1-norm/L2-norm的概念.L1/L2正则化的概念和代码实现- 要是文章看完 ...

  9. 1. 机器学习与深度学习——关系、无/半/有监督学习、差异、主流框架

    机器学习与深度学习 目标 1. 人工智能.机器学习.深度学习三者之间的关系 2. 机器学习方式介绍 2.1 无监督学习 2.2 半监督学习 2.3 有监督学习 3. 深度学习与传统机器学习的差异 3. ...

最新文章

  1. bzoj3545 Peaks
  2. tomcat https 配置
  3. Anaconda安装tenserflow
  4. ssl1615-Frogger【图论,最小生成树,并查集】
  5. Spring Boot集成Druid监控
  6. python数据挖掘学习笔记】十四.Scipy调用curve_fit实现曲线拟合
  7. Linux服务器数据库的导入和导出
  8. 表弟励志做程序员了,除了霸王我还能给他什么?
  9. Airplace平台
  10. 【2】嵌入式TCP/IP协议——————Art-Net处理流程
  11. sap se06和scc4
  12. 计算机强制关机代码bat,自制bat文件搞定定时关机、重启、强制关机、注销等
  13. 盘点近年25家外卖O2O,谁比谁难过
  14. 不支持虚拟化的cpu如何开VM虚拟机(不支持,即“主机不支持Intel VT-x,不是支持Intel VT-x,但Intel VT-x禁处于禁用状态!!!!!!)
  15. R语言入门(1)时间序列分析
  16. html有序列表序号字体大小,css – 对不同字体大小的排序列表编号进行样式化
  17. P1120 小木棍题解
  18. 神经网络照片解读下载,神经网络识别图像原理
  19. 名词解释第七十讲:基金会
  20. hmcl启动器安装游戏版本失败_HMCL启动器,无法解决的问题(急需解决)

热门文章

  1. 右移函数(字符串,数组)
  2. 【C++学习笔记四】运算符重载
  3. socket通信之最简单的socket通信
  4. 关于源文件用不同的编码方式编写,会导致执行结果不一样的现象及解决方法
  5. 【Verilog HDL学习之路】第二章 Verilog HDL的设计方法学——层次建模
  6. System V 信号量
  7. 大牛手把手教你!2021大厂Java面试经历
  8. MMKV集成与原理,成功跳槽阿里!
  9. Linux - 时间相关命令 - ntpdate, date, hwclock
  10. 551. Student Attendance Record I 从字符串判断学生考勤