k-anonimity、l-diversity 和 t-closeness

1. 前言

1.1 等价类的定义

等价类(equivalence class):等价类代表 QI 属性(attribute)相同的记录(record)。We define an equivalence class of an anonymized table to be a set of records that have the same values for the quasi-identifiers。

1.2 record的定义

记录(record):表示关系型数据库(relation data 或者叫做 multidimensonal data)中的行,它对应于每个项目 individual。

1.3 attributes的定义

属性(attribute):每个 record 包含很多对应的 attributes,这些属性可以被分为三类:EI、QI 和 SD。

1.4 两种纰漏:identity disclosure 和 attribute disclosure

  1. identity disclosure(身份纰漏):Identity disclosure occurs when an individual is linked to a particular record in the released table,也就是说可以从特定的记录中关联到某个身份了。
  2. attribute disclosure(属性纰漏):Attribute disclosure occurs when new information about some individuals is revealed, i.e., the released data makes it possible to infer the characteristics of an individual more accurately than it would be possible before the data release,也就是说信息的纰漏使推断身份特征变得可能。

身份纰漏通常导致属性纰漏,一旦身份被确认了,其相关的属性也就可以确认;而属性的纰漏不一定等导致身份的纰漏。而且需要指出的是,错误属性的纰漏可能会对推断身份变得有利。

2. k 匿名(k-anonimity)

2.1 k-anonimity 的定义

k-anonimity 满足每一个等价类中,有至少 k 个 records, 对于在等价类中的属性 attributes 中,不可区分这 k 个 records。
k 匿名有效抵御了身份纰漏,却没有提供足够的技术来抵御属性纰漏。

2.2 k-anonimity 的步骤

  1. 去掉 Explicit Identifiers。
  2. 模糊 Quasi Identifiers,通常的方法是 generalization 和 suppression。
    例子如下图:

同质攻击(homogeneity attack)的例子:
Suppose Alice knows that Bob is a 27-year old man living in ZIP 47678 and Bob’s record is in the table. From Table 2, Alice can conclude that Bob corresponds to one of the first three records, and thus must have heart disease.
背景攻击(background attack)的例子:
Suppose that, by knowing Carl’s age and zip code, Alice can conclude that Carl corresponds to a record in the last equivalence class in Table 2. Furthermore, suppose that Alice knows that Carl has very low risk for heart disease. This background knowledge enables Alice to conclude that Carl most likely has cancer.

2.3 Generalization



2.4 Suppression

引入 suppression 的目的是降低一般化的维度,让数据变得更精确。当有限的 records (tuples with less than k occurrences,called outliers)提高了一般化的维度的时候,Suppression 用来调整一般化的过程。如下图,Ilatic 样板的 records 可以直接由 * 代替,因为这几项都不满足 2-anonimity 导致了过于一般化。

2.5 k-Minimal Generalization (with Suppression)

The application of generalization and suppression to a private table PT produces more general (less precise) and less complete (if some tuples are suppressed) tables that provide better protection of the respondents’ identities. 找到最小 k 的值能够避免一般化过多或者suppression过多,k-minimal generalization with suppression 基于下面的 Distance Vector 的定义。

Definition (Distance vector). Let Ti(A1,...,An)T_{i}(A_{1},...,A_{n}) and Tj(A1,...,An)T_{j}(A_{1},...,A_{n}) be two tables such that TjT_{j} 是 TiT_{i} 的 generalization. The distance vector of TjT_{j} from TiT_{i} is the vector DVi,j=[d1,...,dn]DV_{i,j} = [d1,...,dn], where each dz,z=1,...,nd_{z}, z = 1,...,n is the length of the unique path between dom(Az,Ti)dom(A_{z}, T_{i}) and dom(Az,Tj)dom(A_{z}, T_{j}) in the domain generalization hierarchy DGHDzDGH_{D_{z}}。


上图7可以看出匿名之后的空白处是 suppression 掉的,留下来的是满足 k 匿名的,那么问题来了,generalization 失去数据精确度好呢还是 suppression 失去数据完整度好呢,Samarati 认为定义 MaxSup\mathsf{MaxSup},specifying the maximum number of tuples that can be suppressed。接下来是k-minimal generalization with suppression 的定义:

Intuitively, this definition states that a generalization TjT_{j} is k-minimal iff it satisfies k-anonymity, it does not enforce more suppression than it is allowed (|Ti|−|Tj|≤MaxSup|T_{i}|-|T_{j}| \le \mathsf{MaxSup}), and there does not exist another generalization satisfying these conditions with a distance vector smaller than that of TjT_{j}. 举个例子,对于下图1来说:

MaxSup = 2, QI= {Race, ZIP}, and k = 2,那么有两个 k-minimal generalization with suppression,如图7。

2.6 k匿名技术的分类


Generalization可以被应用于:(i)Attribute,一般化整列;(ii)Cell,复杂度太高;(iii)None。
Suppression可以被应用于:(i)Tuple,suppression is performed at the level of row;(ii)Attribute,suppression is performed at the level of column;(iii)Cell,suppression is performed at the level of single cells;(iv)None。
因为此节内容过多,具体细节见《Classification of k-Anonymity Techniques》一文。

3. L-Diversity

3.1 L-Diversity 的定义

An equivalence class is said to have l-diversity if there are at least“well-represented”values for the sensitive attribute. A table is said to have l-diversity if every equivalence class of the table has l-diversity.“well-represented”的意思是:

  1. Distinct l-diversity:ensure there are at least l distinct values for the sensitive attribute in each
    equivalence class. 它不能抵御概率推断攻击(probabilistic inference attacks),如果在一个等价类中某个 SD 属性出现的频率比其他记录要大的话,容易让攻击者获得信息。

  2. Entropy l-diversity:等价类EE的 entropy l-diversity 被定义为如下

    Entropy(E)=−∑s∈Sp(E,s)logp(E,s)

    Entropy(E)=-\sum_{s \in S}p(E,s)\log p(E,s)
    其中 SS is the domain of the sensitive attribute, and p(E,s)p(E, s) is the fraction of records in EE that have sensitive value ss. A table is said to have entropy l-diversity if for every equivalence class EE, Entropy(E)≥loglEntropy(E) \ge \log l.

  3. Recursive (c,l)(c,l)-diversity:Recursive (c,l)(c, l)-diversity makes sure that the most frequent value does not appear too frequently, and the less frequent values do not appear too rarely. Let mm be the number of values in an equivalence class, and ri,1≤i≤mri, 1 \le i \le m be the number of times that the ithi^{th} most frequent sensitive value appears in an equivalence class E. Then E is said to have recursive (c,l)(c, l)-diversity if r1<c(rl+rl+1+...+rm)r_{1} . A table is said to have recursive (c,l)(c,l)-diversity if all of its equivalence classes have recursive (c,l)(c,l)-diversity.

3.2 L-Diversity 的局限性

  1. l-diversity may be difficult and unnecessary to achieve. 当其中某个 SD 概率很小或者很大时,Entropy(E)Entropy(E)就会很小,这时如果要满足 Entropy(E)≥loglEntropy(E) \ge \log l 的话,ll 就要足够小才行,这样的话 l-diversity 就变得没有意义。比如 对于(1/100,99/100)(1/100,99/100)这种分布图,

    Entropy(E)=1100log100+99100log10099<0.01

    Entropy(E)=\frac{1}{100}\log 100+\frac{99}{100} \log \frac{100}{99}
    而logl=log2<1\log l=\log 2,也就是说 ll 要远远小于1,这个就变得没有意义了。

  2. l-diversity is insufficient to prevent attribute disclosure.

    上表虽然将同一个等价类中的敏感数据分为了三类、但是却忽略了它们语义上其实很接近,都是有关胃的疾病。

4. T-Closeness

4.1 Information Gain

Information gain can be represented as the difference between the posterior belief and the prior belief,也就是观察表信息前后的差值。

4.2 T-Closeness 的定义

给定三个状态,观察表之前,观察者的 belief 为 B0B_{0},QI 被一般化之后(看到 SDSD 的整体分布 QQ),观察者的 belief 为 B1B_{1},最后看到完整表之后(看到等价类中的 SDSD 分布),观察者的 belief 为 B2B_{2}。L-Diversity 其实是限制 B0B_{0} 和 B1B_{1} 的差别,而 T-Closeness 是限制 B1B_{1} 和 B2B_{2} 的差别。换句话说,假设 QQ 为隐私信息(SD)的整体数量分布 ,它们是公开的,We do not limit the observer’s information gain about the population as a whole, but limit the extent to which the observer can learn additional information about specific individuals.

An equivalence class is said to have t-closeness if the distance between the distribution of a sensitive attribute in this class(PP) and the distribution of the attribute in the whole table(QQ) is no more than a threshold t. A table is said to have t-closeness if all equivalence classes have t-closeness.

4.3 距离的定义

对于表3和表4来说,Income(收入)的overall distribution是 Q=Q={3K、4K、5K、6K、7K、8K、9K、10K、11K},表4中第一个等价类的distribution是 P1P_{1}={3K、4K、5K},第二个是 P2P_{2}={6K、8K、11K}。直觉上我们会认为 P1P_{1} 比 P2P_{2} 泄露更多的信息,因为 P1P_{1} 都是比较低的值,我们用 D[P1,Q]>D[P2,Q]D[P_{1},Q]>D[P_{2},Q] 来表示。

4.3.1 Earth Mover’s distance(EMD)

EMD 可以通过 well-studied transportation problem 来正式地定义。Let P=(p1,p2,...pm)P = (p_{1}, p_{2}, ...p_{m}),Q=(q1,q2,...qm)Q =(q_{1}, q_{2}, ...q_{m}), and dijd_{ij} be the ground distance between element ii of PP and element jj of QQ. We want to find a flow F=[fij]F = [f_{ij}] where fijf_{ij} is the flow of mass from element ii of PP to element jj of QQ that minimizes the overall work:

WORK(P,Q,F)=∑i=1m∑j=1mdijfij

WORK(P,Q,F)=\sum_{i=1}^{m}\sum_{j=1}^{m}d_{ij}f_{ij}
subject to the following cosntraints:

fij≥01≤i≤m,1≤j≤m(c1)

f_{ij}\ge 0\qquad 1 \le i \le m,1\le j \le m \qquad (c1)

pi−∑j=1mfij+∑j=1mfji=qi1≤i≤m(c2)

p_{i}-\sum_{j=1}^{m}f_{ij}+\sum_{j=1}^{m}f_{ji}=q_{i} \qquad 1 \le i \le m \qquad (c2)

∑i=1m∑j=1mfij=∑i=1mpi=∑i=1mqi=1(c3)

\sum_{i=1}^{m}\sum_{j=1}^{m}f_{ij}=\sum_{i=1}^{m}p_{i}=\sum_{i=1}^{m}q_{i}=1 \qquad (c3)
These three constraints guarantee that PP is transformed to QQ by the mass flow FF. Once the transportation problem is solved, the EMD is defined to be the total work, i.e.,

D[P,Q]=WORK(P,Q,F)=∑i=1m∑j=1mdijfij

D[P,Q]=WORK(P,Q,F)=\sum_{i=1}^{m}\sum_{j=1}^{m}d_{ij}f_{ij}
Fact 1 If 0≤dij≤10 \le d_{ij} \le 1 for all i,j,i, j, then 0≤D[P,Q]≤10\le D[P,Q]\le 1.
The above fact follows directly from constraint (c1) and (c3). It says that if the ground distances are normalized, i.e., all distances are between 0 and 1, then the EMD between any two distributions is between 0 and 1. This gives a range from which one can choose the t value for t-closeness.

Fact 2 Given two equivalence classes E1E_{1} and E2E_{2}, let P1,P2P_{1}, P_{2}, and PP be the distribution of a sensitive attribute in E1,E2E_{1}, E_{2}, and E1∪E2E1 \cup E2, respectively. Then

D[P,Q]≤|E1||E1|+|E2|D[P1,Q]+E2|E1|+|E2|D[P2,Q]

D[P,Q]\le \frac{|E_{1}|}{|E_{1}|+|E_{2}|}D[P_{1},Q]+\frac{E_{2}}{|E_{1}|+|E_{2}|}D[P2,Q]
This means that when merging two equivalence classes, the maximum distance of any equivalence class from the overall distribution can never increase. Thus t-closeness is achievable for any t≥0t \ge 0.

以上 fact 继承 EMD 的 t-closeness 特性满足以下两个特点:

  1. Generalization Property Let T\mathcal{T} be a table, and let AA and BB be two generalizations on T\mathcal{T} such that AA is more general than BB If T\mathcal{T} satisfies t-closeness using B, then T\mathcal{T} also satisfies t-closeness using A.
  2. Subset Property Let T\mathcal{T} be a table and let CC be a set of attributes in T\mathcal{T} . If T\mathcal{T} satisfies t-closeness with respect to CC, then T\mathcal{T} also satisfies t-closeness with respect to any set of attributes DD such that D⊂CD \subset C.

4.3.2 EMD 的计算

4.3.2.1 EMD for Numerical Attributes

数值属性值是有序的,Let the attribute domain be v1,v2...vm{v_{1}, v_{2}...v_{m}}, where viv_{i} is the ithi^{th} smallest value.
Ordered Distance The distance between two values of is based on the number of values between them in the total order, i.e., ordered_dist(vi,vj)=|i−j|m−1ordered\_dist(v_{i},v_{j})=\frac{|i-j|}{m-1}。

Formally, let ri=pi−qi,(i=1,2,...,m)r_{i} = p_{i} − q_{i},(i=1,2,...,m), then the distance between PP and QQ can be calculated as:

D[P,Q]=1m−1(|r1|+|r1+r2|+...+|r1+r2+...rm−1|)=1m−1∑i=1i=m|∑j=1j=irj|

D[P,Q]=\frac{1}{m-1}(|r_{1}|+|r_{1}+r_{2}|+...+|r_{1}+r_{2}+...r_{m-1}|)\\=\frac{1}{m-1}\sum_{i=1}^{i=m}|\sum_{j=1}^{j=i}r_{j}|

4.3.2.2 EMD for Categorical Attributes

For categorical attributes, a total order often does not exist. We consider two distance measures.
Equal Distance:

D[P,Q]=12∑i=1m|pi−qi|=∑pi≥qi(pi−qi)=−∑pi<qi(pi−qi)

D[P,Q]=\frac{1}{2}\sum_{i=1}^{m}|p_{i}-q_{i}|=\sum_{p_{i}\ge q_{i}}(p_{i}-q_{i})=-\sum_{p_{i}
Hierarchical Distance:
Given a domain hierarchy and two distributions PP and QQ, we define the extra of a leaf node that corresponds to element ii, to be pi−qip_{i}-q_{i}, and the extra of an internal node NN to be the sum of extras of leaf nodes below NN. This extra function can be defined recursively as:

extra(N)={pi−qiif N is a leaf∑C∈Child(N)extra(C)otherwise

extra(N)= \begin{cases} p_{i}-q_{i} \qquad\qquad\qquad \qquad \text{if N is a leaf}\\ \sum_{C \in Child(N)}extra(C) \qquad \text{otherwise} \end{cases}
where Child(N) is the set of all leaf nodes below node NN. The extra function has the property that the sum of extra values for nodes at the same level is 0.

We further define two other functions for internal nodes:

post_extra(N)=∑C∈Child(N)∧extra(C)>0|extra(C)|

post\_extra(N)=\sum_{C\in Child(N)\wedge extra(C)>0}|extra(C)|

neg_extra(N)=∑C∈Child(N)∧extra(C)<0|extra(C)|

neg\_extra(N)=\sum_{C\in Child(N)\wedge extra(C)
We use cost(N) to denote the cost of movings between N’s children branches. An optimal flow moves exactly extra(N) in/out of the subtree rooted at N. Suppose that pos extra(N) > neg extra, then extra(N) = pos extra(N) − neg extra(N) and extra(N) needs to move out. (This cost is counted in the cost of N’s parent node.) In addition, one has to move neg extra among the children nodes to even out all children branches; thus,

cost(N)=height(N)Hmin(pos_extra(N),neg_extra(N))

cost(N)=\frac{height(N)}{H}\min(pos\_extra(N),neg\_extra(N))
Then the earth mover’s distance can be written as:

D[P,Q]=∑Ncost(N)

D[P,Q]=\sum_{N}cost(N)
where N is a non-leaf node.

4.4 Analysis of t-Closeness with EMD

之前表3、表4我们已经知道Q=Q={3K、4K、5K、6K、7K、8K、9K、10K、11K}, P1P_{1}={3K、4K、5K} 和P2P_{2}={6K、8K、11K}。然后我们运用EMD计算D[P1,Q]D[P_{1},Q]和D[P2,Q]D[P_{2},Q]:根据之前的公式,v1=3K,v2=4K,...v9=11Kv_{1}=3K,v_{2}=4K,...v_{9}=11K,我们定义viv_{i}和vjv_{j}之间的距离为|i−j|/8|i-j|/8,因此最大距离为1.

对于D[P1,Q]D[P_{1},Q],One optimal mass flow that transforms P1P_{1} to QQ is to move 1/9 probability mass across the following pairs: (5K→11K)(5K \rightarrow 11K),(5K→10K)(5K \rightarrow 10K),(5K→9K)(5K \rightarrow 9K),(4K→8K)(4K \rightarrow 8K),(4K→7K)(4K \rightarrow 7K),(4K→6K)(4K \rightarrow 6K),(3K→5K)(3K \rightarrow 5K),(3K→4K)(3K \rightarrow 4K),the cost of this is 1/9×(6+5+4+4+3+2+2+1)/8=27/72=0.3751/9\times(6+5+4+4+3+2+2+1)/8=27/72=0.375。

5. K-anonimity、l-diversity和t-closeness的对比

6. Reference

k-Anonimity, V. Ciriani, S. De Capitani di Vimercati, S. Foresti, and P. Samarati

k-anonimity、l-diversity 和 t-closeness相关推荐

  1. C什么k什么_G、D、C、Z、T、K、L、Y,这些字母和火车级别有什么关系

    "高XXX次列车开始检票-" 当我们乘坐火车出行 车站广播总会提前预告 你知道火车车次前头的字母有什么含义吗? 不同字母代表什么? 它们之间又有什么区别呢? 我国的铁路旅客列车,按 ...

  2. 2021 年第十三届四川省 ACM-ICPC 大学生程序设计竞赛(A/B/D/H/E/K/M/L)

    https://codeforces.com/gym/103117 A. 水题 int main() {IOS;int t;cin >> t;while(t--){int k;cin &g ...

  3. 2021江西省icpc(A,B,D,F,G,H,J,K,L)

    K.Many Littles Make a Mickle(签到题) 任意门 先从最简单的签到题开始吧 #include<iostream> #include<cstdio> # ...

  4. “亚信科技杯”南邮第七届大学生程序设计竞赛之网络预赛 (K L题解)

    "亚信科技杯"南邮第七届大学生程序设计竞赛之网络预赛 (K L题解) 第一次出题,果然背锅了,L题由于数据问题,让两种不对的方法ac了,分别是:H<0时取前一天送上花(应该是 ...

  5. C语言学习之编程实现:输入长方形的两个边长a, b和一个整数k。k=1时,输出长方形的周长 l; k=2时 ,输出长方形的面积s;当k=3时 , 输出长方形的周长1和面积s

    C语言学习 编程实现:输入长方形的两个边长a, b和一个整数k.k=1时,输出长方形的周长 l; k=2时 ,输出长方形的面积s;当k=3时 , 输出长方形的周长1和面积s #include < ...

  6. matlab坡度计算公式,matlab解二阶微分方程怎么用matlab来解呢?x的定义域是(0,l),i是道路横坡坡度,w是降雨强度,k是路面横向...

    共回答了16个问题采纳率:81.3% 你的方程即为: y'*y' + y''*y' +w/k =0 s = dsolve('Dy*D2y + Dy^2 + w/k ', 'y(l) = a', 'Dy ...

  7. 【题解】BZOJ 3065: 带插入区间K小值——替罪羊树套线段树

    题目传送门 题解 orz vfk的题解 3065: 带插入区间K小值 系列题解 一 二 三 四 惨 一开始用了一种空间常数很大的方法,每次重构的时候merge两颗线段树,然后无限RE(其实是MLE). ...

  8. hdu 2665(主席树查询区间k大值)

    先贴我自己写的代码做模板虽然跟原博主没什么两样.(一开始空间开的4*maxn,交到hdu上一直TLE很奇怪) #include<bits/stdc++.h> using namespace ...

  9. R19436221 区间第k小 主席树

    主席树模板题,记录一下. #include<bits/stdc++.h> using namespace std; const int maxn=2e5+5; int root[maxn] ...

  10. ACdream 1099——瑶瑶的第K大——————【快排舍半,输入外挂】

    瑶瑶的第K大 Time Limit:2000MS     Memory Limit:128000KB     64bit IO Format:%lld & %llu Submit Status ...

最新文章

  1. Hibernate4多对多关系映射
  2. go hello world第一个程序
  3. 继承的编写小结汇总。
  4. leetcode题解102-翻转二叉树
  5. python开源的人脸识别库_什么是 SeetaFace 开源人脸识别引擎
  6. XSS后台敏感操作(审计思路实现)
  7. 面向生态合作伙伴的实践分享回顾
  8. Jtable 表格按多列排序(支持中文汉字排序)
  9. IDEA eclipse快捷键大全
  10. JScrollBar().setValue(0)设置滚动条位置失效问题
  11. C# 多文件压缩与解压
  12. 浙江大学软件学院网络自动认证+Padavan路由器挂python脚本
  13. 公众号开发素材管理效果演示-微信开发素材管理1
  14. Dubbo线程池耗尽问题
  15. 计算机技术培训新闻,二附院举办ERAS下人工智能计算机辅助关节置换技术
  16. VMware虚拟机实现局域网互通
  17. 邮件 黑名单 白名单 灰名单
  18. 什么样的女人才是老婆
  19. Java中String类常用方法(转)
  20. [python]抓取啄木鸟社区《活学活用wxPython》内容与图片

热门文章

  1. 数据库实战入门——SQL全方位学习
  2. asp毕业设计—— 基于asp+access的网上论坛设计与实现(毕业论文+程序源码)——网上论坛
  3. NISP-电子邮件安全
  4. PHP连接mysql数据库报错:Call to undefined function mysql_connect()
  5. ASP.NET快速入门
  6. windwos .bat脚本大全
  7. Linux下oracle数据库备份导出
  8. 基于SSH框架的管理系统【完整项目源码】
  9. 解决Ubuntu下载缓慢问题
  10. KCF算法数学推导及算法流程图