k-anonimity、l-diversity 和 t-closeness

1. 前言

1.1 等价类的定义

等价类（equivalence class）：等价类代表 QI 属性（attribute）相同的记录（record）。We define an equivalence class of an anonymized table to be a set of records that have the same values for the quasi-identifiers。

1.2 record的定义

记录（record）：表示关系型数据库（relation data 或者叫做 multidimensonal data）中的行，它对应于每个项目 individual。

1.3 attributes的定义

属性（attribute）：每个 record 包含很多对应的 attributes，这些属性可以被分为三类：EI、QI 和 SD。

1.4 两种纰漏：identity disclosure 和 attribute disclosure

identity disclosure（身份纰漏）：Identity disclosure occurs when an individual is linked to a particular record in the released table，也就是说可以从特定的记录中关联到某个身份了。
attribute disclosure（属性纰漏）：Attribute disclosure occurs when new information about some individuals is revealed, i.e., the released data makes it possible to infer the characteristics of an individual more accurately than it would be possible before the data release，也就是说信息的纰漏使推断身份特征变得可能。

身份纰漏通常导致属性纰漏，一旦身份被确认了，其相关的属性也就可以确认；而属性的纰漏不一定等导致身份的纰漏。而且需要指出的是，错误属性的纰漏可能会对推断身份变得有利。

2. k 匿名（k-anonimity）

2.1 k-anonimity 的定义

k-anonimity 满足每一个等价类中，有至少 k 个 records，对于在等价类中的属性 attributes 中，不可区分这 k 个 records。
k 匿名有效抵御了身份纰漏，却没有提供足够的技术来抵御属性纰漏。

2.2 k-anonimity 的步骤

去掉 Explicit Identifiers。
模糊 Quasi Identifiers，通常的方法是 generalization 和 suppression。
例子如下图：

同质攻击（homogeneity attack）的例子：
Suppose Alice knows that Bob is a 27-year old man living in ZIP 47678 and Bob’s record is in the table. From Table 2, Alice can conclude that Bob corresponds to one of the first three records, and thus must have heart disease.
背景攻击（background attack）的例子：
Suppose that, by knowing Carl’s age and zip code, Alice can conclude that Carl corresponds to a record in the last equivalence class in Table 2. Furthermore, suppose that Alice knows that Carl has very low risk for heart disease. This background knowledge enables Alice to conclude that Carl most likely has cancer.

2.3 Generalization

2.4 Suppression

引入 suppression 的目的是降低一般化的维度，让数据变得更精确。当有限的 records （tuples with less than k occurrences，called outliers）提高了一般化的维度的时候，Suppression 用来调整一般化的过程。如下图，Ilatic 样板的 records 可以直接由 * 代替，因为这几项都不满足 2-anonimity 导致了过于一般化。

2.5 k-Minimal Generalization (with Suppression)

The application of generalization and suppression to a private table PT produces more general (less precise) and less complete (if some tuples are suppressed) tables that provide better protection of the respondents’ identities. 找到最小 k 的值能够避免一般化过多或者suppression过多，k-minimal generalization with suppression 基于下面的 Distance Vector 的定义。

Definition (Distance vector). Let Ti(A1,...,An)T_{i}(A_{1},...,A_{n}) and Tj(A1,...,An)T_{j}(A_{1},...,A_{n}) be two tables such that TjT_{j} 是 TiT_{i} 的 generalization. The distance vector of TjT_{j} from TiT_{i} is the vector DVi,j=[d1,...,dn]DV_{i,j} = [d1,...,dn], where each dz,z=1,...,nd_{z}, z = 1,...,n is the length of the unique path between dom(Az,Ti)dom(A_{z}, T_{i}) and dom(Az,Tj)dom(A_{z}, T_{j}) in the domain generalization hierarchy DGHDzDGH_{D_{z}}。

上图7可以看出匿名之后的空白处是 suppression 掉的，留下来的是满足 k 匿名的，那么问题来了，generalization 失去数据精确度好呢还是 suppression 失去数据完整度好呢，Samarati 认为定义 MaxSup\mathsf{MaxSup}，specifying the maximum number of tuples that can be suppressed。接下来是k-minimal generalization with suppression 的定义：

Intuitively, this definition states that a generalization TjT_{j} is k-minimal iff it satisfies k-anonymity, it does not enforce more suppression than it is allowed (|Ti|−|Tj|≤MaxSup|T_{i}|-|T_{j}| \le \mathsf{MaxSup}), and there does not exist another generalization satisfying these conditions with a distance vector smaller than that of TjT_{j}. 举个例子，对于下图1来说：

MaxSup = 2, QI= {Race, ZIP}, and k = 2，那么有两个 k-minimal generalization with suppression，如图7。

2.6 k匿名技术的分类

Generalization可以被应用于：（i）Attribute，一般化整列；（ii）Cell，复杂度太高；（iii）None。
Suppression可以被应用于：（i）Tuple，suppression is performed at the level of row；（ii）Attribute，suppression is performed at the level of column；（iii）Cell，suppression is performed at the level of single cells；（iv）None。
因为此节内容过多，具体细节见《Classification of k-Anonymity Techniques》一文。

3. L-Diversity

3.1 L-Diversity 的定义

An equivalence class is said to have l-diversity if there are at least“well-represented”values for the sensitive attribute. A table is said to have l-diversity if every equivalence class of the table has l-diversity.“well-represented”的意思是：

Distinct l-diversity：ensure there are at least l distinct values for the sensitive attribute in each
equivalence class. 它不能抵御概率推断攻击（probabilistic inference attacks），如果在一个等价类中某个 SD 属性出现的频率比其他记录要大的话，容易让攻击者获得信息。
Entropy l-diversity：等价类EE的 entropy l-diversity 被定义为如下

Entropy(E)=−∑s∈Sp(E,s)logp(E,s)

Entropy(E)=-\sum_{s \in S}p(E,s)\log p(E,s)
其中 SS is the domain of the sensitive attribute, and p(E,s)p(E, s) is the fraction of records in EE that have sensitive value ss. A table is said to have entropy l-diversity if for every equivalence class EE, Entropy(E)≥loglEntropy(E) \ge \log l.
Recursive (c,l)(c,l)-diversity：Recursive (c,l)(c, l)-diversity makes sure that the most frequent value does not appear too frequently, and the less frequent values do not appear too rarely. Let mm be the number of values in an equivalence class, and ri,1≤i≤mri, 1 \le i \le m be the number of times that the ithi^{th} most frequent sensitive value appears in an equivalence class E. Then E is said to have recursive (c,l)(c, l)-diversity if r1<c(rl+rl+1+...+rm)r_{1} . A table is said to have recursive (c,l)(c,l)-diversity if all of its equivalence classes have recursive (c,l)(c,l)-diversity.

3.2 L-Diversity 的局限性

l-diversity may be difficult and unnecessary to achieve. 当其中某个 SD 概率很小或者很大时，Entropy(E)Entropy(E)就会很小，这时如果要满足 Entropy(E)≥loglEntropy(E) \ge \log l 的话，ll 就要足够小才行，这样的话 l-diversity 就变得没有意义。比如对于(1/100,99/100)(1/100,99/100)这种分布图，

Entropy(E)=1100log100+99100log10099<0.01

Entropy(E)=\frac{1}{100}\log 100+\frac{99}{100} \log \frac{100}{99}
而logl=log2<1\log l=\log 2，也就是说 ll 要远远小于1，这个就变得没有意义了。
l-diversity is insufficient to prevent attribute disclosure.

上表虽然将同一个等价类中的敏感数据分为了三类、但是却忽略了它们语义上其实很接近，都是有关胃的疾病。

4. T-Closeness

4.1 Information Gain

Information gain can be represented as the difference between the posterior belief and the prior belief，也就是观察表信息前后的差值。

4.2 T-Closeness 的定义

给定三个状态，观察表之前，观察者的 belief 为 B0B_{0}，QI 被一般化之后（看到 SDSD 的整体分布 QQ），观察者的 belief 为 B1B_{1}，最后看到完整表之后（看到等价类中的 SDSD 分布），观察者的 belief 为 B2B_{2}。L-Diversity 其实是限制 B0B_{0} 和 B1B_{1} 的差别，而 T-Closeness 是限制 B1B_{1} 和 B2B_{2} 的差别。换句话说，假设 QQ 为隐私信息（SD）的整体数量分布，它们是公开的，We do not limit the observer’s information gain about the population as a whole, but limit the extent to which the observer can learn additional information about specific individuals.

An equivalence class is said to have t-closeness if the distance between the distribution of a sensitive attribute in this class（PP） and the distribution of the attribute in the whole table（QQ） is no more than a threshold t. A table is said to have t-closeness if all equivalence classes have t-closeness.

4.3 距离的定义

对于表3和表4来说，Income（收入）的overall distribution是 Q=Q={3K、4K、5K、6K、7K、8K、9K、10K、11K}，表4中第一个等价类的distribution是 P1P_{1}={3K、4K、5K}，第二个是 P2P_{2}={6K、8K、11K}。直觉上我们会认为 P1P_{1} 比 P2P_{2} 泄露更多的信息，因为 P1P_{1} 都是比较低的值，我们用 D[P1,Q]>D[P2,Q]D[P_{1},Q]>D[P_{2},Q] 来表示。

4.3.1 Earth Mover’s distance（EMD）

EMD 可以通过 well-studied transportation problem 来正式地定义。Let P=(p1,p2,...pm)P = (p_{1}, p_{2}, ...p_{m})，Q=(q1,q2,...qm)Q =(q_{1}, q_{2}, ...q_{m}), and dijd_{ij} be the ground distance between element ii of PP and element jj of QQ. We want to find a flow F=[fij]F = [f_{ij}] where fijf_{ij} is the flow of mass from element ii of PP to element jj of QQ that minimizes the overall work:

WORK(P,Q,F)=∑i=1m∑j=1mdijfij

WORK(P,Q,F)=\sum_{i=1}^{m}\sum_{j=1}^{m}d_{ij}f_{ij}
subject to the following cosntraints:

fij≥01≤i≤m,1≤j≤m(c1)

f_{ij}\ge 0\qquad 1 \le i \le m,1\le j \le m \qquad (c1)

pi−∑j=1mfij+∑j=1mfji=qi1≤i≤m(c2)

p_{i}-\sum_{j=1}^{m}f_{ij}+\sum_{j=1}^{m}f_{ji}=q_{i} \qquad 1 \le i \le m \qquad (c2)

∑i=1m∑j=1mfij=∑i=1mpi=∑i=1mqi=1(c3)

\sum_{i=1}^{m}\sum_{j=1}^{m}f_{ij}=\sum_{i=1}^{m}p_{i}=\sum_{i=1}^{m}q_{i}=1 \qquad (c3)
These three constraints guarantee that PP is transformed to QQ by the mass flow FF. Once the transportation problem is solved, the EMD is defined to be the total work, i.e.,

D[P,Q]=WORK(P,Q,F)=∑i=1m∑j=1mdijfij

D[P,Q]=WORK(P,Q,F)=\sum_{i=1}^{m}\sum_{j=1}^{m}d_{ij}f_{ij}
Fact 1 If 0≤dij≤10 \le d_{ij} \le 1 for all i,j,i, j, then 0≤D[P,Q]≤10\le D[P,Q]\le 1.
The above fact follows directly from constraint (c1) and (c3). It says that if the ground distances are normalized, i.e., all distances are between 0 and 1, then the EMD between any two distributions is between 0 and 1. This gives a range from which one can choose the t value for t-closeness.

Fact 2 Given two equivalence classes E1E_{1} and E2E_{2}, let P1,P2P_{1}, P_{2}, and PP be the distribution of a sensitive attribute in E1,E2E_{1}, E_{2}, and E1∪E2E1 \cup E2, respectively. Then

D[P,Q]≤|E1||E1|+|E2|D[P1,Q]+E2|E1|+|E2|D[P2,Q]

D[P,Q]\le \frac{|E_{1}|}{|E_{1}|+|E_{2}|}D[P_{1},Q]+\frac{E_{2}}{|E_{1}|+|E_{2}|}D[P2,Q]
This means that when merging two equivalence classes, the maximum distance of any equivalence class from the overall distribution can never increase. Thus t-closeness is achievable for any t≥0t \ge 0.

以上 fact 继承 EMD 的 t-closeness 特性满足以下两个特点：

Generalization Property Let T\mathcal{T} be a table, and let AA and BB be two generalizations on T\mathcal{T} such that AA is more general than BB If T\mathcal{T} satisfies t-closeness using B, then T\mathcal{T} also satisfies t-closeness using A.
Subset Property Let T\mathcal{T} be a table and let CC be a set of attributes in T\mathcal{T} . If T\mathcal{T} satisfies t-closeness with respect to CC, then T\mathcal{T} also satisfies t-closeness with respect to any set of attributes DD such that D⊂CD \subset C.

4.3.2 EMD 的计算

4.3.2.1 EMD for Numerical Attributes

数值属性值是有序的，Let the attribute domain be v1,v2...vm{v_{1}, v_{2}...v_{m}}, where viv_{i} is the ithi^{th} smallest value.
Ordered Distance The distance between two values of is based on the number of values between them in the total order, i.e., ordered_dist(vi,vj)=|i−j|m−1ordered\_dist(v_{i},v_{j})=\frac{|i-j|}{m-1}。

Formally, let ri=pi−qi,(i=1,2,...,m)r_{i} = p_{i} − q_{i},(i=1,2,...,m), then the distance between PP and QQ can be calculated as:

D[P,Q]=1m−1(|r1|+|r1+r2|+...+|r1+r2+...rm−1|)=1m−1∑i=1i=m|∑j=1j=irj|

D[P,Q]=\frac{1}{m-1}(|r_{1}|+|r_{1}+r_{2}|+...+|r_{1}+r_{2}+...r_{m-1}|)\\=\frac{1}{m-1}\sum_{i=1}^{i=m}|\sum_{j=1}^{j=i}r_{j}|

4.3.2.2 EMD for Categorical Attributes

For categorical attributes, a total order often does not exist. We consider two distance measures.
Equal Distance：

D[P,Q]=12∑i=1m|pi−qi|=∑pi≥qi(pi−qi)=−∑pi<qi(pi−qi)

D[P,Q]=\frac{1}{2}\sum_{i=1}^{m}|p_{i}-q_{i}|=\sum_{p_{i}\ge q_{i}}(p_{i}-q_{i})=-\sum_{p_{i}
Hierarchical Distance:
Given a domain hierarchy and two distributions PP and QQ, we define the extra of a leaf node that corresponds to element ii, to be pi−qip_{i}-q_{i}, and the extra of an internal node NN to be the sum of extras of leaf nodes below NN. This extra function can be defined recursively as:

extra(N)={pi−qiif N is a leaf∑C∈Child(N)extra(C)otherwise

extra(N)= \begin{cases} p_{i}-q_{i} \qquad\qquad\qquad \qquad \text{if N is a leaf}\\ \sum_{C \in Child(N)}extra(C) \qquad \text{otherwise} \end{cases}
where Child(N) is the set of all leaf nodes below node NN. The extra function has the property that the sum of extra values for nodes at the same level is 0.

We further define two other functions for internal nodes:

post_extra(N)=∑C∈Child(N)∧extra(C)>0|extra(C)|

post\_extra(N)=\sum_{C\in Child(N)\wedge extra(C)>0}|extra(C)|

neg_extra(N)=∑C∈Child(N)∧extra(C)<0|extra(C)|

neg\_extra(N)=\sum_{C\in Child(N)\wedge extra(C)
We use cost(N) to denote the cost of movings between N’s children branches. An optimal flow moves exactly extra(N) in/out of the subtree rooted at N. Suppose that pos extra(N) > neg extra, then extra(N) = pos extra(N) − neg extra(N) and extra(N) needs to move out. (This cost is counted in the cost of N’s parent node.) In addition, one has to move neg extra among the children nodes to even out all children branches; thus,

cost(N)=height(N)Hmin(pos_extra(N),neg_extra(N))

cost(N)=\frac{height(N)}{H}\min(pos\_extra(N),neg\_extra(N))
Then the earth mover’s distance can be written as:

D[P,Q]=∑Ncost(N)

D[P,Q]=\sum_{N}cost(N)
where N is a non-leaf node.

4.4 Analysis of t-Closeness with EMD

之前表3、表4我们已经知道Q=Q={3K、4K、5K、6K、7K、8K、9K、10K、11K}， P1P_{1}={3K、4K、5K} 和P2P_{2}={6K、8K、11K}。然后我们运用EMD计算D[P1,Q]D[P_{1},Q]和D[P2,Q]D[P_{2},Q]：根据之前的公式，v1=3K,v2=4K,...v9=11Kv_{1}=3K,v_{2}=4K,...v_{9}=11K，我们定义viv_{i}和vjv_{j}之间的距离为|i−j|/8|i-j|/8，因此最大距离为1.

对于D[P1,Q]D[P_{1},Q]，One optimal mass flow that transforms P1P_{1} to QQ is to move 1/9 probability mass across the following pairs: (5K→11K)(5K \rightarrow 11K),(5K→10K)(5K \rightarrow 10K),(5K→9K)(5K \rightarrow 9K),(4K→8K)(4K \rightarrow 8K),(4K→7K)(4K \rightarrow 7K),(4K→6K)(4K \rightarrow 6K),(3K→5K)(3K \rightarrow 5K),(3K→4K)(3K \rightarrow 4K)，the cost of this is 1/9×(6+5+4+4+3+2+2+1)/8=27/72=0.3751/9\times(6+5+4+4+3+2+2+1)/8=27/72=0.375。

5. K-anonimity、l-diversity和t-closeness的对比

6. Reference

k-Anonimity, V. Ciriani, S. De Capitani di Vimercati, S. Foresti, and P. Samarati