Membership Leakage in Label-Only Exposures论文解读

Abstract

1 Introduction

membership inference中adversary的目标是判定一个数据样本是否用于训练目标ML模型。

现有的隐私攻击依赖于ML模型输出的confidence scores（即class probabilities或者logits）。成功的membership inference是由于ML模型的inherent overfitting性质，即一个ML模型在面对它训练的样本时输出分数会更高。图1展示了这种scored-based threat model可以访问的ML模型的组成部分：

这些score-based攻击的一个主要缺点是当只能获取预测标签时就会失败（即模型的最终输出而不是confidence score）。

这激励我们聚焦于一种新的受到很少关注的membership inference攻击，称为Decision-based攻击。这时adversary仅仅依赖于模型的最终输出，即top-1预测标签作为攻击模型的输入。

在这篇文章中，我们建议使用两种在不同场景下的decision-based 攻击, 即transfer attack和boundary attack。

Transfer Attack

We assume the adversary has an auxiliary
dataset (namely shadow dataset) that comes from the same
distribution as the target model’s training set. The assumption also holds for previous score-based attacks [35, 46, 48,
49]. The adversary first queries the target model in a manner analog to cryptographic oracle, thereby relabeling the
shadow dataset by the target model’s predicted labels. Then,
the adversary can use the relabeled shadow dataset to construct a local shadow model to mimic the behavior of the
target model. In this way, the relabeled shadow dataset contains sufficient information from the target model, and membership information can also be transferred to the shadow
model. Finally, the adversary can leverage the shadow model
to launch a score-based membership inference attack locally.

Boundary Attack

收集数据，尤其是敏感及隐私数据是一个有意义的任务。因此，我们考虑一个更困难并且更实际的场景，这时我们没有shadow数据集以及shadow模型。为了补偿在这种场景下的信息缺失，我们将关注点从目标模型的输出转移到输入。这里，我们的key intuition是扰动成员数据样本要比扰动非成员数据样本更难。 adversary在候选数据样本上查询目标模型，并且扰动它们来改变模型的预测标签。接下来adversary可以利用扰动的magnitude来区分成员以及非成员数据样本。

大量的实验验证展示了我们的两种攻击都能实现strong performance。尤其地，我们的boundary attack在某些情况下甚至outperforms先前的score-based攻击。此外，我们 present a new perspective on the success of current membership inference and show that the distance between a sample and an ML model’s decision boundary is strongly correlated with the sample’s membership status.

最后，我们在多个防御机制上验证了我们的攻击： generalization enhancement [46, 50, 54], privacy enhancement [4] and confidence score perturbation [27,38,56].
The results show that our attacks can bypass most of the
defenses, unless heavy regularization is applied. However
heavy regularization can lead to a significant degradation of
the model accuracy.

大体上来讲，我们的贡献如下：

We perform a systematic investigation on membership leakage in label-only exposures of ML models, and introduce decision-based membership inference attacks, which is highly relevant for real-world applications and important to gauge model privacy.
We propose two types of decision-based attacks under different scenarios, including transfer attack and boundary attack. Extensive experiments demonstrate that our two types of attacks achieve better performances than the baseline attack, and even outperform the previous score-based attacks in some cases.
We propose a new perspective on the reasons for the success of membership inference, and perform a quantitative and qualitative analysis to demonstrate that members of an ML model are more distant from the model’s decision boundary than non-members.
We evaluate multiple defenses against our decision-based attacks and show that our novel attacks can still achieve reasonable performance unless heavy regularization is applied.

4 Boundary attack

4.1 Key Intuition

我们攻击的intuition从ML模型的过拟合本质得来。更具体来说，ML模型在它训练过的样本上输出confidence score更高（即，既然模型对于成员数据样本更自信，那么改变它的输出应当更困难）。

图6描绘了两个随机选取的成员数据样本（图6a，6c）以及非成员数据样本（图6b，6d）with respect to M−0\mathcal{M}-0M−0 trained on CIFAR-10。我们可以观察到成员样本的最高分实际上要比非成员样本高得多。我们接下来使用cross-entropy（等式1）来量化the difficulty for an ML model to change its predicted label for a data sample to other labels。

Table 2展示了the cross entropy between the confidence scores and other labels for these samples。我们可以观察到成员样本的cross entropy要显著大于非成员样本。This leads to the following observation on membership information。

Observation

给定一个ML模型以及一系列数据样本，改变目标模型关于成员样本预测标签的花销要比改变非成员样本预测标签的花销大。此外，考虑到黑盒ML模型仅仅提供了标签信息，这意味着adversary can only perturb the data samples to change the target model‘s predicted labels，因此需要改变一个member样本的扰动要高于non-member样本。那么，adversary可以通过观察扰动的magnitude来判断一个样本是否是member。

4.2 Methodology

我们的攻击包含如下三个步骤：decision change、perturbation measurement以及membership inference。The algorithm can be found in Appendix algorithm 2.

Decision Change

我们使用了两个SOTA黑盒对抗攻击：HopSkipJump[12]以及QEBA [30]。

Perturbation Measurement

一旦最终的模型输出被改变，那么我们就可以衡量加到candidate输入样本上的扰动大小。In general，对抗攻击技艺通常使用LpL_pLp距离，即L0,L1,L2L_0,L_1,L_2L0,L1,L2以及L∞L_{\infty}L∞来衡量原始图片和扰动样本的perceptual similarity。因此，我们选取LpL_pLp距离来衡量扰动大小。

Membership Inference

After obtaining the magnitude of
the perturbations, the adversary simply considers a candidate
sample with perturbations larger than a threshold as a member sample, and vice versa. Similar to the transfer attack, we
mainly use AUC as our evaluation metric. We also provide a
general and simple method for choosing a threshold in Section 4.4.

4.3 实验设定

We use the same experimental setup as presented in Section 3.3, such as the dataset splitting strategy and 6 target
models trained on different size of training set Dtrain. In the
decision change stage, we use the implementation of a popular python library (ART3
) for HopSkipJump, and the authors’
source code4
for QEBA. Note that we only apply untargeted
decision change, i.e., changing the initial decision of the target model to any other random decision. Besides, both HopSkipJump and QEBA require multiple queries to perturb data
samples to change their predicted labels. We set 15,000 for
HopSkipJump and 7,000 for QEBA. We further study the influence of the number of queries on the attack performance.
For space reasons, we report the results of HopSkipJump
scheme in the main body of our paper. Results of QEBA
scheme can be found in Appendix Figure 14 and Figure 15.

4.4 Results

Distribution of Perturbation

首先，我们展示了distribution of perturbation between a perturbed sample and its original one for member and non-member samples in Figure-7。HopSkipJump和QEBA都使用L2L_2L2距离来限制扰动的大小，因此我们也使用L2L_2L2距离来回报结果。和预期的一样，对于member样本的扰动的大小实际上要大于non-member样本。例如在图7中，member样本的平均L2L_2L2距离是1.0755，而non-member样本为0.1102。此外，有着更大训练集的模型，即更低过拟合等级的模型需要更小的扰动来改变最终的预测结果。随着过拟合程度的增加，adversary需要在member样本上改变得更多。原因是过拟合程度更高的ML模型has remembered its training samples to a larger extent，因此我们更难修改它们的预测标签（即我们需要更大的扰动）。

Attack AUC Performance

We report the AUC scores over
all datasets in Figure 8. In particular, we compare 4 different distance metrics, i.e., L0, L1, L2, and L∞, for each decision change scheme. From Figure 8, we can observe that
L1, L2 and L∞ metrics achieve the best performance across
all datasets. For instance in Figure 8 (M -1, CIFAR-10), the
AUC scores for L1, L2, and L∞ metrics are 0.8969, 0.8963,
and 0.9033, respectively, while the AUC score for L0 metric
is 0.7405. From Figure 15 (in Appendix), we can also observe the same results of QEBA scheme: L1, L2 and L∞ metrics achieve the best performance across all datasets, while
L0 metric performs the worst. Therefore, an adversary can
simply choose the same distance metric adopted by adversarial attacks to measure the magnitude of the perturbation.

Effects of Number of Queries

To mount boundary attack in real-world ML applications such as Machine Learning as a Service (MLaaS), the adversary cannot issue as many queries as they want to the target model,因为大量的查询会提高攻击的花销并且可能引起模型提供者的怀疑。现在，我们通过不同数量的查询来衡量模型的performance。这里，我们展示了HopSkipJump scheme for M\mathcal{M}M-5 over all datasets。We vary the number of queries from 0 to 15,000
and evaluate the attack performance based on the L2L_2L2 metric.。如图9所示，在开始阶段AUC会随着查询次数增大快读增加（？）。在经过了2500次查询后，攻击performance变得stable。 From the results, we argue that query limiting would likely not be a suitable defense. For instance, when querying 131 times, the AUC for CIFAR-10 is 0.8228 and CIFAR-100 is 0.9266.
At this time, though the perturbed sample is far away from
its origin’s decision boundary, the magnitude of perturbation for member samples is still relatively larger than that for nonmember samples. Thus, the adversary can still differentiate
member and non-member samples.

Threshold Choosing

Here, we focus on the threshold
choosing for our boundary attack where the adversary is not
equipped with a shadow dataset. We provide a simple and
general method for choosing a threshold. Concretely, we
generate a set of random samples in the feature space as
the target model’s training set. In the case of image classification, we sample each pixel for an image from a uniform
distribution. Next, we treat these randomly generated samples as non-members and query them to the target model.
Then, we apply adversarial attack techniques on these random samples to change their initial predicted labels by the
target model. Finally, we use these samples’ perturbation to
estimate a threshold, i.e., finding a suitable top t percentile
over these perturbations. The algorithm can be found in Appendix algorithm 3.