【论文复现3】算法2——Clustered sampling based on model similarity

【论文代码复现2】Clustered sampling based on sample size_admin11111111的博客-CSDN博客https://blog.csdn.net/admin11111111/article/details/120817883 接着上篇的来写。

一、算法2流程：

1. 根据梯度计算相似性矩阵，相似性用cosine来衡量。

2. 对相似性矩阵进行层次凝聚聚类，cluster之间的距离用ward方式衡量。

3. 根据聚类结果得到新的权重矩阵，（最大的10种类——worker数量最多，权重保持不变）

4. 剩余的类别的权重更新，得到新的权重矩阵distri_cluster作为clients被抽样的概率矩阵，抽样的clients子集的样本作为训练数据集进行训练。

注：此处权重不是模型参数，而是作为clients被抽样的概率权重。

def get_clusters_with_alg2(linkage_matrix: np.array, n_sampled: int, weights: np.array
):"""Algorithm 2"""epsilon = int(10 ** 10)# associate each client to a clusterlink_matrix_p = deepcopy(linkage_matrix)augmented_weights = deepcopy(weights)for i in range(len(link_matrix_p)):idx_1, idx_2 = int(link_matrix_p[i, 0]), int(link_matrix_p[i, 1])new_weight = np.array([augmented_weights[idx_1] + augmented_weights[idx_2]])augmented_weights = np.concatenate((augmented_weights, new_weight))link_matrix_p[i, 2] = int(new_weight * epsilon)clusters = fcluster(link_matrix_p, int(epsilon / n_sampled), criterion="distance")n_clients, n_clusters = len(clusters), len(set(clusters))# Associate each cluster to its number of clients in the clusterpop_clusters = np.zeros((n_clusters, 2)).astype(int)for i in range(n_clusters):pop_clusters[i, 0] = i + 1for client in np.where(clusters == i + 1)[0]:pop_clusters[i, 1] += int(weights[client] * epsilon * n_sampled)pop_clusters = pop_clusters[pop_clusters[:, 1].argsort()]distri_clusters = np.zeros((n_sampled, n_clients)).astype(int)# n_sampled biggest clusters that will remain unchangedkept_clusters = pop_clusters[n_clusters - n_sampled :, 0]for idx, cluster in enumerate(kept_clusters):for client in np.where(clusters == cluster)[0]:distri_clusters[idx, client] = int(weights[client] * n_sampled * epsilon)k = 0for j in pop_clusters[: n_clusters - n_sampled, 0]:clients_in_j = np.where(clusters == j)[0]np.random.shuffle(clients_in_j)for client in clients_in_j:weight_client = int(weights[client] * epsilon * n_sampled)while weight_client > 0:sum_proba_in_k = np.sum(distri_clusters[k])u_i = min(epsilon - sum_proba_in_k, weight_client)distri_clusters[k, client] = u_iweight_client += -u_isum_proba_in_k = np.sum(distri_clusters[k])if sum_proba_in_k == 1 * epsilon:k += 1distri_clusters = distri_clusters.astype(float)for l in range(n_sampled):distri_clusters[l] /= np.sum(distri_clusters[l])return distri_clusters

上次说到 “如果有恶意攻击者命名数据量很小，却说自己数据量很大则会导致系统崩溃。也就是数据量大，则抽样概率大幅增加，计算梯度时就会受到影响。”

二、对样本量大小的攻击

然后便对算法2也进行了clients的抽样权重的人为设定，但是效果并不明显。原因是对相似性梯度进行层次聚类时，簇间距比较小，导致对梯度的分类在设定权重后和设定权重前（所有clients权重相同）的类别数量相同。也就是说，单纯改变数据量多少（这里）对梯度改变不大。

人为设定权重后的实验结果图如下，与论文中所给结果图片基本相同。

于是直接对梯度进行操作。

三、对于梯度的攻击

在server端对梯度进行可以修改，可以对算法2产生影响。但是在服务端修改就不算是workers中有恶意攻击者了，而只能说明系统失效了，这就不需要再验证系统对于worker攻击的鲁棒性了。

但是从另一方面也证明了，直接对梯度进行修改是对梯度有影响的。

从图中很明显可以看出对梯度的修改使得算法2有点失效了，不再接近target模型了。

而且直接对代表性梯度修改，哪怕是只修改一个也会对模型产生影响，如下图只攻击一个代表性梯度效果图：

        # UPDATE THE HISTORY OF LATEST GRADIENTif sampling == "clustered_2":gradients_i = get_gradients(sampling, previous_global_model, clients_models)for idx, gradient in zip(sampled_clients_for_grad, gradients_i):if(sampled_clients_for_grad[0] == idx):gradients[idx] = backward(np.array(gradient))else:gradients[idx] = gradient

这段代码就是直接攻击一个代表性梯度的。

不过有意思的是在get_gradients()函数里面对梯度进行攻击，却起不了效果。结果如下图：

代码如下：

def get_gradients(sampling, global_m, local_models):"""return the `representative gradient` formed by the difference betweenthe local work and the sent global model"""local_model_params = []for model in local_models:local_model_params += [[tens.detach().cpu().numpy() for tens in list(model.parameters())]] #20211014global_model_params = [tens.detach().cpu().numpy() for tens in list(global_m.parameters())] #20211014local_model_grads = []#lst = list(range(100))#lst = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]   #因为这是固定攻击前10个client，后面也不能保证前10个client一定被分到一个类中，而代表性client梯度是根据聚类大小来的，也就是说类簇小了，就不被采取作为代表性梯度了，也就不会被采样了，所以梯度的改变也就没有影响了。for id, local_params in enumerate(local_model_params):#if id in lst:# local_model_grads += [[backward(local_weights - global_weights) for local_weights, global_weights in zip(local_params, global_model_params)]]#else:local_model_grads += [[local_weights - global_weights for local_weights, global_weights in zip(local_params, global_model_params)]]#local_model_grads = backward(np.array(local_model_grads))return local_model_grads

仔细分析后，原因可能是固定攻击前10个client，而这10个client只要不是被同时分到一个类簇中，则它们就有可能全部都不会被采样到，因为每次采样是按照类簇大小（簇内client数量）来抽取10个client。也就是如果前10个client都分散为一个前10大小的类簇之外，就有可能采样到的全部是未被攻击的client；如果在一个类簇中，则类簇大小为10，就一定会被采样到了。

下周试一下直接在local_learning中修改，这样才能算是真正在worker端的攻击。