【KDD 2020】Local Community Detection in Multiple Networks

Local Community Detection in Multiple Networks

多网络中的局部社区检测

ABSTRACT

Local community detection aims to find a set of densely-connected nodes containing given query nodes. Most existing local community detection methods are designed for a single network. However, a single network can be noisy and incomplete. Multiple networks are more informative in real-world applications. There are multiple types of nodes and multiple types of node proximities. Complementary information from different networks helps to improve detection accuracy. In this paper, we propose a novel RWM (Random Walk in Multiple networks) model to find relevant local communities in all networks for a given query node set from one network. RWM sends a random walker in each network to obtain the local proximity w.r.t. the query nodes (i.e., node visiting probabilities). Walkers with similar visiting probabilities reinforce each other. They restrict the probability propagation around the query nodes to identify relevant subgraphs in each network and disregard irrelevant parts. We provide rigorous theoretical foundations for RWM and develop two speeding-up strategies with performance guarantees. Comprehensive experiments are conducted on synthetic and real-world datasets to evaluate the effectiveness and efficiency of RWM.

局部社区检测的目的是找到一组包含给定查询节点的密集连接节点。现有的大多数本地社区检测方法都是针对单个网络设计的。然而，单个网络可能是嘈杂的和不完整的。多个网络在实际应用中信息量更大。有多种类型的节点和多种类型的节点代理。来自不同网络的互补信息有助于提高检测精度。本文提出了一种新的RWM（Random Walk In Multiple networks）模型，从一个网络中找到给定查询节点集的所有网络中的相关本地社区。RWM在每个网络中发送一个随机漫游器，以获得查询节点的局部接近度（即节点访问概率）。具有相似访问概率的步行者相互加强。它们限制查询节点周围的概率传播，以识别每个网络中的相关子图，而忽略不相关的部分。我们为RWM提供了严格的理论基础，并开发了两种性能保证的加速策略。为了评估RWM的有效性和有效性，在合成数据集和真实数据集上进行了综合实验。

INTRODUCTION

As a fundamental task in large network analysis, local community detection has attracted extensive attention recently. Unlike the time-consuming global community detection, the goal of local community detection is to detect a set of nodes with dense connections (i.e., the target local community) that contains a given query node or a query node set [1, 2, 4, 6, 7, 15, 17, 29, 33].

局部社区检测作为大型网络分析中的一项基础性工作，近年来受到了广泛的关注。与耗时的全局社区检测不同，本地社区检测的目标是检测包含给定查询节点或查询节点集[1、2、4、6、7、15、17、29、33]的一组具有密集连接的节点（即目标本地社区）。

Most existing local community detection methods are designed for a single network. However, a single network may contain noisy edges and miss some connections which results in unsatisfactory performance. On the other hand, in real-life applications, more informative multiple network structures are often constructed from different sources or domains [18, 20, 22, 25, 28].

现有的大多数本地社区检测方法都是针对单个网络设计的。然而，一个单一的网络可能包含噪声的边缘，并错过一些连接，从而导致性能不理想。另一方面，在实际应用中，信息量更大的多个网络结构通常由不同的来源或域构成[18、20、22、25、28]。

There are two main kinds of multiple networks. Fig. 1 shows an example of multiple networks with the same node set and different types of edges (which is called multiplex networks [14, 21] or multi-layer network). Each node represents an employee of a university [13]. These three networks reflect different relationships between employees: co-workers, lunch-together, and Facebookfriends. Notice that more similar connections exist between two off-line networks (i.e., co-worker and lunch-together) than that between the off-line and online relationships (i.e., Facebook-friends).

有两种主要的多重网络。图1示出具有相同节点集和不同类型边缘的多个网络的示例（称为复用网络[14，21]或多层网络）。每个节点代表一个大学的员工[13]。这三个关系网反映了员工之间的不同关系：同事、一起吃午饭和facebook朋友。请注意，两个离线网络（即同事和一起吃午餐）之间的相似连接比离线和在线关系（即Facebook好友）之间的连接更为相似。

Fig. 2 is another example from the DBLP dataset with multiple domains of nodes and edges (which is called multi-domain networks [22]). The left part is the author-collaboration network and the right part is the paper-citation network. A cross-edge connecting an author and a paper indicates that the author writes the paper. We see that authors and papers in the same research field may have dense cross-links. Suppose that we are interested in one author, we want to find both the relevant author community as well as the paper community, which includes papers in the research field of the query author. Then cross-edges can provide complementary information from the query (author) domain to other domains (paper) and vise versa. Detecting relevant local communities in two domains can enhance each other.

图2是来自具有多个节点和边的域的DBLP数据集的另一个示例（称为多域网络[22]）。左侧为作者协作网络，右侧为论文引文网络。连接作者和论文的横线表示作者写了论文。我们看到，同一研究领域的作者和论文可能有着紧密的交叉联系。假设我们对一个作者感兴趣，我们希望找到相关的作者社区以及论文社区，其中包括查询作者研究领域的论文。交叉边可以提供从查询（作者）域到其他域（论文）的补充信息，反之亦然。在两个域中检测相关的本地社区可以互相增强。

Note that multiplex networks are a special case of multi-domain networks with the same set of nodes and the cross-network relations are one-to-one mappings. Few works discuss local community detection in comprehensive but complicated multiple networks. Especially, no existing work can find query-relevant local communities in all domains in the more general multi-domain networks.

注意，复用网络是具有相同节点集的多域网络的一个特例，并且跨网络关系是一对一的映射。在复杂而复杂的多个网络中，很少有研究讨论本地社区检测问题。特别是在更一般的多域网络中，没有任何现有的工作能够在所有域中找到查询相关的本地社区。

In this paper, we focus on local community detection in multiple networks. Given a query node in one network, our task is to detect a query-relevant local community in each network.A straightforward method is to independently find the local community in each network. However, this baseline only works for the special multiplex networks and cannot be generalized to multi-domain networks. In addition, this simple approach does not consider the complementary information in multiple networks (e.g.,
similar local structures among the three relations in Fig. 1).

本文主要研究多网络中的局部社区检测问题。给定一个网络中的一个查询节点，我们的任务是在每个网络中检测一个与查询相关的本地社区，一个简单的方法是独立地找到每个网络中的本地社区。然而，这种基线只适用于特殊的复用网络，不能推广到多域网络。此外，这种简单的方法不考虑多个网络中的互补信息（例如图1）中三种关系的相似局部结构。

In this paper, we propose a random walk based method, RWM (Random Walk in Multiple networks), to find query-relevant local
communities in multiple networks simultaneously. The key idea is to integrate complementary influences from multiple networks.
We send out a random walker in each network to explore the network topology based on corresponding transition probabilities.
For networks containing the query node, the probability vectors of walkers are initiated with the query node. For other networks,
probability vectors are initialized by transiting visiting probability via cross-connections among networks. Then probability vectors of the walkers are updated step by step. The vectors represent walkers’ visiting histories. Intuitively, if two walkers share similar or relevant visiting histories (measured by cosine similarity and transited with cross-connections), it indicates that the visited local typologies in corresponding networks are relevant. And these two walkers should reinforce each other with highly weighted influences. On the other hand, if visited local typologies of two walkers are less relevant, smaller weights should be assigned. We update each walker’s visiting probability vector by aggregating other walkers’ influences at each step. In this way, the transition probabilities of each network are modified dynamically. Comparing with traditional random walk models where the transition matrix is time-independent [2, 15, 17, 30, 31], RWM can restrict most visiting probability in the most query-relevant subgraph in each network and ignore irrelevant or noisy parts.

在本文中，我们提出了一种基于随机游走的方法，即RWM（random walk In Multiple networks，RWM）来查找与查询相关的局部同时在多个网络中的社区。关键思想是整合来自多个网络的互补影响。我们在每个网络中发送一个随机漫游器，根据相应的转移概率来探索网络拓扑结构。对于包含查询节点的网络，漫游者的概率向量由查询节点初始化。对于其他网络，通过网络间的交叉连接传递访问概率来初始化概率向量。然后逐步更新步行者的概率向量。这些向量代表步行者的访问历史。直观地说，如果两个步行者共享相似或相关的访问历史（用余弦相似性度量，并通过交叉连接进行转换），则表明相应网络中访问的局部类型是相关的。而这两个步行者应该以高度的影响力互相加强。另一方面，如果访问的两个步行者的局部类型不太相关，则应分配较小的权重。我们通过在每一步中聚合其他步行者的影响来更新每个步行者的访问概率向量。通过这种方法，每个网络的转移概率都是动态修改的。与传统的转移矩阵为时间无关的随机游走模型[2,15,17,30,31]，RWM可以限制每个网络中与查询相关的子图中的最大访问概率，而忽略无关或有噪声的部分。

Theoretically, we provide rigorous analyses of the convergence properties of the proposed model. Two speeding-up strategies are developed as well. We conduct extensive experiments on real and synthetic datasets to demonstrate the advantages of our RWM model on effectiveness and efficiency for local community detection in multiple networks.

理论上，我们对该模型的收敛性进行了严格的分析。提出了两种加速策略。我们在真实数据集和合成数据集上进行了大量的实验，以证明我们的RWM模型在多个网络中本地社区检测的有效性和效率方面的优势。

Our contributions are summarized as follows.

• We propose a novel random walk model, RWM, to detect local communities in multiple networks for a given query node set. Especially, based on our knowledge, this is the first work which can detect all query-relevant local communities in all networks for the general multi-domain networks.

• Fast estimation strategies and sound theoretical foundations are provided to guarantee effectiveness and efficiency of RWM.
• Results of comprehensive experiments on synthetic and real multiple networks verify the advances of RWM.

我们的贡献总结如下。

•我们提出了一种新的随机游走模型RWM，用于检测给定查询节点集的多个网络中的本地社区。特别是基于我们的知识，这是第一个在一般多域网络中能够检测出所有网络中所有查询相关的本地社区的工作。

•为保证RWM的有效性和效率，提供了快速评估策略和可靠的理论基础。

•合成和真实多个网络的综合实验结果验证了RWM的进步。

RELATEDWORK

Local community detection is one of the cutting-edge problems in graph mining. In a single network, local search based method [23, 28], flow-based method [29], and subgraph cohesiveness-optimization based methods, such as k-clique [24, 36], k-truss [9, 11], k-core [3] are developed to detect local communities. A recent survey about local community detection and community search can be found in [10].

局部社区检测是图挖掘中的前沿问题之一。在单个网络中，基于局部搜索的方法[23,28]、基于流的方法[29]和基于子图内聚性优化的方法，如k-clique[24,36]、k-truss[9,11]、k-core[3]，都被开发出来用于检测局部社区。关于本地社区检测和社区搜索的最新调查可以在[10]中找到。

Except for that, random walk based methods have also been routinely applied to detect local communities in a single network
[2, 5, 8, 15, 32, 34, 35]. A walker explores the network following the topological transitions. The node visiting probability is usually utilized to determine the detection results. For example, PRN [2] runs the lazy random walk to update the visiting probability and sweeps the ranking list to find the local community with the minimum conductance. The heat kernel method applies the heat kernel coefficients as the decay factor to find small and dense communities [15]. In [35], the authors propose a motif-based random walk model and obtain node sets with the minimal motif conductance. MWC [5] sends multiple walkers to explore the network to alleviate the query-bias issue. Note that all aforementioned methods are designed for a single network.

除此之外，基于随机游走的方法也经常应用于在单个网络中检测本地社区 [2,5,8,15,32,34,35]。步行者在拓扑转换后探索网络。通常利用节点访问概率来确定检测结果。例如，PRN[2]运行lazy random walk来更新访问概率，并扫描排名列表以找到电导最小的本地社区。热核方法利用热核系数作为衰变因子来寻找小而密集的群落[15]。在文献[35]中，作者提出了一种基于模体的随机游动模型，得到了具有最小模体电导的节点集。MWC[5]发送多个Walker来探索网络，以缓解查询偏差问题。请注意，上述所有方法都是针对单个网络设计的。

Some methods have been proposed for global community detection in multiple networks. An overview can be found in [13].
However, they focus on detecting all communities, which are timeconsuming and are independent of a given query node. Very limited work aim to find query-oriented local communities in multiple networks. For the special case, i.e., multiplex networks, a greedy algorithm is developed to find a community shared by all networks [12]. Similarly, a modified random walk model is proposed in [8] for multiplex networks. For the general multi-domain networks, the method in [34] can only detect one local community in the query network domain. More importantly, these methods assume that all networks share similar or consistent structures. Different from these methods, our method RWM does not make such an assumption. RWM can automatically identify relevant local structures and ignore irrelevant ones in all networks.

在多个网络中，已经提出了一些全局社区检测方法。概述见[13]。但是，它们的重点是检测所有社区，这些社区非常耗时并且独立于给定的查询节点。在多个网络中寻找面向查询的本地社区的工作非常有限。对于特殊情况，即复用网络，提出了一种贪婪算法来寻找所有网络共享的社区[12]。类似地，文献[8]中提出了一种改进的多工网络随机游走模型。对于一般的多域网络，文献[34]中的方法只能检测查询网络域中的一个本地社区。更重要的是，这些方法假设所有网络共享相似或一致的结构。与这些方法不同，我们的方法RWM没有这样的假设。RWM可以自动识别相关的局部结构，忽略所有网络中不相关的局部结构。

RANDOMWALK IN MULTIPLE NETWORKS

LOCAL COMMUNITY DETECTION WITH RESTART STRATEGY