【数据】社区发现数据集

社区发现数据集
目录
社区发现数据集
目录
基于链接分析的数据集
基于链接与离散型属性的数据集
基于链接与文本型属性的数据集
其他常见的数据集链接
Mark Newman收集的数据集
Social and Information Network Analysis
基于链接分析的数据集
Zachary karate club

Zachary 网络是通过对一个美国大学空手道俱乐部进行观测而构建出的一个社会网络.网络包含 34 个节点和 78 条边,其中个体表示俱乐部中的成员,而边表示成员之间存在的友谊关系.空手道俱乐部网络已经成为复杂网络社区结构探测中的一个经典问题[1]。【下载地址】

American College football

College Football 网络. Newman 根据美国大学生足球联赛而创建的一个复杂的社会网络.该网络包含 115个节点和 616 条边,其中网络中的结点代表足球队,两个结点之间的边表示两只球队之间进行过一场比赛.参赛的115支大学生代表队被分为12个联盟。比赛的流程是联盟内部的球队先进行小组赛,然后再是联盟之间球队的比赛。这表明联盟内部的球队之间进行的比赛次数多于联盟之间的球队之间进行的比赛的次数.联盟即可表示为该网络的真实社区结构[2]。【下载地址】

Dolphin social network

Dolphin 数据集是 D.Lusseau 等人使用长达 7 年的时间观察新西兰 Doubtful Sound海峡 62 只海豚群体的交流情况而得到的海豚社会关系网络。这个网络具有 62 个节点，159 条边。节点表示海豚，而边表示海豚间的频繁接触[3]。【下载地址】

netscience dataset

Netscience is a coauthorship network of scientists working on network theory and experiment. The dataset contains all components of the network, for a total of 1589 scientists [12]. 【下载地址, 访问密码：4bfc】

基于链接与离散型属性的数据集
Political blogs

该数据集由Lada Adamic于2005年编译完成，表示博客的政治倾向。包含1490个结点和19090条边。数据集中的每个结点都有一个属性描述（用0或者1表示），表示民主或者保守[4] 。【下载地址】

DBLP Dataset

Digital Bibliography Project (DBLP) is a computer science bibliography. In this data set, authors are considered as users, the paper titles of the authors are the text of users and the coauthorship relationship forms the links of users.
DBLP每月更新的【数据地址】
DBLP处理后的数据集【数据地址】
DBLP数据集【使用说明1，使用说明2】

DBLP-10K
DBLP Bibliography data from four research areas of database (DB), data mining (DM), information retrieval (IR) and artificial intelligence (AI) 3. We build a coauthor graph with top 5, 000 authors and their coauthor relationships. In addition, we use two relevant attributes: prolific and primary topic. For attribute “prolific”, authors with ≥ 20 papers are labeled as highly prolific; authors with ≥ 10 and < 20 papers are labeled as prolific and authors with < 10 papers are labeled as low prolific. For attribute “primary topic”, we use a topic modeling approach (PLSA) to extract 100 research topics from a document collection composed of paper titles from the selected authors. Each extracted topic consists of a probability distribution of keywords which are most representative of the topic. Then each author will have one out of 100 topics as his/her primary topic [5]. 【下载地址访问密码 0674】
DBLP-1K, DBLP-5K
两个数据集，则是直接从DBKP-10K数据集中选择TOP 1000、5000作者构成的数据集。DBLP-5K可参考文献 [6]
Facebook Friendship Datasets

The datasets contain the Facebook networks (from a date in Sept. 2005) from these colleges: Caltech, Princeton, Georgetown and UNC Chapel Hill. The links represent the friendship on Facebook. Each user has the following attributes: ID, a student/faculty status flag, gender, major, second major/minor (if applicable), dormitory(house), year and high school [10].【下载地址，访问密码：264c】.

基于链接与文本型属性的数据集
Enron Email Dataset

This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders [7].【下载地址】

Enron Mail subset
A subset of about 1700 labeled email messages (4.5M). These were chosen in a semi-motivated fashion (focusing on business-related emails and the California Energy Crises and on emails that occurred later in the collection, trying to avoid very personal messages, jokes, and so on). Students in the ANLP course annotated the selected messages with the category labels. Each message was labeled by two people, but no claims of consistency, comprehensiveness, nor generality are made about these labelings. 该子数据集参照【分类】分为11类。【下载地址】
2005年3月版本的【Enron mail数据集】
CiteSeer

The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 3703 unique words. The README file in the dataset provides more details. Click here to download the tarball containing the dataset [8]. 【下载地址】

Cora

The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words. The README file in the dataset provides more details [8].【下载地址】

WebKB

The WebKB dataset consists of 877 scientific publications classified into one of five classes. The citation network consists of 1608 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1703 unique words. The README file in the dataset provides more details [9].【下载地址】

Terrorists

Terrorist Attacks

This dataset consists of 1293 terrorist attacks each assigned one of 6 labels indicating the type of the attack. Each attack is described by a 0/1-valued vector of attributes whose entries indicate the absence/presence of a feature. There are a total of 106 distinct features. The files in the dataset can be used to create two distinct graphs. The README file in the dataset provides more details.【下载地址】

Flickr数据集

The Flickr image sharing network consists of nodes which represent Flickr users, and edges indicate follow relations between users. We use tags of images uploaded by a given user as her attributes. In this network, the ground-truth communities are defined as user-created interest-based groups that have more than five members. 【下载地址, 访问密码：ffdb】

其他常见的数据集链接

Stanford Large Network Dataset Collection
Social networks
online social networks, edges represent interactions between people
Networks with ground-truth communities
ground-truth network communities in social and information networks
Communication networks
email communication networks with edges representing communication
Citation networks
nodes represent papers, edges represent citations
Collaboration networks
nodes represent scientists, edges represent collaborations (co-authoring a paper)
Web graphs
nodes represent webpages and edges are hyperlinks
Amazon networks
nodes represent products and edges link commonly co-purchased products
Internet networks
nodes represent computers and edges communication
Road networks
nodes represent intersections and edges roads connecting the intersections
Autonomous systems
graphs of the internet
Signed networks
networks with positive and negative edges (friend/foe, trust/distrust)
Location-based online social networks
Social networks with geographic check-ins
Wikipedia networks and metadata
Talk, editing and voting data from Wikipedia
Twitter and Memetracker
Memetracker phrases, links and 467 million Tweets
Online communities
Data from online communities such as Reddit and Flickr
Online reviews
Data from online review systems such as BeerAdvocate and Amazon

Mark Newman收集的数据集
介绍及相关社区发现算法：http://www-personal.umich.edu/~mejn/
数据集：http://www-personal.umich.edu/~mejn/netdata/

Social and Information Network Analysis
KDD Cup Dataset
http://www.cs.cornell.edu/projects/kddcup/datasets.html
Stack Overflow Data
http://blog.stackoverflow.com/2009/06/stack-overflow-creative-commons-data-dump/
Youtube dataset
YouTube videos as nodes. Edge a->b means video b is in the related video list (first 20 only) of a video a.
http://netsg.cs.sfu.ca/youtubedata/
Amazon Data
The data was collected by crawling Amazon website and contains product metadata and review information about 548,552 different products (Books, music CDs, DVDs and VHS video tapes). http://snap.stanford.edu/data/amazon-meta.html
[1]: W. W. Zachary, An information flow model for conflict and fission in small groups, Journal of Anthropological Research 33, 452-473 (1977)
[2]: M. Girvan and M. E. J. Newman, Proc. Natl. Acad. Sci. USA 99, 7821-7826 (2002).
[3]: V.Lusseau, K .Schneider, OJ .Boisseau et al. The Bottlenose Dolphin Community of Doubtful Sound Features a Large Proportion of Long-Lasting Associations. Behavioral Ecology and Sociobiology, 2003, 54(4):392-405
[4]: L. A. Adamic and N. Glance, “The political blogosphere and the 2004 US Election”, in Proceedings of the WWW-2005 Workshop on the Weblogging Ecosystem (2005)
[5]: Zhou Y, Cheng H, Yu J X. Clustering large attributed graphs: An efficient incremental approach[C]//Data Mining (ICDM), 2010 IEEE 10th International Conference on. IEEE, 2010: 689-698.
[6]: Zhou Y, Cheng H, Yu J X. Graph clustering based on structural/attribute similarities[J]. Proceedings of the VLDB Endowment, 2009, 2(1): 718-729.
[7]: Klimt B, Yang Y. Introducing the Enron Corpus[C]//CEAS. 2004.
[8]: Yang T, Jin R, Chi Y, et al. Combining link and content for community detection: a discriminative approach[C]//Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2009: 927-936.
[9]: Lu Q, Getoor L. Link-based classification[C]//ICML. 2003, 3: 496-503.
[10]: Dang T A, Viennet E. Community detection based on structural and attribute similarities[C]//International Conference on Digital Society (ICDS). 2012: 7-12.
[11]: Xirong Li, Cees G.M. Snoek, and Marcel Worring, Learning Social Tag Relevance by Neighbor Voting, in IEEE Transactions on Multimedia (T-MM), 2009
[12]: Newman M E J. Finding community structure in networks using the eigenvectors of matrices[J]. Physical review E, 2006, 74(3): 036104.

版权声明：本文为CSDN博主「wzgang123」的原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/wzgang123/article/details/51089521

【数据】社区发现数据集相关推荐

用networkx、igraph实现社区发现——LPA（标签传播算法）
一.LPA简介 LPA全称为Label Propagation Algorithm,是一个基于标签传播的非重叠社区发现算法.通过LPA可以对用户群进行聚类,从而实现用户画像. 推荐系统初期,当标签数远 ...
Python共享单车数据的OD识别与社区发现（TransBigData+igraph）
这个案例的Jupyter notebook: 点击这里. 对于共享单车的出行,每一次出行都可以被看作是一个从起点行动到终点的出行过程.当我们把起点和终点视为节点,把它们之间的出行视为边时,就可以构建一 ...
重叠社区发现算法LFM算法python源码含数据集
LFM算法是来源于论文<Detecting the overlapping and hieerarchical community structure in complex networks&g ...
深度学习助力网络科学：基于深度学习的社区发现最新综述
来源:AMiner科技论文题目: A Comprehensive Survey on Community Detection with Deep Learning 论文网址: https://arx ...
UA MATH567 高维统计III 随机矩阵8 社区发现 Spectral Clustering的理论分析
UA MATH567 高维统计III 随机矩阵8 社区发现 Spectral Clustering的理论分析上一讲我们完成了Stochastic Block Model与社区发现问题的建模,并描述了 ...
《数据科学与大数据分析——数据的发现分析可视化与表示》一2.3　第2阶段：数据准备...
本节书摘来自异步社区<数据科学与大数据分析--数据的发现分析可视化与表示>一书中的第2章,第2.3节,作者[美]EMC Education Services(EMC教育服务团队),更多 ...
【深度学习】最新「深度学习社区发现」综述论文，174篇文献概述六大类方法(含Github资源)...
| 作者:Xing Su | 单位:麦考瑞大学 | 研究方向:人工智能与数据科学社区发现能够揭示各类网络中成员的特征与联系,在网络分析中具有重要意义.近年来,深度学习技术在发现社区结构时,以处理高维 ...
社区发现（一）--算法综述
基础概念简介:https://baike.baidu.com/item/%E7%A4%BE%E5%8C%BA%E5%8F%91%E7%8E%B0%E7%AE%97%E6%B3%95/19460396 ...
社区发现算法-Community Detection-NormalizeCut/Louvain/NMF/LPA
本文结构安排图聚类简介正则化割 Louvain 非负矩阵分解(NMF) 其他常见方法图(graph):是一种由点和边集构成的结构 G = ( V , E ) G=(V,E) G=(V,E) 图聚 ...
社区发现之标签传播算法（LPA）
在Graph领域,社区发现(Community detection)是一个非常热门且广泛的话题,后面会写一个系列,该问题实际上是从子图分割的问题演变而来,在真实的社交网络中,有些用户之间连接非常紧密, ...

【数据】社区发现数据集

【数据】社区发现数据集相关推荐

最新文章

热门文章