社区发现数据集
目录
社区发现数据集
目录
基于链接分析的数据集
基于链接与离散型属性的数据集
基于链接与文本型属性的数据集
其他常见的数据集链接
Mark Newman收集的数据集
Social and Information Network Analysis
基于链接分析的数据集
Zachary karate club

Zachary 网络是通过对一个美国大学空手道俱乐部进行观测而构建出的一个社会网络.网络包含 34 个节点和 78 条边,其中个体表示俱乐部中的成员,而边表示成员之间存在的友谊关系.空手道俱乐部网络已经成为复杂网络社区结构探测中的一个经典问题[1]。【下载地址】

American College football

College Football 网络. Newman 根据美国大学生足球联赛而创建的一个复杂的社会网络.该网络包含 115个节点和 616 条边,其中网络中的结点代表足球队,两个结点之间的边表示两只球队之间进行过一场比赛.参赛的115支大学生代表队被分为12个联盟。比赛的流程是联盟内部的球队先进行小组赛,然后再是联盟之间球队的比赛。这表明联盟内部的球队之间进行的比赛次数多于联盟之间的球队之间进行的比赛的次数.联盟即可表示为该网络的真实社区结构[2]。【下载地址】

Dolphin social network

Dolphin 数据集是 D.Lusseau 等人使用长达 7 年的时间观察新西兰 Doubtful Sound海峡 62 只海豚群体的交流情况而得到的海豚社会关系网络。这个网络具有 62 个节点,159 条边。节点表示海豚,而边表示海豚间的频繁接触[3]。【下载地址】

netscience dataset

Netscience is a coauthorship network of scientists working on network theory and experiment. The dataset contains all components of the network, for a total of 1589 scientists [12]. 【下载地址, 访问密码:4bfc】

基于链接与离散型属性的数据集
Political blogs

该 数 据 集 由Lada Adamic于2005年编译完成, 表示博客的政治倾向。 包含1490个结点和19090条边。数据集中的每个结点都有一个属性描述(用0或者1表示),表示民主或者保守[4] 。【下载地址】

DBLP Dataset

Digital Bibliography Project (DBLP) is a computer science bibliography. In this data set, authors are considered as users, the paper titles of the authors are the text of users and the coauthorship relationship forms the links of users. 
DBLP每月更新的【数据地址】 
DBLP处理后的数据集【数据地址】 
DBLP数据集【使用说明1,使用说明2】

DBLP-10K 
DBLP Bibliography data from four research areas of database (DB), data mining (DM), information retrieval (IR) and artificial intelligence (AI) 3. We build a coauthor graph with top 5, 000 authors and their coauthor relationships. In addition, we use two relevant attributes: prolific and primary topic. For attribute “prolific”, authors with ≥ 20 papers are labeled as highly prolific; authors with ≥ 10 and < 20 papers are labeled as prolific and authors with < 10 papers are labeled as low prolific. For attribute “primary topic”, we use a topic modeling approach (PLSA) to extract 100 research topics from a document collection composed of paper titles from the selected authors. Each extracted topic consists of a probability distribution of keywords which are most representative of the topic. Then each author will have one out of 100 topics as his/her primary topic [5]. 【下载地址 访问密码 0674】
DBLP-1K, DBLP-5K 
两个数据集,则是直接从DBKP-10K数据集中选择TOP 1000、5000作者构成的数据集。DBLP-5K可参考文献 [6]
Facebook Friendship Datasets

The datasets contain the Facebook networks (from a date in Sept. 2005) from these colleges: Caltech, Princeton, Georgetown and UNC Chapel Hill. The links represent the friendship on Facebook. Each user has the following attributes: ID, a student/faculty status flag, gender, major, second major/minor (if applicable), dormitory(house), year and high school [10].【下载地址, 访问密码:264c】.

基于链接与文本型属性的数据集
Enron Email Dataset

This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders [7].【下载地址】

Enron Mail subset 
A subset of about 1700 labeled email messages (4.5M). These were chosen in a semi-motivated fashion (focusing on business-related emails and the California Energy Crises and on emails that occurred later in the collection, trying to avoid very personal messages, jokes, and so on). Students in the ANLP course annotated the selected messages with the category labels. Each message was labeled by two people, but no claims of consistency, comprehensiveness, nor generality are made about these labelings. 该子数据集参照【分类】分为11类。【下载地址】 
2005年3月版本的【Enron mail数据集】
CiteSeer

The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 3703 unique words. The README file in the dataset provides more details. Click here to download the tarball containing the dataset [8]. 【下载地址】

Cora

The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words. The README file in the dataset provides more details [8].【下载地址】

WebKB

The WebKB dataset consists of 877 scientific publications classified into one of five classes. The citation network consists of 1608 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1703 unique words. The README file in the dataset provides more details [9].【下载地址】

Terrorists

The WebKB dataset consists of 877 scientific publications classified into one of five classes. The citation network consists of 1608 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1703 unique words. The README file in the dataset provides more details. 【下载地址】

Terrorist Attacks

This dataset consists of 1293 terrorist attacks each assigned one of 6 labels indicating the type of the attack. Each attack is described by a 0/1-valued vector of attributes whose entries indicate the absence/presence of a feature. There are a total of 106 distinct features. The files in the dataset can be used to create two distinct graphs. The README file in the dataset provides more details.【下载地址】

Flickr数据集

The Flickr image sharing network consists of nodes which represent Flickr users, and edges indicate follow relations between users. We use tags of images uploaded by a given user as her attributes. In this network, the ground-truth communities are defined as user-created interest-based groups that have more than five members. 【下载地址, 访问密码:ffdb】

其他常见的数据集链接

Stanford Large Network Dataset Collection 
Social networks 
online social networks, edges represent interactions between people
Networks with ground-truth communities 
ground-truth network communities in social and information networks
Communication networks 
email communication networks with edges representing communication
Citation networks 
nodes represent papers, edges represent citations
Collaboration networks 
nodes represent scientists, edges represent collaborations (co-authoring a paper)
Web graphs 
nodes represent webpages and edges are hyperlinks
Amazon networks 
nodes represent products and edges link commonly co-purchased products
Internet networks 
nodes represent computers and edges communication
Road networks 
nodes represent intersections and edges roads connecting the intersections
Autonomous systems 
graphs of the internet
Signed networks 
networks with positive and negative edges (friend/foe, trust/distrust)
Location-based online social networks 
Social networks with geographic check-ins
Wikipedia networks and metadata 
Talk, editing and voting data from Wikipedia
Twitter and Memetracker 
Memetracker phrases, links and 467 million Tweets
Online communities 
Data from online communities such as Reddit and Flickr
Online reviews 
Data from online review systems such as BeerAdvocate and Amazon

Mark Newman收集的数据集
介绍及相关社区发现算法:http://www-personal.umich.edu/~mejn/ 
数据集:http://www-personal.umich.edu/~mejn/netdata/

Social and Information Network Analysis
KDD Cup Dataset 
http://www.cs.cornell.edu/projects/kddcup/datasets.html
Stack Overflow Data 
http://blog.stackoverflow.com/2009/06/stack-overflow-creative-commons-data-dump/
Youtube dataset 
YouTube videos as nodes. Edge a->b means video b is in the related video list (first 20 only) of a video a. 
http://netsg.cs.sfu.ca/youtubedata/
Amazon Data 
The data was collected by crawling Amazon website and contains product metadata and review information about 548,552 different products (Books, music CDs, DVDs and VHS video tapes). http://snap.stanford.edu/data/amazon-meta.html
[1]: W. W. Zachary, An information flow model for conflict and fission in small groups, Journal of Anthropological Research 33, 452-473 (1977) 
[2]: M. Girvan and M. E. J. Newman, Proc. Natl. Acad. Sci. USA 99, 7821-7826 (2002). 
[3]: V.Lusseau, K .Schneider, OJ .Boisseau et al. The Bottlenose Dolphin Community of Doubtful Sound Features a Large Proportion of Long-Lasting Associations. Behavioral Ecology and Sociobiology, 2003, 54(4):392-405 
[4]: L. A. Adamic and N. Glance, “The political blogosphere and the 2004 US Election”, in Proceedings of the WWW-2005 Workshop on the Weblogging Ecosystem (2005) 
[5]: Zhou Y, Cheng H, Yu J X. Clustering large attributed graphs: An efficient incremental approach[C]//Data Mining (ICDM), 2010 IEEE 10th International Conference on. IEEE, 2010: 689-698. 
[6]: Zhou Y, Cheng H, Yu J X. Graph clustering based on structural/attribute similarities[J]. Proceedings of the VLDB Endowment, 2009, 2(1): 718-729. 
[7]: Klimt B, Yang Y. Introducing the Enron Corpus[C]//CEAS. 2004. 
[8]: Yang T, Jin R, Chi Y, et al. Combining link and content for community detection: a discriminative approach[C]//Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2009: 927-936. 
[9]: Lu Q, Getoor L. Link-based classification[C]//ICML. 2003, 3: 496-503. 
[10]: Dang T A, Viennet E. Community detection based on structural and attribute similarities[C]//International Conference on Digital Society (ICDS). 2012: 7-12. 
[11]: Xirong Li, Cees G.M. Snoek, and Marcel Worring, Learning Social Tag Relevance by Neighbor Voting, in IEEE Transactions on Multimedia (T-MM), 2009 
[12]: Newman M E J. Finding community structure in networks using the eigenvectors of matrices[J]. Physical review E, 2006, 74(3): 036104.

版权声明:本文为CSDN博主「wzgang123」的原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/wzgang123/article/details/51089521

【数据】社区发现数据集相关推荐

  1. 用networkx、igraph实现社区发现——LPA(标签传播算法)

    一.LPA简介 LPA全称为Label Propagation Algorithm,是一个基于标签传播的非重叠社区发现算法.通过LPA可以对用户群进行聚类,从而实现用户画像. 推荐系统初期,当标签数远 ...

  2. Python共享单车数据的OD识别与社区发现(TransBigData+igraph)

    这个案例的Jupyter notebook: 点击这里. 对于共享单车的出行,每一次出行都可以被看作是一个从起点行动到终点的出行过程.当我们把起点和终点视为节点,把它们之间的出行视为边时,就可以构建一 ...

  3. 重叠社区发现算法LFM算法python源码含数据集

    LFM算法是来源于论文<Detecting the overlapping and hieerarchical community structure in complex networks&g ...

  4. 深度学习助力网络科学:基于深度学习的社区发现最新综述

    来源:AMiner科技 论文题目: A Comprehensive Survey on Community Detection with Deep Learning 论文网址: https://arx ...

  5. UA MATH567 高维统计III 随机矩阵8 社区发现 Spectral Clustering的理论分析

    UA MATH567 高维统计III 随机矩阵8 社区发现 Spectral Clustering的理论分析 上一讲我们完成了Stochastic Block Model与社区发现问题的建模,并描述了 ...

  6. 《数据科学与大数据分析——数据的发现 分析 可视化与表示》一2.3 第2阶段:数据准备...

    本节书摘来自异步社区<数据科学与大数据分析--数据的发现 分析 可视化与表示>一书中的第2章,第2.3节,作者[美]EMC Education Services(EMC教育服务团队),更多 ...

  7. 【深度学习】最新「深度学习社区发现」综述论文,174篇文献概述六大类方法(含Github资源)...

    | 作者:Xing Su | 单位:麦考瑞大学 | 研究方向:人工智能与数据科学 社区发现能够揭示各类网络中成员的特征与联系,在网络分析中具有重要意义.近年来,深度学习技术在发现社区结构时,以处理高维 ...

  8. 社区发现(一)--算法综述

    基础概念简介:https://baike.baidu.com/item/%E7%A4%BE%E5%8C%BA%E5%8F%91%E7%8E%B0%E7%AE%97%E6%B3%95/19460396 ...

  9. 社区发现算法-Community Detection-NormalizeCut/Louvain/NMF/LPA

    本文结构安排 图聚类简介 正则化割 Louvain 非负矩阵分解(NMF) 其他常见方法 图(graph):是一种由点和边集构成的结构 G = ( V , E ) G=(V,E) G=(V,E) 图聚 ...

  10. 社区发现之标签传播算法(LPA)

    在Graph领域,社区发现(Community detection)是一个非常热门且广泛的话题,后面会写一个系列,该问题实际上是从子图分割的问题演变而来,在真实的社交网络中,有些用户之间连接非常紧密, ...

最新文章

  1. sqlserver2008r2表复制原表_SQL Server 2008 R2 主从数据库同步
  2. 机器学习近年来之怪现象
  3. mysql 线上加索引_mysql手札,唯一索引引发的线上事故
  4. 检测到smtp服务器版本信息,邮件服务器DBMail检测功能
  5. 计算机组成原理译码器选择,计算机组成原理第三章习题参考解析.doc
  6. linux之拷贝文件/备份文件;按照原来的权限和日期拷贝.
  7. 垃圾回收 | Java垃圾回收,这杯咖啡,不仅好喝,而且实用!
  8. python 小技巧之获取固定下面包含的某种类型文件的个数
  9. Spring MVC - 介绍
  10. ColorBlinder(我是色盲)
  11. springMVC学习(10)-上传图片
  12. 捡到iphone6怎么解锁_赛博朋克2077前期手枪用哪个好?2077节制结局及银杯节制解锁条件...
  13. asp.net MVC使用 jsQR 扫描二维码
  14. python云计算1_python云计算服务器
  15. 矢量数据空间索引之R树索引
  16. tan和cot的梗_cot和tan的关系
  17. html5文字游戏制作工具,橙光文字游戏制作工具
  18. 850pro测试软件,业界领先技术 三星850PRO 256G固态硬盘测试
  19. 图片编辑软件_pinta在Linux下安装
  20. 小程序为什么要办理ICP增值电信业务经营许可证

热门文章

  1. 安装 OpenCC 简繁体中文转换
  2. 网站文章采集器-万能文章采集器
  3. 云原生服务网格 Istio 1.4 部署指南
  4. NPOI 在word中插入 表格 包括 合并单元格
  5. 旅夜书怀,月夜忆舍弟,天末怀李白,春望,旅宿,与诸子登岘山,宴梅道士山房,章台夜思,淮上喜会梁州故人,赋得暮雨送李曹
  6. 2020CCPC长春站第一场区域赛打铁记
  7. 车辆违章查询接口,获取支持城市参数接口示例
  8. Oracle 11G安装出错(Oracle执行先决条件检查失败)
  9. 如何制作微信二维码指纹扫描图片
  10. SpringBoot整合log4j2