数据科学导论实验：基于Twitter的网络结构和社会群体演化

分析及预处理

查看json结构

随便选一个json文件拖入浏览器，借助chrome的开发者工具查看json结构
其中，name其实不需要取，nick是唯一的且只允许英文数字下划线 (\w)，作为用户的唯一标识

迭代取数据

先取完再处理耗费内存，故通过yield建立迭代器

import json
import osdef extract_info(batch):plist = batch["response"]["list"]for post in plist:nick = post['trackback_author_nick']name = post['trackback_author_name']date = post['trackback_date']content = post['content']yield nick, date, name, contentdef get_infos(fplist):f_i = 0for fp in fplist:fs = os.listdir(fp)for filename in fs:with open(fp+filename,encoding="utf-8") as f:try:batch = json.load(f)except:print("{} is error json file".format(f))for info in extract_info(batch):yield list(info)f_i+= 1print("read all {} files".format(f_i))

提取关系

根据用户转发、评论关系构建网络。当用户 A 在其 content 字段中@用户 B，我们
认为用户 A 与用户 B 之间存在联系。

因此需要通过正则表达式提取用户的nick

import repattern = re.compile('@(\w+)')
def get_at_list(content):return pattern.findall(content)

时间分割

地震时间（2011年3月11日13:46）需要统一时间格式，这里统一为时间戳

import datetime
equake_time = datetime.datetime(2011,3,11,13,46).timestamp()

输出csv，同时保存关系文件以便之后读取

注意数据清洗，通过nick+date的唯一性去重

还需除去@自己的，避免后面的图中出现自环

另外，如果@自己，info中的nick是@中的nick的小写，需要判断一下

import csvfor c in ['EN', 'JP']:key_dic = {}at_dic = {'Pre': {}, 'Post': {}}with open('data/' + c + 'All' + '.csv', 'w', newline='', encoding='utf-8') as f0:writer = csv.writer(f0)writer.writerow(['nick', 'date'])for info in get_infos(['data/' + c + 'alljson/']):ater = info[0]date = info[1]key = (ater, date)if key not in key_dic:key_dic[key] = 1writer.writerow([ater, date])for atee in get_at_list(info[3]):atee = atee.lower()if atee == ater:continuei = 'Pre' if date <= equake_time else 'Post'if (ater, atee) not in at_dic[i]:at_dic[i][(ater, atee)] = 0at_dic[i][(ater, atee)] += 1for type in ['Pre', 'Post']:with open('data/' + c + type + '.csv', 'w', newline='', encoding='utf-8') as f1:writer = csv.writer(f1)writer.writerow(['ater', 'atee', 'count'])for item in at_dic[type].items():writer.writerow(list(item[0]) + [item[1]])del key_dic
del at_dic

构建网络

构建图

import pandas as pd
import networkx as nx
from tqdm import tqdm_notebook as tqdmdef build_graph(df: pd.DataFrame):g = nx.Graph()g.add_weighted_edges_from(df.values.tolist())# for row in tqdm(df.iterrows(), total=len(df)):#     ater = row[1]['nick']#     for at in row[1]['atlist']:#         data = g.get_edge_data(ater, at)#         if data is None:#             g.add_edge(ater, at, weight=1)#         else:#             g[data[0]][data[1]].update({'weight': data[]})return g

删除非共有节点

def drop_diff_point(G1: nx.Graph, G2: nx.Graph):nodes = list(G1.nodes())for node in nodes:if not(node in G2):G1.remove_node(node)nodes = list(G2.nodes())for node in nodes:if not(node in G1):G2.remove_node(node)

装载入内存

dfs = {'EN': {}, 'JP': {}}
nets = {'EN': {}, 'JP': {}}
for c in ['EN', 'JP']:for type in ['All', 'Pre', 'Post']:dfs[c][type] = pd.read_csv('data/' + c + type + '.csv')if type != 'All':nets[c][type] = build_graph(dfs[c][type])drop_diff_point(nets[c]['Pre'], nets[c]['Post'])

保存网络模型

for c in ['EN', 'JP']:for type in ['Pre', 'Post']:nx.write_gml(nets[c][type], 'data/' + c + type + '.gml')

分析结果

网络基本特性信息统计对比

from prettytable import PrettyTable
table = PrettyTable([' ', '  ', '#users', '#tweets', '#links', '#nodes', 'avg_deg', 'largest_com', 'avg_clustering'])
for c in ['JP', 'EN']:for type in ['Pre', 'Post']:dff = dfs[c]['All']if type == 'Pre':df = dff[dff['date'] <= equake_time]else:df = dff[dff['date'] > equake_time]n_usrs = len(df['nick'].unique())n_tws = len(df)n_links = dfs[c][type]['count'].sum()n_nodes = nets[c][type].number_of_nodes()d = average_deg(nets[c][type])s = largest_com(nets[c][type])clu = average_clu(nets[c][type])table.add_row([c, type, n_usrs, n_tws, n_links, n_nodes, round(d/2, 3), s, round(clu/2, 3)])
print(table)

因为是有向图，个人认为度和聚集系数不能直接按原来的无向图算，需要除以2；

另外，有点奇怪的是论文和老师给的模板中，user都没有去重，这里我做了一下去重

+----+------+--------+---------+--------+--------+---------+-------------+--------------+
|    |      | #users | #tweets | #links | #nodes | avg_deg | largest_com | avg_cluster  |
+----+------+--------+---------+--------+--------+---------+-------------+--------------+
| JP | Pre  |  4000  |  39347  | 25738  |  5467  |  1.602  |     4392    |     0.035    |
| JP | Post |  5383  |  102669 | 90825  |  5467  |  3.390  |     4949    |     0.045    |
| EN | Pre  |  3887  |  44124  | 29215  |  4922  |  1.338  |     4024    |     0.044    |
| EN | Post |  4436  |  57462  | 38099  |  4922  |  1.500  |     4204    |     0.048    |
+----+------+--------+---------+--------+--------+---------+-------------+--------------+

个人层面度分析

for c in ['JP', 'EN']:f, ax = plt.subplots(figsize=(6,6))individualdegree(nets[c]['Pre'], nets[c]['Post'], c, ax)

为了更直观地显示前后的变化，我加上了对角线。

个人层面上，日本地震前后 y > x的更多，说明个人的度增高了，即有了更多的联系；EN则较平稳，整体的变化不大。

累积度分析

for c in ['JP', 'EN']:for type in ['Pre', 'Post']:draw_degree_chart(nets[c][type], type, cumlutive_degree_distribution(nets[c][type]))plt.legend()plt.title(c)plt.show()

可以看到地震对美国影响不大，对日本则使累计度普遍增高。

Dephi分析

网络信息统计

社区划分与渲染

社区分析

from prettytable import PrettyTable
from networkx.algorithms.community import k_clique_communitiestable = PrettyTable(['Region', 'Period', 'community_num', 'max_com_size'])
cliques = {'JP': {'Pre': None, 'Post': None}, 'EN': {'Pre': None, 'Post': None}}
for c in ['JP', 'EN']:for type in ['Pre', 'Post']:clique = k_clique_communities(nets[c][type],4)clique = list(clique)clique_size = [len(cl) for cl in clique]cliques[c][type] = cliquetable.add_row([c, type, len(clique), max(clique_size)])
print(table)

+--------+--------+---------------+--------------+
| Region | Period | community_num | max_com_size |
+--------+--------+---------------+--------------+
|   JP   |  Pre   |       20      |     218      |
|   JP   |  Post  |       35      |     711      |
|   EN   |  Pre   |       27      |      97      |
|   EN   |  Post  |       34      |      73      |
+--------+--------+---------------+--------------+

可以看到，美国的社区数增长，是正常的发展趋势；而日本的社区数减少，侧面反映了日本社交聚合关系加强；渲染图更直观地说明了这一点。

桑基图

又叫冲击图 Alluvial diagram，查了一下库，几乎都是R语言写的，除了一个floweaver比较不错，就用了它
https://github.com/ricklupton/floweaver

from floweaver import *from ipysankeywidget import SankeyWidgetc = 'JP'
# c = 'EN'
clique_dic = {'Pre':{}, 'Post': {}}
for type in ['Pre', 'Post']:cliques[c][type].sort(reverse=True, key=len)for i in range(20):for name in cliques[c][type][i]:clique_dic[type][name] = i + 1
df1 = pd.DataFrame.from_dict(clique_dic['Pre'], orient='index', columns=['source'])
df2 = pd.DataFrame.from_dict(clique_dic['Post'], orient='index', columns=['target'])
df3 = df1.join(df2, how='inner').astype('Int64')
df3 = df3.reset_index()
df3.columns = ['source', 'pre', 'post']
df3['type'] = 1
df3['value'] = 1
df3['target'] = df3['source']

from floweaver import *size = dict(width=888, height=666)
nodes = {'PreTop10': ProcessGroup(df3['source'].tolist()),'PostTop10': ProcessGroup(df3['source'].tolist()),
}
ordering = [['PreTop10'], ['PostTop10']]
bundles = [Bundle('PreTop10', 'PostTop10'),]
sdd = SankeyDefinition(nodes, bundles, ordering)pre_partition = Partition.Simple('process',[(i, df3[df3.pre==i].source.tolist()) for i in list(range(100))[1:]])
post_partition = Partition.Simple('process', [(i, df3[df3.post==i].source.tolist()) for i in list(range(100))[1:]])
nodes['PreTop10'].partition = pre_partition
nodes['PostTop10'].partition = post_partitionweave(sdd, df3[['source', 'target', 'type', 'value']]).to_widget(**size)

日本的桑基图中，几乎所有社区的用户都汇集在一起，也很少有社区分散，同样可以看到社区紧密程度的提升

EN的桑基图中，社区用户的汇聚变化不如日本，从第一大社区的增幅和来源成分就可以看出；另外，EN地震后的没有出现在图中的社区大多由离散的用户（因为不构成社区所以不在左侧的来源中）组成，社区变化也较简单。