python for data analysis 操作usagov_bitly

python for data analysis 操作usagov_bitly_data示例

import json
path = 'ch02/usagov_bitly_data2012-03-16-1331923249.txt'
records = [json.loads(line) for line in open(path)]
In [18]: records[0]
Out[18]:
{u'a': u'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like
Gecko) Chrome/17.0.963.78 Safari/535.11',
u'al': u'en-US,en;q=0.8',
u'c': u'US',
u'cy': u'Danvers',
u'g': u'A6qOVH',
u'gr': u'MA',
u'h': u'wfLQtf',
u'hc': 1331822918,
u'hh': u'1.usa.gov',
u'l': u'orofrog',
u'll': [42.576698, -70.954903],
u'nk': 1,
u'r': u'http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf',
u't': 1331923247,
u'tz': u'America/New_York',
u'u': u'http://www.ncbi.nlm.nih.gov/pubmed/22415991'}
In [19]: records[0]['tz']
Out[19]: u'America/New_York'

Counting Time Zones with pandas

In [289]: from pandas import DataFrame, Series
In [290]: import pandas as pd
In [291]: frame = DataFrame(records)
In [292]: frame
Out[292]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3560 entries, 0 to 3559
Data columns:
_heartbeat_ 120 non-null values
a 3440 non-null values
al 3094 non-null values
c 2919 non-null values
cy 2919 non-null values
g 3440 non-null values
gr 2919 non-null values
h 3440 non-null values
hc 3440 non-null values
hh 3440 non-null values
kw 93 non-null values
l 3440 non-null values
ll 2919 non-null values
nk 3440 non-null values
r 3440 non-null values
t 3440 non-null values
tz 3440 non-null values
u 3440 non-null values
dtypes: float64(4), object(14)
In [293]: frame['tz'][:10]
Out[293]:
0 America/New_York
1 America/Denver
2 America/New_York
3 America/Sao_Paulo
4 America/New_York
5 America/New_York
6 Europe/Warsaw
7
8
9
Name: tz

The Series object returned by frame[‘tz’] has a method value_counts that gives us what we’re looking for:

In [294]: tz_counts = frame['tz'].value_counts()
In [295]: tz_counts[:10]
Out[295]:
America/New_York 1251
521
America/Chicago 400
America/Los_Angeles 382
America/Denver 191
Europe/London 74
Asia/Tokyo 37
Pacific/Honolulu 36
Europe/Madrid 35
America/Sao_Paulo 33

You can do a bit of munging to fill in a substitute value for unknown and missing time zone data in the records. The fillna function can replace missing (NA) values and unknown (empty strings) values can be replaced by boolean array indexing:

In [296]: clean_tz = frame['tz'].fillna('Missing')
In [297]: clean_tz[clean_tz == ''] = 'Unknown'
In [298]: tz_counts = clean_tz.value_counts()
In [299]: tz_counts[:10]
Out[299]:
America/New_York 1251
Unknown 521
America/Chicago 400
America/Los_Angeles 382
America/Denver 191
Missing 120
Europe/London 74
Asia/Tokyo 37
Pacific/Honolulu 36
Europe/Madrid 35

Making a horizontal bar plot can be accomplished using the plot method on the counts objects:

In [301]: tz_counts[:10].plot(kind='barh', rot=0)

We’ll explore more tools for working with this kind of data. For example, the a field contains information about the browser, device, or application used to perform the URL shortening:

In [302]: frame['a'][1]
Out[302]: u'GoogleMaps/RochesterNY'
In [303]: frame['a'][50]
Out[303]: u'Mozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/20100101 Firefox/10.0.2'
In [304]: frame['a'][51]
Out[304]: u'Mozilla/5.0 (Linux; U; Android 2.2.2; en-us; LG-P925/V10e Build/FRG83G) AppleWebKit/533.1In [305]: results = Series([x.split()[0] for x in frame.a.dropna()])
In [306]: results[:5]
Out[306]:
0 Mozilla/5.0
1 GoogleMaps/RochesterNY
2 Mozilla/4.0
3 Mozilla/5.0
4 Mozilla/5.0In [307]: results.value_counts()[:8]
Out[307]:
Mozilla/5.0 2594
Mozilla/4.0 601
GoogleMaps/RochesterNY 121
Opera/9.80 34
TEST_INTERNET_AGENT 24
GoogleProducer 21
Mozilla/6.0 5
BlackBerry8520/5.0.0.681 4

suppose you wanted to decompose the top time zones into Windows and non-Windows users. As a simplification, let’s say that a user is on Windows if the string ‘Windows’ is in the agent string. Since some of the agents are missing, I’ll exclude these from the data:

In [308]: cframe = frame[frame.a.notnull()]
In [309]: operating_system = np.where(cframe['a'].str.contains('Windows'),
.....: 'Windows', 'Not Windows')
In [310]: operating_system[:5]
Out[310]:
0 Windows
1 Not Windows
2 Windows
3 Not Windows
4 Windows
Name: aIn [311]: by_tz_os = cframe.groupby(['tz', operating_system])

The group counts, analogous to the value_counts function above, can be computed using size. This result is then reshaped into a table with unstack:

In [312]: agg_counts = by_tz_os.size().unstack().fillna(0)
In [313]: agg_counts[:10]
Out[313]:
a Not Windows Windows
tz
245 276
Africa/Cairo 0 3
Africa/Casablanca 0 1
Africa/Ceuta 0 2
Africa/Johannesburg 0 1
Africa/Lusaka 0 1
America/Anchorage 4 1
America/Argentina/Buenos_Aires 1 0
America/Argentina/Cordoba 0 1
America/Argentina/Mendoza 0 1

Finally, let’s select the top overall time zones. To do so, I construct an indirect index array from the row counts in agg_counts:

# Use to sort in ascending order
In [314]: indexer = agg_counts.sum(1).argsort()
In [315]: indexer[:10]
Out[315]:
tz
24
Africa/Cairo 20
Africa/Casablanca 21
Africa/Ceuta 92
Africa/Johannesburg 87
Africa/Lusaka 53
America/Anchorage 54
America/Argentina/Buenos_Aires 57
America/Argentina/Cordoba 26
America/Argentina/Mendoza 55

I then use take to select the rows in that order, then slice off the last 10 rows:

In [316]: count_subset = agg_counts.take(indexer)[-10:]
In [317]: count_subset
Out[317]:
a Not Windows Windows
tz
America/Sao_Paulo 13 20
Europe/Madrid 16 19
Pacific/Honolulu 0 36
Asia/Tokyo 2 35
Europe/London 43 31
America/Denver 132 59
America/Los_Angeles 130 252
America/Chicago 115 285
245 276
America/New_York 339 912In [319]: count_subset.plot(kind='barh', stacked=True)

The plot doesn’t make it easy to see the relative percentage of Windows users in the smaller groups, but the rows can easily be normalized to sum to 1 then plotted again

normed_subset = count_subset.div(count_subset.sum(1), axis=0)
normed_subset.plot(kind='barh', stacked=True)

python for data analysis 操作usagov_bitly_data示例相关推荐

Python for Data Analysis
本文只是一篇类似导向性的分享, 并没有原创内容, 主要是书籍和网络资源的整理, 仅供参考. 可能会有后续补充更新. 资源 A Byte of Python 这是给没有使用过 Python 的人员的入门 ...
数据分析---《Python for Data Analysis》学习笔记【04】
<Python for Data Analysis>一书由Wes Mckinney所著,中文译名是<利用Python进行数据分析>.这里记录一下学习过程,其中有些方法和书中不同 ...
python 数据分析百度网盘_[百度网盘]利用Python进行数据分析(Python For Data Analysis中文版).pdf - Jan-My31的博客 - 磁力点点...
利用Python进行数据分析(Python For Data Analysis中文版).pdf - Jan-My31的博客 2018-5-27 · 链接:https://pan.baidu.com/s ...
《利用Python进行数据分析： Python for Data Analysis 》学习随笔
NoteBook of <Data Analysis with Python> 3.IPython基础 Tab自动补齐变量名变量方法路径解释 ?解释, ??显示函数源码 ?搜索命名 ...
第03章 Python的数据结构、函数和文件--Python for Data Analysis 2nd
本章讨论Python的内置功能,这些功能本书会用到很多.虽然扩展库,比如pandas和Numpy,使处理大数据集很方便,但它们是和Python的内置数据处理工具一同使用的. 我们会从Python最基础 ...
Python for Data Analysis v2 | Notes_ Chapter 3 Python 的数据结构、函数和文件
本人以简书作者 SeanCheney 系列专题文章并结合原书为学习资源,记录个人笔记,仅作为知识记录及后期复习所用,原作者地址查看简书 SeanCheney,如有错误,还望批评指教.--ZJ 原作者 ...
[学习笔记]Python for Data Analysis, 3E-9.绘图和可视化
进行信息可视化(有时称为绘图)是数据分析中最重要的任务之一.它可能是探索过程中的一部分-例如,帮助识别异常值或所需的数据转换,或作为生成模型想法的一种方式.对于其他人来说,为Web构建交互式可视化可能 ...
[学习笔记]Python for Data Analysis, 3E-8.数据整理：连接、合并和重塑
在许多应用程序中,数据可能分布在多个文件或数据库中,或者以不便于分析的形式排列.本章重点介绍有助于合并.联接和重新排列数据的工具. 首先,介绍一下pandas中的分层索引的概念,这个概念在其中一些操作 ...
[学习笔记]Python for Data Analysis, 3E-11.时间序列
时间序列数据是许多不同领域结构化数据的重要形式,如金融.经济.生态.神经科学和物理学.在许多时间点重复记录的任何内容都会形成一个时间序列.许多时间序列都是固定频率,也就是说数据点回根据某些规律以固定的 ...

python for data analysis 操作usagov_bitly_data示例

python for data analysis 操作usagov_bitly_data示例相关推荐

最新文章

热门文章