使用 Pandas 分析 Apache 日志

本文的作者是 Nikolay Koldunov，本文原文是
Apache log analysis with Pandas

注本文的图有问题，没法引用，还是去原文看下，这里作为一个引子。

%pylab inline

欢迎来到 pylab，一个基于 matplotlib 的 Python 环境【backend: module://IPython.kernel.zmq.pylab.backend_inline】。想要了解更多信息，请键入 'help(pylab)'。

在这个笔记中，我们将展示一个使用 pandas 分析 Apache 访问日志的简单示例。这是我第一次使用 pandas，并且我确定会有更好以及更有效率的方式来做这里展示的事情。所以评论，建议和修正我的蹩脚英语是非常欢迎的。你可以给我发送邮件或者是为这个笔记的 github 创建一个 PR。

加载和解析数据

我们将需要 apachelog 模块，用来解析日志。我们也需要知道设置在 Apache 配置中的日志格式。在我的案例中，我没有访问 Apache 配置，但是主机托管服务提供商在他的帮助页提供了日志格式的描述。下面是它自己的格式以及每个元素的简单描述：

format = r'%V %h  %l %u %t \"%r\" %>s %b \"%i\" \"%{User-Agent}i\" %T'

这里（大部分拷贝自这个 SO 文章）：

%V          - 根据 UseCanonicalName 设置的服务器名字
%h          - 远程主机（客户端 IP）
%l          - identity of the user determined by identd (not usually used since not reliable)
%u          - 由 HTTP authentication 决定的 user name
%t          - 服务器完成处理这个请求的时间
%r          - 来自客户端的请求行。 （"GET / HTTP/1.0"）
%>s         - 服务器端返回给客户端的状态码（200， 404 等等。）
%b          - 响应给客户端的响应报文大小 （in bytes）
\"%i\"      - Referer is the page that linked to this URL.
User-agent  - the browser identification string
%T          - Apache 请求时间

In [3]:import apachelog, sys

设置格式：

In [4]:fformat = r'%V %h %l %u %t \"%r\" %>s %b \"%i\" \"%{User-Agent}i\" %T'

创建一个解析器：

In [5]:p = apachelog.parser(fformat)

简单字符串：

koldunov.net 85.26.235.202 - - [16/Mar/2013:00:19:43 +0400] "GET /?p=364 HTTP/1.0" 200 65237 "http://koldunov.net/?p=364" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11" 0

In [6]:sample_string = 'koldunov.net 85.26.235.202 - - [16/Mar/2013:00:19:43 +0400] "GET /?p=364 HTTP/1.0" 200 65237 "http://koldunov.net/?p=364" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11" 0'

In [7]:data = p.parse(sample_string)

In [8]:data

Out[8]:
{'%>s': '200','%T': '0','%V': 'koldunov.net','%b': '65237','%h': '85.26.235.202','%i': 'http://koldunov.net/?p=364','%l': '-','%r': 'GET /?p=364 HTTP/1.0','%t': '[16/Mar/2013:00:19:43 +0400]','%u': '-','%{User-Agent}i': 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'}

这就是解释器的工作。现在让我们加载真实世界的数据（示例文件位于这里和这里）：

In [9]:log = open('access_log_for_pandas').readlines()

解析每一行，并且创建一个字典列表：

In [10]:
log_list = []
for line in log:try:data = p.parse(line)except:sys.stderr.write("Unable to parse %s" % line)data['%t'] = data['%t'][1:12]+' '+data['%t'][13:21]+' '+data['%t'][22:27]log_list.append(data)

我们不得不调整时间格式位，否则的话 pandas 将不能解析它。

创建和调整数据帧

这将创建一个字典列表，可以转化到一个数据帧：

import pandas as pd
import numpy as np
from pandas import Series, DataFrame, Panel

df = DataFrame(log_list)

展示数据帧的前两行：

df[0:2]

-	%>s	%T	%V	%b	%h	%i	%l	%r	%t	%u	%{User-Agent}i
0	200	0	www.oceanographers.ru	26126	109.165.31.156	-	-	GET /index.php?option=com_content&task=section...	16/Mar/2013 08:00:25 +0400	-	Mozilla/5.0 (Windows NT 6.1; rv:19.0) Gecko/20...
1	200	0	www.oceanographers.ru	10532	109.165.31.156	http://www.oceanographers.ru/index.php?option=...	-	GET /templates/ja_procyon/css/template_css.css...	16/Mar/2013 08:00:25 +0400	-	Mozilla/5.0 (Windows NT 6.1; rv:19.0) Gecko/20...

我们不准备使用所有的数据，因此让我们删除一些列：

del df['%T']; del df['%V']; del df['%i']; del df['%l']; del df['%u']; del df['%{User-Agent}i']

并且把这些列重命名成人类可理解的格式：

df = df.rename(columns={'%>s': 'Status', '%b':'b', '%h':'IP', '%r':'Request', '%t': 'Time'})

结果数据帧的前 5 行：

df.head()

-	Status	b	IP	Request	Time
0	200	26126	109.165.31.156	GET /index.php?option=com_content&task=section...	16/Mar/2013 08:00:25 +0400
1	200	10532	109.165.31.156	GET /templates/ja_procyon/css/template_css.css...	16/Mar/2013 08:00:25 +0400
2	200	1853	109.165.31.156	GET /templates/ja_procyon/switcher.js HTTP/1.0	16/Mar/2013 08:00:25 +0400
3	200	37153	109.165.31.156	GET /includes/js/overlib_mini.js HTTP/1.0	16/Mar/2013 08:00:25 +0400
4	200	3978	109.165.31.156	GET /modules/ja_transmenu/transmenuh.css HTTP/1.0	16/Mar/2013 08:00:25 +0400

转换时间列成 datetime 格式并做一个索引出来（pop 将丢弃原始的 Time 列）：

df.index = pd.to_datetime(df.pop('Time'))

Status 变量是一个 string 类型，因此我们需要把它转换成 int：

df['Status'] = df['Status'].astype('int')

一些 b 列的行包含 '-' 字符，我们需要使用 astype 转换它们：

df['b'][93]

Out[19]:
'-'

我们可以为该列使用一个通用的函数，它们将把所有的破折号转换成 NaN，并且剩余的转换成 floats，另外把 bytes 转换成 megabytes：

def dash2nan(x):if x == '-':x = np.nanelse:x = float(x)/1048576.return x

df['b'] = df['b'].apply(dash2nan)

我相信有一个更优雅的方式来做到这一点。

流量分析

首先，最简单的散点：从该网站的出口流量：

df['b'].plot()

<matplotlib.axes.AxesSubplot at 0xbf7574c>

看起来在早上 9 点左右有人从网站下载了一些大的东西。

但是实际上你想知道的第一件事是你的网站有多少的访问量，以及它们的时间分布。我们从 b 变量的 5 分钟间隔重新取样，并计算每个时间跨度的请求数。实际上，在这个示例中不管我们使用哪个变量，这些数字将表明有多少次请求该网站的信息请求。

df_s = df['b'].resample('5t', how='count')
df_s.plot()

Out[23]:
<matplotlib.axes.AxesSubplot at 0xc14588c>

![此处输入图片的描述][8]

我们不仅仅计算每个时间的请求数，也计算每个时间段的总流量：

df_b = df['b'].resample('10t', how=['count','sum'])
df_b['count'].plot( color='r')
legend()
df_b['sum'].plot(secondary_y=True)

Out[24]:
<matplotlib.axes.AxesSubplot at 0xc2d53ac>

![此处输入图片的描述][9]

正如你所看到的，服务器请求数和流量是不一致的，相关性其实并不是非常高：

df_b.corr()

|-| count| sum
|count| 1.000000| 0.512629
|sum| 0.512629| 1.000000

我们可以仔细看下早高峰：

df_b['2013-03-16 6:00':'2013-03-16 10:00']['sum'].plot()

Out[26]:
<matplotlib.axes.AxesSubplot at 0xc3f5dac>

![此处输入图片的描述][10]

看起来流量峰值是由一个请求引起的。让我们找出这个请求。选择所有响应大于 20 Mb 的请求：

df[df['b']>20]

-	Status	b	IP	Request
Time
2013-03-16 09:02:59	200	21.365701	77.50.248.20	GET /books/Bondarenko.pdf HTTP/1.0

这是一本书的 pdf 文件，这就解释了在 2013-03-16 09:02:59 的流量出口峰值。

接近 20 Mb 是一个大的请求（至少对于我们网站），但是服务器响应的典型大小是？响应大小（小于 20Mb）的立方图看起来像这样：

cc = df[df['b']<20]
cc.b.hist(bins=10)

Out[28]:
<matplotlib.axes.AxesSubplot at 0xc52374c>

![此处输入图片的描述][11]

因此，大部分的文件是小于 0.5 Mb。实际上它们甚至更小：

cc = df[df['b']<0.3]
cc.b.hist(bins=100)

Out[29]:
<matplotlib.axes.AxesSubplot at 0xc5760ec>

![此处输入图片的描述][12]