Hadoop实战: 论坛点击流日志分析
简介
网站点击流日志数据,比如,点击了哪一个链接,在哪个网页停留时间最多,采用了哪个搜索项、总体浏览时间等。而所有这些信息都可被保存在网站日志中。通过分析这些数据,可以获知许多对网站运营至关重要的信息。采集的数据越全面,分析就能越精准。项目主要使用的技术有MapReduce,Hive,Sqoop,Spring,SpringMVC,Mybatis,Echarts;其中,在数据规整和ETL阶段,只要使用的技术时MapReduce,HIve,Sqoop等工具,代码我已经放在GitHub上;在进行数据抽离后,使用JavaEE框架进行可视化展示,这里的代码放在这里。
概念介绍
点击流概念
点击流这个概念更注重用户浏览网站的整个流程,网站日志中记录的用户点击就像是图上的“点”,而点击流更像是将这些“点”串起来形成的“线”。也可以把“点”认为是网站的Page,而“线”则是访问网站的Session。所以点击流数据是由网站日志中整理得到的,它可以比网站日志包含更多的信息,从而使基于点击流数据统计得到的结果更加丰富和高效。
点击流模型生成
点击流数据在具体操作上是由散点状的点击日志数据梳理所得,从而,点击数据在数据建模时应该存在两张模型表(Pageviews和visits)。
1、用于生成点击流的访问日志表
时间戳 |
IP地址 |
Cookie |
Session |
请求URL |
Referal |
2012-01-01 12:31:12 |
101.0.0.1 |
User01 |
S001 |
/a/... |
somesite.com |
2012-01-01 12:31:16 |
201.0.0.2 |
User02 |
S002 |
/a/... |
- |
2012-01-01 12:33:06 |
101.0.0.2 |
User03 |
S002 |
/b/... |
baidu.com |
2012-01-01 15:16:39 |
234.0.0.3 |
User01 |
S003 |
/c/... |
google.com |
2012-01-01 15:17:11 |
101.0.0.1 |
User01 |
S004 |
/d/... |
/c/... |
2012-01-01 15:19:23 |
101.0.0.1 |
User01 |
S004 |
/e/... |
/d/.... |
2、页面点击流模型Pageviews表(按session聚集的访问页面信息)(每个session中的每个url也即是访问页面,的记录信息, 想差半个小时了就认为是下一个session了)
Session |
userid |
时间 |
访问页面URL |
停留时长 |
第几步 |
S001 |
User01 |
2012-01-01 12:31:12 |
/a/.... |
30 |
1 |
S002 |
User02 |
2012-01-01 12:31:16 |
/a/.... |
10 |
1 |
S002 |
User02 |
2012-01-01 12:33:06 |
/b/.... |
110 |
2 |
S002 |
User02 |
2012-01-01 12:35:06 |
/e/.... |
30 |
3 |
3、点击流模型Visits表
Session |
起始时间 |
结束时间 |
进入页面 |
离开页面 |
访问页面数 |
IP |
cookie |
referal |
S001 |
2012-01-01 12:31:12 |
2012-01-01 12:31:12 |
/a/... |
/a/... |
1 |
101.0.0.1 |
User01 |
somesite.com |
S002 |
2012-01-01 12:31:16 |
2012-01-01 12:35:06 |
/a/... |
/e/... |
3 |
201.0.0.2 |
User02 |
- |
S003 |
2012-01-01 12:35:42 |
2012-01-01 12:35:42 |
/c/... |
/c/... |
1 |
234.0.0.3 |
User03 |
baidu.com |
S004 |
2012-01-01 15:16:39 |
2012-01-01 15:19:23 |
/c/... |
/e/... |
3 |
101.0.0.1 |
User01 |
google.com |
…… |
…… |
…… |
…… |
…… |
…… |
…… |
…… |
…… |
这就是点击流模型。当WEB日志转化成点击流数据的时候,很多网站分析度量的计算变得简单了,这就是点击流的“魔力”所在。基于点击流数据我们可以统计出许多常见的网站分析度量
多维度网站流量分析
细分是指通过不同维度对指标进行分割,查看同一个指标在不同维度下的表现,进而找出有问题的那部分指标,对这部分指标进行优化。
网站常用的指标分析
PV(Page View)访问量, 即页面浏览量或点击量,衡量网站用户访问的网页数量;在一定统计周期内用户每打开或刷新一个页面就记录1次,多次打开或刷新同一页面则浏览量累计。
UV(Unique Visitor)独立访客,统计1天内访问某站点的用户数(以cookie为依据);访问网站的一台电脑客户端为一个访客。可以理解成访问某网站的电脑的数量。网站判断来访电脑的身份是通过来访电脑的cookies实现的。如果更换了IP后但不清除cookies,再访问相同网站,该网站的统计中UV数是不变的。如果用户不保存cookies访问、清除了cookies或者更换设备访问,计数会加1。00:00-24:00内相同的客户端多次访问只计为1个访客。
IP(Internet Protocol)独立IP数,是指1天内多少个独立的IP浏览了页面,即统计不同的IP浏览用户数量。同一IP不管访问了几个页面,独立IP数均为1;不同的IP浏览页面,计数会加1。 IP是基于用户广域网IP地址来区分不同的访问者的,所以,多个用户(多个局域网IP)在同一个路由器(同一个广域网IP)内上网,可能被记录为一个独立IP访问者。如果用户不断更换IP,则有可能被多次统计。
项目的机构
系统的数据分析不是一次性的,而是按照一定的时间频率反复计算,因而整个处理链条中的各个环节需要按照一定的先后依赖关系紧密衔接,即涉及到大量任务单元的管理调度,所以,项目中需要添加一个任务调度模块
技术选型
在点击流日志分析这种场景中,对数据采集部分的可靠性、容错能力要求通常不会非常严苛,因此使用通用的flume日志采集框架完全可以满足需求。
本项目即使用flume来实现日志采集。
数据内容
数据的意义:
IP地址
时间
请求方式
请求地址
状态码
请求字节数
来源url地址
终端
数据规整
在收集到网站日志数据后,网站日志数据我放在这里,根据项目需求,我们需要将相关的数据规整;
进行数据规整的目的是
过滤“不合规”数据
格式转换和规整
根据后续的统计需求,过滤分离出各种不同主题(不同栏目path)的基础数据
实现的代码如下:
package cn.edu.hust.preprocess;
import cn.edu.hust.preprocess.domain.WebLogBean;
import org.apache.commons.beanutils.BeanUtils;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import java.io.IOException;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.*;public class ClickStream {static class ClickStreamMapper extends Mapper<LongWritable,Text,Text,WebLogBean>{public static String formatDate(String dateStr) {if (dateStr == null || StringUtils.isBlank(dateStr)) return "2012-04-04 12.00.00";SimpleDateFormat format = new SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss", Locale.ENGLISH);SimpleDateFormat format1 = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss", Locale.CHINA);String result = null;try {Date date = format.parse(dateStr);result = format1.format(date);} catch (ParseException e) {e.printStackTrace();} finally {return result;}}static Text k=new Text();@Overrideprotected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {try{String message=value.toString();String[] splits=message.split(" ");if(splits.length<12) return;String time=formatDate(splits[3].substring(1));String method=splits[5].substring(1);String protocol=StringUtils.isBlank(splits[7])?"HTTP/1.1":splits[7].substring(0,splits[7].length()-1);int status= StringUtils.isBlank(splits[8])?0:Integer.parseInt(splits[8]);int bytes=StringUtils.isBlank(splits[9])?0:Integer.parseInt(splits[9]);String from_url=StringUtils.isBlank(splits[9])?"-":splits[10].substring(1,splits[10].length()-1);StringBuilder sb=new StringBuilder();for (int i=11;i<splits.length;i++){sb.append(splits[i]);}String s=sb.toString();String platform=s.substring(1,s.length()-1);WebLogBean ms=new WebLogBean(splits[0],time,method,splits[6],protocol,status,bytes,from_url,platform);k.set(splits[0]);context.write(k,ms);}catch (Exception e){return ;}}}static class ClickStreamReducer extends Reducer<Text,WebLogBean,NullWritable,Text>{Text v = new Text();@Overrideprotected void reduce(Text key, Iterable<WebLogBean> values, Context context) throws IOException, InterruptedException {ArrayList<WebLogBean> beans = new ArrayList<WebLogBean>();try{for (WebLogBean bean : values) {WebLogBean webLogBean = new WebLogBean();try {BeanUtils.copyProperties(webLogBean, bean);} catch(Exception e) {e.printStackTrace();}beans.add(webLogBean);}//将bean按时间先后顺序排序Collections.sort(beans, new Comparator<WebLogBean>() {public int compare(WebLogBean o1, WebLogBean o2) {try {Date d1 = toDate(o1.getTimeStr());Date d2 = toDate(o2.getTimeStr());if (d1 == null || d2 == null)return 0;return d1.compareTo(d2);} catch (Exception e) {e.printStackTrace();return 0;}}});/*** 以下逻辑为:从有序bean中分辨出各次visit,并对一次visit中所访问的page按顺序标号step*/int step = 1;String session = UUID.randomUUID().toString();for (int i = 0; i < beans.size(); i++) {WebLogBean bean = beans.get(i);// 如果仅有1条数据,则直接输出if (1 == beans.size()) {// 设置默认停留市场为60sv.set(session+","+bean.getIp() + "," + bean.getTimeStr() + "," + bean.getRequest_url() + "," + step + "," + (60) + "," + bean.getFrom_url() + "," + bean.getPlatform() + "," + bean.getBytes() + ","+ bean.getStatus());context.write(NullWritable.get(), v);session = UUID.randomUUID().toString();break;}// 如果不止1条数据,则将第一条跳过不输出,遍历第二条时再输出if (i == 0) {continue;}// 求近两次时间差long timeDiff = timeDiff(toDate(bean.getTimeStr()), toDate(beans.get(i - 1).getTimeStr()));// 如果本次-上次时间差<30分钟,则输出前一次的页面访问信息if (timeDiff < 30 * 60 * 1000) {v.set(session+","+beans.get(i - 1).getIp() + "," + beans.get(i - 1).getTimeStr() + "," + beans.get(i - 1).getRequest_url() + "," + step + "," + (timeDiff / 1000) + "," + beans.get(i - 1).getFrom_url() + ","+ beans.get(i - 1).getPlatform() + "," + beans.get(i - 1).getBytes() + "," + beans.get(i - 1).getStatus());context.write(NullWritable.get(), v);step++;} else {// 如果本次-上次时间差>30分钟,则输出前一次的页面访问信息且将step重置,以分隔为新的visitv.set(session+","+beans.get(i - 1).getIp() + "," + beans.get(i - 1).getTimeStr() + "," + beans.get(i - 1).getRequest_url() + "," + step + "," + (60) + "," + beans.get(i - 1).getFrom_url() + ","+ beans.get(i - 1).getPlatform()+ "," + beans.get(i - 1).getBytes()+ "," + beans.get(i - 1).getStatus());context.write(NullWritable.get(), v);// 输出完上一条之后,重置step编号step = 1;session = UUID.randomUUID().toString();}// 如果此次遍历的是最后一条,则将本条直接输出if (i == beans.size() - 1) {// 设置默认停留市场为60sv.set(session+","+bean.getIp() + "," + bean.getTimeStr() + "," + bean.getRequest_url() + "," + step + "," + (60) + "," + bean.getFrom_url() + "," + bean.getPlatform() + "," + bean.getBytes() + "," + bean.getStatus());context.write(NullWritable.get(), v);}}}catch (Exception e){e.printStackTrace();}}}private static Date toDate(String timeStr) throws ParseException {SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss", Locale.CHINA);return df.parse(timeStr);}private static long timeDiff(String time1, String time2) throws ParseException {Date d1 = toDate(time1);Date d2 = toDate(time2);return d1.getTime() - d2.getTime();}private static long timeDiff(Date time1, Date time2) throws ParseException {return time1.getTime() - time2.getTime();}public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {Configuration conf=new Configuration();Job job=Job.getInstance(conf);job.setJarByClass(ClickStream.class);//设置job的mapper和reducerjob.setMapperClass(ClickStreamMapper.class);job.setReducerClass(ClickStreamReducer.class);//设置mapper过后的细节job.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(WebLogBean.class);//设置Reducer细节job.setOutputKeyClass(NullWritable.class);job.setOutputValueClass(Text.class);job.setNumReduceTasks(4);//设置文件输出路径FileInputFormat.addInputPath(job,new Path(args[0]));FileOutputFormat.setOutputPath(job,new Path(args[1]));boolean flag=job.waitForCompletion(true);System.exit(flag?0:1);}
}
在MapReduce程序中,我们需要使用自定义的Bean,下面是详细代码:
package cn.edu.hust.preprocess.domain;import org.apache.hadoop.io.Writable;import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;public class WebLogBean implements Writable {/** 来源IP*/private String ip;/*** 访问时间*/private String timeStr;/*** 请求方式*/private String method;/*** 请求的url*/private String request_url;/*** 使用的协议*/private String protocol;/*** 状态码*/private int status;/*** 字节数*/private int bytes;/*** 来源url*/private String from_url;/*** 使用的平台*/private String platform;public WebLogBean() {}public WebLogBean(String ip, String timeStr, String method, String request_url, String protocol, int status, int bytes, String from_url, String platform) {this.ip = ip;this.timeStr= timeStr;this.method = method;this.request_url = request_url;this.protocol = protocol;this.status = status;this.bytes = bytes;this.from_url = from_url;this.platform = platform;}public String getIp() {return ip;}public void setIp(String ip) {this.ip = ip;}public String getTimeStr() {return timeStr;}public void setTimeStr(String timeStr) {this.timeStr=timeStr;}public String getMethod() {return method;}public void setMethod(String method) {this.method = method;}public String getRequest_url() {return request_url;}public void setRequest_url(String request_url) {this.request_url = request_url;}public String getProtocol() {return protocol;}public void setProtocol(String protocol) {this.protocol = protocol;}public int getStatus() {return status;}public void setStatus(int status) {this.status = status;}public int getBytes() {return bytes;}public void setBytes(int bytes) {this.bytes = bytes;}public String getFrom_url() {return from_url;}public void setFrom_url(String from_url) {this.from_url = from_url;}public String getPlatform() {return platform;}public void setPlatform(String platform) {this.platform = platform;}public void write(DataOutput dataOutput) throws IOException {dataOutput.writeUTF(ip);dataOutput.writeUTF(this.timeStr);dataOutput.writeUTF(this.method);dataOutput.writeUTF(this.request_url);dataOutput.writeUTF(this.protocol);dataOutput.writeInt(this.status);dataOutput.writeInt(this.bytes);dataOutput.writeUTF(this.from_url);dataOutput.writeUTF(this.platform);}public void readFields(DataInput dataInput) throws IOException {this.ip=dataInput.readUTF();this.timeStr=dataInput.readUTF();this.method=dataInput.readUTF();this.request_url=dataInput.readUTF();this.protocol=dataInput.readUTF();this.status=dataInput.readInt();this.bytes=dataInput.readInt();this.from_url=dataInput.readUTF();this.platform=dataInput.readUTF();}
}
在数据清洗之后,就变成了我们需要的结果,我们可以将数据导入到HIve中,进行数据的ETL。
ETL过程
1.建立ViSiT模型
#1.创建visit模型
create external table click_stream_visit(
session string,
ip string,
timestr string,
request_url string,
setp string,
stayLong string,
from_url string,
platform string,
byte string,
status string
) partitioned by(datestr string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
这里主要,我这里是根据时间来进行分区的。
同时我们将数据规整的结果导入到这个表中。
load data inpath '/log/output/part-r-0000*' into table click_stream_visit partition(datestr='2012-01-04');
2.建立贴源表
主要包含数据的维度表和进一步数据抽离。
#2.创建贴源表
drop table if exists ods_click_pageviews;
create table ods_click_pageviews(
session string,
remote_addr string,
time_local string,
request string,
status string,
body_bytes_sent string,
http_referer string,
http_user_agent string,
step string,
stayLong string
)partitioned by (datestr string)
row format delimited
fields terminated by ',';
insert into table ods_click_pageviews partition(datestr='2012-01-04') select session,ip,timestr,request_url,status,byte,from_url,platform,setp,stayLong from click_stream_visit;
#建立时间维度表
drop table dim_time if exists dim_time;
create table dim_time(
year string,
month string,
day string,
hour string)
row format delimited
fields terminated by ',';
#建立浏览器维度表
create table dim_browser(
browser string
);
#建立终端维度表
create table dim_os(
os string
);
##创建地域维度
create table dim_region(
province string,
city string
);
3.明细表
#建立明细表
drop table ods_weblog_detail;
create table ods_weblog_detail(
remote_addr string, --来源IP
remote_user string, --用户标识
time_local string, --访问完整时间
daystr string, --访问日期
timestr string, --访问时间
yearstr string, --访问年
month string, --访问月
day string, --访问日
hour string, --访问时
request string, --请求的url
status string, --响应码
body_bytes_sent string, --传输字节数
http_referer string, --来源url
ref_host string, --来源的host
ref_path string, --来源的路径
ref_query string, --来源参数query
http_user_agent string,--客户终端标识
os string, ---操作系统
province string,
city string )partitioned by(datestr string) row format delimited
fields terminated by ',';
在建立这些表之后,我们将会进行数据的抽离,这里需要自定义几个函数,我们需要自己实现UDF,具体代码如下。
package cn.edu.hust.udf;import org.apache.hadoop.hive.ql.exec.UDF;public class BrowserUtils extends UDF {public String evaluate(String s){if(s.toLowerCase().contains("chrome"))return "Chrome";else if(s.toLowerCase().contains("firefox"))return "Firefox";else if(s.toLowerCase().contains("mozilla"))return "Mozilla";else if(s.toLowerCase().contains("ie"))return "IE";else if(s.toLowerCase().contains("opera"))return "Oprea";else if(s.toLowerCase().contains("safari"))return "Safari";else if(s.toLowerCase().contains("uc"))return "UC";else if(s.toLowerCase().contains("qq"))return "QQ";elsereturn "Others";}
}
package cn.edu.hust.udf;import cn.edu.hust.udf.bean.Pair;
import org.apache.hadoop.hive.ql.exec.UDF;import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.HashMap;public class CityUtils extends UDF {static ArrayList<HashMap<Pair<Long,Long>,String>> ips=new ArrayList<HashMap<Pair<Long, Long>, String>>();static{BufferedReader reader= null;try {reader = new BufferedReader(new InputStreamReader(new FileInputStream("/home/hadoop/ip.txt")));} catch (FileNotFoundException e) {e.printStackTrace();}String line;try{while((line=reader.readLine())!=null){String[] splits=line.split("\\|");Long up=Long.parseLong(splits[2]);Long down=Long.parseLong(splits[3]);Pair<Long,Long> pair=new Pair<Long, Long>();pair.setFirst(up);pair.setSecond(down);StringBuilder sb=new StringBuilder();sb.append(splits[6]).append("|"+splits[7]);HashMap<Pair<Long,Long>,String> ip=new HashMap<Pair<Long, Long>, String>();ip.put(pair,sb.toString());ips.add(ip);}}catch (Exception e){e.printStackTrace();}}//获取省份和城市public static synchronized String getProvinceAndCity(String ip){String[] splits=ip.split("\\.");double value=0;for(int i=0;i<splits.length;i++){value+=Long.parseLong(splits[i])*Math.pow(2,8*(3-i));}int high=ips.size()-1;int low=0;while(low<=high){int mid=(low+high)/2;Pair<Long,Long> pair=(Pair<Long,Long>)ips.get(mid).keySet().toArray()[0];if(value>=pair.getFirst()&&value<=pair.getSecond()){return (String)ips.get(mid).values().toArray()[0];}else if(value>pair.getSecond()){low=mid+1;}else if(value<pair.getFirst()) {high = mid - 1;}}return "未知|未知";}public synchronized String evaluate(String s){String[] t=getProvinceAndCity(s).split("\\|");System.out.println(t.length);if(t.length<2) return t[0];return t[1];}}
package cn.edu.hust.udf;import cn.edu.hust.udf.bean.Pair;
import org.apache.hadoop.hive.ql.exec.UDF;import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.HashMap;public class IPUtils extends UDF {static ArrayList<HashMap<Pair<Long,Long>,String>> ips=new ArrayList<HashMap<Pair<Long, Long>, String>>();static{BufferedReader reader= null;try {reader = new BufferedReader(new InputStreamReader(new FileInputStream("/home/hadoop/ip.txt")));} catch (FileNotFoundException e) {e.printStackTrace();}String line;try{while((line=reader.readLine())!=null){String[] splits=line.split("\\|");Long up=Long.parseLong(splits[2]);Long down=Long.parseLong(splits[3]);Pair<Long,Long> pair=new Pair<Long, Long>();pair.setFirst(up);pair.setSecond(down);StringBuilder sb=new StringBuilder();sb.append(splits[6]).append("|"+splits[7]);HashMap<Pair<Long,Long>,String> ip=new HashMap<Pair<Long, Long>, String>();ip.put(pair,sb.toString());ips.add(ip);}}catch (Exception e){e.printStackTrace();}}//获取省份和城市public static synchronized String getProvinceAndCity(String ip){String[] splits=ip.split("\\.");double value=0;for(int i=0;i<splits.length;i++){value+=Long.parseLong(splits[i])*Math.pow(2,8*(3-i));}int high=ips.size()-1;int low=0;while(low<=high){int mid=(low+high)/2;Pair<Long,Long> pair=(Pair<Long,Long>)ips.get(mid).keySet().toArray()[0];if(value>=pair.getFirst()&&value<=pair.getSecond()){return (String)ips.get(mid).values().toArray()[0];}else if(value>pair.getSecond()){low=mid+1;}else if(value<pair.getFirst()) {high = mid - 1;}}return "未知|未知";}public synchronized String evaluate(String s){return getProvinceAndCity(s).split("\\|")[0];}}
package cn.edu.hust.udf.bean;import java.io.Serializable;public class Pair<T,U> implements Serializable {private T first;private U second;public Pair() {}public Pair(T first, U second) {this.first = first;this.second = second;}public T getFirst() {return first;}public void setFirst(T first) {this.first = first;}public U getSecond() {return second;}public void setSecond(U second) {this.second = second;}
}
package cn.edu.hust.udf;import org.apache.hadoop.hive.ql.exec.UDF;public class OSUtils extends UDF{public String evaluate(String s){if(s.toLowerCase().contains("windows"))return "Windows";else if(s.toLowerCase().contains("macos"))return "MacOS";else if(s.toLowerCase().contains("linux"))return "Linux";else if(s.toLowerCase().contains("android"))return "Android";else if(s.toLowerCase().contains("ios"))return "IOS";elsereturn "Others";}}
将这个工程打成Jar包,然后就导入到HIve,具体如下:
#创建自己的函数,详见工程源码
add jar /home/hadoop/logAnalyzeHelper.jar;
#创建临时函数
create temporary function getOS as 'cn.edu.hust.udf.OSUtils';
create temporary function getBrowser as 'cn.edu.hust.udf.BrowserUtils';
create temporary function getProvince as 'cn.edu.hust.udf.IPUtils';
create temporary function getCity as 'cn.edu.hust.udf.CityUtils';
根据维度导入数据
#导入操作系统维度表
insert into dim_os select distinct getOS(http_user_agent) from ods_click_pageviews;
#导入浏览器维度表
insert into dim_browser select distinct getBrowser(http_user_agent) from ods_click_pageviews;
##导入维度数据
insert into dim_region (city,province)
select distinct a.city as city,a.province as province
from ods_weblog_detail a
join (select distinct province from ods_weblog_detail) b
on a.province=b.province
where a.datestr='2012-01-04';#导入数据到明细表
insert into ods_weblog_detail partition(datestr='2012-01-04')
select remote_addr,session,time_local,substring(time_local,0,10) as daystr,substring(time_local,12) as timestr,substring(time_local,0,4) as yearstr,substring(time_local,6,2) as month,
substring(time_local,9,2) as day,substring(time_local,12,2) as hour,split(request,"\\?")[0],status
,body_bytes_sent,http_referer,parse_url(http_referer,'HOST') as ref_host,parse_url(http_referer,'PATH') as ref_path,
parse_url(http_referer,'QUERY') as ref_query,getBrowser(http_user_agent) as http_user_agent,
getOS(http_user_agent) as os ,getProvince(remote_addr),getCity(remote_addr) from ods_click_pageviews;
模块开发
#以时间维度统计
select count(1),yearstr,month,day,hour from ods_weblog_detail
group by yearstr,month,day,hour;##每一个小时来统计PV
drop table dw_pvs_hour;
create table dw_pvs_hour(year string,month string,day string,hour string,pvs bigint)
row format delimited
fields terminated by '\t';###插入数据
insert into table dw_pvs_hour
select a.yearstr as year ,a.month as month,a.day as day,a.hour as hour,
count(1) as pvs
from ods_weblog_detail a
group by a.yearstr,a.month,a.day,a.hour;##以天为维度来进行统计PV
drop table dw_pvs_day;
create table dw_pvs_day(pvs bigint,year string,month string,day string)
row format delimited
fields terminated by '\t';
###插入数据
insert into table dw_pvs_day
select count(1) as pvs,a.year as year,a.month as month,a.day as day from dim_time a
join ods_weblog_detail b
on a.year=b.yearstr and a.month=b.month and a.day=b.day
group by a.year,a.month,a.day;##以浏览器类型来进行统计
drop table dw_pvs_browser;
create table dw_pvs_browser(pvs bigint,browser string,
year string,month string,day string)
row format delimited
fields terminated by '\t';
###导入数据
insert into dw_pvs_browser
select count(1) as pvs, a.browser as browser,
b.yearstr as year,
b.month as month,b.day
as day from dim_browser a
join ods_weblog_detail b
on a.browser=b.http_user_agent
group by a.browser,b.yearstr,month,day order by pvs desc;##按照操作系统来进行统计
drop table dw_pvs_os;
create table dw_pvs_os(
pvs bigint,
os string,
year string,
month string,
day string
);insert into dw_pvs_os
select count(1) as pvs, a.os as os,
b.yearstr as year,
b.month as month,b.day
as day from dim_os a
join ods_weblog_detail b
on a.os=b.os
group by a.os,b.yearstr,month,day order by pvs desc;##按照地域的维度去统计PV
drop table dw_pvs_region;
create table dw_pvs_region(pvs bigint,province string,
city string,year string,
month string,day string)
row format delimited
fields terminated by '\t';
###导入数据
insert into dw_pvs_region
select count(1) as pvs,a.province as province,
a.city as city,b.yearstr as year,
b.month as month,b.day as day from dim_region a
join ods_weblog_detail b on
a.province=b.province and a.city=b.city
group by a.province,a.city,b.yearstr,month,day order by pvs desc;##统计uv
drop table dw_uv;
create table dw_uv(
uv int,
year varchar(4),
month varchar(2),
day varchar(2)
);
###导入数据
insert into dw_uv
select count(1) as uv,a.yearstr as year,
a.month as month,a.day as day from
(select distinct remote_user,yearstr,month,day from ods_weblog_detail) a
group by a.yearstr,a.month,a.day;##统计IP
drop table dw_ip;
create table dw_ip(
ip int,
year varchar(4),
month varchar(2),
day varchar(2)
);
###导入数据
insert into dw_ip
select count(1) as ip,a.yearstr as year,
a.month as month,a.day as day from
(select distinct remote_addr,yearstr,month,day from ods_weblog_detail) a
group by a.yearstr,a.month,a.day;#人均浏览页面
##总的请求页面/去重的人数
drop table dw_avgpv_user_d;
create table dw_avgpv_user_d(
day string,
avgpv string);
###插入数据
insert into table dw_avgpv_user_d
select '2012-01-14',sum(b.pvs)/count(b.remote_user) from
(select remote_user,count(1) as pvs from ods_weblog_detail where datestr='2012-01-04' group by remote_user) b;#按referer维度统计pv总量
##按照小时为进行统计
drop table dw_pvs_referer_h;
create table dw_pvs_referer_h(referer_url string,referer_host string,year string,month string,day string,hour string,pv_referer_cnt bigint);
###插入数据
insert into table dw_pvs_referer_h
select split(http_referer,"\\?")[0],ref_host,yearstr,month,day,hour,count(1) as pv_referer_cnt
from ods_weblog_detail
group by http_referer,ref_host,yearstr,month,day,hour
having ref_host is not null
order by hour asc,day asc,month asc,yearstr asc,pv_referer_cnt desc;
使用Sqoop导入到MySQL
在MySQL中建立表,然后导入到MySQL中
#将需要展示的数据导入到mysql
##mysql 需要建立的表
drop table dw_pvs_hour;
create table dw_pvs_hour(
id int primary key auto_increment,
year varchar(4),
month varchar(2),day varchar(2),
hour varchar(2),pvs int);###sqoop导入数据
bin/sqoop export --connect jdbc:mysql://10.211.55.16:3306/log --username root --password root --table dw_pvs_day --columns pvs,year,month,day --export-dir '/user/hive/warehouse/loganalyze.db/dw_pvs_day/' --fields-terminated-by '\t';
drop table dw_pvs_day;
create table dw_pvs_day(
id int primary key auto_increment,
year varchar(4),
month varchar(2),
day varchar(2),
pvs int);###sqoop导入数据
bin/sqoop export --connect jdbc:mysql://10.211.55.16:3306/log --username root --password root --table dw_pvs_browser --columns pvs,browser,year,month,day --export-dir '/user/hive/warehouse/loganalyze.db/dw_pvs_browser/' --fields-terminated-by '\t';
drop table dw_pvs_browser;
create table dw_pvs_browser(
id int primary key auto_increment,
browser varchar(20),
year varchar(4),
month varchar(2),
day varchar(2),
pvs int);create table dw_pvs_os(
id int primary key auto_increment,
pvs bigint,
os varchar(10),
year varchar(4),
month varchar(2),
day varchar(2)
);###sqoop导入数据
bin/sqoop export --connect jdbc:mysql://10.211.55.16:3306/log --username root --password root --table dw_pvs_region --columns pvs,province,city,year,month,day --export-dir '/user/hive/warehouse/loganalyze.db/dw_pvs_region/' --fields-terminated-by '\t';
drop table dw_pvs_region;
create table dw_pvs_region(
id int primary key auto_increment,
province varchar(20),
city varchar(20),
year varchar(4),
month varchar(2),
day varchar(2),
pvs int);###统计uv
drop table dw_uv;
create table dw_uv(
id int primary key auto_increment,
year varchar(4),
month varchar(2),
day varchar(2),
uv int);##统计ip
drop table dw_ip;
create table dw_ip(
id int primary key auto_increment,
year varchar(4),
month varchar(2),
day varchar(2),
ip int);##统计人均访问页面
drop table dw_avgpv_user_d;
create table dw_avgpv_user_d(
id int primary key auto_increment,
day varchar(12),
avgpv float);drop table dw_pvs_referer_h;
create table dw_pvs_referer_h(
id int primary key auto_increment,
referer_url varchar(800),
referer_host varchar(200),
year varchar(4),
month varchar(2),
day varchar(2),
hour varchar(2),
pv_referer_cnt bigint);
利用JavaEE将数据可视化
先建立JavaEE工程:
需要的pom文件如下
<?xml version="1.0" encoding="UTF-8"?><project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"><modelVersion>4.0.0</modelVersion><groupId>cn.edu.hust</groupId><artifactId>ForumLogAnaloyze</artifactId><version>1.0-SNAPSHOT</version><packaging>war</packaging><name>ForumLogAnaloyze Maven Webapp</name><!-- FIXME change it to the project's website --><url>http://www.example.com</url><properties><project.build.sourceEncoding>UTF-8</project.build.sourceEncoding><maven.compiler.source>1.7</maven.compiler.source><maven.compiler.target>1.7</maven.compiler.target><!-- spring版本号 --><spring.version>4.0.2.RELEASE</spring.version><mybatis.version>3.2.6</mybatis.version><!-- log4j日志文件管理包版本 --><slf4j.version>1.7.7</slf4j.version><log4j.version>1.2.17</log4j.version></properties><dependencies><dependency><groupId>junit</groupId><artifactId>junit</artifactId><version>4.11</version><scope>test</scope></dependency><dependency><groupId>jstl</groupId><artifactId>jstl</artifactId><version>1.2</version></dependency><!-- spring核心包 --><dependency><groupId>org.springframework</groupId><artifactId>spring-core</artifactId><version>${spring.version}</version></dependency><dependency><groupId>org.springframework</groupId><artifactId>spring-web</artifactId><version>${spring.version}</version></dependency><dependency><groupId>org.springframework</groupId><artifactId>spring-oxm</artifactId><version>${spring.version}</version></dependency><dependency><groupId>org.springframework</groupId><artifactId>spring-tx</artifactId><version>${spring.version}</version></dependency><dependency><groupId>org.springframework</groupId><artifactId>spring-jdbc</artifactId><version>${spring.version}</version></dependency><dependency><groupId>org.springframework</groupId><artifactId>spring-webmvc</artifactId><version>${spring.version}</version></dependency><dependency><groupId>org.springframework</groupId><artifactId>spring-aop</artifactId><version>${spring.version}</version></dependency><dependency><groupId>org.springframework</groupId><artifactId>spring-context-support</artifactId><version>${spring.version}</version></dependency><dependency><groupId>org.springframework</groupId><artifactId>spring-test</artifactId><version>${spring.version}</version></dependency><dependency><groupId>org.springframework.data</groupId><artifactId>spring-data-redis</artifactId><version>1.6.2.RELEASE</version></dependency><!-- mybatis核心包 --><dependency><groupId>org.mybatis</groupId><artifactId>mybatis</artifactId><version>${mybatis.version}</version></dependency><!-- mybatis/spring包 --><dependency><groupId>org.mybatis</groupId><artifactId>mybatis-spring</artifactId><version>1.2.2</version></dependency><!-- 导入javaee jar 包 --><dependency><groupId>javax</groupId><artifactId>javaee-api</artifactId><version>7.0</version></dependency><!-- 导入Mysql数据库链接jar包 --><dependency><groupId>mysql</groupId><artifactId>mysql-connector-java</artifactId><version>5.1.36</version></dependency><!-- 导入dbcp的jar包,用来在applicationContext.xml中配置数据库 --><dependency><groupId>commons-dbcp</groupId><artifactId>commons-dbcp</artifactId><version>1.2.2</version></dependency><dependency><groupId>commons-lang</groupId><artifactId>commons-lang</artifactId><version>2.0</version></dependency><!-- 日志文件管理包 --><!-- log start --><dependency><groupId>log4j</groupId><artifactId>log4j</artifactId><version>${log4j.version}</version></dependency><!-- 格式化对象,方便输出日志 --><dependency><groupId>com.alibaba</groupId><artifactId>fastjson</artifactId><version>1.1.41</version></dependency><dependency><groupId>org.slf4j</groupId><artifactId>slf4j-api</artifactId><version>${slf4j.version}</version></dependency><dependency><groupId>org.slf4j</groupId><artifactId>slf4j-log4j12</artifactId><version>${slf4j.version}</version></dependency><!-- log end --><!-- 映入JSON --><dependency><groupId>org.codehaus.jackson</groupId><artifactId>jackson-mapper-asl</artifactId><version>1.9.13</version></dependency><!--redis--><dependency><groupId>redis.clients</groupId><artifactId>jedis</artifactId><version>2.9.0</version></dependency><!-- 上传组件包 --><dependency><groupId>commons-fileupload</groupId><artifactId>commons-fileupload</artifactId><version>1.3.1</version></dependency><dependency><groupId>commons-io</groupId><artifactId>commons-io</artifactId><version>2.4</version></dependency><dependency><groupId>commons-codec</groupId><artifactId>commons-codec</artifactId><version>1.9</version></dependency><dependency><groupId>org.apache.ant</groupId><artifactId>ant</artifactId><version>1.9.1</version></dependency><dependency><groupId>org.aspectj</groupId><artifactId>aspectjweaver</artifactId><version>1.9.1</version></dependency><!-- https://mvnrepository.com/artifact/com.google.code.gson/gson --><dependency><groupId>com.google.code.gson</groupId><artifactId>gson</artifactId><version>2.8.2</version></dependency></dependencies><build><resources><resource><directory>src/main/java</directory><includes><include>**/*.properties</include><include>**/*.xml</include></includes><filtering>false</filtering></resource></resources><plugins><plugin><groupId>org.mybatis.generator</groupId><artifactId>mybatis-generator-maven-plugin</artifactId><version>1.3.2</version><configuration><!--配置文件的位置--><configurationFile>src/main/resources/generatorConfig.xml</configurationFile><verbose>true</verbose><overwrite>true</overwrite></configuration><executions><execution><id>Generate MyBatis Artifacts</id><goals><goal>generate</goal></goals></execution></executions><dependencies><dependency><groupId>org.mybatis.generator</groupId><artifactId>mybatis-generator-core</artifactId><version>1.3.2</version></dependency></dependencies></plugin></plugins></build>
</project>
工程架构如下
具体的实现代码,可以参考我的GitHub这里,
最后的结果展示
根据小时统计PV
访客操作系统统计
访客浏览器占比
访客地域统计
Hadoop实战: 论坛点击流日志分析相关推荐
- 网站点击流日志数据分析
网站点击流日志数据分析 点击流数据:关注的是用户访问网站的轨迹,按照时间来进行先后区分 基本上所有的大型网站都有日志埋点 通过js的方式,可以获取到你再网站上面点击的所有的链接,按钮,商品,等等,包括 ...
- 干货实战|基于Elastic Stack的日志分析系统
Elastic Stack简介 Elastic Stack是Elastic公司旗下的一系列软件总称,包括Elasticsearch.Logstash.Kibana和Beats.Elasticsearc ...
- 网站点击流数据分析系统揭晓
网站点击流数据分析项目业务背景 1.1 什么是点击流数据 1.1.1 WEB访问日志 即指用户访问网站时的所有访问.浏览.点击行为数据.比如点击了哪一个链接,在哪个网页停留时间最多,采用了哪个搜索项. ...
- 【CLS数据淘金第四期】网络流日志-云联网日志分析
导语 云联网(Cloud Connect Network,CCN)云联网覆盖全球 20+ 地域,提供云上私有网络间(VPC).VPC 与本地数据中心间(IDC)内网互联的服务,具备全网多点互联.路由自 ...
- Hadoop项目实战---黑马论坛日志分析
1.项目描述 通过对黑马技术论坛的apache common日志进行分析, 计算论坛关键指标,供运营者决策. 2.数据情况 每行记录有5部分组成: 1.访问ip 2.访问时间 3.访问资源[跟着两个访 ...
- Hadoop学习笔记—20.网站日志分析项目案例(一)项目介绍
Hadoop学习笔记-20.网站日志分析项目案例(一)项目介绍 网站日志分析项目案例(一)项目介绍:当前页面 网站日志分析项目案例(二)数据清洗:http://www.cnblogs.com/edis ...
- hadoop网站日志分析(一)
Hadoop学习笔记-20.网站日志分析项目案例(一)项目介绍 网站日志分析项目案例(一)项目介绍:当前页面 网站日志分析项目案例(二)数据清洗:http://www.cnblogs.com/edis ...
- 走向云计算之Hadoop实际应用网站日志分析
一.概述 网站日志分析是Hadoop应用的一个方向.那么什么是网站日志呢? 网站日志是记录web服务器接收处理请求以及运行时错误等各种原始信息的以.log结尾文件.通过网站日志可以清楚的得知用户在什么 ...
- Spark项目实战:大数据实时流处理日志(非常详细)
实战概览 一.实战内容 二.大数据实时流处理分析系统简介 1.需求 2.背景及架构 三.实战所用到的架构和涉及的知识 1.后端架构 2.前端框架 四.项目实战 1.后端开发实战 1.构建项目 2.引入 ...
最新文章
- 讲给23岁以上男生的生活方式
- 第10课:优化神经网络——如何防止过拟合
- spark_spark连接hive config
- Oracle创建序列,删除序列
- Leetcode算法题(C语言)4--旋转数组
- 远程注入利用远程线程直接注入
- 佳能g2810打印机扫描怎么用_佳能g2810打印机说明书
- Linux服务器的系统时间的校准
- 个人作业——软件工程实践总结作业
- Javaweb面试题(一)———更新中
- 用ImDisk在Windows 10中创建内存虚拟盘
- matlab subs函数
- haosou属于搜索引擎的_搜索引擎登录工具
- 个人博客管理系统详解
- 金融经济学二十五讲--徐高
- 港中文深圳校区计算机研究生怎么样,香港中文大学(深圳)大概相当于国内什么水平高校...
- apache评分表的意义_APACHE评分系统及评分表 -
- LAD线性判别分析鸢尾花预测
- moodle 1.9 课程 恢复 2.0
- swf文件格式解析入门(tag解析)