文章目录

数据分析平台介绍
- 网站流量分析项目的一些问题
1．数据处理流程
- 1.数据采集
- 2.数据预处理
- 3.数据入库
- 4.数据分析(ETL)
- 5.数据展现
2.mr程序进行数据预处理理论方面
3.流量分析常见分类
- 骨灰级指标
- - IP
  - PV
  - UV
- 基础级指标
- 复合级指标
- 基础分析（PV,IP,UV）
4.统计分析
- 建表
- - 事实表设计
  - 维度表设计
  - 访问日志明细宽表
- 以下ETL需求小结
- 1.流量分析
- - 1.1多维度统计PV总量
  - - 按时间维度
    - 按终端维度
  - 1.2按referer维度（按照来访维度统计pv）
  - 1.3统计pv总量最大的来源TOPN (分组TOP)
  - 1.4人均浏览页数
- 2.受访分析
- - 2.1 各页面访问统计（各页面PV）
  - 2.2热门页面统计
- 3.访客分析
- 4.访客Visit分析（点击流模型）
- 5.关键路径转化率分析（漏斗模型）

数据分析平台介绍

非结构化数据: 视频、音频、图片
半结构化数据：如网站流量日志要做数据预处理 处理成结构化数据

网站流量分析项目的一些问题

网站流量分析项目中，数据采集环节使用的技术是?

flume

网站流量分析项目中，数据预处理环节使用的技术是？

MapReduce

网站流量分析项目中，建设数仓使用的技术是？

hive

网站流量分析项目中，数据导出环节使用的技术是？

sqoop

建设数仓，首先要确定业务指标。网站流量分析项目中，说说你知道复合级指标.

+ 人均浏览的页数：人均浏览页数=浏览次数/独立访客，体现网站对访客的吸引程度
+ 跳出率：指某一范围内单页访问次数与总访问次数的百分比
+ 退出率：退出的访问者与综合访问量的百分比

网站流量分析中，基础分析有哪些？

+ 趋势分析：通过流量趋势的变化形态，为分析网站访客访问规律、网站发展状况提供参考
+ 对比分析：根据选定的两个对比时段，提供流量在时间上的纵向对比报表，进而发现网站发展状况、发展规律、流量变化率等
+ 当前在线：提供当前时刻站点上的访客量，以及15分钟内的流量、来源、受访等变化情况，方便用户及时了解当前网站流量状况
+ 访问明细：提供最近7天的访客访问记录，可安每个PV或访问行为显示

网站流量分析中，来源分析有哪些？

来源分类、搜索引擎、搜索词、最近7日的访客搜索记录、来路域名、来路页面、来源升降榜等

网站流量分析中，受访分析有哪些？

受访域名、受访页面、受访升降榜、热点图、用户视点、访问轨迹等

网站日志流量分析中，访客分析有哪些？

地区运营商、终端详情、新老访客、忠诚度、活跃度

网站流量分析项目中，整个过程的流程调度使用的技术是?

azkaban

什么是事实表？

可以理解为发生在现实生活中的操作型事件，比如一次购买行为就是一个事实，下订单就是一个事实。由若干维度和度量组成的事实：可以理解为一个动词, 主题，比如下订单事实表里面包含维度的主键 需要内容关联查即可事实表中的每行数据代表一个业务事件（下单、支付、退款、评价等）。
“事实”这个术语表示的是业务事件的度量值（可统计次数、个数、金额等），例如，订单事件中的下单金额。
每个事实表的行包括：具有可加性的数值型的度量值、与维度表相连接的外键、通常具有两个以上的外键、外键之间表示维度表之间多对多的关系。事实表的特点：
1、行数多
2、列少

什么是维度表？

维度：一个名词，group by后面通常可跟维度比如从地区角度看哪个地区的销售额最多，地区就是一个维度，维度信息存储的表就是维度表，存储的一般是对事实描述的信息维度 数据量小 比较固定（如地域维度，再比如用户 相对于下订单的事实）骨灰级指标、基础指标、复合级指标都是维度维度表的特征：
1、范围很宽（列比较多）；
2、跟事实表相比，行数比较少；
3、内容相对固定。

维度建模的三种模式

+ 星型模式星型模式是以事实表为中心，所有的维度表直接连接在事实表上，像星星一样.维表只和事实表关联，维表之间没有关联.每个维表主键为单列，且该主键放置在事实表中，作为两边连接的外键.
+ 雪花模式在星型模式的基础上，维度表可以拥有其他维度表这种模型不太容易理解，维护成本比较高性能方面需要关联多层维表，性能也比星型模型要低。所以一般不是很常用
+ 星座模式星型模式是基于一张事实表的，而星座模式是基于多张事实表的，而且共享维度信息。

维度建模有什么好处？为什么有时候不严格遵守维度建模的原则组织事实表和维度表？

+ 面向业务主题，使得数据模型非常好理解
+ 数据冗余小（因为很多具体的信息都存在相应的维度表中了）
+ 结构清晰（表结构一目了然）增加使用成本，比如查询时要关联多张表对于一些高频的字段，频繁的做查询会非常拖累性能，所以我们经常把这些高频字段放到事实表

1．数据处理流程

1.数据采集

数据采集概念，目前行业会有两种解释：一是数据从无到有的过程（web服务器打印的日志、自定义采集的日志等）叫做数据采集；另一方面也有把通过使用Flume等工具把数据采集到指定位置的这个过程叫做数据采集。
关于具体含义要结合语境具体分析，明白语境中具体含义即可。

2.数据预处理

通过mapreduce程序对采集到的原始日志数据进行预处理，比如清洗、滤除脏数据，结构化，格式整理（标准化），等，并且梳理成点击流模型数据。

3.数据入库

将预处理之后的数据导入到HIVE仓库中相应的库和表中。

4.数据分析(ETL)

项目的核心内容，即根据需求开发ETL分析语句，得出各种统计结果。

5.数据展现

将分析所得数据进行数据可视化，一般通过图表进行展示。

市面上有许多开源的数据可视化软件、工具。比如Echarts.

2.mr程序进行数据预处理理论方面

我们采集的日志是半结构化数据，所以需要做预处理，处理成结构化数据

将采集的日志处理成如下图表（原始访问日志表）

思路：通过io把日志流拿到，然后读每一行，以空格分隔开，再进行清洗

3.流量分析常见分类

骨灰级指标

IP

1天之内，访问网站的不重复IP数。一天内相同IP地址多次访问网站只被计算1次。曾经IP指标可以用来表示用户访问身份，目前则更多的用来获取访问者的地理位置信息。

PV

PageView浏览量: 即通常说的PV值，用户每打开1个网站页面，记录1个PV。用户多次打开同一页面PV累计多次。通俗解释就是页面被加载的总次数。

UV

Unique PageView: 1天之内，访问网站的不重复用户数（以浏览器cookie为依据），一天内同一访客多次访问网站只被计算1次, 不同用户的点击次数访问次数。

基础级指标

访问次数：访客从进入网站到离开网站的一系列活动记为一次访问，也称会话(session),1次访问(会话)可能包含多个PV。
网站停留时间：访问者在网站上花费的时间。
页面停留时间：访问者在某个特定页面或某组网页上所花费的时间。

复合级指标

人均浏览页数：平均每个独立访客产生的PV。人均浏览页数=浏览次数/独立访客。体现网站对访客的吸引程度。
跳出率:指某一范围内单页访问次数或访问者与总访问次数的百分比。其中跳出指单页访问或访问者的次数，即在一次访问中访问者进入网站后只访问了一个页面就离开的数量。
退出率:指某一范围内退出的访问者与综合访问量的百分比。其中退出指访问者离开网站的次数，通常是基于某个范围的。

基础分析（PV,IP,UV）

趋势分析：根据选定的时段，提供网站流量数据，通过流量趋势变化形态，为您分析网站访客的访问规律、网站发展状况提供参考。

对比分析：根据选定的两个对比时段，提供网站流量在时间上的纵向对比报表，帮您发现网站发展状况、发展规律、流量变化率等。

当前在线：提供当前时刻站点上的访客量，以及最近15分钟流量、来源、受访、访客变化情况等，方便用户及时了解当前网站流量状况。

访问明细：提供最近7日的访客访问记录，可按每个PV或每次访问行为（访客的每次会话）显示，并可按照来源、搜索词等条件进行筛选。通过访问明细，用户可以详细了解网站流量的累计过程，从而为用户快速找出流量变动原因提供最原始、最准确的依据。

4.统计分析

建表

事实表设计

维度表设计

-- 创建原始日志数据表
create table ods_weblog_origin(
valid string,
remote_addr string,
remote_user string,
time_local string,
request string,
status string,
body_bytes_sent string,
http_referer string,
http_user_agent string)
partitioned by (datestr string)
row format delimited
fields terminated by '\001';-- 创建点击流模型pageviews表
create table ods_click_pageviews(
session string,
remote_addr string,
remote_user string,
time_local string,
request string,
visit_step string,
page_staylong string,
http_referer string,
http_user_agent string,
body_bytes_sent string,
status string)
partitioned by (datestr string)
row format delimited
fields terminated by '\001';-- 创建点击流visit模型表
create table ods_click_stream_visit(
session     string,
remote_addr string,
inTime      string,
outTime     string,
inPage      string,
outPage     string,
referal     string,
pageVisits  int)
partitioned by (datestr string)
row format delimited
fields terminated by '\001';load data local inpath '/export/servers/webLog/ods_weblog_origin' overwrite into table
ods_weblog_origin partition(datestr='2020-10-17');load data local inpath '/export/servers/webLog/ods_click_pageviews' overwrite into table
ods_click_pageviews partition(datestr='2020-10-17');load data local inpath '/export/servers/webLog/ods_click_stream_visit' overwrite into table
ods_click_stream_visit partition(datestr='2020-10-17');

访问日志明细宽表

建明细表ods_weblog_detailcreate table ods_weblog_detail(
valid           string, --有效标识
remote_addr     string, --来源IP
remote_user     string, --用户标识
time_local      string, --访问完整时间
daystr          string, --访问日期
timestr         string, --访问时间
month           string, --访问月
day             string, --访问日
hour            string, --访问时
request         string, --请求的url
status          string, --响应码
body_bytes_sent string, --传输字节数
http_referer    string, --来源url
ref_host        string, --来源的host
ref_path        string, --来源的路径
ref_query       string, --来源参数query
ref_query_id    string, --来源参数query的值
http_user_agent string --客户终端标识
)
partitioned by(datestr string);通过查询插入数据到明细宽表  ods_weblog_detail中分步：
--抽取refer_url到中间表  t_ods_tmp_referurl
--也就是将来访url分离出host  path  query  query iddrop table if exists t_ods_tmp_referurl;create table t_ods_tmp_referurl as
SELECT a.*,b.*
FROM ods_weblog_origin a
LATERAL VIEW parse_url_tuple(regexp_replace (http_referer, "\"", ""), 'HOST', 'PATH','QUERY', 'QUERY:id')
b as host, path, query, query_id; --抽取转换time_local字段到中间表明细表 t_ods_tmp_detailcreate table tt11tt as
SELECT a.*,b.*
FROM ods_weblog_origin a
LATERAL VIEW parse_url_tuple(regexp_replace (http_referer, "\"", ""), 'HOST', 'PATH','QUERY', 'QUERY:id')
b as host, path, query, query_id; drop table if exists t_ods_tmp_detail;
create table t_ods_tmp_detail as
select b.*,substring(time_local,0,10) as daystr,
substring(time_local,12) as tmstr,
substring(time_local,6,2) as month,
substring(time_local,9,2) as day,
substring(time_local,12,2) as hour
From t_ods_tmp_referurl b;以上语句可以改写成：
insert into table ods_weblog_detail partition(datestr='2020-10-17')
select c.valid,c.remote_addr,c.remote_user,c.time_local,
substring(c.time_local,0,10) as daystr,
substring(c.time_local,12) as tmstr,
substring(c.time_local,6,2) as month,
substring(c.time_local,9,2) as day,
substring(c.time_local,11,3) as hour,
c.request,c.status,c.body_bytes_sent,c.http_referer,c.ref_host,c.ref_path,c.ref_query,c.ref_query_id,c.http_user_agent
from
(SELECT
a.valid,a.remote_addr,a.remote_user,a.time_local,
a.request,a.status,a.body_bytes_sent,a.http_referer,a.http_user_agent,b.ref_host,b.ref_path,b.ref_query,b.ref_query_id
FROM ods_weblog_origin a LATERAL VIEW parse_url_tuple(regexp_replace(http_referer, "\"", ""), 'HOST', 'PATH','QUERY', 'QUERY:id') b asref_host, ref_path, ref_query, ref_query_id) c;show partitions ods_weblog_detail;

以下ETL需求小结

# 1.流量分析## 1.1多维度统计PV总量按时间维度：group by 年月日时，count统计按终端维度：like筛选终端即可## 1.2按refer维度，按照来访维度统计1.每小时各来访url的PV量：group by 月日时 url，count即可，一般host不为空2.每小时各来访host的PV量：group by 时间、host，也是host不为空## 1.3统计PV总量最大的来源topN开窗函数根据上述统计的各url来访PV量，partition by 时间，对PV量排序 取 topN即可## 1.4人均浏览页数人均浏览页数 = 总页面请求数 / 去重后的总人数# 2. 受访分析## 2.1 各页面访问统计（各页面的PV）group by url串 count统计 ## 2.2 热门页面统计统计某个分区里受访页面的top10： where 分区，group by url串，排序分页取10统计每日最热门页面的top10： group 到日、url串 对count() 排序分页取10 # 3.访客分析## 按时间维度统计独立访客及其产生的PV量： group by 时间，ip## 统计每天新访客： 新旧两表左连接，旧表ip是空的，即为新访客# 4.访客visit分析（点击流模型）## 回头/单次访客统计： group ip 次数大于1## 查询今日所有回头访客及其访问次数：# 5.关键路径转化率分析（漏斗模型）## 查询每一个步骤的总访问人数union all 联合查询 分别查每一步的总人数，group步骤## 查询每一步骤相对于路径起点人数的比例表自连接，条件第一步每一步相对起点人数比例 = 每一步的人数/第一步的人数## 查询每一步骤相对于上一步骤的漏出率

1.流量分析

1.1多维度统计PV总量

按时间维度

 -- --计算每小时pvs（页面被加载的总次数），注意gruop by语法
select count(*) as pvs, month, day, hour from ods_weblog_detail group by month, day, hour;-- 每小时的pvscreate table dw_pvs_everyhour_oneday(month string, day string, hour string, pvs bigint) partitioned by(datestr string);insert into table dw_pvs_everyhour_oneday partition(datestr='20201015')
select a.month as month, a.day as day, a.hour as hour, count(*) as pvs from ods_weblog_detail a
where a.datestr='20201015' group by a.month,a.day,a.hour;-- 每天的pvscreate table if not exists dw_pvs_everyday(pvs bigint,month string,day string);insert into table dw_pvs_everyday
select count(*) as pvs, a.month as month, a.day as day from ods_weblog_detail a
group by a.month, a.day;

按终端维度

 select distinct(http_user_agent) from ods_weblog_detail where http_user_agent like '%Chrome%' limit 200;

1.2按referer维度（按照来访维度统计pv）

 -- 统计每小时各来访url产生的pv量create table dw_pvs_referer_everyhour(referer_url string, referer_host string, month string, day string, hour string, pv_referer_cnt bigint) partitioned by(datestr string);insert into table dw_pvs_referer_everyhour partition(datestr='20201015') select http_referer, ref_host, month, day, hour, count(1) as pv_referer_cntfrom ods_weblog_detail group by http_referer,ref_host,month,day,hour having ref_host is not nullorder by hour asc,day asc,month asc,pv_referer_cnt desc;drop table dw_pvs_referer_everyhour;-- --统计每小时各来访host的产生的pv数并排序
drop table dw_pvs_refererhost_everyhour;create table dw_pvs_refererhost_everyhour(ref_host string,month string,day string,hour string,ref_host_cnts bigint) partitioned by(datestr string);insert into table dw_pvs_refererhost_everyhour partition(datestr='20201015')
select ref_host,month,day,hour,count(1) as ref_host_cnts
from ods_weblog_detail
group by ref_host,month,day,hour
having ref_host is not null
order by hour asc,day asc,month asc,ref_host_cnts desc;

1.3统计pv总量最大的来源TOPN (分组TOP)

--需求：按照时间维度，统计一天内各小时产生最多pvs的来源topN
分组求topN，先分组，再求每组内的topN--row_number函数
select ref_host,ref_host_cnts,concat(month,day,hour),
row_number() over (partition by concat(month,day,hour) order by ref_host_cnts desc) as od
from dw_pvs_refererhost_everyhour;--综上可以得出
drop table dw_pvs_refhost_topn_everyhour;create table dw_pvs_refhost_topn_everyhour(
hour string,
toporder string,
ref_host string,
ref_host_cnts string
)partitioned by(datestr string);insert into table dw_pvs_refhost_topn_everyhour partition(datestr='20201015')
select t.hour,t.od,t.ref_host,t.ref_host_cnts from(select ref_host,ref_host_cnts,concat(month,day,hour) as hour,
row_number() over (partition by concat(month,day,hour) order by ref_host_cnts desc) as od
from dw_pvs_refererhost_everyhour) t where od<=3;

1.4人均浏览页数

--需求描述：统计今日所有来访者平均请求的页面数。
--总页面请求数/去重总人数
drop table dw_avgpv_user_everyday;create table dw_avgpv_user_everyday(day string, avgpv string);insert into table dw_avgpv_user_everyday
select '20201015',sum(b.pvs)/count(b.remote_addr)
from
(
select remote_addr, count(1) as pvs from ods_weblog_detail where datestr='20201015'
group by remote_addr
) b;

2.受访分析

2.1 各页面访问统计（各页面PV）

select request as request,count(1) as request_counts from
ods_weblog_detail group by request having request is not null order by request_counts desc limit 20;

2.2热门页面统计

-- 统计20201015这个分区里面的受访页面的top10
drop table dw_hotpages_everyday;create table dw_hotpages_everyday(day string,url string,pvs string);insert into table dw_hotpages_everyday
select '20201015',a.request,a.request_counts from
(
select request as request,count(request) as request_counts
from ods_weblog_detail
where datestr='20201015'
group by request
having request is not null
) a
order by a.request_counts desc limit 10;-- 统计每日最热门页面的top10
select a.month,a.day,a.request ,concat(a.month,a.day),a.total_request
from (
select month,day, request,count(1) as total_request
from ods_weblog_detail
where datestr = '20130918'
group by  request ,month ,day
having request is not null
order by total_request desc limit 10
) a；

3.访客分析

--需求：按照时间维度来统计独立访客及其产生的pv量按照时间维度比如小时来统计独立访客及其产生的 pv 。时间维度：时
drop table dw_user_dstc_ip_h;create table dw_user_dstc_ip_h(
remote_addr string,
pvs      bigint,
hour     string);insert into table dw_user_dstc_ip_h
select remote_addr,count(1) as pvs,concat(month,day,hour) as hour
from ods_weblog_detail
Where datestr='20201015'
group by concat(month,day,hour),remote_addr;--在上述基础之上，可以继续分析，比如每小时独立访客总数
select count(1) as dstc_ip_cnts,hour from dw_user_dstc_ip_h group by hour;时间维度：日
select remote_addr,count(1) as counts,concat(month,day) as day
from ods_weblog_detail
Where datestr='20201015'
group by concat(month,day),remote_addr;时间维度： 月
select remote_addr,count(1) as counts,month
from ods_weblog_detail
group by month,remote_addr;-- 每日新访客
-- 需求：将每天的新访客统计出来--历日去重访客累积表
drop table dw_user_dsct_history;create table dw_user_dsct_history(
day string, ip string) partitioned by(datestr string);select a.remote_addr ,a.day
from (
select  remote_addr,'20201015' as day
from ods_weblog_detail newIp
where datestr ='20201015'
group by remote_addr
) a
left join dw_user_dsct_history hist
on a.remote_addr = hist.ip
where hist.ip is null;--每日新访客表
drop table dw_user_new_d;create table dw_user_new_d (
day string, ip string)
partitioned by(datestr string);-- 每日新用户插入新访客表
insert into table dw_user_new_d partition(datestr='20201015')
select tmp.day as day,tmp.today_addr as new_ip
from
(
select today.day as day,today.remote_addr as today_addr,old.ip as old_addr
from
(
select distinct remote_addr as remote_addr,"20201015" as day
from ods_weblog_detail where datestr="20201015"
) today
left join
dw_user_dsct_history old
on today.remote_addr=old.ip
) tmp
where tmp.old_addr is null;--每日新用户追加到历史累计表
insert into table dw_user_dsct_history partition(datestr='20201015')
select day,ip from dw_user_new_d where datestr='20201015';验证：
select count(distinct remote_addr) from ods_weblog_detail;select count(1) from dw_user_dsct_history where datestr='20201015';select count(1) from dw_user_new_d where datestr='20201015';

4.访客Visit分析（点击流模型）

--  回头/单次访客统计
select remote_addr ,count(remote_addr) ipcount
from  ods_click_stream_visit
group by remote_addr
having ipcount > 1 -- 查询今日所有回头访客及其访问次数。
drop table dw_user_returning;create table dw_user_returning(
day string,
remote_addr string,
acc_cnt string)
partitioned by (datestr string);insert overwrite table dw_user_returning partition(datestr='20201015')
select tmp.day,tmp.remote_addr,tmp.acc_cnt
from
(select '20201015' as day,remote_addr,count(session) as acc_cnt from ods_click_stream_visit
group by remote_addr) tmp
where tmp.acc_cnt>1;-- 人均访问频次,使用所有的独立访问的人，即独立的session个数除以所有的去重IP即可
-- 人均访问的频次，频次表示我们来了多少个session
--  次数都是使用session来进行区分，一个session就是表示一次
select count(session)/count(distinct remote_addr) from ods_click_stream_visit where datestr='20201015';select count(1)
from ods_click_stream_visit
where datestr ='20201015'-- 人均页面浏览量，所有的页面点击次数累加除以所有的独立去重IP总和即可
select sum(pagevisits)/count(distinct remote_addr) from ods_click_stream_visit where datestr='20201015';

5.关键路径转化率分析（漏斗模型）

求两个指标：第一个指标：每一步相对于第一步的转化率第二个指标：每一步相对于上一步的转化率# 使用模型生成的数据，可以满足我们的转化率的求取
load data inpath '/weblog/clickstream/pageviews/click-part-r-00000' overwrite into table ods_click_pageviews partition(datestr='20130920');----------------------------------------------------------
---1、查询每一个步骤的总访问人数Step1、  /item                     1000          相对上一步        相对第一步   1000
Step2、  /category                 800             0.8              0.8         1800
Step3、  /index                    500             0.625            0.5          2300
Step4、  /order                    100             0.2              0.1          2400create table dw_oute_numbs as
select 'step1' as step,count(distinct remote_addr)  as numbs from ods_click_pageviews
where datestr='20130920'
and request like '/item%'
union all
select 'step2' as step,count(distinct remote_addr)  as numbs from ods_click_pageviews
where datestr='20130920'
and request like '/category%'
union all
select 'step3' as step,count(distinct remote_addr)  as numbs from ods_click_pageviews where datestr='20130920'
and request like '/order%'
union all
select 'step4' as step,count(distinct remote_addr)  as numbs from ods_click_pageviews where datestr='20130920'
and request like '/index%';+---------------------+----------------------+--+
| dw_oute_numbs.step  | dw_oute_numbs.numbs  |
+---------------------+----------------------+--+
| step1               | 1029                 |
| step2               | 1029                 |
| step3               | 1028                 |
| step4               | 1018                 |
+---------------------+----------------------+--+----------------------------------------------------------------------------
--2、查询每一步骤相对于路径起点人数的比例
--级联查询，自己跟自己joinselect rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs
from dw_oute_numbs rn
inner join
dw_oute_numbs rr;自join后结果如下图所示：
每一步相对于第一步的转化率+---------+----------+---------+----------+--+
| rnstep  | rnnumbs  | rrstep  | rrnumbs  |
+---------+----------+---------+----------+--+
| step1   | 1029     | step1   | 1029     |
| step2   | 1029     | step1   | 1029     |
| step3   | 1028     | step1   | 1029     |
| step4   | 1018     | step1   | 1029     |
| step1   | 1029     | step2   | 1029     |
| step2   | 1029     | step2   | 1029     |
| step3   | 1028     | step2   | 1029     |
| step4   | 1018     | step2   | 1029     |
| step1   | 1029     | step3   | 1028     |
| step2   | 1029     | step3   | 1028     |
| step3   | 1028     | step3   | 1028     |
| step4   | 1018     | step3   | 1028     |
| step1   | 1029     | step4   | 1018     |
| step2   | 1029     | step4   | 1018     |
| step3   | 1028     | step4   | 1018     |
| step4   | 1018     | step4   | 1018     |
+---------+----------+---------+----------+--+过滤只取step1的所有的数据
select tempTab.rnnumbs/tempTab.rrnumbs from (
select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs
from dw_oute_numbs rn
inner join
dw_oute_numbs rr where rr.step = 'step1'
) tempTab;--每一步的人数/第一步的人数==每一步相对起点人数比例
select tmp.rnstep,tmp.rnnumbs/tmp.rrnumbs as ratio
from(
select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs  from dw_oute_numbs rn
inner join
dw_oute_numbs rr) tmp
where tmp.rrstep='step1';--------------------------------------------------------------------------------
--3、查询每一步骤相对于上一步骤的漏出率
--首先通过自join表过滤出每一步跟上一步的记录select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs
from dw_oute_numbs rn
inner join
dw_oute_numbs rr
where cast(substr(rn.step,5,1) as int)=cast(substr(rr.step,5,1) as int)-1;select newTable.rnnumbs/newTable.rrnumbs from (
select * from (
select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs
from dw_oute_numbs rn
inner join
dw_oute_numbs rr
) tmpTable
where  cast(substr(tmpTable.rrStep,5,1) as int ) =  cast(substr(tmpTable.rnstep,5,1) as int )-1
) newTable where temTable.rrstep.截串  >= temTable.rnstep.截串注意：cast为hive的内置函数，主要用于类型的转换
用例：
select  cast(1 as  float);
select  cast('2018-06-22' as date);+---------+----------+---------+----------+--+
| rnstep  | rnnumbs  | rrstep  | rrnumbs  |
+---------+----------+---------+----------+--+
| step1   | 1029     | step2   | 1029     |
| step2   | 1029     | step3   | 1028     |
| step3   | 1028     | step4   | 1018     |
+---------+----------+---------+----------+--+--然后就可以非常简单的计算出每一步相对上一步的漏出率
select tmp.rrstep as step,tmp.rrnumbs/tmp.rnnumbs as leakage_rate
from
(
select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs
from dw_oute_numbs rn
inner join
dw_oute_numbs rr
) tmp
where cast(substr(tmp.rnstep,5,1) as int)=cast(substr(tmp.rrstep,5,1) as int)-1;-----------------------------------------------------------------------------------
--4、汇总以上两种指标
select abs.step,abs.numbs,abs.rate as abs_ratio,rel.rate as leakage_rate
from
(
select tmp.rnstep as step,tmp.rnnumbs as numbs,tmp.rnnumbs/tmp.rrnumbs as rate
from
(
select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs  from dw_oute_numbs rn
inner join
dw_oute_numbs rr) tmp
where tmp.rrstep='step1'
) abs
left outer join
(
select tmp.rrstep as step,tmp.rrnumbs/tmp.rnnumbs as rate
from
(
select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs  from dw_oute_numbs rn
inner join
dw_oute_numbs rr) tmp
where cast(substr(tmp.rnstep,5,1) as int)=cast(substr(tmp.rrstep,5,1) as int)-1
) rel
on abs.step=rel.step;

数据分析---网站日志流分析相关推荐

大数据离线---网站日志流量分析系统（1）---简介及框架
本次介绍网站日志流量分析系统,首先是简介和架构.后面会对架构中需要的每个模块的进行逐个介绍.本篇主要分为两个部分网站日志流量分析系统简介整体技术流程和架构 1. 网站日志流量分析系统简介 1.1点 ...
宝塔面板下载网站日志，分析百度蜘蛛和谷歌蜘蛛的爬取记录
准备工作:安装好宝塔面板.一个能正常访问的网站.一个在线网站分析工具 1.打开宝塔面板--文件--根目录--www--wwwlogs--域名.log(这个就是网站的日志文件)--选中需要下载的网站日志 ...
网站日志流量分析系统之（日志收集）
一.概述网站日志流量分析系统之(日志埋点)这里我们已经将相关数据通过ajax发送至日志服务器,这里我只用了一台日志服务器(本机Windows环境),日志收集主要分为以下几个步骤: ①日志服务器集结合 ...
Day25_20180602_网站日志流量分析项目06：数据分析的实现及任务调度框架
一.数据仓库模型 | valid | string | | | ip | string | | | user_id | string | | | time | string | | | daystr ...
数据分析网站-竞品分析数据来源
产品汪经常需要做竞品分析,要用到许多专业的图表和数据.数据从何而来?需要我们利用专业的数据研究网站来搜集. 常用的数据搜集网站: 1.百度指数网址:http://index.baidu.com/ 主 ...
大数据离线---网站日志流量分析系统---日志数据原始信息
这里式日志数据的原始信息 194.237.142.21 - - [18/Sep/2013:06:49:18 +0000] "GET /wp-content/uploads/2013/07/r ...
MOOC网站日志分析
网站日志分析是针对万网云虚拟主机存储的日志数据提供的一站式多维度.可视化的一系列数据分析报表,虚机客户可以通过这些分析报表来查看近期站点的流量趋势.洞察站点用户行为.用户地域属性等. MOOC网站日志 ...
阿里云大数据认证——MOOC网站日志分析-课堂笔记
阿里云Clouder认证四.MOOC网站日志分析 1. 课程目标 (1) 了解网站日志分析的意义 (2) 了解网站日志如何搜集 (3) 使用编程语言对日志数据进行处理 (4) 使用编MaxCompu ...
日志分析系统分类有哪些_SEO优化中网站日志起到的重要性作用
在网站优化中,有一点是优化者避免不了的--网站日志的分析.网站日志分析是一个成功的SEO大师的日常优化工作的一部分,通过网站数据的分析,可以更清晰地知道网站的现状,可以挖掘网站潜在的问题,做好优化改善 ...

数据分析---网站日志流分析