注意:pig中用run或者exec 运行脚本。除了cd和ls,其他命令不用。在本代码中用rm和mv命令做例子,容易出错。

另外,pig只有在store或dump时候才会真正加载数据,否则,只是加载代码,不具体操作数据。所以在rm操作时必须注意该文件是否已经生成。如果rm的文件为生成,可以第三文件,进行mv改名操作

SET job.name 'test_age_reporth_istorical';-- 定义任务名字,在http://172.XX.XX.XX:50030/jobtracker.jsp中查看任务状态,失败成功。

SET job.priority HIGH;--优先级

--注册jar包,用于读取sequence file和输出分析结果文件
REGISTER piggybank.jar;
DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader(); --读取二进制文件,函数名定义

%default Cleaned_Log /user/C/data/XXX/cleaned/$date/*/part* --$date是外部传入参数

%default AD_Data /user/XXX/data/xxx/metadata/ad/part*
%default Campaign_Data /user/xxx/data/xxx/metadata/campaign/part*
%default Social_Data /user/xxx/data/report/socialdata/part*

--所有的输出文件路径:
%default Industry_Path $file_path/report/historical/age/$year/industry
%default Industry_SUM $file_path/report/historical/age/$year/industry_sum
%default Industry_TMP $file_path/report/historical/age/$year/industry_tmp

%default Industry_Brand_Path $file_path/report/historical/age/$year/industry_brand
%default Industry_Brand_SUM $file_path/report/historical/age/$year/industry_brand_sum
%default Industry_Brand_TMP $file_path/report/historical/age/$year/industry_brand_tmp

%default ALL_Path $file_path/report/historical/age/$year/all
%default ALL_SUM $file_path/report/historical/age/$year/all_sum
%default ALL_TMP $file_path/report/historical/age/$year/all_tmp

%default output_path /user/xxx/tmp/result

origin_cleaned_data = LOAD '$Cleaned_Log' USING PigStorage(',') --读取日志文件
AS (ad_network_id:chararray,
    xxx_ad_id:chararray,
    guid:chararray,
    id:chararray,
    create_time:chararray,
    action_time:chararray,
    log_type:chararray, 
    ad_id:chararray,
    positioning_method:chararray,
    location_accuracy:chararray,
    lat:chararray, 
    lon:chararray,
    cell_id:chararray,
    lac:chararray,
    mcc:chararray,
    mnc:chararray,
    ip:chararray,
    connection_type:chararray,
    android_id:chararray,
    android_advertising_id:chararray,
    openudid:chararray,
    mac_address:chararray,
    uid:chararray,
    density:chararray,
    screen_height:chararray,
    screen_width:chararray,
    user_agent:chararray,
    app_id:chararray,
    app_category_id:chararray,
    device_model_id:chararray,
    carrier_id:chararray,
    os_id:chararray,
    device_type:chararray,
    os_version:chararray,
    country_region_id:chararray,
    province_region_id:chararray,
    city_region_id:chararray,
    ip_lat:chararray,
    ip_lon:chararray,
    quadkey:chararray);

--loading metadata/ad(adId,campaignId) 
metadata_ad = LOAD '$AD_Data' USING PigStorage(',') AS (adId:chararray, campaignId:chararray);

--loading metadata/campaign数据(campaignId, industryId, brandId)
metadata_campaign = LOAD '$Campaign_Data' USING PigStorage(',') AS (campaignId:chararray, industryId:chararray, brandId:chararray);

--ad and campaign for inner join
joinAdCampaignByCampaignId = JOIN metadata_ad BY campaignId,metadata_campaign BY campaignId;--(adId,campaignId,campaignId,industryId,brandId)
--filtering out redundant column of joinAdCampaignByCampaignId
joined_ad_campaign_data = FOREACH joinAdCampaignByCampaignId GENERATE $0 AS adId,$3 AS industryId,$4 AS brandId; --(adId,industryId,brandId)

--extract column for analyzing
origin_historical_age = FOREACH origin_cleaned_data GENERATE xxx_ad_id,guid,log_type;--(xxx_ad_id,guid,log_type)
--distinct
distinct_origin_historical_age = DISTINCT origin_historical_age;--(xxx_ad_id,guid,log_type)

--loading metadata_region(guid_social, sex, age, income, edu, hobby)
metadata_social = LOAD '$Social_Data' USING PigStorage(',') AS (guid_social:chararray, sex:chararray, age:chararray, income:chararray, edu:chararray, hobby:chararray);
--extract needed column in metadata_social
social_age = FOREACH metadata_social GENERATE guid_social,age;

--join socialData(metadata_social) and logData(distinct_origin_historical_age):
joinedByGUID = JOIN social_age BY guid_social, distinct_origin_historical_age BY guid;
--(guid_social, age; xxx_ad_id,guid,log_type)

--generating analyzing age data
joined_orgin_age_data = FOREACH joinedByGUID GENERATE xxx_ad_id,guid,log_type,age;
joinedByAdId = JOIN joined_ad_campaign_data BY adId, joined_orgin_age_data BY xxx_ad_id; --(adId,industryId,brandId,xxx_ad_id,guid,log_type,age)
--filtering
all_current_data = FOREACH joinedByAdId GENERATE guid,log_type,industryId,brandId,age; --(guid,log_type,industryId,brandId,age)

--for industry analyzing
industry_current_data = FOREACH all_current_data GENERATE industryId,guid,age,log_type;  --(industryId,guid,age,log_type)

--load all in the path "industry"
industry_existed_Data = LOAD '$Industry_Path' USING PigStorage(',') AS (industryId:chararray,guid:chararray,age:chararray,log_type:chararray);

--merge with history data 
union_Industry = UNION industry_existed_Data, industry_current_data;
distict_union_industry = DISTINCT union_Industry;
group_industry = GROUP distict_union_industry BY ($2,$0,$3);
count_guid_for_industry = FOREACH group_industry GENERATE FLATTEN(group),COUNT($1.$1);

rm $Industry_SUM;
STORE count_guid_for_industry INTO '$Industry_SUM' USING PigStorage(',');

--storing union industry data(current and history)
STORE distict_union_industry INTO '$Industry_TMP' USING PigStorage(',');
rm $Industry_Path
mv $Industry_TMP $Industry_Path

--counting guid for industry and brand 
industry_brand_current = FOREACH all_current_data GENERATE age,industryId,brandId,log_type,guid;
--(age,industryId,brandId,log_type,guid)

--load history data of industry_brand
industry_brand_history = LOAD '$Industry_Brand_Path' USING PigStorage(',') AS(age:chararray, industryId:chararray, brandId:chararray, log_type:chararray, guid:chararray);

--union all data of industry_brand
union_industry_brand = UNION industry_brand_current,industry_brand_history;
unique_industry_brand = DISTINCT union_industry_brand;
--(age,industryId,brandId,log_type,guid)

--counting users' number for industry and brand
group_industry_brand = GROUP unique_industry_brand BY ($0,$1,$2,$3);
count_guid_for_industry_brand = FOREACH group_industry_brand GENERATE FLATTEN(group),COUNT($1.$4);

rm $Industry_Brand_SUM;
STORE count_guid_for_industry_brand INTO '$Industry_Brand_SUM' USING PigStorage(',');

STORE unique_industry_brand INTO '$Industry_Brand_TMP' USING PigStorage(',');
rm $Industry_Brand_Path;
mv $Industry_Brand_TMP $Industry_Brand_Path

--counting user number for age and logtype
current_data = FOREACH all_current_data GENERATE age,log_type,guid;--(age,log_type,guid)

--load history data of age and logtype
history_data = LOAD '$ALL_Path' USING PigStorage(',') AS(age:chararray,log_type:chararray,guid:chararray);

--union current and history data
union_all_data = UNION history_data, current_data;
unique_all_data = DISTINCT union_all_data;

--count users' number
group_all_data = GROUP unique_all_data BY ($0,$1);
count_guid_for_age_logtype = FOREACH group_all_data GENERATE FLATTEN(group),COUNT($1.$2);

rm $ALL_SUM;
STORE count_guid_for_age_logtype INTO '$ALL_SUM' USING PigStorage(',');

STORE unique_all_data INTO '$ALL_TMP' USING PigStorage(',');
rm $ALL_Path
mv $ALL_TMP $ALL_Path

pig简单的代码实例:报表统计行业中的点击和曝光量相关推荐

  1. ul、li列表简单实用代码实例

    利用ul和li可以实现列表效果,下面就是一个简单的演示. 代码如下: 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 ...

  2. python封装sql脚本_pymysql的简单封装代码实例

    这篇文章主要介绍了pymysql的简单封装代码实例,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友可以参考下 #coding=utf-8 #!/usr/bin/ ...

  3. 简单python代码实例_求简洁优美的python代码例子、片段、参考资料

    展开全部 建2113议你去看一本书:<计算机程序5261的构造与解释>.里面4102用的语言是Scheme,一种Lisp的方言.通1653过这本书学习程序的抽象.封装,以及重要的函数式编程 ...

  4. 计算机C语言代码实例:统计0~9出现的次数

    统计0~9出现的次数 输出格式为 数字:次数 代码: #include<stdio.h> int main() {const int number=10;int i,x;int count ...

  5. 代码+实例:深度学习中的“轴”全解

    ©PaperWeekly 原创 · 作者|海晨威 学校|同济大学硕士生 研究方向|自然语言处理 在深度学习中,轴,指的就是张量的层级,一般通过参数 axis/dim 来设定.很多张量的运算.神经网络的 ...

  6. 将Doc或者Docx文档处理成html的代码逻辑;统计word中的字数,段数,句数,读取word中文档内容的代码逻辑...

    将Doc或者Docx文档处理成html的代码逻辑 下面是maven的配置代码: <!-- 文档处理所需的jar的依赖 --><dependency><groupId> ...

  7. 将Doc或者Docx文档处理成html的代码逻辑;统计word中的字数,段数,句数,读取word中文档内容的代码逻辑

    将Doc或者Docx文档处理成html的代码逻辑 下面是maven的配置代码: <!-- 文档处理所需的jar的依赖 --><dependency><groupId> ...

  8. UFT开发代码实例:将Excel中的数据保存为数组

    2019独角兽企业重金招聘Python工程师标准>>> 1.核心代码 '读取Excel表格到一个数组的函数 ' 输入: ' sFileName: Excel文件 ' sSheetNa ...

  9. Matplotlib实例教程 | 统计DataFrame中文本长度分布(条形统计图)

最新文章

  1. c mysql安装教程 pdf_MySQL下载安装、配置与使用教程详细版(win7x64)
  2. 各种编码范围总结以及linux下面的编码批量转化
  3. Google发布文档数据库Firestore
  4. mysql数据库设计原则_mysql数据库设计总结
  5. Qt4_实现Edit菜单
  6. 数据存储的问题(1)
  7. 基于canvas的前端图片压缩
  8. mysql中in和exists区别
  9. Java Lambda
  10. c语言电子时钟课程设计报告,电子时钟嵌入式课程设计报告
  11. 统计文章中的单词数量
  12. 用腾讯云COS制作个人图床
  13. 电脑加装内存条的教程
  14. Caused by: java.lang.NoSuchMethodError:No virtual method isSuccess()Z in class Lretrofit2/Response;
  15. PHP反射ReflectionClass、ReflectionMethod
  16. 容得下生命‬的不完美,也经得起世事的颠簸,将人生的一切都根植于生活
  17. 阿里云首席科学家章文嵩(正明)离职,大牛技术一览
  18. QQ Linux版体验
  19. 恶搞小程序--鼠标乱飞
  20. Android 根目录listFiles()文件列表返回值为null

热门文章

  1. 有哪些好用的互联网数据抓取,数据采集,页面解析工具?
  2. 树莓派装Ubuntu系统配置串口引脚与stm32通信
  3. 代理商管理系统/代理商信息管理系统
  4. 论文阅读 TEMPORAL GRAPH NETWORKS FOR DEEP LEARNING ON DYNAMIC GRAPHS
  5. ubuntu18.04 安装五笔拼音
  6. STL中list的remove和remove_if的用法
  7. 无线耳机哪个品牌好一点?综合体验,推荐几款高性价比的无线耳机
  8. 【论文阅读】【3d目标检测】Group-Free 3D Object Detection via Transformers
  9. 五一假期搭建个django后端项目
  10. C/C++代码格式规范(二)