hadoop程序MapReduce之SingletonTableJoin

需求：单表关联问题。从文件中孩子和父母的关系挖掘出孙子和爷奶关系

样板：child-parent.txt

xiaoming daxiong

daxiong alice

daxiong jack

输出：xiaoming alice

xiaoming jack

分析设计：

mapper部分设计：

1、<k1,k1>k1代表：一行数据的编号位置，v1代表：一行数据。

2、左表：<k2,v2>k2代表：parent名字，v2代表：(1,child名字)，此处1：代表左表标志。

3、右表：<k3,v3>k3代表：child名字，v3代表：(2，parent名字)，此处2：代表右表标志。

reduce部分设计：

4、<k4,v4>k4代表：相同的key,v4代表：list<String>

5、求笛卡尔积<k5,v5>:k5代表：grandChild名字，v5代表：grandParent名字。

程序部分：

SingletonTableJoinMapper类

package com.cn.singletonTableJoin;import java.io.IOException;
import java.util.StringTokenizer;import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;public class SingletonTableJoinMapper extends Mapper<Object, Text, Text, Text> {@Overrideprotected void map(Object key, Text value, Mapper<Object, Text, Text, Text>.Context context)throws IOException, InterruptedException {String childName = new String();String parentName = new String();String relationType = new String();String[] values=new String[2]; int i = 0;StringTokenizer itr = new StringTokenizer(value.toString());while(itr.hasMoreElements()){values[i] = itr.nextToken();i++;}if(values[0].compareTo("child") != 0){childName  = values[0];parentName = values[1];relationType = "1";context.write(new Text(parentName), new Text(relationType+" "+childName));relationType = "2";context.write(new Text(childName), new Text(relationType+" "+parentName));}}
}

SingletonTableJoinReduce类：

package com.cn.singletonTableJoin;import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;public class SingletonTableJoinReduce extends Reducer<Text, Text, Text, Text> {@Overrideprotected void reduce(Text key, Iterable<Text> values, Reducer<Text, Text, Text, Text>.Context context)throws IOException, InterruptedException {List<String> grandChild = new ArrayList<String>();List<String> grandParent = new ArrayList<String>();Iterator<Text> itr = values.iterator();while(itr.hasNext()){String[] record = itr.next().toString().split(" ");if(0 == record[0].length()){continue;}if("1".equals(record[0])){grandChild.add(record[1]);}else if("2".equals(record[0])){grandParent.add(record[1]);}}if(0 != grandChild.size() && 0 != grandParent.size()){for(String grandchild : grandChild){for(String grandparent : grandParent){context.write(new Text(grandchild), new Text(grandparent));}}}}
}

SingletonTableJoin类

package com.cn.singletonTableJoin;import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;/*** 单表关联* @author root**/
public class SingletonTableJoin {public static void main(String[] args) throws Exception {Configuration conf = new Configuration();String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();if (otherArgs.length != 2) {System.err.println("Usage: SingletonTableJoin  ");System.exit(2);}//创建一个jobJob job = new Job(conf, "SingletonTableJoin");job.setJarByClass(SingletonTableJoin.class);//设置文件的输入输出路径FileInputFormat.addInputPath(job, new Path(otherArgs[0]));FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));//设置mapper和reduce处理类job.setMapperClass(SingletonTableJoinMapper.class);job.setReducerClass(SingletonTableJoinReduce.class);//设置输出key-value数据类型job.setOutputKeyClass(Text.class);job.setOutputValueClass(Text.class);//提交作业并等待它完成System.exit(job.waitForCompletion(true) ? 0 : 1);}
}

把总结当成一种习惯。

转载于:https://www.cnblogs.com/xubiao/p/5759422.html

hadoop程序MapReduce之SingletonTableJoin相关推荐

hadoop程序MapReduce之DataSort
需求:对文件中的数据进行排序. 样本:sort.log 10 13 10 20 输出:1 10 2 10 3 13 4 20 分析部分: mapper分析: 1.<k1,v1>k1代表:行 ...
hadoop程序运行
hadoop命令的使用: Usage: hadoop [--config confdir] COMMAND 这里COMMAND为下列其中一个: 1 2 3 4 5 6 7 8 9 10 11 12 1 ...
Hadoop 新 MapReduce 框架 Yarn 详解
Hadoop MapReduceV2(Yarn) 框架简介原 Hadoop MapReduce 框架的问题对于业界的大数据存储及分布式处理系统来说,Hadoop 是耳熟能详的卓越开源分布式文件存储 ...
Hadoop：The Definitive Guid 总结 Chapter 1~2 初识Hadoop、MapReduce
1.数据存储与分析问题:当磁盘的存储量随着时间的推移越来越大的时候,对磁盘上的数据的读取速度却没有多大的增长从多个磁盘上进行并行读写操作是可行的,但是存在以下几个方面的问题: 1).第一个问题是硬 ...
Hadoop之MapReduce面试知识复习
Hadoop之MapReduce面试知识复习目录谈谈Hadoop序列化和反序列化及自定义bean对象实现序列化? FileInputFormat切片机制在一个运行的Hadoop 任务中,什么是I ...
Hadoop之MapReduce工作流程
Hadoop之MapReduce工作流程目录流程示意图流程详解注意 1. 流程示意图 MapReduce工作流程流程示意图,如下图 2. 流程详解上面的流程是整个mapreduce最全工作 ...
Hadoop之MapReduce入门
Hadoop之MapReduce概述目录 MapReduce定义 MapReduce优缺点 MapReduce核心思想 MapReduce进程 MapReduce编程规范 MapReduce案例实操 ...
使用ToolRunner运行Hadoop程序基本原理分析
为了简化命令行方式运行作业,Hadoop自带了一些辅助类.GenericOptionsParser是一个类,用来解释常用的Hadoop命令行选项,并根据需要,为Configuration对象设置相应的 ...
mrunit_使用MRUnit测试Hadoop程序
mrunit 这篇文章将略微绕开使用MapReduce实现数据密集型处理中发现的模式,以讨论同样重要的测试. 汤姆•惠勒 ( Tom Wheeler)在纽约2012年Strata / Hadoop W ...

hadoop程序MapReduce之SingletonTableJoin

hadoop程序MapReduce之SingletonTableJoin相关推荐

最新文章

热门文章