hive中的TextFile转为SequenceFile

由于业务需要，把SequenceFile文件导入hive，但是之前的SequenceFile文件是flume传来的。

所以要hadoop的mr任务把TextFile类型转SequenceFile，再导入hive。

代码如下：

import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.SequenceFile.CompressionType;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;/*** 将文本文件转化为序列化文件* @author gongmf* email: 1376818286@qq.com**/
public class TextToSequencefile {public static class ReaderMapper extends Mapper<Writable, Text , Writable, Text> {
//        private final static IntWritable one = new IntWritable(1);
//        private Text word = new Text();protected void map(Writable key, Text value, Context context) throws IOException, InterruptedException {
//            StringTokenizer tokenizer = new StringTokenizer(value.toString());
//            while (tokenizer.hasMoreTokens()) {
//                word.set(tokenizer.nextToken());
//                context.write(word, one);
//            }
//          if(value == null){return;}String str=value.toString() ; // 此处是我的业务需要截取，可注释if(str == null  ||  str.length() < 14){return;}str = str.substring(  14  , str.length() ) ;context.write(key, new Text(str)) ;}}public static class WriterReducer extends Reducer<Text, IntWritable, Text, IntWritable> {private IntWritable result = new IntWritable();protected void reduce(Text key, Iterator<IntWritable> values, Context context) throws IOException, InterruptedException {int sum = 0;while (values.hasNext()) {sum += ((IntWritable) values.next()).get();}result.set(sum);context.write(key, result);}}public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {// section 1Configuration conf = new Configuration();String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();if (otherArgs.length != 2) {System.err.println("Usage : TextToSequencefile ");System.exit(2);}@SuppressWarnings("deprecation")Job job = new Job(conf, "TextToSequencefile");job.setJarByClass(TextToSequencefile.class);job.setInputFormatClass(TextInputFormat.class);job.setOutputFormatClass(SequenceFileOutputFormat.class);SequenceFileOutputFormat.setOutputCompressionType(job,CompressionType.NONE);  //是否压缩 // section2job.setMapOutputKeyClass(Writable.class);job.setMapOutputValueClass(Text.class);//        job.setOutputKeyClass(LongWritable.class);
//        job.setOutputValueClass(Text.class);// section3job.setMapperClass(ReaderMapper.class);
//        job.setCombinerClass(WriterReducer.class);
//        job.setReducerClass(WriterReducer.class);job.setNumReduceTasks(0);// section4FileInputFormat.addInputPath(job, new Path(otherArgs[0]));SequenceFileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));// section5System.exit(job.waitForCompletion(true) ? 0 : 1);}
}

hive中的TextFile转为SequenceFile相关推荐

关于Hive中的存储格式及压缩格式详解
最近面试,遇到了关于Hive的数据存储格式的问题,回答不尽人意,抽时间总结多看看关于Hive存储格式和压缩格式的内容. Hive底层数据是以HDFS文件的形式存储在Hadoop中的,选择一个合适的文件 ...
HIVE中的表以及语法
2019独角兽企业重金招聘Python工程师标准>>> HIVE中的表以及语法一.HIVE的表 HIVE使用的功能性的表格分为四种:内部表.外部表.分区表.分桶表. 1.内部表.外 ...
hive遍历_从Hive中的stored as file_foramt看hive调优
一.行式数据库和列式数据库的对比 1.存储比较行式数据库存储在hdfs上式按行进行存储的,一个block存储一或多行数据.而列式数据库在hdfs上则是按照列进行存储,一个block可能有一列或多列数 ...
hive中导入text文件遇到的坑
今天帮一同学导入一个excel数据,我把excel保存为txt格式,然后建表导入,失败!分隔符格式不匹配,无法导入!!!!怎么看两边都是\t,怎么不匹配呢? 做为程序员,最不怕的就是失败,因为我们有一 ...
Hive中文件存储格式及大小比较测试
在hive中创建表是有如下一个语句 [ROW FORMAT row_format] row_format 的类型有如下: file_format: : SEQUENCEFILE | TEXTFILE ...
6、hive中的file_format
行存储和列存储在hdfs中的区别 hive中的file_format 可以使用set hive.default.fileformat来查看和设置格式 set hive.default.fileform ...
SPARK-SQL 读取内存table 或 hive中的table
相关的资源文件地址链接:https://pan.baidu.com/s/1QGQIrVwg56g9eF16ERSLwQ 提取码:7v8n spark.read().table() 可以操作内存中的某 ...
hive（四）Hive中的窗口函数
目录一.后台启动Hive的JDBC连接 1.关闭后台启动的jdbc 2.Hive中的wordcount实例二.Hive窗口函数 1.with as 用法 2.集合函数 3.行列互换 4.LATER ...
Day58 Hive中的窗口函数
后台启动Hive的JDBC连接 0表示标准输入,1表示标准输出,2表示标准错误输出,nohup表示挂起,&表示后台启动 nohup hive --service hiveserver2 > ...
关于hive中Map join 时大表left join小表的问题
在hive中,(启用Map join时) 大表left join小表,加载从右向左,所以小表会加载进内存,存储成map键值对,通过大表驱动小表,来进行join,即大表中的join字段作为key 来获取 ...

hive中的TextFile转为SequenceFile

hive中的TextFile转为SequenceFile相关推荐

最新文章

热门文章