c++ 操作hadoop

hadoop框架原理：

流程是，将input转换成mapper使用的context格式，然后经过mapper处理后，转换成reducer使用的context格式，经过reducer处理之后，产生output。

c++类库和头文件：

hadoop提供的c++api类库和头文件，安装hadoop之后，类库在hadoop/hadoop-2.8.0/lib/native下，头文件在hadoop/hadoop-2.8.0/include中，复制到系统/usr/lib64和/usr/include下，或者直接复制到自己的项目工程中，或者使用Makefile

c++编程模式：

一个公有继承Mapper的map器，一个公有继承reducer的reduce器，模板如下：

class localmapper : public HadoopPipes::Mapper
{
public:localmapper(HadoopPipes::Taskcontext & context){}void map(HadoopPipes::MapContext & context){}
};class localreducer : public HadoopPipes::Reducer
{
public:localreducer(HadoopPipes::TaskContext & context){}void reduce(HadoopPipes::ReducerContext & context){}
};

其中localmapper类中的map方法用来自定义洗牌规则，localreducer类中的reduce方法用来自定义展示规则，一个示例：

1.写input，这里是tmp.txt，如下：

[root@master helloworld]# cat tmp.txt
a:067
b:066
a:100
b:089
b:099

2.写map和reduce方法，如下：

#include <limits.h>
#include <stdint.h>
#include <string.h>/* hadoop头文件 */
#include <Pipes.hh>
#include <TemplateFactory.hh>
#include <StringUtils.hh>using namespace std;/* hadoop的mapper，reducer，和各自使用的context */
using HadoopPipes::TaskContext;
using HadoopPipes::Mapper;
using HadoopPipes::MapContext;
using HadoopPipes::Reducer;
using HadoopPipes::ReduceContext;/* hadoop方法集中的两种方法 */
using HadoopUtils::toInt;
using HadoopUtils::toString;/* hadoop运行入口 */
using HadoopPipes::TemplateFactory;
using HadoopPipes::runTask;/* 公有继承hadoop的Mapper */
class LocalMapper : public Mapper
{
public:LocalMapper(TaskContext & context){}/* map函数，使用MapContext */void map(MapContext & context){/* 从文本中获取输入 */string line  = context.getInputValue();string key    = line.substr(0, 1);string value = line.substr(2, 3);/* 根据筛选条件洗牌，这里要求value不是100 */if (value != "100"){context.emit(key, value);}}
};/* 公有继承Reducer */
class LocalReducer : public Reducer
{
public:LocalReducer(TaskContext & context){}/* reduce函数，使用ReduceContext */void reduce(ReduceContext & context){int max_value = 0;/* 遍历一个key的所有value，根据筛选条件展示输出，这里选择最大值 */while (context.nextValue()){max_value = max(max_value, toInt(context.getInputValue()));}context.emit(context.getInputKey(), toString(max_value));}
};int main()
{return runTask(TemplateFactory<LocalMapper, LocalReducer>());
}

3.编译，命令如下：

g++ helloworld.cpp -lcrypto -lssl -L/root/hadoop/hadoop-2.8.0/lib/native -lhadooppipes -lhadooputils -lpthread

这里要带上-pthread，因为hadoop内部是并发算法，编译之后得到a.out

4.将a.out和tmp.txt上传到hdfs，命令如下：

hdfs dfs -mkdir /helloworld
hdfs dfs -put tmp.txt /helloworld
hdfd dfs -put a.out /helloworld

其中第一行创建一个隐藏文件夹helloworld，后面两行将可执行文件和input放入此文件下

5.启动任务，脚本如下：

[root@master helloworld]# cat start.sh
hadoop pipes -Dhadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriter=true -input /helloworld/tmp.txt -output output -program /helloworld/a.out

其中input参数指出输入文件的路径，output指出输出文件在hdfs中存放位置，会重新创建新的，要求这个文件之前不能存在，program指出可执行文件路径，执行后结果如下表示成功：

[root@master helloworld]# ./start.sh 18/09/25 15:54:21 INFO client.RMProxy: Connecting to ResourceManager at master/10.1.108.64:8032
18/09/25 15:54:21 INFO client.RMProxy: Connecting to ResourceManager at master/10.1.108.64:8032
18/09/25 15:54:22 INFO mapred.FileInputFormat: Total input files to process : 1
18/09/25 15:54:22 INFO mapreduce.JobSubmitter: number of splits:2
18/09/25 15:54:23 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1537857363076_0004
18/09/25 15:54:24 INFO impl.YarnClientImpl: Submitted application application_1537857363076_0004
18/09/25 15:54:24 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1537857363076_0004/
18/09/25 15:54:24 INFO mapreduce.Job: Running job: job_1537857363076_0004
18/09/25 15:54:38 INFO mapreduce.Job: Job job_1537857363076_0004 running in uber mode : false
18/09/25 15:54:38 INFO mapreduce.Job:  map 0% reduce 0%
18/09/25 15:54:59 INFO mapreduce.Job:  map 100% reduce 0%
18/09/25 15:55:14 INFO mapreduce.Job:  map 100% reduce 100%
18/09/25 15:55:14 INFO mapreduce.Job: Job job_1537857363076_0004 completed successfully
18/09/25 15:55:14 INFO mapreduce.Job: Counters: 49File System CountersFILE: Number of bytes read=38FILE: Number of bytes written=414120FILE: Number of read operations=0FILE: Number of large read operations=0FILE: Number of write operations=0HDFS: Number of bytes read=223HDFS: Number of bytes written=10HDFS: Number of read operations=9HDFS: Number of large read operations=0HDFS: Number of write operations=2Job Counters Launched map tasks=2Launched reduce tasks=1Data-local map tasks=2Total time spent by all maps in occupied slots (ms)=34605Total time spent by all reduces in occupied slots (ms)=11779Total time spent by all map tasks (ms)=34605Total time spent by all reduce tasks (ms)=11779Total vcore-milliseconds taken by all map tasks=34605Total vcore-milliseconds taken by all reduce tasks=11779Total megabyte-milliseconds taken by all map tasks=35435520Total megabyte-milliseconds taken by all reduce tasks=12061696Map-Reduce FrameworkMap input records=5Map output records=4Map output bytes=24Map output materialized bytes=44Input split bytes=178Combine input records=0Combine output records=0Reduce input groups=2Reduce shuffle bytes=44Reduce input records=4Reduce output records=2Spilled Records=8Shuffled Maps =2Failed Shuffles=0Merged Map outputs=2GC time elapsed (ms)=453CPU time spent (ms)=3630Physical memory (bytes) snapshot=548720640Virtual memory (bytes) snapshot=6198038528Total committed heap usage (bytes)=378470400Shuffle ErrorsBAD_ID=0CONNECTION=0IO_ERROR=0WRONG_LENGTH=0WRONG_MAP=0WRONG_REDUCE=0File Input Format Counters Bytes Read=45File Output Format Counters Bytes Written=10
18/09/25 15:55:14 INFO util.ExitUtil: Exiting with status 0

这一行指出作业有一个输入：

18/09/25 15:54:22 INFO mapred.FileInputFormat: Total input files to process : 1

这一行指出hadoop有两个datanode：

18/09/25 15:54:22 INFO mapreduce.JobSubmitter: number of splits:2

通过如下3行可以看到hadoop框架的处理流程，先执行map，map全部执行完毕之后，执行reduce：

18/09/25 15:54:38 INFO mapreduce.Job:  map 0% reduce 0%
18/09/25 15:54:59 INFO mapreduce.Job:  map 100% reduce 0%
18/09/25 15:55:14 INFO mapreduce.Job:  map 100% reduce 100%

6.查看结果如下：

[root@master helloworld]# hdfs dfs -cat output/*
a   67
b   99

这里a中值为100的被排除了，剩下的最大的是67，b中最大的是99

7.作业日志存放位置

hadoop-2.8.0/logs/userlogs

这里存储每个作业的日志，如下：

[root@master userlogs]# ls
application_1537857363076_0002  application_1537857363076_0003  application_1537857363076_0004

每个作业内部有stderr，stdout，syslog，一般内容在syslog中

8.查看每个datanode处理了多少作业：

这里将a.out执行两次，看效果，这里有两个datanode，在第一个datanode下查看日志如下：

[root@master application_1537857363076_0003]# ls
container_1537857363076_0003_01_000001  container_1537857363076_0003_01_000004

第二个datanode下查看日志如下：

[root@slave1 application_1537857363076_0003]# ls
container_1537857363076_0003_01_000002  container_1537857363076_0003_01_000003

但是注意作业不是均分到两个datanode上的，再次执行查看日志如下：

第一个datanode：

[root@master application_1537857363076_0004]# ls
container_1537857363076_0004_01_000004

第二个datanode：

[root@slave1 application_1537857363076_0004]# ls
container_1537857363076_0004_01_000001  container_1537857363076_0004_01_000002  container_1537857363076_0004_01_000003

这里可以看到，master只处理了1个，slave1处理了3个。

c++ 操作hadoop相关推荐

用java程序操作hadoop，intellij IDEA和maven的使用
如果用hadoop直接操作,还要学一些专门的hadoop指令,其实也可以用java代码来操作hadoop 首先电脑上安装intellig IDEA,notepad++,之前开启的hadoop集群(三台 ...
大数据学习之javaAPI远程操作hadoop
前言: 本篇文章针对于2020秋季学期的复习操作,一是对该学期的巩固,二是让老师知道他的努力没有白费,同时,在此感谢徐老师对我们的精心教导- 本文所需材料 IntelliJ IDEA 官网→https ...
Hadoop学习笔记（三）：java操作Hadoop
1.启动hadoop服务. 2.hadoop默认将数据存储带/tmp目录下,如下图: 由于/tmp是linux的临时目录,linux会不定时的对该目录进行清除,因此hadoop可能就会出现意外情况.下 ...
JAVA操作Hadoop
写在前面: 我的博客已迁移至自建服务器:博客传送门,CSDN博客暂时停止,如有机器学习方面的兴趣,欢迎来看一看. 此外目前我在gitHub上准备一些李航的<统计学习方法>的实现算法,目标将 ...
如何python安装hadoop_使用Python操作Hadoop，Python-MapReduce
环境环境使用:hadoop3.1,Python3.6,ubuntu18.04 Hadoop是使用Java开发的,推荐使用Java操作HDFS. 有时候也需要我们使用Python操作HDFS. 本次我 ...
python与hadoop的结合_跟着小编一起学习使用Python操作Hadoop，Python-MapReduce
环境环境使用:hadoop3.1,Python3.6,ubuntu18.04 Hadoop是使用Java开发的,推荐使用Java操作HDFS. 有时候也需要我们使用Python操作HDFS. 本次我 ...
java操作hadoop hdfs，实现文件上传下载demo
本文主要参考了Hadoop HDFS文件系统通过java FileSystem 实现上传下载等,并实际的做了一下验证.代码与引用的文章差别不大,现列出来作为备忘. import java.io.*; ...
python使用hdfs库操作Hadoop的HDFS
此次使用python的hdfs库操作HDFS,首相安装该库:pip install hdfs 其次,要保证HDFS可用,如下图就代表可用,当然你列出的文件和我的不同老规矩,先来看看它这个库的大概结构 ...
关于Hadoop多用户管理支持客户端远程操作的理论总结
1.问题 Hadoop客户端如何配置可远程操作Hadoop:Hadoop多用户情况下,是如何管理权限并分配存储空间和计算能力,保证集群的稳定. 2.Hadoop平台要理解客户端如何通过指定用户远程操 ...

c++ 操作hadoop

c++ 操作hadoop相关推荐

最新文章

热门文章