2018-09-25 参考博客Hadoop

坑1

新版树莓派自带的java环境变量JAVA_HOME是arm32版本

~~export JAVA_HOME=/usr/lib/jvm/jdk-8-oracle-arm-vfp-hflt/jre/~~
export JAVA_HOME=/usr/lib/jvm/jdk-8-oracle-arm32-vfp-hflt/jre/

坑2

hadoop 2.x 版本指定 namenode 主机地址时用到的 core-site.xml 变量是 fs.defaultFS (之前 1.x 老版本用的是 fs.default.name):

<configuration><property><name>fs.defaultFS</name><value>hdfs://192.168.0.52:54310</value>
</property></configuration>

以下为正文

Building a Raspberry Pi Hadoop cluster (Part 1)
Posted on 28 September 2015

原文: https://web.archive.org/web/20170221231927/http://www.becausewecangeek.com/building-a-raspberry-pi-hadoop-cluster-part-1/

Since I love everything related to Raspberry Pi's and am active as a data scientist I've decided that I needed my own Pi cluster. Luckily for me version 2 was already available when I started this project, so this cluster will have a little bit more oomph compared to clusters using the original Pi. It takes quite a bit of setup to get a Hadoop cluster going, so in this first part we will limit ourselves to a single node "cluster".

The cluster

If you have ever visited my house you will have seen a lot of LEGO laying around. LEGO is by far my most expensive hobby, so when I needed to design a Raspberry Pi cluster case I didn't have to look far. My cluster case would be build in bricks.
I started by "designing" the case in OneNote to get the dimensions right. If you need any help converting millimeters to studs I suggest checking out this amazing page.

OneNote sketch for creating the LEGO Raspberry Pi cluster

The case has room for 6 Raspberry Pi's, my Lacie Cloudbox NAS, a TP-Link TL-SG1008D 8-port gigabit switch and the power supply. I took a bit of a gamble and choose the Orico 9-port USB 2.0 HUB, hoping that it would provide enough juice for the Pi's. I have successfully ran 5 Pi's using this single hub.

I designed the case in Lego Digital Designer and ordered the missing bricks via BrickLink. About an hour of building later this beauty was ready to be deployed in my storeroom.

In case you're wondering, that's a Cubesensor base station at the top. We have the hardware, let's do some software.

Getting started

I'm assuming you already have a working Raspberry Pi 2 running Raspbian and that you know how to login via SSH. We will install the latest version of Hadoop, that is 2.7.1. Great, let's get started and login via SSH.

Creating a new group and user

sudo addgroup hadoop
sudo adduser --ingroup hadoop hduser
sudo adduser hduser sudo

Everything Hadoop will be happening via the hduser. Let's change to this user.

su hduser

Generating SSH keys

Although we are using a single node setup in this part, I decided to already create SSH keys. These will be the keys that the nodes use to talk to each other.

cd ~
mkdir .ssh
ssh-keygen -t rsa -P ""
cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys

To verify that everyting is working, you can easily create a SSH tunnel to localhost.

ssh localhost

Installing the elephant in the room called Hadoop

wget ftp://apache.belnet.be/mirrors/ftp.apache.org/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz
sudo mkdir /opt
cd ~
sudo tar -xvzf hadoop-2.7.1.tar.gz -C /opt/
cd /opt
sudo chown -R hduser:hadoop hadoop-2.7.1/

Depending on what you already did with your Pi, the /opt directory may already exist.

Hadoop is now installed, but we still need quite a bit of tinkering to get it configured right.

Setting a few environment variables
First we need to set a few environment variables. There are a few ways to do this, but I always do it by editing the .bashrc file.

nano ~/.bashrc

Add the following lines at the end of the file:

export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:jre/bin/java::")
export HADOOP_HOME=/opt/hadoop-2.7.1
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

Changes in the .bashrc are not applied when you save this file. You can either logout and login again to use those new environment variables or you can:
source ~/.bashrc
If everything is configured right, you should be able to print the installed version of Hadoop.

$hadoop version
Hadoop 2.7.1
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 15ecc87ccf4a0228f35af08fc56de536e6ce657a
Compiled by jenkins on 2015-06-29T06:04Z
Compiled with protoc 2.5.0

From source with checksum fc0a1a23fc1868e4d5ee7fa2b28a58a
This command was run using /opt/hadoop-2.7.1/share/hadoop/common/hadoop-common-2.7.1.jar
Configuring Hadoop 2.7.1
Let's go to the directory that contains all the configuration files of Hadoop. We want to edit the hadoop-env.sh file. For some reason we need to configure JAVA_HOME manually in this file, Hadoop seems to ignore our $JAVA_HOME.

cd $HADOOP_CONF_DIR
nano hadoop-env.sh

Yes, I use nano. Look for the line saying JAVA_HOME and change it to your Java install directory. This was how the line looked after I changed it:

export JAVA_HOME=/usr/lib/jvm/jdk-8-oracle-arm-vfp-hflt/jre/

There are quite a few files that need to be edited now. These are XML files, you just have to paste the code bits below between the configuration tags.

nano core-site.xml

<property>  <name>fs.default.name</name><value>hdfs://localhost:54310</value>
</property>
<property>  <name>hadoop.tmp.dir</name><value>/hdfs/tmp</value>
</property>
nano hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
cp mapred-site.xml.template mapred-site.xml
nano mapred-site.xml  <property><name>mapreduce.framework.name</name><value>yarn</value></property><property><name>mapreduce.map.memory.mb</name><value>256</value></property><property><name>mapreduce.map.java.opts</name><value>-Xmx210m</value></property><property><name>mapreduce.reduce.memory.mb</name><value>256</value></property><property><name>mapreduce.reduce.java.opts</name><value>-Xmx210m</value></property><property><name>yarn.app.mapreduce.am.resource.mb</name><value>256</value></property>

The first property tells us that we want to use Yarn as the MapReduce framework. The other properties are some specific settings for our Raspberry Pi. For example we tell that the Yarn Mapreduce Application Manager gets 256 megabytes of RAM and so does the Map and Reduce containers. These values allow us to actually run stuff, the default size is 1,5GB which our Pi can't deliver with its 1GB RAM.

nano yarn-site.xml

  <property><name>yarn.nodemanager.aux-services</name><value>mapreduce_shuffle</value></property><property><name>yarn.nodemanager.resource.cpu-vcores</name><value>4</value></property><property><name>yarn.nodemanager.resource.memory-mb</name><value>768</value></property><property><name>yarn.scheduler.minimum-allocation-mb</name><value>128</value></property><property><name>yarn.scheduler.maximum-allocation-mb</name><value>768</value></property><property><name>yarn.scheduler.minimum-allocation-vcores</name><value>1</value></property><property><name>yarn.scheduler.maximum-allocation-vcores</name><value>4</value></property>

This file tells Hadoop some information about this node, like the maximum number of memory and cores that can be used. We limit the usable RAM to 768 megabytes, that leaves a bit of memory for the OS and Hadoop. A container will always receive a memory amount that is a multitude of the minimum allocation, 128 megabytes. For example a container that needs 450 megabytes, will get 512 megabytes assigned.

Preparing HDFS

sudo mkdir -p /hdfs/tmp
sudo chown hduser:hadoop /hdfs/tmp
chmod 750 /hdfs/tmp
hdfs namenode -format
Booting Hadoop
cd $HADOOP_HOME/sbin
start-dfs.sh
start-yarn.sh

If you want to verify that everything is working you can use the jps command. In the output of this command you can see that Hadoop components like the NameNode are running. The numbers can be ignored, they are process numbers.

7696 ResourceManager
7331 DataNode
7464 SecondaryNameNode
8107 Jps
7244 NameNode
Running a first MapReduce job
For our first job we need some data. I selected a bit of books from Project Gutenberg and concatenated them into one large file. There's a bit to like for everyone: Shakespeare, Homer, Edgar Allan Poe... The resulting file is about 16MB large.

wget http://www.gutenberg.org/cache/epub/11/pg11.txt
wget http://www.gutenberg.org/cache/epub/74/pg74.txt
wget http://www.gutenberg.org/cache/epub/1661/pg1661.txt
wget http://www.gutenberg.org/cache/epub/2701/pg2701.txt
wget http://www.gutenberg.org/cache/epub/5200/pg5200.txt
wget http://www.gutenberg.org/cache/epub/2591/pg2591.txt
wget http://www.gutenberg.org/cache/epub/6130/pg6130.txt
wget http://www.gutenberg.org/cache/epub/4300/pg4300.txt
wget http://www.gutenberg.org/cache/epub/8800/pg8800.txt
wget http://www.gutenberg.org/cache/epub/345/pg345.txt
wget http://www.gutenberg.org/cache/epub/1497/pg1497.txt
wgethttp://www.gutenberg.org/cache/epub/135/pg135.txdsfsdf
wget http://www.gutenberg.org/cache/epub/135/pg135.txt
wget http://www.gutenberg.org/cache/epub/41/pg41.txt
wget http://www.gutenberg.org/cache/epub/120/pg120.txt
wget http://www.gutenberg.org/cache/epub/22381/pg22381.txt
wget http://www.gutenberg.org/cache/epub/2600/pg2600.txt
wget http://www.gutenberg.org/cache/epub/236/pg236.txt
cat pg*.txt > books.txt

The books.txt file cannot be read by Hadoop from our traditional Linux file system, it needs to be stored on HDFS. We can easily copy it to HDFS.

hdfs dfs -copyFromLocal books.txt /books.txt

You can make sure that the copy operation went properly by listing the contents of the HDFS root directory.

hdfs dfs -ls /
Now let's count the occurence of words in this giant book. We are in luck, the kind developers of Hadoop provide an example that does exactly that.

hadoop jar /opt/hadoop-2.7.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar wordcount /books.txt /books-result
You can view the progress by surfing to http://:8088/cluster. After the job is done you can find the output in the book-result directory. We can view the results of this MapReduce (V2) job using the hdfs command:

hdfs dfs -cat /books-result/part-r* | head -n 20
Since we are talking about multiple books printing the entire list might take a while. If you look at the output you see that the Wordcount example has room for improvement. Uppercase and lowercase words are counted separately and also symbols and characters around a word make things messy. But time for a first benchmark: how long did our single node Raspberry Pi 2 work on this wordcount? The average execution time of 5 jobs was measured to be 3 minutes and 25 seconds.

Some more benchmark jobs
In a next part we will extend our Hadoop cluster to use all 4 Pi's and make it a real cluster. The same wordcount job will be run again and the execution time will be compared to our 3 minutes and 25 seconds. Just to be safe I include some larger test files as well.
I found some public datasets based on the IMDB here, so I used the plot.gz file. This results in an uncompressed text file of about 350 megabytes. Using the same command I set the Pi to work. This time the job was split into 3 parts, so a second core could start counting (up until now we only had used 1 wordcount container and thus 1 core). The result? A Raspberry Pi that was getting a bit warm.

CPU Temp: 73.4ºC
GPU Temp: 73.4ºC
Seems like a LEGO cluster case is not the best in terms of cooling, so I'll have to look into that. But the brave Pi gets the job done (although I can't say without a sweat, pun intended). The average execution time of counting the word occurrences in this text file was 25 minutes and 40 seconds.

My hunger for word occurrence counts was still gnawing, so I decided to seek an even larger corpus. I found the Blog Authorship Corpus that I downloaded, stripped of xml tags and concatenated into a 788MB file. If you want to reproduce this file you can find the steps that I did below.

wget http://www.cs.biu.ac.il/~koppel/blogs/blogs.zip
cd blogs/
cat .xml | sed -e 's/<[^>]>//g' >> blogs.txt
Crunching this file took 1 hour, 8 minutes and 12 seconds. I didn't have the patience to average a few timings. The aggregate resource allocation tells us that this job required 3068130 MB-seconds and 11980 vcore-seconds.

Dataset Execution time
Gutenberg books 3 minutes 25 seconds
IDMB plots 25 minutes 40 seconds
Blog Authorship Corpus 1 hour 8 minutes 12 seconds
Conclusion
So the marvellous LEGO cluster has one working Hadoop node. In the next post I will add the other Pi's to the cluster.

Update: Some sources claim a performance increase after switching to Oracle Java. After installing Oracle Java 8 the blogs dataset took 1h, 47m and 47s. This time does not indicate any significant performance increase. I didn't look any further into this claim.

The first Raspberry Pi node of the Hadoop cluster

2018-09-25 参考博客Hadoop相关推荐

Windows10环境下自己配置Pytracking详细流程（有参考博客）
关于pytracking以及配置前的一些准备 Pytracking是一个基于pytoch的用于视觉对象跟踪和视频对象分割的通用的python框架. 1.1 配置前的准备 windows系统下预装支持V ...
2018.7.10 个人博客文章=利用ORM创建分类和ORM的内置函数
昨天的注册收尾工作其实就差了和MySql联系起来的部分,这部分很简单,首先要做的就是保存用户通过from传送过来的头像文件: """ 保存头像文件 "&quo ...
2018最新csdn修改博客皮肤模板教程
新版博客不能设置皮肤(点击查看旧版皮肤下线公告),默认皮肤实在不习惯,自己研究了下,发现修改前端代码可以替换,现在分享给大家博客设置 f12打开调试器,用元素选择器选择皮肤,找到下面的代码可以看到 ...
【转】2018最新csdn修改博客皮肤模板教程
新版博客不能设置皮肤(点击查看旧版皮肤下线公告),默认皮肤实在不习惯,自己研究了下,发现修改前端代码可以替换,现在分享给大家博客设置 f12打开调试器,用元素选择器选择皮肤,找到下面的代码可以看到 ...
2018 - 待深入学习博客
odoo视图(tree\form)中隐藏按钮( 创建.编辑.删除 ),tree视图中启用编辑 ORACLE ERP 的前世今生摘记及原文 Odoo10学习笔记一:入门与基础视图格式 Odoo开发之扩展 ...
caffe 训练自己的数据参考博客网址
参考网址: http://blog.csdn.net/qqlu_did/article/details/47131549 http://www.docin.com/p-871820919.html 转 ...
RocEDU.课程设计2018 第二周进展博客补交
本周计划完成的任务 (1).将开发板和平板电脑及其相关配件连通,并和电脑连接. (2).将代码的运行设备从安卓模拟器改为试验箱的平板电脑,平板电脑上实现软件. 本周实际完成情况 (1).计划完成的第一 ...
Spring----学习参考博客书单链接
[References] 1. IOC之基于Java类的配置Bean 2.IOC之基于注解的配置bean(上) 3.Spring之IOC的注入方式总结 4.Spring之IOC自动装配解析 5.Spr ...
计算机网络相关知识参考博客子网掩码怎么理解网关及网关的作用路由器基础知识详解
子网掩码怎么理解 https://blog.csdn.net/farmwang/article/details/64132723 网关及网关的作用 https://blog.csdn.net/zhao ...

2018-09-25 参考博客Hadoop

以下为正文

2018-09-25 参考博客Hadoop相关推荐

最新文章

热门文章