原文地址:https://pierrevillard.com/2016/08/13/apache-nifi-1-0-0-cluster-setup/
文章写的很好了,步骤性的英文写得也比较易懂,原样搬过来了,没有再翻译

As you may know a version 1.0.0-BETA of Apache NiFi has been released few days ago. The upcoming 1.0.0 release will be a great moment for the community as it it will mark a lot of work over the last few months with many new features being added.

The objective of the Beta release is to give people a chance to try this new version and to give a feedback before the official major release which will come shortly. If you want to preview this new version with a completely new look, you can download the binaries here, unzip it, and run it (‘./bin/nifi.sh start‘ or ‘./bin/run-nifi.bat‘ for Windows), then you just have to access http://localhost:8080/nifi/.

The objective of this post is to briefly explain how to setup an unsecured NiFi cluster with this new version (a post for setting up a secured cluster will come shortly with explanations on how to use a new tool that will be shipped with NiFi to ease the installation of a secured cluster).

One really important change with this new version is the new paradigm around cluster installation. From the NiFi documentation, we can read:

Starting with the NiFi 1.0 release, NiFi employs a Zero-Master Clustering paradigm. Each of the nodes in a NiFi cluster performs the same tasks on the data but each operates on a different set of data. Apache ZooKeeper elects one of the nodes as the Cluster Coordinator, and failover is handled automatically by ZooKeeper. All cluster nodes report heartbeat and status information to the Cluster Coordinator. The Cluster Coordinator is responsible for disconnecting and connecting nodes. As a DataFlow manager, you can interact with the NiFi cluster through the UI of any node in the cluster. Any change you make is replicated to all nodes in the cluster, allowing for multiple entry points to the cluster.

zero-master-cluster

OK, let’s start with the installation. As you may know it is greatly recommended to use an odd number of ZooKeeper instances with at least 3 nodes (to maintain a majority also called quorum). NiFi comes with an embedded instance of ZooKeeper, but you are free to use an existing cluster of ZooKeeper instances if you want. In this article, we will use the embedded ZooKeeper option.

I will use my computer as the first instance. I also launched two virtual machines (with a minimal Centos 7). All my 3 instances are able to communicate to each other on requested ports. On each machine, I configure my /etc/hosts file with:

192.168.1.17 node-3
192.168.56.101 node-2
192.168.56.102 node-1

I deploy the binaries file on my three instances and unzip it. I now have a NiFi directory on each one of my nodes.

The first thing is to configure the list of the ZK (ZooKeeper) instances in the configuration file ‘./conf/zookeep.properties‘. Since our three NiFi instances will run the embedded ZK instance, I just have to complete the file with the following properties:

server.1=node-1:2888:3888
server.2=node-2:2888:3888
server.3=node-3:2888:3888

Then, everything happens in the ‘./conf/nifi.properties‘. First, I specify that NiFi must run an embedded ZK instance, with the following property:

nifi.state.management.embedded.zookeeper.start=true
I also specify the ZK connect string:

nifi.zookeeper.connect.string=node-1:2181,node-2:2181,node-3:2181

As you can notice, the ./conf/zookeeper.properties file has a property named dataDir. By default, this value is set to ./state/zookeeper. If more than one NiFi node is running an embedded ZK, it is important to tell the server which one it is.

To do that, you need to create a file name myid and placing it in ZK’s data directory. The content of this file should be the index of the server as previously specify by the server. On node-1, I’ll do:

mkdir ./state
mkdir ./state/zookeeper
echo 1 > ./state/zookeeper/myid 

operation needs to be done on each node (don’t forget to change the ID).

If you don’t do this, you may see the following kind of exceptions in the logs:Caused by: java.lang.IllegalArgumentException: ./state/zookeeper/myid file is missing

Then we go to clustering properties. For this article, we are setting up an unsecured cluster, so we must keep:

nifi.cluster.protocol.is.secure=false

Then, we have the following properties:

nifi.cluster.is.node=true
nifi.cluster.node.address=node-1
nifi.cluster.node.protocol.port=9999
nifi.cluster.node.protocol.threads=10
nifi.cluster.node.event.history.size=25
nifi.cluster.node.connection.timeout=5 sec
nifi.cluster.node.read.timeout=5 sec
nifi.cluster.firewall.file=

I set the FQDN of the node I am configuring, and I choose the arbitrary 9999 port for the communication with the elected cluster coordinator. I apply the same configuration on my other nodes:

nifi.cluster.is.node=true
nifi.cluster.node.address=node-2
nifi.cluster.node.protocol.port=9999
nifi.cluster.node.protocol.threads=10
nifi.cluster.node.event.history.size=25
nifi.cluster.node.connection.timeout=5 sec
nifi.cluster.node.read.timeout=5 sec
nifi.cluster.firewall.file=

and

nifi.cluster.is.node=true
nifi.cluster.node.address=node-3
nifi.cluster.node.protocol.port=9999
nifi.cluster.node.protocol.threads=10
nifi.cluster.node.event.history.size=25
nifi.cluster.node.connection.timeout=5 sec
nifi.cluster.node.read.timeout=5 sec
nifi.cluster.firewall.file=

We have configured the exchanges between the nodes and the cluster coordinator, now let’s move to the exchanges between the nodes (to balance the data of the flows).

We have the following properties:

nifi.remote.input.host=node-1
nifi.remote.input.secure=false
nifi.remote.input.socket.port=9998
nifi.remote.input.http.enabled=true
nifi.remote.input.http.transaction.ttl=30 sec

Again, I set the FQDN of the node I am configuring and I choose the arbitrary 9998 port for the Site-to-Site (S2S) exchanges between the nodes of my cluster. The same applies for all the nodes (just change the host property with the correct FQDN).

It is also important to set the FQDN for the web server property, otherwise we may get strange behaviors with all nodes identified as ‘localhost’ in the UI. Consequently, for each node, set the following property with the correct FQDN:

nifi.web.http.host=node-1

And that’s all! Easy, isn’t it?

OK, let’s start our nodes and let’s tail the logs to see what’s going on there!

./bin/nifi.sh start && tail -f ./logs/nifi-app.log

If you look at the logs, you should see that one of the node gets elected as the cluster coordinator and then you should see heartbeats created by the three nodes and sent to the cluster coordinator every 5 seconds.

You can connect to the UI using the node you want (you can have multiple users connected to different nodes, modifications will be applied on each node). Let’s go to:

http://node-2:8080/nifi

Here is what it looks like:

Screen Shot 2016-08-13 at 7.33.08 PM

As you can see in the top-left corner, there are 3 nodes in our cluster. Besides, if we go in the menu (button in the top-right corner) and select the cluster page, we have details on our three nodes:

Screen Shot 2016-08-13 at 7.35.28 PM

We see that my node-2 has been elected as cluster coordinator, and that my node-3 is my primary node. This distinction is important because some processors must run on a unique node (for data consistency) and in this case we will want it to run “On primary node” (example below).

We can display details on a specific node (“information” icon on the left):

Screen Shot 2016-08-13 at 7.35.48 PM

OK, let’s add a processor like GetTwitter. Since the flow will run on all nodes (with balanced data between the nodes), this processor must run on a unique processor if we don’t want to duplicate data. Then, in the scheduling strategy, we will choose the strategy “On primary node”. This way, we don’t duplicate data, and if the primary node changes (because my node dies or gets disconnected), we won’t loose data, the workflow will still be executed.

Screen Shot 2016-08-13 at 7.45.19 PM

Then I can connect my processor to a PutFile processor to save the tweets in JSON by setting a local directory (/tmp/twitter):


Screen Shot 2016-08-13 at 7.52.25 PM

If I run this flow, all my JSON tweets will be stored on the primary node, the data won’t be balanced. To balance the data, I need to use a RPG (Remote Process Group), the RPG will connect to the coordinator to evaluate the load of each node and balance the data over the nodes. It gives us the following flow:

Screen Shot 2016-08-13 at 8.00.26 PM

I have added an input port called “RPG”, then I have added a Remote Process Group that I connected to ” http://node-2:8080/nifi ” and I enabled transmission so that the Remote Process Group was aware of the existing input ports on my cluster. Then in the Remote Process Group configuration, I enabled the RPG input port. I then connected my GetTwitter to the Remote Process Group and selected the RPG input port. Finally, I connected my RPG input port to my PutFile processor.

When running the flow, I now have balanced data all over my nodes (I can check in the local directory ‘/tmp/twitter‘ on each node).

That’s all for this post. I hope you enjoyed it and that it will be helpful for you if setting up a NiFi cluster. All comments/remarks are very welcomed and I kindly encourage you to download Apache NiFi, to try it and to give a feedback to the community if you have any.

Apache nifi 集群安装相关推荐

  1. Apache hadoop集群安装的三种方式:本地、伪分布、完全分布

    四 Hadoop运行模式 1)官方网址 (1)官方网站: http://hadoop.apache.org/ (2)各个版本归档库地址 https://archive.apache.org/dist/ ...

  2. 在centos 7.3上进行Apache HAWQ集群安装部署

    一.前期准备工作 1.准备三台物理机,master(192.168.251.8),dataserver1(192.168.251.9),dataserver2(192.168.251.10): 2.目 ...

  3. 2021年大数据HBase(二):HBase集群安装操作

    全网最详细的大数据HBase文章系列,强烈建议收藏加关注! 新文章都已经列出历史文章目录,帮助大家回顾前面的知识重点. 目录 系列历史文章 前言 HBase集群安装操作 一.上传解压HBase安装包 ...

  4. hadoop集群安装

    一.简述 本次集群安装基于4台虚拟集群下进行. hadoop版本使用 2.6.4 操作系统为 centos6.5 jdk版本为 jdk-7u67-linux-x64.tar.gz 二.准备 创建had ...

  5. ZooKeeper伪分布式集群安装及使用

    为什么80%的码农都做不了架构师?>>>    ZooKeeper伪分布式集群安装及使用 让Hadoop跑在云端系列文章,介绍了如何整合虚拟化和Hadoop,让Hadoop集群跑在V ...

  6. KafKa集群安装、配置

    一.事前准备 1.kafka官网:http://kafka.apache.org/downloads. 2.选择使用版本下载. 3.kafka集群环境准备:(linux) 192.168.145.12 ...

  7. HBase 1.2.6 完全分布式集群安装部署详细过程

    2019独角兽企业重金招聘Python工程师标准>>> Apache HBase 是一个高可靠性.高性能.面向列.可伸缩的分布式存储系统,是NoSQL数据库,基于Google Big ...

  8. ZooKeeper伪分布式集群安装

    为什么80%的码农都做不了架构师?>>>    获取ZooKeeper安装包 下载地址:http://apache.dataguru.cn/zookeeper 选择一个稳定版本进行下 ...

  9. 大数据入门第五天——离线计算之hadoop(上)概述与集群安装

    一.概述 根据之前的凡技术必登其官网的原则,我们当然先得找到它的官网:http://hadoop.apache.org/ 1.什么是hadoop 先看官网介绍: The Apache™ Hadoop® ...

最新文章

  1. Linux查看环境变量当前信息和查看命令
  2. 信噪比与错误指数matlab,关于信噪比不符合理论值的问题
  3. Linux Kernel TCP/IP Stack — L1 Layer — NIC Controller — SKB
  4. 【OpenCV环境配置】Xcode+OpenCV+pkg-config
  5. 和中医学习到的养生方法和知识
  6. 优学院java架构52破解_[单选] 肢体根据需要采用气囊止血带上肢压力至()
  7. 阮一峰react demo代码研究的学习笔记 - demo5 debug
  8. 计算机技术在酒店管理专业的应用与探索,计算机信息化在高职酒店管理专业教学中的应用.doc...
  9. JavaScript 大揭秘:React、性能优化以及多线程
  10. 学习: 导航器添加修饰符
  11. C++递归方法实现全排列
  12. windows11条件下将yafu路径添加到环境变量
  13. Android--使用开源vitamio做万能视频播放器
  14. 2010南非世界杯32强手绘海报
  15. C#自定义控件添加到工具箱:
  16. 泰坦尼克号沉船生还预测
  17. java注解约束参数为固定值_Java学习 使用注解将参数的值限定
  18. 人机交互如何改变人类生活 | 公开课笔记
  19. 《计算机网络--自顶向下方法》第三章--运输层
  20. 海康、大华等网络摄像头RTSP_Onvif网页无插件直播流媒体服务器EasyNVR鉴权出现跨域问题的解决方法

热门文章

  1. android 日期时间类,Android 日期时间等转换工具类
  2. java 写传奇游戏吗,文字版传奇游戏
  3. 基于Java+SpringBoot+vue+element实现物流管理系统
  4. HTML+CSS+JS实现 ❤️3D网状球体动画特效❤️
  5. 徐州医科大学计算机报名,2019年徐州医科大学计算机等级考试准考证打印
  6. MySQL表级完整性约束
  7. cockroachdb mysql_CockroachDB学习笔记——[译]CockroachDB中的SQL:映射表中数据到键值存储...
  8. MySQL 对查询结果进行排序
  9. 组合赋权法之python
  10. 自学计算机软件及应用,[计算机软件及应用]JavaEE自学材料.pdf