原文地址:http://engineeringblog.yelp.com/2015/04/true-zero-downtime-haproxy-reloads.html

HAProxy: Cornerstone of Reliable Websites

One primary goal of the infrastructure teams here at Yelp is to get as close to zero downtime as possible. This means that when users make requests for www.yelp.com we want to ensure that they get a response, and that they get a response as fast as possible. One way we do that at Yelp is by using the excellent HAProxy load balancer. We use it everywhere: for our external load balancing, internal load balancing, and with our move to a Service Oriented Architecture, we find ourselves running HAProxy on every machine at Yelp as part of SmartStack.

We love the flexibility that SmartStack gives us in developing our SOA, but that flexibility comes at a cost. When services or service backends are added or permanently removed, HAProxy has to reload across our entire infrastructure. These reloads can cause reliability problems because while HAProxy is top notch at not dropping traffic while it is running, it can (and does) drop traffic during reloads.

HAProxy Reloads Drop Traffic

As of version 1.5.11, HAProxy does not support zero downtime restarts or reloads of configuration. Instead, it supports fast reloads where a new HAProxy instance starts up, attempts to useSO_REUSEPORT to bind to the same ports that the old HAProxy is listening to and sends a signal to the old HAProxy instance to shut down. This technique is very close to zero downtime on modern Linux kernels, but there is a brief period of time during which both processes are bound to the port. During this critical time, it is possible for traffic to get dropped due to the way that the Linux kernel(mis)handles multiple accepting processes. In particular, the issue lies with the potential for new connections to result in a RST from HAProxy. The issue is that SYN packets can get put into the old HAProxy’s socket queue right before it calls close, which results in a RST of those connections.

There are various workarounds to this issue. For example, Willy Tarreau, the primary maintainer of HAProxy, has suggested that users can drop SYN packets for the duration of the HAProxy restart so that TCP automatically recovers. Unfortunately, RFC 6298 dictates that the initial SYN timeout should be 1s, and the Linux kernel faithfully hardcodes this. As such, dropping SYNs mean that any connections that attempt to establish during the 20-50ms of an HAProxy reload will encounter an extra second of latency or more. The exact latency depends on the TCP implementation of the client, and while some mobile devices retry as fast as 200ms, many devices only retry after 3s. Given the number of HAProxy reloads and the level of traffic Yelp has, this becomes a barrier to the reliability of our services.

Making HAProxy Reloads Not Drop Traffic

To avoid this latency, we built on the solution proposed by Willy. His solution actually works very well at not dropping traffic, but the extra second of latency is a problem. A better solution for us would be to delay the SYN packets until the reload was done, as that would only impose the latency of the HAProxy reload on new connections. To do this, we turned to Linux queueing disciplines (qdiscs). Queueing disciplines manipulate how network packets are handled within the Linux kernel. Specifically you can control how packets are enqueued and dequeued, which provides the ability to rate limit, prioritize, or otherwise order outgoing network traffic. For more information on qdiscs, I highly recommend reading the lartc howto as well as the relevant man pages.

After some light bedtime reading of the Linux kernel source code, one of our SREs, Josh Snyder, discovered a relatively undocumented qdisc that has been available since Linux 3.4: the plugqueueing discipline. Using the plug qdisc, we were able to implement zero downtime HAProxy reloads with the following standard Linux technologies:

  • tc: Linux traffic control. This allows us to set up queueing disciplines that route traffic based on filters. On newer Linux distributions there is also libnl-utils which provide interfaces to some of the newer qdiscs (such as the plug qdisc).
  • iptables: Linux tool for packet filtering and NAT. This allows us to mark incoming SYN packets.

Smartstack clients connect to the loopback interface to make a request to HAProxy, which fortunately turns incoming traffic into outgoing traffic. This means that we can set up a queuing discipline on the loopback interface that looks something like Figure 1.

Figure 1: Queueing Discipline

This sets up a classful implementation of the standard pfifo_fast queueing discipline using theprio qdisc, but with a fourth “plug” lane . A plug qdisc has the capability to queue packets without dequeuing them, and then on command flush those packets. This capability in combination with an iptables rule allows us to redirect SYN packets to the plug during a reload of HAProxy and then unplug after the reload. The handles (e.g ‘1:1’, ‘30:’) are labels that allow us to connect qdiscs together and send packets to particular qdiscs using filters; for more information consult the lartc howto referenced above.

We then programmed this functionality into a script we call qdisc_tool. This tool allows our infrastructure to “protect” a HAProxy reload where we plug traffic, restart haproxy, and then release the plug, delivering all the delayed SYN packets. This invocation looks something like:

1
qdisc_tool protect <normal HAProxy reload command>
view rawqdisc_commands hosted with ❤ by GitHub

We can easily reproduce this technique with standard userspace utilities on modern Linux distributions such as Ubuntu Trusty. If your setup does not have nl-qdisc-add but does have a 3.4+ Linux kernel, you can manipulate the plug via netlink manually.

Set up the Queuing Disciplines

Before we can do graceful HAProxy reloads, we must first set up the queueing discipline described above using tc and nl-qdisc-add. Note that every command must be run as root.

1234567891011121314
# Set up the queuing discipline
tc qdisc add dev lo root handle 1: prio bands 4
tc qdisc add dev lo parent 1:1 handle 10: pfifo limit 1000
tc qdisc add dev lo parent 1:2 handle 20: pfifo limit 1000
tc qdisc add dev lo parent 1:3 handle 30: pfifo limit 1000
# Create a plug qdisc with 1 meg of buffer
nl-qdisc-add --dev=lo --parent=1:4 --id=40: plug --limit 1048576
# Release the plug
nl-qdisc-add --dev=lo --parent=1:4 --id=40: --update plug --release-indefinite
# Set up the filter, any packet marked with “1” will be
# directed to the plug
tc filter add dev lo protocol ip parent 1:0 prio 1 handle 1 fw classid 1:4
view rawsetup_qdisc.sh hosted with ❤ by GitHub

Mark SYN Packets

We want all SYN packets to be routed to the plug lane, which we can accomplish with iptables. We use a link local address so that we redirect only the traffic we want to during the reload, and clients always have the option of making a request to 127.0.0.1 if they wish to avoid the plug. Note that this assumes you have set up a link local connection at 169.254.255.254.

1
iptables -t mangle -I OUTPUT -p tcp -s 169.254.255.254 --syn -j MARK --set-mark 1
view rawsetup_iptables.sh hosted with ❤ by GitHub

Toggle the Plug While Reloading

Once everything is set up, all we need to do to gracefully reload HAProxy is to buffer SYNs before the reload, do the reload, and then release all SYNs after the reload. This will cause any connections that attempt to establish during the restart to observe latency equal to the amount of time it takes HAProxy to restart.

123
nl-qdisc-add --dev=lo --parent=1:4 --id=40: --update plug --buffer
service haproxy reload
nl-qdisc-add --dev=lo --parent=1:4 --id=40: --update plug --release-indefinite
view rawplug_manipulation.sh hosted with ❤ by GitHub

In production we observe that this technique adds about 20ms of latency to incoming connections during the restart, but drops no requests.

Design Tradeoffs

This design has some benefits and some drawbacks. The largest drawback is that this works only for outgoing links and not for incoming traffic. This is because of the way that queueing disciplines work in Linux, namely that you can only shape outgoing traffic. For incoming traffic, one must redirect to an intermediary interface and then shape the outgoing traffic from that intermediary. We are working on integrating a solution similar to this for our external load balancers, but it is not yet in production.

Furthermore, the qdiscs could also probably be tuned more efficiently. For example, we could insert the plug qdisc at the first prio lane and adjust the priomap accordingly to ensure that SYNs always get processed before other packets or we could tune buffer sizes on the pfifo/plug qdiscs. I believe that for this to work with an interface that is not loopback, the plug lane would have to be moved to the first lane to ensure SYN deliverability.

The reason that we decided to go with this solution over something like huptime, hacking file descriptor passing into HAProxy, or dancing between multiple local instances of HAProxy is because we deemed our qdisc solution the lowest risk. Huptime was ruled out quickly as we were unable to get it to function on our machines due to an old libc version, and we were uncertain if the LD_PRELOAD mechanism would even work for something as complicated as HAProxy. One engineer did implement a proof of concept file descriptor patch during a hackathon but the complexity of the patch and the potential for a large fork caused us to abandon that approach; it turns out that doing file descriptor passing properly is really hard. Of the three options, we most seriously considered running multiple HAProxy instances on the same machine and using either NAT, nginx, or another HAProxy instance to switch traffic between them. Ultimately we decided against it because of the number of unknowns in implementation, and the level of maintenance that would be required for the infrastructure.

With our solution, we maintain basically zero infrastructure and trust the Linux kernel and HAProxy to handle the heavy lifting. This trust appears to be well placed as in the months this has been running in production we have observed no issues.

Experimental Setup

To demonstrate that this solution really works, we can fire up an nginx HTTP backend with HAProxy sitting in front, generate some traffic with Apache Benchmark, and see what happens when we restart HAProxy. We can then evaluate a few different solutions this way.

All tests were carried out on a freshly provisioned c3.large AWS machine running Ubuntu Trusty and a 3.13 Linux kernel. HAProxy 1.5.11 was compiled locally with TARGET=linux2628. Nginx was started locally with the default configuration except that it listens on port 8001 and serves a simple “pong” reply instead of the default html. Our compiled HAProxy was started locally with a basic configuration that had a single backend at port 8001 and a corresponding frontend at port 16000.

Just Reload HAProxy

In this experiment, we only restart HAProxy with the ‘-sf’ option, which initiates the fast reload process. This is a pretty unrealistic test because we are restarting HAProxy every 100ms, but it illustrates the point.

Experiment

12345
# Restart haproxy every 100ms
while [ 1 ]; do
./haproxy -f /tmp/haproxy.cfg -p /tmp/haproxy.pid -sf $(cat /tmp/haproxy.pid)
sleep 0.1
done
view rawreload_experiment.sh hosted with ❤ by GitHub

Results

12345
$ ab -c 10 -n 10000 169.254.255.254:16000/
Benchmarking 169.254.255.254 (be patient)
...
apr_socket_recv: Connection reset by peer (104)
Total of 3652 requests completed
view rawreload_result hosted with ❤ by GitHub

Socket reset! Restarting HAProxy has caused us to fail a request even though our backend was healthy. If we tell apache benchmark to continue on receive errors and do more requests:

12345678910
$ ab -r -c 10 -n 200000 169.254.255.254:16000/
Benchmarking 169.254.255.254 (be patient)
...
Complete requests: 200000
Failed requests: 504
...
50% 2
95% 2
99% 3
100% 15 (longest request)
view rawreload_longer_result hosted with ❤ by GitHub

Only 0.25% of requests failed. This is not too bad, but well above our goal of zero.

Drop SYNs and Let TCP Do the Rest

Now we try the method where we drop SYNs. This method seems to completely break with high restart rate as you end up with exponentially backing off connections, so to get reliable results I could only restart HAProxy every second.

Experiment

12345678
# Restart haproxy every second
while [ 1 ]; do
sudo iptables -I INPUT -p tcp --dport 16000 --syn -j DROP
sleep 0.2
./haproxy -f /tmp/haproxy.cfg -p /tmp/haproxy.pid -sf $(cat /tmp/haproxy.pid)
sudo iptables -D INPUT -p tcp --dport 16000 --syn -j DROP
sleep 1
done
view rawiptables_experiment.sh hosted with ❤ by GitHub

Results

12345678910
$ ab -c 10 -n 200000 169.254.255.254:16000/
Benchmarking 169.254.255.254 (be patient)
...
Complete requests: 200000
Failed requests: 0
...
50% 2
95% 2
99% 6
100% 1002 (longest request)
view rawiptables_result hosted with ❤ by GitHub

Figure 2: Iptables Experiment Results

As expected, we drop no requests but incur an additional one second of latency. When request timings are plotted in Figure 2 we see a clear bimodal distribution where any requests that hit the restart take a full second to complete. Less than one percent of the test requests observe the high latency, but that is still enough to be a problem.

Use Our Graceful Restart Method

In this experiment, we restart HAProxy with the ‘-sf’ option and use our queueing strategy to delay incoming SYNs. Just to be sure we are not getting lucky, we do one million requests. In the process of this test we restarted HAProxy over 1500 times.

Experiment

1234567
# Restart haproxy every 100ms
while [ 1 ]; do
sudo nl-qdisc-add --dev=lo --parent=1:4 --id=40: --update plug --buffer &> /dev/null
./haproxy -f /tmp/haproxy.cfg -p /tmp/haproxy.pid -sf $(cat /tmp/haproxy.pid)
sudo nl-qdisc-add --dev=lo --parent=1:4 --id=40: --update plug--release-indefinite &> /dev/null
sleep 0.100
done
view rawtc_experiment.sh hosted with ❤ by GitHub

Results

12345678910
$ ab -c 10 -n 1000000 169.254.255.254:16000/
Benchmarking 169.254.255.254 (be patient)
...
Complete requests: 1000000
Failed requests: 0
...
50% 2
95% 2
99% 8
100% 29 (longest request)
view rawtc_result hosted with ❤ by GitHub

Figure 3: TC Experiment Results

Success! Restarting HAProxy has basically no effect on our traffic, causing only minor delays as can be seen in Figure 3. Note that this method is heavily dependent on how long HAProxy takes to load its configuration, and because we are running such a reduced configuration, these results are deceivingly fast. In our production environment we do observe about a 20ms penalty during HAProxy restarts.

Conclusion

This technique appears to work quite well to achieve our goal of providing a rock-solid service infrastructure for our developers to build on. By delaying SYN packets coming into our HAProxy load balancers that run on each machine, we are able to minimally impact traffic during HAProxy reloads, which allows us to add, remove, and change service backends within our SOA without fear of significantly impacting user traffic.

Acknowledgements

Thanks to Josh Snyder, John Billings and Evan Krall for excellent design and implementation discussions.

转载于:https://www.cnblogs.com/davidwang456/p/4451369.html

True Zero Downtime HAProxy Reloads--转载相关推荐

  1. HAProxy http和https都使用mode tcp模式

    HAProxy http和https都使用mode tcp模式: 注: mode tcp为4层,mode http为7层 vi /etc/haproxy/haproxy.cfg #option htt ...

  2. RabbitMq+Haproxy负载均衡

    HAProxy是一个使用C语言编写的自由及开放源代码软件,其提供高可用性.负载均衡,以及基于TCP和HTTP的应用程序代理. HAProxy特别适用于那些负载特大的web站点,这些站点通常又需要会话保 ...

  3. Haproxy+Rabbitmq中的问题

    问题一.Rabbitmq集群搭建完成 某个集群节宕机后 无法添加失败 解决办法:停掉所有Rabbitmq服务 并删除集群文件 C\Users\Administrator\AppData\Roaming ...

  4. C#.NET Form设置/取消开机自动运行,判断程序是否已经设置成开机自动启动(转载)...

    #region//开机自动运行         private void CB_Auto_CheckedChanged(object sender, EventArgs e)         {//C ...

  5. Haproxy的部署安装

    Haproxy详解 HAProxy: 是法国人Willy Tarreau开发的一个开源软件,是一款应对客户端10000以上的同时连接的高性能的TCP和 HTTP负载均衡器.其功能是用来提供基于cook ...

  6. 修改Eclipse的WorkSpace保持数[转载]

    最近用Eclipse开发特别多,我个人习惯每一个项目一个WorkSpace,这样的话代码干净.而且当项目之前编码规范不一样时,也不会彼此影响.但项目一多,Eclipse默认只保存5个WorkSpace ...

  7. SharePoint【学习笔记】-- SPWeb.EnsureUser()注意AllowUnsafeUpdates=true

    使用时一直报异常,后仔细看它的定义 SPWeb.EnsureUser 方法 (Microsoft.SharePoint) Checks whether the specified login name ...

  8. CentOS系统安装haproxy

    安装haproxy 修改正确时区 cp -r /usr/share/zoneinfo/Asia/Shanghai /etc/localtime 关闭selinux和防火墙 setenforce 0 s ...

  9. 【转载】SPFA算法

    算法简介 SPFA(Shortest Path Faster Algorithm)是Bellman-Ford算法的一种队列实现,减少了不必要的冗余计算.也有人说SPFA本来就是Bellman-Ford ...

最新文章

  1. Spring3.2下使用JavaMailSenderImpl类发送邮件
  2. 如何用auto_ptr做为函数的参数进行传递
  3. 在浏览器上浏览vue项目,后退按钮是可以正常返回上一页的,但打包成app后,点击手机上的物理返回按钮就直接退出app回到桌面...
  4. 工作三年的一点感想(展望篇)
  5. 解读Spring MVC项目技术架构
  6. Cell子刊:盗梦空间成真,科学家成功进入他人梦境,并与之“交流对话”
  7. wsgiserver python 漏洞_python-简单测试wsgi
  8. Exchange Server 2013 RTM发布
  9. 力扣题目系列:746. 使用最小花费爬楼梯 -- 一道动态规划入门题
  10. 《EDIUS 6.5快刀手高效剪辑技法》——1.5 常用视频术语简介
  11. 从WAVE头文件里获取压缩方式
  12. matlab simulink 六自由度机械臂模糊控制pid
  13. 华为OD机试 - We Are A Team
  14. ikbc键盘组合功能键
  15. 笔记本触控板没有反应
  16. SQLServer中ADO,OLEDB,ODBC的区别
  17. 青蛙过河程序及其解析
  18. as5300g2 nas软件功能_硬件与软件齐飞,首款支持2.5G网络的品牌NAS-华芸AS-5202T测评体验...
  19. 职业生涯规划访谈,写给在校的你们
  20. 毕业设计工作内容和进度

热门文章

  1. 科学出版社c语言实验答案,程序设计基础c语言第三章程序结构教材习题答案科学出版社...
  2. catia打开后拖动工具栏有残影_Photoshop工具和工具栏概述
  3. php如何连接mongodb,PHP 连接 MongoDB
  4. opencv运动目标跟踪预测_浅谈多目标跟踪中的相机运动
  5. 任务中断间的同步与通信概述
  6. android如何自定义viewpager,Android自定义ViewPager实现个性化的图片切换效果
  7. ajax error 400 4,jquery - ajax error 400 bad request - Stack Overflow
  8. 华为安卓转鸿蒙,坦白说,华为不用鸿蒙替换安卓,而用HMS替代GMS,是当前最好方案 - 区块网...
  9. 使用显式Intent向下一个活动传递数据
  10. 谷歌市场上架aab安装包的各种坑