分布式协调器ZooKeeper3.4

【ZooKeeper是Apache Hadoop下的开源软件，是一个分布式的协调器，本文来自于Zookeeper的官方网站，地址为：http://zookeeper.apache.org/doc/r3.4.5/zookeeperAdmin.html】

ZooKeeper Administrator's Guide：A Guide to Deployment and Administration

ZooKeeper部署和管理手册

Deployment

部署

This section contains information about deploying Zookeeper and covers these topics:

System Requirements
Clustered (Multi-Server) Setup
Single Server and Developer Setup

The first two sections assume you are interested in installing ZooKeeper in a production environment such as a datacenter. The final section covers situations in whichyou are setting up ZooKeeper on a limited basis - for evaluation, testing, ordevelopment - but not in a production environment.

这一部分包含部署ZooKeeper的内容，包含如下方面：

系统要求
集群搭建
单服务器和开发机搭建

前两方面假设你对生产环境中安装ZooKeeper有性趣，后一部分假设你想评估、测试和开发，而不是在生产环境中部署，因而搭建的ZooKeeper有些限制。

SystemRequirements

系统要求

Supported Platforms

GNU/Linux is supported as a development and production platform for both server and client.
Sun Solaris is supported as a development and production platform for both server and client.
FreeBSD is supported as a development and production platform for clients only. Java NIO selector support in the FreeBSD JVM is broken.
Win32 is supported as a development platform only for both server and client.
MacOSX is supported as a development platform only for both server and client.

Required Software

ZooKeeperruns in Java, release 1.6 or greater (JDK 6 or greater). It runs as anensembleof ZooKeeper servers. Three ZooKeeper servers is the minimum recommended sizefor an ensemble, and we also recommend that they run on separate machines. AtYahoo!, ZooKeeper is usually deployed on dedicated RHEL boxes, with dual-coreprocessors, 2GB of RAM, and 80GB IDE hard drives.

支持的平台

无论生产环境和开发环境，服务器和客户端支持GNU/Linux
无论生产环境和开发环境，服务器和客户端支持Sun Solaris
无论生产环境和开发环境，仅客户端支持FreeBSD。 FreeBSD JVM已不支持Java NIO selector了。
服务器和客户端支持Win32，仅适用于开发环境
服务器和客户端支持MacOSX，仅适用于开发环境

要求的软件

ZooKeeper在Java1.6或之上(JDK 6或之上)运行，若干ZooKeeper服务器作为一个ensemble运行，三台ZooKeeper服务器是一个ensemble最小台数要求，同时我们建议它们运行在不同的机器上。在Yahoo!，ZooKeeper通常被部署在专门的RHEL（Red Hat Enterprise Linux）机器上，双核、2GB RAM以及80GB IDE硬盘。

Clustered (Multi-Server) Setup

集群（多服务器）安装

For reliable ZooKeeper service, you should deploy ZooKeeper in a cluster known as anensemble.As long as a majority of the ensemble are up, the service will be available.Because Zookeeper requires a majority, it is best to use an odd number ofmachines. For example, with four machines ZooKeeper can only handle the failureof a single machine; if two machines fail, the remaining two machines do not constitute a majority. However, with five machines ZooKeeper can handle thefailure of two machines.

Here are thesteps to setting a server that will be part of an ensemble. These steps shouldbe performed on every host in the ensemble:

Install the Java JDK. You can use the native packaging system for your system, or download the JDK from:http://java.sun.com/javase/downloads/index.jsp
Set the Java heap size. This is very important to avoid swapping, which will seriously degrade ZooKeeper performance. To determine the correct value, use load tests, and make sure you are well below the usage limit that would cause you to swap. Be conservative - use a maximum heap size of 3GB for a 4GB machine.
Install the ZooKeeper Server Package. It can be downloaded from:http://zookeeper.apache.org/releases.html
Create a configuration file. This file can be called anything. Use the following settings as a starting point:
```
tickTime=2000
dataDir=/var/lib/zookeeper/
clientPort=2181
initLimit=5
syncLimit=2
server.1=zoo1:2888:3888
server.2=zoo2:2888:3888
server.3=zoo3:2888:3888
```
You can find the meanings of these and other configuration settings in the section Configuration Parameters. A word though about a few here:Every machine that is part of the ZooKeeper ensemble should know about every other machine in the ensemble. You accomplish this with the series of lines of the form server.id=host:port:port. The parameters host and port are straightforward. You attribute the server id to each machine by creating a file named myid, one for each server, which resides in that server's data directory, as specified by the configuration file parameter dataDir.
The myid file consists of a single line containing only the text of that machine's id. So myid of server 1 would contain the text "1" and nothing else. The id must be unique within the ensemble and should have a value between 1 and 255.
If your configuration file is set up, you can start a ZooKeeper server:
```
$ java -cp zookeeper.jar:lib/slf4j-api-1.6.1.jar:lib/slf4j-log4j12-1.6.1.jar:lib/log4j-1.2.15.jar:conf \ org.apache.zookeeper.server.quorum.QuorumPeerMain zoo.cfg
```
QuorumPeerMain starts a ZooKeeper server, JMX management beans are also registered which allows management through a JMX management console. The ZooKeeper JMX document contains details on managing ZooKeeper with JMX. See the script bin/zkServer.sh, which is included in the release, for an example of starting server instances.
Test your deployment by connecting to the hosts:
- In Java, you can run the following command to execute simple operations:
```
$ java -cp zookeeper.jar:lib/slf4j-api-1.6.1.jar:lib/slf4j-log4j12-1.6.1.jar:lib/log4j-1.2.15.jar:conf:src/java/lib/jline-0.9.94.jar \ org.apache.zookeeper.ZooKeeperMain -server 127.0.0.1:2181
```
- In C, you can compile either the single threaded client or the multithreaded client: or n the c subdirectory in the ZooKeeper sources. This compiles the single threaded client:
```
$ make cli_st
```
  And this compiles the mulithreaded client:
```
$ make cli_mt
```
  Running either program gives you a shell in which to execute simple file-system-like operations. To connect to ZooKeeper with the multithreaded client, for example, you would run:
```
$ cli_mt 127.0.0.1:2181 
```

要得到可靠的ZooKeeper服务，你应当将ZooKeeper部署成集群方式，即部署为一个ensemble，只要这个ensemble中大部分机器可用，则服务就可用。由于ZooKeeper需要多数原则，因此最好使用奇数台机器。例如，对4台机器，ZooKeeper只能应对单台机器故障，如果两台机器发生故障，剩下的两台将不能应用多数原则，但是，如果是5台机器，ZooKeeper能应对2台机器故障。

下面是搭建一台服务器的步骤，而这是搭建一个ensemble的组成部分，在ensemble中，应对每台主机应用这些步骤：

安装Java JDK。你可用使用适合于你系统的原生包，也可用从http://java.sun.com/javase/downloads/index.jsp下载。
设置Java堆大小。这个设置对避免内存换入换出很重要，而内存交换将极大地降低ZooKeeper的性能。为了正确设置这个值，采用负载测试，确保你的应用在这个使用限制之下。保守点好—对一个4GB的机器使用3GB的最大堆。
安装ZooKeeper服务器包，可从http://zookeeper.apache.org/releases.html下载。
创建配置文件，这个文件可以叫任何名字，采用以下设置项：
```
tickTime=2000
dataDir=/var/lib/zookeeper/
clientPort=2181
initLimit=5
syncLimit=2
server.1=zoo1:2888:3888
server.2=zoo2:2888:3888
server.3=zoo3:2888:3888
```
这些设置项的意义以及其他的设置项，请参阅“Configuration Parameters”一节。这里简要说一下：
ZooKeeper ensemble中的每一台计算机都要知道其他计算机，你需要设置像server.id=host:port:port这样形式的行，其中host和port的意义很明显，通过创建一个叫myid的文件（它在配置文件中dataDir指定的data目录下），你为每台机器分配一个server id。
myid文件只有一行，即计算机的id（文本方式），所以，服务器1的myid文件将只包含文字“1”，没有其他东西。这个id必须是这个ensemble中唯一的，并介于1到255之间。
如果配置文件建立好了，就可以启动ZooKeeper服务器了：
```
$ java -cp zookeeper.jar:lib/slf4j-api-1.6.1.jar:lib/slf4j-log4j12-1.6.1.jar:lib/log4j-1.2.15.jar:conf \ org.apache.zookeeper.server.quorum.QuorumPeerMain zoo.cfg
```
QuorumPeerMain启动了一个ZooKeeper服务器，JMX管理的bean也注册了，这样可以通过JMX控制台来管理它们。“ZooKeeper JMX document”包含了用JMX管理ZooKeeper的更详细内容。请参阅脚本bin/zkServer.sh，它在发行的内容中，看看如何启动服务器实例。
通过连接主机来验证你的部署：
- 用Java，运行如下命令:
```
$ java -cp zookeeper.jar:lib/slf4j-api-1.6.1.jar:lib/slf4j-log4j12-1.6.1.jar:lib/log4j-1.2.15.jar:conf:src/java/lib/jline-0.9.94.jar \ org.apache.zookeeper.ZooKeeperMain -server 127.0.0.1:2181
```
- 用C，编译单线程或多线程客户端，例如，
  下面命令编译单线程客户端：
```
$ make cli_st
```
  下面命令编译多线程客户端：
```
$ make cli_mt
```
  运行它们中的任何一个，会给你一个shell，你就可以执行简单的像文件系统一样的操作，用多线程客户端连接ZooKeeper，运行
```
$ cli_mt 127.0.0.1:2181
```

Single Server and Developer Setup

单服务器和开发模式安装

If you want to setup ZooKeeper for development purposes, you will probably want to setup asingle server instance of ZooKeeper, and then install either the Java or Cclient-side libraries and bindings on your development machine.

The steps to setting up a single server instance are the similar to the above, except the configuration file is simpler. You can find the complete instructions in theInstalling and Running ZooKeeper in Single Server Mode section ofthe ZooKeeper Getting Started Guide.

For information on installing the client side libraries, refer to theBindingssection of theZooKeeper Programmer's Guide.

如果你想以开发目的搭建ZooKeeper，你可能希望以单服务实例方式运行ZooKeeper，在你的开发机上安装Java或C的客户端库。

在单服务器上搭建ZooKeeper的步骤与上面类似，配置文件将更简单，请参阅“ZooKeeper Getting Started Guide”一文中“Installing and Running ZooKeeper in Single Server Mode”一节。

关于如何安装客户端库，请参阅“ZooKeeper Programmer's Guide”中“Bindings”一节。

Administration

管理

This section contains information about running and maintaining ZooKeeper and covers these topics:

Designing a ZooKeeper Deployment
Provisioning
Things to Consider: ZooKeeper Strengths and Limitations
Administering
Maintenance
Supervision
Monitoring
Logging
Troubleshooting
Configuration Parameters
ZooKeeper Commands: The Four Letter Words
Data File Management
Things to Avoid
Best Practices

Designing a ZooKeeper Deployment

设计一个ZooKeeper的部署方式

The reliablity of ZooKeeper rests on two basic assumptions.

Only a minority of servers in a deployment will fail.Failure in this context means a machine crash, or some error in the network that partitions a server off from the majority.
Deployed machines operate correctly. To operate correctly means to execute code correctly, to have clocks that work properly, and to have storage and network components that perform consistently.

The sections below contain considerations for ZooKeeper administrators to maximize the probability for these assumptions to hold true. Some of these arecross-machines considerations, and others are things you should consider foreach and every machine in your deployment.

Cross Machine Requirements

For the ZooKeeper service to be active, there must be a majority of non-failingmachines that can communicate with each other. To create a deployment that cantolerate the failure of F machines, you should count on deploying 2xF+1machines. Thus, a deployment that consists of three machines can handle onefailure, and a deployment of five machines can handle two failures. Note that adeployment of six machines can only handle two failures since three machines isnot a majority. For this reason, ZooKeeper deployments are usually made up ofan odd number of machines.

To achievethe highest probability of tolerating a failure you should try to make machinefailures independent. For example, if most of the machines share the same switch, failure of that switch could cause a correlated failure and bring downthe service. The same holds true of shared power circuits, cooling systems,etc.

Single Machine Requirements

If ZooKeeper has to contend with other applications for access to resourses like storagemedia, CPU, network, or memory, its performance will suffer markedly. ZooKeeper has strong durability guarantees, which means it uses storage media to logchanges before the operation responsible for the change is allowed to complete.You should be aware of this dependency then, and take great care if you want to ensure that ZooKeeper operations aren’t held up by your media. Here are somethings you can do to minimize that sort of degradation:

ZooKeeper's transaction log must be on a dedicated device. (A dedicated partition is not enough.) ZooKeeper writes the log sequentially, without seeking Sharing your log device with other processes can cause seeks and contention, which in turn can cause multi-second delays.
Do not put ZooKeeper in a situation that can cause a swap. In order for ZooKeeper to function with any sort of timeliness, it simply cannot be allowed to swap. Therefore, make certain that the maximum heap size given to ZooKeeper is not bigger than the amount of real memory available to ZooKeeper. For more on this, seeThings to Avoid below.

ZooKeeper的可靠性依赖于下面两个基本的假设：

1. 整个部署中只有少数服务器会发生故障。故障在这里是指机器宕机或网络发生故障使之与其他多数机器分开。

2. 部署的机器运行正常，运行正常是指代码运行正常，时钟工作正常以及存储和网络工作正常。

本节之后各节包含了ZooKeeper管理员所考虑的使这些假设尽可能实现的东西。有些考虑因素是跨机器的，另一些则是某台或每台机器的。

跨机器方面的要求

要使ZooKeeper服务正常启动，需要多数（满足多数逻辑）机器运行正常且能相互通讯。一个部署中，如果要容忍F台机器故障，则应该有2xF+1台机器，所以，在3台机器的部署中，可以应对1台机器故障，5台机器部署中，可以应对2台故障，注意，6台机器的部署中，只能应对2台故障，因为3台机器不满足多数原则逻辑。基于此，ZooKeeper部署通常由奇数台机器组成。

为最大程度容忍机器故障，应该使机器的故障相互独立，例如，如果大部分机器共享一个开关，开关故障将产生相互关联故障，整个服务就终止了，同样，公用的供电线路，空调系统等也会有同样问题。

单机方面的要求

如果ZooKeeper需要与其他应用通讯，以获得存储、CPU、网络、内存等资源，那它的性能将会大受影响。ZooKeeper有耐用性保证，就是说，在实际的操作之前，会记录日志到介质上，你应用清楚这种依赖性，并且小心利用它，不至于因为存储介质问题而使ZooKeeper停止，这有些因素，你要考虑，来减小这类性能下降：

ZooKeeper的日志应放到一个专门的设备上（专门的的分区是不够的）ZooKeeper顺序写日志，不要与其他进程共享日志设备，其他进程可能产生寻址或发生争夺资源，这样就会有数秒的延迟。
不要使ZooKeeper运行在发生内存交换的场合下。为使ZooKeeper的运行不发生延迟，简单的方法就是不发生内存换入换出。因此，确保分配给ZooKeeper的最大堆不超过它所能得到的最大实际内存，详情请参阅下面的“Things to Avoid”。

Provisioning

设置过程

Things to Consider: ZooKeeperStrengths and Limitations

需要思考的问题

Administering

管理

Maintenance

维护

Little long term maintenance is required for a ZooKeeper cluster however you must be awareof the following:

Ongoing Data Directory Cleanup

The ZooKeeper Data Directorycontains files which are a persistent copy of the znodes stored by a particularserving ensemble. These are the snapshot and transactional log files. Aschanges are made to the znodes these changes are appended to a transaction log,occasionally, when a log grows large, a snapshot of the current state of all znodes will be written to the filesystem. This snapshot supercedes all previouslogs.

A ZooKeeper server will not remove old snapshots and log files when using thedefault configuration (see autopurge below), this is the responsibility of theoperator. Every serving environment is different and therefore the requirementsof managing these files may differ from install to install (backup forexample).

ThePurgeTxnLog utility implements a simple retention policy that administratorscan use. TheAPI docscontains details on calling conventions (arguments, etc...).

In the following example the last count snapshots and their corresponding logs areretained and the others are deleted. The value of <count> shouldtypically be greater than 3 (although not required, this provides 3 backups inthe unlikely event a recent log has become corrupted). This can be run as acron job on the ZooKeeper server machines to clean up the logs daily.

 java -cp zookeeper.jar:lib/slf4j-api-1.6.1.jar:lib/slf4j-log4j12-1.6.1.jar:lib/log4j-1.2.15.jar:conf org.apache.zookeeper.server.PurgeTxnLog <dataDir> <snapDir> -n <count>

Automatic purging of the snapshots and corresponding transaction logs was introduced inversion 3.4.0 and can be enabled via the following configuration parametersautopurge.snapRetainCount andautopurge.purgeInterval. For more on this, seeAdvancedConfiguration below.

Debug Log Cleanup (log4j)

See the section on logging inthis document. It is expected that you will setup a rolling file appender usingthe in-built log4j feature. The sample configuration file in the release tar'sconf/log4j.properties provides an example of this.

ZooKeeper集群基本上不需要长期的维护，但是，你必须清楚以下内容：

数据目录(DataDirectory)的持续清理

ZooKeeper的数据目录包含了ensemble中znode数据的持久化拷贝，它们是快照和日志文件。随着znode的变化，变化被记录到log文件，有时候，当日志变大后，所有znode当前状态的快照被写入文件系统，这个快照将取代以前的日志。

在缺省配置下，ZooKeeper服务器不会删除以前的快照和日志（参阅下面的autopurge），这个责任落在管理员身上。每个应用场合都不同，因此管理这些文件的要求也不同（例如是否备份）。

管理员可以利用工具“PurgeTxnLog”来实现管理策略，API文档包含了详细的调用说明（参数等等）。

下面的例子中，最后若干个快照和对应的日志被保留，其它被删掉。的典型值应大于3（尽管不是必需的），在ZooKeeper机器上作为cron的job来运行，就可以每天清除这些日志。

java -cp zookeeper.jar:lib/slf4j-api-1.6.1.jar:lib/slf4j-log4j12-1.6.1.jar:lib/log4j-1.2.15.jar:conf org.apache.zookeeper.server.PurgeTxnLog <dataDir> <snapDir> -n <count>

自动清除快照以及相应的日志是从3.4.0版开始的，通过设置下面的配置项实现：

autopurge.snapRetainCount，autopurge.purgeInterval，详情请参阅下面的“AdvancedConfiguration”。

调试日志清除(log4j)

请参阅本文中“logging”一节。这里假设你利用内置的log4j的功能安装一个带滚动功能的文件添加器(rolling file appender)，配置的例子包含在发行内容的conf/log4j.properties中。

Supervision

管理

You will want to have a supervisory process that manages each of your ZooKeeper server processes (JVM). The ZK server is designed to be "fail fast" meaning that it will shutdown (process exit) if an error occurs that it cannot recover from. As a ZooKeeper serving cluster is highly reliable, this means that whilethe server may go down the cluster as a whole is still active and serving requests. Additionally, as the cluster is "self healing" the failed server once restarted will automatically rejoin the ensemble w/o any manual interaction.

Having a supervisory process such as daemontools or SMF(other options for supervisory process are also available, it's up to you which one you would like to use, these are just two examples) managing your ZooKeeper server ensures that if the process does exit abnormally it will automatically be restarted and will quickly rejoin the cluster.

你可能想要一个监控程序来管理你的每一个ZooKeeper服务进程(JVM)，ZooKeeper服务为设计成”fail fast”，意思是如果它不能从错误中恢复过来，进程就会退出。由于ZooKeeper是高可用的，意思是若干服务宕掉后，整个集群依然可以正常响应请求。另外，集群是“自我恢复”的，如果发生故障的服务器重新启动后，它会自动再加入到ensemble，这中间，需要或不需要人工干预。

Monitoring

监控

The ZooKeeper service can be monitored in one of two primary ways; 1) the command port through the use of4 letter words and2)JMX. See the appropriate section for your environment/requirements.

ZooKeeper服务可以采用以下两种基本方法之一来监控：1）采用4字母命令；2）JMX。你具体的要求和环境可以请参阅本文合适的内容。

Logging

日志

ZooKeeper uses log4j version 1.2 as its logging infrastructure. The ZooKeeper default log4j.properties file resides in the confdirectory. Log4j requires that log4j.properties either be in the working directory (the directory from which ZooKeeper is run)or be accessible from the classpath.

For more information, see Log4j DefaultInitialization Procedure of the log4j manual.

ZooKeeper采用log4j 1.2版作为日志的基础架构。ZooKeeper中缺省的log4j.properties文件位于conf目录下。Log4j要求log4j.properties或者位于工作目录（ZooKeeper运行的目录）或在classpath指明的可以访问的目录。

Troubleshooting

故障处理

Server not coming up because of file corruption

A server might not be able to read its database and fail to come up because ofsome file corruption in the transaction logs of the ZooKeeper server. You willsee some IOException on loading ZooKeeper database. In such a case, make sureall the other servers in your ensemble are up and working. Use "stat"command on the command port to see if they are in good health. After you haveverified that all the other servers of the ensemble are up, you can go aheadand clean the database of the corrupt server. Delete all the files indatadir/version-2 and datalogdir/version-2/. Restart the server.

因文件损坏而使服务不能启动

一个服务可能因为不能读数据库而启动不了，日志中可以看到文件的损坏信息：在加载ZooKeeper数据库中有IOException。这种情况下，确保ensemble中其他服务器能起来并正常工作，可用“start”命令来看他们是否正常。验证其他服务器都正常工作后，你可以清理故障服务器的数据库，删除datadir/version-2 和 datalogdir/version-2/下的所有文件，重启服务器。

Configuration Parameters

配置参数

ZooKeeper's behavior is governed by the ZooKeeper configuration file. This file is designedso that the exact same file can be used by all the servers that make up a ZooKeeper server assuming the disk layouts are the same. If servers usedifferent configuration files, care must be taken to ensure that the list ofservers in all of the different configuration files match.

ZooKeeper的行为受它的配置文件控制，这个文件被设计成如果ZooKeeper中各服务器磁盘划分一样，则它可以被各服务器所共有。如果你想使各服务器使用不同的配置文件，则必须保证不同配置文件中的服务器列表相互匹配。

Minimum Configuration

Here are the minimum configuration keywords that must be defined in the configuration file:

clientPort

the port to listen for client connections; that is, the port that clients attempt to connect to.

dataDir

the location where ZooKeeper will store the in-memory database snapshots and, unless specified otherwise, the transaction log of updates to the database.

Note

Be careful where you put the transaction log. A dedicated transaction log device is key to consistent good performance. Putting the log on a busy device will adversely effect performance.

tickTime

the length of a single tick, which is the basic time unit used by ZooKeeper, as measured in milliseconds. It is used to regulate heartbeats, and timeouts. For example, the minimum session timeout will be two ticks.

最少配置项

以下是配置文件中最少的必须被定义的项目。

clientPort

侦听客户端连接的端口，也就是客户端连接服务器的端口

dataDir

ZooKeeper存储内存数据库快照的位置，以及对数据库进行修改的日志的位置（如果不特别指明）。

注：小心你放日志的位置。采用专门的日志设备是保持好性能的关键，将日志放到一个忙碌的设备上将大大损失性能。

tickTime

一次tick的时间（毫秒），它是ZooKeeper使用的基本时间单位，心跳、超时的时间都由它来规定。例如，最小的session过期时间是2个tick。

Advanced Configuration

The configuration settings in the section are optional. You can use them to furtherfine tune the behaviour of your ZooKeeper servers. Some can also be set usingJava system properties, generally of the formzookeeper.keyword. Theexact system property, when available, is noted below.

dataLogDir

(No Java system property)

This option will direct the machine to write the transaction log to thedataLogDir rather than thedataDir. This allows a dedicated log device to be used, and helps avoid competition betweenlogging and snaphots.

Note

Having a dedicated log device hasa large impact on throughput and stable latencies. It is highly recommened todedicate a log device and setdataLogDir to point to a directory on thatdevice, and then make sure to pointdataDir to a directory notresiding on that device..

globalOutstandingLimit

(Java system property:zookeeper.globalOutstandingLimit.)
Clients can submit requests faster than ZooKeeper can process them, especially if there are a lot of clients. To prevent ZooKeeper from running out of memory due to queued requests, ZooKeeper will throttle clients so that there is no more than globalOutstandingLimit outstanding requests in the system. The default limit is 1,000.

preAllocSize

(Java system property:zookeeper.preAllocSize)
To avoid seeks ZooKeeper allocates space in the transaction log file in blocks of preAllocSize kilobytes. The default block size is 64M. One reason for changing the size of the blocks is to reduce the block size if snapshots are taken more often. (Also, see snapCount)

snapCount

(Java system property:zookeeper.snapCount)
ZooKeeper logs transactions to a transaction log. After snapCount transactions are written to a log file a snapshot is started and a new transaction log file is created. The default snapCount is 100,000.

traceFile

(Java system property: requestTraceFile)
If this option is defined, requests will be will logged to a trace file named traceFile.year.month.day. Use of this option provides useful debugging information, but will impact performance. (Note: The system property has no zookeeper prefix, and the configuration variable name is different from the system property. Yes - it's not consistent, and it's annoying.)

maxClientCnxns

(No Java system property)
Limits the number of concurrent connections (at the socket level) that a single client, identified by IP address, may make to a single member of the ZooKeeper ensemble. This is used to prevent certain classes of DoS attacks, including file descriptor exhaustion. The default is 60. Setting this to 0 entirely removes the limit on concurrent connections.

clientPortAddress

New in 3.3.0: the address (ipv4, ipv6 or hostname) to listen for client connections; that is, the address that clients attempt to connect to. This is optional, by default we bind in such a way that any connection to theclientPortfor any address/interface/nic on the server will be accepted

minSessionTimeout

(No Java system property)
New in 3.3.0: the minimum session timeout in milliseconds that the server will allow the client to negotiate. Defaults to 2 times thetickTime.

maxSessionTimeout

(No Java system property)
New in 3.3.0: the maximum session timeout in milliseconds that the server will allow the client to negotiate. Defaults to 20 times thetickTime.

fsync.warningthresholdms

(Java system property:fsync.warningthresholdms)
New in 3.3.4: A warning message will be output to the log whenever an fsync in the Transactional Log (WAL) takes longer than this value. The values is specified in milliseconds and defaults to 1000. This value can only be set as a system property.

autopurge.snapRetainCount

(No Java system property)
New in 3.4.0: When enabled, ZooKeeper auto purge feature retains theautopurge.snapRetainCount most recent snapshots and the corresponding transaction logs in thedataDiranddataLogDir respectively and deletes the rest. Defaults to 3. Minimum value is 3.

autopurge.purgeInterval

(No Java system property)
New in 3.4.0: The time interval in hours for which the purge task has to be triggered. Set to a positive integer (1 and above) to enable the auto purging. Defaults to 0.

高级配置项

这节中的配置项是可选的。你可以使用它们进一步调优ZooKeeper的运行。一些也可以用Java系统的参数，它们通常采用zookeeper.keyword这样的形式。系统特有的参数（如果有），将标识在下面。

dataLogDir

(不是Java系统参数)这个选项指明日志文件记录在位置dataLogDir下，而不是dataDir下，这样允许日志采用一个专门的设备，而避免日志与快照之间发生磁盘竞争。注：采用专门的日志设备对吞吐量和稳定的延时有很大影响，强烈建议设置一个专门的日志设备，dataLogDir指向它下面的一个目录，而dataDir指向的目录不在这个设备上。

globalOutstandingLimit

(Java系统参数: zookeeper.globalOutstandingLimit.)
客户端提交请求的速度比ZooKeeper能处理的要快，特别是当有大量客户端的时候。为防止ZooKeeper因请求过多造成内存溢出，ZooKeeper将压缩客户端请求，使系统中未完成的请求不超过globalOutstandingLimit。它的缺省值是1,000。

preAllocSize

(Java系统参数: zookeeper.preAllocSize)
为避免寻址，ZooKeeper使用日志时，一次在文件中分配preAllocSize kB的块空间，它的缺省值是64M。修改此块空间的的一个原因是如果快照过于频繁，则要减小它。(参见snapCount)

snapCount

(Java 系统参数: zookeeper.snapCount)
ZooKeeper的日志记录到日志文件。当snapCount条日志写入日志文件后，一个新的快照和日志文件就被创建了。它的缺省值是100,000。

traceFile

(Java系统参数: requestTraceFile)
如果这个选项被定义了，客户端请求被记录到一个文件名如traceFile.year.month.day的跟踪文件。使用这个选项可以提供有用的调试信息，但会损失性能。（注： Java系统参数无zookeeper前缀，并且配置项参数与系统参数的名字也不同，确实不一致，有点令人烦恼。）

maxClientCnxns

(无Java系统参数)
在socket级别，限制一个客户端（同一IP）同时连接ensemble中一个成员的数目。这用来防止某些类型的DoS攻击，以及防止耗尽句柄数目。它的缺省值是60，设置这个值为0将取消对同时连接的限制。

clientPortAddress

3.3.0中的新项: 侦听客户端连接的地址 (ipv4, ipv6 或主机名) ，即客户端要连接的地址。该项是可选的，缺省情况下，我们绑定服务器上的任何地址/接口/nic，因此所有连接到clientPort端口的连接都被接受。

minSessionTimeout

(无Java系统参数)
3.3.0中的新项: 服务器允许客户端会话超时的最小时间。缺省值是2个tickTime。

maxSessionTimeout

(无Java系统参数)
3.3.0中的新项: 服务器允许客户端会话超时的最大时间。缺省值是20个tickTime。

fsync.warningthresholdms

(Java 系统参数: fsync.warningthresholdms)
3.3.4中的新项:每当事务日志超过此时间值时，fsync将报警信息写入log。该值的单位是毫秒，缺省值是1000，这个值只能作为系统参数设置。

autopurge.snapRetainCount

(无Java系统参数)
3.4.0中的新项：当使它起作用时，ZooKeeper自动清理功能将保留最近autopurge.snapRetainCount个快照和相应日志，其余的位于dataDir和dataLogDir的文件都被删除，它的缺省值是3，最小值是3。

autopurge.purgeInterval

(无Java系统参数)
3.4.0中的新项:执行两次清理任务的时间间隔（小时）。将它设置为一个正值（1或大于1）开启清理功能。它的缺省值是0。

Cluster Options

The optionsin this section are designed for use with an ensemble of servers -- that is,when deploying clusters of servers.

electionAlg

(No Java system property)

Election implementation to use. A value of "0" corresponds to the original UDP-based version, "1" corresponds to the non-authenticated UDP-based version of fast leader election, "2" corresponds to the authenticated UDP-based version of fast leader election, and "3" corresponds to TCP-based version of fast leader election. Currently, algorithm 3 is the default

Note

The implementations of leader election 0, 1, and 2 are now deprecated . We have the intention of removing them in the next release, at which point only the FastLeaderElection will be available..

initLimit

(No Java system property)

Amount of time, in ticks (see tickTime), to allow followers to connect and sync to a leader. Increased this value as needed, if the amount of data managed by ZooKeeper is large.

leaderServes

(Java system property: zookeeper.leaderServes)

Leader accepts client connections. Default value is "yes". The leader machine coordinates updates. For higher update throughput at thes slight expense of read throughput the leader can be configured to not accept clients and focus on coordination. The default to this option is yes, which means that a leader will accept client connections.

Note

Turning on leader selection is highly recommended when you have more than three ZooKeeper servers in an ensemble..

server.x=[hostname]:nnnnn[:nnnnn], etc

(No Java system property)

servers making up the ZooKeeper ensemble. When the server starts up, it determines which server it is by looking for the file myid in the data directory. That file contains the server number, in ASCII, and it should match x in server.x in the left hand side of this setting.The list of servers that make up ZooKeeper servers that is used by the clients must match the list of ZooKeeper servers that each ZooKeeper server has.There are two port numbers nnnnn. The first followers use to connect to the leader, and the second is for leader election. The leader election port is only necessary if electionAlg is 1, 2, or 3 (default). If electionAlg is 0, then the second port is not necessary. If you want to test multiple servers on a single machine, then different ports can be used for each server.

syncLimit

(No Java system property)

Amount of time, in ticks (see tickTime), to allow followers to sync with ZooKeeper. If followers fall too far behind a leader, they will be dropped.

group.x=nnnnn[:nnnnn]

(No Java system property)

Enables a hierarchical quorum construction."x" is a group identifier and the numbers following the "=" sign correspond to server identifiers. The left-hand side of the assignment is a colon-separated list of server identifiers. Note that groups must be disjoint and the union of all groups must be the ZooKeeper ensemble. You will find an examplehere

weight.x=nnnnn

(No Java system property)

Used along with "group", it assigns a weight to a server when forming quorums. Such a value corresponds to the weight of a server when voting. There are a few parts of ZooKeeper that require voting such as leader election and the atomic broadcast protocol. By default the weight of server is 1. If the configuration defines groups, but not weights, then a value of 1 will be assigned to all servers. You will find an examplehere

cnxTimeout

(Java system property: zookeeper.cnxTimeout)

Sets the timeout value for opening connections for leader election notifications. Only applicable if you are using electionAlg 3.

Note

Default value is 5 seconds..

集群配置项

本节中的选项是用于ensemble的，即集群部署的场合。

electionAlg

(无Java系统参数)

选举时使用。值为0对应最初的基于UDP的版本，值为1对应无安全认证的基于UDP的快速leader选举版本，值2对应有安全认证的基于UDP的快速leader选举版本，值3对应基于TCP的快速leader选举版本。目前，3是缺省值。
注：leader选举的实现方法0，1,2已被弃用。我们想在下一个release版中去掉它们，到时只有FastLeaderElection选项。

initLimit

(无Java系统参数)

Follower连接到leader和与leader同步的时间（以tick为单位）。如果ZooKeeper管理的数据比较大，则增加这个值。

leaderServes

(Java系统参数: zookeeper.leaderServes)

Leader可以接受客户端连接。缺省值是"yes"。Leader复杂更新的协调。对于高更新负荷，大于读负荷，leader可以被配置成拒绝客户端连接而只做协调。缺省值是“yes”，即leader接受客户端连接。
注：如果一个ensemble多于3台机器，强烈建议打开leader选举。

server.x=[hostname]:nnnnn[:nnnnn], etc

(无Java系统参数)

构成ZooKeeper ensemble的服务器列表。当服务器启动后，它通过查找data目录下的myid文件来决定它是哪一台服务器。myid文件以ASCII形式包含服务器序号，例如，它匹配server.x项中左边的x。
客户端使用的ZooKeeper的服务器列表，必须与每个ZooKeeper服务器中使用的服务器列表一致。
列表中有两个端口号。Follower用第一个与leader通讯，第二个用来进行leader选举。当electionAlg是1,2,3（缺省值）时，第二个用于Leader选举的端口才有用。如果electionAlg为0，则不需要第二个端口。如果你想在一台机器上试试多个服务器（在一台机器上安装多个ZooKeeper服务器），可以给每个服务器设置不同的端口。

syncLimit

(无Java系统参数)

Follower与leader进行同步的的时间间隔（单位为tick）。如果follower太长时间没有与leader同步，它会被删除。

group.x=nnnnn[:nnnnn]

(无Java系统参数)

允许服务器成员形成层次结构。"x"是组标识，紧跟等号的数字是服务器标识，服务器标识之间用分号分隔，注意各组必须不相交（一台服务器只属于一个组），服务器也必须在这个ensemble中。
这有一个例子（http://zookeeper.apache.org/doc/trunk/zookeeperHierarchicalQuorums.html）

weight.x=nnnnn

(无Java系统参数)

与”group”参数一起使用，它设置了一个权重，当进行选举时，采用该值作为权。ZooKeeper中某些地方是需要选举的，例如leader选举和原子操作广播协议。缺省时，服务器的权重为1。如果配置中设置了组，但没有设置权重，则所有的服务器的权重为1。
这有一个例子（http://zookeeper.apache.org/doc/trunk/zookeeperHierarchicalQuorums.html）

cnxTimeout

(Java系统属性: zookeeper.cnxTimeout)

这是一个进行leader选举的超时值，当超过此值时，打开连接，进行leader选举。这个值仅当electionAlg为3时有效。
注：缺省值为5秒。

Authentication & Authorization Options

The optionsin this section allow control over authentication/authorization performed bythe service.

zookeeper.DigestAuthenticationProvider.superDigest

(Java system property only: zookeeper.DigestAuthenticationProvider.superDigest)

By default this feature isdisabled.
New in 3.2: Enables a ZooKeeper ensemble administrator to access the znode hierarchy as a "super" user. In particular no ACL checking occurs for a user authenticated as super.
org.apache.zookeeper.server.auth.DigestAuthenticationProvider can be used to generate the superDigest, call it with one parameter of "super:". Provide the generated "super:" as the system property value when starting each server of the ensemble.
When authenticating to a ZooKeeper server (from a ZooKeeper client) pass a scheme of "digest" and authdata of "super:". Note that digest auth passes the authdata in plaintext to the server, it would be prudent to use this authentication method only on localhost (not over the network) or over an encrypted connection.

认证和授权配置项

本节中的选项控制Zookeeper服务的认证和授权。

zookeeper.DigestAuthenticationProvider.superDigest

(仅是Java系统的参数: zookeeper.DigestAuthenticationProvider.superDigest)

缺省情况下，该功能是关闭的。
3.2版中的项：允许ZooKeeper ensemble的管理员以一个“超级”用户的身份来访问znode层次树。
org.apache.zookeeper.server.auth.DigestAuthenticationProvider用来产生这个超级用户，用参数"super:"调用它，启动每个ensemble服务器时，将产生的"super:" 作为系统参数提供给服务器。
要认证ZooKeeper服务器（来自ZooKeeper客户端），传入一个”digest”和"super:"。注意，这个摘要验证过程将"super:"以文本方式传给服务器，仅在本机（不经过网络）或在一个加密连接情况，这种认证方式才是可行的。

Experimental Options/Features

New features that are currently considered experimental.

Read Only Mode Server

(Java system property: readonlymode.enabled)

New in 3.4.0: Setting this value to true enables Read Only Mode server support (disabled by default). ROM allows clients sessions which requested ROM support to connect to the server even when the server might be partitioned from the quorum. In this mode ROM clients can still read values from the ZK service, but will be unable to write values and see changes from other clients. See ZOOKEEPER-784 for more details.

试验性质的选项

目前，这些新的特性是试验性质的。

Read Only Mode Server

(Java系统参数: readonlymode.enabled)

3.4.0中的新项：将此值设为true使服务器支持只读模式（缺省是disable）。只读模式允许以只读模式连接的客户端会话在服务器与其他成员断开时依然可以连接，此时，只读模式客户端仍然可以从ZooKeeper服务器读取数据，但不能写入数据，也不能看到其他客户端产生的变化。详情请参阅ZOOKEEPER-784。

Unsafe Options

The followingoptions can be useful, but be careful when you use them. The risk of each isexplained along with the explanation of what the variable does.

forceSync

(Java system property: zookeeper.forceSync)
Requires updates to be synced to media of the transaction log before finishing processing the update. If this option is set to no, ZooKeeper will not require updates to be synced to the media.

jute.maxbuffer

(Java system property:jute.maxbuffer)
This option can only be set as a Java system property. There is no zookeeper prefix on it. It specifies the maximum size of the data that can be stored in a znode. The default is 0xfffff, or just under 1M. If this option is changed, the system property must be set on all servers and clients otherwise problems will arise. This is really a sanity check. ZooKeeper is designed to store data on the order of kilobytes in size.

skipACL

(Java system property:zookeeper.skipACL)
Skips ACL checks. This results in a boost in throughput, but opens up full access to the data tree to everyone.

不安全选项

这些选项可能有用，但必须小心使用它们，每个选项的风险与它的意义一同解释。

forceSync

(Java系统参数: zookeeper.forceSync)
在完成更新处理前，要求更新同步到磁盘上。如果这个选项是no，则ZooKeeper不会马上将更新同步到磁盘。

jute.maxbuffer

(Java 系统参数: jute.maxbuffer)
这个选项只能作为Java系统参数设置。它前面没有zookeeper前缀。它规定了一个znode所能存储的最大数据，缺省值是0xffffff，即低于1M。如果这个选项变了，所有服务器和客户端上的系统选项都必须设置，否则就会有问题，这需要费神去检查。ZooKeeper的设计是存储KB级数据在节点上。

skipACL

(Java 系统参数: zookeeper.skipACL)
跳过ACL检查。这将导致吞吐量大增，但开放了数据树给所有人。

Communication using the Netty framework

New in 3.4:Netty isan NIO based client/server communication framework, it simplifies (over NIObeing used directly) many of the complexities of network level communicationfor java applications. Additionally the Netty framework has built in supportfor encryption (SSL) and authentication (certificates). These are optionalfeatures and can be turned on or off individually.

Prior toversion 3.4 ZooKeeper has always used NIO directly, however in versions 3.4 andlater Netty is supported as an option to NIO (replaces). NIO continues to bethe default, however Netty based communication can be used in place of NIO bysetting the environment variable "zookeeper.serverCnxnFactory" to"org.apache.zookeeper.server.NettyServerCnxnFactory". You have theoption of setting this on either the client(s) or server(s), typically youwould want to set this on both, however that is at your discretion.

TBD - tuningoptions for netty - currently there are none that are netty specific but weshould add some. Esp around max bound on the number of reader worker threadsnetty creates.

TBD - how tomanage encryption

TBD - how tomanage certificates

用Netty框架实现通讯

3.4版中的新特性：Netty是一个基于client/server的网络IO架构，对Java应用，它简化了许多复杂的网络通讯，另外，Netty架构支持加密(SSL)和认证（证书），这些是可选特性，可以被单独的打开或关上。

3.4版之前，ZooKeeper一直直接使用网络IO，但是3.4版及之后，Netty被支持作为可选的网络IO（替代）。直接进行网络IO继续是缺省值，但设置环境变量"zookeeper.serverCnxnFactory"为 "org.apache.zookeeper.server.NettyServerCnxnFactory"，基于Netty的通讯可以用来替代直接网络IO。你可以分别在客户端和服务器端分别设置，虽然这取决于你自己，但通常情况是在两边都设置为Netty。

待定—对netty的优化选项。目前还没用特定的针对netty的选项，但我们应该添加，特别是netty创建的读工作者线程的最大值。

待定—如何管理加密

待定—如何管理证书

ZooKeeper Commands: The FourLetter Words

ZooKeeper命令：4个字母的命令

ZooKeeper responds to a small set of commands. Each command is composed of four letters.You issue the commands to ZooKeeper via telnet or nc, at the client port.

Three of themore interesting commands: "stat" gives some general informationabout the server and connected clients, while "srvr" and"cons" give extended details on server and connections respectively.

ZooKeeper有一个小的命令集。每个命令由4个字母构成，你可以通过telnet或nc，在客户端端口来发出这些命令。

以下是3个有趣的命令：”stat”命令显示服务器及连接的客户端的一般信息，”srvr”和”cons”命令分别显示服务器和客户端连接的详细信息。

conf

New in 3.3.0: Print details about serving configuration.

cons

New in 3.3.0: List full connection/session details for all clients connected to this server. Includes information on numbers of packets received/sent, session id, operation latencies, last operation performed, etc...

crst

New in 3.3.0: Reset connection/session statistics for all connections.

dump

Lists the outstanding sessions and ephemeral nodes. This only works on the leader.

envi

Print details about serving environment

ruok

Tests if server is running in a non-error state. The server will respond with imok if it is running. Otherwise it will not respond at all.
A response of "imok" does not necessarily indicate that the server has joined the quorum, just that the server process is active and bound to the specified client port. Use "stat" for details on state wrt quorum and client connection information.

srst

Reset server statistics.

srvr

New in 3.3.0: Lists full details for the server.

stat

Lists brief details for the server and connected clients.

wchs

New in 3.3.0: Lists brief information on watches for the server.

wchc

New in 3.3.0: Lists detailed information on watches for the server, by session. This outputs a list of sessions(connections) with associated watches (paths). Note, depending on the number of watches this operation may be expensive (ie impact server performance), use it carefully.

wchp

New in 3.3.0: Lists detailed information on watches for the server, by path. This outputs a list of paths (znodes) with associated sessions. Note, depending on the number of watches this operation may be expensive (ie impact server performance), use it carefully.

mntr

New in 3.4.0: Outputs a list of variables that could be used for monitoring the health of the cluster.

$ echo mntr | nc localhost 2185zk_version  3.4.0
zk_avg_latency  0
zk_max_latency  0
zk_min_latency  0
zk_packets_received 70
zk_packets_sent 69
zk_outstanding_requests 0
zk_server_state leader
zk_znode_count   4
zk_watch_count  0
zk_ephemerals_count 0
zk_approximate_data_size    27
zk_followers    4                   - only exposed by the Leader
zk_synced_followers 4               - only exposed by the Leader
zk_pending_syncs    0               - only exposed by the Leader
zk_open_file_descriptor_count 23    - only available on Unix platforms
zk_max_file_descriptor_count 1024   - only available on Unix platforms

The output is compatible with java properties format and the content may change over time (new keys added). Your scripts should expect changes.
ATTENTION: Some of the keys are platform specific and some of the keys are only exported by the Leader.
The output contains multiple lines with the following format:

key \t value

Here's an example of the ruok command:

$ echo ruok | nc 127.0.0.1 5111
imok

conf

3.3.0版的新命令：显示当前配置的详细信息。

cons

3.3.0版的新命令：显示所有客户端连接/会话的详细信息，包括接收/发送数据包的数量，session id，操作延时，上次操作性能等。

crst

3.3.0版的新命令：对所有连接，重置连接/会话统计数据。

dump

列出未完成的session和暂态节点，这个命令只在leader上使用。

envi

显示出当前环境的详细信息。

ruok

测试一下服务器是否运行在非故障状态，如果它在运行，则响应imok，否则根本不响应。

srst

重置服务器的统计数据。

srvr

3.3.0版的新命令：显示服务器所有的详细信息。

stat

列表显示服务器和连接客户端的简要信息。

wchs

3.3.0版的新命令：列表显示服务器上的监视器（watch）的简要信息。

wchc

3.3.0版的新命令：以会话为单位，列表显示服务器上的监视器的详细信息。它会显示出与会话(session)关联的监视器（路径），注意，视监视器的数量，这个命令的代价可能较大（即影响服务器性能），请慎重。

wchp

3.3.0版的新命令：以路径为单位，列表显示服务器上的监视器的详细信息。它会显示与会话相关联的一些列路径(znode)，注意，视监视器的数量，这个命令的代价可能较大（即影响服务器性能），请慎重。

mntr

3.4.0版的新命令：输出用于监控集群健康状态的一系列变量。

$ echo mntr | nc localhost 2185zk_version  3.4.0
zk_avg_latency  0
zk_max_latency  0
zk_min_latency  0
zk_packets_received 70
zk_packets_sent 69
zk_outstanding_requests 0
zk_server_state leader
zk_znode_count   4
zk_watch_count  0
zk_ephemerals_count 0
zk_approximate_data_size    27
zk_followers    4                   - only exposed by the Leader
zk_synced_followers 4               - only exposed by the Leader
zk_pending_syncs    0               - only exposed by the Leader
zk_open_file_descriptor_count 23    - only available on Unix platforms
zk_max_file_descriptor_count 1024   - only available on Unix platforms

该输出与Java属性格式一致，内容可能随时间而变化（添加新项），你的脚本应考虑变化。

注意：有些项与平台相关，有些项只适合于leader。

输出为多行，每行格式如下：

key \t value

这里有一个ruok命令的例子：

$ echo ruok | nc 127.0.0.1 5111
imok

Data File Management

数据文件管理

ZooKeeper stores its data in a data directory and its transaction log in a transactionlog directory. By default these two directories are the same. The server can(and should) be configured to store the transaction log files in a separate directory than the data files. Throughput increases and latency decreases when transaction logs reside on a dedicated log devices.

ZooKeeper将数据保存到数据目录，事务日志保存到日志目录，缺省情况下，这两个目录相同。能够（应该）将事务日志文件配置到不同于数据目录的地方，当事务日志保存到一个专门的设备上时，就会提高吞吐量和降低延时。

The Data Directory

Thisdirectory has two files in it:

myid - contains a single integer in human readable ASCII text that represents the server id.
snapshot.<zxid> - holds the fuzzy snapshot of a data tree.

Each ZooKeeper server has a unique id. This id is used in two places: the myid fileand the configuration file. The myid file identifies the server that corresponds to the given data directory. The configuration file lists the contact information for each server identified byits server id. When a ZooKeeper server instance starts, it reads its id fromthe myid file and then, using that id,reads from the configuration file, looking up the port on which it should listen.

The snapshot files stored in the data directory are fuzzy snapshots in the sense that during the time the ZooKeeper server is taking the snapshot, updates are occurring tothe data tree. The suffix of the snapshot file namesis the zxid, the ZooKeeper transaction id, of the last committedtransaction at the start of the snapshot. Thus, the snapshot includes a subset of the updates to the data tree that occurred while the snapshot was inprocess. The snapshot, then, may not correspond to any data tree that actually existed, and for this reason we refer to it as a fuzzy snapshot. Still, ZooKeeper can recover using this snapshot because it takes advantage of theidempotent nature of its updates. By replaying the transaction log against fuzzy snapshots ZooKeeper gets the state of the system at the end of the log.

数据目录

这个目录有个文件：

myid—以ASCII码形式保存的一个整数，它代表服务器id。
Snapshot.<zxid>—保存着树形数据的“模糊“快照

每个服务器有一个唯一的id，这个id用在两个地方：myid文件和配置文件。myid文件标识了与特定数据目录相对应的服务器，配置文件列出了有sever id所标识的每台服务器的联络信息。当ZooKeeper服务器实例启动时，它从myid文件中读取id，然后，使用该id，从配置文件中读取并查找它应用监听的端口。

Snapshot文件存储在数据目录，它是“模糊“的快照，意思是，ZooKeeper一边在做树形数据快照的同时，更新也在进行。快照文件名字的后缀是zxid，即ZooKeeper事务id，这个id是快照开始之后最后提交的事务。因此，这个快照包含了快照正在进行过程中对树形数据更新的一个子集。这个快照可能并不对应实际发生的数据更新，因此我们把它叫做是“模糊“快照。但是，ZooKeeper仍然能利用快照恢复，因为它利用了更新操作的等幂性。对“模糊“快照重放事务日志，ZooKeeper能得到日志文件结束时的状态。

The Log Directory

The Log Directory contains the ZooKeeper transaction logs. Before any update takesplace, ZooKeeper ensures that the transaction that represents the update is written to non-volatile storage. A new log file is started each time a snapshotis begun. The log file's suffix is the first zxid written to that log.

日志目录

日志目录包含ZooKeeper的事务日志，任何更新发生前，ZooKeeper保证表达了更新的日志被写入持久化存储。当一个快照开始时，也开始一个新的日志文件。日志文件名的后缀是第一个被写入日志的事务id zxid。

File Management

The format ofsnapshot and log files does not change between standalone ZooKeeper servers and different configurations of replicated ZooKeeper servers. Therefore, you can pull these files from a running replicated ZooKeeper server to a development machine with a stand-alone ZooKeeper server for trouble shooting.

Using olderlog and snapshot files, you can look at the previous state of ZooKeeper serversand even restore that state. The LogFormatter class allows an administrator tolook at the transactions in a log.

The ZooKeeper server creates snapshot and log files, but never deletes them. The retention policy of the data and log files is implemented outside of the ZooKeeper server. The server itself only needs the latest complete fuzzy snapshot and the log files from the start of that snapshot. See the maintenance section in this document for more details on setting a retention policy and maintenance of ZooKeeper storage.

文件管理

Standalone部署的ZooKeeper服务器，replicated部署下不同配置的ZooKeeper服务器，快照和日志的格式都是一样的。因此，你可以将replicated部署下服务器上的这些文件拷到standalone部署下的开发机上，进行故障分析。

使用老的日志和快照文件，你可以看到ZooKeeper以前的状态，甚至恢复到这种状态。LogFormatter类允许管理员查看日志文件中的事务。

ZooKeeper服务器创建快照和日志文件，但从不删除它们。是否保留这些数据以及日志文件的策略在ZooKeeper服务器之外实现，服务器本身只需要最新的一个完整的“模糊“快照以及该快照开始之后的日志。关于更多ZooKeeper存储的保留和维护策略，参阅本文“maintenance”一节。

Things to Avoid

应避免的事情

Here are some common problems you can avoid by configuring ZooKeeper correctly:

inconsistent lists of servers

The list of ZooKeeper servers used by the clients must match the list of ZooKeeper servers that each ZooKeeper server has. Things work okay if the client list is a subset of the real list, but things will really act strange if clients have a list of ZooKeeper servers that are in different ZooKeeper clusters. Also, the server lists in each Zookeeper server configuration file should be consistent with one another.

incorrect placement of transasction log

The most performance critical part of ZooKeeper is the transaction log. ZooKeeper syncs transactions to media before it returns a response. A dedicated transaction log device is key to consistent good performance. Putting the log on a busy device will adversely effect performance. If you only have one storage device, put trace files on NFS and increase the snapshotCount; it doesn't eliminate the problem, but it should mitigate it.

incorrect Java heap size

You should take special care to set your Java max heap size correctly. In particular, you should not create a situation in which ZooKeeper swaps to disk. The disk is death to ZooKeeper. Everything is ordered, so if processing one request swaps the disk, all other queued requests will probably do the same. the disk. DON'T SWAP.
Be conservative in your estimates: if you have 4G of RAM, do not set the Java max heap size to 6G or even 4G. For example, it is more likely you would use a 3G heap for a 4G machine, as the operating system and the cache also need memory. The best and only recommend practice for estimating the heap size your system needs is to run load tests, and then make sure you are well below the usage limit that would cause the system to swap.

这有些关于正确配置ZooKeeper的常见问题，你应该避免它们。

服务器列表不一致

客户端所用的ZooKeeper服务器列表应该和服务器上配置的服务器列表一致。客户端所用的列表是真正的在使用列表的一部分，这没有问题，但是，如果客户端有包含不同ZooKeeper集群的服务器列表，事情就比较奇怪了。另外，在各个ZooKeeper服务器上，服务器列表也应该是一致的。

事务日志不正确放置

ZooKeeper中影响性能最重要的部分的是事务日志。在响应客户前，ZooKeeper需要将日志同步到介质上。一个专门的日志设备是保证稳定高效性能的关键，将日志放到繁忙设备上会损害性能，如果你只有一个存储设备，将跟踪文件放到NSF中，并增加snapshotCount，这不会解决问题，但会转移它。

Java堆的不正确配置

你应该特别注意正确设置Java最大堆尺寸。具体来说，不应当发生ZooKeeper被交换到磁盘这种情况，这意味着ZooKeeper的死亡，所有的事情都是按顺序的，所以，如果处理一个请求被交换到磁盘，所有其他被排队的请求也可能做同样的事情。磁盘，不能交换。
对你的估计保守些：如果你有4G RAM，不要设置Java最大堆为6G或甚至4G。例如，对4G的机器，你可能会用到3G的堆，操作系统及缓存也需要内存。最好的，并且是唯一推荐估算堆大小的做法是做压力测试，保证系统在产生交换的使用限制之下运行。

Best Practices

最佳实践

For bes tresults, take note of the following list of good Zookeeper practices:

Formulti-tennant installations see the section detailing ZooKeeper "chroot" support, this can be very useful when deploying many applications/services interfacing to a single ZooKeeper cluster.

为了得到最好的结果，请记住下面使用ZooKeeper的经验：

对多租户（multi-tennant）安装，请参阅详述ZooKeeeper“chroot”支持那一节，当部署许多应用/服务到一个ZooKeeper集群时，这非常有用。