03.shard_allocation_和_cluster的routing设置

文章目录

1.简述
2. cluster 级别的shard allocation 相关的设置
- 1. shard allocation 相关设置
- - 1. cluster.routing.allocation.enable
  - 2. cluster.routing.allocation.node_concurrent_incoming_recoveries
  - 3. cluster.routing.allocation.node_concurrent_outgoing_recoveries
  - 4. cluster.routing.allocation.node_concurrent_recoveries
  - 5. cluster.routing.allocation.node_initial_primaries_recoveries
  - 6. cluster.routing.allocation.same_shard.host
- 2. shard rebalance相关设置
- - 1. cluster.routing.rebalance.enable
  - 2. cluster.routing.allocation.allow_rebalance
  - 3. cluster.routing.allocation.cluster_concurrent_rebalance
- 3. shard balancing 的因子设置
- - 1. cluster.routing.allocation.balance.shard
  - 2. cluster.routing.allocation.balance.index
  - 3. cluster.routing.allocation.balance.threshold
- 4. allocation和rebalance的区别和联系
3. 基于磁盘的shard allocation限制
- 1. cluster.routing.allocation.disk.threshold_enabled
- 2. cluster.routing.allocation.disk.watermark.low
- 3. cluster.routing.allocation.disk.watermark.high
- 4. cluster.routing.allocation.disk.watermark.flood_stage
- 5. cluster.info.update.interval
- 6. cluster.routing.allocation.disk.include_relocations
- 7. 一个使用样例
4. 通过属性配置设置,达到allocation 分配时对node的感知
- 1. 开启集群allocation 感知
- 2. 强制感知是什么呢
5. cluster级别的shard allocation filter 设置
- 1. include
- 2. require
- 3. exclude
- 4. 也可以使用正则来进行配置

1.简述

这里主要是学习master对shard的管理，master决定了一个shard需要被分配到哪个node上面，以及什么时候在cluster中的node之间移动shard来reblace整个cluster

2. cluster 级别的shard allocation 相关的设置

shard alloction 是在某个node上创建某个shard的过程。这个过程会发生在initial recovery, replica allocation, rebalancing, 或者node add或remove的时候

1. shard allocation 相关设置

这个是

1. cluster.routing.allocation.enable

开启或者关闭某种类型的shard的allocation

all : (default) 允许所有类型的shard被allocate
primaries : 只允许primaries被allocated
new_primaries: 只允许primaries被allocated
none : 不孕育任何shard被allocated

这个设置不会影响一个node重启的时候对local primary的recovery, 如果一个被重启的node有一份nassigned primary shard 的copy,那么这个shard会立即成为 primary shard，当然，这个shard的allocation id要和cluster state中记录的active allocation ids一致。

什么是allocation id 参看这里

2. cluster.routing.allocation.node_concurrent_incoming_recoveries

这个参数设置了每个节点可以有多少个shard可以接收从外面进来的recovery用的数据。

一般情况下都是node上的shard都是replica shard，这些shard 接收来自primary shard的数据进行恢复
如果是relocation操作那么这个node上对应的shard也有可能是primary shard
默认值是2

3. cluster.routing.allocation.node_concurrent_outgoing_recoveries

这个和上一个参数正好是相对的，控制了每个node上可以有多少个shard在向外提供shard recovery的数据。

一般情况下都是node上的shard都是primary shard，这些shard 向replica shard传输数据进行shard恢复
如果是relocation操作那么这个node上对应的shard也有可能是replica shard

默认值是2

4. cluster.routing.allocation.node_concurrent_recoveries

这个参数是上面两个参数的综合体，也就是会把上面两个参数设置为一样的
cluster.routing.allocation.node_concurrent_incoming_recoveries and cluster.routing.allocation.node_concurrent_outgoing_recoveries.

5. cluster.routing.allocation.node_initial_primaries_recoveries

replica shard的恢复一般是通过network从primary恢复，但是unassigned primary shard的恢复则只能是通过原来有这个shard的node被重新启动了来进行恢复。这个应该稍微大一些，以便于更多的unassigned primary shard可以更快的被恢复。

6. cluster.routing.allocation.same_shard.host

开启一个检查来防止同一个shard的多个instances在同一个host上面，这个是为了让es能够更好的应对es的node挂掉的情况,这种情况一般都是在一个主机上启动了多个node,这样的话这个node挂掉后es的某个shard的数据可能就丢了，一般情况下，同一个集群的的多个node不会在同一个node上面，但是需要注意有时候我们使用的是虚拟机，虚拟机层面不在同一个服务器上，但是实际多个虚拟机上可能在同一个物理机上，这种情况也是应该尽量避免的。否则就会造成数据的丢失情况。
这个值某人是false，也就是不会开启检查

2. shard rebalance相关设置

下面这些动态设置是用来设置集群层面的shards的rebalance的

1. cluster.routing.rebalance.enable

开启或者关闭某种shard的rebalance

all - (default) Allows shard balancing for all kinds of shards.
primaries - Allows shard balancing only for primary shards.
replicas - Allows shard balancing only for replica shards.
none - No shard balancing of any kind are allowed for any indices.

2. cluster.routing.allocation.allow_rebalance

什么时候允许rebalance操作开始

always - 任何时候都可以
indices_primaries_active - 只有当集群中多有的primary shards都被allocated之后才允许
indices_all_active - (default) 只有当集群中的所有的shard(primaries and replicas) 都被allocated之后才允许rebalance操作

3. cluster.routing.allocation.cluster_concurrent_rebalance

这个设置控制了集群rebalance的并行度，默认值是2。
需要注意的是这个只能控制因为imbalances 导致的shard的迁移的并行度，并不能限制因为allocation filtering 或者 forced awareness导致的分片的转移。

3. shard balancing 的因子设置

The following settings are used together to determine where to place each shard. The cluster is balanced when no allowed rebalancing operation can bring the weight of any node closer to the weight of any other node by more than the balance.threshold.
以下3个因子共同决定了在哪个node上面放置shard,当任何一个rebalance操作都不能使集群中的node之间的weight差距减小的话，集群就达到了balanced的状态。

1. cluster.routing.allocation.balance.shard

设置了每个node上的shard总数在集群balance中占据的权重因子，默认是0.45f,增加这个值的话就意味着集群的balance更倾向于使每个node上面的shard数量都保持一致。
Defines the weight factor for the total number of shards allocated on a node (float). Defaults to 0.45f. Raising this raises the tendency to equalize the number of shards across all nodes in the cluster.

2. cluster.routing.allocation.balance.index

设置了每个index在某个node上的shards数量的权重，默认是0.55f,增大这个值的话意味着集群的balance更倾向于让index的shards平均的分配到cluster的每个node上面。
Defines the weight factor for the number of shards per index allocated on a specific node (float). Defaults to 0.55f. Raising this raises the tendency to equalize the number of shards per index across all nodes in the cluster.

3. cluster.routing.allocation.balance.threshold

shard rebalance的触发阈值，默认是1.0f,增加这个值意味着cluster对集群的balance要求更低，也就是说更不容易触发rebalance。

Minimal optimization value of operations that should be performed (non negative float). Defaults to 1.0f. Raising this will cause the cluster to be less aggressive about optimizing the shard balance.

4. allocation和rebalance的区别和联系

这里主要想再强调一下allocation和rebalance的关系，主要从下面两个配置来进行解析

cluster.routing.allocation.enable
cluster.routing.rebalance.enable

对于allocation强调的是shard的分配，不管你这个shard是因为什么原因要进行分配，比如某个node突然挂掉需要重新分配一些unassigned的shard, 手动的relocation的话需要在目标node上allocation新的shard, rebalance的话也需要在目标node上allocation新的shard。
比如说可能某个node突然挂掉了（而且挂掉的node上的数据被清理掉了），导致了某些shard是unassigned的，这个时候如果 cluster.routing.allocation.enable:none那么即使cluster.routing.rebalance.enable:all,这些unassigned的shard也不会被分配到其他节点，因为最根本的shard分配操作被禁止了。

假如这个时候设置为
cluster.routing.rebalance.enable: none
cluster.routing.allocation.enable: all
那么对应的unassigned的shard会被分配到其他几点上面。在分配完成集群编程green的时候重启挂掉的node(该node上面没有数据)，那么该node上面的shard数量会一直是0，因为rebalance被关闭了。当重新设置cluster.routing.rebalance.enable: all的时候，才会将部分shard迁移到新启动的node上面。
综上，rebalance的功能需要依赖allocation功能的开启，allocation没有开启的话是没有办法进行rebalance操作的（手动的relocation理所当然也没有办法进行），当然allocation还会限制shard丢失之后的shard重新分配。

3. 基于磁盘的shard allocation限制

es会考虑一个node现有的磁盘容量来决定是否将一个新的shard分配到这个node上面，或者是否有必要激活relocation操作从这个node上面迁移走一些shard.
下面这些磁盘相关的设置都是动态的，可以通过elasticsearch.yml设置，也可以通过api来进行设置。

1. cluster.routing.allocation.disk.threshold_enabled

默认是true,如果设置为false的时候在进行shard allocation的时候就不会考虑磁盘的因素。
Defaults to true. Set to false to disable the disk allocation decider.

2. cluster.routing.allocation.disk.watermark.low

低风险水位设置，这个设置的默认值是85%，意味着当一个node的磁盘使用率达到了85%，那么就不会再往这个node上面分配shard了。这个设置对于新创建的index的primary shard不起作用，但是会对replica shard起作用。
这个值也可以直接设置为一个绝对值，比如500mb,这个500mb是指剩余的使用空间哈，不是指已经使用了的空间。这种在集群磁盘比较大的时候比较有用，比如每个node的数量是3T，操作系统实际需要的可能也就50G，但是按照百分比算的话，1% 也有300G，相对来说会有一些浪费。这个时候我们就可以直接设置50G就完事儿了。

Controls the low watermark for disk usage. It defaults to 85%, meaning that Elasticsearch will not allocate shards to nodes that have more than 85% disk used. It can also be set to an absolute byte value (like 500mb) to prevent Elasticsearch from allocating shards if less than the specified amount of space is available. This setting has no effect on the primary shards of newly-created indices but will prevent their replicas from being allocated.

3. cluster.routing.allocation.disk.watermark.high

高风险水位设置，这个设置是90%，当某个node的磁盘使用率达到90%的时候，elasticsearch就会考虑将一部分shard从这个node上面relocate away 到别的node上面。同样的，这个也可以设置为一个实际的值，比如500mb。
这个设置会影响所有的shard的allocation，不论是之前已经分配过的shard或者是新创建的index的shard的分配。

Controls the high watermark. It defaults to 90%, meaning that Elasticsearch will attempt to relocate shards away from a node whose disk usage is above 90%. It can also be set to an absolute byte value (similarly to the low watermark) to relocate shards away from a node if it has less than the specified amount of free space. This setting affects the allocation of all shards, whether previously allocated or not.

4. cluster.routing.allocation.disk.watermark.flood_stage

濒临崩溃阶段，这个设置默认值是95%，当某个node的磁盘使用达到这个水平以后，这个node上的shard对应的index都会被设置为index.blocks.read_only_allow_delete,也就是只允许读操作和删除操作，这是es为了应对集群崩溃不得不采取的一个操作，而且在cluster中的node解除磁盘风险后需要手动进行index.blocks的只读设置的解除。

Controls the flood stage watermark. It defaults to 95%, meaning that Elasticsearch enforces a read-only index block (index.blocks.read_only_allow_delete) on every index that has one or more shards allocated on the node that has at least one disk exceeding the flood stage. This is a last resort to prevent nodes from running out of disk space. The index block must be released manually once there is enough disk space available to allow indexing operations to continue.

You can not mix the usage of percentage values and byte values within these settings. Either all are set to percentage values, or all are set to byte values. This is so that we can we validate that the settings are internally consistent (that is, the low disk threshold is not more than the high disk threshold, and the high disk threshold is not more than the flood stage threshold).

非常需要注意的一点是

cluster.routing.allocation.disk.watermark.low
cluster.routing.allocation.disk.watermark.high
cluster.routing.allocation.disk.watermark.flood_stage

这三个参数的配置类型要保持一致性，也就是说如果使用的是百分比配置则这三个参数都要使用百分比配置，如果想使用具体的大小值设置则都要使用大小值设置。
同时，使用百分比配置的时候是指已经使用的磁盘占比，使用具体值大小的时候指的是剩余空闲磁盘空间容量。

5. cluster.info.update.interval

elasticsearch检查磁盘使用量的频率，默认是每隔30s检查一次。
How often Elasticsearch should check on disk usage for each node in the cluster. Defaults to 30s.

6. cluster.routing.allocation.disk.include_relocations

这个设置控制了cluster在计算一个node的磁盘的使用量的时候是否会加上relacating的shard的磁盘使用，默认是true。
这种计算方式会在磁盘使用量较高node的磁盘使用量计算上产生误差，因为他可能已经将一个shard的90%都迁移出去了，但是我们统计的时候使用的是整个shard的值。

Defaults to true, which means that Elasticsearch will take into account shards that are currently being relocated to the target node when computing a node’s disk usage. Taking relocating shards’ sizes into account may, however, mean that the disk usage for a node is incorrectly estimated on the high side, since the relocation could be 90% complete and a recently retrieved disk usage would include the total size of the relocating shard as well as the space already used by the running relocation.

7. 一个使用样例

若果我们想将低风险水位设置在磁盘剩余容量100G，高风险水位设置在磁盘剩余容量50G，濒临崩溃的风险水位设置在剩余容量为10G，那么我们可以这样设置。

PUT _cluster/settings
{"transient": {"cluster.routing.allocation.disk.watermark.low": "100gb","cluster.routing.allocation.disk.watermark.high": "50gb","cluster.routing.allocation.disk.watermark.flood_stage": "10gb","cluster.info.update.interval": "1m"}
}

An example of updating the low watermark to at least 100 gigabytes free, a high watermark of at least 50 gigabytes free, and a flood stage watermark of 10 gigabytes free, and updating the information about the cluster every minute:

4. 通过属性配置设置,达到allocation 分配时对node的感知

这一块儿的配置咋一看基本上和前文当中对index filter的使用中记录的类似,但是真的是相似而不相同。
这一块儿主要是针对整个集群的配置。

1. 开启集群allocation 感知

1.给对应的node设置attribute,假如我们为每个node标记一个容量size属性，有small,medium,big三个属性，


node.attr.rack_id: rack_one或者`./bin/elasticsearch -Enode.attr.rack_id=rack_one`

2.在每个master-eligible node的elasticsearch.yml文件中开启设置

cluster.routing.allocation.awareness.attributes: rack_id

也可以通过对应啊api来进行动态设置

在这种情况下，如果你进行如下操作:

start 2个配置为node.attr.rack_id:rack_one的node
创建一个index，这个index有5个primary shard，每个primary有1个replica
这个时候10个shard会被分配在这两个node上面，但是并不会考虑是否有某个shard的replica和primary在同一个node上面，因为cluster认为两个node是同一个node,因为他们对应的rack_id是一样的
如果再添加两个配置为node.attr.rack_id:rack_two的node,es会把部分shard迁移到新的node上面，并且会保证同一个shard的primary和replica不会在相同的rack_id的nodes上面
如果配置为node.attr.rack_id:rack_two的node挂掉了，es会把所有的shard都allocated到node.attr.rack_id:rack_one的node上面
如果想要同一个shard的primary和replica不会分配到相同的rack_id的nodes上，可以开启强制感知

2. 强制感知是什么呢

强制感知可以避免同一个atrribute id的nodes持有某个shard的primary和replica，因为同一个attribute id被认为具有强关联的机器，可能会同时挂掉,通过强制感知可以降低数据丢失的风险
先来看看强制感知如何使用

cluster.routing.allocation.awareness.attributes: zone
cluster.routing.allocation.awareness.force.zone.values: zone1,zone2

这里设置了强制感知的attribute的值为zone1,zone2

还拿上面的例子来说

start 2个配置为node.attr.zone:zone1的node
创建一个index，这个index有5个primary shard，每个primary有1个replica
这个时候只有5个primary会被分配到两个node上面，replica shard并不会被分配，直到有node.attr.zone:zone2的node加入到集群当中

5. cluster级别的shard allocation filter 设置

在cluster级别设置一些filter和在index级别设置filter的使用方式类似，但是作用范围是cluster级别
使用的样式如下

PUT _cluster/settings
{"transient" : {"cluster.routing.allocation.exclude._ip" : "10.0.0.1"}
}

对应的可以是自定义的node attribute, 或者是是内建的_name, _ip, _host attributes.
对应的setting有

1. include

cluster.routing.allocation.include.{attribute}

只需要node的attribute中有一个在当前include的配置列表当中即可
Allocate shards to a node whose {attribute} has at least one of the comma-separated values.

2. require

cluster.routing.allocation.require.{attribute}
对应的node必须有全部的当前配置的attribute才会将分片分配上去
Only allocate shards to a node whose {attribute} has all of the comma-separated values.

3. exclude

cluster.routing.allocation.exclude.{attribute}
对应的node没有任何当前配置的的attribute才会将分片分配上去
Do not allocate shards to a node whose {attribute} has any of the comma-separated values.

4. 也可以使用正则来进行配置

PUT _cluster/settings
{"transient": {"cluster.routing.allocation.exclude._ip": "192.168.2.*"}
}