

Scalability and resilience: clusters,nodes, and shard

Elasticsearch支持根据需要进行扩缩容.这得益于Elasticsearch是原生支持分布式的.可以通过往机器中添加服务器(节点)的方式扩大集群容量从而存储更多数据.Elasticsearch会自动的均一些数据和计算任务给新加入的数据.甚至不需要应用程序参与,Elasticsearch完全知道该怎么把数据均衡到多个节点并且提供良好的可伸缩性和高可用性.集群的节点越多这种操作越顺滑越无感. 就是这么丝滑,堪比丝袜!

Elasticsearch is built to be always available and to scale with your needs. It does this by being distributed by nature. You can add servers (nodes) to a cluster to increase capacity and Elasticsearch automatically distributes your data and query load across all of the available nodes. No need to overhaul your application, Elasticsearch knows how to balance multi-node clusters to provide scale and high availability. The more nodes, the merrier.


How does this work? Under the covers, an Elasticsearch index is really just a logical grouping of one or more physical shards, where each shard is actually a self-contained index. By distributing the documents in an index across multiple shards, and distributing those shards across multiple nodes, Elasticsearch can ensure redundancy, which both protects against hardware failures and increases query capacity as nodes are added to a cluster. As the cluster grows (or shrinks), Elasticsearch automatically migrates shards to rebalance the cluster.


这就要说下分片的类型了,其实有俩种类型的分片: 主分片和副分片(备用分片).在索引中的每个文档隶属于一个主分片.副分片就是主分片的备份.




There are two types of shards: primaries and replicas. Each document in an index belongs to one primary shard. A replica shard is a copy of a primary shard. Replicas provide redundant copies of your data to protect against hardware failure and increase capacity to serve read requests like searching or retrieving a document.

索引的主分片数量需要在创建索引的时候指定,创建后就不能修改了。但副分片的数量在索引创建后还是可以修改地.而且修改副分片数量不会影响正在执行的索引和查询操作. 这就是备胎的份量啊!

The number of primary shards in an index is fixed at the time that an index is created, but the number of replica shards can be changed at any time, without interrupting indexing or query operations.


It depands:




There are a number of performance considerations and trade offs with respect to shard size and the number of primary shards configured for an index. The more shards, the more overhead there is simply in maintaining those indices. The larger the shard size, the longer it takes to move shards around when Elasticsearch needs to rebalance a cluster.


Querying lots of small shards makes the processing per shard faster, but more queries means more overhead, so querying a smaller number of larger shards might be faster. In short…it depends.


  • 控制分片大小在GB到数十GB.对于时序数据通常可以控制20GB到40GB.

  • Aim to keep the average shard size between a few GB and a few tens of GB. For use cases with time-based data, it is common to see shards in the 20GB to 40GB range.

  • 避免分片过多,一个节点可以容纳的分片数与可用堆空间成正比.一般来说,每GB堆空间的分片数不应大于20.

  • Avoid the gazillion shards problem. The number of shards a node can hold is proportional to the available heap space. As a general rule, the number of shards per GB of heap space should be less than 20.


The best way to determine the optimal configuration for your use case is through testing with your own data and queries.


In case of disaster


For performance reasons, the nodes within a cluster need to be on the same network. Balancing shards in a cluster across nodes in different data centers simply takes too long. But high-availability architectures demand that you avoid putting all of your eggs in one basket. In the event of a major outage in one location, servers in another location need to be able to take over. Seamlessly. The answer? Cross-cluster replication (CCR).


CCR provides a way to automatically synchronize indices from your primary cluster to a secondary remote cluster that can serve as a hot backup. If the primary cluster fails, the secondary cluster can take over. You can also use CCR to create secondary clusters to serve read requests in geo-proximity to your users.


Cross-cluster replication is active-passive. The index on the primary cluster is the active leader index and handles all write requests. Indices replicated to secondary clusters are read-only followers.


Care And feeding


As with any enterprise system, you need tools to secure, manage, and monitor your Elasticsearch clusters. Security, monitoring, and administrative features that are integrated into Elasticsearch enable you to use Kibana as a control center for managing a cluster. Features like data rollups and index lifecycle management help you intelligently manage your data over time


