第1.7章 elasticsearch线上问题集1

我的es数据规模为5433万，这个时候频繁执行查询、写入的操作，发现python执行有一些异常，记录下来，看看有没有解决的办法
我的脚本

from elasticsearch import Elasticsearch
from elasticsearch import helpers
body = {"query": {"range": {"date": {"gte": pre_time,"lte": end_time}}}}results = helpers.scan(client=es,query=body,scroll="5m",index=sub_index,doc_type='my_type',timeout="10m")print('开始遍历索引')sources = set()for result in results:

1 elasticsearch.exceptions.NotFoundError: NotFoundError(404, ‘search_phase_execution_exception’, ‘No search context found for id [27563069]’)
查看Elasticsearch SearchContextMissingException during ‘scan & scroll’ query with Spring Data Elasticsearch
跟我的问题很像，也就是上下文丢失，那么什么时候上下文丢失了，也就是超时的时候。也就是说解决办法就是将timeout设置足够长。

This usually happens if your search context is not alive anymore.

查看官网资料Keeping the search context alive，可以看到scroll的时间应该足够一批数据返回就可以。
于是我将我的程序了一下，将scroll从原来的5m调整为10m，将原来的timeout10m调整为15m.然后再次执行程序观察，发现Scroll request has only succeeded on 1 shards out of 5也同步解决,也是意外的收获。

results = helpers.scan(client=es,query=body,scroll="10m",index=sub_index,doc_type='my_type',timeout="15m")

The scroll parameter (passed to the search request and to every scroll request) tells Elasticsearch how long it should keep the search context alive. Its value (e.g. 1m, see Time unitsedit) does not need to be long enough to process all data — it just needs to be long enough to process the previous batch of results

Process Process-4:
Traceback (most recent call last):File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrapself.run()File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 93, in runself._target(*self._args, **self._kwargs)File "count_in_total_baidu.py", line 66, in sub_in_totalfor result in results:File "/usr/local/python3/lib/python3.6/site-packages/elasticsearch/helpers/__init__.py", line 379, in scan**scroll_kwargs)File "/usr/local/python3/lib/python3.6/site-packages/elasticsearch/client/utils.py", line 76, in _wrappedreturn func(*args, params=params, **kwargs)File "/usr/local/python3/lib/python3.6/site-packages/elasticsearch/client/__init__.py", line 1016, in scrollparams=params, body=body)File "/usr/local/python3/lib/python3.6/site-packages/elasticsearch/transport.py", line 318, in perform_requeststatus, headers_response, data = connection.perform_request(method, url, params, body, headers=headers, ignore=ignore, timeout=timeout)File "/usr/local/python3/lib/python3.6/site-packages/elasticsearch/connection/http_urllib3.py", line 186, in perform_requestself._raise_error(response.status, raw_data)File "/usr/local/python3/lib/python3.6/site-packages/elasticsearch/connection/base.py", line 125, in _raise_errorraise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
elasticsearch.exceptions.NotFoundError: NotFoundError(404, 'search_phase_execution_exception', 'No search context found for id [27563069]')

2 Scroll request has only succeeded on 1 shards out of 5.

Process Process-2:
Traceback (most recent call last):File "/etc/python/python3.6/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrapself.run()File "/etc/python/python3.6/lib/python3.6/multiprocessing/process.py", line 93, in runself._target(*self._args, **self._kwargs)File "count_in_total.py", line 67, in sub_in_totalfor result in results:File "/etc/python/python3.6/lib/python3.6/site-packages/elasticsearch/helpers/__init__.py", line 394, in scan(resp['_shards']['successful'], resp['_shards']['total'])
elasticsearch.helpers.ScanError: Scroll request has only succeeded on 1 shards out of 5.

找到了一片文章elasticsearch.helpers.ScanError: Scroll request has only succeeded on xx shards,它的原因是当index=''为空时出现此错误，可以index，为什么会为空呢？
3 rejected execution of org.elasticsearch.transport.TransportService

Process Process-1:
Traceback (most recent call last):File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrapself.run()File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 93, in runself._target(*self._args, **self._kwargs)File "count_in_total_baidu.py", line 64, in sub_in_totaltotalEnService.sub_in_total(result.get('_source'))File "/usr/local/python3/lib/python3.6/site-packages/en_plugin/service/en_service.py", line 200, in sub_in_totalself.handler(total_record)File "/usr/local/python3/lib/python3.6/site-packages/en_plugin/service/en_service.py", line 133, in handlerself.opt_es(es_data)File "/usr/local/python3/lib/python3.6/site-packages/en_plugin/service/en_service.py", line 175, in opt_essuccess, msg = helpers.bulk(self.es, self.actions)File "/usr/local/python3/lib/python3.6/site-packages/elasticsearch/helpers/__init__.py", line 257, in bulkfor ok, item in streaming_bulk(client, actions, *args, **kwargs):File "/usr/local/python3/lib/python3.6/site-packages/elasticsearch/helpers/__init__.py", line 192, in streaming_bulkraise_on_error, *args, **kwargs)File "/usr/local/python3/lib/python3.6/site-packages/elasticsearch/helpers/__init__.py", line 137, in _process_bulk_chunkraise BulkIndexError('%i document(s) failed to index.' % len(errors), errors)
elasticsearch.helpers.BulkIndexError: ('1 document(s) failed to index.', [{'index': {'_index': 'invoice_title_v3', '_type': 'invoice_title', '_id': '3ae80d12abcde7d60f72ffb7fbc4696d', 'status': 429, 'error': {'type': 'es_rejected_execution_exception', 'reason': 'rejected execution of org.elasticsearch.transport.TransportService$7@6048fcf0 on EsThreadPoolExecutor[bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@58695c43[Running, pool size = 4, active threads = 4, queued tasks = 202, completed tasks = 13568507]]'}, 'data':

从ElasticSearch常见错误整理 5.5.x这篇文章，可以看到问题是客户端写es的速度太快了，超过es索引数据的速度。
执行GET _nodes/thread_pool可以看到index的queue_size与异常中queue capacity = 200是一致的。只不过5.x版本没看到size这个参数了。

我按照sticsearch高级配置之（二）----线程池设置执行，size默认是CPU核数的5倍，我是4核CPU，所以只是稍微调大一些

PUT _cluster/settings
{"transient": {"threadpool.index.type": "fixed","threadpool.index.size": 30,"threadpool.index.queue_size": 1000,"threadpool.index.reject_policy": "caller"}
}

提示的异常信息是

{"error": {"root_cause": [{"type": "illegal_argument_exception","reason": "transient setting [threadpool.index.queue_size], not dynamically updateable"}],"type": "illegal_argument_exception","reason": "transient setting [threadpool.index.queue_size], not dynamically updateable"},"status": 400
}

查看Transient setting [threadpool.search.queue_size], not dynamically updateable,可以得知queue_size参数并不能通过api来更改了，需要修改配置文件并且重启机器。
我在中elasticsearch.yml中添加这些配置，但是也会下面的错误，有点崩溃

threadpool.index.type: fixed
threadpool.index.size: 40
threadpool.index.queue_size: 1000
threadpool.index.reject_policy: caller

Suppressed: java.lang.IllegalArgumentException: unknown setting [threadpool.index.size] did you mean any of [thread_pool.index.size, thread_pool.get.size, thread_pool.index.queue_size, thread_pool.listener.size, thread_pool.bulk.size]?

找不到解决办法，只能寻求官方文档Thread Pool
4 Caused by: org.elasticsearch.client.transport.NoNodeAvailableException
初步设想es既然是集群模式，那么及时一个节点停掉，应该也不会影响使用，于是我就在生产环境听了一个，结果造成生产环境业务大面积瘫痪，真是惨痛的教训。为什么es集群容错这么低呢？最让人优先想到的是，通过haproxy做代理。

data:elasticsearch:cluster-name: xx_productcluster-nodes: 192.168.1.1:9300,192.168.1.2:9300,192.168.1.3:9300local: falserepositories:enabled: true

Caused by: org.elasticsearch.client.transport.NoNodeAvailableException: None of the configured nodes are available: [{#transport#-1}{a8wHYOwIRjC2sQYilPldrg}{172.19.123.151}{172.19.123.151:9300}, {#transport#-2}{4stEpD9KQSesdbmn2Hldxw}{172.19.123.150}{172.19.123.150:9300}]at org.elasticsearch.client.transport.TransportClientNodesService.ensureNodesAreAvailable(TransportClientNodesService.java:347)at org.elasticsearch.client.transport.TransportClientNodesService.execute(TransportClientNodesService.java:245)at org.elasticsearch.client.transport.TransportProxyClient.execute(TransportProxyClient.java:59)at org.elasticsearch.client.transport.TransportClient.doExecute(TransportClient.java:363)at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:408)at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:80)at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:54)at org.elasticsearch.action.ActionRequestBuilder.get(ActionRequestBuilder.java:62)at com.bwjf.rss.service.impl.CustomerServiceImpl.add(CustomerServiceImpl.java:98)at com.bwjf.rss.kfk.KfkConsumer.processCustomerMessage(KfkConsumer.java:50)at sun.reflect.GeneratedMethodAccessor62.invoke(Unknown Source)at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)at java.lang.reflect.Method.invoke(Method.java:498)at org.springframework.messaging.handler.invocation.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:180)at org.springframework.messaging.handler.invocation.InvocableHandlerMethod.invoke(InvocableHandlerMethod.java:112)at org.springframework.kafka.listener.adapter.HandlerAdapter.invoke(HandlerAdapter.java:48)at org.springframework.kafka.listener.adapter.MessagingMessageListenerAdapter.invokeHandler(MessagingMessageListenerAdapter.java:174)... 8 common frames omitted

3.1 index/delete
也就是说，index/delete操作，线程池的大小取决于可用进程数有关系，max值为1+可用进程数，这么说来es的服务实际上对cpu的核数是有要求的，CPU核数越高，俺么index的速度就会越快。

For index/delete operations. Thread pool type is fixed with a size of # of available processors, queue_size of 200. The maximum size for this pool is 1 + # of available processors.

虽然cpu的使用并不高，但是对核数还是有要求的

3.2 search

For count/search/suggest operations. Thread pool type is fixed with a size of int((# of available_processors * 3) / 2) + 1, queue_size of 1000.

3.3 bulk
脚本success, msg = helpers.bulk(self.es, self.actions)应该用到的是这个，它的默认值与index一样
文章中并没有找到动态修改的办法
设置bulk.size:40，我的cpu核数是8按道理没有问题，但是在实际启动始终却是必须小于9，真的很奇怪

更改elasticsearch.yml中添加配置thread_pool.bulk.queue_size: 2000
更改后，依次按照原
第1.8章 elasticsearch水平扩展重启节点
索引状态变为red，过了一会儿编程了yellow，要是丢了数据，按我就崩溃了。
从下面的监控来看，应该是还在调整中。

这里可以看到迁移的进度

执行命令GET _cat/thread_pool,下图数字的顺序是active, queue and rejected
队列中的数据很容易就到200了，所以设置大一点，应该有效果。
你还可以执行GET _cat/thread_pool/bulk?v&h=id,name,active,rejected,completed，专门查看bulk的情况

跟踪日志,没有发现bulk异常了，问题解决

5 ImportError: cannot import name ‘decodestring’ from ‘base64’
解决方案

pip3 uninstall elasticsearch==7.5.1pip3 install elasticsearch -i http://pypi.douban.com/simple/

  File "D:\appworks\pywork\xddata\es_service.py", line 6, in <module>from elasticsearch import ElasticsearchFile "C:\Python\Python310\lib\site-packages\elasticsearch\__init__.py", line 24, in <module>from .client import ElasticsearchFile "C:\Python\Python310\lib\site-packages\elasticsearch\client\__init__.py", line 5, in <module>from ..transport import TransportFile "C:\Python\Python310\lib\site-packages\elasticsearch\transport.py", line 5, in <module>from .connection import Urllib3HttpConnectionFile "C:\Python\Python310\lib\site-packages\elasticsearch\connection\__init__.py", line 2, in <module>from .http_requests import RequestsHttpConnectionFile "C:\Python\Python310\lib\site-packages\elasticsearch\connection\http_requests.py", line 3, in <module>from base64 import decodestring
ImportError: cannot import name 'decodestring' from 'base64' (C:\Python\Python310\lib\base64.py)Process finished with exit code 1

第1.7章 elasticsearch线上问题集1相关推荐

线上 ELK 集群健康值 red 状态问题排查与解决
线上 ELK 集群健康值 red 状态问题排查与解决参考文章: (1)线上 ELK 集群健康值 red 状态问题排查与解决 (2)https://www.cnblogs.com/haifeiwu/p ...
一篇文章讲透线上应用监控
"线上服务停了,要重启一下"?久经职场做研发的程序员,视线会逐渐转移到线上应用的运行状态.设想一下,如果你在半夜两点正在酣眠美梦时,微信群里突然炸开锅:"服务停了,先重启 ...
JAVA 线上故障排查套路，从 CPU、磁盘、内存、网络到GC 一条龙！
点击上方蓝色"方志朋",选择"设为星标" 回复"666"获取独家整理的学习资料! 线上故障主要会包括cpu.磁盘.内存以及网络问题,而大多数 ...
JAVA 线上故障排查完整套路，从 CPU、磁盘、内存、网络、GC 一条龙！
点击上方"方志朋",选择"设为星标" 回复"666"获取新整理的面试文章作者:fredal https://fredal.xin/java ...
JAVA 线上故障排查完整套路！牛掰！
点击上方"方志朋",选择"设为星标" 回复"666"获取新整理的面试文章来源丨8rr.co/kV3R 线上故障主要会包括 CPU.磁盘.内 ...
JAVA 线上故障排查指南！
来源:https://fredal.xin/java-error-check 线上故障主要会包括cpu.磁盘.内存以及网络问题,而大多数故障可能会包含不止一个层面的问题,所以进行排查时候尽量四个方面依 ...
史上最全的高可用服务系统线上问题排查工具单（一）
来自:云时代架构上一篇文章保证高可用Java服务化系统高效运行的必备工具箱介绍了笔者在互联网公司里线上应急和技术攻关过程中积累的应用层脚本和Java虚拟机命令,这些脚本和命令在发现问题和定位问题的过 ...
掘地三尺，我挖出了这次线上事故的元凶：Java内存泄漏！
来源:http://www.importnew.com/29591.html 一个凄凉的午夜一定是网络的问题一定是 GC 的问题一定是内存泄漏真有这么糟糕吗? 还是不行那现在是怎么回事? 修 ...
从 CPU、磁盘、内存、网络、GC 一条龙！JAVA 线上故障排查完整套路
线上故障主要会包括cpu.磁盘.内存以及网络问题,而大多数故障可能会包含不止一个层面的问题,所以进行排查时候尽量四个方面依次排查一遍.同时例如jstack.jmap等工具也是不囿于一个方面的问题的,基 ...

第1.7章 elasticsearch线上问题集1

第1.7章 elasticsearch线上问题集1相关推荐

最新文章

热门文章