我的es数据规模为5433万,这个时候频繁执行查询、写入的操作,发现python执行有一些异常,记录下来,看看有没有解决的办法
我的脚本

from elasticsearch import Elasticsearch
from elasticsearch import helpers
body = {"query": {"range": {"date": {"gte": pre_time,"lte": end_time}}}}results = helpers.scan(client=es,query=body,scroll="5m",index=sub_index,doc_type='my_type',timeout="10m")print('开始遍历索引')sources = set()for result in results:

1 elasticsearch.exceptions.NotFoundError: NotFoundError(404, ‘search_phase_execution_exception’, ‘No search context found for id [27563069]’)
查看Elasticsearch SearchContextMissingException during ‘scan & scroll’ query with Spring Data Elasticsearch
跟我的问题很像,也就是上下文丢失,那么什么时候上下文丢失了,也就是超时的时候。也就是说解决办法就是将timeout设置足够长。

This usually happens if your search context is not alive anymore.

查看官网资料Keeping the search context alive,可以看到scroll的时间应该足够一批数据返回就可以。
于是我将我的程序了一下,将scroll从原来的5m调整为10m,将原来的timeout10m调整为15m.然后再次执行程序观察,发现Scroll request has only succeeded on 1 shards out of 5也同步解决,也是意外的收获。

results = helpers.scan(client=es,query=body,scroll="10m",index=sub_index,doc_type='my_type',timeout="15m")
The scroll parameter (passed to the search request and to every scroll request) tells Elasticsearch how long it should keep the search context alive. Its value (e.g. 1m, see Time unitsedit) does not need to be long enough to process all data — it just needs to be long enough to process the previous batch of results
Process Process-4:
Traceback (most recent call last):File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrapself.run()File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 93, in runself._target(*self._args, **self._kwargs)File "count_in_total_baidu.py", line 66, in sub_in_totalfor result in results:File "/usr/local/python3/lib/python3.6/site-packages/elasticsearch/helpers/__init__.py", line 379, in scan**scroll_kwargs)File "/usr/local/python3/lib/python3.6/site-packages/elasticsearch/client/utils.py", line 76, in _wrappedreturn func(*args, params=params, **kwargs)File "/usr/local/python3/lib/python3.6/site-packages/elasticsearch/client/__init__.py", line 1016, in scrollparams=params, body=body)File "/usr/local/python3/lib/python3.6/site-packages/elasticsearch/transport.py", line 318, in perform_requeststatus, headers_response, data = connection.perform_request(method, url, params, body, headers=headers, ignore=ignore, timeout=timeout)File "/usr/local/python3/lib/python3.6/site-packages/elasticsearch/connection/http_urllib3.py", line 186, in perform_requestself._raise_error(response.status, raw_data)File "/usr/local/python3/lib/python3.6/site-packages/elasticsearch/connection/base.py", line 125, in _raise_errorraise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
elasticsearch.exceptions.NotFoundError: NotFoundError(404, 'search_phase_execution_exception', 'No search context found for id [27563069]')

2 Scroll request has only succeeded on 1 shards out of 5.

Process Process-2:
Traceback (most recent call last):File "/etc/python/python3.6/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrapself.run()File "/etc/python/python3.6/lib/python3.6/multiprocessing/process.py", line 93, in runself._target(*self._args, **self._kwargs)File "count_in_total.py", line 67, in sub_in_totalfor result in results:File "/etc/python/python3.6/lib/python3.6/site-packages/elasticsearch/helpers/__init__.py", line 394, in scan(resp['_shards']['successful'], resp['_shards']['total'])
elasticsearch.helpers.ScanError: Scroll request has only succeeded on 1 shards out of 5.

找到了一片文章elasticsearch.helpers.ScanError: Scroll request has only succeeded on xx shards,它的原因是当index=''为空时出现此错误,可以index,为什么会为空呢?
3 rejected execution of org.elasticsearch.transport.TransportService

Process Process-1:
Traceback (most recent call last):File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrapself.run()File "/usr/local/python3/lib/python3.6/multiprocessing/process.py", line 93, in runself._target(*self._args, **self._kwargs)File "count_in_total_baidu.py", line 64, in sub_in_totaltotalEnService.sub_in_total(result.get('_source'))File "/usr/local/python3/lib/python3.6/site-packages/en_plugin/service/en_service.py", line 200, in sub_in_totalself.handler(total_record)File "/usr/local/python3/lib/python3.6/site-packages/en_plugin/service/en_service.py", line 133, in handlerself.opt_es(es_data)File "/usr/local/python3/lib/python3.6/site-packages/en_plugin/service/en_service.py", line 175, in opt_essuccess, msg = helpers.bulk(self.es, self.actions)File "/usr/local/python3/lib/python3.6/site-packages/elasticsearch/helpers/__init__.py", line 257, in bulkfor ok, item in streaming_bulk(client, actions, *args, **kwargs):File "/usr/local/python3/lib/python3.6/site-packages/elasticsearch/helpers/__init__.py", line 192, in streaming_bulkraise_on_error, *args, **kwargs)File "/usr/local/python3/lib/python3.6/site-packages/elasticsearch/helpers/__init__.py", line 137, in _process_bulk_chunkraise BulkIndexError('%i document(s) failed to index.' % len(errors), errors)
elasticsearch.helpers.BulkIndexError: ('1 document(s) failed to index.', [{'index': {'_index': 'invoice_title_v3', '_type': 'invoice_title', '_id': '3ae80d12abcde7d60f72ffb7fbc4696d', 'status': 429, 'error': {'type': 'es_rejected_execution_exception', 'reason': 'rejected execution of org.elasticsearch.transport.TransportService$7@6048fcf0 on EsThreadPoolExecutor[bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@58695c43[Running, pool size = 4, active threads = 4, queued tasks = 202, completed tasks = 13568507]]'}, 'data':

从ElasticSearch常见错误整理 5.5.x这篇文章,可以看到问题是客户端写es的速度太快了,超过es索引数据的速度。
执行GET _nodes/thread_pool可以看到index的queue_size与异常中queue capacity = 200是一致的。只不过5.x版本没看到size这个参数了。

我按照sticsearch高级配置之(二)----线程池设置执行,size默认是CPU核数的5倍,我是4核CPU,所以只是稍微调大一些

PUT _cluster/settings
{"transient": {"threadpool.index.type": "fixed","threadpool.index.size": 30,"threadpool.index.queue_size": 1000,"threadpool.index.reject_policy": "caller"}
}

提示的异常信息是

{"error": {"root_cause": [{"type": "illegal_argument_exception","reason": "transient setting [threadpool.index.queue_size], not dynamically updateable"}],"type": "illegal_argument_exception","reason": "transient setting [threadpool.index.queue_size], not dynamically updateable"},"status": 400
}

查看Transient setting [threadpool.search.queue_size], not dynamically updateable,可以得知queue_size参数并不能通过api来更改了,需要修改配置文件并且重启机器。
我在中elasticsearch.yml中添加这些配置,但是也会下面的错误,有点崩溃

threadpool.index.type: fixed
threadpool.index.size: 40
threadpool.index.queue_size: 1000
threadpool.index.reject_policy: caller
Suppressed: java.lang.IllegalArgumentException: unknown setting [threadpool.index.size] did you mean any of [thread_pool.index.size, thread_pool.get.size, thread_pool.index.queue_size, thread_pool.listener.size, thread_pool.bulk.size]?

找不到解决办法,只能寻求官方文档Thread Pool
4 Caused by: org.elasticsearch.client.transport.NoNodeAvailableException
初步设想es既然是集群模式,那么及时一个节点停掉,应该也不会影响使用,于是我就在生产环境听了一个,结果造成生产环境业务大面积瘫痪,真是惨痛的教训。为什么es集群容错这么低呢?最让人优先想到的是,通过haproxy做代理。

data:elasticsearch:cluster-name: xx_productcluster-nodes: 192.168.1.1:9300,192.168.1.2:9300,192.168.1.3:9300local: falserepositories:enabled: true
Caused by: org.elasticsearch.client.transport.NoNodeAvailableException: None of the configured nodes are available: [{#transport#-1}{a8wHYOwIRjC2sQYilPldrg}{172.19.123.151}{172.19.123.151:9300}, {#transport#-2}{4stEpD9KQSesdbmn2Hldxw}{172.19.123.150}{172.19.123.150:9300}]at org.elasticsearch.client.transport.TransportClientNodesService.ensureNodesAreAvailable(TransportClientNodesService.java:347)at org.elasticsearch.client.transport.TransportClientNodesService.execute(TransportClientNodesService.java:245)at org.elasticsearch.client.transport.TransportProxyClient.execute(TransportProxyClient.java:59)at org.elasticsearch.client.transport.TransportClient.doExecute(TransportClient.java:363)at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:408)at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:80)at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:54)at org.elasticsearch.action.ActionRequestBuilder.get(ActionRequestBuilder.java:62)at com.bwjf.rss.service.impl.CustomerServiceImpl.add(CustomerServiceImpl.java:98)at com.bwjf.rss.kfk.KfkConsumer.processCustomerMessage(KfkConsumer.java:50)at sun.reflect.GeneratedMethodAccessor62.invoke(Unknown Source)at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)at java.lang.reflect.Method.invoke(Method.java:498)at org.springframework.messaging.handler.invocation.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:180)at org.springframework.messaging.handler.invocation.InvocableHandlerMethod.invoke(InvocableHandlerMethod.java:112)at org.springframework.kafka.listener.adapter.HandlerAdapter.invoke(HandlerAdapter.java:48)at org.springframework.kafka.listener.adapter.MessagingMessageListenerAdapter.invokeHandler(MessagingMessageListenerAdapter.java:174)... 8 common frames omitted

3.1 index/delete
也就是说,index/delete操作,线程池的大小取决于可用进程数有关系,max值为1+可用进程数,这么说来es的服务实际上对cpu的核数是有要求的,CPU核数越高,俺么index的速度就会越快。

For index/delete operations. Thread pool type is fixed with a size of # of available processors, queue_size of 200. The maximum size for this pool is 1 + # of available processors.

虽然cpu的使用并不高,但是对核数还是有要求的

3.2 search

For count/search/suggest operations. Thread pool type is fixed with a size of int((# of available_processors * 3) / 2) + 1, queue_size of 1000.

3.3 bulk
脚本success, msg = helpers.bulk(self.es, self.actions)应该用到的是这个,它的默认值与index一样
文章中并没有找到动态修改的办法
设置bulk.size:40,我的cpu核数是8按道理没有问题,但是在实际启动始终却是必须小于9,真的很奇怪

更改elasticsearch.yml中添加配置thread_pool.bulk.queue_size: 2000
更改后,依次按照原
第1.8章 elasticsearch水平扩展重启节点
索引状态变为red,过了一会儿编程了yellow,要是丢了数据,按我就崩溃了。
从下面的监控来看,应该是还在调整中。

这里可以看到迁移的进度

执行命令GET _cat/thread_pool,下图数字的顺序是active, queue and rejected
队列中的数据很容易就到200了,所以设置大一点,应该有效果。
你还可以执行GET _cat/thread_pool/bulk?v&h=id,name,active,rejected,completed,专门查看bulk的情况

跟踪日志,没有发现bulk异常了,问题解决

5 ImportError: cannot import name ‘decodestring’ from ‘base64’
解决方案

pip3 uninstall elasticsearch==7.5.1pip3 install elasticsearch -i http://pypi.douban.com/simple/
  File "D:\appworks\pywork\xddata\es_service.py", line 6, in <module>from elasticsearch import ElasticsearchFile "C:\Python\Python310\lib\site-packages\elasticsearch\__init__.py", line 24, in <module>from .client import ElasticsearchFile "C:\Python\Python310\lib\site-packages\elasticsearch\client\__init__.py", line 5, in <module>from ..transport import TransportFile "C:\Python\Python310\lib\site-packages\elasticsearch\transport.py", line 5, in <module>from .connection import Urllib3HttpConnectionFile "C:\Python\Python310\lib\site-packages\elasticsearch\connection\__init__.py", line 2, in <module>from .http_requests import RequestsHttpConnectionFile "C:\Python\Python310\lib\site-packages\elasticsearch\connection\http_requests.py", line 3, in <module>from base64 import decodestring
ImportError: cannot import name 'decodestring' from 'base64' (C:\Python\Python310\lib\base64.py)Process finished with exit code 1

第1.7章 elasticsearch线上问题集1相关推荐

  1. 线上 ELK 集群健康值 red 状态问题排查与解决

    线上 ELK 集群健康值 red 状态问题排查与解决 参考文章: (1)线上 ELK 集群健康值 red 状态问题排查与解决 (2)https://www.cnblogs.com/haifeiwu/p ...

  2. 一篇文章讲透线上应用监控

    "线上服务停了,要重启一下"?久经职场做研发的程序员,视线会逐渐转移到线上应用的运行状态.设想一下,如果你在半夜两点正在酣眠美梦时,微信群里突然炸开锅:"服务停了,先重启 ...

  3. JAVA 线上故障排查套路,从 CPU、磁盘、内存、网络到GC 一条龙!

    点击上方蓝色"方志朋",选择"设为星标" 回复"666"获取独家整理的学习资料! 线上故障主要会包括cpu.磁盘.内存以及网络问题,而大多数 ...

  4. JAVA 线上故障排查完整套路,从 CPU、磁盘、内存、网络、GC 一条龙!

    点击上方"方志朋",选择"设为星标" 回复"666"获取新整理的面试文章 作者:fredal https://fredal.xin/java ...

  5. JAVA 线上故障排查完整套路!牛掰!

    点击上方"方志朋",选择"设为星标" 回复"666"获取新整理的面试文章 来源丨8rr.co/kV3R 线上故障主要会包括 CPU.磁盘.内 ...

  6. JAVA 线上故障排查指南!

    来源:https://fredal.xin/java-error-check 线上故障主要会包括cpu.磁盘.内存以及网络问题,而大多数故障可能会包含不止一个层面的问题,所以进行排查时候尽量四个方面依 ...

  7. 史上最全的高可用服务系统线上问题排查工具单(一)

    来自:云时代架构 上一篇文章保证高可用Java服务化系统高效运行的必备工具箱介绍了笔者在互联网公司里线上应急和技术攻关过程中积累的应用层脚本和Java虚拟机命令,这些脚本和命令在发现问题和定位问题的过 ...

  8. 掘地三尺,我挖出了这次线上事故的元凶:Java内存泄漏!

    来源:http://www.importnew.com/29591.html 一个凄凉的午夜 一定是网络的问题 一定是 GC 的问题 一定是内存泄漏 真有这么糟糕吗? 还是不行 那现在是怎么回事? 修 ...

  9. 从 CPU、磁盘、内存、网络、GC 一条龙!JAVA 线上故障排查完整套路

    线上故障主要会包括cpu.磁盘.内存以及网络问题,而大多数故障可能会包含不止一个层面的问题,所以进行排查时候尽量四个方面依次排查一遍.同时例如jstack.jmap等工具也是不囿于一个方面的问题的,基 ...

最新文章

  1. CentOS 5.6下创建KVM虚拟机
  2. php的_autoload函数实现自动加载类的使用
  3. Firefox 控制台
  4. ubuntu20.04安装noetic版本ros
  5. 使用codeplex管理WP7项目开发版本
  6. laravel 验证用户是否登陆显示不同内容
  7. java se开发_JAVA_SE基础——3.Java程序的开发流程
  8. jquery中has方法
  9. Struts2接收前端页面数据的方式
  10. mysql双机数据热备份_教你怎么样实现MySQL数据库双机热备份
  11. c++中的fork函数_fork函数的作用_fork函数创建进程
  12. CorelDRAW X4无法响应卡死解决方法教程
  13. Github上关于iOS的各种开源项目集合
  14. java输入身高体重计算体脂率_追求健康的体脂率,比追求体重数更有意义:一个公式教你计算...
  15. 计算机表格快捷键,EXCEL表格所有的快捷键《excel一键到底 快捷键》
  16. 跨境电商亚马逊平台运营模式有哪些?
  17. 特征值、特征根、本征值
  18. 配置 DNS over HTTPS阻止DNS污染
  19. Python国际象棋棋盘绘制
  20. GitLab搭建局域网改外网

热门文章

  1. 真!无钥匙!数字钥匙在智能电动车领域的实践
  2. 这才是我要的工作!(强杀伤,慎入!)
  3. 十进制数转换成十六进制数~C语言
  4. 办公本推荐计算机专业,2021年办公本/全能笔记本电脑选购指南(附良心机型推荐)...
  5. 华为到底算不算是一份好工作?看完你们还会羡慕华为的高薪吗?
  6. react项目创建步骤
  7. 实现鼠标手写板(.Net)
  8. 编程天才获冠军!14岁被清华预录取,战胜麻省理工学霸......
  9. 线上诊断之arthas
  10. 一元流量参数为null_了解JavaScript中null和undefined之间的差异和相似之处