注意：由于是重复数据，词法不具有通用性！文章价值不大！

摘自：https://segmentfault.com/a/1190000002695169

Doc Values 会压缩存储重复的内容。给定这样一个简单的 mapping

mappings = {'testdata': {'_source': {'enabled': False}, '_all': {'enabled': False}, 'properties': { 'name': { 'type': 'string', 'index': 'no', 'store': False, 'dynamic': 'strict', 'fielddata': {'format': 'doc_values'} } } } }

插入100万行随机的重复值

words = ['hello', 'world', 'there', 'here'] def read_test_data_in_batches(): batch = [] for i in range(10000 * 100): if i % 50000 == 0: print(i) if len(batch) > 10000: yield batch batch = [] batch.append({ '_index': 'wentao-test-doc-values', '_type': 'testdata', '_source': {'name': random.choice(words)} }) print(i) yield batch

磁盘占用是

size: 28.5Mi (28.5Mi)
docs: 1,000,000 (1,000,000)

把每个word搞长一些，同样是插入100万行

words = ['hello' * 100, 'world' * 100, 'there' * 100, 'here' * 100] def read_test_data_in_batches(): batch = [] for i in range(10000 * 100): if i % 50000 == 0: print(i) if len(batch) > 10000: yield batch batch = [] batch.append({ '_index': 'wentao-test-doc-values', '_type': 'testdata', '_source': {'name': random.choice(words)} }) print(i) yield batch

磁盘占用不升反降

size: 14.4Mi (14.4Mi)
docs: 1,000,000 (1,000,000)

这说明了lucene在底层用列式存储这些字符串的时候是做了压缩的。这个要是在某个商业列式数据库里，就这么点优化都是要大书特书的dictionary encoding优化云云。

Nested Document

实验表明把一堆小文档打包成一个大文档的nested document可以压缩存储空间。把前面的mapping改成这样：

mappings = {'testdata': {'_source': {'enabled': False}, '_all': {'enabled': False}, 'properties': { 'children': { 'type': 'nested', 'properties': { 'name': { 'type': 'string', 'index': 'no', 'store': False, 'dynamic': 'strict', 'fielddata': {'format': 'doc_values'} } } } } } }

还是插入100万行，但是每一千行打包成一个大文档

words = ['hello', 'world', 'there', 'here'] def read_test_data_in_batches(): batch = [] for i in range(10000 * 100): if i % 50000 == 0: print(i) if len(batch) > 1000: yield [{ '_index': 'wentao-test-doc-values2', '_type': 'testdata', '_source': {'children': batch} }] batch = []  batch.append({'name': random.choice(words)}) print(i) yield [{ '_index': 'wentao-test-doc-values2', '_type': 'testdata', '_source': {'children': batch} }]

磁盘占用是

size: 2.47Mi (2.47Mi)
docs: 1,001,000 (1,001,000)

文档数没有变小，但是磁盘空间仅仅占用了2.47M。这个应该受益于lucene内部对于嵌套文档的存储优化。

转载于:https://www.cnblogs.com/bonelee/p/6269604.html

Elasticsearch压缩索引——lucene倒排索引本质是列存储+使用嵌套文档可以大幅度提高压缩率...相关推荐

predicate 列存储索引扫描_ColumnStore index （列存储索引）解析
简介首先介紹列存储的概念: 传统的数据库存储是行存储.对于SQL Server来说,每个page是8K:往page里面塞数据,假设该表每条数据长度是500字节,那么这个page 先塞第一条数据,然后 ...
数据库索引统计信息不一致_列存储索引增强功能–克隆数据库中的索引统计信息更新
数据库索引统计信息不一致 SQL Server was launched in 1993 on WinNT and it completed its 25-year anniversary recen ...
kibana创建es索引_es 索引数据创建mapping 普通内部对象嵌套文档父子文档创建和查询...
普通内部对象 "kibana_sample_data_ecommerce" : { "mappings" : { "properties" ...
Elasticsearch中如何进行排序(中文+父子文档+嵌套文档)
Elasticsearch中如何进行排序背景最近去兄弟部门的新自定义查询项目组搬砖,项目使用Elasticsearch进行数据的检索和查询.每一个查询页面都需要根据选择的字段进行排序,以为是一个比 ...
开始使用Elasticsearch (1): 如何创建index，添加、删除、更新文档
本文内容来自 https://blog.csdn.net/UbuntuTouch/article/details/99481016 ,有删减和文字修正. 在开始使用ES之前, 请安装好ES & ...
【Elasticsearch】elasticsearch 压缩索引 shrink
1.概述关于索引的一些解读,请参考: [Elasticsearch]elasticsearch 索引详解官网压缩索引 shrink shrink命令可以将一个已有的索引压缩成一个新的索引,同时 ...
amazon redshift 分析型数据库特点——本质还是列存储
Amazon Redshift 是一种快速且完全托管的 PB 级数据仓库,使您可以使用现有的商业智能工具经济高效地轻松分析您的所有数据.从最低 0.25 USD 每小时 (不承担任何义务) 直到每年每 ...
java blob压缩_如何从Oracle中用Java压缩的BLOB列中提取XML文档
我在Oracle 11G(11.1)中有一个表,它有一个包含XML文档的BLOB列. XML文档已使用Java程序写入表中,并已使用java.util.zip平减器进行序列化和压缩. 有没有简单的方法 ...
什么是列存储？一文秒懂
导读:在讲<Apache Druid 底层存储设计>时就说过要讲一讲列式存储.现在来了,通过本文你可以了解到行存储模式.列存储模式.它们的优缺点以及列存储模式的优化等知识. 今日格言:不要 ...

Elasticsearch压缩索引——lucene倒排索引本质是列存储+使用嵌套文档可以大幅度提高压缩率...

注意：由于是重复数据，词法不具有通用性！文章价值不大！

摘自：https://segmentfault.com/a/1190000002695169

Doc Values 会压缩存储重复的内容。给定这样一个简单的 mapping

Nested Document

Elasticsearch压缩索引——lucene倒排索引本质是列存储+使用嵌套文档可以大幅度提高压缩率...相关推荐

最新文章

热门文章

Elasticsearch压缩索引——lucene倒排索引本质是列存储+使用嵌套文档可以大幅度提高压缩率...

注意：由于是重复数据，词法不具有通用性！文章价值不大！

摘自：https://segmentfault.com/a/1190000002695169

Doc Values 会压缩存储重复的内容。 给定这样一个简单的 mapping

Nested Document

Elasticsearch压缩索引——lucene倒排索引本质是列存储+使用嵌套文档可以大幅度提高压缩率...相关推荐

最新文章

热门文章

Doc Values 会压缩存储重复的内容。给定这样一个简单的 mapping