ES官网: https://www.elastic.co/

Basic Concepts

There are a few concepts that are core to Elasticsearch. Understanding these concepts from the outset will tremendously help ease the learning process.

Near Realtime (NRT)

Elasticsearch is a near-realtime search platform. What this means is there is a slight latency (normally one second) from the time you index a document until the time it becomes searchable.

Node

A node is a single server that is part of your cluster, stores your data, and participates in the cluster’s indexing and search capabilities. Just like a cluster, a node is identified by a name which by default is a random Universally Unique IDentifier (UUID) that is assigned to the node at startup. You can define any node name you want if you do not want the default. This name is important for administration purposes where you want to identify which servers in your network correspond to which nodes in your Elasticsearch cluster.

A node can be configured to join a specific cluster by the cluster name. By default, each node is set up to join a cluster named elasticsearch which means that if you start up a number of nodes on your network and—​assuming they can discover each other—​they will all automatically form and join a single cluster named elasticsearch.

In a single cluster, you can have as many nodes as you want. Furthermore, if there are no other Elasticsearch nodes currently running on your network, starting a single node will by default form a new single-node cluster named elasticsearch.

Index

An index is a collection of documents that have somewhat similar characteristics. For example, you can have an index for customer data, another index for a product catalog, and yet another index for order data. An index is identified by a name (that must be all lowercase) and this name is used to refer to the index when performing indexing, search, update, and delete operations against the documents in it.

In a single cluster, you can define as many indexes as you want.

Type

Deprecated in 6.0.0.

See Removal of mapping types

A type used to be a logical category/partition of your index to allow you to store different types of documents in the same index, e.g. one type for users, another type for blog posts. It is no longer possible to create multiple types in an index, and the whole concept of types will be removed in a later version. See Removal of mapping types for more.

Document

A document is a basic unit of information that can be indexed. For example, you can have a document for a single customer, another document for a single product, and yet another for a single order. This document is expressed in JSON (JavaScript Object Notation) which is a ubiquitous internet data interchange format.

Within an index/type, you can store as many documents as you want. Note that although a document physically resides in an index, a document actually must be indexed/assigned to a type inside an index.

translate: 在一个index/tpye下, 你可以存储任意多个document。 但是要注意, 尽管document物理上存储在index中,但document 实际上必须在 被 索引/分配 到一个index 的 type里。

Shards & Replicas

An index can potentially store a large amount of data that can exceed the hardware limits of a single node. For example, a single index of a billion documents taking up 1TB of disk space may not fit on the disk of a single node or may be too slow to serve search requests from a single node alone.

To solve this problem, Elasticsearch provides the ability to subdivide your index into multiple pieces called shards. When you create an index, you can simply define the number of shards that you want. Each shard is in itself a fully-functional and independent "index" that can be hosted on any node in the cluster.

Sharding is important for two primary reasons:

  • It allows you to horizontally split/scale your content volume
  • It allows you to distribute and parallelize operations across shards (potentially on multiple nodes) thus increasing performance/throughput

The mechanics of how a shard is distributed and also how its documents are aggregated back into search requests are completely managed by Elasticsearch and is transparent to you as the user.

In a network/cloud environment where failures can be expected anytime, it is very useful and highly recommended to have a failover mechanism in case a shard/node somehow goes offline or disappears for whatever reason. To this end, Elasticsearch allows you to make one or more copies of your index’s shards into what are called replica shards, or replicas for short.

Replication is important for two primary reasons:

  • It provides high availability in case a shard/node fails. For this reason, it is important to note that a replica shard is never allocated on the same node as the original/primary shard that it was copied from.
  • It allows you to scale out your search volume/throughput since searches can be executed on all replicas in parallel.

To summarize, each index can be split into multiple shards. An index can also be replicated zero (meaning no replicas) or more times. Once replicated, each index will have primary shards (the original shards that were replicated from) and replica shards (the copies of the primary shards).

The number of shards and replicas can be defined per index at the time the index is created. After the index is created, you may also change the number of replicas dynamically anytime. You can change the number of shards for an existing index using the _shrink and _split APIs, however this is not a trivial task and pre-planning for the correct number of shards is the optimal approach.

By default, each index in Elasticsearch is allocated 5 primary shards and 1 replica which means that if you have at least two nodes in your cluster, your index will have 5 primary shards and another 5 replica shards (1 complete replica) for a total of 10 shards per index.

Each Elasticsearch shard is a Lucene index. There is a maximum number of documents you can have in a single Lucene index. As of LUCENE-5843, the limit is 2,147,483,519 (= Integer.MAX_VALUE - 128) documents. You can monitor shard sizes using the _cat/shards API.

What are mapping types?

Since the first release of Elasticsearch, each document has been stored in a single index and assigned a single mapping type. A mapping type was used to represent the type of document or entity being indexed, for instance a twitter index might have a user type and a tweet type.

Each mapping type could have its own fields, so the user type might have a full_name field, a user_name field, and an email field, while the tweet type could have a content field, a tweeted_at field and, like the user type, a user_name field.

Each document had a _type meta-field containing the type name, and searches could be limited to one or more types by specifying the type name(s) in the URL:

GET twitter/user,tweet/_search
{"query": {"match": {"user_name": "kimchy"}}
}

The _type field was combined with the document’s _id to generate a _uid field, so documents of different types with the same _id could exist in a single index.

Mapping types were also used to establish a parent-child relationship between documents, so documents of type question could be parents to documents of type answer.

总结:在没有废除type之前, type就类似于数据库的一张表, 不同的type下可以有相同的 id。

Why are mapping types being removed?

Initially, we spoke about an “index” being similar to a “database” in an SQL database, and a “type” being equivalent to a “table”.

This was a bad analogy that led to incorrect assumptions. In an SQL database, tables are independent of each other. The columns in one table have no bearing on columns with the same name in another table. This is not the case for fields in a mapping type.

存在type容易产生错误假设, “index”类似于SQL数据库中的“database”,“type”相当于“table” 。 但是一个表中的列与另一个表中同名的列没有关系, 但对于es中对于type中的字段,情况不是这样的。

In an Elasticsearch index, fields that have the same name in different mapping types are backed by the same Lucene field internally. In other words, using the example above, the user_namefield in the user type is stored in exactly the same field as the user_name field in the tweettype, and both user_name fields must have the same mapping (definition) in both types.

在一个Elasticsearch index 中,在不同映射类型(mapping types )中具有相同名称的字段在内部由相同的Lucene字段支持。简单来说, 在ES中, 即使 type 不同, 相同的字段对于Lucene来说是被相同的映射的。

This can lead to frustration when, for example, you want deleted to be a date field in one type and a boolean field in another type in the same index.

例如,当您希望在同一个索引中删除一个 type 的 date field 和另一个 type 的 boolean field  时,这可能会导致错误。

On top of that, storing different entities that have few or no fields in common in the same index leads to sparse data and interferes with Lucene’s ability to compress documents efficiently.

最重要的是,在同一个索引中存储只有很少或没有相同字段的不同实体会导致数据稀疏,并干扰Lucene有效压缩文档的能力。

For these reasons, we have decided to remove the concept of mapping types from Elasticsearch.

Alternatives to mapping types

Index per document type

按文档类型索引

The first alternative is to have an index per document type. Instead of storing tweets and users in a single twitter index, you could store tweets in the tweets index and users in the userindex. Indices are completely independent of each other and so there will be no conflict of field types between indices.

第一种方法是为每个 document type 建立 单独的 index。您可以将 tweets存储在tweets index中,将users存储在user index中,而不是将tweets 和 users存储在单个twitter index中。索引是完全独立于彼此的,因此在索引之间不会有字段类型冲突。

This approach has two benefits:

  • Data is more likely to be dense and so benefit from compression techniques used in Lucene.
  • The term statistics used for scoring in full text search are more likely to be accurate because all documents in the same index represent a single entity.
  • 数据可能会变得更密集,因此可以从Lucene使用的压缩技术中获益。

  • 在全文搜索中用于评分的术语统计更可能是准确的,因为同一索引中的所有文档都代表一个实体。

Each index can be sized appropriately for the number of documents it will contain: you can use a smaller number of primary shards for users and a larger number of primary shards for tweets.

Custom type field

Of course, there is a limit to how many primary shards can exist in a cluster so you may not want to waste an entire shard for a collection of only a few thousand documents. In this case, you can implement your own custom type field which will work in a similar way to the old _type.

Let’s take the user/tweet example above. Originally, the workflow would have looked something like this:

PUT twitter
{"mappings": {"user": {"properties": {"name": { "type": "text" },"user_name": { "type": "keyword" },"email": { "type": "keyword" }}},"tweet": {"properties": {"content": { "type": "text" },"user_name": { "type": "keyword" },"tweeted_at": { "type": "date" }}}}
}PUT twitter/user/kimchy
{"name": "Shay Banon","user_name": "kimchy","email": "shay@kimchy.com"
}PUT twitter/tweet/1
{"user_name": "kimchy","tweeted_at": "2017-10-24T09:00:00Z","content": "Types are going away"
}GET twitter/tweet/_search
{"query": {"match": {"user_name": "kimchy"}}
}

You could achieve the same thing by adding a custom type field as follows:

PUT twitter
{"mappings": {"_doc": {"properties": {"type": { "type": "keyword" }, "name": { "type": "text" },"user_name": { "type": "keyword" },"email": { "type": "keyword" },"content": { "type": "text" },"tweeted_at": { "type": "date" }}}}
}PUT twitter/_doc/user-kimchy
{"type": "user", "name": "Shay Banon","user_name": "kimchy","email": "shay@kimchy.com"
}PUT twitter/_doc/tweet-1
{"type": "tweet", "user_name": "kimchy","tweeted_at": "2017-10-24T09:00:00Z","content": "Types are going away"
}GET twitter/_search
{"query": {"bool": {"must": {"match": {"user_name": "kimchy"}},"filter": {"match": {"type": "tweet" }}}}
}

The explicit type field takes the place of the implicit _type field.

Parent/Child without mapping types

Previously, a parent-child relationship was represented by making one mapping type the parent, and one or more other mapping types the children. Without types, we can no longer use this syntax. The parent-child feature will continue to function as before, except that the way of expressing the relationship between documents has been changed to use the new join field.

Schedule for removal of mapping types

This is a big change for our users, so we have tried to make it as painless as possible. The change will roll out as follows:

Elasticsearch 5.6.0

  • Setting index.mapping.single_type: true on an index will enable the single-type-per-index behaviour which will be enforced in 6.0.
  • The join field replacement for parent-child is available on indices created in 5.6.

Elasticsearch 6.x

  • Indices created in 5.x will continue to function in 6.x as they did in 5.x.
  • Indices created in 6.x only allow a single-type per index. Any name can be used for the type, but there can be only one. The preferred type name is _doc, so that index APIs have the same path as they will have in 7.0: PUT {index}/_doc/{id} and POST {index}/_doc
  • The _type name can no longer be combined with the _id to form the _uid field. The _uid field has become an alias for the _id field.
  • New indices no longer support the old-style of parent/child and should use the joinfield instead.
  • The _default_ mapping type is deprecated.
  • In 6.7, the index creation, index template, and mapping APIs support a query string parameter (include_type_name) which indicates whether requests and responses should include a type name. It defaults to true, and should be set to an explicit value to prepare to upgrade to 7.0. Not setting include_type_name will result in a deprecation warning. Indices which don’t have an explicit type will use the dummy type name _doc.

Elasticsearch 7.x

  • Specifying types in requests is deprecated. For instance, indexing a document no longer requires a document type. The new index APIs are PUT {index}/_doc/{id} in case of explicit ids and POST {index}/_doc for auto-generated ids.
  • The include_type_name parameter in the index creation, index template, and mapping APIs will default to false. Setting the parameter at all will result in a deprecation warning.
  • The _default_ mapping type is removed.

Elasticsearch 8.x

  • Specifying types in requests is no longer supported.
  • The include_type_name parameter is removed.

Migrating multi-type indices to single-type

The Reindex API can be used to convert multi-type indices to single-type indices. The following examples can be used in Elasticsearch 5.6 or Elasticsearch 6.x. In 6.x, there is no need to specify index.mapping.single_type as that is the default.

Index per document type

This first example splits our twitter index into a tweets index and a users index:

PUT users
{"settings": {"index.mapping.single_type": true},"mappings": {"_doc": {"properties": {"name": {"type": "text"},"user_name": {"type": "keyword"},"email": {"type": "keyword"}}}}
}PUT tweets
{"settings": {"index.mapping.single_type": true},"mappings": {"_doc": {"properties": {"content": {"type": "text"},"user_name": {"type": "keyword"},"tweeted_at": {"type": "date"}}}}
}POST _reindex
{"source": {"index": "twitter","type": "user"},"dest": {"index": "users"}
}POST _reindex
{"source": {"index": "twitter","type": "tweet"},"dest": {"index": "tweets"}
}

Custom type field

This next example adds a custom type field and sets it to the value of the original _type. It also adds the type to the _id in case there are any documents of different types which have conflicting IDs:

PUT new_twitter
{"mappings": {"_doc": {"properties": {"type": {"type": "keyword"},"name": {"type": "text"},"user_name": {"type": "keyword"},"email": {"type": "keyword"},"content": {"type": "text"},"tweeted_at": {"type": "date"}}}}
}POST _reindex
{"source": {"index": "twitter"},"dest": {"index": "new_twitter"},"script": {"source": """ctx._source.type = ctx._type;ctx._id = ctx._type + '-' + ctx._id;ctx._type = '_doc';"""}
}

ES基本概念及废除type 官网资料 - 阅读有困难的加了翻译 - 我只是官网的搬运工相关推荐

  1. 中国知网html阅读说明什么区别,万方数据库与中国知网的区别都有哪些

    作为国内比较知名的两个数据网站,万方数据库和中国知网可以说都是很多学者和同学们在获取数据,或者是刊发数据的首要选择,而很多同学们在使用的过程中,对于万方和中国知网的认识也会逐步的加深,所以这个时候很多 ...

  2. “网易云阅读”-移动架构

    过年回家,手机中有两个应用是爱不释手的,一个是微信,一个就是网易云阅读了.这里不谈论微信了,说说网易云阅读.刚刚接触网易云阅读,是偶然的,具体咋知道的已经忘了.这个APP且不谈交互体验好不好(在交互细 ...

  3. es存在某个字段的查阅_ElasticSearch系列02:ES基础概念详解

    1.ES 简介 1)定义 ES是elaticsearch简写, Elasticsearch是一个开源的高扩展的分布式全文检索引擎,它可以近乎实时的存储.检索数据:本身扩展性很好,可以扩展到上百台服务器 ...

  4. es 映射 mysql_ElasticSearch系列02:ES基础概念详解

    1.ES 简介 1)定义ES是elaticsearch简写, Elasticsearch是一个开源的高扩展的分布式全文检索引擎,它可以近乎实时的存储.检索数据:本身扩展性很好,可以扩展到上百台服务器, ...

  5. ES基础概念和集群概念

    前言 思考一个问题:当系统数据量上了10亿.100亿条的时候,我们在做系统架构的时候通常会从以下角度去考虑问题: 用什么数据库好?(mysql.sybase.oracle.达梦.神通.mongodb. ...

  6. 恩智浦NXP I.MX6ULL芯片介绍下载官网资料

    参考:NXP I.MX6ULL芯片介绍以及资料的获取 作者:一只青木呀 发布时间:2020-09-26 10:54:26 网址:https://blog.csdn.net/weixin_4530991 ...

  7. l4 l7 代理_什么是四层(L4 proxy)和七层负载均衡(L7 proxy)?区别是什么? 翻译自Nginx官网...

    阅读前的小说明: 由于工作需要,本人正在研究微服务架构.而微服务的服务网格( Service Mesh )架构中,需要选择一种 proxy 作为每个微服务之间通讯的代理.因此为了定夺微服务中常用的两种 ...

  8. 企业网站+Axure企业官网通用模板+公司官网通用模板+web端高保真原型+门户官网+物流企业+门户网站+服务中心+产品中心+新闻中心+帮助中心+企业官网+公司官网+公司网站+登录注册+高保真交互

    企业网站+Axure企业官网通用模板+公司官网通用模板+web端高保真原型+门户官网+物流企业+门户网站+服务中心+产品中心+新闻中心+帮助中心+企业官网+公司官网+公司网站+Axure原型+rp原型 ...

  9. NVIDIA Jetson官网资料整理

    可看到,对比树莓派相关方面的资料,NVIDIA Jetson的资料不多,特别是一些国内的培训机构,这方面系统性的介绍也几乎没有,所以我们应该围绕官网转.博主之前学习TensorRT以及OpenVINO ...

最新文章

  1. python文件句柄_Python文件操作
  2. The Innovation | ESCI Indexing
  3. 宏基因组实战9. 组装assembly和分箱bin结果可视化—Anvi'o
  4. apache关于记录真实客户端ip和不记录健康检查日志
  5. 注销凭证与自定义屏幕
  6. iphone桌面横屏设置在哪里_iPhone的实用攻略如此多,这4个你知道吗?
  7. 设置iis网页服务器cpu占比,为什么iis的一个线程池占了100%cpu
  8. kafka数据到flume_大数据摄取:Flume,Kafka和NiFi
  9. 软件开发模型和软件过程模型_什么是软件和软件过程?
  10. mysql myisam/innodb高并发优化经验_MySQL MyISAM / PHP 高并发优化经验
  11. elementUI使用checkboxgroup组件,获得value的数字集合,而不是label的文字集合 - 解决篇
  12. iRobot新款OS能让军用机器人上战场
  13. 7-1 输入学生姓名,输出问候信息。
  14. python基础之-数据类型
  15. [RK3288][Android6.0] USB UVC 协议简结
  16. 计算机毕业设计——简单的网页设计
  17. deepin 切换输入法
  18. 赵伟国回应华为“平衡者”标签:做个老二、老三也可以
  19. vijos P1263 单挑女飞贼
  20. 红米路由器ac2100怎样设置ipv6_红米路由器AC2100怎么样

热门文章

  1. Maple学习笔记——数据结构
  2. 网络程序设计结课总结——神经网络篇
  3. 斐波那契数列通项公式的推导证明----举一反三
  4. 【通信原理】五、模拟调制系统
  5. luaError json解析错误1
  6. 密码模块安全技术要求GMT0028-2012
  7. stdarg.h中三个宏va_start ,va_arg\va_end及vsprintf 的应用
  8. 数十亿数量级评论系统的SQL调优实战
  9. Adblocker for Chrome – NoAds 谷歌专用广告拦截器
  10. 玩转树莓派之 配置openvino进行神经计算棒2加速