关系数据库非关系数据库_如何与关系数据库最佳配合

关系数据库非关系数据库

Relational databases handle data smoothly, whether working with small volumes or processing millions of rows. We will be looking at how we can use relational databases according to our needs, and get the most out of them.

关系数据库无论是处理小批量数据还是处理数百万行的数据，都可以顺利处理。我们将研究如何根据需要使用关系数据库，并充分利用它们。

MySQL has been a popular choice for small to large enterprise companies because of its ability to scale. Similarly, PostgreSQL has also seen a rise in popularity.

MySQL具有可伸缩性，因此已成为小型企业的首选。同样，PostgreSQL也已经普及。

According to the Stack Overflow survey 2018, MySQL is the most popular database among all users.

根据2018年Stack Overflow调查，MySQL是所有用户中最受欢迎的数据库。

The examples described below are using InnoDB as MySQL engine. These are not limited only to MySQL but are also relevant with other relational databases like PostgreSQL. All the benchmarking is done on a machine with 8GB RAM and with an i5 2.7 GHz processor.

下面描述的示例使用InnoDB作为MySQL引擎。这些不仅限于MySQL，还与其他关系数据库(如PostgreSQL)相关。所有基准测试都是在配备8GB RAM和i5 2.7 GHz处理器的计算机上完成的。

Let’s get started with the basics of how relational database store the data.

让我们从关系数据库如何存储数据的基础开始。

了解关系数据库 (Understanding Relational Databases)

存储 (Storage)

MySQL is a relational database where all data is represented in terms of tuples, grouped into relations. A tuple is represented by its attributes.

MySQL是一个关系数据库，其中所有数据都以元组的形式表示，并分为关系。元组由其属性表示。

Let’s say we have an application where people can loan books. We will need to store all the book lending transactions. In order to store them, we have designed a simple relational table with the following command:

假设我们有一个可以让人们借书的应用程序。我们将需要存储所有的图书借贷交易。为了存储它们，我们使用以下命令设计了一个简单的关系表：

> CREATE TABLE book_transactions ( id INTEGER NOT NULL   AUTO_INCREMENT, book_id INTEGER, borrower_id INTEGER, lender_id INTEGER, return_date DATE, PRIMARY KEY (id));

The table looks like:

该表如下所示：

book_transactions
------------------------------------------------
id  borrower_id  lender_id  book_id  return_date

Here id is the primary key and borrower_id, lender_id, book_id are the foreign keys. After we launch our application there are few transactions recorded:

这里id是主键和borrower_id，lender_id，book_id是外键。启动应用程序后，记录了几笔交易：

book_transactions
------------------------------------------------
id  borrower_id  lender_id  book_id  return_date
------------------------------------------------
1   1            1          1        2018-01-13
2   2            3          2        2018-01-13
3   1            2          1        2018-01-13

取得资料 (Fetching the data)

We have a dashboard page for each user where they can see the transactions of their books rented. So let’s fetch the book transactions for a user:

我们为每个用户提供一个仪表板页面，他们可以在其中查看所租书的交易。因此，让我们为用户获取图书交易：

> SELECT * FROM book_transactions WHERE borrower_id = 1;
book_transactions
------------------------------------------------
id  borrower_id  lender_id  book_id  return_date
------------------------------------------------
1   1            1          1        2018-01-13
2   1            2          1        2018-01-13

This scans the relation sequentially and gives us the data for the user. This seems to be very fast, as there are very few data in our relation. To see the exact time of the query execution, set profiling to be true by executing the following command:

这将顺序扫描关系并为我们提供用户数据。这似乎非常快，因为我们之间的关系数据很少。要查看查询执行的确切时间，请通过执行以下命令将分析设置为true：

> set profiling=1;

Once the profiling is set, run the query again and use the following command to look into the execution time:

设置分析后，再次运行查询并使用以下命令查看执行时间：

> show profiles;

This will return the duration of the query we executed.

这将返回我们执行查询的持续时间。

Query_ID | Duration   | Query1 | 0.00254000 | SELECT * FROM book_transactions ...

The execution seems to be very good.

执行力似乎很好。

Slowly, the book_transactions table starts to get filled with data, as there are a lots of transactions going on.

慢慢地，由于有许多事务在进行，book_transactions表开始充满数据。

问题 (The problem)

This increases the number of tuples in our relation. With this, the time it takes to fetch the book transactions for the user will start to take more time. MySQL needs to go through all the tuples to find the result.

这增加了我们关系中的元组数量。这样，为用户获取图书交易所需的时间将开始花费更多时间。 MySQL需要遍历所有元组才能找到结果。

To insert a lot of data into this table, I wrote the following stored procedure:

为了向该表中插入大量数据，我编写了以下存储过程：

DELIMITER //CREATE PROCEDURE InsertALot()BEGINDECLARE i INT DEFAULT 1;WHILE (i <= 100000) DOINSERT INTO book_transactions (borrower_id, lender_id, book_id,   return_date) VALUES ((FLOOR(1 + RAND() * 60)), (FLOOR(1 + RAND() * 60)), (FLOOR(1 + RAND() * 60)), CURDATE());SET i = i+1;END WHILE;END //DELIMITER ;
* It took around 7 minutes to insert 1.5 million data

This inserts 100,000 random records in our table book_transactions. After running this, the profiler shows a slight increase in the runtime:

这将在我们的表book_transactions中插入100,000个随机记录。运行此命令后，探查器显示运行时间略有增加：

Query_ID | Duration   | Query1 | 0.07151000 | SELECT * FROM book_transactions ...

Let’s add few more data running the above procedure and see what happens. With more and more data added, the duration of the query increases. With 1.5 million data inserted into the table, the response time to get the same query is now increased.

让我们再添加一些数据来运行上述过程，看看会发生什么。随着越来越多的数据添加，查询的持续时间增加。将150万个数据插入表中后，现在增加了获得相同查询的响应时间。

Query_ID | Duration   | Query1 | 0.36795200 | SELECT * FROM book_transactions ...

This is just a simple query involving an integer field.

这只是涉及整数字段的简单查询。

With more compound queries, order queries, and count queries, the execution time gets even worse.

随着更多的复合查询，订单查询和计数查询，执行时间变得更糟。

This does not seem to be a long time for a single query, but when we have thousands or even millions of queries running every minute, this makes a big difference.

对于单个查询而言，这似乎并不需要很长时间，但是当我们每分钟都有成千上万甚至数百万个查询在运行时，这将带来很大的不同。

There will be lot more wait time and this will hamper the overall performance of the application. The execution time for the same query increased from 2ms to 370ms.

将有更多的等待时间，这将妨碍应用程序的整体性能。同一查询的执行时间从2ms增加到370ms。

恢复速度 (Getting the speed back)

指数 (Index)

MySQL and other databases provide indexing, a data structure that helps to retrieve data faster.

MySQL和其他数据库提供索引，这是一种有助于更快地检索数据的数据结构。

There are different types of indexing in MySQL:

MySQL中有不同类型的索引编制：

Primary Key — Index added to the primary key. By default, primary keys are always indexed. It also ensures that the two rows do not have the same primary key value.

主键 -添加到主键的索引。默认情况下，主键始终被索引。它还确保两行没有相同的主键值。
Unique — Unique key index insures that no two rows in a relation have the same value. Multiple Null values can be stored with a unique index.

唯一-唯一键索引可确保关系中没有两行具有相同的值。可以使用唯一索引存储多个Null值。
Index — Addition to any other fields other than the primary key.

索引—除主键外的任何其他字段。
Full Text — Full text index helps queries against character-based data.

全文-全文索引有助于查询基于字符的数据。

There are mostly two ways an index is stored:

索引的存储方式主要有两种：

Hash — this is mostly used for exact matching (=), and does not work with comparisons(≥, ≤)

散列 -主要用于精确匹配(=)，不适用于比较(≥，≤)

B-Tree — This is the mostly common way in which the above mentioned index types are stored.

B树 -这是存储上述索引类型的最常用方法。

MySQL uses a B-Tree as its default indexing format. The data are stored in a binary tree which makes the retrieval of data fast.

MySQL使用B树作为其默认索引格式。数据存储在二叉树中，从而可以快速检索数据。

The data organization done by the B-tree helps to skip the full table scan across all tuples in our relation.

B树完成的数据组织有助于跳过我们关系中所有元组的全表扫描。

There are a total of 16 nodes in the above B-Tree. Let’s say we need to find the number 6. We only need to do a total number of 3 scans to get the number. This helps improve the performance of search.

上面的B树中总共有16个节点。假设我们需要找到数字6。我们只需要进行3次扫描即可获得该数字。这有助于提高搜索性能。

So to improve the performance on our book_transactions relation, let’s add the index on the field lender_id.

因此，为了提高book_transactions关系的性能，让我们在lender_id字段上添加索引。

> CREATE INDEX lenderId ON book_transactions(lender_id)
----------------------------------------------------
* It took around 6.18sec Adding this index

The above command adds an index on the lender_id field. Let’s look at how this affects performance for the 1.5 million data that we have by running the same query again.

上面的命令在lender_id字段上添加索引。让我们看看通过再次运行相同的查询，这如何影响我们拥有的150万个数据的性能。

> SELECT * FROM book_transactions WHERE lender_id = 1;
Query_ID | Duration   | Query1 | 0.00787600 | SELECT * FROM book_transactions ...

Woohoo! We are back now.

hoo！我们现在回来了。

It is as fast as it used to be when there were only 3 records in our relation. With the right index added, we can see a dramatic improvement in performance.

它与我们的关系中只有3条记录时一样快。添加正确的索引后，我们可以看到性能的显着提高。

综合指数和单一指数 (Composite and single index)

The index we added was a single field index. Indices can also be added to a composite field.

我们添加的索引是单个字段索引。也可以将索引添加到复合字段。

If our query involved multiple fields, a composite index would have helped us. We can add a composite index with the following command:

如果我们的查询涉及多个字段，则复合索引将对我们有所帮助。我们可以使用以下命令添加复合索引：

> CREATE INDEX lenderReturnDate ON book_transactions(lender_id, return_date);

指数的其他用法 (Other Usage of Indices)

Querying is not the only use of indices. They can be used for the ORDER BY clause as well. Let’s order the records with respect to lender_id.

查询不是索引的唯一用途。它们也可以用于ORDER BY子句。让我们针对lender_id排序记录。

> SELECT * FROM book_transactions ORDER BY lender_id;
1517185 rows in set (4.08 sec)

4.08 sec, that’s a lot! So what went wrong? We have our index in place. Let’s deep dive into how the query is being executed with the help of EXPLAIN clause.

4.08 秒，这是很多！那么出了什么问题？我们有我们的索引。让我们深入探讨如何借助EXPLAIN子句执行查询。

使用说明 (Using Explain)

We can add an explain clause to see how the query will be executed in our current dataset.

我们可以添加一个说明子句，以查看如何在当前数据集中执行查询。

> EXPLAIN SELECT * FROM book_transactions ORDER BY lender_id;

The output of this is as shown bellow:

其输出如下所示：

There are various fields that explain returns. Let’s look into the table above and find out the problem.

有很多字段可以解释退货。让我们看一下上表，找出问题所在。

rows: Total number of rows that will be scanned

行：将被扫描的总行数

filtered: The percentage of row that will be scanned to get the data

已过滤：将被扫描以获取数据的行的百分比

type: It’s given if the index is being used. ALL means it is not using index

类型：给出是否使用索引。 ALL表示未使用索引

possible_keys, key, key_len are all NULL, which means that no index is being used.

Possible_Keys，Key和key_len均为NULL，这意味着未使用任何索引。

So why is the query not using index?

那么为什么查询不使用索引呢？

This is because we have select * in our query, which means we are selecting all the fields from our relation.

这是因为我们在查询中select * ，这意味着我们正在从关系中选择所有字段。

The index only has information about fields that are indexed, and not about other fields. This means MySQL will need to go to the main table to fetch data again.

索引仅包含有关已索引字段的信息，而没有其他字段的信息。这意味着MySQL将需要转到主表以再次获取数据。

So how should we write the query?

那么我们应该如何编写查询呢？

选择仅必填字段 (Select Only the required field)

To remove the need to go to the main table for the query, we need to select only the value that is present in the index table. So let’s change the query:

为了不需要去查询主表，我们只需要选择索引表中存在的值即可。因此，让我们更改查询：

> SELECT lender_id FROM book_transactions ORDER BY lender_id;

This will return the result in 0.46 seconds, which is way faster. But there is still room for improvement.

这将在0.46秒内返回结果，速度更快。但是仍有改进的空间。

As this query is done on all the 1.5 million records that we have, it is taking a little more time as it needs to load data into memory.

由于此查询是针对我们拥有的所有150万条记录进行的，因此将数据加载到内存需要花费更多时间。

使用限制 (Use Limit)

We might not need all the 1.5 million data at the same time. So instead of fetching all the data, using LIMIT and fetching data in batches is a better way to go.

我们可能不需要同时使用所有150万个数据。因此，与其提取所有数据，不如使用LIMIT并批量提取数据是一种更好的方法。

> SELECT lender_idFROM book_transactionsORDER BY lender_id LIMIT 1000;

With a limit in place, the response time now improves drastically, and executes in 0.0025 seconds. Now we can fetch next batch with OFFSET.

有了适当的限制，现在的响应时间将大大提高，并在0.0025秒内执行。现在，我们可以使用OFFSET获取下一批。

> SELECT lender_idFROM book_transactionsORDER BY lender_id LIMIT 1000 OFFSET 1000;

This will fetch the next 1000 row batch. With this we can increase the offset and limit to get all the data. But there is a ‘gotcha’! With an increase in offset, the performance of the query decreases.

这将获取下一个1000行批次。这样我们可以增加偏移量和限制以获取所有数据。但是有一个“陷阱”！随着偏移量的增加，查询的性能会降低。

This is because MySQL will scan through all the data to reach the offset point. So it is better not to use higher offset.

这是因为MySQL将扫描所有数据以到达偏移点。因此最好不要使用更高的偏移量。

计数查询呢？ (What about Count query?)

InnoDB engine has an ability to write concurrently. This makes it highly scalable and improves the throughput per second.

InnoDB引擎具有并发写入的能力。这使其具有高度可扩展性，并提高了每秒的吞吐量。

But this comes at a cost. InnoDB cannot add a cache counter for the number of records in any table. So, the count has to be done in real time by scanning through all the filtered data. This makes the COUNT query slow.

但这是有代价的。 InnoDB无法为任何表中的记录数添加缓存计数器。因此，必须通过扫描所有过滤后的数据来实时进行计数。这会使COUNT查询变慢。

So it is recommended to calculate the summarized count data from application logic for large number of data.

因此，建议针对大量数据从应用程序逻辑计算汇总计数数据。

为什么不将索引添加到所有字段？ (Why not add an index to all fields?)

Adding index helps improve performance a lot, but it also comes with a cost. It should be used effectively. Adding an index to more fields has the following issues:

添加索引有助于极大地提高性能，但同时也要付出代价。应该有效地使用它。向更多字段添加索引存在以下问题：

Needs a lot of memory, bigger machine需要大量的内存，更大的机器
When we delete, there is a re-index (CPU intensive and slower deletes)删除时，会有一个重新索引(CPU密集且删除速度较慢)
When we add anything, there is reindex (CPU intensive and slower inserts)当我们添加任何东西时，就有了重新索引(CPU密集型和慢速插入)
Update does not do full reindex, so update is faster and CPU efficient.更新不会完全重建索引，因此更新速度更快且CPU效率更高。

We are now clear that adding an index helps a lot. But we cannot select all the data, except that which is indexed for fast performance.

现在我们很清楚，添加索引会很有帮助。但是我们无法选择所有数据，除了为快速性能而建立索引的数据。

So how can we select all the attributes and still get fast performance?

那么，我们如何选择所有属性并仍然获得快速性能？

分区 (Partitioning)

While we build indices, we only have information about the field that is indexed. But we do not have data of the fields that are not present in the index.

在建立索引时，我们只有关于被索引字段的信息。但是我们没有索引中不存在的字段的数据。

So, as we said earlier, MySQL needs to look back at the main table to get the data for other fields. This can slow down the execution time.

因此，正如我们之前所说，MySQL需要回顾主表以获取其他字段的数据。这会减慢执行时间。

The way we can solve this is by using partitioning.

我们可以通过使用分区来解决此问题。

Partitioning is a technique in which MySQL splits a table’s data into multiple tables, but still manages it as one.

分区是一种技术，其中MySQL将一个表的数据拆分为多个表，但仍将其作为一个表进行管理。

While doing any kind of operation in the table we need to specify which partition is being used. With data being broken down, MySQL has a smaller data set to query on. Figuring out the right partitioning according to the needs is key for high performance.

在表中执行任何类型的操作时，我们需要指定正在使用的分区。随着数据的分解，MySQL有一个较小的数据集可供查询。根据需要找出合适的分区是实现高性能的关键。

But if we are still using the same machine, will it scale?

但是，如果我们仍在使用同一台机器，它将扩展吗？

分片 (Sharding)

With a huge data set, storing all your data on the same machine can be troublesome.

拥有庞大的数据集，将所有数据存储在同一台计算机上可能会很麻烦。

A specific partition can be heavy and needs more querying, while other are less. So one will affect another. They cannot scale separately.

特定分区可能很繁重，需要更多查询，而其他分区则更少。所以一个会影响另一个。它们无法单独扩展。

Let’s say the recent three months worth of data are the most used, where as the older one are less used. Perhaps the recent data are mostly updated/created whereas the old data are mostly only ever read.

假设最近三个月的数据价值最高，而旧数据则较少。也许最近的数据大部分是更新/创建的，而旧的数据大部分只是被读取的。

To resolve this issue, we can move the recent three months partition to another machine. Sharding is a way in which we divide a big data set into smaller chunks, and move to separate RDBMS. In other words sharding can also be called ‘horizontal partitioning’.

要解决此问题，我们可以将最近三个月的分区移动到另一台计算机上。分片是一种将大数据集分成较小的块，然后转移到单独的RDBMS的方法。换句话说，分片也可以称为“水平分区”。

Relational databases have the ability to scale as the application grows. Finding out the right index and tuning the infrastructure according to need is necessary.

关系数据库具有随着应用程序的增长而扩展的能力。找到正确的索引并根据需要调整基础结构是必要的。

Also Posted on Milap Neupane Blog: How to work Optimally with relational Databases

也发布在Milap Neupane博客上：如何与关系数据库一起最佳工作

翻译自: https://www.freecodecamp.org/news/how-to-work-optimally-with-relational-databases-627073f82d56/

关系数据库非关系数据库