CMU 15-445/645 Lab3-Query Execution

0.写在前面

Lab3的地址：https://15445.courses.cs.cmu.edu/fall2020/project3/
本文主要总结一下在写Lab3需要的基础知识以及Task的解决思路（不公开代码，如果有问题可以留言）。

1.Task #1 - SYSTEM CATALOG

数据库维护一个内部目录(catalog)记录了整个数据库中的table和index信息。
有了catalog，我们就可以从table/index的名字或id，找到这个table/index的指针，以及各种metadata。
在src/include/catalog/catalog.hsrc/include/catalog/catalog.hsrc/include/catalog/catalog.h中实现Table和Index的Create和Get：
CreateTable(Transaction *txn, const std::string &table_name, const Schema &schema)
GetTable(const std::string &table_name)
GetTable(table_oid_t table_oid)

CreateIndex(txn, index_name, table_name, schema, key_schema key_attrs, keysize)
GetIndex(const std::string &index_name, const std::string &table_name)
GetIndex(index_oid_t index_oid),
GetTableIndexes(const std::string &table_name)

第一个任务不是特别的难，但是要注意几点：
1）注意维护几个Hash_table
2)index_Info函数需要一个Index的unique_ptr需要new一个BPlusTreeIndex传入.

TASK #2 - EXECUTORS

实验的整体难度不是特别大，但是要理清执行的逻辑和各类之间的关系。

2.1 executor执行流程

首先是在execution_engine.h中的函数Execute()：
第一个参数（AbstractPlanNode *plan）：executor对应的planNode。
第二个参数（std::vector *result_set）：存放结果。
第三个参数（Transaction *txn）:事务。
第四个参数（ExecutorContext *exec_ctx）：exec_ctx是当前执行的上下文，记录了bfp，log manager，lock manager，catalog和txnmanager。其中最重要是catalog，catalog中有Tables和Indexs等等。

// execution_engine.h
bool Execute(const AbstractPlanNode *plan, std::vector<Tuple> *result_set, Transaction *txn,ExecutorContext *exec_ctx) {// construct executorauto executor = ExecutorFactory::CreateExecutor(exec_ctx, plan);// prepareexecutor->Init();// executetry {Tuple tuple;RID rid;while (executor->Next(&tuple, &rid)) {if (result_set != nullptr && tuple.IsAllocated()) {result_set->push_back(tuple);}}} catch (Exception &e) {// TODO(student): handle exceptions}return true;}

接着工厂模式ExecutorFactory::CreateExecutor(),根据传入的planNode的类型，使用dynamic_cast将planNode转换成对应类型的planNode（父类指针转换成子类指针），调用对应的executor的构造函数创建executor。然后调用executor的init方法初始化executor，重复执行next方法，next返回true则将结果存入result_set并继续执行next, next返回false 则结束。所以后面的任务就是实现每个executor的init和next方法

// executor_factory.cpp
std::unique_ptr<AbstractExecutor> ExecutorFactory::CreateExecutor(ExecutorContext *exec_ctx,const AbstractPlanNode *plan) {switch (plan->GetType()) {// Create a new sequential scan executor.case PlanType::SeqScan: {return std::make_unique<SeqScanExecutor>(exec_ctx, dynamic_cast<const SeqScanPlanNode *>(plan));}
...
}

AbstractPlanNode
这是所有PlanNode的父类。对应的有一个枚举类PlanType，表示所有可能的PlanNode类型。AbstractPlanNode只有两个成员变量，一个是output_shcema,在Next返回tuple（如果需要返回tuple）时可以根据output_schema选择输出tuple的哪几个column（相当于select）。另一个是vector children_, 里面有所有children的常量指针。

/** PlanType represents the types of plans that we have in our system. */
enum class PlanType { SeqScan, IndexScan, Insert, Update, Delete, Aggregation, Limit, NestedLoopJoin, NestedIndexJoin };

2.2 各个类之间的关系

exec_ctx[bfp, log manager, lock manager, catalog, txnmanager]
catalog:[tables、indexes]
tables:[id，table_metadata]
table_metadata:[shema(表，索引，外键等等), name, table_(table_heap)(pages组成的链表), id]

indexes:[id,index_info]
index_info:[shema, name, index_, id, table_name, key_size]

2.3 SEQUENTIAL SCAN

sequential node继承自abstract node，多了两个私有变量。一个是predicate，是用来过滤不符合条件的tuple。比如where id < 5 就是一个predicate。另一个变量是table_oid_,指明了要扫描的表的id。

  const AbstractExpression *predicate_;/** The table whose tuples should be scanned. */table_oid_t table_oid_;

在2.1中execute的执行流程中，我们首先construct executor，再进行Init(),最后调用Next()获取Tuple.
所以在construct executor时候获得table_meta_data，init（）的时候获得table迭代器,table_iterator支持对表顺序扫描,指向当前尚未遍历的第一个tuple。
在Next()函数中我们首先找到当前迭代器指向的第一个符合要求的tuple,然后输出新的tuple。
1）在寻找第一个符号要求的tuple的时候注意，GetPredicate是父类AbstractExpression指针，调用GetPredicate的虚函数Evaluate()，会根据具体的子类类型调用子类的虚函数Evaluate(),返回一个Value类型的结果。调用这个Value类的GetAs函数即可得到该tuple是否满足要求。不满足就继续下一个tuple。
2）在获得新的tuple的时候要注意，并不能直接把table_iterator返回的tuple直接作为结果，因为plan中的OutputSchema可能仅仅是table_iterator返回的tuple的一个projection。

我们可以在Executor_test中一个一个测试样例。

2.4 INDEX SCANS

IndexScan与SeqScan逻辑是类似的，区别在于把TableIterator换成IndexIterator。可以从index_info获得table_name_。

2.5 INSERT

Insert操作是将元组添加到表中。
这里需要注意我们首先需要根据InsertPlanNode的类型来判断child是否有tuple需要插入。
例子：INSERT INTO empty_table2
(SELECT colA, colB FROM test_1 WHERE colA > 500)
【 child对应select】
每次获得一个tuple之后需要将tuple同步到所有的index中，调用index_->InsertEntry。

//如果将insert值直接嵌入到计划中，则为true；
//如果是child plan提供tuple，则为false
plan_->IsRawInsert()

2.6 Update

Update操作是修改指定表中的tuple。
1）通过child_executor_获取下一个需要更新的tuple。
2）调用GenerateUpdatedTuple得到新生成的tuple。
3）调用tableHeap的updataTuple在表中更新tuple，再在所有的索引中删除原来的tuple，插入新的tuple，返回tuple的RID即可。

2.7 Delete

Delete是删除指定表中的tuple。
1）通过child_executor_获取下一个需要删除的tuple。
2）通过table的MarkDelete来标记tuple。然后更新 tnx write set.（这里MarkDelete的意思是使tuple invisable，并不真的删除它。只有在事务提交的时候才真的删除。这样如果事务还没提交就abort了，回滚时只需要将mark的标志撤销）
3）更新索引。

2.8 Join

例子：

SELECT test_1.colA, test_1.colB, test_2.col1, test_2.col3
FROM test_1 JOIN test_2 ON test_1.colA = test_2.col1 AND test_1.colA < 50

2.8.1 Nested Loop Join
Nested Loop Join的实现就是从遍历两张表，对于外查询的每个Tuple在内表中遍历查看是否有相等。

2.8.2 Index Nested Loop Join
Index Nested Loop Join是利用innerTable的索引查找符合 JOIN 条件的。

2.9 Aggregation

例子：

SELECT COUNT(colA), SUM(colA), min(colA), max(colA) from test_1;

可以直接使用官方已经写好的SimpleAggregationHashTable来实现。