项目的完整代码在 C2j-Compiler

前言

在上一篇，已经成功的构建了有限状态自动机，但是这个自动机还存在两个问题：

无法处理shift/reduce矛盾
状态节点太多，导致自动机过大，效率较低

这一节就要解决这两个问题

shift/reduce矛盾

看上一节那个例子的一个节点

e -> t .
t -> t . * f

这时候通过状态节点0输入t跳转到这个节点，但是这时候状态机无法分清是根据推导式1做reduce还是根据推导式2做shift操作，这种情况就称之为shift / reduce矛盾。

SLR(1)语法

在之前的LL(1)语法分析过程中，有一个FOLLOW set，也就是指的是，对某个非终结符，根据语法推导表达式构建出的所有可以跟在该非终结符后面的终结符集合，我们称作该非终结符的FOLLOW set.

之前的博文目录

FOLLOW(s) = {EOI}
FOLLOW(e) = {EOI, },+}
FOLLOW(t) = {EOI, }, + , * }
FOLLOW(f) = {EOI, }, +, * }

也就是说如果当前的输入字符属于e的FOLLOW SET，那么就可以根据第一个推导式做reduce操作

如果构建的状态机，出现reduce / shift矛盾的节点都可以根据上面的原则处理的话，那么这种语法，我们称之为SLR(1)语法。

LR(1)语法

但是如果当前的输入字符，既属于第一个推导式的FLLOW SET，又是第二个推导式 . 右边的符号，这样shift /reduce矛盾就难以解决了。

当我们根据一个输入符号来判断是否可以进行reduce操作时，只需要判断在我们做完了reduce操作后，当前的输入符号是否能够合法的跟在reduce后的非终结符的后面，也就是只要收集只要该符号能够被reduce到退回它的节点的所有路径的能跟在后面的终结符

这种能合法的跟在某个非终结符后面的符号集合，我们称之为look ahead set, 它是FOLLOW set的子集。

在给出LookAhead Set的算法前要先明确两个个概念：

First Set

对一个给定的非终结符，通过一系列语法推导后，能出现在推导最左端的所有终结符的集合，统称为该非终结符的FIRST SET

nullable

如果一个非终结符，它可以推导出空集，那么这样的非终结符我们称之为nullable的非终结符

nullable在之前SyntaxProductionInit里的初始化时已经赋值了

First Set的构建

在前面的陈述后，为了能够解决shift/reduce矛盾，就需要一个lookAhead Set，当然在构建LookAhead Set前，就需要先有First Set

First Set构建算法

如果A是一个终结符，那么FIRST(A)={A}
对于以下形式的语法推导:
s -> A a
s是非终结符，A是终结符，a 是零个或多个终结符或非终结符的组合，那么A属于FIRST(s).
对于推导表达式：
s -> b a
s和b是非终结符，而且b不是nullable的，那么first(s) = first(b)
对于推导表达式:
s -> a1 a2 … an b
如果a1, a2 … an 是nullable 的非终结符，b是非终结符但不是nullable的，或者b是终结符，那么
first(s) 是 first(a1)… first(an) 以及first(b)的集合。

FirstSetBuilder类

First Set构建都在FirstSetBuilder类里实现

这些就是用代码将上面的逻辑实现而已

这时候之前在SyntaxProductionInit初始化用到的symbolMap、symbolArray两个数据结构终于派上用场了

public void buildFirstSets() {while (runFirstSetPass) {runFirstSetPass = false;Iterator<Symbols> it = symbolArray.iterator();while (it.hasNext()) {Symbols symbol = it.next();addSymbolFirstSet(symbol);}}ConsoleDebugColor.outlnPurple("First sets :");debugPrintAllFirstSet();ConsoleDebugColor.outlnPurple("First sets end");}private void addSymbolFirstSet(Symbols symbol) {if (Token.isTerminal(symbol.value)) {if (!symbol.firstSet.contains(symbol.value)) {symbol.firstSet.add(symbol.value);}return ;}ArrayList<int[]> productions = symbol.productions;for (int[] rightSize : productions) {if (rightSize.length == 0) {continue;}if (Token.isTerminal(rightSize[0]) && !symbol.firstSet.contains(rightSize[0])) {runFirstSetPass = true;symbol.firstSet.add(rightSize[0]);} else if (!Token.isTerminal(rightSize[0])) {int pos = 0;Symbols curSymbol;do {curSymbol = symbolMap.get(rightSize[pos]);if (!symbol.firstSet.containsAll(curSymbol.firstSet)) {runFirstSetPass = true;for (int j = 0; j < curSymbol.firstSet.size(); j++) {if (!symbol.firstSet.contains(curSymbol.firstSet.get(j))) {symbol.firstSet.add(curSymbol.firstSet.get(j));}}}pos++;} while (pos < rightSize.length && curSymbol.isNullable);}}
}

LookAhead Set的算法

[S -> a .r B, C]
r -> r1

r是一个非终结符，a, B是0个或多个终结符或非终结符的集合。

在自动机进入r -> r1所在的节点时，如果采取的是reduce操作，那么自动机的节点将会退回[S -> a .r B, C]这个推导式所在的节点，所以要正确的进行reduce操作就要保证当前的输入字符，必须属于FIRST(B)

所以推导式2的look ahead集合就是FIRST(B),如果B是空，那么2的look ahead集合就等于C, 如果B是nullable的，那么推导式2的look ahead集合就是FIRST(B) ∪ C

computeFirstSetOfBetaAndc

计算LookAhead set在每一个production的方法里

public ArrayList<Integer> computeFirstSetOfBetaAndc() {ArrayList<Integer> set = new ArrayList<>();for (int i = dotPos + 1; i < right.size(); i++) {set.add(right.get(i));}ProductionManager manager = ProductionManager.getInstance();ArrayList<Integer> firstSet = new ArrayList<>();if (set.size() > 0) {for (int i = 0; i < set.size(); i++) {ArrayList<Integer> lookAhead = manager.getFirstSetBuilder().getFirstSet(set.get(i));for (int s : lookAhead) {if (!firstSet.contains(s)) {firstSet.add(s);}}if (!manager.getFirstSetBuilder().isSymbolNullable(set.get(i))) {break;}if (i == lookAhead.size() - 1) {//beta is composed by nulleable termsfirstSet.addAll(this.lookAhead);}}} else {firstSet.addAll(lookAhead);}return firstSet;
}

竟然计算了Lookahead Set，那么在计算闭包时，每个节点里的推导式都要加上LookAhead Set以便之后求语法分析表

private void makeClosure() {ConsoleDebugColor.outlnPurple("==== state begin make closure sets ====");Stack<Production> productionStack = new Stack<>();for (Production production : productions) {productionStack.push(production);}while (!productionStack.isEmpty()) {Production production = productionStack.pop();ConsoleDebugColor.outlnPurple("production on top of stack is : ");production.debugPrint();production.debugPrintBeta();if (Token.isTerminal(production.getDotSymbol())) {ConsoleDebugColor.outlnPurple("Symbol after dot is not non-terminal, ignore and process next item");continue;}int symbol = production.getDotSymbol();ArrayList<Production> closures = productionManager.getProduction(symbol);ArrayList<Integer> lookAhead = production.computeFirstSetOfBetaAndc();Iterator<Production> it = closures.iterator();while (it.hasNext()) {Production oldProduct = it.next();Production newProduct = oldProduct.cloneSelf();newProduct.addLookAheadSet(lookAhead);if (!closureSet.contains(newProduct)) { closureSet.add(newProduct);productionStack.push(newProduct); removeRedundantProduction(newProduct);} else {ConsoleDebugColor.outlnPurple("the production is already exist!");}}}debugPrintClosure();ConsoleDebugColor.outlnPurple("==== make closure sets end ====");
}

removeRedundantProduction是处理冗余的产生式，比如

1. [t -> . t * f, {* EOI}]
2. [t -> .t  *  f {EOI}]

这样就可以认为产生式1可以覆盖产生式2

private void removeRedundantProduction(Production product) {boolean removeHappended = true;while (removeHappended) {removeHappended = false;Iterator it = closureSet.iterator();while (it.hasNext()) {Production item = (Production) it.next();if (product.isCover(item)) {removeHappended = true;closureSet.remove(item);break;}}}
}

有限状态自动机的压缩

到现在我们已经算出了LookAhead Set，已经可以正确的计算语法分析表了，但是还有一个问题就是，现在的自动机节点过多，非常影响效率，所以下面的任务就是压缩有限状态自动机

在我们之前构造的LR(1)有限自动机里，如果根据C语言的推导式，应该会产生600多个状态节点，但是是因为之前在构造状态节点时，如果相同的推导式但是它的lookAhead Sets不一样，就认为这是两个不一样的产生式。

下面是对状态节点的equals的重写

@Override
public boolean equals(Object obj) {return checkProductionEqual(obj, false);
}public boolean checkProductionEqual(Object obj, boolean isPartial) {ProductionsStateNode node = (ProductionsStateNode) obj;if (node.productions.size() != this.productions.size()) {return false;}int equalCount = 0;for (int i = 0; i < node.productions.size(); i++) {for (int j = 0; j < this.productions.size(); j++) {if (!isPartial) {if (node.productions.get(i).equals(this.productions.get(j))) {equalCount++;break;}} else {if (node.productions.get(i).productionEquals(this.productions.get(j))) {equalCount++;break;}}}}return equalCount == node.productions.size();
}

所以对这些推导式相同但是LookAhead Sets不同的节点，就可以进行合并，以达到压缩节点数量的目的

合并相似的节点最好的地方，自然就是在添加节点和节点之间的跳转关系的时候了

public void addTransition(ProductionsStateNode from, ProductionsStateNode to, int on) {/* Compress the finite state machine nodes */if (isTransitionTableCompressed) {from = getAndMergeSimilarStates(from);to = getAndMergeSimilarStates(to);}HashMap<Integer, ProductionsStateNode> map = transitionMap.get(from);if (map == null) {map = new HashMap<>();}map.put(on, to);transitionMap.put(from, map);
}

getAndMergeSimilarStates的逻辑也很简单，遍历当前的所有节点，找出相似，把编号大的合并到小的节点上

private ProductionsStateNode getAndMergeSimilarStates(ProductionsStateNode node) {Iterator<ProductionsStateNode> it = stateList.iterator();ProductionsStateNode currentNode = null, returnNode = node;while (it.hasNext()) {currentNode = it.next();if (!currentNode.equals(node) && currentNode.checkProductionEqual(node, true)) {if (currentNode.stateNum < node.stateNum) {currentNode.stateMerge(node);returnNode = currentNode;} else {node.stateMerge(currentNode);returnNode = node;}break;}}if (!compressedStateList.contains(returnNode)) {compressedStateList.add(returnNode);}return returnNode;}

public void stateMerge(ProductionsStateNode node) {if (!this.productions.contains(node.productions)) {for (int i = 0; i < node.productions.size(); i++) {if (!this.productions.contains(node.productions.get(i)) && !mergedProduction.contains(node.productions.get(i))) {mergedProduction.add(node.productions.get(i));}}}
}

小结

这一节的贴的代码应该是到现在五篇里最多，但是主要的就是

解决shift/reduce矛盾

主要在于构造一个lookahead sets，也就是当前的输入符号是否能够合法的跟在reduce后的非终结符的后面
压缩有限状态自动机节点
压缩节点在于合并推导式一样但是lookahead sets不一样的节点

下一篇的内容比较少，也就是可以正式构造出语法分析表和根据表驱动的语法分析，也就代表语法分析阶段的结束

另外的github博客：https://dejavudwh.cn/

转载于:https://www.cnblogs.com/secoding/p/11369177.html

从零写一个编译器（五）：语法分析之自动机的缺陷和改进相关推荐

从零写一个编译器（完结）：总结和系列索引
前言这个系列算作我自己的学习笔记,到现在已经有十三篇了,加上这篇一共十四篇.一步一步的从词法分析到语法分析.语义分析,再到代码生成,准备在这一篇做一个总结收尾和一个这个系列以前文章的索引. (另外, ...
从零写一个编译器（三）：语法分析之几个基础数据结构
项目的完整代码在 C2j-Compiler 写在前面这个系列算作为我自己在学习写一个编译器的过程的一些记录,算法之类的都没有记录原理性的东西,想知道原理的在龙书里都写得非常清楚,但是我自己一开始是不 ...
从零写一个编译器（一）：输入系统和词法分析
项目的完整代码在 C2j-Compiler 前言从半抄半改的完成一个把C语言编译到Java字节码到现在也有些时间,一直想写一个系列来回顾整理一下写一个编译器的过程,也算是学习笔记吧.就从今天开始动笔 ...
从零写一个编译器（二）：语法分析之前置知识
项目的完整代码在 C2j-Compiler 前言在之前完成了词法分析之后,得到了Token流,那么接下来就是实现语法分析器来输入Token流得到抽象语法树 (Abstract Syntax Tree ...
从零写一个编译器（六）：语法分析之表驱动语法分析
项目的完整代码在 C2j-Compiler 前言上一篇已经正式的完成了有限状态自动机的构建和足够判断reduce的信息,接下来的任务就是根据这个有限状态自动机来完成语法分析表和根据这个表来实现语法分 ...
从零写一个编译器（四）：语法分析之构造有限状态自动机
项目的完整代码在 C2j-Compiler 通过上一篇对几个构造自动机的基础数据结构的描述,现在就可以正式来构造有限状态自动机我们先用一个小一点的语法推导式来描述这个过程 s -> e e - ...
从零写一个编译器（十）：编译前传之直接解释执行
项目的完整代码在 C2j-Compiler 前言这一篇不看也不会影响后面代码生成部分现在经过词法分析语法分析语义分析,终于可以进入最核心的部分了.前面那部分可以称作编译器的前端,代码生成代码优化都 ...
从零写一个编译器（九）：语义分析之构造抽象语法树(AST)
项目的完整代码在 C2j-Compiler 前言在上一篇完成了符号表的构建,下一步就是输出抽象语法树(Abstract Syntax Tree,AST) 抽象语法树(abstract syntax ...
从零写一个编译器（八）：语义分析之构造符号表
项目的完整代码在 C2j-Compiler 前言在之前完成了描述符号表的数据结构,现在就可以正式构造符号表了.符号表的创建自然是要根据语法分析过程中走的,所以符号表的创建就在LRStateTable ...

从零写一个编译器（五）：语法分析之自动机的缺陷和改进

前言