## 编程语言开发札记

### 2013/12/24 平安夜

1. 虚拟机规范 第四章，及第六章。 （其实虚拟机类似真实机器，所以目标首先是像 运用汇编一样运用字节码指令，实现简单的输出功能）
2. 程序语法定义 （使用ANTLR 生成词法和语法解析器，语法以精简为主，现阶段只实现 if, for, 和变量）
3. 抽象语法树生成
4. 解析语法树，生成字节码 （可以直接生成到内存， 也可以生成到 class 文件）

1. 先了解编译语言涉及到的知识
2. 看看如何生成字节码
3. ANTLR4 语法熟悉

### 2013/12/25 圣诞

local a,t,i     1: LOADNIL 0 2 0
a=a+i           2: ADD 0 0 2
a=a+1           3: ADD 0 0 250 ; 1
a=t[i]          4: GETTABLE 0 1 2


local a,t,i     1: PUSHNIL 3
a=a+i           2: GETLOCAL 0 ; a
3: GETLOCAL 2 ; i
5: SETLOCAL 0 ; a
a=a+1           6: GETLOCAL 0 ; a
8: SETLOCAL 0 ; a
a=t[i]          9: GETLOCAL 1 ; t
10: GETINDEXED 2 ; i
11: SETLOCAL 0 ; a


a language is just a set of valid sentences

### 2013/12/27

《Language Implementation Patterns》确实适合新手看，概念结合代码，容易消化。

LL(n)

• 第一个L 的意思是 ： read the input from left to right
• 第二个L 的意思是 : descend into parse tree children from left to right
• (n) 的意思是: 表示 lookahead token 的数量， 就是往前多看n个token

#### 一些语法和代码之间的关系

1. (alt1 | alt2 | … | altn)

同时，这种可选结构对应的程序就是 if-else, 如果是LL(1)还可以使用switch-case

if ( ?lookahead-predicts-alt1 ? ) { ?match-alt1 ? }
else if ( ?lookahead-predicts-alt2 ? ) { ?match-alt2 ? }
...
else if ( ?lookahead-predicts-altN ? ) { ?match-altN ? }
else ?throw-exception ?

2. (T)?

程序结构也就对应为

if ( <lookahead-is-T ) { match(T);}

3. (T)+

这种情况下使用 do-while 循环

do {
<code-matching-alternatives>

4. (T)*

零个或多个，果断 while 循环

while ( lookahead-predicts-an-alt-of-subrule> ) {
<code-matching-alternatives>
}


### 2013/12/28

expression 通常会产生一个值。可以包含变量，操作符，以及方法调用。

Mathematics a collection of symbols that jointly express a quantity : the expression for the circumference of a circle is 2πr.

statement

In computer programming a statement can be thought of as the smallest standalone element of an imperative programming language. A program is formed by a sequence of one or more statements. A statement will have internal components (e.g., expressions).

statement 通常包含了 expression

expression -> An expression is a construct made up of variables, operators, and method invocations, which are constructed according to the syntax of the language, that evaluates to a single value

statements -> Statements are roughly equivalent to sentences in natural languages. A statement forms a complete unit of execution. The following types of expressions can be made into a statement by terminating the expression with a semicolon (;). 有些表达式本身可以作为语句(statement), 这些表达式称之为 expression statements. 下面几类就是

• Assignment expressions
• Any use of ++ or —
• Method invocations
• Object creation expressions

// assignment statement
aValue = 8933.234;
// increment statement
aValue++;
// method invocation statement
System.out.println("Hello World!");
// object creation statement
Bicycle myBike = new Bicycle();


// declaration statement
double aValue = 8933.234;


block -> A block is a group of zero or more statements between balanced braces and can be used anywhere a single statement is allowed.

statements:
| ---  expression statements
| ---  declaration statements
| ---  control flow statements
| ---  if-else statements
| ---  switch statements
| ---  while statements
| ---  do-while statements
| ---  for statements
| ---  branching statements
| --- break statements
| --- continue statements
| --- return statements


LL(1) Recursive-Descent Parser

This pattern analyzes the syntactic structure of the token sequence of a
phrase using a single lookahead token

expr: ID '++'
| ID '--'


expr: ID ('++'|'--')


### 2013/12/30

Calc Parser 做好了，之前思路不对，总是纠结于如何生成Statement，不过现在不纠结了，直接按照文法的样子进行匹配就可以了。从这样看的话确实编译前端的内容真是没什么，看来大头在后端。

LL(k) Parser 顾名思义，通过分析后面的 k 个Token来进行语法分析。

The strength of a recursive-descent parser depends entirely on the

LL(k) Parser需要在 LL(1) Parser 的基础上对 lookahead token 做调整。将原来的token改造为数组，数组长度为k。consume() 的时候使用 % 操作保证下标不超过 k。

if(LA(1) == TokenType.xxx && LA(2) == TokenType.yyy) {
...
}


### 2014/1/1

• Backtracking Parser.

we use backtracking parsers only when we need to distinguish between similar language constructs.

• Memorizing Parser
• Predicated Parser

The parsers we’re working with in this book recognize context-free languages. A context-free language is a language whose constructs don’t
depend on the presence of other constructs.

#### Backtraking Parser

Syntactic predicates and speculative parsing are extremely useful when
parsing phrases that look the same from the left edge

Backtraking Parser: is to speculatively attempt the alternatives in order until we
find one that matches

Upon success, the parser rewinds the input and
parses the alternative normally (we’ll see why we parse it twice when
we discuss actions). Upon failing to match an alternative, the parser
rewinds the input and tries the next one. If the parser can’t find any
matching alternative, it throws a “no viable alternative” exception

#### Memoizing Parser

This pattern records partial parsing results during backtracking to
guarantee linear parsing performance, at the cost of a small amount of
memory.

#### Predicated Parser

This pattern augments any top-down parser with arbitrary boolean expressions that help make parsing decisions.
These boolean expressions are called semantic predicates and specify
the semantic applicability of an alternative. Predicates that evaluate to
false effectively “turn off” a parser decision path. From a grammar point
of view, false predicates make alternatives invisible.

### 2014/1/2

• Parse Tree.

Parse trees record how a parser recognizes an input sentence. The interior nodes are rule names, and the leaves are tokens. Although parse trees are less suitable than ASTs for most language applications, parsers can create them automatically.

• Homogeneous AST.

all the nodes have the same type, we say that they are homogeneous. With a single node type, there can be no specialty fields to reference child subtrees. Nodes track children with lists of child pointers.

• Normalized Heterogeneous AST.

Trees with a multitude of node types are called heterogeneous trees. Normalized heterogeneous trees use a normalized list of children like homogeneous trees.

• Irregular Heterogeneous AST.

When we refer to an AST as heterogeneous, we also assume that the nodes have irregular children. Instead of a normalized child list, the nodes have named fields, one per child.

#### 编译中的树

? Dense: No unnecessary nodes
? Convenient: Easy to walk
? Meaningful: Emphasize operators, operands, and the relationship
between them rather than artifacts from the grammar

The first two points imply that it should be easy and fast to identify patterns in the tree. The last point implies that the tree structure should be insensitive to changes in the grammar (at least those unrelated to language syntax)

#### AST 如何实现操作符之间的优先级

To encode “x happens before y,” we make x lower than y in the tree.

Operators with higher precedence appear lower in the AST. For example, in expression 3+4*5, the operator has higher precedence than +. The subtree is a subtree of the + operator node. If we alter the precedence of the operators in the expression with parentheses, (3+4)*5, the + subtree becomes a child of the * node

#### Create AST for nonexecutable language statements

Representing Pseudo-operations in ASTs

Enforcing Tree Structure with the Type System

### 2014/1/4

ANTLR4 年初的时候研究过一段时间，但是后来被别的事冲掉了。现在只能重新学一遍，recall 一下，果然学东西应该一次性学完，不然成本相当高( ▼-▼ )

IR

• Parser Trees

Pros: Parser generators can automatically build these for us.
Cons: Parse trees are full of noise (unnecessary nodes). They are sensitive to changes in  the grammar unrelated to syntax.

• AST Trees

• Homogeneous AST

Pros: Homogeneous trees are very simple.
Cons: It’s cumbersome to annotate AST nodes because the single node type has the union of all needed fields. There is no way to add methods specific to a particular kind of node.


简而言之，Homogeneous AST 的结点将所有的类型混合到一起，这意味着所有的实质上不同的节点拥有相同的结构模型，所以也就无法对特别的节点施加独特的操作。

• Normalized Heterogeneous AST

Pros: It’s easy to add operator or operand-specific data and methods.
Cons: Large grammars like Java’s need about 200 class definitions to be fully heterogeneous.


Normalized Heterogeneous AST 相当于添加了类型层次，对于不同的类型都为其建立对应的class，这样的话如果语法复杂，那么class将会急剧增长。

• Irregular Heterogeneous AST

Pros: It’s easy to add operator- or operand-specific data and methods. Building tree-walking methods for a small set of heterogeneous nodes is quick and easy.
Cons: As with Normalized Heterogeneous AST, there are lots of AST classes to read and write. Having irregular children makes building external visitors difficult.


#### Parse Tree

Parse trees record the sequence of rules a parser applies as well as the tokens it matches. Parse trees describe sentence structure by grouping input symbols into subtrees.

But parse trees are extremely inconvenient to walk and transform. Parse trees are full of noise because of all the interior rule nodes.

Nonetheless, parse trees are still very useful as intermediate representations for some tools and applications. For example, development environments use parse trees to good effect for syntax highlighting and error-checking. 尽管前面说了这么多Parser 不好的地方，但是Parse Tree在 语法高亮和 错误检查等方面上市可以物尽其用的。因为我们可以对语句中的一部分内容进行操作。

#### AST

The key idea behind an AST is the operator-operand tree structure, not the node data type. (AST 关注操作符和操作数) An AST contains the essence of the input token stream and the relationships between operator and operand tokens. We don’t need to use the type system of our implementation language to distinguish between nodes. Nodes in any AST derive from tokens,
so we can use the token type to identify nodes.

#### Homogeneous AST

A homogeneous tree implements an abstract syntax tree (AST) using a
single node data type and a normalized child list representation

public class AST {

private Token token;
private List<AST> children;

public AST() {

}

public AST(Token token) {
this.token = token;
}

if (children == null) {
children = new ArrayList<AST>();
}
}

public boolean isNil() {
}

public String toString() {
}

public String toStringTree() {
if (children == null || children.size() == 0) {
return this.toString();
}
StringBuilder treeToString = new StringBuilder();
if (!isNil()) {
treeToString.append("( ");
treeToString.append(toString());
treeToString.append(" ");
}

for (int i = 0; i < children.size(); i++) {
AST subTree = children.get(i);
if (i > 0) {
treeToString.append(" ");
}
treeToString.append(subTree.toStringTree());
}

if (!isNil()) {

treeToString.append(" )");
}
return treeToString.toString();
}

public static void main(String[] args) {
Token one = new Token("1", TokenType.NUM);
Token two = new Token("2", TokenType.NUM);

System.out.println("1 + 2: " + root.toStringTree());

AST list= new AST();

System.out.println("list: " + list.toStringTree());

}
}


#### Normalized Heterogeneous AST

This pattern implements an abstract syntax tree (AST) using more than a single node data type but with a normalized child list representation.

This pattern makes the most sense when we need to store node-specific
data and plan on using External Tree Visitor. The normalized child list makes it much easier to build external visitors.

#### Irregular Heterogeneous AST

This pattern implements an abstract syntax tree (AST) using more than a single node data type and with an irregular child list representation.

In the implementation of its child pointers. Instead of a uniform list of children, each node data type has specific (named) child fields. In this sense, the child pointers are irregular. In some cases, named fields lead to more readable code. For example, methods can refer to left and right instead of, say, children[0] and children[1].

class AddNode extends ExprNode {
ExprNode left, right;
.....
}


class AddNode extends ExprNode {
List<AST> children;
...
}


### 2014/1/7

#### Tree walking

• whether we have the source codefor our tree nodes,
• whether the trees have normalized children,
• whether the trees are homogeneous or heterogeneous,
• whether we need to rewrite trees while walking, and
• even in which order we need to walk the nodes.

• Embedded Heterogeneous Tree Walker.

Heterogeneous AST node classes define walking methods that execute appropriate actions and walk any children. Tree walking code is distributed across potentially hundreds of class files.

也就是将遍历树中节点时处理逻辑写在各节点里

• External Tree Visitor.

This pattern encapsulates tree walking code (for both homogeneous and heterogeneous ASTs) into a single class definition. It allows us to alter tree-walking behavior without altering AST node definitions. Both WALKING TREES AND VISITATION ORDER 117 the visitor and embedded walker pattern are straightforward but tedious to implement manually.

使用外部visitor 遍历Tree

• Tree Grammar.

A tree grammar describes the structure of valid ASTs. Just as we can automate parser construction from grammars, we can automate tree visitor construction from tree grammars. Tree grammars work on homogeneous or heterogeneous trees because they rely on node token types rather than node types (like AddNode). Tree grammars explicitly dictate node visitation order like the embedded walker and visitor patterns.

• Tree Pattern Matcher.

Instead of specifying an entire tree grammar, this pattern lets us focus on just those subtrees we care about. That’s useful because different phases of an application care about different parts of the tree. A tree pattern matcher also decouples the order in which we apply tree patterns from the tree patterns themselves.

用了Pattern就表示我们可能只关注全集中的一部分情况, 当然如果Pattern足够宽泛的话是能涵盖全集的

#### Walking Trees and Visitation Order

Visit

visiting a tree, we mean executing some actions on the nodes of a tree. The order in which we traverse the nodes is important because that affects the order in which we execute the actions.

• 前序遍历 (+ 1 2)，先访问父节点再访问子节点
• 中序遍历 (1 + 2)，先访问子节点然后父节点，最后是剩下右子节点
• 后续遍历 (1 2 +)，先遍历完子节点，然后才是访问父节点

Walk

Visiting a node means to execute an action somewhere between discovering and finishing that node. But, that same discovery sequence can generate three different tree traversals (node visitation sequences). It all depends on where we put actions in walk( ).

Difference between Tree Walking and Visiting

// Page 120

Embedded Walker VS External Tree Visitor

Embedded Walker 需要在节点中添加动作，这样可也能会很明了的知道在遍历这个节点的时候做了些什么事，但是如果想基于树提供不同的一套动作，那是不是就得在源码里把所有节点都改一遍了？ 长于观察，短于扩展

External Tree Visitor 将对树的遍历操作放到专门的visitor中，这样方便对同一个树提供不同的访问算法，但是正好和 Embedded 相反，访问逻辑放在节点外一定程度上增加了模糊性，多绕几次就晕了。 长于扩展, 短于观察

#### Walking Trees and Rewriting Trees 使用的Patterns

Waling Tree 和 Rewriting Tree 的目的都是对希望对树上的节点进行操作，所以也就有对所有节点的操作，也就有对部分节点的操作，因此需要Grammer或者Pattern来指定需要操作的节点的类型

• Embedded Heterogeneous Tree Walker

Embedding walking methods in tree nodes is the simplest mechanism for building a tree walker

• External Tree Visitor

Visitors are useful for collecting information or doing some simple interpretation, such as expression evaluation. They are not well suited to tree pattern matching applications. Visitors simply walk the tree node by node.

Visitor 只能遍历树中节点，但是并不能完成 tree的 pattern match

• Tree Grammar

Tree grammars specify the structure of entire trees. They are most effective when we need to execute actions in many or most rules. For example, tree-walking code generators typically need to emit code for every subtree.

• Tree Pattern Matcher

To operate only on a specific subset of an AST structures, we can use a tree pattern matcher.

#### External Tree Visitor

Visitors combine tree walking and action execution code outside the AST node definitions.we can change the functionality of the tree walker without having to change the AST class definitions and can even switch visitors on the fly. An external visitor can walk either heterogeneous or homogeneous AST nodes.

1. 根据节点的类型来构建，这种是比较传统的Visitor Pattern
2. 根据Token的类型来构建

Visitor Pattern: relies on a “double-dispatch” method within each AST node. The double-dispatch method redirects visit( ) calls on a node to an appropriate method in a visitor servicing that node type. The visitor is like a set of callback methods.

for a Node, use a visitor  ---->  Node.visit(visitor) --- dispatch to ---> visitor.visit(Node)


/
**
A generic heterogeneous tree node used in our vector math trees
*
/
public abstract class VecMathNode extends HeteroAST {
public VecMathNode() {;}
public VecMathNode(Token t) { this.token = t; }
public abstract void visit(VecMathVisitor visitor); // dispatcher
}

public void visit(VecMathVisitor visitor) { visitor.visit(this); }

Visitor的定义

public interface VecMathVisitor {
void visit(AssignNode n);
void visit(PrintNode n);
void visit(StatListNode n);
void visit(VarNode n);
void visit(DotProductNode n);
void visit(IntNode n);
void visit(MultNode n);
void visit(VectorNode n);
}


we can distinguish between tokens using the token type, we can also distinguish between AST nodes using the token type. By switching on the token type rather than the AST node type, we can avoid the visit() method in each AST node

public void print(VecMathNode n) {
switch ( n.token.type ) {
case Token.ID : print((VarNode)n); break;
case Token.ASSIGN : print((AssignNode)n); break;
case Token.PRINT : print((PrintNode)n); break;
case Token.MULT : print((MultNode)n); break;
case Token.DOT : print((DotProductNode)n); break;
case Token.INT : print((IntNode)n); break;
case Token.VEC : print((VectorNode)n); break;
case Token.STAT_LIST : print((StatListNode)n); break;
default :
// catch unhandled node types
throw new UnsupportedOperationException("Node " +
n.getClass().getName()+ " not handled" );
}
}


#### Tree Grammar

Tree grammars are a terse and formal way of building an external visitor. Visitors generated from tree grammars are usually called tree parsers because they are the two-dimensional analog of conventional parsers.

Tree grammars do not care about the implementation language
classes used to represent AST nodes (they work with both homogeneous and heterogeneous AST nodes). Instead, they rely on token type differences between AST node token payloads to distinguish different kinds of nodes

Tree grammars 同样是可以借助生成工具来定义的

#### Tree Pattern Matcher

This pattern walks trees, triggering actions or tree rewrites as it encounters tree patterns of interest. The process of matching and rewriting trees is formally called term rewriting.

Tree Pattern Match 和 Tree Grammar 的不同之处：

1. We have to specify patterns only for the subtrees we care about.
2. We don’t need to direct the tree walk.

A tree pattern matcher is analogous to text rewriting tools such as awk, sed, and perl.

A tree grammar needs a pattern for each subtree. The grammar rules include traversal instructions just like a hand-built visitor

// TODO: Rewriting Examples

### 2014/1/13

Semantic analysis is just a fancy way to say we’re “sniffing a program to see whether it makes sense.” The term semantics is synonymous with “meaning.”

computer language sentences can define and reference symbols in code blocks. At the end of the code block, the symbols go out of scope

### 2014/1/14

• Symbol Table for Monolithic Scope

All symbols exist within a single scope (set of symbols). Simple property files and early BASIC are the best examples

• Symbol Table for Nested Scopes

There are multiple scopes and scopes can nest inside other scopes. C is a typical language that has nested scopes. Scopes start and end at the start and end of language structures such as functions.

#### 如何组织程序中出现的符号

class T { ... }; // define class T
T f() { ... } // define function f returning type T
int x; // define variable x of type int


Type c = new ClassSymbol("T" ); // define class
MethodSymbol m = new MethodSymbol("f" , c); // define method
Type intType = new BuiltInTypeSymbol("int" ); // define int type
VariableSymbol v = new VariableSymbol("x" , intType); // define var x


Symbol的属性:

• Name: Symbols are usually identifiers like x, f, and T, but they can be operators too.
• Category: That is, what kind of thing the symbol is. Is it a class, method, variable, label, and so on. 例如需要判断只有在 method 的类型上才能进行方法调用
• Type: To validate operation x+y, for example, we need to know the
types of x and y. 动态语言（如Python)在运行时才知道变量类型，静态语言如 Java则是在编译的时候就知道变量类型

#### Symbol Table for Monolithic Scope

This pattern builds a symbol table for a language with a single, flat scope. This pattern is suitable for simple programming languages (without functions), configuration files, small graphics languages, and other small DSLs.

#### Symbol Table for Nested Scopes

This pattern tracks symbols and builds a scope tree for languages with multiple, possibly nested scopes.

Programming language functions are a good example of nested scopes.

int i = 9;
float f(int x, float y)
{
float i;
{ float z = x+y; i = z; }
{ float z = i+1; i = z; }
return i;
}
void g()
{
f(i,2);
}


Methods have two scopes actually: one for the parameters (the MethodSymbol itself) and one for its local variables (a LocalScope whose parent is the MethodSymbol).

### 2014/1/15

// TODO: Antlr4 实现代码

### 2014/1/17

descent: refers to the fact that parsing begins at the root of a parse tree and proceeds toward the leaves(tokens)

### 2014/1/18

Antlr4 概念中提到了 Listener 和 Visitor， 那么这两个到底有什么区别呢？首先从我自己的理解来讲，区别在于，Listener 是事件驱动，而 Visitor是类型驱动。因为Listener通常会和某个动作挂钩，比如进入节点,退出节点 等，而Visitor的话通常会涉及到不同类型的节点。

Listener 通常适合完成一些翻译成另外一种语言或者解释执行

Vistor 通常可以用来在Tree walker的过程中，可以完成表达式求值等过程

It’s easy to build these kinds of extractors or translators
using a listener.