Introduction to Language Design and Implementation

Resources

website: https://bobzhang.github.io/courses/

slides: https://bobzhang.github.io/courses/slides1.pdf

video part 1: https://www.bilibili.com/video/BV1ee4y1e7CU

video part 2: https://www.bilibili.com/video/BV1Ug41187Zh

ReScript

ref: https://bobzhang.github.io/courses/rescript_setup.html

安装 node.js 然后克隆模板仓库使用 npm 构建、运行:

git clone https://github.com/rescript-lang/rescript-project-template rescript-project
cd rescript-project
npm install
npm run res:build
node src/Demo.bs.js  # 执行 demo

vim 插件:

Plug 'rescript-lang/vim-rescript'

Compilation Phases

Lexing & Parsing

这个部分的目标是将字符串转换成 abstract tree。一般分两个阶段:

  1. Tokenization. Split the string into tokens by using regex.
  2. Parsing. 现代语言中到 parsing 基本都与语义无关了。这里的反例是 C++,C++ 太杂糅了。

There are lots of typical compiler frontend tools supporting, such as Lex, Yacc, Bison, etc.

这里还有一条路径,string -> concrete syntax -> abstract tree,一套 abstract syntax 可能有多个对应的 concrete syntax。

Sematic Analysis

Language Specific Lowering, Optimizations

Linearization & Optimizations

经过多层 IR 的转译之后,因为 CPU 是流水线架构,最终需要得到 linear IR(平台无关)。

Platform Specific Code

Tiny Language 0

Formalization

terms: $e ::= \operatorname{Cst}(i) \mid \operatorname{Add}(e_1, e_2) \mid \operatorname{Mul}(e_1, e_2)$
values: $v ::= i \in \operatorname{Int}$

The evaluation rules(形式化语义):

$$ \frac{}{\operatorname{Cst}(i) \Downarrow i} \operatorname{E-const} $$
$$ \frac{e_1 \Downarrow v_1\ \ \ \ \ \ \ e_2 \Downarrow v_2}{\operatorname{Add}(e_1, e_2) \Downarrow (v_1 + v_2)} \operatorname{E-add} $$
$$ \frac{e_1 \Downarrow v_1\ \ \ \ \ \ \ e_2 \Downarrow v_2}{\operatorname{Mul}(e_1, e_2) \Downarrow (v_1 * v_2)} \operatorname{E-mul} $$

For example, $\operatorname{Add}(\operatorname{Cst}(1), \operatorname{Cst}(2)) \Downarrow 3$’s proof tree:

\[\frac{\frac{}{\operatorname{Cst}(1) \Downarrow 1} \operatorname{E-Const}\ \ \ \ \ \ \frac{}{\operatorname{Cst}(2) \Downarrow 2} \operatorname{E-const}}{\operatorname{Add} (\operatorname{Cst}(1), \operatorname{Cst}(2)) \Downarrow 3} \operatorname{E-add}\]

Parser & Interpreter Codes(非形式化语义)

type rec expr =
    | Cst (int)
    | Add (expr, expr)
    | Mul (expr, expr)

let rec eval = (expr : expr) => {
    switch expr {
    | Cst (i) => i
    | Add (a, b) => eval (a) + eval (b)
    | Mul (a, b) => eval (a) * eval (b)
    }
}

但是这里有个问题:就比如:

| Add (a, b) => eval (a) + eval (b)

这里的递归用到了宿主语言的 stack,这里需要将它 lowering 到低级语言从而不用到宿主语言的 stack。

Lowering to a Stack Machine and Interpret

type instr = Cst (int) | Add | Mul  // no recursive
type instrs = list <instr>
type operand = int
type stack = list <operand>

let rec eval = (instrs: instrs, stk: stack) => {
    switch (instrs, stk) {
    | (list{Cst (i), ...rest}, _) =>
        eval(rest, list{i, ...stk})
    | (list{Add, ...rest}, list{a, b, ...stk}) =>
        eval(rest, list{a + b, ...stk})
    | (list{Mul, ...rest}, list{a, b, ...stk}) =>
        eval(rest, list{a * b, ...stk})
    | _ => assert false
    }
}

下面给出该虚拟机的形式化语义:

这个虚拟机由两个部分(数据结构)组成:

Code and stack:

code: $c ::= \epsilon \mid i; c$
stack: $s ::= \epsilon \mid v :: s$

Transition of the machine:

$(\operatorname{Cst}(i); c, s) \rightarrow (c, i :: s)$ (I-Cst)
$(\operatorname{Add}; c, n_2 :: n_1 :: s) \rightarrow (c, (n_1 + n_2) :: s)$ (I-Add)
$(\operatorname{Mul}; c, n_2 :: n_1 :: s) \rightarrow (c, (n_1 \times n_2) :: s)$ (I-Mul)

经过很多步的状态转移后,stack 中只剩一个元素,代码计算完成,得到 v in stack 就是结果:

\[\frac{(c, \epsilon) \rightarrow ^* (\epsilon, v :: \epsilon)}{c \downarrow v}\]

编译对应的数学形式化表示:

\[[\![ \operatorname{Cst}(i) ]\!] = \operatorname{Cst}(i)\] \[[\![ \operatorname{Add}(e_1, e_2) ]\!] = [\![ e_1 ]\!]; [\![ e_2 ]\!]; \operatorname{Add}\] \[[\![ \operatorname{Mul}(e_1, e_2) ]\!] = [\![ e_1 ]\!]; [\![ e_2 ]\!]; \operatorname{Mul}\]

关于如何证明编译过程的正确,利用到栈平衡的性质,使用数学归纳法。具体需要用到 Coq。

正确的编译实现会使下面的式子成立:

\[e \Downarrow v \Longleftrightarrow [\![ e ]\!] \downarrow v\]

前者被叫作 big step operational semantics(大步语义),后者被叫作 small step operational semantics(小步语义)。

Homework

将编译使用非形式化语言实现,即将大步语义转换为小步语义:

type rec expr =
    | Cst (int)
    | Add (expr, expr)
    | Mul (expr, expr)

type instr = Cst (int) | Add | Mul
type instrs = list<instr>
type operand = int
type stack = list<operand>

let rec compile = (expr: expr): list<instr> => {
    switch expr {
    | Cst(i) => list{ Cst(i) }
    | Add(e1, e2) => List.append(List.append(compile(e1), compile(e2)), list{Add})
    | Mul(e1, e2) => List.append(List.append(compile(e1), compile(e2)), list{Mul})
    }
}