始于
分类:PL
Tags: [ PL ]
程序语言理论和实现 Week 0
Introduction to Language Design and Implementation
Resources
website: https://bobzhang.github.io/courses/
slides: https://bobzhang.github.io/courses/slides1.pdf
video part 1: https://www.bilibili.com/video/BV1ee4y1e7CU
video part 2: https://www.bilibili.com/video/BV1Ug41187Zh
ReScript
ref: https://bobzhang.github.io/courses/rescript_setup.html
安装 node.js 然后克隆模板仓库使用 npm 构建、运行:
git clone https://github.com/rescript-lang/rescript-project-template rescript-project
cd rescript-project
npm install
npm run res:build
node src/Demo.bs.js # 执行 demo
vim 插件:
Plug 'rescript-lang/vim-rescript'
Compilation Phases
Lexing & Parsing
这个部分的目标是将字符串转换成 abstract tree。一般分两个阶段:
- Tokenization. Split the string into tokens by using regex.
- Parsing. 现代语言中到 parsing 基本都与语义无关了。这里的反例是 C++,C++ 太杂糅了。
There are lots of typical compiler frontend tools supporting, such as Lex, Yacc, Bison, etc.
这里还有一条路径,string -> concrete syntax -> abstract tree,一套 abstract syntax 可能有多个对应的 concrete syntax。
Sematic Analysis
- Build the symbol table, resolve variables
- Type checking & inference
- 保证每个操作的类型正确
- 用户没有标注类型需要推断(infer, not induce)
- Typeclass/Implictis resolving
- Check other safety/security problems
- 比如 Rust 的 lifetime
- Type soundness:没有运行时类型错误(这个在 nju 静态分析里有说到?)
Language Specific Lowering, Optimizations
- Class/Module/Objects/Typeclass desugaring
- Pattern match desugaring
- Closure conversion
- Language specific optimizations
- IR,一层层的 IR
Linearization & Optimizations
经过多层 IR 的转译之后,因为 CPU 是流水线架构,最终需要得到 linear IR(平台无关)。
- Language & platform agnostics
- Optimizations
- Constant folding, propagation(常数传播),CSE(common subexpression elimination),parital evaluation
- loop 的优化
- 尾递归优化
- intra-procedural, inter-procedural opt。这里说到全局优化是函数内部的优化,全局环境中的优化是 LTO(link time optimization)
- IR simplified:三地址码,LLVM IR 等
Platform Specific Code
- Instruction selection
- Register allocation. 这个很重要。 这里存在一个 hierarchy,要尽可能将 heap 上的内存转到 stack,将 stack 转到 registers。
- 指令调度和机器相关优化。这方面影响最大的主要是一些数值算法,比如神经网络。
Tiny Language 0
Formalization
terms: | $e ::= \operatorname{Cst}(i) \mid \operatorname{Add}(e_1, e_2) \mid \operatorname{Mul}(e_1, e_2)$ |
values: | $v ::= i \in \operatorname{Int}$ |
The evaluation rules(形式化语义):
For example, $\operatorname{Add}(\operatorname{Cst}(1), \operatorname{Cst}(2)) \Downarrow 3$’s proof tree:
\[\frac{\frac{}{\operatorname{Cst}(1) \Downarrow 1} \operatorname{E-Const}\ \ \ \ \ \ \frac{}{\operatorname{Cst}(2) \Downarrow 2} \operatorname{E-const}}{\operatorname{Add} (\operatorname{Cst}(1), \operatorname{Cst}(2)) \Downarrow 3} \operatorname{E-add}\]Parser & Interpreter Codes(非形式化语义)
type rec expr =
| Cst (int)
| Add (expr, expr)
| Mul (expr, expr)
let rec eval = (expr : expr) => {
switch expr {
| Cst (i) => i
| Add (a, b) => eval (a) + eval (b)
| Mul (a, b) => eval (a) * eval (b)
}
}
但是这里有个问题:就比如:
| Add (a, b) => eval (a) + eval (b)
这里的递归用到了宿主语言的 stack,这里需要将它 lowering 到低级语言从而不用到宿主语言的 stack。
Lowering to a Stack Machine and Interpret
type instr = Cst (int) | Add | Mul // no recursive
type instrs = list <instr>
type operand = int
type stack = list <operand>
let rec eval = (instrs: instrs, stk: stack) => {
switch (instrs, stk) {
| (list{Cst (i), ...rest}, _) =>
eval(rest, list{i, ...stk})
| (list{Add, ...rest}, list{a, b, ...stk}) =>
eval(rest, list{a + b, ...stk})
| (list{Mul, ...rest}, list{a, b, ...stk}) =>
eval(rest, list{a * b, ...stk})
| _ => assert false
}
}
下面给出该虚拟机的形式化语义:
这个虚拟机由两个部分(数据结构)组成:
- 指令集
c
。原文是 a code pointerc
giving the next instruction to execute. - Stack
s
.- push
v
ons
: $s \rightarrow v :: s$ - pop
v
offs
: $v :: s \rightarrow s$
- push
Code and stack:
code: | $c ::= \epsilon \mid i; c$ |
stack: | $s ::= \epsilon \mid v :: s$ |
Transition of the machine:
$(\operatorname{Cst}(i); c, s) \rightarrow (c, i :: s)$ | (I-Cst) |
---|---|
$(\operatorname{Add}; c, n_2 :: n_1 :: s) \rightarrow (c, (n_1 + n_2) :: s)$ | (I-Add) |
$(\operatorname{Mul}; c, n_2 :: n_1 :: s) \rightarrow (c, (n_1 \times n_2) :: s)$ | (I-Mul) |
经过很多步的状态转移后,stack 中只剩一个元素,代码计算完成,得到 v
in stack 就是结果:
编译对应的数学形式化表示:
\[[\![ \operatorname{Cst}(i) ]\!] = \operatorname{Cst}(i)\] \[[\![ \operatorname{Add}(e_1, e_2) ]\!] = [\![ e_1 ]\!]; [\![ e_2 ]\!]; \operatorname{Add}\] \[[\![ \operatorname{Mul}(e_1, e_2) ]\!] = [\![ e_1 ]\!]; [\![ e_2 ]\!]; \operatorname{Mul}\]关于如何证明编译过程的正确,利用到栈平衡的性质,使用数学归纳法。具体需要用到 Coq。
正确的编译实现会使下面的式子成立:
\[e \Downarrow v \Longleftrightarrow [\![ e ]\!] \downarrow v\]前者被叫作 big step operational semantics(大步语义),后者被叫作 small step operational semantics(小步语义)。
Homework
将编译使用非形式化语言实现,即将大步语义转换为小步语义:
type rec expr =
| Cst (int)
| Add (expr, expr)
| Mul (expr, expr)
type instr = Cst (int) | Add | Mul
type instrs = list<instr>
type operand = int
type stack = list<operand>
let rec compile = (expr: expr): list<instr> => {
switch expr {
| Cst(i) => list{ Cst(i) }
| Add(e1, e2) => List.append(List.append(compile(e1), compile(e2)), list{Add})
| Mul(e1, e2) => List.append(List.append(compile(e1), compile(e2)), list{Mul})
}
}