Cflat Compiler Project
This project is a comprehensive implementation of a five-stage compiler for Cflat, a custom-designed programming language. The compiler is built from the ground up, starting with lexical analysis and progressing through parsing, intermediate representation, optimization, and final code generation for the x86-64 architecture.
Stage 1: Lexical Analysis (Lexer)
The first stage of the compiler is the lexer, which is responsible for scanning the raw Cflat source code and converting it into a stream of tokens. This process involves:
- Tokenization: The lexer uses a Non-deterministic Finite Automaton (NFA) to recognize and tokenize various language constructs, including keywords, identifiers, numbers, and operators.
- Ambiguity Resolution: To handle ambiguous token sequences, the lexer implements the "maximal munch" principle, which ensures that the longest possible lexeme is matched. A priority system is also in place to resolve conflicts between different token types.
- Whitespace and Comment Handling: The lexer is designed to correctly handle and skip whitespace and both C-style (/* ... */) and C++-style (// ...) comments.
Stage 2: Parsing and Validation
The second stage takes the token stream from the lexer and builds an Abstract Syntax Tree (AST), which is a hierarchical representation of the source code's structure. This stage also includes a validation phase to check for semantic errors. Key features of this stage include:
- LL(1) Recursive Descent Parser: A recursive descent parser, based on a provided LL(1) grammar, is used to construct the AST.
- Syntax Error Handling: If the input code is syntactically incorrect, the parser is designed to halt and report an error, indicating the token at which the error was detected.
- Type Checking: For syntactically correct programs, the parser traverses the generated AST to perform type checking, identifying and reporting any type-related errors. A special ANY type is used to allow the type checker to continue its analysis even after an error is found.
Stage 3: Lowering to Intermediate Representation (LIR)
Once the AST is successfully built and validated, it's "lowered" into a Low-level Intermediate Representation (LIR). This LIR is a more machine-friendly representation of the program, which makes the subsequent optimization and code generation stages simpler and more modular. The lowering process is implemented based on the algorithms discussed in class lectures.
Stage 4: Code Generation for x86-64
In this stage, the LIR is translated into x86-64 assembly code, the native machine language for modern 64-bit processors. This stage involves:
- Instruction Selection: Mapping LIR instructions to their corresponding x86-64 assembly instructions.
- Register Allocation: Managing the use of processor registers to store variables and intermediate results efficiently.
- Stack Management: Generating code to manage the call stack, including function prologues and epilogues.
- Runtime Support: The generated code links against a small C runtime (160-runtime.c) to handle tasks like printing to the console and memory allocation.
Stage 5: LIR Optimization
The final stage of the compiler is focused on optimizing the LIR code to improve its performance. The key optimization implemented is integer constant propagation, which is achieved through Data Flow Analysis (DFA). This process involves:
- Data Flow Analysis: The compiler analyzes the flow of data through the program to determine where variables hold constant values. A worklist-based DFA algorithm is used for this analysis.
- Constant Folding and Propagation: Arithmetic and comparison operations with constant operands are computed at compile time, and the results are propagated through the code.
- Dead Code Elimination: By statically determining the outcome of branches with constant conditions, the compiler can identify and remove unreachable basic blocks from the program, resulting in smaller and more efficient code.