Since its inception two years ago, the EndBASIC interpreter has been using an AST-based execution engine. And during all this time, people have mocked the language for not allowing 10 GOTO 10. Well, fear not: the upcoming 0.10 release has full support for GOTO and GOSUB, features that were made possible by moving to a bytecode-based interpreter. Let’s take a peek at what the problems were and how I addressed them.

A blog on operating systems, programming languages, testing, build systems, my own software projects and even personal productivity. Specifics include FreeBSD, Linux, Rust, Bazel and EndBASIC.

0 subscribers

Before: AST-based execution

An Abstract Syntax Tree (AST) represents the structure of the program as a connected acyclic undirected graph—aka… a tree. Each node represents a construct that appears in the program text and each edge indicates how those constructs are nested. ASTs are the output of a language parser.

ASTs are useful to efficiently walk over a program’s source code at a high level without having to worry about lexing or parsing, and they are useful to implement things like compilers: for each construct in the AST, such as an arithmetic operation or an if statement, the compiler can inspect the various parts of the construct and “directly” emit machine code for them.

But ASTs can also be used to execute a program. Here, take a look at what the main execution loop in EndBASIC used to look like:

async fn exec(&mut self, stmts: &[Statement]) -> Result<()> {
    for stmt in stmts {
        match stmt {
            Statement::Assignment(vref, expr) => {
                self.assign(vref, expr).await?
            }

            Statement::BuiltinCall(name, args) => {
                let cmd = match self.symbols.get(&VarRef::new(name, VarType::Auto))? {
                    Some(Symbol::Command(cmd)) => cmd.clone(),
                    Some(_) => return new_syntax_error(format!("{} is not a command", name)),
                    None => return new_syntax_error(format!("Unknown builtin {}", name)),
                };
                cmd.exec(args, self)
                    .await
                    .map_err(|e| Error::from_call_error(cmd.metadata(), e))?;
            }

            Statement::If(branches) => {
                self.do_if(branches).await?;
            }

            Statement::While(condition, body) => {
                self.do_while(condition, body).await?;
            }

            // ... other statement types elided ...
        }
    }
}

In the above code snippet, note how walking the AST (represented in stmts as a sequence of Statement values) for execution is easy: there is a loop that “sees” every high-level construct and can then simulate what should be done for each of them.

To add some more color to the above, here is how an IF statement’s execution looked like. Note that we explicitly test for each conditional branch that appears in the program and, once we find the branch that is true (if any) we recursively call into the exec function to evaluate its contents and then terminate execution:

async fn do_if(&mut self, branches: &[(Expr, Vec<Statement>)]) -> Result<()> {
    for (expr, stmts) in branches {
        match expr.eval(&mut self.symbols).await? {
            Value::Boolean(true) => {
                self.exec(stmts).await?;
                break;  // Stop executing branches.
            }
            Value::Boolean(false) => (),
            _ => return new_syntax_error("IF/ELSEIF require a boolean condition"),
        };
    }
    Ok(())
}

Similarly, here is how WHILE loop execution looked like. Note again that we have a native loop that iterates over the body of the interpreted loop and exits when the condition is false:

async fn do_while(&mut self, condition: &Expr, body: &[Statement]) -> Result<()> {
    loop {
        match condition.eval(&mut self.symbols).await? {
            Value::Boolean(true) => self.exec(body).await?,
            Value::Boolean(false) => break,
            _ => return new_syntax_error("WHILE requires a boolean condition"),
        }
    }
    Ok(())
}

Looks pretty simple and reasonable, right?

In-between: Hacking GOTO support

Indeed it does. The AST-based executor presented above is simple. But it comes with significant problems—the most salient one being that implementing the much desired GOTO is very difficult.

Consider an EndBASIC program like this:

IF 1 < 4 THEN
    WHILE TRUE
        GOTO @exit
    WEND
END IF
@exit: PRINT "hi"

If we parse the above into an AST and try to execute it with the algorithm presented earlier, we end up with a native call stack that looks like this:

    exec <-- @exit is visible at this level.
    | -> do_if
    |    | -> exec
    |    |    | -> do_while
    |    |    |    | -> exec
    |    |    |    |    | -> do_goto <-- How do we jump to @exit?

Which is a problem. There is no way the do_goto native function can jump to @exit right away. The only thing we can do is unwind the native stack until we reach the nesting level where @exit is defined and continue execution there. And because we can do this and adding GOTO as a feature was not negotiable at this point, I implemented this exact solution as a first cut in commit 8ef803e. Here are the crux of the changes to the AST-based execution loop:

async fn exec(&mut self, stmts: &[Statement]) -> Result<()> {
    let mut i = 0;
    while i < stmts.len() || self.pending_goto.is_some() {
        // If there is a "pending GOTO", see if we can find the target label
        // at this level by scanning over all statements.
        if let Some(target) = self.pending_goto.as_ref() {
            i = 0;
            while i < stmts.len() {
                if let Statement::Label(label) = &stmts[i] {
                    if target == label {
                        i += 1;
                        self.pending_goto = None;
                        break;
                    }
                }
                i += 1;
            }

            // The label wasn't found here.  Unwind this stack frame and the
            // calling exec() will try this again.
            if i == stmts.len() {
                return Ok(());
            }
        }

        // Now execute statements at this level as before ...
        match &stmts[i] {
            Statement::Goto(label) => {
                self.pending_goto = Some(label.clone());
                // Stack unwinding will happen at the next iteration, where we
                // will first try to find the target at the current level.
            }

            Statement::Label(label) => {
                // Nothing to do (other than sanity checks).  The labels as
                // targets were handled above.
            }

            // ... other statement types elided ...
        }

        i += 1;
    }
}

Does the job. Unfortunately, this is not pretty because the pending_goto looks like a clutch, and this is not efficient because what should be a simple instruction jump (a thing that microprocessors can do just fine) becomes a convoluted stack unwinding process plus a target lookup at every level.

What’s worse is that the above isn’t feature-complete either because we are only unwinding the stack (going up the call chain) but never go down. Think about this other program:

IF 1 < 4 THEN
    WHILE TRUE
        GOTO @other
    WEND
ELSE
    @other: PRINT "hi"
END IF

This snippet differs from the above in that the @other target of the GOTO cannot be found by simply unwinding the stack: the executor would need to unwind the stack, yes, but it would also need to go into every possible branch of the program to look for the target, which would either be even more inefficient or it would require a lot of extra bookkeeping to track where the targets live.

And those are not the only problems. What if, instead of a GOTO, we want to implement a GOSUB—a call to an arbitrary location that can later return execution to where the call happened? “Impossible.”

After: compiler plus bytecode execution

All of the problems above go away if we “flatten” the AST by replacing all nesting with (conditional) jumps and then execute the code with a program counter and a call stack. If that sounds like writing a compiler and a virtual machine, it’s because this is precisely what this means. But it needn’t be as complicated as it sounds. In fact, I procrastinated on this for over a year and first implemented the hack above because I feared the change but, in the end, coming up with a prototype was a matter of a couple of hours.

Let’s go back to our “complex” code statement from earlier. This EndBASIC program:

IF 1 < 4 THEN
    WHILE TRUE
        GOTO @other
    WEND
ELSE
    @other: PRINT "hi"
END IF

… is equivalent to this after flattening the AST:

IF 1 < 4 THEN: GOTO @if_else: END IF

@while_start:
IF NOT TRUE: GOTO @while_end: END IF
GOTO @other
GOTO @while_start
@while_end:
GOTO @if_end

@if_else:
@other: PRINT "hi"

@if_end:

All nesting is gone! The only thing we are left with are conditional jumps (the IF one-liners), unconditional jumps, and their corresponding targets. We have essentially gone through a compilation phase. And if we replace the string labels with line numbers (akin to addresses), just like an assembler would do, we are left with this:

1 IF 1 < 4 THEN: GOTO 6: END IF
2 IF NOT TRUE: GOTO 6: END IF
3 GOTO 6
4 GOTO 2
5 GOTO 7
6 PRINT "hi"

… which looks like an awful lot like native machine code, doesn’t it?

And this is what commit c0cb9a3 did. In this change, I added a rudimentary version of a compiler that flattens the AST and a trivial bytecode-based virtual machine that executes every instruction. The execution loop thus becomes:

let mut pc = 0;
while pc < instrs.len() {
    let instr = &instrs[pc];
    match instr {
        Instruction::Assignment(span) => {
            self.assign(span).await?;
            pc += 1;
        }

        Instruction::BuiltinCall(span) => {
            self.call_builtin(span).await?;
            pc += 1;
        }

        Instruction::Jump(span) => {
            pc = span.addr;
        }

        Instruction::JumpIfNotTrue(span) => {
            match span.cond.eval(&mut self.symbols).await? {
                Value::Boolean(true) => pc += 1,
                Value::Boolean(false) => pc = span.addr,
                _ => {
                    return new_syntax_error(span.cond.start_pos(), span.error_msg);
                }
            }
        }

        // ... other instruction types elided ...
    }
}

And with this, you can imagine that implementing GOTO becomes trivial—as does implementing GOSUB and ON ERROR, which are other features that are already available at HEAD.

Now, the above is far from being true bytecode: for example, the instruction types that are currently defined are still too high level. Something like the BuiltinCall instruction type does an incredible amount of work because this is evaluating (in native code) the various arguments to the call and calling into the native function. To turn this into true bytecode, the arguments themselves—which are arbitrary expressions—would need to be evaluated via bytecode instructions with a supporting stack or register bank. And to be real bytecode, the instructions should have a compact binary representation.

Having a true bytecode and a memory map sounds super-fun to build. It may even be necessary when adding support for custom-defined functions. And it would pave the way to having PEEK and POKE. But we didn’t need any of this to solve the GOTO problem. So, for now, this is all that EndBASIC 0.10 will ship with.

Stay tuned for the forthcoming release announcement!

From AST to bytecode execution in EndBASIC

Before: AST-based execution

In-between: Hacking GOTO support

After: compiler plus bytecode execution

Featured software

Featured posts