Happy new year and welcome to 2023!

If you have followed me over the past… three years already? (wow), you know I have been working on an implementation of a BASIC interpreter called EndBASIC. The parser for the language in the 0.1 release started as very rudimentary and only supported: scalar variables and assignments; IF, WHILE and FOR control flow primitives; and the ability to invoke built-in commands such as PRINT and INPUT. There was nothing else in the language, and I neglected improving it as I favored building the scaffolding around the interpreter—features like a web UI and a cloud service seemed much more fun to build at the time.

Unfortunately, due to the massive limitations of the core language, it was almost impossible to write anything of substance in it. EndBASIC rose to the top page of popular news sites a couple of times, but this criticism was common. So, for a few months before the holidays, I chipped away at the core of the interpreter and made major strides in what the language offers, culminating in the massive 0.10 release. To summarize, this new release brought features like GOTO and GOSUB, DO and SELECT CASE, and integer and double interop, to name a few.

In the context of this post, however, I want to focus on the makings of the parser. In particular, I want to analyze various difficulties I encountered while implementing these improvements to the interpreter. And the reason I want to look into these is because, in various occasions, I got a glimpse of the design choices that the original language designers must have faced when coming up with BASIC in the 1960s given the limitations of the machines at the time. This has been fascinating and enlightening.

But also unpleasant. During my implementation work, I have come across many oddities in the language that make it irregular and thus difficult to parse into an AST. All of these oddities have been “fixable”, but I can tell that BASIC didn’t originate with ASTs in mind. For all I can tell, BASIC seems to have been designed for line-by-line interpretation.

A blog on operating systems, programming languages, testing, build systems, my own software projects and even personal productivity. Specifics include FreeBSD, Linux, Rust, Bazel and EndBASIC.

0 subscribers

Follow @jmmv on Mastodon Follow @jmmv on Twitter RSS feed

EndBASIC self-inflicted pain

Before diving into the difficulties I have encountered so far in parsing BASIC, let me start by outlining some core design principles behind EndBASIC:

  1. Core vs. standard library split: The core language, which involves parsing, compiling into bytecode (as of 0.10), and executing the results, is supposed to be completely independent from the “standard library”. This is to ensure that the language remains “pure” and that nothing in the virtual machine can escape to do I/O unless explicitly allowed to do so by hooking individual standard library components into the interpreter. The consequence is that all built-in commands and functions, including PRINT and INPUT, cannot be part of the core.

  2. AST representation: The standard library, which provides all common commands and functions, is where the fun lives—but remember, it is detached from the core. Anything in the standard library must be representable in the AST without command-specific special cases. As we will see later, this is… tricky, because BASIC commands have some of the most inconsistent structure I’ve ever seen.

  3. Hand-made recursive descent parser: I have intentionally handcrafted the lexer and parser because this whole project started just as a “how hard could it be” kind of situation. Nevertheless, I wanted to keep dependencies to a minimum to make the core (not the standard library) tiny and embeddable into other processes, so I had to get a good grasp of everything involved in the process. This has worked OK so far, but it’s not trivial as we will see.

  4. Context-free parsing: The parser should not know anything about the meaning of symbols, which is probably the most problematic constraint of all.

Once again, note that these are all self-imposed constraints. For all I can tell, none of these existed at the time BASIC was designed, and foregoing these constraints would make implementing the parser much simpler.

The reason these constraints exist is because I dove into implementing the EndBASIC parser with rudimentary knowledge of BASIC. I had good memories of using BASIC more than 30 years ago but no mastery of the language, so I’ve been hitting new corner cases along the way and have had to retrofit support for them into the parser.

At the end of this post, I’ll briefly discuss how I think these design constraints should change in order to evolve the language even further, but first, let’s look into the current parsing difficulties, which is the meat of this post.

Oddities and difficulties

Argument passing

The first difficulty in implementing the parser was representing commands in the AST and, in particular, their arguments. In any sane and modern language, arguments are separated from each other in one way—and just one way. If you have a function print and arguments bar and baz, you do things like print bar baz or print(bar, baz), but the separators are all equal and have no special meaning.

Things are not as simple in BASIC. You see, BASIC has two (ha ha, stay tuned) argument separators, and their meaning depends on the command being invoked. Here, take a look at the madness that PRINT is:

' Print two arguments with a long separator in-between them.
PRINT 3, 4

' Print two arguments with a short separator in-between them.
PRINT 3; 4

' Mix and match separators, and even leave some empty fields.
PRINT 3, 4; 5, , 6

' Oh, and you can even have a trailing separator.
PRINT 3,

Tricky? Sure, but now look at INPUT, which also supports the two separators but with completely different meanings:

' Read a number with a "? " default prompt.
INPUT n
INPUT ; n

' Read a number without a prompt.
INPUT , n

' Read a string with a prompt followed by "? ".
INPUT "What's your name"; t$

' Read a string with a prompt with nothing appended to it.
INPUT "Enter some words: ", t$

For a long time, support for these two separators are all that I had implemented in the language, but then… I stumbled upon the NAME command in QuickBASIC which is used to rename files. NAME has this marvelous syntax:

NAME "old name" AS "new name"

Uh. There are not two, but three argument separators, and which ones are valid where and what each means depends on the command being invoked. Things seem to get worse when you start looking at how to open files and represent their file handles, but I won’t get there for now because I haven’t even tried to implement this yet.

Array assignments

The next difficulty arose trying to handle array assignments, a feature that appeared in the 0.6 release. Array references in BASIC use parenthesis, not square brackets as is customary in most other languages. For example:

DIM vector(10) AS INTEGER
vector(2) = 12345
PRINT vector(2)

DIM matrix(10, 10) AS INTEGER
matrix(4, 5) = 987
PRINT matrix(4, 5)

This is fine: who cares if accessing an array is done with vector[2] or vector(2). Except… remember the constraint that says that the EndBASIC parser must not know anything about commands? Consider what happens when you try to parse this:

vector(2) = 12 ' Array assignment.
PRINT(2)       ' Call to print with an argument wrapped in parenthesis.

Here is the parser’s flow. The parser first sees vector or PRINT, which are symbol references for which it knows nothing about. Then the opening parenthesis comes. What are we trying to parse? Will we encounter an equal sign and conclude that this is an array assignment, or will there not be an equal sign and this translates to a command invocation with just one argument? You can’t tell until you encounter the equal sign.

Obviously, this disambiguation was implemented in the parser as I just described, but it is not trivial. This is another case of self-inflicted pain that results from keeping the parser unaware of the symbols that have been defined so far. Otherwise, if we knew that vector is an array, we could immediately conclude that we are facing an array assignment as soon as we see it specified as the first token of a statement; and we could do similarly for PRINT if we knew that it is a command.

Labels

Labels did not exist in original versions of BASIC, but they did appear in later implementations. One which brought labels is QuickBASIC, where labels are defined by appending a colon character to an identifier:

label: INPUT foo

Simple, right? Well, as it turns out… the colon character is also the statement delimiter in BASIC and can be used to separate different statements in a single line.

With that extra bit of knowledge, what does this mean?

PRINT: INPUT foo

Are we trying to define a label called PRINT, or are we calling the PRINT command without arguments? This is impossible to tell without knowing upfront that PRINT is a command, but that goes against the design constraints of the EndBASIC core language.

This is an ambiguity that cannot be resolved if I stick to those design decisions. I could obviously damage the AST by representing argless commands and maybe-labels in a type like Statement::LabelOrArglessCommandCall, but that sounds terrible. So, for now, EndBASIC requires labels to start with the @ sign; unorthodox, but it does the job given that I’m not targeting compatibility with other BASIC implementations.

END statements vs. control flow

The next difficulty appeared in trying to parse the END statement. You see, END can be used like this:

' Terminate program execution.
END

' Terminate program execution with code 1 (a later extension).
END 1

And it can also be used like this:

IF TRUE THEN
    END
END IF

Note that END has two very different meanings depending on the token that comes after it, if any.

I suspect that this is why other control flow primitives don’t use END to check for their termination: WHILE uses WEND, DO uses LOOP, and FOR uses NEXT. Having different keywords as block terminators makes it much easier to perform parsing because, for example, when you are parsing the contents of a WHILE loop, you can simply say “parse statements until you find WEND” and call it a day.

Of course I wondered why END IF was not written as a single keyword as well, such as FI (I see you, POSIX shell), because it breaks this theory and makes parsing unnecessarily hard. Looking at the original version of BASIC, though, reveals the answer: there was no END IF. The language only supported uni-line IF statements, which made this point moot.

References vs. values

Let’s switch topics to a slightly different difficulty: having to represent variable values and variable references as different elements in the AST. Consider the following code:

x = "A string"
PRINT LEN(x)

In this code snippet, the LEN function receives x as an argument, but LEN doesn’t care about x: it cares about the value of x, which is A string.

However, consider this other code snippet:

DIM x(3)
PRINT UBOUND(x)

In this case, the UBOUND function receives x as an argument, and UBOUND does care about x itself, not its value, to inspect its size and return the array’s upper bound. (Because… what is the “value” of x without a subscript, anyway? An address? A copy of the array?)

The problem is… how can these be interned in the AST if we don’t know what each function needs or what each variable’s type is? Remember that the parser currently does not have this information. Up until now, the AST has just recorded that function calls take expressions as arguments, and one possible expression type is a symbol reference (Expr::Symbol(VarRef)). Both LEN and UBOUND receive unevaluated expressions, and then the implementation of LEN evaluates the expression to get its value, whereas the implementation of UBOUND does not evaluate the expression and instead uses the argument to look up the symbol’s definition.

This has worked OK so far but poses new problems when trying to compile expression evaluation into bytecode, which is something that EndBASIC 0.10 notably lacks but that I need to implement for “proper” compilation. When facing the x argument in the function calls above, the compiler cannot tell if it should emit code to load the value of x or if it should emit code to provide a reference to x. Resolving this would require making the AST aware of valid functions and their syntaxes or the compiler aware of variable types.

Exponents and square root

Switching topics again, another oddity comes from the lack of orthogonality in a recent feature I implemented. Exponents are expressed as an infix operator (8 ^ 2) whereas square roots are expressed as a function (SQR(8)).

This does not pose a difficulty in parsing, but it clashes with the desire to keep the core parser and the standard library separate: why should exponents be in the core while SQR is in the standard library? This is inconsistent but cannot be easily resolved in the current implementation.

LET vs. not LET

And finally, LET. For as long as I remember using BASIC myself, which is as far back as the 1980s with Locomotive BASIC 1.1, variable assignment has always been expressed as:

var = 1234

But the original BASIC also has a LET keyword to write variable assignments as:

LET var = 1234

The question is: why? Why would you need to have an optional LET keyword to assign variables when it can be omitted? This is the final telltale sign that gives insight on how original parsing must have been done.

Welcome to the 1960s

From the looks of all of this, it is fairly reasonable to assume that original BASIC parsers read code line by line, did simple prefix matching to determine what to do on a single line, and then passed the remainder of the line to a routine to handle it.

Something like this:

pc = 10
while True:
    line = get_line(pc)
    if not line:
        break

    if line.startswith('PRINT'):
        do_print(line[len('PRINT'):])
    elif line.startswith('INPUT'):
        do_input(line[len('INPUT'):])
    elif line.startswith('NAME'):
        do_name(line[len('NAME'):])
    else:
        raise Exception('Unknown command')

    line += 10

This makes a lot of sense considering the severe limitations of the computers that first ran BASIC in 1963 when it was invented: they had limited computing power, very limited memory, and the code for the interpreter must have been written directly in assembly.

With this perspective, it’s “obvious” why each command can behave differently: it is up to their implementation to interpret the remainder of the program line as they see fit. There is no AST. No abstraction over the program. No compilation. Just a loop that reads a line at a time and decides what to do about it right away.

In a modern implementation like EndBASIC, however, trying to represent this variety of commands in a generic form to maintain the separation between core language and standard library is difficult. What I ended up doing is capturing the list of arguments as a list of expression/separator pairs and passing those to the command handler—which works but is ugly because each command has to implement argument parsing in a fragile way and makes compilation to bytecode difficult.

Potential fixes for EndBASIC

So, any lessons learned? What am I going to do about all of this?

I think keeping the EndBASIC core language vs. standard library separation is a worthy goal. But I also think that keeping the language interpreter completely unaware of symbols is a self-imposed limitation that makes some things hard (distinguishing between array assignments and function calls) and makes other things impossible (having labels without @ prefixes). This should change.

The language parser will probably have to gain a mechanism to register symbols upfront so that the standard library can register its commands and functions as desired. And the language parser and/or the newly-added compiler will also have to dynamically update this table as it parses code to recognize newly-defined variables, arrays, functions, and commands.

The other thing that will need to change is making the parser aware of the signature of commands and functions. This is necessary to fix the compilation problem that arises in the LEN vs. UBOUND case, and also to support passing variables by value or by reference to custom procedures and functions (as QuickBASIC supports). The parser or the compiler need to know that LEN receives an evaluated value whereas UBOUND receives an array reference in order to emit the right bytecode instructions.

Finally, to fix the exponents vs. square roots inconsistency, operators should become function calls provided by the standard library. You could imagine the standard library asking the core to register an infix operator called ^ that is executed via a hidden EXP function, much like many modern languages do. This is, however, a minor annoyance compared to all others that I’m not sure is worth fixing.

And that’s it. This was a very long post full of disconnected thoughts. I’m not sure if you have gotten anything useful out of it, but if you have feedback on the content or corrections on historical facts (which I’m sure are not accurate), please let me know!