Joyful Annotated ASTs
This post is a deep dive into the Abstract Syntax Tree (AST) representation used by the Praxis compiler. This includes a novel Haskell technique, that I have called introspection, which provides a way to easily manipulate mutually recursive structures in ways which would otherwise be extremely cumbersome. It assumes familiarity with Haskell, including Functor, Applicative, and Traversable typeclasses.
The Praxis AST
Let’s dive straight in to a trimmed-down version of the AST representation used by the Praxis compiler:
For the moment let’s ignore all the mentions of Annotated
, which we’ll look at in detail separately. It’s enough to know that Annotated a
means a
with some extra information attached.
It’s not important to understand all of the details, but looking at this representation should afford some insight to the basic structure of the Praxis language. We can see that a program (Program
) is a list of declarations (Decl
), and that a declaration is either a block of recursive declarations (DeclRec
) or a variable (DeclVar
) which has a name, an optional type signature (QType
), and a defining expression (Exp
). There are two flavours of types, Type
for monomorphic types and QType
for polymorphic types (the ‘Q’ standing for quantified). Finally, patterns (Pat
) appear in certain expressions, such as Case
expressions.
To make it easier to talk about, I’ll refer to these types collectively as terms, so that Exp
for example is one category of term. It’s worth pausing here for a moment and appreciating just how intertwined the categories are. We’ll see later how this makes even simple operations quite challenging.
Annotations
Now, let’s take a look at Annotated
:
A few things are worth explaining here:
Tag
is simply a glorified pair with lensestag
andvalue
to the first and second components (respectively).Annotated a
is ana
with a special type of tag,(Source, Maybe (Annotation a))
. There are two utility lensessource
andannotation
to the tag components.Source
(definition omitted) is a range of line and column numbers indicating the location of a substring in the source code. This is used purely for error messages.Annotation a
is a type family which allows us to annotate different categories of terms with different types.
Although every term has a Source
, only certain categories (Exp
, Pat
, and Type
) have an Annotation
. The annotation is wrapped in a Maybe
since it may not be present depending on the stage of compilation1. For example, Exp
and Pat
will only have their Type
annotations set after the type inference stage.
It’s worth mentioning that this approach is somewhat similar to Trees that grow2, but here our annotations only need depend on the category, not the individual constructors within a category. This is a major simplification that has proved to be sufficient for the Praxis compiler.
The Manipulation Problem
Let’s look at one sort of manipulation we want to do to our terms: substitutions. To provide a motivated example, we’ll look at substitutions in the context of type inference.
The goal of type inference is to set the type annotation of every expression and pattern in the program. This happens in two stages:
-
The first stage (constraint generation) traverses the program tree and annotates every expression with a type, however it is allowed to use type unification variables (α, β, …) which each represent a concrete but as-yet unknown type. At the same time constraints are generated involving these unification variables, such as
α ~ β
,α ~ (γ -> Bool)
, andIntegral γ
. -
The second stage (constraint solving) attempts to resolve the unification variables by deduction from the set of generated constraints. When a unification variable is resolved, e.g.
β := (Int -> Bool)
, the unification variable (β
) is replaced by the resolved type (Int -> Bool
) throughout the program, and in all the generated constraints. Type inference is completed when all unification variables have been resolved.
Let’s consider how we would apply a substitution α := t
throughout a program. In essence we need to find all the types within the program, and for each of them apply the substitution. So, let’s start with a function to apply the substitution to a type:
In other words, for a substitution α := t
, subTyUniInType
recursively descends through the type we want to apply the substitution to, leaving the annotation a
as-is at each level, and replacing every occurence of TyUni α
with t
.
So far this may not seem too bad, but now let’s write the substitution for programs proper. To do this we will keep in mind that types can appear in two places3: the annotation of an expression and the annotation of a pattern.
Now what if I told you that in addition to type inference, there is also kind inference, which operates in a similar way but using kind unification variables. For substitutions of these we’ll need a similar set of functions subKindUniInKind
, subKindUniInProgram
, subKindUniInDecl
, subKindUniInExp
, subKindUniInPat
, and subKindUniInType
.
So our desire to implement just one simple manipulation results in a quadratic explosion of tedious category-specific functions. This many functions is not only a pain to write, it’s also exceedingly easy to forget about a recursive case or forget to substite through an annotation. Suffice it to say this does not spark joy.
You may want to pause here and think if there are any alternatives that would let us solve our substitution requirement more easily. In the following sections we’ll see that this can be done, and it can be done in a generic way which lets us perform countless other manipulations in just a handful of lines.
Introspection
The key idea is we need some way to abstract over the category of a term. Then, instead of needing functions for every category (i.e. from Exp
to Exp
, and from Pat
to Pat
, and so on), we can have one single function from term to term.
One simple solution is to have a Term
type which is a union over all the categories, say:
This works, but we can do better. A weakness of this approach is that we lose the ability to express at the type level that a transformation preserves the category.
Instead, the approach used in the Praxis compiler is a Term
typeclass. Then a category-preserving transformation is typed as Term a => a -> a
. To implement the Term
typeclass, we leverage GADTs:
This construction lets us perform case distinctions on the category of an arbitrary term. To see what this looks like practice, let’s compare it to the Term
type approach for lifting a function over Exp
to a function over terms:
There are a few subtleties with the typeclass solution:
- The
a
inwitness :: I a
refers to the type ofx
, thanks to the ScopedTypeVariables extension. - When the
IExp
pattern matches, Haskell deduces thata ~ Exp
sinceIExp :: I Exp
. This means we get to usex :: Exp
in the bodyf x
.
As alluded to earlier, here the typeclass approach gives us a type-checked guarantee that the category is preserved.
Traversals
An addition to the above, we require terms to implement traverse-like semantics. This will let us encapsulate the recursive structure of terms, allowing us to then succinctly write manipulations like substitutions.
We’ve added two functions:
recurseTerm
applies an applicative action to every subterm (of an unannotated term).recurseAnnotation
applies an applicative action to every subterm in an annotation. This takes an additionalI a
argument which is needed to unambiguously determinea
(sinceAnnotation
is non-injective).
These can be combined into a single recurse
function which performs both actions:
Although the types of recurseTerm
and recurseAnnotation
may look quite scary, the implementations are quite straightforward in practice, following a similar pattern to a Traversable
:
Manipulations Revisited
Finally, we can leverage the completed Term
typeclass to write a generic substitution, using Identity
as the applicative:
For the original problem of wanting to substitute a type unification variable, we can call sub
as follows:
We can write other manipulations too, for example summing over a monoidal action using Const
as the applicative:
We can use this to check if a term contains any type unification variables, which tells us whether or not type inference is completed:
That’s all there is to know about the Praxis AST. May your trees by joyful :)
-
Instead of an optional annotation, early versions of the compiler included an additional
Stage
enum in theAnnotation
family. However this was removed as it had marginal benefit and made introspection significantly more complex. ↩ -
Types may also be present in
DeclVar
, but this is ignored for simplicity (and since the type is a user-provided type signature it won’t contain any unification variables in practice). ↩