Thursday, April 27, 2017

A Consistent Type System Foundation

The type system is the cornerstone of a programming language. It is the essential enabler for how a program's sought behavior can be expressed and it is the chief controller preventing behavior that is unintended.

Different languages emphasize these two roles to varying degrees. Sometimes one is given clear priority over the other. Some dynamically types languages take the extreme approach of not having the type system control anything at all, which can make for a very easy and low-friction programming experience for smaller and simpler tasks. But until the code is actually executed you never know exactly what is going to happen.

Haskell, the pure functional language, has the strictest type system I've experienced and it takes a bit of getting used to. It is mathematically and semantically consistent and it really does go a long way in helping ensure the program will be correct once it passes the type checker and compiles. However I'm not sure you could say it is easy to work with...

Tuplex is an imperative language, facilitating object-oriented, data-oriented, and functional programming styles. While a goal is to be as easy as possible to learn, read, and write, it should also facilitate high-performance implementation, as well as memory safety in multi-threaded programs.

Easiness means the type system shall be consistent in syntax and semantics. The basic type syntax shall be easy to learn for reasonably experienced programmers. The manners in how basic types are recombined into advanced ones shall appear straight-forward and predictable.

Performance means low-level data structure concepts shall be reflected among the basic types.

Memory-safety means there shall be strict rules governing references, uninitialized memory, and sharing of data between threads. (Note that strict rules does not necessarily mean poor ease of use.)

Common Type Hierarchy

The type system is akin to a pure OO type system, with some distinctions. Types are not referred to as "classes". The Tuple types and their instances are analogous to what is called classes and objects in other OO-languages.

All values in Tuplex are referred to as objects. Some objects are elementary which means they are not usually subdivided into smaller data parts - like integers, booleans, and references. Other objects are composite which means they are composed of a number of smaller objects.
Note: Tuplex objects are not "boxed". Neither the elementary types nor the composite ones have a "header" or other construct with a memory footprint adjacent to the object value.
Like all OO-type systems, a type derives the characteristics of its parent, and usually adds some of its own (e.g. new fields and methods). This way all the types are organized into a hierarchy and the strengths of OO within polymorphism and behavior reusability can be utilized.

Any

Being part of the common type hierarchy, all types ultimately derive from the global root type at the top, named Any. All types thus derive the characteristics of Any. They aren't very many. Currently planned are methods such as hashcode(), equals(), possibly string().

Type Classes

Any is not derived from directly in user code. At the level below Any we instead find the built-in root types for each type class:

Elementary - scalar numbers like Int and Float, and Bool.

Ref - reference to an object of a certain type somewhere in memory.

Array - zero or more objects of the same type serially arranged in memory.

Tuple - zero or more objects of varying types grouped together in memory.

Interface - zero object memory footprint, but defines methods that the object will support.

Function - a callable code block with a defined function signature (argument and return types).
Note: Function is often short-hand for any callable including lambdas, static methods, instance methods, and free functions. The share a uniform syntax and handling.
The purpose of each type class is to represent and efficiently implement a specific low-level behavior. Elementary types represent the conventional single-register values for numeric and boolean arithmetic.  Arrays represent - arrays, the mainstay of efficient computing. Functions represent executable machine code and interfaces represent a set of methods that shall be provided by all implementors of that interface.

All the type classes are basic building blocks in the sense that none of them could be represented in terms of the other ones. As such the type classes constitute a fully reduced set of type building blocks. (To be picky, there is one - conceptual - exception if performance were ignored: An array is a special case of a tuple, where all elements are of the same type.)

Additional type classes may be added in the future: Union as a type-safe sum type, and Vector to provide effective built-in support for SIMD computing.

Object Types

Interfaces are zero object footprint - they do not define any data content for their objects. Functions define code, not data. Interfaces and functions are thus not object types.

The other type classes - Elementary, Ref, Array, and Tuple - are the object type classes.

Single Inheritance

Tuplex is single-inheritance in that each type derives from a single base type. In addition to the base type a type may implement a number of interfaces.
User-defined interfaces have Interface as their main base type.
This means that each type belongs to a specific type class.

Generic Types

Generics, a.k.a. parameterized types, and templates in C++, are a core part of the type system. Ref and Array themselves are actually generic types, these are their built-in declarations:

public Ref<T> : Any { ... }

public Array<E, L : UInt> : Any { ... }

Generics is a complex subject and their implementation has been the most difficult part of the compiler. They will be better covered in a post of their own!

Thursday, March 30, 2017

The Compilation Passes

How many parser / compiler passes should be considered "many"?

The Tuplex compiler has four. Or five. Or seven... What do we count? You could count the lexical and syntax parsing as two separate passes. Or just as two parts of a single one since they combine into a single linear pass over the source text.

The semantic analysis has four distinct steps. But they don't all traverse the full program so do they count as separate passes?

Lexical Analysis

The lexical analysis scans the source text, removes comments and white space, and transforms it into a sequence of tokens, e.g. '+' or 'variable_name'.

The lexical scanner of Tuplex is written using Flex.

Syntax Analysis

The syntax analysis processes the token sequence and matches it to the grammar rules. Each matched grammar rule typically produces a node in the Abstract Syntax Tree (AST).

The syntax parser is written using Bison.

The grammar of Tuplex is a context-free, deterministic LR grammar.

Semantic Analysis

The semantic analysis is where it gets really interesting! This is where the meaning of the program is processed. The design and internal model of the semantic analyzer is unique for each language, although it will have conceptual similarities with compilers of other high-level languages in the same genre.

An important part of the vision for Tuplex is that all the core constructs be first-class citizens, general, and orthogonal. This puts a lot of complexity on the semantic analyzer. It's been remodeled and refactored a number of times, mostly driven by of the demands of the type resolution system.

Since Tuplex supports declaring types and fields in any order, the semantic analysis needs to be divided into two separate passes.

1 - The Declaration Pass

The declaration pass traverses the entire AST from start to finish and builds the namespaces and their symbol tables. It essentially maps each and every AST node to a scope in the namespace hierarchy, along with registering all declaration attributes and and flags together with the declaration names.

2 - The Resolution Pass

The resolution pass traverses the entire AST a second time. Since the declaration pass has completed before this, all the names are known and can be cross-referenced (resolved). The compiler can now create the internal representation of all the program entities - types, fields and functions.

This is where three major semantic operations are performed:

  • AST transformations that depend on knowledge of the types / fields being used - e.g. inlining and special semantics for certain constructs like references and constructors.
  • Type derivations - setting up the internal representation of all the types defined in the program.
  • Type specializations - reinterpreting (copying and re-analyzing) of the AST sub-trees of the generic (parameterized) types that are being specialized with concrete type arguments.

2b - Deferred Type Resolutions

Certain parameterized type definitions are semantically recursive, even while being legal and well defined. Consider the definition of the Enumerable interface:

public interface Enumerable< E derives Enumerable<E> >

The constraint on the E parameter refers to the type being declared by this statement, and in addition the constraint refers to the parameter itself. To resolve such definitions the compiler works with a deferred ("lazy") resolution model. At the point where the type is defined a proxy is created for it, and put on the type resolution queue to be fully instantiated and laid out later.

2c - Data Type Layout

When all types are resolved they are mapped to static data types. The static data types are the distinct data types that will actually be used in runtime. They get assigned a distinct type id, a v-table entry, and tuples get their instance data laid out.


Code Generation

This pass invokes the LLVM API to generate LLVM IR for the program.


Theoretically the compiler could probably be condensed into two start-to-end passes over the source, with some lazy evaluation in the mix. However this would be very difficult to work with, even if the language was completely defined to begin with! These passes have evolved over the compiler development process and do make for effective separations of concern for different parts of the compilation.