Chapter 7. Under the Hood

Table of Contents
Memory Management
Performance

This chapter discusses what goes on underneath the hood of SML/NJ and CML. First I will spend a little time discussing how memory is used in the SML/NJ run-time. Then I will examine the performance of some test programs. My goal is to give you a feel for the performance of the SML/NJ system in comparison with the C language.

Memory Management

This section describes the design of the SML/NJ heap system. It is based on a multi-generational copying garbage collector (GC).

Garbage Collection Basics

A copying garbage collector works by having two memory spaces, a "from" and a "to" space. Heap objects are allocated in the "from" space until it is full. Then objects that are still live are copied to the "to" space. Then the "from" and "to" spaces are swapped. The result of the swap is a "from" space with all of the live objects and the "to" space is empty again.

Figure 7-1 illustrates these steps. Memory is allocated from the top of the "from" space advancing downwards. The arrow marked "next" is the position for new objects. The arrow is advanced down by the size of the object. When it reaches the bottom the "from" space is full.

The grey regions are live objects. A live object is any object that can be reached by following pointers starting from any of several root objects. All other objects are garbage to be removed. As the GC visits each live object it copies it to the "to" space. Then the "to" space is relabelled the "from" space and the "to" space becomes empty again. The "next" arrow is reset to the end of the copied objects ready for new objects to be allocated.

Figure 7-1. Steps in Copying Collection.

It might seem that copying the live objects would make the GC slow. But actually it's quite fast compared to other kinds of collectors. The reasons for this are:

  • Only live objects are visited while tracing the pointers and only live objects are copied. The fraction of the "from" space that is live can be quite small for functional languages like SML which allocate many transient objects, perhaps around 10%. Since the dead objects are never visited the cost for deleting them is zero.

  • The allocation is very fast. It only takes a few machine instructions to compare the "next" pointer with the bottom of the "from" space and advance it by the size of the object.

  • After a collection the live objects have been coalesced into one memory region. This reduces the number of virtual memory pages occupied by the heap which can help with the program's performance.

When the cost of the copying is amortized over all objects that were allocated in the "from" space the cost per object is very low. It's low enough that SML/NJ does not use a separate stack for the activation records of called functions (which contain the local variables). Instead everything is allocated in the heap and the speed is competitive with stack allocation. (See [Appel1] for a detailed analysis of the costs). Compare this with C where you are taught that allocating objects in the heap is much slower than allocating on the stack.

Allocating activation records in the heap makes the implementation of continuations very easy and fast which in turn makes CML efficient. In effect the heap contains the stacks of each of the threads. Thread switching is fast and the GC will clean up when they terminate.

You might be worried that the copying collector wastes memory since only half of the heap space, the "from" space, is used for allocation. But no physical memory needs to be allocated to the "to" space until the copying starts and it can be removed again when the spaces are swapped. The peak amount of memory used is the size of the "from" space plus the size of the live objects (as they fill the "to" space).

Multi-Generational Garbage Collection

Even though the copying of live objects in the basic copying collector is not that slow, as explained above, it can still be improved upon. SML/NJ actually uses a multi-generational copying collector (MGGC).

The idea is that most objects are either transient and die soon or else they are long-lived. A MGGC attempts to identify the long-lived objects and copy them less often. The GC has multiple heaps called generations. A new object is allocated into the first generation. If it persists for some number of collection cycles then it is promoted into the second generation. For example there might be only one scan of the second generation for 10 scans of the first generation. This reduces the number of times that long-lived objects are copied at the cost of delaying their eventual collection and increasing the peak memory usage.

SML/NJ version 110.0.7 uses 5 generations. Each generation is a copying GC with a "from" space and a "to" space. Each older generation is scanned 5 times less often than the previous one. Persistent objects slowly migrate to the oldest generation. A "minor" collection just scans the first generation. A "major" collection scans the older generations and looks for opportunities to promote objects to the next older generation.

The SML/NJ GC has other optimisations too. Each generation is actually divided into arenas that group together objects according to their kind: records, list cells, strings and arrays. There is a separate area for "big" objects which are never copied. Currently the only big objects are those containing compiled code.

On most Unix systems the memory for the heap spaces is allocated using the mmap system call. The C malloc function continues to work separately for interfacing with the standard C library.

Run-Time Arguments for the Garbage Collector

The SML/NJ run-time takes the following arguments for the garbage collector.

@SMLalloc=<size>

This sets the size of the area where new objects are allocated, in generation 0. The size can have a scale of K or M appended. The default is 256K bytes. Increasing this will improve the performance for programs requiring lots of memory. You will need to experiment to find the best value.

@SMLngens=<int>

This sets the number of generations. The default is 5. You cannot set more than 14. Increasing the number of generations should reduce the amount of copying at the cost of consuming more memory. You probably don't need to change this.

@SMLvmcache=<int>

When the "from" space is emptied the memory can either be returned to the operating system or kept by the run-time. This argument controls this. The default value is 2 meaning that the "from" space memory for the first 2 generations is not returned to the operating system after the copying is done. This avoids the overhead of frequently freeing and reallocating the memory. You probably don't need to change this.

Heap Object Layout

In this section I describe the layout of the different kinds of heap objects: records, list cells, strings and arrays. I won't include complete details, just the gist of it so that you can get an idea of the memory usage for SML types.

The biggest influence on the object layout is the need for the GC to be able to find all of the pointers in an object without having the details of the SML type that the object represents. This is achieved through two features of the layout:

  • every object is preceded by a descriptor word that contains some type information for the whole object;

  • every word in the object can be identified from a descriptor or its contents as being either data or a pointer.

The contents of strings and numeric values are known to be data just from the descriptor. In a record each field is a single 32 bit word. The pointers in the record fields are distinguished by examining the low-order 2 bits of each word. The possible combinations are:

Table 7-1. The Low-Order Bits of a Record Field.

B1

B0

Description

0

0

The field is a pointer with 32-bit alignment.

1

0

The field contains an object descriptor.

x

1

The field contains a data value in the upper 31 bits, for example the Int.int type.

So all data values in a record field must occupy at most 31 bits. Anything larger must be in a separate object on the heap pointed to from the first record. The first case is called an unboxed value and the second is called boxed.

The SML type Int.int is a 31 bit integer that is stored shifted left by 1 bit with the lower bit containing a 1 as shown in Table 7-1. You might think that it would be expensive to manipulate these integers since the machine code would have to shift the integer right when extracting it from the word and shift it left to store it again. But most of these shift operations can be avoided. No shifting is required to copy or compare the integers. Addition and subtraction only require that one of the words have its bit 0 cleared before proceeding. This is easy to arrange at no cost when one of the operands is a constant. The remaining operations including multiplication and division are relatively rare.

Since pointers are always word-aligned their low 2 bits are always zero so this fits the scheme at no extra cost.

The pointer to an object points to the first word after the descriptor. Descriptors are distinguished from all other words by their low 2 bits so that you can have a pointer into the middle of an object. The GC can always scan backwards from the pointer to find the descriptor at the top. The next 4 bits, at positions 2-5, contain a tag that indicates if the object is a record, string, array, list pair, floating point (double precision) or other kind of object.

Some objects, such as records, strings and arrays have a built-in length. This is stored in the remaining 26 bits of the descriptor word. The memory usage of a string is rounded up to a multiple of 4 bytes. This includes a terminal NUL character for compatibility with C. The length does not count the descriptor.

Figure 7-2 shows the layout of the objects corresponding to the SML this record value:

val x = {a = 2, b = 3:Int32.int, c = "abc", d = 3.14159}

Figure 7-2. The Layout of a Record.

The 32 bit integer is stored boxed as a byte vector, similar to a string. Real numbers are stored as 64 bit double precision floating point (and the length field is unused).

So the expression

Array.array(10, 1): Int32.int Array.array

will allocate 11 words for the array and 10*2 words for each boxed element for a total of 31 words. The size would be only 11 words if the element type was Int.int.

List cells are similar to records with two words for the head and tail and a descriptor. The empty list is represented by a zero pointer. The SML option type is similar. The NONE value is represented by a zero pointer while (SOME a) is represented by a record of length one containing the value. Datatypes are also like records with an extra discriminant field for the constructor. I don't have any more details on their representation.