Memory bugs, essentially a mistake in the management of heap memory, are caused by a number of factors and can occur in any program that is being written, enhanced or maintained. The fact that memory bugs can be introduced at any time is part of what makes memory debugging a challenging task. This is especially true with codes that are written collaboratively or that are being maintained over a long period of time, where assumptions about memory management can either change or not be communicated clearly.
Memory bugs can also lurk in a code base for long periods of time. This can happen since they are impossible for a compiler to detect and are often not immediately fatal. During development and prototyping, the bug may simply result in the program using up a few more bytes of memory — something that the developer is not likely to even notice at first. The memory bug then suddenly emerges as an issue when a program is put into production, ported to a new architecture, scaled up to a larger problem size, or when code is adapted and reused from one program to another.
These challenging memory bugs often manifest in one of several ways: as a crash that always happens, a crash that sometimes happens (instability) or just as incorrect results. Furthermore, they are difficult to track down with commonly used development tools and techniques, such as printf and traditional source code debuggers, which are not specifically designed to solve memory problems.
Adding parallelism to this mix makes things even harder. A parallel application does not typically need to be written unless the problem itself is large and, as a result, the programs that are written need to be able to crunch a significant amount of data. While the special purpose high-performance computing systems used to run these programs have many terabytes of memory and data storage available, it frequently turns out that they have less memory per node than one might have on even a desktop development system. This leaves parallel programs squeezed between two effects: a significant amount of data to crunch and not a lot of space to work in. The result is that these programs need to be extraordinarily careful with memory.
When thinking about the importance of troubleshooting memory bugs on large parallel systems, it goes without saying that a single memory error might crash, invalidate, or slow to a crawl a parallel job that will consume many thousands of CPU (central processing unit) hours.
Classifying Memory Errors
Programs typically make use of several different categories of memory that are managed in different ways. These include stack memory, heap memory, shared memory, thread private memory and static or global memory.
Of these categories, programmers are required to pay special attention to memory that is allocated out of the heap, due to the fact that the management of heap memory is done explicitly in the program rather than implicitly at compile or run time. There are a number of ways that a program can fail to make proper use of dynamically allocated heap memory, as outlined below.
Malloc errors occur when a program passes an invalid value to one of the operations in the C Heap Manager API (application programming interface). This could potentially happen, for example, if the value of a pointer (the address of a block) was copied into another pointer, and then at a later time, both pointers were passed to free(). In this case, the second free() is incorrect because the specified pointer does not correspond to an allocated block. The behavior of the program after such an operation is undefined.
A pointer can be said to be dangling when it references memory that has already been deallocated. Any memory access, either a read or a write, through a dangling pointer can lead to undefined behavior. Programs with dangling pointer bugs may sometimes appear to function without any obvious errors, even for significant amounts of time, if the memory that the dangling pointer points to happens not to be recycled into a new allocation during the time that it is accessed.
Memory Bounds Violations
Individual memory allocations that are returned by malloc() represent discrete blocks of memory with defined sizes. In this case, any access to memory immediately before the lowest address in the block or immediately after the highest address in the block results in undefined behavior. If the allocated block is being used as an array, this error could, for example, be the result of a classical “off by one” array bounds error.
Reading memory before it has been initialized is a common error. Some languages assign default values to uninitialized global memory, and many compilers can identify when local variables are read before being initialized. What is more difficult and generally can only be done at runtime is detecting when memory accessed through a pointer is read before being initialized. Dynamic memory is particularly affected, since this is always accessed through a pointer, and in most cases, the content of memory obtained from the memory manager is undefined.
Memory leaks occur when a program finishes using a block of memory, discards all references to the block, but fails to call free() to release it back to the heap manager for reuse. The result is that the program is neither able to make use of the memory nor reallocate it for a new purpose.
The impact of leaks depends on the nature of the application. In some cases, the effects are very minor; in others, where the rate of leakage is high enough or the runtime of the program is long enough, leaks can significantly change the memory behavior and the performance characteristics of the program.
For long-running applications or applications where memory is limited, even a small leakage rate can have serious cumulative and adverse effects. This makes leaks even more problematic since they often linger in otherwise well-understood codes. It can be a challenging task to manage dynamic memory in complex applications in order to ensure that allocations are released exactly once so that malloc and leak errors do not occur.
Using a Memory Debugger to Make Parallel Development More Efficient
In order to successfully find and fix these complex memory errors, developers should employ a memory debugging tool that is specifically designed to identify and resolve memory bugs. An effective memory debugger provides users with the ability to compare memory statistics, look at heap status, detect memory leaks and detect heap bounds violations. Detailed information about individual processes, as well as high-level memory usage statistics across all of the processes that make up a large parallel application, should also be available.
Using a Memory Debugger to Compare Memory Statistics
Many parallel and distributed applications have known or expected behaviors in regards to memory usage. They may be structured so that all of the nodes should allocate the same amount of memory, or they may be structured so that memory usage should depend in some way on the MPI_COMM_WORLD rank of the process. If such a pattern is expected or if the user wishes to examine the set of processes to look for patterns, overall memory usage statistics should be viewed in a graphical form (line, bar and pie charts for example) for one, all, or an arbitrary subset of the processes that make up the debugging session.
The user may drive the program to a new point in execution and then update the view to look for changes. If any processes look out of line, the user will likely want to look more closely at the detailed status of the heap memory for the outlier processes.
Using a Memory Debugger to Look at Heap Status
At any point where a process has been stopped, developers should be able to obtain a view of the heap through a graphical display. An effective display should paint a picture of the heap memory in the selected process, giving the user a way to see the composition of the program’s heap memory at a glance.
Using a Memory Debugger to Detect Leaks
Leak detection can be done at any point in a program’s execution. As discussed, leaks occur when the program ceases to use a block of memory without calling free(). An advanced memory debugger is able to execute leak detection by looking to see if the program retains a reference to specific memory locations.
Heap memory leak detection can be done by driving the program to a location where the memory behavior should be well defined, such as right after initialization, at the transition between two phases of the program, at a specified iteration in programs that are structured around iterations, or by simply halting the processes of a running parallel application. The user should then be able to perform a leak detection analysis on the program based on its state at that point.
A block of memory that the program is not storing a reference to anywhere is flagged as a leak because without a reference, the program is highly unlikely to subsequently make a call to free() for that specific block of memory. It should also be possible to represent the leaked blocks graphically in context of other non-leaked allocations. This helps the developer recognize patterns that could highlight the original cause of the leak.
Using a Memory Debugger to Detect Heap Bounds Violations
Heap bounds violations occur when an error in the program logic causes the program to write beyond the ends of a block of memory allocated on the heap. The malloc API makes no guarantee about the relative spacing or alignment of memory blocks returned in separate memory allocations — or about what the memory before or after any given block may be used for. Consequently, the result of reads and writes before the beginning of a block of memory, or after the end of the block of memory, is undefined.
Blocks are often contiguous with other blocks of program data. Therefore, if the program writes past the end of an array, it is usually overwriting the contents of some other unrelated allocation. If the program is re-run and the same error occurs, the ordering of allocations may differ and the overwriting may occur in a different array. This leads to extremely frustrating “racy” bugs that manifest differently, sometimes causing the program to crash, resulting in bad data, or altering memory in a way that turns out to be completely harmless.
An effective memory debugger will provide a mechanism that involves setting aside a bit of memory before and after heap memory blocks as they are allocated. Since this bit of memory, called a “guard block,” is not part of the allocation, the program should never read or write to that location. Guard blocks should be initialized with a pattern and checked for changes in the pattern. Any changes mean that the program wrote past the bounds of the array.
Tough to Pin Down
Memory bugs are often a source of great frustration for developers because they can be introduced at any time and are caused by a number of different factors. They can also lurk in a code base for long periods of time and manifest themselves in a variety of ways.
This makes memory debugging a challenging task, especially in parallel and distributed programs that process a significant amount of data and work within tight memory constraints. Commonly used development tools and techniques are not specifically designed to solve memory problems and can make the process of finding and fixing memory bugs an even more complex process. Developers should employ an effective memory debugging tool that is specifically designed to identify and resolve memory bugs in parallel and distributed applications to make the process of memory debugging more efficient, resulting in the development of higher-quality applications more quickly.
Chris Gottbrath is product manager for the TotalView Debugger and MemoryScape product lines at TotalView Technologies. His work is focused on making it easier for programmers, scientists and engineers to solve even the most complex bugs and get “back to work.”