CSI: Crash Scene Investigation
Volume Number: 25
Issue Number: 12
Column Tag: Debugging
CSI: Crash Scene Investigation
Examining crashes to catch the culprit
by David Garcea
The 911 Call
You've written an application. Great! However, every application will crash eventually. Every crash is a mystery waiting to be solved, and the deduction and experimentation required to solve it can make you feel like Gil Grissom, but first you must know it happened. When an application crashes, the Crash Reporter will gather information about it and send it to Apple. This works great for Apple's products, but it doesn't help third party developers. You will have to do some coding to redirect this information to you. There are several packages that you can use to accomplish this, or you can do it all yourself. Here are a few of the free, ready-made options:
- Smart Crash Reports by Unsanity is an InputManager-based enhancement for Apple's Crash Reporter application, which causes the crash report to be posted to a CGI on either your web server, or on Unsanity's server, as well as sending it to Apple. http://www.smartcrashreports.com
- HDCrashReporter resides solely inside your application, but requires the user to relaunch your application after a crash, in order to email the report to you. The source code is available under the GNU Lesser General Public License, so you can customize it. http://www.profcast.com/developers/HDCrashReporter.php
- ILCrashReporter is a framework that contains a custom CrashReporter application. When you start your application, you launch the CrashReporter and it watches your process for unexpected termination. When a crash occurs, the crash report and console log are emailed to you. The source code for both the framework and application are available. http://www.infinite-loop.dk/developer
If you choose one of the above options, you may still want to expand on the information they gather. To avoid delays caused by going back to ask for more information, get everything you might need all at once. In addition to the crash report, you should acquire a system profile to determine which environments are susceptible to the crash, and therefore how many users are likely to be affected by it. You should also get the console log, which may hold only a single line that pertains to your program, but that line often pinpoints the problem. These files can be obtained automatically, so that all the user must do is permit the information to be sent to you. The less they have to do, the more likely they are to report the crash. Consider providing a way for the user to describe the incident as well. They witnessed it, so they may know something crucial to reproducing it. You will also want to note the name and contact information of the reporter, so that you can ask for more information if necessary, and get confirmation once you have fixed it.
The Witness Interview
Eyewitness statements are even more unreliable in the technology industry than they are in criminal investigations. While the witness statement could provide the clues that you need to solve the problem, they could also contain misused terminology, specious assertions, and misleading statements. Always examine what the witness said, and stay open to it as a possibility, but do not to assume any of it is accurate.
The problem with witness statements is the difficulty inherent in describing in words what is seen on the screen. To circumvent this, ask the reporter to take a screenshot of your application the moment before they reproduce the crash. You will notice the details that the reporter did not think to mention. At Telestream, we take this further and ask our Quality Assurance department to use ScreenFlow to record a video of the steps leading up to a crash or bug, which is better than a single picture, as it shows you the state of the program at each step. You could use the free demo of ScreenFlow (http://www.telestream.net/screen-flow), or Jing (http://www.jingproject.com) to do the same.
The Victim's Wallet
Once you have all of the documentation, examine it, starting with the crash report, which contains sections describing the process, the report, the crash, the threads, the registers, and the binaries.
The Process section contains identifying information about the process that crashed, including the name, process identification number, how the process was launched, and the executable path for the process.
Ensure that the identifier matches one of yours. If it does not, then the crash is out of your jurisdiction and you cannot fix it. Inform the reporter to send it to the appropriate party.
After the process name is a number in brackets. This is the process identification (PID) number that was assigned to the process when it started. Every process is given a number, starting with zero, and incrementing for each new process that is launched. If this number is high, you know that a lot of processes have been run since the last time the computer booted up, which suggests that the computer has been running for a while without a restart.
Next is the path to the executable for this process. If this is a location that you did not expect, investigate how your program behaves when run from this location. The user may have been running from a locked disk image, or in a directory where they did not have proper permissions, both of which could cause problems if your code is not designed to handle these situations.
The version number of your product is next, and it is essential in correlating the crash to a specific version of your code. The standard Mac version number scheme contains a major version, a minor version, and a bug fix number. This scheme lacks one essential feature. It does not provide a unique identifier for each and every build. Consider using a scheme that consists of the major version, minor version, bug fix number, and build number, thereby assigning a single unique identifier to each and every build. You can then correlate this number to the date and time that a build was made, and then retrieve the exact version of every source file used to make that build from your source control management system. This will save time that might otherwise be lost by trying to reproduce a crash with the wrong source code.
The code type specifies whether the PowerPC or Intel code inside your universal binary was the one that crashed. If you have code written for a specific architecture, such as AltiVec or SSE, this will tell you which was executed.
The parent process tells you how your application or plug-in was launched. For applications, this is typically launchd. If your product was launched by another process, it may have been in an environment or workflow that you had not anticipated.
The Crime Scene
The report section includes the date and time that the crash occurred, which can be used to correlate the crash with the console log, as most of its entries are time-stamped. You can then focus on the log entries immediately prior to the time of the crash.
The version of the operating system is important for reproducing issues, as they may be specific to a certain version of Mac OS X. If it is a new version of Mac OS X that was released after this version of your software, or if it is a very old version of MacOS X, this could signal an incompatibility.
Lastly, the report version describes the format of the crash report, for use by automated analysis programs.
The Cause of Death
Crashes are caused by exceptions. The crash section describes the exception that caused the crash using two identifiers: the exception type, which is the category for the exception; and the exception code, which is the specific identifier. The most common exception types are EXC_ARITHMETIC, EXC_BAD_INSTRUCTION, and EXC_BAD_ACCESS. The line for the exception code may also include the offending address or value that caused the exception. The last item is the number of the thread that was executing when the exception was encountered.
The EXC_ARITHMETIC exception type covers any arithmetic that is considered illegal, such as dividing by zero (EXC_I386_DIV). Mathematically, the result of a division by zero is undefined. Intel processors are strict when it comes to dividing by zero, and they will not allow it. PowerPC processors were more forgiving, albeit mathematically incorrect. Instead of causing a crash, they returned zero as the result.
Listing 1: ExceptionController.m
Divide By Zero
The following demonstrates causing an EXC_ARITHMETIC/EXC_I386_DIV (divide by zero) exception. Note that the compiler will warn you if it sees that you are trying to do a divide operation with a literal constant of zero as the divisor. However, it will not catch situations where a variable with a value of zero is used as the divisor.
int divisor = 0;
// This line will cause the exception on an Intel processor.
int result = 128 / divisor;
// Modulus operations use division, so they can
// also cause this exception.
result = 128 % divisor;
The EXC_BAD_INSTRUCTION exception type means that the processor was given an instruction that it does not understand. This means that your code has corrupted the instruction pointer, which is a register that points to the memory location that holds the next instruction to execute. When that pointer is corrupted it points to some other part of memory and the processor tries to interpret that memory as an instruction, when it was intended to be something else.
In order to prevent a problem in one program from crashing other programs, or even the entire system, Mac OS X uses protected memory. Every process is given a virtual address space, which is divided into segments. Each segment has permissions that specify whether you can read from it, write to it, or execute it. When you allocate memory, it is mapped from the physical address that it resides on to the virtual address that is given to your program. The EXC_BAD_ACCESS exception type means that your program attempted to access memory that either was not mapped (KERN_INVALID_ADDRESS), or was not allowed to access (KERN_PROTECTION_FAILURE) because of the permissions on that segment. To examine the virtual memory maps for your application, pass the PID of your application to the vmmap command line tool.
Listing 2: ExceptionController.m
Kernel Invalid Address
The following demonstrates causing an EXC_BAD_ACCESS/ KERN_INVALID_ADDRESS exception.
// On 32-Bit systems, each process can have up to 4GB of
// memory. Here, we try to write to the very last byte,
// which is neither likely to be mapped already, nor
// mapped by us via allocation. While you aren't likely
// to explicitly do this in application, if you try to
// write to a pointer that has been corrupted, you may
// end up doing just this.
memcpy( (void *)0xFFFFFFFF, "d", 1);
Listing 3: ExceptionController.m
Kernel Protection Failure
The following demonstrates causing an EXC_BAD_ACCESS/ KERN_PROTECTION_FAILURE exception.
// Trying to write to a NULL pointer will cause this
// exception, as memory address zero resides in a
// virtual memory segment called "__PAGEZERO", which
// does not allow write access.
long *badPointer = NULL;
// This line will cause the crash.
*badPointer = 0xFEEDFACE;
Now that you know how to cause these bugs, you will be better prepared to find them and fix them.
The body of the crash report is the threads section. This shows the call stack for every thread in your program when your process crashed. Each entry in the call stack contains a number defining its position in the stack, a universal type identifier, the address of the function, the function name, and the offset to the instruction that caused the crash. The first line is the function that the thread was in at the time of the crash. The identifier tells you what binary contains that function. If the identifier in the first line of the call stack for the crashed thread is not the identifier for your application, you can check the Binary Images Description portion of the crash log for more information on that binary. We will cover more on that section later.
Threads can either be actively executing, or blocked, waiting to execute. Determining which threads were active at the time of the crash and which were blocked, will allow you ferret out the potential culprits. Consider any thread that was blocked to have an alibi. Any thread, whose current function name contains one of the following words, was most likely blocked: wait, delay, semaphore, mutex, and sleep.
If you notice a thread with one or more function names repeated, particularly if the call stack is very deep, the exception might be caused by runaway recursion. Recursion is when a function calls itself, either directly or indirectly. This can be quite useful technique, particularly when dealing with hierarchical data, but if left unchecked, the recursion could keep going until it uses up all of the available memory, which will cause a crash. Recursion can also happen unintentionally, for instance, if you call were to call [self display] in the drawing routine of a custom view.
If you see question marks in place of the binary identifiers, you could have a stack corruption. These can be difficult to solve because the application will continue to run after the memory has been corrupted, crashing instead in code that is executed much later. If you suspect you are dealing with a stack corruption, try turning on stack canaries in Xcode by adding the –fstack-protector (or –fstack-protector-all) flag to the "Other C Flags" setting for your project. Stack canaries work like a canary in a coal mine, as an early warning system. When stack canaries are on, the integrity of the stack is checked when you return from a function. If the stack has been corrupted, an error message is printed to the console to help you find the problem.
Multithreading problems are also difficult to track down because the crashes may only happen a small percentage of the time, and the offending code might not be the in the thread that crashed. Your best resource for these types of issues is collecting multiple crash logs and comparing them together. If you find the same two threads are always in similar locations when the crash occurs, try checking for unsynchronized access to shared resources, which is usually the culprit. Check your semaphores, and mutexes, to see if there is a case you might have left vulnerable to simultaneous access.
After the threads section, you will find a table listing the registers and their values at the time of the crash. The x86 processor architecture designates eight registers for general purposes, six segment registers for memory management, a flags register to describe or control the results of operations, and the instruction pointer, which holds the address of the next instruction to execute. Not all x86 registers are listed in the crash report, but the ones that are can provide clues as to the cause of the crash.
x86 General Purpose Registers Listed In Crash Reports
(32 Bit / 64 Bit) Purpose
EAX /RAX Accumulator
EBX /RBX Base
ECX /RCX Counter
EDX /RDX Data
EDI /RDI Destination Index
ESI /RSI Source Index
EBP /RBP Base Pointer
ESP /RSP Stack Pointer
While the general-purpose registers can be used for anything, most have certain tasks that they are optimized for. The accumulator is where most arithmetic calculations are performed. The base register has no specialized purpose. The counter register is designed for use as the index in loops. The data register is for storing data used in the calculations occurring in the accumulator. The destination index is for use as a pointer to the current location in a write operation. Similarly, the source index is for use as a pointer in a read operation. The base pointer points to the bottom of the stack, and the stack pointer points to the top of the stack.
x86 Other Registers Listed In Crash Reports
SS Stack Segment
EIP/RIP Instruction Pointer
CS Code Segment
DS Data Segment
ES Extra Segment
FS F (Extra) Segment
GS G (Extra) Segment
CR2 Control Register 2
The remaining registers have dedicated purposes. The segment registers are for supporting memory protection via segmentation. However, paging is now the preferred method of memory protection, so most of these registers are set to the same value. The F and G segments may store data specific to a thread. The flags register is used to control the results of operations, and to store information about those results, such as if the result overflowed the register. CR2 contains the offending address when a page fault occurs.
The Known Associates
The Binary Images Description section of the crash report has a list of all of the binaries involved in running your application, including the frameworks, plug-ins, and dynamically-linked libraries. There is one line per binary, and each entry contains the memory address span, the identifier, the version, and the file path it was loaded from. This list is usually long, even for the most trivial applications. If your application uses plug-ins, look for them here to see what versions were present.
If you are having trouble finding the cause of your crash, it is worth taking a few minutes to review this list. Look for anything that is unusual, meaning any entry whose identifier is neither yours nor Apple's (i.e. com.apple.whatever). When you find one that you do not recognize, look it up online. If it seems like it could interfere, try uninstalling it and see if the problem disappears.
The Modus Operandi
Some bugs cause immediate crashes, such as the EXC_I386_DIV exception, others start causing problems that will lead to a crash. If the crash happens at the same line of code every time it is executed, it is probably going to be easy to fix. If, however, the crash only happens occasionally, or at different lines of code, then it is a delayed crash, and will be tougher. To fix these issues, you will have to backtrack and find the initial problem.
To tackle delayed crashes, there are several techniques you can use to narrow down the problem. Law enforcement has a better chance of catching a serial killer each time he commits a new murder because each incident provides the investigators with more information. Similarly, the more instances of the crash you have to examine, the easier it will be to solve. Collect documentation for multiple instances of the crash and compare them, using your favorite diff tool. The information that is the same may be the conditions that are required to cause the crash, which are hints to the cause. You can also use this technique to exclude unrelated crash reports, if they are dramatically different than all of the others that you have collected on the issue.
Your goal now is to be able to reliably reproduce the crash. If you cannot, you will never be able to verify that it was fixed, so you might as well drop it in the cold case drawer.
The next step is reducing the time it takes to reproduce the crash to under a few minutes. If the crash takes an hour to occur, it will take an eternity to investigate it.
If appropriate, try stressing the program by reducing the resources it has available, such as RAM, virtual memory, disk space, network bandwidth, etc. The crash might require one of those resources reaching a critically low point. By reducing those resources from the beginning, such as by launching a lot of applications, filling up disk space with large files, or starting extremely large file transfers, you can induce the required conditions without the usual wait.
Try examining what you think it is doing around the time of the crash. If it is working on a certain part of a large file, then try making that part of the file into the beginning of the file, either by moving it, or by cutting out everything preceding it. If it is in last stage of a multistage process, then try disabling the prior stages.
Once your crash is easily reproducible, you will need to narrow down the problem. Try taking out easily removable items such as plug-ins and frameworks. Next try commenting out half the suspected code, in a way that leaves the other half still compiling and usable. If the problem persists, then you know the bug is in the uncommented half. Then try commenting out half of that code, and continue with this technique until the bug becomes evident.
Another useful technique is regression, which requires that you use a source control management (SCM) product, such as CVS, Subversion, or Perforce. It will also be useful to have unique version numbers for every build of your product like we discussed earlier. Try going back through previous builds of your product until you find one where the problem did not occur. Then, using your SCM tool, find the changes that were made between the unaffected build and the affected build. Those changes are likely to contain the bug.
The Suspect Lineup
You should now be able to turn the problem on or off at will, making the crash occur or not occur.
Now that you have found the fix, you might be thinking that you are done, but there are often many possible fixes for any problem. Do you really want to use whichever one happened to be the first that you found? Take the time to think of a few other possible ways to fix the problem. Then consider the benefits and drawbacks of each. Consider the time it takes to implement, the maintainability of the code, the scope of the changes, and the likelihood that the fix will cause more problems. Now you can pick the best fix and implement it.
You are almost done. Document the bug and the fix in your code, so that neither you, nor the other members of your team inadvertently reintroduce the problem. Document it in your source code management system as well, so that you know when you fixed it, both in regards to time and to versioning. And finally, document it in your release notes so that your users know that this is the update that will fix the problem they are encountering. Do not be so ashamed of the crash that you omit it from the release notes. All applications crash, even Apple's. The fact that you found it, fixed it fast, and made the fix available to your users quickly and honestly, is something to be proud of. Keeping excellent records like this will help prevent the problem from reappearing, and will also provide you with valuable resources for tracking down your next crash.
Apple. "Technical Note TN2123: CrashReporter". http://developer.apple.com/technotes/tn2004/tn2123.html
Wikibooks. "X86 Assembly/X86 Architecture". http://en.wikibooks.org/wiki/X86_Assembly/X86_Architecture
William Swanson. "The Art of Picking Intel Registers". http://www.swansontec.com/sregisters.html
David Garcea is the Engineering Program Manager for Macintosh Desktop Products at the U.S. headquarters of Telestream, Inc., makers of Flip4Mac WMV, Episode, Drive-in, Pipeline, and ScreenFlow. Drawing on over twelve years of experience making Mac applications and plug-ins, he leads the team of engineers that makes it possible for you to make and watch WMV content on your Mac, load all of your DVDs onto your laptop, or capture, transcode, and edit a multi-camera concert in Times Square. He has a bachelor's degree in computer science from the State University of New York Empire State College, and lives in Northern California. You can reach him at firstname.lastname@example.org.