Stack Trace Analysis for Large Scale Debugging
Loading...
Files
Date
Authors
Arnold, Dorian
Ahn, Dong H.
Supinski, Bronis R. De
Lee, Gregory
Miller, Barton P.
Schulz, Martin
Advisors
License
DOI
Type
Technical Report
Journal Title
Journal ISSN
Volume Title
Publisher
University of Wisconsin-Madison Department of Computer Sciences
Grantor
Abstract
There are few runtime tools for modestly sized computing systems, with 10^3 processors, and above this scale, they work poorly. We present the Stack Trace
Analysis Tool (STAT) to aid in debugging extreme-scale applications. STAT can reduce the problem exploration space from thousands of processes to a few by sampling application stack traces to form process equivalence classes,, groups of processes exhibiting similar behavior. In typical parallel computations, large numbers of processes exhibit a small number of different behavior classes, manifested as common patterns in their stack traces. The problem space is reduced to representatives from these common behavior classes upon which we can use full-featured debuggers for root cause analysis.
STAT scalably collects stack traces over a sampling period to assemble a profile of the application's behavior. STAT routines process the trace samples to form a call graph prefix tree that depicts the program's behavior over the program's process space and over time. The prefix tree encodes common behaviors among the various stack samples, distinguishing classes of behavior from which representatives can be targeted for deeper analysis. STAT leverages MRNet, an infrastructure for tool control and data analyses, to overcome scalability barriers encountered by heavy-weight debuggers.
We present STAT's design and an evaluation that shows STAT gathers informative process traces from thousands of processes with sub-second latencies, a significant improvement over existing tools. Our case studies of production codes verify that STAT supports the quick identification of errors that were previously difficult to locate.
Description
Keywords
Related Material and Data
Citation
TR1584