Statistically Debugging Massively-Parallel Applications

dc.contributor.authorde Supinski, Bronis R.
dc.contributor.authorLiblit, Ben
dc.contributor.authorRavitch, Tristan
dc.date.accessioned2013-03-19T13:58:26Z
dc.date.available2013-03-19T13:58:26Z
dc.date.issued2013-02-18
dc.description.abstractStatistical debugging identifies program behaviors that are highly correlated with failures. Traditionally, this approach has been applied to desktop software on which it is effective in identifying the causes that underlie several difficult classes of bugs including: memory corruption, non-deterministic bugs, and bugs with multiple temporally-distant triggers. The domain of scientific computing offers a new target for this type of debugging. Scientific code is run at massive scales offering massive quantities of statistical feedback data. Data collection can scale well because it requires no communication between compute nodes. Unfortunately, existing statistical debugging techniques impose run-time overhead that is unsuitable for computationally-intensive code despite being modest and acceptable in desktop software. Additionally, the normal communication that occurs between nodes in parallel jobs violates a key assumption of statistical independence in existing statistical models. We report on our experience bringing statistical debugging to the domain of scientific computing. We present techniques to reduce the run-time overhead of the required instrumentation by up to 25% over prior work, along with challenges related to data collection. We also discuss case studies looking at real bugs in ParaDiS and BOUT++, as well as some manually-seeded bugs. We demonstrate that the loss of statistical independence between runs is not a problem in practice.en
dc.identifier.citationTR1786en
dc.identifier.urihttp://digital.library.wisc.edu/1793/65136
dc.subjectstatistical debuggingen
dc.subjectdynamic analysisen
dc.subjectstatistical methodsen
dc.subjectdebuggingen
dc.titleStatistically Debugging Massively-Parallel Applicationsen
dc.typeTechnical Reporten

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
TR1786.pdf
Size:
232.38 KB
Format:
Adobe Portable Document Format
Description:
Tech Report

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
2.03 KB
Format:
Item-specific license agreed upon to submission
Description: