Sparrow

Hakjoo Oh, Kihong Heo, Wonchan Lee, Woosuk Lee, and Kwangkeun Yi
Programming Research Lab, ROSAEC Center, Seoul National University

Introduction

Sparrow is a state-of-the-art static analyzer that aims to verify the absence of fatal bugs in C source. Sparrow is designed by Abstract Interpretation and the analysis is sound in design. Sparrow adopts a number of well-founded static analysis techniques for scalability, precision, and user convenience:

Scalability: Sparrow has been engineered on top of our general sparse analysis framework.
Precision: Sparrow uses various precision-improvement techniques, such as selective context-sensitivity, selective relational anlaysis, etc.
Convenience: Sparrow supports alarm clustering, analysis time estimation, etc.

Sparrow's design and engineering efforts used have been published in the following papers.

Papers

Selective Context-Sensitivity Guided by Impact Pre-Analysis. PLDI 2014.
Hakjoo Oh, Wonchan Lee, Kihong Heo, Hongseok Yang, and Kwangkeun Yi. [pdf]
- abstract: We present a method for selectively applying context-sensitivity during interprocedural program analysis. Our method applies context-sensitivity only when and where doing so is likely to improve the precision that matters for resolving given queries. The idea is to use a pre-analysis to estimate the impact of context-sensitivity on the main analysis’s precision, and to use this information to find out when and where the main analysis should turn on or off its context-sensitivity. We formalize this approach and prove that the analysis always benefits from the pre-analysis-guided context-sensitivity. We implemented this selective method for an existing industrial-strength interval analyzer for full C. The method reduced the number of (false) alarms by 24.4%, while increasing the analysis cost by 27.8% on average. The use of the selective method is not limited to context-sensitivity. We demonstrate this generality by following the same principle and developing a selective relational analysis. Our experiments show that the method cost-effectively improves the precision in the relational analysis as well.
Design and Implementation of Sparse Global Analyses for C-like Languages. PLDI 2012.
Hakjoo Oh, Kihong Heo, Wonchan Lee, Woosuk Lee, and Kwangkeun Yi. [slides] [pdf] [bib]
- abstract: In this article we present a general method for achieving global static analyzers that are precise, sound, yet also scalable. Our method generalizes the sparse analysis techniques on top of the abstract interpretation framework to support relational as well as non-relational semantics properties for C-like languages. We first use the abstract interpretation framework to have a global static analyzer whose scalability is unattended. Upon this underlying sound static analyzer, we add our generalized sparse analysis techniques to improve its scalability while preserving the precision of the underlying analysis. Our framework determines what to prove to guarantee that the resulting sparse version should preserve the precision of the underlying analyzer. We formally present our framework; we present that existing sparse analyses are all restricted instances of our framework; we show more semantically elaborate design examples of sparse non-relational and relational static analyses; we present their implementation results that scale to analyze up to one million lines of C programs. We also show a set of implementation techniques that turn out to be critical to economically support the sparse analysis process.
Access-based Abstract Memory Localization in Static Analysis. SCP 2013.
Hakjoo Oh and Kwangkeun Yi. [pdf] [bib]
- abstract: On-the-fly localization of abstract memory states is vital for economical abstract interpretation of imperative programs. Such localization is sometimes called ‘‘abstract garbage collection’’ or ‘‘framing’’. In this article we present a new memory localization technique that is more effective than the conventional reachability-based approach. Our technique is based on a key observation that collecting the reachable memory parts is too conservative and the accessed parts are usually tiny subsets of the reachable part. Our technique first estimates, by an efficient pre-analysis, which parts of input states will be accessed during the analysis of each code block. Then the main analysis uses the access-set results to trim the memory entries before analyzing code blocks. In experiments with an industrial-strength global C static analyzer, the technique is applied right before analyzing each procedure’s body and reduces the average analysis time and memory by 92.1% and 71.2%, respectively, without sacrificing the analysis precision. In addition, we present three extensions of access-based localization: (1) we generalize the idea and apply the localization more frequently such as at loop bodies and basic blocks as well as procedure bodies, additionally reducing analysis time by an average of 31.8%; (2) we present a technique to mitigate a performance problem of localization in handling recursive procedures, and show that this extension improves the average analysis time by 42%; (3) we show how to incorporate the access-based localization into relational numeric analyses.
Sound Non-Statistical Clustering of Static Analysis Alarms . VMCAI 2012.
Woosuk Lee, Wonchan Lee, and Kwangkeun Yi. [slides] [pdf] [bib]
- abstract: We present a sound method for clustering alarms from static analyzers. Our method clusters alarms by discovering sound dependencies between them such that if the dominant alarm of a cluster turns out to be false (respectively true) then it is assured that all others in the same cluster are also false (respectively true). We have implemented our clustering algorithm on top of a realistic buffer-overflow analyzer and proved that our method has the effect of reducing 54% of alarm reports. Our framework is applicable to any abstract interpretation-based static analysis and orthogonal to abstraction refinements and statistical ranking schemes.
Access-based Localization with Bypassing. APLAS 2011.
Hakjoo Oh and Kwangkeun Yi. [slides] [pdf] [bib]
- abstract: We present an extension of access-based localization technique to mitigate a substantial inefficiency in handling procedure calls. Recently, access-based localization was proposed as an effective way of tightly localizing abstract memories. However, it has a limitation in handling procedure calls: the localized input memory for a procedure contains not only memory locations accessed by the procedure but also those accessed by transitively called procedures. The weakness is especially exacerbated in the presence of recursive call cycles, which is common in analysis of realistic programs. In this paper, we present a technique, called bypassing, that mitigates the problem. Our technique localizes input memory states only with memory locations that the procedure directly accesses. Those parts not involved in analysis of the procedure are bypassed to transitively called procedures. In experiments with an industrial-strength global C static analyzer, the technique reduces the average analysis time by 42%. In particular, the technique is especially effective for programs that extensively use recursion: it saves analysis time by 77% on average.
Access Analysis-based Tight Localization of Abstract Memories. VMCAI 2011.
Hakjoo Oh, Lucas Brutschy, and Kwangkeun Yi. [slides] [pdf] [bib]
- abstract: On-the-fly localization of abstract memory states is vital for economical abstract interpretation of imperative programs. Such localization is sometimes called “abstract garbage collection” or “framing”. In this article we present a new memory localization technique that is more effective than the conventional reachability-based approach. Our technique is based on a key observation that collecting the reachable memory parts is too conservative and the accessed parts are usually tiny subsets of the reachable. Our technique first estimates, by an efficient preanalysis, the set of locations that will be accessed during the analysis of each code block. Then the main analysis uses the access-set results to trim the memory entries before analyzing code blocks. In experiments with an industrial-strength global C static analyzer, the technique is applied right before analyzing each procedure’s body and reduces the average analysis time and memory by 92.1% and 71.2%, respectively, without sacrificing the analysis precision. Localizing more frequently such as at loop bodies and basic blocks as well as procedure bodies, the generalized localization additionally reduces analysis time by an average of 31.8%.
An Algorithmic Mitigation of Large Spurious Interprocedural Cycles in Static Analysis. SPE 2010.
Hakjoo Oh and Kwangkeun Yi. [pdf] [bib]
- abstract: We present a simple algorithmic extension of the approximate call-strings approach to mitigate substantial performance degradation caused by spurious interprocedural cycles. Spurious interprocedural cycles are, in a realistic setting, key reasons for why approximate call-return semantics in both context-sensitive and -insensitive static analysis can make the analysis much slower than expected. In the approximate call-strings-based context-sensitive static analysis, because the number of distinguished contexts is finite, multiple call-contexts are inevitably joined at the entry of a procedure and the output at the exit is propagated to multiple return-sites. We found that these multiple returns frequently create a single large cycle (we call it “butterfly cycle”) covering almost all parts of the program and such a spurious cycle makes analyses very slow and inaccurate. Our simple algorithmic technique (within the fixpoint iteration algorithm) identifies and prunes these spurious interprocedural flows. The technique’s effectiveness is proven by experiments with a realistic C analyzer to reduce the analysis time by 7%-96%. Since the technique is algorithmic, it can be easily applicable to existing analyses without changing the underlying abstract semantics, it is orthogonal to the underlying abstract semantics’ context-sensitivity, and its correctness is obvious.
Large Spurious Cycles in Global Static Analysis and Its Algorithmic Mitigation. APLAS 2009.
Hakjoo Oh. [slides] [pdf] [bib]
- abstract: We present a simple algorithmic extension of the classical call-strings approach to mitigate substantial performance degradation caused by spurious interprocedural cycles. Spurious interprocedural cycles are, in a realistic setting, key reasons for why approximate call-return semantics in both context-sensitive and -insensitive static analysis can make the analysis much slower than expected. In the traditional call-strings-based context-sensitive static analysis, because the number of distinguished contexts must be finite, multiple call contexts are inevitably joined at the entry of a procedure and the output at the exit is propagated to multiple return-sites. We found that these multiple returns frequently create a single large cycle (we call it “butterfly cycle”) covering almost all parts of the program and such a spurious cycle makes analyses very slow and inaccurate. Our simple algorithmic technique (within the fixpoint iteration algorithm) identifies and prunes these spurious interprocedural flows. The technique’s effectiveness is proven by experiments with a realistic C analyzer to reduce the analysis time by 7%-96%. Since the technique is algorithmic, it can be easily applicable to existing analyses without changing the underlying abstract semantics, it is orthogonal to the underlying abstract semantics’ context-sensitivity, and its correctness is obvious.
An Empirical Study on Classification Methods for Alarms from a Bug-Finding Static C Analyzer. IPL 2007.
Kwangkeun Yi, Hosik Choi, Jaehwang Kim, and Yongdai Kim. [pdf] [bib]
- abstract: A key application for static analysis is automatic bug-finding. Given the program source, a static analyzer computes an approximation of dynamic program states occurring at each program point, and reports possible bugs by examining the approximate states. From such static bug-finding analysis, false alarms are inevitable. Because static analysis is done at compile-time, exact computation of the program’s run-time states is impossible. Hence some approximation must be involved, so that the detected bugs can contain some false positives. Methodologies such as the abstract interpretation framework [6–8] counsel us to design a correct (conservative) static analyzer. The correctness criterion exacerbates the false alarm problem, because whenever in doubt the analysis must err on the pessimistic side.
Taming False Alarms from a Domain-Unaware C Analyzer by a Bayesian Statistical Post Analysis. SAS 2005.
Yungbum Jung, Jaehwang Kim, Jaeho Shin, and Kwangkeun Yi. [pdf] [bib]
- abstract: We present our experience of combining, in a realistic setting, a static analyzer with a statistical analysis. This combination is in order to reduce the inevitable false alarms from a domain-unaware static analyzer. Our analyzer named Airac(Array Index Range Analyzer for C) collects all the true buffer-overrun points in ANSI C programs. The soundness is maintained, and the analysis’ cost-accuracy improvement is achieved by techniques that static analysis community has long accumulated. For still inevitable false alarms (e.g. Airac raised 970 buffer overrun alarms in commercial C programs of 5.3 million lines and 737 among the 970 alarms were false), which are always apt for particular C programs, we use a statistical post analysis. The statistical analysis, given the analysis results (alarms), sifts out probable false alarms and prioritizes true alarms. It estimates the probability of each alarm being true. The probabilities are used in two ways: 1) only the alarms that have true-alarm probabilities higher than a threshold are reported to the user; 2) the alarms are sorted by the probability before reporting, so that the user can check highly probable errors first. In our experiments with Linux kernel sources, if we set the risk of missing true error is about 3 times greater than false alarming, 74.83% of false alarms could be filtered; only 15.17% of false alarms were mixed up until the user observes 50% of the true alarms.

Performance

Sparrow's performance (both in precision and scalability) is being actively improved. Thanks to our general sparse analysis framework, the current Sparrow is 175x more scalable in terms of lines of code and is 1500x faster than the initial version.

Contact

For general questions regarding Sparrow, please send email to kwang _at_ ropas.snu.ac.kr.

Acknowledgement

This work was supported by the Engineering Research Center of Excellence Program of Korea Ministry of Education, Science and Technology(MEST) / National Research Foundation of Korea(NRF) (Grant 2012-0000468) and the Brain Korea 21 Project, School of Electrical Engineering and Computer Science, Seoul National University.