GNU Compiler Collection (GCC) Internals: Analyzer Internals

27.1 Analyzer Internals

27.1.1 Overview

The analyzer implementation works on the gimple-SSA representation. (I chose this in the hopes of making it easy to work with LTO to do whole-program analysis).

The implementation is read-only: it doesn’t attempt to change anything, just emit warnings.

First, we build a supergraph which combines the callgraph and all of the CFGs into a single directed graph, with both interprocedural and intraprocedural edges. The nodes and edges in the supergraph are called “supernodes” and “superedges”, and often referred to in code as snodes and sedges. Basic blocks in the CFGs are split at interprocedural calls, so there can be more than one supernode per basic block. Most statements will be in just one supernode, but a call statement can appear in two supernodes: at the end of one for the call, and again at the start of another for the return.

The supergraph can be seen using -fdump-analyzer-supergraph.

We then build an analysis_plan which walks the callgraph to determine which calls might be suitable for being summarized (rather than fully explored) and thus in what order to explore the functions.

Next is the heart of the analyzer: we use a worklist to explore state within the supergraph, building an "exploded graph". Nodes in the exploded graph correspond to <point, state> pairs, as in "Precise Interprocedural Dataflow Analysis via Graph Reachability" (Thomas Reps, Susan Horwitz and Mooly Sagiv).

We reuse nodes for <point, state> pairs we’ve already seen, and avoid tracking state too closely, so that (hopefully) we rapidly converge on a final exploded graph, and terminate the analysis. We also bail out if the number of exploded <end-of-basic-block, state> nodes gets larger than a particular multiple of the total number of basic blocks (to ensure termination in the face of pathological state-explosion cases, or bugs). We also stop exploring a point once we hit a limit of states for that point.

We can identify problems directly when processing a <point, state> instance. For example, if we’re finding the successors of

   <point: before-stmt: "free (ptr);",
    state: {"ptr": freed}>

then we can detect a double-free of "ptr". We can then emit a path to reach the problem by finding the simplest route through the graph.

Program points in the analysis are much more fine-grained than in the CFG and supergraph, with points (and thus potentially exploded nodes) for various events, including before individual statements. By default the exploded graph merges multiple consecutive statements in a supernode into one exploded edge to minimize the size of the exploded graph. This can be suppressed via -fanalyzer-fine-grained. The fine-grained approach seems to make things simpler and more debuggable that other approaches I tried, in that each point is responsible for one thing.

Program points in the analysis also have a "call string" identifying the stack of callsites below them, so that paths in the exploded graph correspond to interprocedurally valid paths: we always return to the correct call site, propagating state information accordingly. We avoid infinite recursion by stopping the analysis if a callsite appears more than analyzer-max-recursion-depth in a callstring (defaulting to 2).

27.1.2 Graphs

Nodes and edges in the exploded graph are called “exploded nodes” and “exploded edges” and often referred to in the code as enodes and eedges (especially when distinguishing them from the snodes and sedges in the supergraph).

Each graph numbers its nodes, giving unique identifiers - supernodes are referred to throughout dumps in the form ‘SN': index’ and exploded nodes in the form ‘EN: index’ (e.g. ‘SN: 2’ and ‘EN:29’).

The supergraph can be seen using -fdump-analyzer-supergraph-graph.

The exploded graph can be seen using -fdump-analyzer-exploded-graph and other dump options. Exploded nodes are color-coded in the .dot output based on state-machine states to make it easier to see state changes at a glance.

27.1.3 State Tracking

There’s a tension between:

precision of analysis in the straight-line case, vs
exponential blow-up in the face of control flow.

For example, in general, given this CFG:

we want to avoid differences in state-tracking in B and C from leading to blow-up. If we don’t prevent state blowup, we end up with exponential growth of the exploded graph like this:

           1:A
          /   \
         /     \
        /       \
      2:B       3:C
       |         |
      4:D       5:D        (2 exploded nodes for D)
     /   \     /   \
   6:E   7:F 8:E   9:F
    |     |   |     |
   10:G 11:G 12:G  13:G    (4 exploded nodes for G)

Similar issues arise with loops.

To prevent this, we follow various approaches:

state pruning: which tries to discard state that won’t be relevant later on withing the function. This can be disabled via -fno-analyzer-state-purge.
state merging. We can try to find the commonality between two program_state instances to make a third, simpler program_state. We have two strategies here:
1. the worklist keeps new nodes for the same program_point together, and tries to merge them before processing, and thus before they have successors. Hence, in the above, the two nodes for D (4 and 5) reach the front of the worklist together, and we create a node for D with the merger of the incoming states.
2. try merging with the state of existing enodes for the program_point (which may have already been explored). There will be duplication, but only one set of duplication; subsequent duplicates are more likely to hit the cache. In particular, (hopefully) all merger chains are finite, and so we guarantee termination. This is intended to help with loops: we ought to explore the first iteration, and then have a "subsequent iterations" exploration, which uses a state merged from that of the first, to be more abstract.
We avoid merging pairs of states that have state-machine differences, as these are the kinds of differences that are likely to be most interesting. So, for example, given:
```
      if (condition)
        ptr = malloc (size);
      else
        ptr = local_buf;

      .... do things with 'ptr'

      if (condition)
        free (ptr);

      ...etc
```
then we end up with an exploded graph that looks like this:
```
                   if (condition)
                     / T      \ F
            ---------          ----------
           /                             \
      ptr = malloc (size)             ptr = local_buf
          |                               |
      copy of                         copy of
        "do things with 'ptr'"          "do things with 'ptr'"
      with ptr: heap-allocated        with ptr: stack-allocated
          |                               |
      if (condition)                  if (condition)
          | known to be T                 | known to be F
      free (ptr);                         |
           \                             /
            -----------------------------
                         | ('ptr' is pruned, so states can be merged)
                        etc
```
where some duplication has occurred, but only for the places where the the different paths are worth exploringly separately.

Merging can be disabled via -fno-analyzer-state-merge.

27.1.4 Region Model

Part of the state stored at a exploded_node is a region_model. This is an implementation of the region-based ternary model described in "A Memory Model for Static Analysis of C Programs" (Zhongxing Xu, Ted Kremenek, and Jian Zhang).

A region_model encapsulates a representation of the state of memory, with a tree of region instances, along with their associated values. The representation is graph-like because values can be pointers to regions. It also stores a constraint_manager, capturing relationships between the values.

Because each node in the exploded_graph has a region_model, and each of the latter is graph-like, the exploded_graph is in some ways a graph of graphs.

Here’s an example of printing a region_model, showing the ASCII-art used to visualize the region hierarchy (colorized when printing to stderr):

(gdb) call debug (*this)
r0: {kind: 'root', parent: null, sval: null}
|-stack: r1: {kind: 'stack', parent: r0, sval: sv1}
|  |: sval: sv1: {poisoned: uninit}
|  |-frame for 'test': r2: {kind: 'frame', parent: r1, sval: null, map: {'ptr_3': r3}, function: 'test', depth: 0}
|  |  `-'ptr_3': r3: {kind: 'map', parent: r2, sval: sv3, type: 'void *', map: {}}
|  |    |: sval: sv3: {type: 'void *', unknown}
|  |    |: type: 'void *'
|  `-frame for 'calls_malloc': r4: {kind: 'frame', parent: r1, sval: null, map: {'result_3': r7, '_4': r8, '<anonymous>': r5}, function: 'calls_malloc', depth: 1}
|    |-'<anonymous>': r5: {kind: 'map', parent: r4, sval: sv4, type: 'void *', map: {}}
|    |  |: sval: sv4: {type: 'void *', &r6}
|    |  |: type: 'void *'
|    |-'result_3': r7: {kind: 'map', parent: r4, sval: sv4, type: 'void *', map: {}}
|    |  |: sval: sv4: {type: 'void *', &r6}
|    |  |: type: 'void *'
|    `-'_4': r8: {kind: 'map', parent: r4, sval: sv4, type: 'void *', map: {}}
|      |: sval: sv4: {type: 'void *', &r6}
|      |: type: 'void *'
`-heap: r9: {kind: 'heap', parent: r0, sval: sv2}
  |: sval: sv2: {poisoned: uninit}
  `-r6: {kind: 'symbolic', parent: r9, sval: null, map: {}}
svalues:
  sv0: {type: 'size_t', '1024'}
  sv1: {poisoned: uninit}
  sv2: {poisoned: uninit}
  sv3: {type: 'void *', unknown}
  sv4: {type: 'void *', &r6}
constraint manager:
  equiv classes:
    ec0: {sv0 == '1024'}
    ec1: {sv4}
  constraints:

This is the state at the point of returning from calls_malloc back to test in the following:

void *
calls_malloc (void)
{
  void *result = malloc (1024);
  return result;
}

void test (void)
{
  void *ptr = calls_malloc ();
  /* etc.  */
}

The “root” region (“r0”) has a “stack” child (“r1”), with two children: a frame for test (“r2”), and a frame for calls_malloc (“r4”). These frame regions have child regions for storing their local variables. For example, the return region and that of various other regions within the “calls_malloc” frame all have value “sv4”, a pointer to a heap-allocated region “r6”. Within the parent frame, ptr_3 has value “sv3”, an unknown void *.

27.1.5 Analyzer Paths

We need to explain to the user what the problem is, and to persuade them that there really is a problem. Hence having a diagnostic_path isn’t just an incidental detail of the analyzer; it’s required.

Paths ought to be:

interprocedurally-valid
feasible

Without state-merging, all paths in the exploded graph are feasible (in terms of constraints being satisified). With state-merging, paths in the exploded graph can be infeasible.

We collate warnings and only emit them for the simplest path e.g. for a bug in a utility function, with lots of routes to calling it, we only emit the simplest path (which could be intraprocedural, if it can be reproduced without a caller). We apply a check that each duplicate warning’s shortest path is feasible, rejecting any warnings for which the shortest path is infeasible (which could lead to false negatives).

We use the shortest feasible exploded_path through the exploded_graph (a list of exploded_edge *) to build a diagnostic_path (a list of events for the diagnostic subsystem) - specifically a checker_path.

Having built the checker_path, we prune it to try to eliminate events that aren’t relevant, to minimize how much the user has to read.

After pruning, we notify each event in the path of its ID and record the IDs of interesting events, allowing for events to refer to other events in their descriptions. The pending_diagnostic class has various vfuncs to support emitting more precise descriptions, so that e.g.

a deref-of-unchecked-malloc diagnostic might use:
```
  returning possibly-NULL pointer to 'make_obj' from 'allocator'
```
for a return_event to make it clearer how the unchecked value moves from callee back to caller

a double-free diagnostic might use:

  second 'free' here; first 'free' was at (3)

and a use-after-free might use

  use after 'free' here; memory was freed at (2)

At this point we can emit the diagnostic.

27.1.6 Limitations

Only for C so far
The implementation of call summaries is currently very simplistic.
Lack of function pointer analysis
The region model code creates lots of little mutable objects at each region_model (and thus per exploded_node) rather than sharing immutable objects and having the mutable state in the program_state or region_model. The latter approach might be more efficient, and might avoid dealing with IDs rather than pointers (which requires us to impose an ordering to get meaningful equality).
The region model code doesn’t yet support memcpy. At the gimple-ssa level these have been optimized to statements like this:
```
_10 = MEM <long unsigned int> [(char * {ref-all})&c]
MEM <long unsigned int> [(char * {ref-all})&d] = _10;
```
Perhaps they could be supported via a new compound_svalue type.
There are various other limitations in the region model (grep for TODO/xfail in the testsuite).
The constraint_manager’s implementation of transitivity is currently too expensive to enable by default and so must be manually enabled via -fanalyzer-transitivity).
The checkers are currently hardcoded and don’t allow for user extensibility (e.g. adding allocate/release pairs).
Although the analyzer’s test suite has a proof-of-concept test case for LTO, LTO support hasn’t had extensive testing. There are various lang-specific things in the analyzer that assume C rather than LTO. For example, SSA names are printed to the user in “raw” form, rather than printing the underlying variable name.

Some ideas for other checkers

File-descriptor-based APIs
Linux kernel internal APIs
Signal handling