Static analysis of Python extension modules using GCC

David Malcolm, Red Hat

dmalcolm@redhat.com

PyCon US 2012

Prerequisites

I'm going to assume basic familiarity with Python, and with either C or C++

Hopefully you've used, debugged, or written a Python extension module in C (perhaps via SWIG or Cython)

Straw poll?

Overview

Code: https://fedorahosted.org/gcc-python-plugin/

GNU General Public License v3 (or later)

C extension code

The Problem

How many mistakes can you see in the following C code?

PyObject *
buggy_list_of_random_ints(PyObject *self,
                          PyObject *args)
{
    PyObject *list, *item;
    long count, i;

    if (!PyArg_ParseTuple(args, "i",
                          &count)) {
         return NULL;
    }
    /* continued on the next slide... */
          

The Problem (continued)

{
    PyObject *list, *item;
    long count, i;

    if (!PyArg_ParseTuple(args, "i",
                          &count)) {
         return NULL;
    }
    /* ...continued from previous slide */
    list = PyList_New(0);

    for (i = 0; i < count; i++) {
        item = PyLong_FromLong(random());
        PyList_Append(list, item);
    }

    return list;
}

What could possibly go wrong?

Mismatching types to PyArg_ParseTuple

{
    PyObject *list, *item;
    long count, i;

    if (!PyArg_ParseTuple(args, "i",
                          &count)) {
        return NULL;
    }

If sizeof(long) > sizeof(int), only part of "count" will be written to.

Total lack of error-checking

PyList_New() can fail if there's not enough memory, returning NULL:

list = PyList_New(0); // BUG!

for (i = 0; i < count; i++) {
    item = PyLong_FromLong(random());
    PyList_Append(list, item); //BUG!
}
          

and PyList_Append(NULL, item) will then segfault the interpreter

(even if "item" is also NULL)

Total lack of error-checking: part 2

PyLong_FromLong() can fail if there's not enough memory, returning NULL:

item = PyLong_FromLong(random());  //BUG!
PyList_Append(list, item);
          

though in this case PyList_Append(list, NULL) will not segfault the interpreter.

(Does anyone know all of the rules for which CPython extension function arguments are safe to pass NULL to, and which aren't? More to come...)

Total lack of error-checking: part 3

PyList_Append() can fail if there's not enough memory, returning -1:

    PyList_Append(list, item);

If this happens, the length of the list will be less than what the caller was expecting.

Reference leaks

item = PyLong_FromLong(random());
PyList_Append(list, item);

PyLong_FromLong() returns "a new reference" (if it succeeds)

PyList_Append() doesn't steal the reference, it adds a new one (if it succeeds).

Whether or not PyList_Append() succeeds or fails, "item" still owns a reference on the result of PyLong_FromLong (if non-NULL).

The code needs a Py_DECREF(item)

(as well as all the error checking)

Why are reference leaks bad?

The random ints will become immortal, and will not be cleaned up until the process exits, using up memory unnecessarily.

This will foul up CPython's arena allocators: entire 256KB chunks of RAM can be forced to be kept in-use just to keep a leaked object alive.

But this is a compiled language!

I got sick of tracking down references leaks by hand.

How can we automatically detect these bugs?

There are hundreds of packages within, say, Fedora 17 that link against libpython, written in a mixture of C, C++ (plus inline assembler, and even some Fortran)

Image source: http://hyperboleandahalf.blogspot.com/2010/06/this-is-why-ill-never-be-adult.html

Automated code analysis

Automated code analysis (continued)

Automated code analysis (continued)

Aha! Embed Python scripting into GCC

Embedding CPython within GCC

GCC 4.5 gained the ability to have plugins

(My plugin actually requires GCC >= 4.6 for other reasons)

I quickly got a "hello world" GCC plugin working:

            int
            plugin_init (/* args omitted */)
            {
                /* plugin boilerplate omitted */
                Py_Initialize();
                PyRun_SimpleString("from time import time,ctime\n"
                "print 'Today is',ctime(time())\n");
                Py_Finalize();
            }
          

(this is the example from http://docs.python.org/extending/embedding.html )

It worked

            $ gcc -fplugin=python.so test.c
            Today is Wed Mar 20 23:10:15 2011
          

Generalize it to run arbitrary Python scripts:

$ ./gcc-with-python script.py test.c

Getting at GCC's internal representation

Wrapping GCC's types

Wrapping a million-line, 25-year-old body of C code in Python

GCC's internals, as seen by the Python plugin

Our buggy example code is processed by the C frontend

Converted to a form called "GIMPLE":

Simple expressions, mostly just 3 operands at a time.

The GIMPLE is divided into a control flow graph of basic blocks in which any branching logic is done between basic blocks.

I am grossly oversimplifying things

(Etymology: the GCC version of SIMPLE IL, from the McCAT compiler project at McGill University)

Our first new GCC compilation pass

Here's a Python script to run inside the plugin to print the structure:

import gcc

# We can create custom compilation pass by subclassing e.g. gcc.GimplePass:
class DumpGimple(gcc.GimplePass):
  # passes have execute() methods:
  def execute(self, fun):

    # see next slide
          
class DumpGimple(gcc.GimplePass):
  # passes have execute() methods:
  def execute(self, fun):
    print('FUNCTION: %s' % fun.decl.name)

    for bb in fun.cfg.basic_blocks:

      # we now have a gcc.BasicBlock
      print('BLOCK %i' % bb.index)

      # see next slide...
          
  def execute(self, fun):
    print('FUNCTION: %s' % fun.decl.name)
    for bb in fun.cfg.basic_blocks:
      print('BLOCK %i' % bb.index)

      # (from previous slide)

      if bb.gimple:
        for stmt in bb.gimple:
          # we now have a gcc.Gimple
          print(' %r' % stmt)
          print('   %s' % stmt)

      for edge in bb.succs:
        print(' edge to block %i'
              % edge.dest.index)

  # see next slide
          
class DumpGimple(gcc.GimplePass):
  def execute(self, fun):
      # (body of function omitted)

# Finally, we have to instantiate our compilation pass, and register it
# relative to one of GCC's existing passes:
ps = DumpGimple(name='dump-gimple')
ps.register_after('cfg')
          

Don't call your pass "pass" - it's a reserved word!

We can now run our script like this:

$ ./gcc-with-python \
         dump-gimple.py \
           -I/usr/include/python2.7 \
         bug.c
          

(bug.c is Python extension code, hence we need the -I compilation flag here so gcc can find <Python.h>)

Our buggy example C code

PyObject *
buggy_list_of_random_ints(PyObject *self,
                          PyObject *args)
{
    PyObject *list, *item;
    long count, i;

    if (!PyArg_ParseTuple(args, "i",
                          &count)) {
         return NULL;
    }
    /* etc */
          

Here's the output

FUNCTION: make_a_list_of_random_ints_badly
BLOCK 0
  edge to block 2
BLOCK 1
BLOCK 2
 gcc.GimpleCall()
   D.12024 = PyArg_ParseTuple (args, "i", &count);
 gcc.GimpleCond()
   if (D.12024 == 0)
  edge to block 3
  edge to block 4
BLOCK 3
 gcc.GimpleAssign()
   D.12027 = 0B;
  edge to block 8
        
BLOCK 4
 gcc.GimpleCall()
   list = PyList_New (0);
 gcc.GimpleAssign()
   i = 0;
  edge to block 6
BLOCK 5
 gcc.GimpleCall()
   D.12028 = random ();
 gcc.GimpleCall()
   item = PyLong_FromLong (D.12028);
 gcc.GimpleCall()
   PyList_Append (list, item);
 gcc.GimpleAssign()
   i = i + 1;
  edge to block 6
        
BLOCK 6
 gcc.GimpleAssign()
   count.0 = count;
 gcc.GimpleCond()
   if (i < count.0)
  edge to block 5
  edge to block 7
BLOCK 7
 gcc.GimpleAssign()
   D.12027 = list;
  edge to block 8
BLOCK 8
 gcc.GimpleReturn()
   return D.12027;
  edge to block 1
          

Too much text!

We're walking over GCC's internal representation, but it's hard to see what's going on

So I wrote a "show-gimple.py" script to visualize the structure, using GraphViz:

            $ ./gcc-with-python \
                     examples/show-gimple.py \
                       -I/usr/include/python2.7 \
                     bug.c
          

Many optimization passes later...

Various other internal representations that aren't relevant for our purposes today

What have we achieved?

We can now create new GCC warnings using Python, and perform other code analysis

Examples (by other people):

cpychecker: a static analyzer for CPython extension modules

Code: https://fedorahosted.org/gcc-python-plugin/

Structure of the checker

Interpreter for GIMPLE (simple form of abstract interpretation)

See libcpychecker/absinterp.py in the plugin's source tree

We generate a set of all possible traces through the function

(but give up when it gets too complicated)

A trace is a recording of a list of states, with transitions between them.

Structure of the checker (continued)

Transitions can have descriptive information (e.g. "when PyList_Append() fails")

A state has:

(this lets us handle the semantics of pointers)

AbstractValue subclasses

The checker "knows" how to perform operations on combinations of these: addition, subtraction, comparisons, etc

e.g. most operations on a DeallocatedMemory() instance will be detected as an error

Facets

States are pluggable: extra domain-specific information ("facets") can be added.

For now, the only plugin is a "cpython" one which adds:

See libcpychecker/refcounts.py in the plugin's source tree

Example API implementation

Given a C function e.g. PyMapping_Size, the cpython facet can provide an implementation which describes how to get from an input state to a list of all possible transitions:

def impl_PyMapping_Size(self, stmt, v_o):
    fnmeta = FnMeta(name='PyMapping_Size',
                    docurl='http://docs.python.org/c-api/mapping.html#PyMapping_Size',
                    prototype='Py_ssize_t PyMapping_Size(PyObject *o)',
                    defined_in='Objects/abstract.c',
                    notes='Can cope with NULL (sets exception)')
    # (continued on next slide)
        
# (continued from last slide)

    t_success = \
       self.state.mktrans_assignment(stmt.lhs,
         UnknownValue.make(stmt.lhs.type,
                           stmt.loc),
         fnmeta.desc_when_call_succeeds())
    t_failure = \
       self.state.mktrans_assignment(stmt.lhs,
         ConcreteValue(stmt.lhs.type,
                       stmt.loc,
                       -1),
         fnmeta.desc_when_call_fails())
    t_failure.dest.cpython.set_exception(
         'PyExc_TypeError',
         stmt.loc)
    return [t_success, t_failure]
          

We can also describe how an API can fail:

def impl_PyList_Append(self, stmt,
                       v_op, v_newitem):
    fnmeta = FnMeta(name='PyList_Append', )
                    # etc
    self.state.raise_any_null_ptr_func_arg(
       stmt, 0, v_op,
       why=invokes_Py_TYPE_via_macro(
              fnmeta, 'PyList_Check'))

    # It handles newitem being NULL:
    if v_newitem.is_null_ptr():
        s_failure = \
           self.state.mkstate_concrete_return_of(
              stmt, -1)
        s_failure.cpython.bad_internal_call(stmt.loc)
        return [Transition(self.state,
                           s_failure,
                           'returning -1 from %s() due to NULL item' % fnmeta.name)]
          

Trying it out on our buggy example code

It emits new compilation warnings:

            $ ./gcc-with-cpychecker \
                  -I/usr/include/python2.7/ \
                 bug.c
            bug.c: In function ‘make_a_list_of_random_ints_badly’:
            bug.c:10:26: warning: Mismatching type in call to PyArg_ParseTuple with format code "i" [enabled by default]
            argument 3 ("&count") had type
              "long int *" (pointing to 64 bits)
            but was expecting
              "int *" (pointing to 32 bits)
            for format code "i"
          

It detects the reference leak of "item"

            bug.c:22:1: warning: ob_refcnt of '*item' is 1 too high [enabled by default]
            bug.c:22:1: note: was expecting final ob_refcnt to be N + 1 (for some unknown N)
            bug.c:22:1: note: due to object being referenced by: PyListObject.ob_item[0]
            bug.c:22:1: note: but final ob_refcnt is N + 2
            bug.c:17:14: note: PyLongObject allocated at:   item = PyLong_FromLong(random());
          

(many lines of output describing the path through the function)

HTML reports

The above information is too verbose, so the checker also emits the same information in HTML form:

            bug.c:6:1: note: graphical error report for
            function 'make_a_list_of_random_ints_badly' written out to 'bug.c.make_a_list_of_random_ints_badly-refcount-errors.html'
          

Reference-counting model

For every object and possible trace it tracks all references that it knows about, and assumes that there is some unknown number of other references N that it isn't tracking.

It knows the lower limit of N, and can use this when analyzing Py_DECREF() invocations.

Lack of error-checking

It also detects the lack of error-checking for the PyList_New() call...

Trying it on real-world code

I used the checker on all of the rpms in Fedora 17 that link against libpython2.7

(etc)

(etc; allocate new ref on "hypstr_obj")
int rpmfdFromPyObject(PyObject *obj,
                      rpmfdObject **fdop)
{
    rpmfdObject *fdo = NULL;

    /* (logic to create "fdo" omitted,
        to keep it all on one slide) */

    if (Ferror(fdo->fd)) {
        Py_DECREF(fdo);
        PyErr_SetString(
              PyExc_IOError,
              Fstrerror(fdo->fd));
        return 0;
    }
    *fdop = fdo;
    return 1;
}
          
rpmfd-py.c:34:17: warning: reading from deallocated memory at rpmfd-py.c:34: memory deallocated at rpmfd-py.c:32
rpmfd-py.c:31:8: note: taking True path at:
     if (Ferror(fdo->fd)) {
rpmfd-py.c:32:2: note: reaching:
         Py_DECREF(fdo);
rpmfd-py.c:32:2: note: when taking False path at:
         Py_DECREF(fdo);
rpmfd-py.c:32:2: note: calling tp_dealloc on new ref from call to PyObject_Call allocated at rpmfd-py.c:26 at:
         Py_DECREF(fdo);
rpmfd-py.c:35:17: note: reaching:
             Fstrerror(fdo->fd));
          

Limitations

Summary

Getting involved

Please try the checker on your extension code!

Please report any false positives or tracebacks you run into

Help build out the coverage of the CPython API

Ideas for other GCC hacks using Python?

Open Space here at PyCon: 7pm tonight

Extra slides

Py_DECREF() as seen by the compiler

It's working on the preprocessed source

After preprocessing, a Py_DECREF(op) looks like this:

          do { if ( --((PyObject*)(op))->ob_refcnt != 0) ; else ( (*(((PyObject*)((PyObject *)(op)))->ob_type)->tp_dealloc)((PyObject *)((PyObject *)(op)))); } while (0);
          
or, simplified (removing casts):
          if ( --op->ob_refcnt == 0) {
             op->ob_type->tp_dealloc(op);
          }
          

Py_DECREF() continued

          if ( --op->ob_refcnt == 0) {
             op->ob_type->tp_dealloc(op);
          }
          

All non-NULL (PyObject*) values are assigned an ob_refcnt value of type RefcountValue

A RefcountValue instance "knows" how to be incremented, decremented and compared against 0 (amongst other operations)

All non-NULL (PyObject*) have their ob_type set to a sane PyTypeObject. The tp_dealloc is set to an custom AbstractValue representing a sane function pointer that marks the object as a "poisoned" DeallocatedMemory value

For the compiler geeks...

The GIMPLE is converted to Static Single Assignment form, which we can also see from Python

            $ ./gcc-with-python \
                     examples/show-ssa.py \
                       -I/usr/include/python2.7 \
                     bug.c
          

Many optimization passes later...

The GIMPLE-SSA form is converted to another representation Register Transfer Language, which is close to machine code

I've only barely wrapped this part of GCC in Python. Patches welcome!

Many further optimization passes later, this gets converted to assembler, and then machine code