Static analysis of Python extension modules using GCC

David Malcolm, Red Hat

dmalcolm@redhat.com

PyCon US 2012

Prerequisites

I'm going to assume basic familiarity with Python, and with either C or C++

Hopefully you've used, debugged, or written a Python extension module in C (perhaps via SWIG or Cython)

Straw poll?

Overview

Typical bugs in CPython extension modules
Hacking on the insides of GCC using Python
Adding compilation warnings to GCC for CPython extension modules
Large-scale usage of the analyzer
Q & A

Code: https://fedorahosted.org/gcc-python-plugin/

GNU General Public License v3 (or later)

C extension code

great for wrapping arbitrary C and C++ libraries
historically, used for speed also
but very easy to get things wrong...

The Problem

How many mistakes can you see in the following C code?

PyObject *
buggy_list_of_random_ints(PyObject *self,
                          PyObject *args)
{
    PyObject *list, *item;
    long count, i;

    if (!PyArg_ParseTuple(args, "i",
                          &count)) {
         return NULL;
    }
    /* continued on the next slide... */

The Problem (continued)

{
    PyObject *list, *item;
    long count, i;

    if (!PyArg_ParseTuple(args, "i",
                          &count)) {
         return NULL;
    }
    /* ...continued from previous slide */
    list = PyList_New(0);

    for (i = 0; i < count; i++) {
        item = PyLong_FromLong(random());
        PyList_Append(list, item);
    }

    return list;
}

What could possibly go wrong?

Mismatching types to PyArg_ParseTuple

{
    PyObject *list, *item;
    long count, i;

    if (!PyArg_ParseTuple(args, "i",
                          &count)) {
        return NULL;
    }

If sizeof(long) > sizeof(int), only part of "count" will be written to.

Total lack of error-checking

PyList_New() can fail if there's not enough memory, returning NULL:

list = PyList_New(0); // BUG!

for (i = 0; i < count; i++) {
    item = PyLong_FromLong(random());
    PyList_Append(list, item); //BUG!
}

and PyList_Append(NULL, item) will then segfault the interpreter

(even if "item" is also NULL)

Total lack of error-checking: part 2

PyLong_FromLong() can fail if there's not enough memory, returning NULL:

item = PyLong_FromLong(random());  //BUG!
PyList_Append(list, item);

though in this case PyList_Append(list, NULL) will not segfault the interpreter.

(Does anyone know all of the rules for which CPython extension function arguments are safe to pass NULL to, and which aren't? More to come...)

Total lack of error-checking: part 3

PyList_Append() can fail if there's not enough memory, returning -1:

    PyList_Append(list, item);

If this happens, the length of the list will be less than what the caller was expecting.

Reference leaks

item = PyLong_FromLong(random());
PyList_Append(list, item);

PyLong_FromLong() returns "a new reference" (if it succeeds)

PyList_Append() doesn't steal the reference, it adds a new one (if it succeeds).

Whether or not PyList_Append() succeeds or fails, "item" still owns a reference on the result of PyLong_FromLong (if non-NULL).

The code needs a Py_DECREF(item)

(as well as all the error checking)

Why are reference leaks bad?

The random ints will become immortal, and will not be cleaned up until the process exits, using up memory unnecessarily.

This will foul up CPython's arena allocators: entire 256KB chunks of RAM can be forced to be kept in-use just to keep a leaked object alive.

But this is a compiled language!

I got sick of tracking down references leaks by hand.

How can we automatically detect these bugs?

There are hundreds of packages within, say, Fedora 17 that link against libpython, written in a mixture of C, C++ (plus inline assembler, and even some Fortran)

Image source: http://hyperboleandahalf.blogspot.com/2010/06/this-is-why-ill-never-be-adult.html

Automated code analysis

http://clang-analyzer.llvm.org/
- handles C and Objective-C only
- written in C++, with a Perl driver
https://sparse.wiki.kernel.org/
- handles C only
- written for the Linux kernel, in C
http://cil.sourceforge.net/
- handles C only; written in oCaml
- See e.g. http://berrange.com/posts/2009/05/15/static-analysis-to-validate-mutex-locking-in-libvirt-using-ocaml-cil/

Automated code analysis (continued)

http://coccinelle.lip6.fr/
- See my experiment from November 2009 on PyArg_ParseTuple: http://dmalcolm.livejournal.com/3689.html
- allows autogeneration of patches: i.e. fixing bugs, not just finding them
- handles C only; written in oCaml
- offers some limited Python scriptability
using a Python library to parse C
- e.g. http://code.google.com/p/pycparser/
- e.g. https://launchpad.net/pyclibrary
- seem to handle C only
- have to build all further logic on top of them (e.g. sizes of data types?)
- what about all the GCC extensions we're using?

Automated code analysis (continued)

proprietary tools
- (non-starter for Fedora: everything built using Open Source/Free Software)
GCC plugins
- all our code is already going through GCC
- ...but would need to be written in C

Aha! Embed Python scripting into GCC

use the same compiler we're already using
create the ability to write GCC plugins in Python
then write what we need in Python
Python scripting has been great for gdb

Embedding CPython within GCC

GCC 4.5 gained the ability to have plugins

(My plugin actually requires GCC >= 4.6 for other reasons)

I quickly got a "hello world" GCC plugin working:

            int
            plugin_init (/* args omitted */)
            {
                /* plugin boilerplate omitted */
                Py_Initialize();
                PyRun_SimpleString("from time import time,ctime\n"
                "print 'Today is',ctime(time())\n");
                Py_Finalize();
            }

(this is the example from http://docs.python.org/extending/embedding.html )

It worked

            $ gcc -fplugin=python.so test.c
            Today is Wed Mar 20 23:10:15 2011

Generalize it to run arbitrary Python scripts:

$ ./gcc-with-python script.py test.c

Getting at GCC's internal representation

This doesn't do anything useful until we can get at GCC's data structures from the Python code

Wrapping GCC's types

Wrapping a million-line, 25-year-old body of C code in Python

I originally tried Cython
But GCC is too "baroque"
- several different inheritance hierarchies
- heavy use of the C preprocessor
- access macros for everything
- all types are littered with macros expressing garbage-collection metadata
So I wrote my own C code-generator (about 500 lines of Python)

GCC's internals, as seen by the Python plugin

Our buggy example code is processed by the C frontend

Converted to a form called "GIMPLE":

Simple expressions, mostly just 3 operands at a time.

The GIMPLE is divided into a control flow graph of basic blocks in which any branching logic is done between basic blocks.

I am grossly oversimplifying things

(Etymology: the GCC version of SIMPLE IL, from the McCAT compiler project at McGill University)

Our first new GCC compilation pass

Here's a Python script to run inside the plugin to print the structure:

import gcc

# We can create custom compilation pass by subclassing e.g. gcc.GimplePass:
class DumpGimple(gcc.GimplePass):
  # passes have execute() methods:
  def execute(self, fun):

    # see next slide

class DumpGimple(gcc.GimplePass):
  # passes have execute() methods:
  def execute(self, fun):
    print('FUNCTION: %s' % fun.decl.name)

    for bb in fun.cfg.basic_blocks:

      # we now have a gcc.BasicBlock
      print('BLOCK %i' % bb.index)

      # see next slide...

  def execute(self, fun):
    print('FUNCTION: %s' % fun.decl.name)
    for bb in fun.cfg.basic_blocks:
      print('BLOCK %i' % bb.index)

      # (from previous slide)

      if bb.gimple:
        for stmt in bb.gimple:
          # we now have a gcc.Gimple
          print(' %r' % stmt)
          print('   %s' % stmt)

      for edge in bb.succs:
        print(' edge to block %i'
              % edge.dest.index)

  # see next slide

class DumpGimple(gcc.GimplePass):
  def execute(self, fun):
      # (body of function omitted)

# Finally, we have to instantiate our compilation pass, and register it
# relative to one of GCC's existing passes:
ps = DumpGimple(name='dump-gimple')
ps.register_after('cfg')

Don't call your pass "pass" - it's a reserved word!

We can now run our script like this:

$ ./gcc-with-python \
         dump-gimple.py \
           -I/usr/include/python2.7 \
         bug.c

(bug.c is Python extension code, hence we need the -I compilation flag here so gcc can find <Python.h>)

Our buggy example C code

PyObject *
buggy_list_of_random_ints(PyObject *self,
                          PyObject *args)
{
    PyObject *list, *item;
    long count, i;

    if (!PyArg_ParseTuple(args, "i",
                          &count)) {
         return NULL;
    }
    /* etc */

Here's the output

FUNCTION: make_a_list_of_random_ints_badly
BLOCK 0
  edge to block 2
BLOCK 1
BLOCK 2
 gcc.GimpleCall()
   D.12024 = PyArg_ParseTuple (args, "i", &count);
 gcc.GimpleCond()
   if (D.12024 == 0)
  edge to block 3
  edge to block 4
BLOCK 3
 gcc.GimpleAssign()
   D.12027 = 0B;
  edge to block 8

BLOCK 4
 gcc.GimpleCall()
   list = PyList_New (0);
 gcc.GimpleAssign()
   i = 0;
  edge to block 6
BLOCK 5
 gcc.GimpleCall()
   D.12028 = random ();
 gcc.GimpleCall()
   item = PyLong_FromLong (D.12028);
 gcc.GimpleCall()
   PyList_Append (list, item);
 gcc.GimpleAssign()
   i = i + 1;
  edge to block 6

BLOCK 6
 gcc.GimpleAssign()
   count.0 = count;
 gcc.GimpleCond()
   if (i < count.0)
  edge to block 5
  edge to block 7
BLOCK 7
 gcc.GimpleAssign()
   D.12027 = list;
  edge to block 8
BLOCK 8
 gcc.GimpleReturn()
   return D.12027;
  edge to block 1

Too much text!

We're walking over GCC's internal representation, but it's hard to see what's going on

So I wrote a "show-gimple.py" script to visualize the structure, using GraphViz:

            $ ./gcc-with-python \
                     examples/show-gimple.py \
                       -I/usr/include/python2.7 \
                     bug.c

Many optimization passes later...

Various other internal representations that aren't relevant for our purposes today

GIMPLE-SSA
Register Transfer Language
Assembler
Machine code

What have we achieved?

We can now create new GCC warnings using Python, and perform other code analysis

Examples (by other people):

two different custom warnings created for GDB
http://sourceware.org/ml/gdb-patches/2011-06/msg00408.html

http://sourceware.org/ml/gdb/2011-11/msg00002.html
an analysis script written for LibreOffice:
https://bugs.freedesktop.org/show_bug.cgi?id=43460
porting Emacs between two different Lisp dialects, using Python to port the relevant C code to Lisp
http://tromey.com/blog/?p=709

http://tromey.com/blog/?p=751

cpychecker: a static analyzer for CPython extension modules

gcc-with-cpychecker
The real work is within libcpychecker in the source tree
About 10k lines of Python code, much of which is specific to analyzing CPython

Code: https://fedorahosted.org/gcc-python-plugin/

Structure of the checker

Interpreter for GIMPLE (simple form of abstract interpretation)

See libcpychecker/absinterp.py in the plugin's source tree

We generate a set of all possible traces through the function

(but give up when it gets too complicated)

A trace is a recording of a list of states, with transitions between them.

Structure of the checker (continued)

Transitions can have descriptive information (e.g. "when PyList_Append() fails")

A state has:

location
mapping from variables to storage region
mapping from storage region to AbstractValue instances

(this lets us handle the semantics of pointers)

AbstractValue subclasses

UnknownValue(): we don't know anything about this value
ConcreteValue(42): a specific, known value
WithinRange(): e.g. between 0 and 0x7fffffff
PointerToRegion(): a non-NULL pointer to the given storage region
"poisoned" values: DeallocatedMemory(), UninitializedData() etc

The checker "knows" how to perform operations on combinations of these: addition, subtraction, comparisons, etc

e.g. most operations on a DeallocatedMemory() instance will be detected as an error

Facets

States are pluggable: extra domain-specific information ("facets") can be added.

For now, the only plugin is a "cpython" one which adds:

exception state (the PyObject* for the thread-local exception instance, or NULL)

GIL state

a RefcountValue() subclass of AbstractValue, implementing ob_refcnt behaviors

detailed behavior information on much of the CPython API

See libcpychecker/refcounts.py in the plugin's source tree

Example API implementation

Given a C function e.g. PyMapping_Size, the cpython facet can provide an implementation which describes how to get from an input state to a list of all possible transitions:

def impl_PyMapping_Size(self, stmt, v_o):
    fnmeta = FnMeta(name='PyMapping_Size',
                    docurl='http://docs.python.org/c-api/mapping.html#PyMapping_Size',
                    prototype='Py_ssize_t PyMapping_Size(PyObject *o)',
                    defined_in='Objects/abstract.c',
                    notes='Can cope with NULL (sets exception)')
    # (continued on next slide)

# (continued from last slide)

    t_success = \
       self.state.mktrans_assignment(stmt.lhs,
         UnknownValue.make(stmt.lhs.type,
                           stmt.loc),
         fnmeta.desc_when_call_succeeds())
    t_failure = \
       self.state.mktrans_assignment(stmt.lhs,
         ConcreteValue(stmt.lhs.type,
                       stmt.loc,
                       -1),
         fnmeta.desc_when_call_fails())
    t_failure.dest.cpython.set_exception(
         'PyExc_TypeError',
         stmt.loc)
    return [t_success, t_failure]

We can also describe how an API can fail:

def impl_PyList_Append(self, stmt,
                       v_op, v_newitem):
    fnmeta = FnMeta(name='PyList_Append', )
                    # etc
    self.state.raise_any_null_ptr_func_arg(
       stmt, 0, v_op,
       why=invokes_Py_TYPE_via_macro(
              fnmeta, 'PyList_Check'))

    # It handles newitem being NULL:
    if v_newitem.is_null_ptr():
        s_failure = \
           self.state.mkstate_concrete_return_of(
              stmt, -1)
        s_failure.cpython.bad_internal_call(stmt.loc)
        return [Transition(self.state,
                           s_failure,
                           'returning -1 from %s() due to NULL item' % fnmeta.name)]

Trying it out on our buggy example code

It emits new compilation warnings:

            $ ./gcc-with-cpychecker \
                  -I/usr/include/python2.7/ \
                 bug.c
            bug.c: In function ‘make_a_list_of_random_ints_badly’:
            bug.c:10:26: warning: Mismatching type in call to PyArg_ParseTuple with format code "i" [enabled by default]
            argument 3 ("&count") had type
              "long int *" (pointing to 64 bits)
            but was expecting
              "int *" (pointing to 32 bits)
            for format code "i"

It detects the reference leak of "item"

            bug.c:22:1: warning: ob_refcnt of '*item' is 1 too high [enabled by default]
            bug.c:22:1: note: was expecting final ob_refcnt to be N + 1 (for some unknown N)
            bug.c:22:1: note: due to object being referenced by: PyListObject.ob_item[0]
            bug.c:22:1: note: but final ob_refcnt is N + 2
            bug.c:17:14: note: PyLongObject allocated at:   item = PyLong_FromLong(random());

(many lines of output describing the path through the function)

HTML reports

The above information is too verbose, so the checker also emits the same information in HTML form:

            bug.c:6:1: note: graphical error report for
            function 'make_a_list_of_random_ints_badly' written out to 'bug.c.make_a_list_of_random_ints_badly-refcount-errors.html'

Reference-counting model

For every object and possible trace it tracks all references that it knows about, and assumes that there is some unknown number of other references N that it isn't tracking.

It knows the lower limit of N, and can use this when analyzing Py_DECREF() invocations.

Lack of error-checking

It also detects the lack of error-checking for the PyList_New() call...

Trying it on real-world code

I used the checker on all of the rpms in Fedora 17 that link against libpython2.7

I've filed 54 bug reports so far where the checker found genuine problems (14%)
7 of these have now been patched
50 packages not requiring a bug to be filed (13%)
78 packages waiting on a fix for C++ support (21%)
17 packages waiting on better SWIG support (4%)
172 packages requiring other followup work (sigh) (46%)

(etc)

(etc; allocate new ref on "hypstr_obj")

int rpmfdFromPyObject(PyObject *obj,
                      rpmfdObject **fdop)
{
    rpmfdObject *fdo = NULL;

    /* (logic to create "fdo" omitted,
        to keep it all on one slide) */

    if (Ferror(fdo->fd)) {
        Py_DECREF(fdo);
        PyErr_SetString(
              PyExc_IOError,
              Fstrerror(fdo->fd));
        return 0;
    }
    *fdop = fdo;
    return 1;
}

rpmfd-py.c:34:17: warning: reading from deallocated memory at rpmfd-py.c:34: memory deallocated at rpmfd-py.c:32
rpmfd-py.c:31:8: note: taking True path at:
     if (Ferror(fdo->fd)) {
rpmfd-py.c:32:2: note: reaching:
         Py_DECREF(fdo);
rpmfd-py.c:32:2: note: when taking False path at:
         Py_DECREF(fdo);
rpmfd-py.c:32:2: note: calling tp_dealloc on new ref from call to PyObject_Call allocated at rpmfd-py.c:26 at:
         Py_DECREF(fdo);
rpmfd-py.c:35:17: note: reaching:
             Fstrerror(fdo->fd));

Limitations

It only complains about crashers and reference-count issues
It doesn't complain about the lack of error checking for PyLong_FromLong: this failure case can't crash the interpreter.
It only follows 0 and 1 times through any loop
Generated code (Cython and SWIG) tends to show false positives (I'm working on fixing this)

Summary

Bugs typically occurring in CPython extension modules
GCC plugin that embeds Python inside the compiler
The gcc-with-cpychecker static analyzer
Status of using the checker on real-world code
We're hiring!

http://www.redhat.com/about/work/

Getting involved

Please try the checker on your extension code!

Please report any false positives or tracebacks you run into

Grab it from git
https://fedorahosted.org/gcc-python-plugin
"make" to build the plugin and run the selftest suite

Building a distutils project using the checker:

                $ export CC=/path/to/plugin/gcc-with-cpychecker
                $ python setup.py build

Building a "Makefile"-based project using the checker:

                $ make CC=/path/to/plugin/gcc-with-cpychecker

Help build out the coverage of the CPython API

Ideas for other GCC hacks using Python?

Open Space here at PyCon: 7pm tonight

Extra slides

Py_DECREF() as seen by the compiler

It's working on the preprocessed source

After preprocessing, a Py_DECREF(op) looks like this:

          do { if ( --((PyObject*)(op))->ob_refcnt != 0) ; else ( (*(((PyObject*)((PyObject *)(op)))->ob_type)->tp_dealloc)((PyObject *)((PyObject *)(op)))); } while (0);

or, simplified (removing casts):

          if ( --op->ob_refcnt == 0) {
             op->ob_type->tp_dealloc(op);
          }

Py_DECREF() continued

          if ( --op->ob_refcnt == 0) {
             op->ob_type->tp_dealloc(op);
          }

All non-NULL (PyObject*) values are assigned an ob_refcnt value of type RefcountValue

A RefcountValue instance "knows" how to be incremented, decremented and compared against 0 (amongst other operations)

All non-NULL (PyObject*) have their ob_type set to a sane PyTypeObject. The tp_dealloc is set to an custom AbstractValue representing a sane function pointer that marks the object as a "poisoned" DeallocatedMemory value

For the compiler geeks...

The GIMPLE is converted to Static Single Assignment form, which we can also see from Python

            $ ./gcc-with-python \
                     examples/show-ssa.py \
                       -I/usr/include/python2.7 \
                     bug.c

Many optimization passes later...

The GIMPLE-SSA form is converted to another representation Register Transfer Language, which is close to machine code

I've only barely wrapped this part of GCC in Python. Patches welcome!

Many further optimization passes later, this gets converted to assembler, and then machine code