Static analysis of Python extension modules using GCC

David Malcolm, Red Hat

dmalcolm@redhat.com

PyCon US 2012

Prerequisites

I'm going to assume basic familiarity with Python, and with either C or C++

Hopefully you've used, debugged, or written a Python extension module in C (perhaps via SWIG or Cython)

Straw poll?

Overview

Typical bugs in CPython extension modules
Hacking on the insides of GCC using Python
Adding compilation warnings to GCC for CPython extension modules
Q & A

Code: https://fedorahosted.org/gcc-python-plugin/

GNU General Public License v3 (or later)

C extension code

great for wrapping arbitrary C and C++ libraries
historically, used for speed also
but very easy to get things wrong...

The Problem

How many mistakes can you see in the following C code?

PyObject *
buggy_list_of_random_ints(PyObject *self,
                          PyObject *args)
{
    PyObject *list, *item;
    long count, i;

    if (!PyArg_ParseTuple(args, "i",
                          &count)) {
         return NULL;
    }
    /* continued on the next slide... */

The Problem (continued)

{
    PyObject *list, *item;
    long count, i;

    if (!PyArg_ParseTuple(args, "i",
                          &count)) {
         return NULL;
    }
    /* ...continued from previous slide */
    list = PyList_New(0);

    for (i = 0; i < count; i++) {
        item = PyLong_FromLong(random());
        PyList_Append(list, item);
    }

    return list;
}

What could possibly go wrong?

Mismatching types to PyArg_ParseTuple

long count, i;

if (!PyArg_ParseTuple(args, "i", &count)) {
     return NULL;
}

If sizeof(long) > sizeof(int), only part of "count" will be written to.

Total lack of error-checking

PyList_New() can fail if there's not enough memory, returning NULL:

list = PyList_New(0); // BUG!

for (i = 0; i < count; i++) {
    item = PyLong_FromLong(random());
    PyList_Append(list, item); //BUG!
}

and PyList_Append(NULL, item) will then segfault the interpreter

(even if "item" is also NULL)

Total lack of error-checking: part 2

PyLong_FromLong() can fail if there's not enough memory, returning NULL:

item = PyLong_FromLong(random());  //BUG!
PyList_Append(list, item);

though in this case PyList_Append(list, NULL) will not segfault the interpreter.

(Does anyone know all of the rules for which CPython extension function arguments are safe to pass NULL to, and which aren't? More to come...)

Total lack of error-checking: part 3

PyList_Append() can fail if there's not enough memory, returning -1:

PyList_Append(list, item);

If this happens, the length of the list will be less than what the caller was expecting.

Reference leaks

item = PyLong_FromLong(random());
PyList_Append(list, item);

PyLong_FromLong() returns "a new reference" (if it succeeds)

PyList_Append() doesn't steal the reference, it adds a new one (if it succeeds).

Whether or not PyList_Append() succeeds or fails, "item" still owns a reference on the result of PyLong_FromLong (if non-NULL).

The code needs a Py_DECREF(item)

(as well as all the error checking)

Why are reference leaks bad?

The random ints will become immortal, and will not be cleaned up until the process exits, using up memory unnecessarily.

This will foul up CPython's arena allocators: entire 256KB chunks of RAM can be forced to be kept in-use just to keep a leaked object alive.

But this is a compiled language!

I got sick of tracking down references leaks by hand.

How can we automatically detect these bugs?

There are hundreds of packages within, say, Fedora 17 that link against libpython, written in a mixture of C, C++ (plus inline assembler, and even some Fortran)

Image source: http://hyperboleandahalf.blogspot.com/2010/06/this-is-why-ill-never-be-adult.html

Approaches to code analysis

LLVM's clang-analyzer
Sparse
CIL
Coccinelle
using a Python library to parse C
proprietary tools
GCC plugins

LLVM

LLVM's clang-analyzer

http://clang-analyzer.llvm.org/

written in C++, with a Perl driver
handles C and Objective-C only

Sparse

https://sparse.wiki.kernel.org/

written for the Linux kernel, in C
handles C only

CIL

http://cil.sourceforge.net/

See e.g. the work we did to detect errors in libvirt:

http://berrange.com/posts/2009/05/15/static-analysis-to-validate-mutex-locking-in-libvirt-using-ocaml-cil/

written in oCaml
handles C only

Coccinelle

http://coccinelle.lip6.fr/

See my experiment from November 2009 on PyArg_ParseTuple

http://dmalcolm.livejournal.com/3689.html

allows autogeneration of patches: i.e. fixing bugs, not just finding them
written in oCaml
offers some limited Python scriptability
handles C only

Using a Python library to parse C

e.g. http://code.google.com/p/pycparser/
e.g. https://launchpad.net/pyclibrary
seem to handle C only
have to build all further logic on top of them (e.g. sizes of data types?)
what about all the GCC extensions we're using?

Proprietary tools

(non-starter for Fedora: everything built using Open Source/Free Software)

GCC plugins

all our code is already going through GCC
no need to have to deal with C++, Perl, or OCaml
...but would need to be written in C

Aha! Embed Python scripting into GCC

use the same compiler we're already using
create the ability to write GCC plugins in Python
then write what we need in Python
Python scripting has been great for gdb

Embedding CPython within GCC

GCC 4.5 gained the ability to have plugins

(My plugin actually requires GCC >= 4.6 for other reasons)

I quickly got a "hello world" GCC plugin working:

            int
            plugin_init (/* args omitted */)
            {
                /* plugin boilerplate omitted */
                Py_Initialize();
                PyRun_SimpleString("from time import time,ctime\n"
                "print 'Today is',ctime(time())\n");
                Py_Finalize();
            }

(this is the example from http://docs.python.org/extending/embedding.html )

It worked

            $ gcc -fplugin=python.so test.c
            Today is Wed Mar 20 23:10:15 2011

Generalize it to run arbitrary Python scripts:

$ ./gcc-with-python script.py test.c

Getting at GCC's internal representation

This doesn't do anything useful until we can get at GCC's data structures from the Python code

Wrapping GCC's types

Wrapping a million-line, 25-year-old body of C code in Python

I originally tried Cython
But GCC is too "baroque"
- several different inheritance hierarchies
- heavy use of the C preprocessor
- access macros for everything
- all types are littered with macros expressing garbage-collection metadata
So I wrote my own C code-generator (about 500 lines of Python)

GCC's internals, as seen by the Python plugin

Our buggy example code is processed by the C frontend

Converted to a form called "GIMPLE":

Simple expressions, mostly just 3 operands at a time.

The GIMPLE is divided into a control flow graph of basic blocks in which any branching logic is done between basic blocks.

I am grossly oversimplifying things

(Etymology: the GCC version of SIMPLE IL, from the McCAT compiler project at McGill University)

Our first new GCC compilation pass

Here's a Python script to run inside the plugin to print the structure:

import gcc

# We can create custom compilation pass by subclassing e.g. gcc.GimplePass:
class DumpGimple(gcc.GimplePass):
  # passes have execute() methods:
  def execute(self, fun):

    # see next slide

  def execute(self, fun):
    print('FUNCTION: %s' % fun.decl.name)

    for bb in fun.cfg.basic_blocks:

      # we now have a gcc.BasicBlock
      print('BLOCK %i' % bb.index)

      # see next slide...

  def execute(self, fun):
    print('FUNCTION: %s' % fun.decl.name)
    for bb in fun.cfg.basic_blocks:
      print('BLOCK %i' % bb.index)

      # (from previous slide)

      if bb.gimple:
        for stmt in bb.gimple:
          # we now have a gcc.Gimple
          print(' %r' % stmt)
          print('   %s' % stmt)

      for edge in bb.succs:
        print(' edge to block %i'
              % edge.dest.index)

  # see next slide

class DumpGimple(gcc.GimplePass):
  def execute(self, fun):
      # (body of function omitted)

# Finally, we have to instantiate our compilation pass, and register it
# relative to one of GCC's existing passes:
ps = DumpGimple(name='dump-gimple')
ps.register_after('cfg')

Don't call your pass "pass" - it's a reserved word!

We can now run our script like this:

$ ./gcc-with-python \
         dump-gimple.py \
           -I/usr/include/python2.7 \
         bug.c

(bug.c is Python extension code, hence we need the -I compilation flag here so gcc can find <Python.h>)

Here's the output

FUNCTION: make_a_list_of_random_ints_badly
BLOCK 0
  edge to block 2
BLOCK 1
BLOCK 2
 gcc.GimpleCall()
   D.12024 = PyArg_ParseTuple (args, "i", &count);
 gcc.GimpleCond()
   if (D.12024 == 0)
  edge to block 3
  edge to block 4
BLOCK 3
 gcc.GimpleAssign()
   D.12027 = 0B;
  edge to block 8

BLOCK 4
 gcc.GimpleCall()
   list = PyList_New (0);
 gcc.GimpleAssign()
   i = 0;
  edge to block 6
BLOCK 5
 gcc.GimpleCall()
   D.12028 = random ();
 gcc.GimpleCall()
   item = PyLong_FromLong (D.12028);
 gcc.GimpleCall()
   PyList_Append (list, item);
 gcc.GimpleAssign()
   i = i + 1;
  edge to block 6

BLOCK 6
 gcc.GimpleAssign()
   count.0 = count;
 gcc.GimpleCond()
   if (i < count.0)
  edge to block 5
  edge to block 7
BLOCK 7
 gcc.GimpleAssign()
   D.12027 = list;
  edge to block 8
BLOCK 8
 gcc.GimpleReturn()
   return D.12027;
  edge to block 1

Too much text!

We're walking over GCC's internal representation, but it's hard to see what's going on

So I wrote a "show-gimple.py" script to visualize the structure, using GraphViz:

            $ ./gcc-with-python \
                     examples/show-gimple.py \
                       -I/usr/include/python2.7 \
                     bug.c

Many optimization passes later...

Various other internal representations that aren't relevant for our purposes today

GIMPLE-SSA
Register Transfer Language
Assembler
Machine code

What have we achieved?

We can now create new GCC warnings using Python, and perform other code analysis

Examples (by other people):

two different custom warnings created for GDB
http://sourceware.org/ml/gdb-patches/2011-06/msg00408.html

http://sourceware.org/ml/gdb/2011-11/msg00002.html
an analysis script written for LibreOffice:
https://bugs.freedesktop.org/show_bug.cgi?id=43460
porting Emacs between two different Lisp dialects, using Python to port the relevant C code to Lisp
http://tromey.com/blog/?p=709

http://tromey.com/blog/?p=751

cpychecker: a static analyzer for CPython extension modules

gcc-with-cpychecker
The real work is within libcpychecker in the source tree
About 10k lines of Python code, much of which is specific to analyzing CPython

Code: https://fedorahosted.org/gcc-python-plugin/

Structure of the checker

Interpreter for GIMPLE (simple form of abstract interpretation)

See libcpychecker/absinterp.py in the plugin's source tree

We generate a set of all possible traces through the function

(but give up when it gets too complicated)

A trace is a recording of a list of states, with transitions between them.

Structure of the checker (continued)

Transitions can have descriptive information (e.g. "when PyList_Append() fails")

A state has:

location
mapping from variables to storage region
mapping from storage region to AbstractValue instances

(this lets us handle the semantics of pointers)

AbstractValue subclasses

UnknownValue(): we don't know anything about this value
ConcreteValue(42): a specific, known value
WithinRange(): e.g. between 0 and 0x7fffffff
PointerToRegion(): a non-NULL pointer to the given storage region
"poisoned" values: DeallocatedMemory(), UninitializedData() etc

The checker "knows" how to perform operations on combinations of these: addition, subtraction, comparisons, etc

e.g. most operations on a DeallocatedMemory() instance will be detected as an error

Facets

States are pluggable: extra domain-specific information ("facets") can be added.

For now, the only plugin is a "cpython" one which adds:

exception state (the PyObject* for the thread-local exception instance, or NULL)

GIL state

a RefcountValue() subclass of AbstractValue, implementing ob_refcnt behaviors

detailed behavior information on much of the CPython API

See libcpychecker/refcounts.py in the plugin's source tree

Example API implementation

Given a C function e.g. PyMapping_Size, the cpython facet can provide an implementation which describes how to get from an input state to a list of all possible transitions:

def impl_PyMapping_Size(self, stmt, v_o):
    fnmeta = FnMeta(name='PyMapping_Size',
                    docurl='http://docs.python.org/c-api/mapping.html#PyMapping_Size',
                    prototype='Py_ssize_t PyMapping_Size(PyObject *o)',
                    defined_in='Objects/abstract.c',
                    notes='Can cope with NULL (sets exception)')
    # (continued on next slide)

# (continued from last slide)

    t_success = \
       self.state.mktrans_assignment(stmt.lhs,
         UnknownValue.make(stmt.lhs.type,
                           stmt.loc),
         fnmeta.desc_when_call_succeeds())
    t_failure = \
       self.state.mktrans_assignment(stmt.lhs,
         ConcreteValue(stmt.lhs.type,
                       stmt.loc,
                       -1),
         fnmeta.desc_when_call_fails())
    t_failure.dest.cpython.set_exception(
         'PyExc_TypeError',
         stmt.loc)
    return [t_success, t_failure]

We can also describe how an API can fail:

def impl_PyList_Append(self, stmt,
                       v_op, v_newitem):
    fnmeta = FnMeta(name='PyList_Append', )
                    # etc
    self.state.raise_any_null_ptr_func_arg(
       stmt, 0, v_op,
       why=invokes_Py_TYPE_via_macro(
              fnmeta, 'PyList_Check'))

    # It handles newitem being NULL:
    if v_newitem.is_null_ptr():
        s_failure = \
           self.state.mkstate_concrete_return_of(
              stmt, -1)
        s_failure.cpython.bad_internal_call(stmt.loc)
        return [Transition(self.state,
                           s_failure,
                           'returning -1 from %s() due to NULL item' % fnmeta.name)]

Trying it out on our buggy example code

It emits new compilation warnings:

            $ ./gcc-with-cpychecker \
                  -I/usr/include/python2.7/ \
                 bug.c
            bug.c: In function ‘make_a_list_of_random_ints_badly’:
            bug.c:10:26: warning: Mismatching type in call to PyArg_ParseTuple with format code "i" [enabled by default]
            argument 3 ("&count") had type
              "long int *" (pointing to 64 bits)
            but was expecting
              "int *" (pointing to 32 bits)
            for format code "i"

It detects the reference leak of "item"

            bug.c:22:1: warning: ob_refcnt of '*item' is 1 too high [enabled by default]
            bug.c:22:1: note: was expecting final ob_refcnt to be N + 1 (for some unknown N)
            bug.c:22:1: note: due to object being referenced by: PyListObject.ob_item[0]
            bug.c:22:1: note: but final ob_refcnt is N + 2
            bug.c:17:14: note: PyLongObject allocated at:   item = PyLong_FromLong(random());

Reference-counting model

For every object and possible trace it tracks all reference that it knows about, and assumes that there is some unknown number of other references N that it isn't tracking.

It knows the lower limit of N, and can use this when analyzing Py_DECREF() invocations.

Paths through the code

The reference count warnings emits a verbose description of the path followed through the code that leads to the error:

            bug.c:10:26: note: when PyArg_ParseTuple() succeeds at:
                if (!PyArg_ParseTuple(args, "i", &count)) {
            bug.c:10:8: note: taking False path at:
                if (!PyArg_ParseTuple(args, "i", &count)) {
            bug.c:14:10: note: reaching:
                list = PyList_New(0);

Paths through the code (continued)

            bug.c:14:10: note: when PyList_New() succeeds at:
                list = PyList_New(0);
            bug.c:16:5: note: when considering range:  1 <= value <= 0x7fffffff at:
                for (i = 0; i < count; i++) {
            bug.c:16:5: note: taking True path at:
                for (i = 0; i < count; i++) {
            bug.c:17:31: note: reaching:
                    item = PyLong_FromLong(random());

Actual references vs ob_refcnt

The checker emits a message when either changes:

            bug.c:17:14: note: when PyLong_FromLong() succeeds at:
                    item = PyLong_FromLong(random());
            bug.c:17:14: note: ob_refcnt is now refs: 1 + N where N >= 0
            bug.c:18:22: note: when PyList_Append() succeeds at:
                    PyList_Append(list, item);
            bug.c:18:22: note: ob_refcnt is now refs: 2 + N where N >= 0
            bug.c:18:22: note: '*item' is now referenced by 1 non-stack value(s): PyListObject.ob_item[0]

Paths through the code (continued)

The checker continues following the execution of the code, reaching the top of the loop:

            bug.c:16:5: note: when considering value == (int)1 from bug.c:10 at:
                for (i = 0; i < count; i++) {
            bug.c:16:5: note: taking False path at:
                for (i = 0; i < count; i++) {
            bug.c:21:5: note: reaching:
                return list;
            bug.c:22:1: note: returning

Deduplication

The checker attempts to follow every path through the function.

In this case it detects two similar traces, and only reports the warning for the first such trace:

            bug.c:22:1: note: found 1 similar trace(s) to this

HTML reports

The above information is too verbose, so the checker also emits the same information in HTML form:

            bug.c:6:1: note: graphical error report for
            function 'make_a_list_of_random_ints_badly' written out to 'bug.c.make_a_list_of_random_ints_badly-refcount-errors.html'

Lack of error-checking

It also detects the lack of error-checking for the PyList_New() call:

            bug.c:18:22: warning: calling PyList_Append with NULL as argument 1 (list) at bug.c:18 [enabled by default]
            bug.c:10:26: note: when PyArg_ParseTuple() succeeds at:
                if (!PyArg_ParseTuple(args, "i", &count)) {
            bug.c:10:8: note: taking False path at:
                if (!PyArg_ParseTuple(args, "i", &count)) {
            bug.c:14:10: note: reaching:
                list = PyList_New(0);
            bug.c:14:10: note: when PyList_New() fails at:
                list = PyList_New(0);

Why does it crash?

            bug.c:18:22: note: PyList_Append() invokes Py_TYPE() on the pointer via the PyList_Check() macro, thus accessing (NULL)->ob_type
            bug.c:18:22: note: found 1 similar trace(s) to this

A real bug found by the tool

int rpmfdFromPyObject(PyObject *obj,
                      rpmfdObject **fdop)
{
    rpmfdObject *fdo = NULL;

    /* (logic to create "fdo" omitted,
        to keep it all on one slide) */

    if (Ferror(fdo->fd)) {
        Py_DECREF(fdo);
        PyErr_SetString(
              PyExc_IOError,
              Fstrerror(fdo->fd));
        return 0;
    }
    *fdop = fdo;
    return 1;
}

rpmfd-py.c:34:17: warning: reading from deallocated memory at rpmfd-py.c:34: memory deallocated at rpmfd-py.c:32
rpmfd-py.c:31:8: note: taking True path at:
     if (Ferror(fdo->fd)) {
rpmfd-py.c:32:2: note: reaching:
         Py_DECREF(fdo);
rpmfd-py.c:32:2: note: when taking False path at:
         Py_DECREF(fdo);
rpmfd-py.c:32:2: note: calling tp_dealloc on new ref from call to PyObject_Call allocated at rpmfd-py.c:26 at:
         Py_DECREF(fdo);
rpmfd-py.c:35:17: note: reaching:
             Fstrerror(fdo->fd));

How does this work?

It's working on the preprocessed source

After preprocessing, a Py_DECREF(op) looks like this:

          do { if ( --((PyObject*)(op))->ob_refcnt != 0) ; else ( (*(((PyObject*)((PyObject *)(op)))->ob_type)->tp_dealloc)((PyObject *)((PyObject *)(op)))); } while (0);

or, simplified (removing casts):

          if ( --op->ob_refcnt == 0) {
             op->ob_type->tp_dealloc(op);
          }

How does this work? (continued)

          if ( --op->ob_refcnt == 0) {
             op->ob_type->tp_dealloc(op);
          }

All non-NULL (PyObject*) values are assigned an ob_refcnt value of type RefcountValue

A RefcountValue instance "knows" how to be incremented, decremented and compared against 0 (amongst other operations)

All non-NULL (PyObject*) have their ob_type set to a sane PyTypeObject. The tp_dealloc is set to an custom AbstractValue representing a sane function pointer that marks the object as a "poisoned" DeallocatedMemory value

Limitations

It only complains about crashers and reference-count issues
It doesn't complain about the lack of error checking for PyLong_FromLong: this failure case can't crash the interpreter.
It only follows 0 and 1 times through any loop
Generated code (Cython and SWIG) tends to show false positives (I'm working on fixing this)

Status

I'm working on automatically running the checker against all Python extension code within Fedora 17

I've found and reported bugs against 27 projects so far... (~340 to go...); several are now fixed

Getting involved

Please try the checker on your extension code

Please report any false positives or tracebacks you run into

Grab it from git
https://fedorahosted.org/gcc-python-plugin
"make" to build the plugin and run the selftest suite

Building a distutils using the checker:

                $ export CC=/path/to/plugin/gcc-with-cpychecker
                $ python setup.py build

Building a "Makefile"-based project using the checker:

                $ make CC=/path/to/plugin/gcc-with-cpychecker

Help build out the coverage of the CPython API

Summary

Bugs typically occurring in CPython extension modules
GCC plugin that embeds Python inside the compiler
The gcc-with-cpychecker static analyzer

Q & A

We're hiring!

http://www.redhat.com/about/work/

Extra slides

For the compiler geeks...

The GIMPLE is converted to Static Single Assignment form, which we can also see from Python

            $ ./gcc-with-python \
                     examples/show-ssa.py \
                       -I/usr/include/python2.7 \
                     bug.c

Many optimization passes later...

The GIMPLE-SSA form is converted to another representation Register Transfer Language, which is close to machine code

I've only barely wrapped this part of GCC in Python. Patches welcome!

Many further optimization passes later, this gets converted to assembler, and then machine code