Just-In-Time compilation using GCC (libgccjit.so)

GNU Tools Cauldron 2014

David Malcolm <dmalcolm@redhat.com>

What is libgccjit.so

See http://gcc.gnu.org/wiki/JIT

Prebuilt RPM packages available for Fedora and RHEL:

https://gcc.gnu.org/ml/jit/2014-q2/msg00009.html

Why?

JIT compilation use cases:

Why a dedicated API for JIT?

Design decisions

What the API looks like

A C header file; currently with:

  • 72 function prototypes
  • 12 opaque types
  • 8 enums

State and lifetime management

All state is hung off of context objects:

typedef struct gcc_jit_context gcc_jit_context;

extern gcc_jit_context *
gcc_jit_context_acquire (void);

extern void
gcc_jit_context_release (gcc_jit_context *ctxt);

Simple memory management for client code

Everything that's created from a context is cleaned up when the context is released.

Entities within a context

extern const char *
gcc_jit_object_get_debug_string (gcc_jit_object *obj);

Entities within a context (2)

blockdiag-a157c7af8ddeb46c33638a58e70ce50355ff88bb.png

Source-Code Locations

Optional, but useful to end-users

/* Use this to create locations: */
extern gcc_jit_location *
gcc_jit_context_new_location (gcc_jit_context *ctxt,
                              const char *filename,
                              int line,
                              int column);

/* Need to turn on generation of debuginfo: */
gcc_jit_context_set_bool_option (
  ctxt, GCC_JIT_BOOL_OPTION_DEBUGINFO, 1);

Source-Code Locations (2)

We can use this to single-step through the machine code e.g. generated for bytecode:

(gdb) break fibonacci
(gdb) run
Breakpoint 1, fibonacci (input=8) at main.cc:43
43      DUP,
(gdb) next
47      PUSH_INT_CONST, 2,
(gdb) next
51      BINARY_INT_COMPARE_LT,
(gdb) next
55      JUMP_ABS_IF_TRUE, 17,
(gdb) next
59      DUP,
(gdb) next
63      PUSH_INT_CONST,  1,
(gdb) next
67      BINARY_INT_SUBTRACT,

Types

Access to simple C types:

gcc_jit_type *int_type =
   gcc_jit_context_get_type (ctxt, GCC_JIT_TYPE_INT);

gcc_jit_type *double_type =
   gcc_jit_context_get_type (ctxt, GCC_JIT_TYPE_DOUBLE);

/* etc */

Types (2)

One-time setup vs per-compile state

A common pattern:

  1. one-time setup:

    The client code maps its own API into the JIT world:

    • create gcc_jit_type instances representing the structs and other types of interest
    • similar for globals, functions, etc
  2. repeatedly reuse (1) as each method becomes "hot", using (1) to compile each method to machine code

Seen e.g. in GNU Octave's JIT compiler.

One-time setup vs per-compile state (2)

How to handle this?

If we do it all in one context, we'll have a slow leak due to all of the per-method state never going away.

One-time setup vs per-compile state (3)

Solution: nested contexts:

extern gcc_jit_context *
gcc_jit_context_new_child_context (gcc_jit_context *parent_ctxt);

One-time setup vs per-compile state (4)

Functions

How to generate the equivalent of:

const char *
test_string_literal (void)
{
   return "hello world";
}

Functions (2)

gcc_jit_type *const_char_ptr_type =
  gcc_jit_context_get_type (ctxt, GCC_JIT_TYPE_CONST_CHAR_PTR);

/* Build the test_fn.  */
gcc_jit_function *test_fn =
  gcc_jit_context_new_function (ctxt, NULL,
                                GCC_JIT_FUNCTION_EXPORTED,
                                const_char_ptr_type,
                                "test_string_literal",
                                0, NULL,
                                0);
gcc_jit_block *block = gcc_jit_function_new_block (test_fn, NULL);

gcc_jit_block_end_with_return (
  block, NULL,
  gcc_jit_context_new_string_literal (ctxt, "hello world"));

Functions (3)

Example of a conditional:

/* if (i >= n) */
gcc_jit_block_end_with_conditional (
  loop_cond, NULL,
  gcc_jit_context_new_comparison (
     ctxt, NULL,
     GCC_JIT_COMPARISON_GE,
     gcc_jit_lvalue_as_rvalue (i),
     gcc_jit_param_as_rvalue (n)),
  after_loop,
  loop_body);

Functions (4)

/* sum += i * i */
gcc_jit_block_add_assignment_op (
  loop_body, NULL,
  sum, /* lvalue */
  GCC_JIT_BINARY_OP_PLUS,
  gcc_jit_context_new_binary_op ( /* rvalue */
     ctxt, NULL,
     GCC_JIT_BINARY_OP_MULT, the_type,
     gcc_jit_lvalue_as_rvalue (i),
     gcc_jit_lvalue_as_rvalue (i)));

Comments as a first-class entity

extern void
gcc_jit_block_add_comment (gcc_jit_block *block,
                           gcc_jit_location *loc,
                           const char *text);

Very useful for debugging

e.g.

gcc_jit_block_add_comment (b_entry, NULL,
                           "for i in 0 to (ARRAY_SIZE - 1):");

Internally they are implemented as dummy labels.

Shouldn't affect optimization.

Visible in dumps of initial tree and of gimple.

Error-handling

Inspired by OpenGL:

  • record errors
  • fail if an error has occurred
  • fail gracefully when called after an error

Client code only has to check for errors once.

extern const char *
gcc_jit_context_get_first_error (gcc_jit_context *ctxt);

What the API doesn't do

etc

The C++ API

Methods, and (optionally) operator overloading:

struct quadratic
{
  double a;
  double b;
  double c;
  double discriminant;
};

gccjit::rvalue q_a = param_q.dereference_field (field_a);
gccjit::rvalue q_b = param_q.dereference_field (field_b);
gccjit::rvalue q_c = param_q.dereference_field (field_c);

gccjit::rvalue four =
  ctxt.new_rvalue (double_type, 4);

The C++ API (2)

gccjit::block block = calc_discriminant.new_block ();
block.add_comment ("(b^2 - 4ac)");

block.add_assignment (
  /* q->discriminant =...  */
  param_q.dereference_field (field_discriminant),
  /* (q->b * q->b) - (4 * q->a * q->c) */
  (q_b * q_b) - (four * q_a * q_c));
block.end_with_return ();

Python bindings

See https://github.com/davidmalcolm/pygccjit:

# Create parameter "i":
param_i = ctxt.new_param(int_type, b'i')
# Create the function:
fn = ctxt.new_function(gccjit.FunctionKind.EXPORTED,
                       int_type,
                       b"square",
                       [param_i])

Python bindings (2)

# Create a basic block within the function:
block = fn.new_block(b'entry')

# This basic block is relatively simple:
block.end_with_return(
    ctxt.new_binary_op(gccjit.BinaryOp.MULT,
                       int_type,
                       param_i, param_i))

# Having populated the context, compile it.
jit_result = ctxt.compile()

# This is what you get back from ctxt.compile():
assert isinstance(jit_result, gccjit.Result)

"Coconut": a JIT compiler for Python

https://github.com/davidmalcolm/coconut

(not to be confused with "Unladen Swallow")

Compiles CPython bytecode to machine code

Uses the Python bindings to libgccjit

"Coconut": a JIT compiler for Python (2)

def f(a, b):
  return a * b

"Coconut": a JIT compiler for Python (3)

One basic block:

_images/bytecode-cfg.png

"Coconut": a JIT compiler for Python (4)

31 basic blocks:

_images/ir-cfg.png

"Coconut": a JIT compiler for Python (5)

Status: an experiment:

Bindings for other languages?

Yes please!

Implementation Details

How it originally worked

The original way it worked:

actdiag-cf7a126a690b7581dfb3a6176480823dd9890a4a.png

How it now works

actdiag-9e060c860305a9b8718f9393ed6e507a042a7a70.png

State removal: the clean way vs the hack

gcc::context::context ()
{
  m_dumps = new gcc::dump_manager ();
  m_passes = new gcc::pass_manager (this);
}

State removal: the clean way vs the hack (2)

Add a big mutex and...

/* For those that want to, this function aims to clean up enough
   state that you can call toplev::main again. */
void
toplev::finalize (void)
{
  cgraph_c_finalize ();
  cgraphbuild_c_finalize ();
  cgraphunit_c_finalize ();
  dwarf2out_c_finalize ();
  /* etc */
}

State removal: the clean way vs the hack (3)

void cgraph_c_finalize (void)
{
  x_cgraph_nodes_queue = NULL;
  cgraph_n_nodes = 0;
  cgraph_max_uid = 0;
  cgraph_edge_max_uid = 0;
  cgraph_global_info_ready = false;
  cgraph_state = CGRAPH_STATE_PARSING;
  cgraph_function_flags_ready = false;
  /* etc */
}

State removal: the clean way vs the hack (4)

The testsuite for JIT now runs at the equivalent of -O3

(with each test running in-process 5 times, to shake out state issues)

Assembler as a shared library?

Currently the library:

This shows up as a significant part of the profile.

I would prefer to do this all in-process.

Are there shared libraries for these stages in our toolchain?

OK if I factor out the spec language from the gcc harness?

Summary

Next steps

Questions and Discussion

Thanks for listening!