Optimization records

Showing the user what GCC's optimization passes

are doing

David Malcolm <dmalcolm@redhat.com>

GNU Tools Cauldron 2018

https://dmalcolm.fedorapeople.org/presentations/cauldron-2018/

Optimization records

Getting information from the optimizer

How do advanced users ask for more information on what GCC's optimizers are doing?

e.g.

"Why isn't this loop being vectorized?"
"Did this function get inlined? Why? / Why not?"

etc

What this talk is about

GCC's current solution for this
Problems with the current solution
Improvements I've made for GCC 9
Improvements I want to make
Other ideas

Measure!

Need to figure out what's slow first
But: we have a disconnect between profiling and our optimization dumps
Optimization dumps only talk about source file/line/column
What if a function gets inlined in many different places?
What about templates that are instantiated many times?

How does the user get actionable information?

"How do I make my code faster?"
- "What stopped the loop getting vectorized?"
  - "Can I tweak it?"
"Is there some option I can turn on that's likely to help?"
How can we (GCC devs) improve a particular optimization pass?

The two existing approaches

GCC <= 8:

-fdump-tree-all -fdump-ipa-all -fdump-rtl-all

examine foo.c.SOMETHING

where "SOMETHING" is undocumented, and changes from revision to revision of the compiler (e.g. "foo.c.029t.einline")

no easy way to parse (both for humans and scripts)

what is important, and what isn't?

e.g. "only tell me about the hot loops"

The two existing approaches (2)

-fopt-info
- e.g. -fopt-info-all, -fopt-info-vec-missed
- can see multiple passes at once
- granularity is still per-TU

Example

#include <vector>

std::size_t
f(std::vector<std::vector<float>> const & v)
{
  std::size_t ret = 0;
  for (auto const & w: v)
    ret += w.size();
  return ret;
}

(from Freenode's #gcc; thanks Nightstrike)

Does it vectorize?

gcc 8, with -fopt-info-all, at -O3

Analyzing loop at demo.cc:7
demo.cc:7:24: note: ===== analyze_loop_nest =====
demo.cc:7:24: note: === vect_analyze_loop_form ===
demo.cc:7:24: note: === get_loop_niters ===
demo.cc:7:24: note: Symbolic number of iterations is (((((unsigned long) _10 - (unsigned long) _9) - 24) /[ex] 8) * 768614336404564651 & 2305843009213693951) + 1
demo.cc:7:24: note: === vect_analyze_data_refs ===
demo.cc:7:24: note: got vectype for stmt: _7 = MEM[(float * *)SR.8_23];
vector(2) long unsigned int
demo.cc:7:24: note: got vectype for stmt: _8 = MEM[(float * *)SR.8_23 + 8B];
vector(2) long unsigned int
demo.cc:7:24: note: === vect_analyze_scalar_cycles ===
demo.cc:7:24: note: Analyze phi: ret_17 = PHI <0(5), ret_6(6)>
demo.cc:7:24: note: Access function of PHI: {0, +, _16}_1
demo.cc:7:24: note: step: _16,  init: 0
demo.cc:7:24: note: step unknown.
demo.cc:7:24: note: Analyze phi: SR.8_23 = PHI <_9(5), _12(6)>
demo.cc:7:24: note: Access function of PHI: {_9, +, 24}_1
demo.cc:7:24: note: step: 24,  init: _9
demo.cc:7:24: note: Detected induction.

Does it vectorize? (2)

demo.cc:7:24: note: Analyze phi: ret_17 = PHI <0(5), ret_6(6)>
demo.cc:7:24: note: detected reduction: ret_6 = _16 + ret_17;
demo.cc:7:24: note: Detected reduction.
demo.cc:7:24: note: === vect_pattern_recog ===
demo.cc:7:24: note: vect_is_simple_use: operand _16
demo.cc:7:24: note: def_stmt: _16 = (long unsigned int) _15;
demo.cc:7:24: note: type of def: internal
demo.cc:7:24: note: vect_is_simple_use: operand _16
demo.cc:7:24: note: def_stmt: _16 = (long unsigned int) _15;
demo.cc:7:24: note: type of def: internal
demo.cc:7:24: note: vect_is_simple_use: operand _15
demo.cc:7:24: note: def_stmt: _15 = _14 /[ex] 4;
demo.cc:7:24: note: type of def: internal
demo.cc:7:24: note: vect_is_simple_use: operand _16
demo.cc:7:24: note: def_stmt: _16 = (long unsigned int) _15;
demo.cc:7:24: note: type of def: internal
demo.cc:7:24: note: === vect_analyze_data_ref_accesses ===
demo.cc:7:24: note: Detected interleaving load MEM[(float * *)SR.8_23] and MEM[(float * *)SR.8_23 + 8B]
demo.cc:7:24: note: Detected interleaving load of size 3 starting with _7 = MEM[(float * *)SR.8_23];
demo.cc:7:24: note: There is a gap of 1 elements after the group

Does it vectorize? (3)

demo.cc:7:24: note: === vect_mark_stmts_to_be_vectorized ===
demo.cc:7:24: note: init: phi relevant? ret_17 = PHI <0(5), ret_6(6)>
demo.cc:7:24: note: init: phi relevant? SR.8_23 = PHI <_9(5), _12(6)>
demo.cc:7:24: note: init: stmt relevant? _7 = MEM[(float * *)SR.8_23];
demo.cc:7:24: note: init: stmt relevant? _8 = MEM[(float * *)SR.8_23 + 8B];
demo.cc:7:24: note: init: stmt relevant? _14 = _8 - _7;
demo.cc:7:24: note: init: stmt relevant? _15 = _14 /[ex] 4;
demo.cc:7:24: note: init: stmt relevant? _16 = (long unsigned int) _15;
demo.cc:7:24: note: init: stmt relevant? ret_6 = _16 + ret_17;
demo.cc:7:24: note: vec_stmt_relevant_p: used out of loop.
demo.cc:7:24: note: vect_is_simple_use: operand _16
demo.cc:7:24: note: def_stmt: _16 = (long unsigned int) _15;
demo.cc:7:24: note: type of def: internal
demo.cc:7:24: note: vec_stmt_relevant_p: stmt live but not relevant.
demo.cc:7:24: note: mark relevant 1, live 1: ret_6 = _16 + ret_17;
demo.cc:7:24: note: init: stmt relevant? _12 = SR.8_23 + 24;
demo.cc:7:24: note: init: stmt relevant? if (_10 != _12)
demo.cc:7:24: note: worklist: examine stmt: ret_6 = _16 + ret_17;
demo.cc:7:24: note: vect_is_simple_use: operand _16
demo.cc:7:24: note: def_stmt: _16 = (long unsigned int) _15;
demo.cc:7:24: note: type of def: internal

Does it vectorize? (4)

demo.cc:7:24: note: mark relevant 1, live 0: _16 = (long unsigned int) _15;
demo.cc:7:24: note: vect_is_simple_use: operand ret_17
demo.cc:7:24: note: def_stmt: ret_17 = PHI <0(5), ret_6(6)>
demo.cc:7:24: note: type of def: reduction
demo.cc:7:24: note: mark relevant 1, live 0: ret_17 = PHI <0(5), ret_6(6)>
demo.cc:7:24: note: worklist: examine stmt: ret_17 = PHI <0(5), ret_6(6)>
demo.cc:7:24: note: vect_is_simple_use: operand 0
demo.cc:7:24: note: vect_is_simple_use: operand ret_6
demo.cc:7:24: note: def_stmt: ret_6 = _16 + ret_17;
demo.cc:7:24: note: type of def: reduction
demo.cc:7:24: note: reduc-stmt defining reduc-phi in the same nest.
demo.cc:7:24: note: worklist: examine stmt: _16 = (long unsigned int) _15;
demo.cc:7:24: note: vect_is_simple_use: operand _15
demo.cc:7:24: note: def_stmt: _15 = _14 /[ex] 4;
demo.cc:7:24: note: type of def: internal
demo.cc:7:24: note: mark relevant 1, live 0: _15 = _14 /[ex] 4;
demo.cc:7:24: note: worklist: examine stmt: _15 = _14 /[ex] 4;
demo.cc:7:24: note: vect_is_simple_use: operand _14
demo.cc:7:24: note: def_stmt: _14 = _8 - _7;
demo.cc:7:24: note: type of def: internal
demo.cc:7:24: note: mark relevant 1, live 0: _14 = _8 - _7;

Does it vectorize? (5)

demo.cc:7:24: note: worklist: examine stmt: _14 = _8 - _7;
demo.cc:7:24: note: vect_is_simple_use: operand _8
demo.cc:7:24: note: def_stmt: _8 = MEM[(float * *)SR.8_23 + 8B];
demo.cc:7:24: note: type of def: internal
demo.cc:7:24: note: mark relevant 1, live 0: _8 = MEM[(float * *)SR.8_23 + 8B];
demo.cc:7:24: note: vect_is_simple_use: operand _7
demo.cc:7:24: note: def_stmt: _7 = MEM[(float * *)SR.8_23];
demo.cc:7:24: note: type of def: internal
demo.cc:7:24: note: mark relevant 1, live 0: _7 = MEM[(float * *)SR.8_23];
demo.cc:7:24: note: worklist: examine stmt: _7 = MEM[(float * *)SR.8_23];
demo.cc:7:24: note: worklist: examine stmt: _8 = MEM[(float * *)SR.8_23 + 8B];
demo.cc:7:24: note: === vect_analyze_data_ref_dependences ===
demo.cc:7:24: note: === vect_determine_vectorization_factor ===
demo.cc:7:24: note: ==> examining phi: ret_17 = PHI <0(5), ret_6(6)>
demo.cc:7:24: note: get vectype for scalar type:  size_t
demo.cc:7:24: note: vectype: vector(2) long unsigned int
demo.cc:7:24: note: nunits = 2
demo.cc:7:24: note: ==> examining phi: SR.8_23 = PHI <_9(5), _12(6)>
demo.cc:7:24: note: ==> examining statement: _7 = MEM[(float * *)SR.8_23];
demo.cc:7:24: note: get vectype for scalar type:  float *
demo.cc:7:24: note: vectype: vector(2) long unsigned int

Does it vectorize? (6)

demo.cc:7:24: note: nunits = 2
demo.cc:7:24: note: ==> examining statement: _8 = MEM[(float * *)SR.8_23 + 8B];
demo.cc:7:24: note: get vectype for scalar type:  float *
demo.cc:7:24: note: vectype: vector(2) long unsigned int
demo.cc:7:24: note: nunits = 2
demo.cc:7:24: note: ==> examining statement: _14 = _8 - _7;
demo.cc:7:24: note: get vectype for scalar type:  long int
demo.cc:7:24: note: vectype: vector(2) long int
demo.cc:7:24: note: get vectype for scalar type:  long int
demo.cc:7:24: note: vectype: vector(2) long int
demo.cc:7:24: note: nunits = 2
demo.cc:7:24: note: ==> examining statement: _15 = _14 /[ex] 4;
demo.cc:7:24: note: get vectype for scalar type:  long int
demo.cc:7:24: note: vectype: vector(2) long int
demo.cc:7:24: note: get vectype for scalar type:  long int
demo.cc:7:24: note: vectype: vector(2) long int
demo.cc:7:24: note: nunits = 2
demo.cc:7:24: note: ==> examining statement: _16 = (long unsigned int) _15;
demo.cc:7:24: note: get vectype for scalar type:  long unsigned int
demo.cc:7:24: note: vectype: vector(2) long unsigned int
demo.cc:7:24: note: get vectype for scalar type:  long unsigned int

Does it vectorize? (7)

demo.cc:7:24: note: vectype: vector(2) long unsigned int
demo.cc:7:24: note: nunits = 2
demo.cc:7:24: note: ==> examining statement: ret_6 = _16 + ret_17;
demo.cc:7:24: note: get vectype for scalar type:  size_t
demo.cc:7:24: note: vectype: vector(2) long unsigned int
demo.cc:7:24: note: get vectype for scalar type:  size_t
demo.cc:7:24: note: vectype: vector(2) long unsigned int
demo.cc:7:24: note: nunits = 2
demo.cc:7:24: note: ==> examining statement: _12 = SR.8_23 + 24;
demo.cc:7:24: note: skip.
demo.cc:7:24: note: ==> examining statement: if (_10 != _12)
demo.cc:7:24: note: skip.
demo.cc:7:24: note: vectorization factor = 2
demo.cc:7:24: note: === vect_analyze_slp ===
demo.cc:7:24: note: === vect_make_slp_decision ===
demo.cc:7:24: note: === vect_analyze_data_refs_alignment ===
demo.cc:7:24: note: recording new base alignment for _9
demo.cc:7:24: note:   alignment:    8
demo.cc:7:24: note:   misalignment: 0
demo.cc:7:24: note:   based on:     _7 = MEM[(float * *)SR.8_23];
demo.cc:7:24: note: vect_compute_data_ref_alignment:

Does it vectorize? (8)

demo.cc:7:24: note: can't force alignment of ref: MEM[(float * *)SR.8_23]
demo.cc:7:24: note: vect_compute_data_ref_alignment:
demo.cc:7:24: note: can't force alignment of ref: MEM[(float * *)SR.8_23 + 8B]
demo.cc:7:24: note: === vect_prune_runtime_alias_test_list ===
demo.cc:7:24: note: === vect_enhance_data_refs_alignment ===
demo.cc:7:24: note: vector alignment may not be reachable
demo.cc:7:24: note: vect_can_advance_ivs_p:
demo.cc:7:24: note: Analyze phi: ret_17 = PHI <0(5), ret_6(6)>
demo.cc:7:24: note: reduc or virtual phi. skip.
demo.cc:7:24: note: Analyze phi: SR.8_23 = PHI <_9(5), _12(6)>
demo.cc:7:24: note: Vectorizing an unaligned access.
demo.cc:7:24: note: === vect_analyze_loop_operations ===
demo.cc:7:24: note: examining phi: ret_17 = PHI <0(5), ret_6(6)>
demo.cc:7:24: note: examining phi: SR.8_23 = PHI <_9(5), _12(6)>
demo.cc:7:24: note: ==> examining statement: _7 = MEM[(float * *)SR.8_23];
demo.cc:7:24: note: vect_is_simple_use: operand MEM[(float * *)SR.8_23]
demo.cc:7:24: note: not ssa-name.
demo.cc:7:24: note: use not simple.
demo.cc:7:24: note: vect_is_simple_use: operand MEM[(float * *)SR.8_23]
demo.cc:7:24: note: not ssa-name.
demo.cc:7:24: note: use not simple.

Does it vectorize? (9)

demo.cc:7:24: note: no array mode for V2DI[3]
demo.cc:7:24: note: Data access with gaps requires scalar epilogue loop
demo.cc:7:24: note: can't use a fully-masked loop because the target doesn't have the appropriate masked load or store.
demo.cc:7:24: note: vect_model_load_cost: strided group_size = 3 .
demo.cc:7:24: note: vect_model_load_cost: unaligned supported by hardware.
demo.cc:7:24: note: vect_model_load_cost: inside_cost = 36, prologue_cost = 0 .
demo.cc:7:24: note: ==> examining statement: _8 = MEM[(float * *)SR.8_23 + 8B];
demo.cc:7:24: note: vect_is_simple_use: operand MEM[(float * *)SR.8_23 + 8B]
demo.cc:7:24: note: not ssa-name.
demo.cc:7:24: note: use not simple.
demo.cc:7:24: note: vect_is_simple_use: operand MEM[(float * *)SR.8_23 + 8B]
demo.cc:7:24: note: not ssa-name.
demo.cc:7:24: note: use not simple.
demo.cc:7:24: note: no array mode for V2DI[3]
demo.cc:7:24: note: Data access with gaps requires scalar epilogue loop
demo.cc:7:24: note: vect_model_load_cost: unaligned supported by hardware.
demo.cc:7:24: note: vect_model_load_cost: inside_cost = 12, prologue_cost = 0 .
demo.cc:7:24: note: ==> examining statement: _14 = _8 - _7;
demo.cc:7:24: note: vect_is_simple_use: operand _8
demo.cc:7:24: note: def_stmt: _8 = MEM[(float * *)SR.8_23 + 8B];
demo.cc:7:24: note: type of def: internal

Does it vectorize? (10)

demo.cc:7:24: note: vect_is_simple_use: operand _7
demo.cc:7:24: note: def_stmt: _7 = MEM[(float * *)SR.8_23];
demo.cc:7:24: note: type of def: internal
demo.cc:7:24: note: === vectorizable_operation ===
demo.cc:7:24: note: vect_model_simple_cost: inside_cost = 4, prologue_cost = 0 .
demo.cc:7:24: note: ==> examining statement: _15 = _14 /[ex] 4;
demo.cc:7:24: note: vect_is_simple_use: operand _14
demo.cc:7:24: note: def_stmt: _14 = _8 - _7;
demo.cc:7:24: note: type of def: internal
demo.cc:7:24: note: vect_is_simple_use: operand 4
demo.cc:7:24: note: op not supported by target.
demo.cc:7:24: note: not vectorized: relevant stmt not supported: _15 = _14 /[ex] 4;
demo.cc:7:24: note: bad operation or unsupported loop bound.
demo.cc:4:1: note: vectorized 0 loops in function.
demo.cc:4:1: note: ===vect_slp_analyze_bb===
demo.cc:7:24: note: === vect_analyze_data_refs ===
demo.cc:7:24: note: got vectype for stmt: _9 = MEM[(struct vector * *)v_4(D)];
vector(2) long unsigned int
demo.cc:7:24: note: got vectype for stmt: _10 = MEM[(struct vector * *)v_4(D) + 8B];
vector(2) long unsigned int
demo.cc:7:24: note: === vect_analyze_data_ref_accesses ===

Does it vectorize? (11)

demo.cc:7:24: note: Detected interleaving load MEM[(struct vector * *)v_4(D)] and MEM[(struct vector * *)v_4(D) + 8B]
demo.cc:7:24: note: Detected interleaving load of size 2 starting with _9 = MEM[(struct vector * *)v_4(D)];
demo.cc:7:24: note: not vectorized: no grouped stores in basic block.
demo.cc:7:24: note: ===vect_slp_analyze_bb===
demo.cc:7:24: note: ===vect_slp_analyze_bb===
demo.cc:7:24: note: === vect_analyze_data_refs ===
demo.cc:7:24: note: got vectype for stmt: _7 = MEM[(float * *)SR.8_23];
vector(2) long unsigned int
demo.cc:7:24: note: got vectype for stmt: _8 = MEM[(float * *)SR.8_23 + 8B];
vector(2) long unsigned int
demo.cc:7:24: note: === vect_analyze_data_ref_accesses ===
demo.cc:7:24: note: Detected interleaving load MEM[(float * *)SR.8_23] and MEM[(float * *)SR.8_23 + 8B]
demo.cc:7:24: note: Detected interleaving load of size 2 starting with _7 = MEM[(float * *)SR.8_23];
demo.cc:7:24: note: not vectorized: no grouped stores in basic block.
demo.cc:7:24: note: ===vect_slp_analyze_bb===
demo.cc:7:24: note: ===vect_slp_analyze_bb===
demo.cc:7:24: note: ===vect_slp_analyze_bb===
demo.cc:9:10: note: === vect_analyze_data_refs ===
demo.cc:9:10: note: not vectorized: not enough data-refs in basic block.

Does it vectorize? (12)

The pertinent information was two slides ago.

Does it vectorize? (13)

It's easier to see with -fopt-info-missed:

demo.cc:7:24: note: step unknown.
demo.cc:7:24: note: vector alignment may not be reachable
demo.cc:7:24: note: not ssa-name.
demo.cc:7:24: note: use not simple.
demo.cc:7:24: note: not ssa-name.
demo.cc:7:24: note: use not simple.
demo.cc:7:24: note: no array mode for V2DI[3]
demo.cc:7:24: note: Data access with gaps requires scalar epilogue loop
demo.cc:7:24: note: can't use a fully-masked loop because the target doesn't have the appropriate masked load or store.
demo.cc:7:24: note: not ssa-name.
demo.cc:7:24: note: use not simple.
demo.cc:7:24: note: not ssa-name.
demo.cc:7:24: note: use not simple.

Does it vectorize? (14)

demo.cc:7:24: note: no array mode for V2DI[3]
demo.cc:7:24: note: Data access with gaps requires scalar epilogue loop
demo.cc:7:24: note: op not supported by target.
demo.cc:7:24: note: not vectorized: relevant stmt not supported:  _15 = _14 /[ex] 4;
demo.cc:7:24: note: bad operation or unsupported loop bound.
demo.cc:7:24: note: not vectorized: no grouped stores in basic block.
demo.cc:7:24: note: not vectorized: no grouped stores in basic block.
demo.cc:9:10: note: not vectorized: not enough data-refs in basic block.

i.e.:

demo.cc:7:24: note: not vectorized: relevant stmt not supported:  _15 = _14 /[ex] 4;

Does it vectorize? (15)

demo.cc:7:24: note: not vectorized: relevant stmt not supported:  _15 = _14 /[ex] 4;

So we know that the failure is due to a (then) unsupported tree code (fixed 2018-07-18 as of r262854).
But that doesn't tell us the location of the problematic statement.

It's using the location of the loop for (almost) everything:

demo.cc:7:24:
  for (auto const & w: v)
                        ^

Two kinds of improvement

Better output format for the messages we have
Better messages

What I've done so far for GCC 9

`-fsave-optimization-record`

New in GCC 9
machine-readable output format
writes a demo.cc.opt-record.json file

Example of JSON output

A tuple.

First, some metadata:

[
    {
        "format": "1",
        "generator": {
            "version": "9.0.0 20180829 (experimental)",
            "name": "GNU C++14",
            "pkgversion": "(GCC) ",
            "target": "x86_64-pc-linux-gnu"
        }
    },

Example of JSON output (2)

Then all of the passes (so they can be referred back to):

  [
    {
        "num": -1,
        "type": "gimple",
        "name": "*warn_unused_result",
        "id": "0x469e830",
        "optgroups": []
    },
    "[...etc...]",
    {
        "num": -1,
        "type": "rtl",
        "name": "*clean_state",
        "id": "0x46bccc0",
        "optgroups": []
    }
]

Example of JSON output (3)

Then the dump messages, a list of objects like this:

{
    "kind": "note",
    "count": {
        "quality": "guessed_local",
        "value": 9.5563e+08
    },
    "location": {
        "line": 7,
        "file": "demo.cc",
        "column": 24
    },
    "pass": "0x46b8ae0",
    "impl_location": {
        "line": 4367,
        "file": "../../src/gcc/tree-vect-data-refs.c",
        "function": "vect_analyze_data_refs"
    },

Example of JSON output (4)

"function": "_Z1fRKSt6vectorIS_IfSaIfEESaIS1_EE",
"inlining_chain": [
    {
        "fndecl": "std::size_t f(const std::vector<std::vector<float> >&)"
    }
]

Example of JSON output (5)

The text of the message itself is "marked up" with metadata:

 "message": [
    "got vectype for stmt: ",
    {
        "location": {
            "line": 8,
            "file": "demo.cc",
            "column": 18
        },
        "stmt": "_8 = MEM[(float * *)SR.16_23 + 8B];\n"
    },
    {
        "expr": "vector(2) long unsigned int"
    },
    "\n"
],

Example of JSON output (6)

so that e.g. an HTML presentation might be:

<div class="message">got vectype for stmt:
  <div class="stmt">
    <a href="demo.cc#line-8">_8 = MEM[(float * *)SR.16_23 + 8B];\n"</a>
  </div>
  <div class="expr">vector(2) long unsigned int</div>
<div>

Similarly, an IDE could make use of this in other ways.

Example of HTML report from the JSON output

https://dmalcolm.fedorapeople.org/gcc/2018-05-16/preso-example-inlined/

Status of JSON output

it's in trunk, and works, mostly...
- some bugs (interaction with inliner tracking, LTO support)
- not much automated test coverage
- not yet "tested in anger"

Dump API changes

Previously:

extern void dump_printf_loc (dump_flags_t, source_location,
                             const char *, ...)
  ATTRIBUTE_PRINTF_3;

GCC 9:

extern void dump_printf_loc (dump_flags_t, const dump_location_t &,
                             const char *, ...)
  ATTRIBUTE_GCC_DUMP_PRINTF (3, 0);

dump_location_t

The dump API now takes a dump_location_t, rather than a source_location (aka location_t).

dump_location_t

contains dump_user_location_t:

source information plus profile count

and dump_impl_location_t:

the emission location in gcc source (__FILE__, __LINE__ and function name).

dump_location_t (2)

dump_location_t can be created from gimple * and from rtx_insn *.

Hence, rather than:

dump_printf_loc (MSG_NOTE, gimple_location (stmt),
                 "This statement cannot be analyzed for "
                 "gridification\n");

we write:

dump_printf_loc (MSG_NOTE, stmt,
                 "This statement cannot be analyzed for "
                 "gridification\n");

dump_location_t (3)

Rather than just:

user-code.c:20:5: This statement cannot be analyzed for gridification

we also now have:

profile count of the statement in question (allowing for prioritization, and filtering of optimization dumps for unimportant code)

emission location metadata:

file == "gcc/omp-grid.c", line == 407, function == "grid_inner_loop_gridifiable_p"

ATTRIBUTE_GCC_DUMP_PRINTF

GCC <= 8: dump_printf and dump_printf_loc were printf under the covers
GCC 9: they use pretty_printer, and support format codes appropriate to the middle-end.

ATTRIBUTE_GCC_DUMP_PRINTF (2)

%E (gimple *)

Equivalent to: dump_gimple_expr (MSG_*, TDF_SLIM, stmt, 0)
%G (gimple *)

Equivalent to: dump_gimple_stmt (MSG_*, TDF_SLIM, stmt, 0)
%T (tree)

Equivalent to: dump_generic_expr (MSG_*, arg, TDF_SLIM)

All of this is supported in the JSON output (with the "markup" seen earlier).

ATTRIBUTE_GCC_DUMP_PRINTF (3)

Hence it becomes possible to convert e.g.:

if (dump_enabled_p ())
  {
    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
                     "not vectorized: different sized vector "
                     "types in statement, ");
    dump_generic_expr (MSG_MISSED_OPTIMIZATION, TDF_SLIM, vectype);
    dump_printf (MSG_MISSED_OPTIMIZATION, " and ");
    dump_generic_expr (MSG_MISSED_OPTIMIZATION, TDF_SLIM, nunits_vectype);
    dump_printf (MSG_MISSED_OPTIMIZATION, "\n");
  }

ATTRIBUTE_GCC_DUMP_PRINTF (4)

into:

if (dump_enabled_p ())
  dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
                   "not vectorized: different sized vector "
                   "types in statement, %T and %T\n",
                   vectype, nunits_vectype);

Question: should I go and clean up all dumps in our source tree to use the new format code?

ATTRIBUTE_GCC_DUMP_PRINTF (5)

if (dump_enabled_p ())
  {
    dump_printf_loc (MSG_NOTE, vect_location,
                     "\touter base_address: ");
    dump_generic_expr (MSG_NOTE, TDF_SLIM,
                       STMT_VINFO_DR_BASE_ADDRESS (stmt_info));
    dump_printf (MSG_NOTE, "\n\touter offset from base address: ");
    dump_generic_expr (MSG_NOTE, TDF_SLIM,
                       STMT_VINFO_DR_OFFSET (stmt_info));
    dump_printf (MSG_NOTE,
                 "\n\touter constant offset from base address: ");
    dump_generic_expr (MSG_NOTE, TDF_SLIM,
                       STMT_VINFO_DR_INIT (stmt_info));
    dump_printf (MSG_NOTE, "\n\touter step: ");
    dump_generic_expr (MSG_NOTE, TDF_SLIM,
                       STMT_VINFO_DR_STEP (stmt_info));
    dump_printf (MSG_NOTE, "\n\touter base alignment: %d\n",
                 STMT_VINFO_DR_BASE_ALIGNMENT (stmt_info));
    dump_printf (MSG_NOTE, "\n\touter base misalignment: %d\n",
                 STMT_VINFO_DR_BASE_MISALIGNMENT (stmt_info));
    dump_printf (MSG_NOTE, "\n\touter offset alignment: %d\n",
                 STMT_VINFO_DR_OFFSET_ALIGNMENT (stmt_info));
    dump_printf (MSG_NOTE, "\n\touter step alignment: %d\n",
                 STMT_VINFO_DR_STEP_ALIGNMENT (stmt_info));
  }

ATTRIBUTE_GCC_DUMP_PRINTF (6)

if (dump_enabled_p ())
  dump_printf_loc (MSG_NOTE, vect_location,
                   "\touter base_address: %T"
                   "\n\touter offset from base address: %T"
                   "\n\touter constant offset from base address: %T"
                   "\n\touter step: %T"
                   "\n\touter base alignment: %d\n"
                   "\n\touter base misalignment: %d\n",
                   "\n\touter offset alignment: %d\n"
                   "\n\touter step alignment: %d\n"
                   STMT_VINFO_DR_BASE_ADDRESS (stmt_info),
                   STMT_VINFO_DR_OFFSET (stmt_info),
                   STMT_VINFO_DR_INIT (stmt_info),
                   STMT_VINFO_DR_STEP (stmt_info),
                   STMT_VINFO_DR_BASE_ALIGNMENT (stmt_info),
                   STMT_VINFO_DR_BASE_MISALIGNMENT (stmt_info),
                   STMT_VINFO_DR_OFFSET_ALIGNMENT (stmt_info),
                   STMT_VINFO_DR_STEP_ALIGNMENT (stmt_info));

`AUTO_DUMP_SCOPE` and `DUMP_VECT_SCOPE`

Replace all the:

if (dump_enabled_p ())
  dump_printf_loc (MSG_NOTE, vect_location,
                   "=== vect_analyze_data_ref_accesses ===\n");

with just:

DUMP_VECT_SCOPE ("vect_analyze_data_ref_accesses");

This captures that this note is expressing a frame somewhere on the call stack during the optimization.

`AUTO_DUMP_SCOPE` and `DUMP_VECT_SCOPE` (2)

Textual output now indents them:

demo.cc:7:24: note: === analyze_loop_nest ===
demo.cc:7:24: note:  === vect_analyze_loop_form ===
demo.cc:7:24: note:   === get_loop_niters ===
demo.cc:7:24: note:  Symbolic number of iterations is (((((unsigned long) _10 - (unsigned
long) _9) - 24) /[ex] 8) * 768614336404564651 & 2305843009213693951) + 1

The JSON output nests all of these (and the notes within them), expressing the hierarchy.

Future work

We have about 2 months left for feature-development on GCC 9

What should the user experience be?

"GCC can't vectorize <LOOP> because of <STMT>"

what command-line options does the user provide?
what output do they see?

What should the user experience be? (2)

What's important to the user?

Presumably:

which loop?
which statement, exactly, is stopping vectorization?

Current UX - filtering

Currently the user can filter on:
- what kind of pass ("ipa", "loop", "inline", "omp", "vec", plus "optall")
  
  e.g. -fopt-info-vec or somesuch ("tell me about vectorization")
- what kind of message ("optimized", "missed", "note", "all")
Or look at everything in one pass (for every function in the TU) e.g. -fdump-tree-vect

Future UX Ideas - filtering

Things the user can't filter on yet:
- just a particular function or range of source code (e.g. via a #pragma ? via an attribute?)
- based on hotness ("only tell me about code with a profile count above $THRESHOLD")
Interaction with JSON output?

Current UX - output

(recall the two pages of "note" lines emitted at the loop's location)

Future UX Ideas - output

How about

<LOOP-LOCATION>: couldn't vectorize this loop
<PROBLEM-LOCATION>: because of <REASON>

Future UX Ideas - output (2)

Use other prefixes than just "note" for everything
- "missed-optimization" vs "optimization" ?

Rather than:

demo.cc:7:24: note: couldn't vectorize loop
stl_vector.h:870:50: note: the reason

Maybe:

demo.cc:7:24: missed-optimization: couldn't vectorize loop
stl_vector.h:870:50: note: the reason

(may require some DejaGnu tweaks)

Future UX Ideas - output (3)

put the messages through the diagnostic subsystem
- show source code
- show inlining chain / inclusion chain etc
- maybe show metadata:
  - "which pass?"
  - hotness

Future UX Ideas - output (4)

demo.cc: In function ‘std::size_t f(const std::vector<std::vector<float> >&)’:
demo.cc:7:24: missed-optimization: couldn't vectorize loop due to...
7 |   for (auto const & w: v)
  |                        ^
In file included from ../x86_64-pc-linux-gnu/libstdc++-v3/include/vector:64,
                 from demo.cc:1:
In function ‘std::vector<_Tp, _Alloc>::size_type std::vector<_Tp, _Alloc>::size() const
[with _Tp = float; _Alloc = std::allocator<float>]’,
    inlined from ‘std::size_t f(const std::vector<std::vector<float> >&)’
    at demo.cc:8:18:
../x86_64-pc-linux-gnu/libstdc++-v3/include/bits/stl_vector.h:870:50:
note:   not vectorized: relevant stmt not supported: _15 = _14 /[ex] 4;
870 |       { return size_type(this->_M_impl._M_finish - this->_M_impl._M_start); }
    |                          ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~

Future UX Ideas - output (5)

Idea for implementing the above:

provide a way to filter out most of the output
- everything apart from:
  - <LOOP-LOCATION>: "couldn't vectorize loop"
  - <REASON-LOCATION: "$REASON"
- everything else counts as "details"
- maybe filter out "details" by default, so that users by default get readable output
  - GCC developers (and DejaGnu tests) can opt-in to get the full details

Future UX Ideas - output (6)

Implementation ideas:

explicitly mark the dump calls, adding a new flag(s) to the dump_* API?
- e.g. add MSG_DETAILS and add to lots of dump_ calls
  - ugh (lots of churn)
- e.g. add MSG_PRIORITY and add to a few dump_ calls
implicit: treat the messages that are in nested scopes as being "details", and filter them out implicitly
- or add an auto_hide_dump_messages class, or somesuch
- minimal patching required, some "magic"

`opt_result` and `opt_problem`

I attempted to manually fix all those uses of vect_location
- ...but did I catch all of them?
- is there a better way?

`opt_result` and `opt_problem` (2)

a way to "bubble up" information about an optimization problem from deep in the call stack up to the top of a pass
a special wrapper around bool
- identifies the places that need failure-handling via the C++ type system
  - both to the compiler, and to human readers
- if dump_enabled_p, has a "reason" string as well as the false bool - "why did it fail?"

`opt_result` and `opt_problem` (3)

Rather than:

if (!check_something ())
  {
    if (dump_enabled_p ())
      dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
                       "foo is unsupported.\n");
    return false;
  }
[...lots more checks...]

// All checks passed:
return true;

`opt_result` and `opt_problem` (4)

we (optionally) capture the cause of the failure via:

if (!check_something ())
  return opt_result::failure_at (stmt, "foo is unsupported");

[...lots more checks...]

// All checks passed:
return opt_result::success ();

(this motivated the ATTRIBUTE_GCC_DUMP_PRINTF change above)

`opt_result` and `opt_problem` (5)

return opt_result::success ();

is effectively the same as:

return true;

but documents our intentions.

`opt_result` and `opt_problem` (6)

return opt_result::failure_at (stmt, "foo is unsupported");

when !dump_enabled_p, this is almost the same as:

return false;

Fixing all those "problem locations" naturally fall out of fixing the type issues needed to get it to compile

`opt_result` and `opt_problem` (7)

e.g. the specific failure case from our example was:

if (!ok)
 {
   if (dump_enabled_p ())
     {
       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
                        "not vectorized: relevant stmt not ");
       dump_printf (MSG_MISSED_OPTIMIZATION, "supported: ");
       dump_gimple_stmt (MSG_MISSED_OPTIMIZATION, TDF_SLIM,
                         stmt_info->stmt, 0);
     }
   return false;
 }

(note the use of vect_location)

`opt_result` and `opt_problem` (8)

and this becomes:

if (!ok)
  return opt_result::failure_at (stmt_info->stmt,
                                 "not vectorized:"
                                 " relevant stmt not supported: %G",
                                 stmt_info->stmt);

(note the use of stmt_info->stmt)

`opt_result` and `opt_problem` (9)

Status: am working on this; I hope to get it into gcc 9.

fixes a lot of the issues discussed so far
TODO: figure out that "details" vs "non-details" issue

"Why wasn't a pass run?"

"Was $PASS run? Why not?"
"Does $PASS get run at -O2?"

End-users seem to have a lot of difficulty with this.

Non-trivial interaction of:

command-line options

opt_pass::gate virtual functions, and

the default_options_table (in opts.c)

"Why wasn't a pass run?" (2)

Idea: can we tell the user e.g.:

note: optimization pass 'vect' was skipped for function 'foo'
note: 'vect' pass is enabled via '-ftree-loop-vectorize', or at '-O3' and above

(maybe as part of -fopt-info-vec-missed ?)

"Why wasn't a pass run?" (3)

Possible implementation idea (not prototyped yet):

convert opt_pass::gate to return an opt_result rather than just a bool
hence, when dumping is enabled, we can tell the user why a pass wasn't run on a given function.

UX idea: Rich vectorization hints

Work-in-progress/prototype.

A much more verbose idea: "actionable" reports for the end-user.

Rich optimization hints can contain a mixture of:
- text
- diagrams
- highlighted source locations/ranges,
- proposed patches
- etc.

UX idea: Rich vectorization hints (2)

can be printed to stderr (diagrams become ASCII art)
can be saved as part of the JSON optimization record
- can be prioritized by code hotness
- browsed in an IDE, etc
- diagrams become HTML/SVG?

UX idea: Rich vectorization hints (3)

void
my_example (int n, int *a, int *b, int *c)
{
  int i;

  for (i=0; i<n; i++) {
    a[i] = b[i] + c[i];
  }
}

UX idea: Rich vectorization hints (4)

==[Loop vectorized]=================================================

I was able to vectorize this loop, using SIMD instructions to reduce
the number of iterations by a factor of 4.

../../src/vect-test.c:6:3:
   for (i=0; i<n; i++) {
   ^~~

                                   |
                           +--------------+
             +-------------|run-time tests|-----------+
             |             +--------------+           |
+-------------------------+                +--------------------+
|vectorized loop          |                |scalar loop         |
|  iteration count: n / 4 |                |  iteration count: n|
+-------------------------+                |                    |
             |                             |                    |
+-------------------------+                |                    |
|epilogue                 |                |                    |
|  iteration count: [0..3]|                |                    |
+-------------------------+                +--------------------+
             |                                        |
             +---------------------+------------------+
                                   |
------------------------------------------------[gcc.vect.success]--

UX idea: Rich vectorization hints (5)

                                   |
                           +--------------+
             +-------------|run-time tests|-----------+
             |             +--------------+           |
+-------------------------+                +--------------------+
|vectorized loop          |                |scalar loop         |
|  iteration count: n / 4 |                |  iteration count: n|
+-------------------------+                |                    |
             |                             |                    |
+-------------------------+                |                    |
|epilogue                 |                |                    |
|  iteration count: [0..3]|                |                    |
+-------------------------+                +--------------------+
             |                                        |
             +---------------------+------------------+
                                   |
------------------------------------------------[gcc.vect.success]--

UX idea: Rich vectorization hints (6)

==[Run-time aliasing check]=========================================

Problem:
I couldn't prove that these data references don't alias, so I had to
add a run-time test, falling back to a scalar loop for when they do.

Details:
(1) This read/write pair could alias:

../../src/vect-test.c:7:13:
     a[i] = b[i] + c[i];
     ~^~~   ~^~~

(2) This read/write pair could alias:

../../src/vect-test.c:7:20:
     a[i] = b[i] + c[i];
     ~^~~          ~^~~

UX idea: Rich vectorization hints (7)

Suggestion:
If you know that the buffers cannot overlap in memory, marking them
with restrict will allow me to assume it when optimizing this loop,
and eliminate the run-time test.

--- ../../src/vect-test.c
+++ ../../src/vect-test.c
@@ -1,5 +1,5 @@
 void
-my_example (int n, int *a, int *b, int *c)
+my_example (int n, int * restrict a, int * restrict b, int * restrict c)
 {
   int i;

---------------------[gcc.vect.loop-requires-versioning-for-alias]--

UX idea: Rich vectorization hints (8)

==[Epilogue required for peeling]===================================

Problem:
I couldn't prove that the number of iterations is a multiple of 4,
so I had to add an "epilogue" to cover the final 0-3 iterations.

Details:
FIXME: add a source code highlight or other visualization here?

Suggestion:
FIXME: add a suggestion here?
---------------------[gcc.vect.loop-requires-epilogue-for-peeling]--

UX idea: Rich vectorization hints (9)

Status:

a hackish prototype
lots of work remains; perhaps too much to get done for gcc 9

Summary

state of optimization dumps in gcc 8
changes so far in gcc 9

-fsave-optimization-record
proposed changes for gcc 9
- filtering vectorization messages to just the most important, by default
- showing better locations for problematic constructs
- using the diagnostics subsystem?
- show why a pass wasn't run?
rich vectorization hints??

Questions and Discussion

Thanks for listening!

(thanks to Red Hat for funding this work)

URL for these slides: https://dmalcolm.fedorapeople.org/presentations/cauldron-2018/

Source code for these slides: https://github.com/davidmalcolm/2018-cauldron-talk

Optimization records

Showing the user what GCC's optimization passes

are doing

David Malcolm <dmalcolm@redhat.com>

GNU Tools Cauldron 2018

Optimization records

Getting information from the optimizer

What this talk is about

Measure!

How does the user get actionable information?

The two existing approaches

The two existing approaches (2)

Example

Does it vectorize?

Does it vectorize? (2)

Does it vectorize? (3)

Does it vectorize? (4)

Does it vectorize? (5)

Does it vectorize? (6)

Does it vectorize? (7)

Does it vectorize? (8)

Does it vectorize? (9)

Does it vectorize? (10)

Does it vectorize? (11)

Does it vectorize? (12)

Does it vectorize? (13)

Does it vectorize? (14)

Does it vectorize? (15)

Other problems

Two kinds of improvement

What I've done so far for GCC 9

-fsave-optimization-record

Example of JSON output

Example of JSON output (2)

Example of JSON output (3)

Example of JSON output (4)

Example of JSON output (5)

Example of JSON output (6)

Example of HTML report from the JSON output

Status of JSON output

Dump API changes

dump_location_t

dump_location_t (2)

dump_location_t (3)

ATTRIBUTE_GCC_DUMP_PRINTF

ATTRIBUTE_GCC_DUMP_PRINTF (2)

ATTRIBUTE_GCC_DUMP_PRINTF (3)

ATTRIBUTE_GCC_DUMP_PRINTF (4)

ATTRIBUTE_GCC_DUMP_PRINTF (5)

ATTRIBUTE_GCC_DUMP_PRINTF (6)

AUTO_DUMP_SCOPE and DUMP_VECT_SCOPE

AUTO_DUMP_SCOPE and DUMP_VECT_SCOPE (2)

Future work

What should the user experience be?

What should the user experience be? (2)

Current UX - filtering

Future UX Ideas - filtering

Current UX - output

Future UX Ideas - output

Future UX Ideas - output (2)

Future UX Ideas - output (3)

Future UX Ideas - output (4)

Future UX Ideas - output (5)

Future UX Ideas - output (6)

opt_result and opt_problem

opt_result and opt_problem (2)

opt_result and opt_problem (3)

opt_result and opt_problem (4)

opt_result and opt_problem (5)

opt_result and opt_problem (6)

opt_result and opt_problem (7)

opt_result and opt_problem (8)

opt_result and opt_problem (9)

"Why wasn't a pass run?"

"Why wasn't a pass run?" (2)

"Why wasn't a pass run?" (3)

UX idea: Rich vectorization hints

UX idea: Rich vectorization hints (2)

UX idea: Rich vectorization hints (3)

UX idea: Rich vectorization hints (4)

`-fsave-optimization-record`

`AUTO_DUMP_SCOPE` and `DUMP_VECT_SCOPE`

`AUTO_DUMP_SCOPE` and `DUMP_VECT_SCOPE` (2)

`opt_result` and `opt_problem`

`opt_result` and `opt_problem` (2)

`opt_result` and `opt_problem` (3)

`opt_result` and `opt_problem` (4)

`opt_result` and `opt_problem` (5)

`opt_result` and `opt_problem` (6)

`opt_result` and `opt_problem` (7)

`opt_result` and `opt_problem` (8)

`opt_result` and `opt_problem` (9)