From 5bf498ac2f7ff6e4926a4c95ba2df873a511425d Mon Sep 17 00:00:00 2001 From: David Malcolm Date: Tue, 22 May 2018 11:31:53 -0400 Subject: [PATCH 03/24] Add cover-letter.txt --- cover-letter.txt | 501 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 501 insertions(+) create mode 100644 cover-letter.txt diff --git a/cover-letter.txt b/cover-letter.txt new file mode 100644 index 0000000..bcc80ba --- /dev/null +++ b/cover-letter.txt @@ -0,0 +1,501 @@ +Subject: RFC: Prototype of compiler-assisted performance analysis + +I want to provide more "actionable" information on how GCC optimizes +code, both for us (GCC developers), and for advanced end-users. + +For us, I want to make it easier to get a sense of how an optimization +can be improved (bug-fixing). +For end-users, I want to make it easier to figure out which command-line +flags to try, or how to rework their code. + +In particular, given a workload (or benchmark), how does the user figure +out: + * "what did the optimizers do to the hottest code?" + * "what *didn't* they do, and why not?" + * "how can I make the code faster?" (either by editing the code, or the + gcc options) +and as GCC developers: + * "how can we make GCC do a better job on this code?" + + +The current situation +===================== + +I believe we currently have four ways of emitting information on what the +optimizers are doing: + +(a) -fopt-info and its variants (e.g. "-fopt-info-all"). + + Consider this "-fopt-info" code (from "vect_analyze_loop_form"): + + if (dump_enabled_p ()) + { + dump_printf_loc (MSG_NOTE, vect_location, + "Symbolic number of iterations is "); + dump_generic_expr (MSG_NOTE, TDF_DETAILS, number_of_iterations); + dump_printf (MSG_NOTE, "\n"); + } + + This emits text to the destination(s) of "-fopt-info", e.g.: + + foo.c:15:3: note: Symbolic number of iterations is (unsigned int) n_9(D) + + With "-fopt-info-all" this would be emitted to stderr. + If "details" or "note" was added when enabling the dumpfile + (e.g. via "-fdump-tree-all-details"), it would also be added to the + dumpfile for the current pass. + + Passes are tagged as being members of zero or more optimization groups + ("ipa", "loop", "inline", "omp", "vec"), and these can be used to filter + which messages are emitted. For example, the "increase_alignment" IPA + pass is tagged with both "loop" and "vec". + +(b) other messages emitted to dumpfiles. We have numerous other messages + that are written just to the dumpfiles, and don't go through -fopt-info. + Consider e.g. this from loop-unroll.c:unroll_loop_constant_iterations: + + if (dump_file) + fprintf (dump_file, + ";; Unrolled loop %d times, constant # of iterations %i insns\n", + max_unroll, num_loop_insns (loop)); + +(c) IR dumps emitted to dumpfiles. execute_one_pass and + execute_one_ipa_transform_pass have this code: + + if (dump_file) + do_per_function (execute_function_dump, pass); + + which captures a textual dump of the current state of each function. + +(d) dot-format dumps emitted to the ".dot" dumpfiles (via the "-graph") suffix + e.g. "-fdump-tree-all-graph". + + +Problems with the current situation +=================================== + +-fopt-info: + +* There's no integration with profile data: the report has a "wall of text" + feel, where every site in the code is treated as of having equal + importance. + +* -fopt-info emits lines of the form: + + foo.c:15:3: note: Symbolic number of iterations is (unsigned int) n_9(D) + + This is purely capturing the source location of the deepest point of the + inlining chain: if there are multiple places where a function has been + inlined, we'll get multiple reports, with no way to distinguish them. + +* If a program attempts to parse this data, it can get source location + plus a message, but there's no structure to the messages. For example, + we can't extract the source location of the expressions that are referred + to in the messages. + +* There are various places where we emit MSG_NOTES that look like sections + and subsections, e.g.: + + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, + "=== vect_analyze_data_ref_dependences ===\n"); + + This appears to be intended to express the start of a hierarchical + section within the dump, but there's nothing to express the end of + the section, and they're current printed in a flat way, again + contributing to a "wall of text" feel. In my prototype I've + experimented with capturing the nesting, and highlighting it using + indentation. + +dumpfiles: + +* These are per-pass-instance, capturing every function at each pass. + There's no easy way to see how a particular function "evolved" through + the passes. + +* Some messages are written to dumpfiles, but don't go through -fopt-info. + The division seems somewhat arbitrary. + +All of these options are marked as "GCC Developer Options". + + +Aside: LLVM's experience with remarks +===================================== + +LLVM added "remarks" as a kind of diagnostic for tracking optimization +information. It has some similarities to our -fopt-info. + +However, they ran into issues with them. There's an excellent talk +given by Adam Nemet at the November 2016 LLVM Developers' Meeting +on "Compiler-assisted Performance Analysis": + https://www.youtube.com/watch?v=qq0q1hfzidg + http://llvm.org/devmtg/2016-11/Slides/Nemet-Compiler-assistedPerformanceAnalysis.pdf + (part of http://www.llvm.org/devmtg/2016-11/ ) + +He ran into the "wall of text" issue: too much data, where you +can't see what's important. + +His fix was to capture the data in a file format (as "optimization records") +and to write a HTML tool for visualizing them, showing the optimizations +prioritized by code hotness. + +This seems like a great idea, so I've been experimenting with it for GCC. + + +Prototype of a GCC-based solution +================================= + +I've spent the last three weeks experimenting with this for GCC. + +I'm assuming the following constraints: + +* Preserve existing "-fopt-info" overall (whilst allowing for minor changes + to formatting *within* individual messages). + +* Preserve existing dumpfile output, as it's what GCC contributors are + used to. + +* No extra time/memory cost for the default "disabled" case. I assume it's + acceptable to spend a little more time/memory when the user opts-in by + enabling this instrumentation. + +I experimented with a few approaches, and the current prototype is +something of a hybrid. The prototype aims: + +* to capture "optimization records", emitting them in a format + that retains all pertinent information that can be easily parsed. + This separates the data from its presentation. + +* to associate the optimization records with code hotness, + and to provide a report that prioritizes the records, starting with + the hottest code. + +* other metadata to capture include(s): + + * which pass is emitting this record (and data about the pass) + + * the inlining chain (rather than just the "deepest" source location) + + * as well as source location within the user's code, I'm capturing + source location within *GCC* for where the optimization happens: + for any message you can immediately go to the implementation + +I've implemented an HTML report generator, inspired by Adam Nemet's LLVM +work, using the existing places where we're emitting "-fopt-info" +information. + +Differences from Adam Nemet's work: + +* I've added hierarchical records, so that there can be a nesting + structure of optimization notes (otherwise there's still too much + to decipher). + +* capture of GCC source location + +* LLVM is using YAML for some reason; I used JSON. Given that I'm + capturing some different things, I didn't attempt to use the same + file format as LLVM. + +Example 1 +********* + +Consider this "-fopt-info" code (from "vect_analyze_loop_form"):: + + if (dump_enabled_p ()) + { + dump_printf_loc (MSG_NOTE, vect_location, + "Symbolic number of iterations is "); + dump_generic_expr (MSG_NOTE, TDF_DETAILS, number_of_iterations); + dump_printf (MSG_NOTE, "\n"); + } + +This purely emits text to any destination(s) of "-fopt-info", e.g.: + + foo.c:15:3: note: Symbolic number of iterations is (unsigned int) n_9(D) + +e.g. to stderr, and/or the dumpfile for the current pass. + +This can be rewritten for optimization records as: + + if (optinfo_enabled_p ()) + { + OPTINFO_VECT_NOTE + << "symbolic number of iterations is " + << details (number_of_iterations); + } + +It can then be emitted in up to three ways: + +(a) as before via "-fopt-info":: + + foo.c:15:3: note: symbolic number of iterations is (unsigned int) n_9(D) + + (albeit with capitalization changes), to the "-fopt-info" destination(s). + +(b) It can be emitted as a "remark" through the diagnostics subsystem, + showing the extra metadata:: + + foo.c:15:3: remark: symbolic number of iterations is ‘(unsigned int) + n_9(D)’ [pass=vect] [count(guessed_local)=955630224] + + getting the standard diagnostics benefits: + + * announcing "In function 'blah':" etc when this changes. + + * printing the corresponding source lines (unless + "-fno-diagnostics-show-caret"), and + + * colorizing the various parts of the message if stderr is at a tty. + +(c) It is emitted in JSON form through "-fsave-optimization-record": + + { + "kind": "note", + "count": { + "quality": "guessed_local", + "value": 9.5563e+08 + }, + "location": { + "line": 15, + "file": "foo.c", + "column": 3 + }, + "pass": "vect", + "message": [ + "symbolic number of iterations is ", + { + "expr": "(unsigned int) n_8(D)" + } + ], + "impl_location": { + "line": 1674, + "file": "../../src/gcc/tree-vect-loop.c", + "function": "vect_analyze_loop_form" + }, + "function": "compute_sum" + }, + + Note how this captures metadata such as profile count, function name, + pass, etc. + +I've not been a fan of the "<<" C++ idiom, but it seems to +work well here for capturing the data we need (but do we need to +internationalize the messages?) + +Example 2 +********* + +Consider this "-fopt-info" code (from "vect_create_data_ref_ptr"): + + if (dump_enabled_p ()) + { + tree dr_base_type = TREE_TYPE (DR_BASE_OBJECT (dr)); + dump_printf_loc (MSG_NOTE, vect_location, + "create %s-pointer variable to type: ", + get_tree_code_name (TREE_CODE (aggr_type))); + dump_generic_expr (MSG_NOTE, TDF_SLIM, aggr_type); + if (TREE_CODE (dr_base_type) == ARRAY_TYPE) + dump_printf (MSG_NOTE, " vectorizing an array ref: "); + else if (TREE_CODE (dr_base_type) == VECTOR_TYPE) + dump_printf (MSG_NOTE, " vectorizing a vector ref: "); + else if (TREE_CODE (dr_base_type) == RECORD_TYPE) + dump_printf (MSG_NOTE, " vectorizing a record based array ref: "); + else + dump_printf (MSG_NOTE, " vectorizing a pointer ref: "); + dump_generic_expr (MSG_NOTE, TDF_SLIM, DR_BASE_OBJECT (dr)); + dump_printf (MSG_NOTE, "\n"); + } + +where the information is built up piecewise, with conditional logic. + +This purely emits text to any destination(s) of "-fopt-info", e.g.: + + foo.c:12:5: note: create vector_type-pointer variable to type: + vector(4) int vectorizing a pointer ref: *arr_9(D) + +e.g. to stderr, and/or the dumpfile for the current pass. + +This can be rewritten by exposing the "pending_optinfo": + + if (optinfo_enabled_p ()) + { + tree dr_base_type = TREE_TYPE (DR_BASE_OBJECT (dr)); + pending_optinfo info = OPTINFO_VECT_NOTE + << optinfo_printf ("create %s-pointer variable to type: ", + get_tree_code_name (TREE_CODE (aggr_type))) + << slim (aggr_type); + if (TREE_CODE (dr_base_type) == ARRAY_TYPE) + info << " vectorizing an array ref: "; + else if (TREE_CODE (dr_base_type) == VECTOR_TYPE) + info << " vectorizing a vector ref: "; + else if (TREE_CODE (dr_base_type) == RECORD_TYPE) + info << " vectorizing a record based array ref: "; + else + info << " vectorizing a pointer ref: "; + info << slim (DR_BASE_OBJECT (dr)); + } + +The "pending_optinfo" is emitted when it goes out of scope. As before, it +can be emitted in up to three ways: + +(a) as before via "-fopt-info": + + foo.c:12:5: note: create vector_type-pointer variable to type: + vector(4) int vectorizing a pointer ref: *arr_9(D) + + (albeit with capitalization changes), to the "-fopt-info" destination(s). + +(b) It can be emitted as a "remark" through the diagnostics subsystem: + + foo.c:15:3: remark: create vector_type-pointer variable to type: + ‘vector(4) int’ vectorizing a pointer ref: ‘*arr_9(D)’ [pass=vect] + [count(guessed_local)=860067200] + +(c) It is emitted in JSON form through "-fsave-optimization-record": + + { + "kind": "note", + "count": { + "quality": "guessed_local", + "value": 8.60067e+08 + }, + "location": { + "line": 15, + "file": "foo.c", + "column": 3 + }, + "pass": "vect", + "message": [ + "create vector_type-pointer variable to type: ", + { + "expr": "vector(4) int" + }, + " vectorizing a pointer ref: ", + { + "location": { + "line": 16, + "file": "foo.c", + "column": 5 + }, + "expr": "*arr_9(D)" + } + ], + "impl_location": { + "line": 4702, + "file": "../../src/gcc/tree-vect-data-refs.c", + "function": "vect_create_data_ref_ptr" + }, + "function": "compute_sum" + }, + + Note how the message isn't just text: here it has captured the + source location of the pointer ref. + +I've written an HTML generator (in Python 3) + https://github.com/davidmalcolm/gcc-opt-viewer +which takes a directory full of these json files and emits a directory +of HTML files, with an index.html showing the optimizations, prioritized +from hottest code downwards. + +An example of HTML output can be seen here: + +* https://dmalcolm.fedorapeople.org/gcc/2018-05-18/pgo-demo-test/pgo-demo-test/ + +based on this source: + +* https://github.com/davidmalcolm/pgo-demo + +The HTML shows the optimizations inline next to the source code it affects. +It uses a little JavaScript to add expand/collapse toggles when the amount +of text for an optimization gets too long. You can see the hierarchical +structure of the notes via indentation (e.g. for the vectorization example). +An alternative presentation could add an explicit tree structure with +more expand/collapse toggles - this is just a prototype. + +I've ported many messages from the existing API to the "optinfo" API, but +I deliberately didn't make significant changes to messages. But there's +probably scope for improvements here. For example, in the +"presentation-not-inlined.c" example at the link above, the vect pass emits: + + === analyzing loop === + === analyze_loop_nest === + === vect_analyze_loop_form === + === get_loop_niters === + symbolic number of iterations is (unsigned int) n_9(D) + not vectorized: loop contains function calls or data references that cannot be analyzed + +which is useful, but more useful would be to know *which* function calls +or data references couldn't be analyzed. So that would be something +to tackle in followups (this post is more about discussing the overall +direction). + +Caveats +======= + +This is still just a rough prototype (I'm looking for feedback, especially +from loop, vectorization, and IPA maintainers). + +Some known issues: + +* I haven't yet attempted to bootstrap this; no ChangeLog yet + +* there are lots of FIXMEs + +* the JSON format ought to be versioned + +* "remarks" are a bit half-baked in this patch kit, it's just on/off. + It probably should support the optimization group tags from + -fopt-info. LLVM has a -f[no-]diagnostics-show-hotness; etc. + +* I'm using a little C++11 (std::shared_ptr), which would need to be + ported to C++98. + +Known unknowns +============== + +* Should the viewer generate static HTML, or serve dynamic HTML? + When I tried this on non-trivial examples, the index.html + became unwieldy, so pagination would have helped a lot. That said, + there could be multiple different viewers, a benefit of separating + the data out from its presentation. + +* Should there be a way to pre-filter the optimization records before + they're written out to JSON? Or is it OK to dump everything, and + make filtering be the viewer's responsibility? (currently the patch + kit dumps everything) + +* Should the JSON format interact with the other uses of dumpfiles + (both free-form messages, and capturing the IR)? This would allow + a viewer e.g. to show each state of a given function as it goes + through the optimizers. I've been experimenting with this, but it's + not in the patch kit as is. + +* Formatting convention mismatches: + + * diagnostics have lowercase start, and no trailing period + + * -fopt-info dumps have uppercase start, and often have a trailing period + + * but some of them have multiple sentences + +* Is there a natural tension here between logs (which are timed-ordered + events) vs source markup (source-ordered)? Consider e.g. a + collection of related messages from one subroutine. + +* Should optimization info be internationalized/localized? Currently + "-fopt-info" isn't. I say let's *not* i18n it. + +* My prototype respects TDF_DETAILS for the existing "-fopt-info" output, + but given that it appears to have been applied haphazardly, and that we + can filter in other ways, my prototype ignores it for the other output + formats. It could be added. + +* Which JSON library to use for output? Note that we only need output, so + that for the final version we can directly write the JSON, rather than the + current approach which, inefficiently, requires first building a DOM-like + tree in memory. + +Thanks for reading this far. + +Thoughts? +Dave -- 1.8.5.3