'\" t
.\" Man page generated from reStructuredText.
.
.
.nr rst2man-indent-level 0
.
.de1 rstReportMargin
\\$1 \\n[an-margin]
level \\n[rst2man-indent-level]
level margin: \\n[rst2man-indent\\n[rst2man-indent-level]]
-
\\n[rst2man-indent0]
\\n[rst2man-indent1]
\\n[rst2man-indent2]
..
.de1 INDENT
.\" .rstReportMargin pre:
. RS \\$1
. nr rst2man-indent\\n[rst2man-indent-level] \\n[an-margin]
. nr rst2man-indent-level +1
.\" .rstReportMargin post:
..
.de UNINDENT
. RE
.\" indent \\n[an-margin]
.\" old: \\n[rst2man-indent\\n[rst2man-indent-level]]
.nr rst2man-indent-level -1
.\" new: \\n[rst2man-indent\\n[rst2man-indent-level]]
.in \\n[rst2man-indent\\n[rst2man-indent-level]]u
..
.TH "POLLY" "1" "May 01, 2024" "17" "Polly"
.SH NAME
polly \- Polly 
.SH RELEASE NOTES  RELEASE NOTES
.sp
In Polly 17 the following important changes have been incorporated.
.INDENT 0.0
.IP \(bu 2
Support for \-polly\-vectorizer=polly has been removed. Polly\(aqs internal
vectorizer is not well maintained and is known to not work in some cases
such as region ScopStmts. Unlike LLVM\(aqs LoopVectorize pass it also does
not have a target\-dependent cost heuristics, and we recommend using
LoopVectorize instead of \-polly\-vectorizer=polly.
.sp
In the future we hope that Polly can collaborate better with LoopVectorize,
like Polly marking a loop is safe to vectorize with a specific simd width,
instead of replicating its functionality.
.IP \(bu 2
Polly\-ACC has been removed.
.UNINDENT
.SH THE ARCHITECTURE
.sp
Polly is a loop optimizer for LLVM. Starting from LLVM\-IR it detects and
extracts interesting loop kernels. For each kernel a mathematical model is
derived which precisely describes the individual computations and memory
accesses in the kernels. Within Polly a variety of analysis and code
transformations are performed on this mathematical model. After all
optimizations have been derived and applied, optimized LLVM\-IR is regenerated
and inserted into the LLVM\-IR module.
[image]
.SS Polly in the LLVM pass pipeline
.sp
The standard LLVM pass pipeline as it is used in \-O1/\-O2/\-O3 mode of clang/opt
consists of a sequence of passes that can be grouped in different conceptual
phases. The first phase, we call it here \fBCanonicalization\fP, is a scalar
canonicalization phase that contains passes like \-mem2reg, \-instcombine,
\-cfgsimplify, or early loop unrolling. It has the goal of removing and
simplifying the given IR as much as possible focusing mostly on scalar
optimizations. The second phase consists of three conceptual groups that  are
executed in the so\-called \fBInliner cycle\fP, This is again a set of \fBScalar
Simplification\fP passes, a set of \fBSimple Loop Optimizations\fP, and the
\fBInliner\fP itself. Even though these passes make up the majority of the LLVM
pass pipeline, the primary goal of these passes is still canonicalization
without loosing semantic information that complicates later analysis. As part of
the inliner cycle, the LLVM inliner step\-by\-step tries to inline functions, runs
canonicalization passes to exploit newly exposed simplification opportunities,
and then tries to inline the further simplified functions. Some simple loop
optimizations are executed as part of the inliner cycle. Even though they
perform some optimizations, their primary goal is still the simplification of
the program code. Loop invariant code motion is one such optimization that
besides being beneficial for program performance also allows us to move
computation out of loops and in the best case enables us to eliminate certain
loops completely.  Only after the inliner cycle has been finished, a last
\fBTarget Specialization\fP phase is run, where IR complexity is deliberately
increased to take advantage of target specific features that maximize the
execution performance on the device we target. One of the principal
optimizations in this phase is vectorization, but also target specific loop
unrolling, or some loop transformations (e.g., distribution) that expose more
vectorization opportunities.
[image]
.sp
Polly can conceptually be run at three different positions in the pass pipeline.
As an early optimizer before the standard LLVM pass pipeline, as a later
optimizer as part of the target specialization sequence, and theoretically also
with the loop optimizations in the inliner cycle. We only discuss the first two
options, as running Polly in the inline loop, is likely to disturb the inliner
and is consequently not a good idea.
[image]
.sp
Running Polly early before the standard pass pipeline has the benefit that the
LLVM\-IR processed by Polly is still very close to the original input code.
Hence, it is less likely that transformations applied by LLVM change the IR in
ways not easily understandable for the programmer. As a result, user feedback is
likely better and it is less likely that kernels that in C seem a perfect fit
for Polly have been transformed such that Polly can not handle them any
more. On the other hand, codes that require inlining to be optimized won\(aqt
benefit if Polly is scheduled at this position. The additional set of
canonicalization passes required will result in a small, but general compile
time increase and some random run\-time performance changes due to slightly
different IR being passed through the optimizers. To force Polly to run early in
the pass pipeline use the option \fI\-polly\-position=early\fP (default today).
[image]
.sp
Running Polly right before the vectorizer has the benefit that the full inlining
cycle has been run and as a result even heavily templated C++ code could
theoretically benefit from Polly (more work is necessary to make Polly here
really effective). As the IR that is passed to Polly has already been
canonicalized, there is also no need to run additional canonicalization passes.
General compile time is almost not affected by Polly, as detection of loop
kernels is generally very fast and the actual optimization and cleanup passes
are only run on functions which contain loop kernels that are worth optimizing.
However, due to the many optimizations that LLVM runs before Polly the IR that
reaches Polly often has additional scalar dependences that make Polly a lot less
efficient. To force Polly to run before the vectorizer in the pass pipeline use
the option \fI\-polly\-position=before\-vectorizer\fP\&. This position is not yet the
default for Polly, but work is on its way to be effective even in presence of
scalar dependences. After this work has been completed, Polly will likely use
this position by default.
[image]
.SH USING POLLY WITH CLANG
.sp
This documentation discusses how Polly can be used in Clang to automatically
optimize C/C++ code during compilation.
.sp
\fBWARNING:\fP
.INDENT 0.0
.INDENT 3.5
Warning: clang/LLVM/Polly need to be in sync (compiled from the same
revision).
.UNINDENT
.UNINDENT
.SS Make Polly available from Clang
.sp
Polly is available through clang, opt, and bugpoint, if Polly was checked out
into tools/polly before compilation. No further configuration is needed.
.SS Optimizing with Polly
.sp
Optimizing with Polly is as easy as adding \-O3 \-mllvm \-polly to your compiler
flags (Polly is not available unless optimizations are enabled, such as
\-O1,\-O2,\-O3; Optimizing for size with \-Os or \-Oz is not recommended).
.INDENT 0.0
.INDENT 3.5
.sp
.EX
clang \-O3 \-mllvm \-polly file.c
.EE
.UNINDENT
.UNINDENT
.SS Automatic OpenMP code generation
.sp
To automatically detect parallel loops and generate OpenMP code for them you
also need to add \-mllvm \-polly\-parallel \-lgomp to your CFLAGS.
.INDENT 0.0
.INDENT 3.5
.sp
.EX
clang \-O3 \-mllvm \-polly \-mllvm \-polly\-parallel \-lgomp file.c
.EE
.UNINDENT
.UNINDENT
.SS Switching the OpenMP backend
.sp
The following CL switch allows to choose Polly\(aqs OpenMP\-backend:
.INDENT 0.0
.INDENT 3.5
.INDENT 0.0
.TP
.B \-polly\-omp\-backend[=BACKEND]
choose the OpenMP backend; BACKEND can be \(aqGNU\(aq (the default) or \(aqLLVM\(aq;
.UNINDENT
.UNINDENT
.UNINDENT
.sp
The OpenMP backends can be further influenced using the following CL switches:
.INDENT 0.0
.INDENT 3.5
.INDENT 0.0
.TP
.B \-polly\-num\-threads[=NUM]
set the number of threads to use; NUM may be any positive integer (default: 0, which equals automatic/OMP runtime);
.TP
.B \-polly\-scheduling[=SCHED]
set the OpenMP scheduling type; SCHED can be \(aqstatic\(aq, \(aqdynamic\(aq, \(aqguided\(aq or \(aqruntime\(aq (the default);
.TP
.B \-polly\-scheduling\-chunksize[=CHUNK]
set the chunksize (for the selected scheduling type); CHUNK may be any strictly positive integer (otherwise it will default to 1);
.UNINDENT
.UNINDENT
.UNINDENT
.sp
Note that at the time of writing, the GNU backend may only use the
\fIpolly\-num\-threads\fP and \fIpolly\-scheduling\fP switches, where the latter also has
to be set to \(dqruntime\(dq.
.sp
Example: Use alternative backend with dynamic scheduling, four threads and
chunksize of one (additional switches).
.INDENT 0.0
.INDENT 3.5
.sp
.EX
\-mllvm \-polly\-omp\-backend=LLVM \-mllvm \-polly\-num\-threads=4
\-mllvm \-polly\-scheduling=dynamic \-mllvm \-polly\-scheduling\-chunksize=1
.EE
.UNINDENT
.UNINDENT
.SS Automatic Vector code generation
.sp
Automatic vector code generation can be enabled by adding \-mllvm
\-polly\-vectorizer=stripmine to your CFLAGS.
.INDENT 0.0
.INDENT 3.5
.sp
.EX
clang \-O3 \-mllvm \-polly \-mllvm \-polly\-vectorizer=stripmine file.c
.EE
.UNINDENT
.UNINDENT
.SS Isolate the Polly passes
.sp
Polly\(aqs analysis and transformation passes are run with many other
passes of the pass manager\(aqs pipeline.  Some of passes that run before
Polly are essential for its working, for instance the canonicalization
of loop.  Therefore Polly is unable to optimize code straight out of
clang\(aqs \-O0 output.
.sp
To get the LLVM\-IR that Polly sees in the optimization pipeline, use the
command:
.INDENT 0.0
.INDENT 3.5
.sp
.EX
clang file.c \-c \-O3 \-mllvm \-polly \-mllvm \-polly\-dump\-before\-file=before\-polly.ll
.EE
.UNINDENT
.UNINDENT
.sp
This writes a file \(aqbefore\-polly.ll\(aq containing the LLVM\-IR as passed to
polly, after SSA transformation, loop canonicalization, inlining and
other passes.
.sp
Thereafter, any Polly pass can be run over \(aqbefore\-polly.ll\(aq using the
\(aqopt\(aq tool.  To found out which Polly passes are active in the standard
pipeline, see the output of
.INDENT 0.0
.INDENT 3.5
.sp
.EX
clang file.c \-c \-O3 \-mllvm \-polly \-mllvm \-debug\-pass=Arguments
.EE
.UNINDENT
.UNINDENT
.sp
The Polly\(aqs passes are those between \(aq\-polly\-detect\(aq and
\(aq\-polly\-codegen\(aq. Analysis passes can be omitted.  At the time of this
writing, the default Polly pass pipeline is:
.INDENT 0.0
.INDENT 3.5
.sp
.EX
opt before\-polly.ll \-polly\-simplify \-polly\-optree \-polly\-delicm \-polly\-simplify \-polly\-prune\-unprofitable \-polly\-opt\-isl \-polly\-codegen
.EE
.UNINDENT
.UNINDENT
.sp
Note that this uses LLVM\(aqs old/legacy pass manager.
.sp
For completeness, here are some other methods that generates IR
suitable for processing with Polly from C/C++/Objective C source code.
The previous method is the recommended one.
.sp
The following generates unoptimized LLVM\-IR (\(aq\-O0\(aq, which is the
default) and runs the canonicalizing passes on it
(\(aq\-polly\-canonicalize\(aq). This does /not/ include all the passes that run
before Polly in the default pass pipeline.  The \(aq\-disable\-O0\-optnone\(aq
option is required because otherwise clang adds an \(aqoptnone\(aq attribute
to all functions such that it is skipped by most optimization passes.
This is meant to stop LTO builds to optimize these functions in the
linking phase anyway.
.INDENT 0.0
.INDENT 3.5
.sp
.EX
clang file.c \-c \-O0 \-Xclang \-disable\-O0\-optnone \-emit\-llvm \-S \-o \- | opt \-polly\-canonicalize \-S
.EE
.UNINDENT
.UNINDENT
.sp
The option \(aq\-disable\-llvm\-passes\(aq disables all LLVM passes, even those
that run at \-O0.  Passing \-O1 (or any optimization level other than \-O0)
avoids that the \(aqoptnone\(aq attribute is added.
.INDENT 0.0
.INDENT 3.5
.sp
.EX
clang file.c \-c \-O1 \-Xclang \-disable\-llvm\-passes \-emit\-llvm \-S \-o \- | opt \-polly\-canonicalize \-S
.EE
.UNINDENT
.UNINDENT
.sp
As another alternative, Polly can be pushed in front of the pass
pipeline, and then its output dumped.  This implicitly runs the
\(aq\-polly\-canonicalize\(aq passes.
.INDENT 0.0
.INDENT 3.5
.sp
.EX
clang file.c \-c \-O3 \-mllvm \-polly \-mllvm \-polly\-position=early \-mllvm \-polly\-dump\-before\-file=before\-polly.ll
.EE
.UNINDENT
.UNINDENT
.SS Further options
.sp
Polly supports further options that are mainly useful for the development or the
analysis of Polly. The relevant options can be added to clang by appending
\-mllvm \-option\-name to the CFLAGS or the clang command line.
.SS Limit Polly to a single function
.sp
To limit the execution of Polly to a single function, use the option
\-polly\-only\-func=functionname.
.SS Disable LLVM\-IR generation
.sp
Polly normally regenerates LLVM\-IR from the Polyhedral representation. To only
see the effects of the preparing transformation, but to disable Polly code
generation add the option polly\-no\-codegen.
.SS Graphical view of the SCoPs
.sp
Polly can use graphviz to show the SCoPs it detects in a program. The relevant
options are \-polly\-show, \-polly\-show\-only, \-polly\-dot and \-polly\-dot\-only. The
\(aqshow\(aq options automatically run dotty or another graphviz viewer to show the
scops graphically. The \(aqdot\(aq options store for each function a dot file that
highlights the detected SCoPs. If \(aqonly\(aq is appended at the end of the option,
the basic blocks are shown without the statements the contain.
.SS Change/Disable the Optimizer
.sp
Polly uses by default the isl scheduling optimizer. The isl optimizer optimizes
for data\-locality and parallelism using the Pluto algorithm.
To disable the optimizer entirely use the option \-polly\-optimizer=none.
.SS Disable tiling in the optimizer
.sp
By default both optimizers perform tiling, if possible. In case this is not
wanted the option \-polly\-tiling=false can be used to disable it. (This option
disables tiling for both optimizers).
.SS Import / Export
.sp
The flags \-polly\-import and \-polly\-export allow the export and reimport of the
polyhedral representation. By exporting, modifying and reimporting the
polyhedral representation externally calculated transformations can be
applied. This enables external optimizers or the manual optimization of
specific SCoPs.
.SS Viewing Polly Diagnostics with opt\-viewer
.sp
The flag \-fsave\-optimization\-record will generate .opt.yaml files when compiling
your program. These yaml files contain information about each emitted remark.
Ensure that you have Python 2.7 with PyYaml and Pygments Python Packages.
To run opt\-viewer:
.INDENT 0.0
.INDENT 3.5
.sp
.EX
llvm/tools/opt\-viewer/opt\-viewer.py \-source\-dir /path/to/program/src/ \e
   /path/to/program/src/foo.opt.yaml \e
   /path/to/program/src/bar.opt.yaml \e
   \-o ./output
.EE
.UNINDENT
.UNINDENT
.sp
Include all yaml files (use *.opt.yaml when specifying which yaml files to view)
to view all diagnostics from your program in opt\-viewer. Compile with \X'tty: link https://clang.llvm.org/docs/UsersManual.html#profiling-with-instrumentation'\fI\%PGO\fP\X'tty: link' to view
Hotness information in opt\-viewer. Resulting html files can be viewed in an internet browser.
.SH HOW TO MANUALLY USE THE INDIVIDUAL PIECES OF POLLY
.SS Execute the individual Polly passes manually
.sp
This example presents the individual passes that are involved when optimizing
code with Polly. We show how to execute them individually and explain for
each which analysis is performed or what transformation is applied. In this
example the polyhedral transformation is user\-provided to show how much
performance improvement can be expected by an optimal automatic optimizer.
.SS 1. \fBCreate LLVM\-IR from the C code\fP
.INDENT 0.0
.INDENT 3.5
Polly works on LLVM\-IR. Hence it is necessary to translate the source
files into LLVM\-IR. If more than one file should be optimized the
files can be combined into a single file with llvm\-link.
.INDENT 0.0
.INDENT 3.5
.sp
.EX
clang \-S \-emit\-llvm matmul.c \-Xclang \-disable\-O0\-optnone \-o matmul.ll
.EE
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.SS 2. \fBPrepare the LLVM\-IR for Polly\fP
.INDENT 0.0
.INDENT 3.5
Polly is only able to work with code that matches a canonical form.
To translate the LLVM\-IR into this form we use a set of
canonicalization passes. They are scheduled by using
\(aq\-polly\-canonicalize\(aq.
.INDENT 0.0
.INDENT 3.5
.sp
.EX
opt \-S \-polly\-canonicalize matmul.ll \-o matmul.preopt.ll
.EE
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.SS 3. \fBShow the SCoPs detected by Polly (optional)\fP
.INDENT 0.0
.INDENT 3.5
To understand if Polly was able to detect SCoPs, we print the structure
of the detected SCoPs. In our example two SCoPs are detected. One in
\(aqinit_array\(aq the other in \(aqmain\(aq.
.INDENT 0.0
.INDENT 3.5
.sp
.EX
$ opt \-basic\-aa \-polly\-ast \-analyze matmul.preopt.ll \-polly\-process\-unprofitable \-polly\-use\-llvm\-names
.EE
.UNINDENT
.UNINDENT
.INDENT 0.0
.INDENT 3.5
.sp
.EX
:: isl ast :: init_array :: %for.cond1.preheader\-\-\-%for.end19

if (1)

    for (int c0 = 0; c0 <= 1535; c0 += 1)
      for (int c1 = 0; c1 <= 1535; c1 += 1)
        Stmt_for_body3(c0, c1);

else
    {  /* original code */ }

:: isl ast :: main :: %for.cond1.preheader\-\-\-%for.end30

if (1)

    for (int c0 = 0; c0 <= 1535; c0 += 1)
      for (int c1 = 0; c1 <= 1535; c1 += 1) {
        Stmt_for_body3(c0, c1);
        for (int c2 = 0; c2 <= 1535; c2 += 1)
          Stmt_for_body8(c0, c1, c2);
      }

else
    {  /* original code */ }
.EE
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.SS 4. \fBHighlight the detected SCoPs in the CFGs of the program (requires graphviz/dotty)\fP
.INDENT 0.0
.INDENT 3.5
Polly can use graphviz to graphically show a CFG in which the detected
SCoPs are highlighted. It can also create \(aq.dot\(aq files that can be
translated by the \(aqdot\(aq utility into various graphic formats.
.INDENT 0.0
.INDENT 3.5
.sp
.EX
$ opt \-polly\-use\-llvm\-names \-basic\-aa \-view\-scops \-disable\-output matmul.preopt.ll
$ opt \-polly\-use\-llvm\-names \-basic\-aa \-view\-scops\-only \-disable\-output matmul.preopt.ll
.EE
.UNINDENT
.UNINDENT
.sp
The output for the different functions:
.INDENT 0.0
.IP \(bu 2
view\-scops : \X'tty: link http://polly.llvm.org/experiments/matmul/scops.main.dot.png'\fI\%main\fP\X'tty: link', \X'tty: link http://polly.llvm.org/experiments/matmul/scops.init_array.dot.png'\fI\%init_array\fP\X'tty: link', \X'tty: link http://polly.llvm.org/experiments/matmul/scops.print_array.dot.png'\fI\%print_array\fP\X'tty: link'
.IP \(bu 2
view\-scops\-only : \X'tty: link http://polly.llvm.org/experiments/matmul/scopsonly.main.dot.png'\fI\%main\-scopsonly\fP\X'tty: link', \X'tty: link http://polly.llvm.org/experiments/matmul/scopsonly.init_array.dot.png'\fI\%init_array\-scopsonly\fP\X'tty: link', \X'tty: link http://polly.llvm.org/experiments/matmul/scopsonly.print_array.dot.png'\fI\%print_array\-scopsonly\fP\X'tty: link'
.UNINDENT
.UNINDENT
.UNINDENT
.SS 5. \fBView the polyhedral representation of the SCoPs\fP
.INDENT 0.0
.INDENT 3.5
.INDENT 0.0
.INDENT 3.5
.sp
.EX
$ opt \-polly\-use\-llvm\-names \-basic\-aa \-polly\-scops \-analyze matmul.preopt.ll \-polly\-process\-unprofitable
.EE
.UNINDENT
.UNINDENT
.INDENT 0.0
.INDENT 3.5
.sp
.EX
[...]Printing analysis \(aqPolly \- Create polyhedral description of Scops\(aq for region: \(aqfor.cond1.preheader => for.end19\(aq in function \(aqinit_array\(aq:
    Function: init_array
    Region: %for.cond1.preheader\-\-\-%for.end19
    Max Loop Depth:  2
        Invariant Accesses: {
        }
        Context:
        {  :  }
        Assumed Context:
        {  :  }
        Invalid Context:
        {  : 1 = 0 }
        Arrays {
            float MemRef_A[*][1536]; // Element size 4
            float MemRef_B[*][1536]; // Element size 4
        }
        Arrays (Bounds as pw_affs) {
            float MemRef_A[*][ { [] \-> [(1536)] } ]; // Element size 4
            float MemRef_B[*][ { [] \-> [(1536)] } ]; // Element size 4
        }
        Alias Groups (0):
            n/a
        Statements {
            Stmt_for_body3
                Domain :=
                    { Stmt_for_body3[i0, i1] : 0 <= i0 <= 1535 and 0 <= i1 <= 1535 };
                Schedule :=
                    { Stmt_for_body3[i0, i1] \-> [i0, i1] };
                MustWriteAccess :=      [Reduction Type: NONE] [Scalar: 0]
                    { Stmt_for_body3[i0, i1] \-> MemRef_A[i0, i1] };
                MustWriteAccess :=      [Reduction Type: NONE] [Scalar: 0]
                    { Stmt_for_body3[i0, i1] \-> MemRef_B[i0, i1] };
        }
[...]Printing analysis \(aqPolly \- Create polyhedral description of Scops\(aq for region: \(aqfor.cond1.preheader => for.end30\(aq in function \(aqmain\(aq:
    Function: main
    Region: %for.cond1.preheader\-\-\-%for.end30
    Max Loop Depth:  3
    Invariant Accesses: {
    }
    Context:
    {  :  }
    Assumed Context:
    {  :  }
    Invalid Context:
    {  : 1 = 0 }
    Arrays {
        float MemRef_C[*][1536]; // Element size 4
        float MemRef_A[*][1536]; // Element size 4
        float MemRef_B[*][1536]; // Element size 4
    }
    Arrays (Bounds as pw_affs) {
        float MemRef_C[*][ { [] \-> [(1536)] } ]; // Element size 4
        float MemRef_A[*][ { [] \-> [(1536)] } ]; // Element size 4
        float MemRef_B[*][ { [] \-> [(1536)] } ]; // Element size 4
    }
    Alias Groups (0):
        n/a
    Statements {
        Stmt_for_body3
            Domain :=
                { Stmt_for_body3[i0, i1] : 0 <= i0 <= 1535 and 0 <= i1 <= 1535 };
            Schedule :=
                { Stmt_for_body3[i0, i1] \-> [i0, i1, 0, 0] };
            MustWriteAccess :=  [Reduction Type: NONE] [Scalar: 0]
                { Stmt_for_body3[i0, i1] \-> MemRef_C[i0, i1] };
        Stmt_for_body8
            Domain :=
                { Stmt_for_body8[i0, i1, i2] : 0 <= i0 <= 1535 and 0 <= i1 <= 1535 and 0 <= i2 <= 1535 };
            Schedule :=
                { Stmt_for_body8[i0, i1, i2] \-> [i0, i1, 1, i2] };
            ReadAccess :=       [Reduction Type: NONE] [Scalar: 0]
                { Stmt_for_body8[i0, i1, i2] \-> MemRef_C[i0, i1] };
            ReadAccess :=       [Reduction Type: NONE] [Scalar: 0]
                { Stmt_for_body8[i0, i1, i2] \-> MemRef_A[i0, i2] };
            ReadAccess :=       [Reduction Type: NONE] [Scalar: 0]
                { Stmt_for_body8[i0, i1, i2] \-> MemRef_B[i2, i1] };
            MustWriteAccess :=  [Reduction Type: NONE] [Scalar: 0]
                { Stmt_for_body8[i0, i1, i2] \-> MemRef_C[i0, i1] };
    }
.EE
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.SS 6. \fBShow the dependences for the SCoPs\fP
.INDENT 0.0
.INDENT 3.5
.INDENT 0.0
.INDENT 3.5
.sp
.EX
$ opt \-basic\-aa \-polly\-use\-llvm\-names \-polly\-dependences \-analyze matmul.preopt.ll \-polly\-process\-unprofitable
.EE
.UNINDENT
.UNINDENT
.INDENT 0.0
.INDENT 3.5
.sp
.EX
[...]Printing analysis \(aqPolly \- Calculate dependences\(aq for region: \(aqfor.cond1.preheader => for.end19\(aq in function \(aqinit_array\(aq:
        RAW dependences:
                {  }
        WAR dependences:
                {  }
        WAW dependences:
                {  }
        Reduction dependences:
                n/a
        Transitive closure of reduction dependences:
                {  }
[...]Printing analysis \(aqPolly \- Calculate dependences\(aq for region: \(aqfor.cond1.preheader => for.end30\(aq in function \(aqmain\(aq:
        RAW dependences:
                { Stmt_for_body3[i0, i1] \-> Stmt_for_body8[i0, i1, 0] : 0 <= i0 <= 1535 and 0 <= i1 <= 1535; Stmt_for_body8[i0, i1, i2] \-> Stmt_for_body8[i0, i1, 1 + i2] : 0 <= i0 <= 1535 and 0 <= i1 <= 1535 and 0 <= i2 <= 1534 }
        WAR dependences:
                {  }
        WAW dependences:
                { Stmt_for_body3[i0, i1] \-> Stmt_for_body8[i0, i1, 0] : 0 <= i0 <= 1535 and 0 <= i1 <= 1535; Stmt_for_body8[i0, i1, i2] \-> Stmt_for_body8[i0, i1, 1 + i2] : 0 <= i0 <= 1535 and 0 <= i1 <= 1535 and 0 <= i2 <= 1534 }
        Reduction dependences:
                n/a
        Transitive closure of reduction dependences:
                {  }
.EE
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.SS 7. \fBExport jscop files\fP
.INDENT 0.0
.INDENT 3.5
.INDENT 0.0
.INDENT 3.5
.sp
.EX
$ opt \-basic\-aa \-polly\-use\-llvm\-names \-polly\-export\-jscop matmul.preopt.ll \-polly\-process\-unprofitable
.EE
.UNINDENT
.UNINDENT
.INDENT 0.0
.INDENT 3.5
.sp
.EX
[...]Writing JScop \(aq%for.cond1.preheader\-\-\-%for.end19\(aq in function \(aqinit_array\(aq to \(aq./init_array___%for.cond1.preheader\-\-\-%for.end19.jscop\(aq.

Writing JScop \(aq%for.cond1.preheader\-\-\-%for.end30\(aq in function \(aqmain\(aq to \(aq./main___%for.cond1.preheader\-\-\-%for.end30.jscop\(aq.
.EE
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.SS 8. \fBImport the changed jscop files and print the updated SCoP structure (optional)\fP
.INDENT 0.0
.INDENT 3.5
Polly can reimport jscop files, in which the schedules of the statements
are changed. These changed schedules are used to describe
transformations. It is possible to import different jscop files by
providing the postfix of the jscop file that is imported.
.sp
We apply three different transformations on the SCoP in the main
function. The jscop files describing these transformations are
hand written (and available in docs/experiments/matmul).
.sp
\fBNo Polly\fP
.sp
As a baseline we do not call any Polly code generation, but only apply the normal \-O3 optimizations.
.INDENT 0.0
.INDENT 3.5
.sp
.EX
$ opt \-basic\-aa \-polly\-use\-llvm\-names matmul.preopt.ll \-polly\-import\-jscop \-polly\-ast \-analyze \-polly\-process\-unprofitable
.EE
.UNINDENT
.UNINDENT
.INDENT 0.0
.INDENT 3.5
.sp
.EX
[...]
:: isl ast :: main :: %for.cond1.preheader\-\-\-%for.end30

if (1)

    for (int c0 = 0; c0 <= 1535; c0 += 1)
      for (int c1 = 0; c1 <= 1535; c1 += 1) {
        Stmt_for_body3(c0, c1);
        for (int c3 = 0; c3 <= 1535; c3 += 1)
          Stmt_for_body8(c0, c1, c3);
      }

else
    {  /* original code */ }
[...]
.EE
.UNINDENT
.UNINDENT
.sp
\fBLoop Interchange (and Fission to allow the interchange)\fP
.sp
We split the loops and can now apply an interchange of the loop dimensions that enumerate Stmt_for_body8.
.INDENT 0.0
.INDENT 3.5
.sp
.EX
$ opt \-basic\-aa \-polly\-use\-llvm\-names matmul.preopt.ll \-polly\-import\-jscop \-polly\-import\-jscop\-postfix=interchanged \-polly\-ast \-analyze \-polly\-process\-unprofitable
.EE
.UNINDENT
.UNINDENT
.INDENT 0.0
.INDENT 3.5
.sp
.EX
[...]
:: isl ast :: main :: %for.cond1.preheader\-\-\-%for.end30

if (1)

    {
      for (int c1 = 0; c1 <= 1535; c1 += 1)
        for (int c2 = 0; c2 <= 1535; c2 += 1)
          Stmt_for_body3(c1, c2);
      for (int c1 = 0; c1 <= 1535; c1 += 1)
        for (int c2 = 0; c2 <= 1535; c2 += 1)
          for (int c3 = 0; c3 <= 1535; c3 += 1)
            Stmt_for_body8(c1, c3, c2);
    }

else
    {  /* original code */ }
[...]
.EE
.UNINDENT
.UNINDENT
.sp
\fBInterchange + Tiling\fP
.sp
In addition to the interchange we now tile the second loop nest.
.INDENT 0.0
.INDENT 3.5
.sp
.EX
$ opt \-basic\-aa \-polly\-use\-llvm\-names matmul.preopt.ll \-polly\-import\-jscop \-polly\-import\-jscop\-postfix=interchanged+tiled \-polly\-ast \-analyze \-polly\-process\-unprofitable
.EE
.UNINDENT
.UNINDENT
.INDENT 0.0
.INDENT 3.5
.sp
.EX
[...]
:: isl ast :: main :: %for.cond1.preheader\-\-\-%for.end30

if (1)

    {
      for (int c1 = 0; c1 <= 1535; c1 += 1)
        for (int c2 = 0; c2 <= 1535; c2 += 1)
          Stmt_for_body3(c1, c2);
      for (int c1 = 0; c1 <= 1535; c1 += 64)
        for (int c2 = 0; c2 <= 1535; c2 += 64)
          for (int c3 = 0; c3 <= 1535; c3 += 64)
            for (int c4 = c1; c4 <= c1 + 63; c4 += 1)
              for (int c5 = c3; c5 <= c3 + 63; c5 += 1)
                for (int c6 = c2; c6 <= c2 + 63; c6 += 1)
                  Stmt_for_body8(c4, c6, c5);
    }

else
    {  /* original code */ }
[...]
.EE
.UNINDENT
.UNINDENT
.sp
\fBInterchange + Tiling + Strip\-mining to prepare vectorization\fP
.sp
To later allow vectorization we create a so called trivially
parallelizable loop. It is innermost, parallel and has only four
iterations. It can be replaced by 4\-element SIMD instructions.
.INDENT 0.0
.INDENT 3.5
.sp
.EX
$ opt \-basic\-aa \-polly\-use\-llvm\-names matmul.preopt.ll \-polly\-import\-jscop \-polly\-import\-jscop\-postfix=interchanged+tiled \-polly\-ast \-analyze \-polly\-process\-unprofitable
.EE
.UNINDENT
.UNINDENT
.INDENT 0.0
.INDENT 3.5
.sp
.EX
[...]
:: isl ast :: main :: %for.cond1.preheader\-\-\-%for.end30

if (1)

    {
      for (int c1 = 0; c1 <= 1535; c1 += 1)
        for (int c2 = 0; c2 <= 1535; c2 += 1)
          Stmt_for_body3(c1, c2);
      for (int c1 = 0; c1 <= 1535; c1 += 64)
        for (int c2 = 0; c2 <= 1535; c2 += 64)
          for (int c3 = 0; c3 <= 1535; c3 += 64)
            for (int c4 = c1; c4 <= c1 + 63; c4 += 1)
              for (int c5 = c3; c5 <= c3 + 63; c5 += 1)
                for (int c6 = c2; c6 <= c2 + 63; c6 += 4)
                  for (int c7 = c6; c7 <= c6 + 3; c7 += 1)
                    Stmt_for_body8(c4, c7, c5);
    }

else
    {  /* original code */ }
[...]
.EE
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.SS 9. \fBCodegenerate the SCoPs\fP
.INDENT 0.0
.INDENT 3.5
This generates new code for the SCoPs detected by polly. If
\-polly\-import\-jscop is present, transformations specified in the
imported jscop files will be applied.
.INDENT 0.0
.INDENT 3.5
.sp
.EX
$ opt \-S matmul.preopt.ll | opt \-S \-O3 \-o matmul.normalopt.ll
.EE
.UNINDENT
.UNINDENT
.INDENT 0.0
.INDENT 3.5
.sp
.EX
$ opt \-S matmul.preopt.ll \-basic\-aa \-polly\-use\-llvm\-names \-polly\-import\-jscop \-polly\-import\-jscop\-postfix=interchanged \-polly\-codegen \-polly\-process\-unprofitable | opt \-S \-O3 \-o matmul.polly.interchanged.ll
.EE
.UNINDENT
.UNINDENT
.INDENT 0.0
.INDENT 3.5
.sp
.EX
Reading JScop \(aq%for.cond1.preheader\-\-\-%for.end19\(aq in function \(aqinit_array\(aq from \(aq./init_array___%for.cond1.preheader\-\-\-%for.end19.jscop.interchanged\(aq.
File could not be read: No such file or directory
Reading JScop \(aq%for.cond1.preheader\-\-\-%for.end30\(aq in function \(aqmain\(aq from \(aq./main___%for.cond1.preheader\-\-\-%for.end30.jscop.interchanged\(aq.
.EE
.UNINDENT
.UNINDENT
.INDENT 0.0
.INDENT 3.5
.sp
.EX
$ opt \-S matmul.preopt.ll \-basic\-aa \-polly\-use\-llvm\-names \-polly\-import\-jscop \-polly\-import\-jscop\-postfix=interchanged+tiled \-polly\-codegen \-polly\-process\-unprofitable | opt \-S \-O3 \-o matmul.polly.interchanged+tiled.ll
.EE
.UNINDENT
.UNINDENT
.INDENT 0.0
.INDENT 3.5
.sp
.EX
Reading JScop \(aq%for.cond1.preheader\-\-\-%for.end19\(aq in function \(aqinit_array\(aq from \(aq./init_array___%for.cond1.preheader\-\-\-%for.end19.jscop.interchanged+tiled\(aq.
File could not be read: No such file or directory
Reading JScop \(aq%for.cond1.preheader\-\-\-%for.end30\(aq in function \(aqmain\(aq from \(aq./main___%for.cond1.preheader\-\-\-%for.end30.jscop.interchanged+tiled\(aq.
.EE
.UNINDENT
.UNINDENT
.INDENT 0.0
.INDENT 3.5
.sp
.EX
$ opt \-S matmul.preopt.ll \-basic\-aa \-polly\-use\-llvm\-names \-polly\-import\-jscop \-polly\-import\-jscop\-postfix=interchanged+tiled+vector \-polly\-codegen \-polly\-vectorizer=polly \-polly\-process\-unprofitable | opt \-S \-O3 \-o matmul.polly.interchanged+tiled+vector.ll
.EE
.UNINDENT
.UNINDENT
.INDENT 0.0
.INDENT 3.5
.sp
.EX
Reading JScop \(aq%for.cond1.preheader\-\-\-%for.end19\(aq in function \(aqinit_array\(aq from \(aq./init_array___%for.cond1.preheader\-\-\-%for.end19.jscop.interchanged+tiled+vector\(aq.
File could not be read: No such file or directory
Reading JScop \(aq%for.cond1.preheader\-\-\-%for.end30\(aq in function \(aqmain\(aq from \(aq./main___%for.cond1.preheader\-\-\-%for.end30.jscop.interchanged+tiled+vector\(aq.
.EE
.UNINDENT
.UNINDENT
.INDENT 0.0
.INDENT 3.5
.sp
.EX
$ opt \-S matmul.preopt.ll \-basic\-aa \-polly\-use\-llvm\-names \-polly\-import\-jscop \-polly\-import\-jscop\-postfix=interchanged+tiled+vector \-polly\-codegen \-polly\-vectorizer=polly \-polly\-parallel \-polly\-process\-unprofitable | opt \-S \-O3 \-o matmul.polly.interchanged+tiled+openmp.ll
.EE
.UNINDENT
.UNINDENT
.INDENT 0.0
.INDENT 3.5
.sp
.EX
Reading JScop \(aq%for.cond1.preheader\-\-\-%for.end19\(aq in function \(aqinit_array\(aq from \(aq./init_array___%for.cond1.preheader\-\-\-%for.end19.jscop.interchanged+tiled+vector\(aq.
File could not be read: No such file or directory
Reading JScop \(aq%for.cond1.preheader\-\-\-%for.end30\(aq in function \(aqmain\(aq from \(aq./main___%for.cond1.preheader\-\-\-%for.end30.jscop.interchanged+tiled+vector\(aq.
.EE
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.SS 10. \fBCreate the executables\fP
.INDENT 0.0
.INDENT 3.5
.INDENT 0.0
.INDENT 3.5
.sp
.EX
$ llc matmul.normalopt.ll \-o matmul.normalopt.s \-relocation\-model=pic
$ gcc matmul.normalopt.s \-o matmul.normalopt.exe
$ llc matmul.polly.interchanged.ll \-o matmul.polly.interchanged.s \-relocation\-model=pic
$ gcc matmul.polly.interchanged.s \-o matmul.polly.interchanged.exe
$ llc matmul.polly.interchanged+tiled.ll \-o matmul.polly.interchanged+tiled.s \-relocation\-model=pic
$ gcc matmul.polly.interchanged+tiled.s \-o matmul.polly.interchanged+tiled.exe
$ llc matmul.polly.interchanged+tiled+vector.ll \-o matmul.polly.interchanged+tiled+vector.s \-relocation\-model=pic
$ gcc matmul.polly.interchanged+tiled+vector.s \-o matmul.polly.interchanged+tiled+vector.exe
$ llc matmul.polly.interchanged+tiled+vector+openmp.ll \-o matmul.polly.interchanged+tiled+vector+openmp.s \-relocation\-model=pic
$ gcc matmul.polly.interchanged+tiled+vector+openmp.s \-lgomp \-o matmul.polly.interchanged+tiled+vector+openmp.exe
.EE
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.SS 11. \fBCompare the runtime of the executables\fP
.INDENT 0.0
.INDENT 3.5
By comparing the runtimes of the different code snippets we see that a
simple loop interchange gives here the largest performance boost.
However in this case, adding vectorization and using OpenMP degrades
the performance.
.INDENT 0.0
.INDENT 3.5
.sp
.EX
$ time ./matmul.normalopt.exe

real    0m11.295s
user    0m11.288s
sys     0m0.004s
$ time ./matmul.polly.interchanged.exe

real    0m0.988s
user    0m0.980s
sys     0m0.008s
$ time ./matmul.polly.interchanged+tiled.exe

real    0m0.830s
user    0m0.816s
sys     0m0.012s
$ time ./matmul.polly.interchanged+tiled+vector.exe

real    0m5.430s
user    0m5.424s
sys     0m0.004s
$ time ./matmul.polly.interchanged+tiled+vector+openmp.exe

real    0m3.184s
user    0m11.972s
sys     0m0.036s
.EE
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.SH TIPS AND TRICKS ON USING AND CONTRIBUTING TO POLLY
.SS Committing to polly trunk
.INDENT 0.0
.INDENT 3.5
.INDENT 0.0
.IP \(bu 2
\X'tty: link https://stackoverflow.com/questions/190431/is-git-svn-dcommit-after-merging-in-git-dangerous'\fI\%General reference to git\-svn workflow\fP\X'tty: link'
.UNINDENT
.UNINDENT
.UNINDENT
.SS Using bugpoint to track down errors in large files
.INDENT 0.0
.INDENT 3.5
If you know the \fBopt\fP invocation and have a large \fB\&.ll\fP file that causes
an error, \fBbugpoint\fP allows one to reduce the size of test cases.
.sp
The general calling pattern is:
.INDENT 0.0
.IP \(bu 2
\fB$ bugpoint <file.ll> <pass that causes the crash> \-opt\-args <opt option flags>\fP
.UNINDENT
.sp
An example invocation is:
.INDENT 0.0
.IP \(bu 2
\fB$ bugpoint crash.ll \-polly\-codegen \-opt\-args  \-polly\-canonicalize \-polly\-process\-unprofitable\fP
.UNINDENT
.sp
For more documentation on bugpoint, \X'tty: link https://llvm.org/docs/Bugpoint.html'\fI\%Visit the LLVM manual\fP\X'tty: link'
.UNINDENT
.UNINDENT
.SS Understanding which pass makes a particular change
.INDENT 0.0
.INDENT 3.5
If you know that something like \fIopt \-O3 \-polly\fP makes a change, but you wish to
isolate which pass makes a change, the steps are as follows:
.INDENT 0.0
.IP \(bu 2
\fB$ bugpoint \-O3 file.ll \-opt\-args \-polly\fP  will allow bugpoint to track down the pass which causes the crash.
.UNINDENT
.sp
To do this manually:
.INDENT 0.0
.IP \(bu 2
\fB$ opt \-O3 \-polly \-debug\-pass=Arguments\fP to get all passes that are run by default. \fB\-debug\-pass=Arguments\fP will list all passes that have run.
.IP \(bu 2
Bisect down to the pass that changes it.
.UNINDENT
.UNINDENT
.UNINDENT
.SS Debugging regressions introduced at some unknown earlier point
.sp
In case of a regression in performance or correctness (e.g., an earlier version
of Polly behaved as expected and a later version does not), bisecting over the
version history is the standard approach to identify the commit that introduced
the regression.
.sp
LLVM has a single repository that contains all projects. It can be cloned at:
\X'tty: link https://github.com/llvm/llvm-project'\fI\%https://github.com/llvm/llvm\-project\fP\X'tty: link'\&. How to bisect on a
git repository is explained here
\X'tty: link https://www.metaltoad.com/blog/beginners-guide-git-bisect-process-elimination'\fI\%https://www.metaltoad.com/blog/beginners\-guide\-git\-bisect\-process\-elimination\fP\X'tty: link'\&.
The bisect process can also be automated as explained here:
\X'tty: link https://www.metaltoad.com/blog/mechanizing-git-bisect-bug-hunting-lazy'\fI\%https://www.metaltoad.com/blog/mechanizing\-git\-bisect\-bug\-hunting\-lazy\fP\X'tty: link'\&.
An LLVM specific run script is available here:
\X'tty: link https://gist.github.com/dcci/891cd98d80b1b95352a407d80914f7cf'\fI\%https://gist.github.com/dcci/891cd98d80b1b95352a407d80914f7cf\fP\X'tty: link'\&.
.SH PERFORMANCE
.SS High\-Performance Generalized Matrix Multiplication
.sp
Polly automatically detects and optimizes generalized matrix multiplication,
the computation C ← α ⊗ C ⊕ β ⊗ A ⊗ B, where A, B, and C are three appropriately sized matrices,
⊕ and ⊗ operations are originating from the corresponding matrix semiring, and α and β are
constants, and beta is not equal to zero. It allows to obtain the highly optimized form structured
similar to the expert implementation of GEMM that can be found in GotoBLAS and its successors. The
performance evaluation of GEMM is shown in the following figure.
.INDENT 0.0
.INDENT 3.5
[image]
.UNINDENT
.UNINDENT
.SS Compile Time Impact of Polly
.sp
Clang+LLVM+Polly are compiled using Clang on a Intel(R) Core(TM) i7\-7700 based system. The experiment
is repeated twice: with and without Polly enabled in order to measure its compile time impact.
.sp
The following versions are used:
.INDENT 0.0
.IP \(bu 2
Polly (git hash 0db98a4837b6f233063307bb9184374175401922)
.IP \(bu 2
Clang (git hash 3e1d04a92b51ed36163995c96c31a0e4bbb1561d)
.IP \(bu 2
LLVM  git hash 0265ec7ebad69a47f5c899d95295b5eb41aba68e)
.UNINDENT
.sp
\X'tty: link https://ninja-build.org/'\fI\%ninja\fP\X'tty: link' is used as the build system.
.sp
For both cases the whole compilation was performed five times. The compile times in seconds are shown in the following table.
.TS
box center;
l|l.
T{
Polly Disabled
T}	T{
Polly Enabled
T}
_
T{
964
T}	T{
977
T}
_
T{
964
T}	T{
980
T}
_
T{
967
T}	T{
981
T}
_
T{
967
T}	T{
981
T}
_
T{
968
T}	T{
982
T}
.TE
.sp
The median compile time without Polly enabled is 967 seconds and with Polly enabled it is 981 seconds. The overhead is 1.4%.
.INDENT 0.0
.IP \(bu 2
\X'tty: link http://polly.llvm.org/documentation/passes.html'\fI\%A list of Polly passes\fP\X'tty: link'
.UNINDENT
.INDENT 0.0
.IP \(bu 2
\fI\%Index\fP
.IP \(bu 2
\fI\%Search Page\fP
.UNINDENT
.SH AUTHOR
unknown
.SH COPYRIGHT
2010-2024, The Polly Team
.\" Generated by docutils manpage writer.
.