Loop Nest Splitting for WCET Optimization

Especially loops and if-statements are an inherent source of unpredictability and loss of precision for WCET analysis. This is caused by the difficulty to obtain safe and tight worst-case estimates of an application's high-level control flow. In addition, assembly-level control flow redirections reduce predictability even more due to complex processor pipelines and branch prediction units.

Loop Nest Splitting on the one hand achieves a significantly more homogeneous control flow structure. On the other hand, the highly precise analyses of loop nest splitting enable to generate very accurate high-level flow facts for loops and if-statements, having a very beneficial impact on WCET analysis.

Flow Fact Generation during Loop Nest Splitting

Many embedded real-time multimedia applications are data-dominated using large amounts of data memory. Typically, they consist of deeply nested for-loops. The main algorithm is usually located in the innermost loop. Often, such an algorithm treats particular parts of its data specifically, e.g., an image border requires other manipulations than its center. This boundary checking is implemented using if-statements in the innermost loop. A typical code fragment (an MPEG4 full search motion estimation routine) is depicted below:

  for ( x = 0; x < 36; x++ ) { x1 = 4 * x;
    for ( y = 0; y < 49; y++ ) { y1 = 4 * y;                /* y-loop */
      for ( k = 0; k < 9; k++ ) { x2 = x1 + k - 4;
        for ( l = 0; l < 9; l++ ) { y2 = y1 + l - 4;
          for ( i = 0; i < 4; i++ ) { x3 = x1 + i; x4 = x2 + i;
            for ( j = 0; j < 4; j++ ) { y3 = y1 + j; y4 = y2 + j;
              if ( x3 < 0 || 35 < x3 || y3 < 0 || 48 < y3 )
                then_block_1; else else_block_1;
              if ( x4 < 0 || 35 < x4 || y4 < 0 || 48 < y4 )
                then_block_2; else else_block_2; }}}}}}

This code has several properties making it sub-optimal w.r.t. worst- and average-case execution time (ACET). First, the if-statements lead to a very irregular control flow. Any jump instruction in a machine program causes a control hazard for pipelined processors. This means that the pipeline needs to be stalled for some instruction cycles, so as to prevent the execution of incorrectly prefetched instructions. WCET analysis needs to estimate whether a jump is taken or not. The worst-case influence of this decision on pipeline and branch prediction needs to be taken into account. Since it is difficult to predict these control flow modifications accurately, resulting WCETs tend to become imprecise the more irregular the control flow is.

In addition, the way how conditions of an if-statement are expressed also has a negative impact on WCET. If conditions are connected using the logical and / or operators of ANSI-C, they are evaluated lazily. For example, expression e2 of the condition e1 || e2 is not evaluated if e1 already evaluates to true. Hence, each occurrence of || and && implies hidden control flow modifications with a negative influence on WCET. This unpredictability due to if-statements becomes even more severe if the if-statements are located in deeply nested loops as depicted above. Here, WCET analysis multiplies the overestimated data of the if-statements with the possibly also overestimated number of loop iterations, leading to even more imprecise WCET estimates.

Considering the code shown above, loop nest splitting is able to detect that

  • the conditions x3 < 0 and y3 < 0 are never true, and
  • both if-statements are true for x ≥ 10 or y ≥ 14.

Using this information, the original loop nest above can be rewritten to minimize the total number of executed if-statements as depicted below. Here, a new if-statement (the splitting-if) is inserted in the y-loop testing the condition x >= 10 || y >= 14. The else-part of this new if-statement is an exact copy of the body of the original y-loop. Since all if-statements are fulfilled when the splitting-if is true, the then-part consists of the body of the y-loop without any if-statements and associated else-blocks.

  for ( x = 0; x < 36; x++ ) { x1 = 4 * x;
    for ( y = 0; y < 49; y++ )
      if ( x >= 10 || y >= 14 )                       /* Splitting-If */
        for ( ; y < 49; y++ )
          for ( k = 0; k < 9; k++ )
            ...                                       /* l- & i-loop omitted */
              for ( j = 0; j < 4; j++ ) {
                then_block_1; then_block_2; }
      else { y1 = 4 * y;
        for ( k = 0; k < 9; k++ ) { x2 = x1 + k - 4;
          ...                                         /* l- & i-loop omitted */
            for ( j = 0; j < 4; j++ ) { y3 = y1 + j; y4 = y2 + j;
              if ( 0 || 35 < x3 || 0 || 48 < y3 )
                then_block_1; else else_block_1;
              if ( x4 < 0 || 35 < x4 || y4 < 0 || 48 < y4 )
                then_block_2; else else_block_2; }}}}}}

This example shows that loop nest splitting is able to generate a very homogeneous control flow in the hot-spots on an application. During loop nest splitting, polytopes model the loop currently under analysis by defining affine constraints for the loops' lower and upper bounds and for all surrounding loops. In such a polytope, each integral point represents one execution of the loop body for one actual assignment of values to the loops' index variables. Counting the number of integral points of these polytopes leads to the exact number of executions of the loop body. For this purpose, so-called Ehrhart polynomials are applied to the polytopes.

The number of points returned by the Ehrhart polynomials is used to annotate the source code after loop nest splitting with flow facts for WCET analysis, which exactly specify the executions of a loop compared to code lying outside the outermost loop of a loop nest.

In addition, precise annotations for the splitting-if created by loop nest splitting are provided to the WCET analyzer. These flow facts also rely on the application of Ehrhart polynomials to the polytope modeling the conditions of the splitting-if.

Results

The above figure shows the effect of loop nest splitting on the benchmark's WCETs and ACETs for both the ARM and THUMB ISAs. It shows the values for the optimized benchmarks as a percentage of the unoptimized versions denoted as 100%.

Loop nest splitting reduces both ACET and WCET significantly. Concerning ACET, improvements of 6.4% (QSDPCM) - 54.8% (ME) were measured for the ARM ISA. Similarly, ACET is reduced by 11.5% (QSDPCM) - 59.4% (ME) for the THUMB ISA. On average for all benchmarks, ACET is reduced by 25.0% (ARM) - 30.1% (THUMB). This clearly shows that generating homogeneous control flow in loop nests increases average-case performance since a large amount of code found in innermost loops before our optimization is eliminated.

The figure also shows that loop nest splitting reduces WCETs by a similar order of magnitude. Here, gains reach from 4.4% (QSDPCM) - 86.5% (ME) for the ARM ISA. For the THUMB ISA, WCET reductions by 9.6% (QSDPCM) - 89.0% (ME) were reported by aiT. On average over all benchmarks, the WCET reductions after loop nest splitting are significantly larger than the corresponding ACET reductions. In terms of WCET, average improvements by 34.0% (ARM) - 36.3% (THUMB) were returned by the WCET analyzer.