WCET-Aware Procedure Cloning
Procedure Cloning (also known as Function Specialization) is a well-known compiler optimization exploiting functions that are often called with constant values as arguments. If the caller invokes a callee with some constant arguments, the callee can be cloned and the constant parameters can be removed from the list of parameters and instead be imported into the callee itself. This page describes the impact of procedure cloning on the WCETs of embedded real-time applications.
Impact of Procedure Cloning on WCET Estimation
The C source codes of typical embedded real-time applications often contain loops whose number of iterations depends on a parameter p of the function f containing this loop. In addition, such a function f can be called from various places within the code of the entire application with different values for p. These program structures have a very negative influence on the WCET computed by a timing analyzer.
This is due to the flow fact specification of such a data-dependent loop. Since the loop's function f is called from many places with possibly different arguments, the effective number of loop iterations within f can vary considerably, depending on the context of how and with which parameters f is called. However, the flow facts for such a loop must cover all these different contexts in which the loop may be executed to result in safe WCET estimates. Hence, the lower bound of such flow facts must represent the global minimum of iterations executed by such a loop over all contexts in which f is called, equivalently the same holds for the upper bound. Since such flow facts for data-dependent loops do not consider possible different execution contexts of a function f, the flow facts are safe but lead to a highly overestimated WCET.
WCET-aware procedure cloning is performed within the WCC compiler if a caller invokes a callee f containing data-dependent loops, and if the iterations of such loops depend on a parameter p being a constant in the function call. Cloning results in a specialized version f' of f now having constant loop bounds with respect to p. The data-dependence of the number of loop iterations is resolved by cloning so that highly precise flow facts for such data-dependent loops in f' can be provided. Hence, procedure cloning can be seen as a way to express different calling contexts at the source-code level, thus enabling a high-precision WCET analysis of these specialized functions.
Consider the following ANSI-C code snippet before procedure cloning:
int f( int *x, int n, int p ) {
_Pragma( "loopbound min 2 max 2000" )
for ( i = 0; i < n; i++ ) {
x[i] = p * x[i];
if ( i == 10 ) { ... }
}
return x[n];
}
int main() {
... f( y, 2000, 5 ) ...
... f( z, 2, 5 ) ...
return f( a, 2, 5 );
}
As can be seen, f contains a loop depending on f's parameter n. Additionally, f is called three times within main, two times with n equal to 2 and once with n equal to 2000. In order to obtain safe WCET estimates for this code snippet, the data-dependent loop in f must be annotated with a minimum of 2 iterations and a maximum of 2000 iterations, obviously. Since this single loop is annotated now with a maximum iteration count of 2000, a static timing analyzer has to assume 2000 iterations in the worst case for each and every call of f, even for those ones with n equal to two, thus resulting in huge overestimations.
WCET-aware procedure cloning transforms the above code snippet into the following code:
int f( int *x, int n, int p ) {
_Pragma( "loopbound min 2000 max 2000" )
for ( i = 0; i < n; i++ ) {
x[i] = p * x[i];
if ( i == 10 ) { ... }
}
return x[n];
}
int f_2_5( int *x ) {
_Pragma( "loopbound min 2 max 2" )
for ( i = 0; i < 2; i++ ) {
x[i] = 5 * x[i];
if ( i == 10 ) { ... }
}
return x[2];
}
int main() {
... f( y, 2000, 5 ) ...
... f_2_5( z ) ...
return f_2_5( a );
}
After procedure cloning, a specialized version f_2_5 of f for the values 2 and 5 for the parameters n and p is created. The loop inside f_2_5 is no longer data-dependent. As a consequence, precise flow fact annotations specifying that the loop iterates exactly twice are created. This way, procedure cloning within WCC helps to produce high-quality flow facts for static WCET analysis and thus improving the tightness of computed WCET estimates significantly.
Procedure cloning as performed within the WCC compiler is WCET-aware since it is applied only to those functions that enable a more precise WCET analysis of loops. This way, the code size increase resulting from the generated function clones is kept small while achieving a maximal WCET reduction.
Results
The above figure shows the impact of procedure cloning on the WCET estimates of three complex real-life benchmarks, with 100% corresponding to the WCET estimation of the original code without procedure cloning applied. Results are provided for three different instruction set architectures (ISAs): the TriCore ISA, the ARM7 ARM instruction set, and the ARM7 THUMB ISA.
The WCET for the EPIC benchmark decreased by 94.6% for the TriCore processor. Similarly remarkable improvements were achieved for the ARM processor, namely 95.7% for the ARM mode and 95.7% for the THUMB mode. This is due to the code structure containing a large number of nested loops. EPIC contains a filter function consisting of 32 partially data-dependent loops that is highly appropriate for cloning. Due to this code structure, cloning has a dramatic impact on WCET estimates since exact flow facts for all loops of this filter function representing the hot spot of EPIC are provided for static WCET analysis.
The second benchmark MPEG2 contains two functions that were optimized by procedure cloning. The first function implements the Full Search algorithm to detect the motion of macro-blocks. Within this function, another procedure is called computing the distance between these blocks. Cloning results in a transformed code that has a dedicated version of the Full Search implementation for each block size. The loop bounds in the nested function can again be defined more precisely. For the TriCore ISA, the WCET after procedure cloning is reduced to 70.1% compared to the unoptimized code. Similar improvements were gained for the ARM processor: the worst-case execution time was reduced to 66.8% and 66.6% for the ARM and THUMB modes, respectively.
The GSM benchmark contains one function implementing the short term residual signal filter. This function is called with strongly varying constants defining the number of iterations for its loop. Due to the code's improved analyzability after procedure cloning, the loops can be annotated with exact flow facts as a consequence. This has a positive effect on the estimated WCETs. Reductions from 12.7% (ARM mode) up to 31% (THUMB mode) compared to the WCET of the non-optimized code were achieved.