Consider the following code:
void copy( int const* restrict in, int* restrict out, int count )
{
do
{
*out++ = *in++;
*out++ = *in++;
*out++ = *in++;
*out++ = *in++;
}
while ( --count );
}
A compiler for a RISC machine could output code equivalent to this:
void copy( int const* restrict in, int* restrict out, int count )
{
do
{
int t0 = in [0];
int t1 = in [1];
int t2 = in [2];
int t3 = in [3];
in += 4;
out [0] = t0;
out [1] = t1;
out [2] = t2;
out [3] = t3;
out += 4;
}
while ( --count );
}
The compiler moves the loads together because the data is often not available until a few clocks later. The update of the pointers is also deferred above, further increasing performance on some architectures for the same reason as the moved loads. Duff's device introduces jumps to intermediate points in the loop, preventing the above instruction reordering optimizations.
The 'restrict' tells the compiler that the source and destination regions don't overlap, which is what allows it to re-order the loads and stores (some pre-ISO C compilers use __restrict or __restrict__ instead). Without this, the compiler couldn't legally change the order as subsequent loads could depend on previous stores (in the case where out = in + 1, for example).
However, all of the above is just to give insight into the issue; the only thing that really matters is how code performs in actual use, which must be determined by timing the code as a black box module.