Rockbox Development > Starting Development and Compiling
Duff's Device
(1/1)
abcminiuser:
Hi guys,
New to RockBox development. I'm used to embedded C work, developing for the AVR platform. However all my projects have lacked the enormous scope of RockBox, and I'm unfamiliar with the macros/APIs/etc used in the project.
However, I started poking through some of the files and found the memcpy.c routines in the \common\ directory.
The aligned copying section could be easily optimized down to a simple Duff's Device unrolled loop, reducing code size (I believe) and perhaps even speeding up the copy slightly. For comparison:
--- Code: ---if (!TOO_SMALL(len) && !UNALIGNED (src, dst))
{
aligned_dst = (long*)dst;
aligned_src = (long*)src;
/* Copy 4X long words at a time if possible. */
while (len >= BIGBLOCKSIZE)
{
*aligned_dst++ = *aligned_src++;
*aligned_dst++ = *aligned_src++;
*aligned_dst++ = *aligned_src++;
*aligned_dst++ = *aligned_src++;
len -= (unsigned int)BIGBLOCKSIZE;
}
/* Copy one long word at a time if possible. */
while (len >= LITTLEBLOCKSIZE)
{
*aligned_dst++ = *aligned_src++;
len -= LITTLEBLOCKSIZE;
}
/* Pick up any residual with a byte copier. */
dst = (char*)aligned_dst;
src = (char*)aligned_src;
}
--- End code ---
Becomes:
--- Code: ---if (!TOO_SMALL(len) && !UNALIGNED (src, dst))
{
aligned_dst = (long*)dst;
aligned_src = (long*)src;
/* "Duff's device" block copy method: */
int lenblocks = (len / BIGBLOCKSIZE);
switch (len % BIGBLOCKSIZE)
{
case 0: do { *aligned_dst++ = *aligned_src++;
case 3: *aligned_dst++ = *aligned_src++;
case 2: *aligned_dst++ = *aligned_src++;
case 1: *aligned_dst++ = *aligned_src++;
} while (--lenblocks);
}
/* Pick up any residual with a byte copier. */
dst = (char*)aligned_dst;
src = (char*)aligned_src;
}
--- End code ---
I'm yet to try this, but is there some obvious reason against this that I'm missing? Again I'm completely new to RockBox development, and I've never worked on a cross-platform project before.
Can someone shed some light on this please?
Cheers!
- Dean
dan_a:
Hi Dean,
The IRC channel is the best place to discuss in depth technical things like this. We have optimised versions of memcpy in assembler for some targets, but this might be helpful. I'll test it at some point.
blargg:
Consider the following code:
--- Code: ---void copy( int const* restrict in, int* restrict out, int count )
{
do
{
*out++ = *in++;
*out++ = *in++;
*out++ = *in++;
*out++ = *in++;
}
while ( --count );
}
--- End code ---
A compiler for a RISC machine could output code equivalent to this:
--- Code: ---void copy( int const* restrict in, int* restrict out, int count )
{
do
{
int t0 = in [0];
int t1 = in [1];
int t2 = in [2];
int t3 = in [3];
in += 4;
out [0] = t0;
out [1] = t1;
out [2] = t2;
out [3] = t3;
out += 4;
}
while ( --count );
}
--- End code ---
The compiler moves the loads together because the data is often not available until a few clocks later. The update of the pointers is also deferred above, further increasing performance on some architectures for the same reason as the moved loads. Duff's device introduces jumps to intermediate points in the loop, preventing the above instruction reordering optimizations.
The 'restrict' tells the compiler that the source and destination regions don't overlap, which is what allows it to re-order the loads and stores (some pre-ISO C compilers use __restrict or __restrict__ instead). Without this, the compiler couldn't legally change the order as subsequent loads could depend on previous stores (in the case where out = in + 1, for example).
However, all of the above is just to give insight into the issue; the only thing that really matters is how code performs in actual use, which must be determined by timing the code as a black box module.
Navigation
[0] Message Index
Go to full version