Rockbox.org home
Downloads
Release release
Dev builds dev builds
Extras extras
themes themes
Documentation
Manual manual
Wiki wiki
Device Status device status
Support
Forums forums
Mailing lists mailing lists
IRC IRC
Development
Bugs bugs
Patches patches
Dev Guide dev guide
translations translations
Search



Donate

Rockbox Technical Forums


Login with username, password and session length
Home Help Search Staff List Login Register
News:

Welcome to the Rockbox Technical Forums!

+  Rockbox Technical Forums
|-+  Rockbox Development
| |-+  Starting Development and Compiling
| | |-+  Duff's Device
« previous next »
  • Print
Pages: [1]

Author Topic: Duff's Device  (Read 2357 times)

Offline abcminiuser

  • Member
  • *
  • Posts: 3
Duff's Device
« on: December 22, 2006, 09:15:06 AM »
Hi guys,

New to RockBox development. I'm used to embedded C work, developing for the AVR platform. However all my projects have lacked the enormous scope of RockBox, and I'm unfamiliar with the macros/APIs/etc used in the project.

However, I started poking through some of the files and found the memcpy.c routines in the \common\ directory.

The aligned copying section could be easily optimized down to a simple Duff's Device unrolled loop, reducing code size (I believe) and perhaps even speeding up the copy slightly. For comparison:

Code: [Select]
if (!TOO_SMALL(len) && !UNALIGNED (src, dst))
    {
      aligned_dst = (long*)dst;
      aligned_src = (long*)src;

      /* Copy 4X long words at a time if possible.  */
      while (len >= BIGBLOCKSIZE)
        {
          *aligned_dst++ = *aligned_src++;
          *aligned_dst++ = *aligned_src++;
          *aligned_dst++ = *aligned_src++;
          *aligned_dst++ = *aligned_src++;
          len -= (unsigned int)BIGBLOCKSIZE;
        }

      /* Copy one long word at a time if possible.  */
      while (len >= LITTLEBLOCKSIZE)
        {
          *aligned_dst++ = *aligned_src++;
          len -= LITTLEBLOCKSIZE;
        }

       /* Pick up any residual with a byte copier.  */
      dst = (char*)aligned_dst;
      src = (char*)aligned_src;
    }

Becomes:

Code: [Select]
if (!TOO_SMALL(len) && !UNALIGNED (src, dst))
    {
      aligned_dst = (long*)dst;
      aligned_src = (long*)src;

      /* "Duff's device" block copy method: */
      int lenblocks = (len / BIGBLOCKSIZE);
      switch (len % BIGBLOCKSIZE)
      {
        case 0: do { *aligned_dst++ = *aligned_src++;
        case 3:      *aligned_dst++ = *aligned_src++;
        case 2:     *aligned_dst++ = *aligned_src++;
        case 1:     *aligned_dst++ = *aligned_src++;
                } while (--lenblocks);
      }

       /* Pick up any residual with a byte copier.  */
      dst = (char*)aligned_dst;
      src = (char*)aligned_src;
    }

I'm yet to try this, but is there some obvious reason against this that I'm missing? Again I'm completely new to RockBox development, and I've never worked on a cross-platform project before.

Can someone shed some light on this please?

Cheers!
- Dean
Logged

Offline dan_a

  • Developer
  • Member
  • *
  • Posts: 85
  • MD1CLV
Re: Duff's Device
« Reply #1 on: December 22, 2006, 09:57:48 AM »
Hi Dean,
The IRC channel is the best place to discuss in depth technical things like this.  We have optimised versions of memcpy in assembler for some targets, but this might be helpful.  I'll test it at some point.
Logged
iPod 3G
iPod 4G Mono
Sansa E250
Sansa Clip

Offline blargg

  • Member
  • *
  • Posts: 2
Re: Duff's Device
« Reply #2 on: January 22, 2007, 06:35:22 PM »
Consider the following code:

Code: [Select]
void copy( int const* restrict in, int* restrict out, int count )
{
    do
    {
        *out++ = *in++;
        *out++ = *in++;
        *out++ = *in++;
        *out++ = *in++;
    }
    while ( --count );
}

A compiler for a RISC machine could output code equivalent to this:

Code: [Select]
void copy( int const* restrict in, int* restrict out, int count )
{
    do
    {
        int t0 = in [0];
        int t1 = in [1];
        int t2 = in [2];
        int t3 = in [3];
        in += 4;
       
        out [0] = t0;
        out [1] = t1;
        out [2] = t2;
        out [3] = t3;
        out += 4;
    }
    while ( --count );
}

The compiler moves the loads together because the data is often not available until a few clocks later. The update of the pointers is also deferred above, further increasing performance on some architectures for the same reason as the moved loads. Duff's device introduces jumps to intermediate points in the loop, preventing the above instruction reordering optimizations.

The 'restrict' tells the compiler that the source and destination regions don't overlap, which is what allows it to re-order the loads and stores (some pre-ISO C compilers use __restrict or __restrict__ instead). Without this, the compiler couldn't legally change the order as subsequent loads could depend on previous stores (in the case where out = in + 1, for example).

However, all of the above is just to give insight into the issue; the only thing that really matters is how code performs in actual use, which must be determined by timing the code as a black box module.
Logged

  • Print
Pages: [1]
« previous next »
+  Rockbox Technical Forums
|-+  Rockbox Development
| |-+  Starting Development and Compiling
| | |-+  Duff's Device
 

  • SMF 2.0.19 | SMF © 2021, Simple Machines
  • Rockbox Privacy Policy
  • XHTML
  • RSS
  • WAP2

Page created in 0.053 seconds with 17 queries.