Rockbox Development > Starting Development and Compiling

ARM asm memcpy, code review requested

<< < (3/3)

TP Diffenbach:

--- Quote from: Bagder on July 03, 2006, 04:58:40 PM ---I would be interested in knowing how your version compares to the one in the Linux kernel...

--- End quote ---

Well, your interest will be fulfilled. For entirely unaligned dst and src, the kernel version kicks my ass (of course, mine doesn't even attempt to do anything but byte copy). For src and dst aligned the same but not word aligned, it's a wash. For word aligned, mine does slightly better for lengths less than about 300 bytes, and the kernel does better for larger lengths.

(The kernel version is copying at most 8 words per instruction, mine at most 4; to get the additional 4 registers to copy to/from, it has to push/pop those registers to/from the stack on function entry/exit. So the kernel version's additional overhead costs it at shorter copy lengths, and pays off at longer lengths. Since I have to save my four registers to the stack anyway, I should have just paid the marginal cost.)

But my better is very very slight and only at shorter copy lengths, and the kernel's better is much better for differently aligned src and dst.

Most importantly, for 32 byte aligned copies, the kernel's speed is indistinguishable from mine. (And the ipod video's lcd width just happens to be 320 pixels.) So clearly, we should go with the kernel's version, and pick up the unaligned advantage.

Here's a picture:


Looking closely at the graph and its rhythms/patterns, you can discern the outline of the algorithms and techniques being used.

saratoga:
How are you generating the timing data?  Do we have timers in Rockbox for this sort of thing or are you using an emulator?

TP Diffenbach:

--- Quote from: saratoga on July 04, 2006, 03:54:58 PM ---How are you generating the timing data?  Do we have timers in Rockbox for this sort of thing or are you using an emulator?

--- End quote ---

I'm using the microsecond timer on the ipod, calling this from logfdump:
static void memcpy_metrics( void ) {
    unsigned long src[ 5000 ] ;
    unsigned long dst[ 10 ] ;

    const int jn[] = { 0, 0, 1 } ;
    const int kn[] = { 0, 1, 1 } ;

    int i = 0 ;
   
    for( ; i < 4097 ; ++i )
        src[ i ] = 0x01234567 + i ;
       
    for( i = 0 ; i < 4097 ; i+=8 ) {
        int j = 0 ;
        for( ; j < 1 ; ++j ) {
            unsigned char* sc = (unsigned char*) src ;
            sc += jn[ j ] ;
            int k = 0 ;
            for( ; k < 1 ; ++k ) {
                unsigned char* dc = (unsigned char*) ( src + 1 ) ;
                dc += kn[ j ] ;
                int m ;
                register long s = USEC_TIMER ;
                for( m = 0 ; m < 100 ; ++m )
                    memcpy( dc, sc, i ) ;
                register long e = USEC_TIMER ;
                for( m = 0 ; m < 100 ; ++m )
                    memcpy2( dc, sc, i ) ;
                register long e2 = USEC_TIMER ;
               
                logf( "m %d,%d,%d,%d,%d", kn[ j ], jn[ j ], i, e-s, e2-e ) ;
                if( 0 && memcmp( dc, sc, i ) ) {
                    logf( "ERROR in memcpy, at %d,%d,%d,%d", i, j, k ) ;
                    return ;
                }
            }
        }
    }
}


And yes, I know overlapping ranges are undefined for memcpy; when I went to 4096 words, I didn't want to strain the stack.

Navigation

[0] Message Index

[*] Previous page

Go to full version