Rockbox Technical Forums

Rockbox Development => Starting Development and Compiling => Topic started by: TP Diffenbach on July 03, 2006, 01:22:06 PM

Title: ARM asm memcpy, code review requested
Post by: TP Diffenbach on July 03, 2006, 01:22:06 PM
I've written a version of memcpy in ARM assembly for the ipod builds.

It's NOT based on the version in the linux ARM kernel source (but after writing it, I found it's similar to the kernel's memset, which we do use). It uses load/store multiple, but not preload or bursting or anything cool like that.

As this is my first foray into ARM assembly, I'd like some fresh eyes to give it a code review. Once it's been reviewed, cleaned up, and possibly improved, I'll release it under the GPL.

Attached is the source file and a set of timings (in MS excel format). (You'll have to remove a trailing ".txt" from both, the forum code doesn't allow attachments of "type" .S or .xls.)

In general, it runs in about one microsecond more than half the time the C memcpy takes to run, for word-aligned dst and src.

For non-word aligned dst and srcs, the C version falls back on byte-wise copying. The asm memcpy can do fast copy for unaligned dst and src, so long as dst and src both have the SAME (mis-)alignment. For these cases, the asm memcpy takes about a tenth of the time as the C memcpy, with the ratio improving as more bytes are copied. For differently aligned dst and src, the asm also falls back on byte-wise copying.

(Of course, callers really shouldn't be doing big non-word aligned copies.)

For certain cases, the asm memcpy takes ~ one microsecond longer than the C version, in particular for word-aligned copies of lengths 1, 2, 5, 16, 17, 20, 26, 30, and 40 bytes.


[attachment deleted by admin, too old]
Title: Re: ARM asm memcpy, code review requested
Post by: ryran on July 03, 2006, 02:13:20 PM
Whoa. I wish I had taken up coding when I was younger...
Title: Re: ARM asm memcpy, code review requested
Post by: TP Diffenbach on July 03, 2006, 03:58:20 PM
Ok, I've improved it a bit.

Now the C version only does better for word-aligned on 16, 20, or 22 bytes, and on 1 byte memcpy (!) for aligned or not.

The biggest loss is still at 16 bytes word-aligned. 100 iterations of C memcopy take 197 microseconds; the asm version takes 227 microseconds for those 100 iterations.

That comes down to a difference of three-tenths of a microsecond, or 3 ten-millionths of a second per call.


Alignment   
Bytes copied   
Microseconds Elapsed   
 For 100 iterations of C  memcpy   
 For 100 iterations of Asm memcpy
0   16   197   227
0   20   220   247
0   22   287   294
0   01   110   114
1   01   110   114

At word-aligned 52 bytes, we gain a microsecond; at 64 we saved 1.5 microseconds; at 128 almost 4 (3.98) and we're taking only 0.58 the time.

At 256 bytes, we're taking 0.51 the time of the C memcpy. We'll never get better than 0.49 the time of the C version.

But that means we'll spend perhaps 17 <i>milli</i>seconds copying (most of) the screen buffer rather than 34 milliseconds.

Title: Re: ARM asm memcpy, code review requested
Post by: Llorean on July 03, 2006, 04:51:13 PM
Just as a note, I seem to recall someone saying that you *have* to be word aligned on iPods. But then, I could be misremembering.
Title: Re: ARM asm memcpy, code review requested
Post by: Bagder on July 03, 2006, 04:58:40 PM
I would be interested in knowing how your version compares to the one in the Linux kernel...
Title: Re: ARM asm memcpy, code review requested
Post by: TP Diffenbach on July 03, 2006, 05:19:44 PM
Just as a note, I seem to recall someone saying that you *have* to be word aligned on iPods. But then, I could be misremembering.

Yes, for the standard Load and Store instructions (and as a RISC architecture, all memory accesses are by loading to, and storing from, a register).

But there are separate instructions for loading/storing bytes, which don't have to be word-aligned. And an instruction for halfwords, but that may only be for the Thumb instruction set (thumb instructions only take up 16 bits per instruction, instead of 32); when I tried those, I got a "data error", so I went back to just using bytes and words.




I would be interested in knowing how your version compares to the one in the Linux kernel...

Yeah. I looked at that, and frankly, I had trouble puzzling it out. It uses a bunch of macros which make the assembly unreadable. It's possible to tell gnu-as to output the macros expanded (though it took a while to discover this), but it leaves the output full of junk. I got tired of messing with it, and I don't like to submit code I don't understand.

In part, writing my own was to gain a better understanding of ARM assembler.

I can tell you that the kernel version loads/stores 8 words at a time (mine does 4 at a time) and directly changes the program counter (which I'm too gutless to do yet).

At some point, I'll put it in to see how it does. I suspect it'll do better than mine (even without the preloading).
Title: Re: ARM asm memcpy, code review requested
Post by: TP Diffenbach on July 04, 2006, 04:30:35 AM
Some notes: my asm code tries to be "clever" in a way that the linux kernel memset does not.

In both my memcpy and the kernel's memset, bytes are processed in a loop that dispossesses of 64 bytes at a time.

After that, the count % 64 remaining bytes are processed, using masking.

The kernel memset works like this:
/*
 * No need to correct the count; we're only testing bits from now on
 */
        tst     r2, #8              @ is bit 3 (2^3 = 8 ) set?
        stmnedb r0!, {r1, r3}       @ if so, store two words
        tst     r2, #4              @ is bit 2 (2^2=4) set?
        strne   r1, [r0, #-4]!      @ if so store a word




My code tries to be more "clever":
    ands r3, r2, #12         @ test bits 3 AND 2 at the same time (8|4), save into r3
    beq extra_bytes          @ if both are unset, jump ahead
    cmp r3, #8               @ compare r3 with 8
    ldrne r5, [r1], #4       @ if r3 !=8 then it's 4 or 12, so load and store one word
    strne r5, [r0], #4
    ldmhsia r1!, {r5-r6}     @ if r3 >= 8, it's 8 or 12, so load and store two words
    stmhsia r0!, {r5-r6}   



Often in programming, trying to be excessively "clever" works worse, so I tested a revised version on my memcpy, in which I followed memset's pattern. The result? (As I hoped and expected) my version is significantly faster in most cases, and particularly for good lengths, lengths that are powers of two and more likely to be used.

Why is this? Is the kernel code less efficient that it should be?

No. The kernel code is memset, so it only has to store, not load and store. In ARM assembly, each instruction can be conditional; if the condition is not met, the instruction isn't executed and there's only a one-cycle time cost.

Branching (jumping, a "goto") costs about three cycles, so if you can do something using conditional instructions in three instructions or less, that's as or more efficient than branching. But if you'd need more than three lines of conditionals, it's quicker to branch.

Since memset doesn't need the load lines, each test and store takes two instructions, and branching would be more costly. But since memcpy requires the load as well as the store, each binary digit costs (at least) three instructions.

By preceding every two bit tests with a branch that can skip them, we can skip five instructions for a cost of four cycles ( ands + branch).

For cases where we can't skip, we're paying the same cost: one test per bit, plus whatever the copying costs.
Title: Re: ARM asm memcpy, code review requested
Post by: TP Diffenbach on July 04, 2006, 08:56:19 AM
A further improvement: by adding a couple of redundant lines and increasing the time cost for unaligned and non-word aligned copies, I've brought down the cost for aligned copies.

The aligned cost goes down by about 13 microseconds per 100 calls, while the same-align-but-not-word-aligned cost goes up by about 7, and in a few cases as much as 24 microseconds per 100 calls.

For word-aligned copies, the asm version is now slower than the C version only for copy length of 0, 1, 2, 3, 16, and 20; for 16 and 20 the difference is 16 and 14 microseconds per 100 calls. (And copy lengths of 0-3 really shouldn't be happening.)

For all aligned but not word aligned copies longer than length 3, the asm remains significantly faster than the C version, plateauing at 0.09 (about one-tenth of) the C speed.
Title: Re: ARM asm memcpy, code review requested
Post by: TP Diffenbach on July 04, 2006, 09:52:57 AM
Timings, for copying 512 bytes.

Column1: dst alignment (word boundary + n bytes)
Column2: src alignment (word boundary + n bytes)
Column3: bytes copied
Column4: microseconds for C
Column5: microseconds for asm


3   3   512   18892   1527
2   3   512   18891   15501
1   3   512   18891   15484
0   3   512   18891   15501
3   2   512   18891   15484
2   2   512   18891   1537
1   2   512   18892   15501
0   2   512   18890   15484
3   1   512   18874   15501
2   1   512   18891   15484
1   1   512   18890   1511
0   1   512   18874   15501
3   0   512   18891   15484
2   0   512   18874   15501
1   0   512   18892   15501
0   0   512   3090    1430


I've also now tested the copy , from 0 to 1299 bytes, with (our standard) memcmp, without any errors. so the asm memcpy seems to work correctly. As it goes over about 512 copy length, the best speed of the asm is about 0.45 of the best C speed.

And as you can see above, non-word aligned but same aligned copies are much faster.

We should still compare this to the kernel's memcpy, which is no doubt faster on large copy lengths. But at this point I'm inclined to release this as a patch, and deal with the kernel memcpy later.
Title: Re: ARM asm memcpy, code review requested
Post by: Bagder on July 04, 2006, 02:03:22 PM
Sounds like a fine approach to me!
Title: Re: ARM asm memcpy, code review requested
Post by: TP Diffenbach on July 04, 2006, 02:57:44 PM
I would be interested in knowing how your version compares to the one in the Linux kernel...

Well, your interest will be fulfilled. For entirely unaligned dst and src, the kernel version kicks my ass (of course, mine doesn't even attempt to do anything but byte copy). For src and dst aligned the same but not word aligned, it's a wash. For word aligned, mine does slightly better for lengths less than about 300 bytes, and the kernel does better for larger lengths.

(The kernel version is copying at most 8 words per instruction, mine at most 4; to get the additional 4 registers to copy to/from, it has to push/pop those registers to/from the stack on function entry/exit. So the kernel version's additional overhead costs it at shorter copy lengths, and pays off at longer lengths. Since I have to save my four registers to the stack anyway, I should have just paid the marginal cost.)

But my better is very very slight and only at shorter copy lengths, and the kernel's better is much better for differently aligned src and dst.

Most importantly, for 32 byte aligned copies, the kernel's speed is indistinguishable from mine. (And the ipod video's lcd width just happens to be 320 pixels.) So clearly, we should go with the kernel's version, and pick up the unaligned advantage.

Here's a picture:
(http://img166.imageshack.us/img166/7324/asmmemcpycomparison1hp.th.jpg) (http://img166.imageshack.us/my.php?image=asmmemcpycomparison1hp.jpg)

Looking closely at the graph and its rhythms/patterns, you can discern the outline of the algorithms and techniques being used.
Title: Re: ARM asm memcpy, code review requested
Post by: saratoga on July 04, 2006, 03:54:58 PM
How are you generating the timing data?  Do we have timers in Rockbox for this sort of thing or are you using an emulator?
Title: Re: ARM asm memcpy, code review requested
Post by: TP Diffenbach on July 04, 2006, 04:03:39 PM
How are you generating the timing data?  Do we have timers in Rockbox for this sort of thing or are you using an emulator?

I'm using the microsecond timer on the ipod, calling this from logfdump:
static void memcpy_metrics( void ) {
    unsigned long src[ 5000 ] ;
    unsigned long dst[ 10 ] ;

    const int jn[] = { 0, 0, 1 } ;
    const int kn[] = { 0, 1, 1 } ;

    int i = 0 ;
   
    for( ; i < 4097 ; ++i )
        src[ i ] = 0x01234567 + i ;
       
    for( i = 0 ; i < 4097 ; i+=8 ) {
        int j = 0 ;
        for( ; j < 1 ; ++j ) {
            unsigned char* sc = (unsigned char*) src ;
            sc += jn[ j ] ;
            int k = 0 ;
            for( ; k < 1 ; ++k ) {
                unsigned char* dc = (unsigned char*) ( src + 1 ) ;
                dc += kn[ j ] ;
                int m ;
                register long s = USEC_TIMER ;
                for( m = 0 ; m < 100 ; ++m )
                    memcpy( dc, sc, i ) ;
                register long e = USEC_TIMER ;
                for( m = 0 ; m < 100 ; ++m )
                    memcpy2( dc, sc, i ) ;
                register long e2 = USEC_TIMER ;
               
                logf( "m %d,%d,%d,%d,%d", kn[ j ], jn[ j ], i, e-s, e2-e ) ;
                if( 0 && memcmp( dc, sc, i ) ) {
                    logf( "ERROR in memcpy, at %d,%d,%d,%d", i, j, k ) ;
                    return ;
                }
            }
        }
    }
}


And yes, I know overlapping ranges are undefined for memcpy; when I went to 4096 words, I didn't want to strain the stack.