Rockbox Development > Starting Development and Compiling

ARM asm memcpy, code review requested

<< < (2/3) > >>

TP Diffenbach:

--- Quote from: Llorean on July 03, 2006, 04:51:13 PM ---Just as a note, I seem to recall someone saying that you *have* to be word aligned on iPods. But then, I could be misremembering.

--- End quote ---

Yes, for the standard Load and Store instructions (and as a RISC architecture, all memory accesses are by loading to, and storing from, a register).

But there are separate instructions for loading/storing bytes, which don't have to be word-aligned. And an instruction for halfwords, but that may only be for the Thumb instruction set (thumb instructions only take up 16 bits per instruction, instead of 32); when I tried those, I got a "data error", so I went back to just using bytes and words.





--- Quote from: Bagder on July 03, 2006, 04:58:40 PM ---I would be interested in knowing how your version compares to the one in the Linux kernel...
--- End quote ---

Yeah. I looked at that, and frankly, I had trouble puzzling it out. It uses a bunch of macros which make the assembly unreadable. It's possible to tell gnu-as to output the macros expanded (though it took a while to discover this), but it leaves the output full of junk. I got tired of messing with it, and I don't like to submit code I don't understand.

In part, writing my own was to gain a better understanding of ARM assembler.

I can tell you that the kernel version loads/stores 8 words at a time (mine does 4 at a time) and directly changes the program counter (which I'm too gutless to do yet).

At some point, I'll put it in to see how it does. I suspect it'll do better than mine (even without the preloading).

TP Diffenbach:
Some notes: my asm code tries to be "clever" in a way that the linux kernel memset does not.

In both my memcpy and the kernel's memset, bytes are processed in a loop that dispossesses of 64 bytes at a time.

After that, the count % 64 remaining bytes are processed, using masking.

The kernel memset works like this:
/*
 * No need to correct the count; we're only testing bits from now on
 */
        tst     r2, #8              @ is bit 3 (2^3 = 8 ) set?
        stmnedb r0!, {r1, r3}       @ if so, store two words
        tst     r2, #4              @ is bit 2 (2^2=4) set?
        strne   r1, [r0, #-4]!      @ if so store a word




My code tries to be more "clever":
    ands r3, r2, #12         @ test bits 3 AND 2 at the same time (8|4), save into r3
    beq extra_bytes          @ if both are unset, jump ahead
    cmp r3, #8               @ compare r3 with 8
    ldrne r5, [r1], #4       @ if r3 !=8 then it's 4 or 12, so load and store one word
    strne r5, [r0], #4
    ldmhsia r1!, {r5-r6}     @ if r3 >= 8, it's 8 or 12, so load and store two words
    stmhsia r0!, {r5-r6}   



Often in programming, trying to be excessively "clever" works worse, so I tested a revised version on my memcpy, in which I followed memset's pattern. The result? (As I hoped and expected) my version is significantly faster in most cases, and particularly for good lengths, lengths that are powers of two and more likely to be used.

Why is this? Is the kernel code less efficient that it should be?

No. The kernel code is memset, so it only has to store, not load and store. In ARM assembly, each instruction can be conditional; if the condition is not met, the instruction isn't executed and there's only a one-cycle time cost.

Branching (jumping, a "goto") costs about three cycles, so if you can do something using conditional instructions in three instructions or less, that's as or more efficient than branching. But if you'd need more than three lines of conditionals, it's quicker to branch.

Since memset doesn't need the load lines, each test and store takes two instructions, and branching would be more costly. But since memcpy requires the load as well as the store, each binary digit costs (at least) three instructions.

By preceding every two bit tests with a branch that can skip them, we can skip five instructions for a cost of four cycles ( ands + branch).

For cases where we can't skip, we're paying the same cost: one test per bit, plus whatever the copying costs.

TP Diffenbach:
A further improvement: by adding a couple of redundant lines and increasing the time cost for unaligned and non-word aligned copies, I've brought down the cost for aligned copies.

The aligned cost goes down by about 13 microseconds per 100 calls, while the same-align-but-not-word-aligned cost goes up by about 7, and in a few cases as much as 24 microseconds per 100 calls.

For word-aligned copies, the asm version is now slower than the C version only for copy length of 0, 1, 2, 3, 16, and 20; for 16 and 20 the difference is 16 and 14 microseconds per 100 calls. (And copy lengths of 0-3 really shouldn't be happening.)

For all aligned but not word aligned copies longer than length 3, the asm remains significantly faster than the C version, plateauing at 0.09 (about one-tenth of) the C speed.

TP Diffenbach:
Timings, for copying 512 bytes.

Column1: dst alignment (word boundary + n bytes)
Column2: src alignment (word boundary + n bytes)
Column3: bytes copied
Column4: microseconds for C
Column5: microseconds for asm


3   3   512   18892   1527
2   3   512   18891   15501
1   3   512   18891   15484
0   3   512   18891   15501
3   2   512   18891   15484
2   2   512   18891   1537
1   2   512   18892   15501
0   2   512   18890   15484
3   1   512   18874   15501
2   1   512   18891   15484
1   1   512   18890   1511
0   1   512   18874   15501
3   0   512   18891   15484
2   0   512   18874   15501
1   0   512   18892   15501
0   0   512   3090    1430


I've also now tested the copy , from 0 to 1299 bytes, with (our standard) memcmp, without any errors. so the asm memcpy seems to work correctly. As it goes over about 512 copy length, the best speed of the asm is about 0.45 of the best C speed.

And as you can see above, non-word aligned but same aligned copies are much faster.

We should still compare this to the kernel's memcpy, which is no doubt faster on large copy lengths. But at this point I'm inclined to release this as a patch, and deal with the kernel memcpy later.

Bagder:
Sounds like a fine approach to me!

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version