Rockbox Development > Starting Development and Compiling

ARM asm memcpy, code review requested

(1/3) > >>

TP Diffenbach:
I've written a version of memcpy in ARM assembly for the ipod builds.

It's NOT based on the version in the linux ARM kernel source (but after writing it, I found it's similar to the kernel's memset, which we do use). It uses load/store multiple, but not preload or bursting or anything cool like that.

As this is my first foray into ARM assembly, I'd like some fresh eyes to give it a code review. Once it's been reviewed, cleaned up, and possibly improved, I'll release it under the GPL.

Attached is the source file and a set of timings (in MS excel format). (You'll have to remove a trailing ".txt" from both, the forum code doesn't allow attachments of "type" .S or .xls.)

In general, it runs in about one microsecond more than half the time the C memcpy takes to run, for word-aligned dst and src.

For non-word aligned dst and srcs, the C version falls back on byte-wise copying. The asm memcpy can do fast copy for unaligned dst and src, so long as dst and src both have the SAME (mis-)alignment. For these cases, the asm memcpy takes about a tenth of the time as the C memcpy, with the ratio improving as more bytes are copied. For differently aligned dst and src, the asm also falls back on byte-wise copying.

(Of course, callers really shouldn't be doing big non-word aligned copies.)

For certain cases, the asm memcpy takes ~ one microsecond longer than the C version, in particular for word-aligned copies of lengths 1, 2, 5, 16, 17, 20, 26, 30, and 40 bytes.

[attachment deleted by admin, too old]

Whoa. I wish I had taken up coding when I was younger...

TP Diffenbach:
Ok, I've improved it a bit.

Now the C version only does better for word-aligned on 16, 20, or 22 bytes, and on 1 byte memcpy (!) for aligned or not.

The biggest loss is still at 16 bytes word-aligned. 100 iterations of C memcopy take 197 microseconds; the asm version takes 227 microseconds for those 100 iterations.

That comes down to a difference of three-tenths of a microsecond, or 3 ten-millionths of a second per call.

Bytes copied   
Microseconds Elapsed   
 For 100 iterations of C  memcpy   
 For 100 iterations of Asm memcpy
0   16   197   227
0   20   220   247
0   22   287   294
0   01   110   114
1   01   110   114

At word-aligned 52 bytes, we gain a microsecond; at 64 we saved 1.5 microseconds; at 128 almost 4 (3.98) and we're taking only 0.58 the time.

At 256 bytes, we're taking 0.51 the time of the C memcpy. We'll never get better than 0.49 the time of the C version.

But that means we'll spend perhaps 17 <i>milli</i>seconds copying (most of) the screen buffer rather than 34 milliseconds.

Just as a note, I seem to recall someone saying that you *have* to be word aligned on iPods. But then, I could be misremembering.

I would be interested in knowing how your version compares to the one in the Linux kernel...


[0] Message Index

[#] Next page

Go to full version