I had already achieved a memcpy that is faster than the uclinux memcpy, for copy lengths > ~ 56 (my version 8). Now I have achieved a memcpy (my version 14) that's as fast or faster than uclinux's for all word-aligned copy lengths, and for all copy lengths of same-aligned dst and src except copy lengths between 4 and 8 bytes (and even then, the difference is less than a millionth of a second per call). ("As fast or faster": I'm ignoring a few copy lengths were uclinux is a few microseconds per 100 calls faster. See if you can find thme on the graphs.)
This is achieved by eschewing some of the more clever code for a brute-force approach, and by having duplicate code for copying 16 and 32 bytes, depending on whether the total copy length is less than 48 bytes or not. This allows amortized overhead for longer copy lengths, which pays off as a smaller line slope, and no overhead for shorter lengths (where the overhead would swamp the savings from it).
For mis-aligned copy lengths, my memcpy is essentially the same speed as uclinux's (faster or slower depending on copy length) for copy length < ~64, and faster for longer copy lengths.
My memcpy also bests the linux kernel's memcpy, and the C version we currently use in rockbox for ipod.
Take a look at more pretty pictures here;http://diffenbach.org/rockbox/memcpy/memcpy.comparison.ckue.0.256.html