Hello,
Reading MAP's article about fast copies using unrolled ldi
loop and the provided example with code auto-modification, I decided to make a version that also works in a ROM for the MSXgl C library.
Here is the function:
void Mem_FastCopy(const void* src, void* dest, u16 size) { src; // HL dest; // DE size; // SP+2 __asm // Get parameters ld iy, #2 add iy, sp ld c, 0(iy) ld b, 1(iy) // Handle size that is not multiples of 16 ld a, c // 5 cc and #0x0F // 8 cc jp z, mem_fastcopy_loop // 11 cc - total 24 cc (break-even at 6 loops) neg // 10 cc add #16 // 8 cc add a // 5 cc exx // 5 cc ld b, #0 // 8 cc ld c, a // 5 cc ld hl, #mem_fastcopy_loop // 11 cc add hl, bc // 12 cc push hl // 12 cc exx // 5 cc pop iy // 16 cc jp (iy) // 10 cc - total 131 cc (break-even at 31 loops) // Fast LDIR (with 16x unrolled LDI) mem_fastcopy_loop: .rept 16 ldi // 18 cc .endm jp pe, mem_fastcopy_loop // 11 cc (0,6875 cc per ldi) __endasm; }
This function converges to a speed gain of 18,75% compare to classic ldir
loop (we gain 4.31 cc per loop).
The break-even is reached at 6 loops for multiple of 16 size or at 31 loops otherwise.
I'm not very good in assembler, so it's likely that we can do even faster (at least for the initialization part).
So I'm interested in your proposals.
Of course, one solution would be to say that this function only copies sizes that are multiples of 16, but that's not the point here (even if it's a good point ^^).
Для того, чтобы оставить комментарий, необходимо регистрация или !login