Fast copy using unrolled ldi loop

Page 2/2
1 |

By bore

Master (142)

bore's picture

06-07-2022, 14:20

If you don't want to use __z88dk_callee it could be good to keep in mind that

				; cycles	size
        ld      iy, #2		; 16		4
        add     iy, sp		; 17		2
        ld      c, 0(iy)	; 21		3
        ld      b, 1(iy)	; 21 => 75	3 => 12

can be replaced with

				; cycles	size
	push	hl		; 12		1
	ld	hl, #4		; 11		3
	add	hl, sp		; 12		1
	ld	c, (hl)		; 8		1
	inc	hl		; 7		1
	ld	b, (hl)		; 8		1
	pop	hl		; 11 => 69	1 => 9

Or you can use Bengalacks alternative without __z88dk_callee as long as you push back bc

				; cycles	size
	pop	iy		; 16		2
	pop	bc		; 11		1
	push	bc		; 12 => 39	1 => 4

You will still need to exit with jp (iy) instead of ret here unless you push back iy too.

By Bengalack

Hero (594)

Bengalack's picture

06-07-2022, 20:24

Grauw wrote:

Surely the ret to jump indirectly will still work? You push the jump address right before the ret.

But of course Smile It depends on where this ends up in the end, and only aoineko will know. I was only thinking that if you already have return address in iy, it makes sense to use jp (iy). It is "only " 10 cycles. Doing push + ret instead is 17+11 = 28.

By aoineko

Champion (489)

aoineko's picture

06-07-2022, 22:36

Here is the "final" version for the record:

void Mem_FastCopy(const void* src, void* dest, u16 size) __naked
{
    src;    // HL
    dest;   // DE
    size;   // SP+2
    __asm
        // Get parameters
        pop     iy                          // 16 cc (return address)
        pop     bc                          // 11 cc (retreive size)
    mem_fastcopy_setup:
        // Setup fast LDIR loop
        xor     a                           //  5 cc
        sub     c                           //  5 cc
        and     #15                         //  8 cc
        jp      z, mem_fastcopy_loop        // 11 cc - total 29 cc (break-even at 16 loops)
        add     a                           //  5 cc
        exx                                 //  5 cc
        add     a, #mem_fastcopy_loop       //  8 cc
        ld      l, a                        //  5 cc
        ld      a, #0                       //  8 cc
        adc     a, #mem_fastcopy_loop >> 8  //  8 cc
        ld      h, a                        //  5 cc
        push    hl                          // 12 cc
        exx                                 //  5 cc
        ret                                 // 11 cc - total 101 cc (break-even at 25 loops)
    mem_fastcopy_loop:
        // Fast LDIR (with 16x unrolled LDI)
        .rept 16
        ldi                                 // 18 cc
        .endm
        jp      pe, mem_fastcopy_loop       // 11 cc (0,6875 cc per ldi)
    mem_fastcopy_end:
        jp      (iy)                        // 10 cc
    __endasm;
}

And here is some speed comparaison (after HL, DE and BC registers setup to count only the pure assembler part):

loop count => gain in %

 16 => +9.6% (break-even for multiple of 16)
 25 => +0.2% (break-even for non-multiple of 16)
 30 => +3.4%
 32 => +14.2% (multiple of 16)
 40 => +7.2%
 48 => +15.7% (multiple of 16)
 50 => +9,5%
100 => +14.2%
128 => +17.6% (multiple of 16)
500 => +17.8%
512 => +18.5% (multiple of 16)
 ∞  => +18,7%

Thanks to all!

By Bengalack

Hero (594)

Bengalack's picture

07-07-2022, 00:26

Think you need __z88dk_callee in addition to __naked. If the above works, it is because the caller code has stored original SP-value.

By gdx

Enlighted (5600)

gdx's picture

07-07-2022, 02:33

aoineko wrote:
Metalion wrote:

In your calculation of the break-even, you forgot the //Get Parameters part.
It adds 75 t-states to the total.

This is also use for my "normal" ldir version so I only counted the extra code between the 2 versions.

Yes but it's better to include it because parameter entry is longer for the fast version.

By aoineko

Champion (489)

aoineko's picture

07-07-2022, 09:00

Bengalack wrote:

Think you need __z88dk_callee in addition to __naked. If the above works, it is because the caller code has stored original SP-value.

It's a little bit out of the topic, but __sdcccall(1) (the new default function signature) act already like __z88dk_callee. The stack adjustment is done by the function, not the caller.

Documentation it's not clear on that subject, but it's what I see in all my tests:
« If __z88dk_callee is not used, after the call, the stack parameters are cleaned up by the caller, with the following exceptions: functions that do not have variable arguments and return void or a type of at most 16 bits, or have both a first parameter of type float and a return value of type float. »

I added __z88dk_callee for a peace of mind. Smile

By Bengalack

Hero (594)

Bengalack's picture

07-07-2022, 09:24

Great - that was news to me and very good to know. Thanks! If utilised, this can speed up A LOT! I've been replacing the old "retrieve-from-the-stack-dance" (ld iy,#2 add iy,sp, etc, etc) with sets of pops several places. So much faster.

By Bengalack

Hero (594)

Bengalack's picture

07-07-2022, 09:32

Bengalack wrote:

By doing this, you need to keep the jp (iy), and not replace by ret as Grauw

Grauw is right. I didn't look carefully enough -it is perfect to use ret in this case.

By Prodatron

Paragon (1812)

Prodatron's picture

07-07-2022, 17:31

Replace
jp z, mem_fastcopy_loop
with
jr z, mem_fastcopy_loop
and you will gain anouther 5 cc in most cases Tongue

By aoineko

Champion (489)

aoineko's picture

07-07-2022, 20:34

But I lost 2 cc for multiple of 16 values, isn't it?
I like to have this "multiple of 16" optimization.

Page 2/2
1 |