# Fastest possible multiplication routine?

Page 5/7
1 | 2 | 3 | 4 | | 6 | 7

I need full 8 bit or full 16 bit. No mixed versions needed. Actually, I need the remainder as well, but it doesn't have to be from the same routine as the unsigned version gives. In z80 asm. Although I'm not so sure that r800 will make any difference.

All I can think off is that old school rule devision is multiplying with the inverse... For a 16 bit value this would imply a 64kb table to quickly convert a number X to 1/X and subsequently use the the above singed multiplication routine. Possibly only for use on r800.

I did use muluw for some fixed devision (unsigned) some time ago... let me look it up...

found it!

;---------------------------------------------------
; 16 bit - div 12 ;mod 12
;
; valid value range 0 - 12288
;
; in: hl    uit: hl = div 12 , e = mod 12
;
; verandert: a,bc,de,hl

divmod12

ld bc,5461   ; 65536 / 12

muluw hl,bc

push de      ; >>> bewaar div 12
/*
ld a,21      ; afronding fout correctie grote getallen
ld h,a
*/
ld bc,12     ; rest x 12 = mod

muluw hl,bc

pop hl       ; <<<

ret

Untested, but basically it is GosthwriterP's code on 8 bits

; fastdiv
; input: E=dividend, C=divisor
; output: E=quotient, H=rest/divisor*256

fastdiv:
xor	a
ld	d,a
ld	b,a
ld	hl,HelpTable-1
ld	c,(hl)
inc	hl
ld	b,(hl)
ex	de,hl
muluw hl,bc
ret

HelpTable: 		; dw	2^16/i = 1..255
dw 65536
dw 32768
dw 21845
dw 16384
dw 13107
dw 10923
dw 9362
dw 8192
dw 7282
dw 6554
dw 5958
dw 5461
dw 5041
dw 4681
dw 4369
dw 4096
dw 3855
dw 3641
dw 3449
dw 3277
dw 3121
dw 2979
dw 2849
dw 2731
dw 2621
dw 2521
dw 2427
dw 2341
dw 2260
dw 2185
dw 2114
dw 2048
dw 1986
dw 1928
dw 1872
dw 1820
dw 1771
dw 1725
dw 1680
dw 1638
dw 1598
dw 1560
dw 1524
dw 1489
dw 1456
dw 1425
dw 1394
dw 1365
dw 1337
dw 1311
dw 1285
dw 1260
dw 1237
dw 1214
dw 1192
dw 1170
dw 1150
dw 1130
dw 1111
dw 1092
dw 1074
dw 1057
dw 1040
dw 1024
dw 1008
dw 993
dw 978
dw 964
dw 950
dw 936
dw 923
dw 910
dw 898
dw 886
dw 874
dw 862
dw 851
dw 840
dw 830
dw 819
dw 809
dw 799
dw 790
dw 780
dw 771
dw 762
dw 753
dw 745
dw 736
dw 728
dw 720
dw 712
dw 705
dw 697
dw 690
dw 683
dw 676
dw 669
dw 662
dw 655
dw 649
dw 643
dw 636
dw 630
dw 624
dw 618
dw 612
dw 607
dw 601
dw 596
dw 590
dw 585
dw 580
dw 575
dw 570
dw 565
dw 560
dw 555
dw 551
dw 546
dw 542
dw 537
dw 533
dw 529
dw 524
dw 520
dw 516
dw 512
dw 508
dw 504
dw 500
dw 496
dw 493
dw 489
dw 485
dw 482
dw 478
dw 475
dw 471
dw 468
dw 465
dw 462
dw 458
dw 455
dw 452
dw 449
dw 446
dw 443
dw 440
dw 437
dw 434
dw 431
dw 428
dw 426
dw 423
dw 420
dw 417
dw 415
dw 412
dw 410
dw 407
dw 405
dw 402
dw 400
dw 397
dw 395
dw 392
dw 390
dw 388
dw 386
dw 383
dw 381
dw 379
dw 377
dw 374
dw 372
dw 370
dw 368
dw 366
dw 364
dw 362
dw 360
dw 358
dw 356
dw 354
dw 352
dw 350
dw 349
dw 347
dw 345
dw 343
dw 341
dw 340
dw 338
dw 336
dw 334
dw 333
dw 331
dw 329
dw 328
dw 326
dw 324
dw 323
dw 321
dw 320
dw 318
dw 317
dw 315
dw 314
dw 312
dw 311
dw 309
dw 308
dw 306
dw 305
dw 303
dw 302
dw 301
dw 299
dw 298
dw 297
dw 295
dw 294
dw 293
dw 291
dw 290
dw 289
dw 287
dw 286
dw 285
dw 284
dw 282
dw 281
dw 280
dw 279
dw 278
dw 277
dw 275
dw 274
dw 273
dw 272
dw 271
dw 270
dw 269
dw 267
dw 266
dw 265
dw 264
dw 263
dw 262
dw 261
dw 260
dw 259
dw 258
dw 257

Thinking to the signed version (for z80)

Signed and unsigned 16 bit divisions (and modulus % operator)

;	16 bit divide and modulus routines

;	called with dividend in hl and divisor in de

;	returns with result in hl.

;	adiv (amod) is signed divide (modulus), ldiv (lmod) is unsigned

psect	text,class=CODE

amod:
ex	de,hl		;put modulus in hl
ret

lmod:
call	ldiv
ex	de,hl
ret

ldiv:
xor	a
push	af
ex	de,hl
jr	dv1

ld	a,h
xor	d		;set sign flag for quotient
ld	a,h		;get sign of dividend
push	af
call	negif
ex	de,hl
call	negif
dv1:	ld	b,1
ld	a,h
or	l
jr	nz,dv8
pop	af
ret

dv8:	push	hl
jr	c,dv2
ld	a,d
cp	h
jr	c,dv2
jp	nz,dv6
ld	a,e
cp	l
jr	c,dv2
dv6:	pop	af
inc	b
jp	dv8

dv2:	pop	hl
ex	de,hl
push	hl
ld	hl,0
ex	(sp),hl

dv4:	ld	a,h
cp	d
jr	c,dv3
jp	nz,dv5
ld	a,l
cp	e
jr	c,dv3

dv5:	sbc	hl,de
dv3:	ex	(sp),hl
ccf
srl	d
rr	e
ex	(sp),hl
djnz	dv4
pop	de
ex	de,hl
pop	af
call	m,negat
ex	de,hl
or	a			;test remainder sign bit
call	m,negat
ex	de,hl
ret

negif:	bit	7,h
ret	z
negat:	ld	b,h
ld	c,l
ld	hl,0
or	a
sbc	hl,bc
ret

PS
Taken from sources of the Hi-Tech crosscompiler for windows (v7.8)

Untested, but basically it is GosthwriterP's code on 8 bits

Unfortunately it doesn't work .. it gives off-by-one errors for many inputs. For example: 3 / 3 will result in 0 instead of 1 (3 * 21845 = 0x0000ffff, upper 16-bits are zero). But if you add 1 to all the entries in your lookup table the result will be correct (because you use 16-bit multiplication).

If off-by-one errors are acceptable for your application, there's a much faster routine possible, using 8-bit multiplication. I also optimized the LUT by aligning it on a 256-byte boundary (a similar optimization can be done on your original routine. Actually your routine has a bug: each entry in your table is 2 bytes long but you forgot to multiply the index by 2):

; input:    [A]=dividend, [L]=divisor
; output:   [H]=quotient
; modifies: [B][L]
; remark:   The calculated quotient can be off-by-one!!!

approx_udiv8:
ld h,LUT_recip8 / 256 ; high byte of the table
ld b,(hl)             ; B = 256 / L
mulub a,b             ; H = (A * (256 / L)) / 256 ~= A / L (with rounding errors)
ret

.align 256            ; alternative: ds (256 - (\$ and 255)) and 255
LUT_recip8:
[ table with 256 entries generated by the formula
'256 / i'   or   '(256 / i) + 1'
both formulas result in off-by-one errors ]

Untested, but basically it is GosthwriterP's code on 8 bits

Unfortunately it doesn't work .. it gives off-by-one errors for many inputs. For example: 3 / 3 will result in 0 instead of 1 (3 * 21845 = 0x0000ffff, upper 16-bits are zero). But if you add 1 to all the entries in your lookup table the result will be correct (because you use 16-bit multiplication).

If off-by-one errors are acceptable for your application, there's a much faster routine possible, using 8-bit multiplication. I also optimized the LUT by aligning it on a 256-byte boundary (a similar optimization can be done on your original routine. Actually your routine has a bug: each entry in your table is 2 bytes long but you forgot to multiply the index by 2):

; input:    [A]=dividend, [L]=divisor
; output:   [H]=quotient
; modifies: [B][L]
; remark:   The calculated quotient can be off-by-one!!!

approx_udiv8:
ld h,LUT_recip8 / 256 ; high byte of the table
ld b,(hl)             ; B = 256 / L
mulub a,b             ; H = (A * (256 / L)) / 256 ~= A / L (with rounding errors)
ret

.align 256            ; alternative: ds (256 - (\$ and 255)) and 255
LUT_recip8:
[ table with 256 entries generated by the formula
'256 / i'   or   '(256 / i) + 1'
both formulas result in off-by-one errors ]

I did not test the table agains the rounding errors. I confirm also the DW issue an extra add hl,bc is needed. Thanks! !
Anyway, even fixing both problems, your solution using 8 bit data is much more efficient and should give the very same results than mine using less space and time.

First of all sorry for the long post. I hope some of you will find it interesting nevertheless.

PS
Taken from sources of the Hi-Tech crosscompiler for windows (v7.8)

Ugh, that's very ugly code ;-) There has to be a better way...

I started from this routine I found via google (I made some very minor changes). It does 8-bit by 8-bit unsigned division.

; -- unsigned 8-bit / 8-bit = 8-bit  (8-bit remainder)
; input:  [L] = dividend, [H] = divisor
; output: [L] = quotient, [A] = remainder
;         ([C],[DE],[H] unchanged, [B] = 0)
; cycles Z80:  395-435 (If I didn't make any mistakes!!!)
;        R800:  87-95
; size: 14 bytes
; Z80 R800
udiv8:	xor  a      ;  5    1
ld   b,8    ;  8    2
loop:	sla  l      ; 10    2
rla         ;  5    1
cp   h      ;  5    1
jr   c,next ; 13/8 3/2
sub  h      ;  5    1
inc  l      ;  5    1
next:   djnz loop   ; 14/9 3/2
ret         ; 11    5

After some tweaking I got this. It has one instruction less in the inner loop. But the setup code before the loop is now more expensive, so the total speed gain isn't that much. However negating the divisor is also something that needs to happen in a signed division routine. So I hope this setup cost will partly disappear in a signed division routine. I didn't try to write such routine yet. Note that the worst and best-case execution times are now the same.

; -- unsigned 8-bit / 8-bit = 8-bit  (8-bit remainder)
; input:  [L] = dividend, [A] = divisor
; output: [L] = quotient, [A] = remainder
;         ([C],[DE] unchanged, [B] = 0)
; cycles Z80:  420 (If I didn't make any mistakes!!!)
;        R800:  92
; size: 18 bytes
; Z80 R800
udiv8:	neg          ; 10    2
ld   h,a     ;  5    1     ; h = -divisor
xor  a       ;  5    1
ld   b,8     ;  8    2
rl   l       ; 10    2
loop:	rla          ;  5    1
add  h       ;  5    1     ; a += -divisor
jr   c,next  ; 13/8 3/2
sub  h       ;  5    1     ; after this instr C-flag is always clear
next:	rl   l       ; 10    2
djnz loop    ; 14/9 3/2
ret          ; 11    5

If you really care about speed (and less about code size), you can unroll the loop. This has the extra advantage of not using the B register.

; -- 8-bit / 8-bit = 8-bit  (8-bit remainder)
; input:  [L] = dividend, [A] = divisor
; output: [L] = quotient, [A] = remainder
;         ([BC],[DE] unchanged)
; cycles Z80:  305 (If I didn't make any mistakes!!!)
;        R800:  67
; size: 63 bytes
; Z80 R800
udiv8:	neg          ; 10    2
ld   h,a     ;  5    1
xor  a       ;  5    1
rl   l       ; 10    2
rla          ;  5    1
jr   c,next1 ; 13/8 3/2
sub  h       ;  5    1
next1:	rl   l       ; 10    2
rla          ;  5    1
jr   c,next2 ; 13/8 3/2
sub  h       ;  5    1
next2:	rl   l       ; 10    2
rla          ;  5    1
jr   c,next3 ; 13/8 3/2
sub  h       ;  5    1
next3:	rl   l       ; 10    2
rla          ;  5    1
jr   c,next4 ; 13/8 3/2
sub  h       ;  5    1
next4:	rl   l       ; 10    2
rla          ;  5    1
jr   c,next5 ; 13/8 3/2
sub  h       ;  5    1
next5:	rl   l       ; 10    2
rla          ;  5    1
jr   c,next6 ; 13/8 3/2
sub  h       ;  5    1
next6:	rl   l       ; 10    2
rla          ;  5    1
jr   c,next7 ; 13/8 3/2
sub  h       ;  5    1
next7:	rl   l       ; 10    2
rla          ;  5    1
jr   c,next8 ; 13/8 3/2
sub  h       ;  5    1
next8:	rl   l       ; 10    2
ret          ; 11    5

I also wrote a 16-bit version.

; -- 16-bit / 16-bit = 16-bit  (16-bit remainder)
; input:  [DE] = dividend, [BC] = divisor
; output: [DE] = quotient, [HL] = remainder
;         ([BC] = unchanged, [A] = 0)
; cycles Z80:  1438-1470 (If I didn't make any mistakes!!!)
;        R800:  265-281  (Can be tweaked by changing jp->jr
;                         but that's worse for Z80.)
; size: 27 bytes
; Z80 R800
udiv16:	ld   hl,0     ; 11    3
ld   a,16     ;  8    2
loop:	sla  e        ; 10    2
rl   d        ; 10    2
adc  hl,hl    ; 17    2    after this instr C-flag=0
sbc  hl,bc    ; 17    2
jr   nc,next1 ; 13/8 3/2   z80 and r800: jr here is faster than jp
next0:	add  hl,bc    ; 12    1
dec  a        ;  5    1
jp   nz,loop  ; 11   4/3   z80: jp is faster  r800: jr is faster
ret           ; 11    5
next1:  inc  e        ;  5    1    before this instr e.0 was 0
dec  a        ;  5    1
jp   nz,loop  ; 11   4/3   z80: jp is faster  r800: jr is faster
ret           ; 11    5

A similar tweak as for the 8-bit version can be done here. Though in this case it not a clear win: the best-case execution time got faster, but worst-case got slower (at least on Z80, on R800 it is a clear win). The inner loop has the same worst-case speed, but the setup code is more expensive. When this code is extended to signed division I again expect that the setup-cost will be reduced compared to the version above. This version is also easier to unroll if you need extra speed (unrolled version not shown).

; -- 16-bit / 16-bit = 16-bit  (16-bit remainder)
; input:  [BC] = dividend, [DE] = divisor
; output: [BC] = quotient, [HL] = remainder
;         ([A] = 0)
; cycles Z80:  1334-1526 (If I didn't make any mistakes!!!)
;        R800:  258-274  (Can be tweaked by changing jp->jr
;                         but that's worse for Z80.)
; size: 30 bytes
; Z80 R800
udiv16:	xor  a        ;  5    1    C-flag = 0
ld   h,a      ;  5    1
ld   l,a      ;  5    1    HL = 0
sbc  hl,de    ; 17    2
ex   de,hl    ;  5    1    DE = -divisor
ld   h,a      ;  5    1
ld   l,a      ;  5    1    HL = 0
ld   a,16     ;  8    2
rl   c        ; 10    2
rl   b        ; 10    2
loop:	adc  hl,hl    ; 17    2
add  hl,de    ; 12    1    HL += -divisor
jr   c,next   ; 13/8 3/2   z80 and r800: jr here is faster than jp
sbc  hl,de    ; 17    2    after this instr C-flag is clear
next	rl   c        ; 10    2
rl   b        ; 10    2
dec  a        ;  5    1
jp   nz,loop  ; 11   4/3   z80: jp is faster  r800: jr is faster
ret           ; 11    5

BTW: when comparing the cycle count of Z80 to R800, remember that Z80 runs at 3.5MHz while R800 runs at 7MHz (so 2 R800 cycles take the same amount of time as 1 Z80 cycle).

If you don't care about code size you could even use loop up tables instead of loops !

well done, even if on r800 the 1/x table is by far faster

Page 5/7
1 | 2 | 3 | 4 | | 6 | 7