Autor
| Fastest possible multiplication routine?
|
Prodatron msx master Mensajes: 1113 | Publicado: Mayo 05 2007, 01:34   |
Hi,
today I stumbled over a routine I developed for my old 3D demos and which I uploaded to the CPCWiki some time ago:
http://www.cpcwiki.com/index.php/Programming:Ultrafast_Multiplication
As this is 100% Z80 and not platform related it maybe interesting for you.
AFAIK it's the fastest way of multiplication which is possible on Z80 for this precision (16bit x 8/7bit).
I wonder, if it could be even done faster, and how much faster the R800 does with its inbuilt MUL-commands?
CU,
Prodatron |
|
dvik msx master Mensajes: 1376 | Publicado: Mayo 05 2007, 02:08   |
The code is a bit hard to read. How ultra fast is it?
|
|
Prodatron msx master Mensajes: 1113 | Publicado: Mayo 05 2007, 02:18   |
Sometimes I like to use ":" and put multiple commands in one line 
The whole thing uses (more or less) ONE routine for EACH rigth-side value, that's it. The speed depends on the rigth-side value (8/7 bit) as different routines are selected then. |
|
dvik msx master Mensajes: 1376 | Publicado: Mayo 05 2007, 02:20   |
Do you have some cpu cycle numbers? worst case/best case/ average...
Actually I guess I can do that myself. Its not that hard...
|
|
dvik msx master Mensajes: 1376 | Publicado: Mayo 05 2007, 02:33   |
By doing shifts on DE (or BC) and accumulate on HL, you'll be faster in some cases
MUL073 for example can be implemented like this (input args may need to be modified to fit your code):
MUL073:
ld h,d
ld l,e
sla e
rl d
sla e
rl d
add hl,de
sla e
rl d
sla e
rl d
add hl,de
ld a,h
ret
Saves 26 cycles if I did the math correct. |
|
dvik msx master Mensajes: 1376 | Publicado: Mayo 05 2007, 02:46   |
Also, why do you use:
PUSH BC
RET
isntead of
JP (HL)
|
|
dvik msx master Mensajes: 1376 | Publicado: Mayo 05 2007, 02:52   |
I did a similar routine not too long ago. Don't have it on this computer but the I think I did the table lookup like this:
Mul:
ld h,MulTab
add hl,hl
ld b,(hl)
inc l
ld h,(hl)
ld h,b
jp (hl)
Saves another 20 cycles |
|
nikodr msx addict Mensajes: 491 | Publicado: Mayo 05 2007, 04:35   |
Anyone with a real turbo-r to tell us about the r800 mul routines?Actually i am interested to buy a turbo-r by the way.How much would a good price be? 200-250 euros?
|
|
nikodr msx addict Mensajes: 491 | Publicado: Mayo 05 2007, 04:35   |
Sorry for double post  |
|
dvik msx master Mensajes: 1376 | Publicado: Mayo 05 2007, 07:00   |
I think some of the cases can be optimized quite a bit. here are some examples.
These particular cases are maybe 10-30% faster. But I think most of the cases in your routine are close to optimal. But if say 20% of the cases can be made significantly faster, I think it will be noticeable on the overall performance.
MUL007:
ld d,h
ld e,l
add hl,hl
add hl,de
add hl,hl
add hl,de
ld a,h
ret
MUL011:
ld d,h
ld e,l
add hl,hl
add hl,hl
add hl,de
add hl,hl
add hl,de
ld a,h
ret
MUL063:
ld d,h
ld e,l
rr h
rr l
rr h
rr l
xor a
ld h,l
ld l,a
sbc hl,de
ld a,h
ret
MUL064:
rr h
rr l
rr h
ld a,l
rrca
ret
MUL065:
ld d,h
ld e,l
rr h
rr l
rr h
rr l
ld h,l
ld l,0
add hl,de
ld a,h
ret
MUL073:
ld d,h
ld e,l
sla e
rl d
sla e
rl d
add hl,de
sla e
rl d
sla e
rl d
add hl,de
ld a,h
ret
MUL127:
ld d,h
ld e,l
rr h
rr l
xor a
ld h,l
ld l,a
sbc hl,de
ld a,h
ret
|
|
dvik msx master Mensajes: 1376 | Publicado: Mayo 05 2007, 08:08   |
Some of the versions I put here may be wrong. I'll try to find the real implementation and post it here
|
|
arnold_m online msx lover Mensajes: 86 | Publicado: Mayo 05 2007, 09:13   |
Quote:
| By doing shifts on DE (or BC) and accumulate on HL, you'll be faster in some cases
MUL073 for example can be implemented like this (input args may need to be modified to fit your code):
MUL073:
ld h,d
ld l,e
sla e
rl d
sla e
rl d
add hl,de
sla e
rl d
sla e
rl d
add hl,de
ld a,h
ret
Saves 26 cycles if I did the math correct.
|
When I do the maths i find that this is a MUL021 routine.
My suggestion:
MUL073:
ld e,l
ld d,h
add hl,hl
add hl,hl
add hl,hl
ex de,hl
add hl,de
ex de,hl
add hl,hl
add hl,hl
add hl,hl
add hl,de
ld a,h
ret
The instructions sla e and rl d are relatively expensive two byte instructions.
I count 118+14 cycles for my code above. Prodatron's routine has the same cycle count.
The code of dvik costs 138+32 cycles after correction.
Using
jp (hl)
instead of
push bc
ret[
is possible, but then you need to put the
ex de,hl
in each of the individual multiplication routines, making the fall-through pattern unusable, so you get severe code-bloat. |
|
GhostwriterP msx addict Mensajes: 320 | Publicado: Mayo 05 2007, 09:14   |
Well, this seems like a good example for a library.  |
|
dvik msx master Mensajes: 1376 | Publicado: Mayo 05 2007, 09:32   |
Yeah the MUL73 is not correct. I think I remembered wrong. The shifts are quite nice in some cases even though this was a bad example.
The jp (hl) does indeed make the fallthroughs not work but it saves like 10-15% cpu time. With other changes I think the average execution time could be decreased by at least 20%.
The cost with one optimized for speed is of course bigger code size. This implementation is rather big as well but in apps with a lot of multiplications the extra code size may be worth it.
|
|
Prodatron msx master Mensajes: 1113 | Publicado: Mayo 05 2007, 11:05   |
Hey cool, thanx a lot for this feedback! I will update the article with your optimized routines.
Regarding jp (hl) / push bc:ret - yes, it's like both of you say. So I shouldn't call the "routine" in the article "fastest possible" anymore, but there should be the existing one, which saves some memory, and the really fastest one, which has one completely own sub-routine for every value.
|
|
|
|
|