By doing shifts on DE (or BC) and accumulate on HL, you'll be faster in some cases
MUL073 for example can be implemented like this (input args may need to be modified to fit your code):
MUL073: ld h,d ld l,e sla e rl d sla e rl d add hl,de sla e rl d sla e rl d add hl,de ld a,h ret
Saves 26 cycles if I did the math correct.
When I do the maths i find that this is a MUL021 routine.
My suggestion:
MUL073: ld e,l ld d,h add hl,hl add hl,hl add hl,hl ex de,hl add hl,de ex de,hl add hl,hl add hl,hl add hl,hl add hl,de ld a,h ret
The instructions
sla e
and
rl d
are relatively expensive two byte instructions.
I count 118+14 cycles for my code above. Prodatron's routine has the same cycle count.
The code of dvik costs 138+32 cycles after correction.
Using
jp (hl)
instead of
push bc ret[
is possible, but then you need to put the
ex de,hl
in each of the individual multiplication routines, making the fall-through pattern unusable, so you get severe code-bloat.
Well, this seems like a good example for a library.
Yeah the MUL73 is not correct. I think I remembered wrong. The shifts are quite nice in some cases even though this was a bad example.
The jp (hl) does indeed make the fallthroughs not work but it saves like 10-15% cpu time. With other changes I think the average execution time could be decreased by at least 20%.
The cost with one optimized for speed is of course bigger code size. This implementation is rather big as well but in apps with a lot of multiplications the extra code size may be worth it.
Hey cool, thanx a lot for this feedback! I will update the article with your optimized routines.
Regarding jp (hl) / push bc:ret - yes, it's like both of you say. So I shouldn't call the "routine" in the article "fastest possible" anymore, but there should be the existing one, which saves some memory, and the really fastest one, which has one completely own sub-routine for every value.
I'll send you the version I did earlier. Its not complete but I think I can make it complete based on your version. I started doing a similar routine but didn't finish all cases. I tested numbers 0 But if you want really fast and have a lot of memory you can do a single table lookup. I was actually considering that for the app I was writing. Would be interesting to know what multiplication routine (if any) alone coder used in his speccy port of wolfenstein.
Ok, I am looking forward to it
Regarding the lookup table: Maybe I missunderstood the way, you want to do it, but wouldn't it have a size of 256*65536 bytes?
I use only lookup-tables for the division-part: X will be converted into 1/X, so I can realize formulars like A=X*Y/Z without divisons but two multiplications.
Btw, the bit-shifting instructions may add some optimization. The problem with them is, that you will loose the possibility to have 16bit results. Ok, if you don't need it, it's better to use them... Not easy to have ONE best routine
Very interesting multiplication routine can be found also here:
http://www.mccw.hetlab.tk/92/Multiplication/en.html
how much faster the R800 does with its inbuilt MUL-commands?
MULUB takes 14 cycles
MULUW takes 36 cycles
More information: http://www.msxtop.msxall.com/Docs/Z80R800GuideBrochure.pdf
Not to mention that those add hl,hl's and such are also much faster on r800.