Fastest possible multiplication routine?

Page 2/7
1 | | 3 | 4 | 5 | 6 | 7

Par arnold_m

Master (173)

Portrait de arnold_m

05-05-2007, 09:13

By doing shifts on DE (or BC) and accumulate on HL, you'll be faster in some cases

MUL073 for example can be implemented like this (input args may need to be modified to fit your code):

MUL073:
        ld      h,d
        ld      l,e
        sla     e
        rl      d
        sla     e
        rl      d
        add     hl,de
        sla     e
        rl      d
        sla     e
        rl      d
        add     hl,de 
        ld      a,h
        ret

Saves 26 cycles if I did the math correct.

When I do the maths i find that this is a MUL021 routine.
My suggestion:

MUL073:
      ld e,l
      ld d,h
      add hl,hl
      add hl,hl
      add hl,hl
      ex de,hl
      add hl,de
      ex de,hl
      add hl,hl
      add hl,hl
      add hl,hl
      add hl,de
      ld a,h
      ret

The instructions

sla e

and

rl d

are relatively expensive two byte instructions.
I count 118+14 cycles for my code above. Prodatron's routine has the same cycle count.
The code of dvik costs 138+32 cycles after correction.

Using

   jp (hl)

instead of

   push bc
   ret[

is possible, but then you need to put the

   ex de,hl

in each of the individual multiplication routines, making the fall-through pattern unusable, so you get severe code-bloat.

Par GhostwriterP

Paladin (683)

Portrait de GhostwriterP

05-05-2007, 09:14

Well, this seems like a good example for a library.Tongue

Par dvik

Prophet (2200)

Portrait de dvik

05-05-2007, 09:32

Yeah the MUL73 is not correct. I think I remembered wrong. The shifts are quite nice in some cases even though this was a bad example.

The jp (hl) does indeed make the fallthroughs not work but it saves like 10-15% cpu time. With other changes I think the average execution time could be decreased by at least 20%.

The cost with one optimized for speed is of course bigger code size. This implementation is rather big as well but in apps with a lot of multiplications the extra code size may be worth it.

Par Prodatron

Paragon (1843)

Portrait de Prodatron

05-05-2007, 11:05

Hey cool, thanx a lot for this feedback! I will update the article with your optimized routines.
Regarding jp (hl) / push bc:ret - yes, it's like both of you say. So I shouldn't call the "routine" in the article "fastest possible" anymore, but there should be the existing one, which saves some memory, and the really fastest one, which has one completely own sub-routine for every value.

Par dvik

Prophet (2200)

Portrait de dvik

05-05-2007, 11:15

I'll send you the version I did earlier. Its not complete but I think I can make it complete based on your version. I started doing a similar routine but didn't finish all cases. I tested numbers 0

But if you want really fast and have a lot of memory you can do a single table lookup. I was actually considering that for the app I was writing.

Would be interesting to know what multiplication routine (if any) alone coder used in his speccy port of wolfenstein.

Par Prodatron

Paragon (1843)

Portrait de Prodatron

05-05-2007, 11:39

Ok, I am looking forward to it Smile
Regarding the lookup table: Maybe I missunderstood the way, you want to do it, but wouldn't it have a size of 256*65536 bytes?
I use only lookup-tables for the division-part: X will be converted into 1/X, so I can realize formulars like A=X*Y/Z without divisons but two multiplications.
Btw, the bit-shifting instructions may add some optimization. The problem with them is, that you will loose the possibility to have 16bit results. Ok, if you don't need it, it's better to use them... Not easy to have ONE best routine Wink

Par AuroraMSX

Paragon (1902)

Portrait de AuroraMSX

05-05-2007, 12:23

Have a look at the MAP

Par NYYRIKKI

Enlighted (6067)

Portrait de NYYRIKKI

05-05-2007, 12:37

Very interesting multiplication routine can be found also here:
http://www.mccw.hetlab.tk/92/Multiplication/en.html

Par NYYRIKKI

Enlighted (6067)

Portrait de NYYRIKKI

05-05-2007, 12:53

how much faster the R800 does with its inbuilt MUL-commands?
MULUB takes 14 cycles
MULUW takes 36 cycles

More information: http://www.msxtop.msxall.com/Docs/Z80R800GuideBrochure.pdf

Par GhostwriterP

Paladin (683)

Portrait de GhostwriterP

05-05-2007, 12:55

Not to mention that those add hl,hl's and such are also much faster on r800.

Page 2/7
1 | | 3 | 4 | 5 | 6 | 7