Do you know how is the C flag after muluw_hl_bc ?
maybe I could also skip the and a 's if I were sure C us reset....
Unfortunately this is not possible. Also from the openMSX source code:
Verified on real R800: YHXN flags are unchanged SV flags are reset Z flag is set when result is zero C flag is set when result doesn't fit in 16-bit
In other words C-flag is set when register DE is non-zero.
BTW what do you think about point 2)?
Do you know how is the C flag after muluw_hl_bc ?
maybe I could also skip the and a 's if I were sure C us reset....
Unfortunately this is not possible. Also from the openMSX source code:
Verified on real R800: YHXN flags are unchanged SV flags are reset Z flag is set when result is zero C flag is set when result doesn't fit in 16-bit
In other words C-flag is set when register DE is non-zero.
good to know in any case, thanks
BTW what do you think about point 2)?
I canot figure out your proposal. How do you mean to change the two sections?
Same, about point 4), what is the alternative?
I canot figure out your proposal. How do you mean to change the two sections?
Currently for the case bc_neg_hl_pos you have
push hl muluw_hl_bc ; result in DEHL ex de,hl ; result in HLDE pop bc and a sbc hl,bc ; hl*bc + -1*hl<<16 ret
I think the following is a bit faster: it replaces the expensive PUSH/POP instructions with more but cheaper instructions. If I counted correctly it should be 6 cycles faster.
ld d,b ld e,c ex de,hl ld b,d ld c,e muluw_hl_bc ; result in DEHL ex de,hl ; result in HLDE and a sbc hl,bc ; hl*bc + -1*hl<<16 ret
I think that a generic multiplication routine is not the best answer in ASM.
It is far better to find the fastest routine that performs a specific multiplication (or a family of those).
It is quite rare to have the need for a generic multiplication routine in ASM, apart from real time generated graphics using trigonometrics (3D, ...).
But it is a very good mental exercise, nonetheless
Keep up the optimization
Good, in this way I can merge two branches
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ; Signed int 16 bit x 16 bit -> 32 bit multiplication for r800 ; Called with 1st arg in DE, 2nd arg in BC. Returns with ; HLDE = DE*BC global _imul _imul: ex de,hl ; HL*BC bit 7,b jr nz,bc_is_neg bit 7,h jr nz,hl_is_neg bc_hl_are_pos: ; HL>=0 and BC>=0 muluw_hl_bc ; result in DEHL ex de,hl ; result in HLDE ret bc_is_neg: bit 7,h jr nz,bc_hl_are_neg ; HL>=0 and BC<0 ex_bc_hl ; swap bc and hl hl_is_neg: ; HL<0 and BC>=0 muluw_hl_bc ; result in DEHL ex de,hl ; result in HLDE and a sbc hl,bc ; hl*bc + -1*bc<<16 ret bc_hl_are_neg: ; HL<0 and BC<0 push hl muluw_hl_bc ; result in DEHL ex de,hl ; result in HLDE and a sbc hl,bc ; hl*bc + -1*hl<<16 pop bc and a sbc hl,bc ; hl*bc + -1*hl<<16 + -1*bc<<16 ret
where:
muluw_hl_bc macro db 0xed,0xc3 endm ex_bc_hl macro ld d,b ld e,c ex de,hl ld b,d ld c,e endm
Thanks!
I think that a generic multiplication routine is not the best answer in ASM.
It is far better to find the fastest routine that performs a specific multiplication (or a family of those).
It is quite rare to have the need for a generic multiplication routine in ASM, apart from real time generated graphics using trigonometrics (3D, ...).
But it is a very good mental exercise, nonetheless
Keep up the optimization
This is the case of maze3d.The code is used in this game:
https://sites.google.com/site/testmsx/msx2-doom
The latest M3D.BIN works as SCC rom under a TR
Same, about point 4), what is the alternative?
Point 4 is very much a detail.
There are 4 possible control flows through your routine (the 2 inputs can each be either positive or negative). Each of these flows takes a certain amount of cycles to execute. The fastest path is for both inputs positive. The slowest is probably when both are negative (I didn't actually check). Point 4) is not about making the (average) execution time faster, but about making the difference between the fastest and slowest path smaller.
On R800 a conditional JR or JP instruction that does not actually jump takes less cycles to execute than one that does jump (on Z80 this is only the case for JR instructions). You can use this to arrange your code so that the slowest subroutine (both inputs negative) is reached without actually jumping, while to reach the fastest subroutine (both positive) you have to execute 2 actually taken jump instructions.
Funny how this subject resurfaces at this point in time. I was just messing with it this weekend. Yes, it is for some unknown secret project
However, I have the multiplication side covered. I only needed 8bit x 8bit = 8bit (clipped result) and 16bit x 16bit = 16bit. Which works both for signed and unsigned numbers.
However, the same cannot be said for divisions. The net is full of good examples of 8 and 16 bit unsigned divisions, but none for signed. The only thought I have had so far is to do the same as proposed earlier: xor the two high bits and and make the numbers positive, do the division, and make it negative if the xorred bit is one.
If anybody has some faster implementation for signed divisions, I'm all ears. Note that they will end up in something which will be released including source.
small update at
https://sites.google.com/site/testmsx/msx2-doom
using the code from this post
run M3D.BIN in maze3d_20110315.rar
on a openmsx TR as scc rom
Funny how this subject resurfaces at this point in time. I was just messing with it this weekend. Yes, it is for some unknown secret project
However, I have the multiplication side covered. I only needed 8bit x 8bit = 8bit (clipped result) and 16bit x 16bit = 16bit. Which works both for signed and unsigned numbers.
However, the same cannot be said for divisions. The net is full of good examples of 8 and 16 bit unsigned divisions, but none for signed. The only thought I have had so far is to do the same as proposed earlier: xor the two high bits and and make the numbers positive, do the division, and make it negative if the xorred bit is one.
If anybody has some faster implementation for signed divisions, I'm all ears. Note that they will end up in something which will be released including source.
Do you need signed division 16 bit vs 16 bit, 16 bit vs 8 bit or 8bit vs 8 bit ?
z80 or r800?