Quote:
|
you forget that calls are costly. inline code, no.
the inline directive in 'C' is specifically here to address this.
So, you loose many of the speed gain you've got in the hand coded z80 library of z88dk.
|
No, I have not forgotten :-) Inline code for modern compilers are done for a totally different reason -- calls are expensive because they cause cache misses and flushing of the instruction pipeline. Inlining avoids this problem entirely. A much, much less important reason for inlining is to give the compiler more options to optimize the code by exposing the innards of a particular function to the surrounding calling context. But the real reason is the cache miss thing as the gains from avoiding that are 1000:1 in comparison to the optimization reason.
In small machines such as the z80, with only 64k of memory space, inlining gains very little in performance increase (the opportunity to optimize does present itself but as I am about to show, the compilers generate crap code no matter how much opportunity you give them) and carries a large penalty in terms of size of program you can compile into 64k. Imagine a simple <= compare for ints. Inlining may cost you 12 bytes or so, but a call costs 3 bytes. Multiply that comparison by the 100 or so usages seen in a medium to large size program and your inlining has cost you 9*100 = 900 bytes of memory. And that's just one item!!
To see better the quality of code generated by the C compilers, I hand-coded a brief segment of the original C program in this thread:
for (i = 0; i <= SIZE; i++)
{
if (flags[i]) /* found a prime */
{
prime = i + i + 3; /* twice index + 3 */
for (k = i + prime; k <= SIZE; k += prime)
flags[k] = FALSE; /* kill all multiples */
count++; /* primes found */
}
}
After five minutes of effort, I came up with this hand-coded version:
ld bc,0 ; bc = i
.foriloop
ld a,SIZE/256 ; while bc = i <= SIZE
cp b
jr c, endfori
jp nz, contfori
ld a,SIZE%256
cp c
jr c, endfori
.contfori
ld hl,flags
add hl,bc
ld a,(hl)
or a
jr z, loopfori
ld hl,3 ; de = prime = 2*i + 3
add hl,bc
add hl,bc
ex de,hl
ld l,c ; hl = k = i + prime
ld h,b
add hl,de
push bc
ld bc,FLAGS
.forkloop
ld a,SIZE/256 ; while hl = k <= SIZE
cp h
jr c, endfork
jp nz, contfork
ld a,SIZE%256
cp l
jr c, endfork
.contfork
push hl
add hl,bc
ld (hl),0 ; flags[k] = FALSE
pop hl
add hl,de ; hl = k += prime
jp forkloop
.endfork
inc ix ; count++
pop bc ; bc = i
.loopfori
inc bc
jp foriloop
.endfori
And here is the output for the first compiler tested:
?0015:
LD DE,0
?0020:
LD HL,8190
OR 128
SBC HL,DE
JP PO,?0032
XOR H
?0032:
JP M,?0019
?0021:
LD HL,flags
ADD HL,DE
LD A,(HL)
OR A
JR Z,?0024
?0023:
LD L,E
LD H,D
INC HL
ADD HL,HL
INC HL
PUSH HL
EXX
POP BC
EXX
ADD HL,DE
PUSH HL
POP IY
?0026:
PUSH IY
POP BC
LD HL,8190
OR 128
SBC HL,BC
JP PO,?0033
XOR H
?0033:
JP M,?0025
?0027:
LD HL,flags
PUSH IY
POP BC
ADD HL,BC
LD (HL),0
EXX
PUSH BC
EXX
POP BC
ADD IY,BC
JR ?0026
?0025:
EXX
INC DE
EXX
?0024:
INC DE
JR ?0020
?0019:
INC (IX-2)
JP NZ,?0012
INC (IX-1)
JP ?0012
I'm interested in comparing the innermost loop contents:
for (k = i + prime; k <= SIZE; k += prime)
flags[k] = FALSE; /* kill all multiples */
Hand coded:
.forkloop
ld a,SIZE/256 ; while hl = k <= SIZE
cp h
jr c, endfork
jp nz, contfork
ld a,SIZE%256
cp l
jr c, endfork
.contfork
push hl
add hl,bc
ld (hl),0 ; flags[k] = FALSE
pop hl
add hl,de ; hl = k += prime
jp forkloop
Compiler:
?0026:
PUSH IY
POP BC
LD HL,8190
OR 128
SBC HL,BC
JP PO,?0033
XOR H
?0033:
JP M,?0025
?0027:
LD HL,flags
PUSH IY
POP BC
ADD HL,BC
LD (HL),0
EXX
PUSH BC
EXX
POP BC
ADD IY,BC
JR ?0026
The part that sets the flag to false in the hand-coded version (contfork to the jump) is 63 cycles. The same part in the compiled version (?0027 to the JR) is 112 cycles. Now think of how many times this portion is executed -- 7 to 8 thousand times. The hand-coded version, just from this portion of the program, will be 350,000 to 400,000 cycles faster than the compiler version.
Now consider the same when thinking of the difference between a hand-coded set of z80 libraries versus the C compiled set of libraries. Yes, loops will be run for less than several thousand times, as in this case, but you will be saving 10s to hundreds to thousands of cycles per library call depending on what library function it is.
Quote:
|
Plus, those routines forces the use of specifical conventions such as pushing on stack or saving data on registers before the calls to the hand code. this causes another overhead, tipically.
|
*All* C compilers must collect parameters and push them on the stack before calling any function, library or not. z88dk has a special calling convention where if there is only one parameter, HL can be used for the parameter instead of the stack. I believe SDCC and hitech have similar arrangements. However z88dk also has a CALLEE linkage convention for library functions with more than one parameter, explanation follows.
A normal C function call looks like:
myfunc(int a, int b, int c);
ld hl,_a ; collected somehow
push hl
ld hl,_b
push hl
ld hl,_c
push hl
call myfunc
pop bc ; clean up stack
pop bc
pop bc
Some of the compilers will use IX as a frame pointer, and do something like "ld sp,ix" to cleanup the stack, but using index registers for stack frame is *usually* slower than just pushing/popping.
Inside the C function you must collect parameters and keep stack in the same state before exiting:
myfunc:
pop af
pop bc ; bc = _c
pop de ; bc = _b
pop hl ; hl = _a
push hl ; now restore stack
push de
push bc
push af
; do stuff
ret
Notice all the waste: three pops to restore the stack after the call, four pushes inside the func to keep the stack in a known state. Per function call, this adds 73 cycles to execution time. Per function call this adds 3 bytes to the program size -- this adds up fast!. And this is a best case scenario since no compiled C function is going to be able to collect params off the stack once and intelligently use the registers so that they only need to be collected the one time. Some C compilers will resort to use of IX or IY -- really slow alternatives.
However, if a function can be declared CALLEE, meaning the called function is responsible for cleaning up the stack, then the caller doesn't have to. Then we have this case:
... ; push params on stack
call myfunc
... ; no stack cleanup
myfunc:
pop hl
pop bc
pop de
ex (sp),hl
; do stuff
ret
We save 3 bytes per call and 4 bytes inside the function. Our execution time is reduced by those 73 cycles mentioned above. This is the CALLEE convention in z88dk. I know that a few other compilers have special call linkages but I don't believe they have this one.
About concern for pushing params on stack for common things like integer comparison, etc, this doesn't happen -- the compiler keeps current parameters in registers and simply inserts CALLs to perform the comparisons. The CALL costs 3 bytes, an inlined comparison costs arund 12. Do that a few hundred times and think about your program size :-) The CALL is more expensive (but I assert the hand-optimized code will be better inside the CALL though probably not enough to surpass the savings from inlining for such a simple example) but the amount of time spent in the C code should be less than the time spent in library code. And as they say -- spend your time optimizng the 10% of the code executed 90% of the time if you want to see real speed up, not the other way around.