c compiler comparison (Development Foros MSX)MSX Resource Center            
                       
English Nederlands Espa�ol Portugu�s Russian                  
 Noticias
   Página principal
  Almacén de noticias
  Temas de noticias

 Recursos
   Foros MSX
  Artículos
  Analisis
  Informe de ferias/RUs
  Álbum de fotos
  Ferias y encuentros
  Encuestas
  Enlaces
  Buscar

 Software
   Descargas
  Tienda Online

 MRC
   Quiénes somos
  Únete a nuestro equipo
  Donar
  Políticas
  Contacta con nosotros
  Enlázanos
  Estadísticas

 Buscar
 
  

  

 Login
 

Login

Contraseña




¿Aún no tienes una cuenta? ¡Conviértete en miembro del MSX Resource Center! ¡Únete a nosotros!.


 Estadísticas
 

Hay 47 invitados y 5 miembros en línea

Eres un usuario anónimo.
 

Foros MSX


Foros MSX

Development - c compiler comparison

Ir a la página ( Página anterior 1 | 2 | 3 | 4 )
Autor

c compiler comparison

Alcoholics_Anonymous
msx friend
Mensajes: 10
Publicado: Abril 03 2007, 23:34   
Quote:

Really curious, I was expecting better scores from z88dk...after all seems a compiler specifically oriented to Zilog cpus.



I come at it from a different angle. z88dk is built on small-C which does not attempt to be a fully modern optimizing compiler like, eg, sdcc. That's not the same as saying optimization is not attempted -- it is -- but the small C architecture means fewer opportunities to do it. However I will claim that it makes for smaller code on medium to large projects and this is because of its small-C nature, which calls hand-coded asm subroutines to do everything whereas sdcc and others will inline a lot of code. At first glance one might think, yeah and that makes things faster, but it is my view that all z80 code generated by these compilers is poor to average and pales in comparison with hand-coded asm by any experienced z80 programmer. For this reason I prefer z88dk's approach of providing libraries of hand-coded z80 and using the compiler to generate code for C that is only acting as glue, with the compiled code small rather than fast. The speed gains are in the hand-coded libraries where most execution time is spent.

There is also a large difference between the level of library support in the compilers. z88dk has by far the most library support and most of that is written in z80 and within the next year all of it will be in hand-coded z80. For comparison, sdcc has maybe a handful of hand-coded z80 asm subroutines mainly for the basic multiply, divide, etc operations used by the compiler whereas z88dk has handcoded asm for those things but also the strings, stdlib, malloc, floating point lib, etc. This is the difference between a compiler specifically for the z80 as opposed to a general purpose one :-)

Compare some of the library source code:

sdcc:
http://sdcc.svn.sourceforge.net/viewvc/sdcc/trunk/sdcc/device/lib/

z88dk string:
http://z88dk.cvs.sourceforge.net/z88dk/z88dk/libsrc/strings/

z88dk stdlib:
http://z88dk.cvs.sourceforge.net/z88dk/z88dk/libsrc/stdlib/

z88dk malloc:
http://z88dk.cvs.sourceforge.net/z88dk/z88dk/libsrc/malloc/

Among the above you will find many lib routines not implemented in other z80 C compilers and z88dk has some unique things as well:

z88dk abstract data types:
http://z88dk.cvs.sourceforge.net/z88dk/z88dk/libsrc/adt/

z88dk IM2 mode:
http://z88dk.cvs.sourceforge.net/z88dk/z88dk/libsrc/im2/

z88dk software sprite engine SP1:
http://z88dk.cvs.sourceforge.net/z88dk/z88dk/libsrc/sprites/software/sp1/

You will also notice that z88dk places no restrictions on register usage in library routines and there is special function call linkage for library routines written in asm that is more efficient (in memory and speed) than the usual C stack frame.

I do not want to dissuade against using other compilers, I just wanted to present z88dk in a more favourable light here. I was in your shoes several years ago wondering which compiler to look at and eventually chose z88dk. I liked it so much, I joined the company so to speak :-D

However, there is no question that sdcc, hitech, IAR, etc are generating more optimized compiled C code and that z88dk can stll be improved but I maintain I want small code out of the compiler and gain the speed difference in the libraries where most time is spent :-)

PingPong
msx master
Mensajes: 1069
Publicado: Abril 04 2007, 08:16   
Quote:


For this reason I prefer z88dk's approach of providing libraries of hand-coded z80 and using the compiler to generate code for C that is only acting as glue, with the compiled code small rather than fast. The speed gains are in the hand-coded libraries where most execution time is spent.



you forget that calls are costly. inline code, no.

the inline directive in 'C' is specifically here to address this.

So, you loose many of the speed gain you've got in the hand coded z80 library of z88dk.

Plus, those routines forces the use of specifical conventions such as pushing on stack or saving data on registers before the calls to the hand code. this causes another overhead, tipically.
Alcoholics_Anonymous
msx friend
Mensajes: 10
Publicado: Abril 04 2007, 09:59   
Quote:


you forget that calls are costly. inline code, no.
the inline directive in 'C' is specifically here to address this.
So, you loose many of the speed gain you've got in the hand coded z80 library of z88dk.



No, I have not forgotten :-) Inline code for modern compilers are done for a totally different reason -- calls are expensive because they cause cache misses and flushing of the instruction pipeline. Inlining avoids this problem entirely. A much, much less important reason for inlining is to give the compiler more options to optimize the code by exposing the innards of a particular function to the surrounding calling context. But the real reason is the cache miss thing as the gains from avoiding that are 1000:1 in comparison to the optimization reason.

In small machines such as the z80, with only 64k of memory space, inlining gains very little in performance increase (the opportunity to optimize does present itself but as I am about to show, the compilers generate crap code no matter how much opportunity you give them) and carries a large penalty in terms of size of program you can compile into 64k. Imagine a simple <= compare for ints. Inlining may cost you 12 bytes or so, but a call costs 3 bytes. Multiply that comparison by the 100 or so usages seen in a medium to large size program and your inlining has cost you 9*100 = 900 bytes of memory. And that's just one item!!

To see better the quality of code generated by the C compilers, I hand-coded a brief segment of the original C program in this thread:

    for (i = 0; i <= SIZE; i++)
    {
      if (flags[i])     /* found a prime */
      {
        prime = i + i + 3;  /* twice index + 3 */
        for (k = i + prime; k <= SIZE; k += prime)
          flags[k] = FALSE; /* kill all multiples */
        count++;        /* primes found */
      }
    }


After five minutes of effort, I came up with this hand-coded version:

   ld bc,0                     ; bc = i

.foriloop

   ld a,SIZE/256               ; while bc = i <= SIZE
   cp b
   jr c, endfori
   jp nz, contfori
   
   ld a,SIZE%256
   cp c
   jr c, endfori

.contfori  

   ld hl,flags
   add hl,bc
   ld a,(hl)
   or a
   jr z, loopfori

   ld hl,3                     ; de = prime = 2*i + 3
   add hl,bc
   add hl,bc
   ex de,hl
   
   ld l,c                      ; hl = k = i + prime
   ld h,b
   add hl,de
   
   push bc
   ld bc,FLAGS
   
.forkloop

   ld a,SIZE/256               ; while hl = k <= SIZE
   cp h
   jr c, endfork
   jp nz, contfork
   
   ld a,SIZE%256
   cp l
   jr c, endfork

.contfork

   push hl
   add hl,bc
   ld (hl),0                   ; flags[k] = FALSE
   pop hl
   
   add hl,de                   ; hl = k += prime
   jp forkloop

.endfork

   inc ix                      ; count++
   pop bc                      ; bc = i

.loopfori
   
   inc bc
   jp foriloop

.endfori


And here is the output for the first compiler tested:

?0015:
	LD	DE,0
?0020:
	LD	HL,8190
	OR	128
	SBC	HL,DE
	JP	PO,?0032
	XOR	H
?0032:
	JP	M,?0019
?0021:
	LD	HL,flags
	ADD	HL,DE
	LD	A,(HL)
	OR	A
	JR	Z,?0024
?0023:
	LD	L,E
	LD	H,D
	INC	HL
	ADD	HL,HL
	INC	HL
	PUSH	HL
	EXX
	POP	BC
	EXX
	ADD	HL,DE
	PUSH	HL
	POP	IY
?0026:
	PUSH	IY
	POP	BC
	LD	HL,8190
	OR	128
	SBC	HL,BC
	JP	PO,?0033
	XOR	H
?0033:
	JP	M,?0025
?0027:
	LD	HL,flags
	PUSH	IY
	POP	BC
	ADD	HL,BC
	LD	(HL),0
	EXX
	PUSH	BC
	EXX
	POP	BC
	ADD	IY,BC
	JR	?0026
?0025:
	EXX
	INC	DE
	EXX
?0024:
	INC	DE
	JR	?0020
?0019:
	INC	(IX-2)
	JP	NZ,?0012
	INC	(IX-1)
	JP	?0012


I'm interested in comparing the innermost loop contents:

        for (k = i + prime; k <= SIZE; k += prime)
          flags[k] = FALSE; /* kill all multiples */


Hand coded:

.forkloop

   ld a,SIZE/256               ; while hl = k <= SIZE
   cp h
   jr c, endfork
   jp nz, contfork
   
   ld a,SIZE%256
   cp l
   jr c, endfork

.contfork

   push hl
   add hl,bc
   ld (hl),0                   ; flags[k] = FALSE
   pop hl
   
   add hl,de                   ; hl = k += prime
   jp forkloop


Compiler:

?0026:
	PUSH	IY
	POP	BC
	LD	HL,8190
	OR	128
	SBC	HL,BC
	JP	PO,?0033
	XOR	H
?0033:
	JP	M,?0025
?0027:
	LD	HL,flags
	PUSH	IY
	POP	BC
	ADD	HL,BC
	LD	(HL),0
	EXX
	PUSH	BC
	EXX
	POP	BC
	ADD	IY,BC
	JR	?0026


The part that sets the flag to false in the hand-coded version (contfork to the jump) is 63 cycles. The same part in the compiled version (?0027 to the JR) is 112 cycles. Now think of how many times this portion is executed -- 7 to 8 thousand times. The hand-coded version, just from this portion of the program, will be 350,000 to 400,000 cycles faster than the compiler version.

Now consider the same when thinking of the difference between a hand-coded set of z80 libraries versus the C compiled set of libraries. Yes, loops will be run for less than several thousand times, as in this case, but you will be saving 10s to hundreds to thousands of cycles per library call depending on what library function it is.

Quote:


Plus, those routines forces the use of specifical conventions such as pushing on stack or saving data on registers before the calls to the hand code. this causes another overhead, tipically.



*All* C compilers must collect parameters and push them on the stack before calling any function, library or not. z88dk has a special calling convention where if there is only one parameter, HL can be used for the parameter instead of the stack. I believe SDCC and hitech have similar arrangements. However z88dk also has a CALLEE linkage convention for library functions with more than one parameter, explanation follows.

A normal C function call looks like:

myfunc(int a, int b, int c);

ld hl,_a   ; collected somehow
push hl
ld hl,_b
push hl
ld hl,_c
push hl
call myfunc
pop bc    ; clean up stack
pop bc
pop bc


Some of the compilers will use IX as a frame pointer, and do something like "ld sp,ix" to cleanup the stack, but using index registers for stack frame is *usually* slower than just pushing/popping.

Inside the C function you must collect parameters and keep stack in the same state before exiting:

myfunc:

   pop af
   pop bc   ; bc = _c
   pop de  ; bc = _b
   pop hl   ; hl = _a
   push hl   ; now restore stack
   push de
   push bc
   push af

   ; do stuff

   ret


Notice all the waste: three pops to restore the stack after the call, four pushes inside the func to keep the stack in a known state. Per function call, this adds 73 cycles to execution time. Per function call this adds 3 bytes to the program size -- this adds up fast!. And this is a best case scenario since no compiled C function is going to be able to collect params off the stack once and intelligently use the registers so that they only need to be collected the one time. Some C compilers will resort to use of IX or IY -- really slow alternatives.

However, if a function can be declared CALLEE, meaning the called function is responsible for cleaning up the stack, then the caller doesn't have to. Then we have this case:

...   ; push params on stack
call myfunc
...   ; no stack cleanup

myfunc:

   pop hl
   pop bc
   pop de
   ex (sp),hl

   ; do stuff

   ret


We save 3 bytes per call and 4 bytes inside the function. Our execution time is reduced by those 73 cycles mentioned above. This is the CALLEE convention in z88dk. I know that a few other compilers have special call linkages but I don't believe they have this one.

About concern for pushing params on stack for common things like integer comparison, etc, this doesn't happen -- the compiler keeps current parameters in registers and simply inserts CALLs to perform the comparisons. The CALL costs 3 bytes, an inlined comparison costs arund 12. Do that a few hundred times and think about your program size :-) The CALL is more expensive (but I assert the hand-optimized code will be better inside the CALL though probably not enough to surpass the savings from inlining for such a simple example) but the amount of time spent in the C code should be less than the time spent in library code. And as they say -- spend your time optimizng the 10% of the code executed 90% of the time if you want to see real speed up, not the other way around.


AuroraMSX

msx master
Mensajes: 1277
Publicado: Abril 06 2007, 12:23   
Quote:

*All* C compilers must collect parameters and push them on the stack before calling any function, library or not.


And this is where e.g. TurboPascal (Yeah, I know, Pascal is not C ) has the advantage. By default, TP doesn't use the stack for transferring parameters, but loads them on a fixed address: every single parameter of every single function has its own spot in RAM. And on a single processor, non-threading environment without the need for recursion, this is a much faster solution than using the stack, be it at the cost of RAM usage.

There is a compiler directive in TP to explicitly have the compiler use the stack for parameter transfer, so that implementing recursive functions is actually possible: {R+} if memory serves me well...


[D-Tail]

msx guru
Mensajes: 3026
Publicado: Abril 06 2007, 19:05   
Using the stack to store function parameters is a general bad idea imho. It's often used, and really, it makes the life of the compiler coder much easier. But I don't accept 'ease of coding' as an excuse for a large infliction on speed... Programming in a higher language already comes with a speed penalty!
ARTRAG
msx master
Mensajes: 1802
Publicado: Abril 06 2007, 19:14   
I remember that some modern C cross-compilers for z80 (IAR, maybe, but I must check) allows the use of static RAM areas for parameters by activating an appropriate parameter.
I remember also that the compiler was able to reuse the same static RAM areas for more than one function, provided that the functions are
executed in different times, without calling directly or indirectly each other.
Naturally, activating this option you cannot use recursion, but usually this isn't a real loss.

PingPong
msx master
Mensajes: 1069
Publicado: Abril 06 2007, 20:52   
And lack the recursion facility...
Alcoholics_Anonymous
msx friend
Mensajes: 10
Publicado: Abril 10 2007, 09:57   
Yeah, I do not like the idea of using static ram for function parameters. First, as everyone notes, you lose reentrancy. So this is not up to C standard anymore and programs may no longer work. Second, the amount of RAM required gets HUGE with any reasonably sized program. And, third, if you try to do what was suggested and overlap the static areas reserved by functions that do not execute at the same time, this becomes a difficult problem when function pointers become involved.

Anyway you *can* do exactly what you want by using global variables to pass function parameters, with similar pitfalls.

The difficulty with the z80 (and other 8-bit micros) is that there are very few registers available and they are not orthogonal, so it's very difficult to pass parameters in registers. The other difficulty is that C functions must also be callable through function pointers -- how is a C compiler to know which registers should be loaded up with what contents when all it knows is the parameter list for a given function pointer call? Is it optimal for all functions with three parameters to be called with HL = 1st param, DE = 2nd param, BC = 3rd param? If all three params are the result of mathematical formulas and since HL is the only 16-bit accumulator that must be involved in all such computations, isn't there a significant overhead in juggling register contents to suit the HL,DE,BC list when compared to just pushing results on the stack as they are computed and passing the params that way into the function?

For special cases, I agree, tell the compiler what to do. These are the special call linkages that all compilers support. However, this can normally only be done with functions that are coded in assembler (ie a library routine or similar) since the C code generator requires parameters to be available on the stack in the usual way.

PingPong
msx master
Mensajes: 1069
Publicado: Abril 10 2007, 13:32   
Quote:


For special cases, I agree, tell the compiler what to do. These are the special call linkages that all compilers support. However, this can normally only be done with functions that are coded in assembler (ie a library routine or similar) since the C code generator requires parameters to be available on the stack in the usual way.



to Alcoholics_Anonymous:

Check this link, please, will be nice to know how the source in this link performs on z88dk.
The source is standard 'c', small modification needed....
http://www.msx.org/forumtopicl7228.html
 
Ir a la página ( Página anterior 1 | 2 | 3 | 4 )
 







(c) 1994 - 2009 Fundación MSX Resource Center. MSX es una marca registrada de MSX Licensing Corporation