Compiler Speed Comparison

Confronted with the extreme build time increase with increasing gcc versions on a iMac with 800Mhz G4 (7450) I started to test the performance on build and runtime of built code. Talking about that, folks immediately asked what flags I use and I again was told about the wonderful -Os flag. Faster compile time. Smaller code. Not much slower or even faster code that -O3.

I have a hard time really believing in that - if it would be by universal rule, gcc folks would deprecate -O3 altogether. Searching in the net about this topic lead me to forum where someone remarked that a little program (a bubblesort of a not too small array) performs better with -Os. I thought that I'll put that to the test on the variety of platforms that I have access to. This will also include Compaq Ccompiler for Alpha as well as Intel's.

The beginning made my Thinkpad X31 with 1.4Ghz Pentium M. The OS is a current SourceMage GNU/Linux with main compiler being gcc-4.0.3 and gcc-3.4 additionally installed. Every system library is compiled with 4.0.3,
The compiler per default optimizes for Pentium M, so the additional flags I tested were -O/2/3 and -Os.

Bubblesort

The program really is the bare bubblesort that any CS student learns in first year after having forgotten the lesson about it at school some years before, just with the creation of the initial unsorted array with random numbers added:

#include <stdlib.h>
void bubsort(int *t, int taille){
   int i,j,ind_min;
   long tmp,min;
   for(i=0;i<taille;i++)
   {
     min=t[i]; ind_min=i;
     for(j=i;j<taille;j++)
     {
       if(t[j]<min)
       {
         min=t[j]; ind_min=j;
       }
     }
     tmp = t[ind_min]; t[ind_min]=t[i]; t[i]=tmp;
   }
}

int main()
{
   int i,j;
   int *t;
   srandom(0);
   for(j=0;j<1;j++)
   {
     t = (int*)malloc(100000*sizeof(long));
     for(i=0;i<100000;i++)
     t[i] = random();
     bubsort(t,100000);
     free(t);
   }
   return(0);
}

Measuring build time for that doesn't really make sense, so I just looked at the runtime.

compileropt flagrealusersys
gcc-4.0.3-O21.621.60.0
gcc-4.0.3-O218.018.00.0
gcc-4.0.3-O318.018.00.0
gcc-4.0.3-Os14.414.40.0
gcc-3.4.6-O318.018.00.0
gcc-3.4.6-Os14.414.40.0

That first glimpse seems to be the proof for the almighty -Os. It is way faster indeed. Also, you cannot really tell one compiler from another on that simple program.

mpg123 svn rev 523

I also tested real-world C performance using the well-known mpg123 mpeg audio decoder. I built the generic cpu variant to really measure the compiler code generation and not the good work of people hacking asm routines.

Build time

The time for the build, including the configure step:

CFLAGS=.. CC=.. ./configure --with-optimization=0 --with-cpu=generic && make

compileropt flagrealusersys
gcc-3.4.6-O313.1s10.6s1.5s
gcc-3.4.6-Os11.4s9.0s1.4s
gcc-4.0.3-O312.1s9.6s1.5s
gcc-4.0.3-Os10.5s8.1s1.4s

Observe that gcc-4.0.3 compiles the stuff faster. Also that -Os decreases compile time for both.

Runtime

   cat > script.sh
   #!/bin/bash
  
   file=$1
   out=$(basename $file)
  
   for i in mpg123*
   do
     echo $i
     ./$i -t "$file" 2>/dev/null 1>/dev/null
     time ./$i -t "$file" 2>/dev/null 1> /dev/null
   done
  
   bash script.sh /home/charly_lownoise_\&_mental_theo/1996_on_air/10-are_we_doin_it.mp3

(Well, for testing speed, one should take a fast track right? ;-)

compileropt flagrealusersys
gcc-3.4.6-O32.36s2.33s0.03s
gcc-3.4.6-Os2.62s2.61s0.02s
gcc-4.0.3-O32.34s2.31a0.03s
gcc-4.0.3-Os2.51s2.49s0.02s

It is no big difference, but gcc-4.0.3 is consistently faster, with a hint of it's -Os being improved a bit more. Actually I am gonna be careful about interpreting too much into this very short test, but what you can tell is that -Os is not faster than -O3, it is slower and that difference is more certain than two seconds difference for the -O3 of the different compilers.

pep-tracer svn rev 1016

Now to something bigger:

This is the software suite I write for my thesis in physics. It involves computation of trajectories of passive particles in a given time-dependent 3D wind field. The central program ensemble_run means mainly floating point work with transformation of coordinates involving trigonometric functions, 4 step Runge-Kutta and some memory/search work for finding the wind velocity values in the data grid for the computation.

Build time

The build is one precreated Makefile, a simple make command issuing the build of the whole software suite (not only ensemble_run). That's in all 121 files with around 17000 lines of C++ code.
So, these are the compile times for all of it together:

CXX=... CXXFLAGS=... make

compileropt flagrealusersys
gcc-3.4.6-O346.3s42.1s2.3s
gcc-3.4.6-Os40.7s38.6s2.0s
gcc-4.0.3-O335.4s33.2s2.1s
gcc-4.0.3-Os33.6s31.5s2.1s

The newer gcc takes the lead clearly here, with the Os compile of being faster for both.

Runtime

The really important part is the performance of the resulting binaries:

for i in ensemble_run.gcc*
do
./$i rk4 testdata/$i testdata/start_d150.nc 50000 0 1 0.1 100 20 1974/wind/echogisp_testwind_197401_press_after.nc
echo this was $i
done

The programs itself computes a performance index in the form of the average number of numerical integration steps per second. It's called singlesteps per second, short ss/s. Higher is better, obviously.

compileropt flagspeed in ss/s
gcc-3.4.6-O37795.71
gcc-3.4.6-Os7009.2
gcc-4.0.3-O311356.2
gcc-4.0.3-Os10640.1

That leaves gcc-3.4 lost. It gives only 69% of the performance! There must have been a lot to improve with C++ code... But still, it could also be that just the optimizations for this CPU improved that well (I suspect the situation for PPC not being that glorious).

Again, -Os didn't do it's magic on performance when compared to -O3.

Conclusion

Against the expectation, gcc-4.0.3 proved to be superior or at least not worse both in build time and generated binary runtime. For this x86 box, it seems to be clear win. I'll have to see what it is about on the iMac, though

The -Os flag did't show it's miracle of both producing smaller and faster code, not for the real-life application. The bubblesort showed a significant improvement with -Os, but that seems to be only true because this is a very small program. The size reduction in the dynamically linked binaries, after strip --strip-unneeded, together with the runtime performance change:

programcompiler-O3 size / bytes-Os size / bytessavedruntime performance
bubblesortgcc-3.4.6325230925%+29%
bubblesortgcc-4.0.3323631084%+29%
mpg123gcc-3.4.613932411382018%-10%
mpg123gcc-4.0.314041211241220%-7%
pep_tracer/ensemble_rungcc-3.4.62736722650323%-10%
pep_tracer/ensemble_rungcc-4.0.32732242650523%-6%

Well, the binaries are always smaller with -Os, but not always significantly. The synthetic bubblesort shows impressive 29% speedup, but the picture with real-world applications is totally different. Here you gain 3% up to 20% size reduction for a performance hit of 6% to 10%. You cannot really make a general decision, though.

The 20% reduced size of mpg123 may make up for the 6% performance hit for you, especially when you want to listen to the music in realtime, not decode as fast as possible.

But for the numerical itrajectory integration, I do really not care about 3% less binary size when I can get 6% to 10% more speed. More speed means more data produced with limited cpu time on a shared box / cluster.