[fpc-pascal]Performance: Free Pascal vs GNU Pascal

Discussion:

Mark Emerson

2002-12-29 04:49:21 UTC

How fast is the code generated by FPC versus GPC? Have there been any
benchmarks comparing FPC with GPC....especially on Windows...in other
words how well are these compilers optimizing?

I haven't written Pascal code in years...but the Borland compilers made
code that "screamed"....on those old platforms. While I realize a
comparison might not be possible, how does FPC (and GPC) compare to the
old Borland style compilers?

Also, does anyone have a feeling of how the current Pascal optimizations
compare to the C compiler optimizations for various platforms? In other
words, is there any merit to the claim I'ver heard that the fastest
compiled code comes from C compilers?

My questions are loosely structured, and perhaps naive...but nonetheless
important as I make a fundamental decision regarding my development
environment. Any light on the subject would be appreciated.

MarkLDE

M***@Wisa.be

2002-12-29 09:44:05 UTC

Permalink

Post by Mark Emerson
How fast is the code generated by FPC versus GPC? Have there been any
benchmarks comparing FPC with GPC....especially on Windows...in other
words how well are these compilers optimizing?
I haven't written Pascal code in years...but the Borland compilers made
code that "screamed"....on those old platforms. While I realize a
comparison might not be possible, how does FPC (and GPC) compare to the
old Borland style compilers?

Look at
http://dada.perl.it/shootout/

Such statements are generally unusable and not verifiable. There exist some
benchmarks, but the quality of the algorithms is usually not very good; a C
programmer will not produce optimal pascal algorithms and vice versa.
You will see this if you look at some of the source files on the above site.

For example, the array access function test from the above site:

uses SysUtils, Classes;

var
n, i, k, last : longint;
X, Y : TList;
begin
if ParamCount = 0 then
n := 1
else
n := StrToInt(ParamStr(1));
if n < 1 then n := 1;
last := n - 1;
X := TList.Create;
X.Capacity := n;
For i := 0 To last do
X.Add( Pointer(i+1) );
Y := TList.Create;
Y.Capacity := n;
For i := 0 To last do
Y.Add( Pointer(0) );
For k := 0 To 999 do
begin
For i := last downto 0 do
begin
Y.Items[i] := Pointer(longint(Y.Items[i]) +longint(X.Items[i]));
end;
end;
Writeln (IntToStr(longint(Y.Items[0])), ' ', IntToStr(longint(Y.Items[last])));
end.

As you can see, a TList is used; which is VERY slow.

And the C version:

#include <stdio.h>
#include <stdlib.h>

int
main(int argc, char *argv[]) {
int n = ((argc == 2) ? atoi(argv[1]) : 1);
int i, k, *x, *y;

x = (int *) calloc(n, sizeof(int));
y = (int *) calloc(n, sizeof(int));

for (i = 0; i < n; i++) {
x[i] = i + 1;
}
for (k=0; k<1000; k++) {
for (i = n-1; i >= 0; i--) {
y[i] += x[i];
}
}

fprintf(stdout, "%d %d\n", y[0], y[n-1]);

free(x);
free(y);

return(0);
}

Obviously, these algorithms are not comparable at all.

A more 'correct' pascal algorithm would be

uses sysutils;

var
n, i, k, last : longint;
X, Y : PLongint;

begin
if ParamCount = 0 then
n := 1
else
n := StrToInt(ParamStr(1));
if n < 1 then n := 1;
last := n - 1;
GetMem(X,N*SizeOf(Longint));
GetMem(Y,N*SizeOf(Longint));
For i := 0 To last do
X[i]:=I+1;
For k := 0 To 999 do
For i:=last downto 0 do
Y[i]:=Y[i]+X[i];
Writeln (Y[0], ' ',Y[last]);
FreeMem(X);
FreeMem(Y);
end.

This code is about 3 times faster than the first algorithm.

If you need some really really fast code, you could even include assembler in your
code in critical parts.

Just to say that comparison is difficult.

Michael.

Mark Emerson

2002-12-29 15:39:23 UTC

Permalink

Post by M***@Wisa.be
Look at
http://dada.perl.it/shootout/

Thank you for this VALUABLE reference. Let me ask (if I may)...on
matrix multiplication, which appears to be a pretty clean, simple
algorithm, why do you suppose Microsoft C (vc, and also vc++) is more
than three times faster than FPC?

Mark Emerson

Peter Vreman

2002-12-29 16:07:33 UTC

Permalink

Post by Mark Emerson

Post by M***@Wisa.be
Look at
http://dada.perl.it/shootout/

The times 0.05 and 0.16 seconds also include startup code initialization.
You need to run the tests for at least 10 seconds to get representive
values.

I guess that the optimizer of VC uses more register variables. Also CSE
and out-of-loop moving of 'static' values are not implemented in FPC.

Florian Klaempfl

2002-12-29 16:12:46 UTC

Permalink

Post by M***@Wisa.be
Look at
http://dada.perl.it/shootout/
Thank you for this VALUABLE reference. Let me ask (if I may)...on
matrix multiplication, which appears to be a pretty clean, simple
algorithm, why do you suppose Microsoft C (vc, and also vc++) is more
than three times faster than FPC?
The times 0.05 and 0.16 seconds also include startup code initialization.
You need to run the tests for at least 10 seconds to get representive
values.
I guess that the optimizer of VC uses more register variables. Also CSE
and out-of-loop moving of 'static' values are not implemented in FPC.

No :), mainly the test is unfair :(

Marco van de Voort

2002-12-29 18:22:24 UTC

Permalink

Post by Florian Klaempfl

Post by Peter Vreman
I guess that the optimizer of VC uses more register variables. Also CSE
and out-of-loop moving of 'static' values are not implemented in FPC.

No :), mainly the test is unfair :(

The shootout is pretty bad test. (the algorithms are not the same, the
C (and other more popular languages) versions are almost all wizard
code, with compiler and architecture specific optimizations and the
Pascal versions don't even pass first grade criteria)

But I think focus on benchmarks is stupid. In practice, benchmarks
are hardly representative of real applications (see e.g. client side
Java with all its problems to scale an application, even though they
can perform quite good in simple arithmetic benchmarks)

Florian Klaempfl

2002-12-29 16:08:57 UTC

Permalink

Because it's unfair :( :

Delphi test code:

-------------------------------

program matrix;

const
SIZE = 30;

type TMatrix = array[0..SIZE-1, 0..SIZE-1] of integer;

procedure mkmatrix( rows, cols : integer; var mx : TMatrix);
var
R, C, count : integer;
begin
Dec(rows); Dec(cols);
count:=1;
for R:=0 to rows do
for C:=0 to cols do begin
mx[R, C] := count;
Inc(count);
end;
end;

procedure mmult(rows, cols: integer; const m1, m2: TMatrix; var mm: TMatrix);

[...]

var NUM, code, i: integer;
M1, M2, MM: TMatrix;
begin
NUM:=1;
if ParamCount=1 then Val(ParamStr(1),NUM,code);

mkmatrix(SIZE, SIZE, M1);
mkmatrix(SIZE, SIZE, M2);
for i:=0 to NUM do
mmult(size, size, M1, M2, MM);
WriteLn( MM[0, 0],' ',MM[2, 3],' ',MM[3, 2],' ',MM[4, 4]);
end.

-------------------------------

FPC test code:

program matrix;
uses SysUtils;

const
size = 30;

type tMatrix = array[0..size, 0..size] of longint;

procedure mkmatrix( rows, cols : integer; var mx : tMatrix);
var
R, C : integer;
count : longint;
begin
Dec(rows);
Dec(cols);
count := 1;
for R := 0 to rows do
begin
for C := 0 to cols do
begin
mx[R, C] := count;
Inc(count);
end;
end;
End;

procedure mmult(rows, cols : integer; m1, m2 : tMatrix; var mm : tMatrix );

[...]

var NUM, I : integer;
M1, M2, MM : tMatrix;

begin
if ParamCount = 0 then
NUM := 1
else
NUM := StrToInt(ParamStr(1));

if NUM < 1 then NUM := 1;

mkmatrix(size, size, M1);
mkmatrix(size, size, M2);

for I := 0 To NUM do
begin
mmult(size, size, M1, M2, MM);
end;
WriteLn( IntToStr(MM[0, 0]) + ' ' + IntToStr(MM[2, 3]) + ' ' +
IntToStr(MM[3, 2]) + ' ' + IntToStr(MM[4, 4]));
end.

-------------------------------

Note the matrix sizes and the calling conventions: passing by const is much
faster than parameter passing by value. Further, test with float, integer
matrix operations are very unlikely :)

Marco van de Voort

2002-12-29 13:31:10 UTC

Permalink

The main problem is what you want to optimize. Ordinary application
code, or tightly looped algoritms, e.g. numerical code?

GPC might be a bit better on purely numerical stuff. But this will only
appear in benchmarks, and maybe for some kinds of scientific
calculations. (iow, if you plan to do nothing than matrix operations,
and speed is already a major issue, GPC is an option)

Ordinary code performance also strongly depends on the speed of
the runtime, specially if you start using strings, exceptions etc.

Post by Mark Emerson
I haven't written Pascal code in years...but the Borland compilers made
code that "screamed"....on those old platforms. While I realize a
comparison might not be possible, how does FPC (and GPC) compare to the
old Borland style compilers?

Both are faster, since BP is not very optimizing. Keep in mind
though (specially with the go32v2 dos platform for both compiler)
that binary startup times for dos+ 32-bits extender or windows are
slower than 16-bits dos. This may be perceived as "slower" but it is
an OS problem.
(it is e.g. gone under Unix, while still all the code is generated by roughly the
same compiler)

There are two approaches to this question:

1. The compiler writers approach:

This is not true. There has been extensive discussion about this in
comp.lang.pascal.ansi-iso, with some people developping compilers
for a living (e.g. from Compaq).

Both languages are as fast when a certain minimal amount of
optimization is implemented in the compiler. Before that, C is
generally faster, simply because the average C programmer writes
more low level code.

After that, the quality is pretty much linear with the time spent on the
optimizer, regardless of being C or Pascal. (Pascal is actually
easier to develop an optimizer for, because it has less rules of
course)

2. The programmers approach.

The average Pascal code _is_ slower. But that is a problem (or
more likely, a deliberate design decision) of the writer of the
average code, not the language.

E.g. while in C people spell out e.g. string operations in terms of
pointers and chars (like you could in FPC/BP with pchar's and
strings unit), but the average programmer will use BP shortstrings or
Delphi ansistrings (FPC Delphi modes), which have a penalty of a
between 0 and 10,20 percents for some operations. (on equivalent
string code, not applicationwide). But Pascal strings work a lot
easier, and are less bug-prone.

If you want examples of this, (performance/ease of programming
tradeoffs) look e.g. in the FPC RTL. That is a large library where a
lot of work was done by good programmers, and it shows. This is
generally not harder than writing equivalent C code.

IMHO that is the strong side of the Borlandlike Pascal compilers.
One can achieve C speeds when necessary (if the compiler is good
enough), while using nicer language options for less speed
dependant parts.

Post by Mark Emerson
My questions are loosely structured, and perhaps naive...but nonetheless
important as I make a fundamental decision regarding my development
environment. Any light on the subject would be appreciated.

Free Pascal is good enough for nearly everything, performancewise.

If speed is a _real_ problem, the maximally ten percent overall speed
of recompiling your app with GPC or rewriting it in GCC, won't save
you, and you can't probably hack it with free compilers anyway, and
you'll enter the realm of specialised commercial compilers. (and for
scientific applications, that will be neither Pascal or C, but Fortran
most likely)

IOW choose the one that matches your criteria the best
(based on things like IDE, debugging, language features, source
code availability, development speed etc)

I know the ten percent speed difference sounds conservative, but,
contrary to benchmarks, it is my experience for larger applications.
(which also hold a large percentage of relatively unoptimizable
code).

--
For the GPC/FPC tradeoff, some people like GPC, some (more :-))
like FPC , so the advise is the same here as I give in newsgroups:

Play with both, and/or try to find some source code that is like the
application you want to make, and see how easy you can get it to
run with either. See if it matters performancewise (I doubt it)

For the rest: If you want to start using Delphi features in the future,
choose FPC. (and I promise, you will, they are nice :-)

Florian Klaempfl

2002-12-29 16:11:23 UTC

Permalink

You mean TP? You can't compare it because TP is 16 Bit and FPC and GPC 32 Bit
but on modern processors the code generated by FPC or GPC is lightyears ;) faster

Post by Mark Emerson
Also, does anyone have a feeling of how the current Pascal optimizations
compare to the C compiler optimizations for various platforms? In other
words, is there any merit to the claim I'ver heard that the fastest
compiled code comes from C compilers?
My questions are loosely structured, and perhaps naive...but nonetheless
important as I make a fundamental decision regarding my development
environment. Any light on the subject would be appreciated.

FPC is optimized for large programs with complex statements and heavy heap
usage like the compiler itself. OTOH, on modern processors the instruction
selection of the compiler isn't that important. Much more important is that
the programmer takes care of the hierachical memory structure and the
alignment requirements.

M***@Wisa.be

2002-12-29 16:46:36 UTC

Permalink

Post by Mark Emerson

Post by M***@Wisa.be
Look at
http://dada.perl.it/shootout/

I don't know. I do know that I've been able to speed up some of the other
algorithms by as much as 3 to 4 times; Others do not lend themselves to much
improvement. I haven't seen the C implementation, so I don't know how they
do it in C, maybe I could get a hint to make it work faster in FPC.

I think that this one should be answered by the compiler boys, as I'm not
very familiar with the way the code is generated.

Michael.

Stefan Ziegenbalg

2002-12-30 13:43:47 UTC

Permalink

Post by M***@Wisa.be
I don't know. I do know that I've been able to speed up some of the
other algorithms by as much as 3 to 4 times; Others do not lend
themselves to much improvement. I haven't seen the C implementation,
so I don't know how they do it in C, maybe I could get a hint to make
it work faster in FPC.

On relevant problems the matrices didn't fit into the cache. (If they
does, it is unsignificant whether multiplication takes 10ms or 100ms).

Let me ask how you multiply. I hope you do not perform

for i:=0 to n-1 do
for j:=0 to n-1 do
for k:=0 to n-1 do
a(i,j):=a(i,j) + b(k,j)*c(i,k)
| |
_column_ access row access

These BLAS 1 are optimal for [RC]ISC _without_ cache or if the matrices
fit into the cache.

BLAS2 routines looks like:
for i:=0 to n-1 do
for j:=0 to n-1 do
for k:=0 to n-1 do
a(i,k):=a(k,i) + b(j,k)*c(i,j)
| |
row access row access

They are optimal for vector computers.

BLAS3 routines are optimal for [RC]ISC _with_ cache. The matrix is
splitted into submatrixes, which fit into the cache. For example
[ A1 A2 ]
A = [ ] and so on
[ A3 A4 ]

A1 = B1*C1 + B2*C3
A2 = B1*C2 + B2*C4
A3 = B3*C1 + B4*C3
A4 = B3*C2 + B4*C4

The multiplikations Bi*Ci are BLAS 1 routines.

An other optimazion is unrolling. But it depends form the CPU (amount of
registers) and the compiler. Probably it doesn't help verry much with
fpc and IA32.

Here is on source for both:
Stephan Seidl, Code Crumpling: A Straight Technique to Improve Loop
Performance on Cache Based Systems, see
http://rcswww.urz.tu-dresden.de/~seidl/publications/

I have some C source code from the author, but unfortunately only on
paper. (This code is verry awful and long, because of unrolling, contact
me privately if you want an copy)

Literature about BLAS you find in the Internet (google).

A third way is performing the Strassen-Algorithm. By tricky substitution
and splitting you need less multiplications but (much) more additions.
The total computational costs are of the order O(n^(2.81))=O(n^(log_2
7)). The disadvantage is the reduced accuracy (because of the
additions). But not with exact calculation like integer matrices (as in
the example of Florian Klaempfl).

Regards Stefan

--
mailto:***@mailbox.tu-dresden.de
http://www.sziegenbalg.de
http://www.simage.org