A reproducible comparison between GNU MPFR and machine double-precision

1. Reproducible Experimental Setup
2. Experimental Results From Arnaud Legrand
3. References

Several authors claim that GNU MPFR [1] is \(x\) times slower than double-precision floating-point numbers, for various values of \(x\), without any way for the reader to reproduce their claim. For example in [2], Joris van der Hoeven writes “the MPFR library for arbitrary precision and IEEE-style standardized floating-point arithmetic is typically about a factor 100 slower than double precision machine arithmetic”. Such a claim typically: (i) does not say which version of MPFR was used (and which version of GMP, since MPFR being based on GMP, its efficiency also depends on GMP); (ii) does not detail the environment used (processor, compiler, operating system); (iii) does not explain which application was used for the comparison. Therefore it cannot be reproduced by the reader, which could thus have no confidence in the claimed factor of 100. In this short note we provide reproducible figures that can be checked by the reader.

1 Reproducible Experimental Setup

We use the programs in appendix to multiply two \(1000 × 1000\) matrices. The matrix \(A\) has coefficients \(1/(i + j + 1)\) for \(0 ≤ i, j < 1000\), and matrix \(b\) has coefficients \(1/(ij + 1)\). Both programs print the time for the matrix product (not counting the time to initialize the matrix), and the sum of coefficients of the product matrix (used as a simple checksum between both programs).

We used MFPR version 3.1.5, configured with GMP 6.1.2 (both are the latest releases as of the date of this document).

We used as test processor gcc12.fsffrance.org, which is a machine from the GCC Compile Farm, a set of machines available for developers of free software. The compiler used was GCC 4.5.1, which is installed in /opt/cfarm/release/4.5.1 on this machine, with optimization level -O3. Both GMP and MPFR were also compiled with this compiler, and the GMP and MPFR libraries were linked statically with the application programs (given in appendix).

2 Experimental Results From Arnaud Legrand

2.1 Code

The program (a.c) using the C double-precision type is the following. It takes as command-line argument the matrix dimension.

#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/resource.h> 

static int cputime()
{
  struct rusage rus;
  getrusage(0, &rus);
  return rus.ru_utime.tv_sec * 1000 + rus.ru_utime.tv_usec / 1000;
}

int main(int argc, char *argv[])
{
  double **a;
  double **b;
  double **c;
  double t = 0.0;
  int i, j, k, st;
  int N = atoi(argv[1]);
  st = cputime();
  a = malloc(N * sizeof(double *));
  b = malloc(N * sizeof(double *));
  c = malloc(N * sizeof(double *));
  for (i = 0; i < N; i++) {
    a[i] = malloc(N * sizeof(double));
    b[i] = malloc(N * sizeof(double));
    c[i] = malloc(N * sizeof(double));
    for (j = 0; j < N; j++) {
      a[i][j] = 1.0 / (1.0 + i + j);
      b[i][j] = 1.0 / (1.0 + i * j);
    }
  }
  st = cputime();
  for (i = 0; i < N; i++)
    for (j = 0; j < N; j++)
      c[i][j] = 0.0;
  for (i = 0; i < N; i++)
    for (k = 0; k < N; k++)
      for (j = 0; j < N; j++)
        c[i][j] += a[i][k] * b[k][j];
  for (i = 0; i < N; i++)
    for (j = 0; j < N; j++)
      t += c[i][j];
  printf("matrix product took %dms\n", cputime() - st);
  printf("t=%f\n", t);
  for (i = 0; i < N; i++) {
    free(a[i]);
    free(b[i]);
    free(c[i]);
  }
  free(a);
  free(b);
  free(c);
  return 0;
}

The program (d.c) using GNU MPFR is the following. It takes as command-line argument the matrix dimension and the MPFR precision (in bits).

#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/resource.h> 
#include <mpfr.h>

static int cputime()
{
  struct rusage rus;
  getrusage(0, &rus);
  return rus.ru_utime.tv_sec * 1000 + rus.ru_utime.tv_usec / 1000;
}

int main(int argc, char *argv[])
{
  mpfr_t **a;
  mpfr_t **b;
  mpfr_t **c;
  mpfr_t s;
  double t = 0.0;
  int i, j, k, st;
  int N = atoi(argv[1]);
  int prec = atoi(argv[2]);
  printf("MPFR library: %-12s\nMPFR header: %s (based on %d.%d.%d)\n",
         mpfr_get_version(), MPFR_VERSION_STRING, MPFR_VERSION_MAJOR,
         MPFR_VERSION_MINOR, MPFR_VERSION_PATCHLEVEL);
  st = cputime();
  a = malloc(N * sizeof(mpfr_t *));
  b = malloc(N * sizeof(mpfr_t *));
  c = malloc(N * sizeof(mpfr_t *));
  mpfr_init2(s, prec);
  for (i = 0; i < N; i++) {
    a[i] = malloc(N * sizeof(mpfr_t));
    b[i] = malloc(N * sizeof(mpfr_t));
    c[i] = malloc(N * sizeof(mpfr_t));
    for (j = 0; j < N; j++) {
      mpfr_init2(a[i][j], prec);
      mpfr_init2(b[i][j], prec);
      mpfr_init2(c[i][j], prec);
      mpfr_set_ui(a[i][j], 1, MPFR_RNDN);
      mpfr_div_ui(a[i][j], a[i][j], i + j + 1, MPFR_RNDN);
      mpfr_set_ui(b[i][j], 1, MPFR_RNDN);
      mpfr_div_ui(b[i][j], b[i][j], i * j + 1, MPFR_RNDN);
    }
  }
  st = cputime();
  for (i = 0; i < N; i++)
    for (j = 0; j < N; j++)
      mpfr_set_ui(c[i][j], 0, MPFR_RNDN);
  for (i = 0; i < N; i++)
    for (k = 0; k < N; k++)
      for (j = 0; j < N; j++) {
        mpfr_mul(s, a[i][k], b[k][j], MPFR_RNDN);
        mpfr_add(c[i][j], c[i][j], s, MPFR_RNDN);
      }
  for (i = 0; i < N; i++)
    for (j = 0; j < N; j++)
      t += mpfr_get_d(c[i][j], MPFR_RNDN);
  printf("matrix product took %dms\n", cputime() - st);
  printf("t=%f\n", t);
  for (i = 0; i < N; i++) {
    for (j = 0; j < N; j++) {
      mpfr_clear(a[i][j]);
      mpfr_clear(b[i][j]);
      mpfr_clear(c[i][j]);
    }
    free(a[i]);
    free(b[i]);
    free(c[i]);
  }
  mpfr_clear(s);
  free(a);
  free(b);
  free(c);
  return 0;
}

2.2 Setup

Name of the machine and OS version:

Linux sama 4.2.0-1-amd64 #1 SMP Debian 4.2.6-1 (2015-11-10) x86_64 GNU/Linux

CPU/architecture information:

cat /proc/cpuinfo

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 58
model name	: Intel(R) Core(TM) i7-3687U CPU @ 2.10GHz
stepping	: 9
microcode	: 0x15
cpu MHz		: 2165.617
cache size	: 4096 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 2
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt
bugs		:
bogomips	: 5182.68
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 58
model name	: Intel(R) Core(TM) i7-3687U CPU @ 2.10GHz
stepping	: 9
microcode	: 0x15
cpu MHz		: 3140.515
cache size	: 4096 KB
physical id	: 0
siblings	: 4
core id		: 1
cpu cores	: 2
apicid		: 2
initial apicid	: 2
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt
bugs		:
bogomips	: 5182.68
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:

processor	: 2
vendor_id	: GenuineIntel
cpu family	: 6
model		: 58
model name	: Intel(R) Core(TM) i7-3687U CPU @ 2.10GHz
stepping	: 9
microcode	: 0x15
cpu MHz		: 2860.000
cache size	: 4096 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 2
apicid		: 1
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt
bugs		:
bogomips	: 5182.68
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:

processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 58
model name	: Intel(R) Core(TM) i7-3687U CPU @ 2.10GHz
stepping	: 9
microcode	: 0x15
cpu MHz		: 2813.585
cache size	: 4096 KB
physical id	: 0
siblings	: 4
core id		: 1
cpu cores	: 2
apicid		: 3
initial apicid	: 3
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt
bugs		:
bogomips	: 5182.68
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:

Compiler version

gcc --version

gcc (Debian 5.3.1-6) 5.3.1 20160114
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Libpmfr version:

apt-cache show libmpfr-dev

Package: libmpfr-dev
Source: mpfr4
Version: 3.1.5-1
Installed-Size: 1029
Maintainer: Debian GCC Maintainers <debian-gcc@lists.debian.org>
Architecture: amd64
Replaces: libgmp3-dev (<< 4.1.4-3)
Depends: libgmp-dev, libmpfr4 (= 3.1.5-1)
Suggests: libmpfr-doc
Breaks: libgmp3-dev (<< 4.1.4-3)
Description-en: multiple precision floating-point computation developers tools
 This development package provides the header files and the symbolic
 links to allow compilation and linking of programs that use the libraries
 provided in the libmpfr4 package.
 .
 MPFR provides a library for multiple-precision floating-point computation
 with correct rounding.  The computation is both efficient and has a
 well-defined semantics. It copies the good ideas from the
 ANSI/IEEE-754 standard for double-precision floating-point arithmetic
 (53-bit mantissa).
Description-md5: a2580b68a7c6f1fcadeefc6b17102b32
Multi-Arch: same
Homepage: http://www.mpfr.org/
Tag: devel::lang:c, devel::library, implemented-in::c, role::devel-lib,
 suite::gnu
Section: libdevel
Priority: optional
Filename: pool/main/m/mpfr4/libmpfr-dev_3.1.5-1_amd64.deb
Size: 207200
MD5sum: e5c7872461f263e27312c9ef4f4218b9
SHA256: 279970e210c7db4e2550f5a3b7abb2674d01e9f0afd2a4857f1589a6947e0cbd

2.3 A first measurement

cd /tmp/
gcc -O3 a.c -o a
./a 1000

matrix product took 680ms
t=9062.368470

cd /tmp/
gcc -O3 d.c -o d -lmpfr
./d 1000 53

MPFR library: 3.1.5
MPFR header: 3.1.5 (based on 3.1.5)
matrix product took 74460ms
t=9062.368470

Et donc, chez moi, le ratio est plutôt de

74460/844

[1] 88.22275

2.4 A second measurement

Ceci étant dit, si je reexécute ces deux codes:

cd /tmp/
gcc -O3 a.c -o a
./a 1000

matrix product took 676ms
t=9062.368470

cd /tmp/
gcc -O3 d.c -o d -lmpfr
./d 1000 53

MPFR library: 3.1.5
MPFR header: 3.1.5 (based on 3.1.5)
matrix product took 68732ms
t=9062.368470

J'obtiens une valeur assez différente qui me donnerait cette fois ci un ratio de

68732/676

[1] 101.6746

c'est à dire "plus proche" de ce qui est annoncé dans [2] mais c'est un coup de chance, j'aurais tout aussi bien pu obtenir 120 ! Bref, c'est pas le même setup que vous mais statistiquement parlant, il doit aussi y avoir quelque chose à faire là, non ?

3 References

[1] Fousse, L., Hanrot, G., Lefèvre, V., Pélissier, P., and Zimmermann, P. MPFR: A multiple-precision binary floating- point library with correct rounding. ACM Trans. Math. Softw. 33, 2 (2007), article 13.

[2] van der Hoeven, J. Multiple precision floating-point arithmetic on SIMD processors. In Proceedings of Arith’24 (2017), IEEE, pp. 2–9.

Entered on [2017-09-01 Fri 17:12]