Gaussian 98 Rev A11.3 Benchmark with Intel Fortran 7.1

You are the visitor since last update.

Authors:
Jen-Shiang K. Yu, Jenn-Kang Hwang, Chuan Yi Tang and Chin-Hui Yu
Bioinformatics Center, Department of Biological Science and Technology,
National Chiao Tung University, Hsinchu 300, TAIWAN.

Introduction

This benchmark update (with G98 rev.A11.3) has been performed using the following revisions of Fortran compilers as well as numerical libraries on new platforms including IA32, IA64 and AMD64 under 32-bit mode (AMD64-32): PGI Fortran compiler version 5.0 (IA32 and AMD64-32) and Intel Fortran compiler version 7.1, build 20030307Z (IA32, IA64 and AMD64-32); ATLAS 3.4.1 (IA32), ATLAS 3.5.7 (AMD64-32), Intel MKL 6.0 (IA32, IA64 and AMD64-32), Kazushige's threaded GOTO 0.6 (IA32), threaded GOTO 0.7 (IA64), and AMD Core Math Library (ACML 1.0, AMD64-32). The test results in IBM P690 system is listed as a reference. All of the benchmark utilized a similar strategy to our previous publication. In addition to single-CPU tests, we have tried to perform multiple copies of identical job concurrently (not parallel processing!) in the same machine to quantify the performance impact by multitasking. The outcome has provided information about the ability of memory bus architecture to load multitasking computation.

For further details, please refer to our publication in J. Chem. Inf. Comput. Sci., 44, 635-642 (2004).

Tables

Table 1. Hardware Specifications and the Software Configurations of the Tested Platforms in Detail.
Table X. Description of GAUSSIAN 98 Test Jobs.
Table 2. CPU Time Consumption (in Minutes) of Each Test Job by the Alpha500 Machine and Intel Xeon Systems with CL2.5.
Table 3. CPU Time Consumption (in Minutes) of Each Test Job by the Intel Xeon Systems with CL2 as well as the IA64 System.
Table 4. CPU Time Consumption (in Minutes) of Each Test Job by the AMD Opteron System and IBM P690.
Table 5. Performance Correlation between the SpecFP2000 Benchmark and GAUSSIAN 98 Results.
Table 6. The CPU Time Consumption (in Minutes) of Each Test Job Concurrently Executed in Duplicate in the E7505 System.
Table 7. The CPU Time Consumption (in Minutes) of Each Test Job Concurrently Executed in Duplicate in the zx6000 and K8-32 Systems.
Table 8. Throughput Correlation between the SpecFP2000rate Benchmark and GAUSSIAN 98 Results.

Conclusion

The revisions of both Fortran compilers (PGI 3.3 to 5.0 and Intel Fortran 6.0 to 7.1) deliver about 3% of performance advantage.
For 32-bit executables, the Intel Fortran can equally accelerate the performance of the processors with SSE2 instruction sets regardless the CPU manufacturers (IA32 or AMD64-32), and can generate better-performing binary codes.
For IA32 systems, the improvements by the optimized numerical libraries, in terms of ATLAS, GOTO, and MKL, are nearly identical, with differences less than 2% in the system with Intel E7505 chipsets. For the AMD64 architecture running 32-bit application in 64-bit Linux OS, ifc can tune binaries as if on Pentium4 clones and invariably accelerate the double-precision FP operations. Significant speed variations between the numerical libraries are observed in the AMD64 platform.
Adjusting the CAS latency to CL2 in the E7505 system can additionally accelerate the speed by 5% compared to the default setting of CL2.5.
The IA64 and AMD64 machines are more efficient to perform multiple computations concurrently than the IA32 architecture, probably due to these machines' larger memory bandwidths.

Technical Notes

The following compilation experiences may be useful to scientists who would like to tune the performance of numerical crunching codes with different C and Fortran compilers as well as numerical libraries, however, the resultant executables need to be examined by careful tests to make sure that they give correct answers. Furthermore, linking the target binaries statically is strongly recommended with ifc 7.1, as it can prevent from the executables to reference mixed version of shared libraries of different compiler revisions which may causes trouble at run-time. Although the binaries occupy larger disk-space, static-linking makes everything clearer especially when the system is installed with muiltiple revisions of Intel or PGI compilers. For Intel compiler 8.0, dynamical linking against libguide.so is the default, as in the release notes performance issues is claimed while linking libguide dynamically.

The authors are NOT responsible for any numerical errors, data loss or system damage resulted from the suggestions that follow.

For 32-bit Linux distributions that incorporate the new native POSIX threading library, such as RedHat 9.0, an undefined reference to "__ctype_b" may appear and can be solved by the "-i_dynamic" option at the linking stage using ifc 7.1 with MKL. In the case of linking against GOTO library, options of "-lpthread -lsvml" are useful to resolve other undefined references.

Generation of 32-bit binaries in the AMD64 system needs several special compiling and linking options, since it is the default to produce 64-bit executables in 64-bit Linux system. The architectural tuning option of PGI compiler should be set to "-tp k8-32", while the option of "-m32" is required to specify 32-bit compilation with GNU compilers. Furthermore, the "-melf_i386" option is necessary to link the executables as 32-bit ELF format (the native Linux binary format) at the linking stage, and to pass the above options to the linker, the options of "-Wl,-melf_i386" should be used if the linking is to be done by the compilers (gcc, g77 and ifc, etc.) rather than by ld. Note that mixed-linking among 32-bit and 64-bit object files is not allowed.

In the AMD64 system, complains of undefined references to "e_wsfe", "s_wsfe" and "do_fio" appear at the linking stage when using ifc in combination with the ACML (v1.0) gnu32 library, and the errors can be cleared up by additionally linking the object file of GOTO's xerbla.f, which is recompiled by ifc with "-c" option. On the other hand, to successfully link binaries against ACML pgi32 library, pgf90 should be used instead of pgf77 to eliminate various undefined references since the pgi32 version of ACML is built with Fortran90 rather than Fortran77.

Using ifc to generate 32-bit executables for AMD64 simply requires the "-Wl, -melf_i386" options at the linking stage. Optimization options of "-tpp7 -axW" are to activate the SSE2 support for the double-precision FP acceleration since ifc is able to treat the AMD64 hardware as a Pentium4 compatible derivative while performing the 32-bit compilation.