An Example of Benchmark Obsolescence: 023.eqntott

One of the Reasons Why We Need to Update Our Benchmarks

by Reinhold Weicker
Siemens Nixdorf
Munich, Germany

Published December, 1995; see disclaimer.

Observers of the SPEC CPU benchmarks probably have noticed that in the transition from SPEC92 to SPEC95, some old benchmarks were carried over to the new suite whereas some other benchmarks were not. (Even those that were carried over have been given new, larger input data sets and, consequently, new identifying numbers.) What reasoning was used in this decision process? Benchmark selection within SPEC is a long process, and there is undoubtedly and unavoidably some subjectivity in the judgment about the merits of the benchmark candidates. This is particularly true of benchmark programs that have appeared in previous versions of the benchmark suite. However, there is often also agreement that a particular benchmark has become obsolete or extremely susceptible to new technologies and techniques; in this case it will not be included in the new suite.

Benchmark 023.eqntott, which was carried over from the '89 suite into the '92 suite but is no longer in SPEC95, is such a case. Its problems are related to the fact that this benchmark spends between 79 percent and 85 percent of its execution time (older measurements on Unisys and MIPS systems -- percentages may vary on other systems) in a very small subroutine, and that this subroutine has some peculiarities. Here is the code of this subroutine, function cmppt() in module pterm_ops.c:

/* from compilation unit pterm_ops.c: */
/* 01 */        #include x.h
/* 02 */        extern int ninputs, noutputs;
/* 03 */
/* 04 */        int cmppt (a, b)
/* 05 */        PTERM *a[], *b[];
/* 06 */        {
/* 07 */                register int i, aa, bb;
/* 08 */                for (i = 0; i < ninputs; i++) {
/* 09 */                        aa = a[0]-&gtptand[i];
/* 10 */                        bb = b[0]-&gtptand[i];
/* 11 */                        if (aa == 2)
/* 12 */                                aa = 0;
/* 13 */                        if (bb == 2)
/* 14 */                                bb = 0;
/* 15 */                        if (aa != bb) {
/* 16 */                                if (aa < bb) {
/* 17 */                                  return (-1);
/* 18 */                                }
/* 19 */                                else {
/* 20 */                                   return (1);
/* 21 */                                }
/* 22 */                        }
/* 23 */                }
/* 24 */                return (0);
/* 25 */        }

/* relevant parts of the header file x.h: */

typedef short BIT;
typedef struct Pterm {
  BIT *ptand;                /* AND-plane connections */
  BIT *ptor;                  /* OR-plane connections */
  struct  Pterm *next;       /* link to next product term */
  long  andhash;     /* hash of input connection values */
  short index;                /* number of 1's in ptand */
  short cv;                           /* "covered" flag */
} PTERM;

An initial analysis of this routine reveals:

The type definitions for PTERM and BIT show that the data type of the relevant array is short (16 bit) whereas most current CPU chips are optimized for the manipulation of 32-bit quantities;
Almost all of the subroutines' time is spent in a small loop (lines 09 to 22);
Within this loop, memory accesses (Load instructions) will be the dominating factors;
Apart from memory access, executions of the if statements take most of the time. Remember that in modern CPU's, conditional branches can break the pipeline and therefore tend to be expensive.

When the benchmark was adopted for the SPEC92 suite, its high code locality was known but it appeared that no optimization would be found that could trivialize it. At the time the benchmark was introduced (1989) optimizing this code seemed to be a difficult task. However, if a program becomes a benchmark, compiler authors can become very creative. The following list (not exhaustive) includes some of the optimizations that have been employed:

Several compilation systems perform loop unrolling, i.e. they duplicate the code of the loop body, generating larger basic locks that can be more easily optimized by other optimization techniques. This is a fairly common optimization that tends to benefit most (though not all, due to code size effects) programs with loops.
In some compilation systems, the conditions that are checked by the if statements are transformed to a logically equivalent form that can be compiled into more efficient code. One particular optimization that has been used is insertion of the statement
if (aa == bb) continue;
which optimizes loop termination.
In some compilation systems, the Load instructions are optimized over several iterations of the loop. Instead of loading one 16-bit short item at a time, the compiler generates Load instructions for 32-bit or (where available) 64-bit words, storing them in registers and using them in subsequent iterations through the loop. If implemented properly, this is a legitimate optimization. However, it benefits only programs with such data type properties, and it certainly benefits 023.eqntott more than most other programs. In this sense, it is similar to the case of the old '89 benchmark 030.matrix300 where the benchmark showed greater benefit from a particular optimization than most applications would.

The run rules state that "Use of software features (in preprocessors, compilers, etc.) which invoke, generate or use software designed specifically for any of the SPEC benchmark releases is not allowed." One cannot say that, by itself, some of the optimizations discussed above are benchmark-specific and some are not. Some observers might try to draw conclusions from the SPECratios of the individual benchmarks ("Benchmark 023.eqntott has, for this system, an unusually high SPECratio, compared with other benchmarks"). Or they might even look at the generated assembly code ("This code sequence looks like hand-optimized code"). While such observations can hint at problematic practices, they are not a proof that something incorrect has been done. They can still result from legal and generally useful optimizations that are encouraged by SPEC.

Rather, the following criteria can be used:

Does the optimization generate correct code not only for the benchmark but also for all other programs? Optimizations that are too much targeted towards a particular benchmark may "overshoot the mark" and apply a particular optimizing transformation even in cases where, for other programs, they lead to incorrect code. This cannot be accepted since the generation of correct code is the foremost goal of every compilation system -- performance should never be achieved at the expense of correctness.
Are other programs that are similar to the benchmark (but not identical) optimized in a similar way? While pattern matching, to some degree, is a normal ingredient of every optimizer (it recognizes a common coding idiom and substitutes more efficient operations), a pattern matching that just recognizes SPEC benchmark code and fetches optimal code from some repository is not consistent with SPEC's goal of fair benchmarking. Testing for such cases is a difficult process. It usually involves designing a test program that is "almost the benchmark" but not identical. The code compiled for this test program is then analyzed for correctness and similarity to or distance from the code compiled for the original benchmark. However, this technology is still more of an art. Recently, some such test programs for eqntott have been published by Christopher Glaeser (Nullstone Corp.), see http://www.nullstone.com/eqntott/eqntott.html.

These issues have been discussed within SPEC. There have been differences in opinion whether SPEC as a group should make a judgement on the legality of any particular transformation or optimization. There is, however, unanimous agreement that it is undesirable if benchmarks, by high code locality or by other features, create incentives for code transformations or optimizations that accelerate a benchmark much more than other programs. It is true that real-world programs do contain "hot spots", small program parts where the program spends a large part of its time. The lesson that SPEC has learned is that such programs, however useful they may be otherwise, are just not suitable as benchmark programs. The other lesson is that it makes sense to change benchmarks from time to time, thus decreasing the incentive for special-case optimizations that do not benefit other programs.

The programs of the new benchmark suite SPEC95 have been selected with this experience in mind. They are not perfect but certainly better than the old ones. In addition, the run rules have been extended to state more clearly which code transformations are allowed and which are not allowed. SPEC also has explicitly reserved the option to drop a benchmark from the suite if this benchmark has been compromised. If the new and better benchmarks of SPEC95 had not been on the horizon, SPEC might have done this for benchmark 023.eqntott. Now that the new benchmarks are available, SPEC encourages everyone to move over to the new CPU benchmarks as soon as possible.

Reinhold Weicker is the SPEC Representative for Siemens Nixdorf and the Vice Chairman of the SPEC Open Systems Steering Committee.

Standard Performance Evaluation Corporation

An Example of Benchmark Obsolescence: 023.eqntott

One of the Reasons Why We Need to Update Our Benchmarks