An Explanation of the SPECweb96 Benchmark

Abstract
Introduction
Workload
Internals
Limitations
Future Work
Conclusion

Abstract

SPEC has released SPECweb96, a standardized benchmark for comparing web server performance. The benchmark is designed to provide comparable measures of how well systems can handle HTTP GET requests. SPEC based the workload upon analysis of server logs from web sites ranging from a small personal server up through some of the Net's most-popular servers. Built upon the framework of the popular SPEC SFS/LADDIS benchmark, SPECweb96 can coordinate the driving of HTTP protocol requests from single- or multiple-client systems.

Introduction

SPEC's goal is to provide a common measure of basic web services. By bringing SPEC's usual rigourous standardization of the benchmark and workload, along with our usual complete disclosure rules, we expect that SPECweb96 will improve the level of usefulness for the performance claims being made about available web servers.

SPECweb96 is SPEC's first step in providing answers about web server performance. SPEC envisions future releases that will provide more comprehensive coverage of web server features such as CGI, security, and advanced HTTP protocols. At this time SPEC believes that providing an initial basic workload will help alleviate the confusion of performance claims based on an assortment of benchmarks and their uncontrolled derivations.

Workload

The workload for SPECweb96 is based upon analysis of logs from several servers. During the investigation we had access to logs from NCSA's site (made popular by Mosaic), the home pages for Hewlett-Packard and HAL Computers, and even a small site supporting several comic strip artists. We then compared the findings from these logs against summary data provided by Netscape and CommerceNet.

These logs showed remarkable similarity in the relationships between file sizes and their frequency of access. A fair number of requests were to quite small files: small graphical elements, short text files, and so on. Most of the accesses were to files several KB in size: mostly HTML files and their main graphics. Then, as file size increased, frequency trailed off; a resonable number of requests for HTML files and other documents and pictures that were between 10 and 100 KB in size. Finally there were only occasional accesses to larger documents and multimedia files larger than 100KB. In short, the common activity was often browsing home pages and indices before finally selecting only a few large files to download.

After reviewing all this data, we settled on a workload mix build out of files in four classes: files less than 1KB account for 35% of all requests, files between 1KB and 10KB account for 50% of requests, 14% between 10KB and 100KB, and finally 1% between 100KB and 1MB.

There are 9 discrete sizes within each class (e.g. 1 KB, 2 KB, on up to 9KB, then 10 KB, 20 KB, through 90KB, etc.). However, accesses within a class are not evenly distributed; they are allocated using a Poisson distribution centered around the midpoint within the class. The resulting access pattern mimics the behavior where some files (such as "index.html") are more popular than the rest, and some files (such as "mydog.gif") are rarely requested.

        TABLE 1
    Sizes (in bytes) of Files in Each Class

  Class 0   Class 1   Class 2   Class 3
  102   1024    10,240    102,400
  204   2048    20,480    204,800
  .
  .
  .
  922   9216    92,160    921,600

Finally, it was decided that the total size of the file set should scale with the expected throughput of the server. This is not to say that the size of web sites grows larger as they become more popular, but rather, that the expectations for a high-end server are much greater than for a smaller server. In particular, it should not be unreasonable to assume that two smaller systems might be replaced by one that has twice the performance rating; however, this only holds if the larger system can also handle the files from both, not just the higher request rate. Recognizing that there is likely to be overlap and that file space probably does not grow linearly, SPEC chose to have the file set size grow slowly: the file set size will double as the expected throughput quadruples.

The resulting workload can be thought of as what might be the behavior of a system supporting the "home pages" for a number of "members"; thus there are a set of directories, one per member, with 36 files each, a complete set of nine files per each of the four classes. Requests are spread evenly across all applicable directories; the number of directories is set by the above mentioned scaling function, which can be stated as: sqrt( throughput / 5 ) * 10.

      TABLE 2
  Number of Directories (And Resulting Disk Space)
      Based Upon the Target Throughput ("Ops")

  Ops:    1 Dirs:    4  Size:  22 MB
  Ops:    2 Dirs:    6  Size:  31 MB
  Ops:    5 Dirs:   10  Size:  49 MB
  Ops:   10 Dirs:   14  Size:  69 MB
  Ops:   20 Dirs:   20  Size:  98 MB
  Ops:   50 Dirs:   31  Size: 154 MB
  Ops:  100 Dirs:   44  Size: 218 MB
  Ops:  200 Dirs:   63  Size: 309 MB
  Ops:  500 Dirs:  100  Size: 488 MB
  Ops: 1000 Dirs:  141  Size: 690 MB

Internals

SPECweb96 works by having one or more client (or "driver") system(s) generate a load of HTTP GET requests against a "System Under Test" (SUT). SPEC provides the code that runs on these drivers; the choice and implementations of the HTTP Server on the SUT are up to the benchmarker.

SPECweb96 is implemented in ANSI C and Perl. The benchmark driver and workload generation are written in C; the top-level tool environment, parsing configuration files, and producing reports has been implemented in Perl5. It was hoped that these languages are portable enough that there should be little difficulty in driving this benchmark from a suitably configured UNIX or NT system.

The C code for SPECweb96 was leveraged from SPEC's other networking benchmark, LADDIS (also known as SPEC SFS). LADDIS already provided a framework of coordinating processes generating a networking workload. SPEC removed the NFS related code and replaced it with code to generate HTTP protocol requests.

The LADDIS process structure involves several processes. The "prime" client process coordinates all of the test activity. The "client" process generates the workload on that driver. This "client" starts a configurable number of child processes (or threads in the NT version) which actually generate the workload, leaving the original "client" to do coordination and processing of results on that driver. Also, in the UNIX version, there is a "sync daemon" on each system that is used to pass messages (via RPC) about the test states between the "prime" and all of the working clients.

The real meat of the benchmark is in the workload generation within the "client" subprocesses (or threads). Each of these processes generates an independant stream of HTTP requests, pausing in between requests so that on average it generates the specified number of requests per second. Each child has a separate random deck of operation classes, initialized to match the percentages in the benchmark's defined operation mix. Each child then works its way through the deck, using a Poisson distribution to select a file from the appropriate class, and then selecting at random a directory from which to fetch.

Users control the benchmark by changing values in an "rc" (run control) file. The "manager" Perl script reads this file, and then starts up all the necessary processes, setting the arguments according to the specifications in the rc file. Virtually everything about the benchmark run can be controlled from the rc file. There are items in the rc file ranging from the number of children per driving system down to describing the cache configuration on the SUT.

The other job of the "manager" script is to produce output files that describe the run and the results. Outputs are available in simple ASCII, HTML tables, PostScript, and a "raw" form that is a text-based dump of all of the variables. All of the outputs include results from the run (including whether the run is compliant with all the rules), as well as a description of the test configuration (SUT and drivers). Thus, each output file is intended to be a full disclosure of the benchmark experiment, with enough information to decide whether a result is of interest or not and how to go about duplicating the experiment.

Limitations

SPECweb96 is a good test of a server's ability to serve basic GET requests. It is not, however, a panacea for all web performance questions. First of all, SPECweb96 focuses solely on GET performance and cannot answer questions about a server's ability to handle other types of requests.

Furthermore, a single benchmark cannot answer all potential questions. A standardized workload, by definition, does not necessarily represent any one individual's workload. As always, the universal caveat of "your mileage may vary" holds. SPECweb96 does provide one known, well-defined point in the web server performance space. Its representativeness to other workloads depends upon how closely it matches the characteristics of those workloads.

One particular component of "real-world" web performance that is very difficult to benchmark is the noisy and error-prone nature of large networks. Any benchmark environment is likely to be built around a small isolated LAN inside a test lab; any other configuration leads to inconsistant and unstable results. In the "real world" (especially the greater Internet, but even within most "intranets"), however, gateways, bridges, hubs and long cable lengths lead to widely varying packet transmission times, corruption and loss of packets, not to mention that different TCP/IP parameters will be negotiated depending upon the network types used in the intermediate hops. Unfortunately, including this "real-world" network behavior takes specialized equipment and would be nearly impossible to do in a standardized benchmark environment. Thus, while SPECweb96 can tell a lot about how a web server can handle a workload, it cannot say very much about how that server reaches its customers.

Future Work

Work is ongoing to figure out how to develop a standardized benchmark for many of the more interesting web features. Of course, the web develops very rapidly and standardization efforts are well known for being slow in coming. Thus, owing to the nature of the organization, SPEC is never likely to be out on the cutting edge of web technology. First, cutting edge ideas need to settle into well recognized practices before they can be agreed upon for a standard test.

The simplest feature missing from SPECweb96 is Keep-Alive, where the HTTP server software keeps the TCP/IP connection open for more than one request at a time. This feature was introduced during the time that SPECweb96 was being developed. Keep-Alive promises huge performance improvements, because the overhead of creating and closing connections can be higher than the cost of transferring the desired data. This makes a great deal of sense where it is common to fetch multiple files at once (e.g. a "page" and its associated graphics). However, this leaves a potential for abuse in benchmark environments where a small number of clients download thousands of files in a very short period of time. Furthermore, at the time SPECweb96 was being solidified, there were still no clear answers about how client-side and proxy caching would affect how often multiple simultaneous requests would make it all the way back to the server. Thus, SPEC choose to err on the side of caution and not allow the use of Keep-Alive until we develop a proper workload for it.

The most obvious feature to be addressed in future releases of SPECweb is dynamic pages. SPECweb96 does not attempt to address the performance of any CGI interface. Again, at the time that SPEC was obtaining server logs, dynamic pages were still an occasional feature rather than a common practice. Even today, the uses for dynamic material is changing rapidly -- where it had been just used for search results and other application output, now it is commonly used just to be able to tailor basically static material for the ideosycracies of a variety of different browsers. SPEC is tracking the uses of dynamic pages with the intent of including them in at least one part of the workload for the next release of SPECweb.

One of the most requested additions to SPECweb is support for measuring the cost of authorization and encryption. As with the other issues, not much agreement exists over what is typical use of this feature. What percentage of requests, for example, should be encrypted? Also, benchmarking again opens opportunities for unnatural performance behavior; this time the issue is the latency to reach a certificate agent. In the real world, these agents would be many miles away, but in a benchmark lab the certificate agent would be sitting right next to the server. Unfortunately, SPEC does not expect these issues to be settled in time for us to be able to cover security as part of the workload in the next release. We shall be watching the marketplace and when some security solution emerges then we shall work to incorporate that into our tests.

A final set of questions revolves around what other Internet services should be included. Most of the better web sites offer at least some search capability, for example. Should this be included in the benchmark? If searching is included, what kind of search (text, regular expressions, etc.)? What kind of search engine (WAIS, alta vista, excite, grep, etc.)? How large a database to search over? Then, one also has to consider other services such as FTP or a mail-list server. Of course, if any other servers are included, then firewall and caching HTTP proxies are items that have to be considered. The difficulty with all of this will be coming to an agreeable standard out of the myriad of possibilities.

Conclusion

SPECweb96 represents a reasonable, standardized benchmark of basic WWW server performance. A simple test, SPECweb96 only exercises HTTP GET performance, but it does exercise that central service well and fairly. In this market, where vendors are trading performance claims with unclear comparability of their measurements, SPECweb96 allows users to make fair comparisons between results obtained from disparate sources. Building upon the successes of SPECweb96, more complete benchmarks will become available in the not-too-distant future.