SPEC ACCEL Benchmark Description

Benchmark Name


Benchmark Author

Nady Obeid, John A. Stratton

Benchmark Program General Category

Silicon Wafer Verification

Benchmark Description

The Histogram Parboil benchmark is a straightforward histogramming operation that accumulates the number of occurrences of each output value in the input data set. The output histogram is a two-dimensional matrix of chartype bins that saturate at 255. The Parboil input sets, exemplary of a particular application setting in silicon wafer verification, are what define the optimizations appropriate for the benchmark. The dimensions of the histogram (256 W x 8192 H) are very large, yet the input set follows a roughly Gaussian distribution, centered in the output histogram. Recognizing this high concentration of contributions to the histogram's central region (referred to as the "eye"), the benchmark optimizations mainly focus on improving the throughput of contributions to this area. Prior to performing the histogramming, the optimized implementations for scratchpad run a kernel that determines the size of the eye by sampling the input data. Architectures with an implicit cache can forego such analysis, since the hardware cache will automatically prioritize the heavily accessed region wherever it may be.

Two primary transformations reduce the amount of contentious updates. First, the eye region is privatized to batches of work-groups, the ideal number of batches determined experimentally for a given platform. The input is partitioned among the batches, each of which creates a partial private histogram, later reduced into the final output. These partial histograms can be generated in parallel, so all batches are launched together in one parallel kernel invocation.

Secondly, each batch is further decomposed into several thread blocks (work-groups), each assigned an exclusive region of the output in the batch's partial histogram. Each block in a batch handles the largest number of rows of the output histogram that still fits into its local memory, but all gather their input from the region assigned to their batch. Because the size of local memory is limited, every block can only compute a small portion of the histogram locally.

Therefore, if w is the maximum number of histogram rows that can be stored in local memory, and W is the size of the eye, it takes W/w blocks in each batch to generate the partial histogram from the batch's given portion of the input set. The input for the batch is read redundantly by each constituent block, which only processes the input falling within its tile. The relevant input is scattered into the local memory tile, a reasonably efficient operation. One designated block in the batch is also responsible for handling the rare inputs that fall outside the eye region, by atomically incrementing the appropriate bins in global memory. When blocks in a batch complete, their local tiles are juxtaposed with each other and the non-eye histogram regions in global memory, composing the entire partial histogram for the batch. The partial histograms are reduced into the final output.

Overall, the histogram benchmark demonstrates the high cost of random atomic updates to a large dataset. The global atomic update penalty can sometimes outweigh a fixed-factor cost of redundantly reading input data.

Input Description

116.histo's input consists of a "differential image", which can be considered just an array of integers for histogram bin indexes.

Output Description

116.histo outputs the resulting histogram as an 8-bit RGB bitmap, with the histogram bin count in the red channel for visualization.

The output file histo.out contains detailed timing information about the run. It also shows which device was selected along with what devices where available to OpenCL.

Programming Language

C, OpenCL

Threading Model


Known portability issues

Input file is stored in little-endian binary format.


Last updated: $Date: 2015-03-02 15:15:22 -0500 (Mon, 02 Mar 2015) $