News

What makes a good computing system performance benchmark?

Separating the myths from reality.

Ross Cunniff and Chandra Sakthivel

How do you determine the performance of your computing system’s performance? Benchmarks are a good way of providing you with hard, unbiased data. But, that’s not entirely true. That’s because not all benchmarks are created the same. Hardware vendors creating their own benchmarks can actually skew the results in favor of their own systems by establishing parameters in which their offerings perform exceptionally well. There are ways to identify whether a more level playing field was used. And when that happens, benchmarking yields valuable results for the user and the ISVs.

laptop

Computing system performance benchmarks are a way for hardware vendors to gauge the power of their solutions and measure how they perform against the competition. They also allow buyers to compare products and purchase the right ones for their needs. However, not all benchmarks are created equal. Organizations considering creating or using these benchmarks need to recognize some common myths about performance benchmarking as a discipline and what it takes to create benchmarks that will truly enable vendors to understand their market position while helping companies and individuals make reliable and sensible purchasing decisions.

Let’s start by examining the two most common computing system performance benchmarking myths.

A hardware vendor’s homegrown benchmark is a powerful marketing tool that gives them tribal bragging rights.

Vendors may act as if this is true, but it does a great disservice to themselves, their customers, and the independent software vendors (ISVs) that rely on the hardware vendor’s systems. Hardware vendors will naturally build their benchmarks around the performance metrics that show their products in the best light.

However, the real-world experiences of users rarely match the marketing hype. Many use cases won’t benefit at all from the hyped metrics, and even when they do, the benefit is often far less than the expectation created by the metric. An important reason for this is that many different hardware configurations and OS builds and versions are available, and based on use cases and scenarios, configuration can play a critical role, requiring different core counts, memory channels, bandwidth, and cache size to achieve optimal performance.

So choosing the wrong hardware for a use case based on a marketing-driven performance benchmark—or just obtaining less value than expected—invariably leads to disappointment. What follows is often a blame game.

From users’ perspectives, they just spent all this money on fancy new hardware, so any failure of performance must clearly be the fault of the software vendor. Only after significant frustration on the part of both users and the ISVs does the blame fall where it belongs—on the hardware vendor’s hype. This, in turn, often leads to bad press, reputational damage, and lost sales. Performance benchmarks that are not mapped to real-world use cases will always cause more harm than benefit.

“Performance” can be defined by a single metric or small number of metrics that all vendors and buyers will find useful for their use cases.

System vendors often tout CPU/GPU frequency, core count, and RAM volume and speed as the primary indicators of the value buyers will receive from their computing systems. But for most users, “value” depends entirely on the use case, not the hardware specifications.

Take a relatively simple example—drawing a highlight box around a 3D object. If only one object at a time may be highlighted, the application might be better off using a very high-speed sequential-processing CPU to draw the box. However, if the use case requires highlighting thousands or millions of such objects—especially if their selection criterion is based on some property of the object—then the application will benefit significantly from GPU-based processing that delivers greater throughput, although possibly with greater latency than the simple CPU box highlight. It’s the same task, selection and highlighting , but with very different hardware requirements.

This is why buyers can’t automatically assume an overall benefit from a limited number of metrics cited by a vendor’s own benchmark  and why use-case-dependent measurement and end-to-end system performance measurement really matter for real-world scenarios.

Elements of good performance benchmarking—what buyers really need

Following are the key elements of a good performance benchmark:

  • Vendor agnostic—The benchmark is not produced by a vendor for the purpose of hyping the vendor’s strengths.
  • Unbiased—Independent of who creates the benchmark, it does not inadvertently favor one vendor’s approach by running a limited set of tests that highlight one vendor’s strengths.
  • Use-case-dependent—The benchmark measures real-world scenarios, preferably based on real application workloads, instead of relying on a highly optimized scenario that users would never encounter.
  • End-to-end system performance measurement—The benchmark attempts to cover as many aspects of the entire computing platform as possible, going beyond the CPU or GPU to include storage capacity and speed, data transfer speeds across the platform, etc.
  • Modular, scalable, and extensible—Use cases and hardware are evolving rapidly, so the benchmark needs to be able to adapt and evolve over time.
  • Transparency—Users need access to information on how the benchmark was developed in order to gauge the applicability to the users’ use cases.

To design such benchmarks, it is useful to keep in mind the following best practices:

  • Work with the widest possible variety of users, ISVs, and hardware vendors to understand which workloads need to be measured, how best to measure them fairly, and what aspects of the hardware and software will impact performance. For example, when a user looks for a transcode application benchmark, a metric can be reported as the time taken to complete the transcode or the time taken for each frame transcode (fps).
  • Distill this information into a set of metrics that if improved would help the widest range of users. For example, provide benchmark metrics that can provide detailed information to the user about multiprocess vs. multitask for one process.
  • Don’t rely on a single metric. Users work with systems (e.g., workstations), not individual pieces of hardware (e.g., CPU). They need to understand how the system will handle their particular workloads, so they need to be able to measure the processes relevant to them. An end-to-end task for a particular use case relies on system or platform performance, including CPU, GPU, memory, disk, and accelerator components. Each system works as a producer-consumer or server-client, and the performance of each component depends on the other components.
  • Pick the right benchmark for measurement based on real use cases. The benchmark should include real-world workloads so there will be no ambiguity when comparing benchmarking results to real-world scenarios.
Conclusion

While users benefit the most from unbiased, comprehensive computing system benchmarks, the value for vendors should not be overlooked. Such benchmarks help vendors see where they stand relative to the competition, so they can focus their efforts where it will do the most competitive good. Further, when these benchmarks are based on real application workloads, it enables vendors to work directly with ISVs to improve how their hardware handles specific applications as well as detect possible underlying issues. Most important, because these benchmarks enable customers to make the right decision for their needs, vendors can achieve better customer satisfaction ratings and improve their brand’s reputation.

Still, it is essential for users, and the industry, to understand the limitations of these benchmarks. No benchmark can account for every use case on every type of hardware. Various ingredients can produce different results, even when the system has the same hardware configuration as published benchmark results, and OS versions or updates can also impact performance. This means that with hardware, software, and use cases evolving rapidly, benchmarks may serve only as a foundation for additional research. For example, the benchmark may serve as a baseline, so users can explore the potential benefits that a piece of specialized hardware offered by an individual vendor—such as for AI inferencing and training, or video encoding—may have on their particular use case.

Ross Cunniff is the chair of the Standard Performance Evaluation Corporation’s (SPEC) Graphics Performance Characterization committee. He has more than 35 years of experience in the tech industry, including 25 years with Nvidia, where he serves as a systems software development manager.

Chandra Sakthivel is the chair of SPEC’s Workstation Performance Characterization Committee. He has 22 years of experience in the tech industry and holds more than 21 patents in AI/ML and graphics. He is a workload performance optimization architect at Intel focused on various performance and KPI metrics.