Inside Benchmarking: Hardware Testing in the 21st Century

Because of his background in assembly-level programming and low-level C coding on machines with slow CPUs and very little RAM, David Wren, PassMark Software’s founder, had a strong interest in coding efficiency and computer performance. It was from this interest that PassMark’s benchmark application, PerformanceTest, was born in 1998. At that time, many of the available performance-benchmarking utilities were narrowly focused and were only suitable for use by other programmers. Many of the utilities, for example, were distributed as source code and required compilation into an executable form by the user before they could be used.

With its broad focus and ease of use, PerformanceTest quickly became popular on the Internet and has since become one of the industry-standard benchmarking suites. But as everyone knows, PC technology changes relatively quickly, so the software used to benchmark PCs also must keep pace. As much as we would have liked to have every possible test in our benchmarking software from day one, common sense told us first to focus on CPU and disk tests. Hence our challenge was not just keeping pace with new hardware, but also broadening the range of components that could be tested.

Just about every PC component has undergone significant changes in the last few years, so it has been important to release new benchmarking software to take these changes into account. For example, CPU manufacturers have been in a twofold race to boost raw clock speed and to increase the number of useful instructions executed per clock cycle. Recently, Intel has been focused on achieving maximum clock speed, whereas AMD has been more focused on getting more processing done per cycle.

In this race, several technologies have been introduced or updated for the PC, such as multiple physical CPUs, hyperthreading, on-chip caching, instruction pipelining and even developments like MMX and 3DNow. To benchmark these new technologies effectively, benchmark developers have had to build everything from mundane components that can simply detect and report these new technologies to completely new benchmark tests that better represent broader, more efficient uses of a PC.

Reality and Perception

Sometimes, benchmarking new technologies can produce results that initially defy explanation. For example, the first version of PerformanceTest that tested for Intel’s MMX technology gave very poor, inconsistent results. In some cases, performance was only 50 percent of what was expected. The fault turned out to be a compiler bug that resulted in the poor alignment of 64-bit MMX variables in memory — thus highlighting the importance of a good compiler and correct variable alignment.

When you get down and dirty with the details of a PC, such as the BIOS settings, you would be amazed that sometimes a test that works on 999 PCs does not work on the next one. To add a little spice, the variation in PCs throughout the market poses a significant challenge to developing and applying a set of tests to yield a consistent and meaningful benchmark. We have attempted to tame this beast through research and by incorporating significant feedback from our customers.

Today, when you follow the benchmarking guidelines established by benchmark developers, you can achieve useful benchmarking results for just about any PC. Inconsistent results — across multiple test runs on one PC or on seemingly identical systems — might signal a need for a more serious techie to find the right answers.

Subtle differences such as disk cluster size, disk fragmentation, device driver versions and hidden background tasks, together with more obvious differences like the file-system type, such as NTFS or FAT, all combine to make the benchmark interpretation more complex. This battle of reality and perception is continuously fought during each of our development cycles.

Synthetic Versus Application

There has been quite a bit of debate lately about synthetic versus application benchmarking — that is, running a series of simulated benchmark tests versus running a real-world application like Microsoft Word as a test. By most people’s reckoning, our benchmarking applications sit in the synthetic-benchmarking realm.

There are advantages and disadvantages to the type of methodology used by synthetic benchmarks. Synthetic benchmarking typically provides a broader benchmark, taking into account a large range of possible PC uses, while an application benchmark is typically narrower, taking into account only a single-use application. The synthetic benchmark typically is more usable by the mass market and is available at lower cost to the user. Also, synthetic benchmarking typically covers many of the same kinds of system functions covered in application benchmarking, but in a controlled manner — for example, rendering and interacting with the Windows graphical user interface.

The choice of benchmarking method really depends on what you want to achieve. If, for example, you have a system that will be applied in a controlled environment — such as 100 PCs in a call center that uses three defined apps — then application benchmarking is possibly best suited. If you have a system that might be applied to an unknown variety of applications (such as a home PC), then broad synthetic benchmarking is best suited to the job.

Hardware Interfacing

There are many different ways in which a benchmarking application can access the hardware it is running on to test its speed. Over the years, additional layers of logic have been added to operating systems, and new runtime environments have been created to take advantage of new technologies. Benchmark programmers typically have had to choose which languages — Java, C, C++ and Delphi, for example — and which application programming interfaces they use. Often, these APIs sit on top of each other to form layers, and the programmer must choose which layer to use as well.

For example, in the past, if you were developing a piece of software and you wanted to achieve good performance on existing hardware, you had to write different software for each graphics card and sound card on the market. But today, Microsoft’s DirectX API has largely removed the necessity to interact directly with the video hardware, and there is hardly a Windows graphics-intensive application on the market that doesn’t use DirectX now. So instead of interacting directly with the hardware, most benchmarking applications test the hardware through this API.

PerformanceTest is written in C++ and Assembler and uses the standard Microsoft Win32 API, including DirectX wherever possible. Using this API results in better compatibility for benchmarking on a wide variety of systems. For example, the same hard disk benchmark will work on SCSI, IDE, S-ATA and USB connected disks. Also, because the Win32 API is the API used directly or indirectly by all Windows applications, PerformanceTest measures the same level of speed that a user application would, making use of device drivers supplied by the manufacturers of the hardware.

User Community

Developers are great at developing, but sometimes lousy at developing products that people want to use and will trust. Because we are independent from the suppliers of PC hardware and software, we believe we are in a good position to incorporate a large amount of diverse, independent customer feedback as it relates to changing existing features and implementing new features. USB benchmarking is one example.

USB ports have become ubiquitous for connecting peripherals. More recently, USB 2.0 has been used to connect high-speed devices, hard drives and removable media drives. As people became more dependent on the speed of these external devices, they started requesting a USB benchmarking product. After a little investigation, it became obvious that there are several factors that detrimentally impact USB performance, such as the type of host controller, device drivers, operating system version, connection point to motherboard, cable length and other factors.

In addition, many people simply don’t know whether they have a USB 1.x or USB 2.0 port. After several requests last year, we became enthusiastic about developing a USB 2.0 benchmarking product, which involved hardware development, firmware development and benchmarking software.

Allowing the user community to compare benchmarks is also critical. We encourage our users to submit benchmark results, and we maintain an online Baseline Database of PC benchmarks so users can search for benchmarks of PCs with similar specifications to compare their performance.

In looking to the future, we have just launched a major new version of PerformanceTest, which provides improved benchmarking for multiple physical CPUs, hyperthreading, 2D video rendering and multithreaded hard disk testing. And we will continue to prioritize and incorporate customer features. We have had quite a bit of interest in benchmarking 64-bit CPUs against 32-bit CPUs. This should keep us busy until the next major change in the computing industry, which probably happened yesterday.

Ian Robinson is the Technical Director at PassMark Software. He is responsible for hardware and software development in the area of benchmarking and hardware reliability testing.

Leave a Comment

Please sign in to post or reply to a comment. New users create a free account.

TechNewsWorld Channels