The Myth of Benchmarks

In the early 1990s, in the days of Windows 3.11 and before the Intel Pentium, I started looking for a desktop computer system. I was trying to compare 386DX and 486SX-based computers and one of the ways that the Windows environment could differentiate itself is using benchmarks.

The simpler benchmarks simply tested how quickly one or two system components completed a task. The more sophisticated benchmarks tested more of the computer; some of the productivity benchmarks involved completing a series of automated tasks using Microsoft Office.

Device SpeedMost of the time, the computers based around the 486SX processor were showing as being quicker than those based around the 386DX but I remember trying a number of computers in a showroom and not seeing a real world difference for the things that I really needed with the software that I needed to run.

We have seen a similar benchmark evolution in the Android universe: there are system benchmarks that some websites use to compare different devices. The results they provide are open very much to interpretation. I too am guilty of using a very crude benchmark on some of the devices that I test.

I’m going to cite a comparison between real world devices; the Nexus 4 and the HTC Desire C. In the HTC review, I highlight that this handset has a slow processor and can feel sluggish in use. The example I highlighted was that of opening a large document, where the HTC took a dozen seconds compared with the Nexus 4 taking just one. This test shows us that the Desire C is considerably slower than the Nexus when opening with documents.

How relevant is this to day to day usage? Let’s consider I am opening a file to edit during a twenty five minute bus trip. Here, the extra eleven seconds is statistically irrelevant. If I am editing a document, I am likely to spend more time pondering the changes than I am actually editing the text. The extra time taken to load the document is much less important.

Our perception of how responsive the device is when we are using it is quite important, because some people are more tolerant of a sluggish handset than others.

In the late summer 2013, benchmarking Android devices made it into the news. This is because Anandtech found Samsung cheating optimizing their devices for certain benchmark applications. The key message is that the optimization only made between 5% to 10% improvement in the benchmark scores. Since then, various benchmarking applications have banned OEM devices from their leagues based on this optimization.

Samsung isn’t alone in using benchmark optimization techniques; it turns out that most manufacturers are at it. The techniques mostly consist of making the devices run at full speed whilst the benchmark application is running. This sort of optimization happens across a lot of industries and the best parallel I can draw is in motorsport, where an engine might be set up to produce more power for qualifying compared with a less powerful set up for the race.

Just as the qualification time of racing car has no real world relevance to the time it takes you to commute to your office, so the benchmark time has no real world relevance to the time it takes you to retrieve your smartphone from your pocket, unlock it, read an email and reply.

No, the biggest difference between the Desire C and Nexus 4 when it comes to handling documents is the screen size and resolution. The bigger the screen and the higher the resolution, the easier it is to work with documents.

I should add that manufacturers also perform other optimizations to their software for real world use rather than benchmarking purposes. This can include modifying the animation speed so that applications appear to launch quicker, or modifying the java run time to shorten application loading times. Manufacturers can also deliberately keep their Android skins lightweight and fast, such as HTC’s efforts with Sense 4 and later. These optimizations make a real world difference to how the device feels to use but are not reflected in the benchmark scores from hardware tests.

And finally, I never did decide between a 386DX or 486SX based Windows box. Instead, I bought a Commodore Amiga A1200.

, , , ,