In computing, a benchmark is the act of running a computer program, a set of programs, or other operations, in order to assess the relative

performance A performance is an act of staging or presenting a play, concert, or other form of entertainment. It is also defined as the action or process of carrying out or accomplishing an action, task, or function. Management science In the work place ...

of an object, normally by running a number of standard

tests Test(s), testing, or TEST may refer to: * Test (assessment), an educational assessment intended to measure the respondents' knowledge or other abilities Arts and entertainment * ''Test'' (2013 film), an American film * ''Test'' (2014 film), ...

and trials against it. The term ''benchmark'' is also commonly utilized for the purposes of elaborately designed benchmarking programs themselves. Benchmarking is usually associated with assessing performance characteristics of

computer hardware Computer hardware includes the physical parts of a computer, such as the computer case, case, central processing unit (CPU), Random-access memory, random access memory (RAM), Computer monitor, monitor, Computer mouse, mouse, Computer keyboard, ...

, for example, the

floating point operation In computing, floating point operations per second (FLOPS, flops or flop/s) is a measure of computer performance, useful in fields of scientific computations that require floating-point calculations. For such cases, it is a more accurate meas ...

performance of a

CPU A central processing unit (CPU), also called a central processor, main processor or just processor, is the electronic circuitry that executes instructions comprising a computer program. The CPU performs basic arithmetic, logic, controlling, and ...

, but there are circumstances when the technique is also applicable to software. Software benchmarks are, for example, run against compilers or

database management system In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases span ...

s (DBMS). Benchmarks provide a method of comparing the performance of various subsystems across different chip/system architectures.

Purpose

As computer architecture advanced, it became more difficult to compare the performance of various computer systems simply by looking at their specifications. Therefore, tests were developed that allowed comparison of different architectures. For example, Pentium 4 processors generally operated at a higher clock frequency than Athlon XP or

PowerPC PowerPC (with the backronym Performance Optimization With Enhanced RISC – Performance Computing, sometimes abbreviated as PPC) is a reduced instruction set computer (RISC) instruction set architecture (ISA) created by the 1991 Apple Inc., App ...

processors, which did not necessarily translate to more computational power; a processor with a slower clock frequency might perform as well as or even better than a processor operating at a higher frequency. See BogoMips and the megahertz myth. Benchmarks are designed to mimic a particular type of workload on a component or system. Synthetic benchmarks do this by specially created programs that impose the workload on the component. Application benchmarks run real-world programs on the system. While application benchmarks usually give a much better measure of real-world performance on a given system, synthetic benchmarks are useful for testing individual components, like a

hard disk A hard disk drive (HDD), hard disk, hard drive, or fixed disk is an electro-mechanical data storage device that stores and retrieves digital data using magnetic storage with one or more rigid rapidly rotating platters coated with magnet ...

or networking device. Benchmarks are particularly important in CPU design, giving processor architects the ability to measure and make tradeoffs in

microarchitectural In computer engineering, microarchitecture, also called computer organization and sometimes abbreviated as µarch or uarch, is the way a given instruction set architecture (ISA) is implemented in a particular processor. A given ISA may be impl ...

decisions. For example, if a benchmark extracts the key algorithms of an application, it will contain the performance-sensitive aspects of that application. Running this much smaller snippet on a cycle-accurate simulator can give clues on how to improve performance. Prior to 2000, computer and microprocessor architects used SPEC to do this, although SPEC's Unix-based benchmarks were quite lengthy and thus unwieldy to use intact. Computer manufacturers are known to configure their systems to give unrealistically high performance on benchmark tests that are not replicated in real usage. For instance, during the 1980s some compilers could detect a specific mathematical operation used in a well-known floating-point benchmark and replace the operation with a faster mathematically equivalent operation. However, such a transformation was rarely useful outside the benchmark until the mid-1990s, when

RISC In computer engineering, a reduced instruction set computer (RISC) is a computer designed to simplify the individual instructions given to the computer to accomplish tasks. Compared to the instructions given to a complex instruction set comput ...

and VLIW architectures emphasized the importance of compiler technology as it related to performance. Benchmarks are now regularly used by compiler companies to improve not only their own benchmark scores, but real application performance. CPUs that have many execution units — such as a

superscalar A superscalar processor is a CPU that implements a form of parallelism called instruction-level parallelism within a single processor. In contrast to a scalar processor, which can execute at most one single instruction per clock cycle, a sup ...

CPU, a VLIW CPU, or a reconfigurable computing CPU — typically have slower clock rates than a sequential CPU with one or two execution units when built from transistors that are just as fast. Nevertheless, CPUs with many execution units often complete real-world and benchmark tasks in less time than the supposedly faster high-clock-rate CPU. Given the large number of benchmarks available, a manufacturer can usually find at least one benchmark that shows its system will outperform another system; the other systems can be shown to excel with a different benchmark. Manufacturers commonly report only those benchmarks (or aspects of benchmarks) that show their products in the best light. They also have been known to mis-represent the significance of benchmarks, again to show their products in the best possible light. Taken together, these practices are called ''bench-marketing.'' Ideally benchmarks should only substitute for real applications if the application is unavailable, or too difficult or costly to port to a specific processor or computer system. If performance is critical, the only benchmark that matters is the target environment's application suite.

Functionality

Features of benchmarking software may include recording/ exporting the course of performance to a spreadsheet file, visualization such as drawing line graphs or color-coded tiles, and pausing the process to be able to resume without having to start over. Software can have additional features specific to its purpose, for example, disk benchmarking software may be able to optionally start measuring the disk speed within a specified range of the disk rather than the full disk, measure

random access Random access (more precisely and more generally called direct access) is the ability to access an arbitrary element of a sequence in equal time or any datum from a population of addressable elements roughly as easily and efficiently as any othe ...

reading speed and latency, have a "quick scan" feature which measures the speed through samples of specified intervals and sizes, and allow specifying a data block size, meaning the number of requested bytes per read request.

Challenges

Benchmarking is not easy and often involves several iterative rounds in order to arrive at predictable, useful conclusions. Interpretation of benchmarking data is also extraordinarily difficult. Here is a partial list of common challenges: * Vendors tend to tune their products specifically for industry-standard benchmarks. Norton SysInfo (SI) is particularly easy to tune for, since it mainly biased toward the speed of multiple operations. Use extreme caution in interpreting such results. * Some vendors have been accused of "cheating" at benchmarks — doing things that give much higher benchmark numbers, but make things worse on the actual likely workload. * Many benchmarks focus entirely on the speed of computational performance, neglecting other important features of a computer system, such as: ** Qualities of service, aside from raw performance. Examples of unmeasured qualities of service include security, availability, reliability, execution integrity, serviceability, scalability (especially the ability to quickly and nondisruptively add or reallocate capacity), etc. There are often real trade-offs between and among these qualities of service, and all are important in business computing. Transaction Processing Performance Council Benchmark specifications partially address these concerns by specifying

ACID In computer science, ACID ( atomicity, consistency, isolation, durability) is a set of properties of database transactions intended to guarantee data validity despite errors, power failures, and other mishaps. In the context of databases, a sequ ...

property tests, database scalability rules, and service level requirements. ** In general, benchmarks do not measure Total cost of ownership. Transaction Processing Performance Council Benchmark specifications partially address this concern by specifying that a price/performance metric must be reported in addition to a raw performance metric, using a simplified TCO formula. However, the costs are necessarily only partial, and vendors have been known to price specifically (and only) for the benchmark, designing a highly specific "benchmark special" configuration with an artificially low price. Even a tiny deviation from the benchmark package results in a much higher price in real world experience. ** Facilities burden (space, power, and cooling). When more power is used, a portable system will have a shorter battery life and require recharging more often. A server that consumes more power and/or space may not be able to fit within existing data center resource constraints, including cooling limitations. There are real trade-offs as most semiconductors require more power to switch faster. See also performance per watt. ** In some embedded systems, where memory is a significant cost, better

code density In computer science, an instruction set architecture (ISA), also called computer architecture, is an abstract model of a computer. A device that executes instructions described by that ISA, such as a central processing unit (CPU), is called an ' ...

can significantly reduce costs. * Vendor benchmarks tend to ignore requirements for development, test, and

disaster recovery Disaster recovery is the process of maintaining or reestablishing vital infrastructure and systems following a natural or human-induced disaster, such as a storm or battle.It employs policies, tools, and procedures. Disaster recovery focuses on t ...

computing capacity. Vendors only like to report what might be narrowly required for production capacity in order to make their initial acquisition price seem as low as possible. * Benchmarks are having trouble adapting to widely distributed servers, particularly those with extra sensitivity to network topologies. The emergence of

grid computing Grid computing is the use of widely distributed computer resources to reach a common goal. A computing grid can be thought of as a distributed system with non-interactive workloads that involve many files. Grid computing is distinguished from co ...

, in particular, complicates benchmarking since some workloads are "grid friendly", while others are not. * Users can have very different perceptions of performance than benchmarks may suggest. In particular, users appreciate predictability — servers that always meet or exceed service level agreements. Benchmarks tend to emphasize mean scores (IT perspective), rather than maximum worst-case response times ( real-time computing perspective), or low standard deviations (user perspective). * Many server architectures degrade dramatically at high (near 100%) levels of usage — "fall off a cliff" — and benchmarks should (but often do not) take that factor into account. Vendors, in particular, tend to publish server benchmarks at continuous at about 80% usage — an unrealistic situation — and do not document what happens to the overall system when demand spikes beyond that level. * Many benchmarks focus on one application, or even one application tier, to the exclusion of other applications. Most data centers are now implementing virtualization extensively for a variety of reasons, and benchmarking is still catching up to that reality where multiple applications and application tiers are concurrently running on consolidated servers. * There are few (if any) high quality benchmarks that help measure the performance of batch computing, especially high volume concurrent batch and online computing.

Batch computing Computerized batch processing is a method of running software programs called jobs in batches automatically. While users are required to submit the jobs, no other interaction by the user is required to process the batch. Batches may automatically ...

tends to be much more focused on the predictability of completing long-running tasks correctly before deadlines, such as end of month or end of fiscal year. Many important core business processes are batch-oriented and probably always will be, such as billing. * Benchmarking institutions often disregard or do not follow basic scientific method. This includes, but is not limited to: small sample size, lack of variable control, and the limited repeatability of results.

Benchmarking Principles

There are seven vital characteristics for benchmarks. These key properties are: # Relevance: Benchmarks should measure relatively vital features. # Representativeness: Benchmark performance metrics should be broadly accepted by industry and academia. # Equity: All systems should be fairly compared. # Repeatability: Benchmark results can be verified. # Cost-effectiveness: Benchmark tests are economical. # Scalability: Benchmark tests should work across systems possessing a range of resources from low to high. # Transparency: Benchmark metrics should be easy to understand.

Types of benchmark

#Real program #*word processing software #*tool software of CAD #*user's application software (i.e.: MIS) #* Video games #* Compilers building a large project, for example

Chromium browser Chromium is a chemical element with the symbol Cr and atomic number 24. It is the first element in group 6. It is a steely-grey, lustrous, hard, and brittle transition metal. Chromium metal is valued for its high corrosion resistance and hard ...

Linux kernel The Linux kernel is a free and open-source, monolithic, modular, multitasking, Unix-like operating system kernel. It was originally authored in 1991 by Linus Torvalds for his i386-based PC, and it was soon adopted as the kernel for the GNU ope ...

#Component Benchmark / Microbenchmark #*core routine consists of a relatively small and specific piece of code. #*measure performance of a computer's basic components #*may be used for automatic detection of computer's hardware parameters like number of registers, cache size, memory latency, etc. #Kernel #*contains key codes #*normally abstracted from actual program #*popular kernel: Livermore loop #*linpack benchmark (contains basic linear algebra subroutine written in FORTRAN language) #*results are represented in Mflop/s. #Synthetic Benchmark #*Procedure for programming synthetic benchmark: #**take statistics of all types of operations from many application programs #**get proportion of each operation #**write program based on the proportion above #*Types of Synthetic Benchmark are: #** Whetstone #** Dhrystone #*These were the first general purpose industry standard computer benchmarks. They do not necessarily obtain high scores on modern pipelined computers. # I/O benchmarks # Database benchmarks #* measure the throughput and response times of database management systems (DBMS) # Parallel benchmarks #* used on machines with multiple cores and/or processors, or systems consisting of multiple machines

Common benchmarks

Industry standard (audited and verifiable)

* Business Applications Performance Corporation (BAPCo) * Embedded Microprocessor Benchmark Consortium (EEMBC) *

Standard Performance Evaluation Corporation The Standard Performance Evaluation Corporation (SPEC) is an American non-profit corporation that aims to "produce, establish, maintain and endorse a standardized set" of performance benchmarks for computers. SPEC was founded in 1988. SPEC be ...

(SPEC), in particular their SPECint and SPECfp * Transaction Processing Performance Council (TPC): DBMS benchmarks

Open source benchmarks

AIM Multiuser Benchmark The AIM Multiuser Benchmark, also called the AIM Benchmark Suite VII or AIM7, is a job throughput benchmark widely used by UNIX computer system vendors. Current research operating systems such as K42 use the ''reaim'' form of the benchmark for ...

– composed of a list of tests that could be mixed to create a ‘load mix’ that would simulate a specific computer function on any UNIX-type OS. * Bonnie++ – filesystem and hard drive benchmark * BRL-CAD – cross-platform architecture-agnostic benchmark suite based on multithreaded ray tracing performance; baselined against a VAX-11/780; and used since 1984 for evaluating relative CPU performance, compiler differences, optimization levels, coherency, architecture differences, and operating system differences. * Collective Knowledge – customizable, cross-platform framework to crowdsource benchmarking and optimization of user workloads (such as

deep learning Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. De ...

) across hardware provided by volunteers * Coremark – Embedded computing benchmark *

DEISA Benchmark Suite The Distributed European Infrastructure for Supercomputing Applications (DEISA) was a European Union supercomputer project. A consortium of eleven national supercomputing centres from seven European countries promoted pan-European research on ...

– scientific HPC applications benchmark * Dhrystone – integer arithmetic performance, often reported in DMIPS (Dhrystone millions of instructions per second) *

DiskSpd DiskSpd is a free and open-source command-line tool for storage benchmarking on Microsoft Windows that generates a variety of requests against computer files, partitions or storage devices and presents collected statistics as text in the command ...

– Command-line tool for storage benchmarking that generates a variety of requests against computer files, partitions or storage devices *

Fhourstones In computer science, Fhourstones is an integer benchmark that efficiently solves positions in the game of Connect-4. It was written by John Tromp in 1996-2008, and is incorporated into the Phoronix Test Suite. The measurements are reported as th ...

– an integer benchmark *

HINT Hint and similar may refer to: * Hint (musician), musician Jonathan James from Sussex, England * Hint (SQL), a feature of the SQL computer language * Hint Water, a beverage company from San Francisco, California * Aadu Hint (1910–1989), Estonian ...

– designed to measure overall CPU and memory performance * Iometer – I/O subsystem measurement and characterization tool for single and clustered systems. * IOzone – Filesystem benchmark * LINPACK benchmarks – traditionally used to measure

FLOPS In computing, floating point operations per second (FLOPS, flops or flop/s) is a measure of computer performance, useful in fields of scientific computations that require floating-point calculations. For such cases, it is a more accurate meas ...

Livermore loops Livermore loops (also known as the Livermore Fortran kernels or LFK) is a benchmark for parallel computers. It was created by Francis H. McMahon from scientific source code run on computers at Lawrence Livermore National Laboratory. It consists of ...

NAS parallel benchmarks NAS Parallel Benchmarks (NPB) are a set of benchmarks targeting performance evaluation of highly parallel supercomputers. They are developed and maintained by the NASA Advanced Supercomputing (NAS) Division (formerly the NASA Numerical Aerodynami ...

NBench NBench, short for Native mode Benchmark and later known as BYTEmark, is a synthetic computing benchmark program developed in the mid-1990s by the now defunct BYTE magazine intended to measure a computer's CPU, FPU, and Memory System speed. Histo ...

– synthetic benchmark suite measuring performance of integer arithmetic, memory operations, and floating-point arithmetic * PAL – a benchmark for realtime physics engines *

PerfKitBenchmarker PerfKit Benchmarker is an open source benchmarking tool used to measure and compare cloud offerings. PerfKit Benchmarker is licensed under the Apache 2 license terms. PerfKit Benchmarker is a community effort involving over 500 participants includ ...

– A set of benchmarks to measure and compare cloud offerings. * Phoronix Test Suite – open-source cross-platform benchmarking suite for Linux, OpenSolaris, FreeBSD, OSX and Windows. It includes a number of other benchmarks included on this page to simplify execution. * POV-Ray – 3D render * Tak (function) – a simple benchmark used to test recursion performance *

TATP Benchmark In transaction processing, the Telecommunication Application Transaction Processing Benchmark (TATP) is a benchmark designed to measure the performance of in-memory database transaction systems. Benchmark As database and microprocessor architectur ...

– Telecommunication Application Transaction Processing Benchmark *

TPoX Partial oxidation (POX) is a type of chemical reaction. It occurs when a substoichiometric fuel-air mixture is partially combusted in a reformer, creating a hydrogen-rich syngas which can then be put to further use, for example in a fuel cell. A ...

– An XML transaction processing benchmark for XML databases * VUP (VAX unit of performance) – also called VAX MIPS * Whetstone – floating-point arithmetic performance, often reported in millions of Whetstone instructions per second (MWIPS)

Microsoft Windows benchmarks

* BAPCo: MobileMark, SYSmark, WebMark *

CrystalDiskMark CrystalDiskMark is an open source disk drive benchmark tool for Microsoft Windows from Crystal Dew World. Based on Microsoft's MIT-licensed Diskspd tool, this graphical benchmark is commonly used for testing the performance of solid-state st ...

* Futuremark:

3DMark 3DMark is a computer benchmarking tool created and developed by UL, (formerly Futuremark), to determine the performance of a computer's 3D graphic rendering and CPU workload processing capabilities. Running 3DMark produces a 3DMark score, with hi ...

, PCMark *

Heaven Benchmark Heaven Benchmark is benchmarking software based on the Unigine, UNIGINE Engine. The benchmark was developed and published by Unigine Corp, UNIGINE Company in 2009. The main purpose of software is performance and stability testing for Graphics pro ...

* PiFast *

Superposition Benchmark Superposition Benchmark is benchmarking software based on the UNIGINE Engine. The benchmark was developed and published by UNIGINE Company in 2017. The main purpose of software is performance and stability testing for GPUs. Users can choose ...

Super PI Super PI is a computer program that calculates pi to a specified number of digits after the decimal point—up to a maximum of 32 million. It uses Gauss–Legendre algorithm and is a Windows port of the program used by Yasumasa Kanada in 1995 to ...

SuperPrime Super-prime numbers, also known as higher-order primes or prime-indexed primes (PIPs), are the subsequence of prime numbers that occupy prime-numbered positions within the sequence of all prime numbers. The subsequence begins :3, 5, 11, 17, 31, ...

Valley Benchmark A valley is an elongated low area often running between hills or mountains, which will typically contain a river or stream running from one end to the other. Most valleys are formed by erosion of the land surface by rivers or streams ov ...

* Whetstone * Windows System Assessment Tool, included with Windows Vista and later releases, providing an index for consumers to rate their systems easily * Worldbench (discontinued)

Others

AnTuTu AnTuTu () is a software benchmarking tool commonly used to benchmark smartphones and other devices. It is owned by Chinese company Cheetah Mobile. Operations The company developing the software is based in Chaoyang District, Beijing, and was cof ...

– commonly used on phones and ARM-based devices. * Geekbench – A cross-platform benchmark for Windows, Linux, macOS, iOS and Android. * iCOMP – the Intel comparative microprocessor performance, published by Intel * Khornerstone * Performance Rating – modeling scheme used by AMD and Cyrix to reflect the relative performance usually compared to competing products. * SunSpider – a browser speed test * VMmark – a virtualization benchmark suite.

References

External links

* The dates: 1962-1976 {{DEFAULTSORT:Benchmark (Computing)