In the past, low power consumption and high computing power processors were clearly distinct. For example, Arm, which originated from Acorn, was fundamentally based on low power consumption — in the early years of the ARM instruction set development, it was hardly considered to compete with high-performance instruction sets like x86 or POWER. To the average person, these were simply two different industries.
In the low-power domain, even when Apple announced at the iPhone launch in 2013 that the A7 chip (Cyclone core) was a desktop-class architecture, almost no one in the consumer market took this seriously. Intuitively, mobile processors based on the Arm instruction set being labeled as ‘desktop-class architecture’? It seemed absurd. It wasn’t until Apple made the M1 a reality, and at the same time Arm began to gain traction in the data center market, that more people realized: Arm, which started with low power consumption, could also achieve high computing power.
On the other hand, in the high-performance computing direction, early HPC (High Performance Computing) and truly high-performance devices did not care much about energy consumption, merely a matter of using more electricity. Just like desktop computers, no matter how much power they consumed, they could not be compared to air conditioners; only laptops, which occasionally needed to run on battery, considered power consumption.
However, starting in 2009, more and more literature provided shocking evidence of the power consumption of high-performance devices. At that time, some compared HPC power consumption to the output of nuclear power plants; the need for supercomputers to save power and data centers to reduce electricity costs became a priority. Apple claimed that the Mac Studio reduced energy consumption by 1000kWh per year compared to high-end desktop computers; Nvidia stated that if all AI, HPC, and data analysis tasks worldwide were run on Nvidia GPU servers, the annual power savings would be equivalent to the consumption of 2 million cars.
The originally low-power instruction set is moving towards high performance, while high-performance devices are also beginning to pursue energy efficiency. However, historical remnants still exist: some instruction sets are considered low power — unsuitable for high performance, while some instruction sets are inherently suited for high performance — unable to achieve low power consumption.
After Arm proved it could also achieve high performance, many still believe x86 cannot achieve low power consumption. This belief is not surprising, considering Intel’s previous attempts to penetrate the mobile market have been unsuccessful; and the battery life of x86 CPUs in the laptop PC platform cannot, for the time being, compete with Arm contenders like Apple and Qualcomm; coupled with the earlier failure of IBM’s POWER in the laptop processor market, it further reinforces the perception that power consumption is strongly related to the instruction set.
Is it really so?
The Debate on x86’s Inability to Achieve Low Power Consumption
First, let’s discuss what an instruction set (ISA, Instruction Set Architecture) is. An instruction set naturally refers to a set of instructions. So what are instructions? Operations like addition and subtraction are instructions; 1+1 is a specific instruction. The importance of an instruction set lies fundamentally in the fact that it is an ‘abstraction layer.’
A processor is a type of hardware chip that contains a large number of transistors; these transistors form logic gates, which create functional modules, and these modules combine to form execution units. Various control and execution units together constitute a complete processor microarchitecture, ultimately enabling operations and complex calculations. The question here is, how does the processor, as hardware, communicate with software, or how does software inform the processor of what it needs to do?
Source: Intel
As the ‘abstraction layer’ between software and hardware, the instruction set comes to the forefront. An instruction set is a set of instructions that defines what types of operations can be executed in hardware. It is like a dictionary that guides communication between software and hardware. For hardware, designing a CPU must adhere to the instruction set; for software, compilers convert code written in various high-level languages (like C, Java) into machine code instructions via ISA.
Thus, instruction sets like Arm, x86, and RISC-V essentially define the specifications for their processors and ecosystems — if you want to work with us, you must follow our rules. The x86 instruction set primarily includes Intel and AMD; the Arm instruction set primarily includes Apple, Qualcomm, and Ampere Computing, among others. Given the complexity of these matters, it is unlikely that we can dissect the impact of instruction sets on power consumption in a ‘white box’ manner. However, we can discuss some mainstream arguments regarding the belief that x86 cannot achieve low power consumption.
The first argument is the age-old debate between CISC (Complex Instruction Set) and RISC (Reduced Instruction Set). x86 is regarded as a classic representative of CISC, while Arm belongs to RISC. The idea that CISC is suitable for high performance while RISC excels in low power consumption has been long-standing.
The second common argument is that Arm uses fixed-length instructions, while x86 uses variable-length instructions (i.e., instruction lengths are not fixed). Variable-length instructions mean that the CPU must first determine how long an instruction is, complicating the instruction decoding process. As decoding is a crucial step in processor operation, this gives x86 instruction set processors a natural disadvantage in terms of power consumption.
Another argument is that x86, as a large and historically rich instruction set ecosystem, inevitably drags along the ‘compatibility’ burden, becoming cumbersome and inefficient, hence unable to achieve low power consumption.
The Ancient Legend of RISC
When the concept of RISC (Reduced Instruction Set Computing) was first proposed by David Patterson in the 1980s, the initial idea was to simplify the instruction set so that most instructions could be executed within a single clock cycle, ensuring shorter cycle times and simpler designs — making CPU design easier, cheaper, and faster. The core idea is that when everyone has a million transistors, one should achieve higher pipeline efficiency by executing more small instructions rather than wasting time on a plethora of large, complex instructions.
However, there have been so many debates about CISC and RISC that it feels almost alien to bring them up again. Nevertheless, the strong correlation between instruction sets and power consumption, as well as the CISC vs. RISC debate, is indeed crucial. Chips and Cheese wrote an article last year discussing whether there truly exists a more suitable instruction set for low power consumption between Arm and x86, which also began with the CISC and RISC debate.
In the course of historical development, the so-called CISC and RISC have been converging. A frequently cited example is that Intel began introducing μop (micro-operations) from the Pentium Pro era, where the processor breaks down a complex instruction into several μops for subsequent execution. A classic example is that CISC supports directly adding a number to a memory address, but for RISC — this operation must be divided into three steps: first loading the number from the memory address, then adding it to the existing number, and finally storing the number back to memory. In fact, modern CISC processors also decompose a single instruction into such three micro-instructions.
Perhaps x86’s movement towards certain RISC characteristics began even earlier. Conversely, RISC instruction sets have also absorbed existing technologies from the x86 world throughout decades of development, making the boundaries between the two increasingly blurred. If CISC and RISC were originally differentiated by high performance and low power consumption, then IBM POWER, as a member of the RISC instruction set, once shone in the HPC field, has completely shattered this myth.
However, based on the CISC myth, a question arises: does x86’s introduction of μop mean it needs an additional step, more transistors, and more power consumption to execute the μop breakdown operation? This has been one of the main arguments against CISC.
Arm’s Decoding Overhead is Not Small Either
According to Chips and Cheese, modern Arm instruction set CPUs also need to perform μop decomposition operations. Moreover, for many Arm processors (like Marvell’s ThunderX chip), various adjustments in the μop stage are an essential part of architectural updates.
Another typical example is the supercomputer Fugaku, which has become quite popular in recent years. The A64FX supercomputer chip from Fujitsu is also based on the Arm instruction set. According to its architecture manual, A64FX also decodes large instructions into multiple μops. Therefore, μop is not a ‘patent’ of CISC or x86.
Furthermore, Arm’s SVE (Scalable Vector Extension)/SVE2 extensions will also break down into a multitude of μops. For example, the FADDA instruction is split into 63 μops, some of which have a latency of up to 9 cycles; the A64FX also supports directly loading a value from memory, performing an addition operation, and storing the result back to memory (LDADD instruction), which is split into 4 μops. In this context, discussing CISC and RISC becomes meaningless.
Another important point is that if the decoding overhead of the Arm architecture were indeed so low, Arm would not have introduced μop caches in its microarchitecture. The introduction of these caches is meant to avoid repeated instruction fetching and decoding — something that x86 has already begun doing for some time.
The Cortex-A77 microarchitecture includes an op cache with 1.5k entries. It is said that Arm invested considerable effort into debugging this design, spending at least six months just on that. Subsequent microarchitectures like Cortex-A78, A710, and larger Cortex-X1/X2 also incorporate this design.
Samsung’s self-developed Exynos M5 microarchitecture (used in the Exynos 990 chip, Galaxy S11 smartphone) also introduced a μop cache as one of the pathways to provide μops to subsequent pipelines, thereby saving power in instruction fetching and decoding. It is evident that using CISC and RISC to argue that RISC has an edge in the instruction-fetching and decoding phase, and to assert that x86 cannot achieve low power consumption, is simply untenable in contemporary discussions.
The Actual Impact of Variable-Length Instructions on x86
Another issue is that the variable-length instruction characteristic of CISC ultimately makes its decoding less efficient, as the indefinite instruction length means that the problem of determining how long an instruction is must first be addressed, which wastes transistors and power.
In fact, there have been numerous studies and experiments assessing this in the industry. Chips and Cheese once attempted to disable the op cache (forcing the instruction-fetching and decoding operations to occur) and found that in such cases, AMD’s Zen 2 core consumed 4-10% more power at the core level, and 0.5-6% more at the package level. Theoretically, if only the decoding phase were isolated, the power consumption would be even lower.
Moreover, Chips and Cheese emphasized that their tests were only related to L1 cache and did not involve L2, L3 caches, or main memory — if more levels of the storage system were considered, the power consumption during the decoding phase would be negligible. In some tests where the op cache was disabled, the power consumption actually decreased, indicating that the decoding power consumption had been overshadowed by other components of the processor core.
This shows that once the op cache is enabled, the variable-length instruction characteristic has a minimal impact on decoding. More systemic research also demonstrates that for Intel’s x86 processors, decoding has never been an obstacle to creating high-energy-efficient processors; early architectures like Haswell and Ivy Bridge have been validated in this regard. However, perhaps achieving ‘low power consumption’ requires digging into certain parts of the power consumption makeup, as even a small proportion can still matter.
The Impact of Ecosystem
When discussing instruction sets, it also involves the issue of ‘extended instruction sets.’ Extended instruction sets can be understood as adding new words to the dictionary — new instructions are designed to execute specific operations more efficiently, utilizing newly emerged processing units within the microarchitecture to enhance the efficiency of certain tasks. Intel’s x86 processors include SSE/AVX, while Arm processors include Neon/SVE, both of which belong to extended instruction sets.
However, this often relates to ecosystem issues; if new instructions are added, have developers immediately utilized these instructions? If they have not, it means the new design is wasted and efficiency does not improve. PC benchmarking media often use Cinebench as a scoring tool to test a processor’s performance; why do we always choose the latest version of Cinebench? Why not stick with the outdated Cinebench R15?
Intel’s past added extended instructions, source: Intel
The reason is that the Cinebench R15 program itself does not support x86’s AVX/FMA/AVX2, which means that for updated processors, it cannot reflect performance improvements. Even the latest Cinebench R23 still does not support Intel x86’s AVX512 instruction. There are not many applications that truly optimize for AVX512 or even the earlier AVX2 instructions, especially in the PC personal computing application market.
A particularly interesting example is that Apple’s M1 Ultra scored lower than Intel’s Core i9 in Cinebench R23 testing. Many Apple users stated that the Cinebench R23 program running on Arm Neon was based on ‘translating’ x86 SSE, severely impacting the M1 chip’s score. Regardless of whether this claim is accurate, it at least highlights the development level of the ecosystem within a particular instruction set camp, which significantly influences its high performance and low power consumption.
More specialized instructions seem to be the main theme of CPU and other processor developments in data centers over the past two years, like Intel’s Sapphire Rapids AMX instruction set, primarily aimed at AI computations. The addition of such new features complicates the efficiency comparison between different processors.
Leaving aside instruction sets, at a higher level, for example, Apple’s M1 Max/Ultra’s GPU has such abundant resources that it demonstrates explosive performance and energy efficiency for well-optimized applications. However, due to the relatively poor development level of Apple’s GPU ecosystem, most applications running on M1’s GPU struggle to achieve efficiency, failing to even utilize half of its capabilities. Recently, a data scientist from New Zealand published an article discussing Apple’s GPU issues, mentioning the inherent TBDR architecture and 32MB TLB configuration, which results in most applications failing to fully leverage the features of Apple’s GPU architecture.
The connection to instruction sets is no longer that significant, but the influence of ecosystem development levels on efficiency across all tiers clearly needs to be considered.
The Key Is Not the Instruction Set Itself
Last year, Jim Keller mentioned in an interview with AnandTech that “debating instruction sets is a tragic matter”; “a significant part of the core execution work involves just six instructions: load, store, add, subtract, compare, branch”; “the instruction set itself really doesn’t matter that much.” He also addressed discussions about variable-length instructions, emphasizing that the variable-length instruction issue is not significant, especially with good instruction prediction mechanisms in place.
What truly impacts processor performance, power consumption, and energy efficiency are how the front end feeds the back end, cache design, branch prediction, and data prefetching — these implementation-level microarchitecture issues. Jim Keller particularly pointed out that the key limiting factor for computer performance now lies in predictability, including instruction/branch prediction and data locality issues.
Over a decade ago, there were already studies introducing low-power processor cores into HPC systems, driven by the uncontrollable power consumption of HPC. During this time, many studies related to the correlation between instruction sets and power consumption emerged. For example, the 2014 paper titled “Energy Efficiency Evaluation of Multi-level Parallelism on Low Power Processors” focused on assessing the performance and power consumption of Intel Atom processors (Bonnell core) versus Arm Cortex-A9 cores. In the tests, Bonnell outperformed Cortex-A9 in both performance and energy efficiency at that time. Of course, this is just an isolated example.
Earlier research titled “Power Struggles: Revisiting the RISC vs. CISC Debate on Contemporary ARM and x86 Architectures” explicitly stated that the differences in power consumption and performance between Arm and x86 processors primarily stem from differences in design goals, and that the instruction set itself has never been a crucial deciding factor. “The differences in instruction sets may influence the final chip design implementation, but modern microarchitecture technology has nearly eliminated these differences; there is no instruction set that is fundamentally more efficient.”
Author: Huang Yefeng
Original from EET Electronic Engineering Magazine
In this issue, let’s talk about fake chips. What experiences do you have? Share for rewards!Activity Theme:Have you ever been cheated by fake chips or components?You can share:1. Have you encountered counterfeit chips or refurbished materials?2. What troubles did this cause you?
3. How did the other party commit fraud?4. How did you identify it?
5. Any insights you can share?Activity Prizes:First Prize: 300 yuan JD E-cardSeveral Excellent Prizes: 50 yuan JD E-cardParticipation earns20 yuan JD E-card reward!
Scan to see how peers have been “cheated”
Leave a Comment
Your email address will not be published. Required fields are marked *