2. ARM Architecture and Processors
ARM does not manufacture silicon devices. Instead, ARM creates microprocessor designs that are licensed to semiconductor companies and original equipment manufacturers (OEMs), who integrate them into system-on-chip (SoC) devices.
To ensure compatibility between different implementations, ARM defines architectural specifications that dictate how compliant products should behave. Processors implementing the ARM architecture must conform to specific versions of the architecture.
The ARM architecture supports implementations across a very wide range of performance points. Its simplicity allows for very small implementations, enabling very low power consumption.
The Cortex-A series processors covered in this book conform to the ARMv7-A architecture. There may also be multiple processors with different internal implementations and microarchitectures that have different cycle times and clock frequencies but conform to the same version of the architecture, executing the ARM instruction set defined for that architecture and tested through the ARM validation system.
2.1 Architectural profiles
ARM regularly releases new versions of the architecture. These new versions add new features or modify existing behaviors. Such changes are almost always backward compatible, meaning that user code running on older versions of the architecture will continue to run normally on the new versions. Of course, code written to take advantage of new features will not run on older processors lacking those features.
Some system features and behaviors are defined by implementations across all versions of the architecture. For example, the architecture does not define cache sizes or the cycle time of individual instructions. These are determined by individual cores and SoCs (system-on-chip).
Each architecture version can also define optional extensions. These extensions may be implemented in specific implementations of processors. For example, in the ARMv7 architecture, advanced SIMD (NEON) technology is provided as an optional extension.
The ARMv7 architecture also introduces the concept of “profiles.” These profiles are variations of the architecture used to describe processors aimed at different markets and uses.
These profiles include:
-
A (Application profile): The application profile defines an architecture aimed at high-performance processors that support a virtual memory system with memory management units (MMUs), enabling them to run fully functional operating systems. It supports both ARM and Thumb instruction sets. ARMv7-A is the application profile implemented by all Cortex-A series processors and processors developed by companies based on the ARM architecture. As of early 2014, fewer than 3 billion Cortex-A series chips had been shipped.
-
R (Real-time profile): The real-time profile defines an architecture aimed at systems that require deterministic timing and low interrupt latency. It does not support virtual memory systems but can use simple memory protection units (MPUs) to protect memory regions.
-
M (Microcontroller profile): The microcontroller profile defines an architecture aimed at low-cost systems where low-latency interrupt handling is very important. It uses a different exception handling model than other profiles and supports only a variant of the Thumb instruction set.
2.2 Architecture history and extensions
The ARM architecture has seen relatively little change from the introduction of the first silicon chip in the mid-1980s to the first ARM6 and ARM7 devices in the early 1990s. In the first version of the architecture, most load, store, and arithmetic operations were implemented by the ARM1, aside from the register set. Version 2 added multiply-accumulate instructions and support for coprocessors, along with other innovations. The earliest processors supported only a 26-bit address space. Version 3 separated the program counter and program status register and introduced new modes to support a 32-bit address space. Version 4 added support for half-word load and store operations, as well as additional kernel-level privilege modes.
The ARMv4T architecture introduced the Thumb (16-bit) instruction set, implemented by the ARM7TDMI® and ARM9TDMI® processors, which have shipped in the billions.
The ARMv5TE architecture improved support for digital signal processing (DSP) type operations and saturated arithmetic, and enhanced the interoperability of ARM and Thumb.
The ARMv6 included several enhancements, such as support for unaligned memory access, significant changes to memory architecture and multicore support, and support for SIMD operations on byte or half-word operations within 32-bit registers. It also provided many optional extensions, especially Thumb-2 and security extensions (TrustZone). Thumb-2 extended Thumb into a mixed-length (16-bit and 32-bit) instruction set.
The ARMv7-A architecture made the Thumb-2 extension mandatory and added advanced SIMD extensions (NEON).
Over the years, ARM has adopted a continuous numbering system for processors, such as ARM9 (after ARM7) and ARM8. Various numbers and letters are appended to the base series to indicate different variants. For example, the ARM7TDMI processor has T for Thumb, D for Debug, M for fast multiplier, and I for EmbeddedICE.
For the ARMv7 architecture, ARM Limited adopted the brand name Cortex for its processors and used supplementary letters to indicate one of the three profiles (A, R, or M) that the processor supports. The following diagram shows how the different versions of the architecture correspond to different processor implementations. This diagram is not comprehensive and does not include all architecture versions or processor implementations.
In the following diagram, we show the evolution of the architecture over time, illustrating the new features added with each new version. Almost all architectural changes are backward compatible, meaning that software written for the ARMv4T architecture can still run on ARMv7 processors.
2.2.1 DSP multiply-accumulate and saturated arithmetic instructions
These instructions were added in the ARMv5TE architecture, enhancing the capabilities of digital signal processing and multimedia software, denoted by the letter E. The new instructions provide various signed multiply-accumulate, saturated addition and subtraction, and counting leading zeros variants, and they appear in all subsequent versions of the architecture. In many cases, this enables the removal of simple standalone DSPs from the system.
2.2.2 Jazelle
Jazelle-DBX (Direct Bytecode Execution) was introduced in ARMv5TEJ to accelerate Java performance and save energy. The increased memory availability and improvements in the Just-In-Time (JIT) compiler later diminished its value in application processors. As a result, many ARMv7-A processors do not implement this hardware acceleration.
Jazelle-DBX is best suited for providing high performance for Java in systems with very limited memory (e.g., simple-featured phones or low-cost embedded applications). In current systems, it is primarily used for backward compatibility.
2.2.3 Thumb Execution Environment (ThumbEE)
ThumbEE was introduced in ARMv7-A and is sometimes referred to as Jazelle-RCT (Runtime Compilation Target). It makes some minor changes to the Thumb instruction set, making it a better target for generating code at runtime in controlled environments, such as managed languages like Java, Dalvik, C#, Python, or Perl.
ThumbEE is intended for use by Just-In-Time (JIT) or Ahead-Of-Time (AOT) compilers, in which case it can reduce the size of the code that needs to be recompiled. ARM no longer recommends the use of ThumbEE.
2.2.4 Thumb
The ARMv7 architecture includes two main instruction sets: the ARM instruction set and the Thumb instruction set. Both have essentially the same functionality. The Thumb instruction set is a subset of the most commonly used 32-bit ARM instructions. Each Thumb instruction is 16 bits long and has a corresponding 32-bit ARM instruction that has the same effect. The primary reason for using Thumb code is to reduce code density. Due to its improved density, Thumb code is more likely to be cached than equivalent ARM code, and it can reduce the amount of memory required. You can still use the ARM instruction set, but for specific code sections that require the highest performance, you can use Thumb.
2.2.5 Thumb-2
Despite contrary rumors, there is actually no Thumb-2 instruction set. The Thumb-2 technology was introduced in ARMv6T2 and is required in ARMv7. This technology extends the original 16-bit Thumb instruction set to include 32-bit instructions. The 32-bit Thumb instructions included in ARMv6T2 allow Thumb code to achieve performance similar to ARM code, while its code density is better than pure 16-bit Thumb code.
2.2.6 Security Extensions (TrustZone)
The optional security extension, known as TrustZone, was introduced in ARMv6K and has been implemented in all ARM Cortex-A processors. TrustZone provides a separate secure world to isolate sensitive code and data from the normal world that contains the operating system and applications. Thus, software in the secure world is designed to provide secure services to applications in the normal (non-secure) world.
2.2.7 VFP
Before ARMv7, the VFP extension was known as the Vector Floating Point architecture, supporting vector operations. VFP is an extension that implements single-precision and optional double-precision floating-point operations compliant with ANSI/IEEE standards.
2.2.8 Advanced SIMD (NEON)
ARM NEON technology provides an implementation of the advanced SIMD instruction set and has a separate register file (shared with VFP). Some implementations provide a separate NEON pipeline backend. It supports 8, 16, 32, and 64-bit integers, as well as single-precision (32-bit) floating-point data, which can be operated on as vectors in 64-bit and 128-bit registers.
2.2.9 Coprocessors
The ARM architecture supports the use of coprocessors to extend the instruction set and enhance the functionality of the ARM processor. The numbers range from 0 to 15, with 16 coprocessors available. In early ARM cores, dedicated hardware interfaces were provided to allow the connection of external coprocessors. In Cortex-A series processors, only internal coprocessors are supported. These include CP15 (system control) for cache and MMU, CP14 for debugging, and CP10 and CP11 for NEON and VFP operations.
2.2.10 Large Physical Address Extension (LPAE)
LPAE is an optional feature in the v7-A architecture, currently supported by Cortex-A7, Cortex-A12, and Cortex-A15 processors. It allows 32-bit processors that could only handle a maximum of 4GB address space to access up to 1TB of address space by translating 32-bit virtual memory addresses to 40-bit physical memory addresses. For more information, see the related content.
2.2.11 Virtualization
The ARM virtualization extension is an optional extension of the ARMv7-A architecture profile. The extension supports the use of a hypervisor, a virtual machine monitor, to isolate one operating system from another. When implemented in single-core or multi-core systems, the virtualization extension supports running multiple virtual machines on a single cluster.
2.2.12 big.LITTLE
big.LITTLE processing was introduced in ARMv7, designed to balance the requirements of high performance and power efficiency. big.LITTLE uses high-performance clusters, such as Cortex-A15 processors, combined with energy-efficient clusters, such as Cortex-A7 processors.
The “big” cluster can be used to handle heavy workloads, while the “LITTLE” cluster can manage the majority of workloads in mobile devices.
2.3 Processor properties
In this section, we will explore some ARM processors and identify which processors implement which architecture versions.
2.4 Cortex-A series processors
In this section, we will take a closer look at each processor that implements the ARMv7-A architecture. A summary description is provided for each case, and for more detailed information about each processor, please refer to the related content.
2.4.1 The Cortex-A5 processor
The Cortex-A5 processor is the smallest ARM multicore application processor. Devices based on this processor are typically lower in cost, able to provide Internet services to a wide range of devices from low-cost entry-level smartphones and smart mobile devices to embedded, consumer, and industrial devices.The Cortex-A5 processor features:
-
Full application compatibility with other Cortex-A series processors.
-
Support for multiprocessing, with scalable and energy-efficient performance.
-
Optional floating-point or NEON unit for media and signal processing.
-
A high-performance memory system with cache and memory management units.
-
A high-value migration path from older ARM processors.
2.4.2 The Cortex-A7 processor
The ARM Cortex-A7 processor is the most energy-efficient application processor developed by ARM and further solidifies ARM’s leadership in low power for entry-level smartphones, tablets, and other high-end mobile devices.The Cortex-A7 processor features:
-
Same architecture and feature set as the Cortex-A15 processor, supporting big.LITTLE configurations.
-
Using 28nm process technology, with an area of less than 0.5mm².
-
Full application compatibility with all Cortex-A series processors.
-
Closely coupled low-latency second-level cache (up to 4MB).
-
Support for floating-point units.
-
Support for NEON technology for multimedia and SIMD processing.
2.4.3 The Cortex-A8 processor
The ARM Cortex-A8 processor can scale from 600MHz to over 1GHz. The Cortex-A8 processor meets the requirements of power-optimized mobile devices running at less than 300mW, as well as the demands of performance-optimized consumer applications (requiring 2000 Dhrystone MIPS). It is used in a variety of devices, including Samsung’s S5PC100, Texas Instruments’ OMAP3530, and Freescale’s i.MX515. Whether for high-end feature phones, netbooks, digital TVs, printers, or in-car infotainment systems, the Cortex-A8 processor provides a proven high-performance solution, with millions shipped annually.The Cortex-A8 processor features:
-
Frequency range from 600MHz to over 1GHz.
-
High-performance superscalar architecture.
-
Support for NEON technology for multimedia and SIMD processing.
-
Compatibility with older ARM processors.
2.4.4 The Cortex-A9 processor
The ARM Cortex-A9 processor is an efficient and popular high-performance choice for low-power or thermally constrained and cost-sensitive devices.
Currently, the Cortex-A9 processor is widely used in smartphones, digital TVs, consumer, and enterprise applications. Compared to the Cortex-A8 processor, the Cortex-A9 processor’s performance improves by more than 50%. The Cortex-A9 processor can be configured with up to four cores to provide peak performance when needed. Its configurability and flexibility make the Cortex-A9 processor suitable for various markets and applications.Devices containing the Cortex-A9 processor include NVIDIA’s dual-core Tegra-2, ST’s SPEAr1300, and TI’s OMAP4 platform.
The Cortex-A9 processor features:
-
Support for out-of-order speculative execution pipelines.
-
16, 32, or 64KB four-way set associative L1 cache.
-
Support for floating-point units.
-
Support for NEON technology for multimedia and SIMD processing.
-
Provides hard macro implementation solutions optimized for speed or power consumption.
2.4.5 The Cortex-A12 processor
The Cortex-A12 processor is a high-performance mid-range mobile processing solution designed for mobile applications, such as for smartphones and tablet devices. The Cortex-A12 processor is the successor to the successful Cortex-A9 processor, optimized for mainstream mobile power environments to deliver the best performance and efficiency.
The high performance and high-end feature set of the Cortex-A12 processor make it suitable for a variety of use cases. Mid-range devices can leverage the success of high-end devices and continue to drive growth in the fastest-growing segments of the mobile market.
Architecturally, the Cortex-A12 processor is based on the latest ARMv7-A architecture and has similar extended features to the Cortex-A15 processor.The Cortex-A12 processor features:
-
40-bit large physical address extension (LPAE), addressing up to 1TB of RAM.
-
Full application compatibility with all Cortex-A series processors.
-
Support for NEON technology for multimedia and SIMD processing.
-
Support for virtualization and TrustZone security technology.
2.4.6 The Cortex-A15 processor
The ARM Cortex-A15 processor is designed to provide unprecedented flexibility and processing power. This processor employs advanced power reduction technologies and is suitable for a wide range of markets, from mobile computing and high-end digital home to servers and wireless infrastructure.
The Cortex-A15 MPCore processor has full application compatibility with all other Cortex-A series processors.The Cortex-A15 processor features:
-
Highly scalable, with performance up to 2.5GHz.
-
Full application compatibility with all Cortex-A series processors.
-
Support for out-of-order superscalar processing.
-
Closely coupled low-latency second-level cache (up to 4MB).
-
Support for floating-point units.
-
Support for NEON technology for multimedia and SIMD processing.
-
Can be implemented as a quad-core hard macro.
2.5 Key architectural points of ARM Cortex-A series processors
Key points shared by Cortex-A series devices include:
-
32-bit RISC cores with 16 × 32-bit visible registers and mode-based register grouping.
-
Modified Harvard architecture (independent, concurrent access to instructions and data).
-
Load/store architecture.
-
Standardized Thumb-2 technology.
-
Support for VFP and NEON.
-
Backward compatibility with previous ARM processor code.
-
4GB virtual address space and at least 4GB physical address space.
-
Hardware translation tables for virtual to physical address translation.
-
Virtual page sizes of 4KB, 64KB, 1MB, and 16MB, with cache attributes and access permissions set on a per-page basis.
-
Support for big-endian and little-endian data access.
-
Support for unaligned access with basic load/store instructions.
-
Support for symmetric multiprocessing (SMP) for MPCore™ variants, which are multicore versions of Cortex-A series processors, providing full data consistency at the L1 cache level. Automatic cache and translation lookaside buffer (TLB) maintenance propagation enables efficient SMP operation.
-
Physical index, physical tag (PIPT) data cache.
Leave a Comment
Your email address will not be published. Required fields are marked *