Ampere’s new AmpereOne CPU includes 192 cores and an entirely new microarchitecture.
Ampere announced its AmpereOne processors for cloud datacenters this week, which are the industry’s first general-purpose CPUs with up to 132 processors that can be utilised for AI inference.
The new chips consume more power than their predecessors, Ampere Altra (which will remain in Ampere’s stable for the foreseeable future), but the company claims that, despite higher power consumption, its processors with up to 192 cores provide higher computational density than AMD and Intel CPUs. Some of their performance claims may be debatable.
192 Custom Cloud Native Cores
Ampere’s AmpereOne processors have 136–192 cores (compared to 32–128 cores for Ampere Altra) that operate at up to 3.0 GHz. They are based on the company’s custom Armv8.6+ instruction set architecture and feature two 128-bit vector units that support FP16, BF16, INT16, and INT8 formats. Additionally, each core has a 2MB of 8-way set associativity L2 cache The SoC furthermore contains a 64MB system level cache in addition to L1 and L2 caches. The new CPUs are rated for 200W to 350W, up from 40W to 180W for the Ampere Altra, depending on the particular SKU.
The company asserts that its new cores have been further optimised for cloud and AI workloads and have ‘power and are efficient’ instructions per clock (IPC) gains, which most likely refers to higher IPC (compared to Arm’s Neoverse N1 used for Altra) without a discernible increase in power consumption and die area. Speaking about die area, Ampere does not provide any information on it but does state that the AmpereOne is produced using a TSMC 5nm-class manufacturing process.
Although Ampere does not reveal all of the details about its AmpereOne core, it does say that it includes a highly accurate L1 data prefetcher (which reduces latency, ensures that the CPU spends less time waiting for data, and reduces system power consumption by minimising memory accesses), refined branch misprediction recovery (which reduces latency and wastes less power the sooner the CPU can detect and recover from a branch misprediction), and sophisticated memory disambiguation.
While the list of AmpereOne core architectural upgrades appears short on paper, these enhancements can greatly increase performance and needed extensive study (e.g., which factors slow down the performance of a cloud datacenter CPU the most?) takes a lot of effort to put ideas into action.
I/O and Advanced Security
Because the AmpereOne SoC is designed for cloud datacenters, it offers adequate I/O such as eight DDR5 channels for up to 16 modules supporting up to 8TB of memory per socket, 128 lanes of PCIe Gen5 with 32 controllers, and x4 bifurcation.
Reliability, availability, serviceability (RAS), and security aspects are also required in datacenters. To that aim, the SoC fully supports, to mention a few, ECC memory, single key memory encryption, memory tagging, secure virtualization, and layered virtualization. AmpereOne also contains a number of security features, such as crypto and entropy accelerators, speculative side channel attack mitigation, ROP/JOP attack mitigation, and so on.
Curious Benchmark Results
Without a question, Ampere’s AmpereOne SoC is an outstanding piece of silicon intended to tackle cloud workloads and sporting the industry’s first 192 general-purpose cores. Ampere, on the other hand, employs quite unusual benchmark results to establish its views.
Ampere’s key advantage is the compute density of its AmpereOne. A 42U 16.5kW rack filled with 192-core AmpereOne SoC-based 1S machines can support up to 7926 virtual machines, while a rack powered by AMD’s 96-core EPYC 9654 ‘Genoa’ CPUs can handle 2496 VMs and a rack powered by Intel’s 56-core Xeon Scalable 8480+ ‘Sapphire Rapids’ CPUs can handle 1680 VMs. In the 16.5kW power budget, this comparison makes a lot of sense.
However, 42U rack power density is increasing, and exascalers such as AWS, Google, and Microsoft are prepared for this, especially for their performance-demanding applications. According to an UpTimeInstitute 2020 poll, 16% of firms adopted conventional 42U racks with rack power density ranging from 20kW to above 50kW. As AMD’s newest and previous-generation CPUs improved their TDPs relative to their predecessors, the number of typical installations with 20kW racks has climbed, not reduced.
Ampere compares systems based on AMD’s 96-core EPYC 9654 CPU with 256GB of memory (meaning that it worked in an eight-channel mode, not the 12-channel mode that is supported by Genoa) to show the advantages of its 160-core AmpereOne-based system with 512GB of memory running Generative AI (stable diffusion) and AI Recommenders (DLRM). Ampere-based machines generated over 2X as many queries per second for AI recommendations and 2.3X as many frames per second for generative AI.
Ampere compared the efficiency of their systems in this scenario, which crunched data with an FP16 precision, whereas AMD-based computers calculated with an FP32 precision, which is not an accurate comparison. Additionally, a lot of FP16 workloads are being executed on GPUs rather than CPUs, and massively-parallel GPUs frequently deliver outstanding performance for workloads including generative AI and AI recommendations.
Summary
The AmpereOne general-purpose CPUs from Ampere are the first of their kind in the industry and have up to 192 cores. Strong I/O capabilities, cutting-edge security measures, and enhanced instructions per clock (IPC) increases are further aspects of these CPUs. They can also handle AI tasks with FP8, INT8, BF16, and FP16 precision.
But when it comes to benchmark results, the corporation choose to utilise certain dubious ways to support its claims, which puts some doubt on its accomplishments. Having said that, it will be particularly fascinating to watch the outcomes of unbiased tests of servers based on AmpereOne.