Memory-System Design, Computer-System Design, and Design Consulting

Memory-System Design

The memory system in any computer is very complex, comprised of several interacting subsystems, themselves quite complex; subsystems include DRAMs, disks, caches, and their controllers and busses, each available in many possible configurations. The memory system is where all important information is stored; as such, it is critical that the memory system be robust, efficient, and responsive. Modern computers are limited by the memory system to a significant degree: for instance, the performance of supercomputers from Cray and IBM is entirely determined by memory (Cray observes that an increase in sustained memory bandwidth of 20% results in an increase in system performance of 20%, whereas significant increases in processor speed have only negligible impact on system performance).

The complexity of modern memory subsystems and the resulting memory system has recently become a significant problem. To achieve higher performance, modern subsystems incorporate intelligence in the form of transaction scheduling and local caching of data, and a wide range of configuration parameters is available for design and implementation at all levels of the system. We have shown that the interactions between these intelligent subsystems and design parameters is both complex and extremely detrimental; for instance, we have identified "systemic" behaviors that are non-intuitive, caused by unexpected interactions between otherwise independent subsystems, and responsible for order-of-magnitude shifts in performance and cost. These are behaviors only detectable through extremely detailed modeling of the system, a holistic approach to design.

Our work presents a framework in which to study memory systems in a holistic fashion; today, it has become necessary to consider circuit-level ramifications of system-level design decisions and, as well, to consider system-level ramifications of circuit-level design decisions. Considering the system at such a level of detail allows one to dramatically improve both the cost and the performance of the system, and it also guarantees that a chosen design will not fall prey to overlooked details, such as system level assumptions that violate circuit-level requirements.

Our recent research investigates different facets of the memory system to a significant level of detail. We have developed a highly detailed model of the memory controller and DRAM subsystem; a full-system model that incorporates detailed performance and power models of caches, DRAMs, and disks and captures both application and operating-system activity; circuit-level models of popular low-power SRAM designs to study leakage characteristics, power/performance trade-offs, and noise susceptibility; a lightweight simulator that captures and measures all memory activity out to several trillion instructions, without sampling; and a set of data-mining applications that significantly stresses the memory system. We have studied the memory system from many different perspectives, including the circuit level (e.g., power studies of pipelined nanometer caches at the 90nm, 65nm, 45nm and 32nm technology nodes; the noise susceptibility of low-power SRAMs; etc.), the device-architecture level (e.g., DDR3 and DDR4 device-level parameter studies done on behalf of JEDEC 42.3), and the system level (e.g., power and performance studies of the upcoming Fully Buffered DIMM architecture; memory-scheduling for power-limited high-performance DDR3 devices; application behavior beyond 10 billion instructions; characterization of bioinformatics workloads; etc.). It is our ultimate goal to transform significantly the way memory-system design is done.

Computer-System Design

The present course for next-generation large-scale systems does not scale. Power per node is on the order of 100W (a conservative number); future installations are expected to have hundreds of thousands to millions of nodes. With electricity costing $1M per megawatt-year, it will cost tens of millions of dollars per year per installation, at a minimum, just for the electricity to compute, never mind the cost of cooling.

In addition, anecdotal evidence suggests that existing architectures, designed for general-purpose computing, are not being used efficiently. Many supercomputer codes are vector-based and were written for architectures designed from the ground up for vector arithmetic. The reality is that vector computing is a poor match to general-purpose memory systems, which cannot sustain multiple simultaneous accesses per cycle. A related point: a recent study by Sandia shows a surprising proportion of code doing pointer arithmetic, something in which vector computing is heavy and that vector architectures perform in hardware while general-purpose computers force software to do explicitly. Developers ask for more control over the memory system (e.g., to be able to specify what items should be where in the cache and when)--something difficult to do in hardware-managed caches but standard in embedded architectures.

One way to look at large-scale installations (supercomputers, and most enterprise-computing systems as well) is that they are the world's highest-performance embedded systems. Like embedded systems and unlike typical general-purpose systems, these installations tend to run the same software 24x7. Like embedded systems and unlike typical general-purpose systems, users of these installations will go to great lengths to optimize their software and often write their own operating systems for the hardware. And, like embedded systems, efficiency is (or now has become) a key point.

It is worth exploring the use of embedded-systems architectures for future large-scale installations. Note that Lawrence Berkeley is currently doing exactly this, using a custom embedded CPU, and is finding it an attractive solution. Embedded architectures typically dissipate on the order of 1W per node--up to two orders of magnitude less than general-purpose systems. Embedded DSPs have memory systems that are designed for vector computing (they resemble Cray machines more than anything else) and that are intended for software control. Modern DSPs can sustain more than 10 GFlops per node at a 1W design point (e.g., the TI C6000) or nearly 100 GFlops per node at a 10W design point (e.g., the ClearSpeed CSX700). If a DSP-based supercomputer can perform modern codes at least as well as supercomputers based on general-purpose architectures, the energy savings could easily translate to fifty million dollars saved per year per installation. Our goal is to show that this is indeed the case.

We have reason to believe that a DSP-based cluster would perform at least as well as a cluster of general-purpose processors and would most certainly perform better on vector-based codes. The thesis of a recent grad student [Smith 2006] investigated the use of DSP-based clusters to solve a moderately high-performance problem: real-time processing of telescope images, an embedded system designed to control the mirror array in NASA's James Webb space telescope.

Design Consulting & Artifacts

Past and present members of our research team have done extraordinary things in the area of memory systems. We did Micron's cycle-accurate performance-modeling of Hybrid Memory Cube while in development; we designed the DRAM controller for Cray's Black Widow supercomputer and improved system-level performance by 20%; we helped define commercial follow-ons to the Fully Buffered DIMM architecture; we have built test chips for government labs; we solved a data-movement problem in a commercial ASIC targeting digital video cameras, whose tape-out date was months behind schedule; we have modeled everything from system-level commercial designs to die-level commercial designs.

Call us if you have questions about your designs or design decisions.

In addition, our research group has developed and released two simulation frameworks, one of which has become very widely used (see www.ece.umd.edu/DRAMsim for the public-domain DRAM-system simulator, now used within Intel, IBM, Cray, and numerous academic research groups; see www.ece.umd.edu/BioBench for a public-domain benchmark suite of bioinformatics applications).

Our group has also developed a full-system simulator that boots Linux and performs a full performance and power characterization of the entire memory system from cache to DRAM to disk, capturing all application and operating system activity; the performance modeling has been hardware validated (see Ph.D. thesis by Nuengwong Tuaycharoen at www.ece.umd.edu/~blj). Our group has built one of the fastest cache-simulation tools in the world, which measures cache activity out into the tens of billions of instructions and has built a simulation framework for modeling power and heat in highly integrated systems on chip.