Memory-System Design, Computer-System Design, and Design Consulting
The memory system in any computer is very complex, comprised of several interacting subsystems, themselves quite complex;
subsystems include DRAMs, disks, caches, and their controllers and busses, each available in many possible configurations.
The memory system is where all important information is stored; as such, it is critical that the memory system be robust,
efficient, and responsive. Modern computers are limited by the memory system to a significant degree: for instance, the
performance of supercomputers from Cray and IBM is entirely determined by memory (Cray observes that
an increase in sustained memory bandwidth of 20% results in an increase in system performance of 20%,
whereas significant increases in processor speed have only negligible impact on system performance).
The complexity of modern memory subsystems and the resulting memory system has recently become a significant problem. To
achieve higher performance, modern subsystems incorporate intelligence in the form of transaction scheduling and local caching
of data, and a wide range of configuration parameters is available for design and implementation at all levels of the system.
We have shown that the interactions between these intelligent subsystems and design parameters is both complex and extremely
detrimental; for instance, we have identified "systemic" behaviors that are non-intuitive, caused by unexpected interactions
between otherwise independent subsystems, and responsible for order-of-magnitude shifts in performance and cost. These are
behaviors only detectable through extremely detailed modeling of the system, a holistic approach to design.
Our work presents a framework in which to study memory systems in a holistic fashion; today, it has become necessary to
consider circuit-level ramifications of system-level design decisions and, as well, to consider system-level ramifications of
circuit-level design decisions. Considering the system at such a level of detail allows one to dramatically improve both the
cost and the performance of the system, and it also guarantees that a chosen design will not fall prey to overlooked details,
such as system level assumptions that violate circuit-level requirements.
Our recent research investigates different facets of the memory system to a significant level of detail. We have developed a highly
detailed model of the memory controller and DRAM subsystem; a full-system model that incorporates detailed performance and
power models of caches, DRAMs, and disks and captures both application and operating-system activity; circuit-level models of
popular low-power SRAM designs to study leakage characteristics, power/performance trade-offs, and noise susceptibility; a
lightweight simulator that captures and measures all memory activity out to several trillion instructions, without sampling;
and a set of data-mining applications that significantly stresses the memory system. We have studied the memory system from
many different perspectives, including the circuit level (e.g., power studies of pipelined nanometer caches at the 90nm, 65nm,
45nm and 32nm technology nodes; the noise susceptibility of low-power SRAMs; etc.), the device-architecture level (e.g., DDR3
and DDR4 device-level parameter studies done on behalf of JEDEC 42.3), and the system level (e.g., power and performance
studies of the upcoming Fully Buffered DIMM architecture; memory-scheduling for power-limited high-performance DDR3 devices;
application behavior beyond 10 billion instructions; characterization of bioinformatics workloads; etc.). It is our ultimate
goal to transform significantly the way memory-system design is done.
The present course for next-generation large-scale systems does not scale. Power per node is on the order of 100W (a
conservative number); future installations are expected to have hundreds of thousands to millions of nodes. With electricity
costing $1M per megawatt-year, it will cost tens of millions of dollars per year per installation, at a minimum, just for the
electricity to compute, never mind the cost of cooling.
In addition, anecdotal evidence suggests that existing architectures, designed for general-purpose computing, are not being
used efficiently. Many supercomputer codes are vector-based and were written for architectures designed from the ground up
for vector arithmetic. The reality is that vector computing is a poor match to general-purpose memory systems, which cannot
sustain multiple simultaneous accesses per cycle. A related point: a recent study by Sandia shows a surprising proportion of
code doing pointer arithmetic, something in which vector computing is heavy and that vector architectures perform in hardware
while general-purpose computers force software to do explicitly. Developers ask for more control over the memory system
(e.g., to be able to specify what items should be where in the cache and when)--something difficult to do in
hardware-managed caches but standard in embedded architectures.
One way to look at large-scale installations (supercomputers, and most enterprise-computing systems as well) is that they are
the world's highest-performance embedded systems. Like embedded systems and unlike typical general-purpose
systems, these installations tend to run the same software 24x7. Like embedded systems and unlike typical general-purpose
systems, users of these installations will go to great lengths to optimize their software and often write their own operating
systems for the hardware. And, like embedded systems, efficiency is (or now has become) a key point.
It is worth exploring the use of embedded-systems architectures for future large-scale installations. Note that Lawrence
Berkeley is currently doing exactly this, using a custom embedded CPU, and is finding it an attractive solution. Embedded
architectures typically dissipate on the order of 1W per node--up to two orders of magnitude less than
general-purpose systems. Embedded DSPs have memory systems that are designed for vector computing (they resemble Cray
machines more than anything else) and that are intended for software control. Modern DSPs can sustain more than 10 GFlops
per node at a 1W design point (e.g., the TI C6000) or nearly 100 GFlops per node at a 10W design point (e.g., the ClearSpeed
CSX700). If a DSP-based supercomputer can perform modern codes at least as well as supercomputers based on general-purpose
architectures, the energy savings could easily translate to fifty million dollars saved per year per installation. Our goal
is to show that this is indeed the case.
We have reason to believe that a DSP-based cluster would perform at least as well as a cluster of general-purpose processors
and would most certainly perform better on vector-based codes. The thesis of a recent grad student [Smith 2006] investigated
the use of DSP-based clusters to solve a moderately high-performance problem: real-time processing of telescope images, an
embedded system designed to control the mirror array in NASA's James Webb space telescope.
Design Consulting & Artifacts
Past and present members of our research team have done extraordinary things in the area of memory systems.
We did Micron's cycle-accurate performance-modeling of Hybrid Memory Cube while in development;
we designed the DRAM controller for Cray's Black Widow supercomputer and improved system-level
performance by 20%; we helped define commercial follow-ons to the Fully Buffered DIMM architecture;
we have built test chips for government labs;
we solved a data-movement problem in a commercial ASIC targeting digital video cameras, whose tape-out date
was months behind schedule; we have modeled everything from system-level commercial designs to die-level commercial designs.
Call us if you have questions about your designs or design decisions.
In addition, our research group has developed and released two simulation frameworks, one of which has
become very widely used (see www.ece.umd.edu/DRAMsim
for the public-domain DRAM-system
simulator, now used within Intel, IBM, Cray, and numerous academic research groups; see
www.ece.umd.edu/BioBench for a public-domain benchmark
suite of bioinformatics applications).
Our group has also developed a full-system simulator that boots Linux and performs a full performance and power
characterization of the entire memory system from cache to DRAM to disk, capturing all
application and operating system activity; the performance modeling has been hardware validated
(see Ph.D. thesis by Nuengwong Tuaycharoen at www.ece.umd.edu/~blj).
Our group has built one of the fastest cache-simulation
tools in the world, which measures cache activity out into the tens of billions of
instructions and has built a simulation framework for modeling power and heat in
highly integrated systems on chip.