High Speed Circuit Design Based on a Hybrid of Conventional and Wave Pipelining
MetadataShow full item record
The increasing capabilities of multimedia appliances demand arithmetic circuits with higher speed and reasonable power dissipation. A common technique to attain those goals is synchronous pipelining, which increases the throughput of a circuit at the expense of longer latency, and it is therefore suitable where throughput takes priority over latency.
Two synchronous pipelining approaches, conventional pipelining and wave pipelining, are commonly employed. Conventional pipelining uses registers to divide the circuit into shorter paths and synchronize among sub-blocks, while wave pipelining uses the delay of combinational elements to perform those tasks. As wave pipelining does not introduce additional registers, in principle, it can attain a higher throughput and lower power consumption. However, its throughput is limited by delay variations, while delay balancing often leads to increased power dissipation.
This dissertation proposes a hybrid pipelining method called HyPipe, which divides the circuit into sub-blocks using conventional pipelining, and applies wave pipelining to each sub-block. Each sub-block is derived from a single base circuit, leading to a better delay balance and greater throughput than with heterogeneous circuits. Another requirement for wave pipelining to achieve high speed is short signal rise and fall times. Since CMOS wide-NAND and wide-NOR gates exhibit long rise and fall times and large delay variations, they should be decomposed. We show that the straightforward decomposition using alternating levels of NAND and NOR gates results in large delay variations. Therefore, we propose a new decomposition method using only one gate type. Our method reduces delay variations by up to 39%, and it is appropriate for wave pipelining based on standard-cells or sea-of-gates.
We laid out a 4x4 HyPipe multiplier as a proof of concept and performed a post-layout SPICE simulation. The multiplier achieves a throughput of 4.17 billion multiplications per second or a clock period of 2.52 four-load inverter delays, which is almost twice the speed of any existing multiplier in the open literature. When the supply voltage is reduced to 1.2 V from 1.8 V, its power consumption is reduced from 76.2 mW to 18.2 mW while performing 2.33 billion multiplications per second.
- Doctoral Dissertations