All flashcards
9 cards

Chips

Multiplexers, adders, MACs, RTL-to-GDS, and the SRAM vs. register file trade-off.

Gates and arithmetic

  • Selection — a ternary operator for two inputs.

    3-to-1 MUX as AND/OR tree on select signals

  • A full adder considers the carry-in; a half adder doesn't.

  • A horizontal chain of 4 full adders. Each takes a pair of bits plus the previous adder's carry-out as its carry-in, and emits a sum bit plus a carry-out that feeds the next adder.

    4-bit ripple-carry adder

  • Consider the simpler 4×4 case: 10101110+z1010 \cdot 1110 + z (with zz the accumulate).

    This breaks down into

    1(11103)+0(11102)+1(11101)+01110+z.1 \cdot (1110 \ll 3) + 0 \cdot (1110 \ll 2) + 1 \cdot (1110 \ll 1) + 0 \cdot 1110 + z.

    Each of those is in turn a mux:

    (1?11103:0)+(0?11102:0)+(1?11101:0)+(0?1110:0)+z.(1\,?\,1110 \ll 3 : 0) + (0\,?\,1110 \ll 2 : 0) + (1\,?\,1110 \ll 1 : 0) + (0\,?\,1110 : 0) + z.

    So you need as many adds and muxes as there are digits in the first number.

    The width of both gates has to be twice the width of the biggest number, since an 8-bit × 8-bit product can be 16 bits wide.

    (80.316)+(81.016)=81.316=166.4.(8 \cdot 0.3 \cdot 16) + (8 \cdot 1.0 \cdot 16) = 8 \cdot 1.3 \cdot 16 = 166.4.

  • For two length-NN vectors, run NN sequential FMAs: accaibi+acc\text{acc} \leftarrow a_i \cdot b_i + \text{acc} for i=0,,N1i = 0, \dots, N - 1. The accumulator ends up holding iaibi\sum_i a_i b_i.

    Chained FMAs computing a dot product

RTL to GDS, clock speed, memory

    1. Logic synthesis: compile RTL → technology-independent graph of universal gates (NAND-ish).
    2. Technology mapping: collapse the graph by pattern-matching the larger patterns of TSMC's standard cell library onto it ("regex for trees") → technology-dependent netlist.
    3. Placement: embed cells in 2D minimizing wire length (spring / cost-function optimization).
    4. Routing: connect placed cells with wires on stacked copper layers (Manhattan-only).
    5. GDS-out: lower placement + routing to polygons (doping, etch, metal) → mask data for TSMC.
  • Clock speed is set by the critical path — the longest delay between any two registers (wire plus logic cost). Two chips on the same node can have a different layout, affecting this quantity.

  • SRAM is much smaller per bit but rigid (1R xor 1W per cycle); register files are bigger per bit but flexible (cheap to add ports, no minimum size).

  • Because of how it's implemented: an SRAM is a hand-designed macro with exactly one decoder + one set of bit lines wired into the grid.

    SRAM array layout: one decoder, shared word/bit lines