Additional pictures show the heart of the project, i.e. the multiplier array. both the schematic and layout are shown
The top level in the hierarchy is the ckt called "full_mult" which is at the top of the hierarchy:
full_mult: contains 3 main subsystems:
in_reg16_neg_buf There is two instances of this. Contained within it are the 16-bit registers that are intended to house the incoming X and Y data, also the two 16-bit negators, and and the buffers that drive the Xn and Yn lines across the all the AND product functions of the multiplier (actually 2-input) NOR gates.
out_neg_reg32 the output negator and storage register.
mult_array - The array multiplier, which does all the unsigned multiplications.
"full_mult also contains the logic to determine whether the output 32-bit result requires negating or not, and activates such negation.
Full_mult is the top of the ckt, and was used to generate a netlist for simulation. See section with regards to simulation at a later point.
Now, we get into more detail going over the array multiplier. We start out with a picture of the multiplier. The multiplier implements a simple sum of subproducts without much optimization (no carry save, no look-ahead carry ckts). The layout is a fairly tight ckt. The ckt has been implemented to use only in 3 layers of metal.
The following shows the layout of the array multiplier. A lot of time was spent trying to get this layout to be LVS clean. The array came short of being LVS clean. Even though all the submodules were LVS clean, LVS found errors in the middle of the array, which at this point is a mystery that requires more troubleshooting. Power and ground are shared in a mirrored cells along the horizontal axis to create a tighter layout.
The array's size is 219u x 100.93u. The stepped size of each basic cell of the array (i.e. a
full adder and undersized nor gate) is: 13.3u by 6.66u.
Next we can see the array in schematic form. A variety of Virtuoso schematic features were used to implement this schematic. A brute force showing all the cells turned out to be the easiest to use.
Here we see a closeup of the same schematic, which indicates the detailed names. Throughout this project we worked at finding an approach to the various aspects of design that minimized manual tasks related to the size of the array. Using bussed schemes with Virtuoso made many tasks a lot simpler. That is to say, using buses for schematics, and iteration of pin placement in the schematics and the layout simplified that task considerably.
mult_arr Schematic up-close
The mult_quad_cell was built first to resolve the interface between cells. Good advantage was taken of "Edit-in-Place" to make a tight design using only 3 metal layers for this cell. That leaves plenty of metal layers for power routing and other uses.
The quad cell is essentially a repeat of the very simple cell basic_mult_cell consisting of a full-adder and a small nor-gate, which is also mirrored around the gnd line.
The entire array gets summarized by this symbol in the full_mult top level diagram, taking advantage of bus notation.
The earlier symbol is actually a simplification of the symbol below, which illustrates how the pins were actually placed in the layout. We were trying to do this in a manner that avoided manual placement of pins (e.g. as if we were doing a 54-bit multiplier). Because of the sharing of power and gnds, it was not possible to get an even span for sequential pins, so we divided the pins into an even and odd. This diagram shows how patchcords can be used cleverly in Virtuoso to merge multiple buses and get aroudn this problem., the symbol given previously corresponds to this schematic.
Stage W of Load: --------------------- 1: 3.24 2: 9.72 3: 29.16 4: 87.48 5: 262.44 6: 787.32 7: 2361.96 8: 7085.88 9: 21257.64 10: 63772.92We are using 3 different buffers in this design for scaling up Fan out:
These buffers are used to drive the NOR2 gates that make up each subproduct. These NOR gates were made undersized to cause smaller loads. Each input, thus, consists of: a W of 1.08u for the P device, and a W of .270u for the N device. i.e. a total of: 1.36u. There is a total of 16 of these being driven by _Xn and _Yn signals. That is: 21.76u worth of width. Using the table above this indicates a scale up of about 3 stages.
These buffers drive all the registers. The system currently has to drive 64 Master slave FF's. I am using high leakage FFs with two half-xmission gates per latch (very low load). Each FF puts a total W load of: 1.08 * 2 or: 2.16 W on its clock. So 64 * 2.16 => 138.24u of W. This will translate to 4 or 5 stages. Here we could experiment a bit if we had time to optimize this. We will use 5 stages for the time being. We could allow the NOR gate to be the first stage, but for simplicity sake we will ignore it, and start from the first inverter. This could be optimized some.
These buffers are each used to drive the pos or neg select line of 16 Muxes. Each Mux then consists of a PMOS and an NMOS device or a total of 1.08. Thus by driving 16 stages it is a total W of: 17.28. So we will choose 3 stages for these buffers. p> Below is a picture of the schematic for the 5 stage clock buffer.
We use instance iteration to make the schematic appear simple. Here we have 16-plex symbol for the inverting d_ibuf. We only have to change the d_ibuf element to change all of them. data_buf
We basically implemented the half-adders for the negator by tying one leg of the full-adder to gnd, in order to save on time. Implementing real half-adders was an optimization that could eventually be done, but it wasn't high on the priority list for this toy project.
Since the adder cannot be iterated because of the carry chain, we implemented this with a 4-bit adder, instantiated 4 times hierarchically.
half_adder 16-bit array
As can be seen all that was necessary was to tie one of the legs to ground.
For registers we used a pass-gate implementation. We don't expect that this is the best choice, but it creates a very small FF, and it is fun to play with an alternate approach. The main problem with this approach is that it will of course be slower and more leaky.
Pass-gate M/S FF
simple MS FF
The two way 1-bit MUX shown here is of course a very useful flexible ckt. It was used to implement the negator, and also to implement an XOR gate to help determine whether the output needs to be negated. We transfer select and select bar, to the outside in case both signals already exist.
Here we show how we turn this into a 16-bit 2-way Mux with an iterated symbol.
Note that a lot of the schematics elements were implemented using the iterated symbols feature of Virtuoso, which simplifies the schematics considerably. This features also provides a clever scheme to connect an array of modules to created buses. A single wire going to a bus or array of pins, connects to ALL the pins of the bus or array of pins (and vice-versa), a bus must otherwise be the same size as the sum total of the number of pins of the iterated object times the number of pins of the port in that object. Typically this is a 1-bit port of an iterated object, so that the bus corresponds to the number of objects iterated. Arrays of 16 NOR gates, or 16 Muxes, or 16-buffers can be described using a single device. Thus the notation of U0<15:0> indicates 16 instances of that device. The interconnect strategy described above does the rest. Other Virtuoso features like patchcords, allow you to have two aliases for the same bus. I used these at times to help create the schematic. For complex elements like the multiplier array where you try to implement a 2-D structure with somewhat complex interconnect the approach of using iterated instances and patchcords seems to breaks down. I tried for a while to implement it this way and ended up very frustrated after a lot of effort, so I used the brute schematic approach instead, which has the benefit of graphically showing what is going on, but could create a very big schematic depending on the size of the multiplier. For a very large multiplier we might want to make a hierarchical schematic so that it can be readable. There may be bugs in the iterated/bus connect approach for very complex cases.
Clock speed work with non-backannotated design for patterns given: 146.7Mhz.
Period of clock: 6.8164 ns.
Power with enable set: 1.952460 mw.
Power with enable not set: 1.379626e-04 mw.
Projected area of finished project:
27817 sq. u. = 27817 / 1000000 = .028 mm**2
So as a figure of merit for this design we can calculate the following:
power * area * delay **2 = 1.9524 * .028 * (6.82 ** 2) => 2.54 mW*mm^2*ns^2
The clock speed/delay was extracted as explained below based on set of test vectors working as described in the next section.
Based on the PASS/NO PASS test a binary search is performed to find the best clock period at which the PASS/NO PASS fail still succeeds to a resolution of 25 ps.
The timing was based on testing against a set of test vectors. Note that the clock specified is not necessarilly the worst case clock. We need to find the worst possible test vector, and then rerun the timing extraction shown.
The above system identifies which is the worst case test vectors of the ones
given. The binary search is seeded with two known initial boundary times
(a failing time, and a success time). After 6 iterations the following vector
out of the set given was proven to be the bottleneck, the failure and success
results are provided by the program's log:
LAST BAD RESULT:
Vec 5: 0000000011111111 1111111100000000 11111111111111100000000100000000
Should have been:
Vec 5: 0000000011111111 1111111100000000 11111111111111110000000100000000
LAST GOOD CLOCK PERIOD: 6.81640625 ns,
LAST BAD CLOCK PERIOD: 6.8046875 ns,
Thus identifying an operational clock period given this estimation of a worst case path.
This program was written to generate simulation code for multipliers of arbitrary size. Next we will implement similar code for a Booth multiplier, which is clearly a better performing.
Overall UCSC's CMPE222 was a great class.