登录查看更多内容

Using incomplete sub-multipliers for building multipliers on FPGAs

Martin Kumm

Professor at Hochschule Fulda

发布日期: 2024年5月6日

This article describes the core idea of our recent paper "Small Logic-based Multipliers with Incomplete Sub-Multipliers for FPGAs" from Andreas B?ttcher and me. It was accepted for presentation at ARITH 2024 (https://lnkd.in/eMxQHvCZ), a preprint is available on arXiv (https://lnkd.in/eKk8tbpX)

Performing multiplications in a computer follows similar ideas like performing multiplications by hand using paper & pencil. We all learned in school how to perform large multiplications by only knowing the multiplications of a single digit (which we all had to memorize). These results are properly shifted and added to compute the final product.

Machines multiply numbers in a very similar way; partial products are computed which are shifted and added. In the simplest form, a partial product can be a 1x1 bit operation that can be computed using a simple AND gate. The look-up tables of FPGAs can realize Boolean functions of typically 5 to 6 inputs, so a single AND gate is not efficient to implement; the LUT can do much more than that. One efficient way used before is to perform a 3x2 multiplication. It requires 5 inputs and the result ranges from 0 to 7x3=21 (using the maximum input each). Hence, the result requires 5 bits in total, so we need five 5-input LUTs to compute the product. As each of the 6-input LUTs of current FPGAs can realize two 5-input LUTs, we need three 6-input LUTs in total. Note that one of the six possible 5-input LUTs is not usable anymore.

From the view of previous work in multiplier tiling, this multiplier can be seen as a 3x2 tile as on the left of the figure above. From that point of view, the maximum result of the multiplication can be alternatively seen as the sum of the corresponding bit weights in each tile position: 2^0+2^1+2^1+2^2+2^2+2^3=21.

Now, if we remove the position with weight 2^3, the maximum result is 2^0+2^1+2^1+2^2+2^2=13, which can be represented using 4 bits. Hence, only four 5-input LUTs are required that perfectly map to two 6-input LUTs in total. No waste of another 5-input LUT and more efficient.