We describe a heuristic scheduling approach for optimizing floating-point pipelines subject to input port constraints. The objective of our technique is to maximize functional unit reuse while minimizing the following performance metrics in the generated circuit: (1) maximum multiplexer fanin, (2) datapath fanout, (3) number of multiplexers, and (4) number of registers. For a set of systems biology markup language (SBML) benchmark expressions, we compare the resource usages given by our method to those given by a branch-and-bound enumeration of all valid schedules. Compared with the enumeration results, our heuristic requires on average 33.4% less multiplexer bits and 32.9% less register bits than the worse case, while only requiring 14% more multiplexer bits and 4.5% more register bits than the optimal case. We also compare our results against those given by the state-of-art high-level synthesis tool Xilinx AutoESL. For the most complex of our benchmark expressions, our synthesis technique requires 20% less FPGA slices than AutoESL. 1. Introduction Over the past twenty years there has been a wide range of research in the field of high-level synthesis for pipelined datapaths [1–3]. High-level synthesis methods can often be guided by user-specified constraints that result in throughput and area tradeoffs for the generated circuits. For example, designers may wish to generate circuits with as few resources as possible at the expense of increased multiplexer fan-in that generally results in lower maximum clock frequency. On the other hand, designers may want the generated circuits to achieve maximum clock speed without regard to the resultant resource requirement. While most HLS tools consider these types of constraints, few incorporate input port-constraints when optimizing resource usage. In other words, many HLS tools do not allow the designer to explicitly constrain the synthesis process for memory bandwidth. As an example, consider the following arithmetic expression, the phylogenetic likelihood function (PLF), which is widely used in likelihood-based phylogenetic tools and is shown in [4] Written in another way, the PLF for a single symbol can be computed using the following line of high-level code: This expression requires sixteen floating-point inputs. To achieve maximum throughput when implemented as a pipeline, all sixteen inputs must be read and fed into the pipeline in each clock cycle. Since there are no dependencies in this expression, this would allow one output value to be produced every clock cycle after the pipeline fills. However, this
References
[1]
P. Coussy and A. Morawiec, High-Level Synthesis from Algorithm to Digital Circuit, Springer, 2008.
[2]
D. Gajski, N. Dutt, and S. Lin, High Level Synthesis: Introduction to Chip and System Design, Kluwer Academic, 1992.
[3]
P. Michel, U. Lauther, and P. Duzy, The Synthesis Approach to Digital System Design, Kluwer Academic, 1992.
[4]
S. Zierke and J. D. Bakos, “FPGA acceleration of the phylogenetic likelihood function for Bayesian MCMC inference methods,” BMC Bioinformatics, vol. 11, article 184, 2010.
[5]
N. Park and A. C. Parker, “SEHWA: a program for synthesis of pipelines,” in Proceedings of the 23rd Design Automation Conference, pp. 454–460, July 1986.
[6]
P. G. Paulin and J. P. Knight, “Force-directed scheduling for the behavioral synthesis of ASIC's,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 8, no. 6, pp. 661–679, 1989.
[7]
C. T. Hwang, J. H. Lee, and Y. C. Hsu, “A formal approach to the scheduling problem in high level synthesis,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 10, no. 4, pp. 464–475, 1991.
[8]
I. C. Park and C. M. Kyung, “Fast and near optimal scheduling in automatic data path synthesis,” in Proceedings of the 28th ACM/IEEE Design Automation Conference (DAC '91), pp. 680–685, June 1991.
[9]
W. Sun, M. J. Wirthlin, and S. Neuendorffer, “FPGA pipeline synthesis design exploration using module selection and resource sharing,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 26, no. 2, pp. 254–265, 2007.
[10]
R. Scrofano, L. Zhuo, and V. K. Prasanna, “Area-efficient evaluation of a class of arithmetic expressions using deeply pipelined floating-point cores1,” in Proceedings of the 5th International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA '05), pp. 119–128, June 2005.
[11]
T. Ishimori, H. Yamada, Y. Shibata, et al., “Pipeline scheduling with input port constraints for an FPGA-Based biochemical simulator,” in Reconfigurable Computing: Architectures, Tools and Applications, vol. 5453 of Lecture Notes in Computer Science, pp. 368–373, Springer.
[12]
Y. Hu, A. Ghouse, and B. S. Carlson, “Lower bounds on the iteration time and the number of resources for functional pipelined data flow graphs,” in Proceedings of the IEEE International Conference on Computer Design: VLSI in Computers & Processors (ICCD '93), pp. 21–24, October 1993.
[13]
M. Narasimhan and J. Ramanujam, “A fast approach to computing exact solutions to the Resource-Constrained Scheduling problem,” ACM Transactions on Design Automation of Electronic Systems, vol. 6, no. 4, pp. 490–500, 2001.
[14]
Xilinx Floating Operator Data Sheet, Xilinx Inc, 2009.
[15]
F. Kurdahi and A. Parker, “REAL: a program for register allocation,” in Proceedings of the 24th ACM/IEEE Design Automation Conference (DAC '87), pp. 210–215, June 1987.
[16]
C. Y. Huang, Y. S. Chen, Y. L. Lin, and Y. C. Hsu, “Data path allocation based on bipartite weighted matching,” in Proceedings of the 27th ACM/IEEE Design Automation Conference, pp. 499–504, June 1990.
[17]
D. Chen, J. Gong, and Y. Fan, “Low-power high-level synthesis for FPGA architectures,” in Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED '03), pp. 134–139, Seoul, Korea, August 2003.
[18]
D. Chen and J. Cong, “Register binding and port assignment for multiplexer optimization,” in Proceedings of the ASP-DAC Asia and South Pacific Design Automation Conference, pp. 68–73, January 2004.