Adaptive Computing System (ACS) Domain for Implementing DSP Algorithms in Reconfigurable Hardware
John C. Zaino, BAE Systems

Over the past three years, a team at BAE SYSTEMS has developed an integrated algorithm analysis and mapping environment for translating a dataflow representation of a signal processing algorithm into an Adaptive Computing System (ACS) consisting of field programmable gate array (FPGAs) devices. This environment allows designers to transform signal processing algorithms into FPGA-based hardware an order of magnitude faster than is currently possible. Our approach has been to focus on three areas of capability critical to the success of adaptive computing: algorithm analysis, algorithm mapping, and smart generators. These capabilities take advantage of the special characteristics of signal processing algorithms to reduce the time to field the ACS implementation, and are being implemented as extensions to the Ptolemy design environment ( developed at the University of California, Berkeley.

Algorithm implementation for ACS requires careful consideration of the appropriate signal representation and the costs of operations. The algorithm analysis capabilities being developed on this program reduce the effort required to find good ACS implementation choices for a signal processing algorithm. The environment provides algorithm designers with information about operation counts, including adds, multiplies, and memory accesses, and with analyses of quantization effects related to ACS implementations. For many DSP problems, reduced precision arithmetic will maintain acceptable system performance. A mapping of an algorithm to an FPGA architecture will be successful if the designer can limit wordlength growth without sacrificing algorithm performance. Wordlength reduction introduces noise into the data stream, so the designer must balance the need for an efficient implementation with output quality. Designing optimal wordlength combinations for dataflow graphs can be very difficult, due to the exponential number of wordlength combinations that must be considered. Naive random sampling approaches are ineffective for large flowgraphs, because most of the samples do not correspond to feasible or useful designs.

In our work, we use a Markov chain Monte Carlo (MCMC) sampling approach [1,2] to finding good wordlength combinations for DSP flowgraphs. We define a feasible region of wordlengths that consists of designs near the Pareto-optimal cost/quality boundary which also satisfy any feasibility constraints. The MCMC sampler is used to generate uniform samples from this feasible region by performing a random walk on a specially constructed Markov chain; this approach solves a problem that can be very difficult using only naive sampling approaches. Additionally, the MCMC method offers extreme simplicity in the software coding of the algorithm, in contrast with dynamic programming approaches. With these techniques and capabilities, an algorithm designer can quickly determine the appropriate number of bits for signal representations at all points in the design and quantify the performance of various implementation choices.

Signal processing algorithm mapping for ACS also involves assigning functions to different processing elements. On this program we have developed mapping techniques for ACS tailored to signal processing. These capabilities include performance analysis, multi-device partitioning assistance, and automatic scheduling. The scheduling and partitioning functions use the coarse-grain nature of signal processing dataflow graphs to help optimize partitioning for ACS. Our tools automatically generate the VHDL code needed to translate datapaths across different FPGA devices.

Current methods for logic generation for ACS are either built around libraries of functions or around general-purpose logic synthesis. As part of this effort, we implemented "smart generators" that are extensions of the concept of a parameterized library. These generators are tailored to signal processing functions and will include rules that capture specific implementation techniques and trade-offs. For example, a smart generator for a complex multiplier is able to trade between a three-multiplier implementation and a four-multiplier implementation according to area and latency constraints. Additionally, we have provided mechanisms to automatically generate both hardware and software interfaces for the resultant ACS. Our initial target system was a Xilinx XC4062XL-based Wildforce board from Annapolis Micro Systems Recently, we have been augmenting our tools to support a Xilinx Virtex XCV1000-based Wildstar board, also from Annapolis Micro Systems.

The algorithm analysis, mapping, and logic generation capabilities have been developed as extensions to Ptolemy. Ptolemy provides a well documented, object-oriented, open software architecture with implementations in C++ and Java. Our extensions to Ptolemy have been captured in a new ACS domain that separates the interface specification from implementation for each signal processing functional block. The algorithms of interest to this project are represented by dataflow graphs comprised of these functional blocks, following a synchronous dataflow model of computation. We have used a Corona/Core architecture, where each block has a common interface known as the Corona, and one or more implementations, known as the Cores. A retargeting mechanism allows the users to change Cores and hence implementation, which moves the dataflow graph between various simulation models (floating point, fixed point) and implementations (C code, VHDL code). Within the VHDL code generation, we support both the Xilinx 4K and Virtex family of devices. These ACS tools have been used to automatically implement several applications. The first was a Winograd Discrete Fourier Transform (DFT) as part of a channelized frequency shift-keyed (FSK) receiver in ACS. The Winograd algorithmic structure has the minimum number of multiplications for any DFT approach, and is thus ideal for FPGA implementation. The tools have also been used to develop an FPGA implementation of a high speed linear FM detector. We used our tools to also implement a FFT-based complex correlator, requiring the use of two 16-point complex FFT cores in series with a cornerturn to obtain a 256-point spectrum. Most recently, we utilized our tools to accelerate a high resolution radar template-matching application. In all cases, our ACS tools were used to simulate the algorithm, select appropriate fixed point representations, and generate the VHDL implementations. The final FPGA designs were obtained by synthesizing the VHDL and performing place and route with commercial tools. The ACS domain is part of Ptolemy Classic and includes these ACS capabilities and demonstrations.

[1] P. D. Fiore. "A Custom Computing Framework for Orientation and Photogrammetry". MIT Ph.D. Thesis, June 2000.
[2] I. Beichl and F. Sullivan. "The Metropolis algorithm". Computing in Science and Engineering, pages 65-69, January 2000.

Acknowledgements: This work was supported by the Defense Advanced Projects Research Agency (DARPA) and the United States Air Force Research Laboratory (AFRL) under Contract No. F33615-97-C-1174.