Monthly R&D Status Report Date: August 15, 1996 Title: "SYSTEM-LEVEL DESIGN METHODOLOGY FOR EMBEDDED SIGNAL PROCESSORS" Contract Number: F33615-93-C-1317 Principal Investigator: Edward A. Lee Organization: University of California at Berkeley 1. Tasks Performed Typical of a summer month, the activity level in this reporting period was very high. We have made significant progress on Tycho, the design visualization framework, code generation for real-time systems, mixing control and dataflow, and parallel scheduling. In addition, we have formulated detailed proposals for a design environment for heterogeneous real-time systems built around the Mercury RaceWay. A large portion of this report is devoted to that design. 2. Significant Accomplishments 2.1 Interrim Tycho Release We have completed release 0.1 of Tycho and placed it on the web site. Because this release has minimal impact on Ptolemy, we are not widely announcing its availability. The release allows us, however, to work more closely with certain external partners, and creates a milepost for our own internal development effort. The release notes are included below as an appendix. Tycho is a "syntax manager" for Ptolemy, and will eventually replace the current user interface with an extensible, object-oriented design visualization environment. Tycho can be used and developed entirely independently of Ptolemy. It runs under standard distributions of Itcl 2.0 and 2.1, and we have made available on the web site prebuilt itcl2.1 binaries for HPUX10.01, Solaris2.4, SunOS4.1.3. 2.2 Real-Time Processing Sunil Bhave and John Reekie demonstrated a portable real-time audio capability in Ptolemy built in the CGC domain. The application was a ten-band stereo audio equalizer operating at 44.1 kHz sample rate in real time on an UltraSparc processor. Bill Chen demonstrated code generation for the visual instruction set (VIS) of the UltraSparc processor. Luis Gutierrez is wrapping up the code generation domain for the Texas Instruments C50. Yu Kee Lim has collected C libraries that were scattered and duplicated in different parts of Ptolemy. 2.3 Mixing Control and Dataflow Bilung Lee and Brian Evans installed Bilung's FSM (finite state machine) domain in the development tree for Ptolemy. This domain provides a preliminary hierarchical finite-state machine capability that can be hierarchically combined with other Ptolemy domains. Currently it only works with simulation domains, not code generation domains. It uses an editor in Tycho, which is also in preliminary form. Further information about this domain is available at http://ptolemy.eecs.berkeley.edu/~bilung/FSM/fsm.html. 2.4 Retargeting Jose' Pino and Tom Parks (of Lincoln Labs) have proposed a mechanism for defining an interface to a Ptolemy block that is shared among several implementations. Thus, for instance, implementations in different domains, which may correspond to different target languages, can be assured of having a common set of parameters, state variables, and ports. This should facilitate retargeting a design from one implementation to another. For example, the interface definition interface { name { Add } desc { Output the sum of the inputs. } inmulti { name { input } type { float } } output { name { output } type { float } } } would be implemented using the syntax: star { implements { Add } domain { SDF } go { ... } } 2.5 Scheduling Praveen Murthy has completed a new parallel scheduler in Ptolemy that works on acyclic SDF (synchronous dataflow) graphs and constructs single appearance schedules that are optimized for buffer memory consumption. The algorithms and heuristics it implements are all described in our book: "Software Synthesis from Dataflow Graphs", Kluwer 1996. It invokes both the RPMC and APGAN heuristics and picks the better schedule. The scheduler makes extensive use of the Cluster hierarchy which was only being used in Jose' Pino's hierarchical scheduler until now. 2.6 Design Visualization We have developed in Tycho a viewer and editor for forest data structures. A forest is a collection of trees. The first application of this display displays the call tree of a Tcl program together with profile information including the amount of real time, CPU time, and number of calls associated with each function. This utility uses the TclX Tcl extension to perform the profiling, translates the results of the profiling into a Tycho tree data structure, and displays the result, color coded by CPU time. Farhana Sheikh has developed a "visualization finite-state machine" architecture that can be used to support display and editing of of design information such as schedules, performance metrics, architectural views, etc. The first prototype implements a Gantt chart for visualizing parallel schedules. The class hierarchy is illustrated below: Displayer ---contains--- View ---requires--- VSM ---interprets---Information | | | | SlateView | | | | | GanttView -requires-- GanttVSM Capitalized items are Tycho classes. The GanttVSM (VSM stands for Visualization State Machine) reads in information which is in textual form (file or string) and saves it in an internal format that saves the information state. The View can query the VSM for any key parameters required to render the information in either graphical or textual form. Because the VSM saves the information state, modification of a schedule graphically could be easily done and written out to a .sched file. Constraints on the schedule modifications would be imposed by Ptolemy, which typically contains a representation of the precedence graph that was used to generate the schedule. In the above paradigm, multiple VSM's could be connected to one another, editing interrelated data or enabling multiple views of the same data. 2.7 Code Generation for the Mercury RACEway Ron Galicia has outlined the design of a tool for mapping a control and dataflow representation of a hard real-time signal processing application onto a Mercury RACEway multicomputing system. Low-level programming details would be hidden from the programmer thereby shifting the design focus to performance issues. Graphical visualization and manipulation capabilities will enable study of architectural tradeoffs and optimizations. Tasks can be scheduled and partitioned among the processors either manually or automatically. The tool will handle most of the details involved in generating multiprocessor code, downloading the code to the target system, and initializing the system to run. Additional capabilities of the tool will allow extension of the default routine library by the programmer and interfacing with other hardware synthesis and codesign tools. The tool will be implemented as a code generation target in Ptolemy. Complex computer vision and image processing (CVIP) applications are actually a collection of algorithms or tasks that can be classified into low-, medium-, and high-level categories of control complexity [1][2]. Often control complexity runs converse to the computational demands of the task. Low-level tasks are typically computationally expensive with no control activity and are therefore very easily described by dataflow models, such as synchronous dataflow (SDF) [3]. Representative low-level tasks include edge or corner extraction and correlation operations. Low-level tasks are sometimes called front-end tasks. In contrast, high-level tasks emphasize control instead of computation and impose conditions on the state of the CVIP application. High-level interactions within the application may involve recognition, hypothesis-testing, decision-making, or runtime reconfiguration and adaptation of the whole system itself. Several methods for modeling system state have been proposed such as finite state machines (FSM) and Statecharts [4]. Decision, reconfiguration, and adaptation in signal processing have been addressed by projects such as IPUS [5] and TraX [6]. Medium-level tasks naturally combine control and dataflow. An example of a medium-level task is a low-level correlator that selects a template out of a finite set according to some simple criteria. If the goal is a real-time implementation of this complete system then it would be natural to map the low-level tasks onto hardware implementations such as ASICs or FPGAs because of their higher computational speed. High-level control and decision are more memory-intensive than computation-intensive and are therefore more suited to software implementations on microprocessors [7] than on dedicated hardware. This suggests the need to map the application to a heterogeneous target that allows interactions between the low-, medium-, and high-level tasks. Therefore, there is also a need for partitioning and synthesis tools within a heterogeneous framework based on control and dataflow to bridge these types of algorithms and architectures. A code generation tool is outlined for mapping the application onto a multicomputer target. Although the emphasis of code generation would be on the high- and medium-level tasks, code generation libraries will also exist at the low-level for prototyping purposes. The tool will provide an interface to hardware synthesis and codesign tools giving the programmer the option to map the low-level tasks onto dedicated hardware instead of processor code. The chosen code generation target for the software portions of the application is the Mercury RACEway [8], a scalable, distributed-memory embedded multicomputer commonly used for implementing time-critical signal processing tasks. It consists of processor, or compute, nodes with local memory that are connected by crossbar switches. The code generation tool will deal with the partitioning of the application's algorithms and data into processes for assignment onto the RACEway compute nodes, and allow runtime visualization of process reassignments, births, and deaths caused by the application's control specification. Scheduling of the processes will be performance-based [9], taking into account execution and communication times and ensuring that hard real-time constraints are met, repartitioning and rescheduling when necessary. Communications configurations will depend upon the partitioning of the tasks onto the RACEway. The tool will encapsulate other communications details, code generation, code downloading, and target initialization, leaving the application programmer to focus on design issues instead of implementation mechanics. Additionally, extension of the default routine library from customized simulation code will be possible by interfacing with the customized code on the host or cross-compiling the customized code onto the target. This tool will be implemented within Ptolemy, where several different models of computation exist in a heterogeneous framework. PARTITIONING AND SCHEDULING The programmer will be able to choose between automatic and manual partitioning of the application algorithm and data among the processing nodes. During runtime a graphical interface will allow the user to manually assign and manipulate processes and see an animation of their control-driven reassignments. The automated partitioning feature of the tool will strive to maximize concurrency, minimize interprocessor communications, and use the least number of processors. The most important goal is satisfying the hard real-time constraint and will be achieved by increasing concurrency to reduce total computation time. However, increasing concurrency by dividing a task among more parallel processors may increase interprocessor communications and contribute to delays. Therefore the tool will have to select a partition such that the interprocessor communication costs do not cause the system to violate the real-time constraint. Lastly, automated partitioning will strive to use the least number of processors while satisfying the other constraints. Data parallelism and algorithmic parallelism will be exploited by the tool for meeting the hard real-time constraint. Data parallelism is possible because the RACEway is a distributed-memory system in that each compute node has its own local memory. Data parallelism is common in typical low-level image processing operations such as convolution and filtering [10]. This allows decomposition of the image into subimages that can be operated on independently. A parallel algorithm can be decomposed and assigned to different processors for concurrent execution, gaining speedup through task-level pipelining [11]. Information about the processor execution times and communication delays is needed to evaluate and, in the case of the automated algorithms, find the partitions and schedules that fulfill the hard real-time constraint and minimum processor requirement. Processor and communication models will be formulated to allow quick and accurate estimates of these time costs. Repartitioning and rescheduling will be invoked when a service comes or goes, like in a multiple target-tracking system [12], or when a control condition causes the system to reconfigure itself. During runtime there will be communications between processors within the multicomputer environment, between the multicomputer and the host system, and between the multicomputer and hardware synthesis or codesign tools. Interprocessor communications and process synchronization are details beyond the concern of the applications programmer working on a large application. These details are especially hard to track when processors are added or removed and resources have to be reassigned. The tool will encapsulate these complexities by adding an abstraction layer between the programmer and the hardware. This takes care of the bookkeeping and frees the applications programmer to explore tradeoffs and evaluate implementation changes more effectively and quickly. Using the graphical interface the programmer will be able to take inventory of the processes on the RACEway, as they come and go due to runtime system state changes and high-level control, and view or manually change their assignments among the compute nodes. As the reassignments are made the communications layer will automatically reconfigure routes to reflect the changes. The tool will find and implement the most efficient method of interprocessor communication for the size of the multicomputer system. Increasing the number of compute nodes might introduce communication bottlenecks at some points thereby requiring special rerouting considerations. Also, a shared memory method [13] of communication becomes prohibitive for a very large number of compute nodes. A message passing model of communication might be the solution but would be harder to schedule than the shared memory model. This will be investigated. The developer may use a graphical interface on the host to watch and manipulate the partitioning and scheduling of the application among the target processors. Likewise, during runtime there may be a user control panel making restricted changes to the state of the system. Initially this will also be implemented as a prototypical graphical interface on the host system. Most of the mechanics of these communications between the graphical interfaces on the host and target system will be hidden from the user and programmer by several layers of abstraction. Host-target communication that needs to be completely hidden from the user will exist if parts of the application, either control or dataflow, reside on the host. This capability will also be provided by the tool. Likewise, interactions details between the code generation tool and other tools, such as hardware synthesis and codesign environments, will be hidden by a communications layer. There may be a modified form of extended partitioning [7] that takes into account a multicomputer target, such as the Mercury RACE system, as one of targets from which it can choose. GENERATING AND RUNNING CODE Once partitions, schedules, and communications interfaces have been determined then target and host code are synthesized. Target code is either stitched together from hand-optimized libraries or compiled from its simulation domain code to provide a standalone body of code. Hand-optimized libraries still outperform code generated by compilers for embedded processors [14]. However, it is easier to compile customized simulation domain code into target code. An alternative is to use the proposed host-target communications interface for runtime interactions between the RACEway and the customized simulation domain code residing on the host, thus eliminating the need to cross-compile code to the Mercury target. This would be reasonable for low-computation and low-communication tasks such as high-level system control. In this situation the host would have to be included in the partitioning and scheduling steps. Once standalone target code is created the tool will automatically download it to the target, and then initialize both the host and target systems for runtime operation. RELATED WORK HarTS [9] combines the control and dataflow of a hard real-time application on the same graph, automatically partitions, schedules, and generates multiprocessor code. However, HaRTS's graphical language cannot describe data-driven control. Control in HaRTS is suited more for scheduling and defining task precedence rather than adapting states and reconfiguration. Furthermore, timing control is explicit and not hidden. Actions are driven by external timers rather than the appearance of data on an arc. Completion times are explicit in a HaRTS description of an application. This may be too much detail for verifying the functional aspects of a complex application. It is easier to deal with a complex application at its most abstract control and dataflow description without superfluous timing and scheduling details. The proposed tool will be able to automatically handle timing and scheduling allowing the programmer to focus on the necessary control and dataflow aspects of the design. Another drawback of HaRTS is its lack of heterogeneity. One cannot interface the application graph in HaRTS with other models of computation. This limits HaRTS's ability to interact with a diversity of tasks such as hardware simulation and other tools using different design styles. Ptolemy already supports several models of computation (e.g. FSM or SDF), or domains, and defines interactions between them. SIERA [15] is a framework for creating and combining heterogeneous hardware and software components. It hides complexity from the designer by taking care of tedious system details and emphasizes modularity, reusability, uniformity, and abstraction. The goals of SIERA are very similar to the goals of our proposed code generation tool in Ptolemy. The differences would be that the target for our tool will be suited more for DSP applications instead of robotics. It will not be restricted to hierarchical bus-based architectures and will use a communications model different from SIERA's process network model that is suited more for high-throughput applications. Some work [16] has been done in mapping a graphical Ptolemy description onto Mercury hardware directed by performance estimates. So far this work has been restricted to the SDF model of computation only. Currently, work is being done to automatically minimize the number of processors while meeting hard real-time. Work [17] is being done in our group to combine well-known models of computation for control (e.g. FSM) and dataflow (e.g. SDF) where design styles are different. 2.8 Improvements to Ptolemy We have completed the version 0.6 Programmer's Manual and released it on the web in postscript, PDF, and HTML forms. There are 3 new chapters in the 0.6 Programmer's Manual: 6. Using the Cluster Class for Scheduling 10. PN Domain 15. Adding New Domains The other chapters have been updated to reflect changes between Ptolemy0.5.x and Ptolemy0.6. Printed versions of the manual may be obtained from ILP: EECS/ERL Industrial Liaison Program Office Software Distribution Office 205 Cory Hall # 1770 University of California at Berkeley Berkeley, CA 94720-1770 phone: (510)643-6687 fax: (510)643-6694 email: software@eecs.berkeley.edu www: http://www.eecs.berkeley.edu/ILP/ At this point, it looks like the Ptolemy 0.5.2 Kernel Manual will not be updated for Ptolemy 0.6, but we will continue to make the 0.5.2 Kernel Manual available. Stephen Edwards has tuned the C++ documentation generation system CCE so that it now generates useful HTML documentation for the Ptolemy kernel. We are considering using CCE to generate part of the the Kernel Manual in the next Ptolemy release. Michael Kurz of Silicon Automation Systems uncovered a bug in the Ptolemy loop scheduler. Praveen Murthy diagnosed the cause and Soonhoi Ha (of Seoul National University) proposed a fix. Brian Evans has made changes to CGCTarget so that compile options, link options, and lists of dependent files are returned by new methods. The methods that return the compile and link options will optionally expand environment variables. Luis Gutierrez is wrapping up the code generation domain for the Texas Instruments C50. Luis Gutierrez and Christopher Hylands are planning to release a tar overlay of the C50 domain on August 20th. 2.9 Improvements to Tycho The file system interface was simplified and consolidated in the File class so that derived classes could be written more easily. A hierarchy of new kernel classes implement a set of basic data structures: Graph, DirectedAcyclicGraph, and Forest (a set of Trees). The clipboard mechanism was generalized so that it could be shared between textual and graphical editors. The documentation generation system tydoc has been tuned to the point that it now generates useful HTML documentation for Tycho classes. We have sped up the Sun HTML rendering code (written in Tcl) by a factor of four. This makes is usable for moderate size files on reasonably fast workstations, but further improvement is necessary to make it usable on slower workstations. Tycho0.1 now runs under itcl2.1 which includes tcl7.5/tk4.1. Christopher Hylands installed TclX7.5.3-a1 and Blt2.0 and sent installation bug reports back to the developers. Christopher Hylands researched NT and started discussion about a department wide NT effort. We are planning on obtaining two NT machines with an eye towards porting Tycho to NT. The Tycho port to the MacOS is underway; we are waiting for a compiler. 2.10 Personnel Changes Brian Evans has joined the faculty of the University of Texas at Austin. We expect to continue to collaborate with him. 3. Publications and Presentations 3.1 Publications Accepted for Publication [1] W.-T. Chang, S. Ha, and E. A. Lee, "Heterogeneous Simulation -- Mixing Discrete-Event Models with Dataflow," accepted to the RASSP special issue of the Journal on VLSI Signal Processing, July 1996. [2] S. Sriram and E. A. Lee, "Determining the Order of Processor Transactions in Statically Scheduled Mulitprocessors," accepted by the Journal on VLSI Signal Processing, July 1996. [3] P. K. Murthy, S. S. Bhattacharyya, and E. A. Lee, "Joint Minimization of Code and Data for Synchronous Dataflow Programs", accepted by the Formal Methods in System Design journal. 3.2 Publications Submitted for Publication [4] T. Miyazaki and E. A. Lee, "Code Generation by Using Integer- Controlled Dataflow Graphs," submitted to ICASSP '97. 3.3 Presentations * "Synchronous Dataflow Design Methodologies for Embedded DSP" presented at HP EEsof, Westlake Village, CA, August 8, 1996. http://ptolemy.eecs.berkeley.edu/~pino/papers/thesis/sdfDesignMethod.pdf * "System-Level Design for Embedded Signal Processors" presented at HP EEsof, Westlake Village, CA, August 8, 1996. http://ptolemy.eecs.berkeley.edu/~pino/papers/thesis/ptolemyOverview.pdf Appendix: Release Notes for Tycho 0.1 Tycho is an object-oriented syntax manager with an underlying heterogeneous technical rationale. It provides a number of editors and graphical widgets in an extensible, reusable framework. The editors for textual syntaxes are modeled after emacs in the sense the emacs key bindings are used when possible. Editors for visual syntaxes will be more diverse. The system documentation is integrated, using a hypertext system compatible with the worldwide web. Tycho has been designed primarily for use with the Ptolemy system, a heterogeneous design environment from U.C. Berkeley, but it is also useful on its own. Version 0.1 is the first public release of Tycho as a stand-alone system. It runs under the vanilla Itcl 2.0 and 2.1 with no changes to the executable. It also runs under the Ptolemy system, a design environment distributed by the same group that developed Tycho. The editor capabilities are currently limited to textual syntaxes; although we have considerable work in progress on graphical editors, that code is not ready for release yet. Tycho0.1 can be obtained from http://ptolemy.eecs.berkeley.edu/tycho/Tycho.html. Here is a summary of the capabilities: Tycho Kernel ------------ + A family of dialog windows, including basic message windows (both modal and non-modal), yes-no queries, queries for text entry, and queries with radio buttons. + A list browser and derived file browser and index browser. These browsers support pattern-based search. + An Emacs-like text editing widget that significantly extends the capabilities of the Tk text widget, including a sophisticated search capability, undo/redo, text filling, and carefully designed menus and key bindings. + A shell-like console for interacting with Tcl with a thorough history mechanism fashioned after the Unix tcsh. + An integrated documentation system based on HTML with automatic index generation and an index search mechanism. + Support for hyperlinks from any Tycho editor or display to any other Tycho editor or display. + A context-sensitive spelling checker (for example, it checks the spelling only in comments when editing code). + Interfaces to SCCS and RCS revision control systems. + A font management system that includes an interactive dialog for selecting fonts. + Error handling with a stack display. + Auto-save and crash recovery (although the crash recovery will only work with modifications to the executable, currently only available with the Ptolemy distribution). + Some elementary data structures: Stack, CircularList, Graph, DirectedAcyclicGraph, Forest. Editors and Shells ------------------ In addition to the basic emacs-like text editor described above and the Tcl shell, a number of additional editors and shells have been designed. The editors provide capabilities similar to those of emacs modes, although the potential goes beyond what emacs can do because of the future inclusion of graphical elements. The editors and shells included in release 0.1 are: + C and C++ editors. These color and fill comments and handle indentation. + HTML editor. This editor parses HTML commands, supports hyperlinks, and provides a command to check the validity of hyperlinks. + Itcl and Tcl editors. These editors color and fill comments, handle indentation, and color class, procedure, and method names. They also support "evaluate" commands, where either a selected region or the entire file can be evaluated. + Makefile editor. This editor colors comments, variable definitions, variable references, rules, and various directives. + Ptlang editor. This is an editor for designing blocks in Ptolemy. + Matlab and Mathematica consoles. These will only appear if the appropriate Tcl extensions for interaction with Matlab and Mathematica have been installed. tydoc - The Tycho documentation system -------------------------------------- Tycho0.1 also includes tydoc, a script that converts itcl to html. REFERENCES 1 J. N. Patel, A. A. Khokhar, and L. H. Jamieson, ``Implementation of parallel image processing algorithms in the Cloner environment,'' in VLSI Signal Processing VII (J. Rabaey, P. M. Chau, and J. Eldon, eds.), pp. 83-92, The Institute of Electrical and Electronics Engineers, 1994. 2 C. C. W. Jr., ``The second generation image understanding architecture and beyond,'' in Computer Architectures for Machine Perception, pp. 276-285, 1993. 3 E. A. Lee and D. G. Messerschmitt, ``Synchronous data flow,'' Proceedings of the IEEE, vol. 75, pp. 1235-1245, September 1987. 4 D. Harel, ``Statecharts: A visual formalism for complex systems,'' Sci. Comput. Program., vol. 8, pp. 231-274, 1987. 5 V. R. Lesser, S. H. Nawab, and F. I. Lesser, ``IPUS: An architecture for the integrated processing and understanding of signals,'' Artificial Intelligence, vol. 77, no. 1, pp. 129-171, 1995. 6 A. F. Bobick and R. C. Bolles, ``The representation space paradigm of concurrent evolving object descriptions,'' IEEE Trans. Pattern Anal. Machine Intell., vol. 17, no. 2, pp. 146-156, 1992. 7 A. P. Kalavade, System-level Codesign of Mixed Hardware-software Systems. Memorandum No. UCB/ERL M95/88, University of California at Berkeley, 1995. 8 Mercury Computer Systems, 199 Riverneck Road, Chelmsford, MA 01824-2820, U. S. A. 9 J. Zhu, T. G. Lewis, L. Zhao, W. Jackson, and R. L. Wilson, ``HaRTS: Performance-based design of distributed hard real-time software,'' J. Systems Software, vol. 32, no. 2, pp. 143-156, 1996. 10 V. Chaudhary and J. K. Aggarwal, ``Parallelism in computer vision: A review,'' in Parallel Algorithms for Machine Intelligence and Vision (V. Kumar, P. S. Gopalakrishnan, and L. N. Kanal, eds.), pp. 271-309, Springer-Verlag, 1990. 11 A. C. Downton, R. W. S. Tregidgo, and A. Cuhadar, ``Top-down structured parallelisation of embedded image processing applications,'' IEE Proc.-Vis. Image Signal Process., vol. 141, no. 6, pp. 271-309, 1994. 12 K. R. Pattipati, T. Kurien, R.-T. Lee, and P. B. Luh, ``On mapping a tracking algorithm onto parallel processors,'' IEEE Transactions on Aerospace and Electronic Systems, vol. 26, no. 5, pp. 774-791, 1990. 13 S. Sriram, Minimizing Communication and Synchronization Overhead in Multiprocessors for Digital Signal Processing. Memorandum No. UCB/ERL M95/90, University of California at Berkeley, 1995. 14 G. Araujo, S. Devadas, K. Keutzer, S. Liao, S. Malik, S. Tjiang, A. Sudarsanam, and A. Wang, ``Challenges in code generation for embedded processors,'' in Code Generation for Embedded Processors (P. Marwedel and G. Goossens, eds.), Kluwer Academic Publishers, 1995. 15 M. B. Srivastava and R. W. Brodersen, ``SIERA: A unified framework for rapid-prototyping of system-level hardware and software,'' IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 14, pp. 676-693, June 1995. 16 E. K. Pauer and J. B. Prime, ``An architectural trade capability using the Ptolemy kernel,'' in IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 2, pp. 1252-1255, May 1996. 17 W.-T. Chang, A. Kalavade, and E. A. Lee, ``Effective heterogeneous design and cosimulation,'' in NATO Advanced Study Institute Workshop on Hardware/Software Codesign, 1995.