Targetdescribing the topology and characteristics of the single- or multiple-processor system for which code is to be generated, it is the responsibility of the
Schedulerobject to perform some or all of the following functions:
Blockis executed on (for multiprocessor systems).
In the first approach (
loopingLevel= DEF), which is the default SDF scheduler, we conceptually construct the acyclic precedence graph (APG) corresponding to the system, and generate a schedule that is consistent with that precedence graph. Note that the precedence graph is not physically constructed. There are many possible schedules for all but the most trivial graphs; the schedule chosen takes resource costs, such as the necessity of flushing registers and the amount of buffering required, into account. The target then generates code by executing the actors in the sequence defined by this schedule. This is a quick and efficient approach when the SDF graph does not have large sample-rate changes. If there are large sample-rate changes, the size of the generated code can be huge because the codeblock for an actor might occur many times (if the number of repetitions for the actor is greater than one); in this case, it is better to use some form of loop scheduling.
We call the second approach Joe's scheduler. In this approach (
loopingLevel= CLUST), actors that have the same sample rate are merged (wherever this will not cause deadlock) and loops are introduced to match the sample rates. The result is a hierarchical clustering; within each cluster, the techniques described above can be used to generate a schedule. The code then contains nested loop constructs together with sequences of code from the actors.
Since the second approach is a heuristic solution, there are cases where some looping possibilities go undetected. By setting the
loopingLevelto SJS, we can choose the third approach, called SJS (Shuvra-Joe-Soonhoi) scheduling after the inventor's first names [Bha94]. After performing Joe's scheduling at the front end, it attacks the remaining graph with an algorithm that is guaranteed to find the maximum amount of looping available in the graph.
A fourth approach, obtained by setting
loopingLevelto ACYLOOP, we choose a scheduler that generates single appearance schedules optimized for buffer memory usage. This scheduler was developed by Praveen Murthy and Shuvra `Bhattacharyya [Mur96] [Bha96]. This scheduler only tackles acyclic SDF graphs, and if it finds that the universe is not acyclic, it automatically resets the loopingLevel target parameter to SJS. Basically, for a given SDF graph, there could be many different single appearance schedules. These are all optimally compact in terms of schedule length (or program memory in inline code generation). However, they will, in general, require differing amounts of buffering memory; the difference in the buffer memory requirement of an arbitrary single appearance schedule versus a single appearance schedule optimized for buffer memory usage can be dramatic. In code generation, it is essential that the memory consumption be minimal, especially when generating code for embedded DSP processors since these chips have very limited amounts of on-chip memory. Note that acyclic SDF graphs always have single appearance schedules; hence, this scheduler will always give single appearance schedules. If the
filetarget parameter is set, then a summary of internal scheduling steps will be written to that file. Essentially, two different heuristics are used by the ACYLOOP scheduler, called APGAN and RPMC, and the better one of the two is selected. The generated file will contain the schedule generated by each algorithm, the resulting buffer memory requirement, and a lower bound on the buffer memory requirement (called BMLB) over all possible single appearance schedules.
If the second, third, or fourth approach is taken, the code size is drastically reduced when there are large sample rate changes in the application. On the other hand, we sacrifice some efficient buffer management schemes. For example, suppose that star A produces 5 samples to star B which consumes 1 sample at a time. If we take the first approach, we schedule this graph as ABBBBB and assign a buffer of size 5 between star A and B. Since each invocation of star B knows the exact location in the allocated buffer from which to read its sample, each B invocation can read the sample directly from the buffer. If we choose the second, third, or fourth approach, the scheduling result will be A5(B). Since the body of star B is included inside a loop of factor 5, we have to use indirect addressing for star B to read a sample from the buffer. Therefore, we need an additional buffer pointer for star B (memory overhead), and one more level of memory access (runtime overhead) for indirect addressing.
The first step in multiprocessor scheduling, or parallel scheduling, is to translate a given SDF graph to an acyclic precedence expanded graph (APEG). The APEG describes the dependency between invocations of blocks in the SDF graph during execution of one iteration. Refer to the SDF domain documentation for the meaning of one iteration. Hence, a block in a multirate SDF graph may correspond to several APEG nodes. Parallel schedulers schedule the APEG nodes onto processors.
We have implemented three scheduling techniques that map SDF graphs onto multiple-processors with various interconnection topologies: Hu's level-based list scheduling, Sih's dynamic level scheduling [Sih91], and Sih's declustering scheduling [Sih91]. The target architecture is described by its
Targetobject, derived from
CGMultiTarget. The Target class provides the scheduler with the necessary information on interprocessor communication to enable both scheduling and code synthesis.
CGMultiTargethas a parameter, schedName, that allows the user to select the type of schedule. Currently, there are five different scheduling options:
DLIf schedName is set to
DL, we select the Sih's dynamic level scheduler that accounts for IPC overhead during scheduling.
HUHu's level scheduler is selected, which ignores the IPC overhead.
DCThe Sih's declustering scheduler can be selected by setting
DC. The declustering algorithm is advantageous only when the list scheduling algorithm shows poor performance, judged from the scheduling result because it is more expensive than the
HIER(DC)If we want to use Pino's hierarchical scheduler, we have to set schedName to
DC). The default top-level scheduling option is the
DLscheduler. To use other scheduler,
HUshould be specified within the parenthesis.
CGDDFIf the schedName is set to
CGDDF, the Ha's dynamic construct scheduler is selected. To use this scheduler, Ptolemy should be recompiled with special flags, or use
There are other target parameters that direct the scheduling procedure. If the parameter
manualAssignmentis set to
YES, then the default parallel scheduler does not perform star assignment. Instead, it checks the processor assignment of all stars (set using the
procIdstate of CG and derived stars). By default, the
procIdstate is set to -1, which is an illegal assignment since the child target is numbered from 0. If there is any star, except the
Forkstar, that has an illegal
procIdstate, an error is generated saying that manual scheduling has failed. Otherwise, we invoke a list scheduler that determines the order of execution of blocks on each processor based on the manual assignment. We do not support the case where a block might require more than one processor. The
manualAssignmentoption automatically sets the
oneStarOneProcstate to be discussed next.
If there are sample rate changes, a star in the program graph may be invoked multiple times in each iteration. These invocations may be assigned to multiple processors by default. We can prevent this by setting the
YES. Then, all invocations of a star are assigned to the same processor regardless of whether they are parallelizable or not. The advantage of doing this is the simplicity in code generation since we do not need to splice in
Spread/Collectstars, which will be discussed later. Also, it provides us another possible scheduling option:
adjustSchedule; this is described below. The main disadvantage of setting
YESis the performance loss of not exploiting parallelism. It is most severe if Sih's declustering algorithm is used. Therefore, Sih's declustering algorithm is not recommended with this option.
In this paragraph, we describe a future scheduling option which this release does not support yet. Once automatic scheduling (with
oneStarOneProcoption set) is performed, the processor assignment of each star is determined. After examining the assignment, the user may want to override the scheduling decision manually. It can be done by setting the
adjustScheduleparameter. If that parameter is set, after the automatic scheduling is performed, the
procIdstate of each star is automatically updated with the assigned processor. The programmer can override the scheduling decision by setting that state. The
YESbefore any scheduling decision is made previously. Again, this option is not supported in this release.
Different scheduling options result in different assignments of APEG nodes. Regardless of which scheduling options are chosen, the final stage of the scheduling is to decide the execution order of stars including send/receive stars. This is done by a simple list scheduling algorithm in each child target. The final scheduling results are displayed on a Gantt chart. The multiple-processor scheduler produces a list of single processor schedules, giving them to the child targets. The schedules include send/receive stars for interprocessor communication. The child targets take their schedules and generate code.
To produce code for child targets, we create a sub-galaxy for each child target, which consists of the stars scheduled on that target and some extra stars to be discussed below if necessary. A child target follows the same step to generate code as a single processor target except that the schedule is not computed again since the scheduling result is inherited from the parent target.
createReceive(), as mentioned earlier) in the parent multi-target.
Once the generated code is loaded, processors run autonomously. The synchronization protocol between processors is hardwired into the "send" and "receive" stars. One common approach in shared-memory architectures is the use of semaphores. Thus a typical synchronization protocol is to have the send star set a flag when it completes the data transfer, and have the receive star read the data and reset the semaphore. The receive star will not read the data if the semaphore has not been set and similarly, the send star will not write data if the semaphore has not been reset. In a message passing architecture, the send star may form a message header to specify the source and destination processors. In this case, the receive star would decode the message by examining the message header.
For properly supporting arbitrary data types, the send star should have an
ANYTYPEinput; the receive star should have an
ANYTYPEoutput. The resolved type for each of these ports can be obtained using the
Porthole::resolvedTypemethod. For a preliminary version of the communication stars, you can use a fixed datatype such as
The send/receive stars that are declared to support
ANYTYPEbut fail to support a particular datatype, should display an appropriate error message using the
Error::abortRunmethod. Finally, each of these stars must call
PortHole::numXferto determine the size of the block of data that needs to be transferred upon each invocation.
On the other hand, we may need to "collect" tokens from multiple sources to a single input porthole. Suppose we reverse the connections in the above example: star B produces one token and star A consumes two tokens. We have to insert a "collect" star at the input porthole of star A to collect tokens from star B and a "receive" star that receives a token from processor 2.
The "spread" and "collect" stars are automatically inserted by the scheduler, and are invisible to the user. Moreover, these stars can not be scheduled. They are added to sub-galaxies only for the allocation of memory and other resources before generating code. The "spread" and "collect" stars themselves do not require extra memory since in most cases we can overlay memory buffers. For example, in the first example, a buffer of size 2 is assigned to the output of star A. Star B obtains the information it needs to fetch a token from the first location of the buffer via the "spread" star, while the "send" star knows that it will fetch a token from the second location. Thus, the buffers for the outputs of the "spread" star are overlaid with the output buffer of star A.
In case there are delays or past tokens are used on the connection between two blocks that should be connected through "spread" or "collect" stars, we need to copy data explicitly. Thus, we will need extra memory for these stars. In this case, the user will see the existence of "spread/collect" stars in the generated code.
Collectstars have only been implemented in the CGC domain so far.