Archive for December, 2007

frequently executed (Free web space) paths, and then proposed a very-long-instruction-word

Monday, December 31st, 2007

frequently executed paths, and then proposed a very-long-instruction-word (VLIW) architecture [Fisher 1983] that could expose the microoperations directly to user programs, using the compiler to schedule. Aiken and Nicolau [1988] were among the first to point out that a single loop iteration need not be scheduled in isolation, and presented the algorithm for optimal (ignoring resource constraints) parallelization of loops. Many variations of the multiprocessor scheduling problem are NP-complete [Garey and Johnson 1979; Ullman 1975]. The iterative modulo scheduling algorithm [Rau 1994] gets good results in practice. In the absence of resource constraints, it is equivalent to the Bellman-Ford shortest-path algorithm [Ford and Fulkerson 1962]. Optimal schedules can be obtained (in principle) by expressing the constraints as an integer linear program [Govindarajan et al. 1996], but integer-linearprogram solvers can take exponential time (the problem is NP-complete), and the register-allocation constraint is still difficult to express in linear inequalities. Ball and Larus [1993] describe and measure the static branch-prediction heuristics shown in Section 20.3. Young and Smith [1994] show a profile-based static branch-prediction algorithm that does better than optimal static predition; the apparent contradiction in this statement is explained by the fact that their algorithm replicates some basic blocks, so that a branch that’s 80% taken (with a 20% misprediction rate) might become two different branches, one almost-always taken and one almost- always not taken. Team-Fly Team-Fly EXERCISES . 20.1 Schedule the following loop using the Aiken-Nicolau algorithm: for i . 1 to N a . X[i -2] b . Y[i -1] c . a b d . U[i] e . X[i -1] f . d + e g . d c h : X[i] . g j : Y[i] . f a. Label all the scalar variables with subscripts i and i - 1. Hint: In this loop there are no loop-carried scalar-variable dependences, so none of the subscripts will be i - 1. b. Perform scalar replacement on uses of X[] and Y []. Hint: Now you will have subscripts of i - 1 and i - 2. c. Perform copy propagation to eliminate variables a, be. d. Draw a data-dependence graph of statements c, d, f, g, h, j; label intraiteration edges with 0 and loop-carried edges with 1 or 2, depending on the number of iterations
If you are in need for cheap and reliable webhost to host your website, we recommend http web server services.

Hp web site - cases is to give the heuristics a priority

Sunday, December 30th, 2007

cases is to give the heuristics a priority order and use the first heuristic in the order that applies (the order in which they are listed above is a reasonable prioritization, based on empirical measurements). Another approach is to index a table by every possible subset of conditions that might apply, and decide (based on empirical measurements) what to do for each subset. SHOULD THE COMPILER PREDICT BRANCHES? Perfect static prediction results in a dynamic mispredict rate of about 9% (for C programs) or 6% (for Fortran programs). The “perfect” mispredict rate is not zero because any given branch does not go in the same direction more than 91% of the time, on average. If a branch did go the same direction 100% of the time, there would be little need for it! Fortran programs tend to have more predictable branches because more of the branches are loop branches, and the loops have longer iteration counts. Profile-based prediction, in which a program is compiled with extra instructions to count the number of times each branch is taken, executed on sample data, and recompiled with prediction based on the counts, approaches the accuracy of perfect static prediction. Prediction based on the heuristics described above results in a dynamic mispredict rate of about 20% (for C programs), or about half as good as perfect (or profile-based) static prediction. A typical hardware-based branch-prediction scheme uses two bits for every branch in the instruction cache, recording how the branch went the last two times it executed. This leads to misprediction rates of about 11% (for C programs), which is about as good as profile-based prediction. A mispredict rate of 10% can result in very many stalled instructions -if each mispredict stalls 11 instruction slots, as described in the example on page 456, and there is one mispredict every 10 branches, and one-sixth of all instructions are branches, then 18% of the processor’s time is spent waiting for mispredicted instruction-fetches. Therefore it will be necessary to do better, using some combination of hardware and software techniques. Relying on heuristics that mispredict 20% of the branches is better than no predictions at all, but will not suffice in the long run. Team-Fly Team-Fly FURTHER READING Hennessy and Patterson [1996] explain the design and implementation of high-performance machines, instruction-level parallelism, pipeline structure, functional units, caches, out-of-order execution, register renaming, branch prediction, and many other computer-architecture issues, with comparisons of compiler versus run-time-hardware techniques for optimization. Kane and Heinrich [1992] describe the pipeline constraints of the MIPS R4000 computer, from which Figures 20.1 and 20.2 are adapted. CISC computers of the 1970s implemented complex instructions sequentially using an internal microcode that could do several operations simultaneously; it was not possible for the compiler to interleave parts of several macroinstructions for increased parallelism. Fisher [1981] developed an automatic scheduling algorithm for microcode, using the idea of trace scheduling to optimize
If you are looking for affordable and reliable webhost to host and run your business application visit our ftp web hosting services.

Hosting web - Some machines solve this problem by fetching the

Sunday, December 30th, 2007

Some machines solve this problem by fetching the instructions immediately following the branch; then if the branch is not taken, these fetched-and-decoded instructions can be used immediately. Only if the branch is taken are there stalled instruction slots. Other machines assume the branch will be taken, and begin fetching the instructions at the target address; then if the branch falls through, there is a stall. Some machines even fetch from both addresses simultaneously, though this requires a very complex interface between processor and instruction-cache. Modern machines rely on branch prediction to make the right guess about which instructions to fetch. The branch prediction can be static -the compiler predicts which way the branch is likely to go and places its prediction in the branch instruction itself; or dynamic -the hardware remembers, for each recently executed branch, which way it went last time, and predicts that it will gothesameway. STATIC BRANCH PREDICTION The compiler can communicate predictions to the hardware by a 1-bit field of the branch instruction that encodes the predicted direction. To save this bit, or for compatibility with old instruction sets, some machines use a rule such as “backward branches are assumed to be taken, forward branches are assumed to be not-taken.” The rationale for the first part of this rule is that backward branches are (often) loop branches, and a loop is more likely to continue than to exit. The rationale for the second part of the rule is that it’s often useful to have predicted-not-taken branches for exceptional conditions; if all branches are predicted taken, we could reverse the sense of the condition to make the exceptional case “fall through” and the normal case take the branch, but this leads to worse instruction-cache performance, as discussed in Section 21.2. When generating code for machines that use forward/backward branch direction as the prediction mechanism, the compiler can order the basic blocks of the program in so that the predictedtaken branches go to lower addresses. Several simple heuristics help predict the direction of a branch. Some of these heuristics make intuitive sense, but all have been validated empirically: Pointer: If a loop performs an equality comparison on pointers (p=null or p=q), then predict the condition as false. Call: Abranchis less likely to be the successor that dominates a procedure call (many conditional calls are to handle exceptional situations). Return: Abranchis less likely to a successor that dominates a return-from-procedure. Loop: Abranchis more likely to the successor (if any) that is the header of the loop containing the branch. Loop: Abranchis more likely to the successor (if any) that is a loop preheader, if it does not postdominate the branch. This catches the results of the optimization described in Figure 18.7, where the iteration count is more likely to be > 0 than = 0. (B postdominates A if any path from A to program-exit must go through B; see Section 19.5.) Guard: If some value r is used as an operand of the branch (as part of the conditional test), then a branch is more likely to a successor in which r is live and which does not postdominate the branch. There are some branches to which more than one of the heuristics apply. A simple approach in such
Searching for affordable and proven webhost to host and run your servlet applications? Go to Linux Web Hosting services and you will find it.

Such machines first appeared in 1967 (the IBM (Crystaltech web hosting)

Sunday, December 30th, 2007

Such machines first appeared in 1967 (the IBM 360/91), but did not become common until the mid1990s. Now it appears that most high-performance processors are being designed with dynamic (runtime) scheduling. These machines have several advantages and disadvantages, and it is not yet clear whether static (compile-time) scheduling or out-of-order execution will become standard. Advantages of static scheduling Out-of-order execution uses expensive hardware resources and tends to increase the chip’s cycle time and wattage. The static scheduler can schedule earlier the instructions whose future data-dependence path is longest; a real-time scheduler cannot know the length of the data-dependence path leading from an instruction (see Exercise 20.3). The scheduling problem is NP-complete, so compilers -which have no real-time constraint on their scheduling algorithms -should in principle be able to find better schedules. Advantages of dynamic scheduling Some aspects of the schedule are unpredictable at compile time, such as cache misses, and can be better scheduled when their actual latency is known (see Figure 21.5). Highly pipelined schedules tend to use many registers; typical machines have only 32 register names in a five-bit instruction field, but out-of-order execution with run-time register renaming can use hundreds of actual registers with a few static names (see the Further Reading section). Optimal static scheduling depends on knowing the precise pipeline state that will be reached by the hardware, which is sometimes difficult to determine in practice. Finally, dynamic scheduling does not require that the program be recompiled (i.e., rescheduled) for each different implementation of the same instruction set. Team-Fly Team-Fly 20.3 BRANCH PREDICTION In many floating-point programs, such as Program 20.4a, the basic blocks are long, the instructions are long-latency floating-point operations, and the branches are very predictable for-loop exit conditions. In such programs the problem, as described in the previous sections, is to schedule the long-latency instructions. But in many programs -such as compilers, operating systems, window systems, word processors the basic blocks are short, the instructions are quick integer operations, and the branches are harder to predict. Here the main problem is fetching the instructions fast enough to be able to decode and execute them. Figure 20.11 illustrates the pipeline stages of a COMPARE, BRANCH, and ADD instruction. Until the BRANCH has executed, the instruction-fetch of the successor instruction cannot be performed because the address to fetch is unknown. Figure 20.11: Dependence of ADD’s instruction-fetch on result of BRANCH. Suppose a superscalar machine can issue four instructions at once. Then, in waiting three cycles after the BRANCH is fetched before the ADD can be fetched, 11 instruction-issue slots are wasted (3 4 minus the slot that the BRANCH occupies).
If you are looking for cheap and quality webhost to host and run your website check Jboss Web Hosting services.

Checking for resource conflicts is done with a (Web domain)

Saturday, December 29th, 2007

Checking for resource conflicts is done with a resource reservation table, an array of length . The resources used by an instruction at time t can be entered in the array at position t mod ; adding and removing resource-usage from the table, and checking for conflicts, can be done in constant time. This algorithm is not guaranteed to find an optimal schedule in any sense. There may be an optimal, register-allocable schedule with initiation interval , and the algorithm may fail to find any schedule with time , or it may find a schedule for which register-allocation fails. The only consolation is that it is reported to work very well in practice. The operation of the algorithm on our example is shown in Figure 20.10. Figure 20.10: Iterative modulo scheduling applied to Program 20.4b. Graph 20.5a is the data- dependence graph; min = 5 (see page 451); H =[c, d, e, a, b, f, j, g, h]. OTHER CONTROL FLOW We have shown scheduling algorithms for simple straight-line loop bodies. What if the loop contains internal control flow, such as a tree of if-then-else statements? One approach is to compute both branches of the loop, and then use a conditional move instruction (provided on many high- performance machines) to produce the right result. For example, the loop at left can be rewritten into the loop at right, using a conditional move: for i . 1 to N for i . 1 to Nx . M[i] x . if x >0 u . z . x u . z . xu . A[i] else u . A[i] if x > 0 move u . u’ s . s + us . s + u The resulting loop body is now straight-line code that can be scheduled easily. But if the two sides of the if differ greatly in size, and the frequently executed branch is the small one, then executing both sides in every iteration will be slower than optimal. Or if one branch of the if has a side effect, it must not be executed unless the condition is true. To solve this problem we use trace scheduling: We pick some frequently executed straight-line path through the branches of control flow, schedule this path efficiently, and suffer some amount of ineffiency at those times where we must jump into or out of the trace. See Section 8.2 and also the Further Reading section of this chapter. SHOULD THE COMPILER SCHEDULE INSTRUCTIONS? Many machines have hardware that does dynamic instruction rescheduling at run time. These machines do out-of-order execution, meaning that there may be several decoded instructions in a buffer, and whichever instruction’s operands are available can execute next, even if other instructions that appeared earlier in the program code are still awaiting operands or resources.
Looking for affordable and reliable webhost to host and run your business application? Then look no more and go to servlet web hosting services.

c . In this (Web hosting company) case, h is

Saturday, December 29th, 2007

c . In this case, h is placed without regard to resource constraints, in the earliest time slot that obeys dependence constraints (with respect to already-placed predecessors), and is later than any previous attempt to place h. ALGORITHM 20.9: Iterative modulo scheduling. for min to 8 Budget . n 3 for i . 1 to n LastTime[i] . 0 SchedTime[i] . none while Budget >0 and there are any unscheduled instructions Budget . Budget -1 let h be the highest-priority unscheduled instruction tmin . 0 for each predecessor p of h if SchedTime[p]. none tmin . max(tmin, SchedTime[p]+ Delay(p, h)) for t . tmin to tmin + . -1 if SchedTime[h]= none if h can be scheduled without resource conflicts SchedTime[h] . t if SchedTime[h]= none SchedTime[h] . max(tmin, 1 + LastTime[h]) LastTime[h] . SchedTime[h] for each successor s of h if SchedTime[s] . none if SchedTime[h]+ Delay(h, s)> SchedTime[s] SchedTime[s] . none while the current schedule has resource conflicts let s be some instruction (other than h) involved in a resource con SchedTime[s] . none if all instructions are scheduled RegisterAllocate() if register allocation succeeded without spilling return and report a successfully scheduled loop. Delay(h, s)= Given a dependence edge hi . si+k, so that h uses the value of s from the kt (where k = 0 means that h uses the current iteration’s value of s); Given that the latency of the instruction that computes s is l cycles; return l -k. Once h is placed, other instructions are removed to make the subset schedule S legal again: any successors of h that now don’t obey data-dependence constraints, or any instructions that have resource conflicts with h. This placement-and-removal could iterate forever, but most of the time either it finds a solution quickly or there is no solution, for a given . To cut the algorithm off if it does not find a quick solution, a Budget of c n schedule placements is allowed (for c = 3 or some similar number), after which this value of . is abandoned and the next one is tried. When a def-use edge associated with variable j becomes longer than . cycles, it becomes necessary to have more than one copy of j, with MOVE instructions copying the different-iteration versions in bucket-brigade style. This is illustrated in Figure 20.8 for variables a, b, f, j, but we will not show an explicit algorithm for inserting the moves.
Go visit our java server pages services for a reliable, lowcost webhost to satisfy all your needs.

Free web hosting with ftp - described in this section first schedules, then does

Friday, December 28th, 2007

described in this section first schedules, then does register allocation and hopes for the best. FINDING THE MINIMUM INITIATION INTERVAL Modulo scheduling begins by finding a lower bound for the number of cycles in the pipelined loop body: Resource estimator: For any kind of functional unit, such as a multiplier or a memory- fetch unit, we can see how many cycles such units will be used by the corresponding instructions (e.g., multiply or load, respectively) in the loop body. This, divided by the number of that kind of functional unit provided by the hardware, gives a lower bound for . For example, if there are 6 multiply instructions that each use a multiplier for 3 cycles, and there are two multipliers, then = 6 3/2. Data-dependence estimator: For any data-dependencecycle in the data-dependence graph, where some value xi depends on a chain of other calculations that depends on xi-1, the total latency of the chain gives a lower bound for . Let min be the maximum of these estimators. Let us calculate min for Program 20.4b. For simplicity, we assume that one -arithmetic instruction and one load/store can be issued at a time, and every instruction finishes in one cycle; and we will not consider the scheduling of i . i + 1 or the conditional branch. Then the arithmetic resource estimator is 5 -instructions in the loop body divided by 1 issuable arithmetic instructions per cycle, or = 5. The load/store resource estimator is 4 load/store instructions in the loop body divided by 1 issuable memory operations per cycle, or = 4. The data- dependence estimator comes from the cycle ci . di . ei . ci+1 in Graph 20.5a, whose length gives = 3. Next, we prioritize the instructions of the loop body by some heuristic that decides which instructions to consider first. For example, instructions that are in critical data-dependence cycles, or instructions that use a lot of scarce resources, should be placed in the schedule first, and then other instructions can be filled in around them. Let H1, , Hn be the instructions of the loop body, in (heuristic) priority order. In our example, we could use H =[c, d, e, a, b, f, j, g, h], putting early the instructions that are in the critical recurrence cycle or that use the arithmetic functional unit (since the resource estimators for this loop tell us that arithmetic is in more demand than load/stores). The scheduling algorithm maintains a set S of scheduled instructions, each scheduled for a particular time t. The value of SchedTime[h]= none if h . S, otherwise SchedTime[h] is the currently scheduled time for h. The members of S obey all resource and data-dependence constraints. Each iteration of Algorithm 20.9 places the highest-priority unscheduled instruction h into S, as follows: 1. In the earliest time slot (if there is one) that obeys all dependence constraints with respect to already-placed predecessors of h, and respects all resource constraints. 2. But if there is no slot in . consecutive cycles that obeys resource constraints, then there can never be such a slot, because the functional units available at time t are the same as those at t +
We highly recommend you visit web and email hosting services if you need stable and cheap web hosting platform for your web applications.

MODULO SCHEDULING Iterative modulo scheduling is a practical, (Web design course)

Friday, December 28th, 2007

MODULO SCHEDULING Iterative modulo scheduling is a practical, though not optimal, algorithm for resource-bounded loop scheduling. The idea is to use iterative backtracking to find a good schedule that obeys the functional-unit and data-dependence constraints, and then perform register allocation. The algorithm tries to place all the instructions of the loop body in a schedule of . cycles, assuming that there will also be a prologue and epilogue of the kind used by the Aiken-Nicolau algorithm. The algorithm tries increasing values of . until it reaches a value for which it can make a schedule. A key idea of modulo scheduling is that if an instruction violates functional-unit constraints at time t, then it will not fit at time t + , or at any time t’ where t = t’ modulo . Suppose, for example, we are trying to schedule Program 20.4b with . = 3 on a machine that can perform only one load instruction at a time. The following loop-body schedule is illegal, with two different loads at cycle 1: We can move fi from cycle 1 of the loop to cycle 0, or cycle 2: Either one avoids the resource conflict. We could move fi even earlier, to cycle -1, where (in effect) we are computing fi+1, or even later, to cycle 3, where we are computing fi-1: But with . = 3 we can never solve the resource conflict by moving fi from cycle 1 to cycle 4 (or to cycle -2), because 1 = 4 modulo 3; the calculation of f would still conflict with the calculation of j: Effects on register allocation Consider the calculation of d . f . c, which occurs at cycle 0 of the schedule in Figure 20.7. If we place the calculation of d in a later cycle, then the data-dependence edges from the definitions of f and c to this instruction would lengthen, and the data-dependence edges from this instruction to the use of d in W[i] . d would shrink. If a data-dependence edge shrinks to less than zero cycles, then a data-dependence constraint has been violated; this can be solved by also moving the calculations that use d to a later cycle. Conversely, if a data-dependence edge grows many cycles long, then we must carry several “versions” of a value around the loop (as we carry f; f , f ‘ around the loop of Figure 20.8), and this means that we are using more temporaries, so that register allocation may fail. In fact, an optimal loop-scheduling algorithm should consider register allocation simultaneously with scheduling; but it is not clear whether optimal algorithms are practical, and the iterated modulo scheduling algorithm
If you are looking for affordable and reliable webhost to host and run your business application visit our ftp web hosting services.

Figure 20.8: Pipelined schedule, with move instructions. This (Web design online)

Thursday, December 27th, 2007

Figure 20.8: Pipelined schedule, with move instructions. This loop is optimally scheduled -assuming the machine can execute eight instructions at a time, including four simultaneous loads and stores. Multicycle instructions Although we have illustrated an example where each instruction takes exactly one cycle, the algorithm is easily extensible to the situation where some instructions take multiple cycles. d f1 c2 V[2] b2 a j2 b2 e2 b2 d2 W[2] d2 b a f f U[4] j X[4] c e2 j2 V[3] b a j b i 3 L : d f c b a f b b; a a; f f ; f f; j j ; j j e b d W[i] d V[i+1] b f U[i+2] j X[i+2] c e j a j b i i + 1 if i < N - 2 goto L d f c b a f b b e b d W[N - 1] d V[N] b c e j d f c e b d W[N] d Team-Fly Team-Fly 20.2 RESOURCE-BOUNDED LOOP PIPELINING A real machine can issue only a limited number of instructions at a time, and has only a limited number of load/store units, adders, and multipliers. To be practically useful, a scheduling algorithm must take account of resource constraints. The input to the scheduling algorithm must be in three parts: 1. A program to be scheduled; 2. A description of what resources each instruction uses in each of its pipeline stages (similar to Figure 20.1); 3. A description of the resources available on the machine (how many of each kind of functional unit, how many instructions may be issued at once, restrictions on what kinds of instructions may be issued simultaneously, and so on). Resource-bounded scheduling is NP-complete, meaning that there is unlikely to be an efficient optimal algorithm. As usual in this situation, we use an approximation algorithm that does reasonably well in "typical" cases.
Looking for affordable and reliable webhost to host and run your business application? Then look no more and go to servlet web hosting services.

My web server - See the Further Reading section for reference to

Thursday, December 27th, 2007

See the Further Reading section for reference to proofs. But to see why the loop is optimal, consider that the data-dependence DAG of the unrolled loop has some path of length P to the last instruction to execute, and the scheduled loop executes that instruction at time P. The result, for our example, is shown in Table 20.6b. Now we can find a repeating pattern of three cycles (since three is the slope of the steepest group). In this case, the pattern does not begin until cycle 8; it is shown in a box. This will constitute the body of the scheduled loop. Irregularly scheduled instructions before the loop body constitute a prologue, and instructions after it constitute the epilogue. Now we can generate the multiple-instruction-issue program for this loop, as shown in Figure 20.7. However, the variables still have subscripts in this “program”: The variable ji+1 is live at the same time as ji. To encode this program in instructions, we need to put in MOVE instructions between the different variables, as shown in Figure 20.8. j1 X[1] f1 U[1] c1 e0 j0 a1 j0 b0 j2 X[2] f2 U[2] d1 f0 c1 b1 a1 f0 a2 j1 b1 W[1] d1 V[1] b1 e1 b1 d1 a3 j2 b2 V[2] b2 d2 f1 c2 j3 X[3] f3 U[3] c2 e1 j1 b2 a2 f1 j4 X[4] f4 U[4] b3 a3 f2 W[2] d2 e2 b2 d2 i 3 L : di V[3] b3 fi-1 j2 a4 j3 b3 V[3] b3 c3 e2 j2 ji+2 X [i+2] fi+2 U [i+2] V[i+1] bi+1 W[i] di ei bi di if i < N - 2 goto L i i + 1 ai+2 ji+1 bi+1 ci+1 ei ji dN-1 fN-1 cN-1 eN-1 bN-1 dN-1 cN-1 eN-1 jN-1 dN fN-1 cN eN bN dN Figure 20.7: Pipelined schedule. Assignments in each row happen simultaneously; each right-hand side refers to the value before the assignment. The loop exit test i < N + 1 has been "moved past" three increments of i, so appears as i < N - 2. a1 j0 b0 c1 e0 j0 f1 U[1] j1 X[1] b1 a1 f0 d1 f0 c1 f U[2] j2 X[2] e1 b1 d1 V[1] b1 W[1] d1 a2 j1 b1 b2 a2 f1 c2 e1 j1 f U[3] j X[3] bN aN fN-1 W[N - 1] dN-1 V[N] bN W[N] dNN
Searching for affordable and reliable webhost to host and run your web applications? Go to our java web server services and you will be pleased.