Intel ARCHITECTURE IA-32 manual

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568

Go to page of

A good user manual

The rules should oblige the seller to give the purchaser an operating instrucion of Intel ARCHITECTURE IA-32, along with an item. The lack of an instruction or false information given to customer shall constitute grounds to apply for a complaint because of nonconformity of goods with the contract. In accordance with the law, a customer can receive an instruction in non-paper form; lately graphic and electronic forms of the manuals, as well as instructional videos have been majorly used. A necessary precondition for this is the unmistakable, legible character of an instruction.

What is an instruction?

The term originates from the Latin word „instructio”, which means organizing. Therefore, in an instruction of Intel ARCHITECTURE IA-32 one could find a process description. An instruction's purpose is to teach, to ease the start-up and an item's use or performance of certain activities. An instruction is a compilation of information about an item/a service, it is a clue.

Unfortunately, only a few customers devote their time to read an instruction of Intel ARCHITECTURE IA-32. A good user manual introduces us to a number of additional functionalities of the purchased item, and also helps us to avoid the formation of most of the defects.

What should a perfect user manual contain?

First and foremost, an user manual of Intel ARCHITECTURE IA-32 should contain:
- informations concerning technical data of Intel ARCHITECTURE IA-32
- name of the manufacturer and a year of construction of the Intel ARCHITECTURE IA-32 item
- rules of operation, control and maintenance of the Intel ARCHITECTURE IA-32 item
- safety signs and mark certificates which confirm compatibility with appropriate standards

Why don't we read the manuals?

Usually it results from the lack of time and certainty about functionalities of purchased items. Unfortunately, networking and start-up of Intel ARCHITECTURE IA-32 alone are not enough. An instruction contains a number of clues concerning respective functionalities, safety rules, maintenance methods (what means should be used), eventual defects of Intel ARCHITECTURE IA-32, and methods of problem resolution. Eventually, when one still can't find the answer to his problems, he will be directed to the Intel service. Lately animated manuals and instructional videos are quite popular among customers. These kinds of user manuals are effective; they assure that a customer will familiarize himself with the whole material, and won't skip complicated, technical information of Intel ARCHITECTURE IA-32.

Why one should read the manuals?

It is mostly in the manuals where we will find the details concerning construction and possibility of the Intel ARCHITECTURE IA-32 item, and its use of respective accessory, as well as information concerning all the functions and facilities.

After a successful purchase of an item one should find a moment and get to know with every part of an instruction. Currently the manuals are carefully prearranged and translated, so they could be fully understood by its users. The manuals will serve as an informational aid.

Table of contents for the manual

  • Page 1

    IA-32 In tel® Ar chitecture Op timization R e f er ence Manual Order Number: 248966-013US April 2006[...]

  • Page 2

    ii INFORMATION IN THI S DOCUMENT IS PROVIDE D IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IM PLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEP T AS PROVIDED IN INTEL ’S TERMS AND CONDI- TIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUM ES NO LIABILITY WHA TSOEVER, AND INTEL DISCLAIMS [...]

  • Page 3

    iii Contents Introduction Chapter 1 IA-32 Intel ® Architecture Processor Family Overview SIMD T echnology ............. ....................... ...................... .................... ...................... ............ .... 1-2 Summary of SIMD T e chnologies ................. ........ ......... ................... ....................... ....[...]

  • Page 4

    iv Out-of-Order Core...... ... .. .................... ... ... ...................... .................... .. ... ... .............. 1-30 In-Order Retirement ...................... ....................... ...................... ....................... ........ 1-31 Microarchitecture of Intel ® Core™ Solo and Intel ® Core™ Duo Processors .. ..[...]

  • Page 5

    v Branch Prediction ................... ...................... ....................... ...................... ...................... .. .... 2-15 Eliminating Branches .................. ...................... ....................... ...................... ................. 2-15 S pin-W ait and Idle Loops ............... ... .. .....................[...]

  • Page 6

    vi Floating-Point S talls ................. ... ...................... ....................... ................... ... ... ... ........... 2 -72 x87 Floating-point Operation s with Integer O perands ........................ ...................... 2-72 x87 Floating-point Comp arison Instructions ..... .............. ...................... ..........[...]

  • Page 7

    vii Considerations for Code Co nversion to SIMD Pr ogramming............ ...................... ................ 3-8 Identifying Hot S pots ....... ...................... ....................... ...................... ......................... ... 3-1 0 Determine If Code Benefits by Conversion to SI MD Execution............. ........... ...........[...]

  • Page 8

    viii Packed Shuffle W ord for 64-bit Registers ........ .............. ....................... ...................... ... 4-18 Packed Shuffle W ord for 128-bit Registe r s ......... ......... ...................... .................... ........ 4-19 Unpacking/interleaving 64-bit Data in 128-bit Registers ........................ ...................[...]

  • Page 9

    ix Data Alignment........... ... .................... ... ... .. .................... ... ... ................... ... ... ................... . ....... 5-4 Data Arrangement ...................... ...................... ....................... ...................... ................... 5- 4 V ertical versus Horizontal Computation ............ ......[...]

  • Page 10

    x Hardware Prefetch ..................... ... ... ................... ... ... ... ...................... ...................... ...... 6-19 Example of Ef fective Latency Re duction with H/W Prefetch ............................ ... ........... 6-20 Example of Latency Hiding with S/W Prefetch Instruction .......... .......................... .......[...]

  • Page 11

    xi Key Practices of System Bus Optimization ......... ......... ...................... .................... ........ 7-17 Key Practices of Memory Optimiza tion ............... ....................... ...................... .............. 7-17 Key Practices of Front-end Opti mization ........................ ....................... .................[...]

  • Page 12

    xii Sign Extension to Full 64-Bit s ........................... ....................... ...................... ................... 8-3 Alternate Coding Rules for 64-Bit Mode.... ....................... ......................... ......................... ..... 8-4 Use 64-Bit Registers Instead of T wo 32-Bit Regist ers for 64-Bit Arithmetic .........[...]

  • Page 13

    xiii T ime-based Sampling .............. ... .. .................... ... ... .. .................... ... ... ...................... . A-9 Event-based Sampling.......... ... ...................... ... .............. ...................... ................... A-10 Workload Characterization ........... .......................................... ......[...]

  • Page 14

    xiv Using Performance Metrics with Hyper-Th reading T e chnology .......... ............................ ..... B-50 Using Performance Events of Intel Core Solo and Intel Core Duo processo rs ............. ....... B-56 Understanding the Resu lts in a Performance Count er ....................... ...................... ..... B-56 Ratio Interpretation [...]

  • Page 15

    xv Examples Example 2-1 Assembly Code with an Un predictable Branch ............................. 2-17 Example 2-2 Code Optim ization to E liminate Branches ........ ............. ............. ... 2-17 Example 2-3 Eliminating Branch with CMO V Instr uction .......... ................ .......... 2- 18 Example 2-4 Use of pause Instr uction .........[...]

  • Page 16

    xvi Example 3-4 Identification of SSE2 with cpui d ............................ ................. ........ 3-5 Example 3-5 Identification of SSE2 by the OS ............ ................ ................. ........ 3-6 Example 3-6 Identification of SSE3 with cpui d ............................ ................. ........ 3-7 Example 3-7 Identification[...]

  • Page 17

    xvii Example 4-20 Clipping to an Arbitrary Signed Range [high, low] ...... ................ ... 4-27 Example 4-21 Sim plified Clipping to an Arbitrar y Signed Rang e ...... ................ ... 4-28 Example 4-22 Clipping to an Arbitrary Unsi gned Range [high, low] .................. ... 4-29 Example 4-2 3 Complex Multiply by a Constant ............[...]

  • Page 18

    xviii Example 6-12 Memory Cop y Using Hardware Pref etch and Bus Segment ation .. 6-50 Example 7-1 Serial Execution of Producer and Consum er Work Items ... ............ 7-9 Example 7-2 Basic Structure of Implem enting Producer Consumer Threads . ... 7-11 Example 7-3 Thread Functi on for an Int er laced Producer Consumer Mod el ..... 7-13 Example 7[...]

  • Page 19

    xix Figur es Figure 1-1 T ypical SIMD Ope rations .......... ................ ............. ............. ............... 1-3 Figure 1-2 SIMD Instr uctio n Register Us age ....................... ................ ............. .. 1-4 Figure 1-3 The Inte l NetBurst Micr oarchitectu re .. ................ ............. ............. ... 1-10 Figure 1[...]

  • Page 20

    xx Figure 6-2 Memor y Access Late ncy and Execution Witho ut Prefetch .... .......... 6-23 Figure 6-3 Memor y Access Late ncy and Execution With Prefetch ............. ...... 6-23 Figure 6-4 Pref etch and Loop Unrolling ............................ ................ ................ 6-29 Figure 6-5 Memor y Access Late ncy and Execution With Prefetch[...]

  • Page 21

    xxi T ables T able 1-1 P ent ium 4 and I ntel Xeon Pro cessor Cache P arameters .................. 1-20 T abl e 1-3 Cache Par ameters of P entium M, Intel ® Core™ Solo and Intel ® Core™ Duo Proces sors ................ ............. ................ ............. 1-30 T able 1-2 T r igger Thre shold and CPUID Sign atures for IA-32 Processor F[...]

  • Page 22

    xxii T able C-5 Streaming SIMD Extens ion 64-bit Integer In struct ions...... ............... C-14 T able C-7 IA-32 x87 Floa ting-point Instr uction s ....... ................ ............. ............ C- 16 T able C-8 IA-32 Ge neral Pur pose I nstru ctions .. ............ ................. ............. ..... C-17[...]

  • Page 23

    xxiii Intr oduction The IA-32 Intel ® Architectur e Optimization Refer ence Manual describes how to optimize software to take advantage o f the performance characteristics of the current gene ration of IA-32 Intel architecture family of processors. The optimizati ons described in this manual apply to IA-32 processors based on the Intel ® NetBurst[...]

  • Page 24

    IA-32 Intel® Ar chitectur e Optimization xxiv target the Intel NetBurst microarchi tecture and the Pentium M processor microarchitecture. T uning Y our Application T uning an application for high performance o n any IA-32 processor requires understanding and basic sk ills in: • IA-32 architecture • C and Assembly language • the hot-spot regi[...]

  • Page 25

    Intr oduction xxv The manual consists of the following parts: Introduction . Defines the purpose and outlin es the contents of this manual. Chapter 1: IA-32 Intel ® Ar chitecture Pr ocessor Family Overview . Describes the features relevant to software optimization of the current generation of IA-32 Intel architecture p rocessors, including the arc[...]

  • Page 26

    IA-32 Intel® Ar chitectur e Optimization xxvi Chapter 7: Multiprocessor and Hyper -Threading T echnology . Describes guidelines and techni ques for optimizing multithreaded applications to achieve optimal pe rformance scaling. Use these when targeting multiprocessor (MP) syst ems or MP systems using IA-32 processors that support Hyper -Threading T[...]

  • Page 27

    Intr oduction xxvii Related Documentation For more information on the Intel ar chitecture, specific techniques, and processor architecture terminology re ferenced in this manual, see the following documents: • Intel ® C++ Compiler User ’ s Guide • Intel ® Fortran Compiler User ’ s Guid e • VT une Performance Analyzer online help • Int[...]

  • Page 28

    IA-32 Intel® Ar chitectur e Optimization xxviii Notational Con ventions This manual uses the following conventions: This type style Indicates an element of syntax, a reserved word, a keyword, a filename, instructio n, computer output, or part of a program example. The text appears in lowercase unless uppercase is significant. THIS TYPE STYLE Indic[...]

  • Page 29

    1-1 1 IA-32 Intel ® Ar chitectur e Pr ocessor Family Overview This chapter gives an overview o f th e features relevant to software optimization for the current gener ation s o f I A-32 processors, including: Intel ® Core ™ Solo, Intel ® Core ™ Duo, Intel ® Pentium ® 4, Intel ® Xeon ® , Intel ® Pentium ® M, and IA-32 processors with mu[...]

  • Page 30

    IA-32 Intel® Ar chitectur e Optimization 1-2 Intel Core Solo and Intel Core Duo processors incorporate microarchitectural enhancements for performance and power efficiency that are in addition to those intr oduced in the Pentium M processor . SIMD T echnolog y SIMD computations (see Figure 1- 1) were introduced in the IA-32 architecture with MMX t[...]

  • Page 31

    IA-32 Intel® Architectur e Processor Family Overview 1-3 each corresponding pair of data elem ents (X1 and Y1, X2 and Y2, X3 and Y3, and X4 and Y4). The results of the four parallel computations are sorted as a set of four packed data elements. The Pentium 4 processor further extended the SIMD computation model with the introduction of Streaming S[...]

  • Page 32

    IA-32 Intel® Ar chitectur e Optimization 1-4 SIMD improves the performance of 3D graphics, speech recogn ition, image processing, scientific applicatio ns and applications that have the following characteristics: • inherently parallel • recurring memory access patterns • localized recurring operations performed on the data • data-independe[...]

  • Page 33

    IA-32 Intel® Architectur e Processor Family Overview 1-5 SSE and SSE2 instructions also introduced cacheabil ity and memory ordering instructions that can improve cache usage and application performance. For more on SSE, SSE2, SSE3 and MMX technologies, see: IA-32 Intel® Ar chitectur e Softwar e Developer ’ s Manual, V olume 1: Chapter 9, “Pr[...]

  • Page 34

    IA-32 Intel® Ar chitectur e Optimization 1-6 SSE instructions are useful for 3D geometry , 3D rendering, speech recognition, and video encoding and decoding. Streaming SIMD Extensions 2 Streaming SIMD extensions 2 add the following: • 128-bit data type with two packed double-precision floating-point operands • 128-bit data types for SIMD integ[...]

  • Page 35

    IA-32 Intel® Architectur e Processor Family Overview 1-7 Intel ® Extended Memory 64 T echnolog y (Intel ® EM64T) Intel EM64T is an extension of th e IA-32 Intel architecture. Intel EM64T increases the linear address sp ace for software to 64 bits and supports physical ad dress space up to 40 bits . The technology also introduces a new operating [...]

  • Page 36

    IA-32 Intel® Ar chitectur e Optimization 1-8 Intel NetBurst ® Micr oarchitecture The Pentium 4 processor , Pentium 4 proce ssor Extreme Edition supporting Hyper -Threading T echnology , Pentium D processor , Pentium processor Extreme Editio n and the Intel Xeon processor implement the Intel NetBurst microarchitecture. This section describes the f[...]

  • Page 37

    IA-32 Intel® Architectur e Processor Family Overview 1-9 • to operate at high clock rates and to scale to higher performance and clock rates in the future Design advances of the Intel Ne tBurst microarchitecture include: • a deeply pipelined design that allows for high clock rates (with differen t parts of the chip running at diff erent clock [...]

  • Page 38

    IA-32 Intel® Ar chitectur e Optimization 1-10 The out-of-order core aggressively r eorders µops so that µops whose inputs are ready (and have execution resources available) can execute as soon as possible. The core can issue multiple µops per cycle. The retirement section ensures th at the results of execution are processed according to origina[...]

  • Page 39

    IA-32 Intel® Architectur e Processor Family Overview 1-11 The Front End The front end of the Intel NetBurst micr oarchitecture consists of two parts: • fetch/decode unit • execution trace cache It performs the foll owing functions: • prefetches IA-32 instructions th at are likely to be executed • fetches required inst ructions that have no[...]

  • Page 40

    IA-32 Intel® Ar chitectur e Optimization 1-12 The execution trace cache and the translation engine have cooperating branch prediction hardware. Branch tar gets are predicted based on their linear address using branch predicti on logic and fetched as soon as possible. Branch targets are fetched from the execution trace cache if they are cached, oth[...]

  • Page 41

    IA-32 Intel® Architectur e Processor Family Overview 1-13 correct execution, the results of IA- 32 instructions must be committed in original program order before th ey are retired. Exceptions may be raised as instructions are retired. For this reason , exceptions cannot occur speculatively . When a µop completes and writes its result to the dest[...]

  • Page 42

    IA-32 Intel® Ar chitectur e Optimization 1-14 • a mechanism fetches data only and includes two distinct components: (1) a hardware mechanism to fetch the adjacent cache line within an 128-byte sector that contains the data needed due to a cache line miss, this is also re ferred to as adjacent cache line prefetch (2) a software controlled mechani[...]

  • Page 43

    IA-32 Intel® Architectur e Processor Family Overview 1-15 Branch Prediction Branch prediction is important to th e performance of a deeply pipelined processor . It enables the processor to begin execut ing instructions long before the branch outcome is certain. Branch delay is the penalty that is incurred in the absence of correct prediction. For [...]

  • Page 44

    IA-32 Intel® Ar chitectur e Optimization 1-16 T o take advantage of the forward-not-taken and backward-taken static predictions, code should be arranged so that the lik ely target of the branch immediately follows forwar d branches (see also: “Branch Prediction” in Chapter 2). Branch T arget Buffer . Once branch history is available, the Penti[...]

  • Page 45

    IA-32 Intel® Architectur e Processor Family Overview 1-17 Some parts of the core may speculate that a common condition holds to allow faster execution. If it does not, the machine may stall. An example of this pertains to sto r e-to-load forwarding (see “Store Forwarding” in this chapter). If a load is predicted to be dependent on a store, it [...]

  • Page 46

    IA-32 Intel® Ar chitectur e Optimization 1-18 execution units are not pipelined (meaning that µops cannot be dispatched in consecutive cycles and the throughput is less than one per cycle). The number of µops associated with each instruction provides a basis for selecting instruction s to ge nerate. All µops executed out of the microcode ROM in[...]

  • Page 47

    IA-32 Intel® Architectur e Processor Family Overview 1-19 Caches The Intel NetBurst microarchitectur e supports up to th ree levels of on-chip cache. At least two levels of on-chip cache are implemented in processors based on the Intel NetBur st microarchitecture. The Intel Xeon processor MP and selected Pe ntium and Intel Xeon pr o ce ssors may a[...]

  • Page 48

    IA-32 Intel® Ar chitectur e Optimization 1-20 Levels in the cache hierarchy are not in clusive. The fact that a line is in level i does not imply that it is also in level i+ 1. All caches use a pseudo-LRU (least rece ntly used) replaceme nt algorithm. T able 1- 1 provides p arameters for all cache levels fo r Pentium and Intel Xeon Processors with[...]

  • Page 49

    IA-32 Intel® Architectur e Processor Family Overview 1-21 back within the processor , and 6-12 bus cycles to access memory if there is no bus congestion. Each bus cycle equals several processor cycles. The ratio of processor cloc k speed to the scalable bus clock speed is referred to as bus ratio . For example, one bus cycle for a 100 MHz bus is e[...]

  • Page 50

    IA-32 Intel® Ar chitectur e Optimization 1-22 • avoids the need to access of f-chip caches, which can increase the realized bandwidth compared to a normal load-miss, which returns data to all cache levels Situations that are less likely to benefit from software prefetch are: • for cases that are already bandwidth boun d, prefetching tends to i[...]

  • Page 51

    IA-32 Intel® Architectur e Processor Family Overview 1-23 Hardware prefetching for Pentium 4 processor has the following characteristics: • works with existing applications • does not require extensive study of pr efetch instructions • requires regular access patterns • avoids instruction and issue port bandwidth overhead • has a start-u[...]

  • Page 52

    IA-32 Intel® Ar chitectur e Optimization 1-24 Thus, software optimization of a data access pattern should emphasize tuning for hardware prefetch f irst to favor greater proportions of smaller- stride data accesses in the workload; before attempting to provide hints to the processor by employin g software prefetch instructions. Loads and Stores The[...]

  • Page 53

    IA-32 Intel® Architectur e Processor Family Overview 1-25 Reordering loads with respect to each other can prevent a load miss from stalling later loads. Reordering loads with respect to other loads and stores to different addresses can enable more parallelism, allowing the machine to execute operations as soon as their inputs are ready . W rites t[...]

  • Page 54

    IA-32 Intel® Ar chitectur e Optimization 1-26 Intel ® P entium ® M Processor Micr oar chitecture Like the Intel NetBurst microarchitecture, the pipeline of the Intel Pentium M processor microarchitecture contains three sections: • in-order issue front end • out-of-order superscalar execution core • in-order retirement unit Intel Pentium M [...]

  • Page 55

    IA-32 Intel® Architectur e Processor Family Overview 1-27 The Intel Pentium M processor microa rchitecture is designed for lower power consumption. There are other specific areas of the Pentium M processor microarchitecture that differ from the Intel NetBurst microarchitecture. They are descr ibed next. A block diagram of the Intel Pentium M proce[...]

  • Page 56

    IA-32 Intel® Ar chitectur e Optimization 1-28 The fetch and decode unit in cludes a hardware instruction prefetcher and three decoders that enable parallelism. It also provides a 32KB instruction cache that stores un-decoded binary instructions. The instruction prefetcher fetches inst ructions in a linear fashion from memory if the targ et instruc[...]

  • Page 57

    IA-32 Intel® Architectur e Processor Family Overview 1-29 • Micro-ops (µops) fusion. Some of the most frequent pairs of µops derived from the same instruction can be fused into a single µops. The following categories of fused µops have been implemented in the Pentium M processor: — “Store address” and “store data” micro-ops are fus[...]

  • Page 58

    IA-32 Intel® Ar chitectur e Optimization 1-30 Data is fetched 64 bytes at a time; the instruction and data translation lookaside buffers support 128 entrie s. See T able 1-3 for processor cache parameters. Out-of-Order Cor e The processor core dynamically executes µops ind ependent of program order . The core is designed to facilitate parallel ex[...]

  • Page 59

    IA-32 Intel® Architectur e Processor Family Overview 1-31 In-Order Retirement The retirement unit in the Pentium M processor buffers completed µops is the reorder buf fer (ROB). The ROB updates the architectural state in order . Up to three µops may be retired per cycle. Micr oarchitecture of Intel ® Core ™ Solo and Intel ® Core ™ Duo Pr o[...]

  • Page 60

    IA-32 Intel® Ar chitectur e Optimization 1-32 • Power-op timized bus The system bus is optimized for power efficiency; increased bus speed supports 667 MHz. • Data Prefetch Intel Core Solo and Intel Core Duo processors implement improved hardware prefetch mechanisms: one mech anism can look ahead and prefetch data into L1 from L2. These proces[...]

  • Page 61

    IA-32 Intel® Architectur e Processor Family Overview 1-33 Data Prefetc hing Intel Core Solo and Intel Core Duo processors provide hardware mechanisms to prefetch data from memory to the second-level cache. There are two techniques: one mechan ism activates after the data access pattern experiences two cache-reference misses within a trigger -dista[...]

  • Page 62

    IA-32 Intel® Ar chitectur e Optimization 1-34 The two logical processors each have a complete set of architectural registers while sharing one single phy sical processor's resources. By maintaining the architecture state of two pr ocessors, an HT T echnology capable processor looks like two processors to software, including operating system a[...]

  • Page 63

    IA-32 Intel® Architectur e Processor Family Overview 1-35 In the first implementation of HT T echnology , the phys ical execution resources are shared and the architect ure state is duplicated for each logical processor . This minimizes th e die area cost of implementing HT T echnology while still achieving pe rformance gains for multithreaded app[...]

  • Page 64

    IA-32 Intel® Ar chitectur e Optimization 1-36 Pr ocessor Resources and Hy per -Threading T echnology The majority of microarchitecture re sources in a physical processor are shared between the logical processors. Only a few small data structures were replicated for each logical pro cessor . This section describes how resources are shared, partitio[...]

  • Page 65

    IA-32 Intel® Architectur e Processor Family Overview 1-37 For example: a cache miss, a branch misprediction, or instruction dependencies may prevent a logical processor fr om making forward progress for some number of cycles. The partitioning prevents the stalled logical processor from blo cking forward progress. In general, the buf fers for stagi[...]

  • Page 66

    IA-32 Intel® Ar chitectur e Optimization 1-38 Micr oarchitecture Pipeline an d Hyper -Threading T echnology This section describes the HT T echnology microarchitecture and how instructions from the two logical p r ocessors are handled between the front end and the back end of the pipeline. Although instructions originating fro m two programs or tw[...]

  • Page 67

    IA-32 Intel® Architectur e Processor Family Overview 1-39 Execution Core The core can dispatch up to six µops per cycle, provided the µops are ready to execute. Once the µops ar e placed in the queues waiting for execution, there is no distinction be tween instructions from the two logical processors. The execution co re and memory hierarchy is[...]

  • Page 68

    IA-32 Intel® Ar chitectur e Optimization 1-40 Pentium Processor Extreme Edition prov ide four logical processors in a physical package that has two executi on cores. Each core provides two logical processors sharing an ex ecution core and a cache hierarchy . The Intel Core Duo processor p rovides two logical processors in a physical package. Each [...]

  • Page 69

    IA-32 Intel® Architectur e Processor Family Overview 1-41 Figure 1-7 P entium D Processo r , P entium Processor Ext reme Edition and Intel Core Duo Pr ocessor System Bus Ar c hit ect ual S t ate Ex ec ut ion E ngine Local AP I C Lo ca l APIC Ex ec ut ion Engine Ar c hit ec t ual S t ate Bu s In te r fa ce Bu s In te rfa c e Penti u m D Pr oces s o[...]

  • Page 70

    IA-32 Intel® Ar chitectur e Optimization 1-42 Microar chitecture Pipeline and Multi-Co re Processor s In general, each core in a multi-core processor resembles a single-core processor implementation of the un derlying microarchitecture. The implementation of the cache hierarchy in a dual-core or multi-core processor may be the same or different fr[...]

  • Page 71

    IA-32 Intel® Architectur e Processor Family Overview 1-43 that the cache line that contains th e memory location is owned by the first-level data cache of the initiati ng core (that is, the line is in exclusive or modified state). Then the processor looks for the cache line in the cache and memory sub-systems. The look-ups for the locality of load[...]

  • Page 72

    IA-32 Intel® Ar chitectur e Optimization 1-44 when data is written back to memory , the eviction consumes cache bandwidth and bus bandwidth. For multiple cache misses that require the eviction of modified lines and ar e within a short time, there is an overall degradation in response time of these cache misses. For store operation, reading for own[...]

  • Page 73

    2-1 2 General Optimization Guidelines This chapter discusses general optimi zation techniques that can improve the performance of applications running o n the Intel Pentium 4, Intel Xeon, Pentium M processors, as well as on dual-co re processors. These techniques take advantage of the mi croarchitec t ural features of the generation of IA-32 proces[...]

  • Page 74

    IA-32 Intel® Ar chitectur e Optimization 2-2 The following sections describe practices, tools, coding r ules and recommendations associated with th ese factors that will aid in optimizing the performance on IA-32 processors. T uning to Prevent Known Coding Pitfalls T o produce program code that takes advantage of the Intel NetBurst microarchitectu[...]

  • Page 75

    General Optimization Guidelines 2 2-3 * Streaming SIMD Extensions (S SE) ** Streaming S IMD Extensions 2 (SSE2) General Practices and Coding Guidelines This section discusses guidelines derived from the performance factors listed in the “Tu ning to Achieve Optimum Performance” section. It also highlights practices th at use performance tools. T[...]

  • Page 76

    IA-32 Intel® Ar chitectur e Optimization 2-4 Use A vailable P erformance T ools • Current-generation compiler , su ch as the Intel C++ Compiler: — Set this compiler to produce code for the tar get processor implementation — Use the compiler switches for optimization and/or profile-guided optimization. Thes e features are summarized in the ?[...]

  • Page 77

    General Optimization Guidelines 2 2-5 Optimize Branch Predictability • Improve branch predictability a nd optimize instruction prefetching by arranging code to be consistent with the static branch prediction assumption: backward taken and forward not taken. • A void mixing near calls, far calls and returns. • A void implementing a call by pus[...]

  • Page 78

    IA-32 Intel® Ar chitectur e Optimization 2-6 • Minimize use of global variables and pointers. • Use the const modifier; use the static modifier for global variables. • Use new cacheability instructions and memory-ordering behavior . Optimize Floating-point Perf ormance • A void exceeding representable rang es during computation, since hand[...]

  • Page 79

    General Optimization Guidelines 2 2-7 • A void longer latency instructions: integer multiplies and divides. Replace them with alternate code se quences (e.g., use shifts instead of multiplies). • Use the lea instruction and the full range of addressing modes to do address calculation. • Some types of stores use more µops than others, try to [...]

  • Page 80

    IA-32 Intel® Ar chitectur e Optimization 2-8 • A void the use of conditionals. • Keep induction (loop) variable ex pressions simple. • A void using pointers, tr y to replace pointers with arrays and indices. Coding Rules, Suggestio ns and T uning Hints This chapter includes rules, suggesti ons and hints. They are maintained in separately-num[...]

  • Page 81

    General Optimization Guidelines 2 2-9 P erformance T ools Intel offers several tools that can facilitate optimizing your application’ s performance. Intel ® C++ Compiler Use the Intel C++ Compiler following the recommendations described here. The Intel Compiler ’ s advanced op timization features provide good performance without the need to ha[...]

  • Page 82

    IA-32 Intel® Ar chitectur e Optimization 2-10 General Compiler Recommendations A compiler that has been extensively tuned for the target microarchitec- ture can be expected to match or outperform han d-coding in a general case. However , if particular performance problems are noted with the compiled code, some compilers (lik e the Intel C++ and Fo[...]

  • Page 83

    General Optimization Guidelines 2 2-11 The VT une Performance Analyzer also enables engineers to use these counters to measure a number of wo rkload characteristics, including: • retirement throughput of instructi on execution as an indication of the degree of extractable instruction-level parallelism in the workload, • data traffic locality as[...]

  • Page 84

    IA-32 Intel® Ar chitectur e Optimization 2-12 Intel Core Solo and Intel Core Duo pr ocessors have enhanced front end that is less sensitive to the 4-1-1 template. The practice has no real impact on processors based on the Intel NetBurst microarchitecture. • Dependencies for partial register writes incur large penalties when using the Pentium M p[...]

  • Page 85

    General Optimization Guidelines 2 2-13 • On the Pentium 4 and Intel Xeon processo rs, the primary code size limit of interest is imposed by the trace cache. On Pentium M processors, code size limit is governed by the instruction cache. • There may be a penalty when instructions with immediates requiring more than 16-bit signed representation ar[...]

  • Page 86

    IA-32 Intel® Ar chitectur e Optimization 2-14 T ransparent Cache-P arameter Strategy If CPUID instruction supp orts function leaf 4, also known as deterministic cache parameter leaf, this function leaf will report detailed cache parameters for each level of the cache hierarchy in a deterministic and forward-compatible manner across current and fut[...]

  • Page 87

    General Optimization Guidelines 2 2-15 Branch Prediction Branch optimizations have a significant impact on performance. By understanding the flow of branches and improving the predictability o f branches, you can increase the speed of code significantly . Optimizations that help branch prediction are: • Keep code and data on separate pages (a ver[...]

  • Page 88

    IA-32 Intel® Ar chitectur e Optimization 2-16 Assembly/Compiler Coding Rule 1. (MH impa ct, H generality) Arrange code to make basic blocks contig uous and elimin ate unnecessary bran ch es. For the Pentium M processor , ever y branch counts, even correctly predicted branches have a negative ef fect on the amount of useful code delivered to the pr[...]

  • Page 89

    General Optimization Guidelines 2 2-17 See Example 2-2. The optimized code sets ebx to zero, then compares A and B. If A is greater than or equal to B, ebx is set to one. Then ebx is decreased and “ and -ed” with the difference of the constant values. This sets ebx to either zero or the difference of the values. By adding CONST2 back to ebx , t[...]

  • Page 90

    IA-32 Intel® Ar chitectur e Optimization 2-18 The cmov and fcmov instructions are available on the Pentium II and subsequent processors, but not on Pe ntium processors and earlier 32-bit Intel architecture processors. Be su re to check whether a processor supports these instructions with the cpuid instruction. Spin-W ait and Idle Loops The Pentium[...]

  • Page 91

    General Optimization Guidelines 2 2-19 Static Prediction Branches that do not have a history in the BTB (see the “Branch Prediction” section) are predicted us ing a static prediction algorithm. The Pentium 4, Pentium M, Intel Core Solo and Intel Core Duo processors have similar static prediction algorithms: • Predict unconditional branches to[...]

  • Page 92

    IA-32 Intel® Ar chitectur e Optimization 2-20 Assembly/Compiler Coding Rule 3. (M impa ct, H generality) Arrange code to be consistent with the stat ic bra nch pr ediction algorith m: make the fall-through code following a conditio na l branch be the likely target for a branch with a forward tar g et, an d make the fall-through code following a co[...]

  • Page 93

    General Optimization Guidelines 2 2-21 Examples 2-6, Example 2-7 provide basic rules for a static prediction algorithm. In Example 2-6, the backward branch ( JC Begin ) is not in the BTB the first time through, theref ore, the BTB does not issue a prediction. The static predictor , however, will predict the branch to be taken, so a misprediction wi[...]

  • Page 94

    IA-32 Intel® Ar chitectur e Optimization 2-22 Inlining, Calls and Returns The return address stack mechanism augments the static and dynamic predictors to optimize specifically fo r calls and returns. It ho lds 16 entries, which is lar ge enough to cover the call d e pth of most pr ograms. If there is a chain of more than 16 nested calls and more [...]

  • Page 95

    General Optimization Guidelines 2 2-23 Assembly/Compiler Coding Rule 6 . (H impac t, M gener ality) Do not inline a function if doing so incr eases the working set size beyond what will fit in the trace cache. Assembly/Compiler Coding Rule 7. (ML impact, ML generality) If ther e ar e more than 16 nested calls and r eturns in rapid successi on ; con[...]

  • Page 96

    IA-32 Intel® Ar chitectur e Optimization 2-24 Placing data immediately following an indirect branch can cause a performance problem. If the data consist of all zeros, it look s like a long stream of adds to memory destinations, which can cause resource conflicts and slow down branch recovery . Also, the data immediately following indirect branches[...]

  • Page 97

    General Optimization Guidelines 2 2-25 indir ect branch into a tr ee wher e one or mor e indire ct branches ar e pr eceded by conditi onal branch es to those ta r gets. Apply this “peeling” procedur e to the common tar get of an indir ect branch that corr elates to branch history . The purpose of this rule is to redu ce the total number of misp[...]

  • Page 98

    IA-32 Intel® Ar chitectur e Optimization 2-26 best performance from a coding ef fort. An example of peeling out the most favored tar get of an indirect br anch with correlat ed branch history is shown in Example 2-9. Loop Unr olling The benefits of unrolling loop s are: • Unrolling amortizes the branch overhead, since it eliminates branches and [...]

  • Page 99

    General Optimization Guidelines 2 2-27 • The Pentium 4 processor can correctly predict the exit branch for an inner loop that has 16 or fewer iterations, if that number of iterations is predictable and there are no conditional br anches in the loop. Therefore, if the loop body size is not excessive, and the probable number of iterations is known,[...]

  • Page 100

    IA-32 Intel® Ar chitectur e Optimization 2-28 In this example, a loop that ex ecutes 100 times assigns x to every even-numbered element and y to every odd-numbered element. By unrolling the loop you can make both assignments each iteration, removing one branch in the loop bod y . Compiler Suppor t for Branc h Prediction Compilers can generate code[...]

  • Page 101

    General Optimization Guidelines 2 2-29 Memory Accesses This section discusses guidelines for optimizing code an d data memory accesses. The most important recommendations are: • align data, paying attention to data layout and stack alignment • enable store forwarding • place code and data on separate pages • enhance data locality • use pr[...]

  • Page 102

    IA-32 Intel® Ar chitectur e Optimization 2-30 Assembly/Compiler Coding Rule 16. (H impact, H generality) Align data on natural operand size addr ess boundaries. If the data will be accesses with vector instru ction loads and stor es, align the data o n 16 byte boundaries. For best performance, align data as f ollows: • Align 8-bit data at any ad[...]

  • Page 103

    General Optimization Guidelines 2 2-31 Alignment of code is less of an issue for th e Pentium 4 processor . Alignment of branch targets to ma ximize bandwidth of fetching cached instructions is an issue only when not executing out of the trace cache. Alignment of code can be an issue for the Pentium M processor , and alignment of branch tar gets wi[...]

  • Page 104

    IA-32 Intel® Ar chitectur e Optimization 2-32 Store Forwar ding The processor ’ s memory system only sends stores to memory (includin g cache) after store retirement. Howeve r , store data can be forwarded from a store to a subsequent load fro m the same address to give a much shorter store- load latency . There are two kinds of requirements for[...]

  • Page 105

    General Optimization Guidelines 2 2-33 If a variable is known not to change between when it is stored and when it is used again, the register that was stored can be copied or used directly . If register pressure is too high, or an unseen function is called before the store and th e second load, it may not be possible to eliminate the second load. A[...]

  • Page 106

    IA-32 Intel® Ar chitectur e Optimization 2-34 The size and alignment restrictions fo r store forwarding are illustrated in Figure 2-2. Coding rules to help programmers satis fy size and alignment restrictions for store forwarding follow . Assembly/Compiler Coding Rule 18. (H impact, M generali ty) A load that forwar ds fr om a stor e must have the[...]

  • Page 107

    General Optimization Guidelines 2 2-35 A load that forwards from a store mu st wait for the store’ s data to be written to the store buffer before pr oceeding, but other , unrel ated loads need not wait. Assembly/Compiler Coding Rule 20. (H impact, ML generality) If it is necessary to extract a non-al igned portion of stor ed data, r ead out the [...]

  • Page 108

    IA-32 Intel® Ar chitectur e Optimization 2-36 Example 2-14 illustrates a stalled store-forwarding situation that may appear in compiler generated code. Sometimes a compiler generates code similar to that shown in Example 2-14 to handle spilled byte to the stack and convert the byte to an integer value. Example 2-15 offers two alternatives to avoid[...]

  • Page 109

    General Optimization Guidelines 2 2-37 When moving data that is smalle r than 64 bits between memory locations, 64-bit or 128-bit SIMD register moves ar e more efficient (if aligned) and can be used to avoid un aligned loads. Although floating-point registers allow the movement of 64 bits at a time, floating point instructions should not be used fo[...]

  • Page 110

    IA-32 Intel® Ar chitectur e Optimization 2-38 Store-forwar ding Restrict ion on Data A vailability The value to be stored must be available before the load operation can be completed. If this restriction is vi olated, the execution of the load will be delayed until the data is availabl e. This delay causes some execution resources to be used unnec[...]

  • Page 111

    General Optimization Guidelines 2 2-39 An example of a loop-carried dependence chain is shown in Example 2-17. Data La yout Optimizations User/Source Coding Rule 2. (H impact, M generality) Pad data structur es defined in the sour ce code so that every d ata element is aligned t o a natural operand size a ddre ss boundary . If the operands are pack[...]

  • Page 112

    IA-32 Intel® Ar chitectur e Optimization 2-40 Cache line size for Pentium 4 and Pentium M processors can impact streaming applications (for example, multimedia). These reference and use data only once before discarding it. Data accesses which sparsely utilize the data within a cache line can result in less ef ficient utilization of system memory b[...]

  • Page 113

    General Optimization Guidelines 2 2-41 However , if the access pattern of the array exhibits locality , such as if the array index is being swept through, then the Pentium 4 processor prefetches data from struct_of_array , even if the elements of the structure are accessed together . When the elements of the structure are not accessed with equal fr[...]

  • Page 114

    IA-32 Intel® Ar chitectur e Optimization 2-42 non-sequential manner , the automa tic hardware prefetcher cannot prefetch the data. The prefetcher can recognize up to eight concur rent streams. See Chapter 6 for more information and the hardware prefetcher . Memory coherence is maintained on 64-byte cache lines on the Pentium 4, Intel Xeon and Pent[...]

  • Page 115

    General Optimization Guidelines 2 2-43 If for some reason it is not possible to align the stack for 64-bits, the routine should access the parameter and save it into a register or known aligned storage, thus incurring the penalty only once. Capacity Limits a nd Aliasing in Caches There are cases where addresses with a given stride will compete for [...]

  • Page 116

    IA-32 Intel® Ar chitectur e Optimization 2-44 Capacity Limits in Set-Associative Caches Capacity limits may occur if th e number of outstanding memory references that are mapped to the same set in each way of a given cache exceeded the number of ways of that cache. The conditions that apply to the first-level data cache and s econd level cache are[...]

  • Page 117

    General Optimization Guidelines 2 2-45 Aliasing Cases in the P entium ® 4 and Intel ® Xeon ® Processor s Aliasing conditions that are specific to the Pentium 4 processor and Intel Xeon processor are: • 16K for code – there can only be one of these in the trace cache at a time. If two traces whose starting addresses are 16K apart are in the s[...]

  • Page 118

    IA-32 Intel® Ar chitectur e Optimization 2-46 Aliasing Cases in t he P entium M Pr ocessor Pentium M, Intel Core Solo and I ntel Core Duo processors have the following al iasi ng case: • Store forwarding - If there has b een a store to an address followed by a load to the same address with in a short time window , the load will not proceed until[...]

  • Page 119

    General Optimization Guidelines 2 2-47 Mixing Code and Data The Pentium 4 processor ’ s aggressive prefetching and pre-decoding of instructions has two related ef fects: • Self-modifying code works corr ectly , according to the Intel architecture processor requirements, but incurs a significant performance penalty . A void self-modifying code. [...]

  • Page 120

    IA-32 Intel® Ar chitectur e Optimization 2-48 and cross-modifying code (when more than one processor in a multi-processor system are writing to a code p age) should be avoided when high performance is desired. Software should avoid writing to a code page in the same 1 KB subpage of that is being executed or fetching code in the same 2 KB subpage o[...]

  • Page 121

    General Optimization Guidelines 2 2-49 write misses; only four write-combining b uffers are guaranteed to be available for simultaneous use. W r ite combining applies to memory type WC; it does not apply to memory type UC. Assembly/Compiler Coding Rule 28. (H impact , L generality) If an inner loop writes to mor e th an four arrays, (four distinct [...]

  • Page 122

    IA-32 Intel® Ar chitectur e Optimization 2-50 be no RFO since the line is not cached , and there is no such delay . For details on write-combining, see the Intel Ar chitectur e Softwar e Devel- oper ’ s Manual . Locality Enhancement Locality enhancement can reduce data traffic originating from an outer-level sub-system in the cache/memo ry hiera[...]

  • Page 123

    General Optimization Guidelines 2 2-51 Locality enhancement to the last level cache can be accomplished with sequencing the data access pattern to take advantage of hardware prefetching. This can also take several forms: • T ransformation of a sparsely popul ated multi-dimensional array into a one-dimension array such that memory references occur[...]

  • Page 124

    IA-32 Intel® Ar chitectur e Optimization 2-52 Minimizing Bus Latency The system bus on Intel Xeon and Pentium 4 processo rs provides up to 6.4 GB/sec bandwidth of throug hput at 200 MHz scalable bus clock rate. (See MSR_EBC_FREQUENCY_ID register .) The peak bus bandwidth is even higher with higher bu s clock rates. Each bus transaction includes th[...]

  • Page 125

    General Optimization Guidelines 2 2-53 User/Sourc e Coding Rule 8. (H impact, H generality) T o achieve effective amortization of b us latency , softwar e should pay attentio n to favor data access patterns that result in higher concentrat ions of cache miss patterns with cache miss strides that are signific antly smaller than half of th e har dwar[...]

  • Page 126

    IA-32 Intel® Ar chitectur e Optimization 2-54 Example 2-21 Non-temporal Stores and 64-byte Bus W rite T ransactions Example 2-22 Non-temporal Stores a nd Partial Bus Write T ransactions #define STRIDESIZE 256 Lea ecx, p64byte_Aligned Mov edx, ARRAY_LEN Xor eax, eax slloop: movntps XMMWORD ptr [ecx + eax], xmm0 movntps XMMWORD ptr [ecx + eax+16], x[...]

  • Page 127

    General Optimization Guidelines 2 2-55 Prefetc hing The Pentium 4 processor has th ree prefetching mechanisms: • hardware instruction prefetcher • software prefetch for data • hardware prefetch for cache lines of data or instructions. Hard ware Instruction Fetching The hardware instruction fetcher read s instructions, 32 bytes at a time, into[...]

  • Page 128

    IA-32 Intel® Ar chitectur e Optimization 2-56 access patterns to suit the hardware prefetcher is highly recommended, and should be a higher -priority consideration than using software prefetch instructions. The hardware prefetcher is best fo r small-stride data access patterns in either direction with cache-miss stride not far from 64 bytes. This [...]

  • Page 129

    General Optimization Guidelines 2 2-57 • new cache line flush instruction • new memory fencing instructions For a detailed description of us ing cacheability instructions, see Chapter 6. Code Alignment Because the trace cache (TC) rem oves the decoding stage from the pipeline for frequently executed code, optimizing code alignment for decoding [...]

  • Page 130

    IA-32 Intel® Ar chitectur e Optimization 2-58 Guidelines fo r Optimizi ng Floating-point Code User/Sourc e Coding Rule 10. (M impact, M generality) Enable the compiler ’ s use of S SE, SSE2 or SSE3 instructions wi th appr opria te switches. Follow this procedure to investigate the performan ce of your floating-point application: • Understand h[...]

  • Page 131

    General Optimization Guidelines 2 2-59 to early out). However , be careful of intr oducing more than a total of two values for the flo ating po int cont r ol wor d, or the r e will be a lar g e perfor mance penalty . See “Float in g-point Mod es”. User/Source Coding Rule 13. (H impact, ML generality) Use fast float-to-int routines, FISTTP , or [...]

  • Page 132

    IA-32 Intel® Ar chitectur e Optimization 2-60 desir ed numeric pr ecision, the size of the look-up tableland t aking advantage of the paralleli sm of the Str eamin g S IMD Extensions an d the S treaming SIMD Extensions 2 i nstructions. Floating-point Modes and Exceptions When working with floating-po int numbers, high-speed microprocessors frequen[...]

  • Page 133

    General Optimization Guidelines 2 2-61 executing SSE/SSE2/SSE3 instruct ions and when speed is more important than complying to IEEE st andard. The following paragraphs give recommendations on how to optimize yo ur code to reduce performance degradation s related to floating-point exceptions. Dealing with floating-point exceptions in x87 FPU code E[...]

  • Page 134

    IA-32 Intel® Ar chitectur e Optimization 2-62 Underflow exceptions and denormalized source operan ds are usually treated according to the IEEE 754 specification. If a programmer is willing to trade pure IEEE 754 co mpliance for speed, two non-IEEE 754 compliant modes are provided to speed situations where underflows and input are frequent: FTZ mod[...]

  • Page 135

    General Optimization Guidelines 2 2-63 FPU control word (FCW), such as when performing conversions to integers. On Pentium M, Intel Core Solo and Intel Core Duo processors; FLDCW is improved over previous generations. Specifically , the optimization for FLDCW allows programmers to alternate between two constant values efficiently . For the FLDCW op[...]

  • Page 136

    IA-32 Intel® Ar chitectur e Optimization 2-64 Assembly/Compiler Coding Rule 31. (H impact, M generality) Minimize changes to bits 8-12 of the floating poin t contr ol wor d. Changes for mor e than two values (each value being a combina tion of the following b its: pr ecision, r ounding and infinity control, and the r est of bits in FCW) leads t o [...]

  • Page 137

    General Optimization Guidelines 2 2-65 If there is more than one change to rounding , precision and infinity bits and the rounding mode is not importan t to the result; use the algorithm in Example 2-23 to avoid synchronization issues, the overhead of the fldcw instruction and having to change the ro unding mode. The provided example suffers from a[...]

  • Page 138

    IA-32 Intel® Ar chitectur e Optimization 2-66 Example 2-23 Algorithm to A void Changing the Rounding Mode _fto132proc lea ecx,[esp-8] sub esp,16 ; allocate frame and ecx,-8 ; align pointer on boundary of 8 fld st(0) ; duplicate FPU stack top fistp qw ord ptr[ecx] fild qword ptr[ecx] mov edx,[ecx+4]; high dword of integer mov eax,[ecx] ; low dword [...]

  • Page 139

    General Optimization Guidelines 2 2-67 Assembly/Compiler Coding Rule 32. (H impact, L generality) Minimize the number of changes to th e rounding mode. Do not use changes in the rounding mode to implement the floor and ceiling f unctions if this involves a to tal of mor e than two valu es of the set of r ounding, pr ecision and i nfinity bits. Prec[...]

  • Page 140

    IA-32 Intel® Ar chitectur e Optimization 2-68 Assembly/Compiler Coding Rule 33. (H impact, L generality) Minimize the number of changes to the precision mode. Impr oving P arallelism and the Use of FXCH The x87 instruction set relies on the floating po int stack for one of its operands. If the dependence graph is a tree, which means each intermedi[...]

  • Page 141

    General Optimization Guidelines 2 2-69 This in turn allows instructions to be reordered to make instructions available to be executed in parallel. Out-of-order execution precludes the need for using fxch to move instructions for very short distances. x87 vs. Scalar SIMD Floating-point T rade-offs There are a number of differences between x87 floati[...]

  • Page 142

    IA-32 Intel® Ar chitectur e Optimization 2-70 • Scalar floating-point registers may be accessed directly , avoiding fxch and top-of-stack restrictions. On th e Pentium 4 processor , the floating-point register stack may be used simultaneously with XMM registers. The same hardware is used for both kinds of instr uctions, but the added name space [...]

  • Page 143

    General Optimization Guidelines 2 2-71 Recommendation : Use the compiler switch to generate SSE2 scalar floating-point code over x87 code. When working with scalar SSE/SSE2 code, pay attention to the need for clearing the content of unused slots in an xmm register and the associated performance impact. For example, loading data from memory with mov[...]

  • Page 144

    IA-32 Intel® Ar chitectur e Optimization 2-72 Floating-P oint Stalls Floating-point instructions have a latency of at least two cycles. But, because of the out-of-order nature of Pentium II and the subsequent processors, stalls will not necessarily occur on an in struction or µop basis. However , if an instruction has a very long latency such as [...]

  • Page 145

    General Optimization Guidelines 2 2-73 Note that transcendental functions are supported only in x 87 floating point, not in St reaming SIMD Extensions or Streaming SIMD Extensions 2. Instruction Selection This section explains how to generate optimal assembly co de. The listed optimizations have been shown to contribute to the overall p erformance [...]

  • Page 146

    IA-32 Intel® Ar chitectur e Optimization 2-74 Complex Instructions Assembly/Compiler Coding Rule 40. (ML impact, M generality) A void using complex in struc tio ns (f or example, enter , leave , or loop ) that have mor e than four µops and r equir e multipl e cycles to decode . Use sequences of simple instruc tions instead. Complex instructions m[...]

  • Page 147

    General Optimization Guidelines 2 2-75 Use of the inc and dec Instructions The inc and dec instructions modify o nly a subs et of the bits in the flag register . This creates a dependence on all previous writes of the flag register . This is especially problematic when these instructions are on the critical path because they are used to change an a[...]

  • Page 148

    IA-32 Intel® Ar chitectur e Optimization 2-76 CMPXCHG8B, various rotate instructions, STC, an d STD. An example of assembly with a partial flag regist er stall and alternative code without the stall is shown in T able 2-2. Integer Divide T ypically , an integer divide is preceded by a cwd or cdq instruction. Depending on the operand size, divide i[...]

  • Page 149

    General Optimization Guidelines 2 2-77 (model 9) does incur a penalty . This is because every operation on a partial register updates the whole register . However , this does mean that there may be false dependencies between any references to partial registers. Example 2-24 demonstrates a series of false and real dependencies caused by referencing [...]

  • Page 150

    IA-32 Intel® Ar chitectur e Optimization 2-78 T able 2-3 illustrates using movzx to avoid a partial register stall when packing three byte values into a register . Assembly/Compiler Coding Rule 44. (ML i mpact, L generality) Use sim ple instructions tha t ar e less than eight bytes in length. Assembly/Compiler Coding Rule 45. (M impact, MH general[...]

  • Page 151

    General Optimization Guidelines 2 2-79 less delay than the partial register update prob lem mentioned above, but the performance gain may vary . If the additional μ op is a critical problem, movsx can sometimes be used as alternative. Sometimes sign-extended semantics can be maintained by zero-extending operands. For example, the C code in the fol[...]

  • Page 152

    IA-32 Intel® Ar chitectur e Optimization 2-80 Prefixes and Instruction Decoding An IA-32 instruction can be up to 15 bytes in length. Prefixes can change the length of an instruction th at the decoder must recognize. In some situations, using a length-chang ing prefix (LCP) causes extra delay in decodi ng the instruct ion. The prefixes that change[...]

  • Page 153

    General Optimization Guidelines 2 2-81 • Processing an instruction with the 0x66 prefix th at (i) has a mo dr/m byte in its encodi ng and (ii) the opcode byte of the instruction happens to be aligned on byte 14 of an instruction fetch line. The performance delay in this case is ap proximately twice of those other two situations. Assembly/Compiler[...]

  • Page 154

    IA-32 Intel® Ar chitectur e Optimization 2-82 String move/store instructions ha ve multiple data granularities. For efficient data movement, larger data granularities are preferable. This means better efficiency can be achie ved by decomposing an arbitrary counter value into a number of doublewords plus single byte moves with a count value less or[...]

  • Page 155

    General Optimization Guidelines 2 2-83 • Cache eviction: If the amount of data to be processed by a memory routine approaches half the size of the last level on-die cache, temporal locality of the cache may suf fer . Using streaming store instructions (for example: movntq, movntdq) can minimize the effect of flushing the cache. The threshold to s[...]

  • Page 156

    IA-32 Intel® Ar chitectur e Optimization 2-84 improve address alignment, a small piece of prolog code using movsb/stosb with count less than 4 can be used to p eel off the non-aligned data moves before starting to use mo vsd/stosd. • For cases where N is less than ha lf the size of last level cache, throughput consideration may favor either : (a[...]

  • Page 157

    General Optimization Guidelines 2 2-85 Memory routines in the runtime library generated by Intel Compilers are optimized across wide range of address alignment, counter values, and microarchitectures. In most cases, ap plications should take advantage of the default memory routines provided by Intel Compilers. T able 2-5 Using REP STOSD with Arb it[...]

  • Page 158

    IA-32 Intel® Ar chitectur e Optimization 2-86 In some situations, the byte count of the data to operate is known by the context (versus from a parameter passed from a call). One can take a simpler approach than those required f or a general-purpose library routine. For example, if the byte count is also small, using rep movsb/stosb with count less[...]

  • Page 159

    General Optimization Guidelines 2 2-87 Clearing Registers Pentium 4 processor provides special support to xor , sub , or pxor operations when executed within the same register . This recognizes that clearing a register does not depend on the old value of the register . The xorps and xorpd instructions do no t have this special support. They cannot [...]

  • Page 160

    IA-32 Intel® Ar chitectur e Optimization 2-88 Using test instruction between the instruction that may modify part of the flag register and the instruction th at uses the flag register can also help prevent partial flag register stall. Assembly/Compiler Coding Rule 52. (ML impact, M generality) Use the test instruction instead of and when the resul[...]

  • Page 161

    General Optimization Guidelines 2 2-89 Use movapd as an alternative; it writes all 128 bits. Even though this instruction has a longer latency , the μ ops for movapd use a different execution port and this port is more likely to be free. The change can impact performance. There may be exceptional cases where the latency matters more than the depe [...]

  • Page 162

    IA-32 Intel® Ar chitectur e Optimization 2-90 Pr olog Sequences Assembly/Compiler Coding Rule 57. (M impact, MH generality) In r outines that do not need a frame pointer and that do not have called r outines that modify ESP , use ESP as the base r egister to fr ee up EBP . This optimization does not ap ply in the following cases: a r outine is cal[...]

  • Page 163

    General Optimization Guidelines 2 2-91 Using memory as a destination operand may further reduce register pressure at the slight risk of making trace cache packing more dif ficult. On the Pentium 4 processor , the sequence of loading a value from memory into a register and adding the results in a register to memory is faster than the alternate seque[...]

  • Page 164

    IA-32 Intel® Ar chitectur e Optimization 2-92 Spill Scheduling The spill scheduling algorithm used by a code generator will be impacted by the Pentium 4 processor memory subsystem. A spill scheduling algorithm is an algorithm th at selects what values to spill to memory when there are too many live va lues to fit in registers. Consider the code in[...]

  • Page 165

    General Optimization Guidelines 2 2-93 Because micro-ops are delivered from the trace cache in the common cases, decoding rules are not required. Scheduling Rules f or the P e ntium M Processor Decoder The Pentium M processor has three decoder s, but the decoding rules to supply micro-ops at high band width ar e less stringent than those of the Pen[...]

  • Page 166

    IA-32 Intel® Ar chitectur e Optimization 2-94 Data elements in parallel. The number of elements which can be operated on in parallel range from four single-precision floating point data elements in S treaming SIMD Extensions and two double- precision floating- point data elements in S treaming SIMD Extensions 2 to sixteen byte operations in a 128-[...]

  • Page 167

    General Optimization Guidelines 2 2-95 User/Source Coding Rule 19. (M impact, ML generality) A void the use of conditional bra nches inside loops and co nsi der using SSE instru ctions to eliminate branches. User/Source Coding Rule 20. (M impact, ML generality) Keep induction (loop) variables ex pr essions simple. Miscellaneous This section explain[...]

  • Page 168

    IA-32 Intel® Ar chitectur e Optimization 2-96 The other NOPs have no special hardware support. Their input and output registers are in terpreted by the hardware. Therefore, a code generator should arrange to use th e register containing the oldest value as input, so that the NOP will dispat ch and release RS r esources at the earliest possible opp[...]

  • Page 169

    General Optimization Guidelines 2 2-97 User/Sour ce Coding Rules User/Source Coding Rule 1. (M impact, L generality) If an indir ect branch has two or mor e common ta ken tar gets, and at least one of those tar gets are corr elated with bran ch history leading up to the branch , then convert t he indir ect branch into a tr ee wher e one or mor e in[...]

  • Page 170

    IA-32 Intel® Ar chitectur e Optimization 2-98 User/Source Coding Rule 8. (H impact, H generality) T o achieve effective amortization of bus latency , softwar e should.pay at tention to favor data access patterns that result in higher concentra t ions of cache miss patterns with cache miss strides that ar e significantly smaller t han half of the h[...]

  • Page 171

    General Optimization Guidelines 2 2-99 look-up-tabl e- based algo rit hm using interp olation tech niques. It is p ossible to impr ove transcendental p erfor mance with these techniques by choo sin g the desir ed numeric pr ecision, the size of the look-up tableland t aking advantage of the paralleli sm of the Str eamin g S IMD Extensions an d the [...]

  • Page 172

    IA-32 Intel® Ar chitectur e Optimization 2-100 or der engine . When tuning, note that all IA-32 based pr ocessors have very high branch prediction rates. Cons istently mispr edicted are rar e. Use these instructi ons only if the incr ease in computation time is l ess than the expected cost of a mispr edicted branch. 2-16 Assembly/Compiler Coding R[...]

  • Page 173

    General Optimization Guidelines 2 2-101 Assembly/Compiler Coding Rule 10. (M impact, L generality) Do not put mor e than four branch es in 16-byte chunks. 2 -22 Assembly/Compiler Coding Rule 1 1. (M impact, L generality) Do not put mor e than two end loop branches in a 16-b yte chunk. 2-22 Assembly/Compiler Coding Rule 12. (M im pact, MH generality[...]

  • Page 174

    IA-32 Intel® Ar chitectur e Optimization 2-102 Assembly/Compiler Coding Rule 18. (H impact, M generality) A load that forwards fr om a store must have the same addr ess start poin t and ther efor e the same alignmen t as the stor e data. 2-34 Assembly/Compiler Coding Rule 19. (H impact, M generality) The data of a load which is fo rwar ded fr om a[...]

  • Page 175

    General Optimization Guidelines 2 2-103 first-level cach e working set. A void having mor e than 8 cache lines that ar e some multiple of 64 KB ap art in the same second-l evel cache w orking set. A void having a stor e follo wed by a non-dependent load wi th addr esses that differ by a mult ip le of 4 KB. 2-46 Assembly/Compiler Coding Rule 26. (M [...]

  • Page 176

    IA-32 Intel® Ar chitectur e Optimization 2-104 Assembly/Compiler Coding Rule 32. (H impact , L generality) Minimize the number of chan ges to the r oundin g mo de. Do not use changes in the r ounding mo de to implement t he floor and ceili ng functions if this involv e s a total of more than two values of the set of rounding, pr ecision and infini[...]

  • Page 177

    General Optimization Guidelines 2 2-105 Assembly/Compiler Coding Rule 42. (M impact, H generality) inc and dec instructions should be re pl ac ed wit h an add or sub instruction, because add and sub overwrite all flags, wher eas inc and dec do not, ther efor e creating false dependencies on earlier instructio ns that set the flags. 2-73 Assembly/Co[...]

  • Page 178

    IA-32 Intel® Ar chitectur e Optimization 2-106 instead of a cmp of the r egister to zer o, this saves the need to e ncode the zer o and saves encoding space. A void comparing a constant to a memo ry operand. It is pr eferable to load the memory operand and com p ar e the constant to a r egister . 2-79 Assembly/Compiler Coding Rule 51. (ML impact, [...]

  • Page 179

    General Optimization Guidelines 2 2-107 Assembly/Compiler Coding Rule 56. (M impact, ML generality) For arithmetic or lo gical operations that have th eir sour ce operand in memory and the destinat ion operand is in a register , attempt a st rat egy that initially lo ads the memory operan d to a r egister followed b y a r egister to r egister ALU o[...]

  • Page 180

    IA-32 Intel® Ar chitectur e Optimization 2-108 T uning Suggestions T uning Suggestion 1. Rar ely , a performance pr oblem may be note d due to executing data on a code page as instructio ns. The only condition wher e this is likely to happen is f ollowing an indir ect branch that is not r esident in the trace cache. If a performance problem is cle[...]

  • Page 181

    3-1 3 Coding for SIMD Ar chitectur es Intel Pentium 4, Intel Xeon and Pentium M processors include support for S treaming SIMD Extensions 2 (SSE2), S treaming SI MD Extensions technology (SSE), and MMX technology. In addition, Streaming SIMD Extensions 3 (SSE3) were introduced with the Pentium 4 pr ocessor supporting Hyper -Threading T echnology at[...]

  • Page 182

    IA-32 Intel® Ar chitectur e Optimization 3-2 Chec king for Pr ocessor Suppor t of SIMD Te c h n o l o g i e s This section shows how to check whether a processor supports MMX technology , SSE, SSE2, or SSE3. SIMD technology can be included in your appl ication in three ways: 1. Check for the SIMD technology during installation. If the desired SIMD[...]

  • Page 183

    Coding for SIMD Ar chitectur es 3 3-3 For more information on cpuid see, Intel ® Pr ocessor Identification with CPUID I nstruction , order number 24161 8. Chec king for Streaming SI MD Extensions Support Checking for support of S treaming SIMD Extensions (SSE) on your processor is like checking for MMX technolog y . However , you must also check w[...]

  • Page 184

    IA-32 Intel® Ar chitectur e Optimization 3-4 T o find out whether the operating system supports SSE, execute an SSE instruction and trap for an exception if one occurs. Catching the exception in a simple try/except cl ause (using structured exception handling in C++) and checking whether the exception code is an invalid opcode will give you the an[...]

  • Page 185

    Coding for SIMD Ar chitectur es 3 3-5 Chec king for Streaming SI MD Extensions 2 Support Checking for support of SSE2 is like checking for SSE support. Y ou must also check whether your operat ing system (OS) sup ports SSE. The OS requirements for SSE2 Support are the same as the requirements for SSE. T o check whether your system supports SSE2, fo[...]

  • Page 186

    IA-32 Intel® Ar chitectur e Optimization 3-6 Chec king for Streaming SI MD Extensions 3 Support SSE3 includes 13 instructions, 1 1 of those are suited for SIMD or x87 style programming. Checking for suppor t of these SSE3 instructions is similar to checking for SSE support. Y ou must also check whether your operating system (OS) supports SSE. The [...]

  • Page 187

    Coding for SIMD Ar chitectur es 3 3-7 Example 3-6 Identifica tion of SSE3 with cpuid SSE3 requires the same support from the operating system as SSE. T o find out wh ether the operating syst em suppo rts SSE3 (FISTTP and 10 of the SIMD instructions in SSE3), ex ecute an SSE3 inst ruction and trap for an exception if one occurs. Catching the excepti[...]

  • Page 188

    IA-32 Intel® Ar chitectur e Optimization 3-8 Example 3-7 Identificati on of SSE3 by the OS Considerations f or Code Con version to SIMD Programming The VT une Performance Enhancement Environment CD provides tools to aid in the evaluation and tuning. But before implementing them, you need answers to the following questions: 1. W ill the current cod[...]

  • Page 189

    Coding for SIMD Ar chitectur es 3 3-9 Figure 3-1 Con verting to Streaming SIMD Extensions Chart OM15 156 Code benefit s from S IM D STOP Ident ify H ot Spots i n C ode Int eger or fl oati ng-poi nt? Yes Float ing Point Wh y F P ? Can convert to I nteger? Range or P re c isio n If poss ibl e, re- arrange dat a for S IM D effic ien cy I nteger Change[...]

  • Page 190

    IA-32 Intel® Ar chitectur e Optimization 3-10 T o use any of the SIMD technologies optimally , you must evaluate the following situations in your code: • fragments that are computationally intensive • fragments that are executed often enough to have an impact on performance • fragments that with little data-dependent control flow • fragmen[...]

  • Page 191

    Coding for SIMD Ar chitectur es 3 3-11 specific optimizations. Where appropriate, the coach displays pseudo-code to su ggest the use of highly optimized intrinsics and functions in the Intel ® Performance Library Suite. Because VT une analyzer is designed specifically for all of the Intel architecture (IA)-based processors, including the Pentium 4[...]

  • Page 192

    IA-32 Intel® Ar chitectur e Optimization 3-12 costly application processing time. However , these routines have potential for increased performance when you convert them to use one of the SIMD technologies. Once you identify your opportunities for usin g a SIMD technology , you must evaluate what should be done to determine whether the cur rent al[...]

  • Page 193

    Coding for SIMD Ar chitectur es 3 3-13 Coding Methodologies Software developers need to compare the performance improvement that can be obtained from assembly code ver sus the cost of those improvements. Programming directly in assembly langu age for a target platform may produce th e required performance gain, however , assembly code is not portab[...]

  • Page 194

    IA-32 Intel® Ar chitectur e Optimization 3-14 The examples that follow illustra te the use of coding adjustments to enable the algorithm to benef it from the SSE. The same techniques may be used for single-precision f loating-point, double-precision floating-point, and integer data under SSE2 , SSE, and MMX technology . As a basis for the usage mo[...]

  • Page 195

    Coding for SIMD Ar chitectur es 3 3-15 Assembl y Key loops can be coded directly in assembly lan guage using an assembler or by using inlined assembly (C-asm) in C/C++ code. The Intel compiler or assembler recognize the new instructions and registers, then directly generate the correspondin g code. This model offers the opportunity for attaining gr[...]

  • Page 196

    IA-32 Intel® Ar chitectur e Optimization 3-16 SIMD Extensions 2 inte ger SIMD and __m128d is used for double precision floating-point SIMD. These ty pes enable the programmer to choose the implementation of an algo rithm directly , while allowi ng the compiler to perform regi ster allocation and instru ction scheduling where possible. These intrin[...]

  • Page 197

    Coding for SIMD Ar chitectur es 3 3-17 The intrinsic data types, however , are not a basic ANSI C data type, and therefore you must observe the following usage restrictions: • Use intrinsic data types only on the left-hand side of an assignment as a return value or as a parameter . Y ou cannot use it with other arithmetic expressions (for example[...]

  • Page 198

    IA-32 Intel® Ar chitectur e Optimization 3-18 Here, fvec.h is the class definition file and F32vec4 is the class representing an array of four fl oats. The “+” and “=” operators are overloaded so that the actual S treaming SIMD Extensions implementation in the previous exam ple is abstracted out, or hidden, from the developer . Note how mu[...]

  • Page 199

    Coding for SIMD Ar chitectur es 3 3-19 The caveat to this is that only certain types of loops can be automatically vectorized, and in most cases user interaction with the compiler is needed to fully enable this. Example 3-12 shows the code for auto matic vectorization for the simple four -iteration loop (from Example 3-8). Compile this code using t[...]

  • Page 200

    IA-32 Intel® Ar chitectur e Optimization 3-20 Stac k and Data Alignment T o get the most performance out of code written for SIMD technologies data should be formatted in memory according to the guidelines described in this section. Assembly code with an unaligned accesses is a lot slower than an aligned access. Alignment and Contiguity of Data Ac[...]

  • Page 201

    Coding for SIMD Ar chitectur es 3 3-21 By adding the padding variable pa d , the structure is now 8 bytes, and if the first element is aligned to 8 byte s (64 bits), all following elements will also be aligned. The sample declaration follows: typedef struct { short x,y,z; char a; char pad; } Point; Point pt[N]; Using Arrays to M ake Data Contiguous[...]

  • Page 202

    IA-32 Intel® Ar chitectur e Optimization 3-22 Assuming you have a 64-bit aligned da ta vector and a 64-bit aligned coefficients vector , the filter operation on the first data element wi ll be fully aligned. For the second data element, how ever , access to the data vector will be misaligned. For an example of how to avoid the misalignment problem[...]

  • Page 203

    Coding for SIMD Ar chitectur es 3 3-23 • Functions that use Streaming SIMD Extensions or S treaming SIMD Extensions 2 data need to provide a 1 6-byte aligned stack frame. • The __m128* parameters need to be aligned to 16-byte boundaries, possibly creating “holes” (due to padding) in th e argument block. These new conventions presented in th[...]

  • Page 204

    IA-32 Intel® Ar chitectur e Optimization 3-24 Another way to improve data alignment is to copy the data into locations that are aligned on 64-bit boundaries. When the data is accessed frequently , this can provide a significant performance improvement. Data Alignment fo r 128-bit data Data must be 16-byte aligned when loading to o r storing from t[...]

  • Page 205

    Coding for SIMD Ar chitectur es 3 3-25 The __declspec(align(16)) specifications can be placed before data declarations to force 16-byte alignmen t. This is pa rticularly useful for local or global data declarations that are assigned to 128-bit data types. The syntax for it is __declspec(align( integer-constant )) where the integer-constant is an in[...]

  • Page 206

    IA-32 Intel® Ar chitectur e Optimization 3-26 In C++ (but not in C) it is also possible to force the alignment of a class / struct / union type, as in the code that follows: struct __ declspec(align(16)) my_m128 { float f[4]; }; But, if the data in such a class is going to be used with the S treami ng SIMD Extensions or Streaming SIMD Ex tensions [...]

  • Page 207

    Coding for SIMD Ar chitectur es 3 3-27 Impr oving Memory Utilization Memory performance can be improved by rearran ging data and algorithms for SSE 2, SSE, and MMX technology intrinsics. The methods for improving memory p erformance involve working with the following: • Data structure layout • Strip-mining for vectorizat ion and memory utilizat[...]

  • Page 208

    IA-32 Intel® Ar chitectur e Optimization 3-28 There are two options for comp uting data in AoS format: perform operation on the data as it stands in AoS format, or re-arrange it (swizzle it) into SoA format dynamically . S ee Example 3-16 for code samples of each option based on a dot-product computation. Example 3-15 SoA Data S tructure typedef s[...]

  • Page 209

    Coding for SIMD Ar chitectur es 3 3-29 Performing SIMD operations on the original AoS format can require more calculations and some of the op erations do not take advantage of all of the SIMD elements available. Therefore, th is option is generally less efficient. The recommended way for computing da ta in AoS format is to swizzle each set of eleme[...]

  • Page 210

    IA-32 Intel® Ar chitectur e Optimization 3-30 but is somewhat inefficient as there is the overhead of extra instructions during computation. Performing the sw izzle statically , when the data structures are being laid out, is best as there is no runtime overhead. As mentioned earlier , the SoA arrangement allows more efficient use of the paralleli[...]

  • Page 211

    Coding for SIMD Ar chitectur es 3 3-31 Note that SoA can have the disadvantage of requiring more independent memory stream references. A computation that uses arrays x , y , and z in Example 3-15 would require three separate data streams. This can require the use of more prefetches, additional address generation calculations, as well as having a gr[...]

  • Page 212

    IA-32 Intel® Ar chitectur e Optimization 3-32 Strip Mining Strip minin g, also known as loop s ectioning, is a loop transformation technique for enabling SIMD-encodings of loops, as well as providing a means of improving memory performance. First introduced for vectorizers, this technique consists of the generation of code when each vector operati[...]

  • Page 213

    Coding for SIMD Ar chitectur es 3 3-33 The main loop consists of two func tions: transformation and lighting. For each object, the main loop calls a transformation routine to update some data, then calls the lighting routine to further work on the data. If the size of array v[Num] is larger than the cache, then the coordinates for v[i] that were ca[...]

  • Page 214

    IA-32 Intel® Ar chitectur e Optimization 3-34 In Example 3-19, the computation has been strip-mined to a size strip_size . The value strip_size is chosen such that strip_size elements of array v[Num] fit into the cache hierarchy . By doing this, a given element v[i] brought into the cache by Transform(v[i]) will still be in the cache when we perfo[...]

  • Page 215

    Coding for SIMD Ar chitectur es 3 3-35 For the first iteration of the inner loop, each access to array B will generate a cache miss. If th e size of one row of array A , that is, A[2, 0:MAX-1] , is large enough, by the time the second iteration starts, each access to array B will always generate a cache miss. For instance, on th e first iteration, [...]

  • Page 216

    IA-32 Intel® Ar chitectur e Optimization 3-36 This situation can be avoided if the loop is blocked with respect to the cache size. In Figure 3-3, a block_size is selected as the loop blocking factor . Suppose that block_size is 8, then the blocked chunk of each array will be eight cache lines (32 bytes each). In the first iteration of the inner lo[...]

  • Page 217

    Coding for SIMD Ar chitectur es 3 3-37 As one can see, all the redundant cache misses can be eliminated by applying this loop blocking technique. If MAX is huge, loop blocking can also help reduce the penalty from DTLB (data translation look-aside buffer) misses. In addition to improving the cache/memory performance, this optimization technique als[...]

  • Page 218

    IA-32 Intel® Ar chitectur e Optimization 3-38 Note that this can be applied to both SIMD integer and SIMD floating-point code. If there are multiple consumers of an instan ce of a register , group the consumers together as closely as possible. However , the consumers should not be scheduled near the p roducer . SIMD Optimizations and Microar chite[...]

  • Page 219

    Coding for SIMD Ar chitectur es 3 3-39 Recommendation : When targeting code generation for Intel Core Solo and Intel Core Duo processors, favor instructio ns consisting of two-micro-ops over those with more than two micro-o ps. T uning the Final Applicat ion The best way to tune your applicatio n once it is functioning correctly is to use a profile[...]

  • Page 220

    IA-32 Intel® Ar chitectur e Optimization 3-40[...]

  • Page 221

    4-1 4 Optimizing for SIMD Integer Applications The SIMD integer instructions provide performance impr ovements in applications that are integer-intensive and can take advantage of the SIMD architecture of Pentium 4, In tel Xeon, and Pentium M processors. The guidelines for using these instructio ns in addition to the guidelines described in Chap te[...]

  • Page 222

    IA-32 Intel® Ar chitectur e Optimization 4-2 For planning considerations of using the new SIMD integer instructions, refer to “Checking for S treaming SIMD Extensions 2 Support” in Chapter 3. General Rules on SIMD Integer Code The overall rules and suggestions are as follows: • Do not intermix 64-bit SIMD integer instructions with x87 floati[...]

  • Page 223

    Optimizing for SIMD Integer Applications 4 4-3 Using SIMD Integer with x87 Floating-point All 64-bit SIMD integer instructions use the MMX registers, which share register state with the x87 floating-point stack. Because of this sharing, certain rules and considera tions apply . Instructions which use the MMX registers cannot be freely in termixed w[...]

  • Page 224

    IA-32 Intel® Ar chitectur e Optimization 4-4 Using emms clears all of the valid bits, effectively emptying the x87 floating-point stack and making it ready f or new x87 floating-point operations. The emms instruction ensures a clean transition between using operations on the MMX registers and using operations on the x 87 floating-point stack. On t[...]

  • Page 225

    Optimizing for SIMD Integer Applications 4 4-5 • Don’ t empty when alr eady empty : If the next instruction uses an MMX register , _mm_empty() incurs a cost with no benefit. • Gr oup Instructions: T ry to partition regions that use x87 FP instructions from those that use 64-bit SIMD integer instructions. This eliminates needing an emms instru[...]

  • Page 226

    IA-32 Intel® Ar chitectur e Optimization 4-6 Data Alignment Make sure that 64-bit SIMD integer data is 8- byte aligned and that 128-bit SIMD integer data is 1 6-byte aligned. Referencing unaligned 64-bit SIMD integer data can incur a performance penalty due to accesses that span 2 cache lines. Refe rencing unaligned 128-bit SIMD integer data will [...]

  • Page 227

    Optimizing for SIMD Integer Applications 4 4-7 Signed Unpac k Signed numbers should be sign-ext ended when unpacking the values. This is simil ar to the zero-exte nd shown above except that the psrad instruction (packed shift right arith metic) is used to effectively sign extend the values. Example 4-3 assumes the source is a packed-word (16-bit) d[...]

  • Page 228

    IA-32 Intel® Ar chitectur e Optimization 4-8 Interleaved P ack with Saturation The pack instructions pack two values into the destination register in a predetermined order . Specifically , the packssdw instruction packs two signed doublewords from the source operand and two signed doublewords from the destination operand into four signed words in [...]

  • Page 229

    Optimizing for SIMD Integer Applications 4 4-9 Figure 4-2 illustrates two values interleaved in the destination register , and Example 4-4 shows co de that us es the operation. The two signed doublewords are used as source operands and the result is interleaved signed words. The pack instructions can b e performed with or without saturation as need[...]

  • Page 230

    IA-32 Intel® Ar chitectur e Optimization 4-10 The pack instructions always as sume that the source operands are signed numbers. The result in the destination register is always d efined by the pack instruction that perform s the operation. For example, the packssdw instruction packs each of the tw o signed 32-bit values of the two sources into fou[...]

  • Page 231

    Optimizing for SIMD Integer Applications 4 4-11 Non-Interleaved Unpac k The unpack instructions perform an interleave merge of the data elements of the destination and source oper ands into the destination register . The following example merges the two operands into the destination registers without interleaving. For example, take two adjacent ele[...]

  • Page 232

    IA-32 Intel® Ar chitectur e Optimization 4-12 The other destination register w ill contain the opposite combination illustrated in Figure 4-4. Code in the Example 4-6 unpacks two packed-word sources in a non-interleaved way . The goal is to use the instruction which unpacks doublewords to a quadword, instead of using the instruction which unpacks [...]

  • Page 233

    Optimizing for SIMD Integer Applications 4 4-13 Extract W or d The pextrw instruction takes the word in the designated MMX register selected by the two least significant bits of the immediate value and moves it to the lower half of a 32-bit integer re gister , see Figure 4-5 and Example 4-7. Example 4-6 Unp acking T wo Packed-word Sources in a Non-[...]

  • Page 234

    IA-32 Intel® Ar chitectur e Optimization 4-14 Insert W ord The pinsrw instruction loads a word from the lower half of a 32-bit integer register or from memory and inserts it in the MMX technology destination register at a position de fined by the two least significant bits of the immediate constant. Insertion is done in such a way that the three o[...]

  • Page 235

    Optimizing for SIMD Integer Applications 4 4-15 If all of the operands in a register are being replaced by a series of pinsrw instructions, it can be useful to clear the content and break the dependence chain by either using the pxor instruction or loading the register . See the “Clearing Registers” section in Chapter 2. Figure 4 -6 pinsrw Inst[...]

  • Page 236

    IA-32 Intel® Ar chitectur e Optimization 4-16 Move Byte Mask to Integer The pmovmskb instruction returns a bit mask formed from the most significant bits of each byte of its source operand. When used with the 64-bit MMX registers, this produces an 8-bit mask, zeroing out the upper 24 bits in the destination re gister . When used with the 128-bit X[...]

  • Page 237

    Optimizing for SIMD Integer Applications 4 4-17 Figure 4 -7 pmovmskb Instruction Example Example 4-10 pmovmskb Instruction Code ; Input: ; source value ; Output: ; 32-bit register containing the byte mask in the lower ; eight bits ; movq mm0, [edi] pmovmskb eax, mm0 OM151 65 MM R32 31 0 63 0..0 31 0.. 0 70 55 47 39 23 15 7[...]

  • Page 238

    IA-32 Intel® Ar chitectur e Optimization 4-18 P acked Shuffle W ord f or 64-bit Registers The pshuf instruction (see Figure 4-8, Example 4-1 1) uses the immediate ( imm8 ) operand to select between the four words in either two MMX registers or one MMX register and a 64-bit memory location. Bits 1 and 0 of the immediate valu e encode the source for[...]

  • Page 239

    Optimizing for SIMD Integer Applications 4 4-19 P acked Shuffle W ord f or 128-bit Registers The pshuflw / pshufhw instruction performs a fu ll shuffle of any source word field within the low/high 64 bits to any resu lt word field in the low/high 64 bits, using an 8-bit immediate op erand; the other high/low 64 bits are passed through from the sour[...]

  • Page 240

    IA-32 Intel® Ar chitectur e Optimization 4-20 Unpac king/interleaving 64-bit Data in 128-bit Registers The punpcklqdq / punpchqdq instructio ns interleav e the low/high-order 64-bits of the source operand and the low/high- order 64-bits of the destination operand and writes them to the destination register . The high/low-order 64-bits of the sourc[...]

  • Page 241

    Optimizing for SIMD Integer Applications 4 4-21 Data Mo vement There are two additional instructions to enable data movement from the 64-bit SIMD integer registers to the 128-bit SIMD registers. The movq2dq instruction moves the 64-bit integer data from an MMX register (source) to a 128-bit destination register . The high-order 64 bits of the desti[...]

  • Page 242

    IA-32 Intel® Ar chitectur e Optimization 4-22 pxor MM0, MM0 pcmpeq MM1, MM1 psubb MM 0, MM1 [psubw MM0, MM1] (psubd MM0, MM1) ; three instructions above generate ; the constant 1 in every ; packed-byte [or packed-word] ; (or packed-dword) field pcmpeq MM1, MM1 psrlw MM 1, 16-n(psrld MM1, 32-n) ; two instructions above generate ; the signed constan[...]

  • Page 243

    Optimizing for SIMD Integer Applications 4 4-23 Building Bloc ks This section describes instr uctions and algorithms which implement common code building blocks ef ficiently . Absolute Difference of Unsigned Numbers Example 4-16 computes the absolu te difference of two unsigned numbers. It assumes an unsigned packed-byte data type. Here, we make us[...]

  • Page 244

    IA-32 Intel® Ar chitectur e Optimization 4-24 Absolute Difference of Signed Numbers Chapter 4 computes the absolute difference of two signed numbers. The technique used here is to first sort the co rresponding elements of the input operands into packed words of the maximum values, and packed words of the minimum values. Then the minimum values are[...]

  • Page 245

    Optimizing for SIMD Integer Applications 4 4-25 Absolute V alue Use Example 4-18 to compute | x | , where x is signed. This example assumes signed words to be the oper ands. movq MM2, MM0 ; make a copy of source1 (A) pcmpgtw MM0, MM1 ; create mask of ; source1>source2 (A>B) movq MM4, MM2 ; make another copy of A pxor MM2, MM1 ; create the int[...]

  • Page 246

    IA-32 Intel® Ar chitectur e Optimization 4-26 Clipping to an Arbitrary Range [high, low] This section explains how to clip a values to a range [ high, low ]. Specifically , if the value is less than low or greater than high , then clip to low or high, respectively . This techni que uses the packed-add and packed-subtract instructions with sa turat[...]

  • Page 247

    Optimizing for SIMD Integer Applications 4 4-27 Highly Efficient Clipping For clipping signed words to an arbitrary range, the pmaxsw and pminsw instructions may be used. For clipping un signed bytes to an arbitrary range, the pmaxub and pminub instructions may be used. Example 4-19 shows how to clip signed words to an arbitrary range; the code for[...]

  • Page 248

    IA-32 Intel® Ar chitectur e Optimization 4-28 The code above converts values to un signed numbers first and then clips them to an unsigned range. The last in struction converts the data back to signed data and places the data with in the signed range. Conversion to unsigned data is required for correct results when ( high - low ) < 0x8000 . If [...]

  • Page 249

    Optimizing for SIMD Integer Applications 4 4-29 packed-subtract instructions with unsigned saturation, thus this technique can only be used on p acked-bytes and packed-words data types. The example illustrates the operation on word values. P acked Max/Min of Signed W or d and Unsigned Byte Signed W ord The pmaxsw instruction returns the maximum bet[...]

  • Page 250

    IA-32 Intel® Ar chitectur e Optimization 4-30 Unsigned Byte The pmaxub instruction returns the maximum between the eight unsigned bytes in either two SIMD registers, or one SIMD register and a memory location. The pminub instruction returns the minimum between the eight unsigned bytes in either two SIMD registers, or one SIMD register and a memory[...]

  • Page 251

    Optimizing for SIMD Integer Applications 4 4-31 The subtraction operation presented above is an absolute difference, that is, t = abs(x-y ) . The byte values are stored in temporary space, all values are summed together , and the result is written into the lower word of the destination register . P acked A verage (Byte/W ord) The pavgb and pavgw in[...]

  • Page 252

    IA-32 Intel® Ar chitectur e Optimization 4-32 The PA VGB instruction operates on pack ed unsigned bytes and the PAVGW instruction operates on packed unsigned words. Complex Multipl y by a Constant Complex multiplication is an op eration which requires four multiplications and two additions. This is exactly how the pmaddwd instruction operates. In [...]

  • Page 253

    Optimizing for SIMD Integer Applications 4 4-33 Note that the output is a pack ed doubleword. If needed, a pack instruction can be used to convert th e result to 16-bit (thereby matching the format of the input). P acked 32*32 Multiply The PMULUDQ instruction performs an unsigned multiply on the lower pair of double-word operands within each 64-bit[...]

  • Page 254

    IA-32 Intel® Ar chitectur e Optimization 4-34 Memory Optimizations Y ou can improve memory accesses using the following techniques: • A voiding partial memory accesses • Increasing the bandwidth of memory fills and video fills • Prefetching data with Streaming SIMD Extensions (see Chapter 6, “Optimizing Cache Usage”). The MMX registers a[...]

  • Page 255

    Optimizing for SIMD Integer Applications 4 4-35 P ar tial Memory Accesses Consider a case with large load after a series of small stores to the same area of memory (beginni ng at memory address mem ). The lar ge load will stall in this case as shown in Example 4-24. The movq must wait for the stores to write memory before it can access all the data[...]

  • Page 256

    IA-32 Intel® Ar chitectur e Optimization 4-36 Let us now consider a case with a seri es of small loads after a large store to the same area of memory (beginning at memory address mem ) as shown in Example 4-26. Most of th e small loads will stall because they are not aligned with the store; see “Store Forwarding” in Chapter 2 for more details.[...]

  • Page 257

    Optimizing for SIMD Integer Applications 4 4-37 These transformations, in general, increase the number of instructions required to perform the desired oper ation. For Pentium II, Pentium III , and Pentium 4 processors, the benefit of avoiding forwarding problems outweighs the performance penalty due to the increased number of instructions, making t[...]

  • Page 258

    IA-32 Intel® Ar chitectur e Optimization 4-38 SSE3 provides an instruction LDDQU for loading from memory address that are not 16 byte aligned. LDDQU is a special 128-bit unaligned load designed to avoid cach e line splits. If the address of the load is aligned on a 16-byte boundary , LDQQU loads the 16 bytes requested. If the address of the load i[...]

  • Page 259

    Optimizing for SIMD Integer Applications 4 4-39 Increasing Bandwidth of Memory Fills and Video Fills It is beneficial to understand how memory is accessed and filled. A memory-to-memory fill (for example a memory-to-video fill) is defined as a 64-byte (cache line) load from memory which is immediately stored back to memory (such as a video frame bu[...]

  • Page 260

    IA-32 Intel® Ar chitectur e Optimization 4-40 same DRAM page have shorter la tencies than sequential accesses to dif ferent DRAM pages. In many systems the latency for a p age miss (that is, an access to a different page instead of the page previously accessed) can be twice as lar ge as the latency of a memory page hit (access to the same page as [...]

  • Page 261

    Optimizing for SIMD Integer Applications 4 4-41 aligned versions; this can reduce the performance gains when using the 128-bit SIMD integer extensions. The general guidelines on the alignment of memory operands are: — The greatest performance gains can be achieved when all memory streams are 16-byte aligned. — Reasonable performance gains are p[...]

  • Page 262

    IA-32 Intel® Ar chitectur e Optimization 4-42 P acked SSE2 Integer versus MMX Instructions In general, 128-bit SIMD integer instr uctions should be favored over 64-bit MMX instructions on Intel Core Solo and Intel Core Duo processors. This is bec ause: • Improved decoder bandwidth and more efficient uop flows relative to the Pentium M processor [...]

  • Page 263

    5-1 5 Optimizing for SIMD Floating-point Applications This chapter discusses general rules of optimizing for the single-instruction, multiple-data (SIM D) floating-point instructions available in Streaming SIMD Extensions (SSE), Streaming SIMD Extensions 2 (SSE2)and S treaming SIMD Extensions 3 (SSE3). This chapter also provides examples th at illu[...]

  • Page 264

    IA-32 Intel® Ar chitectur e Optimization 5-2 • Use MMX technology instructions and registers or for cop ying data that is not used later in SIMD floating-point computations. • Use the reciprocal instructions followed by iteration for increased accuracy . These instructions yiel d reduced accuracy but execute much faster . Note the following: ?[...]

  • Page 265

    Optimizing for SIMD Float ing-point Applications 5 5-3 • Is the data arranged for ef fici ent utilization of the SIMD floating-point registers? • Is this application targeted for processors without SIMD floating-point instructions? For more details, see the section on “Consideration s for Code Conversion to SIMD Programming” in Chapter 3. U[...]

  • Page 266

    IA-32 Intel® Ar chitectur e Optimization 5-4 When using scalar floating-point in structions, it is not necessary to ensure that the data appears in vector form. However , all of the optimizations regarding alignment, scheduling, instruction selection, and other optimizations covered in Chapter 2 and Chapter 3 should be observed. Data Alignment SIM[...]

  • Page 267

    Optimizing for SIMD Float ing-point Applications 5 5-5 For some applications, e.g., 3D geometry , the traditional data arrangement requires some changes to fully u tilize the SIMD registers and parallel techniques. T raditionally , the data layout has been an array of structures (AoS). T o fully u tilize the SIMD registers in such applications, a n[...]

  • Page 268

    IA-32 Intel® Ar chitectur e Optimization 5-6 simultaneously referred to as an xyz data representation, see the diagram below) are computed in parallel, and the array is updated one vertex at a time. When data structures are organized for the horizontal computation model, sometimes the availability of homogeneous arithmetic operations in SSE and SS[...]

  • Page 269

    Optimizing for SIMD Float ing-point Applications 5 5-7 T o utilize all 4 computation slot s, the vertex data can be reorganized to allow computation on each component of 4 separate ver tices, that is, processing multiple vectors simultaneously . This can also be referred to as an SoA form of representing vertices data shown in T able 5-1. Organizin[...]

  • Page 270

    IA-32 Intel® Ar chitectur e Optimization 5-8 Figure 5-2 shows how 1 result would be computed for 7 instructions if the data were or ganized as AoS an d using SSE alone: 4 results would require 28 instructions. Figure 5-2 Dot Product Operation Example 5-1 Pseudocode fo r Horizontal (xyz, AoS) Computation mulps ; x*x', y*y', z*z' mova[...]

  • Page 271

    Optimizing for SIMD Float ing-point Applications 5 5-9 Now consider the case when the data is organized as SoA. Example 5-2 demonstrates how 4 results are computed for 5 instructions. For the most ef ficient use of the four component-wide registers, reor ganizing the data into the SoA format yields increased throughput and hence much better perform[...]

  • Page 272

    IA-32 Intel® Ar chitectur e Optimization 5-10 T o gather data from 4 different memory locations on the f ly , follow steps: 1. Identify the first half of the 128-bit memory location. 2. Group the different h alves together using the movlps and movhps to form an xyxy layout in two registers. 3. From the 4 attached halves, get the xxxx by using one [...]

  • Page 273

    Optimizing for SIMD Float ing-point Applications 5 5-11 y1 x1 movhps xmm7, [ecx+16] // xmm7 = y2 x2 y1 x1 movlps xmm0, [ecx+32] // xmm0 = -- -- y3 x3 movhps xmm0, [ecx+48] // xmm0 = y4 x4 y3 x3 movaps xmm6, xmm7 // xmm6 = y1 x1 y1 x1 shufps xmm7, xmm0, 0x88 // xmm7 = x1 x2 x3 x4 => X shufps xmm6, xmm0, 0xDD // xmm6 = y1 y2 y3 y4 => Y movlps x[...]

  • Page 274

    IA-32 Intel® Ar chitectur e Optimization 5-12 Example 5-4 shows the same data -swizzling algorithm encoded using the Intel C++ Compiler ’ s intrinsics for SSE. Example 5-4 Swizzling Da ta Using Intrinsics //Intrinsics version of data swizzle void swizzle_intrin (Vertex_aos *in, Vertex_soa *out, int stride) { __m128 x, y, z, w; __m128 tmp; x = _m[...]

  • Page 275

    Optimizing for SIMD Float ing-point Applications 5 5-13 Although the generated result of all zeros does not depend on the specific data contained in the source operand (that is, XOR of a registe r with itself always produces all zeros), the instruction cannot execute until the instruction that generates xmm0 has completed. In the worst case, this c[...]

  • Page 276

    IA-32 Intel® Ar chitectur e Optimization 5-14 Data Deswizzling In the deswizzle operation, we want to arrange the SoA format back into AoS format so the xxxx , yyyy , zzzz are rearranged and stored in memory as xyz . T o do this we can use the unpcklps / unpckhps instructions to regenerate the xyxy layout and then store each half ( xy ) into its c[...]

  • Page 277

    Optimizing for SIMD Float ing-point Applications 5 5-15 Y ou may have to swizzle data in the registers, but not in memory . This occurs when two different functions n eed to process the data in dif ferent layout. In lighting, for example, data comes as rrrr gggg b bbb aaaa , and you must deswizzle them into rgba before convertin g in to in teger s.[...]

  • Page 278

    IA-32 Intel® Ar chitectur e Optimization 5-16 // Start deswizzling here movaps xmm7, xmm4 // xmm7= a1 a2 a3 a4 movhlps xmm7, xmm3 // xmm7= b3 b4 a3 a4 movaps xmm6, xmm2 // xmm6= g1 g2 g3 g4 movlhps xmm3, xmm4 // xmm3= b1 b2 a1 a2 movhlps xmm2, xmm1 // xmm2= r3 r4 g3 g4 movlhps xmm1, xmm6 // xmm1= r1 r2 g1 g2 movaps xmm6, xmm2 // xmm6= r3 r4 g3 g4 [...]

  • Page 279

    Optimizing for SIMD Float ing-point Applications 5 5-17 Using MMX T echnolog y Code for Cop y or Shuffling Functions If there are some parts in the code th at ar e mainly copyin g, shuf fling, or doing logical manipulations that do not requir e use of SSE code, consider performing these actions with MMX technology co de. For example, if texture dat[...]

  • Page 280

    IA-32 Intel® Ar chitectur e Optimization 5-18 Example 5-8 illustrates how to use MMX technology code for copying or shuf fling. Horizontal ADD Using SSE Although vertical computations use the SIMD performan ce better than horizontal computations do, in some cases, the code must use a horizontal operation. The movlhps / movhlps and shuf fle can be [...]

  • Page 281

    Optimizing for SIMD Float ing-point Applications 5 5-19 Figure 5-3 Horizontal Add Using mo vhlps/movlhps Example 5-9 Horizontal Add Using mo vhlps/movlhps void horiz_add(Vertex_soa *in, float *out) { __asm { mov ecx, in // load structure addresses mov edx, out movaps xmm0, [ecx] // load A1 A2 A3 A4 => xmm0 movaps xmm1, [ecx+16] // load B1 B2 B3 [...]

  • Page 282

    IA-32 Intel® Ar chitectur e Optimization 5-20 // START HORIZONTAL ADD movaps xmm5, xmm0 // xmm5= A1,A2,A3,A4 movlhps xmm5, xmm1 // xmm5= A1,A2,B1,B2 movhlps xmm1, xmm0 // xmm1= A3,A4,B3,B4 addps xmm5, xmm1 // xmm5= A1+A3,A2+A4,B1+B3,B2+B4 movaps xmm4, xmm2 movlhps xmm2, xmm3 // xmm2= C1,C2,D1,D2 movhlps xmm3, xmm4 // xmm3= C3,C4,D3,D4 addps xmm3, [...]

  • Page 283

    Optimizing for SIMD Float ing-point Applications 5 5-21 Use of cvttps2pi/cvttss2si Instructions The cvttps2pi and cvttss2si instructions encode the truncate/chop rounding mode implicitly in the instruction, thereby taking precedence over the rounding mode specified in the MXCSR register . This behavior can eliminate the need to ch ange the rounding[...]

  • Page 284

    IA-32 Intel® Ar chitectur e Optimization 5-22 avoided since there is a penalty associated with writing this register; typically , through the use of the cvttps2pi and cvttss2si instructions, the rounding contr ol in MXCSR can be always be set to round-nearest. Flush-to-Zer o and Denormals-are-Zero Modes The flush-to-zero (FTZ) and de normals-are-z[...]

  • Page 285

    Optimizing for SIMD Float ing-point Applications 5 5-23 SSE3 and Complex Arithmetics The flexibility of SSE3 in dealing with AOS-type of data structure can be demonstrated by the example of multiplicatio n and division of complex numbers. For example, a complex number can be stored in a structure consisting of its real and im aginary part. This nat[...]

  • Page 286

    IA-32 Intel® Ar chitectur e Optimization 5-24 instructions to perform multiplica tions of single-precision complex numbers. Example 5-12 demonstrates using SSE3 instructions to perform division of complex numbers. In both of these examples, the comple x numbers are store in arrays of structures. The MOVSLDUP , MOVSHDUP and the asymmetric ADDSUBPS [...]

  • Page 287

    Optimizing for SIMD Float ing-point Applications 5 5-25 Example 5-12 Division of T wo P air of Single-precision Complex Number // Division of (ak + i bk ) / (ck + i dk ) movshdup xmm0, Src1; load imaginary parts into t he ; destination, b1, b1, b0, b0 movaps xmm1, src2; load the 2nd pair of comple x values, ; i.e. d1, c1, d0, c0 mulps xmm0, xmm1; t[...]

  • Page 288

    IA-32 Intel® Ar chitectur e Optimization 5-26 SSE3 and Horizontal Comp utation Sometimes the AOS type of data organization are more natural in many algebraic formula. SSE3 enhances the flexibility of SIMD programming for applications that rely on the horizontal computation model. SSE3 offers several instruct ions that are capable of horizontal ari[...]

  • Page 289

    Optimizing for SIMD Float ing-point Applications 5 5-27 SIMD Optimizations and Microar chitectures Pentium M, Intel Core Solo and I ntel Core Duo processors have a different microarchitecture than Intel NetBurst ® microarchitecture. The following sub-section discusses optimiz ing SIMD code that target Intel Core Solo and Intel Core Duo processors.[...]

  • Page 290

    IA-32 Intel® Ar chitectur e Optimization 5-28 When targeting complex arithme tics on Intel Core Solo and Intel Core Duo processors, using sing le-precision SSE3 instructions can deliver higher performance than alternatives. On the other hand, tasks requiring double-precision complex arithmetics may perfor m better using scalar SSE2 instructions on[...]

  • Page 291

    6-1 6 Optimizing Cache Usage Over the past decade, processor sp eed has increased more than ten times. Memory access speed has incr eased at a slower pace. The resulting disparity has made it important to tune applications in one of two ways: either (a) a majority of the data accesses are fulfilled from processor caches, or (b) ef fectively masking[...]

  • Page 292

    IA-32 Intel® Ar chitectur e Optimization 6-2 • Memory Optimization Using Hardware Prefetching, Software Prefetch and Cacheability Instru ctions: discusses techniques for implementing memory optimizations using the above instructions. • Using deterministic cache parameters to manage cache hierarchy . General Prefetc h Coding Guidelines The foll[...]

  • Page 293

    Optimizing Cache Usage 6 6-3 • Facilitate compiler optimization: — Minimize use of global variables and pointers — Minimize use of complex control flow —U s e t h e const modifier , avoid register modifier — Choose data types carefully (see below) and avo id type casting. • Use cache blocking techniques (for example, strip mining): — [...]

  • Page 294

    IA-32 Intel® Ar chitectur e Optimization 6-4 • Optimize software prefetch scheduling distance: — Far ahead enough to allow interim computation to overlap memory access time. — Near enough that the prefetched data is not replaced from the data cache. • Use software prefet ch concatenation: — Arrange prefetches to avoid un necessary prefet[...]

  • Page 295

    Optimizing Cache Usage 6 6-5 3. Follows only one stream per 4K page (load or store) 4. Can prefetch up to 8 simultaneous independent streams f rom eight dif feren t 4K regions 5. Does not prefetch across 4K boundary; note that this is independent of paging modes 6. Fetches data into second/third-level cache 7. Does not prefetch UC or WC memory type[...]

  • Page 296

    IA-32 Intel® Ar chitectur e Optimization 6-6 Data reference patterns can be classified as follows: T emporal data will be used again soon Spatial data will be used in adjacent locations, for example, same cache line Non-temporal data which is referenced once and not reused in the immediate future; for example, some multimedia data types, such as t[...]

  • Page 297

    Optimizing Cache Usage 6 6-7 The prefetch instruction is implementation -specific; applications need to be tuned to each implemen tation to maximize performance. The prefetch instructions merely provide a hint to the hardware, and they will not generate exceptions or faults except for a few special cases (see the “Prefetch and Load Instruct ions?[...]

  • Page 298

    IA-32 Intel® Ar chitectur e Optimization 6-8 The Prefetch Instructions – P e ntium 4 Processor Implementation Streaming SIMD Extensions include four flavors of prefetch instructions, one non-temporal, and three temporal. They correspond to two types of operations, temporal and non-temporal. The non-temporal instruction is prefetchnta Fetch the d[...]

  • Page 299

    Optimizing Cache Usage 6 6-9 Currently , the prefetch instruction provides a greater performance gain than preloading because it: • has no destination register , it only updates cache lines. • does not stall the normal instruction retirement. • does not af fect the functional behavior of the program. • has no cache line split accesses. • [...]

  • Page 300

    IA-32 Intel® Ar chitectur e Optimization 6-10 The Non-temporal Store Instructions This section describes the behavior of streaming stores and reiterates some of the information presented in the previous section. In S treaming SIMD Extensions, the movntps , movntpd, movntq , movntdq, movnti, maskmovq and maskmovdqu instructions are streaming, non-t[...]

  • Page 301

    Optimizing Cache Usage 6 6-11 • Reduce disturbance of frequently used cached (temporal) data, since they write around th e processor caches. Streaming stores allow cross-aliasing of memory types for a given memory region. For instance, a regi on may be mapped as write-back ( WB ) via the page attribute tables ( PAT ) or memory type range register[...]

  • Page 302

    IA-32 Intel® Ar chitectur e Optimization 6-12 evicting data from all processor caches). The Pentium M processor implements a combin ation of both approaches. If the streaming store hits a line th at is present in the first-level cache, the store data is combined in place within the first-level cache. If the streaming store h its a line present in [...]

  • Page 303

    Optimizing Cache Usage 6 6-13 possible. This behavior should be considered reserved, and dependence on the behavior of any particular implementation risks future incompatibility . Streaming Store Usage Mo dels The two primary usage domains for streaming store are coherent requests and non-coherent r equests. Coherent Requests Coherent requests are [...]

  • Page 304

    IA-32 Intel® Ar chitectur e Optimization 6-14 In case the region is not mapped as WC , the streaming might update in-place in the cache and a subsequent sfence would not result in the data being written to system memory . Expli citly mapping the region as WC in this case ensures that any data read from this region will not be placed in the process[...]

  • Page 305

    Optimizing Cache Usage 6 6-15 The maskmovq/maskmovdqu (non-temporal by te mask store of packed integer in an MMX technology or S treaming SIMD Ex tensions register) instructions store data from a regist er to the location specified by the edi register . The most significant bit in each byte of the second mask register is used to selectively writ e [...]

  • Page 306

    IA-32 Intel® Ar chitectur e Optimization 6-16 The degree to which a consumer o f data knows that the data is weakly-ordered can vary for these cases. As a result, the sfence instruction should be used to ensure ordering between routines that produce weakly-ordered data and rou tines that consume this data. The sfence instruction provides a perform[...]

  • Page 307

    Optimizing Cache Usage 6 6-17 The clflush Instruction The cache line associated with the li near address specified by the value of byte address is invalidated from all levels of the processor cache hierarchy (data and instruction) . The invalidation is broadcast throughout the coherence domain. If, at any level of the cache hierarchy , the line is [...]

  • Page 308

    IA-32 Intel® Ar chitectur e Optimization 6-18 Memory Optimization Using Prefetch The Pentium 4 processor has two mechanisms for data prefetch: software-controlled prefetch and an automatic hardware prefetch. Software-contr olled Prefetch The software-controlled prefetch is enabled using the four prefetch instructions introduced with Stream ing SIM[...]

  • Page 309

    Optimizing Cache Usage 6 6-19 Har dware Prefetc h The automatic hardware prefetch, can bring cache lines into the unified last-level cache based on prior data misses. The automatic hardware prefetcher will attempt to prefetch two cache lines ahead of the prefetch stream. This feature is introduced with the Pentium 4 processor . The characteristics [...]

  • Page 310

    IA-32 Intel® Ar chitectur e Optimization 6-20 • May consume extra system bandwidth if the application’ s memory traffic has significant portions with strides of cache misses greater than the trigger distance threshold of hardwar e prefet ch (lar ge-stride memory traffic). • Effectiveness with existing applications depends on the proporti ons[...]

  • Page 311

    Optimizing Cache Usage 6 6-21 Example 6-2 Populating an Array for Circ ular Pointer Chasin g with Constant Stride register char ** p; char *next; // Populating pArray for circular point er // chasing with constant access str ide // p = ( char **)*p; loads a value pointing to next load p = (char **)&pArray; for (i = 0; i < aperture; i += stri[...]

  • Page 312

    IA-32 Intel® Ar chitectur e Optimization 6-22 Example of Latency Hiding with S/W Prefetch Instruction Achieving the highest level of memor y optimization using prefetch instructions requires an understanding of the microarchitecture and system architecture of a given machin e. This section translates the key architectural implicat ions into severa[...]

  • Page 313

    Optimizing Cache Usage 6 6-23 execution units sit idle and wait until data is returned. On the other hand, the memory bus sits idle while the execution units are processing vertices. This scenario severely decreases the advantage of having a decoupled architecture. Figure 6-2 Memory Access Latency and Exe cution Without Prefetc h Figure 6-3 Memory [...]

  • Page 314

    IA-32 Intel® Ar chitectur e Optimization 6-24 The performance loss caused by poor utilization of resources can be completely eliminated by correctly scheduling the prefetch instructions appropriately . As shown in Figure 6-3 , prefetch instructions are issued two vertex iterations ahead. This as sumes that only one ve rtex gets processed in one it[...]

  • Page 315

    Optimizing Cache Usage 6 6-25 • Balance single-pass versus multi-pass execution • Resolve memory bank conflict issues • Resolve cache management issues The subsequent sections discuss all the above items. Software Prefetc h Scheduling Distance Determining the ideal prefetch placement in the code depends on many architectural parameters, inclu[...]

  • Page 316

    IA-32 Intel® Ar chitectur e Optimization 6-26 lines of data per iteration. The PSD would need to be increased/decreased if more/less th an two cache lines are used per iteration. Software Prefetc h Concatenation Maximum performance can be achieved when execution pipeline is at maximum throughput, without incurring an y memo ry latency penalties. T[...]

  • Page 317

    Optimizing Cache Usage 6 6-27 This memory de-pipelining creates inefficiency in both the memory pipeline and execution pipeline. Th is de-pipelining effect can be removed by applying a technique ca lled prefetch concatenation. W ith this technique, the memory access an d execution can be fully pipelined and fully utilized. For nested loops, memory [...]

  • Page 318

    IA-32 Intel® Ar chitectur e Optimization 6-28 Prefetch concatenation can bridge the execution pipeline bubbles between the boundary of an inn er loop and its associated outer loop. Simply by unrolling the last iteration out of the inner loop and specifying the effective prefetch addr ess for data used in the following iteration, the performance lo[...]

  • Page 319

    Optimizing Cache Usage 6 6-29 Minimize Number of Software Prefetches Prefetch instructions are not completely free in terms of bus cycles, machine cycles and resources, even though they requ ire minimal clocks and memory bandwidth. Excessive prefetching may lead to performance penalties because issue penalties in the front-end of the machine and/or[...]

  • Page 320

    IA-32 Intel® Ar chitectur e Optimization 6-30 Figure 6-5Figure demonstrates the ef fectiveness of software prefetches in latency hiding. The X ax is indicates the number of computation clocks per loop (each iteration is inde pendent). The Y axis indicates the execution time measured in clocks per loop. The secondary Y axis indicates the percentage[...]

  • Page 321

    Optimizing Cache Usage 6 6-31 Figure 6-5 Memory Access Latency and Execution With Pr efetch 2 Load streams, 1 stor e str eam 50 100 150 200 250 300 350 54 108 144 19 2 240 336 390 Comput a tions per loop Eff ect ive loop lat enc y 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% % of Bus Utilize d 16 32 64 128 none % Bus[...]

  • Page 322

    IA-32 Intel® Ar chitectur e Optimization 6-32 Mix Software Prefetc h with Computation In structions It may seem convenient to cluster all of the prefetch instructions at the beginning of a loop body or before a loop, but this can lead to severe performance degradation. In or der to achieve best possible performance, prefetch instructions must be i[...]

  • Page 323

    Optimizing Cache Usage 6 6-33 Example 6-6 Spread Prefet ch In st ru c ti on s NO TE. T o avoid instruction execution stalls due to the over-utilization of the r esour ce, pr efetch instruc tions must be interspersed with computational instructions. top_loop: prefetchnta [ ebx+128] prefetchnta [ ebx+1128] prefetchnta [ ebx+2128] prefetchnta [ ebx+31[...]

  • Page 324

    IA-32 Intel® Ar chitectur e Optimization 6-34 Software Prefetc h and Cache Bloc king T echniques Cache blocking techniques, such as strip-mining, are used to impr ove temporal locality , and thereby cache hit rate. Strip-mining is a one-dimensional temporal locality optimization for memory . When two-dimensional arrays are used in programs, loop b[...]

  • Page 325

    Optimizing Cache Usage 6 6-35 In the temporally-adjacent scenario , subsequent passes use the same data and find it already in second-level cache. Prefetch issues aside, this is the preferred situation. In the temporally non-adjacent scenario, data used in pass m is displaced by pass (m+1) , requiring data re - fetch into the first level cache and [...]

  • Page 326

    IA-32 Intel® Ar chitectur e Optimization 6-36 Figure 6-7 shows how prefetch instructions and strip-mining can be applied to increase performance in both of these scenarios. For Pentium 4 processors, the left scenario shows a graphical implementation of using prefetchnta to prefetch data into selected ways of the second-level cache only (SM1 d enot[...]

  • Page 327

    Optimizing Cache Usage 6 6-37 In scenario to the right, in Figure 6- 7, keeping the data in one way of the second-level cache does not improve cache locality . Therefore, use prefetcht0 to prefetch the data. This amortizes the latency of the memory references in passes 1 and 2, and keeps a copy of the data in second-level cache, which reduces memor[...]

  • Page 328

    IA-32 Intel® Ar chitectur e Optimization 6-38 W ithout strip-mining, all the x,y ,z coor dinates for the four vertices mu st be re-fetched from memory in the seco nd pass, that is, the lighting loop. This causes under-utilization of cache lines fetched during transformation loop as well as ban dwidth wasted in the lighting loop. Now consider the c[...]

  • Page 329

    Optimizing Cache Usage 6 6-39 T able 6-1 summarizes the steps of the basic usage model that incorporates only software prefetch with strip-mining. The steps are: • Do strip-mining: partition loops so that the dataset fits into second-level cache. • Use prefetchnta if the data is only used once or the dataset fits into 32K (one way of second-lev[...]

  • Page 330

    IA-32 Intel® Ar chitectur e Optimization 6-40 happen to be powers of 2, aliasing conditio n due to finite number of way-associativity (see “Capacity Lim its and Aliasing in Caches” in Chapter 2) will exacerbate the likelihood of cache evictions. Example 6-9(b) shows applying the techniques of tiling with optimal selection of tile size and tile[...]

  • Page 331

    Optimizing Cache Usage 6 6-41 references enables the hardware prefetcher to initiate bus requests to read some cache lines before the code actually reference the linear addresses. Single-pass versus Multi-pass Execution An algorithm can use single- or mult i-pass execution defined as follows: • Single-pass, or unlayered execution passes a single [...]

  • Page 332

    IA-32 Intel® Ar chitectur e Optimization 6-42 selected to ensure that the batch stays within the processor caches through all passes. An intermediate cached buf fer is used to pass the batch of vertices from one stag e or pass to the next on e. Single-pass execution can be better suited to applications which limit the number of features that may b[...]

  • Page 333

    Optimizing Cache Usage 6 6-43 The choice of single-pass or multi-pass can have a number of performance implications. For instance, in a multi-pass pipeline, stages that are limited by bandwidth (either input or output) will reflect more of this performance limitation in overal l execution time. In contrast, for a single-pass approach, bandwidth- li[...]

  • Page 334

    IA-32 Intel® Ar chitectur e Optimization 6-44 a line burst transaction. T o achieve the best possible performance, it is recommended to align data along the cache line boundary and write them consecutively in a cache line si ze while using non-temporal stores. If the consecutive writes are prohibitive due to programming constraints, then software [...]

  • Page 335

    Optimizing Cache Usage 6 6-45 The following examples of using prefetching instructions in the operation of video encoder and decode r as well as in simple 8-byte memory copy , illustrate performance gain from using the prefetching instructions for efficient cache management. Video Encoder In a video encoder example, some of the data used during the[...]

  • Page 336

    IA-32 Intel® Ar chitectur e Optimization 6-46 Later , the processor re-reads the data using prefetchnta , which ensures maximum bandwidth, yet minimizes disturbance of other cached temporal data by using the non- temporal (NT A) version of prefetch. Conclusions fr om Vid eo Encoder and Decoder Implementation These two examples indicate that by usi[...]

  • Page 337

    Optimizing Cache Usage 6 6-47 The memory copy algorithm can be o ptimized using the Streamin g SIMD Extensions with these considerations: • alignment of data • proper layout of pages in memory • cache size • interaction of the transaction lookaside buf fer (TLB) with memory accesses • combining prefetch and streaming-store instructions. T[...]

  • Page 338

    IA-32 Intel® Ar chitectur e Optimization 6-48 Using the 8-by te Streamin g Stores and Software Prefetc h Example 6-1 1 presents the copy algorithm that uses second level cache. The algorithm performs the following steps: 1. Uses blocking technique to transf er 8-byte data from memory into second-level cache using the _mm_prefetch intrinsic, 128 by[...]

  • Page 339

    Optimizing Cache Usage 6 6-49 In Example 6-1 1, eig ht _mm_load_ps and _mm_stream_ ps intrinsics are used so that all of the data prefet ched (a 128-byte cache line) is written back. The prefetch and streaming-stor es are executed in separate loops to minimize the number of transitions between readin g and writing data. This significantly improves [...]

  • Page 340

    IA-32 Intel® Ar chitectur e Optimization 6-50 The instruction, temp = a[kk+CACHESIZE] , is used to ensure the page table entry for array , and a is entered in the TLB prior to prefetching. This is essentially a prefetch itself , as a cache line is filled from that memory location with this instruction. Hence, the prefetching starts from kk+4 i n t[...]

  • Page 341

    Optimizing Cache Usage 6 6-51 prefetch_loop: movaps xmm0, [esi+ecx] movaps xmm0, [esi+ecx+64] add ecx,128 cmp ecx,BLOCK_SIZE jne prefetch_loop xor ecx,ecx align 16 cpy_loop: movdqa xmm0,[esi+ecx] movdqa xmm1,[esi+ecx+16] movdqa xmm2,[esi+ecx+32] movdqa xmm3,[esi+ecx+48] movdqa xmm4,[esi+ecx+64] movdqa xmm5,[esi+ecx+16+64] movdqa xmm6,[esi+ecx+32+64[...]

  • Page 342

    IA-32 Intel® Ar chitectur e Optimization 6-52 P erformance Comparisons of Memory Copy Routines The throughput of a lar ge-region, memory copy routine depends on several factors: • coding techniques that implements the memory copy task • characteristics of the system bus (speed, peak band width, overhead in read/write transaction protocols) •[...]

  • Page 343

    Optimizing Cache Usage 6 6-53 The baseline for performance compariso n is the throughput (bytes/sec) of 8-MByte region memory copy on a first-generation Pentium M processor (CPUID signature 0x69n) with a 400-MHz system bus using byte-sequential technique similar to that shown in Example 6-10. The degree of improvement relative to the performance ba[...]

  • Page 344

    IA-32 Intel® Ar chitectur e Optimization 6-54 query each level of the cache hierarchy . Enumeration of each cache level is by specifying an index value (starting form 0) in the ECX register . The list of parameters is shown in T able 6-3. The deterministic cache parameter leaf provides a means to implement software with a degree of forward compati[...]

  • Page 345

    Optimizing Cache Usage 6 6-55 • Determine multi-threading resource topology in an MP system (See Section 7.10 of IA-32 Intel® Ar chitectur e Softwar e Developer ’ s Manual, V olume 3A ). • Determine cache hierarchy topology in a platform using multi-core processors (See Example 7-13). • Manage threads and processor affinities. • Determin[...]

  • Page 346

    IA-32 Intel® Ar chitectur e Optimization 6-56 platform, software can extract in formation on the numb er and the identities of each logical processor sharing that cache level and is made available to application by the OS. This is discussed in detail in “Using Shared Execution Resources in a Processor Core” in Chapter 7 and Example 7-13. Deter[...]

  • Page 347

    7-1 7 Multi-Cor e and Hyper -Thr eading T echnology This chapter describes software optimization techniques for multithreaded applications running in an environment using either multiprocessor (MP) systems or pr ocessors with hardware-based multi-threading suppor t. Multiproces sor systems are systems with two or more sockets, each mated with a phy[...]

  • Page 348

    IA-32 Intel® Ar chitectur e Optimization 7-2 cores but shared by two logical pr ocessors in the same core if Hyper -Threading T echnology is enabled. This chapter covers guidelines that apply to either situations. This chapter covers • Performance characteristics and usage models, • Programming models for multi threaded applications, • Softw[...]

  • Page 349

    Multi-Cor e and Hyper-Thr e ading T echnology 7 7-3 Figure 7-1 illustrates how performance gains can be realized for any workload according to Amdahl’ s law . The bar in Figure 7-1 represents an individual task unit or the collective workload of an entire application. In general, the speed -up of running multiple threads on an MP systems with N p[...]

  • Page 350

    IA-32 Intel® Ar chitectur e Optimization 7-4 When optimizing application performance in a multithreaded environment, control flow parallelis m is likely to have the lar gest impact on performance scaling with respect to the number of physical processors and to the number of logical processors per physical processor . If the control flow of a multi[...]

  • Page 351

    Multi-Cor e and Hyper-Thr e ading T echnology 7 7-5 terms of time of completion relative to the same task when in a single-threaded environment) will vary , depending on how much shared execution resources and memory are utilized. For development purposes, several popu lar operating systems (for example Microsoft W indows* XP Professional and Home,[...]

  • Page 352

    IA-32 Intel® Ar chitectur e Optimization 7-6 When two applications are employe d as part of a multi-tasking workload, there is little synchron ization overhead between these two processes. It is also important to ensure each application has minimal synchronization overhead within itself. An application that uses lengthy spin loops for intra-proces[...]

  • Page 353

    Multi-Cor e and Hyper-Thr e ading T echnology 7 7-7 P arallel Programming Models T wo common programming models for transforming independent task requirements into application threads are: • domain decomposition • functional decomposition Domain Decomposition Usually large compute-intensive tasks use data sets that can be divided into a number [...]

  • Page 354

    IA-32 Intel® Ar chitectur e Optimization 7-8 Functional Decomposition Applications usually process a wide variety of tasks with diverse functions and many unrelated data sets. For example, a video codec needs several dif ferent processing functions. These include DCT , motion estimation and colo r conversion. Using a functional threading model, ap[...]

  • Page 355

    Multi-Cor e and Hyper-Thr e ading T echnology 7 7-9 overhead when buffers are exch anged between the producer and consumer . T o achieve optimal scalin g with th e number of cores, the synchronization overhead must be kept low . This can be done by ensuring the producer and consumer threads have comparable time constants for completing each increm [...]

  • Page 356

    IA-32 Intel® Ar chitectur e Optimization 7-10 Producer -Consumer Threading Models Figure 7-3 illustrates the basic scheme of interaction between a pair of producer and consumer threads. The horizon tal direction represents time. Each block represents a task unit, processing the buffer assigned to a thread. The gap between each ta sk represents syn[...]

  • Page 357

    Multi-Cor e and Hyper-Thr e ading T echnology 7 7-11 It is possible to structure the prod ucer -consumer model in an interlaced manner such that it can minimize bus traffic and be ef fective on multi-core processors without shared second-level cache. In this interlaced variation of the producer-consumer model, each scheduling quanta of an applicati[...]

  • Page 358

    IA-32 Intel® Ar chitectur e Optimization 7-12 corresponding task to use its designated buffer . Thus, the producer and consumer tasks execute in parallel in two threads. As long as the data generated by the producer reside in either the first or second level cache of the same core, the consumer can access them without incurring bus traffic. The sc[...]

  • Page 359

    Multi-Cor e and Hyper-Thr e ading T echnology 7 7-13 Example 7-3 Thread Function for an Interlace d Producer Consumer Model // master thread starts the first it eration, the other thread must wait // one iteration void producer_consumer_thread(int master) { int mode = 1 - master; // track which thread and its designated buffer index unsigned int it[...]

  • Page 360

    IA-32 Intel® Ar chitectur e Optimization 7-14 T ools for Creating Multithreaded Applications Programming directly to a multithreading application pro gramming interface (API) is not the only me thod for creating multithreaded applications. New tool s such as the Intel ® Compiler have become available with capabilities that make the challenge of c[...]

  • Page 361

    Multi-Cor e and Hyper-Thr e ading T echnology 7 7-15 Automatic Parallelization of Code . While OpenMP directives allow programmers to quickly transform serial applicatio ns into parallel applications, programmers must id entify specific portions of the application code that contain parall elism and add compiler directives. Intel Compiler 6.0 suppor[...]

  • Page 362

    IA-32 Intel® Ar chitectur e Optimization 7-16 Optimization Guidelines This section summarizes optimization guidelines for tuning multithreaded applications. Five areas are listed (in order of importance): • thread synchronization • bus utilization • memory optimization • front end optimization • execution resource optimization Practices [...]

  • Page 363

    Multi-Cor e and Hyper-Thr e ading T echnology 7 7-17 • Place each synchronization variable alone, separated by 128 bytes or in a separate cache line. See “Thread Synchro nization” for more details. Ke y Practices of System Bus Optimization Managing bus traffic can significan tly impact the overall performance of multithreaded software and MP [...]

  • Page 364

    IA-32 Intel® Ar chitectur e Optimization 7-18 • Adjust the private stack of each th read in an application so the spacing between these stacks is not offset by multiples of 64 KB or 1 MB (prevents unnecessary cache line evictions) when targ eting IA-32 processors supporting Hyper-Threading T echnology . • Add a per-instance stack of fset wh en[...]

  • Page 365

    Multi-Cor e and Hyper-Thr e ading T echnology 7 7-19 • For each processor s upporting Hyper -Thr eading T echnology , consider adding functionally unco rrelated threads to increase the hardware resource utilization of each physical processor package. See “Using Thread Affinities to Manage Shared Platform Resources” for more detail s. Generali[...]

  • Page 366

    IA-32 Intel® Ar chitectur e Optimization 7-20 The best practice to reduce the overhead of thread synchro nization is to start by reducing the application’ s requirements for synchronization. Intel Thread Profiler can be used to profile the execution timeline of each thread and detect situations where performance is impacted by frequent occurrenc[...]

  • Page 367

    Multi-Cor e and Hyper-Thr e ading T echnology 7 7-21 the white paper “ Developing Multi-thr eaded Applications: A Platform Consistent Appr oach ” (referenced in the Introduction chapter). • When choosing between differ ent primitives to implement a synchronization construct, using Intel Thread Checker and Thread Profiler can be very useful in[...]

  • Page 368

    IA-32 Intel® Ar chitectur e Optimization 7-22 Synchr onization for Short P eriods The frequency and duration that a thread needs to synchronize with other threads depends applicat ion characteristics. When a synchronization loop needs very fast response, ap plications may use a spin-wait loop. A spin-wait loop is typically used when one thread nee[...]

  • Page 369

    Multi-Cor e and Hyper-Thr e ading T echnology 7 7-23 the processor must guarantee no violations of memo ry order occur . The necessity of maintaining the order of outstanding memory operations inevitably costs the pro cessor a severe penalty that impacts all threads. This penalty occurs on the Pentium M processor , the Intel Core Solo and Intel Cor[...]

  • Page 370

    IA-32 Intel® Ar chitectur e Optimization 7-24 Example 7-4 Spin- wait Loop and P AUSE Instructions (a) An un-optimized spin-wait loop experiences performance penalty when exiting the loop. It consumes execu tion resources without contributing computational work. do { // This loop can run faster than the speed of memory access, // other worker threa[...]

  • Page 371

    Multi-Cor e and Hyper-Thr e ading T echnology 7 7-25 User/Sourc e Coding Rule 21. (M impact, H generality) Insert the P AUSE instruction in fast spin loop s and keep the nu mber of loop repetitions to a minimum to improve overall system performance. On IA-32 processors that use the In tel NetBurst microarchitecture core, the penalty of exiting from[...]

  • Page 372

    IA-32 Intel® Ar chitectur e Optimization 7-26 T o reduce the performance penalty , one approach is to reduce the likelihood of many threads competing to acquire the same lock. Apply a software pipelining technique to handle data that must be shared between multiple threads. Instead of allowing multiple thread s to compete for a given lo ck, no mor[...]

  • Page 373

    Multi-Cor e and Hyper-Thr e ading T echnology 7 7-27 If an application thread must remain idle for a long time, the application should use a thread b locking API or other method to release the idle processor . The techniques discussed here apply to traditional MP system, but they have an even highe r impact on IA-32 processors that support Hyper -T[...]

  • Page 374

    IA-32 Intel® Ar chitectur e Optimization 7-28 A void Coding Pitfalls in Thread Synchr onization Synchronization between multiple th reads must be designed and implemented with care to achieve good performance scaling with respect to the number of discrete pr ocessors and the nu mber of logical processor per physical processor . No single technique[...]

  • Page 375

    Multi-Cor e and Hyper-Thr e ading T echnology 7 7-29 In general, OS function calls should be used with care when synchronizing threads. When using OS-suppo rted thread synchronization objects (critical section, mutex, or semaphore) , preference should be give n to th e OS service that has the least synchronization overhead, such as a critical secti[...]

  • Page 376

    IA-32 Intel® Ar chitectur e Optimization 7-30 Prevent Sharing of Modified Data and False-Sharing On an Intel Core Duo processor , sh aring of modified data incurs a performance penalty when a thread running on one core tries to read or write data that is currently present in modified state in the first level cache of the other core. This will caus[...]

  • Page 377

    Multi-Cor e and Hyper-Thr e ading T echnology 7 7-31 User/Source Coding Rule 24 . (H impact, M generality) Bewar e of false sharing within a cache line (64 bytes on Intel Pen tium 4, Intel Xeon, Pentium M, Intel Core Duo pr ocessors), an d wi thin a sector (128 bytes on Pentium 4 and Intel Xeon processors). When a common block of parameters is pass[...]

  • Page 378

    IA-32 Intel® Ar chitectur e Optimization 7-32 • Objects allocated dynamically by different threads may share cache lines. Make sure that the variable s used locally by one thread are allocated in a manner to prevent sharing the cache line with other threads. Another technique to enforce align ment of synchronization variables and to avoid a cach[...]

  • Page 379

    Multi-Cor e and Hyper-Thr e ading T echnology 7 7-33 • In managed environments that provide automatic object allocation, the object allocators and garbag e collectors are responsible for layout of the objects in memory so that false sharing through two objects does not happen. • Provide classes such that only one thread writes to each object fi[...]

  • Page 380

    IA-32 Intel® Ar chitectur e Optimization 7-34 Conserve Bus Bandwidth In a multi-threading environment, bus bandwidth may be shared by memory traffic originated from multip le bus agents (These agents can be several logical processors and/or several processor cores). Preserving the bus bandwidth can improve p rocessor scaling performance. Also, eff[...]

  • Page 381

    Multi-Cor e and Hyper-Thr e ading T echnology 7 7-35 reads. An approximate working guideline for software to operate below bus saturation is to check if bus read queue depth is sign ificantly below 5. Some MP platform may have a chipset that provides two buses, with each bus servicing one or more physi cal processors. The guidelines for conserving [...]

  • Page 382

    IA-32 Intel® Ar chitectur e Optimization 7-36 A void Excessive Software Prefetc hes Pentium 4 and Intel Xeon Processors have an auto matic hardware prefetcher . It can bring data an d instructions into the unified second-level cache based on prior refere nce patterns. In most situations, the hardware prefetcher is likely to reduce system memory la[...]

  • Page 383

    Multi-Cor e and Hyper-Thr e ading T echnology 7 7-37 latency of scattered memory reads can be improved by issuing multiple memory reads back-to-back to over lap multiple outstanding memory read transactions. The average late ncy of back-to-back bus reads is likely to be lower than the average latency of scattered reads interspersed with other bus t[...]

  • Page 384

    IA-32 Intel® Ar chitectur e Optimization 7-38 Frequently , multiple partial writes to WC memory can be combined into full-sized writes using a software wr ite-combining technique to separate WC store operations from competi ng with WB store traf fic. T o implement software write-combining, uncacheable writes to memory with the WC attribute are wri[...]

  • Page 385

    Multi-Cor e and Hyper-Thr e ading T echnology 7 7-39 block size for loop blocking should be determined by dividing the tar get cache size by the number of logical processors available in a physical processor package. T ypically , some cache lines are needed to access data that are not part of the source or destination buf fers used in cache blockin[...]

  • Page 386

    IA-32 Intel® Ar chitectur e Optimization 7-40 User/Source Coding Rule 33 . (H impact, M generality) Minimize the sharing of data betw een thr eads tha t execut e on differ ent bu s agent s sha ring a common bus . One technique to minimize sharing of data is to copy data to local stack variables if it is to be accessed repeatedly over an extended p[...]

  • Page 387

    Multi-Cor e and Hyper-Thr e ading T echnology 7 7-41 Example 7-8 shows the batched implementation of the producer and consumer thread functions. Example 7-8 Batched Implement ation of the Producer Consumer Thre ads void producer_thread() { int iter_num = workamount - batchsize; int mode1; for (mode1=0; mode1 < batchsize; mo de1 ++) { produce(buf[...]

  • Page 388

    IA-32 Intel® Ar chitectur e Optimization 7-42 Eliminate 64-KByte Al iased Data Accesses The 64 KB aliasing condition is discussed in Chapter 2. Memory accesses that satisfy the 64 KB aliasing condition can cause excessive evictions of the first-level data cache. Eliminating 64-KB-aliased data accesses originating from each thread helps improve fre[...]

  • Page 389

    Multi-Cor e and Hyper-Thr e ading T echnology 7 7-43 Preventing Excessive Evictions in First-Le vel Data Cache Cached data in a first-level data cache are indexed to linear addresses but physically tagged. Data in second-level and third-level caches are tagged and indexed to physical addres ses. While two logical processors in the same physical pro[...]

  • Page 390

    IA-32 Intel® Ar chitectur e Optimization 7-44 P er-thread Stac k Offset T o prevent private stack accesses in concurrent thread s from thrashing the first-level data cache, an applica tion can use a per -thread stack offset for each of its threads. The size of th ese of fsets should be multiples of a common base of fset. The optimum choice of this[...]

  • Page 391

    Multi-Cor e and Hyper-Thr e ading T echnology 7 7-45 Example 7-9 Adding an Offset to t he St ack Pointer of Three Thread s Void Func_thread_entry(DW ORD *pArg) {DWORD StackOffset = *pArg; DWORD var1; // The local variable at this scope may not benefit DWORD var2; // from the adjustment of the stack pointer that ensue . // Call runtime library routi[...]

  • Page 392

    IA-32 Intel® Ar chitectur e Optimization 7-46 P er-instance Stac k Offset Each instance an application runs in its own linear address space; but the address layout of data for stack se gments is identical for the both instances. When the instances are ru nning in lock step, stack accesses are likely to cause of excessive evicti ons of cache lines [...]

  • Page 393

    Multi-Cor e and Hyper-Thr e ading T echnology 7 7-47 However , the buffer space does enable the first-level data cache to be shared cooperatively when two copies of the same application are executing on the two logical processo rs in a physical processor package. T o establish a suitable stack offs et for two insta nces of the sa me application run[...]

  • Page 394

    IA-32 Intel® Ar chitectur e Optimization 7-48 Front-end Optimization In the Intel NetBurst microarchit ecture family of processors, the instructions are decoded into micro-ops (μ ops) and sequences of μ ops (called traces) are stored in the Execution T race Cache. The T race Cache is the primary sub-system in the front end of the processor that [...]

  • Page 395

    Multi-Cor e and Hyper-Thr e ading T echnology 7 7-49 On Hyper -Threading-T echnology-enabled processors, excessive loop unrolling is likely to reduce the T r ace Cache’ s ability to deliver high bandwidth μ op streams to the execution engine. Optimization f or Code Siz e When the T race Cache is continu ously and repeatedly delivering μ op trac[...]

  • Page 396

    IA-32 Intel® Ar chitectur e Optimization 7-50 initial APIC_ID (See Section 7.10 of IA-32 Intel Ar chitectur e Softwar e Developer ’ s Manual , V olume 3A for more details) associated with a logical processor . The three levels are: • physical processor package. A P A CKAGE_ID label can be used to distinguish dif ferent physical packages within[...]

  • Page 397

    Multi-Cor e and Hyper-Thr e ading T echnology 7 7-51 Affinity mask s can be used to optimize shared multi-threading resources. Example 7-1 1 Assembling 3-level IDs , Affinity Masks for Each Logical Processor // The BIOS and/or OS may limit the number of logical processors // available to applic ations after system boot. // The below algorithm will [...]

  • Page 398

    IA-32 Intel® Ar chitectur e Optimization 7-52 Arrangements of af finity-binding can benefit performance more than other arrangements. This applies to: • Scheduling two domain-decomposition threads to use separate cores or physical packages in order to avoid contention of execution resources in the same core • Scheduling two functional-decompos[...]

  • Page 399

    Multi-Cor e and Hyper-Thr e ading T echnology 7 7-53 first to the primary logical proces sor of each processor core. This example is also optimized to the situations of schedu ling two memory-intensive threads to run on separate cores an d scheduling two compute-intensive threads on separate cores. User/Source Coding Rule 39. (M impact, L generalit[...]

  • Page 400

    IA-32 Intel® Ar chitectur e Optimization 7-54 Example 7-12 Assembling a Look up T abl e to Manage Affinit y Mas ks and Schedule Threads to Each Core First AFFINITYMASK LuT[64]; // A Look up table to retrie ve the affinity // mask we want to use from the thread // scheduling sequence index. int index =0; // Index to schedul ing sequ ence. j = 0; / [...]

  • Page 401

    Multi-Cor e and Hyper-Thr e ading T echnology 7 7-55 Example 7-13 Discovering the Affinity Masks fo r Sibling Logical Processors Sharing the Same Cache // Logical processors sharing the same cache can be de te rmined by bucketing // the logical processors with a mask , the width of the mask is determined // from the maximum number of logica l proce[...]

  • Page 402

    IA-32 Intel® Ar chitectur e Optimization 7-56 PackageID[Proce ssorNUM] = PACKAGE_ID; CoreID[ProcessorNum] = CORE_ID; SmtID[ProcessorNum] = SMT_ID; CacheID[ProcessorNUM] = CACHE_ID; // Only the target cache is stored in this example ProcessorNum++; } ThreadAffinityMask <<= 1; } NumStartedLPs = ProcessorNum; CacheIDBucket i s an array of uni q[...]

  • Page 403

    Multi-Cor e and Hyper-Thr e ading T echnology 7 7-57 For (ProcessorNum = 1; ProcessorNum < NumStartedLPs; ProcessorNum++) { ProcessorMask << = 1; For (i = 0; i < CacheNum; i++) { // We may be co mparing bit-fields of logical processors // residing in a di fferent modular boundary of the cache // topology, the code below assume symmetry [...]

  • Page 404

    IA-32 Intel® Ar chitectur e Optimization 7-58 Optimization of Other Shared Resources Resource optimization in multi-thread ed application depends on the cache topology and execution resources associated within the hierarchy of processor topology . Processor topology and an algorithm for sof tware to identify the pro cessor topology are discussed i[...]

  • Page 405

    Multi-Cor e and Hyper-Thr e ading T echnology 7 7-59 seldom reaches 50% of peak retirement bandwidth. Thus, improving single-thread execution throughput sh ould also benefit multi-threading performance. T uning Suggestion 4. (H Impact, M Generality) Optimize multithr eaded applications to achieve optimal processor scaling with r espect to the numb [...]

  • Page 406

    IA-32 Intel® Ar chitectur e Optimization 7-60 throughput of a physical processor package. The non-halted CPI metric can be interpreted as the inverse of the throughpu t of a logical processor 9 . When a single thread is executing and all on-chip ex ecution resour ces are available to it, non-halted CP I can indicate the unused execution bandwidth [...]

  • Page 407

    Multi-Cor e and Hyper-Thr e ading T echnology 7 7-61 Using a function decomposition th reading model, a multithreaded application can pair up a thread with critical dependence on a low-throughput resource with other threads th at do not have the same dependency . User/Source Coding Rule 40. (M impact, L generality) If a single thr ead consumes hal [...]

  • Page 408

    IA-32 Intel® Ar chitectur e Optimization 7-62 W rite-combining buf fers are another example of execution resources shared between two logical proces sors. W ith two threads running simultaneously on a pr ocessor supporting Hyper -Threading T echnology , the write s of both threads count toward the limit of four write-combining buf fers. For exampl[...]

  • Page 409

    8-1 8 64-bit Mode Coding Guidelines Intr oduction This chapter describes coding gui delines for application software written to run in 64-bit mode. These guidelines should be considered as an addendum to the coding guidelin es described in Chap ter 2 through 7. Software that runs in either compatibility mode or legacy non-64-bit modes should follow[...]

  • Page 410

    IA-32 Intel® Ar chitectur e Optimization 8-2 This optimization holds true for the lower 8 general purpose registers: EAX, ECX, EBX, EDX, ESP , EBP , ESI, EDI. T o access the data in registers r9-r15, the REX prefix is required. Using the 32- bit form there does not reduce code size. Assembly/C ompile r Coding ru le Use the 32-bit versions of instr[...]

  • Page 411

    64-bit Mode Coding Guidelines 8 8-3 If the compiler can determine at compile time that the result of a multiply will not exceed 64 bits, then the compiler should generate the multiply instruction that produces a 64-bit result. If the compiler or assembly programmer can not determine that the result will be l ess than 64 bits, then a multiply that p[...]

  • Page 412

    IA-32 Intel® Ar chitectur e Optimization 8-4 Can be replaced with: movsx r8, r9w ;If bits 63:8 do not need to be ;preserved. movsx r8, r10b ;If bits 63:8 do not need to ;be preserved. In the above example, the moves to r8w and r8b both require a mer ge to preserve the rest of the bits in th e register . There is an implicit real dependency on r8 b[...]

  • Page 413

    64-bit Mode Coding Guidelines 8 8-5 IMUL RAX, RCX The 64-bit version above is more ef ficient than using the following 32-bit version: MOV EAX, DWORD PTR[X] MOV ECX, DWORD PTR[Y] IMUL ECX In the 32-bit case above, EAX is required to be a source. The result ends up in the EDX:EAX pair instead of in a single 64-bit register . Assembly/Compiler Coding[...]

  • Page 414

    IA-32 Intel® Ar chitectur e Optimization 8-6 Use 32-Bit V ersions of CVTSI2SS and CVTSI2SD When P ossible The CVTSI2SS and CVTSI2SD instruct ions convert a signed integer in a general-purpose register or memory location to a single-pr ecision or double-precision floating-point value. The signed integer can be either 32-bits or 64-bits. The 32-bit [...]

  • Page 415

    9-1 9 Power Optimization for Mobile Usages Overview Mobile computing allows computer s to operate anywhere, anytime. Battery life is a key factor in deliver ing this benefit. Mobile applications require software optimization that considers both performance and power consumption. This chapter provides backgrou nd on power saving techniques in mobile[...]

  • Page 416

    IA-32 Intel® Ar chitectur e Optimization 9-2 Pentium M, Intel Core Solo and In tel Core Duo processors implement features designed to enable the re duction of active power and static power consumption. These include: • Enhanced Intel SpeedStep ® T echnology enables operating system (OS) to program a processor to transition to lo wer frequen cy [...]

  • Page 417

    Power Optimization for Mobile Usages 9 9-3 to accommodate demand and adapt power consumption. The interaction between the OS power management policy and perf ormance history is described below: 1. Demand is high and the proces sor wo rks at its highest possible frequency (P0). 2. Demand decreases, which the OS recognizes after some delay; the OS se[...]

  • Page 418

    IA-32 Intel® Ar chitectur e Optimization 9-4 A CPI C-States When computational demands are less than 100%, part of the time the processor is doing useful work and the rest of the time it is idle. For example, the processor could be wait ing on an application time-out set by a Sleep() function, waiting for a web server response, or waiting for a us[...]

  • Page 419

    Power Optimization for Mobile Usages 9 9-5 The index of a C-state type desi gnates the depth of sleep. Higher numbers indicate a deeper sleep state and lower power consumption. They also require more time to wake up ( higher exit latency). C-state types are described below: • C0: The processor is active and performing computations and executing i[...]

  • Page 420

    IA-32 Intel® Ar chitectur e Optimization 9-6 Figure 9-3 Application of C-states to Idle Ti me Consider that a processor is in lo west frequency (LFM- low frequency mode) and utilization is low . During the first time slice window ( Figure 9-3 shows an example that uses 100 ms time slice for C-state decisions), processor utilization is low and the [...]

  • Page 421

    Power Optimization for Mobile Usages 9 9-7 • In an Intel Core Solo or Duo pro cessor , after staying in C4 for an extended time, the processor may enter into a Deep C4 state to save additional static power .. The processor reduces volt age to the minimum l evel required to safely maintain processor context. Although exitin g from a deep C4 state [...]

  • Page 422

    IA-32 Intel® Ar chitectur e Optimization 9-8 Adjust P erformance to Meet Quality of Features When a system is battery powered, applications can extend battery life by reducing the performan ce or quality of features, turning of f background activities, or both. Implementin g such options in an application increases the proces sor idle time. Proces[...]

  • Page 423

    Power Optimization for Mobile Usages 9 9-9 • GetActivePwrScheme: Retrieves the active power scheme (current system power scheme) index. An application can use this API to ensure that system is ru nning best power scheme.A void Using Spin Loops Spin loops are used to wait fo r short intervals of time or for synchronization. The main advantag e of [...]

  • Page 424

    IA-32 Intel® Ar chitectur e Optimization 9-10 workload (usually that equates to reducing the number of instructions that the processor needs to ex ecute, or optimizing application performance). Optimizing an application starts with having ef ficient algorithms and then improving them using In tel software development tools, such as Intel ® VT une[...]

  • Page 425

    Power Optimization for Mobile Usages 9 9-11 disk operations over time. Use the GetDevicePowerS tate() W indows API to test disk state an d delay the disk access if it is not spinning. Handling Sleep State T ransitions In some cases, transitioni ng to a sleep state may harm an application. For example, suppose an application is in the middle of usin[...]

  • Page 426

    IA-32 Intel® Ar chitectur e Optimization 9-12 Using Enhanced Intel SpeedStep ® T echnolog y Use Enhanced Intel SpeedS tep T echnology to adjust the processor to operate at a lower frequency and save ener gy . The basic idea is to divide computations into smaller pieces a nd use OS power management policy to effect a transition to higher P-states.[...]

  • Page 427

    Power Optimization for Mobile Usages 9 9-13 The same application can be written in such a way that work units are divided into smaller granularity , but scheduling of each work unit and Sleep() occurring at more frequent intervals (e.g. 100 ms) to deliver the same QOS (operating at full performance 50% of the time). In this scenario, the OS observe[...]

  • Page 428

    IA-32 Intel® Ar chitectur e Optimization 9-14 An additional positive ef fect of continuously operating at a lower frequency is that frequent changes in power draw (from low to high in our case) and battery current even tually harm the battery . They accelerate its deterioration. When the lowest possible operating point (highest P-state) is reached[...]

  • Page 429

    Power Optimization for Mobile Usages 9 9-15 Eventually , if the interval is large enough, the processor will be able to enter deeper sleep and save a considerable amount of power . The following guidelines can help applica tions take advantage of Intel® Enhanced Deeper Sleep: • A void setting higher interrupt rates. Shorter periods between inter[...]

  • Page 430

    IA-32 Intel® Ar chitectur e Optimization 9-16 thread enables the physical proces sor to operate at lower frequency relative to a single-threaded version. This in turn enab les the processor to operate at a lower voltage, saving batter y life. Note that the OS views each logical processor or core in a physical processor as a separate entity and com[...]

  • Page 431

    Power Optimization for Mobile Usages 9 9-17 demands only 50% of processor r esources (based on idle history). The processor frequency may be reduced by such multi-core unaware P-state coordination, resulting in a perfo rmance anomaly . See Figure 9-5: Software applications have a couple of choices to prevent this from happening: • Thread affinity[...]

  • Page 432

    IA-32 Intel® Ar chitectur e Optimization 9-18 processor to enter the lowest possible C-state type (lower -numbered C state has less power saving). For example, if Core 1 meets the requirement to be in ACPI C1 and Core 2 meets requirement for ACPI C3, multi-core-unaware OS coordination takes the physical processor to ACPI C1. See Figure 9-6. 2. Ena[...]

  • Page 433

    Power Optimization for Mobile Usages 9 9-19 imbalance can be accomplished using performance monitoring events. Intel Core Duo processo r provides an event for this purpose. The event (Serial_Execu tion_Cycle) increments under the following conditions: — The core is a ctively executi n g code in C0 state, — The second core in the physical proces[...]

  • Page 434

    IA-32 Intel® Ar chitectur e Optimization 9-20[...]

  • Page 435

    A-1 A Application Performance T ools Intel of fers an array of application performance tools that are optimized to take advantage of the Intel arch itecture (IA)-based processors. This appendix introduces these tools and explains their capabilities for developing the most ef ficient programs without having to write assembly code. The following perf[...]

  • Page 436

    IA-32 Intel® Ar chitectur e Optimization A-2 • Intel Performance Libraries The Intel Performance Library family consists of a set of sof tware libraries optimized for Intel arch itecture processors. The library family includes the following: —I n t e l ® Math Kernel Library (MKL) —I n t e l ® Integrated Performanc e Primitives (IPP) • In[...]

  • Page 437

    Application Performance T ools A A-3 family . V ectorization, processor disp atch, inter-procedural optimization, profile-guided optimization and OpenMP parallelism are all suppor ted by the Intel compilers and can sign ifican tl y ai d the performance of an application. The most general optimization options are -O1 , -02 and -O3 . Each of them ena[...]

  • Page 438

    IA-32 Intel® Ar chitectur e Optimization A-4 default, and targets the Intel Pentium 4 processor and s ubsequent processors. Code produced will run on any Intel architecture 32-bit processor , but will be optimized speci fically for the targeted processor . A utomatic Processor Dispatc h Suppor t ( -Qx[extensions] and -Qax[extensions] ) The -Qx[ext[...]

  • Page 439

    Application Performance T ools A A-5 V ectorizer Swit ch Options The Intel C++ and Fortran Compiler can vectorize your code using the vectorizer switch options. The options that enable the vectorizer are the -Qx[M,K,W,B,P] and -Qax[M,K,W,B,P] d escribed above. The compiler provides a number of ot her vectorizer switch options that allow you to cont[...]

  • Page 440

    IA-32 Intel® Ar chitectur e Optimization A-6 Multithreading with OpenMP* Both the Intel C++ and Fortran Compilers support shared memory parallelism via OpenMP compiler directives, library functions and environment variables. Op enMP directives are ac tivated by the compiler switch -Qopenmp . The available directives ar e described in the Compiler [...]

  • Page 441

    Application Performance T ools A A-7 The -Qrcd option disables the change to truncation of the ro unding mode in floating-point-to-integer conversions. For complete details on all of the code optimization options, refer to the Intel® C++ Compiler User ’ s Guide. Interpr ocedural and Profile-Guided Optimizations The following are two methods to i[...]

  • Page 442

    IA-32 Intel® Ar chitectur e Optimization A-8 When you use PGO, consider the following guidelines: • Minimize the changes to your program after instrumented execution and before feedback compilation. During feedback compilation, the compiler ignores dynamic information for functions modified after that information was generated. • Repeat the in[...]

  • Page 443

    Application Performance T ools A A-9 Sampling Sampling allows you to profile all active software on your sy stem, including operating sy stem, device driver , and application software. It works by occasionally interrupting the processor and collecting the instruction address, process ID, and thread ID. After the sampling activity completes, the VT [...]

  • Page 444

    IA-32 Intel® Ar chitectur e Optimization A-10 Figure A-1 provides an example of a hotspots r eport by location. Event-based Sampling Event-based sampling (EBS) can be used to provide detailed information on the behavior of the microprocessor as it executes software. Some of the events that can be used to trigger sampling include clockticks, cache [...]

  • Page 445

    Application Performance T ools A A-11 different events at a time. The numb er of the events that the VT une analyzer can collect at once on the Pentium 4 and Intel Xeon processor depends on the events selected. Event-based samples are collected after a specific number of processor events have occurred. The samples can then be attributed to the diff[...]

  • Page 446

    IA-32 Intel® Ar chitectur e Optimization A-12 duration of read traffic compared to the duration of the workload is significantly less than unity , it indicat es the dominant data locality of the workload is cache access traffic. A verage Bus Queue Depth: Using the default configuration of the processor event “Bus Reads Underway from the Processo[...]

  • Page 447

    Application Performance T ools A A-13 stride inefficiency is most prom inent on memory traf fic. A useful indicator for lar ge-stride inefficiency in a workload is to compare the ratio between bus read transactions and the number of DTLB pagewalks due to read traffic, under the c ondition of disabling the hardware prefetch while measuring bus traff[...]

  • Page 448

    IA-32 Intel® Ar chitectur e Optimization A-14 The Call Graph V iew depicts the cal ler / callee relationships. Each thread in the application is the root of a call tree. Each node (box) in the call tree represents a function. E ach edge (line with an arrow) connecting two nodes represents the call from the parent to the child function. If the mous[...]

  • Page 449

    Application Performance T ools A A-15 (SSE), St reaming SIMD Extensions 2 (SSE2) and Streaming SIMD Extensions 3 (SSE3). The library se t includes the Intel Math Kernel Library (MKL) and the Intel Integr ated Performance Primitives (IPP). • The Intel Math Kernel Library f or Linux* and W indows*: MKL is composed of highly optimized mathematical f[...]

  • Page 450

    IA-32 Intel® Ar chitectur e Optimization A-16 • Performance: Highly-optimized routin es with a C interface that give Assembly-level performance in a C/C++ development enviro nment (MKL also supports a Fortran interface) . • Platform tuned: Processor -specific optimizations that yield the best performance for each Intel processor . • Compatib[...]

  • Page 451

    Application Performance T ools A A-17 developed with the Intel Performance Libraries benefit from new architectural features of future genera tions of Intel processors simply by relinking the application with upg raded versions of the libraries. Enhanced Deb ugger (EDB) The Enhanced Debugger (EDB) en ables you to debug C++, Fortran or mixed languag[...]

  • Page 452

    IA-32 Intel® Ar chitectur e Optimization A-18 The Intel Thread Checker product is an Intel VT une Performance Analyzer plug-in data collector that executes your program and automatically locates threading errors . As your program runs, the Intel Thread Checker monitors memory accesses and other events and automatically detects situations which cou[...]

  • Page 453

    Application Performance T ools A A-19 Figure A-2 shows Intel Th read Checker displaying the source code of the selected instance from a list of detected data race conditions that occurred during threaded execution. The tar get operands (a synchronization variable shared by mo re than one threads) of the race condition, the type of data operation on[...]

  • Page 454

    IA-32 Intel® Ar chitectur e Optimization A-20 Intel ® Software College The Intel ® Software College is a valuable resource for classes on Streaming SIMD Extensions 2 (SSE2), Threading and the IA-32 Intel Architecture. For online training on how to use the SSE2 and Hyper -Threading T echnology , refer to the IA-32 Architecture T raining - Online [...]

  • Page 455

    B-1 B Using Performance Monitoring Events Performance monitoring events provides faciliti es to chara cterize the interaction between programmed sequen ces of instructions and dif ferent microarchitectural sub-systems. Perf ormance monito ring events are described in Chapter 18 and Appendix A of the IA-32 Intel® Ar chitectur e Softwar e Developer [...]

  • Page 456

    IA-32 Intel® Ar chitectur e Optimization B-2 The performance metrics listed n T ables B-1 through T able B-5 may be applicable to processors that support Hyper -Threading T echnology , see Using Performance Metrics with Hyper -Threading T echnology section. P entium 4 Processor -Specific T erminology Bogus, Non-bogus, Retire Branch mispredictions [...]

  • Page 457

    Using Performance Monitoring Events B B-3 Repla y In order to maximize performance for the common case, the Intel NetBurst microarchitecture sometimes aggressively schedules μ ops for execution before all the conditions for correct execution are guaranteed to be satisfied. In the event that all of these conditions are not satisfied, μ ops must be[...]

  • Page 458

    IA-32 Intel® Ar chitectur e Optimization B-4 miss more than once during its life time, but a Misses Retired metric (for example, 1 st -Level Cache Misses Retired ) will increment only once for that μ op. Counting Cloc ks The count of cycles, also known as clock tick s, forms a fundamental basis for measuring how long a program takes to execute, a[...]

  • Page 459

    Using Performance Monitoring Events B B-5 The first two metrics use performance counters, and thus can be used to cause interrupt upon overflow for sampling. They may also be useful for those cases where it is easier for a tool to read a performance counter instead of the time stamp counter . The timestamp counter is accessed via an instruction, RD[...]

  • Page 460

    IA-32 Intel® Ar chitectur e Optimization B-6 Non-Sleep Cloc kticks The performance monitoring counters can also be configured to count clocks whenever the performance monitoring hardware is not powered-down. T o count “non-sleep clockticks” with a performance-monitoring counter , do the following: • Select any one of the 18 counters. • Sel[...]

  • Page 461

    Using Performance Monitoring Events B B-7 that logical processor is not halted (it may include some portion of the clock cycles for that logical processor to complete a transition into a halted state). A physical processo r that supports Hyper-Threading T echnology enters into a power -saving state if all logical processors are halted. “Non-sleep[...]

  • Page 462

    IA-32 Intel® Ar chitectur e Optimization B-8 Micr oarchitecture Notes T race Cache Even ts The trace cache is not directly comparable to an instruction cache. The two are organized very dif ferently . For example, a trace can span many lines' worth of instruction-cache data. As with most microarchitectural elements, trace cache performance is[...]

  • Page 463

    Using Performance Monitoring Events B B-9 There is a simplified block diagram below of the sub-systems connected to the IOQ unit in the front side bus sub-system and the BSQ unit that interface to the IOQ. A two-way SMP configuration is illustrated. 1st-level cache misses and writebacks (also called core references) result in references to the 2nd-[...]

  • Page 464

    IA-32 Intel® Ar chitectur e Optimization B-10 Figure B-1 Relationships Between the Ca ch e Hierarch y , IOQ , BSQ and Front Side Bus Chip Set System Memo ry 1st Level Data Cache 3rd Level C ache FSB_ IOQ BSQ Unified 2nd Lev el Cache 1st Level Data Cache 3rd Level C ache FSB_ IOQ BSQ Unified 2nd Lev el Cache[...]

  • Page 465

    Using Performance Monitoring Events B B-11 Core references are nominally 64 bytes, the size of a 1st-level cache line. Smaller sizes are called partials, e.g., uncacheable and write combining reads, uncacheable, write-t hrough and write-protect writes, and all I/O. W riteback locks, st reaming stores a nd write combining stores may be full line or [...]

  • Page 466

    IA-32 Intel® Ar chitectur e Optimization B-12 • IOQ_allocation, IOQ_active_entries: 64 bytes for hits or misses, smaller for partials' hits or misses Writebac ks (dir ty evictions) • BSQ_cache_reference: 64 bytes • BSQ_allocation: 64 bytes • BSQ_active_entries: 64 bytes • IOQ_allocation, IOQ_active_entries: 64 bytes The count of IOQ[...]

  • Page 467

    Using Performance Monitoring Events B B-13 transactions of the writeback (WB) memory type for the FSB IOQ and the BSQ can be an indication of h ow of ten this happens. It is less likely to occur for applications with poor locality of writes to the 3rd-level cache, and of course cannot happen when no 3rd-level cache is present. Usage Notes for Speci[...]

  • Page 468

    IA-32 Intel® Ar chitectur e Optimization B-14 Current implementations of the BSQ_cache_reference event do not distinguish between programmatic read and write misses. Programmatic writes that miss must get the rest of the cache line and merge the new data. Such a request is called a read for ownership (RFO). T o the “BSQ_cache_refer ence” hardw[...]

  • Page 469

    Using Performance Monitoring Events B B-15 Usage Notes on Bus Activities A number of performance metrics in T able B-1 are based on IOQ_active_entries and BSQ_active entr ies. The next three paragraphs provide information of various bu s transaction underway metrics. These metrics nominally measure the end- to-end latency of transactions entering t[...]

  • Page 470

    IA-32 Intel® Ar chitectur e Optimization B-16 accesses (i.e., are also 3rd-level misses ). This can decrease the average measured BSQ latencies for workloads that frequently thrash (miss or prefetch a lot into) the 2nd-level cache but hit in the 3rd-level cache. This effect may be less of a factor for workloads that miss all on-chip caches, since [...]

  • Page 471

    Using Performance Monitoring Events B B-17 an expression built up from other metrics; for example, IPC is derived from two single-event metrics. • Column 2 provides a description of the metric in column 1. Please refer to the previous section, “Pentium 4 Processor -Specific T erminology” for various terms that are specific to the Pentium 4 pr[...]

  • Page 472

    IA-32 Intel® Ar chitectur e Optimization B-18 T able B-1 P entium 4 Proces sor Perf ormance Metrics Metric Descrip tion Event Name or Metric Expression Event Mask V alue Required General Metr ics Non-Sleep Cl ock t ick s The number of clocktic ks.while a processor is not in any sleep modes. See explanation on how to count clocks in section “Coun[...]

  • Page 473

    Using Performance Monitoring Events B B-19 Speculative Uops Retired Number of uops retired (include both instr uctions e xecuted to completion and speculatively ex ecuted in the path of branch mispredictions). uops_retired NBOGUS|BOGUS Branching Metrics Branches Retired All branch instr uctions e xecuted to completion Branch_retired MMTM | MMNM | M[...]

  • Page 474

    IA-32 Intel® Ar chitectur e Optimization B-20 Mispredicted retur ns The number of mispredicted returns including all causes. retired_mispred_ branch_type RETURN All conditional s The number of branches that are conditional jump s (may overcount if the branch is from build mode or there is a machine clear near the branch) retired_branch_type CONDIT[...]

  • Page 475

    Using Performance Monitoring Events B B-21 TC Flushes Number of TC flushes (The counter will count twice for each occurrence. Divide the count by 2 to get the number of flushes.) TC_misc FLUSH Logical Processor 0 Deliver Mode The number of cycles that the trace and delivery engin e (TDE) is delivering traces associated with logical processor 0, reg[...]

  • Page 476

    IA-32 Intel® Ar chitectur e Optimization B-22 Logical Processor 1 Deliver Mode The number of cycles that the trace and delivery engin e (TDE) is delivering traces associated with logical processor 1, regardless of the operating modes of the TDE fo r traces associated with logical processor 0. This metric is applicable only if a ph ysical processor[...]

  • Page 477

    Using Performance Monitoring Events B B-23 Logical Processor 0 Build Mode The number of cycles that the trace and delivery engin e (TDE) is building traces associated with logical processor 0, regardless of the operating modes of the TDE fo r traces associated with logical processor 1. If a ph ysical processor suppor ts only one logical processor, [...]

  • Page 478

    IA-32 Intel® Ar chitectur e Optimization B-24 T race Cache Misses The number of times that significant dela ys occurred in order to decode instr uctions and build a trace be cause of a TC miss. BPU_fetch_request TCMISS TC to R OM T ransfers T wice the number of times that the ROM microcode is accessed t o decode comple x IA-32 instructions instead[...]

  • Page 479

    Using Performance Monitoring Events B B-25 Memor y Metr ics P age W alk DTLB All Misses The number of page walk requests due to DTLB misses from either load o r store. page_walk_type DTMISS 1 st -Lev el Cache Load Misses Retired The number of retired μ ops that experienced 1 st -Lev el cache load misses. This stat is often used in a per-instructio[...]

  • Page 480

    IA-32 Intel® Ar chitectur e Optimization B-26 64K Aliasing Conflicts 1 The number of 64K aliasing conflicts. A memor y refe rence causing 64K aliasing conflict can be counted more than once in this stat. The performance penalty resulted from 64K-aliasing conflict can vary from being unnoticeable to considerable. Some impleme ntations of the P enti[...]

  • Page 481

    Using Performance Monitoring Events B B-27 MOB Load Replays The number of repla yed lo ads related to the Memor y Order Buffer (MOB). This metric counts only the case where the store-f orwarding data is not an aligned subset of t he stored data. MOB_load_replay PARTIAL_DATA, UNALGN_ADDR 2 nd- Lev el Cache Read Misses 2 The number of 2nd-lev el cach[...]

  • Page 482

    IA-32 Intel® Ar chitectur e Optimization B-28 2nd-Le vel Cache Reads Hit Shared The number of 2nd-lev el cache read references (loads and RFOs) that hit the cache line in shared state. Be ware of granularity differences . BSQ_cache_reference RD_2ndL_HITS 2nd-Le vel Cache Reads Hit Modified The number of 2nd-lev el cache read references (loads and [...]

  • Page 483

    Using Performance Monitoring Events B B-29 3rd-Lev el Cache Reads Hit Modified The number of 3rd-le vel cache read references (loads and RFOs) that hit the cache line in modified state. Bew are of granularity differences . BSQ_cache_reference RD_3rdL_HITM 3rd-Lev el Cache Reads Hit Exclusiv e The number of 3rd-le vel cache read references (loads an[...]

  • Page 484

    IA-32 Intel® Ar chitectur e Optimization B-30 All WCB Evictio ns The number of times a WC buff er e viction occurred due to any causes (This can be used to distingui sh 64K aliasing cases that contribute mor e significantly to performance penalty , e.g., stores that are 64K aliased. A high count of this metric when there is no significant contribu[...]

  • Page 485

    Using Performance Monitoring Events B B-31 Bus Metrics Bus Accesses from the Processor The number of all bus transactions that were allocated in the IO Queue from this processor . Bew are of gran ular ity issues with this event. Also Bew are of different recipes in mask bits f or Pentium 4 and Intel Xeon processors betw een CPUID model field v alue[...]

  • Page 486

    IA-32 Intel® Ar chitectur e Optimization B-32 Prefetch Ratio F raction of all bus transactions (including retires) that were f or HW or SW pref etching. (Bus Accesses – Nonprefe tch Bus Accesses)/ (Bus Accesses) FSB Data Ready The number of front-side bus cloc ks that the b us is transmitting data driven by this processor (incl udes full reads|w[...]

  • Page 487

    Using Performance Monitoring Events B B-33 Writes from the Processor The number of all write transactions on the bus that w ere allocated in IO Queue from this processor (e xcludes RFOs). Be ware of gran ular ity issues with this event. Also Bew are of different recipes in mask bits f or Pentium 4 and Intel Xeon processors betw een CPUID model fiel[...]

  • Page 488

    IA-32 Intel® Ar chitectur e Optimization B-34 All WC from the Processor The number of Write Combining memor y transactions on the bus th at originated from this pr ocessor . Bew are of granularity issues with this ev ent. Also Beware of different recipes in mask bits f or P entium 4 and Intel Xeon processors between CPUID model field value of 2 an[...]

  • Page 489

    Using Performance Monitoring Events B B-35 Bus Accesses from All Agents The number of all bus transactions that were allocated in the IO Queue by all agents. Be ware of gran ular ity issues with this event. Also Bew are of different recipes in mask bits f or Pentium 4 and Intel Xeon processors betw een CPUID model field v alue of 2 and model value [...]

  • Page 490

    IA-32 Intel® Ar chitectur e Optimization B-36 Bus Reads Underwa y from the processor 7 This is an accrued sum of the durat ions of all read (includes RFOs) transactions by this processor . Divide by “Reads from the Processor” to get bus read request latency . Also Bew are of different recipes in mask bits f or Pentium 4 and Intel Xeon processo[...]

  • Page 491

    Using Performance Monitoring Events B B-37 All UC Underwa y from the processor 7 This is an accrued sum of the durat ions of all UC transactions by this processor . Divide by “All UC from the processor” to get UC request latency . Also Bew are of different recipes in mask bits f or P entium 4 and Intel Xeon processors between CPUID model field [...]

  • Page 492

    IA-32 Intel® Ar chitectur e Optimization B-38 Bus Writes Underwa y from the processor 7 This is an accrued sum of the durat ions of all write transactions b y this processor . Divide by “Writes from the Processor” to get bus write request latency . Also Bew are of different recipes in mask bits f or P entium 4 and Intel Xeon processors between[...]

  • Page 493

    Using Performance Monitoring Events B B-39 Write WC Full (BSQ) The number of write (but neither writeback nor RFO) transactions to WC-typ e memor y . BSQ_allocation 1. REQ_TYPE1 | REQ_LEN0 | REQ_LEN1 | MEM_ TYPE0 | REQ_DEM_ TYPE 2. Enable edge filtering 6 in the CCCR. Write WC Pa r t i a l ( B S Q ) The number of par tial wr ite transactions to WC-[...]

  • Page 494

    IA-32 Intel® Ar chitectur e Optimization B-40 Reads Non-prefetch Full (BSQ) The number of read (excludes RFOs and HW|SW prefetches) transactions to WB-type memor y . Bew are of granularity issues with this eve n t. BSQ_allocation 1. REQ_LEN0 | REQ_LEN1 | MEM_TYPE1 | MEM_TYPE2| REQ_CACHE_TYPE| REQ_DEM_TYPE 2. Enable edge filtering 6 in the CCCR. Re[...]

  • Page 495

    Using Performance Monitoring Events B B-41 UC Write P ar tial (BSQ) The number of UC write transactions. Bew are of granularity issues between BSQ and FSB IOQ e vents . BSQ_allocation 1. REQ_TYPE0 | REQ_LEN0 | REQ_SPLIT_TYPE | REQ_ORD_TYPE | REQ_DEM_TYPE 2. Enable edge filtering 6 in the CCCR. IO Reads Chunk (BSQ) The number of 8-byte aligned IO po[...]

  • Page 496

    IA-32 Intel® Ar chitectur e Optimization B-42 WB Writes Full Underwa y (BSQ) 8 This is an accrued sum of the durat ions of writeback (e victed from cache) transactions to WB-type memor y . Divide by Writes WB Full (BSQ) to estimate a verage request latency . User note: Be ware of eff ects of wr itebacks from 2nd-lev el cache that are quickly satis[...]

  • Page 497

    Using Performance Monitoring Events B B-43 Write WC P ar tial Underwa y (BSQ) 8 This is an accrued sum of the durat ions of par tial wr ite transactions to WC-typ e memor y . Divide by Write WC P ar tial (BSQ) to estimate a verage request latency . User note: Allocated entries of WC par tials that origina te from D Word operands are not included. B[...]

  • Page 498

    IA-32 Intel® Ar chitectur e Optimization B-44 SSE Input Assists The number of occurrences of SSE/SSE2 floating-point operations needing assistance to handl e an e xception condition. The number of occurrences includes speculative counts. SSE_input_assist ALL P acked SP Retired 3 Non-bogus packed single-precision instructi ons retired. Execution_ev[...]

  • Page 499

    Using Performance Monitoring Events B B-45 1. A memory reference causing 64K aliasing conflict can be counte d more than once in this stat. The resulting perf or mance penalty can vary from unnoticeab le to consi derable . Some implementations of the P entium 4 processor f amily can incur significant penalties fr om loads that alias to pr eceding s[...]

  • Page 500

    IA-32 Intel® Ar chitectur e Optimization B-46 4. Most commonly used x87 instructions (e .g., fmul, fadd, fdiv, fsqrt, fstp , etc.) decode i nto a single μ op. Howe ver , transcendental and some x87 instructions decode into se veral μ ops; in these limited cases, the metrics will count the number of μ ops that are actually tagged. 5. This metric[...]

  • Page 501

    Using Performance Monitoring Events B B-47 T able B-2 Metrics That Utiliz e Replay T agging Mechanism Replay Metric T ags 1 Bit field to set: IA32_PEBS_ ENABLE Bit field to set: MSR_ PEBS_ MA T RIX_ VERT Additional MSR See Event Mask P arame ter for Replay_ event 1stL_cache_load_ miss_retired Bit 0, BIT 24, BIT 25 Bit 0 None NBOGUS 2ndL_cache_load_[...]

  • Page 502

    IA-32 Intel® Ar chitectur e Optimization B-48 T ags for fr ont_end_event T able B-3 provides a list of the tags that ar e used by various metrics derived from the front_end_event . The event names referenced in column 2 can be found f rom the Pentium 4 processor performance monitoring events. T ags for e xecution_e vent T able B-4 provides a list [...]

  • Page 503

    Using Performance Monitoring Events B B-49 T able B-4 Metrics That Utilize the Ex ecution T agging Mechanism Execution Me tric T ags Ups tream ESCR Ta g V a l u e i n Upstream ESCR See Event Mask P ar ameter for Execution_ event Packed_SP_retired Set the ALL bit in the e vent mask and the TagUop bit in the ESCR of packed_SP_uop . 1 NBOGUS0 Scalar_S[...]

  • Page 504

    IA-32 Intel® Ar chitectur e Optimization B-50 T able B-5 New Metri cs for P entium 4 Pr ocessor (Famil y 15, Model 3) Using P e rf ormance Metrics with Hyper-Threading Te c h n o l o g y On Intel Xeon processors that support Hyper -Threading T echnology , the performance metrics listed in T able B-1 may be qualified to associate the counts with a [...]

  • Page 505

    Using Performance Monitoring Events B B-51 The performance metrics listed in T able B-1 fall into three categories: • Logical processor specific and su pporting parallel counting. • Logical processor specific but c onstrained by ESCR limitations. • Logical processor independent and not su pporting parallel counting. T able B-5 lists performan[...]

  • Page 506

    IA-32 Intel® Ar chitectur e Optimization B-52 Branching Metrics Branches Retired T agged Mispredicted Branches Retired Mispredicted Branche s Retired All returns All indirect branches All calls All conditionals Mispredicted re tur ns Mispredicted indire ct branches Mispredicted calls Mispredicted conditio nals TC and Front End Metrics T race Cache[...]

  • Page 507

    Using Performance Monitoring Events B B-53 Memory Metrics Split Load Replays 1 Split Store Replays 1 MOB Load Replays 1 64k Aliasing Conflicts 1st-Le vel Cache Load Misses Retired 2nd-Lev el Cache L oad Misses Retired DTLB Load Misses Retired Split Loads Retired 1 Split Stores Retired 1 MOB Load Replays Retired Loads Retired Stores Retir ed DTLB St[...]

  • Page 508

    IA-32 Intel® Ar chitectur e Optimization B-54 Bus Metrics Bus Accesses from the Processor 1 Non-pref etch Bus Accesses from the Processor 1 Reads from the Processor 1 Writes from the Processor 1 Reads Non-prefetch from the Processor 1 All WC from the Processor 1 All UC from the Processor 1 Bus Accesses from All Agents 1 Bus Accesses Underwa y from[...]

  • Page 509

    Using Performance Monitoring Events B B-55 Character ization Metrics x87 Input Assists x87 Output Assists Machine Clear Cou nt Memor y Order Machine Clear Self-Modifying Code Cle ar Scalar DP Retired Scalar SP Retired P acked DP Retired P acked SP Reti red 128-bit MMX Instr uctions Retired 64-bit MMX Instructions Retired x87 Instructions Retired St[...]

  • Page 510

    IA-32 Intel® Ar chitectur e Optimization B-56 Using P e rf ormance Events of Intel Core Solo and Intel Core Duo pr ocessors There are performance events specific to the microarchitecture of Intel Core Solo and Intel Core Duo processors (see T able A-9 of the IA-32 Intel® Ar chitecture Softwar e Developer ’ s Manual, V olume 3B ). Understanding [...]

  • Page 511

    Using Performance Monitoring Events B B-57 There are three cycle-counting events which will not progress on a halted core, even if the halted co re is being snooped. Th ese are: Unhalted core cycles, Unhalted reference cycles, and Unhalted bus cycles. All three events are detected for the unit selected by event 3CH. Some events detect microarchitec[...]

  • Page 512

    IA-32 Intel® Ar chitectur e Optimization B-58 • Some events, such as writeback s, may have non-deter ministic behavior for different runs. In such a case, only measurements collected in the same run yield meaningful ratio values. Notes on Selected Events This section provides event-specific notes for interpreting performance events listed in T a[...]

  • Page 513

    Using Performance Monitoring Events B B-59 • Serial_Execution_Cycles, event number 3C, unit mask 02H This event counts the bus cycles during which the core is actively executing code (non-halted ) while the other core in the physical processor is halted. • L1_Pref_Req, event number 4FH, unit mask 00H This event counts the number of tim es the D[...]

  • Page 514

    IA-32 Intel® Ar chitectur e Optimization B-60[...]

  • Page 515

    C-1 C IA-32 Instruction Latency and Thr oughput This appendix contains tables of the latency , throughput and execution units that are associated with mo re-commonly-used IA-32 instructions 1 . The instruction timing data varies within the IA-32 family of processors. Only data specific to the Intel Pentium 4, Intel Xeon processors and Intel P e nt [...]

  • Page 516

    IA-32 Intel® Ar chitectur e Optimization C-2 Overview The current generation of IA-32 family of processors use out-o f-order execution with dynamic scheduling and buf fering to tolerate poor instruction selection and scheduling that may occur in legacy code. It can reorder μ ops to cover latency delays and to avoid reso urce conflicts. In some ca[...]

  • Page 517

    IA-32 Instruction Latency and Thr oughput C C-3 While several items on the above list involve selecting the right instruction, this appendix focuse s on the following issues. These are listed in an expected priority order , though which item contributes most to performance will vary by application. • Maximize the flow of μ ops into the execution[...]

  • Page 518

    IA-32 Intel® Ar chitectur e Optimization C-4 Definitions The IA-32 instruction performance data are listed in several tables. The tables contain the following information: Instruction Name:The assembly mnemonic of each instruction. Latency: The number of clock cycles that are required for the execution core to complete the execution of all of the [...]

  • Page 519

    IA-32 Instruction Latency and Thr oughput C C-5 accurately predict realistic performance of actual code sequences based on adding instruction latency data. • The instruction latency data are useful when tun ing a dependency chain. However , dependency chains limit the out-of-order core’ s ability to execute micro-ops in pa rallel. The instructi[...]

  • Page 520

    IA-32 Intel® Ar chitectur e Optimization C-6 Latency and Thr oughput with Register Operands IA-32 instruction latency and th roughput data are presented in T able C-2 through T able C-8. The tables include the S treaming SIMD Extension 3, S treaming SIMD Extension 2, Streaming SIMD Extension, MMX technology and most of commonly used IA-32 instruct[...]

  • Page 521

    IA-32 Instruction Latency and Thr oughput C C-7 T able C-2 Streaming SIMD Ext ension 2 128-bit Integer Instructions Instruction Latency 1 Thr oughput Execution Unit 2 CPUID 0F3n 0F2n 0x69n 0F3n 0F2n 0x69n 0F2n CVTDQ2PS 3 xmm, xmm 55 2 2 F P _ A D D CVTPS2DQ 3 xmm, xmm 55 3 + 1 2 2 2 F P _ A D D CVTTPS2DQ 3 xmm, xmm 55 3 + 1 2 2 2 F P _ A D D MO VD [...]

  • Page 522

    IA-32 Intel® Ar chitectur e Optimization C-8 PCMPGTB/PCMPGTD/PC MPGTW xmm, xmm 2 2 1 2 2 1 MMX_ALU PEXTR W r32, xmm, imm8 7 7 3 2 2 2 MMX_SHFT , FP_MISC PINSR W xmm, r32, imm8 4 4 1+1 2 2 2 MMX_SHFT , MMX_MISC PMADD WD xmm, xmm 9 8 3+1 2 2 2 FP_MUL PMAX xmm, xmm 2 2 2 2 MMX_ALU PMIN xmm, xmm 2 2 2 2 MMX_ALU PMOVM SKB 3 r32, xmm 77 2 2 F P _ M I S [...]

  • Page 523

    IA-32 Instruction Latency and Thr oughput C C-9 PSUBB/PSUBW/PSUBD xmm, xmm 2 2 1 2 2 1 MMX_ALU PSUBSB/PSUBSW/PSUB U SB/PSUBUSW xmm, xmm 2 2 1 2 2 1 MMX_ALU PUNPCKHBW/PUNPCKH WD/PUNPCKHDQ xmm, xmm 4 4 1+1 2 2 2 MMX_SHFT PUNPCKHQDQ xmm, xmm 4 4 1_1 2 2 2 MMX_SHFT PUNPCKLBW/PUNPCKL W D/PUNPCKLDQ xmm, xmm 2 2 2 2 2 2 MMX_SHFT PUNPCKLQDQ 3 xmm, xmm 44 1[...]

  • Page 524

    IA-32 Intel® Ar chitectur e Optimization C-10 COMISD xmm, xmm 7 6 1 2 2 1 FP_ADD , FP_MISC CVTDQ2PD xmm, xmm 8 8 4+1 3 3 4 FP_ADD , MMX_SHFT CVTPD2PI mm, xmm 12 11 5 3 3 3 FP_ADD , MMX_SHFT , MMX_ALU CVTPD2DQ xmm, xmm 1 0 9 5 2 2 3 FP_ADD , MMX_SHFT CVTPD2PS 3 xmm, xmm 11 10 2 2 FP_ADD , MMX_SHFT CVTPI2PD xmm, mm 12 11 4+1 2 4 4 FP_ADD , MMX_SHFT [...]

  • Page 525

    IA-32 Instruction Latency and Thr oughput C C-11 DIVPD xmm, xmm 7 0 69 32+31 70 69 62 FP_DIV DIVSD xmm, xmm 39 38 32 39 38 31 FP_DIV MAXPD xmm, xmm 5 4 4 2 2 2 FP_ADD MAXSD xmm, xmm 5 4 3 2 2 1 FP_ADD MINPD xmm, xmm 5 4 4 2 2 2 FP_ADD MINSD xmm, xmm 5 4 3 2 2 1 FP_ADD MO V APD xmm, xmm 6 6 1 1 FP_MO VE MO VMSKPD r32, xmm 6 6 2 2 FP_MISC MO VSD xmm,[...]

  • Page 526

    IA-32 Intel® Ar chitectur e Optimization C-12 T able C-4 Streaming SIMD Extensio n Single-precision Floating-point Instructions Instruction Latency 1 Thr oughput Execution Unit 2 CPUID 0F3n 0F2n 0x69n 0F3n 0F2n 0x69n 0F2n ADDPS xmm, xmm 5 4 4 2 2 2 FP_ADD ADDSS xmm, xmm 5 4 3 2 2 1 FP_ADD ANDNPS 3 xmm, xmm 44 22 22 M M X _ A L U ANDPS 3 xmm, xmm 4[...]

  • Page 527

    IA-32 Instruction Latency and Thr oughput C C-13 MOVLHPS 3 xmm, xmm 44 2 2 M M X _ S H F T MO VMSKPS r32, xmm 6 6 2 2 FP_MISC MO VSS xmm, xmm 4 4 2 2 MMX_SHFT MO VUPS xmm, xmm 6 6 1 1 FP_MO VE MULPS xmm, xmm 7 6 4+1 2 2 2 FP_MUL MULSS xmm, xmm 7 6 2 2 FP_MUL ORPS 3 xmm, xmm 44 22 22 M M X _ A L U RCPPS 3 xmm, xmm 66 24 42 M M X _ M I S C RCPSS 3 xm[...]

  • Page 528

    IA-32 Intel® Ar chitectur e Optimization C-14 T able C-5 Stre aming SIMD Extension 64-bit Integ er Instructi ons Instruction Latency 1 Thr oughput Execution Unit CPUID 0F3n 0F2n 0x69n 0F3n 0F2n 0x69n 0F2n P A VGB/P A V GW mm, mm 2 2 1 1 MMX_ALU PEXTR W r32, mm, imm8 7 7 2 2 2 1 MMX_SHFT , FP_MISC PINSR W mm, r32, imm8 4 4 1 1 1 1 MMX_SHFT , MMX_MI[...]

  • Page 529

    IA-32 Instruction Latency and Thr oughput C C-15 PCMPGTB/PCMPGTD/ PCMPGTW mm, mm 22 1 1 M M X _ A L U PMADDWD 3 mm, mm 98 1 1 F P _ M U L PMULHW/PMULL W 3 mm, mm 98 1 1 F P _ M U L POR mm, mm 2 2 1 1 MMX_ALU PSLLQ/PSLL W/ PSLLD mm, mm/imm8 22 1 1 M M X _ S H F T PSRA W/PSRAD mm, mm/imm8 22 1 1 M M X _ S H F T PSRLQ/PSRL W/PSRLD mm, mm/ imm8 22 1 1 [...]

  • Page 530

    IA-32 Intel® Ar chitectur e Optimization C-16 T able C-7 IA-32 x87 Fl oating-point Instruct ions Instruction Latency 1 Throug hput Execution Unit 2 CPUID 0F3n 0F2n 0x69n 0F3n 0F2n 0x69n 0F2n F ABS 3 2 1 1 FP_MISC FA D D 6 5 1 1 F P _ A D D FSUB 6 5 1 1 FP_ADD FMUL 8 7 2 2 FP_MUL FCOM 3 2 1 1 FP_MI SC FCHS 3 2 1 1 FP_MISC FDIV Single Precision 30 2[...]

  • Page 531

    IA-32 Instruction Latency and Thr oughput C C-17 FSCALE 4 60 7 FRNDINT 4 30 11 FXCH 5 01 F P _ M O V E FLDZ 6 0 FINCSTP/FDECSTP 6 0 See “Table Footnotes” T able C-8 IA-32 General Purpose Instructions Instruction Latency 1 Thr oughput Execution Un it 2 CPUID 0F3n 0F2n 0x69n 0F3n 0F2n 0x69 n 0F2n ADC/SBB reg, reg 8 8 3 3 ADC/SBB reg, imm 8 6 2 2 [...]

  • Page 532

    IA-32 Intel® Ar chitectur e Optimization C-18 Jcc 7 Not Appli- cable 0.5 ALU LOOP 8 1.5 ALU MO V 1 0.5 0.5 0.5 ALU MO VSB/MO VSW 1 0.5 0.5 0.5 ALU MO VZB/MOVZW 1 0.5 0.5 0.5 ALU NEG/NO T/NOP 1 0.5 0.5 0.5 ALU POP r32 1.5 1 MEM_LO AD , ALU PUSH 1.5 1 MEM_STORE, ALU RCL/RCR reg, 1 8 64 1 1 ROL / ROR 1 4 0 .5 1 RET 8 1 MEM_LOAD , ALU SAHF 1 0.5 0.5 0[...]

  • Page 533

    IA-32 Instruction Latency and Thr oughput C C-19 T able Footnotes The following footnotes refer to all tables in this appendix. 1. Latency information for many of in structions that are complex (> 4 μ ops) are estimates based on conservative and worst-case estimates. Actual performance of these instructions by the out-of-order core execution un[...]

  • Page 534

    IA-32 Intel® Ar chitectur e Optimization C-20 4. Latency and Throughput of transcen dental instructions can vary substantially in a dynamic execution environment. Only an approximate value or a range of values are given for these instructions. 5. The FXCH instruction has 0 latency in code sequences. How ever , it is limited to an issue rate of one[...]

  • Page 535

    IA-32 Instruction Latency and Thr oughput C C-21 For the sake of simplicity , all data being requested is assumed to reside in the first level data cache (cache hit). In general, IA-32 instructions with load operations that execute in the integer ALU units require two more clock cycles than the corresponding register -to-register flavor of the same[...]

  • Page 536

    IA-32 Intel® Ar chitectur e Optimization C-22[...]

  • Page 537

    D-1 D S tack Alignment This appendix details on the alignment of th e stacks of data for Streaming SIMD Extensions and Streaming SIMD Extensions 2. Stac k Frames This section describes the stack alig nment conventions for both esp -based (normal), and ebp -based (debug) stack frames. A stack frame is a contiguous block of memory allocated to a func[...]

  • Page 538

    IA-32 Intel® Ar chitectur e Optimization D-2 alignment for __m64 and do uble type data by enforcing that these 64-bit data items are at least eight-byte aligned ( they will now be 16-byte aligned). For variables allocated in the stack frame, the compiler cannot guarantee the base of the variable is aligned unless it also ensures that the stack fra[...]

  • Page 539

    S tack Alignment D D-3 As an optimization, an alternate entr y point can be created that can be called when proper stack alig nment is pr ovided by the caller . Using call graph profiling of the VT une analyzer , calls to the normal (unaligned) entry point can be optimized into ca lls to the (alternate) ali gned entry point when the stack can be pr[...]

  • Page 540

    S tack Alignment D D-4 Example D-1 in the following sections illustrate this technique. Note t he entry points foo and foo.aligned , the latter is the alternate aligned entry point. Aligned esp -Based Stack Frames This section discusses data and parameter alignment and the declspec(align) extended attribute, which can be used to request alignment i[...]

  • Page 541

    S tack Alignment D D-5 Example D-1 Aligned esp-Based Stac k Frames void _cdecl foo (int k) { int j; foo: // See Note A push ebx mov ebx, esp sub esp, 0x00000008 and esp, 0xfffffff0 add esp, 0x00000008 jmp common foo.aligned: push ebx mov ebx, esp common: // See Note B push edx sub esp, 20 j = k; mov edx, [ebx + 8] mov [esp + 16], edx foo(5); mov [e[...]

  • Page 542

    S tack Alignment D D-6 Aligned ebp -Based Stack Frames In ebp -based frames, padding is also inserted immediately before the return address. However , this frame is slightly unusual in that the return address may actually reside in two dif ferent places in the stack. This occurs whenever padding must be added and exception handling is in effect for[...]

  • Page 543

    S tack Alignment D D-7 Example D-2 Aligned ebp-based Stac k Frames void _stdcall foo (int k) { int j; foo: push ebx mov ebx, esp sub esp, 0x00000008 and esp, 0xfffffff0 add esp, 0x00000008 // esp is (8 mod 16) after add jmp common foo.aligned: push ebx // esp is (8 mod 16) after push mov ebx, esp common: push ebp // this slot will be used for // du[...]

  • Page 544

    S tack Alignment D D-8 // the goal is to make esp and ebp // (0 mod 16) here j = k; mov edx, [ebx + 8] // k is (0 mod 16) if caller aligned // its stack mov [ebp - 16], edx // J is (0 mod 16) foo(5); add esp, -4 // normal call sequence to // unaligned entry mov [esp],5 call foo // for stdcall, callee // cleans up stack foo.aligned(5); add esp,-16 /[...]

  • Page 545

    S tack Alignment D D-9 Stac k Frame Optimizations The Intel C++ Compiler provides certain optimizations that may improve the way aligned frames are set up and used. These optimizations are as follows: • If a procedure is defined to leav e the stack frame 16-byte-aligned and it calls another procedure that requir es 16-byte alignment, then the cal[...]

  • Page 546

    IA-32 Intel® Ar chitectur e Optimization D-10 Inlined Assembl y and ebx When using aligned frames, the ebx register generally should n ot be modified in inlined assembly blocks since ebx is used to keep track of the argu ment block. Programmers may modify ebx only if they do not need to access the arguments and provided they save ebx and restore i[...]

  • Page 547

    E-1 E Mathematics of Pr efetch Scheduling Distance This appendix discusses how far away to insert prefetch instructions. It presents a mathematical model allowing you to deduce a simplified equation which you can use for determining the prefetch schedu ling distance (PSD) for your application. For your convenience, th e first section presents this [...]

  • Page 548

    IA-32 Intel® Ar chitectur e Optimization E-2 N inst is the number of instructions in the scope of one loop iteration. Consider the following example of a heuristic equation assuming that parameters have the values as indicated: where 60 corresponds to Nlookup , 25 to Nxfer , and 1.5 to CPI . The values of the parameters in the equation can be deri[...]

  • Page 549

    Mathematics of Pr efetch Scheduling Distance E E-3 T b data transfer latency which is equal to number of lines per iteration * line burst latency Note that the potential effects of µ op reordering are not factored into the estimations discussed. Examine Example E-1 that uses the prefetchnta instruction with a prefetch scheduling distance of 3, tha[...]

  • Page 550

    IA-32 Intel® Ar chitectur e Optimization E-4 Memory access plays a pivotal role in prefetch scheduling. For more understanding of a memory subsy stem, consider Streaming SIMD Extensions and S treaming SIMD Extensions 2 memory pipeline depicted in Figure E-1. Assume that three cache lines are acces sed per iteration and four chunks of data are retu[...]

  • Page 551

    Mathematics of Pr efetch Scheduling Distance E E-5 T l varies dynamically and is also syst em hardware-dependent. The static variants include the core-to-front-sid e-bus ratio, memory manufacturer and memory controller (chipset). The dynamic variants include the memory page open/miss occasions, memory accesses sequence, dif ferent memory types, and[...]

  • Page 552

    IA-32 Intel® Ar chitectur e Optimization E-6 No Preloading or Prefetc h The traditional prog ramming approach does not perform data preloading or prefetch. It is sequen tial in nature and will experience stalls because the memory is unable to provide the data immediately when the execution pipeline re quires it. Examine Figure E-2. As you can see [...]

  • Page 553

    Mathematics of Pr efetch Scheduling Distance E E-7 The iteration latency is approximately equal to the computation laten cy plus the memory leadoff latency (inc ludes cache miss latency , chipset latency , bus arbitration, and so on.) plus the data transfer latency where transfer latency = number of lines per iteration * line burst latency . This m[...]

  • Page 554

    IA-32 Intel® Ar chitectur e Optimization E-8 The following formula shows the re lationship among the parameters: It can be seen from this relationship that the iteration latency is equal to the computation latency , which means the memory accesses are executed in background and their latencies are completely hidden. Compute Bound (Case: T l + T b [...]

  • Page 555

    Mathematics of Pr efetch Scheduling Distance E E-9 For this particular example the pref etch scheduling distance is greater than 1. Data being prefetched for iteration i will be consumed in iteration i+2 . Figure E-4 represents the case when the leadof f latency plus data transfer latency is greater than th e compute latency , which is greater than[...]

  • Page 556

    IA-32 Intel® Ar chitectur e Optimization E-10 Memory Throughput Bound (Case: T b >= T c ) When the application or loop is memory throughput bou nd, the memory latency is no way to be hidden. Under such circumstances, the burst latency is always greater than the co mpute latency . Examine Figure E-5. The following relationship calculates the pre[...]

  • Page 557

    Mathematics of Pr efetch Scheduling Distance E E-11 memory to you cannot do much abou t it. T ypically , data copy from one space to another space, for example, graphics driver moving data from writeback memory to write-combi ning memory , belongs to this category , where performance advantage from pref etch in structions will be marginal. Example [...]

  • Page 558

    IA-32 Intel® Ar chitectur e Optimization E-12 Now for the case T l =18, T b =8 (2 cache lines are needed per iteration) examine the following gr aph. Consider the graph of accesses per iteration in example 1, Figure E-6. The prefetch scheduling dist ance is a step function of T c , the computation latency . The steady state iteration latency ( il [...]

  • Page 559

    Mathematics of Pr efetch Scheduling Distance E E-13 In reality , the front-side bus (FSB) pipelining depth is limited, that is, only four transactions are al lowed at a time in the Pentium III and Pentium 4 processors. Hence a transaction bubble or gap, T g , (gap due to idle bus of imperfect front side bus pipelining) will be observed on FSB activ[...]

  • Page 560

    IA-32 Intel® Ar chitectur e Optimization E-14[...]

  • Page 561

    Index-1 Index 64-bit mode default operand size, 8-1 introduction, 8-1 legacy instructions, 8-1 multiplicati on notes, 8-2 register usage, 8-2, 8-4 sign-extension, 8-3 software prefetch, 8-6 using CVTSI2SS & CVTSI2SD, 8-6 A absolute difference of signed numbers, 4-24 absolute difference of unsigned numbers, 4-23 absolute value, 4-25 accesses per[...]

  • Page 562

    IA-32 Intel® Ar chitectur e Optimization Index-2 coding methodologies, 3-13 coding techniques, 3-12 absolute difference of signed numbers, 4-24 absolute difference of unsigned numbers, 4-23 absolute value, 4-25 clipping to an arbitrary signed range, 4-26 clipping to an arbitrary unsigned range, 4-28 generating constants, 4-21 interleaved pack with[...]

  • Page 563

    Index Index-3 floating-point stalls, 2-72 flow dependency, E-7 flush to zero, 5-22 FXCH instruction, 2-70 G general optimizati on techniques, 2-1 branch prediction, 2-15 static prediction, 2-19 generating constants, 4-21 H horizontal computations, 5-18 hotspots, 3-10 Hyper-Threading T echnology, 7-1 avoid exce ssive software prefe tches, 7-36 cache[...]

  • Page 564

    IA-32 Intel® Ar chitectur e Optimization Index-4 L large load stalls, 2-37 latency, 2-72, 6-5 lea instruction, 2-74 loading and storing to and from the same DRAM page, 4-39 loop blocking, 3-34 loop unrolling, 2-26 loop unrolling option, A-5, A-6 M memory bank conflicts, 6-3 memory O=optimization U=using P=prefetch, 6-18 memory operands, 2-71 memor[...]

  • Page 565

    Index Index-5 O optimizing ca che util ization cache management, 6-44 examples, 6-15 non-temporal store instructions, 6-10 prefetch and load, 6-9 prefetch Instructions, 6-8 prefetching, 6-7 SFENCE instruction, 6-15, 6-16 streaming, non-temporal stores, 6-10 optimizing floati ng- poin t applications copying, shuffling, 5-17 data arrangement, 5-4 dat[...]

  • Page 566

    IA-32 Intel® Ar chitectur e Optimization Index-6 R reciprocal instructions, 5-2 rounding control option, A-6 S sampling event-based, A-10 Self-modifying code, 2-47 SFENCE Instruction, 6-15, 6-16 signed unpack, 4-7 SIMD integer code, 4-2 SIMD-floating-point code, 5-1 simplified 3D geometry pipeline, 6-22 simplified clipping to an arbitrary signed r[...]

  • Page 567

    INTEL SALES OFFICES ASIA P ACIFIC Australia Intel Corp. Level 2 448 St Kilda Road Melbourne VI C 3004 Australia Fax:613- 9862 5599 China Intel Corp. Rm 709, Shaanxi Zhongda Int'l Bldg No.30 Nandajie Street Xian AX71000 2 China Fax:(86 29) 7203 356 Intel Corp. Rm 2710, Metrop oli an To w e r 68 Zouron g Rd Chongqing CQ 400015 China Intel Corp. [...]

  • Page 568

    Intel Corp. 999 CANADA PLACE, Suite 404,#1 1 Va n c o u v e r B C V6C 3E2 Canada Fax:604- 844-28 13 Intel Corp. 2650 Quee nsview Dr ive, Suite 250 Ottawa ON K2B 8H6 Canada Fax:613- 820-59 36 Intel Corp. 190 Attwell D rive, Suite 500 Rexcdale ON M9W 6H8 Canada Fax:416- 675-24 38 Intel Corp. 171 S t. Clair Ave. E, Suite 6 To r o n t o O N Canada Inte[...]