AMD x86 manual

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256

Ir a la página of

Buen manual de instrucciones

Las leyes obligan al vendedor a entregarle al comprador, junto con el producto, el manual de instrucciones AMD x86. La falta del manual o facilitar información incorrecta al consumidor constituyen una base de reclamación por no estar de acuerdo el producto con el contrato. Según la ley, está permitido adjuntar un manual de otra forma que no sea en papel, lo cual últimamente es bastante común y los fabricantes nos facilitan un manual gráfico, su versión electrónica AMD x86 o vídeos de instrucciones para usuarios. La condición es que tenga una forma legible y entendible.

¿Qué es un manual de instrucciones?

El nombre proviene de la palabra latina “instructio”, es decir, ordenar. Por lo tanto, en un manual AMD x86 se puede encontrar la descripción de las etapas de actuación. El propósito de un manual es enseñar, facilitar el encendido o el uso de un dispositivo o la realización de acciones concretas. Un manual de instrucciones también es una fuente de información acerca de un objeto o un servicio, es una pista.

Desafortunadamente pocos usuarios destinan su tiempo a leer manuales AMD x86, sin embargo, un buen manual nos permite, no solo conocer una cantidad de funcionalidades adicionales del dispositivo comprado, sino también evitar la mayoría de fallos.

Entonces, ¿qué debe contener el manual de instrucciones perfecto?

Sobre todo, un manual de instrucciones AMD x86 debe contener:
- información acerca de las especificaciones técnicas del dispositivo AMD x86
- nombre de fabricante y año de fabricación del dispositivo AMD x86
- condiciones de uso, configuración y mantenimiento del dispositivo AMD x86
- marcas de seguridad y certificados que confirmen su concordancia con determinadas normativas

¿Por qué no leemos los manuales de instrucciones?

Normalmente es por la falta de tiempo y seguridad acerca de las funcionalidades determinadas de los dispositivos comprados. Desafortunadamente la conexión y el encendido de AMD x86 no es suficiente. El manual de instrucciones siempre contiene una serie de indicaciones acerca de determinadas funcionalidades, normas de seguridad, consejos de mantenimiento (incluso qué productos usar), fallos eventuales de AMD x86 y maneras de solucionar los problemas que puedan ocurrir durante su uso. Al final, en un manual se pueden encontrar los detalles de servicio técnico AMD en caso de que las soluciones propuestas no hayan funcionado. Actualmente gozan de éxito manuales de instrucciones en forma de animaciones interesantes o vídeo manuales que llegan al usuario mucho mejor que en forma de un folleto. Este tipo de manual ayuda a que el usuario vea el vídeo entero sin saltarse las especificaciones y las descripciones técnicas complicadas de AMD x86, como se suele hacer teniendo una versión en papel.

¿Por qué vale la pena leer los manuales de instrucciones?

Sobre todo es en ellos donde encontraremos las respuestas acerca de la construcción, las posibilidades del dispositivo AMD x86, el uso de determinados accesorios y una serie de informaciones que permiten aprovechar completamente sus funciones y comodidades.

Tras una compra exitosa de un equipo o un dispositivo, vale la pena dedicar un momento para familiarizarse con cada parte del manual AMD x86. Actualmente se preparan y traducen con dedicación, para que no solo sean comprensibles para los usuarios, sino que también cumplan su función básica de información y ayuda.

Índice de manuales de instrucciones

  • Página 1

    AM D Athlon Pr oc essor x86 Code Optimization Guide TM[...]

  • Página 2

    T ra demarks AMD , the A MD logo , A MD Athlon , K6, 3DNo w!, and combi nations ther e of, K 86, and Sup er7 ar e tr adema rks, and AMD -K6 is a r egis tered tra demark of Ad v anced Micr o De vices, I nc. Microso ft, Windows , and Wind ows NT are r egi stered trademarks of Micros oft Corp oration. MMX is a tra demark a nd P entium is a r egiste re[...]

  • Página 3

    Contents iii 22007E/0 — Novembe r 1 99 9 AMD Athlon™ Pr ocessor x86 Code Optimization Contents Revision Histo ry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv 1 Intro duction 1 About this Docum ent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 AMD Athlon ™ Proc essor F[...]

  • Página 4

    iv Contents AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 99 9 Switch Statement Us age . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Optimize Switch State ments . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Use Prototy pes for All Functions . . . . . . . . . . . . . . . . . . . . . [...]

  • Página 5

    Contents v 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Use 8-Bit Sign- Extended Displacements . . . . . . . . . . . . . . . . . . . . . . . 39 Code Padding Using Neutral Code F illers . . . . . . . . . . . . . . . . . . . . . 39 Recommenda tions for the AM D Athlon Processor . . . . . . . . . 40 Recommenda tions f[...]

  • Página 6

    vi Contents AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 99 9 7 Scheduling Opti mizations 6 7 Schedule Instructio ns According to their La tency . . . . . . . . . . . . . . 67 Unrolling Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Complete Loop Unrolling . . . . . . . . .[...]

  • Página 7

    Contents vii 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Signed Deriva tion for Algorithm, Multiplier, and Shift Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 9 Floating-P oint Optimizations 9 7 Ensure All FP U Data is Alig ned . . . . . . . . . . . . . . . . . . . . .[...]

  • Página 8

    viii Contents AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 99 9 Fast Conver sion of Signed Wo rds to Floating-P oint . . . . . . . . . . . . 113 Use MMX PX OR to Negate 3 DNow! Data . . . . . . . . . . . . . . . . . . . . 113 Use MMX PCM P Instead of 3D Now! PFCMP . . . . . . . . . . . . . . . . . . 114 Use MMX Instruct [...]

  • Página 9

    Contents ix 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Integer Execution Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Floating-Point Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 Floating-Point Execution Unit . . . . . . . . . . . . . . . . . . . . . . . . 137 Loa[...]

  • Página 10

    x Contents AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 99 9 Perf Ctr[3:0] MSRs (MSR Addre sses C001_00 04h – C001_0007h) . . . . . . . . . . . . . 167 Starting and Stopping the Perfor mance-Monitoring Counters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 Event an d Time-Stamp[...]

  • Página 11

    List of Figures xi 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization List of Figures Figure 1. AMD Athlon ™ Processo r Block Diagr am . . . . . . . . . . . 131 Figure 2. Integer Execution Pipeline . . . . . . . . . . . . . . . . . . . . . . . 135 Figure 3. Floating-Point Unit Block Diagram . . . . . . . . . . . . . . [...]

  • Página 12

    xii List of Figur es AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 99 9[...]

  • Página 13

    List of T ables xiii 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization List of T ables Table 1. Latency of Repeated String Instr uctions . . . . . . . . . . . . . 84 Table 2. Integer Pipeline Operation T ypes . . . . . . . . . . . . . . . . . 149 Table 3. Integer Decode Types . . . . . . . . . . . . . . . . . . . . . . [...]

  • Página 14

    xiv List of T ables AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Table 29. VectorPa th Integer In structions . . . . . . . . . . . . . . . . . . . 231 Table 30. VectorPa th MMX Instructions . . . . . . . . . . . . . . . . . . . . 234 Table 31. VectorPa th MMX Extensions . . . . . . . . . . . . . . . . . . . . . 234[...]

  • Página 15

    Revision History xv 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Revision History Date Rev Descriptio n Nov . 1 999 E Added “ About this Document” on page 1. F urther clarification of “Consider the Sign of Integer Operands” on page 1 4. Added the optimization, “Use Array Style Instead of Pointer Style Cod[...]

  • Página 16

    xvi Revision History AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9[...]

  • Página 17

    22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization About this Docume nt 1 1 Introduction Th e A M D At h l o n ™ processor is the ne west micr oprocessor in the AMD K86 ™ famil y of micropr ocessors. T he ad v ances in the AMD Athlon pro cessor tak e super scalar oper ation and out- of- or der execution to a new le v[...]

  • Página 18

    2 About this Document AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 pr e vious- gener ation processor s and describes how those optimizations ar e applicable to the AMD Athlon processor . This guide co ntains the f ollowing c hapt er s: Chapter 1: Introduction. Outlin es the material co ver ed in this document. Summ[...]

  • Página 19

    AMD Athlon ™ Proces sor Family 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Appendix B: Pipeline and Execu tion Unit Resources Over view . Describes in detail the e xecution units and its r elation to the instructi on pipeline. Appendix C: Implementation of Write Combining. Describes the algorithm us ed by the [...]

  • Página 20

    4 AMD Athlon ™ Processor Mic roarchitectur e Summary AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 AM D Athlon ™ Proc essor Microar chitecture Summary T he AMD Athlon pr ocessor brings s uper scalar performance and high operating frequency to P C syste ms run ning industr y- standard x86 softw ar e. A brief summ[...]

  • Página 21

    AMD Athlon ™ Processor Mic roarchitecture Summary 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization AMD A thlon execution c or e to ac hiev e and sustain maxim um performance. As a decoupled decode/exec ution processor , the AMD At hlon pr ocessor make s use of a propri etary micr oarc hitecture, whic h defines the [...]

  • Página 22

    6 AMD Athlon ™ Processor Mic roarchitectur e Summary AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T he coding tec hniques for ac hieving peak perf ormance on the AMD Athlon processor include, but are not limited to , those for the AMD-K6, AMD-K6-2, P e ntium ® , P enti um Pro , and P ent ium II pr ocessor s. Ho [...]

  • Página 23

    22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Top Optimiz ations 7 2 T op Optimizations T his chap ter contains concise desc riptions of the best optimizations f or impro ving the performance of the AMD Athlon ™ processor . Subsequent c hapters contai n more detailed descriptions of these and other optimizations. [...]

  • Página 24

    8 Optimization Star AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 ■ A void Placing Cod e and Da ta in the Same 64 -Byte Cache Line Optimization Star T he top optimizations described in this c hapter ar e flagged with a star . In addition, the star appears beside the mor e detailed descriptions found in subsequent [...]

  • Página 25

    Group II Optimizati ons — Secondary Optimizations 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization an ywher e, in an y type of code (integer , x87, 3DNo w!, MMX, etc.). Use the f ollowi ng f ormul a to determine pr efetc h distance: Prefetc h Length = 200 ( DS / C ) ■ Round up to the near est cache line. ■ DS i[...]

  • Página 26

    10 Group II Optimizations — S econdary Optimizations AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 A void Load-Execute Floating-Point Instructions with Integer Opera nds Do not use load-execute floating-point instructions with integer operands . T he floating- point load- execute instructions with integer ope rand[...]

  • Página 27

    Group II Optimizati ons — Secondary Optimizations 11 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization A void Placing Code and Data in the Sam e 64-Byte Cache Line Consider that the AMD Athlon processor cac he line is twice the siz e of pr e vious processor s. Code and data sh ould not be shar ed in the same 64 - byt [...]

  • Página 28

    12 Group II Optimizations — S econdary Optimizations AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9[...]

  • Página 29

    22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Ensure Floati ng-Point Variables and Exp ressions are of Type Float 13 3 C Sourc e Lev el Optimizations This c h apter details C pro gramming pr actice s f or opt imizing code f or the AMD Athlon ™ pr ocessor . Guide lines ar e listed in order of importan ce. Ensure Fl[...]

  • Página 30

    14 Consider the S ign of Integer Operands AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Consider the Sig n of Integer Oper ands In man y cases, the data stored in integer v aria bles determines whether a signed or an unsigned integer type is appr opriate. F or example, to re cor d the w eight of a person in pounds, [...]

  • Página 31

    Use Array Style Instead of Poin ter Style Code 15 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example (Av oid): int i; ====> MOV EAX, i CDQ i = i / 4; AND EDX, 3 ADD EAX, EDX SAR EAX, 2 MOV i, EAX Example (Preferred): unsigned int i; ====> SHR i, 2 i = i / 4; In summar y: Use unsigned types for: ■ Di visio[...]

  • Página 32

    16 Use Array Style Instead of Pointer Style Co de AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Note that source code transf ormations wi ll interact with a compiler ’ s code gener ator and that it is difficult to contr ol the gener ated mac hine code fr om the sourc e lev el. It is e v en possibl e that sour ce c[...]

  • Página 33

    Use Array Style Instead of Poin ter Style Code 17 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization *res++ = dp; /* write transformed z */ dp = vv->x * *m++; dp += vv->y * *m++; dp += vv->z * *m++; dp += vv->w * *m++; *res++ = dp; /* write transformed w */ ++vv; /* next input vertex */ m -= 16; /* reset to s[...]

  • Página 34

    18 Completely Unr oll Small L oops AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Completely Unr oll Small Loops T ak e ad v antage of the AMD At hlon pr ocessor ’ s large, 64-Kb yte instruct ion cache and completel y unroll small loops. Unr olling loops can be beneficial to perf o rmance, especially if the l oop b[...]

  • Página 35

    Avoid Unnecessary Store-to-Load Depend encies 19 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization code in a w a y that a v oids the stor e-to-load dependency . In some instances the language definition ma y prohibit the compiler fr om using code tra nsforma tions that would r emo v e the stor e- to-load dependenc y . I[...]

  • Página 36

    20 Consider Expressi on Order in Compoun d Branch Conditions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Consider Expr ession Order in Compound Branch Conditions Br anc h conditions in C pro gr ams are often compound conditions con sisting of multiple boolean expr ess ions joined by the boolean oper ator s &&a[...]

  • Página 37

    Switch Statement Us age 21 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Switch Statemen t Usage Optimize Switch Statements Switc h statements ar e transl ated using a vari ety of algorithms. T he most common of these ar e jump ta bles and comparison c hains/t r ees. It is r ecommended t o sort th e cases of a s wit[...]

  • Página 38

    22 Use Const T ype Qualifier AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Use Const T ype Qualifier Use the “ const ” type qualifier as m u c h as possible. T his optimization mak e s code mor e r obust and ma y ena ble higher perf ormance code t o be gener ated due to the additional inf ormat ion a v ailable t[...]

  • Página 39

    Generic Loop Hoisting 23 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Generalization for M ultiple Const ant Control C ode T o gener alize this further f or multiple constant control code some mor e w ork ma y ha ve to be done to cr eate the pr oper outer loop . Enumer ation of the constant cases will r educe this [...]

  • Página 40

    24 Declar e Local Functions as Static AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 case combine( 1, 1 ): for( i ... ) { DoWork1( i ); DoWork3( i ); } break; default: break; } T he trick her e is that there is some up-fr ont wor k inv olved in gener ating all the combinations f or the switc h constan t and the total[...]

  • Página 41

    Dynamic Memory All ocation Consideration 25 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization which might inhibit certain op timizations with some compiler s — for example, agg r essiv e inlining. Dynamic Memory Allocation Consideration Dynamic memor y alloca tion ( ‘ malloc ’ in C language) should al w a ys r etu[...]

  • Página 42

    26 Explicitly Extract Common S ube xpressions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 lead to unexpected r esults. F ortunately , in the v ast majority of cases, the final result will differ onl y in the least significa nt bits. Example 1 (Av oid): double a[100],sum; int i; sum = 0.0f; for (i=0; i<100; i++)[...]

  • Página 43

    C Language Struc ture Component Considerations 27 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 1 Avoid: double a,b,c,d,e,f; e = b*c/d; f = b/d*a; Preferred: double a,b,c,d,e,f,t; t = b/d; e = c*t; f = a*t; Example 2 Avoid: double a,b,c,e,f; e = a/c; f = b/c; Preferred: double a,b,c,e,f,t; t = 1/c; e = a*t f[...]

  • Página 44

    28 Sort L ocal V ariables Acco rding to Base T ype Size AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 P ad by Multiple of Largest Base T ype Size P ad the structur e to a m ultiple of the larg est base type siz e of an y member . In this fa shion, if the fir st member of a structur e is natur ally aligned, all other[...]

  • Página 45

    Accelerating Floating-Point Div ides and Square Roots 29 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization quadw ord alignment), so that quadw or d operands might be misaligned, ev en if this technique is used and the compiler does alloca te v ariables in t he order they ar e de clared. T he f ollowing example de monstr[...]

  • Página 46

    30 Accel erating Floating-Point Divides and Squar e Roots AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 necessar y for the c urr ently s elected pr ecision. This means that settin g pr ecision c ontrol to singl e pr ecisio n (v ersus Win32 default of double precision) lo w ers the latenc y of those oper ations. T he[...]

  • Página 47

    Avoid Unnecessary Integ er Division 31 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization A void Unnec essary Integer Division Integer divisi on is the slow est of all integer arithmetic oper ations a nd should be a v oided wh er ev er possi ble. One possibility f or r e ducing the number of integer di visions is mu ltip[...]

  • Página 48

    32 Copy Fr equently De-r eferenced Pointe r Arguments to Local V ariables AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 1 (Av oid): //assumes pointers are different and q!=r void isqrt ( unsigned long a, unsigned long *q, unsigned long *r) { *q = a; if (a > 0) { while (*q > (*r = a / *q)) { *q = (*q + [...]

  • Página 49

    22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Over view 33 4 Instruction Dec oding Optimizations T his c hapter discusses w a ys to maximize the n umber of instructions decoded by the instruction decoder s in the AMD Athlon ™ pr ocessor . Guidelines are listed in or der of importance. Over view T he AMD Athlon pro[...]

  • Página 50

    34 Select Dir ectPath Over V ectorPath Instructions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Select DirectP ath Over V ectorP ath Instructions Use Dir ect P ath instructions rather than V ectorP ath instructions. Dir ectP ath in structions ar e optimiz ed for decode and execute effi cientl y b y minimiz ing the[...]

  • Página 51

    Load-Execute Instructio n Usage 35 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Use Load-Execute Floating-Point Instructions with Floating-P oint Operands W hen opera ting on single- pr ecision or double- pr ecision floating- point data, wher ev er possible use floating- point load-exec ute instructions to i ncr ea[...]

  • Página 52

    36 Align Branch T argets in Pr ogram Hot Spots AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 1 (Av oid): FLD QWORD PTR [foo] FIMUL DWORD PTR [bar] FIADD DWORD PTR [baz] Example 2 (Preferred): FILD DWORD PTR [bar] FILD DWORD PTR [baz] FLD QWORD PTR [foo] FMULP ST(2), ST FADDP ST(1),ST Align Br anch T argets i[...]

  • Página 53

    Avoid Partial Reg ister Reads and Writes 37 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 2 (Preferred): 05 78 56 34 12 add eax, 12345678h ;uses single byte ; opcode form 83 C3 FB add ebx, -5 ;uses 8-bit sign ; extended immediate 74 05 jz $label1 ;uses 1-byte opcode, ; 8-bit immediate A void P artial Registe[...]

  • Página 54

    38 Replace C ertain SH LD Instructions with Alternative AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Replac e Certain S H LD Instructions with Alternative Code Certain instances of the SHLD instruction can be r eplaced b y alternati v e code using SHR and LEA. The alternati v e code has lo w er latenc y and r equir[...]

  • Página 55

    Use 8-Bit Sign-E xtended Displacements 39 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Use 8-Bit Sign-Extended Displac ements Use 8- bit sign- extend ed displacements for condition al br anc hes. Using short, 8-bit sign- extended displacements for conditional br anc hes impr ov es code density with no negati v e ef[...]

  • Página 56

    40 Code Padding Using Neutral Code F illers AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Recommendation s for th e AM D Athlon ™ Processo r F or code that is optimiz ed spec ifically f or the AMD At hlon pr ocessor , the optimal co de fillers ar e NOP instr uctions (opcode 0x90) with up to tw o REP pr efixes (0xF[...]

  • Página 57

    Code Padding Usi ng Neutral Code Fillers 41 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Recommendati ons for AM D- K6 ® Family and AM D Athlon ™ Processor Blen ded Code On x86 pr ocessors other than the AMD Athlon pr ocessor (incl udin g th e AMD-K6 fam il y o f proces sor s) , the REP p refix and especially m [...]

  • Página 58

    42 Code Padding Using Neutral Code F illers AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 NOP3_ECX TEXTEQU <DB 08Dh,00Ch,021h> ;lea ecx, [ecx] NOP3_EDX TEXTEQU <DB 08Dh,014h,022h> ;lea edx, [edx] NOP3_ESI TEXTEQU <DB 08Dh,024h,024h> ;lea esi, [esi] NOP3_EDI TEXTEQU <DB 08Dh,034h,026h> ;lea ed[...]

  • Página 59

    Code Padding Usi ng Neutral Code Fillers 43 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization ;lea edi ,[edi+00000000] NOP6_EDI TEXTEQU <DB 08Dh,0BFh,0,0,0,0> ;lea ebp ,[ebp+00000000] NOP6_EBP TEXTEQU <DB 08Dh,0ADh,0,0,0,0> ;lea eax,[eax*1+00000000] NOP7_EAX TEXTEQU <DB 08Dh,004h,005h,0,0,0,0> ;lea ebx[...]

  • Página 60

    44 Code Padding Using Neutral Code F illers AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9[...]

  • Página 61

    22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Memory Size a nd Alignment Issues 45 5 Cache and Memory Optimizations T his chapter describes code optimization tec hniques that tak e ad v anta ge of the large L1 caches and high-band width buses of the AMD Athlon ™ proces sor . Guidelines ar e listed in or der of imp[...]

  • Página 62

    46 Use the 3DNow! ™ PR EF ET C H and PR E FETCHW AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Align Data Where P ossible In general, a v oid misaligned data references. All data who se siz e is a pow er of 2 is cons ider ed aligned i f it is naturally aligned. F or example: ■ QW OR D accesses ar e aligned if th[...]

  • Página 63

    Use the 3DNow ! ™ PREFE TCH and PREFETCHW I nstructions 47 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization PRE FET CH /W ve rs us PR E F ETC H N T A/T0/T1 /T2 T he PREFETCHNT A/T0/T1/T2 instructions in the MMX extensions ar e pr ocessor implement ation dependent. T o maintain compati bility with t he 25 million AMD-[...]

  • Página 64

    48 Use the 3DNow! ™ PR EF ET C H and PR E FETCHW AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 MOV ECX, (-LARGE_NUM) ;used biased index MOV EAX, OFFSET array_a ;get address of array_a MOV EDX, OFFSET array_b ;get address of array_b MOV ECX, OFFSET array_c ;get address of array_c $loop: PREFETCHW [EAX+196] ;two cac[...]

  • Página 65

    Use the 3DNow ! ™ PREFE TCH and PREFETCHW I nstructions 49 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization T he follo wing optimiza tion rule s w er e app lied to this example . ■ Loops should be unr olled to mak e sur e that the data stride per loop i ter ation is equal to the length of a cac he line. T his a voi[...]

  • Página 66

    50 T ake A dvantage of W rite Combining AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T ak e Advantage of W rite Combining Oper ating system and device dri v er pro gr ammers sh ould tak e ad v antage of the write- combining capabili ties of the AMD Athlon pr ocessor . T he AMD Athlon pr ocessor has a v er y aggr es[...]

  • Página 67

    Store-to-Load F orwarding Restrictions 51 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Store-to-Load F o rwarding R estrictions Stor e-to-load forw arding r efers to the pr ocess of a load reading (f orw ar ding) data fr om the stor e buffer (LS2). T h er e ar e instances in the AMD Athlon processor load/stor e arc[...]

  • Página 68

    52 Store-to -Load Forwar ding Restrictions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Narrow-to-Wide Store-Buffer Data F orwarding Restriction If the f ollo wing conditions ar e pr esent, there i s a narro w-to- wide stor e-buffer data f o rw ar ding r estricti on: ■ T he oper and size of the stor e data is sma[...]

  • Página 69

    Store-to-Load F orwarding Restrictions 53 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 5 (Preferred): MOVD [foo], MM1 ;store lower half PUNPCKHDQ MM1, MM1 ;get upper half into lower half MOVD [foo+4], MM1 ;store lower half ... ADD EAX, [foo] ;fine ADD EDX, [foo+4] ;fine Misaligned Store-Buffer Data F orward[...]

  • Página 70

    54 Stack Alignment Consider ations AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 One Supported Store- to-Load Forw arding Case T her e is one case of a mism atc hed stor e-to- load fo rw arding that is supported by the b y AMD Athlon pr ocessor . The low er 32 bits fr om an aligned QW ORD write feeding into a D W OR[...]

  • Página 71

    Align TBYTE Variab les on Quadword Aligned Addres ses 55 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example (Preferred): Prolog: PUSH EBP MOV EBP, ESP SUB ESP, SIZE_OF_LOCALS ;size of local variables AND ESP, –8 ;push registers that need to be preserved Epilog: ;pop register that needed to be preserved MOV ESP,[...]

  • Página 72

    56 Sort V ariables Accordin g to Base T ype Size AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example: struct { char a[5]; long k; doublex; } baz; T he str uctur e components should be alloc ated (lo west to highes t addr ess) as follo ws: x, k, a[4], a[3], a[2], a[1], a[0], padbyte6, ..., padbyte0 See “ C Langua[...]

  • Página 73

    22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Avoid Branches Depende nt on Random Data 57 6 Br anch Optimizations W hile th e AMD Athlon ™ pr ocessor contains a v ery sophisticated br anch unit, certain optimizations increase t he effect iv eness of the br anc h pr ediction unit. T his c hapter discusses rules tha[...]

  • Página 74

    58 A void Branches De pendent on Random Dat a AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 AM D Ath lon ™ Proces sor Spec ific Code E xample 1 — Signed integer ABS function (X = labs(X)): MOV ECX, [X] ;load value MOV EBX, ECX ;save value NEG ECX ;–value CMOVS ECX, EBX ;if –value is negative, select value MO[...]

  • Página 75

    Always Pair CALL and RETURN 59 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 6 — Increment Ring Buffer Offset: //C Code char buf[BUFSIZE]; int a; if (a < (BUFSIZE-1)) { a++; } else { a = 0; } ;------------- ;Assembly Code MOV EAX, [a] ; old offset CMP EAX, (BUFSIZE-1) ; a < (BUFSIZE-1) ? CF : NC INC [...]

  • Página 76

    60 Replace Bran ches with Computation in 3DNow! ™ Code AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Rep lace Br anches with Computa tion in 3D Now! ™ Code Br anches negati vel y impact the perf ormance of 3DNo w! code. Br anches can oper ate onl y on one data item at a time , i.e., the y ar e inherentl y scalar[...]

  • Página 77

    Replace Branches wi th Computation in 3DNow! ™ Code 61 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 2 (Preferred): ; r = (x < y) ? a : b ; ; in: mm0 a ; mm1 b ; mm2 x ; mm3 y ; out: mm1 r PCMPGTD MM3, MM2 ; y > x ? 0xffffffff : 0 PAND MM1, MM3 ; y > x ? b : 0 PANDN MM3, MM0 ; y > x > 0 : a [...]

  • Página 78

    62 Replace Bran ches with Computation in 3DNow! ™ Code AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 2: C code: float x,z; z = abs(x); if (z >= 1) { z = 1/z; } 3DNow! code: ;in: MM0 = x ;out: MM0 = z MOVQ MM5, mabs ;0x7fffffff PAND MM0, MM5 ;z=abs(x) PFRCP MM2, MM0 ;1/z approx MOVQ MM1, MM0 ;save z PFRC[...]

  • Página 79

    Replace Branches wi th Computation in 3DNow! ™ Code 63 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 4: C code: #define PI 3.14159265358979323 float x,z,r,res; /* 0 <= r <= PI/4 */ z = abs(x) if (z < 1) { res = r; } else { res = PI/2-r; } 3DNow! code: ;in: MM0 = x ; MM1 = r ;out: MM1 = res MOVQ MM[...]

  • Página 80

    64 Replace Bran ches with Computation in 3DNow! ™ Code AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 5: C code: #define PI 3.14159265358979323 float x,y,xa,ya,r,res; int xs,df; xs = x < 0 ? 1 : 0; xa = fabs(x); ya = fabs(y); df = (xa < ya); if (xs && df) { res = PI/2 + r; } else if (xs) { res[...]

  • Página 81

    Avoid the Loop Instruction 65 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization A void the Loop Instruction T he LOOP instruction in the AMD A thlon pr ocessor r equires eight cycles to e xecute. Use the preferr ed code shown belo w: Example 1 (Av oid): LOOP LABEL Example 2 (Preferred): DEC ECX JNZ LABEL A void F ar Con[...]

  • Página 82

    66 A void Recursive Functions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 A void R ecursive Functions A void r ec ur siv e func tions due to the danger o f o verflo wing t he r eturn addr ess stac k. Con v ert end- r ecur siv e functions to iterati ve code. An end-recursi v e funct ion is wh en the func tion call [...]

  • Página 83

    22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Schedule In structions According to their Latenc y 67 7 Scheduling Optimizations T his c hapter descr ibes ho w to code instruc tions f or efficient scheduling. Guidelines ar e lis ted in or der of impor tance. Schedule Instructions Ac cor ding to their Latency Th e A M [...]

  • Página 84

    68 Unrolling Loops AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 unroll ing r educ es r egist er pr essur e by r emoving the loop counter . T o complete l y unroll a loop, remo ve the loop control and r eplicate the loop bod y N times. In addition, completel y unr olling a lo op incr eases scheduling oppo rtunities.[...]

  • Página 85

    Unrolling Loops 69 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Without Loop Unrolling: MOV ECX, MAX_LENGTH MOV EAX, OFFSET A MOV EBX, OFFSET B $add_loop: FLD QWORD PTR [EAX] FADD QWORD PTR [EBX] FSTP QWORD PTR [EAX] ADD EAX, 8 ADD EBX, 8 DEC ECX JNZ $add_loop T he loop consists of se v en instructions. T he AMD At[...]

  • Página 86

    70 Unrolling Loops AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 no faster than three iter a tions in 10 cycles, or 6/10 floating-po int adds per c ycle, or 1.4 times as f ast as the or iginal loop. Deriving Loop Control For P arti ally Unrolled Loops A fr equentl y used loop construct is a counting loop. In a typic[...]

  • Página 87

    Use Function Inlini ng 71 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Use Function In lining Overview Mak e use of the AMD A thlon pr ocessor ’ s large 64- Kbyte instruct ion cache b y inl ining sm all routines to av oid pr ocedur e- call ov erhead. Consider the cost of possible incr eased r egister usage, whic [...]

  • Página 88

    72 A void Address Generati on Interlocks AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novembe r 1 9 9 9 Always Inline Fu nctions if Called from One Site A function should alw a ys be inlined if it can be established that it is called from just one site in the code. F or the C language, determination of this char act eristic is made [...]

  • Página 89

    Use MOVZX and MO VSX 73 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 1 (Av oid): ADD EBX, ECX ;inst 1 MOV EAX, DWORD PTR [10h] ;inst 2 (fast address calc.) MOV ECX, DWORD PTR [EAX+EBX] ;inst 3 (slow address calc.) MOV EDX, DWORD PTR [24h] ;this load is stalled from ; accessing data cache due ; to long laten[...]

  • Página 90

    74 Minimize Po inter Arithmetic in L oops AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 1 (Av oid): int a[MAXSIZE], b[MAXSIZE], c[MAXSIZE], i; for (i=0; i < MAXSIZE; i++) { c [i] = a[i] + b[i]; } MOV ECX, MAXSIZE ;initialize loop counter XOR ESI, ESI ;initialize offset into array a XOR EDI, EDI ;initializ[...]

  • Página 91

    Push Memory Data Carefu lly 75 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization v ariable that starts wi th a negati ve v alue and r eac hes zero when the loop expires. Note that if the base addresses ar e held in r egisters (e.g., when the base addr e sses ar e passe d as ar guments of a function) biasing the base add[...]

  • Página 92

    76 Push Memory Data Careful ly AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novembe r 1 9 9 9[...]

  • Página 93

    22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Replace Divi des with Multiplies 77 8 Integer Optimizations T his c hapter desc ribes w a ys to impr ov e integer p erf ormance thr ough optimize d pr ogr amming tec hniques. T he guidelines ar e listed in order of importance. Replace Divides with Multiplies Replace inte[...]

  • Página 94

    78 Replace Div ides with Multiplies AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Signed Division Utility In the opt_utilities dir ector y of the AMD documentation CDR O M, ru n sdiv .exe in a DOS window to find the fastest code fo r si gned di vision b y a constant. T he utility displa ys the code after the user en[...]

  • Página 95

    Replace Divi des with Multiplies 79 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 1: ;In: EDX = dividend ;Out: EDX = quotient XOR EDX, EDX;0 CMP EAX, d ;CF = (dividend < divisor) ? 1 : 0 SBB EDX, -1 ;quotient = 0+1-CF = (dividend < divisor) ? 0 : 1 In cases where the di vi dend does not need to be pr e[...]

  • Página 96

    80 Replace Div ides with Multiplies AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 ;algorithm 1 MOV EAX, m MOV EDX, dividend MOV ECX, EDX IMUL EDX ADD EDX, ECX SHR ECX, 31 SAR EDX, s ADD EDX, ECX ;quotient in EDX Derivation for a, m, s The deri v atio n f or the algorith m (a), multiplier (m), and sh ift coun t (s), [...]

  • Página 97

    Use Alternative Code When Multiplying by a Co nstant 81 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Remainder of Signed Integer 2 n or – (2 n ) ;IN:EAX = dividend ;OUT:EAX = remainder CDQ ;Sign extend into EDX AND EDX, (2^n–1) ;Mask correction (abs(divison)–1) ADD EAX, EDX ;Apply pre-correction AND EAX, (2^n[...]

  • Página 98

    82 Use Alternative Code When Multiplying b y a Constant AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 by 11: LEA REG2, [REG1*8+REG1] ;3 cycles ADD REG1, REG1 ADD REG1, REG2 by 12: SHL REG1, 2 LEA REG1, [REG1*2+REG1] ;3 cycles by 13: LEA REG2, [REG1*2+REG1] ;3 cycles SHL REG1, 4 SUB REG1, REG2 by 14: LEA REG2, [REG1*[...]

  • Página 99

    Use MMX ™ Instructio ns for Integer-Only Work 83 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization by 26: use IMUL by 27: LEA REG2, [REG1*4+REG1] ;3 cycles SHL REG1, 5 SUB REG1, REG2 by 28: MOV REG2, REG1 ;3 cycles SHL REG1, 3 SUB REG1, REG2 SHL REG1, 2 by 29: LEA REG2, [REG1*2+REG1] ;3 cycles SHL REG1, 5 SUB REG1, RE[...]

  • Página 100

    84 Repeated String Instructi on Usage AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 In addition, using MMX instructi ons incr eases t he a v ailable par allelism. T he AMD Athlon proces sor can issue thr ee integer OPs and two MMX OPs per cycle. Rep eated String Instruction Usage Latency of Repeated String Instructi[...]

  • Página 101

    Repeated String I nstruction Usage 85 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Ensure D F=0 (U P) A lway s m a ke s u re t h a t D F = 0 ( U P ) ( a f t e r ex e c u t i o n o f C L D ) fo r REP MO VS an d REP STOS. DF = 1 ( DO WN ) is only needed f o r certain cases of o ver lapping REP MO VS (f or example, so[...]

  • Página 102

    86 Use X OR Instruction to Cl ear Integer Registers AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Use X O R Instruction to Clear Integer Registe rs T o clear an inte ger r egister to all 0s, use “ X OR r eg , r eg ” . T he AMD Athlon pr ocessor is a ble to av oid the false r ea d dependenc y on the XOR instructi[...]

  • Página 103

    Efficient 64-Bi t Integer Arithmetic 87 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 4 (Le ft shift ): ;shift operand in EDX:EAX left, shift count in ECX (count ; applied modulo 64) SHLD EDX, EAX, CL ;first apply shift count SHL EAX, CL ; mod 32 to EDX:EAX TEST ECX, 32 ;need to shift by another 32? JZ $lshi[...]

  • Página 104

    88 Efficient 64-Bit Integer Arithmeti c AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 7 (Division): ;_ulldiv divides two unsigned 64-bit integers, and returns ; the quotient. ; ;INPUT: [ESP+8]:[ESP+4] dividend ; [ESP+16]:[ESP+12] divisor ; ;OUTPUT: EDX:EAX quotient of division ; ;DESTROYS: EAX,ECX,EDX,EFlags[...]

  • Página 105

    Efficient 64-Bi t Integer Arithmetic 89 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization MOV ECX, EAX ;save quotient IMUL EDI, EAX ;quotient * divisor hi-word ; (low only) MUL DWORD PTR [ESP+20];quotient * divisor lo-word ADD EDX, EDI ;EDX:EAX = quotient * divisor SUB EBX, EAX ;dividend_lo – (quot.*divisor)_lo MOV EA[...]

  • Página 106

    90 Efficient 64-Bit Integer Arithmeti c AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 $r_two_divs: MOV ECX, EAX ;save dividend_lo in ECX MOV EAX, EDX ;get dividend_hi XOR EDX, EDX ;zero extend it into EDX:EAX DIV EBX ;EAX = quotient_hi, EDX = intermediate ; remainder MOV EAX, ECX ;EAX = dividend_lo DIV EBX ;EAX = qu[...]

  • Página 107

    Efficient Impl ementation of Populati on Count Function 91 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Efficient Implementation of Population Co unt Function P opulation count is an oper ation that determines the number of set bits in a bit string. F or example, this can be used to determine the car dinality of a [...]

  • Página 108

    92 Efficient Impl ementation of Populat ion Count Function AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Step 3 F or the fir st time, the v alue in each k-bit field is small eno ugh that adding two k-bit fields r esults in a v alue that stil l fits in the k-bit field. Thus the f ollowing computation is perf ormed: y[...]

  • Página 109

    Derivation of Multipl ier Used for Integer Division by Constants 93 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization ADD EAX, EDX ;x = (w & 0x33333333) + ((w >> 2) & ; 0x33333333) MOV EDX, EDX ;x SHR EAX, 4 ;x >> 4 ADD EAX, EDX ;x + (x >> 4) AND EAX, 00F0F0F0Fh ;y = (x + (x >> 4) & 0[...]

  • Página 110

    94 Derivation of Multiplie r Used for Integer Division by AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 ;algorithm 1 MOV EDX, dividend MOV EAX, m MUL EDX ADD EAX, m ADC EDX, 0 SHR EDX, s ;EDX=quotient */ typedef unsigned __int64 U64; typedef unsigned long U32; U32 d, l, s, m, a, r; U64 m_low, m_high, j, k; U32 log2 [...]

  • Página 111

    Derivation of Multipl ier Used for Integer Division by Constants 95 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization /* Generate m, s for algorithm 1. Based on: Magenheimer, D.J.; et al: “Integer Multiplication and Division on the HP Precision Architecture”. IEEE Transactions on Computers, Vol 37, No. 8, August 198[...]

  • Página 112

    96 Derivation of Multiplie r Used for Integer Division by AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 ;algorithm 1 MOV EAX, m MOV EDX, dividend MOV ECX, EDX IMUL EDX ADD EDX, ECX SHR ECX, 31 SAR EDX, s ADD EDX, ECX ; quotient in EDX */ typedef unsigned __int64 U64; typedef unsigned long U32; U32 log2 (U32 i) { U32[...]

  • Página 113

    22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Ensure All FP U Data is Ali gned 97 9 Floating-P oint Optimizations T his c hapt er details the methods used to optimiz e floating-point code to the pipelined floating-point unit (FPU). Guidelines are listed in order of impo rtance. Ensure All F P U Data is Aligned As di[...]

  • Página 114

    98 Use FFRE E P Macr o to Pop On e Register fr om the FPU AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Use F F R E E P Macro to P op One Register fr om the F P U Stack In FPU intensi v e code, fr equently accessed data is oft en pr e-loaded at the bottom of the FPU stac k befor e pr ocessing floating- point data. A[...]

  • Página 115

    Use the FXCH Instruction Rather tha n FST/FLD Pairs 99 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization T hese instruc tions ar e muc h faster than the classical appr oach using FSTSW , because FSTSW is essentiall y a serializing instruction on the AMD Athlon pr ocess or . W hen FSTSW cannot be a v oided (f or example,[...]

  • Página 116

    10 0 Minimize Floating-Po int-to-Integer Conversions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Minimize Floating-P oint-to-Integer Con versio ns C++, C, an d F ortr an define floa ting-point-t o-integer con v er sions as truncating . This cr eates a pr oblem because the activ e r ounding mode in an application i[...]

  • Página 117

    Minimize F loating-Point-to-Integer Conversi ons 10 1 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization FPU into truncating mode, and perf orming all of the conv ersions before restoring the original control w ord. The speed of the a bo v e code is somewhat dependent on the natur e of the code surrounding it. F o r appl[...]

  • Página 118

    10 2 Minimize Floating-Po int-to-Integer Conversions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 3 (P otentially faster): MOV ECX, DWORD PTR[X+4] ;get upper 32 bits of double XOR EDX, EDX ;i = 0 MOV EAX, ECX ;save sign bit AND ECX, 07FF00000h ;isolate exponent field CMP ECX, 03FF00000h ;if abs(x) < 1.0 [...]

  • Página 119

    Floating-Point Subex pression Elimination 10 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Floating-P oint Subexpr ession Elimination T her e ar e cases which do not r equir e an FXCH instruction after e v er y instruction to allo w access to tw o new stac k entries. In the cases wher e two instructions shar e a s[...]

  • Página 120

    10 4 Check Argument Range of T rigonometric Instructions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 If an “ ar gument out of r ange ” is detected, a r ange r eduction subr o utine is in v ok ed whic h r educes the ar gument to less than 2^63 befor e the instruction is attempted again. While an ar gument > [...]

  • Página 121

    Take Advantag e of the FSINCOS Instruction 10 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Since out- of-r an ge arguments ar e extremely uncommon, the conditional br anch will be perfectly pr edicted, and the other instructions used to guard the trigonometric instruction can execute in par allel to it. T ak e Ad[...]

  • Página 122

    10 6 T ake Advantage of the FSI NCOS Instruction AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9[...]

  • Página 123

    22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Use 3DNow! ™ Instr uctions 10 7 10 3D Now! ™ and M MX ™ Optimizations T his chapter describes 3DNow! and MMX code optimization tec hniqu es f or the AMD Athlon ™ processo r . Guidelines ar e listed in order of impor tance. 3DNo w! porting guideline s can be f oun[...]

  • Página 124

    10 8 Use 3DNow! ™ Instructions for Fast Div ision AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 FEMMS instru ction is suppo rted fo r bac kw ar d compatibili ty with AMD-K6 famil y p r ocessors, and is aliased t o the EMMS instruction. 3DNo w! and MMX in structions are designed to be used concurr entl y with no sw[...]

  • Página 125

    Use 3DNow! ™ Instructions for Fast Division 10 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Pipelined P a ir of 24-Bit Precisio n Divides T his di vi de operation execu tes wi th a tot al late nc y of 21 cycles, assuming that the pr ogr am hides t he latenc y of the fir st MO VD/MO VQ instructio ns within pr ec[...]

  • Página 126

    110 Use 3DNow ! ™ Instructions for Fast Square Ro ot and AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novembe r 1 9 9 9 Use 3D Now! ™ Instructions for F a st Squar e Root and Recipr ocal Square Root 3DNo w! instruc tions can be used to compute a ver y fast, highly ac c u ra t e s q u a re ro o t a n d re c i pr oc a l s q u a re[...]

  • Página 127

    Use MMX ™ PMADDWD Ins truction to Perform Two 32-Bit Multipli es in Parallel 111 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Newton- Raphson Re cipr ocal Squa re Ro ot T he gener al Ne wton-Raphson r ecipro cal squar e root r ecurr ence is: Z i+1 = 1/2 • Z i • (3 – b • Z i 2 ) T o r educe the number of i[...]

  • Página 128

    112 3D Now! ™ and MMX ™ Intra-Operand S wapping AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example: PXOR MM2, MM2 ; 0 | 0 MOVD MM0, [ab] ; 0 0 | b a MOVD MM1, [cd] ; 0 0 | d c PUNPCKLWD MM0, MM2 ; 0 b | 0 a PUNCPKLWD MM1, MM2 ; 0 d | 0 c PMADDWD MM0, MM1 ; b*d | a*c 3D Now! ™ and M MX ™ Intra-Oper and Swa[...]

  • Página 129

    Fast Conversion of S igned Words to Floating-Poin t 113 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization F ast Conversion of Signed W ords to Floating-P oint In many appl ications there is a need to quickl y conv ert data consisting of pac ked 16-bit signed integer s into floating-point n umbers. T he follo wing two e [...]

  • Página 130

    114 Us e M MX ™ P CM P Instead of 3DNow! ™ PFCMP AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 c ycle b ypassing penalty , and another one c ycle penalty if the r esult goes to a 3DNo w! operation. T he PFMUL execution latenc y is fo ur , ther efo re, in the w orst case, the PXOR and PMUL in structio ns ar e the[...]

  • Página 131

    Use MMX ™ Instructio ns for Block Copies and Block Fills 115 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Use M MX ™ Instructions for Block Copies and Block Fills F or moving or filling small bloc ks of data (e.g., less than 512 b ytes) bet w een cachea ble memo r y ar eas, t he REP MO VS and REP ST OS families[...]

  • Página 132

    116 Us e M MX ™ Instructions for Block Copies and Block Fills AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 $xfer: movq mm0, [eax] add edx, 64 movq mm1, [eax+8] add eax, 64 movq mm2, [eax-48] movq [edx-64], mm0 movq mm0, [eax-40] movq [edx-56], mm1 movq mm1, [eax-32] movq [edx-48], mm2 movq mm2, [eax-24] movq [edx[...]

  • Página 133

    Use MMX ™ Instructio ns for Block Copies and Block Fills 117 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization AM D Athlon ™ Proc essor Specific Code T he f ollo wing exam ple code, written f or the inlin e assembler of Micros oft V isual C, is suita ble for mo ving/filling a quadw ord aligned block of data in the f[...]

  • Página 134

    118 Us e M MX ™ PXOR to Clear All Bits in an M MX ™ Register AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 /* block fill (destination QWORD aligned) */ __asm { mov edx, [dst_ptr] mov ecx, [blk_size] shr ecx, 6 movq mm0, [fill_data] align 16 $fill_nc: movntq [edx], mm0 movntq [edx+8], mm0 movntq [edx+16], mm0 mov[...]

  • Página 135

    Use MMX ™ PCMPEQD to S et All Bits in an MMX ™ Register 119 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Use M MX ™ PC M P E QD to Set All Bits in an M MX ™ Regi ste r T o set all the bit s in an MMX r egister to o ne, use: PCMPEQD MMreg, MMreg Note that PCMPEQD MMr eg, MMr eg is dependent on pr evio us wri[...]

  • Página 136

    12 0 Optimized Matrix Multip lication AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 /* Function XForm performs a fully generalized 3D transform on an array of vertices pointed to by "v" and stores the transformed vertices in the location pointed to by "res". Each vertex consists of four floats. T[...]

  • Página 137

    Optimized Matrix Multipli cation 121 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization $$xform: ADD EBX, 16 ;res++ MOVQ MM0, QWORD PTR [EDX] ;v->y | v->x MOVQ MM1, QWORD PTR [EDX+8] ;v->w | v->z ADD EDX, 16 ;v++ MOVQ MM2, MM0 ;v->y | v->x MOVQ MM3, QWORD PTR [EAX+M00] ;m[0][1] | m[0][0] PUNPCKLDQ MM0, [...]

  • Página 138

    12 2 Efficient 3D- Clipping Code Computation Using AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Efficient 3D- Clipping Code Computation Using 3D Now! ™ Instructions Clipping is one of the major acti vities occurring in a 3D gr aphics pipeli ne. In many instances, this activ ity is split i nto tw o parts which do [...]

  • Página 139

    Use 3DNow! ™ PAVGUSB for MPEG-2 Motion Compensation 12 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization ;; ;; DESTROYS MM0,MM1,MM2,MM3,MM4 PXOR MM0, MM0 ; 0 | 0 MOVQ MM1, MM6 ; w | z MOVQ MM4, MM5 ; y | x PUNPCKHDQ MM1, MM1 ; w | w MOVQ MM3, MM6 ; w | z MOVQ MM2, MM5 ; y | x PFSUBR MM3, MM0 ; -w | -z PFSUBR MM2, MM[...]

  • Página 140

    12 4 Use 3DNow! ™ P A VG US B for MP EG-2 Motion AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 1 (Av oid): MOV ESI, DWORD PTR Src_MB MOV EDI, DWORD PTR Dst_MB MOV EDX, DWORD PTR SrcStride MOV EBX, DWORD PTR DstStride MOVQ MM7, QWORD PTR [ConstFEFE] MOVQ MM6, QWORD PTR [Const0101] MOV ECX, 16 L1: MOVQ MM0, [...]

  • Página 141

    Stream of Packed Unsi gned Bytes 12 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization T he f ollo wing code fr agment us es the 3DNo w! P A V GUSB instruction to perform a v er aging betw een the sour ce macr oblock and destination macr obloc k: Example 2 (Preferred): MOV EAX, DWORD PTR Src_MB MOV EDI, DWORD PTR Dst_M[...]

  • Página 142

    12 6 Co mple x N umbe r Ari thm etic AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Complex Number Arithmetic Complex n umbers ha v e a “ real ” part and an “ imaginar y ” part. Multipl ying complex number s (ex. 3 + 4i) is an integral part of many algorithms such as Discrete F o urier T r ansform (DF T) and [...]

  • Página 143

    22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Short Forms 12 7 11 Gener al x86 Optimization Guidelines T his c hapter describes gener a l code optimization tec hniques specific to super scalar proc essors ( that is, tec hniques common to the AMD- K6 ® processor , AMD A thlon ™ processor , and Pe n t i u m ® fami[...]

  • Página 144

    12 8 Dependencies AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Depend encies Spr ead out true dependencies to increase the opportunities f or par allel execution. Anti- depende ncies and output dependencies do not impact performance. Reg ister Operands Maintain fr equently used v alues in register s r at her than i[...]

  • Página 145

    22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Introduction 12 9 Appendix A AM D Athlon ™ Proc essor Micr oarc hitecture Intr oduction W hen discussing processor design, it is important to unders tand the follo wing terms — architecture , microarchitectur e , and design implementation . T he term arch itecture r [...]

  • Página 146

    130 AMD Athlon ™ Processor Micro architectur e AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 AM D Athlon ™ Proc essor Microar chitecture T he innov ativ e AMD Athlon processor micr oar chitectur e appr oach implements the x86 instruction set by pr ocessing simpler oper ations (OPs) instead of complex x86 instruct[...]

  • Página 147

    AMD Athlon ™ Processor Mic roarchitecture 131 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization Figure 1 . AM D Athlon ™ Processor Block Diagram Instruction Cache T he o ut-of-or der ex ecute engi ne of t he AMD Athlon proc essor contains a v ery larg e 64- Kbyte L1 ins truction cac he. T he L1 instruction cac he is [...]

  • Página 148

    132 AMD Athlon ™ Processor Micro architectur e AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 r eplacement is based on a least- r ecently used (LR U ) r eplacement algori thm. T he L1 instruction cac he has an associated tw o-le v el tr anslation look- aside buffer (TLB) structur e. T he firs t-le vel TLB is full y [...]

  • Página 149

    AMD Athlon ™ Processor Mic roarchitecture 13 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization r eturn stack. Subsequen t RETs pop a p r ed icted return addr ess off the top of the stac k. Early Dec oding T he Dir ectP ath and V ectorP ath decoders perf orm ear ly- decoding of instructions into Macr oOPs. A Macr oOP [...]

  • Página 150

    134 AMD Athlon ™ Processor Micro architectur e AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 Instruction Control Unit T he instruction contr ol unit (ICU) is the contr ol center f or the AMD Athlon processor . T he ICU controls the follo wing r esources — the centr alized in-flight r eorder buf fer , the integer [...]

  • Página 151

    AMD Athlon ™ Processor Mic roarchitecture 13 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization Integer Scheduler T he integer s che duler is ba sed on a thr ee- wide queuing system (also kno wn as a r eserv ation station) that feeds thr e e integer executi on positions or pipes. T he r eser va tion stat ions ar e six[...]

  • Página 152

    136 AMD Athlon ™ Processor Micro architectur e AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 Eac h of the three IEUs ar e general purpose in that eac h performs lo gic functions, arithmetic functions, conditional functions, di vide step functions, status flag multiplexing, and br anc h r esolutions. The A GUs calcu[...]

  • Página 153

    AMD Athlon ™ Processor Mic roarchitecture 13 7 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization Floa ting-P oint Ex ecutio n Unit T he floating-point execution unit (FPU) is implemented as a coprocessor that has its o wn out-of- ord er control in addition to the da ta path. T he FPU hand les all r egister oper ations [...]

  • Página 154

    138 AMD Athlon ™ Processor Micro architectur e AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 Load-Store Unit (LS U ) T he load-s tor e unit (LSU) manages dat a load and s tor e accesses to the L1 dat a cache and, if r equired, to the backside L2 cache or system memory . The 44-entr y LSU pro vides a data interface [...]

  • Página 155

    AMD Athlon ™ Processor Mic roarchitecture 13 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization L2 Cache Controller T he AMD Athlon processor contai ns a v ery flexible onboar d L2 contr oller . It uses an independent bac kside bus to access up to 8-Mb ytes of industry- standar d SRAMs. Ther e ar e full on-c hip tags [...]

  • Página 156

    140 AMD Athlon ™ Processor Micro architectur e AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9[...]

  • Página 157

    22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Fetch and Dec ode Pipeline Stages 141 Appendix B Pipeline and Execution Unit R esourc es Ov erview Th e A M D A t h l o n ™ pr ocessor contains two independent execut ion pipelines — one for integer oper ations and one for floating-point operations. T h e integer pip[...]

  • Página 158

    142 Fetch and Dec ode Pipeline Stages AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Figure 5. F etch/Scan/Align/D ecode Pipeline Hardware T he most common x8 6 instructions flo w throug h the Dir ectP ath pipeline stages and are decoded by har dw a r e . T he l ess common instructions, whic h r equire micr ocode ass[...]

  • Página 159

    Fetch and Dec ode Pipeline Stages 14 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Cycle 1 – FET CH The FETCH pipeline stag e calculates t he addr ess of the next x86 instr uction window to fetch from the pr oce ssor caches or system me mory . Cycle 2 – SCAN SC AN determines the start and end pointers of instr[...]

  • Página 160

    144 Integer Pipeline Stages AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 oper ands mapped to r egisters. Both integer and floating-point Macr oOPs ar e placed into the IC U . Integer Pipeline Stages T he integer execution pipeline consi sts of f our or more stages f or scheduling and execution and, if necessar y , [...]

  • Página 161

    Integer Pipelin e Stages 14 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Cycle 7 – SC H E D In the scheduler (SCHED) pipeline stage, the scheduler buffer s can cont ain Macr oOPs that are waiting f or integer operands fr om the ICU or the IEU r esult bus . W hen all oper ands ar e r eceiv ed, SCHED s c hedules [...]

  • Página 162

    146 Floating-Point Pipe line Stages AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Floating-P oint Pipeline Stages T he floa ting-point unit (FPU) is implemente d as a coprocessor that has its o w n out- of- or der cont r ol in addition to the data path. T he FPU handles al l r egister oper ations f or x8 7 instructi[...]

  • Página 163

    Floating-Point P ipeline Stages 14 7 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Cycle 7 – ST K R E N T he stack r ename (S TKREN) pipeline stage in cycle 7 r eceiv e s up to thr ee Macr oOPs fr om IDEC and maps stac k- relati ve r egi ster tag s to vir tual register ta gs. Cycle 8 – REG REN The r egister r e [...]

  • Página 164

    148 Execution Unit Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Execution Unit Resour ces Te r m i n o l o g y T he execution units o perate with two types of register v al ues — operands and res u lt s . T here ar e three oper and types and two r esult types, which ar e described in this section. Oper[...]

  • Página 165

    Execution Unit Resources 14 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Integer Pipeline Operations T abl e 2 shows the categor y or type of o per ations handled b y the integer pipeline. T able 3 sho w s examples of the decode type. As sho wn in T able 2 , the MO V instruction earl y decodes in the Dir ectP a t[...]

  • Página 166

    150 Execution Unit Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Floa ting-P oint P ipeline Oper ations T abl e 4 shows the categor y or type of o per ations handled b y the floating-point execution units. T able 5 sho ws examples of the decode types. As sho wn in T able 4, the F ADD r egister-to- regi st[...]

  • Página 167

    Execution Unit Resources 151 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Load/Store Pipeline Oper ations T he AMD Athlon pr ocessor decodes an y instruction that r efer ences memor y into primiti ve load/stor e oper a tions. F o r exa mple, consider the fo llo wing code sample : MOV AX, [EBX] ;1 load MacroOP PUSH [...]

  • Página 168

    152 Execution Unit Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Code Sample Analysis T he samples in T able 7 on page 153 and T able 8 on page 154 show the execut ion behavior of sev eral serie s of ins tructi ons as a function of decode constr aints, dependenc ies, and execution r esour ce constr aints.[...]

  • Página 169

    Execution Unit Resources 15 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization T a ble 7 . Sample 1 – Integer Register Operations Inst ructi on Number Deco de Pipe Decode Ty p e Clocks I n s t r u c t i o n 12345 6 7 8 1I M U L E A X , E C X 0 V P D I M M M M 2 IN C ESI 0 DP D I E 3 MOV E DI, 0x0 7F4 1 DP D I E 4 AD [...]

  • Página 170

    154 Execution Unit Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T a ble 8. Sample 2 – Integer Reg ister and Memory Load Operations Instruc Num Decode Pipe D ecode Ty p e Clocks I n s t r u c t i o n 1 2 3 4 5 6 7 8 9 10 11 12 1D E C E D X 0 D P D I E 2 MOV E DI, [ECX] 1 DP D I &/S A $ 3 S UB EAX, [[...]

  • Página 171

    22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Introduction 15 5 Appendix C Implementation of W rite Combining Intr oduction T his appendix describes the memory write- c ombining featur e as implemente d in the AMD Athlon ™ pr ocessor famil y . T he AMD Athlon pr ocessor supports the memor y type and r ange r e gis[...]

  • Página 172

    15 6 Write-Combinin g Definitions and Abbrev iations AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 W rite-Combining Definitions and Abbr eviations T his appendix uses the follo wing definitions and ab br ev iations: ■ UC — Uncach eable memor y type ■ WC — Write-combining memory type ■ WT — Writethr ough [...]

  • Página 173

    Write-Combining Operations 15 7 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization signatur e in r egister EAX, wher e EAX[11 – 8] contai ns the instruction famil y code. F or the AMD Athlon processor , the instruction famil y code is six . 2. In addition, t he pr esence of the MTRRs is indicated b y bit 12 and the pr [...]

  • Página 174

    15 8 Wr ite- Combining Operations AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T a ble 9. W rite Combining Completion Events Event Comment Non-WB write outside o f current buffer The first non-WB write to a different cache block address closes combining for previous writes. WB writes do no t affect write combining.[...]

  • Página 175

    Write-Combining Operations 15 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Sending W rite-Buffer Data to the System Once write combining is closed f or a 64- byte write buffer , the contents of the write buffer ar e eligible to be sent to the system as one or more AMD Athlon system bus commands. T able 10 lists t[...]

  • Página 176

    16 0 W rite- Combining Operations AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9[...]

  • Página 177

    22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Over view 16 1 Appendix D P erformance-Monitoring Counters T his c hapter describes ho w to use the AMD Athlon ™ processo r perf ormance monitoring counters. Over view T he AMD Athlon processor pr o vides four 48- bi t perf ormance counter s, which allo ws four type s [...]

  • Página 178

    16 2 Performance Counter Usage AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T hese r egisters can be r ead from and written to using t he RDMSR and WRM SR instructions, r espectiv el y . T he P erfEvtSel[3 :0] r egister s ar e locat ed at MSR l ocations C001_0000h to C0 01_0003h. The P erfCtr[3:0] register s ar e l[...]

  • Página 179

    Performance Counter Usage 16 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Unit Mask Field (Bits 8 — 15 ) Th ese bits are used to further qualify the e vent sel ected in the e v ent select fi eld. F or e xample, f or some cac he ev ents, the ma sk is used as a MESI- pr otocol qualifier of cac he states. See T ab[...]

  • Página 180

    16 4 Per formance Counter Usage AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 greater than or equal to the counter mask. Otherwise if this field is zero , then the counte r increm ents by the total n umber of even t s . T able 1 1 . Performance-Monitoring Counters Event Numbe r Source Unit Notes / Unit Mask (bits 1 [...]

  • Página 181

    Performance Counter Usage 16 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization 65h BU 1xxx_xxxxb = reserved x1xx_xxxxb = WB xx1x_xxxxb = WP xxx1_xxxxb = WT bits 11–10 = reserved xxxx_xx1xb = WC xxxx_xxx1b = UC Sy stem requests with the selected type 73h B U bits 15–11 = reserved xxxx_x1xxb = L2 (L2 hit and no DC h[...]

  • Página 182

    16 6 Performance Counter Usage AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 7Ah B U C ycles that at least one fill request waited to use the L2 80h PC Instr uctio n cache f etche s 8 1h PC Instruction cache misses 82h PC Instruction cache refills from L2 83h PC Instruction cache refills from system 84h PC L1 ITLB m[...]

  • Página 183

    Performance Counter Usage 16 7 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization P erfCtr[3:0] M S Rs (M S R Addr esses C00 1 _000 4h – C00 1 _000 7h) T he performance-counter MSRs contain the e vent or dur ation counts for the se lecte d ev ents b eing count ed. The RDP MC instruction can be used by pr ogr ams or p r[...]

  • Página 184

    16 8 Event and Time-S tamp Monitoring Softwar e AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 allo ws writing both positi ve and negativ e va lues to the perf ormance counters . The perf ormance counter s ma y be initializ ed us ing a 64-bit sig ned integer in the r ange -2 47 and +2 47 . Negati ve v alues ar e usef[...]

  • Página 185

    Monitoring Counter Ov erflow 16 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization T he initialization and start counter s pr ocedur e sets the P erfEvtSel0 and/ or P erfEvtSel1 MSRs for the e v ents to be counted and the method used to count them and init ializ es the counter MSR s (P erfCtr[3:0]) to starting counts. [...]

  • Página 186

    17 0 Monitoring Counter Overflow AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 An e v ent moni tor application util ity or another application pr ogr am can r ead the collected perf ormance inf ormation of the pr ofiled a pplication.[...]

  • Página 187

    22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Introduction 171 Appendix E Progr amming the M TR R and PA T Intr oduction Th e A M D A t h l o n ™ processor includes a set of memor y type and r ange register s (MTRRs) to control cachea bility and access to spec ified m emor y re gions. T he pr ocesso r also i nclud[...]

  • Página 188

    17 2 Memory T ype Range Register (MTRR) Mechanism AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T her e ar e two types of ad dr ess r anges: fixed and v a ria ble. (See F i gur e 12.) F or each addr ess r a nge, ther e is a memo ry type. F or eac h 4K, 16K or 64K s egment within t he fir st 1 Mb yte of memory , ther[...]

  • Página 189

    Memory Type Ra nge Register (MTRR) Mechan ism 17 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Figure 1 2. MTRR Mapping of Physic al Memory 0 FFFFFFFF h 512 K b y t e s 256 K by t es 256 Kb y tes 8 Fixed Rang es (64 Kb y tes ea ch) 64 Fixed R anges (4 Kby tes ea ch) 1 6 Fixed Ran ges (1 6 Kb y tes ea ch) 80000h C0[...]

  • Página 190

    17 4 Memory T ype Range Register (MTRR) Mechanism AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Memory T ypes F iv e standard memor y types ar e defi ned b y the AMD At hlon pr ocessor: writethr ough (WT), write back (WB), wr ite-pro tect (WP), write-combining (WC) , and uncachea ble (UC). T hese ar e described in T[...]

  • Página 191

    Memory Type Ra nge Register (MTRR) Mechan ism 17 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization MTRR D efault T ype Register Format. T he MTRR def ault type r egister is defined as f ollows. Figure 1 4. MTRR Default T ype Register Format E MTRRs ar e ena bled when set. Al l MTRRs (both fixed and v aria ble r ange) [...]

  • Página 192

    17 6 Memory T ype Range Register (MTR R) Mechanism AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Note that if tw o or mor e v ariable m emor y r anges matc h then the inter actions ar e defined as f ollows: 1. If the memor y types ar e identical, then that memor y type is used. 2. If one or mor e of the memor y type[...]

  • Página 193

    Page Attribute Tabl e (PAT) 17 7 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization not affected b y this issue, onl y the v ariable r ange (and MTRR DefT ype) r egi sters are affecte d. P age Attribute T able (P A T) T he P age Attribute T able (P A T) is an e xtension of the page ta ble entry f ormat, whic h a llo ws t[...]

  • Página 194

    17 8 Page Attribute T able (P A T) AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Accessing the P A T A 3-bit inde x consisting of the P A T i, PCD , and PWT bit s of the page ta ble entr y , is used to select one o f the se v en P A T reg ister fields to acquir e the memor y type fo r the desire d page (P A T i is d[...]

  • Página 195

    Page Attribute Tabl e (PAT) 17 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization T a ble 1 5. Effective Memor y T ype Based on P A T and MTR Rs P A T Memory T ype MTRR Memory T ype Effec tive Memory T ype UC- WB, W T, WP, WC UC-Page UC UC-MTR R WC x WC WT W B, WT WT UC UC WC CD WP CD WP WB, WP WP UC UC-MTR R WC, WT CD[...]

  • Página 196

    18 0 Page Attribute T able (P A T ) AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T a ble 1 6. Final Output Memory T ypes Input Memory T ype Output Memory T ype Note RdMem WrM e m Effective. MT ype forceCD 5 AM D -75 1 RdMem WrMe m MemT yp e ●● UC - ●● UC 1 ●● CD - ●● CD 1 ●● WC - ●● WC 1 ●[...]

  • Página 197

    Page Attribute Tabl e (PAT) 18 1 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization ● ● CD - ●● CD ●● WC - ●● WC ●● WT - ●● WT ●● WP - ●● WP ●● WB - ● ● WT 4 ●● - ●● ● CD 2 Notes: 1 . WP is not functional for RdMem/WrMem. 2. ForceCD must cause the MTR R memory t ype to be[...]

  • Página 198

    18 2 Page Attribute T able (P A T) AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 MTR R Fixed-Range Register F ormat T he memor y types defined f or memor y segments defined in eac h of the MTRR fixed-r ange r egist er s ar e defined in T a ble 17 (Also See “ Standar d MTRR T ypes and Pr operties ” on page 176.).[...]

  • Página 199

    Page Attribute Tabl e (PAT) 18 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization V ariable-Range MTRRs A v ariable MTRR can be pro gramm ed to st art at ad dr ess 0000_0000h bec ause the fixed MTRRs alw ays o verride the v aria ble ones. Ho we v er , it is r ecommended not to create an ove rl a p . T he upper tw o v a[...]

  • Página 200

    18 4 Page Attribute T able (P A T) AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Figure 1 7 . MTR RphysMask n Register F ormat Note: A softwar e attempt to write to reser ved bits will generate a general protection exception. Physical Speci fies a 24 -bit mask t o dete rmine the range of Mask the region defined in t[...]

  • Página 201

    Page Attribute Tabl e (PAT) 18 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization MTRR M SR F ormat T his table defines the model-specifi c r egister s re lated to the memor y type range r egister implementation. All MTRRs ar e defined to be 64 bits. T able 1 8. MTRR-R elated Model-Specific Register (MS R) Map Register[...]

  • Página 202

    18 6 Page Attribute T able (P A T ) AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9[...]

  • Página 203

    22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Instruction Dispatch and Execution Resou rces 18 7 Appendix F Instruction Dispatch and Execution Resourc es T his c hapter describes the Macr oOPs gener ated by eac h decoded instruction, along with the r elativ e static execution latencies of these groups of operations.[...]

  • Página 204

    18 8 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 ■ disp1 6/32 — 16-bit or 32-bit displacem ent v alue ■ disp3 2/48 — 32-bit or 48-bit displacem ent v alue ■ eXX — re gister width depending on the oper and size ■ mem32 real — 32-bit floating-point v alue[...]

  • Página 205

    Instruction Dispatch and Execution Resou rces 18 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization ADC mreg8, reg8 1 0h 1 1-xxx-xxx DirectPath ADC mem8, r eg8 1 0h mm-xxx -xxx DirectPath ADC mreg1 6/32, reg1 6/32 1 1h 1 1-xxx-xxx DirectPath ADC mem1 6/32, reg1 6/32 1 1h m m-xxx- xxx DirectPath ADC reg8, mreg8 1 2h 1 1[...]

  • Página 206

    19 0 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 AN D mem8, reg8 20h mm- xxx-xxx Dir ectPath AN D mreg1 6/ 32, reg1 6/32 2 1h 1 1-xxx-xxx DirectPath AN D mem1 6/32, reg1 6/32 2 1h m m-xxx-xxx DirectPath AN D reg8, mreg8 22h 1 1-xxx-xxx DirectPath AN D reg8, mem8 22h mm[...]

  • Página 207

    Instruction Dispatch and Execution Resou rces 19 1 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization BT mem1 6/32, imm8 0Fh BAh mm-1 00-xxx DirectPath BT C mreg1 6/32, reg1 6/32 0Fh BBh 1 1-xxx-xxx V e ctorPath BT C mem1 6/32, reg1 6/32 0Fh B Bh m m-xxx-xxx V ectorPath BT C mreg1 6/32, imm8 0Fh BAh 1 1-1 1 1-xxx V ector[...]

  • Página 208

    19 2 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 CMOVE/C MOVZ reg1 6/32, reg1 6/32 0Fh 44h 1 1-xxx-xxx DirectP ath CMOVE/C MOVZ reg1 6/32, mem1 6/32 0Fh 44h mm-xxx-xxx DirectPath CMOVG/CMOVN LE reg1 6/32, reg1 6/32 0Fh 4Fh 1 1-xxx -xxx DirectPath CMOVG/CMOVN LE reg1 6/3[...]

  • Página 209

    Instruction Dispatch and Execution Resou rces 19 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization CM P EA X, imm1 6/32 3Dh DirectPath CM P mreg8, imm8 80h 1 1-1 1 1-xxx DirectPath CM P mem8, imm8 80h mm-1 1 1-xxx DirectPath CM P mreg1 6/32, imm1 6/32 8 1h 1 1-1 1 1-xxx DirectPath CM P mem1 6/32, imm1 6/32 8 1h mm-1 1[...]

  • Página 210

    19 4 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 DIV EA X, mreg1 6/32 F7h 1 1-1 1 0-xxx V ectorPath DIV EA X, mem1 6/32 F7h mm-1 1 0-xxx V ectorPath ENTE R C8 V ectorPath IDIV mreg8 F6h 1 1-1 1 1-xxx V ectorPath IDIV mem8 F6h mm-1 1 1-xxx V ectorPath IDIV E A X, mreg1 6[...]

  • Página 211

    Instruction Dispatch and Execution Resou rces 19 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization INC mreg8 FEh 1 1-000-xxx DirectPath INC mem8 F Eh mm-000-xxx DirectPath INC mreg1 6/32 F Fh 1 1-000-xxx DirectPath INC mem1 6/32 FFh mm-000-xxx DirectPath INVD 0Fh 08h V ectorPath INVLPG 0Fh 0 1h mm-1 1 1-xxx VectorP at[...]

  • Página 212

    19 6 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 J P/JP E near disp1 6/32 0Fh 8Ah DirectPath J NP/J PO near disp1 6/32 0Fh 8Bh DirectPath J L/JNG E near disp1 6/32 0Fh 8Ch DirectPath J NL/JG E near disp1 6/32 0Fh 8Dh DirectPath J LE/JNG near disp1 6/32 0Fh 8Eh DirectPa[...]

  • Página 213

    Instruction Dispatch and Execution Resou rces 19 7 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization L OOP E/L OOPZ disp8 E1h V ectorPath L OOPN E/L OOP NZ disp8 E0h V ectorPath LSL reg1 6/32, mreg1 6/32 0Fh 03h 1 1 -xxx-xxx VectorP ath LSL reg1 6/32, mem1 6/32 0Fh 03h mm-xxx-xxx V ectorPath LSS reg1 6/32, mem32/ 48 0Fh[...]

  • Página 214

    19 8 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 MOV EDX, imm1 6/32 BAh DirectPath MOV EBX, imm1 6/32 BBh DirectPath MOV E SP, imm 1 6/32 BCh DirectPath MOV EB P, im m1 6/32 B Dh DirectP ath MOV E SI, im m1 6/32 BEh DirectPath MOV EDI, imm1 6/32 B Fh DirectPath MOV mre[...]

  • Página 215

    Instruction Dispatch and Execution Resou rces 19 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization NOT mem8 F6h mm-0 1 0- xx DirectPath NOT mreg1 6/32 F7h 1 1-0 1 0-xxx DirectPath NOT mem1 6/32 F7h mm-0 1 0-xx Dire ctPath OR mreg8, reg8 08h 1 1-xxx-xxx DirectPath OR mem8, reg8 08h mm-xxx-xxx DirectPath OR mreg1 6/32, [...]

  • Página 216

    200 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 POP EB X 5Bh V ectorPath POP ES P 5Ch VectorP ath POP EB P 5Dh V ectorPath POP ES I 5Eh V ectorPath POP EDI 5Fh V ectorPath POP mreg 1 6/32 8Fh 1 1-000-xxx VectorP ath POP mem 1 6/32 8Fh mm-000-xxx V ectorPath POP A/POP A[...]

  • Página 217

    Instruction Dispatch and Execution Resou rces 20 1 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization RCL mreg8, 1 D0h 1 1-0 1 0-xxx DirectPath RC L mem8, 1 D0h mm- 0 1 0-x xx Dir ectPath RCL mreg1 6/32, 1 D1h 1 1-0 1 0-xxx DirectPath RC L mem 1 6/32 , 1 D1h mm- 0 1 0 -xxx Dire ctPat h RCL mreg8, C L D2h 1 1-0 1 0-xxx Di[...]

  • Página 218

    202 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 ROL mreg1 6/32, 1 D1h 1 1-000-xxx DirectPath ROL mem1 6/32, 1 D1h mm- 000-xxx DirectPath ROL mreg8, CL D2h 1 1-000-xxx DirectPath ROL mem8, CL D2h mm-000-xxx DirectPath ROL mreg1 6/32, CL D3h 1 1-000-xxx DirectPath ROL me[...]

  • Página 219

    Instruction Dispatch and Execution Resou rces 203 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization SB B mreg1 6/32, reg1 6/32 1 9h 1 1-xxx-xxx DirectPath S BB mem 1 6/32, r eg1 6/32 1 9h mm-xxx-xxx DirectPath S BB reg8, mreg8 1A h 1 1 -xxx-xxx DirectPath S BB reg8, mem8 1Ah m m-xxx-xxx DirectPath SB B reg1 6/32, mreg1 [...]

  • Página 220

    204 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 S ETS mreg8 0Fh 98h 1 1-xxx -xxx DirectPath S ETS mem8 0Fh 98h mm-xxx -xxx DirectPath SE TN S mreg8 0Fh 99h 1 1-xxx-xxx DirectPath S ETN S mem8 0Fh 99h mm-xxx- xxx DirectPath S ETP/S ETP E mreg8 0Fh 9 Ah 1 1-xxx -xxx Direc[...]

  • Página 221

    Instruction Dispatch and Execution Resou rces 205 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization SH R mem1 6/32, imm8 C1h mm-1 0 1-xxx DirectPath SH R mreg8, 1 D0h 1 1-1 0 1-xxx DirectPath SH R mem8, 1 D0h mm-1 0 1-xxx DirectPath SH R mreg1 6/32, 1 D 1h 1 1-1 0 1-xxx DirectPath SH R mem1 6/32, 1 D1h mm-1 0 1-xxx Dire[...]

  • Página 222

    206 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 S UB r eg8, mreg8 2Ah 1 1-xxx -xxx DirectPath S UB r eg8, mem8 2Ah mm-xxx-xxx DirectPath S U B r eg1 6/ 32, mreg 1 6/32 2Bh 1 1- xxx -xx x Dir ect Path S UB r eg1 6/32, mem1 6/32 2Bh m m-xxx-xxx DirectPath SU B AL, imm 8 [...]

  • Página 223

    Instruction Dispatch and Execution Resou rces 207 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization X ADD mreg8, reg8 0Fh C0h 1 1 -1 00-xxx V ectorPath XA DD mem8, r eg8 0F h C0h mm-1 00-xxx V ectorPath X ADD mreg1 6/32, reg1 6/32 0Fh C1h 1 1-1 0 1-xxx V ectorPath XA DD mem1 6/32, reg1 6/32 0Fh C1h mm-1 0 1-xxx V ectorP[...]

  • Página 224

    208 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T able 20. M MX ™ Instruct ions Instruct ion Mnem onic Prefix By t e(s ) First By t e ModR/ M By t e Decode Ty p e FP U Pipe(s) Notes EM M S 0Fh 77h DirectPath F ADD/FM U L/F ST OR E MOVD mmreg, reg32 0Fh 6Eh 1 1-xx x-x[...]

  • Página 225

    Instruction Dispatch and Execution Resou rces 209 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization P AN DN mmreg1 , mmreg2 0Fh DFh 1 1-xx x-xxx DirectPath F ADD/F M UL P AN DN mmreg, mem64 0Fh DFh m m-xxx-xxx DirectPath F ADD/F M U L PCM P EQB mmreg1 , mmreg2 0Fh 74h 1 1-xxx-xxx DirectPath F ADD/F M UL PCM P EQB mmreg,[...]

  • Página 226

    210 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 PS R AW mmreg1 , mmreg2 0Fh E1h 1 1-xxx-xxx DirectPath F ADD/F M UL P SR A W mmreg, mem64 0Fh E1h mm-xxx-xx x DirectPath F ADD/FM U L PS R AW mmreg, imm8 0Fh 7 1h 1 1-1 00-xxx DirectPath F ADD/F MU L PS R AD mmreg1 , mmreg[...]

  • Página 227

    Instruction Dispatch and Execution Resou rces 21 1 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization P UN PCK HDQ mmreg1 , mmreg2 0Fh 6Ah 1 1-xxx-xxx DirectPath F ADD/FM U L P UN PC KHDQ mmreg, mem64 0Fh 6Ah m m-xxx-xxx DirectPath F AD D/FM U L P UN PCK HWD mmreg1 , mmreg2 0Fh 69h 1 1-xx x-xxx DirectPath F AD D/FM U L P[...]

  • Página 228

    212 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 PM I NSW mmreg, mem64 0F h EAh mm-xxx-xxx Direct Path F ADD/FM U L PM I N UB mmreg1 , mmreg2 0Fh DAh 1 1-xxx -xxx DirectPat h F ADD/F M UL PM I NU B mmreg, mem6 4 0Fh DA h mm-xxx-xx x Direct Path F ADD/FM U L PMOVMSKB re g[...]

  • Página 229

    Instruction Dispatch and Execution Resou rces 21 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization FCMOVB ST(0), ST(i) DAh C0- C7h VectorP ath FCMOVE ST(0), ST(i) DAh C 8- CFh V ectorPath FCMOVBE ST(0), ST(i) DAh D 0-D7h V ectorPath FCMOVU ST(0), ST(i) DAh D8-DFh V ectorPath FCMOVN B ST(0), ST(i) DBh C0- C7h Vector Pa[...]

  • Página 230

    214 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 FIADD [mem32int] DAh m m-000-xxx V ectorPath FIADD [mem1 6int] DEh mm-000-xxx VectorP ath FICOM [mem32int] DAh mm-0 1 0-xxx V ectorPath FICOM [mem1 6int] DEh mm-0 1 0-xx x VectorP ath F ICOM P [m em 32in t] D Ah m m- 0 1 1[...]

  • Página 231

    Instruction Dispatch and Execution Resou rces 21 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization FLD CW [mem1 6] D9h mm-1 0 1-xxx V ectorPath FLD ENV [mem1 4byte] D 9h mm-1 00-xxx V ectorPath FLD ENV [mem28byte] D9h mm-1 00-xxx V ectorPath FLDL2E D9h EA h Dire ctPa th FSTORE FLD L2T D9h E9h DirectPath F STORE FLDL G[...]

  • Página 232

    216 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 F S T C W [ m e m 16 ] D 9 h m m - 111 - x x x V e c t o r P a t h FSTE NV [mem1 4by te] D9h mm-1 1 0-xxx V ectorPath FSTE NV [mem28by te] D9h mm-1 1 0-xxx Vector Path FSTP [mem32real] D9h mm-0 1 1-xxx D irectPath F ADD/F [...]

  • Página 233

    Instruction Dispatch and Execution Resou rces 21 7 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization T a ble 23. 3DNow! ™ Instructions Instru ction Mn emonic Prefix Byte(s) imm8 ModR/M By t e Decode Ty p e FPU Pipe (s) Note FE M M S 0Fh 0Eh Di rectPat h F ADD/FM U L/F ST OR E 2 P A VGU S B mmreg1 , mmreg2 0Fh, 0Fh B F[...]

  • Página 234

    218 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 PF R SQRT mmr eg, mem64 0F h, 0Fh 9 7h mm-xxx-xxx DirectPat h F MU L P FS U B mmreg1 , mmreg2 0Fh, 0Fh 9 Ah 1 1-xxx-xxx DirectPath F ADD PF S UB mmreg, mem64 0Fh, 0Fh 9Ah mm-xxx-xxx Direct Path F ADD P FS U BR mmreg1 , mmr[...]

  • Página 235

    22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Select DirectP ath Over VectorPath Instruc tions 219 Appendix G Dire ctP ath versus V ectorP ath Instructions Select DirectP ath Over V ectorP ath Instructions Use DirectP ath instructions rather than V ectorPath ins tr ucti on s. Direc tP a th instructions ar e optimiz [...]

  • Página 236

    220 DirectPat h Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 T able 25. DirectP ath Integer Instructions Instru ction Mn emonic ADC mreg8, reg8 ADC mem8, reg8 ADC mreg1 6/32, reg1 6/32 ADC mem1 6/32, reg1 6/32 ADC reg8, mreg8 ADC reg8, mem8 ADC reg1 6/32, mreg1 6/32 ADC reg1 6/32, mem1 6/32 ADC AL, i mm[...]

  • Página 237

    DirectPath Instructi ons 22 1 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization CMOVBE/C MOVNA reg1 6/32, reg1 6/32 CMOVBE/C MOVNA reg1 6/32, mem1 6/32 CMOVE/C MOVZ reg1 6/32, reg1 6/32 CMOVE/CM OVZ reg1 6/32, mem1 6/32 CMOVG/CMOVN LE reg1 6/32, reg1 6/32 CMOVG/CMOVN LE reg1 6/32, mem1 6/32 CMOVG E/CMOVN L reg1 6/32, reg[...]

  • Página 238

    222 DirectPat h Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 JN O s ho rt di sp 8 JB /JNAE short disp8 JN B/JAE short disp8 JZ/J E short disp8 J NZ/JN E short disp8 JBE/J NA short disp8 JN BE/JA short disp8 JS short disp8 JN S short disp8 JP/J P E short disp8 JNP/ JPO sh o rt di sp 8 JL/J NG E short dis[...]

  • Página 239

    DirectPath Instructi ons 223 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization MOV mem1 6/32, imm1 6/32 MOVSX reg1 6/32, mreg8 MOVSX reg1 6/32, mem8 MOVSX reg32, mreg1 6 MOVSX reg32, mem1 6 MOVZX reg1 6/32, mreg8 MOVZX reg1 6/32, mem8 MOVZX reg32, mreg1 6 MOVZX reg32, mem1 6 NEG mreg8 NEG m em 8 NEG mreg1 6/32 N EG mem1 [...]

  • Página 240

    224 DirectPat h Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 ROL mreg8, CL ROL mem8, CL ROL mreg1 6/32, CL ROL mem1 6/32, CL ROR mreg8, i mm8 ROR mem8, imm8 ROR mreg1 6/32, imm8 ROR mem1 6/32 , imm8 ROR mreg8, 1 ROR mem8, 1 ROR mreg1 6/32, 1 ROR mem1 6/32, 1 ROR mreg8, CL ROR mem8, CL ROR mreg1 6/32, CL[...]

  • Página 241

    DirectPath Instructions 225 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization SE TL/S ETNG E mreg8 SE TL/SE TNGE mem8 SE TGE/SE TNL mreg8 SET GE/ SETNL mem 8 SE TLE/S ETNG mreg8 SE TLE/S ETNG mem8 SE TG/ S ETN LE mreg8 SE T G/S ETNLE mem8 SH L/SAL mreg8, imm8 SH L/SAL mem8 , im m8 SH L/SAL mreg1 6/32, imm8 SH L/SAL mem1 [...]

  • Página 242

    226 DirectPat h Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 XO R reg1 6/32, mem1 6/32 XOR AL, imm8 XO R EA X, imm1 6/32 XOR mreg8, imm8 X OR mem8, imm8 XOR m reg 1 6 /32 , imm 1 6/32 X OR mem1 6/32, imm1 6/32 XO R mreg1 6/32, imm8 (sign extended) XO R mem1 6/32, imm8 (sign extended) T able 25. DirectP [...]

  • Página 243

    DirectPath Instructi ons 227 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization T able 26. DirectP ath M MX ™ Instructions Instruct ion Mnem onic EMMS MOVD mmreg, mem32 MOVD mem32, mmreg MOVQ mmreg1 , mmreg2 MOVQ mmreg, mem64 MOVQ mmreg2, mmreg1 MOVQ mem64, mmreg P ACKSS DW mmreg1 , mmreg2 P ACKSS DW mmreg, me m64 P ACK[...]

  • Página 244

    228 DirectPat h Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 PS R LD mmreg, imm8 PS R LQ mmreg1 , mmreg2 PS R LQ mmreg, mem64 PS R LQ mmreg, imm8 PS R L W mmreg1 , mmreg2 P S R L W mm reg, m em64 P S R L W mmre g, imm8 PS U BB mmreg1 , mmreg2 P S U BB mmre g, me m64 PS U BD mmreg1 , mmreg2 PS U BD mmreg[...]

  • Página 245

    DirectPath Instructi ons 229 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization T able 28. DirectP ath Floating-Point Instructions Instruct ion Mnem onic FA B S F ADD ST, ST(i) F ADD [mem32real] F ADD ST(i), ST F ADD [mem64real] F ADDP ST(i), ST FCH S FCOM ST(i) FCOMP ST(i) FCOM [mem 32real] FCOM [mem 64real] FCOMP [mem32[...]

  • Página 246

    230 DirectPat h Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 FS U B ST(i), ST FS U BP ST, ST(i) FS U BR [mem32real] FS U BR [mem64real] FS U BR ST, ST(i) FS U BR ST(i), ST FS U BR P ST(i), ST F TST FUC OM FUC OMP FUC OMPP FW A IT FXCH T able 28. DirectP ath Floating-Point Instructions Instruct ion Mnem [...]

  • Página 247

    V ectorPath Instructions 23 1 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization V ectorP ath Instructions T he f ollowi ng ta bles contain Ve c t o r P a t h instructions, which should be av o i d e d in the AMD Athlon processor: ■ Ta b l e 2 9 , “ V ectorP a th Integer Instructions, ” on page 231 ■ Ta b l e 3 0 [...]

  • Página 248

    232 V ectorPath Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — Novembe r 1 9 9 9 DIV EA X, mem1 6/32 EN TER IDIV mr eg8 IDIV mem8 IDIV E A X, mreg1 6/32 IDIV E A X, mem1 6/32 IM U L reg1 6/32, imm1 6/32 I M U L r eg 1 6 /32, mre g1 6/ 32, i mm 1 6 /32 IM U L reg1 6/32, mem1 6/32, imm1 6/32 IM U L reg1 6/32, imm8 (sign ext[...]

  • Página 249

    V ectorPath Instructions 233 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization MUL EAX , m em 3 2 OUT imm8, A L OUT imm8, A X OUT imm8, E A X OUT DX, AL OUT DX, A X OUT DX, EA X POP ES POP SS POP DS POP FS POP GS POP EA X POP ECX POP EDX POP EB X POP ES P POP EB P POP ES I POP EDI POP mreg 1 6/32 POP mem 1 6/32 POP A/POP[...]

  • Página 250

    234 V ectorPath Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — Novembe r 1 9 9 9 STI ST OS B mem8, AL ST OSW mem1 6, A X STOSD mem32, EA X STR mreg1 6 STR mem1 6 SYSC ALL SYSE NTE R SYSE XIT SYSR E T VER R mreg1 6 VER R mem1 6 VER W mreg1 6 VER W mem1 6 WBINVD WRM SR X ADD mreg8, reg8 XADD mem8, reg8 XA DD mreg1 6/32, reg[...]

  • Página 251

    V ectorPath Instructions 235 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization T able 32. V ectorPath Floating-P oint Instructions Instruct ion Mnem onic F2XM1 FB LD [mem80] FBSTP [mem80] FCLE X FCMOVB ST(0), ST(i) FCMOVE ST(0), ST(i) FCMOVBE ST(0), ST(i) FCMOVU ST(0), ST(i) FCMOVN B ST(0), ST(i) FCMOVN E ST(0), ST(i) FC[...]

  • Página 252

    236 V ectorPath Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — Novembe r 1 9 9 9[...]

  • Página 253

    Index 237 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Pr ocessor x86 Code Optimization Index Numerics 3DNow! ™ Inst ructions . . . . . . . . . . . . . . . . . . . . . . . . . . 10 , 107 3DNo w! and MMX ™ Intr a-Oper and Swapping . . . . . . . 112 Clippin g . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 F ast[...]

  • Página 254

    238 Index AM D Athlon ™ Pr ocessor x86 Code Optimization 22007E/0 — Novemb er 1 9 9 9 Instructio n Cach e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Contr ol Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . .[...]

  • Página 255

    Index 239 22007E/0 — No ve mb er 1 999 AM D Athlon ™ Pr ocessor x86 Code Optimization T TBYTE V ariables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 T rigo nome tri c Inst ruc tions . . . . . . . . . . . . . . . . . . . . . . . . . 10 3 V V ectorP ath Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .[...]

  • Página 256

    240 Index AM D Athlon ™ Pr ocessor x86 Code Optimization 22007E/0 — Novemb er 1 9 9 9[...]