AMD x86 Bedienungsanleitung
- Schauen Sie die Anleitung online durch oderladen Sie diese herunter
- 256 Seiten
- 2.99 mb
Zur Seite of
Ähnliche Gebrauchsanleitungen
-
Computer Hardware
AMD Geode LX CS5536
8 Seiten 1.52 mb -
Computer Hardware
AMD 790GX
63 Seiten 2.03 mb -
Computer Hardware
AMD SC3200
428 Seiten 3.26 mb -
Impact Driver
AMD GA-7VASFS-FS
52 Seiten 3.55 mb -
Video Card
AMD FireMV 2260
46 Seiten -
Computer Hardware
AMD AN9 32X
76 Seiten 3.42 mb -
Computer Hardware
AMD SB750
63 Seiten 2.03 mb -
Computer Hardware
AMD GA-K8N51GMF
88 Seiten 7 mb
Richtige Gebrauchsanleitung
Die Vorschriften verpflichten den Verkäufer zur Übertragung der Gebrauchsanleitung AMD x86 an den Erwerber, zusammen mit der Ware. Eine fehlende Anleitung oder falsche Informationen, die dem Verbraucher übertragen werden, bilden eine Grundlage für eine Reklamation aufgrund Unstimmigkeit des Geräts mit dem Vertrag. Rechtsmäßig lässt man das Anfügen einer Gebrauchsanleitung in anderer Form als Papierform zu, was letztens sehr oft genutzt wird, indem man eine grafische oder elektronische Anleitung von AMD x86, sowie Anleitungsvideos für Nutzer beifügt. Die Bedingung ist, dass ihre Form leserlich und verständlich ist.
Was ist eine Gebrauchsanleitung?
Das Wort kommt vom lateinischen „instructio”, d.h. ordnen. Demnach kann man in der Anleitung AMD x86 die Beschreibung der Etappen der Vorgehensweisen finden. Das Ziel der Anleitung ist die Belehrung, Vereinfachung des Starts, der Nutzung des Geräts oder auch der Ausführung bestimmter Tätigkeiten. Die Anleitung ist eine Sammlung von Informationen über ein Gegenstand/eine Dienstleistung, ein Hinweis.
Leider widmen nicht viele Nutzer ihre Zeit der Gebrauchsanleitung AMD x86. Eine gute Gebrauchsanleitung erlaubt nicht nur eine Reihe zusätzlicher Funktionen des gekauften Geräts kennenzulernen, sondern hilft dabei viele Fehler zu vermeiden.
Was sollte also eine ideale Gebrauchsanleitung beinhalten?
Die Gebrauchsanleitung AMD x86 sollte vor allem folgendes enthalten:
- Informationen über technische Daten des Geräts AMD x86
- Den Namen des Produzenten und das Produktionsjahr des Geräts AMD x86
- Grundsätze der Bedienung, Regulierung und Wartung des Geräts AMD x86
- Sicherheitszeichen und Zertifikate, die die Übereinstimmung mit entsprechenden Normen bestätigen
Warum lesen wir keine Gebrauchsanleitungen?
Der Grund dafür ist die fehlende Zeit und die Sicherheit, was die bestimmten Funktionen der gekauften Geräte angeht. Leider ist das Anschließen und Starten von AMD x86 zu wenig. Eine Anleitung beinhaltet eine Reihe von Hinweisen bezüglich bestimmter Funktionen, Sicherheitsgrundsätze, Wartungsarten (sogar das, welche Mittel man benutzen sollte), eventueller Fehler von AMD x86 und Lösungsarten für Probleme, die während der Nutzung auftreten könnten. Immerhin kann man in der Gebrauchsanleitung die Kontaktnummer zum Service AMD finden, wenn die vorgeschlagenen Lösungen nicht wirksam sind. Aktuell erfreuen sich Anleitungen in Form von interessanten Animationen oder Videoanleitungen an Popularität, die den Nutzer besser ansprechen als eine Broschüre. Diese Art von Anleitung gibt garantiert, dass der Nutzer sich das ganze Video anschaut, ohne die spezifizierten und komplizierten technischen Beschreibungen von AMD x86 zu überspringen, wie es bei der Papierform passiert.
Warum sollte man Gebrauchsanleitungen lesen?
In der Gebrauchsanleitung finden wir vor allem die Antwort über den Bau sowie die Möglichkeiten des Geräts AMD x86, über die Nutzung bestimmter Accessoires und eine Reihe von Informationen, die erlauben, jegliche Funktionen und Bequemlichkeiten zu nutzen.
Nach dem gelungenen Kauf des Geräts, sollte man einige Zeit für das Kennenlernen jedes Teils der Anleitung von AMD x86 widmen. Aktuell sind sie genau vorbereitet oder übersetzt, damit sie nicht nur verständlich für die Nutzer sind, aber auch ihre grundliegende Hilfs-Informations-Funktion erfüllen.
Inhaltsverzeichnis der Gebrauchsanleitungen
-
Seite 1
AM D Athlon Pr oc essor x86 Code Optimization Guide TM[...]
-
Seite 2
T ra demarks AMD , the A MD logo , A MD Athlon , K6, 3DNo w!, and combi nations ther e of, K 86, and Sup er7 ar e tr adema rks, and AMD -K6 is a r egis tered tra demark of Ad v anced Micr o De vices, I nc. Microso ft, Windows , and Wind ows NT are r egi stered trademarks of Micros oft Corp oration. MMX is a tra demark a nd P entium is a r egiste re[...]
-
Seite 3
Contents iii 22007E/0 — Novembe r 1 99 9 AMD Athlon™ Pr ocessor x86 Code Optimization Contents Revision Histo ry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv 1 Intro duction 1 About this Docum ent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 AMD Athlon ™ Proc essor F[...]
-
Seite 4
iv Contents AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 99 9 Switch Statement Us age . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Optimize Switch State ments . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Use Prototy pes for All Functions . . . . . . . . . . . . . . . . . . . . . [...]
-
Seite 5
Contents v 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Use 8-Bit Sign- Extended Displacements . . . . . . . . . . . . . . . . . . . . . . . 39 Code Padding Using Neutral Code F illers . . . . . . . . . . . . . . . . . . . . . 39 Recommenda tions for the AM D Athlon Processor . . . . . . . . . 40 Recommenda tions f[...]
-
Seite 6
vi Contents AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 99 9 7 Scheduling Opti mizations 6 7 Schedule Instructio ns According to their La tency . . . . . . . . . . . . . . 67 Unrolling Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Complete Loop Unrolling . . . . . . . . .[...]
-
Seite 7
Contents vii 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Signed Deriva tion for Algorithm, Multiplier, and Shift Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 9 Floating-P oint Optimizations 9 7 Ensure All FP U Data is Alig ned . . . . . . . . . . . . . . . . . . . . .[...]
-
Seite 8
viii Contents AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 99 9 Fast Conver sion of Signed Wo rds to Floating-P oint . . . . . . . . . . . . 113 Use MMX PX OR to Negate 3 DNow! Data . . . . . . . . . . . . . . . . . . . . 113 Use MMX PCM P Instead of 3D Now! PFCMP . . . . . . . . . . . . . . . . . . 114 Use MMX Instruct [...]
-
Seite 9
Contents ix 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Integer Execution Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Floating-Point Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 Floating-Point Execution Unit . . . . . . . . . . . . . . . . . . . . . . . . 137 Loa[...]
-
Seite 10
x Contents AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 99 9 Perf Ctr[3:0] MSRs (MSR Addre sses C001_00 04h – C001_0007h) . . . . . . . . . . . . . 167 Starting and Stopping the Perfor mance-Monitoring Counters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 Event an d Time-Stamp[...]
-
Seite 11
List of Figures xi 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization List of Figures Figure 1. AMD Athlon ™ Processo r Block Diagr am . . . . . . . . . . . 131 Figure 2. Integer Execution Pipeline . . . . . . . . . . . . . . . . . . . . . . . 135 Figure 3. Floating-Point Unit Block Diagram . . . . . . . . . . . . . . [...]
-
Seite 12
xii List of Figur es AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 99 9[...]
-
Seite 13
List of T ables xiii 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization List of T ables Table 1. Latency of Repeated String Instr uctions . . . . . . . . . . . . . 84 Table 2. Integer Pipeline Operation T ypes . . . . . . . . . . . . . . . . . 149 Table 3. Integer Decode Types . . . . . . . . . . . . . . . . . . . . . . [...]
-
Seite 14
xiv List of T ables AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Table 29. VectorPa th Integer In structions . . . . . . . . . . . . . . . . . . . 231 Table 30. VectorPa th MMX Instructions . . . . . . . . . . . . . . . . . . . . 234 Table 31. VectorPa th MMX Extensions . . . . . . . . . . . . . . . . . . . . . 234[...]
-
Seite 15
Revision History xv 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Revision History Date Rev Descriptio n Nov . 1 999 E Added “ About this Document” on page 1. F urther clarification of “Consider the Sign of Integer Operands” on page 1 4. Added the optimization, “Use Array Style Instead of Pointer Style Cod[...]
-
Seite 16
xvi Revision History AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9[...]
-
Seite 17
22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization About this Docume nt 1 1 Introduction Th e A M D At h l o n ™ processor is the ne west micr oprocessor in the AMD K86 ™ famil y of micropr ocessors. T he ad v ances in the AMD Athlon pro cessor tak e super scalar oper ation and out- of- or der execution to a new le v[...]
-
Seite 18
2 About this Document AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 pr e vious- gener ation processor s and describes how those optimizations ar e applicable to the AMD Athlon processor . This guide co ntains the f ollowing c hapt er s: Chapter 1: Introduction. Outlin es the material co ver ed in this document. Summ[...]
-
Seite 19
AMD Athlon ™ Proces sor Family 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Appendix B: Pipeline and Execu tion Unit Resources Over view . Describes in detail the e xecution units and its r elation to the instructi on pipeline. Appendix C: Implementation of Write Combining. Describes the algorithm us ed by the [...]
-
Seite 20
4 AMD Athlon ™ Processor Mic roarchitectur e Summary AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 AM D Athlon ™ Proc essor Microar chitecture Summary T he AMD Athlon pr ocessor brings s uper scalar performance and high operating frequency to P C syste ms run ning industr y- standard x86 softw ar e. A brief summ[...]
-
Seite 21
AMD Athlon ™ Processor Mic roarchitecture Summary 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization AMD A thlon execution c or e to ac hiev e and sustain maxim um performance. As a decoupled decode/exec ution processor , the AMD At hlon pr ocessor make s use of a propri etary micr oarc hitecture, whic h defines the [...]
-
Seite 22
6 AMD Athlon ™ Processor Mic roarchitectur e Summary AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T he coding tec hniques for ac hieving peak perf ormance on the AMD Athlon processor include, but are not limited to , those for the AMD-K6, AMD-K6-2, P e ntium ® , P enti um Pro , and P ent ium II pr ocessor s. Ho [...]
-
Seite 23
22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Top Optimiz ations 7 2 T op Optimizations T his chap ter contains concise desc riptions of the best optimizations f or impro ving the performance of the AMD Athlon ™ processor . Subsequent c hapters contai n more detailed descriptions of these and other optimizations. [...]
-
Seite 24
8 Optimization Star AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 ■ A void Placing Cod e and Da ta in the Same 64 -Byte Cache Line Optimization Star T he top optimizations described in this c hapter ar e flagged with a star . In addition, the star appears beside the mor e detailed descriptions found in subsequent [...]
-
Seite 25
Group II Optimizati ons — Secondary Optimizations 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization an ywher e, in an y type of code (integer , x87, 3DNo w!, MMX, etc.). Use the f ollowi ng f ormul a to determine pr efetc h distance: Prefetc h Length = 200 ( DS / C ) ■ Round up to the near est cache line. ■ DS i[...]
-
Seite 26
10 Group II Optimizations — S econdary Optimizations AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 A void Load-Execute Floating-Point Instructions with Integer Opera nds Do not use load-execute floating-point instructions with integer operands . T he floating- point load- execute instructions with integer ope rand[...]
-
Seite 27
Group II Optimizati ons — Secondary Optimizations 11 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization A void Placing Code and Data in the Sam e 64-Byte Cache Line Consider that the AMD Athlon processor cac he line is twice the siz e of pr e vious processor s. Code and data sh ould not be shar ed in the same 64 - byt [...]
-
Seite 28
12 Group II Optimizations — S econdary Optimizations AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9[...]
-
Seite 29
22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Ensure Floati ng-Point Variables and Exp ressions are of Type Float 13 3 C Sourc e Lev el Optimizations This c h apter details C pro gramming pr actice s f or opt imizing code f or the AMD Athlon ™ pr ocessor . Guide lines ar e listed in order of importan ce. Ensure Fl[...]
-
Seite 30
14 Consider the S ign of Integer Operands AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Consider the Sig n of Integer Oper ands In man y cases, the data stored in integer v aria bles determines whether a signed or an unsigned integer type is appr opriate. F or example, to re cor d the w eight of a person in pounds, [...]
-
Seite 31
Use Array Style Instead of Poin ter Style Code 15 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example (Av oid): int i; ====> MOV EAX, i CDQ i = i / 4; AND EDX, 3 ADD EAX, EDX SAR EAX, 2 MOV i, EAX Example (Preferred): unsigned int i; ====> SHR i, 2 i = i / 4; In summar y: Use unsigned types for: ■ Di visio[...]
-
Seite 32
16 Use Array Style Instead of Pointer Style Co de AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Note that source code transf ormations wi ll interact with a compiler ’ s code gener ator and that it is difficult to contr ol the gener ated mac hine code fr om the sourc e lev el. It is e v en possibl e that sour ce c[...]
-
Seite 33
Use Array Style Instead of Poin ter Style Code 17 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization *res++ = dp; /* write transformed z */ dp = vv->x * *m++; dp += vv->y * *m++; dp += vv->z * *m++; dp += vv->w * *m++; *res++ = dp; /* write transformed w */ ++vv; /* next input vertex */ m -= 16; /* reset to s[...]
-
Seite 34
18 Completely Unr oll Small L oops AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Completely Unr oll Small Loops T ak e ad v antage of the AMD At hlon pr ocessor ’ s large, 64-Kb yte instruct ion cache and completel y unroll small loops. Unr olling loops can be beneficial to perf o rmance, especially if the l oop b[...]
-
Seite 35
Avoid Unnecessary Store-to-Load Depend encies 19 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization code in a w a y that a v oids the stor e-to-load dependency . In some instances the language definition ma y prohibit the compiler fr om using code tra nsforma tions that would r emo v e the stor e- to-load dependenc y . I[...]
-
Seite 36
20 Consider Expressi on Order in Compoun d Branch Conditions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Consider Expr ession Order in Compound Branch Conditions Br anc h conditions in C pro gr ams are often compound conditions con sisting of multiple boolean expr ess ions joined by the boolean oper ator s &&a[...]
-
Seite 37
Switch Statement Us age 21 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Switch Statemen t Usage Optimize Switch Statements Switc h statements ar e transl ated using a vari ety of algorithms. T he most common of these ar e jump ta bles and comparison c hains/t r ees. It is r ecommended t o sort th e cases of a s wit[...]
-
Seite 38
22 Use Const T ype Qualifier AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Use Const T ype Qualifier Use the “ const ” type qualifier as m u c h as possible. T his optimization mak e s code mor e r obust and ma y ena ble higher perf ormance code t o be gener ated due to the additional inf ormat ion a v ailable t[...]
-
Seite 39
Generic Loop Hoisting 23 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Generalization for M ultiple Const ant Control C ode T o gener alize this further f or multiple constant control code some mor e w ork ma y ha ve to be done to cr eate the pr oper outer loop . Enumer ation of the constant cases will r educe this [...]
-
Seite 40
24 Declar e Local Functions as Static AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 case combine( 1, 1 ): for( i ... ) { DoWork1( i ); DoWork3( i ); } break; default: break; } T he trick her e is that there is some up-fr ont wor k inv olved in gener ating all the combinations f or the switc h constan t and the total[...]
-
Seite 41
Dynamic Memory All ocation Consideration 25 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization which might inhibit certain op timizations with some compiler s — for example, agg r essiv e inlining. Dynamic Memory Allocation Consideration Dynamic memor y alloca tion ( ‘ malloc ’ in C language) should al w a ys r etu[...]
-
Seite 42
26 Explicitly Extract Common S ube xpressions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 lead to unexpected r esults. F ortunately , in the v ast majority of cases, the final result will differ onl y in the least significa nt bits. Example 1 (Av oid): double a[100],sum; int i; sum = 0.0f; for (i=0; i<100; i++)[...]
-
Seite 43
C Language Struc ture Component Considerations 27 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 1 Avoid: double a,b,c,d,e,f; e = b*c/d; f = b/d*a; Preferred: double a,b,c,d,e,f,t; t = b/d; e = c*t; f = a*t; Example 2 Avoid: double a,b,c,e,f; e = a/c; f = b/c; Preferred: double a,b,c,e,f,t; t = 1/c; e = a*t f[...]
-
Seite 44
28 Sort L ocal V ariables Acco rding to Base T ype Size AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 P ad by Multiple of Largest Base T ype Size P ad the structur e to a m ultiple of the larg est base type siz e of an y member . In this fa shion, if the fir st member of a structur e is natur ally aligned, all other[...]
-
Seite 45
Accelerating Floating-Point Div ides and Square Roots 29 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization quadw ord alignment), so that quadw or d operands might be misaligned, ev en if this technique is used and the compiler does alloca te v ariables in t he order they ar e de clared. T he f ollowing example de monstr[...]
-
Seite 46
30 Accel erating Floating-Point Divides and Squar e Roots AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 necessar y for the c urr ently s elected pr ecision. This means that settin g pr ecision c ontrol to singl e pr ecisio n (v ersus Win32 default of double precision) lo w ers the latenc y of those oper ations. T he[...]
-
Seite 47
Avoid Unnecessary Integ er Division 31 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization A void Unnec essary Integer Division Integer divisi on is the slow est of all integer arithmetic oper ations a nd should be a v oided wh er ev er possi ble. One possibility f or r e ducing the number of integer di visions is mu ltip[...]
-
Seite 48
32 Copy Fr equently De-r eferenced Pointe r Arguments to Local V ariables AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 1 (Av oid): //assumes pointers are different and q!=r void isqrt ( unsigned long a, unsigned long *q, unsigned long *r) { *q = a; if (a > 0) { while (*q > (*r = a / *q)) { *q = (*q + [...]
-
Seite 49
22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Over view 33 4 Instruction Dec oding Optimizations T his c hapter discusses w a ys to maximize the n umber of instructions decoded by the instruction decoder s in the AMD Athlon ™ pr ocessor . Guidelines are listed in or der of importance. Over view T he AMD Athlon pro[...]
-
Seite 50
34 Select Dir ectPath Over V ectorPath Instructions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Select DirectP ath Over V ectorP ath Instructions Use Dir ect P ath instructions rather than V ectorP ath instructions. Dir ectP ath in structions ar e optimiz ed for decode and execute effi cientl y b y minimiz ing the[...]
-
Seite 51
Load-Execute Instructio n Usage 35 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Use Load-Execute Floating-Point Instructions with Floating-P oint Operands W hen opera ting on single- pr ecision or double- pr ecision floating- point data, wher ev er possible use floating- point load-exec ute instructions to i ncr ea[...]
-
Seite 52
36 Align Branch T argets in Pr ogram Hot Spots AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 1 (Av oid): FLD QWORD PTR [foo] FIMUL DWORD PTR [bar] FIADD DWORD PTR [baz] Example 2 (Preferred): FILD DWORD PTR [bar] FILD DWORD PTR [baz] FLD QWORD PTR [foo] FMULP ST(2), ST FADDP ST(1),ST Align Br anch T argets i[...]
-
Seite 53
Avoid Partial Reg ister Reads and Writes 37 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 2 (Preferred): 05 78 56 34 12 add eax, 12345678h ;uses single byte ; opcode form 83 C3 FB add ebx, -5 ;uses 8-bit sign ; extended immediate 74 05 jz $label1 ;uses 1-byte opcode, ; 8-bit immediate A void P artial Registe[...]
-
Seite 54
38 Replace C ertain SH LD Instructions with Alternative AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Replac e Certain S H LD Instructions with Alternative Code Certain instances of the SHLD instruction can be r eplaced b y alternati v e code using SHR and LEA. The alternati v e code has lo w er latenc y and r equir[...]
-
Seite 55
Use 8-Bit Sign-E xtended Displacements 39 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Use 8-Bit Sign-Extended Displac ements Use 8- bit sign- extend ed displacements for condition al br anc hes. Using short, 8-bit sign- extended displacements for conditional br anc hes impr ov es code density with no negati v e ef[...]
-
Seite 56
40 Code Padding Using Neutral Code F illers AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Recommendation s for th e AM D Athlon ™ Processo r F or code that is optimiz ed spec ifically f or the AMD At hlon pr ocessor , the optimal co de fillers ar e NOP instr uctions (opcode 0x90) with up to tw o REP pr efixes (0xF[...]
-
Seite 57
Code Padding Usi ng Neutral Code Fillers 41 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Recommendati ons for AM D- K6 ® Family and AM D Athlon ™ Processor Blen ded Code On x86 pr ocessors other than the AMD Athlon pr ocessor (incl udin g th e AMD-K6 fam il y o f proces sor s) , the REP p refix and especially m [...]
-
Seite 58
42 Code Padding Using Neutral Code F illers AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 NOP3_ECX TEXTEQU <DB 08Dh,00Ch,021h> ;lea ecx, [ecx] NOP3_EDX TEXTEQU <DB 08Dh,014h,022h> ;lea edx, [edx] NOP3_ESI TEXTEQU <DB 08Dh,024h,024h> ;lea esi, [esi] NOP3_EDI TEXTEQU <DB 08Dh,034h,026h> ;lea ed[...]
-
Seite 59
Code Padding Usi ng Neutral Code Fillers 43 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization ;lea edi ,[edi+00000000] NOP6_EDI TEXTEQU <DB 08Dh,0BFh,0,0,0,0> ;lea ebp ,[ebp+00000000] NOP6_EBP TEXTEQU <DB 08Dh,0ADh,0,0,0,0> ;lea eax,[eax*1+00000000] NOP7_EAX TEXTEQU <DB 08Dh,004h,005h,0,0,0,0> ;lea ebx[...]
-
Seite 60
44 Code Padding Using Neutral Code F illers AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9[...]
-
Seite 61
22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Memory Size a nd Alignment Issues 45 5 Cache and Memory Optimizations T his chapter describes code optimization tec hniques that tak e ad v anta ge of the large L1 caches and high-band width buses of the AMD Athlon ™ proces sor . Guidelines ar e listed in or der of imp[...]
-
Seite 62
46 Use the 3DNow! ™ PR EF ET C H and PR E FETCHW AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Align Data Where P ossible In general, a v oid misaligned data references. All data who se siz e is a pow er of 2 is cons ider ed aligned i f it is naturally aligned. F or example: ■ QW OR D accesses ar e aligned if th[...]
-
Seite 63
Use the 3DNow ! ™ PREFE TCH and PREFETCHW I nstructions 47 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization PRE FET CH /W ve rs us PR E F ETC H N T A/T0/T1 /T2 T he PREFETCHNT A/T0/T1/T2 instructions in the MMX extensions ar e pr ocessor implement ation dependent. T o maintain compati bility with t he 25 million AMD-[...]
-
Seite 64
48 Use the 3DNow! ™ PR EF ET C H and PR E FETCHW AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 MOV ECX, (-LARGE_NUM) ;used biased index MOV EAX, OFFSET array_a ;get address of array_a MOV EDX, OFFSET array_b ;get address of array_b MOV ECX, OFFSET array_c ;get address of array_c $loop: PREFETCHW [EAX+196] ;two cac[...]
-
Seite 65
Use the 3DNow ! ™ PREFE TCH and PREFETCHW I nstructions 49 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization T he follo wing optimiza tion rule s w er e app lied to this example . ■ Loops should be unr olled to mak e sur e that the data stride per loop i ter ation is equal to the length of a cac he line. T his a voi[...]
-
Seite 66
50 T ake A dvantage of W rite Combining AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T ak e Advantage of W rite Combining Oper ating system and device dri v er pro gr ammers sh ould tak e ad v antage of the write- combining capabili ties of the AMD Athlon pr ocessor . T he AMD Athlon pr ocessor has a v er y aggr es[...]
-
Seite 67
Store-to-Load F orwarding Restrictions 51 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Store-to-Load F o rwarding R estrictions Stor e-to-load forw arding r efers to the pr ocess of a load reading (f orw ar ding) data fr om the stor e buffer (LS2). T h er e ar e instances in the AMD Athlon processor load/stor e arc[...]
-
Seite 68
52 Store-to -Load Forwar ding Restrictions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Narrow-to-Wide Store-Buffer Data F orwarding Restriction If the f ollo wing conditions ar e pr esent, there i s a narro w-to- wide stor e-buffer data f o rw ar ding r estricti on: ■ T he oper and size of the stor e data is sma[...]
-
Seite 69
Store-to-Load F orwarding Restrictions 53 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 5 (Preferred): MOVD [foo], MM1 ;store lower half PUNPCKHDQ MM1, MM1 ;get upper half into lower half MOVD [foo+4], MM1 ;store lower half ... ADD EAX, [foo] ;fine ADD EDX, [foo+4] ;fine Misaligned Store-Buffer Data F orward[...]
-
Seite 70
54 Stack Alignment Consider ations AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 One Supported Store- to-Load Forw arding Case T her e is one case of a mism atc hed stor e-to- load fo rw arding that is supported by the b y AMD Athlon pr ocessor . The low er 32 bits fr om an aligned QW ORD write feeding into a D W OR[...]
-
Seite 71
Align TBYTE Variab les on Quadword Aligned Addres ses 55 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example (Preferred): Prolog: PUSH EBP MOV EBP, ESP SUB ESP, SIZE_OF_LOCALS ;size of local variables AND ESP, –8 ;push registers that need to be preserved Epilog: ;pop register that needed to be preserved MOV ESP,[...]
-
Seite 72
56 Sort V ariables Accordin g to Base T ype Size AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example: struct { char a[5]; long k; doublex; } baz; T he str uctur e components should be alloc ated (lo west to highes t addr ess) as follo ws: x, k, a[4], a[3], a[2], a[1], a[0], padbyte6, ..., padbyte0 See “ C Langua[...]
-
Seite 73
22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Avoid Branches Depende nt on Random Data 57 6 Br anch Optimizations W hile th e AMD Athlon ™ pr ocessor contains a v ery sophisticated br anch unit, certain optimizations increase t he effect iv eness of the br anc h pr ediction unit. T his c hapter discusses rules tha[...]
-
Seite 74
58 A void Branches De pendent on Random Dat a AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 AM D Ath lon ™ Proces sor Spec ific Code E xample 1 — Signed integer ABS function (X = labs(X)): MOV ECX, [X] ;load value MOV EBX, ECX ;save value NEG ECX ;–value CMOVS ECX, EBX ;if –value is negative, select value MO[...]
-
Seite 75
Always Pair CALL and RETURN 59 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 6 — Increment Ring Buffer Offset: //C Code char buf[BUFSIZE]; int a; if (a < (BUFSIZE-1)) { a++; } else { a = 0; } ;------------- ;Assembly Code MOV EAX, [a] ; old offset CMP EAX, (BUFSIZE-1) ; a < (BUFSIZE-1) ? CF : NC INC [...]
-
Seite 76
60 Replace Bran ches with Computation in 3DNow! ™ Code AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Rep lace Br anches with Computa tion in 3D Now! ™ Code Br anches negati vel y impact the perf ormance of 3DNo w! code. Br anches can oper ate onl y on one data item at a time , i.e., the y ar e inherentl y scalar[...]
-
Seite 77
Replace Branches wi th Computation in 3DNow! ™ Code 61 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 2 (Preferred): ; r = (x < y) ? a : b ; ; in: mm0 a ; mm1 b ; mm2 x ; mm3 y ; out: mm1 r PCMPGTD MM3, MM2 ; y > x ? 0xffffffff : 0 PAND MM1, MM3 ; y > x ? b : 0 PANDN MM3, MM0 ; y > x > 0 : a [...]
-
Seite 78
62 Replace Bran ches with Computation in 3DNow! ™ Code AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 2: C code: float x,z; z = abs(x); if (z >= 1) { z = 1/z; } 3DNow! code: ;in: MM0 = x ;out: MM0 = z MOVQ MM5, mabs ;0x7fffffff PAND MM0, MM5 ;z=abs(x) PFRCP MM2, MM0 ;1/z approx MOVQ MM1, MM0 ;save z PFRC[...]
-
Seite 79
Replace Branches wi th Computation in 3DNow! ™ Code 63 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 4: C code: #define PI 3.14159265358979323 float x,z,r,res; /* 0 <= r <= PI/4 */ z = abs(x) if (z < 1) { res = r; } else { res = PI/2-r; } 3DNow! code: ;in: MM0 = x ; MM1 = r ;out: MM1 = res MOVQ MM[...]
-
Seite 80
64 Replace Bran ches with Computation in 3DNow! ™ Code AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 5: C code: #define PI 3.14159265358979323 float x,y,xa,ya,r,res; int xs,df; xs = x < 0 ? 1 : 0; xa = fabs(x); ya = fabs(y); df = (xa < ya); if (xs && df) { res = PI/2 + r; } else if (xs) { res[...]
-
Seite 81
Avoid the Loop Instruction 65 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization A void the Loop Instruction T he LOOP instruction in the AMD A thlon pr ocessor r equires eight cycles to e xecute. Use the preferr ed code shown belo w: Example 1 (Av oid): LOOP LABEL Example 2 (Preferred): DEC ECX JNZ LABEL A void F ar Con[...]
-
Seite 82
66 A void Recursive Functions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 A void R ecursive Functions A void r ec ur siv e func tions due to the danger o f o verflo wing t he r eturn addr ess stac k. Con v ert end- r ecur siv e functions to iterati ve code. An end-recursi v e funct ion is wh en the func tion call [...]
-
Seite 83
22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Schedule In structions According to their Latenc y 67 7 Scheduling Optimizations T his c hapter descr ibes ho w to code instruc tions f or efficient scheduling. Guidelines ar e lis ted in or der of impor tance. Schedule Instructions Ac cor ding to their Latency Th e A M [...]
-
Seite 84
68 Unrolling Loops AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 unroll ing r educ es r egist er pr essur e by r emoving the loop counter . T o complete l y unroll a loop, remo ve the loop control and r eplicate the loop bod y N times. In addition, completel y unr olling a lo op incr eases scheduling oppo rtunities.[...]
-
Seite 85
Unrolling Loops 69 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Without Loop Unrolling: MOV ECX, MAX_LENGTH MOV EAX, OFFSET A MOV EBX, OFFSET B $add_loop: FLD QWORD PTR [EAX] FADD QWORD PTR [EBX] FSTP QWORD PTR [EAX] ADD EAX, 8 ADD EBX, 8 DEC ECX JNZ $add_loop T he loop consists of se v en instructions. T he AMD At[...]
-
Seite 86
70 Unrolling Loops AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 no faster than three iter a tions in 10 cycles, or 6/10 floating-po int adds per c ycle, or 1.4 times as f ast as the or iginal loop. Deriving Loop Control For P arti ally Unrolled Loops A fr equentl y used loop construct is a counting loop. In a typic[...]
-
Seite 87
Use Function Inlini ng 71 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Use Function In lining Overview Mak e use of the AMD A thlon pr ocessor ’ s large 64- Kbyte instruct ion cache b y inl ining sm all routines to av oid pr ocedur e- call ov erhead. Consider the cost of possible incr eased r egister usage, whic [...]
-
Seite 88
72 A void Address Generati on Interlocks AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novembe r 1 9 9 9 Always Inline Fu nctions if Called from One Site A function should alw a ys be inlined if it can be established that it is called from just one site in the code. F or the C language, determination of this char act eristic is made [...]
-
Seite 89
Use MOVZX and MO VSX 73 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 1 (Av oid): ADD EBX, ECX ;inst 1 MOV EAX, DWORD PTR [10h] ;inst 2 (fast address calc.) MOV ECX, DWORD PTR [EAX+EBX] ;inst 3 (slow address calc.) MOV EDX, DWORD PTR [24h] ;this load is stalled from ; accessing data cache due ; to long laten[...]
-
Seite 90
74 Minimize Po inter Arithmetic in L oops AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 1 (Av oid): int a[MAXSIZE], b[MAXSIZE], c[MAXSIZE], i; for (i=0; i < MAXSIZE; i++) { c [i] = a[i] + b[i]; } MOV ECX, MAXSIZE ;initialize loop counter XOR ESI, ESI ;initialize offset into array a XOR EDI, EDI ;initializ[...]
-
Seite 91
Push Memory Data Carefu lly 75 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization v ariable that starts wi th a negati ve v alue and r eac hes zero when the loop expires. Note that if the base addresses ar e held in r egisters (e.g., when the base addr e sses ar e passe d as ar guments of a function) biasing the base add[...]
-
Seite 92
76 Push Memory Data Careful ly AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novembe r 1 9 9 9[...]
-
Seite 93
22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Replace Divi des with Multiplies 77 8 Integer Optimizations T his c hapter desc ribes w a ys to impr ov e integer p erf ormance thr ough optimize d pr ogr amming tec hniques. T he guidelines ar e listed in order of importance. Replace Divides with Multiplies Replace inte[...]
-
Seite 94
78 Replace Div ides with Multiplies AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Signed Division Utility In the opt_utilities dir ector y of the AMD documentation CDR O M, ru n sdiv .exe in a DOS window to find the fastest code fo r si gned di vision b y a constant. T he utility displa ys the code after the user en[...]
-
Seite 95
Replace Divi des with Multiplies 79 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 1: ;In: EDX = dividend ;Out: EDX = quotient XOR EDX, EDX;0 CMP EAX, d ;CF = (dividend < divisor) ? 1 : 0 SBB EDX, -1 ;quotient = 0+1-CF = (dividend < divisor) ? 0 : 1 In cases where the di vi dend does not need to be pr e[...]
-
Seite 96
80 Replace Div ides with Multiplies AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 ;algorithm 1 MOV EAX, m MOV EDX, dividend MOV ECX, EDX IMUL EDX ADD EDX, ECX SHR ECX, 31 SAR EDX, s ADD EDX, ECX ;quotient in EDX Derivation for a, m, s The deri v atio n f or the algorith m (a), multiplier (m), and sh ift coun t (s), [...]
-
Seite 97
Use Alternative Code When Multiplying by a Co nstant 81 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Remainder of Signed Integer 2 n or – (2 n ) ;IN:EAX = dividend ;OUT:EAX = remainder CDQ ;Sign extend into EDX AND EDX, (2^n–1) ;Mask correction (abs(divison)–1) ADD EAX, EDX ;Apply pre-correction AND EAX, (2^n[...]
-
Seite 98
82 Use Alternative Code When Multiplying b y a Constant AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 by 11: LEA REG2, [REG1*8+REG1] ;3 cycles ADD REG1, REG1 ADD REG1, REG2 by 12: SHL REG1, 2 LEA REG1, [REG1*2+REG1] ;3 cycles by 13: LEA REG2, [REG1*2+REG1] ;3 cycles SHL REG1, 4 SUB REG1, REG2 by 14: LEA REG2, [REG1*[...]
-
Seite 99
Use MMX ™ Instructio ns for Integer-Only Work 83 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization by 26: use IMUL by 27: LEA REG2, [REG1*4+REG1] ;3 cycles SHL REG1, 5 SUB REG1, REG2 by 28: MOV REG2, REG1 ;3 cycles SHL REG1, 3 SUB REG1, REG2 SHL REG1, 2 by 29: LEA REG2, [REG1*2+REG1] ;3 cycles SHL REG1, 5 SUB REG1, RE[...]
-
Seite 100
84 Repeated String Instructi on Usage AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 In addition, using MMX instructi ons incr eases t he a v ailable par allelism. T he AMD Athlon proces sor can issue thr ee integer OPs and two MMX OPs per cycle. Rep eated String Instruction Usage Latency of Repeated String Instructi[...]
-
Seite 101
Repeated String I nstruction Usage 85 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Ensure D F=0 (U P) A lway s m a ke s u re t h a t D F = 0 ( U P ) ( a f t e r ex e c u t i o n o f C L D ) fo r REP MO VS an d REP STOS. DF = 1 ( DO WN ) is only needed f o r certain cases of o ver lapping REP MO VS (f or example, so[...]
-
Seite 102
86 Use X OR Instruction to Cl ear Integer Registers AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Use X O R Instruction to Clear Integer Registe rs T o clear an inte ger r egister to all 0s, use “ X OR r eg , r eg ” . T he AMD Athlon pr ocessor is a ble to av oid the false r ea d dependenc y on the XOR instructi[...]
-
Seite 103
Efficient 64-Bi t Integer Arithmetic 87 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 4 (Le ft shift ): ;shift operand in EDX:EAX left, shift count in ECX (count ; applied modulo 64) SHLD EDX, EAX, CL ;first apply shift count SHL EAX, CL ; mod 32 to EDX:EAX TEST ECX, 32 ;need to shift by another 32? JZ $lshi[...]
-
Seite 104
88 Efficient 64-Bit Integer Arithmeti c AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 7 (Division): ;_ulldiv divides two unsigned 64-bit integers, and returns ; the quotient. ; ;INPUT: [ESP+8]:[ESP+4] dividend ; [ESP+16]:[ESP+12] divisor ; ;OUTPUT: EDX:EAX quotient of division ; ;DESTROYS: EAX,ECX,EDX,EFlags[...]
-
Seite 105
Efficient 64-Bi t Integer Arithmetic 89 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization MOV ECX, EAX ;save quotient IMUL EDI, EAX ;quotient * divisor hi-word ; (low only) MUL DWORD PTR [ESP+20];quotient * divisor lo-word ADD EDX, EDI ;EDX:EAX = quotient * divisor SUB EBX, EAX ;dividend_lo – (quot.*divisor)_lo MOV EA[...]
-
Seite 106
90 Efficient 64-Bit Integer Arithmeti c AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 $r_two_divs: MOV ECX, EAX ;save dividend_lo in ECX MOV EAX, EDX ;get dividend_hi XOR EDX, EDX ;zero extend it into EDX:EAX DIV EBX ;EAX = quotient_hi, EDX = intermediate ; remainder MOV EAX, ECX ;EAX = dividend_lo DIV EBX ;EAX = qu[...]
-
Seite 107
Efficient Impl ementation of Populati on Count Function 91 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Efficient Implementation of Population Co unt Function P opulation count is an oper ation that determines the number of set bits in a bit string. F or example, this can be used to determine the car dinality of a [...]
-
Seite 108
92 Efficient Impl ementation of Populat ion Count Function AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Step 3 F or the fir st time, the v alue in each k-bit field is small eno ugh that adding two k-bit fields r esults in a v alue that stil l fits in the k-bit field. Thus the f ollowing computation is perf ormed: y[...]
-
Seite 109
Derivation of Multipl ier Used for Integer Division by Constants 93 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization ADD EAX, EDX ;x = (w & 0x33333333) + ((w >> 2) & ; 0x33333333) MOV EDX, EDX ;x SHR EAX, 4 ;x >> 4 ADD EAX, EDX ;x + (x >> 4) AND EAX, 00F0F0F0Fh ;y = (x + (x >> 4) & 0[...]
-
Seite 110
94 Derivation of Multiplie r Used for Integer Division by AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 ;algorithm 1 MOV EDX, dividend MOV EAX, m MUL EDX ADD EAX, m ADC EDX, 0 SHR EDX, s ;EDX=quotient */ typedef unsigned __int64 U64; typedef unsigned long U32; U32 d, l, s, m, a, r; U64 m_low, m_high, j, k; U32 log2 [...]
-
Seite 111
Derivation of Multipl ier Used for Integer Division by Constants 95 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization /* Generate m, s for algorithm 1. Based on: Magenheimer, D.J.; et al: “Integer Multiplication and Division on the HP Precision Architecture”. IEEE Transactions on Computers, Vol 37, No. 8, August 198[...]
-
Seite 112
96 Derivation of Multiplie r Used for Integer Division by AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 ;algorithm 1 MOV EAX, m MOV EDX, dividend MOV ECX, EDX IMUL EDX ADD EDX, ECX SHR ECX, 31 SAR EDX, s ADD EDX, ECX ; quotient in EDX */ typedef unsigned __int64 U64; typedef unsigned long U32; U32 log2 (U32 i) { U32[...]
-
Seite 113
22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Ensure All FP U Data is Ali gned 97 9 Floating-P oint Optimizations T his c hapt er details the methods used to optimiz e floating-point code to the pipelined floating-point unit (FPU). Guidelines are listed in order of impo rtance. Ensure All F P U Data is Aligned As di[...]
-
Seite 114
98 Use FFRE E P Macr o to Pop On e Register fr om the FPU AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Use F F R E E P Macro to P op One Register fr om the F P U Stack In FPU intensi v e code, fr equently accessed data is oft en pr e-loaded at the bottom of the FPU stac k befor e pr ocessing floating- point data. A[...]
-
Seite 115
Use the FXCH Instruction Rather tha n FST/FLD Pairs 99 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization T hese instruc tions ar e muc h faster than the classical appr oach using FSTSW , because FSTSW is essentiall y a serializing instruction on the AMD Athlon pr ocess or . W hen FSTSW cannot be a v oided (f or example,[...]
-
Seite 116
10 0 Minimize Floating-Po int-to-Integer Conversions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Minimize Floating-P oint-to-Integer Con versio ns C++, C, an d F ortr an define floa ting-point-t o-integer con v er sions as truncating . This cr eates a pr oblem because the activ e r ounding mode in an application i[...]
-
Seite 117
Minimize F loating-Point-to-Integer Conversi ons 10 1 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization FPU into truncating mode, and perf orming all of the conv ersions before restoring the original control w ord. The speed of the a bo v e code is somewhat dependent on the natur e of the code surrounding it. F o r appl[...]
-
Seite 118
10 2 Minimize Floating-Po int-to-Integer Conversions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 3 (P otentially faster): MOV ECX, DWORD PTR[X+4] ;get upper 32 bits of double XOR EDX, EDX ;i = 0 MOV EAX, ECX ;save sign bit AND ECX, 07FF00000h ;isolate exponent field CMP ECX, 03FF00000h ;if abs(x) < 1.0 [...]
-
Seite 119
Floating-Point Subex pression Elimination 10 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Floating-P oint Subexpr ession Elimination T her e ar e cases which do not r equir e an FXCH instruction after e v er y instruction to allo w access to tw o new stac k entries. In the cases wher e two instructions shar e a s[...]
-
Seite 120
10 4 Check Argument Range of T rigonometric Instructions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 If an “ ar gument out of r ange ” is detected, a r ange r eduction subr o utine is in v ok ed whic h r educes the ar gument to less than 2^63 befor e the instruction is attempted again. While an ar gument > [...]
-
Seite 121
Take Advantag e of the FSINCOS Instruction 10 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Since out- of-r an ge arguments ar e extremely uncommon, the conditional br anch will be perfectly pr edicted, and the other instructions used to guard the trigonometric instruction can execute in par allel to it. T ak e Ad[...]
-
Seite 122
10 6 T ake Advantage of the FSI NCOS Instruction AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9[...]
-
Seite 123
22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Use 3DNow! ™ Instr uctions 10 7 10 3D Now! ™ and M MX ™ Optimizations T his chapter describes 3DNow! and MMX code optimization tec hniqu es f or the AMD Athlon ™ processo r . Guidelines ar e listed in order of impor tance. 3DNo w! porting guideline s can be f oun[...]
-
Seite 124
10 8 Use 3DNow! ™ Instructions for Fast Div ision AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 FEMMS instru ction is suppo rted fo r bac kw ar d compatibili ty with AMD-K6 famil y p r ocessors, and is aliased t o the EMMS instruction. 3DNo w! and MMX in structions are designed to be used concurr entl y with no sw[...]
-
Seite 125
Use 3DNow! ™ Instructions for Fast Division 10 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Pipelined P a ir of 24-Bit Precisio n Divides T his di vi de operation execu tes wi th a tot al late nc y of 21 cycles, assuming that the pr ogr am hides t he latenc y of the fir st MO VD/MO VQ instructio ns within pr ec[...]
-
Seite 126
110 Use 3DNow ! ™ Instructions for Fast Square Ro ot and AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novembe r 1 9 9 9 Use 3D Now! ™ Instructions for F a st Squar e Root and Recipr ocal Square Root 3DNo w! instruc tions can be used to compute a ver y fast, highly ac c u ra t e s q u a re ro o t a n d re c i pr oc a l s q u a re[...]
-
Seite 127
Use MMX ™ PMADDWD Ins truction to Perform Two 32-Bit Multipli es in Parallel 111 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Newton- Raphson Re cipr ocal Squa re Ro ot T he gener al Ne wton-Raphson r ecipro cal squar e root r ecurr ence is: Z i+1 = 1/2 • Z i • (3 – b • Z i 2 ) T o r educe the number of i[...]
-
Seite 128
112 3D Now! ™ and MMX ™ Intra-Operand S wapping AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example: PXOR MM2, MM2 ; 0 | 0 MOVD MM0, [ab] ; 0 0 | b a MOVD MM1, [cd] ; 0 0 | d c PUNPCKLWD MM0, MM2 ; 0 b | 0 a PUNCPKLWD MM1, MM2 ; 0 d | 0 c PMADDWD MM0, MM1 ; b*d | a*c 3D Now! ™ and M MX ™ Intra-Oper and Swa[...]
-
Seite 129
Fast Conversion of S igned Words to Floating-Poin t 113 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization F ast Conversion of Signed W ords to Floating-P oint In many appl ications there is a need to quickl y conv ert data consisting of pac ked 16-bit signed integer s into floating-point n umbers. T he follo wing two e [...]
-
Seite 130
114 Us e M MX ™ P CM P Instead of 3DNow! ™ PFCMP AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 c ycle b ypassing penalty , and another one c ycle penalty if the r esult goes to a 3DNo w! operation. T he PFMUL execution latenc y is fo ur , ther efo re, in the w orst case, the PXOR and PMUL in structio ns ar e the[...]
-
Seite 131
Use MMX ™ Instructio ns for Block Copies and Block Fills 115 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Use M MX ™ Instructions for Block Copies and Block Fills F or moving or filling small bloc ks of data (e.g., less than 512 b ytes) bet w een cachea ble memo r y ar eas, t he REP MO VS and REP ST OS families[...]
-
Seite 132
116 Us e M MX ™ Instructions for Block Copies and Block Fills AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 $xfer: movq mm0, [eax] add edx, 64 movq mm1, [eax+8] add eax, 64 movq mm2, [eax-48] movq [edx-64], mm0 movq mm0, [eax-40] movq [edx-56], mm1 movq mm1, [eax-32] movq [edx-48], mm2 movq mm2, [eax-24] movq [edx[...]
-
Seite 133
Use MMX ™ Instructio ns for Block Copies and Block Fills 117 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization AM D Athlon ™ Proc essor Specific Code T he f ollo wing exam ple code, written f or the inlin e assembler of Micros oft V isual C, is suita ble for mo ving/filling a quadw ord aligned block of data in the f[...]
-
Seite 134
118 Us e M MX ™ PXOR to Clear All Bits in an M MX ™ Register AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 /* block fill (destination QWORD aligned) */ __asm { mov edx, [dst_ptr] mov ecx, [blk_size] shr ecx, 6 movq mm0, [fill_data] align 16 $fill_nc: movntq [edx], mm0 movntq [edx+8], mm0 movntq [edx+16], mm0 mov[...]
-
Seite 135
Use MMX ™ PCMPEQD to S et All Bits in an MMX ™ Register 119 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Use M MX ™ PC M P E QD to Set All Bits in an M MX ™ Regi ste r T o set all the bit s in an MMX r egister to o ne, use: PCMPEQD MMreg, MMreg Note that PCMPEQD MMr eg, MMr eg is dependent on pr evio us wri[...]
-
Seite 136
12 0 Optimized Matrix Multip lication AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 /* Function XForm performs a fully generalized 3D transform on an array of vertices pointed to by "v" and stores the transformed vertices in the location pointed to by "res". Each vertex consists of four floats. T[...]
-
Seite 137
Optimized Matrix Multipli cation 121 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization $$xform: ADD EBX, 16 ;res++ MOVQ MM0, QWORD PTR [EDX] ;v->y | v->x MOVQ MM1, QWORD PTR [EDX+8] ;v->w | v->z ADD EDX, 16 ;v++ MOVQ MM2, MM0 ;v->y | v->x MOVQ MM3, QWORD PTR [EAX+M00] ;m[0][1] | m[0][0] PUNPCKLDQ MM0, [...]
-
Seite 138
12 2 Efficient 3D- Clipping Code Computation Using AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Efficient 3D- Clipping Code Computation Using 3D Now! ™ Instructions Clipping is one of the major acti vities occurring in a 3D gr aphics pipeli ne. In many instances, this activ ity is split i nto tw o parts which do [...]
-
Seite 139
Use 3DNow! ™ PAVGUSB for MPEG-2 Motion Compensation 12 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization ;; ;; DESTROYS MM0,MM1,MM2,MM3,MM4 PXOR MM0, MM0 ; 0 | 0 MOVQ MM1, MM6 ; w | z MOVQ MM4, MM5 ; y | x PUNPCKHDQ MM1, MM1 ; w | w MOVQ MM3, MM6 ; w | z MOVQ MM2, MM5 ; y | x PFSUBR MM3, MM0 ; -w | -z PFSUBR MM2, MM[...]
-
Seite 140
12 4 Use 3DNow! ™ P A VG US B for MP EG-2 Motion AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 1 (Av oid): MOV ESI, DWORD PTR Src_MB MOV EDI, DWORD PTR Dst_MB MOV EDX, DWORD PTR SrcStride MOV EBX, DWORD PTR DstStride MOVQ MM7, QWORD PTR [ConstFEFE] MOVQ MM6, QWORD PTR [Const0101] MOV ECX, 16 L1: MOVQ MM0, [...]
-
Seite 141
Stream of Packed Unsi gned Bytes 12 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization T he f ollo wing code fr agment us es the 3DNo w! P A V GUSB instruction to perform a v er aging betw een the sour ce macr oblock and destination macr obloc k: Example 2 (Preferred): MOV EAX, DWORD PTR Src_MB MOV EDI, DWORD PTR Dst_M[...]
-
Seite 142
12 6 Co mple x N umbe r Ari thm etic AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Complex Number Arithmetic Complex n umbers ha v e a “ real ” part and an “ imaginar y ” part. Multipl ying complex number s (ex. 3 + 4i) is an integral part of many algorithms such as Discrete F o urier T r ansform (DF T) and [...]
-
Seite 143
22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Short Forms 12 7 11 Gener al x86 Optimization Guidelines T his c hapter describes gener a l code optimization tec hniques specific to super scalar proc essors ( that is, tec hniques common to the AMD- K6 ® processor , AMD A thlon ™ processor , and Pe n t i u m ® fami[...]
-
Seite 144
12 8 Dependencies AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Depend encies Spr ead out true dependencies to increase the opportunities f or par allel execution. Anti- depende ncies and output dependencies do not impact performance. Reg ister Operands Maintain fr equently used v alues in register s r at her than i[...]
-
Seite 145
22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Introduction 12 9 Appendix A AM D Athlon ™ Proc essor Micr oarc hitecture Intr oduction W hen discussing processor design, it is important to unders tand the follo wing terms — architecture , microarchitectur e , and design implementation . T he term arch itecture r [...]
-
Seite 146
130 AMD Athlon ™ Processor Micro architectur e AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 AM D Athlon ™ Proc essor Microar chitecture T he innov ativ e AMD Athlon processor micr oar chitectur e appr oach implements the x86 instruction set by pr ocessing simpler oper ations (OPs) instead of complex x86 instruct[...]
-
Seite 147
AMD Athlon ™ Processor Mic roarchitecture 131 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization Figure 1 . AM D Athlon ™ Processor Block Diagram Instruction Cache T he o ut-of-or der ex ecute engi ne of t he AMD Athlon proc essor contains a v ery larg e 64- Kbyte L1 ins truction cac he. T he L1 instruction cac he is [...]
-
Seite 148
132 AMD Athlon ™ Processor Micro architectur e AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 r eplacement is based on a least- r ecently used (LR U ) r eplacement algori thm. T he L1 instruction cac he has an associated tw o-le v el tr anslation look- aside buffer (TLB) structur e. T he firs t-le vel TLB is full y [...]
-
Seite 149
AMD Athlon ™ Processor Mic roarchitecture 13 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization r eturn stack. Subsequen t RETs pop a p r ed icted return addr ess off the top of the stac k. Early Dec oding T he Dir ectP ath and V ectorP ath decoders perf orm ear ly- decoding of instructions into Macr oOPs. A Macr oOP [...]
-
Seite 150
134 AMD Athlon ™ Processor Micro architectur e AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 Instruction Control Unit T he instruction contr ol unit (ICU) is the contr ol center f or the AMD Athlon processor . T he ICU controls the follo wing r esources — the centr alized in-flight r eorder buf fer , the integer [...]
-
Seite 151
AMD Athlon ™ Processor Mic roarchitecture 13 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization Integer Scheduler T he integer s che duler is ba sed on a thr ee- wide queuing system (also kno wn as a r eserv ation station) that feeds thr e e integer executi on positions or pipes. T he r eser va tion stat ions ar e six[...]
-
Seite 152
136 AMD Athlon ™ Processor Micro architectur e AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 Eac h of the three IEUs ar e general purpose in that eac h performs lo gic functions, arithmetic functions, conditional functions, di vide step functions, status flag multiplexing, and br anc h r esolutions. The A GUs calcu[...]
-
Seite 153
AMD Athlon ™ Processor Mic roarchitecture 13 7 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization Floa ting-P oint Ex ecutio n Unit T he floating-point execution unit (FPU) is implemented as a coprocessor that has its o wn out-of- ord er control in addition to the da ta path. T he FPU hand les all r egister oper ations [...]
-
Seite 154
138 AMD Athlon ™ Processor Micro architectur e AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 Load-Store Unit (LS U ) T he load-s tor e unit (LSU) manages dat a load and s tor e accesses to the L1 dat a cache and, if r equired, to the backside L2 cache or system memory . The 44-entr y LSU pro vides a data interface [...]
-
Seite 155
AMD Athlon ™ Processor Mic roarchitecture 13 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization L2 Cache Controller T he AMD Athlon processor contai ns a v ery flexible onboar d L2 contr oller . It uses an independent bac kside bus to access up to 8-Mb ytes of industry- standar d SRAMs. Ther e ar e full on-c hip tags [...]
-
Seite 156
140 AMD Athlon ™ Processor Micro architectur e AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9[...]
-
Seite 157
22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Fetch and Dec ode Pipeline Stages 141 Appendix B Pipeline and Execution Unit R esourc es Ov erview Th e A M D A t h l o n ™ pr ocessor contains two independent execut ion pipelines — one for integer oper ations and one for floating-point operations. T h e integer pip[...]
-
Seite 158
142 Fetch and Dec ode Pipeline Stages AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Figure 5. F etch/Scan/Align/D ecode Pipeline Hardware T he most common x8 6 instructions flo w throug h the Dir ectP ath pipeline stages and are decoded by har dw a r e . T he l ess common instructions, whic h r equire micr ocode ass[...]
-
Seite 159
Fetch and Dec ode Pipeline Stages 14 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Cycle 1 – FET CH The FETCH pipeline stag e calculates t he addr ess of the next x86 instr uction window to fetch from the pr oce ssor caches or system me mory . Cycle 2 – SCAN SC AN determines the start and end pointers of instr[...]
-
Seite 160
144 Integer Pipeline Stages AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 oper ands mapped to r egisters. Both integer and floating-point Macr oOPs ar e placed into the IC U . Integer Pipeline Stages T he integer execution pipeline consi sts of f our or more stages f or scheduling and execution and, if necessar y , [...]
-
Seite 161
Integer Pipelin e Stages 14 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Cycle 7 – SC H E D In the scheduler (SCHED) pipeline stage, the scheduler buffer s can cont ain Macr oOPs that are waiting f or integer operands fr om the ICU or the IEU r esult bus . W hen all oper ands ar e r eceiv ed, SCHED s c hedules [...]
-
Seite 162
146 Floating-Point Pipe line Stages AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Floating-P oint Pipeline Stages T he floa ting-point unit (FPU) is implemente d as a coprocessor that has its o w n out- of- or der cont r ol in addition to the data path. T he FPU handles al l r egister oper ations f or x8 7 instructi[...]
-
Seite 163
Floating-Point P ipeline Stages 14 7 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Cycle 7 – ST K R E N T he stack r ename (S TKREN) pipeline stage in cycle 7 r eceiv e s up to thr ee Macr oOPs fr om IDEC and maps stac k- relati ve r egi ster tag s to vir tual register ta gs. Cycle 8 – REG REN The r egister r e [...]
-
Seite 164
148 Execution Unit Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Execution Unit Resour ces Te r m i n o l o g y T he execution units o perate with two types of register v al ues — operands and res u lt s . T here ar e three oper and types and two r esult types, which ar e described in this section. Oper[...]
-
Seite 165
Execution Unit Resources 14 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Integer Pipeline Operations T abl e 2 shows the categor y or type of o per ations handled b y the integer pipeline. T able 3 sho w s examples of the decode type. As sho wn in T able 2 , the MO V instruction earl y decodes in the Dir ectP a t[...]
-
Seite 166
150 Execution Unit Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Floa ting-P oint P ipeline Oper ations T abl e 4 shows the categor y or type of o per ations handled b y the floating-point execution units. T able 5 sho ws examples of the decode types. As sho wn in T able 4, the F ADD r egister-to- regi st[...]
-
Seite 167
Execution Unit Resources 151 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Load/Store Pipeline Oper ations T he AMD Athlon pr ocessor decodes an y instruction that r efer ences memor y into primiti ve load/stor e oper a tions. F o r exa mple, consider the fo llo wing code sample : MOV AX, [EBX] ;1 load MacroOP PUSH [...]
-
Seite 168
152 Execution Unit Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Code Sample Analysis T he samples in T able 7 on page 153 and T able 8 on page 154 show the execut ion behavior of sev eral serie s of ins tructi ons as a function of decode constr aints, dependenc ies, and execution r esour ce constr aints.[...]
-
Seite 169
Execution Unit Resources 15 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization T a ble 7 . Sample 1 – Integer Register Operations Inst ructi on Number Deco de Pipe Decode Ty p e Clocks I n s t r u c t i o n 12345 6 7 8 1I M U L E A X , E C X 0 V P D I M M M M 2 IN C ESI 0 DP D I E 3 MOV E DI, 0x0 7F4 1 DP D I E 4 AD [...]
-
Seite 170
154 Execution Unit Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T a ble 8. Sample 2 – Integer Reg ister and Memory Load Operations Instruc Num Decode Pipe D ecode Ty p e Clocks I n s t r u c t i o n 1 2 3 4 5 6 7 8 9 10 11 12 1D E C E D X 0 D P D I E 2 MOV E DI, [ECX] 1 DP D I &/S A $ 3 S UB EAX, [[...]
-
Seite 171
22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Introduction 15 5 Appendix C Implementation of W rite Combining Intr oduction T his appendix describes the memory write- c ombining featur e as implemente d in the AMD Athlon ™ pr ocessor famil y . T he AMD Athlon pr ocessor supports the memor y type and r ange r e gis[...]
-
Seite 172
15 6 Write-Combinin g Definitions and Abbrev iations AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 W rite-Combining Definitions and Abbr eviations T his appendix uses the follo wing definitions and ab br ev iations: ■ UC — Uncach eable memor y type ■ WC — Write-combining memory type ■ WT — Writethr ough [...]
-
Seite 173
Write-Combining Operations 15 7 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization signatur e in r egister EAX, wher e EAX[11 – 8] contai ns the instruction famil y code. F or the AMD Athlon processor , the instruction famil y code is six . 2. In addition, t he pr esence of the MTRRs is indicated b y bit 12 and the pr [...]
-
Seite 174
15 8 Wr ite- Combining Operations AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T a ble 9. W rite Combining Completion Events Event Comment Non-WB write outside o f current buffer The first non-WB write to a different cache block address closes combining for previous writes. WB writes do no t affect write combining.[...]
-
Seite 175
Write-Combining Operations 15 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Sending W rite-Buffer Data to the System Once write combining is closed f or a 64- byte write buffer , the contents of the write buffer ar e eligible to be sent to the system as one or more AMD Athlon system bus commands. T able 10 lists t[...]
-
Seite 176
16 0 W rite- Combining Operations AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9[...]
-
Seite 177
22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Over view 16 1 Appendix D P erformance-Monitoring Counters T his c hapter describes ho w to use the AMD Athlon ™ processo r perf ormance monitoring counters. Over view T he AMD Athlon processor pr o vides four 48- bi t perf ormance counter s, which allo ws four type s [...]
-
Seite 178
16 2 Performance Counter Usage AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T hese r egisters can be r ead from and written to using t he RDMSR and WRM SR instructions, r espectiv el y . T he P erfEvtSel[3 :0] r egister s ar e locat ed at MSR l ocations C001_0000h to C0 01_0003h. The P erfCtr[3:0] register s ar e l[...]
-
Seite 179
Performance Counter Usage 16 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Unit Mask Field (Bits 8 — 15 ) Th ese bits are used to further qualify the e vent sel ected in the e v ent select fi eld. F or e xample, f or some cac he ev ents, the ma sk is used as a MESI- pr otocol qualifier of cac he states. See T ab[...]
-
Seite 180
16 4 Per formance Counter Usage AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 greater than or equal to the counter mask. Otherwise if this field is zero , then the counte r increm ents by the total n umber of even t s . T able 1 1 . Performance-Monitoring Counters Event Numbe r Source Unit Notes / Unit Mask (bits 1 [...]
-
Seite 181
Performance Counter Usage 16 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization 65h BU 1xxx_xxxxb = reserved x1xx_xxxxb = WB xx1x_xxxxb = WP xxx1_xxxxb = WT bits 11–10 = reserved xxxx_xx1xb = WC xxxx_xxx1b = UC Sy stem requests with the selected type 73h B U bits 15–11 = reserved xxxx_x1xxb = L2 (L2 hit and no DC h[...]
-
Seite 182
16 6 Performance Counter Usage AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 7Ah B U C ycles that at least one fill request waited to use the L2 80h PC Instr uctio n cache f etche s 8 1h PC Instruction cache misses 82h PC Instruction cache refills from L2 83h PC Instruction cache refills from system 84h PC L1 ITLB m[...]
-
Seite 183
Performance Counter Usage 16 7 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization P erfCtr[3:0] M S Rs (M S R Addr esses C00 1 _000 4h – C00 1 _000 7h) T he performance-counter MSRs contain the e vent or dur ation counts for the se lecte d ev ents b eing count ed. The RDP MC instruction can be used by pr ogr ams or p r[...]
-
Seite 184
16 8 Event and Time-S tamp Monitoring Softwar e AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 allo ws writing both positi ve and negativ e va lues to the perf ormance counters . The perf ormance counter s ma y be initializ ed us ing a 64-bit sig ned integer in the r ange -2 47 and +2 47 . Negati ve v alues ar e usef[...]
-
Seite 185
Monitoring Counter Ov erflow 16 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization T he initialization and start counter s pr ocedur e sets the P erfEvtSel0 and/ or P erfEvtSel1 MSRs for the e v ents to be counted and the method used to count them and init ializ es the counter MSR s (P erfCtr[3:0]) to starting counts. [...]
-
Seite 186
17 0 Monitoring Counter Overflow AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 An e v ent moni tor application util ity or another application pr ogr am can r ead the collected perf ormance inf ormation of the pr ofiled a pplication.[...]
-
Seite 187
22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Introduction 171 Appendix E Progr amming the M TR R and PA T Intr oduction Th e A M D A t h l o n ™ processor includes a set of memor y type and r ange register s (MTRRs) to control cachea bility and access to spec ified m emor y re gions. T he pr ocesso r also i nclud[...]
-
Seite 188
17 2 Memory T ype Range Register (MTRR) Mechanism AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T her e ar e two types of ad dr ess r anges: fixed and v a ria ble. (See F i gur e 12.) F or each addr ess r a nge, ther e is a memo ry type. F or eac h 4K, 16K or 64K s egment within t he fir st 1 Mb yte of memory , ther[...]
-
Seite 189
Memory Type Ra nge Register (MTRR) Mechan ism 17 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Figure 1 2. MTRR Mapping of Physic al Memory 0 FFFFFFFF h 512 K b y t e s 256 K by t es 256 Kb y tes 8 Fixed Rang es (64 Kb y tes ea ch) 64 Fixed R anges (4 Kby tes ea ch) 1 6 Fixed Ran ges (1 6 Kb y tes ea ch) 80000h C0[...]
-
Seite 190
17 4 Memory T ype Range Register (MTRR) Mechanism AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Memory T ypes F iv e standard memor y types ar e defi ned b y the AMD At hlon pr ocessor: writethr ough (WT), write back (WB), wr ite-pro tect (WP), write-combining (WC) , and uncachea ble (UC). T hese ar e described in T[...]
-
Seite 191
Memory Type Ra nge Register (MTRR) Mechan ism 17 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization MTRR D efault T ype Register Format. T he MTRR def ault type r egister is defined as f ollows. Figure 1 4. MTRR Default T ype Register Format E MTRRs ar e ena bled when set. Al l MTRRs (both fixed and v aria ble r ange) [...]
-
Seite 192
17 6 Memory T ype Range Register (MTR R) Mechanism AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Note that if tw o or mor e v ariable m emor y r anges matc h then the inter actions ar e defined as f ollows: 1. If the memor y types ar e identical, then that memor y type is used. 2. If one or mor e of the memor y type[...]
-
Seite 193
Page Attribute Tabl e (PAT) 17 7 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization not affected b y this issue, onl y the v ariable r ange (and MTRR DefT ype) r egi sters are affecte d. P age Attribute T able (P A T) T he P age Attribute T able (P A T) is an e xtension of the page ta ble entry f ormat, whic h a llo ws t[...]
-
Seite 194
17 8 Page Attribute T able (P A T) AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Accessing the P A T A 3-bit inde x consisting of the P A T i, PCD , and PWT bit s of the page ta ble entr y , is used to select one o f the se v en P A T reg ister fields to acquir e the memor y type fo r the desire d page (P A T i is d[...]
-
Seite 195
Page Attribute Tabl e (PAT) 17 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization T a ble 1 5. Effective Memor y T ype Based on P A T and MTR Rs P A T Memory T ype MTRR Memory T ype Effec tive Memory T ype UC- WB, W T, WP, WC UC-Page UC UC-MTR R WC x WC WT W B, WT WT UC UC WC CD WP CD WP WB, WP WP UC UC-MTR R WC, WT CD[...]
-
Seite 196
18 0 Page Attribute T able (P A T ) AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T a ble 1 6. Final Output Memory T ypes Input Memory T ype Output Memory T ype Note RdMem WrM e m Effective. MT ype forceCD 5 AM D -75 1 RdMem WrMe m MemT yp e ●● UC - ●● UC 1 ●● CD - ●● CD 1 ●● WC - ●● WC 1 ●[...]
-
Seite 197
Page Attribute Tabl e (PAT) 18 1 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization ● ● CD - ●● CD ●● WC - ●● WC ●● WT - ●● WT ●● WP - ●● WP ●● WB - ● ● WT 4 ●● - ●● ● CD 2 Notes: 1 . WP is not functional for RdMem/WrMem. 2. ForceCD must cause the MTR R memory t ype to be[...]
-
Seite 198
18 2 Page Attribute T able (P A T) AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 MTR R Fixed-Range Register F ormat T he memor y types defined f or memor y segments defined in eac h of the MTRR fixed-r ange r egist er s ar e defined in T a ble 17 (Also See “ Standar d MTRR T ypes and Pr operties ” on page 176.).[...]
-
Seite 199
Page Attribute Tabl e (PAT) 18 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization V ariable-Range MTRRs A v ariable MTRR can be pro gramm ed to st art at ad dr ess 0000_0000h bec ause the fixed MTRRs alw ays o verride the v aria ble ones. Ho we v er , it is r ecommended not to create an ove rl a p . T he upper tw o v a[...]
-
Seite 200
18 4 Page Attribute T able (P A T) AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Figure 1 7 . MTR RphysMask n Register F ormat Note: A softwar e attempt to write to reser ved bits will generate a general protection exception. Physical Speci fies a 24 -bit mask t o dete rmine the range of Mask the region defined in t[...]
-
Seite 201
Page Attribute Tabl e (PAT) 18 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization MTRR M SR F ormat T his table defines the model-specifi c r egister s re lated to the memor y type range r egister implementation. All MTRRs ar e defined to be 64 bits. T able 1 8. MTRR-R elated Model-Specific Register (MS R) Map Register[...]
-
Seite 202
18 6 Page Attribute T able (P A T ) AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9[...]
-
Seite 203
22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Instruction Dispatch and Execution Resou rces 18 7 Appendix F Instruction Dispatch and Execution Resourc es T his c hapter describes the Macr oOPs gener ated by eac h decoded instruction, along with the r elativ e static execution latencies of these groups of operations.[...]
-
Seite 204
18 8 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 ■ disp1 6/32 — 16-bit or 32-bit displacem ent v alue ■ disp3 2/48 — 32-bit or 48-bit displacem ent v alue ■ eXX — re gister width depending on the oper and size ■ mem32 real — 32-bit floating-point v alue[...]
-
Seite 205
Instruction Dispatch and Execution Resou rces 18 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization ADC mreg8, reg8 1 0h 1 1-xxx-xxx DirectPath ADC mem8, r eg8 1 0h mm-xxx -xxx DirectPath ADC mreg1 6/32, reg1 6/32 1 1h 1 1-xxx-xxx DirectPath ADC mem1 6/32, reg1 6/32 1 1h m m-xxx- xxx DirectPath ADC reg8, mreg8 1 2h 1 1[...]
-
Seite 206
19 0 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 AN D mem8, reg8 20h mm- xxx-xxx Dir ectPath AN D mreg1 6/ 32, reg1 6/32 2 1h 1 1-xxx-xxx DirectPath AN D mem1 6/32, reg1 6/32 2 1h m m-xxx-xxx DirectPath AN D reg8, mreg8 22h 1 1-xxx-xxx DirectPath AN D reg8, mem8 22h mm[...]
-
Seite 207
Instruction Dispatch and Execution Resou rces 19 1 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization BT mem1 6/32, imm8 0Fh BAh mm-1 00-xxx DirectPath BT C mreg1 6/32, reg1 6/32 0Fh BBh 1 1-xxx-xxx V e ctorPath BT C mem1 6/32, reg1 6/32 0Fh B Bh m m-xxx-xxx V ectorPath BT C mreg1 6/32, imm8 0Fh BAh 1 1-1 1 1-xxx V ector[...]
-
Seite 208
19 2 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 CMOVE/C MOVZ reg1 6/32, reg1 6/32 0Fh 44h 1 1-xxx-xxx DirectP ath CMOVE/C MOVZ reg1 6/32, mem1 6/32 0Fh 44h mm-xxx-xxx DirectPath CMOVG/CMOVN LE reg1 6/32, reg1 6/32 0Fh 4Fh 1 1-xxx -xxx DirectPath CMOVG/CMOVN LE reg1 6/3[...]
-
Seite 209
Instruction Dispatch and Execution Resou rces 19 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization CM P EA X, imm1 6/32 3Dh DirectPath CM P mreg8, imm8 80h 1 1-1 1 1-xxx DirectPath CM P mem8, imm8 80h mm-1 1 1-xxx DirectPath CM P mreg1 6/32, imm1 6/32 8 1h 1 1-1 1 1-xxx DirectPath CM P mem1 6/32, imm1 6/32 8 1h mm-1 1[...]
-
Seite 210
19 4 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 DIV EA X, mreg1 6/32 F7h 1 1-1 1 0-xxx V ectorPath DIV EA X, mem1 6/32 F7h mm-1 1 0-xxx V ectorPath ENTE R C8 V ectorPath IDIV mreg8 F6h 1 1-1 1 1-xxx V ectorPath IDIV mem8 F6h mm-1 1 1-xxx V ectorPath IDIV E A X, mreg1 6[...]
-
Seite 211
Instruction Dispatch and Execution Resou rces 19 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization INC mreg8 FEh 1 1-000-xxx DirectPath INC mem8 F Eh mm-000-xxx DirectPath INC mreg1 6/32 F Fh 1 1-000-xxx DirectPath INC mem1 6/32 FFh mm-000-xxx DirectPath INVD 0Fh 08h V ectorPath INVLPG 0Fh 0 1h mm-1 1 1-xxx VectorP at[...]
-
Seite 212
19 6 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 J P/JP E near disp1 6/32 0Fh 8Ah DirectPath J NP/J PO near disp1 6/32 0Fh 8Bh DirectPath J L/JNG E near disp1 6/32 0Fh 8Ch DirectPath J NL/JG E near disp1 6/32 0Fh 8Dh DirectPath J LE/JNG near disp1 6/32 0Fh 8Eh DirectPa[...]
-
Seite 213
Instruction Dispatch and Execution Resou rces 19 7 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization L OOP E/L OOPZ disp8 E1h V ectorPath L OOPN E/L OOP NZ disp8 E0h V ectorPath LSL reg1 6/32, mreg1 6/32 0Fh 03h 1 1 -xxx-xxx VectorP ath LSL reg1 6/32, mem1 6/32 0Fh 03h mm-xxx-xxx V ectorPath LSS reg1 6/32, mem32/ 48 0Fh[...]
-
Seite 214
19 8 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 MOV EDX, imm1 6/32 BAh DirectPath MOV EBX, imm1 6/32 BBh DirectPath MOV E SP, imm 1 6/32 BCh DirectPath MOV EB P, im m1 6/32 B Dh DirectP ath MOV E SI, im m1 6/32 BEh DirectPath MOV EDI, imm1 6/32 B Fh DirectPath MOV mre[...]
-
Seite 215
Instruction Dispatch and Execution Resou rces 19 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization NOT mem8 F6h mm-0 1 0- xx DirectPath NOT mreg1 6/32 F7h 1 1-0 1 0-xxx DirectPath NOT mem1 6/32 F7h mm-0 1 0-xx Dire ctPath OR mreg8, reg8 08h 1 1-xxx-xxx DirectPath OR mem8, reg8 08h mm-xxx-xxx DirectPath OR mreg1 6/32, [...]
-
Seite 216
200 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 POP EB X 5Bh V ectorPath POP ES P 5Ch VectorP ath POP EB P 5Dh V ectorPath POP ES I 5Eh V ectorPath POP EDI 5Fh V ectorPath POP mreg 1 6/32 8Fh 1 1-000-xxx VectorP ath POP mem 1 6/32 8Fh mm-000-xxx V ectorPath POP A/POP A[...]
-
Seite 217
Instruction Dispatch and Execution Resou rces 20 1 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization RCL mreg8, 1 D0h 1 1-0 1 0-xxx DirectPath RC L mem8, 1 D0h mm- 0 1 0-x xx Dir ectPath RCL mreg1 6/32, 1 D1h 1 1-0 1 0-xxx DirectPath RC L mem 1 6/32 , 1 D1h mm- 0 1 0 -xxx Dire ctPat h RCL mreg8, C L D2h 1 1-0 1 0-xxx Di[...]
-
Seite 218
202 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 ROL mreg1 6/32, 1 D1h 1 1-000-xxx DirectPath ROL mem1 6/32, 1 D1h mm- 000-xxx DirectPath ROL mreg8, CL D2h 1 1-000-xxx DirectPath ROL mem8, CL D2h mm-000-xxx DirectPath ROL mreg1 6/32, CL D3h 1 1-000-xxx DirectPath ROL me[...]
-
Seite 219
Instruction Dispatch and Execution Resou rces 203 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization SB B mreg1 6/32, reg1 6/32 1 9h 1 1-xxx-xxx DirectPath S BB mem 1 6/32, r eg1 6/32 1 9h mm-xxx-xxx DirectPath S BB reg8, mreg8 1A h 1 1 -xxx-xxx DirectPath S BB reg8, mem8 1Ah m m-xxx-xxx DirectPath SB B reg1 6/32, mreg1 [...]
-
Seite 220
204 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 S ETS mreg8 0Fh 98h 1 1-xxx -xxx DirectPath S ETS mem8 0Fh 98h mm-xxx -xxx DirectPath SE TN S mreg8 0Fh 99h 1 1-xxx-xxx DirectPath S ETN S mem8 0Fh 99h mm-xxx- xxx DirectPath S ETP/S ETP E mreg8 0Fh 9 Ah 1 1-xxx -xxx Direc[...]
-
Seite 221
Instruction Dispatch and Execution Resou rces 205 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization SH R mem1 6/32, imm8 C1h mm-1 0 1-xxx DirectPath SH R mreg8, 1 D0h 1 1-1 0 1-xxx DirectPath SH R mem8, 1 D0h mm-1 0 1-xxx DirectPath SH R mreg1 6/32, 1 D 1h 1 1-1 0 1-xxx DirectPath SH R mem1 6/32, 1 D1h mm-1 0 1-xxx Dire[...]
-
Seite 222
206 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 S UB r eg8, mreg8 2Ah 1 1-xxx -xxx DirectPath S UB r eg8, mem8 2Ah mm-xxx-xxx DirectPath S U B r eg1 6/ 32, mreg 1 6/32 2Bh 1 1- xxx -xx x Dir ect Path S UB r eg1 6/32, mem1 6/32 2Bh m m-xxx-xxx DirectPath SU B AL, imm 8 [...]
-
Seite 223
Instruction Dispatch and Execution Resou rces 207 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization X ADD mreg8, reg8 0Fh C0h 1 1 -1 00-xxx V ectorPath XA DD mem8, r eg8 0F h C0h mm-1 00-xxx V ectorPath X ADD mreg1 6/32, reg1 6/32 0Fh C1h 1 1-1 0 1-xxx V ectorPath XA DD mem1 6/32, reg1 6/32 0Fh C1h mm-1 0 1-xxx V ectorP[...]
-
Seite 224
208 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T able 20. M MX ™ Instruct ions Instruct ion Mnem onic Prefix By t e(s ) First By t e ModR/ M By t e Decode Ty p e FP U Pipe(s) Notes EM M S 0Fh 77h DirectPath F ADD/FM U L/F ST OR E MOVD mmreg, reg32 0Fh 6Eh 1 1-xx x-x[...]
-
Seite 225
Instruction Dispatch and Execution Resou rces 209 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization P AN DN mmreg1 , mmreg2 0Fh DFh 1 1-xx x-xxx DirectPath F ADD/F M UL P AN DN mmreg, mem64 0Fh DFh m m-xxx-xxx DirectPath F ADD/F M U L PCM P EQB mmreg1 , mmreg2 0Fh 74h 1 1-xxx-xxx DirectPath F ADD/F M UL PCM P EQB mmreg,[...]
-
Seite 226
210 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 PS R AW mmreg1 , mmreg2 0Fh E1h 1 1-xxx-xxx DirectPath F ADD/F M UL P SR A W mmreg, mem64 0Fh E1h mm-xxx-xx x DirectPath F ADD/FM U L PS R AW mmreg, imm8 0Fh 7 1h 1 1-1 00-xxx DirectPath F ADD/F MU L PS R AD mmreg1 , mmreg[...]
-
Seite 227
Instruction Dispatch and Execution Resou rces 21 1 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization P UN PCK HDQ mmreg1 , mmreg2 0Fh 6Ah 1 1-xxx-xxx DirectPath F ADD/FM U L P UN PC KHDQ mmreg, mem64 0Fh 6Ah m m-xxx-xxx DirectPath F AD D/FM U L P UN PCK HWD mmreg1 , mmreg2 0Fh 69h 1 1-xx x-xxx DirectPath F AD D/FM U L P[...]
-
Seite 228
212 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 PM I NSW mmreg, mem64 0F h EAh mm-xxx-xxx Direct Path F ADD/FM U L PM I N UB mmreg1 , mmreg2 0Fh DAh 1 1-xxx -xxx DirectPat h F ADD/F M UL PM I NU B mmreg, mem6 4 0Fh DA h mm-xxx-xx x Direct Path F ADD/FM U L PMOVMSKB re g[...]
-
Seite 229
Instruction Dispatch and Execution Resou rces 21 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization FCMOVB ST(0), ST(i) DAh C0- C7h VectorP ath FCMOVE ST(0), ST(i) DAh C 8- CFh V ectorPath FCMOVBE ST(0), ST(i) DAh D 0-D7h V ectorPath FCMOVU ST(0), ST(i) DAh D8-DFh V ectorPath FCMOVN B ST(0), ST(i) DBh C0- C7h Vector Pa[...]
-
Seite 230
214 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 FIADD [mem32int] DAh m m-000-xxx V ectorPath FIADD [mem1 6int] DEh mm-000-xxx VectorP ath FICOM [mem32int] DAh mm-0 1 0-xxx V ectorPath FICOM [mem1 6int] DEh mm-0 1 0-xx x VectorP ath F ICOM P [m em 32in t] D Ah m m- 0 1 1[...]
-
Seite 231
Instruction Dispatch and Execution Resou rces 21 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization FLD CW [mem1 6] D9h mm-1 0 1-xxx V ectorPath FLD ENV [mem1 4byte] D 9h mm-1 00-xxx V ectorPath FLD ENV [mem28byte] D9h mm-1 00-xxx V ectorPath FLDL2E D9h EA h Dire ctPa th FSTORE FLD L2T D9h E9h DirectPath F STORE FLDL G[...]
-
Seite 232
216 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 F S T C W [ m e m 16 ] D 9 h m m - 111 - x x x V e c t o r P a t h FSTE NV [mem1 4by te] D9h mm-1 1 0-xxx V ectorPath FSTE NV [mem28by te] D9h mm-1 1 0-xxx Vector Path FSTP [mem32real] D9h mm-0 1 1-xxx D irectPath F ADD/F [...]
-
Seite 233
Instruction Dispatch and Execution Resou rces 21 7 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization T a ble 23. 3DNow! ™ Instructions Instru ction Mn emonic Prefix Byte(s) imm8 ModR/M By t e Decode Ty p e FPU Pipe (s) Note FE M M S 0Fh 0Eh Di rectPat h F ADD/FM U L/F ST OR E 2 P A VGU S B mmreg1 , mmreg2 0Fh, 0Fh B F[...]
-
Seite 234
218 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 PF R SQRT mmr eg, mem64 0F h, 0Fh 9 7h mm-xxx-xxx DirectPat h F MU L P FS U B mmreg1 , mmreg2 0Fh, 0Fh 9 Ah 1 1-xxx-xxx DirectPath F ADD PF S UB mmreg, mem64 0Fh, 0Fh 9Ah mm-xxx-xxx Direct Path F ADD P FS U BR mmreg1 , mmr[...]
-
Seite 235
22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Select DirectP ath Over VectorPath Instruc tions 219 Appendix G Dire ctP ath versus V ectorP ath Instructions Select DirectP ath Over V ectorP ath Instructions Use DirectP ath instructions rather than V ectorPath ins tr ucti on s. Direc tP a th instructions ar e optimiz [...]
-
Seite 236
220 DirectPat h Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 T able 25. DirectP ath Integer Instructions Instru ction Mn emonic ADC mreg8, reg8 ADC mem8, reg8 ADC mreg1 6/32, reg1 6/32 ADC mem1 6/32, reg1 6/32 ADC reg8, mreg8 ADC reg8, mem8 ADC reg1 6/32, mreg1 6/32 ADC reg1 6/32, mem1 6/32 ADC AL, i mm[...]
-
Seite 237
DirectPath Instructi ons 22 1 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization CMOVBE/C MOVNA reg1 6/32, reg1 6/32 CMOVBE/C MOVNA reg1 6/32, mem1 6/32 CMOVE/C MOVZ reg1 6/32, reg1 6/32 CMOVE/CM OVZ reg1 6/32, mem1 6/32 CMOVG/CMOVN LE reg1 6/32, reg1 6/32 CMOVG/CMOVN LE reg1 6/32, mem1 6/32 CMOVG E/CMOVN L reg1 6/32, reg[...]
-
Seite 238
222 DirectPat h Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 JN O s ho rt di sp 8 JB /JNAE short disp8 JN B/JAE short disp8 JZ/J E short disp8 J NZ/JN E short disp8 JBE/J NA short disp8 JN BE/JA short disp8 JS short disp8 JN S short disp8 JP/J P E short disp8 JNP/ JPO sh o rt di sp 8 JL/J NG E short dis[...]
-
Seite 239
DirectPath Instructi ons 223 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization MOV mem1 6/32, imm1 6/32 MOVSX reg1 6/32, mreg8 MOVSX reg1 6/32, mem8 MOVSX reg32, mreg1 6 MOVSX reg32, mem1 6 MOVZX reg1 6/32, mreg8 MOVZX reg1 6/32, mem8 MOVZX reg32, mreg1 6 MOVZX reg32, mem1 6 NEG mreg8 NEG m em 8 NEG mreg1 6/32 N EG mem1 [...]
-
Seite 240
224 DirectPat h Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 ROL mreg8, CL ROL mem8, CL ROL mreg1 6/32, CL ROL mem1 6/32, CL ROR mreg8, i mm8 ROR mem8, imm8 ROR mreg1 6/32, imm8 ROR mem1 6/32 , imm8 ROR mreg8, 1 ROR mem8, 1 ROR mreg1 6/32, 1 ROR mem1 6/32, 1 ROR mreg8, CL ROR mem8, CL ROR mreg1 6/32, CL[...]
-
Seite 241
DirectPath Instructions 225 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization SE TL/S ETNG E mreg8 SE TL/SE TNGE mem8 SE TGE/SE TNL mreg8 SET GE/ SETNL mem 8 SE TLE/S ETNG mreg8 SE TLE/S ETNG mem8 SE TG/ S ETN LE mreg8 SE T G/S ETNLE mem8 SH L/SAL mreg8, imm8 SH L/SAL mem8 , im m8 SH L/SAL mreg1 6/32, imm8 SH L/SAL mem1 [...]
-
Seite 242
226 DirectPat h Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 XO R reg1 6/32, mem1 6/32 XOR AL, imm8 XO R EA X, imm1 6/32 XOR mreg8, imm8 X OR mem8, imm8 XOR m reg 1 6 /32 , imm 1 6/32 X OR mem1 6/32, imm1 6/32 XO R mreg1 6/32, imm8 (sign extended) XO R mem1 6/32, imm8 (sign extended) T able 25. DirectP [...]
-
Seite 243
DirectPath Instructi ons 227 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization T able 26. DirectP ath M MX ™ Instructions Instruct ion Mnem onic EMMS MOVD mmreg, mem32 MOVD mem32, mmreg MOVQ mmreg1 , mmreg2 MOVQ mmreg, mem64 MOVQ mmreg2, mmreg1 MOVQ mem64, mmreg P ACKSS DW mmreg1 , mmreg2 P ACKSS DW mmreg, me m64 P ACK[...]
-
Seite 244
228 DirectPat h Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 PS R LD mmreg, imm8 PS R LQ mmreg1 , mmreg2 PS R LQ mmreg, mem64 PS R LQ mmreg, imm8 PS R L W mmreg1 , mmreg2 P S R L W mm reg, m em64 P S R L W mmre g, imm8 PS U BB mmreg1 , mmreg2 P S U BB mmre g, me m64 PS U BD mmreg1 , mmreg2 PS U BD mmreg[...]
-
Seite 245
DirectPath Instructi ons 229 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization T able 28. DirectP ath Floating-Point Instructions Instruct ion Mnem onic FA B S F ADD ST, ST(i) F ADD [mem32real] F ADD ST(i), ST F ADD [mem64real] F ADDP ST(i), ST FCH S FCOM ST(i) FCOMP ST(i) FCOM [mem 32real] FCOM [mem 64real] FCOMP [mem32[...]
-
Seite 246
230 DirectPat h Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 FS U B ST(i), ST FS U BP ST, ST(i) FS U BR [mem32real] FS U BR [mem64real] FS U BR ST, ST(i) FS U BR ST(i), ST FS U BR P ST(i), ST F TST FUC OM FUC OMP FUC OMPP FW A IT FXCH T able 28. DirectP ath Floating-Point Instructions Instruct ion Mnem [...]
-
Seite 247
V ectorPath Instructions 23 1 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization V ectorP ath Instructions T he f ollowi ng ta bles contain Ve c t o r P a t h instructions, which should be av o i d e d in the AMD Athlon processor: ■ Ta b l e 2 9 , “ V ectorP a th Integer Instructions, ” on page 231 ■ Ta b l e 3 0 [...]
-
Seite 248
232 V ectorPath Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — Novembe r 1 9 9 9 DIV EA X, mem1 6/32 EN TER IDIV mr eg8 IDIV mem8 IDIV E A X, mreg1 6/32 IDIV E A X, mem1 6/32 IM U L reg1 6/32, imm1 6/32 I M U L r eg 1 6 /32, mre g1 6/ 32, i mm 1 6 /32 IM U L reg1 6/32, mem1 6/32, imm1 6/32 IM U L reg1 6/32, imm8 (sign ext[...]
-
Seite 249
V ectorPath Instructions 233 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization MUL EAX , m em 3 2 OUT imm8, A L OUT imm8, A X OUT imm8, E A X OUT DX, AL OUT DX, A X OUT DX, EA X POP ES POP SS POP DS POP FS POP GS POP EA X POP ECX POP EDX POP EB X POP ES P POP EB P POP ES I POP EDI POP mreg 1 6/32 POP mem 1 6/32 POP A/POP[...]
-
Seite 250
234 V ectorPath Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — Novembe r 1 9 9 9 STI ST OS B mem8, AL ST OSW mem1 6, A X STOSD mem32, EA X STR mreg1 6 STR mem1 6 SYSC ALL SYSE NTE R SYSE XIT SYSR E T VER R mreg1 6 VER R mem1 6 VER W mreg1 6 VER W mem1 6 WBINVD WRM SR X ADD mreg8, reg8 XADD mem8, reg8 XA DD mreg1 6/32, reg[...]
-
Seite 251
V ectorPath Instructions 235 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization T able 32. V ectorPath Floating-P oint Instructions Instruct ion Mnem onic F2XM1 FB LD [mem80] FBSTP [mem80] FCLE X FCMOVB ST(0), ST(i) FCMOVE ST(0), ST(i) FCMOVBE ST(0), ST(i) FCMOVU ST(0), ST(i) FCMOVN B ST(0), ST(i) FCMOVN E ST(0), ST(i) FC[...]
-
Seite 252
236 V ectorPath Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — Novembe r 1 9 9 9[...]
-
Seite 253
Index 237 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Pr ocessor x86 Code Optimization Index Numerics 3DNow! ™ Inst ructions . . . . . . . . . . . . . . . . . . . . . . . . . . 10 , 107 3DNo w! and MMX ™ Intr a-Oper and Swapping . . . . . . . 112 Clippin g . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 F ast[...]
-
Seite 254
238 Index AM D Athlon ™ Pr ocessor x86 Code Optimization 22007E/0 — Novemb er 1 9 9 9 Instructio n Cach e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Contr ol Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . .[...]
-
Seite 255
Index 239 22007E/0 — No ve mb er 1 999 AM D Athlon ™ Pr ocessor x86 Code Optimization T TBYTE V ariables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 T rigo nome tri c Inst ruc tions . . . . . . . . . . . . . . . . . . . . . . . . . 10 3 V V ectorP ath Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .[...]
-
Seite 256
240 Index AM D Athlon ™ Pr ocessor x86 Code Optimization 22007E/0 — Novemb er 1 9 9 9[...]