1. Optimizing software in C++ An optimization guide for Windows, Linux and Mac platforms By Agner Fog. Copenhagen University College of Engineering.
10The Intel compiler has a number of important optimization features: Very good support for vector operations using the single-instruction-multiple-
100Use a compiler that supports automatic vectorization, such as Gnu, Intel or PathScale. Use appropriate compiler options to enable the desired ins
101load the structure y into an XMM register, add the constant vector (1.,2.,3.,4.), and store the result in x. The #pragma vector always tells the c
102Intrinsic functions are primitive operations in the sense that each intrinsic function call is translated to just one or a few machine instruction
103 a = _mm_or_si128(a, mask); // Store the result vector in eight consecutive elements in aa: _mm_store_si128(ToVectorAddress(aa[i]
104_mm_empty() after the 64-bit vector operations and before any floating point code. The 128-bit vectors do not have this problem. Rewriting exampl
105 // Redefine class for vector of eight 16-bit integers: class Int16Vector8 : public Is16vec8 { public: // Default constructor: Int16Vecto
106 d.load(dd + i); // Load eight elements from dd[i] a = select(b > 0, c + 2, d + g); // Select c+2 or d+g a.store(aa
107Intel compiler does an excellent job of optimizing this code, and so does the Gnu compiler. The functions, operators and constructors are all inli
108 for (int i = 0; i < size; i++) { aa[i] = (bb[i] + cc[i]) / ff[i]; } } Example 11.6a is not vectorized automatically by the c
109 // Roll out loop by four to fit the four-element vectors: for (int i = 0; i < size; i += 4) { b.load(bb + i); // Load fo
11Codeplay VectorC A commercial compiler for 32-bit Windows. Integrates into the Microsoft Visual Studio IDE. Has not been updated since 2004. Can do
110} void AddAndDivide (short int *aa, short int *bb, short int *cc, float *ff) { // Define temporary vectors: Int16Vector8 a, b, c;
111 // Example 11.7 #include <emmintrin.h> // Define class complexd for vector of two doubles. // The low element represents the real part an
112static inline complexd operator - (complexd const & a, complexd const & b) { return _mm_sub_pd(a, b);} // complex * complex: (a.re*b.
113static inline complexd operator - (complexd const & a) { static const union { // (signbit,signbit) int i[4]; __m128d v
11411.3 Mathematical functions for vectors There are various function libraries for computing mathematical functions such as logarithms, exponential
115The necessary function prototypes and description of the different math libraries can be found in: Mike Stoner: "Integrating Fast Math Librar
116Code example Compiler used Time per element AMD64 Time per element Pentium 4 Time per element Pentium M Time per element Core 2 11.8a Microsoft 5
117 11.6 Conclusion Vectorized code often contains a lot of extra instructions for converting the data to the right format and getting them into the
118want to make one version that takes advantage of the SSE2 instruction set and another version that is compatible with old microprocessors. Using t
119 ... InnerFunctionSSE2(); } } else { // Use legacy version to support old systems for (i = 0; i < 1000
12time in library functions then it may not be necessary to optimize anything else than finding the most efficient library and economize the library
120You may want to compile an application for a 64-bit operating system in order to get the advantages mentioned on page 6. Unfortunately, it is not
121• Make two or more versions of the most critical part of the code and compile them separately with the appropriate instruction set specified. Ins
122 __intel_cpu_indicator = 0x800; break; case 6: case 7: // Supplementary-SSE3 supported __intel_cpu_indicator = 0x1000;
123 13 Specific optimization tips 13.1 Use lookup tables Reading a value from a table of constants is very fast if the table is cached. Usually it t
124Storing something in static memory can cause caching problems because static data are likely to be scattered around at different memory addresses.
125reason is, I guess, that compiler makers assume that floating point comparisons are more predictable than integer comparisons. The solution a = 1.
126A possible negative value of i will appear as a large positive number when i is interpreted as an unsigned integer and this will trigger the error
127 Sunday = 1, Monday = 2, Tuesday = 4, Wednesday = 8, Thursday = 0x10, Friday = 0x20, Saturday = 0x40 }; Weekdays Day; if (Day & (Tuesday
128Here, the address of matrix[j][0] is calculated internally as (int)&matrix[0][0] + j * (columns * sizeof(float)). Now, the factor to multiply
129• Integer division by a constant is faster than division by a variable • Integer division by a constant is faster if the constant is a power of
13memcpy 16kB aligned operands Intel Core 2 0.12 0.18 0.12 0.11 0.18 0.18 0.18 0.11 memcpy 16kB unaligned op. Intel Core 2 0.63 0.75 0.18 0.11 1.21
130 // Example 13.13b int list[300]; int i; for (i = 0; i < 300; i += 3) { list[i] = 0; list[i+1] = 1; list[i+2] = 2; } The loop unr
131// Example 13.16a double y, a1, a2, b1, b2; y = a1/b1 + a2/b2; Here we can eliminate one division by making a common denominator: // Example 1
13213.8 Conversions between floating point numbers and integers Conversion from floating point to integer According to the standards for the C++ lang
133Conversion from integer to floating point Conversion of integers to floating point is faster than from floating point to integer. The conversion t
134()631638322)1(−−⋅+⋅⋅−= fractiononevaluelongdoubleexponentsign. The value is zero if all bits except the sign bit are zero. Zero can be represente
135float a, b; if (*(unsigned int*)&a * 2 > *(unsigned int*)&b * 2) { // abs(a) > abs(b) } The multiplication by 2 in example 13.28
136There is a penalty for accessing part of a variable when the full variable is accessed shortly afterwards because the CPU becomes confused about t
137Alternatively, you can get ReadTSC as a library function available in www.agner.org/optimize/asmlib.zip. // Example 14.1 #include <intrin.h>
138the clock counts should be multiplied by the clock period and by the number of times CriticalFunction is called in a typical application to calcul
14specified on a command line or an input file. The output goes to the console or to an output file. A console mode program is fast, compact, and sim
14015 Overview of compiler options Table 15.1. Command line options relevant to optimization MS compiler Windows Gnu compilerLinux Intel compiler
141 Table 15.2. Compiler directives and keywords relevant to optimization MS compiler Windows Gnu compiler Linux Intel compiler Windows Intel comp
142platform Linux platform n.a. __unix__ __linux__ __unix__ __linux__ x86 platform _M_IX86 _M_IX86 x86-64 platform _M_IX86 and _WIN64 _M_X64 _M_
14316 Literature Other manuals by Agner Fog The present manual is number one in a series of five manuals. See page 3 for a list of titles. Literatur
144Theoretical textbook about input/output devices, ergonomics, cognition, psychology. W. M. Newman & M. G. Lamming: "Interactive System
15A good way to prevent such errors is to replace arrays by well-tested container classes. The standard template library (STL) is a useful source of
163.2 Use a profiler to find hot spots Before you start to optimize anything, you have to identify the critical parts of the program. In some program
17Function addresses are obscured in optimized programs. The profiler identifies any hot spots in the program by their address and attempts to transl
18arithmetic operations. The most common time-consumers are discussed in the following sections. 3.3 Program installation The time it takes to insta
19The advantages of using static linking rather than dynamic linking are: Static linking includes only the part of the library that is actually need
26.18 Runtime type identification (RTTI)... 50 6.19 Inheritance...
20this leads to a considerable extra overhead. There are two reasons for this overhead. The first reason is that the 32-bit instruction set has no su
21thread if there is other work that the processor can do while waiting for disk operations to finish. 3.7 System database It can take several secon
22cycles if it is not cached. See page 25 about data storage and page 81 about memory caching. 3.11 Context switches A context switch is a switch be
23the loop counter with its limit, etc. In most cases, you can assume that these integer operations do not add to the total computation time. 4 Per
24Copy protection. Some copy protection schemes are based on hacks that violate or circumvent operating system standards. Such schemes are frequent s
25Performance Primitives" library contains many functions for audio and video processing, signal processing, data compression and cryptography (
26The static data area is usually divided into three parts: one for constants that are never modified by the program, one for initialized variables t
27Floating point variables use a different kind of registers. There are eight floating point registers available in 32-bit operating systems and sixt
28 A class member variable with the static modifier will be stored in static memory and will have one and only one instance. Non-static members of th
29 It is recommended to use the default integer size in cases where the size doesn't matter, such as simple variables, loop counters, etc. In la
316 Literature... 143 1 Introducti
30will delay the availability of x for approximately two clock cycles. Obviously, the initial value of i must be adjusted if you change pre-increment
31The calculation of expressions where operands have mixed precision require precision conversion instructions which can be quite time-consuming (see
326.4 Enums An enum is simply an integer in disguise. Enums are exactly as efficient as integers. 6.5 Booleans The order of Boolean operands The ope
33This is typically implemented by the compiler in the following way: bool a, b, c, d; if (a != 0) { if (b != 0) { c = 1; } else {
34You cannot replace a && b with a & b if b is an expression that should not be evaluated if a is false. Likewise, you cannot replace a |
35References are useful for copy constructors and overloaded operators. Function parameters that are declared as constant references accept expressi
36Pentium M processor may be able to predict the target if the changes of the function pointer follows a simple regular pattern, while Pentium 4 and
37 More examples of container classes are given in www.agner.org/optimize/cppexamples.zip. An array using the above template class is declared by sp
38this is not an issue because an optimizing compiler can see that the rows are accessed consecutively and can calculate the address of each row by a
39Floating point precision conversion Conversions between float, double and long double take no extra time when the floating point register stack is
4development process. These requirements are often conflicting with the requirements of optimizing the software for speed or size. Today, it is not
40Pointer type conversion A pointer can be converted to a pointer of a different type. Likewise, a pointer can be converted to an integer, or an inte
41Dynamic cast The dynamic_cast operator is used for converting a pointer to one class to a pointer to another class. It makes a runtime check that t
42 A switch statements is a kind of branch that can go more than two ways. Switch statements are most efficient if the case labels follow a sequence
43thousand times then the loop control branch is mispredicted only one time in thousand so the misprediction penalty is only a negligible contributio
44The loop control condition The most efficient loop control condition is a simple integer counter. A microprocessor with out-of-order capabilities (
45// Example 6.32a const int size = 1000; int i; float a[size], b[size]; // set a to zero for (i = 0; i < size; i++) a[i] = 0.0; // copy a to b f
46Transfer large objects by reference If an object (class or structure) is transferred to a function as a parameter, then the entire object is copied
47Floating point parameters are not affected by __fastcall. The implicit 'this' pointer in member functions is also treated like a paramete
48and member functions is not expensive. You may use an object oriented programming style if it is good for the logical structure and clarity of the
49Structure and class objects can often be made smaller by reordering the data members. If the class has at least one virtual member functions then t
5 This manual is based on the standard PC platform with an Intel or AMD processor and a Windows, Linux or BSD operating system running in 32-bit or 6
506.17 Virtual member functions Virtual functions are used for implementing polymorphic classes. Each instance of a polymorphic class has a pointer t
516.20 Constructors and destructors A constructor is implemented as a member function which returns a reference to the object. The allocation of memo
52 struct { int a:4; int b:2; int c:2; }; char abc; }; Bitfield x; int A, B, C; x.abc = A | (B << 4) | (C << 6)
53 template <int m> int MultiplyBy (int x) { return x * m;} int a, b; a = Multiply(10,8); b = MultiplyBy<8>(10); a and b will both g
54 public: virtual void Disp() { cout << 1; } }; class C2 : public CHello { public: virtual void Disp() { cout <&
55 void Disp() { cout << 2; } }; void test () { CChild1 Object1; CChild2 Object2; CChild1 * p1; p1 = &Object1; p1-
56threads with different priorities then the user might experience unacceptably long response times to keyboard and mouse inputs when the program is
57 } catch (...) { ... } } The function F1 is supposed to call the destructor for the object x when it returns. But what if an exception
58 __try { // Main loop for calculations: for ( ; i < arraysize; i++) { // Overflow may occur in multiplicatio
59Exception handling is not necessary when no attempt is made to recover from errors. If you just want the program to issue an error message and stop
6Some systems have a graphics processing unit, usually on a graphics card. Such units can be used as coprocessors to take care of some of the heavy g
607 Optimizations in the compiler 7.1 How compilers optimize Modern compilers can do a lot of modifications to the code in order to improve performan
61 float a, b; a = parabola (2.0f); b = a + 1.0f; The compiler may replace this by // Example 7.3b a = 5.0f; b = 6.0f; Constant folding and consta
62 The maximum number of floating point register variables is eight in 32-bit systems and sixteen in 64-bit systems. Some compilers have difficulties
63else { y = cos(x); } z = y + 1.; Eliminate jumps Jumps can be avoided by copying the code that it jumps to. Example: // Example 7.9a int SomeF
64 } else { return a - 1; } } The compiler may reduce this to: // Example 7.11b int SomeFunction (int a, bool b) { if (b) {
65Induction variables An expression that is a linear function of a loop counter can be calculated by adding a constant to the previous value. Example
66The compilers I have studied do not make induction variables for floating point expressions or more complex integer expressions. See page 73 for an
67This may be useful if the subexpression c+b can be reused elsewhere. In this example, we are using 8-bit integers which range from -128 to +127. An
68Without optimization, the compiler needs to look up in a virtual table to see whether the call p->f() goes to C0::f or C1::f. But an optimizing
69((a*x+b)*x+c)*x+d x*x*x*x*x*x*x*x = ((x2) 2) 2 - - x - - - - - - a+a+a+a = a*4 x - x x - - - - x -(-a) = a x - x x x x x x - a-(-b) = a+b x - x
7The SSE2 instruction set is supported on all 64-bit CPU's and operating systems. The 64 bit instruction set supports self-relative addressing
70(a&&b) || (!a&&c) || (b&&c) = a ? b : c x - - x - - - - - (a&&b) || (a&&b&&c) = a&&
71Boolean XMM (vector) reductions: ~(~a) = a - n.a. - - - - n.a. n.a. - (a&b)|(a&c) = a&(b|c) - n.a. - - - - n.a. n.a. -
72 Most compilers have an option for assuming no pointer aliasing (/Oa). The easiest way to overcome the obstacle of possible pointer aliasing is to
73#define pure_function __attribute__((const)) #else #define pure_function #endif double Func1(double) pure_function ; double Func2(double x) {
74The calculation of this polynomial can be done with just two additions by the use of two induction variables: // Example 7.23b. Calculate polynomi
75 Avoid long dependency chains, especially loop-carried dependency chains with long latencies. 7.5 Compiler optimization options All C++ compilers
76 You may choose a newer instruction set when compatibility with old CPU's is not needed. Even better, you may make multiple versions of the mo
77int List[ArraySize]; ... for (int i = 0; i < ArraySize; i++) List[i]++; Here, the compiler can replace all occurrences of ArraySize by the valu
78Assume no pointer aliasing. __declspec(noalias) or __restrict or #pragma optimize("a",on). Specifies that pointer aliasing does not occur
79 mov ecx, DWORD PTR [esp+8] ; ecx = a xor eax, eax ; eax = i = 0 mov edx, DWORD PTR [esp
8Several modern programming languages use an intermediate code (byte code). The source code is compiled into an intermediate code, which is the code
80compiler has not noticed that i can never be negative so that we don't have to care about the sign bit. We can tell it this by making i an uns
81Furthermore, this solution is using one register less so that it doesn't have to push and pop ebx. 8 Optimizing memory access 8.1 Caching of
82this can cause contentions in the code cache. The subsequent sections describe various ways to avoid these problems. More details about how caches
83in which they are used. Such variables and objects will be stored on the stack, which is very likely to be in the level-1 cache. The different kind
84 // Example 8.2b void F3(bool y) { union { int a[1000]; float b[1000]; }; if (y) { F1(a); } else { F2(b
85Arrays that are too large for the stack can be allocated dynamically. The advantages of dynamic memory allocation are: Gives a more clear program
86step by step. In most systems, you cannot increase the size of a memory block that has already been allocated. If the final size cannot be predicte
87 DynamicArray[i] = WhateverFunction(i); // ... } } } Obviously, a function should never return any pointer or reference
88themselves. But implementing a matrix in STL as a vector of vectors, as is often seen, is certainly not the most efficient solution. Fortunately,
89• Is searching needed before all objects have been added? If search facilities are needed, and new objects can be added at any time, then the solu
9An important disadvantage of C++ relates to security. There are no checks for array bounds violation, integer overflow, and invalid pointers. The ab
90const int NUMROWS = 100, NUMCOLUMNS = 100; int matrix[NUMROWS][NUMCOLUMNS]; int row, column; for (row = 0; row < NUMROWS; row++) for (column
91the rows, not the columns. Every fourth of these cache lines belong to the same set in the cache. When we reach element number 16 in column 28, the
92 // Define size of squares: const int TILESIZE = 8; // SIZE must be divisible by TILESIZE // Loop r1 and c1 for all squares:
93Table 8.3. Cache control instructions. There are other cache control instructions than the ones mentioned in table 8.3, such as flush and fence in
94const int SIZE = 512; // number of rows and columns in matrix // function to transpose and copy matrix void TransposeCopy(double a[SIZE][SIZE], dou
95It is important to distinguish between coarse-grained parallelism and fine-grained parallelism when deciding whether it is advantageous to do thing
96A CPU with multiple cores or a system with multiple CPU's is very useful for running multiple tasks in parallel. A CPU-intensive program with
97float list[size], sum1 = 0, sum2 = 0; int i; for (i = 0; i < size; i += 2) { sum1 += list[i]; sum2 += list[i+1];} sum1 += sum2; If the m
98The loop branch should be predicted. This is no problem if the repeat count is large or constant. If the loop count is small and changing then the
99The use of vector operations is more advantageous the smaller the data elements are. For example, it takes the same time to add two vectors of four
Comentarios a estos manuales