Performance Engineering Of Software Systems
课程主页: link
Lec3. Bit Hacks
交换两个整数
int x = 1; int y = 2;
x = x ^ y;
y = x ^ y;
x = x ^ y;
求两个数的最小值,减少分支,但是未必能比O3优化好
int x = 1; int y = 2;
int min_value = y ^ ((x ^ y) & -(x < y)); // x < y 被隐式转成0或1
加法取模,限定 \(0 \le x < n\),\(0 \le y < n\),计算\((x + y) \ mod \ n\)
z = x + y;
result = z - (n & -(z >= n));
进位到最近的2的幂
uint64_t n;
--n; // 处理n已经是2的幂的情况
n |= n >> 1; // 整个过程就是把最高位的1一直铺到后面的位置
n |= n >> 2;
n |= n >> 4;
n |= n >> 8;
n |= n >> 16;
n |= n >> 32;
++n;
取最低位的1
r = x & (-x);
求bit为1的数量,但是可能不管如何都不如内置函数__builtin_popcount快
for (int r = 0; x != 0; r++) {
x &= x - 1; // x消去最末尾的1
}
Lec4 Assembly Language and Computer Architecture
x86通用寄存器有多个名字,代表不同bits
%rax 8byte, %eax 低位4bytes, %ax 低位2个byte %al 最低位1byte %ah 最低位第二个byte
本课程和clang,objdump, perf一样使用AT&T语法。op A B 格式中B存放结果。
Floating
- SSE/AVX support single precision and double precision scalar floating-point arithmetic
- x87 instructions support single-, double-, and extended-precision scalar floating-point arithmetic
编译器一般偏好使用 SSE 指令,因为使用更简单.
SSE instructions use two-letter suffixes to encode the data type.
- ss One single-precision floating-point value (float, ss 的第一个 s 是single)
- sd One double-precision floating-point value (double)
- ps Vector of single-precision floating-point values (ps 的 p 是packed)
- pd Vector of double-precision floating-point values
Vector
现代处理器一般都有 vector 硬件支持SIMD
- Modern SSE instruction sets support vector operations on integer, single-precision, and doubleprecision floating-point values.
- AVX instructions support vector operations on single-precision, and double-precision floatingpoint values.
- AVX2 instructions add integer-vector operations to the AVX instruction set.
- AVX-512 (AVX3) instructions increase the register length to 512 bits and provide new vector operations
SSE 指令使用128 bit XMM寄存器,一次最多两个操作数,AVX可以使用256-bit YMM寄存器,且一次最多有3个操作数
SSE | AVX/AVX2 | |
---|---|---|
Floating-point | addpd | vaddpd (开头的v表示avx) |
Integer | paddq (开头的p表示integer) | vpaddq |
OVERVIEW OF COMPUTER ARCHITECTURE
教学一般讲解5-stage流水线,实际上Intel可能有14-19 pipeline stages