Performance Engineering Of Software Systems

课程主页: link

Lec3. Bit Hacks

交换两个整数

int x = 1; int y = 2;

x = x ^ y;
y = x ^ y;
x = x ^ y;

求两个数的最小值,减少分支,但是未必能比O3优化好

int x = 1; int y = 2;

int min_value = y ^ ((x ^ y) & -(x < y)); // x < y 被隐式转成0或1

加法取模,限定 \(0 \le x < n\),\(0 \le y < n\),计算\((x + y) \ mod \ n\)

z = x + y;
result = z - (n & -(z >= n));

进位到最近的2的幂

uint64_t n;
--n;  // 处理n已经是2的幂的情况
n |= n >> 1;   // 整个过程就是把最高位的1一直铺到后面的位置
n |= n >> 2;
n |= n >> 4;
n |= n >> 8;
n |= n >> 16;
n |= n >> 32;
++n;

取最低位的1

r = x & (-x);

求bit为1的数量,但是可能不管如何都不如内置函数__builtin_popcount快

for (int r = 0; x != 0; r++) {
	x &= x - 1; // x消去最末尾的1
}

Lec4 Assembly Language and Computer Architecture

x86通用寄存器有多个名字,代表不同bits

%rax 8byte, %eax 低位4bytes, %ax 低位2个byte %al 最低位1byte %ah 最低位第二个byte

image-20231103164224246

本课程和clang,objdump, perf一样使用AT&T语法。op A B 格式中B存放结果。

image-20231103161416979

Floating

  • SSE/AVX support single precision and double precision scalar floating-point arithmetic
  • x87 instructions support single-, double-, and extended-precision scalar floating-point arithmetic

编译器一般偏好使用 SSE 指令,因为使用更简单.

SSE instructions use two-letter suffixes to encode the data type.

  • ss One single-precision floating-point value (float, ss 的第一个 s 是single)
  • sd One double-precision floating-point value (double)
  • ps Vector of single-precision floating-point values (ps 的 p 是packed)
  • pd Vector of double-precision floating-point values

Vector

现代处理器一般都有 vector 硬件支持SIMD

  • Modern SSE instruction sets support vector operations on integer, single-precision, and doubleprecision floating-point values.
  • AVX instructions support vector operations on single-precision, and double-precision floatingpoint values.
  • AVX2 instructions add integer-vector operations to the AVX instruction set.
  • AVX-512 (AVX3) instructions increase the register length to 512 bits and provide new vector operations

SSE 指令使用128 bit XMM寄存器,一次最多两个操作数,AVX可以使用256-bit YMM寄存器,且一次最多有3个操作数

  SSE AVX/AVX2
Floating-point addpd vaddpd (开头的v表示avx)
Integer paddq (开头的p表示integer) vpaddq

OVERVIEW OF COMPUTER ARCHITECTURE

教学一般讲解5-stage流水线,实际上Intel可能有14-19 pipeline stages

Lec5 C to Assembly