性能优化篇（1）：几种简单的访存优化手段

V2AS问路

性能优化篇（1）：几种简单的访存优化手段

阅读原文时间：2021年04月22日阅读：2

========================

Author:stormQ

Sunday, 10. November 2019 11:36AM

减少不必要的内存引用

示例：

void poor(const int *src, int n, int *dest)
{
    for (int i = 0; i < n; ++i)
    {
        *dest += src[i];
    }
}

void better(const int *src, int n, int *dest)
{
    int sum = *dest;
    for (int i = 0; i < n; ++i)
    {
        sum += src[i];
    }
    *dest = sum;
}

执行时间统计：

启动程序方式

第一次执行耗时(us)

第二次执行耗时(us)

第三次执行耗时(us)

第四次执行耗时(us)

第五次执行耗时(us)

./main_Og

poor:176463
better:67489
poor:172714
better:69189
poor:170618
better:67232
poor:176130
better:68082
poor:170447
better:69484

valgrind --tool=cachegrind ./main_Og

poor:2403126
better:1540885
poor:2401457
better:1543162
poor:2404605
better:1541964
poor:2410699
better:1546153
poor:2409636
better:1543493

从统计结果中可以看出，better() 函数的执行速度比 poor() 函数快 2.5 倍左右。

从汇编的角度分析：

; With -Og optimization
; Dump of assembler code for function poor(int const*, int, int*):
   0x000000000040080c <+0>:    mov w3, #0x0                    // #0
   0x0000000000400810 <+4>:    cmp w3, w1
   0x0000000000400814 <+8>:    b.ge    0x400830 <poor(int const*, int, int*)+36>
   0x0000000000400818 <+12>:    ldr w5, [x2]
   0x000000000040081c <+16>:    ldr w4, [x0,w3,sxtw #2]
   0x0000000000400820 <+20>:    add w4, w5, w4
   0x0000000000400824 <+24>:    str w4, [x2]
   0x0000000000400828 <+28>:    add w3, w3, #0x1
   0x000000000040082c <+32>:    b   0x400810 <poor(int const*, int, int*)+4>
   0x0000000000400830 <+36>:    ret

; With -Og optimization
; Dump of assembler code for function better(int const*, int, int*):
   0x0000000000400834 <+0>:    ldr w4, [x2]
   0x0000000000400838 <+4>:    mov w3, #0x0                    // #0
   0x000000000040083c <+8>:    cmp w3, w1
   0x0000000000400840 <+12>:    b.ge    0x400854 <better(int const*, int, int*)+32>
   0x0000000000400844 <+16>:    ldr w5, [x0,w3,sxtw #2]
   0x0000000000400848 <+20>:    add w4, w4, w5
   0x000000000040084c <+24>:    add w3, w3, #0x1
   0x0000000000400850 <+28>:    b   0x40083c <better(int const*, int, int*)+8>
   0x0000000000400854 <+32>:    str w4, [x2]
   0x0000000000400858 <+36>:    ret

从汇编代码中可以看出：

使用-Og优化选项编译时，poor() 函数的 for 循环的一次迭代中：读内存操作数量为2，写内存操作数量为1。
使用-Og优化选项编译时，better() 函数的 for 循环的一次迭代中：读内存操作数量为1，写内存操作数量为0。better() 函数的内存读写数量较少，因此执行速度更快。

从 cache 性能的角度分析：

--------------------------------------------------------------------------------
I1 cache:         16384 B, 64 B, 4-way associative
D1 cache:         16384 B, 64 B, 4-way associative
LL cache:         262144 B, 64 B, 8-way associative
Command:          ./b_Og
Data file:        cachegrind.out.18226
Events recorded:  Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Events shown:     Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Thresholds:       0.1 100 100 100 100 100 100 100 100
Include dirs:     
User annotated:   
Auto-annotation:  off

--------------------------------------------------------------------------------
           Ir  I1mr ILmr          Dr       D1mr       DLmr          Dw      D1mw      DLmw 
--------------------------------------------------------------------------------
1,993,889,018 1,004  864 315,036,517 13,121,792 13,115,591 209,899,117 6,555,946 6,555,006  PROGRAM TOTALS

--------------------------------------------------------------------------------
         Ir I1mr ILmr          Dr      D1mr      DLmr          Dw      D1mw      DLmw  file:function
--------------------------------------------------------------------------------
838,860,804    0    0 209,715,200 6,553,601 6,553,601 104,857,600         0         0  /home/b.cpp:poor(int const*, int, int*)
629,145,606    0    0 104,857,601 6,553,601 6,553,601           1         1         1  /home/b.cpp:better(int const*, int, int*)
524,288,004    1    1           0         0         0 104,857,600 6,553,600 6,553,600  /home/b.cpp:init(int*, int)

输出结果解析：

poor() 函数的读内存操作的数量为 209,715,200（Dr 列），读内存操作在 L1 级缓存中不命中的数量为 6,553,601（D1mr 列），读内存操作在 LL 级缓存（最后一级缓存）中不命中的数量为 6,553,601（DLmr 列）。
poor() 函数的写内存操作的数量为 104,857,600（Dw 列），写内存操作在 L1 级缓存中不命中的数量为 0（D1mw 列），写内存操作在 LL 级缓存（最后一级缓存）中不命中的数量为 0（DLmw 列）。
better() 函数的读内存操作的数量为 104,857,601（Dr 列），读内存操作在 L1 级缓存中不命中的数量为 6,553,601（D1mr 列），读内存操作在 LL 级缓存（最后一级缓存）中不命中的数量为 6,553,601（DLmr 列）。
better() 函数的写内存操作的数量为 1（Dw 列），写内存操作在 L1 级缓存中不命中的数量为 1（D1mw 列），写内存操作在 LL 级缓存（最后一级缓存）中不命中的数量为 1（DLmw 列）。

从上述数据中可以看出：1）poor() 函数的读内存操作的数量为 better() 函数的两倍；2）poor() 函数的写内存操作的数量比 better() 函数多 104,857,599（104,857,599 = 104,857,600 - 1）。这也验证了better() 函数执行速度较快的原因。

按顺序存储同时要访问的数据

示例：

void poor(const int *src, int *dest, int n)
{
    for (int i = 0; i < n; ++i)
    {
        // 同时要访问的数据（src[i]、dest[i]）在两个数组中，即同时要访问的数据不是连续存储的
        dest[i] = src[i];
    }
}

struct Vec2
{
    Vec2()
    {
        static long long count = 0;
        a = count++;
    }
    int a;
    int b;
};

void better(struct Vec2 *data, int n)
{
    for (int i = 0; i < n; ++i)
    {
        // 同时要访问的数据（data[i].a、data[i].b）存储在同一个结构体中，即同时要访问的数据是连续存储的
        data[i].b = data[i].a;
    }
}

执行时间统计：

启动程序方式

第一次执行耗时(us)

第二次执行耗时(us)

第三次执行耗时(us)

第四次执行耗时(us)

第五次执行耗时(us)

./main_Og

poor:8591
better:2714
poor:4657
better:1936
poor:7424
better:3436
poor:8976
better:1937
poor:4866
better:1931

valgrind --tool=cachegrind ./main_Og

poor:60573
better:52037
poor:59379
better:51119
poor:60963
better:51360
poor:59742
better:51622
poor:64674
better:52330

从统计结果中可以看出，better() 函数的执行速度比 poor() 函数快 2 倍左右。

统计 cache 使用情况：

--------------------------------------------------------------------------------
         Ir I1mr ILmr          Dr      D1mr      DLmr          Dw      D1mw      DLmw 

         .    .    .         .       .       .         .       .       .  void poor(const int *src, int *dest, int n)
         .    .    .         .       .       .         .       .       .  {
 8,294,404    0    0         1       1       1         0       0       0      for (int i = 0; i < n; ++i)
         .    .    .         .       .       .         .       .       .      {
 6,220,800    0    0 2,073,600 129,601 129,601 2,073,600 129,601 129,601          dest[i] = src[i];
         .    .    .         .       .       .         .       .       .      }
         .    .    .         .       .       .         .       .       .  }
         .    .    .         .       .       .         .       .       .  
         .    .    .         .       .       .         .       .       .  struct Vec2
         .    .    .         .       .       .         .       .       .  {
         .    .    .         .       .       .         .       .       .      Vec2()
         .    .    .         .       .       .         .       .       .      {
         .    .    .         .       .       .         .       .       .          static long long count = 0;
 8,294,400    0    0 2,073,600       1       1 4,147,200 259,200 259,200          a = count++;
         .    .    .         .       .       .         .       .       .      }
         .    .    .         .       .       .         .       .       .      int a;
         .    .    .         .       .       .         .       .       .      int b;
         .    .    .         .       .       .         .       .       .  };
         .    .    .         .       .       .         .       .       .  
         .    .    .         .       .       .         .       .       .  void better(struct Vec2 *data, int n)
         .    .    .         .       .       .         .       .       .  {
 8,294,404    0    0         1       1       1         0       0       0      for (int i = 0; i < n; ++i)
         .    .    .         .       .       .         .       .       .      {
 8,294,400    0    0 2,073,600 259,201 259,201 2,073,600       0       0          data[i].b = data[i].a;
         .    .    .         .       .       .         .       .       .      }
         .    .    .         .       .       .         .       .       .  }

输出结果解析：

poor() 和 better() 函数的内存读操作的数量是相同的，而 better() 函数的 D1mr、DLmr 比 poor() 函数分别多 129600 次、129600 次。
poor() 和 better() 函数的内存写操作的数量是相同的，而 poor() 函数的 D1mw、DLmw 比 better() 函数分别多 129601 次、129601 次。
从内存读和写操作的不命中总数量来看，两者是相同的。但为什么 better() 函数的执行速度比 poor() 函数快 2 倍？

按顺序访问数据

按数据在内存中排列先后顺序进行访问（读或写）时，通常会降低 cache 的缺失率，即减少访问内存的次数，从而执行速度更快。比如：C/C++ 中的多维数组是以行主序（存储完一行后再存储下一行）存储在内存中的，那么在循环中访问完一行后再访问下一行的方式更高效。Fortran 中的多维数组是以列主序（存储完一列后再存储下一列）存储在内存中的，那么在循环中访问完一列后再访问下一列的方式更高效。

示例：访问二维整型数组

// 按列访问数组元素
long long poor(const int *src, int rows, int cols)
{
    long long sum = 0;
    for (int i = 0; i < cols; ++i)
    {
        for (int j = 0; j < rows; ++j)
        {
            sum += *(src + j * cols + i);
        }
    }
    return sum;
}

// 按行访问数组元素
long long better(const int *src, int rows, int cols)
{
    long long sum = 0;
    for (int i = 0; i < rows; ++i)
    {
        for (int j = 0; j < cols; ++j)
        {
            sum += *(src + i * cols + j);
        }
    }
    return sum;
}

执行时间统计：

启动程序方式

第一次执行耗时(us)

第二次执行耗时(us)

第三次执行耗时(us)

第四次执行耗时(us)

第五次执行耗时(us)

./main_Og

poor:12575
better:2479
poor:12661
better:2240
poor:12687
better:2313
poor:12387
better:2241
poor:12493
better:2230

valgrind --tool=cachegrind ./main_Og

poor:101736
better:54882
poor:99056
better:56708
poor:96098
better:57031
poor:97853
better:57691
poor:95651
better:57502

从统计结果中可以看出，better() 函数的执行速度比 poor() 函数快 5 倍左右。

统计 cache 使用情况：

--------------------------------------------------------------------------------
       Ir I1mr ILmr        Dr   D1mr   DLmr        Dw   D1mw  DLmw 

         .    .    .         .         .      .         .       .       .  long long poor(const int *src, int rows, int cols)
         .    .    .         .         .      .         .       .       .  {
         2    0    0         0         0      0         0       0       0      long long sum = 0;
     4,323    0    0         0         0      0         0       0       0      for (int i = 0; i < cols; ++i)
         .    .    .         .         .      .         .       .       .      {
 8,296,560    1    1         0         0      0         0       0       0          for (int j = 0; j < rows; ++j)
         .    .    .         .         .      .         .       .       .          {
14,515,200    0    0 2,073,600 2,073,600 76,630         0       0       0              sum += *(src + j * cols + i);
         .    .    .         .         .      .         .       .       .          }
         .    .    .         .         .      .         .       .       .      }
         .    .    .         .         .      .         .       .       .      return sum;
         1    0    0         1         1      1         0       0       0  }
         .    .    .         .         .      .         .       .       .  
         .    .    .         .         .      .         .       .       .  long long better(const int *src, int rows, int cols)
         .    .    .         .         .      .         .       .       .  {
         2    0    0         0         0      0         0       0       0      long long sum = 0;
     7,683    0    0         0         0      0         0       0       0      for (int i = 0; i < rows; ++i)
         .    .    .         .         .      .         .       .       .      {
 8,298,240    0    0         0         0      0         0       0       0          for (int j = 0; j < cols; ++j)
         .    .    .         .         .      .         .       .       .          {
14,515,200    0    0 2,073,600   129,601 74,916         0       0       0              sum += *(src + i * cols + j);
         .    .    .         .         .      .         .       .       .          }
         .    .    .         .         .      .         .       .       .      }
         .    .    .         .         .      .         .       .       .      return sum;
         1    0    0         1         1      1         0       0       0  }

输出结果解析：

poor() 和 better() 函数的读内存操作的数量是相同的。
poor() 和 better() 函数在 L1 级缓存的读数据操作不命中数量差距很大，前者的读数据操作不命中数量为后者的 16 倍。这正是 better() 函数执行速度快于 poor() 函数的原因。

如果你觉得本文对你有所帮助，欢迎关注公众号，支持一下！

手机扫一扫

移动阅读更方便

你可能感兴趣的文章

Go语言-Slice详解