性能优化-函数性能度量及分析工具

问题

度量下面累加程序的性能

#include <vector>
#include <random>
#include <limits>

using namespace std;

int64_t Accumulate(vector<int64_t>& arr)
{
    int64_t sum = 0;
    for (int i = 0; i < arr.size(); i++) {
        sum += arr[i];
    }
    return sum;
}

vector<int64_t> RandGenerator(int n)
{
    random_device rd;
    mt19937 gen(rd());
    uniform_int_distribution<> distrib(numeric_limits<int32_t>::min(), numeric_limits<int32_t>::max());

    vector<int64_t> ans(n);
    for (int i = 0; i < ans.size(); i++) {
        ans[i] = distrib(gen);
    }
    return ans;
}

int main(int argc, char* argv[]) {

    auto arr = RandGenerator(1'000'000'000);

    auto sum = Accumulate(arr);


    return 0;
}

程序运行时间

最直观的度量方法就是程序运行的时间，可以使用 C++ 标准库 <chrono> 中的 high_resolution_clock 来测量。

int main(int argc, char *argv[]) {

    auto arr = RandGenerator(1'000'000'000);

    auto start = chrono::high_resolution_clock::now();
    auto sum = Accumulate(arr);
    auto end = chrono::high_resolution_clock::now();

    auto diff_ns =
        chrono::duration_cast<chrono::nanoseconds>(end - start).count();

    cout << format("sum: {}, diff: {} ns\n", sum, diff_ns) << endl;

    return 0;
}

Windows 上 msvc 为 steady_clock 的别名，具体实现封装了 QueryPerformanceCounter ¹。

Linux gcc 上为 system_clock 的别名，具体实现为 clock_gettime(CLOCK_REALTIME) ²

Intel VTune Profiler

Intel® VTune™ Profiler 是 Intel 官方推出的高性能、跨平台（Windows/Linux/macOS）的性能分析工具。

工具安装

下载地址：https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler-download.html ，直接下载离线安装包安装，安装完成后以管理员身份运行。

分析程序准备

编译程序时带上调试符号，MSVC 以 RelWithDebInfo 编译，GCC 带上 -g -O2 参数。

工程配置

以管理员身份打开 Intel Vtune Profiler ，新建一个 Project ，点击 Configure Analysis

Where 选择 Local Host ，What 选择 Launch Application ，右边 HOW 选择 Hotspots

热点分析

运行完成后 Profiler 输出分析结果：

Summary ：总结性的输出，主要包括 CPU 占用前几的函数
Bottom-up ：自定向上，包括每个函数的调用链
Caller/Callee ：调用者和被调用者的 CPU 时间
Top-down Tree ：自顶向下的函数调用消耗
Flame Graph ：火焰图

在分析结果上双击函数，可以打开函数的源代码和汇编代码

微架构分析

HOW 选择 Microarchitecture Exploration ，运行后收集到微架构相关的信息：

Summary ：微架构指标统计信息
Bottom-up ：各个函数关联的微架构指标统计信息
Event Count ：各个函数对应的事件数量

Perf

热点分析

使用 perf record 命令进行事件采样，默认的事件为 cpu cycle ，通过 -e 指定其他事件，加上 -g 参数生成函数调用栈，程序运行完成后生成 perf.data 文件

使用 perf report 命令查看生成的 perf.data 文件

使用 perf annotate 查看汇编及汇编指令占用的资源百分比

火焰图生成

下载火焰图生成工具 https://github.com/brendangregg/FlameGraph
使用 perf script 处理生成的 perf.data ，注意加 -g 参数才有调用栈
1
perf script -i perf.data > perf.unfold

使用 stackcollapse-perf.pl 处理 perf.unfold

`1`	`./stackcollapse-perf.pl perf.unfold > perf.folded`

使用 flamegraph.pl 生成火焰图

`1`	`./flamegraph.pl perf.folded > perf.svg`

事件统计

使用 perf stat 命令统计程序运行过程中产生的事件， -e 参数指定统计的事件列表