分析 Node.js 应用

¥Profiling Node.js Applications

分析 Node.js 应用涉及通过分析应用运行时的 CPU、内存和其他运行时指标来衡量其性能。这有助于识别可能影响应用效率、响应能力和可扩展性的瓶颈、高 CPU 使用率、内存泄漏或缓慢的函数调用。

¥Profiling a Node.js application involves measuring its performance by analyzing the CPU, memory, and other runtime metrics while the application is running. This helps in identifying bottlenecks, high CPU usage, memory leaks, or slow function calls that may impact the application's efficiency, responsiveness and scalability.

有许多第三方工具可用于分析 Node.js 应用，但在许多情况下，最简单的选择是使用 Node.js 内置分析器。内置分析器使用 V8 内部的分析器，它在程序执行期间定期对堆栈进行采样。它将这些样本的结果以及重要的优化事件（例如 jit 编译）记录为一系列刻度：

¥There are many third party tools available for profiling Node.js applications but, in many cases, the easiest option is to use the Node.js built-in profiler. The built-in profiler uses the profiler inside V8 which samples the stack at regular intervals during program execution. It records the results of these samples, along with important optimization events such as jit compiles, as a series of ticks:

code-creation,LazyCompile,0,0x2d5000a337a0,396,"bp native array.js:1153:16",0x289f644df68,~
code-creation,LazyCompile,0,0x2d5000a33940,716,"hasOwnProperty native v8natives.js:198:30",0x289f64438d0,~
code-creation,LazyCompile,0,0x2d5000a33c20,284,"ToName native runtime.js:549:16",0x289f643bb28,~
code-creation,Stub,2,0x2d5000a33d40,182,"DoubleToIStub"
code-creation,Stub,2,0x2d5000a33e00,507,"NumberToStringStub"

过去，你需要 V8 源代码才能解释滴答声。幸运的是，自 Node.js 4.4.0 以来，已经引入了一些工具，这些工具有助于使用这些信息，而无需单独从源代码构建 V8。让我们看看内置分析器如何帮助深入了解应用性能。

¥In the past, you needed the V8 source code to be able to interpret the ticks. Luckily, tools have been introduced since Node.js 4.4.0 that facilitate the consumption of this information without separately building V8 from source. Let's see how the built-in profiler can help provide insight into application performance.

为了说明 tick profiler 的用法，我们将使用一个简单的 Express 应用。我们的应用将有两个处理程序，一个用于向我们的系统添加新用户：

¥To illustrate the use of the tick profiler, we will work with a simple Express application. Our application will have two handlers, one for adding new users to our system:

app.get('/newUser', (req, res) => {
  let username = req.query.username || '';
  const password = req.query.password || '';

  username = username.replace(/[^a-zA-Z0-9]/g, '');

  if (!username || !password || users[username]) {
    return res.sendStatus(400);
  }

  const salt = crypto.randomBytes(128).toString('base64');
  const hash = crypto.pbkdf2Sync(password, salt, 10000, 512, 'sha512');

  users[username] = { salt, hash };

  res.sendStatus(200);
});

另一个用于验证用户身份验证尝试：

¥and another for validating user authentication attempts:

app.get('/auth', (req, res) => {
  let username = req.query.username || '';
  const password = req.query.password || '';

  username = username.replace(/[^a-zA-Z0-9]/g, '');

  if (!username || !password || !users[username]) {
    return res.sendStatus(400);
  }

  const { salt, hash } = users[username];
  const encryptHash = crypto.pbkdf2Sync(password, salt, 10000, 512, 'sha512');

  if (crypto.timingSafeEqual(hash, encryptHash)) {
    res.sendStatus(200);
  } else {
    res.sendStatus(401);
  }
});

请注意，这些不是推荐用于在 Node.js 应用中对用户进行身份验证的处理程序，仅用于说明目的。一般来说，你不应该尝试设计自己的加密身份验证机制。最好使用现有的、经过验证的身份验证解决方案。

¥Please note that these are NOT recommended handlers for authenticating users in your Node.js applications and are used purely for illustration purposes. You should not be trying to design your own cryptographic authentication mechanisms in general. It is much better to use existing, proven authentication solutions.

现在假设我们已经部署了我们的应用，并且用户抗诉请求延迟过高。我们可以使用内置分析器轻松运行应用：

¥Now assume that we've deployed our application and users are complaining about high latency on requests. We can easily run the app with the built-in profiler:

NODE_ENV=production node --prof app.js

并使用 ab（ApacheBench）对服务器施加一些负载：

¥and put some load on the server using ab (ApacheBench):

curl -X GET "http://localhost:8080/newUser?username=matt&password=password"
ab -k -c 20 -n 250 "http://localhost:8080/auth?username=matt&password=password"

并获得 ab 输出：

¥and get an ab output of:

Concurrency Level:      20
Time taken for tests:   46.932 seconds
Complete requests:      250
Failed requests:        0
Keep-Alive requests:    250
Total transferred:      50250 bytes
HTML transferred:       500 bytes
Requests per second:    5.33 [#/sec] (mean)
Time per request:       3754.556 [ms] (mean)
Time per request:       187.728 [ms] (mean, across all concurrent requests)
Transfer rate:          1.05 [Kbytes/sec] received

...

Percentage of the requests served within a certain time (ms)
  50%   3755
  66%   3804
  75%   3818
  80%   3825
  90%   3845
  95%   3858
  98%   3874
  99%   3875
 100%   4225 (longest request)

从此输出中，我们看到我们每秒只能处理大约 5 个请求，平均请求往返时间不到 4 秒。在现实世界的例子中，我们可以代表用户请求在许多函数中做大量工作，但即使在我们的简单示例中，也可能会浪费时间编译正则表达式、生成随机盐、从用户密码生成唯一哈希值或在 Express 框架本身内部。

¥From this output, we see that we're only managing to serve about 5 requests per second and that the average request takes just under 4 seconds round trip. In a real-world example, we could be doing lots of work in many functions on behalf of a user request but even in our simple example, time could be lost compiling regular expressions, generating random salts, generating unique hashes from user passwords, or inside the Express framework itself.

由于我们使用 --prof 选项运行我们的应用，因此在与应用本地运行相同的目录中生成了一个 tick 文件。它应该具有 isolate-0xnnnnnnnnnnnn-v8.log 形式（其中 n 是数字）。

¥Since we ran our application using the --prof option, a tick file was generated in the same directory as your local run of the application. It should have the form isolate-0xnnnnnnnnnnnn-v8.log (where n is a digit).

为了理解这个文件，我们需要使用与 Node.js 二进制文件打包在一起的 tick 处理器。要运行处理器，请使用 --prof-process 标志：

¥In order to make sense of this file, we need to use the tick processor bundled with the Node.js binary. To run the processor, use the --prof-process flag:

node --prof-process isolate-0xnnnnnnnnnnnn-v8.log > processed.txt

在你最喜欢的文本编辑器中打开 processed.txt 将为你提供几种不同类型的信息。文件被分成几个部分，这些部分又按语言细分。首先，我们查看摘要部分并查看：

¥Opening processed.txt in your favorite text editor will give you a few different types of information. The file is broken up into sections which are again broken up by language. First, we look at the summary section and see:

 [Summary]:
   ticks  total  nonlib   name
     79    0.2%    0.2%  JavaScript
  36703   97.2%   99.2%  C++
      7    0.0%    0.0%  GC
    767    2.0%          Shared libraries
    215    0.6%          Unaccounted

这告诉我们，收集的所有样本中有 97% 发生在 C++ 代码中，并且在查看处理后的输出的其他部分时，我们应该最关注用 C++（而不是 JavaScript）完成的工作。考虑到这一点，我们接下来找到 [C++] 部分，其中包含有关哪些 C++ 函数占用最多 CPU 时间的信息，并查看：

¥This tells us that 97% of all samples gathered occurred in C++ code and that when viewing other sections of the processed output we should pay most attention to work being done in C++ (as opposed to JavaScript). With this in mind, we next find the [C++] section which contains information about which C++ functions are taking the most CPU time and see:

 [C++]:
   ticks  total  nonlib   name
  19557   51.8%   52.9%  node::crypto::PBKDF2(v8::FunctionCallbackInfo<v8::Value> const&)
   4510   11.9%   12.2%  _sha1_block_data_order
   3165    8.4%    8.6%  _malloc_zone_malloc

我们看到前 3 个条目占程序所用 CPU 时间的 72.1%。从此输出中，我们立即看到至少 51.8% 的 CPU 时间被名为 PBKDF2 的函数占用，该函数对应于我们从用户密码生成的哈希值。但是，可能不会立即看出下面两个条目如何影响我们的应用（或者如果是，我们将为了举例而假装不是这样）。为了更好地理解这些函数之间的关系，我们接下来将查看 [自下而上（重）配置文件] 部分，该部分提供有关每个函数主要调用者的信息。检查本节，我们发现：

¥We see that the top 3 entries account for 72.1% of CPU time taken by the program. From this output, we immediately see that at least 51.8% of CPU time is taken up by a function called PBKDF2 which corresponds to our hash generation from a user's password. However, it may not be immediately obvious how the lower two entries factor into our application (or if it is we will pretend otherwise for the sake of example). To better understand the relationship between these functions, we will next look at the [Bottom up (heavy) profile] section which provides information about the primary callers of each function. Examining this section, we find:

   ticks parent  name
  19557   51.8%  node::crypto::PBKDF2(v8::FunctionCallbackInfo<v8::Value> const&)
  19557  100.0%    v8::internal::Builtins::~Builtins()
  19557  100.0%      LazyCompile: ~pbkdf2 crypto.js:557:16

   4510   11.9%  _sha1_block_data_order
   4510  100.0%    LazyCompile: *pbkdf2 crypto.js:557:16
   4510  100.0%      LazyCompile: *exports.pbkdf2Sync crypto.js:552:30

   3165    8.4%  _malloc_zone_malloc
   3161   99.9%    LazyCompile: *pbkdf2 crypto.js:557:16
   3161  100.0%      LazyCompile: *exports.pbkdf2Sync crypto.js:552:30

解析此部分的工作量比上面的原始滴答计数要多一点。在上述每个 "调用堆栈" 中，父列中的百分比告诉你当前行中的函数调用了上一行中的函数的样本百分比。例如，在上面 _sha1_block_data_order 的中间 "调用堆栈" 中，我们看到 _sha1_block_data_order 出现在 11.9% 的样本中，这是我们从上面的原始计数中知道的。但是，在这里，我们也可以看出它总是由 Node.js 加密模块中的 pbkdf2 函数调用。我们看到，类似地，_malloc_zone_malloc 几乎完全由相同的 pbkdf2 函数调用。因此，使用此视图中的信息，我们可以知道，我们对用户密码的哈希计算不仅占上面的 51.8%，还占前 3 个采样最多的函数中的所有 CPU 时间，因为对 _sha1_block_data_order 和 _malloc_zone_malloc 的调用是代表 pbkdf2 函数进行的。

¥Parsing this section takes a little more work than the raw tick counts above. Within each of the "call stacks" above, the percentage in the parent column tells you the percentage of samples for which the function in the row above was called by the function in the current row. For example, in the middle "call stack" above for _sha1_block_data_order, we see that _sha1_block_data_order occurred in 11.9% of samples, which we knew from the raw counts above. However, here, we can also tell that it was always called by the pbkdf2 function inside the Node.js crypto module. We see that similarly, _malloc_zone_malloc was called almost exclusively by the same pbkdf2 function. Thus, using the information in this view, we can tell that our hash computation from the user's password accounts not only for the 51.8% from above but also for all CPU time in the top 3 most sampled functions since the calls to _sha1_block_data_order and _malloc_zone_malloc were made on behalf of the pbkdf2 function.

此时，很明显，基于密码的哈希生成应该是我们优化的目标。幸运的是，你已经完全内化了异步编程的好处，并且你意识到从用户密码生成哈希的工作是以同步方式完成的，从而限制了事件循环。这可以防止我们在计算哈希值时处理其他传入请求。

¥At this point, it is very clear that the password-based hash generation should be the target of our optimization. Thankfully, you've fully internalized the benefits of asynchronous programming and you realize that the work to generate a hash from the user's password is being done in a synchronous way and thus tying down the event loop. This prevents us from working on other incoming requests while computing a hash.

为了解决这个问题，你可以对上面的处理程序进行一点修改，使用异步版本的 pbkdf2 函数：

¥To remedy this issue, you make a small modification to the above handlers to use the asynchronous version of the pbkdf2 function:

app.get('/auth', (req, res) => {
  let username = req.query.username || '';
  const password = req.query.password || '';

  username = username.replace(/[^a-zA-Z0-9]/g, '');

  if (!username || !password || !users[username]) {
    return res.sendStatus(400);
  }

  crypto.pbkdf2(
    password,
    users[username].salt,
    10000,
    512,
    'sha512',
    (err, hash) => {
      if (users[username].hash.toString() === hash.toString()) {
        res.sendStatus(200);
      } else {
        res.sendStatus(401);
      }
    }
  );
});

使用应用的异步版本重新运行上述 ab 基准测试可得出：

¥A new run of the ab benchmark above with the asynchronous version of your app yields:

Concurrency Level:      20
Time taken for tests:   12.846 seconds
Complete requests:      250
Failed requests:        0
Keep-Alive requests:    250
Total transferred:      50250 bytes
HTML transferred:       500 bytes
Requests per second:    19.46 [#/sec] (mean)
Time per request:       1027.689 [ms] (mean)
Time per request:       51.384 [ms] (mean, across all concurrent requests)
Transfer rate:          3.82 [Kbytes/sec] received

...

Percentage of the requests served within a certain time (ms)
  50%   1018
  66%   1035
  75%   1041
  80%   1043
  90%   1049
  95%   1063
  98%   1070
  99%   1071
 100%   1079 (longest request)

Yay！你的应用现在每秒处理大约 20 个请求，大约是同步哈希生成的 4 倍。此外，平均延迟从之前的 4 秒下降到略高于 1 秒。

¥Yay! Your app is now serving about 20 requests per second, roughly 4 times more than it was with the synchronous hash generation. Additionally, the average latency is down from the 4 seconds before to just over 1 second.

希望通过对这个（诚然是人为的）示例的性能调查，你已经了解了 V8 tick 处理器如何帮助你更好地了解 Node.js 应用的性能。

¥Hopefully, through the performance investigation of this (admittedly contrived) example, you've seen how the V8 tick processor can help you gain a better understanding of the performance of your Node.js applications.

你可能还会发现如何创建火焰图很有用。

¥You may also find how to create a flame graph helpful.

调试 Node.js 使用 Node.js 获取数据