跟踪垃圾收集

¥Tracing garbage collection

本指南将介绍垃圾收集跟踪的基础知识。

¥This guide will go through the fundamentals of garbage collection traces.

在本指南结束时,你将能够:

¥By the end of this guide, you'll be able to:

  • 在 Node.js 应用中启用跟踪

    ¥Enable traces in your Node.js application

  • 解释跟踪

    ¥Interpret traces

  • 识别 Node.js 应用中的潜在内存问题

    ¥Identify potential memory issues in your Node.js application

关于垃圾收集器的工作原理有很多需要学习的地方,但如果你学到了一件事,那就是当 GC 运行时,你的代码不会运行。

¥There's a lot to learn about how the garbage collector works, but if you learn one thing it's that when GC is running, your code is not.

你可能想知道垃圾收集运行的频率和持续时间,以及结果是什么。

¥You may want to know how often and long the garbage collection runs, and what is the outcome.

设置

¥Setup

对于本指南的提议,我们将使用以下脚本:

¥For the proposal of this guide, we'll use this script:

// script.mjs

import os from 'node:os';

let len = 1_000_000;
const entries = new Set();

function addEntry() {
  const entry = {
    timestamp: Date.now(),
    memory: os.freemem(),
    totalMemory: os.totalmem(),
    uptime: os.uptime(),
  };

  entries.add(entry);
}

function summary() {
  console.log(`Total: ${entries.size} entries`);
}

// execution
(() => {
  while (len > 0) {
    addEntry();
    process.stdout.write(`~~> ${len} entries to record\r`);
    len--;
  }

  summary();
})();

即使此处的泄漏很明显,在实际应用的上下文中找到泄漏源也可能很麻烦。

¥Even if the leak is evident here, finding the source of a leak could be cumbersome in the context of a real-world application.

使用垃圾收集跟踪运行

¥Running with garbage collection traces

你可以使用 --trace-gc 标志在进程的控制台输出中查看垃圾收集的跟踪。

¥You can see traces for garbage collection in console output of your process using the --trace-gc flag.

$ node --trace-gc script.mjs

注意:你可以在 Node.js Diagnostics 存储库中找到此 exercise 的源代码。

¥Note: you can find the source code of this exercise in the Node.js Diagnostics repository.

它应该输出类似以下内容:

¥It should output something like:

[39067:0x158008000]     2297 ms: Scavenge 117.5 (135.8) -> 102.2 (135.8) MB, 0.8 / 0.0 ms  (average mu = 0.994, current mu = 0.994) allocation failure
[39067:0x158008000]     2375 ms: Scavenge 120.0 (138.3) -> 104.7 (138.3) MB, 0.9 / 0.0 ms  (average mu = 0.994, current mu = 0.994) allocation failure
[39067:0x158008000]     2453 ms: Scavenge 122.4 (140.8) -> 107.1 (140.8) MB, 0.7 / 0.0 ms  (average mu = 0.994, current mu = 0.994) allocation failure
[39067:0x158008000]     2531 ms: Scavenge 124.9 (143.3) -> 109.6 (143.3) MB, 0.7 / 0.0 ms  (average mu = 0.994, current mu = 0.994) allocation failure
[39067:0x158008000]     2610 ms: Scavenge 127.1 (145.5) -> 111.8 (145.5) MB, 0.7 / 0.0 ms  (average mu = 0.994, current mu = 0.994) allocation failure
[39067:0x158008000]     2688 ms: Scavenge 129.6 (148.0) -> 114.2 (148.0) MB, 0.8 / 0.0 ms  (average mu = 0.994, current mu = 0.994) allocation failure
[39067:0x158008000]     2766 ms: Scavenge 132.0 (150.5) -> 116.7 (150.5) MB, 1.1 / 0.0 ms  (average mu = 0.994, current mu = 0.994) allocation failure
Total: 1000000 entries

难以阅读?也许我们应该回顾一些概念并解释 --trace-gc 标志的输出。

¥Hard to read? Maybe we should pass in review a few concepts and explain the outputs of the --trace-gc flag.

使用 --trace-gc 检查跟踪

¥Examining a trace with --trace-gc

--trace-gc(或 --trace_gc,两者都可以)标志在控制台中输出所有垃圾收集事件。每行的组成可以描述为:

¥The --trace-gc (or --trace_gc, either is fine) flag outputs all garbage collection events in the console. The composition of each line can be described as:

[13973:0x110008000]       44 ms: Scavenge 2.4 (3.2) -> 2.0 (4.2) MB, 0.5 / 0.0 ms  (average mu = 1.000, current mu = 1.000) allocation failure
令牌值解释
13973正在运行的进程的 PID
0x110008000隔离(JS 堆实例)
44 毫秒自进程启动以来的时间(以毫秒为单位)
清除GC 的类型/阶段
2.4GC 前使用的堆大小(MB)
(3.2)GC 前的总堆大小(MB)
2.0GC 后使用的堆大小(MB)
(4.2)GC 后的总堆大小(MB)
0.5 / 0.0 毫秒(平均 mu = 1.000,当前 mu = 1.000)在 GC 中花费的时间(以毫秒为单位)
分配失败GC 的原因

我们只关注这里的两个事件:

¥We'll only focus on two events here:

  • 清除

    ¥Scavenge

  • 标记-清除

    ¥Mark-sweep

堆被分成空间。其中,我们有一个名为 "new" 空间的空间和另一个名为 "old" 空间的空间。

¥The heap is divided into spaces. Amongst these, we have a space called the "new" space and another one called the "old" space.

👉 实际上,堆结构有点不同,但在本文中我们将坚持使用更简单的版本。如果你想要更多详细信息,我们鼓励你查看有关 Orinoco 的 Peter Marshall 的演讲

¥👉 In reality, the heap structure is a bit different, but we'll stick to a simpler version for this article. If you want more details we encourage you to look at this talk of Peter Marshall about Orinoco.

清除

¥Scavenge

Scavenge 是一种算法的名称,它将执行垃圾收集到新空间中。新空间是创建对象的地方。新空间设计为小巧且快速的垃圾收集。

¥Scavenge is the name of an algorithm that will perform garbage collection into new space. The new space is where objects are created. The new space is designed to be small and fast for garbage collection.

让我们想象一个 Scavenge 场景:

¥Let's imagine a Scavenge scenario:

  • 我们分配了 ABCD

    ¥we allocated A, B, C & D.

    | A | B | C | D | <unallocated> |
    
  • 我们想要分配 E

    ¥we want to allocate E

  • 空间不足,内存已耗尽

    ¥not enough space, the memory is exhausted

  • 然后,触发(垃圾)收集

    ¥then, a (garbage) collection is triggered

  • 死对象被收集

    ¥dead objects are collected

  • 活动对象将保留

    ¥living object will stay

  • 假设 BD 已死

    ¥assuming B and D were dead

    | A | C | <unallocated> |
    
  • 现在我们可以分配 E

    ¥now we can allocate E

    | A | C | E | <unallocated> |
    

v8 将在两次 Scavenge 操作后将对象(而不是垃圾收集器)提升到旧空间。

¥v8 will promote objects, not garbage collected after two Scavenge operations to the old space.

👉 完整 Scavenge 场景

¥👉 Full Scavenge scenario

标记-清除

¥Mark-sweep

标记-清除用于从旧空间收集对象。旧空间是新空间中幸存下来的对象所在的空间。

¥Mark-sweep is used to collect objects from old space. The old space is where objects that survived the new space are living.

该算法由两个阶段组成:

¥This algorithm is composed of two phases:

  • 标记:将仍然活着的对象标记为黑色,将其他对象标记为白色。

    ¥Mark: Will mark still alive objects as black and others as white.

  • 扫描:扫描白色对象并将其转换为可用空间。

    ¥Sweep: Scans for white objects and converts them to free spaces.

👉 事实上,标记和清除步骤更加复杂。请阅读此 document 了解更多详细信息。

¥👉 In fact, the Mark and Sweep steps are a bit more elaborate. Please read this document for more details.

--trace-gc 实际操作

¥--trace-gc in action

内存泄漏

¥Memory leak

现在,如果你快速返回上一个终端窗口:你将在控制台中看到许多 Mark-sweep 事件。我们还看到,事件发生后收集的内存量微不足道。

¥Now, if you return quickly to the previous terminal window: you will see many Mark-sweep events in the console. We also see that the amount of memory collected after the event is insignificant.

现在我们是垃圾收集方面的专家!我们可以推断出什么?

¥Now that we are experts in garbage collection! What could we deduce?

我们可能有内存泄漏!但我们如何确定这一点?(提醒:在这个例子中很明显,但对于实际应用呢?)

¥We probably have a memory leak! But how could we be sure of that? (Reminder: it is pretty apparent in this example, but what about a real-world application?)

但我们如何发现上下文?

¥But how could we spot the context?

如何获取错误分配的上下文

¥How to get the context of bad allocations

  1. 假设我们观察到旧空间在不断增加。

    ¥Suppose we observe that the old space is continuously increasing.

  2. 减少 --max-old-space-size 以使总堆更接近极限

    ¥Reduce --max-old-space-size such that the total heap is closer to the limit

  3. 运行程序直到内存不足。

    ¥Run the program until you hit the out of memory.

  4. 生成的日志显示失败的上下文。

    ¥The produced log shows the failing context.

  5. 如果达到 OOM,则将堆大小增加约 10% 并重复几次。如果观察到相同的模式,则表示内存泄漏。

    ¥If it hits OOM, increment the heap size by ~10% and repeat a few times. If the same pattern is observed, it indicates a memory leak.

  6. 如果没有 OOM,则将堆大小冻结为该值 - 打包堆可减少内存占用和计算延迟。

    ¥If there is no OOM, then freeze the heap size to that value - A packed heap reduces memory footprint and computation latency.

例如,尝试使用以下命令运行 script.mjs

¥For example, try to run script.mjs with the following command:

node --trace-gc --max-old-space-size=50 script.mjs

你应该会遇到 OOM:

¥You should experience an OOM:

[...]
<--- Last few GCs --->
[40928:0x148008000]      509 ms: Mark-sweep 46.8 (65.8) -> 40.6 (77.3) MB, 6.4 / 0.0 ms  (+ 1.4 ms in 11 steps since start of marking, biggest step 0.2 ms, walltime since start of marking 24 ms) (average mu = 0.977, current mu = 0.977) finalize incrementa[40928:0x148008000]      768 ms: Mark-sweep 56.3 (77.3) -> 47.1 (83.0) MB, 35.9 / 0.0 ms  (average mu = 0.927, current mu = 0.861) allocation failure scavenge might not succeed
<--- JS stacktrace --->
FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory [...]

现在,尝试以 100mb 为单位:

¥Now, try to it for 100mb:

node --trace-gc --max-old-space-size=100 script.mjs

你应该会遇到类似的情况,唯一的区别应该是最后一个 GC 跟踪将包含更大的堆大小。

¥You should experience something similar, the only difference should be that the last GC trace will contain a bigger heap size.

<--- Last few GCs --->
[40977:0x128008000]     2066 ms: Mark-sweep (reduce) 99.6 (102.5) -> 99.6 (102.5) MB, 46.7 / 0.0 ms  (+ 0.0 ms in 0 steps since start of marking, biggest step 0.0 ms, walltime since start of marking 47 ms) (average mu = 0.154, current mu = 0.155) allocati[40977:0x128008000]     2123 ms: Mark-sweep (reduce) 99.6 (102.5) -> 99.6 (102.5) MB, 47.7 / 0.0 ms  (+ 0.0 ms in 0 steps since start of marking, biggest step 0.0 ms, walltime since start of marking 48 ms) (average mu = 0.165, current mu = 0.175) allocati

注意:在实际应用环境中,在代码中查找泄漏的对象可能很麻烦。堆快照可以帮助你找到它。访问 堆快照专用指南

¥Note: In the context of real application, it could be cumbersome to find the leaked object in the code. Heap snapshot could help you to find it. Visit the guide dedicated to heap snapshot

速度慢

¥Slowness

你如何断言是否发生了太多垃圾收集或导致了开销?

¥How do you assert whether too many garbage collections are happening or causing an overhead?

  1. 查看跟踪数据,准确地说是连续收集之间的时间。

    ¥Review the trace data, precisely the time between consecutive collections.

  2. 查看跟踪数据,特别是在 GC 中花费的时间。

    ¥Review the trace data, specifically around time spent in GC.

  3. 如果两次 GC 之间的时间小于在 GC 中花费的时间,则应用严重不足。

    ¥If the time between two GC is less than the time spent in GC, the application is severely starving.

  4. 如果两次 GCS 之间的时间和在 GC 中花费的时间非常高,则应用可能可以使用较小的堆。

    ¥If the time between two GCS and the time spent in GC are very high, probably the application can use a smaller heap.

  5. 如果两次 GCS 之间的时间远远大于在 GC 中花费的时间,则应用相对健康。

    ¥If the time between two GCS is much greater than the time spent in GC, the application is relatively healthy.

修复泄漏

¥Fix the leak

现在让我们修复泄漏。我们可以使用文件,而不是使用对象来存储我们的条目。

¥Now let's fix the leak. Instead of using an object to store our entries, we could use a file.

让我们稍微修改一下我们的脚本:

¥Let's modify our script a bit:

// script-fix.mjs
import os from 'node:os';
import fs from 'node:fs/promises';

let len = 1_000_000;
const fileName = `entries-${Date.now()}`;

async function addEntry() {
  const entry = {
    timestamp: Date.now(),
    memory: os.freemem(),
    totalMemory: os.totalmem(),
    uptime: os.uptime(),
  };
  await fs.appendFile(fileName, JSON.stringify(entry) + '\n');
}

async function summary() {
  const stats = await fs.lstat(fileName);
  console.log(`File size ${stats.size} bytes`);
}

// execution
(async () => {
  await fs.writeFile(fileName, '----START---\n');
  while (len > 0) {
    await addEntry();
    process.stdout.write(`~~> ${len} entries to record\r`);
    len--;
  }

  await summary();
})();

使用 Set 存储数据根本不是一个坏习惯;你应该只关心程序的内存占用。

¥Using a Set to store data is not a bad practice at all; you should just care about the memory footprint of your program.

注意:你可以在 Node.js Diagnostics 存储库中找到此 exercise 的源代码。

¥Note: you can find the source code of this exercise in the Node.js Diagnostics repository.

现在,让我们执行此脚本。

¥Now, let's execute this script.

node --trace-gc script-fix.mjs

你应该注意两件事:

¥You should observe two things:

  • 标记-清除事件出现的频率较低

    ¥Mark-sweep events appear less frequently

  • 内存占用不超过 25MB,而第一个脚本的内存占用超过 130MB。

    ¥the memory footprint doesn't exceed 25MB versus more than 130MB with the first script.

这很有意义,因为新版本对内存的压力比第一个版本要小。

¥It makes a lot of sense as the new version puts less pressure on the memory than the first one.

总结:你觉得如何改进这个脚本?你可能会发现新版本的脚本很慢。如果我们再次使用 Set,并且只有当内存达到特定大小时才将其内容写入文件,会怎么样?

¥Takeaway: What do you think about improving this script? You probably see that the new version of the script is slow. What if we use a Set again and write its content into a file only when the memory reaches a specific size?

getheapstatistics API 可以帮助你。

¥getheapstatistics API could help you.

奖励:以编程方式跟踪垃圾收集

¥Bonus: Trace garbage collection programmatically

使用 v8 模块

¥Using v8 module

你可能希望避免从进程的整个生命周期中获取跟踪。在这种情况下,从进程内部设置标志。v8 模块公开了一个 API 来动态放置标志。

¥You might want to avoid getting traces from the entire lifetime of your process. In that case, set the flag from within the process. The v8 module exposes an API to put flags on the fly.

import v8 from 'v8';

// enabling trace-gc
v8.setFlagsFromString('--trace-gc');

// disabling trace-gc
v8.setFlagsFromString('--notrace-gc');

使用性能钩子

¥Using performance hooks

在 Node.js 中,你可以使用 性能钩子 来跟踪垃圾收集。

¥In Node.js, you can use performance hooks to trace garbage collection.

const { PerformanceObserver } = require('node:perf_hooks');

// Create a performance observer
const obs = new PerformanceObserver(list => {
  const entry = list.getEntries()[0];
  /*
  The entry is an instance of PerformanceEntry containing
  metrics of a single garbage collection event.
  For example:
  PerformanceEntry {
    name: 'gc',
    entryType: 'gc',
    startTime: 2820.567669,
    duration: 1.315709,
    kind: 1
  }
  */
});

// Subscribe to notifications of GCs
obs.observe({ entryTypes: ['gc'] });

// Stop subscription
obs.disconnect();

使用性能钩子检查跟踪

¥Examining a trace with performance hooks

你可以从 PerformanceObserver 中的回调中获取 GC 统计信息作为 PerformanceEntry

¥You can get GC statistics as PerformanceEntry from the callback in PerformanceObserver.

例如:

¥For example:

PerformanceEntry {
  name: 'gc',
  entryType: 'gc',
  startTime: 2820.567669,
  duration: 1.315709,
  kind: 1
}
属性解释
name性能条目的名称。
entryType性能条目的类型。
startTime高解析度毫秒时间戳标记性能条目的开始时间。
duration此条目经过的总毫秒数。
kind发生的垃圾收集操作的类型。
flags有关 GC 的附加信息。

有关更多信息,你可以参考 有关性能钩子的文档

¥For more information, you can refer to the documentation about performance hooks.