不要阻塞事件循环(或工作池)

¥Don't Block the Event Loop (or the Worker Pool)

你应该阅读本指南吗?

¥Should you read this guide?

如果你编写的内容比简短的命令行脚本更复杂,阅读本文应该可以帮助你编写性能更高、更安全的应用。

¥If you're writing anything more complicated than a brief command-line script, reading this should help you write higher-performance, more-secure applications.

本文档是针对 Node.js 服务器编写的,但这些概念也适用于复杂的 Node.js 应用。本文档以 Linux 为中心,操作系统特定细节各不相同。

¥This document is written with Node.js servers in mind, but the concepts apply to complex Node.js applications as well. Where OS-specific details vary, this document is Linux-centric.

概括

¥Summary

Node.js 在事件循环(初始化和回调)中运行 JavaScript 代码,并提供工作池来处理文件 I/O 等昂贵的任务。Node.js 扩展性很好,有时比 Apache 等更重量级的方法更好。Node.js 可扩展性的秘诀在于它使用少量线程来处理许多客户端。如果 Node.js 可以使用更少的线程,那么它可以将更多的系统时间和内存用于处理客户端,而不是为线程(内存、上下文切换)支付空间和时间开销。但由于 Node.js 只有几个线程,你必须构建应用以明智地使用它们。

¥Node.js runs JavaScript code in the Event Loop (initialization and callbacks), and offers a Worker Pool to handle expensive tasks like file I/O. Node.js scales well, sometimes better than more heavyweight approaches like Apache. The secret to the scalability of Node.js is that it uses a small number of threads to handle many clients. If Node.js can make do with fewer threads, then it can spend more of your system's time and memory working on clients rather than on paying space and time overheads for threads (memory, context-switching). But because Node.js has only a few threads, you must structure your application to use them wisely.

以下是让你的 Node.js 服务器保持快速运行的一条经验法则:当每个客户端在任何给定时间相关的工作是 "small" 时,Node.js 速度很快。

¥Here's a good rule of thumb for keeping your Node.js server speedy: Node.js is fast when the work associated with each client at any given time is "small".

这适用于事件循环上的回调和工作池上的任务。

¥This applies to callbacks on the Event Loop and tasks on the Worker Pool.

为什么我应该避免阻止事件循环和工作池?

¥Why should I avoid blocking the Event Loop and the Worker Pool?

Node.js 使用少量线程来处理许多客户端。在 Node.js 中有两种类型的线程:一个事件循环(又名主循环、主线程、事件线程等)和一个工作池(又名线程池)中的 k 工作器池。

¥Node.js uses a small number of threads to handle many clients. In Node.js there are two types of threads: one Event Loop (aka the main loop, main thread, event thread, etc.), and a pool of k Workers in a Worker Pool (aka the threadpool).

如果线程需要很长时间才能执行回调(事件循环)或任务(工作线程),我们称之为 "blocked"。当一个线程因代表一个客户端工作而被阻止时,它无法处理来自任何其他客户端的请求。这提供了两个不阻止事件循环或工作池的动机:

¥If a thread is taking a long time to execute a callback (Event Loop) or a task (Worker), we call it "blocked". While a thread is blocked working on behalf of one client, it cannot handle requests from any other clients. This provides two motivations for blocking neither the Event Loop nor the Worker Pool:

  1. 性能:如果你定期在任一类型的线程上执行重量级活动,则服务器的吞吐量(请求/秒)将受到影响。

    ¥Performance: If you regularly perform heavyweight activity on either type of thread, the throughput (requests/second) of your server will suffer.

  2. 安全性:如果对于某些输入,你的某个线程可能会阻塞,则恶意客户端可以提交此 "恶意输入",使你的线程阻塞,并阻止它们在其他客户端上工作。这将是 拒绝服务 攻击。

    ¥Security: If it is possible that for certain input one of your threads might block, a malicious client could submit this "evil input", make your threads block, and keep them from working on other clients. This would be a Denial of Service attack.

Node 快速回顾

¥A quick review of Node

Node.js 使用事件驱动架构:它有一个用于编排的事件循环和一个用于昂贵任务的工作池。

¥Node.js uses the Event-Driven Architecture: it has an Event Loop for orchestration and a Worker Pool for expensive tasks.

哪些代码在事件循环上运行?

¥What code runs on the Event Loop?

当它们开始时,Node.js 应用首先完成初始化阶段,require 模块并注册事件回调。然后,Node.js 应用进入事件循环,通过执行适当的回调来响应传入的客户端请求。此回调同步执行,并可能注册异步请求以在完成后继续处理。这些异步请求的回调也将在事件循环中执行。

¥When they begin, Node.js applications first complete an initialization phase, require'ing modules and registering callbacks for events. Node.js applications then enter the Event Loop, responding to incoming client requests by executing the appropriate callback. This callback executes synchronously, and may register asynchronous requests to continue processing after it completes. The callbacks for these asynchronous requests will also be executed on the Event Loop.

事件循环还将满足其回调触发的非阻塞异步请求,例如网络 I/O。

¥The Event Loop will also fulfill the non-blocking asynchronous requests made by its callbacks, e.g., network I/O.

总之,事件循环执行为事件注册的 JavaScript 回调,还负责满足非阻塞异步请求,如网络 I/O。

¥In summary, the Event Loop executes the JavaScript callbacks registered for events, and is also responsible for fulfilling non-blocking asynchronous requests like network I/O.

哪些代码在工作池上运行?

¥What code runs on the Worker Pool?

Node.js 的工作池在 libuv(docs)中实现,它公开了一个通用任务提交 API。

¥The Worker Pool of Node.js is implemented in libuv (docs), which exposes a general task submission API.

Node.js 使用工作池来处理 "expensive" 任务。这包括操作系统未提供非阻塞版本的 I/O,以及特别占用 CPU 的任务。

¥Node.js uses the Worker Pool to handle "expensive" tasks. This includes I/O for which an operating system does not provide a non-blocking version, as well as particularly CPU-intensive tasks.

这些是使用此工作池的 Node.js 模块 API:

¥These are the Node.js module APIs that make use of this Worker Pool:

  1. I/O 密集型

    ¥I/O-intensive

    1. DNSdns.lookup(), dns.lookupService().

    2. 文件系统:除 fs.FSWatcher() 之外的所有文件系统 API 以及明确同步的 API 都使用 libuv 的线程池。

      ¥File System: All file system APIs except fs.FSWatcher() and those that are explicitly synchronous use libuv's threadpool.

  2. CPU 密集型

    ¥CPU-intensive

    1. 加密crypto.pbkdf2(), crypto.scrypt(), crypto.randomBytes(), crypto.randomFill(), crypto.generateKeyPair().

      ¥Crypto: crypto.pbkdf2(), crypto.scrypt(), crypto.randomBytes(), crypto.randomFill(), crypto.generateKeyPair().

    2. Zlib:除了明确同步的 zlib API 之外,所有 zlib API 都使用 libuv 的线程池。

      ¥Zlib: All zlib APIs except those that are explicitly synchronous use libuv's threadpool.

在许多 Node.js 应用中,这些 API 是工作池的唯一任务来源。使用 C++ 插件 的应用和模块可以向工作池提交其他任务。

¥In many Node.js applications, these APIs are the only sources of tasks for the Worker Pool. Applications and modules that use a C++ add-on can submit other tasks to the Worker Pool.

为了完整起见,我们注意到,当你从事件循环上的回调调用其中一个 API 时,事件循环会支付一些较小的设置成本,因为它会进入该 API 的 Node.js C++ 绑定并向工作池提交任务。与任务的总体成本相比,这些成本可以忽略不计,这就是事件循环卸载它的原因。当将其中一个任务提交给工作池时,Node.js 会提供指向 Node.js C++ 绑定中相应 C++ 函数的指针。

¥For the sake of completeness, we note that when you call one of these APIs from a callback on the Event Loop, the Event Loop pays some minor setup costs as it enters the Node.js C++ bindings for that API and submits a task to the Worker Pool. These costs are negligible compared to the overall cost of the task, which is why the Event Loop is offloading it. When submitting one of these tasks to the Worker Pool, Node.js provides a pointer to the corresponding C++ function in the Node.js C++ bindings.

Node.js 如何决定接下来要运行什么代码?

¥How does Node.js decide what code to run next?

抽象地说,事件循环和工作池分别维护待处理事件和待处理任务的队列。

¥Abstractly, the Event Loop and the Worker Pool maintain queues for pending events and pending tasks, respectively.

实际上,事件循环实际上并不维护队列。相反,它有一个文件描述符集合,它要求操作系统使用 epoll(Linux)、kqueue(OSX)、事件端口(Solaris)或 IOCP(Windows)等机制来监视这些文件描述符。这些文件描述符对应于网络套接字、它正在监视的任何文件等等。当操作系统表示其中一个文件描述符已准备就绪时,事件循环会将其转换为适当的事件并调用与该事件关联的回调。你可以了解有关此过程 此处 的更多信息。

¥In truth, the Event Loop does not actually maintain a queue. Instead, it has a collection of file descriptors that it asks the operating system to monitor, using a mechanism like epoll (Linux), kqueue (OSX), event ports (Solaris), or IOCP (Windows). These file descriptors correspond to network sockets, any files it is watching, and so on. When the operating system says that one of these file descriptors is ready, the Event Loop translates it to the appropriate event and invokes the callback(s) associated with that event. You can learn more about this process here.

相反,工作池使用真实队列,其条目是需要处理的任务。Worker 从此队列中弹出一个任务并对其进行处理,完成后,Worker 会为事件循环引发 "至少有一项任务已完成" 事件。

¥In contrast, the Worker Pool uses a real queue whose entries are tasks to be processed. A Worker pops a task from this queue and works on it, and when finished the Worker raises an "At least one task is finished" event for the Event Loop.

这对应用设计意味着什么?

¥What does this mean for application design?

在像 Apache 这样的每个客户端一个线程的系统中,每个待处理的客户端都分配有自己的线程。如果处理一个客户端的线程阻塞,操作系统将中断它并让另一个客户端轮流。因此,操作系统确保需要少量工作的客户端不会受到需要更多工作的客户端的惩罚。

¥In a one-thread-per-client system like Apache, each pending client is assigned its own thread. If a thread handling one client blocks, the operating system will interrupt it and give another client a turn. The operating system thus ensures that clients that require a small amount of work are not penalized by clients that require more work.

由于 Node.js 使用少量线程处理许多客户端,因此如果线程在处理一个客户端的请求时发生阻塞,则待处理的客户端请求可能要等到线程完成其回调或任务后才能轮到。因此,公平对待客户是你的应用的责任。这意味着你不应该在任何单个回调或任务中为任何客户端做太多工作。

¥Because Node.js handles many clients with few threads, if a thread blocks handling one client's request, then pending client requests may not get a turn until the thread finishes its callback or task. The fair treatment of clients is thus the responsibility of your application. This means that you shouldn't do too much work for any client in any single callback or task.

这是 Node.js 可以很好地扩展的部分原因,但这也意味着你有责任确保公平调度。下一节将讨论如何确保事件循环和工作池的公平调度。

¥This is part of why Node.js can scale well, but it also means that you are responsible for ensuring fair scheduling. The next sections talk about how to ensure fair scheduling for the Event Loop and for the Worker Pool.

不要阻塞事件循环

¥Don't block the Event Loop

事件循环注意到每个新的客户端连接并协调响应的生成。所有传入请求和传出响应都通过事件循环。这意味着如果事件循环在任何时候花费的时间太长,所有当前和新客户端都不会轮到。

¥The Event Loop notices each new client connection and orchestrates the generation of a response. All incoming requests and outgoing responses pass through the Event Loop. This means that if the Event Loop spends too long at any point, all current and new clients will not get a turn.

你应该确保永远不会阻止事件循环。换句话说,你的每个 JavaScript 回调都应该快速完成。这当然也适用于你的 awaitPromise.then 等等。

¥You should make sure you never block the Event Loop. In other words, each of your JavaScript callbacks should complete quickly. This of course also applies to your await's, your Promise.then's, and so on.

确保这一点的一个好方法是推断回调的 "计算复杂性"。如果你的回调无论其参数是什么都采取恒定的步骤数,那么你将始终为每个待处理的客户端提供公平的机会。如果你的回调根据其参数采取不同数量的步骤,那么你应该考虑参数可能有多长。

¥A good way to ensure this is to reason about the "computational complexity" of your callbacks. If your callback takes a constant number of steps no matter what its arguments are, then you'll always give every pending client a fair turn. If your callback takes a different number of steps depending on its arguments, then you should think about how long the arguments might be.

示例 1:一个恒定时间回调。

¥Example 1: A constant-time callback.

app.get('/constant-time', (req, res) => {
  res.sendStatus(200);
});

示例 2:O(n) 回调。对于小型 n,此回调将快速运行,而对于大型 n,此回调将运行得更慢。

¥Example 2: An O(n) callback. This callback will run quickly for small n and more slowly for large n.

app.get('/countToN', (req, res) => {
  let n = req.query.n;

  // n iterations before giving someone else a turn
  for (let i = 0; i < n; i++) {
    console.log(`Iter ${i}`);
  }

  res.sendStatus(200);
});

示例 3:O(n^2) 回调。对于较小的 n,此回调仍会快速运行,但对于较大的 n,其运行速度将比之前的 O(n) 示例慢得多。

¥Example 3: An O(n^2) callback. This callback will still run quickly for small n, but for large n it will run much more slowly than the previous O(n) example.

app.get('/countToN2', (req, res) => {
  let n = req.query.n;

  // n^2 iterations before giving someone else a turn
  for (let i = 0; i < n; i++) {
    for (let j = 0; j < n; j++) {
      console.log(`Iter ${i}.${j}`);
    }
  }

  res.sendStatus(200);
});

你应该有多小心?

¥How careful should you be?

Node.js 使用 Google V8 引擎来执行 JavaScript,这对于许多常见操作来说速度非常快。此规则的例外是正则表达式和 JSON 操作,如下所述。

¥Node.js uses the Google V8 engine for JavaScript, which is quite fast for many common operations. Exceptions to this rule are regexps and JSON operations, discussed below.

但是,对于复杂的任务,你应该考虑限制输入并拒绝太长的输入。这样,即使你的回调具有很大的复杂性,通过限制输入,你也可以确保回调在最长可接受输入上花费的时间不会超过最坏情况的时间。然后,你可以评估此​​回调的最坏情况成本,并确定其运行时间在你的上下文中是否可以接受。

¥However, for complex tasks you should consider bounding the input and rejecting inputs that are too long. That way, even if your callback has large complexity, by bounding the input you ensure the callback cannot take more than the worst-case time on the longest acceptable input. You can then evaluate the worst-case cost of this callback and determine whether its running time is acceptable in your context.

阻塞事件循环:REDOS

¥Blocking the Event Loop: REDOS

一种常见的灾难性阻塞事件循环的方法是使用 "vulnerable" 正则表达式

¥One common way to block the Event Loop disastrously is by using a "vulnerable" regular expression.

避免易受攻击的正则表达式

¥Avoiding vulnerable regular expressions

正则表达式 (regexp) 将输入字符串与模式匹配。我们通常认为正则表达式匹配需要单次通过输入字符串 --- O(n) 次,其中 n 是输入字符串的长度。在许多情况下,一次传递确实就足够了。不幸的是,在某些情况下,正则表达式匹配可能需要指数级的输入字符串旅行 --- O(2^n) 次。指数级的行程数意味着,如果引擎需要 x 次行程来确定匹配,那么如果我们在输入字符串中仅添加一个字符,它将需要 2*x 次行程。由于行程次数与所需时间呈线性关系,因此此评估的效果将是阻止事件循环。

¥A regular expression (regexp) matches an input string against a pattern. We usually think of a regexp match as requiring a single pass through the input string --- O(n) time where n is the length of the input string. In many cases, a single pass is indeed all it takes. Unfortunately, in some cases the regexp match might require an exponential number of trips through the input string --- O(2^n) time. An exponential number of trips means that if the engine requires x trips to determine a match, it will need 2*x trips if we add only one more character to the input string. Since the number of trips is linearly related to the time required, the effect of this evaluation will be to block the Event Loop.

一个易受攻击的正则表达式是你的正则表达式引擎可能需要指数级时间的正则表达式,将你暴露给 "恶意输入" 上的 REDOS。你的正则表达式模式是否易受攻击(即正则表达式引擎可能需要大量时间)实际上是一个很难回答的问题,并且取决于你使用的是 Perl、Python、Ruby、Java、JavaScript 等,但以下是适用于所有这些语言的一些经验法则:

¥A vulnerable regular expression is one on which your regular expression engine might take exponential time, exposing you to REDOS on "evil input". Whether or not your regular expression pattern is vulnerable (i.e. the regexp engine might take exponential time on it) is actually a difficult question to answer, and varies depending on whether you're using Perl, Python, Ruby, Java, JavaScript, etc., but here are some rules of thumb that apply across all of these languages:

  1. 避免使用嵌套量词,如 (a+)*。V8 的正则表达式引擎可以快速处理其中一些问题,但其他问题则容易受到攻击。

    ¥Avoid nested quantifiers like (a+)*. V8's regexp engine can handle some of these quickly, but others are vulnerable.

  2. 避免使用重叠子句的 OR,如 (a|a)*。同样,这些有时很快。

    ¥Avoid OR's with overlapping clauses, like (a|a)*. Again, these are sometimes-fast.

  3. 避免使用反向引用,如 (a.*) \1。没有正则表达式引擎可以保证在线性时间内评估这些功能。

    ¥Avoid using backreferences, like (a.*) \1. No regexp engine can guarantee evaluating these in linear time.

  4. 如果你正在进行简单的字符串匹配,请使用 indexOf 或本地等效项。它会更便宜,并且永远不会超过 O(n)

    ¥If you're doing a simple string match, use indexOf or the local equivalent. It will be cheaper and will never take more than O(n).

如果你不确定正则表达式是否存在漏洞,请记住,即使对于存在漏洞的正则表达式和长输入字符串,Node.js 通常也不会遇到报告匹配的问题。当存在不匹配时会触发指数行为,但 Node.js 无法确定,直到它尝试通过输入字符串的许多路径。

¥If you aren't sure whether your regular expression is vulnerable, remember that Node.js generally doesn't have trouble reporting a match even for a vulnerable regexp and a long input string. The exponential behavior is triggered when there is a mismatch but Node.js can't be certain until it tries many paths through the input string.

REDOS 示例

¥A REDOS example

以下是将其服务器暴露给 REDOS 的易受攻击的正则表达式示例:

¥Here is an example vulnerable regexp exposing its server to REDOS:

app.get('/redos-me', (req, res) => {
  let filePath = req.query.filePath;

  // REDOS
  if (filePath.match(/(\/.+)+$/)) {
    console.log('valid path');
  } else {
    console.log('invalid path');
  }

  res.sendStatus(200);
});

此示例中的易受攻击的正则表达式是一种(糟糕的!)检查 Linux 上有效路径的方法。它匹配由 "/" 分隔的名称序列组成的字符串,例如 "/a/b/c"。它很危险,因为它违反了规则 1:它有一个双重嵌套量词。

¥The vulnerable regexp in this example is a (bad!) way to check for a valid path on Linux. It matches strings that are a sequence of "/"-delimited names, like "/a/b/c". It is dangerous because it violates rule 1: it has a doubly-nested quantifier.

如果客户端使用 filePath ///.../\n 进行查询(100 个 / 后跟正则表达式的 "." 不匹配的换行符),则事件循环将永远阻塞事件循环。此客户端的 REDOS 攻击导致所有其他客户端在正则表达式匹配完成之前都无法轮到。

¥If a client queries with filePath ///.../\n (100 /'s followed by a newline character that the regexp's "." won't match), then the Event Loop will take effectively forever, blocking the Event Loop. This client's REDOS attack causes all other clients not to get a turn until the regexp match finishes.

因此,你应该谨慎使用复杂的正则表达式来验证用户输入。

¥For this reason, you should be leery of using complex regular expressions to validate user input.

反 REDOS 资源

¥Anti-REDOS Resources

有一些工具可以检查你的正则表达式是否安全,例如

¥There are some tools to check your regexps for safety, like

但是,这些都无法捕获所有易受攻击的正则表达式。

¥However, neither of these will catch all vulnerable regexps.

另一种方法是使用不同的正则表达式引擎。你可以使用 node-re2 模块,该模块使用 Google 的超快 RE2 正则表达式引擎。但请注意,RE2 与 V8 的正则表达式并非 100% 兼容,因此如果你换入 node-re2 模块来处理正则表达式,请检查回归。node-re2 不支持特别复杂的正则表达式。

¥Another approach is to use a different regexp engine. You could use the node-re2 module, which uses Google's blazing-fast RE2 regexp engine. But be warned, RE2 is not 100% compatible with V8's regexps, so check for regressions if you swap in the node-re2 module to handle your regexps. And particularly complicated regexps are not supported by node-re2.

如果你尝试匹配某些 "obvious",例如 URL 或文件路径,请在 regexp 库 中查找示例或使用 npm 模块,例如 ip-regex

¥If you're trying to match something "obvious", like a URL or a file path, find an example in a regexp library or use an npm module, e.g. ip-regex.

阻塞事件循环:Node.js 核心模块

¥Blocking the Event Loop: Node.js core modules

几个 Node.js 核心模块具有同步的昂贵 API,包括:

¥Several Node.js core modules have synchronous expensive APIs, including:

这些 API 非常昂贵,因为它们涉及大量计算(加密、压缩)、需要 I/O(文件 I/O)或可能两者兼而有之(子进程)。这些 API 旨在方便编写脚本,但不适用于服务器上下文。如果你在事件循环上执行它们,它们将比典型的 JavaScript 指令花费更长的时间才能完成,从而阻止事件循环。

¥These APIs are expensive, because they involve significant computation (encryption, compression), require I/O (file I/O), or potentially both (child process). These APIs are intended for scripting convenience, but are not intended for use in the server context. If you execute them on the Event Loop, they will take far longer to complete than a typical JavaScript instruction, blocking the Event Loop.

在服务器中,你不应使用这些模块中的以下同步 API:

¥In a server, you should not use the following synchronous APIs from these modules:

  • 加密:

    ¥Encryption:

    • crypto.randomBytes(同步版本)

      ¥crypto.randomBytes (synchronous version)

    • crypto.randomFillSync

    • crypto.pbkdf2Sync

    • 你还应该小心为加密和解密例程提供大量输入。

      ¥You should also be careful about providing large input to the encryption and decryption routines.

  • 压缩:

    ¥Compression:

    • zlib.inflateSync

    • zlib.deflateSync

  • 文件系统:

    ¥File system:

    • 请勿使用同步文件系统 API。例如,如果你访问的文件位于像 NFS 这样的 分布式文件系统 中,则访问时间可能会有很大差异。

      ¥Do not use the synchronous file system APIs. For example, if the file you access is in a distributed file system like NFS, access times can vary widely.

  • 子级进程:

    ¥Child process:

    • child_process.spawnSync

    • child_process.execSync

    • child_process.execFileSync

截至 Node.js v9,此列表相当完整。

¥This list is reasonably complete as of Node.js v9.

阻塞事件循环:JSON DOS

¥Blocking the Event Loop: JSON DOS

JSON.parseJSON.stringify 是其他可能昂贵的操作。虽然这些是输入长度的 O(n),但对于较大的 n,它们可能需要很长时间。

¥JSON.parse and JSON.stringify are other potentially expensive operations. While these are O(n) in the length of the input, for large n they can take surprisingly long.

如果你的服务器操纵 JSON 对象,特别是来自客户端的对象,你应该谨慎处理事件循环上处理的对象或字符串的大小。

¥If your server manipulates JSON objects, particularly those from a client, you should be cautious about the size of the objects or strings you work with on the Event Loop.

示例:JSON 阻塞。我们创建一个大小为 2^21 的对象 obj 并对其进行 JSON.stringify 处理,对字符串运行 indexOf,然后对其进行 JSON.parse。JSON.stringify 的字符串为 50MB。将对象字符串化需要 0.7 秒,对 50MB 字符串进行 indexOf 需要 0.03 秒,解析字符串需要 1.3 秒。

¥Example: JSON blocking. We create an object obj of size 2^21 and JSON.stringify it, run indexOf on the string, and then JSON.parse it. The JSON.stringify'd string is 50MB. It takes 0.7 seconds to stringify the object, 0.03 seconds to indexOf on the 50MB string, and 1.3 seconds to parse the string.

let obj = { a: 1 };
let niter = 20;

let before, str, pos, res, took;

for (let i = 0; i < niter; i++) {
  obj = { obj1: obj, obj2: obj }; // Doubles in size each iter
}

before = process.hrtime();
str = JSON.stringify(obj);
took = process.hrtime(before);
console.log('JSON.stringify took ' + took);

before = process.hrtime();
pos = str.indexOf('nomatch');
took = process.hrtime(before);
console.log('Pure indexof took ' + took);

before = process.hrtime();
res = JSON.parse(str);
took = process.hrtime(before);
console.log('JSON.parse took ' + took);

有 npm 模块提供异步 JSON API。例如,参见:

¥There are npm modules that offer asynchronous JSON APIs. See for example:

  • JSONStream,它具有流 API。

    ¥JSONStream, which has stream APIs.

  • Big-Friendly JSON,它具有流 API 以及标准 JSON API 的异步版本,使用下面概述的事件循环分区范例。

    ¥Big-Friendly JSON, which has stream APIs as well as asynchronous versions of the standard JSON APIs using the partitioning-on-the-Event-Loop paradigm outlined below.

不阻塞事件循环的复杂计算

¥Complex calculations without blocking the Event Loop

假设你想在 JavaScript 中执行复杂计算而不阻塞事件循环。你有两个选择:分区或卸载。

¥Suppose you want to do complex calculations in JavaScript without blocking the Event Loop. You have two options: partitioning or offloading.

分区

¥Partitioning

你可以对计算进行分区,以便每个计算都在事件循环上运行,但定期产生(轮流)其他待处理事件。在 JavaScript 中,很容易将正在进行的任务的状态保存在闭包中,如下面的示例 2 所示。

¥You could partition your calculations so that each runs on the Event Loop but regularly yields (gives turns to) other pending events. In JavaScript it's easy to save the state of an ongoing task in a closure, as shown in example 2 below.

举一个简单的例子,假设你要计算 1n 的平均值。

¥For a simple example, suppose you want to compute the average of the numbers 1 to n.

示例 1:未分区平均值,成本 O(n)

¥Example 1: Un-partitioned average, costs O(n)

for (let i = 0; i < n; i++) sum += i;
let avg = sum / n;
console.log('avg: ' + avg);

示例 2:分区平均值,每个 n 异步步骤花费 O(1)

¥Example 2: Partitioned average, each of the n asynchronous steps costs O(1).

function asyncAvg(n, avgCB) {
  // Save ongoing sum in JS closure.
  let sum = 0;
  function help(i, cb) {
    sum += i;
    if (i == n) {
      cb(sum);
      return;
    }

    // "Asynchronous recursion".
    // Schedule next operation asynchronously.
    setImmediate(help.bind(null, i + 1, cb));
  }

  // Start the helper, with CB to call avgCB.
  help(1, function (sum) {
    let avg = sum / n;
    avgCB(avg);
  });
}

asyncAvg(n, function (avg) {
  console.log('avg of 1-n: ' + avg);
});

你可以将此原则应用于数组迭代等。

¥You can apply this principle to array iterations and so forth.

卸载

¥Offloading

如果你需要做一些更复杂的事情,分区不是一个好的选择。这是因为分区仅使用事件循环,并且你不会从机器上几乎肯定可用的多个核心中受益。请记住,事件循环应该协调客户端请求,而不是自己满足它们。对于复杂的任务,将工作从事件循环移到工作池。

¥If you need to do something more complex, partitioning is not a good option. This is because partitioning uses only the Event Loop, and you won't benefit from multiple cores almost certainly available on your machine. Remember, the Event Loop should orchestrate client requests, not fulfill them itself. For a complicated task, move the work off of the Event Loop onto a Worker Pool.

如何卸载

¥How to offload

你有两个选项可用于将工作卸载到目标工作池。

¥You have two options for a destination Worker Pool to which to offload work.

  1. 你可以通过开发 C++ 插件 来使用内置的 Node.js Worker Pool。在旧版本的 Node 上,使用 NAN 构建 C++ 插件,在新版本上使用 N-APInode-webworker-threads 提供了一种仅使用 JavaScript 的方式来访问 Node.js 工作池。

    ¥You can use the built-in Node.js Worker Pool by developing a C++ addon. On older versions of Node, build your C++ addon using NAN, and on newer versions use N-API. node-webworker-threads offers a JavaScript-only way to access the Node.js Worker Pool.

  2. 你可以创建和管理自己的专用于计算的工作池,而不是 Node.js I/O 主题的工作池。最直接的方法是使用 子进程集群

    ¥You can create and manage your own Worker Pool dedicated to computation rather than the Node.js I/O-themed Worker Pool. The most straightforward ways to do this is using Child Process or Cluster.

你不应该简单地为每个客户端创建一个 子进程。你可以比创建和管理子项更快地接收客户端请求,并且你的服务器可能会成为 fork 炸弹

¥You should not simply create a Child Process for every client. You can receive client requests more quickly than you can create and manage children, and your server might become a fork bomb.

卸载的缺点

¥Downside of offloading

卸载方法的缺点是它会以通信成本的形式产生开销。只有事件循环才被允许查看应用的 "namespace"(JavaScript 状态)。从 Worker 中,你无法操作 Event Loop 命名空间中的 JavaScript 对象。相反,你必须序列化和反序列化你希望共享的任何对象。然后 Worker 可以对这些对象的副本进行操作,并将修改后的对象(或 "patch")返回到事件循环。

¥The downside of the offloading approach is that it incurs overhead in the form of communication costs. Only the Event Loop is allowed to see the "namespace" (JavaScript state) of your application. From a Worker, you cannot manipulate a JavaScript object in the Event Loop's namespace. Instead, you have to serialize and deserialize any objects you wish to share. Then the Worker can operate on its own copy of these object(s) and return the modified object (or a "patch") to the Event Loop.

有关序列化问题,请参阅 JSON DOS 部分。

¥For serialization concerns, see the section on JSON DOS.

一些卸载建议

¥Some suggestions for offloading

你可能希望区分 CPU 密集型任务和 I/O 密集型任务,因为它们具有明显不同的特性。

¥You may wish to distinguish between CPU-intensive and I/O-intensive tasks because they have markedly different characteristics.

CPU 密集型任务只有在其 Worker 被调度时才会取得进展,并且 Worker 必须被调度到你机器的 逻辑核心 之一上。如果你有 4 个逻辑核心和 5 个 Worker,其中一个 Worker 无法取得进展。因此,你为此 Worker 支付了开销(内存和调度成本),却得不到任何回报。

¥A CPU-intensive task only makes progress when its Worker is scheduled, and the Worker must be scheduled onto one of your machine's logical cores. If you have 4 logical cores and 5 Workers, one of these Workers cannot make progress. As a result, you are paying overhead (memory and scheduling costs) for this Worker and getting no return for it.

I/O 密集型任务涉及查询外部服务提供商(DNS、文件系统等)并等待其响应。当具有 I/O 密集型任务的工作器正在等待其响应时,它没有其他事情可做,并且可以由操作系统取消调度,从而让另一个工作器有机会提交其请求。因此,即使相关线程未运行,I/O 密集型任务也会取得进展。外部服务提供商(如数据库和文件系统)已高度优化,可以同时处理许多待处理的请求。例如,文件系统将检查大量待处理的写入和读取请求,以合并冲突的更新并以最佳顺序检索文件。

¥I/O-intensive tasks involve querying an external service provider (DNS, file system, etc.) and waiting for its response. While a Worker with an I/O-intensive task is waiting for its response, it has nothing else to do and can be de-scheduled by the operating system, giving another Worker a chance to submit their request. Thus, I/O-intensive tasks will be making progress even while the associated thread is not running. External service providers like databases and file systems have been highly optimized to handle many pending requests concurrently. For example, a file system will examine a large set of pending write and read requests to merge conflicting updates and to retrieve files in an optimal order.

如果你仅依赖一个工作池,例如 Node.js 工作池,那么 CPU 密集型和 I/O 密集型工作的不同特性可能会损害应用的性能。

¥If you rely on only one Worker Pool, e.g. the Node.js Worker Pool, then the differing characteristics of CPU-bound and I/O-bound work may harm your application's performance.

因此,你可能希望维护一个单独的计算工作池。

¥For this reason, you might wish to maintain a separate Computation Worker Pool.

卸载:conclusions

¥Offloading: conclusions

对于简单的任务,比如遍历任意长数组的元素,分区可能是一个不错的选择。如果你的计算更复杂,卸载是一种更好的方法:通信成本,即在事件循环和工作池之间传递序列化对象的开销,被使用多个核心的好处所抵消。

¥For simple tasks, like iterating over the elements of an arbitrarily long array, partitioning might be a good option. If your computation is more complex, offloading is a better approach: the communication costs, i.e. the overhead of passing serialized objects between the Event Loop and the Worker Pool, are offset by the benefit of using multiple cores.

但是,如果你的服务器严重依赖复杂的计算,你应该考虑 Node.js 是否真的适合。Node.js 擅长处理 I/O 密集型工作,但对于昂贵的计算,它可能不是最佳选择。

¥However, if your server relies heavily on complex calculations, you should think about whether Node.js is really a good fit. Node.js excels for I/O-bound work, but for expensive computation it might not be the best option.

如果你采用卸载方法,请参阅不阻止工作池的部分。

¥If you take the offloading approach, see the section on not blocking the Worker Pool.

不要阻塞工作池

¥Don't block the Worker Pool

Node.js 有一个由 k Workers 组成的 Worker Pool。如果你使用上面讨论的卸载范例,则可能有一个单独的计算工作池,相同的原则也适用于该池。无论哪种情况,让我们假设 k 比你可能同时处理的客户端数量小得多。这符合 Node.js 的 "多个客户端使用一个线程" 理念,这是其可扩展性的秘诀。

¥Node.js has a Worker Pool composed of k Workers. If you are using the Offloading paradigm discussed above, you might have a separate Computational Worker Pool, to which the same principles apply. In either case, let us assume that k is much smaller than the number of clients you might be handling concurrently. This is in keeping with the "one thread for many clients" philosophy of Node.js, the secret to its scalability.

如上所述,每个 Worker 在继续执行 Worker Pool 队列中的下一个任务之前,都会完成其当前任务。

¥As discussed above, each Worker completes its current Task before proceeding to the next one on the Worker Pool queue.

现在,处理客户端请求所需的任务成本会有所不同。某些任务可以快速完成(例如,读取短文件或缓存文件,或生成少量随机字节),而其他任务则需要更长时间(例如,读取较大文件或未缓存文件,或生成更多随机字节)。你的目标应该是尽量减少任务时间的变化,并且你应该使用任务分区来实现这一点。

¥Now, there will be variation in the cost of the Tasks required to handle your clients' requests. Some Tasks can be completed quickly (e.g. reading short or cached files, or producing a small number of random bytes), and others will take longer (e.g reading larger or uncached files, or generating more random bytes). Your goal should be to minimize the variation in Task times, and you should use Task partitioning to accomplish this.

最小化任务中的变化次

¥Minimizing the variation in Task times

如果 Worker 的当前任务比其他任务昂贵得多,那么它将无法处理其他待处理的任务。换句话说,每个相对较长的任务都会有效地将工作池的大小减少一,直到完成。这是不可取的,因为在某种程度上,工作池中的工作器越多,工作池吞吐量(任务/秒)就越大,因此服务器吞吐量(客户端请求/秒)就越大。一个具有相对昂贵任务的客户端将降低工作池的吞吐量,进而降低服务器的吞吐量。

¥If a Worker's current Task is much more expensive than other Tasks, then it will be unavailable to work on other pending Tasks. In other words, each relatively long Task effectively decreases the size of the Worker Pool by one until it is completed. This is undesirable because, up to a point, the more Workers in the Worker Pool, the greater the Worker Pool throughput (tasks/second) and thus the greater the server throughput (client requests/second). One client with a relatively expensive Task will decrease the throughput of the Worker Pool, in turn decreasing the throughput of the server.

要避免这种情况,你应该尝试最小化提交给工作池的任务长度变化。虽然将 I/O 请求(DB、FS 等)访问的外部系统视为黑盒是合适的,但你应该了解这些 I/O 请求的相对成本,并应避免提交你可能特别长的请求。

¥To avoid this, you should try to minimize variation in the length of Tasks you submit to the Worker Pool. While it is appropriate to treat the external systems accessed by your I/O requests (DB, FS, etc.) as black boxes, you should be aware of the relative cost of these I/O requests, and should avoid submitting requests you can expect to be particularly long.

两个示例应该可以说明任务时间的可能变化。

¥Two examples should illustrate the possible variation in task times.

变体示例:长时间运行的文件系统读取

¥Variation example: Long-running file system reads

假设你的服务器必须读取文件才能处理某些客户端请求。在咨询 Node.js 文件系统 API 后,你选择使用 fs.readFile() 以简化操作。但是,fs.readFile()currently)没有分区:它提交一个涵盖整个文件的 fs.read() 任务。如果你为某些用户读取较短的文件,为其他用户读取较长的文件,fs.readFile() 可能会导致任务长度发生显著变化,从而损害工作池吞吐量。

¥Suppose your server must read files in order to handle some client requests. After consulting the Node.js File system APIs, you opted to use fs.readFile() for simplicity. However, fs.readFile() is (currently) not partitioned: it submits a single fs.read() Task spanning the entire file. If you read shorter files for some users and longer files for others, fs.readFile() may introduce significant variation in Task lengths, to the detriment of Worker Pool throughput.

对于最坏的情况,假设攻击者可以说服你的服务器读取任意文件(这是 目录遍历漏洞)。如果你的服务器运行的是 Linux,攻击者可以命名一个非常慢的文件:/dev/random。出于所有实际目的,/dev/random 的速度非常慢,并且要求从 /dev/random 读取的每个工作器都永远不会完成该任务。然后,攻击者提交 k 请求,每个 Worker 一个,并且使用 Worker Pool 的其他客户端请求都不会取得进展。

¥For a worst-case scenario, suppose an attacker can convince your server to read an arbitrary file (this is a directory traversal vulnerability). If your server is running Linux, the attacker can name an extremely slow file: /dev/random. For all practical purposes, /dev/random is infinitely slow, and every Worker asked to read from /dev/random will never finish that Task. An attacker then submits k requests, one for each Worker, and no other client requests that use the Worker Pool will make progress.

变体示例:长时间运行的加密操作

¥Variation example: Long-running crypto operations

假设你的服务器使用 crypto.randomBytes() 生成加密安全的随机字节。crypto.randomBytes() 未分区:它会创建一个 randomBytes() 任务来生成你请求的字节数。如果你为某些用户创建较少的字节,而为其他用户创建较多的字节,则 crypto.randomBytes() 是任务长度变化的另一个来源。

¥Suppose your server generates cryptographically secure random bytes using crypto.randomBytes(). crypto.randomBytes() is not partitioned: it creates a single randomBytes() Task to generate as many bytes as you requested. If you create fewer bytes for some users and more bytes for others, crypto.randomBytes() is another source of variation in Task lengths.

任务分区

¥Task partitioning

具有可变时间成本的任务可能会损害工作池的吞吐量。为了最大限度地减少任务时间的变化,你应尽可能将每个任务划分为成本相当的子任务。当每个子任务完成时,它应该提交下一个子任务,当最后一个子任务完成时,它应该通知提交者。

¥Tasks with variable time costs can harm the throughput of the Worker Pool. To minimize variation in Task times, as far as possible you should partition each Task into comparable-cost sub-Tasks. When each sub-Task completes it should submit the next sub-Task, and when the final sub-Task completes it should notify the submitter.

要继续 fs.readFile() 示例,你应该改用 fs.read()(手动分区)或 ReadStream(自动分区)。

¥To continue the fs.readFile() example, you should instead use fs.read() (manual partitioning) or ReadStream (automatically partitioned).

同样的原则适用于 CPU 绑定任务;asyncAvg 示例可能不适合事件循环,但它非常适合工作池。

¥The same principle applies to CPU-bound tasks; the asyncAvg example might be inappropriate for the Event Loop, but it is well suited to the Worker Pool.

当你将任务划分为子任务时,较短的任务会扩展为少量子任务,而较长的任务会扩展为大量子任务。在较长任务的每个子任务之间,分配给它的 Worker 可以处理另一个较短任务的子任务,从而提高 Worker Pool 的整体任务吞吐量。

¥When you partition a Task into sub-Tasks, shorter Tasks expand into a small number of sub-Tasks, and longer Tasks expand into a larger number of sub-Tasks. Between each sub-Task of a longer Task, the Worker to which it was assigned can work on a sub-Task from another, shorter, Task, thus improving the overall Task throughput of the Worker Pool.

请注意,完成的子任务数不是衡量工作池吞吐量的有用指标。相反,请关注已完成的任务数。

¥Note that the number of sub-Tasks completed is not a useful metric for the throughput of the Worker Pool. Instead, concern yourself with the number of Tasks completed.

避免任务分区

¥Avoiding Task partitioning

回想一下,任务分区的目的是尽量减少任务时间的变化。如果你可以区分较短的任务和较长的任务(例如,对数组求和与对数组进行排序),则可以为每类任务创建一个工作池。将较短的任务和较长的任务路由到单独的工作池是最小化任务时间变化的另一种方法。

¥Recall that the purpose of Task partitioning is to minimize the variation in Task times. If you can distinguish between shorter Tasks and longer Tasks (e.g. summing an array vs. sorting an array), you could create one Worker Pool for each class of Task. Routing shorter Tasks and longer Tasks to separate Worker Pools is another way to minimize Task time variation.

赞成这种方法,分区任务会产生开销(创建工作池任务表示和操作工作池队列的成本),而避免分区可以节省额外前往工作池的成本。它还可以防止你在划分任务时犯错误。

¥In favor of this approach, partitioning Tasks incurs overhead (the costs of creating a Worker Pool Task representation and of manipulating the Worker Pool queue), and avoiding partitioning saves you the costs of additional trips to the Worker Pool. It also keeps you from making mistakes in partitioning your Tasks.

这种方法的缺点是所有这些 Worker Pools 中的 Worker 都会产生空间和时间开销,并且会相互竞争 CPU 时间。请记住,每个 CPU 绑定任务仅在计划时才会取得进展。因此,你应该在仔细分析后才考虑这种方法。

¥The downside of this approach is that Workers in all of these Worker Pools will incur space and time overheads and will compete with each other for CPU time. Remember that each CPU-bound Task makes progress only while it is scheduled. As a result, you should only consider this approach after careful analysis.

工作池:conclusions

¥Worker Pool: conclusions

无论你只使用 Node.js 工作池还是维护单独的工作池,你都应该优化池的任务吞吐量。

¥Whether you use only the Node.js Worker Pool or maintain separate Worker Pool(s), you should optimize the Task throughput of your Pool(s).

为此,请使用任务分区将任务时间的变化最小化。

¥To do this, minimize the variation in Task times by using Task partitioning.

npm 模块的风险

¥The risks of npm modules

虽然 Node.js 核心模块为各种应用提供了构建块,但有时还需要更多的东西。Node.js 开发者从 npm 生态系统 中受益匪浅,数十万个模块提供的功能可加速你的开发过程。

¥While the Node.js core modules offer building blocks for a wide variety of applications, sometimes something more is needed. Node.js developers benefit tremendously from the npm ecosystem, with hundreds of thousands of modules offering functionality to accelerate your development process.

但是请记住,这些模块中的大多数都是由第三方开发者编写的,并且通常仅以尽力而为的保证发布。使用 npm 模块的开发者应该关注两件事,尽管后者经常被遗忘。

¥Remember, however, that the majority of these modules are written by third-party developers and are generally released with only best-effort guarantees. A developer using an npm module should be concerned about two things, though the latter is frequently forgotten.

  1. 它是否遵守其 API?

    ¥Does it honor its APIs?

  2. 它的 API 可能会阻止事件循环或 Worker 吗?许多模块不努力表明其 API 的成本,这对社区不利。

    ¥Might its APIs block the Event Loop or a Worker? Many modules make no effort to indicate the cost of their APIs, to the detriment of the community.

对于简单的 API,你可以估算 API 的成本;字符串操作的成本并不难理解。但在很多情况下,API 的成本并不清楚。

¥For simple APIs you can estimate the cost of the APIs; the cost of string manipulation isn't hard to fathom. But in many cases it's unclear how much an API might cost.

如果你正在调用可能执行昂贵操作的 API,请仔细检查成本。要求开发者记录它,或者自己检查源代码(并提交记录成本的 PR)。

¥If you are calling an API that might do something expensive, double-check the cost. Ask the developers to document it, or examine the source code yourself (and submit a PR documenting the cost).

请记住,即使 API 是异步的,你也不知道它在每个分区中的 Worker 或事件循环上可能花费多少时间。例如,假设在上面给出的 asyncAvg 示例中,对辅助函数的每次调用都会对一半的数字求和,而不是其中一个。然后这个函数仍然是异步的,但每个分区的成本将是 O(n),而不是 O(1),这使得它对于任意 n 值来说不太安全。

¥Remember, even if the API is asynchronous, you don't know how much time it might spend on a Worker or on the Event Loop in each of its partitions. For example, suppose in the asyncAvg example given above, each call to the helper function summed half of the numbers rather than one of them. Then this function would still be asynchronous, but the cost of each partition would be O(n), not O(1), making it much less safe to use for arbitrary values of n.

结论

¥Conclusion

Node.js 有两种类型的线程:一个事件循环和 k 工作器。事件循环负责 JavaScript 回调和非阻塞 I/O,而 Worker 执行与完成异步请求的 C++ 代码相对应的任务,包括阻塞 I/O 和 CPU 密集型工作。两种类型的线程一次只能处理一个活动。如果任何回调或任务需要很长时间,则运行它的线程将被阻塞。如果你的应用进行阻塞回调或任务,这可能会导致吞吐量(客户端/秒)下降,最坏的情况是完全拒绝服务。

¥Node.js has two types of threads: one Event Loop and k Workers. The Event Loop is responsible for JavaScript callbacks and non-blocking I/O, and a Worker executes tasks corresponding to C++ code that completes an asynchronous request, including blocking I/O and CPU-intensive work. Both types of threads work on no more than one activity at a time. If any callback or task takes a long time, the thread running it becomes blocked. If your application makes blocking callbacks or tasks, this can lead to degraded throughput (clients/second) at best, and complete denial of service at worst.

要编写高吞吐量、更防 DoS 的 Web 服务器,你必须确保在良性和恶意输入时,你的事件循环和工作器都不会阻塞。

¥To write a high-throughput, more DoS-proof web server, you must ensure that on benign and on malicious input, neither your Event Loop nor your Workers will block.