缓冲区与字符编码


当在 Buffer 和字符串之间进行转换时,可以指定字符编码。 如果未指定字符编码,则默认使用 UTF-8。

const buf = Buffer.from('hello world', 'utf8');

console.log(buf.toString('hex'));
// 打印: 68656c6c6f20776f726c64
console.log(buf.toString('base64'));
// 打印: aGVsbG8gd29ybGQ=

console.log(Buffer.from('fhqwhgads', 'utf8'));
// 打印: <Buffer 66 68 71 77 68 67 61 64 73>
console.log(Buffer.from('fhqwhgads', 'utf16le'));
// 打印: <Buffer 66 00 68 00 71 00 77 00 68 00 67 00 61 00 64 00 73 00>

Node.js 目前支持的字符编码如下:

  • 'utf8': 多字节编码的 Unicode 字符。 许多网页和其他文档格式使用 UTF-8。 这是默认的字符编码。 当将 Buffer 解码为不完全包含有效 UTF-8 数据的字符串时,则 Unicode 替换字符 U+FFFD � 将用于表示这些错误。

  • 'utf16le': 多字节编码的 Unicode 字符。 与 'utf8' 不同,字符串中的每个字符都将使用 2 或 4 个字节进行编码。 Node.js 仅支持 UTF-16小端序变体。

  • 'latin1': Latin-1 代表 ISO-8859-1。 此字符编码仅支持 U+0000U+00FF 的 Unicode 字符。 每个字符都使用单个字节进行编码。 不符合该范围的字符将被截断并映射到该范围内的字符。

使用以上编码之一将 Buffer 转换为字符串称为解码,将字符串转换为 Buffer 称为编码。

Node.js 还支持以下二进制转文本的编码。 对于二进制转文本的编码,命名约定是相反的:将 Buffer 转换为字符串通常称为编码,将字符串转换为 Buffer 通常称为解码。

  • 'base64': Base64 编码。 当从字符串创建 Buffer 时,此编码还将正确接受 RFC 4648,第 5 节中指定的 "URL 和文件名安全字母表"。 base64 编码的字符串中包含的空白字符(例如空格、制表符和换行符)会被忽略。

  • 'base64url': base64url 编码如 RFC 4648 第 5 节中指定。 当从字符串创建 Buffer 时,此编码也将正确接受常规的 base64 编码的字符串。 当将 Buffer 编码为字符串时,此编码将忽略填充。

  • 'hex': 将每个字节编码为两个十六进制字符。 当解码仅包含有效十六进制字符的字符串时,可能会发生数据截断。 请参阅下面的示例。

还支持以下旧版字符编码:

  • 'ascii': 仅适用于 7 位 ASCII 数据。 当将字符串编码为 Buffer 时,这等效于使用 'latin1'。 当将 Buffer 解码为字符串时,使用此编码将在解码为 'latin1' 之前额外取消设置每个字节的最高位。 通常,没有理由使用此编码,因为在编码或解码纯 ASCII 文本时,'utf8'(或者,如果已知数据始终是纯 ASCII,则为 'latin1')将是更好的选择。 它仅用于旧版兼容性。

  • 'binary': 'latin1' 的别名。 有关此主题的更多背景信息,请参阅二进制字符串。 此编码的名称很容易让人误解,因为这里列出的所有编码都在字符串和二进制数据之间进行转换。 对于字符串和 Buffer 之间的转换,通常 'utf-8' 是正确的选择。

  • 'ucs2': 'utf16le' 的别名。 UCS-2 过去指的是 UTF-16 的一种变体,它不支持代码点大于 U+FFFF 的字符。 在 Node.js 中,始终支持这些代码点。

Buffer.from('1ag', 'hex');
// 打印 <Buffer 1a>,当遇到第一个非十六进制值 ('g') 时,则截断数据。

Buffer.from('1a7g', 'hex');
// 打印 <Buffer 1a>,当数据以一位数 ('7') 结尾时,则截断数据。

Buffer.from('1634', 'hex');
// 打印 <Buffer 16 34>,表现所有数据。

现代 Web 浏览器遵循 WHATWG 编码标准,其将 'latin1''ISO-8859-1' 别名为 'win-1252'。 这意味着在执行 http.get() 之类的操作时,如果返回的字符集是 WHATWG 规范中列出的字符集之一,则服务器实际上可能返回 'win-1252' 编码的数据,使用 'latin1' 编码可能会错误地解码字符。

When converting between Buffers and strings, a character encoding may be specified. If no character encoding is specified, UTF-8 will be used as the default.

const buf = Buffer.from('hello world', 'utf8');

console.log(buf.toString('hex'));
// Prints: 68656c6c6f20776f726c64
console.log(buf.toString('base64'));
// Prints: aGVsbG8gd29ybGQ=

console.log(Buffer.from('fhqwhgads', 'utf8'));
// Prints: <Buffer 66 68 71 77 68 67 61 64 73>
console.log(Buffer.from('fhqwhgads', 'utf16le'));
// Prints: <Buffer 66 00 68 00 71 00 77 00 68 00 67 00 61 00 64 00 73 00>

The character encodings currently supported by Node.js are the following:

  • 'utf8': Multi-byte encoded Unicode characters. Many web pages and other document formats use UTF-8. This is the default character encoding. When decoding a Buffer into a string that does not exclusively contain valid UTF-8 data, the Unicode replacement character U+FFFD � will be used to represent those errors.

  • 'utf16le': Multi-byte encoded Unicode characters. Unlike 'utf8', each character in the string will be encoded using either 2 or 4 bytes. Node.js only supports the little-endian variant of UTF-16.

  • 'latin1': Latin-1 stands for ISO-8859-1. This character encoding only supports the Unicode characters from U+0000 to U+00FF. Each character is encoded using a single byte. Characters that do not fit into that range are truncated and will be mapped to characters in that range.

Converting a Buffer into a string using one of the above is referred to as decoding, and converting a string into a Buffer is referred to as encoding.

Node.js also supports the following binary-to-text encodings. For binary-to-text encodings, the naming convention is reversed: Converting a Buffer into a string is typically referred to as encoding, and converting a string into a Buffer as decoding.

  • 'base64': Base64 encoding. When creating a Buffer from a string, this encoding will also correctly accept "URL and Filename Safe Alphabet" as specified in RFC 4648, Section 5. Whitespace characters such as spaces, tabs, and new lines contained within the base64-encoded string are ignored.

  • 'base64url': base64url encoding as specified in RFC 4648, Section 5. When creating a Buffer from a string, this encoding will also correctly accept regular base64-encoded strings. When encoding a Buffer to a string, this encoding will omit padding.

  • 'hex': Encode each byte as two hexadecimal characters. Data truncation may occur when decoding strings that do exclusively contain valid hexadecimal characters. See below for an example.

The following legacy character encodings are also supported:

  • 'ascii': For 7-bit ASCII data only. When encoding a string into a Buffer, this is equivalent to using 'latin1'. When decoding a Buffer into a string, using this encoding will additionally unset the highest bit of each byte before decoding as 'latin1'. Generally, there should be no reason to use this encoding, as 'utf8' (or, if the data is known to always be ASCII-only, 'latin1') will be a better choice when encoding or decoding ASCII-only text. It is only provided for legacy compatibility.

  • 'binary': Alias for 'latin1'. See binary strings for more background on this topic. The name of this encoding can be very misleading, as all of the encodings listed here convert between strings and binary data. For converting between strings and Buffers, typically 'utf-8' is the right choice.

  • 'ucs2': Alias of 'utf16le'. UCS-2 used to refer to a variant of UTF-16 that did not support characters that had code points larger than U+FFFF. In Node.js, these code points are always supported.

Buffer.from('1ag', 'hex');
// Prints <Buffer 1a>, data truncated when first non-hexadecimal value
// ('g') encountered.

Buffer.from('1a7g', 'hex');
// Prints <Buffer 1a>, data truncated when data ends in single digit ('7').

Buffer.from('1634', 'hex');
// Prints <Buffer 16 34>, all data represented.

Modern Web browsers follow the WHATWG Encoding Standard which aliases both 'latin1' and 'ISO-8859-1' to 'win-1252'. This means that while doing something like http.get(), if the returned charset is one of those listed in the WHATWG specification it is possible that the server actually returned 'win-1252'-encoded data, and using 'latin1' encoding may incorrectly decode the characters.