Buffer 与字符编码


当字符串数据被存储入 Buffer 实例或从 Buffer 实例中被提取时,可以指定一个字符编码。

const buf = Buffer.from('hello world', 'ascii');

console.log(buf.toString('hex'));
// 打印: 68656c6c6f20776f726c64
console.log(buf.toString('base64'));
// 打印: aGVsbG8gd29ybGQ=

console.log(Buffer.from('fhqwhgads', 'ascii'));
// 打印: <Buffer 66 68 71 77 68 67 61 64 73>
console.log(Buffer.from('fhqwhgads', 'utf16le'));
// 打印: <Buffer 66 00 68 00 71 00 77 00 68 00 67 00 61 00 64 00 73 00>

Node.js 当前支持的字符编码有:

  • 'ascii' - 仅适用于 7 位 ASCII 数据。此编码速度很快,如果设置则会剥离高位。

  • 'utf8' - 多字节编码的 Unicode 字符。许多网页和其他文档格式都使用 UTF-8。

  • 'utf16le' - 2 或 4 个字节,小端序编码的 Unicode 字符。支持代理对(U+10000 至 U+10FFFF)。

  • 'ucs2' - 'utf16le' 的别名。

  • 'base64' - Base64 编码。当从字符串创建 Buffer 时,此编码也会正确地接受 RFC4648 第 5 节中指定的 “URL 和文件名安全字母”。

  • 'latin1' - 一种将 Buffer 编码成单字节编码字符串的方法(由 RFC1345 中的 IANA 定义,第 63 页,作为 Latin-1 的补充块和 C0/C1 控制码)。

  • 'binary' - 'latin1' 的别名。

  • 'hex' - 将每个字节编码成两个十六进制的字符。

现代的 Web 浏览器遵循 WHATWG 编码标准,将 'latin1''ISO-8859-1' 别名为 'win-1252'。 这意味着当执行 http.get() 之类的操作时,如果返回的字符集是 WHATWG 规范中列出的字符集之一,则服务器可能实际返回 'win-1252' 编码的数据,而使用 'latin1' 编码可能错误地解码字符。

When string data is stored in or extracted out of a Buffer instance, a character encoding may be specified.

const buf = Buffer.from('hello world', 'ascii');

console.log(buf.toString('hex'));
// Prints: 68656c6c6f20776f726c64
console.log(buf.toString('base64'));
// Prints: aGVsbG8gd29ybGQ=

console.log(Buffer.from('fhqwhgads', 'ascii'));
// Prints: <Buffer 66 68 71 77 68 67 61 64 73>
console.log(Buffer.from('fhqwhgads', 'utf16le'));
// Prints: <Buffer 66 00 68 00 71 00 77 00 68 00 67 00 61 00 64 00 73 00>

The character encodings currently supported by Node.js include:

  • 'ascii' - For 7-bit ASCII data only. This encoding is fast and will strip the high bit if set.

  • 'utf8' - Multibyte encoded Unicode characters. Many web pages and other document formats use UTF-8.

  • 'utf16le' - 2 or 4 bytes, little-endian encoded Unicode characters. Surrogate pairs (U+10000 to U+10FFFF) are supported.

  • 'ucs2' - Alias of 'utf16le'.

  • 'base64' - Base64 encoding. When creating a Buffer from a string, this encoding will also correctly accept "URL and Filename Safe Alphabet" as specified in RFC4648, Section 5.

  • 'latin1' - A way of encoding the Buffer into a one-byte encoded string (as defined by the IANA in RFC1345, page 63, to be the Latin-1 supplement block and C0/C1 control codes).

  • 'binary' - Alias for 'latin1'.

  • 'hex' - Encode each byte as two hexadecimal characters.

Modern Web browsers follow the WHATWG Encoding Standard which aliases both 'latin1' and 'ISO-8859-1' to 'win-1252'. This means that while doing something like http.get(), if the returned charset is one of those listed in the WHATWG specification it is possible that the server actually returned 'win-1252'-encoded data, and using 'latin1' encoding may incorrectly decode the characters.