WHATWG API


WHATWG 网址标准使用比旧版 API 使用的方法更具选择性和细粒度的方法来选择编码字符。

WHATWG 算法定义了四个“百分比编码集”,用于描述必须进行百分比编码的字符范围:

  • C0 控制百分比编码集,包括 U+0000 到 U+001F(含)范围内的代码点和所有大于 U+007E 的代码点。

  • 片段百分比编码集,包括 C0 控制百分比编码集和代码点 U+0020、U+0022、U+003C、U+003E 和 U+0060。

  • 路径百分比编码集,包括 C0 控制百分比编码集和代码点 U+0020、U+0022、U+0023、U+003C、U+003E、U+003F、U+0060、U +007B 和 U+007D。

  • userinfo 编码集,包括路径百分比编码集和代码点 U+002F、U+003A、U+003B、U+003D、U+0040、U+005B、U+005C、U+005D、 U+005E 和 U+007C。

userinfo 百分比编码集专门用于网址中编码的用户名和密码。 路径百分比编码集用于大多数网址的路径。 片段百分比编码集用于网址片段。 除了所有其他情况外,C0 控制百分比编码集用于某些特定条件下的主机和路径。

当主机名中出现非 ASCII 字符时,主机名将使用 Punycode 算法进行编码。 但是请注意,主机名可能包含 Punycode 编码和百分比编码的字符:

const myURL = new URL('https://%CF%80.example.com/foo');
console.log(myURL.href);
// 打印 https://xn--1xa.example.com/foo
console.log(myURL.origin);
// 打印 https://xn--1xa.example.com

The WHATWG URL Standard uses a more selective and fine grained approach to selecting encoded characters than that used by the Legacy API.

The WHATWG algorithm defines four "percent-encode sets" that describe ranges of characters that must be percent-encoded:

  • The C0 control percent-encode set includes code points in range U+0000 to U+001F (inclusive) and all code points greater than U+007E.

  • The fragment percent-encode set includes the C0 control percent-encode set and code points U+0020, U+0022, U+003C, U+003E, and U+0060.

  • The path percent-encode set includes the C0 control percent-encode set and code points U+0020, U+0022, U+0023, U+003C, U+003E, U+003F, U+0060, U+007B, and U+007D.

  • The userinfo encode set includes the path percent-encode set and code points U+002F, U+003A, U+003B, U+003D, U+0040, U+005B, U+005C, U+005D, U+005E, and U+007C.

The userinfo percent-encode set is used exclusively for username and passwords encoded within the URL. The path percent-encode set is used for the path of most URLs. The fragment percent-encode set is used for URL fragments. The C0 control percent-encode set is used for host and path under certain specific conditions, in addition to all other cases.

When non-ASCII characters appear within a host name, the host name is encoded using the Punycode algorithm. Note, however, that a host name may contain both Punycode encoded and percent-encoded characters:

const myURL = new URL('https://%CF%80.example.com/foo');
console.log(myURL.href);
// Prints https://xn--1xa.example.com/foo
console.log(myURL.origin);
// Prints https://xn--1xa.example.com