跳到内容

如何处理不同的文件系统

🌐 How to Work with Different Filesystems

Node.js 提供了许多文件系统的功能。但并非所有文件系统都是相同的。以下是一些建议的最佳实践,可在处理不同文件系统时保持代码的简洁和安全。

🌐 Node.js exposes many features of the filesystem. But not all filesystems are alike. The following are suggested best practices to keep your code simple and safe when working with different filesystems.

文件系统行为

🌐 Filesystem Behavior

在你使用文件系统之前,你需要了解它的行为。不同的文件系统行为各异,功能也有多有少:区分大小写、不区分大小写、保留大小写、保留 Unicode 形式、时间戳分辨率、扩展属性、索引节点、Unix 权限、备用数据流等。

🌐 Before you can work with a filesystem, you need to know how it behaves. Different filesystems behave differently and have more or less features than others: case sensitivity, case insensitivity, case preservation, Unicode form preservation, timestamp resolution, extended attributes, inodes, Unix permissions, alternate data streams etc.

谨慎根据 process.platform 推断文件系统的行为。例如,不要假设因为你的程序运行在 Darwin 系统上,就意味着你正在使用大小写不敏感的文件系统(HFS+),因为用户可能使用的是大小写敏感的文件系统(HFSX)。同样,不要假设因为你的程序运行在 Linux 系统上,就意味着你使用的是支持 Unix 权限和 inode 的文件系统,因为你可能在某个特定的外部驱动器、USB 或网络驱动器上,这些驱动器可能不支持这些功能。

🌐 Be wary of inferring filesystem behavior from process.platform. For example, do not assume that because your program is running on Darwin that you are therefore working on a case-insensitive filesystem (HFS+), as the user may be using a case-sensitive filesystem (HFSX). Similarly, do not assume that because your program is running on Linux that you are therefore working on a filesystem which supports Unix permissions and inodes, as you may be on a particular external drive, USB or network drive which does not.

操作系统可能不会让推断文件系统行为变得容易,但一切并非无望。与其维持一个包含所有已知文件系统及其行为的列表(这总是会不完整),不如直接探测文件系统以查看其实际行为。一些易于探测的功能的存在与否,通常足以推断那些更难探测功能的行为。

🌐 The operating system may not make it easy to infer filesystem behavior, but all is not lost. Instead of keeping a list of every known filesystem and behavior (which is always going to be incomplete), you can probe the filesystem to see how it actually behaves. The presence or absence of certain features which are easy to probe, are often enough to infer the behavior of other features which are more difficult to probe.

请记住,一些用户可能在工作树中的不同路径挂载了不同的文件系统。

🌐 Remember that some users may have different filesystems mounted at various paths in the working tree.

避免采取最低共通标准的方法

🌐 Avoid a Lowest Common Denominator Approach

你可能会倾向于让你的程序表现得像一个最低公分母的文件系统,通过将所有文件名规范化为大写,将所有文件名规范化为 NFC Unicode 形式,并将所有文件时间戳规范化为例如 1 秒分辨率。这将是最低公分母的方法。

🌐 You might be tempted to make your program act like a lowest common denominator filesystem, by normalizing all filenames to uppercase, normalizing all filenames to NFC Unicode form, and normalizing all file timestamps to say 1-second resolution. This would be the lowest common denominator approach.

不要这样做。你只能与在各方面都具有完全相同最低公分母特性的文件系统安全地交互。你将无法以用户期望的方式处理更高级的文件系统,并且可能会遇到文件名或时间戳冲突。通过一系列复杂的依赖事件,你几乎肯定会丢失或损坏用户数据,并且会产生难以甚至不可能解决的漏洞。

🌐 Do not do this. You would only be able to interact safely with a filesystem which has the exact same lowest common denominator characteristics in every respect. You would be unable to work with more advanced filesystems in the way that users expect, and you would run into filename or timestamp collisions. You would most certainly lose and corrupt user data through a series of complicated dependent events, and you would create bugs that would be difficult if not impossible to solve.

当你以后需要支持一个只有 2 秒或 24 小时时间戳分辨率的文件系统时,会发生什么?当 Unicode 标准发展到包含一个稍微不同的规范化算法(就像过去发生过的那样)时,会发生什么?

🌐 What happens when you later need to support a filesystem that only has 2-second or 24-hour timestamp resolution? What happens when the Unicode standard advances to include a slightly different normalization algorithm (as has happened in the past)?

最低公分母的方法倾向于仅使用“可移植”的系统调用来创建可移植程序。这会导致程序存在漏洞,本质上并不真正可移植。

🌐 A lowest common denominator approach would tend to try to create a portable program by using only "portable" system calls. This leads to programs that are leaky and not in fact portable.

采用超级组训练方法

🌐 Adopt a Superset Approach

通过采用超集方法,最大限度地利用你支持的每个平台。例如,一个可移植的备份程序应该在 Windows 系统之间正确同步 btime(文件或文件夹的创建时间),并且不应破坏或更改 btime,即使 Linux 系统不支持 btime。同样,这个可移植的备份程序应该在 Linux 系统之间正确同步 Unix 权限,并且不应破坏或更改 Unix 权限,即使 Windows 系统不支持 Unix 权限。

🌐 Make the best use of each platform you support by adopting a superset approach. For example, a portable backup program should sync btimes (the created time of a file or folder) correctly between Windows systems, and should not destroy or alter btimes, even though btimes are not supported on Linux systems. The same portable backup program should sync Unix permissions correctly between Linux systems, and should not destroy or alter Unix permissions, even though Unix permissions are not supported on Windows systems.

通过让你的程序表现得像一个更高级的文件系统来处理不同的文件系统。支持所有可能功能的超集:区分大小写、保留大小写、Unicode 形式敏感、Unicode 形式保留、Unix 权限、高分辨率纳秒时间戳、扩展属性等。

🌐 Handle different filesystems by making your program act like a more advanced filesystem. Support a superset of all possible features: case-sensitivity, case-preservation, Unicode form sensitivity, Unicode form preservation, Unix permissions, high-resolution nanosecond timestamps, extended attributes etc.

一旦你的程序中实现了大小写保留功能,如果需要与不区分大小写的文件系统交互,你总是可以实现大小写不敏感。但如果你的程序放弃了大小写保留功能,就无法安全地与大小写保留的文件系统交互。对于 Unicode 形式的保留和时间戳分辨率的保留也是同样的道理。

🌐 Once you have case-preservation in your program, you can always implement case-insensitivity if you need to interact with a case-insensitive filesystem. But if you forego case-preservation in your program, you cannot interact safely with a case-preserving filesystem. The same is true for Unicode form preservation and timestamp resolution preservation.

如果文件系统为你提供一个包含大小写混合的文件名,那么请保持文件名的原始大小写。如果文件系统为你提供一个混合 Unicode 形式或 NFC、NFD(或 NFKC、NFKD)的文件名,那么请保持文件名的原始字节序列。如果文件系统为你提供一个毫秒时间戳,那么请保持时间戳的毫秒精度。

🌐 If a filesystem provides you with a filename in a mix of lowercase and uppercase, then keep the filename in the exact case given. If a filesystem provides you with a filename in mixed Unicode form or NFC or NFD (or NFKC or NFKD), then keep the filename in the exact byte sequence given. If a filesystem provides you with a millisecond timestamp, then keep the timestamp in millisecond resolution.

当你使用功能较弱的文件系统时,你总是可以根据需要适当降采样,并使用比较函数以符合你的程序所运行的文件系统的行为。如果你知道文件系统不支持 Unix 权限,那么你不应期望读取到与写入的 Unix 权限相同的权限。如果你知道文件系统不区分大小写,那么当你的程序创建 abc 时,在目录列表中看到 ABC 是正常的。但如果你知道文件系统区分大小写,那么在检测文件重命名或文件系统区分大小写的情况下,你应当将 ABC 视为与 abc 不同的文件名。

🌐 When you work with a lesser filesystem, you can always downsample appropriately, with comparison functions as required by the behavior of the filesystem on which your program is running. If you know that the filesystem does not support Unix permissions, then you should not expect to read the same Unix permissions you write. If you know that the filesystem does not preserve case, then you should be prepared to see ABC in a directory listing when your program creates abc. But if you know that the filesystem does preserve case, then you should consider ABC to be a different filename to abc, when detecting file renames or if the filesystem is case-sensitive.

大小写保留

🌐 Case Preservation

你可以创建一个名为 test/abc 的目录,有时会惊讶地发现 fs.readdir('test') 返回的是 ['ABC']。这并不是 Node 的错误。Node 会返回文件系统存储的文件名,而并非所有文件系统都支持大小写保留。有些文件系统会将所有文件名转换为大写(或小写)。

🌐 You may create a directory called test/abc and be surprised to see sometimes that fs.readdir('test') returns ['ABC']. This is not a bug in Node. Node returns the filename as the filesystem stores it, and not all filesystems support case-preservation. Some filesystems convert all filenames to uppercase (or lowercase).

保留 Unicode 形式

🌐 Unicode Form Preservation

大小写保持和 Unicode 形式保持是相似的概念。要理解为什么应该保持 Unicode 形式,首先要确保你理解为什么要保持大小写。当正确理解时,Unicode 形式的保持同样是简单的。

🌐 Case preservation and Unicode form preservation are similar concepts. To understand why Unicode form should be preserved , make sure that you first understand why case should be preserved. Unicode form preservation is just as simple when understood correctly.

Unicode 可以使用几种不同的字节序列来编码相同的字符。几条字符串可能看起来相同,但其字节序列不同。在处理 UTF-8 字符串时,要注意你的预期是否符合 Unicode 的工作方式。正如你不会期望所有 UTF-8 字符都编码为单个字节一样,你也不应该期望几条在肉眼看起来相同的 UTF-8 字符串具有相同的字节表示。这可能是你对 ASCII 可以有的预期,但对于 UTF-8 则不适用。

🌐 Unicode can encode the same characters using several different byte sequences. Several strings may look the same, but have different byte sequences. When working with UTF-8 strings, be careful that your expectations are in line with how Unicode works. Just as you would not expect all UTF-8 characters to encode to a single byte, you should not expect several UTF-8 strings that look the same to the human eye to have the same byte representation. This may be an expectation that you can have of ASCII, but not of UTF-8.

你可以创建一个名为 test/café 的目录(NFC Unicode 形式,字节序列为 <63 61 66 c3 a9>string.length === 5),有时会惊讶地发现 fs.readdir('test') 返回 ['café'](NFD Unicode 形式,字节序列为 <63 61 66 65 cc 81>string.length === 6)。这不是 Node 的错误。Node.js 会按文件系统存储的方式返回文件名,并非所有文件系统都支持保留 Unicode 形式。

🌐 You may create a directory called test/café (NFC Unicode form with byte sequence <63 61 66 c3 a9> and string.length === 5) and be surprised to see sometimes that fs.readdir('test') returns ['café'] (NFD Unicode form with byte sequence <63 61 66 65 cc 81> and string.length === 6). This is not a bug in Node. Node.js returns the filename as the filesystem stores it, and not all filesystems support Unicode form preservation.

例如,HFS+ 会将所有文件名规范化为几乎总是与 NFD 形式相同的形式。不要指望 HFS+ 的行为与 NTFS 或 EXT4 相同,反之亦然。不要试图通过规范化来永久更改数据,以掩盖文件系统之间的 Unicode 差异。这只会制造问题而无法解决任何问题。相反,应保留 Unicode 形式,仅将规范化用作比较函数。

🌐 HFS+, for example, will normalize all filenames to a form almost always the same as NFD form. Do not expect HFS+ to behave the same as NTFS or EXT4 and vice-versa. Do not try to change data permanently through normalization as a leaky abstraction to paper over Unicode differences between filesystems. This would create problems without solving any. Rather, preserve Unicode form and use normalization as a comparison function only.

Unicode 形式不敏感

🌐 Unicode Form Insensitivity

Unicode 形式不敏感性和 Unicode 形式保留是两种不同的文件系统行为,常被混淆。正如大小写不敏感有时被错误地实现为在存储和传输文件名时将文件名永久标准化为大写一样,Unicode 形式不敏感性有时也被错误地实现为在存储和传输文件名时将文件名永久标准化为某种 Unicode 形式(在 HFS+ 中为 NFD)。更好的方法是仅在比较时使用 Unicode 标准化,从而可以在不牺牲 Unicode 形式保留的前提下实现 Unicode 形式不敏感性。

🌐 Unicode form insensitivity and Unicode form preservation are two different filesystem behaviors often mistaken for each other. Just as case-insensitivity has sometimes been incorrectly implemented by permanently normalizing filenames to uppercase when storing and transmitting filenames, so Unicode form insensitivity has sometimes been incorrectly implemented by permanently normalizing filenames to a certain Unicode form (NFD in the case of HFS+) when storing and transmitting filenames. It is possible and much better to implement Unicode form insensitivity without sacrificing Unicode form preservation, by using Unicode normalization for comparison only.

比较不同的 Unicode 形式

🌐 Comparing Different Unicode Forms

Node.js 提供了 string.normalize('NFC' / 'NFD'),你可以使用它将 UTF-8 字符串规范化为 NFC 或 NFD。你绝不应该存储这个函数的输出,而应该只在比较函数中使用它,以测试两个 UTF-8 字符串在用户看来是否相同。

🌐 Node.js provides string.normalize('NFC' / 'NFD') which you can use to normalize a UTF-8 string to either NFC or NFD. You should never store the output from this function but only use it as part of a comparison function to test whether two UTF-8 strings would look the same to the user.

你可以使用 string1.normalize('NFC') === string2.normalize('NFC')string1.normalize('NFD') === string2.normalize('NFD') 作为你的比较函数。使用哪种形式都无关紧要。

🌐 You can use string1.normalize('NFC') === string2.normalize('NFC') or string1.normalize('NFD') === string2.normalize('NFD') as your comparison function. Which form you use does not matter.

归一化速度很快,但你可能希望使用缓存作为输入传递给比较函数,以避免多次对相同字符串进行归一化。如果字符串不在缓存中,则对其进行归一化并存入缓存。注意不要存储或持久化缓存,仅将其用作缓存。

🌐 Normalization is fast but you may want to use a cache as input to your comparison function to avoid normalizing the same string many times over. If the string is not present in the cache then normalize it and cache it. Be careful not to store or persist the cache, use it only as a cache.

请注意,使用 normalize() 需要你的 Node.js 版本包含 ICU(否则 normalize() 只会返回原始字符串)。如果你从官网下载最新版本的 Node.js,它将包含 ICU。

🌐 Note that using normalize() requires that your version of Node.js include ICU (otherwise normalize() will just return the original string). If you download the latest version of Node.js from the website then it will include ICU.

时间戳分辨率

🌐 Timestamp Resolution

你可以将文件的 mtime(修改时间)设置为 1444291759414(毫秒分辨率),但有时会惊讶地发现 fs.stat 返回的新 mtime 是 1444291759000(1 秒分辨率)或 1444291758000(2 秒分辨率)。这并不是 Node 的错误。Node.js 返回的时间戳是文件系统存储的时间戳,而并非所有文件系统都支持纳秒、毫秒或 1 秒的时间戳分辨率。有些文件系统对 atime 时间戳的分辨率特别粗糙,例如一些 FAT 文件系统甚至达到 24 小时。

🌐 You may set the mtime (the modified time) of a file to 1444291759414 (millisecond resolution) and be surprised to see sometimes that fs.stat returns the new mtime as 1444291759000 (1-second resolution) or 1444291758000 (2-second resolution). This is not a bug in Node. Node.js returns the timestamp as the filesystem stores it, and not all filesystems support nanosecond, millisecond or 1-second timestamp resolution. Some filesystems even have very coarse resolution for the atime timestamp in particular, e.g. 24 hours for some FAT filesystems.

不要通过规范化破坏文件名和时间戳

🌐 Do Not Corrupt Filenames and Timestamps Through Normalization

文件名和时间戳是用户数据。就像你绝不会自动将用户文件数据改为大写,或将 CRLF 行结束符规范化为 LF 一样,你也不应通过大小写/Unicode 形式/时间戳规范化来更改、干扰或破坏文件名或时间戳。规范化应仅用于比较,绝不用于修改数据。

🌐 Filenames and timestamps are user data. Just as you would never automatically rewrite user file data to uppercase the data or normalize CRLF to LF line-endings, so you should never change, interfere or corrupt filenames or timestamps through case / Unicode form / timestamp normalization. Normalization should only ever be used for comparison, never for altering data.

规范化实际上是一种有损的哈希码。你可以用它来测试某些类型的等价性(例如,虽然几个字符串的字节序列不同,但它们看起来是否相同),但你永远不能把它作为实际数据的替代。你的程序应该按原样传递文件名和时间戳数据。

🌐 Normalization is effectively a lossy hash code. You can use it to test for certain kinds of equivalence (e.g. do several strings look the same even though they have different byte sequences) but you can never use it as a substitute for the actual data. Your program should pass on filename and timestamp data as is.

你的程序可以创建新的 NFC 数据(或任何它偏好的 Unicode 形式组合),或者使用小写或大写的文件名,或带有 2 秒分辨率的时间戳,但你的程序不应通过强制大小写 / Unicode 形式 / 时间戳规范化来破坏现有用户数据。相反,应采用超集的方法,在程序中保留大小写、Unicode 形式和时间戳分辨率。这样,你就能安全地与采取相同处理的文件系统进行交互。

🌐 Your program can create new data in NFC (or in any combination of Unicode form it prefers) or with a lowercase or uppercase filename, or with a 2-second resolution timestamp, but your program should not corrupt existing user data by imposing case / Unicode form / timestamp normalization. Rather, adopt a superset approach and preserve case, Unicode form and timestamp resolution in your program. That way, you will be able to interact safely with filesystems which do the same.

适当使用归一化比较函数

🌐 Use Normalization Comparison Functions Appropriately

确保适当地使用大小写 / Unicode 形式 / 时间戳比较函数。如果你在大小写敏感的文件系统上工作,不要使用大小写不敏感的文件名比较函数。如果你在对 Unicode 形式敏感的文件系统上工作(例如 NTFS 以及大多数同时保留 NFC 和 NFD 或混合 Unicode 形式的 Linux 文件系统),不要使用 Unicode 形式不敏感的比较函数。如果你在纳秒分辨率时间戳的文件系统上工作,不要以 2 秒分辨率比较时间戳。

🌐 Make sure that you use case / Unicode form / timestamp comparison functions appropriately. Do not use a case-insensitive filename comparison function if you are working on a case-sensitive filesystem. Do not use a Unicode form insensitive comparison function if you are working on a Unicode form sensitive filesystem (e.g. NTFS and most Linux filesystems which preserve both NFC and NFD or mixed Unicode forms). Do not compare timestamps at 2-second resolution if you are working on a nanosecond timestamp resolution filesystem.

准备好应对比较函数的细微差异

🌐 Be Prepared for Slight Differences in Comparison Functions

请注意,你的比较函数应与文件系统的比较方式一致(如果可能的话,最好探测文件系统以了解它实际是如何比较的)。例如,不区分大小写比简单的 toLowerCase() 比较要复杂得多。实际上,toUpperCase() 通常比 toLowerCase() 更好(因为它对某些外语字符的处理方式不同)。但更好的方法是直接探测文件系统,因为每种文件系统都有其内置的大小写比较表。

🌐 Be careful that your comparison functions match those of the filesystem (or probe the filesystem if possible to see how it would actually compare). Case-insensitivity for example is more complex than a simple toLowerCase() comparison. In fact, toUpperCase() is usually better than toLowerCase() (since it handles certain foreign language characters differently). But better still would be to probe the filesystem since every filesystem has its own case comparison table baked in.

例如,苹果的 HFS+ 会将文件名规范化为 NFD 形式,但这种 NFD 形式实际上是当前 NFD 形式的旧版本,有时可能与最新的 Unicode 标准的 NFD 形式略有不同。不要期望 HFS+ 的 NFD 始终与 Unicode NFD 完全相同。

🌐 As an example, Apple's HFS+ normalizes filenames to NFD form but this NFD form is actually an older version of the current NFD form and may sometimes be slightly different from the latest Unicode standard's NFD form. Do not expect HFS+ NFD to be exactly the same as Unicode NFD all the time.