Seek freedom and become captive of your desires, seek discipline and find your liberty. - Frank Herbert, Dune
"Negative freedom is freedom from constraint, that is, permission to do things; Positive freedom is empowerment, that is, ability to do things... Negative and positive freedoms, it might seem, are two different descriptions of the same thing. No! Life is not so simple. There is reason to think that constraints (prohibitions, if you like) can actually help people to do things better. Constraints can enhance ability; in other words, less negative freedom can mean more positive freedom." - Angus Sibley, "Two Kinds of Freedom"
Traditionally, Unix/Linux/POSIX filenames can be almost any sequence of bytes, and their meaning is unassigned. The only real rules are that "/" is always the directory separator, and that filenames can't contain byte 0 (because this is the terminator). Although this is flexible, this creates many unnecessary problems. In particular, this lack of limitations makes it unnecessarily difficult to write correct programs (enabling many security flaws), makes it impossible to consistently and accurately display filenames, and it confuses users.
This article will try to convince you that adding some limitations on legal Unix/Linux/POSIX filenames would be an improvement. Many programs already presume these limitations, the POSIX standard already permits such limitations, and many Unix/Linux filesystems already embed such limitations - so it'd be better to make these (reasonable) assumptions true in the first place. This article will discuss, in particular, the problems of control characters in filenames, leading dashes in filenames, the lack of a standard encoding scheme (vs. UTF-8), and special metacharacters in filenames. Spaces in filenames are probably hopeless in general, but resolving some of the other issues will simplify their handling too. This article will then briefly discuss some methods for solving this long-term, though that's not easy - if I've convinced you that this needs improving, I'd like your help figuring out how to do it!
Imagine that you don't know Unix/Linux/POSIX (I presume you really do), and that you're trying to do some simple things with it. For example, let's try to print out the contents of all files in the current directory, putting it into a file above:
cat * > ../collection
The list doesn't include "hidden" items (filenames beginning with "."), but often that's what you want anyway, so that's not unreasonable. The problem is that although this seems to work, filenames could begin with "-" (e.g., "-n"). So if there's a file named "-n", and you're using GNU cat, all of a sudden your output will be numbered! Oops; that means on every command we have to disable option processing; for most commands that means using "--" everywhere, except not all commands support "--" (ugh!). Many people know that prefixing the filename or glob with "./" can resolve this, but not all programs do such prefixing, so when you're writing a new program you can't guarantee that the caller has prefixed it for you.
The "cat *" command will complain if there are directories; in theory, we could just replace the "*" with something that computes the list of file names (which will also include the hidden files):
cat -- `find . -type f` > ../collection
Whups, that fails too! Filenames can include spaces, which causes splitting. Advanced users can fiddle with IFS, but a simpler approach is to use a "while" loop or xargs:
( find . -type f | while read -r filename ; do cat "$filename" ; done ) > ../collection # OR ( find . -type f | xargs -d "\n" cat ) > ../collection
Whups, these don't work either - these create a list of filenames separated by newlines, but filenames can include newlines too! This is hard to handle portably; if you use GNU find and xargs, you can use GNU extensions to separate filenames with \0:
( find . -type f -print0 | xargs -0 cat ) > ../collection
But this convention is supported by only a few tools, they tend to be non-standard (non-portable) extensions, the option names to use this convention are jarringly inconsistent (GNU has sort -z, find -print0, and xargs -0), and this format is difficult to view and modify (because so few tools support it). Which is silly; processing lines of text files is well-supported, and filenames are commonly-handled, but you can't simply have filenames separated by newlines?!?
Ugh - lots of annoying problems, caused not because we don't have enough flexibility, but because we have too much. You can't use newline and tab to delimit filenames, because filenames could include them! BashFAQ discusses handling newlines, as do many other documents about shell scripts.
In a well-designed system, simple things should be simple, and the "obvious easy" way to do something should be the right way. I call this goal "no sharp edges" - to use an analogy, if you're designing a wrench, don't put razor blades on the handles. The current POSIX filesystem fails this test - it does have sharp edges. Because it's hard to do things the "right" way, many Unix/Linux programs simply assume that "filenames are reasonable", even though the system doesn't guarantee that this is true. This leads to programs with errors that aren't immediately obvious. In some cases, these errors can even be security vulnerabilities (see CWE 78, CWE 73, and CWE 116, all of which are in the 2009 CWE/SANS Top 25 Most Dangerous Programming Errors, as well as the "Secure Programming..." text on filenames). It would be better if the system actually did guarantee that filenames were reasonable; then already-written programs would be correct.
The problem is so bad that there are programs like detox and Glindra that try to fix "bad" filenames. But the real problem is that they were allowed in the first place - cleaning them later is a second-best approach.
Lots of programs fail to handle "bad" filenames, such as filenames with newlines in them, because it's harder to write programs that handle such filenames correctly. In several cases, developers have specifically stated that there's no point in supporting such filenames! For example:
Failure to handle "bad" filenames can lead to mysterious failures and even security problems... but only if they can happen at all. If "bad" filenames can't occur, the problems they cause go away too!
The POSIX standard defines what a "portable filename" is; this definition implies that many filenames are not portable and thus do not need to be supported by POSIX systems. The POSIX.1-2008 specification is simultaneously released as both The Open Group's Base Specifications Issue 7 and IEEE Std 1003.1(TM)-2008; I'll emphasize the Open Group's version, since it is available at no charge via the Internet (good job!!). Its "base definitions" document section 4.7 ("Filename Portability") says:
For a filename to be portable across implementations conforming to POSIX.1-2008, it shall consist only of the portable filename character set as defined in Portable Filename Character Set. Portable filenames shall not have the <hyphen> character as the first character since this may cause problems when filenames are passed as command line arguments.
I then examined the Portable Filename Character Set, defined in 3.276 ("Portable Filename Character Set"); this turns out to be just A-Z, a-z, 0-9, <period>, <underscore>, and <hyphen>. So it's perfectly okay for a POSIX system to reject a non-portable filename.
Indeed, existing POSIX systems already reject some filenames. A common reason is that many POSIX systems mount local or remote filesystems that have additional rules, e.g., for Microsoft Windows. Wikipedia's entry on Filenames reports on these rules in more detail. For example, the Microsoft Windows kernel forbids the use of characters in range 1-31 (i.e., 0x01-0x1F) in filenames, so any such filenames can't be shared with Windows users, and they're not supposed to be stored on their filesystems. I wrote some code and found that the Linux msdos module (which supports one of the Windows filesystems) already rejects some "bad" filenames, returning the EINVAL error message instead.
So some application developers already assume that filenames aren't "unreasonable", the existing standard (POSIX) already permits operating systems to reject certain kinds of filenames, and existing POSIX systems already reject certain filenames in some circumstances. In that case, what kinds of limitations could we add to filenames that would help users and software developers?
First: Why the heck are the ASCII control characters (bytes 1 through 31) permitted in filenames? There's no advantage to keeping these as legal characters, and the problems are legion: they can't be reasonably displayed, many are troublesome to enter (especially in GUIs!), and they cause nothing but nasty side-effects. They also cause portability problems, since filesystems for Microsoft Windows can't contain them anyway.
One of the nastiest permitted control character is the newline character. Many programs work a line-at-a-time, with a filename as the content or part of the content; this is great, except it fails when a newline can be in the filename. Many programs simply ignore the problem, and presume that there are no newlines in filenames. But this creates a subtle bug, possibly even a vulnerability - better to make the no-newline assumption true in the first place! I know of no program that legitimately requires the ability to insert newlines in a filename. Indeed, it's not hard to find comments like "ban newlines in filenames". GNU's "find" and "xargs" make it possible to work around this by inserting byte 0 between each filename... but few other programs support this convention (even "ls" normally doesn't). Using byte 0 as the separator is a pain to use anyway; who wants to read the intermediate output of this?
The "tab" character is another control character that makes no sense; if tabs are never in filenames, then it's an great character to use as a "column separator" for multi-column data output - especially since many programs already use this convention. But the tab character isn't safe to use (easily) if it can be part of a filename.
Some control characters, particularly the escape (ESC) character, can cause all sorts of display problems, including security problems. Terminals (like xterm, gnome-terminal, the Linux console, etc.) implement control sequences. Most software developers don't understand that merely displaying filenames can cause security problems if they can contain control characters. The GNU ls program tries to protect users from this effect by default (see the -N option), but many people display filenames without getting filtered by ls - and the problem returns. H. D. Moore's "Terminal Emulator Security Issues" (2003) summarizes some of the security issues; modern terminal emulators try to disable the most dangerous ones, but they can still cause trouble. A filename with embedded control characters can (when displayed) cause function keys to be renamed, set X atoms, change displays in misleading ways, and so on. To counter this, some programs modify control characters (such as find and ls) - making it even harder to correctly handle files with such names.
In any case, filenames with control characters aren't portable. POSIX.1-2008 doesn't include control characters in the "portable filename character set", implying that such filenames aren't portable per the POSIX standard. Wikipedia's entry on Filenames notes that the Windows kernel forbids the use of characters in range 1-31 (i.e., 0x01-0x1F), so any such filenames can't be shared with Windows users, and they're not supposed to be stored on their filesystems.
In contrast, if control characters are forbidden, then you can safely use control characters like TAB and NEWLINE as filename separators, and the risk of displaying unfiltered control characters from this source goes away. As noted above, software developers make these assumptions anyway; it'd be great if it was safe to do so.
The "leading dash" problem is an ancient problem in Unix/Linux/POSIX. This is another example of the general problem that there's interaction between overly-flexible filenames with other system components (particularly option flags and shell scripts).
The Unix-haters handbook page 27 (PDF page 67) notes problems these decisions cause: "By convention, programs accept their options as their first argument, usu- ally preceded by a dash (–)... Finally, Unix filenames can contain most characters, including nonprinting ones. This is flaw #3. These archi- tectural choices interact badly. The shell lists files alphabetically when expanding “*”, and the dash (-) comes first in the lexicographic caste sys- tem. Therefore, filenames that begin with a dash (-) appear first when “*” is used. These filenames become options to the invoked program, yielding unpredictable, surprising, and dangerous behavior... [e.g., "rm *" will expand filenames beginning with dash, and use those as options to rm]... We’ve known several people who have made a typo while renaming a file that resulted in a filename that began with a dash: "% mv file1 -file2" Now just try to name it back... Doesn’t it seem a little crazy that a filename beginning with a hypen, espe- cially when that dash is the result of a wildcard match, is treated as an option list?" Indeed, people repeatedly ask how to ignore leading dashes in filenames - yes, you can prepend "./", but why do you need to know this at all?
The list of problems that "leading dash filenames" creates is seemingly endless. You can't safely run "cat *", because there might be a file with a leading dash; if there's a file named "-n", then suddenly all the output is numbered if you use GNU cat. Not all programs support the "--" convention, so you can't simply say "precede all command lists with --", and in any case, people forget to do this in real life. You could prefix the name or glob with "./"; that's a good solution, but people don't know or often forget to do this. The result: You're almost guaranteed to have programs that break (or are vulnerable) when filenames beginning with dash are created.
POSIX.1-2008's "base definitions" document section 4.7 ("Filename Portability") specifically says "Portable filenames shall not have the <hyphen> character as the first character since this may cause problems when filenames are passed as command line arguments". So filenames with leading hyphens are already specifically identified as non-portable in the POSIX standard.
There's no reason that a filesystem must permit filenames to begin with a dash. If such filenames were forbidden, then writing safe shell scripts would be much simpler - if a parameter begins with a "-", then it's an option and there is no other possibility.
With today's march towards globalization, computers must support the sharing of information using many different languages. Given that, it's crazy that there's no standard encoding for filenames across all Unix/Linux/POSIX systems. At the beginnings of Unix, everyone assumed that filenames could only be English text, but that hasn't been true for a long time. Yet because you can't know the character encoding of a given filename, in theory you can't display filenames at all today. Why? Because then you don't know how to translate the bytes of a filename into displayable characters (!). This is true for GUIs, and even for the command line. Yet you must be able to display filenames, so you need to make some determination... and it will be wrong.
The usual approach is to use environment variables to guess what the filename encoding is. But as soon as you start working with other people (say, by receiving a tarball or sharing a filesystem), the single environment variable approach fails. That's because single-environment-variable approach assumes that the entire filesystem uses the same encoding, but once there's sharing, different parts of the filesystem can (and often do) use different encoding systems. In short, there's no currently guarantee that two people share the same convention. Should you interpret the bytes in a filename as UTF-8? ISO-8859-1? One of the other ISO-8859-* encodings? KOI8-* (for Cyrillic)? EUC-JP or Shift-JIS (both popular in Japan)? In short, this is too flexible! The Austin Group even had a discussion about this in 2009. This failure to standardize the encoding leads to confusion, which can lead to mistakes and even vulnerabilities.
Yet this flexibility is actually not flexible enough, because the current filesystem requirements don't permit arbitrary encodings. If you want to store arbitrary international text, you need to use Unicode/ISO-10646, and then pick one of its several encodings... but its other common encodings (UTF-16 and UTF-32) that must be able to store byte 0. But those don't work because you can't use byte 0 in a filename. It's also not flexible in another way: There's no mechanism to find out what encoding is used on a given filesystem. If one person uses ISO-8859-1, there's no obvious way to find out about it. In theory, you could store the encoding system with the filename, and then use multiple system calls to find out what encoding was used for each name.. but really, who needs that kind of complexity?!?
If you want to store arbitrary language characters using todays' Unix/Linux/POSIX filesystem, the only widely-used answer that "simply works" for all languages is UTF-8. Any other approach would require nonstandard additions like adding sort of "character encoding" value with the filesystem, which would then require user programs to examine and use this encoding value. Users and software developers don't need more complexity - they want less. If people simply agreed that "all filenames will be sent in/out of the kernel in UTF-8 format", then all programs would work correctly; programs could simply retrieve a filename and print it, knowing that the filename is in UTF-8. Plan 9 already did this, and showed that you could do this on a POSIX-like system. Indeed, UTF-8 was developed by Unix luminaries Ken Thompson and Rob Pike specifically to support arbitrary language characters on Unix-like systems.
Some filesystems store filenames in other formats, but all of them have mount options to translate in/out of UTF-8 for userspace. In fact, some filesystems require a specific encoding on-disk for filenames, but to do this correctly, the kernel has to know which encoding is being used for the data sent in and out (e.g., with iocharset). But not all filesystems can do this conversion, and how do you find out which options are used where?!? Again, the simple answer is "use UTF-8 everywhere".
There's also another reason to use UTF-8 in filenames: Normalization. Some symbols have more than one Unicode representation (e.g., a character might be followed by accent 1 then accent 2, or by accent 2 then accent 1). They'd look the same, but would be considered different when compared byte-for-byte, and there's more than one normalization system (Linux programs and the W3C use normally use NFC, but Darwin and MacOS X normally use use NFD). If you have a filename in a non-Unicode encoding, then it's ambiguous how you "should" translate these to Unicode, making simple questions like "is this file already there" tricky. But if you store the name as UTF-8 encoded Unicode, then there's no trouble; you can just use the filename using whatever normalization convention was used when the file was created (presuming that the on-disk representation also uses some Unicode encoding).
Samba's developer has identified yet another reason - case handling. Efficiently implementing Windows' filesystem semantics, where uppercase and lowercase are considered identical, requires that you be able to know what is "uppercase" and what is "lowercase". This is only practical if you know what the filename encoding is in the first place. Again, a single encoding system for all filenames, from the application point of view, is almost required to make this efficient.
Converting from one encoding to another can be a pain; systems need to support graceful upgrades to UTF-8. Thankfully, there's a program named "convmv" that can help; it was designed to be "very handy when one wants to switch over from old 8-bit locales to UTF-8 locales".
Again, let's look at 7 ("POSIX.1-2008 is simultaneously IEEE Std 1003.1(TM)-2008 and The Open Group Technical Standard Base Specifications, Issue 7.") Its "Portable Filename Character Set" (defined in 3.276) is only A-Z, a-z, 0-9, <period>, <underscore>, and <hyphen>. Note that this is a very restrictive list; few international speakers would accept this limited list, since it would mean they must only use English filenames. That's rediculous; most computer users don't even know English. So why is this standard so restrictive? That's because there's no standard encoding; since you don't know if a filename is UTF-8 or something else, there's no way to portably share filenames with non-English characters. If we did agree that UTF-8 encoding is used, the set of portable characters could include all languages. In other words, the lack of a standard creates arbitrary and unreasonable limitations.
Linux distributions are already moving towards storing filenames in UTF-8, for this very reason. Fedora's packaging guidelines require that "filenames that contain non-ASCII characters must be encoded as UTF-8. Since there's no way to note which encoding the filename is in, using the same encoding for all filenames is the best way to ensure users can read the filenames properly." OpenSuSE 9.1 has already switched to using UTF-8 as the default system character set ("lang_LANG.UTF-8"). Ubuntu recommends using UTF-8, saying "A good rule is to choose utf-8 locales", and provides a UTF-8 migration tool as part of its UTF-8 by default feature.
The major POSIX GUI suites, GNOME and KDE, seem to be moving this way too:
This is definitely a longer-term approach. Systems have to support UTF-8 as well as the many older encodings, giving people time to switch to UTF-8. To use "UTF-8 everywhere", all tools need to be updated to support UTF-8. Years ago, this was a big problem, but as of 2009 this is mostly a solved problem, and I think the trajectory is very clear.
However, if everyone moves to UTF-8 filenames, there's an interesting problem: Certain byte sequences are illegal in UTF-8. When getting filenames, you don't want to have to keep checking on this. If the kernel enforces these restrictions, ensuring that only UTF-8 filenames are allowed, then there's no problem.
The filesystem should be requiring that filenames meet some standard, not because of some evil need to control people, but simply so that the names can always be displayed correctly at a later time. The lack of standards makes things harder for users, not easier. Yet the filesystem doesn't force filenames to be UTF-8, so it can easily have garbage.
If filenames could not contain shell metacharacters, then a number of security problems would go away, and it'd be a lot easier for users to enter filenames. Often, shell programs are flattened into single long strings, and although filenames are supposed to be escaped if they have unusual characters, it's not at all unusual for a program to fail to escape something correctly. If filenames never had characters that needed to be escaped, there'd be one less operation that could fail.
I doubt these limitations could be agreed upon across all POSIX systems, but it'd be nice if administrators could configure specific systems to prevent such filenames on higher-value systems. This requires that software developers not use such metacharacters in filenames, but these aren't part of the POSIX portable filename set anyway.
I'll grant that this is less important; with the steps above, a lot of programs and statements like "cat *" just work correctly. But funny characters cause troubles for command-line users, because they need to quote them when typing in commands.. and they often forget to do so. A useful starting-point list is "*?:[]"<>|(){}&'!" (this is Glindra's "safe" list with ampersand, single-quote, and bang added). This list is probably a little extreme, but let's try and see (I think colon is no big deal by itself on most Unix/Linux systems, but it does cause trouble with Windows and MacOS systems). Note that < and >, and & are on the list; this eliminates many HTML/XML problems! I'd need to go through a complete analysis of all characters for a final list; for security, you want to identify everything that is permissible, and disallow everything else, but its manifestation can be either way as long as you've considered all possible cases.
In fact, for portability's sake, you already don't want to create filenames with weird characters either. MacOS and Windows XP also forbid certain characters/names. Some MacOS filesystems and interfaces forbid ":" in a name (it's the directory separator). Microsoft Windows won't let you begin filenames with a space or dot, and it also restricts these characters:
: * ? " < > |Also, in Windows, \ and / are both interpreted as directory name separators, and according to that page there are some issues with ".", "[", "]", ";", "=", and ",".
For more info, see Wikipedia's entry on Filenames. Windows' NTFS rules are actually complicated:
Windows kernel forbids the use of characters in range 1-31 (i.e., 0x01-0x1F) and characters " * : < > ? \ / |. Although NTFS allows each path component (directory or filename) to be 255 characters long and paths up to about 32767 characters long, the Windows kernel only supports paths up to 259 characters long. Additionally, Windows forbids the use of the MS-DOS device names AUX, CLOCK$, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, CON, LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, LPT9, NUL and PRN, as well as these names with any extension (for example, AUX.txt), except when using Long UNC paths (ex. \\.\C:\nul.txt or \\?\D:\aux\con). (In fact, CLOCK$ may be used if an extension is provided.) These restrictions only apply to Windows - Linux, for example, allows use of " * : < > ? \ / | even in NTFS.
Note here that Microsoft Windows makes a number of terrible mistakes with its filesystem. It has very arbitrary limits on particular filenames; merely having a file named "com1.txt" can cause problems on a Windows system! The point is not to duplicate the mistakes of other systems! Instead, the goal is to have simple rules that make it easy to avoid common mistakes. We need something that is neither "everything is permissible" nor "capricious, hard-to-follow rules".
It'd be easier and cleaner to write fully-correct shell scripts if filenames couldn't include any kind of whitespace. There's no reason anyone needs tab or newline in filenames, as noted above, so that leaves us with the space character. Indeed, there are a lot of existing Unix/Linux programs that presume there are no space characters in filenames (many RPM spec files make this assumption; this can be enforced in their constrained environment). Unfortunately, a lot of people do have filenames with embedded spaces. In theory, "_" would suffice instead of space, but I suspect that users won't go along. Indeed, you essentially cannot handle typical Windows and MacOS filenames without handling filenames with an embedded space, so while this would make things easier, this is a lost cause.
However, this is no disaster. Many "obvious" shell programs actually work correctly, even in the presence of spaces. E.G., "cat *" will work correctly, even if some filenames have spaces, with bash, zsh, ksh, and even busybox's shell.
More importantly, once newlines and tabs cannot happen in filenames, programs can safely use newlines and tabs as delimeters between filenames. Having safe delimiters makes spaces in filenames much easier to handle. In particular, shell programs can then safely do what many already do: Use programs like 'find' to create a list of filenames (one per line), and then process the filenames a line at a time.
Shell script writers can use an additional trick if newlines and tabs can't be in filenames. Shell script writers for Bourne-like shells could set the IFS variable to just tab and newline (removing "space" entirely, which is usually the first character in IFS). Doing this means that space is no longer an input separator - only tabs and newlines separate input values. Setting the IFS variable to just tab and newline would make lists of filenames much easier to deal with, even if they contain spaces; you can then have programs produce filenames separated by newlines (or tabs) and process them directly, even if they have embedded spaces. One annoying problem is actually setting IFS; it's awkward to set IFS to just tab and newline using only standard POSIX shell capabilities Many shell implementations support the $'...' extension, including bash, ksh (korn shell), and zsh (Z shell); in those shells you can just do IFS=$'\t\n'. As the korn shell documentation says, the purpose of $'...' is to 'solve the problem of entering special characters in scripts [using] ANSI-C rules to translate the string... It would have been cleaner to have all "..." strings handle ANSI-C escapes, but that would not be backwards compatible.' It might be useful to try to get $'...' in the POSIX standard, specifically to simplify this case. Once IFS is reset like this, filenames with spaces become much simpler to handle.
In sum: It'd be far better if filenames were more limited so that they would be safer and easier to use. This would eliminate a whole class of errors and vulnerabilities in programs that "look correct" but subtly fail when unusual filenames are created (possibly by attackers).
In general, kernels should emphasize mechanism not policy. But in these cases, as noted above, the problem is so bad that there are programs like detox and Glindra to fix bad filenames. What's worse, there's no method for enforcing policy - there's often no easy way to enforce "you may not create filenames with this pattern". There's not even a standard way to report when such a policy is being enforced, or of telling people what filenames are fine. In the longer term, systems could auto-rename "bad" file names if they appear on a pre-existing filesystems. This is difficult; it's better to prevent bad filenames in the first place.
So what steps could be taken to clean this up slowly, over time, without causing undue burdens to anyone? Here are some ideas:
This won't happen overnight; many programs will still have to handle "bad" filenames as this transition occurs. But we can start making bad filenames impossible now, so that future software developers won't have to deal with them.
What is "bad", though? Even if they aren't universal, it'd be useful to have a common list so that software developers could avoid creating "non-portable" filenames. Some restrictions are easier to convince people of than others; administrators of a locked-down system might be interested in a longer list of rules. Here are possible rules, in order of importance (I'd do the first two right away, the third as consensus can be achieved, and the later ones would probably only apply to individual systems):
In particular, ensuring that filenames had no control characters, no leading dashes, and used UTF-8 encoding would make a lot of software development simpler. This is a long-term effort, but the journey of a thousand miles starts with the first step.
Feel free to see my home page at http://www.dwheeler.com. You may also want to look at my paper Why OSS/FS? Look at the Numbers! and my book on how to develop secure programs.
(C) Copyright 2009 David A. Wheeler.