Why does Linux use getdents() on directories instead of read()?

问题

I was skimming through K&R C and I noticed that to read the entries in a directories, they used:

while (read(dp->fd, (char *) &dirbuf, sizeof(dirbuf)) == sizeof(dirbuf))
    /* code */

Where dirbuf was a system-specific directory structure, and dp->fd a valid file descriptor. On my system, dirbuf would have been a struct linux_dirent. Note that a struct linux_dirent has a flexible array member for the entry name, but let us assume, for the sake of simplicity, that it doesn't. (Dealing with the flexible array member in this scenario would only require a little extra boilerplate code).

Linux, however, doesn't support this construct. When using read() to try reading directory entries as above, read() returns -1 and errno is set to EISDIR.

Instead, Linux dedicates a system call specifcally for reading directories, namely the getdents() system call. However, I've noticed that it works in pretty much the same way as above.

while (syscall(SYS_getdents, fd, &dirbuf, sizeof(dirbuf)) != -1)
    /* code */

What was the rational behind this? There seems to be little/no benefit compared to using read() as done in K&R.

回答1:

getdents will return struct linux_dirent. It will do this for any underlying type of filesystem. The "on disk" format could be completely different, known only to the given filesystem driver, so a simple userspace read call could not work. That is, getdents may convert from the native format to fill the linux_dirent.

couldn't the same thing be said about reading bytes from a file with read()? The on disk format of the data within a file isn't necessary uniform across filesystems or even contiguous on disk - thus, reading a series of bytes from disk would again be something I expect to be delegated to the file system driver.

The discontiguous file data in handled by the VFS ["virtual filesystem"] layer. Regardless of how a FS chooses to organize the block list for a file (e.g. ext4 uses "inodes": "index" or "information" nodes. these use an "ISAM" ("index sequential access method") organization. But, an MS/DOS FS can have a completely different organization).

Each FS driver registers a table of VFS function callbacks when it's started. For a given operation (e.g. open/close/read/write/seek), there is corresponding entry in the table.

The VFS layer (i.e. from the userspace syscall) will "call down" into the FS driver and the FS driver will perform the operation, doing whatever it deems necessary to fulfill the request.

I assume that the FS driver would know about the location of the data inside a regular file on disk - even if the data was fragmented.

Yes. For example, if the read request is to read the first three blocks from the file (e.g. 0,1,2), the FS will look up the indexing information for the file and get a list of physical blocks to read (e.g. 1000000,200,37) from the disk surface. This is all handled transparently in the FS driver.

The userspace program will simply see its buffer filled up with the correct data, without regard to how complex the FS indexing and block fetch had to be.

Perhaps it is [loosely] more proper to refer to this as transferring inode data as there are inodes for files (i.e. an inode has the indexing information to "scatter/gather" the FS blocks for the file). But, the FS driver also uses this internally to read from a directory. That is, each directory has an inode to keep track of the indexing information for that directory.

So, to an FS driver, a directory is much like a flat file that has specially formatted information. These are the directory "entries". This is what getdents returns. This "sits on top of" the inode indexing layer.

Directory entries can be variable length [based on the length of the filename]. So, the on disk format would be (call it "Type A"):

static part|variable length name
static part|variable length name
...

But ... some FSes organize themselves differently (call it "Type B"):

<static1>,<static2>...
<variable1>,<variable2>,...

So, the type A organization might be read atomically by a userspace read(2) call, the type B would have difficulty. So, the getdents VFS call handles this.

couldn't the VFS also present a "linux_dirent" view of a directory like the VFS presents a "flat view" of a file?

That is what getdents is for.

Then again, I'm assuming that a FS driver knows the type of each file and thus could return a linux_dirent when read() is called on a directory rather than a series of bytes.

getdents did not always exist. When dirents were fixed size and there was only one FS format, the readdir(3) call probably did read(2) underneath and got a series of bytes [which is only what read(2) provides]. Actually, IIRC, in the beginning there was only readdir(2) and getdents and readdir(3) did not exist.

But, what do you do if the read(2) is "short" (e.g. two bytes too small)? How do you communicate that to the app?

My question is more like since the FS driver can determine whether a file is a directory or a regular file (and I'm assuming it can), and since it has to intercept all read() calls eventually, why isn't read() on a directory implemented as reading the linux_dirent?

read on a dir isn't intercepted and converted to getdents because the OS is minimalist. It expects you to know the difference and make the appropriate syscall.

You do open(2) for files or dirs [opendir(3) is wrapper and does open(2) underneath]. You can read/write/seek for file and seek/getdents for dirs.

But ... doing read for returns EISDIR. [Side note: I had forgotten this in my original comments]. In the simple "flat data" model it provides, there isn't a way to convey/control all that getdents can/does.

So, rather than allow an inferior way to get partial/wrong info, it's simpler for the kernel and an app developer to go through the getdents interface.

Further, getdents does things atomically. If you're reading directory entries in a given program, there may be other programs that are creating and deleting files in that directory or renaming them--right in the middle of your getdents sequence.

getdents will present an atomic view. Either a file exists or it doesn't. It's been renamed or it hasn't. So, you don't get a "half modified" view, regardless of how much "turmoil" is happening around you. When you ask getdents for 20 entries, you'll get them [or 10 if there's only that much].

Side note: A useful trick is to "overspecify" the count. That is, tell getdents you want 50,000 entries [you must provide the space]. You'll usually get back something like 100 or so. But, now, what you've got is an atomic snapshot in time for the full directory. I sometimes do this instead of looping with a count of 1--YMMV. You still have to protect against immediate disappearance but at least you can see it (i.e. a subsequent file open fails)

So, you always get "whole" entries and no entry for a just deleted file. That is not to say that the file is still there, merely that it was there at the time of the getdents. Another process may instantly erase it, but not in the middle of the getdents

If read(2) were allowed, you'd have to guess at how much data to read and wouldn't know which entries were fully formed on in a partial state. If the FS had the type B organization above, a single read could not atomically get the static portion and variable portion in a single step.

It would be philosophically incorrect to slow down read(2) to do what getdents does.

getdents, unlink, creat, rmdir, and rename (etc.) operations are interlocked and serialized to prevent any inconsistencies [not to mention FS corruption or leaked/lost FS blocks]. In other words, these syscalls all "know about each other".

If pgmA renames "x" to "z" and pgmB renames "y" to "z", they don't collide. One goes first and another second but no FS blocks are ever lost/leaked. getdents gets the whole view (be it "x y", "y z", "x z" or "z"), but it will never see "x y z" simultaneously.

回答2:

In K&R (actually, Unix up through SVr2 at least, perhaps SVr3), directory entries were 16 bytes, using 2 bytes for the inode and 14 bytes for filenames.

Using read made sense, because the directory entries on the disk were all the same size. 16 bytes (a power of 2) also made sense, because it did not require hardware multiply to compute offsets. (I recall someone telling me around 1978 that the Unix disk driver used floating point and was slow... but that's second hand, although amusing).

Later improvements to directories allowed longer names, which meant that the sizes differed (there being no point to making huge entries all the same as the largest possible name). A newer interface was provided, readdir.

Linux provides a lower-level interface. According to its manual page:

These are not the interfaces you are interested in. Look at readdir(3) for the POSIX-conforming C library interface. This page documents the bare kernel system call interfaces.

As illustrated in your example, getdents is a system call, useful for implementing readdir. The manner in which readdir is implemented is unspecified. There is no particular reason why the early readdir (from about 30 years ago) could not have been implemented as a library function using read and malloc and similar functions to manage the long filenames read from the directory.

Moving features into the kernel was done (probably) in this case to improve performance. Because getdents reads several directory entries at a time (unlike readdir), that may reduce the overhead of reading all of the entries for a small directory (by reducing the number of system calls).

回答3:

Your suspicion is correct: It would make more sense to have the read system call work on directories and return some standardized data, rather than have the separate getdents system call. getdents is superfluous and reduces the uniformity of the interface. The other answers assert that "read" as an interface would be inferior in some way to "getdents". They are incorrect. As you can observe, the arguments and return value of "read" and "getdents" are identical; just "read" only works on non-directories and "getdents" only works on directories. "getdents" could easily be folded into "read" to get a single uniform syscall.

The reason this is not the case is historical. Originally, "read" worked on directories, but returned the actual raw directory entry in the filesystem. This was complex to parse, so the getdirents call was added in addition to read, to provide a filesystem-independent view of directory entries. Eventually, "read" on directories was turned off. "read" on directories just as well could have been made to behave identically to getdirents instead of being turned off. It just wasn't, possibly because it seemed duplicative.

In Linux, in particular, "read" has returned an error when reading directories for so long that it's almost certain that some program is relying on this behavior. So, backwards compatibility demands that "read" on Linux will never work on directories.

来源：https://stackoverflow.com/questions/36144807/why-does-linux-use-getdents-on-directories-instead-of-read

标签

Linux

unix

architecture

filesystem-access