Message ID | 20211221164004.119663-1-shr@fb.com (mailing list archive) |
---|---|
Headers | show |
Series | io_uring: add getdents64 support | expand |
On Tue, Dec 21, 2021 at 8:40 AM Stefan Roesch <shr@fb.com> wrote: > > This series adds support for getdents64 in liburing. The intent is to > provide a more complete I/O interface for io_uring. Ack, this series looks much more natural to me now. I think some of the callers of "iterate_dir()" could probably be cleaned up with the added argument, but for this series I prefer that mindless approach of just doing "&(arg1)->f_pos" as the third argument that is clearly a no-op. So the end result is perhaps not as beautiful as it could be, but I think the patch series DTRT. Linus
On Tue, 21 Dec 2021 08:40:01 -0800, Stefan Roesch wrote: > This series adds support for getdents64 in liburing. The intent is to > provide a more complete I/O interface for io_uring. > > Patch 1: fs: add offset parameter to iterate_dir function. > This adds an offset parameter to the iterate_dir() > function. The new parameter allows the caller to specify > the offset to use. > > [...] Applied, thanks! [1/3] fs: add offset parameter to iterate_dir function commit: 1533c1b579e1e866465fd9d04c8a5ebb1e25ba28 [2/3] fs: split off vfs_getdents function of getdents64 syscall commit: 54d460de2423434e7aa9f7fcb9656230de53b85e [3/3] io_uring: add support for getdents64 commit: b4518682080d3a1cdd6ea45a54ff6772b8b2797a Best regards,
On Tue, Dec 21, 2021 at 08:40:01AM -0800, Stefan Roesch wrote: > This series adds support for getdents64 in liburing. The intent is to > provide a more complete I/O interface for io_uring. > > Patch 1: fs: add offset parameter to iterate_dir function. > This adds an offset parameter to the iterate_dir() > function. The new parameter allows the caller to specify > the offset to use. > > Patch 2: fs: split off vfs_getdents function from getdents64 system call > This splits of the iterate_dir part of the syscall in its own > dedicated function. This allows to call the function directly from > io_uring. > > Patch 3: io_uring: add support for getdents64 > Adds the functions to io_uring to support getdents64. > > There is also a patch series for the changes to liburing. This includes > a new test. The patch series is called "liburing: add getdents support." > > The following tests have been performed: > - new liburing getdents test program has been run > - xfstests have been run > - both tests have been repeated with the kernel memory leak checker > and no leaks have been reported. AFAICS, it completely breaks the "is the position known to be valid" logics in a lot of ->iterate_dir() instances. For a reasonably simple example see e.g. ext2_readdir(): bool need_revalidate = !inode_eq_iversion(inode, file->f_version); ..... if (unlikely(need_revalidate)) { if (offset) { offset = ext2_validate_entry(kaddr, offset, chunk_mask); ctx->pos = (n<<PAGE_SHIFT) + offset; } file->f_version = inode_query_iversion(inode); need_revalidate = false; } and that, combined with * directory modifications bumping iversion * lseek explicitly cleaing ->f_version on location changes and resulting position going back into ->f_pos, *BEFORE* we unlock the damn file. makes sure that we call ext2_validate_entry() for any untrusted position. Your code breaks the living hell out of that. First of all, the offset is completely arbitrary and no amount of struct file-based checks is going to be relevant. Furthermore, this "we'd normalized the position, the valid one will go into ->f_pos and do so before the caller does fdput_pos(), so mark the file as known-good-position" logics also break - ->f_pos is *NOT* locked and new positition, however valid it might be, is not going to be put there. How could that possibly work? I'm not saying that the current variant is particularly nice, but the need to avoid revalidation of position on each getdents(2) call is real, and this revalidation is not cheap. Furthermore, how would that work on e.g. shmem/tmpfs/ramfs/etc.? There the validation is not an issue, simply because the position is not used at all - a per-file cursor is, and it's moved by dcache_dir_lseek() and dcache_readdir(). Folks, readdir from arbitrary position had been a source of pain since mid-80s. A plenty of PITA all around, and now you introduce another API that shoves things into the same orifices without even a benefit of going through lseek(2), or any of the exclusion warranties regarding the file position?
On Tue, Dec 21, 2021 at 09:15:24AM -0800, Linus Torvalds wrote: > On Tue, Dec 21, 2021 at 8:40 AM Stefan Roesch <shr@fb.com> wrote: > > > > This series adds support for getdents64 in liburing. The intent is to > > provide a more complete I/O interface for io_uring. > > Ack, this series looks much more natural to me now. > > I think some of the callers of "iterate_dir()" could probably be > cleaned up with the added argument, but for this series I prefer that > mindless approach of just doing "&(arg1)->f_pos" as the third argument > that is clearly a no-op. > > So the end result is perhaps not as beautiful as it could be, but I > think the patch series DTRT. It really does not. Think what happens if you pass e.g. an odd position to that on e.g. ext2/3/4. Or just use it on tmpfs, for that matter.
On Fri, Dec 31, 2021 at 11:15:55PM +0000, Al Viro wrote: > On Tue, Dec 21, 2021 at 09:15:24AM -0800, Linus Torvalds wrote: > > On Tue, Dec 21, 2021 at 8:40 AM Stefan Roesch <shr@fb.com> wrote: > > > > > > This series adds support for getdents64 in liburing. The intent is to > > > provide a more complete I/O interface for io_uring. > > > > Ack, this series looks much more natural to me now. > > > > I think some of the callers of "iterate_dir()" could probably be > > cleaned up with the added argument, but for this series I prefer that > > mindless approach of just doing "&(arg1)->f_pos" as the third argument > > that is clearly a no-op. > > > > So the end result is perhaps not as beautiful as it could be, but I > > think the patch series DTRT. > > It really does not. Think what happens if you pass e.g. an odd position > to that on e.g. ext2/3/4. Or just use it on tmpfs, for that matter. [A bit of a braindump follows; my apologies for reminding of some unpleasant parts of history] The real problem here is the userland ABI along the lines of pread for directories. There's a good reason why we (as well as everybody else, AFAIK) do not have pgetdents(2). Handling of IO positions for directories had been causing trouble ever since the directory layouts had grown support for long names. Originally a directory had been an array of fixed-sized entries; back then ls(1) simply used read(2). No special logics for handling offsets, other than "each entry is 16 bytes, so you want to read a multiple of 16 starting from offset that is a multiple of 16". As soon as FFS had introduced support for names longer than 14 characters, the things got more complicated - there's no predicting if given position is an entry boundary. Worse, what used to have been an entry boundary might very well come to point to the middle of a name - all it takes is unlink()+creat() done since the time the position used to be OK. Details are filesystem-dependent; e.g. for original FFS all multiples of 512 are valid offsets, and each entry has its length stored in bytes 4 and 5, so one needs to read the relevant 512 bytes of contents and walk the list of entries until they get to (or past) the position that needs to be validated. For ext2 it's a bit more unpleasant, since the chunk size is not 512 bytes - it's a block size, i.e. might easily by 4Kb. For more complex layouts it gets even nastier. Having to do that kind of validation on each directory entry access would be too costly. That's somewhat mitigated since the old readdir(2) is no longer used, and we return multiple entries per syscall (getdents(2)). With the exclusion between directory reads and modifications, that allows to limit validation to the first entry returned on that syscall. It's still too costly, though. The next part of mitigation is to use the fact that valid position will remain valid until somebody modifies a directory, so we don't need to validate if directory hadn't been changed since the last time getdents(2) had gotten to this position. Of course, explicit change of position since that last getdents(2) means that we can't skip validation. Another fun part is synthetic filesystems like tmpfs - there we don't have any persistent directory contents we could use as source of offsets. All we have is a cyclic list of dentries; memory address of dentry is obviously not a candidate - trying to validate _that_ would be beyond unpleasant. So we use the position in the list. To avoid the O(directory size) cost of walking the list to the position we are asked to read from, we insert a cursor into the list and have directory reads and seeks move it as needed. File position is not authoritative there - the cursor is. They can get out of sync - if you read almost to the end of directory, unlink the first file, use lseek(fd, SEEK_CUR, 0) to find position, then lseek to 0 and back to the position reported, you'll end up one entry further than you would without those lseeks. Userland is generally OK with that (and we are within the POSIX warranties). However, if that kind of "out of sync" can happen without any directory modifications involved, we have trouble. Directories are not well-suited for random access. Not since mid-80s. It's possible, but not cheap and there's a lot of non-obvious corner cases where directory modifications are involved. Some of those must work, or the userland will break. Telling which ones those are is not trivial. pgetdents(2) would be a bad idea by itself; making it asynchronous is the stuff of nightmares. Please, reconsider that ABI - AFAICS, there's no way to do it cheaply and it will take modifying each ->iterate_dir() just to avoid breaking the existing uses of those.
On Sat, Jan 1, 2022 at 8:59 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > On Fri, Dec 31, 2021 at 11:15:55PM +0000, Al Viro wrote: > > On Tue, Dec 21, 2021 at 09:15:24AM -0800, Linus Torvalds wrote: > > > On Tue, Dec 21, 2021 at 8:40 AM Stefan Roesch <shr@fb.com> wrote: > > > > > > > > This series adds support for getdents64 in liburing. The intent is to > > > > provide a more complete I/O interface for io_uring. > > > > > > Ack, this series looks much more natural to me now. > > > > > > I think some of the callers of "iterate_dir()" could probably be > > > cleaned up with the added argument, but for this series I prefer that > > > mindless approach of just doing "&(arg1)->f_pos" as the third argument > > > that is clearly a no-op. > > > > > > So the end result is perhaps not as beautiful as it could be, but I > > > think the patch series DTRT. > > > > It really does not. Think what happens if you pass e.g. an odd position > > to that on e.g. ext2/3/4. Or just use it on tmpfs, for that matter. > > [A bit of a braindump follows; my apologies for reminding of some > unpleasant parts of history] > > The real problem here is the userland ABI along the lines of pread for > directories. There's a good reason why we (as well as everybody else, > AFAIK) do not have pgetdents(2). > > Handling of IO positions for directories had been causing trouble > ever since the directory layouts had grown support for long names. > Originally a directory had been an array of fixed-sized entries; back > then ls(1) simply used read(2). No special logics for handling offsets, > other than "each entry is 16 bytes, so you want to read a multiple of > 16 starting from offset that is a multiple of 16". > > As soon as FFS had introduced support for names longer than 14 characters, > the things got more complicated - there's no predicting if given position > is an entry boundary. Worse, what used to have been an entry boundary > might very well come to point to the middle of a name - all it takes is > unlink()+creat() done since the time the position used to be OK. > > Details are filesystem-dependent; e.g. for original FFS all multiples of > 512 are valid offsets, and each entry has its length stored in bytes 4 > and 5, so one needs to read the relevant 512 bytes of contents and walk > the list of entries until they get to (or past) the position that needs > to be validated. For ext2 it's a bit more unpleasant, since the chunk > size is not 512 bytes - it's a block size, i.e. might easily by 4Kb. > For more complex layouts it gets even nastier. > > Having to do that kind of validation on each directory entry access would > be too costly. That's somewhat mitigated since the old readdir(2) is no > longer used, and we return multiple entries per syscall (getdents(2)). > With the exclusion between directory reads and modifications, that allows > to limit validation to the first entry returned on that syscall. > > It's still too costly, though. The next part of mitigation is to use > the fact that valid position will remain valid until somebody modifies a > directory, so we don't need to validate if directory hadn't been changed > since the last time getdents(2) had gotten to this position. Of course, > explicit change of position since that last getdents(2) means that we > can't skip validation. And for this validation caching to work properly, AFAIU you need to hold the file->f_pos_lock (or have exclusive access to the "struct file"), which happens in the normal getdents() path via fdget_pos(). This guarantees that the readdir handler has exclusive access to the file's ->f_version, which has to stay in sync with the position. By the way, this makes me wonder: If I open() an ext4 directory and then concurrently modify the directory, call readdir() on it, and issue IORING_OP_READV SQEs with ->off == -1, can that cause ext4 to think that filesystem corruption has occurred? io_uring has some dodgy code that seems to be reading and writing file->f_pos without holding the file->f_pos_lock. And even if the file doesn't have an f_op->read or f_op->read_iter handler, I think you might be able to read ->f_pos of an ext4 directory and write the value back later, unless I'm missing a check somewhere? io_prep_rw() grabs file->f_pos; then later, io_read() calls io_iter_do_read() (which will fail with -EINVAL), and then the error path goes through kiocb_done(), which writes the position back to req->file->f_pos. So I think the following race might work: First, open an FD referencing a directory on an ext4 filesystem. Read part of the directory's entries from the FD with getdents(). Then modify the directory such that the FD's ->f_version is now stale. Then do the race: task A: start a IORING_OP_READV op on the FD and let it grab the stale offset from file->f_pos task B: run getdents() on the FD. after this, ->f_version is up-to-date and ->f_pos points to the start of a dentry task A: continue handling the IORING_OP_READV: io_iter_do_read() fails, kiocb_done() overwrites the valid ->f_pos with the stale value while leaving ->f_version up-to-date Afterwards, the file might have an ->f_pos pointing in the middle of a dentry while ->f_version indicates that it is valid (meaning it's supposed to point to the start of a dentry). If you then do another getdents() call on the FD, I think you might get garbage back and/or make the kernel think that the filesystem is corrupted (depending on what's stored at that offset)?
On 1/2/22 11:03 PM, Jann Horn wrote: > io_uring has some dodgy code that seems to be reading and writing > file->f_pos without holding the file->f_pos_lock. And even if the file > doesn't have an f_op->read or f_op->read_iter handler, I think you > might be able to read ->f_pos of an ext4 directory and write the value > back later, unless I'm missing a check somewhere? I posted an RFC to hold f_pos_lock across those operations before the break: https://lore.kernel.org/io-uring/8a9e55bf-3195-5282-2907-41b2f2b23cc8@kernel.dk/ picking it up this week and flushing it out, hopefully.
On Sun, Jan 2, 2022 at 11:04 PM Jann Horn <jannh@google.com> wrote: > > And for this validation caching to work properly, AFAIU you need to > hold the file->f_pos_lock (or have exclusive access to the "struct > file"), which happens in the normal getdents() path via fdget_pos(). > This guarantees that the readdir handler has exclusive access to the > file's ->f_version, which has to stay in sync with the position. Yes. So the whole 'preaddir()' model was wrong, and thanks to Al for noticing. It turns out that you cannot pass in a different 'pos' than f_pos, because we have that very tight coupling between the 'struct file' and readdir(). It's not just about f_pos and f_version, either - Al pointed out the virtual filesystems, which use a special dentry cursor to traverse the child dentries for readdir, and that one uses 'file->private_data'. So the directory position isn't really about some simple passed-in pos, it has locking rules, it has validation, and it has actual secondary data in the file pointer. Linus
On Mon, Jan 03, 2022 at 08:03:51AM +0100, Jann Horn wrote: > io_prep_rw() grabs file->f_pos; then later, io_read() calls > io_iter_do_read() (which will fail with -EINVAL), and then the error > path goes through kiocb_done(), which writes the position back to > req->file->f_pos. So I think the following race might work: Why does it touch ->f_pos on failure, anyway? It's a bug, plain and simple; note that read(2) and write(2) are explicitly requested to leave IO position alone if they return an error. See e.g. fs/read_write.c:ksys_read() - ret = vfs_read(f.file, buf, count, ppos); if (ret >= 0 && ppos) f.file->f_pos = pos; fdput_pos(f); Position update happens only on success (and only for non-stream files, at that). No matter how special io-uring is (it's not covered by POSIX, for obvious reasons), this is simply wrong, directories or no directories.
Stefan Roesch wrote on Tue, Dec 21, 2021 at 08:40:01AM -0800: > This series adds support for getdents64 in liburing. The intent is to > provide a more complete I/O interface for io_uring. [reminder: This series was dropped a year ago after Al Viro rightly pointed out that we can't just pass an arbitrary offset to iterate_dir as offset validation is costly and not always possible at all] I'm digging an old topic here because I was looking at trying io_uring on a toy project (specifically, exporting infos out of /sys/fs/cgroup recursively), but was partly held back by the lack of getdents or equivalent interface for the crawler part -- and existing bricks like fts or nftw predate openat interfaces and just didn't appeal to me, but in general just mixing in io_uring and synchronous getdents sounded a bit of a pain. Given it's been over a year I guess it's not such a much needed feature, but when you're centering your program loop around the ring having the occasional getdents/readdir call is a bit cumbersome. This is probably a naive idea, but would it make sense discussing just making it fit the current getdents API: Given the problem seems to be revalidating the offset, since the main usecase is probably to just go through a directory, perhaps the file's offset could be updated just like the real call and offset validation be skipped if the parameter is equal to the file's offset? Giving a different offset would be equivalent to lseek + getdents and update the position as well, so e.g. rewinding the directory with a seek 0 would work if an application needs to check a directory's content multiple times. Heck, seek to any offset other than 0 could just be refused for any sane usage, it just doesn't make sense to use anything else in practice; that'd make it a defacto "dumb getdents + rewinddir". The API could be made simpler to use/clear about expectations by making the last parameter "bool rewind_first" instead of an offset (or I guess a flag with just a single valid bit if we were to really implement this) This isn't very io_uring-like, but it'd allow for an io_uring-centric program to also handle some directory processing, and wouldn't expose anything that's not currently already possible to do (as long as each processed op brings in its own context, the only "race" I can see in iterate_dir is that file->f_pos can go back to a previously also valid position if some iterations are faster than others, assuming it's an atomic access, and that'd probably warrant a READ/WRITE_ONCE or something.. but that's not a new problem, user threads can already hammer getdents in parallel on a single fd if they want useless results) What do you think? I'm just asking with a toy in mind so it's not like I have any strong opinion, but I'd be happy to rework Stefan's patches if it looks like it might make sense. (And if the consensus is that this will make hell break loose I'll just forget about it before sinking more time in it, just catching up was fun enough!) Cheers,