[v7,0/3] io_uring: add getdents64 support

Message ID	20211221164004.119663-1-shr@fb.com (mailing list archive)
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Stefan Roesch <shr@fb.com> To: <io-uring@vger.kernel.org>, <linux-fsdevel@vger.kernel.org> CC: <torvalds@linux-foundation.org>, <shr@fb.com> Subject: [PATCH v7 0/3] io_uring: add getdents64 support Date: Tue, 21 Dec 2021 08:40:01 -0800 Message-ID: <20211221164004.119663-1-shr@fb.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain Precedence: bulk
Series	io_uring: add getdents64 support \| expand [v7,0/3] io_uring: add getdents64 support [v7,1/3] fs: add offset parameter to iterate_dir function [v7,2/3] fs: split off vfs_getdents function of getdents64 syscall [v7,3/3] io_uring: add support for getdents64

Stefan Roesch Dec. 21, 2021, 4:40 p.m. UTC

This series adds support for getdents64 in liburing. The intent is to
provide a more complete I/O interface for io_uring.

Patch 1: fs: add offset parameter to iterate_dir function.
  This adds an offset parameter to the iterate_dir()
  function. The new parameter allows the caller to specify
  the offset to use.

Patch 2: fs: split off vfs_getdents function from getdents64 system call
  This splits of the iterate_dir part of the syscall in its own
  dedicated function. This allows to call the function directly from
  io_uring.

Patch 3: io_uring: add support for getdents64
  Adds the functions to io_uring to support getdents64.

There is also a patch series for the changes to liburing. This includes
a new test. The patch series is called "liburing: add getdents support."

The following tests have been performed:
- new liburing getdents test program has been run
- xfstests have been run
- both tests have been repeated with the kernel memory leak checker
  and no leaks have been reported.


V7: - add loff_t *parameter to iterate_dir function
    - remove do_iterate_dir function
    - change callers of iterate_dir function
v6: - rebased patch series
v5: - remove old patch (v4 contained a patch file from v3)
V4: - silence compiler warnings
V3: - add do_iterate_dir() function to Patch 1
    - make iterate_dir() function call do_iterate_dir()
      This has the advantage that the function signature of iterate_dir
      does not change
V2: - updated the iterate_dir calls in fs/ksmbd, fs/ecryptfs and arch/alpha with
      the additional parameter.


Stefan Roesch (3):
  fs: add offset parameter to iterate_dir function
  fs: split off vfs_getdents function of getdents64 syscall
  io_uring: add support for getdents64

 arch/alpha/kernel/osf_sys.c   |  2 +-
 fs/ecryptfs/file.c            |  2 +-
 fs/exportfs/expfs.c           |  2 +-
 fs/internal.h                 |  8 +++++
 fs/io_uring.c                 | 52 +++++++++++++++++++++++++++++
 fs/ksmbd/smb2pdu.c            |  3 +-
 fs/ksmbd/vfs.c                |  4 +--
 fs/nfsd/nfs4recover.c         |  2 +-
 fs/nfsd/vfs.c                 |  2 +-
 fs/overlayfs/readdir.c        |  6 ++--
 fs/readdir.c                  | 62 +++++++++++++++++++++++++----------
 include/linux/fs.h            |  2 +-
 include/uapi/linux/io_uring.h |  1 +
 13 files changed, 119 insertions(+), 29 deletions(-)


base-commit: d09358c3d161dcea8f02eae1281bc996819cc769

Linus Torvalds Dec. 21, 2021, 5:15 p.m. UTC | #1

On Tue, Dec 21, 2021 at 8:40 AM Stefan Roesch <shr@fb.com> wrote:
>
> This series adds support for getdents64 in liburing. The intent is to
> provide a more complete I/O interface for io_uring.

Ack, this series looks much more natural to me now.

I think some of the callers of "iterate_dir()" could probably be
cleaned up with the added argument, but for this series I prefer that
mindless approach of just doing "&(arg1)->f_pos" as the third argument
that is clearly a no-op.

So the end result is perhaps not as beautiful as it could be, but I
think the patch series DTRT.

            Linus

Jens Axboe Dec. 21, 2021, 7:17 p.m. UTC | #2

On Tue, 21 Dec 2021 08:40:01 -0800, Stefan Roesch wrote:
> This series adds support for getdents64 in liburing. The intent is to
> provide a more complete I/O interface for io_uring.
> 
> Patch 1: fs: add offset parameter to iterate_dir function.
>   This adds an offset parameter to the iterate_dir()
>   function. The new parameter allows the caller to specify
>   the offset to use.
> 
> [...]

Applied, thanks!

[1/3] fs: add offset parameter to iterate_dir function
      commit: 1533c1b579e1e866465fd9d04c8a5ebb1e25ba28
[2/3] fs: split off vfs_getdents function of getdents64 syscall
      commit: 54d460de2423434e7aa9f7fcb9656230de53b85e
[3/3] io_uring: add support for getdents64
      commit: b4518682080d3a1cdd6ea45a54ff6772b8b2797a

Best regards,

Al Viro Dec. 31, 2021, 11:14 p.m. UTC | #3

On Tue, Dec 21, 2021 at 08:40:01AM -0800, Stefan Roesch wrote:
> This series adds support for getdents64 in liburing. The intent is to
> provide a more complete I/O interface for io_uring.
> 
> Patch 1: fs: add offset parameter to iterate_dir function.
>   This adds an offset parameter to the iterate_dir()
>   function. The new parameter allows the caller to specify
>   the offset to use.
> 
> Patch 2: fs: split off vfs_getdents function from getdents64 system call
>   This splits of the iterate_dir part of the syscall in its own
>   dedicated function. This allows to call the function directly from
>   io_uring.
> 
> Patch 3: io_uring: add support for getdents64
>   Adds the functions to io_uring to support getdents64.
> 
> There is also a patch series for the changes to liburing. This includes
> a new test. The patch series is called "liburing: add getdents support."
> 
> The following tests have been performed:
> - new liburing getdents test program has been run
> - xfstests have been run
> - both tests have been repeated with the kernel memory leak checker
>   and no leaks have been reported.

AFAICS, it completely breaks the "is the position known to be valid"
logics in a lot of ->iterate_dir() instances.  For a reasonably simple
example see e.g. ext2_readdir():

        bool need_revalidate = !inode_eq_iversion(inode, file->f_version);

.....
                if (unlikely(need_revalidate)) {
                        if (offset) {
                                offset = ext2_validate_entry(kaddr, offset, chunk_mask);
                                ctx->pos = (n<<PAGE_SHIFT) + offset;
                        }
                        file->f_version = inode_query_iversion(inode);
                        need_revalidate = false;
                }
and that, combined with
	* directory modifications bumping iversion
	* lseek explicitly cleaing ->f_version on location changes
and resulting position going back into ->f_pos, *BEFORE* we unlock the damn
file.

makes sure that we call ext2_validate_entry() for any untrusted position.

Your code breaks the living hell out of that.  First of all, the offset is
completely arbitrary and no amount of struct file-based checks is going to
be relevant.  Furthermore, this "we'd normalized the position, the valid
one will go into ->f_pos and do so before the caller does fdput_pos(), so
mark the file as known-good-position" logics also break - ->f_pos is *NOT*
locked and new positition, however valid it might be, is not going to
be put there.

How could that possibly work?  I'm not saying that the current variant is
particularly nice, but the need to avoid revalidation of position on each
getdents(2) call is real, and this revalidation is not cheap.

Furthermore, how would that work on e.g. shmem/tmpfs/ramfs/etc.?
There the validation is not an issue, simply because the position is
not used at all - a per-file cursor is, and it's moved by dcache_dir_lseek()
and dcache_readdir().

Folks, readdir from arbitrary position had been a source of pain since
mid-80s.  A plenty of PITA all around, and now you introduce another
API that shoves things into the same orifices without even a benefit of
going through lseek(2), or any of the exclusion warranties regarding
the file position?

Al Viro Dec. 31, 2021, 11:15 p.m. UTC | #4

On Tue, Dec 21, 2021 at 09:15:24AM -0800, Linus Torvalds wrote:
> On Tue, Dec 21, 2021 at 8:40 AM Stefan Roesch <shr@fb.com> wrote:
> >
> > This series adds support for getdents64 in liburing. The intent is to
> > provide a more complete I/O interface for io_uring.
> 
> Ack, this series looks much more natural to me now.
> 
> I think some of the callers of "iterate_dir()" could probably be
> cleaned up with the added argument, but for this series I prefer that
> mindless approach of just doing "&(arg1)->f_pos" as the third argument
> that is clearly a no-op.
> 
> So the end result is perhaps not as beautiful as it could be, but I
> think the patch series DTRT.

It really does not.  Think what happens if you pass e.g. an odd position
to that on e.g. ext2/3/4.  Or just use it on tmpfs, for that matter.

Al Viro Jan. 1, 2022, 7:59 p.m. UTC | #5

On Fri, Dec 31, 2021 at 11:15:55PM +0000, Al Viro wrote:
> On Tue, Dec 21, 2021 at 09:15:24AM -0800, Linus Torvalds wrote:
> > On Tue, Dec 21, 2021 at 8:40 AM Stefan Roesch <shr@fb.com> wrote:
> > >
> > > This series adds support for getdents64 in liburing. The intent is to
> > > provide a more complete I/O interface for io_uring.
> > 
> > Ack, this series looks much more natural to me now.
> > 
> > I think some of the callers of "iterate_dir()" could probably be
> > cleaned up with the added argument, but for this series I prefer that
> > mindless approach of just doing "&(arg1)->f_pos" as the third argument
> > that is clearly a no-op.
> > 
> > So the end result is perhaps not as beautiful as it could be, but I
> > think the patch series DTRT.
> 
> It really does not.  Think what happens if you pass e.g. an odd position
> to that on e.g. ext2/3/4.  Or just use it on tmpfs, for that matter.

[A bit of a braindump follows; my apologies for reminding of some
unpleasant parts of history]

The real problem here is the userland ABI along the lines of pread for
directories.  There's a good reason why we (as well as everybody else,
AFAIK) do not have pgetdents(2).

Handling of IO positions for directories had been causing trouble
ever since the directory layouts had grown support for long names.
Originally a directory had been an array of fixed-sized entries; back
then ls(1) simply used read(2).  No special logics for handling offsets,
other than "each entry is 16 bytes, so you want to read a multiple of
16 starting from offset that is a multiple of 16".

As soon as FFS had introduced support for names longer than 14 characters,
the things got more complicated - there's no predicting if given position
is an entry boundary.  Worse, what used to have been an entry boundary
might very well come to point to the middle of a name - all it takes is
unlink()+creat() done since the time the position used to be OK.

Details are filesystem-dependent; e.g. for original FFS all multiples of
512 are valid offsets, and each entry has its length stored in bytes 4
and 5, so one needs to read the relevant 512 bytes of contents and walk
the list of entries until they get to (or past) the position that needs
to be validated.  For ext2 it's a bit more unpleasant, since the chunk
size is not 512 bytes - it's a block size, i.e. might easily by 4Kb.
For more complex layouts it gets even nastier.

Having to do that kind of validation on each directory entry access would
be too costly.  That's somewhat mitigated since the old readdir(2) is no
longer used, and we return multiple entries per syscall (getdents(2)).
With the exclusion between directory reads and modifications, that allows
to limit validation to the first entry returned on that syscall.

It's still too costly, though.  The next part of mitigation is to use
the fact that valid position will remain valid until somebody modifies a
directory, so we don't need to validate if directory hadn't been changed
since the last time getdents(2) had gotten to this position.  Of course,
explicit change of position since that last getdents(2) means that we
can't skip validation.

Another fun part is synthetic filesystems like tmpfs - there we don't
have any persistent directory contents we could use as source of offsets.
All we have is a cyclic list of dentries; memory address of dentry is
obviously not a candidate - trying to validate _that_ would be beyond
unpleasant.  So we use the position in the list.  To avoid the O(directory
size) cost of walking the list to the position we are asked to read from,
we insert a cursor into the list and have directory reads and seeks move
it as needed.  File position is not authoritative there - the cursor is.
They can get out of sync - if you read almost to the end of directory,
unlink the first file, use lseek(fd, SEEK_CUR, 0) to find position, then
lseek to 0 and back to the position reported, you'll end up one entry
further than you would without those lseeks.  Userland is generally OK
with that (and we are within the POSIX warranties).  However, if that
kind of "out of sync" can happen without any directory modifications
involved, we have trouble.

Directories are not well-suited for random access.  Not since mid-80s.
It's possible, but not cheap and there's a lot of non-obvious corner cases
where directory modifications are involved.  Some of those must work,
or the userland will break.  Telling which ones those are is not trivial.

pgetdents(2) would be a bad idea by itself; making it asynchronous is
the stuff of nightmares.  Please, reconsider that ABI - AFAICS, there's
no way to do it cheaply and it will take modifying each ->iterate_dir()
just to avoid breaking the existing uses of those.

Jann Horn Jan. 3, 2022, 7:03 a.m. UTC | #6

On Sat, Jan 1, 2022 at 8:59 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Dec 31, 2021 at 11:15:55PM +0000, Al Viro wrote:
> > On Tue, Dec 21, 2021 at 09:15:24AM -0800, Linus Torvalds wrote:
> > > On Tue, Dec 21, 2021 at 8:40 AM Stefan Roesch <shr@fb.com> wrote:
> > > >
> > > > This series adds support for getdents64 in liburing. The intent is to
> > > > provide a more complete I/O interface for io_uring.
> > >
> > > Ack, this series looks much more natural to me now.
> > >
> > > I think some of the callers of "iterate_dir()" could probably be
> > > cleaned up with the added argument, but for this series I prefer that
> > > mindless approach of just doing "&(arg1)->f_pos" as the third argument
> > > that is clearly a no-op.
> > >
> > > So the end result is perhaps not as beautiful as it could be, but I
> > > think the patch series DTRT.
> >
> > It really does not.  Think what happens if you pass e.g. an odd position
> > to that on e.g. ext2/3/4.  Or just use it on tmpfs, for that matter.
>
> [A bit of a braindump follows; my apologies for reminding of some
> unpleasant parts of history]
>
> The real problem here is the userland ABI along the lines of pread for
> directories.  There's a good reason why we (as well as everybody else,
> AFAIK) do not have pgetdents(2).
>
> Handling of IO positions for directories had been causing trouble
> ever since the directory layouts had grown support for long names.
> Originally a directory had been an array of fixed-sized entries; back
> then ls(1) simply used read(2).  No special logics for handling offsets,
> other than "each entry is 16 bytes, so you want to read a multiple of
> 16 starting from offset that is a multiple of 16".
>
> As soon as FFS had introduced support for names longer than 14 characters,
> the things got more complicated - there's no predicting if given position
> is an entry boundary.  Worse, what used to have been an entry boundary
> might very well come to point to the middle of a name - all it takes is
> unlink()+creat() done since the time the position used to be OK.
>
> Details are filesystem-dependent; e.g. for original FFS all multiples of
> 512 are valid offsets, and each entry has its length stored in bytes 4
> and 5, so one needs to read the relevant 512 bytes of contents and walk
> the list of entries until they get to (or past) the position that needs
> to be validated.  For ext2 it's a bit more unpleasant, since the chunk
> size is not 512 bytes - it's a block size, i.e. might easily by 4Kb.
> For more complex layouts it gets even nastier.
>
> Having to do that kind of validation on each directory entry access would
> be too costly.  That's somewhat mitigated since the old readdir(2) is no
> longer used, and we return multiple entries per syscall (getdents(2)).
> With the exclusion between directory reads and modifications, that allows
> to limit validation to the first entry returned on that syscall.
>
> It's still too costly, though.  The next part of mitigation is to use
> the fact that valid position will remain valid until somebody modifies a
> directory, so we don't need to validate if directory hadn't been changed
> since the last time getdents(2) had gotten to this position.  Of course,
> explicit change of position since that last getdents(2) means that we
> can't skip validation.

And for this validation caching to work properly, AFAIU you need to
hold the file->f_pos_lock (or have exclusive access to the "struct
file"), which happens in the normal getdents() path via fdget_pos().
This guarantees that the readdir handler has exclusive access to the
file's ->f_version, which has to stay in sync with the position.

By the way, this makes me wonder: If I open() an ext4 directory and
then concurrently modify the directory, call readdir() on it, and
issue IORING_OP_READV SQEs with ->off == -1, can that cause ext4 to
think that filesystem corruption has occurred?

io_uring has some dodgy code that seems to be reading and writing
file->f_pos without holding the file->f_pos_lock. And even if the file
doesn't have an f_op->read or f_op->read_iter handler, I think you
might be able to read ->f_pos of an ext4 directory and write the value
back later, unless I'm missing a check somewhere?

io_prep_rw() grabs file->f_pos; then later, io_read() calls
io_iter_do_read() (which will fail with -EINVAL), and then the error
path goes through kiocb_done(), which writes the position back to
req->file->f_pos. So I think the following race might work:

First, open an FD referencing a directory on an ext4 filesystem. Read
part of the directory's entries from the FD with getdents(). Then
modify the directory such that the FD's ->f_version is now stale. Then
do the race:

task A: start a IORING_OP_READV op on the FD and let it grab the stale
offset from file->f_pos
task B: run getdents() on the FD. after this, ->f_version is
up-to-date and ->f_pos points to the start of a dentry
task A: continue handling the IORING_OP_READV: io_iter_do_read()
fails, kiocb_done() overwrites the valid ->f_pos with the stale value
while leaving ->f_version up-to-date

Afterwards, the file might have an ->f_pos pointing in the middle of a
dentry while ->f_version indicates that it is valid (meaning it's
supposed to point to the start of a dentry). If you then do another
getdents() call on the FD, I think you might get garbage back and/or
make the kernel think that the filesystem is corrupted (depending on
what's stored at that offset)?

Jens Axboe Jan. 3, 2022, 3 p.m. UTC | #7

On 1/2/22 11:03 PM, Jann Horn wrote:
> io_uring has some dodgy code that seems to be reading and writing
> file->f_pos without holding the file->f_pos_lock. And even if the file
> doesn't have an f_op->read or f_op->read_iter handler, I think you
> might be able to read ->f_pos of an ext4 directory and write the value
> back later, unless I'm missing a check somewhere?

I posted an RFC to hold f_pos_lock across those operations before the
break:

https://lore.kernel.org/io-uring/8a9e55bf-3195-5282-2907-41b2f2b23cc8@kernel.dk/

picking it up this week and flushing it out, hopefully.

Linus Torvalds Jan. 3, 2022, 6:55 p.m. UTC | #8

On Sun, Jan 2, 2022 at 11:04 PM Jann Horn <jannh@google.com> wrote:
>
> And for this validation caching to work properly, AFAIU you need to
> hold the file->f_pos_lock (or have exclusive access to the "struct
> file"), which happens in the normal getdents() path via fdget_pos().
> This guarantees that the readdir handler has exclusive access to the
> file's ->f_version, which has to stay in sync with the position.

Yes.

So the whole 'preaddir()' model was wrong, and thanks to Al for noticing.

It turns out that you cannot pass in a different 'pos' than f_pos,
because we have that very tight coupling between the 'struct file' and
readdir().

It's not just about f_pos and f_version, either - Al pointed out the
virtual filesystems, which use a special dentry cursor to traverse the
child dentries for readdir, and that one uses 'file->private_data'.

So the directory position isn't really about some simple passed-in
pos, it has locking rules, it has validation, and it has actual
secondary data in the file pointer.

                  Linus

Al Viro Jan. 3, 2022, 9:12 p.m. UTC | #9

On Mon, Jan 03, 2022 at 08:03:51AM +0100, Jann Horn wrote:

> io_prep_rw() grabs file->f_pos; then later, io_read() calls
> io_iter_do_read() (which will fail with -EINVAL), and then the error
> path goes through kiocb_done(), which writes the position back to
> req->file->f_pos. So I think the following race might work:

Why does it touch ->f_pos on failure, anyway?  It's a bug, plain and
simple; note that read(2) and write(2) are explicitly requested to
leave IO position alone if they return an error.  See e.g.
fs/read_write.c:ksys_read() -
                ret = vfs_read(f.file, buf, count, ppos);
		if (ret >= 0 && ppos)
			f.file->f_pos = pos;
		fdput_pos(f);
Position update happens only on success (and only for non-stream
files, at that).

No matter how special io-uring is (it's not covered by POSIX, for
obvious reasons), this is simply wrong, directories or no directories.

Dominique Martinet April 16, 2023, 10:06 p.m. UTC | #10

Stefan Roesch wrote on Tue, Dec 21, 2021 at 08:40:01AM -0800:
> This series adds support for getdents64 in liburing. The intent is to
> provide a more complete I/O interface for io_uring.

[reminder: This series was dropped a year ago after Al Viro rightly
pointed out that we can't just pass an arbitrary offset to iterate_dir
as offset validation is costly and not always possible at all]

I'm digging an old topic here because I was looking at trying io_uring
on a toy project (specifically, exporting infos out of /sys/fs/cgroup
recursively), but was partly held back by the lack of getdents or
equivalent interface for the crawler part -- and existing bricks like
fts or nftw predate openat interfaces and just didn't appeal to me, but
in general just mixing in io_uring and synchronous getdents sounded a
bit of a pain.

Given it's been over a year I guess it's not such a much needed feature,
but when you're centering your program loop around the ring having the
occasional getdents/readdir call is a bit cumbersome.


This is probably a naive idea, but would it make sense discussing just
making it fit the current getdents API:
Given the problem seems to be revalidating the offset, since the main
usecase is probably to just go through a directory, perhaps the file's
offset could be updated just like the real call and offset validation be
skipped if the parameter is equal to the file's offset?
Giving a different offset would be equivalent to lseek + getdents and
update the position as well, so e.g. rewinding the directory with a seek
0 would work if an application needs to check a directory's content
multiple times.

Heck, seek to any offset other than 0 could just be refused for any sane
usage, it just doesn't make sense to use anything else in practice;
that'd make it a defacto "dumb getdents + rewinddir".
The API could be made simpler to use/clear about expectations by making
the last parameter "bool rewind_first" instead of an offset (or I guess
a flag with just a single valid bit if we were to really implement this)


This isn't very io_uring-like, but it'd allow for an io_uring-centric
program to also handle some directory processing, and wouldn't expose
anything that's not currently already possible to do (as long as each
processed op brings in its own context, the only "race" I can see in
iterate_dir is that file->f_pos can go back to a previously also valid
position if some iterations are faster than others, assuming it's an
atomic access, and that'd probably warrant a READ/WRITE_ONCE or
something.. but that's not a new problem, user threads can already
hammer getdents in parallel on a single fd if they want useless results)


What do you think?


I'm just asking with a toy in mind so it's not like I have any strong
opinion, but I'd be happy to rework Stefan's patches if it looks like it
might make sense.
(And if the consensus is that this will make hell break loose I'll just
forget about it before sinking more time in it, just catching up was fun
enough!)

Cheers,

[v7,0/3] io_uring: add getdents64 support

Message

Comments