diff mbox series

fix bitmap corruption on close_range() with CLOSE_RANGE_UNSHARE

Message ID 20240803225054.GY5334@ZenIV (mailing list archive)
State New
Headers show
Series fix bitmap corruption on close_range() with CLOSE_RANGE_UNSHARE | expand

Commit Message

Al Viro Aug. 3, 2024, 10:50 p.m. UTC
[in vfs.git#fixes]

copy_fd_bitmaps(new, old, count) is expected to copy the first
count/BITS_PER_LONG bits from old->full_fds_bits[] and fill
the rest with zeroes.  What it does is copying enough words
(BITS_TO_LONGS(count/BITS_PER_LONG)), then memsets the rest.
That works fine, *if* all bits past the cutoff point are
clear.  Otherwise we are risking garbage from the last word
we'd copied.

For most of the callers that is true - expand_fdtable() has
count equal to old->max_fds, so there's no open descriptors
past count, let alone fully occupied words in ->open_fds[],
which is what bits in ->full_fds_bits[] correspond to.

The other caller (dup_fd()) passes sane_fdtable_size(old_fdt, max_fds),
which is the smallest multiple of BITS_PER_LONG that covers all
opened descriptors below max_fds.  In the common case (copying on
fork()) max_fds is ~0U, so all opened descriptors will be below
it and we are fine, by the same reasons why the call in expand_fdtable()
is safe.

Unfortunately, there is a case where max_fds is less than that
and where we might, indeed, end up with junk in ->full_fds_bits[] -
close_range(from, to, CLOSE_RANGE_UNSHARE) with
	* descriptor table being currently shared
	* 'to' being above the current capacity of descriptor table
	* 'from' being just under some chunk of opened descriptors.
In that case we end up with observably wrong behaviour - e.g. spawn
a child with CLONE_FILES, get all descriptors in range 0..127 open,
then close_range(64, ~0U, CLOSE_RANGE_UNSHARE) and watch dup(0) ending
up with descriptor #128, despite #64 being observably not open.

The best way to fix that is in dup_fd() - there we both can easily check
whether we might need to fix anything up and see which word needs the
upper bits cleared.
    
Reproducer follows:

#define __GNU_SOURCE
#include <linux/close_range.h>
#include <unistd.h>
#include <fcntl.h>
#include <signal.h>
#include <sched.h>
#include <stdio.h>
#include <stdbool.h>
#include <sys/mman.h>
    
void is_open(int fd)
{
	printf("#%d is %s\n", fd,
		fcntl(fd, F_GETFD) >= 0 ? "opened" : "not opened");
}
    
int child(void *unused)
{
	while(1) {
	}
	return 0;
}
    
int main(void)
{
	char *stack;
	pid_t pid;

	stack = mmap(NULL, 1024*1024, PROT_READ | PROT_WRITE,
		     MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
	if (stack == MAP_FAILED) {
		perror("mmap");
		return -1;
	}

	pid = clone(child, stack + 1024*1024, CLONE_FILES | SIGCHLD, NULL);
	if (pid == -1) {
		perror("clone");
		return -1;
	}
	for (int i = 2; i < 128; i++)
	    dup2(0, i);
	close_range(64, ~0U, CLOSE_RANGE_UNSHARE);

	is_open(64);
	printf("dup(0) => %d, expected 64\n", dup(0));

	kill(pid, 9);
	return 0;
}

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---

Comments

Al Viro Aug. 3, 2024, 11:06 p.m. UTC | #1
On Sat, Aug 03, 2024 at 11:50:54PM +0100, Al Viro wrote:

> The best way to fix that is in dup_fd() - there we both can easily check
> whether we might need to fix anything up and see which word needs the
> upper bits cleared.

PS: we could try and handle that in copy_fd_bitmaps() instead, but there
it's harder to tell whether we need to bother in the first place, so that
add overhead on a fairly hot paths.
Linus Torvalds Aug. 3, 2024, 11:51 p.m. UTC | #2
On Sat, 3 Aug 2024 at 15:50, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> +               unsigned int n = open_files / BITS_PER_LONG;
> +               if (n % BITS_PER_LONG) {
> +                       unsigned long mask = BITMAP_LAST_WORD_MASK(n);
> +                       new_fdt->full_fds_bits[n / BITS_PER_LONG] &= mask;
> +               }

Ouch. I hate this.

And yes, clearly this fixes a bug, and clearly we don't maintain the
full_fds_bits[] array properly, but I still hate it.

The fact that we had this bug in the first place shows that our
full_fds_bits code is too subtle, and this makes it worse.

I really would prefer to do this in copy_fd_bitmaps() instead.

Also, your comment says

                 ... nothing of that sort happens to open_fds and
                * close_on_exec, since there the part that needs to be copied
                * will always fall on a word boundary.

but it doesn't really explain *why* those things are on a word boundary.

So I think part of the commentary should be that we always make sure
that the normal bitmaps are full words (pointing out the

       ALIGN(min(count, max_fds), BITS_PER_LONG)

in sane_fdtable_size), and then explaining that the reason the
'full_fds_bits' bitmap is special is that it is not a bitmap of file
descriptors, it's a bitmap of 'words of file descriptor bitmaps'.

So yes, the full_fds_bits bitmap is very special indeed, and yes, we
got this all wrong, but I feel like the bug is in copy_fd_bitmaps(),
and the fixup should be there too...

In fact, I think the full_fds_bits copy in copy_fd_bitmaps() should
just be entirely rewritten. Even the initial copy shouldn't be done
with some byte-wise memcpy/memset, since those are all 'unsigned long'
arrays, and the size is aligned. So it should be done on words, but
whatever.

And the full_fds_bits case should use our actual bitmap functions.

So instead of

        cpy = BITBIT_SIZE(count);
        set = BITBIT_SIZE(nfdt->max_fds) - cpy;
        memcpy(nfdt->full_fds_bits, ofdt->full_fds_bits, cpy);
        memset((char *)nfdt->full_fds_bits + cpy, 0, set);

I think the code could (and should) do just

        cpy = count / BITS_PER_LONG;
        max = nfdt->max_fds / BITS_PER_LONG;

        bitmap_copy(nfdt->full_fds_bits, ofdt->full_fds_bits, cpy);
        bitmap_clear(nfdt->full_fds_bits, count, max - cpy);

or something like that.

Doesn't that seem more straightforward?

Of course, it's also entirely possible that I have just confused
myself, and the above is buggy garbage. I tried to think this through,
but I really tried to think more along the lines of "how can we make
this legible so that the code is actually understandable". Because I
think your patch almost certainly fixes the bug, but it sure isn't
easy to understand.

           Linus
Linus Torvalds Aug. 4, 2024, 12:05 a.m. UTC | #3
On Sat, 3 Aug 2024 at 16:51, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> In fact, I think the full_fds_bits copy in copy_fd_bitmaps() should
> just be entirely rewritten. Even the initial copy shouldn't be done
> with some byte-wise memcpy/memset, since those are all 'unsigned long'
> arrays, and the size is aligned. So it should be done on words, but
> whatever.
>
> And the full_fds_bits case should use our actual bitmap functions.

Something ENTIRELY UNTESTED like this, IOW.

Am I just completely off my rocker? You decide.

                 Linus
Al Viro Aug. 4, 2024, 12:34 a.m. UTC | #4
On Sat, Aug 03, 2024 at 04:51:44PM -0700, Linus Torvalds wrote:

> I think the code could (and should) do just
> 
>         cpy = count / BITS_PER_LONG;
>         max = nfdt->max_fds / BITS_PER_LONG;
> 
>         bitmap_copy(nfdt->full_fds_bits, ofdt->full_fds_bits, cpy);
>         bitmap_clear(nfdt->full_fds_bits, count, max - cpy);
> 
> or something like that.
> 
> Doesn't that seem more straightforward?

Considered that; unfortunately, bitmap_clear() is... suboptimal,
to put it mildly.  And on the expand_fdtable() pathway (as well
as normal fork()) we might have to clear quite a bit.

> Of course, it's also entirely possible that I have just confused
> myself, and the above is buggy garbage. I tried to think this through,
> but I really tried to think more along the lines of "how can we make
> this legible so that the code is actually understandable". Because I
> think your patch almost certainly fixes the bug, but it sure isn't
> easy to understand.

Bitmap handling in there certainly needs to be cleaned up; no arguments
here.  If anything, we might want something like

bitmap_copy_and_extend(unsigned long *to, const unsigned long *from,
			unsigned int count, unsigned int size)
{
	unsigned int copy = BITS_TO_LONGS(count);

	memcpy(to, from, copy * sizeof(long));
	if (count % BITS_PER_LONG)
		to[copy - 1] &= BITMAP_LAST_WORD_MASK(count);
	memset(to + copy, 0, bitmap_size(size) - copy * sizeof(long));
}

and use it for all of three bitmaps.  Compiler should be able to spot
CSEs here - let it decide if keeping them is cheaper than reevaluating...
Would you hate
static void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt,
                            unsigned int count, unsigned int size)
{
	bitmap_copy_and_extend(nfdt->open_fds, ofdt->open_fds,
				count, size);
	bitmap_copy_and_extend(nfdt->close_on_exec, ofdt->close_on_exec,
				count, size);
	bitmap_copy_and_extend(nfdt->full_fds_bits, ofdt->full_fds_bits,
				count / BITS_PER_LONG, size / BITS_PER_LONG);
}
with callers passing nfdt->size as the 4th argument?

Said that, the posted variant might be better wrt backporting - it has
no overhead on the common codepaths; something like the variant above
(and I think that use of bitmap_clear() will be worse) would need
profiling.  So it might make sense to use that for backports, with
cleanups done on top of it.  Up to you...

FWIW, I'm crawling through fs/file.c at the moment (in part - to replace
Documentation/filesystems/files.rst with something that wouldn't be
years out of date), and there's a growing patch series around that
area.  This thing got caught while trying to document copy_fd_bitmaps()
and convince myself that it works correctly...
Linus Torvalds Aug. 4, 2024, 3:42 a.m. UTC | #5
On Sat, 3 Aug 2024 at 17:34, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Bitmap handling in there certainly needs to be cleaned up; no arguments
> here.  If anything, we might want something like
>
> bitmap_copy_and_extend(unsigned long *to, const unsigned long *from,
>                         unsigned int count, unsigned int size)
> {
>         unsigned int copy = BITS_TO_LONGS(count);
>
>         memcpy(to, from, copy * sizeof(long));
>         if (count % BITS_PER_LONG)
>                 to[copy - 1] &= BITMAP_LAST_WORD_MASK(count);
>         memset(to + copy, 0, bitmap_size(size) - copy * sizeof(long));
> }
>
> and use it for all of three bitmaps.

That certainly works for me.

If you want to specialize it a bit more, you might do a separate
routine for the "known to be word aligned" case.

I extended on my previous patch. I'm not sure it's worth it, and it's
still ENTIRELY UNTESTED, but you get the idea..

But I'm ok with your generic bitmap version too.

              Linus
Al Viro Aug. 4, 2024, 3:47 a.m. UTC | #6
On Sun, Aug 04, 2024 at 01:34:05AM +0100, Al Viro wrote:

> static void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt,
>                             unsigned int count, unsigned int size)
> {
> 	bitmap_copy_and_extend(nfdt->open_fds, ofdt->open_fds,
> 				count, size);
> 	bitmap_copy_and_extend(nfdt->close_on_exec, ofdt->close_on_exec,
> 				count, size);
> 	bitmap_copy_and_extend(nfdt->full_fds_bits, ofdt->full_fds_bits,
> 				count / BITS_PER_LONG, size / BITS_PER_LONG);
> }

One variant would be

#define fdt_words(fdt)	((fdt)->max_fds / BITS_PER_LONG) /* longs in fdt->open_fds[] */

static inline void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt,
                            unsigned int copy_words)
{
	unsigned int nwords = fdt_words(nfdt);

	bitmap_copy_and_extend(nfdt->open_fds, ofdt->open_fds,
			copy_words * BITS_PER_LONG, nwords * BITS_PER_LONG);
	bitmap_copy_and_extend(nfdt->close_on_exec, ofdt->close_on_exec,
			copy_words * BITS_PER_LONG, nwords * BITS_PER_LONG);
	bitmap_copy_and_extend(nfdt->full_fds_bits, ofdt->full_fds_bits,
			copy_words, nwords);
}
with callers being
        copy_fd_bitmaps(nfdt, ofdt, fdt_words(ofdt));
and
        copy_fd_bitmaps(new_fdt, old_fdt, open_files / BITS_PER_LONG);

resp.  That would compile into pure memcpy()+memset() for the main bitmaps
and at least as good as bitmap_copy()+bitmap_clear() for full_fds_bits...
Al Viro Aug. 4, 2024, 4:17 a.m. UTC | #7
BTW, the current description I've got next to struct fdtable definition:

/*
 * fdtable's capacity is stored in ->max_fds; it is set when instance is
 * created and never changed after that.  It is always a positive multiple
 * of BITS_PER_LONG (i.e. no less than BITS_PER_LONG).  It can't exceed
 * MAX_INT - userland often stores descriptors as int, treating anything
 * negative as an error, so the numbers past MAX_INT are definitely
 * not usable as descriptors.
 *
 * if fdt->max_fds is equal to N, fdt can hold descriptors in range 0..N-1.
 *
 * ->fd points to an N-element array of file pointers.  If descriptor #k is
 * open, ->fd[k] points to the file associated it; otherwise it's NULL.
 *
 * both ->open_fds and ->close_on_exec point to N-bit bitmaps.
 * Since N is a multiple of BITS_PER_LONG, those are arrays of unsigned long
 * with N / BITS_PER_LONG elements and all bits are in use.
 * Bit #k in ->open_fds is set iff descriptor #k has been claimed.
 * In that case the corresponding bit in ->close_on_exec determines whether
 * it should be closed on execve().  If descriptor is *not* claimed, the
 * value of corresponding bit in ->close_on_exec is undetermined.
 *
 * ->full_fds_bits points to a N/BITS_PER_LONG-bit bitmap, i.e. to an array of
 * unsigned long with BITS_TO_LONGS(N / BITS_PER_LONG) elements.  Note that
 * N is *not* promised to be a multiple of BITS_PER_LONG^2, so there may be
 * unused bits in the last word.  Its contents is a function of ->open_fds -
 * bit #k is set in it iff all bits stored in ->open_fds[k] are set (i.e. if
 * ->open_fds[k] == ~0ul).  All unused bits in the last word must be clear.
 * That bitmap is used for cheaper search for unclaimed descriptors.
 *
 * Note that all of those are pointers - arrays themselves are elsewhere.
 * The whole thing is reached from a files_struct instance; the reason
 * we do not keep those pointers in files_struct itself is that on resize
 * we want to flip all of them at once *and* we want the readers to be
 * able to work locklessly.  This is a device for switching all of those
 * in one store operation.  Freeing the replaced instance is RCU-delayed;
 * ->rcu is used to do that.
 */
Linus Torvalds Aug. 4, 2024, 3:18 p.m. UTC | #8
On Sat, 3 Aug 2024 at 20:47, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
>         bitmap_copy_and_extend(nfdt->open_fds, ofdt->open_fds,
>                         copy_words * BITS_PER_LONG, nwords * BITS_PER_LONG);

Ok, thinking about it, I like this interface, since it looks like the
compiler would see that it's in BITS_PER_LONG chunks if we inline it
and decide it's important.

So make it so.

               Linus
Al Viro Aug. 4, 2024, 9:13 p.m. UTC | #9
On Sun, Aug 04, 2024 at 08:18:14AM -0700, Linus Torvalds wrote:
> On Sat, 3 Aug 2024 at 20:47, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> >         bitmap_copy_and_extend(nfdt->open_fds, ofdt->open_fds,
> >                         copy_words * BITS_PER_LONG, nwords * BITS_PER_LONG);
> 
> Ok, thinking about it, I like this interface, since it looks like the
> compiler would see that it's in BITS_PER_LONG chunks if we inline it
> and decide it's important.
> 
> So make it so.

FWIW, what I'm testing right now is the patch below; it does fix the reproducer
and it seems to be having no problems with xfstests so far.  Needs profiling;
again, no visible slowdowns on xfstests, but...


diff --git a/fs/file.c b/fs/file.c
index a11e59b5d602..655338effe9c 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -46,27 +46,23 @@ static void free_fdtable_rcu(struct rcu_head *rcu)
 #define BITBIT_NR(nr)	BITS_TO_LONGS(BITS_TO_LONGS(nr))
 #define BITBIT_SIZE(nr)	(BITBIT_NR(nr) * sizeof(long))
 
+#define fdt_words(fdt) ((fdt)->max_fds / BITS_PER_LONG) // words in ->open_fds
 /*
  * Copy 'count' fd bits from the old table to the new table and clear the extra
  * space if any.  This does not copy the file pointers.  Called with the files
  * spinlock held for write.
  */
-static void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt,
-			    unsigned int count)
+static inline void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt,
+			    unsigned int copy_words)
 {
-	unsigned int cpy, set;
-
-	cpy = count / BITS_PER_BYTE;
-	set = (nfdt->max_fds - count) / BITS_PER_BYTE;
-	memcpy(nfdt->open_fds, ofdt->open_fds, cpy);
-	memset((char *)nfdt->open_fds + cpy, 0, set);
-	memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy);
-	memset((char *)nfdt->close_on_exec + cpy, 0, set);
-
-	cpy = BITBIT_SIZE(count);
-	set = BITBIT_SIZE(nfdt->max_fds) - cpy;
-	memcpy(nfdt->full_fds_bits, ofdt->full_fds_bits, cpy);
-	memset((char *)nfdt->full_fds_bits + cpy, 0, set);
+	unsigned int nwords = fdt_words(nfdt);
+
+	bitmap_copy_and_extend(nfdt->open_fds, ofdt->open_fds,
+			copy_words * BITS_PER_LONG, nwords * BITS_PER_LONG);
+	bitmap_copy_and_extend(nfdt->close_on_exec, ofdt->close_on_exec,
+			copy_words * BITS_PER_LONG, nwords * BITS_PER_LONG);
+	bitmap_copy_and_extend(nfdt->full_fds_bits, ofdt->full_fds_bits,
+			copy_words, nwords);
 }
 
 /*
@@ -84,7 +80,7 @@ static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt)
 	memcpy(nfdt->fd, ofdt->fd, cpy);
 	memset((char *)nfdt->fd + cpy, 0, set);
 
-	copy_fd_bitmaps(nfdt, ofdt, ofdt->max_fds);
+	copy_fd_bitmaps(nfdt, ofdt, fdt_words(ofdt));
 }
 
 /*
@@ -379,7 +375,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, unsigned int max_fds, int
 		open_files = sane_fdtable_size(old_fdt, max_fds);
 	}
 
-	copy_fd_bitmaps(new_fdt, old_fdt, open_files);
+	copy_fd_bitmaps(new_fdt, old_fdt, open_files / BITS_PER_LONG);
 
 	old_fds = old_fdt->fd;
 	new_fds = new_fdt->fd;
diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 8c4768c44a01..d3b66d77df7a 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -270,6 +270,18 @@ static inline void bitmap_copy_clear_tail(unsigned long *dst,
 		dst[nbits / BITS_PER_LONG] &= BITMAP_LAST_WORD_MASK(nbits);
 }
 
+static inline void bitmap_copy_and_extend(unsigned long *to,
+					  const unsigned long *from,
+					  unsigned int count, unsigned int size)
+{
+	unsigned int copy = BITS_TO_LONGS(count);
+
+	memcpy(to, from, copy * sizeof(long));
+	if (count % BITS_PER_LONG)
+		to[copy - 1] &= BITMAP_LAST_WORD_MASK(count);
+	memset(to + copy, 0, bitmap_size(size) - copy * sizeof(long));
+}
+
 /*
  * On 32-bit systems bitmaps are represented as u32 arrays internally. On LE64
  * machines the order of hi and lo parts of numbers match the bitmap structure.
Christian Brauner Aug. 5, 2024, 7:22 a.m. UTC | #10
> Reproducer follows:
> 
> #define __GNU_SOURCE
> #include <linux/close_range.h>
> #include <unistd.h>
> #include <fcntl.h>
> #include <signal.h>
> #include <sched.h>
> #include <stdio.h>
> #include <stdbool.h>
> #include <sys/mman.h>
>     
> void is_open(int fd)
> {
> 	printf("#%d is %s\n", fd,
> 		fcntl(fd, F_GETFD) >= 0 ? "opened" : "not opened");
> }
>     
> int child(void *unused)
> {
> 	while(1) {
> 	}
> 	return 0;
> }
>     
> int main(void)
> {
> 	char *stack;
> 	pid_t pid;
> 
> 	stack = mmap(NULL, 1024*1024, PROT_READ | PROT_WRITE,
> 		     MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
> 	if (stack == MAP_FAILED) {
> 		perror("mmap");
> 		return -1;
> 	}
> 
> 	pid = clone(child, stack + 1024*1024, CLONE_FILES | SIGCHLD, NULL);
> 	if (pid == -1) {
> 		perror("clone");
> 		return -1;
> 	}
> 	for (int i = 2; i < 128; i++)
> 	    dup2(0, i);
> 	close_range(64, ~0U, CLOSE_RANGE_UNSHARE);
> 
> 	is_open(64);
> 	printf("dup(0) => %d, expected 64\n", dup(0));
> 
> 	kill(pid, 9);
> 	return 0;

Could you please add that reproducer to
tools/testing/selftests/core/close_range_test.c

TEST(close_range_bitmap_corruption)
{
}

Really, it doesn't have to be pretty but these repros in there really
have been helpful finding such corruptions when run with a proper k*san
config.
Christian Brauner Aug. 5, 2024, 9:48 a.m. UTC | #11
On Sat, Aug 03, 2024 at 11:50:54PM GMT, Al Viro wrote:
> [in vfs.git#fixes]
> 
> copy_fd_bitmaps(new, old, count) is expected to copy the first
> count/BITS_PER_LONG bits from old->full_fds_bits[] and fill
> the rest with zeroes.  What it does is copying enough words
> (BITS_TO_LONGS(count/BITS_PER_LONG)), then memsets the rest.
> That works fine, *if* all bits past the cutoff point are
> clear.  Otherwise we are risking garbage from the last word
> we'd copied.
> 
> For most of the callers that is true - expand_fdtable() has
> count equal to old->max_fds, so there's no open descriptors
> past count, let alone fully occupied words in ->open_fds[],
> which is what bits in ->full_fds_bits[] correspond to.
> 
> The other caller (dup_fd()) passes sane_fdtable_size(old_fdt, max_fds),
> which is the smallest multiple of BITS_PER_LONG that covers all
> opened descriptors below max_fds.  In the common case (copying on
> fork()) max_fds is ~0U, so all opened descriptors will be below
> it and we are fine, by the same reasons why the call in expand_fdtable()
> is safe.
> 
> Unfortunately, there is a case where max_fds is less than that
> and where we might, indeed, end up with junk in ->full_fds_bits[] -
> close_range(from, to, CLOSE_RANGE_UNSHARE) with
> 	* descriptor table being currently shared
> 	* 'to' being above the current capacity of descriptor table
> 	* 'from' being just under some chunk of opened descriptors.
> In that case we end up with observably wrong behaviour - e.g. spawn
> a child with CLONE_FILES, get all descriptors in range 0..127 open,
> then close_range(64, ~0U, CLOSE_RANGE_UNSHARE) and watch dup(0) ending
> up with descriptor #128, despite #64 being observably not open.
> 
> The best way to fix that is in dup_fd() - there we both can easily check
> whether we might need to fix anything up and see which word needs the
> upper bits cleared.
>     
> Reproducer follows:
> 
> #define __GNU_SOURCE
> #include <linux/close_range.h>
> #include <unistd.h>
> #include <fcntl.h>
> #include <signal.h>
> #include <sched.h>
> #include <stdio.h>
> #include <stdbool.h>
> #include <sys/mman.h>
>     
> void is_open(int fd)
> {
> 	printf("#%d is %s\n", fd,
> 		fcntl(fd, F_GETFD) >= 0 ? "opened" : "not opened");
> }
>     
> int child(void *unused)
> {
> 	while(1) {
> 	}
> 	return 0;
> }
>     
> int main(void)
> {
> 	char *stack;
> 	pid_t pid;
> 
> 	stack = mmap(NULL, 1024*1024, PROT_READ | PROT_WRITE,
> 		     MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
> 	if (stack == MAP_FAILED) {
> 		perror("mmap");
> 		return -1;
> 	}
> 
> 	pid = clone(child, stack + 1024*1024, CLONE_FILES | SIGCHLD, NULL);
> 	if (pid == -1) {
> 		perror("clone");
> 		return -1;
> 	}
> 	for (int i = 2; i < 128; i++)
> 	    dup2(0, i);
> 	close_range(64, ~0U, CLOSE_RANGE_UNSHARE);
> 
> 	is_open(64);
> 	printf("dup(0) => %d, expected 64\n", dup(0));
> 
> 	kill(pid, 9);
> 	return 0;
> }
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Hm, this should probably also get a fixes tag for
4b3b3b3b3b3b ("vfs: clear remainder of 'full_fds_bits' in dup_fd()")
(and possibly for the CLOSE_RANGE_UNSHARE addition).
Al Viro Aug. 5, 2024, 6:54 p.m. UTC | #12
On Mon, Aug 05, 2024 at 09:22:08AM +0200, Christian Brauner wrote:

> Really, it doesn't have to be pretty but these repros in there really
> have been helpful finding such corruptions when run with a proper k*san
> config.

See below; so far it survives beating (and close_range_test passes with
the patch, while failing the last test on mainline).  BTW, EXPECT_...
alone is sufficient for it to whine, but it doesn't actually fail the
test, so don't we need exit(EXIT_FAILURE) on (at least some of) the
previous tests?  Hadn't played with the kselftest before, so...

diff --git a/fs/file.c b/fs/file.c
index a11e59b5d602..655338effe9c 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -46,27 +46,23 @@ static void free_fdtable_rcu(struct rcu_head *rcu)
 #define BITBIT_NR(nr)	BITS_TO_LONGS(BITS_TO_LONGS(nr))
 #define BITBIT_SIZE(nr)	(BITBIT_NR(nr) * sizeof(long))
 
+#define fdt_words(fdt) ((fdt)->max_fds / BITS_PER_LONG) // words in ->open_fds
 /*
  * Copy 'count' fd bits from the old table to the new table and clear the extra
  * space if any.  This does not copy the file pointers.  Called with the files
  * spinlock held for write.
  */
-static void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt,
-			    unsigned int count)
+static inline void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt,
+			    unsigned int copy_words)
 {
-	unsigned int cpy, set;
-
-	cpy = count / BITS_PER_BYTE;
-	set = (nfdt->max_fds - count) / BITS_PER_BYTE;
-	memcpy(nfdt->open_fds, ofdt->open_fds, cpy);
-	memset((char *)nfdt->open_fds + cpy, 0, set);
-	memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy);
-	memset((char *)nfdt->close_on_exec + cpy, 0, set);
-
-	cpy = BITBIT_SIZE(count);
-	set = BITBIT_SIZE(nfdt->max_fds) - cpy;
-	memcpy(nfdt->full_fds_bits, ofdt->full_fds_bits, cpy);
-	memset((char *)nfdt->full_fds_bits + cpy, 0, set);
+	unsigned int nwords = fdt_words(nfdt);
+
+	bitmap_copy_and_extend(nfdt->open_fds, ofdt->open_fds,
+			copy_words * BITS_PER_LONG, nwords * BITS_PER_LONG);
+	bitmap_copy_and_extend(nfdt->close_on_exec, ofdt->close_on_exec,
+			copy_words * BITS_PER_LONG, nwords * BITS_PER_LONG);
+	bitmap_copy_and_extend(nfdt->full_fds_bits, ofdt->full_fds_bits,
+			copy_words, nwords);
 }
 
 /*
@@ -84,7 +80,7 @@ static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt)
 	memcpy(nfdt->fd, ofdt->fd, cpy);
 	memset((char *)nfdt->fd + cpy, 0, set);
 
-	copy_fd_bitmaps(nfdt, ofdt, ofdt->max_fds);
+	copy_fd_bitmaps(nfdt, ofdt, fdt_words(ofdt));
 }
 
 /*
@@ -379,7 +375,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, unsigned int max_fds, int
 		open_files = sane_fdtable_size(old_fdt, max_fds);
 	}
 
-	copy_fd_bitmaps(new_fdt, old_fdt, open_files);
+	copy_fd_bitmaps(new_fdt, old_fdt, open_files / BITS_PER_LONG);
 
 	old_fds = old_fdt->fd;
 	new_fds = new_fdt->fd;
diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 8c4768c44a01..d3b66d77df7a 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -270,6 +270,18 @@ static inline void bitmap_copy_clear_tail(unsigned long *dst,
 		dst[nbits / BITS_PER_LONG] &= BITMAP_LAST_WORD_MASK(nbits);
 }
 
+static inline void bitmap_copy_and_extend(unsigned long *to,
+					  const unsigned long *from,
+					  unsigned int count, unsigned int size)
+{
+	unsigned int copy = BITS_TO_LONGS(count);
+
+	memcpy(to, from, copy * sizeof(long));
+	if (count % BITS_PER_LONG)
+		to[copy - 1] &= BITMAP_LAST_WORD_MASK(count);
+	memset(to + copy, 0, bitmap_size(size) - copy * sizeof(long));
+}
+
 /*
  * On 32-bit systems bitmaps are represented as u32 arrays internally. On LE64
  * machines the order of hi and lo parts of numbers match the bitmap structure.
diff --git a/tools/testing/selftests/core/close_range_test.c b/tools/testing/selftests/core/close_range_test.c
index 991c473e3859..12b4eb9d0434 100644
--- a/tools/testing/selftests/core/close_range_test.c
+++ b/tools/testing/selftests/core/close_range_test.c
@@ -589,4 +589,39 @@ TEST(close_range_cloexec_unshare_syzbot)
 	EXPECT_EQ(close(fd3), 0);
 }
 
+TEST(close_range_bitmap_corruption)
+{
+	pid_t pid;
+	int status;
+	struct __clone_args args = {
+		.flags = CLONE_FILES,
+		.exit_signal = SIGCHLD,
+	};
+
+	/* get the first 128 descriptors open */
+	for (int i = 2; i < 128; i++)
+		EXPECT_GE(dup2(0, i), 0);
+
+	/* get descriptor table shared */
+	pid = sys_clone3(&args, sizeof(args));
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		/* unshare and truncate descriptor table down to 64 */
+		if (sys_close_range(64, ~0U, CLOSE_RANGE_UNSHARE))
+			exit(EXIT_FAILURE);
+
+		ASSERT_EQ(fcntl(64, F_GETFD), -1);
+		/* ... and verify that the range 64..127 is not
+		   stuck "fully used" according to secondary bitmap */
+		EXPECT_EQ(dup(0), 64)
+			exit(EXIT_FAILURE);
+		exit(EXIT_SUCCESS);
+	}
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+}
+
 TEST_HARNESS_MAIN
Al Viro Aug. 5, 2024, 11:44 p.m. UTC | #13
On Sun, Aug 04, 2024 at 10:13:22PM +0100, Al Viro wrote:
> On Sun, Aug 04, 2024 at 08:18:14AM -0700, Linus Torvalds wrote:
> > On Sat, 3 Aug 2024 at 20:47, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > >         bitmap_copy_and_extend(nfdt->open_fds, ofdt->open_fds,
> > >                         copy_words * BITS_PER_LONG, nwords * BITS_PER_LONG);
> > 
> > Ok, thinking about it, I like this interface, since it looks like the
> > compiler would see that it's in BITS_PER_LONG chunks if we inline it
> > and decide it's important.
> > 
> > So make it so.
> 
> FWIW, what I'm testing right now is the patch below; it does fix the reproducer
> and it seems to be having no problems with xfstests so far.  Needs profiling;
> again, no visible slowdowns on xfstests, but...

Seems to work, no visible slowdowns.  Pushed into #fixes.

BTW, speaking of fdtable, the longer I look at it, the more I'm tempted to
try and kill it off completely.

Look:
	->fdt->max_fds is non-decreasing
	->fdt->full_fds_bits only accessed under ->files_lock
	->fdt->open_fds is sometimes used with rcu_read_lock() alone,
but such users are very few - it's
	bitmap_weight(fdt->open_fds, fdt->max_fds);
in proc_readfd_count() (stat on /proc/*/fd) and max_select_fds();
the latter is checking if anything in the fd_set_bits passed to
select(2) is not opened, in addition to looking for the maximal
descriptor passed to select(2) in that.  Then we do fdget() on
all of them during the main loop of select().  Sure, we want
-EBADF if it hadn't been opened at all and we want EPOLLNVAL
on subsequent passes, but that could've been handled easily
in the main loop itself.  As for the bitmap_weight(), I seriously
suspect that keeping the count (all updates are under ->files_lock)
might be cheap enough, and IIRC we already had cgroup (or was it bpf?)
folks trying to get that count in really scary ways...
	->fdt->close_on_exec has two places accessing it
with rcu_read_lock() alone (F_GETFD and alloc_bprm()).  In both
cases we have already done fdget() on the same descrpiptor.

So... do we really need that indirect?  The problem would be
seeing ->max_fds update before that of the array pointers.
Smells like that should be doable with barrier(s) - and not
many, at that...  

Note that if we coallocate the file references' array and bitmaps,
we could e.g. slap rcu_head over the beginning of ->open_fds
bitmap after the sucker's gone from files_struct and do RCU-delayed
freeing on that.  Am I missing something obvious here?
Linus Torvalds Aug. 6, 2024, 12:04 a.m. UTC | #14
On Mon, 5 Aug 2024 at 16:45, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> So... do we really need that indirect?  The problem would be
> seeing ->max_fds update before that of the array pointers.

The reason is simply so that we can have one single allocation.

In fact, quite often, it's zero allocations when you can use the
'struct fdtable fdtab' that is embedded in 'struct files_struct'. But
the 'struct fdtable' was a convenient way to allocate all those
bitmaps _and_ the 'struct file *' array all together.

That said, I agree that we could just then put the pointers to said
single allocation right into 'struct files_struct'. So the actual
indirection after the allocation is pointless.

And yes, I think it's entirely a historical artifact of how that thing
grew to be. Long long long ago there was no secondary allocation at
all, and MAX_OPEN was fixed at 20.

Because dammit, twenty open files is all you ever really need, and all
the bitmaps fit comfortably in one single word even on 32-bit
machines.

Those were the days.

           Linus
Al Viro Aug. 6, 2024, 1:02 a.m. UTC | #15
On Mon, Aug 05, 2024 at 05:04:05PM -0700, Linus Torvalds wrote:
> On Mon, 5 Aug 2024 at 16:45, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > So... do we really need that indirect?  The problem would be
> > seeing ->max_fds update before that of the array pointers.
> 
> The reason is simply so that we can have one single allocation.
> 
> In fact, quite often, it's zero allocations when you can use the
> 'struct fdtable fdtab' that is embedded in 'struct files_struct'.

More to the point, we use arrays embedded into files_struct.

>But
> the 'struct fdtable' was a convenient way to allocate all those
> bitmaps _and_ the 'struct file *' array all together.

I don't think so - IIRC, it was introduced when we added RCU'd
file lookup.  Let me check...  Yep; badf16621c1f "[PATCH] files:
break up files struct", with RCU support as rationale.  Followed
by ab2af1f50050 "[PATCH] files: files struct with RCU".

Before those commits ->max_fds and ->fd used to live in
files_struct and fget() (OK, fcheck_files()) had been taking
->files_lock, so that concurrent expand_files() would not
screw us over.

The problem was not just the need to delay freeing old ->fd
array; that could be dealt with easily enough.  Think what
would've happened if fcheck_files() ended up fetching
new value of ->max_fds and old value of ->fd, which pointed
to pre-expansion array.  Indirection allowed to update
both in one store.

The thing is, ->max_fds for successive ->fdt is monotonously
increasing.  So a lockless reader seeing the old size is
fine with the new table - we just need to prevent the opposite.

Would rcu_assign_pointer of pointers + smp_store_release of max_fds on expand
(all under ->files_lock, etc.) paired with
smp_load_acquire of max_fds + rcu_dereference of ->fd on file lookup side
be enough, or do we need an explicit smp_wmb/smp_rmb in there?

> And yes, I think it's entirely a historical artifact of how that thing
> grew to be. Long long long ago there was no secondary allocation at
> all, and MAX_OPEN was fixed at 20.

Yes - but that had changed way before 2005 when those patches went in.
Separate allocation of bitmaps is 2.3.12pre7 and separate allocation of
->fd is even older - 2.1.90pre1.  NR_OPEN had been 1024 at that point;
earlier history is
	0.97: 20 -> 32
	0.98.4: 32 -> 256
	2.1.26: 256 -> 1024
Christian Brauner Aug. 6, 2024, 8:41 a.m. UTC | #16
On Tue, Aug 06, 2024 at 02:02:17AM GMT, Al Viro wrote:
> On Mon, Aug 05, 2024 at 05:04:05PM -0700, Linus Torvalds wrote:
> > On Mon, 5 Aug 2024 at 16:45, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > So... do we really need that indirect?  The problem would be
> > > seeing ->max_fds update before that of the array pointers.
> > 
> > The reason is simply so that we can have one single allocation.
> > 
> > In fact, quite often, it's zero allocations when you can use the
> > 'struct fdtable fdtab' that is embedded in 'struct files_struct'.
> 
> More to the point, we use arrays embedded into files_struct.
> 
> >But
> > the 'struct fdtable' was a convenient way to allocate all those
> > bitmaps _and_ the 'struct file *' array all together.
> 
> I don't think so - IIRC, it was introduced when we added RCU'd
> file lookup.  Let me check...  Yep; badf16621c1f "[PATCH] files:
> break up files struct", with RCU support as rationale.  Followed
> by ab2af1f50050 "[PATCH] files: files struct with RCU".
> 
> Before those commits ->max_fds and ->fd used to live in
> files_struct and fget() (OK, fcheck_files()) had been taking
> ->files_lock, so that concurrent expand_files() would not
> screw us over.
> 
> The problem was not just the need to delay freeing old ->fd
> array; that could be dealt with easily enough.  Think what
> would've happened if fcheck_files() ended up fetching
> new value of ->max_fds and old value of ->fd, which pointed
> to pre-expansion array.  Indirection allowed to update
> both in one store.
> 
> The thing is, ->max_fds for successive ->fdt is monotonously
> increasing.  So a lockless reader seeing the old size is
> fine with the new table - we just need to prevent the opposite.
> 
> Would rcu_assign_pointer of pointers + smp_store_release of max_fds on expand
> (all under ->files_lock, etc.) paired with
> smp_load_acquire of max_fds + rcu_dereference of ->fd on file lookup side
> be enough, or do we need an explicit smp_wmb/smp_rmb in there?

Afair, smp_load_acquire() would be a barrier for both later loads and
stores and smp_store_release() would be a barrier for both earlier loads
and stores.

Iiuc, here we only care about ordering stores to ->fd and max_fds on the
write side and about ordering loads of max_fds and ->fd on the reader
side. The reader doesn't actually write anything.

In other words, we want to make ->fd visible before max_fds on the write
side and we want to load max_fds after ->fd.

So I think smp_wmb() and smp_rmb() would be sufficient. I also find it
clearer in this case.
Christian Brauner Aug. 6, 2024, 9:11 a.m. UTC | #17
On Mon, Aug 05, 2024 at 07:54:29PM GMT, Al Viro wrote:
> On Mon, Aug 05, 2024 at 09:22:08AM +0200, Christian Brauner wrote:
> 
> > Really, it doesn't have to be pretty but these repros in there really
> > have been helpful finding such corruptions when run with a proper k*san
> > config.
> 
> See below; so far it survives beating (and close_range_test passes with
> the patch, while failing the last test on mainline).  BTW, EXPECT_...
> alone is sufficient for it to whine, but it doesn't actually fail the
> test, so don't we need exit(EXIT_FAILURE) on (at least some of) the
> previous tests?  Hadn't played with the kselftest before, so...

The selftest infrastrcuture is a bit weird and I'm no expert myself.
So any failure in EXPECT_*() will cause the test to fail but will
continue running the whole test.
Using ASSERT_*() instead of EXPECT_*() will also cause the test to fail
but will also stop the test immediately.

So really no matter if EXPEC_*() or ASSERT_*() the end result should be
that the test run fails:

diff --git a/tools/testing/selftests/core/close_range_test.c b/tools/testing/selftests/core/close_range_test.c
index 991c473e3859..3f7257487b85 100644
--- a/tools/testing/selftests/core/close_range_test.c
+++ b/tools/testing/selftests/core/close_range_test.c
@@ -37,6 +37,8 @@ TEST(core_close_range)
        int i, ret;
        int open_fds[101];

+       EXPECT_NE(0, 0);
+
        for (i = 0; i < ARRAY_SIZE(open_fds); i++) {
                int fd;


> ./close_range_test
TAP version 13
1..7
# Starting 7 tests from 1 test cases.
#  RUN           global.core_close_range ...
# close_range_test.c:40:core_close_range:Expected 0 (0) != 0 (0)
# core_close_range: Test failed
#          FAIL  global.core_close_range
not ok 1 global.core_close_range
#  RUN           global.close_range_unshare ...
#            OK  global.close_range_unshare
ok 2 global.close_range_unshare
#  RUN           global.close_range_unshare_capped ...
#            OK  global.close_range_unshare_capped
ok 3 global.close_range_unshare_capped
#  RUN           global.close_range_cloexec ...
#            OK  global.close_range_cloexec
ok 4 global.close_range_cloexec
#  RUN           global.close_range_cloexec_unshare ...
#            OK  global.close_range_cloexec_unshare
ok 5 global.close_range_cloexec_unshare
#  RUN           global.close_range_cloexec_syzbot ...
#            OK  global.close_range_cloexec_syzbot
ok 6 global.close_range_cloexec_syzbot
#  RUN           global.close_range_cloexec_unshare_syzbot ...
#            OK  global.close_range_cloexec_unshare_syzbot
ok 7 global.close_range_cloexec_unshare_syzbot
# FAILED: 6 / 7 tests passed.
# Totals: pass:6 fail:1 xfail:0 xpass:0 skip:0 error:0

> 
> diff --git a/fs/file.c b/fs/file.c
> index a11e59b5d602..655338effe9c 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -46,27 +46,23 @@ static void free_fdtable_rcu(struct rcu_head *rcu)
>  #define BITBIT_NR(nr)	BITS_TO_LONGS(BITS_TO_LONGS(nr))
>  #define BITBIT_SIZE(nr)	(BITBIT_NR(nr) * sizeof(long))
>  
> +#define fdt_words(fdt) ((fdt)->max_fds / BITS_PER_LONG) // words in ->open_fds
>  /*
>   * Copy 'count' fd bits from the old table to the new table and clear the extra
>   * space if any.  This does not copy the file pointers.  Called with the files
>   * spinlock held for write.
>   */
> -static void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt,
> -			    unsigned int count)
> +static inline void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt,
> +			    unsigned int copy_words)
>  {
> -	unsigned int cpy, set;
> -
> -	cpy = count / BITS_PER_BYTE;
> -	set = (nfdt->max_fds - count) / BITS_PER_BYTE;
> -	memcpy(nfdt->open_fds, ofdt->open_fds, cpy);
> -	memset((char *)nfdt->open_fds + cpy, 0, set);
> -	memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy);
> -	memset((char *)nfdt->close_on_exec + cpy, 0, set);
> -
> -	cpy = BITBIT_SIZE(count);
> -	set = BITBIT_SIZE(nfdt->max_fds) - cpy;
> -	memcpy(nfdt->full_fds_bits, ofdt->full_fds_bits, cpy);
> -	memset((char *)nfdt->full_fds_bits + cpy, 0, set);
> +	unsigned int nwords = fdt_words(nfdt);
> +
> +	bitmap_copy_and_extend(nfdt->open_fds, ofdt->open_fds,
> +			copy_words * BITS_PER_LONG, nwords * BITS_PER_LONG);
> +	bitmap_copy_and_extend(nfdt->close_on_exec, ofdt->close_on_exec,
> +			copy_words * BITS_PER_LONG, nwords * BITS_PER_LONG);
> +	bitmap_copy_and_extend(nfdt->full_fds_bits, ofdt->full_fds_bits,
> +			copy_words, nwords);
>  }
>  
>  /*
> @@ -84,7 +80,7 @@ static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt)
>  	memcpy(nfdt->fd, ofdt->fd, cpy);
>  	memset((char *)nfdt->fd + cpy, 0, set);
>  
> -	copy_fd_bitmaps(nfdt, ofdt, ofdt->max_fds);
> +	copy_fd_bitmaps(nfdt, ofdt, fdt_words(ofdt));
>  }
>  
>  /*
> @@ -379,7 +375,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, unsigned int max_fds, int
>  		open_files = sane_fdtable_size(old_fdt, max_fds);
>  	}
>  
> -	copy_fd_bitmaps(new_fdt, old_fdt, open_files);
> +	copy_fd_bitmaps(new_fdt, old_fdt, open_files / BITS_PER_LONG);
>  
>  	old_fds = old_fdt->fd;
>  	new_fds = new_fdt->fd;
> diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
> index 8c4768c44a01..d3b66d77df7a 100644
> --- a/include/linux/bitmap.h
> +++ b/include/linux/bitmap.h
> @@ -270,6 +270,18 @@ static inline void bitmap_copy_clear_tail(unsigned long *dst,
>  		dst[nbits / BITS_PER_LONG] &= BITMAP_LAST_WORD_MASK(nbits);
>  }
>  
> +static inline void bitmap_copy_and_extend(unsigned long *to,
> +					  const unsigned long *from,
> +					  unsigned int count, unsigned int size)
> +{
> +	unsigned int copy = BITS_TO_LONGS(count);
> +
> +	memcpy(to, from, copy * sizeof(long));
> +	if (count % BITS_PER_LONG)
> +		to[copy - 1] &= BITMAP_LAST_WORD_MASK(count);
> +	memset(to + copy, 0, bitmap_size(size) - copy * sizeof(long));
> +}
> +
>  /*
>   * On 32-bit systems bitmaps are represented as u32 arrays internally. On LE64
>   * machines the order of hi and lo parts of numbers match the bitmap structure.
> diff --git a/tools/testing/selftests/core/close_range_test.c b/tools/testing/selftests/core/close_range_test.c
> index 991c473e3859..12b4eb9d0434 100644
> --- a/tools/testing/selftests/core/close_range_test.c
> +++ b/tools/testing/selftests/core/close_range_test.c
> @@ -589,4 +589,39 @@ TEST(close_range_cloexec_unshare_syzbot)
>  	EXPECT_EQ(close(fd3), 0);
>  }
>  
> +TEST(close_range_bitmap_corruption)
> +{
> +	pid_t pid;
> +	int status;
> +	struct __clone_args args = {
> +		.flags = CLONE_FILES,
> +		.exit_signal = SIGCHLD,
> +	};
> +
> +	/* get the first 128 descriptors open */
> +	for (int i = 2; i < 128; i++)
> +		EXPECT_GE(dup2(0, i), 0);
> +
> +	/* get descriptor table shared */
> +	pid = sys_clone3(&args, sizeof(args));
> +	ASSERT_GE(pid, 0);
> +
> +	if (pid == 0) {
> +		/* unshare and truncate descriptor table down to 64 */
> +		if (sys_close_range(64, ~0U, CLOSE_RANGE_UNSHARE))
> +			exit(EXIT_FAILURE);
> +
> +		ASSERT_EQ(fcntl(64, F_GETFD), -1);
> +		/* ... and verify that the range 64..127 is not
> +		   stuck "fully used" according to secondary bitmap */
> +		EXPECT_EQ(dup(0), 64)
> +			exit(EXIT_FAILURE);
> +		exit(EXIT_SUCCESS);
> +	}
> +
> +	EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +	EXPECT_EQ(true, WIFEXITED(status));
> +	EXPECT_EQ(0, WEXITSTATUS(status));
> +}
> +
>  TEST_HARNESS_MAIN
Al Viro Aug. 6, 2024, 4:32 p.m. UTC | #18
On Tue, Aug 06, 2024 at 10:41:59AM +0200, Christian Brauner wrote:

> > Would rcu_assign_pointer of pointers + smp_store_release of max_fds on expand
> > (all under ->files_lock, etc.) paired with
> > smp_load_acquire of max_fds + rcu_dereference of ->fd on file lookup side
> > be enough, or do we need an explicit smp_wmb/smp_rmb in there?
> 
> Afair, smp_load_acquire() would be a barrier for both later loads and
> stores and smp_store_release() would be a barrier for both earlier loads
> and stores.
> 
> Iiuc, here we only care about ordering stores to ->fd and max_fds on the
> write side and about ordering loads of max_fds and ->fd on the reader
> side. The reader doesn't actually write anything.
> 
> In other words, we want to make ->fd visible before max_fds on the write
> side and we want to load max_fds after ->fd.
> 
> So I think smp_wmb() and smp_rmb() would be sufficient. I also find it
> clearer in this case.

It's not the question of sufficiency; it's whether anything cheaper can be
had.
Linus Torvalds Aug. 6, 2024, 5:01 p.m. UTC | #19
On Tue, 6 Aug 2024 at 09:32, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > So I think smp_wmb() and smp_rmb() would be sufficient. I also find it
> > clearer in this case.
>
> It's not the question of sufficiency; it's whether anything cheaper can be
> had.

I'd avoid smp_rmb(), since that can be very expensive, but
"smp_load_acquire()" is about as cheap as it comes. As is
"smp_store_release()" and "smp_wmb()".

So if it's mostly about just ordering with max_fds, I'd suggest just
doing the store of the bigger max_fds value with a smp_store_release()
(which guarantees that any previous stores of the bitmap pointers etc
have been done first and are visible), and then before accessing the
arrays, do the load of the max_fds thing with a "smp_load_acquire()",
and it should all be good.

But the array pointers themselves also need the same
smp_store_release() when updating, just to guarantee that the newly
copied contents are ordered. Which implies having to read those too
with smp_load_acquire(), regardless of the max_fds value read.

But on x86, smp_store_release/smp_load_acquire is completely free. And
on other architectures it's generally one of the cheapest possible
ordering constraints. So I doubt it's a performance issue even outside
of x86, and avoiding a double indirection is probably a win.

            Linus
diff mbox series

Patch

diff --git a/fs/file.c b/fs/file.c
index a11e59b5d602..7f0ab8636d7c 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -380,6 +380,20 @@  struct files_struct *dup_fd(struct files_struct *oldf, unsigned int max_fds, int
 	}
 
 	copy_fd_bitmaps(new_fdt, old_fdt, open_files);
+	if (unlikely(max_fds != NR_OPEN_MAX)) {
+		/*
+		 * if there had been opened descriptors past open_files,
+		 * the last copied word in full_fds_bits might have picked
+		 * stray bits; nothing of that sort happens to open_fds and
+		 * close_on_exec, since there the part that needs to be copied
+		 * will always fall on a word boundary.
+		 */
+		unsigned int n = open_files / BITS_PER_LONG;
+		if (n % BITS_PER_LONG) {
+			unsigned long mask = BITMAP_LAST_WORD_MASK(n);
+			new_fdt->full_fds_bits[n / BITS_PER_LONG] &= mask;
+		}
+	}
 
 	old_fds = old_fdt->fd;
 	new_fds = new_fdt->fd;