Message ID | 20240803225054.GY5334@ZenIV (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | fix bitmap corruption on close_range() with CLOSE_RANGE_UNSHARE | expand |
On Sat, Aug 03, 2024 at 11:50:54PM +0100, Al Viro wrote: > The best way to fix that is in dup_fd() - there we both can easily check > whether we might need to fix anything up and see which word needs the > upper bits cleared. PS: we could try and handle that in copy_fd_bitmaps() instead, but there it's harder to tell whether we need to bother in the first place, so that add overhead on a fairly hot paths.
On Sat, 3 Aug 2024 at 15:50, Al Viro <viro@zeniv.linux.org.uk> wrote: > > + unsigned int n = open_files / BITS_PER_LONG; > + if (n % BITS_PER_LONG) { > + unsigned long mask = BITMAP_LAST_WORD_MASK(n); > + new_fdt->full_fds_bits[n / BITS_PER_LONG] &= mask; > + } Ouch. I hate this. And yes, clearly this fixes a bug, and clearly we don't maintain the full_fds_bits[] array properly, but I still hate it. The fact that we had this bug in the first place shows that our full_fds_bits code is too subtle, and this makes it worse. I really would prefer to do this in copy_fd_bitmaps() instead. Also, your comment says ... nothing of that sort happens to open_fds and * close_on_exec, since there the part that needs to be copied * will always fall on a word boundary. but it doesn't really explain *why* those things are on a word boundary. So I think part of the commentary should be that we always make sure that the normal bitmaps are full words (pointing out the ALIGN(min(count, max_fds), BITS_PER_LONG) in sane_fdtable_size), and then explaining that the reason the 'full_fds_bits' bitmap is special is that it is not a bitmap of file descriptors, it's a bitmap of 'words of file descriptor bitmaps'. So yes, the full_fds_bits bitmap is very special indeed, and yes, we got this all wrong, but I feel like the bug is in copy_fd_bitmaps(), and the fixup should be there too... In fact, I think the full_fds_bits copy in copy_fd_bitmaps() should just be entirely rewritten. Even the initial copy shouldn't be done with some byte-wise memcpy/memset, since those are all 'unsigned long' arrays, and the size is aligned. So it should be done on words, but whatever. And the full_fds_bits case should use our actual bitmap functions. So instead of cpy = BITBIT_SIZE(count); set = BITBIT_SIZE(nfdt->max_fds) - cpy; memcpy(nfdt->full_fds_bits, ofdt->full_fds_bits, cpy); memset((char *)nfdt->full_fds_bits + cpy, 0, set); I think the code could (and should) do just cpy = count / BITS_PER_LONG; max = nfdt->max_fds / BITS_PER_LONG; bitmap_copy(nfdt->full_fds_bits, ofdt->full_fds_bits, cpy); bitmap_clear(nfdt->full_fds_bits, count, max - cpy); or something like that. Doesn't that seem more straightforward? Of course, it's also entirely possible that I have just confused myself, and the above is buggy garbage. I tried to think this through, but I really tried to think more along the lines of "how can we make this legible so that the code is actually understandable". Because I think your patch almost certainly fixes the bug, but it sure isn't easy to understand. Linus
On Sat, 3 Aug 2024 at 16:51, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > In fact, I think the full_fds_bits copy in copy_fd_bitmaps() should > just be entirely rewritten. Even the initial copy shouldn't be done > with some byte-wise memcpy/memset, since those are all 'unsigned long' > arrays, and the size is aligned. So it should be done on words, but > whatever. > > And the full_fds_bits case should use our actual bitmap functions. Something ENTIRELY UNTESTED like this, IOW. Am I just completely off my rocker? You decide. Linus
On Sat, Aug 03, 2024 at 04:51:44PM -0700, Linus Torvalds wrote: > I think the code could (and should) do just > > cpy = count / BITS_PER_LONG; > max = nfdt->max_fds / BITS_PER_LONG; > > bitmap_copy(nfdt->full_fds_bits, ofdt->full_fds_bits, cpy); > bitmap_clear(nfdt->full_fds_bits, count, max - cpy); > > or something like that. > > Doesn't that seem more straightforward? Considered that; unfortunately, bitmap_clear() is... suboptimal, to put it mildly. And on the expand_fdtable() pathway (as well as normal fork()) we might have to clear quite a bit. > Of course, it's also entirely possible that I have just confused > myself, and the above is buggy garbage. I tried to think this through, > but I really tried to think more along the lines of "how can we make > this legible so that the code is actually understandable". Because I > think your patch almost certainly fixes the bug, but it sure isn't > easy to understand. Bitmap handling in there certainly needs to be cleaned up; no arguments here. If anything, we might want something like bitmap_copy_and_extend(unsigned long *to, const unsigned long *from, unsigned int count, unsigned int size) { unsigned int copy = BITS_TO_LONGS(count); memcpy(to, from, copy * sizeof(long)); if (count % BITS_PER_LONG) to[copy - 1] &= BITMAP_LAST_WORD_MASK(count); memset(to + copy, 0, bitmap_size(size) - copy * sizeof(long)); } and use it for all of three bitmaps. Compiler should be able to spot CSEs here - let it decide if keeping them is cheaper than reevaluating... Would you hate static void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt, unsigned int count, unsigned int size) { bitmap_copy_and_extend(nfdt->open_fds, ofdt->open_fds, count, size); bitmap_copy_and_extend(nfdt->close_on_exec, ofdt->close_on_exec, count, size); bitmap_copy_and_extend(nfdt->full_fds_bits, ofdt->full_fds_bits, count / BITS_PER_LONG, size / BITS_PER_LONG); } with callers passing nfdt->size as the 4th argument? Said that, the posted variant might be better wrt backporting - it has no overhead on the common codepaths; something like the variant above (and I think that use of bitmap_clear() will be worse) would need profiling. So it might make sense to use that for backports, with cleanups done on top of it. Up to you... FWIW, I'm crawling through fs/file.c at the moment (in part - to replace Documentation/filesystems/files.rst with something that wouldn't be years out of date), and there's a growing patch series around that area. This thing got caught while trying to document copy_fd_bitmaps() and convince myself that it works correctly...
On Sat, 3 Aug 2024 at 17:34, Al Viro <viro@zeniv.linux.org.uk> wrote: > > Bitmap handling in there certainly needs to be cleaned up; no arguments > here. If anything, we might want something like > > bitmap_copy_and_extend(unsigned long *to, const unsigned long *from, > unsigned int count, unsigned int size) > { > unsigned int copy = BITS_TO_LONGS(count); > > memcpy(to, from, copy * sizeof(long)); > if (count % BITS_PER_LONG) > to[copy - 1] &= BITMAP_LAST_WORD_MASK(count); > memset(to + copy, 0, bitmap_size(size) - copy * sizeof(long)); > } > > and use it for all of three bitmaps. That certainly works for me. If you want to specialize it a bit more, you might do a separate routine for the "known to be word aligned" case. I extended on my previous patch. I'm not sure it's worth it, and it's still ENTIRELY UNTESTED, but you get the idea.. But I'm ok with your generic bitmap version too. Linus
On Sun, Aug 04, 2024 at 01:34:05AM +0100, Al Viro wrote: > static void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt, > unsigned int count, unsigned int size) > { > bitmap_copy_and_extend(nfdt->open_fds, ofdt->open_fds, > count, size); > bitmap_copy_and_extend(nfdt->close_on_exec, ofdt->close_on_exec, > count, size); > bitmap_copy_and_extend(nfdt->full_fds_bits, ofdt->full_fds_bits, > count / BITS_PER_LONG, size / BITS_PER_LONG); > } One variant would be #define fdt_words(fdt) ((fdt)->max_fds / BITS_PER_LONG) /* longs in fdt->open_fds[] */ static inline void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt, unsigned int copy_words) { unsigned int nwords = fdt_words(nfdt); bitmap_copy_and_extend(nfdt->open_fds, ofdt->open_fds, copy_words * BITS_PER_LONG, nwords * BITS_PER_LONG); bitmap_copy_and_extend(nfdt->close_on_exec, ofdt->close_on_exec, copy_words * BITS_PER_LONG, nwords * BITS_PER_LONG); bitmap_copy_and_extend(nfdt->full_fds_bits, ofdt->full_fds_bits, copy_words, nwords); } with callers being copy_fd_bitmaps(nfdt, ofdt, fdt_words(ofdt)); and copy_fd_bitmaps(new_fdt, old_fdt, open_files / BITS_PER_LONG); resp. That would compile into pure memcpy()+memset() for the main bitmaps and at least as good as bitmap_copy()+bitmap_clear() for full_fds_bits...
BTW, the current description I've got next to struct fdtable definition: /* * fdtable's capacity is stored in ->max_fds; it is set when instance is * created and never changed after that. It is always a positive multiple * of BITS_PER_LONG (i.e. no less than BITS_PER_LONG). It can't exceed * MAX_INT - userland often stores descriptors as int, treating anything * negative as an error, so the numbers past MAX_INT are definitely * not usable as descriptors. * * if fdt->max_fds is equal to N, fdt can hold descriptors in range 0..N-1. * * ->fd points to an N-element array of file pointers. If descriptor #k is * open, ->fd[k] points to the file associated it; otherwise it's NULL. * * both ->open_fds and ->close_on_exec point to N-bit bitmaps. * Since N is a multiple of BITS_PER_LONG, those are arrays of unsigned long * with N / BITS_PER_LONG elements and all bits are in use. * Bit #k in ->open_fds is set iff descriptor #k has been claimed. * In that case the corresponding bit in ->close_on_exec determines whether * it should be closed on execve(). If descriptor is *not* claimed, the * value of corresponding bit in ->close_on_exec is undetermined. * * ->full_fds_bits points to a N/BITS_PER_LONG-bit bitmap, i.e. to an array of * unsigned long with BITS_TO_LONGS(N / BITS_PER_LONG) elements. Note that * N is *not* promised to be a multiple of BITS_PER_LONG^2, so there may be * unused bits in the last word. Its contents is a function of ->open_fds - * bit #k is set in it iff all bits stored in ->open_fds[k] are set (i.e. if * ->open_fds[k] == ~0ul). All unused bits in the last word must be clear. * That bitmap is used for cheaper search for unclaimed descriptors. * * Note that all of those are pointers - arrays themselves are elsewhere. * The whole thing is reached from a files_struct instance; the reason * we do not keep those pointers in files_struct itself is that on resize * we want to flip all of them at once *and* we want the readers to be * able to work locklessly. This is a device for switching all of those * in one store operation. Freeing the replaced instance is RCU-delayed; * ->rcu is used to do that. */
On Sat, 3 Aug 2024 at 20:47, Al Viro <viro@zeniv.linux.org.uk> wrote: > > bitmap_copy_and_extend(nfdt->open_fds, ofdt->open_fds, > copy_words * BITS_PER_LONG, nwords * BITS_PER_LONG); Ok, thinking about it, I like this interface, since it looks like the compiler would see that it's in BITS_PER_LONG chunks if we inline it and decide it's important. So make it so. Linus
On Sun, Aug 04, 2024 at 08:18:14AM -0700, Linus Torvalds wrote: > On Sat, 3 Aug 2024 at 20:47, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > bitmap_copy_and_extend(nfdt->open_fds, ofdt->open_fds, > > copy_words * BITS_PER_LONG, nwords * BITS_PER_LONG); > > Ok, thinking about it, I like this interface, since it looks like the > compiler would see that it's in BITS_PER_LONG chunks if we inline it > and decide it's important. > > So make it so. FWIW, what I'm testing right now is the patch below; it does fix the reproducer and it seems to be having no problems with xfstests so far. Needs profiling; again, no visible slowdowns on xfstests, but... diff --git a/fs/file.c b/fs/file.c index a11e59b5d602..655338effe9c 100644 --- a/fs/file.c +++ b/fs/file.c @@ -46,27 +46,23 @@ static void free_fdtable_rcu(struct rcu_head *rcu) #define BITBIT_NR(nr) BITS_TO_LONGS(BITS_TO_LONGS(nr)) #define BITBIT_SIZE(nr) (BITBIT_NR(nr) * sizeof(long)) +#define fdt_words(fdt) ((fdt)->max_fds / BITS_PER_LONG) // words in ->open_fds /* * Copy 'count' fd bits from the old table to the new table and clear the extra * space if any. This does not copy the file pointers. Called with the files * spinlock held for write. */ -static void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt, - unsigned int count) +static inline void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt, + unsigned int copy_words) { - unsigned int cpy, set; - - cpy = count / BITS_PER_BYTE; - set = (nfdt->max_fds - count) / BITS_PER_BYTE; - memcpy(nfdt->open_fds, ofdt->open_fds, cpy); - memset((char *)nfdt->open_fds + cpy, 0, set); - memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy); - memset((char *)nfdt->close_on_exec + cpy, 0, set); - - cpy = BITBIT_SIZE(count); - set = BITBIT_SIZE(nfdt->max_fds) - cpy; - memcpy(nfdt->full_fds_bits, ofdt->full_fds_bits, cpy); - memset((char *)nfdt->full_fds_bits + cpy, 0, set); + unsigned int nwords = fdt_words(nfdt); + + bitmap_copy_and_extend(nfdt->open_fds, ofdt->open_fds, + copy_words * BITS_PER_LONG, nwords * BITS_PER_LONG); + bitmap_copy_and_extend(nfdt->close_on_exec, ofdt->close_on_exec, + copy_words * BITS_PER_LONG, nwords * BITS_PER_LONG); + bitmap_copy_and_extend(nfdt->full_fds_bits, ofdt->full_fds_bits, + copy_words, nwords); } /* @@ -84,7 +80,7 @@ static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt) memcpy(nfdt->fd, ofdt->fd, cpy); memset((char *)nfdt->fd + cpy, 0, set); - copy_fd_bitmaps(nfdt, ofdt, ofdt->max_fds); + copy_fd_bitmaps(nfdt, ofdt, fdt_words(ofdt)); } /* @@ -379,7 +375,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, unsigned int max_fds, int open_files = sane_fdtable_size(old_fdt, max_fds); } - copy_fd_bitmaps(new_fdt, old_fdt, open_files); + copy_fd_bitmaps(new_fdt, old_fdt, open_files / BITS_PER_LONG); old_fds = old_fdt->fd; new_fds = new_fdt->fd; diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h index 8c4768c44a01..d3b66d77df7a 100644 --- a/include/linux/bitmap.h +++ b/include/linux/bitmap.h @@ -270,6 +270,18 @@ static inline void bitmap_copy_clear_tail(unsigned long *dst, dst[nbits / BITS_PER_LONG] &= BITMAP_LAST_WORD_MASK(nbits); } +static inline void bitmap_copy_and_extend(unsigned long *to, + const unsigned long *from, + unsigned int count, unsigned int size) +{ + unsigned int copy = BITS_TO_LONGS(count); + + memcpy(to, from, copy * sizeof(long)); + if (count % BITS_PER_LONG) + to[copy - 1] &= BITMAP_LAST_WORD_MASK(count); + memset(to + copy, 0, bitmap_size(size) - copy * sizeof(long)); +} + /* * On 32-bit systems bitmaps are represented as u32 arrays internally. On LE64 * machines the order of hi and lo parts of numbers match the bitmap structure.
> Reproducer follows: > > #define __GNU_SOURCE > #include <linux/close_range.h> > #include <unistd.h> > #include <fcntl.h> > #include <signal.h> > #include <sched.h> > #include <stdio.h> > #include <stdbool.h> > #include <sys/mman.h> > > void is_open(int fd) > { > printf("#%d is %s\n", fd, > fcntl(fd, F_GETFD) >= 0 ? "opened" : "not opened"); > } > > int child(void *unused) > { > while(1) { > } > return 0; > } > > int main(void) > { > char *stack; > pid_t pid; > > stack = mmap(NULL, 1024*1024, PROT_READ | PROT_WRITE, > MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0); > if (stack == MAP_FAILED) { > perror("mmap"); > return -1; > } > > pid = clone(child, stack + 1024*1024, CLONE_FILES | SIGCHLD, NULL); > if (pid == -1) { > perror("clone"); > return -1; > } > for (int i = 2; i < 128; i++) > dup2(0, i); > close_range(64, ~0U, CLOSE_RANGE_UNSHARE); > > is_open(64); > printf("dup(0) => %d, expected 64\n", dup(0)); > > kill(pid, 9); > return 0; Could you please add that reproducer to tools/testing/selftests/core/close_range_test.c TEST(close_range_bitmap_corruption) { } Really, it doesn't have to be pretty but these repros in there really have been helpful finding such corruptions when run with a proper k*san config.
On Sat, Aug 03, 2024 at 11:50:54PM GMT, Al Viro wrote: > [in vfs.git#fixes] > > copy_fd_bitmaps(new, old, count) is expected to copy the first > count/BITS_PER_LONG bits from old->full_fds_bits[] and fill > the rest with zeroes. What it does is copying enough words > (BITS_TO_LONGS(count/BITS_PER_LONG)), then memsets the rest. > That works fine, *if* all bits past the cutoff point are > clear. Otherwise we are risking garbage from the last word > we'd copied. > > For most of the callers that is true - expand_fdtable() has > count equal to old->max_fds, so there's no open descriptors > past count, let alone fully occupied words in ->open_fds[], > which is what bits in ->full_fds_bits[] correspond to. > > The other caller (dup_fd()) passes sane_fdtable_size(old_fdt, max_fds), > which is the smallest multiple of BITS_PER_LONG that covers all > opened descriptors below max_fds. In the common case (copying on > fork()) max_fds is ~0U, so all opened descriptors will be below > it and we are fine, by the same reasons why the call in expand_fdtable() > is safe. > > Unfortunately, there is a case where max_fds is less than that > and where we might, indeed, end up with junk in ->full_fds_bits[] - > close_range(from, to, CLOSE_RANGE_UNSHARE) with > * descriptor table being currently shared > * 'to' being above the current capacity of descriptor table > * 'from' being just under some chunk of opened descriptors. > In that case we end up with observably wrong behaviour - e.g. spawn > a child with CLONE_FILES, get all descriptors in range 0..127 open, > then close_range(64, ~0U, CLOSE_RANGE_UNSHARE) and watch dup(0) ending > up with descriptor #128, despite #64 being observably not open. > > The best way to fix that is in dup_fd() - there we both can easily check > whether we might need to fix anything up and see which word needs the > upper bits cleared. > > Reproducer follows: > > #define __GNU_SOURCE > #include <linux/close_range.h> > #include <unistd.h> > #include <fcntl.h> > #include <signal.h> > #include <sched.h> > #include <stdio.h> > #include <stdbool.h> > #include <sys/mman.h> > > void is_open(int fd) > { > printf("#%d is %s\n", fd, > fcntl(fd, F_GETFD) >= 0 ? "opened" : "not opened"); > } > > int child(void *unused) > { > while(1) { > } > return 0; > } > > int main(void) > { > char *stack; > pid_t pid; > > stack = mmap(NULL, 1024*1024, PROT_READ | PROT_WRITE, > MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0); > if (stack == MAP_FAILED) { > perror("mmap"); > return -1; > } > > pid = clone(child, stack + 1024*1024, CLONE_FILES | SIGCHLD, NULL); > if (pid == -1) { > perror("clone"); > return -1; > } > for (int i = 2; i < 128; i++) > dup2(0, i); > close_range(64, ~0U, CLOSE_RANGE_UNSHARE); > > is_open(64); > printf("dup(0) => %d, expected 64\n", dup(0)); > > kill(pid, 9); > return 0; > } > > Cc: stable@vger.kernel.org > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Hm, this should probably also get a fixes tag for 4b3b3b3b3b3b ("vfs: clear remainder of 'full_fds_bits' in dup_fd()") (and possibly for the CLOSE_RANGE_UNSHARE addition).
On Mon, Aug 05, 2024 at 09:22:08AM +0200, Christian Brauner wrote: > Really, it doesn't have to be pretty but these repros in there really > have been helpful finding such corruptions when run with a proper k*san > config. See below; so far it survives beating (and close_range_test passes with the patch, while failing the last test on mainline). BTW, EXPECT_... alone is sufficient for it to whine, but it doesn't actually fail the test, so don't we need exit(EXIT_FAILURE) on (at least some of) the previous tests? Hadn't played with the kselftest before, so... diff --git a/fs/file.c b/fs/file.c index a11e59b5d602..655338effe9c 100644 --- a/fs/file.c +++ b/fs/file.c @@ -46,27 +46,23 @@ static void free_fdtable_rcu(struct rcu_head *rcu) #define BITBIT_NR(nr) BITS_TO_LONGS(BITS_TO_LONGS(nr)) #define BITBIT_SIZE(nr) (BITBIT_NR(nr) * sizeof(long)) +#define fdt_words(fdt) ((fdt)->max_fds / BITS_PER_LONG) // words in ->open_fds /* * Copy 'count' fd bits from the old table to the new table and clear the extra * space if any. This does not copy the file pointers. Called with the files * spinlock held for write. */ -static void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt, - unsigned int count) +static inline void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt, + unsigned int copy_words) { - unsigned int cpy, set; - - cpy = count / BITS_PER_BYTE; - set = (nfdt->max_fds - count) / BITS_PER_BYTE; - memcpy(nfdt->open_fds, ofdt->open_fds, cpy); - memset((char *)nfdt->open_fds + cpy, 0, set); - memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy); - memset((char *)nfdt->close_on_exec + cpy, 0, set); - - cpy = BITBIT_SIZE(count); - set = BITBIT_SIZE(nfdt->max_fds) - cpy; - memcpy(nfdt->full_fds_bits, ofdt->full_fds_bits, cpy); - memset((char *)nfdt->full_fds_bits + cpy, 0, set); + unsigned int nwords = fdt_words(nfdt); + + bitmap_copy_and_extend(nfdt->open_fds, ofdt->open_fds, + copy_words * BITS_PER_LONG, nwords * BITS_PER_LONG); + bitmap_copy_and_extend(nfdt->close_on_exec, ofdt->close_on_exec, + copy_words * BITS_PER_LONG, nwords * BITS_PER_LONG); + bitmap_copy_and_extend(nfdt->full_fds_bits, ofdt->full_fds_bits, + copy_words, nwords); } /* @@ -84,7 +80,7 @@ static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt) memcpy(nfdt->fd, ofdt->fd, cpy); memset((char *)nfdt->fd + cpy, 0, set); - copy_fd_bitmaps(nfdt, ofdt, ofdt->max_fds); + copy_fd_bitmaps(nfdt, ofdt, fdt_words(ofdt)); } /* @@ -379,7 +375,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, unsigned int max_fds, int open_files = sane_fdtable_size(old_fdt, max_fds); } - copy_fd_bitmaps(new_fdt, old_fdt, open_files); + copy_fd_bitmaps(new_fdt, old_fdt, open_files / BITS_PER_LONG); old_fds = old_fdt->fd; new_fds = new_fdt->fd; diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h index 8c4768c44a01..d3b66d77df7a 100644 --- a/include/linux/bitmap.h +++ b/include/linux/bitmap.h @@ -270,6 +270,18 @@ static inline void bitmap_copy_clear_tail(unsigned long *dst, dst[nbits / BITS_PER_LONG] &= BITMAP_LAST_WORD_MASK(nbits); } +static inline void bitmap_copy_and_extend(unsigned long *to, + const unsigned long *from, + unsigned int count, unsigned int size) +{ + unsigned int copy = BITS_TO_LONGS(count); + + memcpy(to, from, copy * sizeof(long)); + if (count % BITS_PER_LONG) + to[copy - 1] &= BITMAP_LAST_WORD_MASK(count); + memset(to + copy, 0, bitmap_size(size) - copy * sizeof(long)); +} + /* * On 32-bit systems bitmaps are represented as u32 arrays internally. On LE64 * machines the order of hi and lo parts of numbers match the bitmap structure. diff --git a/tools/testing/selftests/core/close_range_test.c b/tools/testing/selftests/core/close_range_test.c index 991c473e3859..12b4eb9d0434 100644 --- a/tools/testing/selftests/core/close_range_test.c +++ b/tools/testing/selftests/core/close_range_test.c @@ -589,4 +589,39 @@ TEST(close_range_cloexec_unshare_syzbot) EXPECT_EQ(close(fd3), 0); } +TEST(close_range_bitmap_corruption) +{ + pid_t pid; + int status; + struct __clone_args args = { + .flags = CLONE_FILES, + .exit_signal = SIGCHLD, + }; + + /* get the first 128 descriptors open */ + for (int i = 2; i < 128; i++) + EXPECT_GE(dup2(0, i), 0); + + /* get descriptor table shared */ + pid = sys_clone3(&args, sizeof(args)); + ASSERT_GE(pid, 0); + + if (pid == 0) { + /* unshare and truncate descriptor table down to 64 */ + if (sys_close_range(64, ~0U, CLOSE_RANGE_UNSHARE)) + exit(EXIT_FAILURE); + + ASSERT_EQ(fcntl(64, F_GETFD), -1); + /* ... and verify that the range 64..127 is not + stuck "fully used" according to secondary bitmap */ + EXPECT_EQ(dup(0), 64) + exit(EXIT_FAILURE); + exit(EXIT_SUCCESS); + } + + EXPECT_EQ(waitpid(pid, &status, 0), pid); + EXPECT_EQ(true, WIFEXITED(status)); + EXPECT_EQ(0, WEXITSTATUS(status)); +} + TEST_HARNESS_MAIN
On Sun, Aug 04, 2024 at 10:13:22PM +0100, Al Viro wrote: > On Sun, Aug 04, 2024 at 08:18:14AM -0700, Linus Torvalds wrote: > > On Sat, 3 Aug 2024 at 20:47, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > > > bitmap_copy_and_extend(nfdt->open_fds, ofdt->open_fds, > > > copy_words * BITS_PER_LONG, nwords * BITS_PER_LONG); > > > > Ok, thinking about it, I like this interface, since it looks like the > > compiler would see that it's in BITS_PER_LONG chunks if we inline it > > and decide it's important. > > > > So make it so. > > FWIW, what I'm testing right now is the patch below; it does fix the reproducer > and it seems to be having no problems with xfstests so far. Needs profiling; > again, no visible slowdowns on xfstests, but... Seems to work, no visible slowdowns. Pushed into #fixes. BTW, speaking of fdtable, the longer I look at it, the more I'm tempted to try and kill it off completely. Look: ->fdt->max_fds is non-decreasing ->fdt->full_fds_bits only accessed under ->files_lock ->fdt->open_fds is sometimes used with rcu_read_lock() alone, but such users are very few - it's bitmap_weight(fdt->open_fds, fdt->max_fds); in proc_readfd_count() (stat on /proc/*/fd) and max_select_fds(); the latter is checking if anything in the fd_set_bits passed to select(2) is not opened, in addition to looking for the maximal descriptor passed to select(2) in that. Then we do fdget() on all of them during the main loop of select(). Sure, we want -EBADF if it hadn't been opened at all and we want EPOLLNVAL on subsequent passes, but that could've been handled easily in the main loop itself. As for the bitmap_weight(), I seriously suspect that keeping the count (all updates are under ->files_lock) might be cheap enough, and IIRC we already had cgroup (or was it bpf?) folks trying to get that count in really scary ways... ->fdt->close_on_exec has two places accessing it with rcu_read_lock() alone (F_GETFD and alloc_bprm()). In both cases we have already done fdget() on the same descrpiptor. So... do we really need that indirect? The problem would be seeing ->max_fds update before that of the array pointers. Smells like that should be doable with barrier(s) - and not many, at that... Note that if we coallocate the file references' array and bitmaps, we could e.g. slap rcu_head over the beginning of ->open_fds bitmap after the sucker's gone from files_struct and do RCU-delayed freeing on that. Am I missing something obvious here?
On Mon, 5 Aug 2024 at 16:45, Al Viro <viro@zeniv.linux.org.uk> wrote: > > So... do we really need that indirect? The problem would be > seeing ->max_fds update before that of the array pointers. The reason is simply so that we can have one single allocation. In fact, quite often, it's zero allocations when you can use the 'struct fdtable fdtab' that is embedded in 'struct files_struct'. But the 'struct fdtable' was a convenient way to allocate all those bitmaps _and_ the 'struct file *' array all together. That said, I agree that we could just then put the pointers to said single allocation right into 'struct files_struct'. So the actual indirection after the allocation is pointless. And yes, I think it's entirely a historical artifact of how that thing grew to be. Long long long ago there was no secondary allocation at all, and MAX_OPEN was fixed at 20. Because dammit, twenty open files is all you ever really need, and all the bitmaps fit comfortably in one single word even on 32-bit machines. Those were the days. Linus
On Mon, Aug 05, 2024 at 05:04:05PM -0700, Linus Torvalds wrote: > On Mon, 5 Aug 2024 at 16:45, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > So... do we really need that indirect? The problem would be > > seeing ->max_fds update before that of the array pointers. > > The reason is simply so that we can have one single allocation. > > In fact, quite often, it's zero allocations when you can use the > 'struct fdtable fdtab' that is embedded in 'struct files_struct'. More to the point, we use arrays embedded into files_struct. >But > the 'struct fdtable' was a convenient way to allocate all those > bitmaps _and_ the 'struct file *' array all together. I don't think so - IIRC, it was introduced when we added RCU'd file lookup. Let me check... Yep; badf16621c1f "[PATCH] files: break up files struct", with RCU support as rationale. Followed by ab2af1f50050 "[PATCH] files: files struct with RCU". Before those commits ->max_fds and ->fd used to live in files_struct and fget() (OK, fcheck_files()) had been taking ->files_lock, so that concurrent expand_files() would not screw us over. The problem was not just the need to delay freeing old ->fd array; that could be dealt with easily enough. Think what would've happened if fcheck_files() ended up fetching new value of ->max_fds and old value of ->fd, which pointed to pre-expansion array. Indirection allowed to update both in one store. The thing is, ->max_fds for successive ->fdt is monotonously increasing. So a lockless reader seeing the old size is fine with the new table - we just need to prevent the opposite. Would rcu_assign_pointer of pointers + smp_store_release of max_fds on expand (all under ->files_lock, etc.) paired with smp_load_acquire of max_fds + rcu_dereference of ->fd on file lookup side be enough, or do we need an explicit smp_wmb/smp_rmb in there? > And yes, I think it's entirely a historical artifact of how that thing > grew to be. Long long long ago there was no secondary allocation at > all, and MAX_OPEN was fixed at 20. Yes - but that had changed way before 2005 when those patches went in. Separate allocation of bitmaps is 2.3.12pre7 and separate allocation of ->fd is even older - 2.1.90pre1. NR_OPEN had been 1024 at that point; earlier history is 0.97: 20 -> 32 0.98.4: 32 -> 256 2.1.26: 256 -> 1024
On Tue, Aug 06, 2024 at 02:02:17AM GMT, Al Viro wrote: > On Mon, Aug 05, 2024 at 05:04:05PM -0700, Linus Torvalds wrote: > > On Mon, 5 Aug 2024 at 16:45, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > > > So... do we really need that indirect? The problem would be > > > seeing ->max_fds update before that of the array pointers. > > > > The reason is simply so that we can have one single allocation. > > > > In fact, quite often, it's zero allocations when you can use the > > 'struct fdtable fdtab' that is embedded in 'struct files_struct'. > > More to the point, we use arrays embedded into files_struct. > > >But > > the 'struct fdtable' was a convenient way to allocate all those > > bitmaps _and_ the 'struct file *' array all together. > > I don't think so - IIRC, it was introduced when we added RCU'd > file lookup. Let me check... Yep; badf16621c1f "[PATCH] files: > break up files struct", with RCU support as rationale. Followed > by ab2af1f50050 "[PATCH] files: files struct with RCU". > > Before those commits ->max_fds and ->fd used to live in > files_struct and fget() (OK, fcheck_files()) had been taking > ->files_lock, so that concurrent expand_files() would not > screw us over. > > The problem was not just the need to delay freeing old ->fd > array; that could be dealt with easily enough. Think what > would've happened if fcheck_files() ended up fetching > new value of ->max_fds and old value of ->fd, which pointed > to pre-expansion array. Indirection allowed to update > both in one store. > > The thing is, ->max_fds for successive ->fdt is monotonously > increasing. So a lockless reader seeing the old size is > fine with the new table - we just need to prevent the opposite. > > Would rcu_assign_pointer of pointers + smp_store_release of max_fds on expand > (all under ->files_lock, etc.) paired with > smp_load_acquire of max_fds + rcu_dereference of ->fd on file lookup side > be enough, or do we need an explicit smp_wmb/smp_rmb in there? Afair, smp_load_acquire() would be a barrier for both later loads and stores and smp_store_release() would be a barrier for both earlier loads and stores. Iiuc, here we only care about ordering stores to ->fd and max_fds on the write side and about ordering loads of max_fds and ->fd on the reader side. The reader doesn't actually write anything. In other words, we want to make ->fd visible before max_fds on the write side and we want to load max_fds after ->fd. So I think smp_wmb() and smp_rmb() would be sufficient. I also find it clearer in this case.
On Mon, Aug 05, 2024 at 07:54:29PM GMT, Al Viro wrote: > On Mon, Aug 05, 2024 at 09:22:08AM +0200, Christian Brauner wrote: > > > Really, it doesn't have to be pretty but these repros in there really > > have been helpful finding such corruptions when run with a proper k*san > > config. > > See below; so far it survives beating (and close_range_test passes with > the patch, while failing the last test on mainline). BTW, EXPECT_... > alone is sufficient for it to whine, but it doesn't actually fail the > test, so don't we need exit(EXIT_FAILURE) on (at least some of) the > previous tests? Hadn't played with the kselftest before, so... The selftest infrastrcuture is a bit weird and I'm no expert myself. So any failure in EXPECT_*() will cause the test to fail but will continue running the whole test. Using ASSERT_*() instead of EXPECT_*() will also cause the test to fail but will also stop the test immediately. So really no matter if EXPEC_*() or ASSERT_*() the end result should be that the test run fails: diff --git a/tools/testing/selftests/core/close_range_test.c b/tools/testing/selftests/core/close_range_test.c index 991c473e3859..3f7257487b85 100644 --- a/tools/testing/selftests/core/close_range_test.c +++ b/tools/testing/selftests/core/close_range_test.c @@ -37,6 +37,8 @@ TEST(core_close_range) int i, ret; int open_fds[101]; + EXPECT_NE(0, 0); + for (i = 0; i < ARRAY_SIZE(open_fds); i++) { int fd; > ./close_range_test TAP version 13 1..7 # Starting 7 tests from 1 test cases. # RUN global.core_close_range ... # close_range_test.c:40:core_close_range:Expected 0 (0) != 0 (0) # core_close_range: Test failed # FAIL global.core_close_range not ok 1 global.core_close_range # RUN global.close_range_unshare ... # OK global.close_range_unshare ok 2 global.close_range_unshare # RUN global.close_range_unshare_capped ... # OK global.close_range_unshare_capped ok 3 global.close_range_unshare_capped # RUN global.close_range_cloexec ... # OK global.close_range_cloexec ok 4 global.close_range_cloexec # RUN global.close_range_cloexec_unshare ... # OK global.close_range_cloexec_unshare ok 5 global.close_range_cloexec_unshare # RUN global.close_range_cloexec_syzbot ... # OK global.close_range_cloexec_syzbot ok 6 global.close_range_cloexec_syzbot # RUN global.close_range_cloexec_unshare_syzbot ... # OK global.close_range_cloexec_unshare_syzbot ok 7 global.close_range_cloexec_unshare_syzbot # FAILED: 6 / 7 tests passed. # Totals: pass:6 fail:1 xfail:0 xpass:0 skip:0 error:0 > > diff --git a/fs/file.c b/fs/file.c > index a11e59b5d602..655338effe9c 100644 > --- a/fs/file.c > +++ b/fs/file.c > @@ -46,27 +46,23 @@ static void free_fdtable_rcu(struct rcu_head *rcu) > #define BITBIT_NR(nr) BITS_TO_LONGS(BITS_TO_LONGS(nr)) > #define BITBIT_SIZE(nr) (BITBIT_NR(nr) * sizeof(long)) > > +#define fdt_words(fdt) ((fdt)->max_fds / BITS_PER_LONG) // words in ->open_fds > /* > * Copy 'count' fd bits from the old table to the new table and clear the extra > * space if any. This does not copy the file pointers. Called with the files > * spinlock held for write. > */ > -static void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt, > - unsigned int count) > +static inline void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt, > + unsigned int copy_words) > { > - unsigned int cpy, set; > - > - cpy = count / BITS_PER_BYTE; > - set = (nfdt->max_fds - count) / BITS_PER_BYTE; > - memcpy(nfdt->open_fds, ofdt->open_fds, cpy); > - memset((char *)nfdt->open_fds + cpy, 0, set); > - memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy); > - memset((char *)nfdt->close_on_exec + cpy, 0, set); > - > - cpy = BITBIT_SIZE(count); > - set = BITBIT_SIZE(nfdt->max_fds) - cpy; > - memcpy(nfdt->full_fds_bits, ofdt->full_fds_bits, cpy); > - memset((char *)nfdt->full_fds_bits + cpy, 0, set); > + unsigned int nwords = fdt_words(nfdt); > + > + bitmap_copy_and_extend(nfdt->open_fds, ofdt->open_fds, > + copy_words * BITS_PER_LONG, nwords * BITS_PER_LONG); > + bitmap_copy_and_extend(nfdt->close_on_exec, ofdt->close_on_exec, > + copy_words * BITS_PER_LONG, nwords * BITS_PER_LONG); > + bitmap_copy_and_extend(nfdt->full_fds_bits, ofdt->full_fds_bits, > + copy_words, nwords); > } > > /* > @@ -84,7 +80,7 @@ static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt) > memcpy(nfdt->fd, ofdt->fd, cpy); > memset((char *)nfdt->fd + cpy, 0, set); > > - copy_fd_bitmaps(nfdt, ofdt, ofdt->max_fds); > + copy_fd_bitmaps(nfdt, ofdt, fdt_words(ofdt)); > } > > /* > @@ -379,7 +375,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, unsigned int max_fds, int > open_files = sane_fdtable_size(old_fdt, max_fds); > } > > - copy_fd_bitmaps(new_fdt, old_fdt, open_files); > + copy_fd_bitmaps(new_fdt, old_fdt, open_files / BITS_PER_LONG); > > old_fds = old_fdt->fd; > new_fds = new_fdt->fd; > diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h > index 8c4768c44a01..d3b66d77df7a 100644 > --- a/include/linux/bitmap.h > +++ b/include/linux/bitmap.h > @@ -270,6 +270,18 @@ static inline void bitmap_copy_clear_tail(unsigned long *dst, > dst[nbits / BITS_PER_LONG] &= BITMAP_LAST_WORD_MASK(nbits); > } > > +static inline void bitmap_copy_and_extend(unsigned long *to, > + const unsigned long *from, > + unsigned int count, unsigned int size) > +{ > + unsigned int copy = BITS_TO_LONGS(count); > + > + memcpy(to, from, copy * sizeof(long)); > + if (count % BITS_PER_LONG) > + to[copy - 1] &= BITMAP_LAST_WORD_MASK(count); > + memset(to + copy, 0, bitmap_size(size) - copy * sizeof(long)); > +} > + > /* > * On 32-bit systems bitmaps are represented as u32 arrays internally. On LE64 > * machines the order of hi and lo parts of numbers match the bitmap structure. > diff --git a/tools/testing/selftests/core/close_range_test.c b/tools/testing/selftests/core/close_range_test.c > index 991c473e3859..12b4eb9d0434 100644 > --- a/tools/testing/selftests/core/close_range_test.c > +++ b/tools/testing/selftests/core/close_range_test.c > @@ -589,4 +589,39 @@ TEST(close_range_cloexec_unshare_syzbot) > EXPECT_EQ(close(fd3), 0); > } > > +TEST(close_range_bitmap_corruption) > +{ > + pid_t pid; > + int status; > + struct __clone_args args = { > + .flags = CLONE_FILES, > + .exit_signal = SIGCHLD, > + }; > + > + /* get the first 128 descriptors open */ > + for (int i = 2; i < 128; i++) > + EXPECT_GE(dup2(0, i), 0); > + > + /* get descriptor table shared */ > + pid = sys_clone3(&args, sizeof(args)); > + ASSERT_GE(pid, 0); > + > + if (pid == 0) { > + /* unshare and truncate descriptor table down to 64 */ > + if (sys_close_range(64, ~0U, CLOSE_RANGE_UNSHARE)) > + exit(EXIT_FAILURE); > + > + ASSERT_EQ(fcntl(64, F_GETFD), -1); > + /* ... and verify that the range 64..127 is not > + stuck "fully used" according to secondary bitmap */ > + EXPECT_EQ(dup(0), 64) > + exit(EXIT_FAILURE); > + exit(EXIT_SUCCESS); > + } > + > + EXPECT_EQ(waitpid(pid, &status, 0), pid); > + EXPECT_EQ(true, WIFEXITED(status)); > + EXPECT_EQ(0, WEXITSTATUS(status)); > +} > + > TEST_HARNESS_MAIN
On Tue, Aug 06, 2024 at 10:41:59AM +0200, Christian Brauner wrote: > > Would rcu_assign_pointer of pointers + smp_store_release of max_fds on expand > > (all under ->files_lock, etc.) paired with > > smp_load_acquire of max_fds + rcu_dereference of ->fd on file lookup side > > be enough, or do we need an explicit smp_wmb/smp_rmb in there? > > Afair, smp_load_acquire() would be a barrier for both later loads and > stores and smp_store_release() would be a barrier for both earlier loads > and stores. > > Iiuc, here we only care about ordering stores to ->fd and max_fds on the > write side and about ordering loads of max_fds and ->fd on the reader > side. The reader doesn't actually write anything. > > In other words, we want to make ->fd visible before max_fds on the write > side and we want to load max_fds after ->fd. > > So I think smp_wmb() and smp_rmb() would be sufficient. I also find it > clearer in this case. It's not the question of sufficiency; it's whether anything cheaper can be had.
On Tue, 6 Aug 2024 at 09:32, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > So I think smp_wmb() and smp_rmb() would be sufficient. I also find it > > clearer in this case. > > It's not the question of sufficiency; it's whether anything cheaper can be > had. I'd avoid smp_rmb(), since that can be very expensive, but "smp_load_acquire()" is about as cheap as it comes. As is "smp_store_release()" and "smp_wmb()". So if it's mostly about just ordering with max_fds, I'd suggest just doing the store of the bigger max_fds value with a smp_store_release() (which guarantees that any previous stores of the bitmap pointers etc have been done first and are visible), and then before accessing the arrays, do the load of the max_fds thing with a "smp_load_acquire()", and it should all be good. But the array pointers themselves also need the same smp_store_release() when updating, just to guarantee that the newly copied contents are ordered. Which implies having to read those too with smp_load_acquire(), regardless of the max_fds value read. But on x86, smp_store_release/smp_load_acquire is completely free. And on other architectures it's generally one of the cheapest possible ordering constraints. So I doubt it's a performance issue even outside of x86, and avoiding a double indirection is probably a win. Linus
diff --git a/fs/file.c b/fs/file.c index a11e59b5d602..7f0ab8636d7c 100644 --- a/fs/file.c +++ b/fs/file.c @@ -380,6 +380,20 @@ struct files_struct *dup_fd(struct files_struct *oldf, unsigned int max_fds, int } copy_fd_bitmaps(new_fdt, old_fdt, open_files); + if (unlikely(max_fds != NR_OPEN_MAX)) { + /* + * if there had been opened descriptors past open_files, + * the last copied word in full_fds_bits might have picked + * stray bits; nothing of that sort happens to open_fds and + * close_on_exec, since there the part that needs to be copied + * will always fall on a word boundary. + */ + unsigned int n = open_files / BITS_PER_LONG; + if (n % BITS_PER_LONG) { + unsigned long mask = BITMAP_LAST_WORD_MASK(n); + new_fdt->full_fds_bits[n / BITS_PER_LONG] &= mask; + } + } old_fds = old_fdt->fd; new_fds = new_fdt->fd;
[in vfs.git#fixes] copy_fd_bitmaps(new, old, count) is expected to copy the first count/BITS_PER_LONG bits from old->full_fds_bits[] and fill the rest with zeroes. What it does is copying enough words (BITS_TO_LONGS(count/BITS_PER_LONG)), then memsets the rest. That works fine, *if* all bits past the cutoff point are clear. Otherwise we are risking garbage from the last word we'd copied. For most of the callers that is true - expand_fdtable() has count equal to old->max_fds, so there's no open descriptors past count, let alone fully occupied words in ->open_fds[], which is what bits in ->full_fds_bits[] correspond to. The other caller (dup_fd()) passes sane_fdtable_size(old_fdt, max_fds), which is the smallest multiple of BITS_PER_LONG that covers all opened descriptors below max_fds. In the common case (copying on fork()) max_fds is ~0U, so all opened descriptors will be below it and we are fine, by the same reasons why the call in expand_fdtable() is safe. Unfortunately, there is a case where max_fds is less than that and where we might, indeed, end up with junk in ->full_fds_bits[] - close_range(from, to, CLOSE_RANGE_UNSHARE) with * descriptor table being currently shared * 'to' being above the current capacity of descriptor table * 'from' being just under some chunk of opened descriptors. In that case we end up with observably wrong behaviour - e.g. spawn a child with CLONE_FILES, get all descriptors in range 0..127 open, then close_range(64, ~0U, CLOSE_RANGE_UNSHARE) and watch dup(0) ending up with descriptor #128, despite #64 being observably not open. The best way to fix that is in dup_fd() - there we both can easily check whether we might need to fix anything up and see which word needs the upper bits cleared. Reproducer follows: #define __GNU_SOURCE #include <linux/close_range.h> #include <unistd.h> #include <fcntl.h> #include <signal.h> #include <sched.h> #include <stdio.h> #include <stdbool.h> #include <sys/mman.h> void is_open(int fd) { printf("#%d is %s\n", fd, fcntl(fd, F_GETFD) >= 0 ? "opened" : "not opened"); } int child(void *unused) { while(1) { } return 0; } int main(void) { char *stack; pid_t pid; stack = mmap(NULL, 1024*1024, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0); if (stack == MAP_FAILED) { perror("mmap"); return -1; } pid = clone(child, stack + 1024*1024, CLONE_FILES | SIGCHLD, NULL); if (pid == -1) { perror("clone"); return -1; } for (int i = 2; i < 128; i++) dup2(0, i); close_range(64, ~0U, CLOSE_RANGE_UNSHARE); is_open(64); printf("dup(0) => %d, expected 64\n", dup(0)); kill(pid, 9); return 0; } Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> ---