Message ID | 20241019210037.146825-1-kent.overstreet@linux.dev (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: Drop INT_MAX limit from kvmalloc() | expand |
On Sat, Oct 19, 2024 at 05:00:37PM -0400, Kent Overstreet wrote: > A user with a 75 TB filesystem reported the following journal replay > error: > https://github.com/koverstreet/bcachefs/issues/769 > > In journal replay we have to sort and dedup all the keys from the > journal, which means we need a large contiguous allocation. Given that > the user has 128GB of ram, the 2GB limit on allocation size has become > far too small. > > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: Andrew Morton <akpm@linux-foundation.org> > Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> > --- > mm/util.c | 6 ------ > 1 file changed, 6 deletions(-) > > diff --git a/mm/util.c b/mm/util.c > index 4f1275023eb7..c60df7723096 100644 > --- a/mm/util.c > +++ b/mm/util.c > @@ -665,12 +665,6 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node) > if (!gfpflags_allow_blocking(flags)) > return NULL; > > - /* Don't even allow crazy sizes */ > - if (unlikely(size > INT_MAX)) { > - WARN_ON_ONCE(!(flags & __GFP_NOWARN)); > - return NULL; > - } > - Err, and not replace it with _any_ limit? That seems very unwise. > /* > * kvmalloc() can always use VM_ALLOW_HUGE_VMAP, > * since the callers already cannot assume anything > -- > 2.45.2 >
On Sun, Oct 20, 2024 at 12:45:33PM +0100, Lorenzo Stoakes wrote: > On Sat, Oct 19, 2024 at 05:00:37PM -0400, Kent Overstreet wrote: > > A user with a 75 TB filesystem reported the following journal replay > > error: > > https://github.com/koverstreet/bcachefs/issues/769 > > > > In journal replay we have to sort and dedup all the keys from the > > journal, which means we need a large contiguous allocation. Given that > > the user has 128GB of ram, the 2GB limit on allocation size has become > > far too small. > > > > Cc: Vlastimil Babka <vbabka@suse.cz> > > Cc: Andrew Morton <akpm@linux-foundation.org> > > Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> > > --- > > mm/util.c | 6 ------ > > 1 file changed, 6 deletions(-) > > > > diff --git a/mm/util.c b/mm/util.c > > index 4f1275023eb7..c60df7723096 100644 > > --- a/mm/util.c > > +++ b/mm/util.c > > @@ -665,12 +665,6 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node) > > if (!gfpflags_allow_blocking(flags)) > > return NULL; > > > > - /* Don't even allow crazy sizes */ > > - if (unlikely(size > INT_MAX)) { > > - WARN_ON_ONCE(!(flags & __GFP_NOWARN)); > > - return NULL; > > - } > > - > > Err, and not replace it with _any_ limit? That seems very unwise. large allocations will go to either the page allocator or vmalloc, and they have their own limits. although I should have a look at that, and make sure we're not triggering the > MAX_ORDER warning in the page allocator unnecessarily w hen we could just call vmalloc().
+cc Linus, vmalloc reviewers On Sun, Oct 20, 2024 at 09:00:07AM -0400, Kent Overstreet wrote: > On Sun, Oct 20, 2024 at 12:45:33PM +0100, Lorenzo Stoakes wrote: > > On Sat, Oct 19, 2024 at 05:00:37PM -0400, Kent Overstreet wrote: > > > A user with a 75 TB filesystem reported the following journal replay > > > error: > > > https://github.com/koverstreet/bcachefs/issues/769 > > > > > > In journal replay we have to sort and dedup all the keys from the > > > journal, which means we need a large contiguous allocation. Given that > > > the user has 128GB of ram, the 2GB limit on allocation size has become > > > far too small. > > > > > > Cc: Vlastimil Babka <vbabka@suse.cz> > > > Cc: Andrew Morton <akpm@linux-foundation.org> > > > Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> > > > --- > > > mm/util.c | 6 ------ > > > 1 file changed, 6 deletions(-) > > > > > > diff --git a/mm/util.c b/mm/util.c > > > index 4f1275023eb7..c60df7723096 100644 > > > --- a/mm/util.c > > > +++ b/mm/util.c > > > @@ -665,12 +665,6 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node) > > > if (!gfpflags_allow_blocking(flags)) > > > return NULL; > > > > > > - /* Don't even allow crazy sizes */ > > > - if (unlikely(size > INT_MAX)) { > > > - WARN_ON_ONCE(!(flags & __GFP_NOWARN)); > > > - return NULL; > > > - } > > > - > > > > Err, and not replace it with _any_ limit? That seems very unwise. > > large allocations will go to either the page allocator or vmalloc, and > they have their own limits. Ah actually I misread it here, I see the allocation gets immediately sent off to __kmalloc_node_noprof() and thus that'll apply its own limits before doing this check prior to the vmalloc call. We actually do have a basic check in __vmalloc_node_range_noprof() that prevents _totally_ insane requests, checking that size >> PAGE_SHIFT <= totalram_pages(), so we shouldn't get anything too stupid here (I am thinking especially of ptr + size overflow type situations). But Linus explicitly introduced this INT_MAX check in commit 7661809d493b ("mm: don't allow oversized kvmalloc() calls"), presumably for a reason, so have cc'd him here in case he has an objection to this which amounts to a revert of that patch. Assuming Linus doesn't object, I don't see how this is really doing anything different than just invoking __vmalloc_node_range_noprof() direct which we do in quite a few places anyway? I guess let's wait and see what he says or if Vlastimil/the vmalloc reviewers have any thoughts on this. But looks sane to me otherwise. > > although I should have a look at that, and make sure we're not > triggering the > MAX_ORDER warning in the page allocator unnecessarily w > hen we could just call vmalloc(). OK I guess tha would be a check prior to invoking __kmalloc_node_noprof() -> ... -> __alloc_pages_noprof() and WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp) ? Sounds sensible.
On Sun, Oct 20, 2024 at 05:44:37PM +0100, Lorenzo Stoakes wrote: > +cc Linus, vmalloc reviewers > > On Sun, Oct 20, 2024 at 09:00:07AM -0400, Kent Overstreet wrote: > > On Sun, Oct 20, 2024 at 12:45:33PM +0100, Lorenzo Stoakes wrote: > > > On Sat, Oct 19, 2024 at 05:00:37PM -0400, Kent Overstreet wrote: > > > > A user with a 75 TB filesystem reported the following journal replay > > > > error: > > > > https://github.com/koverstreet/bcachefs/issues/769 > > > > > > > > In journal replay we have to sort and dedup all the keys from the > > > > journal, which means we need a large contiguous allocation. Given that > > > > the user has 128GB of ram, the 2GB limit on allocation size has become > > > > far too small. > > > > > > > > Cc: Vlastimil Babka <vbabka@suse.cz> > > > > Cc: Andrew Morton <akpm@linux-foundation.org> > > > > Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> > > > > --- > > > > mm/util.c | 6 ------ > > > > 1 file changed, 6 deletions(-) > > > > > > > > diff --git a/mm/util.c b/mm/util.c > > > > index 4f1275023eb7..c60df7723096 100644 > > > > --- a/mm/util.c > > > > +++ b/mm/util.c > > > > @@ -665,12 +665,6 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node) > > > > if (!gfpflags_allow_blocking(flags)) > > > > return NULL; > > > > > > > > - /* Don't even allow crazy sizes */ > > > > - if (unlikely(size > INT_MAX)) { > > > > - WARN_ON_ONCE(!(flags & __GFP_NOWARN)); > > > > - return NULL; > > > > - } > > > > - > > > > > > Err, and not replace it with _any_ limit? That seems very unwise. > > > > large allocations will go to either the page allocator or vmalloc, and > > they have their own limits. > > Ah actually I misread it here, I see the allocation gets immediately sent > off to __kmalloc_node_noprof() and thus that'll apply its own limits before > doing this check prior to the vmalloc call. > > We actually do have a basic check in __vmalloc_node_range_noprof() that > prevents _totally_ insane requests, checking that size >> PAGE_SHIFT <= > totalram_pages(), so we shouldn't get anything too stupid here (I am > thinking especially of ptr + size overflow type situations). Yep, exactly. > But Linus explicitly introduced this INT_MAX check in commit 7661809d493b > ("mm: don't allow oversized kvmalloc() calls"), presumably for a reason, so > have cc'd him here in case he has an objection to this which amounts to a > revert of that patch. Ah, thanks for adding him. Yeah, the concern historically has been that integer truncation bugs are really nasty and entirely too common, and we have little in the way of tooling that can catch that. Unfortunately that's still the case today. Kees has been doing work on catching integer overflow, but I don't think we have anything coming yet for catching integer truncation. But given that vmalloc() already supports > INT_MAX requests, and memory sizes keep growing so 2GB is getting pretty small - I think it's time, this is going to come up in other places sooner or later.
On Sun, 20 Oct 2024 at 10:04, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > But given that vmalloc() already supports > INT_MAX requests, and memory > sizes keep growing so 2GB is getting pretty small - I think it's time, > this is going to come up in other places sooner or later. No. If you need 2GB+ memory for filesystem operations, you fix your code. 2GB is more than you *can* allocate on 32-bit platforms, and even on 64-bit ones it is often quite a lot. So even aside from the *fact* of integer overflows, that check isn't going away. Linus
On Sun, Oct 20, 2024 at 11:46:11AM -0700, Linus Torvalds wrote: > On Sun, 20 Oct 2024 at 10:04, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > > But given that vmalloc() already supports > INT_MAX requests, and memory > > sizes keep growing so 2GB is getting pretty small - I think it's time, > > this is going to come up in other places sooner or later. > > No. > > If you need 2GB+ memory for filesystem operations, you fix your code. This is for journal replay, where we've got a big array of keys and we need to sort them. The keys have to fit in memory (and had to fit in memory previously, for them to be dirty in the journal); this array is just _references_ to those keys. > 2GB is more than you *can* allocate on 32-bit platforms, and even on > 64-bit ones it is often quite a lot. This came up on a machine with 128GB of ram.
On Sun, 20 Oct 2024 at 11:53, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > This came up on a machine with 128GB of ram. So? Your argument is "we're using a ton of memory, so we can use a ton more". Not a great argument. How about you limit the amount of memory you use in the first place instead? Linus
On Sun, 20 Oct 2024 at 12:09, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > How about you limit the amount of memory you use in the first place instead? .. and to clarify: we're not making other parts of the kernel less robust because *you* are doing something stupid and odd. Linus
On Sun, Oct 20, 2024 at 12:09:02PM -0700, Linus Torvalds wrote: > On Sun, 20 Oct 2024 at 11:53, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > > This came up on a machine with 128GB of ram. > > So? > > Your argument is "we're using a ton of memory, so we can use a ton more". Datasets are going to grow as machines grow, yes. Otherwise, what are we building them for? > How about you limit the amount of memory you use in the first place instead? That'd be an artificial cap on performance.
On Sun, Oct 20, 2024 at 12:09:58PM -0700, Linus Torvalds wrote: > On Sun, 20 Oct 2024 at 12:09, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > How about you limit the amount of memory you use in the first place instead? > > .. and to clarify: we're not making other parts of the kernel less > robust because *you* are doing something stupid and odd. Except, vmalloc() already behaves this way - so it seems to me you already have. I've already added a stupid workaround to the darray code to switch to calling vmalloc() directly, when necessary; this patch was a courtesy because if bcachefs is hitting this limit no doubt other things will be soon as well.
On 10/20/24 20:53, Kent Overstreet wrote: > On Sun, Oct 20, 2024 at 11:46:11AM -0700, Linus Torvalds wrote: >> On Sun, 20 Oct 2024 at 10:04, Kent Overstreet <kent.overstreet@linux.dev> wrote: >> > >> > But given that vmalloc() already supports > INT_MAX requests, and memory >> > sizes keep growing so 2GB is getting pretty small - I think it's time, >> > this is going to come up in other places sooner or later. >> >> No. >> >> If you need 2GB+ memory for filesystem operations, you fix your code. > > This is for journal replay, where we've got a big array of keys and we > need to sort them. > > The keys have to fit in memory (and had to fit in memory previously, for > them to be dirty in the journal); What if the disk is moved to a smaller system, should the fs still mount there? (I don't mean such a small system that it can't vmalloc() 2GB specifically, but in principle...) > this array is just _references_ to > those keys. > >> 2GB is more than you *can* allocate on 32-bit platforms, and even on >> 64-bit ones it is often quite a lot. > > This came up on a machine with 128GB of ram.
On Sun, Oct 20, 2024 at 09:53:06PM +0200, Vlastimil Babka wrote: > On 10/20/24 20:53, Kent Overstreet wrote: > > On Sun, Oct 20, 2024 at 11:46:11AM -0700, Linus Torvalds wrote: > >> On Sun, 20 Oct 2024 at 10:04, Kent Overstreet <kent.overstreet@linux.dev> wrote: > >> > > >> > But given that vmalloc() already supports > INT_MAX requests, and memory > >> > sizes keep growing so 2GB is getting pretty small - I think it's time, > >> > this is going to come up in other places sooner or later. > >> > >> No. > >> > >> If you need 2GB+ memory for filesystem operations, you fix your code. > > > > This is for journal replay, where we've got a big array of keys and we > > need to sort them. > > > > The keys have to fit in memory (and had to fit in memory previously, for > > them to be dirty in the journal); > > What if the disk is moved to a smaller system, should the fs still mount > there? (I don't mean such a small system that it can't vmalloc() 2GB > specifically, but in principle...) You'll have to do journal replay on the bigger system. Once you've done that, it'll work just fine on the smaller system. (Now, trying to work with a 75TB filesystem on a small machine is going to be really painful if you ever need to fsck. That's just an inherently hard problem, but we've got fsck scalability/performance improvements in the works). But journal replay does inherently require the whole contents of the journal to fit in memory - we have to do the sort + dedup so that we can overlay the contents of the journal over the btree until journal replay is finished so that we can get a consistent view of the filesystem, which we need so that we can run the allocator, and go read-write, which we need in order to do journal replay. Fun bootstrap problems.
On Sun, Oct 20, 2024 at 12:09:58PM -0700, Linus Torvalds wrote: > On Sun, 20 Oct 2024 at 12:09, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > How about you limit the amount of memory you use in the first place instead? > > .. and to clarify: we're not making other parts of the kernel less > robust because *you* are doing something stupid and odd. But back to the actual topic at hand, the security angle - while we don't have ubsan style checks for integer truncation (the real concern), we do have sysbot + kasan, which does cover this pretty thoroughly. And the INT_MAX check wouldn't catch truncation anyways - it'd only catch integer _underflow_, but allocation size calculations pretty much as a rule never use subtractions, so I don't think this check was ever worth much to begin with.
On Sun, 20 Oct 2024 at 13:10, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > And the INT_MAX check wouldn't catch truncation anyways - it'd only > catch integer _underflow_, but allocation size calculations pretty much > as a rule never use subtractions, so I don't think this check was ever > worth much to begin with. It fixed a real security issue. Enough said, and you're just making shit up to make excuses. Also, you might want to start look at latency numbers in addition to throughput. If your journal replay needs an *index* that is 2G in size, you may have other issues. Your journal size is insane, and your "artificial cap on performance" had better come with numbers. Why do you keep on being the person who creates all these pointless arguments? Not just with me, btw. Linus
On Sun, Oct 20, 2024 at 01:19:42PM -0700, Linus Torvalds wrote: > On Sun, 20 Oct 2024 at 13:10, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > > And the INT_MAX check wouldn't catch truncation anyways - it'd only > > catch integer _underflow_, but allocation size calculations pretty much > > as a rule never use subtractions, so I don't think this check was ever > > worth much to begin with. > > It fixed a real security issue. Which you quite conveniently aren't naming. > Enough said, and you're just making shit up to make excuses. > > Also, you might want to start look at latency numbers in addition to > throughput. If your journal replay needs an *index* that is 2G in > size, you may have other issues. Latency for journal replay? No, journal replay is only something happens at mount after an unclean shutdown. We can afford to take some time there, and journal replay performance hasn't been a concern. > Your journal size is insane, and your "artificial cap on performance" > had better come with numbers. I'm not going to run custom benchmarks just for a silly argument, sorry. But on a fileserver with 128 GB of ram and a 75 TB filesystem (yes, that's likely a dedicated fileserver), we can quite easily justify a btree node cache of perhaps 10GB, and on random update workloads the journal does need to be that big - otherwise our btree node write size goes down and throughput suffers. > Why do you keep on being the person who creates all these pointless > arguments? Not just with me, btw. That's only going to get the biggest eyeroll ever.
On Sun, 20 Oct 2024 at 13:30, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > Latency for journal replay? No, latency for the journaling itself. You're the one who claimed that a 2G cap on just the *index* to the journal would be an "artificial cap on performance" when I suggested just limiting the amount of memory you use on the journaling. Other filesystems happily limit the amount of dirty data because of latency concerns. And yes, it shows in benchmarks, where the difference between having huge amounts of data pending in memory and actually writing it back in a timely manner can be a noticeable performance penalty. It's still a good idea. Linus
On Sun, 20 Oct 2024 at 13:54, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Sun, 20 Oct 2024 at 13:30, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > > Latency for journal replay? > > No, latency for the journaling itself. Side note: latency of the journal replay can actually be quite critical indeed for any "five nines" operation, and big journals are not necessarily a good idea for that reason. There's a very real reason many places don't use filesystems that do fsck any more. But yes, it's a different issue from the runtime performance / latency side. Linus
On Sun, Oct 20, 2024 at 01:54:17PM -0700, Linus Torvalds wrote: > On Sun, 20 Oct 2024 at 13:30, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > > Latency for journal replay? > > No, latency for the journaling itself. > > You're the one who claimed that a 2G cap on just the *index* to the > journal would be an "artificial cap on performance" when I suggested > just limiting the amount of memory you use on the journaling. That's the same as limiting the amount of dirty metadata. Any filesystem wouldn't want an artificial cap on dirty metadata, because so long as it's dirty you can buffer up more updates, but for bcachefs it's a bit different. Btree nodes are log structured: this was a dumb idea that turned out to be utterly brilliant, because it enabled eytzinger search trees, and because pure b-trees and pure compacting data structures (the leveldb lineage) have weakness. Pure b-trees (with a more typical 4k node size) mean you have no opportunity to compact random updates, so writing them out to disk is inefficient. The compacting data structures solve that at the cost of terrible multithreaded update performance, and the resort overhead also in practice becomes quite costly. So, log structured btree nodes mean we can _reasonably_ efficiently write out just the 4k/8k/whatever out of a 256k btree node - i.e. we can spray random updates across an enormous btree and serialize them with decent efficiency, and because the compacting only happens within a btree node it doesn't destroy multithreaded update performance, the resorts are much cheaper (they fit in cache), and we get to use eytzinger search trees for most lookups. But: it's still much better if we can do btree node writes that are as big as possible, for all the obvious reasons - and journal size is a major factor on real world performance. All the performance testing I do where we're filling up a filesystem with mostly metadata, the first thing the tests do is increase the journal size... > Other filesystems happily limit the amount of dirty data because of > latency concerns. And yes, it shows in benchmarks, where the > difference between having huge amounts of data pending in memory and > actually writing it back in a timely manner can be a noticeable > performance penalty. Well, ext3 historically had major architectural issues with journal latency - I never really studied that but from what I gather there were lots of issues around dependent write ordering. bcachefs doesn't have any of that - broadly speaking, dirty stuff can just be written out, and i.e. memory reclaim might flush dirty btree nodes before the journal does. (The one exception being interior nodes that have pointers to btree nodes that have just been created but not yet written). The only real latency concern we have with the journal is if it fills up entirely. That will cause latency spikes, because journal reclaim has to free up entire buckets at a time. They shouldn't be massive ones, because of the lack of dependent write issues (journal reclaim itself can run full tilt and won't be getting blocked, and it starts running full tilt once the journal is more than half full), and with the default journal size I doubt it'll happen much in real world usage. But it is something people looking at performance should watch for, and we've got time stats in sysfs for this: /sys/fs/bcachefs/<uuid>/time_stats/blocked_journal_low_on_space Also worth watching is /sys/fs/bcachefs/<uuid>/time_stats/blocked_journal_max_in_flight This one tracks when we can't open a new journal entry because we already have the max (4) in flight, closing or writing - it'll happen if something is doing constant fsyncs, forcing us to write much smaller journal entries than we'd like. That was coming up in conversation the other day, and I might be doing another XFS-like thing to address it if it starts becoming an issue in real world usage. If latency due to the journal filling up starts being an issue for people, the first thing to do will be to just resize the journal (we can do that online, and I might make that happen automatically at some point) - and if for some reason that's not an option we'll need to add more intelligent throttling, so that the latencies happen bit by bit instead of all at once. But the next thing I need to do in that area has nothing to with throughput: a common complaint has been that we trickle out writes when the system is idle, and that behaviour dates from the days when bcache was designed for servers optimized for throughput and smoothing out bursty workloads. Nowadays, a lot of people are running it on their desktops or laptops, and we need to be able to do a "rush to idle" type thing. I have a design doc for that I wrote fully a year ago and haven't gotten to yet... https://github.com/koverstreet/bcachefs/commit/98a70eef3d46397a085069531fc503dae20d63fb
On Sun, 20 Oct 2024 at 14:29, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > That's the same as limiting the amount of dirty metadata. Exactly. We limit dirty data for a reason. That was what I said, and that was my point. Excessive dirty data is a huge latency concern. It's a latency concern not just for journal replay, it's a latency concern for anybody who then does a "sync" or whatever. Linus
On Sun, Oct 20, 2024 at 02:21:50PM -0700, Linus Torvalds wrote: > On Sun, 20 Oct 2024 at 13:54, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > On Sun, 20 Oct 2024 at 13:30, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > > > > Latency for journal replay? > > > > No, latency for the journaling itself. > > Side note: latency of the journal replay can actually be quite > critical indeed for any "five nines" operation, and big journals are > not necessarily a good idea for that reason. > > There's a very real reason many places don't use filesystems that do > fsck any more. I need to ask one of the guys with a huge filesystem (if you're listening and have numbers, please chime in), but I don't think journal replay is bad compared to system boot time. At this point it would be completely trivial to do journal replay in the background, after the filesystem is mounted: all we need to do prior to mount is read the journal and sort+dedup the keys, replaying all the updates is the expensive part - but like I mentioned the btree API transparently overlays the journal keys until journal replay is finished, and this was necessary for solving various bootstrap issues. So if someone complains, I'll flip that on and we'll start testing it. Fsck is the real concern, yes, and there's lots to be done there. I have the majority of the work completed for online fsck, but that isn't enough - because if fsck takes a week to complete and it takes most of system capacity while it's running, that's not acceptable either (and that would be the case today if you tried bcachefs on a petabyte filesystem). So for that, we need to be making as many of the consistency checks and repair things that fsck does things that we can do whenever other operations are touching that metadata (and this is mainly what I mean when I mean self healing), and we need to either reduce our dependency on passes that go "walk everything and check references", or add ways to shard them (and only check parts of the filesystem that are suspected to have damage). Checking extent backpointers is the big offender, and fortunately that's the easiest one to fix.
On Sun, Oct 20, 2024 at 02:30:43PM -0700, Linus Torvalds wrote: > On Sun, 20 Oct 2024 at 14:29, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > > That's the same as limiting the amount of dirty metadata. > > Exactly. > > We limit dirty data for a reason. That was what I said, and that was my point. > > Excessive dirty data is a huge latency concern. It's a latency concern > not just for journal replay, it's a latency concern for anybody who > then does a "sync" or whatever. And my counterpoint is that if you've got a huge filesystem, limiting dirty metadata means that you force it to be written out as a whole bunch of random tiny writes, instead of allowing them to be batched up. That's horrible for overall throughput.
On 10/20/24 9:29 PM, Kent Overstreet wrote: > On Sun, Oct 20, 2024 at 01:19:42PM -0700, Linus Torvalds wrote: >> On Sun, 20 Oct 2024 at 13:10, Kent Overstreet <kent.overstreet@linux.dev> wrote: >>> >>> And the INT_MAX check wouldn't catch truncation anyways - it'd only >>> catch integer _underflow_, but allocation size calculations pretty much >>> as a rule never use subtractions, so I don't think this check was ever >>> worth much to begin with. >> >> It fixed a real security issue. > > Which you quite conveniently aren't naming. > >> Enough said, and you're just making shit up to make excuses. >> >> Also, you might want to start look at latency numbers in addition to >> throughput. If your journal replay needs an *index* that is 2G in >> size, you may have other issues. > > Latency for journal replay? > > No, journal replay is only something happens at mount after an unclean > shutdown. We can afford to take some time there, and journal replay > performance hasn't been a concern. Then why are you arguing about there being an "artificial cap on performance", if you can "afford to take some time there"? Am I missing something? - Joshie
On Sun, Oct 20, 2024 at 10:51:10PM +0100, Joshua Ashton wrote: > > > On 10/20/24 9:29 PM, Kent Overstreet wrote: > > On Sun, Oct 20, 2024 at 01:19:42PM -0700, Linus Torvalds wrote: > > > On Sun, 20 Oct 2024 at 13:10, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > > > > > > And the INT_MAX check wouldn't catch truncation anyways - it'd only > > > > catch integer _underflow_, but allocation size calculations pretty much > > > > as a rule never use subtractions, so I don't think this check was ever > > > > worth much to begin with. > > > > > > It fixed a real security issue. > > > > Which you quite conveniently aren't naming. > > > > > Enough said, and you're just making shit up to make excuses. > > > > > > Also, you might want to start look at latency numbers in addition to > > > throughput. If your journal replay needs an *index* that is 2G in > > > size, you may have other issues. > > > > Latency for journal replay? > > > > No, journal replay is only something happens at mount after an unclean > > shutdown. We can afford to take some time there, and journal replay > > performance hasn't been a concern. > > Then why are you arguing about there being an "artificial cap on > performance", if you can "afford to take some time there"? > > Am I missing something? The journal keys has to exist as a flat sorted array, and it has to contain _all_ the (sorted, deduped) keys that were in the journal.
Op 20/10/2024 om 22:29 schreef Kent Overstreet: > On Sun, Oct 20, 2024 at 01:19:42PM -0700, Linus Torvalds wrote: > >> Enough said, and you're just making shit up to make excuses. >> >> Also, you might want to start look at latency numbers in addition to >> throughput. If your journal replay needs an *index* that is 2G in >> size, you may have other issues. > Latency for journal replay? > > No, journal replay is only something happens at mount after an unclean > shutdown. We can afford to take some time there, and journal replay > performance hasn't been a concern. > >> Your journal size is insane, and your "artificial cap on performance" >> had better come with numbers. > I'm not going to run custom benchmarks just for a silly argument, sorry. > > But on a fileserver with 128 GB of ram and a 75 TB filesystem (yes, > that's likely a dedicated fileserver), we can quite easily justify a > btree node cache of perhaps 10GB, and on random update workloads the > journal does need to be that big - otherwise our btree node write size > goes down and throughput suffers. > Is this idea based on "the user only has 1 FS per device?" I assume I have this setup (and it probably is, looks like mine). I have 3 bcachefs filesystems each taking 10% of RAM. So, I end up with a memory load of 30% dedicated to bcachefs caching. If I read your argument, you say "I want a large btree node cache, because that's making the fs more efficient". No doubts about that. VFS buffering may already save you a lot of lookups you're actually building the btree node cache for. Theoretically, there's a large difference about how they work, but in practice, what files will it lookup mostly? Probably the few ones you already have in your vfs buffer. The added value of keeping a large "metadata" cache seems doubtful. I have my doubts about trading 15G of buffer to 15G of btree node cache: You lose the opportunity to share those 15G ram between all filesystems. On the other hand, when you perform many different file lookups, it will shine with everything it has. Maybe some tuning parameter could help here? it will at least limit the "insane" required journal size Janpieter Sollie
Op 21/10/2024 om 10:46 schreef Janpieter Sollie: > Op 20/10/2024 om 22:29 schreef Kent Overstreet: >>> >>> >> >> I'm not going to run custom benchmarks just for a silly argument, sorry. >> But on a fileserver with 128 GB of ram and a 75 TB filesystem >> (yes, that's likely a dedicated fileserver), >> we can quite easily justify a btree node cache of perhaps 10GB, >> and on random update workloads the journal does need to be that big - >> otherwise our btree node write size goes down and throughput suffers. >> > Is this idea based on "the user only has 1 FS per device?" > I assume I have this setup (and it probably is, looks like mine). > I have 3 bcachefs filesystems each taking 10% of RAM. > So, I end up with a memory load of 30% dedicated to bcachefs caching. > If I read your argument, you say "I want a large btree node cache, > because that's making the fs more efficient". No doubts about that. > > VFS buffering may already save you a lot of lookups you're actually > building the btree node cache for. > Theoretically, there's a large difference about how they work, > but in practice, what files will it lookup mostly? > Probably the few ones you already have in your vfs buffer. > The added value of keeping a large "metadata" cache seems doubtful. > > I have my doubts about trading 15G of buffer to 15G of btree node cache: > You lose the opportunity to share those 15G ram between all filesystems. > On the other hand, when you perform many different file lookups, > it will shine with everything it has. > > Maybe some tuning parameter could help here? > it will at least limit the "insane" required journal size > > Janpieter Sollie And things quickly grow out of hand here: a bcachefs report on fs usage: blablabla (other disks) A device dedicated to metadata: SSDM (device 6): sdg1 rw data buckets fragmented free: 39254491136 149744 sb: 3149824 13 258048 journal: 312475648 1192 btree: 428605440 1635 user: 0 0 cached: 0 0 parity: 0 0 stripe: 0 0 need_gc_gens: 0 0 need_discard: 0 0 unstriped: 0 0 capacity: 39998980096 152584 Oops ... the journal size is more than 70% of the fs data! Janpieter Sollie
On Sun, Oct 20, 2024 at 03:16:11PM -0400, Kent Overstreet wrote: > On Sun, Oct 20, 2024 at 12:09:58PM -0700, Linus Torvalds wrote: > > On Sun, 20 Oct 2024 at 12:09, Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > > > > > > How about you limit the amount of memory you use in the first place instead? > > > > .. and to clarify: we're not making other parts of the kernel less > > robust because *you* are doing something stupid and odd. > > Except, vmalloc() already behaves this way - so it seems to me you > already have. > > I've already added a stupid workaround to the darray code to switch to > calling vmalloc() directly, when necessary; this patch was a courtesy > because if bcachefs is hitting this limit no doubt other things will be > soon as well. > I was thinking to prevent "big" allocations to limit the vmalloc() by the INT_MAX sizes, i.e. to apply same limitation as kvmalloc() has. vmalloc() is stick to totalram_pages() which is way a lot. But it would break bcachefs, as i see it. -- Uladzislau Rezki
On Sun, Oct 20, 2024 at 02:21:50PM -0700, Linus Torvalds wrote: > There's a very real reason many places don't use filesystems that do > fsck any more. So fsck has been on my mind a lot lately - seems like all I'm working on lately is fsck related things - incidentally, "filesystems that don't need fsck" is a myth. The reason is that filesystems have mutable global state, and that state has to be consistent for the filesystem to work correctly: at a minumum, global usage counters that have to be correct for -ENOSPC to work, and allocation/free space maps that have to be correct to not double allocate. The only way out of this is to do a pure logging filesystem, i.e. nilfs, and there's a reason those never took off - compacting overhead is too high, they don't scale with real world usage (and you still have to give up on posix -ENOSPC, not that that's any real loss). Even in distributed land, they may get away without a traditional precise fsck, but they get away with that by leaving the heavy lifting (precise allocation information) to something like a traditional local filesystem that does have that. And they still need, at a minumum, a global GC operation - but GC is just a toy version of fsck to the filesystem developer; it'll have the same algorithmic complexity as traditional fsck but without having to be precise. (Incidentally, the main check allocations fsck pass in bcachefs is directly descended from the runtime GC code in bcache, even if barely recognizable now). And a filesystem needs to be able to cope with extreme damage to be considered fit for purpose - we need to degrade gracefully if there's corruption, not tell the user "oops, your filesystem is inaccessible" if something got scribbled over. I consider it flatly inacceptable to not be able to recover a filesystem if there's data on it. If you blew away the superblock and all the backup superblocks by running mkfs, /that's/ pretty much unrecoverable because there's too much in the superblock we really need, but literally anything else we should be able to recover from - and automatically is the goal. So there's a lot of interesting challanges in fsck. - Scaling: fsck is pretty much the limiting factor on filesystem scalability, If it wasn't for fsck bcachefs would probably scale up to an exabyte fairly trivially. Making fsck scale to exabyte range filesystems is going to take a _lot_ of clever sharding and clever algorithms. - Continuing to run gracefully in the presence of damage wherever possible, instead of forcing fsck to be run right away. If allocation info is corrupt in the wrong ways such that we might double allocate, that's a problem, or if interior btree nodes are toast that requires expensive repair, but we should be able to continue running with most other types of corruption. That hasn't been the priority while in development - in development, we want to fail fast and noisily so that bugs are reported and the filesystem is left in a state where we can see what happened - but this is an area I'm starting to work on now. - Making sure that fsck never makes things worse You really don't want fsck to ever delete anything, this could be absolutely tragic in the event of any sort of transient error (or bug). We've still got a bit of work to do here with pointers to indirect extents, and there's probably some other cases that need to be looked at - I think XFS is ahead of bcachefs here, I know Darrick has a notion of "tainted" metadata, the idea being that if a pointer to an indirect extent points to a missing extent, we don't delete it, we just flag it as tainted: don't log more fsck errors, just return -EIO when reading from it; then if we're able to recover the indirect extent later we can just clear the tainted flag. We've got some fun tricks for getting back online as quickly as possible even in the event of catastrophic damage. If alloc info is suspect, we can do very quick pass that walks all pointers and just marks a "bucket is currently allocated, don't use" bitmap and defer repairing or rebuilding the actual alloc info until we're online, in the background. And if interior btree nodes are toast and we need to scan (which shouldn't ever happen, but users are users and hardware is hardware, and I haven't done btrfs dup style replication because you can't trust SSDs to lay writes out on different erase units) there's a bitmap in the superblock of ranges that have btree nodes so the scan pass on a modern filesystem shouldn't take too long.
diff --git a/mm/util.c b/mm/util.c index 4f1275023eb7..c60df7723096 100644 --- a/mm/util.c +++ b/mm/util.c @@ -665,12 +665,6 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node) if (!gfpflags_allow_blocking(flags)) return NULL; - /* Don't even allow crazy sizes */ - if (unlikely(size > INT_MAX)) { - WARN_ON_ONCE(!(flags & __GFP_NOWARN)); - return NULL; - } - /* * kvmalloc() can always use VM_ALLOW_HUGE_VMAP, * since the callers already cannot assume anything
A user with a 75 TB filesystem reported the following journal replay error: https://github.com/koverstreet/bcachefs/issues/769 In journal replay we have to sort and dedup all the keys from the journal, which means we need a large contiguous allocation. Given that the user has 128GB of ram, the 2GB limit on allocation size has become far too small. Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> --- mm/util.c | 6 ------ 1 file changed, 6 deletions(-)