mm: Drop INT_MAX limit from kvmalloc()

Message ID	20241019210037.146825-1-kent.overstreet@linux.dev (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Kent Overstreet <kent.overstreet@linux.dev> To: linux-bcachefs@vger.kernel.org, linux-mm@kvack.org Cc: Kent Overstreet <kent.overstreet@linux.dev>, Vlastimil Babka <vbabka@suse.cz>, Andrew Morton <akpm@linux-foundation.org> Subject: [PATCH] mm: Drop INT_MAX limit from kvmalloc() Date: Sat, 19 Oct 2024 17:00:37 -0400 Message-ID: <20241019210037.146825-1-kent.overstreet@linux.dev> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: Drop INT_MAX limit from kvmalloc() \| expand mm: Drop INT_MAX limit from kvmalloc()

Kent Overstreet Oct. 19, 2024, 9 p.m. UTC

A user with a 75 TB filesystem reported the following journal replay
error:
https://github.com/koverstreet/bcachefs/issues/769

In journal replay we have to sort and dedup all the keys from the
journal, which means we need a large contiguous allocation. Given that
the user has 128GB of ram, the 2GB limit on allocation size has become
far too small.

Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 mm/util.c | 6 ------
 1 file changed, 6 deletions(-)

Lorenzo Stoakes Oct. 20, 2024, 11:45 a.m. UTC | #1

On Sat, Oct 19, 2024 at 05:00:37PM -0400, Kent Overstreet wrote:
> A user with a 75 TB filesystem reported the following journal replay
> error:
> https://github.com/koverstreet/bcachefs/issues/769
>
> In journal replay we have to sort and dedup all the keys from the
> journal, which means we need a large contiguous allocation. Given that
> the user has 128GB of ram, the 2GB limit on allocation size has become
> far too small.
>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> ---
>  mm/util.c | 6 ------
>  1 file changed, 6 deletions(-)
>
> diff --git a/mm/util.c b/mm/util.c
> index 4f1275023eb7..c60df7723096 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -665,12 +665,6 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node)
>  	if (!gfpflags_allow_blocking(flags))
>  		return NULL;
>
> -	/* Don't even allow crazy sizes */
> -	if (unlikely(size > INT_MAX)) {
> -		WARN_ON_ONCE(!(flags & __GFP_NOWARN));
> -		return NULL;
> -	}
> -

Err, and not replace it with _any_ limit? That seems very unwise.


>  	/*
>  	 * kvmalloc() can always use VM_ALLOW_HUGE_VMAP,
>  	 * since the callers already cannot assume anything
> --
> 2.45.2
>

Kent Overstreet Oct. 20, 2024, 1 p.m. UTC | #2

On Sun, Oct 20, 2024 at 12:45:33PM +0100, Lorenzo Stoakes wrote:
> On Sat, Oct 19, 2024 at 05:00:37PM -0400, Kent Overstreet wrote:
> > A user with a 75 TB filesystem reported the following journal replay
> > error:
> > https://github.com/koverstreet/bcachefs/issues/769
> >
> > In journal replay we have to sort and dedup all the keys from the
> > journal, which means we need a large contiguous allocation. Given that
> > the user has 128GB of ram, the 2GB limit on allocation size has become
> > far too small.
> >
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> > ---
> >  mm/util.c | 6 ------
> >  1 file changed, 6 deletions(-)
> >
> > diff --git a/mm/util.c b/mm/util.c
> > index 4f1275023eb7..c60df7723096 100644
> > --- a/mm/util.c
> > +++ b/mm/util.c
> > @@ -665,12 +665,6 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node)
> >  	if (!gfpflags_allow_blocking(flags))
> >  		return NULL;
> >
> > -	/* Don't even allow crazy sizes */
> > -	if (unlikely(size > INT_MAX)) {
> > -		WARN_ON_ONCE(!(flags & __GFP_NOWARN));
> > -		return NULL;
> > -	}
> > -
> 
> Err, and not replace it with _any_ limit? That seems very unwise.

large allocations will go to either the page allocator or vmalloc, and
they have their own limits.

although I should have a look at that, and make sure we're not
triggering the > MAX_ORDER warning in the page allocator unnecessarily w
hen we could just call vmalloc().

Lorenzo Stoakes Oct. 20, 2024, 4:44 p.m. UTC | #3

+cc Linus, vmalloc reviewers

On Sun, Oct 20, 2024 at 09:00:07AM -0400, Kent Overstreet wrote:
> On Sun, Oct 20, 2024 at 12:45:33PM +0100, Lorenzo Stoakes wrote:
> > On Sat, Oct 19, 2024 at 05:00:37PM -0400, Kent Overstreet wrote:
> > > A user with a 75 TB filesystem reported the following journal replay
> > > error:
> > > https://github.com/koverstreet/bcachefs/issues/769
> > >
> > > In journal replay we have to sort and dedup all the keys from the
> > > journal, which means we need a large contiguous allocation. Given that
> > > the user has 128GB of ram, the 2GB limit on allocation size has become
> > > far too small.
> > >
> > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> > > ---
> > >  mm/util.c | 6 ------
> > >  1 file changed, 6 deletions(-)
> > >
> > > diff --git a/mm/util.c b/mm/util.c
> > > index 4f1275023eb7..c60df7723096 100644
> > > --- a/mm/util.c
> > > +++ b/mm/util.c
> > > @@ -665,12 +665,6 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node)
> > >  	if (!gfpflags_allow_blocking(flags))
> > >  		return NULL;
> > >
> > > -	/* Don't even allow crazy sizes */
> > > -	if (unlikely(size > INT_MAX)) {
> > > -		WARN_ON_ONCE(!(flags & __GFP_NOWARN));
> > > -		return NULL;
> > > -	}
> > > -
> >
> > Err, and not replace it with _any_ limit? That seems very unwise.
>
> large allocations will go to either the page allocator or vmalloc, and
> they have their own limits.

Ah actually I misread it here, I see the allocation gets immediately sent
off to __kmalloc_node_noprof() and thus that'll apply its own limits before
doing this check prior to the vmalloc call.

We actually do have a basic check in __vmalloc_node_range_noprof() that
prevents _totally_ insane requests, checking that size >> PAGE_SHIFT <=
totalram_pages(), so we shouldn't get anything too stupid here (I am
thinking especially of ptr + size overflow type situations).

But Linus explicitly introduced this INT_MAX check in commit 7661809d493b
("mm: don't allow oversized kvmalloc() calls"), presumably for a reason, so
have cc'd him here in case he has an objection to this which amounts to a
revert of that patch.

Assuming Linus doesn't object, I don't see how this is really doing
anything different than just invoking __vmalloc_node_range_noprof() direct
which we do in quite a few places anyway?

I guess let's wait and see what he says or if Vlastimil/the vmalloc
reviewers have any thoughts on this.

But looks sane to me otherwise.

>
> although I should have a look at that, and make sure we're not
> triggering the > MAX_ORDER warning in the page allocator unnecessarily w
> hen we could just call vmalloc().

OK I guess tha would be a check prior to invoking __kmalloc_node_noprof()
-> ... -> __alloc_pages_noprof() and WARN_ON_ONCE_GFP(order >
MAX_PAGE_ORDER, gfp) ?

Sounds sensible.

Kent Overstreet Oct. 20, 2024, 5:03 p.m. UTC | #4

On Sun, Oct 20, 2024 at 05:44:37PM +0100, Lorenzo Stoakes wrote:
> +cc Linus, vmalloc reviewers
> 
> On Sun, Oct 20, 2024 at 09:00:07AM -0400, Kent Overstreet wrote:
> > On Sun, Oct 20, 2024 at 12:45:33PM +0100, Lorenzo Stoakes wrote:
> > > On Sat, Oct 19, 2024 at 05:00:37PM -0400, Kent Overstreet wrote:
> > > > A user with a 75 TB filesystem reported the following journal replay
> > > > error:
> > > > https://github.com/koverstreet/bcachefs/issues/769
> > > >
> > > > In journal replay we have to sort and dedup all the keys from the
> > > > journal, which means we need a large contiguous allocation. Given that
> > > > the user has 128GB of ram, the 2GB limit on allocation size has become
> > > > far too small.
> > > >
> > > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> > > > ---
> > > >  mm/util.c | 6 ------
> > > >  1 file changed, 6 deletions(-)
> > > >
> > > > diff --git a/mm/util.c b/mm/util.c
> > > > index 4f1275023eb7..c60df7723096 100644
> > > > --- a/mm/util.c
> > > > +++ b/mm/util.c
> > > > @@ -665,12 +665,6 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node)
> > > >  	if (!gfpflags_allow_blocking(flags))
> > > >  		return NULL;
> > > >
> > > > -	/* Don't even allow crazy sizes */
> > > > -	if (unlikely(size > INT_MAX)) {
> > > > -		WARN_ON_ONCE(!(flags & __GFP_NOWARN));
> > > > -		return NULL;
> > > > -	}
> > > > -
> > >
> > > Err, and not replace it with _any_ limit? That seems very unwise.
> >
> > large allocations will go to either the page allocator or vmalloc, and
> > they have their own limits.
> 
> Ah actually I misread it here, I see the allocation gets immediately sent
> off to __kmalloc_node_noprof() and thus that'll apply its own limits before
> doing this check prior to the vmalloc call.
> 
> We actually do have a basic check in __vmalloc_node_range_noprof() that
> prevents _totally_ insane requests, checking that size >> PAGE_SHIFT <=
> totalram_pages(), so we shouldn't get anything too stupid here (I am
> thinking especially of ptr + size overflow type situations).

Yep, exactly.

> But Linus explicitly introduced this INT_MAX check in commit 7661809d493b
> ("mm: don't allow oversized kvmalloc() calls"), presumably for a reason, so
> have cc'd him here in case he has an objection to this which amounts to a
> revert of that patch.

Ah, thanks for adding him.

Yeah, the concern historically has been that integer truncation bugs are
really nasty and entirely too common, and we have little in the way of
tooling that can catch that.

Unfortunately that's still the case today. Kees has been doing work on
catching integer overflow, but I don't think we have anything coming yet
for catching integer truncation.

But given that vmalloc() already supports > INT_MAX requests, and memory
sizes keep growing so 2GB is getting pretty small - I think it's time,
this is going to come up in other places sooner or later.

Linus Torvalds Oct. 20, 2024, 6:46 p.m. UTC | #5

On Sun, 20 Oct 2024 at 10:04, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>
> But given that vmalloc() already supports > INT_MAX requests, and memory
> sizes keep growing so 2GB is getting pretty small - I think it's time,
> this is going to come up in other places sooner or later.

No.

If you need 2GB+ memory for filesystem operations, you fix your code.

2GB is more than you *can* allocate on 32-bit platforms, and even on
64-bit ones it is often quite a lot.

So even aside from the *fact* of integer overflows, that check isn't going away.

             Linus

Kent Overstreet Oct. 20, 2024, 6:53 p.m. UTC | #6

On Sun, Oct 20, 2024 at 11:46:11AM -0700, Linus Torvalds wrote:
> On Sun, 20 Oct 2024 at 10:04, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> >
> > But given that vmalloc() already supports > INT_MAX requests, and memory
> > sizes keep growing so 2GB is getting pretty small - I think it's time,
> > this is going to come up in other places sooner or later.
> 
> No.
> 
> If you need 2GB+ memory for filesystem operations, you fix your code.

This is for journal replay, where we've got a big array of keys and we
need to sort them.

The keys have to fit in memory (and had to fit in memory previously, for
them to be dirty in the journal); this array is just _references_ to
those keys.

> 2GB is more than you *can* allocate on 32-bit platforms, and even on
> 64-bit ones it is often quite a lot.

This came up on a machine with 128GB of ram.

Linus Torvalds Oct. 20, 2024, 7:09 p.m. UTC | #7

On Sun, 20 Oct 2024 at 11:53, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>
> This came up on a machine with 128GB of ram.

So?

Your argument is "we're using a ton of memory, so we can use a ton more".

Not a great argument.

How about you limit the amount of memory you use in the first place instead?

                Linus

Linus Torvalds Oct. 20, 2024, 7:09 p.m. UTC | #8

On Sun, 20 Oct 2024 at 12:09, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> How about you limit the amount of memory you use in the first place instead?

.. and to clarify: we're not making other parts of the kernel less
robust because *you* are doing something stupid and odd.

              Linus

Kent Overstreet Oct. 20, 2024, 7:10 p.m. UTC | #9

On Sun, Oct 20, 2024 at 12:09:02PM -0700, Linus Torvalds wrote:
> On Sun, 20 Oct 2024 at 11:53, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> >
> > This came up on a machine with 128GB of ram.
> 
> So?
> 
> Your argument is "we're using a ton of memory, so we can use a ton more".

Datasets are going to grow as machines grow, yes. Otherwise, what are we
building them for?

> How about you limit the amount of memory you use in the first place instead?

That'd be an artificial cap on performance.

Kent Overstreet Oct. 20, 2024, 7:16 p.m. UTC | #10

On Sun, Oct 20, 2024 at 12:09:58PM -0700, Linus Torvalds wrote:
> On Sun, 20 Oct 2024 at 12:09, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > How about you limit the amount of memory you use in the first place instead?
> 
> .. and to clarify: we're not making other parts of the kernel less
> robust because *you* are doing something stupid and odd.

Except, vmalloc() already behaves this way - so it seems to me you
already have.

I've already added a stupid workaround to the darray code to switch to
calling vmalloc() directly, when necessary; this patch was a courtesy
because if bcachefs is hitting this limit no doubt other things will be
soon as well.

Vlastimil Babka Oct. 20, 2024, 7:53 p.m. UTC | #11

On 10/20/24 20:53, Kent Overstreet wrote:
> On Sun, Oct 20, 2024 at 11:46:11AM -0700, Linus Torvalds wrote:
>> On Sun, 20 Oct 2024 at 10:04, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>> >
>> > But given that vmalloc() already supports > INT_MAX requests, and memory
>> > sizes keep growing so 2GB is getting pretty small - I think it's time,
>> > this is going to come up in other places sooner or later.
>> 
>> No.
>> 
>> If you need 2GB+ memory for filesystem operations, you fix your code.
> 
> This is for journal replay, where we've got a big array of keys and we
> need to sort them.
> 
> The keys have to fit in memory (and had to fit in memory previously, for
> them to be dirty in the journal);

What if the disk is moved to a smaller system, should the fs still mount
there? (I don't mean such a small system that it can't vmalloc() 2GB
specifically, but in principle...)

> this array is just _references_ to
> those keys.
> 
>> 2GB is more than you *can* allocate on 32-bit platforms, and even on
>> 64-bit ones it is often quite a lot.
> 
> This came up on a machine with 128GB of ram.

Kent Overstreet Oct. 20, 2024, 8:08 p.m. UTC | #12

On Sun, Oct 20, 2024 at 09:53:06PM +0200, Vlastimil Babka wrote:
> On 10/20/24 20:53, Kent Overstreet wrote:
> > On Sun, Oct 20, 2024 at 11:46:11AM -0700, Linus Torvalds wrote:
> >> On Sun, 20 Oct 2024 at 10:04, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> >> >
> >> > But given that vmalloc() already supports > INT_MAX requests, and memory
> >> > sizes keep growing so 2GB is getting pretty small - I think it's time,
> >> > this is going to come up in other places sooner or later.
> >> 
> >> No.
> >> 
> >> If you need 2GB+ memory for filesystem operations, you fix your code.
> > 
> > This is for journal replay, where we've got a big array of keys and we
> > need to sort them.
> > 
> > The keys have to fit in memory (and had to fit in memory previously, for
> > them to be dirty in the journal);
> 
> What if the disk is moved to a smaller system, should the fs still mount
> there? (I don't mean such a small system that it can't vmalloc() 2GB
> specifically, but in principle...)

You'll have to do journal replay on the bigger system. Once you've done
that, it'll work just fine on the smaller system.

(Now, trying to work with a 75TB filesystem on a small machine is going
to be really painful if you ever need to fsck. That's just an inherently
hard problem, but we've got fsck scalability/performance improvements in
the works).

But journal replay does inherently require the whole contents of the
journal to fit in memory - we have to do the sort + dedup so that we can
overlay the contents of the journal over the btree until journal replay
is finished so that we can get a consistent view of the filesystem,
which we need so that we can run the allocator, and go read-write, which
we need in order to do journal replay.

Fun bootstrap problems.

Kent Overstreet Oct. 20, 2024, 8:10 p.m. UTC | #13

On Sun, Oct 20, 2024 at 12:09:58PM -0700, Linus Torvalds wrote:
> On Sun, 20 Oct 2024 at 12:09, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > How about you limit the amount of memory you use in the first place instead?
> 
> .. and to clarify: we're not making other parts of the kernel less
> robust because *you* are doing something stupid and odd.

But back to the actual topic at hand, the security angle - while we
don't have ubsan style checks for integer truncation (the real concern),
we do have sysbot + kasan, which does cover this pretty thoroughly.

And the INT_MAX check wouldn't catch truncation anyways - it'd only
catch integer _underflow_, but allocation size calculations pretty much
as a rule never use subtractions, so I don't think this check was ever
worth much to begin with.

Linus Torvalds Oct. 20, 2024, 8:19 p.m. UTC | #14

On Sun, 20 Oct 2024 at 13:10, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>
> And the INT_MAX check wouldn't catch truncation anyways - it'd only
> catch integer _underflow_, but allocation size calculations pretty much
> as a rule never use subtractions, so I don't think this check was ever
> worth much to begin with.

It fixed a real security issue.

Enough said, and you're just making shit up to make excuses.

Also, you might want to start look at latency numbers in addition to
throughput. If your journal replay needs an *index* that is 2G in
size, you may have other issues.

Your journal size is insane, and your "artificial cap on performance"
had better come with numbers.

Why do you keep on being the person who creates all these pointless
arguments? Not just with me, btw.

                 Linus

Kent Overstreet Oct. 20, 2024, 8:29 p.m. UTC | #15

On Sun, Oct 20, 2024 at 01:19:42PM -0700, Linus Torvalds wrote:
> On Sun, 20 Oct 2024 at 13:10, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> >
> > And the INT_MAX check wouldn't catch truncation anyways - it'd only
> > catch integer _underflow_, but allocation size calculations pretty much
> > as a rule never use subtractions, so I don't think this check was ever
> > worth much to begin with.
> 
> It fixed a real security issue.

Which you quite conveniently aren't naming.

> Enough said, and you're just making shit up to make excuses.
> 
> Also, you might want to start look at latency numbers in addition to
> throughput. If your journal replay needs an *index* that is 2G in
> size, you may have other issues.

Latency for journal replay?

No, journal replay is only something happens at mount after an unclean
shutdown. We can afford to take some time there, and journal replay
performance hasn't been a concern.

> Your journal size is insane, and your "artificial cap on performance"
> had better come with numbers.

I'm not going to run custom benchmarks just for a silly argument, sorry.

But on a fileserver with 128 GB of ram and a 75 TB filesystem (yes,
that's likely a dedicated fileserver), we can quite easily justify a
btree node cache of perhaps 10GB, and on random update workloads the
journal does need to be that big - otherwise our btree node write size
goes down and throughput suffers.

> Why do you keep on being the person who creates all these pointless
> arguments? Not just with me, btw.

That's only going to get the biggest eyeroll ever.

Linus Torvalds Oct. 20, 2024, 8:54 p.m. UTC | #16

On Sun, 20 Oct 2024 at 13:30, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>
> Latency for journal replay?

No, latency for the journaling itself.

You're the one who claimed that a 2G cap on just the *index* to the
journal would be an "artificial cap on performance" when I suggested
just limiting the amount of memory you use on the journaling.

Other filesystems happily limit the amount of dirty data because of
latency concerns. And yes, it shows in benchmarks, where the
difference between having huge amounts of data pending in memory and
actually writing it back in a timely manner can be a noticeable
performance penalty.

It's still a good idea.

             Linus

Linus Torvalds Oct. 20, 2024, 9:21 p.m. UTC | #17

On Sun, 20 Oct 2024 at 13:54, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Sun, 20 Oct 2024 at 13:30, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> >
> > Latency for journal replay?
>
> No, latency for the journaling itself.

Side note: latency of the journal replay can actually be quite
critical indeed for any "five nines" operation, and big journals are
not necessarily a good idea for that reason.

There's a very real reason many places don't use filesystems that do
fsck any more.

But yes, it's a different issue from the runtime performance / latency side.

             Linus

Kent Overstreet Oct. 20, 2024, 9:29 p.m. UTC | #18

On Sun, Oct 20, 2024 at 01:54:17PM -0700, Linus Torvalds wrote:
> On Sun, 20 Oct 2024 at 13:30, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> >
> > Latency for journal replay?
> 
> No, latency for the journaling itself.
> 
> You're the one who claimed that a 2G cap on just the *index* to the
> journal would be an "artificial cap on performance" when I suggested
> just limiting the amount of memory you use on the journaling.

That's the same as limiting the amount of dirty metadata.

Any filesystem wouldn't want an artificial cap on dirty metadata,
because so long as it's dirty you can buffer up more updates, but for
bcachefs it's a bit different.

Btree nodes are log structured: this was a dumb idea that turned out to
be utterly brilliant, because it enabled eytzinger search trees, and
because pure b-trees and pure compacting data structures (the leveldb
lineage) have weakness. Pure b-trees (with a more typical 4k node size)
mean you have no opportunity to compact random updates, so writing them
out to disk is inefficient. The compacting data structures solve that at
the cost of terrible multithreaded update performance, and the resort
overhead also in practice becomes quite costly.

So, log structured btree nodes mean we can _reasonably_ efficiently
write out just the 4k/8k/whatever out of a 256k btree node - i.e. we can
spray random updates across an enormous btree and serialize them with
decent efficiency, and because the compacting only happens within a
btree node it doesn't destroy multithreaded update performance, the
resorts are much cheaper (they fit in cache), and we get to use
eytzinger search trees for most lookups.

But: it's still much better if we can do btree node writes that are as
big as possible, for all the obvious reasons - and journal size is a
major factor on real world performance. All the performance testing I do
where we're filling up a filesystem with mostly metadata, the first
thing the tests do is increase the journal size...

> Other filesystems happily limit the amount of dirty data because of
> latency concerns. And yes, it shows in benchmarks, where the
> difference between having huge amounts of data pending in memory and
> actually writing it back in a timely manner can be a noticeable
> performance penalty.

Well, ext3 historically had major architectural issues with journal
latency - I never really studied that but from what I gather there were
lots of issues around dependent write ordering. bcachefs doesn't have
any of that - broadly speaking, dirty stuff can just be written out, and
i.e. memory reclaim might flush dirty btree nodes before the journal
does.

(The one exception being interior nodes that have pointers to btree
nodes that have just been created but not yet written).

The only real latency concern we have with the journal is if it fills
up entirely. That will cause latency spikes, because journal reclaim has
to free up entire buckets at a time. They shouldn't be massive ones,
because of the lack of dependent write issues (journal reclaim itself
can run full tilt and won't be getting blocked, and it starts running
full tilt once the journal is more than half full), and with the default
journal size I doubt it'll happen much in real world usage.

But it is something people looking at performance should watch for, and
we've got time stats in sysfs for this:
/sys/fs/bcachefs/<uuid>/time_stats/blocked_journal_low_on_space

Also worth watching is 
/sys/fs/bcachefs/<uuid>/time_stats/blocked_journal_max_in_flight

This one tracks when we can't open a new journal entry because we
already have the max (4) in flight, closing or writing - it'll happen if
something is doing constant fsyncs, forcing us to write much smaller
journal entries than we'd like. That was coming up in conversation the
other day, and I might be doing another XFS-like thing to address it if
it starts becoming an issue in real world usage.

If latency due to the journal filling up starts being an issue for
people, the first thing to do will be to just resize the journal (we can
do that online, and I might make that happen automatically at some
point) - and if for some reason that's not an option we'll need to add
more intelligent throttling, so that the latencies happen bit by bit
instead of all at once.

But the next thing I need to do in that area has nothing to with
throughput: a common complaint has been that we trickle out writes when
the system is idle, and that behaviour dates from the days when bcache
was designed for servers optimized for throughput and smoothing out
bursty workloads. Nowadays, a lot of people are running it on their
desktops or laptops, and we need to be able to do a "rush to idle" type
thing.

I have a design doc for that I wrote fully a year ago and haven't gotten
to yet... https://github.com/koverstreet/bcachefs/commit/98a70eef3d46397a085069531fc503dae20d63fb

Linus Torvalds Oct. 20, 2024, 9:30 p.m. UTC | #19

On Sun, 20 Oct 2024 at 14:29, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>
> That's the same as limiting the amount of dirty metadata.

Exactly.

We limit dirty data for a reason. That was what I said, and that was my point.

Excessive dirty data is a huge latency concern. It's a latency concern
not just for journal replay, it's a latency concern for anybody who
then does a "sync" or whatever.

               Linus

Kent Overstreet Oct. 20, 2024, 9:40 p.m. UTC | #20

On Sun, Oct 20, 2024 at 02:21:50PM -0700, Linus Torvalds wrote:
> On Sun, 20 Oct 2024 at 13:54, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Sun, 20 Oct 2024 at 13:30, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > >
> > > Latency for journal replay?
> >
> > No, latency for the journaling itself.
> 
> Side note: latency of the journal replay can actually be quite
> critical indeed for any "five nines" operation, and big journals are
> not necessarily a good idea for that reason.
> 
> There's a very real reason many places don't use filesystems that do
> fsck any more.

I need to ask one of the guys with a huge filesystem (if you're
listening and have numbers, please chime in), but I don't think journal
replay is bad compared to system boot time.

At this point it would be completely trivial to do journal replay in the
background, after the filesystem is mounted: all we need to do prior to
mount is read the journal and sort+dedup the keys, replaying all the
updates is the expensive part - but like I mentioned the btree API
transparently overlays the journal keys until journal replay is
finished, and this was necessary for solving various bootstrap issues.
So if someone complains, I'll flip that on and we'll start testing it.

Fsck is the real concern, yes, and there's lots to be done there. I have
the majority of the work completed for online fsck, but that isn't
enough - because if fsck takes a week to complete and it takes most of
system capacity while it's running, that's not acceptable either (and
that would be the case today if you tried bcachefs on a petabyte
filesystem).

So for that, we need to be making as many of the consistency checks and
repair things that fsck does things that we can do whenever other
operations are touching that metadata (and this is mainly what I mean
when I mean self healing), and we need to either reduce our dependency
on passes that go "walk everything and check references", or add ways to
shard them (and only check parts of the filesystem that are suspected to
have damage). Checking extent backpointers is the big offender, and
fortunately that's the easiest one to fix.

Kent Overstreet Oct. 20, 2024, 9:42 p.m. UTC | #21

On Sun, Oct 20, 2024 at 02:30:43PM -0700, Linus Torvalds wrote:
> On Sun, 20 Oct 2024 at 14:29, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> >
> > That's the same as limiting the amount of dirty metadata.
> 
> Exactly.
> 
> We limit dirty data for a reason. That was what I said, and that was my point.
> 
> Excessive dirty data is a huge latency concern. It's a latency concern
> not just for journal replay, it's a latency concern for anybody who
> then does a "sync" or whatever.

And my counterpoint is that if you've got a huge filesystem, limiting
dirty metadata means that you force it to be written out as a whole
bunch of random tiny writes, instead of allowing them to be batched up.
That's horrible for overall throughput.

Joshua Ashton Oct. 20, 2024, 9:51 p.m. UTC | #22

On 10/20/24 9:29 PM, Kent Overstreet wrote:
> On Sun, Oct 20, 2024 at 01:19:42PM -0700, Linus Torvalds wrote:
>> On Sun, 20 Oct 2024 at 13:10, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>>>
>>> And the INT_MAX check wouldn't catch truncation anyways - it'd only
>>> catch integer _underflow_, but allocation size calculations pretty much
>>> as a rule never use subtractions, so I don't think this check was ever
>>> worth much to begin with.
>>
>> It fixed a real security issue.
> 
> Which you quite conveniently aren't naming.
> 
>> Enough said, and you're just making shit up to make excuses.
>>
>> Also, you might want to start look at latency numbers in addition to
>> throughput. If your journal replay needs an *index* that is 2G in
>> size, you may have other issues.
> 
> Latency for journal replay?
> 
> No, journal replay is only something happens at mount after an unclean
> shutdown. We can afford to take some time there, and journal replay
> performance hasn't been a concern.

Then why are you arguing about there being an "artificial cap on 
performance", if you can "afford to take some time there"?

Am I missing something?

- Joshie

Kent Overstreet Oct. 20, 2024, 9:57 p.m. UTC | #23

On Sun, Oct 20, 2024 at 10:51:10PM +0100, Joshua Ashton wrote:
> 
> 
> On 10/20/24 9:29 PM, Kent Overstreet wrote:
> > On Sun, Oct 20, 2024 at 01:19:42PM -0700, Linus Torvalds wrote:
> > > On Sun, 20 Oct 2024 at 13:10, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > > > 
> > > > And the INT_MAX check wouldn't catch truncation anyways - it'd only
> > > > catch integer _underflow_, but allocation size calculations pretty much
> > > > as a rule never use subtractions, so I don't think this check was ever
> > > > worth much to begin with.
> > > 
> > > It fixed a real security issue.
> > 
> > Which you quite conveniently aren't naming.
> > 
> > > Enough said, and you're just making shit up to make excuses.
> > > 
> > > Also, you might want to start look at latency numbers in addition to
> > > throughput. If your journal replay needs an *index* that is 2G in
> > > size, you may have other issues.
> > 
> > Latency for journal replay?
> > 
> > No, journal replay is only something happens at mount after an unclean
> > shutdown. We can afford to take some time there, and journal replay
> > performance hasn't been a concern.
> 
> Then why are you arguing about there being an "artificial cap on
> performance", if you can "afford to take some time there"?
> 
> Am I missing something?

The journal keys has to exist as a flat sorted array, and it has to
contain _all_ the (sorted, deduped) keys that were in the journal.

Janpieter Sollie Oct. 21, 2024, 8:46 a.m. UTC | #24

Op 20/10/2024 om 22:29 schreef Kent Overstreet:
> On Sun, Oct 20, 2024 at 01:19:42PM -0700, Linus Torvalds wrote:
>
>> Enough said, and you're just making shit up to make excuses.
>>
>> Also, you might want to start look at latency numbers in addition to
>> throughput. If your journal replay needs an *index* that is 2G in
>> size, you may have other issues.
> Latency for journal replay?
>
> No, journal replay is only something happens at mount after an unclean
> shutdown. We can afford to take some time there, and journal replay
> performance hasn't been a concern.
>
>> Your journal size is insane, and your "artificial cap on performance"
>> had better come with numbers.
> I'm not going to run custom benchmarks just for a silly argument, sorry.
>
> But on a fileserver with 128 GB of ram and a 75 TB filesystem (yes,
> that's likely a dedicated fileserver), we can quite easily justify a
> btree node cache of perhaps 10GB, and on random update workloads the
> journal does need to be that big - otherwise our btree node write size
> goes down and throughput suffers.
>
Is this idea based on "the user only has 1 FS per device?"
I assume I have this setup (and it probably is, looks like mine).
I have 3 bcachefs filesystems each taking 10% of RAM.
So, I end up with a memory load of 30% dedicated to bcachefs caching.
If I read your argument, you say "I want a large btree node cache,
because that's making the fs more efficient".  No doubts about that.

VFS buffering may already save you a lot of lookups you're actually
building the btree node cache for.
Theoretically, there's a large difference about how they work,
but in practice, what files will it lookup mostly?
Probably the few ones you already have in your vfs buffer.
The added value of keeping a large "metadata" cache seems doubtful.

I have my doubts about trading 15G of buffer to 15G of btree node cache:
You lose the opportunity to share those 15G ram between all filesystems.
On the other hand, when you perform many different file lookups,
it will shine with everything it has.

Maybe some tuning parameter could help here?
it will at least limit the "insane" required journal size

Janpieter Sollie

Janpieter Sollie Oct. 21, 2024, 9:22 a.m. UTC | #25

Op 21/10/2024 om 10:46 schreef Janpieter Sollie:
> Op 20/10/2024 om 22:29 schreef Kent Overstreet:
>>>
>>>
>>
>> I'm not going to run custom benchmarks just for a silly argument, sorry.
>> But on a fileserver with 128 GB of ram and a 75 TB filesystem
>> (yes, that's likely a dedicated fileserver),
>> we can quite easily justify a btree node cache of perhaps 10GB,
>> and on random update workloads the journal does need to be that big -
>> otherwise our btree node write size goes down and throughput suffers.
>>
> Is this idea based on "the user only has 1 FS per device?"
> I assume I have this setup (and it probably is, looks like mine).
> I have 3 bcachefs filesystems each taking 10% of RAM.
> So, I end up with a memory load of 30% dedicated to bcachefs caching.
> If I read your argument, you say "I want a large btree node cache,
> because that's making the fs more efficient".  No doubts about that.
>
> VFS buffering may already save you a lot of lookups you're actually
> building the btree node cache for.
> Theoretically, there's a large difference about how they work,
> but in practice, what files will it lookup mostly?
> Probably the few ones you already have in your vfs buffer.
> The added value of keeping a large "metadata" cache seems doubtful.
>
> I have my doubts about trading 15G of buffer to 15G of btree node cache:
> You lose the opportunity to share those 15G ram between all filesystems.
> On the other hand, when you perform many different file lookups,
> it will shine with everything it has.
>
> Maybe some tuning parameter could help here?
> it will at least limit the "insane" required journal size
>
> Janpieter Sollie
And things quickly grow out of hand here:

a bcachefs report on fs usage:

blablabla (other disks)

A device dedicated to metadata:

SSDM (device 6):                sdg1              rw
                                data         buckets    fragmented
  free:                  39254491136          149744
  sb:                        3149824              13        258048
  journal:                 312475648            1192
  btree:                   428605440            1635
  user:                            0               0
  cached:                          0               0
  parity:                          0               0
  stripe:                          0               0
  need_gc_gens:                    0               0
  need_discard:                    0               0
  unstriped:                       0               0
  capacity:              39998980096          152584

Oops ... the journal size is more than 70% of the fs data!

Janpieter Sollie

Uladzislau Rezki Oct. 21, 2024, 4:15 p.m. UTC | #26

On Sun, Oct 20, 2024 at 03:16:11PM -0400, Kent Overstreet wrote:
> On Sun, Oct 20, 2024 at 12:09:58PM -0700, Linus Torvalds wrote:
> > On Sun, 20 Oct 2024 at 12:09, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > How about you limit the amount of memory you use in the first place instead?
> > 
> > .. and to clarify: we're not making other parts of the kernel less
> > robust because *you* are doing something stupid and odd.
> 
> Except, vmalloc() already behaves this way - so it seems to me you
> already have.
> 
> I've already added a stupid workaround to the darray code to switch to
> calling vmalloc() directly, when necessary; this patch was a courtesy
> because if bcachefs is hitting this limit no doubt other things will be
> soon as well.
>
I was thinking to prevent "big" allocations to limit the vmalloc() by
the INT_MAX sizes, i.e. to apply same limitation as kvmalloc() has.
vmalloc() is stick to totalram_pages() which is way a lot.

But it would break bcachefs, as i see it.

--
Uladzislau Rezki

Kent Overstreet Oct. 27, 2024, 7:58 p.m. UTC | #27

On Sun, Oct 20, 2024 at 02:21:50PM -0700, Linus Torvalds wrote:
> There's a very real reason many places don't use filesystems that do
> fsck any more.

So fsck has been on my mind a lot lately - seems like all I'm working on
lately is fsck related things - incidentally, "filesystems that don't
need fsck" is a myth.

The reason is that filesystems have mutable global state, and that state
has to be consistent for the filesystem to work correctly: at a minumum,
global usage counters that have to be correct for -ENOSPC to work, and
allocation/free space maps that have to be correct to not double
allocate.

The only way out of this is to do a pure logging filesystem, i.e. nilfs,
and there's a reason those never took off - compacting overhead is too
high, they don't scale with real world usage (and you still have to give
up on posix -ENOSPC, not that that's any real loss).

Even in distributed land, they may get away without a traditional
precise fsck, but they get away with that by leaving the heavy lifting
(precise allocation information) to something like a traditional local
filesystem that does have that. And they still need, at a minumum, a
global GC operation - but GC is just a toy version of fsck to the
filesystem developer; it'll have the same algorithmic complexity as
traditional fsck but without having to be precise.

(Incidentally, the main check allocations fsck pass in bcachefs is
directly descended from the runtime GC code in bcache, even if barely
recognizable now).

And a filesystem needs to be able to cope with extreme damage to be
considered fit for purpose - we need to degrade gracefully if there's
corruption, not tell the user "oops, your filesystem is inaccessible" if
something got scribbled over. I consider it flatly inacceptable to not
be able to recover a filesystem if there's data on it. If you blew away
the superblock and all the backup superblocks by running mkfs, /that's/
pretty much unrecoverable because there's too much in the superblock we
really need, but literally anything else we should be able to recover
from - and automatically is the goal.

So there's a lot of interesting challanges in fsck.

- Scaling: fsck is pretty much the limiting factor on filesystem
  scalability, If it wasn't for fsck bcachefs would probably scale up to
  an exabyte fairly trivially. Making fsck scale to exabyte range
  filesystems is going to take a _lot_ of clever sharding and clever
  algorithms.

- Continuing to run gracefully in the presence of damage wherever
  possible, instead of forcing fsck to be run right away.

  If allocation info is corrupt in the wrong ways such that we might
  double allocate, that's a problem, or if interior btree nodes are
  toast that requires expensive repair, but we should be able to
  continue running with most other types of corruption. That hasn't been
  the priority while in development - in development, we want to fail
  fast and noisily so that bugs are reported and the filesystem is left
  in a state where we can see what happened - but this is an area I'm
  starting to work on now.

- Making sure that fsck never makes things worse

  You really don't want fsck to ever delete anything, this could be
  absolutely tragic in the event of any sort of transient error (or
  bug). We've still got a bit of work to do here with pointers to
  indirect extents, and there's probably some other cases that need to
  be looked at - I think XFS is ahead of bcachefs here, I know
  Darrick has a notion of "tainted" metadata, the idea being that if a
  pointer to an indirect extent points to a missing extent, we don't
  delete it, we just flag it as tainted: don't log more fsck errors,
  just return -EIO when reading from it; then if we're able to recover
  the indirect extent later we can just clear the tainted flag.

We've got some fun tricks for getting back online as quickly as possible
even in the event of catastrophic damage. If alloc info is suspect, we
can do very quick pass that walks all pointers and just marks a "bucket
is currently allocated, don't use" bitmap and defer repairing or
rebuilding the actual alloc info until we're online, in the background.
And if interior btree nodes are toast and we need to scan (which
shouldn't ever happen, but users are users and hardware is hardware, and
I haven't done btrfs dup style replication because you can't trust SSDs
to lay writes out on different erase units) there's a bitmap in the
superblock of ranges that have btree nodes so the scan pass on a modern
filesystem shouldn't take too long.

mm: Drop INT_MAX limit from kvmalloc()

Commit Message

Comments

Patch