mbox series

[GIT,PULL] bcachefs fixes for 6.11-rc5

Message ID sctzes5z3s2zoadzldrpw3yfycauc4kpcsbpidjkrew5hkz7yf@eejp6nunfpin (mailing list archive)
State New
Headers show
Series [GIT,PULL] bcachefs fixes for 6.11-rc5 | expand

Pull-request

git://evilpiepirate.org/bcachefs.git tags/bcachefs-2024-08-23

Message

Kent Overstreet Aug. 23, 2024, 6:54 p.m. UTC
Hi Linus, big one this time...

no more bug reports related to the disk accounting rewrite, things are
looking good over here as far as regressions go

The following changes since commit 0e49d3ff12501adaafaf6fdb19699f021d1eda1c:

  bcachefs: Fix locking in __bch2_trans_mark_dev_sb() (2024-08-16 20:45:15 -0400)

are available in the Git repository at:

  git://evilpiepirate.org/bcachefs.git tags/bcachefs-2024-08-23

for you to fetch changes up to 4d8ead60ffd937a73b50f42f6bd776b6a7919dde:

  bcachefs: key cache can now allocate from pending (2024-08-22 11:51:55 -0400)

----------------------------------------------------------------
bcachefs fixes for 6.11-rc5

Lots of little fixes and two big items, which were orgiinally slated for
the 6.12 merge window but turned out to be pretty important.

The little stuff includes assorted syzbot fixes and some upgrade fixes
for old (pre 1.0) filesystems, and a fix for moving data off a device
that was switched to durability=0 after data had been written to it.

The big items are:
- rhashtable conversion for VFS inodes cache
  Thas was slated for the 6.12 merge window, but a deadlock was
  uncovered in __wait_for_freeing_inode(); bcachefs inverts the usual
  locking between the VFS inode cache and on disk (btree) locking, with
  some advantages and some extra trickyness. The result was that we were
  waiting (via __wait_for_freeing_inode()) on the evict -> clear_inode()
  path with btree locks held, which was a rare deadlock but undoubtedly
  also one of the sources of the SRCU warnings.

- new data structure for managing freelists in btree key cache
  This eliminates the btree key cache lock, and associated lock
  contention. User feedback is that this resolves the main source of the
  SRCU warnings we've been seeing - which means that on some of the big
  multithreaded workloads people are running the lock contention was
  really bad (threads piling up and causing O(n^2) wait times), if it
  was able to trigger a 10 second delay warning.

On the test reported by
https://lore.kernel.org/linux-bcachefs/CAGudoHGenxzk0ZqPXXi1_QDbfqQhGHu+wUwzyS6WmfkUZ1HiXA@mail.gmail.com/

We're now 4x faster than xfs on creatrees, roughly even on walktrees,
with consistent run to run performance; dominant factor in profiles is
lru lock contention.

----------------------------------------------------------------
Kent Overstreet (34):
      bcachefs: Reallocate table when we're increasing size
      bcachefs: fix field-spanning write warning
      bcachefs: Fix incorrect gfp flags
      bcachefs: Extra debug for data move path
      bcachefs: bch2_data_update_init() cleanup
      bcachefs: Fix "trying to move an extent, but nr_replicas=0"
      bcachefs: setting bcachefs_effective.* xattrs is a noop
      bcachefs: Fix failure to relock in btree_node_get()
      bcachefs: Fix bch2_trigger_alloc assert
      bcachefs: Fix bch2_bucket_gens_init()
      bcachefs: fix time_stats_to_text()
      bcachefs: fix missing bch2_err_str()
      bcachefs: unlock_long() before resort in journal replay
      bcachefs: fix failure to relock in bch2_btree_node_mem_alloc()
      bcachefs: fix failure to relock in btree_node_fill()
      bcachefs: Fix locking in bch2_ioc_setlabel()
      bcachefs: Fix replay_now_at() assert
      bcachefs: Fix missing validation in bch2_sb_journal_v2_validate()
      fs/super.c: improve get_tree() error message
      bcachefs: Fix warning in bch2_fs_journal_stop()
      bcachefs: Fix compat issue with old alloc_v4 keys
      bcachefs: Fix refcounting in discard path
      bcachefs: clear path->should_be_locked in bch2_btree_key_cache_drop()
      bcachefs: add missing inode_walker_exit()
      bcachefs: don't use rht_bucket() in btree_key_cache_scan()
      inode: make __iget() a static inline
      bcachefs: switch to rhashtable for vfs inodes hash
      bcachefs: Fix deadlock in __wait_on_freeing_inode()
      bcachefs: journal_entry_replicas_not_marked is now autofix
      lib/generic-radix-tree.c: genradix_ptr_inlined()
      lib/generic-radix-tree.c: add preallocation
      bcachefs: rcu_pending
      bcachefs: Rip out freelists from btree key cache
      bcachefs: key cache can now allocate from pending

Yuesong Li (1):
      bcachefs: Fix double assignment in check_dirent_to_subvol()

 fs/bcachefs/Makefile                      |   1 +
 fs/bcachefs/acl.c                         |   2 +-
 fs/bcachefs/alloc_background.c            |  66 ++--
 fs/bcachefs/alloc_background_format.h     |   1 +
 fs/bcachefs/bcachefs.h                    |   1 +
 fs/bcachefs/btree_cache.c                 |  25 ++
 fs/bcachefs/btree_cache.h                 |   2 +
 fs/bcachefs/btree_iter.h                  |   9 +
 fs/bcachefs/btree_key_cache.c             | 426 ++++++---------------
 fs/bcachefs/btree_key_cache_types.h       |  18 +-
 fs/bcachefs/btree_types.h                 |   3 +-
 fs/bcachefs/btree_update_interior.c       |  46 +--
 fs/bcachefs/buckets_waiting_for_journal.c |   4 +-
 fs/bcachefs/data_update.c                 | 209 ++++++-----
 fs/bcachefs/extents.c                     |   2 +
 fs/bcachefs/fs-io-buffered.c              |   4 +-
 fs/bcachefs/fs-io-direct.c                |   2 +-
 fs/bcachefs/fs-io.c                       |   6 +-
 fs/bcachefs/fs-ioctl.c                    |   7 +-
 fs/bcachefs/fs.c                          | 234 ++++++++----
 fs/bcachefs/fs.h                          |  18 +-
 fs/bcachefs/fsck.c                        |   6 +-
 fs/bcachefs/inode.c                       |   2 +-
 fs/bcachefs/journal.c                     |   2 +-
 fs/bcachefs/journal_sb.c                  |  15 +
 fs/bcachefs/rcu_pending.c                 | 603 ++++++++++++++++++++++++++++++
 fs/bcachefs/rcu_pending.h                 |  25 ++
 fs/bcachefs/recovery.c                    |   9 +-
 fs/bcachefs/replicas.c                    |   3 +-
 fs/bcachefs/sb-errors_format.h            |   2 +-
 fs/bcachefs/subvolume_types.h             |   3 +-
 fs/bcachefs/super.c                       |   2 +
 fs/bcachefs/util.c                        |   1 -
 fs/bcachefs/xattr.c                       |  14 +-
 fs/inode.c                                |   8 -
 fs/super.c                                |   4 +-
 include/linux/fs.h                        |   9 +-
 include/linux/generic-radix-tree.h        | 106 +++++-
 lib/generic-radix-tree.c                  |  80 +---
 39 files changed, 1309 insertions(+), 671 deletions(-)
 create mode 100644 fs/bcachefs/rcu_pending.c
 create mode 100644 fs/bcachefs/rcu_pending.h

Comments

Linus Torvalds Aug. 24, 2024, 1:23 a.m. UTC | #1
On Sat, 24 Aug 2024 at 02:54, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>
> Hi Linus, big one this time...

Yeah, no, enough is enough. The last pull was already big.

This is too big, it touches non-bcachefs stuff, and it's not even
remotely some kind of regression.

At some point "fix something" just turns into development, and this is
that point.

Nobody sane uses bcachefs and expects it to be stable, so every single
user is an experimental site.

The bcachefs patches have become these kinds of "lots of development
during the release cycles rather than before it", to the point where
I'm starting to regret merging bcachefs.

If bcachefs can't work sanely within the normal upstream kernel
release schedule, maybe it shouldn't *be* in the normal upstream
kernel.

This is getting beyond ridiculous.

               Linus
Kent Overstreet Aug. 24, 2024, 2:13 a.m. UTC | #2
On Sat, Aug 24, 2024 at 09:23:00AM GMT, Linus Torvalds wrote:
> On Sat, 24 Aug 2024 at 02:54, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> >
> > Hi Linus, big one this time...
> 
> Yeah, no, enough is enough. The last pull was already big.
> 
> This is too big, it touches non-bcachefs stuff, and it's not even
> remotely some kind of regression.
> 
> At some point "fix something" just turns into development, and this is
> that point.
> 
> Nobody sane uses bcachefs and expects it to be stable, so every single
> user is an experimental site.

Eh?

Universal consensus has been that bcachefs is _definitely_ more
trustworthy than brtfs, in terms of "will this filesystem ever go
unrecoverable or lose my data" - I've seen many reports of people who've
put it through the same situations where btrfs falls.

I've ever seen people compare bcachefs's robustness in positive terms
vs. /xfs/; and that's the result of a *hell* of a lot of work with the
#1 goal of having a robust filesystem that _never_ loses data.

Syzbot dashboard bears this out as well, bcachefs is starting to look
better than btrfs there as well...

(Peanut gallery: Please don't rush out and switch to bcachefs just yet.
I still have a backlog of bugs and issues - some of them serious, as in
your filessystem will go emergency read only - and I don't want people
getting bit. There's still a ton to do; I'm not taking EXPERIMENTAL off
until at least the fuzz testing for on disk corruption is in play).

Look, I've been doing this for a long time, I've had people running my
code in production for a long time, and I'm working with my users on a
daily basis to address issues. I don't throw code over the wall; I do
everything I can to support it and make sure it's working well. 

And - the "srcu held for 10+s warnings" really were bad, there are going
to be a long tail of those that need to be fixed - to get to the rest,
we need the primary causes fixed first.

And when I ship code, I'm _always_ weighing "how much do we want this"
vs. "risk of regression/risk in general" - I'm not just throwing out
whatever I feel like.

Look, this is the filesystem you're all going to want to be running in -
knock on wood - just a year or two, because I'm working to to make it
more robust and reliable than xfs and ext4 (and yes, it will be) with
_end to end data integrity_.

We need this. there's still tons of people with "btrfs just crapped
itself and now I'm fucked" horror stories, and running a non
checksumming filesystem is like buying non ECC ram. I've got users with
100+ TB filesystems who trust my code, and I haven't lost anyone's
filesystem who was patient and willing to work with me.

But I've got to get this done, and right now that does mean moving fast
and grinding through a lot of issues.

(again for the peanut gallery: _please_ do not rush to install it yet
unless you are willing and able to report issues, I'll say when the bugs
have been worked through and the hardening is done).
Linus Torvalds Aug. 24, 2024, 2:25 a.m. UTC | #3
On Sat, 24 Aug 2024 at 10:14, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>
> On Sat, Aug 24, 2024 at 09:23:00AM GMT, Linus Torvalds wrote:
> > On Sat, 24 Aug 2024 at 02:54, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > >
> > > Hi Linus, big one this time...
> >
> > Yeah, no, enough is enough. The last pull was already big.
> >
> > This is too big, it touches non-bcachefs stuff, and it's not even
> > remotely some kind of regression.
> >
> > At some point "fix something" just turns into development, and this is
> > that point.
> >
> > Nobody sane uses bcachefs and expects it to be stable, so every single
> > user is an experimental site.
>
> Eh?
>
> Universal consensus has been that bcachefs is _definitely_ more
> trustworthy than brtfs,

I'll believe that when there are major distros that use it and you
have lots of varied use.

But it doesn't even change the issue: you aren't fixing a regression,
you are doing new development to fix some old probl;em, and now you
are literally editing non-bcachefs files too.

Enough is enough.

                   Linus
Kent Overstreet Aug. 24, 2024, 2:33 a.m. UTC | #4
On Sat, Aug 24, 2024 at 10:25:02AM GMT, Linus Torvalds wrote:
> On Sat, 24 Aug 2024 at 10:14, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> >
> > On Sat, Aug 24, 2024 at 09:23:00AM GMT, Linus Torvalds wrote:
> > > On Sat, 24 Aug 2024 at 02:54, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > > >
> > > > Hi Linus, big one this time...
> > >
> > > Yeah, no, enough is enough. The last pull was already big.
> > >
> > > This is too big, it touches non-bcachefs stuff, and it's not even
> > > remotely some kind of regression.
> > >
> > > At some point "fix something" just turns into development, and this is
> > > that point.
> > >
> > > Nobody sane uses bcachefs and expects it to be stable, so every single
> > > user is an experimental site.
> >
> > Eh?
> >
> > Universal consensus has been that bcachefs is _definitely_ more
> > trustworthy than brtfs,
> 
> I'll believe that when there are major distros that use it and you
> have lots of varied use.

Oh, I'm waiting for that hammer to drop too.

But: all the data we've got so far is that it really is shaping up to be
that solid, there's clearly been big upticks in users as it went
upstream, as distros have been rolling it out, and the uptick in bug
reports hasn't been there.

> But it doesn't even change the issue: you aren't fixing a regression,
> you are doing new development to fix some old probl;em, and now you
> are literally editing non-bcachefs files too.

What is to be gained by holding back fixes, if we've got every reason to
believe that the fixes are solid?

And yes, these _are_ solid, the rhashtable stuff was done months ago
(minus the deadlock fix, that's more recent), and the rcu_pending stuff
was mostly done months ago as well, and _heavily_ tested (including
using it as replacement backend for kvfree_rcu, which is the eventual
goal there).

And the genradix code is code that I also wrote and maintain, and those
are simple patches.
Linus Torvalds Aug. 24, 2024, 2:35 a.m. UTC | #5
On Sat, 24 Aug 2024 at 10:33, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>
> What is to be gained by holding back fixes, if we've got every reason to
> believe that the fixes are solid?

What is to be gained by having release rules and a stable development
environment? I wonder.

            Linus
Linus Torvalds Aug. 24, 2024, 2:40 a.m. UTC | #6
On Sat, 24 Aug 2024 at 10:35, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> What is to be gained by having release rules and a stable development
> environment? I wonder.

But seriously - thinking that "I changed a thousand lines, there's no
way that introduces new bugs" is the kind of thinking that I DO NOT
WANT TO HEAR from a maintainer.

What planet ARE you from? Stop being obtuse.

           Linus
Kent Overstreet Aug. 24, 2024, 2:47 a.m. UTC | #7
On Sat, Aug 24, 2024 at 10:35:38AM GMT, Linus Torvalds wrote:
> On Sat, 24 Aug 2024 at 10:33, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> >
> > What is to be gained by holding back fixes, if we've got every reason to
> > believe that the fixes are solid?
> 
> What is to be gained by having release rules and a stable development
> environment? I wonder.

Sure, which is why I'm not sending you anything here that isn't a fix
for a real issue.

(Ok, technically a few of those, the "missing trans_relock()" fixes are
theoretical, but if they are real then they're bad).
Linus Torvalds Aug. 24, 2024, 2:57 a.m. UTC | #8
On Sat, 24 Aug 2024 at 10:48, Kent Overstreet <kent.overstreet@linux.dev> wrote:
>
> Sure, which is why I'm not sending you anything here that isn't a fix
> for a real issue.

Kent, bugs happen.

The number of bugs that happen in "bug fixes" is in fact quite high.
You should see the stable tree discussions when people get heated
about the regressions introduced by fixes.

This is, for example, why stable has the rule of fixes being small
(which does get violated, but it is at least a goal: "It cannot be
bigger than 100 lines, with context"), because small fixes are easier
to think about and hopefully they have fewer problems of their own.

It's also why my "development happens before the merge window" rule exists.

If you have to do development to fix an old problem, it's for the next
merge window. Exactly because new bugs happen. We want _stability_.

The fixes after the merge window are supposed to be fixes for
regressions, not "oh, I noticed a long-standing problem, and now I'm
fixing that".

But obviously the same kind of logic as for stable trees apply: if
it's a small obvious fix that would be stable material *anyway*, then
there is no reason to wait for the next release and then just put it
in the stable pile.

So I do end up taking small fixes, because at that point it is indeed
a "it wouldn't help to wait" situation.

But your pull requests haven't been "small fixes". And I admit, I've
let it slide. You never saw the last pull request, when I sighed, did
a "git fetch", and went through every commit just to see. And then did
the pull for real.

This time I did the same. And came to the conclusion that no, this was
not a series of small fixes any more.

             Linus
Kent Overstreet Aug. 24, 2024, 2:59 a.m. UTC | #9
On Sat, Aug 24, 2024 at 10:40:33AM GMT, Linus Torvalds wrote:
> On Sat, 24 Aug 2024 at 10:35, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > What is to be gained by having release rules and a stable development
> > environment? I wonder.
> 
> But seriously - thinking that "I changed a thousand lines, there's no
> way that introduces new bugs" is the kind of thinking that I DO NOT
> WANT TO HEAR from a maintainer.
> 
> What planet ARE you from? Stop being obtuse.

Heh.

No, I can't write 1000 lines of bug free code (I think when I was
younger I pulled it off a few times...).

But I do have really good automated testing (I put everything through
lockdep, kasan, ubsan, and other variants now), and a bunch of testers
willing to run my git branches on their crazy (and huge) filesystems.

And enough experience to know when code is likely to be solid and when I
should hold back on it.

Are you seeing a ton of crazy last minute fixes for regressions in my
pull requests? No, there's a few fixes for recent regressions here and
there, but nothing that would cause major regrets. The worst in terms of
needing last minute fixes was the member info btree bitmap stuff, and
the superblock downgrade section... but those we did legitimately need.
Kent Overstreet Aug. 24, 2024, 3:10 a.m. UTC | #10
On Sat, Aug 24, 2024 at 10:57:55AM GMT, Linus Torvalds wrote:
> On Sat, 24 Aug 2024 at 10:48, Kent Overstreet <kent.overstreet@linux.dev> wrote:
> >
> > Sure, which is why I'm not sending you anything here that isn't a fix
> > for a real issue.
> 
> Kent, bugs happen.

I _know_.

Look, filesystem development is as high stakes as it gets. Normal kernel
development, you fuck up - you crash the machine, you lose some work,
you reboot, people are annoyed but generally it's ok.

In filesystem land, you can corrupt data and not find out about it until
weeks later, or _worse_. I've got stories to give people literal
nightmares. Hell, that stuff has fueled my own nightmares for years. You
know how much grey my beard has now?

Which is why I have spent many years of my life building a codebase and
development process where I can work productively where I can not just
catch but recover from pretty much any fuckup imaginable.

Because peace of mind is priceless...
Carl E. Thompson Aug. 24, 2024, 4:22 a.m. UTC | #11
Kent, I'm not a kernel developer I'm just a user that is impressed with bcachefs, uses it on his personal systems, and eagerly waits for new features. I am one of the users who's been using bcachefs for years and has never lost any data using it.

However I am going to be blunt: as someone who designs and builds Linux-based storage servers (well, I used to) as part of their job I would never, ever consider using bcachefs professionally as it is now and the way it appears to be developed currently. It is simply too much changed too fast without any separation between what is currently stable and working for customers and new development. Your work is excellent but **process** is equally and sometimes even more important. Some of the other hats I've worn professionally include as a lead C/C++ developer and as a product release manager so I've learned from very painful experience that large projects absolutely **must** have strict rules for process. I'm sure you realize that. Linus is not being a jerk about this. Just a couple of months ago Linus had to tell you the exact same thing he's telling you again here. And that wasn't the first time. Is your plan to just continue to break the rules and do whatever the heck you want until
  Linus stops bothering you? I don't think that's a good plan.


Since I'm already being blunt I'm going to be even more blunt: you have a serious problem working with others. In the past and in this thread I've read where you seem to imply that other kernel developers are gatekeeping and resist some of your ideas because you've created something that (in your opinion) is already better in some ways than some of things they've created. But from where I'm sitting the problems you've experienced are 90% because of **you**. You're an adult and you need to understand that about yourself so you can do something about it.


I get that I've way overstepped my bounds here. If the kernel developers wish to ban me from the kernel lists I understand.

Carl


> On 2024-08-23 7:59 PM PDT Kent Overstreet <kent.overstreet@linux.dev> wrote:
> 
>  
> On Sat, Aug 24, 2024 at 10:40:33AM GMT, Linus Torvalds wrote:
> > On Sat, 24 Aug 2024 at 10:35, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > What is to be gained by having release rules and a stable development
> > > environment? I wonder.
> > 
> > But seriously - thinking that "I changed a thousand lines, there's no
> > way that introduces new bugs" is the kind of thinking that I DO NOT
> > WANT TO HEAR from a maintainer.
> > 
> > What planet ARE you from? Stop being obtuse.
> 
> Heh.
> 
> No, I can't write 1000 lines of bug free code (I think when I was
> younger I pulled it off a few times...).
> 
> But I do have really good automated testing (I put everything through
> lockdep, kasan, ubsan, and other variants now), and a bunch of testers
> willing to run my git branches on their crazy (and huge) filesystems.
> 
> And enough experience to know when code is likely to be solid and when I
> should hold back on it.
> 
> Are you seeing a ton of crazy last minute fixes for regressions in my
> pull requests? No, there's a few fixes for recent regressions here and
> there, but nothing that would cause major regrets. The worst in terms of
> needing last minute fixes was the member info btree bitmap stuff, and
> the superblock downgrade section... but those we did legitimately need.
Kent Overstreet Aug. 24, 2024, 11:48 a.m. UTC | #12
On Fri, Aug 23, 2024 at 09:22:55PM GMT, Carl E. Thompson wrote:
> Kent, I'm not a kernel developer I'm just a user that is impressed with bcachefs, uses it on his personal systems, and eagerly waits for new features. I am one of the users who's been using bcachefs for years and has never lost any data using it.
> 
> However I am going to be blunt: as someone who designs and builds Linux-based storage servers (well, I used to) as part of their job I would never, ever consider using bcachefs professionally as it is now and the way it appears to be developed currently. It is simply too much changed too fast without any separation between what is currently stable and working for customers and new development. Your work is excellent but **process** is equally and sometimes even more important. Some of the other hats I've worn professionally include as a lead C/C++ developer and as a product release manager so I've learned from very painful experience that large projects absolutely **must** have strict rules for process. I'm sure you realize that. Linus is not being a jerk about this. Just a couple of months ago Linus had to tell you the exact same thing he's telling you again here. And that wasn't the first time. Is your plan to just continue to break the rules and do whatever the heck you want until

You guys are freaked out because I'm moving quickly and you don't have
visibility into my own internal process, that's all.

I've got a test clusture, a community testing my code before I send it
to Linus, and a codebase that I own and know like the back of my hand
that's stuffed with assertions. And, the changes in question are
algorithmically fairly simple and things that I have excellent test
coverage for. These are all factors that let me say, with confidence,
that there really aren't any bugs in this this pull request.

Look, there will always be a natural tension between "strict rules and
processes" vs. "weighing the situations and using your judgement". There
isn't a right or wrong answer as to where on the spectrum we should be,
we just all have to use our brains.

No one is being jerks here, Linus and I are just sitting in different
places with different perspectives. He has a resonsibility as someone
managing a huge project to enforce rules as he sees best, while I have a
responsibility to support users with working code, and to do that to the
best of my abilities.