Message ID | 20240318-vfs-fixes-e0e7e114b1d1@brauner (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [GIT,PULL] vfs fixes | expand |
The pull request you sent on Mon, 18 Mar 2024 13:19:54 +0100:
> git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.9-rc1.fixes
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/0a7b0acecea273c8816f4f5b0e189989470404cf
Thank you!
On Mon, 18 Mar 2024 at 05:20, Christian Brauner <brauner@kernel.org> wrote: > > * Take a passive reference on the superblock when opening a block device > so the holder is available to concurrent callers from the block layer. So I've pulled this, but I have to admit that I hate it. The bdev "holder" logic is an abomination. And "struct blk_holder_ops" is horrendous. Afaik, we have exactly two cases of "struct blk_holder_ops" in the whole kernel, and you edited one of them. And the other one is in bcachefs, and is a completely empty one with no actual ops, so I think that one shouldn't exist. In other words, we have only *one* actual set of "holder ops". That makes me suspicious in the first place. Now, let's then look at that new "holder->put_holder" use. It has _one_ single user too, which is bd_end_claim(), which is called from one place, which is bdev_release(). Which in turn is called from exactly one place, which is blkdev_release(). Which is the release function for def_blk_fops. Which is called from __fput() on the last release of the file. Fine, fine, fine. So let's chase down *who* actually uses that single "blk_holder_ops". And it turns out that it's used in three places: fs/super.c, fs/ext4/super.c, and fs/xfs/xfs_super.c. So in those three cases, it would be absolutely *wrong* if the 'holder' was anything but the super-block (because that's what the new get/put functions require for any of this to work. This all smells horribly bad to me. The code looks and acts like it is some generic interface, but in reality it really isn't. Yes, bcachefs seems to make up some random holder (it's a one-byte kmalloc that isn't actually used), and a random holder op structure (it's empty, as mentioned), but none of this makes any sense at all. I get the feeling that the "get/put" operations should just be done in the three places that currently use that 'fs_holder_ops'. IOW, isn't the 'get()' always basically paired with the mounting? And the 'put()' would probably be best done iin kill_block_super()? I don't know. Maybe I missed something really important, but this smells like a specific case that simply shouldn't have gotten this kind of "generic infrastructure" solution. Linus
On Mon, 18 Mar 2024 at 12:14, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > IOW, isn't the 'get()' always basically paired with the mounting? And > the 'put()' would probably be best done iin kill_block_super()? .. or alternative handwavy approach: The fundamental _reason_ for the ->get/put seems to be to make the 'holder' lifetime be at least as long as the 'struct file' it is associated with. No? So how about we just make the 'holder' always *be* a 'struct file *'? That (a) gets rid of the typeless 'void *' part (b) is already what it is for normal cases (ie O_EXCL file opens). wouldn't it be lovely if we just made the rule be that 'holder' *is* the file pointer, and got rid of a lot of typeless WTF code? Again, this comment (and the previous email) is more based on "this does not feel right to me" than anything else. That code just makes my skin itch. I can't say it's _wrong_, but it just FeelsWrongToMe(tm). Linus
On Mon, Mar 18, 2024 at 12:41:49PM -0700, Linus Torvalds wrote: > On Mon, 18 Mar 2024 at 12:14, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > IOW, isn't the 'get()' always basically paired with the mounting? And > > the 'put()' would probably be best done iin kill_block_super()? > > .. or alternative handwavy approach: > > The fundamental _reason_ for the ->get/put seems to be to make the > 'holder' lifetime be at least as long as the 'struct file' it is > associated with. No? > > So how about we just make the 'holder' always *be* a 'struct file *'? That > > (a) gets rid of the typeless 'void *' part > > (b) is already what it is for normal cases (ie O_EXCL file opens). > > wouldn't it be lovely if we just made the rule be that 'holder' *is* > the file pointer, and got rid of a lot of typeless WTF code? > > Again, this comment (and the previous email) is more based on "this > does not feel right to me" than anything else. > > That code just makes my skin itch. I can't say it's _wrong_, but it > just FeelsWrongToMe(tm). So, initially I think the holder ops were intended to be generic by Christoph but I agree that it's probably not needed. I just didn't massage that code yet. Now on my todo for this cycle!
> > Again, this comment (and the previous email) is more based on "this > > does not feel right to me" than anything else. > > > > That code just makes my skin itch. I can't say it's _wrong_, but it > > just FeelsWrongToMe(tm). > > So, initially I think the holder ops were intended to be generic by > Christoph but I agree that it's probably not needed. I just didn't > massage that code yet. Now on my todo for this cycle! So, the block holder ops will gain additional implementers in the block layer that will implement their own separate ops. So I trust the block layer with this. The holder is used to determine whether a block device can be reopened. So both for internal (mounting, log device initialization) or userspace opens we compare the holders of the block device. We do have allowed for quite some time to open the same block device exclusively with different flags. So there are multiple files open to the same block device and the holder is used as proof that it can be reopened. So always using the file as the holder would still mean that we have to compare file->private_data to determine whether the block device can be reopened. So it won't get us as much as we'd want. The reason for the holder to remain valid is that the block layer does have ioctl operations such as removal of a device in the case of nbd, suspend and resume used in stuff like cryptsetup. In all such cases we go from arbitrary block device to arbitrary holder and then inform them about the operation calling the appropriate callback. So we would still have to guarantee the validity of the holder in file->private_data. There are also two internal codepaths where the block device is temporarly marked as being in the process of being claimed. This will cause actual openers to wait until bd_holder is really set or aborted but not fail the actual open. This has traditionally been the case in the loop code and during user initiated and internally triggered partition scanning. That could be reworked but would be pretty ugly. We'll continue considering additional cleanups and latest next merge window I'll give you a detailed write up what happened.