Message ID | 20241024025142.4082218-1-david@fromorbit.com (mailing list archive) |
---|---|
Headers | show |
Series | xfs: sparse inodes overlap end of filesystem | expand |
On Thu, Oct 24, 2024 at 01:51:02PM +1100, Dave Chinner wrote: > We have had a large number of recent reports about cloud filesystems > with "corrupt" inode records recently. They are all the same, and > feature a filesystem that has been grown from a small size to a > larger size (to 30G or 50G). In all cases, they have a very small > runt AG at the end of the filesystem. In the case of the 30GB > filesystems, this is 1031 blocks long. > > These filesystems start issuing corruption warnings when trying to > allocate an in a sparse cluster at block 1024 of the runt AG. At > this point, there shouldn't be a sparse inode cluster because there > isn't space to fit an entire inode chunk (8 blocks) at block 1024. > i.e. it is only 7 blocks from the end of the AG. > > Hence the first bug is that we allowed allocation of a sparse inode > cluster in this location when it should not have occurred. The first > patch in the series addresses this. > > However, there is actually nothing corrupt in the on-disk sparse > inode record or inode cluster at agbno 1024. It is a 32 inode > cluster, which means it is 4 blocks in length, so sits entirely > within the AG and every inode in the record is addressable and > accessible. The only thing we can't do is make the sparse inode > record whole - the inode allocation code cannot allocate another 4 > blocks that span beyond the end of the AG. Hence this inode record > and cluster remain sparse until all the inodes in it are freed and > the cluster removed from disk. > > The second bug is that we don't consider inodes beyond inode cluster > alignment at the end of an AG as being valid. When sparse inode > alignment is in use, we set the in-memory inode cluster alignment to > match the inode chunk alignment, and so the maximum valid inode > number is inode chunk aligned, not inode cluster aligned. Hence when > we have an inode cluster at the end of the AG - so the max inode > number is cluster aligned - we reject that entire cluster as being > invalid. > > As stated above, there is nothing corrupt about the sparse inode > cluster at the end of the AG, it just doesn't match an arbitrary > alignment validation restriction for inodes at the end of the AG. > Given we have production filesystems out there with sparse inode > clusters allocated with cluster alignment at the end of the AG, we > need to consider these inodes as valid and not error out with a > corruption report. The second patch addresses this. > > The third issue I found is that we never validate the > sb->sb_spino_align valid when we mount the filesystem. It could have > any value and we just blindly use it when calculating inode > allocation geometry. The third patch adds sb->sb_spino_align range > validation to the superblock verifier. > > There is one question that needs to be resolved in this patchset: if > we take patch 2 to allow sparse inodes at the end of the AG, why > would we need the change in patch 1? Indeed, at this point I have to > ask why we even need the min/max agbno guidelines to the inode chunk > allocation as we end up allowing any aligned location in the AG to > be used by sparse inodes. i.e. if we take patch 2, then patch 1 is > unnecessary and now we can remove a bunch of code (min/max_agbno > constraints) from the allocator paths... > > I'd prefer that we take the latter path: ignore the first patch. > This results in more flexible behaviour, allows existing filesystems > with this issue to work without needing xfs_repair to fix them, and > we get to remove complexity from the code. > > Thoughts? > This all sounds reasonable on its own if the corruption is essentially artifical and there is a path for code simplification, etc. That said, I think there's a potential counter argument for skipping patch 1 though. A couple things come to mind: 1. When this corrupted inode chunk allocation does occur, is it possible to actually allocate an inode out of it, or does the error checking logic prevent that? My sense was the latter, but I could be wrong. This generally indicates whether user data is impacted or not if repair resolves by tossing the chunk. 2. Would we recommend a user upgrade to a new kernel with a corruption present that causes inode allocation failure? My .02: under no circumstances would I run a distro/package upgrade on a filesystem in that state before running repair, nor would I recommend that to anyone else. The caveat to this is that even after a repair, there's no guarantee an upgrade wouldn't go and realloc the same bad chunk and end up right back in the same state, and thus fail just the same. For that reason, I'm not sure we can achieve a reliable workaround via a kernel change on its own. I'm wondering if this requires something on the repair side that either recommends growing the fs by a few blocks, or perhaps if it finds this "unaligned runt" situation, actively does something to prevent it. For example, I assume allocating the last handful of blocks out of the runt AG would prevent the problem. Of course that technically creates another corruption by leaking blocks, but as long repair knows to keep it in place so long as the fs geometry is susceptible, perhaps that would work..? Hmm.. if those blocks are free then maybe a better option would be to just truncate the last few blocks off the runt AG (i.e. effectively reduce the fs size by the size of the sparse chunk allocation), then the fs could be made consistent and functional. Hm? Brian
On Thu, Oct 24, 2024 at 01:51:02PM +1100, Dave Chinner wrote: > We have had a large number of recent reports about cloud filesystems > with "corrupt" inode records recently. They are all the same, and > feature a filesystem that has been grown from a small size to a > larger size (to 30G or 50G). In all cases, they have a very small > runt AG at the end of the filesystem. In the case of the 30GB > filesystems, this is 1031 blocks long. > > These filesystems start issuing corruption warnings when trying to > allocate an in a sparse cluster at block 1024 of the runt AG. At > this point, there shouldn't be a sparse inode cluster because there > isn't space to fit an entire inode chunk (8 blocks) at block 1024. > i.e. it is only 7 blocks from the end of the AG. > > Hence the first bug is that we allowed allocation of a sparse inode > cluster in this location when it should not have occurred. The first > patch in the series addresses this. > > However, there is actually nothing corrupt in the on-disk sparse > inode record or inode cluster at agbno 1024. It is a 32 inode > cluster, which means it is 4 blocks in length, so sits entirely > within the AG and every inode in the record is addressable and > accessible. The only thing we can't do is make the sparse inode > record whole - the inode allocation code cannot allocate another 4 > blocks that span beyond the end of the AG. Hence this inode record > and cluster remain sparse until all the inodes in it are freed and > the cluster removed from disk. > > The second bug is that we don't consider inodes beyond inode cluster > alignment at the end of an AG as being valid. When sparse inode > alignment is in use, we set the in-memory inode cluster alignment to > match the inode chunk alignment, and so the maximum valid inode > number is inode chunk aligned, not inode cluster aligned. Hence when > we have an inode cluster at the end of the AG - so the max inode > number is cluster aligned - we reject that entire cluster as being > invalid. > > As stated above, there is nothing corrupt about the sparse inode > cluster at the end of the AG, it just doesn't match an arbitrary > alignment validation restriction for inodes at the end of the AG. > Given we have production filesystems out there with sparse inode > clusters allocated with cluster alignment at the end of the AG, we > need to consider these inodes as valid and not error out with a > corruption report. The second patch addresses this. > > The third issue I found is that we never validate the > sb->sb_spino_align valid when we mount the filesystem. It could have > any value and we just blindly use it when calculating inode > allocation geometry. The third patch adds sb->sb_spino_align range > validation to the superblock verifier. > > There is one question that needs to be resolved in this patchset: if > we take patch 2 to allow sparse inodes at the end of the AG, why > would we need the change in patch 1? Indeed, at this point I have to > ask why we even need the min/max agbno guidelines to the inode chunk > allocation as we end up allowing any aligned location in the AG to > be used by sparse inodes. i.e. if we take patch 2, then patch 1 is > unnecessary and now we can remove a bunch of code (min/max_agbno > constraints) from the allocator paths... Yeah, I think we can only adjust the codebase to allow chunks that cross EOAG as long as the clusters don't. > I'd prefer that we take the latter path: ignore the first patch. > This results in more flexible behaviour, allows existing filesystems > with this issue to work without needing xfs_repair to fix them, and > we get to remove complexity from the code. Do xfs_repair/scrub trip over these sparse chunks that cross EOAG, or are they ok? --D > Thoughts? >
On Thu, Oct 24, 2024 at 09:20:39AM -0400, Brian Foster wrote: > On Thu, Oct 24, 2024 at 01:51:02PM +1100, Dave Chinner wrote: > > We have had a large number of recent reports about cloud filesystems > > with "corrupt" inode records recently. They are all the same, and > > feature a filesystem that has been grown from a small size to a > > larger size (to 30G or 50G). In all cases, they have a very small > > runt AG at the end of the filesystem. In the case of the 30GB > > filesystems, this is 1031 blocks long. > > > > These filesystems start issuing corruption warnings when trying to > > allocate an in a sparse cluster at block 1024 of the runt AG. At > > this point, there shouldn't be a sparse inode cluster because there > > isn't space to fit an entire inode chunk (8 blocks) at block 1024. > > i.e. it is only 7 blocks from the end of the AG. > > > > Hence the first bug is that we allowed allocation of a sparse inode > > cluster in this location when it should not have occurred. The first > > patch in the series addresses this. > > > > However, there is actually nothing corrupt in the on-disk sparse > > inode record or inode cluster at agbno 1024. It is a 32 inode > > cluster, which means it is 4 blocks in length, so sits entirely > > within the AG and every inode in the record is addressable and > > accessible. The only thing we can't do is make the sparse inode > > record whole - the inode allocation code cannot allocate another 4 > > blocks that span beyond the end of the AG. Hence this inode record > > and cluster remain sparse until all the inodes in it are freed and > > the cluster removed from disk. > > > > The second bug is that we don't consider inodes beyond inode cluster > > alignment at the end of an AG as being valid. When sparse inode > > alignment is in use, we set the in-memory inode cluster alignment to > > match the inode chunk alignment, and so the maximum valid inode > > number is inode chunk aligned, not inode cluster aligned. Hence when > > we have an inode cluster at the end of the AG - so the max inode > > number is cluster aligned - we reject that entire cluster as being > > invalid. > > > > As stated above, there is nothing corrupt about the sparse inode > > cluster at the end of the AG, it just doesn't match an arbitrary > > alignment validation restriction for inodes at the end of the AG. > > Given we have production filesystems out there with sparse inode > > clusters allocated with cluster alignment at the end of the AG, we > > need to consider these inodes as valid and not error out with a > > corruption report. The second patch addresses this. > > > > The third issue I found is that we never validate the > > sb->sb_spino_align valid when we mount the filesystem. It could have > > any value and we just blindly use it when calculating inode > > allocation geometry. The third patch adds sb->sb_spino_align range > > validation to the superblock verifier. > > > > There is one question that needs to be resolved in this patchset: if > > we take patch 2 to allow sparse inodes at the end of the AG, why > > would we need the change in patch 1? Indeed, at this point I have to > > ask why we even need the min/max agbno guidelines to the inode chunk > > allocation as we end up allowing any aligned location in the AG to > > be used by sparse inodes. i.e. if we take patch 2, then patch 1 is > > unnecessary and now we can remove a bunch of code (min/max_agbno > > constraints) from the allocator paths... > > > > I'd prefer that we take the latter path: ignore the first patch. > > This results in more flexible behaviour, allows existing filesystems > > with this issue to work without needing xfs_repair to fix them, and > > we get to remove complexity from the code. > > > > Thoughts? > > > > This all sounds reasonable on its own if the corruption is essentially > artifical and there is a path for code simplification, etc. That said, I > think there's a potential counter argument for skipping patch 1 though. > A couple things come to mind: > > 1. When this corrupted inode chunk allocation does occur, is it possible > to actually allocate an inode out of it, or does the error checking > logic prevent that? The error checking during finobt lookup prevents inode allocation. i.e. it fails after finobt lookup via the path: xfs_dialloc_ag_finobt_newino() xfs_inobt_get_rec() xfs_inobt_check_irec() if (!xfs_verify_agino(pag, irec->ir_startino)) return __this_address; Before any modifications are made. hence the transaction is clean when cancelled, and the error is propagated cleanly back out to userspace. > 2. Would we recommend a user upgrade to a new kernel with a corruption > present that causes inode allocation failure? I'm not making any sort of recommendations on how downstream distros handle system recovery from this issue. System recovery once the problem has manifested is a separate problem - what we are concerned about here is modifying the kernel code to: a) prevent the issue from happening again; and b) ensure that existing filesytsems with this latent issue on disk don't throw corruption errors in future. How we get that kernel onto downstream distro systems is largely a distro support issue. Downstream distros can: - run the gaunlet and upgrade in place, relying on the the transactional upgrade behaviour of the package manager to handle rollbacks when file create failures during installation; or - grow the cloud block device and filesystem to put the inode cluster wholly within the bounds of the AG and so pass the alignment checks without any kernel modifications, then do the kernel upgrade; or - live patch the running kernel to allow the sparse inode cluster to be considered valid (as per the second patch in this series) so it won't ever fail whilst installing the kernel upgrade; or - do some xfs_db magic to remove the finobt record that is problematic and leak the sparse inode cluster so we never try to allocate inode from it, then do the kernel upgrade and run xfs_repair; or - do some xfs_db magic to remove the finobt record and shrink the filesystem down a few blocks to inode chunk align it. - something else... IOWs, there are lots of ways that the downstream distros can mitigate the problem sufficiently to install an updated kernel that won't have the problem ever again. > My .02: under no circumstances would I run a distro/package upgrade on a > filesystem in that state before running repair, nor would I recommend > that to anyone else. Keep in mind that disallowing distro/package managers to run in this situation also rules out shipping a new xfsprogs package to address the issue. i.e. if the distro won't allow kernel package upgrades, then they won't allow xfsprogs package upgrades, either. There aren't any filesystem level problems with allowing operation to continue on the filesystem with a sparse inode chunk like this in the runt AG. Yes, file create will fail with an error every so often, but there aren't any issues beyond that. The "corruption" can't propagate, it wont' shut down the filesystem, and errors are returned to userspace when it is hit. There is no danger to other filesystem metadata or user data from this issue. Hence I don't see any issues with simply installing new packages and rebooting to make the problem go away... > The caveat to this is that even after a repair, > there's no guarantee an upgrade wouldn't go and realloc the same bad > chunk and end up right back in the same state, and thus fail just the > same. Sure, but this can be said about every single sparse inode enabled filesystem that has an unaligned end of the runt AG whether the problem has manifested or not. There are going to *lots* of filesystems out there with this potential problem just waiting to be tripped over. i.e. the scope of this latent issue has the potential to affect a very large number of filesystems. Hence saying that we can't upgrade -anything- on <some large subset> of sparse inode enable filesystems because they *might* fail with this problem doesn't make a whole lot of sense to me.... > For example, I assume allocating the last handful of blocks out of the > runt AG would prevent the problem. Of course that technically creates > another corruption by leaking blocks, but as long repair knows to keep > it in place so long as the fs geometry is susceptible, perhaps that > would work..? I think that would become an implicit change of on-disk format. scrub would need to know about it, as would online repair. growfs will need to handle the case explicitly (is the trailing runt space on this filesystem leaked or not?), and so on. I don't see this as a viable solution. -Dave.
On Thu, Oct 24, 2024 at 09:38:04AM -0700, Darrick J. Wong wrote: > On Thu, Oct 24, 2024 at 01:51:02PM +1100, Dave Chinner wrote: > > I'd prefer that we take the latter path: ignore the first patch. > > This results in more flexible behaviour, allows existing filesystems > > with this issue to work without needing xfs_repair to fix them, and > > we get to remove complexity from the code. > > Do xfs_repair/scrub trip over these sparse chunks that cross EOAG, > or are they ok? Repair trips over them and removes them (because they fail alignment checks). I haven't looked to see if it needs anything more than to consider the alignment of these sparse chunks as being ok. I suspect the same for scrub - as long as the alignment checks out, it shouldn't need to fail the cluster... -Dave.
On Fri, Oct 25, 2024 at 11:48:15AM +1100, Dave Chinner wrote: > On Thu, Oct 24, 2024 at 09:20:39AM -0400, Brian Foster wrote: > > On Thu, Oct 24, 2024 at 01:51:02PM +1100, Dave Chinner wrote: > > > We have had a large number of recent reports about cloud filesystems > > > with "corrupt" inode records recently. They are all the same, and > > > feature a filesystem that has been grown from a small size to a > > > larger size (to 30G or 50G). In all cases, they have a very small > > > runt AG at the end of the filesystem. In the case of the 30GB > > > filesystems, this is 1031 blocks long. > > > > > > These filesystems start issuing corruption warnings when trying to > > > allocate an in a sparse cluster at block 1024 of the runt AG. At > > > this point, there shouldn't be a sparse inode cluster because there > > > isn't space to fit an entire inode chunk (8 blocks) at block 1024. > > > i.e. it is only 7 blocks from the end of the AG. > > > > > > Hence the first bug is that we allowed allocation of a sparse inode > > > cluster in this location when it should not have occurred. The first > > > patch in the series addresses this. > > > > > > However, there is actually nothing corrupt in the on-disk sparse > > > inode record or inode cluster at agbno 1024. It is a 32 inode > > > cluster, which means it is 4 blocks in length, so sits entirely > > > within the AG and every inode in the record is addressable and > > > accessible. The only thing we can't do is make the sparse inode > > > record whole - the inode allocation code cannot allocate another 4 > > > blocks that span beyond the end of the AG. Hence this inode record > > > and cluster remain sparse until all the inodes in it are freed and > > > the cluster removed from disk. > > > > > > The second bug is that we don't consider inodes beyond inode cluster > > > alignment at the end of an AG as being valid. When sparse inode > > > alignment is in use, we set the in-memory inode cluster alignment to > > > match the inode chunk alignment, and so the maximum valid inode > > > number is inode chunk aligned, not inode cluster aligned. Hence when > > > we have an inode cluster at the end of the AG - so the max inode > > > number is cluster aligned - we reject that entire cluster as being > > > invalid. > > > > > > As stated above, there is nothing corrupt about the sparse inode > > > cluster at the end of the AG, it just doesn't match an arbitrary > > > alignment validation restriction for inodes at the end of the AG. > > > Given we have production filesystems out there with sparse inode > > > clusters allocated with cluster alignment at the end of the AG, we > > > need to consider these inodes as valid and not error out with a > > > corruption report. The second patch addresses this. > > > > > > The third issue I found is that we never validate the > > > sb->sb_spino_align valid when we mount the filesystem. It could have > > > any value and we just blindly use it when calculating inode > > > allocation geometry. The third patch adds sb->sb_spino_align range > > > validation to the superblock verifier. > > > > > > There is one question that needs to be resolved in this patchset: if > > > we take patch 2 to allow sparse inodes at the end of the AG, why > > > would we need the change in patch 1? Indeed, at this point I have to > > > ask why we even need the min/max agbno guidelines to the inode chunk > > > allocation as we end up allowing any aligned location in the AG to > > > be used by sparse inodes. i.e. if we take patch 2, then patch 1 is > > > unnecessary and now we can remove a bunch of code (min/max_agbno > > > constraints) from the allocator paths... > > > > > > I'd prefer that we take the latter path: ignore the first patch. > > > This results in more flexible behaviour, allows existing filesystems > > > with this issue to work without needing xfs_repair to fix them, and > > > we get to remove complexity from the code. > > > > > > Thoughts? > > > > > > > This all sounds reasonable on its own if the corruption is essentially > > artifical and there is a path for code simplification, etc. That said, I > > think there's a potential counter argument for skipping patch 1 though. > > A couple things come to mind: > > > > 1. When this corrupted inode chunk allocation does occur, is it possible > > to actually allocate an inode out of it, or does the error checking > > logic prevent that? > > The error checking during finobt lookup prevents inode allocation. > i.e. it fails after finobt lookup via the path: > > xfs_dialloc_ag_finobt_newino() > xfs_inobt_get_rec() > xfs_inobt_check_irec() > if (!xfs_verify_agino(pag, irec->ir_startino)) > return __this_address; > > Before any modifications are made. hence the transaction is clean > when cancelled, and the error is propagated cleanly back out to > userspace. > Ok, so data is not at risk. That is good to know. > > 2. Would we recommend a user upgrade to a new kernel with a corruption > > present that causes inode allocation failure? > > I'm not making any sort of recommendations on how downstream distros > handle system recovery from this issue. System recovery once the > problem has manifested is a separate problem - what we are concerned > about here is modifying the kernel code to: > > a) prevent the issue from happening again; and > b) ensure that existing filesytsems with this latent issue on disk > don't throw corruption errors in future. > > How we get that kernel onto downstream distro systems is largely a > distro support issue. Downstream distros can: > > - run the gaunlet and upgrade in place, relying on the the > transactional upgrade behaviour of the package manager to handle > rollbacks when file create failures during installation; or > - grow the cloud block device and filesystem to put the inode > cluster wholly within the bounds of the AG and so pass the > alignment checks without any kernel modifications, then do the > kernel upgrade; or > - live patch the running kernel to allow the sparse inode cluster to > be considered valid (as per the second patch in this series) > so it won't ever fail whilst installing the kernel upgrade; or > - do some xfs_db magic to remove the finobt record that is > problematic and leak the sparse inode cluster so we never try to > allocate inode from it, then do the kernel upgrade and run > xfs_repair; or > - do some xfs_db magic to remove the finobt record and shrink the > filesystem down a few blocks to inode chunk align it. > - something else... > > IOWs, there are lots of ways that the downstream distros can > mitigate the problem sufficiently to install an updated kernel that > won't have the problem ever again. > IOOW, we agree that something might be necessary on the userspace side to fully support users/distros out of this problem. Given the last couple items on the list don't share some of the constraints or external dependencies of the others, and pretty much restate the couple of ideas proposed in my previous mail, they seem like the most broadly useful options to me. > > My .02: under no circumstances would I run a distro/package upgrade on a > > filesystem in that state before running repair, nor would I recommend > > that to anyone else. > > Keep in mind that disallowing distro/package managers to run in this > situation also rules out shipping a new xfsprogs package to address > the issue. i.e. if the distro won't allow kernel package upgrades, > then they won't allow xfsprogs package upgrades, either. > I don't know why you would expect otherwise. I assume this would require a boot/recovery image or emergency boot mode with binary external to the filesystem that needs repair. > There aren't any filesystem level problems with allowing operation > to continue on the filesystem with a sparse inode chunk like this in > the runt AG. Yes, file create will fail with an error every so > often, but there aren't any issues beyond that. The "corruption" > can't propagate, it wont' shut down the filesystem, and errors are > returned to userspace when it is hit. There is no danger to > other filesystem metadata or user data from this issue. > > Hence I don't see any issues with simply installing new packages and > rebooting to make the problem go away... > Nor do I so long as upgrade doesn't fail and cancel out on inode allocation failures. > > The caveat to this is that even after a repair, > > there's no guarantee an upgrade wouldn't go and realloc the same bad > > chunk and end up right back in the same state, and thus fail just the > > same. > > Sure, but this can be said about every single sparse inode enabled > filesystem that has an unaligned end of the runt AG whether the > problem has manifested or not. There are going to *lots* of > filesystems out there with this potential problem just waiting to be > tripped over. > > i.e. the scope of this latent issue has the potential to affect a > very large number of filesystems. Hence saying that we can't > upgrade -anything- on <some large subset> of sparse inode enable > filesystems because they *might* fail with this problem doesn't make > a whole lot of sense to me.... > That doesn't make sense to me either. I'd just upgrade in that case. If the problem hasn't manifested already, it seems unlikely to occur before the fixed kernel is installed. > > For example, I assume allocating the last handful of blocks out of the > > runt AG would prevent the problem. Of course that technically creates > > another corruption by leaking blocks, but as long repair knows to keep > > it in place so long as the fs geometry is susceptible, perhaps that > > would work..? > > I think that would become an implicit change of on-disk format. > scrub would need to know about it, as would online repair. growfs > will need to handle the case explicitly (is the trailing runt space > on this filesystem leaked or not?), and so on. I don't see this as a > viable solution. > Then just make it stateless and leak the blocks as a transient means to facilitate upgrade and recovery, as suggested above. Brian > -Dave. > -- > Dave Chinner > david@fromorbit.com >
On 10/23/24 9:51 PM, Dave Chinner wrote: > There is one question that needs to be resolved in this patchset: if > we take patch 2 to allow sparse inodes at the end of the AG, why > would we need the change in patch 1? Indeed, at this point I have to > ask why we even need the min/max agbno guidelines to the inode chunk > allocation as we end up allowing any aligned location in the AG to > be used by sparse inodes. i.e. if we take patch 2, then patch 1 is > unnecessary and now we can remove a bunch of code (min/max_agbno > constraints) from the allocator paths... > > I'd prefer that we take the latter path: ignore the first patch. > This results in more flexible behaviour, allows existing filesystems > with this issue to work without needing xfs_repair to fix them, and > we get to remove complexity from the code. > > Thoughts? For some reason I'm struggling to grasp some of the details here, so maybe I can just throw out a "what I think should happen" type response. A concern is that older xfs_repair binaries will continue to see inodes in this region as corrupt, and throw them away, IIUC - even if the kernel is updated to handle them properly. Older xfs_repair could be encountered on rescue CDs/images, maybe even in initramfs environments, by virt hosts managing guest filesystems, etc. So it seems to me that it would be worth it to prevent any new inode allocations in this region going forward, even if we *can* make it work, so that we won't continue to generate what looks like corruption to older userspace. That might not be the most "pure" upstream approach, but as a practical matter I think it might be a better outcome for users and support orgs... even if distros update kernels & userspace together, that does not necessarily prevent older userspace from encountering a filesystem with inodes in this range and trashing them. Thanks, -Eric
On Tue, Oct 29, 2024 at 11:14:18AM -0500, Eric Sandeen wrote: > On 10/23/24 9:51 PM, Dave Chinner wrote: > > There is one question that needs to be resolved in this patchset: if > > we take patch 2 to allow sparse inodes at the end of the AG, why > > would we need the change in patch 1? Indeed, at this point I have to > > ask why we even need the min/max agbno guidelines to the inode chunk > > allocation as we end up allowing any aligned location in the AG to > > be used by sparse inodes. i.e. if we take patch 2, then patch 1 is > > unnecessary and now we can remove a bunch of code (min/max_agbno > > constraints) from the allocator paths... > > > > I'd prefer that we take the latter path: ignore the first patch. > > This results in more flexible behaviour, allows existing filesystems > > with this issue to work without needing xfs_repair to fix them, and > > we get to remove complexity from the code. > > > > Thoughts? > > For some reason I'm struggling to grasp some of the details here, so > maybe I can just throw out a "what I think should happen" type response. > > A concern is that older xfs_repair binaries will continue to see > inodes in this region as corrupt, and throw them away, IIUC - even > if the kernel is updated to handle them properly. > > Older xfs_repair could be encountered on rescue CDs/images, maybe > even in initramfs environments, by virt hosts managing guest filesystems, > etc. > > So it seems to me that it would be worth it to prevent any new inode > allocations in this region going forward, even if we *can* make it work, > so that we won't continue to generate what looks like corruption to older > userspace. > > That might not be the most "pure" upstream approach, but as a practical > matter I think it might be a better outcome for users and support > orgs... even if distros update kernels & userspace together, that does > not necessarily prevent older userspace from encountering a filesystem > with inodes in this range and trashing them. > I'm inclined to agree with Eric here as preventing the sparse inodes to be allocated at the edge of the runt AG sounds the most reasonable approach to me. It just seems to me yet another corner case to deal with for very little benefit, i.e to enable a few extra inodes, on a FS that seems to be in life support regarding space for new inodes, whether it's a distro kernel or upstream kernel. It kind of seem risky to me, to allow users to run a new kernel, allocate inodes there, fill those inodes with data, just to run a not yet ready xfs_repair, and discard everything there. Just seems like a possible data loss vector. Unless - and I'm not sure how reasonable it is -, we first release a new xfsprogs, preventing xfs_repair to rip off those inodes, and later update the kernel. But this will end up on users hitting a -EFSCORRUPTED every attempt to allocate inodes from the FS edge. How feasible would be to first prevent inodes to be allocated at the runt AG's edge, let it sink for a while, and once we have a fixed xfs_repair for some time, we then enable inode allocation on the edge, giving enough time for users to have a newer xfs_repair? Again, I'm not sure it it does make sense at all, hopefully it does. Carlos
On 10/31/24 6:44 AM, Carlos Maiolino wrote: > On Tue, Oct 29, 2024 at 11:14:18AM -0500, Eric Sandeen wrote: >> On 10/23/24 9:51 PM, Dave Chinner wrote: >>> There is one question that needs to be resolved in this patchset: if >>> we take patch 2 to allow sparse inodes at the end of the AG, why >>> would we need the change in patch 1? Indeed, at this point I have to >>> ask why we even need the min/max agbno guidelines to the inode chunk >>> allocation as we end up allowing any aligned location in the AG to >>> be used by sparse inodes. i.e. if we take patch 2, then patch 1 is >>> unnecessary and now we can remove a bunch of code (min/max_agbno >>> constraints) from the allocator paths... >>> >>> I'd prefer that we take the latter path: ignore the first patch. >>> This results in more flexible behaviour, allows existing filesystems >>> with this issue to work without needing xfs_repair to fix them, and >>> we get to remove complexity from the code. >>> >>> Thoughts? >> >> For some reason I'm struggling to grasp some of the details here, so >> maybe I can just throw out a "what I think should happen" type response. >> >> A concern is that older xfs_repair binaries will continue to see >> inodes in this region as corrupt, and throw them away, IIUC - even >> if the kernel is updated to handle them properly. >> >> Older xfs_repair could be encountered on rescue CDs/images, maybe >> even in initramfs environments, by virt hosts managing guest filesystems, >> etc. >> >> So it seems to me that it would be worth it to prevent any new inode >> allocations in this region going forward, even if we *can* make it work, >> so that we won't continue to generate what looks like corruption to older >> userspace. >> >> That might not be the most "pure" upstream approach, but as a practical >> matter I think it might be a better outcome for users and support >> orgs... even if distros update kernels & userspace together, that does >> not necessarily prevent older userspace from encountering a filesystem >> with inodes in this range and trashing them. >> > > I'm inclined to agree with Eric here as preventing the sparse inodes to be > allocated at the edge of the runt AG sounds the most reasonable approach to me. > > It just seems to me yet another corner case to deal with for very little benefit, > i.e to enable a few extra inodes, on a FS that seems to be in life support > regarding space for new inodes, whether it's a distro kernel or upstream kernel. > > It kind of seem risky to me, to allow users to run a new kernel, allocate inodes > there, fill those inodes with data, just to run a not yet ready xfs_repair, and > discard everything there. Just seems like a possible data loss vector. > > Unless - and I'm not sure how reasonable it is -, we first release a new > xfsprogs, preventing xfs_repair to rip off those inodes, and later update the > kernel. But this will end up on users hitting a -EFSCORRUPTED every attempt to > allocate inodes from the FS edge. > > How feasible would be to first prevent inodes to be allocated at the runt AG's > edge, let it sink for a while, and once we have a fixed xfs_repair for some > time, we then enable inode allocation on the edge, giving enough time for users > to have a newer xfs_repair? > > Again, I'm not sure it it does make sense at all, hopefully it does. I think Dave agrees with all this too, and I may have simply misunderstood the proposal. paraphrasing a side convo w/ Dave, it seems to make sense to have 3 steps for the fix: 1) Stop allowing inode allocations in these blocks (kernel) 2) Treat already-allocated inodes in these blocks as valid (kernel+userspace) 3) Re-enable inode allocations to these blocks (kernel) Distros can pick up 1) and 2), and skip 3) if desired to avoid problems with older userspace if that seems prudent. I guess I still worry a little about upstream/non-distro use after applying 3) but it's the technically correct path forward. -Eric
On Thu, Oct 31, 2024 at 12:44:12PM +0100, Carlos Maiolino wrote: > On Tue, Oct 29, 2024 at 11:14:18AM -0500, Eric Sandeen wrote: > > On 10/23/24 9:51 PM, Dave Chinner wrote: > > > There is one question that needs to be resolved in this patchset: if > > > we take patch 2 to allow sparse inodes at the end of the AG, why > > > would we need the change in patch 1? Indeed, at this point I have to > > > ask why we even need the min/max agbno guidelines to the inode chunk > > > allocation as we end up allowing any aligned location in the AG to > > > be used by sparse inodes. i.e. if we take patch 2, then patch 1 is > > > unnecessary and now we can remove a bunch of code (min/max_agbno > > > constraints) from the allocator paths... > > > > > > I'd prefer that we take the latter path: ignore the first patch. > > > This results in more flexible behaviour, allows existing filesystems > > > with this issue to work without needing xfs_repair to fix them, and > > > we get to remove complexity from the code. > > > > > > Thoughts? > > > > For some reason I'm struggling to grasp some of the details here, so > > maybe I can just throw out a "what I think should happen" type response. > > > > A concern is that older xfs_repair binaries will continue to see > > inodes in this region as corrupt, and throw them away, IIUC - even > > if the kernel is updated to handle them properly. > > > > Older xfs_repair could be encountered on rescue CDs/images, maybe > > even in initramfs environments, by virt hosts managing guest filesystems, > > etc. > > > > So it seems to me that it would be worth it to prevent any new inode > > allocations in this region going forward, even if we *can* make it work, > > so that we won't continue to generate what looks like corruption to older > > userspace. > > > > That might not be the most "pure" upstream approach, but as a practical > > matter I think it might be a better outcome for users and support > > orgs... even if distros update kernels & userspace together, that does > > not necessarily prevent older userspace from encountering a filesystem > > with inodes in this range and trashing them. > > > > I'm inclined to agree with Eric here as preventing the sparse inodes to be > allocated at the edge of the runt AG sounds the most reasonable approach to me. > It just seems to me yet another corner case to deal with for very little benefit, > i.e to enable a few extra inodes, on a FS that seems to be in life support > regarding space for new inodes, whether it's a distro kernel or upstream kernel. > > It kind of seem risky to me, to allow users to run a new kernel, allocate inodes > there, fill those inodes with data, just to run a not yet ready xfs_repair, and > discard everything there. Just seems like a possible data loss vector. I agree. I think we have to fix the fsck/validation code to allow inode clusters that go right up to EOAG on a runt AG because current kernels write out filesystems that way. I also think we have to take the other patch that prevents the inode allocator from creating a chunk that crosses EOAG so that unpatched xfs_repairs won't trip over newer filesystems. > Unless - and I'm not sure how reasonable it is -, we first release a new > xfsprogs, preventing xfs_repair to rip off those inodes, and later update the > kernel. But this will end up on users hitting a -EFSCORRUPTED every attempt to > allocate inodes from the FS edge. > > How feasible would be to first prevent inodes to be allocated at the runt AG's > edge, let it sink for a while, and once we have a fixed xfs_repair for some > time, we then enable inode allocation on the edge, giving enough time for users > to have a newer xfs_repair? > > Again, I'm not sure it it does make sense at all, hopefully it does. I don't think we can ever re-enable inode allocation on the edge no matter how much time goes by. --D > > Carlos >