Message ID | 20230915183848.1018717-1-kernel@pankajraghav.com (mailing list archive) |
---|---|
Headers | show |
Series | Enable block size > page size in XFS | expand |
On Fri, Sep 15, 2023 at 08:38:25PM +0200, Pankaj Raghav wrote: > Only XFS was enabled and tested as a part of this series as it has > supported block sizes up to 64k and sector sizes up to 32k for years. > The only thing missing was the page cache magic to enable bs > ps. However any filesystem > that doesn't depend on buffer-heads and support larger block sizes > already should be able to leverage this effort to also support LBS, > bs > ps. I think you should choose whether you're going to use 'bs > ps' or LBS and stick to it. They're both pretty inscrutable and using both interchanagbly is worse. But I think filesystems which use buffer_heads should be fine to support bs > ps. The problems with the buffer cache are really when you try to support small block sizes and large folio sizes (eg arrays of bhs on the stack). Supporting bs == folio_size shouldn't be a problem.
On Fri, Sep 15, 2023 at 08:38:25PM +0200, Pankaj Raghav wrote: > From: Pankaj Raghav <p.raghav@samsung.com> > > There has been efforts over the last 16 years to enable enable Large > Block Sizes (LBS), that is block sizes in filesystems where bs > page > size [1] [2]. Through these efforts we have learned that one of the > main blockers to supporting bs > ps in fiesystems has been a way to > allocate pages that are at least the filesystem block size on the page > cache where bs > ps [3]. Another blocker was changed in filesystems due to > buffer-heads. Thanks to these previous efforts, the surgery by Matthew > Willcox in the page cache for adopting xarray's multi-index support, and > iomap support, it makes supporting bs > ps in XFS possible with only a few > line change to XFS. Most of changes are to the page cache to support minimum > order folio support for the target block size on the filesystem. > > A new motivation for LBS today is to support high-capacity (large amount > of Terabytes) QLC SSDs where the internal Indirection Unit (IU) are > typically greater than 4k [4] to help reduce DRAM and so in turn cost > and space. In practice this then allows different architectures to use a > base page size of 4k while still enabling support for block sizes > aligned to the larger IUs by relying on high order folios on the page > cache when needed. It also enables to take advantage of these same > drive's support for larger atomics than 4k with buffered IO support in > Linux. As described this year at LSFMM, supporting large atomics greater > than 4k enables databases to remove the need to rely on their own > journaling, so they can disable double buffered writes [5], which is a > feature different cloud providers are already innovating and enabling > customers for through custom storage solutions. > > This series still needs some polishing and fixing some crashes, but it is > mainly targeted to get initial feedback from the community, enable initial > experimentation, hence the RFC. It's being posted now given the results from > our testing are proving much better results than expected and we hope to > polish this up together with the community. After all, this has been a 16 > year old effort and none of this could have been possible without that effort. > > Implementation: > > This series only adds the notion of a minimum order of a folio in the > page cache that was initially proposed by Willy. The minimum folio order > requirement is set during inode creation. The minimum order will > typically correspond to the filesystem block size. The page cache will > in turn respect the minimum folio order requirement while allocating a > folio. This series mainly changes the page cache's filemap, readahead, and > truncation code to allocate and align the folios to the minimum order set for the > filesystem's inode's respective address space mapping. > > Only XFS was enabled and tested as a part of this series as it has > supported block sizes up to 64k and sector sizes up to 32k for years. > The only thing missing was the page cache magic to enable bs > ps. However any filesystem > that doesn't depend on buffer-heads and support larger block sizes > already should be able to leverage this effort to also support LBS, > bs > ps. > > This also paves the way for supporting block devices where their logical > block size > page size in the future by leveraging iomap's address space > operation added to the block device cache by Christoph Hellwig [6]. We > have work to enable support for this, enabling LBAs > 4k on NVME, and > at the same time allow coexistence with buffer-heads on the same block > device so to enable support allow for a drive to use filesystem's to > switch between filesystem's which may depend on buffer-heads or need the > iomap address space operations for the block device cache. Patches for > this will be posted shortly after this patch series. Do you have a git tree branch that I can pull this from somewhere? As it is, I'd really prefer stuff that adds significant XFS functionality that we need to test to be based on a current Linus TOT kernel so that we can test it without being impacted by all the random unrelated breakages that regularly happen in linux-next kernels.... -Dave.
On Mon, Sep 18, 2023 at 08:05:20AM +1000, Dave Chinner wrote: > On Fri, Sep 15, 2023 at 08:38:25PM +0200, Pankaj Raghav wrote: > > From: Pankaj Raghav <p.raghav@samsung.com> > > > > There has been efforts over the last 16 years to enable enable Large > > Block Sizes (LBS), that is block sizes in filesystems where bs > page > > size [1] [2]. Through these efforts we have learned that one of the > > main blockers to supporting bs > ps in fiesystems has been a way to > > allocate pages that are at least the filesystem block size on the page > > cache where bs > ps [3]. Another blocker was changed in filesystems due to > > buffer-heads. Thanks to these previous efforts, the surgery by Matthew > > Willcox in the page cache for adopting xarray's multi-index support, and > > iomap support, it makes supporting bs > ps in XFS possible with only a few > > line change to XFS. Most of changes are to the page cache to support minimum > > order folio support for the target block size on the filesystem. > > > > A new motivation for LBS today is to support high-capacity (large amount > > of Terabytes) QLC SSDs where the internal Indirection Unit (IU) are > > typically greater than 4k [4] to help reduce DRAM and so in turn cost > > and space. In practice this then allows different architectures to use a > > base page size of 4k while still enabling support for block sizes > > aligned to the larger IUs by relying on high order folios on the page > > cache when needed. It also enables to take advantage of these same > > drive's support for larger atomics than 4k with buffered IO support in > > Linux. As described this year at LSFMM, supporting large atomics greater > > than 4k enables databases to remove the need to rely on their own > > journaling, so they can disable double buffered writes [5], which is a > > feature different cloud providers are already innovating and enabling > > customers for through custom storage solutions. > > > > This series still needs some polishing and fixing some crashes, but it is > > mainly targeted to get initial feedback from the community, enable initial > > experimentation, hence the RFC. It's being posted now given the results from > > our testing are proving much better results than expected and we hope to > > polish this up together with the community. After all, this has been a 16 > > year old effort and none of this could have been possible without that effort. > > > > Implementation: > > > > This series only adds the notion of a minimum order of a folio in the > > page cache that was initially proposed by Willy. The minimum folio order > > requirement is set during inode creation. The minimum order will > > typically correspond to the filesystem block size. The page cache will > > in turn respect the minimum folio order requirement while allocating a > > folio. This series mainly changes the page cache's filemap, readahead, and > > truncation code to allocate and align the folios to the minimum order set for the > > filesystem's inode's respective address space mapping. > > > > Only XFS was enabled and tested as a part of this series as it has > > supported block sizes up to 64k and sector sizes up to 32k for years. > > The only thing missing was the page cache magic to enable bs > ps. However any filesystem > > that doesn't depend on buffer-heads and support larger block sizes > > already should be able to leverage this effort to also support LBS, > > bs > ps. > > > > This also paves the way for supporting block devices where their logical > > block size > page size in the future by leveraging iomap's address space > > operation added to the block device cache by Christoph Hellwig [6]. We > > have work to enable support for this, enabling LBAs > 4k on NVME, and > > at the same time allow coexistence with buffer-heads on the same block > > device so to enable support allow for a drive to use filesystem's to > > switch between filesystem's which may depend on buffer-heads or need the > > iomap address space operations for the block device cache. Patches for > > this will be posted shortly after this patch series. > > Do you have a git tree branch that I can pull this from > somewhere? > > As it is, I'd really prefer stuff that adds significant XFS > functionality that we need to test to be based on a current Linus > TOT kernel so that we can test it without being impacted by all > the random unrelated breakages that regularly happen in linux-next > kernels.... That's understandable! I just rebased onto Linus' tree, this only has the bs > ps support on 4k sector size: https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=v6.6-rc2-lbs-nobdev I just did a cursory build / boot / fsx with 16k block size / 4k sector size test with this tree only. I havne't ran fstests on it. Just a heads up, using 512 byte sector size will fail for now, it's a regression we have to fix. Likewise using block sizes 1k, 2k will also regress on fsx right now. These are regressions we are aware of but haven't had time yet to bisect / fix. Luis
On Sun, Sep 17, 2023 at 07:04:24PM -0700, Luis Chamberlain wrote: > On Mon, Sep 18, 2023 at 08:05:20AM +1000, Dave Chinner wrote: > > On Fri, Sep 15, 2023 at 08:38:25PM +0200, Pankaj Raghav wrote: > > > From: Pankaj Raghav <p.raghav@samsung.com> > > > > > > There has been efforts over the last 16 years to enable enable Large > > > Block Sizes (LBS), that is block sizes in filesystems where bs > page > > > size [1] [2]. Through these efforts we have learned that one of the > > > main blockers to supporting bs > ps in fiesystems has been a way to > > > allocate pages that are at least the filesystem block size on the page > > > cache where bs > ps [3]. Another blocker was changed in filesystems due to > > > buffer-heads. Thanks to these previous efforts, the surgery by Matthew > > > Willcox in the page cache for adopting xarray's multi-index support, and > > > iomap support, it makes supporting bs > ps in XFS possible with only a few > > > line change to XFS. Most of changes are to the page cache to support minimum > > > order folio support for the target block size on the filesystem. > > > > > > A new motivation for LBS today is to support high-capacity (large amount > > > of Terabytes) QLC SSDs where the internal Indirection Unit (IU) are > > > typically greater than 4k [4] to help reduce DRAM and so in turn cost > > > and space. In practice this then allows different architectures to use a > > > base page size of 4k while still enabling support for block sizes > > > aligned to the larger IUs by relying on high order folios on the page > > > cache when needed. It also enables to take advantage of these same > > > drive's support for larger atomics than 4k with buffered IO support in > > > Linux. As described this year at LSFMM, supporting large atomics greater > > > than 4k enables databases to remove the need to rely on their own > > > journaling, so they can disable double buffered writes [5], which is a > > > feature different cloud providers are already innovating and enabling > > > customers for through custom storage solutions. > > > > > > This series still needs some polishing and fixing some crashes, but it is > > > mainly targeted to get initial feedback from the community, enable initial > > > experimentation, hence the RFC. It's being posted now given the results from > > > our testing are proving much better results than expected and we hope to > > > polish this up together with the community. After all, this has been a 16 > > > year old effort and none of this could have been possible without that effort. > > > > > > Implementation: > > > > > > This series only adds the notion of a minimum order of a folio in the > > > page cache that was initially proposed by Willy. The minimum folio order > > > requirement is set during inode creation. The minimum order will > > > typically correspond to the filesystem block size. The page cache will > > > in turn respect the minimum folio order requirement while allocating a > > > folio. This series mainly changes the page cache's filemap, readahead, and > > > truncation code to allocate and align the folios to the minimum order set for the > > > filesystem's inode's respective address space mapping. > > > > > > Only XFS was enabled and tested as a part of this series as it has > > > supported block sizes up to 64k and sector sizes up to 32k for years. > > > The only thing missing was the page cache magic to enable bs > ps. However any filesystem > > > that doesn't depend on buffer-heads and support larger block sizes > > > already should be able to leverage this effort to also support LBS, > > > bs > ps. > > > > > > This also paves the way for supporting block devices where their logical > > > block size > page size in the future by leveraging iomap's address space > > > operation added to the block device cache by Christoph Hellwig [6]. We > > > have work to enable support for this, enabling LBAs > 4k on NVME, and > > > at the same time allow coexistence with buffer-heads on the same block > > > device so to enable support allow for a drive to use filesystem's to > > > switch between filesystem's which may depend on buffer-heads or need the > > > iomap address space operations for the block device cache. Patches for > > > this will be posted shortly after this patch series. > > > > Do you have a git tree branch that I can pull this from > > somewhere? > > > > As it is, I'd really prefer stuff that adds significant XFS > > functionality that we need to test to be based on a current Linus > > TOT kernel so that we can test it without being impacted by all > > the random unrelated breakages that regularly happen in linux-next > > kernels.... > > That's understandable! I just rebased onto Linus' tree, this only > has the bs > ps support on 4k sector size: > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=v6.6-rc2-lbs-nobdev > I just did a cursory build / boot / fsx with 16k block size / 4k sector size > test with this tree only. I havne't ran fstests on it. W/ 64k block size, generic/042 fails (maybe just a test block size thing), generic/091 fails (data corruption on read after ~70 ops) and then generic/095 hung with a crash in iomap_readpage_iter() during readahead. Looks like a null folio was passed to ifs_alloc(), which implies the iomap_readpage_ctx didn't have a folio attached to it. Something isn't working properly in the readahead code, which would also explain the quick fsx failure... > Just a heads up, using 512 byte sector size will fail for now, it's a > regression we have to fix. Likewise using block sizes 1k, 2k will also > regress on fsx right now. These are regressions we are aware of but > haven't had time yet to bisect / fix. I'm betting that the recently added sub-folio dirty tracking code got broken by this patchset.... Cheers, Dave.
>>> >>> As it is, I'd really prefer stuff that adds significant XFS >>> functionality that we need to test to be based on a current Linus >>> TOT kernel so that we can test it without being impacted by all >>> the random unrelated breakages that regularly happen in linux-next >>> kernels.... >> >> That's understandable! I just rebased onto Linus' tree, this only >> has the bs > ps support on 4k sector size: >> >> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=v6.6-rc2-lbs-nobdev > I think this tree doesn't have some of the last minute changes I did before I sent the RFC. I will sync with Luis offline regarding that. > >> I just did a cursory build / boot / fsx with 16k block size / 4k sector size >> test with this tree only. I havne't ran fstests on it. > > W/ 64k block size, generic/042 fails (maybe just a test block size > thing), generic/091 fails (data corruption on read after ~70 ops) > and then generic/095 hung with a crash in iomap_readpage_iter() > during readahead. > > Looks like a null folio was passed to ifs_alloc(), which implies the > iomap_readpage_ctx didn't have a folio attached to it. Something > isn't working properly in the readahead code, which would also > explain the quick fsx failure... > Yeah, I have noticed this as well. This is the main crash scenario I am noticing when I am running xfstests, and hopefully we will be able to fix it soon. In general, we have had better results with 16k block size than 64k block size. I still don't know why, but the ifs_alloc crash happens in generic/451 with 16k block size. >> Just a heads up, using 512 byte sector size will fail for now, it's a >> regression we have to fix. Likewise using block sizes 1k, 2k will also >> regress on fsx right now. These are regressions we are aware of but >> haven't had time yet to bisect / fix. > > I'm betting that the recently added sub-folio dirty tracking code > got broken by this patchset.... > Hmm, this crossed my mind as well. I am assuming I can really test the sub-folio dirty tracking code on a system which has a page size greater than the block size? Or is there some tests that can already test this? CCing Ritesh as well. > Cheers, > > Dave.
On 2023-09-15 20:50, Matthew Wilcox wrote: > On Fri, Sep 15, 2023 at 08:38:25PM +0200, Pankaj Raghav wrote: >> Only XFS was enabled and tested as a part of this series as it has >> supported block sizes up to 64k and sector sizes up to 32k for years. >> The only thing missing was the page cache magic to enable bs > ps. However any filesystem >> that doesn't depend on buffer-heads and support larger block sizes >> already should be able to leverage this effort to also support LBS, >> bs > ps. > > I think you should choose whether you're going to use 'bs > ps' or LBS > and stick to it. They're both pretty inscrutable and using both > interchanagbly is worse. > Got it! Probably I will stick to Large block size and explain what it means at the start of the patchset. > But I think filesystems which use buffer_heads should be fine to support > bs > ps. The problems with the buffer cache are really when you try to > support small block sizes and large folio sizes (eg arrays of bhs on > the stack). Supporting bs == folio_size shouldn't be a problem. > I remember some patches from you trying to avoid the stack limitation while working with bh. Thanks for the clarification!
Pankaj Raghav <p.raghav@samsung.com> writes: >>>> >>>> As it is, I'd really prefer stuff that adds significant XFS >>>> functionality that we need to test to be based on a current Linus >>>> TOT kernel so that we can test it without being impacted by all >>>> the random unrelated breakages that regularly happen in linux-next >>>> kernels.... >>> >>> That's understandable! I just rebased onto Linus' tree, this only >>> has the bs > ps support on 4k sector size: >>> >>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=v6.6-rc2-lbs-nobdev >> > > I think this tree doesn't have some of the last minute changes I did before I sent the RFC. I will > sync with Luis offline regarding that. > >> >>> I just did a cursory build / boot / fsx with 16k block size / 4k sector size >>> test with this tree only. I havne't ran fstests on it. >> >> W/ 64k block size, generic/042 fails (maybe just a test block size >> thing), generic/091 fails (data corruption on read after ~70 ops) >> and then generic/095 hung with a crash in iomap_readpage_iter() >> during readahead. >> >> Looks like a null folio was passed to ifs_alloc(), which implies the >> iomap_readpage_ctx didn't have a folio attached to it. Something >> isn't working properly in the readahead code, which would also >> explain the quick fsx failure... >> > > Yeah, I have noticed this as well. This is the main crash scenario I am noticing > when I am running xfstests, and hopefully we will be able to fix it soon. > > In general, we have had better results with 16k block size than 64k block size. I still don't > know why, but the ifs_alloc crash happens in generic/451 with 16k block size. > > >>> Just a heads up, using 512 byte sector size will fail for now, it's a >>> regression we have to fix. Likewise using block sizes 1k, 2k will also >>> regress on fsx right now. These are regressions we are aware of but >>> haven't had time yet to bisect / fix. >> >> I'm betting that the recently added sub-folio dirty tracking code >> got broken by this patchset.... >> > > Hmm, this crossed my mind as well. I am assuming I can really test the sub-folio dirty > tracking code on a system which has a page size greater than the block size? Or is there > some tests that can already test this? CCing Ritesh as well. > Sorry I haven't yet looked into this series yet. I will spend sometime reading it. Will also give a spin to run the fstests. But to answer your question on how to test sub-folio dirty tracking code[1] [2] with XFS. Just use blocksize < pagesize in mkfs option and run fstests. There are a no. of tests which checks for data correctness for various types of writes. 1. test 1k blocksize on a 4k pagsize machine (as long as bs < ps) 2. Test 4k blocksize on a 64k pagesize machine (if you have one) (as long as bs < ps) 3. Or also enable large folios support and test bs < ps (with large folios system starts insantiating large folios > 4k on a 4k pagesize machine. So blocksize automatically becomes lesser than folio size) You will need CONFIG_TRANSPARENT_HUGEPAGE to be enabled along with willy's series which enables large folios in buffered write path [3]. (This is already in linux 6.6-rc1) <snip> /* * Large folio support currently depends on THP. These dependencies are * being worked on but are not yet fixed. */ static inline bool mapping_large_folio_support(struct address_space *mapping) { return IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && test_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags); } <links> [1]: https://lore.kernel.org/linux-xfs/20230725122932.144426-1-ritesh.list@gmail.com/ [2]: https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git/commit/?h=for-next&id=4ce02c67972211be488408c275c8fbf19faf29b3 [3]: https://lore.kernel.org/all/ZLVrEkVU2YCneoXR@casper.infradead.org/ Hope this helps! -ritesh
On Tue, Sep 19, 2023 at 05:26:44PM +0530, Ritesh Harjani wrote: > Pankaj Raghav <p.raghav@samsung.com> writes: > > >>>> > >>>> As it is, I'd really prefer stuff that adds significant XFS > >>>> functionality that we need to test to be based on a current Linus > >>>> TOT kernel so that we can test it without being impacted by all > >>>> the random unrelated breakages that regularly happen in linux-next > >>>> kernels.... > >>> > >>> That's understandable! I just rebased onto Linus' tree, this only > >>> has the bs > ps support on 4k sector size: > >>> > >>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=v6.6-rc2-lbs-nobdev > >> > > > > I think this tree doesn't have some of the last minute changes I did before I sent the RFC. I will > > sync with Luis offline regarding that. > > > >> > >>> I just did a cursory build / boot / fsx with 16k block size / 4k sector size > >>> test with this tree only. I havne't ran fstests on it. > >> > >> W/ 64k block size, generic/042 fails (maybe just a test block size > >> thing), generic/091 fails (data corruption on read after ~70 ops) > >> and then generic/095 hung with a crash in iomap_readpage_iter() > >> during readahead. > >> > >> Looks like a null folio was passed to ifs_alloc(), which implies the > >> iomap_readpage_ctx didn't have a folio attached to it. Something > >> isn't working properly in the readahead code, which would also > >> explain the quick fsx failure... > >> > > > > Yeah, I have noticed this as well. This is the main crash scenario I am noticing > > when I am running xfstests, and hopefully we will be able to fix it soon. > > > > In general, we have had better results with 16k block size than 64k block size. I still don't > > know why, but the ifs_alloc crash happens in generic/451 with 16k block size. > > > > > >>> Just a heads up, using 512 byte sector size will fail for now, it's a > >>> regression we have to fix. Likewise using block sizes 1k, 2k will also > >>> regress on fsx right now. These are regressions we are aware of but > >>> haven't had time yet to bisect / fix. > >> > >> I'm betting that the recently added sub-folio dirty tracking code > >> got broken by this patchset.... > >> > > > > Hmm, this crossed my mind as well. I am assuming I can really test the sub-folio dirty > > tracking code on a system which has a page size greater than the block size? Or is there > > some tests that can already test this? CCing Ritesh as well. > > > > Sorry I haven't yet looked into this series yet. I will spend sometime > reading it. Will also give a spin to run the fstests. Ritesh, You can save yourself time in not testing the patch series with fstests for block sizes below ps as we already are aware that a patch in the series breaks this. We just wanted to get the patch series out early for review given the progress. There's probably one patch which regresses this, if each patch regresses this, that's a bigger issue :P Luis
On Mon, Sep 18, 2023 at 02:29:22PM +0200, Pankaj Raghav wrote: > I think this tree doesn't have some of the last minute changes I did > before I sent the RFC. I will sync with Luis offline regarding that. OK, we sorted the small changes, and this patch series posted is now rebased and available here to Linus' v6.6-rc2, for those that want more stability than the wild wild linux-next: https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=large-block-linus-nobdev If you wanna muck with the coexistence stuff, which you will need if you want to actually use an LBS device, that is this patch series and then the coexistence stuff: https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=large-block-linus Given this is a fresh rebase, I started running fsx on the nobdev branch which only has this series and managed to get fsx ops up to over 1 million for: 512 sector size: * 16k block size * 32k block size * 64k block size 4k sector size: * 16k block size * 32k block size * 64k block size It's at least enough cursory test to git push it. I haven't tested yet the second branch I pushed though but it applied without any changes so it should be good (usual famous last words). Luis
On Wed, Sep 20, 2023 at 09:57:56PM -0700, Luis Chamberlain wrote: > On Wed, Sep 20, 2023 at 08:00:12PM -0700, Luis Chamberlain wrote: > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=large-block-linus > > > > I haven't tested yet the second branch I pushed though but it applied without any changes > > so it should be good (usual famous last words). > > I have run some preliminary tests on that branch as well above using fsx > with larger LBA formats running them all on the *same* system at the > same time. Kernel is happy. > > root@linus ~ # uname -r > 6.6.0-rc2-large-block-linus+ > > root@linus ~ # mount | grep mnt > /dev/nvme17n1 on /mnt-16k type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota) > /dev/nvme13n1 on /mnt-32k-16ks type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota) > /dev/nvme11n1 on /mnt-64k-16ks type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=64k,noquota) > /dev/nvme18n1 on /mnt-32k type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota) > /dev/nvme14n1 on /mnt-64k-32ks type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=64k,noquota) > /dev/nvme7n1 on /mnt-64k-512b type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota) > /dev/nvme4n1 on /mnt-32k-512 type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota) > /dev/nvme3n1 on /mnt-16k-512b type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota) > /dev/nvme9n1 on /mnt-64k-4ks type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=64k,noquota) > /dev/nvme8n1 on /mnt-32k-4ks type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota) > /dev/nvme6n1 on /mnt-16k-4ks type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota) > /dev/nvme5n1 on /mnt-4k type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota) > /dev/nvme1n1 on /mnt-512 type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota) > > root@linus ~ # ps -ef| grep fsx > root 45601 45172 44 04:02 pts/3 00:20:26 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-16k/foo > root 46207 45658 39 04:04 pts/5 00:17:18 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-32k-16ks/foo > root 46792 46289 35 04:06 pts/7 00:14:36 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-64k-16ks/foo > root 47293 46899 39 04:08 pts/9 00:15:30 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-32k/foo > root 47921 47338 34 04:10 pts/11 00:12:56 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-64k-32ks/foo > root 48898 48484 32 04:14 pts/13 00:10:56 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-64k-512b/foo > root 49313 48939 35 04:15 pts/15 00:11:38 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-32k-512/foo > root 49729 49429 40 04:17 pts/17 00:12:27 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-16k-512b/foo > root 50085 49794 33 04:18 pts/19 00:09:56 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-64k-4ks/foo > root 50449 50130 36 04:19 pts/21 00:10:28 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-32k-4ks/foo > root 50844 50517 41 04:20 pts/23 00:11:22 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-16k-4ks/foo > root 51135 50893 52 04:21 pts/25 00:13:57 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-4k/foo > root 52061 51193 49 04:25 pts/27 00:11:21 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-512/foo > root 57668 52131 0 04:48 pts/29 00:00:00 grep fsx So I just pulled this, built it and run generic/091 as the very first test on this: # ./run_check.sh --mkfs-opts "-m rmapbt=1 -b size=64k" --run-opts "-s xfs_64k generic/091" ..... meta-data=/dev/pmem0 isize=512 agcount=4, agsize=32768 blks = sectsz=4096 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=1 = reflink=1 bigtime=1 inobtcount=1 nrext64=0 data = bsize=65536 blocks=131072, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=65536 ascii-ci=0, ftype=1 log =internal log bsize=65536 blocks=2613, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=65536 blocks=0, rtextents=0 .... Running: MOUNT_OPTIONS= ./check -R xunit -b -s xfs_64k generic/091 SECTION -- xfs_64k FSTYP -- xfs (debug) PLATFORM -- Linux/x86_64 test3 6.6.0-rc2-large-block-linus-dgc+ #1906 SMP PREEMPT_DYNAMIC Thu Sep 21 15:19:47 AEST 2023 MKFS_OPTIONS -- -f -m rmapbt=1 -b size=64k /dev/pmem1 MOUNT_OPTIONS -- -o dax=never -o context=system_u:object_r:root_t:s0 /dev/pmem1 /mnt/scratch generic/091 10s ... [failed, exit status 1]- output mismatch (see /home/dave/src/xfstests-dev/results//xfs_64k/generic/091.out.bad) --- tests/generic/091.out 2022-12-21 15:53:25.467044754 +1100 +++ /home/dave/src/xfstests-dev/results//xfs_64k/generic/091.out.bad 2023-09-21 15:47:48.222559248 +1000 @@ -1,7 +1,113 @@ QA output created by 091 fsx -N 10000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W -fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W -fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W -fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W -fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W -fsx -N 10000 -o 128000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -W ... (Run 'diff -u /home/dave/src/xfstests-dev/tests/generic/091.out /home/dave/src/xfstests-dev/results//xfs_64k/generic/091.out.bad' to see the entire diff) Failures: generic/091 Failed 1 of 1 tests Xunit report: /home/dave/src/xfstests-dev/results//xfs_64k/result.xml SECTION -- xfs_64k ========================= Failures: generic/091 Failed 1 of 1 tests real 0m4.214s user 0m0.972s sys 0m3.603s # For all these assertions about how none of your testing is finding bugs in this code, It's taken me *4 seconds* of test runtime to find the first failure. And, well, it's the same failure as I reported for the previous version of this code: # cat /home/dave/src/xfstests-dev/results//xfs_64k/generic/091.out.bad /home/dave/src/xfstests-dev/ltp/fsx -N 10000 -l 500000 -r 4096 -t 512 -w 512 -Z -R -W /mnt/test/junk mapped writes DISABLED Seed set to 1 main: filesystem does not support exchange range, disabling! fallocating to largest ever: 0x79f06 READ BAD DATA: offset = 0x18000, size = 0xf000, fname = /mnt/test/junk OFFSET GOOD BAD RANGE 0x21000 0x0000 0x9008 0x0 operation# (mod 256) for the bad data may be 144 0x21001 0x0000 0x0810 0x1 operation# (mod 256) for the bad data may be 16 0x21002 0x0000 0x1000 0x2 operation# (mod 256) for the bad data may be 16 0x21005 0x0000 0x8e00 0x3 operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops 0x21007 0x0000 0x82ff 0x4 operation# (mod 256) for the bad data may be 255 0x21008 0x0000 0xffff 0x5 operation# (mod 256) for the bad data may be 255 0x21009 0x0000 0xffff 0x6 operation# (mod 256) for the bad data may be 255 0x2100a 0x0000 0xffff 0x7 operation# (mod 256) for the bad data may be 255 0x2100b 0x0000 0xff00 0x8 operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops 0x21010 0x0000 0x700b 0x9 operation# (mod 256) for the bad data may be 112 0x21011 0x0000 0x0b10 0xa operation# (mod 256) for the bad data may be 16 0x21012 0x0000 0x1000 0xb operation# (mod 256) for the bad data may be 16 0x21014 0x0000 0x038e 0xc operation# (mod 256) for the bad data may be 3 0x21015 0x0000 0x8e00 0xd operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops 0x21017 0x0000 0x82ff 0xe operation# (mod 256) for the bad data may be 255 0x21018 0x0000 0xffff 0xf operation# (mod 256) for the bad data may be 255 LOG DUMP (69 total operations): 1( 1 mod 256): FALLOC 0x6ba10 thru 0x79f06 (0xe4f6 bytes) EXTENDING 2( 2 mod 256): SKIPPED (no operation) 3( 3 mod 256): SKIPPED (no operation) 4( 4 mod 256): TRUNCATE DOWN from 0x79f06 to 0x51800 5( 5 mod 256): SKIPPED (no operation) 6( 6 mod 256): READ 0x1b000 thru 0x21fff (0x7000 bytes) 7( 7 mod 256): PUNCH 0x2ce7a thru 0x39b9e (0xcd25 bytes) 8( 8 mod 256): PUNCH 0x29238 thru 0x29f57 (0xd20 bytes) 9( 9 mod 256): COPY 0x3000 thru 0x9fff (0x7000 bytes) to 0x40400 thru 0x473ff 10( 10 mod 256): READ 0x16000 thru 0x21fff (0xc000 bytes) 11( 11 mod 256): FALLOC 0x4a42b thru 0x4b8f7 (0x14cc bytes) INTERIOR 12( 12 mod 256): TRUNCATE DOWN from 0x51800 to 0x15c00 ******WWWW 13( 13 mod 256): SKIPPED (no operation) 14( 14 mod 256): READ 0xb000 thru 0x14fff (0xa000 bytes) 15( 15 mod 256): SKIPPED (no operation) 16( 16 mod 256): SKIPPED (no operation) 17( 17 mod 256): SKIPPED (no operation) 18( 18 mod 256): READ 0x3000 thru 0x11fff (0xf000 bytes) 19( 19 mod 256): FALLOC 0x69b94 thru 0x6c922 (0x2d8e bytes) EXTENDING 20( 20 mod 256): SKIPPED (no operation) 21( 21 mod 256): SKIPPED (no operation) 22( 22 mod 256): WRITE 0x23000 thru 0x285ff (0x5600 bytes) 23( 23 mod 256): SKIPPED (no operation) 24( 24 mod 256): SKIPPED (no operation) 25( 25 mod 256): SKIPPED (no operation) 26( 26 mod 256): ZERO 0x1fba0 thru 0x2c568 (0xc9c9 bytes) ******ZZZZ 27( 27 mod 256): READ 0x4f000 thru 0x50fff (0x2000 bytes) 28( 28 mod 256): READ 0x39000 thru 0x3afff (0x2000 bytes) 29( 29 mod 256): WRITE 0x40200 thru 0x4cdff (0xcc00 bytes) 30( 30 mod 256): SKIPPED (no operation) 31( 31 mod 256): WRITE 0x47e00 thru 0x547ff (0xca00 bytes) 32( 32 mod 256): SKIPPED (no operation) 33( 33 mod 256): READ 0x28000 thru 0x29fff (0x2000 bytes) 34( 34 mod 256): SKIPPED (no operation) 35( 35 mod 256): READ 0x69000 thru 0x6bfff (0x3000 bytes) 36( 36 mod 256): READ 0x16000 thru 0x20fff (0xb000 bytes) 37( 37 mod 256): ZERO 0x45150 thru 0x47e9c (0x2d4d bytes) 38( 38 mod 256): SKIPPED (no operation) 39( 39 mod 256): SKIPPED (no operation) 40( 40 mod 256): COPY 0x10000 thru 0x11fff (0x2000 bytes) to 0x22a00 thru 0x249ff 41( 41 mod 256): WRITE 0x29000 thru 0x2efff (0x6000 bytes) 42( 42 mod 256): ZERO 0x59c7 thru 0x13eee (0xe528 bytes) 43( 43 mod 256): FALLOC 0x1fdbf thru 0x2e694 (0xe8d5 bytes) INTERIOR ******FFFF 44( 44 mod 256): SKIPPED (no operation) 45( 45 mod 256): ZERO 0x740f5 thru 0x7a11f (0x602b bytes) 46( 46 mod 256): SKIPPED (no operation) 47( 47 mod 256): WRITE 0x14200 thru 0x1e3ff (0xa200 bytes) 48( 48 mod 256): READ 0x69000 thru 0x6bfff (0x3000 bytes) 49( 49 mod 256): TRUNCATE DOWN from 0x6c922 to 0x16a00 ******WWWW 50( 50 mod 256): WRITE 0x15000 thru 0x163ff (0x1400 bytes) 51( 51 mod 256): PUNCH 0x3b5e thru 0xa2c1 (0x6764 bytes) 52( 52 mod 256): SKIPPED (no operation) 53( 53 mod 256): SKIPPED (no operation) 54( 54 mod 256): WRITE 0x34a00 thru 0x3fdff (0xb400 bytes) HOLE ***WWWW 55( 55 mod 256): WRITE 0x38000 thru 0x397ff (0x1800 bytes) 56( 56 mod 256): PUNCH 0x7922 thru 0x115f0 (0x9ccf bytes) 57( 57 mod 256): SKIPPED (no operation) 58( 58 mod 256): SKIPPED (no operation) 59( 59 mod 256): SKIPPED (no operation) 60( 60 mod 256): FALLOC 0x300a8 thru 0x331d0 (0x3128 bytes) INTERIOR 61( 61 mod 256): ZERO 0x3799c thru 0x39245 (0x18aa bytes) 62( 62 mod 256): ZERO 0x62fc3 thru 0x6b630 (0x866e bytes) 63( 63 mod 256): SKIPPED (no operation) 64( 64 mod 256): ZERO 0x6110a thru 0x61dad (0xca4 bytes) 65( 65 mod 256): FALLOC 0x1d8ca thru 0x20876 (0x2fac bytes) INTERIOR 66( 66 mod 256): COPY 0x65000 thru 0x68fff (0x4000 bytes) to 0x22400 thru 0x263ff 67( 67 mod 256): SKIPPED (no operation) 68( 68 mod 256): WRITE 0x36a00 thru 0x415ff (0xac00 bytes) 69( 69 mod 256): READ 0x18000 thru 0x26fff (0xf000 bytes) ***RRRR*** Log of operations saved to "/mnt/test/junk.fsxops"; replay with --replay-ops Correct content saved for comparison (maybe hexdump "/mnt/test/junk" vs "/mnt/test/junk.fsxgood") Guess what? The fsx parameters being used means it is testing things you aren't. Yes, the '-Z -R -W' mean it is using direct IO for reads and writes, mmap() is disabled. Other parameters indicate that using 4k aligned reads and 512 byte aligned writes and truncates. There is a reason there are multiple different fsx tests in fstests; they all exercise different sets of IO behaviours and alignments, and they exercise the IO paths differently. So there's clearly something wrong here - it's likely that the filesystem IO alignment parameters pulled from the underlying block device (4k physical, 512 byte logical sector sizes) are improperly interpreted. i.e. for a filesystem with a sector size of 4kB, direct IO with an alignment of 512 bytes should be rejected...... -Dave.
On Thu, Sep 21, 2023 at 04:03:56PM +1000, Dave Chinner wrote: > On Wed, Sep 20, 2023 at 09:57:56PM -0700, Luis Chamberlain wrote: > > On Wed, Sep 20, 2023 at 08:00:12PM -0700, Luis Chamberlain wrote: > > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=large-block-linus > > > > > > I haven't tested yet the second branch I pushed though but it applied without any changes > > > so it should be good (usual famous last words). > > > > I have run some preliminary tests on that branch as well above using fsx > > with larger LBA formats running them all on the *same* system at the > > same time. Kernel is happy. <-- snip --> > So I just pulled this, built it and run generic/091 as the very > first test on this: > > # ./run_check.sh --mkfs-opts "-m rmapbt=1 -b size=64k" --run-opts "-s xfs_64k generic/091" The cover letter for this patch series acknowledged failures in fstests. For kdevops now, we borrow the same last linux-next baseline: git grep "generic/091" workflows/fstests/expunges/6.6.0-rc2-large-block-linus-nobdev workflows/fstests/expunges/6.6.0-rc2-large-block-linus-nobdev/xfs/unassigned/xfs_reflink_1024.txt:generic/091 # possible regression workflows/fstests/expunges/6.6.0-rc2-large-block-linus-nobdev/xfs/unassigned/xfs_reflink_16k.txt:generic/091 workflows/fstests/expunges/6.6.0-rc2-large-block-linus-nobdev/xfs/unassigned/xfs_reflink_32k.txt:generic/091 workflows/fstests/expunges/6.6.0-rc2-large-block-linus-nobdev/xfs/unassigned/xfs_reflink_64k_4ks.txt:generic/091 So well, we already know this fails. > For all these assertions about how none of your testing is finding > bugs in this code, It's taken me *4 seconds* of test runtime to find > the first failure. Because you know what to look for and this is not yet perfect. > And, well, it's the same failure as I reported for the previous > version of this code: And we haven't done *any* new changes to the patch series so no surprise either. > Guess what? The fsx parameters being used means it is testing things you > aren't. I actualy found quite a bit of issues with -W. And it was useful. > Yes, the '-Z -R -W' mean it is using direct IO for reads and writes, > mmap() is disabled. Other parameters indicate that using 4k aligned reads and > 512 byte aligned writes and truncates. Thanks! This will help for sure!. > There is a reason there are multiple different fsx tests in fstests; You made it clear, and I documented the goal to ensure we get to the point we pass all those: https://kernelnewbies.org/KernelProjects/large-block-size#fsx > they all exercise different sets of IO behaviours and alignments, > and they exercise the IO paths differently. > > So there's clearly something wrong here - it's likely that the > filesystem IO alignment parameters pulled from the underlying block > device (4k physical, 512 byte logical sector sizes) are improperly > interpreted. i.e. for a filesystem with a sector size of 4kB, > direct IO with an alignment of 512 bytes should be rejected...... So yes, this is not yet complete. But now let's step back and I want you to realize where we started and why we decided to post, in particular me, I was suggesting we post now, instead of waiting for us to resolve *it all*. When we first started this work we simply thought it was impossible. Unless of course you are Matthew and you believed hard in your work. The progress, which you don't see, is that steps towards fixing fsx issues have been logarithmic. Days, weeks, months before decent progress, but the progress was steady... And so to get to where we are today only just shows, well this is actually not impossible, and Matthew did the right thing with the right data structure, and the changes to the page cache with multi index array stuff, it seems to be able to also be used for LBS. At this point, from a logarithmic perspective, we have huge progress, and I don't think it will stop. It gives us confidence Matthew was right and LBS is possible indeed with the multi-index stuff. It's not about, can this crash. Yes, we know, it can crash. It's about how many different ways, and how many fixes left. Because clearly the multi-index stuff is working well. The code feedback so far on this patch series has mostly been "I don't think this patch is needed" or "perhaps this way is better", and that's the kind of feedback we're looking for. Because *each* new patch adds a huge a milestone. And it seems the progress has been logarithmic. It is exactly why this series went out with a few patches which ... we felt safer with them than without. For instance the batch delete.. I still am suspicious about us not needing as Hannes' patches also seem to rely on similer rounding on the wait stuff, and it seems to bring back memories on issues found on permissions. But anyway, the point is that, this is clearly not ready. But try to think of progress here as logarithmic, and any *dent* we make on the page cache to fix the last corner cases will be huge, not small. If you want to try, you can see for yourself, what's the next fix? :) And if found, was it logarithmic? How do we polish this? That's the goal of this patch series. Luis
On Thu, Sep 21, 2023 at 12:18:13AM -0700, Luis Chamberlain wrote:
> When we first started this work we simply thought it was impossible.
*not possible*
Luis
On Thu, Sep 21, 2023 at 12:18:13AM -0700, Luis Chamberlain wrote: > On Thu, Sep 21, 2023 at 04:03:56PM +1000, Dave Chinner wrote: > > On Wed, Sep 20, 2023 at 09:57:56PM -0700, Luis Chamberlain wrote: > > > On Wed, Sep 20, 2023 at 08:00:12PM -0700, Luis Chamberlain wrote: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=large-block-linus > > > > > > > > I haven't tested yet the second branch I pushed though but it applied without any changes > > > > so it should be good (usual famous last words). > > > > > > I have run some preliminary tests on that branch as well above using fsx > > > with larger LBA formats running them all on the *same* system at the > > > same time. Kernel is happy. > > <-- snip --> > > > So I just pulled this, built it and run generic/091 as the very > > first test on this: > > > > # ./run_check.sh --mkfs-opts "-m rmapbt=1 -b size=64k" --run-opts "-s xfs_64k generic/091" > > The cover letter for this patch series acknowledged failures in fstests. But this is a new update, which you said fixed various issues, and you posted this in direct response to the bug report I gave you. > For kdevops now, we borrow the same last linux-next baseline: > > git grep "generic/091" workflows/fstests/expunges/6.6.0-rc2-large-block-linus-nobdev > workflows/fstests/expunges/6.6.0-rc2-large-block-linus-nobdev/xfs/unassigned/xfs_reflink_1024.txt:generic/091 # possible regression > workflows/fstests/expunges/6.6.0-rc2-large-block-linus-nobdev/xfs/unassigned/xfs_reflink_16k.txt:generic/091 > workflows/fstests/expunges/6.6.0-rc2-large-block-linus-nobdev/xfs/unassigned/xfs_reflink_32k.txt:generic/091 > workflows/fstests/expunges/6.6.0-rc2-large-block-linus-nobdev/xfs/unassigned/xfs_reflink_64k_4ks.txt:generic/091 > > So well, we already know this fails. *cough* -You- know it already fails. And you are expecting people who try the code to somehow know that you've explicitly ignored this fsx failure, especially after all your words to tell us how much fsx testing it has passed? And that's kinda my point - you're effusing about how much fsx testing this has passed, yet it istill fails after just a handful of ops in generic/091. The dissonance could break windows... ---- Fundamentally, when it comes to data integrity, it important to exercise as much of the operational application space as quickly as possible as it is that breadth of variation in operations that flushes out more bugs and helps stabilises the code faster. Why do you think we talk about the massive test matrix most filesytsems have and how long it takes to iterate so much? It's because iterating that complex test matrix is how we find all the whacky, weird bugs in the code. Concentrating on a single test configuration and running it over and over again won't find bugs in code it doesn't exercise no matter how long it is run for. Running such a setup in an automated environment doesn't mean you get better code coverage, it just means you cover the same narrow set of corner cases faster and more times. If it works once, it should work a million times. Iterating it a billion more times doesn't tell us anything additional, either. Put simply: performing deep, homogenous testing on code that has known data corruption bugs outside the narrow scope of the test case is not telling us anything useful about the overall state of the code. Indeed, turning off failing tests that are critical to validating the correct operation of the code you are modifying is bad practice. For code changes like this, all fsx testing in fstests should pass before you post anything for review - even for an RFC. There is no point reviewing code that doesn't work properly, nor wasting people's time by encouraging them to test it when it's clear to you that it's going to fail in various important ways. Hence I think your testing is focussing on the wrong things and I suspect that you've misunderstood the statements of "we'll need billions of fsx ops to test this code" that various people have made really meant. You've elevated running billions of fsx ops to your primary "it works" gating condition, at the expense of making sure all the other parts of the filesystem still work correctly. The reality is that the returns from fsx diminish as the number of ops go up. Once you've run the first hundred million fsx ops for a given operations set, the chance that the next 100M ops will find a new problem is -greatly- reduced. The vast majority of problems will be found in the first 10M ops that are run in any given fsx operation, and few bugs are found beyond the 100M mark. Yes, we occasionally find one up in the billions, but that's rare and most definitely not somethign to focus on when still developing RFC level code. Different fsx configurations change the operation set that is run - mixing DIO reads with buffered writes, turning mmap on and off, using AIO or io_uring rather than synchronous IO, etc. These all exercise different code paths and corner cases and have vastly different code interactions, and that is what we need to cover when developing new code. IOWs, we need coverage of the *entire operation space*, not just the same narrow set of operations run billions of time. A wide focus requires billions of ops to cover because it requires lots of different application configurations to be run. In constrast, there are only three fs configurations that matter: bs < PS, bs == PS and bs > PS. For example, 16kB, 32kB and 64kB filesystem configs exercise exactly the same code paths in exactly the same way (e.g. both have non-zero miniumum folio orders but only differ by what that order is). Hence running the same test application configs on these different filessytem configurations does actually not improve code coverage of the testing at all. Testing all of them only increases the resources required to the test a change, it does not improve the quality of coverage of the testing being performed at all.... Hence I'd strongly suggest that, for the next posting of these cahnge, you focus on making fstests pass without turning off any failing tests, and that fsx is run with a wide variety of configurations (e.g. modify all the fstests cases to run for a configurable number of ops (e.g. via SOAK_DURATION)). We just don't care at this point about finding that 1 in 10^15 ops bug because it's code in development; what we actually care about is that -everything- works correctly for the vast majority of use cases.... -Dave.
lOn Thu, Sep 21, 2023 at 04:03:56PM +1000, Dave Chinner wrote: > So there's clearly something wrong here - it's likely that the > filesystem IO alignment parameters pulled from the underlying block > device (4k physical, 512 byte logical sector sizes) are improperly > interpreted. i.e. for a filesystem with a sector size of 4kB, > direct IO with an alignment of 512 bytes should be rejected...... I wonder if it's something in the truncation code that's splitting folios that ought not to be split. Does this test possibly keep folios in cache that maybe get invalidated? truncate_inode_partial_folio() is the one i'm most concernd about. but i'm also severely jetlagged.
From: Pankaj Raghav <p.raghav@samsung.com> There has been efforts over the last 16 years to enable enable Large Block Sizes (LBS), that is block sizes in filesystems where bs > page size [1] [2]. Through these efforts we have learned that one of the main blockers to supporting bs > ps in fiesystems has been a way to allocate pages that are at least the filesystem block size on the page cache where bs > ps [3]. Another blocker was changed in filesystems due to buffer-heads. Thanks to these previous efforts, the surgery by Matthew Willcox in the page cache for adopting xarray's multi-index support, and iomap support, it makes supporting bs > ps in XFS possible with only a few line change to XFS. Most of changes are to the page cache to support minimum order folio support for the target block size on the filesystem. A new motivation for LBS today is to support high-capacity (large amount of Terabytes) QLC SSDs where the internal Indirection Unit (IU) are typically greater than 4k [4] to help reduce DRAM and so in turn cost and space. In practice this then allows different architectures to use a base page size of 4k while still enabling support for block sizes aligned to the larger IUs by relying on high order folios on the page cache when needed. It also enables to take advantage of these same drive's support for larger atomics than 4k with buffered IO support in Linux. As described this year at LSFMM, supporting large atomics greater than 4k enables databases to remove the need to rely on their own journaling, so they can disable double buffered writes [5], which is a feature different cloud providers are already innovating and enabling customers for through custom storage solutions. This series still needs some polishing and fixing some crashes, but it is mainly targeted to get initial feedback from the community, enable initial experimentation, hence the RFC. It's being posted now given the results from our testing are proving much better results than expected and we hope to polish this up together with the community. After all, this has been a 16 year old effort and none of this could have been possible without that effort. Implementation: This series only adds the notion of a minimum order of a folio in the page cache that was initially proposed by Willy. The minimum folio order requirement is set during inode creation. The minimum order will typically correspond to the filesystem block size. The page cache will in turn respect the minimum folio order requirement while allocating a folio. This series mainly changes the page cache's filemap, readahead, and truncation code to allocate and align the folios to the minimum order set for the filesystem's inode's respective address space mapping. Only XFS was enabled and tested as a part of this series as it has supported block sizes up to 64k and sector sizes up to 32k for years. The only thing missing was the page cache magic to enable bs > ps. However any filesystem that doesn't depend on buffer-heads and support larger block sizes already should be able to leverage this effort to also support LBS, bs > ps. This also paves the way for supporting block devices where their logical block size > page size in the future by leveraging iomap's address space operation added to the block device cache by Christoph Hellwig [6]. We have work to enable support for this, enabling LBAs > 4k on NVME, and at the same time allow coexistence with buffer-heads on the same block device so to enable support allow for a drive to use filesystem's to switch between filesystem's which may depend on buffer-heads or need the iomap address space operations for the block device cache. Patches for this will be posted shortly after this patch series. Testing: The test results show, this isn't so scary. Only a few regressions so far on xfs where CRCs are disabled on block sizes smaller than 4k and some generic tests crashing the system for bs > 4k. The crashes are at most a handful at this point. This series has been cleaned up 3 times now after we passed our first billion through fsx ops on different block sizes. Not surprisingly there are a few test bugs for the bs > ps world. We've established baseline first against linux-next against 14 different XFS test profiles as maintained in kdevops [7]: xfs_crc xfs_reflink xfs_reflink_normapbt xfs_reflink_1024 xfs_reflink_2k xfs_reflink_4k xfs_nocrc xfs_nocrc_512 xfs_nocrc_1k xfs_nocrc_2k xfs_nocrc_4k xfs_logdev xfs_rtdev xfs_rtlogdev We first established a high confidence baseline for linux-next and have kept following that to ensure we don't regress it. The majority of regressions are fsx ops on no CRC block sizes of 512 and 2k, and we plan to fix that, but welcome others at this point to jump in and collaborate. The list of known possible regressions are then can be seen on kdevops with git grep: git grep regression workflows/fstests/expunges/6.6.0-rc1-large-block-20230914/ | awk -F"unassigned/" '{print $2}' xfs_nocrc_2k.txt:generic/075 # possible regression xfs_nocrc_2k.txt:generic/112 # possible regression xfs_nocrc_2k.txt:generic/127 # possible regression xfs_nocrc_2k.txt:generic/231 # possible regression xfs_nocrc_2k.txt:generic/263 # possible regression xfs_nocrc_2k.txt:generic/469 # possible regression xfs_nocrc_512.txt:generic/075 # possible regression xfs_nocrc_512.txt:generic/112 # possible regression xfs_nocrc_512.txt:generic/127 # possible regression xfs_nocrc_512.txt:generic/231 # possible regression xfs_nocrc_512.txt:generic/263 # possible regression xfs_nocrc_512.txt:generic/469 # possible regression xfs_reflink_1024.txt:generic/457 # possible regression crash https://gist.github.com/mcgrof/f182b250a9d091f77dc85782a83224b3 xfs_rtdev.txt:generic/333 # might crash might be a regression, takes forever... Billion of fsx ops are possible with 16k and so far successful also with hundreds of millions of fsx ops against 32k and 64k with 4k sector size. To verify larger IOs are used we have been using Daniel Gomez's lbs-ctl tool which uses eBPF to verify different IO counts on the block layer. That tool will soon be published. For more details please refer to the kernel newbies page on LBS [8]. [1] https://lwn.net/Articles/231793/ [2] https://lwn.net/ml/linux-fsdevel/20181107063127.3902-1-david@fromorbit.com/ [3] https://lore.kernel.org/linux-mm/20230308075952.GU2825702@dread.disaster.area/ [4] https://cdrdv2-public.intel.com/605724/Achieving_Optimal_Perf_IU_SSDs-338395-003US.pdf [5] https://lwn.net/Articles/932900/ [6] https://lore.kernel.org/lkml/20230801172201.1923299-2-hch@lst.de/T/ [7] https://github.com/linux-kdevops/kdevops/blob/master/playbooks/roles/fstests/templates/xfs/xfs.config [8] https://kernelnewbies.org/KernelProjects/large-block-size -- Regards, Pankaj Luis Dave Chinner (1): xfs: expose block size in stat Luis Chamberlain (12): filemap: set the order of the index in page_cache_delete_batch() filemap: align index to mapping_min_order in filemap_range_has_page() mm: call xas_set_order() in replace_page_cache_folio() filemap: align the index to mapping_min_order in __filemap_add_folio() filemap: align the index to mapping_min_order in filemap_get_folios_tag() filemap: align the index to mapping_min_order in filemap_get_pages() readahead: set file_ra_state->ra_pages to be at least mapping_min_order readahead: add folio with at least mapping_min_order in page_cache_ra_order readahead: set the minimum ra size in get_(init|next)_ra readahead: align ra start and size to mapping_min_order in ondemand_ra() truncate: align index to mapping_min_order mm: round down folio split requirements Matthew Wilcox (Oracle) (1): fs: Allow fine-grained control of folio sizes Pankaj Raghav (9): pagemap: use mapping_min_order in fgf_set_order() filemap: add folio with at least mapping_min_order in __filemap_get_folio filemap: use mapping_min_order while allocating folios filemap: align the index to mapping_min_order in do_[a]sync_mmap_readahead filemap: align index to mapping_min_order in filemap_fault() readahead: allocate folios with mapping_min_order in ra_unbounded() readahead: align with mapping_min_order in force_page_cache_ra() xfs: enable block size larger than page size support xfs: set minimum order folio for page cache based on blocksize fs/iomap/buffered-io.c | 2 +- fs/xfs/xfs_icache.c | 8 +++- fs/xfs/xfs_iops.c | 4 +- fs/xfs/xfs_mount.c | 9 ++++- fs/xfs/xfs_super.c | 7 +--- include/linux/pagemap.h | 87 ++++++++++++++++++++++++++++++----------- mm/filemap.c | 87 +++++++++++++++++++++++++++++++++-------- mm/huge_memory.c | 14 +++++-- mm/readahead.c | 86 ++++++++++++++++++++++++++++++++++------ mm/truncate.c | 34 +++++++++++----- 10 files changed, 263 insertions(+), 75 deletions(-) base-commit: e143016b56ecb0fcda5bb6026b0a25fe55274f56