Message ID | 20190225040904.5557-1-ming.lei@redhat.com (mailing list archive) |
---|---|
State | Deferred, archived |
Headers | show |
Series | xfs: allocate sector sized IO buffer via page_frag_alloc | expand |
On Mon, Feb 25, 2019 at 12:09:04PM +0800, Ming Lei wrote: > XFS uses kmalloc() to allocate sector sized IO buffer. .... > Use page_frag_alloc() to allocate the sector sized buffer, then the > above issue can be fixed because offset_in_page of allocated buffer > is always sector aligned. Didn't we already reject this approach because page frags cannot be reused and that pages allocated to the frag pool are pinned in memory until all fragments allocated on the page have been freed? i.e. when we consider 64k page machines and 4k block sizes (i.e. default config), every single metadata allocation is a sub-page allocation and so will use this new page frag mechanism. IOWs, it will result in fragmenting memory severely and typical memory reclaim not being able to fix it because the metadata that pins each page is largely unreclaimable... Cheers, Dave.
On Mon, Feb 25, 2019 at 03:36:48PM +1100, Dave Chinner wrote: > On Mon, Feb 25, 2019 at 12:09:04PM +0800, Ming Lei wrote: > > XFS uses kmalloc() to allocate sector sized IO buffer. > .... > > Use page_frag_alloc() to allocate the sector sized buffer, then the > > above issue can be fixed because offset_in_page of allocated buffer > > is always sector aligned. > > Didn't we already reject this approach because page frags cannot be I remembered there is this kind of issue mentioned, but just not found the details, so post out the patch for restarting the discussion. > reused and that pages allocated to the frag pool are pinned in > memory until all fragments allocated on the page have been freed? Yes, that is one problem. But if one page is consumed, sooner or later, all fragments will be freed, then the page becomes available again. > > i.e. when we consider 64k page machines and 4k block sizes (i.e. > default config), every single metadata allocation is a sub-page > allocation and so will use this new page frag mechanism. IOWs, it > will result in fragmenting memory severely and typical memory > reclaim not being able to fix it because the metadata that pins each > page is largely unreclaimable... It can be an issue in case of IO timeout & retry. Thanks, Ming
On Mon, Feb 25, 2019 at 04:46:25PM +0800, Ming Lei wrote: > On Mon, Feb 25, 2019 at 03:36:48PM +1100, Dave Chinner wrote: > > On Mon, Feb 25, 2019 at 12:09:04PM +0800, Ming Lei wrote: > > > XFS uses kmalloc() to allocate sector sized IO buffer. > > .... > > > Use page_frag_alloc() to allocate the sector sized buffer, then the > > > above issue can be fixed because offset_in_page of allocated buffer > > > is always sector aligned. > > > > Didn't we already reject this approach because page frags cannot be > > I remembered there is this kind of issue mentioned, but just not found > the details, so post out the patch for restarting the discussion. > > > reused and that pages allocated to the frag pool are pinned in > > memory until all fragments allocated on the page have been freed? > > Yes, that is one problem. But if one page is consumed, sooner or later, > all fragments will be freed, then the page becomes available again. > > > > > i.e. when we consider 64k page machines and 4k block sizes (i.e. > > default config), every single metadata allocation is a sub-page > > allocation and so will use this new page frag mechanism. IOWs, it > > will result in fragmenting memory severely and typical memory > > reclaim not being able to fix it because the metadata that pins each > > page is largely unreclaimable... > > It can be an issue in case of IO timeout & retry. The worst case is still not worse than allocating single page for sub-page IO, which should be used on other file systems under the same situation, I guess. thanks, Ming
On 2/25/19 5:36 AM, Dave Chinner wrote: > On Mon, Feb 25, 2019 at 12:09:04PM +0800, Ming Lei wrote: >> XFS uses kmalloc() to allocate sector sized IO buffer. > .... >> Use page_frag_alloc() to allocate the sector sized buffer, then the >> above issue can be fixed because offset_in_page of allocated buffer >> is always sector aligned. > > Didn't we already reject this approach because page frags cannot be > reused and that pages allocated to the frag pool are pinned in > memory until all fragments allocated on the page have been freed? I don't know if you did, but it's certainly true., Also I don't think there's any specified alignment guarantee for page_frag_alloc(). What about kmem_cache_create() with align parameter? That *should* be guaranteed regardless of whatever debugging is enabled - if not, I would consider it a bug. > i.e. when we consider 64k page machines and 4k block sizes (i.e. > default config), every single metadata allocation is a sub-page > allocation and so will use this new page frag mechanism. IOWs, it > will result in fragmenting memory severely and typical memory > reclaim not being able to fix it because the metadata that pins each > page is largely unreclaimable... > > Cheers, > > Dave. >
On Mon, Feb 25, 2019 at 04:46:25PM +0800, Ming Lei wrote: > On Mon, Feb 25, 2019 at 03:36:48PM +1100, Dave Chinner wrote: > > On Mon, Feb 25, 2019 at 12:09:04PM +0800, Ming Lei wrote: > > > XFS uses kmalloc() to allocate sector sized IO buffer. > > .... > > > Use page_frag_alloc() to allocate the sector sized buffer, then the > > > above issue can be fixed because offset_in_page of allocated buffer > > > is always sector aligned. > > > > Didn't we already reject this approach because page frags cannot be > > I remembered there is this kind of issue mentioned, but just not found > the details, so post out the patch for restarting the discussion. As previously discussed, the only solution that fits all use cases we have to support are a slab caches that do not break object alignment when slab debug options are turned on. > > reused and that pages allocated to the frag pool are pinned in > > memory until all fragments allocated on the page have been freed? > > Yes, that is one problem. But if one page is consumed, sooner or later, > all fragments will be freed, then the page becomes available again. Ah, no, your assumption about how metadata caching in XFS works is flawed. Some metadata ends up being cached for the life of the filesystem because it is so frequently referenced it never gets reclaimed. AG headers, btree root blocks, etc. And the XFS metadata cache hangs on to such metadata even under extreme memory pressure because if we reclaim it then any filesystem operation will need to reallocate that memory to clean dirty pages and that is the very last thing we want to do under extreme memory pressure conditions. If allocation cannot reuse holes in pages (i.e. works as a proper slab cache) then we are going to blow out the amount of memory that the XFS metadata cache uses very badly on filesystems where block size != page size. > > i.e. when we consider 64k page machines and 4k block sizes (i.e. > > default config), every single metadata allocation is a sub-page > > allocation and so will use this new page frag mechanism. IOWs, it > > will result in fragmenting memory severely and typical memory > > reclaim not being able to fix it because the metadata that pins each > > page is largely unreclaimable... > > It can be an issue in case of IO timeout & retry. This makes no sense to me. Exactly how does filesystem memory allocation affect IO timeouts and any retries the filesystem might issue? Cheers, Dave.
On Mon, Feb 25, 2019 at 02:15:59PM +0100, Vlastimil Babka wrote: > On 2/25/19 5:36 AM, Dave Chinner wrote: > > On Mon, Feb 25, 2019 at 12:09:04PM +0800, Ming Lei wrote: > >> XFS uses kmalloc() to allocate sector sized IO buffer. > > .... > >> Use page_frag_alloc() to allocate the sector sized buffer, then the > >> above issue can be fixed because offset_in_page of allocated buffer > >> is always sector aligned. > > > > Didn't we already reject this approach because page frags cannot be > > reused and that pages allocated to the frag pool are pinned in > > memory until all fragments allocated on the page have been freed? > > I don't know if you did, but it's certainly true., Also I don't think > there's any specified alignment guarantee for page_frag_alloc(). We did, and the alignment guarantee would have come from all fragments having an aligned size. > What about kmem_cache_create() with align parameter? That *should* be > guaranteed regardless of whatever debugging is enabled - if not, I would > consider it a bug. Yup, that's pretty much what was decided. The sticking point was whether is should be block layer infrastructure (because the actual memory buffer alignment is a block/device driver requirement not visible to the filesystem) or whether "sector size alignement is good enough for everyone". Cheers, Dave.
On Tue, Feb 26, 2019 at 07:26:30AM +1100, Dave Chinner wrote: > On Mon, Feb 25, 2019 at 02:15:59PM +0100, Vlastimil Babka wrote: > > On 2/25/19 5:36 AM, Dave Chinner wrote: > > > On Mon, Feb 25, 2019 at 12:09:04PM +0800, Ming Lei wrote: > > >> XFS uses kmalloc() to allocate sector sized IO buffer. > > > .... > > >> Use page_frag_alloc() to allocate the sector sized buffer, then the > > >> above issue can be fixed because offset_in_page of allocated buffer > > >> is always sector aligned. > > > > > > Didn't we already reject this approach because page frags cannot be > > > reused and that pages allocated to the frag pool are pinned in > > > memory until all fragments allocated on the page have been freed? > > > > I don't know if you did, but it's certainly true., Also I don't think > > there's any specified alignment guarantee for page_frag_alloc(). > > We did, and the alignment guarantee would have come from all > fragments having an aligned size. > > > What about kmem_cache_create() with align parameter? That *should* be > > guaranteed regardless of whatever debugging is enabled - if not, I would > > consider it a bug. > > Yup, that's pretty much what was decided. The sticking point was > whether is should be block layer infrastructure (because the actual > memory buffer alignment is a block/device driver requirement not > visible to the filesystem) or whether "sector size alignement is > good enough for everyone". OK, looks I miss the long life time of meta data caching, then let's discuss the slab approach. Looks one single slab cache doesn't work, given the size may be 512 * N (1 <= N < PAGE_SIZE/512), that is basically what I posted the first time. https://marc.info/?t=153986884900007&r=1&w=2 https://marc.info/?t=153986885100001&r=1&w=2 Or what is the exact size of sub-page IO in xfs most of time? For example, if 99% times falls in 512 byte allocation, maybe it is enough to just maintain one 512byte slab. Thanks, Ming
On Tue, Feb 26, 2019 at 10:22:50AM +0800, Ming Lei wrote: > On Tue, Feb 26, 2019 at 07:26:30AM +1100, Dave Chinner wrote: > > On Mon, Feb 25, 2019 at 02:15:59PM +0100, Vlastimil Babka wrote: > > > On 2/25/19 5:36 AM, Dave Chinner wrote: > > > > On Mon, Feb 25, 2019 at 12:09:04PM +0800, Ming Lei wrote: > > > >> XFS uses kmalloc() to allocate sector sized IO buffer. > > > > .... > > > >> Use page_frag_alloc() to allocate the sector sized buffer, then the > > > >> above issue can be fixed because offset_in_page of allocated buffer > > > >> is always sector aligned. > > > > > > > > Didn't we already reject this approach because page frags cannot be > > > > reused and that pages allocated to the frag pool are pinned in > > > > memory until all fragments allocated on the page have been freed? > > > > > > I don't know if you did, but it's certainly true., Also I don't think > > > there's any specified alignment guarantee for page_frag_alloc(). > > > > We did, and the alignment guarantee would have come from all > > fragments having an aligned size. > > > > > What about kmem_cache_create() with align parameter? That *should* be > > > guaranteed regardless of whatever debugging is enabled - if not, I would > > > consider it a bug. > > > > Yup, that's pretty much what was decided. The sticking point was > > whether is should be block layer infrastructure (because the actual > > memory buffer alignment is a block/device driver requirement not > > visible to the filesystem) or whether "sector size alignement is > > good enough for everyone". > > OK, looks I miss the long life time of meta data caching, then let's > discuss the slab approach. > > Looks one single slab cache doesn't work, given the size may be 512 * N > (1 <= N < PAGE_SIZE/512), that is basically what I posted the first > time. > > https://marc.info/?t=153986884900007&r=1&w=2 > https://marc.info/?t=153986885100001&r=1&w=2 > > Or what is the exact size of sub-page IO in xfs most of time? For Determined by mkfs parameters. Any power of 2 between 512 bytes and 64kB needs to be supported. e.g: # mkfs.xfs -s size=512 -b size=1k -i size=2k -n size=8k .... will have metadata that is sector sized (512 bytes), filesystem block sized (1k), directory block sized (8k) and inode cluster sized (32k), and will use all of them in large quantities. > example, if 99% times falls in 512 byte allocation, maybe it is enough > to just maintain one 512byte slab. It is not. On a 64k page size machine, we use sub page slabs for metadata blocks of 2^N bytes where 9 <= N <= 15.. Cheers, Dave.
On Tue, Feb 26, 2019 at 02:02:14PM +1100, Dave Chinner wrote: > > Or what is the exact size of sub-page IO in xfs most of time? For > > Determined by mkfs parameters. Any power of 2 between 512 bytes and > 64kB needs to be supported. e.g: > > # mkfs.xfs -s size=512 -b size=1k -i size=2k -n size=8k .... > > will have metadata that is sector sized (512 bytes), filesystem > block sized (1k), directory block sized (8k) and inode cluster sized > (32k), and will use all of them in large quantities. If XFS is going to use each of these in large quantities, then it doesn't seem unreasonable for XFS to create a slab for each type of metadata?
On Mon, Feb 25, 2019 at 07:27:37PM -0800, Matthew Wilcox wrote: > On Tue, Feb 26, 2019 at 02:02:14PM +1100, Dave Chinner wrote: > > > Or what is the exact size of sub-page IO in xfs most of time? For > > > > Determined by mkfs parameters. Any power of 2 between 512 bytes and > > 64kB needs to be supported. e.g: > > > > # mkfs.xfs -s size=512 -b size=1k -i size=2k -n size=8k .... > > > > will have metadata that is sector sized (512 bytes), filesystem > > block sized (1k), directory block sized (8k) and inode cluster sized > > (32k), and will use all of them in large quantities. > > If XFS is going to use each of these in large quantities, then it doesn't > seem unreasonable for XFS to create a slab for each type of metadata? Well, that is the question, isn't it? How many other filesystems will want to make similar "don't use entire pages just for 4k of metadata" optimisations as 64k page size machines become more common? There are others that have the same "use slab for sector aligned IO" which will fall foul of the same problem that has been reported for XFS.... If nobody else cares/wants it, then it can be XFS only. But it's only fair we address the "will it be useful to others" question first..... -Dave.
On Tue, Feb 26, 2019 at 03:58:26PM +1100, Dave Chinner wrote: > On Mon, Feb 25, 2019 at 07:27:37PM -0800, Matthew Wilcox wrote: > > On Tue, Feb 26, 2019 at 02:02:14PM +1100, Dave Chinner wrote: > > > > Or what is the exact size of sub-page IO in xfs most of time? For > > > > > > Determined by mkfs parameters. Any power of 2 between 512 bytes and > > > 64kB needs to be supported. e.g: > > > > > > # mkfs.xfs -s size=512 -b size=1k -i size=2k -n size=8k .... > > > > > > will have metadata that is sector sized (512 bytes), filesystem > > > block sized (1k), directory block sized (8k) and inode cluster sized > > > (32k), and will use all of them in large quantities. > > > > If XFS is going to use each of these in large quantities, then it doesn't > > seem unreasonable for XFS to create a slab for each type of metadata? > > > Well, that is the question, isn't it? How many other filesystems > will want to make similar "don't use entire pages just for 4k of > metadata" optimisations as 64k page size machines become more > common? There are others that have the same "use slab for sector > aligned IO" which will fall foul of the same problem that has been > reported for XFS.... > > If nobody else cares/wants it, then it can be XFS only. But it's > only fair we address the "will it be useful to others" question > first..... This kind of slab cache should have been global, just like interface of kmalloc(size). However, the alignment requirement depends on block device's block size, then it becomes hard to implement as genera interface, for example: block size: 512, 1024, 2048, 4096 slab size: 512*N, 0 < N < PAGE_SIZE/512 For 4k page size, 28(7*4) slabs need to be created, and 64k page size needs to create 127*4 slabs. But, specific file system may only use some of them, and it depends on meta data size. Thanks, Ming
On 2/26/19 10:33 AM, Ming Lei wrote: > On Tue, Feb 26, 2019 at 03:58:26PM +1100, Dave Chinner wrote: >> On Mon, Feb 25, 2019 at 07:27:37PM -0800, Matthew Wilcox wrote: >>> On Tue, Feb 26, 2019 at 02:02:14PM +1100, Dave Chinner wrote: >>>>> Or what is the exact size of sub-page IO in xfs most of time? For >>>> >>>> Determined by mkfs parameters. Any power of 2 between 512 bytes and >>>> 64kB needs to be supported. e.g: >>>> >>>> # mkfs.xfs -s size=512 -b size=1k -i size=2k -n size=8k .... >>>> >>>> will have metadata that is sector sized (512 bytes), filesystem >>>> block sized (1k), directory block sized (8k) and inode cluster sized >>>> (32k), and will use all of them in large quantities. >>> >>> If XFS is going to use each of these in large quantities, then it doesn't >>> seem unreasonable for XFS to create a slab for each type of metadata? >> >> >> Well, that is the question, isn't it? How many other filesystems >> will want to make similar "don't use entire pages just for 4k of >> metadata" optimisations as 64k page size machines become more >> common? There are others that have the same "use slab for sector >> aligned IO" which will fall foul of the same problem that has been >> reported for XFS.... >> >> If nobody else cares/wants it, then it can be XFS only. But it's >> only fair we address the "will it be useful to others" question >> first..... > > This kind of slab cache should have been global, just like interface of > kmalloc(size). > > However, the alignment requirement depends on block device's block size, > then it becomes hard to implement as genera interface, for example: > > block size: 512, 1024, 2048, 4096 > slab size: 512*N, 0 < N < PAGE_SIZE/512 > > For 4k page size, 28(7*4) slabs need to be created, and 64k page size > needs to create 127*4 slabs. > Where does the '*4' multiplier come from? So I wonder how hard would it actually be (+CC slab maintainers) to just guarantee generic kmalloc() alignment for power-of-two sizes. If we can do that for kmem_cache_create() then the code should be already there. AFAIK the alignment happens anyway (albeit not guaranteed) in the non-debug cases, and if guaranteeing alignment for certain debugging configurations (that need some space before the object) means larger memory overhead, then the cost should still be bearable since its non-standard configuration where the point is to catch bug and not have peak performance? > But, specific file system may only use some of them, and it depends > on meta data size. > > Thanks, > Ming >
On Tue, Feb 26, 2019 at 6:07 PM Vlastimil Babka <vbabka@suse.cz> wrote: > > On 2/26/19 10:33 AM, Ming Lei wrote: > > On Tue, Feb 26, 2019 at 03:58:26PM +1100, Dave Chinner wrote: > >> On Mon, Feb 25, 2019 at 07:27:37PM -0800, Matthew Wilcox wrote: > >>> On Tue, Feb 26, 2019 at 02:02:14PM +1100, Dave Chinner wrote: > >>>>> Or what is the exact size of sub-page IO in xfs most of time? For > >>>> > >>>> Determined by mkfs parameters. Any power of 2 between 512 bytes and > >>>> 64kB needs to be supported. e.g: > >>>> > >>>> # mkfs.xfs -s size=512 -b size=1k -i size=2k -n size=8k .... > >>>> > >>>> will have metadata that is sector sized (512 bytes), filesystem > >>>> block sized (1k), directory block sized (8k) and inode cluster sized > >>>> (32k), and will use all of them in large quantities. > >>> > >>> If XFS is going to use each of these in large quantities, then it doesn't > >>> seem unreasonable for XFS to create a slab for each type of metadata? > >> > >> > >> Well, that is the question, isn't it? How many other filesystems > >> will want to make similar "don't use entire pages just for 4k of > >> metadata" optimisations as 64k page size machines become more > >> common? There are others that have the same "use slab for sector > >> aligned IO" which will fall foul of the same problem that has been > >> reported for XFS.... > >> > >> If nobody else cares/wants it, then it can be XFS only. But it's > >> only fair we address the "will it be useful to others" question > >> first..... > > > > This kind of slab cache should have been global, just like interface of > > kmalloc(size). > > > > However, the alignment requirement depends on block device's block size, > > then it becomes hard to implement as genera interface, for example: > > > > block size: 512, 1024, 2048, 4096 > > slab size: 512*N, 0 < N < PAGE_SIZE/512 > > > > For 4k page size, 28(7*4) slabs need to be created, and 64k page size > > needs to create 127*4 slabs. > > > > Where does the '*4' multiplier come from? The buffer needs to be device block size aligned for dio, and now the block size can be 512, 1024, 2048 and 4096. Thanks, Ming Lei
On Tue, Feb 26, 2019 at 07:12:49PM +0800, Ming Lei wrote: > On Tue, Feb 26, 2019 at 6:07 PM Vlastimil Babka <vbabka@suse.cz> wrote: > > On 2/26/19 10:33 AM, Ming Lei wrote: > > > On Tue, Feb 26, 2019 at 03:58:26PM +1100, Dave Chinner wrote: > > >> On Mon, Feb 25, 2019 at 07:27:37PM -0800, Matthew Wilcox wrote: > > >>> On Tue, Feb 26, 2019 at 02:02:14PM +1100, Dave Chinner wrote: > > >>>>> Or what is the exact size of sub-page IO in xfs most of time? For > > >>>> > > >>>> Determined by mkfs parameters. Any power of 2 between 512 bytes and > > >>>> 64kB needs to be supported. e.g: > > >>>> > > >>>> # mkfs.xfs -s size=512 -b size=1k -i size=2k -n size=8k .... > > >>>> > > >>>> will have metadata that is sector sized (512 bytes), filesystem > > >>>> block sized (1k), directory block sized (8k) and inode cluster sized > > >>>> (32k), and will use all of them in large quantities. > > >>> > > >>> If XFS is going to use each of these in large quantities, then it doesn't > > >>> seem unreasonable for XFS to create a slab for each type of metadata? > > >> > > >> > > >> Well, that is the question, isn't it? How many other filesystems > > >> will want to make similar "don't use entire pages just for 4k of > > >> metadata" optimisations as 64k page size machines become more > > >> common? There are others that have the same "use slab for sector > > >> aligned IO" which will fall foul of the same problem that has been > > >> reported for XFS.... > > >> > > >> If nobody else cares/wants it, then it can be XFS only. But it's > > >> only fair we address the "will it be useful to others" question > > >> first..... > > > > > > This kind of slab cache should have been global, just like interface of > > > kmalloc(size). > > > > > > However, the alignment requirement depends on block device's block size, > > > then it becomes hard to implement as genera interface, for example: > > > > > > block size: 512, 1024, 2048, 4096 > > > slab size: 512*N, 0 < N < PAGE_SIZE/512 > > > > > > For 4k page size, 28(7*4) slabs need to be created, and 64k page size > > > needs to create 127*4 slabs. > > > > > > > Where does the '*4' multiplier come from? > > The buffer needs to be device block size aligned for dio, and now the block > size can be 512, 1024, 2048 and 4096. Why does the block size make a difference? This requirement is due to some storage devices having shoddy DMA controllers. Are you saying there are devices which can't even do 512-byte aligned I/O?
On Tue, Feb 26, 2019 at 04:12:09AM -0800, Matthew Wilcox wrote: > On Tue, Feb 26, 2019 at 07:12:49PM +0800, Ming Lei wrote: > > On Tue, Feb 26, 2019 at 6:07 PM Vlastimil Babka <vbabka@suse.cz> wrote: > > > On 2/26/19 10:33 AM, Ming Lei wrote: > > > > On Tue, Feb 26, 2019 at 03:58:26PM +1100, Dave Chinner wrote: > > > >> On Mon, Feb 25, 2019 at 07:27:37PM -0800, Matthew Wilcox wrote: > > > >>> On Tue, Feb 26, 2019 at 02:02:14PM +1100, Dave Chinner wrote: > > > >>>>> Or what is the exact size of sub-page IO in xfs most of time? For > > > >>>> > > > >>>> Determined by mkfs parameters. Any power of 2 between 512 bytes and > > > >>>> 64kB needs to be supported. e.g: > > > >>>> > > > >>>> # mkfs.xfs -s size=512 -b size=1k -i size=2k -n size=8k .... > > > >>>> > > > >>>> will have metadata that is sector sized (512 bytes), filesystem > > > >>>> block sized (1k), directory block sized (8k) and inode cluster sized > > > >>>> (32k), and will use all of them in large quantities. > > > >>> > > > >>> If XFS is going to use each of these in large quantities, then it doesn't > > > >>> seem unreasonable for XFS to create a slab for each type of metadata? > > > >> > > > >> > > > >> Well, that is the question, isn't it? How many other filesystems > > > >> will want to make similar "don't use entire pages just for 4k of > > > >> metadata" optimisations as 64k page size machines become more > > > >> common? There are others that have the same "use slab for sector > > > >> aligned IO" which will fall foul of the same problem that has been > > > >> reported for XFS.... > > > >> > > > >> If nobody else cares/wants it, then it can be XFS only. But it's > > > >> only fair we address the "will it be useful to others" question > > > >> first..... > > > > > > > > This kind of slab cache should have been global, just like interface of > > > > kmalloc(size). > > > > > > > > However, the alignment requirement depends on block device's block size, > > > > then it becomes hard to implement as genera interface, for example: > > > > > > > > block size: 512, 1024, 2048, 4096 > > > > slab size: 512*N, 0 < N < PAGE_SIZE/512 > > > > > > > > For 4k page size, 28(7*4) slabs need to be created, and 64k page size > > > > needs to create 127*4 slabs. > > > > > > > > > > Where does the '*4' multiplier come from? > > > > The buffer needs to be device block size aligned for dio, and now the block > > size can be 512, 1024, 2048 and 4096. > > Why does the block size make a difference? This requirement is due to > some storage devices having shoddy DMA controllers. Are you saying there > are devices which can't even do 512-byte aligned I/O? Direct IO requires that, see do_blockdev_direct_IO(). This issue can be triggered when running xfs over loop/dio. We could fallback to buffered IO under this situation, but not sure it is the only case. Thanks, Ming
On Tue, Feb 26, 2019 at 08:35:46PM +0800, Ming Lei wrote: > On Tue, Feb 26, 2019 at 04:12:09AM -0800, Matthew Wilcox wrote: > > On Tue, Feb 26, 2019 at 07:12:49PM +0800, Ming Lei wrote: > > > The buffer needs to be device block size aligned for dio, and now the block > > > size can be 512, 1024, 2048 and 4096. > > > > Why does the block size make a difference? This requirement is due to > > some storage devices having shoddy DMA controllers. Are you saying there > > are devices which can't even do 512-byte aligned I/O? > > Direct IO requires that, see do_blockdev_direct_IO(). > > This issue can be triggered when running xfs over loop/dio. We could > fallback to buffered IO under this situation, but not sure it is the > only case. Wait, we're imposing a ridiculous amount of complexity on XFS for no reason at all? We should just change this to 512-byte alignment. Tying it to the blocksize of the device never made any sense.
On Tue, Feb 26, 2019 at 05:02:30AM -0800, Matthew Wilcox wrote: > On Tue, Feb 26, 2019 at 08:35:46PM +0800, Ming Lei wrote: > > On Tue, Feb 26, 2019 at 04:12:09AM -0800, Matthew Wilcox wrote: > > > On Tue, Feb 26, 2019 at 07:12:49PM +0800, Ming Lei wrote: > > > > The buffer needs to be device block size aligned for dio, and now the block > > > > size can be 512, 1024, 2048 and 4096. > > > > > > Why does the block size make a difference? This requirement is due to > > > some storage devices having shoddy DMA controllers. Are you saying there > > > are devices which can't even do 512-byte aligned I/O? > > > > Direct IO requires that, see do_blockdev_direct_IO(). > > > > This issue can be triggered when running xfs over loop/dio. We could > > fallback to buffered IO under this situation, but not sure it is the > > only case. > > Wait, we're imposing a ridiculous amount of complexity on XFS for no > reason at all? We should just change this to 512-byte alignment. Tying > it to the blocksize of the device never made any sense. OK, that is fine since we can fallback to buffered IO for loop in case of unaligned dio. Then something like the following patch should work for all fs, could anyone comment on this approach? -- diff --git a/block/blk-lib.c b/block/blk-lib.c index 5f2c429d4378..76f09f23a410 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -405,3 +405,44 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, return ret; } EXPORT_SYMBOL(blkdev_issue_zeroout); + +static struct kmem_cache *sector_buf_slabs[(PAGE_SIZE >> 9) - 1]; + +void *blk_alloc_sec_buf(unsigned size, gfp_t flags) +{ + int idx; + + size = round_up(size, 512); + if (size >= PAGE_SIZE) + return NULL; + + idx = (size >> 9) - 1; + if (!sector_buf_slabs[idx]) + return NULL; + return kmem_cache_alloc(sector_buf_slabs[idx], flags); +} +EXPORT_SYMBOL_GPL(blk_alloc_sec_buf); + +void blk_free_sec_buf(void *buf, int size) +{ + size = round_up(size, 512); + if (size >= PAGE_SIZE) + return; + + return kmem_cache_free(sector_buf_slabs[(size >> 9) - 1], buf); +} +EXPORT_SYMBOL_GPL(blk_free_sec_buf); + +void __init blk_sector_buf_init(void) +{ + unsigned size; + + for (size = 512; size < PAGE_SIZE; size += 512) { + char name[16]; + int idx = (size >> 9) - 1; + + snprintf(name, 16, "blk_sec_buf-%u", size); + sector_buf_slabs[idx] = kmem_cache_create(name, size, 512, + SLAB_PANIC, NULL); + } +} diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index faed9d9eb84c..a4117e526715 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1657,6 +1657,9 @@ extern int bdev_read_page(struct block_device *, sector_t, struct page *); extern int bdev_write_page(struct block_device *, sector_t, struct page *, struct writeback_control *); +extern void *blk_alloc_sec_buf(unsigned size, gfp_t flags); +extern void blk_free_sec_buf(void *buf, int size); + #ifdef CONFIG_BLK_DEV_ZONED bool blk_req_needs_zone_write_lock(struct request *rq); void __blk_req_zone_write_lock(struct request *rq); @@ -1755,6 +1758,15 @@ static inline int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask, return 0; } +static inline void *blk_alloc_sec_buf(unsigned size, gfp_t flags) +{ + return NULL; +} + +static inline void blk_free_sec_buf(void *buf, int size) +{ +} + #endif /* CONFIG_BLOCK */ static inline void blk_wake_io_task(struct task_struct *waiter) Thanks, Ming
On Tue, Feb 26, 2019 at 09:42:48PM +0800, Ming Lei wrote: > On Tue, Feb 26, 2019 at 05:02:30AM -0800, Matthew Wilcox wrote: > > Wait, we're imposing a ridiculous amount of complexity on XFS for no > > reason at all? We should just change this to 512-byte alignment. Tying > > it to the blocksize of the device never made any sense. > > OK, that is fine since we can fallback to buffered IO for loop in case of > unaligned dio. > > Then something like the following patch should work for all fs, could > anyone comment on this approach? That's not even close to what I meant. diff --git a/fs/direct-io.c b/fs/direct-io.c index ec2fb6fe6d37..dee1fc47a7fc 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -1185,18 +1185,20 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode, struct dio_submit sdio = { 0, }; struct buffer_head map_bh = { 0, }; struct blk_plug plug; - unsigned long align = offset | iov_iter_alignment(iter); /* * Avoid references to bdev if not absolutely needed to give * the early prefetch in the caller enough time. */ - if (align & blocksize_mask) { + if (iov_iter_alignment(iter) & 511) + goto out; + + if (offset & blocksize_mask) { if (bdev) blkbits = blksize_bits(bdev_logical_block_size(bdev)); blocksize_mask = (1 << blkbits) - 1; - if (align & blocksize_mask) + if (offset & blocksize_mask) goto out; }
On Mon, 25 Feb 2019, Vlastimil Babka wrote: > What about kmem_cache_create() with align parameter? That *should* be > guaranteed regardless of whatever debugging is enabled - if not, I would > consider it a bug. It definitely guarantees that. What would be the point of the alignment parameter otherwise?
On Tue, 26 Feb 2019, Ming Lei wrote: > Then something like the following patch should work for all fs, could > anyone comment on this approach? Note that various subsystems have similar implementations. Have a look at drivers/dma/dmaengine.c struct dmaengine_unmap_pool { struct kmem_cache *cache; const char *name; mempool_t *pool; size_t size; }; #define __UNMAP_POOL(x) { .size = x, .name = "dmaengine-unmap-" __stringify(x) } static struct dmaengine_unmap_pool unmap_pool[] = { __UNMAP_POOL(2), #if IS_ENABLED(CONFIG_DMA_ENGINE_RAID) __UNMAP_POOL(16), __UNMAP_POOL(128), __UNMAP_POOL(256), #endif }; Or drivers/md/dm-bufio.c: struct dm_bufio_client { struct mutex lock; struct list_head lru[LIST_SIZE]; unsigned long n_buffers[LIST_SIZE]; struct block_device *bdev; unsigned block_size; s8 sectors_per_block_bits; void (*alloc_callback)(struct dm_buffer *); void (*write_callback)(struct dm_buffer *); struct kmem_cache *slab_buffer; struct kmem_cache *slab_cache; struct dm_io_client *dm_io; struct list_head reserved_buffers; unsigned need_reserved_buffers; unsigned minimum_buffers; struct rb_root buffer_tree; wait_queue_head_t free_buffer_wait; sector_t start; int async_write_error; struct list_head client_list; struct shrinker shrinker; };
On Tue, Feb 26, 2019 at 06:04:40AM -0800, Matthew Wilcox wrote: > On Tue, Feb 26, 2019 at 09:42:48PM +0800, Ming Lei wrote: > > On Tue, Feb 26, 2019 at 05:02:30AM -0800, Matthew Wilcox wrote: > > > Wait, we're imposing a ridiculous amount of complexity on XFS for no > > > reason at all? We should just change this to 512-byte alignment. Tying > > > it to the blocksize of the device never made any sense. > > > > OK, that is fine since we can fallback to buffered IO for loop in case of > > unaligned dio. > > > > Then something like the following patch should work for all fs, could > > anyone comment on this approach? > > That's not even close to what I meant. > > diff --git a/fs/direct-io.c b/fs/direct-io.c > index ec2fb6fe6d37..dee1fc47a7fc 100644 > --- a/fs/direct-io.c > +++ b/fs/direct-io.c > @@ -1185,18 +1185,20 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode, Wait a minute, are you all saying that /directio/ is broken on XFS too?? XFS doesn't use blockdev_direct_IO anymore. I thought we were talking about alignment of XFS metadata buffers (xfs_buf.c), which is a very different topic. As I understand the problem, in non-debug mode the slab caches give xfs_buf chunks of memory that are aligned well enough to work, but in debug mode the slabs allocate slightly more bytes to carry debug information which pushes the returned address up slightly, thus breaking the alignment requirements. So why can't we just move the debug info to the end of the object? If our 512 byte allocation turns into a (512 + a few more) bytes we'll end up using 1024 bytes on the allocation regardless, so it shouldn't matter to put the debug info at offset 512. If the reason is fear that kernel code will scribble off the end of the object, then return (*obj + 512). Maybe you all have already covered this, though? --D > struct dio_submit sdio = { 0, }; > struct buffer_head map_bh = { 0, }; > struct blk_plug plug; > - unsigned long align = offset | iov_iter_alignment(iter); > > /* > * Avoid references to bdev if not absolutely needed to give > * the early prefetch in the caller enough time. > */ > > - if (align & blocksize_mask) { > + if (iov_iter_alignment(iter) & 511) > + goto out; > + > + if (offset & blocksize_mask) { > if (bdev) > blkbits = blksize_bits(bdev_logical_block_size(bdev)); > blocksize_mask = (1 << blkbits) - 1; > - if (align & blocksize_mask) > + if (offset & blocksize_mask) > goto out; > } >
On Tue, Feb 26, 2019 at 08:14:33AM -0800, Darrick J. Wong wrote: > On Tue, Feb 26, 2019 at 06:04:40AM -0800, Matthew Wilcox wrote: > > On Tue, Feb 26, 2019 at 09:42:48PM +0800, Ming Lei wrote: > > > On Tue, Feb 26, 2019 at 05:02:30AM -0800, Matthew Wilcox wrote: > > > > Wait, we're imposing a ridiculous amount of complexity on XFS for no > > > > reason at all? We should just change this to 512-byte alignment. Tying > > > > it to the blocksize of the device never made any sense. > > > > > > OK, that is fine since we can fallback to buffered IO for loop in case of > > > unaligned dio. > > > > > > Then something like the following patch should work for all fs, could > > > anyone comment on this approach? > > > > That's not even close to what I meant. > > > > diff --git a/fs/direct-io.c b/fs/direct-io.c > > index ec2fb6fe6d37..dee1fc47a7fc 100644 > > --- a/fs/direct-io.c > > +++ b/fs/direct-io.c > > @@ -1185,18 +1185,20 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode, > > Wait a minute, are you all saying that /directio/ is broken on XFS too?? > XFS doesn't use blockdev_direct_IO anymore. > > I thought we were talking about alignment of XFS metadata buffers > (xfs_buf.c), which is a very different topic. > > As I understand the problem, in non-debug mode the slab caches give > xfs_buf chunks of memory that are aligned well enough to work, but in > debug mode the slabs allocate slightly more bytes to carry debug > information which pushes the returned address up slightly, thus breaking > the alignment requirements. > > So why can't we just move the debug info to the end of the object? If > our 512 byte allocation turns into a (512 + a few more) bytes we'll end > up using 1024 bytes on the allocation regardless, so it shouldn't matter > to put the debug info at offset 512. If the reason is fear that kernel > code will scribble off the end of the object, then return (*obj + 512). > Maybe you all have already covered this, though? I don't know _what_ Ming Lei is saying. I thought the problem was with slab redzones, which need to be before and after each object, but apparently the problem is with KASAN as well.
On Tue, Feb 26, 2019 at 05:33:04PM +0800, Ming Lei wrote: > On Tue, Feb 26, 2019 at 03:58:26PM +1100, Dave Chinner wrote: > > On Mon, Feb 25, 2019 at 07:27:37PM -0800, Matthew Wilcox wrote: > > > On Tue, Feb 26, 2019 at 02:02:14PM +1100, Dave Chinner wrote: > > > > > Or what is the exact size of sub-page IO in xfs most of time? For > > > > > > > > Determined by mkfs parameters. Any power of 2 between 512 bytes and > > > > 64kB needs to be supported. e.g: > > > > > > > > # mkfs.xfs -s size=512 -b size=1k -i size=2k -n size=8k .... > > > > > > > > will have metadata that is sector sized (512 bytes), filesystem > > > > block sized (1k), directory block sized (8k) and inode cluster sized > > > > (32k), and will use all of them in large quantities. > > > > > > If XFS is going to use each of these in large quantities, then it doesn't > > > seem unreasonable for XFS to create a slab for each type of metadata? > > > > > > Well, that is the question, isn't it? How many other filesystems > > will want to make similar "don't use entire pages just for 4k of > > metadata" optimisations as 64k page size machines become more > > common? There are others that have the same "use slab for sector > > aligned IO" which will fall foul of the same problem that has been > > reported for XFS.... > > > > If nobody else cares/wants it, then it can be XFS only. But it's > > only fair we address the "will it be useful to others" question > > first..... > > This kind of slab cache should have been global, just like interface of > kmalloc(size). > > However, the alignment requirement depends on block device's block size, > then it becomes hard to implement as genera interface, for example: > > block size: 512, 1024, 2048, 4096 > slab size: 512*N, 0 < N < PAGE_SIZE/512 > > For 4k page size, 28(7*4) slabs need to be created, and 64k page size > needs to create 127*4 slabs. IDGI. Where's the 7/127 come from? We only require sector alignment at most, so as long as each slab object is aligned to it's size, we only need one slab for each block size. Cheers, Dave.
On Tue, Feb 26, 2019 at 08:19:12AM -0800, Matthew Wilcox wrote: > On Tue, Feb 26, 2019 at 08:14:33AM -0800, Darrick J. Wong wrote: > > On Tue, Feb 26, 2019 at 06:04:40AM -0800, Matthew Wilcox wrote: > > > On Tue, Feb 26, 2019 at 09:42:48PM +0800, Ming Lei wrote: > > > > On Tue, Feb 26, 2019 at 05:02:30AM -0800, Matthew Wilcox wrote: > > > > > Wait, we're imposing a ridiculous amount of complexity on XFS for no > > > > > reason at all? We should just change this to 512-byte alignment. Tying > > > > > it to the blocksize of the device never made any sense. > > > > > > > > OK, that is fine since we can fallback to buffered IO for loop in case of > > > > unaligned dio. > > > > > > > > Then something like the following patch should work for all fs, could > > > > anyone comment on this approach? > > > > > > That's not even close to what I meant. > > > > > > diff --git a/fs/direct-io.c b/fs/direct-io.c > > > index ec2fb6fe6d37..dee1fc47a7fc 100644 > > > --- a/fs/direct-io.c > > > +++ b/fs/direct-io.c > > > @@ -1185,18 +1185,20 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode, > > > > Wait a minute, are you all saying that /directio/ is broken on XFS too?? > > XFS doesn't use blockdev_direct_IO anymore. > > > > I thought we were talking about alignment of XFS metadata buffers > > (xfs_buf.c), which is a very different topic. > > > > As I understand the problem, in non-debug mode the slab caches give > > xfs_buf chunks of memory that are aligned well enough to work, but in > > debug mode the slabs allocate slightly more bytes to carry debug > > information which pushes the returned address up slightly, thus breaking > > the alignment requirements. > > > > So why can't we just move the debug info to the end of the object? If > > our 512 byte allocation turns into a (512 + a few more) bytes we'll end > > up using 1024 bytes on the allocation regardless, so it shouldn't matter > > to put the debug info at offset 512. If the reason is fear that kernel > > code will scribble off the end of the object, then return (*obj + 512). > > Maybe you all have already covered this, though? > > I don't know _what_ Ming Lei is saying. I thought the problem was > with slab redzones, which need to be before and after each object, > but apparently the problem is with KASAN as well. I have mentioned several times that it is triggered on xfs over loop/dio, however it may be addressed by falling back to buffered IO in case unaligned buffer. Please see lo_rw_aio(). Thanks, Ming
On Wed, Feb 27, 2019 at 07:45:50AM +1100, Dave Chinner wrote: > On Tue, Feb 26, 2019 at 05:33:04PM +0800, Ming Lei wrote: > > On Tue, Feb 26, 2019 at 03:58:26PM +1100, Dave Chinner wrote: > > > On Mon, Feb 25, 2019 at 07:27:37PM -0800, Matthew Wilcox wrote: > > > > On Tue, Feb 26, 2019 at 02:02:14PM +1100, Dave Chinner wrote: > > > > > > Or what is the exact size of sub-page IO in xfs most of time? For > > > > > > > > > > Determined by mkfs parameters. Any power of 2 between 512 bytes and > > > > > 64kB needs to be supported. e.g: > > > > > > > > > > # mkfs.xfs -s size=512 -b size=1k -i size=2k -n size=8k .... > > > > > > > > > > will have metadata that is sector sized (512 bytes), filesystem > > > > > block sized (1k), directory block sized (8k) and inode cluster sized > > > > > (32k), and will use all of them in large quantities. > > > > > > > > If XFS is going to use each of these in large quantities, then it doesn't > > > > seem unreasonable for XFS to create a slab for each type of metadata? > > > > > > > > > Well, that is the question, isn't it? How many other filesystems > > > will want to make similar "don't use entire pages just for 4k of > > > metadata" optimisations as 64k page size machines become more > > > common? There are others that have the same "use slab for sector > > > aligned IO" which will fall foul of the same problem that has been > > > reported for XFS.... > > > > > > If nobody else cares/wants it, then it can be XFS only. But it's > > > only fair we address the "will it be useful to others" question > > > first..... > > > > This kind of slab cache should have been global, just like interface of > > kmalloc(size). > > > > However, the alignment requirement depends on block device's block size, > > then it becomes hard to implement as genera interface, for example: > > > > block size: 512, 1024, 2048, 4096 > > slab size: 512*N, 0 < N < PAGE_SIZE/512 > > > > For 4k page size, 28(7*4) slabs need to be created, and 64k page size > > needs to create 127*4 slabs. > > IDGI. Where's the 7/127 come from? > > We only require sector alignment at most, so as long as each slab > object is aligned to it's size, we only need one slab for each block > size. Each slab has fixed size, I remembered that you mentioned that the meta data size can be 512 * N (1 <= N <= PAGE_SIZE / 512). https://marc.info/?l=linux-fsdevel&m=155115014513355&w=2 Thanks, Ming
On Wed, Feb 27, 2019 at 09:50:55AM +0800, Ming Lei wrote: > On Wed, Feb 27, 2019 at 07:45:50AM +1100, Dave Chinner wrote: > > On Tue, Feb 26, 2019 at 05:33:04PM +0800, Ming Lei wrote: > > > On Tue, Feb 26, 2019 at 03:58:26PM +1100, Dave Chinner wrote: > > > > On Mon, Feb 25, 2019 at 07:27:37PM -0800, Matthew Wilcox wrote: > > > > > On Tue, Feb 26, 2019 at 02:02:14PM +1100, Dave Chinner wrote: > > > > > > > Or what is the exact size of sub-page IO in xfs most of time? For > > > > > > > > > > > > Determined by mkfs parameters. Any power of 2 between 512 bytes and > > > > > > 64kB needs to be supported. e.g: > > > > > > > > > > > > # mkfs.xfs -s size=512 -b size=1k -i size=2k -n size=8k .... > > > > > > > > > > > > will have metadata that is sector sized (512 bytes), filesystem > > > > > > block sized (1k), directory block sized (8k) and inode cluster sized > > > > > > (32k), and will use all of them in large quantities. > > > > > > > > > > If XFS is going to use each of these in large quantities, then it doesn't > > > > > seem unreasonable for XFS to create a slab for each type of metadata? > > > > > > > > > > > > Well, that is the question, isn't it? How many other filesystems > > > > will want to make similar "don't use entire pages just for 4k of > > > > metadata" optimisations as 64k page size machines become more > > > > common? There are others that have the same "use slab for sector > > > > aligned IO" which will fall foul of the same problem that has been > > > > reported for XFS.... > > > > > > > > If nobody else cares/wants it, then it can be XFS only. But it's > > > > only fair we address the "will it be useful to others" question > > > > first..... > > > > > > This kind of slab cache should have been global, just like interface of > > > kmalloc(size). > > > > > > However, the alignment requirement depends on block device's block size, > > > then it becomes hard to implement as genera interface, for example: > > > > > > block size: 512, 1024, 2048, 4096 > > > slab size: 512*N, 0 < N < PAGE_SIZE/512 > > > > > > For 4k page size, 28(7*4) slabs need to be created, and 64k page size > > > needs to create 127*4 slabs. > > > > IDGI. Where's the 7/127 come from? > > > > We only require sector alignment at most, so as long as each slab > > object is aligned to it's size, we only need one slab for each block > > size. > > Each slab has fixed size, I remembered that you mentioned that the meta > data size can be 512 * N (1 <= N <= PAGE_SIZE / 512). > > https://marc.info/?l=linux-fsdevel&m=155115014513355&w=2 nggggh. *That* *is* *not* *what* *I* *said*. That is *what you said* and I said that was wrong and the actual sizes needed were: dgc> It is not. On a 64k page size machine, we use sub page slabs for dgc> metadata blocks of 2^N bytes where 9 <= N <= 15.. Please do the maths. I shoul dnot have to do it for you. Also, please don't confusing sector-in-LBA alignment with /memory buffer/ alignment. i.e. you're assuming that these 2kB IOs at different sector offsets like the following: 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 +-------+-------+-------+-------+-------+-------+-------+-------+ +ooooooooooooooooooooooooooooooo+ +ooooooooooooooooooooooooooooooo+ +ooooooooooooooooooooooooooooooo+ +ooooooooooooooooooooooooooooooo+ +ooooooooooooooooooooooooooooooo+ require a memory buffer alignment that matches the sector-in-LBA alignment. This is wrong - the memory alignment constraint is usually a hardware DMA engine limitation and has nothing to do with the LBA the IO will place/retreive the data in/from. i.e. An aligned 2kB slab will only allocate 2kB slab objects like so: 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 +-------+-------+-------+-------+-------+-------+-------+-------+ +ooooooooooooooooooooooooooooooo+ +ooooooooooooooooooooooooooooooo+ and these memory buffers will always have 512 byte alignment. This meets the memory alignemnt requirements of all hardware (which is 512 byte alignment or smaller), and so we only need one slab per size. Cheers, Dave.
On 2/26/19 5:19 PM, Matthew Wilcox wrote: > On Tue, Feb 26, 2019 at 08:14:33AM -0800, Darrick J. Wong wrote: >> On Tue, Feb 26, 2019 at 06:04:40AM -0800, Matthew Wilcox wrote: >> Wait a minute, are you all saying that /directio/ is broken on XFS too?? >> XFS doesn't use blockdev_direct_IO anymore. >> >> I thought we were talking about alignment of XFS metadata buffers >> (xfs_buf.c), which is a very different topic. >> >> As I understand the problem, in non-debug mode the slab caches give >> xfs_buf chunks of memory that are aligned well enough to work, but in >> debug mode the slabs allocate slightly more bytes to carry debug >> information which pushes the returned address up slightly, thus breaking >> the alignment requirements. >> >> So why can't we just move the debug info to the end of the object? If >> our 512 byte allocation turns into a (512 + a few more) bytes we'll end >> up using 1024 bytes on the allocation regardless, so it shouldn't matter >> to put the debug info at offset 512. If the reason is fear that kernel >> code will scribble off the end of the object, then return (*obj + 512). >> Maybe you all have already covered this, though? > > I don't know _what_ Ming Lei is saying. I thought the problem was > with slab redzones, which need to be before and after each object, > but apparently the problem is with KASAN as well. That's what I thought as well. But if we can solve it for caches created by kmem_cache_create(..., align, ...) then IMHO we could guarantee natural alignment for power-of-two kmalloc caches as well.
On Tue, Feb 26, 2019 at 08:14:33AM -0800, Darrick J. Wong wrote: > On Tue, Feb 26, 2019 at 06:04:40AM -0800, Matthew Wilcox wrote: > > On Tue, Feb 26, 2019 at 09:42:48PM +0800, Ming Lei wrote: > > > On Tue, Feb 26, 2019 at 05:02:30AM -0800, Matthew Wilcox wrote: > > > > Wait, we're imposing a ridiculous amount of complexity on XFS for no > > > > reason at all? We should just change this to 512-byte alignment. Tying > > > > it to the blocksize of the device never made any sense. > > > > > > OK, that is fine since we can fallback to buffered IO for loop in case of > > > unaligned dio. > > > > > > Then something like the following patch should work for all fs, could > > > anyone comment on this approach? > > > > That's not even close to what I meant. > > > > diff --git a/fs/direct-io.c b/fs/direct-io.c > > index ec2fb6fe6d37..dee1fc47a7fc 100644 > > --- a/fs/direct-io.c > > +++ b/fs/direct-io.c > > @@ -1185,18 +1185,20 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode, > > Wait a minute, are you all saying that /directio/ is broken on XFS too?? > XFS doesn't use blockdev_direct_IO anymore. No, loop devices w/ direct IO is a complete red herring. It's the same issue - the upper filesystem is sending down unaligned bios to the loop device, which is then just passing them on to the underlying filesystem via DIO, and the DIO code in the lower filesystem saying "that user memory buffer ain't aligned to my logical sector size" and rejecting it. Actually, in XFS's case, it doesn't care about memory buffer alignment - it's the iomap code that is rejecting it when mapping the memory buffer to a bio in iomap_dio_bio_actor(): unsigned int blkbits = blksize_bits(bdev_logical_block_size(iomap->bdev)); ..... unsigned int align = iov_iter_alignment(dio->submit.iter); ..... if ((pos | length | align) & ((1 << blkbits) - 1)) return -EINVAL; IOWs, if the memory buffer is not aligned to the logical block size of the underlying device (which defaults to 512 bytes) then it will be rejected by the lower filesystem... > I thought we were talking about alignment of XFS metadata buffers > (xfs_buf.c), which is a very different topic. Yup, it's the same problem, just a different symptom. Fix the unaligned buffer problem, we fix the loop dev w/ direct-io problem, too. FWIW, this "filesystem image sector size mismatch" problem has been around for many, many years - the same issue that occurs with loop devices when you try to mount a 512 byte sector image on a hard 4k sector host filesystem/storage device using direct IO in the loop device. This isn't a new thing at all - if you want to use direct IO to manipulate filesystem images, you actually need to know what you are doing.... Cheers, Dave.
On Wed, Feb 27, 2019 at 08:07:31AM +0100, Vlastimil Babka wrote: > > I don't know _what_ Ming Lei is saying. I thought the problem was > > with slab redzones, which need to be before and after each object, > > but apparently the problem is with KASAN as well. > > That's what I thought as well. But if we can solve it for caches created > by kmem_cache_create(..., align, ...) then IMHO we could guarantee > natural alignment for power-of-two kmalloc caches as well. Yes, having a version of kmalloc that guarantees power of two alignment would be extremely helpful and avoid a lot of pointless boilerplate code.
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index 4f5f2ff3f70f..92b8cdf5e51c 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -340,12 +340,27 @@ xfs_buf_free( __free_page(page); } } else if (bp->b_flags & _XBF_KMEM) - kmem_free(bp->b_addr); + page_frag_free(bp->b_addr); _xfs_buf_free_pages(bp); xfs_buf_free_maps(bp); kmem_zone_free(xfs_buf_zone, bp); } +static DEFINE_PER_CPU(struct page_frag_cache, xfs_frag_cache); + +static void *xfs_alloc_frag(int size) +{ + struct page_frag_cache *nc; + void *data; + + preempt_disable(); + nc = this_cpu_ptr(&xfs_frag_cache); + data = page_frag_alloc(nc, size, GFP_ATOMIC); + preempt_enable(); + + return data; +} + /* * Allocates all the pages for buffer in question and builds it's page list. */ @@ -368,7 +383,7 @@ xfs_buf_allocate_memory( */ size = BBTOB(bp->b_length); if (size < PAGE_SIZE) { - bp->b_addr = kmem_alloc(size, KM_NOFS); + bp->b_addr = xfs_alloc_frag(size); if (!bp->b_addr) { /* low memory - use alloc_page loop instead */ goto use_alloc_page; @@ -377,7 +392,7 @@ xfs_buf_allocate_memory( if (((unsigned long)(bp->b_addr + size - 1) & PAGE_MASK) != ((unsigned long)bp->b_addr & PAGE_MASK)) { /* b_addr spans two pages - use alloc_page instead */ - kmem_free(bp->b_addr); + page_frag_free(bp->b_addr); bp->b_addr = NULL; goto use_alloc_page; }
XFS uses kmalloc() to allocate sector sized IO buffer. Turns out buffer allocated via kmalloc(sector sized) can't be guaranteed to be 512 byte aligned, and actually slab only provides ARCH_KMALLOC_MINALIGN alignment, even though it is observed that the sector size allocation is often 512 byte aligned. When KASAN or other memory debug options are enabled, the allocated buffer becomes not aliged with 512 byte any more. This unalgined IO buffer causes at least two issues: 1) some storage controller requires IO buffer to be 512 byte aligned, and data corruption is observed 2) loop/dio requires the IO buffer to be logical block size aligned, and loop's default logcial block size is 512 byte, then one xfs image can't be mounted via loop/dio any more. Use page_frag_alloc() to allocate the sector sized buffer, then the above issue can be fixed because offset_in_page of allocated buffer is always sector aligned. Not see any regression with this patch on xfstests. Cc: Jens Axboe <axboe@kernel.dk> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Christopher Lameter <cl@linux.com> Cc: Linux FS Devel <linux-fsdevel@vger.kernel.org> Cc: linux-mm@kvack.org Cc: linux-block@vger.kernel.org Link: https://marc.info/?t=153734857500004&r=1&w=2 Signed-off-by: Ming Lei <ming.lei@redhat.com> --- fs/xfs/xfs_buf.c | 21 ++++++++++++++++++--- 1 file changed, 18 insertions(+), 3 deletions(-)