Message ID | 20230612203910.724378-7-willy@infradead.org (mailing list archive) |
---|---|
State | Under Review |
Headers | show |
Series | Create large folios in iomap buffered write path | expand |
On Mon, Jun 12, 2023 at 09:39:08PM +0100, Matthew Wilcox (Oracle) wrote: > Allow callers of __filemap_get_folio() to specify a preferred folio > order in the FGP flags. This is only honoured in the FGP_CREATE path; > if there is already a folio in the page cache that covers the index, > we will return it, no matter what its order is. No create-around is > attempted; we will only create folios which start at the specified index. > Unmodified callers will continue to allocate order 0 folios. > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> ..... > - /* Init accessed so avoid atomic mark_page_accessed later */ > - if (fgp_flags & FGP_ACCESSED) > - __folio_set_referenced(folio); > + if (!mapping_large_folio_support(mapping)) > + order = 0; > + if (order > MAX_PAGECACHE_ORDER) > + order = MAX_PAGECACHE_ORDER; > + /* If we're not aligned, allocate a smaller folio */ > + if (index & ((1UL << order) - 1)) > + order = __ffs(index); If I read this right, if we pass in an unaligned index, we won't get the size of the folio we ask for? e.g. if we want an order-4 folio (64kB) because we have a 64kB block size in the filesystem, then we have to pass in an index that order-4 aligned, yes? I ask this, because the later iomap code that asks for large folios only passes in "pos >> PAGE_SHIFT" so it looks to me like it won't allocate large folios for anything other than large folio aligned writes, even if we need them. What am I missing? -Dave.
On Tue, Jun 13, 2023 at 08:49:05AM +1000, Dave Chinner wrote: > On Mon, Jun 12, 2023 at 09:39:08PM +0100, Matthew Wilcox (Oracle) wrote: > > Allow callers of __filemap_get_folio() to specify a preferred folio > > order in the FGP flags. This is only honoured in the FGP_CREATE path; > > if there is already a folio in the page cache that covers the index, > > we will return it, no matter what its order is. No create-around is > > attempted; we will only create folios which start at the specified index. > > Unmodified callers will continue to allocate order 0 folios. > ..... > > - /* Init accessed so avoid atomic mark_page_accessed later */ > > - if (fgp_flags & FGP_ACCESSED) > > - __folio_set_referenced(folio); > > + if (!mapping_large_folio_support(mapping)) > > + order = 0; > > + if (order > MAX_PAGECACHE_ORDER) > > + order = MAX_PAGECACHE_ORDER; > > + /* If we're not aligned, allocate a smaller folio */ > > + if (index & ((1UL << order) - 1)) > > + order = __ffs(index); > > If I read this right, if we pass in an unaligned index, we won't get > the size of the folio we ask for? Right. That's implied by (but perhaps not obvious from) the changelog. Folios are always naturally aligned in the file, so an order-4 folio has to start at a multiple of 16. If the index you pass in is not a multiple of 16, we can't create an order-4 folio without starting at an earlier index. For a 4kB block size filesystem, that's what we want. Applications _generally_ don't write backwards, so creating an order-4 folio is just wasting memory. > e.g. if we want an order-4 folio (64kB) because we have a 64kB block > size in the filesystem, then we have to pass in an index that > order-4 aligned, yes? > > I ask this, because the later iomap code that asks for large folios > only passes in "pos >> PAGE_SHIFT" so it looks to me like it won't > allocate large folios for anything other than large folio aligned > writes, even if we need them. > > What am I missing? Perhaps what you're missing is that this isn't trying to solve the problem of supporting a bs > ps filesystem? That's also a worthwhile project, but it's not this project. In fact, I'd say that project is almost orthogonal to this one; for this usage we can always fall back to smaller folios on memory pressure or misalignment. For a bs > ps block device, we have to allocate folios at least as large as the blocksize and cannot fall back to smaller folios. For a bs > ps filesystem on a bdev with bs == ps, we can fall back (as your prototype showed).
On Tue, Jun 13, 2023 at 01:42:51AM +0100, Matthew Wilcox wrote: > On Tue, Jun 13, 2023 at 08:49:05AM +1000, Dave Chinner wrote: > > On Mon, Jun 12, 2023 at 09:39:08PM +0100, Matthew Wilcox (Oracle) wrote: > > > Allow callers of __filemap_get_folio() to specify a preferred folio > > > order in the FGP flags. This is only honoured in the FGP_CREATE path; > > > if there is already a folio in the page cache that covers the index, > > > we will return it, no matter what its order is. No create-around is > > > attempted; we will only create folios which start at the specified index. > > > Unmodified callers will continue to allocate order 0 folios. > > ..... > > > - /* Init accessed so avoid atomic mark_page_accessed later */ > > > - if (fgp_flags & FGP_ACCESSED) > > > - __folio_set_referenced(folio); > > > + if (!mapping_large_folio_support(mapping)) > > > + order = 0; > > > + if (order > MAX_PAGECACHE_ORDER) > > > + order = MAX_PAGECACHE_ORDER; > > > + /* If we're not aligned, allocate a smaller folio */ > > > + if (index & ((1UL << order) - 1)) > > > + order = __ffs(index); > > > > If I read this right, if we pass in an unaligned index, we won't get > > the size of the folio we ask for? > > Right. That's implied by (but perhaps not obvious from) the changelog. > Folios are always naturally aligned in the file, so an order-4 folio > has to start at a multiple of 16. If the index you pass in is not > a multiple of 16, we can't create an order-4 folio without starting > at an earlier index. > > For a 4kB block size filesystem, that's what we want. Applications > _generally_ don't write backwards, so creating an order-4 folio is just > wasting memory. > > > e.g. if we want an order-4 folio (64kB) because we have a 64kB block > > size in the filesystem, then we have to pass in an index that > > order-4 aligned, yes? > > > > I ask this, because the later iomap code that asks for large folios > > only passes in "pos >> PAGE_SHIFT" so it looks to me like it won't > > allocate large folios for anything other than large folio aligned > > writes, even if we need them. > > > > What am I missing? > > Perhaps what you're missing is that this isn't trying to solve the > problem of supporting a bs > ps filesystem? No, that's not what I'm asking about. I know there's other changes needed to enforce minimum folio size/alignment for bs > ps. What I'm asking about is when someone does a 16kB write at offset 12kB, they won't get a large folio allocated at all, right? Even though the write is large enough to enable it? Indeed, if we do a 1MB write at offset 4KB, we'll get 4kB at 4KB, 8KB and 12kB (because we can't do order-1 folios), then order-2 at 16KB, order-3 at 32kB, and so on until we hit offset 1MB where we will do an order-0 folio allocation again (because the remaining length is 4KB). The next 1MB write will then follow the same pattern, right? I think this ends up being sub-optimal and fairly non-obvious non-obvious behaviour from the iomap side of the fence which is clearly asking for high-order folios to be allocated. i.e. a small amount of allocate-around to naturally align large folios when the page cache is otherwise empty would make a big difference to the efficiency of non-large-folio-aligned sequential writes... Cheers, Dave.
On Tue, Jun 13, 2023 at 11:30:13AM +1000, Dave Chinner wrote: > On Tue, Jun 13, 2023 at 01:42:51AM +0100, Matthew Wilcox wrote: > > On Tue, Jun 13, 2023 at 08:49:05AM +1000, Dave Chinner wrote: > > > On Mon, Jun 12, 2023 at 09:39:08PM +0100, Matthew Wilcox (Oracle) wrote: > > > > Allow callers of __filemap_get_folio() to specify a preferred folio > > > > order in the FGP flags. This is only honoured in the FGP_CREATE path; > > > > if there is already a folio in the page cache that covers the index, > > > > we will return it, no matter what its order is. No create-around is > > > > attempted; we will only create folios which start at the specified index. > > > > Unmodified callers will continue to allocate order 0 folios. > > > ..... > > > > - /* Init accessed so avoid atomic mark_page_accessed later */ > > > > - if (fgp_flags & FGP_ACCESSED) > > > > - __folio_set_referenced(folio); > > > > + if (!mapping_large_folio_support(mapping)) > > > > + order = 0; > > > > + if (order > MAX_PAGECACHE_ORDER) > > > > + order = MAX_PAGECACHE_ORDER; > > > > + /* If we're not aligned, allocate a smaller folio */ > > > > + if (index & ((1UL << order) - 1)) > > > > + order = __ffs(index); > > > > > > If I read this right, if we pass in an unaligned index, we won't get > > > the size of the folio we ask for? > > > > Right. That's implied by (but perhaps not obvious from) the changelog. > > Folios are always naturally aligned in the file, so an order-4 folio > > has to start at a multiple of 16. If the index you pass in is not > > a multiple of 16, we can't create an order-4 folio without starting > > at an earlier index. > > > > For a 4kB block size filesystem, that's what we want. Applications > > _generally_ don't write backwards, so creating an order-4 folio is just > > wasting memory. > > > > > e.g. if we want an order-4 folio (64kB) because we have a 64kB block > > > size in the filesystem, then we have to pass in an index that > > > order-4 aligned, yes? > > > > > > I ask this, because the later iomap code that asks for large folios > > > only passes in "pos >> PAGE_SHIFT" so it looks to me like it won't > > > allocate large folios for anything other than large folio aligned > > > writes, even if we need them. > > > > > > What am I missing? > > > > Perhaps what you're missing is that this isn't trying to solve the > > problem of supporting a bs > ps filesystem? > > No, that's not what I'm asking about. I know there's other changes > needed to enforce minimum folio size/alignment for bs > ps. OK. Bringing up the 64kB block size filesystem confused me. > What I'm asking about is when someone does a 16kB write at offset > 12kB, they won't get a large folio allocated at all, right? Even > though the write is large enough to enable it? Right. > Indeed, if we do a 1MB write at offset 4KB, we'll get 4kB at 4KB, 8KB > and 12kB (because we can't do order-1 folios), then order-2 at 16KB, > order-3 at 32kB, and so on until we hit offset 1MB where we will do > an order-0 folio allocation again (because the remaining length is > 4KB). The next 1MB write will then follow the same pattern, right? Yes. Assuming we get another write ... > I think this ends up being sub-optimal and fairly non-obvious > non-obvious behaviour from the iomap side of the fence which is > clearly asking for high-order folios to be allocated. i.e. a small > amount of allocate-around to naturally align large folios when the > page cache is otherwise empty would make a big difference to the > efficiency of non-large-folio-aligned sequential writes... At this point we're arguing about what I/O pattern to optimise for. I'm going for a "do no harm" approach where we only allocate exactly as much memory as we did before. You're advocating for a higher-risk/higher-reward approach. I'd prefer the low-risk approach for now; we can change it later! I'd like to see some amount of per-fd write history (as we have per-fd readahead history) to decide whether to allocate large folios ahead of the current write position. As with readahead, I'd like to see that even doing single-byte writes can result in the allocation of large folios, as long as the app has done enough of them.
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
(and I agree with keeping it simple and the folios aligned in the
page and not too optimistically allocated for now)
On Tue, Jun 13, 2023 at 03:00:14AM +0100, Matthew Wilcox wrote: > On Tue, Jun 13, 2023 at 11:30:13AM +1000, Dave Chinner wrote: > > Indeed, if we do a 1MB write at offset 4KB, we'll get 4kB at 4KB, 8KB > > and 12kB (because we can't do order-1 folios), then order-2 at 16KB, > > order-3 at 32kB, and so on until we hit offset 1MB where we will do > > an order-0 folio allocation again (because the remaining length is > > 4KB). The next 1MB write will then follow the same pattern, right? > > Yes. Assuming we get another write ... > > > I think this ends up being sub-optimal and fairly non-obvious > > non-obvious behaviour from the iomap side of the fence which is > > clearly asking for high-order folios to be allocated. i.e. a small > > amount of allocate-around to naturally align large folios when the > > page cache is otherwise empty would make a big difference to the > > efficiency of non-large-folio-aligned sequential writes... > > At this point we're arguing about what I/O pattern to optimise for. > I'm going for a "do no harm" approach where we only allocate exactly as > much memory as we did before. You're advocating for a > higher-risk/higher-reward approach. Not really - I'm just trying to understand the behaviour the change will result in, compared to what would be considered optimal as it's not clearly spelled out in either the code or the commit messages. If I hadn't looked at the code closely and saw a trace with this sort of behaviour (i.e. I understood large folios were in use, but not exactly how they worked), I'd be very surprised to see a weird repeated pattern of varying folio sizes. I'd probably think it was a bug in the implementation.... > I'd prefer the low-risk approach for now; we can change it later! That's fine by me - just document the limitations and expected behaviour in the code rather than expect people to have to discover this behaviour for themselves. > I'd like to see some amount of per-fd write history (as we have per-fd > readahead history) to decide whether to allocate large folios ahead of > the current write position. As with readahead, I'd like to see that even > doing single-byte writes can result in the allocation of large folios, > as long as the app has done enough of them. *nod* We already have some hints in the iomaps that can tell you this sort of thing. e.g. if ->iomap_begin() returns a delalloc iomap that extends beyond the current write, we're performing a sequence of multiple sequential writes..... Cheers, Dave.
On Tue, Jun 13, 2023 at 05:54:35PM +1000, Dave Chinner wrote: > On Tue, Jun 13, 2023 at 03:00:14AM +0100, Matthew Wilcox wrote: > > On Tue, Jun 13, 2023 at 11:30:13AM +1000, Dave Chinner wrote: > > > I think this ends up being sub-optimal and fairly non-obvious > > > non-obvious behaviour from the iomap side of the fence which is > > > clearly asking for high-order folios to be allocated. i.e. a small > > > amount of allocate-around to naturally align large folios when the > > > page cache is otherwise empty would make a big difference to the > > > efficiency of non-large-folio-aligned sequential writes... > > > > At this point we're arguing about what I/O pattern to optimise for. > > I'm going for a "do no harm" approach where we only allocate exactly as > > much memory as we did before. You're advocating for a > > higher-risk/higher-reward approach. > > Not really - I'm just trying to understand the behaviour the change > will result in, compared to what would be considered optimal as it's > not clearly spelled out in either the code or the commit messages. I suppose it depends what you think we're optimising for. If it's minimum memory consumption, then the current patchset is optimal ;-) If it's minimum number of folios allocated for a particular set of writes, then your proposal makes sense. I do think we should end up doing something along the lines of your sketch; it just doesn't need to be now. > > I'd like to see some amount of per-fd write history (as we have per-fd > > readahead history) to decide whether to allocate large folios ahead of > > the current write position. As with readahead, I'd like to see that even > > doing single-byte writes can result in the allocation of large folios, > > as long as the app has done enough of them. > > *nod* > > We already have some hints in the iomaps that can tell you this sort > of thing. e.g. if ->iomap_begin() returns a delalloc iomap that > extends beyond the current write, we're performing a sequence of > multiple sequential writes..... Well, if this is something the FS is already tracking, then we can either try to lift that functionality into the page cache, or just take advantage of the FS knowledge. In iomap_write_begin(), we have access to the srcmap and the iomap, and we can pass in something other than the length of the write as the hint to __iomap_get_folio() for the size of the folio we would like. I should probably clean up the kernel-doc for iomap_get_folio() ... - * @len: length of write + * @len: Suggested size of folio to create.
On Tue, Jun 13, 2023 at 05:54:35PM +1000, Dave Chinner wrote: > If I hadn't looked at the code closely and saw a trace with this > sort of behaviour (i.e. I understood large folios were in use, > but not exactly how they worked), I'd be very surprised to see a > weird repeated pattern of varying folio sizes. I'd probably think > it was a bug in the implementation.... > > > I'd prefer the low-risk approach for now; we can change it later! > > That's fine by me - just document the limitations and expected > behaviour in the code rather than expect people to have to discover > this behaviour for themselves. How about this? +++ b/include/linux/pagemap.h @@ -548,6 +548,17 @@ typedef unsigned int __bitwise fgf_t; #define FGP_WRITEBEGIN (FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE) +/** + * fgf_set_order - Encode a length in the fgf_t flags. + * @size: The suggested size of the folio to create. + * + * The caller of __filemap_get_folio() can use this to suggest a preferred + * size for the folio that is created. If there is already a folio at + * the index, it will be returned, no matter what its size. If a folio + * is freshly created, it may be of a different size than requested + * due to alignment constraints, memory pressure, or the presence of + * other folios at nearby indices. + */ static inline fgf_t fgf_set_order(size_t size) { unsigned int shift = ilog2(size);
On Fri, Jun 16, 2023 at 06:45:43PM +0100, Matthew Wilcox wrote: > On Tue, Jun 13, 2023 at 05:54:35PM +1000, Dave Chinner wrote: > > If I hadn't looked at the code closely and saw a trace with this > > sort of behaviour (i.e. I understood large folios were in use, > > but not exactly how they worked), I'd be very surprised to see a > > weird repeated pattern of varying folio sizes. I'd probably think > > it was a bug in the implementation.... > > > > > I'd prefer the low-risk approach for now; we can change it later! > > > > That's fine by me - just document the limitations and expected > > behaviour in the code rather than expect people to have to discover > > this behaviour for themselves. > > How about this? > > +++ b/include/linux/pagemap.h > @@ -548,6 +548,17 @@ typedef unsigned int __bitwise fgf_t; > > #define FGP_WRITEBEGIN (FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE) > > +/** > + * fgf_set_order - Encode a length in the fgf_t flags. > + * @size: The suggested size of the folio to create. > + * > + * The caller of __filemap_get_folio() can use this to suggest a preferred > + * size for the folio that is created. If there is already a folio at > + * the index, it will be returned, no matter what its size. If a folio > + * is freshly created, it may be of a different size than requested > + * due to alignment constraints, memory pressure, or the presence of > + * other folios at nearby indices. > + */ > static inline fgf_t fgf_set_order(size_t size) > { > unsigned int shift = ilog2(size); Yup, I'm happy with that - "preferred size" is a good way of describing the best effort attempt it makes.... Cheers, Dave.
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 993242f0c1e1..b2ed80f91e5b 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -466,6 +466,19 @@ static inline void *detach_page_private(struct page *page) return folio_detach_private(page_folio(page)); } +/* + * There are some parts of the kernel which assume that PMD entries + * are exactly HPAGE_PMD_ORDER. Those should be fixed, but until then, + * limit the maximum allocation order to PMD size. I'm not aware of any + * assumptions about maximum order if THP are disabled, but 8 seems like + * a good order (that's 1MB if you're using 4kB pages) + */ +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#define MAX_PAGECACHE_ORDER HPAGE_PMD_ORDER +#else +#define MAX_PAGECACHE_ORDER 8 +#endif + #ifdef CONFIG_NUMA struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order); #else @@ -531,9 +544,19 @@ typedef unsigned int __bitwise fgf_t; #define FGP_NOWAIT ((__force fgf_t)0x00000020) #define FGP_FOR_MMAP ((__force fgf_t)0x00000040) #define FGP_STABLE ((__force fgf_t)0x00000080) +#define FGF_GET_ORDER(fgf) (((__force unsigned)fgf) >> 26) /* top 6 bits */ #define FGP_WRITEBEGIN (FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE) +static inline fgf_t fgf_set_order(size_t size) +{ + unsigned int shift = ilog2(size); + + if (shift <= PAGE_SHIFT) + return 0; + return (__force fgf_t)((shift - PAGE_SHIFT) << 26); +} + void *filemap_get_entry(struct address_space *mapping, pgoff_t index); struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index, fgf_t fgp_flags, gfp_t gfp); diff --git a/mm/filemap.c b/mm/filemap.c index 42353b82ebf6..bd66398ae072 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1937,7 +1937,9 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index, folio_wait_stable(folio); no_page: if (!folio && (fgp_flags & FGP_CREAT)) { + unsigned order = FGF_GET_ORDER(fgp_flags); int err; + if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping)) gfp |= __GFP_WRITE; if (fgp_flags & FGP_NOFS) @@ -1946,26 +1948,40 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index, gfp &= ~GFP_KERNEL; gfp |= GFP_NOWAIT | __GFP_NOWARN; } - - folio = filemap_alloc_folio(gfp, 0); - if (!folio) - return ERR_PTR(-ENOMEM); - if (WARN_ON_ONCE(!(fgp_flags & (FGP_LOCK | FGP_FOR_MMAP)))) fgp_flags |= FGP_LOCK; - /* Init accessed so avoid atomic mark_page_accessed later */ - if (fgp_flags & FGP_ACCESSED) - __folio_set_referenced(folio); + if (!mapping_large_folio_support(mapping)) + order = 0; + if (order > MAX_PAGECACHE_ORDER) + order = MAX_PAGECACHE_ORDER; + /* If we're not aligned, allocate a smaller folio */ + if (index & ((1UL << order) - 1)) + order = __ffs(index); - err = filemap_add_folio(mapping, folio, index, gfp); - if (unlikely(err)) { + do { + err = -ENOMEM; + if (order == 1) + order = 0; + folio = filemap_alloc_folio(gfp, order); + if (!folio) + continue; + + /* Init accessed so avoid atomic mark_page_accessed later */ + if (fgp_flags & FGP_ACCESSED) + __folio_set_referenced(folio); + + err = filemap_add_folio(mapping, folio, index, gfp); + if (!err) + break; folio_put(folio); folio = NULL; - if (err == -EEXIST) - goto repeat; - } + } while (order-- > 0); + if (err == -EEXIST) + goto repeat; + if (err) + return ERR_PTR(err); /* * filemap_add_folio locks the page, and for mmap * we expect an unlocked page. diff --git a/mm/readahead.c b/mm/readahead.c index 47afbca1d122..59a071badb90 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -462,19 +462,6 @@ static int try_context_readahead(struct address_space *mapping, return 1; } -/* - * There are some parts of the kernel which assume that PMD entries - * are exactly HPAGE_PMD_ORDER. Those should be fixed, but until then, - * limit the maximum allocation order to PMD size. I'm not aware of any - * assumptions about maximum order if THP are disabled, but 8 seems like - * a good order (that's 1MB if you're using 4kB pages) - */ -#ifdef CONFIG_TRANSPARENT_HUGEPAGE -#define MAX_PAGECACHE_ORDER HPAGE_PMD_ORDER -#else -#define MAX_PAGECACHE_ORDER 8 -#endif - static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index, pgoff_t mark, unsigned int order, gfp_t gfp) {
Allow callers of __filemap_get_folio() to specify a preferred folio order in the FGP flags. This is only honoured in the FGP_CREATE path; if there is already a folio in the page cache that covers the index, we will return it, no matter what its order is. No create-around is attempted; we will only create folios which start at the specified index. Unmodified callers will continue to allocate order 0 folios. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> --- include/linux/pagemap.h | 23 ++++++++++++++++++++++ mm/filemap.c | 42 ++++++++++++++++++++++++++++------------- mm/readahead.c | 13 ------------- 3 files changed, 52 insertions(+), 26 deletions(-)