Message ID | 164447147257.23354.2801426518649016278.stgit@noble.brown (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Remove remaining parts of congestion tracking code. | expand |
Hi Neil! On Thu 10-02-22 16:37:52, NeilBrown wrote: > Add some "big-picture" documentation for read-ahead and polish the code > to make it fit this documentation. > > The meaning of ->async_size is clarified to match its name. > i.e. Any request to ->readahead() has a sync part and an async part. > The caller will wait for the sync pages to complete, but will not wait > for the async pages. The first async page is still marked PG_readahead So I don't think this is how the code was meant. My understanding of readahead comes from a comment: /* * On-demand readahead design. * .... in mm/readahead.c. The ra->size is how many pages should be read. ra->async_size is the "lookahead size" meaning that we should place a marker (PageReadahead) at "ra->size - ra->async_size" to trigger next readahead. > > - in try_context_readahead(), the async_sync is set correctly rather > than being set to 1. Prior to Commit 2cad40180197 ("readahead: make > context readahead more conservative") it was set to ra->size which > is not correct (that implies no sync component). As this was too > high and caused problems it was reduced to 1, again incorrect but less > problematic. The setting provided with this patch does not restore > those problems, and is now not arbitrary. I agree the 1 there looks strange as it effectively discards all the logic handling the lookahead size. I agree with the tweak there but I would do this behavioral change as a separate commit since it can have performance implications. > - The calculation of ->async_size in the initial_readahead section of > ondemand_readahead() now makes sense - it is zero if the chosen > size does not exceed the requested size. This means that we will not > set the PG_readahead flag in this case, but as the requested size > has not been satisfied we can expect a subsequent read ahead request > any way. So I agree that setting of ->async_size to ->size in initial_readahead section does not make great sence but if you look a bit below into readit section, you will notice the ->async_size is overwritten there to something meaninful. So I think the code actually does something sensible, maybe it could be written in a more readable way. > Note that the current function names page_cache_sync_ra() and > page_cache_async_ra() are misleading. All ra request are partly sync > and partly async, so either part can be empty. The meaning of these names IMO is: page_cache_sync_ra() - tell readahead that we currently need a page ractl->_index and would prefer req_count pages fetched ahead. page_cache_async_ra() - called when we hit the lookahead marker to give opportunity to readahead code to prefetch more pages. > A page_cache_sync_ra() request will usually set ->async_size non-zero, > implying it is not all synchronous. > When a non-zero req_count is passed to page_cache_async_ra(), the > implication is that some prefix of the request is synchronous, though > the calculation made there is incorrect - I haven't tried to fix it. > > Signed-off-by: NeilBrown <neilb@suse.de> Honza > --- > Documentation/core-api/mm-api.rst | 19 ++++++- > Documentation/filesystems/vfs.rst | 16 ++++-- > include/linux/fs.h | 9 +++ > mm/readahead.c | 103 ++++++++++++++++++++++++++++++++++++- > 4 files changed, 135 insertions(+), 12 deletions(-) > > diff --git a/Documentation/core-api/mm-api.rst b/Documentation/core-api/mm-api.rst > index 395835f9289f..f5b2f92822c8 100644 > --- a/Documentation/core-api/mm-api.rst > +++ b/Documentation/core-api/mm-api.rst > @@ -58,15 +58,30 @@ Virtually Contiguous Mappings > File Mapping and Page Cache > =========================== > > -.. kernel-doc:: mm/readahead.c > - :export: > +Filemap > +------- > > .. kernel-doc:: mm/filemap.c > :export: > > +Readahead > +--------- > + > +.. kernel-doc:: mm/readahead.c > + :doc: Readahead Overview > + > +.. kernel-doc:: mm/readahead.c > + :export: > + > +Writeback > +--------- > + > .. kernel-doc:: mm/page-writeback.c > :export: > > +Truncate > +-------- > + > .. kernel-doc:: mm/truncate.c > :export: > > diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst > index bf5c48066fac..b4a0baa46dcc 100644 > --- a/Documentation/filesystems/vfs.rst > +++ b/Documentation/filesystems/vfs.rst > @@ -806,12 +806,16 @@ cache in your filesystem. The following members are defined: > object. The pages are consecutive in the page cache and are > locked. The implementation should decrement the page refcount > after starting I/O on each page. Usually the page will be > - unlocked by the I/O completion handler. If the filesystem decides > - to stop attempting I/O before reaching the end of the readahead > - window, it can simply return. The caller will decrement the page > - refcount and unlock the remaining pages for you. Set PageUptodate > - if the I/O completes successfully. Setting PageError on any page > - will be ignored; simply unlock the page if an I/O error occurs. > + unlocked by the I/O completion handler. The set of pages are > + divided into some sync pages followed by some async pages, > + rac->ra->async_size gives the number of async pages. The > + filesystem should attempt to read all sync pages but may decide > + to stop once it reaches the async pages. If it does decide to > + stop attempting I/O, it can simply return. The caller will > + remove the remaining pages from the address space, unlock them > + and decrement the page refcount. Set PageUptodate if the I/O > + completes successfully. Setting PageError on any page will be > + ignored; simply unlock the page if an I/O error occurs. > > ``readpages`` > called by the VM to read pages associated with the address_space > diff --git a/include/linux/fs.h b/include/linux/fs.h > index e2d892b201b0..8b5c486bd4a2 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -930,10 +930,15 @@ struct fown_struct { > * struct file_ra_state - Track a file's readahead state. > * @start: Where the most recent readahead started. > * @size: Number of pages read in the most recent readahead. > - * @async_size: Start next readahead when this many pages are left. > - * @ra_pages: Maximum size of a readahead request. > + * @async_size: Numer of pages that were/are not needed immediately > + * and so were/are genuinely "ahead". Start next readahead when > + * the first of these pages is accessed. > + * @ra_pages: Maximum size of a readahead request, copied from the bdi. > * @mmap_miss: How many mmap accesses missed in the page cache. > * @prev_pos: The last byte in the most recent read request. > + * > + * When this structure is passed to ->readahead(), the "most recent" > + * readahead means the current readahead. > */ > struct file_ra_state { > pgoff_t start; > diff --git a/mm/readahead.c b/mm/readahead.c > index cf0dcf89eb69..c44b2957f59f 100644 > --- a/mm/readahead.c > +++ b/mm/readahead.c > @@ -8,6 +8,105 @@ > * Initial version. > */ > > +/** > + * DOC: Readahead Overview > + * > + * Readahead is used to read content into the page cache before it is > + * explicitly requested by the application. Readahead only ever > + * attempts to read pages that are not yet in the page cache. If a > + * page is present but not up-to-date, readahead will not try to read > + * it. In that case a simple ->readpage() will be requested. > + * > + * Readahead is triggered when an application read request (whether a > + * systemcall or a page fault) finds that the requested page is not in > + * the page cache, or that it is in the page cache and has the > + * %PG_readahead flag set. This flag indicates that the page was loaded > + * as part of a previous read-ahead request and now that it has been > + * accessed, it is time for the next read-ahead. > + * > + * Each readahead request is partly synchronous read, and partly async > + * read-ahead. This is reflected in the struct file_ra_state which > + * contains ->size being to total number of pages, and ->async_size > + * which is the number of pages in the async section. The first page in > + * this async section will have %PG_readahead set as a trigger for a > + * subsequent read ahead. Once a series of sequential reads has been > + * established, there should be no need for a synchronous component and > + * all read ahead request will be fully asynchronous. > + * > + * When either of the triggers causes a readahead, three numbers need to > + * be determined: the start of the region, the size of the region, and > + * the size of the async tail. > + * > + * The start of the region is simply the first page address at or after > + * the accessed address, which is not currently populated in the page > + * cache. This is found with a simple search in the page cache. > + * > + * The size of the async tail is determined by subtracting the size that > + * was explicitly requested from the determined request size, unless > + * this would be less than zero - then zero is used. NOTE THIS > + * CALCULATION IS WRONG WHEN THE START OF THE REGION IS NOT THE ACCESSED > + * PAGE. > + * > + * The size of the region is normally determined from the size of the > + * previous readahead which loaded the preceding pages. This may be > + * discovered from the struct file_ra_state for simple sequential reads, > + * or from examining the state of the page cache when multiple > + * sequential reads are interleaved. Specifically: where the readahead > + * was triggered by the %PG_readahead flag, the size of the previous > + * readahead is assumed to be the number of pages from the triggering > + * page to the start of the new readahead. In these cases, the size of > + * the previous readahead is scaled, often doubled, for the new > + * readahead, though see get_next_ra_size() for details. > + * > + * If the size of the previous read cannot be determined, the number of > + * preceding pages in the page cache is used to estimate the size of > + * a previous read. This estimate could easily be misled by random > + * reads being coincidentally adjacent, so it is ignored unless it is > + * larger than the current request, and it is not scaled up, unless it > + * is at the start of file. > + * > + * In general read ahead is accelerated at the start of the file, as > + * reads from there are often sequential. There are other minor > + * adjustments to the read ahead size in various special cases and these > + * are best discovered by reading the code. > + * > + * The above calculation determines the readahead, to which any requested > + * read size may be added. > + * > + * Readahead requests are sent to the filesystem using the ->readahead() > + * address space operation, for which mpage_readahead() is a canonical > + * implementation. ->readahead() should normally initiate reads on all > + * pages, but may fail to read any or all pages without causing an IO > + * error. The page cache reading code will issue a ->readpage() request > + * for any page which ->readahead() does not provided, and only an error > + * from this will be final. > + * > + * ->readahead() will generally call readahead_page() repeatedly to get > + * each page from those prepared for read ahead. It may fail to read a > + * page by: > + * > + * * not calling readahead_page() sufficiently many times, effectively > + * ignoring some pages, as might be appropriate if the path to > + * storage is congested. > + * > + * * failing to actually submit a read request for a given page, > + * possibly due to insufficient resources, or > + * > + * * getting an error during subsequent processing of a request. > + * > + * In the last two cases, the page should be unlocked to indicate that > + * the read attempt has failed. In the first case the page will be > + * unlocked by the caller. > + * > + * Those pages not in the final ``async_size`` of the request should be > + * considered to be important and ->readahead() should not fail them due > + * to congestion or temporary resource unavailability, but should wait > + * for necessary resources (e.g. memory or indexing information) to > + * become available. Pages in the final ``async_size`` may be > + * considered less urgent and failure to read them is more acceptable. > + * They will eventually be read individually using ->readpage(). > + */ > + > #include <linux/kernel.h> > #include <linux/dax.h> > #include <linux/gfp.h> > @@ -426,7 +525,7 @@ static int try_context_readahead(struct address_space *mapping, > > ra->start = index; > ra->size = min(size + req_size, max); > - ra->async_size = 1; > + ra->async_size = ra->size - req_size; > > return 1; > } > @@ -527,7 +626,7 @@ static void ondemand_readahead(struct readahead_control *ractl, > initial_readahead: > ra->start = index; > ra->size = get_init_ra_size(req_size, max_pages); > - ra->async_size = ra->size > req_size ? ra->size - req_size : ra->size; > + ra->async_size = ra->size > req_size ? ra->size - req_size : 0; > > readit: > /* > >
On Thu, 10 Feb 2022, Jan Kara wrote: > Hi Neil! > > On Thu 10-02-22 16:37:52, NeilBrown wrote: > > Add some "big-picture" documentation for read-ahead and polish the code > > to make it fit this documentation. > > > > The meaning of ->async_size is clarified to match its name. > > i.e. Any request to ->readahead() has a sync part and an async part. > > The caller will wait for the sync pages to complete, but will not wait > > for the async pages. The first async page is still marked PG_readahead Thanks for the review! > > So I don't think this is how the code was meant. My understanding of > readahead comes from a comment: I can't be sure what was "meant" but what I described is very close to what the code actually does. > > /* > * On-demand readahead design. > * > .... > > in mm/readahead.c. The ra->size is how many pages should be read. > ra->async_size is the "lookahead size" meaning that we should place a > marker (PageReadahead) at "ra->size - ra->async_size" to trigger next > readahead. This description of PageReadahead and ->async_size focuses on *what* happens, not *why*. Importantly it doesn't help answer the question "What should I set ->async_size to?" The implication in the code is that when we sequentially access a page that was read-ahead (read before it was explicitly requested), we trigger more read ahead. So ->async_size should refer to that part of the readahead request which was not explicitly requested. With that understanding, it becomes possible to audit all the places that ->async_size are set and to see if they make sense. > > > > > - in try_context_readahead(), the async_sync is set correctly rather > > than being set to 1. Prior to Commit 2cad40180197 ("readahead: make > > context readahead more conservative") it was set to ra->size which > > is not correct (that implies no sync component). As this was too > > high and caused problems it was reduced to 1, again incorrect but less > > problematic. The setting provided with this patch does not restore > > those problems, and is now not arbitrary. > > I agree the 1 there looks strange as it effectively discards all the logic > handling the lookahead size. I agree with the tweak there but I would do > this behavioral change as a separate commit since it can have performance > implications. > > > - The calculation of ->async_size in the initial_readahead section of > > ondemand_readahead() now makes sense - it is zero if the chosen > > size does not exceed the requested size. This means that we will not > > set the PG_readahead flag in this case, but as the requested size > > has not been satisfied we can expect a subsequent read ahead request > > any way. > > So I agree that setting of ->async_size to ->size in initial_readahead > section does not make great sence but if you look a bit below into readit > section, you will notice the ->async_size is overwritten there to something > meaninful. So I think the code actually does something sensible, maybe it > could be written in a more readable way. I'm certainly focusing on making the code look sensible and be consistent with the documentation, rather than fixing actual faults in behaviour. Code that makes sense is easier to maintain. I came very close to removing that code after readit: but I agree it needs a separate patch and needs more thought. It looks like a bandaid that addressed some specific problem which was probably caused by one of the size fields being set "wrongly" earlier. > > > Note that the current function names page_cache_sync_ra() and > > page_cache_async_ra() are misleading. All ra request are partly sync > > and partly async, so either part can be empty. > > The meaning of these names IMO is: > page_cache_sync_ra() - tell readahead that we currently need a page > ractl->_index and would prefer req_count pages fetched ahead. I don't think that is what req_count means. req_count is the number of pages that are needed *now* to satisfy the current read request. page_cache_sync_ra() has the job of determining how many more pages (if any) to read-ahead to satisfy future requests. Sometimes it reads another req_count - sometimes not. > > page_cache_async_ra() - called when we hit the lookahead marker to give > opportunity to readahead code to prefetch more pages. Yes, but page_cache_async_ra() is given a req_count which, as above, is the number of pages needed to satisfy *this* request. That wouldn't make sense if it was a pure future-readahead request. In practice, the word "sync" is used to mean "page was missing" and "async" here means "PG_readahead was found". But that isn't what those words usually mean. They both call ondemand_readahead() passing False or True respectively to hit_readahead_marker - which makes that meaning clear in the code... but it still isn't clear in the name. > > > A page_cache_sync_ra() request will usually set ->async_size non-zero, > > implying it is not all synchronous. > > When a non-zero req_count is passed to page_cache_async_ra(), the > > implication is that some prefix of the request is synchronous, though > > the calculation made there is incorrect - I haven't tried to fix it. > > > > Signed-off-by: NeilBrown <neilb@suse.de> > > Honza Thanks, NeilBrown
On Fri 11-02-22 10:35:17, NeilBrown wrote: > On Thu, 10 Feb 2022, Jan Kara wrote: > > Hi Neil! > > > > On Thu 10-02-22 16:37:52, NeilBrown wrote: > > > Add some "big-picture" documentation for read-ahead and polish the code > > > to make it fit this documentation. > > > > > > The meaning of ->async_size is clarified to match its name. > > > i.e. Any request to ->readahead() has a sync part and an async part. > > > The caller will wait for the sync pages to complete, but will not wait > > > for the async pages. The first async page is still marked PG_readahead > > Thanks for the review! > > > > > So I don't think this is how the code was meant. My understanding of > > readahead comes from a comment: > > I can't be sure what was "meant" but what I described is very close to > what the code actually does. > > > > > /* > > * On-demand readahead design. > > * > > .... > > > > in mm/readahead.c. The ra->size is how many pages should be read. > > ra->async_size is the "lookahead size" meaning that we should place a > > marker (PageReadahead) at "ra->size - ra->async_size" to trigger next > > readahead. > > This description of PageReadahead and ->async_size focuses on *what* > happens, not *why*. Importantly it doesn't help answer the question "What > should I set ->async_size to?" Sorry for delayed reply. I was on vacation and then catching up with stuff. I know you have submitted another version of the patches but not much has changed in this regard so I figured it might be better to continue discussion here. So my answer to "What should I set ->async_size to?" is: Ideally so that it takes application to process data between ->async_size and ->size as long as it takes the disk to load the next chunk of data into the page cache. This is explained in the comment: * To overlap application thinking time and disk I/O time, we do * `readahead pipelining': Do not wait until the application consumed all * readahead pages and stalled on the missing page at readahead_index; * Instead, submit an asynchronous readahead I/O as soon as there are * only async_size pages left in the readahead window. Normally async_size * will be equal to size, for maximum pipelining. Now because things such as think time or time to read pages is difficult to estimate, we just end up triggering next readahead as soon as we are at least a bit confident application is going to use the pages. But I don't think there was ever any intent to have "sync" and "async" parts of the request or that ->size - ->async_size is what must be read. Any function in the readahead code is free to return without doing anything regardless of passed parameters and the caller needs to deal with that, ->size is just a hint to the filesystem how much we expect to be useful to read... > The implication in the code is that when we sequentially access a page > that was read-ahead (read before it was explicitly requested), we trigger > more read ahead. So ->async_size should refer to that part of the > readahead request which was not explicitly requested. With that > understanding, it becomes possible to audit all the places that > ->async_size are set and to see if they make sense. I don't think this "implication" was ever intended :) But it may have happened that some guesses how big ->async_size should be have ended like that because of the impracticality of the original definition of how large ->async_size should be. In fact I kind of like what you suggest ->async_size should be - it is possible to actually implement that unlike the original definition - but it is more of "let's redesign how readahead size is chosen" than "let's document how readahead size is chosen" :). > > > Note that the current function names page_cache_sync_ra() and > > > page_cache_async_ra() are misleading. All ra request are partly sync > > > and partly async, so either part can be empty. > > > > The meaning of these names IMO is: > > page_cache_sync_ra() - tell readahead that we currently need a page > > ractl->_index and would prefer req_count pages fetched ahead. > > I don't think that is what req_count means. req_count is the number of > pages that are needed *now* to satisfy the current read request. > page_cache_sync_ra() has the job of determining how many more pages (if > any) to read-ahead to satisfy future requests. Sometimes it reads > another req_count - sometimes not. So this is certainly true for page_cache_sync_readahead() call in filemap_get_pages() but the call of page_cache_sync_ra() from do_sync_mmap_readahead() does not quite follow what you say - we need only one page there but request more. > > page_cache_async_ra() - called when we hit the lookahead marker to give > > opportunity to readahead code to prefetch more pages. > > Yes, but page_cache_async_ra() is given a req_count which, as above, is > the number of pages needed to satisfy *this* request. That wouldn't > make sense if it was a pure future-readahead request. Again, usage of req_count in page_cache_async_ra() is not always the number of pages immediately needed. > In practice, the word "sync" is used to mean "page was missing" and > "async" here means "PG_readahead was found". But that isn't what those > words usually mean. > > They both call ondemand_readahead() passing False or True respectively > to hit_readahead_marker - which makes that meaning clear in the code... > but it still isn't clear in the name. I agree the naming is somewhat confusing :) Honza
On Fri, 25 Feb 2022, Jan Kara wrote: > On Fri 11-02-22 10:35:17, NeilBrown wrote: > > On Thu, 10 Feb 2022, Jan Kara wrote: > > > Hi Neil! > > > > > > On Thu 10-02-22 16:37:52, NeilBrown wrote: > > > > Add some "big-picture" documentation for read-ahead and polish the code > > > > to make it fit this documentation. > > > > > > > > The meaning of ->async_size is clarified to match its name. > > > > i.e. Any request to ->readahead() has a sync part and an async part. > > > > The caller will wait for the sync pages to complete, but will not wait > > > > for the async pages. The first async page is still marked PG_readahead > > > > Thanks for the review! > > > > > > > > So I don't think this is how the code was meant. My understanding of > > > readahead comes from a comment: > > > > I can't be sure what was "meant" but what I described is very close to > > what the code actually does. > > > > > > > > /* > > > * On-demand readahead design. > > > * > > > .... > > > > > > in mm/readahead.c. The ra->size is how many pages should be read. > > > ra->async_size is the "lookahead size" meaning that we should place a > > > marker (PageReadahead) at "ra->size - ra->async_size" to trigger next > > > readahead. > > > > This description of PageReadahead and ->async_size focuses on *what* > > happens, not *why*. Importantly it doesn't help answer the question "What > > should I set ->async_size to?" > > Sorry for delayed reply. I was on vacation and then catching up with stuff. > I know you have submitted another version of the patches but not much has > changed in this regard so I figured it might be better to continue > discussion here. > > So my answer to "What should I set ->async_size to?" is: Ideally so that it > takes application to process data between ->async_size and ->size as long > as it takes the disk to load the next chunk of data into the page cache. > This is explained in the comment: > > * To overlap application thinking time and disk I/O time, we do > * `readahead pipelining': Do not wait until the application consumed all > * readahead pages and stalled on the missing page at readahead_index; > * Instead, submit an asynchronous readahead I/O as soon as there are > * only async_size pages left in the readahead window. Normally async_size > * will be equal to size, for maximum pipelining. > > Now because things such as think time or time to read pages is difficult to > estimate, we just end up triggering next readahead as soon as we are at least > a bit confident application is going to use the pages. But I don't think > there was ever any intent to have "sync" and "async" parts of the request > or that ->size - ->async_size is what must be read. Any function in the > readahead code is free to return without doing anything regardless of > passed parameters and the caller needs to deal with that, ->size is just a > hint to the filesystem how much we expect to be useful to read... I don't think it is only "difficult" to estimate. It is impossible and there is no evidence in the code of any attempt to estimate. So I think you are reading more into that comment than is there. Certainly we want to overlap "think" and "read". The issue I think is not timing but size - how much to read. At one extreme you can maximum overlap by requesting the whole file when the first byte is read. There are obvious problems with that approach. As you say, we need to be "at least a bit confident that the application is going to use the pages". The way I read the comment and the code, that is the primary issue: How can we be confident that the data will be used? Obviously we cannot, but we can manage the risk. The read ahead calculations are all about risk management. We limit the potential waste by only pre-reading a fixed multiple of the number of pages that have been explicitly requested - that multiple is 1. So if 32 pages have been read, it is now safe to read-ahead an extra 32 (or maybe it's 1.5... anyway it is a simple function of how much has been requested). The "when to read it" question is based on finding a balance between achieving maximum confidence that the pattern continues to be a sequential read (so don't read too early) and loading the pages so they'll be ready when needed (so don't read too late). There is no "obviously right" answer - but we can clearly see the solution that the code chooses in *almost* all circumstances. It starts a new opportunistic read when the first page from the last opportunistic read is first accessed. I think that pattern is clear enough that it is worth documenting, and I can see no justification for diverging from that pattern. You express a doubt that 'there was ever any intent to have "sync" and "async" parts of the request' and you may well be correct. But there is nonetheless an observed reality that some of the pages passed to ->readhead() (or ->readpages()) are wanted immediately, and some are not. And those that are not are the number placed in ->async_size. When writing documentation the intent of the author is of some interest, but the behaviour of the code is paramount. > > > The implication in the code is that when we sequentially access a page > > that was read-ahead (read before it was explicitly requested), we trigger > > more read ahead. So ->async_size should refer to that part of the > > readahead request which was not explicitly requested. With that > > understanding, it becomes possible to audit all the places that > > ->async_size are set and to see if they make sense. > > I don't think this "implication" was ever intended :) But it may have > happened that some guesses how big ->async_size should be have ended like > that because of the impracticality of the original definition of how large > ->async_size should be. > > In fact I kind of like what you suggest ->async_size should be - it is > possible to actually implement that unlike the original definition - but it > is more of "let's redesign how readahead size is chosen" than "let's > document how readahead size is chosen" :). I don't think I'm changing the design - I'm describing the design in a possibly novel way. I accept that "number of pages that aren't needed immediately" and "where to set the PG_readahead" flag are not intrinsically related and that maybe I have been unduly swayed by the field name "async_size". I certainly want the filesystem to know which pages are explisitly request, and which are being read opportunistically, so it can handle them differently. The filesystem already sees "async_size" so it is easy to use that. Maybe I should have added a new field which is the actual async size, can kept it separate from the currently misnamed "async_size". But it didn't seem like a useful exercise. > > > > > Note that the current function names page_cache_sync_ra() and > > > > page_cache_async_ra() are misleading. All ra request are partly sync > > > > and partly async, so either part can be empty. > > > > > > The meaning of these names IMO is: > > > page_cache_sync_ra() - tell readahead that we currently need a page > > > ractl->_index and would prefer req_count pages fetched ahead. > > > > I don't think that is what req_count means. req_count is the number of > > pages that are needed *now* to satisfy the current read request. > > page_cache_sync_ra() has the job of determining how many more pages (if > > any) to read-ahead to satisfy future requests. Sometimes it reads > > another req_count - sometimes not. > > So this is certainly true for page_cache_sync_readahead() call in > filemap_get_pages() but the call of page_cache_sync_ra() from > do_sync_mmap_readahead() does not quite follow what you say - we need only > one page there but request more. Yes... and no. That code in do_sync_mmap_readahead() runs if VM_SEQ_READ is set. That means MADV_SEQUENTIAL has been passed with madvise(). So the application has explicitly said "I will access this mapping sequentially" which effectively mean "if I access any bytes, you can assume I'll want to access a great many subsequent bytes". So the "request" here can be reasonably seen as asking for "as much as possible". So while this is read-ahead, it is quite different from the heuristic read-ahead that we've been discussing so far. This isn't "might be needed, so don't bother if it seems too hard", this is "will be needed, so treat it like any other read request". > > > > page_cache_async_ra() - called when we hit the lookahead marker to give > > > opportunity to readahead code to prefetch more pages. > > > > Yes, but page_cache_async_ra() is given a req_count which, as above, is > > the number of pages needed to satisfy *this* request. That wouldn't > > make sense if it was a pure future-readahead request. > > Again, usage of req_count in page_cache_async_ra() is not always the number > of pages immediately needed. I think it is ... or should be. The use in f2fs/dir.c doesn't seem obvious, but I don't know the context. In every other case, the number comes either from an explicit request, or from a confirmed desire (promise?) to read the file sequentially. > > > In practice, the word "sync" is used to mean "page was missing" and > > "async" here means "PG_readahead was found". But that isn't what those > > words usually mean. > > > > They both call ondemand_readahead() passing False or True respectively > > to hit_readahead_marker - which makes that meaning clear in the code... > > but it still isn't clear in the name. > > I agree the naming is somewhat confusing :) Thanks :-) NeilBrown
On Mon, 28 Feb 2022 15:28:39 +1100 "NeilBrown" <neilb@suse.de> wrote: > When writing documentation the intent of the author is of some interest, > but the behaviour of the code is paramount. uh, er, ah, no. The code describes the behaviour of the code. The comments are there to describe things other than the code's behaviour. Things such as the author's intent. Any deviation between the author's intent and the code's behaviour is called a "bug", so it's pretty important to understand authorial intent, no?
On Mon, 28 Feb 2022, Andrew Morton wrote: > On Mon, 28 Feb 2022 15:28:39 +1100 "NeilBrown" <neilb@suse.de> wrote: > > > When writing documentation the intent of the author is of some interest, > > but the behaviour of the code is paramount. > > uh, er, ah, no. The code describes the behaviour of the code. The > comments are there to describe things other than the code's behaviour. > Things such as the author's intent. > > Any deviation between the author's intent and the code's behaviour is > called a "bug", so it's pretty important to understand authorial > intent, no? When the author is writing the documentation - then yes - definitely. When the "author" is several different people over a period of years, then it is not even certain that there is a single unified "intent". The author's intent is less interesting not so much because it is less relevant, but because it is less available. So when writing third-party post-hoc documentation, the focus has to be on the code, though with reference to the intent to whatever extent it is available. Bugs then show up where the actual behaviour turns out to be impossible to document coherently. Thanks, NeilBrown
diff --git a/Documentation/core-api/mm-api.rst b/Documentation/core-api/mm-api.rst index 395835f9289f..f5b2f92822c8 100644 --- a/Documentation/core-api/mm-api.rst +++ b/Documentation/core-api/mm-api.rst @@ -58,15 +58,30 @@ Virtually Contiguous Mappings File Mapping and Page Cache =========================== -.. kernel-doc:: mm/readahead.c - :export: +Filemap +------- .. kernel-doc:: mm/filemap.c :export: +Readahead +--------- + +.. kernel-doc:: mm/readahead.c + :doc: Readahead Overview + +.. kernel-doc:: mm/readahead.c + :export: + +Writeback +--------- + .. kernel-doc:: mm/page-writeback.c :export: +Truncate +-------- + .. kernel-doc:: mm/truncate.c :export: diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst index bf5c48066fac..b4a0baa46dcc 100644 --- a/Documentation/filesystems/vfs.rst +++ b/Documentation/filesystems/vfs.rst @@ -806,12 +806,16 @@ cache in your filesystem. The following members are defined: object. The pages are consecutive in the page cache and are locked. The implementation should decrement the page refcount after starting I/O on each page. Usually the page will be - unlocked by the I/O completion handler. If the filesystem decides - to stop attempting I/O before reaching the end of the readahead - window, it can simply return. The caller will decrement the page - refcount and unlock the remaining pages for you. Set PageUptodate - if the I/O completes successfully. Setting PageError on any page - will be ignored; simply unlock the page if an I/O error occurs. + unlocked by the I/O completion handler. The set of pages are + divided into some sync pages followed by some async pages, + rac->ra->async_size gives the number of async pages. The + filesystem should attempt to read all sync pages but may decide + to stop once it reaches the async pages. If it does decide to + stop attempting I/O, it can simply return. The caller will + remove the remaining pages from the address space, unlock them + and decrement the page refcount. Set PageUptodate if the I/O + completes successfully. Setting PageError on any page will be + ignored; simply unlock the page if an I/O error occurs. ``readpages`` called by the VM to read pages associated with the address_space diff --git a/include/linux/fs.h b/include/linux/fs.h index e2d892b201b0..8b5c486bd4a2 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -930,10 +930,15 @@ struct fown_struct { * struct file_ra_state - Track a file's readahead state. * @start: Where the most recent readahead started. * @size: Number of pages read in the most recent readahead. - * @async_size: Start next readahead when this many pages are left. - * @ra_pages: Maximum size of a readahead request. + * @async_size: Numer of pages that were/are not needed immediately + * and so were/are genuinely "ahead". Start next readahead when + * the first of these pages is accessed. + * @ra_pages: Maximum size of a readahead request, copied from the bdi. * @mmap_miss: How many mmap accesses missed in the page cache. * @prev_pos: The last byte in the most recent read request. + * + * When this structure is passed to ->readahead(), the "most recent" + * readahead means the current readahead. */ struct file_ra_state { pgoff_t start; diff --git a/mm/readahead.c b/mm/readahead.c index cf0dcf89eb69..c44b2957f59f 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -8,6 +8,105 @@ * Initial version. */ +/** + * DOC: Readahead Overview + * + * Readahead is used to read content into the page cache before it is + * explicitly requested by the application. Readahead only ever + * attempts to read pages that are not yet in the page cache. If a + * page is present but not up-to-date, readahead will not try to read + * it. In that case a simple ->readpage() will be requested. + * + * Readahead is triggered when an application read request (whether a + * systemcall or a page fault) finds that the requested page is not in + * the page cache, or that it is in the page cache and has the + * %PG_readahead flag set. This flag indicates that the page was loaded + * as part of a previous read-ahead request and now that it has been + * accessed, it is time for the next read-ahead. + * + * Each readahead request is partly synchronous read, and partly async + * read-ahead. This is reflected in the struct file_ra_state which + * contains ->size being to total number of pages, and ->async_size + * which is the number of pages in the async section. The first page in + * this async section will have %PG_readahead set as a trigger for a + * subsequent read ahead. Once a series of sequential reads has been + * established, there should be no need for a synchronous component and + * all read ahead request will be fully asynchronous. + * + * When either of the triggers causes a readahead, three numbers need to + * be determined: the start of the region, the size of the region, and + * the size of the async tail. + * + * The start of the region is simply the first page address at or after + * the accessed address, which is not currently populated in the page + * cache. This is found with a simple search in the page cache. + * + * The size of the async tail is determined by subtracting the size that + * was explicitly requested from the determined request size, unless + * this would be less than zero - then zero is used. NOTE THIS + * CALCULATION IS WRONG WHEN THE START OF THE REGION IS NOT THE ACCESSED + * PAGE. + * + * The size of the region is normally determined from the size of the + * previous readahead which loaded the preceding pages. This may be + * discovered from the struct file_ra_state for simple sequential reads, + * or from examining the state of the page cache when multiple + * sequential reads are interleaved. Specifically: where the readahead + * was triggered by the %PG_readahead flag, the size of the previous + * readahead is assumed to be the number of pages from the triggering + * page to the start of the new readahead. In these cases, the size of + * the previous readahead is scaled, often doubled, for the new + * readahead, though see get_next_ra_size() for details. + * + * If the size of the previous read cannot be determined, the number of + * preceding pages in the page cache is used to estimate the size of + * a previous read. This estimate could easily be misled by random + * reads being coincidentally adjacent, so it is ignored unless it is + * larger than the current request, and it is not scaled up, unless it + * is at the start of file. + * + * In general read ahead is accelerated at the start of the file, as + * reads from there are often sequential. There are other minor + * adjustments to the read ahead size in various special cases and these + * are best discovered by reading the code. + * + * The above calculation determines the readahead, to which any requested + * read size may be added. + * + * Readahead requests are sent to the filesystem using the ->readahead() + * address space operation, for which mpage_readahead() is a canonical + * implementation. ->readahead() should normally initiate reads on all + * pages, but may fail to read any or all pages without causing an IO + * error. The page cache reading code will issue a ->readpage() request + * for any page which ->readahead() does not provided, and only an error + * from this will be final. + * + * ->readahead() will generally call readahead_page() repeatedly to get + * each page from those prepared for read ahead. It may fail to read a + * page by: + * + * * not calling readahead_page() sufficiently many times, effectively + * ignoring some pages, as might be appropriate if the path to + * storage is congested. + * + * * failing to actually submit a read request for a given page, + * possibly due to insufficient resources, or + * + * * getting an error during subsequent processing of a request. + * + * In the last two cases, the page should be unlocked to indicate that + * the read attempt has failed. In the first case the page will be + * unlocked by the caller. + * + * Those pages not in the final ``async_size`` of the request should be + * considered to be important and ->readahead() should not fail them due + * to congestion or temporary resource unavailability, but should wait + * for necessary resources (e.g. memory or indexing information) to + * become available. Pages in the final ``async_size`` may be + * considered less urgent and failure to read them is more acceptable. + * They will eventually be read individually using ->readpage(). + */ + #include <linux/kernel.h> #include <linux/dax.h> #include <linux/gfp.h> @@ -426,7 +525,7 @@ static int try_context_readahead(struct address_space *mapping, ra->start = index; ra->size = min(size + req_size, max); - ra->async_size = 1; + ra->async_size = ra->size - req_size; return 1; } @@ -527,7 +626,7 @@ static void ondemand_readahead(struct readahead_control *ractl, initial_readahead: ra->start = index; ra->size = get_init_ra_size(req_size, max_pages); - ra->async_size = ra->size > req_size ? ra->size - req_size : ra->size; + ra->async_size = ra->size > req_size ? ra->size - req_size : 0; readit: /*
Add some "big-picture" documentation for read-ahead and polish the code to make it fit this documentation. The meaning of ->async_size is clarified to match its name. i.e. Any request to ->readahead() has a sync part and an async part. The caller will wait for the sync pages to complete, but will not wait for the async pages. The first async page is still marked PG_readahead - in try_context_readahead(), the async_sync is set correctly rather than being set to 1. Prior to Commit 2cad40180197 ("readahead: make context readahead more conservative") it was set to ra->size which is not correct (that implies no sync component). As this was too high and caused problems it was reduced to 1, again incorrect but less problematic. The setting provided with this patch does not restore those problems, and is now not arbitrary. - The calculation of ->async_size in the initial_readahead section of ondemand_readahead() now makes sense - it is zero if the chosen size does not exceed the requested size. This means that we will not set the PG_readahead flag in this case, but as the requested size has not been satisfied we can expect a subsequent read ahead request any way. Note that the current function names page_cache_sync_ra() and page_cache_async_ra() are misleading. All ra request are partly sync and partly async, so either part can be empty. A page_cache_sync_ra() request will usually set ->async_size non-zero, implying it is not all synchronous. When a non-zero req_count is passed to page_cache_async_ra(), the implication is that some prefix of the request is synchronous, though the calculation made there is incorrect - I haven't tried to fix it. Signed-off-by: NeilBrown <neilb@suse.de> --- Documentation/core-api/mm-api.rst | 19 ++++++- Documentation/filesystems/vfs.rst | 16 ++++-- include/linux/fs.h | 9 +++ mm/readahead.c | 103 ++++++++++++++++++++++++++++++++++++- 4 files changed, 135 insertions(+), 12 deletions(-)