[007/227] mm: document and polish read-ahead code

Message ID	20220322213852.702A4C340F2@smtp.kernel.org (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> Date: Tue, 22 Mar 2022 14:38:51 -0700 To: trond.myklebust@hammerspace.com,philipp.reisner@linbit.com,paolo.valente@linaro.org,miklos@szeredi.hu,lars.ellenberg@linbit.com,konishi.ryusuke@gmail.com,jlayton@kernel.org,jaegeuk@kernel.org,jack@suse.cz,idryomov@gmail.com,fengguang.wu@intel.com,djwong@kernel.org,chao@kernel.org,axboe@kernel.dk,Anna.Schumaker@Netapp.com,neilb@suse.de,akpm@linux-foundation.org,patches@lists.linux.dev,linux-mm@kvack.org,mm-commits@vger.kernel.org,torvalds@linux-foundation.org,akpm@linux-foundation.org From: Andrew Morton <akpm@linux-foundation.org> In-Reply-To: <20220322143803.04a5e59a07e48284f196a2f9@linux-foundation.org> Subject: [patch 007/227] mm: document and polish read-ahead code Message-Id: <20220322213852.702A4C340F2@smtp.kernel.org> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[001/227] linux/kthread.h: remove unused macros \| expand [001/227] linux/kthread.h: remove unused macros [002/227] scripts/spelling.txt: add more spellings to spelling.txt [003/227] ntfs: add sanity check on allocation size [004/227] ocfs2: cleanup some return variables [005/227] fs/ocfs2: fix comments mentioning i_mutex [006/227] doc: convert 'subsection' to 'section' in gfp.h [007/227] mm: document and polish read-ahead code [008/227] mm: improve cleanup when ->readpages doesn't process all pages [009/227] fuse: remove reliance on bdi congestion [010/227] nfs: remove reliance on bdi congestion [011/227] ceph: remove reliance on bdi congestion [012/227] remove inode_congested() [013/227] remove bdi_congested() and wb_congested() and related functions [014/227] f2fs: replace congestion_wait() calls with io_schedule_timeout() [015/227] block/bfq-iosched.c: use "false" rather than "BLK_RW_ASYNC" [016/227] remove congestion tracking framework [017/227] mount: warn only once about timestamp range expiration [018/227] mm/memremap: avoid calling kasan_remove_zero_shadow() for device private memory [019/227] filemap: remove find_get_pages() [020/227] mm/writeback: minor clean up for highmem_dirtyable_memory [021/227] mm: fs: fix lru_cache_disabled race in bh_lru [022/227] mm: fix invalid page pointer returned with FOLL_PIN gups [023/227] mm/gup: follow_pfn_pte(): -EEXIST cleanup [024/227] mm/gup: remove unused pin_user_pages_locked() [025/227] mm: change lookup_node() to use get_user_pages_fast() [026/227] mm/gup: remove unused get_user_pages_locked() [027/227] mm/swap: fix confusing comment in folio_mark_accessed [028/227] tmpfs: support for file creation time [029/227] shmem: mapping_set_exiting() to help mapped resilience [030/227] tmpfs: do not allocate pages on read [031/227] mm: shmem: use helper macro __ATTR_RW [032/227] memcg: replace in_interrupt() with !in_task() [033/227] memcg: add per-memcg total kernel memory stat [034/227] mm/memcg: mem_cgroup_per_node is already set to 0 on allocation [035/227] mm/memcg: retrieve parent memcg from css.parent [036/227] memcg: refactor mem_cgroup_oom [037/227] memcg: unify force charging conditions [038/227] selftests: memcg: test high limit for single entry allocation [039/227] memcg: synchronously enforce memory.high for large overcharges [040/227] mm/memcontrol: return 1 from cgroup.memory __setup() handler [041/227] mm/memcg: revert ("mm/memcg: optimize user context object stock access") [043/227] mm/memcg: protect per-CPU counter by disabling preemption on PREEMPT_RT where needed. [044/227] mm/memcg: opencode the inner part of obj_cgroup_uncharge_pages() in drain_obj_stock() [045/227] mm/memcg: protect memcg_stock with a local_lock_t [047/227] mm: list_lru: transpose the array of per-node per-memcg lru lists [048/227] mm: introduce kmem_cache_alloc_lru [049/227] fs: introduce alloc_inode_sb() to allocate filesystems specific inode [050/227] fs: allocate inode by using alloc_inode_sb() [051/227] f2fs: allocate inode by using alloc_inode_sb() [052/227] mm: dcache: use kmem_cache_alloc_lru() to allocate dentry [053/227] xarray: use kmem_cache_alloc_lru to allocate xa_node [054/227] mm: memcontrol: move memcg_online_kmem() to mem_cgroup_css_online() [055/227] mm: list_lru: allocate list_lru_one only when needed [056/227] mm: list_lru: rename memcg_drain_all_list_lrus to memcg_reparent_list_lrus [057/227] mm: list_lru: replace linear array with xarray [058/227] mm: memcontrol: reuse memory cgroup ID for kmem ID [059/227] mm: memcontrol: fix cannot alloc the maximum memcg ID [060/227] mm: list_lru: rename list_lru_per_memcg to list_lru_memcg [061/227] mm: memcontrol: rename memcg_cache_id to memcg_kmem_id [062/227] memcg: enable accounting for tty-related objects [063/227] selftests, x86: fix how check_cc.sh is being invoked [064/227] mm: merge pte_mkhuge() call into arch_make_huge_pte() [065/227] mm: remove mmu_gathers storage from remaining architectures [066/227] mm: thp: fix wrong cache flush in remove_migration_pmd() [067/227] mm: fix missing cache flush for all tail pages of compound page [068/227] mm: hugetlb: fix missing cache flush in copy_huge_page_from_user() [069/227] mm: hugetlb: fix missing cache flush in hugetlb_mcopy_atomic_pte() [070/227] mm: shmem: fix missing cache flush in shmem_mfill_atomic_pte() [071/227] mm: userfaultfd: fix missing cache flush in mcopy_atomic_pte() and __mcopy_atomic() [072/227] mm: replace multiple dcache flush with flush_dcache_folio() [073/227] mm: don't skip swap entry even if zap_details specified [074/227] mm: rename zap_skip_check_mapping() to should_zap_page() [075/227] mm: change zap_details.zap_mapping into even_cows [076/227] mm: rework swap handling of zap_pte_range [077/227] mm/mmap: return 1 from stack_guard_gap __setup() handler [078/227] mm/memory.c: use helper function range_in_vma() [079/227] mm/memory.c: use helper macro min and max in unmap_mapping_range_tree() [080/227] mm: _install_special_mapping() apply VM_LOCKED_CLEAR_MASK [081/227] mm/mmap: remove obsolete comment in ksys_mmap_pgoff [082/227] mm/mremap:: use vma_lookup() instead of find_vma() [083/227] mm/sparse: make mminit_validate_memmodel_limits() static [084/227] mm/vmalloc: remove unneeded function forward declaration [085/227] mm/vmalloc: Move draining areas out of caller context [086/227] mm/vmalloc: add adjust_search_size parameter [087/227] mm/vmalloc: eliminate an extra orig_gfp_mask [088/227] mm/vmalloc.c: fix "unused function" warning [089/227] mm/vmalloc: fix comments about vmap_area struct [090/227] mm: page_alloc: avoid merging non-fallbackable pageblocks with others [091/227] mm/mmzone.c: use try_cmpxchg() in page_cpupid_xchg_last() [092/227] mm/mmzone.h: remove unused macros [093/227] mm/page_alloc: don't pass pfn to free_unref_page_commit() [094/227] cma: factor out minimum alignment requirement [095/227] mm: enforce pageblock_order < MAX_ORDER [096/227] mm/page_alloc: mark pagesets as __maybe_unused [097/227] mm/pages_alloc.c: don't create ZONE_MOVABLE beyond the end of a node [098/227] mm/page_alloc: fetch the correct pcp buddy during bulk free [099/227] mm/page_alloc: track range of active PCP lists during bulk free [100/227] mm/page_alloc: simplify how many pages are selected per pcp list during bulk free [101/227] mm/page_alloc: drain the requested list first during bulk free [102/227] mm/page_alloc: free pages in a single pass during bulk free [103/227] mm/page_alloc: limit number of high-order pages on PCP during bulk free [104/227] mm/page_alloc: do not prefetch buddies during bulk free [105/227] arch/x86/mm/numa: Do not initialize nodes twice [106/227] mm: count time in drain_all_pages during direct reclaim as memory pressure [107/227] mm/page_alloc: call check_new_pages() while zone spinlock is not held [108/227] mm/page_alloc: check high-order pages for corruption during PCP operations [109/227] mm/memory-failure.c: remove obsolete comment [110/227] mm/hwpoison: fix error page recovered but reported "not recovered" [111/227] mm: invalidate hwpoison page cache page in fault path [112/227] mm/memory-failure.c: minor clean up for memory_failure_dev_pagemap [113/227] mm/memory-failure.c: catch unexpected -EFAULT from vma_address() [114/227] mm/memory-failure.c: rework the signaling logic in kill_proc [115/227] mm/memory-failure.c: fix race with changing page more robustly [116/227] mm/memory-failure.c: remove PageSlab check in hwpoison_filter_dev [117/227] mm/memory-failure.c: rework the try_to_unmap logic in hwpoison_user_mappings() [118/227] mm/memory-failure.c: remove obsolete comment in __soft_offline_page [119/227] mm/memory-failure.c: remove unnecessary PageTransTail check [120/227] mm/hwpoison-inject: support injecting hwpoison to free page [121/227] mm/hwpoison: avoid the impact of hwpoison_filter() return value on mce handler [122/227] mm/hwpoison: add in-use hugepage hwpoison filter judgement [123/227] mm/memory-failure.c: fix race with changing page compound again [124/227] mm/memory-failure.c: avoid calling invalidate_inode_page() with unexpected pages [125/227] mm/memory-failure.c: make non-LRU movable pages unhandlable [126/227] mm, fault-injection: declare should_fail_alloc_page() [127/227] mm/mlock: fix potential imbalanced rlimit ucounts adjustment [128/227] mm: hugetlb: free the 2nd vmemmap page associated with each HugeTLB page [129/227] mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_key [130/227] mm: sparsemem: use page table lock to protect kernel pmd operations [131/227] selftests: vm: add a hugetlb test case [132/227] mm: sparsemem: move vmemmap related to HugeTLB to CONFIG_HUGETLB_PAGE_FREE_VMEMMAP [133/227] mm/hugetlb: generalize ARCH_WANT_GENERAL_HUGETLB [134/227] hugetlb: clean up potential spectre issue warnings [135/227] mm/hugetlb: use helper macro __ATTR_RW [136/227] mm/hugetlb.c: export PageHeadHuge() [137/227] mm: remove unneeded local variable follflags [138/227] userfaultfd: provide unmasked address on page-fault [139/227] userfaultfd/selftests: fix uninitialized_var.cocci warning [140/227] mm/fs: delete PF_SWAPWRITE [141/227] mm: __isolate_lru_page_prepare() in isolate_migratepages_block() [142/227] mm/list_lru: optimize memcg_reparent_list_lru_node() [143/227] mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu [144/227] mm: workingset: replace IRQ-off check with a lockdep assert. [145/227] mm: vmscan: fix documentation for page_check_references() [146/227] mm: compaction: cleanup the compaction trace events [147/227] mempolicy: mbind_range() set_policy() after vma_merge() [148/227] mm/oom_kill: remove unneeded is_memcg_oom check [149/227] mm,migrate: fix establishing demotion target [150/227] mm/migrate: fix race between lock page and clear PG_Isolated [151/227] mm/thp: refix __split_huge_pmd_locked() for migration PMD [152/227] mm/cma: provide option to opt out from exposing pages on activation failure [153/227] powerpc/fadump: opt out from freeing pages on cma activation failure [154/227] NUMA Balancing: add page promotion counter [155/227] NUMA balancing: optimize page placement for memory tiering system [156/227] memory tiering: skip to scan fast memory [157/227] mm: page_io: fix psi memory pressure error on cold swapins [158/227] mm/vmstat: add event for ksm swapping in copy [159/227] mm/ksm: use helper macro __ATTR_RW [160/227] mm/hwpoison: check the subpage, not the head page [161/227] mm/madvise: use vma_lookup() instead of find_vma() [162/227] mm: madvise: return correct bytes advised with process_madvise [163/227] mm: madvise: skip unmapped vma holes passed to process_madvise [164/227] mm, memory_hotplug: make arch_alloc_nodedata independent on CONFIG_MEMORY_HOTPLUG [165/227] mm: handle uninitialized numa nodes gracefully [166/227] mm, memory_hotplug: drop arch_free_nodedata [167/227] mm, memory_hotplug: reorganize new pgdat initialization [168/227] mm: make free_area_init_node aware of memory less nodes [169/227] memcg: do not tweak node in alloc_mem_cgroup_per_node_info [170/227] drivers/base/memory: add memory block to memory group after registration succeeded [171/227] drivers/base/node: consolidate node device subsystem initialization in node_dev_init() [172/227] mm/memory_hotplug: remove obsolete comment of __add_pages [173/227] mm/memory_hotplug: avoid calling zone_intersects() for ZONE_NORMAL [174/227] mm/memory_hotplug: clean up try_offline_node [175/227] mm/memory_hotplug: fix misplaced comment in offline_pages [176/227] drivers/base/node: rename link_mem_sections() to register_memory_block_under_node() [177/227] drivers/base/memory: determine and store zone for single-zone memory blocks [178/227] drivers/base/memory: clarify adding and removing of memory blocks [179/227] mm: only re-generate demotion targets when a numa node changes its N_CPU state [180/227] mm/thp: ClearPageDoubleMap in first page_add_file_rmap() [181/227] mm/zswap.c: allow handling just same-value filled pages [182/227] mm: remove usercopy_warn() [183/227] mm: uninline copy_overflow() [184/227] mm/usercopy: return 1 from hardened_usercopy __setup() handler [185/227] mm/early_ioremap: declare early_memremap_pgprot_adjust() [186/227] highmem: document kunmap_local() [187/227] mm/highmem: remove unnecessary done label [188/227] mm/page_table_check.c: use strtobool for param parsing [189/227] mm/kfence: remove unnecessary CONFIG_KFENCE option [190/227] kfence: allow re-enabling KFENCE after system startup [191/227] kfence: alloc kfence_pool after system startup [192/227] kunit: fix UAF when run kfence test case test_gfpzero [193/227] kunit: make kunit_test_timeout compatible with comment [194/227] kfence: test: try to avoid test_gfpzero trigger rcu_stall [195/227] kfence: allow use of a deferrable timer [196/227] mm/hmm.c: remove unneeded local variable ret [197/227] mm/damon/dbgfs/init_regions: use target index instead of target id [198/227] Docs/admin-guide/mm/damon/usage: update for changed initail_regions file input [199/227] mm/damon/core: move damon_set_targets() into dbgfs [200/227] mm/damon: remove the target id concept [201/227] mm/damon: remove redundant page validation [202/227] mm/damon: rename damon_primitives to damon_operations [203/227] mm/damon: let monitoring operations can be registered and selected [204/227] mm/damon/paddr,vaddr: register themselves to DAMON in subsys_initcall [205/227] mm/damon/reclaim: use damon_select_ops() instead of damon_{v,p}a_set_operations() [206/227] mm/damon/dbgfs: use damon_select_ops() instead of damon_{v,p}a_set_operations() [207/227] mm/damon/dbgfs: use operations id for knowing if the target has pid [208/227] mm/damon/dbgfs-test: fix is_target_id() change [209/227] mm/damon/paddr,vaddr: remove damon_{p,v}a_{target_valid,set_operations}() [210/227] mm/damon: remove unnecessary CONFIG_DAMON option [211/227] Docs/vm/damon: call low level monitoring primitives the operations [212/227] Docs/vm/damon/design: update DAMON-Idle Page Tracking interference handling [213/227] Docs/damon: update outdated term 'regions update interval' [214/227] mm/damon/core: allow non-exclusive DAMON start/stop [215/227] mm/damon/core: add number of each enum type values [216/227] mm/damon: implement a minimal stub for sysfs-based DAMON interface [217/227] mm/damon/sysfs: link DAMON for virtual address spaces monitoring [218/227] mm/damon/sysfs: support the physical address space monitoring [219/227] mm/damon/sysfs: support DAMON-based Operation Schemes [220/227] mm/damon/sysfs: support DAMOS quotas [221/227] mm/damon/sysfs: support schemes prioritization [222/227] mm/damon/sysfs: support DAMOS watermarks [223/227] mm/damon/sysfs: support DAMOS stats [224/227] selftests/damon: add a test for DAMON sysfs interface [225/227] Docs/admin-guide/mm/damon/usage: document DAMON sysfs interface [226/227] Docs/ABI/testing: add DAMON sysfs interface ABI document [227/227] mm/damon/sysfs: remove repeat container_of() in damon_sysfs_kdamond_release()

Message ID

20220322213852.702A4C340F2@smtp.kernel.org (mailing list archive)

State

New

Headers

Date: Tue, 22 Mar 2022 14:38:51 -0700
To: 
 trond.myklebust@hammerspace.com,philipp.reisner@linbit.com,paolo.valente@linaro.org,miklos@szeredi.hu,lars.ellenberg@linbit.com,konishi.ryusuke@gmail.com,jlayton@kernel.org,jaegeuk@kernel.org,jack@suse.cz,idryomov@gmail.com,fengguang.wu@intel.com,djwong@kernel.org,chao@kernel.org,axboe@kernel.dk,Anna.Schumaker@Netapp.com,neilb@suse.de,akpm@linux-foundation.org,patches@lists.linux.dev,linux-mm@kvack.org,mm-commits@vger.kernel.org,torvalds@linux-foundation.org,akpm@linux-foundation.org
From: Andrew Morton <akpm@linux-foundation.org>
In-Reply-To: <20220322143803.04a5e59a07e48284f196a2f9@linux-foundation.org>
Subject: [patch 007/227] mm: document and polish read-ahead code
Message-Id: <20220322213852.702A4C340F2@smtp.kernel.org>
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

[001/227] linux/kthread.h: remove unused macros | expand

Commit Message

Andrew Morton March 22, 2022, 9:38 p.m. UTC

From: NeilBrown <neilb@suse.de>
Subject: mm: document and polish read-ahead code

Add some "big-picture" documentation for read-ahead and polish the code to
make it fit this documentation.

The meaning of ->async_size is clarified to match its name.  i.e.  Any
request to ->readahead() has a sync part and an async part.  The caller
will wait for the sync pages to complete, but will not wait for the async
pages.  The first async page is still marked PG_readahead

Note that the current function names page_cache_sync_ra() and
page_cache_async_ra() are misleading.  All ra request are partly sync and
partly async, so either part can be empty.  A page_cache_sync_ra() request
will usually set ->async_size non-zero, implying it is not all
synchronous.

When a non-zero req_count is passed to page_cache_async_ra(), the
implication is that some prefix of the request is synchronous, though the
calculation made there is incorrect - I haven't tried to fix it.

Link: https://lkml.kernel.org/r/164549983734.9187.11586890887006601405.stgit@noble.brown
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Paolo Valente <paolo.valente@linaro.org>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/core-api/mm-api.rst |   19 ++++-
 Documentation/filesystems/vfs.rst |   16 ++--
 include/linux/fs.h                |    9 +-
 mm/readahead.c                    |   99 ++++++++++++++++++++++++++++
 4 files changed, 133 insertions(+), 10 deletions(-)

--- a/Documentation/core-api/mm-api.rst~mm-document-and-polish-read-ahead-code
+++ a/Documentation/core-api/mm-api.rst
@@ -58,15 +58,30 @@  Virtually Contiguous Mappings
 File Mapping and Page Cache
 ===========================
 
-.. kernel-doc:: mm/readahead.c
-   :export:
+Filemap
+-------
 
 .. kernel-doc:: mm/filemap.c
    :export:
 
+Readahead
+---------
+
+.. kernel-doc:: mm/readahead.c
+   :doc: Readahead Overview
+
+.. kernel-doc:: mm/readahead.c
+   :export:
+
+Writeback
+---------
+
 .. kernel-doc:: mm/page-writeback.c
    :export:
 
+Truncate
+--------
+
 .. kernel-doc:: mm/truncate.c
    :export:
 
--- a/Documentation/filesystems/vfs.rst~mm-document-and-polish-read-ahead-code
+++ a/Documentation/filesystems/vfs.rst
@@ -806,12 +806,16 @@  cache in your filesystem.  The following
 	object.  The pages are consecutive in the page cache and are
 	locked.  The implementation should decrement the page refcount
 	after starting I/O on each page.  Usually the page will be
-	unlocked by the I/O completion handler.  If the filesystem decides
-	to stop attempting I/O before reaching the end of the readahead
-	window, it can simply return.  The caller will decrement the page
-	refcount and unlock the remaining pages for you.  Set PageUptodate
-	if the I/O completes successfully.  Setting PageError on any page
-	will be ignored; simply unlock the page if an I/O error occurs.
+	unlocked by the I/O completion handler.  The set of pages are
+	divided into some sync pages followed by some async pages,
+	rac->ra->async_size gives the number of async pages.  The
+	filesystem should attempt to read all sync pages but may decide
+	to stop once it reaches the async pages.  If it does decide to
+	stop attempting I/O, it can simply return.  The caller will
+	remove the remaining pages from the address space, unlock them
+	and decrement the page refcount.  Set PageUptodate if the I/O
+	completes successfully.  Setting PageError on any page will be
+	ignored; simply unlock the page if an I/O error occurs.
 
 ``readpages``
 	called by the VM to read pages associated with the address_space
--- a/include/linux/fs.h~mm-document-and-polish-read-ahead-code
+++ a/include/linux/fs.h
@@ -930,10 +930,15 @@  struct fown_struct {
  * struct file_ra_state - Track a file's readahead state.
  * @start: Where the most recent readahead started.
  * @size: Number of pages read in the most recent readahead.
- * @async_size: Start next readahead when this many pages are left.
- * @ra_pages: Maximum size of a readahead request.
+ * @async_size: Numer of pages that were/are not needed immediately
+ *      and so were/are genuinely "ahead".  Start next readahead when
+ *      the first of these pages is accessed.
+ * @ra_pages: Maximum size of a readahead request, copied from the bdi.
  * @mmap_miss: How many mmap accesses missed in the page cache.
  * @prev_pos: The last byte in the most recent read request.
+ *
+ * When this structure is passed to ->readahead(), the "most recent"
+ * readahead means the current readahead.
  */
 struct file_ra_state {
 	pgoff_t start;
--- a/mm/readahead.c~mm-document-and-polish-read-ahead-code
+++ a/mm/readahead.c
@@ -8,6 +8,105 @@ 
  *		Initial version.
  */
 
+/**
+ * DOC: Readahead Overview
+ *
+ * Readahead is used to read content into the page cache before it is
+ * explicitly requested by the application.  Readahead only ever
+ * attempts to read pages that are not yet in the page cache.  If a
+ * page is present but not up-to-date, readahead will not try to read
+ * it. In that case a simple ->readpage() will be requested.
+ *
+ * Readahead is triggered when an application read request (whether a
+ * systemcall or a page fault) finds that the requested page is not in
+ * the page cache, or that it is in the page cache and has the
+ * %PG_readahead flag set.  This flag indicates that the page was loaded
+ * as part of a previous read-ahead request and now that it has been
+ * accessed, it is time for the next read-ahead.
+ *
+ * Each readahead request is partly synchronous read, and partly async
+ * read-ahead.  This is reflected in the struct file_ra_state which
+ * contains ->size being to total number of pages, and ->async_size
+ * which is the number of pages in the async section.  The first page in
+ * this async section will have %PG_readahead set as a trigger for a
+ * subsequent read ahead.  Once a series of sequential reads has been
+ * established, there should be no need for a synchronous component and
+ * all read ahead request will be fully asynchronous.
+ *
+ * When either of the triggers causes a readahead, three numbers need to
+ * be determined: the start of the region, the size of the region, and
+ * the size of the async tail.
+ *
+ * The start of the region is simply the first page address at or after
+ * the accessed address, which is not currently populated in the page
+ * cache.  This is found with a simple search in the page cache.
+ *
+ * The size of the async tail is determined by subtracting the size that
+ * was explicitly requested from the determined request size, unless
+ * this would be less than zero - then zero is used.  NOTE THIS
+ * CALCULATION IS WRONG WHEN THE START OF THE REGION IS NOT THE ACCESSED
+ * PAGE.
+ *
+ * The size of the region is normally determined from the size of the
+ * previous readahead which loaded the preceding pages.  This may be
+ * discovered from the struct file_ra_state for simple sequential reads,
+ * or from examining the state of the page cache when multiple
+ * sequential reads are interleaved.  Specifically: where the readahead
+ * was triggered by the %PG_readahead flag, the size of the previous
+ * readahead is assumed to be the number of pages from the triggering
+ * page to the start of the new readahead.  In these cases, the size of
+ * the previous readahead is scaled, often doubled, for the new
+ * readahead, though see get_next_ra_size() for details.
+ *
+ * If the size of the previous read cannot be determined, the number of
+ * preceding pages in the page cache is used to estimate the size of
+ * a previous read.  This estimate could easily be misled by random
+ * reads being coincidentally adjacent, so it is ignored unless it is
+ * larger than the current request, and it is not scaled up, unless it
+ * is at the start of file.
+ *
+ * In general read ahead is accelerated at the start of the file, as
+ * reads from there are often sequential.  There are other minor
+ * adjustments to the read ahead size in various special cases and these
+ * are best discovered by reading the code.
+ *
+ * The above calculation determines the readahead, to which any requested
+ * read size may be added.
+ *
+ * Readahead requests are sent to the filesystem using the ->readahead()
+ * address space operation, for which mpage_readahead() is a canonical
+ * implementation.  ->readahead() should normally initiate reads on all
+ * pages, but may fail to read any or all pages without causing an IO
+ * error.  The page cache reading code will issue a ->readpage() request
+ * for any page which ->readahead() does not provided, and only an error
+ * from this will be final.
+ *
+ * ->readahead() will generally call readahead_page() repeatedly to get
+ * each page from those prepared for read ahead.  It may fail to read a
+ * page by:
+ *
+ * * not calling readahead_page() sufficiently many times, effectively
+ *   ignoring some pages, as might be appropriate if the path to
+ *   storage is congested.
+ *
+ * * failing to actually submit a read request for a given page,
+ *   possibly due to insufficient resources, or
+ *
+ * * getting an error during subsequent processing of a request.
+ *
+ * In the last two cases, the page should be unlocked to indicate that
+ * the read attempt has failed.  In the first case the page will be
+ * unlocked by the caller.
+ *
+ * Those pages not in the final ``async_size`` of the request should be
+ * considered to be important and ->readahead() should not fail them due
+ * to congestion or temporary resource unavailability, but should wait
+ * for necessary resources (e.g.  memory or indexing information) to
+ * become available.  Pages in the final ``async_size`` may be
+ * considered less urgent and failure to read them is more acceptable.
+ * They will eventually be read individually using ->readpage().
+ */
+
 #include <linux/kernel.h>
 #include <linux/dax.h>
 #include <linux/gfp.h>

[007/227] mm: document and polish read-ahead code

Commit Message

Patch