diff mbox series

[03/11] mm: Protect operations adding pages to page cache with invalidate_lock

Message ID 20210512134631.4053-3-jack@suse.cz (mailing list archive)
State New, archived
Headers show
Series fs: Hole punch vs page cache filling races | expand

Commit Message

Jan Kara May 12, 2021, 1:46 p.m. UTC
Currently, serializing operations such as page fault, read, or readahead
against hole punching is rather difficult. The basic race scheme is
like:

fallocate(FALLOC_FL_PUNCH_HOLE)			read / fault / ..
  truncate_inode_pages_range()
						  <create pages in page
						   cache here>
  <update fs block mapping and free blocks>

Now the problem is in this way read / page fault / readahead can
instantiate pages in page cache with potentially stale data (if blocks
get quickly reused). Avoiding this race is not simple - page locks do
not work because we want to make sure there are *no* pages in given
range. inode->i_rwsem does not work because page fault happens under
mmap_sem which ranks below inode->i_rwsem. Also using it for reads makes
the performance for mixed read-write workloads suffer.

So create a new rw_semaphore in the address_space - invalidate_lock -
that protects adding of pages to page cache for page faults / reads /
readahead.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 Documentation/filesystems/locking.rst | 60 ++++++++++++++++++-------
 fs/inode.c                            |  3 ++
 include/linux/fs.h                    |  6 +++
 mm/filemap.c                          | 65 ++++++++++++++++++++++-----
 mm/readahead.c                        |  2 +
 mm/rmap.c                             | 37 +++++++--------
 mm/truncate.c                         |  2 +-
 7 files changed, 127 insertions(+), 48 deletions(-)

Comments

Matthew Wilcox (Oracle) May 12, 2021, 2:20 p.m. UTC | #1
On Wed, May 12, 2021 at 03:46:11PM +0200, Jan Kara wrote:

> diff --git a/mm/truncate.c b/mm/truncate.c
> index 57a618c4a0d6..93bde2741e0e 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -415,7 +415,7 @@ EXPORT_SYMBOL(truncate_inode_pages_range);
>   * @mapping: mapping to truncate
>   * @lstart: offset from which to truncate
>   *
> - * Called under (and serialised by) inode->i_rwsem.
> + * Called under (and serialised by) inode->i_rwsem and inode->i_mapping_rwsem.

mapping->invalidate_lock, surely?  And could we ask lockdep to assert
this for us instead of just a comment?

>   *
>   * Note: When this function returns, there can be a page in the process of
>   * deletion (inside __delete_from_page_cache()) in the specified range.  Thus
> -- 
> 2.26.2
> 
>
Matthew Wilcox (Oracle) May 12, 2021, 2:40 p.m. UTC | #2
On Wed, May 12, 2021 at 03:46:11PM +0200, Jan Kara wrote:
> Currently, serializing operations such as page fault, read, or readahead
> against hole punching is rather difficult. The basic race scheme is
> like:
> 
> fallocate(FALLOC_FL_PUNCH_HOLE)			read / fault / ..
>   truncate_inode_pages_range()
> 						  <create pages in page
> 						   cache here>
>   <update fs block mapping and free blocks>
> 
> Now the problem is in this way read / page fault / readahead can
> instantiate pages in page cache with potentially stale data (if blocks
> get quickly reused). Avoiding this race is not simple - page locks do
> not work because we want to make sure there are *no* pages in given
> range. inode->i_rwsem does not work because page fault happens under
> mmap_sem which ranks below inode->i_rwsem. Also using it for reads makes
> the performance for mixed read-write workloads suffer.
> 
> So create a new rw_semaphore in the address_space - invalidate_lock -
> that protects adding of pages to page cache for page faults / reads /
> readahead.

Remind me (or, rather, add to the documentation) why we have to hold the
invalidate_lock during the call to readpage / readahead, and we don't just
hold it around the call to add_to_page_cache / add_to_page_cache_locked
/ add_to_page_cache_lru ?  I appreciate that ->readpages is still going
to suck, but we're down to just three implementations of ->readpages now
(9p, cifs & nfs).

Also, could I trouble you to run the comments through 'fmt' (or
equivalent)?  It's easier to read if you're not kissing right up on 80
columns.

> +++ b/fs/inode.c
> @@ -190,6 +190,9 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
>  	mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
>  	mapping->private_data = NULL;
>  	mapping->writeback_index = 0;
> +	init_rwsem(&mapping->invalidate_lock);
> +	lockdep_set_class(&mapping->invalidate_lock,
> +			  &sb->s_type->invalidate_lock_key);

Why not:

	__init_rwsem(&mapping->invalidate_lock, "mapping.invalidate_lock",
			&sb->s_type->invalidate_lock_key);
Darrick J. Wong May 12, 2021, 3:23 p.m. UTC | #3
On Wed, May 12, 2021 at 03:46:11PM +0200, Jan Kara wrote:
> Currently, serializing operations such as page fault, read, or readahead
> against hole punching is rather difficult. The basic race scheme is
> like:
> 
> fallocate(FALLOC_FL_PUNCH_HOLE)			read / fault / ..
>   truncate_inode_pages_range()
> 						  <create pages in page
> 						   cache here>
>   <update fs block mapping and free blocks>
> 
> Now the problem is in this way read / page fault / readahead can
> instantiate pages in page cache with potentially stale data (if blocks
> get quickly reused). Avoiding this race is not simple - page locks do
> not work because we want to make sure there are *no* pages in given
> range. inode->i_rwsem does not work because page fault happens under
> mmap_sem which ranks below inode->i_rwsem. Also using it for reads makes
> the performance for mixed read-write workloads suffer.
> 
> So create a new rw_semaphore in the address_space - invalidate_lock -
> that protects adding of pages to page cache for page faults / reads /
> readahead.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  Documentation/filesystems/locking.rst | 60 ++++++++++++++++++-------
>  fs/inode.c                            |  3 ++
>  include/linux/fs.h                    |  6 +++
>  mm/filemap.c                          | 65 ++++++++++++++++++++++-----
>  mm/readahead.c                        |  2 +
>  mm/rmap.c                             | 37 +++++++--------
>  mm/truncate.c                         |  2 +-
>  7 files changed, 127 insertions(+), 48 deletions(-)
> 
> diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
> index 4ed2b22bd0a8..b73666a3da42 100644
> --- a/Documentation/filesystems/locking.rst
> +++ b/Documentation/filesystems/locking.rst
> @@ -271,19 +271,19 @@ prototypes::
>  locking rules:
>  	All except set_page_dirty and freepage may block
>  
> -======================	======================== =========
> -ops			PageLocked(page)	 i_rwsem
> -======================	======================== =========
> +======================	======================== =========	===============
> +ops			PageLocked(page)	 i_rwsem	invalidate_lock
> +======================	======================== =========	===============
>  writepage:		yes, unlocks (see below)
> -readpage:		yes, unlocks
> +readpage:		yes, unlocks				shared
>  writepages:
>  set_page_dirty		no
> -readahead:		yes, unlocks
> -readpages:		no
> +readahead:		yes, unlocks				shared
> +readpages:		no					shared
>  write_begin:		locks the page		 exclusive
>  write_end:		yes, unlocks		 exclusive
>  bmap:
> -invalidatepage:		yes
> +invalidatepage:		yes					exclusive
>  releasepage:		yes
>  freepage:		yes
>  direct_IO:
> @@ -378,7 +378,10 @@ keep it that way and don't breed new callers.
>  ->invalidatepage() is called when the filesystem must attempt to drop
>  some or all of the buffers from the page when it is being truncated. It
>  returns zero on success. If ->invalidatepage is zero, the kernel uses
> -block_invalidatepage() instead.
> +block_invalidatepage() instead. The filesystem should exclusively acquire
> +invalidate_lock before invalidating page cache in truncate / hole punch path (and
> +thus calling into ->invalidatepage) to block races between page cache
> +invalidation and page cache filling functions (fault, read, ...).
>  
>  ->releasepage() is called when the kernel is about to try to drop the
>  buffers from the page in preparation for freeing it.  It returns zero to
> @@ -573,6 +576,27 @@ in sys_read() and friends.
>  the lease within the individual filesystem to record the result of the
>  operation
>  
> +->fallocate implementation must be really careful to maintain page cache
> +consistency when punching holes or performing other operations that invalidate
> +page cache contents. Usually the filesystem needs to call
> +truncate_inode_pages_range() to invalidate relevant range of the page cache.
> +However the filesystem usually also needs to update its internal (and on disk)
> +view of file offset -> disk block mapping. Until this update is finished, the
> +filesystem needs to block page faults and reads from reloading now-stale page
> +cache contents from the disk. VFS provides mapping->invalidate_lock for this
> +and acquires it in shared mode in paths loading pages from disk
> +(filemap_fault(), filemap_read(), readahead paths). The filesystem is
> +responsible for taking this lock in its fallocate implementation and generally
> +whenever the page cache contents needs to be invalidated because a block is
> +moving from under a page.
> +
> +->copy_file_range and ->remap_file_range implementations need to serialize
> +against modifications of file data while the operation is running. For blocking
> +changes through write(2) and similar operations inode->i_rwsem can be used. For
> +blocking changes through memory mapping, the filesystem can use
> +mapping->invalidate_lock provided it also acquires it in its ->page_mkwrite
> +implementation.

Question: What is the locking order when acquiring the invalidate_lock
of two different files?  Is it the same as i_rwsem (increasing order of
the struct inode pointer) or is it the same as the XFS MMAPLOCK that is
being hoisted here (increasing order of i_ino)?

The reason I ask is that remap_file_range has to do that, but I don't
see any conversions for the xfs_lock_two_inodes(..., MMAPLOCK_EXCL)
calls in xfs_ilock2_io_mmap in this series.

--D

> +
>  dquot_operations
>  ================
>  
> @@ -634,9 +658,9 @@ access:		yes
>  to be faulted in. The filesystem must find and return the page associated
>  with the passed in "pgoff" in the vm_fault structure. If it is possible that
>  the page may be truncated and/or invalidated, then the filesystem must lock
> -the page, then ensure it is not already truncated (the page lock will block
> -subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
> -locked. The VM will unlock the page.
> +invalidate_lock, then ensure the page is not already truncated (invalidate_lock
> +will block subsequent truncate), and then return with VM_FAULT_LOCKED, and the
> +page locked. The VM will unlock the page.
>  
>  ->map_pages() is called when VM asks to map easy accessible pages.
>  Filesystem should find and map pages associated with offsets from "start_pgoff"
> @@ -647,12 +671,14 @@ page table entry. Pointer to entry associated with the page is passed in
>  "pte" field in vm_fault structure. Pointers to entries for other offsets
>  should be calculated relative to "pte".
>  
> -->page_mkwrite() is called when a previously read-only pte is
> -about to become writeable. The filesystem again must ensure that there are
> -no truncate/invalidate races, and then return with the page locked. If
> -the page has been truncated, the filesystem should not look up a new page
> -like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which
> -will cause the VM to retry the fault.
> +->page_mkwrite() is called when a previously read-only pte is about to become
> +writeable. The filesystem again must ensure that there are no
> +truncate/invalidate races or races with operations such as ->remap_file_range
> +or ->copy_file_range, and then return with the page locked. Usually
> +mapping->invalidate_lock is suitable for proper serialization. If the page has
> +been truncated, the filesystem should not look up a new page like the ->fault()
> +handler, but simply return with VM_FAULT_NOPAGE, which will cause the VM to
> +retry the fault.
>  
>  ->pfn_mkwrite() is the same as page_mkwrite but when the pte is
>  VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is
> diff --git a/fs/inode.c b/fs/inode.c
> index c93500d84264..63a814367118 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -190,6 +190,9 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
>  	mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
>  	mapping->private_data = NULL;
>  	mapping->writeback_index = 0;
> +	init_rwsem(&mapping->invalidate_lock);
> +	lockdep_set_class(&mapping->invalidate_lock,
> +			  &sb->s_type->invalidate_lock_key);
>  	inode->i_private = NULL;
>  	inode->i_mapping = mapping;
>  	INIT_HLIST_HEAD(&inode->i_dentry);	/* buggered by rcu freeing */
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index c3c88fdb9b2a..897238d9f1e0 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -436,6 +436,10 @@ int pagecache_write_end(struct file *, struct address_space *mapping,
>   * struct address_space - Contents of a cacheable, mappable object.
>   * @host: Owner, either the inode or the block_device.
>   * @i_pages: Cached pages.
> + * @invalidate_lock: Guards coherency between page cache contents and
> + *   file offset->disk block mappings in the filesystem during invalidates.
> + *   It is also used to block modification of page cache contents through
> + *   memory mappings.
>   * @gfp_mask: Memory allocation flags to use for allocating pages.
>   * @i_mmap_writable: Number of VM_SHARED mappings.
>   * @nr_thps: Number of THPs in the pagecache (non-shmem only).
> @@ -453,6 +457,7 @@ int pagecache_write_end(struct file *, struct address_space *mapping,
>  struct address_space {
>  	struct inode		*host;
>  	struct xarray		i_pages;
> +	struct rw_semaphore	invalidate_lock;
>  	gfp_t			gfp_mask;
>  	atomic_t		i_mmap_writable;
>  #ifdef CONFIG_READ_ONLY_THP_FOR_FS
> @@ -2488,6 +2493,7 @@ struct file_system_type {
>  
>  	struct lock_class_key i_lock_key;
>  	struct lock_class_key i_mutex_key;
> +	struct lock_class_key invalidate_lock_key;
>  	struct lock_class_key i_mutex_dir_key;
>  };
>  
> diff --git a/mm/filemap.c b/mm/filemap.c
> index ba1068a1837f..4d9ec4c6cc34 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -77,7 +77,8 @@
>   *        ->i_pages lock
>   *
>   *  ->i_rwsem
> - *    ->i_mmap_rwsem		(truncate->unmap_mapping_range)
> + *    ->invalidate_lock		(acquired by fs in truncate path)
> + *      ->i_mmap_rwsem		(truncate->unmap_mapping_range)
>   *
>   *  ->mmap_lock
>   *    ->i_mmap_rwsem
> @@ -85,7 +86,8 @@
>   *        ->i_pages lock	(arch-dependent flush_dcache_mmap_lock)
>   *
>   *  ->mmap_lock
> - *    ->lock_page		(access_process_vm)
> + *    ->invalidate_lock		(filemap_fault)
> + *      ->lock_page		(filemap_fault, access_process_vm)
>   *
>   *  ->i_rwsem			(generic_perform_write)
>   *    ->mmap_lock		(fault_in_pages_readable->do_page_fault)
> @@ -2368,20 +2370,30 @@ static int filemap_update_page(struct kiocb *iocb,
>  {
>  	int error;
>  
> +	if (iocb->ki_flags & IOCB_NOWAIT) {
> +		if (!down_read_trylock(&mapping->invalidate_lock))
> +			return -EAGAIN;
> +	} else {
> +		down_read(&mapping->invalidate_lock);
> +	}
> +
>  	if (!trylock_page(page)) {
> +		error = -EAGAIN;
>  		if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO))
> -			return -EAGAIN;
> +			goto unlock_mapping;
>  		if (!(iocb->ki_flags & IOCB_WAITQ)) {
> +			up_read(&mapping->invalidate_lock);
>  			put_and_wait_on_page_locked(page, TASK_KILLABLE);
>  			return AOP_TRUNCATED_PAGE;
>  		}
>  		error = __lock_page_async(page, iocb->ki_waitq);
>  		if (error)
> -			return error;
> +			goto unlock_mapping;
>  	}
>  
> +	error = AOP_TRUNCATED_PAGE;
>  	if (!page->mapping)
> -		goto truncated;
> +		goto unlock;
>  
>  	error = 0;
>  	if (filemap_range_uptodate(mapping, iocb->ki_pos, iter, page))
> @@ -2392,15 +2404,13 @@ static int filemap_update_page(struct kiocb *iocb,
>  		goto unlock;
>  
>  	error = filemap_read_page(iocb->ki_filp, mapping, page);
> -	if (error == AOP_TRUNCATED_PAGE)
> -		put_page(page);
> -	return error;
> -truncated:
> -	unlock_page(page);
> -	put_page(page);
> -	return AOP_TRUNCATED_PAGE;
> +	goto unlock_mapping;
>  unlock:
>  	unlock_page(page);
> +unlock_mapping:
> +	up_read(&mapping->invalidate_lock);
> +	if (error == AOP_TRUNCATED_PAGE)
> +		put_page(page);
>  	return error;
>  }
>  
> @@ -2415,6 +2425,19 @@ static int filemap_create_page(struct file *file,
>  	if (!page)
>  		return -ENOMEM;
>  
> +	/*
> +	 * Protect against truncate / hole punch. Grabbing invalidate_lock here
> +	 * assures we cannot instantiate and bring uptodate new pagecache pages
> +	 * after evicting page cache during truncate and before actually
> +	 * freeing blocks.  Note that we could release invalidate_lock after
> +	 * inserting the page into page cache as the locked page would then be
> +	 * enough to synchronize with hole punching. But there are code paths
> +	 * such as filemap_update_page() filling in partially uptodate pages or
> +	 * ->readpages() that need to hold invalidate_lock while mapping blocks
> +	 * for IO so let's hold the lock here as well to keep locking rules
> +	 * simple.
> +	 */
> +	down_read(&mapping->invalidate_lock);
>  	error = add_to_page_cache_lru(page, mapping, index,
>  			mapping_gfp_constraint(mapping, GFP_KERNEL));
>  	if (error == -EEXIST)
> @@ -2426,9 +2449,11 @@ static int filemap_create_page(struct file *file,
>  	if (error)
>  		goto error;
>  
> +	up_read(&mapping->invalidate_lock);
>  	pagevec_add(pvec, page);
>  	return 0;
>  error:
> +	up_read(&mapping->invalidate_lock);
>  	put_page(page);
>  	return error;
>  }
> @@ -2988,6 +3013,13 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
>  		count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
>  		ret = VM_FAULT_MAJOR;
>  		fpin = do_sync_mmap_readahead(vmf);
> +	}
> +
> +	/*
> +	 * See comment in filemap_create_page() why we need invalidate_lock
> +	 */
> +	down_read(&mapping->invalidate_lock);
> +	if (!page) {
>  retry_find:
>  		page = pagecache_get_page(mapping, offset,
>  					  FGP_CREAT|FGP_FOR_MMAP,
> @@ -2995,6 +3027,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
>  		if (!page) {
>  			if (fpin)
>  				goto out_retry;
> +			up_read(&mapping->invalidate_lock);
>  			return VM_FAULT_OOM;
>  		}
>  	}
> @@ -3035,9 +3068,11 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
>  	if (unlikely(offset >= max_off)) {
>  		unlock_page(page);
>  		put_page(page);
> +		up_read(&mapping->invalidate_lock);
>  		return VM_FAULT_SIGBUS;
>  	}
>  
> +	up_read(&mapping->invalidate_lock);
>  	vmf->page = page;
>  	return ret | VM_FAULT_LOCKED;
>  
> @@ -3056,6 +3091,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
>  
>  	if (!error || error == AOP_TRUNCATED_PAGE)
>  		goto retry_find;
> +	up_read(&mapping->invalidate_lock);
>  
>  	return VM_FAULT_SIGBUS;
>  
> @@ -3067,6 +3103,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
>  	 */
>  	if (page)
>  		put_page(page);
> +	up_read(&mapping->invalidate_lock);
>  	if (fpin)
>  		fput(fpin);
>  	return ret | VM_FAULT_RETRY;
> @@ -3437,6 +3474,8 @@ static struct page *do_read_cache_page(struct address_space *mapping,
>   *
>   * If the page does not get brought uptodate, return -EIO.
>   *
> + * The function expects mapping->invalidate_lock to be already held.
> + *
>   * Return: up to date page on success, ERR_PTR() on failure.
>   */
>  struct page *read_cache_page(struct address_space *mapping,
> @@ -3460,6 +3499,8 @@ EXPORT_SYMBOL(read_cache_page);
>   *
>   * If the page does not get brought uptodate, return -EIO.
>   *
> + * The function expects mapping->invalidate_lock to be already held.
> + *
>   * Return: up to date page on success, ERR_PTR() on failure.
>   */
>  struct page *read_cache_page_gfp(struct address_space *mapping,
> diff --git a/mm/readahead.c b/mm/readahead.c
> index d589f147f4c2..9785c54107bb 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -192,6 +192,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
>  	 */
>  	unsigned int nofs = memalloc_nofs_save();
>  
> +	down_read(&mapping->invalidate_lock);
>  	/*
>  	 * Preallocate as many pages as we will need.
>  	 */
> @@ -236,6 +237,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
>  	 * will then handle the error.
>  	 */
>  	read_pages(ractl, &page_pool, false);
> +	up_read(&mapping->invalidate_lock);
>  	memalloc_nofs_restore(nofs);
>  }
>  EXPORT_SYMBOL_GPL(page_cache_ra_unbounded);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index a35cbbbded0d..76d33c3b8ae6 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -22,24 +22,25 @@
>   *
>   * inode->i_rwsem	(while writing or truncating, not reading or faulting)
>   *   mm->mmap_lock
> - *     page->flags PG_locked (lock_page)   * (see hugetlbfs below)
> - *       hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share)
> - *         mapping->i_mmap_rwsem
> - *           hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
> - *           anon_vma->rwsem
> - *             mm->page_table_lock or pte_lock
> - *               swap_lock (in swap_duplicate, swap_info_get)
> - *                 mmlist_lock (in mmput, drain_mmlist and others)
> - *                 mapping->private_lock (in __set_page_dirty_buffers)
> - *                   lock_page_memcg move_lock (in __set_page_dirty_buffers)
> - *                     i_pages lock (widely used)
> - *                       lruvec->lru_lock (in lock_page_lruvec_irq)
> - *                 inode->i_lock (in set_page_dirty's __mark_inode_dirty)
> - *                 bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
> - *                   sb_lock (within inode_lock in fs/fs-writeback.c)
> - *                   i_pages lock (widely used, in set_page_dirty,
> - *                             in arch-dependent flush_dcache_mmap_lock,
> - *                             within bdi.wb->list_lock in __sync_single_inode)
> + *     mapping->invalidate_lock (in filemap_fault)
> + *       page->flags PG_locked (lock_page)   * (see hugetlbfs below)
> + *         hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share)
> + *           mapping->i_mmap_rwsem
> + *             hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
> + *             anon_vma->rwsem
> + *               mm->page_table_lock or pte_lock
> + *                 swap_lock (in swap_duplicate, swap_info_get)
> + *                   mmlist_lock (in mmput, drain_mmlist and others)
> + *                   mapping->private_lock (in __set_page_dirty_buffers)
> + *                     lock_page_memcg move_lock (in __set_page_dirty_buffers)
> + *                       i_pages lock (widely used)
> + *                         lruvec->lru_lock (in lock_page_lruvec_irq)
> + *                   inode->i_lock (in set_page_dirty's __mark_inode_dirty)
> + *                   bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
> + *                     sb_lock (within inode_lock in fs/fs-writeback.c)
> + *                     i_pages lock (widely used, in set_page_dirty,
> + *                               in arch-dependent flush_dcache_mmap_lock,
> + *                               within bdi.wb->list_lock in __sync_single_inode)
>   *
>   * anon_vma->rwsem,mapping->i_mmap_rwsem   (memory_failure, collect_procs_anon)
>   *   ->tasklist_lock
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 57a618c4a0d6..93bde2741e0e 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -415,7 +415,7 @@ EXPORT_SYMBOL(truncate_inode_pages_range);
>   * @mapping: mapping to truncate
>   * @lstart: offset from which to truncate
>   *
> - * Called under (and serialised by) inode->i_rwsem.
> + * Called under (and serialised by) inode->i_rwsem and inode->i_mapping_rwsem.
>   *
>   * Note: When this function returns, there can be a page in the process of
>   * deletion (inside __delete_from_page_cache()) in the specified range.  Thus
> -- 
> 2.26.2
>
Jan Kara May 13, 2021, 5:44 p.m. UTC | #4
On Wed 12-05-21 08:23:45, Darrick J. Wong wrote:
> On Wed, May 12, 2021 at 03:46:11PM +0200, Jan Kara wrote:
> > +->fallocate implementation must be really careful to maintain page cache
> > +consistency when punching holes or performing other operations that invalidate
> > +page cache contents. Usually the filesystem needs to call
> > +truncate_inode_pages_range() to invalidate relevant range of the page cache.
> > +However the filesystem usually also needs to update its internal (and on disk)
> > +view of file offset -> disk block mapping. Until this update is finished, the
> > +filesystem needs to block page faults and reads from reloading now-stale page
> > +cache contents from the disk. VFS provides mapping->invalidate_lock for this
> > +and acquires it in shared mode in paths loading pages from disk
> > +(filemap_fault(), filemap_read(), readahead paths). The filesystem is
> > +responsible for taking this lock in its fallocate implementation and generally
> > +whenever the page cache contents needs to be invalidated because a block is
> > +moving from under a page.
> > +
> > +->copy_file_range and ->remap_file_range implementations need to serialize
> > +against modifications of file data while the operation is running. For blocking
> > +changes through write(2) and similar operations inode->i_rwsem can be used. For
> > +blocking changes through memory mapping, the filesystem can use
> > +mapping->invalidate_lock provided it also acquires it in its ->page_mkwrite
> > +implementation.
> 
> Question: What is the locking order when acquiring the invalidate_lock
> of two different files?  Is it the same as i_rwsem (increasing order of
> the struct inode pointer) or is it the same as the XFS MMAPLOCK that is
> being hoisted here (increasing order of i_ino)?
> 
> The reason I ask is that remap_file_range has to do that, but I don't
> see any conversions for the xfs_lock_two_inodes(..., MMAPLOCK_EXCL)
> calls in xfs_ilock2_io_mmap in this series.

Good question. Technically, I don't think there's real need to establish a
single ordering because locks among different filesystems are never going
to be acquired together (effectively each lock type is local per sb and we
are free to define an ordering for each lock type differently). But to
maintain some sanity I guess having the same locking order for doublelock
of i_rwsem and invalidate_lock makes sense. Is there a reason why XFS uses
by-ino ordering? So that we don't have to consider two different orders in
xfs_lock_two_inodes()...

								Honza

> > +
> >  dquot_operations
> >  ================
> >  
> > @@ -634,9 +658,9 @@ access:		yes
> >  to be faulted in. The filesystem must find and return the page associated
> >  with the passed in "pgoff" in the vm_fault structure. If it is possible that
> >  the page may be truncated and/or invalidated, then the filesystem must lock
> > -the page, then ensure it is not already truncated (the page lock will block
> > -subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
> > -locked. The VM will unlock the page.
> > +invalidate_lock, then ensure the page is not already truncated (invalidate_lock
> > +will block subsequent truncate), and then return with VM_FAULT_LOCKED, and the
> > +page locked. The VM will unlock the page.
> >  
> >  ->map_pages() is called when VM asks to map easy accessible pages.
> >  Filesystem should find and map pages associated with offsets from "start_pgoff"
> > @@ -647,12 +671,14 @@ page table entry. Pointer to entry associated with the page is passed in
> >  "pte" field in vm_fault structure. Pointers to entries for other offsets
> >  should be calculated relative to "pte".
> >  
> > -->page_mkwrite() is called when a previously read-only pte is
> > -about to become writeable. The filesystem again must ensure that there are
> > -no truncate/invalidate races, and then return with the page locked. If
> > -the page has been truncated, the filesystem should not look up a new page
> > -like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which
> > -will cause the VM to retry the fault.
> > +->page_mkwrite() is called when a previously read-only pte is about to become
> > +writeable. The filesystem again must ensure that there are no
> > +truncate/invalidate races or races with operations such as ->remap_file_range
> > +or ->copy_file_range, and then return with the page locked. Usually
> > +mapping->invalidate_lock is suitable for proper serialization. If the page has
> > +been truncated, the filesystem should not look up a new page like the ->fault()
> > +handler, but simply return with VM_FAULT_NOPAGE, which will cause the VM to
> > +retry the fault.
> >  
> >  ->pfn_mkwrite() is the same as page_mkwrite but when the pte is
> >  VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is
> > diff --git a/fs/inode.c b/fs/inode.c
> > index c93500d84264..63a814367118 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -190,6 +190,9 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
> >  	mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
> >  	mapping->private_data = NULL;
> >  	mapping->writeback_index = 0;
> > +	init_rwsem(&mapping->invalidate_lock);
> > +	lockdep_set_class(&mapping->invalidate_lock,
> > +			  &sb->s_type->invalidate_lock_key);
> >  	inode->i_private = NULL;
> >  	inode->i_mapping = mapping;
> >  	INIT_HLIST_HEAD(&inode->i_dentry);	/* buggered by rcu freeing */
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index c3c88fdb9b2a..897238d9f1e0 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -436,6 +436,10 @@ int pagecache_write_end(struct file *, struct address_space *mapping,
> >   * struct address_space - Contents of a cacheable, mappable object.
> >   * @host: Owner, either the inode or the block_device.
> >   * @i_pages: Cached pages.
> > + * @invalidate_lock: Guards coherency between page cache contents and
> > + *   file offset->disk block mappings in the filesystem during invalidates.
> > + *   It is also used to block modification of page cache contents through
> > + *   memory mappings.
> >   * @gfp_mask: Memory allocation flags to use for allocating pages.
> >   * @i_mmap_writable: Number of VM_SHARED mappings.
> >   * @nr_thps: Number of THPs in the pagecache (non-shmem only).
> > @@ -453,6 +457,7 @@ int pagecache_write_end(struct file *, struct address_space *mapping,
> >  struct address_space {
> >  	struct inode		*host;
> >  	struct xarray		i_pages;
> > +	struct rw_semaphore	invalidate_lock;
> >  	gfp_t			gfp_mask;
> >  	atomic_t		i_mmap_writable;
> >  #ifdef CONFIG_READ_ONLY_THP_FOR_FS
> > @@ -2488,6 +2493,7 @@ struct file_system_type {
> >  
> >  	struct lock_class_key i_lock_key;
> >  	struct lock_class_key i_mutex_key;
> > +	struct lock_class_key invalidate_lock_key;
> >  	struct lock_class_key i_mutex_dir_key;
> >  };
> >  
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index ba1068a1837f..4d9ec4c6cc34 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -77,7 +77,8 @@
> >   *        ->i_pages lock
> >   *
> >   *  ->i_rwsem
> > - *    ->i_mmap_rwsem		(truncate->unmap_mapping_range)
> > + *    ->invalidate_lock		(acquired by fs in truncate path)
> > + *      ->i_mmap_rwsem		(truncate->unmap_mapping_range)
> >   *
> >   *  ->mmap_lock
> >   *    ->i_mmap_rwsem
> > @@ -85,7 +86,8 @@
> >   *        ->i_pages lock	(arch-dependent flush_dcache_mmap_lock)
> >   *
> >   *  ->mmap_lock
> > - *    ->lock_page		(access_process_vm)
> > + *    ->invalidate_lock		(filemap_fault)
> > + *      ->lock_page		(filemap_fault, access_process_vm)
> >   *
> >   *  ->i_rwsem			(generic_perform_write)
> >   *    ->mmap_lock		(fault_in_pages_readable->do_page_fault)
> > @@ -2368,20 +2370,30 @@ static int filemap_update_page(struct kiocb *iocb,
> >  {
> >  	int error;
> >  
> > +	if (iocb->ki_flags & IOCB_NOWAIT) {
> > +		if (!down_read_trylock(&mapping->invalidate_lock))
> > +			return -EAGAIN;
> > +	} else {
> > +		down_read(&mapping->invalidate_lock);
> > +	}
> > +
> >  	if (!trylock_page(page)) {
> > +		error = -EAGAIN;
> >  		if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO))
> > -			return -EAGAIN;
> > +			goto unlock_mapping;
> >  		if (!(iocb->ki_flags & IOCB_WAITQ)) {
> > +			up_read(&mapping->invalidate_lock);
> >  			put_and_wait_on_page_locked(page, TASK_KILLABLE);
> >  			return AOP_TRUNCATED_PAGE;
> >  		}
> >  		error = __lock_page_async(page, iocb->ki_waitq);
> >  		if (error)
> > -			return error;
> > +			goto unlock_mapping;
> >  	}
> >  
> > +	error = AOP_TRUNCATED_PAGE;
> >  	if (!page->mapping)
> > -		goto truncated;
> > +		goto unlock;
> >  
> >  	error = 0;
> >  	if (filemap_range_uptodate(mapping, iocb->ki_pos, iter, page))
> > @@ -2392,15 +2404,13 @@ static int filemap_update_page(struct kiocb *iocb,
> >  		goto unlock;
> >  
> >  	error = filemap_read_page(iocb->ki_filp, mapping, page);
> > -	if (error == AOP_TRUNCATED_PAGE)
> > -		put_page(page);
> > -	return error;
> > -truncated:
> > -	unlock_page(page);
> > -	put_page(page);
> > -	return AOP_TRUNCATED_PAGE;
> > +	goto unlock_mapping;
> >  unlock:
> >  	unlock_page(page);
> > +unlock_mapping:
> > +	up_read(&mapping->invalidate_lock);
> > +	if (error == AOP_TRUNCATED_PAGE)
> > +		put_page(page);
> >  	return error;
> >  }
> >  
> > @@ -2415,6 +2425,19 @@ static int filemap_create_page(struct file *file,
> >  	if (!page)
> >  		return -ENOMEM;
> >  
> > +	/*
> > +	 * Protect against truncate / hole punch. Grabbing invalidate_lock here
> > +	 * assures we cannot instantiate and bring uptodate new pagecache pages
> > +	 * after evicting page cache during truncate and before actually
> > +	 * freeing blocks.  Note that we could release invalidate_lock after
> > +	 * inserting the page into page cache as the locked page would then be
> > +	 * enough to synchronize with hole punching. But there are code paths
> > +	 * such as filemap_update_page() filling in partially uptodate pages or
> > +	 * ->readpages() that need to hold invalidate_lock while mapping blocks
> > +	 * for IO so let's hold the lock here as well to keep locking rules
> > +	 * simple.
> > +	 */
> > +	down_read(&mapping->invalidate_lock);
> >  	error = add_to_page_cache_lru(page, mapping, index,
> >  			mapping_gfp_constraint(mapping, GFP_KERNEL));
> >  	if (error == -EEXIST)
> > @@ -2426,9 +2449,11 @@ static int filemap_create_page(struct file *file,
> >  	if (error)
> >  		goto error;
> >  
> > +	up_read(&mapping->invalidate_lock);
> >  	pagevec_add(pvec, page);
> >  	return 0;
> >  error:
> > +	up_read(&mapping->invalidate_lock);
> >  	put_page(page);
> >  	return error;
> >  }
> > @@ -2988,6 +3013,13 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> >  		count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
> >  		ret = VM_FAULT_MAJOR;
> >  		fpin = do_sync_mmap_readahead(vmf);
> > +	}
> > +
> > +	/*
> > +	 * See comment in filemap_create_page() why we need invalidate_lock
> > +	 */
> > +	down_read(&mapping->invalidate_lock);
> > +	if (!page) {
> >  retry_find:
> >  		page = pagecache_get_page(mapping, offset,
> >  					  FGP_CREAT|FGP_FOR_MMAP,
> > @@ -2995,6 +3027,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> >  		if (!page) {
> >  			if (fpin)
> >  				goto out_retry;
> > +			up_read(&mapping->invalidate_lock);
> >  			return VM_FAULT_OOM;
> >  		}
> >  	}
> > @@ -3035,9 +3068,11 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> >  	if (unlikely(offset >= max_off)) {
> >  		unlock_page(page);
> >  		put_page(page);
> > +		up_read(&mapping->invalidate_lock);
> >  		return VM_FAULT_SIGBUS;
> >  	}
> >  
> > +	up_read(&mapping->invalidate_lock);
> >  	vmf->page = page;
> >  	return ret | VM_FAULT_LOCKED;
> >  
> > @@ -3056,6 +3091,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> >  
> >  	if (!error || error == AOP_TRUNCATED_PAGE)
> >  		goto retry_find;
> > +	up_read(&mapping->invalidate_lock);
> >  
> >  	return VM_FAULT_SIGBUS;
> >  
> > @@ -3067,6 +3103,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> >  	 */
> >  	if (page)
> >  		put_page(page);
> > +	up_read(&mapping->invalidate_lock);
> >  	if (fpin)
> >  		fput(fpin);
> >  	return ret | VM_FAULT_RETRY;
> > @@ -3437,6 +3474,8 @@ static struct page *do_read_cache_page(struct address_space *mapping,
> >   *
> >   * If the page does not get brought uptodate, return -EIO.
> >   *
> > + * The function expects mapping->invalidate_lock to be already held.
> > + *
> >   * Return: up to date page on success, ERR_PTR() on failure.
> >   */
> >  struct page *read_cache_page(struct address_space *mapping,
> > @@ -3460,6 +3499,8 @@ EXPORT_SYMBOL(read_cache_page);
> >   *
> >   * If the page does not get brought uptodate, return -EIO.
> >   *
> > + * The function expects mapping->invalidate_lock to be already held.
> > + *
> >   * Return: up to date page on success, ERR_PTR() on failure.
> >   */
> >  struct page *read_cache_page_gfp(struct address_space *mapping,
> > diff --git a/mm/readahead.c b/mm/readahead.c
> > index d589f147f4c2..9785c54107bb 100644
> > --- a/mm/readahead.c
> > +++ b/mm/readahead.c
> > @@ -192,6 +192,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
> >  	 */
> >  	unsigned int nofs = memalloc_nofs_save();
> >  
> > +	down_read(&mapping->invalidate_lock);
> >  	/*
> >  	 * Preallocate as many pages as we will need.
> >  	 */
> > @@ -236,6 +237,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
> >  	 * will then handle the error.
> >  	 */
> >  	read_pages(ractl, &page_pool, false);
> > +	up_read(&mapping->invalidate_lock);
> >  	memalloc_nofs_restore(nofs);
> >  }
> >  EXPORT_SYMBOL_GPL(page_cache_ra_unbounded);
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index a35cbbbded0d..76d33c3b8ae6 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -22,24 +22,25 @@
> >   *
> >   * inode->i_rwsem	(while writing or truncating, not reading or faulting)
> >   *   mm->mmap_lock
> > - *     page->flags PG_locked (lock_page)   * (see hugetlbfs below)
> > - *       hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share)
> > - *         mapping->i_mmap_rwsem
> > - *           hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
> > - *           anon_vma->rwsem
> > - *             mm->page_table_lock or pte_lock
> > - *               swap_lock (in swap_duplicate, swap_info_get)
> > - *                 mmlist_lock (in mmput, drain_mmlist and others)
> > - *                 mapping->private_lock (in __set_page_dirty_buffers)
> > - *                   lock_page_memcg move_lock (in __set_page_dirty_buffers)
> > - *                     i_pages lock (widely used)
> > - *                       lruvec->lru_lock (in lock_page_lruvec_irq)
> > - *                 inode->i_lock (in set_page_dirty's __mark_inode_dirty)
> > - *                 bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
> > - *                   sb_lock (within inode_lock in fs/fs-writeback.c)
> > - *                   i_pages lock (widely used, in set_page_dirty,
> > - *                             in arch-dependent flush_dcache_mmap_lock,
> > - *                             within bdi.wb->list_lock in __sync_single_inode)
> > + *     mapping->invalidate_lock (in filemap_fault)
> > + *       page->flags PG_locked (lock_page)   * (see hugetlbfs below)
> > + *         hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share)
> > + *           mapping->i_mmap_rwsem
> > + *             hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
> > + *             anon_vma->rwsem
> > + *               mm->page_table_lock or pte_lock
> > + *                 swap_lock (in swap_duplicate, swap_info_get)
> > + *                   mmlist_lock (in mmput, drain_mmlist and others)
> > + *                   mapping->private_lock (in __set_page_dirty_buffers)
> > + *                     lock_page_memcg move_lock (in __set_page_dirty_buffers)
> > + *                       i_pages lock (widely used)
> > + *                         lruvec->lru_lock (in lock_page_lruvec_irq)
> > + *                   inode->i_lock (in set_page_dirty's __mark_inode_dirty)
> > + *                   bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
> > + *                     sb_lock (within inode_lock in fs/fs-writeback.c)
> > + *                     i_pages lock (widely used, in set_page_dirty,
> > + *                               in arch-dependent flush_dcache_mmap_lock,
> > + *                               within bdi.wb->list_lock in __sync_single_inode)
> >   *
> >   * anon_vma->rwsem,mapping->i_mmap_rwsem   (memory_failure, collect_procs_anon)
> >   *   ->tasklist_lock
> > diff --git a/mm/truncate.c b/mm/truncate.c
> > index 57a618c4a0d6..93bde2741e0e 100644
> > --- a/mm/truncate.c
> > +++ b/mm/truncate.c
> > @@ -415,7 +415,7 @@ EXPORT_SYMBOL(truncate_inode_pages_range);
> >   * @mapping: mapping to truncate
> >   * @lstart: offset from which to truncate
> >   *
> > - * Called under (and serialised by) inode->i_rwsem.
> > + * Called under (and serialised by) inode->i_rwsem and inode->i_mapping_rwsem.
> >   *
> >   * Note: When this function returns, there can be a page in the process of
> >   * deletion (inside __delete_from_page_cache()) in the specified range.  Thus
> > -- 
> > 2.26.2
> >
Jan Kara May 13, 2021, 5:49 p.m. UTC | #5
On Wed 12-05-21 15:20:44, Matthew Wilcox wrote:
> On Wed, May 12, 2021 at 03:46:11PM +0200, Jan Kara wrote:
> 
> > diff --git a/mm/truncate.c b/mm/truncate.c
> > index 57a618c4a0d6..93bde2741e0e 100644
> > --- a/mm/truncate.c
> > +++ b/mm/truncate.c
> > @@ -415,7 +415,7 @@ EXPORT_SYMBOL(truncate_inode_pages_range);
> >   * @mapping: mapping to truncate
> >   * @lstart: offset from which to truncate
> >   *
> > - * Called under (and serialised by) inode->i_rwsem.
> > + * Called under (and serialised by) inode->i_rwsem and inode->i_mapping_rwsem.
> 
> mapping->invalidate_lock, surely?

Right, thanks for noticing. 

> And could we ask lockdep to assert this for us instead of just a comment?

That's the plan but currently it would trip for filesystems unaware of
invalidate_lock. Once all filesystems are converted I plan to transform the
comments into actual asserts. In this series I aim at fixing the data
corruption issues, I plan the cleanups for later...

								Honza
Darrick J. Wong May 13, 2021, 6:52 p.m. UTC | #6
On Thu, May 13, 2021 at 07:44:59PM +0200, Jan Kara wrote:
> On Wed 12-05-21 08:23:45, Darrick J. Wong wrote:
> > On Wed, May 12, 2021 at 03:46:11PM +0200, Jan Kara wrote:
> > > +->fallocate implementation must be really careful to maintain page cache
> > > +consistency when punching holes or performing other operations that invalidate
> > > +page cache contents. Usually the filesystem needs to call
> > > +truncate_inode_pages_range() to invalidate relevant range of the page cache.
> > > +However the filesystem usually also needs to update its internal (and on disk)
> > > +view of file offset -> disk block mapping. Until this update is finished, the
> > > +filesystem needs to block page faults and reads from reloading now-stale page
> > > +cache contents from the disk. VFS provides mapping->invalidate_lock for this
> > > +and acquires it in shared mode in paths loading pages from disk
> > > +(filemap_fault(), filemap_read(), readahead paths). The filesystem is
> > > +responsible for taking this lock in its fallocate implementation and generally
> > > +whenever the page cache contents needs to be invalidated because a block is
> > > +moving from under a page.
> > > +
> > > +->copy_file_range and ->remap_file_range implementations need to serialize
> > > +against modifications of file data while the operation is running. For blocking
> > > +changes through write(2) and similar operations inode->i_rwsem can be used. For
> > > +blocking changes through memory mapping, the filesystem can use
> > > +mapping->invalidate_lock provided it also acquires it in its ->page_mkwrite
> > > +implementation.
> > 
> > Question: What is the locking order when acquiring the invalidate_lock
> > of two different files?  Is it the same as i_rwsem (increasing order of
> > the struct inode pointer) or is it the same as the XFS MMAPLOCK that is
> > being hoisted here (increasing order of i_ino)?
> > 
> > The reason I ask is that remap_file_range has to do that, but I don't
> > see any conversions for the xfs_lock_two_inodes(..., MMAPLOCK_EXCL)
> > calls in xfs_ilock2_io_mmap in this series.
> 
> Good question. Technically, I don't think there's real need to establish a
> single ordering because locks among different filesystems are never going
> to be acquired together (effectively each lock type is local per sb and we
> are free to define an ordering for each lock type differently). But to
> maintain some sanity I guess having the same locking order for doublelock
> of i_rwsem and invalidate_lock makes sense. Is there a reason why XFS uses
> by-ino ordering? So that we don't have to consider two different orders in
> xfs_lock_two_inodes()...

I imagine Dave will chime in on this, but I suspect the reason is
hysterical raisins^Wreasons.  It might simply be time to convert all
three XFS inode locks to use the same ordering rules.

--D

> 
> 								Honza
> 
> > > +
> > >  dquot_operations
> > >  ================
> > >  
> > > @@ -634,9 +658,9 @@ access:		yes
> > >  to be faulted in. The filesystem must find and return the page associated
> > >  with the passed in "pgoff" in the vm_fault structure. If it is possible that
> > >  the page may be truncated and/or invalidated, then the filesystem must lock
> > > -the page, then ensure it is not already truncated (the page lock will block
> > > -subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
> > > -locked. The VM will unlock the page.
> > > +invalidate_lock, then ensure the page is not already truncated (invalidate_lock
> > > +will block subsequent truncate), and then return with VM_FAULT_LOCKED, and the
> > > +page locked. The VM will unlock the page.
> > >  
> > >  ->map_pages() is called when VM asks to map easy accessible pages.
> > >  Filesystem should find and map pages associated with offsets from "start_pgoff"
> > > @@ -647,12 +671,14 @@ page table entry. Pointer to entry associated with the page is passed in
> > >  "pte" field in vm_fault structure. Pointers to entries for other offsets
> > >  should be calculated relative to "pte".
> > >  
> > > -->page_mkwrite() is called when a previously read-only pte is
> > > -about to become writeable. The filesystem again must ensure that there are
> > > -no truncate/invalidate races, and then return with the page locked. If
> > > -the page has been truncated, the filesystem should not look up a new page
> > > -like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which
> > > -will cause the VM to retry the fault.
> > > +->page_mkwrite() is called when a previously read-only pte is about to become
> > > +writeable. The filesystem again must ensure that there are no
> > > +truncate/invalidate races or races with operations such as ->remap_file_range
> > > +or ->copy_file_range, and then return with the page locked. Usually
> > > +mapping->invalidate_lock is suitable for proper serialization. If the page has
> > > +been truncated, the filesystem should not look up a new page like the ->fault()
> > > +handler, but simply return with VM_FAULT_NOPAGE, which will cause the VM to
> > > +retry the fault.
> > >  
> > >  ->pfn_mkwrite() is the same as page_mkwrite but when the pte is
> > >  VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is
> > > diff --git a/fs/inode.c b/fs/inode.c
> > > index c93500d84264..63a814367118 100644
> > > --- a/fs/inode.c
> > > +++ b/fs/inode.c
> > > @@ -190,6 +190,9 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
> > >  	mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
> > >  	mapping->private_data = NULL;
> > >  	mapping->writeback_index = 0;
> > > +	init_rwsem(&mapping->invalidate_lock);
> > > +	lockdep_set_class(&mapping->invalidate_lock,
> > > +			  &sb->s_type->invalidate_lock_key);
> > >  	inode->i_private = NULL;
> > >  	inode->i_mapping = mapping;
> > >  	INIT_HLIST_HEAD(&inode->i_dentry);	/* buggered by rcu freeing */
> > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > index c3c88fdb9b2a..897238d9f1e0 100644
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -436,6 +436,10 @@ int pagecache_write_end(struct file *, struct address_space *mapping,
> > >   * struct address_space - Contents of a cacheable, mappable object.
> > >   * @host: Owner, either the inode or the block_device.
> > >   * @i_pages: Cached pages.
> > > + * @invalidate_lock: Guards coherency between page cache contents and
> > > + *   file offset->disk block mappings in the filesystem during invalidates.
> > > + *   It is also used to block modification of page cache contents through
> > > + *   memory mappings.
> > >   * @gfp_mask: Memory allocation flags to use for allocating pages.
> > >   * @i_mmap_writable: Number of VM_SHARED mappings.
> > >   * @nr_thps: Number of THPs in the pagecache (non-shmem only).
> > > @@ -453,6 +457,7 @@ int pagecache_write_end(struct file *, struct address_space *mapping,
> > >  struct address_space {
> > >  	struct inode		*host;
> > >  	struct xarray		i_pages;
> > > +	struct rw_semaphore	invalidate_lock;
> > >  	gfp_t			gfp_mask;
> > >  	atomic_t		i_mmap_writable;
> > >  #ifdef CONFIG_READ_ONLY_THP_FOR_FS
> > > @@ -2488,6 +2493,7 @@ struct file_system_type {
> > >  
> > >  	struct lock_class_key i_lock_key;
> > >  	struct lock_class_key i_mutex_key;
> > > +	struct lock_class_key invalidate_lock_key;
> > >  	struct lock_class_key i_mutex_dir_key;
> > >  };
> > >  
> > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > index ba1068a1837f..4d9ec4c6cc34 100644
> > > --- a/mm/filemap.c
> > > +++ b/mm/filemap.c
> > > @@ -77,7 +77,8 @@
> > >   *        ->i_pages lock
> > >   *
> > >   *  ->i_rwsem
> > > - *    ->i_mmap_rwsem		(truncate->unmap_mapping_range)
> > > + *    ->invalidate_lock		(acquired by fs in truncate path)
> > > + *      ->i_mmap_rwsem		(truncate->unmap_mapping_range)
> > >   *
> > >   *  ->mmap_lock
> > >   *    ->i_mmap_rwsem
> > > @@ -85,7 +86,8 @@
> > >   *        ->i_pages lock	(arch-dependent flush_dcache_mmap_lock)
> > >   *
> > >   *  ->mmap_lock
> > > - *    ->lock_page		(access_process_vm)
> > > + *    ->invalidate_lock		(filemap_fault)
> > > + *      ->lock_page		(filemap_fault, access_process_vm)
> > >   *
> > >   *  ->i_rwsem			(generic_perform_write)
> > >   *    ->mmap_lock		(fault_in_pages_readable->do_page_fault)
> > > @@ -2368,20 +2370,30 @@ static int filemap_update_page(struct kiocb *iocb,
> > >  {
> > >  	int error;
> > >  
> > > +	if (iocb->ki_flags & IOCB_NOWAIT) {
> > > +		if (!down_read_trylock(&mapping->invalidate_lock))
> > > +			return -EAGAIN;
> > > +	} else {
> > > +		down_read(&mapping->invalidate_lock);
> > > +	}
> > > +
> > >  	if (!trylock_page(page)) {
> > > +		error = -EAGAIN;
> > >  		if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO))
> > > -			return -EAGAIN;
> > > +			goto unlock_mapping;
> > >  		if (!(iocb->ki_flags & IOCB_WAITQ)) {
> > > +			up_read(&mapping->invalidate_lock);
> > >  			put_and_wait_on_page_locked(page, TASK_KILLABLE);
> > >  			return AOP_TRUNCATED_PAGE;
> > >  		}
> > >  		error = __lock_page_async(page, iocb->ki_waitq);
> > >  		if (error)
> > > -			return error;
> > > +			goto unlock_mapping;
> > >  	}
> > >  
> > > +	error = AOP_TRUNCATED_PAGE;
> > >  	if (!page->mapping)
> > > -		goto truncated;
> > > +		goto unlock;
> > >  
> > >  	error = 0;
> > >  	if (filemap_range_uptodate(mapping, iocb->ki_pos, iter, page))
> > > @@ -2392,15 +2404,13 @@ static int filemap_update_page(struct kiocb *iocb,
> > >  		goto unlock;
> > >  
> > >  	error = filemap_read_page(iocb->ki_filp, mapping, page);
> > > -	if (error == AOP_TRUNCATED_PAGE)
> > > -		put_page(page);
> > > -	return error;
> > > -truncated:
> > > -	unlock_page(page);
> > > -	put_page(page);
> > > -	return AOP_TRUNCATED_PAGE;
> > > +	goto unlock_mapping;
> > >  unlock:
> > >  	unlock_page(page);
> > > +unlock_mapping:
> > > +	up_read(&mapping->invalidate_lock);
> > > +	if (error == AOP_TRUNCATED_PAGE)
> > > +		put_page(page);
> > >  	return error;
> > >  }
> > >  
> > > @@ -2415,6 +2425,19 @@ static int filemap_create_page(struct file *file,
> > >  	if (!page)
> > >  		return -ENOMEM;
> > >  
> > > +	/*
> > > +	 * Protect against truncate / hole punch. Grabbing invalidate_lock here
> > > +	 * assures we cannot instantiate and bring uptodate new pagecache pages
> > > +	 * after evicting page cache during truncate and before actually
> > > +	 * freeing blocks.  Note that we could release invalidate_lock after
> > > +	 * inserting the page into page cache as the locked page would then be
> > > +	 * enough to synchronize with hole punching. But there are code paths
> > > +	 * such as filemap_update_page() filling in partially uptodate pages or
> > > +	 * ->readpages() that need to hold invalidate_lock while mapping blocks
> > > +	 * for IO so let's hold the lock here as well to keep locking rules
> > > +	 * simple.
> > > +	 */
> > > +	down_read(&mapping->invalidate_lock);
> > >  	error = add_to_page_cache_lru(page, mapping, index,
> > >  			mapping_gfp_constraint(mapping, GFP_KERNEL));
> > >  	if (error == -EEXIST)
> > > @@ -2426,9 +2449,11 @@ static int filemap_create_page(struct file *file,
> > >  	if (error)
> > >  		goto error;
> > >  
> > > +	up_read(&mapping->invalidate_lock);
> > >  	pagevec_add(pvec, page);
> > >  	return 0;
> > >  error:
> > > +	up_read(&mapping->invalidate_lock);
> > >  	put_page(page);
> > >  	return error;
> > >  }
> > > @@ -2988,6 +3013,13 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> > >  		count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
> > >  		ret = VM_FAULT_MAJOR;
> > >  		fpin = do_sync_mmap_readahead(vmf);
> > > +	}
> > > +
> > > +	/*
> > > +	 * See comment in filemap_create_page() why we need invalidate_lock
> > > +	 */
> > > +	down_read(&mapping->invalidate_lock);
> > > +	if (!page) {
> > >  retry_find:
> > >  		page = pagecache_get_page(mapping, offset,
> > >  					  FGP_CREAT|FGP_FOR_MMAP,
> > > @@ -2995,6 +3027,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> > >  		if (!page) {
> > >  			if (fpin)
> > >  				goto out_retry;
> > > +			up_read(&mapping->invalidate_lock);
> > >  			return VM_FAULT_OOM;
> > >  		}
> > >  	}
> > > @@ -3035,9 +3068,11 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> > >  	if (unlikely(offset >= max_off)) {
> > >  		unlock_page(page);
> > >  		put_page(page);
> > > +		up_read(&mapping->invalidate_lock);
> > >  		return VM_FAULT_SIGBUS;
> > >  	}
> > >  
> > > +	up_read(&mapping->invalidate_lock);
> > >  	vmf->page = page;
> > >  	return ret | VM_FAULT_LOCKED;
> > >  
> > > @@ -3056,6 +3091,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> > >  
> > >  	if (!error || error == AOP_TRUNCATED_PAGE)
> > >  		goto retry_find;
> > > +	up_read(&mapping->invalidate_lock);
> > >  
> > >  	return VM_FAULT_SIGBUS;
> > >  
> > > @@ -3067,6 +3103,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> > >  	 */
> > >  	if (page)
> > >  		put_page(page);
> > > +	up_read(&mapping->invalidate_lock);
> > >  	if (fpin)
> > >  		fput(fpin);
> > >  	return ret | VM_FAULT_RETRY;
> > > @@ -3437,6 +3474,8 @@ static struct page *do_read_cache_page(struct address_space *mapping,
> > >   *
> > >   * If the page does not get brought uptodate, return -EIO.
> > >   *
> > > + * The function expects mapping->invalidate_lock to be already held.
> > > + *
> > >   * Return: up to date page on success, ERR_PTR() on failure.
> > >   */
> > >  struct page *read_cache_page(struct address_space *mapping,
> > > @@ -3460,6 +3499,8 @@ EXPORT_SYMBOL(read_cache_page);
> > >   *
> > >   * If the page does not get brought uptodate, return -EIO.
> > >   *
> > > + * The function expects mapping->invalidate_lock to be already held.
> > > + *
> > >   * Return: up to date page on success, ERR_PTR() on failure.
> > >   */
> > >  struct page *read_cache_page_gfp(struct address_space *mapping,
> > > diff --git a/mm/readahead.c b/mm/readahead.c
> > > index d589f147f4c2..9785c54107bb 100644
> > > --- a/mm/readahead.c
> > > +++ b/mm/readahead.c
> > > @@ -192,6 +192,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
> > >  	 */
> > >  	unsigned int nofs = memalloc_nofs_save();
> > >  
> > > +	down_read(&mapping->invalidate_lock);
> > >  	/*
> > >  	 * Preallocate as many pages as we will need.
> > >  	 */
> > > @@ -236,6 +237,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
> > >  	 * will then handle the error.
> > >  	 */
> > >  	read_pages(ractl, &page_pool, false);
> > > +	up_read(&mapping->invalidate_lock);
> > >  	memalloc_nofs_restore(nofs);
> > >  }
> > >  EXPORT_SYMBOL_GPL(page_cache_ra_unbounded);
> > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > index a35cbbbded0d..76d33c3b8ae6 100644
> > > --- a/mm/rmap.c
> > > +++ b/mm/rmap.c
> > > @@ -22,24 +22,25 @@
> > >   *
> > >   * inode->i_rwsem	(while writing or truncating, not reading or faulting)
> > >   *   mm->mmap_lock
> > > - *     page->flags PG_locked (lock_page)   * (see hugetlbfs below)
> > > - *       hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share)
> > > - *         mapping->i_mmap_rwsem
> > > - *           hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
> > > - *           anon_vma->rwsem
> > > - *             mm->page_table_lock or pte_lock
> > > - *               swap_lock (in swap_duplicate, swap_info_get)
> > > - *                 mmlist_lock (in mmput, drain_mmlist and others)
> > > - *                 mapping->private_lock (in __set_page_dirty_buffers)
> > > - *                   lock_page_memcg move_lock (in __set_page_dirty_buffers)
> > > - *                     i_pages lock (widely used)
> > > - *                       lruvec->lru_lock (in lock_page_lruvec_irq)
> > > - *                 inode->i_lock (in set_page_dirty's __mark_inode_dirty)
> > > - *                 bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
> > > - *                   sb_lock (within inode_lock in fs/fs-writeback.c)
> > > - *                   i_pages lock (widely used, in set_page_dirty,
> > > - *                             in arch-dependent flush_dcache_mmap_lock,
> > > - *                             within bdi.wb->list_lock in __sync_single_inode)
> > > + *     mapping->invalidate_lock (in filemap_fault)
> > > + *       page->flags PG_locked (lock_page)   * (see hugetlbfs below)
> > > + *         hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share)
> > > + *           mapping->i_mmap_rwsem
> > > + *             hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
> > > + *             anon_vma->rwsem
> > > + *               mm->page_table_lock or pte_lock
> > > + *                 swap_lock (in swap_duplicate, swap_info_get)
> > > + *                   mmlist_lock (in mmput, drain_mmlist and others)
> > > + *                   mapping->private_lock (in __set_page_dirty_buffers)
> > > + *                     lock_page_memcg move_lock (in __set_page_dirty_buffers)
> > > + *                       i_pages lock (widely used)
> > > + *                         lruvec->lru_lock (in lock_page_lruvec_irq)
> > > + *                   inode->i_lock (in set_page_dirty's __mark_inode_dirty)
> > > + *                   bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
> > > + *                     sb_lock (within inode_lock in fs/fs-writeback.c)
> > > + *                     i_pages lock (widely used, in set_page_dirty,
> > > + *                               in arch-dependent flush_dcache_mmap_lock,
> > > + *                               within bdi.wb->list_lock in __sync_single_inode)
> > >   *
> > >   * anon_vma->rwsem,mapping->i_mmap_rwsem   (memory_failure, collect_procs_anon)
> > >   *   ->tasklist_lock
> > > diff --git a/mm/truncate.c b/mm/truncate.c
> > > index 57a618c4a0d6..93bde2741e0e 100644
> > > --- a/mm/truncate.c
> > > +++ b/mm/truncate.c
> > > @@ -415,7 +415,7 @@ EXPORT_SYMBOL(truncate_inode_pages_range);
> > >   * @mapping: mapping to truncate
> > >   * @lstart: offset from which to truncate
> > >   *
> > > - * Called under (and serialised by) inode->i_rwsem.
> > > + * Called under (and serialised by) inode->i_rwsem and inode->i_mapping_rwsem.
> > >   *
> > >   * Note: When this function returns, there can be a page in the process of
> > >   * deletion (inside __delete_from_page_cache()) in the specified range.  Thus
> > > -- 
> > > 2.26.2
> > > 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
Jan Kara May 13, 2021, 7:01 p.m. UTC | #7
On Wed 12-05-21 15:40:21, Matthew Wilcox wrote:
> On Wed, May 12, 2021 at 03:46:11PM +0200, Jan Kara wrote:
> > Currently, serializing operations such as page fault, read, or readahead
> > against hole punching is rather difficult. The basic race scheme is
> > like:
> > 
> > fallocate(FALLOC_FL_PUNCH_HOLE)			read / fault / ..
> >   truncate_inode_pages_range()
> > 						  <create pages in page
> > 						   cache here>
> >   <update fs block mapping and free blocks>
> > 
> > Now the problem is in this way read / page fault / readahead can
> > instantiate pages in page cache with potentially stale data (if blocks
> > get quickly reused). Avoiding this race is not simple - page locks do
> > not work because we want to make sure there are *no* pages in given
> > range. inode->i_rwsem does not work because page fault happens under
> > mmap_sem which ranks below inode->i_rwsem. Also using it for reads makes
> > the performance for mixed read-write workloads suffer.
> > 
> > So create a new rw_semaphore in the address_space - invalidate_lock -
> > that protects adding of pages to page cache for page faults / reads /
> > readahead.
> 
> Remind me (or, rather, add to the documentation) why we have to hold the
> invalidate_lock during the call to readpage / readahead, and we don't just
> hold it around the call to add_to_page_cache / add_to_page_cache_locked
> / add_to_page_cache_lru ?  I appreciate that ->readpages is still going
> to suck, but we're down to just three implementations of ->readpages now
> (9p, cifs & nfs).

There's a comment in filemap_create_page() trying to explain this. We need
to protect against cases like: Filesystem with 1k blocksize, file F has
page at index 0 with uptodate buffer at 0-1k, rest not uptodate. All blocks
underlying page are allocated. Now let read at offset 1k race with hole
punch at offset 1k, length 1k.

read()					hole punch
...
  filemap_read()
    filemap_get_pages()
      - page found in the page cache but !Uptodate
      filemap_update_page()
					  locks everything
					  truncate_inode_pages_range()
					    lock_page(page)
					    do_invalidatepage()
					    unlock_page(page)
        locks page
          filemap_read_page()
            ->readpage()
              block underlying offset 1k
	      still allocated -> map buffer
					  free block under offset 1k
	      submit IO -> corrupted data

If you think I should expand it to explain more details, please tell.
Or maybe I can put more detailed discussion like above into the changelog?

> Also, could I trouble you to run the comments through 'fmt' (or
> equivalent)?  It's easier to read if you're not kissing right up on 80
> columns.

Sure, will do.

> > +++ b/fs/inode.c
> > @@ -190,6 +190,9 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
> >  	mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
> >  	mapping->private_data = NULL;
> >  	mapping->writeback_index = 0;
> > +	init_rwsem(&mapping->invalidate_lock);
> > +	lockdep_set_class(&mapping->invalidate_lock,
> > +			  &sb->s_type->invalidate_lock_key);
> 
> Why not:
> 
> 	__init_rwsem(&mapping->invalidate_lock, "mapping.invalidate_lock",
> 			&sb->s_type->invalidate_lock_key);

I replicated what we do for i_rwsem but you're right, this is better.
Updated.

								Honza
Matthew Wilcox (Oracle) May 13, 2021, 7:38 p.m. UTC | #8
On Thu, May 13, 2021 at 09:01:14PM +0200, Jan Kara wrote:
> On Wed 12-05-21 15:40:21, Matthew Wilcox wrote:
> > Remind me (or, rather, add to the documentation) why we have to hold the
> > invalidate_lock during the call to readpage / readahead, and we don't just
> > hold it around the call to add_to_page_cache / add_to_page_cache_locked
> > / add_to_page_cache_lru ?  I appreciate that ->readpages is still going
> > to suck, but we're down to just three implementations of ->readpages now
> > (9p, cifs & nfs).
> 
> There's a comment in filemap_create_page() trying to explain this. We need
> to protect against cases like: Filesystem with 1k blocksize, file F has
> page at index 0 with uptodate buffer at 0-1k, rest not uptodate. All blocks
> underlying page are allocated. Now let read at offset 1k race with hole
> punch at offset 1k, length 1k.
> 
> read()					hole punch
> ...
>   filemap_read()
>     filemap_get_pages()
>       - page found in the page cache but !Uptodate
>       filemap_update_page()
> 					  locks everything
> 					  truncate_inode_pages_range()
> 					    lock_page(page)
> 					    do_invalidatepage()
> 					    unlock_page(page)
>         locks page
>           filemap_read_page()

Ah, this is the partial_start case, which means that page->mapping
is still valid.  But that means that do_invalidatepage() was called
with (offset 1024, length 1024), immediately after we called
zero_user_segment().  So isn't this a bug in the fs do_invalidatepage()?
The range from 1k-2k _is_ uptodate.  It's been zeroed in memory,
and if we were to run after the "free block" below, we'd get that
memory zeroed again.

>             ->readpage()
>               block underlying offset 1k
> 	      still allocated -> map buffer
> 					  free block under offset 1k
> 	      submit IO -> corrupted data
> 
> If you think I should expand it to explain more details, please tell.
> Or maybe I can put more detailed discussion like above into the changelog?

> > Why not:
> > 
> > 	__init_rwsem(&mapping->invalidate_lock, "mapping.invalidate_lock",
> > 			&sb->s_type->invalidate_lock_key);
> 
> I replicated what we do for i_rwsem but you're right, this is better.
> Updated.

Hmm, there's a few places we should use __init_rwsem() ... something
for my "when bored" pile of work.
Dave Chinner May 13, 2021, 11:19 p.m. UTC | #9
On Thu, May 13, 2021 at 11:52:52AM -0700, Darrick J. Wong wrote:
> On Thu, May 13, 2021 at 07:44:59PM +0200, Jan Kara wrote:
> > On Wed 12-05-21 08:23:45, Darrick J. Wong wrote:
> > > On Wed, May 12, 2021 at 03:46:11PM +0200, Jan Kara wrote:
> > > > +->fallocate implementation must be really careful to maintain page cache
> > > > +consistency when punching holes or performing other operations that invalidate
> > > > +page cache contents. Usually the filesystem needs to call
> > > > +truncate_inode_pages_range() to invalidate relevant range of the page cache.
> > > > +However the filesystem usually also needs to update its internal (and on disk)
> > > > +view of file offset -> disk block mapping. Until this update is finished, the
> > > > +filesystem needs to block page faults and reads from reloading now-stale page
> > > > +cache contents from the disk. VFS provides mapping->invalidate_lock for this
> > > > +and acquires it in shared mode in paths loading pages from disk
> > > > +(filemap_fault(), filemap_read(), readahead paths). The filesystem is
> > > > +responsible for taking this lock in its fallocate implementation and generally
> > > > +whenever the page cache contents needs to be invalidated because a block is
> > > > +moving from under a page.
> > > > +
> > > > +->copy_file_range and ->remap_file_range implementations need to serialize
> > > > +against modifications of file data while the operation is running. For blocking
> > > > +changes through write(2) and similar operations inode->i_rwsem can be used. For
> > > > +blocking changes through memory mapping, the filesystem can use
> > > > +mapping->invalidate_lock provided it also acquires it in its ->page_mkwrite
> > > > +implementation.
> > > 
> > > Question: What is the locking order when acquiring the invalidate_lock
> > > of two different files?  Is it the same as i_rwsem (increasing order of
> > > the struct inode pointer) or is it the same as the XFS MMAPLOCK that is
> > > being hoisted here (increasing order of i_ino)?
> > > 
> > > The reason I ask is that remap_file_range has to do that, but I don't
> > > see any conversions for the xfs_lock_two_inodes(..., MMAPLOCK_EXCL)
> > > calls in xfs_ilock2_io_mmap in this series.
> > 
> > Good question. Technically, I don't think there's real need to establish a
> > single ordering because locks among different filesystems are never going
> > to be acquired together (effectively each lock type is local per sb and we
> > are free to define an ordering for each lock type differently). But to
> > maintain some sanity I guess having the same locking order for doublelock
> > of i_rwsem and invalidate_lock makes sense. Is there a reason why XFS uses
> > by-ino ordering? So that we don't have to consider two different orders in
> > xfs_lock_two_inodes()...
> 
> I imagine Dave will chime in on this, but I suspect the reason is
> hysterical raisins^Wreasons.

It's the locking rules that XFS has used pretty much forever.
Locking by inode number always guarantees the same locking order of
two inodes in the same filesystem, regardless of the specific
in-memory instances of the two inodes.

e.g. if we lock based on the inode structure address, in one
instancex, we could get A -> B, then B gets recycled and
reallocated, then we get B -> A as the locking order for the same
two inodes.

That, IMNSHO, is utterly crazy because with non-deterministic inode
lock ordered like this you can't make consistent locking rules for
locking the physical inode cluster buffers underlying the inodes in
the situation where they also need to be locked.

We've been down this path before more than a decade ago when the
powers that be decreed that inode locking order is to be "by
structure address" rather than inode number, because "inode number
is not unique across multiple superblocks".

I'm not sure that there is anywhere that locks multiple inodes
across different superblocks, but here we are again....

> It might simply be time to convert all
> three XFS inode locks to use the same ordering rules.

Careful, there lie dragons along that path because of things like
how the inode cluster buffer operations work - they all assume
ascending inode number traversal within and across inode cluster
buffers and hence we do have locking order constraints based on
inode number...

Cheers,

Dave.
Jan Kara May 14, 2021, 11:07 a.m. UTC | #10
On Thu 13-05-21 20:38:47, Matthew Wilcox wrote:
> On Thu, May 13, 2021 at 09:01:14PM +0200, Jan Kara wrote:
> > On Wed 12-05-21 15:40:21, Matthew Wilcox wrote:
> > > Remind me (or, rather, add to the documentation) why we have to hold the
> > > invalidate_lock during the call to readpage / readahead, and we don't just
> > > hold it around the call to add_to_page_cache / add_to_page_cache_locked
> > > / add_to_page_cache_lru ?  I appreciate that ->readpages is still going
> > > to suck, but we're down to just three implementations of ->readpages now
> > > (9p, cifs & nfs).
> > 
> > There's a comment in filemap_create_page() trying to explain this. We need
> > to protect against cases like: Filesystem with 1k blocksize, file F has
> > page at index 0 with uptodate buffer at 0-1k, rest not uptodate. All blocks
> > underlying page are allocated. Now let read at offset 1k race with hole
> > punch at offset 1k, length 1k.
> > 
> > read()					hole punch
> > ...
> >   filemap_read()
> >     filemap_get_pages()
> >       - page found in the page cache but !Uptodate
> >       filemap_update_page()
> > 					  locks everything
> > 					  truncate_inode_pages_range()
> > 					    lock_page(page)
> > 					    do_invalidatepage()
> > 					    unlock_page(page)
> >         locks page
> >           filemap_read_page()
> 
> Ah, this is the partial_start case, which means that page->mapping
> is still valid.  But that means that do_invalidatepage() was called
> with (offset 1024, length 1024), immediately after we called
> zero_user_segment().  So isn't this a bug in the fs do_invalidatepage()?
> The range from 1k-2k _is_ uptodate.  It's been zeroed in memory,
> and if we were to run after the "free block" below, we'd get that
> memory zeroed again.

Well, yes, do_invalidatepage() could mark zeroed region as uptodate. But I
don't think we want to rely on 'uptodate' not getting spuriously cleared
(which would reopen the problem). Generally the assumption is that there's
no problem clearing (or not setting) uptodate flag of a clean buffer
because the fs can always provide the data again. Similarly, fs is free to
refetch data into clean & uptodate page, if it thinks it's worth it. Now
all these would become correctness issues. So IMHO the fragility is not
worth the shorter lock hold times. That's why I went for the rule that
read-IO submission is still protected by invalidate_lock to make things
simple.

								Honza
Darrick J. Wong May 14, 2021, 4:17 p.m. UTC | #11
On Fri, May 14, 2021 at 09:19:45AM +1000, Dave Chinner wrote:
> On Thu, May 13, 2021 at 11:52:52AM -0700, Darrick J. Wong wrote:
> > On Thu, May 13, 2021 at 07:44:59PM +0200, Jan Kara wrote:
> > > On Wed 12-05-21 08:23:45, Darrick J. Wong wrote:
> > > > On Wed, May 12, 2021 at 03:46:11PM +0200, Jan Kara wrote:
> > > > > +->fallocate implementation must be really careful to maintain page cache
> > > > > +consistency when punching holes or performing other operations that invalidate
> > > > > +page cache contents. Usually the filesystem needs to call
> > > > > +truncate_inode_pages_range() to invalidate relevant range of the page cache.
> > > > > +However the filesystem usually also needs to update its internal (and on disk)
> > > > > +view of file offset -> disk block mapping. Until this update is finished, the
> > > > > +filesystem needs to block page faults and reads from reloading now-stale page
> > > > > +cache contents from the disk. VFS provides mapping->invalidate_lock for this
> > > > > +and acquires it in shared mode in paths loading pages from disk
> > > > > +(filemap_fault(), filemap_read(), readahead paths). The filesystem is
> > > > > +responsible for taking this lock in its fallocate implementation and generally
> > > > > +whenever the page cache contents needs to be invalidated because a block is
> > > > > +moving from under a page.
> > > > > +
> > > > > +->copy_file_range and ->remap_file_range implementations need to serialize
> > > > > +against modifications of file data while the operation is running. For blocking
> > > > > +changes through write(2) and similar operations inode->i_rwsem can be used. For
> > > > > +blocking changes through memory mapping, the filesystem can use
> > > > > +mapping->invalidate_lock provided it also acquires it in its ->page_mkwrite
> > > > > +implementation.
> > > > 
> > > > Question: What is the locking order when acquiring the invalidate_lock
> > > > of two different files?  Is it the same as i_rwsem (increasing order of
> > > > the struct inode pointer) or is it the same as the XFS MMAPLOCK that is
> > > > being hoisted here (increasing order of i_ino)?
> > > > 
> > > > The reason I ask is that remap_file_range has to do that, but I don't
> > > > see any conversions for the xfs_lock_two_inodes(..., MMAPLOCK_EXCL)
> > > > calls in xfs_ilock2_io_mmap in this series.
> > > 
> > > Good question. Technically, I don't think there's real need to establish a
> > > single ordering because locks among different filesystems are never going
> > > to be acquired together (effectively each lock type is local per sb and we
> > > are free to define an ordering for each lock type differently). But to
> > > maintain some sanity I guess having the same locking order for doublelock
> > > of i_rwsem and invalidate_lock makes sense. Is there a reason why XFS uses
> > > by-ino ordering? So that we don't have to consider two different orders in
> > > xfs_lock_two_inodes()...
> > 
> > I imagine Dave will chime in on this, but I suspect the reason is
> > hysterical raisins^Wreasons.
> 
> It's the locking rules that XFS has used pretty much forever.
> Locking by inode number always guarantees the same locking order of
> two inodes in the same filesystem, regardless of the specific
> in-memory instances of the two inodes.
> 
> e.g. if we lock based on the inode structure address, in one
> instancex, we could get A -> B, then B gets recycled and
> reallocated, then we get B -> A as the locking order for the same
> two inodes.
> 
> That, IMNSHO, is utterly crazy because with non-deterministic inode
> lock ordered like this you can't make consistent locking rules for
> locking the physical inode cluster buffers underlying the inodes in
> the situation where they also need to be locked.

<nod> That's protected by the ILOCK, correct?

> We've been down this path before more than a decade ago when the
> powers that be decreed that inode locking order is to be "by
> structure address" rather than inode number, because "inode number
> is not unique across multiple superblocks".
> 
> I'm not sure that there is anywhere that locks multiple inodes
> across different superblocks, but here we are again....

Hm.  Are there situations where one would want to lock multiple
/mappings/ across different superblocks?  The remapping code doesn't
allow cross-super operations, so ... pipes and splice, maybe?  I don't
remember that code well enough to say for sure.

I've been operating under the assumption that as long as one takes all
the same class of lock at the same time (e.g. all the IOLOCKs, then all
the MMAPLOCKs, then all the ILOCKs, like reflink does) that the
incongruency in locking order rules within a class shouldn't be a
problem.

> > It might simply be time to convert all
> > three XFS inode locks to use the same ordering rules.
> 
> Careful, there lie dragons along that path because of things like
> how the inode cluster buffer operations work - they all assume
> ascending inode number traversal within and across inode cluster
> buffers and hence we do have locking order constraints based on
> inode number...

Fair enough, I'll leave the ILOCK alone. :)

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Jan Kara May 17, 2021, 11:21 a.m. UTC | #12
On Fri 14-05-21 09:17:30, Darrick J. Wong wrote:
> On Fri, May 14, 2021 at 09:19:45AM +1000, Dave Chinner wrote:
> > We've been down this path before more than a decade ago when the
> > powers that be decreed that inode locking order is to be "by
> > structure address" rather than inode number, because "inode number
> > is not unique across multiple superblocks".
> > 
> > I'm not sure that there is anywhere that locks multiple inodes
> > across different superblocks, but here we are again....
> 
> Hm.  Are there situations where one would want to lock multiple
> /mappings/ across different superblocks?  The remapping code doesn't
> allow cross-super operations, so ... pipes and splice, maybe?  I don't
> remember that code well enough to say for sure.

Splice and friends work one file at a time. I.e., first they fill a pipe
from the file with ->read_iter, then they flush the pipe to the target file
with ->write_iter. So file locking doesn't get coupled there.

> I've been operating under the assumption that as long as one takes all
> the same class of lock at the same time (e.g. all the IOLOCKs, then all
> the MMAPLOCKs, then all the ILOCKs, like reflink does) that the
> incongruency in locking order rules within a class shouldn't be a
> problem.

That's my understanding as well.

> > > It might simply be time to convert all
> > > three XFS inode locks to use the same ordering rules.
> > 
> > Careful, there lie dragons along that path because of things like
> > how the inode cluster buffer operations work - they all assume
> > ascending inode number traversal within and across inode cluster
> > buffers and hence we do have locking order constraints based on
> > inode number...
> 
> Fair enough, I'll leave the ILOCK alone. :)

OK, so should I change the order for invalidate_lock or shall we just leave
that alone as it is not a practical problem AFAICT.

								Honza
Dave Chinner May 18, 2021, 10:36 p.m. UTC | #13
On Fri, May 14, 2021 at 09:17:30AM -0700, Darrick J. Wong wrote:
> On Fri, May 14, 2021 at 09:19:45AM +1000, Dave Chinner wrote:
> > On Thu, May 13, 2021 at 11:52:52AM -0700, Darrick J. Wong wrote:
> > > On Thu, May 13, 2021 at 07:44:59PM +0200, Jan Kara wrote:
> > > > On Wed 12-05-21 08:23:45, Darrick J. Wong wrote:
> > > > > On Wed, May 12, 2021 at 03:46:11PM +0200, Jan Kara wrote:
> > > > > > +->fallocate implementation must be really careful to maintain page cache
> > > > > > +consistency when punching holes or performing other operations that invalidate
> > > > > > +page cache contents. Usually the filesystem needs to call
> > > > > > +truncate_inode_pages_range() to invalidate relevant range of the page cache.
> > > > > > +However the filesystem usually also needs to update its internal (and on disk)
> > > > > > +view of file offset -> disk block mapping. Until this update is finished, the
> > > > > > +filesystem needs to block page faults and reads from reloading now-stale page
> > > > > > +cache contents from the disk. VFS provides mapping->invalidate_lock for this
> > > > > > +and acquires it in shared mode in paths loading pages from disk
> > > > > > +(filemap_fault(), filemap_read(), readahead paths). The filesystem is
> > > > > > +responsible for taking this lock in its fallocate implementation and generally
> > > > > > +whenever the page cache contents needs to be invalidated because a block is
> > > > > > +moving from under a page.
> > > > > > +
> > > > > > +->copy_file_range and ->remap_file_range implementations need to serialize
> > > > > > +against modifications of file data while the operation is running. For blocking
> > > > > > +changes through write(2) and similar operations inode->i_rwsem can be used. For
> > > > > > +blocking changes through memory mapping, the filesystem can use
> > > > > > +mapping->invalidate_lock provided it also acquires it in its ->page_mkwrite
> > > > > > +implementation.
> > > > > 
> > > > > Question: What is the locking order when acquiring the invalidate_lock
> > > > > of two different files?  Is it the same as i_rwsem (increasing order of
> > > > > the struct inode pointer) or is it the same as the XFS MMAPLOCK that is
> > > > > being hoisted here (increasing order of i_ino)?
> > > > > 
> > > > > The reason I ask is that remap_file_range has to do that, but I don't
> > > > > see any conversions for the xfs_lock_two_inodes(..., MMAPLOCK_EXCL)
> > > > > calls in xfs_ilock2_io_mmap in this series.
> > > > 
> > > > Good question. Technically, I don't think there's real need to establish a
> > > > single ordering because locks among different filesystems are never going
> > > > to be acquired together (effectively each lock type is local per sb and we
> > > > are free to define an ordering for each lock type differently). But to
> > > > maintain some sanity I guess having the same locking order for doublelock
> > > > of i_rwsem and invalidate_lock makes sense. Is there a reason why XFS uses
> > > > by-ino ordering? So that we don't have to consider two different orders in
> > > > xfs_lock_two_inodes()...
> > > 
> > > I imagine Dave will chime in on this, but I suspect the reason is
> > > hysterical raisins^Wreasons.
> > 
> > It's the locking rules that XFS has used pretty much forever.
> > Locking by inode number always guarantees the same locking order of
> > two inodes in the same filesystem, regardless of the specific
> > in-memory instances of the two inodes.
> > 
> > e.g. if we lock based on the inode structure address, in one
> > instancex, we could get A -> B, then B gets recycled and
> > reallocated, then we get B -> A as the locking order for the same
> > two inodes.
> > 
> > That, IMNSHO, is utterly crazy because with non-deterministic inode
> > lock ordered like this you can't make consistent locking rules for
> > locking the physical inode cluster buffers underlying the inodes in
> > the situation where they also need to be locked.
> 
> <nod> That's protected by the ILOCK, correct?
> 
> > We've been down this path before more than a decade ago when the
> > powers that be decreed that inode locking order is to be "by
> > structure address" rather than inode number, because "inode number
> > is not unique across multiple superblocks".
> > 
> > I'm not sure that there is anywhere that locks multiple inodes
> > across different superblocks, but here we are again....
> 
> Hm.  Are there situations where one would want to lock multiple
> /mappings/ across different superblocks?  The remapping code doesn't
> allow cross-super operations, so ... pipes and splice, maybe?  I don't
> remember that code well enough to say for sure.

Hmmmm. Doing read IO into a buffer that is mmap()d from another
file, and we take a page fault on it inside the read IO path? We're
copying from a page in one mapping and taking a fault in another
mapping and hence taking the invalidate_lock to populate the page
cache for the second mapping...

I haven't looked closely enough at where the invalidate_lock is held
in the read path to determine if this is an issue, but if it is then
it is also a potential deadlock scenario...

Cheers,

Dave.
Jan Kara May 19, 2021, 10:57 a.m. UTC | #14
On Wed 19-05-21 08:36:37, Dave Chinner wrote:
> On Fri, May 14, 2021 at 09:17:30AM -0700, Darrick J. Wong wrote:
> > On Fri, May 14, 2021 at 09:19:45AM +1000, Dave Chinner wrote:
> > > On Thu, May 13, 2021 at 11:52:52AM -0700, Darrick J. Wong wrote:
> > > > On Thu, May 13, 2021 at 07:44:59PM +0200, Jan Kara wrote:
> > > > > On Wed 12-05-21 08:23:45, Darrick J. Wong wrote:
> > > > > > On Wed, May 12, 2021 at 03:46:11PM +0200, Jan Kara wrote:
> > > > > > > +->fallocate implementation must be really careful to maintain page cache
> > > > > > > +consistency when punching holes or performing other operations that invalidate
> > > > > > > +page cache contents. Usually the filesystem needs to call
> > > > > > > +truncate_inode_pages_range() to invalidate relevant range of the page cache.
> > > > > > > +However the filesystem usually also needs to update its internal (and on disk)
> > > > > > > +view of file offset -> disk block mapping. Until this update is finished, the
> > > > > > > +filesystem needs to block page faults and reads from reloading now-stale page
> > > > > > > +cache contents from the disk. VFS provides mapping->invalidate_lock for this
> > > > > > > +and acquires it in shared mode in paths loading pages from disk
> > > > > > > +(filemap_fault(), filemap_read(), readahead paths). The filesystem is
> > > > > > > +responsible for taking this lock in its fallocate implementation and generally
> > > > > > > +whenever the page cache contents needs to be invalidated because a block is
> > > > > > > +moving from under a page.
> > > > > > > +
> > > > > > > +->copy_file_range and ->remap_file_range implementations need to serialize
> > > > > > > +against modifications of file data while the operation is running. For blocking
> > > > > > > +changes through write(2) and similar operations inode->i_rwsem can be used. For
> > > > > > > +blocking changes through memory mapping, the filesystem can use
> > > > > > > +mapping->invalidate_lock provided it also acquires it in its ->page_mkwrite
> > > > > > > +implementation.
> > > > > > 
> > > > > > Question: What is the locking order when acquiring the invalidate_lock
> > > > > > of two different files?  Is it the same as i_rwsem (increasing order of
> > > > > > the struct inode pointer) or is it the same as the XFS MMAPLOCK that is
> > > > > > being hoisted here (increasing order of i_ino)?
> > > > > > 
> > > > > > The reason I ask is that remap_file_range has to do that, but I don't
> > > > > > see any conversions for the xfs_lock_two_inodes(..., MMAPLOCK_EXCL)
> > > > > > calls in xfs_ilock2_io_mmap in this series.
> > > > > 
> > > > > Good question. Technically, I don't think there's real need to establish a
> > > > > single ordering because locks among different filesystems are never going
> > > > > to be acquired together (effectively each lock type is local per sb and we
> > > > > are free to define an ordering for each lock type differently). But to
> > > > > maintain some sanity I guess having the same locking order for doublelock
> > > > > of i_rwsem and invalidate_lock makes sense. Is there a reason why XFS uses
> > > > > by-ino ordering? So that we don't have to consider two different orders in
> > > > > xfs_lock_two_inodes()...
> > > > 
> > > > I imagine Dave will chime in on this, but I suspect the reason is
> > > > hysterical raisins^Wreasons.
> > > 
> > > It's the locking rules that XFS has used pretty much forever.
> > > Locking by inode number always guarantees the same locking order of
> > > two inodes in the same filesystem, regardless of the specific
> > > in-memory instances of the two inodes.
> > > 
> > > e.g. if we lock based on the inode structure address, in one
> > > instancex, we could get A -> B, then B gets recycled and
> > > reallocated, then we get B -> A as the locking order for the same
> > > two inodes.
> > > 
> > > That, IMNSHO, is utterly crazy because with non-deterministic inode
> > > lock ordered like this you can't make consistent locking rules for
> > > locking the physical inode cluster buffers underlying the inodes in
> > > the situation where they also need to be locked.
> > 
> > <nod> That's protected by the ILOCK, correct?
> > 
> > > We've been down this path before more than a decade ago when the
> > > powers that be decreed that inode locking order is to be "by
> > > structure address" rather than inode number, because "inode number
> > > is not unique across multiple superblocks".
> > > 
> > > I'm not sure that there is anywhere that locks multiple inodes
> > > across different superblocks, but here we are again....
> > 
> > Hm.  Are there situations where one would want to lock multiple
> > /mappings/ across different superblocks?  The remapping code doesn't
> > allow cross-super operations, so ... pipes and splice, maybe?  I don't
> > remember that code well enough to say for sure.
> 
> Hmmmm. Doing read IO into a buffer that is mmap()d from another
> file, and we take a page fault on it inside the read IO path? We're
> copying from a page in one mapping and taking a fault in another
> mapping and hence taking the invalidate_lock to populate the page
> cache for the second mapping...
> 
> I haven't looked closely enough at where the invalidate_lock is held
> in the read path to determine if this is an issue, but if it is then
> it is also a potential deadlock scenario...

I was careful enough to avoid this problem - we first bring pages into
pages cache (under invalidate_lock), then drop invalidate lock, just
keep page refs, and copy page cache content into the buffer (which may grab
invalidate_lock from another mapping as you say).

								Honza
diff mbox series

Patch

diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index 4ed2b22bd0a8..b73666a3da42 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -271,19 +271,19 @@  prototypes::
 locking rules:
 	All except set_page_dirty and freepage may block
 
-======================	======================== =========
-ops			PageLocked(page)	 i_rwsem
-======================	======================== =========
+======================	======================== =========	===============
+ops			PageLocked(page)	 i_rwsem	invalidate_lock
+======================	======================== =========	===============
 writepage:		yes, unlocks (see below)
-readpage:		yes, unlocks
+readpage:		yes, unlocks				shared
 writepages:
 set_page_dirty		no
-readahead:		yes, unlocks
-readpages:		no
+readahead:		yes, unlocks				shared
+readpages:		no					shared
 write_begin:		locks the page		 exclusive
 write_end:		yes, unlocks		 exclusive
 bmap:
-invalidatepage:		yes
+invalidatepage:		yes					exclusive
 releasepage:		yes
 freepage:		yes
 direct_IO:
@@ -378,7 +378,10 @@  keep it that way and don't breed new callers.
 ->invalidatepage() is called when the filesystem must attempt to drop
 some or all of the buffers from the page when it is being truncated. It
 returns zero on success. If ->invalidatepage is zero, the kernel uses
-block_invalidatepage() instead.
+block_invalidatepage() instead. The filesystem should exclusively acquire
+invalidate_lock before invalidating page cache in truncate / hole punch path (and
+thus calling into ->invalidatepage) to block races between page cache
+invalidation and page cache filling functions (fault, read, ...).
 
 ->releasepage() is called when the kernel is about to try to drop the
 buffers from the page in preparation for freeing it.  It returns zero to
@@ -573,6 +576,27 @@  in sys_read() and friends.
 the lease within the individual filesystem to record the result of the
 operation
 
+->fallocate implementation must be really careful to maintain page cache
+consistency when punching holes or performing other operations that invalidate
+page cache contents. Usually the filesystem needs to call
+truncate_inode_pages_range() to invalidate relevant range of the page cache.
+However the filesystem usually also needs to update its internal (and on disk)
+view of file offset -> disk block mapping. Until this update is finished, the
+filesystem needs to block page faults and reads from reloading now-stale page
+cache contents from the disk. VFS provides mapping->invalidate_lock for this
+and acquires it in shared mode in paths loading pages from disk
+(filemap_fault(), filemap_read(), readahead paths). The filesystem is
+responsible for taking this lock in its fallocate implementation and generally
+whenever the page cache contents needs to be invalidated because a block is
+moving from under a page.
+
+->copy_file_range and ->remap_file_range implementations need to serialize
+against modifications of file data while the operation is running. For blocking
+changes through write(2) and similar operations inode->i_rwsem can be used. For
+blocking changes through memory mapping, the filesystem can use
+mapping->invalidate_lock provided it also acquires it in its ->page_mkwrite
+implementation.
+
 dquot_operations
 ================
 
@@ -634,9 +658,9 @@  access:		yes
 to be faulted in. The filesystem must find and return the page associated
 with the passed in "pgoff" in the vm_fault structure. If it is possible that
 the page may be truncated and/or invalidated, then the filesystem must lock
-the page, then ensure it is not already truncated (the page lock will block
-subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
-locked. The VM will unlock the page.
+invalidate_lock, then ensure the page is not already truncated (invalidate_lock
+will block subsequent truncate), and then return with VM_FAULT_LOCKED, and the
+page locked. The VM will unlock the page.
 
 ->map_pages() is called when VM asks to map easy accessible pages.
 Filesystem should find and map pages associated with offsets from "start_pgoff"
@@ -647,12 +671,14 @@  page table entry. Pointer to entry associated with the page is passed in
 "pte" field in vm_fault structure. Pointers to entries for other offsets
 should be calculated relative to "pte".
 
-->page_mkwrite() is called when a previously read-only pte is
-about to become writeable. The filesystem again must ensure that there are
-no truncate/invalidate races, and then return with the page locked. If
-the page has been truncated, the filesystem should not look up a new page
-like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which
-will cause the VM to retry the fault.
+->page_mkwrite() is called when a previously read-only pte is about to become
+writeable. The filesystem again must ensure that there are no
+truncate/invalidate races or races with operations such as ->remap_file_range
+or ->copy_file_range, and then return with the page locked. Usually
+mapping->invalidate_lock is suitable for proper serialization. If the page has
+been truncated, the filesystem should not look up a new page like the ->fault()
+handler, but simply return with VM_FAULT_NOPAGE, which will cause the VM to
+retry the fault.
 
 ->pfn_mkwrite() is the same as page_mkwrite but when the pte is
 VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is
diff --git a/fs/inode.c b/fs/inode.c
index c93500d84264..63a814367118 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -190,6 +190,9 @@  int inode_init_always(struct super_block *sb, struct inode *inode)
 	mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
 	mapping->private_data = NULL;
 	mapping->writeback_index = 0;
+	init_rwsem(&mapping->invalidate_lock);
+	lockdep_set_class(&mapping->invalidate_lock,
+			  &sb->s_type->invalidate_lock_key);
 	inode->i_private = NULL;
 	inode->i_mapping = mapping;
 	INIT_HLIST_HEAD(&inode->i_dentry);	/* buggered by rcu freeing */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c3c88fdb9b2a..897238d9f1e0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -436,6 +436,10 @@  int pagecache_write_end(struct file *, struct address_space *mapping,
  * struct address_space - Contents of a cacheable, mappable object.
  * @host: Owner, either the inode or the block_device.
  * @i_pages: Cached pages.
+ * @invalidate_lock: Guards coherency between page cache contents and
+ *   file offset->disk block mappings in the filesystem during invalidates.
+ *   It is also used to block modification of page cache contents through
+ *   memory mappings.
  * @gfp_mask: Memory allocation flags to use for allocating pages.
  * @i_mmap_writable: Number of VM_SHARED mappings.
  * @nr_thps: Number of THPs in the pagecache (non-shmem only).
@@ -453,6 +457,7 @@  int pagecache_write_end(struct file *, struct address_space *mapping,
 struct address_space {
 	struct inode		*host;
 	struct xarray		i_pages;
+	struct rw_semaphore	invalidate_lock;
 	gfp_t			gfp_mask;
 	atomic_t		i_mmap_writable;
 #ifdef CONFIG_READ_ONLY_THP_FOR_FS
@@ -2488,6 +2493,7 @@  struct file_system_type {
 
 	struct lock_class_key i_lock_key;
 	struct lock_class_key i_mutex_key;
+	struct lock_class_key invalidate_lock_key;
 	struct lock_class_key i_mutex_dir_key;
 };
 
diff --git a/mm/filemap.c b/mm/filemap.c
index ba1068a1837f..4d9ec4c6cc34 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -77,7 +77,8 @@ 
  *        ->i_pages lock
  *
  *  ->i_rwsem
- *    ->i_mmap_rwsem		(truncate->unmap_mapping_range)
+ *    ->invalidate_lock		(acquired by fs in truncate path)
+ *      ->i_mmap_rwsem		(truncate->unmap_mapping_range)
  *
  *  ->mmap_lock
  *    ->i_mmap_rwsem
@@ -85,7 +86,8 @@ 
  *        ->i_pages lock	(arch-dependent flush_dcache_mmap_lock)
  *
  *  ->mmap_lock
- *    ->lock_page		(access_process_vm)
+ *    ->invalidate_lock		(filemap_fault)
+ *      ->lock_page		(filemap_fault, access_process_vm)
  *
  *  ->i_rwsem			(generic_perform_write)
  *    ->mmap_lock		(fault_in_pages_readable->do_page_fault)
@@ -2368,20 +2370,30 @@  static int filemap_update_page(struct kiocb *iocb,
 {
 	int error;
 
+	if (iocb->ki_flags & IOCB_NOWAIT) {
+		if (!down_read_trylock(&mapping->invalidate_lock))
+			return -EAGAIN;
+	} else {
+		down_read(&mapping->invalidate_lock);
+	}
+
 	if (!trylock_page(page)) {
+		error = -EAGAIN;
 		if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO))
-			return -EAGAIN;
+			goto unlock_mapping;
 		if (!(iocb->ki_flags & IOCB_WAITQ)) {
+			up_read(&mapping->invalidate_lock);
 			put_and_wait_on_page_locked(page, TASK_KILLABLE);
 			return AOP_TRUNCATED_PAGE;
 		}
 		error = __lock_page_async(page, iocb->ki_waitq);
 		if (error)
-			return error;
+			goto unlock_mapping;
 	}
 
+	error = AOP_TRUNCATED_PAGE;
 	if (!page->mapping)
-		goto truncated;
+		goto unlock;
 
 	error = 0;
 	if (filemap_range_uptodate(mapping, iocb->ki_pos, iter, page))
@@ -2392,15 +2404,13 @@  static int filemap_update_page(struct kiocb *iocb,
 		goto unlock;
 
 	error = filemap_read_page(iocb->ki_filp, mapping, page);
-	if (error == AOP_TRUNCATED_PAGE)
-		put_page(page);
-	return error;
-truncated:
-	unlock_page(page);
-	put_page(page);
-	return AOP_TRUNCATED_PAGE;
+	goto unlock_mapping;
 unlock:
 	unlock_page(page);
+unlock_mapping:
+	up_read(&mapping->invalidate_lock);
+	if (error == AOP_TRUNCATED_PAGE)
+		put_page(page);
 	return error;
 }
 
@@ -2415,6 +2425,19 @@  static int filemap_create_page(struct file *file,
 	if (!page)
 		return -ENOMEM;
 
+	/*
+	 * Protect against truncate / hole punch. Grabbing invalidate_lock here
+	 * assures we cannot instantiate and bring uptodate new pagecache pages
+	 * after evicting page cache during truncate and before actually
+	 * freeing blocks.  Note that we could release invalidate_lock after
+	 * inserting the page into page cache as the locked page would then be
+	 * enough to synchronize with hole punching. But there are code paths
+	 * such as filemap_update_page() filling in partially uptodate pages or
+	 * ->readpages() that need to hold invalidate_lock while mapping blocks
+	 * for IO so let's hold the lock here as well to keep locking rules
+	 * simple.
+	 */
+	down_read(&mapping->invalidate_lock);
 	error = add_to_page_cache_lru(page, mapping, index,
 			mapping_gfp_constraint(mapping, GFP_KERNEL));
 	if (error == -EEXIST)
@@ -2426,9 +2449,11 @@  static int filemap_create_page(struct file *file,
 	if (error)
 		goto error;
 
+	up_read(&mapping->invalidate_lock);
 	pagevec_add(pvec, page);
 	return 0;
 error:
+	up_read(&mapping->invalidate_lock);
 	put_page(page);
 	return error;
 }
@@ -2988,6 +3013,13 @@  vm_fault_t filemap_fault(struct vm_fault *vmf)
 		count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
 		ret = VM_FAULT_MAJOR;
 		fpin = do_sync_mmap_readahead(vmf);
+	}
+
+	/*
+	 * See comment in filemap_create_page() why we need invalidate_lock
+	 */
+	down_read(&mapping->invalidate_lock);
+	if (!page) {
 retry_find:
 		page = pagecache_get_page(mapping, offset,
 					  FGP_CREAT|FGP_FOR_MMAP,
@@ -2995,6 +3027,7 @@  vm_fault_t filemap_fault(struct vm_fault *vmf)
 		if (!page) {
 			if (fpin)
 				goto out_retry;
+			up_read(&mapping->invalidate_lock);
 			return VM_FAULT_OOM;
 		}
 	}
@@ -3035,9 +3068,11 @@  vm_fault_t filemap_fault(struct vm_fault *vmf)
 	if (unlikely(offset >= max_off)) {
 		unlock_page(page);
 		put_page(page);
+		up_read(&mapping->invalidate_lock);
 		return VM_FAULT_SIGBUS;
 	}
 
+	up_read(&mapping->invalidate_lock);
 	vmf->page = page;
 	return ret | VM_FAULT_LOCKED;
 
@@ -3056,6 +3091,7 @@  vm_fault_t filemap_fault(struct vm_fault *vmf)
 
 	if (!error || error == AOP_TRUNCATED_PAGE)
 		goto retry_find;
+	up_read(&mapping->invalidate_lock);
 
 	return VM_FAULT_SIGBUS;
 
@@ -3067,6 +3103,7 @@  vm_fault_t filemap_fault(struct vm_fault *vmf)
 	 */
 	if (page)
 		put_page(page);
+	up_read(&mapping->invalidate_lock);
 	if (fpin)
 		fput(fpin);
 	return ret | VM_FAULT_RETRY;
@@ -3437,6 +3474,8 @@  static struct page *do_read_cache_page(struct address_space *mapping,
  *
  * If the page does not get brought uptodate, return -EIO.
  *
+ * The function expects mapping->invalidate_lock to be already held.
+ *
  * Return: up to date page on success, ERR_PTR() on failure.
  */
 struct page *read_cache_page(struct address_space *mapping,
@@ -3460,6 +3499,8 @@  EXPORT_SYMBOL(read_cache_page);
  *
  * If the page does not get brought uptodate, return -EIO.
  *
+ * The function expects mapping->invalidate_lock to be already held.
+ *
  * Return: up to date page on success, ERR_PTR() on failure.
  */
 struct page *read_cache_page_gfp(struct address_space *mapping,
diff --git a/mm/readahead.c b/mm/readahead.c
index d589f147f4c2..9785c54107bb 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -192,6 +192,7 @@  void page_cache_ra_unbounded(struct readahead_control *ractl,
 	 */
 	unsigned int nofs = memalloc_nofs_save();
 
+	down_read(&mapping->invalidate_lock);
 	/*
 	 * Preallocate as many pages as we will need.
 	 */
@@ -236,6 +237,7 @@  void page_cache_ra_unbounded(struct readahead_control *ractl,
 	 * will then handle the error.
 	 */
 	read_pages(ractl, &page_pool, false);
+	up_read(&mapping->invalidate_lock);
 	memalloc_nofs_restore(nofs);
 }
 EXPORT_SYMBOL_GPL(page_cache_ra_unbounded);
diff --git a/mm/rmap.c b/mm/rmap.c
index a35cbbbded0d..76d33c3b8ae6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -22,24 +22,25 @@ 
  *
  * inode->i_rwsem	(while writing or truncating, not reading or faulting)
  *   mm->mmap_lock
- *     page->flags PG_locked (lock_page)   * (see hugetlbfs below)
- *       hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share)
- *         mapping->i_mmap_rwsem
- *           hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
- *           anon_vma->rwsem
- *             mm->page_table_lock or pte_lock
- *               swap_lock (in swap_duplicate, swap_info_get)
- *                 mmlist_lock (in mmput, drain_mmlist and others)
- *                 mapping->private_lock (in __set_page_dirty_buffers)
- *                   lock_page_memcg move_lock (in __set_page_dirty_buffers)
- *                     i_pages lock (widely used)
- *                       lruvec->lru_lock (in lock_page_lruvec_irq)
- *                 inode->i_lock (in set_page_dirty's __mark_inode_dirty)
- *                 bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
- *                   sb_lock (within inode_lock in fs/fs-writeback.c)
- *                   i_pages lock (widely used, in set_page_dirty,
- *                             in arch-dependent flush_dcache_mmap_lock,
- *                             within bdi.wb->list_lock in __sync_single_inode)
+ *     mapping->invalidate_lock (in filemap_fault)
+ *       page->flags PG_locked (lock_page)   * (see hugetlbfs below)
+ *         hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share)
+ *           mapping->i_mmap_rwsem
+ *             hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
+ *             anon_vma->rwsem
+ *               mm->page_table_lock or pte_lock
+ *                 swap_lock (in swap_duplicate, swap_info_get)
+ *                   mmlist_lock (in mmput, drain_mmlist and others)
+ *                   mapping->private_lock (in __set_page_dirty_buffers)
+ *                     lock_page_memcg move_lock (in __set_page_dirty_buffers)
+ *                       i_pages lock (widely used)
+ *                         lruvec->lru_lock (in lock_page_lruvec_irq)
+ *                   inode->i_lock (in set_page_dirty's __mark_inode_dirty)
+ *                   bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
+ *                     sb_lock (within inode_lock in fs/fs-writeback.c)
+ *                     i_pages lock (widely used, in set_page_dirty,
+ *                               in arch-dependent flush_dcache_mmap_lock,
+ *                               within bdi.wb->list_lock in __sync_single_inode)
  *
  * anon_vma->rwsem,mapping->i_mmap_rwsem   (memory_failure, collect_procs_anon)
  *   ->tasklist_lock
diff --git a/mm/truncate.c b/mm/truncate.c
index 57a618c4a0d6..93bde2741e0e 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -415,7 +415,7 @@  EXPORT_SYMBOL(truncate_inode_pages_range);
  * @mapping: mapping to truncate
  * @lstart: offset from which to truncate
  *
- * Called under (and serialised by) inode->i_rwsem.
+ * Called under (and serialised by) inode->i_rwsem and inode->i_mapping_rwsem.
  *
  * Note: When this function returns, there can be a page in the process of
  * deletion (inside __delete_from_page_cache()) in the specified range.  Thus