[08/10] dax: New fault locking

Message ID	1458566575-28063-9-git-send-email-jack@suse.cz (mailing list archive)
State	Not Applicable
Headers	show Return-Path: <linux-nvdimm-bounces@lists.01.org> From: Jan Kara <jack@suse.cz> To: linux-fsdevel@vger.kernel.org Subject: [PATCH 08/10] dax: New fault locking Date: Mon, 21 Mar 2016 14:22:53 +0100 Message-Id: <1458566575-28063-9-git-send-email-jack@suse.cz> In-Reply-To: <1458566575-28063-1-git-send-email-jack@suse.cz> References: <1458566575-28063-1-git-send-email-jack@suse.cz> Precedence: list Cc: Jan Kara <jack@suse.cz>, linux-nvdimm@lists.01.org, NeilBrown <neilb@suse.com>, "Wilcox, Matthew R" <matthew.r.wilcox@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" <linux-nvdimm-bounces@lists.01.org>

Message ID

1458566575-28063-9-git-send-email-jack@suse.cz (mailing list archive)

State

Not Applicable

Headers

From: Jan Kara <jack@suse.cz>
To: linux-fsdevel@vger.kernel.org
Subject: [PATCH 08/10] dax: New fault locking
Date: Mon, 21 Mar 2016 14:22:53 +0100
Message-Id: <1458566575-28063-9-git-send-email-jack@suse.cz>
In-Reply-To: <1458566575-28063-1-git-send-email-jack@suse.cz>
References: <1458566575-28063-1-git-send-email-jack@suse.cz>
Precedence: list
Cc: Jan Kara <jack@suse.cz>, linux-nvdimm@lists.01.org,
	NeilBrown <neilb@suse.com>, "Wilcox,
	Matthew R" <matthew.r.wilcox@intel.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Errors-To: linux-nvdimm-bounces@lists.01.org
Sender: "Linux-nvdimm" <linux-nvdimm-bounces@lists.01.org>

Commit Message

Jan Kara March 21, 2016, 1:22 p.m. UTC

Currently DAX page fault locking is racy.

CPU0 (write fault)		CPU1 (read fault)

__dax_fault()			__dax_fault()
  get_block(inode, block, &bh, 0) -> not mapped
				  get_block(inode, block, &bh, 0)
				    -> not mapped
  if (!buffer_mapped(&bh))
    if (vmf->flags & FAULT_FLAG_WRITE)
      get_block(inode, block, &bh, 1) -> allocates blocks
  if (page) -> no
				  if (!buffer_mapped(&bh))
				    if (vmf->flags & FAULT_FLAG_WRITE) {
				    } else {
				      dax_load_hole();
				    }
  dax_insert_mapping()

And we are in a situation where we fail in dax_radix_entry() with -EIO.

Another problem with the current DAX page fault locking is that there is
no race-free way to clear dirty tag in the radix tree. We can always
end up with clean radix tree and dirty data in CPU cache.

We fix the first problem by introducing locking of exceptional radix
tree entries in DAX mappings acting very similarly to page lock and thus
synchronizing properly faults against the same mapping index. The same
lock can later be used to avoid races when clearing radix tree dirty
tag.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c            | 500 ++++++++++++++++++++++++++++++++++++++--------------
 include/linux/dax.h |   1 +
 mm/truncate.c       |  62 ++++---
 3 files changed, 396 insertions(+), 167 deletions(-)

Comments

Ross Zwisler March 29, 2016, 9:57 p.m. UTC | #1

On Mon, Mar 21, 2016 at 02:22:53PM +0100, Jan Kara wrote:
> Currently DAX page fault locking is racy.
> 
> CPU0 (write fault)		CPU1 (read fault)
> 
> __dax_fault()			__dax_fault()
>   get_block(inode, block, &bh, 0) -> not mapped
> 				  get_block(inode, block, &bh, 0)
> 				    -> not mapped
>   if (!buffer_mapped(&bh))
>     if (vmf->flags & FAULT_FLAG_WRITE)
>       get_block(inode, block, &bh, 1) -> allocates blocks
>   if (page) -> no
> 				  if (!buffer_mapped(&bh))
> 				    if (vmf->flags & FAULT_FLAG_WRITE) {
> 				    } else {
> 				      dax_load_hole();
> 				    }
>   dax_insert_mapping()
> 
> And we are in a situation where we fail in dax_radix_entry() with -EIO.
> 
> Another problem with the current DAX page fault locking is that there is
> no race-free way to clear dirty tag in the radix tree. We can always
> end up with clean radix tree and dirty data in CPU cache.
> 
> We fix the first problem by introducing locking of exceptional radix
> tree entries in DAX mappings acting very similarly to page lock and thus
> synchronizing properly faults against the same mapping index. The same
> lock can later be used to avoid races when clearing radix tree dirty
> tag.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

I've got lots of little comments, but from my point of view this seems like
it is looking pretty good.  I agree with the choice to put this in dax.c as
opposed to radix-tree.c or something - this seems very DAX specific for now.

> ---
>  fs/dax.c            | 500 ++++++++++++++++++++++++++++++++++++++--------------
>  include/linux/dax.h |   1 +
>  mm/truncate.c       |  62 ++++---
>  3 files changed, 396 insertions(+), 167 deletions(-)
<>
> +static inline int slot_locked(void **v)
> +{
> +	unsigned long l = *(unsigned long *)v;
> +	return l & DAX_ENTRY_LOCK;
> +}
> +
> +static inline void *lock_slot(void **v)
> +{
> +	unsigned long *l = (unsigned long *)v;
> +	return (void*)(*l |= DAX_ENTRY_LOCK);
> +}
> +
> +static inline void *unlock_slot(void **v)
> +{
> +	unsigned long *l = (unsigned long *)v;
> +	return (void*)(*l &= ~(unsigned long)DAX_ENTRY_LOCK);
> +}

For the above three helpers I think we could do with better parameter and
variable naming so it's clearer what's going on.  s/v/slot/ and s/l/entry/ ?

Also, for many of these new functions we need to be holding
mapping->tree_lock - can we quickly document that with comments?

> +/*
> + * Lookup entry in radix tree, wait for it to become unlocked if it is
> + * exceptional entry and return.
> + *
> + * The function must be called with mapping->tree_lock held.
> + */
> +static void *lookup_unlocked_mapping_entry(struct address_space *mapping,
> +					   pgoff_t index, void ***slotp)
> +{
> +	void *ret, **slot;
> +	struct wait_exceptional_entry_queue wait;

This should probably be named 'ewait' to be consistent with
wake_exceptional_entry_func(), and so we have a different and consistent
naming between our struct wait_exceptional_entry_queue and wait_queue_t
variables.

> +	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
> +
> +	init_wait(&wait.wait);
> +	wait.wait.func = wake_exceptional_entry_func;
> +	wait.key.root = &mapping->page_tree;
> +	wait.key.index = index;
> +
> +	for (;;) {
> +		ret = __radix_tree_lookup(&mapping->page_tree, index, NULL,
> +					  &slot);
> +		if (!ret || !radix_tree_exceptional_entry(ret) ||
> +		    !slot_locked(slot)) {
> +			if (slotp)
> +				*slotp = slot;
> +			return ret;
> +		}
> +		prepare_to_wait_exclusive(wq, &wait.wait, TASK_UNINTERRUPTIBLE);

Should we make this TASK_INTERRUPTIBLE so we don't end up with an unkillable
zombie?

> +		spin_unlock_irq(&mapping->tree_lock);
> +		schedule();
> +		finish_wait(wq, &wait.wait);
> +		spin_lock_irq(&mapping->tree_lock);
> +	}
> +}
> +
> +/*
> + * Find radix tree entry at given index. If it points to a page, return with
> + * the page locked. If it points to the exceptional entry, return with the
> + * radix tree entry locked. If the radix tree doesn't contain given index,
> + * create empty exceptional entry for the index and return with it locked.
> + *
> + * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
> + * persistent memory the benefit is doubtful. We can add that later if we can
> + * show it helps.
> + */
> +static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
> +{
> +	void *ret, **slot;
> +
> +restart:
> +	spin_lock_irq(&mapping->tree_lock);
> +	ret = lookup_unlocked_mapping_entry(mapping, index, &slot);
> +	/* No entry for given index? Make sure radix tree is big enough. */
> +	if (!ret) {
> +		int err;
> +
> +		spin_unlock_irq(&mapping->tree_lock);
> +		err = radix_tree_preload(
> +				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);

What is the benefit to preloading the radix tree?  It looks like we have to
drop the mapping->tree_lock, deal with an error, regrab the lock and then deal
with a possible collision with an entry that was inserted while we didn't hold
the lock.

Can we just try and insert it, then if it fails with -ENOMEM we just do our
normal error path, dropping the tree_lock and returning the error?

> +		if (err)
> +			return ERR_PTR(err);
> +		ret = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | DAX_ENTRY_LOCK);
> +		spin_lock_irq(&mapping->tree_lock);
> +		err = radix_tree_insert(&mapping->page_tree, index, ret);
> +		radix_tree_preload_end();
> +		if (err) {
> +			spin_unlock_irq(&mapping->tree_lock);
> +			/* Someone already created the entry? */
> +			if (err == -EEXIST)
> +				goto restart;
> +			return ERR_PTR(err);
> +		}
> +		/* Good, we have inserted empty locked entry into the tree. */
> +		mapping->nrexceptional++;
> +		spin_unlock_irq(&mapping->tree_lock);
> +		return ret;
> +	}
> +	/* Normal page in radix tree? */
> +	if (!radix_tree_exceptional_entry(ret)) {
> +		struct page *page = ret;
> +
> +		page_cache_get(page);
> +		spin_unlock_irq(&mapping->tree_lock);
> +		lock_page(page);
> +		/* Page got truncated? Retry... */
> +		if (unlikely(page->mapping != mapping)) {
> +			unlock_page(page);
> +			page_cache_release(page);
> +			goto restart;
> +		}
> +		return page;
> +	}
> +	ret = lock_slot(slot);
> +	spin_unlock_irq(&mapping->tree_lock);
> +	return ret;
> +}
> +
> +static void unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
> +{
> +	void *ret, **slot;
> +	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +	ret = __radix_tree_lookup(&mapping->page_tree, index, NULL, &slot);
> +	if (WARN_ON_ONCE(!ret || !radix_tree_exceptional_entry(ret))) {
> +		spin_unlock_irq(&mapping->tree_lock);
> +		return;
> +	}
> +	if (WARN_ON_ONCE(!slot_locked(slot))) {
> +		spin_unlock_irq(&mapping->tree_lock);
> +		return;
> +	}

It may be worth combining these two WARN_ON_ONCE() error cases for brevity,
since they are both insanity conditions.

> +	unlock_slot(slot);
> +	spin_unlock_irq(&mapping->tree_lock);
> +	if (waitqueue_active(wq)) {
> +		struct exceptional_entry_key key;
> +
> +		key.root = &mapping->page_tree;
> +		key.index = index;
> +		__wake_up(wq, TASK_NORMAL, 1, &key);
> +	}

The above if() block is repeated 3 times in the next few functions with small
variations (the third argument to __wake_up()).  Perhaps it should be pulled
out into a helper?

> +static void *dax_mapping_entry(struct address_space *mapping, pgoff_t index,
> +			       void *entry, sector_t sector, bool dirty,
> +			       gfp_t gfp_mask)

This argument list is getting pretty long, and our one caller gets lots of
these guys out of the VMF.  Perhaps we could just pass in the VMF and extract
the bits ourselves?

>  {
>  	struct radix_tree_root *page_tree = &mapping->page_tree;
> -	pgoff_t pmd_index = DAX_PMD_INDEX(index);
> -	int type, error = 0;
> -	void *entry;
> +	int error = 0;
> +	bool hole_fill = false;
> +	void *ret;

Just a nit, but I find the use of 'ret' a bit confusing, since it's not a
return value that we got from anywhere, it's an entry that we set up, insert
and then return to our caller.  We use 'error' to capture return values from
calls this function makes.  Maybe this would be clearer as "new_entry" or
something?

> -	WARN_ON_ONCE(pmd_entry && !dirty);
>  	if (dirty)
>  		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  
> -	spin_lock_irq(&mapping->tree_lock);
> -
> -	entry = radix_tree_lookup(page_tree, pmd_index);
> -	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD) {
> -		index = pmd_index;
> -		goto dirty;
> +	/* Replacing hole page with block mapping? */
> +	if (!radix_tree_exceptional_entry(entry)) {
> +		hole_fill = true;
> +		error = radix_tree_preload(gfp_mask);
> +		if (error)
> +			return ERR_PTR(error);
>  	}
>  
> -	entry = radix_tree_lookup(page_tree, index);
> -	if (entry) {
> -		type = RADIX_DAX_TYPE(entry);
> -		if (WARN_ON_ONCE(type != RADIX_DAX_PTE &&
> -					type != RADIX_DAX_PMD)) {
> -			error = -EIO;
> +	spin_lock_irq(&mapping->tree_lock);
> +	ret = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
> +		       DAX_ENTRY_LOCK);
> +	if (hole_fill) {
> +		__delete_from_page_cache(entry, NULL);
> +		error = radix_tree_insert(page_tree, index, ret);
> +		if (error) {
> +			ret = ERR_PTR(error);
>  			goto unlock;
>  		}
> +		mapping->nrexceptional++;
> +	} else {
> +		void **slot;
> +		void *ret2;
>  
> -		if (!pmd_entry || type == RADIX_DAX_PMD)
> -			goto dirty;
> -
> -		/*
> -		 * We only insert dirty PMD entries into the radix tree.  This
> -		 * means we don't need to worry about removing a dirty PTE
> -		 * entry and inserting a clean PMD entry, thus reducing the
> -		 * range we would flush with a follow-up fsync/msync call.
> -		 */
> -		radix_tree_delete(&mapping->page_tree, index);
> -		mapping->nrexceptional--;
> -	}
> -
> -	if (sector == NO_SECTOR) {
> -		/*
> -		 * This can happen during correct operation if our pfn_mkwrite
> -		 * fault raced against a hole punch operation.  If this
> -		 * happens the pte that was hole punched will have been
> -		 * unmapped and the radix tree entry will have been removed by
> -		 * the time we are called, but the call will still happen.  We
> -		 * will return all the way up to wp_pfn_shared(), where the
> -		 * pte_same() check will fail, eventually causing page fault
> -		 * to be retried by the CPU.
> -		 */
> -		goto unlock;
> +		ret2 = __radix_tree_lookup(page_tree, index, NULL, &slot);

You don't need ret2.  You can just compare 'entry' with '*slot' - see
dax_writeback_one() for an example.

> +		WARN_ON_ONCE(ret2 != entry);
> +		radix_tree_replace_slot(slot, ret);
>  	}
> -
> -	error = radix_tree_insert(page_tree, index,
> -			RADIX_DAX_ENTRY(sector, pmd_entry));
> -	if (error)
> -		goto unlock;
> -
> -	mapping->nrexceptional++;
> - dirty:
>  	if (dirty)
>  		radix_tree_tag_set(page_tree, index, PAGECACHE_TAG_DIRTY);
>   unlock:
>  	spin_unlock_irq(&mapping->tree_lock);
> -	return error;
> +	if (hole_fill)
> +		radix_tree_preload_end();
> +	return ret;
>  }
>  
>  static int dax_writeback_one(struct block_device *bdev,
> @@ -542,17 +782,18 @@ int dax_writeback_mapping_range(struct address_space *mapping,
>  }
>  EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
>  
> -static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
> +static int dax_insert_mapping(struct address_space *mapping,
> +			struct buffer_head *bh, void *entry,
>  			struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
>  	unsigned long vaddr = (unsigned long)vmf->virtual_address;
> -	struct address_space *mapping = inode->i_mapping;
>  	struct block_device *bdev = bh->b_bdev;
>  	struct blk_dax_ctl dax = {
> -		.sector = to_sector(bh, inode),
> +		.sector = to_sector(bh, mapping->host),
>  		.size = bh->b_size,
>  	};
>  	int error;
> +	void *ret;
>  
>  	i_mmap_lock_read(mapping);
>  
> @@ -562,16 +803,26 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
>  	}
>  	dax_unmap_atomic(bdev, &dax);
>  
> -	error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false,
> -			vmf->flags & FAULT_FLAG_WRITE);
> -	if (error)
> +	ret = dax_mapping_entry(mapping, vmf->pgoff, entry, dax.sector,
> +			        vmf->flags & FAULT_FLAG_WRITE,
> +			        vmf->gfp_mask & ~__GFP_HIGHMEM);

The spacing before the parameters to dax_mapping_entry() is messed up & makes
checkpatch grumpy:

ERROR: code indent should use tabs where possible
#488: FILE: fs/dax.c:812:
+^I^I^I        vmf->flags & FAULT_FLAG_WRITE,$

ERROR: code indent should use tabs where possible
#489: FILE: fs/dax.c:813:
+^I^I^I        vmf->gfp_mask & ~__GFP_HIGHMEM);$

There are a few other checkpatch warnings as well that should probably be
addressed.

> +	if (IS_ERR(ret)) {
> +		error = PTR_ERR(ret);
>  		goto out;
> +	}
> +	/* Have we replaced hole page? Unmap and free it. */
> +	if (!radix_tree_exceptional_entry(entry)) {
> +		unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT,
> +				    PAGE_CACHE_SIZE, 0);
> +		unlock_page(entry);
> +		page_cache_release(entry);
> +	}
> +	entry = ret;
>  
>  	error = vm_insert_mixed(vma, vaddr, dax.pfn);
> -
>   out:
>  	i_mmap_unlock_read(mapping);
> -
> +	put_locked_mapping_entry(mapping, vmf->pgoff, entry);

Hmm....this entry was locked by our parent (__dax_fault()), and is released by
our parent in error cases that go through 'unlock_entry:'.  For symmetry it's
probably better to move this call up to our parent as well.

Jan Kara March 31, 2016, 4:27 p.m. UTC | #2

Thanks for review Ross! I have implemented your comments unless I state
here otherwise.

On Tue 29-03-16 15:57:32, Ross Zwisler wrote:
> On Mon, Mar 21, 2016 at 02:22:53PM +0100, Jan Kara wrote:
> > +	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
> > +
> > +	init_wait(&wait.wait);
> > +	wait.wait.func = wake_exceptional_entry_func;
> > +	wait.key.root = &mapping->page_tree;
> > +	wait.key.index = index;
> > +
> > +	for (;;) {
> > +		ret = __radix_tree_lookup(&mapping->page_tree, index, NULL,
> > +					  &slot);
> > +		if (!ret || !radix_tree_exceptional_entry(ret) ||
> > +		    !slot_locked(slot)) {
> > +			if (slotp)
> > +				*slotp = slot;
> > +			return ret;
> > +		}
> > +		prepare_to_wait_exclusive(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
> 
> Should we make this TASK_INTERRUPTIBLE so we don't end up with an unkillable
> zombie?

Well, and do you want to deal with signal handling all the way up? The wait
should be pretty short given the nature of pmem so I didn't see a big point
in bothering with signal handling...

> > +		spin_unlock_irq(&mapping->tree_lock);
> > +		schedule();
> > +		finish_wait(wq, &wait.wait);
> > +		spin_lock_irq(&mapping->tree_lock);
> > +	}
> > +}
> > +
> > +/*
> > + * Find radix tree entry at given index. If it points to a page, return with
> > + * the page locked. If it points to the exceptional entry, return with the
> > + * radix tree entry locked. If the radix tree doesn't contain given index,
> > + * create empty exceptional entry for the index and return with it locked.
> > + *
> > + * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
> > + * persistent memory the benefit is doubtful. We can add that later if we can
> > + * show it helps.
> > + */
> > +static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
> > +{
> > +	void *ret, **slot;
> > +
> > +restart:
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	ret = lookup_unlocked_mapping_entry(mapping, index, &slot);
> > +	/* No entry for given index? Make sure radix tree is big enough. */
> > +	if (!ret) {
> > +		int err;
> > +
> > +		spin_unlock_irq(&mapping->tree_lock);
> > +		err = radix_tree_preload(
> > +				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
> 
> What is the benefit to preloading the radix tree?  It looks like we have
> to drop the mapping->tree_lock, deal with an error, regrab the lock and
> then deal with a possible collision with an entry that was inserted while
> we didn't hold the lock.
> 
> Can we just try and insert it, then if it fails with -ENOMEM we just do
> our normal error path, dropping the tree_lock and returning the error?

If we don't preload, the allocations will happen with GFP_ATOMIC. That
should be avoided if possible since atomic allocations are pretty
restricted. So basically all the pagecache first allocates nodes we may
need before acquiring locks and then uses these nodes later and I have
mirrored that behavior. Note that we take the hit for dropping the lock
only if we really need to allocate new radix tree node so about once per 64
new entries. So it is not too bad.

> > -	WARN_ON_ONCE(pmd_entry && !dirty);
> >  	if (dirty)
> >  		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> >  
> > -	spin_lock_irq(&mapping->tree_lock);
> > -
> > -	entry = radix_tree_lookup(page_tree, pmd_index);
> > -	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD) {
> > -		index = pmd_index;
> > -		goto dirty;
> > +	/* Replacing hole page with block mapping? */
> > +	if (!radix_tree_exceptional_entry(entry)) {
> > +		hole_fill = true;
> > +		error = radix_tree_preload(gfp_mask);
> > +		if (error)
> > +			return ERR_PTR(error);
> >  	}
> >  
> > -	entry = radix_tree_lookup(page_tree, index);
> > -	if (entry) {
> > -		type = RADIX_DAX_TYPE(entry);
> > -		if (WARN_ON_ONCE(type != RADIX_DAX_PTE &&
> > -					type != RADIX_DAX_PMD)) {
> > -			error = -EIO;
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	ret = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
> > +		       DAX_ENTRY_LOCK);
> > +	if (hole_fill) {
> > +		__delete_from_page_cache(entry, NULL);
> > +		error = radix_tree_insert(page_tree, index, ret);
> > +		if (error) {
> > +			ret = ERR_PTR(error);
> >  			goto unlock;
> >  		}
> > +		mapping->nrexceptional++;
> > +	} else {
> > +		void **slot;
> > +		void *ret2;
> >  
> > -		if (!pmd_entry || type == RADIX_DAX_PMD)
> > -			goto dirty;
> > -
> > -		/*
> > -		 * We only insert dirty PMD entries into the radix tree.  This
> > -		 * means we don't need to worry about removing a dirty PTE
> > -		 * entry and inserting a clean PMD entry, thus reducing the
> > -		 * range we would flush with a follow-up fsync/msync call.
> > -		 */
> > -		radix_tree_delete(&mapping->page_tree, index);
> > -		mapping->nrexceptional--;
> > -	}
> > -
> > -	if (sector == NO_SECTOR) {
> > -		/*
> > -		 * This can happen during correct operation if our pfn_mkwrite
> > -		 * fault raced against a hole punch operation.  If this
> > -		 * happens the pte that was hole punched will have been
> > -		 * unmapped and the radix tree entry will have been removed by
> > -		 * the time we are called, but the call will still happen.  We
> > -		 * will return all the way up to wp_pfn_shared(), where the
> > -		 * pte_same() check will fail, eventually causing page fault
> > -		 * to be retried by the CPU.
> > -		 */
> > -		goto unlock;
> > +		ret2 = __radix_tree_lookup(page_tree, index, NULL, &slot);
> 
> You don't need ret2.  You can just compare 'entry' with '*slot' - see
> dax_writeback_one() for an example.

Hum, but if we want to do this cleanly (and get all the lockdep
verification), we should use

radix_tree_deref_slot_protected(slot, &mapping->tree_lock)

instead of *slot. And at that point my fingers hurt so much that I just
create a new variable for caching the result ;). BTW, this has prompted me
to also fix lock_slot, unlock_slot, slot_locked to use proper RCU
primitives for modifying slot contents.

								Honza

diff --git a/fs/dax.c b/fs/dax.c
index 444e9dd079ca..4fcac59b6dcb 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -41,6 +41,30 @@ 
 #define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
 		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE)))
 
+/* We choose 4096 entries - same as per-zone page wait tables */
+#define DAX_WAIT_TABLE_BITS 12
+#define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
+
+wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
+
+static int __init init_dax_wait_table(void)
+{
+	int i;
+
+	for (i = 0; i < DAX_WAIT_TABLE_ENTRIES; i++)
+		init_waitqueue_head(wait_table + i);
+	return 0;
+}
+fs_initcall(init_dax_wait_table);
+
+static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
+					      pgoff_t index)
+{
+	unsigned long hash = hash_long((unsigned long)mapping ^ index,
+				       DAX_WAIT_TABLE_BITS);
+	return wait_table + hash;
+}
+
 static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax)
 {
 	struct request_queue *q = bdev->bd_queue;
@@ -306,6 +330,237 @@  ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
 EXPORT_SYMBOL_GPL(dax_do_io);
 
 /*
+ * DAX radix tree locking
+ */
+struct exceptional_entry_key {
+	struct radix_tree_root *root;
+	unsigned long index;
+};
+
+struct wait_exceptional_entry_queue {
+	wait_queue_t wait;
+	struct exceptional_entry_key key;
+};
+
+static int wake_exceptional_entry_func(wait_queue_t *wait, unsigned mode,
+				       int sync, void *keyp)
+{
+	struct exceptional_entry_key *key = keyp;
+	struct wait_exceptional_entry_queue *ewait =
+		container_of(wait, struct wait_exceptional_entry_queue, wait);
+
+	if (key->root != ewait->key.root || key->index != ewait->key.index)
+		return 0;
+	return autoremove_wake_function(wait, mode, sync, NULL);
+}
+
+static inline int slot_locked(void **v)
+{
+	unsigned long l = *(unsigned long *)v;
+	return l & DAX_ENTRY_LOCK;
+}
+
+static inline void *lock_slot(void **v)
+{
+	unsigned long *l = (unsigned long *)v;
+	return (void*)(*l |= DAX_ENTRY_LOCK);
+}
+
+static inline void *unlock_slot(void **v)
+{
+	unsigned long *l = (unsigned long *)v;
+	return (void*)(*l &= ~(unsigned long)DAX_ENTRY_LOCK);
+}
+
+/*
+ * Lookup entry in radix tree, wait for it to become unlocked if it is
+ * exceptional entry and return.
+ *
+ * The function must be called with mapping->tree_lock held.
+ */
+static void *lookup_unlocked_mapping_entry(struct address_space *mapping,
+					   pgoff_t index, void ***slotp)
+{
+	void *ret, **slot;
+	struct wait_exceptional_entry_queue wait;
+	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+
+	init_wait(&wait.wait);
+	wait.wait.func = wake_exceptional_entry_func;
+	wait.key.root = &mapping->page_tree;
+	wait.key.index = index;
+
+	for (;;) {
+		ret = __radix_tree_lookup(&mapping->page_tree, index, NULL,
+					  &slot);
+		if (!ret || !radix_tree_exceptional_entry(ret) ||
+		    !slot_locked(slot)) {
+			if (slotp)
+				*slotp = slot;
+			return ret;
+		}
+		prepare_to_wait_exclusive(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
+		spin_unlock_irq(&mapping->tree_lock);
+		schedule();
+		finish_wait(wq, &wait.wait);
+		spin_lock_irq(&mapping->tree_lock);
+	}
+}
+
+/*
+ * Find radix tree entry at given index. If it points to a page, return with
+ * the page locked. If it points to the exceptional entry, return with the
+ * radix tree entry locked. If the radix tree doesn't contain given index,
+ * create empty exceptional entry for the index and return with it locked.
+ *
+ * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
+ * persistent memory the benefit is doubtful. We can add that later if we can
+ * show it helps.
+ */
+static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
+{
+	void *ret, **slot;
+
+restart:
+	spin_lock_irq(&mapping->tree_lock);
+	ret = lookup_unlocked_mapping_entry(mapping, index, &slot);
+	/* No entry for given index? Make sure radix tree is big enough. */
+	if (!ret) {
+		int err;
+
+		spin_unlock_irq(&mapping->tree_lock);
+		err = radix_tree_preload(
+				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
+		if (err)
+			return ERR_PTR(err);
+		ret = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | DAX_ENTRY_LOCK);
+		spin_lock_irq(&mapping->tree_lock);
+		err = radix_tree_insert(&mapping->page_tree, index, ret);
+		radix_tree_preload_end();
+		if (err) {
+			spin_unlock_irq(&mapping->tree_lock);
+			/* Someone already created the entry? */
+			if (err == -EEXIST)
+				goto restart;
+			return ERR_PTR(err);
+		}
+		/* Good, we have inserted empty locked entry into the tree. */
+		mapping->nrexceptional++;
+		spin_unlock_irq(&mapping->tree_lock);
+		return ret;
+	}
+	/* Normal page in radix tree? */
+	if (!radix_tree_exceptional_entry(ret)) {
+		struct page *page = ret;
+
+		page_cache_get(page);
+		spin_unlock_irq(&mapping->tree_lock);
+		lock_page(page);
+		/* Page got truncated? Retry... */
+		if (unlikely(page->mapping != mapping)) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto restart;
+		}
+		return page;
+	}
+	ret = lock_slot(slot);
+	spin_unlock_irq(&mapping->tree_lock);
+	return ret;
+}
+
+static void unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
+{
+	void *ret, **slot;
+	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+
+	spin_lock_irq(&mapping->tree_lock);
+	ret = __radix_tree_lookup(&mapping->page_tree, index, NULL, &slot);
+	if (WARN_ON_ONCE(!ret || !radix_tree_exceptional_entry(ret))) {
+		spin_unlock_irq(&mapping->tree_lock);
+		return;
+	}
+	if (WARN_ON_ONCE(!slot_locked(slot))) {
+		spin_unlock_irq(&mapping->tree_lock);
+		return;
+	}
+	unlock_slot(slot);
+	spin_unlock_irq(&mapping->tree_lock);
+	if (waitqueue_active(wq)) {
+		struct exceptional_entry_key key;
+
+		key.root = &mapping->page_tree;
+		key.index = index;
+		__wake_up(wq, TASK_NORMAL, 1, &key);
+	}
+}
+
+static void put_locked_mapping_entry(struct address_space *mapping,
+				     pgoff_t index, void *entry)
+{
+	if (!radix_tree_exceptional_entry(entry)) {
+		unlock_page(entry);
+		page_cache_release(entry);
+	} else {
+		unlock_mapping_entry(mapping, index);
+	}
+}
+
+/*
+ * Called when we are done with radix tree entry we looked up via
+ * lookup_unlocked_mapping_entry() and which we didn't lock in the end.
+ */
+static void put_unlocked_mapping_entry(struct address_space *mapping,
+				       pgoff_t index, void *entry)
+{
+	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+
+	if(!radix_tree_exceptional_entry(entry))
+		return;
+
+	/* We have to wake up next waiter for the radix tree entry lock */
+	if (waitqueue_active(wq)) {
+		struct exceptional_entry_key key;
+
+		key.root = &mapping->page_tree;
+		key.index = index;
+		__wake_up(wq, TASK_NORMAL, 1, &key);
+	}
+}
+
+/*
+ * Delete exceptional DAX entry at @index from @mapping. Wait for radix tree
+ * entry to get unlocked before deleting it.
+ */
+int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
+{
+	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+	void *entry;
+
+	spin_lock_irq(&mapping->tree_lock);
+	entry = lookup_unlocked_mapping_entry(mapping, index, NULL);
+	/*
+	 * Caller should make sure radix tree modifications don't race and
+	 * we have seen exceptional entry here before.
+	 */
+	if (WARN_ON_ONCE(!entry || !radix_tree_exceptional_entry(entry))) {
+		spin_unlock_irq(&mapping->tree_lock);
+		return 0;
+	}
+	radix_tree_delete(&mapping->page_tree, index);
+	mapping->nrexceptional--;
+	spin_unlock_irq(&mapping->tree_lock);
+	if (waitqueue_active(wq)) {
+		struct exceptional_entry_key key;
+
+		key.root = &mapping->page_tree;
+		key.index = index;
+		__wake_up(wq, TASK_NORMAL, 0, &key);
+	}
+	return 1;
+}
+
+/*
  * The user has performed a load from a hole in the file.  Allocating
  * a new page in the file would cause excessive storage usage for
  * workloads with sparse files.  We allocate a page cache page instead.
@@ -313,16 +568,24 @@  EXPORT_SYMBOL_GPL(dax_do_io);
  * otherwise it will simply fall out of the page cache under memory
  * pressure without ever having been dirtied.
  */
-static int dax_load_hole(struct address_space *mapping, struct page *page,
-							struct vm_fault *vmf)
+static int dax_load_hole(struct address_space *mapping, void *entry,
+			 struct vm_fault *vmf)
 {
-	struct inode *inode = mapping->host;
-	if (!page)
-		page = find_or_create_page(mapping, vmf->pgoff,
-						GFP_KERNEL | __GFP_ZERO);
-	if (!page)
-		return VM_FAULT_OOM;
+	struct page *page;
+
+	/* Hole page already exists? Return it...  */
+	if (!radix_tree_exceptional_entry(entry)) {
+		vmf->page = entry;
+		return VM_FAULT_LOCKED;
+	}
 
+	/* This will replace locked radix tree entry with a hole page */
+	page = find_or_create_page(mapping, vmf->pgoff,
+				   vmf->gfp_mask | __GFP_ZERO);
+	if (!page) {
+		put_locked_mapping_entry(mapping, vmf->pgoff, entry);
+		return VM_FAULT_OOM;
+	}
 	vmf->page = page;
 	return VM_FAULT_LOCKED;
 }
@@ -346,77 +609,54 @@  static int copy_user_bh(struct page *to, struct inode *inode,
 	return 0;
 }
 
-#define NO_SECTOR -1
 #define DAX_PMD_INDEX(page_index) (page_index & (PMD_MASK >> PAGE_CACHE_SHIFT))
 
-static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
-		sector_t sector, bool pmd_entry, bool dirty)
+static void *dax_mapping_entry(struct address_space *mapping, pgoff_t index,
+			       void *entry, sector_t sector, bool dirty,
+			       gfp_t gfp_mask)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
-	pgoff_t pmd_index = DAX_PMD_INDEX(index);
-	int type, error = 0;
-	void *entry;
+	int error = 0;
+	bool hole_fill = false;
+	void *ret;
 
-	WARN_ON_ONCE(pmd_entry && !dirty);
 	if (dirty)
 		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
-	spin_lock_irq(&mapping->tree_lock);
-
-	entry = radix_tree_lookup(page_tree, pmd_index);
-	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD) {
-		index = pmd_index;
-		goto dirty;
+	/* Replacing hole page with block mapping? */
+	if (!radix_tree_exceptional_entry(entry)) {
+		hole_fill = true;
+		error = radix_tree_preload(gfp_mask);
+		if (error)
+			return ERR_PTR(error);
 	}
 
-	entry = radix_tree_lookup(page_tree, index);
-	if (entry) {
-		type = RADIX_DAX_TYPE(entry);
-		if (WARN_ON_ONCE(type != RADIX_DAX_PTE &&
-					type != RADIX_DAX_PMD)) {
-			error = -EIO;
+	spin_lock_irq(&mapping->tree_lock);
+	ret = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
+		       DAX_ENTRY_LOCK);
+	if (hole_fill) {
+		__delete_from_page_cache(entry, NULL);
+		error = radix_tree_insert(page_tree, index, ret);
+		if (error) {
+			ret = ERR_PTR(error);
 			goto unlock;
 		}
+		mapping->nrexceptional++;
+	} else {
+		void **slot;
+		void *ret2;
 
-		if (!pmd_entry || type == RADIX_DAX_PMD)
-			goto dirty;
-
-		/*
-		 * We only insert dirty PMD entries into the radix tree.  This
-		 * means we don't need to worry about removing a dirty PTE
-		 * entry and inserting a clean PMD entry, thus reducing the
-		 * range we would flush with a follow-up fsync/msync call.
-		 */
-		radix_tree_delete(&mapping->page_tree, index);
-		mapping->nrexceptional--;
-	}
-
-	if (sector == NO_SECTOR) {
-		/*
-		 * This can happen during correct operation if our pfn_mkwrite
-		 * fault raced against a hole punch operation.  If this
-		 * happens the pte that was hole punched will have been
-		 * unmapped and the radix tree entry will have been removed by
-		 * the time we are called, but the call will still happen.  We
-		 * will return all the way up to wp_pfn_shared(), where the
-		 * pte_same() check will fail, eventually causing page fault
-		 * to be retried by the CPU.
-		 */
-		goto unlock;
+		ret2 = __radix_tree_lookup(page_tree, index, NULL, &slot);
+		WARN_ON_ONCE(ret2 != entry);
+		radix_tree_replace_slot(slot, ret);
 	}
-
-	error = radix_tree_insert(page_tree, index,
-			RADIX_DAX_ENTRY(sector, pmd_entry));
-	if (error)
-		goto unlock;
-
-	mapping->nrexceptional++;
- dirty:
 	if (dirty)
 		radix_tree_tag_set(page_tree, index, PAGECACHE_TAG_DIRTY);
  unlock:
 	spin_unlock_irq(&mapping->tree_lock);
-	return error;
+	if (hole_fill)
+		radix_tree_preload_end();
+	return ret;
 }
 
 static int dax_writeback_one(struct block_device *bdev,
@@ -542,17 +782,18 @@  int dax_writeback_mapping_range(struct address_space *mapping,
 }
 EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
 
-static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
+static int dax_insert_mapping(struct address_space *mapping,
+			struct buffer_head *bh, void *entry,
 			struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	unsigned long vaddr = (unsigned long)vmf->virtual_address;
-	struct address_space *mapping = inode->i_mapping;
 	struct block_device *bdev = bh->b_bdev;
 	struct blk_dax_ctl dax = {
-		.sector = to_sector(bh, inode),
+		.sector = to_sector(bh, mapping->host),
 		.size = bh->b_size,
 	};
 	int error;
+	void *ret;
 
 	i_mmap_lock_read(mapping);
 
@@ -562,16 +803,26 @@  static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	}
 	dax_unmap_atomic(bdev, &dax);
 
-	error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false,
-			vmf->flags & FAULT_FLAG_WRITE);
-	if (error)
+	ret = dax_mapping_entry(mapping, vmf->pgoff, entry, dax.sector,
+			        vmf->flags & FAULT_FLAG_WRITE,
+			        vmf->gfp_mask & ~__GFP_HIGHMEM);
+	if (IS_ERR(ret)) {
+		error = PTR_ERR(ret);
 		goto out;
+	}
+	/* Have we replaced hole page? Unmap and free it. */
+	if (!radix_tree_exceptional_entry(entry)) {
+		unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT,
+				    PAGE_CACHE_SIZE, 0);
+		unlock_page(entry);
+		page_cache_release(entry);
+	}
+	entry = ret;
 
 	error = vm_insert_mixed(vma, vaddr, dax.pfn);
-
  out:
 	i_mmap_unlock_read(mapping);
-
+	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
 	return error;
 }
 
@@ -591,7 +842,7 @@  int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	struct file *file = vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
 	struct inode *inode = mapping->host;
-	struct page *page;
+	void *entry;
 	struct buffer_head bh;
 	unsigned long vaddr = (unsigned long)vmf->virtual_address;
 	unsigned blkbits = inode->i_blkbits;
@@ -600,6 +851,11 @@  int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	int error;
 	int major = 0;
 
+	/*
+	 * Check whether offset isn't beyond end of file now. Caller is supposed
+	 * to hold locks serializing us with truncate / punch hole so this is
+	 * a reliable test.
+	 */
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (vmf->pgoff >= size)
 		return VM_FAULT_SIGBUS;
@@ -609,40 +865,17 @@  int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	bh.b_bdev = inode->i_sb->s_bdev;
 	bh.b_size = PAGE_SIZE;
 
- repeat:
-	page = find_get_page(mapping, vmf->pgoff);
-	if (page) {
-		if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
-			page_cache_release(page);
-			return VM_FAULT_RETRY;
-		}
-		if (unlikely(page->mapping != mapping)) {
-			unlock_page(page);
-			page_cache_release(page);
-			goto repeat;
-		}
+	entry = grab_mapping_entry(mapping, vmf->pgoff);
+	if (IS_ERR(entry)) {
+		error = PTR_ERR(entry);
+		goto out;
 	}
 
 	error = get_block(inode, block, &bh, 0);
 	if (!error && (bh.b_size < PAGE_SIZE))
 		error = -EIO;		/* fs corruption? */
 	if (error)
-		goto unlock_page;
-
-	if (!buffer_mapped(&bh) && !vmf->cow_page) {
-		if (vmf->flags & FAULT_FLAG_WRITE) {
-			error = get_block(inode, block, &bh, 1);
-			count_vm_event(PGMAJFAULT);
-			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
-			major = VM_FAULT_MAJOR;
-			if (!error && (bh.b_size < PAGE_SIZE))
-				error = -EIO;
-			if (error)
-				goto unlock_page;
-		} else {
-			return dax_load_hole(mapping, page, vmf);
-		}
-	}
+		goto unlock_entry;
 
 	if (vmf->cow_page) {
 		struct page *new_page = vmf->cow_page;
@@ -651,30 +884,35 @@  int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		else
 			clear_user_highpage(new_page, vaddr);
 		if (error)
-			goto unlock_page;
-		vmf->page = page;
-		if (!page)
+			goto unlock_entry;
+		if (!radix_tree_exceptional_entry(entry)) {
+			vmf->page = entry;
+		} else {
+			unlock_mapping_entry(mapping, vmf->pgoff);
 			i_mmap_lock_read(mapping);
+			vmf->page = NULL;
+		}
 		return VM_FAULT_LOCKED;
 	}
 
-	/* Check we didn't race with a read fault installing a new page */
-	if (!page && major)
-		page = find_lock_page(mapping, vmf->pgoff);
-
-	if (page) {
-		unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT,
-							PAGE_CACHE_SIZE, 0);
-		delete_from_page_cache(page);
-		unlock_page(page);
-		page_cache_release(page);
-		page = NULL;
+	if (!buffer_mapped(&bh)) {
+		if (vmf->flags & FAULT_FLAG_WRITE) {
+			error = get_block(inode, block, &bh, 1);
+			count_vm_event(PGMAJFAULT);
+			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
+			major = VM_FAULT_MAJOR;
+			if (!error && (bh.b_size < PAGE_SIZE))
+				error = -EIO;
+			if (error)
+				goto unlock_entry;
+		} else {
+			return dax_load_hole(mapping, entry, vmf);
+		}
 	}
 
 	/* Filesystem should not return unwritten buffers to us! */
 	WARN_ON_ONCE(buffer_unwritten(&bh));
-	error = dax_insert_mapping(inode, &bh, vma, vmf);
-
+	error = dax_insert_mapping(mapping, &bh, entry, vma, vmf);
  out:
 	if (error == -ENOMEM)
 		return VM_FAULT_OOM | major;
@@ -683,11 +921,8 @@  int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		return VM_FAULT_SIGBUS | major;
 	return VM_FAULT_NOPAGE | major;
 
- unlock_page:
-	if (page) {
-		unlock_page(page);
-		page_cache_release(page);
-	}
+ unlock_entry:
+	put_locked_mapping_entry(mapping, vmf->pgoff, entry);
 	goto out;
 }
 EXPORT_SYMBOL(__dax_fault);
@@ -976,23 +1211,18 @@  EXPORT_SYMBOL_GPL(dax_pmd_fault);
 int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct file *file = vma->vm_file;
-	int error;
-
-	/*
-	 * We pass NO_SECTOR to dax_radix_entry() because we expect that a
-	 * RADIX_DAX_PTE entry already exists in the radix tree from a
-	 * previous call to __dax_fault().  We just want to look up that PTE
-	 * entry using vmf->pgoff and make sure the dirty tag is set.  This
-	 * saves us from having to make a call to get_block() here to look
-	 * up the sector.
-	 */
-	error = dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false,
-			true);
+	struct address_space *mapping = file->f_mapping;
+	void *entry;
+	pgoff_t index = vmf->pgoff;
 
-	if (error == -ENOMEM)
-		return VM_FAULT_OOM;
-	if (error)
-		return VM_FAULT_SIGBUS;
+	spin_lock_irq(&mapping->tree_lock);
+	entry = lookup_unlocked_mapping_entry(mapping, index, NULL);
+	if (!entry || !radix_tree_exceptional_entry(entry))
+		goto out;
+	radix_tree_tag_set(&mapping->page_tree, index, PAGECACHE_TAG_DIRTY);
+	put_unlocked_mapping_entry(mapping, index, entry);
+out:
+	spin_unlock_irq(&mapping->tree_lock);
 	return VM_FAULT_NOPAGE;
 }
 EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index fd28d824254b..da2416d916e6 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -18,6 +18,7 @@  int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
+int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
 
 #ifdef CONFIG_FS_DAX
 struct page *read_dax_sector(struct block_device *bdev, sector_t n);
diff --git a/mm/truncate.c b/mm/truncate.c
index 7598b552ae03..a38d87688012 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -34,40 +34,38 @@  static void clear_exceptional_entry(struct address_space *mapping,
 	if (shmem_mapping(mapping))
 		return;
 
-	spin_lock_irq(&mapping->tree_lock);
-
 	if (dax_mapping(mapping)) {
-		if (radix_tree_delete_item(&mapping->page_tree, index, entry))
-			mapping->nrexceptional--;
-	} else {
-		/*
-		 * Regular page slots are stabilized by the page lock even
-		 * without the tree itself locked.  These unlocked entries
-		 * need verification under the tree lock.
-		 */
-		if (!__radix_tree_lookup(&mapping->page_tree, index, &node,
-					&slot))
-			goto unlock;
-		if (*slot != entry)
-			goto unlock;
-		radix_tree_replace_slot(slot, NULL);
-		mapping->nrexceptional--;
-		if (!node)
-			goto unlock;
-		workingset_node_shadows_dec(node);
-		/*
-		 * Don't track node without shadow entries.
-		 *
-		 * Avoid acquiring the list_lru lock if already untracked.
-		 * The list_empty() test is safe as node->private_list is
-		 * protected by mapping->tree_lock.
-		 */
-		if (!workingset_node_shadows(node) &&
-		    !list_empty(&node->private_list))
-			list_lru_del(&workingset_shadow_nodes,
-					&node->private_list);
-		__radix_tree_delete_node(&mapping->page_tree, node);
+		dax_delete_mapping_entry(mapping, index);
+		return;
 	}
+	spin_lock_irq(&mapping->tree_lock);
+	/*
+	 * Regular page slots are stabilized by the page lock even
+	 * without the tree itself locked.  These unlocked entries
+	 * need verification under the tree lock.
+	 */
+	if (!__radix_tree_lookup(&mapping->page_tree, index, &node,
+				&slot))
+		goto unlock;
+	if (*slot != entry)
+		goto unlock;
+	radix_tree_replace_slot(slot, NULL);
+	mapping->nrexceptional--;
+	if (!node)
+		goto unlock;
+	workingset_node_shadows_dec(node);
+	/*
+	 * Don't track node without shadow entries.
+	 *
+	 * Avoid acquiring the list_lru lock if already untracked.
+	 * The list_empty() test is safe as node->private_list is
+	 * protected by mapping->tree_lock.
+	 */
+	if (!workingset_node_shadows(node) &&
+	    !list_empty(&node->private_list))
+		list_lru_del(&workingset_shadow_nodes,
+				&node->private_list);
+	__radix_tree_delete_node(&mapping->page_tree, node);
 unlock:
 	spin_unlock_irq(&mapping->tree_lock);
 }

[08/10] dax: New fault locking

Commit Message

Comments

Patch