diff mbox

dax: fix deadlock in __dax_fault

Message ID 1443040800-5460-1-git-send-email-ross.zwisler@linux.intel.com (mailing list archive)
State Superseded
Headers show

Commit Message

Ross Zwisler Sept. 23, 2015, 8:40 p.m. UTC
Fix the deadlock exposed by xfstests generic/075.  Here is the sequence
that was causing us to deadlock:

1) enter __dax_fault()
2) page = find_get_page() gives us a page, so skip
	i_mmap_lock_write(mapping)
3) if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page)
	passes, enter this block
4) if (vmf->flags & FAULT_FLAG_WRITE) fails, so do the else case and
	i_mmap_unlock_write(mapping);
	return dax_load_hole(mapping, page, vmf);

This causes us to up_write() a semaphore that we weren't holding.

The up_write() on a semaphore we didn't down_write() happens twice in
a row, and then the next time we try and i_mmap_lock_write(), we hang.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Dave Chinner <david@fromorbit.com>
---
 fs/dax.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Dave Chinner Sept. 24, 2015, 2:52 a.m. UTC | #1
On Wed, Sep 23, 2015 at 02:40:00PM -0600, Ross Zwisler wrote:
> Fix the deadlock exposed by xfstests generic/075.  Here is the sequence
> that was causing us to deadlock:
> 
> 1) enter __dax_fault()
> 2) page = find_get_page() gives us a page, so skip
> 	i_mmap_lock_write(mapping)
> 3) if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page)
> 	passes, enter this block
> 4) if (vmf->flags & FAULT_FLAG_WRITE) fails, so do the else case and
> 	i_mmap_unlock_write(mapping);
> 	return dax_load_hole(mapping, page, vmf);
> 
> This causes us to up_write() a semaphore that we weren't holding.
> 
> The up_write() on a semaphore we didn't down_write() happens twice in
> a row, and then the next time we try and i_mmap_lock_write(), we hang.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Reported-by: Dave Chinner <david@fromorbit.com>
> ---
>  fs/dax.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 7ae6df7..df1b0ac 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -405,7 +405,8 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  			if (error)
>  				goto unlock;
>  		} else {
> -			i_mmap_unlock_write(mapping);
> +			if (!page)
> +				i_mmap_unlock_write(mapping);
>  			return dax_load_hole(mapping, page, vmf);
>  		}
>  	}

I can't review this properly because I can't work out how this
locking is supposed to work.  Captain, we have a Charlie Foxtrot
situation here:

	page = find_get_page(mapping, vmf->pgoff)
	if (page) {
		....
	} else {
		i_mmap_lock_write(mapping);
	}

So if there's no page in the page cache, we lock the i_mmap_lock.
The we have the case the above patch fixes. Then later:

	if (vmf->cow_page) {
		.....
		if (!page) {
			/* can fall through */
		}
		return VM_FAULT_LOCKED;
	}

Which means __dax_fault() can also return here with the
i_mmap_lock_write() held. There's no documentation to indicate why
this is valid, and only by looking about 4 function calls higher up
the stack can I see that there's some attempt to handle this
*specific return condition* (in do_cow_fault()). That also is
lacking in documentation explaining the circumstances where we might
have the i_mmap_lock_write() held and have to release it. (Not to
mention the beautiful copy-n-waste of the unlock code, either.)

The above code in __dax_fault() is then followed by this gem:

	/* Check we didn't race with a read fault installing a new page */
        if (!page && major)
                page = find_lock_page(mapping, vmf->pgoff);

	if (page) {
		/* mapping invalidation .... */
	}
	.....

	if (!page)
		i_mmap_unlock_write(mapping);

Which means that if we had a race with another read fault, we'll
remove the page from the page cache, insert the new direct mapped
pfn into the mapping, and *then fail to unlock the i_mmap lock*.

Is this supposed to work this way? Or is it another bug?

Another difficult question this change of locking raised that I
can't answer: is it valid to call into the filesystem via getblock()
or complete_unwritten() while holding the i_mmap_rwsem? This puts
filesystem transactions and locks inside the scope of i_mmap_rwsem,
which may have impact on the fact that we already have an inode lock
order dependency w.r.t. i_mmap_rwsem through truncate (and probably
other paths, too).

So, please document the locking model, explain the corner cases and
the intricacies like why *unbalanced, return value conditional
locking* is necessary, and update the charts of lock order
dependencies in places like mm/filemap.c, and then we might have
some idea of how much of a train-wreck this actually is....

Cheers,

Dave.
Boaz Harrosh Sept. 24, 2015, 9:03 a.m. UTC | #2
On 09/24/2015 05:52 AM, Dave Chinner wrote:
> On Wed, Sep 23, 2015 at 02:40:00PM -0600, Ross Zwisler wrote:
>> Fix the deadlock exposed by xfstests generic/075.  Here is the sequence
>> that was causing us to deadlock:
>>
>> 1) enter __dax_fault()
>> 2) page = find_get_page() gives us a page, so skip
>> 	i_mmap_lock_write(mapping)
>> 3) if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page)
>> 	passes, enter this block
>> 4) if (vmf->flags & FAULT_FLAG_WRITE) fails, so do the else case and
>> 	i_mmap_unlock_write(mapping);
>> 	return dax_load_hole(mapping, page, vmf);
>>
>> This causes us to up_write() a semaphore that we weren't holding.
>>
>> The up_write() on a semaphore we didn't down_write() happens twice in
>> a row, and then the next time we try and i_mmap_lock_write(), we hang.
>>
>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Reported-by: Dave Chinner <david@fromorbit.com>
>> ---
>>  fs/dax.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/dax.c b/fs/dax.c
>> index 7ae6df7..df1b0ac 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -405,7 +405,8 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>>  			if (error)
>>  				goto unlock;
>>  		} else {
>> -			i_mmap_unlock_write(mapping);
>> +			if (!page)
>> +				i_mmap_unlock_write(mapping);
>>  			return dax_load_hole(mapping, page, vmf);
>>  		}
>>  	}
> 
> I can't review this properly because I can't work out how this
> locking is supposed to work.  Captain, we have a Charlie Foxtrot
> situation here:
> 
> 	page = find_get_page(mapping, vmf->pgoff)
> 	if (page) {
> 		....
> 	} else {
> 		i_mmap_lock_write(mapping);
> 	}
> 
> So if there's no page in the page cache, we lock the i_mmap_lock.
> The we have the case the above patch fixes. Then later:
> 
> 	if (vmf->cow_page) {
> 		.....
> 		if (!page) {
> 			/* can fall through */
> 		}
> 		return VM_FAULT_LOCKED;
> 	}
> 
> Which means __dax_fault() can also return here with the
> i_mmap_lock_write() held. There's no documentation to indicate why
> this is valid, and only by looking about 4 function calls higher up
> the stack can I see that there's some attempt to handle this
> *specific return condition* (in do_cow_fault()). That also is
> lacking in documentation explaining the circumstances where we might
> have the i_mmap_lock_write() held and have to release it. (Not to
> mention the beautiful copy-n-waste of the unlock code, either.)
> 
> The above code in __dax_fault() is then followed by this gem:
> 
> 	/* Check we didn't race with a read fault installing a new page */
>         if (!page && major)
>                 page = find_lock_page(mapping, vmf->pgoff);
> 
> 	if (page) {
> 		/* mapping invalidation .... */
> 	}
> 	.....
> 
> 	if (!page)
> 		i_mmap_unlock_write(mapping);
> 
> Which means that if we had a race with another read fault, we'll
> remove the page from the page cache, insert the new direct mapped
> pfn into the mapping, and *then fail to unlock the i_mmap lock*.
> 
> Is this supposed to work this way? Or is it another bug?
> 
> Another difficult question this change of locking raised that I
> can't answer: is it valid to call into the filesystem via getblock()
> or complete_unwritten() while holding the i_mmap_rwsem? This puts
> filesystem transactions and locks inside the scope of i_mmap_rwsem,
> which may have impact on the fact that we already have an inode lock
> order dependency w.r.t. i_mmap_rwsem through truncate (and probably
> other paths, too).
> 
> So, please document the locking model, explain the corner cases and
> the intricacies like why *unbalanced, return value conditional
> locking* is necessary, and update the charts of lock order
> dependencies in places like mm/filemap.c, and then we might have
> some idea of how much of a train-wreck this actually is....
> 

Hi hi

I hate this VM_FAULT_LOCKED + !page which means i_mmap_lock. I still
think it solves nothing and that we've done a really really bad job.

If we *easily* involve the FS in the locking here (Which btw I think XFS
already does), then this all i_mmap_lock can be avoided.

Please remind me again what race it is suppose to avoid? I get confused.

> Cheers,
> Dave.
> 

Thanks
Boaz
Ross Zwisler Sept. 24, 2015, 3:50 p.m. UTC | #3
On Thu, Sep 24, 2015 at 12:52:25PM +1000, Dave Chinner wrote:
> On Wed, Sep 23, 2015 at 02:40:00PM -0600, Ross Zwisler wrote:
> > Fix the deadlock exposed by xfstests generic/075.  Here is the sequence
> > that was causing us to deadlock:
> > 
> > 1) enter __dax_fault()
> > 2) page = find_get_page() gives us a page, so skip
> > 	i_mmap_lock_write(mapping)
> > 3) if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page)
> > 	passes, enter this block
> > 4) if (vmf->flags & FAULT_FLAG_WRITE) fails, so do the else case and
> > 	i_mmap_unlock_write(mapping);
> > 	return dax_load_hole(mapping, page, vmf);
> > 
> > This causes us to up_write() a semaphore that we weren't holding.
> > 
> > The up_write() on a semaphore we didn't down_write() happens twice in
> > a row, and then the next time we try and i_mmap_lock_write(), we hang.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > Reported-by: Dave Chinner <david@fromorbit.com>
> > ---
> >  fs/dax.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 7ae6df7..df1b0ac 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -405,7 +405,8 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> >  			if (error)
> >  				goto unlock;
> >  		} else {
> > -			i_mmap_unlock_write(mapping);
> > +			if (!page)
> > +				i_mmap_unlock_write(mapping);
> >  			return dax_load_hole(mapping, page, vmf);
> >  		}
> >  	}
> 
> I can't review this properly because I can't work out how this
> locking is supposed to work.  Captain, we have a Charlie Foxtrot
> situation here:
> 
> 	page = find_get_page(mapping, vmf->pgoff)
> 	if (page) {
> 		....
> 	} else {
> 		i_mmap_lock_write(mapping);
> 	}
> 
> So if there's no page in the page cache, we lock the i_mmap_lock.
> The we have the case the above patch fixes. Then later:
> 
> 	if (vmf->cow_page) {
> 		.....
> 		if (!page) {
> 			/* can fall through */
> 		}
> 		return VM_FAULT_LOCKED;
> 	}
> 
> Which means __dax_fault() can also return here with the
> i_mmap_lock_write() held. There's no documentation to indicate why
> this is valid, and only by looking about 4 function calls higher up
> the stack can I see that there's some attempt to handle this
> *specific return condition* (in do_cow_fault()). That also is
> lacking in documentation explaining the circumstances where we might
> have the i_mmap_lock_write() held and have to release it. (Not to
> mention the beautiful copy-n-waste of the unlock code, either.)
> 
> The above code in __dax_fault() is then followed by this gem:
> 
> 	/* Check we didn't race with a read fault installing a new page */
>         if (!page && major)
>                 page = find_lock_page(mapping, vmf->pgoff);
> 
> 	if (page) {
> 		/* mapping invalidation .... */
> 	}
> 	.....
> 
> 	if (!page)
> 		i_mmap_unlock_write(mapping);
> 
> Which means that if we had a race with another read fault, we'll
> remove the page from the page cache, insert the new direct mapped
> pfn into the mapping, and *then fail to unlock the i_mmap lock*.
> 
> Is this supposed to work this way? Or is it another bug?
> 
> Another difficult question this change of locking raised that I
> can't answer: is it valid to call into the filesystem via getblock()
> or complete_unwritten() while holding the i_mmap_rwsem? This puts
> filesystem transactions and locks inside the scope of i_mmap_rwsem,
> which may have impact on the fact that we already have an inode lock
> order dependency w.r.t. i_mmap_rwsem through truncate (and probably
> other paths, too).
> 
> So, please document the locking model, explain the corner cases and
> the intricacies like why *unbalanced, return value conditional
> locking* is necessary, and update the charts of lock order
> dependencies in places like mm/filemap.c, and then we might have
> some idea of how much of a train-wreck this actually is....

Yep, I saw these too, but didn't yet know how to deal with them.  We have at
similar issues with __dax_pmd_fault():

	i_mmap_lock_write(mapping);
	length = get_block(inode, block, &bh, write);
	if (length)
		return VM_FAULT_SIGBUS;

Whoops, we just took the mmap semaphore, and then we have a return that
doesn't release it.  A quick test confirms that this will deadlock the next
fault that tries to take the mmap semaphore.

I agree that we need to give the fault handling code some attention when it
comes to locking, and that we need to have better documentation.  I'll work on
this when I get some time.

In the meantime I still think it's worthwhile to take the short term fix for
the obvious generic/075 deadlock, yea?

- Ross
Dave Chinner Sept. 25, 2015, 2:53 a.m. UTC | #4
On Thu, Sep 24, 2015 at 09:50:29AM -0600, Ross Zwisler wrote:
> On Thu, Sep 24, 2015 at 12:52:25PM +1000, Dave Chinner wrote:
> > On Wed, Sep 23, 2015 at 02:40:00PM -0600, Ross Zwisler wrote:
> > > Fix the deadlock exposed by xfstests generic/075.  Here is the sequence
> > > that was causing us to deadlock:
> > > 
> > > 1) enter __dax_fault()
> > > 2) page = find_get_page() gives us a page, so skip
> > > 	i_mmap_lock_write(mapping)
> > > 3) if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page)
> > > 	passes, enter this block
> > > 4) if (vmf->flags & FAULT_FLAG_WRITE) fails, so do the else case and
> > > 	i_mmap_unlock_write(mapping);
> > > 	return dax_load_hole(mapping, page, vmf);
> > > 
> > > This causes us to up_write() a semaphore that we weren't holding.
> > > 
> > > The up_write() on a semaphore we didn't down_write() happens twice in
> > > a row, and then the next time we try and i_mmap_lock_write(), we hang.
> > > 
> > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > Reported-by: Dave Chinner <david@fromorbit.com>
> > > ---
> > >  fs/dax.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index 7ae6df7..df1b0ac 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -405,7 +405,8 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> > >  			if (error)
> > >  				goto unlock;
> > >  		} else {
> > > -			i_mmap_unlock_write(mapping);
> > > +			if (!page)
> > > +				i_mmap_unlock_write(mapping);
> > >  			return dax_load_hole(mapping, page, vmf);
> > >  		}
> > >  	}
> > 
> > I can't review this properly because I can't work out how this
> > locking is supposed to work.  Captain, we have a Charlie Foxtrot
> > situation here:
> > 
> > 	page = find_get_page(mapping, vmf->pgoff)
> > 	if (page) {
> > 		....
> > 	} else {
> > 		i_mmap_lock_write(mapping);
> > 	}
> > 
> > So if there's no page in the page cache, we lock the i_mmap_lock.
> > The we have the case the above patch fixes. Then later:
> > 
> > 	if (vmf->cow_page) {
> > 		.....
> > 		if (!page) {
> > 			/* can fall through */
> > 		}
> > 		return VM_FAULT_LOCKED;
> > 	}
> > 
> > Which means __dax_fault() can also return here with the
> > i_mmap_lock_write() held. There's no documentation to indicate why
> > this is valid, and only by looking about 4 function calls higher up
> > the stack can I see that there's some attempt to handle this
> > *specific return condition* (in do_cow_fault()). That also is
> > lacking in documentation explaining the circumstances where we might
> > have the i_mmap_lock_write() held and have to release it. (Not to
> > mention the beautiful copy-n-waste of the unlock code, either.)
> > 
> > The above code in __dax_fault() is then followed by this gem:
> > 
> > 	/* Check we didn't race with a read fault installing a new page */
> >         if (!page && major)
> >                 page = find_lock_page(mapping, vmf->pgoff);
> > 
> > 	if (page) {
> > 		/* mapping invalidation .... */
> > 	}
> > 	.....
> > 
> > 	if (!page)
> > 		i_mmap_unlock_write(mapping);
> > 
> > Which means that if we had a race with another read fault, we'll
> > remove the page from the page cache, insert the new direct mapped
> > pfn into the mapping, and *then fail to unlock the i_mmap lock*.
> > 
> > Is this supposed to work this way? Or is it another bug?
> > 
> > Another difficult question this change of locking raised that I
> > can't answer: is it valid to call into the filesystem via getblock()
> > or complete_unwritten() while holding the i_mmap_rwsem? This puts
> > filesystem transactions and locks inside the scope of i_mmap_rwsem,
> > which may have impact on the fact that we already have an inode lock
> > order dependency w.r.t. i_mmap_rwsem through truncate (and probably
> > other paths, too).
> > 
> > So, please document the locking model, explain the corner cases and
> > the intricacies like why *unbalanced, return value conditional
> > locking* is necessary, and update the charts of lock order
> > dependencies in places like mm/filemap.c, and then we might have
> > some idea of how much of a train-wreck this actually is....
> 
> Yep, I saw these too, but didn't yet know how to deal with them.  We have at
> similar issues with __dax_pmd_fault():
> 
> 	i_mmap_lock_write(mapping);
> 	length = get_block(inode, block, &bh, write);
> 	if (length)
> 		return VM_FAULT_SIGBUS;
> 
> Whoops, we just took the mmap semaphore, and then we have a return that
> doesn't release it.  A quick test confirms that this will deadlock the next
> fault that tries to take the mmap semaphore.

Ugh. So there's more than one broken commit and code path we are
talking about here.

Oh, even the "fallback" path is broken there - it converts the
unwritten extent to written, even in the case where the underlying
pages weren't zeroed. See this case:

                if (unlikely(!zero_page))
                        goto fallback;

That's a stale data exposure bug, so has security implications....

> I agree that we need to give the fault handling code some attention when it
> comes to locking, and that we need to have better documentation.  I'll work on
> this when I get some time.
> 
> In the meantime I still think it's worthwhile to take the short term fix for
> the obvious generic/075 deadlock, yea?

I think a bunch of reverts are in order -the code is broken, has
stale data exposure problems and we're going to have to rip the code
out anyway because I don't think this is the right way to fix the
underlying problem ("If two threads write-fault on the same hole at
the same time...")

We've already got block allocation serialisation at the filesystem
level, and the issue is the unserialised block zeroing being done by
the dax code. That can be fixed by moving the zeroing into the
filesystem code when it runs "complete_unwritten" and checks whether
the mapping has already been marked as written or not...

I've recently pointed out in a different thread that this is the
solution to whatever that problem was (can't recall which
thread/problem is was now :/ ) and it the same solution here. We
already have the serialisation we need, we just need to move the
block zeroing operation into the appropriate places to make it work
correctly.

Cheers,

Dave.
Ross Zwisler Sept. 25, 2015, 6:23 p.m. UTC | #5
On Fri, Sep 25, 2015 at 12:53:57PM +1000, Dave Chinner wrote:
<>
> We've already got block allocation serialisation at the filesystem
> level, and the issue is the unserialised block zeroing being done by
> the dax code. That can be fixed by moving the zeroing into the
> filesystem code when it runs "complete_unwritten" and checks whether
> the mapping has already been marked as written or not...
> 
> I've recently pointed out in a different thread that this is the
> solution to whatever that problem was (can't recall which
> thread/problem is was now :/ ) and it the same solution here. We
> already have the serialisation we need, we just need to move the
> block zeroing operation into the appropriate places to make it work
> correctly.

I think perhaps this is the thread that you're remembering:
https://lkml.org/lkml/2015/8/11/731
Dave Chinner Sept. 25, 2015, 11:30 p.m. UTC | #6
On Fri, Sep 25, 2015 at 12:23:34PM -0600, Ross Zwisler wrote:
> On Fri, Sep 25, 2015 at 12:53:57PM +1000, Dave Chinner wrote:
> <>
> > We've already got block allocation serialisation at the filesystem
> > level, and the issue is the unserialised block zeroing being done by
> > the dax code. That can be fixed by moving the zeroing into the
> > filesystem code when it runs "complete_unwritten" and checks whether
> > the mapping has already been marked as written or not...
> > 
> > I've recently pointed out in a different thread that this is the
> > solution to whatever that problem was (can't recall which
> > thread/problem is was now :/ ) and it the same solution here. We
> > already have the serialisation we need, we just need to move the
> > block zeroing operation into the appropriate places to make it work
> > correctly.
> 
> I think perhaps this is the thread that you're remembering:
> https://lkml.org/lkml/2015/8/11/731

Ah, yes, the read vs write fault race condition, whereas this change
was to address write vs write fault races. And neither of which
address the fault vs truncate problem properly, either, which is
what the XFS_MMAP_LOCKING addresses.

So, yeah, pushing the block zeroing down to the filesystem gets rid
of multiple problems that concurrent page faults have with
initialisation without adding any more serialisation or overhead.

Cheers,

Dave.
Ross Zwisler Sept. 26, 2015, 3:17 a.m. UTC | #7
On Fri, Sep 25, 2015 at 12:53:57PM +1000, Dave Chinner wrote:
> On Thu, Sep 24, 2015 at 09:50:29AM -0600, Ross Zwisler wrote:
> > On Thu, Sep 24, 2015 at 12:52:25PM +1000, Dave Chinner wrote:
> > > On Wed, Sep 23, 2015 at 02:40:00PM -0600, Ross Zwisler wrote:
> > > > Fix the deadlock exposed by xfstests generic/075.  Here is the sequence
> > > > that was causing us to deadlock:
> > > > 
> > > > 1) enter __dax_fault()
> > > > 2) page = find_get_page() gives us a page, so skip
> > > > 	i_mmap_lock_write(mapping)
> > > > 3) if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page)
> > > > 	passes, enter this block
> > > > 4) if (vmf->flags & FAULT_FLAG_WRITE) fails, so do the else case and
> > > > 	i_mmap_unlock_write(mapping);
> > > > 	return dax_load_hole(mapping, page, vmf);
> > > > 
> > > > This causes us to up_write() a semaphore that we weren't holding.
> > > > 
> > > > The up_write() on a semaphore we didn't down_write() happens twice in
> > > > a row, and then the next time we try and i_mmap_lock_write(), we hang.
> > > > 
> > > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > > Reported-by: Dave Chinner <david@fromorbit.com>
> > > > ---
> > > >  fs/dax.c | 3 ++-
> > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/fs/dax.c b/fs/dax.c
> > > > index 7ae6df7..df1b0ac 100644
> > > > --- a/fs/dax.c
> > > > +++ b/fs/dax.c
> > > > @@ -405,7 +405,8 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> > > >  			if (error)
> > > >  				goto unlock;
> > > >  		} else {
> > > > -			i_mmap_unlock_write(mapping);
> > > > +			if (!page)
> > > > +				i_mmap_unlock_write(mapping);
> > > >  			return dax_load_hole(mapping, page, vmf);
> > > >  		}
> > > >  	}
> > > 
> > > I can't review this properly because I can't work out how this
> > > locking is supposed to work.  Captain, we have a Charlie Foxtrot
> > > situation here:
> > > 
> > > 	page = find_get_page(mapping, vmf->pgoff)
> > > 	if (page) {
> > > 		....
> > > 	} else {
> > > 		i_mmap_lock_write(mapping);
> > > 	}
> > > 
> > > So if there's no page in the page cache, we lock the i_mmap_lock.
> > > The we have the case the above patch fixes. Then later:
> > > 
> > > 	if (vmf->cow_page) {
> > > 		.....
> > > 		if (!page) {
> > > 			/* can fall through */
> > > 		}
> > > 		return VM_FAULT_LOCKED;
> > > 	}
> > > 
> > > Which means __dax_fault() can also return here with the
> > > i_mmap_lock_write() held. There's no documentation to indicate why
> > > this is valid, and only by looking about 4 function calls higher up
> > > the stack can I see that there's some attempt to handle this
> > > *specific return condition* (in do_cow_fault()). That also is
> > > lacking in documentation explaining the circumstances where we might
> > > have the i_mmap_lock_write() held and have to release it. (Not to
> > > mention the beautiful copy-n-waste of the unlock code, either.)
> > > 
> > > The above code in __dax_fault() is then followed by this gem:
> > > 
> > > 	/* Check we didn't race with a read fault installing a new page */
> > >         if (!page && major)
> > >                 page = find_lock_page(mapping, vmf->pgoff);
> > > 
> > > 	if (page) {
> > > 		/* mapping invalidation .... */
> > > 	}
> > > 	.....
> > > 
> > > 	if (!page)
> > > 		i_mmap_unlock_write(mapping);
> > > 
> > > Which means that if we had a race with another read fault, we'll
> > > remove the page from the page cache, insert the new direct mapped
> > > pfn into the mapping, and *then fail to unlock the i_mmap lock*.
> > > 
> > > Is this supposed to work this way? Or is it another bug?
> > > 
> > > Another difficult question this change of locking raised that I
> > > can't answer: is it valid to call into the filesystem via getblock()
> > > or complete_unwritten() while holding the i_mmap_rwsem? This puts
> > > filesystem transactions and locks inside the scope of i_mmap_rwsem,
> > > which may have impact on the fact that we already have an inode lock
> > > order dependency w.r.t. i_mmap_rwsem through truncate (and probably
> > > other paths, too).
> > > 
> > > So, please document the locking model, explain the corner cases and
> > > the intricacies like why *unbalanced, return value conditional
> > > locking* is necessary, and update the charts of lock order
> > > dependencies in places like mm/filemap.c, and then we might have
> > > some idea of how much of a train-wreck this actually is....
> > 
> > Yep, I saw these too, but didn't yet know how to deal with them.  We have at
> > similar issues with __dax_pmd_fault():
> > 
> > 	i_mmap_lock_write(mapping);
> > 	length = get_block(inode, block, &bh, write);
> > 	if (length)
> > 		return VM_FAULT_SIGBUS;
> > 
> > Whoops, we just took the mmap semaphore, and then we have a return that
> > doesn't release it.  A quick test confirms that this will deadlock the next
> > fault that tries to take the mmap semaphore.
> 
> Ugh. So there's more than one broken commit and code path we are
> talking about here.
> 
> Oh, even the "fallback" path is broken there - it converts the
> unwritten extent to written, even in the case where the underlying
> pages weren't zeroed. See this case:
> 
>                 if (unlikely(!zero_page))
>                         goto fallback;
> 
> That's a stale data exposure bug, so has security implications....
> 
> > I agree that we need to give the fault handling code some attention when it
> > comes to locking, and that we need to have better documentation.  I'll work on
> > this when I get some time.
> > 
> > In the meantime I still think it's worthwhile to take the short term fix for
> > the obvious generic/075 deadlock, yea?
> 
> I think a bunch of reverts are in order -the code is broken, has
> stale data exposure problems and we're going to have to rip the code
> out anyway because I don't think this is the right way to fix the
> underlying problem ("If two threads write-fault on the same hole at
> the same time...")
> 
> We've already got block allocation serialisation at the filesystem
> level, and the issue is the unserialised block zeroing being done by
> the dax code. That can be fixed by moving the zeroing into the
> filesystem code when it runs "complete_unwritten" and checks whether
> the mapping has already been marked as written or not...
> 
> I've recently pointed out in a different thread that this is the
> solution to whatever that problem was (can't recall which
> thread/problem is was now :/ ) and it the same solution here. We
> already have the serialisation we need, we just need to move the
> block zeroing operation into the appropriate places to make it work
> correctly.

I think that at the end of the day the locking in the two DAX fault paths is
trying to protect us against two things:

1) Simultaneous write-fault on the same hole at the same time. The winner of
   the race will return to userspace and complete their store, only to have
   the loser overwrite their store with zeroes.

   We are currently using mapping->i_mmap_rwsem to protect ourselves from
   this.  Here is the patch that added this protection:

   https://lkml.org/lkml/2015/8/4/870

2) Races between page faults and truncate.  If we have a struct page we
   protect ourselves by locking the page via lock_page_or_retry().  If we
   don't have a struct page we use mapping->i_mmap_rwsem.  This is the source
   of all of the mapping->i_mmap_rwsem locking being based on !page.

   The protection provided by mapping->i_mmap_rwsem against truncate has been
   in place in some form since v4.0.

To get them all written down in one place, here are all the issues and
questions that I currently know about regarding the DAX fault paths as of
v4.3-rc2:

1) In __dax_pmd_fault() the block inside the first "if (buffer_unwritten(&bh)
   || buffer_new(&bh))" test doesn't set kaddr before starting to zero.  This
   was fixed by the following patch which has been accepted into the --mm
   tree:

   https://lkml.org/lkml/2015/9/22/989

2) In __dax_pmd_fault() we have this code:

   	i_mmap_lock_write(mapping);
	length = get_block(inode, block, &bh, write);
	if (length)
		return VM_FAULT_SIGBUS;

   We just took the mapping->i_mmap_rwsem semaphore, and then we have a return
   that doesn't release it.  A quick test confirms that this will deadlock the
   next fault that tries to take the mapping->i_mmap_rwsem semaphore.

3) In __dax_pmd_fault() we convert the unwritten extent to written, even in
   the case where the underlying pages weren't zeroed. See this case:

                if (unlikely(!zero_page))
                        goto fallback;

   That's a stale data exposure bug, so has security implications....

   There are other places where we head through this completion path because
   of an error, and those should similarly only convert the extents as written
   if they have been zeroed.

4) In __dax_fault() we can return through this path:

		} else {
			i_mmap_unlock_write(mapping);
			return dax_load_hole(mapping, page, vmf);
		}

   without having ever taken the mapping->i_mmap_rwsem semaphore.  This means
   we end up releasing a semaphore that we never took, which can lead to
   a deadlock.  This was the original bug report for xfstest generic/075, and
   is currently addressed by the following patch in the -mm tree:

   https://lkml.org/lkml/2015/9/23/607

5) In __dax_fault() we rely on the presence of a page pointer to know whether
   or not we should release mapping->i_mmap_rwsem at the end of the function.
   Unfortunately, we can set the page pointer in the middle of the function
   while the semaphore is held:

	/* Check we didn't race with a read fault installing a new page */
	if (!page && major)
		page = find_lock_page(mapping, vmf->pgoff);

   This will cause us to fail the "if (!page)" test at the end of the
   function, making us fail to release the semaphore on exit.

6) In __dax_fault() in the vmf->cow_page case we can end up exiting with
   status VM_FAULT_LOCKED but with mapping->i_mmap_rwsem held.  This is
   actually correct behavior and is documented in the commit message of the
   patch that added this code:

   commit 2e4cdab0584fa884e0a81c4f45b93ce875c9fcaa
   https://lkml.org/lkml/2015/1/13/635

   This breaks the locking rules used by the rest of the function, though, and
   needs comments.

7) In both __dax_fault() and __dax_pmd_fault() we call the filesystem's
   get_block() and complete_unwritten() functions while holding
   mapping->i_mmap_rwsem.  There is a concern that this could potentially lead
   to a lock inversion, leading to a deadlock.

Here's how I think we can address the above issues (basically "do what Dave
said"):

1) For the simultaneous write-fault issue, move the responsibility for zeroing
   the extents and converting them to written to the filesystem as Dave
   suggests.  This will require us to get the coordination between DAX and
   the filesystem write in each filesystem we support (currentely xfs and
   ext4), but it will take advantage of fine-grained locking already present
   in the filesystem and will let us remove much of our reliance on
   mapping->i_mmap_rwsem.

   This will also address the scalability issues mentioned as future work in
   Matthew's patch that introduced the heavy reliance on
   mapping->i_mmap_rwsem:

   https://lkml.org/lkml/2015/8/4/870

   This will allow us to scale back our usage of mapping->i_mmap_rwsem,
   helping us deal with many of the above issues.

   This will also take care of issue #3 (information leak of non-zeroed blocks
   marked as written) as DAX will no longer be responsible for converting
   newly zeroed blocks to written.

2) For the remaining uses of mapping->i_mmap_rwsem in the fault paths that
   protect us against truncate, address any remaining code issues listed
   above, clearly document in the fault handlers the locking rules and the
   exceptions to those rules (cow_page + VM_FAULT_LOCKED).  Update any locking
   order changes like in mm/filemap.c as necessary.

3) For issue #7 above (lock inversion concerns between i_mmap_rwsem and the
   filesystem locking needed for get_block() and complete_unwritten()), I
   think that the complete_unwritten() calls in DAX will go away since we will
   be allocating, zeroing and converting extents to written all in the
   filesystem.  Similarly, I think that the calls to get_block() will no
   longer happen while the mapping->i_mmap_rwsem is held - this was the case
   as of v4.2, and I think we can probably get bet there with finger grained
   FS locking.

4) Test all changes with xfstests using both xfs & ext4, using lockep.

Did I miss any issues, or does this path not solve one of them somehow?

Does this sound like a reasonable path forward for v4.3?  Dave, and Jan, can
you guys can provide guidance and code reviews for the XFS and ext4 bits?

Thanks,
- Ross
Dave Chinner Sept. 28, 2015, 12:59 a.m. UTC | #8
On Fri, Sep 25, 2015 at 09:17:45PM -0600, Ross Zwisler wrote:
> On Fri, Sep 25, 2015 at 12:53:57PM +1000, Dave Chinner wrote:
> > I think a bunch of reverts are in order -the code is broken, has
> > stale data exposure problems and we're going to have to rip the code
> > out anyway because I don't think this is the right way to fix the
> > underlying problem ("If two threads write-fault on the same hole at
> > the same time...")

[...]

> I think that at the end of the day the locking in the two DAX fault paths is
> trying to protect us against two things:
> 
> 1) Simultaneous write-fault on the same hole at the same time.
> 2) Races between page faults and truncate.

1) is fixed by pushing block zeroing into get_blocks()

2) is fixed by the filesystem locking out page faults during
truncate. XFS already does this with it's XFS_MMAPLOCK_[SHARED|EXCL]
locking on page faults and truncate...

> To get them all written down in one place, here are all the issues and
> questions that I currently know about regarding the DAX fault paths as of
> v4.3-rc2:
> 
> 1) In __dax_pmd_fault() the block inside the first "if (buffer_unwritten(&bh)

	[fail to set kaddr, patch in -mm]

> 2) In __dax_pmd_fault() we have this code:

	[fail to release i_mmap_lock_write()]

> 3) In __dax_pmd_fault() we convert the unwritten extent to written
....
> 4) In __dax_fault() we can return through this path:

	[incorrect release of i_mmap_lock_write()]

> 5) In __dax_fault() we rely on the presence of a page pointer to know whether

	[fail to release i_mmap_lock_write() on racing read faults]

> 6) In __dax_fault() in the vmf->cow_page case we can end up exiting with
>    status VM_FAULT_LOCKED but with mapping->i_mmap_rwsem held.  This is

	[undocumented, unbalanced, but handled]

> 7) In both __dax_fault() and __dax_pmd_fault() we call the filesystem's
>    get_block() and complete_unwritten() functions while holding
>    mapping->i_mmap_rwsem.  There is a concern that this could potentially lead
>    to a lock inversion, leading to a deadlock.

	[more investigation needed]

> Here's how I think we can address the above issues (basically "do what Dave
> said"):
> 
> 1) For the simultaneous write-fault issue, move the responsibility for zeroing
>    the extents and converting them to written to the filesystem as Dave

*nod*

> 2) For the remaining uses of mapping->i_mmap_rwsem in the fault paths that
>    protect us against truncate, address any remaining code issues listed
>    above, clearly document in the fault handlers the locking rules and the
>    exceptions to those rules (cow_page + VM_FAULT_LOCKED).  Update any locking
>    order changes like in mm/filemap.c as necessary.

Incorrect. Truncate is a special case of hole punching, and the mm
subsystem has long relied on a nasty hack to serialise page faults
against truncate. That is:

	lock_page(page);
	if (page beyond EOF or mapping is different) {
		/* OMG FAIL FAIL FAIL */
	}

This "page beyond EOF" does not work for operations like hole
punching or other direct filesystem extent manipulation operations
that occur within the current EOF and require mapping invalidation
and serialisation until after the extent manipulation is done.

IOWs, all of these operations require the filesystem to prevent new
page faults from occurring after the initial invalidation and until
the extent manipulation is completed.  This includes truncate - the
"page beyond EOF" hack is not necessary anymore, because truncate
has the same invalidation behaviour w.r.t. punching out mappings
within EOF...

And with DAX, we have no struct page to lock, so there really isn't
anything in the mm subsystem that can be used to ensure that a given
range of a file is not being otherwise manipulated in a manner that
races with a page fault. The i_mmap_rwsem is not a replacement for
the page lock....

In reality, the require DAX page fault vs truncate serialisation is
provided for XFS by the XFS_MMAPLOCK_* inode locking that is done in
the fault, mkwrite and filesystem extent manipulation paths. There
is no way this sort of exclusion can be done in the mm/ subsystem as
it simply does not have the context to be able to provide the
necessary serialisation.  Every filesystem that wants to use DAX
needs to provide this external page fault serialisation, and in
doing so will also protect it's hole punch/extent swap/shift
operations under normal operation against racing mmap access....

IOWs, for DAX this needs to be fixed in ext4, not the mm/ subsystem.

> 3) For issue #7 above (lock inversion concerns between i_mmap_rwsem and the
>    filesystem locking needed for get_block() and complete_unwritten()), I
>    think that the complete_unwritten() calls in DAX will go away since we will
>    be allocating, zeroing and converting extents to written all in the
>    filesystem.  Similarly, I think that the calls to get_block() will no
>    longer happen while the mapping->i_mmap_rwsem is held - this was the case
>    as of v4.2, and I think we can probably get bet there with finger grained
>    FS locking.

ext2 already does this zeroing. See the call to dax_clear_blocks()
in ext2_get_blocks. I should be able to do something similar in XFS
easily enough.

> 4) Test all changes with xfstests using both xfs & ext4, using lockep.
> 
> Did I miss any issues, or does this path not solve one of them somehow?
> 
> Does this sound like a reasonable path forward for v4.3?  Dave, and Jan, can
> you guys can provide guidance and code reviews for the XFS and ext4 bits?

IMO, it's way too much to get into 4.3. I'd much prefer we revert
the bad changes in 4.3, and then work towards fixing this for the
4.4 merge window. If someone needs this for 4.3, then they can
backport the 4.4 code to 4.3-stable.

The "fast and loose and fix it later" development model does not
work for persistent storage algorithms; DAX is storage - not memory
management - and so we need to treat it as such.

Cheers,

Dave.
Dan Williams Sept. 28, 2015, 12:13 p.m. UTC | #9
On Sun, Sep 27, 2015 at 5:59 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Fri, Sep 25, 2015 at 09:17:45PM -0600, Ross Zwisler wrote:
>> On Fri, Sep 25, 2015 at 12:53:57PM +1000, Dave Chinner wrote:
[..]
>> Does this sound like a reasonable path forward for v4.3?  Dave, and Jan, can
>> you guys can provide guidance and code reviews for the XFS and ext4 bits?
>
> IMO, it's way too much to get into 4.3. I'd much prefer we revert
> the bad changes in 4.3, and then work towards fixing this for the
> 4.4 merge window. If someone needs this for 4.3, then they can
> backport the 4.4 code to 4.3-stable.
>

If the proposal is to step back and get a running start at these fixes
for 4.4, then it is worth considering what the state of allocating
pages for DAX mappings will be in 4.4.  It's already that case that
allocating struct page for DAX mappings is the only solution on the
horizon for enabling a get_user_pages() solution for persistent
memory.  We of course need to get the page-less DAX path fixed up, but
the near-term path to full functionality and safety is when struct
page is available to enable the typical synchronization mechanics.

That said, it's not clear that saves us any work given the axonram and
dcssblk DAX drivers do not currently allocate struct page, and pmem
optionally allocates struct page.
Dave Chinner Sept. 28, 2015, 9:35 p.m. UTC | #10
On Mon, Sep 28, 2015 at 05:13:50AM -0700, Dan Williams wrote:
> On Sun, Sep 27, 2015 at 5:59 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Fri, Sep 25, 2015 at 09:17:45PM -0600, Ross Zwisler wrote:
> >> On Fri, Sep 25, 2015 at 12:53:57PM +1000, Dave Chinner wrote:
> [..]
> >> Does this sound like a reasonable path forward for v4.3?  Dave, and Jan, can
> >> you guys can provide guidance and code reviews for the XFS and ext4 bits?
> >
> > IMO, it's way too much to get into 4.3. I'd much prefer we revert
> > the bad changes in 4.3, and then work towards fixing this for the
> > 4.4 merge window. If someone needs this for 4.3, then they can
> > backport the 4.4 code to 4.3-stable.
> >
> 
> If the proposal is to step back and get a running start at these fixes
> for 4.4, then it is worth considering what the state of allocating
> pages for DAX mappings will be in 4.4.

Oh, do tell. I haven't seen any published design, code, etc, and I
certainly haven't planned any time in the 4.4 window to do a
complete audit, rework and test of the XFS DAX code. So if you want
a working DAX implementation in the short term, we need to fix what
we have and not do wholesale changes to infrastructure that put us
back to square 1.

And, quite frankly, I'm not enabling any new DAX behaviour/subsystem
in XFS until I've had time to review, test and fix it so it works
without deadlocking or corrupting data.

> It's already that case that
> allocating struct page for DAX mappings is the only solution on the
> horizon for enabling a get_user_pages() solution for persistent
> memory.  We of course need to get the page-less DAX path fixed up, but
> the near-term path to full functionality and safety is when struct
> page is available to enable the typical synchronization mechanics.

And we do so at the expense of medium to long term complexity and
maintenance. I'm no fan of using struct pages to track terabytes to
petabytes of persistent memory, and I'm even less of a fan of having
to simultaneously support both struct page and pfn based DAX
subsystems...

> That said, it's not clear that saves us any work given the axonram and
> dcssblk DAX drivers do not currently allocate struct page, and pmem
> optionally allocates struct page.

Precisely my point.

Cheers,

Dave.
Ross Zwisler Sept. 28, 2015, 10:40 p.m. UTC | #11
On Mon, Sep 28, 2015 at 10:59:04AM +1000, Dave Chinner wrote:
> On Fri, Sep 25, 2015 at 09:17:45PM -0600, Ross Zwisler wrote:
<>
> In reality, the require DAX page fault vs truncate serialisation is
> provided for XFS by the XFS_MMAPLOCK_* inode locking that is done in
> the fault, mkwrite and filesystem extent manipulation paths. There
> is no way this sort of exclusion can be done in the mm/ subsystem as
> it simply does not have the context to be able to provide the
> necessary serialisation.  Every filesystem that wants to use DAX
> needs to provide this external page fault serialisation, and in
> doing so will also protect it's hole punch/extent swap/shift
> operations under normal operation against racing mmap access....
> 
> IOWs, for DAX this needs to be fixed in ext4, not the mm/ subsystem.

So is it your belief that XFS already has correct locking in place to ensure
that we don't hit these races?  I see XFS taking XFS_MMAPLOCK_SHARED before it
calls __dax_fault() in both xfs_filemap_page_mkwrite() (via __dax_mkwrite) and
in xfs_filemap_fault().

XFS takes XFS_MMAPLOCK_EXCL before a truncate in xfs_vn_setattr() - I haven't
found the generic hole punching code yet, but I assume it takes
XFS_MMAPLOCK_EXCL as well.

Meaning, is the work that we need to do around extent vs page fault locking
basically adding equivalent locking to ext4 and ext2 and removing the attempts
at locking from dax.c?

> > 4) Test all changes with xfstests using both xfs & ext4, using lockep.
> > 
> > Did I miss any issues, or does this path not solve one of them somehow?
> > 
> > Does this sound like a reasonable path forward for v4.3?  Dave, and Jan, can
> > you guys can provide guidance and code reviews for the XFS and ext4 bits?
> 
> IMO, it's way too much to get into 4.3. I'd much prefer we revert
> the bad changes in 4.3, and then work towards fixing this for the
> 4.4 merge window. If someone needs this for 4.3, then they can
> backport the 4.4 code to 4.3-stable.
> 
> The "fast and loose and fix it later" development model does not
> work for persistent storage algorithms; DAX is storage - not memory
> management - and so we need to treat it as such.

Okay.  To get our locking back to v4.2 levels here are the two commits I think
we need to look at:

commit 843172978bb9 ("dax: fix race between simultaneous faults")
commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for DAX")

The former is the one that introduced the heavy reliance on write locks of the
mmap semaphore and introduced the various deadlocks that we've found.  The
latter moved some of that locking around so we didn't hold a write lock on the
mmap semaphore during unmap_mapping_range().

Does this sound correct to you?

On an unrelated note, while wandering through the XFS code I found the
following lock ordering documented above xfs_ilock():

 * Basic locking order:
 *
 * i_iolock -> i_mmap_lock -> page_lock -> i_ilock
 *
 * mmap_sem locking order:
 *
 * i_iolock -> page lock -> mmap_sem
 * mmap_sem -> i_mmap_lock -> page_lock

I noticed that page_lock and i_mmap_lock are in different places in the
ordering depending on the presence or absence of mmap_sem.  Does this not open
us up to a lock ordering inversion?

Thread 1 (mmap_sem)			Thread 2 (no mmap_sem)
-------------------			----------------------
page_lock
mmap_sem
					i_mmap_lock
					<waiting for page_lock>
<waiting for i_mmap_lock>

Thanks,
- Ross
Dan Williams Sept. 28, 2015, 10:57 p.m. UTC | #12
On Mon, Sep 28, 2015 at 2:35 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Sep 28, 2015 at 05:13:50AM -0700, Dan Williams wrote:
>> On Sun, Sep 27, 2015 at 5:59 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Fri, Sep 25, 2015 at 09:17:45PM -0600, Ross Zwisler wrote:
>> >> On Fri, Sep 25, 2015 at 12:53:57PM +1000, Dave Chinner wrote:
>> [..]
>> >> Does this sound like a reasonable path forward for v4.3?  Dave, and Jan, can
>> >> you guys can provide guidance and code reviews for the XFS and ext4 bits?
>> >
>> > IMO, it's way too much to get into 4.3. I'd much prefer we revert
>> > the bad changes in 4.3, and then work towards fixing this for the
>> > 4.4 merge window. If someone needs this for 4.3, then they can
>> > backport the 4.4 code to 4.3-stable.
>> >
>>
>> If the proposal is to step back and get a running start at these fixes
>> for 4.4, then it is worth considering what the state of allocating
>> pages for DAX mappings will be in 4.4.
>
> Oh, do tell. I haven't seen any published design, code, etc,

This is via the devm_memremap_pages() api that went into 4.2 [1] and
my v1 (RFC quality) series using it for dax get_user_pages() [2].

[1]: https://lkml.org/lkml/2015/8/25/841
[2]: https://lkml.org/lkml/2015/9/23/11

> and I certainly haven't planned any time in the 4.4 window to do a
> complete audit, rework and test of the XFS DAX code. So if you want
> a working DAX implementation in the short term, we need to fix what
> we have and not do wholesale changes to infrastructure that put us
> back to square 1.

Yes, as Ross educated me, the current split of what is handled in the
filesystem vs what is handled in __dax_fault() potentially makes the
availability of struct page moot because the locking does not work if
initiated from within fs/dax.c...

> And, quite frankly, I'm not enabling any new DAX behaviour/subsystem
> in XFS until I've had time to review, test and fix it so it works
> without deadlocking or corrupting data.

I'm in violent agreement, to the point where I'm pondering whether
CONFIG_FS_DAX should just depend on CONFIG_BROKEN in 4.3 until we've
convinced ourselves of all the fixes in 4.4.  It's not clear to me
that we have a stable baseline to which we can revert this "still in
development" implementation, did you have one in mind?

>> It's already that case that
>> allocating struct page for DAX mappings is the only solution on the
>> horizon for enabling a get_user_pages() solution for persistent
>> memory.  We of course need to get the page-less DAX path fixed up, but
>> the near-term path to full functionality and safety is when struct
>> page is available to enable the typical synchronization mechanics.
>
> And we do so at the expense of medium to long term complexity and
> maintenance. I'm no fan of using struct pages to track terabytes to
> petabytes of persistent memory, and I'm even less of a fan of having
> to simultaneously support both struct page and pfn based DAX
> subsystems...

I'm no fan of tracking petabytes of persistent memory with struct
page, but we're in the near term space (hardware technology-wise) of
how to enable DMA/RDMA to 100s of gigabytes to a few terabytes of
persistent memory.  A page-less solution to that problem is not on the
horizon as far as I can tell.  In short, I am concerned we are
spending time working around the lack of struct page to get to a
stable page-less solution that is still missing support for the use
cases that are expected to "just work".

I do not think introducing page-back persistent memory sets us back to
square 1.  Instead, given the functionality that is enabled when pages
are present I think it is safe to assume most platforms will arrange
for page backed persistent memory.  If the page-less case is rare to
non-existent then we should design for the page-backed case at least
until the "petabytes of persistent memory" era arrives.  I think we
have plenty of time to get page-less right before it is needed, but we
have to get over the roadblocks that Christoph and I hit even trying
to convert the DMA-API over to be pfn based [3].

[3]: https://lkml.org/lkml/2015/8/12/682
Dave Chinner Sept. 29, 2015, 2:18 a.m. UTC | #13
On Mon, Sep 28, 2015 at 03:57:29PM -0700, Dan Williams wrote:
> On Mon, Sep 28, 2015 at 2:35 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Sep 28, 2015 at 05:13:50AM -0700, Dan Williams wrote:
> >> On Sun, Sep 27, 2015 at 5:59 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > On Fri, Sep 25, 2015 at 09:17:45PM -0600, Ross Zwisler wrote:
> >> >> On Fri, Sep 25, 2015 at 12:53:57PM +1000, Dave Chinner wrote:
> >> [..]
> >> >> Does this sound like a reasonable path forward for v4.3?  Dave, and Jan, can
> >> >> you guys can provide guidance and code reviews for the XFS and ext4 bits?
> >> >
> >> > IMO, it's way too much to get into 4.3. I'd much prefer we revert
> >> > the bad changes in 4.3, and then work towards fixing this for the
> >> > 4.4 merge window. If someone needs this for 4.3, then they can
> >> > backport the 4.4 code to 4.3-stable.
> >> >
> >>
> >> If the proposal is to step back and get a running start at these fixes
> >> for 4.4, then it is worth considering what the state of allocating
> >> pages for DAX mappings will be in 4.4.
> >
> > Oh, do tell. I haven't seen any published design, code, etc,
> 
> This is via the devm_memremap_pages() api that went into 4.2 [1] and
> my v1 (RFC quality) series using it for dax get_user_pages() [2].
> 
> [1]: https://lkml.org/lkml/2015/8/25/841
> [2]: https://lkml.org/lkml/2015/9/23/11

I'll have a look at some point when I'm not trying to put out fires.

> > And, quite frankly, I'm not enabling any new DAX behaviour/subsystem
> > in XFS until I've had time to review, test and fix it so it works
> > without deadlocking or corrupting data.
> 
> I'm in violent agreement, to the point where I'm pondering whether
> CONFIG_FS_DAX should just depend on CONFIG_BROKEN in 4.3 until we've
> convinced ourselves of all the fixes in 4.4.  It's not clear to me
> that we have a stable baseline to which we can revert this "still in
> development" implementation, did you have one in mind?

XFS warns that DAX is experimental when you mount with that option,
so there is no need to do that:

[  686.055780] XFS (ram0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
[  686.058464] XFS (ram0): Mounting V5 Filesystem
[  686.062857] XFS (ram0): Ending clean mount

> >> It's already that case that
> >> allocating struct page for DAX mappings is the only solution on the
> >> horizon for enabling a get_user_pages() solution for persistent
> >> memory.  We of course need to get the page-less DAX path fixed up, but
> >> the near-term path to full functionality and safety is when struct
> >> page is available to enable the typical synchronization mechanics.
> >
> > And we do so at the expense of medium to long term complexity and
> > maintenance. I'm no fan of using struct pages to track terabytes to
> > petabytes of persistent memory, and I'm even less of a fan of having
> > to simultaneously support both struct page and pfn based DAX
> > subsystems...
> 
> I'm no fan of tracking petabytes of persistent memory with struct
> page, but we're in the near term space (hardware technology-wise) of
> how to enable DMA/RDMA to 100s of gigabytes to a few terabytes of
> persistent memory.

Don't think I don't know that - as I said to someone a few hours
ago on IRC:

[29/09/15 07:41] <dchinner> I'm sure they do, but they have a hard requirement to support RDMA from persistent memory
[29/09/15 07:41] <dchinner> and that's what seems to be driving the "we need to use struct pages" design

> A page-less solution to that problem is not on the
> horizon as far as I can tell.  In short, I am concerned we are
> spending time working around the lack of struct page to get to a
> stable page-less solution that is still missing support for the use
> cases that are expected to "just work".

I'm concerned with making what we have work before we go and change
everything. You might want to move really quickly, but without sane
filesystem support you can't ship anything worth a damn. There's all
sorts of issues here, and introducing struct pages doesn't solve all
of them.

Let's concentrate on ensuring the basic operation of DAX is robust
first - get the page fault vs extent manipulations serialised, sane
and scalable before we start changing anything else. If we don't
solve these problems, then nothing else we do will be reliable, and
the problems exist regardless of whether we are using struct pages
or not. Hence these are the critical problems we need to fix before
anything else.

Once we have these issues sorted out, switching between struct page
and pfn should be much simpler because we don't have to worry about
different locking strategies to protect against truncate, racing
page faults, etc.

> I do not think introducing page-back persistent memory sets us back to
> square 1.  Instead, given the functionality that is enabled when pages
> are present I think it is safe to assume most platforms will arrange
> for page backed persistent memory.

Sure, but it will take a little time to get there. Moving fast
doesn't help us here - it only results in stuff we have to revert or
redo in the near future and that means progress is much slower than
it should be. Let's solve the DAX problems in the right order - it
will make things simpler and faster down the road.

Cheers,

Dave
Dave Chinner Sept. 29, 2015, 2:44 a.m. UTC | #14
On Mon, Sep 28, 2015 at 04:40:01PM -0600, Ross Zwisler wrote:
> On Mon, Sep 28, 2015 at 10:59:04AM +1000, Dave Chinner wrote:
> > On Fri, Sep 25, 2015 at 09:17:45PM -0600, Ross Zwisler wrote:
> <>
> > In reality, the require DAX page fault vs truncate serialisation is
> > provided for XFS by the XFS_MMAPLOCK_* inode locking that is done in
> > the fault, mkwrite and filesystem extent manipulation paths. There
> > is no way this sort of exclusion can be done in the mm/ subsystem as
> > it simply does not have the context to be able to provide the
> > necessary serialisation.  Every filesystem that wants to use DAX
> > needs to provide this external page fault serialisation, and in
> > doing so will also protect it's hole punch/extent swap/shift
> > operations under normal operation against racing mmap access....
> > 
> > IOWs, for DAX this needs to be fixed in ext4, not the mm/ subsystem.
> 
> So is it your belief that XFS already has correct locking in place to ensure
> that we don't hit these races?  I see XFS taking XFS_MMAPLOCK_SHARED before it
> calls __dax_fault() in both xfs_filemap_page_mkwrite() (via __dax_mkwrite) and
> in xfs_filemap_fault().
> 
> XFS takes XFS_MMAPLOCK_EXCL before a truncate in xfs_vn_setattr() - I haven't
> found the generic hole punching code yet, but I assume it takes
> XFS_MMAPLOCK_EXCL as well.

There is no generic hole punching. See xfs_file_fallocate(), where
most fallocate() based operations are protected, xfs_ioc_space()
where all the XFS ioctl based extent manipulations are protected,
xfs_swap_extents() where online defrag extent swaps are protected.
And we'll add it to any future operations that directly
manipulate extent mappings. 

> Meaning, is the work that we need to do around extent vs page fault locking
> basically adding equivalent locking to ext4 and ext2 and removing the attempts
> at locking from dax.c?

Yup. I'm not game to touch ext4 locking, though.

> 
> > > 4) Test all changes with xfstests using both xfs & ext4, using lockep.
> > > 
> > > Did I miss any issues, or does this path not solve one of them somehow?
> > > 
> > > Does this sound like a reasonable path forward for v4.3?  Dave, and Jan, can
> > > you guys can provide guidance and code reviews for the XFS and ext4 bits?
> > 
> > IMO, it's way too much to get into 4.3. I'd much prefer we revert
> > the bad changes in 4.3, and then work towards fixing this for the
> > 4.4 merge window. If someone needs this for 4.3, then they can
> > backport the 4.4 code to 4.3-stable.
> > 
> > The "fast and loose and fix it later" development model does not
> > work for persistent storage algorithms; DAX is storage - not memory
> > management - and so we need to treat it as such.
> 
> Okay.  To get our locking back to v4.2 levels here are the two commits I think
> we need to look at:
> 
> commit 843172978bb9 ("dax: fix race between simultaneous faults")
> commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for DAX")

Already testing a kernel with those reverted. My current DAX patch
stack is (bottom is first commit in stack):

f672ae4 xfs: add ->pfn_mkwrite support for DAX
6855c23 xfs: remove DAX complete_unwritten callback
e074bdf Revert "dax: fix race between simultaneous faults"
8ba0157 Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX"
a2ce6a5 xfs: DAX does not use IO completion callbacks
246c52a xfs: update size during allocation for DAX
9d10e7b xfs: Don't use unwritten extents for DAX
eaef807 xfs: factor out sector mapping.
e7f2d50 xfs: introduce per-inode DAX enablement

BTW, add to the problems that need fixing is that the pfn_mkwrite
code needs to check that the fault is still within EOF, like
__dax_fault does. i.e. the top patch in the series adds this
to xfs_filemap_pfn_mkwrite() instead of using dax_pfn_mkwrite():

....
+       /* check if the faulting page hasn't raced with truncate */
+       xfs_ilock(ip, XFS_MMAPLOCK_SHARED);
+       size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+       if (vmf->pgoff >= size)
+               ret = VM_FAULT_SIGBUS;
+       xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
+       sb_end_pagefault(inode->i_sb);

i.e. dax_pfn_mkwrite() doesn't work correctly when racing with
truncate (i.e. another way that ext2/ext4 are currently broken).

> On an unrelated note, while wandering through the XFS code I found the
> following lock ordering documented above xfs_ilock():
> 
>  * Basic locking order:
>  *
>  * i_iolock -> i_mmap_lock -> page_lock -> i_ilock
>  *
>  * mmap_sem locking order:
>  *
>  * i_iolock -> page lock -> mmap_sem
>  * mmap_sem -> i_mmap_lock -> page_lock
> 
> I noticed that page_lock and i_mmap_lock are in different places in the
> ordering depending on the presence or absence of mmap_sem.  Does this not open
> us up to a lock ordering inversion?

Typo, not picked up in review (note the missing "_").

- * i_iolock -> page lock -> mmap_sem
+ * i_iolock -> page fault -> mmap_sem

Thanks for the heads-up on that.

Cheers,

Dave.
Dan Williams Sept. 29, 2015, 3:08 a.m. UTC | #15
On Mon, Sep 28, 2015 at 7:18 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Sep 28, 2015 at 03:57:29PM -0700, Dan Williams wrote:
>> On Mon, Sep 28, 2015 at 2:35 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Mon, Sep 28, 2015 at 05:13:50AM -0700, Dan Williams wrote:
>> >> On Sun, Sep 27, 2015 at 5:59 PM, Dave Chinner <david@fromorbit.com> wrote:
>> >> > On Fri, Sep 25, 2015 at 09:17:45PM -0600, Ross Zwisler wrote:
>> >> >> On Fri, Sep 25, 2015 at 12:53:57PM +1000, Dave Chinner wrote:
>> >> [..]
>> >> >> Does this sound like a reasonable path forward for v4.3?  Dave, and Jan, can
>> >> >> you guys can provide guidance and code reviews for the XFS and ext4 bits?
>> >> >
>> >> > IMO, it's way too much to get into 4.3. I'd much prefer we revert
>> >> > the bad changes in 4.3, and then work towards fixing this for the
>> >> > 4.4 merge window. If someone needs this for 4.3, then they can
>> >> > backport the 4.4 code to 4.3-stable.
>> >> >
>> >>
>> >> If the proposal is to step back and get a running start at these fixes
>> >> for 4.4, then it is worth considering what the state of allocating
>> >> pages for DAX mappings will be in 4.4.
>> >
>> > Oh, do tell. I haven't seen any published design, code, etc,
>>
>> This is via the devm_memremap_pages() api that went into 4.2 [1] and
>> my v1 (RFC quality) series using it for dax get_user_pages() [2].
>>
>> [1]: https://lkml.org/lkml/2015/8/25/841
>> [2]: https://lkml.org/lkml/2015/9/23/11
>
> I'll have a look at some point when I'm not trying to put out fires.
>
>> > And, quite frankly, I'm not enabling any new DAX behaviour/subsystem
>> > in XFS until I've had time to review, test and fix it so it works
>> > without deadlocking or corrupting data.
>>
>> I'm in violent agreement, to the point where I'm pondering whether
>> CONFIG_FS_DAX should just depend on CONFIG_BROKEN in 4.3 until we've
>> convinced ourselves of all the fixes in 4.4.  It's not clear to me
>> that we have a stable baseline to which we can revert this "still in
>> development" implementation, did you have one in mind?
>
> XFS warns that DAX is experimental when you mount with that option,
> so there is no need to do that:
>
> [  686.055780] XFS (ram0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
> [  686.058464] XFS (ram0): Mounting V5 Filesystem
> [  686.062857] XFS (ram0): Ending clean mount

Well that is comforting, although a similar warning is missing from
ext4.  I'll send a patch.

>> >> It's already that case that
>> >> allocating struct page for DAX mappings is the only solution on the
>> >> horizon for enabling a get_user_pages() solution for persistent
>> >> memory.  We of course need to get the page-less DAX path fixed up, but
>> >> the near-term path to full functionality and safety is when struct
>> >> page is available to enable the typical synchronization mechanics.
>> >
>> > And we do so at the expense of medium to long term complexity and
>> > maintenance. I'm no fan of using struct pages to track terabytes to
>> > petabytes of persistent memory, and I'm even less of a fan of having
>> > to simultaneously support both struct page and pfn based DAX
>> > subsystems...
>>
>> I'm no fan of tracking petabytes of persistent memory with struct
>> page, but we're in the near term space (hardware technology-wise) of
>> how to enable DMA/RDMA to 100s of gigabytes to a few terabytes of
>> persistent memory.
>
> Don't think I don't know that - as I said to someone a few hours
> ago on IRC:
>
> [29/09/15 07:41] <dchinner> I'm sure they do, but they have a hard requirement to support RDMA from persistent memory
> [29/09/15 07:41] <dchinner> and that's what seems to be driving the "we need to use struct pages" design

Fair enough...

>> A page-less solution to that problem is not on the
>> horizon as far as I can tell.  In short, I am concerned we are
>> spending time working around the lack of struct page to get to a
>> stable page-less solution that is still missing support for the use
>> cases that are expected to "just work".
>
> I'm concerned with making what we have work before we go and change
> everything. You might want to move really quickly, but without sane
> filesystem support you can't ship anything worth a damn. There's all
> sorts of issues here, and introducing struct pages doesn't solve all
> of them.
>
> Let's concentrate on ensuring the basic operation of DAX is robust
> first - get the page fault vs extent manipulations serialised, sane
> and scalable before we start changing anything else. If we don't
> solve these problems, then nothing else we do will be reliable, and
> the problems exist regardless of whether we are using struct pages
> or not. Hence these are the critical problems we need to fix before
> anything else.
>
> Once we have these issues sorted out, switching between struct page
> and pfn should be much simpler because we don't have to worry about
> different locking strategies to protect against truncate, racing
> page faults, etc.

It sounds like you have a page-independent/scalable method in mind for
solving the truncate protection problem?  I had always thought that
must require struct page, but if you're happy to carry that solution
in the filesystem you're not going to see resistance from me.

>> I do not think introducing page-back persistent memory sets us back to
>> square 1.  Instead, given the functionality that is enabled when pages
>> are present I think it is safe to assume most platforms will arrange
>> for page backed persistent memory.
>
> Sure, but it will take a little time to get there. Moving fast
> doesn't help us here - it only results in stuff we have to revert or
> redo in the near future and that means progress is much slower than
> it should be. Let's solve the DAX problems in the right order - it
> will make things simpler and faster down the road.

Sounds workable, although this thread is missing an ext4
representative so far.  Hopefully ext4 is equally open to solving
these problems generically without struct page.

Outside of that there's also basic device driver lifetime fixes in the
get_user_pages() series (patches 8-10) that are 4.4 material to stop
the trivial breakage from unbinding the pmem driver regardless of when
we decide to stage the others.

In any event, thanks for the attention and patience, Dave, much appreciated.
Dave Chinner Sept. 29, 2015, 4:19 a.m. UTC | #16
On Mon, Sep 28, 2015 at 08:08:13PM -0700, Dan Williams wrote:
> On Mon, Sep 28, 2015 at 7:18 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Sep 28, 2015 at 03:57:29PM -0700, Dan Williams wrote:
> > I'm concerned with making what we have work before we go and change
> > everything. You might want to move really quickly, but without sane
> > filesystem support you can't ship anything worth a damn. There's all
> > sorts of issues here, and introducing struct pages doesn't solve all
> > of them.
> >
> > Let's concentrate on ensuring the basic operation of DAX is robust
> > first - get the page fault vs extent manipulations serialised, sane
> > and scalable before we start changing anything else. If we don't
> > solve these problems, then nothing else we do will be reliable, and
> > the problems exist regardless of whether we are using struct pages
> > or not. Hence these are the critical problems we need to fix before
> > anything else.
> >
> > Once we have these issues sorted out, switching between struct page
> > and pfn should be much simpler because we don't have to worry about
> > different locking strategies to protect against truncate, racing
> > page faults, etc.
> 
> It sounds like you have a page-independent/scalable method in mind for
> solving the truncate protection problem?

In mind? It was added to XFS back in 4.1 by commit 653c60b ("xfs:
introduce mmap/truncate lock").

> I had always thought that
> must require struct page, but if you're happy to carry that solution
> in the filesystem you're not going to see resistance from me.

The page based "truncate serialisation" solution in Linux has always
been a horrible, nasty hack. It only works because we always modify
the inode size before invalidating pages in the page cache. Hence we
can check once we have the page lock to see if the page is beyond
EOF. This EOF update hack avoids the need for a filesystem level
lock to guarantee multi-page truncate invalidation atomicity.

This, however, does not work for operations which require atomic
multi-page invalidation within EOF over multiple long running
operations. The page is not beyond EOF, so we have no way of
determining from the page taht we are racing with an invalidation of
the page. This affects operations such as hole punching, extent 
shifting up/down, swapping extents between different inodes, etc.

These *require* the filesystem to be able to lock out page faults
for the duration of the manipulation operation, as the extent
manipulations may not be done atomically from the perspective of
userspace driven page faults. They *are atomic* from the perspective
of the read/write syscall interfaces thanks to the XFS_IOLOCK_*
locking (usually the i_mutex provides this in other filesystems).

The XFS_MMAPLOCK_[SHARED|EXCL] locking was added to provide this
multi-page invalidation  vs page fault serialisation to XFS. It
works completely independently of any given struct page in the file,
and so does not rely on DAX to use struct page to serialise
correctly.  I've solved this problem in XFS because a generic
solution was not happening. Any filesystem that supports hole
punching and other fallocate-based manipulations needs similar
locking to be safe against page fault based /data corruption/ races
in these operations.

FWIW, direct IO also needs to exclude page faults for proper mmap
coherency, but that's much harder problem to solve because direct IO
takes the mmap_sem below the filesystem, and hence we can't hold any
locks over DIO submission that might be required in a page fault
within get_user_pages()....

> >> I do not think introducing page-back persistent memory sets us back to
> >> square 1.  Instead, given the functionality that is enabled when pages
> >> are present I think it is safe to assume most platforms will arrange
> >> for page backed persistent memory.
> >
> > Sure, but it will take a little time to get there. Moving fast
> > doesn't help us here - it only results in stuff we have to revert or
> > redo in the near future and that means progress is much slower than
> > it should be. Let's solve the DAX problems in the right order - it
> > will make things simpler and faster down the road.
> 
> Sounds workable, although this thread is missing an ext4
> representative so far.  Hopefully ext4 is equally open to solving
> these problems generically without struct page.

It appears to me that none of the ext4 developers have shown any
interest in fixing the problems reported with DAX. Some of those
problems existed before DAX came along (e.g. unwritten extent
conversion issues that can cause data corruption) but I've seen no
interest in fixing them, either.  Hence I'm quite happy to drop ext4
DAX support until an ext4 dev steps up and fixes the issues that
need fixing...

Cheers,

Dave.
Dave Chinner Sept. 30, 2015, 1:57 a.m. UTC | #17
On Tue, Sep 29, 2015 at 12:44:58PM +1000, Dave Chinner wrote:
> On Mon, Sep 28, 2015 at 04:40:01PM -0600, Ross Zwisler wrote:
> > > > 4) Test all changes with xfstests using both xfs & ext4, using lockep.
> > > > 
> > > > Did I miss any issues, or does this path not solve one of them somehow?
> > > > 
> > > > Does this sound like a reasonable path forward for v4.3?  Dave, and Jan, can
> > > > you guys can provide guidance and code reviews for the XFS and ext4 bits?
> > > 
> > > IMO, it's way too much to get into 4.3. I'd much prefer we revert
> > > the bad changes in 4.3, and then work towards fixing this for the
> > > 4.4 merge window. If someone needs this for 4.3, then they can
> > > backport the 4.4 code to 4.3-stable.
> > > 
> > > The "fast and loose and fix it later" development model does not
> > > work for persistent storage algorithms; DAX is storage - not memory
> > > management - and so we need to treat it as such.
> > 
> > Okay.  To get our locking back to v4.2 levels here are the two commits I think
> > we need to look at:
> > 
> > commit 843172978bb9 ("dax: fix race between simultaneous faults")
> > commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for DAX")
> 
> Already testing a kernel with those reverted. My current DAX patch
> stack is (bottom is first commit in stack):
> 

And just to indicate why 4.3 is completely unrealistic, let me give
you a summary of this patchset so far:

> f672ae4 xfs: add ->pfn_mkwrite support for DAX

I *think* it works.

> 6855c23 xfs: remove DAX complete_unwritten callback

Gone.

> e074bdf Revert "dax: fix race between simultaneous faults"
> 8ba0157 Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX"
> a2ce6a5 xfs: DAX does not use IO completion callbacks

DAX still needs to use IO completion callbacks for the DIO path, so
needed rewriting. Made 6855c23 redundant.

> 246c52a xfs: update size during allocation for DAX

Fundamentally broken, so removed. DIO passes the actual size from IO
completion, not into block allocation, hence DIO still needs
completion callbacks. DAX page faults can't change
the file size (should segv before we get here), so need to
specifically handle that to avoid leaking ioend structures due to
incorrect detection of EOF updates due to ovreflows...

> 9d10e7b xfs: Don't use unwritten extents for DAX

Exposed a behaviour in DIO and DAX that results in s64 variable
overflow when writing to the block at file offset (2^63 - 1FSB).
Both the DAX and DIO code ask for a mapping at:

xfs_get_blocks_alloc: [...] offset 0x7ffffffffffff000 count 4096

which gives a size of 0x8000000000000000 (larger than
sb->s_maxbytes!) and results a sign overflow checking if a inode
size update is requireed.  Direct IO avoids this overflow because
the logic checks for unwritten extents first and the IO completion
callback that has the correct size. Removing unwritten extent
allocation from DAX exposed this bug through firing asserts all
through the XFS block mapping and IO completion callbacks....

Fixed the overflow, testing got further and then fsx exposed another
problem similar to the size update issue above. Patch is
fundamentally broken: block zeroing needs to be driven all the way
into the low level allocator implementation to fix the problems fsx
exposed.

> eaef807 xfs: factor out sector mapping.

Probably not going to be used now.

So, basically, I've rewritten most of the patch set once, and I'm
about to fundamentally change it again to address problems the first
two versions have exposed.  Hopefully this will show you the
complexity of what we are dealing with here, and why I said this
needs to go through 4.4?

It should also help explain why I suggested that if ext4 developers
aren't interested in fixing DAX problems then we should just drop
ext4 DAX support? Making this stuff work correctly requires more
than just a cursory knowledge of a filesystem, and nobody actively
working on DAX has the qualifications to make these sorts of changes
to ext4...

Cheers,

Dave.
Ross Zwisler Sept. 30, 2015, 2:04 a.m. UTC | #18
On Tue, Sep 29, 2015 at 12:44:58PM +1000, Dave Chinner wrote:
<>
> Already testing a kernel with those reverted. My current DAX patch
> stack is (bottom is first commit in stack):
> 
> f672ae4 xfs: add ->pfn_mkwrite support for DAX
> 6855c23 xfs: remove DAX complete_unwritten callback
> e074bdf Revert "dax: fix race between simultaneous faults"
> 8ba0157 Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX"
> a2ce6a5 xfs: DAX does not use IO completion callbacks
> 246c52a xfs: update size during allocation for DAX
> 9d10e7b xfs: Don't use unwritten extents for DAX
> eaef807 xfs: factor out sector mapping.
> e7f2d50 xfs: introduce per-inode DAX enablement

Dave, would you be willing to share these patches with me, even if they are
just RFC?  I'm working through how to add equivalent support in both ext2 and
ext4, and a conceptual example in XFS would be really helpful.

Eric and Jan, if you guys have any spare cycles to help with the ext2/ext4
bits, I certainly wouldn't say no.  :)
Dave Chinner Sept. 30, 2015, 3:22 a.m. UTC | #19
On Tue, Sep 29, 2015 at 08:04:21PM -0600, Ross Zwisler wrote:
> On Tue, Sep 29, 2015 at 12:44:58PM +1000, Dave Chinner wrote:
> <>
> > Already testing a kernel with those reverted. My current DAX patch
> > stack is (bottom is first commit in stack):
> > 
> > f672ae4 xfs: add ->pfn_mkwrite support for DAX
> > 6855c23 xfs: remove DAX complete_unwritten callback
> > e074bdf Revert "dax: fix race between simultaneous faults"
> > 8ba0157 Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX"
> > a2ce6a5 xfs: DAX does not use IO completion callbacks
> > 246c52a xfs: update size during allocation for DAX
> > 9d10e7b xfs: Don't use unwritten extents for DAX
> > eaef807 xfs: factor out sector mapping.
> > e7f2d50 xfs: introduce per-inode DAX enablement
> 
> Dave, would you be willing to share these patches with me, even if they are
> just RFC?  I'm working through how to add equivalent support in both ext2 and
> ext4, and a conceptual example in XFS would be really helpful.

When I have code that works properly and isn't fundamentally broken,
I'll post it. Hopefully within the next day.

> Eric and Jan, if you guys have any spare cycles to help with the ext2/ext4
> bits, I certainly wouldn't say no.  :)

ext2 already does the block zeroing in allocation, so it only needs
locking help...

Cheers,

Dave.
Jan Kara Oct. 2, 2015, 12:55 p.m. UTC | #20
On Tue 29-09-15 20:04:21, Ross Zwisler wrote:
> On Tue, Sep 29, 2015 at 12:44:58PM +1000, Dave Chinner wrote:
> <>
> > Already testing a kernel with those reverted. My current DAX patch
> > stack is (bottom is first commit in stack):
> > 
> > f672ae4 xfs: add ->pfn_mkwrite support for DAX
> > 6855c23 xfs: remove DAX complete_unwritten callback
> > e074bdf Revert "dax: fix race between simultaneous faults"
> > 8ba0157 Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX"
> > a2ce6a5 xfs: DAX does not use IO completion callbacks
> > 246c52a xfs: update size during allocation for DAX
> > 9d10e7b xfs: Don't use unwritten extents for DAX
> > eaef807 xfs: factor out sector mapping.
> > e7f2d50 xfs: introduce per-inode DAX enablement
> 
> Dave, would you be willing to share these patches with me, even if they are
> just RFC?  I'm working through how to add equivalent support in both ext2 and
> ext4, and a conceptual example in XFS would be really helpful.
> 
> Eric and Jan, if you guys have any spare cycles to help with the ext2/ext4
> bits, I certainly wouldn't say no.  :)

I'm sorry for being slow but I was on vacation or travelling for
conferences the whole September (still at a conference now ;) so I'm in
mostly just catching up with what's going on... But I can help with making
necessary changes to ext4 to make DAX reliable there.

								Honza
diff mbox

Patch

diff --git a/fs/dax.c b/fs/dax.c
index 7ae6df7..df1b0ac 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -405,7 +405,8 @@  int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 			if (error)
 				goto unlock;
 		} else {
-			i_mmap_unlock_write(mapping);
+			if (!page)
+				i_mmap_unlock_write(mapping);
 			return dax_load_hole(mapping, page, vmf);
 		}
 	}