Message ID | 1443040800-5460-1-git-send-email-ross.zwisler@linux.intel.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
On Wed, Sep 23, 2015 at 02:40:00PM -0600, Ross Zwisler wrote: > Fix the deadlock exposed by xfstests generic/075. Here is the sequence > that was causing us to deadlock: > > 1) enter __dax_fault() > 2) page = find_get_page() gives us a page, so skip > i_mmap_lock_write(mapping) > 3) if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page) > passes, enter this block > 4) if (vmf->flags & FAULT_FLAG_WRITE) fails, so do the else case and > i_mmap_unlock_write(mapping); > return dax_load_hole(mapping, page, vmf); > > This causes us to up_write() a semaphore that we weren't holding. > > The up_write() on a semaphore we didn't down_write() happens twice in > a row, and then the next time we try and i_mmap_lock_write(), we hang. > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> > Reported-by: Dave Chinner <david@fromorbit.com> > --- > fs/dax.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/fs/dax.c b/fs/dax.c > index 7ae6df7..df1b0ac 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -405,7 +405,8 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, > if (error) > goto unlock; > } else { > - i_mmap_unlock_write(mapping); > + if (!page) > + i_mmap_unlock_write(mapping); > return dax_load_hole(mapping, page, vmf); > } > } I can't review this properly because I can't work out how this locking is supposed to work. Captain, we have a Charlie Foxtrot situation here: page = find_get_page(mapping, vmf->pgoff) if (page) { .... } else { i_mmap_lock_write(mapping); } So if there's no page in the page cache, we lock the i_mmap_lock. The we have the case the above patch fixes. Then later: if (vmf->cow_page) { ..... if (!page) { /* can fall through */ } return VM_FAULT_LOCKED; } Which means __dax_fault() can also return here with the i_mmap_lock_write() held. There's no documentation to indicate why this is valid, and only by looking about 4 function calls higher up the stack can I see that there's some attempt to handle this *specific return condition* (in do_cow_fault()). That also is lacking in documentation explaining the circumstances where we might have the i_mmap_lock_write() held and have to release it. (Not to mention the beautiful copy-n-waste of the unlock code, either.) The above code in __dax_fault() is then followed by this gem: /* Check we didn't race with a read fault installing a new page */ if (!page && major) page = find_lock_page(mapping, vmf->pgoff); if (page) { /* mapping invalidation .... */ } ..... if (!page) i_mmap_unlock_write(mapping); Which means that if we had a race with another read fault, we'll remove the page from the page cache, insert the new direct mapped pfn into the mapping, and *then fail to unlock the i_mmap lock*. Is this supposed to work this way? Or is it another bug? Another difficult question this change of locking raised that I can't answer: is it valid to call into the filesystem via getblock() or complete_unwritten() while holding the i_mmap_rwsem? This puts filesystem transactions and locks inside the scope of i_mmap_rwsem, which may have impact on the fact that we already have an inode lock order dependency w.r.t. i_mmap_rwsem through truncate (and probably other paths, too). So, please document the locking model, explain the corner cases and the intricacies like why *unbalanced, return value conditional locking* is necessary, and update the charts of lock order dependencies in places like mm/filemap.c, and then we might have some idea of how much of a train-wreck this actually is.... Cheers, Dave.
On 09/24/2015 05:52 AM, Dave Chinner wrote: > On Wed, Sep 23, 2015 at 02:40:00PM -0600, Ross Zwisler wrote: >> Fix the deadlock exposed by xfstests generic/075. Here is the sequence >> that was causing us to deadlock: >> >> 1) enter __dax_fault() >> 2) page = find_get_page() gives us a page, so skip >> i_mmap_lock_write(mapping) >> 3) if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page) >> passes, enter this block >> 4) if (vmf->flags & FAULT_FLAG_WRITE) fails, so do the else case and >> i_mmap_unlock_write(mapping); >> return dax_load_hole(mapping, page, vmf); >> >> This causes us to up_write() a semaphore that we weren't holding. >> >> The up_write() on a semaphore we didn't down_write() happens twice in >> a row, and then the next time we try and i_mmap_lock_write(), we hang. >> >> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> >> Reported-by: Dave Chinner <david@fromorbit.com> >> --- >> fs/dax.c | 3 ++- >> 1 file changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/fs/dax.c b/fs/dax.c >> index 7ae6df7..df1b0ac 100644 >> --- a/fs/dax.c >> +++ b/fs/dax.c >> @@ -405,7 +405,8 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, >> if (error) >> goto unlock; >> } else { >> - i_mmap_unlock_write(mapping); >> + if (!page) >> + i_mmap_unlock_write(mapping); >> return dax_load_hole(mapping, page, vmf); >> } >> } > > I can't review this properly because I can't work out how this > locking is supposed to work. Captain, we have a Charlie Foxtrot > situation here: > > page = find_get_page(mapping, vmf->pgoff) > if (page) { > .... > } else { > i_mmap_lock_write(mapping); > } > > So if there's no page in the page cache, we lock the i_mmap_lock. > The we have the case the above patch fixes. Then later: > > if (vmf->cow_page) { > ..... > if (!page) { > /* can fall through */ > } > return VM_FAULT_LOCKED; > } > > Which means __dax_fault() can also return here with the > i_mmap_lock_write() held. There's no documentation to indicate why > this is valid, and only by looking about 4 function calls higher up > the stack can I see that there's some attempt to handle this > *specific return condition* (in do_cow_fault()). That also is > lacking in documentation explaining the circumstances where we might > have the i_mmap_lock_write() held and have to release it. (Not to > mention the beautiful copy-n-waste of the unlock code, either.) > > The above code in __dax_fault() is then followed by this gem: > > /* Check we didn't race with a read fault installing a new page */ > if (!page && major) > page = find_lock_page(mapping, vmf->pgoff); > > if (page) { > /* mapping invalidation .... */ > } > ..... > > if (!page) > i_mmap_unlock_write(mapping); > > Which means that if we had a race with another read fault, we'll > remove the page from the page cache, insert the new direct mapped > pfn into the mapping, and *then fail to unlock the i_mmap lock*. > > Is this supposed to work this way? Or is it another bug? > > Another difficult question this change of locking raised that I > can't answer: is it valid to call into the filesystem via getblock() > or complete_unwritten() while holding the i_mmap_rwsem? This puts > filesystem transactions and locks inside the scope of i_mmap_rwsem, > which may have impact on the fact that we already have an inode lock > order dependency w.r.t. i_mmap_rwsem through truncate (and probably > other paths, too). > > So, please document the locking model, explain the corner cases and > the intricacies like why *unbalanced, return value conditional > locking* is necessary, and update the charts of lock order > dependencies in places like mm/filemap.c, and then we might have > some idea of how much of a train-wreck this actually is.... > Hi hi I hate this VM_FAULT_LOCKED + !page which means i_mmap_lock. I still think it solves nothing and that we've done a really really bad job. If we *easily* involve the FS in the locking here (Which btw I think XFS already does), then this all i_mmap_lock can be avoided. Please remind me again what race it is suppose to avoid? I get confused. > Cheers, > Dave. > Thanks Boaz
On Thu, Sep 24, 2015 at 12:52:25PM +1000, Dave Chinner wrote: > On Wed, Sep 23, 2015 at 02:40:00PM -0600, Ross Zwisler wrote: > > Fix the deadlock exposed by xfstests generic/075. Here is the sequence > > that was causing us to deadlock: > > > > 1) enter __dax_fault() > > 2) page = find_get_page() gives us a page, so skip > > i_mmap_lock_write(mapping) > > 3) if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page) > > passes, enter this block > > 4) if (vmf->flags & FAULT_FLAG_WRITE) fails, so do the else case and > > i_mmap_unlock_write(mapping); > > return dax_load_hole(mapping, page, vmf); > > > > This causes us to up_write() a semaphore that we weren't holding. > > > > The up_write() on a semaphore we didn't down_write() happens twice in > > a row, and then the next time we try and i_mmap_lock_write(), we hang. > > > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> > > Reported-by: Dave Chinner <david@fromorbit.com> > > --- > > fs/dax.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/fs/dax.c b/fs/dax.c > > index 7ae6df7..df1b0ac 100644 > > --- a/fs/dax.c > > +++ b/fs/dax.c > > @@ -405,7 +405,8 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, > > if (error) > > goto unlock; > > } else { > > - i_mmap_unlock_write(mapping); > > + if (!page) > > + i_mmap_unlock_write(mapping); > > return dax_load_hole(mapping, page, vmf); > > } > > } > > I can't review this properly because I can't work out how this > locking is supposed to work. Captain, we have a Charlie Foxtrot > situation here: > > page = find_get_page(mapping, vmf->pgoff) > if (page) { > .... > } else { > i_mmap_lock_write(mapping); > } > > So if there's no page in the page cache, we lock the i_mmap_lock. > The we have the case the above patch fixes. Then later: > > if (vmf->cow_page) { > ..... > if (!page) { > /* can fall through */ > } > return VM_FAULT_LOCKED; > } > > Which means __dax_fault() can also return here with the > i_mmap_lock_write() held. There's no documentation to indicate why > this is valid, and only by looking about 4 function calls higher up > the stack can I see that there's some attempt to handle this > *specific return condition* (in do_cow_fault()). That also is > lacking in documentation explaining the circumstances where we might > have the i_mmap_lock_write() held and have to release it. (Not to > mention the beautiful copy-n-waste of the unlock code, either.) > > The above code in __dax_fault() is then followed by this gem: > > /* Check we didn't race with a read fault installing a new page */ > if (!page && major) > page = find_lock_page(mapping, vmf->pgoff); > > if (page) { > /* mapping invalidation .... */ > } > ..... > > if (!page) > i_mmap_unlock_write(mapping); > > Which means that if we had a race with another read fault, we'll > remove the page from the page cache, insert the new direct mapped > pfn into the mapping, and *then fail to unlock the i_mmap lock*. > > Is this supposed to work this way? Or is it another bug? > > Another difficult question this change of locking raised that I > can't answer: is it valid to call into the filesystem via getblock() > or complete_unwritten() while holding the i_mmap_rwsem? This puts > filesystem transactions and locks inside the scope of i_mmap_rwsem, > which may have impact on the fact that we already have an inode lock > order dependency w.r.t. i_mmap_rwsem through truncate (and probably > other paths, too). > > So, please document the locking model, explain the corner cases and > the intricacies like why *unbalanced, return value conditional > locking* is necessary, and update the charts of lock order > dependencies in places like mm/filemap.c, and then we might have > some idea of how much of a train-wreck this actually is.... Yep, I saw these too, but didn't yet know how to deal with them. We have at similar issues with __dax_pmd_fault(): i_mmap_lock_write(mapping); length = get_block(inode, block, &bh, write); if (length) return VM_FAULT_SIGBUS; Whoops, we just took the mmap semaphore, and then we have a return that doesn't release it. A quick test confirms that this will deadlock the next fault that tries to take the mmap semaphore. I agree that we need to give the fault handling code some attention when it comes to locking, and that we need to have better documentation. I'll work on this when I get some time. In the meantime I still think it's worthwhile to take the short term fix for the obvious generic/075 deadlock, yea? - Ross
On Thu, Sep 24, 2015 at 09:50:29AM -0600, Ross Zwisler wrote: > On Thu, Sep 24, 2015 at 12:52:25PM +1000, Dave Chinner wrote: > > On Wed, Sep 23, 2015 at 02:40:00PM -0600, Ross Zwisler wrote: > > > Fix the deadlock exposed by xfstests generic/075. Here is the sequence > > > that was causing us to deadlock: > > > > > > 1) enter __dax_fault() > > > 2) page = find_get_page() gives us a page, so skip > > > i_mmap_lock_write(mapping) > > > 3) if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page) > > > passes, enter this block > > > 4) if (vmf->flags & FAULT_FLAG_WRITE) fails, so do the else case and > > > i_mmap_unlock_write(mapping); > > > return dax_load_hole(mapping, page, vmf); > > > > > > This causes us to up_write() a semaphore that we weren't holding. > > > > > > The up_write() on a semaphore we didn't down_write() happens twice in > > > a row, and then the next time we try and i_mmap_lock_write(), we hang. > > > > > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> > > > Reported-by: Dave Chinner <david@fromorbit.com> > > > --- > > > fs/dax.c | 3 ++- > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/dax.c b/fs/dax.c > > > index 7ae6df7..df1b0ac 100644 > > > --- a/fs/dax.c > > > +++ b/fs/dax.c > > > @@ -405,7 +405,8 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, > > > if (error) > > > goto unlock; > > > } else { > > > - i_mmap_unlock_write(mapping); > > > + if (!page) > > > + i_mmap_unlock_write(mapping); > > > return dax_load_hole(mapping, page, vmf); > > > } > > > } > > > > I can't review this properly because I can't work out how this > > locking is supposed to work. Captain, we have a Charlie Foxtrot > > situation here: > > > > page = find_get_page(mapping, vmf->pgoff) > > if (page) { > > .... > > } else { > > i_mmap_lock_write(mapping); > > } > > > > So if there's no page in the page cache, we lock the i_mmap_lock. > > The we have the case the above patch fixes. Then later: > > > > if (vmf->cow_page) { > > ..... > > if (!page) { > > /* can fall through */ > > } > > return VM_FAULT_LOCKED; > > } > > > > Which means __dax_fault() can also return here with the > > i_mmap_lock_write() held. There's no documentation to indicate why > > this is valid, and only by looking about 4 function calls higher up > > the stack can I see that there's some attempt to handle this > > *specific return condition* (in do_cow_fault()). That also is > > lacking in documentation explaining the circumstances where we might > > have the i_mmap_lock_write() held and have to release it. (Not to > > mention the beautiful copy-n-waste of the unlock code, either.) > > > > The above code in __dax_fault() is then followed by this gem: > > > > /* Check we didn't race with a read fault installing a new page */ > > if (!page && major) > > page = find_lock_page(mapping, vmf->pgoff); > > > > if (page) { > > /* mapping invalidation .... */ > > } > > ..... > > > > if (!page) > > i_mmap_unlock_write(mapping); > > > > Which means that if we had a race with another read fault, we'll > > remove the page from the page cache, insert the new direct mapped > > pfn into the mapping, and *then fail to unlock the i_mmap lock*. > > > > Is this supposed to work this way? Or is it another bug? > > > > Another difficult question this change of locking raised that I > > can't answer: is it valid to call into the filesystem via getblock() > > or complete_unwritten() while holding the i_mmap_rwsem? This puts > > filesystem transactions and locks inside the scope of i_mmap_rwsem, > > which may have impact on the fact that we already have an inode lock > > order dependency w.r.t. i_mmap_rwsem through truncate (and probably > > other paths, too). > > > > So, please document the locking model, explain the corner cases and > > the intricacies like why *unbalanced, return value conditional > > locking* is necessary, and update the charts of lock order > > dependencies in places like mm/filemap.c, and then we might have > > some idea of how much of a train-wreck this actually is.... > > Yep, I saw these too, but didn't yet know how to deal with them. We have at > similar issues with __dax_pmd_fault(): > > i_mmap_lock_write(mapping); > length = get_block(inode, block, &bh, write); > if (length) > return VM_FAULT_SIGBUS; > > Whoops, we just took the mmap semaphore, and then we have a return that > doesn't release it. A quick test confirms that this will deadlock the next > fault that tries to take the mmap semaphore. Ugh. So there's more than one broken commit and code path we are talking about here. Oh, even the "fallback" path is broken there - it converts the unwritten extent to written, even in the case where the underlying pages weren't zeroed. See this case: if (unlikely(!zero_page)) goto fallback; That's a stale data exposure bug, so has security implications.... > I agree that we need to give the fault handling code some attention when it > comes to locking, and that we need to have better documentation. I'll work on > this when I get some time. > > In the meantime I still think it's worthwhile to take the short term fix for > the obvious generic/075 deadlock, yea? I think a bunch of reverts are in order -the code is broken, has stale data exposure problems and we're going to have to rip the code out anyway because I don't think this is the right way to fix the underlying problem ("If two threads write-fault on the same hole at the same time...") We've already got block allocation serialisation at the filesystem level, and the issue is the unserialised block zeroing being done by the dax code. That can be fixed by moving the zeroing into the filesystem code when it runs "complete_unwritten" and checks whether the mapping has already been marked as written or not... I've recently pointed out in a different thread that this is the solution to whatever that problem was (can't recall which thread/problem is was now :/ ) and it the same solution here. We already have the serialisation we need, we just need to move the block zeroing operation into the appropriate places to make it work correctly. Cheers, Dave.
On Fri, Sep 25, 2015 at 12:53:57PM +1000, Dave Chinner wrote: <> > We've already got block allocation serialisation at the filesystem > level, and the issue is the unserialised block zeroing being done by > the dax code. That can be fixed by moving the zeroing into the > filesystem code when it runs "complete_unwritten" and checks whether > the mapping has already been marked as written or not... > > I've recently pointed out in a different thread that this is the > solution to whatever that problem was (can't recall which > thread/problem is was now :/ ) and it the same solution here. We > already have the serialisation we need, we just need to move the > block zeroing operation into the appropriate places to make it work > correctly. I think perhaps this is the thread that you're remembering: https://lkml.org/lkml/2015/8/11/731
On Fri, Sep 25, 2015 at 12:23:34PM -0600, Ross Zwisler wrote: > On Fri, Sep 25, 2015 at 12:53:57PM +1000, Dave Chinner wrote: > <> > > We've already got block allocation serialisation at the filesystem > > level, and the issue is the unserialised block zeroing being done by > > the dax code. That can be fixed by moving the zeroing into the > > filesystem code when it runs "complete_unwritten" and checks whether > > the mapping has already been marked as written or not... > > > > I've recently pointed out in a different thread that this is the > > solution to whatever that problem was (can't recall which > > thread/problem is was now :/ ) and it the same solution here. We > > already have the serialisation we need, we just need to move the > > block zeroing operation into the appropriate places to make it work > > correctly. > > I think perhaps this is the thread that you're remembering: > https://lkml.org/lkml/2015/8/11/731 Ah, yes, the read vs write fault race condition, whereas this change was to address write vs write fault races. And neither of which address the fault vs truncate problem properly, either, which is what the XFS_MMAP_LOCKING addresses. So, yeah, pushing the block zeroing down to the filesystem gets rid of multiple problems that concurrent page faults have with initialisation without adding any more serialisation or overhead. Cheers, Dave.
On Fri, Sep 25, 2015 at 12:53:57PM +1000, Dave Chinner wrote: > On Thu, Sep 24, 2015 at 09:50:29AM -0600, Ross Zwisler wrote: > > On Thu, Sep 24, 2015 at 12:52:25PM +1000, Dave Chinner wrote: > > > On Wed, Sep 23, 2015 at 02:40:00PM -0600, Ross Zwisler wrote: > > > > Fix the deadlock exposed by xfstests generic/075. Here is the sequence > > > > that was causing us to deadlock: > > > > > > > > 1) enter __dax_fault() > > > > 2) page = find_get_page() gives us a page, so skip > > > > i_mmap_lock_write(mapping) > > > > 3) if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page) > > > > passes, enter this block > > > > 4) if (vmf->flags & FAULT_FLAG_WRITE) fails, so do the else case and > > > > i_mmap_unlock_write(mapping); > > > > return dax_load_hole(mapping, page, vmf); > > > > > > > > This causes us to up_write() a semaphore that we weren't holding. > > > > > > > > The up_write() on a semaphore we didn't down_write() happens twice in > > > > a row, and then the next time we try and i_mmap_lock_write(), we hang. > > > > > > > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> > > > > Reported-by: Dave Chinner <david@fromorbit.com> > > > > --- > > > > fs/dax.c | 3 ++- > > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/fs/dax.c b/fs/dax.c > > > > index 7ae6df7..df1b0ac 100644 > > > > --- a/fs/dax.c > > > > +++ b/fs/dax.c > > > > @@ -405,7 +405,8 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, > > > > if (error) > > > > goto unlock; > > > > } else { > > > > - i_mmap_unlock_write(mapping); > > > > + if (!page) > > > > + i_mmap_unlock_write(mapping); > > > > return dax_load_hole(mapping, page, vmf); > > > > } > > > > } > > > > > > I can't review this properly because I can't work out how this > > > locking is supposed to work. Captain, we have a Charlie Foxtrot > > > situation here: > > > > > > page = find_get_page(mapping, vmf->pgoff) > > > if (page) { > > > .... > > > } else { > > > i_mmap_lock_write(mapping); > > > } > > > > > > So if there's no page in the page cache, we lock the i_mmap_lock. > > > The we have the case the above patch fixes. Then later: > > > > > > if (vmf->cow_page) { > > > ..... > > > if (!page) { > > > /* can fall through */ > > > } > > > return VM_FAULT_LOCKED; > > > } > > > > > > Which means __dax_fault() can also return here with the > > > i_mmap_lock_write() held. There's no documentation to indicate why > > > this is valid, and only by looking about 4 function calls higher up > > > the stack can I see that there's some attempt to handle this > > > *specific return condition* (in do_cow_fault()). That also is > > > lacking in documentation explaining the circumstances where we might > > > have the i_mmap_lock_write() held and have to release it. (Not to > > > mention the beautiful copy-n-waste of the unlock code, either.) > > > > > > The above code in __dax_fault() is then followed by this gem: > > > > > > /* Check we didn't race with a read fault installing a new page */ > > > if (!page && major) > > > page = find_lock_page(mapping, vmf->pgoff); > > > > > > if (page) { > > > /* mapping invalidation .... */ > > > } > > > ..... > > > > > > if (!page) > > > i_mmap_unlock_write(mapping); > > > > > > Which means that if we had a race with another read fault, we'll > > > remove the page from the page cache, insert the new direct mapped > > > pfn into the mapping, and *then fail to unlock the i_mmap lock*. > > > > > > Is this supposed to work this way? Or is it another bug? > > > > > > Another difficult question this change of locking raised that I > > > can't answer: is it valid to call into the filesystem via getblock() > > > or complete_unwritten() while holding the i_mmap_rwsem? This puts > > > filesystem transactions and locks inside the scope of i_mmap_rwsem, > > > which may have impact on the fact that we already have an inode lock > > > order dependency w.r.t. i_mmap_rwsem through truncate (and probably > > > other paths, too). > > > > > > So, please document the locking model, explain the corner cases and > > > the intricacies like why *unbalanced, return value conditional > > > locking* is necessary, and update the charts of lock order > > > dependencies in places like mm/filemap.c, and then we might have > > > some idea of how much of a train-wreck this actually is.... > > > > Yep, I saw these too, but didn't yet know how to deal with them. We have at > > similar issues with __dax_pmd_fault(): > > > > i_mmap_lock_write(mapping); > > length = get_block(inode, block, &bh, write); > > if (length) > > return VM_FAULT_SIGBUS; > > > > Whoops, we just took the mmap semaphore, and then we have a return that > > doesn't release it. A quick test confirms that this will deadlock the next > > fault that tries to take the mmap semaphore. > > Ugh. So there's more than one broken commit and code path we are > talking about here. > > Oh, even the "fallback" path is broken there - it converts the > unwritten extent to written, even in the case where the underlying > pages weren't zeroed. See this case: > > if (unlikely(!zero_page)) > goto fallback; > > That's a stale data exposure bug, so has security implications.... > > > I agree that we need to give the fault handling code some attention when it > > comes to locking, and that we need to have better documentation. I'll work on > > this when I get some time. > > > > In the meantime I still think it's worthwhile to take the short term fix for > > the obvious generic/075 deadlock, yea? > > I think a bunch of reverts are in order -the code is broken, has > stale data exposure problems and we're going to have to rip the code > out anyway because I don't think this is the right way to fix the > underlying problem ("If two threads write-fault on the same hole at > the same time...") > > We've already got block allocation serialisation at the filesystem > level, and the issue is the unserialised block zeroing being done by > the dax code. That can be fixed by moving the zeroing into the > filesystem code when it runs "complete_unwritten" and checks whether > the mapping has already been marked as written or not... > > I've recently pointed out in a different thread that this is the > solution to whatever that problem was (can't recall which > thread/problem is was now :/ ) and it the same solution here. We > already have the serialisation we need, we just need to move the > block zeroing operation into the appropriate places to make it work > correctly. I think that at the end of the day the locking in the two DAX fault paths is trying to protect us against two things: 1) Simultaneous write-fault on the same hole at the same time. The winner of the race will return to userspace and complete their store, only to have the loser overwrite their store with zeroes. We are currently using mapping->i_mmap_rwsem to protect ourselves from this. Here is the patch that added this protection: https://lkml.org/lkml/2015/8/4/870 2) Races between page faults and truncate. If we have a struct page we protect ourselves by locking the page via lock_page_or_retry(). If we don't have a struct page we use mapping->i_mmap_rwsem. This is the source of all of the mapping->i_mmap_rwsem locking being based on !page. The protection provided by mapping->i_mmap_rwsem against truncate has been in place in some form since v4.0. To get them all written down in one place, here are all the issues and questions that I currently know about regarding the DAX fault paths as of v4.3-rc2: 1) In __dax_pmd_fault() the block inside the first "if (buffer_unwritten(&bh) || buffer_new(&bh))" test doesn't set kaddr before starting to zero. This was fixed by the following patch which has been accepted into the --mm tree: https://lkml.org/lkml/2015/9/22/989 2) In __dax_pmd_fault() we have this code: i_mmap_lock_write(mapping); length = get_block(inode, block, &bh, write); if (length) return VM_FAULT_SIGBUS; We just took the mapping->i_mmap_rwsem semaphore, and then we have a return that doesn't release it. A quick test confirms that this will deadlock the next fault that tries to take the mapping->i_mmap_rwsem semaphore. 3) In __dax_pmd_fault() we convert the unwritten extent to written, even in the case where the underlying pages weren't zeroed. See this case: if (unlikely(!zero_page)) goto fallback; That's a stale data exposure bug, so has security implications.... There are other places where we head through this completion path because of an error, and those should similarly only convert the extents as written if they have been zeroed. 4) In __dax_fault() we can return through this path: } else { i_mmap_unlock_write(mapping); return dax_load_hole(mapping, page, vmf); } without having ever taken the mapping->i_mmap_rwsem semaphore. This means we end up releasing a semaphore that we never took, which can lead to a deadlock. This was the original bug report for xfstest generic/075, and is currently addressed by the following patch in the -mm tree: https://lkml.org/lkml/2015/9/23/607 5) In __dax_fault() we rely on the presence of a page pointer to know whether or not we should release mapping->i_mmap_rwsem at the end of the function. Unfortunately, we can set the page pointer in the middle of the function while the semaphore is held: /* Check we didn't race with a read fault installing a new page */ if (!page && major) page = find_lock_page(mapping, vmf->pgoff); This will cause us to fail the "if (!page)" test at the end of the function, making us fail to release the semaphore on exit. 6) In __dax_fault() in the vmf->cow_page case we can end up exiting with status VM_FAULT_LOCKED but with mapping->i_mmap_rwsem held. This is actually correct behavior and is documented in the commit message of the patch that added this code: commit 2e4cdab0584fa884e0a81c4f45b93ce875c9fcaa https://lkml.org/lkml/2015/1/13/635 This breaks the locking rules used by the rest of the function, though, and needs comments. 7) In both __dax_fault() and __dax_pmd_fault() we call the filesystem's get_block() and complete_unwritten() functions while holding mapping->i_mmap_rwsem. There is a concern that this could potentially lead to a lock inversion, leading to a deadlock. Here's how I think we can address the above issues (basically "do what Dave said"): 1) For the simultaneous write-fault issue, move the responsibility for zeroing the extents and converting them to written to the filesystem as Dave suggests. This will require us to get the coordination between DAX and the filesystem write in each filesystem we support (currentely xfs and ext4), but it will take advantage of fine-grained locking already present in the filesystem and will let us remove much of our reliance on mapping->i_mmap_rwsem. This will also address the scalability issues mentioned as future work in Matthew's patch that introduced the heavy reliance on mapping->i_mmap_rwsem: https://lkml.org/lkml/2015/8/4/870 This will allow us to scale back our usage of mapping->i_mmap_rwsem, helping us deal with many of the above issues. This will also take care of issue #3 (information leak of non-zeroed blocks marked as written) as DAX will no longer be responsible for converting newly zeroed blocks to written. 2) For the remaining uses of mapping->i_mmap_rwsem in the fault paths that protect us against truncate, address any remaining code issues listed above, clearly document in the fault handlers the locking rules and the exceptions to those rules (cow_page + VM_FAULT_LOCKED). Update any locking order changes like in mm/filemap.c as necessary. 3) For issue #7 above (lock inversion concerns between i_mmap_rwsem and the filesystem locking needed for get_block() and complete_unwritten()), I think that the complete_unwritten() calls in DAX will go away since we will be allocating, zeroing and converting extents to written all in the filesystem. Similarly, I think that the calls to get_block() will no longer happen while the mapping->i_mmap_rwsem is held - this was the case as of v4.2, and I think we can probably get bet there with finger grained FS locking. 4) Test all changes with xfstests using both xfs & ext4, using lockep. Did I miss any issues, or does this path not solve one of them somehow? Does this sound like a reasonable path forward for v4.3? Dave, and Jan, can you guys can provide guidance and code reviews for the XFS and ext4 bits? Thanks, - Ross
On Fri, Sep 25, 2015 at 09:17:45PM -0600, Ross Zwisler wrote: > On Fri, Sep 25, 2015 at 12:53:57PM +1000, Dave Chinner wrote: > > I think a bunch of reverts are in order -the code is broken, has > > stale data exposure problems and we're going to have to rip the code > > out anyway because I don't think this is the right way to fix the > > underlying problem ("If two threads write-fault on the same hole at > > the same time...") [...] > I think that at the end of the day the locking in the two DAX fault paths is > trying to protect us against two things: > > 1) Simultaneous write-fault on the same hole at the same time. > 2) Races between page faults and truncate. 1) is fixed by pushing block zeroing into get_blocks() 2) is fixed by the filesystem locking out page faults during truncate. XFS already does this with it's XFS_MMAPLOCK_[SHARED|EXCL] locking on page faults and truncate... > To get them all written down in one place, here are all the issues and > questions that I currently know about regarding the DAX fault paths as of > v4.3-rc2: > > 1) In __dax_pmd_fault() the block inside the first "if (buffer_unwritten(&bh) [fail to set kaddr, patch in -mm] > 2) In __dax_pmd_fault() we have this code: [fail to release i_mmap_lock_write()] > 3) In __dax_pmd_fault() we convert the unwritten extent to written .... > 4) In __dax_fault() we can return through this path: [incorrect release of i_mmap_lock_write()] > 5) In __dax_fault() we rely on the presence of a page pointer to know whether [fail to release i_mmap_lock_write() on racing read faults] > 6) In __dax_fault() in the vmf->cow_page case we can end up exiting with > status VM_FAULT_LOCKED but with mapping->i_mmap_rwsem held. This is [undocumented, unbalanced, but handled] > 7) In both __dax_fault() and __dax_pmd_fault() we call the filesystem's > get_block() and complete_unwritten() functions while holding > mapping->i_mmap_rwsem. There is a concern that this could potentially lead > to a lock inversion, leading to a deadlock. [more investigation needed] > Here's how I think we can address the above issues (basically "do what Dave > said"): > > 1) For the simultaneous write-fault issue, move the responsibility for zeroing > the extents and converting them to written to the filesystem as Dave *nod* > 2) For the remaining uses of mapping->i_mmap_rwsem in the fault paths that > protect us against truncate, address any remaining code issues listed > above, clearly document in the fault handlers the locking rules and the > exceptions to those rules (cow_page + VM_FAULT_LOCKED). Update any locking > order changes like in mm/filemap.c as necessary. Incorrect. Truncate is a special case of hole punching, and the mm subsystem has long relied on a nasty hack to serialise page faults against truncate. That is: lock_page(page); if (page beyond EOF or mapping is different) { /* OMG FAIL FAIL FAIL */ } This "page beyond EOF" does not work for operations like hole punching or other direct filesystem extent manipulation operations that occur within the current EOF and require mapping invalidation and serialisation until after the extent manipulation is done. IOWs, all of these operations require the filesystem to prevent new page faults from occurring after the initial invalidation and until the extent manipulation is completed. This includes truncate - the "page beyond EOF" hack is not necessary anymore, because truncate has the same invalidation behaviour w.r.t. punching out mappings within EOF... And with DAX, we have no struct page to lock, so there really isn't anything in the mm subsystem that can be used to ensure that a given range of a file is not being otherwise manipulated in a manner that races with a page fault. The i_mmap_rwsem is not a replacement for the page lock.... In reality, the require DAX page fault vs truncate serialisation is provided for XFS by the XFS_MMAPLOCK_* inode locking that is done in the fault, mkwrite and filesystem extent manipulation paths. There is no way this sort of exclusion can be done in the mm/ subsystem as it simply does not have the context to be able to provide the necessary serialisation. Every filesystem that wants to use DAX needs to provide this external page fault serialisation, and in doing so will also protect it's hole punch/extent swap/shift operations under normal operation against racing mmap access.... IOWs, for DAX this needs to be fixed in ext4, not the mm/ subsystem. > 3) For issue #7 above (lock inversion concerns between i_mmap_rwsem and the > filesystem locking needed for get_block() and complete_unwritten()), I > think that the complete_unwritten() calls in DAX will go away since we will > be allocating, zeroing and converting extents to written all in the > filesystem. Similarly, I think that the calls to get_block() will no > longer happen while the mapping->i_mmap_rwsem is held - this was the case > as of v4.2, and I think we can probably get bet there with finger grained > FS locking. ext2 already does this zeroing. See the call to dax_clear_blocks() in ext2_get_blocks. I should be able to do something similar in XFS easily enough. > 4) Test all changes with xfstests using both xfs & ext4, using lockep. > > Did I miss any issues, or does this path not solve one of them somehow? > > Does this sound like a reasonable path forward for v4.3? Dave, and Jan, can > you guys can provide guidance and code reviews for the XFS and ext4 bits? IMO, it's way too much to get into 4.3. I'd much prefer we revert the bad changes in 4.3, and then work towards fixing this for the 4.4 merge window. If someone needs this for 4.3, then they can backport the 4.4 code to 4.3-stable. The "fast and loose and fix it later" development model does not work for persistent storage algorithms; DAX is storage - not memory management - and so we need to treat it as such. Cheers, Dave.
On Sun, Sep 27, 2015 at 5:59 PM, Dave Chinner <david@fromorbit.com> wrote: > On Fri, Sep 25, 2015 at 09:17:45PM -0600, Ross Zwisler wrote: >> On Fri, Sep 25, 2015 at 12:53:57PM +1000, Dave Chinner wrote: [..] >> Does this sound like a reasonable path forward for v4.3? Dave, and Jan, can >> you guys can provide guidance and code reviews for the XFS and ext4 bits? > > IMO, it's way too much to get into 4.3. I'd much prefer we revert > the bad changes in 4.3, and then work towards fixing this for the > 4.4 merge window. If someone needs this for 4.3, then they can > backport the 4.4 code to 4.3-stable. > If the proposal is to step back and get a running start at these fixes for 4.4, then it is worth considering what the state of allocating pages for DAX mappings will be in 4.4. It's already that case that allocating struct page for DAX mappings is the only solution on the horizon for enabling a get_user_pages() solution for persistent memory. We of course need to get the page-less DAX path fixed up, but the near-term path to full functionality and safety is when struct page is available to enable the typical synchronization mechanics. That said, it's not clear that saves us any work given the axonram and dcssblk DAX drivers do not currently allocate struct page, and pmem optionally allocates struct page.
On Mon, Sep 28, 2015 at 05:13:50AM -0700, Dan Williams wrote: > On Sun, Sep 27, 2015 at 5:59 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Fri, Sep 25, 2015 at 09:17:45PM -0600, Ross Zwisler wrote: > >> On Fri, Sep 25, 2015 at 12:53:57PM +1000, Dave Chinner wrote: > [..] > >> Does this sound like a reasonable path forward for v4.3? Dave, and Jan, can > >> you guys can provide guidance and code reviews for the XFS and ext4 bits? > > > > IMO, it's way too much to get into 4.3. I'd much prefer we revert > > the bad changes in 4.3, and then work towards fixing this for the > > 4.4 merge window. If someone needs this for 4.3, then they can > > backport the 4.4 code to 4.3-stable. > > > > If the proposal is to step back and get a running start at these fixes > for 4.4, then it is worth considering what the state of allocating > pages for DAX mappings will be in 4.4. Oh, do tell. I haven't seen any published design, code, etc, and I certainly haven't planned any time in the 4.4 window to do a complete audit, rework and test of the XFS DAX code. So if you want a working DAX implementation in the short term, we need to fix what we have and not do wholesale changes to infrastructure that put us back to square 1. And, quite frankly, I'm not enabling any new DAX behaviour/subsystem in XFS until I've had time to review, test and fix it so it works without deadlocking or corrupting data. > It's already that case that > allocating struct page for DAX mappings is the only solution on the > horizon for enabling a get_user_pages() solution for persistent > memory. We of course need to get the page-less DAX path fixed up, but > the near-term path to full functionality and safety is when struct > page is available to enable the typical synchronization mechanics. And we do so at the expense of medium to long term complexity and maintenance. I'm no fan of using struct pages to track terabytes to petabytes of persistent memory, and I'm even less of a fan of having to simultaneously support both struct page and pfn based DAX subsystems... > That said, it's not clear that saves us any work given the axonram and > dcssblk DAX drivers do not currently allocate struct page, and pmem > optionally allocates struct page. Precisely my point. Cheers, Dave.
On Mon, Sep 28, 2015 at 10:59:04AM +1000, Dave Chinner wrote: > On Fri, Sep 25, 2015 at 09:17:45PM -0600, Ross Zwisler wrote: <> > In reality, the require DAX page fault vs truncate serialisation is > provided for XFS by the XFS_MMAPLOCK_* inode locking that is done in > the fault, mkwrite and filesystem extent manipulation paths. There > is no way this sort of exclusion can be done in the mm/ subsystem as > it simply does not have the context to be able to provide the > necessary serialisation. Every filesystem that wants to use DAX > needs to provide this external page fault serialisation, and in > doing so will also protect it's hole punch/extent swap/shift > operations under normal operation against racing mmap access.... > > IOWs, for DAX this needs to be fixed in ext4, not the mm/ subsystem. So is it your belief that XFS already has correct locking in place to ensure that we don't hit these races? I see XFS taking XFS_MMAPLOCK_SHARED before it calls __dax_fault() in both xfs_filemap_page_mkwrite() (via __dax_mkwrite) and in xfs_filemap_fault(). XFS takes XFS_MMAPLOCK_EXCL before a truncate in xfs_vn_setattr() - I haven't found the generic hole punching code yet, but I assume it takes XFS_MMAPLOCK_EXCL as well. Meaning, is the work that we need to do around extent vs page fault locking basically adding equivalent locking to ext4 and ext2 and removing the attempts at locking from dax.c? > > 4) Test all changes with xfstests using both xfs & ext4, using lockep. > > > > Did I miss any issues, or does this path not solve one of them somehow? > > > > Does this sound like a reasonable path forward for v4.3? Dave, and Jan, can > > you guys can provide guidance and code reviews for the XFS and ext4 bits? > > IMO, it's way too much to get into 4.3. I'd much prefer we revert > the bad changes in 4.3, and then work towards fixing this for the > 4.4 merge window. If someone needs this for 4.3, then they can > backport the 4.4 code to 4.3-stable. > > The "fast and loose and fix it later" development model does not > work for persistent storage algorithms; DAX is storage - not memory > management - and so we need to treat it as such. Okay. To get our locking back to v4.2 levels here are the two commits I think we need to look at: commit 843172978bb9 ("dax: fix race between simultaneous faults") commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for DAX") The former is the one that introduced the heavy reliance on write locks of the mmap semaphore and introduced the various deadlocks that we've found. The latter moved some of that locking around so we didn't hold a write lock on the mmap semaphore during unmap_mapping_range(). Does this sound correct to you? On an unrelated note, while wandering through the XFS code I found the following lock ordering documented above xfs_ilock(): * Basic locking order: * * i_iolock -> i_mmap_lock -> page_lock -> i_ilock * * mmap_sem locking order: * * i_iolock -> page lock -> mmap_sem * mmap_sem -> i_mmap_lock -> page_lock I noticed that page_lock and i_mmap_lock are in different places in the ordering depending on the presence or absence of mmap_sem. Does this not open us up to a lock ordering inversion? Thread 1 (mmap_sem) Thread 2 (no mmap_sem) ------------------- ---------------------- page_lock mmap_sem i_mmap_lock <waiting for page_lock> <waiting for i_mmap_lock> Thanks, - Ross
On Mon, Sep 28, 2015 at 2:35 PM, Dave Chinner <david@fromorbit.com> wrote: > On Mon, Sep 28, 2015 at 05:13:50AM -0700, Dan Williams wrote: >> On Sun, Sep 27, 2015 at 5:59 PM, Dave Chinner <david@fromorbit.com> wrote: >> > On Fri, Sep 25, 2015 at 09:17:45PM -0600, Ross Zwisler wrote: >> >> On Fri, Sep 25, 2015 at 12:53:57PM +1000, Dave Chinner wrote: >> [..] >> >> Does this sound like a reasonable path forward for v4.3? Dave, and Jan, can >> >> you guys can provide guidance and code reviews for the XFS and ext4 bits? >> > >> > IMO, it's way too much to get into 4.3. I'd much prefer we revert >> > the bad changes in 4.3, and then work towards fixing this for the >> > 4.4 merge window. If someone needs this for 4.3, then they can >> > backport the 4.4 code to 4.3-stable. >> > >> >> If the proposal is to step back and get a running start at these fixes >> for 4.4, then it is worth considering what the state of allocating >> pages for DAX mappings will be in 4.4. > > Oh, do tell. I haven't seen any published design, code, etc, This is via the devm_memremap_pages() api that went into 4.2 [1] and my v1 (RFC quality) series using it for dax get_user_pages() [2]. [1]: https://lkml.org/lkml/2015/8/25/841 [2]: https://lkml.org/lkml/2015/9/23/11 > and I certainly haven't planned any time in the 4.4 window to do a > complete audit, rework and test of the XFS DAX code. So if you want > a working DAX implementation in the short term, we need to fix what > we have and not do wholesale changes to infrastructure that put us > back to square 1. Yes, as Ross educated me, the current split of what is handled in the filesystem vs what is handled in __dax_fault() potentially makes the availability of struct page moot because the locking does not work if initiated from within fs/dax.c... > And, quite frankly, I'm not enabling any new DAX behaviour/subsystem > in XFS until I've had time to review, test and fix it so it works > without deadlocking or corrupting data. I'm in violent agreement, to the point where I'm pondering whether CONFIG_FS_DAX should just depend on CONFIG_BROKEN in 4.3 until we've convinced ourselves of all the fixes in 4.4. It's not clear to me that we have a stable baseline to which we can revert this "still in development" implementation, did you have one in mind? >> It's already that case that >> allocating struct page for DAX mappings is the only solution on the >> horizon for enabling a get_user_pages() solution for persistent >> memory. We of course need to get the page-less DAX path fixed up, but >> the near-term path to full functionality and safety is when struct >> page is available to enable the typical synchronization mechanics. > > And we do so at the expense of medium to long term complexity and > maintenance. I'm no fan of using struct pages to track terabytes to > petabytes of persistent memory, and I'm even less of a fan of having > to simultaneously support both struct page and pfn based DAX > subsystems... I'm no fan of tracking petabytes of persistent memory with struct page, but we're in the near term space (hardware technology-wise) of how to enable DMA/RDMA to 100s of gigabytes to a few terabytes of persistent memory. A page-less solution to that problem is not on the horizon as far as I can tell. In short, I am concerned we are spending time working around the lack of struct page to get to a stable page-less solution that is still missing support for the use cases that are expected to "just work". I do not think introducing page-back persistent memory sets us back to square 1. Instead, given the functionality that is enabled when pages are present I think it is safe to assume most platforms will arrange for page backed persistent memory. If the page-less case is rare to non-existent then we should design for the page-backed case at least until the "petabytes of persistent memory" era arrives. I think we have plenty of time to get page-less right before it is needed, but we have to get over the roadblocks that Christoph and I hit even trying to convert the DMA-API over to be pfn based [3]. [3]: https://lkml.org/lkml/2015/8/12/682
On Mon, Sep 28, 2015 at 03:57:29PM -0700, Dan Williams wrote: > On Mon, Sep 28, 2015 at 2:35 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Mon, Sep 28, 2015 at 05:13:50AM -0700, Dan Williams wrote: > >> On Sun, Sep 27, 2015 at 5:59 PM, Dave Chinner <david@fromorbit.com> wrote: > >> > On Fri, Sep 25, 2015 at 09:17:45PM -0600, Ross Zwisler wrote: > >> >> On Fri, Sep 25, 2015 at 12:53:57PM +1000, Dave Chinner wrote: > >> [..] > >> >> Does this sound like a reasonable path forward for v4.3? Dave, and Jan, can > >> >> you guys can provide guidance and code reviews for the XFS and ext4 bits? > >> > > >> > IMO, it's way too much to get into 4.3. I'd much prefer we revert > >> > the bad changes in 4.3, and then work towards fixing this for the > >> > 4.4 merge window. If someone needs this for 4.3, then they can > >> > backport the 4.4 code to 4.3-stable. > >> > > >> > >> If the proposal is to step back and get a running start at these fixes > >> for 4.4, then it is worth considering what the state of allocating > >> pages for DAX mappings will be in 4.4. > > > > Oh, do tell. I haven't seen any published design, code, etc, > > This is via the devm_memremap_pages() api that went into 4.2 [1] and > my v1 (RFC quality) series using it for dax get_user_pages() [2]. > > [1]: https://lkml.org/lkml/2015/8/25/841 > [2]: https://lkml.org/lkml/2015/9/23/11 I'll have a look at some point when I'm not trying to put out fires. > > And, quite frankly, I'm not enabling any new DAX behaviour/subsystem > > in XFS until I've had time to review, test and fix it so it works > > without deadlocking or corrupting data. > > I'm in violent agreement, to the point where I'm pondering whether > CONFIG_FS_DAX should just depend on CONFIG_BROKEN in 4.3 until we've > convinced ourselves of all the fixes in 4.4. It's not clear to me > that we have a stable baseline to which we can revert this "still in > development" implementation, did you have one in mind? XFS warns that DAX is experimental when you mount with that option, so there is no need to do that: [ 686.055780] XFS (ram0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk [ 686.058464] XFS (ram0): Mounting V5 Filesystem [ 686.062857] XFS (ram0): Ending clean mount > >> It's already that case that > >> allocating struct page for DAX mappings is the only solution on the > >> horizon for enabling a get_user_pages() solution for persistent > >> memory. We of course need to get the page-less DAX path fixed up, but > >> the near-term path to full functionality and safety is when struct > >> page is available to enable the typical synchronization mechanics. > > > > And we do so at the expense of medium to long term complexity and > > maintenance. I'm no fan of using struct pages to track terabytes to > > petabytes of persistent memory, and I'm even less of a fan of having > > to simultaneously support both struct page and pfn based DAX > > subsystems... > > I'm no fan of tracking petabytes of persistent memory with struct > page, but we're in the near term space (hardware technology-wise) of > how to enable DMA/RDMA to 100s of gigabytes to a few terabytes of > persistent memory. Don't think I don't know that - as I said to someone a few hours ago on IRC: [29/09/15 07:41] <dchinner> I'm sure they do, but they have a hard requirement to support RDMA from persistent memory [29/09/15 07:41] <dchinner> and that's what seems to be driving the "we need to use struct pages" design > A page-less solution to that problem is not on the > horizon as far as I can tell. In short, I am concerned we are > spending time working around the lack of struct page to get to a > stable page-less solution that is still missing support for the use > cases that are expected to "just work". I'm concerned with making what we have work before we go and change everything. You might want to move really quickly, but without sane filesystem support you can't ship anything worth a damn. There's all sorts of issues here, and introducing struct pages doesn't solve all of them. Let's concentrate on ensuring the basic operation of DAX is robust first - get the page fault vs extent manipulations serialised, sane and scalable before we start changing anything else. If we don't solve these problems, then nothing else we do will be reliable, and the problems exist regardless of whether we are using struct pages or not. Hence these are the critical problems we need to fix before anything else. Once we have these issues sorted out, switching between struct page and pfn should be much simpler because we don't have to worry about different locking strategies to protect against truncate, racing page faults, etc. > I do not think introducing page-back persistent memory sets us back to > square 1. Instead, given the functionality that is enabled when pages > are present I think it is safe to assume most platforms will arrange > for page backed persistent memory. Sure, but it will take a little time to get there. Moving fast doesn't help us here - it only results in stuff we have to revert or redo in the near future and that means progress is much slower than it should be. Let's solve the DAX problems in the right order - it will make things simpler and faster down the road. Cheers, Dave
On Mon, Sep 28, 2015 at 04:40:01PM -0600, Ross Zwisler wrote: > On Mon, Sep 28, 2015 at 10:59:04AM +1000, Dave Chinner wrote: > > On Fri, Sep 25, 2015 at 09:17:45PM -0600, Ross Zwisler wrote: > <> > > In reality, the require DAX page fault vs truncate serialisation is > > provided for XFS by the XFS_MMAPLOCK_* inode locking that is done in > > the fault, mkwrite and filesystem extent manipulation paths. There > > is no way this sort of exclusion can be done in the mm/ subsystem as > > it simply does not have the context to be able to provide the > > necessary serialisation. Every filesystem that wants to use DAX > > needs to provide this external page fault serialisation, and in > > doing so will also protect it's hole punch/extent swap/shift > > operations under normal operation against racing mmap access.... > > > > IOWs, for DAX this needs to be fixed in ext4, not the mm/ subsystem. > > So is it your belief that XFS already has correct locking in place to ensure > that we don't hit these races? I see XFS taking XFS_MMAPLOCK_SHARED before it > calls __dax_fault() in both xfs_filemap_page_mkwrite() (via __dax_mkwrite) and > in xfs_filemap_fault(). > > XFS takes XFS_MMAPLOCK_EXCL before a truncate in xfs_vn_setattr() - I haven't > found the generic hole punching code yet, but I assume it takes > XFS_MMAPLOCK_EXCL as well. There is no generic hole punching. See xfs_file_fallocate(), where most fallocate() based operations are protected, xfs_ioc_space() where all the XFS ioctl based extent manipulations are protected, xfs_swap_extents() where online defrag extent swaps are protected. And we'll add it to any future operations that directly manipulate extent mappings. > Meaning, is the work that we need to do around extent vs page fault locking > basically adding equivalent locking to ext4 and ext2 and removing the attempts > at locking from dax.c? Yup. I'm not game to touch ext4 locking, though. > > > > 4) Test all changes with xfstests using both xfs & ext4, using lockep. > > > > > > Did I miss any issues, or does this path not solve one of them somehow? > > > > > > Does this sound like a reasonable path forward for v4.3? Dave, and Jan, can > > > you guys can provide guidance and code reviews for the XFS and ext4 bits? > > > > IMO, it's way too much to get into 4.3. I'd much prefer we revert > > the bad changes in 4.3, and then work towards fixing this for the > > 4.4 merge window. If someone needs this for 4.3, then they can > > backport the 4.4 code to 4.3-stable. > > > > The "fast and loose and fix it later" development model does not > > work for persistent storage algorithms; DAX is storage - not memory > > management - and so we need to treat it as such. > > Okay. To get our locking back to v4.2 levels here are the two commits I think > we need to look at: > > commit 843172978bb9 ("dax: fix race between simultaneous faults") > commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for DAX") Already testing a kernel with those reverted. My current DAX patch stack is (bottom is first commit in stack): f672ae4 xfs: add ->pfn_mkwrite support for DAX 6855c23 xfs: remove DAX complete_unwritten callback e074bdf Revert "dax: fix race between simultaneous faults" 8ba0157 Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX" a2ce6a5 xfs: DAX does not use IO completion callbacks 246c52a xfs: update size during allocation for DAX 9d10e7b xfs: Don't use unwritten extents for DAX eaef807 xfs: factor out sector mapping. e7f2d50 xfs: introduce per-inode DAX enablement BTW, add to the problems that need fixing is that the pfn_mkwrite code needs to check that the fault is still within EOF, like __dax_fault does. i.e. the top patch in the series adds this to xfs_filemap_pfn_mkwrite() instead of using dax_pfn_mkwrite(): .... + /* check if the faulting page hasn't raced with truncate */ + xfs_ilock(ip, XFS_MMAPLOCK_SHARED); + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; + if (vmf->pgoff >= size) + ret = VM_FAULT_SIGBUS; + xfs_iunlock(ip, XFS_MMAPLOCK_SHARED); + sb_end_pagefault(inode->i_sb); i.e. dax_pfn_mkwrite() doesn't work correctly when racing with truncate (i.e. another way that ext2/ext4 are currently broken). > On an unrelated note, while wandering through the XFS code I found the > following lock ordering documented above xfs_ilock(): > > * Basic locking order: > * > * i_iolock -> i_mmap_lock -> page_lock -> i_ilock > * > * mmap_sem locking order: > * > * i_iolock -> page lock -> mmap_sem > * mmap_sem -> i_mmap_lock -> page_lock > > I noticed that page_lock and i_mmap_lock are in different places in the > ordering depending on the presence or absence of mmap_sem. Does this not open > us up to a lock ordering inversion? Typo, not picked up in review (note the missing "_"). - * i_iolock -> page lock -> mmap_sem + * i_iolock -> page fault -> mmap_sem Thanks for the heads-up on that. Cheers, Dave.
On Mon, Sep 28, 2015 at 7:18 PM, Dave Chinner <david@fromorbit.com> wrote: > On Mon, Sep 28, 2015 at 03:57:29PM -0700, Dan Williams wrote: >> On Mon, Sep 28, 2015 at 2:35 PM, Dave Chinner <david@fromorbit.com> wrote: >> > On Mon, Sep 28, 2015 at 05:13:50AM -0700, Dan Williams wrote: >> >> On Sun, Sep 27, 2015 at 5:59 PM, Dave Chinner <david@fromorbit.com> wrote: >> >> > On Fri, Sep 25, 2015 at 09:17:45PM -0600, Ross Zwisler wrote: >> >> >> On Fri, Sep 25, 2015 at 12:53:57PM +1000, Dave Chinner wrote: >> >> [..] >> >> >> Does this sound like a reasonable path forward for v4.3? Dave, and Jan, can >> >> >> you guys can provide guidance and code reviews for the XFS and ext4 bits? >> >> > >> >> > IMO, it's way too much to get into 4.3. I'd much prefer we revert >> >> > the bad changes in 4.3, and then work towards fixing this for the >> >> > 4.4 merge window. If someone needs this for 4.3, then they can >> >> > backport the 4.4 code to 4.3-stable. >> >> > >> >> >> >> If the proposal is to step back and get a running start at these fixes >> >> for 4.4, then it is worth considering what the state of allocating >> >> pages for DAX mappings will be in 4.4. >> > >> > Oh, do tell. I haven't seen any published design, code, etc, >> >> This is via the devm_memremap_pages() api that went into 4.2 [1] and >> my v1 (RFC quality) series using it for dax get_user_pages() [2]. >> >> [1]: https://lkml.org/lkml/2015/8/25/841 >> [2]: https://lkml.org/lkml/2015/9/23/11 > > I'll have a look at some point when I'm not trying to put out fires. > >> > And, quite frankly, I'm not enabling any new DAX behaviour/subsystem >> > in XFS until I've had time to review, test and fix it so it works >> > without deadlocking or corrupting data. >> >> I'm in violent agreement, to the point where I'm pondering whether >> CONFIG_FS_DAX should just depend on CONFIG_BROKEN in 4.3 until we've >> convinced ourselves of all the fixes in 4.4. It's not clear to me >> that we have a stable baseline to which we can revert this "still in >> development" implementation, did you have one in mind? > > XFS warns that DAX is experimental when you mount with that option, > so there is no need to do that: > > [ 686.055780] XFS (ram0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk > [ 686.058464] XFS (ram0): Mounting V5 Filesystem > [ 686.062857] XFS (ram0): Ending clean mount Well that is comforting, although a similar warning is missing from ext4. I'll send a patch. >> >> It's already that case that >> >> allocating struct page for DAX mappings is the only solution on the >> >> horizon for enabling a get_user_pages() solution for persistent >> >> memory. We of course need to get the page-less DAX path fixed up, but >> >> the near-term path to full functionality and safety is when struct >> >> page is available to enable the typical synchronization mechanics. >> > >> > And we do so at the expense of medium to long term complexity and >> > maintenance. I'm no fan of using struct pages to track terabytes to >> > petabytes of persistent memory, and I'm even less of a fan of having >> > to simultaneously support both struct page and pfn based DAX >> > subsystems... >> >> I'm no fan of tracking petabytes of persistent memory with struct >> page, but we're in the near term space (hardware technology-wise) of >> how to enable DMA/RDMA to 100s of gigabytes to a few terabytes of >> persistent memory. > > Don't think I don't know that - as I said to someone a few hours > ago on IRC: > > [29/09/15 07:41] <dchinner> I'm sure they do, but they have a hard requirement to support RDMA from persistent memory > [29/09/15 07:41] <dchinner> and that's what seems to be driving the "we need to use struct pages" design Fair enough... >> A page-less solution to that problem is not on the >> horizon as far as I can tell. In short, I am concerned we are >> spending time working around the lack of struct page to get to a >> stable page-less solution that is still missing support for the use >> cases that are expected to "just work". > > I'm concerned with making what we have work before we go and change > everything. You might want to move really quickly, but without sane > filesystem support you can't ship anything worth a damn. There's all > sorts of issues here, and introducing struct pages doesn't solve all > of them. > > Let's concentrate on ensuring the basic operation of DAX is robust > first - get the page fault vs extent manipulations serialised, sane > and scalable before we start changing anything else. If we don't > solve these problems, then nothing else we do will be reliable, and > the problems exist regardless of whether we are using struct pages > or not. Hence these are the critical problems we need to fix before > anything else. > > Once we have these issues sorted out, switching between struct page > and pfn should be much simpler because we don't have to worry about > different locking strategies to protect against truncate, racing > page faults, etc. It sounds like you have a page-independent/scalable method in mind for solving the truncate protection problem? I had always thought that must require struct page, but if you're happy to carry that solution in the filesystem you're not going to see resistance from me. >> I do not think introducing page-back persistent memory sets us back to >> square 1. Instead, given the functionality that is enabled when pages >> are present I think it is safe to assume most platforms will arrange >> for page backed persistent memory. > > Sure, but it will take a little time to get there. Moving fast > doesn't help us here - it only results in stuff we have to revert or > redo in the near future and that means progress is much slower than > it should be. Let's solve the DAX problems in the right order - it > will make things simpler and faster down the road. Sounds workable, although this thread is missing an ext4 representative so far. Hopefully ext4 is equally open to solving these problems generically without struct page. Outside of that there's also basic device driver lifetime fixes in the get_user_pages() series (patches 8-10) that are 4.4 material to stop the trivial breakage from unbinding the pmem driver regardless of when we decide to stage the others. In any event, thanks for the attention and patience, Dave, much appreciated.
On Mon, Sep 28, 2015 at 08:08:13PM -0700, Dan Williams wrote: > On Mon, Sep 28, 2015 at 7:18 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Mon, Sep 28, 2015 at 03:57:29PM -0700, Dan Williams wrote: > > I'm concerned with making what we have work before we go and change > > everything. You might want to move really quickly, but without sane > > filesystem support you can't ship anything worth a damn. There's all > > sorts of issues here, and introducing struct pages doesn't solve all > > of them. > > > > Let's concentrate on ensuring the basic operation of DAX is robust > > first - get the page fault vs extent manipulations serialised, sane > > and scalable before we start changing anything else. If we don't > > solve these problems, then nothing else we do will be reliable, and > > the problems exist regardless of whether we are using struct pages > > or not. Hence these are the critical problems we need to fix before > > anything else. > > > > Once we have these issues sorted out, switching between struct page > > and pfn should be much simpler because we don't have to worry about > > different locking strategies to protect against truncate, racing > > page faults, etc. > > It sounds like you have a page-independent/scalable method in mind for > solving the truncate protection problem? In mind? It was added to XFS back in 4.1 by commit 653c60b ("xfs: introduce mmap/truncate lock"). > I had always thought that > must require struct page, but if you're happy to carry that solution > in the filesystem you're not going to see resistance from me. The page based "truncate serialisation" solution in Linux has always been a horrible, nasty hack. It only works because we always modify the inode size before invalidating pages in the page cache. Hence we can check once we have the page lock to see if the page is beyond EOF. This EOF update hack avoids the need for a filesystem level lock to guarantee multi-page truncate invalidation atomicity. This, however, does not work for operations which require atomic multi-page invalidation within EOF over multiple long running operations. The page is not beyond EOF, so we have no way of determining from the page taht we are racing with an invalidation of the page. This affects operations such as hole punching, extent shifting up/down, swapping extents between different inodes, etc. These *require* the filesystem to be able to lock out page faults for the duration of the manipulation operation, as the extent manipulations may not be done atomically from the perspective of userspace driven page faults. They *are atomic* from the perspective of the read/write syscall interfaces thanks to the XFS_IOLOCK_* locking (usually the i_mutex provides this in other filesystems). The XFS_MMAPLOCK_[SHARED|EXCL] locking was added to provide this multi-page invalidation vs page fault serialisation to XFS. It works completely independently of any given struct page in the file, and so does not rely on DAX to use struct page to serialise correctly. I've solved this problem in XFS because a generic solution was not happening. Any filesystem that supports hole punching and other fallocate-based manipulations needs similar locking to be safe against page fault based /data corruption/ races in these operations. FWIW, direct IO also needs to exclude page faults for proper mmap coherency, but that's much harder problem to solve because direct IO takes the mmap_sem below the filesystem, and hence we can't hold any locks over DIO submission that might be required in a page fault within get_user_pages().... > >> I do not think introducing page-back persistent memory sets us back to > >> square 1. Instead, given the functionality that is enabled when pages > >> are present I think it is safe to assume most platforms will arrange > >> for page backed persistent memory. > > > > Sure, but it will take a little time to get there. Moving fast > > doesn't help us here - it only results in stuff we have to revert or > > redo in the near future and that means progress is much slower than > > it should be. Let's solve the DAX problems in the right order - it > > will make things simpler and faster down the road. > > Sounds workable, although this thread is missing an ext4 > representative so far. Hopefully ext4 is equally open to solving > these problems generically without struct page. It appears to me that none of the ext4 developers have shown any interest in fixing the problems reported with DAX. Some of those problems existed before DAX came along (e.g. unwritten extent conversion issues that can cause data corruption) but I've seen no interest in fixing them, either. Hence I'm quite happy to drop ext4 DAX support until an ext4 dev steps up and fixes the issues that need fixing... Cheers, Dave.
On Tue, Sep 29, 2015 at 12:44:58PM +1000, Dave Chinner wrote: > On Mon, Sep 28, 2015 at 04:40:01PM -0600, Ross Zwisler wrote: > > > > 4) Test all changes with xfstests using both xfs & ext4, using lockep. > > > > > > > > Did I miss any issues, or does this path not solve one of them somehow? > > > > > > > > Does this sound like a reasonable path forward for v4.3? Dave, and Jan, can > > > > you guys can provide guidance and code reviews for the XFS and ext4 bits? > > > > > > IMO, it's way too much to get into 4.3. I'd much prefer we revert > > > the bad changes in 4.3, and then work towards fixing this for the > > > 4.4 merge window. If someone needs this for 4.3, then they can > > > backport the 4.4 code to 4.3-stable. > > > > > > The "fast and loose and fix it later" development model does not > > > work for persistent storage algorithms; DAX is storage - not memory > > > management - and so we need to treat it as such. > > > > Okay. To get our locking back to v4.2 levels here are the two commits I think > > we need to look at: > > > > commit 843172978bb9 ("dax: fix race between simultaneous faults") > > commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for DAX") > > Already testing a kernel with those reverted. My current DAX patch > stack is (bottom is first commit in stack): > And just to indicate why 4.3 is completely unrealistic, let me give you a summary of this patchset so far: > f672ae4 xfs: add ->pfn_mkwrite support for DAX I *think* it works. > 6855c23 xfs: remove DAX complete_unwritten callback Gone. > e074bdf Revert "dax: fix race between simultaneous faults" > 8ba0157 Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX" > a2ce6a5 xfs: DAX does not use IO completion callbacks DAX still needs to use IO completion callbacks for the DIO path, so needed rewriting. Made 6855c23 redundant. > 246c52a xfs: update size during allocation for DAX Fundamentally broken, so removed. DIO passes the actual size from IO completion, not into block allocation, hence DIO still needs completion callbacks. DAX page faults can't change the file size (should segv before we get here), so need to specifically handle that to avoid leaking ioend structures due to incorrect detection of EOF updates due to ovreflows... > 9d10e7b xfs: Don't use unwritten extents for DAX Exposed a behaviour in DIO and DAX that results in s64 variable overflow when writing to the block at file offset (2^63 - 1FSB). Both the DAX and DIO code ask for a mapping at: xfs_get_blocks_alloc: [...] offset 0x7ffffffffffff000 count 4096 which gives a size of 0x8000000000000000 (larger than sb->s_maxbytes!) and results a sign overflow checking if a inode size update is requireed. Direct IO avoids this overflow because the logic checks for unwritten extents first and the IO completion callback that has the correct size. Removing unwritten extent allocation from DAX exposed this bug through firing asserts all through the XFS block mapping and IO completion callbacks.... Fixed the overflow, testing got further and then fsx exposed another problem similar to the size update issue above. Patch is fundamentally broken: block zeroing needs to be driven all the way into the low level allocator implementation to fix the problems fsx exposed. > eaef807 xfs: factor out sector mapping. Probably not going to be used now. So, basically, I've rewritten most of the patch set once, and I'm about to fundamentally change it again to address problems the first two versions have exposed. Hopefully this will show you the complexity of what we are dealing with here, and why I said this needs to go through 4.4? It should also help explain why I suggested that if ext4 developers aren't interested in fixing DAX problems then we should just drop ext4 DAX support? Making this stuff work correctly requires more than just a cursory knowledge of a filesystem, and nobody actively working on DAX has the qualifications to make these sorts of changes to ext4... Cheers, Dave.
On Tue, Sep 29, 2015 at 12:44:58PM +1000, Dave Chinner wrote: <> > Already testing a kernel with those reverted. My current DAX patch > stack is (bottom is first commit in stack): > > f672ae4 xfs: add ->pfn_mkwrite support for DAX > 6855c23 xfs: remove DAX complete_unwritten callback > e074bdf Revert "dax: fix race between simultaneous faults" > 8ba0157 Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX" > a2ce6a5 xfs: DAX does not use IO completion callbacks > 246c52a xfs: update size during allocation for DAX > 9d10e7b xfs: Don't use unwritten extents for DAX > eaef807 xfs: factor out sector mapping. > e7f2d50 xfs: introduce per-inode DAX enablement Dave, would you be willing to share these patches with me, even if they are just RFC? I'm working through how to add equivalent support in both ext2 and ext4, and a conceptual example in XFS would be really helpful. Eric and Jan, if you guys have any spare cycles to help with the ext2/ext4 bits, I certainly wouldn't say no. :)
On Tue, Sep 29, 2015 at 08:04:21PM -0600, Ross Zwisler wrote: > On Tue, Sep 29, 2015 at 12:44:58PM +1000, Dave Chinner wrote: > <> > > Already testing a kernel with those reverted. My current DAX patch > > stack is (bottom is first commit in stack): > > > > f672ae4 xfs: add ->pfn_mkwrite support for DAX > > 6855c23 xfs: remove DAX complete_unwritten callback > > e074bdf Revert "dax: fix race between simultaneous faults" > > 8ba0157 Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX" > > a2ce6a5 xfs: DAX does not use IO completion callbacks > > 246c52a xfs: update size during allocation for DAX > > 9d10e7b xfs: Don't use unwritten extents for DAX > > eaef807 xfs: factor out sector mapping. > > e7f2d50 xfs: introduce per-inode DAX enablement > > Dave, would you be willing to share these patches with me, even if they are > just RFC? I'm working through how to add equivalent support in both ext2 and > ext4, and a conceptual example in XFS would be really helpful. When I have code that works properly and isn't fundamentally broken, I'll post it. Hopefully within the next day. > Eric and Jan, if you guys have any spare cycles to help with the ext2/ext4 > bits, I certainly wouldn't say no. :) ext2 already does the block zeroing in allocation, so it only needs locking help... Cheers, Dave.
On Tue 29-09-15 20:04:21, Ross Zwisler wrote: > On Tue, Sep 29, 2015 at 12:44:58PM +1000, Dave Chinner wrote: > <> > > Already testing a kernel with those reverted. My current DAX patch > > stack is (bottom is first commit in stack): > > > > f672ae4 xfs: add ->pfn_mkwrite support for DAX > > 6855c23 xfs: remove DAX complete_unwritten callback > > e074bdf Revert "dax: fix race between simultaneous faults" > > 8ba0157 Revert "mm: take i_mmap_lock in unmap_mapping_range() for DAX" > > a2ce6a5 xfs: DAX does not use IO completion callbacks > > 246c52a xfs: update size during allocation for DAX > > 9d10e7b xfs: Don't use unwritten extents for DAX > > eaef807 xfs: factor out sector mapping. > > e7f2d50 xfs: introduce per-inode DAX enablement > > Dave, would you be willing to share these patches with me, even if they are > just RFC? I'm working through how to add equivalent support in both ext2 and > ext4, and a conceptual example in XFS would be really helpful. > > Eric and Jan, if you guys have any spare cycles to help with the ext2/ext4 > bits, I certainly wouldn't say no. :) I'm sorry for being slow but I was on vacation or travelling for conferences the whole September (still at a conference now ;) so I'm in mostly just catching up with what's going on... But I can help with making necessary changes to ext4 to make DAX reliable there. Honza
diff --git a/fs/dax.c b/fs/dax.c index 7ae6df7..df1b0ac 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -405,7 +405,8 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, if (error) goto unlock; } else { - i_mmap_unlock_write(mapping); + if (!page) + i_mmap_unlock_write(mapping); return dax_load_hole(mapping, page, vmf); } }
Fix the deadlock exposed by xfstests generic/075. Here is the sequence that was causing us to deadlock: 1) enter __dax_fault() 2) page = find_get_page() gives us a page, so skip i_mmap_lock_write(mapping) 3) if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page) passes, enter this block 4) if (vmf->flags & FAULT_FLAG_WRITE) fails, so do the else case and i_mmap_unlock_write(mapping); return dax_load_hole(mapping, page, vmf); This causes us to up_write() a semaphore that we weren't holding. The up_write() on a semaphore we didn't down_write() happens twice in a row, and then the next time we try and i_mmap_lock_write(), we hang. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reported-by: Dave Chinner <david@fromorbit.com> --- fs/dax.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)