Message ID | 1558356671-29599-2-git-send-email-jsimmons@infradead.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | More lustre patches | expand |
On Mon, May 20 2019, James Simmons wrote: > From: Patrick Farrell <pfarrell@whamcloud.com> > > Various error conditions in the fault path can cause us to > not return a page in vm_fault. Check if it's present > before accessing it. I cannot find any error conditions that would leave ->page NULL, but that wouldn't set one of VM_FAULT_RETRY | VM_FAULT_ERROR | VM_FAULT_LOCKED in 'result'. Can someone provide an example? (I have seen crashes with vmf->page being NULL, but they were caused by VM_FAULT_RETRY being #defined to 0 as lustre/llite/llite_internal.h still does on OpenSFS lustre) > > Additionally, it's not valid to return VM_FAULT_NOPAGE for > page faults. The correct return when accessing a page that > does not exist is VM_FAULT_SIGBUS. Correcting this avoids > looping infinitely in the testcase. I agree with that. VM_FAULT_NOPAGE is valid for page_mkwrite - and ll_page_mkwrite() has separate code to choose VM_FAULT_NOPAGE. So the change to to_fault_error() is valid. NeilBrown > > Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com> > WC-bug-id: https://jira.whamcloud.com/browse/LU-11403 > Reviewed-on: https://review.whamcloud.com/34247 > Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com> > Reviewed-by: Alexander Zarochentsev <c17826@cray.com> > Reviewed-by: Oleg Drokin <green@whamcloud.com> > Signed-off-by: James Simmons <jsimmons@infradead.org> > --- > fs/lustre/llite/llite_mmap.c | 6 ++---- > 1 file changed, 2 insertions(+), 4 deletions(-) > > diff --git a/fs/lustre/llite/llite_mmap.c b/fs/lustre/llite/llite_mmap.c > index 1865db1..c8e57ad 100644 > --- a/fs/lustre/llite/llite_mmap.c > +++ b/fs/lustre/llite/llite_mmap.c > @@ -238,9 +238,6 @@ static inline vm_fault_t to_fault_error(int result) > case 0: > result = VM_FAULT_LOCKED; > break; > - case -EFAULT: > - result = VM_FAULT_NOPAGE; > - break; > case -ENOMEM: > result = VM_FAULT_OOM; > break; > @@ -366,7 +363,8 @@ static vm_fault_t ll_fault(struct vm_fault *vmf) > > restart: > result = __ll_fault(vmf->vma, vmf); > - if (!(result & (VM_FAULT_RETRY | VM_FAULT_ERROR | VM_FAULT_LOCKED))) { > + if (vmf->page && > + !(result & (VM_FAULT_RETRY | VM_FAULT_ERROR | VM_FAULT_LOCKED))) { > struct page *vmpage = vmf->page; > > /* check if this page has been truncated */ > -- > 1.8.3.1
Neil, If you can’t find any, I imagine they don’t exist, at least in your branch, given that difference you cited. The particular case we had is here: https://jira.whamcloud.com/browse/LU-11403 Which is when the file exists but has no striping info, and hence no data. - Patrick
> > From: Patrick Farrell <pfarrell@whamcloud.com> > > > > Various error conditions in the fault path can cause us to > > not return a page in vm_fault. Check if it's present > > before accessing it. > > I cannot find any error conditions that would leave ->page NULL, > but that wouldn't set one of > VM_FAULT_RETRY | VM_FAULT_ERROR | VM_FAULT_LOCKED > in 'result'. > > Can someone provide an example? Reproducer is here: https://review.whamcloud.com/#/c/34247/7/lustre/tests/mmap_mknod_test.c Just run mmap_mknod_test "file" and it will show this problem. > (I have seen crashes with vmf->page being NULL, but they were caused > by VM_FAULT_RETRY being #defined to 0 as lustre/llite/llite_internal.h > still does on OpenSFS lustre) > > > > > Additionally, it's not valid to return VM_FAULT_NOPAGE for > > page faults. The correct return when accessing a page that > > does not exist is VM_FAULT_SIGBUS. Correcting this avoids > > looping infinitely in the testcase. > > I agree with that. VM_FAULT_NOPAGE is valid for page_mkwrite - and > ll_page_mkwrite() has separate code to choose VM_FAULT_NOPAGE. > So the change to to_fault_error() is valid. > > NeilBrown > > > > > > Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com> > > WC-bug-id: https://jira.whamcloud.com/browse/LU-11403 > > Reviewed-on: https://review.whamcloud.com/34247 > > Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com> > > Reviewed-by: Alexander Zarochentsev <c17826@cray.com> > > Reviewed-by: Oleg Drokin <green@whamcloud.com> > > Signed-off-by: James Simmons <jsimmons@infradead.org> > > --- > > fs/lustre/llite/llite_mmap.c | 6 ++---- > > 1 file changed, 2 insertions(+), 4 deletions(-) > > > > diff --git a/fs/lustre/llite/llite_mmap.c b/fs/lustre/llite/llite_mmap.c > > index 1865db1..c8e57ad 100644 > > --- a/fs/lustre/llite/llite_mmap.c > > +++ b/fs/lustre/llite/llite_mmap.c > > @@ -238,9 +238,6 @@ static inline vm_fault_t to_fault_error(int result) > > case 0: > > result = VM_FAULT_LOCKED; > > break; > > - case -EFAULT: > > - result = VM_FAULT_NOPAGE; > > - break; > > case -ENOMEM: > > result = VM_FAULT_OOM; > > break; > > @@ -366,7 +363,8 @@ static vm_fault_t ll_fault(struct vm_fault *vmf) > > > > restart: > > result = __ll_fault(vmf->vma, vmf); > > - if (!(result & (VM_FAULT_RETRY | VM_FAULT_ERROR | VM_FAULT_LOCKED))) { > > + if (vmf->page && > > + !(result & (VM_FAULT_RETRY | VM_FAULT_ERROR | VM_FAULT_LOCKED))) { > > struct page *vmpage = vmf->page; > > > > /* check if this page has been truncated */ > > -- > > 1.8.3.1 >
On Wed, May 22 2019, Patrick Farrell wrote: > Neil, > > If you can’t find any, I imagine they don’t exist, at least in your branch, given that difference you cited. > > The particular case we had is here: > https://jira.whamcloud.com/browse/LU-11403 > > Which is when the file exists but has no striping info, and hence no data. Hi Patrick, Thanks for the reply. Looking at that issue it appears that the root cause is -EFAULT being returned from lov_io_init_empty, which then got returned by ll_fault_io_init and then was converted by to_fault_error into VM_FAULT_NOPAGE. The test if (!(result & (VM_FAULT_RETRY | VM_FAULT_ERROR | VM_FAULT_LOCKED))) { in ll_fault() doesn't notice VM_FAULT_NOPAGE (it is not part of VM_FAULT_ERROR), so the 'then' branch is run, which triggers the problem. Just changing to_fault_error() to convert -EFAULT to VM_FAULT_SIGBUS, as you did, cause the 'if' to do "the right thing", as VM_FAULT_SIGBUS is one of the flags in VM_FAULT_ERROR. So the extra test on vmf->page is redundant. If there has been no error, and we don't need to retry, then vmf->page *must* exist. I've changed the commit message to explain this, so what I have now is below. Thanks, NeilBrown From: Patrick Farrell <pfarrell@whamcloud.com> Date: Mon, 20 May 2019 08:50:43 -0400 Subject: [PATCH] lustre: llite: ll_fault fix Error conditions in the fault path such as a fault on a file without stripes (see lov_io_init_emtpy()) can cause us to not return a page in vm_fault, and to report -EFAULT. -EFAULT is currently mapped to VM_FAULT_NOPAGE by to_fault_error(), and ll_fault doesn't recognize this as an error which might leave ->page unset, and tries to dereference vmf->page. VM_FAULT_NOPAGE is *not* a valid status for .fault, only for .page_mkwrite and .pfn_mkwrite. So change to_fault_error() to instead map -EFAULT to VM_FAULT_SIGBUS. This is part of VM_FAULT_ERROR, so ll_fault will *not* dereference vmf->page when that error code is set. Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com> WC-bug-id: https://jira.whamcloud.com/browse/LU-11403 Reviewed-on: https://review.whamcloud.com/34247 Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com> Reviewed-by: Alexander Zarochentsev <c17826@cray.com> Reviewed-by: Oleg Drokin <green@whamcloud.com> Signed-off-by: James Simmons <jsimmons@infradead.org> Signed-off-by: NeilBrown <neilb@suse.com> --- fs/lustre/llite/llite_mmap.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/fs/lustre/llite/llite_mmap.c b/fs/lustre/llite/llite_mmap.c index 1865db1237ce..59fb40029306 100644 --- a/fs/lustre/llite/llite_mmap.c +++ b/fs/lustre/llite/llite_mmap.c @@ -238,9 +238,6 @@ static inline vm_fault_t to_fault_error(int result) case 0: result = VM_FAULT_LOCKED; break; - case -EFAULT: - result = VM_FAULT_NOPAGE; - break; case -ENOMEM: result = VM_FAULT_OOM; break;
Neil, Memory has surfaced here. There was a not-documented-where-I-can-find-it report of a null pointer on “page” here that came in around the same time running - I believe - racer (a big messy “do random things from many threads” test). No obvious explanation for how we got the problem but I added the check since it was simple enough. But I’m perfectly comfortable with you removing it and seeing if something occurs again. - Patrick
diff --git a/fs/lustre/llite/llite_mmap.c b/fs/lustre/llite/llite_mmap.c index 1865db1..c8e57ad 100644 --- a/fs/lustre/llite/llite_mmap.c +++ b/fs/lustre/llite/llite_mmap.c @@ -238,9 +238,6 @@ static inline vm_fault_t to_fault_error(int result) case 0: result = VM_FAULT_LOCKED; break; - case -EFAULT: - result = VM_FAULT_NOPAGE; - break; case -ENOMEM: result = VM_FAULT_OOM; break; @@ -366,7 +363,8 @@ static vm_fault_t ll_fault(struct vm_fault *vmf) restart: result = __ll_fault(vmf->vma, vmf); - if (!(result & (VM_FAULT_RETRY | VM_FAULT_ERROR | VM_FAULT_LOCKED))) { + if (vmf->page && + !(result & (VM_FAULT_RETRY | VM_FAULT_ERROR | VM_FAULT_LOCKED))) { struct page *vmpage = vmf->page; /* check if this page has been truncated */