Message ID | 20201209092818.30417-1-osalvador@suse.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm,hwpoison: Return -EBUSY when migration fails | expand |
On Wed, Dec 09, 2020 at 10:28:18AM +0100, Oscar Salvador wrote: > Currently, we return -EIO when we fail to migrate the page. > > Migrations' failures are rather transient as they can happen due to > several reasons, e.g: high page refcount bump, mapping->migrate_page > failing etc. > All meaning that at that time the page could not be migrated, but > that has nothing to do with an EIO error. > > Let us return -EBUSY instead, as we do in case we failed to isolate > the page. > > While are it, let us remove the "ret" print as its value does not change. > > Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
On 09.12.20 10:28, Oscar Salvador wrote: > Currently, we return -EIO when we fail to migrate the page. > > Migrations' failures are rather transient as they can happen due to > several reasons, e.g: high page refcount bump, mapping->migrate_page > failing etc. > All meaning that at that time the page could not be migrated, but > that has nothing to do with an EIO error. > > Let us return -EBUSY instead, as we do in case we failed to isolate > the page. > > While are it, let us remove the "ret" print as its value does not change. > > Signed-off-by: Oscar Salvador <osalvador@suse.de> > --- > mm/memory-failure.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index 428991e297e2..1942fb83ac64 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -1849,11 +1849,11 @@ static int __soft_offline_page(struct page *page) > pr_info("soft offline: %#lx: %s migration failed %d, type %lx (%pGp)\n", > pfn, msg_page[huge], ret, page->flags, &page->flags); > if (ret > 0) > - ret = -EIO; > + ret = -EBUSY; Do we expect callers to retry immediately? -EAGAIN might make also sense. But -EBUSY is an obvious improvement. Do we have callers relying on this behavior?
On 12/9/20 10:28 AM, Oscar Salvador wrote: > Currently, we return -EIO when we fail to migrate the page. > > Migrations' failures are rather transient as they can happen due to > several reasons, e.g: high page refcount bump, mapping->migrate_page > failing etc. > All meaning that at that time the page could not be migrated, but > that has nothing to do with an EIO error. > > Let us return -EBUSY instead, as we do in case we failed to isolate > the page. > > While are it, let us remove the "ret" print as its value does not change. > > Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Technically this affects madvise(2) so let's cc linux-api. The manpage doesn't document error codes of MADV_HWPOISON and MADV_SOFT_OFFLINE (besides EPERM) though so nothing to adjust there. It's meant only for the hwpoison testing suite anyway. > --- > mm/memory-failure.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index 428991e297e2..1942fb83ac64 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -1849,11 +1849,11 @@ static int __soft_offline_page(struct page *page) > pr_info("soft offline: %#lx: %s migration failed %d, type %lx (%pGp)\n", > pfn, msg_page[huge], ret, page->flags, &page->flags); > if (ret > 0) > - ret = -EIO; > + ret = -EBUSY; > } > } else { > - pr_info("soft offline: %#lx: %s isolation failed: %d, page count %d, type %lx (%pGp)\n", > - pfn, msg_page[huge], ret, page_count(page), page->flags, &page->flags); > + pr_info("soft offline: %#lx: %s isolation failed, page count %d, type %lx (%pGp)\n", > + pfn, msg_page[huge], page_count(page), page->flags, &page->flags); > ret = -EBUSY; > } > return ret; >
On Wed, Dec 09, 2020 at 10:59:04AM +0100, David Hildenbrand wrote: > On 09.12.20 10:28, Oscar Salvador wrote: > Do we expect callers to retry immediately? -EAGAIN might make also > sense. But -EBUSY is an obvious improvement. Do we have callers relying > on this behavior? Not really, unless something LTP takes a look at the error code in retries in case EBUSY. Take into account that most of the callers do not even really check the return code (GHES, RAS/CEC, etc.)
On Wed, Dec 09, 2020 at 11:25:31AM +0100, Vlastimil Babka wrote: > On 12/9/20 10:28 AM, Oscar Salvador wrote: > > Currently, we return -EIO when we fail to migrate the page. > > > > Migrations' failures are rather transient as they can happen due to > > several reasons, e.g: high page refcount bump, mapping->migrate_page > > failing etc. > > All meaning that at that time the page could not be migrated, but > > that has nothing to do with an EIO error. > > > > Let us return -EBUSY instead, as we do in case we failed to isolate > > the page. > > > > While are it, let us remove the "ret" print as its value does not change. > > > > Signed-off-by: Oscar Salvador <osalvador@suse.de> > > Acked-by: Vlastimil Babka <vbabka@suse.cz> > > Technically this affects madvise(2) so let's cc linux-api. The manpage doesn't > document error codes of MADV_HWPOISON and MADV_SOFT_OFFLINE (besides EPERM) > though so nothing to adjust there. It's meant only for the hwpoison testing > suite anyway. Well, not only for hwpoison testing suite. RAS/CEC and GHES also use soft_offline_page by means of memory_failure_queue in case a page cec count goes beyond a certain thereshold, but they do not really check the return code. Only madvise does.
diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 428991e297e2..1942fb83ac64 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1849,11 +1849,11 @@ static int __soft_offline_page(struct page *page) pr_info("soft offline: %#lx: %s migration failed %d, type %lx (%pGp)\n", pfn, msg_page[huge], ret, page->flags, &page->flags); if (ret > 0) - ret = -EIO; + ret = -EBUSY; } } else { - pr_info("soft offline: %#lx: %s isolation failed: %d, page count %d, type %lx (%pGp)\n", - pfn, msg_page[huge], ret, page_count(page), page->flags, &page->flags); + pr_info("soft offline: %#lx: %s isolation failed, page count %d, type %lx (%pGp)\n", + pfn, msg_page[huge], page_count(page), page->flags, &page->flags); ret = -EBUSY; } return ret;
Currently, we return -EIO when we fail to migrate the page. Migrations' failures are rather transient as they can happen due to several reasons, e.g: high page refcount bump, mapping->migrate_page failing etc. All meaning that at that time the page could not be migrated, but that has nothing to do with an EIO error. Let us return -EBUSY instead, as we do in case we failed to isolate the page. While are it, let us remove the "ret" print as its value does not change. Signed-off-by: Oscar Salvador <osalvador@suse.de> --- mm/memory-failure.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)