Message ID | 20181114211704.6381-4-david@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm/kdump: allow to exclude pages that are logically offline | expand |
Hi David, On 11/14/18 at 10:17pm, David Hildenbrand wrote: > Let's export PG_offline via PAGE_OFFLINE_MAPCOUNT_VALUE, so > makedumpfile can directly skip pages that are logically offline and the > content therefore stale. It would be good to copy some background info from cover letter to the patch description so that we can get better understanding why this is needed now. BTW, Lianbo is working on a documentation of the vmcoreinfo exported fields. Ccing him so that he is aware of this. Also cc Boris, although I do not want the doc changes blocks this he might have different opinion :) > > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Dave Young <dyoung@redhat.com> > Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> > Cc: Baoquan He <bhe@redhat.com> > Cc: Omar Sandoval <osandov@fb.com> > Cc: Arnd Bergmann <arnd@arndb.de> > Cc: Matthew Wilcox <willy@infradead.org> > Cc: Michal Hocko <mhocko@suse.com> > Cc: "Michael S. Tsirkin" <mst@redhat.com> > Signed-off-by: David Hildenbrand <david@redhat.com> > --- > kernel/crash_core.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/kernel/crash_core.c b/kernel/crash_core.c > index 933cb3e45b98..093c9f917ed0 100644 > --- a/kernel/crash_core.c > +++ b/kernel/crash_core.c > @@ -464,6 +464,8 @@ static int __init crash_save_vmcoreinfo_init(void) > VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE); > #ifdef CONFIG_HUGETLB_PAGE > VMCOREINFO_NUMBER(HUGETLB_PAGE_DTOR); > +#define PAGE_OFFLINE_MAPCOUNT_VALUE (~PG_offline) > + VMCOREINFO_NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE); > #endif > > arch_crash_save_vmcoreinfo(); > -- > 2.17.2 > Thanks Dave
On 15.11.18 07:19, Dave Young wrote: > Hi David, > > On 11/14/18 at 10:17pm, David Hildenbrand wrote: >> Let's export PG_offline via PAGE_OFFLINE_MAPCOUNT_VALUE, so >> makedumpfile can directly skip pages that are logically offline and the >> content therefore stale. > > It would be good to copy some background info from cover letter to the > patch description so that we can get better understanding why this is > needed now. Yes, will add more detail! > > BTW, Lianbo is working on a documentation of the vmcoreinfo exported > fields. Ccing him so that he is aware of this. > > Also cc Boris, although I do not want the doc changes blocks this > he might have different opinion :) I'll be happy to help updating documentation (or updating it myself if the doc updates go in first).
On Thu, Nov 15, 2018 at 02:19:23PM +0800, Dave Young wrote: > It would be good to copy some background info from cover letter to the > patch description so that we can get better understanding why this is > needed now. > > BTW, Lianbo is working on a documentation of the vmcoreinfo exported > fields. Ccing him so that he is aware of this. > > Also cc Boris, although I do not want the doc changes blocks this > he might have different opinion :) Yeah, my initial reaction is that exporting an mm-internal flag to userspace is a no-no. What would be better, IMHO, is having a general method of telling the kdump kernel - maybe ranges of physical addresses - which to skip. Because the moment there's a set of pages which do not have PG_offline set but kdump would still like to skip, this breaks. But that's mm guys' call.
On 15.11.18 12:10, Borislav Petkov wrote: > On Thu, Nov 15, 2018 at 02:19:23PM +0800, Dave Young wrote: >> It would be good to copy some background info from cover letter to the >> patch description so that we can get better understanding why this is >> needed now. >> >> BTW, Lianbo is working on a documentation of the vmcoreinfo exported >> fields. Ccing him so that he is aware of this. >> >> Also cc Boris, although I do not want the doc changes blocks this >> he might have different opinion :) > > Yeah, my initial reaction is that exporting an mm-internal flag to > userspace is a no-no. Sorry to say, but that is the current practice without which makedumpfile would not be able to work at all. (exclude user pages, exclude page cache, exclude buddy pages). Let's not reinvent the wheel here. This is how dumping works forever. Also see how hwpoisoned pages are handled. (please note that most distributions only support dumping via makedumpfile, for said reasons) > > What would be better, IMHO, is having a general method of telling the > kdump kernel - maybe ranges of physical addresses - which to skip. And that has to be updated whenever we change a page flag. But then the dump kernel would have to be aware about "struct page" location and format of some old kernel to be dump. Let's just not even discuss going down that path. > > Because the moment there's a set of pages which do not have PG_offline > set but kdump would still like to skip, this breaks. I don't understand your concern. PG_offline is only an optimization for sections that are online. Offline sections can already be completely ignored when dumping. I don't see how there should be "set of pages which do not have PG_offline". I mean if they don't have PG_offline they will simply be dumped like before. Which worked forever. Sorry, I don't get your point.
On Thu, Nov 15, 2018 at 12:20:40PM +0100, David Hildenbrand wrote: > Sorry to say, but that is the current practice without which > makedumpfile would not be able to work at all. (exclude user pages, > exclude page cache, exclude buddy pages). Let's not reinvent the wheel > here. This is how dumping works forever. Sorry, but "we've always done this in the past" doesn't make it better. > I don't see how there should be "set of pages which do not have > PG_offline". It doesn't have to be a set of pages. Think a (mmconfig perhaps) region which the kdump kernel should completely skip because poking in it in the kdump kernel, causes all kinds of havoc like machine checks. etc. We've had and still have one issue like that. But let me clarify my note: I don't want to be discussing with you the design of makedumpfile and how it should or should not work - that ship has already sailed. Apparently there are valid reasons to do it this way. I was *simply* stating that it feels wrong to export mm flags like that. But as I said already, that is mm guys' call and looking at how we're already exporting a bunch of stuff in the vmcoreinfo - including other mm flags - I guess one more flag doesn't matter anymore.
On 15.11.18 12:52, Borislav Petkov wrote: > On Thu, Nov 15, 2018 at 12:20:40PM +0100, David Hildenbrand wrote: >> Sorry to say, but that is the current practice without which >> makedumpfile would not be able to work at all. (exclude user pages, >> exclude page cache, exclude buddy pages). Let's not reinvent the wheel >> here. This is how dumping works forever. > > Sorry, but "we've always done this in the past" doesn't make it better. Just saying that "I'm not the first to do it, don't hit me with a stick" :) > >> I don't see how there should be "set of pages which do not have >> PG_offline". > > It doesn't have to be a set of pages. Think a (mmconfig perhaps) region > which the kdump kernel should completely skip because poking in it in > the kdump kernel, causes all kinds of havoc like machine checks. etc. > We've had and still have one issue like that. Indeed. And we still have without makedumpfile. I think you are aware of this, but I'll explain it just for consistency: PG_hwpoison At some point we detect a HW error and mask a page as PG_hwpoison. makedumpfile knows how to treat that flag and can exclude it from the dump (== not access it). No crash. kdump itself has no clue about old "struct pages". Especially: a) Where they are located in memory (e.g. SPARSE) b) What their format is ("where are the flags") c) What the meaning of flags is ("what does bit X mean") In order to know such information, we would have to do parsing of quite some information inside the kernel in kdump. Basically what makedumpfile does just now. Is this feasible? I don't think so. So we would need another approach to communicate such information as you said. I can't think of any, but if anybody reading this has an idea, please speak up. I am interested. The *only* way right now we would have to handle such scenarios: 1. While dumping memory and we get a machine check, fake reading a zero page instead of crashing. 2. While dumping memory and we get a fault, fake reading a zero page instead of crashing. > > But let me clarify my note: I don't want to be discussing with you the > design of makedumpfile and how it should or should not work - that ship > has already sailed. Apparently there are valid reasons to do it this > way. Indeed, and the basic design is to export these flags. (let's say "unfortunately", being able to handle such stuff in kdump directly would be the dream). > I was *simply* stating that it feels wrong to export mm flags like that. > > But as I said already, that is mm guys' call and looking at how we're > already exporting a bunch of stuff in the vmcoreinfo - including other > mm flags - I guess one more flag doesn't matter anymore. Fair enough, noted. If you have an idea how to handle this in kdump, please let me know.
On Thu 15-11-18 12:52:13, Borislav Petkov wrote: > On Thu, Nov 15, 2018 at 12:20:40PM +0100, David Hildenbrand wrote: > > Sorry to say, but that is the current practice without which > > makedumpfile would not be able to work at all. (exclude user pages, > > exclude page cache, exclude buddy pages). Let's not reinvent the wheel > > here. This is how dumping works forever. > > Sorry, but "we've always done this in the past" doesn't make it better. > > > I don't see how there should be "set of pages which do not have > > PG_offline". > > It doesn't have to be a set of pages. Think a (mmconfig perhaps) region > which the kdump kernel should completely skip because poking in it in > the kdump kernel, causes all kinds of havoc like machine checks. etc. > We've had and still have one issue like that. > > But let me clarify my note: I don't want to be discussing with you the > design of makedumpfile and how it should or should not work - that ship > has already sailed. Apparently there are valid reasons to do it this > way. > > I was *simply* stating that it feels wrong to export mm flags like that. > > But as I said already, that is mm guys' call and looking at how we're > already exporting a bunch of stuff in the vmcoreinfo - including other > mm flags - I guess one more flag doesn't matter anymore. I am not familiar with kexec to judge this particular patch but we cannot simply define any range for these pages (same as for hwpoison ones) because they can be almost anywhere in the available memory range. Then there can be countless of them. There is no other way to rule them out but to check the page state. I am not really sure what is the concern here exactly. Kdump is so closly tight to the specific kernel version that the api exported specifically for its purpose cannot be seriously considered an ABI. Kdump has to adopt all the time.
On Thu, Nov 15, 2018 at 01:11:02PM +0100, Michal Hocko wrote: > I am not familiar with kexec to judge this particular patch but we > cannot simply define any range for these pages (same as for hwpoison > ones) because they can be almost anywhere in the available memory range. > Then there can be countless of them. There is no other way to rule them > out but to check the page state. I guess, especially if it is a monster box with a lot of memory in it. > I am not really sure what is the concern here exactly. Kdump is so > closly tight to the specific kernel version that the api exported > specifically for its purpose cannot be seriously considered an ABI. > Kdump has to adopt all the time. Right... Except, when people start ogling vmcoreinfo for other things and start exporting all kinds of kernel internals in there, my alarm bells start ringing. But ok, kdump *is* special and I guess that's fine.
On Thu, Nov 15, 2018 at 01:01:17PM +0100, David Hildenbrand wrote: > Just saying that "I'm not the first to do it, don't hit me with a stick" :) :-) > Indeed. And we still have without makedumpfile. I think you are aware of > this, but I'll explain it just for consistency: PG_hwpoison No, I appreciate an explanation very much! So thanks for that. :) > At some point we detect a HW error and mask a page as PG_hwpoison. > > makedumpfile knows how to treat that flag and can exclude it from the > dump (== not access it). No crash. > > kdump itself has no clue about old "struct pages". Especially: > a) Where they are located in memory (e.g. SPARSE) > b) What their format is ("where are the flags") > c) What the meaning of flags is ("what does bit X mean") > > In order to know such information, we would have to do parsing of quite > some information inside the kernel in kdump. Basically what makedumpfile > does just now. Is this feasible? I don't think so. > > So we would need another approach to communicate such information as you > said. I can't think of any, but if anybody reading this has an idea, > please speak up. I am interested. Yeah but that ship has sailed. And even if we had a great idea, we'd have to support kdump before and after the idea. And that would be a serious mess. And if you have a huge box with gazillion piles of memory and an alpha particle passes through a bunch of them on its way down to the earth's core, and while doing so, flips a bunch of bits, you need to go and collect all those regions and update some list which you then need to shove into the second kernel. And you probably need to do all that through perhaps a piece of memory which is used for communication between first and second kernel and that list better fit in there, or you need to realloc. And that piece of memory's layout needs to be properly defined so that the second kernel can parse it correctly. And so on... > The *only* way right now we would have to handle such scenarios: > > 1. While dumping memory and we get a machine check, fake reading a zero > page instead of crashing. > 2. While dumping memory and we get a fault, fake reading a zero page > instead of crashing. Yap. > Indeed, and the basic design is to export these flags. (let's say > "unfortunately", being able to handle such stuff in kdump directly would > be the dream). Well, AFAICT, the minimum work you need to always do before starting the dumping is somehow generate that list of pages or ranges to not dump. And that work needs to be done by the first or the second kernel, I'd say. If the first kernel would do it, then you'd have to probably have callbacks to certain operations which go and add ranges or pages to exclude, to a list which is then readily accessible to the second kernel. Which means, when you reserve memory for the second kernel, you'd have to reserve memory also for such a list. But then what do you do when that memory gets filled up...? So I guess exporting those things in vmcoreinfo is probably the only thing we *can* do in the end. Oh well, enough rambling... :)
diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 933cb3e45b98..093c9f917ed0 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -464,6 +464,8 @@ static int __init crash_save_vmcoreinfo_init(void) VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE); #ifdef CONFIG_HUGETLB_PAGE VMCOREINFO_NUMBER(HUGETLB_PAGE_DTOR); +#define PAGE_OFFLINE_MAPCOUNT_VALUE (~PG_offline) + VMCOREINFO_NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE); #endif arch_crash_save_vmcoreinfo();
Let's export PG_offline via PAGE_OFFLINE_MAPCOUNT_VALUE, so makedumpfile can directly skip pages that are logically offline and the content therefore stale. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Young <dyoung@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Baoquan He <bhe@redhat.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> --- kernel/crash_core.c | 2 ++ 1 file changed, 2 insertions(+)