mbox series

[RFC,v1,0/4] mm, hwpoison: improve handling workload related to hugetlb and memory_hotplug

Message ID 20220427042841.678351-1-naoya.horiguchi@linux.dev (mailing list archive)
Headers show
Series mm, hwpoison: improve handling workload related to hugetlb and memory_hotplug | expand

Message

Naoya Horiguchi April 27, 2022, 4:28 a.m. UTC
Hi,

This patchset addresses some issues on the workload related to hwpoison,
hugetlb, and memory_hotplug.  The problem in memory hotremove reported by
Miaohe Lin [1] is mentioned in 2/4.  This patch depends on "storing raw
error info" functionality provided by 1/4. This patch also provide delayed
dissolve function too.

Patch 3/4 is to adjust unpoison to new semantics of HPageMigratable for
hwpoisoned hugepage. And 4/4 is the fix for the inconsistent counter issue.

[1] https://lore.kernel.org/linux-mm/20220421135129.19767-1-linmiaohe@huawei.com/

Please let me know if you have any suggestions and comments.

Thanks,
Naoya Horiguchi
---
Summary:

Naoya Horiguchi (4):
      mm, hwpoison, hugetlb: introduce SUBPAGE_INDEX_HWPOISON to save raw error page
      mm,hwpoison,hugetlb,memory_hotplug: hotremove memory section with hwpoisoned hugepage
      mm, hwpoison: add parameter unpoison to get_hwpoison_huge_page()
      mm, memory_hotplug: fix inconsistent num_poisoned_pages on memory hotremove

 include/linux/hugetlb.h | 29 +++++++++++++++++++++++++++--
 mm/hugetlb.c            | 24 ++++++++++++++++++++++--
 mm/memory-failure.c     |  8 ++++++--
 mm/memory_hotplug.c     | 23 +++++++++++------------
 mm/page_alloc.c         |  6 ++++++
 5 files changed, 72 insertions(+), 18 deletions(-)

Comments

David Hildenbrand April 27, 2022, 10:48 a.m. UTC | #1
On 27.04.22 06:28, Naoya Horiguchi wrote:
> Hi,
> 
> This patchset addresses some issues on the workload related to hwpoison,
> hugetlb, and memory_hotplug.  The problem in memory hotremove reported by
> Miaohe Lin [1] is mentioned in 2/4.  This patch depends on "storing raw
> error info" functionality provided by 1/4. This patch also provide delayed
> dissolve function too.
> 
> Patch 3/4 is to adjust unpoison to new semantics of HPageMigratable for
> hwpoisoned hugepage. And 4/4 is the fix for the inconsistent counter issue.
> 
> [1] https://lore.kernel.org/linux-mm/20220421135129.19767-1-linmiaohe@huawei.com/
> 
> Please let me know if you have any suggestions and comments.
> 

Hi,

I raised some time ago already that I don't quite see the value of
allowing memory offlining with poisened pages.

1) It overcomplicates the offlining code and seems to be partially
   broken
2) It happens rarely (ever?), so do we even care?
3) Once the memory is offline, we can re-online it and lost HWPoison.
   The memory can be happily used.

3) can happen easily if our DIMM consists of multiple memory blocks and
offlining of some memory block fails -> we'll re-online all already
offlined ones. We'll happily reuse previously HWPoisoned pages, which
feels more dangerous to me then just leaving the DIMM around (and
eventually hwpoisoning all pages on it such that it won't get used
anymore?).

So maybe we should just fail offlining once we stumble over a hwpoisoned
page?

Yes, we would disallow removing a semi-broken DIMM from the system that
was onlined MOVABLE. I wonder if we really need that and how often it
happens in real life. Most systems I am aware of don't allow for
replacing individual DIMMs, but only complete NUMA nodes. Hm.
Oscar Salvador April 27, 2022, 12:20 p.m. UTC | #2
On Wed, Apr 27, 2022 at 12:48:16PM +0200, David Hildenbrand wrote:
> I raised some time ago already that I don't quite see the value of
> allowing memory offlining with poisened pages.
> 
> 1) It overcomplicates the offlining code and seems to be partially
>    broken
> 2) It happens rarely (ever?), so do we even care?
> 3) Once the memory is offline, we can re-online it and lost HWPoison.
>    The memory can be happily used.
> 
> 3) can happen easily if our DIMM consists of multiple memory blocks and
> offlining of some memory block fails -> we'll re-online all already
> offlined ones. We'll happily reuse previously HWPoisoned pages, which
> feels more dangerous to me then just leaving the DIMM around (and
> eventually hwpoisoning all pages on it such that it won't get used
> anymore?).
> 
> So maybe we should just fail offlining once we stumble over a hwpoisoned
> page?
> 
> Yes, we would disallow removing a semi-broken DIMM from the system that
> was onlined MOVABLE. I wonder if we really need that and how often it
> happens in real life. Most systems I am aware of don't allow for
> replacing individual DIMMs, but only complete NUMA nodes. Hm.

I teend to agree with all you said.
And to be honest, the mechanism of making a semi-broken DIMM healthy
again has always been a mistery to me.

One would think that:

1- you hot-remove the memory
2- you fix/remove it
3- you hotplug memory again

but I am not sure how many times this came to be.

And there is also the thing about losing the hwpoison information
between offline<->online transitions, so, the thing is unreliable.

And for that to work, we would have to add a bunch of code
to keep track of "offlined" pages that are hwpoisoned, so we
flag them again once they get onlined, and that means more
room for errors.

So, I would lean towards the fact of not allowing to offline
memory that contain such pages in the first place, unless that
proves to be a no-go.
HORIGUCHI NAOYA(堀口 直也) April 27, 2022, 12:20 p.m. UTC | #3
On Wed, Apr 27, 2022 at 12:48:16PM +0200, David Hildenbrand wrote:
> On 27.04.22 06:28, Naoya Horiguchi wrote:
> > Hi,
> > 
> > This patchset addresses some issues on the workload related to hwpoison,
> > hugetlb, and memory_hotplug.  The problem in memory hotremove reported by
> > Miaohe Lin [1] is mentioned in 2/4.  This patch depends on "storing raw
> > error info" functionality provided by 1/4. This patch also provide delayed
> > dissolve function too.
> > 
> > Patch 3/4 is to adjust unpoison to new semantics of HPageMigratable for
> > hwpoisoned hugepage. And 4/4 is the fix for the inconsistent counter issue.
> > 
> > [1] https://lore.kernel.org/linux-mm/20220421135129.19767-1-linmiaohe@huawei.com/
> > 
> > Please let me know if you have any suggestions and comments.
> > 
> 
> Hi,
> 
> I raised some time ago already that I don't quite see the value of
> allowing memory offlining with poisened pages.
> 
> 1) It overcomplicates the offlining code and seems to be partially
>    broken

Yes, the current code has rooms of improvement.  Maybe page refcount
of hwpoisoned pages is one of them.

> 2) It happens rarely (ever?), so do we even care?

I'm not certain of the rarity.  Some cloud service providers who maintain
lots of servers may care?

> 3) Once the memory is offline, we can re-online it and lost HWPoison.
>    The memory can be happily used.
> 
> 3) can happen easily if our DIMM consists of multiple memory blocks and
> offlining of some memory block fails -> we'll re-online all already
> offlined ones. We'll happily reuse previously HWPoisoned pages, which
> feels more dangerous to me then just leaving the DIMM around (and
> eventually hwpoisoning all pages on it such that it won't get used
> anymore?).

I see. This scenario can often happen.

> 
> So maybe we should just fail offlining once we stumble over a hwpoisoned
> page?

That could be one choice.

Maybe another is like this: offlining can succeed but HWPoison flags are
kept over offline-reonline operations.  If the system noticed that the
re-onlined blocks are backed by the original DIMMs or NUMA nodes, then the
saved HWPoison flags are still effective, so keep using them.  If the
re-onlined blocks are backed by replaced DIMMs/NUMA nodes, then we can clear
all HWPoison flags associated with replaced physical address range.  This
can be done automatically in re-onlining if there's a way for kernel to know
whether DIMM/NUMA nodes are replaced with new ones.  But if there isn't,
system applications have to check the HW and explicitly reset the HWPoison
flags.

> 
> Yes, we would disallow removing a semi-broken DIMM from the system that
> was onlined MOVABLE. I wonder if we really need that and how often it
> happens in real life. Most systems I am aware of don't allow for
> replacing individual DIMMs, but only complete NUMA nodes. Hm.

Yes, maybe most servers do not provide direct physical access to DIMMs.

Thanks,
Naoya Horiguchi
David Hildenbrand April 28, 2022, 8:44 a.m. UTC | #4
>> 2) It happens rarely (ever?), so do we even care?
> 
> I'm not certain of the rarity.  Some cloud service providers who maintain
> lots of servers may care?

About replacing broken DIMMs? I'm not so sure, especially because it
requires a special setup with ZONE_MOVABLE (i.e., movablecore) to be
somewhat reliable and individual DIMMs can usually not get replaced at all.

> 
>> 3) Once the memory is offline, we can re-online it and lost HWPoison.
>>    The memory can be happily used.
>>
>> 3) can happen easily if our DIMM consists of multiple memory blocks and
>> offlining of some memory block fails -> we'll re-online all already
>> offlined ones. We'll happily reuse previously HWPoisoned pages, which
>> feels more dangerous to me then just leaving the DIMM around (and
>> eventually hwpoisoning all pages on it such that it won't get used
>> anymore?).
> 
> I see. This scenario can often happen.
> 
>>
>> So maybe we should just fail offlining once we stumble over a hwpoisoned
>> page?
> 
> That could be one choice.
> 
> Maybe another is like this: offlining can succeed but HWPoison flags are
> kept over offline-reonline operations.  If the system noticed that the
> re-onlined blocks are backed by the original DIMMs or NUMA nodes, then the
> saved HWPoison flags are still effective, so keep using them.  If the
> re-onlined blocks are backed by replaced DIMMs/NUMA nodes, then we can clear
> all HWPoison flags associated with replaced physical address range.  This
> can be done automatically in re-onlining if there's a way for kernel to know
> whether DIMM/NUMA nodes are replaced with new ones.  But if there isn't,
> system applications have to check the HW and explicitly reset the HWPoison
> flags.

Offline memory sections have a stale memmap, so there is no trusting on
that. And trying to work around that or adjusting memory onlining code
overcomplicates something we really don't care about supporting.

So if we continue allowing offlining memory blocks with poisoned pages,
we could simply remember that that memory block had any posioned page
(either for the memory section or maybe better for the whole memory
block). We can then simply reject/fail memory onlining of these memory
blocks.

So that leaves us with either

1) Fail offlining -> no need to care about reonlining
2) Succeed offlining but fail re-onlining
HORIGUCHI NAOYA(堀口 直也) May 9, 2022, 7:29 a.m. UTC | #5
On Thu, Apr 28, 2022 at 10:44:15AM +0200, David Hildenbrand wrote:
> >> 2) It happens rarely (ever?), so do we even care?
> > 
> > I'm not certain of the rarity.  Some cloud service providers who maintain
> > lots of servers may care?
> 
> About replacing broken DIMMs? I'm not so sure, especially because it
> requires a special setup with ZONE_MOVABLE (i.e., movablecore) to be
> somewhat reliable and individual DIMMs can usually not get replaced at all.
> 
> > 
> >> 3) Once the memory is offline, we can re-online it and lost HWPoison.
> >>    The memory can be happily used.
> >>
> >> 3) can happen easily if our DIMM consists of multiple memory blocks and
> >> offlining of some memory block fails -> we'll re-online all already
> >> offlined ones. We'll happily reuse previously HWPoisoned pages, which
> >> feels more dangerous to me then just leaving the DIMM around (and
> >> eventually hwpoisoning all pages on it such that it won't get used
> >> anymore?).
> > 
> > I see. This scenario can often happen.
> > 
> >>
> >> So maybe we should just fail offlining once we stumble over a hwpoisoned
> >> page?
> > 
> > That could be one choice.
> > 
> > Maybe another is like this: offlining can succeed but HWPoison flags are
> > kept over offline-reonline operations.  If the system noticed that the
> > re-onlined blocks are backed by the original DIMMs or NUMA nodes, then the
> > saved HWPoison flags are still effective, so keep using them.  If the
> > re-onlined blocks are backed by replaced DIMMs/NUMA nodes, then we can clear
> > all HWPoison flags associated with replaced physical address range.  This
> > can be done automatically in re-onlining if there's a way for kernel to know
> > whether DIMM/NUMA nodes are replaced with new ones.  But if there isn't,
> > system applications have to check the HW and explicitly reset the HWPoison
> > flags.
> 
> Offline memory sections have a stale memmap, so there is no trusting on
> that. And trying to work around that or adjusting memory onlining code
> overcomplicates something we really don't care about supporting.

OK, so I'll go forward to reduce complexity in hwpoison specific code in
memory offlining code.

> 
> So if we continue allowing offlining memory blocks with poisoned pages,
> we could simply remember that that memory block had any posioned page
> (either for the memory section or maybe better for the whole memory
> block). We can then simply reject/fail memory onlining of these memory
> blocks.

It seems to be helpful also for other conext (like hugetlb) to know whether
there's any hwpoisoned page in a given range of physical address, so I'll
think of this approach.

> 
> So that leaves us with either
> 
> 1) Fail offlining -> no need to care about reonlining
> 2) Succeed offlining but fail re-onlining

Rephrasing in case I misread, memory offlining code should never check
hwpoisoned pages finally, and memory onlining code would do kind of range
query to find hwpoisoned pages (without depending on PageHWPoison flag).

Thanks,
Naoya Horiguchi
Miaohe Lin May 9, 2022, 9:04 a.m. UTC | #6
On 2022/5/9 15:29, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Thu, Apr 28, 2022 at 10:44:15AM +0200, David Hildenbrand wrote:
>>>> 2) It happens rarely (ever?), so do we even care?
>>>
>>> I'm not certain of the rarity.  Some cloud service providers who maintain
>>> lots of servers may care?
>>
>> About replacing broken DIMMs? I'm not so sure, especially because it
>> requires a special setup with ZONE_MOVABLE (i.e., movablecore) to be
>> somewhat reliable and individual DIMMs can usually not get replaced at all.
>>
>>>
>>>> 3) Once the memory is offline, we can re-online it and lost HWPoison.
>>>>    The memory can be happily used.
>>>>
>>>> 3) can happen easily if our DIMM consists of multiple memory blocks and
>>>> offlining of some memory block fails -> we'll re-online all already
>>>> offlined ones. We'll happily reuse previously HWPoisoned pages, which
>>>> feels more dangerous to me then just leaving the DIMM around (and
>>>> eventually hwpoisoning all pages on it such that it won't get used
>>>> anymore?).
>>>
>>> I see. This scenario can often happen.
>>>
>>>>
>>>> So maybe we should just fail offlining once we stumble over a hwpoisoned
>>>> page?
>>>
>>> That could be one choice.
>>>
>>> Maybe another is like this: offlining can succeed but HWPoison flags are
>>> kept over offline-reonline operations.  If the system noticed that the
>>> re-onlined blocks are backed by the original DIMMs or NUMA nodes, then the
>>> saved HWPoison flags are still effective, so keep using them.  If the
>>> re-onlined blocks are backed by replaced DIMMs/NUMA nodes, then we can clear
>>> all HWPoison flags associated with replaced physical address range.  This
>>> can be done automatically in re-onlining if there's a way for kernel to know
>>> whether DIMM/NUMA nodes are replaced with new ones.  But if there isn't,
>>> system applications have to check the HW and explicitly reset the HWPoison
>>> flags.
>>
>> Offline memory sections have a stale memmap, so there is no trusting on
>> that. And trying to work around that or adjusting memory onlining code
>> overcomplicates something we really don't care about supporting.
> 
> OK, so I'll go forward to reduce complexity in hwpoison specific code in
> memory offlining code.
> 
>>
>> So if we continue allowing offlining memory blocks with poisoned pages,
>> we could simply remember that that memory block had any posioned page
>> (either for the memory section or maybe better for the whole memory
>> block). We can then simply reject/fail memory onlining of these memory
>> blocks.
> 
> It seems to be helpful also for other conext (like hugetlb) to know whether
> there's any hwpoisoned page in a given range of physical address, so I'll
> think of this approach.
> 
>>
>> So that leaves us with either
>>
>> 1) Fail offlining -> no need to care about reonlining

Maybe fail offlining will be a better alternative as we can get rid of many races
between memory failure and memory offline? But no strong opinion. :)

Thanks!

>> 2) Succeed offlining but fail re-onlining
> 
> Rephrasing in case I misread, memory offlining code should never check
> hwpoisoned pages finally, and memory onlining code would do kind of range
> query to find hwpoisoned pages (without depending on PageHWPoison flag).
> 
> Thanks,
> Naoya Horiguchi
>
Oscar Salvador May 9, 2022, 9:58 a.m. UTC | #7
On Mon, May 09, 2022 at 05:04:54PM +0800, Miaohe Lin wrote:
> >> So that leaves us with either
> >>
> >> 1) Fail offlining -> no need to care about reonlining
> 
> Maybe fail offlining will be a better alternative as we can get rid of many races
> between memory failure and memory offline? But no strong opinion. :)

If taking care of those races is not an herculean effort, I'd go with
allowing offlining + disallow re-onlining.
Mainly because memory RAS stuff.

Now, to the re-onlining thing, we'll have to come up with a way to check
whether a section contains hwpoisoned pages, so we do not have to go
and check every single page, as that will be really suboptimal.
Miaohe Lin May 9, 2022, 10:53 a.m. UTC | #8
On 2022/5/9 17:58, Oscar Salvador wrote:
> On Mon, May 09, 2022 at 05:04:54PM +0800, Miaohe Lin wrote:
>>>> So that leaves us with either
>>>>
>>>> 1) Fail offlining -> no need to care about reonlining
>>
>> Maybe fail offlining will be a better alternative as we can get rid of many races
>> between memory failure and memory offline? But no strong opinion. :)
> 
> If taking care of those races is not an herculean effort, I'd go with
> allowing offlining + disallow re-onlining.
> Mainly because memory RAS stuff.

This dose make sense to me. Thanks. We can try to solve those races if
offlining + disallow re-onlining is applied. :)

> 
> Now, to the re-onlining thing, we'll have to come up with a way to check
> whether a section contains hwpoisoned pages, so we do not have to go
> and check every single page, as that will be really suboptimal.

Yes, we need a stable and cheap way to do that.

Thanks!

> 
>
David Hildenbrand May 11, 2022, 3:11 p.m. UTC | #9
On 09.05.22 12:53, Miaohe Lin wrote:
> On 2022/5/9 17:58, Oscar Salvador wrote:
>> On Mon, May 09, 2022 at 05:04:54PM +0800, Miaohe Lin wrote:
>>>>> So that leaves us with either
>>>>>
>>>>> 1) Fail offlining -> no need to care about reonlining
>>>
>>> Maybe fail offlining will be a better alternative as we can get rid of many races
>>> between memory failure and memory offline? But no strong opinion. :)
>>
>> If taking care of those races is not an herculean effort, I'd go with
>> allowing offlining + disallow re-onlining.
>> Mainly because memory RAS stuff.
> 
> This dose make sense to me. Thanks. We can try to solve those races if
> offlining + disallow re-onlining is applied. :)
> 
>>
>> Now, to the re-onlining thing, we'll have to come up with a way to check
>> whether a section contains hwpoisoned pages, so we do not have to go
>> and check every single page, as that will be really suboptimal.
> 
> Yes, we need a stable and cheap way to do that.

My simplistic approach would be a simple flag/indicator in the memory block devices
that indicates that any page in the memory block was hwpoisoned. It's easy to
check that during memory onlining and fail it.

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 084d67fd55cc..3d0ef812e901 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -183,6 +183,9 @@ static int memory_block_online(struct memory_block *mem)
        struct zone *zone;
        int ret;
 
+       if (mem->hwpoisoned)
+               return -EHWPOISON;
+
        zone = zone_for_pfn_range(mem->online_type, mem->nid, mem->group,
                                  start_pfn, nr_pages);
 


Once the problematic DIMM would actually get unplugged, the memory block devices
would get removed as well. So when hotplugging a new DIMM in the same
location, we could online that memory again.

Another place to store that would be the memory section, we'd then have to check
all underlying sections here.

We're a bit short on flags in the memory section I think, but they are easier to
lookup from other code eventually then memory block devices.
HORIGUCHI NAOYA(堀口 直也) May 11, 2022, 4:10 p.m. UTC | #10
On Wed, May 11, 2022 at 05:11:17PM +0200, David Hildenbrand wrote:
> On 09.05.22 12:53, Miaohe Lin wrote:
> > On 2022/5/9 17:58, Oscar Salvador wrote:
> >> On Mon, May 09, 2022 at 05:04:54PM +0800, Miaohe Lin wrote:
> >>>>> So that leaves us with either
> >>>>>
> >>>>> 1) Fail offlining -> no need to care about reonlining
> >>>
> >>> Maybe fail offlining will be a better alternative as we can get rid of many races
> >>> between memory failure and memory offline? But no strong opinion. :)
> >>
> >> If taking care of those races is not an herculean effort, I'd go with
> >> allowing offlining + disallow re-onlining.
> >> Mainly because memory RAS stuff.
> > 
> > This dose make sense to me. Thanks. We can try to solve those races if
> > offlining + disallow re-onlining is applied. :)
> > 
> >>
> >> Now, to the re-onlining thing, we'll have to come up with a way to check
> >> whether a section contains hwpoisoned pages, so we do not have to go
> >> and check every single page, as that will be really suboptimal.
> > 
> > Yes, we need a stable and cheap way to do that.
> 
> My simplistic approach would be a simple flag/indicator in the memory block devices
> that indicates that any page in the memory block was hwpoisoned. It's easy to
> check that during memory onlining and fail it.
> 
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 084d67fd55cc..3d0ef812e901 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -183,6 +183,9 @@ static int memory_block_online(struct memory_block *mem)
>         struct zone *zone;
>         int ret;
>  
> +       if (mem->hwpoisoned)
> +               return -EHWPOISON;
> +
>         zone = zone_for_pfn_range(mem->online_type, mem->nid, mem->group,
>                                   start_pfn, nr_pages);
>  

Thanks for the idea, a simple flag could work if we don't have to consider
unpoison.  If we need consider unpoison, we need remember the last hwpoison
page in the memory block, so mem->hwpoisoned should be the counter of
hwpoison pages.

> 
> 
> Once the problematic DIMM would actually get unplugged, the memory block devices
> would get removed as well. So when hotplugging a new DIMM in the same
> location, we could online that memory again.

What about PG_hwpoison flags?  struct pages are also freed and reallocated
in the actual DIMM replacement?

Thanks,
Naoya Horiguchi

> 
> Another place to store that would be the memory section, we'd then have to check
> all underlying sections here.
> 
> We're a bit short on flags in the memory section I think, but they are easier to
> lookup from other code eventually then memory block devices.
David Hildenbrand May 11, 2022, 4:22 p.m. UTC | #11
On 11.05.22 18:10, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Wed, May 11, 2022 at 05:11:17PM +0200, David Hildenbrand wrote:
>> On 09.05.22 12:53, Miaohe Lin wrote:
>>> On 2022/5/9 17:58, Oscar Salvador wrote:
>>>> On Mon, May 09, 2022 at 05:04:54PM +0800, Miaohe Lin wrote:
>>>>>>> So that leaves us with either
>>>>>>>
>>>>>>> 1) Fail offlining -> no need to care about reonlining
>>>>>
>>>>> Maybe fail offlining will be a better alternative as we can get rid of many races
>>>>> between memory failure and memory offline? But no strong opinion. :)
>>>>
>>>> If taking care of those races is not an herculean effort, I'd go with
>>>> allowing offlining + disallow re-onlining.
>>>> Mainly because memory RAS stuff.
>>>
>>> This dose make sense to me. Thanks. We can try to solve those races if
>>> offlining + disallow re-onlining is applied. :)
>>>
>>>>
>>>> Now, to the re-onlining thing, we'll have to come up with a way to check
>>>> whether a section contains hwpoisoned pages, so we do not have to go
>>>> and check every single page, as that will be really suboptimal.
>>>
>>> Yes, we need a stable and cheap way to do that.
>>
>> My simplistic approach would be a simple flag/indicator in the memory block devices
>> that indicates that any page in the memory block was hwpoisoned. It's easy to
>> check that during memory onlining and fail it.
>>
>> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>> index 084d67fd55cc..3d0ef812e901 100644
>> --- a/drivers/base/memory.c
>> +++ b/drivers/base/memory.c
>> @@ -183,6 +183,9 @@ static int memory_block_online(struct memory_block *mem)
>>         struct zone *zone;
>>         int ret;
>>  
>> +       if (mem->hwpoisoned)
>> +               return -EHWPOISON;
>> +
>>         zone = zone_for_pfn_range(mem->online_type, mem->nid, mem->group,
>>                                   start_pfn, nr_pages);
>>  
> 
> Thanks for the idea, a simple flag could work if we don't have to consider
> unpoison.  If we need consider unpoison, we need remember the last hwpoison
> page in the memory block, so mem->hwpoisoned should be the counter of
> hwpoison pages.

Right, but unpoisoning+memory offlining+memory onlining is a yet more
extreme use case we don't have to bother about I think.

> 
>>
>>
>> Once the problematic DIMM would actually get unplugged, the memory block devices
>> would get removed as well. So when hotplugging a new DIMM in the same
>> location, we could online that memory again.
> 
> What about PG_hwpoison flags?  struct pages are also freed and reallocated
> in the actual DIMM replacement?

Once memory is offline, the memmap is stale and is no longer
trustworthy. It gets reinitialize during memory onlining -- so any
previous PG_hwpoison is overridden at least there. In some setups, we
even poison the whole memmap via page_init_poison() during memory offlining.

Apart from that, we should be freeing the memmap in all relevant cases
when removing memory. I remember there are a couple of corner cases, but
we don't really have to care about that.
Miaohe Lin May 12, 2022, 3:04 a.m. UTC | #12
On 2022/5/12 0:22, David Hildenbrand wrote:
> On 11.05.22 18:10, HORIGUCHI NAOYA(堀口 直也) wrote:
>> On Wed, May 11, 2022 at 05:11:17PM +0200, David Hildenbrand wrote:
>>> On 09.05.22 12:53, Miaohe Lin wrote:
>>>> On 2022/5/9 17:58, Oscar Salvador wrote:
>>>>> On Mon, May 09, 2022 at 05:04:54PM +0800, Miaohe Lin wrote:
>>>>>>>> So that leaves us with either
>>>>>>>>
>>>>>>>> 1) Fail offlining -> no need to care about reonlining
>>>>>>
>>>>>> Maybe fail offlining will be a better alternative as we can get rid of many races
>>>>>> between memory failure and memory offline? But no strong opinion. :)
>>>>>
>>>>> If taking care of those races is not an herculean effort, I'd go with
>>>>> allowing offlining + disallow re-onlining.
>>>>> Mainly because memory RAS stuff.
>>>>
>>>> This dose make sense to me. Thanks. We can try to solve those races if
>>>> offlining + disallow re-onlining is applied. :)
>>>>
>>>>>
>>>>> Now, to the re-onlining thing, we'll have to come up with a way to check
>>>>> whether a section contains hwpoisoned pages, so we do not have to go
>>>>> and check every single page, as that will be really suboptimal.
>>>>
>>>> Yes, we need a stable and cheap way to do that.
>>>
>>> My simplistic approach would be a simple flag/indicator in the memory block devices
>>> that indicates that any page in the memory block was hwpoisoned. It's easy to
>>> check that during memory onlining and fail it.
>>>
>>> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>>> index 084d67fd55cc..3d0ef812e901 100644
>>> --- a/drivers/base/memory.c
>>> +++ b/drivers/base/memory.c
>>> @@ -183,6 +183,9 @@ static int memory_block_online(struct memory_block *mem)
>>>         struct zone *zone;
>>>         int ret;
>>>  
>>> +       if (mem->hwpoisoned)
>>> +               return -EHWPOISON;
>>> +
>>>         zone = zone_for_pfn_range(mem->online_type, mem->nid, mem->group,
>>>                                   start_pfn, nr_pages);
>>>  
>>
>> Thanks for the idea, a simple flag could work if we don't have to consider
>> unpoison.  If we need consider unpoison, we need remember the last hwpoison
>> page in the memory block, so mem->hwpoisoned should be the counter of
>> hwpoison pages.
> 
> Right, but unpoisoning+memory offlining+memory onlining is a yet more
> extreme use case we don't have to bother about I think.
> 
>>
>>>
>>>
>>> Once the problematic DIMM would actually get unplugged, the memory block devices
>>> would get removed as well. So when hotplugging a new DIMM in the same
>>> location, we could online that memory again.
>>
>> What about PG_hwpoison flags?  struct pages are also freed and reallocated
>> in the actual DIMM replacement?
> 
> Once memory is offline, the memmap is stale and is no longer
> trustworthy. It gets reinitialize during memory onlining -- so any
> previous PG_hwpoison is overridden at least there. In some setups, we
> even poison the whole memmap via page_init_poison() during memory offlining.
> 

I tend to agree with David. The memmap is unreliable after memory is offline. So preventing memory
re-online until a new DIMM replacement is a good idea.

Thanks!

> Apart from that, we should be freeing the memmap in all relevant cases
> when removing memory. I remember there are a couple of corner cases, but
> we don't really have to care about that.
>
HORIGUCHI NAOYA(堀口 直也) May 12, 2022, 6:35 a.m. UTC | #13
On Wed, May 11, 2022 at 06:22:41PM +0200, David Hildenbrand wrote:
> On 11.05.22 18:10, HORIGUCHI NAOYA(堀口 直也) wrote:
> > On Wed, May 11, 2022 at 05:11:17PM +0200, David Hildenbrand wrote:
> >> On 09.05.22 12:53, Miaohe Lin wrote:
> >>> On 2022/5/9 17:58, Oscar Salvador wrote:
> >>>> On Mon, May 09, 2022 at 05:04:54PM +0800, Miaohe Lin wrote:
> >>>>>>> So that leaves us with either
> >>>>>>>
> >>>>>>> 1) Fail offlining -> no need to care about reonlining
> >>>>>
> >>>>> Maybe fail offlining will be a better alternative as we can get rid of many races
> >>>>> between memory failure and memory offline? But no strong opinion. :)
> >>>>
> >>>> If taking care of those races is not an herculean effort, I'd go with
> >>>> allowing offlining + disallow re-onlining.
> >>>> Mainly because memory RAS stuff.
> >>>
> >>> This dose make sense to me. Thanks. We can try to solve those races if
> >>> offlining + disallow re-onlining is applied. :)
> >>>
> >>>>
> >>>> Now, to the re-onlining thing, we'll have to come up with a way to check
> >>>> whether a section contains hwpoisoned pages, so we do not have to go
> >>>> and check every single page, as that will be really suboptimal.
> >>>
> >>> Yes, we need a stable and cheap way to do that.
> >>
> >> My simplistic approach would be a simple flag/indicator in the memory block devices
> >> that indicates that any page in the memory block was hwpoisoned. It's easy to
> >> check that during memory onlining and fail it.
> >>
> >> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> >> index 084d67fd55cc..3d0ef812e901 100644
> >> --- a/drivers/base/memory.c
> >> +++ b/drivers/base/memory.c
> >> @@ -183,6 +183,9 @@ static int memory_block_online(struct memory_block *mem)
> >>         struct zone *zone;
> >>         int ret;
> >>  
> >> +       if (mem->hwpoisoned)
> >> +               return -EHWPOISON;
> >> +
> >>         zone = zone_for_pfn_range(mem->online_type, mem->nid, mem->group,
> >>                                   start_pfn, nr_pages);
> >>  
> > 
> > Thanks for the idea, a simple flag could work if we don't have to consider
> > unpoison.  If we need consider unpoison, we need remember the last hwpoison
> > page in the memory block, so mem->hwpoisoned should be the counter of
> > hwpoison pages.
> 
> Right, but unpoisoning+memory offlining+memory onlining is a yet more
> extreme use case we don't have to bother about I think.

OK. Maybe starting with simple one is fine.

> 
> > 
> >>
> >>
> >> Once the problematic DIMM would actually get unplugged, the memory block devices
> >> would get removed as well. So when hotplugging a new DIMM in the same
> >> location, we could online that memory again.
> > 
> > What about PG_hwpoison flags?  struct pages are also freed and reallocated
> > in the actual DIMM replacement?
> 
> Once memory is offline, the memmap is stale and is no longer
> trustworthy. It gets reinitialize during memory onlining -- so any
> previous PG_hwpoison is overridden at least there. In some setups, we
> even poison the whole memmap via page_init_poison() during memory offlining.
> 
> Apart from that, we should be freeing the memmap in all relevant cases
> when removing memory. I remember there are a couple of corner cases, but
> we don't really have to care about that.

OK, so there seems no need to manipulate struct pages for hwpoison in
all relevant cases.

Thanks,
Naoya Horiguchi
David Hildenbrand May 12, 2022, 7:28 a.m. UTC | #14
>>>>
>>>> Once the problematic DIMM would actually get unplugged, the memory block devices
>>>> would get removed as well. So when hotplugging a new DIMM in the same
>>>> location, we could online that memory again.
>>>
>>> What about PG_hwpoison flags?  struct pages are also freed and reallocated
>>> in the actual DIMM replacement?
>>
>> Once memory is offline, the memmap is stale and is no longer
>> trustworthy. It gets reinitialize during memory onlining -- so any
>> previous PG_hwpoison is overridden at least there. In some setups, we
>> even poison the whole memmap via page_init_poison() during memory offlining.
>>
>> Apart from that, we should be freeing the memmap in all relevant cases
>> when removing memory. I remember there are a couple of corner cases, but
>> we don't really have to care about that.
> 
> OK, so there seems no need to manipulate struct pages for hwpoison in
> all relevant cases.

Right. When offlining a memory block, all we have to do is remember if
we stumbled over a hwpoisoned page and rememebr that inside the memory
block. Rejecting to online is then easy.
Miaohe Lin May 12, 2022, 11:13 a.m. UTC | #15
On 2022/5/12 15:28, David Hildenbrand wrote:
>>>>>
>>>>> Once the problematic DIMM would actually get unplugged, the memory block devices
>>>>> would get removed as well. So when hotplugging a new DIMM in the same
>>>>> location, we could online that memory again.
>>>>
>>>> What about PG_hwpoison flags?  struct pages are also freed and reallocated
>>>> in the actual DIMM replacement?
>>>
>>> Once memory is offline, the memmap is stale and is no longer
>>> trustworthy. It gets reinitialize during memory onlining -- so any
>>> previous PG_hwpoison is overridden at least there. In some setups, we
>>> even poison the whole memmap via page_init_poison() during memory offlining.
>>>
>>> Apart from that, we should be freeing the memmap in all relevant cases
>>> when removing memory. I remember there are a couple of corner cases, but
>>> we don't really have to care about that.
>>
>> OK, so there seems no need to manipulate struct pages for hwpoison in
>> all relevant cases.
> 
> Right. When offlining a memory block, all we have to do is remember if
> we stumbled over a hwpoisoned page and rememebr that inside the memory
> block. Rejecting to online is then easy.

BTW: How should we deal with the below race window:

CPU A			CPU B				CPU C
accessing page while hold page refcnt
			memory_failure happened on page
							offline_pages
							  page can be offlined due to page refcnt
							  is ignored when PG_hwpoison is set
can still access page struct...

Any in use page (with page refcnt incremented) might be offlined while its content, e.g. flags, private ..., can
still be accessed if the above race happened. Is this possible? Or am I miss something? Any suggestion to fix it?
I can't figure out a way yet. :(

Thanks a lot!

>
David Hildenbrand May 12, 2022, 12:59 p.m. UTC | #16
On 12.05.22 13:13, Miaohe Lin wrote:
> On 2022/5/12 15:28, David Hildenbrand wrote:
>>>>>>
>>>>>> Once the problematic DIMM would actually get unplugged, the memory block devices
>>>>>> would get removed as well. So when hotplugging a new DIMM in the same
>>>>>> location, we could online that memory again.
>>>>>
>>>>> What about PG_hwpoison flags?  struct pages are also freed and reallocated
>>>>> in the actual DIMM replacement?
>>>>
>>>> Once memory is offline, the memmap is stale and is no longer
>>>> trustworthy. It gets reinitialize during memory onlining -- so any
>>>> previous PG_hwpoison is overridden at least there. In some setups, we
>>>> even poison the whole memmap via page_init_poison() during memory offlining.
>>>>
>>>> Apart from that, we should be freeing the memmap in all relevant cases
>>>> when removing memory. I remember there are a couple of corner cases, but
>>>> we don't really have to care about that.
>>>
>>> OK, so there seems no need to manipulate struct pages for hwpoison in
>>> all relevant cases.
>>
>> Right. When offlining a memory block, all we have to do is remember if
>> we stumbled over a hwpoisoned page and rememebr that inside the memory
>> block. Rejecting to online is then easy.
> 
> BTW: How should we deal with the below race window:
> 
> CPU A			CPU B				CPU C
> accessing page while hold page refcnt
> 			memory_failure happened on page
> 							offline_pages
> 							  page can be offlined due to page refcnt
> 							  is ignored when PG_hwpoison is set
> can still access page struct...
> 
> Any in use page (with page refcnt incremented) might be offlined while its content, e.g. flags, private ..., can
> still be accessed if the above race happened. Is this possible? Or am I miss something? Any suggestion to fix it?
> I can't figure out a way yet. :(

I assume you mean that test_pages_isolated() essentially only checks for
PageHWPoison() and doesn't care about the refcount?

That part is very dodgy and it's part of my motivation to question that
whole handling in the first place.


In do_migrate_range(), there is a comment:

"
 HWPoison pages have elevated reference counts so the migration would
 fail on them. It also doesn't make any sense to migrate them in the
 first place. Still try to unmap such a page in case it is still mapped
 (e.g. current hwpoison implementation doesn't unmap KSM pages but keep
 the unmap as the catch all safety net).
"

My assumption would be: if there are any unexpected references on a
hwpoison page, we must fail offlining. Ripping out the page might be
more harmful then just leaving it in place and failing offlining for the
time being.



I am no export on PageHWPoison(). Which guarantees do we have regarding
the page count?

If we succeed in unmapping the page, there shouldn't be any references
from the page tables. We might still have GUP references to such pages,
and it would be fair enough to fail offlining. I remember we try
removing the page from the pagecache etc. to free up these references.
So which additional references do we have that the comment in offlining
code talks about? A single additional one from hwpoison code?

Once we figure that out, we might tweak test_pages_isolated() to also
consider the page count and not rip out random pages that are still
referenced in the system.
Miaohe Lin May 16, 2022, 3:25 a.m. UTC | #17
On 2022/5/12 20:59, David Hildenbrand wrote:
> On 12.05.22 13:13, Miaohe Lin wrote:
>> On 2022/5/12 15:28, David Hildenbrand wrote:
>>>>>>>
>>>>>>> Once the problematic DIMM would actually get unplugged, the memory block devices
>>>>>>> would get removed as well. So when hotplugging a new DIMM in the same
>>>>>>> location, we could online that memory again.
>>>>>>
>>>>>> What about PG_hwpoison flags?  struct pages are also freed and reallocated
>>>>>> in the actual DIMM replacement?
>>>>>
>>>>> Once memory is offline, the memmap is stale and is no longer
>>>>> trustworthy. It gets reinitialize during memory onlining -- so any
>>>>> previous PG_hwpoison is overridden at least there. In some setups, we
>>>>> even poison the whole memmap via page_init_poison() during memory offlining.
>>>>>
>>>>> Apart from that, we should be freeing the memmap in all relevant cases
>>>>> when removing memory. I remember there are a couple of corner cases, but
>>>>> we don't really have to care about that.
>>>>
>>>> OK, so there seems no need to manipulate struct pages for hwpoison in
>>>> all relevant cases.
>>>
>>> Right. When offlining a memory block, all we have to do is remember if
>>> we stumbled over a hwpoisoned page and rememebr that inside the memory
>>> block. Rejecting to online is then easy.
>>
>> BTW: How should we deal with the below race window:
>>
>> CPU A			CPU B				CPU C
>> accessing page while hold page refcnt
>> 			memory_failure happened on page
>> 							offline_pages
>> 							  page can be offlined due to page refcnt
>> 							  is ignored when PG_hwpoison is set
>> can still access page struct...
>>
>> Any in use page (with page refcnt incremented) might be offlined while its content, e.g. flags, private ..., can
>> still be accessed if the above race happened. Is this possible? Or am I miss something? Any suggestion to fix it?
>> I can't figure out a way yet. :(
> 
> I assume you mean that test_pages_isolated() essentially only checks for
> PageHWPoison() and doesn't care about the refcount?

Yes, page refcount is ignored when PG_HWPoison is set.

> 
> That part is very dodgy and it's part of my motivation to question that
> whole handling in the first place.
> 
> 
> In do_migrate_range(), there is a comment:
> 
> "
>  HWPoison pages have elevated reference counts so the migration would
>  fail on them. It also doesn't make any sense to migrate them in the
>  first place. Still try to unmap such a page in case it is still mapped
>  (e.g. current hwpoison implementation doesn't unmap KSM pages but keep
>  the unmap as the catch all safety net).
> "
> 
> My assumption would be: if there are any unexpected references on a
> hwpoison page, we must fail offlining. Ripping out the page might be
> more harmful then just leaving it in place and failing offlining for the
> time being.

I tend to agree with this. :)

> 
> 
> 
> I am no export on PageHWPoison(). Which guarantees do we have regarding
> the page count?
> 
> If we succeed in unmapping the page, there shouldn't be any references
> from the page tables. We might still have GUP references to such pages,
> and it would be fair enough to fail offlining. I remember we try
> removing the page from the pagecache etc. to free up these references.
> So which additional references do we have that the comment in offlining
> code talks about? A single additional one from hwpoison code?

IIRC, memory_failure will hold one extra page refcount. This refcount will be released
in unpoison_memory.

> 
> Once we figure that out, we might tweak test_pages_isolated() to also
> consider the page count and not rip out random pages that are still
> referenced in the system.
> 

But there are some corner cases where PageHWPoison is set but page refcnt is not increased.
So we couldn't detect the page refcount reliably now. :(

Thanks!