Message ID | 20220427042841.678351-1-naoya.horiguchi@linux.dev (mailing list archive) |
---|---|
Headers | show |
Series | mm, hwpoison: improve handling workload related to hugetlb and memory_hotplug | expand |
On 27.04.22 06:28, Naoya Horiguchi wrote: > Hi, > > This patchset addresses some issues on the workload related to hwpoison, > hugetlb, and memory_hotplug. The problem in memory hotremove reported by > Miaohe Lin [1] is mentioned in 2/4. This patch depends on "storing raw > error info" functionality provided by 1/4. This patch also provide delayed > dissolve function too. > > Patch 3/4 is to adjust unpoison to new semantics of HPageMigratable for > hwpoisoned hugepage. And 4/4 is the fix for the inconsistent counter issue. > > [1] https://lore.kernel.org/linux-mm/20220421135129.19767-1-linmiaohe@huawei.com/ > > Please let me know if you have any suggestions and comments. > Hi, I raised some time ago already that I don't quite see the value of allowing memory offlining with poisened pages. 1) It overcomplicates the offlining code and seems to be partially broken 2) It happens rarely (ever?), so do we even care? 3) Once the memory is offline, we can re-online it and lost HWPoison. The memory can be happily used. 3) can happen easily if our DIMM consists of multiple memory blocks and offlining of some memory block fails -> we'll re-online all already offlined ones. We'll happily reuse previously HWPoisoned pages, which feels more dangerous to me then just leaving the DIMM around (and eventually hwpoisoning all pages on it such that it won't get used anymore?). So maybe we should just fail offlining once we stumble over a hwpoisoned page? Yes, we would disallow removing a semi-broken DIMM from the system that was onlined MOVABLE. I wonder if we really need that and how often it happens in real life. Most systems I am aware of don't allow for replacing individual DIMMs, but only complete NUMA nodes. Hm.
On Wed, Apr 27, 2022 at 12:48:16PM +0200, David Hildenbrand wrote: > I raised some time ago already that I don't quite see the value of > allowing memory offlining with poisened pages. > > 1) It overcomplicates the offlining code and seems to be partially > broken > 2) It happens rarely (ever?), so do we even care? > 3) Once the memory is offline, we can re-online it and lost HWPoison. > The memory can be happily used. > > 3) can happen easily if our DIMM consists of multiple memory blocks and > offlining of some memory block fails -> we'll re-online all already > offlined ones. We'll happily reuse previously HWPoisoned pages, which > feels more dangerous to me then just leaving the DIMM around (and > eventually hwpoisoning all pages on it such that it won't get used > anymore?). > > So maybe we should just fail offlining once we stumble over a hwpoisoned > page? > > Yes, we would disallow removing a semi-broken DIMM from the system that > was onlined MOVABLE. I wonder if we really need that and how often it > happens in real life. Most systems I am aware of don't allow for > replacing individual DIMMs, but only complete NUMA nodes. Hm. I teend to agree with all you said. And to be honest, the mechanism of making a semi-broken DIMM healthy again has always been a mistery to me. One would think that: 1- you hot-remove the memory 2- you fix/remove it 3- you hotplug memory again but I am not sure how many times this came to be. And there is also the thing about losing the hwpoison information between offline<->online transitions, so, the thing is unreliable. And for that to work, we would have to add a bunch of code to keep track of "offlined" pages that are hwpoisoned, so we flag them again once they get onlined, and that means more room for errors. So, I would lean towards the fact of not allowing to offline memory that contain such pages in the first place, unless that proves to be a no-go.
On Wed, Apr 27, 2022 at 12:48:16PM +0200, David Hildenbrand wrote: > On 27.04.22 06:28, Naoya Horiguchi wrote: > > Hi, > > > > This patchset addresses some issues on the workload related to hwpoison, > > hugetlb, and memory_hotplug. The problem in memory hotremove reported by > > Miaohe Lin [1] is mentioned in 2/4. This patch depends on "storing raw > > error info" functionality provided by 1/4. This patch also provide delayed > > dissolve function too. > > > > Patch 3/4 is to adjust unpoison to new semantics of HPageMigratable for > > hwpoisoned hugepage. And 4/4 is the fix for the inconsistent counter issue. > > > > [1] https://lore.kernel.org/linux-mm/20220421135129.19767-1-linmiaohe@huawei.com/ > > > > Please let me know if you have any suggestions and comments. > > > > Hi, > > I raised some time ago already that I don't quite see the value of > allowing memory offlining with poisened pages. > > 1) It overcomplicates the offlining code and seems to be partially > broken Yes, the current code has rooms of improvement. Maybe page refcount of hwpoisoned pages is one of them. > 2) It happens rarely (ever?), so do we even care? I'm not certain of the rarity. Some cloud service providers who maintain lots of servers may care? > 3) Once the memory is offline, we can re-online it and lost HWPoison. > The memory can be happily used. > > 3) can happen easily if our DIMM consists of multiple memory blocks and > offlining of some memory block fails -> we'll re-online all already > offlined ones. We'll happily reuse previously HWPoisoned pages, which > feels more dangerous to me then just leaving the DIMM around (and > eventually hwpoisoning all pages on it such that it won't get used > anymore?). I see. This scenario can often happen. > > So maybe we should just fail offlining once we stumble over a hwpoisoned > page? That could be one choice. Maybe another is like this: offlining can succeed but HWPoison flags are kept over offline-reonline operations. If the system noticed that the re-onlined blocks are backed by the original DIMMs or NUMA nodes, then the saved HWPoison flags are still effective, so keep using them. If the re-onlined blocks are backed by replaced DIMMs/NUMA nodes, then we can clear all HWPoison flags associated with replaced physical address range. This can be done automatically in re-onlining if there's a way for kernel to know whether DIMM/NUMA nodes are replaced with new ones. But if there isn't, system applications have to check the HW and explicitly reset the HWPoison flags. > > Yes, we would disallow removing a semi-broken DIMM from the system that > was onlined MOVABLE. I wonder if we really need that and how often it > happens in real life. Most systems I am aware of don't allow for > replacing individual DIMMs, but only complete NUMA nodes. Hm. Yes, maybe most servers do not provide direct physical access to DIMMs. Thanks, Naoya Horiguchi
>> 2) It happens rarely (ever?), so do we even care? > > I'm not certain of the rarity. Some cloud service providers who maintain > lots of servers may care? About replacing broken DIMMs? I'm not so sure, especially because it requires a special setup with ZONE_MOVABLE (i.e., movablecore) to be somewhat reliable and individual DIMMs can usually not get replaced at all. > >> 3) Once the memory is offline, we can re-online it and lost HWPoison. >> The memory can be happily used. >> >> 3) can happen easily if our DIMM consists of multiple memory blocks and >> offlining of some memory block fails -> we'll re-online all already >> offlined ones. We'll happily reuse previously HWPoisoned pages, which >> feels more dangerous to me then just leaving the DIMM around (and >> eventually hwpoisoning all pages on it such that it won't get used >> anymore?). > > I see. This scenario can often happen. > >> >> So maybe we should just fail offlining once we stumble over a hwpoisoned >> page? > > That could be one choice. > > Maybe another is like this: offlining can succeed but HWPoison flags are > kept over offline-reonline operations. If the system noticed that the > re-onlined blocks are backed by the original DIMMs or NUMA nodes, then the > saved HWPoison flags are still effective, so keep using them. If the > re-onlined blocks are backed by replaced DIMMs/NUMA nodes, then we can clear > all HWPoison flags associated with replaced physical address range. This > can be done automatically in re-onlining if there's a way for kernel to know > whether DIMM/NUMA nodes are replaced with new ones. But if there isn't, > system applications have to check the HW and explicitly reset the HWPoison > flags. Offline memory sections have a stale memmap, so there is no trusting on that. And trying to work around that or adjusting memory onlining code overcomplicates something we really don't care about supporting. So if we continue allowing offlining memory blocks with poisoned pages, we could simply remember that that memory block had any posioned page (either for the memory section or maybe better for the whole memory block). We can then simply reject/fail memory onlining of these memory blocks. So that leaves us with either 1) Fail offlining -> no need to care about reonlining 2) Succeed offlining but fail re-onlining
On Thu, Apr 28, 2022 at 10:44:15AM +0200, David Hildenbrand wrote: > >> 2) It happens rarely (ever?), so do we even care? > > > > I'm not certain of the rarity. Some cloud service providers who maintain > > lots of servers may care? > > About replacing broken DIMMs? I'm not so sure, especially because it > requires a special setup with ZONE_MOVABLE (i.e., movablecore) to be > somewhat reliable and individual DIMMs can usually not get replaced at all. > > > > >> 3) Once the memory is offline, we can re-online it and lost HWPoison. > >> The memory can be happily used. > >> > >> 3) can happen easily if our DIMM consists of multiple memory blocks and > >> offlining of some memory block fails -> we'll re-online all already > >> offlined ones. We'll happily reuse previously HWPoisoned pages, which > >> feels more dangerous to me then just leaving the DIMM around (and > >> eventually hwpoisoning all pages on it such that it won't get used > >> anymore?). > > > > I see. This scenario can often happen. > > > >> > >> So maybe we should just fail offlining once we stumble over a hwpoisoned > >> page? > > > > That could be one choice. > > > > Maybe another is like this: offlining can succeed but HWPoison flags are > > kept over offline-reonline operations. If the system noticed that the > > re-onlined blocks are backed by the original DIMMs or NUMA nodes, then the > > saved HWPoison flags are still effective, so keep using them. If the > > re-onlined blocks are backed by replaced DIMMs/NUMA nodes, then we can clear > > all HWPoison flags associated with replaced physical address range. This > > can be done automatically in re-onlining if there's a way for kernel to know > > whether DIMM/NUMA nodes are replaced with new ones. But if there isn't, > > system applications have to check the HW and explicitly reset the HWPoison > > flags. > > Offline memory sections have a stale memmap, so there is no trusting on > that. And trying to work around that or adjusting memory onlining code > overcomplicates something we really don't care about supporting. OK, so I'll go forward to reduce complexity in hwpoison specific code in memory offlining code. > > So if we continue allowing offlining memory blocks with poisoned pages, > we could simply remember that that memory block had any posioned page > (either for the memory section or maybe better for the whole memory > block). We can then simply reject/fail memory onlining of these memory > blocks. It seems to be helpful also for other conext (like hugetlb) to know whether there's any hwpoisoned page in a given range of physical address, so I'll think of this approach. > > So that leaves us with either > > 1) Fail offlining -> no need to care about reonlining > 2) Succeed offlining but fail re-onlining Rephrasing in case I misread, memory offlining code should never check hwpoisoned pages finally, and memory onlining code would do kind of range query to find hwpoisoned pages (without depending on PageHWPoison flag). Thanks, Naoya Horiguchi
On 2022/5/9 15:29, HORIGUCHI NAOYA(堀口 直也) wrote: > On Thu, Apr 28, 2022 at 10:44:15AM +0200, David Hildenbrand wrote: >>>> 2) It happens rarely (ever?), so do we even care? >>> >>> I'm not certain of the rarity. Some cloud service providers who maintain >>> lots of servers may care? >> >> About replacing broken DIMMs? I'm not so sure, especially because it >> requires a special setup with ZONE_MOVABLE (i.e., movablecore) to be >> somewhat reliable and individual DIMMs can usually not get replaced at all. >> >>> >>>> 3) Once the memory is offline, we can re-online it and lost HWPoison. >>>> The memory can be happily used. >>>> >>>> 3) can happen easily if our DIMM consists of multiple memory blocks and >>>> offlining of some memory block fails -> we'll re-online all already >>>> offlined ones. We'll happily reuse previously HWPoisoned pages, which >>>> feels more dangerous to me then just leaving the DIMM around (and >>>> eventually hwpoisoning all pages on it such that it won't get used >>>> anymore?). >>> >>> I see. This scenario can often happen. >>> >>>> >>>> So maybe we should just fail offlining once we stumble over a hwpoisoned >>>> page? >>> >>> That could be one choice. >>> >>> Maybe another is like this: offlining can succeed but HWPoison flags are >>> kept over offline-reonline operations. If the system noticed that the >>> re-onlined blocks are backed by the original DIMMs or NUMA nodes, then the >>> saved HWPoison flags are still effective, so keep using them. If the >>> re-onlined blocks are backed by replaced DIMMs/NUMA nodes, then we can clear >>> all HWPoison flags associated with replaced physical address range. This >>> can be done automatically in re-onlining if there's a way for kernel to know >>> whether DIMM/NUMA nodes are replaced with new ones. But if there isn't, >>> system applications have to check the HW and explicitly reset the HWPoison >>> flags. >> >> Offline memory sections have a stale memmap, so there is no trusting on >> that. And trying to work around that or adjusting memory onlining code >> overcomplicates something we really don't care about supporting. > > OK, so I'll go forward to reduce complexity in hwpoison specific code in > memory offlining code. > >> >> So if we continue allowing offlining memory blocks with poisoned pages, >> we could simply remember that that memory block had any posioned page >> (either for the memory section or maybe better for the whole memory >> block). We can then simply reject/fail memory onlining of these memory >> blocks. > > It seems to be helpful also for other conext (like hugetlb) to know whether > there's any hwpoisoned page in a given range of physical address, so I'll > think of this approach. > >> >> So that leaves us with either >> >> 1) Fail offlining -> no need to care about reonlining Maybe fail offlining will be a better alternative as we can get rid of many races between memory failure and memory offline? But no strong opinion. :) Thanks! >> 2) Succeed offlining but fail re-onlining > > Rephrasing in case I misread, memory offlining code should never check > hwpoisoned pages finally, and memory onlining code would do kind of range > query to find hwpoisoned pages (without depending on PageHWPoison flag). > > Thanks, > Naoya Horiguchi >
On Mon, May 09, 2022 at 05:04:54PM +0800, Miaohe Lin wrote: > >> So that leaves us with either > >> > >> 1) Fail offlining -> no need to care about reonlining > > Maybe fail offlining will be a better alternative as we can get rid of many races > between memory failure and memory offline? But no strong opinion. :) If taking care of those races is not an herculean effort, I'd go with allowing offlining + disallow re-onlining. Mainly because memory RAS stuff. Now, to the re-onlining thing, we'll have to come up with a way to check whether a section contains hwpoisoned pages, so we do not have to go and check every single page, as that will be really suboptimal.
On 2022/5/9 17:58, Oscar Salvador wrote: > On Mon, May 09, 2022 at 05:04:54PM +0800, Miaohe Lin wrote: >>>> So that leaves us with either >>>> >>>> 1) Fail offlining -> no need to care about reonlining >> >> Maybe fail offlining will be a better alternative as we can get rid of many races >> between memory failure and memory offline? But no strong opinion. :) > > If taking care of those races is not an herculean effort, I'd go with > allowing offlining + disallow re-onlining. > Mainly because memory RAS stuff. This dose make sense to me. Thanks. We can try to solve those races if offlining + disallow re-onlining is applied. :) > > Now, to the re-onlining thing, we'll have to come up with a way to check > whether a section contains hwpoisoned pages, so we do not have to go > and check every single page, as that will be really suboptimal. Yes, we need a stable and cheap way to do that. Thanks! > >
On 09.05.22 12:53, Miaohe Lin wrote: > On 2022/5/9 17:58, Oscar Salvador wrote: >> On Mon, May 09, 2022 at 05:04:54PM +0800, Miaohe Lin wrote: >>>>> So that leaves us with either >>>>> >>>>> 1) Fail offlining -> no need to care about reonlining >>> >>> Maybe fail offlining will be a better alternative as we can get rid of many races >>> between memory failure and memory offline? But no strong opinion. :) >> >> If taking care of those races is not an herculean effort, I'd go with >> allowing offlining + disallow re-onlining. >> Mainly because memory RAS stuff. > > This dose make sense to me. Thanks. We can try to solve those races if > offlining + disallow re-onlining is applied. :) > >> >> Now, to the re-onlining thing, we'll have to come up with a way to check >> whether a section contains hwpoisoned pages, so we do not have to go >> and check every single page, as that will be really suboptimal. > > Yes, we need a stable and cheap way to do that. My simplistic approach would be a simple flag/indicator in the memory block devices that indicates that any page in the memory block was hwpoisoned. It's easy to check that during memory onlining and fail it. diff --git a/drivers/base/memory.c b/drivers/base/memory.c index 084d67fd55cc..3d0ef812e901 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -183,6 +183,9 @@ static int memory_block_online(struct memory_block *mem) struct zone *zone; int ret; + if (mem->hwpoisoned) + return -EHWPOISON; + zone = zone_for_pfn_range(mem->online_type, mem->nid, mem->group, start_pfn, nr_pages); Once the problematic DIMM would actually get unplugged, the memory block devices would get removed as well. So when hotplugging a new DIMM in the same location, we could online that memory again. Another place to store that would be the memory section, we'd then have to check all underlying sections here. We're a bit short on flags in the memory section I think, but they are easier to lookup from other code eventually then memory block devices.
On Wed, May 11, 2022 at 05:11:17PM +0200, David Hildenbrand wrote: > On 09.05.22 12:53, Miaohe Lin wrote: > > On 2022/5/9 17:58, Oscar Salvador wrote: > >> On Mon, May 09, 2022 at 05:04:54PM +0800, Miaohe Lin wrote: > >>>>> So that leaves us with either > >>>>> > >>>>> 1) Fail offlining -> no need to care about reonlining > >>> > >>> Maybe fail offlining will be a better alternative as we can get rid of many races > >>> between memory failure and memory offline? But no strong opinion. :) > >> > >> If taking care of those races is not an herculean effort, I'd go with > >> allowing offlining + disallow re-onlining. > >> Mainly because memory RAS stuff. > > > > This dose make sense to me. Thanks. We can try to solve those races if > > offlining + disallow re-onlining is applied. :) > > > >> > >> Now, to the re-onlining thing, we'll have to come up with a way to check > >> whether a section contains hwpoisoned pages, so we do not have to go > >> and check every single page, as that will be really suboptimal. > > > > Yes, we need a stable and cheap way to do that. > > My simplistic approach would be a simple flag/indicator in the memory block devices > that indicates that any page in the memory block was hwpoisoned. It's easy to > check that during memory onlining and fail it. > > diff --git a/drivers/base/memory.c b/drivers/base/memory.c > index 084d67fd55cc..3d0ef812e901 100644 > --- a/drivers/base/memory.c > +++ b/drivers/base/memory.c > @@ -183,6 +183,9 @@ static int memory_block_online(struct memory_block *mem) > struct zone *zone; > int ret; > > + if (mem->hwpoisoned) > + return -EHWPOISON; > + > zone = zone_for_pfn_range(mem->online_type, mem->nid, mem->group, > start_pfn, nr_pages); > Thanks for the idea, a simple flag could work if we don't have to consider unpoison. If we need consider unpoison, we need remember the last hwpoison page in the memory block, so mem->hwpoisoned should be the counter of hwpoison pages. > > > Once the problematic DIMM would actually get unplugged, the memory block devices > would get removed as well. So when hotplugging a new DIMM in the same > location, we could online that memory again. What about PG_hwpoison flags? struct pages are also freed and reallocated in the actual DIMM replacement? Thanks, Naoya Horiguchi > > Another place to store that would be the memory section, we'd then have to check > all underlying sections here. > > We're a bit short on flags in the memory section I think, but they are easier to > lookup from other code eventually then memory block devices.
On 11.05.22 18:10, HORIGUCHI NAOYA(堀口 直也) wrote: > On Wed, May 11, 2022 at 05:11:17PM +0200, David Hildenbrand wrote: >> On 09.05.22 12:53, Miaohe Lin wrote: >>> On 2022/5/9 17:58, Oscar Salvador wrote: >>>> On Mon, May 09, 2022 at 05:04:54PM +0800, Miaohe Lin wrote: >>>>>>> So that leaves us with either >>>>>>> >>>>>>> 1) Fail offlining -> no need to care about reonlining >>>>> >>>>> Maybe fail offlining will be a better alternative as we can get rid of many races >>>>> between memory failure and memory offline? But no strong opinion. :) >>>> >>>> If taking care of those races is not an herculean effort, I'd go with >>>> allowing offlining + disallow re-onlining. >>>> Mainly because memory RAS stuff. >>> >>> This dose make sense to me. Thanks. We can try to solve those races if >>> offlining + disallow re-onlining is applied. :) >>> >>>> >>>> Now, to the re-onlining thing, we'll have to come up with a way to check >>>> whether a section contains hwpoisoned pages, so we do not have to go >>>> and check every single page, as that will be really suboptimal. >>> >>> Yes, we need a stable and cheap way to do that. >> >> My simplistic approach would be a simple flag/indicator in the memory block devices >> that indicates that any page in the memory block was hwpoisoned. It's easy to >> check that during memory onlining and fail it. >> >> diff --git a/drivers/base/memory.c b/drivers/base/memory.c >> index 084d67fd55cc..3d0ef812e901 100644 >> --- a/drivers/base/memory.c >> +++ b/drivers/base/memory.c >> @@ -183,6 +183,9 @@ static int memory_block_online(struct memory_block *mem) >> struct zone *zone; >> int ret; >> >> + if (mem->hwpoisoned) >> + return -EHWPOISON; >> + >> zone = zone_for_pfn_range(mem->online_type, mem->nid, mem->group, >> start_pfn, nr_pages); >> > > Thanks for the idea, a simple flag could work if we don't have to consider > unpoison. If we need consider unpoison, we need remember the last hwpoison > page in the memory block, so mem->hwpoisoned should be the counter of > hwpoison pages. Right, but unpoisoning+memory offlining+memory onlining is a yet more extreme use case we don't have to bother about I think. > >> >> >> Once the problematic DIMM would actually get unplugged, the memory block devices >> would get removed as well. So when hotplugging a new DIMM in the same >> location, we could online that memory again. > > What about PG_hwpoison flags? struct pages are also freed and reallocated > in the actual DIMM replacement? Once memory is offline, the memmap is stale and is no longer trustworthy. It gets reinitialize during memory onlining -- so any previous PG_hwpoison is overridden at least there. In some setups, we even poison the whole memmap via page_init_poison() during memory offlining. Apart from that, we should be freeing the memmap in all relevant cases when removing memory. I remember there are a couple of corner cases, but we don't really have to care about that.
On 2022/5/12 0:22, David Hildenbrand wrote: > On 11.05.22 18:10, HORIGUCHI NAOYA(堀口 直也) wrote: >> On Wed, May 11, 2022 at 05:11:17PM +0200, David Hildenbrand wrote: >>> On 09.05.22 12:53, Miaohe Lin wrote: >>>> On 2022/5/9 17:58, Oscar Salvador wrote: >>>>> On Mon, May 09, 2022 at 05:04:54PM +0800, Miaohe Lin wrote: >>>>>>>> So that leaves us with either >>>>>>>> >>>>>>>> 1) Fail offlining -> no need to care about reonlining >>>>>> >>>>>> Maybe fail offlining will be a better alternative as we can get rid of many races >>>>>> between memory failure and memory offline? But no strong opinion. :) >>>>> >>>>> If taking care of those races is not an herculean effort, I'd go with >>>>> allowing offlining + disallow re-onlining. >>>>> Mainly because memory RAS stuff. >>>> >>>> This dose make sense to me. Thanks. We can try to solve those races if >>>> offlining + disallow re-onlining is applied. :) >>>> >>>>> >>>>> Now, to the re-onlining thing, we'll have to come up with a way to check >>>>> whether a section contains hwpoisoned pages, so we do not have to go >>>>> and check every single page, as that will be really suboptimal. >>>> >>>> Yes, we need a stable and cheap way to do that. >>> >>> My simplistic approach would be a simple flag/indicator in the memory block devices >>> that indicates that any page in the memory block was hwpoisoned. It's easy to >>> check that during memory onlining and fail it. >>> >>> diff --git a/drivers/base/memory.c b/drivers/base/memory.c >>> index 084d67fd55cc..3d0ef812e901 100644 >>> --- a/drivers/base/memory.c >>> +++ b/drivers/base/memory.c >>> @@ -183,6 +183,9 @@ static int memory_block_online(struct memory_block *mem) >>> struct zone *zone; >>> int ret; >>> >>> + if (mem->hwpoisoned) >>> + return -EHWPOISON; >>> + >>> zone = zone_for_pfn_range(mem->online_type, mem->nid, mem->group, >>> start_pfn, nr_pages); >>> >> >> Thanks for the idea, a simple flag could work if we don't have to consider >> unpoison. If we need consider unpoison, we need remember the last hwpoison >> page in the memory block, so mem->hwpoisoned should be the counter of >> hwpoison pages. > > Right, but unpoisoning+memory offlining+memory onlining is a yet more > extreme use case we don't have to bother about I think. > >> >>> >>> >>> Once the problematic DIMM would actually get unplugged, the memory block devices >>> would get removed as well. So when hotplugging a new DIMM in the same >>> location, we could online that memory again. >> >> What about PG_hwpoison flags? struct pages are also freed and reallocated >> in the actual DIMM replacement? > > Once memory is offline, the memmap is stale and is no longer > trustworthy. It gets reinitialize during memory onlining -- so any > previous PG_hwpoison is overridden at least there. In some setups, we > even poison the whole memmap via page_init_poison() during memory offlining. > I tend to agree with David. The memmap is unreliable after memory is offline. So preventing memory re-online until a new DIMM replacement is a good idea. Thanks! > Apart from that, we should be freeing the memmap in all relevant cases > when removing memory. I remember there are a couple of corner cases, but > we don't really have to care about that. >
On Wed, May 11, 2022 at 06:22:41PM +0200, David Hildenbrand wrote: > On 11.05.22 18:10, HORIGUCHI NAOYA(堀口 直也) wrote: > > On Wed, May 11, 2022 at 05:11:17PM +0200, David Hildenbrand wrote: > >> On 09.05.22 12:53, Miaohe Lin wrote: > >>> On 2022/5/9 17:58, Oscar Salvador wrote: > >>>> On Mon, May 09, 2022 at 05:04:54PM +0800, Miaohe Lin wrote: > >>>>>>> So that leaves us with either > >>>>>>> > >>>>>>> 1) Fail offlining -> no need to care about reonlining > >>>>> > >>>>> Maybe fail offlining will be a better alternative as we can get rid of many races > >>>>> between memory failure and memory offline? But no strong opinion. :) > >>>> > >>>> If taking care of those races is not an herculean effort, I'd go with > >>>> allowing offlining + disallow re-onlining. > >>>> Mainly because memory RAS stuff. > >>> > >>> This dose make sense to me. Thanks. We can try to solve those races if > >>> offlining + disallow re-onlining is applied. :) > >>> > >>>> > >>>> Now, to the re-onlining thing, we'll have to come up with a way to check > >>>> whether a section contains hwpoisoned pages, so we do not have to go > >>>> and check every single page, as that will be really suboptimal. > >>> > >>> Yes, we need a stable and cheap way to do that. > >> > >> My simplistic approach would be a simple flag/indicator in the memory block devices > >> that indicates that any page in the memory block was hwpoisoned. It's easy to > >> check that during memory onlining and fail it. > >> > >> diff --git a/drivers/base/memory.c b/drivers/base/memory.c > >> index 084d67fd55cc..3d0ef812e901 100644 > >> --- a/drivers/base/memory.c > >> +++ b/drivers/base/memory.c > >> @@ -183,6 +183,9 @@ static int memory_block_online(struct memory_block *mem) > >> struct zone *zone; > >> int ret; > >> > >> + if (mem->hwpoisoned) > >> + return -EHWPOISON; > >> + > >> zone = zone_for_pfn_range(mem->online_type, mem->nid, mem->group, > >> start_pfn, nr_pages); > >> > > > > Thanks for the idea, a simple flag could work if we don't have to consider > > unpoison. If we need consider unpoison, we need remember the last hwpoison > > page in the memory block, so mem->hwpoisoned should be the counter of > > hwpoison pages. > > Right, but unpoisoning+memory offlining+memory onlining is a yet more > extreme use case we don't have to bother about I think. OK. Maybe starting with simple one is fine. > > > > >> > >> > >> Once the problematic DIMM would actually get unplugged, the memory block devices > >> would get removed as well. So when hotplugging a new DIMM in the same > >> location, we could online that memory again. > > > > What about PG_hwpoison flags? struct pages are also freed and reallocated > > in the actual DIMM replacement? > > Once memory is offline, the memmap is stale and is no longer > trustworthy. It gets reinitialize during memory onlining -- so any > previous PG_hwpoison is overridden at least there. In some setups, we > even poison the whole memmap via page_init_poison() during memory offlining. > > Apart from that, we should be freeing the memmap in all relevant cases > when removing memory. I remember there are a couple of corner cases, but > we don't really have to care about that. OK, so there seems no need to manipulate struct pages for hwpoison in all relevant cases. Thanks, Naoya Horiguchi
>>>> >>>> Once the problematic DIMM would actually get unplugged, the memory block devices >>>> would get removed as well. So when hotplugging a new DIMM in the same >>>> location, we could online that memory again. >>> >>> What about PG_hwpoison flags? struct pages are also freed and reallocated >>> in the actual DIMM replacement? >> >> Once memory is offline, the memmap is stale and is no longer >> trustworthy. It gets reinitialize during memory onlining -- so any >> previous PG_hwpoison is overridden at least there. In some setups, we >> even poison the whole memmap via page_init_poison() during memory offlining. >> >> Apart from that, we should be freeing the memmap in all relevant cases >> when removing memory. I remember there are a couple of corner cases, but >> we don't really have to care about that. > > OK, so there seems no need to manipulate struct pages for hwpoison in > all relevant cases. Right. When offlining a memory block, all we have to do is remember if we stumbled over a hwpoisoned page and rememebr that inside the memory block. Rejecting to online is then easy.
On 2022/5/12 15:28, David Hildenbrand wrote: >>>>> >>>>> Once the problematic DIMM would actually get unplugged, the memory block devices >>>>> would get removed as well. So when hotplugging a new DIMM in the same >>>>> location, we could online that memory again. >>>> >>>> What about PG_hwpoison flags? struct pages are also freed and reallocated >>>> in the actual DIMM replacement? >>> >>> Once memory is offline, the memmap is stale and is no longer >>> trustworthy. It gets reinitialize during memory onlining -- so any >>> previous PG_hwpoison is overridden at least there. In some setups, we >>> even poison the whole memmap via page_init_poison() during memory offlining. >>> >>> Apart from that, we should be freeing the memmap in all relevant cases >>> when removing memory. I remember there are a couple of corner cases, but >>> we don't really have to care about that. >> >> OK, so there seems no need to manipulate struct pages for hwpoison in >> all relevant cases. > > Right. When offlining a memory block, all we have to do is remember if > we stumbled over a hwpoisoned page and rememebr that inside the memory > block. Rejecting to online is then easy. BTW: How should we deal with the below race window: CPU A CPU B CPU C accessing page while hold page refcnt memory_failure happened on page offline_pages page can be offlined due to page refcnt is ignored when PG_hwpoison is set can still access page struct... Any in use page (with page refcnt incremented) might be offlined while its content, e.g. flags, private ..., can still be accessed if the above race happened. Is this possible? Or am I miss something? Any suggestion to fix it? I can't figure out a way yet. :( Thanks a lot! >
On 12.05.22 13:13, Miaohe Lin wrote: > On 2022/5/12 15:28, David Hildenbrand wrote: >>>>>> >>>>>> Once the problematic DIMM would actually get unplugged, the memory block devices >>>>>> would get removed as well. So when hotplugging a new DIMM in the same >>>>>> location, we could online that memory again. >>>>> >>>>> What about PG_hwpoison flags? struct pages are also freed and reallocated >>>>> in the actual DIMM replacement? >>>> >>>> Once memory is offline, the memmap is stale and is no longer >>>> trustworthy. It gets reinitialize during memory onlining -- so any >>>> previous PG_hwpoison is overridden at least there. In some setups, we >>>> even poison the whole memmap via page_init_poison() during memory offlining. >>>> >>>> Apart from that, we should be freeing the memmap in all relevant cases >>>> when removing memory. I remember there are a couple of corner cases, but >>>> we don't really have to care about that. >>> >>> OK, so there seems no need to manipulate struct pages for hwpoison in >>> all relevant cases. >> >> Right. When offlining a memory block, all we have to do is remember if >> we stumbled over a hwpoisoned page and rememebr that inside the memory >> block. Rejecting to online is then easy. > > BTW: How should we deal with the below race window: > > CPU A CPU B CPU C > accessing page while hold page refcnt > memory_failure happened on page > offline_pages > page can be offlined due to page refcnt > is ignored when PG_hwpoison is set > can still access page struct... > > Any in use page (with page refcnt incremented) might be offlined while its content, e.g. flags, private ..., can > still be accessed if the above race happened. Is this possible? Or am I miss something? Any suggestion to fix it? > I can't figure out a way yet. :( I assume you mean that test_pages_isolated() essentially only checks for PageHWPoison() and doesn't care about the refcount? That part is very dodgy and it's part of my motivation to question that whole handling in the first place. In do_migrate_range(), there is a comment: " HWPoison pages have elevated reference counts so the migration would fail on them. It also doesn't make any sense to migrate them in the first place. Still try to unmap such a page in case it is still mapped (e.g. current hwpoison implementation doesn't unmap KSM pages but keep the unmap as the catch all safety net). " My assumption would be: if there are any unexpected references on a hwpoison page, we must fail offlining. Ripping out the page might be more harmful then just leaving it in place and failing offlining for the time being. I am no export on PageHWPoison(). Which guarantees do we have regarding the page count? If we succeed in unmapping the page, there shouldn't be any references from the page tables. We might still have GUP references to such pages, and it would be fair enough to fail offlining. I remember we try removing the page from the pagecache etc. to free up these references. So which additional references do we have that the comment in offlining code talks about? A single additional one from hwpoison code? Once we figure that out, we might tweak test_pages_isolated() to also consider the page count and not rip out random pages that are still referenced in the system.
On 2022/5/12 20:59, David Hildenbrand wrote: > On 12.05.22 13:13, Miaohe Lin wrote: >> On 2022/5/12 15:28, David Hildenbrand wrote: >>>>>>> >>>>>>> Once the problematic DIMM would actually get unplugged, the memory block devices >>>>>>> would get removed as well. So when hotplugging a new DIMM in the same >>>>>>> location, we could online that memory again. >>>>>> >>>>>> What about PG_hwpoison flags? struct pages are also freed and reallocated >>>>>> in the actual DIMM replacement? >>>>> >>>>> Once memory is offline, the memmap is stale and is no longer >>>>> trustworthy. It gets reinitialize during memory onlining -- so any >>>>> previous PG_hwpoison is overridden at least there. In some setups, we >>>>> even poison the whole memmap via page_init_poison() during memory offlining. >>>>> >>>>> Apart from that, we should be freeing the memmap in all relevant cases >>>>> when removing memory. I remember there are a couple of corner cases, but >>>>> we don't really have to care about that. >>>> >>>> OK, so there seems no need to manipulate struct pages for hwpoison in >>>> all relevant cases. >>> >>> Right. When offlining a memory block, all we have to do is remember if >>> we stumbled over a hwpoisoned page and rememebr that inside the memory >>> block. Rejecting to online is then easy. >> >> BTW: How should we deal with the below race window: >> >> CPU A CPU B CPU C >> accessing page while hold page refcnt >> memory_failure happened on page >> offline_pages >> page can be offlined due to page refcnt >> is ignored when PG_hwpoison is set >> can still access page struct... >> >> Any in use page (with page refcnt incremented) might be offlined while its content, e.g. flags, private ..., can >> still be accessed if the above race happened. Is this possible? Or am I miss something? Any suggestion to fix it? >> I can't figure out a way yet. :( > > I assume you mean that test_pages_isolated() essentially only checks for > PageHWPoison() and doesn't care about the refcount? Yes, page refcount is ignored when PG_HWPoison is set. > > That part is very dodgy and it's part of my motivation to question that > whole handling in the first place. > > > In do_migrate_range(), there is a comment: > > " > HWPoison pages have elevated reference counts so the migration would > fail on them. It also doesn't make any sense to migrate them in the > first place. Still try to unmap such a page in case it is still mapped > (e.g. current hwpoison implementation doesn't unmap KSM pages but keep > the unmap as the catch all safety net). > " > > My assumption would be: if there are any unexpected references on a > hwpoison page, we must fail offlining. Ripping out the page might be > more harmful then just leaving it in place and failing offlining for the > time being. I tend to agree with this. :) > > > > I am no export on PageHWPoison(). Which guarantees do we have regarding > the page count? > > If we succeed in unmapping the page, there shouldn't be any references > from the page tables. We might still have GUP references to such pages, > and it would be fair enough to fail offlining. I remember we try > removing the page from the pagecache etc. to free up these references. > So which additional references do we have that the comment in offlining > code talks about? A single additional one from hwpoison code? IIRC, memory_failure will hold one extra page refcount. This refcount will be released in unpoison_memory. > > Once we figure that out, we might tweak test_pages_isolated() to also > consider the page count and not rip out random pages that are still > referenced in the system. > But there are some corner cases where PageHWPoison is set but page refcnt is not increased. So we couldn't detect the page refcount reliably now. :( Thanks!