Message ID | 20210412224320.1747638-1-nao.horiguchi@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | mm,hwpoison: fix sending SIGBUS for Action Required MCE | expand |
On Tue, 13 Apr 2021 07:43:17 +0900 Naoya Horiguchi <nao.horiguchi@gmail.com> wrote: > Hi, > > I wrote this patchset to materialize what I think is the current > allowable solution mentioned by the previous discussion [1]. > I simply borrowed Tony's mutex patch and Aili's return code patch, > then I queued another one to find error virtual address in the best > effort manner. I know that this is not a perfect solution, but > should work for some typical case. > > My simple testing showed this patchset seems to work as intended, > but if you have the related testcases, could you please test and > let me have some feedback? > > Thanks, > Naoya Horiguchi > > [1]: https://lore.kernel.org/linux-mm/20210331192540.2141052f@alex-virtual-machine/ > --- > Summary: > > Aili Yao (1): > mm,hwpoison: return -EHWPOISON when page already > > Naoya Horiguchi (1): > mm,hwpoison: add kill_accessing_process() to find error virtual address > > Tony Luck (1): > mm/memory-failure: Use a mutex to avoid memory_failure() races > > arch/x86/kernel/cpu/mce/core.c | 13 +++- > include/linux/swapops.h | 5 ++ > mm/memory-failure.c | 166 ++++++++++++++++++++++++++++++++++++++++- > 3 files changed, 178 insertions(+), 6 deletions(-) Hi Naoya, Thanks for your patch and complete fix for this race issue. I test your patches, mainly it worked as expected, but in some cases it failed, I checked it and find some doubt places, could you help confirm it? 1. there is a compile warning: static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr, unsigned long end, struct mm_walk *walk) { struct hwp_walk *hwp = (struct hwp_walk *)walk->private; int ret; ---- here It seems this ret may not be initialized, and some time ret may be error retruned? and for this: static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift, unsigned long poisoned_pfn, struct to_kill *tk) { unsigned long pfn; I think it better to be initialized too. 2. In the function hwpoison_pte_range(): if (pfn <= hwp->pfn && hwp->pfn < pfn + PMD_SIZE) this check seem we should use PMD_SIZE/PAGE_SIZE or some macro like this? 3. unsigned long hwpoison_vaddr = addr + (hwp->pfn << PAGE_SHIFT & ~PMD_MASK); this seems not exact accurate? 4. static int set_to_kill(struct to_kill *tk, unsigned long addr, short shift) { if (tk->addr) { --- I am not sure about this check and if it will lead failure. return 1; } In my test, it seems sometimes it will hit this branch, I don't know it's multi entry issue or multi posion issue. when i get to this fail, there is not enough log for this, but i can't reproduce it after that. wolud you help confirm this and if any changes, please post again and I will do the test again. Thansk Aili Yao
On Sat, Apr 17, 2021 at 01:47:51PM +0800, Aili Yao wrote: > On Tue, 13 Apr 2021 07:43:17 +0900 > Naoya Horiguchi <nao.horiguchi@gmail.com> wrote: > > > Hi, > > > > I wrote this patchset to materialize what I think is the current > > allowable solution mentioned by the previous discussion [1]. > > I simply borrowed Tony's mutex patch and Aili's return code patch, > > then I queued another one to find error virtual address in the best > > effort manner. I know that this is not a perfect solution, but > > should work for some typical case. > > > > My simple testing showed this patchset seems to work as intended, > > but if you have the related testcases, could you please test and > > let me have some feedback? > > > > Thanks, > > Naoya Horiguchi > > > > [1]: https://lore.kernel.org/linux-mm/20210331192540.2141052f@alex-virtual-machine/ > > --- > > Summary: > > > > Aili Yao (1): > > mm,hwpoison: return -EHWPOISON when page already > > > > Naoya Horiguchi (1): > > mm,hwpoison: add kill_accessing_process() to find error virtual address > > > > Tony Luck (1): > > mm/memory-failure: Use a mutex to avoid memory_failure() races > > > > arch/x86/kernel/cpu/mce/core.c | 13 +++- > > include/linux/swapops.h | 5 ++ > > mm/memory-failure.c | 166 ++++++++++++++++++++++++++++++++++++++++- > > 3 files changed, 178 insertions(+), 6 deletions(-) > > Hi Naoya, > > Thanks for your patch and complete fix for this race issue. > > I test your patches, mainly it worked as expected, but in some cases it failed, I checked it > and find some doubt places, could you help confirm it? > > 1. there is a compile warning: > static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr, > unsigned long end, struct mm_walk *walk) > { > struct hwp_walk *hwp = (struct hwp_walk *)walk->private; > int ret; ---- here > > It seems this ret may not be initialized, and some time ret may be error retruned? Yes, I'll initialize with 0. > > and for this: > static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift, > unsigned long poisoned_pfn, struct to_kill *tk) > { > unsigned long pfn; > > I think it better to be initialized too. OK. > > 2. In the function hwpoison_pte_range(): > if (pfn <= hwp->pfn && hwp->pfn < pfn + PMD_SIZE) this check seem we should use PMD_SIZE/PAGE_SIZE or some macro like this? Thanks, that's right. HPAGE_PMD_NR seems to fit here. We also need "#ifdef CONFIG_TRANSPARENT_HUGEPAGE" to use it. > > 3. unsigned long hwpoison_vaddr = addr + (hwp->pfn << PAGE_SHIFT & ~PMD_MASK); this seems not exact accurate? As operators precedence rule, we evaluate this in the following order: 1. ~ (bitwise NOT), 2. << (bitwise left shift), and 3. & (bitwise AND). So this part are equivalent with ((hwp->pfn << PAGE_SHIFT) & (~PMD_MASK)), which should properly calculate an address offset within a pmd. > > 4. static int set_to_kill(struct to_kill *tk, unsigned long addr, short shift) > { > if (tk->addr) { --- I am not sure about this check and if it will lead failure. > return 1; > } > In my test, it seems sometimes it will hit this branch, I don't know it's multi entry issue or multi posion issue. > when i get to this fail, there is not enough log for this, but i can't reproduce it after that. As you commented above, this version is buggy, and that might make you see the random failure. Sorry about that. > > wolud you help confirm this and if any changes, please post again and I will do the test again. I'll send the fixed one for patch 3/3 soon later. Thanks, Naoya Horiguchi