Message ID | 20210421005728.1994268-1-nao.horiguchi@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | mm,hwpoison: fix sending SIGBUS for Action Required MCE | expand |
> I updated the series based on previous feedbacks/discussions. Tested-by: Tony Luck <tony.luck@intel.com> -Tony
>> I updated the series based on previous feedbacks/discussions. > > Tested-by: Tony Luck <tony.luck@intel.com> Maybe this series should get a "Cc: stable" tag too? I don't have a specific commit to provide for Fixes:, but it deals with an issue where sometimes processes return from the machine check handler and hit the same machine check again. It's hard[1] to hit in v5.8, but somehow gets progressively easier in newer versions. -Tony [1] Initially I thought it didn't happen at all in v5.8. Then wasted a bunch of time trying to bisect. But my "git bisect good" responses turned out to be randomly assigned when I just hadn't tried enough times to make the problem show up.
On Fri, May 07, 2021 at 03:37:07AM +0000, Luck, Tony wrote: > >> I updated the series based on previous feedbacks/discussions. > > > > Tested-by: Tony Luck <tony.luck@intel.com> > > Maybe this series should get a "Cc: stable" tag too? The second patch (I squashed 2/3 and 3/3 into one in v4) is a little too large (more than 100 lines of change), which doesn't meet stable rule. But the first patch looks suitable to me for stable. > > I don't have a specific commit to provide for Fixes:, but it deals > with an issue where sometimes processes return from the machine > check handler and hit the same machine check again. If we consider that this issue is visible after LMCE is supported, the following commit could be the target of Fixes tag. commit 243d657eaf540db882f73497060da5a4f7d86a90 Author: Ashok Raj <ashok.raj@intel.com> Date: Thu Jun 4 18:55:24 2015 +0200 x86/mce: Handle Local MCE events but according to your test result, any other factor might affect when the issue actually got reproducible. > > It's hard[1] to hit in v5.8, but somehow gets progressively easier in > newer versions. > > -Tony > > [1] Initially I thought it didn't happen at all in v5.8. Then wasted a bunch > of time trying to bisect. But my "git bisect good" responses turned out to > be randomly assigned when I just hadn't tried enough times to make the > problem show up. Thanks for time and effort for testing. - Naoya
On Fri, 7 May 2021 05:24:22 +0000 HORIGUCHI NAOYA(堀口 直也) <naoya.horiguchi@nec.com> wrote: > On Fri, May 07, 2021 at 03:37:07AM +0000, Luck, Tony wrote: > > >> I updated the series based on previous feedbacks/discussions. > > > > > > Tested-by: Tony Luck <tony.luck@intel.com> > > > > Maybe this series should get a "Cc: stable" tag too? > > The second patch (I squashed 2/3 and 3/3 into one in v4) is a little > too large (more than 100 lines of change), which doesn't meet stable rule. > But the first patch looks suitable to me for stable. Before cc:stable, would you please do one stress test first? It failed in my server, but I didn't dig into it, maybe the fail is meaningless. Just a small suggestion. Thanks! Aili Yao
> Before cc:stable, would you please do one stress test first? > It failed in my server, but I didn't dig into it, maybe the fail is meaningless. > Just a small suggestion. Upstream plus these three patches passed an overnight 8-hour stress test on four machines here (running a test that's been failing with hung kernels and kernel crashes w/o these patches). What were the symptoms of your failed server? -Tony
On Fri, 7 May 2021 18:02:13 +0000 "Luck, Tony" <tony.luck@intel.com> wrote: > > Before cc:stable, would you please do one stress test first? > > It failed in my server, but I didn't dig into it, maybe the fail is meaningless. > > Just a small suggestion. > > Upstream plus these three patches passed an overnight 8-hour stress test > on four machines here (running a test that's been failing with hung kernels > and kernel crashes w/o these patches). > > What were the symptoms of your failed server? > Sorry, I am not sure if the stress test in my server is same with yours. Usually, I will do one RAS stress test from mce-test/cases/stress before post something. Maybe this test is not general useful or you have tested it and passed or you may think the test is not proper though or you tested and you don't think it's one real issue, then just ignore it please. Thanks! Aili Yao
> Sorry, I am not sure if the stress test in my server is same with yours. It's better if you test different things ... that's how we get better coverage. > Usually, I will do one RAS stress test from mce-test/cases/stress before post something. > Maybe this test is not general useful or you have tested it and passed or you may think > the test is not proper though or you tested and you don't think it's one real issue, > then just ignore it please. I haven't re-run the mce-test cases for a while. The test that was run on these patches involves a multi-threaded application running on all logical CPUs. When poison is injected many of the CPUs can hit the poison close together. So it's a good stress test for the new series. -Tony