Message ID | 20191017142123.24245-1-osalvador@suse.de (mailing list archive) |
---|---|
Headers | show |
Series | Hwpoison rework {hard,soft}-offline | expand |
Hello! We are faced with similar problems with hwpoisoned pages on one of our production clusters after kernel update to stable 4.19. Application that does a lot of memory allocations sometimes caught SIGBUS signal with message in dmesg about hardware memory corruption fault. In kernel and mce logs we saw messages about soft offlining pages with correctable errors. Those events always had happened before application was killed. This is not the behavior we expect. We want our application to continue working on a smaller set of available pages in the system. This issue is difficult to reproduce, but we suppose that the reason for such behavior is that compaction does not check for page poisonness while processing free pages, so as a result valid userspace data gets migrated to bad pages. We wrote the simple test: - soft offline first 4 pages in every 64 continuous pages in ZONE_NORMAL through writing pfn to /sys/devices/system/memory/soft_offline_page - force compaction by echo 1 >> /proc/sys/vm/compact_memory Without this patch series after these steps bash became unusable and every attempt to run any command leads to SIGBUS with message about hardware memory corruption fault. And after applying this series to our kernel tree we cannot reproduce such SIGBUSes by our test. On upstream kernel 5.7 this behavior is still reproducible. So, we want to know, why this patchset wasn't merged to the upstream? Is there any problems in such rework for {soft,hard}-offline handling? BTW, this patchset should be updated with upstream changes in mm. Thanks for you replies. -- Dmitry Yakunin
Hi Dmitry, On Thu, Jun 11, 2020 at 07:43:19PM +0300, Dmitry Yakunin wrote: > Hello! > > We are faced with similar problems with hwpoisoned pages > on one of our production clusters after kernel update to stable 4.19. > Application that does a lot of memory allocations sometimes caught SIGBUS signal > with message in dmesg about hardware memory corruption fault. > In kernel and mce logs we saw messages about soft offlining pages with > correctable errors. Those events always had happened before application > was killed. This is not the behavior we expect. We want our application to > continue working on a smaller set of available pages in the system. > > This issue is difficult to reproduce, but we suppose that the reason for such > behavior is that compaction does not check for page poisonness while processing > free pages, so as a result valid userspace data gets migrated to bad pages. > We wrote the simple test: > - soft offline first 4 pages in every 64 continuous pages in ZONE_NORMAL > through writing pfn to /sys/devices/system/memory/soft_offline_page > - force compaction by echo 1 >> /proc/sys/vm/compact_memory > Without this patch series after these steps bash became unusable > and every attempt to run any command leads to SIGBUS with message about > hardware memory corruption fault. And after applying this series to our kernel > tree we cannot reproduce such SIGBUSes by our test. On upstream kernel 5.7 > this behavior is still reproducible. > > So, we want to know, why this patchset wasn't merged to the upstream? > Is there any problems in such rework for {soft,hard}-offline handling? No technical reason, it's just because I didn't have enough power to push this to be merged. Really sorry about that. > BTW, this patchset should be updated with upstream changes in mm. I'm working this now and still need more testing to confirm, but I hope I'll update and post this for 5.9. Thanks, Naoya Horiguchi