From patchwork Fri Mar 11 19:51:47 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ivan Babrou X-Patchwork-Id: 12778543 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1FFB9C433F5 for ; Fri, 11 Mar 2022 19:52:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 40BE28D0002; Fri, 11 Mar 2022 14:52:00 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 393D98D0001; Fri, 11 Mar 2022 14:52:00 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 20E898D0002; Fri, 11 Mar 2022 14:52:00 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.28]) by kanga.kvack.org (Postfix) with ESMTP id 0C7C68D0001 for ; Fri, 11 Mar 2022 14:52:00 -0500 (EST) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay12.hostedemail.com (Postfix) with ESMTP id D36AD12133F for ; Fri, 11 Mar 2022 19:51:59 +0000 (UTC) X-FDA: 79233151158.15.BB6687D Received: from mail-yb1-f176.google.com (mail-yb1-f176.google.com [209.85.219.176]) by imf04.hostedemail.com (Postfix) with ESMTP id 5BE1C4002B for ; Fri, 11 Mar 2022 19:51:59 +0000 (UTC) Received: by mail-yb1-f176.google.com with SMTP id z30so19057163ybi.2 for ; Fri, 11 Mar 2022 11:51:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloudflare.com; s=google; h=mime-version:from:date:message-id:subject:to:cc; bh=h9rjrYiCz3BaVVXnAud09tb1Qq3tgD9CDU+KSXWzbJQ=; b=aW/qTxeupnKAR+3HX3gZb1f61FpmTDACKSqX3JlQVb0GB087BGvEI5LaXoOX18QidW T+mT1HJY3xvn55nwb/igbzLGfV+eaOZgjWd9+4ouvfLobiWK+v14vKCCha+cBqO+bUbj R4ZkBhpxLywkRVOLk00tVbgKjdAvWgEwQ1DIw= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:from:date:message-id:subject:to:cc; bh=h9rjrYiCz3BaVVXnAud09tb1Qq3tgD9CDU+KSXWzbJQ=; b=hoSBUXcz3kNiQYGb5L33pB0xqE25MchGGvvzuC8mwQgPrRVcl1w+GzCrwYR568q1tD jkyXbjZ6VsFmQGMipejTbNuy2jrPcc5mFIMAI8OkgW+xT6HCsHcgKj0+yjqnVeOGuoxw XT500VPo0/fv5JnRQc0uxrwg8AaweVU5Nza7WSWQZpydukYe9bEm357SgVCHCovwZttJ +jtL7fY2e4SP1zvsKOA6fSsh2DcGyR53EeKfhBbkipgDIUrgO64CIle/6QSTme5IeBih bGCFcE7TjeGP1IqdvVYRAuK+nBnk1Q2ze9ueTu6jgDe3RCUylm/Wq1EeHkgphIJCgBGX yLTw== X-Gm-Message-State: AOAM531ABkWzz1D8HQSshfMDbSANHKlEtP+KHt2YnP+DwTYhG4ntHuCZ dYj4yJXfz3yTWrmAUrw6FNZUYln/5/NSriVRHTbCi9FMBq1U/Q== X-Google-Smtp-Source: ABdhPJxeFbHnJNNSMG5acTS1B/lHVPkO/75dga5mzDxm5QcTuR/yDQmlrLNsZ6anxOjSzd0LjpmelmbkAWYQ2Cq8tPo= X-Received: by 2002:a25:4109:0:b0:628:7778:fb18 with SMTP id o9-20020a254109000000b006287778fb18mr9347178yba.412.1647028318367; Fri, 11 Mar 2022 11:51:58 -0800 (PST) MIME-Version: 1.0 From: Ivan Babrou Date: Fri, 11 Mar 2022 11:51:47 -0800 Message-ID: Subject: zram corruption due to uninitialized do_swap_page fault To: Linux MM Cc: linux-kernel , Andrew Morton , Minchan Kim , Nitin Gupta , Sergey Senozhatsky , Jens Axboe , linux-block@vger.kernel.org X-Stat-Signature: khod5oqhfaxkbanny5oou84h6b181t5n Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=cloudflare.com header.s=google header.b="aW/qTxeu"; spf=none (imf04.hostedemail.com: domain of ivan@cloudflare.com has no SPF policy when checking 209.85.219.176) smtp.mailfrom=ivan@cloudflare.com; dmarc=pass (policy=reject) header.from=cloudflare.com X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 5BE1C4002B X-HE-Tag: 1647028319-369435 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hello, We're looking into using zram, but unfortunately we ran into some corruption issues. We've seen rocksdb complaining about "Corruption: bad entry in block", and we've also seen some coredumps that point at memory being zeroed out. One of our Rust processes coredumps contains a non-null pointer pointing at zero, among other things: * core::ptr::non_null::NonNull {pointer: 0x0} In fact, a whole bunch of memory around this pointer was all zeros. Disabling zram resolves all issues, and we can't reproduce any of these issues with other swap setups. I've tried adding crc32 checksumming for pages that are compressed, but it didn't catch the issue either, even though userspace facing symptoms were present. My crc32 code doesn't touch ZRAM_SAME pages, though. Unfortunately, this isn't trivial to replicate, and I believe that it depends on zram used for swap specifically, not for zram as a block device. Specifically, swap_slot_free_notify looks suspicious. Here's a patch that I have to catch the issue in the act: zram_fill_page(mem, PAGE_SIZE, value); In essence, it warns whenever a page is read from zram that was not previously written to. To make this work, one needs to zero out zram prior to running mkswap on it. I have prepared a GitHub repo with my observations and a reproduction: * https://github.com/bobrik/zram-corruptor I'm able to trigger the following in an aarch64 VM with two threads reading the same memory out of swap: [ 512.651752][ T7285] ------------[ cut here ]------------ [ 512.652279][ T7285] WARNING: CPU: 0 PID: 7285 at drivers/block/zram/zram_drv.c:1285 __zram_bvec_read+0x28c/0x2e8 [zram] [ 512.653923][ T7285] Modules linked in: zram zsmalloc kheaders nfsv3 nfs lockd grace sunrpc xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_counter xt_addrtype nft_compat nf_tables nfnetlink bridge stp llc overlay xfs libcrc32c zstd zstd_compress af_packet aes_ce_blk aes_ce_cipher ghash_ce gf128mul virtio_net sha3_ce net_failover sha3_generic failover sha512_ce sha512_arm64 sha2_ce sha256_arm64 virtio_mmio virtio_ring qemu_fw_cfg rtc_pl031 virtio fuse ip_tables x_tables ext4 mbcache crc16 jbd2 nvme nvme_core pci_host_generic pci_host_common unix [last unloaded: zsmalloc] [ 512.659238][ T7285] CPU: 0 PID: 7285 Comm: zram-corruptor Tainted: G W 5.16.0-ivan #1 0877d306c6dc0716835d43cafe4399473d09e406 [ 512.660413][ T7285] Hardware name: linux,dummy-virt (DT) [ 512.661077][ T7285] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 512.661788][ T7285] pc : __zram_bvec_read+0x28c/0x2e8 [zram] [ 512.662099][ T7285] lr : zram_bvec_rw+0x70/0x204 [zram] [ 512.662422][ T7285] sp : ffffffc01018bac0 [ 512.662720][ T7285] x29: ffffffc01018bae0 x28: ffffff9e4e725280 x27: ffffff9e4e725280 [ 512.663122][ T7285] x26: ffffff9e4e725280 x25: 00000000000001f6 x24: 0000000100033e6c [ 512.663601][ T7285] x23: 00000000000001f6 x22: 0000000000000000 x21: fffffffe7a36d840 [ 512.664252][ T7285] x20: 00000000000001f6 x19: ffffff9e69423c00 x18: ffffffc010711068 [ 512.664812][ T7285] x17: 0000000000000008 x16: ffffffd34aed51bc x15: 0000000000000000 [ 512.665507][ T7285] x14: 0000000000000a88 x13: 0000000000000000 x12: 0000000000000000 [ 512.666183][ T7285] x11: 0000000100033e6c x10: ffffffc01091d000 x9 : 0000000001000000 [ 512.666627][ T7285] x8 : 0000000000002f10 x7 : 80b75f8fb90b52c4 x6 : 051609fe50833de3 [ 512.667276][ T7285] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 512.667875][ T7285] x2 : 00000000000001f6 x1 : 00000000000001f6 x0 : ffffffd305b746af [ 512.668483][ T7285] Call trace: [ 512.668682][ T7285] __zram_bvec_read+0x28c/0x2e8 [zram 745969ed35ea0fb382bfd518d6f70e13966e9b52] [ 512.669405][ T7285] zram_bvec_rw+0x70/0x204 [zram 745969ed35ea0fb382bfd518d6f70e13966e9b52] [ 512.670066][ T7285] zram_rw_page+0xb4/0x16c [zram 745969ed35ea0fb382bfd518d6f70e13966e9b52] [ 512.670584][ T7285] bdev_read_page+0x74/0xac [ 512.670843][ T7285] swap_readpage+0x5c/0x2e4 [ 512.671243][ T7285] do_swap_page+0x2f4/0x988 [ 512.671560][ T7285] handle_pte_fault+0xcc/0x1fc [ 512.671935][ T7285] handle_mm_fault+0x284/0x4a8 [ 512.672412][ T7285] do_page_fault+0x274/0x428 [ 512.672704][ T7285] do_translation_fault+0x5c/0xf8 [ 512.673083][ T7285] do_mem_abort+0x50/0xc8 [ 512.673293][ T7285] el0_da+0x3c/0x74 [ 512.673549][ T7285] el0t_64_sync_handler+0xc4/0xec [ 512.673972][ T7285] el0t_64_sync+0x1a4/0x1a8 [ 512.674495][ T7285] ---[ end trace cf983b7507c20343 ]--- [ 512.675359][ T7285] zram: Page 502 read from zram without previous write I can also trace accesses to zram to catch the unfortunate sequence: zram_bvec_write index = 502 [cpu = 3, tid = 7286] zram_free_page index = 502 [cpu = 3, tid = 7286] zram_bvec_read index = 502 [cpu = 3, tid = 7286] zram_free_page index = 502 [cpu = 3, tid = 7286] <-- problematic free zram_bvec_read index = 502 [cpu = 0, tid = 7285] <-- problematic read With stacks for zram_free_page: zram_bvec_write index = 502 [cpu = 3, tid = 7286] zram_free_page index = 502 [cpu = 3, tid = 7286] zram_free_page+0 $x.97+32 zram_rw_page+180 bdev_write_page+124 __swap_writepage+116 swap_writepage+160 pageout+284 shrink_page_list+2892 shrink_inactive_list+688 shrink_lruvec+360 shrink_node_memcgs+148 shrink_node+860 shrink_zones+368 do_try_to_free_pages+232 try_to_free_mem_cgroup_pages+292 try_charge_memcg+608 zram_bvec_read index = 502 [cpu = 3, tid = 7286] zram_free_page index = 502 [cpu = 3, tid = 7286] <-- problematic free zram_free_page+0 swap_range_free+220 swap_entry_free+244 swapcache_free_entries+152 free_swap_slot+288 __swap_entry_free+216 swap_free+108 do_swap_page+1776 handle_pte_fault+204 handle_mm_fault+644 do_page_fault+628 do_translation_fault+92 do_mem_abort+80 el0_da+60 el0t_64_sync_handler+196 el0t_64_sync+420 zram_bvec_read index = 502 [cpu = 0, tid = 7285] <-- problematic read The very last read is the same one that triggered the warning from my patch in dmesg. You can see that the slot is freed before reading by swapcache_free_entries. As far as I can see, only zram implements swap_slot_free_notify. Swapping in an uninitialized zram page results in all zeroes copied, which matches the symptoms. The issue doesn't reproduce if I pin both threads to the same CPU. It also doesn't reproduce with a single thread. All of this seems to point at some sort of race condition. I was able to reproduce this on x86_64 bare metal server as well. I'm happy to try out mitigation approaches for this. If my understanding here is incorrect, I'm also happy to try out patches that could help me catch the issue in the wild. Reported-by: Ivan Babrou Tested-by: Ivan Babrou Signed-off-by: Minchan Kim diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index 438ce34ee760..fea46a70a3c9 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -1265,6 +1265,9 @@ static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index, unsigned long value; void *mem; + if (WARN_ON(!handle && !zram_test_flag(zram, index, ZRAM_SAME))) + pr_warn("Page %u read from zram without previous write\n", index); + value = handle ? zram_get_element(zram, index) : 0; mem = kmap_atomic(page);