From patchwork Mon Sep 25 20:28:49 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Rik van Riel X-Patchwork-Id: 13398398 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 526E3CE79A2 for ; Mon, 25 Sep 2023 20:31:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C02958D0036; Mon, 25 Sep 2023 16:31:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BB2728D0001; Mon, 25 Sep 2023 16:31:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AA1198D0036; Mon, 25 Sep 2023 16:31:29 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 9AF768D0001 for ; Mon, 25 Sep 2023 16:31:29 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 4458980CA9 for ; Mon, 25 Sep 2023 20:31:29 +0000 (UTC) X-FDA: 81276265098.07.AC60CE5 Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147]) by imf13.hostedemail.com (Postfix) with ESMTP id 915F92002F for ; Mon, 25 Sep 2023 20:31:27 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=none; dmarc=none; spf=none (imf13.hostedemail.com: domain of riel@shelob.surriel.com has no SPF policy when checking 96.67.55.147) smtp.mailfrom=riel@shelob.surriel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1695673887; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references; bh=XioVnARdLB6bwJ7fgwz2gXhIl2/LID0ZzLzrIMgjZHA=; b=rVmgWPxmLjFdV0I2VnxXv9lnecxku+GkK5+RctXeSvhwe/4mLaHaTzHdTwu3PE4sDdf7Ez 97sjch33Q1iJsYGqkVTCdFMPkFIbKKPoY6m1qxSruzXcdwnH+YIyPsqQCOHuhEo+2kyEAT ZrmQBHFWkNMU9CbFdmBvBVZLMxTZCoA= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=none; dmarc=none; spf=none (imf13.hostedemail.com: domain of riel@shelob.surriel.com has no SPF policy when checking 96.67.55.147) smtp.mailfrom=riel@shelob.surriel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695673887; a=rsa-sha256; cv=none; b=H8A7msFOl3BIEJ56WLOFgCzN201oyxhYfVBVSvyYPtkIRHPc7hV/fBaP7/cENWGWwFbOiX c6gf3BRKWyHBecOs9D6+K2z9Vg3Bt9KC2xP8zBPHriE1HJuKOb5OncxOMCqxERDITt2p8z oaKROl+k3mbHp45aTTuw8f96k0B1Lvs= Received: from imladris.home.surriel.com ([10.0.13.28] helo=imladris.surriel.com) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1qksDy-0006Gb-07; Mon, 25 Sep 2023 16:30:34 -0400 From: riel@surriel.com To: linux-kernel@vger.kernel.org Cc: kernel-team@meta.com, linux-mm@kvack.org, akpm@linux-foundation.org, muchun.song@linux.dev, mike.kravetz@oracle.com, leit@meta.com, willy@infradead.org Subject: [PATCH v3 0/3] hugetlbfs: close race between MADV_DONTNEED and page fault Date: Mon, 25 Sep 2023 16:28:49 -0400 Message-ID: <20230925203030.703439-1-riel@surriel.com> X-Mailer: git-send-email 2.41.0 MIME-Version: 1.0 X-Rspamd-Queue-Id: 915F92002F X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: 67p19a8i6886ktt1z6cfhyr7r1nic3wa X-HE-Tag: 1695673887-624036 X-HE-Meta: U2FsdGVkX1/ZbChbkbVOJqJcyuswigyR50FGzU7+goLbLOII6Gw6M8FEq9DPglfLO3uDnI53o8lox/OmdwpFvpk4ck5UGzCTnRKWroKQxjIHMiQSUNWvd4raKoFmGiwg8LpSc+w8HpkeogO7wwmnMXGkgraHtzPBpGtv9E/+tqPkRLcm/b9vfAurrI8tWzAjOWsxB384DD31qHRFlHYeTWrUO/fmXxflYwEoQIy/D55G/KL8J/Mmhw1kpMh7OBAFQjvnDpnp6GfhVfS+H/iCcwgeGed6jfKe1yLPQDGqNVtacFUP0moX5FPr1pSBx3o2DcxPR+plj7PbwjzKlojtA0yhmuPqvt+9GDn23Hnf/oegyqFKLOHEubzs2pQYyQ8iEGZDMNJkPHD9BnGG4WBoZW5s4QoCPKe1dE1xF3mmP31rJps9p+jZS4qihWiNZG3LcM4eRSPSvrLpjM99qsrHrfyUTQHcGUIdIgU0asRZ6OChc+8fNONVNjVQU6eo3MIKwFmr7cARr7E/S3LQJkXRS2fO8QtxlOhwkqA7Hd0ai6aMUSAF3dnLHAeTAhsfmiFjmoevfwua2rC7YITo6oNTQ2Tqu+5C5/o0+u8c7hG6a0s6ThPKLOfy5thdWZZydFWQ9mUZccEVBAgxdA82EEemGSNPhC8Ghod3TyIBhSu75+ooneU2asoY+5r61V/FVwelSFvlew1nRqjy9TfwkkB/n869u96j0TgMsciol5Qfpu5vjiI4ARKLoovl62ox3rDEdlSGoaLJFYQSwaVOfAPh+lY4ylycwKxM55HdriYggRwkQ39Q4Ve7cpiZFcRtNyKPs9glyoCU4omryR9fn6+JudxlXPCffXgnWS4gy+Hzg9q8n7+wwZ5YmIF/xXcLa4U/Rnw4TIMhmnPAOEKKfoumUuB6DHVLG1ybHxYHZJcdqLRP5q8xhV2FG9yJZcQRf4U1Psphla348q075z3tLSf gPFskbkW 3CB0GdHtxTh3mbR4uivdlHSTZxEhmXeZYy6BvKYipv67tOFTGRVHi+dEdM3T/AHdM8Fc2wAxJdsDL8xjCPGHo3P+K2bHWPVprNK8cx5RHfPxzMcJwgsFxpd5PoFLuo4Viz2ozCfZTzHFYfqzTCvsdgmKXjSZNZtt0RqbK0v7Y6OC3/VkeUBjPqpPLhXHAgsMtpyIgj/zKyLaf/bf2fJq7Y2lkUH9DnDM6hiWKSDWdcccbEeJlFCfCGLaxKXYeQE1slpGdWVTf1dnwH4V6JwZD3ym4FA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: v3: fix compile error w/ lockdep and test case errors with patch 3 v2: fix the locking bug found with the libhugetlbfs tests. Malloc libraries, like jemalloc and tcalloc, take decisions on when to call madvise independently from the code in the main application. This sometimes results in the application page faulting on an address, right after the malloc library has shot down the backing memory with MADV_DONTNEED. Usually this is harmless, because we always have some 4kB pages sitting around to satisfy a page fault. However, with hugetlbfs systems often allocate only the exact number of huge pages that the application wants. Due to TLB batching, hugetlbfs MADV_DONTNEED will free pages outside of any lock taken on the page fault path, which can open up the following race condition: CPU 1 CPU 2 MADV_DONTNEED unmap page shoot down TLB entry page fault fail to allocate a huge page killed with SIGBUS free page Fix that race by extending the hugetlb_vma_lock locking scheme to also cover private hugetlb mappings (with resv_map), and pulling the locking from __unmap_hugepage_final_range into helper functions called from zap_page_range_single. This ensures page faults stay locked out of the MADV_DONTNEED VMA until the huge pages have actually been freed. The third patch in the series is more of an RFC. Using the invalidate_lock instead of the hugetlb_vma_lock greatly simplifies the code, but at the cost of turning a per-VMA lock into a lock per backing hugetlbfs file, which could slow things down when multiple processes are mapping the same hugetlbfs file.