From patchwork Wed Aug 9 11:03:00 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tetsuo Handa X-Patchwork-Id: 13347775 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D87CCC04A94 for ; Wed, 9 Aug 2023 11:03:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 20F4D6B0071; Wed, 9 Aug 2023 07:03:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1C0238E0002; Wed, 9 Aug 2023 07:03:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0862A8E0001; Wed, 9 Aug 2023 07:03:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id EDCC66B0071 for ; Wed, 9 Aug 2023 07:03:34 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id B1E18B25B7 for ; Wed, 9 Aug 2023 11:03:34 +0000 (UTC) X-FDA: 81104280348.23.FF09AB7 Received: from www262.sakura.ne.jp (www262.sakura.ne.jp [202.181.97.72]) by imf24.hostedemail.com (Postfix) with ESMTP id 207CE18001B for ; Wed, 9 Aug 2023 11:03:31 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=none; spf=none (imf24.hostedemail.com: domain of penguin-kernel@I-love.SAKURA.ne.jp has no SPF policy when checking 202.181.97.72) smtp.mailfrom=penguin-kernel@I-love.SAKURA.ne.jp; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1691579012; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references; bh=kIAEmA/AgZQAlXJZI5uroZB3JaG78r7kDyCbooj1nPs=; b=cevBmQeATDIAJeFfTInJRRqdwD0QNsknbZvUmaC6erpswBpEKTBGNJHBGsSAM/LZYL9mh4 mG1XXDElon7IJUuzCfQu/X9TXxYESH5PBnSZ96mb8PNz+ukVQCaUW90M0oSiPwbAQbStwW hQZjYazXibRgPJQexa77fcWTWccCmko= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1691579012; a=rsa-sha256; cv=none; b=fmImfCyvOXsymLih3J5tUJw/XnG35mAODXrXMaX9ntCzvSsSurfFafV/ZP+6qnwR1zTOB1 t9xvZ982NzjKiQyFZ/uADLW/kDRB6utjebWwOl1pKWW5F0RfM428mRhTcPqUUfXCsV3g+W ZgPtfasd1AamViHJxdB9lYbHwibIdAM= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=none; spf=none (imf24.hostedemail.com: domain of penguin-kernel@I-love.SAKURA.ne.jp has no SPF policy when checking 202.181.97.72) smtp.mailfrom=penguin-kernel@I-love.SAKURA.ne.jp; dmarc=none Received: from fsav411.sakura.ne.jp (fsav411.sakura.ne.jp [133.242.250.110]) by www262.sakura.ne.jp (8.15.2/8.15.2) with ESMTP id 379B31Ur074373; Wed, 9 Aug 2023 20:03:01 +0900 (JST) (envelope-from penguin-kernel@I-love.SAKURA.ne.jp) Received: from www262.sakura.ne.jp (202.181.97.72) by fsav411.sakura.ne.jp (F-Secure/fsigk_smtp/550/fsav411.sakura.ne.jp); Wed, 09 Aug 2023 20:03:01 +0900 (JST) X-Virus-Status: clean(F-Secure/fsigk_smtp/550/fsav411.sakura.ne.jp) Received: from [192.168.1.6] (M106072142033.v4.enabler.ne.jp [106.72.142.33]) (authenticated bits=0) by www262.sakura.ne.jp (8.15.2/8.15.2) with ESMTPSA id 379B30sR074366 (version=TLSv1.2 cipher=AES256-GCM-SHA384 bits=256 verify=NO); Wed, 9 Aug 2023 20:03:01 +0900 (JST) (envelope-from penguin-kernel@I-love.SAKURA.ne.jp) Message-ID: <6cc13636-eda6-6a95-6564-db1c9ae76bb6@I-love.SAKURA.ne.jp> Date: Wed, 9 Aug 2023 20:03:00 +0900 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.14.0 Content-Language: en-US To: Michal Hocko , Sebastian Andrzej Siewior , Andrew Morton Cc: Petr Mladek , linux-mm , LKML , "Luis Claudio R. Goncalves" , Boqun Feng , Ingo Molnar , John Ogness , Mel Gorman , Peter Zijlstra , Thomas Gleixner , Waiman Long , Will Deacon From: Tetsuo Handa Subject: [PATCH v2] mm/page_alloc: don't check zonelist_update_seq from atomic allocations X-Rspamd-Queue-Id: 207CE18001B X-Rspam-User: X-Stat-Signature: 6bsho3anc44g9m4x11uhhhakc44x6fs6 X-Rspamd-Server: rspam03 X-HE-Tag: 1691579011-685057 X-HE-Meta: U2FsdGVkX1/Jnd7naiUW2AVrDMUsUwM5VFkvYYYjZuphsxCeq34S7UKd7bf3ZiA0Rdp29WBt77L7aaWgm6erwlHWIB2Xw2M606s1xOotWU66KVSJTQjpX+fuB2oBhluc3f/G0hB7p/3fBnatAfs5TFJoZw0Aqd9Hfv/cCDWsGNPcrMopm38BEpeUeBv6ujbN15Dtej2UqLs2YgZcfJRKuoQxR3BEBl7/U2vb3gxtdx61ENeudcnhnIGHhVOnckbmAcx8+cH8tGlCnTEagTZSLxKYbqF87PYrndzxbtrgTLknWXf7zCenFf0evixnhj7N4te8MkxiXCjSB+f2sW9ZzelFWABGtds2TZakapO3O8QPBZ9lNjAelPYoAOhYU3FUeXPvpHfVPspPAPhw8wi1XOizKsI9QHz0dqrkj0ZKbbwAcv+W6UL3rC/LWamg28keO7azGZh4sGe0rV6urkGX4drtFNVndjYDXkXqYOOEjQRgBqz99pCKBVzja+gOiEhdQILlpUdODqritV66/wvJ5YCgW9De/4ul/qOiwzu2dcGZ3QAzi7laEcwr73Le4rT/DXRwULBRHgeEqoNewNemKRz8bKNSzU9r/ygY/WQD2kKqvnN1+HnSl6IPDtj5HUnDhAAcUjNOjnu94/23bnVahwQ9r2kohs424BNDtHZvbuSMZ3rAESFCDzqM+Gd0J3shIy0M6rCGJw5lrf60vJv28cUcXcV8wyJhZMH+gIq76DjGt5+/byJr9XvEv5J9xyIRnWi/g8xP4U4WeSbNZw4ZxE5GaFoSXD2HR2b0LlWIcAP7h1909cr5JdTMzra3C41d00yxJyim3x2M/wokgqWA/Gvxw9FZS0UATg+0f+YndQlVNttoyDlADdYSBQlhlXIWW3YGjY5KXAmlTf6+EXwqBCDgtqfyaDpvoBpEdb1bTT/6kEknns04R9ASWPhQli8D1efvwLAgk6hYPLRGrKj X14J6UpU GzsIqmUzRU/5fiStblKjnoS+VJKxAbk6DFGUeY3wxcCDKd6Y1ROrIvMarLEXQfFyXV33KF3/m85sicKCiS1T9jZnkhbLta3BQSy9HoxVYTvYl0XWINcuVMyQSS2Fc9OedPNUrho7wpO2e7KjdqgvTummCzW/FNoxLFMoNDIUTE9tMnbdM3EI+VEOjGwmmsA+Nup0bNm8ir8zlZgdG14RWyxBUgVV9lIRN9+1c4DHqfQgfTW4dK54wMITsHS8pCmL26JQbOBWlqxmVWD8f9FS0mMSYCGkfGUYr5mWbMEi+V2xLzNlsnmyBJHWRTmcqRh2oG1KN1TlUm6iU7BANPFuNfj4yhaKtXJ+/riJPtqlxaSZHkva4rOOWTfJwAlJH1ThnJX4OS5aU95kC76M= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Sebastian Andrzej Siewior reported that commit 1007843a9190 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock") is problematic for CONFIG_PREEMPT_RT=y case, for PREEMPT_RT kernels do not allow holding spinlocks with interrupts disabled because PREEMPT_RT kernels manage priority by making the spinlock sleepable. Commit 3d36424b3b58 ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation") was obviously wrong that that commit introduced a spinlock into !__GFP_DIRECT_RECLAIM allocations without understanding the reality that we cannot figure out all possible locking dependency. Like commit 726ccdba1521 ("kasan,kmsan: remove __GFP_KSWAPD_RECLAIM usage from kasan/kmsan") says, !__GFP_DIRECT_RECLAIM allocations might happen with arbitrary locks held. But the page allocator does not offer a gfp flag that opts out from holding zonelist_update_seq seqlock. Under such situations, the safer approach is not to opt in to holding zonelist_update_seq seqlock if sleeping is not permitted. This is an updated and optimized version of [1]. This patch eliminates while ((__seq = seqprop_sequence(s)) & 1) cpu_relax(); path from zonelist_iter_begin() which is always called as long as __alloc_pages_slowpath() is called. There is no need to wait for completion of rebuilding zonelists, for rebuilding zonelists being in flight does not mean that allocation never succeeds. When allocation did not fail, this "while" loop becomes nothing but a waste of CPU time. And it is very likely that rebuilding zonelists being not in flight from the beginning. This patch not only avoids possibility of deadlock, but also makes zonelist_iter_begin() faster and simpler. This change is an improvement even without considering printk() and lockdep/KCSAN related problems. Reported-by: Sebastian Andrzej Siewior Closes: https://lkml.kernel.org/r/20230621104034.HT6QnNkQ@linutronix.de Link: https://lkml.kernel.org/r/dfdb9da6-ca8f-7a81-bfdd-d74b4c401f11@I-love.SAKURA.ne.jp [1] Fixes: 1007843a9190 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock") Signed-off-by: Tetsuo Handa Nacked-by: Michal Hocko --- This patch replaces "mm/page_alloc: use write_seqlock_irqsave() instead write_seqlock() + local_irq_save()." in linux-next.git tree. mm/page_alloc.c | 73 ++++++++++++++++++++++++++++++------------------- 1 file changed, 45 insertions(+), 28 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 7d3460c7a480..5557d9a2ff2c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3638,26 +3638,44 @@ EXPORT_SYMBOL_GPL(fs_reclaim_release); /* * Zonelists may change due to hotplug during allocation. Detect when zonelists - * have been rebuilt so allocation retries. Reader side does not lock and - * retries the allocation if zonelist changes. Writer side is protected by the - * embedded spin_lock. + * have been rebuilt so __GFP_DIRECT_RECLAIM allocation retries. Writer side + * makes this sequence odd before rebuilding zonelists and makes this sequence + * even after rebuilding zonelists. Sine writer side disables context switching + * when rebuilding zonelists, and !__GFP_DIRECT_RECLAIM allocation will not + * retry when zonelists changed, reader side does not need to hold a lock (but + * needs to use data_race() annotation), making locking dependency simpler. */ -static DEFINE_SEQLOCK(zonelist_update_seq); +static unsigned int zonelist_update_seq; static unsigned int zonelist_iter_begin(void) { if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE)) - return read_seqbegin(&zonelist_update_seq); + /* See comment above. */ + return data_race(READ_ONCE(zonelist_update_seq)); return 0; } -static unsigned int check_retry_zonelist(unsigned int seq) +static unsigned int check_retry_zonelist(gfp_t gfp, unsigned int seq) { - if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE)) - return read_seqretry(&zonelist_update_seq, seq); + if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE) && (gfp & __GFP_DIRECT_RECLAIM)) { + /* See comment above. */ + unsigned int seq2 = data_race(READ_ONCE(zonelist_update_seq)); - return seq; + /* + * "seq != seq2" indicates that __build_all_zonelists() has + * started or has finished rebuilding zonelists, hence retry. + * "seq == seq2 && (seq2 & 1)" indicates that + * __build_all_zonelists() is still rebuilding zonelists + * with context switching disabled, hence retry. + * "seq == seq2 && !(seq2 & 1)" indicates that + * __build_all_zonelists() did not rebuild zonelists, hence + * no retry. + */ + return unlikely(seq != seq2 || (seq2 & 1)); + } + + return 0; } /* Perform direct synchronous page reclaim */ @@ -4146,7 +4164,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, * a unnecessary OOM kill. */ if (check_retry_cpuset(cpuset_mems_cookie, ac) || - check_retry_zonelist(zonelist_iter_cookie)) + check_retry_zonelist(gfp_mask, zonelist_iter_cookie)) goto restart; /* Reclaim has failed us, start killing things */ @@ -4172,7 +4190,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, * a unnecessary OOM kill. */ if (check_retry_cpuset(cpuset_mems_cookie, ac) || - check_retry_zonelist(zonelist_iter_cookie)) + check_retry_zonelist(gfp_mask, zonelist_iter_cookie)) goto restart; /* @@ -5136,22 +5154,17 @@ static void __build_all_zonelists(void *data) int nid; int __maybe_unused cpu; pg_data_t *self = data; + static DEFINE_SPINLOCK(lock); unsigned long flags; - /* - * Explicitly disable this CPU's interrupts before taking seqlock - * to prevent any IRQ handler from calling into the page allocator - * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock. - */ - local_irq_save(flags); - /* - * Explicitly disable this CPU's synchronous printk() before taking - * seqlock to prevent any printk() from trying to hold port->lock, for - * tty_insert_flip_string_and_push_buffer() on other CPU might be - * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held. - */ - printk_deferred_enter(); - write_seqlock(&zonelist_update_seq); +#ifdef CONFIG_PREEMPT_RT + migrate_disable() +#endif + /* Serialize increments of zonelist_update_seq. */ + spin_lock_irqsave(&lock, flags); + zonelist_update_seq++; + /* Tell check_retry_zonelist() that we started rebuilding zonelists. */ + smp_wmb(); #ifdef CONFIG_NUMA memset(node_load, 0, sizeof(node_load)); @@ -5188,9 +5201,13 @@ static void __build_all_zonelists(void *data) #endif } - write_sequnlock(&zonelist_update_seq); - printk_deferred_exit(); - local_irq_restore(flags); + /* Tell check_retry_zonelist() that we finished rebuilding zonelists. */ + smp_wmb(); + zonelist_update_seq++; + spin_unlock_irqrestore(&lock, flags); +#ifdef CONFIG_PREEMPT_RT + migrate_enable() +#endif } static noinline void __init