From patchwork Mon Jul 4 19:05:41 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dai Ngo X-Patchwork-Id: 12905793 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 164B3C43334 for ; Mon, 4 Jul 2022 19:05:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229674AbiGDTFx (ORCPT ); Mon, 4 Jul 2022 15:05:53 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56760 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234350AbiGDTFv (ORCPT ); Mon, 4 Jul 2022 15:05:51 -0400 Received: from mx0a-00069f02.pphosted.com (mx0a-00069f02.pphosted.com [205.220.165.32]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0F71FB5F for ; Mon, 4 Jul 2022 12:05:50 -0700 (PDT) Received: from pps.filterd (m0246617.ppops.net [127.0.0.1]) by mx0b-00069f02.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 264HLoNa001743 for ; Mon, 4 Jul 2022 19:05:50 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id; s=corp-2021-07-09; bh=mm7HMX+Ha+Z0Nslan+0Wi0F4zVLZc6U9kOesHa3yt6o=; b=IiOdVpGgZs0BBMTESthUZUI7BsErNuYEfINjc0uN4uMYvtNH5EqRXWi3KHisgQzJmEmB GAgbo5ZeJiamcJgDzGT2fhnn6nzSCL+fdIGfSjTmSYGXBdzQpdWmQwfSphOGbamJTmKE dyZ1sFZzqPkow22Sn98NM1JidIrY3TXvYBmrevMO690fNFeLymyMmxIyOgjoTF/YBh5M 1xEP0ZrweIixEnmZ/bSuvqNHR8jZmb7vT7eGqW7ddQ69RrbYMrGU65IhcNdDQz/0mZCB mnOm505H4/qZNHI6/2hGuKqVbIirBiVusa1F6qMaBcD+kzozAYNid2OHdfV5OCIhVMR8 HA== Received: from iadpaimrmta01.imrmtpd1.prodappiadaev1.oraclevcn.com (iadpaimrmta01.appoci.oracle.com [130.35.100.223]) by mx0b-00069f02.pphosted.com (PPS) with ESMTPS id 3h2eju40sh-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK) for ; Mon, 04 Jul 2022 19:05:50 +0000 Received: from pps.filterd (iadpaimrmta01.imrmtpd1.prodappiadaev1.oraclevcn.com [127.0.0.1]) by iadpaimrmta01.imrmtpd1.prodappiadaev1.oraclevcn.com (8.16.1.2/8.16.1.2) with SMTP id 264J1PP6022079 for ; Mon, 4 Jul 2022 19:05:48 GMT Received: from pps.reinject (localhost [127.0.0.1]) by iadpaimrmta01.imrmtpd1.prodappiadaev1.oraclevcn.com with ESMTP id 3h2cf7vt8p-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK) for ; Mon, 04 Jul 2022 19:05:48 +0000 Received: from iadpaimrmta01.imrmtpd1.prodappiadaev1.oraclevcn.com (iadpaimrmta01.imrmtpd1.prodappiadaev1.oraclevcn.com [127.0.0.1]) by pps.reinject (8.16.0.36/8.16.0.36) with SMTP id 264J5mhq029031 for ; Mon, 4 Jul 2022 19:05:48 GMT Received: from ca-common-hq.us.oracle.com (ca-common-hq.us.oracle.com [10.211.9.209]) by iadpaimrmta01.imrmtpd1.prodappiadaev1.oraclevcn.com with ESMTP id 3h2cf7vt8d-1; Mon, 04 Jul 2022 19:05:48 +0000 From: Dai Ngo To: chuck.lever@oracle.com Cc: linux-nfs@vger.kernel.org Subject: [PATCH v2 0/2] NFSD: handling memory shortage problem with Courteous server Date: Mon, 4 Jul 2022 12:05:41 -0700 Message-Id: <1656961543-25210-1-git-send-email-dai.ngo@oracle.com> X-Mailer: git-send-email 1.8.3.1 X-Proofpoint-GUID: nrb4fnMphTJj2N3qKjP2lS6lM8nky1Bx X-Proofpoint-ORIG-GUID: nrb4fnMphTJj2N3qKjP2lS6lM8nky1Bx Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org Currently the idle timeout for courtesy client is fixed at 1 day. If there are lots of courtesy clients remain in the system it can cause memory resource shortage that effects the operations of other modules in the kernel. This problem can be observed by running pynfs nfs4.0 CID5 test in a loop. Eventually system runs out of memory and rpc.gssd fails to add new watch: rpc.gssd[3851]: ERROR: inotify_add_watch failed for nfsd4_cb/clnt6c2e: No space left on device and alloc_inode also fails with out of memory: Call Trace: dump_stack_lvl+0x33/0x42 dump_header+0x4a/0x1ed oom_kill_process+0x80/0x10d out_of_memory+0x237/0x25f __alloc_pages_slowpath.constprop.0+0x617/0x7b6 __alloc_pages+0x132/0x1e3 alloc_slab_page+0x15/0x33 allocate_slab+0x78/0x1ab ? alloc_inode+0x38/0x8d ___slab_alloc+0x2af/0x373 ? alloc_inode+0x38/0x8d ? slab_pre_alloc_hook.constprop.0+0x9f/0x158 ? alloc_inode+0x38/0x8d __slab_alloc.constprop.0+0x1c/0x24 kmem_cache_alloc_lru+0x8c/0x142 alloc_inode+0x38/0x8d iget_locked+0x60/0x126 kernfs_get_inode+0x18/0x105 kernfs_iop_lookup+0x6d/0xbc __lookup_slow+0xb7/0xf9 lookup_slow+0x3a/0x52 walk_component+0x90/0x100 ? inode_permission+0x87/0x128 link_path_walk.part.0.constprop.0+0x266/0x2ea ? path_init+0x101/0x2f2 path_lookupat+0x4c/0xfa filename_lookup+0x63/0xd7 ? getname_flags+0x32/0x17a ? kmem_cache_alloc+0x11f/0x144 ? getname_flags+0x16c/0x17a user_path_at_empty+0x37/0x4b do_readlinkat+0x61/0x102 __x64_sys_readlinkat+0x18/0x1b do_syscall_64+0x57/0x72 entry_SYSCALL_64_after_hwframe+0x46/0xb0 This patch addresses this problem by: . removing the fixed 1-day idle time limit for courtesy client. Courtesy client is now allowed to remain valid as long as the available system memory is above 80%. . when available system memory drops below 80%, laundromat starts trimming older courtesy clients. The number of courtesy clients to trim is a percentage of the total number of courtesy clients exist in the system. This percentage is computed based on the current percentage of available system memory. . the percentage of number of courtesy clients to be trimmed is based on this table: ---------------------------------- | % memory | % courtesy clients | | available | to trim | ---------------------------------- | > 80 | 0 | | > 70 | 10 | | > 60 | 20 | | > 50 | 40 | | > 40 | 60 | | > 30 | 80 | | < 30 | 100 | ---------------------------------- . due to the overhead associated with removing client record, there is a limit of 128 clients to be trimmed for each laundromat run. This is done to prevent the laundromat from spending too long destroying the clients and misses performing its other tasks in a timely manner. . the laundromat is scheduled to run sooner if there are more courtesy clients need to be destroyed. The shrinker method was evaluated and found it's not suitable for this problem due to these reasons: . destroying the NFSv4 client on the shrinker context can cause deadlock since nfsd_file_put calls into the underlying FS code and we have no control what it will do as seen in this stack trace: ====================================================== WARNING: possible circular locking dependency detected 5.19.0-rc2_sk+ #1 Not tainted ------------------------------------------------------ lck/31847 is trying to acquire lock: ffff88811d268850 (&sb->s_type->i_mutex_key#16){+.+.}-{3:3}, at: btrfs_inode_lock+0x38/0x70 #012but task is already holding lock: ffffffffb41848c0 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0x506/0x1db0 #012which lock already depends on the new lock. #012the existing dependency chain (in reverse order) is: #012-> #1 (fs_reclaim){+.+.}-{0:0}: fs_reclaim_acquire+0xc0/0x100 __kmalloc+0x51/0x320 btrfs_buffered_write+0x2eb/0xd90 btrfs_do_write_iter+0x6bf/0x11c0 do_iter_readv_writev+0x2bb/0x5a0 do_iter_write+0x131/0x630 nfsd_vfs_write+0x4da/0x1900 [nfsd] nfsd4_write+0x2ac/0x760 [nfsd] nfsd4_proc_compound+0xce8/0x23e0 [nfsd] nfsd_dispatch+0x4ed/0xc10 [nfsd] svc_process_common+0xd3f/0x1b00 [sunrpc] svc_process+0x361/0x4f0 [sunrpc] nfsd+0x2d6/0x570 [nfsd] kthread+0x2a1/0x340 ret_from_fork+0x22/0x30 #012-> #0 (&sb->s_type->i_mutex_key#16){+.+.}-{3:3}: __lock_acquire+0x318d/0x7830 lock_acquire+0x1bb/0x500 down_write+0x82/0x130 btrfs_inode_lock+0x38/0x70 btrfs_sync_file+0x280/0x1010 nfsd_file_flush.isra.0+0x1b/0x220 [nfsd] nfsd_file_put+0xd4/0x110 [nfsd] release_all_access+0x13a/0x220 [nfsd] nfs4_free_ol_stateid+0x40/0x90 [nfsd] free_ol_stateid_reaplist+0x131/0x210 [nfsd] release_openowner+0xf7/0x160 [nfsd] __destroy_client+0x3cc/0x740 [nfsd] nfsd_cc_lru_scan+0x271/0x410 [nfsd] shrink_slab.constprop.0+0x31e/0x7d0 shrink_node+0x54b/0xe50 try_to_free_pages+0x394/0xba0 __alloc_pages_slowpath.constprop.0+0x5d2/0x1db0 __alloc_pages+0x4d6/0x580 __handle_mm_fault+0xc25/0x2810 handle_mm_fault+0x136/0x480 do_user_addr_fault+0x3d8/0xec0 exc_page_fault+0x5d/0xc0 asm_exc_page_fault+0x27/0x30 #012other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(&sb->s_type->i_mutex_key#16); lock(fs_reclaim); lock(&sb->s_type->i_mutex_key#16); #012 *** DEADLOCK *** . the shrinker kicks in only when memory drops really low, ~<5%. By this time, some other components in the system already run into issue with memory shortage. For example, rpc.gssd starts failing to add watches in /var/lib/nfs/rpc_pipefs/nfsd4_cb once the memory consumed by these watches reaches about 1% of available system memory. . destroying the NFSv4 client has significant overhead due to the upcall to user space to remove the client records which might access storage device. There is potential deadlock if the storage subsystem needs to allocate memory.