From patchwork Wed May 20 06:40:32 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ritesh Harjani X-Patchwork-Id: 11559395 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1062B912 for ; Wed, 20 May 2020 06:41:33 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id F04AD20884 for ; Wed, 20 May 2020 06:41:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726677AbgETGlQ (ORCPT ); Wed, 20 May 2020 02:41:16 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:13948 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726596AbgETGlO (ORCPT ); Wed, 20 May 2020 02:41:14 -0400 Received: from pps.filterd (m0187473.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 04K6WVJZ115218; Wed, 20 May 2020 02:41:10 -0400 Received: from ppma03fra.de.ibm.com (6b.4a.5195.ip4.static.sl-reverse.com [149.81.74.107]) by mx0a-001b2d01.pphosted.com with ESMTP id 312btvy59a-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 20 May 2020 02:41:08 -0400 Received: from pps.filterd (ppma03fra.de.ibm.com [127.0.0.1]) by ppma03fra.de.ibm.com (8.16.0.42/8.16.0.42) with SMTP id 04K6ZRKM023149; Wed, 20 May 2020 06:41:01 GMT Received: from b06cxnps3074.portsmouth.uk.ibm.com (d06relay09.portsmouth.uk.ibm.com [9.149.109.194]) by ppma03fra.de.ibm.com with ESMTP id 313xdhs891-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 20 May 2020 06:41:01 +0000 Received: from b06wcsmtp001.portsmouth.uk.ibm.com (b06wcsmtp001.portsmouth.uk.ibm.com [9.149.105.160]) by b06cxnps3074.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 04K6ewUX31981602 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 20 May 2020 06:40:58 GMT Received: from b06wcsmtp001.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 1B3DCA405B; Wed, 20 May 2020 06:40:58 +0000 (GMT) Received: from b06wcsmtp001.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 864E0A4054; Wed, 20 May 2020 06:40:56 +0000 (GMT) Received: from localhost.localdomain.com (unknown [9.79.188.115]) by b06wcsmtp001.portsmouth.uk.ibm.com (Postfix) with ESMTP; Wed, 20 May 2020 06:40:56 +0000 (GMT) From: Ritesh Harjani To: linux-ext4@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org, Jan Kara , tytso@mit.edu, "Aneesh Kumar K . V" , linux-kernel@vger.kernel.org, Ritesh Harjani Subject: [PATCHv5 1/5] ext4: mballoc: Add blocks to PA list under same spinlock after allocating blocks Date: Wed, 20 May 2020 12:10:32 +0530 Message-Id: X-Mailer: git-send-email 2.21.0 In-Reply-To: References: MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.216,18.0.676 definitions=2020-05-20_02:2020-05-19,2020-05-20 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 suspectscore=3 bulkscore=0 priorityscore=1501 malwarescore=0 mlxscore=0 mlxlogscore=999 impostorscore=0 spamscore=0 clxscore=1015 cotscore=-2147483648 lowpriorityscore=0 adultscore=0 phishscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2004280000 definitions=main-2005200051 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org ext4_mb_discard_preallocations() only checks for grp->bb_prealloc_list of every group to discard the group's PA to free up the space if allocation request fails. Consider below race:- Process A Process B 1. allocate blocks 1. Fails block allocation from ext4_mb_regular_allocator() ext4_lock_group() allocated blocks more than ac_o_ex.fe_len ext4_unlock_group() 2. Scans the grp->bb_prealloc_list (under ext4_lock_group()) and find nothing and thus return -ENOSPC. 2. Add the additional blocks to PA list ext4_lock_group() add blocks to grp->bb_prealloc_list ext4_unlock_group() Above race could be avoided if we add those additional blocks to grp->bb_prealloc_list at the same time with block allocation when ext4_lock_group() was still held. With this discard-PA will know if there are actually any blocks which could be freed from the PA Signed-off-by: Ritesh Harjani --- fs/ext4/mballoc.c | 97 ++++++++++++++++++++++++++++++----------------- 1 file changed, 62 insertions(+), 35 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 33a69424942c..decc5168d126 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -349,6 +349,7 @@ static void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap, ext4_group_t group); static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap, ext4_group_t group); +static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac); static inline void *mb_correct_addr_and_bit(int *bit, void *addr) { @@ -1701,6 +1702,14 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac, sbi->s_mb_last_start = ac->ac_f_ex.fe_start; spin_unlock(&sbi->s_md_lock); } + /* + * As we've just preallocated more space than + * user requested originally, we store allocated + * space in a special descriptor. + */ + if (ac->ac_o_ex.fe_len < ac->ac_b_ex.fe_len) + ext4_mb_new_preallocation(ac); + } /* @@ -1949,7 +1958,7 @@ void ext4_mb_simple_scan_group(struct ext4_allocation_context *ac, ext4_mb_use_best_found(ac, e4b); - BUG_ON(ac->ac_b_ex.fe_len != ac->ac_g_ex.fe_len); + BUG_ON(ac->ac_f_ex.fe_len != ac->ac_g_ex.fe_len); if (EXT4_SB(sb)->s_mb_stats) atomic_inc(&EXT4_SB(sb)->s_bal_2orders); @@ -3675,7 +3684,7 @@ static void ext4_mb_put_pa(struct ext4_allocation_context *ac, /* * creates new preallocated space for given inode */ -static noinline_for_stack int +static noinline_for_stack void ext4_mb_new_inode_pa(struct ext4_allocation_context *ac) { struct super_block *sb = ac->ac_sb; @@ -3688,10 +3697,9 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac) BUG_ON(ac->ac_o_ex.fe_len >= ac->ac_b_ex.fe_len); BUG_ON(ac->ac_status != AC_STATUS_FOUND); BUG_ON(!S_ISREG(ac->ac_inode->i_mode)); + BUG_ON(ac->ac_pa == NULL); - pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS); - if (pa == NULL) - return -ENOMEM; + pa = ac->ac_pa; if (ac->ac_b_ex.fe_len < ac->ac_g_ex.fe_len) { int winl; @@ -3735,7 +3743,6 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac) pa->pa_pstart = ext4_grp_offs_to_block(sb, &ac->ac_b_ex); pa->pa_len = ac->ac_b_ex.fe_len; pa->pa_free = pa->pa_len; - atomic_set(&pa->pa_count, 1); spin_lock_init(&pa->pa_lock); INIT_LIST_HEAD(&pa->pa_inode_list); INIT_LIST_HEAD(&pa->pa_group_list); @@ -3755,21 +3762,17 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac) pa->pa_obj_lock = &ei->i_prealloc_lock; pa->pa_inode = ac->ac_inode; - ext4_lock_group(sb, ac->ac_b_ex.fe_group); list_add(&pa->pa_group_list, &grp->bb_prealloc_list); - ext4_unlock_group(sb, ac->ac_b_ex.fe_group); spin_lock(pa->pa_obj_lock); list_add_rcu(&pa->pa_inode_list, &ei->i_prealloc_list); spin_unlock(pa->pa_obj_lock); - - return 0; } /* * creates new preallocated space for locality group inodes belongs to */ -static noinline_for_stack int +static noinline_for_stack void ext4_mb_new_group_pa(struct ext4_allocation_context *ac) { struct super_block *sb = ac->ac_sb; @@ -3781,11 +3784,9 @@ ext4_mb_new_group_pa(struct ext4_allocation_context *ac) BUG_ON(ac->ac_o_ex.fe_len >= ac->ac_b_ex.fe_len); BUG_ON(ac->ac_status != AC_STATUS_FOUND); BUG_ON(!S_ISREG(ac->ac_inode->i_mode)); + BUG_ON(ac->ac_pa == NULL); - BUG_ON(ext4_pspace_cachep == NULL); - pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS); - if (pa == NULL) - return -ENOMEM; + pa = ac->ac_pa; /* preallocation can change ac_b_ex, thus we store actually * allocated blocks for history */ @@ -3795,7 +3796,6 @@ ext4_mb_new_group_pa(struct ext4_allocation_context *ac) pa->pa_lstart = pa->pa_pstart; pa->pa_len = ac->ac_b_ex.fe_len; pa->pa_free = pa->pa_len; - atomic_set(&pa->pa_count, 1); spin_lock_init(&pa->pa_lock); INIT_LIST_HEAD(&pa->pa_inode_list); INIT_LIST_HEAD(&pa->pa_group_list); @@ -3816,26 +3816,20 @@ ext4_mb_new_group_pa(struct ext4_allocation_context *ac) pa->pa_obj_lock = &lg->lg_prealloc_lock; pa->pa_inode = NULL; - ext4_lock_group(sb, ac->ac_b_ex.fe_group); list_add(&pa->pa_group_list, &grp->bb_prealloc_list); - ext4_unlock_group(sb, ac->ac_b_ex.fe_group); /* * We will later add the new pa to the right bucket * after updating the pa_free in ext4_mb_release_context */ - return 0; } -static int ext4_mb_new_preallocation(struct ext4_allocation_context *ac) +static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac) { - int err; - if (ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC) - err = ext4_mb_new_group_pa(ac); + ext4_mb_new_group_pa(ac); else - err = ext4_mb_new_inode_pa(ac); - return err; + ext4_mb_new_inode_pa(ac); } /* @@ -4150,6 +4144,29 @@ void ext4_discard_preallocations(struct inode *inode) } } +static int ext4_mb_pa_alloc(struct ext4_allocation_context *ac) +{ + struct ext4_prealloc_space *pa; + + BUG_ON(ext4_pspace_cachep == NULL); + pa = kmem_cache_zalloc(ext4_pspace_cachep, GFP_NOFS); + if (!pa) + return -ENOMEM; + atomic_set(&pa->pa_count, 1); + ac->ac_pa = pa; + return 0; +} + +static void ext4_mb_pa_free(struct ext4_allocation_context *ac) +{ + struct ext4_prealloc_space *pa = ac->ac_pa; + + BUG_ON(!pa); + ac->ac_pa = NULL; + WARN_ON(!atomic_dec_and_test(&pa->pa_count)); + kmem_cache_free(ext4_pspace_cachep, pa); +} + #ifdef CONFIG_EXT4_DEBUG static inline void ext4_mb_show_pa(struct super_block *sb) { @@ -4606,23 +4623,28 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle, if (!ext4_mb_use_preallocated(ac)) { ac->ac_op = EXT4_MB_HISTORY_ALLOC; ext4_mb_normalize_request(ac, ar); + + *errp = ext4_mb_pa_alloc(ac); + if (*errp) + goto errout; repeat: /* allocate space in core */ *errp = ext4_mb_regular_allocator(ac); - if (*errp) - goto discard_and_exit; - - /* as we've just preallocated more space than - * user requested originally, we store allocated - * space in a special descriptor */ - if (ac->ac_status == AC_STATUS_FOUND && - ac->ac_o_ex.fe_len < ac->ac_b_ex.fe_len) - *errp = ext4_mb_new_preallocation(ac); + /* + * pa allocated above is added to grp->bb_prealloc_list only + * when we were able to allocate some block i.e. when + * ac->ac_status == AC_STATUS_FOUND. + * And error from above mean ac->ac_status != AC_STATUS_FOUND + * So we have to free this pa here itself. + */ if (*errp) { - discard_and_exit: + ext4_mb_pa_free(ac); ext4_discard_allocated_blocks(ac); goto errout; } + if (ac->ac_status == AC_STATUS_FOUND && + ac->ac_o_ex.fe_len >= ac->ac_f_ex.fe_len) + ext4_mb_pa_free(ac); } if (likely(ac->ac_status == AC_STATUS_FOUND)) { *errp = ext4_mb_mark_diskspace_used(ac, handle, reserv_clstrs); @@ -4637,6 +4659,11 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle, freed = ext4_mb_discard_preallocations(sb, ac->ac_o_ex.fe_len); if (freed) goto repeat; + /* + * If block allocation fails then the pa allocated above + * needs to be freed here itself. + */ + ext4_mb_pa_free(ac); *errp = -ENOSPC; } From patchwork Wed May 20 06:40:33 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ritesh Harjani X-Patchwork-Id: 11559389 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E25C614C0 for ; Wed, 20 May 2020 06:41:16 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id D3884207FB for ; Wed, 20 May 2020 06:41:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726623AbgETGlM (ORCPT ); Wed, 20 May 2020 02:41:12 -0400 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:56476 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726596AbgETGlK (ORCPT ); Wed, 20 May 2020 02:41:10 -0400 Received: from pps.filterd (m0098421.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 04K6VNgX039696; Wed, 20 May 2020 02:41:04 -0400 Received: from ppma03ams.nl.ibm.com (62.31.33a9.ip4.static.sl-reverse.com [169.51.49.98]) by mx0a-001b2d01.pphosted.com with ESMTP id 312c659w81-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 20 May 2020 02:41:04 -0400 Received: from pps.filterd (ppma03ams.nl.ibm.com [127.0.0.1]) by ppma03ams.nl.ibm.com (8.16.0.42/8.16.0.42) with SMTP id 04K6eOsE007591; Wed, 20 May 2020 06:41:02 GMT Received: from b06cxnps4075.portsmouth.uk.ibm.com (d06relay12.portsmouth.uk.ibm.com [9.149.109.197]) by ppma03ams.nl.ibm.com with ESMTP id 313xas32ng-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 20 May 2020 06:41:02 +0000 Received: from b06wcsmtp001.portsmouth.uk.ibm.com (b06wcsmtp001.portsmouth.uk.ibm.com [9.149.105.160]) by b06cxnps4075.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 04K6f0vn56819850 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 20 May 2020 06:41:00 GMT Received: from b06wcsmtp001.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id ECDCCA4060; Wed, 20 May 2020 06:40:59 +0000 (GMT) Received: from b06wcsmtp001.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 6D59CA4066; Wed, 20 May 2020 06:40:58 +0000 (GMT) Received: from localhost.localdomain.com (unknown [9.79.188.115]) by b06wcsmtp001.portsmouth.uk.ibm.com (Postfix) with ESMTP; Wed, 20 May 2020 06:40:58 +0000 (GMT) From: Ritesh Harjani To: linux-ext4@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org, Jan Kara , tytso@mit.edu, "Aneesh Kumar K . V" , linux-kernel@vger.kernel.org, Ritesh Harjani Subject: [PATCHv5 2/5] ext4: mballoc: Refactor ext4_mb_discard_preallocations() Date: Wed, 20 May 2020 12:10:33 +0530 Message-Id: <1cfae0098d2aa9afbeb59331401258182868c8f2.1589955723.git.riteshh@linux.ibm.com> X-Mailer: git-send-email 2.21.0 In-Reply-To: References: MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.216,18.0.676 definitions=2020-05-20_02:2020-05-19,2020-05-20 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 impostorscore=0 malwarescore=0 suspectscore=1 mlxlogscore=999 cotscore=-2147483648 spamscore=0 priorityscore=1501 bulkscore=0 adultscore=0 phishscore=0 mlxscore=0 lowpriorityscore=0 clxscore=1015 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2004280000 definitions=main-2005200051 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Implement ext4_mb_discard_preallocations_should_retry() which we will need in later patches to add more logic like check for sequence number match to see if we should retry for block allocation or not. There should be no functionality change in this patch. Signed-off-by: Ritesh Harjani --- fs/ext4/mballoc.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index decc5168d126..b75408d72773 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -4543,6 +4543,17 @@ static int ext4_mb_discard_preallocations(struct super_block *sb, int needed) return freed; } +static bool ext4_mb_discard_preallocations_should_retry(struct super_block *sb, + struct ext4_allocation_context *ac) +{ + int freed; + + freed = ext4_mb_discard_preallocations(sb, ac->ac_o_ex.fe_len); + if (freed) + return true; + return false; +} + /* * Main entry point into mballoc to allocate blocks * it tries to use preallocation first, then falls back @@ -4551,7 +4562,6 @@ static int ext4_mb_discard_preallocations(struct super_block *sb, int needed) ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle, struct ext4_allocation_request *ar, int *errp) { - int freed; struct ext4_allocation_context *ac = NULL; struct ext4_sb_info *sbi; struct super_block *sb; @@ -4656,8 +4666,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle, ar->len = ac->ac_b_ex.fe_len; } } else { - freed = ext4_mb_discard_preallocations(sb, ac->ac_o_ex.fe_len); - if (freed) + if (ext4_mb_discard_preallocations_should_retry(sb, ac)) goto repeat; /* * If block allocation fails then the pa allocated above From patchwork Wed May 20 06:40:34 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ritesh Harjani X-Patchwork-Id: 11559391 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A2EF0912 for ; Wed, 20 May 2020 06:41:17 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 8DB9B207FB for ; Wed, 20 May 2020 06:41:17 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726686AbgETGlR (ORCPT ); Wed, 20 May 2020 02:41:17 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:53366 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726631AbgETGlN (ORCPT ); Wed, 20 May 2020 02:41:13 -0400 Received: from pps.filterd (m0098396.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 04K6c9wO177503; Wed, 20 May 2020 02:41:07 -0400 Received: from ppma01fra.de.ibm.com (46.49.7a9f.ip4.static.sl-reverse.com [159.122.73.70]) by mx0a-001b2d01.pphosted.com with ESMTP id 312ageeua1-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 20 May 2020 02:41:07 -0400 Received: from pps.filterd (ppma01fra.de.ibm.com [127.0.0.1]) by ppma01fra.de.ibm.com (8.16.0.42/8.16.0.42) with SMTP id 04K6f5ZI014148; Wed, 20 May 2020 06:41:05 GMT Received: from b06avi18878370.portsmouth.uk.ibm.com (b06avi18878370.portsmouth.uk.ibm.com [9.149.26.194]) by ppma01fra.de.ibm.com with ESMTP id 313xcd18r6-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 20 May 2020 06:41:04 +0000 Received: from b06wcsmtp001.portsmouth.uk.ibm.com (b06wcsmtp001.portsmouth.uk.ibm.com [9.149.105.160]) by b06avi18878370.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 04K6f2Et58458538 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 20 May 2020 06:41:02 GMT Received: from b06wcsmtp001.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id E9CF9A4060; Wed, 20 May 2020 06:41:01 +0000 (GMT) Received: from b06wcsmtp001.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 54E2FA405C; Wed, 20 May 2020 06:41:00 +0000 (GMT) Received: from localhost.localdomain.com (unknown [9.79.188.115]) by b06wcsmtp001.portsmouth.uk.ibm.com (Postfix) with ESMTP; Wed, 20 May 2020 06:41:00 +0000 (GMT) From: Ritesh Harjani To: linux-ext4@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org, Jan Kara , tytso@mit.edu, "Aneesh Kumar K . V" , linux-kernel@vger.kernel.org, Ritesh Harjani Subject: [PATCHv5 3/5] ext4: mballoc: Introduce pcpu seqcnt for freeing PA to improve ENOSPC handling Date: Wed, 20 May 2020 12:10:34 +0530 Message-Id: <7f254686903b87c419d798742fd9a1be34f0657b.1589955723.git.riteshh@linux.ibm.com> X-Mailer: git-send-email 2.21.0 In-Reply-To: References: MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.216,18.0.676 definitions=2020-05-20_02:2020-05-19,2020-05-20 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 adultscore=0 phishscore=0 malwarescore=0 impostorscore=0 mlxlogscore=999 clxscore=1015 bulkscore=0 priorityscore=1501 lowpriorityscore=0 suspectscore=3 cotscore=-2147483648 spamscore=0 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2004280000 definitions=main-2005200051 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org There could be a race in function ext4_mb_discard_group_preallocations() where the 1st thread may iterate through group's bb_prealloc_list and remove all the PAs and add to function's local list head. Now if the 2nd thread comes in to discard the group preallocations, it will see that the group->bb_prealloc_list is empty and will return 0. Consider for a case where we have less number of groups (for e.g. just group 0), this may even return an -ENOSPC error from ext4_mb_new_blocks() (where we call for ext4_mb_discard_group_preallocations()). But that is wrong, since 2nd thread should have waited for 1st thread to release all the PAs and should have retried for allocation. Since 1st thread was anyway going to discard the PAs. The algorithm using this percpu seq counter goes below: 1. We sample the percpu discard_pa_seq counter before trying for block allocation in ext4_mb_new_blocks(). 2. We increment this percpu discard_pa_seq counter when we either allocate or free these blocks i.e. while marking those blocks as used/free in mb_mark_used()/mb_free_blocks(). 3. We also increment this percpu seq counter when we successfully identify that the bb_prealloc_list is not empty and hence proceed for discarding of those PAs inside ext4_mb_discard_group_preallocations(). Now to make sure that the regular fast path of block allocation is not affected, as a small optimization we only sample the percpu seq counter on that cpu. Only when the block allocation fails and when freed blocks found were 0, that is when we sample percpu seq counter for all cpus using below function ext4_get_discard_pa_seq_sum(). This happens after making sure that all the PAs on grp->bb_prealloc_list got freed or if it's empty. It can be well argued that why don't just check for grp->bb_free to see if there are any free blocks to be allocated. So here are the two concerns which were discussed:- 1. If for some reason the blocks available in the group are not appropriate for allocation logic (say for e.g. EXT4_MB_HINT_GOAL_ONLY, although this is not yet implemented), then the retry logic may result into infinte looping since grp->bb_free is non-zero. 2. Also before preallocation was clubbed with block allocation with the same ext4_lock_group() held, there were lot of races where grp->bb_free could not be reliably relied upon. Due to above, this patch considers discard_pa_seq logic to determine if we should retry for block allocation. Say if there are are n threads trying for block allocation and none of those could allocate or discard any of the blocks, then all of those n threads will fail the block allocation and return -ENOSPC error. (Since the seq counter for all of those will match as no block allocation/discard was done during that duration). Signed-off-by: Ritesh Harjani --- fs/ext4/mballoc.c | 56 ++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 51 insertions(+), 5 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index b75408d72773..754ff9f65199 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -351,6 +351,35 @@ static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap, ext4_group_t group); static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac); +/* + * The algorithm using this percpu seq counter goes below: + * 1. We sample the percpu discard_pa_seq counter before trying for block + * allocation in ext4_mb_new_blocks(). + * 2. We increment this percpu discard_pa_seq counter when we either allocate + * or free these blocks i.e. while marking those blocks as used/free in + * mb_mark_used()/mb_free_blocks(). + * 3. We also increment this percpu seq counter when we successfully identify + * that the bb_prealloc_list is not empty and hence proceed for discarding + * of those PAs inside ext4_mb_discard_group_preallocations(). + * + * Now to make sure that the regular fast path of block allocation is not + * affected, as a small optimization we only sample the percpu seq counter + * on that cpu. Only when the block allocation fails and when freed blocks + * found were 0, that is when we sample percpu seq counter for all cpus using + * below function ext4_get_discard_pa_seq_sum(). This happens after making + * sure that all the PAs on grp->bb_prealloc_list got freed or if it's empty. + */ +static DEFINE_PER_CPU(u64, discard_pa_seq); +static inline u64 ext4_get_discard_pa_seq_sum(void) +{ + int __cpu; + u64 __seq = 0; + + for_each_possible_cpu(__cpu) + __seq += per_cpu(discard_pa_seq, __cpu); + return __seq; +} + static inline void *mb_correct_addr_and_bit(int *bit, void *addr) { #if BITS_PER_LONG == 64 @@ -1462,6 +1491,7 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b, mb_check_buddy(e4b); mb_free_blocks_double(inode, e4b, first, count); + this_cpu_inc(discard_pa_seq); e4b->bd_info->bb_free += count; if (first < e4b->bd_info->bb_first_free) e4b->bd_info->bb_first_free = first; @@ -1603,6 +1633,7 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex) mb_check_buddy(e4b); mb_mark_used_double(e4b, start, len); + this_cpu_inc(discard_pa_seq); e4b->bd_info->bb_free -= len; if (e4b->bd_info->bb_first_free == start) e4b->bd_info->bb_first_free += len; @@ -3962,6 +3993,7 @@ ext4_mb_discard_group_preallocations(struct super_block *sb, INIT_LIST_HEAD(&list); repeat: ext4_lock_group(sb, group); + this_cpu_inc(discard_pa_seq); list_for_each_entry_safe(pa, tmp, &grp->bb_prealloc_list, pa_group_list) { spin_lock(&pa->pa_lock); @@ -4544,14 +4576,26 @@ static int ext4_mb_discard_preallocations(struct super_block *sb, int needed) } static bool ext4_mb_discard_preallocations_should_retry(struct super_block *sb, - struct ext4_allocation_context *ac) + struct ext4_allocation_context *ac, u64 *seq) { int freed; + u64 seq_retry = 0; + bool ret = false; freed = ext4_mb_discard_preallocations(sb, ac->ac_o_ex.fe_len); - if (freed) - return true; - return false; + if (freed) { + ret = true; + goto out_dbg; + } + seq_retry = ext4_get_discard_pa_seq_sum(); + if (seq_retry != *seq) { + *seq = seq_retry; + ret = true; + } + +out_dbg: + mb_debug(sb, "freed %d, retry ? %s\n", freed, ret ? "yes" : "no"); + return ret; } /* @@ -4568,6 +4612,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle, ext4_fsblk_t block = 0; unsigned int inquota = 0; unsigned int reserv_clstrs = 0; + u64 seq; might_sleep(); sb = ar->inode->i_sb; @@ -4630,6 +4675,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle, } ac->ac_op = EXT4_MB_HISTORY_PREALLOC; + seq = *this_cpu_ptr(&discard_pa_seq); if (!ext4_mb_use_preallocated(ac)) { ac->ac_op = EXT4_MB_HISTORY_ALLOC; ext4_mb_normalize_request(ac, ar); @@ -4666,7 +4712,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle, ar->len = ac->ac_b_ex.fe_len; } } else { - if (ext4_mb_discard_preallocations_should_retry(sb, ac)) + if (ext4_mb_discard_preallocations_should_retry(sb, ac, &seq)) goto repeat; /* * If block allocation fails then the pa allocated above From patchwork Wed May 20 06:40:35 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ritesh Harjani X-Patchwork-Id: 11559397 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 862FE912 for ; Wed, 20 May 2020 06:41:34 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 7240020884 for ; Wed, 20 May 2020 06:41:34 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726764AbgETGlc (ORCPT ); Wed, 20 May 2020 02:41:32 -0400 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:6478 "EHLO mx0b-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726655AbgETGlQ (ORCPT ); Wed, 20 May 2020 02:41:16 -0400 Received: from pps.filterd (m0127361.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 04K6Wx41022666; Wed, 20 May 2020 02:41:09 -0400 Received: from ppma05fra.de.ibm.com (6c.4a.5195.ip4.static.sl-reverse.com [149.81.74.108]) by mx0a-001b2d01.pphosted.com with ESMTP id 312wsjpqjq-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 20 May 2020 02:41:09 -0400 Received: from pps.filterd (ppma05fra.de.ibm.com [127.0.0.1]) by ppma05fra.de.ibm.com (8.16.0.42/8.16.0.42) with SMTP id 04K6f7Q0030066; Wed, 20 May 2020 06:41:07 GMT Received: from b06cxnps4074.portsmouth.uk.ibm.com (d06relay11.portsmouth.uk.ibm.com [9.149.109.196]) by ppma05fra.de.ibm.com with ESMTP id 313x4xh8f4-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 20 May 2020 06:41:06 +0000 Received: from b06wcsmtp001.portsmouth.uk.ibm.com (b06wcsmtp001.portsmouth.uk.ibm.com [9.149.105.160]) by b06cxnps4074.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 04K6f4rb41681008 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 20 May 2020 06:41:04 GMT Received: from b06wcsmtp001.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 2065DA4060; Wed, 20 May 2020 06:41:04 +0000 (GMT) Received: from b06wcsmtp001.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 6058CA4054; Wed, 20 May 2020 06:41:02 +0000 (GMT) Received: from localhost.localdomain.com (unknown [9.79.188.115]) by b06wcsmtp001.portsmouth.uk.ibm.com (Postfix) with ESMTP; Wed, 20 May 2020 06:41:02 +0000 (GMT) From: Ritesh Harjani To: linux-ext4@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org, Jan Kara , tytso@mit.edu, "Aneesh Kumar K . V" , linux-kernel@vger.kernel.org, Ritesh Harjani Subject: [PATCHv5 4/5] ext4: mballoc: Refactor ext4_mb_good_group() Date: Wed, 20 May 2020 12:10:35 +0530 Message-Id: X-Mailer: git-send-email 2.21.0 In-Reply-To: References: MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.216,18.0.676 definitions=2020-05-20_03:2020-05-19,2020-05-20 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxscore=0 mlxlogscore=999 lowpriorityscore=0 cotscore=-2147483648 suspectscore=3 clxscore=1015 priorityscore=1501 spamscore=0 malwarescore=0 phishscore=0 bulkscore=0 adultscore=0 impostorscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2004280000 definitions=main-2005200056 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org ext4_mb_good_group() definition was changed some time back and now it even initializes the buddy cache (via ext4_mb_init_group()), if in case the EXT4_MB_GRP_NEED_INIT() is true for a group. Note that ext4_mb_init_group() could sleep and so should not be called under a spinlock held. This is fine as of now because ext4_mb_good_group() is called before loading the buddy bitmap without ext4_lock_group() held and again called after loading the bitmap, only this time with ext4_lock_group() held. But still this whole thing is confusing. So this patch refactors out ext4_mb_good_group_nolock() which should be called when without holding ext4_lock_group(). Also in further patches we hold the spinlock (ext4_lock_group()) while doing any calculations which involves grp->bb_free or grp->bb_fragments. Signed-off-by: Ritesh Harjani --- fs/ext4/mballoc.c | 78 ++++++++++++++++++++++++++++++----------------- 1 file changed, 50 insertions(+), 28 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 754ff9f65199..c9297c878a90 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -2106,15 +2106,14 @@ void ext4_mb_scan_aligned(struct ext4_allocation_context *ac, } /* - * This is now called BEFORE we load the buddy bitmap. + * This is also called BEFORE we load the buddy bitmap. * Returns either 1 or 0 indicating that the group is either suitable - * for the allocation or not. In addition it can also return negative - * error code when something goes wrong. + * for the allocation or not. */ -static int ext4_mb_good_group(struct ext4_allocation_context *ac, +static bool ext4_mb_good_group(struct ext4_allocation_context *ac, ext4_group_t group, int cr) { - unsigned free, fragments; + ext4_grpblk_t free, fragments; int flex_size = ext4_flex_bg_size(EXT4_SB(ac->ac_sb)); struct ext4_group_info *grp = ext4_get_group_info(ac->ac_sb, group); @@ -2122,23 +2121,16 @@ static int ext4_mb_good_group(struct ext4_allocation_context *ac, free = grp->bb_free; if (free == 0) - return 0; + return false; if (cr <= 2 && free < ac->ac_g_ex.fe_len) - return 0; + return false; if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp))) - return 0; - - /* We only do this if the grp has never been initialized */ - if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) { - int ret = ext4_mb_init_group(ac->ac_sb, group, GFP_NOFS); - if (ret) - return ret; - } + return false; fragments = grp->bb_fragments; if (fragments == 0) - return 0; + return false; switch (cr) { case 0: @@ -2148,31 +2140,63 @@ static int ext4_mb_good_group(struct ext4_allocation_context *ac, if ((ac->ac_flags & EXT4_MB_HINT_DATA) && (flex_size >= EXT4_FLEX_SIZE_DIR_ALLOC_SCHEME) && ((group % flex_size) == 0)) - return 0; + return false; if ((ac->ac_2order > ac->ac_sb->s_blocksize_bits+1) || (free / fragments) >= ac->ac_g_ex.fe_len) - return 1; + return true; if (grp->bb_largest_free_order < ac->ac_2order) - return 0; + return false; - return 1; + return true; case 1: if ((free / fragments) >= ac->ac_g_ex.fe_len) - return 1; + return true; break; case 2: if (free >= ac->ac_g_ex.fe_len) - return 1; + return true; break; case 3: - return 1; + return true; default: BUG(); } - return 0; + return false; +} + +/* + * This could return negative error code if something goes wrong + * during ext4_mb_init_group(). This should not be called with + * ext4_lock_group() held. + */ +static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac, + ext4_group_t group, int cr) +{ + struct ext4_group_info *grp = ext4_get_group_info(ac->ac_sb, group); + ext4_grpblk_t free; + int ret = 0; + + free = grp->bb_free; + if (free == 0) + goto out; + if (cr <= 2 && free < ac->ac_g_ex.fe_len) + goto out; + if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp))) + goto out; + + /* We only do this if the grp has never been initialized */ + if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) { + ret = ext4_mb_init_group(ac->ac_sb, group, GFP_NOFS); + if (ret) + return ret; + } + + ret = ext4_mb_good_group(ac, group, cr); +out: + return ret; } static noinline_for_stack int @@ -2260,7 +2284,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) group = 0; /* This now checks without needing the buddy page */ - ret = ext4_mb_good_group(ac, group, cr); + ret = ext4_mb_good_group_nolock(ac, group, cr); if (ret <= 0) { if (!first_err) first_err = ret; @@ -2278,11 +2302,9 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) * block group */ ret = ext4_mb_good_group(ac, group, cr); - if (ret <= 0) { + if (ret == 0) { ext4_unlock_group(sb, group); ext4_mb_unload_buddy(&e4b); - if (!first_err) - first_err = ret; continue; } From patchwork Wed May 20 06:40:36 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ritesh Harjani X-Patchwork-Id: 11559393 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 4B76F913 for ; Wed, 20 May 2020 06:41:30 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 38C8C20890 for ; Wed, 20 May 2020 06:41:30 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726737AbgETGlZ (ORCPT ); Wed, 20 May 2020 02:41:25 -0400 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:64432 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726658AbgETGlR (ORCPT ); Wed, 20 May 2020 02:41:17 -0400 Received: from pps.filterd (m0098421.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 04K6VScC040115; Wed, 20 May 2020 02:41:10 -0400 Received: from ppma04ams.nl.ibm.com (63.31.33a9.ip4.static.sl-reverse.com [169.51.49.99]) by mx0a-001b2d01.pphosted.com with ESMTP id 312c659wbx-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 20 May 2020 02:41:10 -0400 Received: from pps.filterd (ppma04ams.nl.ibm.com [127.0.0.1]) by ppma04ams.nl.ibm.com (8.16.0.42/8.16.0.42) with SMTP id 04K6ejhu003252; Wed, 20 May 2020 06:41:09 GMT Received: from b06avi18626390.portsmouth.uk.ibm.com (b06avi18626390.portsmouth.uk.ibm.com [9.149.26.192]) by ppma04ams.nl.ibm.com with ESMTP id 313xehk2r4-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 20 May 2020 06:41:08 +0000 Received: from b06wcsmtp001.portsmouth.uk.ibm.com (b06wcsmtp001.portsmouth.uk.ibm.com [9.149.105.160]) by b06avi18626390.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 04K6dr6H13959648 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 20 May 2020 06:39:53 GMT Received: from b06wcsmtp001.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 0F13BA4064; Wed, 20 May 2020 06:41:06 +0000 (GMT) Received: from b06wcsmtp001.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 81AF3A4054; Wed, 20 May 2020 06:41:04 +0000 (GMT) Received: from localhost.localdomain.com (unknown [9.79.188.115]) by b06wcsmtp001.portsmouth.uk.ibm.com (Postfix) with ESMTP; Wed, 20 May 2020 06:41:04 +0000 (GMT) From: Ritesh Harjani To: linux-ext4@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org, Jan Kara , tytso@mit.edu, "Aneesh Kumar K . V" , linux-kernel@vger.kernel.org, Ritesh Harjani Subject: [PATCHv5 5/5] ext4: mballoc: Use lock for checking free blocks while retrying Date: Wed, 20 May 2020 12:10:36 +0530 Message-Id: <9cb740a117c958c36596f167b12af1beae9a68b7.1589955723.git.riteshh@linux.ibm.com> X-Mailer: git-send-email 2.21.0 In-Reply-To: References: MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.216,18.0.676 definitions=2020-05-20_02:2020-05-19,2020-05-20 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 impostorscore=0 malwarescore=0 suspectscore=3 mlxlogscore=999 cotscore=-2147483648 spamscore=0 priorityscore=1501 bulkscore=0 adultscore=0 phishscore=0 mlxscore=0 lowpriorityscore=0 clxscore=1015 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2004280000 definitions=main-2005200051 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Currently while doing block allocation grp->bb_free may be getting modified if discard is happening in parallel. For e.g. consider a case where there are lot of threads who have preallocated lot of blocks and there is a thread which is trying to discard all of this group's PA. Now it could happen that we see all of those group's bb_free is zero and fail the allocation while there is sufficient space if we free up all the PA. So this patch adds another flag "EXT4_MB_STRICT_CHECK" which will be set if we are unable to allocate any blocks in the first try (since we may not have considered blocks about to be discarded from PA lists). So during retry attempt to allocate blocks we will use ext4_lock_group() for checking if the group is good or not. Signed-off-by: Ritesh Harjani --- fs/ext4/ext4.h | 2 ++ fs/ext4/mballoc.c | 13 ++++++++++++- include/trace/events/ext4.h | 3 ++- 3 files changed, 16 insertions(+), 2 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index fb37fb3fe689..d185f3bcb9eb 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -150,6 +150,8 @@ enum SHIFT_DIRECTION { #define EXT4_MB_USE_ROOT_BLOCKS 0x1000 /* Use blocks from reserved pool */ #define EXT4_MB_USE_RESERVED 0x2000 +/* Do strict check for free blocks while retrying block allocation */ +#define EXT4_MB_STRICT_CHECK 0x4000 struct ext4_allocation_request { /* target inode for block we're allocating */ diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index c9297c878a90..a9083113a8c0 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -2176,9 +2176,13 @@ static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac, ext4_group_t group, int cr) { struct ext4_group_info *grp = ext4_get_group_info(ac->ac_sb, group); + struct super_block *sb = ac->ac_sb; + bool should_lock = ac->ac_flags & EXT4_MB_STRICT_CHECK; ext4_grpblk_t free; int ret = 0; + if (should_lock) + ext4_lock_group(sb, group); free = grp->bb_free; if (free == 0) goto out; @@ -2186,6 +2190,8 @@ static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac, goto out; if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp))) goto out; + if (should_lock) + ext4_unlock_group(sb, group); /* We only do this if the grp has never been initialized */ if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) { @@ -2194,8 +2200,12 @@ static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac, return ret; } + if (should_lock) + ext4_lock_group(sb, group); ret = ext4_mb_good_group(ac, group, cr); out: + if (should_lock) + ext4_unlock_group(sb, group); return ret; } @@ -4610,7 +4620,8 @@ static bool ext4_mb_discard_preallocations_should_retry(struct super_block *sb, goto out_dbg; } seq_retry = ext4_get_discard_pa_seq_sum(); - if (seq_retry != *seq) { + if (!(ac->ac_flags & EXT4_MB_STRICT_CHECK) || seq_retry != *seq) { + ac->ac_flags |= EXT4_MB_STRICT_CHECK; *seq = seq_retry; ret = true; } diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h index 19c87661eeec..0df9efa80b16 100644 --- a/include/trace/events/ext4.h +++ b/include/trace/events/ext4.h @@ -35,7 +35,8 @@ struct partial_cluster; { EXT4_MB_DELALLOC_RESERVED, "DELALLOC_RESV" }, \ { EXT4_MB_STREAM_ALLOC, "STREAM_ALLOC" }, \ { EXT4_MB_USE_ROOT_BLOCKS, "USE_ROOT_BLKS" }, \ - { EXT4_MB_USE_RESERVED, "USE_RESV" }) + { EXT4_MB_USE_RESERVED, "USE_RESV" }, \ + { EXT4_MB_STRICT_CHECK, "STRICT_CHECK" }) #define show_map_flags(flags) __print_flags(flags, "|", \ { EXT4_GET_BLOCKS_CREATE, "CREATE" }, \