From patchwork Sat Jan 27 01:58:17 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Zhang Yi X-Patchwork-Id: 13533910 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B9B85C47DA9 for ; Sat, 27 Jan 2024 02:03:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E7CC96B009C; Fri, 26 Jan 2024 21:02:57 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E2A886B00A1; Fri, 26 Jan 2024 21:02:57 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AAB646B009F; Fri, 26 Jan 2024 21:02:57 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 7FE316B009F for ; Fri, 26 Jan 2024 21:02:57 -0500 (EST) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 674441C0FBB for ; Sat, 27 Jan 2024 02:02:57 +0000 (UTC) X-FDA: 81723442794.18.F1C3A49 Received: from dggsgout11.his.huawei.com (unknown [45.249.212.51]) by imf23.hostedemail.com (Postfix) with ESMTP id 0B82A14000A for ; Sat, 27 Jan 2024 02:02:54 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf23.hostedemail.com: domain of yi.zhang@huaweicloud.com designates 45.249.212.51 as permitted sender) smtp.mailfrom=yi.zhang@huaweicloud.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1706320975; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=c4b+UUrjOV+PbxS3N7EWCREBI1daFn8kJscWvxsFcbw=; b=XP2eNEA/2dv1ZxuVi91MEQuf01RnooLQzXxy02FVmD1FAtru5tFr37m4n6I/f/ACICxZo/ LySCPeyuL62ZkRNffF+VA6OZF++9yRv3MV0xn/nBqZ+albiNIRsrxHAtfygPAkj5B9Q3B9 pLKEHqpho2uxpE23zy8pYeRltWb0X8Q= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf23.hostedemail.com: domain of yi.zhang@huaweicloud.com designates 45.249.212.51 as permitted sender) smtp.mailfrom=yi.zhang@huaweicloud.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706320975; a=rsa-sha256; cv=none; b=YGCQF1SVhTkHIAzVxdco1GkXM6vYKFIcih6vMvuojdPy0hMmtnugSJx52FopnXtnJOoAt8 k92vTYqsn3kwp8DIpfuwgAXwNXv1MPgIh2GNXi+am33qunEYohPC6x7pqA4BPVV1QLy/j8 3EmS4VxTqsImSb2eeyoqXKxagdyD0Yk= Received: from mail.maildlp.com (unknown [172.19.93.142]) by dggsgout11.his.huawei.com (SkyGuard) with ESMTP id 4TMHrm5v65z4f3k5Y for ; Sat, 27 Jan 2024 10:02:48 +0800 (CST) Received: from mail02.huawei.com (unknown [10.116.40.112]) by mail.maildlp.com (Postfix) with ESMTP id 4019E1A016E for ; Sat, 27 Jan 2024 10:02:51 +0800 (CST) Received: from huaweicloud.com (unknown [10.175.104.67]) by APP1 (Coremail) with SMTP id cCh0CgAX5g40ZLRlGJtmCA--.7377S22; Sat, 27 Jan 2024 10:02:51 +0800 (CST) From: Zhang Yi To: linux-ext4@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, tytso@mit.edu, adilger.kernel@dilger.ca, jack@suse.cz, ritesh.list@gmail.com, hch@infradead.org, djwong@kernel.org, willy@infradead.org, zokeefe@google.com, yi.zhang@huawei.com, yi.zhang@huaweicloud.com, chengzhihao1@huawei.com, yukuai3@huawei.com, wangkefeng.wang@huawei.com Subject: [RFC PATCH v3 18/26] ext4: implement buffered write iomap path Date: Sat, 27 Jan 2024 09:58:17 +0800 Message-Id: <20240127015825.1608160-19-yi.zhang@huaweicloud.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20240127015825.1608160-1-yi.zhang@huaweicloud.com> References: <20240127015825.1608160-1-yi.zhang@huaweicloud.com> MIME-Version: 1.0 X-CM-TRANSID: cCh0CgAX5g40ZLRlGJtmCA--.7377S22 X-Coremail-Antispam: 1UD129KBjvJXoW3CF4UArWxKF1UKw17Zw1UKFg_yoWDuF4fpF Z0kFy5GF4UWF97uF4fKF4UZr1ak3W7tr4UurW3Wrn8Xr9FvrWIqF18KFyayF15JrWxur4j qF4jkry8Wr47ArDanT9S1TB71UUUUUUqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDU0xBIdaVrnRJUUUPI14x267AKxVWrJVCq3wAFc2x0x2IEx4CE42xK8VAvwI8IcIk0 rVWrJVCq3wAFIxvE14AKwVWUJVWUGwA2048vs2IY020E87I2jVAFwI0_JF0E3s1l82xGYI kIc2x26xkF7I0E14v26ryj6s0DM28lY4IEw2IIxxk0rwA2F7IY1VAKz4vEj48ve4kI8wA2 z4x0Y4vE2Ix0cI8IcVAFwI0_Ar0_tr1l84ACjcxK6xIIjxv20xvEc7CjxVAFwI0_Gr1j6F 4UJwA2z4x0Y4vEx4A2jsIE14v26rxl6s0DM28EF7xvwVC2z280aVCY1x0267AKxVW0oVCq 3wAS0I0E0xvYzxvE52x082IY62kv0487Mc02F40EFcxC0VAKzVAqx4xG6I80ewAv7VC0I7 IYx2IY67AKxVWUXVWUAwAv7VC2z280aVAFwI0_Jr0_Gr1lOx8S6xCaFVCjc4AY6r1j6r4U M4x0Y48IcxkI7VAKI48JM4x0x7Aq67IIx4CEVc8vx2IErcIFxwACI402YVCY1x02628vn2 kIc2xKxwCF04k20xvY0x0EwIxGrwCFx2IqxVCFs4IE7xkEbVWUJVW8JwC20s026c02F40E 14v26r1j6r18MI8I3I0E7480Y4vE14v26r106r1rMI8E67AF67kF1VAFwI0_GFv_WrylIx kGc2Ij64vIr41lIxAIcVC0I7IYx2IY67AKxVW8JVW5JwCI42IY6xIIjxv20xvEc7CjxVAF wI0_Gr1j6F4UJwCI42IY6xAIw20EY4v20xvaj40_Jr0_JF4lIxAIcVC2z280aVAFwI0_Gr 0_Cr1lIxAIcVC2z280aVCY1x0267AKxVW8Jr0_Cr1UYxBIdaVFxhVjvjDU0xZFpf9x0JUl 2NtUUUUU= X-CM-SenderInfo: d1lo6xhdqjqx5xdzvxpfor3voofrz/ X-Rspam-User: X-Stat-Signature: jtfs11nnky14zcygsductfw94a7yj8bf X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 0B82A14000A X-HE-Tag: 1706320974-33999 X-HE-Meta: U2FsdGVkX1/B/bkpsX49WEK6sv8sh2/lgOJ3LAAjPEOLZ7gSjjhEimux4BKonBFd6xvUHrMhDUzn+AsQ35B2ncEzy7xrsSxeG0GxFCjARfztu11yUxjCy0ucBsDhWKVLPOtuVEX7YuEPLR6JB6wlYeGMp9DPRpWAtLU0Y91ChewfG2yfcBYWmH2qtwIdV7lrEtdqo4qjS8kj8cwo7WvKUDFHehnTmqqiPcnLQem2FAeYcIG6TsRPo7K1sio0BL/Wx2Ji+JXeMPcxCH/ZYevzdLoe4Ejux6zVNUEyWMMnNMXV5iNyRUr6f6RKn6swxOkSJc2xoJydfC/g3avu7JFXWVc0smd5QKgd5IcWRMMmNBWcl0IkJIbwTU2gadyboTrlAaxg85fveHTpaddv8wzahI+izGJvj0UzMDRC7f1AU1jRVQfRR/oexJrW+h1U3dwsOVHkHZwHlgg/BQ1h5wJsP/1NFizYhAX677KZdEbUPQmgoSZkrBFyEyzx4FvQlMFiiS3d1ZT25SwknMlfkwRRG6OFBox5maD5hQDKIwvON+/JAHmdv5lDyJVNV6DDKmN3eQCyrL8s0GcDAGgK/eWL1cFxZPOmRvdVoBBDbd+xrH0FCAcg2D1qD36dmG1BqbxWayVIpr9alj5vEyeAOPTQ0bXWYPhiH2rGctvCoeWrI1MKdzQ1YB83sWh0T/ORgp/sRHb1eSvktL6FHPKLGdpK3CojFe4XRHOR4AtHdVcP5kuxxHao9jARNr47SbO7ISNVK3EOz9Ji47pUCir6eaveC5erLcODmbpO/IyoTKkUB+IFzDT1COTnTbSThcuU3Z+15JL1oZQSJxOT8svA/hfq/UjXl10ms2D8XtVPRQ+m5K9IhlpYriz0HB1Pl4pm/OuEgY0akAvt8dJfFcKlJAs0umQlFYRayXgx5DJpGOwP80+puIpVmGtUnc51XJtvsn0Z8aQk3SkMkD1L0pnDYUy PnWqQQVA sOlv/ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Zhang Yi Implement buffered write iomap path, use ext4_da_map_blocks() to map delalloc extents and add ext4_iomap_get_blocks() to allocate blocks if delalloc is disabled or free space is about to run out. Note that we don't want to support dioread_lock mount option any more, so we drop the branch of ext4_should_dioread_nolock() and always allocate unwritten extents for new blocks, also make ext4_should_dioread_nolock() not controlled by the DIOREAD_NOLOCK mount option and always return true. Besides, the i_disksize updating is also postponed to after writeback. After this, now we map or allocate batch of blocks once a time, so it should be able to bring a lot of performance gains. Signed-off-by: Zhang Yi --- fs/ext4/ext4.h | 3 + fs/ext4/ext4_jbd2.h | 7 ++ fs/ext4/file.c | 19 ++++- fs/ext4/inode.c | 168 ++++++++++++++++++++++++++++++++++++++++++-- 4 files changed, 190 insertions(+), 7 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 3461cb3ff524..03cdcf3d86a5 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -2970,6 +2970,7 @@ int ext4_walk_page_buffers(handle_t *handle, struct buffer_head *bh)); int do_journal_get_write_access(handle_t *handle, struct inode *inode, struct buffer_head *bh); +int ext4_nonda_switch(struct super_block *sb); #define FALL_BACK_TO_NONDELALLOC 1 #define CONVERT_INLINE_DATA 2 @@ -3827,6 +3828,8 @@ static inline void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end) extern const struct iomap_ops ext4_iomap_ops; extern const struct iomap_ops ext4_iomap_overwrite_ops; extern const struct iomap_ops ext4_iomap_report_ops; +extern const struct iomap_ops ext4_iomap_buffered_write_ops; +extern const struct iomap_ops ext4_iomap_buffered_da_write_ops; static inline int ext4_buffer_uptodate(struct buffer_head *bh) { diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h index 0c77697d5e90..c1194ba8d6f2 100644 --- a/fs/ext4/ext4_jbd2.h +++ b/fs/ext4/ext4_jbd2.h @@ -499,6 +499,13 @@ static inline int ext4_free_data_revoke_credits(struct inode *inode, int blocks) */ static inline int ext4_should_dioread_nolock(struct inode *inode) { + /* + * Always enable dioread_nolock for inode which use buffered + * iomap path. + */ + if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)) + return 1; + if (!test_opt(inode->i_sb, DIOREAD_NOLOCK)) return 0; if (!S_ISREG(inode->i_mode)) diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 6aa15dafc677..d15bd6ff1b20 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -282,6 +282,20 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from) return count; } +static ssize_t ext4_iomap_buffered_write(struct kiocb *iocb, + struct iov_iter *from) +{ + struct inode *inode = file_inode(iocb->ki_filp); + const struct iomap_ops *iomap_ops; + + if (test_opt(inode->i_sb, DELALLOC) && !ext4_nonda_switch(inode->i_sb)) + iomap_ops = &ext4_iomap_buffered_da_write_ops; + else + iomap_ops = &ext4_iomap_buffered_write_ops; + + return iomap_file_buffered_write(iocb, from, iomap_ops); +} + static ssize_t ext4_buffered_write_iter(struct kiocb *iocb, struct iov_iter *from) { @@ -296,7 +310,10 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb, if (ret <= 0) goto out; - ret = generic_perform_write(iocb, from); + if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)) + ret = ext4_iomap_buffered_write(iocb, from); + else + ret = generic_perform_write(iocb, from); out: inode_unlock(inode); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 5d542ce13d2a..c48aca637896 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -2842,7 +2842,7 @@ static int ext4_dax_writepages(struct address_space *mapping, return ret; } -static int ext4_nonda_switch(struct super_block *sb) +int ext4_nonda_switch(struct super_block *sb) { s64 free_clusters, dirty_clusters; struct ext4_sb_info *sbi = EXT4_SB(sb); @@ -3238,6 +3238,15 @@ static bool ext4_inode_datasync_dirty(struct inode *inode) return inode->i_state & I_DIRTY_DATASYNC; } +static bool ext4_iomap_valid(struct inode *inode, const struct iomap *iomap) +{ + return iomap->validity_cookie == READ_ONCE(EXT4_I(inode)->i_es_seq); +} + +static const struct iomap_folio_ops ext4_iomap_folio_ops = { + .iomap_valid = ext4_iomap_valid, +}; + static void ext4_set_iomap(struct inode *inode, struct iomap *iomap, struct ext4_map_blocks *map, loff_t offset, loff_t length, unsigned int flags) @@ -3268,6 +3277,9 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap, !ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) iomap->flags |= IOMAP_F_MERGED; + iomap->validity_cookie = READ_ONCE(EXT4_I(inode)->i_es_seq); + iomap->folio_ops = &ext4_iomap_folio_ops; + /* * Flags passed to ext4_map_blocks() for direct I/O writes can result * in m_flags having both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN bits @@ -3507,11 +3519,42 @@ const struct iomap_ops ext4_iomap_report_ops = { .iomap_begin = ext4_iomap_begin_report, }; -static int ext4_iomap_buffered_io_begin(struct inode *inode, loff_t offset, +static int ext4_iomap_get_blocks(struct inode *inode, + struct ext4_map_blocks *map) +{ + handle_t *handle; + int ret, needed_blocks; + + /* + * Reserve one block more for addition to orphan list in case + * we allocate blocks but write fails for some reason. + */ + needed_blocks = ext4_writepage_trans_blocks(inode) + 1; + handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, needed_blocks); + if (IS_ERR(handle)) + return PTR_ERR(handle); + + ret = ext4_map_blocks(handle, inode, map, + EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT); + /* + * Have to stop journal here since there is a potential deadlock + * caused by later balance_dirty_pages(), it might wait on the + * ditry pages to be written back, which might start another + * handle and wait this handle stop. + */ + ext4_journal_stop(handle); + + return ret; +} + +#define IOMAP_F_EXT4_DELALLOC IOMAP_F_PRIVATE + +static int __ext4_iomap_buffered_io_begin(struct inode *inode, loff_t offset, loff_t length, unsigned int iomap_flags, - struct iomap *iomap, struct iomap *srcmap) + struct iomap *iomap, struct iomap *srcmap, + bool delalloc) { - int ret; + int ret, retries = 0; struct ext4_map_blocks map; u8 blkbits = inode->i_blkbits; @@ -3521,20 +3564,133 @@ static int ext4_iomap_buffered_io_begin(struct inode *inode, loff_t offset, return -EINVAL; if (WARN_ON_ONCE(ext4_has_inline_data(inode))) return -ERANGE; - +retry: /* Calculate the first and last logical blocks respectively. */ map.m_lblk = offset >> blkbits; map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits, EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1; + if (iomap_flags & IOMAP_WRITE) { + if (delalloc) + ret = ext4_da_map_blocks(inode, &map); + else + ret = ext4_iomap_get_blocks(inode, &map); - ret = ext4_map_blocks(NULL, inode, &map, 0); + if (ret == -ENOSPC && + ext4_should_retry_alloc(inode->i_sb, &retries)) + goto retry; + } else { + ret = ext4_map_blocks(NULL, inode, &map, 0); + } if (ret < 0) return ret; ext4_set_iomap(inode, iomap, &map, offset, length, iomap_flags); + if (delalloc) + iomap->flags |= IOMAP_F_EXT4_DELALLOC; + + return 0; +} + +static inline int ext4_iomap_buffered_io_begin(struct inode *inode, + loff_t offset, loff_t length, unsigned int flags, + struct iomap *iomap, struct iomap *srcmap) +{ + return __ext4_iomap_buffered_io_begin(inode, offset, length, flags, + iomap, srcmap, false); +} + +static inline int ext4_iomap_buffered_da_write_begin(struct inode *inode, + loff_t offset, loff_t length, unsigned int flags, + struct iomap *iomap, struct iomap *srcmap) +{ + return __ext4_iomap_buffered_io_begin(inode, offset, length, flags, + iomap, srcmap, true); +} + +/* + * Drop the staled delayed allocation range from the write failure, + * including both start and end blocks. If not, we could leave a range + * of delayed extents covered by a clean folio, it could lead to + * inaccurate space reservation. + */ +static int ext4_iomap_punch_delalloc(struct inode *inode, loff_t offset, + loff_t length) +{ + ext4_es_remove_extent(inode, offset >> inode->i_blkbits, + DIV_ROUND_UP(length, EXT4_BLOCK_SIZE(inode->i_sb))); return 0; } +static int ext4_iomap_buffered_write_end(struct inode *inode, loff_t offset, + loff_t length, ssize_t written, + unsigned int flags, + struct iomap *iomap) +{ + handle_t *handle; + loff_t end; + int ret = 0, ret2; + + /* delalloc */ + if (iomap->flags & IOMAP_F_EXT4_DELALLOC) { + ret = iomap_file_buffered_write_punch_delalloc(inode, iomap, + offset, length, written, ext4_iomap_punch_delalloc); + if (ret) + ext4_warning(inode->i_sb, + "Failed to clean up delalloc for inode %lu, %d", + inode->i_ino, ret); + return ret; + } + + /* nodelalloc */ + end = offset + length; + if (!(iomap->flags & IOMAP_F_SIZE_CHANGED) && end <= inode->i_size) + return 0; + + handle = ext4_journal_start(inode, EXT4_HT_INODE, 2); + if (IS_ERR(handle)) + return PTR_ERR(handle); + + if (iomap->flags & IOMAP_F_SIZE_CHANGED) { + ext4_update_i_disksize(inode, inode->i_size); + ret = ext4_mark_inode_dirty(handle, inode); + } + + /* + * If we have allocated more blocks and copied less. + * We will have blocks allocated outside inode->i_size, + * so truncate them. + */ + if (end > inode->i_size) + ext4_orphan_add(handle, inode); + + ret2 = ext4_journal_stop(handle); + ret = ret ? : ret2; + + if (end > inode->i_size) { + ext4_truncate_failed_write(inode); + /* + * If truncate failed early the inode might still be + * on the orphan list; we need to make sure the inode + * is removed from the orphan list in that case. + */ + if (inode->i_nlink) + ext4_orphan_del(NULL, inode); + } + + return ret; +} + + +const struct iomap_ops ext4_iomap_buffered_write_ops = { + .iomap_begin = ext4_iomap_buffered_io_begin, + .iomap_end = ext4_iomap_buffered_write_end, +}; + +const struct iomap_ops ext4_iomap_buffered_da_write_ops = { + .iomap_begin = ext4_iomap_buffered_da_write_begin, + .iomap_end = ext4_iomap_buffered_write_end, +}; + const struct iomap_ops ext4_iomap_buffered_read_ops = { .iomap_begin = ext4_iomap_buffered_io_begin, };