From patchwork Thu May 4 14:51:07 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Ritesh Harjani (IBM)" X-Patchwork-Id: 13231290 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 043F2C77B78 for ; Thu, 4 May 2023 14:52:34 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231345AbjEDOwc (ORCPT ); Thu, 4 May 2023 10:52:32 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45648 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231224AbjEDOvg (ORCPT ); Thu, 4 May 2023 10:51:36 -0400 Received: from mail-pf1-x42f.google.com (mail-pf1-x42f.google.com [IPv6:2607:f8b0:4864:20::42f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 397C14EF5; Thu, 4 May 2023 07:51:26 -0700 (PDT) Received: by mail-pf1-x42f.google.com with SMTP id d2e1a72fcca58-63b7b54642cso446620b3a.0; Thu, 04 May 2023 07:51:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1683211885; x=1685803885; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=S9oa+YAIM8kWx6ZHF/yFRSmhk3uXfvF3ZPQSW2wux7c=; b=igx5H+/o74yDzyk6KNrljyNrrIt6FfJNMhqJucIQ7SJOcTZENnYG6gzdlMX9GDgxdM +d/phhzWAHMoxZMCokJrfbbhi4QJIlOLpMKi/w0I5Hte/k96ugK3PQumYCDVp0WweDbG AVbf7wA6+JzGecFLIVRnhxd5Rmf22sZ7qYoVzJFx/wDr2BODNqkwpv5YoTJkxNP9qb7H 7WVL6Axt4aES7vHOAToOPaOBtoI+W/ryi/Bq+WHz/66gOc5MJwlvHKv/EC6ivr3EYcc7 U6cdwMx7Id8HOHuTI2RtY15iG25D2JaEZ71kL3neYq9F0KTP3AdYLCveF2rZFdQ/QWQ2 IuuA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683211885; x=1685803885; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=S9oa+YAIM8kWx6ZHF/yFRSmhk3uXfvF3ZPQSW2wux7c=; b=dIsOUxUPo2R3jQrNGd8vxU2y1xinSPAz6XJ1XETKl4B6TifPKgHMLzzpRjpAz0U/VH pFpCOCi0ByDd7lGLmBZZm08o6LlFEJGJkhTcRzWOGrqAemodRis0IbQg0eQnI60O6irD 1YEUxbkaVw7a7H5Y4QNh54F5j/dsZtxjp7PHW4UrTsTtsx629sYio4pm3GMI2gvECWBQ pETeT4dPr8zOYRInCyn2Y4qpTMFVEfA5RkKyXBua0SL5e06XiCHZ00MP13Qd4BfG3v1B CbOs9OciNIS/Xx8FqljP9bq4r3W7wbHb/psYxuc7bpea6nZW/l1OabKbvQRZV4IK+vr3 GLUQ== X-Gm-Message-State: AC+VfDw584a4Nwnagg+Y5Ftng1lWMsJkBH2fj9lOCl2EIKnMZH3gWTPS l0+HPmgHux7F4M3Aix2y6HVahyFvPAs= X-Google-Smtp-Source: ACHHUZ7Kl4+3HEPH06OeyvcIr81BVvMKhNw27JU/iIIsJYYxhACWbYYimg8e3QRVsmXYv/ArDOL6Jg== X-Received: by 2002:a05:6a00:1916:b0:63f:1926:5bb8 with SMTP id y22-20020a056a00191600b0063f19265bb8mr3101624pfi.30.1683211885302; Thu, 04 May 2023 07:51:25 -0700 (PDT) Received: from rh-tp.ibmuc.com ([2406:7400:63:80ba:df67:5773:54c8:514f]) by smtp.gmail.com with ESMTPSA id z192-20020a6333c9000000b0052c53577756sm3107503pgz.64.2023.05.04.07.51.22 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 04 May 2023 07:51:24 -0700 (PDT) From: "Ritesh Harjani (IBM)" To: linux-xfs@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org, Matthew Wilcox , Dave Chinner , Brian Foster , Ojaswin Mujoo , Disha Goel , "Ritesh Harjani (IBM)" Subject: [RFCv4 1/3] iomap: Allocate iop in ->write_begin() early Date: Thu, 4 May 2023 20:21:07 +0530 Message-Id: <06959535927b4278c3ec7a49aef798db6139d095.1683208091.git.ritesh.list@gmail.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Earlier when the folio is uptodate, we only allocate iop at writeback time (in iomap_writepage_map()). This is ok until now, but when we are going to add support for subpage size dirty bitmap tracking in iop, this could cause some performance degradation. The reason is that if we don't allocate iop during ->write_begin(), then we will never mark the necessary dirty bits in ->write_end() call. And we will have to mark all the bits as dirty at the writeback time, that could cause the same write amplification and performance problems as it is now (w/o subpage dirty bitmap tracking in iop). However, for all the writes with (pos, len) which completely overlaps the given folio, there is no need to allocate an iop during ->write_begin(). So skip those cases. Signed-off-by: Ritesh Harjani (IBM) --- fs/iomap/buffered-io.c | 21 +++++++++++++++++++-- 1 file changed, 19 insertions(+), 2 deletions(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 6f4c97a6d7e9..e43821bd1ff5 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -562,14 +562,31 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos, size_t from = offset_in_folio(folio, pos), to = from + len; size_t poff, plen; - if (folio_test_uptodate(folio)) + /* + * If the write completely overlaps the current folio, then + * entire folio will be dirtied so there is no need for + * sub-folio state tracking structures to be attached to this folio. + */ + + if (pos <= folio_pos(folio) && + pos + len >= folio_pos(folio) + folio_size(folio)) return 0; - folio_clear_error(folio); iop = iomap_page_create(iter->inode, folio, iter->flags); + + /* + * If we don't have an iop and nr_blocks > 1 then return -EAGAIN here + * even though the folio may be uptodate. To ensure we add sub-folio + * state tracking structures to this folio. + */ if ((iter->flags & IOMAP_NOWAIT) && !iop && nr_blocks > 1) return -EAGAIN; + if (folio_test_uptodate(folio)) + return 0; + folio_clear_error(folio); + + do { iomap_adjust_read_range(iter->inode, folio, &block_start, block_end - block_start, &poff, &plen); From patchwork Thu May 4 14:51:08 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Ritesh Harjani (IBM)" X-Patchwork-Id: 13231291 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8381DC77B78 for ; Thu, 4 May 2023 14:52:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231378AbjEDOwo (ORCPT ); Thu, 4 May 2023 10:52:44 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45904 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231314AbjEDOvn (ORCPT ); Thu, 4 May 2023 10:51:43 -0400 Received: from mail-pf1-x42e.google.com (mail-pf1-x42e.google.com [IPv6:2607:f8b0:4864:20::42e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 41ABE3598; Thu, 4 May 2023 07:51:29 -0700 (PDT) Received: by mail-pf1-x42e.google.com with SMTP id d2e1a72fcca58-643465067d1so527321b3a.0; Thu, 04 May 2023 07:51:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1683211888; x=1685803888; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=wfp1xXUZJdJ4qjQw95B1gOuqgTHc9Q2HKNsENhHLoQc=; b=AOyjb2LC0bBu0we/nhfduVCNT4rA0zw8NFHKnKqZNK1ebDBPAD/XUQ/qpGOW+3QL/X p1J20i1gMSiWZqSUvLT5T9L63rynUs6ytTPHcB/Z0R2Ds30TK0sEdYIr0fFagxQrUXm7 NdxaOILou31UEA6UcOa96IhcIVRDxU0TGsEo57mhZ0idzNX76fwyXug38Z7fH2PKfN4A VDz8IDiDNQOBaxoPzeoq52XR03bzKvKhDeMdYS6oKnHZBJtRNWLVF7tOga1lCvFPDkio gtT7P6bA32t2pa5dRsHOZS+BkBOateqzZAtk9RDooFxZTzGhvtZIQpxh1prQYhv2CJjI pfLg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683211888; x=1685803888; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=wfp1xXUZJdJ4qjQw95B1gOuqgTHc9Q2HKNsENhHLoQc=; b=gBAAArPgBAPkeoRcf0Ugof9Uv26P/TQTzsHf95wY5bTgN/i5CT9GgQJ8iYUn5NdueN j3oyZY6l3DMMzL8scLIPMDv/50RfXEXpW9tGyE9/X2It8qhOnAagHGv3v01tpnKluNwQ M1KBqeZnit1bkS2OhT6A0WgN4xn/AvGL5UGVnFRj7RaHi2th8tqsClBVvfBfi4zo87UP UxhBSXIYWIydQccifaSt7OlAnofwkG2uX/e6Ya1EC9IkdmMRzP2Mu6uSN5jWN7SaE3lr jRFOuWRypMytsR6we8PwzN9r7SltBbha7IA2sLB8EIBJmaHw/KhcPLYBc/PFHYOXMPXW hIbQ== X-Gm-Message-State: AC+VfDwCDcEakhgE35QzZVjup+n+UB8KsnosLVI06p35Z1qlwWeVpxdq h69fZFYkBfJouJqJ/XL9DeD9GrbbzIo= X-Google-Smtp-Source: ACHHUZ7kZmEzTF3sX2WIs/SbhowikMonymlCtNXi20KcWeouUlGRfC2am+8V0kkzXMJiG6jzGj6k7Q== X-Received: by 2002:a05:6a00:2386:b0:63d:2343:f9b with SMTP id f6-20020a056a00238600b0063d23430f9bmr2872788pfc.19.1683211888354; Thu, 04 May 2023 07:51:28 -0700 (PDT) Received: from rh-tp.ibmuc.com ([2406:7400:63:80ba:df67:5773:54c8:514f]) by smtp.gmail.com with ESMTPSA id z192-20020a6333c9000000b0052c53577756sm3107503pgz.64.2023.05.04.07.51.25 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 04 May 2023 07:51:28 -0700 (PDT) From: "Ritesh Harjani (IBM)" To: linux-xfs@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org, Matthew Wilcox , Dave Chinner , Brian Foster , Ojaswin Mujoo , Disha Goel , "Ritesh Harjani (IBM)" , Dave Chinner Subject: [RFCv4 2/3] iomap: Change uptodate variable name to state Date: Thu, 4 May 2023 20:21:08 +0530 Message-Id: <57994bfd33f6b4dd84adb8ea075a1974d6a5e928.1683208091.git.ritesh.list@gmail.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This patch changes the struct iomap_page uptodate & uptodate_lock member names to state and state_lock to better reflect their purpose for the upcoming patch. It also introduces the accessor functions for updating uptodate state bits in iop->state bitmap. This makes the code easy to understand on when different bitmap types are getting referred in different code paths. Reviewed-by: Dave Chinner Signed-off-by: Ritesh Harjani (IBM) --- fs/iomap/buffered-io.c | 65 ++++++++++++++++++++++++++++++++---------- 1 file changed, 50 insertions(+), 15 deletions(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index e43821bd1ff5..b8b23c859ecf 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -25,13 +25,13 @@ /* * Structure allocated for each folio when block size < folio size - * to track sub-folio uptodate status and I/O completions. + * to track sub-folio uptodate state and I/O completions. */ struct iomap_page { atomic_t read_bytes_pending; atomic_t write_bytes_pending; - spinlock_t uptodate_lock; - unsigned long uptodate[]; + spinlock_t state_lock; + unsigned long state[]; }; static inline struct iomap_page *to_iomap_page(struct folio *folio) @@ -43,6 +43,38 @@ static inline struct iomap_page *to_iomap_page(struct folio *folio) static struct bio_set iomap_ioend_bioset; +/* + * Accessor functions for setting/clearing/checking uptodate bits in + * iop->state bitmap. + * nrblocks is i_blocks_per_folio() which is passed in every + * function as the last argument for API consistency. + */ +static inline void iop_set_range_uptodate(struct iomap_page *iop, + unsigned int start, unsigned int len, + unsigned int nrblocks) +{ + bitmap_set(iop->state, start, len); +} + +static inline void iop_clear_range_uptodate(struct iomap_page *iop, + unsigned int start, unsigned int len, + unsigned int nrblocks) +{ + bitmap_clear(iop->state, start, len); +} + +static inline bool iop_test_uptodate(struct iomap_page *iop, unsigned int block, + unsigned int nrblocks) +{ + return test_bit(block, iop->state); +} + +static inline bool iop_uptodate_full(struct iomap_page *iop, + unsigned int nrblocks) +{ + return bitmap_full(iop->state, nrblocks); +} + static struct iomap_page * iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags) { @@ -58,12 +90,12 @@ iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags) else gfp = GFP_NOFS | __GFP_NOFAIL; - iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)), + iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)), gfp); if (iop) { - spin_lock_init(&iop->uptodate_lock); + spin_lock_init(&iop->state_lock); if (folio_test_uptodate(folio)) - bitmap_fill(iop->uptodate, nr_blocks); + iop_set_range_uptodate(iop, 0, nr_blocks, nr_blocks); folio_attach_private(folio, iop); } return iop; @@ -79,7 +111,7 @@ static void iomap_page_release(struct folio *folio) return; WARN_ON_ONCE(atomic_read(&iop->read_bytes_pending)); WARN_ON_ONCE(atomic_read(&iop->write_bytes_pending)); - WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) != + WARN_ON_ONCE(iop_uptodate_full(iop, nr_blocks) != folio_test_uptodate(folio)); kfree(iop); } @@ -99,6 +131,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio, size_t plen = min_t(loff_t, folio_size(folio) - poff, length); unsigned first = poff >> block_bits; unsigned last = (poff + plen - 1) >> block_bits; + unsigned int nr_blocks = i_blocks_per_folio(inode, folio); /* * If the block size is smaller than the page size, we need to check the @@ -110,7 +143,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio, /* move forward for each leading block marked uptodate */ for (i = first; i <= last; i++) { - if (!test_bit(i, iop->uptodate)) + if (!iop_test_uptodate(iop, i, nr_blocks)) break; *pos += block_size; poff += block_size; @@ -120,7 +153,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio, /* truncate len if we find any trailing uptodate block(s) */ for ( ; i <= last; i++) { - if (test_bit(i, iop->uptodate)) { + if (iop_test_uptodate(iop, i, nr_blocks)) { plen -= (last - i + 1) * block_size; last = i - 1; break; @@ -151,12 +184,13 @@ static void iomap_iop_set_range_uptodate(struct folio *folio, unsigned first = off >> inode->i_blkbits; unsigned last = (off + len - 1) >> inode->i_blkbits; unsigned long flags; + unsigned int nr_blocks = i_blocks_per_folio(inode, folio); - spin_lock_irqsave(&iop->uptodate_lock, flags); - bitmap_set(iop->uptodate, first, last - first + 1); - if (bitmap_full(iop->uptodate, i_blocks_per_folio(inode, folio))) + spin_lock_irqsave(&iop->state_lock, flags); + iop_set_range_uptodate(iop, first, last - first + 1, nr_blocks); + if (iop_uptodate_full(iop, nr_blocks)) folio_mark_uptodate(folio); - spin_unlock_irqrestore(&iop->uptodate_lock, flags); + spin_unlock_irqrestore(&iop->state_lock, flags); } static void iomap_set_range_uptodate(struct folio *folio, @@ -439,6 +473,7 @@ bool iomap_is_partially_uptodate(struct folio *folio, size_t from, size_t count) struct iomap_page *iop = to_iomap_page(folio); struct inode *inode = folio->mapping->host; unsigned first, last, i; + unsigned int nr_blocks = i_blocks_per_folio(inode, folio); if (!iop) return false; @@ -451,7 +486,7 @@ bool iomap_is_partially_uptodate(struct folio *folio, size_t from, size_t count) last = (from + count - 1) >> inode->i_blkbits; for (i = first; i <= last; i++) - if (!test_bit(i, iop->uptodate)) + if (!iop_test_uptodate(iop, i, nr_blocks)) return false; return true; } @@ -1652,7 +1687,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc, * invalid, grab a new one. */ for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) { - if (iop && !test_bit(i, iop->uptodate)) + if (iop && !iop_test_uptodate(iop, i, nblocks)) continue; error = wpc->ops->map_blocks(wpc, inode, pos); From patchwork Thu May 4 14:51:09 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Ritesh Harjani (IBM)" X-Patchwork-Id: 13231292 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 556A3C7EE23 for ; Thu, 4 May 2023 14:52:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231388AbjEDOwu (ORCPT ); Thu, 4 May 2023 10:52:50 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46774 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231336AbjEDOwb (ORCPT ); Thu, 4 May 2023 10:52:31 -0400 Received: from mail-pf1-x436.google.com (mail-pf1-x436.google.com [IPv6:2607:f8b0:4864:20::436]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A3F8340CB; Thu, 4 May 2023 07:51:32 -0700 (PDT) Received: by mail-pf1-x436.google.com with SMTP id d2e1a72fcca58-643557840e4so765624b3a.2; Thu, 04 May 2023 07:51:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1683211891; x=1685803891; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=hL0PEW+zKTMCWjI+Tq65B4wPII54eQtyTfrfjmPEOyE=; b=VaWqge1NUmCC7uAhsqpE4MpdAYWUzljnX3w3Y1BgokMvz02Z4cGr3bIf6DI3Nkt5LP QJgg7ZlE+/uKN7bT8+ao7m9Q0ZBHU5HXGqOIxnFdXCdlkBkGmKXNGC3o0SmE1CwljsgP Z+CF/50dWv8nn5Fkm2ZpCQhGCINaf0sATetpJ0yYl99msSn1Uglg7ZAZNrn/i/FW8kpJ I2sotZVnJkchHE5C8mzMBTtqaZt2viaI2XQW4nFGuusa4fVorAOlqzWueojQxnjyM9LY 9cXYcovjogbwPG4BsUU7kWK8nC/PvIZ1A9ECASeHbTadKaz3GaE7n6tmbjkWlqE9HmMp k3hQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683211891; x=1685803891; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=hL0PEW+zKTMCWjI+Tq65B4wPII54eQtyTfrfjmPEOyE=; b=O4ObCQHdIbi/tSr+fFhrsCEKslMZ6MlgDw8/6R2NhmbDDoRIt5PcWLssvcP3Pspyce BPgZ6ubMbCx4oYmlqr+JkSnF32tS/Zk0zxFmdrY0aCGsm352BzvdMWti9LyXt7RQVDkJ 6aITB8TsOgsUcPGJ7o85mHwRRUNaLMMcs5lAmZpO//NicJFXtOH9BYdMSr4yq5MMFruH sZxKEuh8yxA3H8iKIjpp8XGE0GZG/RrOV0Bo+roxP60KszIGP9axO28FX4cgeMkVeq3Q aCjOX8Ug2ZapyDZG7RidEGt8j3vGRuKMwUo/FW6sw1e7bkfZ3cHxMiLNRTFPnfXKwKaJ EEEw== X-Gm-Message-State: AC+VfDxhb9Oh2uMS+Bj539sjThmOt8E7ymJWOLdiagTKWDKKhFQuqzdy 9hFNsJWn9e2rLC+oEF2YFqkHlmo5vKo= X-Google-Smtp-Source: ACHHUZ6ficwKH2Pph2o8fcdzP3WPE4E++hbBqMa1lX2hi5Zkju9TgSg4zH+rfkqhh+EeIoU0FJiDSw== X-Received: by 2002:a05:6a00:10c3:b0:643:4d69:efb8 with SMTP id d3-20020a056a0010c300b006434d69efb8mr2869144pfu.6.1683211891506; Thu, 04 May 2023 07:51:31 -0700 (PDT) Received: from rh-tp.ibmuc.com ([2406:7400:63:80ba:df67:5773:54c8:514f]) by smtp.gmail.com with ESMTPSA id z192-20020a6333c9000000b0052c53577756sm3107503pgz.64.2023.05.04.07.51.28 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 04 May 2023 07:51:31 -0700 (PDT) From: "Ritesh Harjani (IBM)" To: linux-xfs@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org, Matthew Wilcox , Dave Chinner , Brian Foster , Ojaswin Mujoo , Disha Goel , "Ritesh Harjani (IBM)" , Aravinda Herle Subject: [RFCv4 3/3] iomap: Support subpage size dirty tracking to improve write performance Date: Thu, 4 May 2023 20:21:09 +0530 Message-Id: <377c30e7b5f2783dd5be12c59ea703d7c72ba004.1683208091.git.ritesh.list@gmail.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On a 64k pagesize platforms (specially Power and/or aarch64) with 4k filesystem blocksize, this patch should improve the performance by doing only the subpage dirty data write. This should also reduce the write amplification since we can now track subpage dirty status within state bitmaps. Earlier we had to write the entire 64k page even if only a part of it (e.g. 4k) was updated. Performance testing of below fio workload reveals ~16x performance improvement on nvme with XFS (4k blocksize) on Power (64K pagesize) FIO reported write bw scores improved from around ~28 MBps to ~452 MBps. 1. [global] ioengine=psync rw=randwrite overwrite=1 pre_read=1 direct=0 bs=4k size=1G dir=./ numjobs=8 fdatasync=1 runtime=60 iodepth=64 group_reporting=1 [fio-run] 2. Also our internal performance team reported that this patch improves there database workload performance by around ~83% (with XFS on Power) Reported-by: Aravinda Herle Reported-by: Brian Foster Signed-off-by: Ritesh Harjani (IBM) --- fs/gfs2/aops.c | 2 +- fs/iomap/buffered-io.c | 175 ++++++++++++++++++++++++++++++++++------- fs/xfs/xfs_aops.c | 2 +- fs/zonefs/file.c | 2 +- include/linux/iomap.h | 1 + 5 files changed, 149 insertions(+), 33 deletions(-) -- 2.39.2 diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c index a5f4be6b9213..75efec3c3b71 100644 --- a/fs/gfs2/aops.c +++ b/fs/gfs2/aops.c @@ -746,7 +746,7 @@ static const struct address_space_operations gfs2_aops = { .writepages = gfs2_writepages, .read_folio = gfs2_read_folio, .readahead = gfs2_readahead, - .dirty_folio = filemap_dirty_folio, + .dirty_folio = iomap_dirty_folio, .release_folio = iomap_release_folio, .invalidate_folio = iomap_invalidate_folio, .bmap = gfs2_bmap, diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index b8b23c859ecf..52c9703ff262 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -34,6 +34,11 @@ struct iomap_page { unsigned long state[]; }; +enum iop_state { + IOP_STATE_UPDATE = 0, + IOP_STATE_DIRTY = 1 +}; + static inline struct iomap_page *to_iomap_page(struct folio *folio) { if (folio_test_private(folio)) @@ -44,8 +49,8 @@ static inline struct iomap_page *to_iomap_page(struct folio *folio) static struct bio_set iomap_ioend_bioset; /* - * Accessor functions for setting/clearing/checking uptodate bits in - * iop->state bitmap. + * Accessor functions for setting/clearing/checking uptodate and + * dirty bits in iop->state bitmap. * nrblocks is i_blocks_per_folio() which is passed in every * function as the last argument for API consistency. */ @@ -75,8 +80,29 @@ static inline bool iop_uptodate_full(struct iomap_page *iop, return bitmap_full(iop->state, nrblocks); } +static inline void iop_set_range_dirty(struct iomap_page *iop, + unsigned int start, unsigned int len, + unsigned int nrblocks) +{ + bitmap_set(iop->state, start + nrblocks, len); +} + +static inline void iop_clear_range_dirty(struct iomap_page *iop, + unsigned int start, unsigned int len, + unsigned int nrblocks) +{ + bitmap_clear(iop->state, start + nrblocks, len); +} + +static inline bool iop_test_dirty(struct iomap_page *iop, unsigned int block, + unsigned int nrblocks) +{ + return test_bit(block + nrblocks, iop->state); +} + static struct iomap_page * -iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags) +iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags, + bool is_dirty) { struct iomap_page *iop = to_iomap_page(folio); unsigned int nr_blocks = i_blocks_per_folio(inode, folio); @@ -90,12 +116,21 @@ iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags) else gfp = GFP_NOFS | __GFP_NOFAIL; - iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)), + /* + * iop->state tracks two sets of state flags when the + * filesystem block size is smaller than the folio size. + * The first state tracks per-filesystem block uptodate + * and the second tracks per-filesystem block dirty + * state. + */ + iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(2 * nr_blocks)), gfp); if (iop) { spin_lock_init(&iop->state_lock); if (folio_test_uptodate(folio)) iop_set_range_uptodate(iop, 0, nr_blocks, nr_blocks); + if (is_dirty) + iop_set_range_dirty(iop, 0, nr_blocks, nr_blocks); folio_attach_private(folio, iop); } return iop; @@ -177,29 +212,62 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio, *lenp = plen; } -static void iomap_iop_set_range_uptodate(struct folio *folio, - struct iomap_page *iop, size_t off, size_t len) +static void iomap_iop_set_range(struct folio *folio, struct iomap_page *iop, + size_t off, size_t len, enum iop_state state) { struct inode *inode = folio->mapping->host; - unsigned first = off >> inode->i_blkbits; - unsigned last = (off + len - 1) >> inode->i_blkbits; + unsigned int blks_per_folio = i_blocks_per_folio(inode, folio); + unsigned int first_blk = (off >> inode->i_blkbits); + unsigned int last_blk = ((off + len - 1) >> inode->i_blkbits); + unsigned int nr_blks = last_blk - first_blk + 1; unsigned long flags; - unsigned int nr_blocks = i_blocks_per_folio(inode, folio); - spin_lock_irqsave(&iop->state_lock, flags); - iop_set_range_uptodate(iop, first, last - first + 1, nr_blocks); - if (iop_uptodate_full(iop, nr_blocks)) - folio_mark_uptodate(folio); - spin_unlock_irqrestore(&iop->state_lock, flags); + switch (state) { + case IOP_STATE_UPDATE: + if (!iop) { + folio_mark_uptodate(folio); + return; + } + spin_lock_irqsave(&iop->state_lock, flags); + iop_set_range_uptodate(iop, first_blk, nr_blks, blks_per_folio); + if (iop_uptodate_full(iop, blks_per_folio)) + folio_mark_uptodate(folio); + spin_unlock_irqrestore(&iop->state_lock, flags); + break; + case IOP_STATE_DIRTY: + if (!iop) + return; + spin_lock_irqsave(&iop->state_lock, flags); + iop_set_range_dirty(iop, first_blk, nr_blks, blks_per_folio); + spin_unlock_irqrestore(&iop->state_lock, flags); + break; + } } -static void iomap_set_range_uptodate(struct folio *folio, - struct iomap_page *iop, size_t off, size_t len) +static void iomap_iop_clear_range(struct folio *folio, + struct iomap_page *iop, size_t off, size_t len, + enum iop_state state) { - if (iop) - iomap_iop_set_range_uptodate(folio, iop, off, len); - else - folio_mark_uptodate(folio); + struct inode *inode = folio->mapping->host; + unsigned int blks_per_folio = i_blocks_per_folio(inode, folio); + unsigned int first_blk = (off >> inode->i_blkbits); + unsigned int last_blk = ((off + len - 1) >> inode->i_blkbits); + unsigned int nr_blks = last_blk - first_blk + 1; + unsigned long flags; + + switch (state) { + case IOP_STATE_UPDATE: + // Never gets called so not implemented + WARN_ON(1); + break; + case IOP_STATE_DIRTY: + if (!iop) + return; + spin_lock_irqsave(&iop->state_lock, flags); + iop_clear_range_dirty(iop, first_blk, nr_blks, blks_per_folio); + spin_unlock_irqrestore(&iop->state_lock, flags); + break; + } } static void iomap_finish_folio_read(struct folio *folio, size_t offset, @@ -211,7 +279,7 @@ static void iomap_finish_folio_read(struct folio *folio, size_t offset, folio_clear_uptodate(folio); folio_set_error(folio); } else { - iomap_set_range_uptodate(folio, iop, offset, len); + iomap_iop_set_range(folio, iop, offset, len, IOP_STATE_UPDATE); } if (!iop || atomic_sub_and_test(len, &iop->read_bytes_pending)) @@ -265,7 +333,8 @@ static int iomap_read_inline_data(const struct iomap_iter *iter, if (WARN_ON_ONCE(size > iomap->length)) return -EIO; if (offset > 0) - iop = iomap_page_create(iter->inode, folio, iter->flags); + iop = iomap_page_create(iter->inode, folio, iter->flags, + folio_test_dirty(folio)); else iop = to_iomap_page(folio); @@ -273,7 +342,8 @@ static int iomap_read_inline_data(const struct iomap_iter *iter, memcpy(addr, iomap->inline_data, size); memset(addr + size, 0, PAGE_SIZE - poff - size); kunmap_local(addr); - iomap_set_range_uptodate(folio, iop, offset, PAGE_SIZE - poff); + iomap_iop_set_range(folio, iop, offset, PAGE_SIZE - poff, + IOP_STATE_UPDATE); return 0; } @@ -303,14 +373,15 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter, return iomap_read_inline_data(iter, folio); /* zero post-eof blocks as the page may be mapped */ - iop = iomap_page_create(iter->inode, folio, iter->flags); + iop = iomap_page_create(iter->inode, folio, iter->flags, + folio_test_dirty(folio)); iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen); if (plen == 0) goto done; if (iomap_block_needs_zeroing(iter, pos)) { folio_zero_range(folio, poff, plen); - iomap_set_range_uptodate(folio, iop, poff, plen); + iomap_iop_set_range(folio, iop, poff, plen, IOP_STATE_UPDATE); goto done; } @@ -559,6 +630,18 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len) } EXPORT_SYMBOL_GPL(iomap_invalidate_folio); +bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio) +{ + unsigned int nr_blocks = i_blocks_per_folio(mapping->host, folio); + struct iomap_page *iop; + + iop = iomap_page_create(mapping->host, folio, 0, false); + iomap_iop_set_range(folio, iop, 0, + nr_blocks << mapping->host->i_blkbits, IOP_STATE_DIRTY); + return filemap_dirty_folio(mapping, folio); +} +EXPORT_SYMBOL_GPL(iomap_dirty_folio); + static void iomap_write_failed(struct inode *inode, loff_t pos, unsigned len) { @@ -607,7 +690,8 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos, pos + len >= folio_pos(folio) + folio_size(folio)) return 0; - iop = iomap_page_create(iter->inode, folio, iter->flags); + iop = iomap_page_create(iter->inode, folio, iter->flags, + folio_test_dirty(folio)); /* * If we don't have an iop and nr_blocks > 1 then return -EAGAIN here @@ -648,7 +732,7 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos, if (status) return status; } - iomap_set_range_uptodate(folio, iop, poff, plen); + iomap_iop_set_range(folio, iop, poff, plen, IOP_STATE_UPDATE); } while ((block_start += plen) < block_end); return 0; @@ -771,7 +855,10 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len, */ if (unlikely(copied < len && !folio_test_uptodate(folio))) return 0; - iomap_set_range_uptodate(folio, iop, offset_in_folio(folio, pos), len); + iomap_iop_set_range(folio, iop, offset_in_folio(folio, pos), copied, + IOP_STATE_UPDATE); + iomap_iop_set_range(folio, iop, offset_in_folio(folio, pos), copied, + IOP_STATE_DIRTY); filemap_dirty_folio(inode->i_mapping, folio); return copied; } @@ -959,6 +1046,12 @@ static int iomap_write_delalloc_scan(struct inode *inode, { while (start_byte < end_byte) { struct folio *folio; + size_t first, last; + loff_t end; + unsigned int i; + struct iomap_page *iop; + u8 blkbits = inode->i_blkbits; + unsigned int nr_blocks; /* grab locked page */ folio = filemap_lock_folio(inode->i_mapping, @@ -983,6 +1076,26 @@ static int iomap_write_delalloc_scan(struct inode *inode, } } + /* + * When we have subfolio dirty tracking, there can be + * subblocks within a folio which are marked uptodate + * but not dirty. In that case it is necessary to punch + * out such blocks to avoid leaking delalloc blocks. + */ + iop = to_iomap_page(folio); + if (!iop) + goto skip_iop_punch; + end = min_t(loff_t, end_byte - 1, + (folio_next_index(folio) << PAGE_SHIFT) - 1); + first = offset_in_folio(folio, start_byte) >> blkbits; + last = offset_in_folio(folio, end) >> blkbits; + nr_blocks = i_blocks_per_folio(inode, folio); + for (i = first; i <= last; i++) { + if (!iop_test_dirty(iop, i, nr_blocks)) + punch(inode, i << blkbits, + 1 << blkbits); + } +skip_iop_punch: /* * Make sure the next punch start is correctly bound to * the end of this data range, not the end of the folio. @@ -1671,7 +1784,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc, struct writeback_control *wbc, struct inode *inode, struct folio *folio, u64 end_pos) { - struct iomap_page *iop = iomap_page_create(inode, folio, 0); + struct iomap_page *iop = iomap_page_create(inode, folio, 0, true); struct iomap_ioend *ioend, *next; unsigned len = i_blocksize(inode); unsigned nblocks = i_blocks_per_folio(inode, folio); @@ -1687,7 +1800,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc, * invalid, grab a new one. */ for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) { - if (iop && !iop_test_uptodate(iop, i, nblocks)) + if (iop && !iop_test_dirty(iop, i, nblocks)) continue; error = wpc->ops->map_blocks(wpc, inode, pos); @@ -1731,6 +1844,8 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc, } } + iomap_iop_clear_range(folio, iop, 0, end_pos - folio_pos(folio), + IOP_STATE_DIRTY); folio_start_writeback(folio); folio_unlock(folio); diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index 2ef78aa1d3f6..77c7332ae197 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -578,7 +578,7 @@ const struct address_space_operations xfs_address_space_operations = { .read_folio = xfs_vm_read_folio, .readahead = xfs_vm_readahead, .writepages = xfs_vm_writepages, - .dirty_folio = filemap_dirty_folio, + .dirty_folio = iomap_dirty_folio, .release_folio = iomap_release_folio, .invalidate_folio = iomap_invalidate_folio, .bmap = xfs_vm_bmap, diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c index 132f01d3461f..e508c8e97372 100644 --- a/fs/zonefs/file.c +++ b/fs/zonefs/file.c @@ -175,7 +175,7 @@ const struct address_space_operations zonefs_file_aops = { .read_folio = zonefs_read_folio, .readahead = zonefs_readahead, .writepages = zonefs_writepages, - .dirty_folio = filemap_dirty_folio, + .dirty_folio = iomap_dirty_folio, .release_folio = iomap_release_folio, .invalidate_folio = iomap_invalidate_folio, .migrate_folio = filemap_migrate_folio, diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 0f8123504e5e..0c2bee80565c 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -264,6 +264,7 @@ bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count); struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos); bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags); void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len); +bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio); int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len, const struct iomap_ops *ops); int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,