From patchwork Tue Feb 28 14:59:40 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Brian Foster X-Patchwork-Id: 9595943 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 0E582601D7 for ; Tue, 28 Feb 2017 14:59:45 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 02D3428497 for ; Tue, 28 Feb 2017 14:59:45 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id EBEE1284FF; Tue, 28 Feb 2017 14:59:44 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8460F28497 for ; Tue, 28 Feb 2017 14:59:44 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752263AbdB1O7n (ORCPT ); Tue, 28 Feb 2017 09:59:43 -0500 Received: from mx1.redhat.com ([209.132.183.28]:19763 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752117AbdB1O7m (ORCPT ); Tue, 28 Feb 2017 09:59:42 -0500 Received: from int-mx09.intmail.prod.int.phx2.redhat.com (int-mx09.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 4555867BA7; Tue, 28 Feb 2017 14:59:42 +0000 (UTC) Received: from bfoster.bfoster (dhcp-41-20.bos.redhat.com [10.18.41.20]) by int-mx09.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id v1SExf1J005303; Tue, 28 Feb 2017 09:59:42 -0500 Received: by bfoster.bfoster (Postfix, from userid 1000) id C06F91202CD; Tue, 28 Feb 2017 09:59:40 -0500 (EST) Date: Tue, 28 Feb 2017 09:59:40 -0500 From: Brian Foster To: Christoph Hellwig Cc: Xiong Zhou , linux-xfs@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org Subject: Re: LTP write03 writev07 xfs failures Message-ID: <20170228145940.GA13377@bfoster.bfoster> References: <20170227042220.syhudcilvyettwnf@XZHOUW.usersys.redhat.com> <20170227160901.GB9344@bfoster.bfoster> <20170227201334.GC9344@bfoster.bfoster> <20170228140455.GB9371@infradead.org> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20170228140455.GB9371@infradead.org> User-Agent: Mutt/1.7.1 (2016-10-04) X-Scanned-By: MIMEDefang 2.68 on 10.5.11.22 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.28]); Tue, 28 Feb 2017 14:59:42 +0000 (UTC) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On Tue, Feb 28, 2017 at 06:04:55AM -0800, Christoph Hellwig wrote: > On Mon, Feb 27, 2017 at 03:13:35PM -0500, Brian Foster wrote: > > After playing around a bit, I don't think using i_size is the right > > approach either. It just exacerbates the original problem on buffered > > writes into sparse files. We can end up leaving around however many > > delalloc blocks we've allocated. > > > > I think we need a way to differentiate preexisting (previously written) > > delalloc blocks from those allocated and unused by the current write. We > > might be able to do that by looking at the pagecache, but I think that > > means looking at the buffer state to make sure we handle sub-page block > > sizes correctly. I.e., make *_iomap_end_delalloc() punch out all > > delalloc blocks in the non-written range that are either not page backed > > or not dirty+delalloc buffer backed. Hm? > > That sounds ugly, but right off my mind I see no other way. I'll need > to take a look at what the old pre-iomap code did there, as I think > none of these issues happened there. Heh. I've appended what I'm currently playing around with. It's certainly uglier, but not terrible IMO (outside of the fact that we have to look at the buffer_heads). This seems to address the problem, but still only lightly tested... An entirely different approach may be to somehow or another differentiate allocated delalloc blocks from "found" delalloc blocks in the iomap_begin() handler, and then perhaps encode that into the iomap such that the iomap_end() handler has an explicit reference of what to punch. Personally, I wouldn't mind doing something like the below short term to fix the regression and then incorporate an iomap enhancement to break the buffer_head dependency. Brian ---8<--- diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index 41662fb..5761dc6 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -1066,6 +1066,30 @@ xfs_file_iomap_begin( return error; } +/* grab the bh for a page offset */ +static struct buffer_head * +bh_for_pgoff( + struct page *page, + loff_t offset) +{ + struct buffer_head *bh, *head; + + ASSERT(offset < PAGE_SIZE); + ASSERT(page_has_buffers(page)); + + bh = head = page_buffers(page); + do { + struct buffer_head *next = bh->b_this_page; + if (next == head) + break; + if (bh_offset(next) > offset) + break; + bh = next; + } while (true); + + return bh; +} + static int xfs_file_iomap_end_delalloc( struct xfs_inode *ip, @@ -1074,13 +1098,21 @@ xfs_file_iomap_end_delalloc( ssize_t written) { struct xfs_mount *mp = ip->i_mount; + struct address_space *mapping = VFS_I(ip)->i_mapping; + struct page *page; + struct buffer_head *bh; xfs_fileoff_t start_fsb; xfs_fileoff_t end_fsb; int error = 0; - /* behave as if the write failed if drop writes is enabled */ - if (xfs_mp_drop_writes(mp)) + /* + * Behave as if the write failed if drop writes is enabled. Punch out + * the pagecache to trigger delalloc cleanup. + */ + if (xfs_mp_drop_writes(mp)) { written = 0; + truncate_pagecache_range(VFS_I(ip), offset, offset + length); + } /* * start_fsb refers to the first unused block after a short write. If @@ -1094,22 +1126,29 @@ xfs_file_iomap_end_delalloc( end_fsb = XFS_B_TO_FSB(mp, offset + length); /* - * Trim back delalloc blocks if we didn't manage to write the whole - * range reserved. + * We have to clear out any unused delalloc blocks in the event of a + * failed or short write. Otherwise, these blocks linger indefinitely as + * they are not fronted by dirty pagecache. * - * We don't need to care about racing delalloc as we hold i_mutex - * across the reserve/allocate/unreserve calls. If there are delalloc - * blocks in the range, they are ours. + * To filter out blocks that were successfully written by a previous + * write, walk the unwritten range and only punch out blocks that are + * not backed by dirty+delalloc buffers. */ - if (start_fsb < end_fsb) { - truncate_pagecache_range(VFS_I(ip), XFS_FSB_TO_B(mp, start_fsb), - XFS_FSB_TO_B(mp, end_fsb) - 1); + for (; start_fsb < end_fsb; start_fsb++) { + offset = XFS_FSB_TO_B(mp, start_fsb); + page = find_get_page(mapping, offset >> PAGE_SHIFT); + if (page) { + bh = bh_for_pgoff(page, offset & ~PAGE_MASK); + if ((buffer_dirty(bh) && buffer_delay(bh))) { + put_page(page); + continue; + } + put_page(page); + } xfs_ilock(ip, XFS_ILOCK_EXCL); - error = xfs_bmap_punch_delalloc_range(ip, start_fsb, - end_fsb - start_fsb); + error = xfs_bmap_punch_delalloc_range(ip, start_fsb, 1); xfs_iunlock(ip, XFS_ILOCK_EXCL); - if (error && !XFS_FORCED_SHUTDOWN(mp)) { xfs_alert(mp, "%s: unable to clean up ino %lld", __func__, ip->i_ino);