From patchwork Mon Apr 6 20:18:27 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tejun Heo X-Patchwork-Id: 6165361 Return-Path: X-Original-To: patchwork-linux-fsdevel@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.136]) by patchwork2.web.kernel.org (Postfix) with ESMTP id 725F0BF4A6 for ; Mon, 6 Apr 2015 20:19:37 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 70E572035D for ; Mon, 6 Apr 2015 20:19:36 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 4625720303 for ; Mon, 6 Apr 2015 20:19:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754577AbbDFUS5 (ORCPT ); Mon, 6 Apr 2015 16:18:57 -0400 Received: from mail-qk0-f172.google.com ([209.85.220.172]:33729 "EHLO mail-qk0-f172.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754545AbbDFUSw (ORCPT ); Mon, 6 Apr 2015 16:18:52 -0400 Received: by qkx62 with SMTP id 62so31421392qkx.0; Mon, 06 Apr 2015 13:18:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:from:to:cc:subject:date:message-id:in-reply-to:references; bh=2uJo+6241LuwoZ562J1PKL0WnkbnoKNYOBtHkLehO2U=; b=cikH5jAyeYOIVrPVn5SL2aST+NEiGtsG/HHmv0V5AJa7gfdU5PrVoDiHhPZJclhbe0 f+MduwDdmaeVZtPtQTIZM3B3qMWCDjUEK7i5vZU82H2vi6I3ADfv4BZHsVuFGUkQXJNo 1Z3iZzfi1fsA095laf+os+docokxVY2Zl7FsDvqp0osti4W4EXF8CQpJAONDkWCk4p4N IxYWkPCXbnhXL/axye+8brgmQuNdRXdmIfr18FXlOO+NvmFLf1xGOZqPldpY30qFatxN dtiiKifDbLki3pQj0HjNLRTnsftZZbkc5fnh4/VWYSu6Eyr8ljRNuwBNqdC+jP44fI2j TVGA== X-Received: by 10.55.56.75 with SMTP id f72mr31430630qka.75.1428351531311; Mon, 06 Apr 2015 13:18:51 -0700 (PDT) Received: from htj.duckdns.org.lan (207-38-238-8.c3-0.wsd-ubr1.qens-wsd.ny.cable.rcn.com. [207.38.238.8]) by mx.google.com with ESMTPSA id 75sm3868673qhw.41.2015.04.06.13.18.46 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 06 Apr 2015 13:18:47 -0700 (PDT) From: Tejun Heo To: axboe@kernel.dk Cc: linux-kernel@vger.kernel.org, jack@suse.cz, hch@infradead.org, hannes@cmpxchg.org, linux-fsdevel@vger.kernel.org, vgoyal@redhat.com, lizefan@huawei.com, cgroups@vger.kernel.org, linux-mm@kvack.org, mhocko@suse.cz, clm@fb.com, fengguang.wu@intel.com, david@fromorbit.com, gthelen@google.com, Tejun Heo Subject: [PATCH 09/10] writeback: implement foreign cgroup inode bdi_writeback switching Date: Mon, 6 Apr 2015 16:18:27 -0400 Message-Id: <1428351508-8399-10-git-send-email-tj@kernel.org> X-Mailer: git-send-email 2.1.0 In-Reply-To: <1428351508-8399-1-git-send-email-tj@kernel.org> References: <1428351508-8399-1-git-send-email-tj@kernel.org> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Spam-Status: No, score=-6.8 required=5.0 tests=BAYES_00,DKIM_SIGNED, RCVD_IN_DNSWL_HI,T_DKIM_INVALID,T_RP_MATCHES_RCVD,UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP As concurrent write sharing of an inode is expected to be very rare and memcg only tracks page ownership on first-use basis severely confining the usefulness of such sharing, cgroup writeback tracks ownership per-inode. While the support for concurrent write sharing of an inode is deemed unnecessary, an inode being written to by different cgroups at different points in time is a lot more common, and, more importantly, charging only by first-use can too readily lead to grossly incorrect behaviors (single foreign page can lead to gigabytes of writeback to be incorrectly attributed). To resolve this issue, cgroup writeback detects the majority dirtier of an inode and transfers the ownership to it. The previous patches implemented the foreign condition detection mechanism and laid the groundwork. This patch implements the actual switching. With the previously implemented [unlocked_]inode_to_wb_and_list_lock() and wb stat transaction, grabbing wb->list_lock, inode->i_lock and mapping->tree_lock gives us full exclusion against all wb operations on the target inode. inode_switch_wb_work_fn() grabs all the locks and transfers the inode atomically along with its RECLAIMABLE and WRITEBACK stats. Signed-off-by: Tejun Heo Cc: Jens Axboe Cc: Jan Kara Cc: Wu Fengguang Cc: Greg Thelen --- fs/fs-writeback.c | 86 +++++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 84 insertions(+), 2 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 633d9d5..74b7a99 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -309,30 +309,112 @@ static void inode_switch_wbs_work_fn(struct work_struct *work) struct inode_switch_wbs_context *isw = container_of(work, struct inode_switch_wbs_context, work); struct inode *inode = isw->inode; + struct address_space *mapping = inode->i_mapping; + struct bdi_writeback *old_wb = inode->i_wb; struct bdi_writeback *new_wb = isw->new_wb; + struct radix_tree_iter iter; + bool switched = false; + void **slot; /* * By the time control reaches here, RCU grace period has passed * since I_WB_SWITCH assertion and all wb stat update transactions * between unlocked_inode_to_wb_begin/end() are guaranteed to be * synchronizing against mapping->tree_lock. + * + * Grabbing old_wb->list_lock, inode->i_lock and mapping->tree_lock + * gives us exclusion against all wb related operations on @inode + * including IO list manipulations and stat updates. */ + if (old_wb < new_wb) { + spin_lock(&old_wb->list_lock); + spin_lock_nested(&new_wb->list_lock, SINGLE_DEPTH_NESTING); + } else { + spin_lock(&new_wb->list_lock); + spin_lock_nested(&old_wb->list_lock, SINGLE_DEPTH_NESTING); + } spin_lock(&inode->i_lock); + spin_lock_irq(&mapping->tree_lock); + + /* + * Once I_FREEING is visible under i_lock, the eviction path owns + * the inode and we shouldn't modify ->i_wb_list. + */ + if (unlikely(inode->i_state & I_FREEING)) + goto skip_switch; + /* + * Count and transfer stats. Note that PAGECACHE_TAG_DIRTY points + * to possibly dirty pages while PAGECACHE_TAG_WRITEBACK points to + * pages actually under underwriteback. + */ + radix_tree_for_each_tagged(slot, &mapping->page_tree, &iter, 0, + PAGECACHE_TAG_DIRTY) { + struct page *page = radix_tree_deref_slot_protected(slot, + &mapping->tree_lock); + if (likely(page) && PageDirty(page)) { + __dec_wb_stat(old_wb, WB_RECLAIMABLE); + __inc_wb_stat(new_wb, WB_RECLAIMABLE); + } + } + + radix_tree_for_each_tagged(slot, &mapping->page_tree, &iter, 0, + PAGECACHE_TAG_WRITEBACK) { + struct page *page = radix_tree_deref_slot_protected(slot, + &mapping->tree_lock); + if (likely(page)) { + WARN_ON_ONCE(!PageWriteback(page)); + __dec_wb_stat(old_wb, WB_WRITEBACK); + __inc_wb_stat(new_wb, WB_WRITEBACK); + } + } + + wb_get(new_wb); + + /* + * Transfer to @new_wb's IO list if necessary. The specific list + * @inode was on is ignored and the inode is put on ->b_dirty which + * is always correct including from ->b_dirty_time. The transfer + * preserves @inode->dirtied_when ordering. + */ + if (!list_empty(&inode->i_wb_list)) { + struct inode *pos; + + inode_wb_list_del_locked(inode, old_wb); + inode->i_wb = new_wb; + list_for_each_entry(pos, &new_wb->b_dirty, i_wb_list) + if (time_after_eq(inode->dirtied_when, + pos->dirtied_when)) + break; + inode_wb_list_move_locked(inode, new_wb, pos->i_wb_list.prev); + } else { + inode->i_wb = new_wb; + } + + /* ->i_wb_frn updates may race wbc_detach_inode() but doesn't matter */ inode->i_wb_frn_winner = 0; inode->i_wb_frn_avg_time = 0; inode->i_wb_frn_history = 0; - + switched = true; +skip_switch: /* * Paired with load_acquire in unlocked_inode_to_wb_begin() and * ensures that the new wb is visible if they see !I_WB_SWITCH. */ smp_store_release(&inode->i_state, inode->i_state & ~I_WB_SWITCH); + spin_unlock_irq(&mapping->tree_lock); spin_unlock(&inode->i_lock); + spin_unlock(&new_wb->list_lock); + spin_unlock(&old_wb->list_lock); - iput(inode); + if (switched) { + wb_wakeup(new_wb); + wb_put(old_wb); + } wb_put(new_wb); + + iput(inode); kfree(isw); }