From patchwork Tue Feb 28 02:49:57 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tahsin Erdogan X-Patchwork-Id: 9594605 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id ECCC1600CB for ; Tue, 28 Feb 2017 03:22:26 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E0045284F3 for ; Tue, 28 Feb 2017 03:22:26 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id D352828501; Tue, 28 Feb 2017 03:22:26 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 65104284F3 for ; Tue, 28 Feb 2017 03:22:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751950AbdB1DWW (ORCPT ); Mon, 27 Feb 2017 22:22:22 -0500 Received: from mail-pf0-f178.google.com ([209.85.192.178]:34364 "EHLO mail-pf0-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751505AbdB1DWU (ORCPT ); Mon, 27 Feb 2017 22:22:20 -0500 Received: by mail-pf0-f178.google.com with SMTP id p185so12725765pfb.1 for ; Mon, 27 Feb 2017 19:22:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=from:to:cc:subject:date:message-id; bh=tdhrhI/qsvtaJh5/4kWnpQ/TS3WkmeSvewMQdqqFeMs=; b=jmc7zFoiDTY9tJ4QMVE73fo5hD02QXtsYQIJ7lDw8Jgjij2u+m7DQew2ADeFVFGhzC u7kr2p9QZF/uFIPlIEhBINHeZsqO65BRd1SNhw7u71+t2Yle4p16mvAbBum2GuGDZ9cD wOXO3YyFo9Ja579LWXKjLc3xyVGZHz1FxqFfpT5adYVkvDXR3WYF0l2s7K1BeIwHIuha 0t+7O+dknPVqNRJ91fKvtyPzzoL2bxbiVUHOoPG5W2d0Vsvv7FYCdabPlnCGtbpfrfMD +gpXloByK6CLduy/XV9sICnaSwHO/mRaWGytP2Mkazp7GO4JeDNkY8rxuymktU8MtNj4 04fw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id; bh=tdhrhI/qsvtaJh5/4kWnpQ/TS3WkmeSvewMQdqqFeMs=; b=QFsmezFbZaj9c6J1zSY3z7Hq0DIf7ojOAgeADfGloNaMd/BAGmQvjDmNaNaWnAHkVP QuB7719vxQhkdqyRhYhlLySQTgO6jdoyXf5UScdF3G79bD1ISdxH3tYNm+xLu2JWtf5Q HMIAzFKN0GXWxRqIWDbrapJFr35H9J4y5e14gVIx0fn3IOhulaWfdftzt4ePMfFBHZCx /SbQCUnNwTNtcwhqpFZo9FlsCdWQl21VXVY3UQi6d+aPzwTvI9whCsRimu+vdBdUfNe/ YpaqtWi5UjbhlIlSSIpOuQynug1gsvV5JF1XmOAQDWljx2/qCXDqDJrOtxrygRoIZDoj P34g== X-Gm-Message-State: AMke39kbGnts6rmYmPnrzxZaEaz5DkSU+CGtg55vk9bGV1bu5XkcVV+biIsJkKGtzbUREzEE X-Received: by 10.99.98.6 with SMTP id w6mr7031234pgb.223.1488250230726; Mon, 27 Feb 2017 18:50:30 -0800 (PST) Received: from tahsin1.mtv.corp.google.com ([100.99.140.90]) by smtp.gmail.com with ESMTPSA id u28sm346642pgo.20.2017.02.27.18.50.29 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Mon, 27 Feb 2017 18:50:30 -0800 (PST) From: Tahsin Erdogan To: Tejun Heo , Jens Axboe Cc: linux-block@vger.kernel.org, David Rientjes , linux-kernel@vger.kernel.org, Tahsin Erdogan Subject: [PATCH] blkcg: allocate struct blkcg_gq outside request queue spinlock Date: Mon, 27 Feb 2017 18:49:57 -0800 Message-Id: <20170228024957.4314-1-tahsin@google.com> X-Mailer: git-send-email 2.11.0.483.g087da7b7c-goog Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP blkg_conf_prep() currently calls blkg_lookup_create() while holding request queue spinlock. This means allocating memory for struct blkcg_gq has to be made non-blocking. This causes occasional -ENOMEM failures in call paths like below: pcpu_alloc+0x68f/0x710 __alloc_percpu_gfp+0xd/0x10 __percpu_counter_init+0x55/0xc0 cfq_pd_alloc+0x3b2/0x4e0 blkg_alloc+0x187/0x230 blkg_create+0x489/0x670 blkg_lookup_create+0x9a/0x230 blkg_conf_prep+0x1fb/0x240 __cfqg_set_weight_device.isra.105+0x5c/0x180 cfq_set_weight_on_dfl+0x69/0xc0 cgroup_file_write+0x39/0x1c0 kernfs_fop_write+0x13f/0x1d0 __vfs_write+0x23/0x120 vfs_write+0xc2/0x1f0 SyS_write+0x44/0xb0 entry_SYSCALL_64_fastpath+0x18/0xad In the code path above, percpu allocator cannot call vmalloc() due to queue spinlock. A failure in this call path gives grief to tools which are trying to configure io weights. We see occasional failures happen shortly after reboots even when system is not under any memory pressure. Machines with a lot of cpus are more vulnerable to this condition. Do struct blkcg_gq allocations outside the queue spinlock to allow blocking during memory allocations. Suggested-by: Tejun Heo Signed-off-by: Tahsin Erdogan --- block/blk-cgroup.c | 108 ++++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 82 insertions(+), 26 deletions(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 295e98c2c8cc..8ec95f333bc8 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -788,6 +788,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol, __acquires(rcu) __acquires(disk->queue->queue_lock) { struct gendisk *disk; + struct request_queue *q; struct blkcg_gq *blkg; struct module *owner; unsigned int major, minor; @@ -806,44 +807,99 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol, if (!disk) return -ENODEV; if (part) { - owner = disk->fops->owner; - put_disk(disk); - module_put(owner); - return -ENODEV; + ret = -ENODEV; + goto fail; + } + + q = disk->queue; + + if (!blkcg_policy_enabled(q, pol)) { + ret = -EOPNOTSUPP; + goto fail; } rcu_read_lock(); - spin_lock_irq(disk->queue->queue_lock); + spin_lock_irq(q->queue_lock); - if (blkcg_policy_enabled(disk->queue, pol)) - blkg = blkg_lookup_create(blkcg, disk->queue); - else - blkg = ERR_PTR(-EOPNOTSUPP); + /* + * This could be the first entry point of blkcg implementation and + * we shouldn't allow anything to go through for a bypassing queue. + */ + if (unlikely(blk_queue_bypass(q))) { + ret = blk_queue_dying(q) ? -ENODEV : -EBUSY; + goto fail_unlock; + } - if (IS_ERR(blkg)) { - ret = PTR_ERR(blkg); + blkg = __blkg_lookup(blkcg, q, true); + if (blkg) + goto success; + + /* + * Create blkgs walking down from blkcg_root to @blkcg, so that all + * non-root blkgs have access to their parents. + */ + while (true) { + struct blkcg *pos = blkcg; + struct blkcg *parent; + struct blkcg_gq *new_blkg; + + parent = blkcg_parent(blkcg); + while (parent && !__blkg_lookup(parent, q, false)) { + pos = parent; + parent = blkcg_parent(parent); + } + + spin_unlock_irq(q->queue_lock); rcu_read_unlock(); - spin_unlock_irq(disk->queue->queue_lock); - owner = disk->fops->owner; - put_disk(disk); - module_put(owner); - /* - * If queue was bypassing, we should retry. Do so after a - * short msleep(). It isn't strictly necessary but queue - * can be bypassing for some time and it's always nice to - * avoid busy looping. - */ - if (ret == -EBUSY) { - msleep(10); - ret = restart_syscall(); + + new_blkg = blkg_alloc(pos, q, GFP_KERNEL); + if (unlikely(!new_blkg)) { + ret = -ENOMEM; + goto fail; + } + + rcu_read_lock(); + spin_lock_irq(q->queue_lock); + + /* Lookup again since we dropped the lock for blkg_alloc(). */ + blkg = __blkg_lookup(pos, q, false); + if (blkg) { + blkg_free(new_blkg); + } else { + blkg = blkg_create(pos, q, new_blkg); + if (unlikely(IS_ERR(blkg))) { + ret = PTR_ERR(blkg); + goto fail_unlock; + } } - return ret; - } + if (pos == blkcg) + goto success; + } +success: ctx->disk = disk; ctx->blkg = blkg; ctx->body = body; return 0; + +fail_unlock: + spin_unlock_irq(q->queue_lock); + rcu_read_unlock(); +fail: + owner = disk->fops->owner; + put_disk(disk); + module_put(owner); + /* + * If queue was bypassing, we should retry. Do so after a + * short msleep(). It isn't strictly necessary but queue + * can be bypassing for some time and it's always nice to + * avoid busy looping. + */ + if (ret == -EBUSY) { + msleep(10); + ret = restart_syscall(); + } + return ret; } EXPORT_SYMBOL_GPL(blkg_conf_prep);