From patchwork Thu Jun 29 16:25:36 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ming Lei X-Patchwork-Id: 9819517 X-Patchwork-Delegate: snitzer@redhat.com Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 91BA66035F for ; Fri, 30 Jun 2017 12:47:42 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 46947269DA for ; Fri, 30 Jun 2017 12:47:39 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 3B424284E3; Fri, 30 Jun 2017 12:47:39 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.3 required=2.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED, DKIM_SIGNED, FREEMAIL_FROM, RCVD_IN_DNSWL_HI, RCVD_IN_SORBS_SPAM, T_DKIM_INVALID autolearn=unavailable version=3.3.1 Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id E047828520 for ; Fri, 30 Jun 2017 12:47:36 +0000 (UTC) Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 1EF9E85363; Fri, 30 Jun 2017 12:47:35 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 1EF9E85363 Authentication-Results: ext-mx01.extmail.prod.ext.phx2.redhat.com; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: ext-mx01.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=dm-devel-bounces@redhat.com DKIM-Filter: OpenDKIM Filter v2.11.0 mx1.redhat.com 1EF9E85363 Authentication-Results: mx1.redhat.com; dkim=fail reason="signature verification failed" (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="GbI3HilW" Received: from colo-mx.corp.redhat.com (colo-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.21]) by smtp.corp.redhat.com (Postfix) with ESMTPS id B73088072D; Fri, 30 Jun 2017 12:47:34 +0000 (UTC) Received: from lists01.pubmisc.prod.ext.phx2.redhat.com (lists01.pubmisc.prod.ext.phx2.redhat.com [10.5.19.33]) by colo-mx.corp.redhat.com (Postfix) with ESMTP id DD9FB262; Fri, 30 Jun 2017 12:47:33 +0000 (UTC) Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) by lists01.pubmisc.prod.ext.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id v5TGPkV5026198 for ; Thu, 29 Jun 2017 12:25:46 -0400 Received: by smtp.corp.redhat.com (Postfix) id 344D78D560; Thu, 29 Jun 2017 16:25:46 +0000 (UTC) Delivered-To: dm-devel@redhat.com Received: from mx1.redhat.com (ext-mx03.extmail.prod.ext.phx2.redhat.com [10.5.110.27]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 818138D558; Thu, 29 Jun 2017 16:25:40 +0000 (UTC) Received: from mail-ua0-f196.google.com (mail-ua0-f196.google.com [209.85.217.196]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 9C68A83F3F; Thu, 29 Jun 2017 16:25:38 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 9C68A83F3F Authentication-Results: ext-mx03.extmail.prod.ext.phx2.redhat.com; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: ext-mx03.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=tom.leiming@gmail.com DKIM-Filter: OpenDKIM Filter v2.11.0 mx1.redhat.com 9C68A83F3F Received: by mail-ua0-f196.google.com with SMTP id j53so7016164uaa.2; Thu, 29 Jun 2017 09:25:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=EnQ7ZeDcPMNZulj8GS2ogmGY4nZ+v8TpJkNc48ale38=; b=GbI3HilWQinCJULqZ/SyQLGEJ+IzOC2eYfBNWG5IWlpJnK7PBDsnyM6TAlM/XbIqN4 xAPdFKX7/TLHgD3rtGps1ghbXFStTg4HWOvpyuupW1n/iH5Vk+IKXNpY90TyBgryDS01 1esynUy+gwsqpbIwNb5H3CS//fpSFIaeDXPVgAo6onDdEGu+J0j8bUzJx0HcnZCBat7q VNpPk1cT+DqWd5ztvUoBzUkv1ugpPUwJdmYEhPKmID5X10yX0JFTZ1YB9s3NlafdYtcl fM7pqaivt7hRr0n1ppyMmhSjiViJ4n0nH0a87R104v45yRpchh5nHpRrGw3J/jtEXLBh aq/A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=EnQ7ZeDcPMNZulj8GS2ogmGY4nZ+v8TpJkNc48ale38=; b=af22c9Pk+mulkCGHUTXqcgCI+0gtt5ixcmqP00MghnmCLXRrDsgo3+rS9gaFI9e3UI sGphWLDPCnoAEWjb7xFGgn6ry/P0xdBmlJfXfbqzzTWn2UbGtco5ANLcMwIPkodfXAG3 Wa8S7JdirDq7KHQffVOCdY3n6qaPAnm/WcFHYXGDUnSHrRW49odwnLCy4LRoPaDckX9u HYMN4rdJd216r90Kj5erfaUhVi26wRKjq9n4cMMsQJMYZpA/nsmhEoiN9SXIK62HAGID pBosihCjkYd6sS8dZy+4DRr2gBKsa74t82SzkIl4YqNe298fSM56fXJmmsH3reQPKeTh ubfA== X-Gm-Message-State: AKS2vOwOXhQic+db4CxMBZsqLIuRq/NBPC3mlNfaTt0XP4FKCvDYBDhH 0v1Vy61gZDmZzXUrm9cabzT/+kgwBQ== X-Received: by 10.176.2.84 with SMTP id 78mr10506993uas.80.1498753537633; Thu, 29 Jun 2017 09:25:37 -0700 (PDT) MIME-Version: 1.0 Received: by 10.176.69.163 with HTTP; Thu, 29 Jun 2017 09:25:36 -0700 (PDT) In-Reply-To: <7f0a852e-5f90-4c63-9a43-a4180557530c@kernel.dk> References: <20170628211010.4C8C9124035@b01ledav002.gho.pok.ibm.com> <7f0a852e-5f90-4c63-9a43-a4180557530c@kernel.dk> From: Ming Lei Date: Fri, 30 Jun 2017 00:25:36 +0800 Message-ID: To: Jens Axboe X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.27]); Thu, 29 Jun 2017 16:25:39 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.27]); Thu, 29 Jun 2017 16:25:39 +0000 (UTC) for IP:'209.85.217.196' DOMAIN:'mail-ua0-f196.google.com' HELO:'mail-ua0-f196.google.com' FROM:'tom.leiming@gmail.com' RCPT:'' X-RedHat-Spam-Score: 1.17 * (BAYES_50, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, RCVD_IN_SORBS_SPAM, SPF_PASS) 209.85.217.196 mail-ua0-f196.google.com 209.85.217.196 mail-ua0-f196.google.com X-Scanned-By: MIMEDefang 2.78 on 10.5.110.27 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11 X-loop: dm-devel@redhat.com X-Mailman-Approved-At: Fri, 30 Jun 2017 08:46:48 -0400 Cc: Brian King , linux-block , "open list:DEVICE-MAPPER \(LVM\)" , Alasdair Kergon , Mike Snitzer Subject: Re: [dm-devel] [PATCH 1/1] block: Convert hd_struct in_flight from atomic to percpu X-BeenThere: dm-devel@redhat.com X-Mailman-Version: 2.1.12 Precedence: junk List-Id: device-mapper development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.25]); Fri, 30 Jun 2017 12:47:36 +0000 (UTC) X-Virus-Scanned: ClamAV using ClamSMTP On Thu, Jun 29, 2017 at 11:58 PM, Jens Axboe wrote: > On 06/29/2017 02:40 AM, Ming Lei wrote: >> On Thu, Jun 29, 2017 at 5:49 AM, Jens Axboe wrote: >>> On 06/28/2017 03:12 PM, Brian King wrote: >>>> This patch converts the in_flight counter in struct hd_struct from a >>>> pair of atomics to a pair of percpu counters. This eliminates a couple >>>> of atomics from the hot path. When running this on a Power system, to >>>> a single null_blk device with 80 submission queues, irq mode 0, with >>>> 80 fio jobs, I saw IOPs go from 1.5M IO/s to 11.4 IO/s. >>> >>> This has been done before, but I've never really liked it. The reason is >>> that it means that reading the part stat inflight count now has to >>> iterate over every possible CPU. Did you use partitions in your testing? >>> How many CPUs were configured? When I last tested this a few years ago >>> on even a quad core nehalem (which is notoriously shitty for cross-node >>> latencies), it was a net loss. >> >> One year ago, I saw null_blk's IOPS can be decreased to 10% >> of non-RQF_IO_STAT on a dual socket ARM64(each CPU has >> 96 cores, and dual numa nodes) too, the performance can be >> recovered basically if per numa-node counter is introduced and >> used in this case, but the patch was never posted out. >> If anyone is interested in that, I can rebase the patch on current >> block tree and post out. I guess the performance issue might be >> related with system cache coherency implementation more or less. >> This issue on ARM64 can be observed with the following userspace >> atomic counting test too: >> >> http://kernel.ubuntu.com/~ming/test/cache/ > > How well did the per-node thing work? Doesn't seem to me like it would Last time, on ARM64, I remembered that the IOPS was basically recovered, but now I don't have a such machine to test. Could Brian test the attached patch to see if it works on big Power machine? And the idea is simple, just make the atomic counter per-node. > go far enough. And per CPU is too much. One potential improvement would > be to change the part_stat_read() to just loop online CPUs, instead of > all possible CPUs. When CPUs go on/offline, use that as the slow path to > ensure the stats are sane. Often there's a huge difference between > NR_CPUS configured and what the system has. As Brian states, RH ships > with 2048, while I doubt a lot of customers actually run that... One observation I saw on arm64 dual socket before is that atomic inc/dec on counter stored in local numa node is much cheaper than cross-node, that is why I tried the per-node counter. And wrt. in-flight atomic counter, both inc and dec should happen on CPUs belonging to same numa node in case of blk-mq. > > Outside of coming up with a more clever data structure that is fully > CPU topology aware, one thing that could work is just having X cache > line separated read/write inflight counters per node, where X is some > suitable value (like 4). That prevents us from having cross node > traffic, and it also keeps the cross cpu traffic fairly low. That should > provide a nice balance between cost of incrementing the inflight > counting, and the cost of looping for reading it. > > And that brings me to the next part... > >>> I do agree that we should do something about it, and it's one of those >>> items I've highlighted in talks about blk-mq on pending issues to fix >>> up. It's just not great as it currently stands, but I don't think per >>> CPU counters is the right way to fix it, at least not for the inflight >>> counter. >> >> Yeah, it won't be a issue for non-mq path, and for blk-mq path, maybe >> we can use some blk-mq knowledge(tagset?) to figure out the >> 'in_flight' counter. I thought about it before, but never got a >> perfect solution, and looks it is a bit hard, :-) > > The tags are already a bit spread out, so it's worth a shot. That would > remove the need to do anything in the inc/dec path, as the tags already > do that. The inlight count could be easily retrieved with > sbitmap_weight(). The only issue here is that we need separate read and > write counters, and the weight would obviously only get us the total > count. But we can have a slower path for that, just iterate the tags and > count them. The fast path only cares about total count. > > Let me try that out real quick. > > -- > Jens Axboe > Thanks, Ming Lei --- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel diff --git a/block/partition-generic.c b/block/partition-generic.c index c5ec8246e25e..bd6644bf9643 100644 --- a/block/partition-generic.c +++ b/block/partition-generic.c @@ -140,8 +140,9 @@ ssize_t part_inflight_show(struct device *dev, { struct hd_struct *p = dev_to_part(dev); - return sprintf(buf, "%8u %8u\n", atomic_read(&p->in_flight[0]), - atomic_read(&p->in_flight[1])); + return sprintf(buf, "%8u %8u\n", + pnode_counter_read_all(&p->in_flight[0]), + pnode_counter_read_all(&p->in_flight[1])); } #ifdef CONFIG_FAIL_MAKE_REQUEST diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 96bd13e581cd..aac0b2235410 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -521,7 +521,7 @@ static void start_io_acct(struct dm_io *io) cpu = part_stat_lock(); part_round_stats(cpu, &dm_disk(md)->part0); part_stat_unlock(); - atomic_set(&dm_disk(md)->part0.in_flight[rw], + pnode_counter_set(&dm_disk(md)->part0.in_flight[rw], atomic_inc_return(&md->pending[rw])); if (unlikely(dm_stats_used(&md->stats))) @@ -550,7 +550,7 @@ static void end_io_acct(struct dm_io *io) * a flush. */ pending = atomic_dec_return(&md->pending[rw]); - atomic_set(&dm_disk(md)->part0.in_flight[rw], pending); + pnode_counter_set(&dm_disk(md)->part0.in_flight[rw], pending); pending += atomic_read(&md->pending[rw^0x1]); /* nudge anyone waiting on suspend queue */ diff --git a/include/linux/genhd.h b/include/linux/genhd.h index e619fae2f037..40c9bc74a120 100644 --- a/include/linux/genhd.h +++ b/include/linux/genhd.h @@ -15,6 +15,7 @@ #include #include #include +#include #ifdef CONFIG_BLOCK @@ -120,7 +121,7 @@ struct hd_struct { int make_it_fail; #endif unsigned long stamp; - atomic_t in_flight[2]; + struct pnode_counter in_flight[2]; #ifdef CONFIG_SMP struct disk_stats __percpu *dkstats; #else @@ -364,21 +365,22 @@ static inline void free_part_stats(struct hd_struct *part) static inline void part_inc_in_flight(struct hd_struct *part, int rw) { - atomic_inc(&part->in_flight[rw]); + pnode_counter_inc(&part->in_flight[rw]); if (part->partno) - atomic_inc(&part_to_disk(part)->part0.in_flight[rw]); + pnode_counter_inc(&part_to_disk(part)->part0.in_flight[rw]); } static inline void part_dec_in_flight(struct hd_struct *part, int rw) { - atomic_dec(&part->in_flight[rw]); + pnode_counter_dec(&part->in_flight[rw]); if (part->partno) - atomic_dec(&part_to_disk(part)->part0.in_flight[rw]); + pnode_counter_dec(&part_to_disk(part)->part0.in_flight[rw]); } static inline int part_in_flight(struct hd_struct *part) { - return atomic_read(&part->in_flight[0]) + atomic_read(&part->in_flight[1]); + return pnode_counter_read_all(&part->in_flight[0]) + \ + pnode_counter_read_all(&part->in_flight[1]); } static inline struct partition_meta_info *alloc_part_info(struct gendisk *disk) @@ -627,11 +629,34 @@ extern ssize_t part_fail_store(struct device *dev, const char *buf, size_t count); #endif /* CONFIG_FAIL_MAKE_REQUEST */ +static inline int hd_counter_init(struct hd_struct *part) +{ + if (pnode_counter_init(&part->in_flight[0], GFP_KERNEL)) + return -ENOMEM; + if (pnode_counter_init(&part->in_flight[1], GFP_KERNEL)) { + pnode_counter_deinit(&part->in_flight[0]); + return -ENOMEM; + } + + return 0; +} + +static inline void hd_counter_deinit(struct hd_struct *part) +{ + pnode_counter_deinit(&part->in_flight[0]); + pnode_counter_deinit(&part->in_flight[1]); +} + static inline int hd_ref_init(struct hd_struct *part) { + if (hd_counter_init(part)) + return -ENOMEM; + if (percpu_ref_init(&part->ref, __delete_partition, 0, - GFP_KERNEL)) + GFP_KERNEL)) { + hd_counter_deinit(part); return -ENOMEM; + } return 0; } @@ -659,6 +684,7 @@ static inline void hd_free_part(struct hd_struct *part) { free_part_stats(part); free_part_info(part); + hd_counter_deinit(part); percpu_ref_exit(&part->ref); } diff --git a/include/linux/pernode_counter.h b/include/linux/pernode_counter.h new file mode 100644 index 000000000000..127639fbc25f --- /dev/null +++ b/include/linux/pernode_counter.h @@ -0,0 +1,118 @@ +#ifndef _LINUX_PERNODE_COUNTER_H +#define _LINUX_PERNODE_COUNTER_H +/* + * A simple per node atomic counter for use in block io accounting. + */ + +#include +#include +#include +#include + +struct node_counter { + atomic_t counter_in_node; +}; + +struct pnode_counter { + struct node_counter * __percpu *counter; + struct node_counter **nodes; +}; + +static inline int pnode_counter_init(struct pnode_counter *pnc, gfp_t gfp) +{ + struct node_counter **nodes; + int i, j, cpu; + + nodes = kzalloc(nr_node_ids * sizeof(struct node_counter *), gfp); + if (!nodes) + goto err_nodes; + + for_each_node(i) { + nodes[i] = kzalloc_node(sizeof(struct node_counter), gfp, i); + if (!nodes[i]) + goto err_node_counter; + } + + pnc->counter = alloc_percpu_gfp(struct node_counter *, gfp); + if (!pnc->counter) + goto err_node_counter; + + for_each_possible_cpu(cpu) + *per_cpu_ptr(pnc->counter, cpu) = nodes[cpu_to_node(cpu)]; + + pnc->nodes = nodes; + + return 0; + + err_node_counter: + for (j = 0; j < i; j++) + kfree(nodes[j]); + kfree(nodes); + err_nodes: + return -ENOMEM; +} + +static inline void pnode_counter_deinit(struct pnode_counter *pnc) +{ + int cpu; + + for_each_possible_cpu(cpu) { + struct node_counter *node = *per_cpu_ptr(pnc->counter, cpu); + + kfree(node); + *per_cpu_ptr(pnc->counter, cpu) = NULL; + } + free_percpu(pnc->counter); + kfree(pnc->nodes); +} + +static inline void pnode_counter_inc(struct pnode_counter *pnc) +{ + struct node_counter *node = this_cpu_read(*pnc->counter); + + atomic_inc(&node->counter_in_node); +} + +static inline void pnode_counter_inc_cpu(struct pnode_counter *pnc, int cpu) +{ + struct node_counter *node = *per_cpu_ptr(pnc->counter, cpu); + + atomic_inc(&node->counter_in_node); +} + +static inline void pnode_counter_dec(struct pnode_counter *pnc) +{ + struct node_counter *node = this_cpu_read(*pnc->counter); + + atomic_dec(&node->counter_in_node); +} + +static inline void pnode_counter_dec_cpu(struct pnode_counter *pnc, int cpu) +{ + struct node_counter *node = *per_cpu_ptr(pnc->counter, cpu); + + atomic_dec(&node->counter_in_node); +} + +static inline void pnode_counter_set(struct pnode_counter *pnc, int val) +{ + int i; + struct node_counter *node = this_cpu_read(*pnc->counter); + + for_each_node(i) + atomic_set(&pnc->nodes[i]->counter_in_node, 0); + atomic_set(&node->counter_in_node, val); +} + +static inline long pnode_counter_read_all(struct pnode_counter *pnc) +{ + int i; + long val = 0; + + for_each_node(i) + val += atomic_read(&pnc->nodes[i]->counter_in_node); + + return val; +} + +#endif /* _LINUX_PERNODE_COUNTER_H */