Message ID | 20170628211010.4C8C9124035@b01ledav002.gho.pok.ibm.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Delegated to: | Mike Snitzer |
Headers | show |
On 06/28/2017 03:12 PM, Brian King wrote: > This patch converts the in_flight counter in struct hd_struct from a > pair of atomics to a pair of percpu counters. This eliminates a couple > of atomics from the hot path. When running this on a Power system, to > a single null_blk device with 80 submission queues, irq mode 0, with > 80 fio jobs, I saw IOPs go from 1.5M IO/s to 11.4 IO/s. This has been done before, but I've never really liked it. The reason is that it means that reading the part stat inflight count now has to iterate over every possible CPU. Did you use partitions in your testing? How many CPUs were configured? When I last tested this a few years ago on even a quad core nehalem (which is notoriously shitty for cross-node latencies), it was a net loss. I do agree that we should do something about it, and it's one of those items I've highlighted in talks about blk-mq on pending issues to fix up. It's just not great as it currently stands, but I don't think per CPU counters is the right way to fix it, at least not for the inflight counter.
On 06/28/2017 03:12 PM, Brian King wrote: > -static inline int part_in_flight(struct hd_struct *part) > +static inline unsigned long part_in_flight(struct hd_struct *part) > { > - return atomic_read(&part->in_flight[0]) + atomic_read(&part->in_flight[1]); > + return part_stat_read(part, in_flight[0]) + part_stat_read(part, in_flight[1]); One obvious improvement would be to not do this twice, but only have to loop once. Instead of making this an array, make it a structure with a read and write count. It still doesn't really fix the issue of someone running on a kernel with a ton of possible CPUs configured. But it does reduce the overhead by 50%.
On 06/28/2017 03:54 PM, Jens Axboe wrote: > On 06/28/2017 03:12 PM, Brian King wrote: >> -static inline int part_in_flight(struct hd_struct *part) >> +static inline unsigned long part_in_flight(struct hd_struct *part) >> { >> - return atomic_read(&part->in_flight[0]) + atomic_read(&part->in_flight[1]); >> + return part_stat_read(part, in_flight[0]) + part_stat_read(part, in_flight[1]); > > One obvious improvement would be to not do this twice, but only have to > loop once. Instead of making this an array, make it a structure with a > read and write count. > > It still doesn't really fix the issue of someone running on a kernel > with a ton of possible CPUs configured. But it does reduce the overhead > by 50%. Or something as simple as this: #define part_stat_read_double(part, field1, field2) \ ({ \ typeof((part)->dkstats->field1) res = 0; \ unsigned int _cpu; \ for_each_possible_cpu(_cpu) { \ res += per_cpu_ptr((part)->dkstats, _cpu)->field1; \ res += per_cpu_ptr((part)->dkstats, _cpu)->field2; \ } \ res; \ }) static inline unsigned long part_in_flight(struct hd_struct *part) { return part_stat_read_double(part, in_flight[0], in_flight[1]); }
On 06/28/2017 04:49 PM, Jens Axboe wrote: > On 06/28/2017 03:12 PM, Brian King wrote: >> This patch converts the in_flight counter in struct hd_struct from a >> pair of atomics to a pair of percpu counters. This eliminates a couple >> of atomics from the hot path. When running this on a Power system, to >> a single null_blk device with 80 submission queues, irq mode 0, with >> 80 fio jobs, I saw IOPs go from 1.5M IO/s to 11.4 IO/s. > > This has been done before, but I've never really liked it. The reason is > that it means that reading the part stat inflight count now has to > iterate over every possible CPU. Did you use partitions in your testing? > How many CPUs were configured? When I last tested this a few years ago I did not use partitions. I was running this on a 4 socket Power 8 machine with 5 cores per socket, running with 4 threads per core, so a total of 80 logical CPUs were usable in Linux. I was missing the fact that part_round_stats_single calls part_in_flight and had only noticed the sysfs and procfs users of part_in_flight previously. -Brian
On 06/28/2017 04:59 PM, Jens Axboe wrote: > On 06/28/2017 03:54 PM, Jens Axboe wrote: >> On 06/28/2017 03:12 PM, Brian King wrote: >>> -static inline int part_in_flight(struct hd_struct *part) >>> +static inline unsigned long part_in_flight(struct hd_struct *part) >>> { >>> - return atomic_read(&part->in_flight[0]) + atomic_read(&part->in_flight[1]); >>> + return part_stat_read(part, in_flight[0]) + part_stat_read(part, in_flight[1]); >> >> One obvious improvement would be to not do this twice, but only have to >> loop once. Instead of making this an array, make it a structure with a >> read and write count. >> >> It still doesn't really fix the issue of someone running on a kernel >> with a ton of possible CPUs configured. But it does reduce the overhead >> by 50%. > > Or something as simple as this: > > #define part_stat_read_double(part, field1, field2) \ > ({ \ > typeof((part)->dkstats->field1) res = 0; \ > unsigned int _cpu; \ > for_each_possible_cpu(_cpu) { \ > res += per_cpu_ptr((part)->dkstats, _cpu)->field1; \ > res += per_cpu_ptr((part)->dkstats, _cpu)->field2; \ > } \ > res; \ > }) > > static inline unsigned long part_in_flight(struct hd_struct *part) > { > return part_stat_read_double(part, in_flight[0], in_flight[1]); > } > I'll give this a try and also see about running some more exhaustive runs to see if there are any cases where we go backwards in performance. I'll also run with partitions and see how that impacts this. Thanks, Brian
On 06/28/2017 04:07 PM, Brian King wrote: > On 06/28/2017 04:59 PM, Jens Axboe wrote: >> On 06/28/2017 03:54 PM, Jens Axboe wrote: >>> On 06/28/2017 03:12 PM, Brian King wrote: >>>> -static inline int part_in_flight(struct hd_struct *part) >>>> +static inline unsigned long part_in_flight(struct hd_struct *part) >>>> { >>>> - return atomic_read(&part->in_flight[0]) + atomic_read(&part->in_flight[1]); >>>> + return part_stat_read(part, in_flight[0]) + part_stat_read(part, in_flight[1]); >>> >>> One obvious improvement would be to not do this twice, but only have to >>> loop once. Instead of making this an array, make it a structure with a >>> read and write count. >>> >>> It still doesn't really fix the issue of someone running on a kernel >>> with a ton of possible CPUs configured. But it does reduce the overhead >>> by 50%. >> >> Or something as simple as this: >> >> #define part_stat_read_double(part, field1, field2) \ >> ({ \ >> typeof((part)->dkstats->field1) res = 0; \ >> unsigned int _cpu; \ >> for_each_possible_cpu(_cpu) { \ >> res += per_cpu_ptr((part)->dkstats, _cpu)->field1; \ >> res += per_cpu_ptr((part)->dkstats, _cpu)->field2; \ >> } \ >> res; \ >> }) >> >> static inline unsigned long part_in_flight(struct hd_struct *part) >> { >> return part_stat_read_double(part, in_flight[0], in_flight[1]); >> } >> > > I'll give this a try and also see about running some more exhaustive > runs to see if there are any cases where we go backwards in performance. > > I'll also run with partitions and see how that impacts this. And do something nuts, like setting NR_CPUS to 512 or whatever. What do distros ship with?
On Thu, Jun 29, 2017 at 5:49 AM, Jens Axboe <axboe@kernel.dk> wrote: > On 06/28/2017 03:12 PM, Brian King wrote: >> This patch converts the in_flight counter in struct hd_struct from a >> pair of atomics to a pair of percpu counters. This eliminates a couple >> of atomics from the hot path. When running this on a Power system, to >> a single null_blk device with 80 submission queues, irq mode 0, with >> 80 fio jobs, I saw IOPs go from 1.5M IO/s to 11.4 IO/s. > > This has been done before, but I've never really liked it. The reason is > that it means that reading the part stat inflight count now has to > iterate over every possible CPU. Did you use partitions in your testing? > How many CPUs were configured? When I last tested this a few years ago > on even a quad core nehalem (which is notoriously shitty for cross-node > latencies), it was a net loss. One year ago, I saw null_blk's IOPS can be decreased to 10% of non-RQF_IO_STAT on a dual socket ARM64(each CPU has 96 cores, and dual numa nodes) too, the performance can be recovered basically if per numa-node counter is introduced and used in this case, but the patch was never posted out. If anyone is interested in that, I can rebase the patch on current block tree and post out. I guess the performance issue might be related with system cache coherency implementation more or less. This issue on ARM64 can be observed with the following userspace atomic counting test too: http://kernel.ubuntu.com/~ming/test/cache/ > > I do agree that we should do something about it, and it's one of those > items I've highlighted in talks about blk-mq on pending issues to fix > up. It's just not great as it currently stands, but I don't think per > CPU counters is the right way to fix it, at least not for the inflight > counter. Yeah, it won't be a issue for non-mq path, and for blk-mq path, maybe we can use some blk-mq knowledge(tagset?) to figure out the 'in_flight' counter. I thought about it before, but never got a perfect solution, and looks it is a bit hard, :-) Thanks, Ming Lei -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On 06/28/2017 05:19 PM, Jens Axboe wrote: > On 06/28/2017 04:07 PM, Brian King wrote: >> On 06/28/2017 04:59 PM, Jens Axboe wrote: >>> On 06/28/2017 03:54 PM, Jens Axboe wrote: >>>> On 06/28/2017 03:12 PM, Brian King wrote: >>>>> -static inline int part_in_flight(struct hd_struct *part) >>>>> +static inline unsigned long part_in_flight(struct hd_struct *part) >>>>> { >>>>> - return atomic_read(&part->in_flight[0]) + atomic_read(&part->in_flight[1]); >>>>> + return part_stat_read(part, in_flight[0]) + part_stat_read(part, in_flight[1]); >>>> >>>> One obvious improvement would be to not do this twice, but only have to >>>> loop once. Instead of making this an array, make it a structure with a >>>> read and write count. >>>> >>>> It still doesn't really fix the issue of someone running on a kernel >>>> with a ton of possible CPUs configured. But it does reduce the overhead >>>> by 50%. >>> >>> Or something as simple as this: >>> >>> #define part_stat_read_double(part, field1, field2) \ >>> ({ \ >>> typeof((part)->dkstats->field1) res = 0; \ >>> unsigned int _cpu; \ >>> for_each_possible_cpu(_cpu) { \ >>> res += per_cpu_ptr((part)->dkstats, _cpu)->field1; \ >>> res += per_cpu_ptr((part)->dkstats, _cpu)->field2; \ >>> } \ >>> res; \ >>> }) >>> >>> static inline unsigned long part_in_flight(struct hd_struct *part) >>> { >>> return part_stat_read_double(part, in_flight[0], in_flight[1]); >>> } >>> >> >> I'll give this a try and also see about running some more exhaustive >> runs to see if there are any cases where we go backwards in performance. >> >> I'll also run with partitions and see how that impacts this. > > And do something nuts, like setting NR_CPUS to 512 or whatever. What do > distros ship with? Both RHEL and SLES set NR_CPUS=2048 for the Power architecture. I can easily switch the SMT mode of the machine I used for this from 4 to 8 to have a total of 160 online logical CPUs and see how that affects the performance. I'll see if I can find a larger machine as well. Thanks, Brian
On 06/29/2017 02:40 AM, Ming Lei wrote: > On Thu, Jun 29, 2017 at 5:49 AM, Jens Axboe <axboe@kernel.dk> wrote: >> On 06/28/2017 03:12 PM, Brian King wrote: >>> This patch converts the in_flight counter in struct hd_struct from a >>> pair of atomics to a pair of percpu counters. This eliminates a couple >>> of atomics from the hot path. When running this on a Power system, to >>> a single null_blk device with 80 submission queues, irq mode 0, with >>> 80 fio jobs, I saw IOPs go from 1.5M IO/s to 11.4 IO/s. >> >> This has been done before, but I've never really liked it. The reason is >> that it means that reading the part stat inflight count now has to >> iterate over every possible CPU. Did you use partitions in your testing? >> How many CPUs were configured? When I last tested this a few years ago >> on even a quad core nehalem (which is notoriously shitty for cross-node >> latencies), it was a net loss. > > One year ago, I saw null_blk's IOPS can be decreased to 10% > of non-RQF_IO_STAT on a dual socket ARM64(each CPU has > 96 cores, and dual numa nodes) too, the performance can be > recovered basically if per numa-node counter is introduced and > used in this case, but the patch was never posted out. > If anyone is interested in that, I can rebase the patch on current > block tree and post out. I guess the performance issue might be > related with system cache coherency implementation more or less. > This issue on ARM64 can be observed with the following userspace > atomic counting test too: > > http://kernel.ubuntu.com/~ming/test/cache/ How well did the per-node thing work? Doesn't seem to me like it would go far enough. And per CPU is too much. One potential improvement would be to change the part_stat_read() to just loop online CPUs, instead of all possible CPUs. When CPUs go on/offline, use that as the slow path to ensure the stats are sane. Often there's a huge difference between NR_CPUS configured and what the system has. As Brian states, RH ships with 2048, while I doubt a lot of customers actually run that... Outside of coming up with a more clever data structure that is fully CPU topology aware, one thing that could work is just having X cache line separated read/write inflight counters per node, where X is some suitable value (like 4). That prevents us from having cross node traffic, and it also keeps the cross cpu traffic fairly low. That should provide a nice balance between cost of incrementing the inflight counting, and the cost of looping for reading it. And that brings me to the next part... >> I do agree that we should do something about it, and it's one of those >> items I've highlighted in talks about blk-mq on pending issues to fix >> up. It's just not great as it currently stands, but I don't think per >> CPU counters is the right way to fix it, at least not for the inflight >> counter. > > Yeah, it won't be a issue for non-mq path, and for blk-mq path, maybe > we can use some blk-mq knowledge(tagset?) to figure out the > 'in_flight' counter. I thought about it before, but never got a > perfect solution, and looks it is a bit hard, :-) The tags are already a bit spread out, so it's worth a shot. That would remove the need to do anything in the inc/dec path, as the tags already do that. The inlight count could be easily retrieved with sbitmap_weight(). The only issue here is that we need separate read and write counters, and the weight would obviously only get us the total count. But we can have a slower path for that, just iterate the tags and count them. The fast path only cares about total count. Let me try that out real quick.
On 06/29/2017 09:58 AM, Jens Axboe wrote: > On 06/29/2017 02:40 AM, Ming Lei wrote: >> On Thu, Jun 29, 2017 at 5:49 AM, Jens Axboe <axboe@kernel.dk> wrote: >>> On 06/28/2017 03:12 PM, Brian King wrote: >>>> This patch converts the in_flight counter in struct hd_struct from a >>>> pair of atomics to a pair of percpu counters. This eliminates a couple >>>> of atomics from the hot path. When running this on a Power system, to >>>> a single null_blk device with 80 submission queues, irq mode 0, with >>>> 80 fio jobs, I saw IOPs go from 1.5M IO/s to 11.4 IO/s. >>> >>> This has been done before, but I've never really liked it. The reason is >>> that it means that reading the part stat inflight count now has to >>> iterate over every possible CPU. Did you use partitions in your testing? >>> How many CPUs were configured? When I last tested this a few years ago >>> on even a quad core nehalem (which is notoriously shitty for cross-node >>> latencies), it was a net loss. >> >> One year ago, I saw null_blk's IOPS can be decreased to 10% >> of non-RQF_IO_STAT on a dual socket ARM64(each CPU has >> 96 cores, and dual numa nodes) too, the performance can be >> recovered basically if per numa-node counter is introduced and >> used in this case, but the patch was never posted out. >> If anyone is interested in that, I can rebase the patch on current >> block tree and post out. I guess the performance issue might be >> related with system cache coherency implementation more or less. >> This issue on ARM64 can be observed with the following userspace >> atomic counting test too: >> >> http://kernel.ubuntu.com/~ming/test/cache/ > > How well did the per-node thing work? Doesn't seem to me like it would > go far enough. And per CPU is too much. One potential improvement would > be to change the part_stat_read() to just loop online CPUs, instead of > all possible CPUs. When CPUs go on/offline, use that as the slow path to > ensure the stats are sane. Often there's a huge difference between > NR_CPUS configured and what the system has. As Brian states, RH ships > with 2048, while I doubt a lot of customers actually run that... > > Outside of coming up with a more clever data structure that is fully > CPU topology aware, one thing that could work is just having X cache > line separated read/write inflight counters per node, where X is some > suitable value (like 4). That prevents us from having cross node > traffic, and it also keeps the cross cpu traffic fairly low. That should > provide a nice balance between cost of incrementing the inflight > counting, and the cost of looping for reading it. > > And that brings me to the next part... > >>> I do agree that we should do something about it, and it's one of those >>> items I've highlighted in talks about blk-mq on pending issues to fix >>> up. It's just not great as it currently stands, but I don't think per >>> CPU counters is the right way to fix it, at least not for the inflight >>> counter. >> >> Yeah, it won't be a issue for non-mq path, and for blk-mq path, maybe >> we can use some blk-mq knowledge(tagset?) to figure out the >> 'in_flight' counter. I thought about it before, but never got a >> perfect solution, and looks it is a bit hard, :-) > > The tags are already a bit spread out, so it's worth a shot. That would > remove the need to do anything in the inc/dec path, as the tags already > do that. The inlight count could be easily retrieved with > sbitmap_weight(). The only issue here is that we need separate read and > write counters, and the weight would obviously only get us the total > count. But we can have a slower path for that, just iterate the tags and > count them. The fast path only cares about total count. > > Let me try that out real quick. Well, that only works for whole disk stats, of course... There's no way around iterating the tags and checking for this to truly work.
On 06/29/2017 11:25 AM, Ming Lei wrote: > On Thu, Jun 29, 2017 at 11:58 PM, Jens Axboe <axboe@kernel.dk> wrote: >> On 06/29/2017 02:40 AM, Ming Lei wrote: >>> On Thu, Jun 29, 2017 at 5:49 AM, Jens Axboe <axboe@kernel.dk> wrote: >>>> On 06/28/2017 03:12 PM, Brian King wrote: >>>>> This patch converts the in_flight counter in struct hd_struct from a >>>>> pair of atomics to a pair of percpu counters. This eliminates a couple >>>>> of atomics from the hot path. When running this on a Power system, to >>>>> a single null_blk device with 80 submission queues, irq mode 0, with >>>>> 80 fio jobs, I saw IOPs go from 1.5M IO/s to 11.4 IO/s. >>>> >>>> This has been done before, but I've never really liked it. The reason is >>>> that it means that reading the part stat inflight count now has to >>>> iterate over every possible CPU. Did you use partitions in your testing? >>>> How many CPUs were configured? When I last tested this a few years ago >>>> on even a quad core nehalem (which is notoriously shitty for cross-node >>>> latencies), it was a net loss. >>> >>> One year ago, I saw null_blk's IOPS can be decreased to 10% >>> of non-RQF_IO_STAT on a dual socket ARM64(each CPU has >>> 96 cores, and dual numa nodes) too, the performance can be >>> recovered basically if per numa-node counter is introduced and >>> used in this case, but the patch was never posted out. >>> If anyone is interested in that, I can rebase the patch on current >>> block tree and post out. I guess the performance issue might be >>> related with system cache coherency implementation more or less. >>> This issue on ARM64 can be observed with the following userspace >>> atomic counting test too: >>> >>> http://kernel.ubuntu.com/~ming/test/cache/ >> >> How well did the per-node thing work? Doesn't seem to me like it would > > Last time, on ARM64, I remembered that the IOPS was basically recovered, > but now I don't have a such machine to test. Could Brian test the attached patch > to see if it works on big Power machine? > > And the idea is simple, just make the atomic counter per-node. I tried loading the patch and get an oops right away on boot. Haven't been able to debug anything yet. This is on top of an older kernel, so not sure if that is the issue or not. I can try upstream and see if i have different results... [-1;-1fUbuntu 16.04[-1;-1f. . . .Unable to handle kernel paging request for data at address 0xc00031313a333532 Faulting instruction address: 0xc0000000002552c4 Oops: Kernel access of bad area, sig: 11 [#1] SMP NR_CPUS=1024 NUMA PowerNV Modules linked in: dm_round_robin vmx_crypto powernv_rng leds_powernv powernv_op_panel led_class rng_core dm_multipath autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq multipath dm_mirror dm_region_hash dm_log cxlflash bnx2x mdio libcrc32c nvme nvme_core lpfc cxl crc_t10dif crct10dif_generic crct10dif_common CPU: 9 PID: 3485 Comm: multipathd Not tainted 4.9.10-dirty #7 task: c000000fba0d0000 task.stack: c000000fb8a1c000 NIP: c0000000002552c4 LR: c000000000255274 CTR: 0000000000000000 REGS: c000000fb8a1f350 TRAP: 0300 Not tainted (4.9.10-dirty) MSR: 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 24028848 XER: 00000000 CFAR: c000000000008a60 DAR: c00031313a333532 DSISR: 40000000 SOFTE: 1 GPR00: c0000000001f8980 c000000fb8a1f5d0 c000000000f24300 c000000fc4007c00 GPR04: 00000000024000c0 c0000000002fc0ac 000000000000025d c000000fc5d346e0 GPR08: 0000000fc50d6000 0000000000000000 c00031313a333532 c000000fbc836240 GPR12: 0000000000002200 c00000000ff02400 00003fff79cfebf0 0000000000000000 GPR16: 00000000c138fd03 000001003a12bf70 00003fff79ae75e0 00003fff79d269e8 GPR20: 0000000000000001 00003fff79cfebf0 000001003a393aa0 00003fff79d067b8 GPR24: 0000000000000000 c000000000b04948 000000000000a1ff c0000000002fc0ac GPR28: 00000000024000c0 0000000000000007 c0000000002fc0ac c000000fc4007c00 NIP [c0000000002552c4] __kmalloc_track_caller+0x94/0x2f0 LR [c000000000255274] __kmalloc_track_caller+0x44/0x2f0 Call Trace: [c000000fb8a1f5d0] [c000000fb8a1f610] 0xc000000fb8a1f610 (unreliable) [c000000fb8a1f620] [c0000000001f8980] kstrdup+0x50/0xb0 [c000000fb8a1f660] [c0000000002fc0ac] __kernfs_new_node+0x4c/0x140 [c000000fb8a1f6b0] [c0000000002fd9f4] kernfs_new_node+0x34/0x80 [c000000fb8a1f6e0] [c000000000300708] kernfs_create_link+0x38/0xf0 [c000000fb8a1f720] [c000000000301cb8] sysfs_do_create_link_sd.isra.0+0xa8/0x160 [c000000fb8a1f770] [c00000000054a658] device_add+0x2b8/0x740 [c000000fb8a1f830] [c00000000054ae58] device_create_groups_vargs+0x178/0x190 [c000000fb8a1f890] [c0000000001fcd70] bdi_register+0x80/0x1d0 [c000000fb8a1f8c0] [c0000000001fd244] bdi_register_owner+0x44/0x80 [c000000fb8a1f940] [c000000000453bbc] device_add_disk+0x1cc/0x500 [c000000fb8a1f9f0] [c000000000764dec] dm_create+0x33c/0x5f0 [c000000fb8a1fa90] [c00000000076d9ac] dev_create+0x8c/0x3e0 [c000000fb8a1fb40] [c00000000076d1fc] ctl_ioctl+0x38c/0x580 [c000000fb8a1fd20] [c00000000076d410] dm_ctl_ioctl+0x20/0x30 [c000000fb8a1fd40] [c0000000002799ac] do_vfs_ioctl+0xcc/0x8f0 [c000000fb8a1fde0] [c00000000027a230] SyS_ioctl+0x60/0xc0 [c000000fb8a1fe30] [c00000000000bfe0] system_call+0x38/0xfc Instruction dump: 39290008 7cc8482a e94d0030 e9230000 7ce95214 7d49502a e9270010 2faa0000 419e007c 2fa90000 419e0074 e93f0022 <7f4a482a> 39200000 88ad023a 992d023a ---[ end trace dcdac2d3f668d033 ]--- > >> go far enough. And per CPU is too much. One potential improvement would >> be to change the part_stat_read() to just loop online CPUs, instead of >> all possible CPUs. When CPUs go on/offline, use that as the slow path to >> ensure the stats are sane. Often there's a huge difference between >> NR_CPUS configured and what the system has. As Brian states, RH ships >> with 2048, while I doubt a lot of customers actually run that... > > One observation I saw on arm64 dual socket before is that atomic inc/dec on > counter stored in local numa node is much cheaper than cross-node, that is > why I tried the per-node counter. And wrt. in-flight atomic counter, both inc > and dec should happen on CPUs belonging to same numa node in case of > blk-mq. > >> >> Outside of coming up with a more clever data structure that is fully >> CPU topology aware, one thing that could work is just having X cache >> line separated read/write inflight counters per node, where X is some >> suitable value (like 4). That prevents us from having cross node >> traffic, and it also keeps the cross cpu traffic fairly low. That should >> provide a nice balance between cost of incrementing the inflight >> counting, and the cost of looping for reading it. >> >> And that brings me to the next part... >> >>>> I do agree that we should do something about it, and it's one of those >>>> items I've highlighted in talks about blk-mq on pending issues to fix >>>> up. It's just not great as it currently stands, but I don't think per >>>> CPU counters is the right way to fix it, at least not for the inflight >>>> counter. >>> >>> Yeah, it won't be a issue for non-mq path, and for blk-mq path, maybe >>> we can use some blk-mq knowledge(tagset?) to figure out the >>> 'in_flight' counter. I thought about it before, but never got a >>> perfect solution, and looks it is a bit hard, :-) >> >> The tags are already a bit spread out, so it's worth a shot. That would >> remove the need to do anything in the inc/dec path, as the tags already >> do that. The inlight count could be easily retrieved with >> sbitmap_weight(). The only issue here is that we need separate read and >> write counters, and the weight would obviously only get us the total >> count. But we can have a slower path for that, just iterate the tags and >> count them. The fast path only cares about total count. >> >> Let me try that out real quick. >> >> -- >> Jens Axboe >> > > > > Thanks, > Ming Lei >
diff -puN include/linux/genhd.h~blk_in_flight_atomic_remove include/linux/genhd.h --- linux-block/include/linux/genhd.h~blk_in_flight_atomic_remove 2017-06-28 16:06:43.037948079 -0500 +++ linux-block-bjking1/include/linux/genhd.h 2017-06-28 16:06:43.064947978 -0500 @@ -87,6 +87,7 @@ struct disk_stats { unsigned long ticks[2]; unsigned long io_ticks; unsigned long time_in_queue; + unsigned long in_flight[2]; }; #define PARTITION_META_INFO_VOLNAMELTH 64 @@ -120,7 +121,6 @@ struct hd_struct { int make_it_fail; #endif unsigned long stamp; - atomic_t in_flight[2]; #ifdef CONFIG_SMP struct disk_stats __percpu *dkstats; #else @@ -362,23 +362,23 @@ static inline void free_part_stats(struc #define part_stat_sub(cpu, gendiskp, field, subnd) \ part_stat_add(cpu, gendiskp, field, -subnd) -static inline void part_inc_in_flight(struct hd_struct *part, int rw) +static inline void part_inc_in_flight(int cpu, struct hd_struct *part, int rw) { - atomic_inc(&part->in_flight[rw]); + part_stat_inc(cpu, part, in_flight[rw]); if (part->partno) - atomic_inc(&part_to_disk(part)->part0.in_flight[rw]); + part_stat_inc(cpu, &part_to_disk(part)->part0, in_flight[rw]); } -static inline void part_dec_in_flight(struct hd_struct *part, int rw) +static inline void part_dec_in_flight(int cpu, struct hd_struct *part, int rw) { - atomic_dec(&part->in_flight[rw]); + part_stat_dec(cpu, part, in_flight[rw]); if (part->partno) - atomic_dec(&part_to_disk(part)->part0.in_flight[rw]); + part_stat_dec(cpu, &part_to_disk(part)->part0, in_flight[rw]); } -static inline int part_in_flight(struct hd_struct *part) +static inline unsigned long part_in_flight(struct hd_struct *part) { - return atomic_read(&part->in_flight[0]) + atomic_read(&part->in_flight[1]); + return part_stat_read(part, in_flight[0]) + part_stat_read(part, in_flight[1]); } static inline struct partition_meta_info *alloc_part_info(struct gendisk *disk) diff -puN block/bio.c~blk_in_flight_atomic_remove block/bio.c --- linux-block/block/bio.c~blk_in_flight_atomic_remove 2017-06-28 16:06:43.041948064 -0500 +++ linux-block-bjking1/block/bio.c 2017-06-28 16:06:43.065947974 -0500 @@ -1737,7 +1737,7 @@ void generic_start_io_acct(int rw, unsig part_round_stats(cpu, part); part_stat_inc(cpu, part, ios[rw]); part_stat_add(cpu, part, sectors[rw], sectors); - part_inc_in_flight(part, rw); + part_inc_in_flight(cpu, part, rw); part_stat_unlock(); } @@ -1751,7 +1751,7 @@ void generic_end_io_acct(int rw, struct part_stat_add(cpu, part, ticks[rw], duration); part_round_stats(cpu, part); - part_dec_in_flight(part, rw); + part_dec_in_flight(cpu, part, rw); part_stat_unlock(); } diff -puN block/blk-core.c~blk_in_flight_atomic_remove block/blk-core.c --- linux-block/block/blk-core.c~blk_in_flight_atomic_remove 2017-06-28 16:06:43.045948049 -0500 +++ linux-block-bjking1/block/blk-core.c 2017-06-28 16:06:43.066947970 -0500 @@ -2435,7 +2435,7 @@ void blk_account_io_done(struct request part_stat_inc(cpu, part, ios[rw]); part_stat_add(cpu, part, ticks[rw], duration); part_round_stats(cpu, part); - part_dec_in_flight(part, rw); + part_dec_in_flight(cpu, part, rw); hd_struct_put(part); part_stat_unlock(); @@ -2493,7 +2493,7 @@ void blk_account_io_start(struct request hd_struct_get(part); } part_round_stats(cpu, part); - part_inc_in_flight(part, rw); + part_inc_in_flight(cpu, part, rw); rq->part = part; } diff -puN block/blk-merge.c~blk_in_flight_atomic_remove block/blk-merge.c --- linux-block/block/blk-merge.c~blk_in_flight_atomic_remove 2017-06-28 16:06:43.048948038 -0500 +++ linux-block-bjking1/block/blk-merge.c 2017-06-28 16:06:43.067947967 -0500 @@ -634,7 +634,7 @@ static void blk_account_io_merge(struct part = req->part; part_round_stats(cpu, part); - part_dec_in_flight(part, rq_data_dir(req)); + part_dec_in_flight(cpu, part, rq_data_dir(req)); hd_struct_put(part); part_stat_unlock(); diff -puN block/genhd.c~blk_in_flight_atomic_remove block/genhd.c --- linux-block/block/genhd.c~blk_in_flight_atomic_remove 2017-06-28 16:06:43.052948023 -0500 +++ linux-block-bjking1/block/genhd.c 2017-06-28 16:06:43.068947963 -0500 @@ -1220,7 +1220,7 @@ static int diskstats_show(struct seq_fil part_round_stats(cpu, hd); part_stat_unlock(); seq_printf(seqf, "%4d %7d %s %lu %lu %lu " - "%u %lu %lu %lu %u %u %u %u\n", + "%u %lu %lu %lu %u %lu %u %u\n", MAJOR(part_devt(hd)), MINOR(part_devt(hd)), disk_name(gp, hd->partno, buf), part_stat_read(hd, ios[READ]), diff -puN block/partition-generic.c~blk_in_flight_atomic_remove block/partition-generic.c --- linux-block/block/partition-generic.c~blk_in_flight_atomic_remove 2017-06-28 16:06:43.055948012 -0500 +++ linux-block-bjking1/block/partition-generic.c 2017-06-28 16:06:43.069947959 -0500 @@ -120,7 +120,7 @@ ssize_t part_stat_show(struct device *de return sprintf(buf, "%8lu %8lu %8llu %8u " "%8lu %8lu %8llu %8u " - "%8u %8u %8u" + "%8lu %8u %8u" "\n", part_stat_read(p, ios[READ]), part_stat_read(p, merges[READ]), @@ -140,8 +140,8 @@ ssize_t part_inflight_show(struct device { struct hd_struct *p = dev_to_part(dev); - return sprintf(buf, "%8u %8u\n", atomic_read(&p->in_flight[0]), - atomic_read(&p->in_flight[1])); + return sprintf(buf, "%8lu %8lu\n", part_stat_read(p, in_flight[0]), + part_stat_read(p, in_flight[1])); } #ifdef CONFIG_FAIL_MAKE_REQUEST diff -puN drivers/md/dm.c~blk_in_flight_atomic_remove drivers/md/dm.c --- linux-block/drivers/md/dm.c~blk_in_flight_atomic_remove 2017-06-28 16:06:43.058948000 -0500 +++ linux-block-bjking1/drivers/md/dm.c 2017-06-28 16:06:43.070947955 -0500 @@ -517,9 +517,9 @@ static void start_io_acct(struct dm_io * cpu = part_stat_lock(); part_round_stats(cpu, &dm_disk(md)->part0); + part_inc_in_flight(cpu, &dm_disk(md)->part0, rw); + atomic_inc(&md->pending[rw]); part_stat_unlock(); - atomic_set(&dm_disk(md)->part0.in_flight[rw], - atomic_inc_return(&md->pending[rw])); if (unlikely(dm_stats_used(&md->stats))) dm_stats_account_io(&md->stats, bio_data_dir(bio), @@ -532,7 +532,7 @@ static void end_io_acct(struct dm_io *io struct mapped_device *md = io->md; struct bio *bio = io->bio; unsigned long duration = jiffies - io->start_time; - int pending; + int pending, cpu; int rw = bio_data_dir(bio); generic_end_io_acct(rw, &dm_disk(md)->part0, io->start_time); @@ -546,9 +546,11 @@ static void end_io_acct(struct dm_io *io * After this is decremented the bio must not be touched if it is * a flush. */ + cpu = part_stat_lock(); pending = atomic_dec_return(&md->pending[rw]); - atomic_set(&dm_disk(md)->part0.in_flight[rw], pending); + part_dec_in_flight(cpu, &dm_disk(md)->part0, rw); pending += atomic_read(&md->pending[rw^0x1]); + part_stat_unlock(); /* nudge anyone waiting on suspend queue */ if (!pending)
This patch converts the in_flight counter in struct hd_struct from a pair of atomics to a pair of percpu counters. This eliminates a couple of atomics from the hot path. When running this on a Power system, to a single null_blk device with 80 submission queues, irq mode 0, with 80 fio jobs, I saw IOPs go from 1.5M IO/s to 11.4 IO/s. Signed-off-by: Brian King <brking@linux.vnet.ibm.com> --- block/bio.c | 4 ++-- block/blk-core.c | 4 ++-- block/blk-merge.c | 2 +- block/genhd.c | 2 +- block/partition-generic.c | 6 +++--- drivers/md/dm.c | 10 ++++++---- include/linux/genhd.h | 18 +++++++++--------- 7 files changed, 24 insertions(+), 22 deletions(-)