Message ID | CAN-5tyEX1Z_3EB3h6=z_S1E=gpTObOrcP5Ub2HmVKBB5RaU1DQ@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
T24gVGh1LCAyMDE3LTA2LTI5IGF0IDA5OjI1IC0wNDAwLCBPbGdhIEtvcm5pZXZza2FpYSB3cm90 ZToNCj4gSGkgZm9sa3MsDQo+IA0KPiBPbiBhIG11bHRpLWNvcmUgbWFjaGluZSwgaXMgaXQgZXhw ZWN0ZWQgdGhhdCB3ZSBjYW4gaGF2ZSBwYXJhbGxlbA0KPiBSUENzDQo+IGhhbmRsZWQgYnkgZWFj aCBvZiB0aGUgcGVyLWNvcmUgd29ya3F1ZXVlPw0KPiANCj4gSW4gdGVzdGluZyBhIHJlYWQgd29y a2xvYWQsIG9ic2VydmluZyB2aWEgInRvcCIgY29tbWFuZCB0aGF0IGEgc2luZ2xlDQo+ICJrd29y a2VyIiB0aHJlYWQgaXMgcnVubmluZyBzZXJ2aWNpbmcgdGhlIHJlcXVlc3RzIChubyBwYXJhbGxl bGlzbSkuDQo+IEl0J3MgbW9yZSBwcm9taW5lbnQgd2hpbGUgZG9pbmcgdGhlc2Ugb3BlcmF0aW9u cyBvdmVyIGtyYjVwIG1vdW50Lg0KPiANCj4gV2hhdCBoYXMgYmVlbiBzdWdnZXN0ZWQgYnkgQnJ1 Y2UgaXMgdG8gdHJ5IHRoaXMgYW5kIGluIG15IHRlc3RpbmcgSQ0KPiBzZWUgdGhlbiB0aGUgcmVh ZCB3b3JrbG9hZCBzcHJlYWQgYW1vbmcgYWxsIHRoZSBrd29ya2VyIHRocmVhZHMuDQo+IA0KPiBT aWduZWQtb2ZmLWJ5OiBPbGdhIEtvcm5pZXZza2FpYSA8a29sZ2FAbmV0YXBwLmNvbT4NCj4gDQo+ IGRpZmYgLS1naXQgYS9uZXQvc3VucnBjL3NjaGVkLmMgYi9uZXQvc3VucnBjL3NjaGVkLmMNCj4g aW5kZXggMGNjODM4My4uZjgwZTY4OCAxMDA2NDQNCj4gLS0tIGEvbmV0L3N1bnJwYy9zY2hlZC5j DQo+ICsrKyBiL25ldC9zdW5ycGMvc2NoZWQuYw0KPiBAQCAtMTA5NSw3ICsxMDk1LDcgQEAgc3Rh dGljIGludCBycGNpb2Rfc3RhcnQodm9pZCkNCj4gwqAgKiBDcmVhdGUgdGhlIHJwY2lvZCB0aHJl YWQgYW5kIHdhaXQgZm9yIGl0IHRvIHN0YXJ0Lg0KPiDCoCAqLw0KPiDCoCBkcHJpbnRrKCJSUEM6 wqDCoMKgwqDCoMKgwqBjcmVhdGluZyB3b3JrcXVldWUgcnBjaW9kXG4iKTsNCj4gLSB3cSA9IGFs bG9jX3dvcmtxdWV1ZSgicnBjaW9kIiwgV1FfTUVNX1JFQ0xBSU0sIDApOw0KPiArIHdxID0gYWxs b2Nfd29ya3F1ZXVlKCJycGNpb2QiLCBXUV9NRU1fUkVDTEFJTSB8IFdRX1VOQk9VTkQsIDApOw0K PiDCoCBpZiAoIXdxKQ0KPiDCoCBnb3RvIG91dF9mYWlsZWQ7DQo+IMKgIHJwY2lvZF93b3JrcXVl dWUgPSB3cTsNCj4gDQoNCldRX1VOQk9VTkQgdHVybnMgb2ZmIGNvbmN1cnJlbmN5IG1hbmFnZW1l bnQgb24gdGhlIHRocmVhZCBwb29sIChTZWUNCkRvY3VtZW50YXRpb24vY29yZS1hcGkvd29ya3F1 ZXVlLnJzdC4gSXQgYWxzbyBtZWFucyB3ZSBjb250ZW5kIGZvciB3b3JrDQppdGVtIHF1ZXVpbmcv ZGVxdWV1aW5nIGxvY2tzLCBzaW5jZSB0aGUgdGhyZWFkcyB3aGljaCBydW4gdGhlIHdvcmsNCml0 ZW1zIGFyZSBub3QgYm91bmQgdG8gYSBDUFUuDQoNCklPVzogVGhpcyBpcyBub3QgYSBzbGFtLWR1 bmsgb2J2aW91cyBnYWluLg0KDQotLSANClRyb25kIE15a2xlYnVzdA0KTGludXggTkZTIGNsaWVu dCBtYWludGFpbmVyLCBQcmltYXJ5RGF0YQ0KdHJvbmQubXlrbGVidXN0QHByaW1hcnlkYXRhLmNv bQ0K -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jul 3, 2017 at 10:58 AM, Trond Myklebust <trondmy@primarydata.com> wrote: > On Thu, 2017-06-29 at 09:25 -0400, Olga Kornievskaia wrote: >> Hi folks, >> >> On a multi-core machine, is it expected that we can have parallel >> RPCs >> handled by each of the per-core workqueue? >> >> In testing a read workload, observing via "top" command that a single >> "kworker" thread is running servicing the requests (no parallelism). >> It's more prominent while doing these operations over krb5p mount. >> >> What has been suggested by Bruce is to try this and in my testing I >> see then the read workload spread among all the kworker threads. >> >> Signed-off-by: Olga Kornievskaia <kolga@netapp.com> >> >> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c >> index 0cc8383..f80e688 100644 >> --- a/net/sunrpc/sched.c >> +++ b/net/sunrpc/sched.c >> @@ -1095,7 +1095,7 @@ static int rpciod_start(void) >> * Create the rpciod thread and wait for it to start. >> */ >> dprintk("RPC: creating workqueue rpciod\n"); >> - wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM, 0); >> + wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM | WQ_UNBOUND, 0); >> if (!wq) >> goto out_failed; >> rpciod_workqueue = wq; >> > > WQ_UNBOUND turns off concurrency management on the thread pool (See > Documentation/core-api/workqueue.rst. It also means we contend for work > item queuing/dequeuing locks, since the threads which run the work > items are not bound to a CPU. > > IOW: This is not a slam-dunk obvious gain. I agree but I think it's worth consideration. I'm waiting to get (real) performance numbers of improvement (instead of my VM setup) to help my case. However, it was reported 90% degradation for the read performance over krb5p when 1CPU is executing all ops. Is there a different way to make sure that on a multi-processor machine we can take advantage of all available CPUs? Simple kernel threads instead of a work queue? Can/should we have an WQ_UNBOUND work queue for secure mounts and another queue for other mounts? While I wouldn't call krb5 load long running, Documentation says that an example for WQ_UNBOUND is for CPU intensive workloads. And also in general "work items are not expected to hog a CPU and consume many cycles". How "many" is too "many". How many operations are crypto operations? -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> On Jul 5, 2017, at 10:44 AM, Olga Kornievskaia <aglo@umich.edu> wrote: > > On Mon, Jul 3, 2017 at 10:58 AM, Trond Myklebust > <trondmy@primarydata.com> wrote: >> On Thu, 2017-06-29 at 09:25 -0400, Olga Kornievskaia wrote: >>> Hi folks, >>> >>> On a multi-core machine, is it expected that we can have parallel >>> RPCs >>> handled by each of the per-core workqueue? >>> >>> In testing a read workload, observing via "top" command that a single >>> "kworker" thread is running servicing the requests (no parallelism). >>> It's more prominent while doing these operations over krb5p mount. >>> >>> What has been suggested by Bruce is to try this and in my testing I >>> see then the read workload spread among all the kworker threads. >>> >>> Signed-off-by: Olga Kornievskaia <kolga@netapp.com> >>> >>> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c >>> index 0cc8383..f80e688 100644 >>> --- a/net/sunrpc/sched.c >>> +++ b/net/sunrpc/sched.c >>> @@ -1095,7 +1095,7 @@ static int rpciod_start(void) >>> * Create the rpciod thread and wait for it to start. >>> */ >>> dprintk("RPC: creating workqueue rpciod\n"); >>> - wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM, 0); >>> + wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM | WQ_UNBOUND, 0); >>> if (!wq) >>> goto out_failed; >>> rpciod_workqueue = wq; >>> >> >> WQ_UNBOUND turns off concurrency management on the thread pool (See >> Documentation/core-api/workqueue.rst. It also means we contend for work >> item queuing/dequeuing locks, since the threads which run the work >> items are not bound to a CPU. >> >> IOW: This is not a slam-dunk obvious gain. > > I agree but I think it's worth consideration. I'm waiting to get > (real) performance numbers of improvement (instead of my VM setup) to > help my case. However, it was reported 90% degradation for the read > performance over krb5p when 1CPU is executing all ops. > > Is there a different way to make sure that on a multi-processor > machine we can take advantage of all available CPUs? Simple kernel > threads instead of a work queue? There is a trade-off between spreading the work, and ensuring it is executed on a CPU close to the I/O and application. IMO UNBOUND is a good way to do that. UNBOUND will attempt to schedule the work on the preferred CPU, but allow it to be migrated if that CPU is busy. The advantage of this is that when the client workload is CPU intensive (say, a software build), RPC client work can be scheduled and run more quickly, which reduces latency. > Can/should we have an WQ_UNBOUND work queue for secure mounts and > another queue for other mounts? > > While I wouldn't call krb5 load long running, Documentation says that > an example for WQ_UNBOUND is for CPU intensive workloads. And also in > general "work items are not expected to hog a CPU and consume many > cycles". How "many" is too "many". How many operations are crypto > operations? > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2017-07-05 at 11:11 -0400, Chuck Lever wrote: > > On Jul 5, 2017, at 10:44 AM, Olga Kornievskaia <aglo@umich.edu> > > wrote: > > > > On Mon, Jul 3, 2017 at 10:58 AM, Trond Myklebust > > <trondmy@primarydata.com> wrote: > > > On Thu, 2017-06-29 at 09:25 -0400, Olga Kornievskaia wrote: > > > > Hi folks, > > > > > > > > On a multi-core machine, is it expected that we can have > > > > parallel > > > > RPCs > > > > handled by each of the per-core workqueue? > > > > > > > > In testing a read workload, observing via "top" command that a > > > > single > > > > "kworker" thread is running servicing the requests (no > > > > parallelism). > > > > It's more prominent while doing these operations over krb5p > > > > mount. > > > > > > > > What has been suggested by Bruce is to try this and in my > > > > testing I > > > > see then the read workload spread among all the kworker > > > > threads. > > > > > > > > Signed-off-by: Olga Kornievskaia <kolga@netapp.com> > > > > > > > > diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c > > > > index 0cc8383..f80e688 100644 > > > > --- a/net/sunrpc/sched.c > > > > +++ b/net/sunrpc/sched.c > > > > @@ -1095,7 +1095,7 @@ static int rpciod_start(void) > > > > * Create the rpciod thread and wait for it to start. > > > > */ > > > > dprintk("RPC: creating workqueue rpciod\n"); > > > > - wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM, 0); > > > > + wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM | WQ_UNBOUND, > > > > 0); > > > > if (!wq) > > > > goto out_failed; > > > > rpciod_workqueue = wq; > > > > > > > > > > WQ_UNBOUND turns off concurrency management on the thread pool > > > (See > > > Documentation/core-api/workqueue.rst. It also means we contend > > > for work > > > item queuing/dequeuing locks, since the threads which run the > > > work > > > items are not bound to a CPU. > > > > > > IOW: This is not a slam-dunk obvious gain. > > > > I agree but I think it's worth consideration. I'm waiting to get > > (real) performance numbers of improvement (instead of my VM setup) > > to > > help my case. However, it was reported 90% degradation for the read > > performance over krb5p when 1CPU is executing all ops. > > > > Is there a different way to make sure that on a multi-processor > > machine we can take advantage of all available CPUs? Simple kernel > > threads instead of a work queue? > > There is a trade-off between spreading the work, and ensuring it > is executed on a CPU close to the I/O and application. IMO UNBOUND > is a good way to do that. UNBOUND will attempt to schedule the > work on the preferred CPU, but allow it to be migrated if that > CPU is busy. > > The advantage of this is that when the client workload is CPU > intensive (say, a software build), RPC client work can be scheduled > and run more quickly, which reduces latency. > That should no longer be a huge issue, since queue_work() will now default to the WORK_CPU_UNBOUND flag, which prefers the local CPU, but will schedule elsewhere if the local CPU is congested. -- Trond Myklebust Linux NFS client maintainer, PrimaryData trond.myklebust@primarydata.com
On Wed, Jul 5, 2017 at 11:46 AM, Trond Myklebust <trondmy@primarydata.com> wrote: > On Wed, 2017-07-05 at 11:11 -0400, Chuck Lever wrote: >> > On Jul 5, 2017, at 10:44 AM, Olga Kornievskaia <aglo@umich.edu> >> > wrote: >> > >> > On Mon, Jul 3, 2017 at 10:58 AM, Trond Myklebust >> > <trondmy@primarydata.com> wrote: >> > > On Thu, 2017-06-29 at 09:25 -0400, Olga Kornievskaia wrote: >> > > > Hi folks, >> > > > >> > > > On a multi-core machine, is it expected that we can have >> > > > parallel >> > > > RPCs >> > > > handled by each of the per-core workqueue? >> > > > >> > > > In testing a read workload, observing via "top" command that a >> > > > single >> > > > "kworker" thread is running servicing the requests (no >> > > > parallelism). >> > > > It's more prominent while doing these operations over krb5p >> > > > mount. >> > > > >> > > > What has been suggested by Bruce is to try this and in my >> > > > testing I >> > > > see then the read workload spread among all the kworker >> > > > threads. >> > > > >> > > > Signed-off-by: Olga Kornievskaia <kolga@netapp.com> >> > > > >> > > > diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c >> > > > index 0cc8383..f80e688 100644 >> > > > --- a/net/sunrpc/sched.c >> > > > +++ b/net/sunrpc/sched.c >> > > > @@ -1095,7 +1095,7 @@ static int rpciod_start(void) >> > > > * Create the rpciod thread and wait for it to start. >> > > > */ >> > > > dprintk("RPC: creating workqueue rpciod\n"); >> > > > - wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM, 0); >> > > > + wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM | WQ_UNBOUND, >> > > > 0); >> > > > if (!wq) >> > > > goto out_failed; >> > > > rpciod_workqueue = wq; >> > > > >> > > >> > > WQ_UNBOUND turns off concurrency management on the thread pool >> > > (See >> > > Documentation/core-api/workqueue.rst. It also means we contend >> > > for work >> > > item queuing/dequeuing locks, since the threads which run the >> > > work >> > > items are not bound to a CPU. >> > > >> > > IOW: This is not a slam-dunk obvious gain. >> > >> > I agree but I think it's worth consideration. I'm waiting to get >> > (real) performance numbers of improvement (instead of my VM setup) >> > to >> > help my case. However, it was reported 90% degradation for the read >> > performance over krb5p when 1CPU is executing all ops. >> > >> > Is there a different way to make sure that on a multi-processor >> > machine we can take advantage of all available CPUs? Simple kernel >> > threads instead of a work queue? >> >> There is a trade-off between spreading the work, and ensuring it >> is executed on a CPU close to the I/O and application. IMO UNBOUND >> is a good way to do that. UNBOUND will attempt to schedule the >> work on the preferred CPU, but allow it to be migrated if that >> CPU is busy. >> >> The advantage of this is that when the client workload is CPU >> intensive (say, a software build), RPC client work can be scheduled >> and run more quickly, which reduces latency. >> > > That should no longer be a huge issue, since queue_work() will now > default to the WORK_CPU_UNBOUND flag, which prefers the local CPU, but > will schedule elsewhere if the local CPU is congested. I don't believe NFS use workqueue_congested() to somehow schedule the work elsewhere. Unless the queue is marked UNBOUNDED I don't believe there is any intention of balancing the CPU load. > > -- > Trond Myklebust > Linux NFS client maintainer, PrimaryData > trond.myklebust@primarydata.com -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2017-07-05 at 12:09 -0400, Olga Kornievskaia wrote: > On Wed, Jul 5, 2017 at 11:46 AM, Trond Myklebust > <trondmy@primarydata.com> wrote: > > On Wed, 2017-07-05 at 11:11 -0400, Chuck Lever wrote: > > > > On Jul 5, 2017, at 10:44 AM, Olga Kornievskaia <aglo@umich.edu> > > > > wrote: > > > > > > > > On Mon, Jul 3, 2017 at 10:58 AM, Trond Myklebust > > > > <trondmy@primarydata.com> wrote: > > > > > On Thu, 2017-06-29 at 09:25 -0400, Olga Kornievskaia wrote: > > > > > > Hi folks, > > > > > > > > > > > > On a multi-core machine, is it expected that we can have > > > > > > parallel > > > > > > RPCs > > > > > > handled by each of the per-core workqueue? > > > > > > > > > > > > In testing a read workload, observing via "top" command > > > > > > that a > > > > > > single > > > > > > "kworker" thread is running servicing the requests (no > > > > > > parallelism). > > > > > > It's more prominent while doing these operations over krb5p > > > > > > mount. > > > > > > > > > > > > What has been suggested by Bruce is to try this and in my > > > > > > testing I > > > > > > see then the read workload spread among all the kworker > > > > > > threads. > > > > > > > > > > > > Signed-off-by: Olga Kornievskaia <kolga@netapp.com> > > > > > > > > > > > > diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c > > > > > > index 0cc8383..f80e688 100644 > > > > > > --- a/net/sunrpc/sched.c > > > > > > +++ b/net/sunrpc/sched.c > > > > > > @@ -1095,7 +1095,7 @@ static int rpciod_start(void) > > > > > > * Create the rpciod thread and wait for it to start. > > > > > > */ > > > > > > dprintk("RPC: creating workqueue rpciod\n"); > > > > > > - wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM, 0); > > > > > > + wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM | > > > > > > WQ_UNBOUND, > > > > > > 0); > > > > > > if (!wq) > > > > > > goto out_failed; > > > > > > rpciod_workqueue = wq; > > > > > > > > > > > > > > > > WQ_UNBOUND turns off concurrency management on the thread > > > > > pool > > > > > (See > > > > > Documentation/core-api/workqueue.rst. It also means we > > > > > contend > > > > > for work > > > > > item queuing/dequeuing locks, since the threads which run the > > > > > work > > > > > items are not bound to a CPU. > > > > > > > > > > IOW: This is not a slam-dunk obvious gain. > > > > > > > > I agree but I think it's worth consideration. I'm waiting to > > > > get > > > > (real) performance numbers of improvement (instead of my VM > > > > setup) > > > > to > > > > help my case. However, it was reported 90% degradation for the > > > > read > > > > performance over krb5p when 1CPU is executing all ops. > > > > > > > > Is there a different way to make sure that on a multi-processor > > > > machine we can take advantage of all available CPUs? Simple > > > > kernel > > > > threads instead of a work queue? > > > > > > There is a trade-off between spreading the work, and ensuring it > > > is executed on a CPU close to the I/O and application. IMO > > > UNBOUND > > > is a good way to do that. UNBOUND will attempt to schedule the > > > work on the preferred CPU, but allow it to be migrated if that > > > CPU is busy. > > > > > > The advantage of this is that when the client workload is CPU > > > intensive (say, a software build), RPC client work can be > > > scheduled > > > and run more quickly, which reduces latency. > > > > > > > That should no longer be a huge issue, since queue_work() will now > > default to the WORK_CPU_UNBOUND flag, which prefers the local CPU, > > but > > will schedule elsewhere if the local CPU is congested. > > I don't believe NFS use workqueue_congested() to somehow schedule the > work elsewhere. Unless the queue is marked UNBOUNDED I don't believe > there is any intention of balancing the CPU load. > I shouldn't have to test the queue when scheduling with WORK_CPU_UNBOUND. -- Trond Myklebust Linux NFS client maintainer, PrimaryData trond.myklebust@primarydata.com
On Wed, Jul 5, 2017 at 12:14 PM, Trond Myklebust <trondmy@primarydata.com> wrote: > On Wed, 2017-07-05 at 12:09 -0400, Olga Kornievskaia wrote: >> On Wed, Jul 5, 2017 at 11:46 AM, Trond Myklebust >> <trondmy@primarydata.com> wrote: >> > On Wed, 2017-07-05 at 11:11 -0400, Chuck Lever wrote: >> > > > On Jul 5, 2017, at 10:44 AM, Olga Kornievskaia <aglo@umich.edu> >> > > > wrote: >> > > > >> > > > On Mon, Jul 3, 2017 at 10:58 AM, Trond Myklebust >> > > > <trondmy@primarydata.com> wrote: >> > > > > On Thu, 2017-06-29 at 09:25 -0400, Olga Kornievskaia wrote: >> > > > > > Hi folks, >> > > > > > >> > > > > > On a multi-core machine, is it expected that we can have >> > > > > > parallel >> > > > > > RPCs >> > > > > > handled by each of the per-core workqueue? >> > > > > > >> > > > > > In testing a read workload, observing via "top" command >> > > > > > that a >> > > > > > single >> > > > > > "kworker" thread is running servicing the requests (no >> > > > > > parallelism). >> > > > > > It's more prominent while doing these operations over krb5p >> > > > > > mount. >> > > > > > >> > > > > > What has been suggested by Bruce is to try this and in my >> > > > > > testing I >> > > > > > see then the read workload spread among all the kworker >> > > > > > threads. >> > > > > > >> > > > > > Signed-off-by: Olga Kornievskaia <kolga@netapp.com> >> > > > > > >> > > > > > diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c >> > > > > > index 0cc8383..f80e688 100644 >> > > > > > --- a/net/sunrpc/sched.c >> > > > > > +++ b/net/sunrpc/sched.c >> > > > > > @@ -1095,7 +1095,7 @@ static int rpciod_start(void) >> > > > > > * Create the rpciod thread and wait for it to start. >> > > > > > */ >> > > > > > dprintk("RPC: creating workqueue rpciod\n"); >> > > > > > - wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM, 0); >> > > > > > + wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM | >> > > > > > WQ_UNBOUND, >> > > > > > 0); >> > > > > > if (!wq) >> > > > > > goto out_failed; >> > > > > > rpciod_workqueue = wq; >> > > > > > >> > > > > >> > > > > WQ_UNBOUND turns off concurrency management on the thread >> > > > > pool >> > > > > (See >> > > > > Documentation/core-api/workqueue.rst. It also means we >> > > > > contend >> > > > > for work >> > > > > item queuing/dequeuing locks, since the threads which run the >> > > > > work >> > > > > items are not bound to a CPU. >> > > > > >> > > > > IOW: This is not a slam-dunk obvious gain. >> > > > >> > > > I agree but I think it's worth consideration. I'm waiting to >> > > > get >> > > > (real) performance numbers of improvement (instead of my VM >> > > > setup) >> > > > to >> > > > help my case. However, it was reported 90% degradation for the >> > > > read >> > > > performance over krb5p when 1CPU is executing all ops. >> > > > >> > > > Is there a different way to make sure that on a multi-processor >> > > > machine we can take advantage of all available CPUs? Simple >> > > > kernel >> > > > threads instead of a work queue? >> > > >> > > There is a trade-off between spreading the work, and ensuring it >> > > is executed on a CPU close to the I/O and application. IMO >> > > UNBOUND >> > > is a good way to do that. UNBOUND will attempt to schedule the >> > > work on the preferred CPU, but allow it to be migrated if that >> > > CPU is busy. >> > > >> > > The advantage of this is that when the client workload is CPU >> > > intensive (say, a software build), RPC client work can be >> > > scheduled >> > > and run more quickly, which reduces latency. >> > > >> > >> > That should no longer be a huge issue, since queue_work() will now >> > default to the WORK_CPU_UNBOUND flag, which prefers the local CPU, >> > but >> > will schedule elsewhere if the local CPU is congested. >> >> I don't believe NFS use workqueue_congested() to somehow schedule the >> work elsewhere. Unless the queue is marked UNBOUNDED I don't believe >> there is any intention of balancing the CPU load. >> > > I shouldn't have to test the queue when scheduling with > WORK_CPU_UNBOUND. > Comments in the code says that "if CPU dies" it'll be re-scheduled on another. I think the code requires to mark the queue UNBOUND to really be scheduled on a different CPU. Just my reading of the code and it matches what is seen with the krb5 workload. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jul 5, 2017 at 1:33 PM, Olga Kornievskaia <aglo@umich.edu> wrote: > On Wed, Jul 5, 2017 at 12:14 PM, Trond Myklebust > <trondmy@primarydata.com> wrote: >> On Wed, 2017-07-05 at 12:09 -0400, Olga Kornievskaia wrote: >>> On Wed, Jul 5, 2017 at 11:46 AM, Trond Myklebust >>> <trondmy@primarydata.com> wrote: >>> > On Wed, 2017-07-05 at 11:11 -0400, Chuck Lever wrote: >>> > > > On Jul 5, 2017, at 10:44 AM, Olga Kornievskaia <aglo@umich.edu> >>> > > > wrote: >>> > > > >>> > > > On Mon, Jul 3, 2017 at 10:58 AM, Trond Myklebust >>> > > > <trondmy@primarydata.com> wrote: >>> > > > > On Thu, 2017-06-29 at 09:25 -0400, Olga Kornievskaia wrote: >>> > > > > > Hi folks, >>> > > > > > >>> > > > > > On a multi-core machine, is it expected that we can have >>> > > > > > parallel >>> > > > > > RPCs >>> > > > > > handled by each of the per-core workqueue? >>> > > > > > >>> > > > > > In testing a read workload, observing via "top" command >>> > > > > > that a >>> > > > > > single >>> > > > > > "kworker" thread is running servicing the requests (no >>> > > > > > parallelism). >>> > > > > > It's more prominent while doing these operations over krb5p >>> > > > > > mount. >>> > > > > > >>> > > > > > What has been suggested by Bruce is to try this and in my >>> > > > > > testing I >>> > > > > > see then the read workload spread among all the kworker >>> > > > > > threads. >>> > > > > > >>> > > > > > Signed-off-by: Olga Kornievskaia <kolga@netapp.com> >>> > > > > > >>> > > > > > diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c >>> > > > > > index 0cc8383..f80e688 100644 >>> > > > > > --- a/net/sunrpc/sched.c >>> > > > > > +++ b/net/sunrpc/sched.c >>> > > > > > @@ -1095,7 +1095,7 @@ static int rpciod_start(void) >>> > > > > > * Create the rpciod thread and wait for it to start. >>> > > > > > */ >>> > > > > > dprintk("RPC: creating workqueue rpciod\n"); >>> > > > > > - wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM, 0); >>> > > > > > + wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM | >>> > > > > > WQ_UNBOUND, >>> > > > > > 0); >>> > > > > > if (!wq) >>> > > > > > goto out_failed; >>> > > > > > rpciod_workqueue = wq; >>> > > > > > >>> > > > > >>> > > > > WQ_UNBOUND turns off concurrency management on the thread >>> > > > > pool >>> > > > > (See >>> > > > > Documentation/core-api/workqueue.rst. It also means we >>> > > > > contend >>> > > > > for work >>> > > > > item queuing/dequeuing locks, since the threads which run the >>> > > > > work >>> > > > > items are not bound to a CPU. >>> > > > > >>> > > > > IOW: This is not a slam-dunk obvious gain. >>> > > > >>> > > > I agree but I think it's worth consideration. I'm waiting to >>> > > > get >>> > > > (real) performance numbers of improvement (instead of my VM >>> > > > setup) >>> > > > to >>> > > > help my case. However, it was reported 90% degradation for the >>> > > > read >>> > > > performance over krb5p when 1CPU is executing all ops. >>> > > > >>> > > > Is there a different way to make sure that on a multi-processor >>> > > > machine we can take advantage of all available CPUs? Simple >>> > > > kernel >>> > > > threads instead of a work queue? >>> > > >>> > > There is a trade-off between spreading the work, and ensuring it >>> > > is executed on a CPU close to the I/O and application. IMO >>> > > UNBOUND >>> > > is a good way to do that. UNBOUND will attempt to schedule the >>> > > work on the preferred CPU, but allow it to be migrated if that >>> > > CPU is busy. >>> > > >>> > > The advantage of this is that when the client workload is CPU >>> > > intensive (say, a software build), RPC client work can be >>> > > scheduled >>> > > and run more quickly, which reduces latency. >>> > > >>> > >>> > That should no longer be a huge issue, since queue_work() will now >>> > default to the WORK_CPU_UNBOUND flag, which prefers the local CPU, >>> > but >>> > will schedule elsewhere if the local CPU is congested. >>> >>> I don't believe NFS use workqueue_congested() to somehow schedule the >>> work elsewhere. Unless the queue is marked UNBOUNDED I don't believe >>> there is any intention of balancing the CPU load. >>> >> >> I shouldn't have to test the queue when scheduling with >> WORK_CPU_UNBOUND. >> > > Comments in the code says that "if CPU dies" it'll be re-scheduled on > another. I think the code requires to mark the queue UNBOUND to really > be scheduled on a different CPU. Just my reading of the code and it > matches what is seen with the krb5 workload. Trond, what's the path forward here? What about a run-time configuration that starts rpciod with the UNBOUND options instead? -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> On Feb 14, 2018, at 6:13 PM, Mora, Jorge <Jorge.Mora@netapp.com> wrote: > > Hello, > > The patch gives some performance improvement on Kerberos read. > The following results show performance comparisons between unpatched > and patched systems. The html files included as attachments show the > results as line charts. > > - Best read performance improvement when testing with a single dd transfer. > The patched system gives 70% better performance than the unpatched system. > (first set of results) > > - The patched system gives 18% better performance than the unpatched system > when testing with multiple dd transfers. > (second set of results) > > - The write test shows there is no performance hit by the patch. > (third set of results) > > - When testing on a different client having less RAM and fewer number of CPU cores, > there is no performance degradation for Kerberos in the unpatched system. > In this case, the patch does not provide any performance improvement. > (fourth set of results) > > ================================================================================ > Test environment: > > NFS client: CPU: 16 cores, RAM: 32GB (E5620 @ 2.40GHz) > NFS servers: CPU: 16 cores, RAM: 32GB (E5620 @ 2.40GHz) > NFS mount: NFSv3 with sec=(sys or krb5p) > > For tests with a single dd transfer there is of course one NFS server used > and one file being read -- only one transfer was needed to fill up the > network connection. > > For tests with multiple dd transfers, three different NFS server were used > and four different files were used per NFS server for a total of 12 different > files being read (12 different transfers in parallel). > > The patch was applied on top of 4.14.0-rc3 kernel and the NFS servers were > running RHEL 7.4. > > The fourth set of results below show an unpatched system with no Kerberos > degradation (same kernel 4.14.0-rc3) but in contrast with the main client > used for testing this client has only 4 CPU cores and 8GB of RAM. > I believe that even though this system has less CPU cores and less RAM, > the CPU is faster (E31220 @ 3.10GHz vs E5620 @ 2.40GHz) so it is able > to handle the Kerberos load better and fill up the network connection > with a single thread than the main client with more CPU cores and more > memory. Jorge, thanks for publishing these results. Can you do a "numactl -H" on your clients and post the output? I suspect the throughput improvement on the big client is because WQ_UNBOUND behaves differently on NUMA systems. (Even so, I agree that the proposed change is valuable). > ================================================================================ > > Kerberos Read Performance: 170.15% (patched system over unpatched system) > > Client CPU: Intel(R) Xeon(R) CPU E5620 @ 2.40GHz > CPU cores: 16 > RAM: 32 GB > NFS version: 3 > Mount points: 1 > dd's per mount: 1 > Total dd's: 1 > Data transferred: 7.81 GB (per run) > Number of runs: 10 > > Kerberos Read Performance (unpatched system vs patched system) > Transfer rate (unpatched system) avg: 65.88 MB/s, var: 20.28, stddev: 4.50 > Transfer rate (patched system) avg: 112.10 MB/s, var: 0.00, stddev: 0.01 > Performance (patched over unpatched): 170.15% > > Unpatched System Read Performance (sys vs krb5p) > Transfer rate (sec=sys) avg: 111.96 MB/s, var: 0.02, stddev: 0.13 > Transfer rate (sec=krb5p) avg: 65.88 MB/s, var: 20.28, stddev: 4.50 > Performance (krb5p over sys): 58.84% > > Patched System Read Performance (sys vs krb5p) > Transfer rate (sec=sys) avg: 111.94 MB/s, var: 0.02, stddev: 0.14 > Transfer rate (sec=krb5p) avg: 112.10 MB/s, var: 0.00, stddev: 0.01 > Performance (krb5p over sys): 100.14% > > ================================================================================ > > Kerberos Read Performance: 118.02% (patched system over unpatched system) > > Client CPU: Intel(R) Xeon(R) CPU E5620 @ 2.40GHz > CPU cores: 16 > RAM: 32 GB > NFS version: 3 > Mount points: 3 > dd's per mount: 4 > Total dd's: 12 > Data transferred: 93.75 GB (per run) > Number of runs: 10 > > Kerberos Read Performance (unpatched system vs patched system) > Transfer rate (unpatched system) avg: 94.99 MB/s, var: 68.96, stddev: 8.30 > Transfer rate (patched system) avg: 112.11 MB/s, var: 0.00, stddev: 0.03 > Performance (patched over unpatched): 118.02% > > Unpatched System Read Performance (sys vs krb5p) > Transfer rate (sec=sys) avg: 112.21 MB/s, var: 0.00, stddev: 0.00 > Transfer rate (sec=krb5p) avg: 94.99 MB/s, var: 68.96, stddev: 8.30 > Performance (krb5p over sys): 84.66% > > Patched System Read Performance (sys vs krb5p) > Transfer rate (sec=sys) avg: 112.20 MB/s, var: 0.00, stddev: 0.00 > Transfer rate (sec=krb5p) avg: 112.11 MB/s, var: 0.00, stddev: 0.03 > Performance (krb5p over sys): 99.92% > > ================================================================================ > > Kerberos Write Performance: 101.55% (patched system over unpatched system) > > Client CPU: Intel(R) Xeon(R) CPU E5620 @ 2.40GHz > CPU cores: 16 > RAM: 32 GB > NFS version: 3 > Mount points: 3 > dd's per mount: 4 > Total dd's: 12 > Data transferred: 93.75 GB (per run) > Number of runs: 10 > > Kerberos Write Performance (unpatched system vs patched system) > Transfer rate (unpatched system) avg: 103.70 MB/s, var: 110.51, stddev: 10.51 > Transfer rate (patched system) avg: 105.31 MB/s, var: 35.04, stddev: 5.92 > Performance (patched over unpatched): 101.55% > > Unpatched System Write Performance (sys vs krb5p) > Transfer rate (sec=sys) avg: 109.87 MB/s, var: 10.27, stddev: 3.20 > Transfer rate (sec=krb5p) avg: 103.70 MB/s, var: 110.51, stddev: 10.51 > Performance (krb5p over sys): 94.39% > > Patched System Write Performance (sys vs krb5p) > Transfer rate (sec=sys) avg: 111.03 MB/s, var: 0.58, stddev: 0.76 > Transfer rate (sec=krb5p) avg: 105.31 MB/s, var: 35.04, stddev: 5.92 > Performance (krb5p over sys): 94.85% > > ================================================================================ > > Kerberos Read Performance: 99.99% (patched system over unpatched system) > > Client CPU: Intel(R) Xeon(R) CPU E31220 @ 3.10GHz > CPU cores: 4 > RAM: 8 GB > NFS version: 3 > Mount points: 1 > dd's per mount: 1 > Total dd's: 1 > Data transferred: 7.81 GB (per run) > Number of runs: 10 > > Kerberos Read Performance (unpatched system vs patched system) > Transfer rate (unpatched system) avg: 112.02 MB/s, var: 0.04, stddev: 0.21 > Transfer rate (patched system) avg: 112.01 MB/s, var: 0.06, stddev: 0.25 > Performance (patched over unpatched): 99.99% > > Unpatched System Read Performance (sys vs krb5p) > Transfer rate (sec=sys) avg: 111.86 MB/s, var: 0.06, stddev: 0.24 > Transfer rate (sec=krb5p) avg: 112.02 MB/s, var: 0.04, stddev: 0.21 > Performance (krb5p over sys): 100.14% > > Patched System Read Performance (sys vs krb5p) > Transfer rate (sec=sys) avg: 111.76 MB/s, var: 0.12, stddev: 0.34 > Transfer rate (sec=krb5p) avg: 112.01 MB/s, var: 0.06, stddev: 0.25 > Performance (krb5p over sys): 100.22% > > > --Jorge > > ________________________________________ > From: linux-nfs-owner@vger.kernel.org <linux-nfs-owner@vger.kernel.org> on behalf of Olga Kornievskaia <aglo@umich.edu> > Sent: Wednesday, July 19, 2017 11:59 AM > To: Trond Myklebust > Cc: linux-nfs@vger.kernel.org; chuck.lever@oracle.com > Subject: Re: [RFC] fix parallelism for rpc tasks > > On Wed, Jul 5, 2017 at 1:33 PM, Olga Kornievskaia <aglo@umich.edu> wrote: >> On Wed, Jul 5, 2017 at 12:14 PM, Trond Myklebust >> <trondmy@primarydata.com> wrote: >>> On Wed, 2017-07-05 at 12:09 -0400, Olga Kornievskaia wrote: >>>> On Wed, Jul 5, 2017 at 11:46 AM, Trond Myklebust >>>> <trondmy@primarydata.com> wrote: >>>>> On Wed, 2017-07-05 at 11:11 -0400, Chuck Lever wrote: >>>>>>> On Jul 5, 2017, at 10:44 AM, Olga Kornievskaia <aglo@umich.edu> >>>>>>> wrote: >>>>>>> >>>>>>> On Mon, Jul 3, 2017 at 10:58 AM, Trond Myklebust >>>>>>> <trondmy@primarydata.com> wrote: >>>>>>>> On Thu, 2017-06-29 at 09:25 -0400, Olga Kornievskaia wrote: >>>>>>>>> Hi folks, >>>>>>>>> >>>>>>>>> On a multi-core machine, is it expected that we can have >>>>>>>>> parallel >>>>>>>>> RPCs >>>>>>>>> handled by each of the per-core workqueue? >>>>>>>>> >>>>>>>>> In testing a read workload, observing via "top" command >>>>>>>>> that a >>>>>>>>> single >>>>>>>>> "kworker" thread is running servicing the requests (no >>>>>>>>> parallelism). >>>>>>>>> It's more prominent while doing these operations over krb5p >>>>>>>>> mount. >>>>>>>>> >>>>>>>>> What has been suggested by Bruce is to try this and in my >>>>>>>>> testing I >>>>>>>>> see then the read workload spread among all the kworker >>>>>>>>> threads. >>>>>>>>> >>>>>>>>> Signed-off-by: Olga Kornievskaia <kolga@netapp.com> >>>>>>>>> >>>>>>>>> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c >>>>>>>>> index 0cc8383..f80e688 100644 >>>>>>>>> --- a/net/sunrpc/sched.c >>>>>>>>> +++ b/net/sunrpc/sched.c >>>>>>>>> @@ -1095,7 +1095,7 @@ static int rpciod_start(void) >>>>>>>>> * Create the rpciod thread and wait for it to start. >>>>>>>>> */ >>>>>>>>> dprintk("RPC: creating workqueue rpciod\n"); >>>>>>>>> - wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM, 0); >>>>>>>>> + wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM | >>>>>>>>> WQ_UNBOUND, >>>>>>>>> 0); >>>>>>>>> if (!wq) >>>>>>>>> goto out_failed; >>>>>>>>> rpciod_workqueue = wq; >>>>>>>>> >>>>>>>> >>>>>>>> WQ_UNBOUND turns off concurrency management on the thread >>>>>>>> pool >>>>>>>> (See >>>>>>>> Documentation/core-api/workqueue.rst. It also means we >>>>>>>> contend >>>>>>>> for work >>>>>>>> item queuing/dequeuing locks, since the threads which run the >>>>>>>> work >>>>>>>> items are not bound to a CPU. >>>>>>>> >>>>>>>> IOW: This is not a slam-dunk obvious gain. >>>>>>> >>>>>>> I agree but I think it's worth consideration. I'm waiting to >>>>>>> get >>>>>>> (real) performance numbers of improvement (instead of my VM >>>>>>> setup) >>>>>>> to >>>>>>> help my case. However, it was reported 90% degradation for the >>>>>>> read >>>>>>> performance over krb5p when 1CPU is executing all ops. >>>>>>> >>>>>>> Is there a different way to make sure that on a multi-processor >>>>>>> machine we can take advantage of all available CPUs? Simple >>>>>>> kernel >>>>>>> threads instead of a work queue? >>>>>> >>>>>> There is a trade-off between spreading the work, and ensuring it >>>>>> is executed on a CPU close to the I/O and application. IMO >>>>>> UNBOUND >>>>>> is a good way to do that. UNBOUND will attempt to schedule the >>>>>> work on the preferred CPU, but allow it to be migrated if that >>>>>> CPU is busy. >>>>>> >>>>>> The advantage of this is that when the client workload is CPU >>>>>> intensive (say, a software build), RPC client work can be >>>>>> scheduled >>>>>> and run more quickly, which reduces latency. >>>>>> >>>>> >>>>> That should no longer be a huge issue, since queue_work() will now >>>>> default to the WORK_CPU_UNBOUND flag, which prefers the local CPU, >>>>> but >>>>> will schedule elsewhere if the local CPU is congested. >>>> >>>> I don't believe NFS use workqueue_congested() to somehow schedule the >>>> work elsewhere. Unless the queue is marked UNBOUNDED I don't believe >>>> there is any intention of balancing the CPU load. >>>> >>> >>> I shouldn't have to test the queue when scheduling with >>> WORK_CPU_UNBOUND. >>> >> >> Comments in the code says that "if CPU dies" it'll be re-scheduled on >> another. I think the code requires to mark the queue UNBOUND to really >> be scheduled on a different CPU. Just my reading of the code and it >> matches what is seen with the krb5 workload. > > Trond, what's the path forward here? What about a run-time > configuration that starts rpciod with the UNBOUND options instead? > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > <dd_read_single.html><dd_read_mult.html><dd_write_mult.html><dd_read_single1.html> -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c index 0cc8383..f80e688 100644 --- a/net/sunrpc/sched.c +++ b/net/sunrpc/sched.c @@ -1095,7 +1095,7 @@ static int rpciod_start(void) * Create the rpciod thread and wait for it to start. */ dprintk("RPC: creating workqueue rpciod\n"); - wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM, 0); + wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM | WQ_UNBOUND, 0); if (!wq) goto out_failed; rpciod_workqueue = wq;
Hi folks, On a multi-core machine, is it expected that we can have parallel RPCs handled by each of the per-core workqueue? In testing a read workload, observing via "top" command that a single "kworker" thread is running servicing the requests (no parallelism). It's more prominent while doing these operations over krb5p mount. What has been suggested by Bruce is to try this and in my testing I see then the read workload spread among all the kworker threads. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html