Message ID | 20220622134028.2013417-1-dylany@fb.com (mailing list archive) |
---|---|
Headers | show |
Series | io_uring: tw contention improvments | expand |
On 6/22/22 7:40 AM, Dylan Yudaken wrote: > Task work currently uses a spin lock to guard task_list and > task_running. Some use cases such as networking can trigger task_work_add > from multiple threads all at once, which suffers from contention here. > > This can be changed to use a lockless list which seems to have better > performance. Running the micro benchmark in [1] I see 20% improvment in > multithreaded task work add. It required removing the priority tw list > optimisation, however it isn't clear how important that optimisation is. > Additionally it has fairly easy to break semantics. > > Patch 1-2 remove the priority tw list optimisation > Patch 3-5 add lockless lists for task work > Patch 6 fixes a bug I noticed in io_uring event tracing > Patch 7-8 adds tracing for task_work_run I ran some IRQ driven workloads on this. Basic 512b random read, DIO, IRQ, and then at queue depths 1-64, doubling every time. Once we get to QD=8, start doing submit/complete batch of 1/4th of the QD so we ramp up there too. Results below, first set is 5.19-rc3 + for-5.20/io_uring, second set is that plus this series. This is what I ran: sudo taskset -c 12 t/io_uring -d<QD> -b512 -s<batch> -c<batch> -p0 -F1 -B1 -n1 -D0 -R0 -X1 -R1 -t1 -r5 /dev/nvme0n1 on a gen2 optane drive. tldr - looks like an improvement there too, and no ill effects seen on latency. 5.19-rc3 + for-5.20/io_uring: QD1, Batch=1 Maximum IOPS=244K 1509: Latency percentiles: percentiles (nsec): | 1.0000th=[ 3996], 5.0000th=[ 3996], 10.0000th=[ 3996], | 20.0000th=[ 4036], 30.0000th=[ 4036], 40.0000th=[ 4036], | 50.0000th=[ 4036], 60.0000th=[ 4036], 70.0000th=[ 4036], | 80.0000th=[ 4076], 90.0000th=[ 4116], 95.0000th=[ 4196], | 99.0000th=[ 4437], 99.5000th=[ 5421], 99.9000th=[ 7590], | 99.9500th=[ 9518], 99.9900th=[32289] QD=2, Batch=1 Maximum IOPS=483K 1533: Latency percentiles: percentiles (nsec): | 1.0000th=[ 3714], 5.0000th=[ 3755], 10.0000th=[ 3795], | 20.0000th=[ 3795], 30.0000th=[ 3835], 40.0000th=[ 3955], | 50.0000th=[ 4036], 60.0000th=[ 4076], 70.0000th=[ 4076], | 80.0000th=[ 4076], 90.0000th=[ 4116], 95.0000th=[ 4156], | 99.0000th=[ 4518], 99.5000th=[ 6144], 99.9000th=[ 7510], | 99.9500th=[ 9839], 99.9900th=[32289] QD=4, Batch=1 Maximum IOPS=907K 1583: Latency percentiles: percentiles (nsec): | 1.0000th=[ 3393], 5.0000th=[ 3514], 10.0000th=[ 3594], | 20.0000th=[ 3634], 30.0000th=[ 3795], 40.0000th=[ 3875], | 50.0000th=[ 3955], 60.0000th=[ 4076], 70.0000th=[ 4156], | 80.0000th=[ 4277], 90.0000th=[ 4397], 95.0000th=[ 4477], | 99.0000th=[ 5120], 99.5000th=[ 5903], 99.9000th=[ 9357], | 99.9500th=[11004], 99.9900th=[32289] QD=8, Batch=2 Maximum IOPS=1688K 1631: Latency percentiles: percentiles (nsec): | 1.0000th=[ 3353], 5.0000th=[ 3554], 10.0000th=[ 3634], | 20.0000th=[ 3755], 30.0000th=[ 3875], 40.0000th=[ 4036], | 50.0000th=[ 4156], 60.0000th=[ 4277], 70.0000th=[ 4437], | 80.0000th=[ 4678], 90.0000th=[ 4839], 95.0000th=[ 5040], | 99.0000th=[ 6305], 99.5000th=[ 7028], 99.9000th=[10080], | 99.9500th=[15502], 99.9900th=[32932] QD=16, Batch=4 Maximum IOPS=2613K 1680: Latency percentiles: percentiles (nsec): | 1.0000th=[ 3955], 5.0000th=[ 4397], 10.0000th=[ 4558], | 20.0000th=[ 4759], 30.0000th=[ 4959], 40.0000th=[ 5120], | 50.0000th=[ 5261], 60.0000th=[ 5502], 70.0000th=[ 5743], | 80.0000th=[ 5903], 90.0000th=[ 6305], 95.0000th=[ 6706], | 99.0000th=[ 8393], 99.5000th=[ 8955], 99.9000th=[11325], | 99.9500th=[31968], 99.9900th=[34217] QD=32, Batch=8 Maximum IOPS=3573K 1706: Latency percentiles: percentiles (nsec): | 1.0000th=[ 4919], 5.0000th=[ 5662], 10.0000th=[ 5903], | 20.0000th=[ 6144], 30.0000th=[ 6465], 40.0000th=[ 6626], | 50.0000th=[ 6867], 60.0000th=[ 7188], 70.0000th=[ 7510], | 80.0000th=[ 7992], 90.0000th=[ 8714], 95.0000th=[ 9357], | 99.0000th=[11325], 99.5000th=[11967], 99.9000th=[16626], | 99.9500th=[34217], 99.9900th=[37108] QD=64, Batch=16 Maximum IOPS=3953K 1735: Latency percentiles: percentiles (nsec): | 1.0000th=[ 6626], 5.0000th=[ 7188], 10.0000th=[ 7510], | 20.0000th=[ 7992], 30.0000th=[ 8393], 40.0000th=[ 9116], | 50.0000th=[10160], 60.0000th=[11164], 70.0000th=[11646], | 80.0000th=[12128], 90.0000th=[12931], 95.0000th=[13735], | 99.0000th=[15984], 99.5000th=[16787], 99.9000th=[34217], | 99.9500th=[38072], 99.9900th=[40964] ============ 5.19-rc3 + for-5.20/io_uring + this series: QD=1, Batch=1 Maximum IOPS=246K 909: Latency percentiles: percentiles (nsec): | 1.0000th=[ 3955], 5.0000th=[ 3996], 10.0000th=[ 3996], | 20.0000th=[ 3996], 30.0000th=[ 3996], 40.0000th=[ 3996], | 50.0000th=[ 3996], 60.0000th=[ 3996], 70.0000th=[ 4036], | 80.0000th=[ 4036], 90.0000th=[ 4076], 95.0000th=[ 4116], | 99.0000th=[ 4196], 99.5000th=[ 5341], 99.9000th=[ 7590], | 99.9500th=[ 9357], 99.9900th=[32289] QD=2, Batch=1 Maximum IOPS=487K 932: Latency percentiles: percentiles (nsec): | 1.0000th=[ 3714], 5.0000th=[ 3755], 10.0000th=[ 3755], | 20.0000th=[ 3755], 30.0000th=[ 3795], 40.0000th=[ 3795], | 50.0000th=[ 3996], 60.0000th=[ 4036], 70.0000th=[ 4036], | 80.0000th=[ 4036], 90.0000th=[ 4076], 95.0000th=[ 4116], | 99.0000th=[ 4437], 99.5000th=[ 6224], 99.9000th=[ 7510], | 99.9500th=[ 9598], 99.9900th=[32289] QD=4, Batch=1 aximum IOPS=921K 955: Latency percentiles: percentiles (nsec): | 1.0000th=[ 3393], 5.0000th=[ 3433], 10.0000th=[ 3514], | 20.0000th=[ 3594], 30.0000th=[ 3674], 40.0000th=[ 3795], | 50.0000th=[ 3875], 60.0000th=[ 3996], 70.0000th=[ 4036], | 80.0000th=[ 4156], 90.0000th=[ 4317], 95.0000th=[ 4678], | 99.0000th=[ 5120], 99.5000th=[ 5903], 99.9000th=[ 9116], | 99.9500th=[10522], 99.9900th=[32289] QD=8, Batch=2 Maximum IOPS=1658K 981: Latency percentiles: percentiles (nsec): | 1.0000th=[ 3313], 5.0000th=[ 3514], 10.0000th=[ 3594], | 20.0000th=[ 3714], 30.0000th=[ 3835], 40.0000th=[ 3996], | 50.0000th=[ 4116], 60.0000th=[ 4196], 70.0000th=[ 4397], | 80.0000th=[ 4598], 90.0000th=[ 4718], 95.0000th=[ 4919], | 99.0000th=[ 6385], 99.5000th=[ 6947], 99.9000th=[10000], | 99.9500th=[15180], 99.9900th=[32932] QD=16, Batch=4 Maximum IOPS=2749K 1010: Latency percentiles: percentiles (nsec): | 1.0000th=[ 3955], 5.0000th=[ 4437], 10.0000th=[ 4558], | 20.0000th=[ 4759], 30.0000th=[ 4959], 40.0000th=[ 5120], | 50.0000th=[ 5261], 60.0000th=[ 5502], 70.0000th=[ 5743], | 80.0000th=[ 5903], 90.0000th=[ 6224], 95.0000th=[ 6626], | 99.0000th=[ 8313], 99.5000th=[ 9036], 99.9000th=[11967], | 99.9500th=[32289], 99.9900th=[34217] QD=32, Batch=8 Maximum IOPS=3583K 1050: Latency percentiles: percentiles (nsec): | 1.0000th=[ 4879], 5.0000th=[ 5582], 10.0000th=[ 5903], | 20.0000th=[ 6224], 30.0000th=[ 6465], 40.0000th=[ 6626], | 50.0000th=[ 6787], 60.0000th=[ 7028], 70.0000th=[ 7349], | 80.0000th=[ 7911], 90.0000th=[ 8634], 95.0000th=[ 9196], | 99.0000th=[11164], 99.5000th=[11967], 99.9000th=[16305], | 99.9500th=[34217], 99.9900th=[37108] QD=64, Batch=16 Maximum IOPS=3959K 1081: Latency percentiles: percentiles (nsec): | 1.0000th=[ 6546], 5.0000th=[ 7108], 10.0000th=[ 7429], | 20.0000th=[ 7992], 30.0000th=[ 8313], 40.0000th=[ 8955], | 50.0000th=[10000], 60.0000th=[11004], 70.0000th=[11646], | 80.0000th=[12128], 90.0000th=[12931], 95.0000th=[13735], | 99.0000th=[15984], 99.5000th=[16787], 99.9000th=[33253], | 99.9500th=[38072], 99.9900th=[41446]
On Wed, 22 Jun 2022 06:40:20 -0700, Dylan Yudaken wrote: > Task work currently uses a spin lock to guard task_list and > task_running. Some use cases such as networking can trigger task_work_add > from multiple threads all at once, which suffers from contention here. > > This can be changed to use a lockless list which seems to have better > performance. Running the micro benchmark in [1] I see 20% improvment in > multithreaded task work add. It required removing the priority tw list > optimisation, however it isn't clear how important that optimisation is. > Additionally it has fairly easy to break semantics. > > [...] Applied, thanks! [1/8] io_uring: remove priority tw list optimisation commit: bb35381ea1b3980704809f1c13d7831989a9bc97 [2/8] io_uring: remove __io_req_task_work_add commit: fbfa4521091037bdfe499501d4c7ed175592ccd4 [3/8] io_uring: lockless task list commit: f032372c18b0730f551b8fa0a354ce2e84cfcbb7 [4/8] io_uring: introduce llist helpers commit: c0808632a83a7c607a987154372e705353acf4f2 [5/8] io_uring: batch task_work commit: 7afb384a25b0ed597defad431dcc83b5f509c98e [6/8] io_uring: move io_uring_get_opcode out of TP_printk commit: 1da6baa4e4c290cebafec3341dbf3cbca21081b7 [7/8] io_uring: add trace event for running task work commit: d34b8ba25f0c3503f8766bd595c6d28e01cbbd54 [8/8] io_uring: trace task_work_run commit: e57a6f13bec58afe717894ce7fb7e6061c3fc2f4 Best regards,
On 6/22/22 23:21, Jens Axboe wrote: > On 6/22/22 7:40 AM, Dylan Yudaken wrote: >> Task work currently uses a spin lock to guard task_list and >> task_running. Some use cases such as networking can trigger task_work_add >> from multiple threads all at once, which suffers from contention here. >> >> This can be changed to use a lockless list which seems to have better >> performance. Running the micro benchmark in [1] I see 20% improvment in >> multithreaded task work add. It required removing the priority tw list >> optimisation, however it isn't clear how important that optimisation is. >> Additionally it has fairly easy to break semantics. >> >> Patch 1-2 remove the priority tw list optimisation >> Patch 3-5 add lockless lists for task work >> Patch 6 fixes a bug I noticed in io_uring event tracing >> Patch 7-8 adds tracing for task_work_run > > I ran some IRQ driven workloads on this. Basic 512b random read, DIO, > IRQ, and then at queue depths 1-64, doubling every time. Once we get to > QD=8, start doing submit/complete batch of 1/4th of the QD so we ramp up > there too. Results below, first set is 5.19-rc3 + for-5.20/io_uring, > second set is that plus this series. > > This is what I ran: > > sudo taskset -c 12 t/io_uring -d<QD> -b512 -s<batch> -c<batch> -p0 -F1 -B1 -n1 -D0 -R0 -X1 -R1 -t1 -r5 /dev/nvme0n1 > > on a gen2 optane drive. > > tldr - looks like an improvement there too, and no ill effects seen on > latency. Looks so, nice. > > 5.19-rc3 + for-5.20/io_uring: > > QD1, Batch=1 > Maximum IOPS=244K > 1509: Latency percentiles: > percentiles (nsec): > | 1.0000th=[ 3996], 5.0000th=[ 3996], 10.0000th=[ 3996], > | 20.0000th=[ 4036], 30.0000th=[ 4036], 40.0000th=[ 4036], > | 50.0000th=[ 4036], 60.0000th=[ 4036], 70.0000th=[ 4036], > | 80.0000th=[ 4076], 90.0000th=[ 4116], 95.0000th=[ 4196], > | 99.0000th=[ 4437], 99.5000th=[ 5421], 99.9000th=[ 7590], > | 99.9500th=[ 9518], 99.9900th=[32289] > > QD=2, Batch=1 > Maximum IOPS=483K > 1533: Latency percentiles: > percentiles (nsec): > | 1.0000th=[ 3714], 5.0000th=[ 3755], 10.0000th=[ 3795], > | 20.0000th=[ 3795], 30.0000th=[ 3835], 40.0000th=[ 3955], > | 50.0000th=[ 4036], 60.0000th=[ 4076], 70.0000th=[ 4076], > | 80.0000th=[ 4076], 90.0000th=[ 4116], 95.0000th=[ 4156], > | 99.0000th=[ 4518], 99.5000th=[ 6144], 99.9000th=[ 7510], > | 99.9500th=[ 9839], 99.9900th=[32289] > > QD=4, Batch=1 > Maximum IOPS=907K > 1583: Latency percentiles: > percentiles (nsec): > | 1.0000th=[ 3393], 5.0000th=[ 3514], 10.0000th=[ 3594], > | 20.0000th=[ 3634], 30.0000th=[ 3795], 40.0000th=[ 3875], > | 50.0000th=[ 3955], 60.0000th=[ 4076], 70.0000th=[ 4156], > | 80.0000th=[ 4277], 90.0000th=[ 4397], 95.0000th=[ 4477], > | 99.0000th=[ 5120], 99.5000th=[ 5903], 99.9000th=[ 9357], > | 99.9500th=[11004], 99.9900th=[32289] > > QD=8, Batch=2 > Maximum IOPS=1688K > 1631: Latency percentiles: > percentiles (nsec): > | 1.0000th=[ 3353], 5.0000th=[ 3554], 10.0000th=[ 3634], > | 20.0000th=[ 3755], 30.0000th=[ 3875], 40.0000th=[ 4036], > | 50.0000th=[ 4156], 60.0000th=[ 4277], 70.0000th=[ 4437], > | 80.0000th=[ 4678], 90.0000th=[ 4839], 95.0000th=[ 5040], > | 99.0000th=[ 6305], 99.5000th=[ 7028], 99.9000th=[10080], > | 99.9500th=[15502], 99.9900th=[32932] > > QD=16, Batch=4 > Maximum IOPS=2613K > 1680: Latency percentiles: > percentiles (nsec): > | 1.0000th=[ 3955], 5.0000th=[ 4397], 10.0000th=[ 4558], > | 20.0000th=[ 4759], 30.0000th=[ 4959], 40.0000th=[ 5120], > | 50.0000th=[ 5261], 60.0000th=[ 5502], 70.0000th=[ 5743], > | 80.0000th=[ 5903], 90.0000th=[ 6305], 95.0000th=[ 6706], > | 99.0000th=[ 8393], 99.5000th=[ 8955], 99.9000th=[11325], > | 99.9500th=[31968], 99.9900th=[34217] > > QD=32, Batch=8 > Maximum IOPS=3573K > 1706: Latency percentiles: > percentiles (nsec): > | 1.0000th=[ 4919], 5.0000th=[ 5662], 10.0000th=[ 5903], > | 20.0000th=[ 6144], 30.0000th=[ 6465], 40.0000th=[ 6626], > | 50.0000th=[ 6867], 60.0000th=[ 7188], 70.0000th=[ 7510], > | 80.0000th=[ 7992], 90.0000th=[ 8714], 95.0000th=[ 9357], > | 99.0000th=[11325], 99.5000th=[11967], 99.9000th=[16626], > | 99.9500th=[34217], 99.9900th=[37108] > > QD=64, Batch=16 > Maximum IOPS=3953K > 1735: Latency percentiles: > percentiles (nsec): > | 1.0000th=[ 6626], 5.0000th=[ 7188], 10.0000th=[ 7510], > | 20.0000th=[ 7992], 30.0000th=[ 8393], 40.0000th=[ 9116], > | 50.0000th=[10160], 60.0000th=[11164], 70.0000th=[11646], > | 80.0000th=[12128], 90.0000th=[12931], 95.0000th=[13735], > | 99.0000th=[15984], 99.5000th=[16787], 99.9000th=[34217], > | 99.9500th=[38072], 99.9900th=[40964] > > > ============ > > > 5.19-rc3 + for-5.20/io_uring + this series: > > QD=1, Batch=1 > Maximum IOPS=246K > 909: Latency percentiles: > percentiles (nsec): > | 1.0000th=[ 3955], 5.0000th=[ 3996], 10.0000th=[ 3996], > | 20.0000th=[ 3996], 30.0000th=[ 3996], 40.0000th=[ 3996], > | 50.0000th=[ 3996], 60.0000th=[ 3996], 70.0000th=[ 4036], > | 80.0000th=[ 4036], 90.0000th=[ 4076], 95.0000th=[ 4116], > | 99.0000th=[ 4196], 99.5000th=[ 5341], 99.9000th=[ 7590], > | 99.9500th=[ 9357], 99.9900th=[32289] > > QD=2, Batch=1 > Maximum IOPS=487K > 932: Latency percentiles: > percentiles (nsec): > | 1.0000th=[ 3714], 5.0000th=[ 3755], 10.0000th=[ 3755], > | 20.0000th=[ 3755], 30.0000th=[ 3795], 40.0000th=[ 3795], > | 50.0000th=[ 3996], 60.0000th=[ 4036], 70.0000th=[ 4036], > | 80.0000th=[ 4036], 90.0000th=[ 4076], 95.0000th=[ 4116], > | 99.0000th=[ 4437], 99.5000th=[ 6224], 99.9000th=[ 7510], > | 99.9500th=[ 9598], 99.9900th=[32289] > > QD=4, Batch=1 > aximum IOPS=921K > 955: Latency percentiles: > percentiles (nsec): > | 1.0000th=[ 3393], 5.0000th=[ 3433], 10.0000th=[ 3514], > | 20.0000th=[ 3594], 30.0000th=[ 3674], 40.0000th=[ 3795], > | 50.0000th=[ 3875], 60.0000th=[ 3996], 70.0000th=[ 4036], > | 80.0000th=[ 4156], 90.0000th=[ 4317], 95.0000th=[ 4678], > | 99.0000th=[ 5120], 99.5000th=[ 5903], 99.9000th=[ 9116], > | 99.9500th=[10522], 99.9900th=[32289] > > QD=8, Batch=2 > Maximum IOPS=1658K > 981: Latency percentiles: > percentiles (nsec): > | 1.0000th=[ 3313], 5.0000th=[ 3514], 10.0000th=[ 3594], > | 20.0000th=[ 3714], 30.0000th=[ 3835], 40.0000th=[ 3996], > | 50.0000th=[ 4116], 60.0000th=[ 4196], 70.0000th=[ 4397], > | 80.0000th=[ 4598], 90.0000th=[ 4718], 95.0000th=[ 4919], > | 99.0000th=[ 6385], 99.5000th=[ 6947], 99.9000th=[10000], > | 99.9500th=[15180], 99.9900th=[32932] > > QD=16, Batch=4 > Maximum IOPS=2749K > 1010: Latency percentiles: > percentiles (nsec): > | 1.0000th=[ 3955], 5.0000th=[ 4437], 10.0000th=[ 4558], > | 20.0000th=[ 4759], 30.0000th=[ 4959], 40.0000th=[ 5120], > | 50.0000th=[ 5261], 60.0000th=[ 5502], 70.0000th=[ 5743], > | 80.0000th=[ 5903], 90.0000th=[ 6224], 95.0000th=[ 6626], > | 99.0000th=[ 8313], 99.5000th=[ 9036], 99.9000th=[11967], > | 99.9500th=[32289], 99.9900th=[34217] > > QD=32, Batch=8 > Maximum IOPS=3583K > 1050: Latency percentiles: > percentiles (nsec): > | 1.0000th=[ 4879], 5.0000th=[ 5582], 10.0000th=[ 5903], > | 20.0000th=[ 6224], 30.0000th=[ 6465], 40.0000th=[ 6626], > | 50.0000th=[ 6787], 60.0000th=[ 7028], 70.0000th=[ 7349], > | 80.0000th=[ 7911], 90.0000th=[ 8634], 95.0000th=[ 9196], > | 99.0000th=[11164], 99.5000th=[11967], 99.9000th=[16305], > | 99.9500th=[34217], 99.9900th=[37108] > > QD=64, Batch=16 > Maximum IOPS=3959K > 1081: Latency percentiles: > percentiles (nsec): > | 1.0000th=[ 6546], 5.0000th=[ 7108], 10.0000th=[ 7429], > | 20.0000th=[ 7992], 30.0000th=[ 8313], 40.0000th=[ 8955], > | 50.0000th=[10000], 60.0000th=[11004], 70.0000th=[11646], > | 80.0000th=[12128], 90.0000th=[12931], 95.0000th=[13735], > | 99.0000th=[15984], 99.5000th=[16787], 99.9000th=[33253], > | 99.9500th=[38072], 99.9900th=[41446] >