diff mbox

Large latency on blk_queue_enter

Message ID 576f9601-b0de-c636-8195-07e12fe99734@fb.com (mailing list archive)
State New, archived
Headers show

Commit Message

Jens Axboe May 8, 2017, 3:08 p.m. UTC
On 05/08/2017 09:02 AM, Javier González wrote:
>> On 8 May 2017, at 16.52, Jens Axboe <axboe@fb.com> wrote:
>>
>> On 05/08/2017 08:46 AM, Javier González wrote:
>>>> On 8 May 2017, at 16.23, Jens Axboe <axboe@fb.com> wrote:
>>>>
>>>> On 05/08/2017 08:20 AM, Javier González wrote:
>>>>>> On 8 May 2017, at 16.13, Jens Axboe <axboe@fb.com> wrote:
>>>>>>
>>>>>> On 05/08/2017 07:44 AM, Javier González wrote:
>>>>>>>> On 8 May 2017, at 14.27, Ming Lei <ming.lei@redhat.com> wrote:
>>>>>>>>
>>>>>>>> On Mon, May 08, 2017 at 01:54:58PM +0200, Javier González wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I find an unusual added latency(~20-30ms) on blk_queue_enter when
>>>>>>>>> allocating a request directly from the NVMe driver through
>>>>>>>>> nvme_alloc_request. I could use some help confirming that this is a bug
>>>>>>>>> and not an expected side effect due to something else.
>>>>>>>>>
>>>>>>>>> I can reproduce this latency consistently on LightNVM when mixing I/O
>>>>>>>>> from pblk and I/O sent through an ioctl using liblightnvm, but I don't
>>>>>>>>> see anything on the LightNVM side that could impact the request
>>>>>>>>> allocation.
>>>>>>>>>
>>>>>>>>> When I have a 100% read workload sent from pblk, the max. latency is
>>>>>>>>> constant throughout several runs at ~80us (which is normal for the media
>>>>>>>>> we are using at bs=4k, qd=1). All pblk I/Os reach the nvme_nvm_submit_io
>>>>>>>>> function on lightnvm.c., which uses nvme_alloc_request. When we send a
>>>>>>>>> command from user space through an ioctl, then the max latency goes up
>>>>>>>>> to ~20-30ms. This happens independently from the actual command
>>>>>>>>> (IN/OUT). I tracked down the added latency down to the call
>>>>>>>>> percpu_ref_tryget_live in blk_queue_enter. Seems that the queue
>>>>>>>>> reference counter is not released as it should through blk_queue_exit in
>>>>>>>>> blk_mq_alloc_request. For reference, all ioctl I/Os reach the
>>>>>>>>> nvme_nvm_submit_user_cmd on lightnvm.c
>>>>>>>>>
>>>>>>>>> Do you have any idea about why this might happen? I can dig more into
>>>>>>>>> it, but first I wanted to make sure that I am not missing any obvious
>>>>>>>>> assumption, which would explain the reference counter to be held for a
>>>>>>>>> longer time.
>>>>>>>>
>>>>>>>> You need to check if the .q_usage_counter is working at atomic mode.
>>>>>>>> This counter is initialized as atomic mode, and finally switchs to
>>>>>>>> percpu mode via percpu_ref_switch_to_percpu() in blk_register_queue().
>>>>>>>
>>>>>>> Thanks for commenting Ming.
>>>>>>>
>>>>>>> The .q_usage_counter is not working on atomic mode. The queue is
>>>>>>> initialized normally through blk_register_queue() and the counter is
>>>>>>> switched to percpu mode, as you mentioned. As I understand it, this is
>>>>>>> how it should be, right?
>>>>>>
>>>>>> That is how it should be, yes. You're not running with any heavy
>>>>>> debugging options, like lockdep or anything like that?
>>>>>
>>>>> No lockdep, KASAN, kmemleak or any of the other usual suspects.
>>>>>
>>>>> What's interesting is that it only happens when one of the I/Os comes
>>>>> from user space through the ioctl. If I have several pblk instances on
>>>>> the same device (which would end up allocating a new request in
>>>>> parallel, potentially on the same core), the latency spike does not
>>>>> trigger.
>>>>>
>>>>> I also tried to bind the read thread and the liblightnvm thread issuing
>>>>> the ioctl to different cores, but it does not help...
>>>>
>>>> How do I reproduce this? Off the top of my head, and looking at the code,
>>>> I have no idea what is going on here.
>>>
>>> Using LightNVM and liblightnvm [1] you can reproduce it by:
>>>
>>> 1. Instantiate a pblk instance on the first channel (luns 0 - 7):
>>>        sudo nvme lnvm create -d nvme0n1 -n test0 -t pblk -b 0 -e 7 -f
>>> 2. Write 5GB to the test0 block device with a normal fio script
>>> 3. Read 5GB to verify that latencies are good (max. ~80-90us at bs=4k, qd=1)
>>> 4. Re-run 3. and in parallel send a command through liblightnvm to a
>>> different channel. A simple command is an erase (erase block 900 on
>>> channel 2, lun 0):
>>> 	sudo nvm_vblk line_erase /dev/nvme0n1 2 2 0 0 900
>>>
>>> After 4. you should see a ~25-30ms latency on the read workload.
>>>
>>> I tried to reproduce the ioctl in a more generic way to reach
>>> __nvme_submit_user_cmd(), but SPDK steals the whole device. Also, qemu
>>> is not reliable for this kind of performance testing.
>>>
>>> If you have a suggestion on how I can mix an ioctl with normal block I/O
>>> read on a standard NVMe device, I'm happy to try it and see if I can
>>> reproduce the issue.
>>
>> Just to rule out this being any hardware related delays in processing
>> IO:
>>
>> 1) Does it reproduce with a simpler command, anything close to a no-op
>>   that you can test?
> 
> Yes. I tried with a 4KB read and with a fake command I drop right after
> allocation.
> 
>> 2) What did you use to time the stall being blk_queue_enter()?
>>
> 
> I have some debug code measuring time with ktime_get() in different
> places in the stack, and among other places, around blk_queue_enter(). I
> use them then to measure max latency and expose it through sysfs. I can
> see that the latency peak is recorded in the probe before
> blk_queue_enter() and not in the one after.
> 
> I also did an experiment, where the normal I/O path allocates the
> request with BLK_MQ_REQ_NOWAIT. When running the experiment above, the
> read test fails since we reach:
> 	if (nowait)
> 	  return -EBUSY;
> 
> in blk_queue_enter.

OK, that's starting to make more sense, that indicates that there is indeed
something wrong with the refs. Does the below help?

Comments

Jens Axboe May 8, 2017, 3:14 p.m. UTC | #1
On 05/08/2017 09:08 AM, Jens Axboe wrote:
> On 05/08/2017 09:02 AM, Javier González wrote:
>>> On 8 May 2017, at 16.52, Jens Axboe <axboe@fb.com> wrote:
>>>
>>> On 05/08/2017 08:46 AM, Javier González wrote:
>>>>> On 8 May 2017, at 16.23, Jens Axboe <axboe@fb.com> wrote:
>>>>>
>>>>> On 05/08/2017 08:20 AM, Javier González wrote:
>>>>>>> On 8 May 2017, at 16.13, Jens Axboe <axboe@fb.com> wrote:
>>>>>>>
>>>>>>> On 05/08/2017 07:44 AM, Javier González wrote:
>>>>>>>>> On 8 May 2017, at 14.27, Ming Lei <ming.lei@redhat.com> wrote:
>>>>>>>>>
>>>>>>>>> On Mon, May 08, 2017 at 01:54:58PM +0200, Javier González wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I find an unusual added latency(~20-30ms) on blk_queue_enter when
>>>>>>>>>> allocating a request directly from the NVMe driver through
>>>>>>>>>> nvme_alloc_request. I could use some help confirming that this is a bug
>>>>>>>>>> and not an expected side effect due to something else.
>>>>>>>>>>
>>>>>>>>>> I can reproduce this latency consistently on LightNVM when mixing I/O
>>>>>>>>>> from pblk and I/O sent through an ioctl using liblightnvm, but I don't
>>>>>>>>>> see anything on the LightNVM side that could impact the request
>>>>>>>>>> allocation.
>>>>>>>>>>
>>>>>>>>>> When I have a 100% read workload sent from pblk, the max. latency is
>>>>>>>>>> constant throughout several runs at ~80us (which is normal for the media
>>>>>>>>>> we are using at bs=4k, qd=1). All pblk I/Os reach the nvme_nvm_submit_io
>>>>>>>>>> function on lightnvm.c., which uses nvme_alloc_request. When we send a
>>>>>>>>>> command from user space through an ioctl, then the max latency goes up
>>>>>>>>>> to ~20-30ms. This happens independently from the actual command
>>>>>>>>>> (IN/OUT). I tracked down the added latency down to the call
>>>>>>>>>> percpu_ref_tryget_live in blk_queue_enter. Seems that the queue
>>>>>>>>>> reference counter is not released as it should through blk_queue_exit in
>>>>>>>>>> blk_mq_alloc_request. For reference, all ioctl I/Os reach the
>>>>>>>>>> nvme_nvm_submit_user_cmd on lightnvm.c
>>>>>>>>>>
>>>>>>>>>> Do you have any idea about why this might happen? I can dig more into
>>>>>>>>>> it, but first I wanted to make sure that I am not missing any obvious
>>>>>>>>>> assumption, which would explain the reference counter to be held for a
>>>>>>>>>> longer time.
>>>>>>>>>
>>>>>>>>> You need to check if the .q_usage_counter is working at atomic mode.
>>>>>>>>> This counter is initialized as atomic mode, and finally switchs to
>>>>>>>>> percpu mode via percpu_ref_switch_to_percpu() in blk_register_queue().
>>>>>>>>
>>>>>>>> Thanks for commenting Ming.
>>>>>>>>
>>>>>>>> The .q_usage_counter is not working on atomic mode. The queue is
>>>>>>>> initialized normally through blk_register_queue() and the counter is
>>>>>>>> switched to percpu mode, as you mentioned. As I understand it, this is
>>>>>>>> how it should be, right?
>>>>>>>
>>>>>>> That is how it should be, yes. You're not running with any heavy
>>>>>>> debugging options, like lockdep or anything like that?
>>>>>>
>>>>>> No lockdep, KASAN, kmemleak or any of the other usual suspects.
>>>>>>
>>>>>> What's interesting is that it only happens when one of the I/Os comes
>>>>>> from user space through the ioctl. If I have several pblk instances on
>>>>>> the same device (which would end up allocating a new request in
>>>>>> parallel, potentially on the same core), the latency spike does not
>>>>>> trigger.
>>>>>>
>>>>>> I also tried to bind the read thread and the liblightnvm thread issuing
>>>>>> the ioctl to different cores, but it does not help...
>>>>>
>>>>> How do I reproduce this? Off the top of my head, and looking at the code,
>>>>> I have no idea what is going on here.
>>>>
>>>> Using LightNVM and liblightnvm [1] you can reproduce it by:
>>>>
>>>> 1. Instantiate a pblk instance on the first channel (luns 0 - 7):
>>>>        sudo nvme lnvm create -d nvme0n1 -n test0 -t pblk -b 0 -e 7 -f
>>>> 2. Write 5GB to the test0 block device with a normal fio script
>>>> 3. Read 5GB to verify that latencies are good (max. ~80-90us at bs=4k, qd=1)
>>>> 4. Re-run 3. and in parallel send a command through liblightnvm to a
>>>> different channel. A simple command is an erase (erase block 900 on
>>>> channel 2, lun 0):
>>>> 	sudo nvm_vblk line_erase /dev/nvme0n1 2 2 0 0 900
>>>>
>>>> After 4. you should see a ~25-30ms latency on the read workload.
>>>>
>>>> I tried to reproduce the ioctl in a more generic way to reach
>>>> __nvme_submit_user_cmd(), but SPDK steals the whole device. Also, qemu
>>>> is not reliable for this kind of performance testing.
>>>>
>>>> If you have a suggestion on how I can mix an ioctl with normal block I/O
>>>> read on a standard NVMe device, I'm happy to try it and see if I can
>>>> reproduce the issue.
>>>
>>> Just to rule out this being any hardware related delays in processing
>>> IO:
>>>
>>> 1) Does it reproduce with a simpler command, anything close to a no-op
>>>   that you can test?
>>
>> Yes. I tried with a 4KB read and with a fake command I drop right after
>> allocation.
>>
>>> 2) What did you use to time the stall being blk_queue_enter()?
>>>
>>
>> I have some debug code measuring time with ktime_get() in different
>> places in the stack, and among other places, around blk_queue_enter(). I
>> use them then to measure max latency and expose it through sysfs. I can
>> see that the latency peak is recorded in the probe before
>> blk_queue_enter() and not in the one after.
>>
>> I also did an experiment, where the normal I/O path allocates the
>> request with BLK_MQ_REQ_NOWAIT. When running the experiment above, the
>> read test fails since we reach:
>> 	if (nowait)
>> 	  return -EBUSY;
>>
>> in blk_queue_enter.
> 
> OK, that's starting to make more sense, that indicates that there is indeed
> something wrong with the refs. Does the below help?

No, that can't be right, it does look balanced to begin with.
blk_mq_alloc_request() always grabs a queue ref, and always drops it. If
we return with a request succesfully allocated, then we have an extra
ref on it, which is dropped when it is later freed. Something smells
fishy, I'll dig a bit.
=?UTF-8?q?Javier=20Gonz=C3=A1lez?= May 8, 2017, 3:22 p.m. UTC | #2
Javier

> On 8 May 2017, at 17.14, Jens Axboe <axboe@kernel.dk> wrote:
> 
> On 05/08/2017 09:08 AM, Jens Axboe wrote:
>> On 05/08/2017 09:02 AM, Javier González wrote:
>>>> On 8 May 2017, at 16.52, Jens Axboe <axboe@fb.com> wrote:
>>>> 
>>>> On 05/08/2017 08:46 AM, Javier González wrote:
>>>>>> On 8 May 2017, at 16.23, Jens Axboe <axboe@fb.com> wrote:
>>>>>> 
>>>>>> On 05/08/2017 08:20 AM, Javier González wrote:
>>>>>>>> On 8 May 2017, at 16.13, Jens Axboe <axboe@fb.com> wrote:
>>>>>>>> 
>>>>>>>> On 05/08/2017 07:44 AM, Javier González wrote:
>>>>>>>>>> On 8 May 2017, at 14.27, Ming Lei <ming.lei@redhat.com> wrote:
>>>>>>>>>> 
>>>>>>>>>> On Mon, May 08, 2017 at 01:54:58PM +0200, Javier González wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>> I find an unusual added latency(~20-30ms) on blk_queue_enter when
>>>>>>>>>>> allocating a request directly from the NVMe driver through
>>>>>>>>>>> nvme_alloc_request. I could use some help confirming that this is a bug
>>>>>>>>>>> and not an expected side effect due to something else.
>>>>>>>>>>> 
>>>>>>>>>>> I can reproduce this latency consistently on LightNVM when mixing I/O
>>>>>>>>>>> from pblk and I/O sent through an ioctl using liblightnvm, but I don't
>>>>>>>>>>> see anything on the LightNVM side that could impact the request
>>>>>>>>>>> allocation.
>>>>>>>>>>> 
>>>>>>>>>>> When I have a 100% read workload sent from pblk, the max. latency is
>>>>>>>>>>> constant throughout several runs at ~80us (which is normal for the media
>>>>>>>>>>> we are using at bs=4k, qd=1). All pblk I/Os reach the nvme_nvm_submit_io
>>>>>>>>>>> function on lightnvm.c., which uses nvme_alloc_request. When we send a
>>>>>>>>>>> command from user space through an ioctl, then the max latency goes up
>>>>>>>>>>> to ~20-30ms. This happens independently from the actual command
>>>>>>>>>>> (IN/OUT). I tracked down the added latency down to the call
>>>>>>>>>>> percpu_ref_tryget_live in blk_queue_enter. Seems that the queue
>>>>>>>>>>> reference counter is not released as it should through blk_queue_exit in
>>>>>>>>>>> blk_mq_alloc_request. For reference, all ioctl I/Os reach the
>>>>>>>>>>> nvme_nvm_submit_user_cmd on lightnvm.c
>>>>>>>>>>> 
>>>>>>>>>>> Do you have any idea about why this might happen? I can dig more into
>>>>>>>>>>> it, but first I wanted to make sure that I am not missing any obvious
>>>>>>>>>>> assumption, which would explain the reference counter to be held for a
>>>>>>>>>>> longer time.
>>>>>>>>>> 
>>>>>>>>>> You need to check if the .q_usage_counter is working at atomic mode.
>>>>>>>>>> This counter is initialized as atomic mode, and finally switchs to
>>>>>>>>>> percpu mode via percpu_ref_switch_to_percpu() in blk_register_queue().
>>>>>>>>> 
>>>>>>>>> Thanks for commenting Ming.
>>>>>>>>> 
>>>>>>>>> The .q_usage_counter is not working on atomic mode. The queue is
>>>>>>>>> initialized normally through blk_register_queue() and the counter is
>>>>>>>>> switched to percpu mode, as you mentioned. As I understand it, this is
>>>>>>>>> how it should be, right?
>>>>>>>> 
>>>>>>>> That is how it should be, yes. You're not running with any heavy
>>>>>>>> debugging options, like lockdep or anything like that?
>>>>>>> 
>>>>>>> No lockdep, KASAN, kmemleak or any of the other usual suspects.
>>>>>>> 
>>>>>>> What's interesting is that it only happens when one of the I/Os comes
>>>>>>> from user space through the ioctl. If I have several pblk instances on
>>>>>>> the same device (which would end up allocating a new request in
>>>>>>> parallel, potentially on the same core), the latency spike does not
>>>>>>> trigger.
>>>>>>> 
>>>>>>> I also tried to bind the read thread and the liblightnvm thread issuing
>>>>>>> the ioctl to different cores, but it does not help...
>>>>>> 
>>>>>> How do I reproduce this? Off the top of my head, and looking at the code,
>>>>>> I have no idea what is going on here.
>>>>> 
>>>>> Using LightNVM and liblightnvm [1] you can reproduce it by:
>>>>> 
>>>>> 1. Instantiate a pblk instance on the first channel (luns 0 - 7):
>>>>>       sudo nvme lnvm create -d nvme0n1 -n test0 -t pblk -b 0 -e 7 -f
>>>>> 2. Write 5GB to the test0 block device with a normal fio script
>>>>> 3. Read 5GB to verify that latencies are good (max. ~80-90us at bs=4k, qd=1)
>>>>> 4. Re-run 3. and in parallel send a command through liblightnvm to a
>>>>> different channel. A simple command is an erase (erase block 900 on
>>>>> channel 2, lun 0):
>>>>> 	sudo nvm_vblk line_erase /dev/nvme0n1 2 2 0 0 900
>>>>> 
>>>>> After 4. you should see a ~25-30ms latency on the read workload.
>>>>> 
>>>>> I tried to reproduce the ioctl in a more generic way to reach
>>>>> __nvme_submit_user_cmd(), but SPDK steals the whole device. Also, qemu
>>>>> is not reliable for this kind of performance testing.
>>>>> 
>>>>> If you have a suggestion on how I can mix an ioctl with normal block I/O
>>>>> read on a standard NVMe device, I'm happy to try it and see if I can
>>>>> reproduce the issue.
>>>> 
>>>> Just to rule out this being any hardware related delays in processing
>>>> IO:
>>>> 
>>>> 1) Does it reproduce with a simpler command, anything close to a no-op
>>>>  that you can test?
>>> 
>>> Yes. I tried with a 4KB read and with a fake command I drop right after
>>> allocation.
>>> 
>>>> 2) What did you use to time the stall being blk_queue_enter()?
>>> 
>>> I have some debug code measuring time with ktime_get() in different
>>> places in the stack, and among other places, around blk_queue_enter(). I
>>> use them then to measure max latency and expose it through sysfs. I can
>>> see that the latency peak is recorded in the probe before
>>> blk_queue_enter() and not in the one after.
>>> 
>>> I also did an experiment, where the normal I/O path allocates the
>>> request with BLK_MQ_REQ_NOWAIT. When running the experiment above, the
>>> read test fails since we reach:
>>> 	if (nowait)
>>> 	  return -EBUSY;
>>> 
>>> in blk_queue_enter.
>> 
>> OK, that's starting to make more sense, that indicates that there is indeed
>> something wrong with the refs. Does the below help?
> 
> No, that can't be right, it does look balanced to begin with.
> blk_mq_alloc_request() always grabs a queue ref, and always drops it. If
> we return with a request succesfully allocated, then we have an extra
> ref on it, which is dropped when it is later freed.

I agree, it seems more like a reference is put too late. I looked into
into the places where the reference is put, but it all seems normal. In
any case, I run it (just to see), and it did not help.

> Something smells fishy, I'll dig a bit.

Thanks! I continue looking into it myself; let me know if I can help
with something more specific.

Javier
Jens Axboe May 8, 2017, 3:25 p.m. UTC | #3
On 05/08/2017 09:22 AM, Javier González wrote:
> 
> Javier
> 
>> On 8 May 2017, at 17.14, Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 05/08/2017 09:08 AM, Jens Axboe wrote:
>>> On 05/08/2017 09:02 AM, Javier González wrote:
>>>>> On 8 May 2017, at 16.52, Jens Axboe <axboe@fb.com> wrote:
>>>>>
>>>>> On 05/08/2017 08:46 AM, Javier González wrote:
>>>>>>> On 8 May 2017, at 16.23, Jens Axboe <axboe@fb.com> wrote:
>>>>>>>
>>>>>>> On 05/08/2017 08:20 AM, Javier González wrote:
>>>>>>>>> On 8 May 2017, at 16.13, Jens Axboe <axboe@fb.com> wrote:
>>>>>>>>>
>>>>>>>>> On 05/08/2017 07:44 AM, Javier González wrote:
>>>>>>>>>>> On 8 May 2017, at 14.27, Ming Lei <ming.lei@redhat.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On Mon, May 08, 2017 at 01:54:58PM +0200, Javier González wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I find an unusual added latency(~20-30ms) on blk_queue_enter when
>>>>>>>>>>>> allocating a request directly from the NVMe driver through
>>>>>>>>>>>> nvme_alloc_request. I could use some help confirming that this is a bug
>>>>>>>>>>>> and not an expected side effect due to something else.
>>>>>>>>>>>>
>>>>>>>>>>>> I can reproduce this latency consistently on LightNVM when mixing I/O
>>>>>>>>>>>> from pblk and I/O sent through an ioctl using liblightnvm, but I don't
>>>>>>>>>>>> see anything on the LightNVM side that could impact the request
>>>>>>>>>>>> allocation.
>>>>>>>>>>>>
>>>>>>>>>>>> When I have a 100% read workload sent from pblk, the max. latency is
>>>>>>>>>>>> constant throughout several runs at ~80us (which is normal for the media
>>>>>>>>>>>> we are using at bs=4k, qd=1). All pblk I/Os reach the nvme_nvm_submit_io
>>>>>>>>>>>> function on lightnvm.c., which uses nvme_alloc_request. When we send a
>>>>>>>>>>>> command from user space through an ioctl, then the max latency goes up
>>>>>>>>>>>> to ~20-30ms. This happens independently from the actual command
>>>>>>>>>>>> (IN/OUT). I tracked down the added latency down to the call
>>>>>>>>>>>> percpu_ref_tryget_live in blk_queue_enter. Seems that the queue
>>>>>>>>>>>> reference counter is not released as it should through blk_queue_exit in
>>>>>>>>>>>> blk_mq_alloc_request. For reference, all ioctl I/Os reach the
>>>>>>>>>>>> nvme_nvm_submit_user_cmd on lightnvm.c
>>>>>>>>>>>>
>>>>>>>>>>>> Do you have any idea about why this might happen? I can dig more into
>>>>>>>>>>>> it, but first I wanted to make sure that I am not missing any obvious
>>>>>>>>>>>> assumption, which would explain the reference counter to be held for a
>>>>>>>>>>>> longer time.
>>>>>>>>>>>
>>>>>>>>>>> You need to check if the .q_usage_counter is working at atomic mode.
>>>>>>>>>>> This counter is initialized as atomic mode, and finally switchs to
>>>>>>>>>>> percpu mode via percpu_ref_switch_to_percpu() in blk_register_queue().
>>>>>>>>>>
>>>>>>>>>> Thanks for commenting Ming.
>>>>>>>>>>
>>>>>>>>>> The .q_usage_counter is not working on atomic mode. The queue is
>>>>>>>>>> initialized normally through blk_register_queue() and the counter is
>>>>>>>>>> switched to percpu mode, as you mentioned. As I understand it, this is
>>>>>>>>>> how it should be, right?
>>>>>>>>>
>>>>>>>>> That is how it should be, yes. You're not running with any heavy
>>>>>>>>> debugging options, like lockdep or anything like that?
>>>>>>>>
>>>>>>>> No lockdep, KASAN, kmemleak or any of the other usual suspects.
>>>>>>>>
>>>>>>>> What's interesting is that it only happens when one of the I/Os comes
>>>>>>>> from user space through the ioctl. If I have several pblk instances on
>>>>>>>> the same device (which would end up allocating a new request in
>>>>>>>> parallel, potentially on the same core), the latency spike does not
>>>>>>>> trigger.
>>>>>>>>
>>>>>>>> I also tried to bind the read thread and the liblightnvm thread issuing
>>>>>>>> the ioctl to different cores, but it does not help...
>>>>>>>
>>>>>>> How do I reproduce this? Off the top of my head, and looking at the code,
>>>>>>> I have no idea what is going on here.
>>>>>>
>>>>>> Using LightNVM and liblightnvm [1] you can reproduce it by:
>>>>>>
>>>>>> 1. Instantiate a pblk instance on the first channel (luns 0 - 7):
>>>>>>       sudo nvme lnvm create -d nvme0n1 -n test0 -t pblk -b 0 -e 7 -f
>>>>>> 2. Write 5GB to the test0 block device with a normal fio script
>>>>>> 3. Read 5GB to verify that latencies are good (max. ~80-90us at bs=4k, qd=1)
>>>>>> 4. Re-run 3. and in parallel send a command through liblightnvm to a
>>>>>> different channel. A simple command is an erase (erase block 900 on
>>>>>> channel 2, lun 0):
>>>>>> 	sudo nvm_vblk line_erase /dev/nvme0n1 2 2 0 0 900
>>>>>>
>>>>>> After 4. you should see a ~25-30ms latency on the read workload.
>>>>>>
>>>>>> I tried to reproduce the ioctl in a more generic way to reach
>>>>>> __nvme_submit_user_cmd(), but SPDK steals the whole device. Also, qemu
>>>>>> is not reliable for this kind of performance testing.
>>>>>>
>>>>>> If you have a suggestion on how I can mix an ioctl with normal block I/O
>>>>>> read on a standard NVMe device, I'm happy to try it and see if I can
>>>>>> reproduce the issue.
>>>>>
>>>>> Just to rule out this being any hardware related delays in processing
>>>>> IO:
>>>>>
>>>>> 1) Does it reproduce with a simpler command, anything close to a no-op
>>>>>  that you can test?
>>>>
>>>> Yes. I tried with a 4KB read and with a fake command I drop right after
>>>> allocation.
>>>>
>>>>> 2) What did you use to time the stall being blk_queue_enter()?
>>>>
>>>> I have some debug code measuring time with ktime_get() in different
>>>> places in the stack, and among other places, around blk_queue_enter(). I
>>>> use them then to measure max latency and expose it through sysfs. I can
>>>> see that the latency peak is recorded in the probe before
>>>> blk_queue_enter() and not in the one after.
>>>>
>>>> I also did an experiment, where the normal I/O path allocates the
>>>> request with BLK_MQ_REQ_NOWAIT. When running the experiment above, the
>>>> read test fails since we reach:
>>>> 	if (nowait)
>>>> 	  return -EBUSY;
>>>>
>>>> in blk_queue_enter.
>>>
>>> OK, that's starting to make more sense, that indicates that there is indeed
>>> something wrong with the refs. Does the below help?
>>
>> No, that can't be right, it does look balanced to begin with.
>> blk_mq_alloc_request() always grabs a queue ref, and always drops it. If
>> we return with a request succesfully allocated, then we have an extra
>> ref on it, which is dropped when it is later freed.
> 
> I agree, it seems more like a reference is put too late. I looked into
> into the places where the reference is put, but it all seems normal. In
> any case, I run it (just to see), and it did not help.
> 
>> Something smells fishy, I'll dig a bit.
> 
> Thanks! I continue looking into it myself; let me know if I can help
> with something more specific.

What exact kernel are you running? And does the device have a scheduler
attached, or is it set to "none"?
=?UTF-8?q?Javier=20Gonz=C3=A1lez?= May 8, 2017, 3:38 p.m. UTC | #4
> On 8 May 2017, at 17.25, Jens Axboe <axboe@kernel.dk> wrote:
> 
> On 05/08/2017 09:22 AM, Javier González wrote:
>> Javier
>> 
>>> On 8 May 2017, at 17.14, Jens Axboe <axboe@kernel.dk> wrote:
>>> 
>>> On 05/08/2017 09:08 AM, Jens Axboe wrote:
>>>> On 05/08/2017 09:02 AM, Javier González wrote:
>>>>>> On 8 May 2017, at 16.52, Jens Axboe <axboe@fb.com> wrote:
>>>>>> 
>>>>>> On 05/08/2017 08:46 AM, Javier González wrote:
>>>>>>>> On 8 May 2017, at 16.23, Jens Axboe <axboe@fb.com> wrote:
>>>>>>>> 
>>>>>>>> On 05/08/2017 08:20 AM, Javier González wrote:
>>>>>>>>>> On 8 May 2017, at 16.13, Jens Axboe <axboe@fb.com> wrote:
>>>>>>>>>> 
>>>>>>>>>> On 05/08/2017 07:44 AM, Javier González wrote:
>>>>>>>>>>>> On 8 May 2017, at 14.27, Ming Lei <ming.lei@redhat.com> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> On Mon, May 08, 2017 at 01:54:58PM +0200, Javier González wrote:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I find an unusual added latency(~20-30ms) on blk_queue_enter when
>>>>>>>>>>>>> allocating a request directly from the NVMe driver through
>>>>>>>>>>>>> nvme_alloc_request. I could use some help confirming that this is a bug
>>>>>>>>>>>>> and not an expected side effect due to something else.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I can reproduce this latency consistently on LightNVM when mixing I/O
>>>>>>>>>>>>> from pblk and I/O sent through an ioctl using liblightnvm, but I don't
>>>>>>>>>>>>> see anything on the LightNVM side that could impact the request
>>>>>>>>>>>>> allocation.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> When I have a 100% read workload sent from pblk, the max. latency is
>>>>>>>>>>>>> constant throughout several runs at ~80us (which is normal for the media
>>>>>>>>>>>>> we are using at bs=4k, qd=1). All pblk I/Os reach the nvme_nvm_submit_io
>>>>>>>>>>>>> function on lightnvm.c., which uses nvme_alloc_request. When we send a
>>>>>>>>>>>>> command from user space through an ioctl, then the max latency goes up
>>>>>>>>>>>>> to ~20-30ms. This happens independently from the actual command
>>>>>>>>>>>>> (IN/OUT). I tracked down the added latency down to the call
>>>>>>>>>>>>> percpu_ref_tryget_live in blk_queue_enter. Seems that the queue
>>>>>>>>>>>>> reference counter is not released as it should through blk_queue_exit in
>>>>>>>>>>>>> blk_mq_alloc_request. For reference, all ioctl I/Os reach the
>>>>>>>>>>>>> nvme_nvm_submit_user_cmd on lightnvm.c
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Do you have any idea about why this might happen? I can dig more into
>>>>>>>>>>>>> it, but first I wanted to make sure that I am not missing any obvious
>>>>>>>>>>>>> assumption, which would explain the reference counter to be held for a
>>>>>>>>>>>>> longer time.
>>>>>>>>>>>> 
>>>>>>>>>>>> You need to check if the .q_usage_counter is working at atomic mode.
>>>>>>>>>>>> This counter is initialized as atomic mode, and finally switchs to
>>>>>>>>>>>> percpu mode via percpu_ref_switch_to_percpu() in blk_register_queue().
>>>>>>>>>>> 
>>>>>>>>>>> Thanks for commenting Ming.
>>>>>>>>>>> 
>>>>>>>>>>> The .q_usage_counter is not working on atomic mode. The queue is
>>>>>>>>>>> initialized normally through blk_register_queue() and the counter is
>>>>>>>>>>> switched to percpu mode, as you mentioned. As I understand it, this is
>>>>>>>>>>> how it should be, right?
>>>>>>>>>> 
>>>>>>>>>> That is how it should be, yes. You're not running with any heavy
>>>>>>>>>> debugging options, like lockdep or anything like that?
>>>>>>>>> 
>>>>>>>>> No lockdep, KASAN, kmemleak or any of the other usual suspects.
>>>>>>>>> 
>>>>>>>>> What's interesting is that it only happens when one of the I/Os comes
>>>>>>>>> from user space through the ioctl. If I have several pblk instances on
>>>>>>>>> the same device (which would end up allocating a new request in
>>>>>>>>> parallel, potentially on the same core), the latency spike does not
>>>>>>>>> trigger.
>>>>>>>>> 
>>>>>>>>> I also tried to bind the read thread and the liblightnvm thread issuing
>>>>>>>>> the ioctl to different cores, but it does not help...
>>>>>>>> 
>>>>>>>> How do I reproduce this? Off the top of my head, and looking at the code,
>>>>>>>> I have no idea what is going on here.
>>>>>>> 
>>>>>>> Using LightNVM and liblightnvm [1] you can reproduce it by:
>>>>>>> 
>>>>>>> 1. Instantiate a pblk instance on the first channel (luns 0 - 7):
>>>>>>>      sudo nvme lnvm create -d nvme0n1 -n test0 -t pblk -b 0 -e 7 -f
>>>>>>> 2. Write 5GB to the test0 block device with a normal fio script
>>>>>>> 3. Read 5GB to verify that latencies are good (max. ~80-90us at bs=4k, qd=1)
>>>>>>> 4. Re-run 3. and in parallel send a command through liblightnvm to a
>>>>>>> different channel. A simple command is an erase (erase block 900 on
>>>>>>> channel 2, lun 0):
>>>>>>> 	sudo nvm_vblk line_erase /dev/nvme0n1 2 2 0 0 900
>>>>>>> 
>>>>>>> After 4. you should see a ~25-30ms latency on the read workload.
>>>>>>> 
>>>>>>> I tried to reproduce the ioctl in a more generic way to reach
>>>>>>> __nvme_submit_user_cmd(), but SPDK steals the whole device. Also, qemu
>>>>>>> is not reliable for this kind of performance testing.
>>>>>>> 
>>>>>>> If you have a suggestion on how I can mix an ioctl with normal block I/O
>>>>>>> read on a standard NVMe device, I'm happy to try it and see if I can
>>>>>>> reproduce the issue.
>>>>>> 
>>>>>> Just to rule out this being any hardware related delays in processing
>>>>>> IO:
>>>>>> 
>>>>>> 1) Does it reproduce with a simpler command, anything close to a no-op
>>>>>> that you can test?
>>>>> 
>>>>> Yes. I tried with a 4KB read and with a fake command I drop right after
>>>>> allocation.
>>>>> 
>>>>>> 2) What did you use to time the stall being blk_queue_enter()?
>>>>> 
>>>>> I have some debug code measuring time with ktime_get() in different
>>>>> places in the stack, and among other places, around blk_queue_enter(). I
>>>>> use them then to measure max latency and expose it through sysfs. I can
>>>>> see that the latency peak is recorded in the probe before
>>>>> blk_queue_enter() and not in the one after.
>>>>> 
>>>>> I also did an experiment, where the normal I/O path allocates the
>>>>> request with BLK_MQ_REQ_NOWAIT. When running the experiment above, the
>>>>> read test fails since we reach:
>>>>> 	if (nowait)
>>>>> 	  return -EBUSY;
>>>>> 
>>>>> in blk_queue_enter.
>>>> 
>>>> OK, that's starting to make more sense, that indicates that there is indeed
>>>> something wrong with the refs. Does the below help?
>>> 
>>> No, that can't be right, it does look balanced to begin with.
>>> blk_mq_alloc_request() always grabs a queue ref, and always drops it. If
>>> we return with a request succesfully allocated, then we have an extra
>>> ref on it, which is dropped when it is later freed.
>> 
>> I agree, it seems more like a reference is put too late. I looked into
>> into the places where the reference is put, but it all seems normal. In
>> any case, I run it (just to see), and it did not help.
>> 
>>> Something smells fishy, I'll dig a bit.
>> 
>> Thanks! I continue looking into it myself; let me know if I can help
>> with something more specific.
> 
> What exact kernel are you running? And does the device have a scheduler
> attached, or is it set to "none"?

I can reproduce the issue on 4.11-rc7. I will rebase on top of your
for-4.12/block, but I cannot see any patches that might be related. If
it changes I'll ping you.

I mentioned the problem to Christoph last week and disabling the
schedulers was the first thing he recommended. I measured time around
blk_mq_sched_get_request and for this particular test the choose of
scheduler (including BFQ and kyber) does not seem to have an effect.

Javier
Jens Axboe May 8, 2017, 3:40 p.m. UTC | #5
On 05/08/2017 09:38 AM, Javier González wrote:
>> On 8 May 2017, at 17.25, Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 05/08/2017 09:22 AM, Javier González wrote:
>>> Javier
>>>
>>>> On 8 May 2017, at 17.14, Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> On 05/08/2017 09:08 AM, Jens Axboe wrote:
>>>>> On 05/08/2017 09:02 AM, Javier González wrote:
>>>>>>> On 8 May 2017, at 16.52, Jens Axboe <axboe@fb.com> wrote:
>>>>>>>
>>>>>>> On 05/08/2017 08:46 AM, Javier González wrote:
>>>>>>>>> On 8 May 2017, at 16.23, Jens Axboe <axboe@fb.com> wrote:
>>>>>>>>>
>>>>>>>>> On 05/08/2017 08:20 AM, Javier González wrote:
>>>>>>>>>>> On 8 May 2017, at 16.13, Jens Axboe <axboe@fb.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 05/08/2017 07:44 AM, Javier González wrote:
>>>>>>>>>>>>> On 8 May 2017, at 14.27, Ming Lei <ming.lei@redhat.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, May 08, 2017 at 01:54:58PM +0200, Javier González wrote:
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I find an unusual added latency(~20-30ms) on blk_queue_enter when
>>>>>>>>>>>>>> allocating a request directly from the NVMe driver through
>>>>>>>>>>>>>> nvme_alloc_request. I could use some help confirming that this is a bug
>>>>>>>>>>>>>> and not an expected side effect due to something else.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I can reproduce this latency consistently on LightNVM when mixing I/O
>>>>>>>>>>>>>> from pblk and I/O sent through an ioctl using liblightnvm, but I don't
>>>>>>>>>>>>>> see anything on the LightNVM side that could impact the request
>>>>>>>>>>>>>> allocation.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> When I have a 100% read workload sent from pblk, the max. latency is
>>>>>>>>>>>>>> constant throughout several runs at ~80us (which is normal for the media
>>>>>>>>>>>>>> we are using at bs=4k, qd=1). All pblk I/Os reach the nvme_nvm_submit_io
>>>>>>>>>>>>>> function on lightnvm.c., which uses nvme_alloc_request. When we send a
>>>>>>>>>>>>>> command from user space through an ioctl, then the max latency goes up
>>>>>>>>>>>>>> to ~20-30ms. This happens independently from the actual command
>>>>>>>>>>>>>> (IN/OUT). I tracked down the added latency down to the call
>>>>>>>>>>>>>> percpu_ref_tryget_live in blk_queue_enter. Seems that the queue
>>>>>>>>>>>>>> reference counter is not released as it should through blk_queue_exit in
>>>>>>>>>>>>>> blk_mq_alloc_request. For reference, all ioctl I/Os reach the
>>>>>>>>>>>>>> nvme_nvm_submit_user_cmd on lightnvm.c
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Do you have any idea about why this might happen? I can dig more into
>>>>>>>>>>>>>> it, but first I wanted to make sure that I am not missing any obvious
>>>>>>>>>>>>>> assumption, which would explain the reference counter to be held for a
>>>>>>>>>>>>>> longer time.
>>>>>>>>>>>>>
>>>>>>>>>>>>> You need to check if the .q_usage_counter is working at atomic mode.
>>>>>>>>>>>>> This counter is initialized as atomic mode, and finally switchs to
>>>>>>>>>>>>> percpu mode via percpu_ref_switch_to_percpu() in blk_register_queue().
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for commenting Ming.
>>>>>>>>>>>>
>>>>>>>>>>>> The .q_usage_counter is not working on atomic mode. The queue is
>>>>>>>>>>>> initialized normally through blk_register_queue() and the counter is
>>>>>>>>>>>> switched to percpu mode, as you mentioned. As I understand it, this is
>>>>>>>>>>>> how it should be, right?
>>>>>>>>>>>
>>>>>>>>>>> That is how it should be, yes. You're not running with any heavy
>>>>>>>>>>> debugging options, like lockdep or anything like that?
>>>>>>>>>>
>>>>>>>>>> No lockdep, KASAN, kmemleak or any of the other usual suspects.
>>>>>>>>>>
>>>>>>>>>> What's interesting is that it only happens when one of the I/Os comes
>>>>>>>>>> from user space through the ioctl. If I have several pblk instances on
>>>>>>>>>> the same device (which would end up allocating a new request in
>>>>>>>>>> parallel, potentially on the same core), the latency spike does not
>>>>>>>>>> trigger.
>>>>>>>>>>
>>>>>>>>>> I also tried to bind the read thread and the liblightnvm thread issuing
>>>>>>>>>> the ioctl to different cores, but it does not help...
>>>>>>>>>
>>>>>>>>> How do I reproduce this? Off the top of my head, and looking at the code,
>>>>>>>>> I have no idea what is going on here.
>>>>>>>>
>>>>>>>> Using LightNVM and liblightnvm [1] you can reproduce it by:
>>>>>>>>
>>>>>>>> 1. Instantiate a pblk instance on the first channel (luns 0 - 7):
>>>>>>>>      sudo nvme lnvm create -d nvme0n1 -n test0 -t pblk -b 0 -e 7 -f
>>>>>>>> 2. Write 5GB to the test0 block device with a normal fio script
>>>>>>>> 3. Read 5GB to verify that latencies are good (max. ~80-90us at bs=4k, qd=1)
>>>>>>>> 4. Re-run 3. and in parallel send a command through liblightnvm to a
>>>>>>>> different channel. A simple command is an erase (erase block 900 on
>>>>>>>> channel 2, lun 0):
>>>>>>>> 	sudo nvm_vblk line_erase /dev/nvme0n1 2 2 0 0 900
>>>>>>>>
>>>>>>>> After 4. you should see a ~25-30ms latency on the read workload.
>>>>>>>>
>>>>>>>> I tried to reproduce the ioctl in a more generic way to reach
>>>>>>>> __nvme_submit_user_cmd(), but SPDK steals the whole device. Also, qemu
>>>>>>>> is not reliable for this kind of performance testing.
>>>>>>>>
>>>>>>>> If you have a suggestion on how I can mix an ioctl with normal block I/O
>>>>>>>> read on a standard NVMe device, I'm happy to try it and see if I can
>>>>>>>> reproduce the issue.
>>>>>>>
>>>>>>> Just to rule out this being any hardware related delays in processing
>>>>>>> IO:
>>>>>>>
>>>>>>> 1) Does it reproduce with a simpler command, anything close to a no-op
>>>>>>> that you can test?
>>>>>>
>>>>>> Yes. I tried with a 4KB read and with a fake command I drop right after
>>>>>> allocation.
>>>>>>
>>>>>>> 2) What did you use to time the stall being blk_queue_enter()?
>>>>>>
>>>>>> I have some debug code measuring time with ktime_get() in different
>>>>>> places in the stack, and among other places, around blk_queue_enter(). I
>>>>>> use them then to measure max latency and expose it through sysfs. I can
>>>>>> see that the latency peak is recorded in the probe before
>>>>>> blk_queue_enter() and not in the one after.
>>>>>>
>>>>>> I also did an experiment, where the normal I/O path allocates the
>>>>>> request with BLK_MQ_REQ_NOWAIT. When running the experiment above, the
>>>>>> read test fails since we reach:
>>>>>> 	if (nowait)
>>>>>> 	  return -EBUSY;
>>>>>>
>>>>>> in blk_queue_enter.
>>>>>
>>>>> OK, that's starting to make more sense, that indicates that there is indeed
>>>>> something wrong with the refs. Does the below help?
>>>>
>>>> No, that can't be right, it does look balanced to begin with.
>>>> blk_mq_alloc_request() always grabs a queue ref, and always drops it. If
>>>> we return with a request succesfully allocated, then we have an extra
>>>> ref on it, which is dropped when it is later freed.
>>>
>>> I agree, it seems more like a reference is put too late. I looked into
>>> into the places where the reference is put, but it all seems normal. In
>>> any case, I run it (just to see), and it did not help.
>>>
>>>> Something smells fishy, I'll dig a bit.
>>>
>>> Thanks! I continue looking into it myself; let me know if I can help
>>> with something more specific.
>>
>> What exact kernel are you running? And does the device have a scheduler
>> attached, or is it set to "none"?
> 
> I can reproduce the issue on 4.11-rc7. I will rebase on top of your
> for-4.12/block, but I cannot see any patches that might be related. If
> it changes I'll ping you.

I don't suspect it will do anything for you. I just ask to know what
base you are on.

> I mentioned the problem to Christoph last week and disabling the
> schedulers was the first thing he recommended. I measured time around
> blk_mq_sched_get_request and for this particular test the choose of
> scheduler (including BFQ and kyber) does not seem to have an effect.

kyber vs none would be the interesting test. Some of the paths are a
little different depending if there's a scheduler attached or not, so
it's good to know that we're seeing this in both cases.
=?UTF-8?q?Javier=20Gonz=C3=A1lez?= May 8, 2017, 3:49 p.m. UTC | #6
> On 8 May 2017, at 17.40, Jens Axboe <axboe@kernel.dk> wrote:
> 
> On 05/08/2017 09:38 AM, Javier González wrote:
>>> On 8 May 2017, at 17.25, Jens Axboe <axboe@kernel.dk> wrote:
>>> 
>>> On 05/08/2017 09:22 AM, Javier González wrote:
>>>> Javier
>>>> 
>>>>> On 8 May 2017, at 17.14, Jens Axboe <axboe@kernel.dk> wrote:
>>>>> 
>>>>> On 05/08/2017 09:08 AM, Jens Axboe wrote:
>>>>>> On 05/08/2017 09:02 AM, Javier González wrote:
>>>>>>>> On 8 May 2017, at 16.52, Jens Axboe <axboe@fb.com> wrote:
>>>>>>>> 
>>>>>>>> On 05/08/2017 08:46 AM, Javier González wrote:
>>>>>>>>>> On 8 May 2017, at 16.23, Jens Axboe <axboe@fb.com> wrote:
>>>>>>>>>> 
>>>>>>>>>> On 05/08/2017 08:20 AM, Javier González wrote:
>>>>>>>>>>>> On 8 May 2017, at 16.13, Jens Axboe <axboe@fb.com> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> On 05/08/2017 07:44 AM, Javier González wrote:
>>>>>>>>>>>>>> On 8 May 2017, at 14.27, Ming Lei <ming.lei@redhat.com> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Mon, May 08, 2017 at 01:54:58PM +0200, Javier González wrote:
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I find an unusual added latency(~20-30ms) on blk_queue_enter when
>>>>>>>>>>>>>>> allocating a request directly from the NVMe driver through
>>>>>>>>>>>>>>> nvme_alloc_request. I could use some help confirming that this is a bug
>>>>>>>>>>>>>>> and not an expected side effect due to something else.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I can reproduce this latency consistently on LightNVM when mixing I/O
>>>>>>>>>>>>>>> from pblk and I/O sent through an ioctl using liblightnvm, but I don't
>>>>>>>>>>>>>>> see anything on the LightNVM side that could impact the request
>>>>>>>>>>>>>>> allocation.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> When I have a 100% read workload sent from pblk, the max. latency is
>>>>>>>>>>>>>>> constant throughout several runs at ~80us (which is normal for the media
>>>>>>>>>>>>>>> we are using at bs=4k, qd=1). All pblk I/Os reach the nvme_nvm_submit_io
>>>>>>>>>>>>>>> function on lightnvm.c., which uses nvme_alloc_request. When we send a
>>>>>>>>>>>>>>> command from user space through an ioctl, then the max latency goes up
>>>>>>>>>>>>>>> to ~20-30ms. This happens independently from the actual command
>>>>>>>>>>>>>>> (IN/OUT). I tracked down the added latency down to the call
>>>>>>>>>>>>>>> percpu_ref_tryget_live in blk_queue_enter. Seems that the queue
>>>>>>>>>>>>>>> reference counter is not released as it should through blk_queue_exit in
>>>>>>>>>>>>>>> blk_mq_alloc_request. For reference, all ioctl I/Os reach the
>>>>>>>>>>>>>>> nvme_nvm_submit_user_cmd on lightnvm.c
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Do you have any idea about why this might happen? I can dig more into
>>>>>>>>>>>>>>> it, but first I wanted to make sure that I am not missing any obvious
>>>>>>>>>>>>>>> assumption, which would explain the reference counter to be held for a
>>>>>>>>>>>>>>> longer time.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> You need to check if the .q_usage_counter is working at atomic mode.
>>>>>>>>>>>>>> This counter is initialized as atomic mode, and finally switchs to
>>>>>>>>>>>>>> percpu mode via percpu_ref_switch_to_percpu() in blk_register_queue().
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks for commenting Ming.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The .q_usage_counter is not working on atomic mode. The queue is
>>>>>>>>>>>>> initialized normally through blk_register_queue() and the counter is
>>>>>>>>>>>>> switched to percpu mode, as you mentioned. As I understand it, this is
>>>>>>>>>>>>> how it should be, right?
>>>>>>>>>>>> 
>>>>>>>>>>>> That is how it should be, yes. You're not running with any heavy
>>>>>>>>>>>> debugging options, like lockdep or anything like that?
>>>>>>>>>>> 
>>>>>>>>>>> No lockdep, KASAN, kmemleak or any of the other usual suspects.
>>>>>>>>>>> 
>>>>>>>>>>> What's interesting is that it only happens when one of the I/Os comes
>>>>>>>>>>> from user space through the ioctl. If I have several pblk instances on
>>>>>>>>>>> the same device (which would end up allocating a new request in
>>>>>>>>>>> parallel, potentially on the same core), the latency spike does not
>>>>>>>>>>> trigger.
>>>>>>>>>>> 
>>>>>>>>>>> I also tried to bind the read thread and the liblightnvm thread issuing
>>>>>>>>>>> the ioctl to different cores, but it does not help...
>>>>>>>>>> 
>>>>>>>>>> How do I reproduce this? Off the top of my head, and looking at the code,
>>>>>>>>>> I have no idea what is going on here.
>>>>>>>>> 
>>>>>>>>> Using LightNVM and liblightnvm [1] you can reproduce it by:
>>>>>>>>> 
>>>>>>>>> 1. Instantiate a pblk instance on the first channel (luns 0 - 7):
>>>>>>>>>     sudo nvme lnvm create -d nvme0n1 -n test0 -t pblk -b 0 -e 7 -f
>>>>>>>>> 2. Write 5GB to the test0 block device with a normal fio script
>>>>>>>>> 3. Read 5GB to verify that latencies are good (max. ~80-90us at bs=4k, qd=1)
>>>>>>>>> 4. Re-run 3. and in parallel send a command through liblightnvm to a
>>>>>>>>> different channel. A simple command is an erase (erase block 900 on
>>>>>>>>> channel 2, lun 0):
>>>>>>>>> 	sudo nvm_vblk line_erase /dev/nvme0n1 2 2 0 0 900
>>>>>>>>> 
>>>>>>>>> After 4. you should see a ~25-30ms latency on the read workload.
>>>>>>>>> 
>>>>>>>>> I tried to reproduce the ioctl in a more generic way to reach
>>>>>>>>> __nvme_submit_user_cmd(), but SPDK steals the whole device. Also, qemu
>>>>>>>>> is not reliable for this kind of performance testing.
>>>>>>>>> 
>>>>>>>>> If you have a suggestion on how I can mix an ioctl with normal block I/O
>>>>>>>>> read on a standard NVMe device, I'm happy to try it and see if I can
>>>>>>>>> reproduce the issue.
>>>>>>>> 
>>>>>>>> Just to rule out this being any hardware related delays in processing
>>>>>>>> IO:
>>>>>>>> 
>>>>>>>> 1) Does it reproduce with a simpler command, anything close to a no-op
>>>>>>>> that you can test?
>>>>>>> 
>>>>>>> Yes. I tried with a 4KB read and with a fake command I drop right after
>>>>>>> allocation.
>>>>>>> 
>>>>>>>> 2) What did you use to time the stall being blk_queue_enter()?
>>>>>>> 
>>>>>>> I have some debug code measuring time with ktime_get() in different
>>>>>>> places in the stack, and among other places, around blk_queue_enter(). I
>>>>>>> use them then to measure max latency and expose it through sysfs. I can
>>>>>>> see that the latency peak is recorded in the probe before
>>>>>>> blk_queue_enter() and not in the one after.
>>>>>>> 
>>>>>>> I also did an experiment, where the normal I/O path allocates the
>>>>>>> request with BLK_MQ_REQ_NOWAIT. When running the experiment above, the
>>>>>>> read test fails since we reach:
>>>>>>> 	if (nowait)
>>>>>>> 	  return -EBUSY;
>>>>>>> 
>>>>>>> in blk_queue_enter.
>>>>>> 
>>>>>> OK, that's starting to make more sense, that indicates that there is indeed
>>>>>> something wrong with the refs. Does the below help?
>>>>> 
>>>>> No, that can't be right, it does look balanced to begin with.
>>>>> blk_mq_alloc_request() always grabs a queue ref, and always drops it. If
>>>>> we return with a request succesfully allocated, then we have an extra
>>>>> ref on it, which is dropped when it is later freed.
>>>> 
>>>> I agree, it seems more like a reference is put too late. I looked into
>>>> into the places where the reference is put, but it all seems normal. In
>>>> any case, I run it (just to see), and it did not help.
>>>> 
>>>>> Something smells fishy, I'll dig a bit.
>>>> 
>>>> Thanks! I continue looking into it myself; let me know if I can help
>>>> with something more specific.
>>> 
>>> What exact kernel are you running? And does the device have a scheduler
>>> attached, or is it set to "none"?
>> 
>> I can reproduce the issue on 4.11-rc7. I will rebase on top of your
>> for-4.12/block, but I cannot see any patches that might be related. If
>> it changes I'll ping you.
> 
> I don't suspect it will do anything for you. I just ask to know what
> base you are on.
> 
>> I mentioned the problem to Christoph last week and disabling the
>> schedulers was the first thing he recommended. I measured time around
>> blk_mq_sched_get_request and for this particular test the choose of
>> scheduler (including BFQ and kyber) does not seem to have an effect.
> 
> kyber vs none would be the interesting test. Some of the paths are a
> little different depending if there's a scheduler attached or not, so
> it's good to know that we're seeing this in both cases.
> 

I just tested on your for-4.12/block with none and kyber and the latency
spike appears in both cases.

Javier
Jens Axboe May 8, 2017, 4:06 p.m. UTC | #7
On 05/08/2017 09:49 AM, Javier González wrote:
>> On 8 May 2017, at 17.40, Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 05/08/2017 09:38 AM, Javier González wrote:
>>>> On 8 May 2017, at 17.25, Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> On 05/08/2017 09:22 AM, Javier González wrote:
>>>>> Javier
>>>>>
>>>>>> On 8 May 2017, at 17.14, Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>
>>>>>> On 05/08/2017 09:08 AM, Jens Axboe wrote:
>>>>>>> On 05/08/2017 09:02 AM, Javier González wrote:
>>>>>>>>> On 8 May 2017, at 16.52, Jens Axboe <axboe@fb.com> wrote:
>>>>>>>>>
>>>>>>>>> On 05/08/2017 08:46 AM, Javier González wrote:
>>>>>>>>>>> On 8 May 2017, at 16.23, Jens Axboe <axboe@fb.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 05/08/2017 08:20 AM, Javier González wrote:
>>>>>>>>>>>>> On 8 May 2017, at 16.13, Jens Axboe <axboe@fb.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 05/08/2017 07:44 AM, Javier González wrote:
>>>>>>>>>>>>>>> On 8 May 2017, at 14.27, Ming Lei <ming.lei@redhat.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, May 08, 2017 at 01:54:58PM +0200, Javier González wrote:
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I find an unusual added latency(~20-30ms) on blk_queue_enter when
>>>>>>>>>>>>>>>> allocating a request directly from the NVMe driver through
>>>>>>>>>>>>>>>> nvme_alloc_request. I could use some help confirming that this is a bug
>>>>>>>>>>>>>>>> and not an expected side effect due to something else.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I can reproduce this latency consistently on LightNVM when mixing I/O
>>>>>>>>>>>>>>>> from pblk and I/O sent through an ioctl using liblightnvm, but I don't
>>>>>>>>>>>>>>>> see anything on the LightNVM side that could impact the request
>>>>>>>>>>>>>>>> allocation.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> When I have a 100% read workload sent from pblk, the max. latency is
>>>>>>>>>>>>>>>> constant throughout several runs at ~80us (which is normal for the media
>>>>>>>>>>>>>>>> we are using at bs=4k, qd=1). All pblk I/Os reach the nvme_nvm_submit_io
>>>>>>>>>>>>>>>> function on lightnvm.c., which uses nvme_alloc_request. When we send a
>>>>>>>>>>>>>>>> command from user space through an ioctl, then the max latency goes up
>>>>>>>>>>>>>>>> to ~20-30ms. This happens independently from the actual command
>>>>>>>>>>>>>>>> (IN/OUT). I tracked down the added latency down to the call
>>>>>>>>>>>>>>>> percpu_ref_tryget_live in blk_queue_enter. Seems that the queue
>>>>>>>>>>>>>>>> reference counter is not released as it should through blk_queue_exit in
>>>>>>>>>>>>>>>> blk_mq_alloc_request. For reference, all ioctl I/Os reach the
>>>>>>>>>>>>>>>> nvme_nvm_submit_user_cmd on lightnvm.c
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Do you have any idea about why this might happen? I can dig more into
>>>>>>>>>>>>>>>> it, but first I wanted to make sure that I am not missing any obvious
>>>>>>>>>>>>>>>> assumption, which would explain the reference counter to be held for a
>>>>>>>>>>>>>>>> longer time.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> You need to check if the .q_usage_counter is working at atomic mode.
>>>>>>>>>>>>>>> This counter is initialized as atomic mode, and finally switchs to
>>>>>>>>>>>>>>> percpu mode via percpu_ref_switch_to_percpu() in blk_register_queue().
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for commenting Ming.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The .q_usage_counter is not working on atomic mode. The queue is
>>>>>>>>>>>>>> initialized normally through blk_register_queue() and the counter is
>>>>>>>>>>>>>> switched to percpu mode, as you mentioned. As I understand it, this is
>>>>>>>>>>>>>> how it should be, right?
>>>>>>>>>>>>>
>>>>>>>>>>>>> That is how it should be, yes. You're not running with any heavy
>>>>>>>>>>>>> debugging options, like lockdep or anything like that?
>>>>>>>>>>>>
>>>>>>>>>>>> No lockdep, KASAN, kmemleak or any of the other usual suspects.
>>>>>>>>>>>>
>>>>>>>>>>>> What's interesting is that it only happens when one of the I/Os comes
>>>>>>>>>>>> from user space through the ioctl. If I have several pblk instances on
>>>>>>>>>>>> the same device (which would end up allocating a new request in
>>>>>>>>>>>> parallel, potentially on the same core), the latency spike does not
>>>>>>>>>>>> trigger.
>>>>>>>>>>>>
>>>>>>>>>>>> I also tried to bind the read thread and the liblightnvm thread issuing
>>>>>>>>>>>> the ioctl to different cores, but it does not help...
>>>>>>>>>>>
>>>>>>>>>>> How do I reproduce this? Off the top of my head, and looking at the code,
>>>>>>>>>>> I have no idea what is going on here.
>>>>>>>>>>
>>>>>>>>>> Using LightNVM and liblightnvm [1] you can reproduce it by:
>>>>>>>>>>
>>>>>>>>>> 1. Instantiate a pblk instance on the first channel (luns 0 - 7):
>>>>>>>>>>     sudo nvme lnvm create -d nvme0n1 -n test0 -t pblk -b 0 -e 7 -f
>>>>>>>>>> 2. Write 5GB to the test0 block device with a normal fio script
>>>>>>>>>> 3. Read 5GB to verify that latencies are good (max. ~80-90us at bs=4k, qd=1)
>>>>>>>>>> 4. Re-run 3. and in parallel send a command through liblightnvm to a
>>>>>>>>>> different channel. A simple command is an erase (erase block 900 on
>>>>>>>>>> channel 2, lun 0):
>>>>>>>>>> 	sudo nvm_vblk line_erase /dev/nvme0n1 2 2 0 0 900
>>>>>>>>>>
>>>>>>>>>> After 4. you should see a ~25-30ms latency on the read workload.
>>>>>>>>>>
>>>>>>>>>> I tried to reproduce the ioctl in a more generic way to reach
>>>>>>>>>> __nvme_submit_user_cmd(), but SPDK steals the whole device. Also, qemu
>>>>>>>>>> is not reliable for this kind of performance testing.
>>>>>>>>>>
>>>>>>>>>> If you have a suggestion on how I can mix an ioctl with normal block I/O
>>>>>>>>>> read on a standard NVMe device, I'm happy to try it and see if I can
>>>>>>>>>> reproduce the issue.
>>>>>>>>>
>>>>>>>>> Just to rule out this being any hardware related delays in processing
>>>>>>>>> IO:
>>>>>>>>>
>>>>>>>>> 1) Does it reproduce with a simpler command, anything close to a no-op
>>>>>>>>> that you can test?
>>>>>>>>
>>>>>>>> Yes. I tried with a 4KB read and with a fake command I drop right after
>>>>>>>> allocation.
>>>>>>>>
>>>>>>>>> 2) What did you use to time the stall being blk_queue_enter()?
>>>>>>>>
>>>>>>>> I have some debug code measuring time with ktime_get() in different
>>>>>>>> places in the stack, and among other places, around blk_queue_enter(). I
>>>>>>>> use them then to measure max latency and expose it through sysfs. I can
>>>>>>>> see that the latency peak is recorded in the probe before
>>>>>>>> blk_queue_enter() and not in the one after.
>>>>>>>>
>>>>>>>> I also did an experiment, where the normal I/O path allocates the
>>>>>>>> request with BLK_MQ_REQ_NOWAIT. When running the experiment above, the
>>>>>>>> read test fails since we reach:
>>>>>>>> 	if (nowait)
>>>>>>>> 	  return -EBUSY;
>>>>>>>>
>>>>>>>> in blk_queue_enter.
>>>>>>>
>>>>>>> OK, that's starting to make more sense, that indicates that there is indeed
>>>>>>> something wrong with the refs. Does the below help?
>>>>>>
>>>>>> No, that can't be right, it does look balanced to begin with.
>>>>>> blk_mq_alloc_request() always grabs a queue ref, and always drops it. If
>>>>>> we return with a request succesfully allocated, then we have an extra
>>>>>> ref on it, which is dropped when it is later freed.
>>>>>
>>>>> I agree, it seems more like a reference is put too late. I looked into
>>>>> into the places where the reference is put, but it all seems normal. In
>>>>> any case, I run it (just to see), and it did not help.
>>>>>
>>>>>> Something smells fishy, I'll dig a bit.
>>>>>
>>>>> Thanks! I continue looking into it myself; let me know if I can help
>>>>> with something more specific.
>>>>
>>>> What exact kernel are you running? And does the device have a scheduler
>>>> attached, or is it set to "none"?
>>>
>>> I can reproduce the issue on 4.11-rc7. I will rebase on top of your
>>> for-4.12/block, but I cannot see any patches that might be related. If
>>> it changes I'll ping you.
>>
>> I don't suspect it will do anything for you. I just ask to know what
>> base you are on.
>>
>>> I mentioned the problem to Christoph last week and disabling the
>>> schedulers was the first thing he recommended. I measured time around
>>> blk_mq_sched_get_request and for this particular test the choose of
>>> scheduler (including BFQ and kyber) does not seem to have an effect.
>>
>> kyber vs none would be the interesting test. Some of the paths are a
>> little different depending if there's a scheduler attached or not, so
>> it's good to know that we're seeing this in both cases.
>>
> 
> I just tested on your for-4.12/block with none and kyber and the latency
> spike appears in both cases.

OK good. I looked at your reproduction case. Looks like we ultimately
end up submitting IO through nvme_nvm_submit_user_cmd() when you do the
nvm_vblk line_erase, which is basically the same code as
NVME_IOCTL_SUBMIT_IO as far as request alloc, setup, issue, free goes.
So does it reproduce for you as well on a normal nvme device, if you run
a nvme read /dev/nvme0 [...] while running the same read fio job?
=?UTF-8?q?Javier=20Gonz=C3=A1lez?= May 8, 2017, 4:39 p.m. UTC | #8
> On 8 May 2017, at 18.06, Jens Axboe <axboe@kernel.dk> wrote:
> 
> On 05/08/2017 09:49 AM, Javier González wrote:
>>> On 8 May 2017, at 17.40, Jens Axboe <axboe@kernel.dk> wrote:
>>> 
>>> On 05/08/2017 09:38 AM, Javier González wrote:
>>>>> On 8 May 2017, at 17.25, Jens Axboe <axboe@kernel.dk> wrote:
>>>>> 
>>>>> On 05/08/2017 09:22 AM, Javier González wrote:
>>>>>> Javier
>>>>>> 
>>>>>>> On 8 May 2017, at 17.14, Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>> 
>>>>>>> On 05/08/2017 09:08 AM, Jens Axboe wrote:
>>>>>>>> On 05/08/2017 09:02 AM, Javier González wrote:
>>>>>>>>>> On 8 May 2017, at 16.52, Jens Axboe <axboe@fb.com> wrote:
>>>>>>>>>> 
>>>>>>>>>> On 05/08/2017 08:46 AM, Javier González wrote:
>>>>>>>>>>>> On 8 May 2017, at 16.23, Jens Axboe <axboe@fb.com> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> On 05/08/2017 08:20 AM, Javier González wrote:
>>>>>>>>>>>>>> On 8 May 2017, at 16.13, Jens Axboe <axboe@fb.com> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 05/08/2017 07:44 AM, Javier González wrote:
>>>>>>>>>>>>>>>> On 8 May 2017, at 14.27, Ming Lei <ming.lei@redhat.com> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Mon, May 08, 2017 at 01:54:58PM +0200, Javier González wrote:
>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I find an unusual added latency(~20-30ms) on blk_queue_enter when
>>>>>>>>>>>>>>>>> allocating a request directly from the NVMe driver through
>>>>>>>>>>>>>>>>> nvme_alloc_request. I could use some help confirming that this is a bug
>>>>>>>>>>>>>>>>> and not an expected side effect due to something else.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I can reproduce this latency consistently on LightNVM when mixing I/O
>>>>>>>>>>>>>>>>> from pblk and I/O sent through an ioctl using liblightnvm, but I don't
>>>>>>>>>>>>>>>>> see anything on the LightNVM side that could impact the request
>>>>>>>>>>>>>>>>> allocation.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> When I have a 100% read workload sent from pblk, the max. latency is
>>>>>>>>>>>>>>>>> constant throughout several runs at ~80us (which is normal for the media
>>>>>>>>>>>>>>>>> we are using at bs=4k, qd=1). All pblk I/Os reach the nvme_nvm_submit_io
>>>>>>>>>>>>>>>>> function on lightnvm.c., which uses nvme_alloc_request. When we send a
>>>>>>>>>>>>>>>>> command from user space through an ioctl, then the max latency goes up
>>>>>>>>>>>>>>>>> to ~20-30ms. This happens independently from the actual command
>>>>>>>>>>>>>>>>> (IN/OUT). I tracked down the added latency down to the call
>>>>>>>>>>>>>>>>> percpu_ref_tryget_live in blk_queue_enter. Seems that the queue
>>>>>>>>>>>>>>>>> reference counter is not released as it should through blk_queue_exit in
>>>>>>>>>>>>>>>>> blk_mq_alloc_request. For reference, all ioctl I/Os reach the
>>>>>>>>>>>>>>>>> nvme_nvm_submit_user_cmd on lightnvm.c
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Do you have any idea about why this might happen? I can dig more into
>>>>>>>>>>>>>>>>> it, but first I wanted to make sure that I am not missing any obvious
>>>>>>>>>>>>>>>>> assumption, which would explain the reference counter to be held for a
>>>>>>>>>>>>>>>>> longer time.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> You need to check if the .q_usage_counter is working at atomic mode.
>>>>>>>>>>>>>>>> This counter is initialized as atomic mode, and finally switchs to
>>>>>>>>>>>>>>>> percpu mode via percpu_ref_switch_to_percpu() in blk_register_queue().
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks for commenting Ming.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The .q_usage_counter is not working on atomic mode. The queue is
>>>>>>>>>>>>>>> initialized normally through blk_register_queue() and the counter is
>>>>>>>>>>>>>>> switched to percpu mode, as you mentioned. As I understand it, this is
>>>>>>>>>>>>>>> how it should be, right?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> That is how it should be, yes. You're not running with any heavy
>>>>>>>>>>>>>> debugging options, like lockdep or anything like that?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> No lockdep, KASAN, kmemleak or any of the other usual suspects.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> What's interesting is that it only happens when one of the I/Os comes
>>>>>>>>>>>>> from user space through the ioctl. If I have several pblk instances on
>>>>>>>>>>>>> the same device (which would end up allocating a new request in
>>>>>>>>>>>>> parallel, potentially on the same core), the latency spike does not
>>>>>>>>>>>>> trigger.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I also tried to bind the read thread and the liblightnvm thread issuing
>>>>>>>>>>>>> the ioctl to different cores, but it does not help...
>>>>>>>>>>>> 
>>>>>>>>>>>> How do I reproduce this? Off the top of my head, and looking at the code,
>>>>>>>>>>>> I have no idea what is going on here.
>>>>>>>>>>> 
>>>>>>>>>>> Using LightNVM and liblightnvm [1] you can reproduce it by:
>>>>>>>>>>> 
>>>>>>>>>>> 1. Instantiate a pblk instance on the first channel (luns 0 - 7):
>>>>>>>>>>>    sudo nvme lnvm create -d nvme0n1 -n test0 -t pblk -b 0 -e 7 -f
>>>>>>>>>>> 2. Write 5GB to the test0 block device with a normal fio script
>>>>>>>>>>> 3. Read 5GB to verify that latencies are good (max. ~80-90us at bs=4k, qd=1)
>>>>>>>>>>> 4. Re-run 3. and in parallel send a command through liblightnvm to a
>>>>>>>>>>> different channel. A simple command is an erase (erase block 900 on
>>>>>>>>>>> channel 2, lun 0):
>>>>>>>>>>> 	sudo nvm_vblk line_erase /dev/nvme0n1 2 2 0 0 900
>>>>>>>>>>> 
>>>>>>>>>>> After 4. you should see a ~25-30ms latency on the read workload.
>>>>>>>>>>> 
>>>>>>>>>>> I tried to reproduce the ioctl in a more generic way to reach
>>>>>>>>>>> __nvme_submit_user_cmd(), but SPDK steals the whole device. Also, qemu
>>>>>>>>>>> is not reliable for this kind of performance testing.
>>>>>>>>>>> 
>>>>>>>>>>> If you have a suggestion on how I can mix an ioctl with normal block I/O
>>>>>>>>>>> read on a standard NVMe device, I'm happy to try it and see if I can
>>>>>>>>>>> reproduce the issue.
>>>>>>>>>> 
>>>>>>>>>> Just to rule out this being any hardware related delays in processing
>>>>>>>>>> IO:
>>>>>>>>>> 
>>>>>>>>>> 1) Does it reproduce with a simpler command, anything close to a no-op
>>>>>>>>>> that you can test?
>>>>>>>>> 
>>>>>>>>> Yes. I tried with a 4KB read and with a fake command I drop right after
>>>>>>>>> allocation.
>>>>>>>>> 
>>>>>>>>>> 2) What did you use to time the stall being blk_queue_enter()?
>>>>>>>>> 
>>>>>>>>> I have some debug code measuring time with ktime_get() in different
>>>>>>>>> places in the stack, and among other places, around blk_queue_enter(). I
>>>>>>>>> use them then to measure max latency and expose it through sysfs. I can
>>>>>>>>> see that the latency peak is recorded in the probe before
>>>>>>>>> blk_queue_enter() and not in the one after.
>>>>>>>>> 
>>>>>>>>> I also did an experiment, where the normal I/O path allocates the
>>>>>>>>> request with BLK_MQ_REQ_NOWAIT. When running the experiment above, the
>>>>>>>>> read test fails since we reach:
>>>>>>>>> 	if (nowait)
>>>>>>>>> 	  return -EBUSY;
>>>>>>>>> 
>>>>>>>>> in blk_queue_enter.
>>>>>>>> 
>>>>>>>> OK, that's starting to make more sense, that indicates that there is indeed
>>>>>>>> something wrong with the refs. Does the below help?
>>>>>>> 
>>>>>>> No, that can't be right, it does look balanced to begin with.
>>>>>>> blk_mq_alloc_request() always grabs a queue ref, and always drops it. If
>>>>>>> we return with a request succesfully allocated, then we have an extra
>>>>>>> ref on it, which is dropped when it is later freed.
>>>>>> 
>>>>>> I agree, it seems more like a reference is put too late. I looked into
>>>>>> into the places where the reference is put, but it all seems normal. In
>>>>>> any case, I run it (just to see), and it did not help.
>>>>>> 
>>>>>>> Something smells fishy, I'll dig a bit.
>>>>>> 
>>>>>> Thanks! I continue looking into it myself; let me know if I can help
>>>>>> with something more specific.
>>>>> 
>>>>> What exact kernel are you running? And does the device have a scheduler
>>>>> attached, or is it set to "none"?
>>>> 
>>>> I can reproduce the issue on 4.11-rc7. I will rebase on top of your
>>>> for-4.12/block, but I cannot see any patches that might be related. If
>>>> it changes I'll ping you.
>>> 
>>> I don't suspect it will do anything for you. I just ask to know what
>>> base you are on.
>>> 
>>>> I mentioned the problem to Christoph last week and disabling the
>>>> schedulers was the first thing he recommended. I measured time around
>>>> blk_mq_sched_get_request and for this particular test the choose of
>>>> scheduler (including BFQ and kyber) does not seem to have an effect.
>>> 
>>> kyber vs none would be the interesting test. Some of the paths are a
>>> little different depending if there's a scheduler attached or not, so
>>> it's good to know that we're seeing this in both cases.
>> 
>> I just tested on your for-4.12/block with none and kyber and the latency
>> spike appears in both cases.
> 
> OK good. I looked at your reproduction case. Looks like we ultimately
> end up submitting IO through nvme_nvm_submit_user_cmd() when you do the
> nvm_vblk line_erase, which is basically the same code as
> NVME_IOCTL_SUBMIT_IO as far as request alloc, setup, issue, free goes.
> So does it reproduce for you as well on a normal nvme device, if you run
> a nvme read /dev/nvme0 [...] while running the same read fio job?

Ok. I'll try that.

Thanks!

Javier
=?UTF-8?q?Javier=20Gonz=C3=A1lez?= May 9, 2017, 10:34 a.m. UTC | #9
> On 8 May 2017, at 18.39, Javier González <jg@lightnvm.io> wrote:
> 
>> On 8 May 2017, at 18.06, Jens Axboe <axboe@kernel.dk> wrote:
>> 
>> On 05/08/2017 09:49 AM, Javier González wrote:
>>>> On 8 May 2017, at 17.40, Jens Axboe <axboe@kernel.dk> wrote:
>>>> 
>>>> On 05/08/2017 09:38 AM, Javier González wrote:
>>>>>> On 8 May 2017, at 17.25, Jens Axboe <axboe@kernel.dk> wrote:
>>>>>> 
>>>>>> On 05/08/2017 09:22 AM, Javier González wrote:
>>>>>>> Javier
>>>>>>> 
>>>>>>>> On 8 May 2017, at 17.14, Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>> 
>>>>>>>> On 05/08/2017 09:08 AM, Jens Axboe wrote:
>>>>>>>>> On 05/08/2017 09:02 AM, Javier González wrote:
>>>>>>>>>>> On 8 May 2017, at 16.52, Jens Axboe <axboe@fb.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> On 05/08/2017 08:46 AM, Javier González wrote:
>>>>>>>>>>>>> On 8 May 2017, at 16.23, Jens Axboe <axboe@fb.com> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On 05/08/2017 08:20 AM, Javier González wrote:
>>>>>>>>>>>>>>> On 8 May 2017, at 16.13, Jens Axboe <axboe@fb.com> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On 05/08/2017 07:44 AM, Javier González wrote:
>>>>>>>>>>>>>>>>> On 8 May 2017, at 14.27, Ming Lei <ming.lei@redhat.com> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Mon, May 08, 2017 at 01:54:58PM +0200, Javier González wrote:
>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I find an unusual added latency(~20-30ms) on blk_queue_enter when
>>>>>>>>>>>>>>>>>> allocating a request directly from the NVMe driver through
>>>>>>>>>>>>>>>>>> nvme_alloc_request. I could use some help confirming that this is a bug
>>>>>>>>>>>>>>>>>> and not an expected side effect due to something else.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I can reproduce this latency consistently on LightNVM when mixing I/O
>>>>>>>>>>>>>>>>>> from pblk and I/O sent through an ioctl using liblightnvm, but I don't
>>>>>>>>>>>>>>>>>> see anything on the LightNVM side that could impact the request
>>>>>>>>>>>>>>>>>> allocation.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> When I have a 100% read workload sent from pblk, the max. latency is
>>>>>>>>>>>>>>>>>> constant throughout several runs at ~80us (which is normal for the media
>>>>>>>>>>>>>>>>>> we are using at bs=4k, qd=1). All pblk I/Os reach the nvme_nvm_submit_io
>>>>>>>>>>>>>>>>>> function on lightnvm.c., which uses nvme_alloc_request. When we send a
>>>>>>>>>>>>>>>>>> command from user space through an ioctl, then the max latency goes up
>>>>>>>>>>>>>>>>>> to ~20-30ms. This happens independently from the actual command
>>>>>>>>>>>>>>>>>> (IN/OUT). I tracked down the added latency down to the call
>>>>>>>>>>>>>>>>>> percpu_ref_tryget_live in blk_queue_enter. Seems that the queue
>>>>>>>>>>>>>>>>>> reference counter is not released as it should through blk_queue_exit in
>>>>>>>>>>>>>>>>>> blk_mq_alloc_request. For reference, all ioctl I/Os reach the
>>>>>>>>>>>>>>>>>> nvme_nvm_submit_user_cmd on lightnvm.c
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Do you have any idea about why this might happen? I can dig more into
>>>>>>>>>>>>>>>>>> it, but first I wanted to make sure that I am not missing any obvious
>>>>>>>>>>>>>>>>>> assumption, which would explain the reference counter to be held for a
>>>>>>>>>>>>>>>>>> longer time.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> You need to check if the .q_usage_counter is working at atomic mode.
>>>>>>>>>>>>>>>>> This counter is initialized as atomic mode, and finally switchs to
>>>>>>>>>>>>>>>>> percpu mode via percpu_ref_switch_to_percpu() in blk_register_queue().
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks for commenting Ming.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The .q_usage_counter is not working on atomic mode. The queue is
>>>>>>>>>>>>>>>> initialized normally through blk_register_queue() and the counter is
>>>>>>>>>>>>>>>> switched to percpu mode, as you mentioned. As I understand it, this is
>>>>>>>>>>>>>>>> how it should be, right?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> That is how it should be, yes. You're not running with any heavy
>>>>>>>>>>>>>>> debugging options, like lockdep or anything like that?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> No lockdep, KASAN, kmemleak or any of the other usual suspects.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> What's interesting is that it only happens when one of the I/Os comes
>>>>>>>>>>>>>> from user space through the ioctl. If I have several pblk instances on
>>>>>>>>>>>>>> the same device (which would end up allocating a new request in
>>>>>>>>>>>>>> parallel, potentially on the same core), the latency spike does not
>>>>>>>>>>>>>> trigger.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I also tried to bind the read thread and the liblightnvm thread issuing
>>>>>>>>>>>>>> the ioctl to different cores, but it does not help...
>>>>>>>>>>>>> 
>>>>>>>>>>>>> How do I reproduce this? Off the top of my head, and looking at the code,
>>>>>>>>>>>>> I have no idea what is going on here.
>>>>>>>>>>>> 
>>>>>>>>>>>> Using LightNVM and liblightnvm [1] you can reproduce it by:
>>>>>>>>>>>> 
>>>>>>>>>>>> 1. Instantiate a pblk instance on the first channel (luns 0 - 7):
>>>>>>>>>>>>   sudo nvme lnvm create -d nvme0n1 -n test0 -t pblk -b 0 -e 7 -f
>>>>>>>>>>>> 2. Write 5GB to the test0 block device with a normal fio script
>>>>>>>>>>>> 3. Read 5GB to verify that latencies are good (max. ~80-90us at bs=4k, qd=1)
>>>>>>>>>>>> 4. Re-run 3. and in parallel send a command through liblightnvm to a
>>>>>>>>>>>> different channel. A simple command is an erase (erase block 900 on
>>>>>>>>>>>> channel 2, lun 0):
>>>>>>>>>>>> 	sudo nvm_vblk line_erase /dev/nvme0n1 2 2 0 0 900
>>>>>>>>>>>> 
>>>>>>>>>>>> After 4. you should see a ~25-30ms latency on the read workload.
>>>>>>>>>>>> 
>>>>>>>>>>>> I tried to reproduce the ioctl in a more generic way to reach
>>>>>>>>>>>> __nvme_submit_user_cmd(), but SPDK steals the whole device. Also, qemu
>>>>>>>>>>>> is not reliable for this kind of performance testing.
>>>>>>>>>>>> 
>>>>>>>>>>>> If you have a suggestion on how I can mix an ioctl with normal block I/O
>>>>>>>>>>>> read on a standard NVMe device, I'm happy to try it and see if I can
>>>>>>>>>>>> reproduce the issue.
>>>>>>>>>>> 
>>>>>>>>>>> Just to rule out this being any hardware related delays in processing
>>>>>>>>>>> IO:
>>>>>>>>>>> 
>>>>>>>>>>> 1) Does it reproduce with a simpler command, anything close to a no-op
>>>>>>>>>>> that you can test?
>>>>>>>>>> 
>>>>>>>>>> Yes. I tried with a 4KB read and with a fake command I drop right after
>>>>>>>>>> allocation.
>>>>>>>>>> 
>>>>>>>>>>> 2) What did you use to time the stall being blk_queue_enter()?
>>>>>>>>>> 
>>>>>>>>>> I have some debug code measuring time with ktime_get() in different
>>>>>>>>>> places in the stack, and among other places, around blk_queue_enter(). I
>>>>>>>>>> use them then to measure max latency and expose it through sysfs. I can
>>>>>>>>>> see that the latency peak is recorded in the probe before
>>>>>>>>>> blk_queue_enter() and not in the one after.
>>>>>>>>>> 
>>>>>>>>>> I also did an experiment, where the normal I/O path allocates the
>>>>>>>>>> request with BLK_MQ_REQ_NOWAIT. When running the experiment above, the
>>>>>>>>>> read test fails since we reach:
>>>>>>>>>> 	if (nowait)
>>>>>>>>>> 	  return -EBUSY;
>>>>>>>>>> 
>>>>>>>>>> in blk_queue_enter.
>>>>>>>>> 
>>>>>>>>> OK, that's starting to make more sense, that indicates that there is indeed
>>>>>>>>> something wrong with the refs. Does the below help?
>>>>>>>> 
>>>>>>>> No, that can't be right, it does look balanced to begin with.
>>>>>>>> blk_mq_alloc_request() always grabs a queue ref, and always drops it. If
>>>>>>>> we return with a request succesfully allocated, then we have an extra
>>>>>>>> ref on it, which is dropped when it is later freed.
>>>>>>> 
>>>>>>> I agree, it seems more like a reference is put too late. I looked into
>>>>>>> into the places where the reference is put, but it all seems normal. In
>>>>>>> any case, I run it (just to see), and it did not help.
>>>>>>> 
>>>>>>>> Something smells fishy, I'll dig a bit.
>>>>>>> 
>>>>>>> Thanks! I continue looking into it myself; let me know if I can help
>>>>>>> with something more specific.
>>>>>> 
>>>>>> What exact kernel are you running? And does the device have a scheduler
>>>>>> attached, or is it set to "none"?
>>>>> 
>>>>> I can reproduce the issue on 4.11-rc7. I will rebase on top of your
>>>>> for-4.12/block, but I cannot see any patches that might be related. If
>>>>> it changes I'll ping you.
>>>> 
>>>> I don't suspect it will do anything for you. I just ask to know what
>>>> base you are on.
>>>> 
>>>>> I mentioned the problem to Christoph last week and disabling the
>>>>> schedulers was the first thing he recommended. I measured time around
>>>>> blk_mq_sched_get_request and for this particular test the choose of
>>>>> scheduler (including BFQ and kyber) does not seem to have an effect.
>>>> 
>>>> kyber vs none would be the interesting test. Some of the paths are a
>>>> little different depending if there's a scheduler attached or not, so
>>>> it's good to know that we're seeing this in both cases.
>>> 
>>> I just tested on your for-4.12/block with none and kyber and the latency
>>> spike appears in both cases.
>> 
>> OK good. I looked at your reproduction case. Looks like we ultimately
>> end up submitting IO through nvme_nvm_submit_user_cmd() when you do the
>> nvm_vblk line_erase, which is basically the same code as
>> NVME_IOCTL_SUBMIT_IO as far as request alloc, setup, issue, free goes.
>> So does it reproduce for you as well on a normal nvme device, if you run
>> a nvme read /dev/nvme0 [...] while running the same read fio job?
> 
> Ok. I'll try that.

I cannot reproduce the latency on a normal nvme drive when mixing I/O
from a fio job and ioctls.

The path is different from the one in pblk, since normal block I/O
uses the generic_make_request(), but still, they both need to
blk_queue_enter(), allocate a request, etc. They only "major" difference
I see is that normal block I/O requests are given by get_request()
(which as far as I understand takes pre-allocated requests from the
queue request list), while pblk allocates each request via
nvme_alloc_request().

What puzzles me most is that having different pblk instances, issuing
I/O in parallel does not trigger the long tail. Otherwise, I would think
that we are just unlucky and get scheduled out. Still, 20ms...

BTW, in order to discard NUMA, I tried on a single socket machine, and
same, same.

Javier.
Ming Lei May 9, 2017, 10:58 a.m. UTC | #10
On Tue, May 09, 2017 at 12:34:42PM +0200, Javier González wrote:
> > On 8 May 2017, at 18.39, Javier González <jg@lightnvm.io> wrote:
> > 
> >> On 8 May 2017, at 18.06, Jens Axboe <axboe@kernel.dk> wrote:
> >> 
> >> On 05/08/2017 09:49 AM, Javier González wrote:
> >>>> On 8 May 2017, at 17.40, Jens Axboe <axboe@kernel.dk> wrote:
> >>>> 
> >>>> On 05/08/2017 09:38 AM, Javier González wrote:
> >>>>>> On 8 May 2017, at 17.25, Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>> 
> >>>>>> On 05/08/2017 09:22 AM, Javier González wrote:
> >>>>>>> Javier
> >>>>>>> 
> >>>>>>>> On 8 May 2017, at 17.14, Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>>>> 
> >>>>>>>> On 05/08/2017 09:08 AM, Jens Axboe wrote:
> >>>>>>>>> On 05/08/2017 09:02 AM, Javier González wrote:
> >>>>>>>>>>> On 8 May 2017, at 16.52, Jens Axboe <axboe@fb.com> wrote:
> >>>>>>>>>>> 
> >>>>>>>>>>> On 05/08/2017 08:46 AM, Javier González wrote:
> >>>>>>>>>>>>> On 8 May 2017, at 16.23, Jens Axboe <axboe@fb.com> wrote:
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> On 05/08/2017 08:20 AM, Javier González wrote:
> >>>>>>>>>>>>>>> On 8 May 2017, at 16.13, Jens Axboe <axboe@fb.com> wrote:
> >>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>> On 05/08/2017 07:44 AM, Javier González wrote:
> >>>>>>>>>>>>>>>>> On 8 May 2017, at 14.27, Ming Lei <ming.lei@redhat.com> wrote:
> >>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>> On Mon, May 08, 2017 at 01:54:58PM +0200, Javier González wrote:
> >>>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>> I find an unusual added latency(~20-30ms) on blk_queue_enter when
> >>>>>>>>>>>>>>>>>> allocating a request directly from the NVMe driver through
> >>>>>>>>>>>>>>>>>> nvme_alloc_request. I could use some help confirming that this is a bug
> >>>>>>>>>>>>>>>>>> and not an expected side effect due to something else.
> >>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>> I can reproduce this latency consistently on LightNVM when mixing I/O
> >>>>>>>>>>>>>>>>>> from pblk and I/O sent through an ioctl using liblightnvm, but I don't
> >>>>>>>>>>>>>>>>>> see anything on the LightNVM side that could impact the request
> >>>>>>>>>>>>>>>>>> allocation.
> >>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>> When I have a 100% read workload sent from pblk, the max. latency is
> >>>>>>>>>>>>>>>>>> constant throughout several runs at ~80us (which is normal for the media
> >>>>>>>>>>>>>>>>>> we are using at bs=4k, qd=1). All pblk I/Os reach the nvme_nvm_submit_io
> >>>>>>>>>>>>>>>>>> function on lightnvm.c., which uses nvme_alloc_request. When we send a
> >>>>>>>>>>>>>>>>>> command from user space through an ioctl, then the max latency goes up
> >>>>>>>>>>>>>>>>>> to ~20-30ms. This happens independently from the actual command
> >>>>>>>>>>>>>>>>>> (IN/OUT). I tracked down the added latency down to the call
> >>>>>>>>>>>>>>>>>> percpu_ref_tryget_live in blk_queue_enter. Seems that the queue
> >>>>>>>>>>>>>>>>>> reference counter is not released as it should through blk_queue_exit in
> >>>>>>>>>>>>>>>>>> blk_mq_alloc_request. For reference, all ioctl I/Os reach the
> >>>>>>>>>>>>>>>>>> nvme_nvm_submit_user_cmd on lightnvm.c
> >>>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>>> Do you have any idea about why this might happen? I can dig more into
> >>>>>>>>>>>>>>>>>> it, but first I wanted to make sure that I am not missing any obvious
> >>>>>>>>>>>>>>>>>> assumption, which would explain the reference counter to be held for a
> >>>>>>>>>>>>>>>>>> longer time.
> >>>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>>> You need to check if the .q_usage_counter is working at atomic mode.
> >>>>>>>>>>>>>>>>> This counter is initialized as atomic mode, and finally switchs to
> >>>>>>>>>>>>>>>>> percpu mode via percpu_ref_switch_to_percpu() in blk_register_queue().
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> Thanks for commenting Ming.
> >>>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>>> The .q_usage_counter is not working on atomic mode. The queue is
> >>>>>>>>>>>>>>>> initialized normally through blk_register_queue() and the counter is
> >>>>>>>>>>>>>>>> switched to percpu mode, as you mentioned. As I understand it, this is
> >>>>>>>>>>>>>>>> how it should be, right?
> >>>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>> That is how it should be, yes. You're not running with any heavy
> >>>>>>>>>>>>>>> debugging options, like lockdep or anything like that?
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> No lockdep, KASAN, kmemleak or any of the other usual suspects.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> What's interesting is that it only happens when one of the I/Os comes
> >>>>>>>>>>>>>> from user space through the ioctl. If I have several pblk instances on
> >>>>>>>>>>>>>> the same device (which would end up allocating a new request in
> >>>>>>>>>>>>>> parallel, potentially on the same core), the latency spike does not
> >>>>>>>>>>>>>> trigger.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> I also tried to bind the read thread and the liblightnvm thread issuing
> >>>>>>>>>>>>>> the ioctl to different cores, but it does not help...
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> How do I reproduce this? Off the top of my head, and looking at the code,
> >>>>>>>>>>>>> I have no idea what is going on here.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Using LightNVM and liblightnvm [1] you can reproduce it by:
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 1. Instantiate a pblk instance on the first channel (luns 0 - 7):
> >>>>>>>>>>>>   sudo nvme lnvm create -d nvme0n1 -n test0 -t pblk -b 0 -e 7 -f
> >>>>>>>>>>>> 2. Write 5GB to the test0 block device with a normal fio script
> >>>>>>>>>>>> 3. Read 5GB to verify that latencies are good (max. ~80-90us at bs=4k, qd=1)
> >>>>>>>>>>>> 4. Re-run 3. and in parallel send a command through liblightnvm to a
> >>>>>>>>>>>> different channel. A simple command is an erase (erase block 900 on
> >>>>>>>>>>>> channel 2, lun 0):
> >>>>>>>>>>>> 	sudo nvm_vblk line_erase /dev/nvme0n1 2 2 0 0 900
> >>>>>>>>>>>> 
> >>>>>>>>>>>> After 4. you should see a ~25-30ms latency on the read workload.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> I tried to reproduce the ioctl in a more generic way to reach
> >>>>>>>>>>>> __nvme_submit_user_cmd(), but SPDK steals the whole device. Also, qemu
> >>>>>>>>>>>> is not reliable for this kind of performance testing.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> If you have a suggestion on how I can mix an ioctl with normal block I/O
> >>>>>>>>>>>> read on a standard NVMe device, I'm happy to try it and see if I can
> >>>>>>>>>>>> reproduce the issue.
> >>>>>>>>>>> 
> >>>>>>>>>>> Just to rule out this being any hardware related delays in processing
> >>>>>>>>>>> IO:
> >>>>>>>>>>> 
> >>>>>>>>>>> 1) Does it reproduce with a simpler command, anything close to a no-op
> >>>>>>>>>>> that you can test?
> >>>>>>>>>> 
> >>>>>>>>>> Yes. I tried with a 4KB read and with a fake command I drop right after
> >>>>>>>>>> allocation.
> >>>>>>>>>> 
> >>>>>>>>>>> 2) What did you use to time the stall being blk_queue_enter()?
> >>>>>>>>>> 
> >>>>>>>>>> I have some debug code measuring time with ktime_get() in different
> >>>>>>>>>> places in the stack, and among other places, around blk_queue_enter(). I
> >>>>>>>>>> use them then to measure max latency and expose it through sysfs. I can
> >>>>>>>>>> see that the latency peak is recorded in the probe before
> >>>>>>>>>> blk_queue_enter() and not in the one after.
> >>>>>>>>>> 
> >>>>>>>>>> I also did an experiment, where the normal I/O path allocates the
> >>>>>>>>>> request with BLK_MQ_REQ_NOWAIT. When running the experiment above, the
> >>>>>>>>>> read test fails since we reach:
> >>>>>>>>>> 	if (nowait)
> >>>>>>>>>> 	  return -EBUSY;
> >>>>>>>>>> 
> >>>>>>>>>> in blk_queue_enter.
> >>>>>>>>> 
> >>>>>>>>> OK, that's starting to make more sense, that indicates that there is indeed
> >>>>>>>>> something wrong with the refs. Does the below help?
> >>>>>>>> 
> >>>>>>>> No, that can't be right, it does look balanced to begin with.
> >>>>>>>> blk_mq_alloc_request() always grabs a queue ref, and always drops it. If
> >>>>>>>> we return with a request succesfully allocated, then we have an extra
> >>>>>>>> ref on it, which is dropped when it is later freed.
> >>>>>>> 
> >>>>>>> I agree, it seems more like a reference is put too late. I looked into
> >>>>>>> into the places where the reference is put, but it all seems normal. In
> >>>>>>> any case, I run it (just to see), and it did not help.
> >>>>>>> 
> >>>>>>>> Something smells fishy, I'll dig a bit.
> >>>>>>> 
> >>>>>>> Thanks! I continue looking into it myself; let me know if I can help
> >>>>>>> with something more specific.
> >>>>>> 
> >>>>>> What exact kernel are you running? And does the device have a scheduler
> >>>>>> attached, or is it set to "none"?
> >>>>> 
> >>>>> I can reproduce the issue on 4.11-rc7. I will rebase on top of your
> >>>>> for-4.12/block, but I cannot see any patches that might be related. If
> >>>>> it changes I'll ping you.
> >>>> 
> >>>> I don't suspect it will do anything for you. I just ask to know what
> >>>> base you are on.
> >>>> 
> >>>>> I mentioned the problem to Christoph last week and disabling the
> >>>>> schedulers was the first thing he recommended. I measured time around
> >>>>> blk_mq_sched_get_request and for this particular test the choose of
> >>>>> scheduler (including BFQ and kyber) does not seem to have an effect.
> >>>> 
> >>>> kyber vs none would be the interesting test. Some of the paths are a
> >>>> little different depending if there's a scheduler attached or not, so
> >>>> it's good to know that we're seeing this in both cases.
> >>> 
> >>> I just tested on your for-4.12/block with none and kyber and the latency
> >>> spike appears in both cases.
> >> 
> >> OK good. I looked at your reproduction case. Looks like we ultimately
> >> end up submitting IO through nvme_nvm_submit_user_cmd() when you do the
> >> nvm_vblk line_erase, which is basically the same code as
> >> NVME_IOCTL_SUBMIT_IO as far as request alloc, setup, issue, free goes.
> >> So does it reproduce for you as well on a normal nvme device, if you run
> >> a nvme read /dev/nvme0 [...] while running the same read fio job?
> > 
> > Ok. I'll try that.
> 
> I cannot reproduce the latency on a normal nvme drive when mixing I/O
> from a fio job and ioctls.
> 
> The path is different from the one in pblk, since normal block I/O
> uses the generic_make_request(), but still, they both need to
> blk_queue_enter(), allocate a request, etc. They only "major" difference
> I see is that normal block I/O requests are given by get_request()
> (which as far as I understand takes pre-allocated requests from the
> queue request list), while pblk allocates each request via
> nvme_alloc_request().
> 
> What puzzles me most is that having different pblk instances, issuing
> I/O in parallel does not trigger the long tail. Otherwise, I would think
> that we are just unlucky and get scheduled out. Still, 20ms...
> 
> BTW, in order to discard NUMA, I tried on a single socket machine, and
> same, same.

I suspect the .q_usage_counter is DEAD, and you can check it via
percpu_ref_is_dying(), or just check if slow path is reached.

The fast path is that percpu_ref_tryget_live() returns directly,
and slow path is reached only if queue is freezed or dead.

If that is true, you can add a dump_stack() in blk_freeze_queue_start()
to see where the unusual freezing/unfreezing is from.

Thanks,
Ming
=?UTF-8?q?Javier=20Gonz=C3=A1lez?= May 9, 2017, 11:21 a.m. UTC | #11
> On 9 May 2017, at 12.58, Ming Lei <ming.lei@redhat.com> wrote:
> 
> On Tue, May 09, 2017 at 12:34:42PM +0200, Javier González wrote:
>>> On 8 May 2017, at 18.39, Javier González <jg@lightnvm.io> wrote:
>>> 
>>>> On 8 May 2017, at 18.06, Jens Axboe <axboe@kernel.dk> wrote:
>>>> 
>>>> On 05/08/2017 09:49 AM, Javier González wrote:
>>>>>> On 8 May 2017, at 17.40, Jens Axboe <axboe@kernel.dk> wrote:
>>>>>> 
>>>>>> On 05/08/2017 09:38 AM, Javier González wrote:
>>>>>>>> On 8 May 2017, at 17.25, Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>> 
>>>>>>>> On 05/08/2017 09:22 AM, Javier González wrote:
>>>>>>>>> Javier
>>>>>>>>> 
>>>>>>>>>> On 8 May 2017, at 17.14, Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>>> 
>>>>>>>>>> On 05/08/2017 09:08 AM, Jens Axboe wrote:
>>>>>>>>>>> On 05/08/2017 09:02 AM, Javier González wrote:
>>>>>>>>>>>>> On 8 May 2017, at 16.52, Jens Axboe <axboe@fb.com> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On 05/08/2017 08:46 AM, Javier González wrote:
>>>>>>>>>>>>>>> On 8 May 2017, at 16.23, Jens Axboe <axboe@fb.com> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On 05/08/2017 08:20 AM, Javier González wrote:
>>>>>>>>>>>>>>>>> On 8 May 2017, at 16.13, Jens Axboe <axboe@fb.com> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On 05/08/2017 07:44 AM, Javier González wrote:
>>>>>>>>>>>>>>>>>>> On 8 May 2017, at 14.27, Ming Lei <ming.lei@redhat.com> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Mon, May 08, 2017 at 01:54:58PM +0200, Javier González wrote:
>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> I find an unusual added latency(~20-30ms) on blk_queue_enter when
>>>>>>>>>>>>>>>>>>>> allocating a request directly from the NVMe driver through
>>>>>>>>>>>>>>>>>>>> nvme_alloc_request. I could use some help confirming that this is a bug
>>>>>>>>>>>>>>>>>>>> and not an expected side effect due to something else.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> I can reproduce this latency consistently on LightNVM when mixing I/O
>>>>>>>>>>>>>>>>>>>> from pblk and I/O sent through an ioctl using liblightnvm, but I don't
>>>>>>>>>>>>>>>>>>>> see anything on the LightNVM side that could impact the request
>>>>>>>>>>>>>>>>>>>> allocation.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> When I have a 100% read workload sent from pblk, the max. latency is
>>>>>>>>>>>>>>>>>>>> constant throughout several runs at ~80us (which is normal for the media
>>>>>>>>>>>>>>>>>>>> we are using at bs=4k, qd=1). All pblk I/Os reach the nvme_nvm_submit_io
>>>>>>>>>>>>>>>>>>>> function on lightnvm.c., which uses nvme_alloc_request. When we send a
>>>>>>>>>>>>>>>>>>>> command from user space through an ioctl, then the max latency goes up
>>>>>>>>>>>>>>>>>>>> to ~20-30ms. This happens independently from the actual command
>>>>>>>>>>>>>>>>>>>> (IN/OUT). I tracked down the added latency down to the call
>>>>>>>>>>>>>>>>>>>> percpu_ref_tryget_live in blk_queue_enter. Seems that the queue
>>>>>>>>>>>>>>>>>>>> reference counter is not released as it should through blk_queue_exit in
>>>>>>>>>>>>>>>>>>>> blk_mq_alloc_request. For reference, all ioctl I/Os reach the
>>>>>>>>>>>>>>>>>>>> nvme_nvm_submit_user_cmd on lightnvm.c
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Do you have any idea about why this might happen? I can dig more into
>>>>>>>>>>>>>>>>>>>> it, but first I wanted to make sure that I am not missing any obvious
>>>>>>>>>>>>>>>>>>>> assumption, which would explain the reference counter to be held for a
>>>>>>>>>>>>>>>>>>>> longer time.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> You need to check if the .q_usage_counter is working at atomic mode.
>>>>>>>>>>>>>>>>>>> This counter is initialized as atomic mode, and finally switchs to
>>>>>>>>>>>>>>>>>>> percpu mode via percpu_ref_switch_to_percpu() in blk_register_queue().
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks for commenting Ming.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> The .q_usage_counter is not working on atomic mode. The queue is
>>>>>>>>>>>>>>>>>> initialized normally through blk_register_queue() and the counter is
>>>>>>>>>>>>>>>>>> switched to percpu mode, as you mentioned. As I understand it, this is
>>>>>>>>>>>>>>>>>> how it should be, right?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> That is how it should be, yes. You're not running with any heavy
>>>>>>>>>>>>>>>>> debugging options, like lockdep or anything like that?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> No lockdep, KASAN, kmemleak or any of the other usual suspects.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> What's interesting is that it only happens when one of the I/Os comes
>>>>>>>>>>>>>>>> from user space through the ioctl. If I have several pblk instances on
>>>>>>>>>>>>>>>> the same device (which would end up allocating a new request in
>>>>>>>>>>>>>>>> parallel, potentially on the same core), the latency spike does not
>>>>>>>>>>>>>>>> trigger.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I also tried to bind the read thread and the liblightnvm thread issuing
>>>>>>>>>>>>>>>> the ioctl to different cores, but it does not help...
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> How do I reproduce this? Off the top of my head, and looking at the code,
>>>>>>>>>>>>>>> I have no idea what is going on here.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Using LightNVM and liblightnvm [1] you can reproduce it by:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 1. Instantiate a pblk instance on the first channel (luns 0 - 7):
>>>>>>>>>>>>>>  sudo nvme lnvm create -d nvme0n1 -n test0 -t pblk -b 0 -e 7 -f
>>>>>>>>>>>>>> 2. Write 5GB to the test0 block device with a normal fio script
>>>>>>>>>>>>>> 3. Read 5GB to verify that latencies are good (max. ~80-90us at bs=4k, qd=1)
>>>>>>>>>>>>>> 4. Re-run 3. and in parallel send a command through liblightnvm to a
>>>>>>>>>>>>>> different channel. A simple command is an erase (erase block 900 on
>>>>>>>>>>>>>> channel 2, lun 0):
>>>>>>>>>>>>>> 	sudo nvm_vblk line_erase /dev/nvme0n1 2 2 0 0 900
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> After 4. you should see a ~25-30ms latency on the read workload.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I tried to reproduce the ioctl in a more generic way to reach
>>>>>>>>>>>>>> __nvme_submit_user_cmd(), but SPDK steals the whole device. Also, qemu
>>>>>>>>>>>>>> is not reliable for this kind of performance testing.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> If you have a suggestion on how I can mix an ioctl with normal block I/O
>>>>>>>>>>>>>> read on a standard NVMe device, I'm happy to try it and see if I can
>>>>>>>>>>>>>> reproduce the issue.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Just to rule out this being any hardware related delays in processing
>>>>>>>>>>>>> IO:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 1) Does it reproduce with a simpler command, anything close to a no-op
>>>>>>>>>>>>> that you can test?
>>>>>>>>>>>> 
>>>>>>>>>>>> Yes. I tried with a 4KB read and with a fake command I drop right after
>>>>>>>>>>>> allocation.
>>>>>>>>>>>> 
>>>>>>>>>>>>> 2) What did you use to time the stall being blk_queue_enter()?
>>>>>>>>>>>> 
>>>>>>>>>>>> I have some debug code measuring time with ktime_get() in different
>>>>>>>>>>>> places in the stack, and among other places, around blk_queue_enter(). I
>>>>>>>>>>>> use them then to measure max latency and expose it through sysfs. I can
>>>>>>>>>>>> see that the latency peak is recorded in the probe before
>>>>>>>>>>>> blk_queue_enter() and not in the one after.
>>>>>>>>>>>> 
>>>>>>>>>>>> I also did an experiment, where the normal I/O path allocates the
>>>>>>>>>>>> request with BLK_MQ_REQ_NOWAIT. When running the experiment above, the
>>>>>>>>>>>> read test fails since we reach:
>>>>>>>>>>>> 	if (nowait)
>>>>>>>>>>>> 	  return -EBUSY;
>>>>>>>>>>>> 
>>>>>>>>>>>> in blk_queue_enter.
>>>>>>>>>>> 
>>>>>>>>>>> OK, that's starting to make more sense, that indicates that there is indeed
>>>>>>>>>>> something wrong with the refs. Does the below help?
>>>>>>>>>> 
>>>>>>>>>> No, that can't be right, it does look balanced to begin with.
>>>>>>>>>> blk_mq_alloc_request() always grabs a queue ref, and always drops it. If
>>>>>>>>>> we return with a request succesfully allocated, then we have an extra
>>>>>>>>>> ref on it, which is dropped when it is later freed.
>>>>>>>>> 
>>>>>>>>> I agree, it seems more like a reference is put too late. I looked into
>>>>>>>>> into the places where the reference is put, but it all seems normal. In
>>>>>>>>> any case, I run it (just to see), and it did not help.
>>>>>>>>> 
>>>>>>>>>> Something smells fishy, I'll dig a bit.
>>>>>>>>> 
>>>>>>>>> Thanks! I continue looking into it myself; let me know if I can help
>>>>>>>>> with something more specific.
>>>>>>>> 
>>>>>>>> What exact kernel are you running? And does the device have a scheduler
>>>>>>>> attached, or is it set to "none"?
>>>>>>> 
>>>>>>> I can reproduce the issue on 4.11-rc7. I will rebase on top of your
>>>>>>> for-4.12/block, but I cannot see any patches that might be related. If
>>>>>>> it changes I'll ping you.
>>>>>> 
>>>>>> I don't suspect it will do anything for you. I just ask to know what
>>>>>> base you are on.
>>>>>> 
>>>>>>> I mentioned the problem to Christoph last week and disabling the
>>>>>>> schedulers was the first thing he recommended. I measured time around
>>>>>>> blk_mq_sched_get_request and for this particular test the choose of
>>>>>>> scheduler (including BFQ and kyber) does not seem to have an effect.
>>>>>> 
>>>>>> kyber vs none would be the interesting test. Some of the paths are a
>>>>>> little different depending if there's a scheduler attached or not, so
>>>>>> it's good to know that we're seeing this in both cases.
>>>>> 
>>>>> I just tested on your for-4.12/block with none and kyber and the latency
>>>>> spike appears in both cases.
>>>> 
>>>> OK good. I looked at your reproduction case. Looks like we ultimately
>>>> end up submitting IO through nvme_nvm_submit_user_cmd() when you do the
>>>> nvm_vblk line_erase, which is basically the same code as
>>>> NVME_IOCTL_SUBMIT_IO as far as request alloc, setup, issue, free goes.
>>>> So does it reproduce for you as well on a normal nvme device, if you run
>>>> a nvme read /dev/nvme0 [...] while running the same read fio job?
>>> 
>>> Ok. I'll try that.
>> 
>> I cannot reproduce the latency on a normal nvme drive when mixing I/O
>> from a fio job and ioctls.
>> 
>> The path is different from the one in pblk, since normal block I/O
>> uses the generic_make_request(), but still, they both need to
>> blk_queue_enter(), allocate a request, etc. They only "major" difference
>> I see is that normal block I/O requests are given by get_request()
>> (which as far as I understand takes pre-allocated requests from the
>> queue request list), while pblk allocates each request via
>> nvme_alloc_request().
>> 
>> What puzzles me most is that having different pblk instances, issuing
>> I/O in parallel does not trigger the long tail. Otherwise, I would think
>> that we are just unlucky and get scheduled out. Still, 20ms...
>> 
>> BTW, in order to discard NUMA, I tried on a single socket machine, and
>> same, same.
> 
> I suspect the .q_usage_counter is DEAD, and you can check it via
> percpu_ref_is_dying(), or just check if slow path is reached.
> 
> The fast path is that percpu_ref_tryget_live() returns directly,
> and slow path is reached only if queue is freezed or dead.
> 
> If that is true, you can add a dump_stack() in blk_freeze_queue_start()
> to see where the unusual freezing/unfreezing is from.


Thanks for the hint Ming! You are right. We somehow trigger a re-read
partition:

[  324.010184]  dump_stack+0x63/0x90
[  324.010191]  blk_freeze_queue_start+0x1e/0x50
[  324.010194]  blk_mq_freeze_queue+0x12/0x20
[  324.010199]  __nvme_revalidate_disk+0xa4/0x350
[  324.010203]  nvme_revalidate_disk+0x53/0x90
[  324.010206]  rescan_partitions+0x8d/0x380
[  324.010211]  ? tlb_flush_mmu_free+0x36/0x60
[  324.010218]  ? security_capable+0x48/0x60
[  324.010221]  __blkdev_reread_part+0x65/0x70
[  324.010223]  blkdev_reread_part+0x23/0x40
[  324.010225]  blkdev_ioctl+0x387/0x910
[  324.010229]  ? locks_insert_lock_ctx+0x7e/0xd0
[  324.010235]  block_ioctl+0x3d/0x50
[  324.010239]  do_vfs_ioctl+0xa1/0x5d0
[  324.010242]  ? locks_lock_inode_wait+0x51/0x150
[  324.010247]  ? kmem_cache_alloc+0xd7/0x1b0
[  324.010249]  ? locks_alloc_lock+0x1b/0x70
[  324.010252]  SyS_ioctl+0x79/0x90
[  324.010254]  ? SyS_flock+0x11c/0x180
[  324.010260]  entry_SYSCALL_64_fastpath+0x1e/0xad

I'm checking why this happens now...

Javier
=?UTF-8?q?Javier=20Gonz=C3=A1lez?= May 9, 2017, 2:21 p.m. UTC | #12
> On 9 May 2017, at 13.21, Javier González <jg@lightnvm.io> wrote:
> 
>> On 9 May 2017, at 12.58, Ming Lei <ming.lei@redhat.com> wrote:
>> 
>> On Tue, May 09, 2017 at 12:34:42PM +0200, Javier González wrote:
>>>> On 8 May 2017, at 18.39, Javier González <jg@lightnvm.io> wrote:
>>>> 
>>>>> On 8 May 2017, at 18.06, Jens Axboe <axboe@kernel.dk> wrote:
>>>>> 
>>>>> On 05/08/2017 09:49 AM, Javier González wrote:
>>>>>>> On 8 May 2017, at 17.40, Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>> 
>>>>>>> On 05/08/2017 09:38 AM, Javier González wrote:
>>>>>>>>> On 8 May 2017, at 17.25, Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>> 
>>>>>>>>> On 05/08/2017 09:22 AM, Javier González wrote:
>>>>>>>>>> Javier
>>>>>>>>>> 
>>>>>>>>>>> On 8 May 2017, at 17.14, Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> On 05/08/2017 09:08 AM, Jens Axboe wrote:
>>>>>>>>>>>> On 05/08/2017 09:02 AM, Javier González wrote:
>>>>>>>>>>>>>> On 8 May 2017, at 16.52, Jens Axboe <axboe@fb.com> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 05/08/2017 08:46 AM, Javier González wrote:
>>>>>>>>>>>>>>>> On 8 May 2017, at 16.23, Jens Axboe <axboe@fb.com> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On 05/08/2017 08:20 AM, Javier González wrote:
>>>>>>>>>>>>>>>>>> On 8 May 2017, at 16.13, Jens Axboe <axboe@fb.com> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On 05/08/2017 07:44 AM, Javier González wrote:
>>>>>>>>>>>>>>>>>>>> On 8 May 2017, at 14.27, Ming Lei <ming.lei@redhat.com> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Mon, May 08, 2017 at 01:54:58PM +0200, Javier González wrote:
>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> I find an unusual added latency(~20-30ms) on blk_queue_enter when
>>>>>>>>>>>>>>>>>>>>> allocating a request directly from the NVMe driver through
>>>>>>>>>>>>>>>>>>>>> nvme_alloc_request. I could use some help confirming that this is a bug
>>>>>>>>>>>>>>>>>>>>> and not an expected side effect due to something else.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> I can reproduce this latency consistently on LightNVM when mixing I/O
>>>>>>>>>>>>>>>>>>>>> from pblk and I/O sent through an ioctl using liblightnvm, but I don't
>>>>>>>>>>>>>>>>>>>>> see anything on the LightNVM side that could impact the request
>>>>>>>>>>>>>>>>>>>>> allocation.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> When I have a 100% read workload sent from pblk, the max. latency is
>>>>>>>>>>>>>>>>>>>>> constant throughout several runs at ~80us (which is normal for the media
>>>>>>>>>>>>>>>>>>>>> we are using at bs=4k, qd=1). All pblk I/Os reach the nvme_nvm_submit_io
>>>>>>>>>>>>>>>>>>>>> function on lightnvm.c., which uses nvme_alloc_request. When we send a
>>>>>>>>>>>>>>>>>>>>> command from user space through an ioctl, then the max latency goes up
>>>>>>>>>>>>>>>>>>>>> to ~20-30ms. This happens independently from the actual command
>>>>>>>>>>>>>>>>>>>>> (IN/OUT). I tracked down the added latency down to the call
>>>>>>>>>>>>>>>>>>>>> percpu_ref_tryget_live in blk_queue_enter. Seems that the queue
>>>>>>>>>>>>>>>>>>>>> reference counter is not released as it should through blk_queue_exit in
>>>>>>>>>>>>>>>>>>>>> blk_mq_alloc_request. For reference, all ioctl I/Os reach the
>>>>>>>>>>>>>>>>>>>>> nvme_nvm_submit_user_cmd on lightnvm.c
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Do you have any idea about why this might happen? I can dig more into
>>>>>>>>>>>>>>>>>>>>> it, but first I wanted to make sure that I am not missing any obvious
>>>>>>>>>>>>>>>>>>>>> assumption, which would explain the reference counter to be held for a
>>>>>>>>>>>>>>>>>>>>> longer time.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> You need to check if the .q_usage_counter is working at atomic mode.
>>>>>>>>>>>>>>>>>>>> This counter is initialized as atomic mode, and finally switchs to
>>>>>>>>>>>>>>>>>>>> percpu mode via percpu_ref_switch_to_percpu() in blk_register_queue().
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thanks for commenting Ming.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> The .q_usage_counter is not working on atomic mode. The queue is
>>>>>>>>>>>>>>>>>>> initialized normally through blk_register_queue() and the counter is
>>>>>>>>>>>>>>>>>>> switched to percpu mode, as you mentioned. As I understand it, this is
>>>>>>>>>>>>>>>>>>> how it should be, right?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> That is how it should be, yes. You're not running with any heavy
>>>>>>>>>>>>>>>>>> debugging options, like lockdep or anything like that?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> No lockdep, KASAN, kmemleak or any of the other usual suspects.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> What's interesting is that it only happens when one of the I/Os comes
>>>>>>>>>>>>>>>>> from user space through the ioctl. If I have several pblk instances on
>>>>>>>>>>>>>>>>> the same device (which would end up allocating a new request in
>>>>>>>>>>>>>>>>> parallel, potentially on the same core), the latency spike does not
>>>>>>>>>>>>>>>>> trigger.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I also tried to bind the read thread and the liblightnvm thread issuing
>>>>>>>>>>>>>>>>> the ioctl to different cores, but it does not help...
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> How do I reproduce this? Off the top of my head, and looking at the code,
>>>>>>>>>>>>>>>> I have no idea what is going on here.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Using LightNVM and liblightnvm [1] you can reproduce it by:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 1. Instantiate a pblk instance on the first channel (luns 0 - 7):
>>>>>>>>>>>>>>> sudo nvme lnvm create -d nvme0n1 -n test0 -t pblk -b 0 -e 7 -f
>>>>>>>>>>>>>>> 2. Write 5GB to the test0 block device with a normal fio script
>>>>>>>>>>>>>>> 3. Read 5GB to verify that latencies are good (max. ~80-90us at bs=4k, qd=1)
>>>>>>>>>>>>>>> 4. Re-run 3. and in parallel send a command through liblightnvm to a
>>>>>>>>>>>>>>> different channel. A simple command is an erase (erase block 900 on
>>>>>>>>>>>>>>> channel 2, lun 0):
>>>>>>>>>>>>>>> 	sudo nvm_vblk line_erase /dev/nvme0n1 2 2 0 0 900
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> After 4. you should see a ~25-30ms latency on the read workload.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I tried to reproduce the ioctl in a more generic way to reach
>>>>>>>>>>>>>>> __nvme_submit_user_cmd(), but SPDK steals the whole device. Also, qemu
>>>>>>>>>>>>>>> is not reliable for this kind of performance testing.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> If you have a suggestion on how I can mix an ioctl with normal block I/O
>>>>>>>>>>>>>>> read on a standard NVMe device, I'm happy to try it and see if I can
>>>>>>>>>>>>>>> reproduce the issue.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Just to rule out this being any hardware related delays in processing
>>>>>>>>>>>>>> IO:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 1) Does it reproduce with a simpler command, anything close to a no-op
>>>>>>>>>>>>>> that you can test?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Yes. I tried with a 4KB read and with a fake command I drop right after
>>>>>>>>>>>>> allocation.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 2) What did you use to time the stall being blk_queue_enter()?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I have some debug code measuring time with ktime_get() in different
>>>>>>>>>>>>> places in the stack, and among other places, around blk_queue_enter(). I
>>>>>>>>>>>>> use them then to measure max latency and expose it through sysfs. I can
>>>>>>>>>>>>> see that the latency peak is recorded in the probe before
>>>>>>>>>>>>> blk_queue_enter() and not in the one after.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I also did an experiment, where the normal I/O path allocates the
>>>>>>>>>>>>> request with BLK_MQ_REQ_NOWAIT. When running the experiment above, the
>>>>>>>>>>>>> read test fails since we reach:
>>>>>>>>>>>>> 	if (nowait)
>>>>>>>>>>>>> 	  return -EBUSY;
>>>>>>>>>>>>> 
>>>>>>>>>>>>> in blk_queue_enter.
>>>>>>>>>>>> 
>>>>>>>>>>>> OK, that's starting to make more sense, that indicates that there is indeed
>>>>>>>>>>>> something wrong with the refs. Does the below help?
>>>>>>>>>>> 
>>>>>>>>>>> No, that can't be right, it does look balanced to begin with.
>>>>>>>>>>> blk_mq_alloc_request() always grabs a queue ref, and always drops it. If
>>>>>>>>>>> we return with a request succesfully allocated, then we have an extra
>>>>>>>>>>> ref on it, which is dropped when it is later freed.
>>>>>>>>>> 
>>>>>>>>>> I agree, it seems more like a reference is put too late. I looked into
>>>>>>>>>> into the places where the reference is put, but it all seems normal. In
>>>>>>>>>> any case, I run it (just to see), and it did not help.
>>>>>>>>>> 
>>>>>>>>>>> Something smells fishy, I'll dig a bit.
>>>>>>>>>> 
>>>>>>>>>> Thanks! I continue looking into it myself; let me know if I can help
>>>>>>>>>> with something more specific.
>>>>>>>>> 
>>>>>>>>> What exact kernel are you running? And does the device have a scheduler
>>>>>>>>> attached, or is it set to "none"?
>>>>>>>> 
>>>>>>>> I can reproduce the issue on 4.11-rc7. I will rebase on top of your
>>>>>>>> for-4.12/block, but I cannot see any patches that might be related. If
>>>>>>>> it changes I'll ping you.
>>>>>>> 
>>>>>>> I don't suspect it will do anything for you. I just ask to know what
>>>>>>> base you are on.
>>>>>>> 
>>>>>>>> I mentioned the problem to Christoph last week and disabling the
>>>>>>>> schedulers was the first thing he recommended. I measured time around
>>>>>>>> blk_mq_sched_get_request and for this particular test the choose of
>>>>>>>> scheduler (including BFQ and kyber) does not seem to have an effect.
>>>>>>> 
>>>>>>> kyber vs none would be the interesting test. Some of the paths are a
>>>>>>> little different depending if there's a scheduler attached or not, so
>>>>>>> it's good to know that we're seeing this in both cases.
>>>>>> 
>>>>>> I just tested on your for-4.12/block with none and kyber and the latency
>>>>>> spike appears in both cases.
>>>>> 
>>>>> OK good. I looked at your reproduction case. Looks like we ultimately
>>>>> end up submitting IO through nvme_nvm_submit_user_cmd() when you do the
>>>>> nvm_vblk line_erase, which is basically the same code as
>>>>> NVME_IOCTL_SUBMIT_IO as far as request alloc, setup, issue, free goes.
>>>>> So does it reproduce for you as well on a normal nvme device, if you run
>>>>> a nvme read /dev/nvme0 [...] while running the same read fio job?
>>>> 
>>>> Ok. I'll try that.
>>> 
>>> I cannot reproduce the latency on a normal nvme drive when mixing I/O
>>> from a fio job and ioctls.
>>> 
>>> The path is different from the one in pblk, since normal block I/O
>>> uses the generic_make_request(), but still, they both need to
>>> blk_queue_enter(), allocate a request, etc. They only "major" difference
>>> I see is that normal block I/O requests are given by get_request()
>>> (which as far as I understand takes pre-allocated requests from the
>>> queue request list), while pblk allocates each request via
>>> nvme_alloc_request().
>>> 
>>> What puzzles me most is that having different pblk instances, issuing
>>> I/O in parallel does not trigger the long tail. Otherwise, I would think
>>> that we are just unlucky and get scheduled out. Still, 20ms...
>>> 
>>> BTW, in order to discard NUMA, I tried on a single socket machine, and
>>> same, same.
>> 
>> I suspect the .q_usage_counter is DEAD, and you can check it via
>> percpu_ref_is_dying(), or just check if slow path is reached.
>> 
>> The fast path is that percpu_ref_tryget_live() returns directly,
>> and slow path is reached only if queue is freezed or dead.
>> 
>> If that is true, you can add a dump_stack() in blk_freeze_queue_start()
>> to see where the unusual freezing/unfreezing is from.
> 
> 
> Thanks for the hint Ming! You are right. We somehow trigger a re-read
> partition:
> 
> [  324.010184]  dump_stack+0x63/0x90
> [  324.010191]  blk_freeze_queue_start+0x1e/0x50
> [  324.010194]  blk_mq_freeze_queue+0x12/0x20
> [  324.010199]  __nvme_revalidate_disk+0xa4/0x350
> [  324.010203]  nvme_revalidate_disk+0x53/0x90
> [  324.010206]  rescan_partitions+0x8d/0x380
> [  324.010211]  ? tlb_flush_mmu_free+0x36/0x60
> [  324.010218]  ? security_capable+0x48/0x60
> [  324.010221]  __blkdev_reread_part+0x65/0x70
> [  324.010223]  blkdev_reread_part+0x23/0x40
> [  324.010225]  blkdev_ioctl+0x387/0x910
> [  324.010229]  ? locks_insert_lock_ctx+0x7e/0xd0
> [  324.010235]  block_ioctl+0x3d/0x50
> [  324.010239]  do_vfs_ioctl+0xa1/0x5d0
> [  324.010242]  ? locks_lock_inode_wait+0x51/0x150
> [  324.010247]  ? kmem_cache_alloc+0xd7/0x1b0
> [  324.010249]  ? locks_alloc_lock+0x1b/0x70
> [  324.010252]  SyS_ioctl+0x79/0x90
> [  324.010254]  ? SyS_flock+0x11c/0x180
> [  324.010260]  entry_SYSCALL_64_fastpath+0x1e/0xad
> 
> I'm checking why this happens now...

So apparently we opened a file descriptor with O_RDWR for sending ioctls
on liblightnvm. Opening for writing triggered a syscall to re-read
and ultimately reached blk_mq_freeze_queue(), which accounted for the
latencies we observed. For reference, nvme-cli opens the fd read-only,
reason why we could not reproduce the issue with nvme read.

Thanks Ming and Jens for looking into this and giving good advice. Much
appreciated!

Javier
diff mbox

Patch

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5d4ce7eb8dbf..df5ee82d28f8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -292,10 +292,11 @@  struct request *blk_mq_alloc_request(struct request_queue *q, int rw,
 	rq = blk_mq_sched_get_request(q, NULL, rw, &alloc_data);
 
 	blk_mq_put_ctx(alloc_data.ctx);
-	blk_queue_exit(q);
 
-	if (!rq)
+	if (!rq) {
+		blk_queue_exit(q);
 		return ERR_PTR(-EWOULDBLOCK);
+	}
 
 	rq->__data_len = 0;
 	rq->__sector = (sector_t) -1;