mbox series

[RFC,0/3] cpuidle: add poll_source API and virtio vq polling

Message ID 20210713161906.457857-1-stefanha@redhat.com (mailing list archive)
Headers show
Series cpuidle: add poll_source API and virtio vq polling | expand

Message

Stefan Hajnoczi July 13, 2021, 4:19 p.m. UTC
These patches are not polished yet but I would like request feedback on this
approach and share performance results with you.

Idle CPUs tentatively enter a busy wait loop before halting when the cpuidle
haltpoll driver is enabled inside a virtual machine. This reduces wakeup
latency for events that occur soon after the vCPU becomes idle.

This patch series extends the cpuidle busy wait loop with the new poll_source
API so drivers can participate in polling. Such polling-aware drivers disable
their device's irq during the busy wait loop to avoid the cost of interrupts.
This reduces latency further than regular cpuidle haltpoll, which still relies
on irqs.

Virtio drivers are modified to use the poll_source API so all virtio device
types get this feature. The following virtio-blk fio benchmark results show the
improvement:

             IOPS (numjobs=4, iodepth=1, 4 virtqueues)
               before   poll_source      io_poll
4k randread    167102  186049 (+11%)  186654 (+11%)
4k randwrite   162204  181214 (+11%)  181850 (+12%)
4k randrw      159520  177071 (+11%)  177928 (+11%)

The comparison against io_poll shows that cpuidle poll_source achieves
equivalent performance to the block layer's io_poll feature (which I
implemented in a separate patch series [1]).

The advantage of poll_source is that applications do not need to explicitly set
the RWF_HIPRI I/O request flag. The poll_source approach is attractive because
few applications actually use RWF_HIPRI and it takes advantage of CPU cycles we
would have spent in cpuidle haltpoll anyway.

The current series does not improve virtio-net. I haven't investigated deeply,
but it is possible that NAPI and poll_source do not combine. See the final
patch for a starting point on making the two work together.

I have not tried this on bare metal but it might help there too. The cost of
disabling a device's irq must be less than the savings from avoiding irq
handling for this optimization to make sense.

[1] https://lore.kernel.org/linux-block/20210520141305.355961-1-stefanha@redhat.com/

Stefan Hajnoczi (3):
  cpuidle: add poll_source API
  virtio: add poll_source virtqueue polling
  softirq: participate in cpuidle polling

 drivers/cpuidle/Makefile           |   1 +
 drivers/virtio/virtio_pci_common.h |   7 ++
 include/linux/interrupt.h          |   2 +
 include/linux/poll_source.h        |  53 +++++++++++++++
 include/linux/virtio.h             |   2 +
 include/linux/virtio_config.h      |   2 +
 drivers/cpuidle/poll_source.c      | 102 +++++++++++++++++++++++++++++
 drivers/cpuidle/poll_state.c       |   6 ++
 drivers/virtio/virtio.c            |  34 ++++++++++
 drivers/virtio/virtio_pci_common.c |  86 ++++++++++++++++++++++++
 drivers/virtio/virtio_pci_modern.c |   2 +
 kernel/softirq.c                   |  14 ++++
 12 files changed, 311 insertions(+)
 create mode 100644 include/linux/poll_source.h
 create mode 100644 drivers/cpuidle/poll_source.c

Comments

Jason Wang July 21, 2021, 3:29 a.m. UTC | #1
在 2021/7/14 上午12:19, Stefan Hajnoczi 写道:
> These patches are not polished yet but I would like request feedback on this
> approach and share performance results with you.
>
> Idle CPUs tentatively enter a busy wait loop before halting when the cpuidle
> haltpoll driver is enabled inside a virtual machine. This reduces wakeup
> latency for events that occur soon after the vCPU becomes idle.
>
> This patch series extends the cpuidle busy wait loop with the new poll_source
> API so drivers can participate in polling. Such polling-aware drivers disable
> their device's irq during the busy wait loop to avoid the cost of interrupts.
> This reduces latency further than regular cpuidle haltpoll, which still relies
> on irqs.
>
> Virtio drivers are modified to use the poll_source API so all virtio device
> types get this feature. The following virtio-blk fio benchmark results show the
> improvement:
>
>               IOPS (numjobs=4, iodepth=1, 4 virtqueues)
>                 before   poll_source      io_poll
> 4k randread    167102  186049 (+11%)  186654 (+11%)
> 4k randwrite   162204  181214 (+11%)  181850 (+12%)
> 4k randrw      159520  177071 (+11%)  177928 (+11%)
>
> The comparison against io_poll shows that cpuidle poll_source achieves
> equivalent performance to the block layer's io_poll feature (which I
> implemented in a separate patch series [1]).
>
> The advantage of poll_source is that applications do not need to explicitly set
> the RWF_HIPRI I/O request flag. The poll_source approach is attractive because
> few applications actually use RWF_HIPRI and it takes advantage of CPU cycles we
> would have spent in cpuidle haltpoll anyway.
>
> The current series does not improve virtio-net. I haven't investigated deeply,
> but it is possible that NAPI and poll_source do not combine. See the final
> patch for a starting point on making the two work together.
>
> I have not tried this on bare metal but it might help there too. The cost of
> disabling a device's irq must be less than the savings from avoiding irq
> handling for this optimization to make sense.
>
> [1] https://lore.kernel.org/linux-block/20210520141305.355961-1-stefanha@redhat.com/


Hi Stefan:

Some questions:

1) What's the advantages of introducing polling at virtio level instead 
of doing it at each subsystems? Polling in virtio level may only work 
well if all (or most) of the devices are virtio
2) What's the advantages of using cpuidle instead of using a thread (and 
leverage the scheduler)?
3) Any reason it's virtio_pci specific not a general virtio one?

Thanks

(Btw, do we need to cc scheduler guys?)


>
> Stefan Hajnoczi (3):
>    cpuidle: add poll_source API
>    virtio: add poll_source virtqueue polling
>    softirq: participate in cpuidle polling
>
>   drivers/cpuidle/Makefile           |   1 +
>   drivers/virtio/virtio_pci_common.h |   7 ++
>   include/linux/interrupt.h          |   2 +
>   include/linux/poll_source.h        |  53 +++++++++++++++
>   include/linux/virtio.h             |   2 +
>   include/linux/virtio_config.h      |   2 +
>   drivers/cpuidle/poll_source.c      | 102 +++++++++++++++++++++++++++++
>   drivers/cpuidle/poll_state.c       |   6 ++
>   drivers/virtio/virtio.c            |  34 ++++++++++
>   drivers/virtio/virtio_pci_common.c |  86 ++++++++++++++++++++++++
>   drivers/virtio/virtio_pci_modern.c |   2 +
>   kernel/softirq.c                   |  14 ++++
>   12 files changed, 311 insertions(+)
>   create mode 100644 include/linux/poll_source.h
>   create mode 100644 drivers/cpuidle/poll_source.c
>
Stefan Hajnoczi July 21, 2021, 9:41 a.m. UTC | #2
On Wed, Jul 21, 2021 at 11:29:55AM +0800, Jason Wang wrote:
> 
> 在 2021/7/14 上午12:19, Stefan Hajnoczi 写道:
> > These patches are not polished yet but I would like request feedback on this
> > approach and share performance results with you.
> > 
> > Idle CPUs tentatively enter a busy wait loop before halting when the cpuidle
> > haltpoll driver is enabled inside a virtual machine. This reduces wakeup
> > latency for events that occur soon after the vCPU becomes idle.
> > 
> > This patch series extends the cpuidle busy wait loop with the new poll_source
> > API so drivers can participate in polling. Such polling-aware drivers disable
> > their device's irq during the busy wait loop to avoid the cost of interrupts.
> > This reduces latency further than regular cpuidle haltpoll, which still relies
> > on irqs.
> > 
> > Virtio drivers are modified to use the poll_source API so all virtio device
> > types get this feature. The following virtio-blk fio benchmark results show the
> > improvement:
> > 
> >               IOPS (numjobs=4, iodepth=1, 4 virtqueues)
> >                 before   poll_source      io_poll
> > 4k randread    167102  186049 (+11%)  186654 (+11%)
> > 4k randwrite   162204  181214 (+11%)  181850 (+12%)
> > 4k randrw      159520  177071 (+11%)  177928 (+11%)
> > 
> > The comparison against io_poll shows that cpuidle poll_source achieves
> > equivalent performance to the block layer's io_poll feature (which I
> > implemented in a separate patch series [1]).
> > 
> > The advantage of poll_source is that applications do not need to explicitly set
> > the RWF_HIPRI I/O request flag. The poll_source approach is attractive because
> > few applications actually use RWF_HIPRI and it takes advantage of CPU cycles we
> > would have spent in cpuidle haltpoll anyway.
> > 
> > The current series does not improve virtio-net. I haven't investigated deeply,
> > but it is possible that NAPI and poll_source do not combine. See the final
> > patch for a starting point on making the two work together.
> > 
> > I have not tried this on bare metal but it might help there too. The cost of
> > disabling a device's irq must be less than the savings from avoiding irq
> > handling for this optimization to make sense.
> > 
> > [1] https://lore.kernel.org/linux-block/20210520141305.355961-1-stefanha@redhat.com/
> 
> 
> Hi Stefan:
> 
> Some questions:
> 
> 1) What's the advantages of introducing polling at virtio level instead of
> doing it at each subsystems? Polling in virtio level may only work well if
> all (or most) of the devices are virtio

I'm not sure I understand the question. cpuidle haltpoll benefits all
devices today, except it incurs interrupt latency. The poll_source API
eliminates the interrupt latency for drivers that can disable device
interrupts cheaply.

This patch adds poll_source to core virtio code so that all virtio
drivers get this feature for free. No driver-specific changes are
needed.

If you mean networking, block layer, etc by "subsystems" then there's
nothing those subsystems can do to help. Whether poll_source can be used
depends on the specific driver, not the subsystem. If you consider
drivers/virtio/ a subsystem, then that's exactly what the patch series
is doing.

> 2) What's the advantages of using cpuidle instead of using a thread (and
> leverage the scheduler)?

In order to combine with the existing cpuidle infrastructure. No new
polling loop is introduced and no additional CPU cycles are spent on
polling.

If cpuidle itself is converted to threads then poll_source would
automatically operate in a thread too, but this patch series doesn't
change how the core cpuidle code works.

> 3) Any reason it's virtio_pci specific not a general virtio one?

Good idea, it is possible to move the virtio_pci changes into virtio.c.

Other transports can't use this feature yet though. Only virtio_pci
supports vq irq affinity. But the code can be generic and if other
transports ever support vq irq affinity they'll get it for free.

> (Btw, do we need to cc scheduler guys?)

I'm not sure. This patch series doesn't change how cpuidle interacts
with the scheduler. The cpuidle maintainers can pull in more people, if
necessary.

Stefan
Jason Wang July 22, 2021, 9:04 a.m. UTC | #3
在 2021/7/21 下午5:41, Stefan Hajnoczi 写道:
> On Wed, Jul 21, 2021 at 11:29:55AM +0800, Jason Wang wrote:
>> 在 2021/7/14 上午12:19, Stefan Hajnoczi 写道:
>>> These patches are not polished yet but I would like request feedback on this
>>> approach and share performance results with you.
>>>
>>> Idle CPUs tentatively enter a busy wait loop before halting when the cpuidle
>>> haltpoll driver is enabled inside a virtual machine. This reduces wakeup
>>> latency for events that occur soon after the vCPU becomes idle.
>>>
>>> This patch series extends the cpuidle busy wait loop with the new poll_source
>>> API so drivers can participate in polling. Such polling-aware drivers disable
>>> their device's irq during the busy wait loop to avoid the cost of interrupts.
>>> This reduces latency further than regular cpuidle haltpoll, which still relies
>>> on irqs.
>>>
>>> Virtio drivers are modified to use the poll_source API so all virtio device
>>> types get this feature. The following virtio-blk fio benchmark results show the
>>> improvement:
>>>
>>>                IOPS (numjobs=4, iodepth=1, 4 virtqueues)
>>>                  before   poll_source      io_poll
>>> 4k randread    167102  186049 (+11%)  186654 (+11%)
>>> 4k randwrite   162204  181214 (+11%)  181850 (+12%)
>>> 4k randrw      159520  177071 (+11%)  177928 (+11%)
>>>
>>> The comparison against io_poll shows that cpuidle poll_source achieves
>>> equivalent performance to the block layer's io_poll feature (which I
>>> implemented in a separate patch series [1]).
>>>
>>> The advantage of poll_source is that applications do not need to explicitly set
>>> the RWF_HIPRI I/O request flag. The poll_source approach is attractive because
>>> few applications actually use RWF_HIPRI and it takes advantage of CPU cycles we
>>> would have spent in cpuidle haltpoll anyway.
>>>
>>> The current series does not improve virtio-net. I haven't investigated deeply,
>>> but it is possible that NAPI and poll_source do not combine. See the final
>>> patch for a starting point on making the two work together.
>>>
>>> I have not tried this on bare metal but it might help there too. The cost of
>>> disabling a device's irq must be less than the savings from avoiding irq
>>> handling for this optimization to make sense.
>>>
>>> [1] https://lore.kernel.org/linux-block/20210520141305.355961-1-stefanha@redhat.com/
>>
>> Hi Stefan:
>>
>> Some questions:
>>
>> 1) What's the advantages of introducing polling at virtio level instead of
>> doing it at each subsystems? Polling in virtio level may only work well if
>> all (or most) of the devices are virtio
> I'm not sure I understand the question. cpuidle haltpoll benefits all
> devices today, except it incurs interrupt latency. The poll_source API
> eliminates the interrupt latency for drivers that can disable device
> interrupts cheaply.
>
> This patch adds poll_source to core virtio code so that all virtio
> drivers get this feature for free. No driver-specific changes are
> needed.
>
> If you mean networking, block layer, etc by "subsystems" then there's
> nothing those subsystems can do to help. Whether poll_source can be used
> depends on the specific driver, not the subsystem. If you consider
> drivers/virtio/ a subsystem, then that's exactly what the patch series
> is doing.


I meant, if we choose to use idle poll, we have some several choices:

1) bus level (e.g the virtio)
2) subsystem level (e.g the networking and block)

I'm not sure which one is better.


>
>> 2) What's the advantages of using cpuidle instead of using a thread (and
>> leverage the scheduler)?
> In order to combine with the existing cpuidle infrastructure. No new
> polling loop is introduced and no additional CPU cycles are spent on
> polling.
>
> If cpuidle itself is converted to threads then poll_source would
> automatically operate in a thread too, but this patch series doesn't
> change how the core cpuidle code works.


So networking subsystem can use NAPI busy polling in the process context 
which means it can be leveraged by the scheduler.

I'm not sure it's a good idea to poll drivers for a specific bus in the 
general cpu idle layer.

Another questions, are those numbers measured by APICV capable machine?

Virtio-net turns on the tx interrupts since 2 years ago. And we don't 
see too much difference when measured with a APICV host.


>
>> 3) Any reason it's virtio_pci specific not a general virtio one?
> Good idea, it is possible to move the virtio_pci changes into virtio.c.
>
> Other transports can't use this feature yet though. Only virtio_pci
> supports vq irq affinity. But the code can be generic and if other
> transports ever support vq irq affinity they'll get it for free.


Yes.

Thanks


>
>> (Btw, do we need to cc scheduler guys?)
> I'm not sure. This patch series doesn't change how cpuidle interacts
> with the scheduler. The cpuidle maintainers can pull in more people, if
> necessary.
>
> Stefan
Stefan Hajnoczi July 26, 2021, 3:17 p.m. UTC | #4
On Thu, Jul 22, 2021 at 05:04:57PM +0800, Jason Wang wrote:
> 
> 在 2021/7/21 下午5:41, Stefan Hajnoczi 写道:
> > On Wed, Jul 21, 2021 at 11:29:55AM +0800, Jason Wang wrote:
> > > 在 2021/7/14 上午12:19, Stefan Hajnoczi 写道:
> > > > These patches are not polished yet but I would like request feedback on this
> > > > approach and share performance results with you.
> > > > 
> > > > Idle CPUs tentatively enter a busy wait loop before halting when the cpuidle
> > > > haltpoll driver is enabled inside a virtual machine. This reduces wakeup
> > > > latency for events that occur soon after the vCPU becomes idle.
> > > > 
> > > > This patch series extends the cpuidle busy wait loop with the new poll_source
> > > > API so drivers can participate in polling. Such polling-aware drivers disable
> > > > their device's irq during the busy wait loop to avoid the cost of interrupts.
> > > > This reduces latency further than regular cpuidle haltpoll, which still relies
> > > > on irqs.
> > > > 
> > > > Virtio drivers are modified to use the poll_source API so all virtio device
> > > > types get this feature. The following virtio-blk fio benchmark results show the
> > > > improvement:
> > > > 
> > > >                IOPS (numjobs=4, iodepth=1, 4 virtqueues)
> > > >                  before   poll_source      io_poll
> > > > 4k randread    167102  186049 (+11%)  186654 (+11%)
> > > > 4k randwrite   162204  181214 (+11%)  181850 (+12%)
> > > > 4k randrw      159520  177071 (+11%)  177928 (+11%)
> > > > 
> > > > The comparison against io_poll shows that cpuidle poll_source achieves
> > > > equivalent performance to the block layer's io_poll feature (which I
> > > > implemented in a separate patch series [1]).
> > > > 
> > > > The advantage of poll_source is that applications do not need to explicitly set
> > > > the RWF_HIPRI I/O request flag. The poll_source approach is attractive because
> > > > few applications actually use RWF_HIPRI and it takes advantage of CPU cycles we
> > > > would have spent in cpuidle haltpoll anyway.
> > > > 
> > > > The current series does not improve virtio-net. I haven't investigated deeply,
> > > > but it is possible that NAPI and poll_source do not combine. See the final
> > > > patch for a starting point on making the two work together.
> > > > 
> > > > I have not tried this on bare metal but it might help there too. The cost of
> > > > disabling a device's irq must be less than the savings from avoiding irq
> > > > handling for this optimization to make sense.
> > > > 
> > > > [1] https://lore.kernel.org/linux-block/20210520141305.355961-1-stefanha@redhat.com/
> > > 
> > > Hi Stefan:
> > > 
> > > Some questions:
> > > 
> > > 1) What's the advantages of introducing polling at virtio level instead of
> > > doing it at each subsystems? Polling in virtio level may only work well if
> > > all (or most) of the devices are virtio
> > I'm not sure I understand the question. cpuidle haltpoll benefits all
> > devices today, except it incurs interrupt latency. The poll_source API
> > eliminates the interrupt latency for drivers that can disable device
> > interrupts cheaply.
> > 
> > This patch adds poll_source to core virtio code so that all virtio
> > drivers get this feature for free. No driver-specific changes are
> > needed.
> > 
> > If you mean networking, block layer, etc by "subsystems" then there's
> > nothing those subsystems can do to help. Whether poll_source can be used
> > depends on the specific driver, not the subsystem. If you consider
> > drivers/virtio/ a subsystem, then that's exactly what the patch series
> > is doing.
> 
> 
> I meant, if we choose to use idle poll, we have some several choices:
> 
> 1) bus level (e.g the virtio)
> 2) subsystem level (e.g the networking and block)
> 
> I'm not sure which one is better.

This API is intended to be driver- or bus-level. I don't think
subsystems can do very much since they don't know the hardware
capabilities (cheap interrupt disabling) and in most cases there's no
advantage of plumbing it through subsystems when drivers can call the
API directly.

> > > 2) What's the advantages of using cpuidle instead of using a thread (and
> > > leverage the scheduler)?
> > In order to combine with the existing cpuidle infrastructure. No new
> > polling loop is introduced and no additional CPU cycles are spent on
> > polling.
> > 
> > If cpuidle itself is converted to threads then poll_source would
> > automatically operate in a thread too, but this patch series doesn't
> > change how the core cpuidle code works.
> 
> 
> So networking subsystem can use NAPI busy polling in the process context
> which means it can be leveraged by the scheduler.
> 
> I'm not sure it's a good idea to poll drivers for a specific bus in the
> general cpu idle layer.

Why? Maybe because the cpuidle execution environment is a little special?

> Another questions, are those numbers measured by APICV capable machine?

Yes.

> Virtio-net turns on the tx interrupts since 2 years ago. And we don't see
> too much difference when measured with a APICV host.

My understand is NAPI always takes the first interrupt. Polling only
happens on subsequent rounds until there's no more work to do.

There seem to be multiple factors that would influence tx performance
like how full the tx queues are, whether more packets are sent during
NAPI polling, whether you're benchmarking a physical PCIe NIC or a
vhost_net software device, etc.

Regarding APICV and software devices, the benchmark results I posted
show that avoiding the interrupt injection helps even with APICV.

Stefan
Rafael J. Wysocki July 26, 2021, 3:47 p.m. UTC | #5
On Mon, Jul 26, 2021 at 5:17 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Thu, Jul 22, 2021 at 05:04:57PM +0800, Jason Wang wrote:
> >
> > 在 2021/7/21 下午5:41, Stefan Hajnoczi 写道:
> > > On Wed, Jul 21, 2021 at 11:29:55AM +0800, Jason Wang wrote:
> > > > 在 2021/7/14 上午12:19, Stefan Hajnoczi 写道:
> > > > > These patches are not polished yet but I would like request feedback on this
> > > > > approach and share performance results with you.
> > > > >
> > > > > Idle CPUs tentatively enter a busy wait loop before halting when the cpuidle
> > > > > haltpoll driver is enabled inside a virtual machine. This reduces wakeup
> > > > > latency for events that occur soon after the vCPU becomes idle.
> > > > >
> > > > > This patch series extends the cpuidle busy wait loop with the new poll_source
> > > > > API so drivers can participate in polling. Such polling-aware drivers disable
> > > > > their device's irq during the busy wait loop to avoid the cost of interrupts.
> > > > > This reduces latency further than regular cpuidle haltpoll, which still relies
> > > > > on irqs.
> > > > >
> > > > > Virtio drivers are modified to use the poll_source API so all virtio device
> > > > > types get this feature. The following virtio-blk fio benchmark results show the
> > > > > improvement:
> > > > >
> > > > >                IOPS (numjobs=4, iodepth=1, 4 virtqueues)
> > > > >                  before   poll_source      io_poll
> > > > > 4k randread    167102  186049 (+11%)  186654 (+11%)
> > > > > 4k randwrite   162204  181214 (+11%)  181850 (+12%)
> > > > > 4k randrw      159520  177071 (+11%)  177928 (+11%)
> > > > >
> > > > > The comparison against io_poll shows that cpuidle poll_source achieves
> > > > > equivalent performance to the block layer's io_poll feature (which I
> > > > > implemented in a separate patch series [1]).
> > > > >
> > > > > The advantage of poll_source is that applications do not need to explicitly set
> > > > > the RWF_HIPRI I/O request flag. The poll_source approach is attractive because
> > > > > few applications actually use RWF_HIPRI and it takes advantage of CPU cycles we
> > > > > would have spent in cpuidle haltpoll anyway.
> > > > >
> > > > > The current series does not improve virtio-net. I haven't investigated deeply,
> > > > > but it is possible that NAPI and poll_source do not combine. See the final
> > > > > patch for a starting point on making the two work together.
> > > > >
> > > > > I have not tried this on bare metal but it might help there too. The cost of
> > > > > disabling a device's irq must be less than the savings from avoiding irq
> > > > > handling for this optimization to make sense.
> > > > >
> > > > > [1] https://lore.kernel.org/linux-block/20210520141305.355961-1-stefanha@redhat.com/
> > > >
> > > > Hi Stefan:
> > > >
> > > > Some questions:
> > > >
> > > > 1) What's the advantages of introducing polling at virtio level instead of
> > > > doing it at each subsystems? Polling in virtio level may only work well if
> > > > all (or most) of the devices are virtio
> > > I'm not sure I understand the question. cpuidle haltpoll benefits all
> > > devices today, except it incurs interrupt latency. The poll_source API
> > > eliminates the interrupt latency for drivers that can disable device
> > > interrupts cheaply.
> > >
> > > This patch adds poll_source to core virtio code so that all virtio
> > > drivers get this feature for free. No driver-specific changes are
> > > needed.
> > >
> > > If you mean networking, block layer, etc by "subsystems" then there's
> > > nothing those subsystems can do to help. Whether poll_source can be used
> > > depends on the specific driver, not the subsystem. If you consider
> > > drivers/virtio/ a subsystem, then that's exactly what the patch series
> > > is doing.
> >
> >
> > I meant, if we choose to use idle poll, we have some several choices:
> >
> > 1) bus level (e.g the virtio)
> > 2) subsystem level (e.g the networking and block)
> >
> > I'm not sure which one is better.
>
> This API is intended to be driver- or bus-level. I don't think
> subsystems can do very much since they don't know the hardware
> capabilities (cheap interrupt disabling) and in most cases there's no
> advantage of plumbing it through subsystems when drivers can call the
> API directly.
>
> > > > 2) What's the advantages of using cpuidle instead of using a thread (and
> > > > leverage the scheduler)?
> > > In order to combine with the existing cpuidle infrastructure. No new
> > > polling loop is introduced and no additional CPU cycles are spent on
> > > polling.
> > >
> > > If cpuidle itself is converted to threads then poll_source would
> > > automatically operate in a thread too, but this patch series doesn't
> > > change how the core cpuidle code works.
> >
> >
> > So networking subsystem can use NAPI busy polling in the process context
> > which means it can be leveraged by the scheduler.
> >
> > I'm not sure it's a good idea to poll drivers for a specific bus in the
> > general cpu idle layer.
>
> Why? Maybe because the cpuidle execution environment is a little special?

Well, this would be prone to abuse.

The time spent in that driver callback counts as CPU idle time while
it really is the driver running and there is not limit on how much
time the callback can take, while doing costly things in the idle loop
is generally avoided, because on wakeup the CPU needs to be available
to the task needing it as soon as possible.  IOW, the callback
potentially add unbounded latency to the CPU wakeup path.
Stefan Hajnoczi July 26, 2021, 4:01 p.m. UTC | #6
On Mon, Jul 26, 2021 at 05:47:19PM +0200, Rafael J. Wysocki wrote:
> On Mon, Jul 26, 2021 at 5:17 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Thu, Jul 22, 2021 at 05:04:57PM +0800, Jason Wang wrote:
> > >
> > > 在 2021/7/21 下午5:41, Stefan Hajnoczi 写道:
> > > > On Wed, Jul 21, 2021 at 11:29:55AM +0800, Jason Wang wrote:
> > > > > 在 2021/7/14 上午12:19, Stefan Hajnoczi 写道:
> > > > > > These patches are not polished yet but I would like request feedback on this
> > > > > > approach and share performance results with you.
> > > > > >
> > > > > > Idle CPUs tentatively enter a busy wait loop before halting when the cpuidle
> > > > > > haltpoll driver is enabled inside a virtual machine. This reduces wakeup
> > > > > > latency for events that occur soon after the vCPU becomes idle.
> > > > > >
> > > > > > This patch series extends the cpuidle busy wait loop with the new poll_source
> > > > > > API so drivers can participate in polling. Such polling-aware drivers disable
> > > > > > their device's irq during the busy wait loop to avoid the cost of interrupts.
> > > > > > This reduces latency further than regular cpuidle haltpoll, which still relies
> > > > > > on irqs.
> > > > > >
> > > > > > Virtio drivers are modified to use the poll_source API so all virtio device
> > > > > > types get this feature. The following virtio-blk fio benchmark results show the
> > > > > > improvement:
> > > > > >
> > > > > >                IOPS (numjobs=4, iodepth=1, 4 virtqueues)
> > > > > >                  before   poll_source      io_poll
> > > > > > 4k randread    167102  186049 (+11%)  186654 (+11%)
> > > > > > 4k randwrite   162204  181214 (+11%)  181850 (+12%)
> > > > > > 4k randrw      159520  177071 (+11%)  177928 (+11%)
> > > > > >
> > > > > > The comparison against io_poll shows that cpuidle poll_source achieves
> > > > > > equivalent performance to the block layer's io_poll feature (which I
> > > > > > implemented in a separate patch series [1]).
> > > > > >
> > > > > > The advantage of poll_source is that applications do not need to explicitly set
> > > > > > the RWF_HIPRI I/O request flag. The poll_source approach is attractive because
> > > > > > few applications actually use RWF_HIPRI and it takes advantage of CPU cycles we
> > > > > > would have spent in cpuidle haltpoll anyway.
> > > > > >
> > > > > > The current series does not improve virtio-net. I haven't investigated deeply,
> > > > > > but it is possible that NAPI and poll_source do not combine. See the final
> > > > > > patch for a starting point on making the two work together.
> > > > > >
> > > > > > I have not tried this on bare metal but it might help there too. The cost of
> > > > > > disabling a device's irq must be less than the savings from avoiding irq
> > > > > > handling for this optimization to make sense.
> > > > > >
> > > > > > [1] https://lore.kernel.org/linux-block/20210520141305.355961-1-stefanha@redhat.com/
> > > > >
> > > > > Hi Stefan:
> > > > >
> > > > > Some questions:
> > > > >
> > > > > 1) What's the advantages of introducing polling at virtio level instead of
> > > > > doing it at each subsystems? Polling in virtio level may only work well if
> > > > > all (or most) of the devices are virtio
> > > > I'm not sure I understand the question. cpuidle haltpoll benefits all
> > > > devices today, except it incurs interrupt latency. The poll_source API
> > > > eliminates the interrupt latency for drivers that can disable device
> > > > interrupts cheaply.
> > > >
> > > > This patch adds poll_source to core virtio code so that all virtio
> > > > drivers get this feature for free. No driver-specific changes are
> > > > needed.
> > > >
> > > > If you mean networking, block layer, etc by "subsystems" then there's
> > > > nothing those subsystems can do to help. Whether poll_source can be used
> > > > depends on the specific driver, not the subsystem. If you consider
> > > > drivers/virtio/ a subsystem, then that's exactly what the patch series
> > > > is doing.
> > >
> > >
> > > I meant, if we choose to use idle poll, we have some several choices:
> > >
> > > 1) bus level (e.g the virtio)
> > > 2) subsystem level (e.g the networking and block)
> > >
> > > I'm not sure which one is better.
> >
> > This API is intended to be driver- or bus-level. I don't think
> > subsystems can do very much since they don't know the hardware
> > capabilities (cheap interrupt disabling) and in most cases there's no
> > advantage of plumbing it through subsystems when drivers can call the
> > API directly.
> >
> > > > > 2) What's the advantages of using cpuidle instead of using a thread (and
> > > > > leverage the scheduler)?
> > > > In order to combine with the existing cpuidle infrastructure. No new
> > > > polling loop is introduced and no additional CPU cycles are spent on
> > > > polling.
> > > >
> > > > If cpuidle itself is converted to threads then poll_source would
> > > > automatically operate in a thread too, but this patch series doesn't
> > > > change how the core cpuidle code works.
> > >
> > >
> > > So networking subsystem can use NAPI busy polling in the process context
> > > which means it can be leveraged by the scheduler.
> > >
> > > I'm not sure it's a good idea to poll drivers for a specific bus in the
> > > general cpu idle layer.
> >
> > Why? Maybe because the cpuidle execution environment is a little special?
> 
> Well, this would be prone to abuse.
> 
> The time spent in that driver callback counts as CPU idle time while
> it really is the driver running and there is not limit on how much
> time the callback can take, while doing costly things in the idle loop
> is generally avoided, because on wakeup the CPU needs to be available
> to the task needing it as soon as possible.  IOW, the callback
> potentially add unbounded latency to the CPU wakeup path.

How is this different from driver interrupt handlers running during
cpuidle?

Stefan
Rafael J. Wysocki July 26, 2021, 4:37 p.m. UTC | #7
On Mon, Jul 26, 2021 at 6:04 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Mon, Jul 26, 2021 at 05:47:19PM +0200, Rafael J. Wysocki wrote:
> > On Mon, Jul 26, 2021 at 5:17 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > On Thu, Jul 22, 2021 at 05:04:57PM +0800, Jason Wang wrote:
> > > >
> > > > 在 2021/7/21 下午5:41, Stefan Hajnoczi 写道:
> > > > > On Wed, Jul 21, 2021 at 11:29:55AM +0800, Jason Wang wrote:
> > > > > > 在 2021/7/14 上午12:19, Stefan Hajnoczi 写道:
> > > > > > > These patches are not polished yet but I would like request feedback on this
> > > > > > > approach and share performance results with you.
> > > > > > >
> > > > > > > Idle CPUs tentatively enter a busy wait loop before halting when the cpuidle
> > > > > > > haltpoll driver is enabled inside a virtual machine. This reduces wakeup
> > > > > > > latency for events that occur soon after the vCPU becomes idle.
> > > > > > >
> > > > > > > This patch series extends the cpuidle busy wait loop with the new poll_source
> > > > > > > API so drivers can participate in polling. Such polling-aware drivers disable
> > > > > > > their device's irq during the busy wait loop to avoid the cost of interrupts.
> > > > > > > This reduces latency further than regular cpuidle haltpoll, which still relies
> > > > > > > on irqs.
> > > > > > >
> > > > > > > Virtio drivers are modified to use the poll_source API so all virtio device
> > > > > > > types get this feature. The following virtio-blk fio benchmark results show the
> > > > > > > improvement:
> > > > > > >
> > > > > > >                IOPS (numjobs=4, iodepth=1, 4 virtqueues)
> > > > > > >                  before   poll_source      io_poll
> > > > > > > 4k randread    167102  186049 (+11%)  186654 (+11%)
> > > > > > > 4k randwrite   162204  181214 (+11%)  181850 (+12%)
> > > > > > > 4k randrw      159520  177071 (+11%)  177928 (+11%)
> > > > > > >
> > > > > > > The comparison against io_poll shows that cpuidle poll_source achieves
> > > > > > > equivalent performance to the block layer's io_poll feature (which I
> > > > > > > implemented in a separate patch series [1]).
> > > > > > >
> > > > > > > The advantage of poll_source is that applications do not need to explicitly set
> > > > > > > the RWF_HIPRI I/O request flag. The poll_source approach is attractive because
> > > > > > > few applications actually use RWF_HIPRI and it takes advantage of CPU cycles we
> > > > > > > would have spent in cpuidle haltpoll anyway.
> > > > > > >
> > > > > > > The current series does not improve virtio-net. I haven't investigated deeply,
> > > > > > > but it is possible that NAPI and poll_source do not combine. See the final
> > > > > > > patch for a starting point on making the two work together.
> > > > > > >
> > > > > > > I have not tried this on bare metal but it might help there too. The cost of
> > > > > > > disabling a device's irq must be less than the savings from avoiding irq
> > > > > > > handling for this optimization to make sense.
> > > > > > >
> > > > > > > [1] https://lore.kernel.org/linux-block/20210520141305.355961-1-stefanha@redhat.com/
> > > > > >
> > > > > > Hi Stefan:
> > > > > >
> > > > > > Some questions:
> > > > > >
> > > > > > 1) What's the advantages of introducing polling at virtio level instead of
> > > > > > doing it at each subsystems? Polling in virtio level may only work well if
> > > > > > all (or most) of the devices are virtio
> > > > > I'm not sure I understand the question. cpuidle haltpoll benefits all
> > > > > devices today, except it incurs interrupt latency. The poll_source API
> > > > > eliminates the interrupt latency for drivers that can disable device
> > > > > interrupts cheaply.
> > > > >
> > > > > This patch adds poll_source to core virtio code so that all virtio
> > > > > drivers get this feature for free. No driver-specific changes are
> > > > > needed.
> > > > >
> > > > > If you mean networking, block layer, etc by "subsystems" then there's
> > > > > nothing those subsystems can do to help. Whether poll_source can be used
> > > > > depends on the specific driver, not the subsystem. If you consider
> > > > > drivers/virtio/ a subsystem, then that's exactly what the patch series
> > > > > is doing.
> > > >
> > > >
> > > > I meant, if we choose to use idle poll, we have some several choices:
> > > >
> > > > 1) bus level (e.g the virtio)
> > > > 2) subsystem level (e.g the networking and block)
> > > >
> > > > I'm not sure which one is better.
> > >
> > > This API is intended to be driver- or bus-level. I don't think
> > > subsystems can do very much since they don't know the hardware
> > > capabilities (cheap interrupt disabling) and in most cases there's no
> > > advantage of plumbing it through subsystems when drivers can call the
> > > API directly.
> > >
> > > > > > 2) What's the advantages of using cpuidle instead of using a thread (and
> > > > > > leverage the scheduler)?
> > > > > In order to combine with the existing cpuidle infrastructure. No new
> > > > > polling loop is introduced and no additional CPU cycles are spent on
> > > > > polling.
> > > > >
> > > > > If cpuidle itself is converted to threads then poll_source would
> > > > > automatically operate in a thread too, but this patch series doesn't
> > > > > change how the core cpuidle code works.
> > > >
> > > >
> > > > So networking subsystem can use NAPI busy polling in the process context
> > > > which means it can be leveraged by the scheduler.
> > > >
> > > > I'm not sure it's a good idea to poll drivers for a specific bus in the
> > > > general cpu idle layer.
> > >
> > > Why? Maybe because the cpuidle execution environment is a little special?
> >
> > Well, this would be prone to abuse.
> >
> > The time spent in that driver callback counts as CPU idle time while
> > it really is the driver running and there is not limit on how much
> > time the callback can take, while doing costly things in the idle loop
> > is generally avoided, because on wakeup the CPU needs to be available
> > to the task needing it as soon as possible.  IOW, the callback
> > potentially add unbounded latency to the CPU wakeup path.
>
> How is this different from driver interrupt handlers running during
> cpuidle?

The time spent on handling interrupts does not count as CPU idle time.
Stefan Hajnoczi July 27, 2021, 1:32 p.m. UTC | #8
On Mon, Jul 26, 2021 at 06:37:13PM +0200, Rafael J. Wysocki wrote:
> On Mon, Jul 26, 2021 at 6:04 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Mon, Jul 26, 2021 at 05:47:19PM +0200, Rafael J. Wysocki wrote:
> > > On Mon, Jul 26, 2021 at 5:17 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > >
> > > > On Thu, Jul 22, 2021 at 05:04:57PM +0800, Jason Wang wrote:
> > > > >
> > > > > 在 2021/7/21 下午5:41, Stefan Hajnoczi 写道:
> > > > > > On Wed, Jul 21, 2021 at 11:29:55AM +0800, Jason Wang wrote:
> > > > > > > 在 2021/7/14 上午12:19, Stefan Hajnoczi 写道:
> > > > > > > > These patches are not polished yet but I would like request feedback on this
> > > > > > > > approach and share performance results with you.
> > > > > > > >
> > > > > > > > Idle CPUs tentatively enter a busy wait loop before halting when the cpuidle
> > > > > > > > haltpoll driver is enabled inside a virtual machine. This reduces wakeup
> > > > > > > > latency for events that occur soon after the vCPU becomes idle.
> > > > > > > >
> > > > > > > > This patch series extends the cpuidle busy wait loop with the new poll_source
> > > > > > > > API so drivers can participate in polling. Such polling-aware drivers disable
> > > > > > > > their device's irq during the busy wait loop to avoid the cost of interrupts.
> > > > > > > > This reduces latency further than regular cpuidle haltpoll, which still relies
> > > > > > > > on irqs.
> > > > > > > >
> > > > > > > > Virtio drivers are modified to use the poll_source API so all virtio device
> > > > > > > > types get this feature. The following virtio-blk fio benchmark results show the
> > > > > > > > improvement:
> > > > > > > >
> > > > > > > >                IOPS (numjobs=4, iodepth=1, 4 virtqueues)
> > > > > > > >                  before   poll_source      io_poll
> > > > > > > > 4k randread    167102  186049 (+11%)  186654 (+11%)
> > > > > > > > 4k randwrite   162204  181214 (+11%)  181850 (+12%)
> > > > > > > > 4k randrw      159520  177071 (+11%)  177928 (+11%)
> > > > > > > >
> > > > > > > > The comparison against io_poll shows that cpuidle poll_source achieves
> > > > > > > > equivalent performance to the block layer's io_poll feature (which I
> > > > > > > > implemented in a separate patch series [1]).
> > > > > > > >
> > > > > > > > The advantage of poll_source is that applications do not need to explicitly set
> > > > > > > > the RWF_HIPRI I/O request flag. The poll_source approach is attractive because
> > > > > > > > few applications actually use RWF_HIPRI and it takes advantage of CPU cycles we
> > > > > > > > would have spent in cpuidle haltpoll anyway.
> > > > > > > >
> > > > > > > > The current series does not improve virtio-net. I haven't investigated deeply,
> > > > > > > > but it is possible that NAPI and poll_source do not combine. See the final
> > > > > > > > patch for a starting point on making the two work together.
> > > > > > > >
> > > > > > > > I have not tried this on bare metal but it might help there too. The cost of
> > > > > > > > disabling a device's irq must be less than the savings from avoiding irq
> > > > > > > > handling for this optimization to make sense.
> > > > > > > >
> > > > > > > > [1] https://lore.kernel.org/linux-block/20210520141305.355961-1-stefanha@redhat.com/
> > > > > > >
> > > > > > > Hi Stefan:
> > > > > > >
> > > > > > > Some questions:
> > > > > > >
> > > > > > > 1) What's the advantages of introducing polling at virtio level instead of
> > > > > > > doing it at each subsystems? Polling in virtio level may only work well if
> > > > > > > all (or most) of the devices are virtio
> > > > > > I'm not sure I understand the question. cpuidle haltpoll benefits all
> > > > > > devices today, except it incurs interrupt latency. The poll_source API
> > > > > > eliminates the interrupt latency for drivers that can disable device
> > > > > > interrupts cheaply.
> > > > > >
> > > > > > This patch adds poll_source to core virtio code so that all virtio
> > > > > > drivers get this feature for free. No driver-specific changes are
> > > > > > needed.
> > > > > >
> > > > > > If you mean networking, block layer, etc by "subsystems" then there's
> > > > > > nothing those subsystems can do to help. Whether poll_source can be used
> > > > > > depends on the specific driver, not the subsystem. If you consider
> > > > > > drivers/virtio/ a subsystem, then that's exactly what the patch series
> > > > > > is doing.
> > > > >
> > > > >
> > > > > I meant, if we choose to use idle poll, we have some several choices:
> > > > >
> > > > > 1) bus level (e.g the virtio)
> > > > > 2) subsystem level (e.g the networking and block)
> > > > >
> > > > > I'm not sure which one is better.
> > > >
> > > > This API is intended to be driver- or bus-level. I don't think
> > > > subsystems can do very much since they don't know the hardware
> > > > capabilities (cheap interrupt disabling) and in most cases there's no
> > > > advantage of plumbing it through subsystems when drivers can call the
> > > > API directly.
> > > >
> > > > > > > 2) What's the advantages of using cpuidle instead of using a thread (and
> > > > > > > leverage the scheduler)?
> > > > > > In order to combine with the existing cpuidle infrastructure. No new
> > > > > > polling loop is introduced and no additional CPU cycles are spent on
> > > > > > polling.
> > > > > >
> > > > > > If cpuidle itself is converted to threads then poll_source would
> > > > > > automatically operate in a thread too, but this patch series doesn't
> > > > > > change how the core cpuidle code works.
> > > > >
> > > > >
> > > > > So networking subsystem can use NAPI busy polling in the process context
> > > > > which means it can be leveraged by the scheduler.
> > > > >
> > > > > I'm not sure it's a good idea to poll drivers for a specific bus in the
> > > > > general cpu idle layer.
> > > >
> > > > Why? Maybe because the cpuidle execution environment is a little special?
> > >
> > > Well, this would be prone to abuse.
> > >
> > > The time spent in that driver callback counts as CPU idle time while
> > > it really is the driver running and there is not limit on how much
> > > time the callback can take, while doing costly things in the idle loop
> > > is generally avoided, because on wakeup the CPU needs to be available
> > > to the task needing it as soon as possible.  IOW, the callback
> > > potentially add unbounded latency to the CPU wakeup path.
> >
> > How is this different from driver interrupt handlers running during
> > cpuidle?
> 
> The time spent on handling interrupts does not count as CPU idle time.

Is that taken care of by account_hardirq_enter()?

Drivers using poll_source have two pieces:
1. A small piece of code that polls the device.
2. A larger piece of code that runs when polling succeeds. This is
   basically the irq handler.

Would it be acceptable to run #1 from cpuidle but defer #2 so it's not
accounted as idle time?

Thanks,
Stefan