diff mbox series

[RFC,2/2] kvm/eventfd: Use priority waitqueue to catch events before userspace

Message ID 20201026175325.585623-2-dwmw2@infradead.org (mailing list archive)
State New, archived
Headers show
Series [RFC,1/2] sched/wait: Add add_wait_queue_priority() | expand

Commit Message

David Woodhouse Oct. 26, 2020, 5:53 p.m. UTC
From: David Woodhouse <dwmw@amazon.co.uk>

As far as I can tell, when we use posted interrupts we silently cut off
the events from userspace, if it's listening on the same eventfd that
feeds the irqfd.

I like that behaviour. Let's do it all the time, even without posted
interrupts. It makes it much easier to handle IRQ remapping invalidation
without having to constantly add/remove the fd from the userspace poll
set. We can just leave userspace polling on it, and the bypass will...
well... bypass it.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 virt/kvm/eventfd.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Comments

Paolo Bonzini Oct. 27, 2020, 8:01 a.m. UTC | #1
On 26/10/20 18:53, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> As far as I can tell, when we use posted interrupts we silently cut off
> the events from userspace, if it's listening on the same eventfd that
> feeds the irqfd.
> 
> I like that behaviour. Let's do it all the time, even without posted
> interrupts. It makes it much easier to handle IRQ remapping invalidation
> without having to constantly add/remove the fd from the userspace poll
> set. We can just leave userspace polling on it, and the bypass will...
> well... bypass it.

This looks good, though of course it depends on the somewhat hackish
patch 1. However don't you need to read the eventfd as well, since
userspace will never be able to do so?

Paolo

> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>  virt/kvm/eventfd.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> index d6408bb497dc..39443e2f72bf 100644
> --- a/virt/kvm/eventfd.c
> +++ b/virt/kvm/eventfd.c
> @@ -191,6 +191,7 @@ irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
>  	struct kvm *kvm = irqfd->kvm;
>  	unsigned seq;
>  	int idx;
> +	int ret = 0;
>  
>  	if (flags & EPOLLIN) {
>  		idx = srcu_read_lock(&kvm->irq_srcu);
> @@ -204,6 +205,7 @@ irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
>  					      false) == -EWOULDBLOCK)
>  			schedule_work(&irqfd->inject);
>  		srcu_read_unlock(&kvm->irq_srcu, idx);
> +		ret = 1;
>  	}
>  
>  	if (flags & EPOLLHUP) {
> @@ -227,7 +229,7 @@ irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
>  		spin_unlock_irqrestore(&kvm->irqfds.lock, iflags);
>  	}
>  
> -	return 0;
> +	return ret;
>  }
>  
>  static void
> @@ -236,7 +238,7 @@ irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh,
>  {
>  	struct kvm_kernel_irqfd *irqfd =
>  		container_of(pt, struct kvm_kernel_irqfd, pt);
> -	add_wait_queue(wqh, &irqfd->wait);
> +	add_wait_queue_priority(wqh, &irqfd->wait);
>  }
>  
>  /* Must be called under irqfds.lock */
>
David Woodhouse Oct. 27, 2020, 10:15 a.m. UTC | #2
On Tue, 2020-10-27 at 09:01 +0100, Paolo Bonzini wrote:
> On 26/10/20 18:53, David Woodhouse wrote:
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > As far as I can tell, when we use posted interrupts we silently cut off
> > the events from userspace, if it's listening on the same eventfd that
> > feeds the irqfd.
> > 
> > I like that behaviour. Let's do it all the time, even without posted
> > interrupts. It makes it much easier to handle IRQ remapping invalidation
> > without having to constantly add/remove the fd from the userspace poll
> > set. We can just leave userspace polling on it, and the bypass will...
> > well... bypass it.
> 
> This looks good, though of course it depends on the somewhat hackish
> patch 1.

I thought it was quite neat :)

>  However don't you need to read the eventfd as well, since
> userspace will never be able to do so?

Yes. Although that's a separate cleanup as it was already true before
my patch. Right now, userspace needs to explicitly stop polling on the
VFIO eventfd while it's assigned as KVM IRQFD (to avoid injecting
duplicate interrupts when the kernel isn't using PI and allows events
to leak). So it isn't going to consume the events in that case either.
Nothing's really changed.

The VFIO virqfd is just the same. The count just builds up when the
kernel handles the events, and is eventually cleared by
eventfd_ctx_remove_wait_queue().

In both cases, that actually works fine because in practice the events
are raised by eventfd_signal() in the kernel, and that works even if
the count reaches ULLONG_MAX. It's just that sending further events
from *userspace* would block in that case.

Both of them theoretically want fixing — regardless of the priority
patch.

Since the wq lock is held while the wakeup function (virqfd_wakeup or 
irqfd_wakeup for VFIO/KVM respectively) run, all they really need to do
is call eventfd_ctx_do_read() to consume the events. I'll look at
whether I can find a nicer option than just exporting that.
diff mbox series

Patch

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index d6408bb497dc..39443e2f72bf 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -191,6 +191,7 @@  irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
 	struct kvm *kvm = irqfd->kvm;
 	unsigned seq;
 	int idx;
+	int ret = 0;
 
 	if (flags & EPOLLIN) {
 		idx = srcu_read_lock(&kvm->irq_srcu);
@@ -204,6 +205,7 @@  irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
 					      false) == -EWOULDBLOCK)
 			schedule_work(&irqfd->inject);
 		srcu_read_unlock(&kvm->irq_srcu, idx);
+		ret = 1;
 	}
 
 	if (flags & EPOLLHUP) {
@@ -227,7 +229,7 @@  irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
 		spin_unlock_irqrestore(&kvm->irqfds.lock, iflags);
 	}
 
-	return 0;
+	return ret;
 }
 
 static void
@@ -236,7 +238,7 @@  irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh,
 {
 	struct kvm_kernel_irqfd *irqfd =
 		container_of(pt, struct kvm_kernel_irqfd, pt);
-	add_wait_queue(wqh, &irqfd->wait);
+	add_wait_queue_priority(wqh, &irqfd->wait);
 }
 
 /* Must be called under irqfds.lock */