Message ID | 4A35C269.7050209@novell.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Sun, Jun 14, 2009 at 11:39:21PM -0400, Gregory Haskins wrote: > Michael S. Tsirkin wrote: > > On Sun, Jun 14, 2009 at 08:53:11AM -0400, Gregory Haskins wrote: > > > >> Michael S. Tsirkin wrote: > >> > >>> On Thu, Jun 04, 2009 at 08:48:12AM -0400, Gregory Haskins wrote: > >>> > >>> > >>>> +static void > >>>> +irqfd_disconnect(struct _irqfd *irqfd) > >>>> +{ > >>>> + struct kvm *kvm; > >>>> + > >>>> + mutex_lock(&irqfd->lock); > >>>> + > >>>> + kvm = rcu_dereference(irqfd->kvm); > >>>> + rcu_assign_pointer(irqfd->kvm, NULL); > >>>> + > >>>> + mutex_unlock(&irqfd->lock); > >>>> + > >>>> + if (!kvm) > >>>> + return; > >>>> > >>>> mutex_lock(&kvm->lock); > >>>> - kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1); > >>>> - kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0); > >>>> + list_del(&irqfd->list); > >>>> mutex_unlock(&kvm->lock); > >>>> + > >>>> + /* > >>>> + * It is important to not drop the kvm reference until the next grace > >>>> + * period because there might be lockless references in flight up > >>>> + * until then > >>>> + */ > >>>> + synchronize_srcu(&irqfd->srcu); > >>>> + kvm_put_kvm(kvm); > >>>> } > >>>> > >>>> > >>> So irqfd object will persist after kvm goes away, until eventfd is closed? > >>> > >>> > >> Yep, by design. It becomes part of the eventfd and is thus associated > >> with its lifetime. Consider it as if we made our own anon-fd > >> implementation for irqfd and the lifetime looks similar. The difference > >> is that we are reusing eventfd and its interface semantics. > >> > >>> > >>> > >>>> > >>>> static int > >>>> irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key) > >>>> { > >>>> struct _irqfd *irqfd = container_of(wait, struct _irqfd, wait); > >>>> + unsigned long flags = (unsigned long)key; > >>>> > >>>> - /* > >>>> - * The wake_up is called with interrupts disabled. Therefore we need > >>>> - * to defer the IRQ injection until later since we need to acquire the > >>>> - * kvm->lock to do so. > >>>> - */ > >>>> - schedule_work(&irqfd->work); > >>>> + if (flags & POLLIN) > >>>> + /* > >>>> + * The POLLIN wake_up is called with interrupts disabled. > >>>> + * Therefore we need to defer the IRQ injection until later > >>>> + * since we need to acquire the kvm->lock to do so. > >>>> + */ > >>>> + schedule_work(&irqfd->inject); > >>>> + > >>>> + if (flags & POLLHUP) { > >>>> + /* > >>>> + * The POLLHUP is called unlocked, so it theoretically should > >>>> + * be safe to remove ourselves from the wqh using the locked > >>>> + * variant of remove_wait_queue() > >>>> + */ > >>>> + remove_wait_queue(irqfd->wqh, &irqfd->wait); > >>>> + flush_work(&irqfd->inject); > >>>> + irqfd_disconnect(irqfd); > >>>> + > >>>> + cleanup_srcu_struct(&irqfd->srcu); > >>>> + kfree(irqfd); > >>>> + } > >>>> > >>>> return 0; > >>>> } > >>>> > >>>> > >>> And it is removed by this function when eventfd is closed. > >>> But what prevents the kvm module from going away, meanwhile? > >>> > >>> > >> Well, we hold a reference to struct kvm until we call > >> irqfd_disconnect(). If kvm closes first, we disconnect and disassociate > >> all references to kvm leaving irqfd->kvm = NULL. Likewise, if irqfd > >> closes first, we disassociate with kvm with the above quoted logic. In > >> either case, we are holding a kvm reference up until that "disconnect" > >> point. Therefore kvm should not be able to disappear before that > >> disconnect, and after that point we do not care. > >> > > > > Yes, we do care. > > > > Here's the scenario in more detail: > > > > - kvm is closed > > - irq disconnect is called > > - kvm is put > > - kvm module is removed: all irqs are disconnected > > - eventfd closes and triggers callback into removed kvm module > > - crash > > > > [ lightbulb turns on] > > Ah, now I see the point you were making. I thought you were talking > about the .text in kvm_set_irq() (which would be protected by my > kvm_get_kvm() reference afaict). But you are actually talking about the > irqfd .text itself. Indeed, you are correct that is this currently a > race. Good catch! > > > > >> If that is not sufficient to prevent kvm.ko from going away in the > >> middle, then IMO kvm_get_kvm() has a bug, not irqfd. ;) However, I > >> believe everything is actually ok here. > >> > >> -Greg > >> > >> > > > > > > BTW, why can't we remove irqfds in kvm_release? > > > > Well, this would be ideal but we run into that bi-directional reference > thing that we talked about earlier and we both agree is non-trivial to > solve. Solving this locking problem would incidentally also pave the > way for restoring the DEASSIGN feature, so patches welcome! So far the only workable approach that I see is reverting the POLLHUP patch. I agree it looks pretty, but DEASSIGN and closing the races is more important IMO. And locking will definitely become much simpler. > In the meantime, I think we can close the hole you found with the > following patch (build-tested only): > > commit f3a8dccc9e815599438e9feb0ea53e8eb10ad2b3 > Author: Gregory Haskins <ghaskins@novell.com> > Date: Sun Jun 14 23:37:49 2009 -0400 > > KVM: make irqfd take kvm.ko module reference > > Michael Tsirkin pointed out that we currently have a race between someone > holding an irqfd reference and an rmmod against kvm.ko. This patch closes > that hole by making sure that irqfd holds a kvm.ko reference for its lifetime. > > Found-by: Michael S. Tsirkin <mst@redhat.com> > Signed-off-by: Gregory Haskins <ghaskins@novell.com> > > diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c > index 2c8028c..67e4eca 100644 > --- a/virt/kvm/eventfd.c > +++ b/virt/kvm/eventfd.c > @@ -29,6 +29,7 @@ > #include <linux/list.h> > #include <linux/eventfd.h> > #include <linux/srcu.h> > +#include <linux/module.h> > > /* > * -------------------------------------------------------------------- > @@ -123,6 +124,7 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int > sync, void > *key) > > cleanup_srcu_struct(&irqfd->srcu); > kfree(irqfd); > + module_put(THIS_MODULE); > } > > return 0; module_put(THIS_MODULE) is always a bug unless you know that someone has a reference to the current module: the module could go away between this call and returning from function. > @@ -176,6 +178,7 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags) > if (ret < 0) > goto fail; > > + __module_get(THIS_MODULE); > kvm_get_kvm(kvm); > > mutex_lock(&kvm->lock); > > > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin wrote: > On Sun, Jun 14, 2009 at 11:39:21PM -0400, Gregory Haskins wrote: > >> Michael S. Tsirkin wrote: >> >>> On Sun, Jun 14, 2009 at 08:53:11AM -0400, Gregory Haskins wrote: >>> >>> >>>> Michael S. Tsirkin wrote: >>>> >>>> >>>>> On Thu, Jun 04, 2009 at 08:48:12AM -0400, Gregory Haskins wrote: >>>>> >>>>> >>>>> >>>>>> +static void >>>>>> +irqfd_disconnect(struct _irqfd *irqfd) >>>>>> +{ >>>>>> + struct kvm *kvm; >>>>>> + >>>>>> + mutex_lock(&irqfd->lock); >>>>>> + >>>>>> + kvm = rcu_dereference(irqfd->kvm); >>>>>> + rcu_assign_pointer(irqfd->kvm, NULL); >>>>>> + >>>>>> + mutex_unlock(&irqfd->lock); >>>>>> + >>>>>> + if (!kvm) >>>>>> + return; >>>>>> >>>>>> mutex_lock(&kvm->lock); >>>>>> - kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1); >>>>>> - kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0); >>>>>> + list_del(&irqfd->list); >>>>>> mutex_unlock(&kvm->lock); >>>>>> + >>>>>> + /* >>>>>> + * It is important to not drop the kvm reference until the next grace >>>>>> + * period because there might be lockless references in flight up >>>>>> + * until then >>>>>> + */ >>>>>> + synchronize_srcu(&irqfd->srcu); >>>>>> + kvm_put_kvm(kvm); >>>>>> } >>>>>> >>>>>> >>>>>> >>>>> So irqfd object will persist after kvm goes away, until eventfd is closed? >>>>> >>>>> >>>>> >>>> Yep, by design. It becomes part of the eventfd and is thus associated >>>> with its lifetime. Consider it as if we made our own anon-fd >>>> implementation for irqfd and the lifetime looks similar. The difference >>>> is that we are reusing eventfd and its interface semantics. >>>> >>>> >>>>> >>>>> >>>>> >>>>>> >>>>>> static int >>>>>> irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key) >>>>>> { >>>>>> struct _irqfd *irqfd = container_of(wait, struct _irqfd, wait); >>>>>> + unsigned long flags = (unsigned long)key; >>>>>> >>>>>> - /* >>>>>> - * The wake_up is called with interrupts disabled. Therefore we need >>>>>> - * to defer the IRQ injection until later since we need to acquire the >>>>>> - * kvm->lock to do so. >>>>>> - */ >>>>>> - schedule_work(&irqfd->work); >>>>>> + if (flags & POLLIN) >>>>>> + /* >>>>>> + * The POLLIN wake_up is called with interrupts disabled. >>>>>> + * Therefore we need to defer the IRQ injection until later >>>>>> + * since we need to acquire the kvm->lock to do so. >>>>>> + */ >>>>>> + schedule_work(&irqfd->inject); >>>>>> + >>>>>> + if (flags & POLLHUP) { >>>>>> + /* >>>>>> + * The POLLHUP is called unlocked, so it theoretically should >>>>>> + * be safe to remove ourselves from the wqh using the locked >>>>>> + * variant of remove_wait_queue() >>>>>> + */ >>>>>> + remove_wait_queue(irqfd->wqh, &irqfd->wait); >>>>>> + flush_work(&irqfd->inject); >>>>>> + irqfd_disconnect(irqfd); >>>>>> + >>>>>> + cleanup_srcu_struct(&irqfd->srcu); >>>>>> + kfree(irqfd); >>>>>> + } >>>>>> >>>>>> return 0; >>>>>> } >>>>>> >>>>>> >>>>>> >>>>> And it is removed by this function when eventfd is closed. >>>>> But what prevents the kvm module from going away, meanwhile? >>>>> >>>>> >>>>> >>>> Well, we hold a reference to struct kvm until we call >>>> irqfd_disconnect(). If kvm closes first, we disconnect and disassociate >>>> all references to kvm leaving irqfd->kvm = NULL. Likewise, if irqfd >>>> closes first, we disassociate with kvm with the above quoted logic. In >>>> either case, we are holding a kvm reference up until that "disconnect" >>>> point. Therefore kvm should not be able to disappear before that >>>> disconnect, and after that point we do not care. >>>> >>>> >>> Yes, we do care. >>> >>> Here's the scenario in more detail: >>> >>> - kvm is closed >>> - irq disconnect is called >>> - kvm is put >>> - kvm module is removed: all irqs are disconnected >>> - eventfd closes and triggers callback into removed kvm module >>> - crash >>> >>> >> [ lightbulb turns on] >> >> Ah, now I see the point you were making. I thought you were talking >> about the .text in kvm_set_irq() (which would be protected by my >> kvm_get_kvm() reference afaict). But you are actually talking about the >> irqfd .text itself. Indeed, you are correct that is this currently a >> race. Good catch! >> >> >>> >>> >>>> If that is not sufficient to prevent kvm.ko from going away in the >>>> middle, then IMO kvm_get_kvm() has a bug, not irqfd. ;) However, I >>>> believe everything is actually ok here. >>>> >>>> -Greg >>>> >>>> >>>> >>> BTW, why can't we remove irqfds in kvm_release? >>> >>> >> Well, this would be ideal but we run into that bi-directional reference >> thing that we talked about earlier and we both agree is non-trivial to >> solve. Solving this locking problem would incidentally also pave the >> way for restoring the DEASSIGN feature, so patches welcome! >> > > So far the only workable approach that I see is reverting the POLLHUP > patch. I agree it looks pretty, but DEASSIGN and closing the races is > more important IMO. And locking will definitely become much simpler. > > >> In the meantime, I think we can close the hole you found with the >> following patch (build-tested only): >> >> commit f3a8dccc9e815599438e9feb0ea53e8eb10ad2b3 >> Author: Gregory Haskins <ghaskins@novell.com> >> Date: Sun Jun 14 23:37:49 2009 -0400 >> >> KVM: make irqfd take kvm.ko module reference >> >> Michael Tsirkin pointed out that we currently have a race between someone >> holding an irqfd reference and an rmmod against kvm.ko. This patch closes >> that hole by making sure that irqfd holds a kvm.ko reference for its lifetime. >> >> Found-by: Michael S. Tsirkin <mst@redhat.com> >> Signed-off-by: Gregory Haskins <ghaskins@novell.com> >> >> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c >> index 2c8028c..67e4eca 100644 >> --- a/virt/kvm/eventfd.c >> +++ b/virt/kvm/eventfd.c >> @@ -29,6 +29,7 @@ >> #include <linux/list.h> >> #include <linux/eventfd.h> >> #include <linux/srcu.h> >> +#include <linux/module.h> >> >> /* >> * -------------------------------------------------------------------- >> @@ -123,6 +124,7 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int >> sync, void >> *key) >> >> cleanup_srcu_struct(&irqfd->srcu); >> kfree(irqfd); >> + module_put(THIS_MODULE); >> } >> >> return 0; >> > > module_put(THIS_MODULE) is always a bug unless you know that someone has > a reference to the current module: the module could go away between this > call and returning from function. > Hmm. I understand what you are saying conceptually (i.e. the .text could get yanked before we hit the next line of code, in this case the "return 0"). However, holding a reference when you _know_ someone else holds a reference to me says that one of the references is redundant. In addition, there is certainly plenty of precedence for module_put(THIS_MODULE) all throughout the kernel (including module_put_and_exit()). Are those broken as well? In any case, one of the patches I have in queue to push to Davide for eventfd may provide a good solution to this problem as well, so I will get that polished up today. Thanks Michael, -Greg > >> @@ -176,6 +178,7 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags) >> if (ret < 0) >> goto fail; >> >> + __module_get(THIS_MODULE); >> kvm_get_kvm(kvm); >> >> mutex_lock(&kvm->lock); >> >> >> >> > > >
On Mon, Jun 15, 2009 at 08:08:18AM -0400, Gregory Haskins wrote: > >> @@ -123,6 +124,7 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int > >> sync, void > >> *key) > >> > >> cleanup_srcu_struct(&irqfd->srcu); > >> kfree(irqfd); > >> + module_put(THIS_MODULE); > >> } > >> > >> return 0; > >> > > > > module_put(THIS_MODULE) is always a bug unless you know that someone has > > a reference to the current module: the module could go away between this > > call and returning from function. > > > > Hmm. I understand what you are saying conceptually (i.e. the .text > could get yanked before we hit the next line of code, in this case the > "return 0"). However, holding a reference when you _know_ someone else > holds a reference to me says that one of the references is redundant. > In addition, there is certainly plenty of precedence for > module_put(THIS_MODULE) all throughout the kernel (including > module_put_and_exit()). Are those broken as well? Maybe not, but I don't know why. It works fine as long as you don't unload any modules though :) Rusty, could you enlighten us please?
On Mon, 15 Jun 2009 10:24:39 pm Michael S. Tsirkin wrote: > On Mon, Jun 15, 2009 at 08:08:18AM -0400, Gregory Haskins wrote: > > Hmm. I understand what you are saying conceptually (i.e. the .text > > could get yanked before we hit the next line of code, in this case the > > "return 0"). However, holding a reference when you _know_ someone else > > holds a reference to me says that one of the references is redundant. > > In addition, there is certainly plenty of precedence for > > module_put(THIS_MODULE) all throughout the kernel (including > > module_put_and_exit()). Are those broken as well? > > Maybe not, but I don't know why. It works fine as long as you don't > unload any modules though :) Rusty, could you enlighten us please? Yep, they're almost all broken. A few have comments indicating that someone else is holding a reference (eg. loopback). But at some point you give up playing whack-a-mole for random drivers. module_put_and_exit() does *not* have this problem, BTW. Rusty. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jun 18, 2009 at 02:46:30PM +0930, Rusty Russell wrote: > On Mon, 15 Jun 2009 10:24:39 pm Michael S. Tsirkin wrote: > > On Mon, Jun 15, 2009 at 08:08:18AM -0400, Gregory Haskins wrote: > > > Hmm. I understand what you are saying conceptually (i.e. the .text > > > could get yanked before we hit the next line of code, in this case the > > > "return 0"). However, holding a reference when you _know_ someone else > > > holds a reference to me says that one of the references is redundant. > > > In addition, there is certainly plenty of precedence for > > > module_put(THIS_MODULE) all throughout the kernel (including > > > module_put_and_exit()). Are those broken as well? > > > > Maybe not, but I don't know why. It works fine as long as you don't > > unload any modules though :) Rusty, could you enlighten us please? > > Yep, they're almost all broken. A few have comments indicating that someone > else is holding a reference (eg. loopback). > > But at some point you give up playing whack-a-mole for random drivers. > > module_put_and_exit() does *not* have this problem, BTW. > > Rusty. I see that, the .text for module_put_and_exit is never modular itself. Thanks, Rusty! BTW, Gregory, this can be used to fix the race in the design: create a thread and let it drop the module reference with module_put_and_exit. Which will work, but I guess at this point we should ask ourselves whether all the hearburn with srcu, threads and module references is better than just asking the user to call and ioctl.
Michael S. Tsirkin wrote: > On Thu, Jun 18, 2009 at 02:46:30PM +0930, Rusty Russell wrote: > >> On Mon, 15 Jun 2009 10:24:39 pm Michael S. Tsirkin wrote: >> >>> On Mon, Jun 15, 2009 at 08:08:18AM -0400, Gregory Haskins wrote: >>> >>>> Hmm. I understand what you are saying conceptually (i.e. the .text >>>> could get yanked before we hit the next line of code, in this case the >>>> "return 0"). However, holding a reference when you _know_ someone else >>>> holds a reference to me says that one of the references is redundant. >>>> In addition, there is certainly plenty of precedence for >>>> module_put(THIS_MODULE) all throughout the kernel (including >>>> module_put_and_exit()). Are those broken as well? >>>> >>> Maybe not, but I don't know why. It works fine as long as you don't >>> unload any modules though :) Rusty, could you enlighten us please? >>> >> Yep, they're almost all broken. A few have comments indicating that someone >> else is holding a reference (eg. loopback). >> >> But at some point you give up playing whack-a-mole for random drivers. >> >> module_put_and_exit() does *not* have this problem, BTW. >> >> Rusty. >> > > I see that, the .text for module_put_and_exit is never modular itself. > Thanks, Rusty! > Ah! That is the trick I wasn't understanding. > BTW, Gregory, this can be used to fix the race in the design: create a > thread and let it drop the module reference with module_put_and_exit. > I had thought of doing something like this initially too, but I think its racy as well. Ultimately, you need to make sure the eventfd callback is completely out before its safe to run, and deferring to a thread would not change this race. The only sane way I can see to do that is to have the caller infrastructure annotate the event somehow (either directly with a module_put(), or indirectly with some kind of state transition that can be tracked with something like synchronize_sched(). > Which will work, but I guess at this point we should ask ourselves > whether all the hearburn with srcu, threads and module references is > better than just asking the user to call and ioctl. > I am starting to agree with you, here. :) Note one thing: the SRCU stuff is mostly orthogonal from the rest of the conversation re: the module_put() races. I only tied it into the current thread because the eventfd_notifier_register() thread gave me a convenient way to hook some other context to do the module_put(). In the long term, the srcu changes are for the can_sleep() stuff. So on that note, lets see if I can convince Davide that the srcu stuff is not so evil before we revert the POLLHUP patches, since the module_put() fix is trivial once that is in place. Thanks Michael (and Rusty), -Greg
On Thu, Jun 18, 2009 at 08:00:39AM -0400, Gregory Haskins wrote: > Michael S. Tsirkin wrote: > > On Thu, Jun 18, 2009 at 02:46:30PM +0930, Rusty Russell wrote: > > > >> On Mon, 15 Jun 2009 10:24:39 pm Michael S. Tsirkin wrote: > >> > >>> On Mon, Jun 15, 2009 at 08:08:18AM -0400, Gregory Haskins wrote: > >>> > >>>> Hmm. I understand what you are saying conceptually (i.e. the .text > >>>> could get yanked before we hit the next line of code, in this case the > >>>> "return 0"). However, holding a reference when you _know_ someone else > >>>> holds a reference to me says that one of the references is redundant. > >>>> In addition, there is certainly plenty of precedence for > >>>> module_put(THIS_MODULE) all throughout the kernel (including > >>>> module_put_and_exit()). Are those broken as well? > >>>> > >>> Maybe not, but I don't know why. It works fine as long as you don't > >>> unload any modules though :) Rusty, could you enlighten us please? > >>> > >> Yep, they're almost all broken. A few have comments indicating that someone > >> else is holding a reference (eg. loopback). > >> > >> But at some point you give up playing whack-a-mole for random drivers. > >> > >> module_put_and_exit() does *not* have this problem, BTW. > >> > >> Rusty. > >> > > > > I see that, the .text for module_put_and_exit is never modular itself. > > Thanks, Rusty! > > > > Ah! That is the trick I wasn't understanding. > > BTW, Gregory, this can be used to fix the race in the design: create a > > thread and let it drop the module reference with module_put_and_exit. > > > > I had thought of doing something like this initially too, but I think > its racy as well. Ultimately, you need to make sure the eventfd > callback is completely out before its safe to run, and deferring to a > thread would not change this race. The only sane way I can see to do > that is to have the caller infrastructure annotate the event somehow > (either directly with a module_put(), or indirectly with some kind of > state transition that can be tracked with something like > synchronize_sched(). Here's what one could do: create a thread for each irqfd, and increment module ref count, put that thread to sleep. When done with irqfd, don't delete it and don't decrement module refcount, wake thread instead. thread kills irqfd and calls module_put_and_exit. I don't think it's racy but I won't call it sane either. > > Which will work, but I guess at this point we should ask ourselves > > whether all the hearburn with srcu, threads and module references is > > better than just asking the user to call and ioctl. > > > > I am starting to agree with you, here. :) > > Note one thing: the SRCU stuff is mostly orthogonal from the rest of the > conversation re: the module_put() races. I only tied it into the > current thread because the eventfd_notifier_register() thread gave me a > convenient way to hook some other context to do the module_put(). In > the long term, the srcu changes are for the can_sleep() stuff. So on > that note, lets see if I can convince Davide that the srcu stuff is not > so evil before we revert the POLLHUP patches, since the module_put() fix > is trivial once that is in place. Can this help with DEASSIGN as well? We need it for migration. > Thanks Michael (and Rusty), > -Greg > > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin wrote: > On Thu, Jun 18, 2009 at 08:00:39AM -0400, Gregory Haskins wrote: > >> Michael S. Tsirkin wrote: >> >> >>> BTW, Gregory, this can be used to fix the race in the design: create a >>> thread and let it drop the module reference with module_put_and_exit. >>> >>> >> I had thought of doing something like this initially too, but I think >> its racy as well. Ultimately, you need to make sure the eventfd >> callback is completely out before its safe to run, and deferring to a >> thread would not change this race. The only sane way I can see to do >> that is to have the caller infrastructure annotate the event somehow >> (either directly with a module_put(), or indirectly with some kind of >> state transition that can be tracked with something like >> synchronize_sched(). >> > > Here's what one could do: create a thread for each irqfd, and increment > module ref count, put that thread to sleep. When done with > irqfd, don't delete it and don't decrement module refcount, wake thread > instead. thread kills irqfd and calls module_put_and_exit. > > I don't think it's racy I believe it is. How would you prevent the thread from doing the module_put_and_exit() before the eventfd callback thread is known to have exited the relevant .text section? All this talk does give me an idea, tho. Ill make a patch. > >>> Which will work, but I guess at this point we should ask ourselves >>> whether all the hearburn with srcu, threads and module references is >>> better than just asking the user to call and ioctl. >>> >>> >> I am starting to agree with you, here. :) >> >> Note one thing: the SRCU stuff is mostly orthogonal from the rest of the >> conversation re: the module_put() races. I only tied it into the >> current thread because the eventfd_notifier_register() thread gave me a >> convenient way to hook some other context to do the module_put(). In >> the long term, the srcu changes are for the can_sleep() stuff. So on >> that note, lets see if I can convince Davide that the srcu stuff is not >> so evil before we revert the POLLHUP patches, since the module_put() fix >> is trivial once that is in place. >> > > Can this help with DEASSIGN as well? We need it for migration. > No, but afaict you do not need this for migration anyway. Migrate the GSI and re-call kvm_irqfd() on the other side. Would the fd even be relevant across a migration anyway? I would think not, but admittedly I know little about how qemu/kvm migration actually works. Regards, -Greg
On Thu, Jun 18, 2009 at 10:03:36AM -0400, Gregory Haskins wrote: > Michael S. Tsirkin wrote: > > On Thu, Jun 18, 2009 at 08:00:39AM -0400, Gregory Haskins wrote: > > > >> Michael S. Tsirkin wrote: > >> > >> > >>> BTW, Gregory, this can be used to fix the race in the design: create a > >>> thread and let it drop the module reference with module_put_and_exit. > >>> > >>> > >> I had thought of doing something like this initially too, but I think > >> its racy as well. Ultimately, you need to make sure the eventfd > >> callback is completely out before its safe to run, and deferring to a > >> thread would not change this race. The only sane way I can see to do > >> that is to have the caller infrastructure annotate the event somehow > >> (either directly with a module_put(), or indirectly with some kind of > >> state transition that can be tracked with something like > >> synchronize_sched(). > >> > > > > Here's what one could do: create a thread for each irqfd, and increment > > module ref count, put that thread to sleep. When done with > > irqfd, don't delete it and don't decrement module refcount, wake thread > > instead. thread kills irqfd and calls module_put_and_exit. > > > > I don't think it's racy > > > I believe it is. How would you prevent the thread from doing the > module_put_and_exit() before the eventfd callback thread is known to > have exited the relevant .text section? Right. > All this talk does give me an idea, tho. Ill make a patch. OK, but ask yourself whether this bag of tricks is worth it, and whether we'll find another hole later. Let's reserve the trickiness for fast-path, where it's needed, and keep at least the assign/deassign simple. > > > >>> Which will work, but I guess at this point we should ask ourselves > >>> whether all the hearburn with srcu, threads and module references is > >>> better than just asking the user to call and ioctl. > >>> > >>> > >> I am starting to agree with you, here. :) > >> > >> Note one thing: the SRCU stuff is mostly orthogonal from the rest of the > >> conversation re: the module_put() races. I only tied it into the > >> current thread because the eventfd_notifier_register() thread gave me a > >> convenient way to hook some other context to do the module_put(). In > >> the long term, the srcu changes are for the can_sleep() stuff. So on > >> that note, lets see if I can convince Davide that the srcu stuff is not > >> so evil before we revert the POLLHUP patches, since the module_put() fix > >> is trivial once that is in place. > >> > > > > Can this help with DEASSIGN as well? We need it for migration. > > > > No, but afaict you do not need this for migration anyway. Migrate the > GSI and re-call kvm_irqfd() on the other side. Would the fd even be > relevant across a migration anyway? I would think not, but admittedly I > know little about how qemu/kvm migration actually works. Yes but that's not live migration. For live migration, the trick is that you are running locally but send changes to remote guest. For that, we need to put qemu in the middle between the device and the guest, so it can detect activity and update the remote side. And the best way to do that is to take poll eventfd that device assigns and push eventfd that kvm polls. To switch between this setup and the one where kvm polls the ventfd from device directly, you need deassign.
Michael S. Tsirkin wrote: > On Thu, Jun 18, 2009 at 10:03:36AM -0400, Gregory Haskins wrote: > >> Michael S. Tsirkin wrote: >> >>> On Thu, Jun 18, 2009 at 08:00:39AM -0400, Gregory Haskins wrote: >>> >>> >>>> Michael S. Tsirkin wrote: >>>> >>>> >>>> >>>>> BTW, Gregory, this can be used to fix the race in the design: create a >>>>> thread and let it drop the module reference with module_put_and_exit. >>>>> >>>>> >>>>> >>>> I had thought of doing something like this initially too, but I think >>>> its racy as well. Ultimately, you need to make sure the eventfd >>>> callback is completely out before its safe to run, and deferring to a >>>> thread would not change this race. The only sane way I can see to do >>>> that is to have the caller infrastructure annotate the event somehow >>>> (either directly with a module_put(), or indirectly with some kind of >>>> state transition that can be tracked with something like >>>> synchronize_sched(). >>>> >>>> >>> Here's what one could do: create a thread for each irqfd, and increment >>> module ref count, put that thread to sleep. When done with >>> irqfd, don't delete it and don't decrement module refcount, wake thread >>> instead. thread kills irqfd and calls module_put_and_exit. >>> >>> I don't think it's racy >>> >> I believe it is. How would you prevent the thread from doing the >> module_put_and_exit() before the eventfd callback thread is known to >> have exited the relevant .text section? >> > > Right. > > >> All this talk does give me an idea, tho. Ill make a patch. >> > > OK, but ask yourself whether this bag of tricks is worth it, and whether > we'll find another hole later. Let's reserve the trickiness for > fast-path, where it's needed, and keep at least the assign/deassign simple. > Understood. OTOH, going back to the model where two steps are needed for close() is ugly too, so I don't want to just give up and revert that fix too easily. At some point we will call it one way or the other, but I am not there quite yet. > >>> >>> >>>>> Which will work, but I guess at this point we should ask ourselves >>>>> whether all the hearburn with srcu, threads and module references is >>>>> better than just asking the user to call and ioctl. >>>>> >>>>> >>>>> >>>> I am starting to agree with you, here. :) >>>> >>>> Note one thing: the SRCU stuff is mostly orthogonal from the rest of the >>>> conversation re: the module_put() races. I only tied it into the >>>> current thread because the eventfd_notifier_register() thread gave me a >>>> convenient way to hook some other context to do the module_put(). In >>>> the long term, the srcu changes are for the can_sleep() stuff. So on >>>> that note, lets see if I can convince Davide that the srcu stuff is not >>>> so evil before we revert the POLLHUP patches, since the module_put() fix >>>> is trivial once that is in place. >>>> >>>> >>> Can this help with DEASSIGN as well? We need it for migration. >>> >>> >> No, but afaict you do not need this for migration anyway. Migrate the >> GSI and re-call kvm_irqfd() on the other side. Would the fd even be >> relevant across a migration anyway? I would think not, but admittedly I >> know little about how qemu/kvm migration actually works. >> > > Yes but that's not live migration. For live migration, the trick is that > you are running locally but send changes to remote guest. For that, we > need to put qemu in the middle between the device and the guest, so it > can detect activity and update the remote side. > > And the best way to do that is to take poll eventfd that device assigns > and push eventfd that kvm polls. To switch between this setup > and the one where kvm polls the ventfd from device directly, > you need deassign. > So its still not clear why the distinction between deassign-the-gsi-but-leave-the-fd-valid is needed over a simple close(). Can you elaborate? -Greg
On Thu, Jun 18, 2009 at 12:29:31PM -0400, Gregory Haskins wrote: > Michael S. Tsirkin wrote: > > On Thu, Jun 18, 2009 at 10:03:36AM -0400, Gregory Haskins wrote: > > > >> Michael S. Tsirkin wrote: > >> > >>> On Thu, Jun 18, 2009 at 08:00:39AM -0400, Gregory Haskins wrote: > >>> > >>> > >>>> Michael S. Tsirkin wrote: > >>>> > >>>> > >>>> > >>>>> BTW, Gregory, this can be used to fix the race in the design: create a > >>>>> thread and let it drop the module reference with module_put_and_exit. > >>>>> > >>>>> > >>>>> > >>>> I had thought of doing something like this initially too, but I think > >>>> its racy as well. Ultimately, you need to make sure the eventfd > >>>> callback is completely out before its safe to run, and deferring to a > >>>> thread would not change this race. The only sane way I can see to do > >>>> that is to have the caller infrastructure annotate the event somehow > >>>> (either directly with a module_put(), or indirectly with some kind of > >>>> state transition that can be tracked with something like > >>>> synchronize_sched(). > >>>> > >>>> > >>> Here's what one could do: create a thread for each irqfd, and increment > >>> module ref count, put that thread to sleep. When done with > >>> irqfd, don't delete it and don't decrement module refcount, wake thread > >>> instead. thread kills irqfd and calls module_put_and_exit. > >>> > >>> I don't think it's racy > >>> > >> I believe it is. How would you prevent the thread from doing the > >> module_put_and_exit() before the eventfd callback thread is known to > >> have exited the relevant .text section? > >> > > > > Right. > > > > > >> All this talk does give me an idea, tho. Ill make a patch. > >> > > > > OK, but ask yourself whether this bag of tricks is worth it, and whether > > we'll find another hole later. Let's reserve the trickiness for > > fast-path, where it's needed, and keep at least the assign/deassign simple. > > > > Understood. OTOH, going back to the model where two steps are needed > for close() is ugly too, so I don't want to just give up and revert that > fix too easily. At some point we will call it one way or the other, but > I am not there quite yet. > > > >>> > >>> > >>>>> Which will work, but I guess at this point we should ask ourselves > >>>>> whether all the hearburn with srcu, threads and module references is > >>>>> better than just asking the user to call and ioctl. > >>>>> > >>>>> > >>>>> > >>>> I am starting to agree with you, here. :) > >>>> > >>>> Note one thing: the SRCU stuff is mostly orthogonal from the rest of the > >>>> conversation re: the module_put() races. I only tied it into the > >>>> current thread because the eventfd_notifier_register() thread gave me a > >>>> convenient way to hook some other context to do the module_put(). In > >>>> the long term, the srcu changes are for the can_sleep() stuff. So on > >>>> that note, lets see if I can convince Davide that the srcu stuff is not > >>>> so evil before we revert the POLLHUP patches, since the module_put() fix > >>>> is trivial once that is in place. > >>>> > >>>> > >>> Can this help with DEASSIGN as well? We need it for migration. > >>> > >>> > >> No, but afaict you do not need this for migration anyway. Migrate the > >> GSI and re-call kvm_irqfd() on the other side. Would the fd even be > >> relevant across a migration anyway? I would think not, but admittedly I > >> know little about how qemu/kvm migration actually works. > >> > > > > Yes but that's not live migration. For live migration, the trick is that > > you are running locally but send changes to remote guest. For that, we > > need to put qemu in the middle between the device and the guest, so it > > can detect activity and update the remote side. > > > > And the best way to do that is to take poll eventfd that device assigns > > and push eventfd that kvm polls. To switch between this setup > > and the one where kvm polls the ventfd from device directly, > > you need deassign. > > > > So its still not clear why the distinction between > deassign-the-gsi-but-leave-the-fd-valid is needed over a simple > close(). Can you elaborate? The fd needs to be left assigned to the device, so that we can poll the fd and get events, then forward them to kvm.
Michael S. Tsirkin wrote: > On Thu, Jun 18, 2009 at 12:29:31PM -0400, Gregory Haskins wrote: > >> Michael S. Tsirkin wrote: >> >>> On Thu, Jun 18, 2009 at 10:03:36AM -0400, Gregory Haskins wrote: >>> >>> >>>> Michael S. Tsirkin wrote: >>>> >>>> >>>>> On Thu, Jun 18, 2009 at 08:00:39AM -0400, Gregory Haskins wrote: >>>>> >>>>> >>>>> >>>>>> Michael S. Tsirkin wrote: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> BTW, Gregory, this can be used to fix the race in the design: create a >>>>>>> thread and let it drop the module reference with module_put_and_exit. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> I had thought of doing something like this initially too, but I think >>>>>> its racy as well. Ultimately, you need to make sure the eventfd >>>>>> callback is completely out before its safe to run, and deferring to a >>>>>> thread would not change this race. The only sane way I can see to do >>>>>> that is to have the caller infrastructure annotate the event somehow >>>>>> (either directly with a module_put(), or indirectly with some kind of >>>>>> state transition that can be tracked with something like >>>>>> synchronize_sched(). >>>>>> >>>>>> >>>>>> >>>>> Here's what one could do: create a thread for each irqfd, and increment >>>>> module ref count, put that thread to sleep. When done with >>>>> irqfd, don't delete it and don't decrement module refcount, wake thread >>>>> instead. thread kills irqfd and calls module_put_and_exit. >>>>> >>>>> I don't think it's racy >>>>> >>>>> >>>> I believe it is. How would you prevent the thread from doing the >>>> module_put_and_exit() before the eventfd callback thread is known to >>>> have exited the relevant .text section? >>>> >>>> >>> Right. >>> >>> >>> >>>> All this talk does give me an idea, tho. Ill make a patch. >>>> >>>> >>> OK, but ask yourself whether this bag of tricks is worth it, and whether >>> we'll find another hole later. Let's reserve the trickiness for >>> fast-path, where it's needed, and keep at least the assign/deassign simple. >>> >>> >> Understood. OTOH, going back to the model where two steps are needed >> for close() is ugly too, so I don't want to just give up and revert that >> fix too easily. At some point we will call it one way or the other, but >> I am not there quite yet. >> >>> >>> >>>>> >>>>> >>>>> >>>>>>> Which will work, but I guess at this point we should ask ourselves >>>>>>> whether all the hearburn with srcu, threads and module references is >>>>>>> better than just asking the user to call and ioctl. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> I am starting to agree with you, here. :) >>>>>> >>>>>> Note one thing: the SRCU stuff is mostly orthogonal from the rest of the >>>>>> conversation re: the module_put() races. I only tied it into the >>>>>> current thread because the eventfd_notifier_register() thread gave me a >>>>>> convenient way to hook some other context to do the module_put(). In >>>>>> the long term, the srcu changes are for the can_sleep() stuff. So on >>>>>> that note, lets see if I can convince Davide that the srcu stuff is not >>>>>> so evil before we revert the POLLHUP patches, since the module_put() fix >>>>>> is trivial once that is in place. >>>>>> >>>>>> >>>>>> >>>>> Can this help with DEASSIGN as well? We need it for migration. >>>>> >>>>> >>>>> >>>> No, but afaict you do not need this for migration anyway. Migrate the >>>> GSI and re-call kvm_irqfd() on the other side. Would the fd even be >>>> relevant across a migration anyway? I would think not, but admittedly I >>>> know little about how qemu/kvm migration actually works. >>>> >>>> >>> Yes but that's not live migration. For live migration, the trick is that >>> you are running locally but send changes to remote guest. For that, we >>> need to put qemu in the middle between the device and the guest, so it >>> can detect activity and update the remote side. >>> >>> And the best way to do that is to take poll eventfd that device assigns >>> and push eventfd that kvm polls. To switch between this setup >>> and the one where kvm polls the ventfd from device directly, >>> you need deassign. >>> >>> >> So its still not clear why the distinction between >> deassign-the-gsi-but-leave-the-fd-valid is needed over a simple >> close(). Can you elaborate? >> > > > The fd needs to be left assigned to the device, so that we can poll > the fd and get events, then forward them to kvm. > Ah, ok. Now I get what you are trying to do. Well, per the PM I sent you this morning, I figured out the magic to resolve the locking issues. So we should be able to add DEASSIGN logic soon, I hope. -Greg
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c index 2c8028c..67e4eca 100644 --- a/virt/kvm/eventfd.c +++ b/virt/kvm/eventfd.c @@ -29,6 +29,7 @@ #include <linux/list.h> #include <linux/eventfd.h> #include <linux/srcu.h> +#include <linux/module.h> /* * -------------------------------------------------------------------- @@ -123,6 +124,7 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key) cleanup_srcu_struct(&irqfd->srcu); kfree(irqfd); + module_put(THIS_MODULE); }