Message ID | 20200211205252.GA10003@ziepe.ca (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3] mm/mmu_notifier: prevent unpaired invalidate_start and invalidate_end | expand |
On 2/11/20 12:52 PM, Jason Gunthorpe wrote: > Many users of the mmu_notifier invalidate_range callbacks maintain > locking/counters/etc on a paired basis and have long expected that > invalidate_range_start/end() are always paired. > > For instance kvm_mmu_notifier_invalidate_range_end() undoes > kvm->mmu_notifier_count which was incremented during start(). > > The recent change to add non-blocking notifiers breaks this assumption > when multiple notifiers are present in the list. When EAGAIN is returned > from an invalidate_range_start() then no invalidate_range_ends() are > called, even if the subscription's start had previously been called. > > Unfortunately, due to the RCU list traversal we can't reliably generate a > subset of the linked list representing the notifiers already called to > generate an invalidate_range_end() pairing. > > One case works correctly, if only one subscription requires > invalidate_range_end() and it is the last entry in the hlist. In this > case, when invalidate_range_start() returns -EAGAIN there will be nothing > to unwind. > > Keep the notifier hlist sorted so that notifiers that require > invalidate_range_end() are always last, and if two are added then disable > non-blocking invalidation for the mm. > > A warning is printed for this case, if in future we determine this never > happens then we can simply fail during registration when there are > unsupported combinations of notifiers. > > Fixes: 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers") > Cc: Michal Hocko <mhocko@suse.com> > Cc: "Jérôme Glisse" <jglisse@redhat.com> > Cc: Christoph Hellwig <hch@lst.de> > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> > --- > mm/mmu_notifier.c | 53 ++++++++++++++++++++++++++++++++++++++++++++--- > 1 file changed, 50 insertions(+), 3 deletions(-) > > v1: https://lore.kernel.org/linux-mm/20190724152858.GB28493@ziepe.ca/ > v2: https://lore.kernel.org/linux-mm/20190807191627.GA3008@ziepe.ca/ > * Abandon attempting to fix it by calling invalidate_range_end() during an > EAGAIN start > * Just trivially ban multiple subscriptions > v3: > * Be more sophisticated, ban only multiple subscriptions if the result is > a failure. Allows multiple subscriptions without invalidate_range_end > * Include a printk when this condition is hit (Michal) > > At this point the rework Christoph requested during the first posting > is completed and there are now only 3 drivers using > invalidate_range_end(): > > drivers/misc/mic/scif/scif_dma.c: .invalidate_range_end = scif_mmu_notifier_invalidate_range_end}; > drivers/misc/sgi-gru/grutlbpurge.c: .invalidate_range_end = gru_invalidate_range_end, > virt/kvm/kvm_main.c: .invalidate_range_end = kvm_mmu_notifier_invalidate_range_end, > > While I think it is unlikely that any of these drivers will be used in > combination with each other, display a printk in hopes to check. > > Someday I expect to just fail the registration on this condition. > > I think this also addresses Michal's concern about a 'big hammer' as > it probably won't ever trigger now. > > Regards, > Jason > > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c > index ef3973a5d34a94..f3aba7a970f576 100644 > --- a/mm/mmu_notifier.c > +++ b/mm/mmu_notifier.c > @@ -37,7 +37,8 @@ struct lockdep_map __mmu_notifier_invalidate_range_start_map = { > struct mmu_notifier_subscriptions { > /* all mmu notifiers registered in this mm are queued in this list */ > struct hlist_head list; > - bool has_itree; > + u8 has_itree; > + u8 no_blocking; > /* to serialize the list modifications and hlist_unhashed */ > spinlock_t lock; > unsigned long invalidate_seq; > @@ -475,6 +476,10 @@ static int mn_hlist_invalidate_range_start( > int ret = 0; > int id; > > + if (unlikely(subscriptions->no_blocking && > + !mmu_notifier_range_blockable(range))) > + return -EAGAIN; > + > id = srcu_read_lock(&srcu); > hlist_for_each_entry_rcu(subscription, &subscriptions->list, hlist) { > const struct mmu_notifier_ops *ops = subscription->ops; > @@ -590,6 +595,48 @@ void __mmu_notifier_invalidate_range(struct mm_struct *mm, > srcu_read_unlock(&srcu, id); > } > > +/* > + * Add a hlist subscription to the list. The list is kept sorted by the > + * existence of ops->invalidate_range_end. If there is more than one > + * invalidate_range_end in the list then this process can no longer support > + * non-blocking invalidation. > + * > + * non-blocking invalidation is problematic as a requirement to block results in > + * the invalidation being aborted, however due to the use of RCU we have no > + * reliable way to ensure that every sueessful invalidate_range_start() results s/sueessful/successful > + * in a call to invalidate_range_end(). > + * > + * Thus to support blocking only the last subscription in the list can have > + * invalidate_range_end() set. > + */ > +static void > +mn_hist_add_subscription(struct mmu_notifier_subscriptions *subscriptions, > + struct mmu_notifier *subscription) We have mn_hlist_xxx in a number of places in mmu_notifier.c. Seems like this should be named mn_hlist_add_subscription(). > +{ > + struct mmu_notifier *last = NULL; > + struct mmu_notifier *itr; > + > + hlist_for_each_entry(itr, &subscriptions->list, hlist) > + last = itr; > + > + if (last && last->ops->invalidate_range_end && > + subscription->ops->invalidate_range_end) { > + subscriptions->no_blocking = true; > + pr_warn_once( > + "%s (%d) created two mmu_notifier's with invalidate_range_end(): %ps and %ps, non-blocking notifiers disabled\n", line length? > + current->comm, current->pid, > + last->ops->invalidate_range_end, > + subscription->ops->invalidate_range_end); > + } > + if (!last || !last->ops->invalidate_range_end) > + subscriptions->no_blocking = false; > + > + if (last && subscription->ops->invalidate_range_end) > + hlist_add_behind_rcu(&subscription->hlist, &last->hlist); > + else > + hlist_add_head_rcu(&subscription->hlist, &subscriptions->list); > +} > + > /* > * Same as mmu_notifier_register but here the caller must hold the mmap_sem in > * write mode. A NULL mn signals the notifier is being registered for itree > @@ -660,8 +707,8 @@ int __mmu_notifier_register(struct mmu_notifier *subscription, > subscription->users = 1; > > spin_lock(&mm->notifier_subscriptions->lock); > - hlist_add_head_rcu(&subscription->hlist, > - &mm->notifier_subscriptions->list); > + mn_hist_add_subscription(mm->notifier_subscriptions, > + subscription); > spin_unlock(&mm->notifier_subscriptions->lock); > } else > mm->notifier_subscriptions->has_itree = true; > Other than some nits, looks good to me so you can add: Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
On Tue, Feb 11, 2020 at 01:28:42PM -0800, Ralph Campbell wrote: > > +/* > > + * Add a hlist subscription to the list. The list is kept sorted by the > > + * existence of ops->invalidate_range_end. If there is more than one > > + * invalidate_range_end in the list then this process can no longer support > > + * non-blocking invalidation. > > + * > > + * non-blocking invalidation is problematic as a requirement to block results in > > + * the invalidation being aborted, however due to the use of RCU we have no > > + * reliable way to ensure that every sueessful invalidate_range_start() results > > s/sueessful/successful woops, yes, two spellos, thanks > > +{ > > + struct mmu_notifier *last = NULL; > > + struct mmu_notifier *itr; > > + > > + hlist_for_each_entry(itr, &subscriptions->list, hlist) > > + last = itr; > > + > > + if (last && last->ops->invalidate_range_end && > > + subscription->ops->invalidate_range_end) { > > + subscriptions->no_blocking = true; > > + pr_warn_once( > > + "%s (%d) created two mmu_notifier's with invalidate_range_end(): %ps and %ps, non-blocking notifiers disabled\n", > > line length? Style guide is to keep strings across the 80 cols for grepability > > /* > > * Same as mmu_notifier_register but here the caller must hold the mmap_sem in > > * write mode. A NULL mn signals the notifier is being registered for itree > > @@ -660,8 +707,8 @@ int __mmu_notifier_register(struct mmu_notifier *subscription, > > subscription->users = 1; > > spin_lock(&mm->notifier_subscriptions->lock); > > - hlist_add_head_rcu(&subscription->hlist, > > - &mm->notifier_subscriptions->list); > > + mn_hist_add_subscription(mm->notifier_subscriptions, > > + subscription); > > spin_unlock(&mm->notifier_subscriptions->lock); > > } else > > mm->notifier_subscriptions->has_itree = true; > > > > Other than some nits, looks good to me so you can add: > Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Great, thanks! Jason
On Tue, Feb 11, 2020 at 04:52:52PM -0400, Jason Gunthorpe wrote: > Many users of the mmu_notifier invalidate_range callbacks maintain > locking/counters/etc on a paired basis and have long expected that > invalidate_range_start/end() are always paired. > > For instance kvm_mmu_notifier_invalidate_range_end() undoes > kvm->mmu_notifier_count which was incremented during start(). > > The recent change to add non-blocking notifiers breaks this assumption > when multiple notifiers are present in the list. When EAGAIN is returned > from an invalidate_range_start() then no invalidate_range_ends() are > called, even if the subscription's start had previously been called. > > Unfortunately, due to the RCU list traversal we can't reliably generate a > subset of the linked list representing the notifiers already called to > generate an invalidate_range_end() pairing. > > One case works correctly, if only one subscription requires > invalidate_range_end() and it is the last entry in the hlist. In this > case, when invalidate_range_start() returns -EAGAIN there will be nothing > to unwind. > > Keep the notifier hlist sorted so that notifiers that require > invalidate_range_end() are always last, and if two are added then disable > non-blocking invalidation for the mm. > > A warning is printed for this case, if in future we determine this never > happens then we can simply fail during registration when there are > unsupported combinations of notifiers. > > Fixes: 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers") > Cc: Michal Hocko <mhocko@suse.com> > Cc: "Jérôme Glisse" <jglisse@redhat.com> > Cc: Christoph Hellwig <hch@lst.de> > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> > mm/mmu_notifier.c | 53 ++++++++++++++++++++++++++++++++++++++++++++--- > 1 file changed, 50 insertions(+), 3 deletions(-) > > v1: https://lore.kernel.org/linux-mm/20190724152858.GB28493@ziepe.ca/ > v2: https://lore.kernel.org/linux-mm/20190807191627.GA3008@ziepe.ca/ > * Abandon attempting to fix it by calling invalidate_range_end() during an > EAGAIN start > * Just trivially ban multiple subscriptions > v3: > * Be more sophisticated, ban only multiple subscriptions if the result is > a failure. Allows multiple subscriptions without invalidate_range_end > * Include a printk when this condition is hit (Michal) > > At this point the rework Christoph requested during the first posting > is completed and there are now only 3 drivers using > invalidate_range_end(): > > drivers/misc/mic/scif/scif_dma.c: .invalidate_range_end = scif_mmu_notifier_invalidate_range_end}; > drivers/misc/sgi-gru/grutlbpurge.c: .invalidate_range_end = gru_invalidate_range_end, > virt/kvm/kvm_main.c: .invalidate_range_end = kvm_mmu_notifier_invalidate_range_end, > > While I think it is unlikely that any of these drivers will be used in > combination with each other, display a printk in hopes to check. > > Someday I expect to just fail the registration on this condition. > > I think this also addresses Michal's concern about a 'big hammer' as > it probably won't ever trigger now. I'm going to put this in linux-next to see if there are any reports of the pr_warn failing. Michal, are you happy with this solution now? Thanks, Jason
On Fri, Feb 28, 2020 at 09:50:06AM -0400, Jason Gunthorpe wrote: > On Tue, Feb 11, 2020 at 04:52:52PM -0400, Jason Gunthorpe wrote: > > Many users of the mmu_notifier invalidate_range callbacks maintain > > locking/counters/etc on a paired basis and have long expected that > > invalidate_range_start/end() are always paired. > > > > For instance kvm_mmu_notifier_invalidate_range_end() undoes > > kvm->mmu_notifier_count which was incremented during start(). > > > > The recent change to add non-blocking notifiers breaks this assumption > > when multiple notifiers are present in the list. When EAGAIN is returned > > from an invalidate_range_start() then no invalidate_range_ends() are > > called, even if the subscription's start had previously been called. > > > > Unfortunately, due to the RCU list traversal we can't reliably generate a > > subset of the linked list representing the notifiers already called to > > generate an invalidate_range_end() pairing. > > > > One case works correctly, if only one subscription requires > > invalidate_range_end() and it is the last entry in the hlist. In this > > case, when invalidate_range_start() returns -EAGAIN there will be nothing > > to unwind. > > > > Keep the notifier hlist sorted so that notifiers that require > > invalidate_range_end() are always last, and if two are added then disable > > non-blocking invalidation for the mm. > > > > A warning is printed for this case, if in future we determine this never > > happens then we can simply fail during registration when there are > > unsupported combinations of notifiers. > > > > Fixes: 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers") > > Cc: Michal Hocko <mhocko@suse.com> > > Cc: "Jérôme Glisse" <jglisse@redhat.com> > > Cc: Christoph Hellwig <hch@lst.de> > > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> > > mm/mmu_notifier.c | 53 ++++++++++++++++++++++++++++++++++++++++++++--- > > 1 file changed, 50 insertions(+), 3 deletions(-) > > > > v1: https://lore.kernel.org/linux-mm/20190724152858.GB28493@ziepe.ca/ > > v2: https://lore.kernel.org/linux-mm/20190807191627.GA3008@ziepe.ca/ > > * Abandon attempting to fix it by calling invalidate_range_end() during an > > EAGAIN start > > * Just trivially ban multiple subscriptions > > v3: > > * Be more sophisticated, ban only multiple subscriptions if the result is > > a failure. Allows multiple subscriptions without invalidate_range_end > > * Include a printk when this condition is hit (Michal) > > > > At this point the rework Christoph requested during the first posting > > is completed and there are now only 3 drivers using > > invalidate_range_end(): > > > > drivers/misc/mic/scif/scif_dma.c: .invalidate_range_end = scif_mmu_notifier_invalidate_range_end}; > > drivers/misc/sgi-gru/grutlbpurge.c: .invalidate_range_end = gru_invalidate_range_end, > > virt/kvm/kvm_main.c: .invalidate_range_end = kvm_mmu_notifier_invalidate_range_end, > > > > While I think it is unlikely that any of these drivers will be used in > > combination with each other, display a printk in hopes to check. > > > > Someday I expect to just fail the registration on this condition. > > > > I think this also addresses Michal's concern about a 'big hammer' as > > it probably won't ever trigger now. > > I'm going to put this in linux-next to see if there are any reports of > the pr_warn failing. > > Michal, are you happy with this solution now? It's been a month in linux-next now, with no complaints. If there are no comments I will go ahead to send it in the hmm PR. Thanks, Jason
On Tue 24-03-20 16:41:37, Jason Gunthorpe wrote: > On Fri, Feb 28, 2020 at 09:50:06AM -0400, Jason Gunthorpe wrote: > > On Tue, Feb 11, 2020 at 04:52:52PM -0400, Jason Gunthorpe wrote: > > > Many users of the mmu_notifier invalidate_range callbacks maintain > > > locking/counters/etc on a paired basis and have long expected that > > > invalidate_range_start/end() are always paired. > > > > > > For instance kvm_mmu_notifier_invalidate_range_end() undoes > > > kvm->mmu_notifier_count which was incremented during start(). > > > > > > The recent change to add non-blocking notifiers breaks this assumption > > > when multiple notifiers are present in the list. When EAGAIN is returned > > > from an invalidate_range_start() then no invalidate_range_ends() are > > > called, even if the subscription's start had previously been called. > > > > > > Unfortunately, due to the RCU list traversal we can't reliably generate a > > > subset of the linked list representing the notifiers already called to > > > generate an invalidate_range_end() pairing. > > > > > > One case works correctly, if only one subscription requires > > > invalidate_range_end() and it is the last entry in the hlist. In this > > > case, when invalidate_range_start() returns -EAGAIN there will be nothing > > > to unwind. > > > > > > Keep the notifier hlist sorted so that notifiers that require > > > invalidate_range_end() are always last, and if two are added then disable > > > non-blocking invalidation for the mm. > > > > > > A warning is printed for this case, if in future we determine this never > > > happens then we can simply fail during registration when there are > > > unsupported combinations of notifiers. > > > > > > Fixes: 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers") > > > Cc: Michal Hocko <mhocko@suse.com> > > > Cc: "Jérôme Glisse" <jglisse@redhat.com> > > > Cc: Christoph Hellwig <hch@lst.de> > > > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> > > > mm/mmu_notifier.c | 53 ++++++++++++++++++++++++++++++++++++++++++++--- > > > 1 file changed, 50 insertions(+), 3 deletions(-) > > > > > > v1: https://lore.kernel.org/linux-mm/20190724152858.GB28493@ziepe.ca/ > > > v2: https://lore.kernel.org/linux-mm/20190807191627.GA3008@ziepe.ca/ > > > * Abandon attempting to fix it by calling invalidate_range_end() during an > > > EAGAIN start > > > * Just trivially ban multiple subscriptions > > > v3: > > > * Be more sophisticated, ban only multiple subscriptions if the result is > > > a failure. Allows multiple subscriptions without invalidate_range_end > > > * Include a printk when this condition is hit (Michal) > > > > > > At this point the rework Christoph requested during the first posting > > > is completed and there are now only 3 drivers using > > > invalidate_range_end(): > > > > > > drivers/misc/mic/scif/scif_dma.c: .invalidate_range_end = scif_mmu_notifier_invalidate_range_end}; > > > drivers/misc/sgi-gru/grutlbpurge.c: .invalidate_range_end = gru_invalidate_range_end, > > > virt/kvm/kvm_main.c: .invalidate_range_end = kvm_mmu_notifier_invalidate_range_end, > > > > > > While I think it is unlikely that any of these drivers will be used in > > > combination with each other, display a printk in hopes to check. > > > > > > Someday I expect to just fail the registration on this condition. > > > > > > I think this also addresses Michal's concern about a 'big hammer' as > > > it probably won't ever trigger now. > > > > I'm going to put this in linux-next to see if there are any reports of > > the pr_warn failing. > > > > Michal, are you happy with this solution now? > > It's been a month in linux-next now, with no complaints. If there are > no comments I will go ahead to send it in the hmm PR. I will not block this but it still looks like a wrong approach. A more robust solution would be to allow calling invalidate_range_end even for the failing invalidate_start.
On Wed, Mar 25, 2020 at 09:01:17AM +0100, Michal Hocko wrote: > > > I'm going to put this in linux-next to see if there are any reports of > > > the pr_warn failing. > > > > > > Michal, are you happy with this solution now? > > > > It's been a month in linux-next now, with no complaints. If there are > > no comments I will go ahead to send it in the hmm PR. > > I will not block this but it still looks like a wrong approach. A more > robust solution would be to allow calling invalidate_range_end even for > the failing invalidate_start. That requires reliably walking a rcu list backwards under rcu. I don't have a good feeling about that algorithm. Do you know of a solution? Since we don't actually have any users that care about this any more, I felt this testable solution was better. Jason
On Wed 25-03-20 09:14:07, Jason Gunthorpe wrote: > On Wed, Mar 25, 2020 at 09:01:17AM +0100, Michal Hocko wrote: > > > > I'm going to put this in linux-next to see if there are any reports of > > > > the pr_warn failing. > > > > > > > > Michal, are you happy with this solution now? > > > > > > It's been a month in linux-next now, with no complaints. If there are > > > no comments I will go ahead to send it in the hmm PR. > > > > I will not block this but it still looks like a wrong approach. A more > > robust solution would be to allow calling invalidate_range_end even for > > the failing invalidate_start. > > That requires reliably walking a rcu list backwards under rcu. I don't > have a good feeling about that algorithm. Do you know of a solution? No, not really. I would be pushing for it if I had. Maybe we want a different data structure than the RCU list. But... > Since we don't actually have any users that care about this any more, > I felt this testable solution was better. ... you are right that this is probably the best short term workaround.
> On Feb 11, 2020, at 3:52 PM, Jason Gunthorpe <jgg@mellanox.com> wrote: > > Many users of the mmu_notifier invalidate_range callbacks maintain > locking/counters/etc on a paired basis and have long expected that > invalidate_range_start/end() are always paired. > > For instance kvm_mmu_notifier_invalidate_range_end() undoes > kvm->mmu_notifier_count which was incremented during start(). > > The recent change to add non-blocking notifiers breaks this assumption > when multiple notifiers are present in the list. When EAGAIN is returned > from an invalidate_range_start() then no invalidate_range_ends() are > called, even if the subscription's start had previously been called. > > Unfortunately, due to the RCU list traversal we can't reliably generate a > subset of the linked list representing the notifiers already called to > generate an invalidate_range_end() pairing. > > One case works correctly, if only one subscription requires > invalidate_range_end() and it is the last entry in the hlist. In this > case, when invalidate_range_start() returns -EAGAIN there will be nothing > to unwind. > > Keep the notifier hlist sorted so that notifiers that require > invalidate_range_end() are always last, and if two are added then disable > non-blocking invalidation for the mm. > > A warning is printed for this case, if in future we determine this never > happens then we can simply fail during registration when there are > unsupported combinations of notifiers. This will generate a warning when running a simple qemu-kvm on arm64, qemu-kvm (37712) created two mmu_notifier's with invalidate_range_end(): kvm_mmu_notifier_invalidate_range_end and kvm_mmu_notifier_invalidate_range_end, non-blocking notifiers disabled > > Fixes: 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers") > Cc: Michal Hocko <mhocko@suse.com> > Cc: "Jérôme Glisse" <jglisse@redhat.com> > Cc: Christoph Hellwig <hch@lst.de> > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> > --- > mm/mmu_notifier.c | 53 ++++++++++++++++++++++++++++++++++++++++++++--- > 1 file changed, 50 insertions(+), 3 deletions(-) > > v1: https://lore.kernel.org/linux-mm/20190724152858.GB28493@ziepe.ca/ > v2: https://lore.kernel.org/linux-mm/20190807191627.GA3008@ziepe.ca/ > * Abandon attempting to fix it by calling invalidate_range_end() during an > EAGAIN start > * Just trivially ban multiple subscriptions > v3: > * Be more sophisticated, ban only multiple subscriptions if the result is > a failure. Allows multiple subscriptions without invalidate_range_end > * Include a printk when this condition is hit (Michal) > > At this point the rework Christoph requested during the first posting > is completed and there are now only 3 drivers using > invalidate_range_end(): > > drivers/misc/mic/scif/scif_dma.c: .invalidate_range_end = scif_mmu_notifier_invalidate_range_end}; > drivers/misc/sgi-gru/grutlbpurge.c: .invalidate_range_end = gru_invalidate_range_end, > virt/kvm/kvm_main.c: .invalidate_range_end = kvm_mmu_notifier_invalidate_range_end, > > While I think it is unlikely that any of these drivers will be used in > combination with each other, display a printk in hopes to check. > > Someday I expect to just fail the registration on this condition. > > I think this also addresses Michal's concern about a 'big hammer' as > it probably won't ever trigger now. > > Regards, > Jason > > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c > index ef3973a5d34a94..f3aba7a970f576 100644 > --- a/mm/mmu_notifier.c > +++ b/mm/mmu_notifier.c > @@ -37,7 +37,8 @@ struct lockdep_map __mmu_notifier_invalidate_range_start_map = { > struct mmu_notifier_subscriptions { > /* all mmu notifiers registered in this mm are queued in this list */ > struct hlist_head list; > - bool has_itree; > + u8 has_itree; > + u8 no_blocking; > /* to serialize the list modifications and hlist_unhashed */ > spinlock_t lock; > unsigned long invalidate_seq; > @@ -475,6 +476,10 @@ static int mn_hlist_invalidate_range_start( > int ret = 0; > int id; > > + if (unlikely(subscriptions->no_blocking && > + !mmu_notifier_range_blockable(range))) > + return -EAGAIN; > + > id = srcu_read_lock(&srcu); > hlist_for_each_entry_rcu(subscription, &subscriptions->list, hlist) { > const struct mmu_notifier_ops *ops = subscription->ops; > @@ -590,6 +595,48 @@ void __mmu_notifier_invalidate_range(struct mm_struct *mm, > srcu_read_unlock(&srcu, id); > } > > +/* > + * Add a hlist subscription to the list. The list is kept sorted by the > + * existence of ops->invalidate_range_end. If there is more than one > + * invalidate_range_end in the list then this process can no longer support > + * non-blocking invalidation. > + * > + * non-blocking invalidation is problematic as a requirement to block results in > + * the invalidation being aborted, however due to the use of RCU we have no > + * reliable way to ensure that every sueessful invalidate_range_start() results > + * in a call to invalidate_range_end(). > + * > + * Thus to support blocking only the last subscription in the list can have > + * invalidate_range_end() set. > + */ > +static void > +mn_hist_add_subscription(struct mmu_notifier_subscriptions *subscriptions, > + struct mmu_notifier *subscription) > +{ > + struct mmu_notifier *last = NULL; > + struct mmu_notifier *itr; > + > + hlist_for_each_entry(itr, &subscriptions->list, hlist) > + last = itr; > + > + if (last && last->ops->invalidate_range_end && > + subscription->ops->invalidate_range_end) { > + subscriptions->no_blocking = true; > + pr_warn_once( > + "%s (%d) created two mmu_notifier's with invalidate_range_end(): %ps and %ps, non-blocking notifiers disabled\n", > + current->comm, current->pid, > + last->ops->invalidate_range_end, > + subscription->ops->invalidate_range_end); > + } > + if (!last || !last->ops->invalidate_range_end) > + subscriptions->no_blocking = false; > + > + if (last && subscription->ops->invalidate_range_end) > + hlist_add_behind_rcu(&subscription->hlist, &last->hlist); > + else > + hlist_add_head_rcu(&subscription->hlist, &subscriptions->list); > +} > + > /* > * Same as mmu_notifier_register but here the caller must hold the mmap_sem in > * write mode. A NULL mn signals the notifier is being registered for itree > @@ -660,8 +707,8 @@ int __mmu_notifier_register(struct mmu_notifier *subscription, > subscription->users = 1; > > spin_lock(&mm->notifier_subscriptions->lock); > - hlist_add_head_rcu(&subscription->hlist, > - &mm->notifier_subscriptions->list); > + mn_hist_add_subscription(mm->notifier_subscriptions, > + subscription); > spin_unlock(&mm->notifier_subscriptions->lock); > } else > mm->notifier_subscriptions->has_itree = true; > -- > 2.25.0 >
On Thu, Mar 26, 2020 at 09:06:12AM -0400, Qian Cai wrote: > > > > On Feb 11, 2020, at 3:52 PM, Jason Gunthorpe <jgg@mellanox.com> wrote: > > > > Many users of the mmu_notifier invalidate_range callbacks maintain > > locking/counters/etc on a paired basis and have long expected that > > invalidate_range_start/end() are always paired. > > > > For instance kvm_mmu_notifier_invalidate_range_end() undoes > > kvm->mmu_notifier_count which was incremented during start(). > > > > The recent change to add non-blocking notifiers breaks this assumption > > when multiple notifiers are present in the list. When EAGAIN is returned > > from an invalidate_range_start() then no invalidate_range_ends() are > > called, even if the subscription's start had previously been called. > > > > Unfortunately, due to the RCU list traversal we can't reliably generate a > > subset of the linked list representing the notifiers already called to > > generate an invalidate_range_end() pairing. > > > > One case works correctly, if only one subscription requires > > invalidate_range_end() and it is the last entry in the hlist. In this > > case, when invalidate_range_start() returns -EAGAIN there will be nothing > > to unwind. > > > > Keep the notifier hlist sorted so that notifiers that require > > invalidate_range_end() are always last, and if two are added then disable > > non-blocking invalidation for the mm. > > > > A warning is printed for this case, if in future we determine this never > > happens then we can simply fail during registration when there are > > unsupported combinations of notifiers. > > This will generate a warning when running a simple qemu-kvm on arm64, > > qemu-kvm (37712) created two mmu_notifier's with > invalidate_range_end(): kvm_mmu_notifier_invalidate_range_end and > kvm_mmu_notifier_invalidate_range_end, non-blocking notifiers > disabled Thank you Qian, this is very valuable information It seems this solution will not work if kvm is registering multiple notifiers. I will take the patch out of linux-next Jason
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index ef3973a5d34a94..f3aba7a970f576 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -37,7 +37,8 @@ struct lockdep_map __mmu_notifier_invalidate_range_start_map = { struct mmu_notifier_subscriptions { /* all mmu notifiers registered in this mm are queued in this list */ struct hlist_head list; - bool has_itree; + u8 has_itree; + u8 no_blocking; /* to serialize the list modifications and hlist_unhashed */ spinlock_t lock; unsigned long invalidate_seq; @@ -475,6 +476,10 @@ static int mn_hlist_invalidate_range_start( int ret = 0; int id; + if (unlikely(subscriptions->no_blocking && + !mmu_notifier_range_blockable(range))) + return -EAGAIN; + id = srcu_read_lock(&srcu); hlist_for_each_entry_rcu(subscription, &subscriptions->list, hlist) { const struct mmu_notifier_ops *ops = subscription->ops; @@ -590,6 +595,48 @@ void __mmu_notifier_invalidate_range(struct mm_struct *mm, srcu_read_unlock(&srcu, id); } +/* + * Add a hlist subscription to the list. The list is kept sorted by the + * existence of ops->invalidate_range_end. If there is more than one + * invalidate_range_end in the list then this process can no longer support + * non-blocking invalidation. + * + * non-blocking invalidation is problematic as a requirement to block results in + * the invalidation being aborted, however due to the use of RCU we have no + * reliable way to ensure that every sueessful invalidate_range_start() results + * in a call to invalidate_range_end(). + * + * Thus to support blocking only the last subscription in the list can have + * invalidate_range_end() set. + */ +static void +mn_hist_add_subscription(struct mmu_notifier_subscriptions *subscriptions, + struct mmu_notifier *subscription) +{ + struct mmu_notifier *last = NULL; + struct mmu_notifier *itr; + + hlist_for_each_entry(itr, &subscriptions->list, hlist) + last = itr; + + if (last && last->ops->invalidate_range_end && + subscription->ops->invalidate_range_end) { + subscriptions->no_blocking = true; + pr_warn_once( + "%s (%d) created two mmu_notifier's with invalidate_range_end(): %ps and %ps, non-blocking notifiers disabled\n", + current->comm, current->pid, + last->ops->invalidate_range_end, + subscription->ops->invalidate_range_end); + } + if (!last || !last->ops->invalidate_range_end) + subscriptions->no_blocking = false; + + if (last && subscription->ops->invalidate_range_end) + hlist_add_behind_rcu(&subscription->hlist, &last->hlist); + else + hlist_add_head_rcu(&subscription->hlist, &subscriptions->list); +} + /* * Same as mmu_notifier_register but here the caller must hold the mmap_sem in * write mode. A NULL mn signals the notifier is being registered for itree @@ -660,8 +707,8 @@ int __mmu_notifier_register(struct mmu_notifier *subscription, subscription->users = 1; spin_lock(&mm->notifier_subscriptions->lock); - hlist_add_head_rcu(&subscription->hlist, - &mm->notifier_subscriptions->list); + mn_hist_add_subscription(mm->notifier_subscriptions, + subscription); spin_unlock(&mm->notifier_subscriptions->lock); } else mm->notifier_subscriptions->has_itree = true;
Many users of the mmu_notifier invalidate_range callbacks maintain locking/counters/etc on a paired basis and have long expected that invalidate_range_start/end() are always paired. For instance kvm_mmu_notifier_invalidate_range_end() undoes kvm->mmu_notifier_count which was incremented during start(). The recent change to add non-blocking notifiers breaks this assumption when multiple notifiers are present in the list. When EAGAIN is returned from an invalidate_range_start() then no invalidate_range_ends() are called, even if the subscription's start had previously been called. Unfortunately, due to the RCU list traversal we can't reliably generate a subset of the linked list representing the notifiers already called to generate an invalidate_range_end() pairing. One case works correctly, if only one subscription requires invalidate_range_end() and it is the last entry in the hlist. In this case, when invalidate_range_start() returns -EAGAIN there will be nothing to unwind. Keep the notifier hlist sorted so that notifiers that require invalidate_range_end() are always last, and if two are added then disable non-blocking invalidation for the mm. A warning is printed for this case, if in future we determine this never happens then we can simply fail during registration when there are unsupported combinations of notifiers. Fixes: 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers") Cc: Michal Hocko <mhocko@suse.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> --- mm/mmu_notifier.c | 53 ++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 50 insertions(+), 3 deletions(-) v1: https://lore.kernel.org/linux-mm/20190724152858.GB28493@ziepe.ca/ v2: https://lore.kernel.org/linux-mm/20190807191627.GA3008@ziepe.ca/ * Abandon attempting to fix it by calling invalidate_range_end() during an EAGAIN start * Just trivially ban multiple subscriptions v3: * Be more sophisticated, ban only multiple subscriptions if the result is a failure. Allows multiple subscriptions without invalidate_range_end * Include a printk when this condition is hit (Michal) At this point the rework Christoph requested during the first posting is completed and there are now only 3 drivers using invalidate_range_end(): drivers/misc/mic/scif/scif_dma.c: .invalidate_range_end = scif_mmu_notifier_invalidate_range_end}; drivers/misc/sgi-gru/grutlbpurge.c: .invalidate_range_end = gru_invalidate_range_end, virt/kvm/kvm_main.c: .invalidate_range_end = kvm_mmu_notifier_invalidate_range_end, While I think it is unlikely that any of these drivers will be used in combination with each other, display a printk in hopes to check. Someday I expect to just fail the registration on this condition. I think this also addresses Michal's concern about a 'big hammer' as it probably won't ever trigger now. Regards, Jason