Message ID | 20160929081818.GE28107@nuc-i3427.alporthouse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Sep 29, 2016 at 1:18 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote: > On Thu, Sep 29, 2016 at 03:34:11PM +0800, Jisheng Zhang wrote: >> On Marvell berlin arm64 platforms, I see the preemptoff tracer report >> a max 26543 us latency at __purge_vmap_area_lazy, this latency is an >> awfully bad for STB. And the ftrace log also shows __free_vmap_area >> contributes most latency now. I noticed that Joel mentioned the same >> issue[1] on x86 platform and gave two solutions, but it seems no patch >> is sent out for this purpose. >> >> This patch adopts Joel's first solution, but I use 16MB per core >> rather than 8MB per core for the number of lazy_max_pages. After this >> patch, the preemptoff tracer reports a max 6455us latency, reduced to >> 1/4 of original result. > > My understanding is that > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index 91f44e78c516..3f7c6d6969ac 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -626,7 +626,6 @@ void set_iounmap_nonlazy(void) > static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end, > int sync, int force_flush) > { > - static DEFINE_SPINLOCK(purge_lock); > struct llist_node *valist; > struct vmap_area *va; > struct vmap_area *n_va; > @@ -637,12 +636,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end, > * should not expect such behaviour. This just simplifies locking for > * the case that isn't actually used at the moment anyway. > */ > - if (!sync && !force_flush) { > - if (!spin_trylock(&purge_lock)) > - return; > - } else > - spin_lock(&purge_lock); > - > if (sync) > purge_fragmented_blocks_allcpus(); > > @@ -667,7 +660,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end, > __free_vmap_area(va); > spin_unlock(&vmap_area_lock); > } > - spin_unlock(&purge_lock); > } > [..] > should now be safe. That should significantly reduce the preempt-disabled > section, I think. I believe that the purge_lock is supposed to prevent concurrent purges from happening. For the case where if you have another concurrent overflow happen in alloc_vmap_area() between the spin_unlock and purge : spin_unlock(&vmap_area_lock); if (!purged) purge_vmap_area_lazy(); Then the 2 purges would happen at the same time and could subtract vmap_lazy_nr twice. I had proposed to change it to mutex in [1]. How do you feel about that? Let me know your suggestions, thanks. I am also Ok with reducing the lazy_max_pages value. [1] http://lkml.iu.edu/hypermail/linux/kernel/1603.2/04803.html Regards, Joel
On Sat, Oct 08, 2016 at 08:43:51PM -0700, Joel Fernandes wrote: > On Thu, Sep 29, 2016 at 1:18 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote: > > On Thu, Sep 29, 2016 at 03:34:11PM +0800, Jisheng Zhang wrote: > >> On Marvell berlin arm64 platforms, I see the preemptoff tracer report > >> a max 26543 us latency at __purge_vmap_area_lazy, this latency is an > >> awfully bad for STB. And the ftrace log also shows __free_vmap_area > >> contributes most latency now. I noticed that Joel mentioned the same > >> issue[1] on x86 platform and gave two solutions, but it seems no patch > >> is sent out for this purpose. > >> > >> This patch adopts Joel's first solution, but I use 16MB per core > >> rather than 8MB per core for the number of lazy_max_pages. After this > >> patch, the preemptoff tracer reports a max 6455us latency, reduced to > >> 1/4 of original result. > > > > My understanding is that > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > index 91f44e78c516..3f7c6d6969ac 100644 > > --- a/mm/vmalloc.c > > +++ b/mm/vmalloc.c > > @@ -626,7 +626,6 @@ void set_iounmap_nonlazy(void) > > static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end, > > int sync, int force_flush) > > { > > - static DEFINE_SPINLOCK(purge_lock); > > struct llist_node *valist; > > struct vmap_area *va; > > struct vmap_area *n_va; > > @@ -637,12 +636,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end, > > * should not expect such behaviour. This just simplifies locking for > > * the case that isn't actually used at the moment anyway. > > */ > > - if (!sync && !force_flush) { > > - if (!spin_trylock(&purge_lock)) > > - return; > > - } else > > - spin_lock(&purge_lock); > > - > > if (sync) > > purge_fragmented_blocks_allcpus(); > > > > @@ -667,7 +660,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end, > > __free_vmap_area(va); > > spin_unlock(&vmap_area_lock); > > } > > - spin_unlock(&purge_lock); > > } > > > [..] > > should now be safe. That should significantly reduce the preempt-disabled > > section, I think. > > I believe that the purge_lock is supposed to prevent concurrent purges > from happening. > > For the case where if you have another concurrent overflow happen in > alloc_vmap_area() between the spin_unlock and purge : > > spin_unlock(&vmap_area_lock); > if (!purged) > purge_vmap_area_lazy(); > > Then the 2 purges would happen at the same time and could subtract > vmap_lazy_nr twice. That itself is not the problem, as each instance of __purge_vmap_area_lazy() operates on its own freelist, and so there will be no double accounting. However, removing the lock removes the serialisation which does mean that alloc_vmap_area() will not block on another thread conducting the purge, and so it will try to reallocate before that is complete and the free area made available. It also means that we are doing the atomic_sub(vmap_lazy_nr) too early. That supports making the outer lock a mutex as you suggested. But I think cond_resched_lock() is better for the vmap_area_lock (just because it turns out to be an expensive loop and we may want the reschedule). -Chris
On Sun, Oct 9, 2016 at 5:42 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote: [..] >> > My understanding is that >> > >> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c >> > index 91f44e78c516..3f7c6d6969ac 100644 >> > --- a/mm/vmalloc.c >> > +++ b/mm/vmalloc.c >> > @@ -626,7 +626,6 @@ void set_iounmap_nonlazy(void) >> > static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end, >> > int sync, int force_flush) >> > { >> > - static DEFINE_SPINLOCK(purge_lock); >> > struct llist_node *valist; >> > struct vmap_area *va; >> > struct vmap_area *n_va; >> > @@ -637,12 +636,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end, >> > * should not expect such behaviour. This just simplifies locking for >> > * the case that isn't actually used at the moment anyway. >> > */ >> > - if (!sync && !force_flush) { >> > - if (!spin_trylock(&purge_lock)) >> > - return; >> > - } else >> > - spin_lock(&purge_lock); >> > - >> > if (sync) >> > purge_fragmented_blocks_allcpus(); >> > >> > @@ -667,7 +660,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end, >> > __free_vmap_area(va); >> > spin_unlock(&vmap_area_lock); >> > } >> > - spin_unlock(&purge_lock); >> > } >> > >> [..] >> > should now be safe. That should significantly reduce the preempt-disabled >> > section, I think. >> >> I believe that the purge_lock is supposed to prevent concurrent purges >> from happening. >> >> For the case where if you have another concurrent overflow happen in >> alloc_vmap_area() between the spin_unlock and purge : >> >> spin_unlock(&vmap_area_lock); >> if (!purged) >> purge_vmap_area_lazy(); >> >> Then the 2 purges would happen at the same time and could subtract >> vmap_lazy_nr twice. > > That itself is not the problem, as each instance of > __purge_vmap_area_lazy() operates on its own freelist, and so there will > be no double accounting. > > However, removing the lock removes the serialisation which does mean > that alloc_vmap_area() will not block on another thread conducting the > purge, and so it will try to reallocate before that is complete and the > free area made available. It also means that we are doing the > atomic_sub(vmap_lazy_nr) too early. > > That supports making the outer lock a mutex as you suggested. But I think > cond_resched_lock() is better for the vmap_area_lock (just because it > turns out to be an expensive loop and we may want the reschedule). > -Chris Ok. So I'll submit a patch with mutex for purge_lock and use cond_resched_lock for the vmap_area_lock as you suggested. I'll also drop the lazy_max_pages to 8MB as Andi suggested to reduce the lock hold time. Let me know if you have any objections. Thanks, Joel
On Sun, Oct 09, 2016 at 12:00:31PM -0700, Joel Fernandes wrote: > Ok. So I'll submit a patch with mutex for purge_lock and use > cond_resched_lock for the vmap_area_lock as you suggested. I'll also > drop the lazy_max_pages to 8MB as Andi suggested to reduce the lock > hold time. Let me know if you have any objections. The downside of using a mutex here though, is that we may be called from contexts that cannot sleep (alloc_vmap_area), or reschedule for that matter! If we change the notion of purged, we can forgo the mutex in favour of spinning on the direct reclaim path. That just leaves the complication of whether to use cond_resched_lock() or a lock around the individual __free_vmap_area(). -Chris
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 91f44e78c516..3f7c6d6969ac 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -626,7 +626,6 @@ void set_iounmap_nonlazy(void) static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end, int sync, int force_flush) { - static DEFINE_SPINLOCK(purge_lock); struct llist_node *valist; struct vmap_area *va; struct vmap_area *n_va; @@ -637,12 +636,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end, * should not expect such behaviour. This just simplifies locking for * the case that isn't actually used at the moment anyway. */ - if (!sync && !force_flush) { - if (!spin_trylock(&purge_lock)) - return; - } else - spin_lock(&purge_lock); - if (sync) purge_fragmented_blocks_allcpus(); @@ -667,7 +660,6 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end, __free_vmap_area(va); spin_unlock(&vmap_area_lock); } - spin_unlock(&purge_lock); } /*