Message ID | 87y3nyd4pu.fsf@notabene.neil.brown.name (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Oct 26, 2017 at 01:26:37PM +1100, NeilBrown wrote: > > The synchronize_rcu() in namespace_unlock() is called every time > a filesystem is unmounted. If a great many filesystems are mounted, > this can cause a noticable slow-down in, for example, system shutdown. > > The sequence: > mkdir -p /tmp/Mtest/{0..5000} > time for i in /tmp/Mtest/*; do mount -t tmpfs tmpfs $i ; done > time umount /tmp/Mtest/* > > on a 4-cpu VM can report 8 seconds to mount the tmpfs filesystems, and > 100 seconds to unmount them. > > Boot the same VM with 1 CPU and it takes 18 seconds to mount the > tmpfs filesystems, but only 36 to unmount. > > If we change the synchronize_rcu() to synchronize_rcu_expedited() > the umount time on a 4-cpu VM is 8 seconds to mount and 0.6 to > unmount. > > I think this 200-fold speed up is worth the slightly higher system > impact of use synchronize_rcu_expedited(). > > Signed-off-by: NeilBrown <neilb@suse.com> > --- > > Cc: to Paul and Josh in case they'll correct me if using _expedited() > is really bad here. I suspect that filesystem unmount is pretty rare in production real-time workloads, which are the ones that might care. So I would guess that this is OK. If the real-time guys ever do want to do filesystem unmounts while their real-time applications are running, they might modify this so that it can use synchronize_rcu() instead for real-time builds of the kernel. But just for completeness, one way to make this work across the board might be to instead use call_rcu(), with the callback function kicking off a workqueue handler to do the rest of the unmount. Of course, in saying that, I am ignoring any mutexes that you might be holding across this whole thing, and also ignoring any problems that might arise when returning to userspace with some portion of the unmount operation still pending. (For example, someone unmounting a filesystem and then immediately remounting that same filesystem.) Thanx, Paul > Thanks, > NeilBrown > > > fs/namespace.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/namespace.c b/fs/namespace.c > index 3b601f115b6c..fce91c447fab 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -1420,7 +1420,7 @@ static void namespace_unlock(void) > if (likely(hlist_empty(&head))) > return; > > - synchronize_rcu(); > + synchronize_rcu_expedited(); > > group_pin_kill(&head); > } > -- > 2.14.0.rc0.dirty >
On Thu, Oct 26, 2017 at 05:27:43AM -0700, Paul E. McKenney wrote: > On Thu, Oct 26, 2017 at 01:26:37PM +1100, NeilBrown wrote: > > > > The synchronize_rcu() in namespace_unlock() is called every time > > a filesystem is unmounted. If a great many filesystems are mounted, > > this can cause a noticable slow-down in, for example, system shutdown. > > > > The sequence: > > mkdir -p /tmp/Mtest/{0..5000} > > time for i in /tmp/Mtest/*; do mount -t tmpfs tmpfs $i ; done > > time umount /tmp/Mtest/* > > > > on a 4-cpu VM can report 8 seconds to mount the tmpfs filesystems, and > > 100 seconds to unmount them. > > > > Boot the same VM with 1 CPU and it takes 18 seconds to mount the > > tmpfs filesystems, but only 36 to unmount. > > > > If we change the synchronize_rcu() to synchronize_rcu_expedited() > > the umount time on a 4-cpu VM is 8 seconds to mount and 0.6 to > > unmount. > > > > I think this 200-fold speed up is worth the slightly higher system > > impact of use synchronize_rcu_expedited(). > > > > Signed-off-by: NeilBrown <neilb@suse.com> > > --- > > > > Cc: to Paul and Josh in case they'll correct me if using _expedited() > > is really bad here. > > I suspect that filesystem unmount is pretty rare in production real-time > workloads, which are the ones that might care. So I would guess that > this is OK. > > If the real-time guys ever do want to do filesystem unmounts while their > real-time applications are running, they might modify this so that it can > use synchronize_rcu() instead for real-time builds of the kernel. Which they can already do using the rcupdate.rcu_normal boot parameter. Thanx, Paul > But just for completeness, one way to make this work across the board > might be to instead use call_rcu(), with the callback function kicking > off a workqueue handler to do the rest of the unmount. Of course, > in saying that, I am ignoring any mutexes that you might be holding > across this whole thing, and also ignoring any problems that might arise > when returning to userspace with some portion of the unmount operation > still pending. (For example, someone unmounting a filesystem and then > immediately remounting that same filesystem.) > > Thanx, Paul > > > Thanks, > > NeilBrown > > > > > > fs/namespace.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/fs/namespace.c b/fs/namespace.c > > index 3b601f115b6c..fce91c447fab 100644 > > --- a/fs/namespace.c > > +++ b/fs/namespace.c > > @@ -1420,7 +1420,7 @@ static void namespace_unlock(void) > > if (likely(hlist_empty(&head))) > > return; > > > > - synchronize_rcu(); > > + synchronize_rcu_expedited(); > > > > group_pin_kill(&head); > > } > > -- > > 2.14.0.rc0.dirty > > > >
On Thu, Oct 26 2017, Paul E. McKenney wrote: > On Thu, Oct 26, 2017 at 01:26:37PM +1100, NeilBrown wrote: >> >> The synchronize_rcu() in namespace_unlock() is called every time >> a filesystem is unmounted. If a great many filesystems are mounted, >> this can cause a noticable slow-down in, for example, system shutdown. >> >> The sequence: >> mkdir -p /tmp/Mtest/{0..5000} >> time for i in /tmp/Mtest/*; do mount -t tmpfs tmpfs $i ; done >> time umount /tmp/Mtest/* >> >> on a 4-cpu VM can report 8 seconds to mount the tmpfs filesystems, and >> 100 seconds to unmount them. >> >> Boot the same VM with 1 CPU and it takes 18 seconds to mount the >> tmpfs filesystems, but only 36 to unmount. >> >> If we change the synchronize_rcu() to synchronize_rcu_expedited() >> the umount time on a 4-cpu VM is 8 seconds to mount and 0.6 to >> unmount. >> >> I think this 200-fold speed up is worth the slightly higher system >> impact of use synchronize_rcu_expedited(). >> >> Signed-off-by: NeilBrown <neilb@suse.com> >> --- >> >> Cc: to Paul and Josh in case they'll correct me if using _expedited() >> is really bad here. > > I suspect that filesystem unmount is pretty rare in production real-time > workloads, which are the ones that might care. So I would guess that > this is OK. > > If the real-time guys ever do want to do filesystem unmounts while their > real-time applications are running, they might modify this so that it can > use synchronize_rcu() instead for real-time builds of the kernel. Thanks for the confirmation Paul. > > But just for completeness, one way to make this work across the board > might be to instead use call_rcu(), with the callback function kicking > off a workqueue handler to do the rest of the unmount. Of course, > in saying that, I am ignoring any mutexes that you might be holding > across this whole thing, and also ignoring any problems that might arise > when returning to userspace with some portion of the unmount operation > still pending. (For example, someone unmounting a filesystem and then > immediately remounting that same filesystem.) I had briefly considered that option, but it doesn't work. The purpose of this synchronize_rcu() is to wait for any filename lookup which might be locklessly touching the mountpoint to complete. It is only after that that the real meat of unmount happen - the filesystem is told that the last reference is gone, and it gets to flush any saved changes out to disk etc. That stuff really has to happen before the umount syscall returns. Thanks, NeilBrown
On Fri, Oct 27, 2017 at 11:45:08AM +1100, NeilBrown wrote: > On Thu, Oct 26 2017, Paul E. McKenney wrote: > > > On Thu, Oct 26, 2017 at 01:26:37PM +1100, NeilBrown wrote: > >> > >> The synchronize_rcu() in namespace_unlock() is called every time > >> a filesystem is unmounted. If a great many filesystems are mounted, > >> this can cause a noticable slow-down in, for example, system shutdown. > >> > >> The sequence: > >> mkdir -p /tmp/Mtest/{0..5000} > >> time for i in /tmp/Mtest/*; do mount -t tmpfs tmpfs $i ; done > >> time umount /tmp/Mtest/* > >> > >> on a 4-cpu VM can report 8 seconds to mount the tmpfs filesystems, and > >> 100 seconds to unmount them. > >> > >> Boot the same VM with 1 CPU and it takes 18 seconds to mount the > >> tmpfs filesystems, but only 36 to unmount. > >> > >> If we change the synchronize_rcu() to synchronize_rcu_expedited() > >> the umount time on a 4-cpu VM is 8 seconds to mount and 0.6 to > >> unmount. > >> > >> I think this 200-fold speed up is worth the slightly higher system > >> impact of use synchronize_rcu_expedited(). > >> > >> Signed-off-by: NeilBrown <neilb@suse.com> > >> --- > >> > >> Cc: to Paul and Josh in case they'll correct me if using _expedited() > >> is really bad here. > > > > I suspect that filesystem unmount is pretty rare in production real-time > > workloads, which are the ones that might care. So I would guess that > > this is OK. > > > > If the real-time guys ever do want to do filesystem unmounts while their > > real-time applications are running, they might modify this so that it can > > use synchronize_rcu() instead for real-time builds of the kernel. > > Thanks for the confirmation Paul. > > > > > But just for completeness, one way to make this work across the board > > might be to instead use call_rcu(), with the callback function kicking > > off a workqueue handler to do the rest of the unmount. Of course, > > in saying that, I am ignoring any mutexes that you might be holding > > across this whole thing, and also ignoring any problems that might arise > > when returning to userspace with some portion of the unmount operation > > still pending. (For example, someone unmounting a filesystem and then > > immediately remounting that same filesystem.) > > I had briefly considered that option, but it doesn't work. > The purpose of this synchronize_rcu() is to wait for any filename lookup > which might be locklessly touching the mountpoint to complete. > It is only after that that the real meat of unmount happen - the > filesystem is told that the last reference is gone, and it gets to > flush any saved changes out to disk etc. > That stuff really has to happen before the umount syscall returns. Hey, I was hoping! ;-) Thanx, Paul
On 10/26/2017 02:27 PM, Paul E. McKenney wrote: > But just for completeness, one way to make this work across the board > might be to instead use call_rcu(), with the callback function kicking > off a workqueue handler to do the rest of the unmount. Of course, > in saying that, I am ignoring any mutexes that you might be holding > across this whole thing, and also ignoring any problems that might arise > when returning to userspace with some portion of the unmount operation > still pending. (For example, someone unmounting a filesystem and then > immediately remounting that same filesystem.) You really need to complete all side effects of deallocating a resource before returning to user space. Otherwise, it will never be possible to allocate and deallocate resources in a tight loop because you either get spurious failures because too many unaccounted deallocations are stuck somewhere in the system (and the user can't tell that this is due to a race), or you get an OOM because the user manages to queue up too much state. We already have this problem with RLIMIT_NPROC, where waitpid etc. return before the process is completely gone. On some kernels/configurations, the resulting race is so wide that parallel make no longer works reliable because it runs into fork failures. Thanks, Florian
On Mon, Nov 27, 2017 at 12:27:04PM +0100, Florian Weimer wrote: > On 10/26/2017 02:27 PM, Paul E. McKenney wrote: > >But just for completeness, one way to make this work across the board > >might be to instead use call_rcu(), with the callback function kicking > >off a workqueue handler to do the rest of the unmount. Of course, > >in saying that, I am ignoring any mutexes that you might be holding > >across this whole thing, and also ignoring any problems that might arise > >when returning to userspace with some portion of the unmount operation > >still pending. (For example, someone unmounting a filesystem and then > >immediately remounting that same filesystem.) > > You really need to complete all side effects of deallocating a > resource before returning to user space. Otherwise, it will never > be possible to allocate and deallocate resources in a tight loop > because you either get spurious failures because too many > unaccounted deallocations are stuck somewhere in the system (and the > user can't tell that this is due to a race), or you get an OOM > because the user manages to queue up too much state. > > We already have this problem with RLIMIT_NPROC, where waitpid etc. > return before the process is completely gone. On some > kernels/configurations, the resulting race is so wide that parallel > make no longer works reliable because it runs into fork failures. Or alternatively, use rcu_barrier() occasionally to wait for all preceding deferred deallocations. And there are quite a few other ways to take on this problem. Thanx, Paul
On Mon, Nov 27 2017, Paul E. McKenney wrote: > On Mon, Nov 27, 2017 at 12:27:04PM +0100, Florian Weimer wrote: >> On 10/26/2017 02:27 PM, Paul E. McKenney wrote: >> >But just for completeness, one way to make this work across the board >> >might be to instead use call_rcu(), with the callback function kicking >> >off a workqueue handler to do the rest of the unmount. Of course, >> >in saying that, I am ignoring any mutexes that you might be holding >> >across this whole thing, and also ignoring any problems that might arise >> >when returning to userspace with some portion of the unmount operation >> >still pending. (For example, someone unmounting a filesystem and then >> >immediately remounting that same filesystem.) >> >> You really need to complete all side effects of deallocating a >> resource before returning to user space. Otherwise, it will never >> be possible to allocate and deallocate resources in a tight loop >> because you either get spurious failures because too many >> unaccounted deallocations are stuck somewhere in the system (and the >> user can't tell that this is due to a race), or you get an OOM >> because the user manages to queue up too much state. >> >> We already have this problem with RLIMIT_NPROC, where waitpid etc. >> return before the process is completely gone. On some >> kernels/configurations, the resulting race is so wide that parallel >> make no longer works reliable because it runs into fork failures. > > Or alternatively, use rcu_barrier() occasionally to wait for all > preceding deferred deallocations. And there are quite a few other > ways to take on this problem. So, supposing we could package up everything that has to happen after the current synchronize_rcu() and put it in an call_rcu() call back, then instead of calling synchronize_rcu_expedited() at the end of namespace_unlock(), we could possibly call call_rcu() there and rcu_barrier() at the start of namespace_lock()..... That would mean a single unmount would have low impact, but it would still slow down a sequence of 1000 consecutive unmounts. Maybe we would only need the rcu_barrier() before select namespace_lock() calls. I would need to study the code closely to form an opinion. Interesting idea though. Hopefully the _expedited() patch will be accepted - I haven't had a "nak" yet... thanks, NeilBrown
diff --git a/fs/namespace.c b/fs/namespace.c index 3b601f115b6c..fce91c447fab 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1420,7 +1420,7 @@ static void namespace_unlock(void) if (likely(hlist_empty(&head))) return; - synchronize_rcu(); + synchronize_rcu_expedited(); group_pin_kill(&head); }
The synchronize_rcu() in namespace_unlock() is called every time a filesystem is unmounted. If a great many filesystems are mounted, this can cause a noticable slow-down in, for example, system shutdown. The sequence: mkdir -p /tmp/Mtest/{0..5000} time for i in /tmp/Mtest/*; do mount -t tmpfs tmpfs $i ; done time umount /tmp/Mtest/* on a 4-cpu VM can report 8 seconds to mount the tmpfs filesystems, and 100 seconds to unmount them. Boot the same VM with 1 CPU and it takes 18 seconds to mount the tmpfs filesystems, but only 36 to unmount. If we change the synchronize_rcu() to synchronize_rcu_expedited() the umount time on a 4-cpu VM is 8 seconds to mount and 0.6 to unmount. I think this 200-fold speed up is worth the slightly higher system impact of use synchronize_rcu_expedited(). Signed-off-by: NeilBrown <neilb@suse.com> --- Cc: to Paul and Josh in case they'll correct me if using _expedited() is really bad here. Thanks, NeilBrown fs/namespace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)