Message ID | 162077975380.14498.11347675368470436331.stgit@web.messagingengine.com (mailing list archive) |
---|---|
Headers | show |
Series | kernfs: proposed locking and concurrency improvement | expand |
On Wed, May 12, 2021 at 08:38:35AM +0800, Ian Kent wrote: > There have been a few instances of contention on the kernfs_mutex during > path walks, a case on very large IBM systems seen by myself, a report by > Brice Goglin and followed up by Fox Chen, and I've since seen a couple > of other reports by CoreOS users. > > The common thread is a large number of kernfs path walks leading to > slowness of path walks due to kernfs_mutex contention. > > The problem being that changes to the VFS over some time have increased > it's concurrency capabilities to an extent that kernfs's use of a mutex > is no longer appropriate. There's also an issue of walks for non-existent > paths causing contention if there are quite a few of them which is a less > common problem. > > This patch series is relatively straight forward. > > All it does is add the ability to take advantage of VFS negative dentry > caching to avoid needless dentry alloc/free cycles for lookups of paths > that don't exit and change the kernfs_mutex to a read/write semaphore. > > The patch that tried to stay in VFS rcu-walk mode during path walks has > been dropped for two reasons. First, it doesn't actually give very much > improvement and, second, if there's a place where mistakes could go > unnoticed it would be in that path. This makes the patch series simpler > to review and reduces the likelihood of problems going unnoticed and > popping up later. > > The patch to use a revision to identify if a directory has changed has > also been dropped. If the directory has changed the dentry revision > needs to be updated to avoid subsequent rb tree searches and after > changing to use a read/write semaphore the update also requires a lock. > But the d_lock is the only lock available at this point which might > itself be contended. > > Changes since v3: > - remove unneeded indirection when referencing the super block. > - check if inode attribute update is actually needed. > > Changes since v2: > - actually fix the inode attribute update locking. > - drop the patch that tried to stay in rcu-walk mode. > - drop the use a revision to identify if a directory has changed patch. > > Changes since v1: > - fix locking in .permission() and .getattr() by re-factoring the attribute > handling code. > --- > > Ian Kent (5): > kernfs: move revalidate to be near lookup > kernfs: use VFS negative dentry caching > kernfs: switch kernfs to use an rwsem > kernfs: use i_lock to protect concurrent inode updates > kernfs: add kernfs_need_inode_refresh() > > > fs/kernfs/dir.c | 170 ++++++++++++++++++++---------------- > fs/kernfs/file.c | 4 +- > fs/kernfs/inode.c | 45 ++++++++-- > fs/kernfs/kernfs-internal.h | 5 +- > fs/kernfs/mount.c | 12 +-- > fs/kernfs/symlink.c | 4 +- > include/linux/kernfs.h | 2 +- > 7 files changed, 147 insertions(+), 95 deletions(-) > > -- > Ian > Any benchmark numbers that you ran that are better/worse with this patch series? That woul dbe good to know, otherwise you aren't changing functionality here, so why would we take these changes? :) thanks, greg k-h
On Wed, May 12, 2021 at 2:21 PM Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > > On Wed, May 12, 2021 at 08:38:35AM +0800, Ian Kent wrote: > > There have been a few instances of contention on the kernfs_mutex during > > path walks, a case on very large IBM systems seen by myself, a report by > > Brice Goglin and followed up by Fox Chen, and I've since seen a couple > > of other reports by CoreOS users. > > > > The common thread is a large number of kernfs path walks leading to > > slowness of path walks due to kernfs_mutex contention. > > > > The problem being that changes to the VFS over some time have increased > > it's concurrency capabilities to an extent that kernfs's use of a mutex > > is no longer appropriate. There's also an issue of walks for non-existent > > paths causing contention if there are quite a few of them which is a less > > common problem. > > > > This patch series is relatively straight forward. > > > > All it does is add the ability to take advantage of VFS negative dentry > > caching to avoid needless dentry alloc/free cycles for lookups of paths > > that don't exit and change the kernfs_mutex to a read/write semaphore. > > > > The patch that tried to stay in VFS rcu-walk mode during path walks has > > been dropped for two reasons. First, it doesn't actually give very much > > improvement and, second, if there's a place where mistakes could go > > unnoticed it would be in that path. This makes the patch series simpler > > to review and reduces the likelihood of problems going unnoticed and > > popping up later. > > > > The patch to use a revision to identify if a directory has changed has > > also been dropped. If the directory has changed the dentry revision > > needs to be updated to avoid subsequent rb tree searches and after > > changing to use a read/write semaphore the update also requires a lock. > > But the d_lock is the only lock available at this point which might > > itself be contended. > > > > Changes since v3: > > - remove unneeded indirection when referencing the super block. > > - check if inode attribute update is actually needed. > > > > Changes since v2: > > - actually fix the inode attribute update locking. > > - drop the patch that tried to stay in rcu-walk mode. > > - drop the use a revision to identify if a directory has changed patch. > > > > Changes since v1: > > - fix locking in .permission() and .getattr() by re-factoring the attribute > > handling code. > > --- > > > > Ian Kent (5): > > kernfs: move revalidate to be near lookup > > kernfs: use VFS negative dentry caching > > kernfs: switch kernfs to use an rwsem > > kernfs: use i_lock to protect concurrent inode updates > > kernfs: add kernfs_need_inode_refresh() > > > > > > fs/kernfs/dir.c | 170 ++++++++++++++++++++---------------- > > fs/kernfs/file.c | 4 +- > > fs/kernfs/inode.c | 45 ++++++++-- > > fs/kernfs/kernfs-internal.h | 5 +- > > fs/kernfs/mount.c | 12 +-- > > fs/kernfs/symlink.c | 4 +- > > include/linux/kernfs.h | 2 +- > > 7 files changed, 147 insertions(+), 95 deletions(-) > > > > -- > > Ian > > > > Any benchmark numbers that you ran that are better/worse with this patch > series? That woul dbe good to know, otherwise you aren't changing > functionality here, so why would we take these changes? :) Let me run it on my benchmark and bring back the result to you. > thanks, > > greg k-h thanks, fox
Hi, I ran it on my benchmark (https://github.com/foxhlchen/sysfs_benchmark). machine: aws c5 (Intel Xeon with 96 logical cores) kernel: v5.12 benchmark: create 96 threads and bind them to each core then run open+read+close on a sysfs file simultaneously for 1000 times. result: Without the patchset, an open+read+close operation takes 550-570 us, perf shows significant time(>40%) spending on mutex_lock. After applying it, it takes 410-440 us for that operation and perf shows only ~4% time on mutex_lock. It's weird, I don't see a huge performance boost compared to v2, even though there is no mutex problem from the perf report. I've put console outputs and perf reports on the attachment for your reference. thanks, fox
On Wed, May 12, 2021 at 4:47 PM Fox Chen <foxhlchen@gmail.com> wrote: > > Hi, > > I ran it on my benchmark (https://github.com/foxhlchen/sysfs_benchmark). > > machine: aws c5 (Intel Xeon with 96 logical cores) > kernel: v5.12 > benchmark: create 96 threads and bind them to each core then run > open+read+close on a sysfs file simultaneously for 1000 times. > result: > Without the patchset, an open+read+close operation takes 550-570 us, > perf shows significant time(>40%) spending on mutex_lock. > After applying it, it takes 410-440 us for that operation and perf > shows only ~4% time on mutex_lock. > > It's weird, I don't see a huge performance boost compared to v2, even I meant I don't see a huge performance boost here and it's way worse than v2. IIRC, for v2 fastest one only takes 40us > though there is no mutex problem from the perf report. > I've put console outputs and perf reports on the attachment for your reference. > > > thanks, > fox fox
On Wed, 2021-05-12 at 08:21 +0200, Greg Kroah-Hartman wrote: > On Wed, May 12, 2021 at 08:38:35AM +0800, Ian Kent wrote: > > There have been a few instances of contention on the kernfs_mutex > > during > > path walks, a case on very large IBM systems seen by myself, a > > report by > > Brice Goglin and followed up by Fox Chen, and I've since seen a > > couple > > of other reports by CoreOS users. > > > > The common thread is a large number of kernfs path walks leading to > > slowness of path walks due to kernfs_mutex contention. > > > > The problem being that changes to the VFS over some time have > > increased > > it's concurrency capabilities to an extent that kernfs's use of a > > mutex > > is no longer appropriate. There's also an issue of walks for non- > > existent > > paths causing contention if there are quite a few of them which is > > a less > > common problem. > > > > This patch series is relatively straight forward. > > > > All it does is add the ability to take advantage of VFS negative > > dentry > > caching to avoid needless dentry alloc/free cycles for lookups of > > paths > > that don't exit and change the kernfs_mutex to a read/write > > semaphore. > > > > The patch that tried to stay in VFS rcu-walk mode during path walks > > has > > been dropped for two reasons. First, it doesn't actually give very > > much > > improvement and, second, if there's a place where mistakes could go > > unnoticed it would be in that path. This makes the patch series > > simpler > > to review and reduces the likelihood of problems going unnoticed > > and > > popping up later. > > > > The patch to use a revision to identify if a directory has changed > > has > > also been dropped. If the directory has changed the dentry revision > > needs to be updated to avoid subsequent rb tree searches and after > > changing to use a read/write semaphore the update also requires a > > lock. > > But the d_lock is the only lock available at this point which might > > itself be contended. > > > > Changes since v3: > > - remove unneeded indirection when referencing the super block. > > - check if inode attribute update is actually needed. > > > > Changes since v2: > > - actually fix the inode attribute update locking. > > - drop the patch that tried to stay in rcu-walk mode. > > - drop the use a revision to identify if a directory has changed > > patch. > > > > Changes since v1: > > - fix locking in .permission() and .getattr() by re-factoring the > > attribute > > handling code. > > --- > > > > Ian Kent (5): > > kernfs: move revalidate to be near lookup > > kernfs: use VFS negative dentry caching > > kernfs: switch kernfs to use an rwsem > > kernfs: use i_lock to protect concurrent inode updates > > kernfs: add kernfs_need_inode_refresh() > > > > > > fs/kernfs/dir.c | 170 ++++++++++++++++++++------------ > > ---- > > fs/kernfs/file.c | 4 +- > > fs/kernfs/inode.c | 45 ++++++++-- > > fs/kernfs/kernfs-internal.h | 5 +- > > fs/kernfs/mount.c | 12 +-- > > fs/kernfs/symlink.c | 4 +- > > include/linux/kernfs.h | 2 +- > > 7 files changed, 147 insertions(+), 95 deletions(-) > > > > -- > > Ian > > > > Any benchmark numbers that you ran that are better/worse with this > patch > series? That woul dbe good to know, otherwise you aren't changing > functionality here, so why would we take these changes? :) Hi Greg, I'm sorry, I don't have a benchmark. My continued work on this has been driven by the report from Brice Goglin and Fox Chen, and also because I've seen a couple of other reports of kernfs_mutex contention that is resolved by the series. Unfortunately the two reports I've seen fairly recently are on kernels that are about as far away from the upstream kernel as you can get so probably aren't useful in making my case. The report I've worked on most recently is on CoreOS/Kunbernetes (based on RHEL-8.3) where the machine load goes to around 870 after loading what they call an OpenShift performance profile. I looked at some sysreq dumps and they have a seemingly endless number of processes waiting on the kernfs_mutex. I tried to look at the Kubernetes source but it's written in go so there would need to be a lot of time spent to work out what's going on, I'm trying to find someone to help with that. All I can say from looking at the Kubernetes source is it has quite a few sysfs paths in it so I assume it uses sysfs fairly heavily. The other problem I saw was also on CoreOS/Kunernetes. A vmcore analysis showed kernfs_mutex contention but with a different set of processes and not as significant as the former problem. So, even though this isn't against the current upstream, there isn't much difference in the kernfs/sysfs source between those two kernels and given the results of Brice and Fox I fear I'll be seeing more of this as time goes by. I'm fairly confident that the user space applications aren't optimal (although you may have stronger words for it than that) I was hoping you would agree that it's sensible for the kernel to protect itself to the extent that it can provided the change is straight forward enough. I have tried to make the patches as simple as possible without loosing much of the improvement to minimize any potential ongoing maintenance burden. So, I'm sorry I can't offer you more incentive to consider the series, but I remain hopeful you will. If there is anything you would like me to follow up on please ask and, if I can, I will do what's requested. Ian
On Wed, 2021-05-12 at 16:54 +0800, Fox Chen wrote: > On Wed, May 12, 2021 at 4:47 PM Fox Chen <foxhlchen@gmail.com> wrote: > > > > Hi, > > > > I ran it on my benchmark ( > > https://github.com/foxhlchen/sysfs_benchmark). > > > > machine: aws c5 (Intel Xeon with 96 logical cores) > > kernel: v5.12 > > benchmark: create 96 threads and bind them to each core then run > > open+read+close on a sysfs file simultaneously for 1000 times. > > result: > > Without the patchset, an open+read+close operation takes 550-570 > > us, > > perf shows significant time(>40%) spending on mutex_lock. > > After applying it, it takes 410-440 us for that operation and perf > > shows only ~4% time on mutex_lock. > > > > It's weird, I don't see a huge performance boost compared to v2, > > even > > I meant I don't see a huge performance boost here and it's way worse > than v2. > IIRC, for v2 fastest one only takes 40us Thanks Fox, I'll have a look at those reports but this is puzzling. Perhaps the added overhead of the check if an update is needed is taking more than expected and more than just taking the lock and being done with it. Then there's the v2 series ... I'll see if I can dig out your reports on those too. > > > > though there is no mutex problem from the perf report. > > I've put console outputs and perf reports on the attachment for > > your reference. Yep, thanks. Ian
On Thu, May 13, 2021 at 09:50:19PM +0800, Ian Kent wrote: > On Wed, 2021-05-12 at 08:21 +0200, Greg Kroah-Hartman wrote: > > On Wed, May 12, 2021 at 08:38:35AM +0800, Ian Kent wrote: > > > There have been a few instances of contention on the kernfs_mutex > > > during > > > path walks, a case on very large IBM systems seen by myself, a > > > report by > > > Brice Goglin and followed up by Fox Chen, and I've since seen a > > > couple > > > of other reports by CoreOS users. > > > > > > The common thread is a large number of kernfs path walks leading to > > > slowness of path walks due to kernfs_mutex contention. > > > > > > The problem being that changes to the VFS over some time have > > > increased > > > it's concurrency capabilities to an extent that kernfs's use of a > > > mutex > > > is no longer appropriate. There's also an issue of walks for non- > > > existent > > > paths causing contention if there are quite a few of them which is > > > a less > > > common problem. > > > > > > This patch series is relatively straight forward. > > > > > > All it does is add the ability to take advantage of VFS negative > > > dentry > > > caching to avoid needless dentry alloc/free cycles for lookups of > > > paths > > > that don't exit and change the kernfs_mutex to a read/write > > > semaphore. > > > > > > The patch that tried to stay in VFS rcu-walk mode during path walks > > > has > > > been dropped for two reasons. First, it doesn't actually give very > > > much > > > improvement and, second, if there's a place where mistakes could go > > > unnoticed it would be in that path. This makes the patch series > > > simpler > > > to review and reduces the likelihood of problems going unnoticed > > > and > > > popping up later. > > > > > > The patch to use a revision to identify if a directory has changed > > > has > > > also been dropped. If the directory has changed the dentry revision > > > needs to be updated to avoid subsequent rb tree searches and after > > > changing to use a read/write semaphore the update also requires a > > > lock. > > > But the d_lock is the only lock available at this point which might > > > itself be contended. > > > > > > Changes since v3: > > > - remove unneeded indirection when referencing the super block. > > > - check if inode attribute update is actually needed. > > > > > > Changes since v2: > > > - actually fix the inode attribute update locking. > > > - drop the patch that tried to stay in rcu-walk mode. > > > - drop the use a revision to identify if a directory has changed > > > patch. > > > > > > Changes since v1: > > > - fix locking in .permission() and .getattr() by re-factoring the > > > attribute > > > handling code. > > > --- > > > > > > Ian Kent (5): > > > kernfs: move revalidate to be near lookup > > > kernfs: use VFS negative dentry caching > > > kernfs: switch kernfs to use an rwsem > > > kernfs: use i_lock to protect concurrent inode updates > > > kernfs: add kernfs_need_inode_refresh() > > > > > > > > > fs/kernfs/dir.c | 170 ++++++++++++++++++++------------ > > > ---- > > > fs/kernfs/file.c | 4 +- > > > fs/kernfs/inode.c | 45 ++++++++-- > > > fs/kernfs/kernfs-internal.h | 5 +- > > > fs/kernfs/mount.c | 12 +-- > > > fs/kernfs/symlink.c | 4 +- > > > include/linux/kernfs.h | 2 +- > > > 7 files changed, 147 insertions(+), 95 deletions(-) > > > > > > -- > > > Ian > > > > > > > Any benchmark numbers that you ran that are better/worse with this > > patch > > series? That woul dbe good to know, otherwise you aren't changing > > functionality here, so why would we take these changes? :) > > Hi Greg, > > I'm sorry, I don't have a benchmark. > > My continued work on this has been driven by the report from > Brice Goglin and Fox Chen, and also because I've seen a couple > of other reports of kernfs_mutex contention that is resolved > by the series. > > Unfortunately the two reports I've seen fairly recently are on > kernels that are about as far away from the upstream kernel > as you can get so probably aren't useful in making my case. > > The report I've worked on most recently is on CoreOS/Kunbernetes > (based on RHEL-8.3) where the machine load goes to around 870 > after loading what they call an OpenShift performance profile. > > I looked at some sysreq dumps and they have a seemingly endless > number of processes waiting on the kernfs_mutex. > > I tried to look at the Kubernetes source but it's written in > go so there would need to be a lot of time spent to work out > what's going on, I'm trying to find someone to help with that. > > All I can say from looking at the Kubernetes source is it has > quite a few sysfs paths in it so I assume it uses sysfs fairly > heavily. > > The other problem I saw was also on CoreOS/Kunernetes. > A vmcore analysis showed kernfs_mutex contention but with a > different set of processes and not as significant as the former > problem. > > So, even though this isn't against the current upstream, there > isn't much difference in the kernfs/sysfs source between those > two kernels and given the results of Brice and Fox I fear I'll > be seeing more of this as time goes by. > > I'm fairly confident that the user space applications aren't > optimal (although you may have stronger words for it than that) > I was hoping you would agree that it's sensible for the kernel > to protect itself to the extent that it can provided the change > is straight forward enough. > > I have tried to make the patches as simple as possible without > loosing much of the improvement to minimize any potential ongoing > maintenance burden. > > So, I'm sorry I can't offer you more incentive to consider the > series, but I remain hopeful you will. At the very least, if you could test the series on those "older" systems and say "booting went from X seconds to Y seconds!". Otherwise, while changes are nice, without a real-world test that this actually made any difference at all, why would we take these changes? thanks, greg k-h
Hi Ian On Thu, May 13, 2021 at 10:10 PM Ian Kent <raven@themaw.net> wrote: > > On Wed, 2021-05-12 at 16:54 +0800, Fox Chen wrote: > > On Wed, May 12, 2021 at 4:47 PM Fox Chen <foxhlchen@gmail.com> wrote: > > > > > > Hi, > > > > > > I ran it on my benchmark ( > > > https://github.com/foxhlchen/sysfs_benchmark). > > > > > > machine: aws c5 (Intel Xeon with 96 logical cores) > > > kernel: v5.12 > > > benchmark: create 96 threads and bind them to each core then run > > > open+read+close on a sysfs file simultaneously for 1000 times. > > > result: > > > Without the patchset, an open+read+close operation takes 550-570 > > > us, > > > perf shows significant time(>40%) spending on mutex_lock. > > > After applying it, it takes 410-440 us for that operation and perf > > > shows only ~4% time on mutex_lock. > > > > > > It's weird, I don't see a huge performance boost compared to v2, > > > even > > > > I meant I don't see a huge performance boost here and it's way worse > > than v2. > > IIRC, for v2 fastest one only takes 40us > > Thanks Fox, > > I'll have a look at those reports but this is puzzling. > > Perhaps the added overhead of the check if an update is > needed is taking more than expected and more than just > taking the lock and being done with it. Then there's > the v2 series ... I'll see if I can dig out your reports > on those too. Apologies, I was mistaken, it's compared to V3, not V2. The previous benchmark report is here. https://lore.kernel.org/linux-fsdevel/CAC2o3DKNc=sL2n8291Dpiyb0bRHaX=nd33ogvO_LkJqpBj-YmA@mail.gmail.com/ > > > > > > > though there is no mutex problem from the perf report. > > > I've put console outputs and perf reports on the attachment for > > > your reference. > > Yep, thanks. > Ian > thanks, fox
On Thu, 2021-05-13 at 17:19 +0200, Greg Kroah-Hartman wrote: > On Thu, May 13, 2021 at 09:50:19PM +0800, Ian Kent wrote: > > On Wed, 2021-05-12 at 08:21 +0200, Greg Kroah-Hartman wrote: > > > On Wed, May 12, 2021 at 08:38:35AM +0800, Ian Kent wrote: > > > > There have been a few instances of contention on the > > > > kernfs_mutex > > > > during > > > > path walks, a case on very large IBM systems seen by myself, a > > > > report by > > > > Brice Goglin and followed up by Fox Chen, and I've since seen a > > > > couple > > > > of other reports by CoreOS users. > > > > > > > > The common thread is a large number of kernfs path walks > > > > leading to > > > > slowness of path walks due to kernfs_mutex contention. > > > > > > > > The problem being that changes to the VFS over some time have > > > > increased > > > > it's concurrency capabilities to an extent that kernfs's use of > > > > a > > > > mutex > > > > is no longer appropriate. There's also an issue of walks for > > > > non- > > > > existent > > > > paths causing contention if there are quite a few of them which > > > > is > > > > a less > > > > common problem. > > > > > > > > This patch series is relatively straight forward. > > > > > > > > All it does is add the ability to take advantage of VFS > > > > negative > > > > dentry > > > > caching to avoid needless dentry alloc/free cycles for lookups > > > > of > > > > paths > > > > that don't exit and change the kernfs_mutex to a read/write > > > > semaphore. > > > > > > > > The patch that tried to stay in VFS rcu-walk mode during path > > > > walks > > > > has > > > > been dropped for two reasons. First, it doesn't actually give > > > > very > > > > much > > > > improvement and, second, if there's a place where mistakes > > > > could go > > > > unnoticed it would be in that path. This makes the patch series > > > > simpler > > > > to review and reduces the likelihood of problems going > > > > unnoticed > > > > and > > > > popping up later. > > > > > > > > The patch to use a revision to identify if a directory has > > > > changed > > > > has > > > > also been dropped. If the directory has changed the dentry > > > > revision > > > > needs to be updated to avoid subsequent rb tree searches and > > > > after > > > > changing to use a read/write semaphore the update also requires > > > > a > > > > lock. > > > > But the d_lock is the only lock available at this point which > > > > might > > > > itself be contended. > > > > > > > > Changes since v3: > > > > - remove unneeded indirection when referencing the super block. > > > > - check if inode attribute update is actually needed. > > > > > > > > Changes since v2: > > > > - actually fix the inode attribute update locking. > > > > - drop the patch that tried to stay in rcu-walk mode. > > > > - drop the use a revision to identify if a directory has > > > > changed > > > > patch. > > > > > > > > Changes since v1: > > > > - fix locking in .permission() and .getattr() by re-factoring > > > > the > > > > attribute > > > > handling code. > > > > --- > > > > > > > > Ian Kent (5): > > > > kernfs: move revalidate to be near lookup > > > > kernfs: use VFS negative dentry caching > > > > kernfs: switch kernfs to use an rwsem > > > > kernfs: use i_lock to protect concurrent inode updates > > > > kernfs: add kernfs_need_inode_refresh() > > > > > > > > > > > > fs/kernfs/dir.c | 170 ++++++++++++++++++++-------- > > > > ---- > > > > ---- > > > > fs/kernfs/file.c | 4 +- > > > > fs/kernfs/inode.c | 45 ++++++++-- > > > > fs/kernfs/kernfs-internal.h | 5 +- > > > > fs/kernfs/mount.c | 12 +-- > > > > fs/kernfs/symlink.c | 4 +- > > > > include/linux/kernfs.h | 2 +- > > > > 7 files changed, 147 insertions(+), 95 deletions(-) > > > > > > > > -- > > > > Ian > > > > > > > > > > Any benchmark numbers that you ran that are better/worse with > > > this > > > patch > > > series? That woul dbe good to know, otherwise you aren't > > > changing > > > functionality here, so why would we take these changes? :) > > > > Hi Greg, > > > > I'm sorry, I don't have a benchmark. > > > > My continued work on this has been driven by the report from > > Brice Goglin and Fox Chen, and also because I've seen a couple > > of other reports of kernfs_mutex contention that is resolved > > by the series. > > > > Unfortunately the two reports I've seen fairly recently are on > > kernels that are about as far away from the upstream kernel > > as you can get so probably aren't useful in making my case. > > > > The report I've worked on most recently is on CoreOS/Kunbernetes > > (based on RHEL-8.3) where the machine load goes to around 870 > > after loading what they call an OpenShift performance profile. > > > > I looked at some sysreq dumps and they have a seemingly endless > > number of processes waiting on the kernfs_mutex. > > > > I tried to look at the Kubernetes source but it's written in > > go so there would need to be a lot of time spent to work out > > what's going on, I'm trying to find someone to help with that. > > > > All I can say from looking at the Kubernetes source is it has > > quite a few sysfs paths in it so I assume it uses sysfs fairly > > heavily. > > > > The other problem I saw was also on CoreOS/Kunernetes. > > A vmcore analysis showed kernfs_mutex contention but with a > > different set of processes and not as significant as the former > > problem. > > > > So, even though this isn't against the current upstream, there > > isn't much difference in the kernfs/sysfs source between those > > two kernels and given the results of Brice and Fox I fear I'll > > be seeing more of this as time goes by. > > > > I'm fairly confident that the user space applications aren't > > optimal (although you may have stronger words for it than that) > > I was hoping you would agree that it's sensible for the kernel > > to protect itself to the extent that it can provided the change > > is straight forward enough. > > > > I have tried to make the patches as simple as possible without > > loosing much of the improvement to minimize any potential ongoing > > maintenance burden. > > > > So, I'm sorry I can't offer you more incentive to consider the > > series, but I remain hopeful you will. > > At the very least, if you could test the series on those "older" > systems > and say "booting went from X seconds to Y seconds!". The last test I did was done on the system showing high load and it went from around 870 to around 3. It completely resolved the reported problem. I need to have the current patches re-tested and that can take a while and I need to look at Fox's results results, I'm thinking the additional patch in v4 is probably not needed. Ian
On Thu, 2021-05-13 at 23:37 +0800, Fox Chen wrote: > Hi Ian > > On Thu, May 13, 2021 at 10:10 PM Ian Kent <raven@themaw.net> wrote: > > > > On Wed, 2021-05-12 at 16:54 +0800, Fox Chen wrote: > > > On Wed, May 12, 2021 at 4:47 PM Fox Chen <foxhlchen@gmail.com> > > > wrote: > > > > > > > > Hi, > > > > > > > > I ran it on my benchmark ( > > > > https://github.com/foxhlchen/sysfs_benchmark). > > > > > > > > machine: aws c5 (Intel Xeon with 96 logical cores) > > > > kernel: v5.12 > > > > benchmark: create 96 threads and bind them to each core then > > > > run > > > > open+read+close on a sysfs file simultaneously for 1000 times. > > > > result: > > > > Without the patchset, an open+read+close operation takes 550- > > > > 570 > > > > us, > > > > perf shows significant time(>40%) spending on mutex_lock. > > > > After applying it, it takes 410-440 us for that operation and > > > > perf > > > > shows only ~4% time on mutex_lock. > > > > > > > > It's weird, I don't see a huge performance boost compared to > > > > v2, > > > > even > > > > > > I meant I don't see a huge performance boost here and it's way > > > worse > > > than v2. > > > IIRC, for v2 fastest one only takes 40us > > > > Thanks Fox, > > > > I'll have a look at those reports but this is puzzling. > > > > Perhaps the added overhead of the check if an update is > > needed is taking more than expected and more than just > > taking the lock and being done with it. Then there's > > the v2 series ... I'll see if I can dig out your reports > > on those too. > > Apologies, I was mistaken, it's compared to V3, not V2. The previous > benchmark report is here. > https://lore.kernel.org/linux-fsdevel/CAC2o3DKNc=sL2n8291Dpiyb0bRHaX=nd33ogvO_LkJqpBj-YmA@mail.gmail.com/ Are all these tests using a single file name in the open/read/close loop? That being the case the per-object inode lock will behave like a mutex and once contention occurs any speed benefits of a spinlock over a mutex (or rwsem) will disappear. In this case changing from a write lock to a read lock in those functions and adding the inode mutex will do nothing but add the overhead of taking the read lock. And similarly adding the update check function also just adds overhead and, as we see, once contention starts it has a cumulative effect that's often not linear. The whole idea of a read lock/per-object spin lock was to reduce the possibility of contention for paths other than the same path while not impacting same path accesses too much for an overall gain. Based on this I'm thinking the update check function is probably not worth keeping, it just adds unnecessary churn and has a negative impact for same file contention access patterns. I think that using multiple paths, at least one per test process (so if you are running 16 processes use at least 16 different files, the same in each process), and selecting one at random for each loop of the open would better simulate real world access patterns. Ian
On Fri, May 14, 2021 at 9:34 AM Ian Kent <raven@themaw.net> wrote: > > On Thu, 2021-05-13 at 23:37 +0800, Fox Chen wrote: > > Hi Ian > > > > On Thu, May 13, 2021 at 10:10 PM Ian Kent <raven@themaw.net> wrote: > > > > > > On Wed, 2021-05-12 at 16:54 +0800, Fox Chen wrote: > > > > On Wed, May 12, 2021 at 4:47 PM Fox Chen <foxhlchen@gmail.com> > > > > wrote: > > > > > > > > > > Hi, > > > > > > > > > > I ran it on my benchmark ( > > > > > https://github.com/foxhlchen/sysfs_benchmark). > > > > > > > > > > machine: aws c5 (Intel Xeon with 96 logical cores) > > > > > kernel: v5.12 > > > > > benchmark: create 96 threads and bind them to each core then > > > > > run > > > > > open+read+close on a sysfs file simultaneously for 1000 times. > > > > > result: > > > > > Without the patchset, an open+read+close operation takes 550- > > > > > 570 > > > > > us, > > > > > perf shows significant time(>40%) spending on mutex_lock. > > > > > After applying it, it takes 410-440 us for that operation and > > > > > perf > > > > > shows only ~4% time on mutex_lock. > > > > > > > > > > It's weird, I don't see a huge performance boost compared to > > > > > v2, > > > > > even > > > > > > > > I meant I don't see a huge performance boost here and it's way > > > > worse > > > > than v2. > > > > IIRC, for v2 fastest one only takes 40us > > > > > > Thanks Fox, > > > > > > I'll have a look at those reports but this is puzzling. > > > > > > Perhaps the added overhead of the check if an update is > > > needed is taking more than expected and more than just > > > taking the lock and being done with it. Then there's > > > the v2 series ... I'll see if I can dig out your reports > > > on those too. > > > > Apologies, I was mistaken, it's compared to V3, not V2. The previous > > benchmark report is here. > > https://lore.kernel.org/linux-fsdevel/CAC2o3DKNc=sL2n8291Dpiyb0bRHaX=nd33ogvO_LkJqpBj-YmA@mail.gmail.com/ > > Are all these tests using a single file name in the open/read/close > loop? Yes, because It's easy to implement yet enough to trigger the mutex_lock. And you are right It's not a real-life pattern, but on the bright side, it proves there is no original mutex_lock problem anymore. :) > That being the case the per-object inode lock will behave like a > mutex and once contention occurs any speed benefits of a spinlock > over a mutex (or rwsem) will disappear. > > In this case changing from a write lock to a read lock in those > functions and adding the inode mutex will do nothing but add the > overhead of taking the read lock. And similarly adding the update > check function also just adds overhead and, as we see, once > contention starts it has a cumulative effect that's often not > linear. > > The whole idea of a read lock/per-object spin lock was to reduce > the possibility of contention for paths other than the same path > while not impacting same path accesses too much for an overall > gain. Based on this I'm thinking the update check function is > probably not worth keeping, it just adds unnecessary churn and > has a negative impact for same file contention access patterns. > > I think that using multiple paths, at least one per test process > (so if you are running 16 processes use at least 16 different > files, the same in each process), and selecting one at random > for each loop of the open would better simulate real world > access patterns. > > > Ian > thanks, fox
On Fri, 2021-05-14 at 10:34 +0800, Fox Chen wrote: > On Fri, May 14, 2021 at 9:34 AM Ian Kent <raven@themaw.net> wrote: > > > > On Thu, 2021-05-13 at 23:37 +0800, Fox Chen wrote: > > > Hi Ian > > > > > > On Thu, May 13, 2021 at 10:10 PM Ian Kent <raven@themaw.net> > > > wrote: > > > > > > > > On Wed, 2021-05-12 at 16:54 +0800, Fox Chen wrote: > > > > > On Wed, May 12, 2021 at 4:47 PM Fox Chen > > > > > <foxhlchen@gmail.com> > > > > > wrote: > > > > > > > > > > > > Hi, > > > > > > > > > > > > I ran it on my benchmark ( > > > > > > https://github.com/foxhlchen/sysfs_benchmark). > > > > > > > > > > > > machine: aws c5 (Intel Xeon with 96 logical cores) > > > > > > kernel: v5.12 > > > > > > benchmark: create 96 threads and bind them to each core > > > > > > then > > > > > > run > > > > > > open+read+close on a sysfs file simultaneously for 1000 > > > > > > times. > > > > > > result: > > > > > > Without the patchset, an open+read+close operation takes > > > > > > 550- > > > > > > 570 > > > > > > us, > > > > > > perf shows significant time(>40%) spending on mutex_lock. > > > > > > After applying it, it takes 410-440 us for that operation > > > > > > and > > > > > > perf > > > > > > shows only ~4% time on mutex_lock. > > > > > > > > > > > > It's weird, I don't see a huge performance boost compared > > > > > > to > > > > > > v2, > > > > > > even > > > > > > > > > > I meant I don't see a huge performance boost here and it's > > > > > way > > > > > worse > > > > > than v2. > > > > > IIRC, for v2 fastest one only takes 40us > > > > > > > > Thanks Fox, > > > > > > > > I'll have a look at those reports but this is puzzling. > > > > > > > > Perhaps the added overhead of the check if an update is > > > > needed is taking more than expected and more than just > > > > taking the lock and being done with it. Then there's > > > > the v2 series ... I'll see if I can dig out your reports > > > > on those too. > > > > > > Apologies, I was mistaken, it's compared to V3, not V2. The > > > previous > > > benchmark report is here. > > > https://lore.kernel.org/linux-fsdevel/CAC2o3DKNc=sL2n8291Dpiyb0bRHaX=nd33ogvO_LkJqpBj-YmA@mail.gmail.com/ > > > > Are all these tests using a single file name in the open/read/close > > loop? > > Yes, because It's easy to implement yet enough to trigger the > mutex_lock. > > And you are right It's not a real-life pattern, but on the bright > side, it proves there is no original mutex_lock problem anymore. :) I've been looking at your reports and they are quite interesting. > > > That being the case the per-object inode lock will behave like a > > mutex and once contention occurs any speed benefits of a spinlock > > over a mutex (or rwsem) will disappear. > > > > In this case changing from a write lock to a read lock in those > > functions and adding the inode mutex will do nothing but add the > > overhead of taking the read lock. And similarly adding the update > > check function also just adds overhead and, as we see, once > > contention starts it has a cumulative effect that's often not > > linear. > > > > The whole idea of a read lock/per-object spin lock was to reduce > > the possibility of contention for paths other than the same path > > while not impacting same path accesses too much for an overall > > gain. Based on this I'm thinking the update check function is > > probably not worth keeping, it just adds unnecessary churn and > > has a negative impact for same file contention access patterns. The reports indicate (to me anyway) that the slowdown isn't due to kernfs. It looks more like kernfs is now putting pressure on the VFS, mostly on the file table lock but it looks like there's a mild amount of contention on a few other locks as well now. That's a whole different problem and those file table handling functions don't appear to have any obvious problems so they are doing what they have to do and that can't be avoided. That's definitely out of scope for these changes. And, as you'd expect, once any appreciable amount of contention happens our measurements go out the window, certainly with respect to kernfs. It also doesn't change my option that checking if an inode attribute update is needed in kernfs isn't useful since, IIUC that file table lock contention would result even if you were using different paths. So I'll drop that patch from the series. Ian > > > > I think that using multiple paths, at least one per test process > > (so if you are running 16 processes use at least 16 different > > files, the same in each process), and selecting one at random > > for each loop of the open would better simulate real world > > access patterns. > > > > > > Ian > > > > > thanks, > fox
On Mon, May 17, 2021 at 9:32 AM Ian Kent <raven@themaw.net> wrote: > > On Fri, 2021-05-14 at 10:34 +0800, Fox Chen wrote: > > On Fri, May 14, 2021 at 9:34 AM Ian Kent <raven@themaw.net> wrote: > > > > > > On Thu, 2021-05-13 at 23:37 +0800, Fox Chen wrote: > > > > Hi Ian > > > > > > > > On Thu, May 13, 2021 at 10:10 PM Ian Kent <raven@themaw.net> > > > > wrote: > > > > > > > > > > On Wed, 2021-05-12 at 16:54 +0800, Fox Chen wrote: > > > > > > On Wed, May 12, 2021 at 4:47 PM Fox Chen > > > > > > <foxhlchen@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > I ran it on my benchmark ( > > > > > > > https://github.com/foxhlchen/sysfs_benchmark). > > > > > > > > > > > > > > machine: aws c5 (Intel Xeon with 96 logical cores) > > > > > > > kernel: v5.12 > > > > > > > benchmark: create 96 threads and bind them to each core > > > > > > > then > > > > > > > run > > > > > > > open+read+close on a sysfs file simultaneously for 1000 > > > > > > > times. > > > > > > > result: > > > > > > > Without the patchset, an open+read+close operation takes > > > > > > > 550- > > > > > > > 570 > > > > > > > us, > > > > > > > perf shows significant time(>40%) spending on mutex_lock. > > > > > > > After applying it, it takes 410-440 us for that operation > > > > > > > and > > > > > > > perf > > > > > > > shows only ~4% time on mutex_lock. > > > > > > > > > > > > > > It's weird, I don't see a huge performance boost compared > > > > > > > to > > > > > > > v2, > > > > > > > even > > > > > > > > > > > > I meant I don't see a huge performance boost here and it's > > > > > > way > > > > > > worse > > > > > > than v2. > > > > > > IIRC, for v2 fastest one only takes 40us > > > > > > > > > > Thanks Fox, > > > > > > > > > > I'll have a look at those reports but this is puzzling. > > > > > > > > > > Perhaps the added overhead of the check if an update is > > > > > needed is taking more than expected and more than just > > > > > taking the lock and being done with it. Then there's > > > > > the v2 series ... I'll see if I can dig out your reports > > > > > on those too. > > > > > > > > Apologies, I was mistaken, it's compared to V3, not V2. The > > > > previous > > > > benchmark report is here. > > > > https://lore.kernel.org/linux-fsdevel/CAC2o3DKNc=sL2n8291Dpiyb0bRHaX=nd33ogvO_LkJqpBj-YmA@mail.gmail.com/ > > > > > > Are all these tests using a single file name in the open/read/close > > > loop? > > > > Yes, because It's easy to implement yet enough to trigger the > > mutex_lock. > > > > And you are right It's not a real-life pattern, but on the bright > > side, it proves there is no original mutex_lock problem anymore. :) > > I've been looking at your reports and they are quite interesting. > > > > > > That being the case the per-object inode lock will behave like a > > > mutex and once contention occurs any speed benefits of a spinlock > > > over a mutex (or rwsem) will disappear. > > > > > > In this case changing from a write lock to a read lock in those > > > functions and adding the inode mutex will do nothing but add the > > > overhead of taking the read lock. And similarly adding the update > > > check function also just adds overhead and, as we see, once > > > contention starts it has a cumulative effect that's often not > > > linear. > > > > > > The whole idea of a read lock/per-object spin lock was to reduce > > > the possibility of contention for paths other than the same path > > > while not impacting same path accesses too much for an overall > > > gain. Based on this I'm thinking the update check function is > > > probably not worth keeping, it just adds unnecessary churn and > > > has a negative impact for same file contention access patterns. > > The reports indicate (to me anyway) that the slowdown isn't > due to kernfs. It looks more like kernfs is now putting pressure > on the VFS, mostly on the file table lock but it looks like > there's a mild amount of contention on a few other locks as well > now. That's correct, I ran my benchmark on ext4 the result was similarly slow. But It shouldn't be that as I remember I tested it before it was very fast and you can see the result of V3 was much faster. So I ran this benchmark again on AWS c5a which also has 96 cores but with AMD CPUs. The result was amazing the fastest one had a 10x boost (~40us) very similar to the V3 one (see attachment) I guess my previous benchmark of V3 was run on c5a. I can't figure why it is so slow on Intel's CPUs, I also tried C5.metal which is running on the physical machine, the result is still slow (~200us). But anyway, at least it shows this patchset solves the mutex_lock problem and can bring even 10x boosts on some occasions. > That's a whole different problem and those file table handling > functions don't appear to have any obvious problems so they are > doing what they have to do and that can't be avoided. > > That's definitely out of scope for these changes. > > And, as you'd expect, once any appreciable amount of contention > happens our measurements go out the window, certainly with > respect to kernfs. > > It also doesn't change my option that checking if an inode > attribute update is needed in kernfs isn't useful since, IIUC > that file table lock contention would result even if you were > using different paths. > > So I'll drop that patch from the series. > > Ian > > > > > > I think that using multiple paths, at least one per test process > > > (so if you are running 16 processes use at least 16 different > > > files, the same in each process), and selecting one at random > > > for each loop of the open would better simulate real world > > > access patterns. > > > > > > > > > Ian > > > > > > > > > thanks, > > fox > > thanks, fox
On Mon, 2021-05-17 at 09:32 +0800, Ian Kent wrote: > On Fri, 2021-05-14 at 10:34 +0800, Fox Chen wrote: > > On Fri, May 14, 2021 at 9:34 AM Ian Kent <raven@themaw.net> wrote: > > > > > > On Thu, 2021-05-13 at 23:37 +0800, Fox Chen wrote: > > > > Hi Ian > > > > > > > > On Thu, May 13, 2021 at 10:10 PM Ian Kent <raven@themaw.net> > > > > wrote: > > > > > > > > > > On Wed, 2021-05-12 at 16:54 +0800, Fox Chen wrote: > > > > > > On Wed, May 12, 2021 at 4:47 PM Fox Chen > > > > > > <foxhlchen@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > I ran it on my benchmark ( > > > > > > > https://github.com/foxhlchen/sysfs_benchmark). > > > > > > > > > > > > > > machine: aws c5 (Intel Xeon with 96 logical cores) > > > > > > > kernel: v5.12 > > > > > > > benchmark: create 96 threads and bind them to each core > > > > > > > then > > > > > > > run > > > > > > > open+read+close on a sysfs file simultaneously for 1000 > > > > > > > times. > > > > > > > result: > > > > > > > Without the patchset, an open+read+close operation takes > > > > > > > 550- > > > > > > > 570 > > > > > > > us, > > > > > > > perf shows significant time(>40%) spending on mutex_lock. > > > > > > > After applying it, it takes 410-440 us for that operation > > > > > > > and > > > > > > > perf > > > > > > > shows only ~4% time on mutex_lock. > > > > > > > > > > > > > > It's weird, I don't see a huge performance boost compared > > > > > > > to > > > > > > > v2, > > > > > > > even > > > > > > > > > > > > I meant I don't see a huge performance boost here and it's > > > > > > way > > > > > > worse > > > > > > than v2. > > > > > > IIRC, for v2 fastest one only takes 40us > > > > > > > > > > Thanks Fox, > > > > > > > > > > I'll have a look at those reports but this is puzzling. > > > > > > > > > > Perhaps the added overhead of the check if an update is > > > > > needed is taking more than expected and more than just > > > > > taking the lock and being done with it. Then there's > > > > > the v2 series ... I'll see if I can dig out your reports > > > > > on those too. > > > > > > > > Apologies, I was mistaken, it's compared to V3, not V2. The > > > > previous > > > > benchmark report is here. > > > > https://lore.kernel.org/linux-fsdevel/CAC2o3DKNc=sL2n8291Dpiyb0bRHaX=nd33ogvO_LkJqpBj-YmA@mail.gmail.com/ > > > > > > Are all these tests using a single file name in the > > > open/read/close > > > loop? > > > > Yes, because It's easy to implement yet enough to trigger the > > mutex_lock. > > > > And you are right It's not a real-life pattern, but on the bright > > side, it proves there is no original mutex_lock problem anymore. :) > > I've been looking at your reports and they are quite interesting. > > > > > > That being the case the per-object inode lock will behave like a > > > mutex and once contention occurs any speed benefits of a spinlock > > > over a mutex (or rwsem) will disappear. > > > > > > In this case changing from a write lock to a read lock in those > > > functions and adding the inode mutex will do nothing but add the > > > overhead of taking the read lock. And similarly adding the update > > > check function also just adds overhead and, as we see, once > > > contention starts it has a cumulative effect that's often not > > > linear. > > > > > > The whole idea of a read lock/per-object spin lock was to reduce > > > the possibility of contention for paths other than the same path > > > while not impacting same path accesses too much for an overall > > > gain. Based on this I'm thinking the update check function is > > > probably not worth keeping, it just adds unnecessary churn and > > > has a negative impact for same file contention access patterns. > > The reports indicate (to me anyway) that the slowdown isn't > due to kernfs. It looks more like kernfs is now putting pressure > on the VFS, mostly on the file table lock but it looks like > there's a mild amount of contention on a few other locks as well > now. > > That's a whole different problem and those file table handling > functions don't appear to have any obvious problems so they are > doing what they have to do and that can't be avoided. > > That's definitely out of scope for these changes. > > And, as you'd expect, once any appreciable amount of contention > happens our measurements go out the window, certainly with > respect to kernfs. > > It also doesn't change my option that checking if an inode > attribute update is needed in kernfs isn't useful since, IIUC > that file table lock contention would result even if you were > using different paths. > > So I'll drop that patch from the series. It will probably not make any difference, but for completeness and after some thought, I felt I should post what I think is happening with the benchmark. The benchmark application runs a pthreads thread for each CPU and goes into a tight open/read/close loop to demonstrate the contention that can occur on the kernfs mutex as the number of CPUs grows. But that tight open/read/close loop causes contention on the VFS file table because the pthreads threads share the process file table. So the poor performance is actually evidence the last patch is doing what it's meant to do rather than evidence of a regression with the series. The benchmark is putting pressure on the process file table on some hardware configurations but those critical sections are small and there's really nothing obvious that can be done to improve the file table locking. It is however important to remember that the benckmark application access pattern is not a normal access pattern by a long way. So I don't see the need for a new revision of the series with the last patch dropped. If there's a change of heart and the series was to be merged I'll leave whether to include this last patch to the discretion of the maintainer as the bulk of the improvement comes from the earlier patches anyway. > > Ian > > > > > > I think that using multiple paths, at least one per test process > > > (so if you are running 16 processes use at least 16 different > > > files, the same in each process), and selecting one at random > > > for each loop of the open would better simulate real world > > > access patterns. > > > > > > > > > Ian > > > > > > > > > thanks, > > fox >
On Thu, May 27, 2021 at 09:23:06AM +0800, Ian Kent wrote: > On Mon, 2021-05-17 at 09:32 +0800, Ian Kent wrote: > > On Fri, 2021-05-14 at 10:34 +0800, Fox Chen wrote: > > > On Fri, May 14, 2021 at 9:34 AM Ian Kent <raven@themaw.net> wrote: > > > > > > > > On Thu, 2021-05-13 at 23:37 +0800, Fox Chen wrote: > > > > > Hi Ian > > > > > > > > > > On Thu, May 13, 2021 at 10:10 PM Ian Kent <raven@themaw.net> > > > > > wrote: > > > > > > > > > > > > On Wed, 2021-05-12 at 16:54 +0800, Fox Chen wrote: > > > > > > > On Wed, May 12, 2021 at 4:47 PM Fox Chen > > > > > > > <foxhlchen@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > I ran it on my benchmark ( > > > > > > > > https://github.com/foxhlchen/sysfs_benchmark). > > > > > > > > > > > > > > > > machine: aws c5 (Intel Xeon with 96 logical cores) > > > > > > > > kernel: v5.12 > > > > > > > > benchmark: create 96 threads and bind them to each core > > > > > > > > then > > > > > > > > run > > > > > > > > open+read+close on a sysfs file simultaneously for 1000 > > > > > > > > times. > > > > > > > > result: > > > > > > > > Without the patchset, an open+read+close operation takes > > > > > > > > 550- > > > > > > > > 570 > > > > > > > > us, > > > > > > > > perf shows significant time(>40%) spending on mutex_lock. > > > > > > > > After applying it, it takes 410-440 us for that operation > > > > > > > > and > > > > > > > > perf > > > > > > > > shows only ~4% time on mutex_lock. > > > > > > > > > > > > > > > > It's weird, I don't see a huge performance boost compared > > > > > > > > to > > > > > > > > v2, > > > > > > > > even > > > > > > > > > > > > > > I meant I don't see a huge performance boost here and it's > > > > > > > way > > > > > > > worse > > > > > > > than v2. > > > > > > > IIRC, for v2 fastest one only takes 40us > > > > > > > > > > > > Thanks Fox, > > > > > > > > > > > > I'll have a look at those reports but this is puzzling. > > > > > > > > > > > > Perhaps the added overhead of the check if an update is > > > > > > needed is taking more than expected and more than just > > > > > > taking the lock and being done with it. Then there's > > > > > > the v2 series ... I'll see if I can dig out your reports > > > > > > on those too. > > > > > > > > > > Apologies, I was mistaken, it's compared to V3, not V2. The > > > > > previous > > > > > benchmark report is here. > > > > > https://lore.kernel.org/linux-fsdevel/CAC2o3DKNc=sL2n8291Dpiyb0bRHaX=nd33ogvO_LkJqpBj-YmA@mail.gmail.com/ > > > > > > > > Are all these tests using a single file name in the > > > > open/read/close > > > > loop? > > > > > > Yes, because It's easy to implement yet enough to trigger the > > > mutex_lock. > > > > > > And you are right It's not a real-life pattern, but on the bright > > > side, it proves there is no original mutex_lock problem anymore. :) > > > > I've been looking at your reports and they are quite interesting. > > > > > > > > > That being the case the per-object inode lock will behave like a > > > > mutex and once contention occurs any speed benefits of a spinlock > > > > over a mutex (or rwsem) will disappear. > > > > > > > > In this case changing from a write lock to a read lock in those > > > > functions and adding the inode mutex will do nothing but add the > > > > overhead of taking the read lock. And similarly adding the update > > > > check function also just adds overhead and, as we see, once > > > > contention starts it has a cumulative effect that's often not > > > > linear. > > > > > > > > The whole idea of a read lock/per-object spin lock was to reduce > > > > the possibility of contention for paths other than the same path > > > > while not impacting same path accesses too much for an overall > > > > gain. Based on this I'm thinking the update check function is > > > > probably not worth keeping, it just adds unnecessary churn and > > > > has a negative impact for same file contention access patterns. > > > > The reports indicate (to me anyway) that the slowdown isn't > > due to kernfs. It looks more like kernfs is now putting pressure > > on the VFS, mostly on the file table lock but it looks like > > there's a mild amount of contention on a few other locks as well > > now. > > > > That's a whole different problem and those file table handling > > functions don't appear to have any obvious problems so they are > > doing what they have to do and that can't be avoided. > > > > That's definitely out of scope for these changes. > > > > And, as you'd expect, once any appreciable amount of contention > > happens our measurements go out the window, certainly with > > respect to kernfs. > > > > It also doesn't change my option that checking if an inode > > attribute update is needed in kernfs isn't useful since, IIUC > > that file table lock contention would result even if you were > > using different paths. > > > > So I'll drop that patch from the series. > > It will probably not make any difference, but for completeness > and after some thought, I felt I should post what I think is > happening with the benchmark. > > The benchmark application runs a pthreads thread for each CPU > and goes into a tight open/read/close loop to demonstrate the > contention that can occur on the kernfs mutex as the number of > CPUs grows. > > But that tight open/read/close loop causes contention on the VFS > file table because the pthreads threads share the process file > table. > > So the poor performance is actually evidence the last patch is > doing what it's meant to do rather than evidence of a regression > with the series. > > The benchmark is putting pressure on the process file table on > some hardware configurations but those critical sections are small > and there's really nothing obvious that can be done to improve the > file table locking. > > It is however important to remember that the benckmark application > access pattern is not a normal access pattern by a long way. > > So I don't see the need for a new revision of the series with the > last patch dropped. > > If there's a change of heart and the series was to be merged I'll > leave whether to include this last patch to the discretion of the > maintainer as the bulk of the improvement comes from the earlier > patches anyway. Can you please resubmit the series, it is long-gone from my review queue. thanks, greg k-h
On Thu, 2021-05-27 at 08:50 +0200, Greg Kroah-Hartman wrote: > On Thu, May 27, 2021 at 09:23:06AM +0800, Ian Kent wrote: > > On Mon, 2021-05-17 at 09:32 +0800, Ian Kent wrote: > > > On Fri, 2021-05-14 at 10:34 +0800, Fox Chen wrote: > > > > On Fri, May 14, 2021 at 9:34 AM Ian Kent <raven@themaw.net> > > > > wrote: > > > > > > > > > > On Thu, 2021-05-13 at 23:37 +0800, Fox Chen wrote: > > > > > > Hi Ian > > > > > > > > > > > > On Thu, May 13, 2021 at 10:10 PM Ian Kent > > > > > > <raven@themaw.net> > > > > > > wrote: > > > > > > > > > > > > > > On Wed, 2021-05-12 at 16:54 +0800, Fox Chen wrote: > > > > > > > > On Wed, May 12, 2021 at 4:47 PM Fox Chen > > > > > > > > <foxhlchen@gmail.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > I ran it on my benchmark ( > > > > > > > > > https://github.com/foxhlchen/sysfs_benchmark). > > > > > > > > > > > > > > > > > > machine: aws c5 (Intel Xeon with 96 logical cores) > > > > > > > > > kernel: v5.12 > > > > > > > > > benchmark: create 96 threads and bind them to each > > > > > > > > > core > > > > > > > > > then > > > > > > > > > run > > > > > > > > > open+read+close on a sysfs file simultaneously for > > > > > > > > > 1000 > > > > > > > > > times. > > > > > > > > > result: > > > > > > > > > Without the patchset, an open+read+close operation > > > > > > > > > takes > > > > > > > > > 550- > > > > > > > > > 570 > > > > > > > > > us, > > > > > > > > > perf shows significant time(>40%) spending on > > > > > > > > > mutex_lock. > > > > > > > > > After applying it, it takes 410-440 us for that > > > > > > > > > operation > > > > > > > > > and > > > > > > > > > perf > > > > > > > > > shows only ~4% time on mutex_lock. > > > > > > > > > > > > > > > > > > It's weird, I don't see a huge performance boost > > > > > > > > > compared > > > > > > > > > to > > > > > > > > > v2, > > > > > > > > > even > > > > > > > > > > > > > > > > I meant I don't see a huge performance boost here and > > > > > > > > it's > > > > > > > > way > > > > > > > > worse > > > > > > > > than v2. > > > > > > > > IIRC, for v2 fastest one only takes 40us > > > > > > > > > > > > > > Thanks Fox, > > > > > > > > > > > > > > I'll have a look at those reports but this is puzzling. > > > > > > > > > > > > > > Perhaps the added overhead of the check if an update is > > > > > > > needed is taking more than expected and more than just > > > > > > > taking the lock and being done with it. Then there's > > > > > > > the v2 series ... I'll see if I can dig out your reports > > > > > > > on those too. > > > > > > > > > > > > Apologies, I was mistaken, it's compared to V3, not V2. > > > > > > The > > > > > > previous > > > > > > benchmark report is here. > > > > > > https://lore.kernel.org/linux-fsdevel/CAC2o3DKNc=sL2n8291Dpiyb0bRHaX=nd33ogvO_LkJqpBj-YmA@mail.gmail.com/ > > > > > > > > > > Are all these tests using a single file name in the > > > > > open/read/close > > > > > loop? > > > > > > > > Yes, because It's easy to implement yet enough to trigger the > > > > mutex_lock. > > > > > > > > And you are right It's not a real-life pattern, but on the > > > > bright > > > > side, it proves there is no original mutex_lock problem > > > > anymore. :) > > > > > > I've been looking at your reports and they are quite interesting. > > > > > > > > > > > > That being the case the per-object inode lock will behave > > > > > like a > > > > > mutex and once contention occurs any speed benefits of a > > > > > spinlock > > > > > over a mutex (or rwsem) will disappear. > > > > > > > > > > In this case changing from a write lock to a read lock in > > > > > those > > > > > functions and adding the inode mutex will do nothing but add > > > > > the > > > > > overhead of taking the read lock. And similarly adding the > > > > > update > > > > > check function also just adds overhead and, as we see, once > > > > > contention starts it has a cumulative effect that's often not > > > > > linear. > > > > > > > > > > The whole idea of a read lock/per-object spin lock was to > > > > > reduce > > > > > the possibility of contention for paths other than the same > > > > > path > > > > > while not impacting same path accesses too much for an > > > > > overall > > > > > gain. Based on this I'm thinking the update check function is > > > > > probably not worth keeping, it just adds unnecessary churn > > > > > and > > > > > has a negative impact for same file contention access > > > > > patterns. > > > > > > The reports indicate (to me anyway) that the slowdown isn't > > > due to kernfs. It looks more like kernfs is now putting pressure > > > on the VFS, mostly on the file table lock but it looks like > > > there's a mild amount of contention on a few other locks as well > > > now. > > > > > > That's a whole different problem and those file table handling > > > functions don't appear to have any obvious problems so they are > > > doing what they have to do and that can't be avoided. > > > > > > That's definitely out of scope for these changes. > > > > > > And, as you'd expect, once any appreciable amount of contention > > > happens our measurements go out the window, certainly with > > > respect to kernfs. > > > > > > It also doesn't change my option that checking if an inode > > > attribute update is needed in kernfs isn't useful since, IIUC > > > that file table lock contention would result even if you were > > > using different paths. > > > > > > So I'll drop that patch from the series. > > > > It will probably not make any difference, but for completeness > > and after some thought, I felt I should post what I think is > > happening with the benchmark. > > > > The benchmark application runs a pthreads thread for each CPU > > and goes into a tight open/read/close loop to demonstrate the > > contention that can occur on the kernfs mutex as the number of > > CPUs grows. > > > > But that tight open/read/close loop causes contention on the VFS > > file table because the pthreads threads share the process file > > table. > > > > So the poor performance is actually evidence the last patch is > > doing what it's meant to do rather than evidence of a regression > > with the series. > > > > The benchmark is putting pressure on the process file table on > > some hardware configurations but those critical sections are small > > and there's really nothing obvious that can be done to improve the > > file table locking. > > > > It is however important to remember that the benckmark application > > access pattern is not a normal access pattern by a long way. > > > > So I don't see the need for a new revision of the series with the > > last patch dropped. > > > > If there's a change of heart and the series was to be merged I'll > > leave whether to include this last patch to the discretion of the > > maintainer as the bulk of the improvement comes from the earlier > > patches anyway. > > Can you please resubmit the series, it is long-gone from my review > queue. Sure, will do as soon as I can. Thanks Ian