Message ID | 159237905950.89469.6559073274338175600.stgit@mickey.themaw.net (mailing list archive) |
---|---|
Headers | show |
Series | kernfs: proposed locking and concurrency improvement | expand |
Hello, Ian. On Wed, Jun 17, 2020 at 03:37:43PM +0800, Ian Kent wrote: > The series here tries to reduce the locking needed during path walks > based on the assumption that there are many path walks with a fairly > large portion of those for non-existent paths, as described above. > > That was done by adding kernfs negative dentry caching (non-existent > paths) to avoid continual alloc/free cycle of dentries and a read/write > semaphore introduced to increase kernfs concurrency during path walks. > > With these changes we still need kernel parameters of udev.children-max=2048 > and systemd.default_timeout_start_sec=300 for the fastest boot times of > under 5 minutes. I don't have strong objections to the series but the rationales don't seem particularly strong. It's solving a suspected problem but only half way. It isn't clear whether this can be the long term solution for the problem machine and whether it will benefit anyone else in a meaningful way either. I think Greg already asked this but how are the 100,000+ memory objects used? Is that justified in the first place? Thanks.
On 6/19/20 8:38 AM, Tejun Heo wrote: > I don't have strong objections to the series but the rationales don't seem > particularly strong. It's solving a suspected problem but only half way. It > isn't clear whether this can be the long term solution for the problem > machine and whether it will benefit anyone else in a meaningful way either. I don't understand your statement about solving the problem halfway. Could you elaborate? > I think Greg already asked this but how are the 100,000+ memory objects > used? Is that justified in the first place? They are used for hotplugging and partitioning memory. The size of the segments (and thus the number of them) is dictated by the underlying hardware. Rick
On Fri, Jun 19, 2020 at 01:41:39PM -0700, Rick Lindsley wrote: > On 6/19/20 8:38 AM, Tejun Heo wrote: > > > I don't have strong objections to the series but the rationales don't seem > > particularly strong. It's solving a suspected problem but only half way. It > > isn't clear whether this can be the long term solution for the problem > > machine and whether it will benefit anyone else in a meaningful way either. > > I don't understand your statement about solving the problem halfway. Could > you elaborate? Spending 5 minutes during boot creating sysfs objects doesn't seem like a particularly good solution and I don't know whether anyone else would experience similar issues. Again, not necessarily against improving the scalability of kernfs code but the use case seems a bit out there. > > I think Greg already asked this but how are the 100,000+ memory objects > > used? Is that justified in the first place? > > They are used for hotplugging and partitioning memory. The size of the > segments (and thus the number of them) is dictated by the underlying > hardware. This sounds so bad. There gotta be a better interface for that, right? Thanks.
On 6/19/20 3:23 PM, Tejun Heo wrote: > Spending 5 minutes during boot creating sysfs objects doesn't seem like a > particularly good solution and I don't know whether anyone else would > experience similar issues. Again, not necessarily against improving the > scalability of kernfs code but the use case seems a bit out there. Creating sysfs objects is not a new "solution". We already do it, and it can take over an hour on a machine with 64TB of memory. We're not *adding* this ... we're *improving* something that has been around for a long time. >> They are used for hotplugging and partitioning memory. The size of the >> segments (and thus the number of them) is dictated by the underlying >> hardware. > > This sounds so bad. There gotta be a better interface for that, right? Again; this is not new. Easily changing the state of devices by writing new values into kernfs is one of the reasons kernfs exists. echo 0 > /sys/devices//system/memory/memory10374/online and boom - you've taken memory chunk 10374 offline. These changes are not just a whim. I used lockstat to measure contention during boot. The addition of 250,000 "devices" in parallel created tremendous contention on the kernfs_mutex and, it appears, on one of the directories within it where memory nodes are created. Using a mutex means that the details of that mutex must bounce around all the cpus ... did I mention 1500+ cpus? A whole lot of thrash ... These patches turn that mutex into a rw semaphore, and any thread checking for the mere existence of something can get by with a read lock. lockstat showed that about 90% of the mutex contentions were in a read path and only 10% in a write path. Switching to a rw semaphore meant that 90% of the time there was no waiting and the thread proceeded with its caches intact. Total time spent waiting on either read or write decreased by 75%, and should scale much better with subsequent hardware upgrades. With contention on this particular data structure reduced, we saw thrash on related control structures decrease markedly too. The contention for the lockref taken in dput dropped 66% and, likely due to reduced thrash, the time used waiting for that structure dropped 99%. Rick
On Fri, 2020-06-19 at 11:38 -0400, Tejun Heo wrote: > Hello, Ian. > > On Wed, Jun 17, 2020 at 03:37:43PM +0800, Ian Kent wrote: > > The series here tries to reduce the locking needed during path > > walks > > based on the assumption that there are many path walks with a > > fairly > > large portion of those for non-existent paths, as described above. > > > > That was done by adding kernfs negative dentry caching (non- > > existent > > paths) to avoid continual alloc/free cycle of dentries and a > > read/write > > semaphore introduced to increase kernfs concurrency during path > > walks. > > > > With these changes we still need kernel parameters of > > udev.children-max=2048 > > and systemd.default_timeout_start_sec=300 for the fastest boot > > times of > > under 5 minutes. > > I don't have strong objections to the series but the rationales don't > seem > particularly strong. It's solving a suspected problem but only half > way. It > isn't clear whether this can be the long term solution for the > problem > machine and whether it will benefit anyone else in a meaningful way > either. > > I think Greg already asked this but how are the 100,000+ memory > objects > used? Is that justified in the first place? The problem is real enough, however, whether improvements can be made in other areas flowing on from the arch specific device creation notifications is not clear cut. There's no question that there is very high contention between the VFS and sysfs and that's all the series is trying to improve, nothing more. What both you and Greg have raised are good questions but are unfortunately very difficult to answer. I tried to add some discussion about it, to the extent that I could, in the cover letter. Basically the division of memory into 256M chunks is something that's needed to provide flexibility for arbitrary partition creation (a set of hardware allocated that's used for, essentially, a bare metal OS install). Whether that's many small partitions for load balanced server farms (or whatever) or much larger partitions for for demanding applications, such as Oracle systems, is not something that can be known in advance. So the division into small memory chunks can't change. The question of sysfs node creation, what uses them and when they are used is much harder. I'm not able to find that out and, I doubt even IBM would know, if their customers use applications that need to consult the sysfs file system for this information or when it's needed if it is need at all. So I'm stuck on this question. One thing is for sure though, it would be (at the very least) risky for a vendor to assume they either aren't needed or aren't needed early during system start up. OTOH I've looked at what gets invoked on udev notifications (which is the source of the heavy path walk activity, I admit I need to dig deeper) and that doesn't appear to be doing anything obviously wrong so that far seems ok. For my part, as long as the series proves to be sound, why not, it does substantially reduce contention between the VFS and sysfs is the face of heavy sysfs path walk activity so I think that stands alone as sufficient to consider the change worthwhile. Ian
On Fri, 2020-06-19 at 18:23 -0400, Tejun Heo wrote: > On Fri, Jun 19, 2020 at 01:41:39PM -0700, Rick Lindsley wrote: > > On 6/19/20 8:38 AM, Tejun Heo wrote: > > > > > I don't have strong objections to the series but the rationales > > > don't seem > > > particularly strong. It's solving a suspected problem but only > > > half way. It > > > isn't clear whether this can be the long term solution for the > > > problem > > > machine and whether it will benefit anyone else in a meaningful > > > way either. > > > > I don't understand your statement about solving the problem > > halfway. Could > > you elaborate? > > Spending 5 minutes during boot creating sysfs objects doesn't seem > like a > particularly good solution and I don't know whether anyone else would > experience similar issues. Again, not necessarily against improving > the > scalability of kernfs code but the use case seems a bit out there. > > > > I think Greg already asked this but how are the 100,000+ memory > > > objects > > > used? Is that justified in the first place? > > > > They are used for hotplugging and partitioning memory. The size of > > the > > segments (and thus the number of them) is dictated by the > > underlying > > hardware. > > This sounds so bad. There gotta be a better interface for that, > right? I'm still struggling a bit to grasp what your getting at but ... Maybe your talking about the underlying notifications system where a notification is sent for every event. There's nothing new about that problem and it's becoming increasingly clear that existing kernel notification sub-systems don't scale well. Mount handling is a current example which is one of the areas David Howells is trying to improve and that's taken years now to get as far as it has. It seems to me that any improvements in the area here would have a different solution, perhaps something along the lines of multiple notification merging, increased context carried in notifications, or the like. Something like the notification merging to reduce notification volume might eventually be useful for David's notifications sub-system too (and, I think the design of that sub-system could probably accommodate that sort of change away from the problematic anonymous notification sub-systems we have now). But it's taken a long time to get that far with that project and the case here would have a far more significant impact on a fairly large number of sub-systems, both kernel and user space, so all I can hope for with this discussion is to raise awareness of the need so that it's recognised and thought about approaches to improving it can happen. So, while the questions you ask are valid and your concerns real, it's unrealistic to think there's a simple solution that can be implemented in short order. Problem awareness is all that can be done now so that fundamental and probably wide spread improvements might be able to be implemented over time. But if I misunderstand your thinking on this please elaborate further. Ian
Hello, Ian. On Sun, Jun 21, 2020 at 12:55:33PM +0800, Ian Kent wrote: > > > They are used for hotplugging and partitioning memory. The size of > > > the > > > segments (and thus the number of them) is dictated by the > > > underlying > > > hardware. > > > > This sounds so bad. There gotta be a better interface for that, > > right? > > I'm still struggling a bit to grasp what your getting at but ... I was more trying to say that the sysfs device interface with per-object directory isn't the right interface for this sort of usage at all. Are these even real hardware pieces which can be plugged in and out? While being a discrete piece of hardware isn't a requirement to be a device model device, the whole thing is designed with such use cases on mind. It definitely isn't the right design for representing six digit number of logical entities. It should be obvious that representing each consecutive memory range with a separate directory entry is far from an optimal way of representing something like this. It's outright silly. > Maybe your talking about the underlying notifications system where > a notification is sent for every event. > > There's nothing new about that problem and it's becoming increasingly > clear that existing kernel notification sub-systems don't scale well. > > Mount handling is a current example which is one of the areas David > Howells is trying to improve and that's taken years now to get as > far as it has. There sure are scalability issues everywhere that needs to be improved but there also are cases where the issue is the approach itself rather than scalability and I'm wondering whether this is the latter. Thanks.
On Fri, Jun 19, 2020 at 07:44:29PM -0700, Rick Lindsley wrote: > echo 0 > /sys/devices//system/memory/memory10374/online > > and boom - you've taken memory chunk 10374 offline. > > These changes are not just a whim. I used lockstat to measure contention > during boot. The addition of 250,000 "devices" in parallel created > tremendous contention on the kernfs_mutex and, it appears, on one of the > directories within it where memory nodes are created. Using a mutex means > that the details of that mutex must bounce around all the cpus ... did I > mention 1500+ cpus? A whole lot of thrash ... I don't know. The above highlights the absurdity of the approach itself to me. You seem to be aware of it too in writing: 250,000 "devices". Thanks.
On Mon, Jun 22, 2020 at 01:48:45PM -0400, Tejun Heo wrote: > Hello, Ian. > > On Sun, Jun 21, 2020 at 12:55:33PM +0800, Ian Kent wrote: > > > > They are used for hotplugging and partitioning memory. The size of > > > > the > > > > segments (and thus the number of them) is dictated by the > > > > underlying > > > > hardware. > > > > > > This sounds so bad. There gotta be a better interface for that, > > > right? > > > > I'm still struggling a bit to grasp what your getting at but ... > > I was more trying to say that the sysfs device interface with per-object > directory isn't the right interface for this sort of usage at all. Are these > even real hardware pieces which can be plugged in and out? While being a > discrete piece of hardware isn't a requirement to be a device model device, > the whole thing is designed with such use cases on mind. It definitely isn't > the right design for representing six digit number of logical entities. > > It should be obvious that representing each consecutive memory range with a > separate directory entry is far from an optimal way of representing > something like this. It's outright silly. I agree. And again, Ian, you are just "kicking the problem down the road" if we accept these patches. Please fix this up properly so that this interface is correctly fixed to not do looney things like this. thanks, greg k-h
On 6/22/20 10:53 AM, Tejun Heo wrote: > I don't know. The above highlights the absurdity of the approach itself to > me. You seem to be aware of it too in writing: 250,000 "devices". Just because it is absurd doesn't mean it wasn't built that way :) I agree, and I'm trying to influence the next hardware design. However, what's already out there is memory units that must be accessed in 256MB blocks. If you want to remove/add a GB, that's really 4 blocks of memory you're manipulating, to the hardware. Those blocks have to be registered and recognized by the kernel for that to work. Rick
On Mon, Jun 22, 2020 at 01:48:45PM -0400, Tejun Heo wrote: > It should be obvious that representing each consecutive memory range with a > separate directory entry is far from an optimal way of representing > something like this. It's outright silly. On 6/22/20 11:03 AM, Greg Kroah-Hartman wrote: > I agree. And again, Ian, you are just "kicking the problem down the > road" if we accept these patches. Please fix this up properly so that > this interface is correctly fixed to not do looney things like this. Given that we cannot change the underlying machine representation of this hardware, what do you (all, not just you Greg) consider to be "properly"? Rick
On Mon, 2020-06-22 at 20:03 +0200, Greg Kroah-Hartman wrote: > On Mon, Jun 22, 2020 at 01:48:45PM -0400, Tejun Heo wrote: > > Hello, Ian. > > > > On Sun, Jun 21, 2020 at 12:55:33PM +0800, Ian Kent wrote: > > > > > They are used for hotplugging and partitioning memory. The > > > > > size of > > > > > the > > > > > segments (and thus the number of them) is dictated by the > > > > > underlying > > > > > hardware. > > > > > > > > This sounds so bad. There gotta be a better interface for that, > > > > right? > > > > > > I'm still struggling a bit to grasp what your getting at but ... > > > > I was more trying to say that the sysfs device interface with per- > > object > > directory isn't the right interface for this sort of usage at all. > > Are these > > even real hardware pieces which can be plugged in and out? While > > being a > > discrete piece of hardware isn't a requirement to be a device model > > device, > > the whole thing is designed with such use cases on mind. It > > definitely isn't > > the right design for representing six digit number of logical > > entities. > > > > It should be obvious that representing each consecutive memory > > range with a > > separate directory entry is far from an optimal way of representing > > something like this. It's outright silly. > > I agree. And again, Ian, you are just "kicking the problem down the > road" if we accept these patches. Please fix this up properly so > that > this interface is correctly fixed to not do looney things like this. Fine, mitigating this problem isn't the end of the story, and you don't want to do accept a change to mitigate it because that could mean no further discussion on it and no further work toward solving it. But it seems to me a "proper" solution to this will cross a number of areas so this isn't just "my" problem and, as you point out, it's likely to become increasingly problematic over time. So what are your ideas and recommendations on how to handle hotplug memory at this granularity for this much RAM (and larger amounts)? Ian
On Mon, Jun 22, 2020 at 02:27:38PM -0700, Rick Lindsley wrote: > > On Mon, Jun 22, 2020 at 01:48:45PM -0400, Tejun Heo wrote: > > > It should be obvious that representing each consecutive memory range with a > > separate directory entry is far from an optimal way of representing > > something like this. It's outright silly. > > On 6/22/20 11:03 AM, Greg Kroah-Hartman wrote: > > > I agree. And again, Ian, you are just "kicking the problem down the > > road" if we accept these patches. Please fix this up properly so that > > this interface is correctly fixed to not do looney things like this. > > Given that we cannot change the underlying machine representation of > this hardware, what do you (all, not just you Greg) consider to be > "properly"? Change the userspace representation of the hardware then. Why does userspace care about so many individual blocks, what happens if you provide them a larger granularity? I can't imagine userspace really wants to see 20k devices and manage them individually, where is the code that does that? What happens if you delay adding the devices until after booting? Userspace should be event driven and only handle things after it sees the devices being present, so try delaying and seeing what happens to prevent this from keeping boot from progressing. thanks, greg k-h
On Tue, Jun 23, 2020 at 01:09:08PM +0800, Ian Kent wrote: > On Mon, 2020-06-22 at 20:03 +0200, Greg Kroah-Hartman wrote: > > On Mon, Jun 22, 2020 at 01:48:45PM -0400, Tejun Heo wrote: > > > Hello, Ian. > > > > > > On Sun, Jun 21, 2020 at 12:55:33PM +0800, Ian Kent wrote: > > > > > > They are used for hotplugging and partitioning memory. The > > > > > > size of > > > > > > the > > > > > > segments (and thus the number of them) is dictated by the > > > > > > underlying > > > > > > hardware. > > > > > > > > > > This sounds so bad. There gotta be a better interface for that, > > > > > right? > > > > > > > > I'm still struggling a bit to grasp what your getting at but ... > > > > > > I was more trying to say that the sysfs device interface with per- > > > object > > > directory isn't the right interface for this sort of usage at all. > > > Are these > > > even real hardware pieces which can be plugged in and out? While > > > being a > > > discrete piece of hardware isn't a requirement to be a device model > > > device, > > > the whole thing is designed with such use cases on mind. It > > > definitely isn't > > > the right design for representing six digit number of logical > > > entities. > > > > > > It should be obvious that representing each consecutive memory > > > range with a > > > separate directory entry is far from an optimal way of representing > > > something like this. It's outright silly. > > > > I agree. And again, Ian, you are just "kicking the problem down the > > road" if we accept these patches. Please fix this up properly so > > that > > this interface is correctly fixed to not do looney things like this. > > Fine, mitigating this problem isn't the end of the story, and you > don't want to do accept a change to mitigate it because that could > mean no further discussion on it and no further work toward solving > it. > > But it seems to me a "proper" solution to this will cross a number > of areas so this isn't just "my" problem and, as you point out, it's > likely to become increasingly problematic over time. > > So what are your ideas and recommendations on how to handle hotplug > memory at this granularity for this much RAM (and larger amounts)? First off, this is not my platform, and not my problem, so it's funny you ask me :) Anyway, as I have said before, my first guesses would be: - increase the granularity size of the "memory chunks", reducing the number of devices you create. - delay creating the devices until way after booting, or do it on a totally different path/thread/workqueue/whatever to prevent delay at booting And then there's always: - don't create them at all, only only do so if userspace asks you to. You all have the userspace tools/users for this interface and know it best to know what will work for them. If you don't, then hey, let's just delete the whole thing and see who screams :) thanks, greg k-h
On Tue, 2020-06-23 at 08:02 +0200, Greg Kroah-Hartman wrote: > On Tue, Jun 23, 2020 at 01:09:08PM +0800, Ian Kent wrote: > > On Mon, 2020-06-22 at 20:03 +0200, Greg Kroah-Hartman wrote: > > > On Mon, Jun 22, 2020 at 01:48:45PM -0400, Tejun Heo wrote: > > > > Hello, Ian. > > > > > > > > On Sun, Jun 21, 2020 at 12:55:33PM +0800, Ian Kent wrote: > > > > > > > They are used for hotplugging and partitioning memory. > > > > > > > The > > > > > > > size of > > > > > > > the > > > > > > > segments (and thus the number of them) is dictated by the > > > > > > > underlying > > > > > > > hardware. > > > > > > > > > > > > This sounds so bad. There gotta be a better interface for > > > > > > that, > > > > > > right? > > > > > > > > > > I'm still struggling a bit to grasp what your getting at but > > > > > ... > > > > > > > > I was more trying to say that the sysfs device interface with > > > > per- > > > > object > > > > directory isn't the right interface for this sort of usage at > > > > all. > > > > Are these > > > > even real hardware pieces which can be plugged in and out? > > > > While > > > > being a > > > > discrete piece of hardware isn't a requirement to be a device > > > > model > > > > device, > > > > the whole thing is designed with such use cases on mind. It > > > > definitely isn't > > > > the right design for representing six digit number of logical > > > > entities. > > > > > > > > It should be obvious that representing each consecutive memory > > > > range with a > > > > separate directory entry is far from an optimal way of > > > > representing > > > > something like this. It's outright silly. > > > > > > I agree. And again, Ian, you are just "kicking the problem down > > > the > > > road" if we accept these patches. Please fix this up properly so > > > that > > > this interface is correctly fixed to not do looney things like > > > this. > > > > Fine, mitigating this problem isn't the end of the story, and you > > don't want to do accept a change to mitigate it because that could > > mean no further discussion on it and no further work toward solving > > it. > > > > But it seems to me a "proper" solution to this will cross a number > > of areas so this isn't just "my" problem and, as you point out, > > it's > > likely to become increasingly problematic over time. > > > > So what are your ideas and recommendations on how to handle hotplug > > memory at this granularity for this much RAM (and larger amounts)? > > First off, this is not my platform, and not my problem, so it's funny > you ask me :) Sorry, but I don't think it's funny at all. It's not "my platform" either, I'm just the poor old sole that took this on because, on the face of it, it's a file system problem as claimed by others that looked at it and promptly washed their hands of it. I don't see how asking for your advice is out of order at all. > > Anyway, as I have said before, my first guesses would be: > - increase the granularity size of the "memory chunks", > reducing > the number of devices you create. Yes, I didn't get that from your initial comments but you've said it a couple of times recently and I do get it now. I'll try and find someone appropriate to consult about that and see where it goes. > - delay creating the devices until way after booting, or do it > on a totally different path/thread/workqueue/whatever to > prevent delay at booting When you first said this it sounded like a ugly workaround to me. But perhaps it isn't (I'm not really convinced it is TBH), so it's probably worth trying to follow up on too. > > And then there's always: > - don't create them at all, only only do so if userspace asks > you to. At first glance the impression I get from this is that it's an even uglier work around than delaying it but it might actually the most sensible way to handle this, as it's been called, silliness. We do have the inode flag S_AUTOMOUNT that will cause the dcache flag DCACHE_NEED_AUTOMOUNT to be set on the dentry and that will cause the dentry op ->d_automount() to be called on access so, from a path walk perspective, the dentries could just appear when needed. The question I'd need to answer is do the kernfs nodes exist so ->d_automount() can discover if the node lookup is valid, and I think the answer might be yes (but we would need to suppress udev notifications for S_AUTOMOUNT nodes). The catch will be that this is "not" mounting per-se, so anything I do would probably be seen as an ugly hack that subverts the VFS automount support. If I could find a way to reconcile that I could probably do this. Al, what say you on this? > > You all have the userspace tools/users for this interface and know it > best to know what will work for them. If you don't, then hey, let's > just delete the whole thing and see who screams :) Please, no joking, I'm finding it hard enough to cope with this disappointment as it is, ;) Ian
On Tue, 2020-06-23 at 16:01 +0800, Ian Kent wrote: > On Tue, 2020-06-23 at 08:02 +0200, Greg Kroah-Hartman wrote: > > On Tue, Jun 23, 2020 at 01:09:08PM +0800, Ian Kent wrote: > > > On Mon, 2020-06-22 at 20:03 +0200, Greg Kroah-Hartman wrote: > > > > On Mon, Jun 22, 2020 at 01:48:45PM -0400, Tejun Heo wrote: > > > > > Hello, Ian. > > > > > > > > > > On Sun, Jun 21, 2020 at 12:55:33PM +0800, Ian Kent wrote: > > > > > > > > They are used for hotplugging and partitioning memory. > > > > > > > > The > > > > > > > > size of > > > > > > > > the > > > > > > > > segments (and thus the number of them) is dictated by > > > > > > > > the > > > > > > > > underlying > > > > > > > > hardware. > > > > > > > > > > > > > > This sounds so bad. There gotta be a better interface for > > > > > > > that, > > > > > > > right? > > > > > > > > > > > > I'm still struggling a bit to grasp what your getting at > > > > > > but > > > > > > ... > > > > > > > > > > I was more trying to say that the sysfs device interface with > > > > > per- > > > > > object > > > > > directory isn't the right interface for this sort of usage at > > > > > all. > > > > > Are these > > > > > even real hardware pieces which can be plugged in and out? > > > > > While > > > > > being a > > > > > discrete piece of hardware isn't a requirement to be a device > > > > > model > > > > > device, > > > > > the whole thing is designed with such use cases on mind. It > > > > > definitely isn't > > > > > the right design for representing six digit number of logical > > > > > entities. > > > > > > > > > > It should be obvious that representing each consecutive > > > > > memory > > > > > range with a > > > > > separate directory entry is far from an optimal way of > > > > > representing > > > > > something like this. It's outright silly. > > > > > > > > I agree. And again, Ian, you are just "kicking the problem > > > > down > > > > the > > > > road" if we accept these patches. Please fix this up properly > > > > so > > > > that > > > > this interface is correctly fixed to not do looney things like > > > > this. > > > > > > Fine, mitigating this problem isn't the end of the story, and you > > > don't want to do accept a change to mitigate it because that > > > could > > > mean no further discussion on it and no further work toward > > > solving > > > it. > > > > > > But it seems to me a "proper" solution to this will cross a > > > number > > > of areas so this isn't just "my" problem and, as you point out, > > > it's > > > likely to become increasingly problematic over time. > > > > > > So what are your ideas and recommendations on how to handle > > > hotplug > > > memory at this granularity for this much RAM (and larger > > > amounts)? > > > > First off, this is not my platform, and not my problem, so it's > > funny > > you ask me :) > > Sorry, but I don't think it's funny at all. > > It's not "my platform" either, I'm just the poor old sole that > took this on because, on the face of it, it's a file system > problem as claimed by others that looked at it and promptly > washed their hands of it. > > I don't see how asking for your advice is out of order at all. > > > Anyway, as I have said before, my first guesses would be: > > - increase the granularity size of the "memory chunks", > > reducing > > the number of devices you create. > > Yes, I didn't get that from your initial comments but you've said > it a couple of times recently and I do get it now. > > I'll try and find someone appropriate to consult about that and > see where it goes. > > > - delay creating the devices until way after booting, or do it > > on a totally different path/thread/workqueue/whatever to > > prevent delay at booting > > When you first said this it sounded like a ugly workaround to me. > But perhaps it isn't (I'm not really convinced it is TBH), so it's > probably worth trying to follow up on too. > > > And then there's always: > > - don't create them at all, only only do so if userspace asks > > you to. > > At first glance the impression I get from this is that it's an even > uglier work around than delaying it but it might actually the most > sensible way to handle this, as it's been called, silliness. > > We do have the inode flag S_AUTOMOUNT that will cause the dcache flag > DCACHE_NEED_AUTOMOUNT to be set on the dentry and that will cause > the dentry op ->d_automount() to be called on access so, from a path > walk perspective, the dentries could just appear when needed. > > The question I'd need to answer is do the kernfs nodes exist so > ->d_automount() can discover if the node lookup is valid, and I think > the answer might be yes (but we would need to suppress udev > notifications for S_AUTOMOUNT nodes). Or maybe taking the notion of on-demand dentry creation further is "nothing" more than not triggering udev notifications when nodes are created letting the VFS create dentries on access alone is all that would be needed. I'm really not sure about how this would work yet ... Ian
On 6/22/20 11:02 PM, Greg Kroah-Hartman wrote: > First off, this is not my platform, and not my problem, so it's funny > you ask me :) Weeeelll, not your platform perhaps but MAINTAINERS does list you first and Tejun second as maintainers for kernfs. So in that sense, any patches would need to go thru you. So, your opinions do matter. > Anyway, as I have said before, my first guesses would be: > - increase the granularity size of the "memory chunks", reducing > the number of devices you create. This would mean finding every utility that relies on this behavior. That may be possible, although not easy, for distro or platform software, but it's hard to guess what user-related utilities may have been created by other consumers of those distros or that platform. In any case, removing an interface without warning is a hanging offense in many Linux circles. > - delay creating the devices until way after booting, or do it > on a totally different path/thread/workqueue/whatever to > prevent delay at booting This has been considered, but it again requires a full list of utilities relying on this interface and determining which of them may want to run before the devices are "loaded" at boot time. It may be few, or even zero, but it would be a much more disruptive change in the boot process than what we are suggesting. > And then there's always: > - don't create them at all, only only do so if userspace asks > you to. If they are done in parallel on demand, you'll see the same problem (load average of 1000+, contention in the same spot.) You obviously won't hold up the boot, of course, but your utility and anything else running on the machine will take an unexpected pause ... for somewhere between 30 and 90 minutes. Seems equally unfriendly. A variant of this, which does have a positive effect, is to observe that coldplug during initramfs does seem to load up the memory device tree without incident. We do a second coldplug after we switch roots and this is the one that runs into timer issues. I have asked "those that should know" why there is a second coldplug. I can guess but would prefer to know to avoid that screaming option. If that second coldplug is unnecessary for the kernfs memory interfaces to work correctly, then that is an alternate, and perhaps even better solution. (It wouldn't change the fact that kernfs was not built for speed and this problem remains below the surface to trip up another.) However, nobody I've found can say that is safe, and I'm not fond of the 'see who screams' test solution. > You all have the userspace tools/users for this interface and know it > best to know what will work for them. If you don't, then hey, let's > just delete the whole thing and see who screams :) I guess I'm puzzled by why everyone seems offended by suggesting we change a mutex to a rw semaphore. In a vacuum, sure, but we have before and after numbers. Wouldn't the same cavalier logic apply? Why not change it and see who screams? I haven't heard any criticism of the patch itself - I'm hearing criticism of the problem. This problem is not specific to memory devices. As we get larger systems, we'll see it elsewhere. We do already see a mild form of this when fibre finds 1000-2000 fibre disks and goes to add them in parallel. Small memory chunks introduces the problem at a level two orders of magnitude bigger, but eventually other devices will be subject to it too. Why not address this now? 'Doctor, it hurts when I do this' 'Then don't do that' Funny as a joke. Less funny as a review comment. Rick
On Tue, Jun 23, 2020 at 02:33:48AM -0700, Rick Lindsley wrote: > On 6/22/20 11:02 PM, Greg Kroah-Hartman wrote: > > > First off, this is not my platform, and not my problem, so it's funny > > you ask me :) > > Weeeelll, not your platform perhaps but MAINTAINERS does list you > first and Tejun second as maintainers for kernfs. So in that sense, > any patches would need to go thru you. So, your opinions do matter. Sure, but "help, I'm abusing your code interface, so fix your code interface and not my caller code" really isn't the best mantra :) > > Anyway, as I have said before, my first guesses would be: > > - increase the granularity size of the "memory chunks", reducing > > the number of devices you create. > > This would mean finding every utility that relies on this behavior. > That may be possible, although not easy, for distro or platform > software, but it's hard to guess what user-related utilities may have > been created by other consumers of those distros or that platform. In > any case, removing an interface without warning is a hanging offense > in many Linux circles. I agree, so find out who uses it! You can search all debian tools easily. You can ask any closed-source setup tools that are on your platforms if they use it. You can "break it and see if anyone notices", which is what we do all the time. The "hanging offence" is "I'm breaking this even if you are using it!". > > - delay creating the devices until way after booting, or do it > > on a totally different path/thread/workqueue/whatever to > > prevent delay at booting > > This has been considered, but it again requires a full list of utilities relying on this interface and determining which of them may want to run before the devices are "loaded" at boot time. It may be few, or even zero, but it would be a much more disruptive change in the boot process than what we are suggesting. Is that really the case? I strongly suggest you all do some research here. Oh, and wrap your email lines please... > > And then there's always: > > - don't create them at all, only only do so if userspace asks > > you to. > > If they are done in parallel on demand, you'll see the same problem (load average of 1000+, contention in the same spot.) You obviously won't hold up the boot, of course, but your utility and anything else running on the machine will take an unexpected pause ... for somewhere between 30 and 90 minutes. Seems equally unfriendly. I agree, but it shouldn't be shutting down the whole box, other stuff should run just fine, right? Is this platform really that "weak" that it can't handle this happening in a single thread/cpu? > A variant of this, which does have a positive effect, is to observe that coldplug during initramfs does seem to load up the memory device tree without incident. We do a second coldplug after we switch roots and this is the one that runs into timer issues. I have asked "those that should know" why there is a second coldplug. I can guess but would prefer to know to avoid that screaming option. If that second coldplug is unnecessary for the kernfs memory interfaces to work correctly, then that is an alternate, and perhaps even better solution. (It wouldn't change the fact that kernfs was not built for speed and this problem remains below the surface to trip up another.) > > However, nobody I've found can say that is safe, and I'm not fond of the 'see who screams' test solution. > > > You all have the userspace tools/users for this interface and know it > > best to know what will work for them. If you don't, then hey, let's > > just delete the whole thing and see who screams :) > > I guess I'm puzzled by why everyone seems offended by suggesting we change a mutex to a rw semaphore. In a vacuum, sure, but we have before and after numbers. Wouldn't the same cavalier logic apply? Why not change it and see who screams? I am offended as a number of years ago this same user of kernfs/sysfs did a lot of work to reduce the number of contentions in kernfs for this same reason. After that work was done, "all was good". Now this comes along again, blaming kernfs/sysfs, not the caller. Memory is only going to get bigger over time, you might want to fix it this way and then run away. But we have to maintain this for the next 20+ years, and you are not solving the root-problem here. It will come back again, right? > I haven't heard any criticism of the patch itself - I'm hearing criticism of the problem. This problem is not specific to memory devices. As we get larger systems, we'll see it elsewhere. We do already see a mild form of this when fibre finds 1000-2000 fibre disks and goes to add them in parallel. Small memory chunks introduces the problem at a level two orders of magnitude bigger, but eventually other devices will be subject to it too. Why not address this now? 1-2k devices are easy to handle, we handle 30k scsi devices today with no problem at all, and have for 15+ years. We are "lucky" there that the hardware is slower than kernfs/sysfs so that we are not the bottleneck at all. > 'Doctor, it hurts when I do this' > 'Then don't do that' > > Funny as a joke. Less funny as a review comment. Treat the system as a whole please, don't go for a short-term fix that we all know is not solving the real problem here. thanks, greg k-h
On Tue, Jun 23, 2020 at 04:01:52PM +0800, Ian Kent wrote: > On Tue, 2020-06-23 at 08:02 +0200, Greg Kroah-Hartman wrote: > > On Tue, Jun 23, 2020 at 01:09:08PM +0800, Ian Kent wrote: > > > On Mon, 2020-06-22 at 20:03 +0200, Greg Kroah-Hartman wrote: > > > > On Mon, Jun 22, 2020 at 01:48:45PM -0400, Tejun Heo wrote: > > > > > Hello, Ian. > > > > > > > > > > On Sun, Jun 21, 2020 at 12:55:33PM +0800, Ian Kent wrote: > > > > > > > > They are used for hotplugging and partitioning memory. > > > > > > > > The > > > > > > > > size of > > > > > > > > the > > > > > > > > segments (and thus the number of them) is dictated by the > > > > > > > > underlying > > > > > > > > hardware. > > > > > > > > > > > > > > This sounds so bad. There gotta be a better interface for > > > > > > > that, > > > > > > > right? > > > > > > > > > > > > I'm still struggling a bit to grasp what your getting at but > > > > > > ... > > > > > > > > > > I was more trying to say that the sysfs device interface with > > > > > per- > > > > > object > > > > > directory isn't the right interface for this sort of usage at > > > > > all. > > > > > Are these > > > > > even real hardware pieces which can be plugged in and out? > > > > > While > > > > > being a > > > > > discrete piece of hardware isn't a requirement to be a device > > > > > model > > > > > device, > > > > > the whole thing is designed with such use cases on mind. It > > > > > definitely isn't > > > > > the right design for representing six digit number of logical > > > > > entities. > > > > > > > > > > It should be obvious that representing each consecutive memory > > > > > range with a > > > > > separate directory entry is far from an optimal way of > > > > > representing > > > > > something like this. It's outright silly. > > > > > > > > I agree. And again, Ian, you are just "kicking the problem down > > > > the > > > > road" if we accept these patches. Please fix this up properly so > > > > that > > > > this interface is correctly fixed to not do looney things like > > > > this. > > > > > > Fine, mitigating this problem isn't the end of the story, and you > > > don't want to do accept a change to mitigate it because that could > > > mean no further discussion on it and no further work toward solving > > > it. > > > > > > But it seems to me a "proper" solution to this will cross a number > > > of areas so this isn't just "my" problem and, as you point out, > > > it's > > > likely to become increasingly problematic over time. > > > > > > So what are your ideas and recommendations on how to handle hotplug > > > memory at this granularity for this much RAM (and larger amounts)? > > > > First off, this is not my platform, and not my problem, so it's funny > > you ask me :) > > Sorry, but I don't think it's funny at all. > > It's not "my platform" either, I'm just the poor old sole that > took this on because, on the face of it, it's a file system > problem as claimed by others that looked at it and promptly > washed their hands of it. > > I don't see how asking for your advice is out of order at all. > > > > > Anyway, as I have said before, my first guesses would be: > > - increase the granularity size of the "memory chunks", > > reducing > > the number of devices you create. > > Yes, I didn't get that from your initial comments but you've said > it a couple of times recently and I do get it now. > > I'll try and find someone appropriate to consult about that and > see where it goes. > > > - delay creating the devices until way after booting, or do it > > on a totally different path/thread/workqueue/whatever to > > prevent delay at booting > > When you first said this it sounded like a ugly workaround to me. > But perhaps it isn't (I'm not really convinced it is TBH), so it's > probably worth trying to follow up on too. It's not a workaround, it lets the rest of the system come up and do useful things while you are still discovering parts of the system that are not up and running. We do this all the time for lots of drivers/devices/subsystems, why is memory any different here? > > And then there's always: > > - don't create them at all, only only do so if userspace asks > > you to. > > At first glance the impression I get from this is that it's an even > uglier work around than delaying it but it might actually the most > sensible way to handle this, as it's been called, silliness. > > We do have the inode flag S_AUTOMOUNT that will cause the dcache flag > DCACHE_NEED_AUTOMOUNT to be set on the dentry and that will cause > the dentry op ->d_automount() to be called on access so, from a path > walk perspective, the dentries could just appear when needed. > > The question I'd need to answer is do the kernfs nodes exist so > ->d_automount() can discover if the node lookup is valid, and I think > the answer might be yes (but we would need to suppress udev > notifications for S_AUTOMOUNT nodes). > > The catch will be that this is "not" mounting per-se, so anything > I do would probably be seen as an ugly hack that subverts the VFS > automount support. > > If I could find a way to reconcile that I could probably do this. I am not meaning to do this at the fs layer, but at the device layer. Why not wait until someone goes "hey, I wonder what my memory layout is, let's go ask the kernel to probe all of that." Or some other such "delayed initialization" method. Don't mess with the fs for this, that's probably the wrong layer for all of this. thanks, greg k-h
On Tue, 2020-06-23 at 02:33 -0700, Rick Lindsley wrote: > On 6/22/20 11:02 PM, Greg Kroah-Hartman wrote: > > > First off, this is not my platform, and not my problem, so it's > > funny > > you ask me :) > > Weeeelll, not your platform perhaps but MAINTAINERS does list you > first and Tejun second as maintainers for kernfs. So in that sense, > any patches would need to go thru you. So, your opinions do matter. > > > > Anyway, as I have said before, my first guesses would be: > > - increase the granularity size of the "memory chunks", > > reducing > > the number of devices you create. > > This would mean finding every utility that relies on this > behavior. That may be possible, although not easy, for distro or > platform software, but it's hard to guess what user-related utilities > may have been created by other consumers of those distros or that > platform. In any case, removing an interface without warning is a > hanging offense in many Linux circles. > > > - delay creating the devices until way after booting, or do it > > on a totally different path/thread/workqueue/whatever to > > prevent delay at booting > > This has been considered, but it again requires a full list of > utilities relying on this interface and determining which of them may > want to run before the devices are "loaded" at boot time. It may be > few, or even zero, but it would be a much more disruptive change in > the boot process than what we are suggesting. > > > And then there's always: > > - don't create them at all, only only do so if userspace asks > > you to. > > If they are done in parallel on demand, you'll see the same problem > (load average of 1000+, contention in the same spot.) You obviously > won't hold up the boot, of course, but your utility and anything else > running on the machine will take an unexpected pause ... for > somewhere between 30 and 90 minutes. Seems equally unfriendly. > > A variant of this, which does have a positive effect, is to observe > that coldplug during initramfs does seem to load up the memory device > tree without incident. We do a second coldplug after we switch roots > and this is the one that runs into timer issues. I have asked "those > that should know" why there is a second coldplug. I can guess but > would prefer to know to avoid that screaming option. If that second > coldplug is unnecessary for the kernfs memory interfaces to work > correctly, then that is an alternate, and perhaps even better > solution. (It wouldn't change the fact that kernfs was not built for > speed and this problem remains below the surface to trip up another.) We might still need the patches here for that on-demand mechanism to be feasible. For example, for an ls of the node directory it should be doable to enumerate the nodes in readdir without creating dentries but there's the inevitable stat() of each path that follows that would probably lead to similar contention. And changing the division of the entries into sub-directories would inevitably break anything that does actually need to access them. Ian
On 6/23/20 4:45 AM, Greg Kroah-Hartman wrote: > Sure, but "help, I'm abusing your code interface, so fix your code > interface and not my caller code" really isn't the best mantra :) Well, those are your words, not mine. What we're saying is, "we've identified an interface that doesn't scale in this situation, but we have a way to make it scale to all known configurations." > I am offended as a number of years ago this same user of kernfs/sysfs > did a lot of work to reduce the number of contentions in kernfs for > this same reason. After that work was done, "all was good". Now > this comes along again, blaming kernfs/sysfs, not the caller. Ok. I don't know about the history, but I can tell you "blame" is not the word I'd use. As hardware changes, Linux also changes, and over "a number of years" it's not surprising to me if basic assumptions changed again and led us back to a place we've been before. That's not an indictment. It just IS. > Memory is only going to get bigger over time, you might want to fix it > this way and then run away. But we have to maintain this for the next > 20+ years, and you are not solving the root-problem here. It will > come back again, right? If hardware vendors insist on dealing with small blocks of memory in large aggregates, then yes it could. You'll have to trust that I am also in discussion with hardware architects about how that is a very bad architecture and it's time to change decades and think bigger. Separate audience, equally contentious discussion. But the bottom line is, it's out there already and can't be walked back. Your response here seems to center on "kernfs was never designed for that." If so, we're in agreement. We're suggesting a way it can be extended to be more robust, with no (apparent) side effects. I'd like to discuss the merits of the patch itself. Rick
Hello, Rick. On Mon, Jun 22, 2020 at 02:22:34PM -0700, Rick Lindsley wrote: > > I don't know. The above highlights the absurdity of the approach itself to > > me. You seem to be aware of it too in writing: 250,000 "devices". > > Just because it is absurd doesn't mean it wasn't built that way :) > > I agree, and I'm trying to influence the next hardware design. However, I'm not saying that the hardware should not segment things into however many pieces that it wants / needs to. That part is fine. > what's already out there is memory units that must be accessed in 256MB > blocks. If you want to remove/add a GB, that's really 4 blocks of memory > you're manipulating, to the hardware. Those blocks have to be registered > and recognized by the kernel for that to work. The problem is fitting that into an interface which wholly doesn't fit that particular requirement. It's not that difficult to imagine different ways to represent however many memory slots, right? It'd take work to make sure that integrates well with whatever tooling or use cases but once done this particular problem will be resolved permanently and the whole thing will look a lot less silly. Wouldn't that be better? Thanks.
Thanks, Tejun, appreciate the feedback. On 6/23/20 4:13 PM, Tejun Heo wrote: > The problem is fitting that into an interface which wholly doesn't fit that > particular requirement. It's not that difficult to imagine different ways to > represent however many memory slots, right? Perhaps we have different views of how this is showing up. systemd is the primary instigator of the boot issues. Systemd runs /usr/lib/systemd/system/systemd-udev-trigger.service which does a udev trigger, specifically /usr/bin/udevadm trigger --type=devices --action=add as part of its post-initramfs coldplug. It then waits for that to finish, under the watch of a timer. So, the kernel itself is reporting these devices to systemd. It gets that information from talking to the hardware. That means, then, that the obfuscation must either start in the kernel itself (it lies to systemd), or start in systemd when it handles the devices it got from the kernel. If the kernel lies, then the actual granularity is not available to any user utilities. Unless you're suggesting a new interface be created that would allow utilities to determine the "real" memory addresses available for manipulation. But the changes you describe cannot be limited to the unknown number of auxiliary utilities. Having one subsystem lie to another seems like the start of a bad idea, anyway. When the hardware management console, separate from Linux, reports a memory error, or adds or deletes memory in a guest system, it's not going to be manipulating spoofed addresses that are only a Linux construct. In contrast, the provided patch fixes the observed problem with no ripple effect to other subsystems or utilities. Greg had suggested Treat the system as a whole please, don't go for a short-term fix that we all know is not solving the real problem here. Your solution affects multiple subsystems; this one affects one. Which is the whole system approach in terms of risk? You mentioned you support 30k scsi disks but only because they are slow so the inefficiencies of kernfs don't show. That doesn't bother you? Rick
On Wed, Jun 24, 2020 at 02:04:15AM -0700, Rick Lindsley wrote: > In contrast, the provided patch fixes the observed problem with no ripple > effect to other subsystems or utilities. Your patch, as-is, is fine, but to somehow think that this is going to solve your real problem here is the issue we keep raising. The real problem is you have way too many devices that somehow need to all get probed at boot time before you can do anything else. > Greg had suggested > Treat the system as a whole please, don't go for a short-term > fix that we all know is not solving the real problem here. > > Your solution affects multiple subsystems; this one affects one. Which is > the whole system approach in terms of risk? You mentioned you support 30k > scsi disks but only because they are slow so the inefficiencies of kernfs > don't show. That doesn't bother you? Systems with 30k of devices do not have any problems that I know of because they do not do foolish things like stall booting before they are all discovered :) What's the odds that if we take this patch, you all have to come back in a few years with some other change to the api due to even larger memory sizes happening? What happens if you boot on a system with this change and with 10x the memory you currently have? Try simulating that by creating 10x the number of devices and see what happens. Does the bottleneck still remain in kernfs or is it somewhere else? thanks, greg k-h
Hello, Rick. On Wed, Jun 24, 2020 at 02:04:15AM -0700, Rick Lindsley wrote: > In contrast, the provided patch fixes the observed problem with no ripple > effect to other subsystems or utilities. > > Greg had suggested > Treat the system as a whole please, don't go for a short-term > fix that we all know is not solving the real problem here. > > Your solution affects multiple subsystems; this one affects one. Which is > the whole system approach in terms of risk? You mentioned you support 30k > scsi disks but only because they are slow so the inefficiencies of kernfs > don't show. That doesn't bother you? I suggest putting honest thoughts into finding a long term solution instead of these rhetorical retorts. If you really can't see how ill-suited the current use of interface and proposed solution is, I'm not sure how better to communicate them to you. Thanks.
On Tue, 2020-06-23 at 19:13 -0400, Tejun Heo wrote: > Hello, Rick. > > On Mon, Jun 22, 2020 at 02:22:34PM -0700, Rick Lindsley wrote: > > > I don't know. The above highlights the absurdity of the approach > > > itself to > > > me. You seem to be aware of it too in writing: 250,000 "devices". > > > > Just because it is absurd doesn't mean it wasn't built that way :) > > > > I agree, and I'm trying to influence the next hardware design. > > However, > > I'm not saying that the hardware should not segment things into > however many > pieces that it wants / needs to. That part is fine. > > > what's already out there is memory units that must be accessed in > > 256MB > > blocks. If you want to remove/add a GB, that's really 4 blocks of > > memory > > you're manipulating, to the hardware. Those blocks have to be > > registered > > and recognized by the kernel for that to work. > > The problem is fitting that into an interface which wholly doesn't > fit that > particular requirement. It's not that difficult to imagine different > ways to > represent however many memory slots, right? It'd take work to make > sure that > integrates well with whatever tooling or use cases but once done this > particular problem will be resolved permanently and the whole thing > will > look a lot less silly. Wouldn't that be better? Well, no, I am finding it difficult to imagine different ways to represent this but perhaps that's because I'm blinker eyed on what a solution might look like because of my file system focus. Can "anyone" throw out some ideas with a little more detail than we have had so far so we can maybe start to formulate an actual plan of what needs to be done. Ian
On Thu, Jun 25, 2020 at 04:15:19PM +0800, Ian Kent wrote: > On Tue, 2020-06-23 at 19:13 -0400, Tejun Heo wrote: > > Hello, Rick. > > > > On Mon, Jun 22, 2020 at 02:22:34PM -0700, Rick Lindsley wrote: > > > > I don't know. The above highlights the absurdity of the approach > > > > itself to > > > > me. You seem to be aware of it too in writing: 250,000 "devices". > > > > > > Just because it is absurd doesn't mean it wasn't built that way :) > > > > > > I agree, and I'm trying to influence the next hardware design. > > > However, > > > > I'm not saying that the hardware should not segment things into > > however many > > pieces that it wants / needs to. That part is fine. > > > > > what's already out there is memory units that must be accessed in > > > 256MB > > > blocks. If you want to remove/add a GB, that's really 4 blocks of > > > memory > > > you're manipulating, to the hardware. Those blocks have to be > > > registered > > > and recognized by the kernel for that to work. > > > > The problem is fitting that into an interface which wholly doesn't > > fit that > > particular requirement. It's not that difficult to imagine different > > ways to > > represent however many memory slots, right? It'd take work to make > > sure that > > integrates well with whatever tooling or use cases but once done this > > particular problem will be resolved permanently and the whole thing > > will > > look a lot less silly. Wouldn't that be better? > > Well, no, I am finding it difficult to imagine different ways to > represent this but perhaps that's because I'm blinker eyed on what > a solution might look like because of my file system focus. > > Can "anyone" throw out some ideas with a little more detail than we > have had so far so we can maybe start to formulate an actual plan of > what needs to be done. I think both Tejun and I have provided a number of alternatives for you all to look into, and yet you all keep saying that those are impossible for some unknown reason. It's not up to me to tell you what to do to fix your broken interfaces as only you all know who is using this and how to handle those changes. It is up to me to say "don't do that!" and to refuse patches that don't solve the root problem here. I'll review these later on (I have 1500+ patches to review at the moment) as these are a nice micro-optimization... And as this conversation seems to just going in circles, I think this is going to be my last response to it... greg k-h
On Thu, 2020-06-25 at 11:43 +0200, Greg Kroah-Hartman wrote: > On Thu, Jun 25, 2020 at 04:15:19PM +0800, Ian Kent wrote: > > On Tue, 2020-06-23 at 19:13 -0400, Tejun Heo wrote: > > > Hello, Rick. > > > > > > On Mon, Jun 22, 2020 at 02:22:34PM -0700, Rick Lindsley wrote: > > > > > I don't know. The above highlights the absurdity of the > > > > > approach > > > > > itself to > > > > > me. You seem to be aware of it too in writing: 250,000 > > > > > "devices". > > > > > > > > Just because it is absurd doesn't mean it wasn't built that way > > > > :) > > > > > > > > I agree, and I'm trying to influence the next hardware design. > > > > However, > > > > > > I'm not saying that the hardware should not segment things into > > > however many > > > pieces that it wants / needs to. That part is fine. > > > > > > > what's already out there is memory units that must be accessed > > > > in > > > > 256MB > > > > blocks. If you want to remove/add a GB, that's really 4 blocks > > > > of > > > > memory > > > > you're manipulating, to the hardware. Those blocks have to be > > > > registered > > > > and recognized by the kernel for that to work. > > > > > > The problem is fitting that into an interface which wholly > > > doesn't > > > fit that > > > particular requirement. It's not that difficult to imagine > > > different > > > ways to > > > represent however many memory slots, right? It'd take work to > > > make > > > sure that > > > integrates well with whatever tooling or use cases but once done > > > this > > > particular problem will be resolved permanently and the whole > > > thing > > > will > > > look a lot less silly. Wouldn't that be better? > > > > Well, no, I am finding it difficult to imagine different ways to > > represent this but perhaps that's because I'm blinker eyed on what > > a solution might look like because of my file system focus. > > > > Can "anyone" throw out some ideas with a little more detail than we > > have had so far so we can maybe start to formulate an actual plan > > of > > what needs to be done. > > I think both Tejun and I have provided a number of alternatives for > you > all to look into, and yet you all keep saying that those are > impossible > for some unknown reason. Yes, those comments are a starting point to be sure. And continuing on that path isn't helping anyone. That's why I'm asking for your input on what a solution you would see as adequate might look like to you (and Tejun). > > It's not up to me to tell you what to do to fix your broken > interfaces > as only you all know who is using this and how to handle those > changes. But it would be useful to go into a little more detail, based on your own experience, about what you think a suitable solution might be. That surely needs to be taken into account and used to guide the direction of our investigation of what we do. > > It is up to me to say "don't do that!" and to refuse patches that > don't > solve the root problem here. I'll review these later on (I have > 1500+ > patches to review at the moment) as these are a nice > micro-optimization... Sure, and I get the "I don't want another post and run set of patches that I have to maintain forever that don't fully solve the problem" view and any ideas and perhaps a little more detail on where we might go with this would be very much appreciated. > > And as this conversation seems to just going in circles, I think this > is > going to be my last response to it... Which is why I'm asking this, I really would like to see this discussion change course and become useful. Ian
Hi, I found this series of patches solves exact the problem I am trying to solve. https://lore.kernel.org/lkml/20201202145837.48040-1-foxhlchen@gmail.com/ The problem is reported by Brice Goglin on thread: Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface https://lore.kernel.org/lkml/X60dvJoT4fURcnsF@kroah.com/ I independently comfirmed that on a 96-core AWS c5.metal server. Do open+read+write on /sys/devices/system/cpu/cpu15/topology/core_id 1000 times. With a single thread it takes ~2.5 us for each open+read+close. With one thread per core, 96 threads running simultaneously takes 540 us for each of the same operation (without much variation) -- 200x slower than the single thread one. My Benchmark code is here: https://github.com/foxhlchen/sysfs_benchmark The problem can only be observed in large machines (>=16 cores). The more cores you have the slower it can be. Perf shows that CPUs spend most of the time (>80%) waiting on mutex locks in kernfs_iop_permission and kernfs_dop_revalidate. After applying this, performance gets huge boost -- with the fastest one at ~30 us to the worst at ~180 us (most of on spin_locks, the delay just stacking up, very similar to the performance on ext4). I hope this problem can justifies this series of patches. A big mutex in kernfs is really not nice. Due to this BIG LOCK, concurrency in kernfs is almost NONE, even though you do operations on different files, they are contentious. As we get more and more cores on normal machines and because sysfs provides such important information, this problem should be fix. So please reconsider accepting the patches. For the patches, there is a mutex_lock in kn->attr_mutex, as Tejun mentioned here (https://lore.kernel.org/lkml/X8fe0cmu+aq1gi7O@mtj.duckdns.org/), maybe a global rwsem for kn->iattr will be better?? thanks, fox
On Thu, 2020-12-10 at 16:44 +0000, Fox Chen wrote: > Hi, > > I found this series of patches solves exact the problem I am trying > to solve. > https://lore.kernel.org/lkml/20201202145837.48040-1-foxhlchen@gmail.com/ Right. > > The problem is reported by Brice Goglin on thread: > Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface > https://lore.kernel.org/lkml/X60dvJoT4fURcnsF@kroah.com/ > > I independently comfirmed that on a 96-core AWS c5.metal server. > Do open+read+write on /sys/devices/system/cpu/cpu15/topology/core_id > 1000 times. > With a single thread it takes ~2.5 us for each open+read+close. > With one thread per core, 96 threads running simultaneously takes 540 > us > for each of the same operation (without much variation) -- 200x > slower than the > single thread one. Right, interesting that the it's actually a problem on such small system configurations. I didn't think it would be evident on hardware that doesn't have a much larger configuration. > > My Benchmark code is here: > https://github.com/foxhlchen/sysfs_benchmark > > The problem can only be observed in large machines (>=16 cores). > The more cores you have the slower it can be. > > Perf shows that CPUs spend most of the time (>80%) waiting on mutex > locks in > kernfs_iop_permission and kernfs_dop_revalidate. > > After applying this, performance gets huge boost -- with the fastest > one at ~30 us > to the worst at ~180 us (most of on spin_locks, the delay just > stacking up, very > similar to the performance on ext4). That's the problem isn't it. Unfortunately we don't get large improvements for nothing so I was constantly thinking, what have I done here that isn't ok ... and I don't have an answer for that. The series needs review from others for that but we didn't get that far. > > I hope this problem can justifies this series of patches. A big mutex > in kernfs > is really not nice. Due to this BIG LOCK, concurrency in kernfs is > almost NONE, > even though you do operations on different files, they are > contentious. Well, as much as I don't like to admit it, Greg (and Tejun) do have a point about the number of sysfs files used when there is a very large amount of RAM. But IIUC the suggestion of altering the sysfs representation for this devices memory would introduce all sorts of problems, it then being different form all device memory representations (systemd udev coldplug for example). But I think your saying there are also visible improvements elsewhere too, which is to be expected of course. Let's not forget that, as the maintainer, Greg has every right to be reluctant to take changes because he is likely to end up owning and maintaining the changes. That can lead to considerable overhead and frustration if the change isn't quite right and it's very hard to be sure there aren't hidden problems with it. Fact is that, due to Greg's rejection, there was much more focus by the reporter to fix it at the source but I have no control over that, I only know that it helped to get things moving. Given the above, I was considering posting the series again and asking for the series to be re-considered but since I annoyed Greg so much the first time around I've been reluctant to do so. But now is a good time I guess, Greg, please, would you re-consider possibly accepting these patches? I would really like some actual review of what I have done from people like yourself and Al. Ha, after that they might well not be ok anyway! > > As we get more and more cores on normal machines and because sysfs > provides such > important information, this problem should be fix. So please > reconsider accepting > the patches. > > For the patches, there is a mutex_lock in kn->attr_mutex, as Tejun > mentioned here > (https://lore.kernel.org/lkml/X8fe0cmu+aq1gi7O@mtj.duckdns.org/), > maybe a global > rwsem for kn->iattr will be better?? I wasn't sure about that, IIRC a spin lock could be used around the initial check and checked again at the end which would probably have been much faster but much less conservative and a bit more ugly so I just went the conservative path since there was so much change already. Ian
On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote: > > > For the patches, there is a mutex_lock in kn->attr_mutex, as Tejun > > mentioned here > > (https://lore.kernel.org/lkml/X8fe0cmu+aq1gi7O@mtj.duckdns.org/), > > maybe a global > > rwsem for kn->iattr will be better?? > > I wasn't sure about that, IIRC a spin lock could be used around the > initial check and checked again at the end which would probably have > been much faster but much less conservative and a bit more ugly so > I just went the conservative path since there was so much change > already. Sorry, I hadn't looked at Tejun's reply yet and TBH didn't remember it. Based on what Tejun said it sounds like that needs work. Ian
On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote: > On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote: > > > For the patches, there is a mutex_lock in kn->attr_mutex, as > > > Tejun > > > mentioned here > > > (https://lore.kernel.org/lkml/X8fe0cmu+aq1gi7O@mtj.duckdns.org/), > > > maybe a global > > > rwsem for kn->iattr will be better?? > > > > I wasn't sure about that, IIRC a spin lock could be used around the > > initial check and checked again at the end which would probably > > have > > been much faster but much less conservative and a bit more ugly so > > I just went the conservative path since there was so much change > > already. > > Sorry, I hadn't looked at Tejun's reply yet and TBH didn't remember > it. > > Based on what Tejun said it sounds like that needs work. Those attribute handling patches were meant to allow taking the rw sem read lock instead of the write lock for kernfs_refresh_inode() updates, with the added locking to protect the inode attributes update since it's called from the VFS both with and without the inode lock. Looking around it looks like kernfs_iattrs() is called from multiple places without a node database lock at all. I'm thinking that, to keep my proposed change straight forward and on topic, I should just leave kernfs_refresh_inode() taking the node db write lock for now and consider the attributes handling as a separate change. Once that's done we could reconsider what's needed to use the node db read lock in kernfs_refresh_inode(). It will reduce the effectiveness of the series but it would make this change much more complicated, and is somewhat off-topic, and could hamper the chances of reviewers spotting problem with it. Ian
On Sun, Dec 13, 2020 at 11:46 AM Ian Kent <raven@themaw.net> wrote: > > On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote: > > On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote: > > > > For the patches, there is a mutex_lock in kn->attr_mutex, as > > > > Tejun > > > > mentioned here > > > > (https://lore.kernel.org/lkml/X8fe0cmu+aq1gi7O@mtj.duckdns.org/), > > > > maybe a global > > > > rwsem for kn->iattr will be better?? > > > > > > I wasn't sure about that, IIRC a spin lock could be used around the > > > initial check and checked again at the end which would probably > > > have > > > been much faster but much less conservative and a bit more ugly so > > > I just went the conservative path since there was so much change > > > already. > > > > Sorry, I hadn't looked at Tejun's reply yet and TBH didn't remember > > it. > > > > Based on what Tejun said it sounds like that needs work. > > Those attribute handling patches were meant to allow taking the rw > sem read lock instead of the write lock for kernfs_refresh_inode() > updates, with the added locking to protect the inode attributes > update since it's called from the VFS both with and without the > inode lock. Oh, understood. I was asking also because lock on kn->attr_mutex drags concurrent performance. > Looking around it looks like kernfs_iattrs() is called from multiple > places without a node database lock at all. > > I'm thinking that, to keep my proposed change straight forward > and on topic, I should just leave kernfs_refresh_inode() taking > the node db write lock for now and consider the attributes handling > as a separate change. Once that's done we could reconsider what's > needed to use the node db read lock in kernfs_refresh_inode(). You meant taking write lock of kernfs_rwsem for kernfs_refresh_inode()?? It may be a lot slower in my benchmark, let me test it. > It will reduce the effectiveness of the series but it would make > this change much more complicated, and is somewhat off-topic, and > could hamper the chances of reviewers spotting problem with it. > > Ian > thanks, fox
On Mon, 2020-12-14 at 14:14 +0800, Fox Chen wrote: > On Sun, Dec 13, 2020 at 11:46 AM Ian Kent <raven@themaw.net> wrote: > > On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote: > > > On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote: > > > > > For the patches, there is a mutex_lock in kn->attr_mutex, as > > > > > Tejun > > > > > mentioned here > > > > > ( > > > > > https://lore.kernel.org/lkml/X8fe0cmu+aq1gi7O@mtj.duckdns.org/ > > > > > ), > > > > > maybe a global > > > > > rwsem for kn->iattr will be better?? > > > > > > > > I wasn't sure about that, IIRC a spin lock could be used around > > > > the > > > > initial check and checked again at the end which would probably > > > > have > > > > been much faster but much less conservative and a bit more ugly > > > > so > > > > I just went the conservative path since there was so much > > > > change > > > > already. > > > > > > Sorry, I hadn't looked at Tejun's reply yet and TBH didn't > > > remember > > > it. > > > > > > Based on what Tejun said it sounds like that needs work. > > > > Those attribute handling patches were meant to allow taking the rw > > sem read lock instead of the write lock for kernfs_refresh_inode() > > updates, with the added locking to protect the inode attributes > > update since it's called from the VFS both with and without the > > inode lock. > > Oh, understood. I was asking also because lock on kn->attr_mutex > drags > concurrent performance. > > > Looking around it looks like kernfs_iattrs() is called from > > multiple > > places without a node database lock at all. > > > > I'm thinking that, to keep my proposed change straight forward > > and on topic, I should just leave kernfs_refresh_inode() taking > > the node db write lock for now and consider the attributes handling > > as a separate change. Once that's done we could reconsider what's > > needed to use the node db read lock in kernfs_refresh_inode(). > > You meant taking write lock of kernfs_rwsem for > kernfs_refresh_inode()?? > It may be a lot slower in my benchmark, let me test it. Yes, but make sure the write lock of kernfs_rwsem is being taken not the read lock. That's a mistake I had initially? Still, that attributes handling is, I think, sufficient to warrant a separate change since it looks like it might need work, the kernfs node db probably should be kept stable for those attribute updates but equally the existence of an instantiated dentry might mitigate the it. Some people might just know whether it's ok or not but I would like to check the callers to work out what's going on. In any case it's academic if GCH isn't willing to consider the series for review and possible merge. Ian
On Mon, Dec 14, 2020 at 9:30 PM Ian Kent <raven@themaw.net> wrote: > > On Mon, 2020-12-14 at 14:14 +0800, Fox Chen wrote: > > On Sun, Dec 13, 2020 at 11:46 AM Ian Kent <raven@themaw.net> wrote: > > > On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote: > > > > On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote: > > > > > > For the patches, there is a mutex_lock in kn->attr_mutex, as > > > > > > Tejun > > > > > > mentioned here > > > > > > ( > > > > > > https://lore.kernel.org/lkml/X8fe0cmu+aq1gi7O@mtj.duckdns.org/ > > > > > > ), > > > > > > maybe a global > > > > > > rwsem for kn->iattr will be better?? > > > > > > > > > > I wasn't sure about that, IIRC a spin lock could be used around > > > > > the > > > > > initial check and checked again at the end which would probably > > > > > have > > > > > been much faster but much less conservative and a bit more ugly > > > > > so > > > > > I just went the conservative path since there was so much > > > > > change > > > > > already. > > > > > > > > Sorry, I hadn't looked at Tejun's reply yet and TBH didn't > > > > remember > > > > it. > > > > > > > > Based on what Tejun said it sounds like that needs work. > > > > > > Those attribute handling patches were meant to allow taking the rw > > > sem read lock instead of the write lock for kernfs_refresh_inode() > > > updates, with the added locking to protect the inode attributes > > > update since it's called from the VFS both with and without the > > > inode lock. > > > > Oh, understood. I was asking also because lock on kn->attr_mutex > > drags > > concurrent performance. > > > > > Looking around it looks like kernfs_iattrs() is called from > > > multiple > > > places without a node database lock at all. > > > > > > I'm thinking that, to keep my proposed change straight forward > > > and on topic, I should just leave kernfs_refresh_inode() taking > > > the node db write lock for now and consider the attributes handling > > > as a separate change. Once that's done we could reconsider what's > > > needed to use the node db read lock in kernfs_refresh_inode(). > > > > You meant taking write lock of kernfs_rwsem for > > kernfs_refresh_inode()?? > > It may be a lot slower in my benchmark, let me test it. > > Yes, but make sure the write lock of kernfs_rwsem is being taken > not the read lock. > > That's a mistake I had initially? > > Still, that attributes handling is, I think, sufficient to warrant > a separate change since it looks like it might need work, the kernfs > node db probably should be kept stable for those attribute updates > but equally the existence of an instantiated dentry might mitigate > the it. > > Some people might just know whether it's ok or not but I would like > to check the callers to work out what's going on. > > In any case it's academic if GCH isn't willing to consider the series > for review and possible merge. > Hi Ian I removed kn->attr_mutex and changed read lock to write lock for kernfs_refresh_inode down_write(&kernfs_rwsem); kernfs_refresh_inode(kn, inode); up_write(&kernfs_rwsem); Unfortunate, changes in this way make things worse, my benchmark runs 100% slower than upstream sysfs. :( open+read+close a sysfs file concurrently took 1000us. (Currently, sysfs with a big mutex kernfs_mutex only takes ~500us for one open+read+close operation concurrently) |--45.93%--kernfs_iop_permission | | | | | | | | | | | |--22.55%--down_write | | | | | | | | | | | | | --20.69%--rwsem_down_write_slowpath | | | | | | | | | | | | | |--8.89%--schedule perf showed most of the time had been spent on kernfs_iop_permission thanks, fox
On Tue, 2020-12-15 at 16:33 +0800, Fox Chen wrote: > On Mon, Dec 14, 2020 at 9:30 PM Ian Kent <raven@themaw.net> wrote: > > On Mon, 2020-12-14 at 14:14 +0800, Fox Chen wrote: > > > On Sun, Dec 13, 2020 at 11:46 AM Ian Kent <raven@themaw.net> > > > wrote: > > > > On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote: > > > > > On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote: > > > > > > > For the patches, there is a mutex_lock in kn->attr_mutex, > > > > > > > as > > > > > > > Tejun > > > > > > > mentioned here > > > > > > > ( > > > > > > > https://lore.kernel.org/lkml/X8fe0cmu+aq1gi7O@mtj.duckdns.org/ > > > > > > > ), > > > > > > > maybe a global > > > > > > > rwsem for kn->iattr will be better?? > > > > > > > > > > > > I wasn't sure about that, IIRC a spin lock could be used > > > > > > around > > > > > > the > > > > > > initial check and checked again at the end which would > > > > > > probably > > > > > > have > > > > > > been much faster but much less conservative and a bit more > > > > > > ugly > > > > > > so > > > > > > I just went the conservative path since there was so much > > > > > > change > > > > > > already. > > > > > > > > > > Sorry, I hadn't looked at Tejun's reply yet and TBH didn't > > > > > remember > > > > > it. > > > > > > > > > > Based on what Tejun said it sounds like that needs work. > > > > > > > > Those attribute handling patches were meant to allow taking the > > > > rw > > > > sem read lock instead of the write lock for > > > > kernfs_refresh_inode() > > > > updates, with the added locking to protect the inode attributes > > > > update since it's called from the VFS both with and without the > > > > inode lock. > > > > > > Oh, understood. I was asking also because lock on kn->attr_mutex > > > drags > > > concurrent performance. > > > > > > > Looking around it looks like kernfs_iattrs() is called from > > > > multiple > > > > places without a node database lock at all. > > > > > > > > I'm thinking that, to keep my proposed change straight forward > > > > and on topic, I should just leave kernfs_refresh_inode() taking > > > > the node db write lock for now and consider the attributes > > > > handling > > > > as a separate change. Once that's done we could reconsider > > > > what's > > > > needed to use the node db read lock in kernfs_refresh_inode(). > > > > > > You meant taking write lock of kernfs_rwsem for > > > kernfs_refresh_inode()?? > > > It may be a lot slower in my benchmark, let me test it. > > > > Yes, but make sure the write lock of kernfs_rwsem is being taken > > not the read lock. > > > > That's a mistake I had initially? > > > > Still, that attributes handling is, I think, sufficient to warrant > > a separate change since it looks like it might need work, the > > kernfs > > node db probably should be kept stable for those attribute updates > > but equally the existence of an instantiated dentry might mitigate > > the it. > > > > Some people might just know whether it's ok or not but I would like > > to check the callers to work out what's going on. > > > > In any case it's academic if GCH isn't willing to consider the > > series > > for review and possible merge. > > > Hi Ian > > I removed kn->attr_mutex and changed read lock to write lock for > kernfs_refresh_inode > > down_write(&kernfs_rwsem); > kernfs_refresh_inode(kn, inode); > up_write(&kernfs_rwsem); > > > Unfortunate, changes in this way make things worse, my benchmark > runs > 100% slower than upstream sysfs. :( > open+read+close a sysfs file concurrently took 1000us. (Currently, > sysfs with a big mutex kernfs_mutex only takes ~500us > for one open+read+close operation concurrently) Right, so it does need attention nowish. I'll have a look at it in a while, I really need to get a new autofs release out, and there are quite a few changes, and testing is seeing a number of errors, some old, some newly introduced. It's proving difficult. > > > --45.93%--kernfs_iop_permission > | | > | | | | > | | > | | | > > --22.55%--down_write > | | > | | | | | > | | > | | | | > --20.69%--rwsem_down_write_slowpath > | | > | | | | > | > | | > | | | | > |--8.89%--schedule > > perf showed most of the time had been spent on kernfs_iop_permission > > > thanks, > fox
On Tue, 2020-12-15 at 20:59 +0800, Ian Kent wrote: > On Tue, 2020-12-15 at 16:33 +0800, Fox Chen wrote: > > On Mon, Dec 14, 2020 at 9:30 PM Ian Kent <raven@themaw.net> wrote: > > > On Mon, 2020-12-14 at 14:14 +0800, Fox Chen wrote: > > > > On Sun, Dec 13, 2020 at 11:46 AM Ian Kent <raven@themaw.net> > > > > wrote: > > > > > On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote: > > > > > > On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote: > > > > > > > > For the patches, there is a mutex_lock in kn- > > > > > > > > >attr_mutex, > > > > > > > > as > > > > > > > > Tejun > > > > > > > > mentioned here > > > > > > > > ( > > > > > > > > https://lore.kernel.org/lkml/X8fe0cmu+aq1gi7O@mtj.duckdns.org/ > > > > > > > > ), > > > > > > > > maybe a global > > > > > > > > rwsem for kn->iattr will be better?? > > > > > > > > > > > > > > I wasn't sure about that, IIRC a spin lock could be used > > > > > > > around > > > > > > > the > > > > > > > initial check and checked again at the end which would > > > > > > > probably > > > > > > > have > > > > > > > been much faster but much less conservative and a bit > > > > > > > more > > > > > > > ugly > > > > > > > so > > > > > > > I just went the conservative path since there was so much > > > > > > > change > > > > > > > already. > > > > > > > > > > > > Sorry, I hadn't looked at Tejun's reply yet and TBH didn't > > > > > > remember > > > > > > it. > > > > > > > > > > > > Based on what Tejun said it sounds like that needs work. > > > > > > > > > > Those attribute handling patches were meant to allow taking > > > > > the > > > > > rw > > > > > sem read lock instead of the write lock for > > > > > kernfs_refresh_inode() > > > > > updates, with the added locking to protect the inode > > > > > attributes > > > > > update since it's called from the VFS both with and without > > > > > the > > > > > inode lock. > > > > > > > > Oh, understood. I was asking also because lock on kn- > > > > >attr_mutex > > > > drags > > > > concurrent performance. > > > > > > > > > Looking around it looks like kernfs_iattrs() is called from > > > > > multiple > > > > > places without a node database lock at all. > > > > > > > > > > I'm thinking that, to keep my proposed change straight > > > > > forward > > > > > and on topic, I should just leave kernfs_refresh_inode() > > > > > taking > > > > > the node db write lock for now and consider the attributes > > > > > handling > > > > > as a separate change. Once that's done we could reconsider > > > > > what's > > > > > needed to use the node db read lock in > > > > > kernfs_refresh_inode(). > > > > > > > > You meant taking write lock of kernfs_rwsem for > > > > kernfs_refresh_inode()?? > > > > It may be a lot slower in my benchmark, let me test it. > > > > > > Yes, but make sure the write lock of kernfs_rwsem is being taken > > > not the read lock. > > > > > > That's a mistake I had initially? > > > > > > Still, that attributes handling is, I think, sufficient to > > > warrant > > > a separate change since it looks like it might need work, the > > > kernfs > > > node db probably should be kept stable for those attribute > > > updates > > > but equally the existence of an instantiated dentry might > > > mitigate > > > the it. > > > > > > Some people might just know whether it's ok or not but I would > > > like > > > to check the callers to work out what's going on. > > > > > > In any case it's academic if GCH isn't willing to consider the > > > series > > > for review and possible merge. > > > > > Hi Ian > > > > I removed kn->attr_mutex and changed read lock to write lock for > > kernfs_refresh_inode > > > > down_write(&kernfs_rwsem); > > kernfs_refresh_inode(kn, inode); > > up_write(&kernfs_rwsem); > > > > > > Unfortunate, changes in this way make things worse, my benchmark > > runs > > 100% slower than upstream sysfs. :( > > open+read+close a sysfs file concurrently took 1000us. (Currently, > > sysfs with a big mutex kernfs_mutex only takes ~500us > > for one open+read+close operation concurrently) > > Right, so it does need attention nowish. > > I'll have a look at it in a while, I really need to get a new autofs > release out, and there are quite a few changes, and testing is seeing > a number of errors, some old, some newly introduced. It's proving > difficult. I've taken a breather for the autofs testing and had a look at this. I think my original analysis of this was wrong. Could you try this patch please. I'm not sure how much difference it will make but, in principle, it's much the same as the previous approach except it doesn't increase the kernfs node struct size or mess with the other attribute handling code. Note, this is not even compile tested. kernfs: use kernfs read lock in .getattr() and .permission() From: Ian Kent <raven@themaw.net> From Documenation/filesystems.rst and (slightly outdated) comments in fs/attr.c the inode i_rwsem is used for attribute handling. This lock satisfies the requirememnts needed to reduce lock contention, namely a per-object lock needs to be used rather than a file system global lock with the kernfs node db held stable for read operations. In particular it should reduce lock contention seen when calling the kernfs .permission() method. The inode methods .getattr() and .permission() do not hold the inode i_rwsem lock when called as they are usually read operations. Also the .permission() method checks for rcu-walk mode and returns -ECHILD to the VFS if it is set. So the i_rwsem lock can be used in kernfs_iop_getattr() and kernfs_iop_permission() to protect the inode update done by kernfs_refresh_inode(). Using this lock allows the kernfs node db write lock in these functions to be changed to a read lock. Signed-off-by: Ian Kent <raven@themaw.net> --- fs/kernfs/inode.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c index ddaf18198935..568037e9efe9 100644 --- a/fs/kernfs/inode.c +++ b/fs/kernfs/inode.c @@ -189,9 +189,11 @@ int kernfs_iop_getattr(const struct path *path, struct kstat *stat, struct inode *inode = d_inode(path->dentry); struct kernfs_node *kn = inode->i_private; - down_write(&kernfs_rwsem); + inode_lock(inode); + down_read(&kernfs_rwsem); kernfs_refresh_inode(kn, inode); - up_write(&kernfs_rwsem); + up_read(&kernfs_rwsem); + inode_unlock(inode); generic_fillattr(inode, stat); return 0; @@ -281,9 +283,11 @@ int kernfs_iop_permission(struct inode *inode, int mask) kn = inode->i_private; - down_write(&kernfs_rwsem); + inode_lock(inode); + down_read(&kernfs_rwsem); kernfs_refresh_inode(kn, inode); - up_write(&kernfs_rwsem); + up_read(&kernfs_rwsem); + inode_unlock(inode); return generic_permission(inode, mask); }
On Thu, Dec 17, 2020 at 12:46 PM Ian Kent <raven@themaw.net> wrote: > > On Tue, 2020-12-15 at 20:59 +0800, Ian Kent wrote: > > On Tue, 2020-12-15 at 16:33 +0800, Fox Chen wrote: > > > On Mon, Dec 14, 2020 at 9:30 PM Ian Kent <raven@themaw.net> wrote: > > > > On Mon, 2020-12-14 at 14:14 +0800, Fox Chen wrote: > > > > > On Sun, Dec 13, 2020 at 11:46 AM Ian Kent <raven@themaw.net> > > > > > wrote: > > > > > > On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote: > > > > > > > On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote: > > > > > > > > > For the patches, there is a mutex_lock in kn- > > > > > > > > > >attr_mutex, > > > > > > > > > as > > > > > > > > > Tejun > > > > > > > > > mentioned here > > > > > > > > > ( > > > > > > > > > https://lore.kernel.org/lkml/X8fe0cmu+aq1gi7O@mtj.duckdns.org/ > > > > > > > > > ), > > > > > > > > > maybe a global > > > > > > > > > rwsem for kn->iattr will be better?? > > > > > > > > > > > > > > > > I wasn't sure about that, IIRC a spin lock could be used > > > > > > > > around > > > > > > > > the > > > > > > > > initial check and checked again at the end which would > > > > > > > > probably > > > > > > > > have > > > > > > > > been much faster but much less conservative and a bit > > > > > > > > more > > > > > > > > ugly > > > > > > > > so > > > > > > > > I just went the conservative path since there was so much > > > > > > > > change > > > > > > > > already. > > > > > > > > > > > > > > Sorry, I hadn't looked at Tejun's reply yet and TBH didn't > > > > > > > remember > > > > > > > it. > > > > > > > > > > > > > > Based on what Tejun said it sounds like that needs work. > > > > > > > > > > > > Those attribute handling patches were meant to allow taking > > > > > > the > > > > > > rw > > > > > > sem read lock instead of the write lock for > > > > > > kernfs_refresh_inode() > > > > > > updates, with the added locking to protect the inode > > > > > > attributes > > > > > > update since it's called from the VFS both with and without > > > > > > the > > > > > > inode lock. > > > > > > > > > > Oh, understood. I was asking also because lock on kn- > > > > > >attr_mutex > > > > > drags > > > > > concurrent performance. > > > > > > > > > > > Looking around it looks like kernfs_iattrs() is called from > > > > > > multiple > > > > > > places without a node database lock at all. > > > > > > > > > > > > I'm thinking that, to keep my proposed change straight > > > > > > forward > > > > > > and on topic, I should just leave kernfs_refresh_inode() > > > > > > taking > > > > > > the node db write lock for now and consider the attributes > > > > > > handling > > > > > > as a separate change. Once that's done we could reconsider > > > > > > what's > > > > > > needed to use the node db read lock in > > > > > > kernfs_refresh_inode(). > > > > > > > > > > You meant taking write lock of kernfs_rwsem for > > > > > kernfs_refresh_inode()?? > > > > > It may be a lot slower in my benchmark, let me test it. > > > > > > > > Yes, but make sure the write lock of kernfs_rwsem is being taken > > > > not the read lock. > > > > > > > > That's a mistake I had initially? > > > > > > > > Still, that attributes handling is, I think, sufficient to > > > > warrant > > > > a separate change since it looks like it might need work, the > > > > kernfs > > > > node db probably should be kept stable for those attribute > > > > updates > > > > but equally the existence of an instantiated dentry might > > > > mitigate > > > > the it. > > > > > > > > Some people might just know whether it's ok or not but I would > > > > like > > > > to check the callers to work out what's going on. > > > > > > > > In any case it's academic if GCH isn't willing to consider the > > > > series > > > > for review and possible merge. > > > > > > > Hi Ian > > > > > > I removed kn->attr_mutex and changed read lock to write lock for > > > kernfs_refresh_inode > > > > > > down_write(&kernfs_rwsem); > > > kernfs_refresh_inode(kn, inode); > > > up_write(&kernfs_rwsem); > > > > > > > > > Unfortunate, changes in this way make things worse, my benchmark > > > runs > > > 100% slower than upstream sysfs. :( > > > open+read+close a sysfs file concurrently took 1000us. (Currently, > > > sysfs with a big mutex kernfs_mutex only takes ~500us > > > for one open+read+close operation concurrently) > > > > Right, so it does need attention nowish. > > > > I'll have a look at it in a while, I really need to get a new autofs > > release out, and there are quite a few changes, and testing is seeing > > a number of errors, some old, some newly introduced. It's proving > > difficult. > > I've taken a breather for the autofs testing and had a look at this. Thanks. :) > I think my original analysis of this was wrong. > > Could you try this patch please. > I'm not sure how much difference it will make but, in principle, > it's much the same as the previous approach except it doesn't > increase the kernfs node struct size or mess with the other > attribute handling code. > > Note, this is not even compile tested. I failed to apply this patch. So based on the original six patches, I manually removed kn->attr_mutex, and added inode_lock/inode_unlock to those two functions, they were like: int kernfs_iop_getattr(const struct path *path, struct kstat *stat, u32 request_mask, unsigned int query_flags) { struct inode *inode = d_inode(path->dentry); struct kernfs_node *kn = inode->i_private; inode_lock(inode); down_read(&kernfs_rwsem); kernfs_refresh_inode(kn, inode); up_read(&kernfs_rwsem); inode_unlock(inode); generic_fillattr(inode, stat); return 0; } int kernfs_iop_permission(struct inode *inode, int mask) { struct kernfs_node *kn; if (mask & MAY_NOT_BLOCK) return -ECHILD; kn = inode->i_private; inode_lock(inode); down_read(&kernfs_rwsem); kernfs_refresh_inode(kn, inode); up_read(&kernfs_rwsem); inode_unlock(inode); return generic_permission(inode, mask); } But I couldn't boot the kernel and there was no error on the screen. I guess it was deadlocked on /sys creation?? :D > kernfs: use kernfs read lock in .getattr() and .permission() > > From: Ian Kent <raven@themaw.net> > > From Documenation/filesystems.rst and (slightly outdated) comments > in fs/attr.c the inode i_rwsem is used for attribute handling. > > This lock satisfies the requirememnts needed to reduce lock contention, > namely a per-object lock needs to be used rather than a file system > global lock with the kernfs node db held stable for read operations. > > In particular it should reduce lock contention seen when calling the > kernfs .permission() method. > > The inode methods .getattr() and .permission() do not hold the inode > i_rwsem lock when called as they are usually read operations. Also > the .permission() method checks for rcu-walk mode and returns -ECHILD > to the VFS if it is set. So the i_rwsem lock can be used in > kernfs_iop_getattr() and kernfs_iop_permission() to protect the inode > update done by kernfs_refresh_inode(). Using this lock allows the > kernfs node db write lock in these functions to be changed to a read > lock. > > Signed-off-by: Ian Kent <raven@themaw.net> > --- > fs/kernfs/inode.c | 12 ++++++++---- > 1 file changed, 8 insertions(+), 4 deletions(-) > > diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c > index ddaf18198935..568037e9efe9 100644 > --- a/fs/kernfs/inode.c > +++ b/fs/kernfs/inode.c > @@ -189,9 +189,11 @@ int kernfs_iop_getattr(const struct path *path, struct kstat *stat, > struct inode *inode = d_inode(path->dentry); > struct kernfs_node *kn = inode->i_private; > > - down_write(&kernfs_rwsem); > + inode_lock(inode); > + down_read(&kernfs_rwsem); > kernfs_refresh_inode(kn, inode); > - up_write(&kernfs_rwsem); > + up_read(&kernfs_rwsem); > + inode_unlock(inode); > > generic_fillattr(inode, stat); > return 0; > @@ -281,9 +283,11 @@ int kernfs_iop_permission(struct inode *inode, int mask) > > kn = inode->i_private; > > - down_write(&kernfs_rwsem); > + inode_lock(inode); > + down_read(&kernfs_rwsem); > kernfs_refresh_inode(kn, inode); > - up_write(&kernfs_rwsem); > + up_read(&kernfs_rwsem); > + inode_unlock(inode); > > return generic_permission(inode, mask); > } > thanks, fox
On Thu, 2020-12-17 at 16:54 +0800, Fox Chen wrote: > On Thu, Dec 17, 2020 at 12:46 PM Ian Kent <raven@themaw.net> wrote: > > On Tue, 2020-12-15 at 20:59 +0800, Ian Kent wrote: > > > On Tue, 2020-12-15 at 16:33 +0800, Fox Chen wrote: > > > > On Mon, Dec 14, 2020 at 9:30 PM Ian Kent <raven@themaw.net> > > > > wrote: > > > > > On Mon, 2020-12-14 at 14:14 +0800, Fox Chen wrote: > > > > > > On Sun, Dec 13, 2020 at 11:46 AM Ian Kent <raven@themaw.net > > > > > > > > > > > > > wrote: > > > > > > > On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote: > > > > > > > > On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote: > > > > > > > > > > For the patches, there is a mutex_lock in kn- > > > > > > > > > > > attr_mutex, > > > > > > > > > > as > > > > > > > > > > Tejun > > > > > > > > > > mentioned here > > > > > > > > > > ( > > > > > > > > > > https://lore.kernel.org/lkml/X8fe0cmu+aq1gi7O@mtj.duckdns.org/ > > > > > > > > > > ), > > > > > > > > > > maybe a global > > > > > > > > > > rwsem for kn->iattr will be better?? > > > > > > > > > > > > > > > > > > I wasn't sure about that, IIRC a spin lock could be > > > > > > > > > used > > > > > > > > > around > > > > > > > > > the > > > > > > > > > initial check and checked again at the end which > > > > > > > > > would > > > > > > > > > probably > > > > > > > > > have > > > > > > > > > been much faster but much less conservative and a bit > > > > > > > > > more > > > > > > > > > ugly > > > > > > > > > so > > > > > > > > > I just went the conservative path since there was so > > > > > > > > > much > > > > > > > > > change > > > > > > > > > already. > > > > > > > > > > > > > > > > Sorry, I hadn't looked at Tejun's reply yet and TBH > > > > > > > > didn't > > > > > > > > remember > > > > > > > > it. > > > > > > > > > > > > > > > > Based on what Tejun said it sounds like that needs > > > > > > > > work. > > > > > > > > > > > > > > Those attribute handling patches were meant to allow > > > > > > > taking > > > > > > > the > > > > > > > rw > > > > > > > sem read lock instead of the write lock for > > > > > > > kernfs_refresh_inode() > > > > > > > updates, with the added locking to protect the inode > > > > > > > attributes > > > > > > > update since it's called from the VFS both with and > > > > > > > without > > > > > > > the > > > > > > > inode lock. > > > > > > > > > > > > Oh, understood. I was asking also because lock on kn- > > > > > > > attr_mutex > > > > > > drags > > > > > > concurrent performance. > > > > > > > > > > > > > Looking around it looks like kernfs_iattrs() is called > > > > > > > from > > > > > > > multiple > > > > > > > places without a node database lock at all. > > > > > > > > > > > > > > I'm thinking that, to keep my proposed change straight > > > > > > > forward > > > > > > > and on topic, I should just leave kernfs_refresh_inode() > > > > > > > taking > > > > > > > the node db write lock for now and consider the > > > > > > > attributes > > > > > > > handling > > > > > > > as a separate change. Once that's done we could > > > > > > > reconsider > > > > > > > what's > > > > > > > needed to use the node db read lock in > > > > > > > kernfs_refresh_inode(). > > > > > > > > > > > > You meant taking write lock of kernfs_rwsem for > > > > > > kernfs_refresh_inode()?? > > > > > > It may be a lot slower in my benchmark, let me test it. > > > > > > > > > > Yes, but make sure the write lock of kernfs_rwsem is being > > > > > taken > > > > > not the read lock. > > > > > > > > > > That's a mistake I had initially? > > > > > > > > > > Still, that attributes handling is, I think, sufficient to > > > > > warrant > > > > > a separate change since it looks like it might need work, the > > > > > kernfs > > > > > node db probably should be kept stable for those attribute > > > > > updates > > > > > but equally the existence of an instantiated dentry might > > > > > mitigate > > > > > the it. > > > > > > > > > > Some people might just know whether it's ok or not but I > > > > > would > > > > > like > > > > > to check the callers to work out what's going on. > > > > > > > > > > In any case it's academic if GCH isn't willing to consider > > > > > the > > > > > series > > > > > for review and possible merge. > > > > > > > > > Hi Ian > > > > > > > > I removed kn->attr_mutex and changed read lock to write lock > > > > for > > > > kernfs_refresh_inode > > > > > > > > down_write(&kernfs_rwsem); > > > > kernfs_refresh_inode(kn, inode); > > > > up_write(&kernfs_rwsem); > > > > > > > > > > > > Unfortunate, changes in this way make things worse, my > > > > benchmark > > > > runs > > > > 100% slower than upstream sysfs. :( > > > > open+read+close a sysfs file concurrently took 1000us. > > > > (Currently, > > > > sysfs with a big mutex kernfs_mutex only takes ~500us > > > > for one open+read+close operation concurrently) > > > > > > Right, so it does need attention nowish. > > > > > > I'll have a look at it in a while, I really need to get a new > > > autofs > > > release out, and there are quite a few changes, and testing is > > > seeing > > > a number of errors, some old, some newly introduced. It's proving > > > difficult. > > > > I've taken a breather for the autofs testing and had a look at > > this. > > Thanks. :) > > > I think my original analysis of this was wrong. > > > > Could you try this patch please. > > I'm not sure how much difference it will make but, in principle, > > it's much the same as the previous approach except it doesn't > > increase the kernfs node struct size or mess with the other > > attribute handling code. > > > > Note, this is not even compile tested. > > I failed to apply this patch. So based on the original six patches, I > manually removed kn->attr_mutex, and added > inode_lock/inode_unlock to those two functions, they were like: > > int kernfs_iop_getattr(const struct path *path, struct kstat *stat, > u32 request_mask, unsigned int query_flags) > { > struct inode *inode = d_inode(path->dentry); > struct kernfs_node *kn = inode->i_private; > > inode_lock(inode); > down_read(&kernfs_rwsem); > kernfs_refresh_inode(kn, inode); > up_read(&kernfs_rwsem); > inode_unlock(inode); > > generic_fillattr(inode, stat); > return 0; > } > > int kernfs_iop_permission(struct inode *inode, int mask) > { > struct kernfs_node *kn; > > if (mask & MAY_NOT_BLOCK) > return -ECHILD; > > kn = inode->i_private; > > inode_lock(inode); > down_read(&kernfs_rwsem); > kernfs_refresh_inode(kn, inode); > up_read(&kernfs_rwsem); > inode_unlock(inode); > > return generic_permission(inode, mask); > } > > But I couldn't boot the kernel and there was no error on the screen. > I guess it was deadlocked on /sys creation?? :D Right, I guess the locking documentation is out of date. I'm guessing the inode lock is taken somewhere over the .permission() call. If that usage is consistent it's easy fixed, if the usage is inconsistent it's hard to deal with and amounts to a bug. I'll have another look at it. Also, it sounds like I'm working from a more recent series. I had 8 patches, dropped the last three and added the one I posted. If I can work out what's going on I'll post the series for you to check. Ian > > > kernfs: use kernfs read lock in .getattr() and .permission() > > > > From: Ian Kent <raven@themaw.net> > > > > From Documenation/filesystems.rst and (slightly outdated) comments > > in fs/attr.c the inode i_rwsem is used for attribute handling. > > > > This lock satisfies the requirememnts needed to reduce lock > > contention, > > namely a per-object lock needs to be used rather than a file system > > global lock with the kernfs node db held stable for read > > operations. > > > > In particular it should reduce lock contention seen when calling > > the > > kernfs .permission() method. > > > > The inode methods .getattr() and .permission() do not hold the > > inode > > i_rwsem lock when called as they are usually read operations. Also > > the .permission() method checks for rcu-walk mode and returns > > -ECHILD > > to the VFS if it is set. So the i_rwsem lock can be used in > > kernfs_iop_getattr() and kernfs_iop_permission() to protect the > > inode > > update done by kernfs_refresh_inode(). Using this lock allows the > > kernfs node db write lock in these functions to be changed to a > > read > > lock. > > > > Signed-off-by: Ian Kent <raven@themaw.net> > > --- > > fs/kernfs/inode.c | 12 ++++++++---- > > 1 file changed, 8 insertions(+), 4 deletions(-) > > > > diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c > > index ddaf18198935..568037e9efe9 100644 > > --- a/fs/kernfs/inode.c > > +++ b/fs/kernfs/inode.c > > @@ -189,9 +189,11 @@ int kernfs_iop_getattr(const struct path > > *path, struct kstat *stat, > > struct inode *inode = d_inode(path->dentry); > > struct kernfs_node *kn = inode->i_private; > > > > - down_write(&kernfs_rwsem); > > + inode_lock(inode); > > + down_read(&kernfs_rwsem); > > kernfs_refresh_inode(kn, inode); > > - up_write(&kernfs_rwsem); > > + up_read(&kernfs_rwsem); > > + inode_unlock(inode); > > > > generic_fillattr(inode, stat); > > return 0; > > @@ -281,9 +283,11 @@ int kernfs_iop_permission(struct inode *inode, > > int mask) > > > > kn = inode->i_private; > > > > - down_write(&kernfs_rwsem); > > + inode_lock(inode); > > + down_read(&kernfs_rwsem); > > kernfs_refresh_inode(kn, inode); > > - up_write(&kernfs_rwsem); > > + up_read(&kernfs_rwsem); > > + inode_unlock(inode); > > > > return generic_permission(inode, mask); > > } > > > > thanks, > fox
On Thu, 2020-12-17 at 18:09 +0800, Ian Kent wrote: > On Thu, 2020-12-17 at 16:54 +0800, Fox Chen wrote: > > On Thu, Dec 17, 2020 at 12:46 PM Ian Kent <raven@themaw.net> wrote: > > > On Tue, 2020-12-15 at 20:59 +0800, Ian Kent wrote: > > > > On Tue, 2020-12-15 at 16:33 +0800, Fox Chen wrote: > > > > > On Mon, Dec 14, 2020 at 9:30 PM Ian Kent <raven@themaw.net> > > > > > wrote: > > > > > > On Mon, 2020-12-14 at 14:14 +0800, Fox Chen wrote: > > > > > > > On Sun, Dec 13, 2020 at 11:46 AM Ian Kent < > > > > > > > raven@themaw.net > > > > > > > wrote: > > > > > > > > On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote: > > > > > > > > > On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote: > > > > > > > > > > > For the patches, there is a mutex_lock in kn- > > > > > > > > > > > > attr_mutex, > > > > > > > > > > > as > > > > > > > > > > > Tejun > > > > > > > > > > > mentioned here > > > > > > > > > > > ( > > > > > > > > > > > https://lore.kernel.org/lkml/X8fe0cmu+aq1gi7O@mtj.duckdns.org/ > > > > > > > > > > > ), > > > > > > > > > > > maybe a global > > > > > > > > > > > rwsem for kn->iattr will be better?? > > > > > > > > > > > > > > > > > > > > I wasn't sure about that, IIRC a spin lock could be > > > > > > > > > > used > > > > > > > > > > around > > > > > > > > > > the > > > > > > > > > > initial check and checked again at the end which > > > > > > > > > > would > > > > > > > > > > probably > > > > > > > > > > have > > > > > > > > > > been much faster but much less conservative and a > > > > > > > > > > bit > > > > > > > > > > more > > > > > > > > > > ugly > > > > > > > > > > so > > > > > > > > > > I just went the conservative path since there was > > > > > > > > > > so > > > > > > > > > > much > > > > > > > > > > change > > > > > > > > > > already. > > > > > > > > > > > > > > > > > > Sorry, I hadn't looked at Tejun's reply yet and TBH > > > > > > > > > didn't > > > > > > > > > remember > > > > > > > > > it. > > > > > > > > > > > > > > > > > > Based on what Tejun said it sounds like that needs > > > > > > > > > work. > > > > > > > > > > > > > > > > Those attribute handling patches were meant to allow > > > > > > > > taking > > > > > > > > the > > > > > > > > rw > > > > > > > > sem read lock instead of the write lock for > > > > > > > > kernfs_refresh_inode() > > > > > > > > updates, with the added locking to protect the inode > > > > > > > > attributes > > > > > > > > update since it's called from the VFS both with and > > > > > > > > without > > > > > > > > the > > > > > > > > inode lock. > > > > > > > > > > > > > > Oh, understood. I was asking also because lock on kn- > > > > > > > > attr_mutex > > > > > > > drags > > > > > > > concurrent performance. > > > > > > > > > > > > > > > Looking around it looks like kernfs_iattrs() is called > > > > > > > > from > > > > > > > > multiple > > > > > > > > places without a node database lock at all. > > > > > > > > > > > > > > > > I'm thinking that, to keep my proposed change straight > > > > > > > > forward > > > > > > > > and on topic, I should just leave > > > > > > > > kernfs_refresh_inode() > > > > > > > > taking > > > > > > > > the node db write lock for now and consider the > > > > > > > > attributes > > > > > > > > handling > > > > > > > > as a separate change. Once that's done we could > > > > > > > > reconsider > > > > > > > > what's > > > > > > > > needed to use the node db read lock in > > > > > > > > kernfs_refresh_inode(). > > > > > > > > > > > > > > You meant taking write lock of kernfs_rwsem for > > > > > > > kernfs_refresh_inode()?? > > > > > > > It may be a lot slower in my benchmark, let me test it. > > > > > > > > > > > > Yes, but make sure the write lock of kernfs_rwsem is being > > > > > > taken > > > > > > not the read lock. > > > > > > > > > > > > That's a mistake I had initially? > > > > > > > > > > > > Still, that attributes handling is, I think, sufficient to > > > > > > warrant > > > > > > a separate change since it looks like it might need work, > > > > > > the > > > > > > kernfs > > > > > > node db probably should be kept stable for those attribute > > > > > > updates > > > > > > but equally the existence of an instantiated dentry might > > > > > > mitigate > > > > > > the it. > > > > > > > > > > > > Some people might just know whether it's ok or not but I > > > > > > would > > > > > > like > > > > > > to check the callers to work out what's going on. > > > > > > > > > > > > In any case it's academic if GCH isn't willing to consider > > > > > > the > > > > > > series > > > > > > for review and possible merge. > > > > > > > > > > > Hi Ian > > > > > > > > > > I removed kn->attr_mutex and changed read lock to write lock > > > > > for > > > > > kernfs_refresh_inode > > > > > > > > > > down_write(&kernfs_rwsem); > > > > > kernfs_refresh_inode(kn, inode); > > > > > up_write(&kernfs_rwsem); > > > > > > > > > > > > > > > Unfortunate, changes in this way make things worse, my > > > > > benchmark > > > > > runs > > > > > 100% slower than upstream sysfs. :( > > > > > open+read+close a sysfs file concurrently took 1000us. > > > > > (Currently, > > > > > sysfs with a big mutex kernfs_mutex only takes ~500us > > > > > for one open+read+close operation concurrently) > > > > > > > > Right, so it does need attention nowish. > > > > > > > > I'll have a look at it in a while, I really need to get a new > > > > autofs > > > > release out, and there are quite a few changes, and testing is > > > > seeing > > > > a number of errors, some old, some newly introduced. It's > > > > proving > > > > difficult. > > > > > > I've taken a breather for the autofs testing and had a look at > > > this. > > > > Thanks. :) > > > > > I think my original analysis of this was wrong. > > > > > > Could you try this patch please. > > > I'm not sure how much difference it will make but, in principle, > > > it's much the same as the previous approach except it doesn't > > > increase the kernfs node struct size or mess with the other > > > attribute handling code. > > > > > > Note, this is not even compile tested. > > > > I failed to apply this patch. So based on the original six patches, > > I > > manually removed kn->attr_mutex, and added > > inode_lock/inode_unlock to those two functions, they were like: > > > > int kernfs_iop_getattr(const struct path *path, struct kstat *stat, > > u32 request_mask, unsigned int query_flags) > > { > > struct inode *inode = d_inode(path->dentry); > > struct kernfs_node *kn = inode->i_private; > > > > inode_lock(inode); > > down_read(&kernfs_rwsem); > > kernfs_refresh_inode(kn, inode); > > up_read(&kernfs_rwsem); > > inode_unlock(inode); > > > > generic_fillattr(inode, stat); > > return 0; > > } > > > > int kernfs_iop_permission(struct inode *inode, int mask) > > { > > struct kernfs_node *kn; > > > > if (mask & MAY_NOT_BLOCK) > > return -ECHILD; > > > > kn = inode->i_private; > > > > inode_lock(inode); > > down_read(&kernfs_rwsem); > > kernfs_refresh_inode(kn, inode); > > up_read(&kernfs_rwsem); > > inode_unlock(inode); > > > > return generic_permission(inode, mask); > > } > > > > But I couldn't boot the kernel and there was no error on the > > screen. > > I guess it was deadlocked on /sys creation?? :D > > Right, I guess the locking documentation is out of date. I'm guessing > the inode lock is taken somewhere over the .permission() call. If > that > usage is consistent it's easy fixed, if the usage is inconsistent > it's > hard to deal with and amounts to a bug. Yes, it is called, both shared on open, and exclusive on open create, and without the inode lock at all at the start of path resolution. That can't really be called a VFS bug since .permission() is meant to check permissions not update the inode. This is probably what lead to the attr patches I had. If a suitable place to put a local per-object lock can't be found for this, other than in the kernfs_node, then it's a real problem from a contention POV. What could be done is to make the kernfs node attr_mutex a pointer and dynamically allocate it but even that is too costly a size addition to the kernfs node structure as Tejun has said. Those patches I referred to clearly aren't finished because the eighth one is empty, which followed a patch I have titled "kernfs: make attr_mutex a local kernfs node lock". I obviously gave up on it when the series was rejected. But I'll give it some more thought. Ian > > I'll have another look at it. > > Also, it sounds like I'm working from a more recent series. > > I had 8 patches, dropped the last three and added the one I posted. > If I can work out what's going on I'll post the series for you to > check. > > Ian > > > > kernfs: use kernfs read lock in .getattr() and .permission() > > > > > > From: Ian Kent <raven@themaw.net> > > > > > > From Documenation/filesystems.rst and (slightly outdated) > > > comments > > > in fs/attr.c the inode i_rwsem is used for attribute handling. > > > > > > This lock satisfies the requirememnts needed to reduce lock > > > contention, > > > namely a per-object lock needs to be used rather than a file > > > system > > > global lock with the kernfs node db held stable for read > > > operations. > > > > > > In particular it should reduce lock contention seen when calling > > > the > > > kernfs .permission() method. > > > > > > The inode methods .getattr() and .permission() do not hold the > > > inode > > > i_rwsem lock when called as they are usually read operations. > > > Also > > > the .permission() method checks for rcu-walk mode and returns > > > -ECHILD > > > to the VFS if it is set. So the i_rwsem lock can be used in > > > kernfs_iop_getattr() and kernfs_iop_permission() to protect the > > > inode > > > update done by kernfs_refresh_inode(). Using this lock allows the > > > kernfs node db write lock in these functions to be changed to a > > > read > > > lock. > > > > > > Signed-off-by: Ian Kent <raven@themaw.net> > > > --- > > > fs/kernfs/inode.c | 12 ++++++++---- > > > 1 file changed, 8 insertions(+), 4 deletions(-) > > > > > > diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c > > > index ddaf18198935..568037e9efe9 100644 > > > --- a/fs/kernfs/inode.c > > > +++ b/fs/kernfs/inode.c > > > @@ -189,9 +189,11 @@ int kernfs_iop_getattr(const struct path > > > *path, struct kstat *stat, > > > struct inode *inode = d_inode(path->dentry); > > > struct kernfs_node *kn = inode->i_private; > > > > > > - down_write(&kernfs_rwsem); > > > + inode_lock(inode); > > > + down_read(&kernfs_rwsem); > > > kernfs_refresh_inode(kn, inode); > > > - up_write(&kernfs_rwsem); > > > + up_read(&kernfs_rwsem); > > > + inode_unlock(inode); > > > > > > generic_fillattr(inode, stat); > > > return 0; > > > @@ -281,9 +283,11 @@ int kernfs_iop_permission(struct inode > > > *inode, > > > int mask) > > > > > > kn = inode->i_private; > > > > > > - down_write(&kernfs_rwsem); > > > + inode_lock(inode); > > > + down_read(&kernfs_rwsem); > > > kernfs_refresh_inode(kn, inode); > > > - up_write(&kernfs_rwsem); > > > + up_read(&kernfs_rwsem); > > > + inode_unlock(inode); > > > > > > return generic_permission(inode, mask); > > > } > > > > > > > thanks, > > fox
On Thu, 2020-12-17 at 19:09 +0800, Ian Kent wrote: > On Thu, 2020-12-17 at 18:09 +0800, Ian Kent wrote: > > On Thu, 2020-12-17 at 16:54 +0800, Fox Chen wrote: > > > On Thu, Dec 17, 2020 at 12:46 PM Ian Kent <raven@themaw.net> > > > wrote: > > > > On Tue, 2020-12-15 at 20:59 +0800, Ian Kent wrote: > > > > > On Tue, 2020-12-15 at 16:33 +0800, Fox Chen wrote: > > > > > > On Mon, Dec 14, 2020 at 9:30 PM Ian Kent <raven@themaw.net> > > > > > > wrote: > > > > > > > On Mon, 2020-12-14 at 14:14 +0800, Fox Chen wrote: > > > > > > > > On Sun, Dec 13, 2020 at 11:46 AM Ian Kent < > > > > > > > > raven@themaw.net > > > > > > > > wrote: > > > > > > > > > On Fri, 2020-12-11 at 10:17 +0800, Ian Kent wrote: > > > > > > > > > > On Fri, 2020-12-11 at 10:01 +0800, Ian Kent wrote: > > > > > > > > > > > > For the patches, there is a mutex_lock in kn- > > > > > > > > > > > > > attr_mutex, > > > > > > > > > > > > as > > > > > > > > > > > > Tejun > > > > > > > > > > > > mentioned here > > > > > > > > > > > > ( > > > > > > > > > > > > https://lore.kernel.org/lkml/X8fe0cmu+aq1gi7O@mtj.duckdns.org/ > > > > > > > > > > > > ), > > > > > > > > > > > > maybe a global > > > > > > > > > > > > rwsem for kn->iattr will be better?? > > > > > > > > > > > > > > > > > > > > > > I wasn't sure about that, IIRC a spin lock could > > > > > > > > > > > be > > > > > > > > > > > used > > > > > > > > > > > around > > > > > > > > > > > the > > > > > > > > > > > initial check and checked again at the end which > > > > > > > > > > > would > > > > > > > > > > > probably > > > > > > > > > > > have > > > > > > > > > > > been much faster but much less conservative and a > > > > > > > > > > > bit > > > > > > > > > > > more > > > > > > > > > > > ugly > > > > > > > > > > > so > > > > > > > > > > > I just went the conservative path since there was > > > > > > > > > > > so > > > > > > > > > > > much > > > > > > > > > > > change > > > > > > > > > > > already. > > > > > > > > > > > > > > > > > > > > Sorry, I hadn't looked at Tejun's reply yet and TBH > > > > > > > > > > didn't > > > > > > > > > > remember > > > > > > > > > > it. > > > > > > > > > > > > > > > > > > > > Based on what Tejun said it sounds like that needs > > > > > > > > > > work. > > > > > > > > > > > > > > > > > > Those attribute handling patches were meant to allow > > > > > > > > > taking > > > > > > > > > the > > > > > > > > > rw > > > > > > > > > sem read lock instead of the write lock for > > > > > > > > > kernfs_refresh_inode() > > > > > > > > > updates, with the added locking to protect the inode > > > > > > > > > attributes > > > > > > > > > update since it's called from the VFS both with and > > > > > > > > > without > > > > > > > > > the > > > > > > > > > inode lock. > > > > > > > > > > > > > > > > Oh, understood. I was asking also because lock on kn- > > > > > > > > > attr_mutex > > > > > > > > drags > > > > > > > > concurrent performance. > > > > > > > > > > > > > > > > > Looking around it looks like kernfs_iattrs() is > > > > > > > > > called > > > > > > > > > from > > > > > > > > > multiple > > > > > > > > > places without a node database lock at all. > > > > > > > > > > > > > > > > > > I'm thinking that, to keep my proposed change > > > > > > > > > straight > > > > > > > > > forward > > > > > > > > > and on topic, I should just leave > > > > > > > > > kernfs_refresh_inode() > > > > > > > > > taking > > > > > > > > > the node db write lock for now and consider the > > > > > > > > > attributes > > > > > > > > > handling > > > > > > > > > as a separate change. Once that's done we could > > > > > > > > > reconsider > > > > > > > > > what's > > > > > > > > > needed to use the node db read lock in > > > > > > > > > kernfs_refresh_inode(). > > > > > > > > > > > > > > > > You meant taking write lock of kernfs_rwsem for > > > > > > > > kernfs_refresh_inode()?? > > > > > > > > It may be a lot slower in my benchmark, let me test it. > > > > > > > > > > > > > > Yes, but make sure the write lock of kernfs_rwsem is > > > > > > > being > > > > > > > taken > > > > > > > not the read lock. > > > > > > > > > > > > > > That's a mistake I had initially? > > > > > > > > > > > > > > Still, that attributes handling is, I think, sufficient > > > > > > > to > > > > > > > warrant > > > > > > > a separate change since it looks like it might need work, > > > > > > > the > > > > > > > kernfs > > > > > > > node db probably should be kept stable for those > > > > > > > attribute > > > > > > > updates > > > > > > > but equally the existence of an instantiated dentry might > > > > > > > mitigate > > > > > > > the it. > > > > > > > > > > > > > > Some people might just know whether it's ok or not but I > > > > > > > would > > > > > > > like > > > > > > > to check the callers to work out what's going on. > > > > > > > > > > > > > > In any case it's academic if GCH isn't willing to > > > > > > > consider > > > > > > > the > > > > > > > series > > > > > > > for review and possible merge. > > > > > > > > > > > > > Hi Ian > > > > > > > > > > > > I removed kn->attr_mutex and changed read lock to write > > > > > > lock > > > > > > for > > > > > > kernfs_refresh_inode > > > > > > > > > > > > down_write(&kernfs_rwsem); > > > > > > kernfs_refresh_inode(kn, inode); > > > > > > up_write(&kernfs_rwsem); > > > > > > > > > > > > > > > > > > Unfortunate, changes in this way make things worse, my > > > > > > benchmark > > > > > > runs > > > > > > 100% slower than upstream sysfs. :( > > > > > > open+read+close a sysfs file concurrently took 1000us. > > > > > > (Currently, > > > > > > sysfs with a big mutex kernfs_mutex only takes ~500us > > > > > > for one open+read+close operation concurrently) > > > > > > > > > > Right, so it does need attention nowish. > > > > > > > > > > I'll have a look at it in a while, I really need to get a new > > > > > autofs > > > > > release out, and there are quite a few changes, and testing > > > > > is > > > > > seeing > > > > > a number of errors, some old, some newly introduced. It's > > > > > proving > > > > > difficult. > > > > > > > > I've taken a breather for the autofs testing and had a look at > > > > this. > > > > > > Thanks. :) > > > > > > > I think my original analysis of this was wrong. > > > > > > > > Could you try this patch please. > > > > I'm not sure how much difference it will make but, in > > > > principle, > > > > it's much the same as the previous approach except it doesn't > > > > increase the kernfs node struct size or mess with the other > > > > attribute handling code. > > > > > > > > Note, this is not even compile tested. > > > > > > I failed to apply this patch. So based on the original six > > > patches, > > > I > > > manually removed kn->attr_mutex, and added > > > inode_lock/inode_unlock to those two functions, they were like: > > > > > > int kernfs_iop_getattr(const struct path *path, struct kstat > > > *stat, > > > u32 request_mask, unsigned int > > > query_flags) > > > { > > > struct inode *inode = d_inode(path->dentry); > > > struct kernfs_node *kn = inode->i_private; > > > > > > inode_lock(inode); > > > down_read(&kernfs_rwsem); > > > kernfs_refresh_inode(kn, inode); > > > up_read(&kernfs_rwsem); > > > inode_unlock(inode); > > > > > > generic_fillattr(inode, stat); > > > return 0; > > > } > > > > > > int kernfs_iop_permission(struct inode *inode, int mask) > > > { > > > struct kernfs_node *kn; > > > > > > if (mask & MAY_NOT_BLOCK) > > > return -ECHILD; > > > > > > kn = inode->i_private; > > > > > > inode_lock(inode); > > > down_read(&kernfs_rwsem); > > > kernfs_refresh_inode(kn, inode); > > > up_read(&kernfs_rwsem); > > > inode_unlock(inode); > > > > > > return generic_permission(inode, mask); > > > } > > > > > > But I couldn't boot the kernel and there was no error on the > > > screen. > > > I guess it was deadlocked on /sys creation?? :D > > > > Right, I guess the locking documentation is out of date. I'm > > guessing > > the inode lock is taken somewhere over the .permission() call. If > > that > > usage is consistent it's easy fixed, if the usage is inconsistent > > it's > > hard to deal with and amounts to a bug. > > Yes, it is called, both shared on open, and exclusive on open > create, and without the inode lock at all at the start of path > resolution. > > That can't really be called a VFS bug since .permission() is > meant to check permissions not update the inode. > > This is probably what lead to the attr patches I had. > > If a suitable place to put a local per-object lock can't be > found for this, other than in the kernfs_node, then it's a > real problem from a contention POV. > > What could be done is to make the kernfs node attr_mutex > a pointer and dynamically allocate it but even that is too > costly a size addition to the kernfs node structure as > Tejun has said. I guess the question to ask is, is there really a need to call kernfs_refresh_inode() from functions that are usually reading/checking functions. Would it be sufficient to refresh the inode in the write/set operations in (if there's any) places where things like setattr_copy() is not already called? Perhaps GKH or Tejun could comment on this? Ian > > Those patches I referred to clearly aren't finished because > the eighth one is empty, which followed a patch I have titled > "kernfs: make attr_mutex a local kernfs node lock". > > I obviously gave up on it when the series was rejected. > But I'll give it some more thought. > > Ian > > > I'll have another look at it. > > > > Also, it sounds like I'm working from a more recent series. > > > > I had 8 patches, dropped the last three and added the one I posted. > > If I can work out what's going on I'll post the series for you to > > check. > > > > Ian > > > > > > kernfs: use kernfs read lock in .getattr() and .permission() > > > > > > > > From: Ian Kent <raven@themaw.net> > > > > > > > > From Documenation/filesystems.rst and (slightly outdated) > > > > comments > > > > in fs/attr.c the inode i_rwsem is used for attribute handling. > > > > > > > > This lock satisfies the requirememnts needed to reduce lock > > > > contention, > > > > namely a per-object lock needs to be used rather than a file > > > > system > > > > global lock with the kernfs node db held stable for read > > > > operations. > > > > > > > > In particular it should reduce lock contention seen when > > > > calling > > > > the > > > > kernfs .permission() method. > > > > > > > > The inode methods .getattr() and .permission() do not hold the > > > > inode > > > > i_rwsem lock when called as they are usually read operations. > > > > Also > > > > the .permission() method checks for rcu-walk mode and returns > > > > -ECHILD > > > > to the VFS if it is set. So the i_rwsem lock can be used in > > > > kernfs_iop_getattr() and kernfs_iop_permission() to protect the > > > > inode > > > > update done by kernfs_refresh_inode(). Using this lock allows > > > > the > > > > kernfs node db write lock in these functions to be changed to a > > > > read > > > > lock. > > > > > > > > Signed-off-by: Ian Kent <raven@themaw.net> > > > > --- > > > > fs/kernfs/inode.c | 12 ++++++++---- > > > > 1 file changed, 8 insertions(+), 4 deletions(-) > > > > > > > > diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c > > > > index ddaf18198935..568037e9efe9 100644 > > > > --- a/fs/kernfs/inode.c > > > > +++ b/fs/kernfs/inode.c > > > > @@ -189,9 +189,11 @@ int kernfs_iop_getattr(const struct path > > > > *path, struct kstat *stat, > > > > struct inode *inode = d_inode(path->dentry); > > > > struct kernfs_node *kn = inode->i_private; > > > > > > > > - down_write(&kernfs_rwsem); > > > > + inode_lock(inode); > > > > + down_read(&kernfs_rwsem); > > > > kernfs_refresh_inode(kn, inode); > > > > - up_write(&kernfs_rwsem); > > > > + up_read(&kernfs_rwsem); > > > > + inode_unlock(inode); > > > > > > > > generic_fillattr(inode, stat); > > > > return 0; > > > > @@ -281,9 +283,11 @@ int kernfs_iop_permission(struct inode > > > > *inode, > > > > int mask) > > > > > > > > kn = inode->i_private; > > > > > > > > - down_write(&kernfs_rwsem); > > > > + inode_lock(inode); > > > > + down_read(&kernfs_rwsem); > > > > kernfs_refresh_inode(kn, inode); > > > > - up_write(&kernfs_rwsem); > > > > + up_read(&kernfs_rwsem); > > > > + inode_unlock(inode); > > > > > > > > return generic_permission(inode, mask); > > > > } > > > > > > > > > > thanks, > > > fox
Hello, On Thu, Dec 17, 2020 at 07:48:49PM +0800, Ian Kent wrote: > > What could be done is to make the kernfs node attr_mutex > > a pointer and dynamically allocate it but even that is too > > costly a size addition to the kernfs node structure as > > Tejun has said. > > I guess the question to ask is, is there really a need to > call kernfs_refresh_inode() from functions that are usually > reading/checking functions. > > Would it be sufficient to refresh the inode in the write/set > operations in (if there's any) places where things like > setattr_copy() is not already called? > > Perhaps GKH or Tejun could comment on this? My memory is a bit hazy but invalidations on reads is how sysfs namespace is implemented, so I don't think there's an easy around that. The only thing I can think of is embedding the lock into attrs and doing xchg dance when attaching it. Thanks.
On Thu, 2020-12-17 at 10:14 -0500, Tejun Heo wrote: > Hello, > > On Thu, Dec 17, 2020 at 07:48:49PM +0800, Ian Kent wrote: > > > What could be done is to make the kernfs node attr_mutex > > > a pointer and dynamically allocate it but even that is too > > > costly a size addition to the kernfs node structure as > > > Tejun has said. > > > > I guess the question to ask is, is there really a need to > > call kernfs_refresh_inode() from functions that are usually > > reading/checking functions. > > > > Would it be sufficient to refresh the inode in the write/set > > operations in (if there's any) places where things like > > setattr_copy() is not already called? > > > > Perhaps GKH or Tejun could comment on this? > > My memory is a bit hazy but invalidations on reads is how sysfs > namespace is > implemented, so I don't think there's an easy around that. The only > thing I > can think of is embedding the lock into attrs and doing xchg dance > when > attaching it. Sounds like your saying it would be ok to add a lock to the attrs structure, am I correct? Assuming it is then, to keep things simple, use two locks. One global lock for the allocation and an attrs lock for all the attrs field updates including the kernfs_refresh_inode() update. The critical section for the global lock could be reduced and it changed to a spin lock. In __kernfs_iattrs() we would have something like: take the allocation lock do the allocated checks assign if existing attrs release the allocation lock return existing if found othewise release the allocation lock allocate and initialize attrs take the allocation lock check if someone beat us to it free and grab exiting attrs otherwise assign the new attrs release the allocation lock return attrs Add a spinlock to the attrs struct and use it everywhere for field updates. Am I on the right track or can you see problems with this? Ian
On Fri, Dec 18, 2020 at 3:36 PM Ian Kent <raven@themaw.net> wrote: > > On Thu, 2020-12-17 at 10:14 -0500, Tejun Heo wrote: > > Hello, > > > > On Thu, Dec 17, 2020 at 07:48:49PM +0800, Ian Kent wrote: > > > > What could be done is to make the kernfs node attr_mutex > > > > a pointer and dynamically allocate it but even that is too > > > > costly a size addition to the kernfs node structure as > > > > Tejun has said. > > > > > > I guess the question to ask is, is there really a need to > > > call kernfs_refresh_inode() from functions that are usually > > > reading/checking functions. > > > > > > Would it be sufficient to refresh the inode in the write/set > > > operations in (if there's any) places where things like > > > setattr_copy() is not already called? > > > > > > Perhaps GKH or Tejun could comment on this? > > > > My memory is a bit hazy but invalidations on reads is how sysfs > > namespace is > > implemented, so I don't think there's an easy around that. The only > > thing I > > can think of is embedding the lock into attrs and doing xchg dance > > when > > attaching it. > > Sounds like your saying it would be ok to add a lock to the > attrs structure, am I correct? > > Assuming it is then, to keep things simple, use two locks. > > One global lock for the allocation and an attrs lock for all the > attrs field updates including the kernfs_refresh_inode() update. > > The critical section for the global lock could be reduced and it > changed to a spin lock. > > In __kernfs_iattrs() we would have something like: > > take the allocation lock > do the allocated checks > assign if existing attrs > release the allocation lock > return existing if found > othewise > release the allocation lock > > allocate and initialize attrs > > take the allocation lock > check if someone beat us to it > free and grab exiting attrs > otherwise > assign the new attrs > release the allocation lock > return attrs > > Add a spinlock to the attrs struct and use it everywhere for > field updates. > > Am I on the right track or can you see problems with this? > > Ian > umm, we update the inode in kernfs_refresh_inode, right?? So I guess the problem is how can we protect the inode when kernfs_refresh_inode is called, not the attrs?? thanks, fox
On Fri, 2020-12-18 at 16:01 +0800, Fox Chen wrote: > On Fri, Dec 18, 2020 at 3:36 PM Ian Kent <raven@themaw.net> wrote: > > On Thu, 2020-12-17 at 10:14 -0500, Tejun Heo wrote: > > > Hello, > > > > > > On Thu, Dec 17, 2020 at 07:48:49PM +0800, Ian Kent wrote: > > > > > What could be done is to make the kernfs node attr_mutex > > > > > a pointer and dynamically allocate it but even that is too > > > > > costly a size addition to the kernfs node structure as > > > > > Tejun has said. > > > > > > > > I guess the question to ask is, is there really a need to > > > > call kernfs_refresh_inode() from functions that are usually > > > > reading/checking functions. > > > > > > > > Would it be sufficient to refresh the inode in the write/set > > > > operations in (if there's any) places where things like > > > > setattr_copy() is not already called? > > > > > > > > Perhaps GKH or Tejun could comment on this? > > > > > > My memory is a bit hazy but invalidations on reads is how sysfs > > > namespace is > > > implemented, so I don't think there's an easy around that. The > > > only > > > thing I > > > can think of is embedding the lock into attrs and doing xchg > > > dance > > > when > > > attaching it. > > > > Sounds like your saying it would be ok to add a lock to the > > attrs structure, am I correct? > > > > Assuming it is then, to keep things simple, use two locks. > > > > One global lock for the allocation and an attrs lock for all the > > attrs field updates including the kernfs_refresh_inode() update. > > > > The critical section for the global lock could be reduced and it > > changed to a spin lock. > > > > In __kernfs_iattrs() we would have something like: > > > > take the allocation lock > > do the allocated checks > > assign if existing attrs > > release the allocation lock > > return existing if found > > othewise > > release the allocation lock > > > > allocate and initialize attrs > > > > take the allocation lock > > check if someone beat us to it > > free and grab exiting attrs > > otherwise > > assign the new attrs > > release the allocation lock > > return attrs > > > > Add a spinlock to the attrs struct and use it everywhere for > > field updates. > > > > Am I on the right track or can you see problems with this? > > > > Ian > > > > umm, we update the inode in kernfs_refresh_inode, right?? So I guess > the problem is how can we protect the inode when kernfs_refresh_inode > is called, not the attrs?? But the attrs (which is what's copied from) were protected by the mutex lock (IIUC) so dealing with the inode attributes implies dealing with the kernfs node attrs too. For example in kernfs_iop_setattr() the call to setattr_copy() copies the node attrs to the inode under the same mutex lock. So, if a read lock is used the copy in kernfs_refresh_inode() is no longer protected, it needs to be protected in a different way. Ian
On Fri, Dec 18, 2020 at 7:21 PM Ian Kent <raven@themaw.net> wrote: > > On Fri, 2020-12-18 at 16:01 +0800, Fox Chen wrote: > > On Fri, Dec 18, 2020 at 3:36 PM Ian Kent <raven@themaw.net> wrote: > > > On Thu, 2020-12-17 at 10:14 -0500, Tejun Heo wrote: > > > > Hello, > > > > > > > > On Thu, Dec 17, 2020 at 07:48:49PM +0800, Ian Kent wrote: > > > > > > What could be done is to make the kernfs node attr_mutex > > > > > > a pointer and dynamically allocate it but even that is too > > > > > > costly a size addition to the kernfs node structure as > > > > > > Tejun has said. > > > > > > > > > > I guess the question to ask is, is there really a need to > > > > > call kernfs_refresh_inode() from functions that are usually > > > > > reading/checking functions. > > > > > > > > > > Would it be sufficient to refresh the inode in the write/set > > > > > operations in (if there's any) places where things like > > > > > setattr_copy() is not already called? > > > > > > > > > > Perhaps GKH or Tejun could comment on this? > > > > > > > > My memory is a bit hazy but invalidations on reads is how sysfs > > > > namespace is > > > > implemented, so I don't think there's an easy around that. The > > > > only > > > > thing I > > > > can think of is embedding the lock into attrs and doing xchg > > > > dance > > > > when > > > > attaching it. > > > > > > Sounds like your saying it would be ok to add a lock to the > > > attrs structure, am I correct? > > > > > > Assuming it is then, to keep things simple, use two locks. > > > > > > One global lock for the allocation and an attrs lock for all the > > > attrs field updates including the kernfs_refresh_inode() update. > > > > > > The critical section for the global lock could be reduced and it > > > changed to a spin lock. > > > > > > In __kernfs_iattrs() we would have something like: > > > > > > take the allocation lock > > > do the allocated checks > > > assign if existing attrs > > > release the allocation lock > > > return existing if found > > > othewise > > > release the allocation lock > > > > > > allocate and initialize attrs > > > > > > take the allocation lock > > > check if someone beat us to it > > > free and grab exiting attrs > > > otherwise > > > assign the new attrs > > > release the allocation lock > > > return attrs > > > > > > Add a spinlock to the attrs struct and use it everywhere for > > > field updates. > > > > > > Am I on the right track or can you see problems with this? > > > > > > Ian > > > > > > > umm, we update the inode in kernfs_refresh_inode, right?? So I guess > > the problem is how can we protect the inode when kernfs_refresh_inode > > is called, not the attrs?? > > But the attrs (which is what's copied from) were protected by the > mutex lock (IIUC) so dealing with the inode attributes implies > dealing with the kernfs node attrs too. > > For example in kernfs_iop_setattr() the call to setattr_copy() copies > the node attrs to the inode under the same mutex lock. So, if a read > lock is used the copy in kernfs_refresh_inode() is no longer protected, > it needs to be protected in a different way. > Ok, I'm actually wondering why the VFS holds exclusive i_rwsem for .setattr but no lock for .getattr (misdocumented?? sometimes they have as you've found out)? What does it protect against?? Because .permission does a similar thing here -- updating inode attributes, the goal is to provide the same protection level for .permission as for .setattr, am I right??? thanks, fox
Hello, On Fri, Dec 18, 2020 at 03:36:21PM +0800, Ian Kent wrote: > Sounds like your saying it would be ok to add a lock to the > attrs structure, am I correct? Yeah, adding a lock to attrs is a lot less of a problem and it looks like it's gonna have to be either that or hashed locks, which might actually make sense if we're worried about the size of attrs (I don't think we need to). Thanks.
On Fri, 2020-12-18 at 21:20 +0800, Fox Chen wrote: > On Fri, Dec 18, 2020 at 7:21 PM Ian Kent <raven@themaw.net> wrote: > > On Fri, 2020-12-18 at 16:01 +0800, Fox Chen wrote: > > > On Fri, Dec 18, 2020 at 3:36 PM Ian Kent <raven@themaw.net> > > > wrote: > > > > On Thu, 2020-12-17 at 10:14 -0500, Tejun Heo wrote: > > > > > Hello, > > > > > > > > > > On Thu, Dec 17, 2020 at 07:48:49PM +0800, Ian Kent wrote: > > > > > > > What could be done is to make the kernfs node attr_mutex > > > > > > > a pointer and dynamically allocate it but even that is > > > > > > > too > > > > > > > costly a size addition to the kernfs node structure as > > > > > > > Tejun has said. > > > > > > > > > > > > I guess the question to ask is, is there really a need to > > > > > > call kernfs_refresh_inode() from functions that are usually > > > > > > reading/checking functions. > > > > > > > > > > > > Would it be sufficient to refresh the inode in the > > > > > > write/set > > > > > > operations in (if there's any) places where things like > > > > > > setattr_copy() is not already called? > > > > > > > > > > > > Perhaps GKH or Tejun could comment on this? > > > > > > > > > > My memory is a bit hazy but invalidations on reads is how > > > > > sysfs > > > > > namespace is > > > > > implemented, so I don't think there's an easy around that. > > > > > The > > > > > only > > > > > thing I > > > > > can think of is embedding the lock into attrs and doing xchg > > > > > dance > > > > > when > > > > > attaching it. > > > > > > > > Sounds like your saying it would be ok to add a lock to the > > > > attrs structure, am I correct? > > > > > > > > Assuming it is then, to keep things simple, use two locks. > > > > > > > > One global lock for the allocation and an attrs lock for all > > > > the > > > > attrs field updates including the kernfs_refresh_inode() > > > > update. > > > > > > > > The critical section for the global lock could be reduced and > > > > it > > > > changed to a spin lock. > > > > > > > > In __kernfs_iattrs() we would have something like: > > > > > > > > take the allocation lock > > > > do the allocated checks > > > > assign if existing attrs > > > > release the allocation lock > > > > return existing if found > > > > othewise > > > > release the allocation lock > > > > > > > > allocate and initialize attrs > > > > > > > > take the allocation lock > > > > check if someone beat us to it > > > > free and grab exiting attrs > > > > otherwise > > > > assign the new attrs > > > > release the allocation lock > > > > return attrs > > > > > > > > Add a spinlock to the attrs struct and use it everywhere for > > > > field updates. > > > > > > > > Am I on the right track or can you see problems with this? > > > > > > > > Ian > > > > > > > > > > umm, we update the inode in kernfs_refresh_inode, right?? So I > > > guess > > > the problem is how can we protect the inode when > > > kernfs_refresh_inode > > > is called, not the attrs?? > > > > But the attrs (which is what's copied from) were protected by the > > mutex lock (IIUC) so dealing with the inode attributes implies > > dealing with the kernfs node attrs too. > > > > For example in kernfs_iop_setattr() the call to setattr_copy() > > copies > > the node attrs to the inode under the same mutex lock. So, if a > > read > > lock is used the copy in kernfs_refresh_inode() is no longer > > protected, > > it needs to be protected in a different way. > > > > Ok, I'm actually wondering why the VFS holds exclusive i_rwsem for > .setattr but > no lock for .getattr (misdocumented?? sometimes they have as you've > found out)? > What does it protect against?? Because .permission does a similar > thing > here -- updating inode attributes, the goal is to provide the same > protection level > for .permission as for .setattr, am I right??? As far as the documentation goes that's probably my misunderstanding of it. It does happen that the VFS makes assumptions about how call backs are meant to be used. Read like call backs, like .getattr() and .permission() are meant to be used, well, like read like functions so the VFS should be ok to take locks or not based on the operation context at hand. So it's not about the locking for these call backs per se, it's about the context in which they are called. For example, in link_path_walk(), at the beginning of the component lookup loop (essentially for the containing directory at that point), may_lookup() is called which leads to a call to .permission() without any inode lock held at that point. But file opens (possibly following a path walk to resolve a path) are different. For example, do_filp_open() calls path_openat() which leads to a call to open_last_lookups(), which leads to a call to .permission() along the way. And in this case there are two contexts, an open() create or one without create, the former needing the exclusive inode lock and the later able to use the shared lock. So it's about the locking needed for the encompassing operation that is being done not about those functions specifically. TBH the VFS is very complex and Al has a much, much better understanding of it than I do so he would need to be the one to answer whether it's the file systems responsibility to use these calls in the way the VFS expects. My belief is that if a file system needs to use a call back in a way that's in conflict with what the VFS expects it's the file systems' responsibility to deal with the side effects. Ian
On Fri, 2020-12-18 at 09:59 -0500, Tejun Heo wrote: > Hello, > > On Fri, Dec 18, 2020 at 03:36:21PM +0800, Ian Kent wrote: > > Sounds like your saying it would be ok to add a lock to the > > attrs structure, am I correct? > > Yeah, adding a lock to attrs is a lot less of a problem and it looks > like > it's gonna have to be either that or hashed locks, which might > actually make > sense if we're worried about the size of attrs (I don't think we need > to). Maybe that isn't needed. And looking further I see there's a race that kernfs can't do anything about between kernfs_refresh_inode() and fs/inode.c:update_times(). kernfs could avoid fighting with the VFS to keep the attributes set to those of the kernfs node by using the inode operation .update_times() and, if it makes sense, the kernfs node attributes that it wants to be updated on file system activity could also be updated here. I can't find any reason why this shouldn't be done but kernfs is fairly widely used in other kernel subsystems so what does everyone think of this patch, updated to set kernfs node attributes that should be updated of course, see comment in the patch? kernfs: fix attributes update race From: Ian Kent <raven@themaw.net> kernfs uses kernfs_refresh_inode() (called from kernfs_iop_getattr() and kernfs_iop_permission()) to keep the inode attributes set to the attibutes of the kernfs node. But there is no way for kernfs to prevent racing with the function fs/inode.c:update_times(). The better choice is to use the inode operation .update_times() and just let the VFS use the generic functions for .getattr() and .permission(). Signed-off-by: Ian Kent <raven@themaw.net> --- fs/kernfs/inode.c | 37 ++++++++++++++----------------------- fs/kernfs/kernfs-internal.h | 4 +--- 2 files changed, 15 insertions(+), 26 deletions(-) diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c index fc2469a20fed..51780329590c 100644 --- a/fs/kernfs/inode.c +++ b/fs/kernfs/inode.c @@ -24,9 +24,8 @@ static const struct address_space_operations kernfs_aops = { }; static const struct inode_operations kernfs_iops = { - .permission = kernfs_iop_permission, + .update_time = kernfs_update_time, .setattr = kernfs_iop_setattr, - .getattr = kernfs_iop_getattr, .listxattr = kernfs_iop_listxattr, }; @@ -183,18 +182,26 @@ static void kernfs_refresh_inode(struct kernfs_node *kn, struct inode *inode) set_nlink(inode, kn->dir.subdirs + 2); } -int kernfs_iop_getattr(const struct path *path, struct kstat *stat, - u32 request_mask, unsigned int query_flags) +static int kernfs_iop_update_time(struct inode *inode, struct timespec64 *time, int flags) { - struct inode *inode = d_inode(path->dentry); struct kernfs_node *kn = inode->i_private; + struct kernfs_iattrs *attrs; mutex_lock(&kernfs_mutex); + attrs = kernfs_iattrs(kn); + if (!attrs) { + mutex_unlock(&kernfs_mutex); + return -ENOMEM; + } + + /* Which kernfs node attributes should be updated from + * time? + */ + kernfs_refresh_inode(kn, inode); mutex_unlock(&kernfs_mutex); - generic_fillattr(inode, stat); - return 0; + return 0 } static void kernfs_init_inode(struct kernfs_node *kn, struct inode *inode) @@ -272,22 +279,6 @@ void kernfs_evict_inode(struct inode *inode) kernfs_put(kn); } -int kernfs_iop_permission(struct inode *inode, int mask) -{ - struct kernfs_node *kn; - - if (mask & MAY_NOT_BLOCK) - return -ECHILD; - - kn = inode->i_private; - - mutex_lock(&kernfs_mutex); - kernfs_refresh_inode(kn, inode); - mutex_unlock(&kernfs_mutex); - - return generic_permission(inode, mask); -} - int kernfs_xattr_get(struct kernfs_node *kn, const char *name, void *value, size_t size) { diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h index 7ee97ef59184..98d08b928f93 100644 --- a/fs/kernfs/kernfs-internal.h +++ b/fs/kernfs/kernfs-internal.h @@ -89,10 +89,8 @@ extern struct kmem_cache *kernfs_node_cache, *kernfs_iattrs_cache; */ extern const struct xattr_handler *kernfs_xattr_handlers[]; void kernfs_evict_inode(struct inode *inode); -int kernfs_iop_permission(struct inode *inode, int mask); +int kernfs_update_time(struct inode *inode, struct timespec64 *time, int flags); int kernfs_iop_setattr(struct dentry *dentry, struct iattr *iattr); -int kernfs_iop_getattr(const struct path *path, struct kstat *stat, - u32 request_mask, unsigned int query_flags); ssize_t kernfs_iop_listxattr(struct dentry *dentry, char *buf, size_t size); int __kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr);
On Sat, Dec 19, 2020 at 8:53 AM Ian Kent <raven@themaw.net> wrote: > > On Fri, 2020-12-18 at 21:20 +0800, Fox Chen wrote: > > On Fri, Dec 18, 2020 at 7:21 PM Ian Kent <raven@themaw.net> wrote: > > > On Fri, 2020-12-18 at 16:01 +0800, Fox Chen wrote: > > > > On Fri, Dec 18, 2020 at 3:36 PM Ian Kent <raven@themaw.net> > > > > wrote: > > > > > On Thu, 2020-12-17 at 10:14 -0500, Tejun Heo wrote: > > > > > > Hello, > > > > > > > > > > > > On Thu, Dec 17, 2020 at 07:48:49PM +0800, Ian Kent wrote: > > > > > > > > What could be done is to make the kernfs node attr_mutex > > > > > > > > a pointer and dynamically allocate it but even that is > > > > > > > > too > > > > > > > > costly a size addition to the kernfs node structure as > > > > > > > > Tejun has said. > > > > > > > > > > > > > > I guess the question to ask is, is there really a need to > > > > > > > call kernfs_refresh_inode() from functions that are usually > > > > > > > reading/checking functions. > > > > > > > > > > > > > > Would it be sufficient to refresh the inode in the > > > > > > > write/set > > > > > > > operations in (if there's any) places where things like > > > > > > > setattr_copy() is not already called? > > > > > > > > > > > > > > Perhaps GKH or Tejun could comment on this? > > > > > > > > > > > > My memory is a bit hazy but invalidations on reads is how > > > > > > sysfs > > > > > > namespace is > > > > > > implemented, so I don't think there's an easy around that. > > > > > > The > > > > > > only > > > > > > thing I > > > > > > can think of is embedding the lock into attrs and doing xchg > > > > > > dance > > > > > > when > > > > > > attaching it. > > > > > > > > > > Sounds like your saying it would be ok to add a lock to the > > > > > attrs structure, am I correct? > > > > > > > > > > Assuming it is then, to keep things simple, use two locks. > > > > > > > > > > One global lock for the allocation and an attrs lock for all > > > > > the > > > > > attrs field updates including the kernfs_refresh_inode() > > > > > update. > > > > > > > > > > The critical section for the global lock could be reduced and > > > > > it > > > > > changed to a spin lock. > > > > > > > > > > In __kernfs_iattrs() we would have something like: > > > > > > > > > > take the allocation lock > > > > > do the allocated checks > > > > > assign if existing attrs > > > > > release the allocation lock > > > > > return existing if found > > > > > othewise > > > > > release the allocation lock > > > > > > > > > > allocate and initialize attrs > > > > > > > > > > take the allocation lock > > > > > check if someone beat us to it > > > > > free and grab exiting attrs > > > > > otherwise > > > > > assign the new attrs > > > > > release the allocation lock > > > > > return attrs > > > > > > > > > > Add a spinlock to the attrs struct and use it everywhere for > > > > > field updates. > > > > > > > > > > Am I on the right track or can you see problems with this? > > > > > > > > > > Ian > > > > > > > > > > > > > umm, we update the inode in kernfs_refresh_inode, right?? So I > > > > guess > > > > the problem is how can we protect the inode when > > > > kernfs_refresh_inode > > > > is called, not the attrs?? > > > > > > But the attrs (which is what's copied from) were protected by the > > > mutex lock (IIUC) so dealing with the inode attributes implies > > > dealing with the kernfs node attrs too. > > > > > > For example in kernfs_iop_setattr() the call to setattr_copy() > > > copies > > > the node attrs to the inode under the same mutex lock. So, if a > > > read > > > lock is used the copy in kernfs_refresh_inode() is no longer > > > protected, > > > it needs to be protected in a different way. > > > > > > > Ok, I'm actually wondering why the VFS holds exclusive i_rwsem for > > .setattr but > > no lock for .getattr (misdocumented?? sometimes they have as you've > > found out)? > > What does it protect against?? Because .permission does a similar > > thing > > here -- updating inode attributes, the goal is to provide the same > > protection level > > for .permission as for .setattr, am I right??? > > As far as the documentation goes that's probably my misunderstanding > of it. > > It does happen that the VFS makes assumptions about how call backs > are meant to be used. > > Read like call backs, like .getattr() and .permission() are meant to > be used, well, like read like functions so the VFS should be ok to > take locks or not based on the operation context at hand. > > So it's not about the locking for these call backs per se, it's about > the context in which they are called. > > For example, in link_path_walk(), at the beginning of the component > lookup loop (essentially for the containing directory at that point), > may_lookup() is called which leads to a call to .permission() without > any inode lock held at that point. > > But file opens (possibly following a path walk to resolve a path) > are different. > > For example, do_filp_open() calls path_openat() which leads to a > call to open_last_lookups(), which leads to a call to .permission() > along the way. And in this case there are two contexts, an open() > create or one without create, the former needing the exclusive inode > lock and the later able to use the shared lock. > > So it's about the locking needed for the encompassing operation that > is being done not about those functions specifically. > > TBH the VFS is very complex and Al has a much, much better > understanding of it than I do so he would need to be the one to answer > whether it's the file systems responsibility to use these calls in the > way the VFS expects. > > My belief is that if a file system needs to use a call back in a way > that's in conflict with what the VFS expects it's the file systems' > responsibility to deal with the side effects. > Thanks for clarifying. Ian. Yeah, it's complex and confusing and it's very hard to spot lock context by reading VFS code. I put code like this: if (lockdep_is_held_type(&inode->i_rwsem, -1)) { if (lockdep_is_held_type(&inode->i_rwsem, 0)) { pr_warn("kernfs iop_permission inode WRITE lock is held"); } else if (lockdep_is_held_type(&inode->i_rwsem, 1)) { pr_warn("kernfs iop_permission inode READ lock is held"); } } else { pr_warn("kernfs iop_permission inode lock is NOT held"); } in both .permission & .getattr. Then I do some open/read/write/close to /sys, interestingly, all log outputs suggest they are in WRITE lock context. and I put dump_stack() to the lock-is-held if branch, it prints a lot of following context: [ 481.795445] CPU: 0 PID: 1 Comm: systemd Not tainted 5.10.0-dirty #25 [ 481.795446] Hardware name: Parallels Software International Inc. Parallels Virtual Platform/Parallels Virtual Platform, BIOS 15.1.5 (47309) 09/26/2020 [ 481.795446] Call Trace: [ 481.795448] dump_stack (lib/dump_stack.c:120) [ 481.795450] kernfs_iop_permission (fs/kernfs/inode.c:295) [ 481.795452] inode_permission.part.0 (fs/namei.c:398 fs/namei.c:463) [ 481.795454] may_open (fs/namei.c:2875) [ 481.795456] path_openat (fs/namei.c:3250 fs/namei.c:3369) [ 481.795458] ? ___sys_sendmsg (net/socket.c:2411) [ 481.795460] ? preempt_count_add (./include/linux/ftrace.h:821 kernel/sched/core.c:4166 kernel/sched/core.c:4163 kernel/sched/core.c:4191) [ 481.795461] ? sock_poll (net/socket.c:1265) [ 481.795463] do_filp_open (fs/namei.c:3396) [ 481.795466] ? __check_object_size (mm/usercopy.c:240 mm/usercopy.c:286 mm/usercopy.c:256) [ 481.795469] ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:102 ./include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183) [ 481.795470] do_sys_openat2 (fs/open.c:1168) [ 481.795472] __x64_sys_openat (fs/open.c:1195) [ 481.795473] do_syscall_64 (arch/x86/entry/common.c:46) [ 481.795475] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:127) [ 481.795476] RIP: 0033:0x7f6b31d69c94 Surprisingly, I didn't see the lock holding code along the path. thanks, fox
Hello, On Sat, Dec 19, 2020 at 03:08:13PM +0800, Ian Kent wrote: > And looking further I see there's a race that kernfs can't do anything > about between kernfs_refresh_inode() and fs/inode.c:update_times(). Do kernfs files end up calling into that path tho? Doesn't look like it to me but if so yeah we'd need to override the update_time for kernfs. > +static int kernfs_iop_update_time(struct inode *inode, struct timespec64 *time, int flags) > { > - struct inode *inode = d_inode(path->dentry); > struct kernfs_node *kn = inode->i_private; > + struct kernfs_iattrs *attrs; > > mutex_lock(&kernfs_mutex); > + attrs = kernfs_iattrs(kn); > + if (!attrs) { > + mutex_unlock(&kernfs_mutex); > + return -ENOMEM; > + } > + > + /* Which kernfs node attributes should be updated from > + * time? > + */ > + > kernfs_refresh_inode(kn, inode); > mutex_unlock(&kernfs_mutex); I don't see how this would reflect the changes from kernfs_setattr() into the attached inode. This would actually make the attr updates obviously racy - the userland visible attrs would be stale until the inode gets reclaimed and then when it gets reinstantiated it'd show the latest information. That said, if you wanna take the direction where attr updates are reflected to the associated inode when the change occurs, which makes sense, the right thing to do would be making kernfs_setattr() update the associated inode if existent. Thanks.
On Sat, 2020-12-19 at 11:23 -0500, Tejun Heo wrote: > Hello, > > On Sat, Dec 19, 2020 at 03:08:13PM +0800, Ian Kent wrote: > > And looking further I see there's a race that kernfs can't do > > anything > > about between kernfs_refresh_inode() and fs/inode.c:update_times(). > > Do kernfs files end up calling into that path tho? Doesn't look like > it to > me but if so yeah we'd need to override the update_time for kernfs. Sorry, the below was very hastily done and not what I would actually propose. The main point of it was the question + /* Which kernfs node attributes should be updated from + * time? + */ but looking at it again this morning I think the node iattr fields that might need to be updated would be atime, ctime and mtime only, maybe not ctime ... not sure. What do you think? Also, if kn->attr == NULL it should fall back to what the VFS currently does. The update_times() function is one of the few places where the VFS updates the inode times. The idea is that the reason kernfs needs to overwrite the inode attributes is to reset what the VFS might have done but if kernfs has this inode operation they won't need to be overwritten since they won't have changed. There may be other places where the attributes (or an attribute) are set by the VFS, I haven't finished checking that yet so my suggestion might not be entirely valid. What I need to do is work out what kernfs node attributes, if any, should be updated by .update_times(). If I go by what kernfs_refresh_inode() does now then that would be none but shouldn't atime at least be updated in the node iattr. > > +static int kernfs_iop_update_time(struct inode *inode, struct > > timespec64 *time, int flags) > > { > > - struct inode *inode = d_inode(path->dentry); > > struct kernfs_node *kn = inode->i_private; > > + struct kernfs_iattrs *attrs; > > > > mutex_lock(&kernfs_mutex); > > + attrs = kernfs_iattrs(kn); > > + if (!attrs) { > > + mutex_unlock(&kernfs_mutex); > > + return -ENOMEM; > > + } > > + > > + /* Which kernfs node attributes should be updated from > > + * time? > > + */ > > + > > kernfs_refresh_inode(kn, inode); > > mutex_unlock(&kernfs_mutex); > > I don't see how this would reflect the changes from kernfs_setattr() > into > the attached inode. This would actually make the attr updates > obviously racy > - the userland visible attrs would be stale until the inode gets > reclaimed > and then when it gets reinstantiated it'd show the latest > information. Right, I will have to think about that, but as I say above this isn't really what I would propose. If .update_times() sticks strictly to what kernfs_refresh_inode() does now then it would set the inode attributes from the node iattr only. > > That said, if you wanna take the direction where attr updates are > reflected > to the associated inode when the change occurs, which makes sense, > the right > thing to do would be making kernfs_setattr() update the associated > inode if > existent. Mmm, that's a good point but it looks like the inode isn't available there. Ian
On Sun, 2020-12-20 at 07:52 +0800, Ian Kent wrote: > On Sat, 2020-12-19 at 11:23 -0500, Tejun Heo wrote: > > Hello, > > > > On Sat, Dec 19, 2020 at 03:08:13PM +0800, Ian Kent wrote: > > > And looking further I see there's a race that kernfs can't do > > > anything > > > about between kernfs_refresh_inode() and > > > fs/inode.c:update_times(). > > > > Do kernfs files end up calling into that path tho? Doesn't look > > like > > it to > > me but if so yeah we'd need to override the update_time for kernfs. You are correct, update_time() will only be called during symlink following and only to update atime. So this isn't sufficient to update the inode attributes to reflect changes make by things like kernfs_setattr() or when the directory link count changes ... Sigh!
On Sun, Dec 20, 2020 at 7:52 AM Ian Kent <raven@themaw.net> wrote: > > On Sat, 2020-12-19 at 11:23 -0500, Tejun Heo wrote: > > Hello, > > > > On Sat, Dec 19, 2020 at 03:08:13PM +0800, Ian Kent wrote: > > > And looking further I see there's a race that kernfs can't do > > > anything > > > about between kernfs_refresh_inode() and fs/inode.c:update_times(). > > > > Do kernfs files end up calling into that path tho? Doesn't look like > > it to > > me but if so yeah we'd need to override the update_time for kernfs. > > Sorry, the below was very hastily done and not what I would actually > propose. > > The main point of it was the question > > + /* Which kernfs node attributes should be updated from > + * time? > + */ > > but looking at it again this morning I think the node iattr fields > that might need to be updated would be atime, ctime and mtime only, > maybe not ctime ... not sure. > > What do you think? > > Also, if kn->attr == NULL it should fall back to what the VFS > currently does. > > The update_times() function is one of the few places where the > VFS updates the inode times. > > The idea is that the reason kernfs needs to overwrite the inode > attributes is to reset what the VFS might have done but if kernfs > has this inode operation they won't need to be overwritten since > they won't have changed. > > There may be other places where the attributes (or an attribute) > are set by the VFS, I haven't finished checking that yet so my > suggestion might not be entirely valid. > > What I need to do is work out what kernfs node attributes, if any, > should be updated by .update_times(). If I go by what > kernfs_refresh_inode() does now then that would be none but shouldn't > atime at least be updated in the node iattr. > > > > +static int kernfs_iop_update_time(struct inode *inode, struct > > > timespec64 *time, int flags) > > > { > > > - struct inode *inode = d_inode(path->dentry); > > > struct kernfs_node *kn = inode->i_private; > > > + struct kernfs_iattrs *attrs; > > > > > > mutex_lock(&kernfs_mutex); > > > + attrs = kernfs_iattrs(kn); > > > + if (!attrs) { > > > + mutex_unlock(&kernfs_mutex); > > > + return -ENOMEM; > > > + } > > > + > > > + /* Which kernfs node attributes should be updated from > > > + * time? > > > + */ > > > + > > > kernfs_refresh_inode(kn, inode); > > > mutex_unlock(&kernfs_mutex); > > > > I don't see how this would reflect the changes from kernfs_setattr() > > into > > the attached inode. This would actually make the attr updates > > obviously racy > > - the userland visible attrs would be stale until the inode gets > > reclaimed > > and then when it gets reinstantiated it'd show the latest > > information. > > Right, I will have to think about that, but as I say above this > isn't really what I would propose. > > If .update_times() sticks strictly to what kernfs_refresh_inode() > does now then it would set the inode attributes from the node iattr > only. > > > > > That said, if you wanna take the direction where attr updates are > > reflected > > to the associated inode when the change occurs, which makes sense, > > the right > > thing to do would be making kernfs_setattr() update the associated > > inode if > > existent. > > Mmm, that's a good point but it looks like the inode isn't available > there. > Is it possible to embed super block somewhere, then we can call kernfs_get_inode to get inode in kernfs_setattr??? thanks, fox
On Sat, 2020-12-19 at 15:47 +0800, Fox Chen wrote: > On Sat, Dec 19, 2020 at 8:53 AM Ian Kent <raven@themaw.net> wrote: > > On Fri, 2020-12-18 at 21:20 +0800, Fox Chen wrote: > > > On Fri, Dec 18, 2020 at 7:21 PM Ian Kent <raven@themaw.net> > > > wrote: > > > > On Fri, 2020-12-18 at 16:01 +0800, Fox Chen wrote: > > > > > On Fri, Dec 18, 2020 at 3:36 PM Ian Kent <raven@themaw.net> > > > > > wrote: > > > > > > On Thu, 2020-12-17 at 10:14 -0500, Tejun Heo wrote: > > > > > > > Hello, > > > > > > > > > > > > > > On Thu, Dec 17, 2020 at 07:48:49PM +0800, Ian Kent wrote: > > > > > > > > > What could be done is to make the kernfs node > > > > > > > > > attr_mutex > > > > > > > > > a pointer and dynamically allocate it but even that > > > > > > > > > is > > > > > > > > > too > > > > > > > > > costly a size addition to the kernfs node structure > > > > > > > > > as > > > > > > > > > Tejun has said. > > > > > > > > > > > > > > > > I guess the question to ask is, is there really a need > > > > > > > > to > > > > > > > > call kernfs_refresh_inode() from functions that are > > > > > > > > usually > > > > > > > > reading/checking functions. > > > > > > > > > > > > > > > > Would it be sufficient to refresh the inode in the > > > > > > > > write/set > > > > > > > > operations in (if there's any) places where things like > > > > > > > > setattr_copy() is not already called? > > > > > > > > > > > > > > > > Perhaps GKH or Tejun could comment on this? > > > > > > > > > > > > > > My memory is a bit hazy but invalidations on reads is how > > > > > > > sysfs > > > > > > > namespace is > > > > > > > implemented, so I don't think there's an easy around > > > > > > > that. > > > > > > > The > > > > > > > only > > > > > > > thing I > > > > > > > can think of is embedding the lock into attrs and doing > > > > > > > xchg > > > > > > > dance > > > > > > > when > > > > > > > attaching it. > > > > > > > > > > > > Sounds like your saying it would be ok to add a lock to the > > > > > > attrs structure, am I correct? > > > > > > > > > > > > Assuming it is then, to keep things simple, use two locks. > > > > > > > > > > > > One global lock for the allocation and an attrs lock for > > > > > > all > > > > > > the > > > > > > attrs field updates including the kernfs_refresh_inode() > > > > > > update. > > > > > > > > > > > > The critical section for the global lock could be reduced > > > > > > and > > > > > > it > > > > > > changed to a spin lock. > > > > > > > > > > > > In __kernfs_iattrs() we would have something like: > > > > > > > > > > > > take the allocation lock > > > > > > do the allocated checks > > > > > > assign if existing attrs > > > > > > release the allocation lock > > > > > > return existing if found > > > > > > othewise > > > > > > release the allocation lock > > > > > > > > > > > > allocate and initialize attrs > > > > > > > > > > > > take the allocation lock > > > > > > check if someone beat us to it > > > > > > free and grab exiting attrs > > > > > > otherwise > > > > > > assign the new attrs > > > > > > release the allocation lock > > > > > > return attrs > > > > > > > > > > > > Add a spinlock to the attrs struct and use it everywhere > > > > > > for > > > > > > field updates. > > > > > > > > > > > > Am I on the right track or can you see problems with this? > > > > > > > > > > > > Ian > > > > > > > > > > > > > > > > umm, we update the inode in kernfs_refresh_inode, right?? So > > > > > I > > > > > guess > > > > > the problem is how can we protect the inode when > > > > > kernfs_refresh_inode > > > > > is called, not the attrs?? > > > > > > > > But the attrs (which is what's copied from) were protected by > > > > the > > > > mutex lock (IIUC) so dealing with the inode attributes implies > > > > dealing with the kernfs node attrs too. > > > > > > > > For example in kernfs_iop_setattr() the call to setattr_copy() > > > > copies > > > > the node attrs to the inode under the same mutex lock. So, if a > > > > read > > > > lock is used the copy in kernfs_refresh_inode() is no longer > > > > protected, > > > > it needs to be protected in a different way. > > > > > > > > > > Ok, I'm actually wondering why the VFS holds exclusive i_rwsem > > > for > > > .setattr but > > > no lock for .getattr (misdocumented?? sometimes they have as > > > you've > > > found out)? > > > What does it protect against?? Because .permission does a similar > > > thing > > > here -- updating inode attributes, the goal is to provide the > > > same > > > protection level > > > for .permission as for .setattr, am I right??? > > > > As far as the documentation goes that's probably my > > misunderstanding > > of it. > > > > It does happen that the VFS makes assumptions about how call backs > > are meant to be used. > > > > Read like call backs, like .getattr() and .permission() are meant > > to > > be used, well, like read like functions so the VFS should be ok to > > take locks or not based on the operation context at hand. > > > > So it's not about the locking for these call backs per se, it's > > about > > the context in which they are called. > > > > For example, in link_path_walk(), at the beginning of the component > > lookup loop (essentially for the containing directory at that > > point), > > may_lookup() is called which leads to a call to .permission() > > without > > any inode lock held at that point. > > > > But file opens (possibly following a path walk to resolve a path) > > are different. > > > > For example, do_filp_open() calls path_openat() which leads to a > > call to open_last_lookups(), which leads to a call to .permission() > > along the way. And in this case there are two contexts, an open() > > create or one without create, the former needing the exclusive > > inode > > lock and the later able to use the shared lock. > > > > So it's about the locking needed for the encompassing operation > > that > > is being done not about those functions specifically. > > > > TBH the VFS is very complex and Al has a much, much better > > understanding of it than I do so he would need to be the one to > > answer > > whether it's the file systems responsibility to use these calls in > > the > > way the VFS expects. > > > > My belief is that if a file system needs to use a call back in a > > way > > that's in conflict with what the VFS expects it's the file systems' > > responsibility to deal with the side effects. > > > > Thanks for clarifying. Ian. > > Yeah, it's complex and confusing and it's very hard to spot lock > context by reading VFS code. > > I put code like this: > if (lockdep_is_held_type(&inode->i_rwsem, -1)) { > if (lockdep_is_held_type(&inode->i_rwsem, 0)) { > pr_warn("kernfs iop_permission inode WRITE > lock is held"); > } else if (lockdep_is_held_type(&inode->i_rwsem, 1)) > { > pr_warn("kernfs iop_permission inode READ > lock > is held"); > } > } else { > pr_warn("kernfs iop_permission inode lock is NOT > held"); > } > > in both .permission & .getattr. Then I do some open/read/write/close > to /sys, interestingly, all log outputs suggest they are in WRITE > lock > context. The thing is in open_last_lookups() called from path_openat() we have: if (open_flag & O_CREAT) inode_lock(dir->d_inode); else inode_lock_shared(dir->d_inode); and from there it's - lookup_open() and possibly may_o_create() which calls inode_permission() and onto .permission(). So, strictly speaking, it should be taken exclusive because you would expect may_o_create() to be called only on O_CREATE. But all the possible cases would need to be checked and if it is taken shared anywhere we are in trouble. Another example is in link_path_walk() we have a call to may_lookup() which leads to a call to .permission() without the lock being held. So there are a bunch of cases to check and knowing you are the owner of the lock if it is held is at least risky when concurrency is high and there's a possibility the lock can be taken shared. Ian