Message ID | 20190211224437.25267-1-daniel.m.jordan@oracle.com (mailing list archive) |
---|---|
Headers | show |
Series | use pinned_vm instead of locked_vm to account pinned pages | expand |
On Mon, Feb 11, 2019 at 05:44:32PM -0500, Daniel Jordan wrote: > Hi, > > This series converts users that account pinned pages with locked_vm to > account with pinned_vm instead, pinned_vm being the correct counter to > use. It's based on a similar patch I posted recently[0]. > > The patches are based on rdma/for-next to build on Davidlohr Bueso's > recent conversion of pinned_vm to an atomic64_t[1]. Seems to make some > sense for these to be routed the same way, despite lack of rdma content? Oy.. I'd be willing to accumulate a branch with acks to send to Linus *separately* from RDMA to Linus, but this is very abnormal. Better to wait a few weeks for -rc1 and send patches through the subsystem trees. > All five of these places, and probably some of Davidlohr's conversions, > probably want to be collapsed into a common helper in the core mm for > accounting pinned pages. I tried, and there are several details that > likely need discussion, so this can be done as a follow-on. I've wondered the same.. Jason
On Mon, Feb 11, 2019 at 03:54:47PM -0700, Jason Gunthorpe wrote: > On Mon, Feb 11, 2019 at 05:44:32PM -0500, Daniel Jordan wrote: > > Hi, > > > > This series converts users that account pinned pages with locked_vm to > > account with pinned_vm instead, pinned_vm being the correct counter to > > use. It's based on a similar patch I posted recently[0]. > > > > The patches are based on rdma/for-next to build on Davidlohr Bueso's > > recent conversion of pinned_vm to an atomic64_t[1]. Seems to make some > > sense for these to be routed the same way, despite lack of rdma content? > > Oy.. I'd be willing to accumulate a branch with acks to send to Linus > *separately* from RDMA to Linus, but this is very abnormal. > > Better to wait a few weeks for -rc1 and send patches through the > subsystem trees. Ok, I can do that. It did seem strange, so I made it a question...
On Mon, Feb 11, 2019 at 03:54:47PM -0700, Jason Gunthorpe wrote: > On Mon, Feb 11, 2019 at 05:44:32PM -0500, Daniel Jordan wrote: > > > All five of these places, and probably some of Davidlohr's conversions, > > probably want to be collapsed into a common helper in the core mm for > > accounting pinned pages. I tried, and there are several details that > > likely need discussion, so this can be done as a follow-on. > > I've wondered the same.. I'm really thinking this would be a nice way to ensure it gets cleaned up and does not happen again. Also, by moving it to the core we could better manage any user visible changes. From a high level, pinned is a subset of locked so it seems like we need a 2 sets of helpers. try_increment_locked_vm(...) decrement_locked_vm(...) try_increment_pinned_vm(...) decrement_pinned_vm(...) Where try_increment_pinned_vm() also increments locked_vm... Of course this may end up reverting the improvement of Davidlohr Bueso's atomic work... :-( Furthermore it would seem better (although I don't know if at all possible) if this were accounted for in core calls which tracked them based on how the pages are being used so that drivers can't call try_increment_locked_vm() and then pin the pages... Thus getting the account wrong vs what actually happened. And then in the end we can go back to locked_vm being the value checked against RLIMIT_MEMLOCK. Ira
On Wed, Feb 13, 2019 at 05:53:14PM -0800, Ira Weiny wrote: > On Mon, Feb 11, 2019 at 03:54:47PM -0700, Jason Gunthorpe wrote: > > On Mon, Feb 11, 2019 at 05:44:32PM -0500, Daniel Jordan wrote: > > > > > All five of these places, and probably some of Davidlohr's conversions, > > > probably want to be collapsed into a common helper in the core mm for > > > accounting pinned pages. I tried, and there are several details that > > > likely need discussion, so this can be done as a follow-on. > > > > I've wondered the same.. > > I'm really thinking this would be a nice way to ensure it gets cleaned up and > does not happen again. > > Also, by moving it to the core we could better manage any user visible changes. > > From a high level, pinned is a subset of locked so it seems like we need a 2 > sets of helpers. > > try_increment_locked_vm(...) > decrement_locked_vm(...) > > try_increment_pinned_vm(...) > decrement_pinned_vm(...) > > Where try_increment_pinned_vm() also increments locked_vm... Of course this > may end up reverting the improvement of Davidlohr Bueso's atomic work... :-( > > Furthermore it would seem better (although I don't know if at all possible) if > this were accounted for in core calls which tracked them based on how the pages > are being used so that drivers can't call try_increment_locked_vm() and then > pin the pages... Thus getting the account wrong vs what actually happened. > > And then in the end we can go back to locked_vm being the value checked against > RLIMIT_MEMLOCK. Someone would need to understand the bug that was fixed by splitting them. I think it had to do with double accounting pinned and mlocked pages and thus delivering a lower than expected limit to userspace. vfio has this bug, RDMA does not. RDMA has a bug where it can overallocate locked memory, vfio doesn't. Really unclear how to fix this. The pinned/locked split with two buckets may be the right way. Jason
On Wed, Feb 13, 2019 at 11:00:06PM -0700, Jason Gunthorpe wrote: > On Wed, Feb 13, 2019 at 05:53:14PM -0800, Ira Weiny wrote: > > On Mon, Feb 11, 2019 at 03:54:47PM -0700, Jason Gunthorpe wrote: > > > On Mon, Feb 11, 2019 at 05:44:32PM -0500, Daniel Jordan wrote: > > > > > > > All five of these places, and probably some of Davidlohr's conversions, > > > > probably want to be collapsed into a common helper in the core mm for > > > > accounting pinned pages. I tried, and there are several details that > > > > likely need discussion, so this can be done as a follow-on. > > > > > > I've wondered the same.. > > > > I'm really thinking this would be a nice way to ensure it gets cleaned up and > > does not happen again. > > > > Also, by moving it to the core we could better manage any user visible changes. > > > > From a high level, pinned is a subset of locked so it seems like we need a 2 > > sets of helpers. > > > > try_increment_locked_vm(...) > > decrement_locked_vm(...) > > > > try_increment_pinned_vm(...) > > decrement_pinned_vm(...) > > > > Where try_increment_pinned_vm() also increments locked_vm... Of course this > > may end up reverting the improvement of Davidlohr Bueso's atomic work... :-( > > > > Furthermore it would seem better (although I don't know if at all possible) if > > this were accounted for in core calls which tracked them based on how the pages > > are being used so that drivers can't call try_increment_locked_vm() and then > > pin the pages... Thus getting the account wrong vs what actually happened. > > > > And then in the end we can go back to locked_vm being the value checked against > > RLIMIT_MEMLOCK. > > Someone would need to understand the bug that was fixed by splitting > them. > My suggestion above assumes that splitting them is required/correct. To be fair I've not dug into if this is true or not, but I trust Christopher. What I have found is this commit: bc3e53f682d9 mm: distinguish between mlocked and pinned pages I think that commit introduced the bug (for IB) which at the time may have been "ok" because many users of IB at the time were HPC/MPI users and I don't think MPI does a lot of _separate_ mlock operations so the count of locked_vm was probably negligible. Alternatively, the clusters I've worked on in the past had compute nodes set with RLIMIT_MEMLOCK to 'unlimited' whilst running MPI applications on compute nodes of a cluster... :-/ I think what Christopher did was probably ok for the internal tracking but we _should_ have had something which summed the 2 for RLIMIT_MEMLOCK checking at that time to be 100% correct? Christopher do you remember why you did not do that? [1] http://lkml.kernel.org/r/20130524140114.GK23650@twins.programming.kicks-ass.net > > I think it had to do with double accounting pinned and mlocked pages > and thus delivering a lower than expected limit to userspace. > > vfio has this bug, RDMA does not. RDMA has a bug where it can > overallocate locked memory, vfio doesn't. Wouldn't vfio also be able to overallocate if the user had RDMA pinned pages? I think the problem is that if the user calls mlock on a large range then both vfio and RDMA could potentially overallocate even with this fix. This was your initial email to Daniel, I think... And Alex's concern. > > Really unclear how to fix this. The pinned/locked split with two > buckets may be the right way. Are you suggesting that we have 2 user limits? Ira > > Jason
On Thu, Feb 14, 2019 at 11:33:53AM -0800, Ira Weiny wrote: > > I think it had to do with double accounting pinned and mlocked pages > > and thus delivering a lower than expected limit to userspace. > > > > vfio has this bug, RDMA does not. RDMA has a bug where it can > > overallocate locked memory, vfio doesn't. > > Wouldn't vfio also be able to overallocate if the user had RDMA pinned pages? Yes > I think the problem is that if the user calls mlock on a large range then both > vfio and RDMA could potentially overallocate even with this fix. This was your > initial email to Daniel, I think... And Alex's concern. Here are the possibilities - mlock and pin on the same pages - RDMA respects the limit, VFIO halfs it. - mlock and pin on different pages - RDMA doubles the limit, VFIO respects it - VFIO and RDMA in the same process, the limit is halfed or doubled, depending. IHMO we should make VFIO & RDMA the same, and then decide what to do about case #2. > > Really unclear how to fix this. The pinned/locked split with two > > buckets may be the right way. > > Are you suggesting that we have 2 user limits? This is what RDMA has done since CL's patch. It is very hard to fix as you need to track how many pages are mlocked *AND* pinned. Jason
On Thu, Feb 14, 2019 at 01:12:31PM -0700, Jason Gunthorpe wrote: > On Thu, Feb 14, 2019 at 11:33:53AM -0800, Ira Weiny wrote: > > > > I think it had to do with double accounting pinned and mlocked pages > > > and thus delivering a lower than expected limit to userspace. > > > > > > vfio has this bug, RDMA does not. RDMA has a bug where it can > > > overallocate locked memory, vfio doesn't. > > > > Wouldn't vfio also be able to overallocate if the user had RDMA pinned pages? > > Yes > > > I think the problem is that if the user calls mlock on a large range then both > > vfio and RDMA could potentially overallocate even with this fix. This was your > > initial email to Daniel, I think... And Alex's concern. > > Here are the possibilities > - mlock and pin on the same pages - RDMA respects the limit, VFIO halfs it. > - mlock and pin on different pages - RDMA doubles the limit, VFIO > respects it > - VFIO and RDMA in the same process, the limit is halfed or doubled, depending. > > IHMO we should make VFIO & RDMA the same, and then decide what to do > about case #2. I'm not against that. Sorry if I came across that way. For this series I agree we should make it consistent. > > > > Really unclear how to fix this. The pinned/locked split with two > > > buckets may be the right way. > > > > Are you suggesting that we have 2 user limits? > > This is what RDMA has done since CL's patch. I don't understand? What is the other _user_ limit (other than RLIMIT_MEMLOCK)? > > It is very hard to fix as you need to track how many pages are mlocked > *AND* pinned. Understood. :-/ Ira > > Jason
On Thu, Feb 14, 2019 at 01:46:51PM -0800, Ira Weiny wrote: > > > > Really unclear how to fix this. The pinned/locked split with two > > > > buckets may be the right way. > > > > > > Are you suggesting that we have 2 user limits? > > > > This is what RDMA has done since CL's patch. > > I don't understand? What is the other _user_ limit (other than > RLIMIT_MEMLOCK)? With todays implementation RLIMIT_MEMLOCK covers two user limits, total number of pinned pages and total number of mlocked pages. The two are different buckets and not summed. Jason
On Thu, 14 Feb 2019, Jason Gunthorpe wrote: > On Thu, Feb 14, 2019 at 01:46:51PM -0800, Ira Weiny wrote: > > > > > > Really unclear how to fix this. The pinned/locked split with two > > > > > buckets may be the right way. > > > > > > > > Are you suggesting that we have 2 user limits? > > > > > > This is what RDMA has done since CL's patch. > > > > I don't understand? What is the other _user_ limit (other than > > RLIMIT_MEMLOCK)? > > With todays implementation RLIMIT_MEMLOCK covers two user limits, > total number of pinned pages and total number of mlocked pages. The > two are different buckets and not summed. Applications were failing at some point because they were effectively summed up. If you mlocked/pinned a dataset of more than half the memory of a system then things would get really weird. Also there is the possibility of even more duplication because pages can be pinned by multiple kernel subsystems. So you could get more than doubling of the number. The sane thing was to account them separately so that mlocking and pinning worked without apps failing and then wait for another genius to find out how to improve the situation by getting the pinned page mess under control. It is not even advisable to check pinned pages against any limit because pages can be pinned by multiple subsystems. The main problem here is that we only have a refcount to indicate pinning and no way to clearly distinguish long term from short pins. In order to really fix this issue we would need to have a list of subsystems that have taken long term pins on a page. But doing so would waste a lot of memory and cause a significant performance regression. And the discussions here seem to be meandering around these issues. Nothing really that convinces me that we have a clean solution at hand.