Message ID | 20211120045011.3074840-1-almasrymina@google.com (mailing list archive) |
---|---|
Headers | show |
Series | Deterministic charging of shared memory | expand |
On Fri, Nov 19, 2021 at 08:50:06PM -0800, Mina Almasry wrote: > 1. One complication to address is the behavior when the target memcg > hits its memory.max limit because of remote charging. In this case the > oom-killer will be invoked, but the oom-killer may not find anything > to kill in the target memcg being charged. Thera are a number of considerations > in this case: > > 1. It's not great to kill the allocating process since the allocating process > is not running in the memcg under oom, and killing it will not free memory > in the memcg under oom. > 2. Pagefaults may hit the memcg limit, and we need to handle the pagefault > somehow. If not, the process will forever loop the pagefault in the upstream > kernel. > > In this case, I propose simply failing the remote charge and returning an ENOSPC > to the caller. This will cause will cause the process executing the remote > charge to get an ENOSPC in non-pagefault paths, and get a SIGBUS on the pagefault > path. This will be documented behavior of remote charging, and this feature is > opt-in. Users can: > - Not opt-into the feature if they want. > - Opt-into the feature and accept the risk of received ENOSPC or SIGBUS and > abort if they desire. > - Gracefully handle any resulting ENOSPC or SIGBUS errors and continue their > operation without executing the remote charge if possible. Why is ENOSPC the right error instead of ENOMEM?
On Fri, Nov 19, 2021 at 9:01 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Fri, Nov 19, 2021 at 08:50:06PM -0800, Mina Almasry wrote: > > 1. One complication to address is the behavior when the target memcg > > hits its memory.max limit because of remote charging. In this case the > > oom-killer will be invoked, but the oom-killer may not find anything > > to kill in the target memcg being charged. Thera are a number of considerations > > in this case: > > > > 1. It's not great to kill the allocating process since the allocating process > > is not running in the memcg under oom, and killing it will not free memory > > in the memcg under oom. > > 2. Pagefaults may hit the memcg limit, and we need to handle the pagefault > > somehow. If not, the process will forever loop the pagefault in the upstream > > kernel. > > > > In this case, I propose simply failing the remote charge and returning an ENOSPC > > to the caller. This will cause will cause the process executing the remote > > charge to get an ENOSPC in non-pagefault paths, and get a SIGBUS on the pagefault > > path. This will be documented behavior of remote charging, and this feature is > > opt-in. Users can: > > - Not opt-into the feature if they want. > > - Opt-into the feature and accept the risk of received ENOSPC or SIGBUS and > > abort if they desire. > > - Gracefully handle any resulting ENOSPC or SIGBUS errors and continue their > > operation without executing the remote charge if possible. > > Why is ENOSPC the right error instead of ENOMEM? Returning ENOMEM from mem_cgroup_charge_mapping() will cause the application to get ENOMEM from non-pagefault paths (which is perfectly fine), and get stuck in a loop trying to resolve the pagefault in the pagefault path (less fine). The logic is here: https://elixir.bootlin.com/linux/latest/source/arch/x86/mm/fault.c#L1432 ENOMEM gets bubbled up here as VM_FAULT_OOM and on remote charges the behavior I see is that the kernel loops the pagefault forever until memory is freed in the remote memcg, and it may never will. ENOSPC gets bubbled up here as a VM_FAULT_SIGBUS and and sends a SIGBUS to the allocating process. The conjecture here is that it's preferred to send a SIGBUS to the allocating process rather than have it be stuck in a loop trying to resolve a pagefault.
On Fri, Nov 19, 2021 at 08:50:06PM -0800, Mina Almasry wrote: > Problem: > Currently shared memory is charged to the memcg of the allocating > process. This makes memory usage of processes accessing shared memory > a bit unpredictable since whichever process accesses the memory first > will get charged. We have a number of use cases where our userspace > would like deterministic charging of shared memory: > > 1. System services allocating memory for client jobs: > We have services (namely a network access service[1]) that provide > functionality for clients running on the machine and allocate memory > to carry out these services. The memory usage of these services > depends on the number of jobs running on the machine and the nature of > the requests made to the service, which makes the memory usage of > these services hard to predict and thus hard to limit via memory.max. > These system services would like a way to allocate memory and instruct > the kernel to charge this memory to the client’s memcg. > > 2. Shared filesystem between subtasks of a large job > Our infrastructure has large meta jobs such as kubernetes which spawn > multiple subtasks which share a tmpfs mount. These jobs and its > subtasks use that tmpfs mount for various purposes such as data > sharing or persistent data between the subtask restarts. In kubernetes > terminology, the meta job is similar to pods and subtasks are > containers under pods. We want the shared memory to be > deterministically charged to the kubernetes's pod and independent to > the lifetime of containers under the pod. > > 3. Shared libraries and language runtimes shared between independent jobs. > We’d like to optimize memory usage on the machine by sharing libraries > and language runtimes of many of the processes running on our machines > in separate memcgs. This produces a side effect that one job may be > unlucky to be the first to access many of the libraries and may get > oom killed as all the cached files get charged to it. > > Design: > My rough proposal to solve this problem is to simply add a > ‘memcg=/path/to/memcg’ mount option for filesystems: > directing all the memory of the file system to be ‘remote charged’ to > cgroup provided by that memcg= option. > > Caveats: > > 1. One complication to address is the behavior when the target memcg > hits its memory.max limit because of remote charging. In this case the > oom-killer will be invoked, but the oom-killer may not find anything > to kill in the target memcg being charged. Thera are a number of considerations > in this case: > > 1. It's not great to kill the allocating process since the allocating process > is not running in the memcg under oom, and killing it will not free memory > in the memcg under oom. > 2. Pagefaults may hit the memcg limit, and we need to handle the pagefault > somehow. If not, the process will forever loop the pagefault in the upstream > kernel. > > In this case, I propose simply failing the remote charge and returning an ENOSPC > to the caller. This will cause will cause the process executing the remote > charge to get an ENOSPC in non-pagefault paths, and get a SIGBUS on the pagefault > path. This will be documented behavior of remote charging, and this feature is > opt-in. Users can: > - Not opt-into the feature if they want. > - Opt-into the feature and accept the risk of received ENOSPC or SIGBUS and > abort if they desire. > - Gracefully handle any resulting ENOSPC or SIGBUS errors and continue their > operation without executing the remote charge if possible. > > 2. Only processes allowed the enter cgroup at mount time can mount a > tmpfs with memcg=<cgroup>. This is to prevent intential DoS of random cgroups > on the machine. However, once a filesysetem is mounted with memcg=<cgroup>, any > process with write access to this mount point will be able to charge memory to > <cgroup>. This is largely a non-issue because in configurations where there is > untrusted code running on the machine, mount point access needs to be > restricted to the intended users only regardless of whether the mount point > memory is deterministly charged or not. I'm not a fan of this. It uses filesystem mounts to create shareable resource domains outside of the cgroup hierarchy, which has all the downsides you listed, and more: 1. You need a filesystem interface in the first place, and a new ad-hoc channel and permission model to coordinate with the cgroup tree, which isn't great. All filesystems you want to share data on need to be converted. 2. It doesn't extend to non-filesystem sources of shared data, such as memfds, ipc shm etc. 3. It requires unintuitive configuration for what should be basic shared accounting semantics. Per default you still get the old 'first touch' semantics, but to get sharing you need to reconfigure the filesystems? 4. If a task needs to work with a hierarchy of data sharing domains - system-wide, group of jobs, job - it must interact with a hierarchy of filesystem mounts. This is a pain to setup and may require task awareness. Moving data around, working with different mount points. Also, no shared and private data accounting within the same file. 5. It reintroduces cgroup1 semantics of tasks and resouces, which are entangled, sitting in disjunct domains. OOM killing is one quirk of that, but there are others you haven't touched on. Who is charged for the CPU cycles of reclaim in the out-of-band domain? Who is charged for the paging IO? How is resource pressure accounted and attributed? Soon you need cpu= and io= as well. My take on this is that it might work for your rather specific usecase, but it doesn't strike me as a general-purpose feature suitable for upstream. If we want sharing semantics for memory, I think we need a more generic implementation with a cleaner interface. Here is one idea: Have you considered reparenting pages that are accessed by multiple cgroups to the first common ancestor of those groups? Essentially, whenever there is a memory access (minor fault, buffered IO) to a page that doesn't belong to the accessing task's cgroup, you find the common ancestor between that task and the owning cgroup, and move the page there. With a tree like this: root - job group - job `- job `- job group - job `- job all pages accessed inside that tree will propagate to the highest level at which they are shared - which is the same level where you'd also set shared policies, like a job group memory limit or io weight. E.g. libc pages would (likely) bubble to the root, persistent tmpfs pages would bubble to the respective job group, private data would stay within each job. No further user configuration necessary. Although you still *can* use mount namespacing etc. to prohibit undesired sharing between cgroups. The actual user-visible accounting change would be quite small, and arguably much more intuitive. Remember that accounting is recursive, meaning that a job page today also shows up in the counters of job group and root. This would not change. The only thing that IS weird today is that when two jobs share a page, it will arbitrarily show up in one job's counter but not in the other's. That would change: it would no longer show up as either, since it's not private to either; it would just be a job group (and up) page. This would be a generic implementation of resource sharing semantics: independent of data source and filesystems, contained inside the cgroup interface, and reusing the existing hierarchies of accounting and control domains to also represent levels of common property. Thoughts?
On Mon, Nov 22, 2021 at 11:04 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Fri, Nov 19, 2021 at 08:50:06PM -0800, Mina Almasry wrote: > > Problem: > > Currently shared memory is charged to the memcg of the allocating > > process. This makes memory usage of processes accessing shared memory > > a bit unpredictable since whichever process accesses the memory first > > will get charged. We have a number of use cases where our userspace > > would like deterministic charging of shared memory: > > > > 1. System services allocating memory for client jobs: > > We have services (namely a network access service[1]) that provide > > functionality for clients running on the machine and allocate memory > > to carry out these services. The memory usage of these services > > depends on the number of jobs running on the machine and the nature of > > the requests made to the service, which makes the memory usage of > > these services hard to predict and thus hard to limit via memory.max. > > These system services would like a way to allocate memory and instruct > > the kernel to charge this memory to the client’s memcg. > > > > 2. Shared filesystem between subtasks of a large job > > Our infrastructure has large meta jobs such as kubernetes which spawn > > multiple subtasks which share a tmpfs mount. These jobs and its > > subtasks use that tmpfs mount for various purposes such as data > > sharing or persistent data between the subtask restarts. In kubernetes > > terminology, the meta job is similar to pods and subtasks are > > containers under pods. We want the shared memory to be > > deterministically charged to the kubernetes's pod and independent to > > the lifetime of containers under the pod. > > > > 3. Shared libraries and language runtimes shared between independent jobs. > > We’d like to optimize memory usage on the machine by sharing libraries > > and language runtimes of many of the processes running on our machines > > in separate memcgs. This produces a side effect that one job may be > > unlucky to be the first to access many of the libraries and may get > > oom killed as all the cached files get charged to it. > > > > Design: > > My rough proposal to solve this problem is to simply add a > > ‘memcg=/path/to/memcg’ mount option for filesystems: > > directing all the memory of the file system to be ‘remote charged’ to > > cgroup provided by that memcg= option. > > > > Caveats: > > > > 1. One complication to address is the behavior when the target memcg > > hits its memory.max limit because of remote charging. In this case the > > oom-killer will be invoked, but the oom-killer may not find anything > > to kill in the target memcg being charged. Thera are a number of considerations > > in this case: > > > > 1. It's not great to kill the allocating process since the allocating process > > is not running in the memcg under oom, and killing it will not free memory > > in the memcg under oom. > > 2. Pagefaults may hit the memcg limit, and we need to handle the pagefault > > somehow. If not, the process will forever loop the pagefault in the upstream > > kernel. > > > > In this case, I propose simply failing the remote charge and returning an ENOSPC > > to the caller. This will cause will cause the process executing the remote > > charge to get an ENOSPC in non-pagefault paths, and get a SIGBUS on the pagefault > > path. This will be documented behavior of remote charging, and this feature is > > opt-in. Users can: > > - Not opt-into the feature if they want. > > - Opt-into the feature and accept the risk of received ENOSPC or SIGBUS and > > abort if they desire. > > - Gracefully handle any resulting ENOSPC or SIGBUS errors and continue their > > operation without executing the remote charge if possible. > > > > 2. Only processes allowed the enter cgroup at mount time can mount a > > tmpfs with memcg=<cgroup>. This is to prevent intential DoS of random cgroups > > on the machine. However, once a filesysetem is mounted with memcg=<cgroup>, any > > process with write access to this mount point will be able to charge memory to > > <cgroup>. This is largely a non-issue because in configurations where there is > > untrusted code running on the machine, mount point access needs to be > > restricted to the intended users only regardless of whether the mount point > > memory is deterministly charged or not. > > I'm not a fan of this. It uses filesystem mounts to create shareable > resource domains outside of the cgroup hierarchy, which has all the > downsides you listed, and more: > > 1. You need a filesystem interface in the first place, and a new > ad-hoc channel and permission model to coordinate with the cgroup > tree, which isn't great. All filesystems you want to share data on > need to be converted. > My understanding is that this problem exists today with tmpfs-shared memory, regardless of memcg= support or not. I.e. for processes to share memory via tmpfs the sys admin needs to limit access to the mount point to the processes regardless of which cgroup[s] the processes are in for the machine to be properly configured, or risk unintended data access and a security violation. So existing tmpfs shared memory would/should already have these permissions in place, and (I'm hoping) we can piggy back or that and provide deterministic charging. > 2. It doesn't extend to non-filesystem sources of shared data, such as > memfds, ipc shm etc. > I was hoping - if possible - to extend similar APIs/semantics to other shared memory sources, although to be honest I'll concede I haven't thoroughly thought of how the implementation would look like. > 3. It requires unintuitive configuration for what should be basic > shared accounting semantics. Per default you still get the old > 'first touch' semantics, but to get sharing you need to reconfigure > the filesystems? > Yes, this is indeed an explicit option that needs to be configured by the sys admin. I'm not so sure about changing the default in the kernel and potentially breaking existing accounting like you mention below. I think the kernel also automagically trying to figure out the proper memcg to deterministically charge has its own issues (comments on the proposal below). > 4. If a task needs to work with a hierarchy of data sharing domains - > system-wide, group of jobs, job - it must interact with a hierarchy > of filesystem mounts. This is a pain to setup and may require task > awareness. Moving data around, working with different mount points. > Also, no shared and private data accounting within the same file. > Again, my impression/feeling here is that this is a generic problem with tmpfs shared memory, and maybe shared memory in general, which folks find very useful already despite the existing shortcomings. Today AFAIK we don't have interfaces to say 'this is shared memory and it's shared between processes in cgroups A, B, and C'. Instead we say this is shared memory and the tmpfs access permissions or visibility decree who can access the shared memory (and the permissions are oblivious to cgroups) and the memory charging is first touch based and not deterministic. > 5. It reintroduces cgroup1 semantics of tasks and resouces, which are > entangled, sitting in disjunct domains. OOM killing is one quirk of > that, but there are others you haven't touched on. Who is charged > for the CPU cycles of reclaim in the out-of-band domain? Who is > charged for the paging IO? How is resource pressure accounted and > attributed? Soon you need cpu= and io= as well. > I think the allocating task is charged for cpu and io resources and I'm not sure I see a compelling reason to change that. I think the distinction is that memory is shared but charged to the one faulting it which is maybe not really fair or can be deterministically predicted by the sys admin setting limits on the various cgroups. I don't see that logic extending to cpu, but perhaps to io maybe. > My take on this is that it might work for your rather specific > usecase, but it doesn't strike me as a general-purpose feature > suitable for upstream. > > > If we want sharing semantics for memory, I think we need a more > generic implementation with a cleaner interface. > My issue here is that AFAICT in the upstream kernel there is no way to deterministically charge the shared memory other than preallocation which doesn't work so well on overcommitted systems and requires changes in the individual tasks that are allocating the shared memory. I'm definitely on board with any proposal that achieves what we want, although there are issues with the specific proposal you mentioned. (and thanks for reviewing and suggesting alternatives!) > Here is one idea: > > Have you considered reparenting pages that are accessed by multiple > cgroups to the first common ancestor of those groups? > > Essentially, whenever there is a memory access (minor fault, buffered > IO) to a page that doesn't belong to the accessing task's cgroup, you > find the common ancestor between that task and the owning cgroup, and > move the page there. > > With a tree like this: > > root - job group - job > `- job > `- job group - job > `- job > > all pages accessed inside that tree will propagate to the highest > level at which they are shared - which is the same level where you'd > also set shared policies, like a job group memory limit or io weight. > > E.g. libc pages would (likely) bubble to the root, persistent tmpfs > pages would bubble to the respective job group, private data would > stay within each job. > > No further user configuration necessary. Although you still *can* use > mount namespacing etc. to prohibit undesired sharing between cgroups. > > The actual user-visible accounting change would be quite small, and > arguably much more intuitive. Remember that accounting is recursive, > meaning that a job page today also shows up in the counters of job > group and root. This would not change. The only thing that IS weird > today is that when two jobs share a page, it will arbitrarily show up > in one job's counter but not in the other's. That would change: it > would no longer show up as either, since it's not private to either; > it would just be a job group (and up) page. > > This would be a generic implementation of resource sharing semantics: > independent of data source and filesystems, contained inside the > cgroup interface, and reusing the existing hierarchies of accounting > and control domains to also represent levels of common property. > > Thoughts? 2 issues I see here: 1. This is a default change, somewhat likely to break existing accounting. 2. I think we're trying to make the charging deterministic, and this makes it even harder to a priori predict where memory is charged: (a) memory is initially charged to the allocating task, which forces the sys admin to over provision cgroups that access shared memory, because what if they pre-allocate the shared memory and get charged for all of it? (b) The shared memory will only land "where it's supposed to land" if the sys admin has correctly set the permissions of the shared memory (tmpfs file system permissions/visibility for example). If the mount access is incorrectly configured and accessed by a bad actor the memory will likely be reparented to root, which is likely worse than causing ENOSPC/SIGBUS in the current proposal. Hence, it's really an implicit requirement for the shared memory permissions to be correct for this to work, in which case memcg= seems better to me since it doesn't suffer from issue (a). I'm loosely aware of past conversations with Shakeel where it was recommended to charge the first common ancestor, mainly to side-step issues with the oom-killer not finding anything to kill. IMO I quite like memcg= approach because you can: 1. memcg=<first common ancestor cgroup>, and not deal with potential SIGBUS/ENOSPC 2. memcg=<remote cgroup>, and deal with potential SIGBUS/ENOSPC. And the user has flexibility to decide. But regardless of the proposal, I see it as an existing/orthogonal problem that shared memory permissions be 'correct', and AFAICT existing shared memory permission models are completely oblivious to cgroups, so there is work for the sys admin to do anyway to make sure that processes in the intended processes only are able to access the shared memory.
On Mon, Nov 22, 2021 at 02:04:04PM -0500, Johannes Weiner wrote: > On Fri, Nov 19, 2021 at 08:50:06PM -0800, Mina Almasry wrote: > > Problem: > > Currently shared memory is charged to the memcg of the allocating > > process. This makes memory usage of processes accessing shared memory > > a bit unpredictable since whichever process accesses the memory first > > will get charged. We have a number of use cases where our userspace > > would like deterministic charging of shared memory: > > > > 1. System services allocating memory for client jobs: > > We have services (namely a network access service[1]) that provide > > functionality for clients running on the machine and allocate memory > > to carry out these services. The memory usage of these services > > depends on the number of jobs running on the machine and the nature of > > the requests made to the service, which makes the memory usage of > > these services hard to predict and thus hard to limit via memory.max. > > These system services would like a way to allocate memory and instruct > > the kernel to charge this memory to the client’s memcg. > > > > 2. Shared filesystem between subtasks of a large job > > Our infrastructure has large meta jobs such as kubernetes which spawn > > multiple subtasks which share a tmpfs mount. These jobs and its > > subtasks use that tmpfs mount for various purposes such as data > > sharing or persistent data between the subtask restarts. In kubernetes > > terminology, the meta job is similar to pods and subtasks are > > containers under pods. We want the shared memory to be > > deterministically charged to the kubernetes's pod and independent to > > the lifetime of containers under the pod. > > > > 3. Shared libraries and language runtimes shared between independent jobs. > > We’d like to optimize memory usage on the machine by sharing libraries > > and language runtimes of many of the processes running on our machines > > in separate memcgs. This produces a side effect that one job may be > > unlucky to be the first to access many of the libraries and may get > > oom killed as all the cached files get charged to it. > > > > Design: > > My rough proposal to solve this problem is to simply add a > > ‘memcg=/path/to/memcg’ mount option for filesystems: > > directing all the memory of the file system to be ‘remote charged’ to > > cgroup provided by that memcg= option. > > > > Caveats: > > > > 1. One complication to address is the behavior when the target memcg > > hits its memory.max limit because of remote charging. In this case the > > oom-killer will be invoked, but the oom-killer may not find anything > > to kill in the target memcg being charged. Thera are a number of considerations > > in this case: > > > > 1. It's not great to kill the allocating process since the allocating process > > is not running in the memcg under oom, and killing it will not free memory > > in the memcg under oom. > > 2. Pagefaults may hit the memcg limit, and we need to handle the pagefault > > somehow. If not, the process will forever loop the pagefault in the upstream > > kernel. > > > > In this case, I propose simply failing the remote charge and returning an ENOSPC > > to the caller. This will cause will cause the process executing the remote > > charge to get an ENOSPC in non-pagefault paths, and get a SIGBUS on the pagefault > > path. This will be documented behavior of remote charging, and this feature is > > opt-in. Users can: > > - Not opt-into the feature if they want. > > - Opt-into the feature and accept the risk of received ENOSPC or SIGBUS and > > abort if they desire. > > - Gracefully handle any resulting ENOSPC or SIGBUS errors and continue their > > operation without executing the remote charge if possible. > > > > 2. Only processes allowed the enter cgroup at mount time can mount a > > tmpfs with memcg=<cgroup>. This is to prevent intential DoS of random cgroups > > on the machine. However, once a filesysetem is mounted with memcg=<cgroup>, any > > process with write access to this mount point will be able to charge memory to > > <cgroup>. This is largely a non-issue because in configurations where there is > > untrusted code running on the machine, mount point access needs to be > > restricted to the intended users only regardless of whether the mount point > > memory is deterministly charged or not. > > I'm not a fan of this. It uses filesystem mounts to create shareable > resource domains outside of the cgroup hierarchy, which has all the > downsides you listed, and more: > > 1. You need a filesystem interface in the first place, and a new > ad-hoc channel and permission model to coordinate with the cgroup > tree, which isn't great. All filesystems you want to share data on > need to be converted. > > 2. It doesn't extend to non-filesystem sources of shared data, such as > memfds, ipc shm etc. > > 3. It requires unintuitive configuration for what should be basic > shared accounting semantics. Per default you still get the old > 'first touch' semantics, but to get sharing you need to reconfigure > the filesystems? > > 4. If a task needs to work with a hierarchy of data sharing domains - > system-wide, group of jobs, job - it must interact with a hierarchy > of filesystem mounts. This is a pain to setup and may require task > awareness. Moving data around, working with different mount points. > Also, no shared and private data accounting within the same file. > > 5. It reintroduces cgroup1 semantics of tasks and resouces, which are > entangled, sitting in disjunct domains. OOM killing is one quirk of > that, but there are others you haven't touched on. Who is charged > for the CPU cycles of reclaim in the out-of-band domain? Who is > charged for the paging IO? How is resource pressure accounted and > attributed? Soon you need cpu= and io= as well. > > My take on this is that it might work for your rather specific > usecase, but it doesn't strike me as a general-purpose feature > suitable for upstream. > > > If we want sharing semantics for memory, I think we need a more > generic implementation with a cleaner interface. > > Here is one idea: > > Have you considered reparenting pages that are accessed by multiple > cgroups to the first common ancestor of those groups? > > Essentially, whenever there is a memory access (minor fault, buffered > IO) to a page that doesn't belong to the accessing task's cgroup, you > find the common ancestor between that task and the owning cgroup, and > move the page there. > > With a tree like this: > > root - job group - job > `- job > `- job group - job > `- job > > all pages accessed inside that tree will propagate to the highest > level at which they are shared - which is the same level where you'd > also set shared policies, like a job group memory limit or io weight. > > E.g. libc pages would (likely) bubble to the root, persistent tmpfs > pages would bubble to the respective job group, private data would > stay within each job. > > No further user configuration necessary. Although you still *can* use > mount namespacing etc. to prohibit undesired sharing between cgroups. > > The actual user-visible accounting change would be quite small, and > arguably much more intuitive. Remember that accounting is recursive, > meaning that a job page today also shows up in the counters of job > group and root. This would not change. The only thing that IS weird > today is that when two jobs share a page, it will arbitrarily show up > in one job's counter but not in the other's. That would change: it > would no longer show up as either, since it's not private to either; > it would just be a job group (and up) page. In general I like the idea, but I think the user-visible change will be quite large, almost "cgroup v3"-large. Here are some problems: 1) Anything shared between e.g. system.slice and user.slice now belongs to the root cgroup and is completely unaccounted/unlimited. E.g. all pagecache belonging to shared libraries. 2) It's concerning in security terms. If I understand the idea correctly, a read-only access will allow to move charges to an upper level, potentially crossing memory.max limits. It doesn't sound safe. 3) It brings a non-trivial amount of memory to non-leave cgroups. To some extent it returns us to the cgroup v1 world and a question of competition between resources consumed by a cgroup directly and through children cgroups. Not like the problem doesn't exist now, but it's less pronounced. If say >50% of system.slice's memory will belong to system.slice directly, then we likely will need separate non-recursive counters, limits, protections, etc. 4) Imagine a production server and a system administrator entering using ssh (and being put into user.slice) and running a big grep... It screws up all memory accounting until a next reboot. Not a completely impossible scenario. That said, I agree with Johannes and I'm also not a big fan of this patchset. I agree that the problem exist and that the patchset provides a solution, but it doesn't look nice (and generic enough) and creates a lot of questions and corner cases. Btw, won't (an optional) disabling of memcg accounting for a tmpfs solve your problem? It will be less invasive and will not require any oom changes.
On Mon, Nov 22, 2021 at 3:09 PM Roman Gushchin <guro@fb.com> wrote: > > On Mon, Nov 22, 2021 at 02:04:04PM -0500, Johannes Weiner wrote: > > On Fri, Nov 19, 2021 at 08:50:06PM -0800, Mina Almasry wrote: > > > Problem: > > > Currently shared memory is charged to the memcg of the allocating > > > process. This makes memory usage of processes accessing shared memory > > > a bit unpredictable since whichever process accesses the memory first > > > will get charged. We have a number of use cases where our userspace > > > would like deterministic charging of shared memory: > > > > > > 1. System services allocating memory for client jobs: > > > We have services (namely a network access service[1]) that provide > > > functionality for clients running on the machine and allocate memory > > > to carry out these services. The memory usage of these services > > > depends on the number of jobs running on the machine and the nature of > > > the requests made to the service, which makes the memory usage of > > > these services hard to predict and thus hard to limit via memory.max. > > > These system services would like a way to allocate memory and instruct > > > the kernel to charge this memory to the client’s memcg. > > > > > > 2. Shared filesystem between subtasks of a large job > > > Our infrastructure has large meta jobs such as kubernetes which spawn > > > multiple subtasks which share a tmpfs mount. These jobs and its > > > subtasks use that tmpfs mount for various purposes such as data > > > sharing or persistent data between the subtask restarts. In kubernetes > > > terminology, the meta job is similar to pods and subtasks are > > > containers under pods. We want the shared memory to be > > > deterministically charged to the kubernetes's pod and independent to > > > the lifetime of containers under the pod. > > > > > > 3. Shared libraries and language runtimes shared between independent jobs. > > > We’d like to optimize memory usage on the machine by sharing libraries > > > and language runtimes of many of the processes running on our machines > > > in separate memcgs. This produces a side effect that one job may be > > > unlucky to be the first to access many of the libraries and may get > > > oom killed as all the cached files get charged to it. > > > > > > Design: > > > My rough proposal to solve this problem is to simply add a > > > ‘memcg=/path/to/memcg’ mount option for filesystems: > > > directing all the memory of the file system to be ‘remote charged’ to > > > cgroup provided by that memcg= option. > > > > > > Caveats: > > > > > > 1. One complication to address is the behavior when the target memcg > > > hits its memory.max limit because of remote charging. In this case the > > > oom-killer will be invoked, but the oom-killer may not find anything > > > to kill in the target memcg being charged. Thera are a number of considerations > > > in this case: > > > > > > 1. It's not great to kill the allocating process since the allocating process > > > is not running in the memcg under oom, and killing it will not free memory > > > in the memcg under oom. > > > 2. Pagefaults may hit the memcg limit, and we need to handle the pagefault > > > somehow. If not, the process will forever loop the pagefault in the upstream > > > kernel. > > > > > > In this case, I propose simply failing the remote charge and returning an ENOSPC > > > to the caller. This will cause will cause the process executing the remote > > > charge to get an ENOSPC in non-pagefault paths, and get a SIGBUS on the pagefault > > > path. This will be documented behavior of remote charging, and this feature is > > > opt-in. Users can: > > > - Not opt-into the feature if they want. > > > - Opt-into the feature and accept the risk of received ENOSPC or SIGBUS and > > > abort if they desire. > > > - Gracefully handle any resulting ENOSPC or SIGBUS errors and continue their > > > operation without executing the remote charge if possible. > > > > > > 2. Only processes allowed the enter cgroup at mount time can mount a > > > tmpfs with memcg=<cgroup>. This is to prevent intential DoS of random cgroups > > > on the machine. However, once a filesysetem is mounted with memcg=<cgroup>, any > > > process with write access to this mount point will be able to charge memory to > > > <cgroup>. This is largely a non-issue because in configurations where there is > > > untrusted code running on the machine, mount point access needs to be > > > restricted to the intended users only regardless of whether the mount point > > > memory is deterministly charged or not. > > > > I'm not a fan of this. It uses filesystem mounts to create shareable > > resource domains outside of the cgroup hierarchy, which has all the > > downsides you listed, and more: > > > > 1. You need a filesystem interface in the first place, and a new > > ad-hoc channel and permission model to coordinate with the cgroup > > tree, which isn't great. All filesystems you want to share data on > > need to be converted. > > > > 2. It doesn't extend to non-filesystem sources of shared data, such as > > memfds, ipc shm etc. > > > > 3. It requires unintuitive configuration for what should be basic > > shared accounting semantics. Per default you still get the old > > 'first touch' semantics, but to get sharing you need to reconfigure > > the filesystems? > > > > 4. If a task needs to work with a hierarchy of data sharing domains - > > system-wide, group of jobs, job - it must interact with a hierarchy > > of filesystem mounts. This is a pain to setup and may require task > > awareness. Moving data around, working with different mount points. > > Also, no shared and private data accounting within the same file. > > > > 5. It reintroduces cgroup1 semantics of tasks and resouces, which are > > entangled, sitting in disjunct domains. OOM killing is one quirk of > > that, but there are others you haven't touched on. Who is charged > > for the CPU cycles of reclaim in the out-of-band domain? Who is > > charged for the paging IO? How is resource pressure accounted and > > attributed? Soon you need cpu= and io= as well. > > > > My take on this is that it might work for your rather specific > > usecase, but it doesn't strike me as a general-purpose feature > > suitable for upstream. > > > > > > If we want sharing semantics for memory, I think we need a more > > generic implementation with a cleaner interface. > > > > Here is one idea: > > > > Have you considered reparenting pages that are accessed by multiple > > cgroups to the first common ancestor of those groups? > > > > Essentially, whenever there is a memory access (minor fault, buffered > > IO) to a page that doesn't belong to the accessing task's cgroup, you > > find the common ancestor between that task and the owning cgroup, and > > move the page there. > > > > With a tree like this: > > > > root - job group - job > > `- job > > `- job group - job > > `- job > > > > all pages accessed inside that tree will propagate to the highest > > level at which they are shared - which is the same level where you'd > > also set shared policies, like a job group memory limit or io weight. > > > > E.g. libc pages would (likely) bubble to the root, persistent tmpfs > > pages would bubble to the respective job group, private data would > > stay within each job. > > > > No further user configuration necessary. Although you still *can* use > > mount namespacing etc. to prohibit undesired sharing between cgroups. > > > > The actual user-visible accounting change would be quite small, and > > arguably much more intuitive. Remember that accounting is recursive, > > meaning that a job page today also shows up in the counters of job > > group and root. This would not change. The only thing that IS weird > > today is that when two jobs share a page, it will arbitrarily show up > > in one job's counter but not in the other's. That would change: it > > would no longer show up as either, since it's not private to either; > > it would just be a job group (and up) page. > > In general I like the idea, but I think the user-visible change will be quite > large, almost "cgroup v3"-large. Here are some problems: > 1) Anything shared between e.g. system.slice and user.slice now belongs > to the root cgroup and is completely unaccounted/unlimited. E.g. all pagecache > belonging to shared libraries. > 2) It's concerning in security terms. If I understand the idea correctly, a > read-only access will allow to move charges to an upper level, potentially > crossing memory.max limits. It doesn't sound safe. > 3) It brings a non-trivial amount of memory to non-leave cgroups. To some extent > it returns us to the cgroup v1 world and a question of competition between > resources consumed by a cgroup directly and through children cgroups. Not > like the problem doesn't exist now, but it's less pronounced. > If say >50% of system.slice's memory will belong to system.slice directly, > then we likely will need separate non-recursive counters, limits, protections, > etc. > 4) Imagine a production server and a system administrator entering using ssh > (and being put into user.slice) and running a big grep... It screws up all > memory accounting until a next reboot. Not a completely impossible scenario. > > That said, I agree with Johannes and I'm also not a big fan of this patchset. > > I agree that the problem exist and that the patchset provides a solution, but > it doesn't look nice (and generic enough) and creates a lot of questions and > corner cases. > Thanks as always for your review and I definitely welcome any suggestions for how to solve this. I surmise from your response and Johannes's that we're looking here for a solution that involves no configuration from the sysadmin, where the kernel automatically figures out where is the best place for the shared memory to get charged and there are little to no corner cases to handle. I honestly can't think of one at this moment. I was thinking some opt-in deterministic charging with some configuration from the sysadmin and reasonable edge case handling could make sense. > Btw, won't (an optional) disabling of memcg accounting for a tmpfs solve your > problem? It will be less invasive and will not require any oom changes. I think it will solve use case #1, but I don't see it solving use cases #2 and #3. To be completely honest it sounds a bit hacky to me and there were concerns on this patchset that sysadmin needs to rely on ad-hoc mount write permissions to reliably use a memcg= feature, but disabling tmpfs accounting is in the same boat and seems a more dangerous? (as in mistakingly granting write access to a tmpfs mount to a bad actor can reliably DoS the entire machine).
On Mon, Nov 22, 2021 at 03:09:26PM -0800, Roman Gushchin wrote: > On Mon, Nov 22, 2021 at 02:04:04PM -0500, Johannes Weiner wrote: > > On Fri, Nov 19, 2021 at 08:50:06PM -0800, Mina Almasry wrote: > > > Problem: > > > Currently shared memory is charged to the memcg of the allocating > > > process. This makes memory usage of processes accessing shared memory > > > a bit unpredictable since whichever process accesses the memory first > > > will get charged. We have a number of use cases where our userspace > > > would like deterministic charging of shared memory: > > > > > > 1. System services allocating memory for client jobs: > > > We have services (namely a network access service[1]) that provide > > > functionality for clients running on the machine and allocate memory > > > to carry out these services. The memory usage of these services > > > depends on the number of jobs running on the machine and the nature of > > > the requests made to the service, which makes the memory usage of > > > these services hard to predict and thus hard to limit via memory.max. > > > These system services would like a way to allocate memory and instruct > > > the kernel to charge this memory to the client’s memcg. > > > > > > 2. Shared filesystem between subtasks of a large job > > > Our infrastructure has large meta jobs such as kubernetes which spawn > > > multiple subtasks which share a tmpfs mount. These jobs and its > > > subtasks use that tmpfs mount for various purposes such as data > > > sharing or persistent data between the subtask restarts. In kubernetes > > > terminology, the meta job is similar to pods and subtasks are > > > containers under pods. We want the shared memory to be > > > deterministically charged to the kubernetes's pod and independent to > > > the lifetime of containers under the pod. > > > > > > 3. Shared libraries and language runtimes shared between independent jobs. > > > We’d like to optimize memory usage on the machine by sharing libraries > > > and language runtimes of many of the processes running on our machines > > > in separate memcgs. This produces a side effect that one job may be > > > unlucky to be the first to access many of the libraries and may get > > > oom killed as all the cached files get charged to it. > > > > > > Design: > > > My rough proposal to solve this problem is to simply add a > > > ‘memcg=/path/to/memcg’ mount option for filesystems: > > > directing all the memory of the file system to be ‘remote charged’ to > > > cgroup provided by that memcg= option. > > > > > > Caveats: > > > > > > 1. One complication to address is the behavior when the target memcg > > > hits its memory.max limit because of remote charging. In this case the > > > oom-killer will be invoked, but the oom-killer may not find anything > > > to kill in the target memcg being charged. Thera are a number of considerations > > > in this case: > > > > > > 1. It's not great to kill the allocating process since the allocating process > > > is not running in the memcg under oom, and killing it will not free memory > > > in the memcg under oom. > > > 2. Pagefaults may hit the memcg limit, and we need to handle the pagefault > > > somehow. If not, the process will forever loop the pagefault in the upstream > > > kernel. > > > > > > In this case, I propose simply failing the remote charge and returning an ENOSPC > > > to the caller. This will cause will cause the process executing the remote > > > charge to get an ENOSPC in non-pagefault paths, and get a SIGBUS on the pagefault > > > path. This will be documented behavior of remote charging, and this feature is > > > opt-in. Users can: > > > - Not opt-into the feature if they want. > > > - Opt-into the feature and accept the risk of received ENOSPC or SIGBUS and > > > abort if they desire. > > > - Gracefully handle any resulting ENOSPC or SIGBUS errors and continue their > > > operation without executing the remote charge if possible. > > > > > > 2. Only processes allowed the enter cgroup at mount time can mount a > > > tmpfs with memcg=<cgroup>. This is to prevent intential DoS of random cgroups > > > on the machine. However, once a filesysetem is mounted with memcg=<cgroup>, any > > > process with write access to this mount point will be able to charge memory to > > > <cgroup>. This is largely a non-issue because in configurations where there is > > > untrusted code running on the machine, mount point access needs to be > > > restricted to the intended users only regardless of whether the mount point > > > memory is deterministly charged or not. > > > > I'm not a fan of this. It uses filesystem mounts to create shareable > > resource domains outside of the cgroup hierarchy, which has all the > > downsides you listed, and more: > > > > 1. You need a filesystem interface in the first place, and a new > > ad-hoc channel and permission model to coordinate with the cgroup > > tree, which isn't great. All filesystems you want to share data on > > need to be converted. > > > > 2. It doesn't extend to non-filesystem sources of shared data, such as > > memfds, ipc shm etc. > > > > 3. It requires unintuitive configuration for what should be basic > > shared accounting semantics. Per default you still get the old > > 'first touch' semantics, but to get sharing you need to reconfigure > > the filesystems? > > > > 4. If a task needs to work with a hierarchy of data sharing domains - > > system-wide, group of jobs, job - it must interact with a hierarchy > > of filesystem mounts. This is a pain to setup and may require task > > awareness. Moving data around, working with different mount points. > > Also, no shared and private data accounting within the same file. > > > > 5. It reintroduces cgroup1 semantics of tasks and resouces, which are > > entangled, sitting in disjunct domains. OOM killing is one quirk of > > that, but there are others you haven't touched on. Who is charged > > for the CPU cycles of reclaim in the out-of-band domain? Who is > > charged for the paging IO? How is resource pressure accounted and > > attributed? Soon you need cpu= and io= as well. > > > > My take on this is that it might work for your rather specific > > usecase, but it doesn't strike me as a general-purpose feature > > suitable for upstream. > > > > > > If we want sharing semantics for memory, I think we need a more > > generic implementation with a cleaner interface. > > > > Here is one idea: > > > > Have you considered reparenting pages that are accessed by multiple > > cgroups to the first common ancestor of those groups? > > > > Essentially, whenever there is a memory access (minor fault, buffered > > IO) to a page that doesn't belong to the accessing task's cgroup, you > > find the common ancestor between that task and the owning cgroup, and > > move the page there. > > > > With a tree like this: > > > > root - job group - job > > `- job > > `- job group - job > > `- job > > > > all pages accessed inside that tree will propagate to the highest > > level at which they are shared - which is the same level where you'd > > also set shared policies, like a job group memory limit or io weight. > > > > E.g. libc pages would (likely) bubble to the root, persistent tmpfs > > pages would bubble to the respective job group, private data would > > stay within each job. > > > > No further user configuration necessary. Although you still *can* use > > mount namespacing etc. to prohibit undesired sharing between cgroups. > > > > The actual user-visible accounting change would be quite small, and > > arguably much more intuitive. Remember that accounting is recursive, > > meaning that a job page today also shows up in the counters of job > > group and root. This would not change. The only thing that IS weird > > today is that when two jobs share a page, it will arbitrarily show up > > in one job's counter but not in the other's. That would change: it > > would no longer show up as either, since it's not private to either; > > it would just be a job group (and up) page. These are great questions. > In general I like the idea, but I think the user-visible change will be quite > large, almost "cgroup v3"-large. I wouldn't quite say cgroup3 :-) But it would definitely require a new mount option for cgroupfs. > Here are some problems: > 1) Anything shared between e.g. system.slice and user.slice now belongs > to the root cgroup and is completely unaccounted/unlimited. E.g. all pagecache > belonging to shared libraries. Correct, but arguably that's a good thing: Right now, even though the libraries are used by both, they'll be held by one group. This can cause two priority inversions: hipri references don't prevent the shared page from thrashing inside a lowpri group, which could subject the hipri group to reclaim pressure and waiting for slow refaults of the lowpri groups; if the lowpri group is the hotter user of this page, this could sustain. Or the page ends up in the hipri group, and the lowpri group pins it there even when the hipri group is done with it, thus stealing its capacity. Yes, a libc page used by everybody in the system would end up in the root cgroup. But arguably that makes much more sense than having it show up as exclusive memory of system.slice/systemd-udevd.service. And certainly we don't want a universally shared page be subjected to the local resource pressure of one lowpri user of it. Recognizing the shared property and propagating it to the common domain - the level at which priorities are equal between them - would make the accounting clearer and solve both these inversions. > 2) It's concerning in security terms. If I understand the idea correctly, a > read-only access will allow to move charges to an upper level, potentially > crossing memory.max limits. It doesn't sound safe. Hm. The mechanism is slightly different, but escaping memory.max happens today as well: shared memory is already not subject to the memory.max of (n-1)/n cgroups that touch it. So before, you can escape containment to whatever other cgroup is using the page. After, you can escape to the common domain. It's difficult for me to say one is clearly worse than the other. You can conceive of realistic scenarios where both are equally problematic. Practically, they appear to require the same solution: if the environment isn't to be trusted, namespacing and limiting access to shared data is necessary to avoid cgroups escaping containment or DoSing other groups. > 3) It brings a non-trivial amount of memory to non-leave cgroups. To some extent > it returns us to the cgroup v1 world and a question of competition between > resources consumed by a cgroup directly and through children cgroups. Not > like the problem doesn't exist now, but it's less pronounced. > If say >50% of system.slice's memory will belong to system.slice directly, > then we likely will need separate non-recursive counters, limits, protections, > etc. I actually do see numbers like this in practice. Temporary system.slice units allocate cache, then their cgroups get deleted and the cache is reused by the next instances. Quite often, system.slice has much more memory than its subgroups combined. So in a way, we have what I'm proposing if the sharing happens with dead cgroups. Sharing with live cgroups wouldn't necessarily create a bigger demand for new counters than what we have now. I think the cgroup1 issue was slightly different: in cgroup1 we allowed *tasks* to live in non-leaf groups, and so users wanted to control the *private* memory of said tasks with policies that were *different* from the shared policies applied to the leaves. This wouldn't be the same here. Tasks are still only inside leafs, and there is no "private" memory inside a non-leaf group. It's shared among the children, and so subject to policies shared by all children. > 4) Imagine a production server and a system administrator entering using ssh > (and being put into user.slice) and running a big grep... It screws up all > memory accounting until a next reboot. Not a completely impossible scenario. This can also happen with the first-touch model, though. The second you touch private data of some workload, the memory might escape it. It's not as pronounced with a first-touch policy - although proactive reclaim makes this worse. But I'm not sure you can call it a new concern in the proposed model: you already have to be careful with the data you touch and bring into memory from your current cgroup. Again, I think this is where mount namespaces come in. You're not necessarily supposed to see private data of workloads from the outside and access it accidentally. It's common practice to ssh directly into containers to muck with them and their memory, at which point you'll be in the appropriate cgroup and permission context, too. However, I do agree with Mina and you: this is a significant change in behavior, and a cgroupfs mount option would certainly be warranted.
On Tue, Nov 23, 2021 at 12:21 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Mon, Nov 22, 2021 at 03:09:26PM -0800, Roman Gushchin wrote: > > On Mon, Nov 22, 2021 at 02:04:04PM -0500, Johannes Weiner wrote: > > > On Fri, Nov 19, 2021 at 08:50:06PM -0800, Mina Almasry wrote: > > > > Problem: > > > > Currently shared memory is charged to the memcg of the allocating > > > > process. This makes memory usage of processes accessing shared memory > > > > a bit unpredictable since whichever process accesses the memory first > > > > will get charged. We have a number of use cases where our userspace > > > > would like deterministic charging of shared memory: > > > > > > > > 1. System services allocating memory for client jobs: > > > > We have services (namely a network access service[1]) that provide > > > > functionality for clients running on the machine and allocate memory > > > > to carry out these services. The memory usage of these services > > > > depends on the number of jobs running on the machine and the nature of > > > > the requests made to the service, which makes the memory usage of > > > > these services hard to predict and thus hard to limit via memory.max. > > > > These system services would like a way to allocate memory and instruct > > > > the kernel to charge this memory to the client’s memcg. > > > > > > > > 2. Shared filesystem between subtasks of a large job > > > > Our infrastructure has large meta jobs such as kubernetes which spawn > > > > multiple subtasks which share a tmpfs mount. These jobs and its > > > > subtasks use that tmpfs mount for various purposes such as data > > > > sharing or persistent data between the subtask restarts. In kubernetes > > > > terminology, the meta job is similar to pods and subtasks are > > > > containers under pods. We want the shared memory to be > > > > deterministically charged to the kubernetes's pod and independent to > > > > the lifetime of containers under the pod. > > > > > > > > 3. Shared libraries and language runtimes shared between independent jobs. > > > > We’d like to optimize memory usage on the machine by sharing libraries > > > > and language runtimes of many of the processes running on our machines > > > > in separate memcgs. This produces a side effect that one job may be > > > > unlucky to be the first to access many of the libraries and may get > > > > oom killed as all the cached files get charged to it. > > > > > > > > Design: > > > > My rough proposal to solve this problem is to simply add a > > > > ‘memcg=/path/to/memcg’ mount option for filesystems: > > > > directing all the memory of the file system to be ‘remote charged’ to > > > > cgroup provided by that memcg= option. > > > > > > > > Caveats: > > > > > > > > 1. One complication to address is the behavior when the target memcg > > > > hits its memory.max limit because of remote charging. In this case the > > > > oom-killer will be invoked, but the oom-killer may not find anything > > > > to kill in the target memcg being charged. Thera are a number of considerations > > > > in this case: > > > > > > > > 1. It's not great to kill the allocating process since the allocating process > > > > is not running in the memcg under oom, and killing it will not free memory > > > > in the memcg under oom. > > > > 2. Pagefaults may hit the memcg limit, and we need to handle the pagefault > > > > somehow. If not, the process will forever loop the pagefault in the upstream > > > > kernel. > > > > > > > > In this case, I propose simply failing the remote charge and returning an ENOSPC > > > > to the caller. This will cause will cause the process executing the remote > > > > charge to get an ENOSPC in non-pagefault paths, and get a SIGBUS on the pagefault > > > > path. This will be documented behavior of remote charging, and this feature is > > > > opt-in. Users can: > > > > - Not opt-into the feature if they want. > > > > - Opt-into the feature and accept the risk of received ENOSPC or SIGBUS and > > > > abort if they desire. > > > > - Gracefully handle any resulting ENOSPC or SIGBUS errors and continue their > > > > operation without executing the remote charge if possible. > > > > > > > > 2. Only processes allowed the enter cgroup at mount time can mount a > > > > tmpfs with memcg=<cgroup>. This is to prevent intential DoS of random cgroups > > > > on the machine. However, once a filesysetem is mounted with memcg=<cgroup>, any > > > > process with write access to this mount point will be able to charge memory to > > > > <cgroup>. This is largely a non-issue because in configurations where there is > > > > untrusted code running on the machine, mount point access needs to be > > > > restricted to the intended users only regardless of whether the mount point > > > > memory is deterministly charged or not. > > > > > > I'm not a fan of this. It uses filesystem mounts to create shareable > > > resource domains outside of the cgroup hierarchy, which has all the > > > downsides you listed, and more: > > > > > > 1. You need a filesystem interface in the first place, and a new > > > ad-hoc channel and permission model to coordinate with the cgroup > > > tree, which isn't great. All filesystems you want to share data on > > > need to be converted. > > > > > > 2. It doesn't extend to non-filesystem sources of shared data, such as > > > memfds, ipc shm etc. > > > > > > 3. It requires unintuitive configuration for what should be basic > > > shared accounting semantics. Per default you still get the old > > > 'first touch' semantics, but to get sharing you need to reconfigure > > > the filesystems? > > > > > > 4. If a task needs to work with a hierarchy of data sharing domains - > > > system-wide, group of jobs, job - it must interact with a hierarchy > > > of filesystem mounts. This is a pain to setup and may require task > > > awareness. Moving data around, working with different mount points. > > > Also, no shared and private data accounting within the same file. > > > > > > 5. It reintroduces cgroup1 semantics of tasks and resouces, which are > > > entangled, sitting in disjunct domains. OOM killing is one quirk of > > > that, but there are others you haven't touched on. Who is charged > > > for the CPU cycles of reclaim in the out-of-band domain? Who is > > > charged for the paging IO? How is resource pressure accounted and > > > attributed? Soon you need cpu= and io= as well. > > > > > > My take on this is that it might work for your rather specific > > > usecase, but it doesn't strike me as a general-purpose feature > > > suitable for upstream. > > > > > > > > > If we want sharing semantics for memory, I think we need a more > > > generic implementation with a cleaner interface. > > > > > > Here is one idea: > > > > > > Have you considered reparenting pages that are accessed by multiple > > > cgroups to the first common ancestor of those groups? > > > > > > Essentially, whenever there is a memory access (minor fault, buffered > > > IO) to a page that doesn't belong to the accessing task's cgroup, you > > > find the common ancestor between that task and the owning cgroup, and > > > move the page there. > > > > > > With a tree like this: > > > > > > root - job group - job > > > `- job > > > `- job group - job > > > `- job > > > > > > all pages accessed inside that tree will propagate to the highest > > > level at which they are shared - which is the same level where you'd > > > also set shared policies, like a job group memory limit or io weight. > > > > > > E.g. libc pages would (likely) bubble to the root, persistent tmpfs > > > pages would bubble to the respective job group, private data would > > > stay within each job. > > > > > > No further user configuration necessary. Although you still *can* use > > > mount namespacing etc. to prohibit undesired sharing between cgroups. > > > > > > The actual user-visible accounting change would be quite small, and > > > arguably much more intuitive. Remember that accounting is recursive, > > > meaning that a job page today also shows up in the counters of job > > > group and root. This would not change. The only thing that IS weird > > > today is that when two jobs share a page, it will arbitrarily show up > > > in one job's counter but not in the other's. That would change: it > > > would no longer show up as either, since it's not private to either; > > > it would just be a job group (and up) page. > > These are great questions. > > > In general I like the idea, but I think the user-visible change will be quite > > large, almost "cgroup v3"-large. > > I wouldn't quite say cgroup3 :-) But it would definitely require a new > mount option for cgroupfs. > > > Here are some problems: > > 1) Anything shared between e.g. system.slice and user.slice now belongs > > to the root cgroup and is completely unaccounted/unlimited. E.g. all pagecache > > belonging to shared libraries. > > Correct, but arguably that's a good thing: > > Right now, even though the libraries are used by both, they'll be held > by one group. This can cause two priority inversions: hipri references > don't prevent the shared page from thrashing inside a lowpri group, > which could subject the hipri group to reclaim pressure and waiting > for slow refaults of the lowpri groups; if the lowpri group is the > hotter user of this page, this could sustain. Or the page ends up in > the hipri group, and the lowpri group pins it there even when the > hipri group is done with it, thus stealing its capacity. > > Yes, a libc page used by everybody in the system would end up in the > root cgroup. But arguably that makes much more sense than having it > show up as exclusive memory of system.slice/systemd-udevd.service. > And certainly we don't want a universally shared page be subjected to > the local resource pressure of one lowpri user of it. > > Recognizing the shared property and propagating it to the common > domain - the level at which priorities are equal between them - would > make the accounting clearer and solve both these inversions. > > > 2) It's concerning in security terms. If I understand the idea correctly, a > > read-only access will allow to move charges to an upper level, potentially > > crossing memory.max limits. It doesn't sound safe. > > Hm. The mechanism is slightly different, but escaping memory.max > happens today as well: shared memory is already not subject to the > memory.max of (n-1)/n cgroups that touch it. > > So before, you can escape containment to whatever other cgroup is > using the page. After, you can escape to the common domain. It's > difficult for me to say one is clearly worse than the other. You can > conceive of realistic scenarios where both are equally problematic. > > Practically, they appear to require the same solution: if the > environment isn't to be trusted, namespacing and limiting access to > shared data is necessary to avoid cgroups escaping containment or > DoSing other groups. > > > 3) It brings a non-trivial amount of memory to non-leave cgroups. To some extent > > it returns us to the cgroup v1 world and a question of competition between > > resources consumed by a cgroup directly and through children cgroups. Not > > like the problem doesn't exist now, but it's less pronounced. > > If say >50% of system.slice's memory will belong to system.slice directly, > > then we likely will need separate non-recursive counters, limits, protections, > > etc. > > I actually do see numbers like this in practice. Temporary > system.slice units allocate cache, then their cgroups get deleted and > the cache is reused by the next instances. Quite often, system.slice > has much more memory than its subgroups combined. > > So in a way, we have what I'm proposing if the sharing happens with > dead cgroups. Sharing with live cgroups wouldn't necessarily create a > bigger demand for new counters than what we have now. > > I think the cgroup1 issue was slightly different: in cgroup1 we > allowed *tasks* to live in non-leaf groups, and so users wanted to > control the *private* memory of said tasks with policies that were > *different* from the shared policies applied to the leaves. > > This wouldn't be the same here. Tasks are still only inside leafs, and > there is no "private" memory inside a non-leaf group. It's shared > among the children, and so subject to policies shared by all children. > > > 4) Imagine a production server and a system administrator entering using ssh > > (and being put into user.slice) and running a big grep... It screws up all > > memory accounting until a next reboot. Not a completely impossible scenario. > > This can also happen with the first-touch model, though. The second > you touch private data of some workload, the memory might escape it. > > It's not as pronounced with a first-touch policy - although proactive > reclaim makes this worse. But I'm not sure you can call it a new > concern in the proposed model: you already have to be careful with the > data you touch and bring into memory from your current cgroup. > > Again, I think this is where mount namespaces come in. You're not > necessarily supposed to see private data of workloads from the outside > and access it accidentally. It's common practice to ssh directly into > containers to muck with them and their memory, at which point you'll > be in the appropriate cgroup and permission context, too. > > However, I do agree with Mina and you: this is a significant change in > behavior, and a cgroupfs mount option would certainly be warranted. I don't mean to be a nag here but I have trouble seeing pages being re-accounted on minor faults working for us, and that might be fine, but I'm expecting if it doesn't really work for us it likely won't work for the next person trying to use this. The issue is that the fact that the memory is initially accounted to the allocating process forces the sysadmin to overprovision the cgroup limit anyway so that the tasks don't oom if tasks are pre-allocating memory. The memory usage of a task accessing shared memory stays very unpredictable because it's waiting on another task in another cgroup to touch the shared memory for the shared memory to be unaccounted to its cgroup. I have a couple of (admittingly probably controversial) suggestions: 1. memcg flag, say memory.charge_for_shared_memory. When we allocate shared memory, we charge it to the first ancestor memcg that has memory.charge_for_shared_memory==true. 2. Somehow on the creation of shared memory, we somehow declare that this memory belongs to <cgroup>. Only descendants of <cgroup> are able to touch the shared memory and the shared memory is charged to <cgroup>.
On Tue, Nov 23, 2021 at 01:19:47PM -0800, Mina Almasry wrote: > On Tue, Nov 23, 2021 at 12:21 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Mon, Nov 22, 2021 at 03:09:26PM -0800, Roman Gushchin wrote: > > > On Mon, Nov 22, 2021 at 02:04:04PM -0500, Johannes Weiner wrote: > > > > On Fri, Nov 19, 2021 at 08:50:06PM -0800, Mina Almasry wrote: > > > > > Problem: > > > > > Currently shared memory is charged to the memcg of the allocating > > > > > process. This makes memory usage of processes accessing shared memory > > > > > a bit unpredictable since whichever process accesses the memory first > > > > > will get charged. We have a number of use cases where our userspace > > > > > would like deterministic charging of shared memory: > > > > > > > > > > 1. System services allocating memory for client jobs: > > > > > We have services (namely a network access service[1]) that provide > > > > > functionality for clients running on the machine and allocate memory > > > > > to carry out these services. The memory usage of these services > > > > > depends on the number of jobs running on the machine and the nature of > > > > > the requests made to the service, which makes the memory usage of > > > > > these services hard to predict and thus hard to limit via memory.max. > > > > > These system services would like a way to allocate memory and instruct > > > > > the kernel to charge this memory to the client’s memcg. > > > > > > > > > > 2. Shared filesystem between subtasks of a large job > > > > > Our infrastructure has large meta jobs such as kubernetes which spawn > > > > > multiple subtasks which share a tmpfs mount. These jobs and its > > > > > subtasks use that tmpfs mount for various purposes such as data > > > > > sharing or persistent data between the subtask restarts. In kubernetes > > > > > terminology, the meta job is similar to pods and subtasks are > > > > > containers under pods. We want the shared memory to be > > > > > deterministically charged to the kubernetes's pod and independent to > > > > > the lifetime of containers under the pod. > > > > > > > > > > 3. Shared libraries and language runtimes shared between independent jobs. > > > > > We’d like to optimize memory usage on the machine by sharing libraries > > > > > and language runtimes of many of the processes running on our machines > > > > > in separate memcgs. This produces a side effect that one job may be > > > > > unlucky to be the first to access many of the libraries and may get > > > > > oom killed as all the cached files get charged to it. > > > > > > > > > > Design: > > > > > My rough proposal to solve this problem is to simply add a > > > > > ‘memcg=/path/to/memcg’ mount option for filesystems: > > > > > directing all the memory of the file system to be ‘remote charged’ to > > > > > cgroup provided by that memcg= option. > > > > > > > > > > Caveats: > > > > > > > > > > 1. One complication to address is the behavior when the target memcg > > > > > hits its memory.max limit because of remote charging. In this case the > > > > > oom-killer will be invoked, but the oom-killer may not find anything > > > > > to kill in the target memcg being charged. Thera are a number of considerations > > > > > in this case: > > > > > > > > > > 1. It's not great to kill the allocating process since the allocating process > > > > > is not running in the memcg under oom, and killing it will not free memory > > > > > in the memcg under oom. > > > > > 2. Pagefaults may hit the memcg limit, and we need to handle the pagefault > > > > > somehow. If not, the process will forever loop the pagefault in the upstream > > > > > kernel. > > > > > > > > > > In this case, I propose simply failing the remote charge and returning an ENOSPC > > > > > to the caller. This will cause will cause the process executing the remote > > > > > charge to get an ENOSPC in non-pagefault paths, and get a SIGBUS on the pagefault > > > > > path. This will be documented behavior of remote charging, and this feature is > > > > > opt-in. Users can: > > > > > - Not opt-into the feature if they want. > > > > > - Opt-into the feature and accept the risk of received ENOSPC or SIGBUS and > > > > > abort if they desire. > > > > > - Gracefully handle any resulting ENOSPC or SIGBUS errors and continue their > > > > > operation without executing the remote charge if possible. > > > > > > > > > > 2. Only processes allowed the enter cgroup at mount time can mount a > > > > > tmpfs with memcg=<cgroup>. This is to prevent intential DoS of random cgroups > > > > > on the machine. However, once a filesysetem is mounted with memcg=<cgroup>, any > > > > > process with write access to this mount point will be able to charge memory to > > > > > <cgroup>. This is largely a non-issue because in configurations where there is > > > > > untrusted code running on the machine, mount point access needs to be > > > > > restricted to the intended users only regardless of whether the mount point > > > > > memory is deterministly charged or not. > > > > > > > > I'm not a fan of this. It uses filesystem mounts to create shareable > > > > resource domains outside of the cgroup hierarchy, which has all the > > > > downsides you listed, and more: > > > > > > > > 1. You need a filesystem interface in the first place, and a new > > > > ad-hoc channel and permission model to coordinate with the cgroup > > > > tree, which isn't great. All filesystems you want to share data on > > > > need to be converted. > > > > > > > > 2. It doesn't extend to non-filesystem sources of shared data, such as > > > > memfds, ipc shm etc. > > > > > > > > 3. It requires unintuitive configuration for what should be basic > > > > shared accounting semantics. Per default you still get the old > > > > 'first touch' semantics, but to get sharing you need to reconfigure > > > > the filesystems? > > > > > > > > 4. If a task needs to work with a hierarchy of data sharing domains - > > > > system-wide, group of jobs, job - it must interact with a hierarchy > > > > of filesystem mounts. This is a pain to setup and may require task > > > > awareness. Moving data around, working with different mount points. > > > > Also, no shared and private data accounting within the same file. > > > > > > > > 5. It reintroduces cgroup1 semantics of tasks and resouces, which are > > > > entangled, sitting in disjunct domains. OOM killing is one quirk of > > > > that, but there are others you haven't touched on. Who is charged > > > > for the CPU cycles of reclaim in the out-of-band domain? Who is > > > > charged for the paging IO? How is resource pressure accounted and > > > > attributed? Soon you need cpu= and io= as well. > > > > > > > > My take on this is that it might work for your rather specific > > > > usecase, but it doesn't strike me as a general-purpose feature > > > > suitable for upstream. > > > > > > > > > > > > If we want sharing semantics for memory, I think we need a more > > > > generic implementation with a cleaner interface. > > > > > > > > Here is one idea: > > > > > > > > Have you considered reparenting pages that are accessed by multiple > > > > cgroups to the first common ancestor of those groups? > > > > > > > > Essentially, whenever there is a memory access (minor fault, buffered > > > > IO) to a page that doesn't belong to the accessing task's cgroup, you > > > > find the common ancestor between that task and the owning cgroup, and > > > > move the page there. > > > > > > > > With a tree like this: > > > > > > > > root - job group - job > > > > `- job > > > > `- job group - job > > > > `- job > > > > > > > > all pages accessed inside that tree will propagate to the highest > > > > level at which they are shared - which is the same level where you'd > > > > also set shared policies, like a job group memory limit or io weight. > > > > > > > > E.g. libc pages would (likely) bubble to the root, persistent tmpfs > > > > pages would bubble to the respective job group, private data would > > > > stay within each job. > > > > > > > > No further user configuration necessary. Although you still *can* use > > > > mount namespacing etc. to prohibit undesired sharing between cgroups. > > > > > > > > The actual user-visible accounting change would be quite small, and > > > > arguably much more intuitive. Remember that accounting is recursive, > > > > meaning that a job page today also shows up in the counters of job > > > > group and root. This would not change. The only thing that IS weird > > > > today is that when two jobs share a page, it will arbitrarily show up > > > > in one job's counter but not in the other's. That would change: it > > > > would no longer show up as either, since it's not private to either; > > > > it would just be a job group (and up) page. > > > > These are great questions. > > > > > In general I like the idea, but I think the user-visible change will be quite > > > large, almost "cgroup v3"-large. > > > > I wouldn't quite say cgroup3 :-) But it would definitely require a new > > mount option for cgroupfs. > > > > > Here are some problems: > > > 1) Anything shared between e.g. system.slice and user.slice now belongs > > > to the root cgroup and is completely unaccounted/unlimited. E.g. all pagecache > > > belonging to shared libraries. > > > > Correct, but arguably that's a good thing: > > > > Right now, even though the libraries are used by both, they'll be held > > by one group. This can cause two priority inversions: hipri references > > don't prevent the shared page from thrashing inside a lowpri group, > > which could subject the hipri group to reclaim pressure and waiting > > for slow refaults of the lowpri groups; if the lowpri group is the > > hotter user of this page, this could sustain. Or the page ends up in > > the hipri group, and the lowpri group pins it there even when the > > hipri group is done with it, thus stealing its capacity. > > > > Yes, a libc page used by everybody in the system would end up in the > > root cgroup. But arguably that makes much more sense than having it > > show up as exclusive memory of system.slice/systemd-udevd.service. > > And certainly we don't want a universally shared page be subjected to > > the local resource pressure of one lowpri user of it. > > > > Recognizing the shared property and propagating it to the common > > domain - the level at which priorities are equal between them - would > > make the accounting clearer and solve both these inversions. > > > > > 2) It's concerning in security terms. If I understand the idea correctly, a > > > read-only access will allow to move charges to an upper level, potentially > > > crossing memory.max limits. It doesn't sound safe. > > > > Hm. The mechanism is slightly different, but escaping memory.max > > happens today as well: shared memory is already not subject to the > > memory.max of (n-1)/n cgroups that touch it. > > > > So before, you can escape containment to whatever other cgroup is > > using the page. After, you can escape to the common domain. It's > > difficult for me to say one is clearly worse than the other. You can > > conceive of realistic scenarios where both are equally problematic. > > > > Practically, they appear to require the same solution: if the > > environment isn't to be trusted, namespacing and limiting access to > > shared data is necessary to avoid cgroups escaping containment or > > DoSing other groups. > > > > > 3) It brings a non-trivial amount of memory to non-leave cgroups. To some extent > > > it returns us to the cgroup v1 world and a question of competition between > > > resources consumed by a cgroup directly and through children cgroups. Not > > > like the problem doesn't exist now, but it's less pronounced. > > > If say >50% of system.slice's memory will belong to system.slice directly, > > > then we likely will need separate non-recursive counters, limits, protections, > > > etc. > > > > I actually do see numbers like this in practice. Temporary > > system.slice units allocate cache, then their cgroups get deleted and > > the cache is reused by the next instances. Quite often, system.slice > > has much more memory than its subgroups combined. > > > > So in a way, we have what I'm proposing if the sharing happens with > > dead cgroups. Sharing with live cgroups wouldn't necessarily create a > > bigger demand for new counters than what we have now. > > > > I think the cgroup1 issue was slightly different: in cgroup1 we > > allowed *tasks* to live in non-leaf groups, and so users wanted to > > control the *private* memory of said tasks with policies that were > > *different* from the shared policies applied to the leaves. > > > > This wouldn't be the same here. Tasks are still only inside leafs, and > > there is no "private" memory inside a non-leaf group. It's shared > > among the children, and so subject to policies shared by all children. > > > > > 4) Imagine a production server and a system administrator entering using ssh > > > (and being put into user.slice) and running a big grep... It screws up all > > > memory accounting until a next reboot. Not a completely impossible scenario. > > > > This can also happen with the first-touch model, though. The second > > you touch private data of some workload, the memory might escape it. > > > > It's not as pronounced with a first-touch policy - although proactive > > reclaim makes this worse. But I'm not sure you can call it a new > > concern in the proposed model: you already have to be careful with the > > data you touch and bring into memory from your current cgroup. > > > > Again, I think this is where mount namespaces come in. You're not > > necessarily supposed to see private data of workloads from the outside > > and access it accidentally. It's common practice to ssh directly into > > containers to muck with them and their memory, at which point you'll > > be in the appropriate cgroup and permission context, too. > > > > However, I do agree with Mina and you: this is a significant change in > > behavior, and a cgroupfs mount option would certainly be warranted. > > I don't mean to be a nag here but I have trouble seeing pages being > re-accounted on minor faults working for us, and that might be fine, > but I'm expecting if it doesn't really work for us it likely won't > work for the next person trying to use this. Yes, I agree, the performance impact might be non-trivial. I think we discussed something similar in the past in the context of re-charging pages belonging to a deleted cgroup. And the consensus was that we'd need to add hooks into many places to check whether a page belongs to a dying (or other-than-current) cgroup and it might be not cheap. > > The issue is that the fact that the memory is initially accounted to > the allocating process forces the sysadmin to overprovision the cgroup > limit anyway so that the tasks don't oom if tasks are pre-allocating > memory. The memory usage of a task accessing shared memory stays very > unpredictable because it's waiting on another task in another cgroup > to touch the shared memory for the shared memory to be unaccounted to > its cgroup. > > I have a couple of (admittingly probably controversial) suggestions: > 1. memcg flag, say memory.charge_for_shared_memory. When we allocate > shared memory, we charge it to the first ancestor memcg that has > memory.charge_for_shared_memory==true. I think the problem here is that we try really hard to avoid any per-memory-type knobs, and this is another one. > 2. Somehow on the creation of shared memory, we somehow declare that > this memory belongs to <cgroup>. Only descendants of <cgroup> are able > to touch the shared memory and the shared memory is charged to > <cgroup>. This sounds like a mount namespace. Thanks!
On Mon 22-11-21 14:04:04, Johannes Weiner wrote: [...] > I'm not a fan of this. It uses filesystem mounts to create shareable > resource domains outside of the cgroup hierarchy, which has all the > downsides you listed, and more: > > 1. You need a filesystem interface in the first place, and a new > ad-hoc channel and permission model to coordinate with the cgroup > tree, which isn't great. All filesystems you want to share data on > need to be converted. > > 2. It doesn't extend to non-filesystem sources of shared data, such as > memfds, ipc shm etc. > > 3. It requires unintuitive configuration for what should be basic > shared accounting semantics. Per default you still get the old > 'first touch' semantics, but to get sharing you need to reconfigure > the filesystems? > > 4. If a task needs to work with a hierarchy of data sharing domains - > system-wide, group of jobs, job - it must interact with a hierarchy > of filesystem mounts. This is a pain to setup and may require task > awareness. Moving data around, working with different mount points. > Also, no shared and private data accounting within the same file. > > 5. It reintroduces cgroup1 semantics of tasks and resouces, which are > entangled, sitting in disjunct domains. OOM killing is one quirk of > that, but there are others you haven't touched on. Who is charged > for the CPU cycles of reclaim in the out-of-band domain? Who is > charged for the paging IO? How is resource pressure accounted and > attributed? Soon you need cpu= and io= as well. > > My take on this is that it might work for your rather specific > usecase, but it doesn't strike me as a general-purpose feature > suitable for upstream. I just want to reiterate that this resonates with my concerns expressed earlier and thanks for expressing them in a much better structured and comprehensive way, Johannes. [btw. a non-technical comment. For features like this it is better to not rush into newer versions posting until there is at least some agreement for the feature. Otherwise we have fragments of the discussion spread over several email threads] > If we want sharing semantics for memory, I think we need a more > generic implementation with a cleaner interface. > > Here is one idea: > > Have you considered reparenting pages that are accessed by multiple > cgroups to the first common ancestor of those groups? > > Essentially, whenever there is a memory access (minor fault, buffered > IO) to a page that doesn't belong to the accessing task's cgroup, you > find the common ancestor between that task and the owning cgroup, and > move the page there. > > With a tree like this: > > root - job group - job > `- job > `- job group - job > `- job > > all pages accessed inside that tree will propagate to the highest > level at which they are shared - which is the same level where you'd > also set shared policies, like a job group memory limit or io weight. > > E.g. libc pages would (likely) bubble to the root, persistent tmpfs > pages would bubble to the respective job group, private data would > stay within each job. > > No further user configuration necessary. Although you still *can* use > mount namespacing etc. to prohibit undesired sharing between cgroups. > > The actual user-visible accounting change would be quite small, and > arguably much more intuitive. Remember that accounting is recursive, > meaning that a job page today also shows up in the counters of job > group and root. This would not change. The only thing that IS weird > today is that when two jobs share a page, it will arbitrarily show up > in one job's counter but not in the other's. That would change: it > would no longer show up as either, since it's not private to either; > it would just be a job group (and up) page. > > This would be a generic implementation of resource sharing semantics: > independent of data source and filesystems, contained inside the > cgroup interface, and reusing the existing hierarchies of accounting > and control domains to also represent levels of common property. > > Thoughts? This is an interesting concept. I am not sure how expensive and intrusive (code wise) this would get but that is more of an implementation detail. Another option would be to provide a syscall to claim a shared resource. This would require a cooperation of the application but it would establish a clear responsibility model.
Hi Johannes, On Mon, Nov 22, 2021 at 11:04 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > [...] > Here is one idea: > > Have you considered reparenting pages that are accessed by multiple > cgroups to the first common ancestor of those groups? > > Essentially, whenever there is a memory access (minor fault, buffered > IO) to a page that doesn't belong to the accessing task's cgroup, you > find the common ancestor between that task and the owning cgroup, and > move the page there. > > With a tree like this: > > root - job group - job > `- job > `- job group - job > `- job > > all pages accessed inside that tree will propagate to the highest > level at which they are shared - which is the same level where you'd > also set shared policies, like a job group memory limit or io weight. > > E.g. libc pages would (likely) bubble to the root, persistent tmpfs > pages would bubble to the respective job group, private data would > stay within each job. > > No further user configuration necessary. Although you still *can* use > mount namespacing etc. to prohibit undesired sharing between cgroups. > > The actual user-visible accounting change would be quite small, and > arguably much more intuitive. Remember that accounting is recursive, > meaning that a job page today also shows up in the counters of job > group and root. This would not change. The only thing that IS weird > today is that when two jobs share a page, it will arbitrarily show up > in one job's counter but not in the other's. That would change: it > would no longer show up as either, since it's not private to either; > it would just be a job group (and up) page. > > This would be a generic implementation of resource sharing semantics: > independent of data source and filesystems, contained inside the > cgroup interface, and reusing the existing hierarchies of accounting > and control domains to also represent levels of common property. > > Thoughts? Before commenting on your proposal, I would like to clarify that the use-cases given are not specific to us but are more general. Though I think you are arguing that the implementation is not general purpose which I kind of agree with. Let me take a stab again at describing these use-cases which I think can be partitioned based on the relationship of the entities sharing/accessing the memory among them. (Sorry for repeating these because I think we should keep these in mind while discussing the possible solutions). 1) Mutually trusted entities sharing memory for collaborative work. One example is a file-system shared between sub-tasks of a meta-job. (Mina's second use-case). 2) Independent entities sharing memory to reduce cost. Examples include shared libraries, packages or tool chains. (Mina's third use-case). 3) One entity observing or monitoring another entity. Examples include gdb, ptrace, uprobes, VM or process migration and checkpointing. 4) Server-Client relationship. (Mina's first use-case. Let me put (3) out of the way first as these operations have special interfaces and the target entity is a process (not a cgroup). Remote charging works for these and no new oom corner cases are introduced. For (1) and (2), I think your proposal aligns pretty well with them but one important property is still missing which we are very adamant about i.e. 'deterministic charge'. To explain with an example, suppose two instances of the same job are running on two different systems. On one system, it is sharing a shared library with an unrelated job and the second instance is using that library alone. The owner will see different memory usage for both instances which can mess with their resource planning. However I think this can be solved very easily with an opt-in add-on. The node controller knows upfront the libraries/packages which can be shared between the jobs and is responsible for creating the cgroup hierarchy (at least the top level) for the jobs. It can create a common ancestor for all such jobs and let the kernel know that if any descendant accesses these libraries, charge to this specific ancestor. If someone out of this sub-hierarchy accesses the memory, follow the proposal i.e. common ancestor. With this specific opt-in add-on, all job owners will see their job usage more consistent. [I am putting this as a brainstorming discussion] Regarding (4), for our use-case, the server wants the cost of the memory needed to serve a client to be paid by the corresponding client. Please note that the memory is not necessarily accessed by the client. Now we can argue that this use-case can be served similar to (3) i.e. through a special interface/syscall. I think that would be challenging particularly when the lifetime of a client 'process' is independent of the memory needed to serve that client. Another way is to disable the accounting of that specific memory needed to serve the clients (I think Roman suggested a similar notion as disabling accounting of a tmpfs). Any other ideas? thanks, Shakeel