Message ID | 20200428161355.6377-1-schatzberg.dan@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | Charge loop device i/o to issuing cgroup | expand |
On Tue, Apr 28, 2020 at 12:13:46PM -0400, Dan Schatzberg wrote: > The loop device runs all i/o to the backing file on a separate kworker > thread which results in all i/o being charged to the root cgroup. This > allows a loop device to be used to trivially bypass resource limits > and other policy. This patch series fixes this gap in accounting. How is this specific to the loop device? Isn't every block device that offloads work to a kthread or single worker thread susceptible to the same "exploit"? Or is the problem simply that the loop worker thread is simply not taking the IO's associated cgroup and submitting the IO with that cgroup associated with it? That seems kinda simple to fix.... > Naively charging cgroups could result in priority inversions through > the single kworker thread in the case where multiple cgroups are > reading/writing to the same loop device. And that's where all the complexity and serialisation comes from, right? So, again: how is this unique to the loop device? Other block devices also offload IO to kthreads to do blocking work and IO submission to lower layers. Hence this seems to me like a generic "block device does IO submission from different task" issue that should be handled by generic infrastructure and not need to be reimplemented multiple times in every block device driver that offloads work to other threads... > This patch series does some > minor modification to the loop driver so that each cgroup can make > forward progress independently to avoid this inversion. > > With this patch series applied, the above script triggers OOM kills > when writing through the loop device as expected. NACK! The IO that is disallowed should fail with ENOMEM or some similar error, not trigger an OOM kill that shoots some innocent bystander in the head. That's worse than using BUG() to report errors... Cheers, Dave.
On Wed, Apr 29, 2020 at 07:47:34AM +1000, Dave Chinner wrote: > On Tue, Apr 28, 2020 at 12:13:46PM -0400, Dan Schatzberg wrote: > > This patch series does some > > minor modification to the loop driver so that each cgroup can make > > forward progress independently to avoid this inversion. > > > > With this patch series applied, the above script triggers OOM kills > > when writing through the loop device as expected. > > NACK! > > The IO that is disallowed should fail with ENOMEM or some similar > error, not trigger an OOM kill that shoots some innocent bystander > in the head. That's worse than using BUG() to report errors... Did you actually read the script? It's OOMing because it's creating 256M worth of tmpfs pages inside a 64M cgroup. It's not killing an innocent bystander, it's killing in the cgroup that is allocating all that memory - after Dan makes sure that memory is accounted to its rightful owner. As opposed to before this series, where all this memory isn't accounted properly and goes to the root cgroup - where, ironically, it could cause OOM and kill an actually innocent bystander.
On Wed 29-04-20 07:47:34, Dave Chinner wrote: > On Tue, Apr 28, 2020 at 12:13:46PM -0400, Dan Schatzberg wrote: > > The loop device runs all i/o to the backing file on a separate kworker > > thread which results in all i/o being charged to the root cgroup. This > > allows a loop device to be used to trivially bypass resource limits > > and other policy. This patch series fixes this gap in accounting. > > How is this specific to the loop device? Isn't every block device > that offloads work to a kthread or single worker thread susceptible > to the same "exploit"? > > Or is the problem simply that the loop worker thread is simply not > taking the IO's associated cgroup and submitting the IO with that > cgroup associated with it? That seems kinda simple to fix.... > > > Naively charging cgroups could result in priority inversions through > > the single kworker thread in the case where multiple cgroups are > > reading/writing to the same loop device. > > And that's where all the complexity and serialisation comes from, > right? > > So, again: how is this unique to the loop device? Other block > devices also offload IO to kthreads to do blocking work and IO > submission to lower layers. Hence this seems to me like a generic > "block device does IO submission from different task" issue that > should be handled by generic infrastructure and not need to be > reimplemented multiple times in every block device driver that > offloads work to other threads... Yeah, I was thinking about the same when reading the patch series description. We already have some cgroup workarounds for btrfs kthreads if I remember correctly, we have cgroup handling for flush workers, now we are adding cgroup handling for loopback device workers, and soon I'd expect someone comes with a need for DM/MD worker processes and IMHO it's getting out of hands because the complexity spreads through the kernel with every subsystem comming with slightly different solution to the problem and also the number of kthreads gets multiplied by the number of cgroups. So I agree some generic solution how to approach IO throttling of kthreads / workers would be desirable. OTOH I don't have a great idea how the generic infrastructure should look like... Honza
On Wed, Apr 29, 2020 at 07:47:34AM +1000, Dave Chinner wrote: > On Tue, Apr 28, 2020 at 12:13:46PM -0400, Dan Schatzberg wrote: > > The loop device runs all i/o to the backing file on a separate kworker > > thread which results in all i/o being charged to the root cgroup. This > > allows a loop device to be used to trivially bypass resource limits > > and other policy. This patch series fixes this gap in accounting. > > How is this specific to the loop device? Isn't every block device > that offloads work to a kthread or single worker thread susceptible > to the same "exploit"? I believe this is fairly loop device specific. The issue is that the loop driver issues I/O by re-entering the VFS layer (resulting in tmpfs like in my example or entering the block layer). Normally, I/O through the VFS layer is accounted for and controlled (e.g. you can OOM if writing to tmpfs, or get throttled by the I/O controller) but the loop device completely side-steps the accounting. > > Or is the problem simply that the loop worker thread is simply not > taking the IO's associated cgroup and submitting the IO with that > cgroup associated with it? That seems kinda simple to fix.... > > > Naively charging cgroups could result in priority inversions through > > the single kworker thread in the case where multiple cgroups are > > reading/writing to the same loop device. > > And that's where all the complexity and serialisation comes from, > right? > > So, again: how is this unique to the loop device? Other block > devices also offload IO to kthreads to do blocking work and IO > submission to lower layers. Hence this seems to me like a generic > "block device does IO submission from different task" issue that > should be handled by generic infrastructure and not need to be > reimplemented multiple times in every block device driver that > offloads work to other threads... I'm not familiar with other block device drivers that behave like this. Could you point me at a few? > > > This patch series does some > > minor modification to the loop driver so that each cgroup can make > > forward progress independently to avoid this inversion. > > > > With this patch series applied, the above script triggers OOM kills > > when writing through the loop device as expected. > > NACK! > > The IO that is disallowed should fail with ENOMEM or some similar > error, not trigger an OOM kill that shoots some innocent bystander > in the head. That's worse than using BUG() to report errors... The OOM behavior is due to cgroup limit. It mirrors the behavior one sees when writing to a too-large tmpfs.
Hello, On Wed, Apr 29, 2020 at 12:25:40PM +0200, Jan Kara wrote: > Yeah, I was thinking about the same when reading the patch series > description. We already have some cgroup workarounds for btrfs kthreads if > I remember correctly, we have cgroup handling for flush workers, now we are > adding cgroup handling for loopback device workers, and soon I'd expect > someone comes with a need for DM/MD worker processes and IMHO it's getting > out of hands because the complexity spreads through the kernel with every > subsystem comming with slightly different solution to the problem and also > the number of kthreads gets multiplied by the number of cgroups. So I > agree some generic solution how to approach IO throttling of kthreads / > workers would be desirable. > > OTOH I don't have a great idea how the generic infrastructure should look > like... I don't really see a way around that. The only generic solution would be letting all IOs through as root and handle everything through backcharging, which we already can do as backcharging is already in use to handle metadata updates which can't be controlled directly. However, doing that for all IOs would make the control quality a lot worse as all control would be based on first incurring deficit and then try to punish the issuer after the fact. The infrastructure work done to make IO control work for btrfs is generic and the changes needed on btrfs side was pretty small. Most of the work was identifying non-regular IO pathways (bouncing through different kthreads and whatnot) and making sure they're annotating IO ownership and the needed mechanism correctly. The biggest challenge probably is ensuring that the filesystem doesn't add ordering dependency between separate data IOs, which is a nice property to have with or without cgroup support. That leaves the nesting drivers, loop and md/dm. Given that they sit in the middle of IO stack and proxy a lot of its roles, they'll have to be updated to be transparent in terms of cgroup ownership if IO control is gonna work through them. Maybe we can have a common infra shared between loop, dm and md but they aren't many and may also be sufficiently different. idk Thanks.
On Wed 29-04-20 10:22:30, Tejun Heo wrote: > Hello, > > On Wed, Apr 29, 2020 at 12:25:40PM +0200, Jan Kara wrote: > > Yeah, I was thinking about the same when reading the patch series > > description. We already have some cgroup workarounds for btrfs kthreads if > > I remember correctly, we have cgroup handling for flush workers, now we are > > adding cgroup handling for loopback device workers, and soon I'd expect > > someone comes with a need for DM/MD worker processes and IMHO it's getting > > out of hands because the complexity spreads through the kernel with every > > subsystem comming with slightly different solution to the problem and also > > the number of kthreads gets multiplied by the number of cgroups. So I > > agree some generic solution how to approach IO throttling of kthreads / > > workers would be desirable. > > > > OTOH I don't have a great idea how the generic infrastructure should look > > like... > > I don't really see a way around that. The only generic solution would be > letting all IOs through as root and handle everything through backcharging, > which we already can do as backcharging is already in use to handle metadata > updates which can't be controlled directly. However, doing that for all IOs > would make the control quality a lot worse as all control would be based on > first incurring deficit and then try to punish the issuer after the fact. Yeah, it will be probably somewhat worse but OTOH given we'd track the IO balance per cgroup there will deficit only when a cgroup is starting so it could be bearable. I'm more concerned about issues like that for some IO controllers (e.g. for blk-iolatency or for the work preserving controllers), it is not obvious how to sensibly estimate some cost to charge to a cgroup since these controllers are more about giving priority to IO of some cgroup in presence of IO from another cgroup rather than some hard throughput limit or something like that. > The infrastructure work done to make IO control work for btrfs is generic > and the changes needed on btrfs side was pretty small. Most of the work was > identifying non-regular IO pathways (bouncing through different kthreads and > whatnot) and making sure they're annotating IO ownership and the needed > mechanism correctly. The biggest challenge probably is ensuring that the > filesystem doesn't add ordering dependency between separate data IOs, which > is a nice property to have with or without cgroup support. > > That leaves the nesting drivers, loop and md/dm. Given that they sit in the > middle of IO stack and proxy a lot of its roles, they'll have to be updated > to be transparent in terms of cgroup ownership if IO control is gonna work > through them. Maybe we can have a common infra shared between loop, dm and > md but they aren't many and may also be sufficiently different. idk Yeah, as I said, I don't really have a better alternative :-| Honza
On Tue, Apr 28, 2020 at 10:27:32PM -0400, Johannes Weiner wrote: > On Wed, Apr 29, 2020 at 07:47:34AM +1000, Dave Chinner wrote: > > On Tue, Apr 28, 2020 at 12:13:46PM -0400, Dan Schatzberg wrote: > > > This patch series does some > > > minor modification to the loop driver so that each cgroup can make > > > forward progress independently to avoid this inversion. > > > > > > With this patch series applied, the above script triggers OOM kills > > > when writing through the loop device as expected. > > > > NACK! > > > > The IO that is disallowed should fail with ENOMEM or some similar > > error, not trigger an OOM kill that shoots some innocent bystander > > in the head. That's worse than using BUG() to report errors... > > Did you actually read the script? Of course I did. You specifically mean this bit: echo 64M > $CGROUP/memory.max; mount -t tmpfs -o size=512m tmpfs /tmp; truncate -s 512m /tmp/backing_file losetup $LOOP_DEV /tmp/backing_file dd if=/dev/zero of=$LOOP_DEV bs=1M count=256; And with this patch set the dd gets OOM killed because the /tmp/backing_file usage accounted to the cgroup goes over 64MB and so tmpfs OOM kills the IO... As I said: that's very broken behaviour from a storage stack perspective. > It's OOMing because it's creating 256M worth of tmpfs pages inside a > 64M cgroup. It's not killing an innocent bystander, it's killing in > the cgroup that is allocating all that memory - after Dan makes sure > that memory is accounted to its rightful owner. What this example does is turn /tmp into thinly provisioned storage for $CGROUP via a restricted quota. It has a device size of 512MB, but only has physical storage capacity of 64MB, as constrained by the cgroup memory limit. You're dealing with management of -storage devices- and -IO error reporting- here, not memory management. When thin provisioned storage runs out of space - for whatever reason - and it cannot resolve the issue by blocking, then it *must* return ENOSPC to the IO submitter to tell it the IO has failed. This is no different to if we set a quota on /tmp/backing_file and it is exhausted at 64MB of data written - we fail the IO with ENOSPC or EDQUOT, depending on which quota we used. IOWs, we have solid expectations on how block devices report unsolvable resource shortages during IO because those errors have to be handled correctly by whatever is submitting the IO. We do not use the OOM-killer to report or resolve ENOSPC errors. Indeed, once you've killed the dd, that CGROUP still consumes 64MB of tmpfs space in /tmp/backing_file, in which case any further IO to $LOOP_DEV is also going to OOM kill. These are horrible semantics for reporting errors to userspace. And if the OOM-killer actually frees the 64MB of data written to /tmp/backing_file through the $LOOP_DEV, then you're actually corrupting the storage and ensuring that nobody can read the data that was written to $LOOP_DEV. So now lets put a filesystem on $LOOP_DEV in the above example, and write out data to the filesystem which does IO to $LOOP_DEV. Just by chance, the filesystem does some metadata writes iin the context of the user process doing the writes (because journalling, etc) and that metadata IO is what pushes the loop device over the cgroup's memory limit. You kill the user application even though it wasn't directly responsible for going over the 64MB limit of space in $LOOP_DEV. What happens now to the filesystem's metadata IO? Did $LOOP_DEV return an error, or after the OOM kill did the IO succeed? What happens if all the IO being generated from the user application is metadata and that starts failing - killing the user application doesn't help anything - the filesystem IO is failing and that's where the errors need to be reported. And if the answer is "metadata IO isn't accounted to the $CGROUP" then what happens when the tmpfs actually runs out of it's 512MB of space because of all the metadata the filesystem wrote (e.g. create lots of zero length files)? That's an ENOSPC error, and we'll get that from $LOOP_DEV just fine. So why should the same error - running out of tmpfs space vs running out of CGROUP quota space on tmpfs be handled differently? Either they are both ENOSPC IO errors, or they are both OOM kill vectors because tmpfs space has run out... See the problem here? We report storage resource shortages (whatever the cause) by IO errors, not by killing userspace processes. Userspace may be able to handle the IO error sanely; it has no warning or choice when you use OOM kill to report ENOSPC errors... > As opposed to before this series, where all this memory isn't > accounted properly and goes to the root cgroup - where, ironically, it > could cause OOM and kill an actually innocent bystander. Johannes, I didn't question the need for the functionality. I questioned the implementation and pointed out fundamental problems it has as well as the architectural questions raised by needing special kthread-based handling for correct accounting of cgroup-aware IO. It's a really bad look to go shoot the messenger when it's clear you haven't understood the message that was delivered. -Dave.
On Wed, Apr 29, 2020 at 12:25:40PM +0200, Jan Kara wrote: > On Wed 29-04-20 07:47:34, Dave Chinner wrote: > > On Tue, Apr 28, 2020 at 12:13:46PM -0400, Dan Schatzberg wrote: > > > The loop device runs all i/o to the backing file on a separate kworker > > > thread which results in all i/o being charged to the root cgroup. This > > > allows a loop device to be used to trivially bypass resource limits > > > and other policy. This patch series fixes this gap in accounting. > > > > How is this specific to the loop device? Isn't every block device > > that offloads work to a kthread or single worker thread susceptible > > to the same "exploit"? > > > > Or is the problem simply that the loop worker thread is simply not > > taking the IO's associated cgroup and submitting the IO with that > > cgroup associated with it? That seems kinda simple to fix.... > > > > > Naively charging cgroups could result in priority inversions through > > > the single kworker thread in the case where multiple cgroups are > > > reading/writing to the same loop device. > > > > And that's where all the complexity and serialisation comes from, > > right? > > > > So, again: how is this unique to the loop device? Other block > > devices also offload IO to kthreads to do blocking work and IO > > submission to lower layers. Hence this seems to me like a generic > > "block device does IO submission from different task" issue that > > should be handled by generic infrastructure and not need to be > > reimplemented multiple times in every block device driver that > > offloads work to other threads... > > Yeah, I was thinking about the same when reading the patch series > description. We already have some cgroup workarounds for btrfs kthreads if > I remember correctly, we have cgroup handling for flush workers, now we are > adding cgroup handling for loopback device workers, and soon I'd expect > someone comes with a need for DM/MD worker processes and IMHO it's getting > out of hands because the complexity spreads through the kernel with every > subsystem comming with slightly different solution to the problem and also > the number of kthreads gets multiplied by the number of cgroups. So I > agree some generic solution how to approach IO throttling of kthreads / > workers would be desirable. Yup, that's pretty much what I was thinking: it's yet another special snowflake for cgroup-aware IO.... > OTOH I don't have a great idea how the generic infrastructure should look > like... I haven't given it any thought - it's not something I have any bandwidth to spend time on. I'll happily review a unified generic cgroup-aware kthread-based IO dispatch mechanism, but I don't have the time to design and implement that myself.... OTOH, I will make time to stop people screwing up filesystems and block devices with questionable complexity and unique, storage device dependent userspace visible error behaviour. This sort of change is objectively worse for users than not supporting the functionality in the first place. Cheers, Dave.
On Mon, May 4, 2020 at 11:30 PM Dave Chinner <david@fromorbit.com> wrote: > > On Tue, Apr 28, 2020 at 10:27:32PM -0400, Johannes Weiner wrote: > > On Wed, Apr 29, 2020 at 07:47:34AM +1000, Dave Chinner wrote: > > > On Tue, Apr 28, 2020 at 12:13:46PM -0400, Dan Schatzberg wrote: > > > > This patch series does some > > > > minor modification to the loop driver so that each cgroup can make > > > > forward progress independently to avoid this inversion. > > > > > > > > With this patch series applied, the above script triggers OOM kills > > > > when writing through the loop device as expected. > > > > > > NACK! > > > > > > The IO that is disallowed should fail with ENOMEM or some similar > > > error, not trigger an OOM kill that shoots some innocent bystander > > > in the head. That's worse than using BUG() to report errors... > > > > Did you actually read the script? > Before responding, I want to make it clear that the OOM behavior which you are objecting to is already present in the kernel and is independent of this patch series. This patch series is only connecting the charging links which were missing for the loop device. > Of course I did. You specifically mean this bit: > > echo 64M > $CGROUP/memory.max; > mount -t tmpfs -o size=512m tmpfs /tmp; > truncate -s 512m /tmp/backing_file > losetup $LOOP_DEV /tmp/backing_file > dd if=/dev/zero of=$LOOP_DEV bs=1M count=256; > > And with this patch set the dd gets OOM killed because the > /tmp/backing_file usage accounted to the cgroup goes over 64MB and > so tmpfs OOM kills the IO... > A few queries to better understand your objection: Let's forget about the cgroup for a second. Let's suppose I am running this script on a system/VM having 64 MiB. In your opinion what should happen? Next let's add the swap to the 64 MiB system/VM/cgroup and re-run the script, what should be the correct behavior? Next replace the tmpfs with any other disk backed file system and re-run the script in a 64 MiB system/VM/cgroup, what should be the correct behavior? > As I said: that's very broken behaviour from a storage stack > perspective. > > > It's OOMing because it's creating 256M worth of tmpfs pages inside a > > 64M cgroup. It's not killing an innocent bystander, it's killing in > > the cgroup that is allocating all that memory - after Dan makes sure > > that memory is accounted to its rightful owner. > > What this example does is turn /tmp into thinly provisioned storage > for $CGROUP via a restricted quota. It has a device size of 512MB, > but only has physical storage capacity of 64MB, as constrained by > the cgroup memory limit. You're dealing with management of -storage > devices- and -IO error reporting- here, not memory management. > > When thin provisioned storage runs out of space - for whatever > reason - and it cannot resolve the issue by blocking, then it *must* > return ENOSPC to the IO submitter to tell it the IO has failed. This > is no different to if we set a quota on /tmp/backing_file and it is > exhausted at 64MB of data written - we fail the IO with ENOSPC or > EDQUOT, depending on which quota we used. > > IOWs, we have solid expectations on how block devices report > unsolvable resource shortages during IO because those errors have to > be handled correctly by whatever is submitting the IO. We do not use > the OOM-killer to report or resolve ENOSPC errors. > > Indeed, once you've killed the dd, that CGROUP still consumes 64MB > of tmpfs space in /tmp/backing_file, in which case any further IO to > $LOOP_DEV is also going to OOM kill. These are horrible semantics > for reporting errors to userspace. > > And if the OOM-killer actually frees the 64MB of data written to > /tmp/backing_file through the $LOOP_DEV, then you're actually > corrupting the storage and ensuring that nobody can read the data > that was written to $LOOP_DEV. > > So now lets put a filesystem on $LOOP_DEV in the above example, and > write out data to the filesystem which does IO to $LOOP_DEV. Just by > chance, the filesystem does some metadata writes iin the context of > the user process doing the writes (because journalling, etc) and > that metadata IO is what pushes the loop device over the cgroup's > memory limit. > > You kill the user application even though it wasn't directly > responsible for going over the 64MB limit of space in $LOOP_DEV. > What happens now to the filesystem's metadata IO? Did $LOOP_DEV > return an error, or after the OOM kill did the IO succeed? What > happens if all the IO being generated from the user application is > metadata and that starts failing - killing the user application > doesn't help anything - the filesystem IO is failing and that's > where the errors need to be reported. > > And if the answer is "metadata IO isn't accounted to the $CGROUP" > then what happens when the tmpfs actually runs out of it's 512MB of > space because of all the metadata the filesystem wrote (e.g. create > lots of zero length files)? That's an ENOSPC error, and we'll get > that from $LOOP_DEV just fine. > > So why should the same error - running out of tmpfs space vs running > out of CGROUP quota space on tmpfs be handled differently? Either > they are both ENOSPC IO errors, or they are both OOM kill vectors > because tmpfs space has run out... > > See the problem here? We report storage resource shortages > (whatever the cause) by IO errors, not by killing userspace > processes. Userspace may be able to handle the IO error sanely; it > has no warning or choice when you use OOM kill to report ENOSPC > errors... > > > As opposed to before this series, where all this memory isn't > > accounted properly and goes to the root cgroup - where, ironically, it > > could cause OOM and kill an actually innocent bystander. > > Johannes, I didn't question the need for the functionality. I > questioned the implementation and pointed out fundamental problems > it has as well as the architectural questions raised by needing > special kthread-based handling for correct accounting of > cgroup-aware IO. > > It's a really bad look to go shoot the messenger when it's clear you > haven't understood the message that was delivered. > > -Dave. > -- > Dave Chinner > david@fromorbit.com
On Tue, May 05, 2020 at 04:29:46PM +1000, Dave Chinner wrote: > On Tue, Apr 28, 2020 at 10:27:32PM -0400, Johannes Weiner wrote: > > On Wed, Apr 29, 2020 at 07:47:34AM +1000, Dave Chinner wrote: > > > On Tue, Apr 28, 2020 at 12:13:46PM -0400, Dan Schatzberg wrote: > > > > This patch series does some > > > > minor modification to the loop driver so that each cgroup can make > > > > forward progress independently to avoid this inversion. > > > > > > > > With this patch series applied, the above script triggers OOM kills > > > > when writing through the loop device as expected. > > > > > > NACK! > > > > > > The IO that is disallowed should fail with ENOMEM or some similar > > > error, not trigger an OOM kill that shoots some innocent bystander > > > in the head. That's worse than using BUG() to report errors... > > > > Did you actually read the script? > > Of course I did. You specifically mean this bit: > > echo 64M > $CGROUP/memory.max; > mount -t tmpfs -o size=512m tmpfs /tmp; > truncate -s 512m /tmp/backing_file > losetup $LOOP_DEV /tmp/backing_file > dd if=/dev/zero of=$LOOP_DEV bs=1M count=256; > > And with this patch set the dd gets OOM killed because the > /tmp/backing_file usage accounted to the cgroup goes over 64MB and > so tmpfs OOM kills the IO... > > As I said: that's very broken behaviour from a storage stack > perspective. > > > It's OOMing because it's creating 256M worth of tmpfs pages inside a > > 64M cgroup. It's not killing an innocent bystander, it's killing in > > the cgroup that is allocating all that memory - after Dan makes sure > > that memory is accounted to its rightful owner. > > What this example does is turn /tmp into thinly provisioned storage > for $CGROUP via a restricted quota. It has a device size of 512MB, > but only has physical storage capacity of 64MB, as constrained by > the cgroup memory limit. You're dealing with management of -storage > devices- and -IO error reporting- here, not memory management. > > When thin provisioned storage runs out of space - for whatever > reason - and it cannot resolve the issue by blocking, then it *must* > return ENOSPC to the IO submitter to tell it the IO has failed. This > is no different to if we set a quota on /tmp/backing_file and it is > exhausted at 64MB of data written - we fail the IO with ENOSPC or > EDQUOT, depending on which quota we used. > > IOWs, we have solid expectations on how block devices report > unsolvable resource shortages during IO because those errors have to > be handled correctly by whatever is submitting the IO. We do not use > the OOM-killer to report or resolve ENOSPC errors. > > Indeed, once you've killed the dd, that CGROUP still consumes 64MB > of tmpfs space in /tmp/backing_file, in which case any further IO to > $LOOP_DEV is also going to OOM kill. These are horrible semantics > for reporting errors to userspace. > > And if the OOM-killer actually frees the 64MB of data written to > /tmp/backing_file through the $LOOP_DEV, then you're actually > corrupting the storage and ensuring that nobody can read the data > that was written to $LOOP_DEV. Right, but that's just tmpfs. It doesn't have much to do with the loop device or its semantics as a block device. (Although I don't think most users really see loop as a true block device, but rather as a namespacing tool that for better or worse happens to be implemented at the block layer). But remove the loop device indirection and the tmpfs semantics would be exactly the same. tmpfs returns -ENOSPC when you run out of filesystem quota, but when it tries to allocate memory and can't, it'll invoke the OOM killer as a means to reclaim memory. And when that fails, it'll return -ENOMEM. Dan's patches don't change the block device semantics of loop. They just ensure that the chain of causality between writer and memory allocation isn't broken. In fact, it barely has anything to do with loop itself. If loop were to do its redirections synchronously and in the context of the process that is making requests, we wouldn't have this problem. The generic problem is that of one process performing work on behalf of another process with side-effects relevant to the originator. The generic solution is to have the worker impersonate the process that created the work in all the various aspects that matter. Like io_uring and various other kthreads and workers doing use_mm() when the address space of the process creating the work matters. > So now lets put a filesystem on $LOOP_DEV in the above example, and > write out data to the filesystem which does IO to $LOOP_DEV. Just by > chance, the filesystem does some metadata writes iin the context of > the user process doing the writes (because journalling, etc) and > that metadata IO is what pushes the loop device over the cgroup's > memory limit. > > You kill the user application even though it wasn't directly > responsible for going over the 64MB limit of space in $LOOP_DEV. > What happens now to the filesystem's metadata IO? Did $LOOP_DEV > return an error, or after the OOM kill did the IO succeed? What > happens if all the IO being generated from the user application is > metadata and that starts failing - killing the user application > doesn't help anything - the filesystem IO is failing and that's > where the errors need to be reported. > > And if the answer is "metadata IO isn't accounted to the $CGROUP" > then what happens when the tmpfs actually runs out of it's 512MB of > space because of all the metadata the filesystem wrote (e.g. create > lots of zero length files)? That's an ENOSPC error, and we'll get > that from $LOOP_DEV just fine. Well, what happens today if you write to a loop mount backed by tmpfs, but the machine is *physically* out of memory? None of these questions are new in the context of this patch set. The cgroup annotations don't inject anything that isn't already happening. When you use the loop device on a tmpfs backing today, logically speaking you're charging the root cgroup. That may not have a user-set limit, but it's limited by physical RAM. With or without cgroup annotation, tmpfs needs to allocate memory, and that can fail. The function that charges to a specific cgroup is just a few lines below the function that physically allocates the page. Both can invoke the OOM killer for slightly different reasons that aren't really relevant to the loop device on top of it. > So why should the same error - running out of tmpfs space vs running > out of CGROUP quota space on tmpfs be handled differently? Either > they are both ENOSPC IO errors, or they are both OOM kill vectors > because tmpfs space has run out... Because charging memory has allocation semantics, and tmpfs already defines what those are. > > As opposed to before this series, where all this memory isn't > > accounted properly and goes to the root cgroup - where, ironically, it > > could cause OOM and kill an actually innocent bystander. > > Johannes, I didn't question the need for the functionality. I > questioned the implementation and pointed out fundamental problems > it has as well as the architectural questions raised by needing > special kthread-based handling for correct accounting of > cgroup-aware IO. > > It's a really bad look to go shoot the messenger when it's clear you > haven't understood the message that was delivered. Do I need to point out the irony here? ;) Maybe let's focus on the technical questions and misunderstandings first before throwing NAKs around.
Hello, Dave. On Tue, May 05, 2020 at 04:41:14PM +1000, Dave Chinner wrote: > > OTOH I don't have a great idea how the generic infrastructure should look > > like... > > I haven't given it any thought - it's not something I have any > bandwidth to spend time on. I'll happily review a unified > generic cgroup-aware kthread-based IO dispatch mechanism, but I > don't have the time to design and implement that myself.... > > OTOH, I will make time to stop people screwing up filesystems and > block devices with questionable complexity and unique, storage > device dependent userspace visible error behaviour. This sort of > change is objectively worse for users than not supporting the > functionality in the first place. That probably is too strong a position to hold without spending at least some thoughts on a subject, whatever the subject may be, and it doesn't seem like your understanding of userspace implications is accurate. I don't necessarily disagree that it'd be nice to have a common infrastructure and there may be some part which can actually be factored out. However, there isn't gonna be a magic bullet which magically makes every IO thing in the kernel cgroup aware automatically. Please consider the followings. * Avoding IO priority inversions requires splitting IO channels according to cgroups and working around (e.g. with backcharging) when they can't be. It's a substantial feature which may require substantial changes. Each IO subsystem has different constraints and existing structures and many of them would require their own solutions. It's not different from different filesystems needing their own solutions for similar problems. * Because different filesystems and IO stacking layers already have their own internal infrastructure, the right way to add cgroup support is adapting to and modifying the existing infrastructure rather than trying to restructure them to use the same cgroup mechanism, which I don't think would be possible in many cases. * Among the three IO stacking / redirecting mechanisms - md/dm, loop and fuse - the requirements and what's possible vary quite a bit. md/dm definitely need to support full-on IO channel splitting cgroup support. loop can go either way, but given existing uses, full splitting makes a sense. fuse, as it currently stands, can't support that because the priority inversions extend all the way to userspace and the kernel API isn't built for that. If it wants to support cgroup containment, each instance would have to be assigned to a cgroup. Between dm/md and loop, it's maybe possible that some of the sub-threading code can be reused, but I don't see a point in blocking loop updates given that it clearly fixes userspace visible malfunctions, is not that much code and how the shared code should look is unclear yet. We'll be able to answer the sharing question when we actually get to dm/md conversion. Thanks.
Seems like discussion on this patch series has died down. There's been a concern raised that we could generalize infrastructure across loop, md, etc. This may be possible, in the future, but it isn't clear to me how this would look like. I'm inclined to fix the existing issue with loop devices now (this is a problem we hit at FB) and address consolidation with other cases if and when those are addressed. Jens, you've expressed interest in seeing this series go through the block tree so I'm interested in your perspective here. Barring any concrete implementation bugs, would you be okay merging this version?
On Tue, May 12, 2020 at 09:25:21AM -0400, Dan Schatzberg wrote: > Seems like discussion on this patch series has died down. There's been > a concern raised that we could generalize infrastructure across loop, > md, etc. This may be possible, in the future, but it isn't clear to me > how this would look like. I'm inclined to fix the existing issue with > loop devices now (this is a problem we hit at FB) and address > consolidation with other cases if and when those are addressed. > > Jens, you've expressed interest in seeing this series go through the > block tree so I'm interested in your perspective here. Barring any > concrete implementation bugs, would you be okay merging this version? Independ of any higher level issues you need to sort out the spinlock mess I pointed out.
On Tue, May 12, 2020 at 06:35:45AM -0700, Christoph Hellwig wrote: > On Tue, May 12, 2020 at 09:25:21AM -0400, Dan Schatzberg wrote: > > Seems like discussion on this patch series has died down. There's been > > a concern raised that we could generalize infrastructure across loop, > > md, etc. This may be possible, in the future, but it isn't clear to me > > how this would look like. I'm inclined to fix the existing issue with > > loop devices now (this is a problem we hit at FB) and address > > consolidation with other cases if and when those are addressed. > > > > Jens, you've expressed interest in seeing this series go through the > > block tree so I'm interested in your perspective here. Barring any > > concrete implementation bugs, would you be okay merging this version? > > Independ of any higher level issues you need to sort out the spinlock > mess I pointed out. Will do - I'll split out the lock-use refactor into a separate patch. Do you have particular concerns about re-using the existing spinlock? Its existing use is not contended so I didn't see any harm in extending its use. I'll add this justification to the commit message as well, but I'm tempted to leave the re-use as is instead of creating a new lock.
On Tue, May 26, 2020 at 10:28:03AM -0400, Dan Schatzberg wrote: > Will do - I'll split out the lock-use refactor into a separate > patch. Do you have particular concerns about re-using the existing > spinlock? Its existing use is not contended so I didn't see any harm > in extending its use. I'll add this justification to the commit > message as well, but I'm tempted to leave the re-use as is instead of > creating a new lock. Please don't share a lock for entirely separate critical sections that are used from different contexts.