Message ID | 1434146254-26220-4-git-send-email-tj@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Jun 12, 2015 at 04:57:34PM -0500, Tejun Heo wrote: > Update Documentation/cgroups/blkio-controller.txt to reflect the > recently added cgroup writeback support. > > Signed-off-by: Tejun Heo <tj@kernel.org> > Cc: Li Zefan <lizefan@huawei.com> > Cc: Vivek Goyal <vgoyal@redhat.com> > Cc: cgroups@vger.kernel.org > Cc: linux-fsdevel@vger.kernel.org > --- > Documentation/cgroups/blkio-controller.txt | 83 ++++++++++++++++++++++++++++-- Hi Tejun, This looks good to me. Thanks. IIRC, I had run into the issues with two fsync running into two cgroups. One cgroup was of really small limit and other was unlimited. At that point of time I think conclusion was that multiple transactions could not make progress at the same time. So slower cgroup had blocked unlimited cgroup process from opening a transaction (as IO from slower group was stuck inside throttling later). For some reason, in my limited testing I have not noticed it with your branch. May be things have changed since or I am just hazy on details. I will do some more testing. Thanks Vivek > 1 file changed, 78 insertions(+), 5 deletions(-) > > diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt > index cd556b9..68b6a6a 100644 > --- a/Documentation/cgroups/blkio-controller.txt > +++ b/Documentation/cgroups/blkio-controller.txt > @@ -387,8 +387,81 @@ groups and put applications in that group which are not driving enough > IO to keep disk busy. In that case set group_idle=0, and CFQ will not idle > on individual groups and throughput should improve. > > -What works > -========== > -- Currently only sync IO queues are support. All the buffered writes are > - still system wide and not per group. Hence we will not see service > - differentiation between buffered writes between groups. > +Writeback > +========= > + > +Page cache is dirtied through buffered writes and shared mmaps and > +written asynchronously to the backing filesystem by the writeback > +mechanism. Writeback sits between the memory and IO domains and > +regulates the proportion of dirty memory by balancing dirtying and > +write IOs. > + > +On traditional cgroup hierarchies, relationships between different > +controllers cannot be established making it impossible for writeback > +to operate accounting for cgroup resource restrictions and all > +writeback IOs are attributed to the root cgroup. > + > +If both the blkio and memory controllers are used on the v2 hierarchy > +and the filesystem supports cgroup writeback, writeback operations > +correctly follow the resource restrictions imposed by both memory and > +blkio controllers. > + > +Writeback examines both system-wide and per-cgroup dirty memory status > +and enforces the more restrictive of the two. Also, writeback control > +parameters which are absolute values - vm.dirty_bytes and > +vm.dirty_background_bytes - are distributed across cgroups according > +to their current writeback bandwidth. > + > +There's a peculiarity stemming from the discrepancy in ownership > +granularity between memory controller and writeback. While memory > +controller tracks ownership per page, writeback operates on inode > +basis. cgroup writeback bridges the gap by tracking ownership by > +inode but migrating ownership if too many foreign pages, pages which > +don't match the current inode ownership, have been encountered while > +writing back the inode. > + > +This is a conscious design choice as writeback operations are > +inherently tied to inodes making strictly following page ownership > +complicated and inefficient. The only use case which suffers from > +this compromise is multiple cgroups concurrently dirtying disjoint > +regions of the same inode, which is an unlikely use case and decided > +to be unsupported. Note that as memory controller assigns page > +ownership on the first use and doesn't update it until the page is > +released, even if cgroup writeback strictly follows page ownership, > +multiple cgroups dirtying overlapping areas wouldn't work as expected. > +In general, write-sharing an inode across multiple cgroups is not well > +supported. > + > +Filesystem support for cgroup writeback > +--------------------------------------- > + > +A filesystem can make writeback IOs cgroup-aware by updating > +address_space_operations->writepage[s]() to annotate bio's using the > +following two functions. > + > +* wbc_init_bio(@wbc, @bio) > + > + Should be called for each bio carrying writeback data and associates > + the bio with the inode's owner cgroup. Can be called anytime > + between bio allocation and submission. > + > +* wbc_account_io(@wbc, @page, @bytes) > + > + Should be called for each data segment being written out. While > + this function doesn't care exactly when it's called during the > + writeback session, it's the easiest and most natural to call it as > + data segments are added to a bio. > + > +With writeback bio's annotated, cgroup support can be enabled per > +super_block by setting MS_CGROUPWB in ->s_flags. This allows for > +selective disabling of cgroup writeback support which is helpful when > +certain filesystem features, e.g. journaled data mode, are > +incompatible. > + > +wbc_init_bio() binds the specified bio to its cgroup. Depending on > +the configuration, the bio may be executed at a lower priority and if > +the writeback session is holding shared resources, e.g. a journal > +entry, may lead to priority inversion. There is no one easy solution > +for the problem. Filesystems can try to work around specific problem > +cases by skipping wbc_init_bio() or using bio_associate_blkcg() > +directly. > -- > 2.4.2 -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hey, Vivek. On Mon, Jun 15, 2015 at 01:28:23PM -0400, Vivek Goyal wrote: > IIRC, I had run into the issues with two fsync running into two cgroups. > One cgroup was of really small limit and other was unlimited. At that > point of time I think conclusion was that multiple transactions could > not make progress at the same time. So slower cgroup had blocked unlimited > cgroup process from opening a transaction (as IO from slower group was > stuck inside throttling later). > > For some reason, in my limited testing I have not noticed it with your > branch. May be things have changed since or I am just hazy on details. > I will do some more testing. On ext2, there's nothing interlocking each other. My understanding of ext4 is pretty limited but as long as the journal head doesn't overwrap and gets bloked on the slow one, it should be fine, so for most use cases, this shouldn't be a problem. Thanks.
On Mon, Jun 15, 2015 at 02:23:45PM -0400, Tejun Heo wrote: > > On ext2, there's nothing interlocking each other. My understanding of > ext4 is pretty limited but as long as the journal head doesn't > overwrap and gets bloked on the slow one, it should be fine, so for > most use cases, this shouldn't be a problem. The writes to the journal in ext3/ext4 are done from the jbd/jbd2 kernel thread. So writes to the journal shouldn't be a problem. In data=ordered mode inodes that have blocks that were allocated during the current transaction do have to have their data blocks written out, and this is done by the jbd/jbd2 thread using filemap_fdatawait(). If this gets throttled because blocks were originally dirtied by some cgroup that didn't have much disk time quota, then all file system activities will get stalled out until the ordered mode writeback completes, which means if there are any high priority cgroups trying to execute any system call that mutates file system state will block until the commit has gotten past the initial setup stage, and so other system activity could sputter to a halt --- at which point the commit will be allowed to compete, and then all of the calls to ext4_journal_start() will unblock, and the system will come back to life. :-) Because ext3 doesn't have delayed allocation, it will orders of magnitude more data=ordered block flushing, so this problem will be far worse with ext3 compared to ext4. So if there is some way we can signal to any cgroup that that might be throttling writeback or disk I/O that the jbd/jbd2 process should be considered privileged, that would be a good since it would allow us to avoid a potential priority inversion problem. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, Ted. On Mon, Jun 15, 2015 at 07:35:19PM -0400, Theodore Ts'o wrote: > So if there is some way we can signal to any cgroup that that might be > throttling writeback or disk I/O that the jbd/jbd2 process should be > considered privileged, that would be a good since it would allow us to > avoid a potential priority inversion problem. I see. In the long term, I think we might need to come up with a way to overcharge a slower cgroup to avoid blocking faster ones for cases where some IOs are depended upon by more than one cgroups. That'd take quite a bit of work from blkcg side. Will think more about it. Thanks.
On Tue, Jun 16, 2015 at 05:54:36PM -0400, Tejun Heo wrote: > Hello, Ted. > > On Mon, Jun 15, 2015 at 07:35:19PM -0400, Theodore Ts'o wrote: > > So if there is some way we can signal to any cgroup that that might be > > throttling writeback or disk I/O that the jbd/jbd2 process should be > > considered privileged, that would be a good since it would allow us to > > avoid a potential priority inversion problem. > > I see. In the long term, I think we might need to come up with a way > to overcharge a slower cgroup to avoid blocking faster ones for cases > where some IOs are depended upon by more than one cgroups. That'd > take quite a bit of work from blkcg side. Will think more about it. Hmm, while we're at it, there's another priority inversion that can be painful. If a block directory has been pushed out of memory (possibly because it was initially accessed by a cgroup with a very tiny amount of memory allocated to its cgroup) and a process with a cgroup tries to do a lookup in that directory, it will issue the read with such a tightly constrained disk time that it might take minutes for the read to complete. The problem is that the VFS has locked the directory's i_mutex *before* calling ext4_lookup(). If a high priority process then tries to read the same directory, or in fact any VFS operation which requires taking the directory's i_mutex first, including renaming the directory, the high priority process will end up blocking until the read is completed --- which can be minutes if the low priority process has a tiny amount of disk time allocated to it. There is a related problem where if a read for a particular block is issued with a very low amount of amount of disk time, and that same block is required by a high priority process, we can also get hit with a very similar priority inversion problem. To date the answer has always been, "Doctor, Doctor it hurts when I do that...." The only way I can think of fixing the directory mutex problem is by returning an error code to the VFS layer which instructs it to unlock the directory, and then have it wait on some wait channel so it ends up calling the lookup after the directory block has been read into memory (and we can hope that due to a tight memory cgroup the block doesn't end up getting ejected from memory right away). As another solution for another part of the problem, if a high priority process attempts a read and the I/O is already queued up, but it's at the back of the bus because it was originally posted by a low priority cgroup, the rest of the fix would be to elevate the priority of said I/O request and then resort the queue. As far as the filemap_fdatawait() call is concerned, if it's being called by fsync() run by a low priority process, or from the writeback thread, then it can certainly take place at a low prority. But if the filemap_fdatawait() is being done by a high priority process, such as a jbd/jbd2 thread, then there needs to be a way that we can set a flag in the wbc structure indicating that the writes should be submitted as if it was issued from the kernel thread, and not based on who originally dirtied the page. It's going to be a number of point solutions, which is a bit ugly, but I think that is much more likely to be successful than trying to implement, say, a generalized priority inheritance scheme for block I/O requests and related locks. :-) - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, On Tue, Jun 16, 2015 at 11:15:40PM -0400, Theodore Ts'o wrote: > Hmm, while we're at it, there's another priority inversion that can be > painful. If a block directory has been pushed out of memory (possibly > because it was initially accessed by a cgroup with a very tiny amount > of memory allocated to its cgroup) and a process with a cgroup tries At scale, this is self-correcting to certain extent in that if the inode is actually something shared across cgroups, it'll most likely end up in a cgroup which has enough resource to keep it in memory. This doesn't prevent one-off hiccups but it at least shouldn't develop into a systematic and chronic issue. > to do a lookup in that directory, it will issue the read with such a > tightly constrained disk time that it might take minutes for the read > to complete. The problem is that the VFS has locked the directory's > i_mutex *before* calling ext4_lookup(). > > If a high priority process then tries to read the same directory, or > in fact any VFS operation which requires taking the directory's > i_mutex first, including renaming the directory, the high priority > process will end up blocking until the read is completed --- which can > be minutes if the low priority process has a tiny amount of disk time > allocated to it. > > There is a related problem where if a read for a particular block is > issued with a very low amount of amount of disk time, and that same > block is required by a high priority process, we can also get hit with > a very similar priority inversion problem. > > To date the answer has always been, "Doctor, Doctor it hurts when I do > that...." The only way I can think of fixing the directory mutex In a lot of use cases, the directories accessed by different cgroups are fairly segregated so this hopefully shouldn't happen too often but yeah it can be painful on sharing cases. > problem is by returning an error code to the VFS layer which instructs > it to unlock the directory, and then have it wait on some wait channel > so it ends up calling the lookup after the directory block has been > read into memory (and we can hope that due to a tight memory cgroup > the block doesn't end up getting ejected from memory right away). > > As another solution for another part of the problem, if a high > priority process attempts a read and the I/O is already queued up, but > it's at the back of the bus because it was originally posted by a low > priority cgroup, the rest of the fix would be to elevate the priority > of said I/O request and then resort the queue. > > As far as the filemap_fdatawait() call is concerned, if it's being > called by fsync() run by a low priority process, or from the writeback > thread, then it can certainly take place at a low prority. But if the > filemap_fdatawait() is being done by a high priority process, such as > a jbd/jbd2 thread, then there needs to be a way that we can set a flag > in the wbc structure indicating that the writes should be submitted as > if it was issued from the kernel thread, and not based on who > originally dirtied the page. Hmmm... so, overriding things *before* an bio is issued shouldn't be too difficult and as long as this sort of operations aren't prevalent we might be able to get away with just charging them against root. Especially if it's to avoid getting blocked on the journal which we already consider a shared overhead which is charged to root. If this becomes large enough to require exacting charges, it'll be more complex but still way better than trying to raise priority on a bio which is already issued, which is likely to be excruciatingly painful if possible at all. > It's going to be a number of point solutions, which is a bit ugly, but > I think that is much more likely to be successful than trying to > implement, say, a generalized priority inheritance scheme for block > I/O requests and related locks. :-) I agree that generalized priority inheritance mechanism would be a massive overkill. I think as long as we can avoid boosting bio's which already have been issued, things should be relatively sane. Hopefully, we'd be able to figure out solutions for the worst offenders within these constraints. Thanks.
On Wed, Jun 17, 2015 at 02:52:37PM -0400, Tejun Heo wrote: > > Hmmm... so, overriding things *before* an bio is issued shouldn't be > too difficult and as long as this sort of operations aren't prevalent > we might be able to get away with just charging them against root. > Especially if it's to avoid getting blocked on the journal which we > already consider a shared overhead which is charged to root. If this > becomes large enough to require exacting charges, it'll be more > complex but still way better than trying to raise priority on a bio > which is already issued, which is likely to be excruciatingly painful > if possible at all. Yeah, just charging the overhead to root seems good enough. I could imagine charging it to whatever cgroup the jbd/jbd2 thread belongs to, which in turn would be the cgroup of the process that mounted the file system. The only problem with that is that if a low-priority process is allowed to mount a file system, and it gets traversed by a high priority process, the high priority process will get impacted. So maybe it's better to just say that it always get charged to the root cgroup. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hey, Ted. On Wed, Jun 17, 2015 at 05:48:52PM -0400, Theodore Ts'o wrote: > On Wed, Jun 17, 2015 at 02:52:37PM -0400, Tejun Heo wrote: > > > > Hmmm... so, overriding things *before* an bio is issued shouldn't be > > too difficult and as long as this sort of operations aren't prevalent > > we might be able to get away with just charging them against root. > > Especially if it's to avoid getting blocked on the journal which we > > already consider a shared overhead which is charged to root. If this > > becomes large enough to require exacting charges, it'll be more > > complex but still way better than trying to raise priority on a bio > > which is already issued, which is likely to be excruciatingly painful > > if possible at all. > > Yeah, just charging the overhead to root seems good enough. I think the easiest way to achieve this bypass would be making jbd mark the inode while waiting in fdatawait so that writeback path can skip attaching the writeback bios for the inode. This isn't perfect but should be able to work around stalls from priority inversion to certain extent. However, I can't come up with a workload to test it. AFAICS, the fdatawait stall path in jbd2 is journal_finish_inode_data_buffers() but the path doesn't trigger reliabley with mixed load of overwriting dd, a bunch of file creations and chmods and different cgroups stay pretty well isolated. Can you please suggest a workload for testing the datawait path? Thanks.
diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt index cd556b9..68b6a6a 100644 --- a/Documentation/cgroups/blkio-controller.txt +++ b/Documentation/cgroups/blkio-controller.txt @@ -387,8 +387,81 @@ groups and put applications in that group which are not driving enough IO to keep disk busy. In that case set group_idle=0, and CFQ will not idle on individual groups and throughput should improve. -What works -========== -- Currently only sync IO queues are support. All the buffered writes are - still system wide and not per group. Hence we will not see service - differentiation between buffered writes between groups. +Writeback +========= + +Page cache is dirtied through buffered writes and shared mmaps and +written asynchronously to the backing filesystem by the writeback +mechanism. Writeback sits between the memory and IO domains and +regulates the proportion of dirty memory by balancing dirtying and +write IOs. + +On traditional cgroup hierarchies, relationships between different +controllers cannot be established making it impossible for writeback +to operate accounting for cgroup resource restrictions and all +writeback IOs are attributed to the root cgroup. + +If both the blkio and memory controllers are used on the v2 hierarchy +and the filesystem supports cgroup writeback, writeback operations +correctly follow the resource restrictions imposed by both memory and +blkio controllers. + +Writeback examines both system-wide and per-cgroup dirty memory status +and enforces the more restrictive of the two. Also, writeback control +parameters which are absolute values - vm.dirty_bytes and +vm.dirty_background_bytes - are distributed across cgroups according +to their current writeback bandwidth. + +There's a peculiarity stemming from the discrepancy in ownership +granularity between memory controller and writeback. While memory +controller tracks ownership per page, writeback operates on inode +basis. cgroup writeback bridges the gap by tracking ownership by +inode but migrating ownership if too many foreign pages, pages which +don't match the current inode ownership, have been encountered while +writing back the inode. + +This is a conscious design choice as writeback operations are +inherently tied to inodes making strictly following page ownership +complicated and inefficient. The only use case which suffers from +this compromise is multiple cgroups concurrently dirtying disjoint +regions of the same inode, which is an unlikely use case and decided +to be unsupported. Note that as memory controller assigns page +ownership on the first use and doesn't update it until the page is +released, even if cgroup writeback strictly follows page ownership, +multiple cgroups dirtying overlapping areas wouldn't work as expected. +In general, write-sharing an inode across multiple cgroups is not well +supported. + +Filesystem support for cgroup writeback +--------------------------------------- + +A filesystem can make writeback IOs cgroup-aware by updating +address_space_operations->writepage[s]() to annotate bio's using the +following two functions. + +* wbc_init_bio(@wbc, @bio) + + Should be called for each bio carrying writeback data and associates + the bio with the inode's owner cgroup. Can be called anytime + between bio allocation and submission. + +* wbc_account_io(@wbc, @page, @bytes) + + Should be called for each data segment being written out. While + this function doesn't care exactly when it's called during the + writeback session, it's the easiest and most natural to call it as + data segments are added to a bio. + +With writeback bio's annotated, cgroup support can be enabled per +super_block by setting MS_CGROUPWB in ->s_flags. This allows for +selective disabling of cgroup writeback support which is helpful when +certain filesystem features, e.g. journaled data mode, are +incompatible. + +wbc_init_bio() binds the specified bio to its cgroup. Depending on +the configuration, the bio may be executed at a lower priority and if +the writeback session is holding shared resources, e.g. a journal +entry, may lead to priority inversion. There is no one easy solution +for the problem. Filesystems can try to work around specific problem +cases by skipping wbc_init_bio() or using bio_associate_blkcg() +directly.
Update Documentation/cgroups/blkio-controller.txt to reflect the recently added cgroup writeback support. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: cgroups@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org --- Documentation/cgroups/blkio-controller.txt | 83 ++++++++++++++++++++++++++++-- 1 file changed, 78 insertions(+), 5 deletions(-)