Message ID | 20160302175656.GA59991@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Thanks for the patch I will likely have time to test this sometime next week. But just to be sure - the expected behavior would be that processes writing to dm-based devices would experience the fair-shair scheduling of CFQ (provided that the physical devices that back those DM devices use CFQ), correct? -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 02 2016 at 1:03P -0500, Nikolay Borisov <kernel@kyup.com> wrote: > Thanks for the patch I will likely have time to test this sometime next week. > But just to be sure - the expected behavior would be that processes > writing to dm-based devices would experience the fair-shair > scheduling of CFQ (provided that the physical devices that back those > DM devices use CFQ), correct? Yes, that is the goal. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 02, 2016 at 08:03:10PM +0200, Nikolay Borisov wrote: > Thanks for the patch I will likely have time to test this sometime next week. > But just to be sure - the expected behavior would be that processes > writing to dm-based devices would experience the fair-shair > scheduling of CFQ (provided that the physical devices that back those > DM devices use CFQ), correct? Nikolay, I am not sure how well it will work with CFQ of underlying device. It will get cgroup information right for buffered writes. But cgroup information for reads and direct writes will come from submitter's context and if dm layer gets in between, then many a times submitter might be a worker thread and IO will be attributed to that worker's cgroup (root cgroup). Give it a try anyway. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 02, 2016 at 09:59:13PM +0200, Nikolay Borisov wrote: > On Wednesday, March 2, 2016, Vivek Goyal <vgoyal@redhat.com> wrote: > > > On Wed, Mar 02, 2016 at 08:03:10PM +0200, Nikolay Borisov wrote: > > > Thanks for the patch I will likely have time to test this sometime next > > week. > > > But just to be sure - the expected behavior would be that processes > > > writing to dm-based devices would experience the fair-shair > > > scheduling of CFQ (provided that the physical devices that back those > > > DM devices use CFQ), correct? > > > > Nikolay, > > > > I am not sure how well it will work with CFQ of underlying device. It will > > get cgroup information right for buffered writes. But cgroup information > > > Right, what's your definition of buffered writes? Writes which go through page cache. > My mental model is that > when a process submits a write request to a dm device , the bio is going to > be put on a devi e workqueue which would then be serviced by a background > worker thread and later the submitter notified. Do you refer to this whole > gamut of operations as buffered writes? No, once the bio is submitted to dm device it could be a buffered write or a direct write. > > for reads and direct writes will come from submitter's context and if dm > > layer gets in between, then many a times submitter might be a worker > > thread and IO will be attributed to that worker's cgroup (root cgroup). > > > Be that as it may, proivded that the worker thread is in the 'correct' > cgroup, then the appropriate babdwidth policies should apply, no? Worker thread will most likely be in root cgroup. So if a worker thread is submitting bio, it will be attributed to root cgroup. We had similar issue with IO priority and it did not work reliably with CFQ on underlying device when dm devices were sitting on top. If we really want to give it a try, I guess we will have to put cgroup info of submitter early in bio at the time of bio creation even for all kind of IO. Not sure if it is worth the effort. For the case of IO throttling, I think you should put throttling rules on the dm device itself. That means as long as filesystem supports the cgroups, you should be getting right cgroup information for all kind of IO and throttling should work just fine. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/02/2016 02:10 PM, Vivek Goyal wrote: > On Wed, Mar 02, 2016 at 09:59:13PM +0200, Nikolay Borisov wrote: > We had similar issue with IO priority and it did not work reliably with > CFQ on underlying device when dm devices were sitting on top. > > If we really want to give it a try, I guess we will have to put cgroup > info of submitter early in bio at the time of bio creation even for all > kind of IO. Not sure if it is worth the effort. As it stands, imagine that you have a hypervisor node running many VMs (or containers), each of which is assigned a separate logical volume (possibly thin-provisioned) as its rootfs. Ideally we want the disk accesses by those VMs to be "fair" relative to each other, and we want to guarantee a certain amount of bandwidth for the host as well. Without this sort of feature, how can we accomplish that? > For the case of IO throttling, I think you should put throttling rules on > the dm device itself. That means as long as filesystem supports the > cgroups, you should be getting right cgroup information for all kind of > IO and throttling should work just fine. IO throttling isn't all that useful, since it requires you to know in advance what your IO rate is. And it doesn't adjust nicely as the number of competing entities changes the way that weight-based schemes do. Chris -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 02, 2016 at 10:19:38PM +0200, Nikolay Borisov wrote: > On Wednesday, March 2, 2016, Vivek Goyal <vgoyal@redhat.com> wrote: > > > On Wed, Mar 02, 2016 at 09:59:13PM +0200, Nikolay Borisov wrote: > > > On Wednesday, March 2, 2016, Vivek Goyal <vgoyal@redhat.com > > <javascript:;>> wrote: > > > > > > > On Wed, Mar 02, 2016 at 08:03:10PM +0200, Nikolay Borisov wrote: > > > > > Thanks for the patch I will likely have time to test this sometime > > next > > > > week. > > > > > But just to be sure - the expected behavior would be that processes > > > > > writing to dm-based devices would experience the fair-shair > > > > > scheduling of CFQ (provided that the physical devices that back those > > > > > DM devices use CFQ), correct? > > > > > > > > Nikolay, > > > > > > > > I am not sure how well it will work with CFQ of underlying device. It > > will > > > > get cgroup information right for buffered writes. But cgroup > > information > > > > > > > > > Right, what's your definition of buffered writes? > > > > Writes which go through page cache. > > > > > My mental model is that > > > when a process submits a write request to a dm device , the bio is going > > to > > > be put on a devi e workqueue which would then be serviced by a > > background > > > worker thread and later the submitter notified. Do you refer to this > > whole > > > gamut of operations as buffered writes? > > > > No, once the bio is submitted to dm device it could be a buffered write or > > a direct write. > > > > > > > > for reads and direct writes will come from submitter's context and if dm > > > > layer gets in between, then many a times submitter might be a worker > > > > thread and IO will be attributed to that worker's cgroup (root cgroup). > > > > > > > > > Be that as it may, proivded that the worker thread is in the 'correct' > > > cgroup, then the appropriate babdwidth policies should apply, no? > > > > Worker thread will most likely be in root cgroup. So if a worker thread > > is submitting bio, it will be attributed to root cgroup. > > > > We had similar issue with IO priority and it did not work reliably with > > CFQ on underlying device when dm devices were sitting on top. > > > > If we really want to give it a try, I guess we will have to put cgroup > > info of submitter early in bio at the time of bio creation even for all > > kind of IO. Not sure if it is worth the effort. > > > > For the case of IO throttling, I think you should put throttling rules on > > the dm device itself. That means as long as filesystem supports the > > cgroups, you should be getting right cgroup information for all kind of > > IO and throttling should work just fine. > > > Throttling does work even now, but the use case I had in mind was > proportional > distribution of IO. Imagine 50 or so dm devices, hosting IO intensive > workloads. In > this situation, I'd be interested each of them getting proportional IO > based on the weights > set in the blkcg controller for each respective cgroup for every workload. > I see what you are trying to do. Carry the cgroup information from top to bottom of IO stack for all kind of IO. I guess we also need to call bio_associate_current() when dm accepts bio from the submitter. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 02, 2016 at 02:34:50PM -0600, Chris Friesen wrote: > On 03/02/2016 02:10 PM, Vivek Goyal wrote: > >On Wed, Mar 02, 2016 at 09:59:13PM +0200, Nikolay Borisov wrote: > > >We had similar issue with IO priority and it did not work reliably with > >CFQ on underlying device when dm devices were sitting on top. > > > >If we really want to give it a try, I guess we will have to put cgroup > >info of submitter early in bio at the time of bio creation even for all > >kind of IO. Not sure if it is worth the effort. > > As it stands, imagine that you have a hypervisor node running many VMs (or > containers), each of which is assigned a separate logical volume (possibly > thin-provisioned) as its rootfs. > > Ideally we want the disk accesses by those VMs to be "fair" relative to each > other, and we want to guarantee a certain amount of bandwidth for the host > as well. > > Without this sort of feature, how can we accomplish that? As of now, you can't. I will try adding bio_associate_current() and see if that along with Mike's patches gets you what you are looking for. On a side note, have you tried using CFQ's proportional logic with multile VMs. Say partition the disk and pass each parition to VM/container and do the IO. My main concern is that by default each cgroup can add significant idling overhead and kill overall throughput of disk (especially for random IO or if cgroup does not have enough IO to keep disk busy). One can disable group idling but that kills service differentiation for most of the workloads. So I was curious to know if CFQ's proportional bandwidth division is helping you in real life. (without dm of course). > > >For the case of IO throttling, I think you should put throttling rules on > >the dm device itself. That means as long as filesystem supports the > >cgroups, you should be getting right cgroup information for all kind of > >IO and throttling should work just fine. > > IO throttling isn't all that useful, since it requires you to know in > advance what your IO rate is. And it doesn't adjust nicely as the number of > competing entities changes the way that weight-based schemes do. Agreed that absolute limits are less useful as compared to dynamic limits provided by weights. This is more useful for scenario where a cloud provider does not want to provide disk bandwidth if user has not paid for it (even if disk bandwidth is available). Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/02/2016 07:56 PM, Mike Snitzer wrote: > On Wed, Mar 02 2016 at 11:06P -0500, > Tejun Heo <tj@kernel.org> wrote: > >> Hello, >> >> On Thu, Feb 25, 2016 at 09:53:14AM -0500, Mike Snitzer wrote: >>> Right, LVM created devices are bio-based DM devices in the kernel. >>> bio-based block devices do _not_ have an IO scheduler. Their underlying >>> request-based device does. >> >> dm devices are not the actual resource source, so I don't think it'd >> work too well to put io controllers on them (can't really do things >> like proportional control without owning the queue). >> >>> I'm not well-versed on the top-level cgroup interface and how it maps to >>> associated resources that are established in the kernel. But it could >>> be that the configuration of blkio cgroup against a bio-based LVM device >>> needs to be passed through to the underlying request-based device >>> (e.g. /dev/sda4 in Chris's case)? >>> >>> I'm also wondering whether the latest cgroup work that Tejun has just >>> finished (afaik to support buffered IO in the IO controller) will afford >>> us a more meaningful reason to work to make cgroups' blkio controller >>> actually work with bio-based devices like LVM's DM devices? >>> >>> I'm very much open to advice on how to proceed with investigating this >>> integration work. Tejun, Vivek, anyone else: if you have advice on next >>> steps for DM on this front _please_ yell, thanks! >> >> I think the only thing necessary is dm transferring bio cgroup tags to >> the bio's that it ends up passing down the stack. Please take a look >> at fs/btrfs/extent_io.c::btrfs_bio_clone() for an example. We >> probably should introduce a wrapper for this so that each site doesn't >> need to ifdef it. >> >> Thanks. > > OK, I think this should do it. Nikolay and/or others can you test this > patch using blkio cgroups controller with LVM devices and report back? > > From: Mike Snitzer <snitzer@redhat.com> > Date: Wed, 2 Mar 2016 12:37:39 -0500 > Subject: [PATCH] block: transfer source bio's cgroup tags to clone via bio_associate_blkcg() > > Move btrfs_bio_clone()'s support for transferring a source bio's cgroup > tags to a clone into both bio_clone_bioset() and __bio_clone_fast(). > The former is used by btrfs (MD and blk-core also use it via bio_split). > The latter is used by both DM and bcache. > > This should enable the blkio cgroups controller to work with all > stacking bio-based block devices. > > Reported-by: Nikolay Borisov <kernel@kyup.com> > Suggested-by: Tejun Heo <tj@kernel.org> > Signed-off-by: Mike Snitzer <snitzer@redhat.com> > --- > block/bio.c | 10 ++++++++++ > fs/btrfs/extent_io.c | 6 ------ > 2 files changed, 10 insertions(+), 6 deletions(-) So I had a chance to test the settings here is what I got when running 2 container, using LVM-thin for their root device and having applied your patch: When the 2 containers are using the same blkio.weight values (500) I get the following from running DD simultaneously on the 2 containers: [root@c1501 ~]# dd if=test.img of=test2.img bs=1M count=3000 oflag=direct 3000+0 records in 3000+0 records out 3145728000 bytes (3.1 GB) copied, 165.171 s, 19.0 MB/s [root@c1500 ~]# dd if=test.img of=test2.img bs=1M count=3000 oflag=direct 3000+0 records in 3000+0 records out 3145728000 bytes (3.1 GB) copied, 166.165 s, 18.9 MB/s Also iostat showed the 2 volumes using almost the same amount of IO (around 20mb r/w). I then increase the weight for c1501 to 1000 i.e. twice the bandwidth that c1500 has, so I would expect its dd to complete twice as fast: [root@c1501 ~]# dd if=test.img of=test2.img bs=1M count=3000 oflag=direct 3000+0 records in 3000+0 records out 3145728000 bytes (3.1 GB) copied, 150.892 s, 20.8 MB/s [root@c1500 ~]# dd if=test.img of=test2.img bs=1M count=3000 oflag=direct 3000+0 records in 3000+0 records out 3145728000 bytes (3.1 GB) copied, 157.167 s, 20.0 MB/s Now repeating the same tests but this time using the page-cache (echo 3 > /proc/sys/vm/drop_caches) was executed before each test run: With equal weights (500): [root@c1501 ~]# dd if=test.img of=test2.img bs=1M count=3000 3000+0 records in 3000+0 records out 3145728000 bytes (3.1 GB) copied, 114.923 s, 27.4 MB/s [root@c1500 ~]# dd if=test.img of=test2.img bs=1M count=3000 3000+0 records in 3000+0 records out 3145728000 bytes (3.1 GB) copied, 120.245 s, 26.2 MB/s With (c1501's weight equal to twice that of c1500 (1000)): [root@c1501 ~]# dd if=test.img of=test2.img bs=1M count=3000 3000+0 records in 3000+0 records out 3145728000 bytes (3.1 GB) copied, 99.0181 s, 31.8 MB/s [root@c1500 ~]# dd if=test.img of=test2.img bs=1M count=3000 3000+0 records in 3000+0 records out 3145728000 bytes (3.1 GB) copied, 122.872 s, 25.6 MB/s I'd say that for buffered IO your patch does indeed make a difference, and this sort of aligns with what Vivek said about the patch working for buffered writes but not for direct. I will proceed now and test his patch applied for the case of direct writes. Hope this helps. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/14/2016 05:08 PM, Nikolay Borisov wrote: > > > On 03/02/2016 07:56 PM, Mike Snitzer wrote: >> On Wed, Mar 02 2016 at 11:06P -0500, >> Tejun Heo <tj@kernel.org> wrote: >> >>> Hello, >>> >>> On Thu, Feb 25, 2016 at 09:53:14AM -0500, Mike Snitzer wrote: >>>> Right, LVM created devices are bio-based DM devices in the kernel. >>>> bio-based block devices do _not_ have an IO scheduler. Their underlying >>>> request-based device does. >>> >>> dm devices are not the actual resource source, so I don't think it'd >>> work too well to put io controllers on them (can't really do things >>> like proportional control without owning the queue). >>> >>>> I'm not well-versed on the top-level cgroup interface and how it maps to >>>> associated resources that are established in the kernel. But it could >>>> be that the configuration of blkio cgroup against a bio-based LVM device >>>> needs to be passed through to the underlying request-based device >>>> (e.g. /dev/sda4 in Chris's case)? >>>> >>>> I'm also wondering whether the latest cgroup work that Tejun has just >>>> finished (afaik to support buffered IO in the IO controller) will afford >>>> us a more meaningful reason to work to make cgroups' blkio controller >>>> actually work with bio-based devices like LVM's DM devices? >>>> >>>> I'm very much open to advice on how to proceed with investigating this >>>> integration work. Tejun, Vivek, anyone else: if you have advice on next >>>> steps for DM on this front _please_ yell, thanks! >>> >>> I think the only thing necessary is dm transferring bio cgroup tags to >>> the bio's that it ends up passing down the stack. Please take a look >>> at fs/btrfs/extent_io.c::btrfs_bio_clone() for an example. We >>> probably should introduce a wrapper for this so that each site doesn't >>> need to ifdef it. >>> >>> Thanks. >> >> OK, I think this should do it. Nikolay and/or others can you test this >> patch using blkio cgroups controller with LVM devices and report back? >> >> From: Mike Snitzer <snitzer@redhat.com> >> Date: Wed, 2 Mar 2016 12:37:39 -0500 >> Subject: [PATCH] block: transfer source bio's cgroup tags to clone via bio_associate_blkcg() >> >> Move btrfs_bio_clone()'s support for transferring a source bio's cgroup >> tags to a clone into both bio_clone_bioset() and __bio_clone_fast(). >> The former is used by btrfs (MD and blk-core also use it via bio_split). >> The latter is used by both DM and bcache. >> >> This should enable the blkio cgroups controller to work with all >> stacking bio-based block devices. >> >> Reported-by: Nikolay Borisov <kernel@kyup.com> >> Suggested-by: Tejun Heo <tj@kernel.org> >> Signed-off-by: Mike Snitzer <snitzer@redhat.com> >> --- >> block/bio.c | 10 ++++++++++ >> fs/btrfs/extent_io.c | 6 ------ >> 2 files changed, 10 insertions(+), 6 deletions(-) > > > So I had a chance to test the settings here is what I got when running > 2 container, using LVM-thin for their root device and having applied > your patch: > > When the 2 containers are using the same blkio.weight values (500) I > get the following from running DD simultaneously on the 2 containers: > > [root@c1501 ~]# dd if=test.img of=test2.img bs=1M count=3000 oflag=direct > 3000+0 records in > 3000+0 records out > 3145728000 bytes (3.1 GB) copied, 165.171 s, 19.0 MB/s > > [root@c1500 ~]# dd if=test.img of=test2.img bs=1M count=3000 oflag=direct > 3000+0 records in > 3000+0 records out > 3145728000 bytes (3.1 GB) copied, 166.165 s, 18.9 MB/s > > Also iostat showed the 2 volumes using almost the same amount of > IO (around 20mb r/w). I then increase the weight for c1501 to 1000 i.e. > twice the bandwidth that c1500 has, so I would expect its dd to complete > twice as fast: > > [root@c1501 ~]# dd if=test.img of=test2.img bs=1M count=3000 oflag=direct > 3000+0 records in > 3000+0 records out > 3145728000 bytes (3.1 GB) copied, 150.892 s, 20.8 MB/s > > > [root@c1500 ~]# dd if=test.img of=test2.img bs=1M count=3000 oflag=direct > 3000+0 records in > 3000+0 records out > 3145728000 bytes (3.1 GB) copied, 157.167 s, 20.0 MB/s > > Now repeating the same tests but this time using the page-cache > (echo 3 > /proc/sys/vm/drop_caches) was executed before each test run: > > With equal weights (500): > [root@c1501 ~]# dd if=test.img of=test2.img bs=1M count=3000 > 3000+0 records in > 3000+0 records out > 3145728000 bytes (3.1 GB) copied, 114.923 s, 27.4 MB/s > > [root@c1500 ~]# dd if=test.img of=test2.img bs=1M count=3000 > 3000+0 records in > 3000+0 records out > 3145728000 bytes (3.1 GB) copied, 120.245 s, 26.2 MB/s > > With (c1501's weight equal to twice that of c1500 (1000)): > > [root@c1501 ~]# dd if=test.img of=test2.img bs=1M count=3000 > 3000+0 records in > 3000+0 records out > 3145728000 bytes (3.1 GB) copied, 99.0181 s, 31.8 MB/s > > [root@c1500 ~]# dd if=test.img of=test2.img bs=1M count=3000 > 3000+0 records in > 3000+0 records out > 3145728000 bytes (3.1 GB) copied, 122.872 s, 25.6 MB/s And another test which makes it obvious that your patch works: [root@c1501 ~]# dd if=test.img of=test2.img bs=1M count=6000 6000+0 records in 6000+0 records out 6291456000 bytes (6.3 GB) copied, 210.466 s, 29.9 MB/s [root@c1500 ~]# dd if=test.img of=test2.img bs=1M count=3000 3000+0 records in 3000+0 records out 3145728000 bytes (3.1 GB) copied, 201.118 s, 15.6 MB/s So a file that is twice the size of another one (6vs3 g) is copied for almost the same amount of time with 2x the bandwidth. > > I'd say that for buffered IO your patch does indeed make a difference, > and this sort of aligns with what Vivek said about the patch > working for buffered writes but not for direct. > > I will proceed now and test his patch applied for the case of > direct writes. > > Hope this helps. > > -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Mar 14 2016 at 11:31am -0400, Nikolay Borisov <n.borisov@siteground.com> wrote: > > > On 03/14/2016 05:08 PM, Nikolay Borisov wrote: > > > > > > On 03/02/2016 07:56 PM, Mike Snitzer wrote: > >> On Wed, Mar 02 2016 at 11:06P -0500, > >> Tejun Heo <tj@kernel.org> wrote: > >> > >>> Hello, > >>> > >>> On Thu, Feb 25, 2016 at 09:53:14AM -0500, Mike Snitzer wrote: > >>>> Right, LVM created devices are bio-based DM devices in the kernel. > >>>> bio-based block devices do _not_ have an IO scheduler. Their underlying > >>>> request-based device does. > >>> > >>> dm devices are not the actual resource source, so I don't think it'd > >>> work too well to put io controllers on them (can't really do things > >>> like proportional control without owning the queue). > >>> > >>>> I'm not well-versed on the top-level cgroup interface and how it maps to > >>>> associated resources that are established in the kernel. But it could > >>>> be that the configuration of blkio cgroup against a bio-based LVM device > >>>> needs to be passed through to the underlying request-based device > >>>> (e.g. /dev/sda4 in Chris's case)? > >>>> > >>>> I'm also wondering whether the latest cgroup work that Tejun has just > >>>> finished (afaik to support buffered IO in the IO controller) will afford > >>>> us a more meaningful reason to work to make cgroups' blkio controller > >>>> actually work with bio-based devices like LVM's DM devices? > >>>> > >>>> I'm very much open to advice on how to proceed with investigating this > >>>> integration work. Tejun, Vivek, anyone else: if you have advice on next > >>>> steps for DM on this front _please_ yell, thanks! > >>> > >>> I think the only thing necessary is dm transferring bio cgroup tags to > >>> the bio's that it ends up passing down the stack. Please take a look > >>> at fs/btrfs/extent_io.c::btrfs_bio_clone() for an example. We > >>> probably should introduce a wrapper for this so that each site doesn't > >>> need to ifdef it. > >>> > >>> Thanks. > >> > >> OK, I think this should do it. Nikolay and/or others can you test this > >> patch using blkio cgroups controller with LVM devices and report back? > >> > >> From: Mike Snitzer <snitzer@redhat.com> > >> Date: Wed, 2 Mar 2016 12:37:39 -0500 > >> Subject: [PATCH] block: transfer source bio's cgroup tags to clone via bio_associate_blkcg() > >> > >> Move btrfs_bio_clone()'s support for transferring a source bio's cgroup > >> tags to a clone into both bio_clone_bioset() and __bio_clone_fast(). > >> The former is used by btrfs (MD and blk-core also use it via bio_split). > >> The latter is used by both DM and bcache. > >> > >> This should enable the blkio cgroups controller to work with all > >> stacking bio-based block devices. > >> > >> Reported-by: Nikolay Borisov <kernel@kyup.com> > >> Suggested-by: Tejun Heo <tj@kernel.org> > >> Signed-off-by: Mike Snitzer <snitzer@redhat.com> > >> --- > >> block/bio.c | 10 ++++++++++ > >> fs/btrfs/extent_io.c | 6 ------ > >> 2 files changed, 10 insertions(+), 6 deletions(-) > > > > > > So I had a chance to test the settings here is what I got when running > > 2 container, using LVM-thin for their root device and having applied > > your patch: > > > > When the 2 containers are using the same blkio.weight values (500) I > > get the following from running DD simultaneously on the 2 containers: > > > > [root@c1501 ~]# dd if=test.img of=test2.img bs=1M count=3000 oflag=direct > > 3000+0 records in > > 3000+0 records out > > 3145728000 bytes (3.1 GB) copied, 165.171 s, 19.0 MB/s > > > > [root@c1500 ~]# dd if=test.img of=test2.img bs=1M count=3000 oflag=direct > > 3000+0 records in > > 3000+0 records out > > 3145728000 bytes (3.1 GB) copied, 166.165 s, 18.9 MB/s > > > > Also iostat showed the 2 volumes using almost the same amount of > > IO (around 20mb r/w). I then increase the weight for c1501 to 1000 i.e. > > twice the bandwidth that c1500 has, so I would expect its dd to complete > > twice as fast: > > > > [root@c1501 ~]# dd if=test.img of=test2.img bs=1M count=3000 oflag=direct > > 3000+0 records in > > 3000+0 records out > > 3145728000 bytes (3.1 GB) copied, 150.892 s, 20.8 MB/s > > > > > > [root@c1500 ~]# dd if=test.img of=test2.img bs=1M count=3000 oflag=direct > > 3000+0 records in > > 3000+0 records out > > 3145728000 bytes (3.1 GB) copied, 157.167 s, 20.0 MB/s > > > > Now repeating the same tests but this time using the page-cache > > (echo 3 > /proc/sys/vm/drop_caches) was executed before each test run: > > > > With equal weights (500): > > [root@c1501 ~]# dd if=test.img of=test2.img bs=1M count=3000 > > 3000+0 records in > > 3000+0 records out > > 3145728000 bytes (3.1 GB) copied, 114.923 s, 27.4 MB/s > > > > [root@c1500 ~]# dd if=test.img of=test2.img bs=1M count=3000 > > 3000+0 records in > > 3000+0 records out > > 3145728000 bytes (3.1 GB) copied, 120.245 s, 26.2 MB/s > > > > With (c1501's weight equal to twice that of c1500 (1000)): > > > > [root@c1501 ~]# dd if=test.img of=test2.img bs=1M count=3000 > > 3000+0 records in > > 3000+0 records out > > 3145728000 bytes (3.1 GB) copied, 99.0181 s, 31.8 MB/s > > > > [root@c1500 ~]# dd if=test.img of=test2.img bs=1M count=3000 > > 3000+0 records in > > 3000+0 records out > > 3145728000 bytes (3.1 GB) copied, 122.872 s, 25.6 MB/s > > And another test which makes it obvious that your patch works: > > [root@c1501 ~]# dd if=test.img of=test2.img bs=1M count=6000 > 6000+0 records in > 6000+0 records out > 6291456000 bytes (6.3 GB) copied, 210.466 s, 29.9 MB/s > > [root@c1500 ~]# dd if=test.img of=test2.img bs=1M count=3000 > 3000+0 records in > 3000+0 records out > 3145728000 bytes (3.1 GB) copied, 201.118 s, 15.6 MB/s > > > So a file that is twice the size of another one (6vs3 g) is copied for > almost the same amount of time with 2x the bandwidth. Great. Jens, can you pick up the patch in question ("[PATCH] block: transfer source bio's cgroup tags to clone via bio_associate_blkcg()") that I posted in this thread? -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Great. > > Jens, can you pick up the patch in question ("[PATCH] block: transfer > source bio's cgroup tags to clone via bio_associate_blkcg()") that I > posted in this thread? And what about Vivek's patch of associating the source bio with the io context of the process issuing the io? Would that help in the DIO case? -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/block/bio.c b/block/bio.c index cf75915..25812be 100644 --- a/block/bio.c +++ b/block/bio.c @@ -584,6 +584,11 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src) bio->bi_rw = bio_src->bi_rw; bio->bi_iter = bio_src->bi_iter; bio->bi_io_vec = bio_src->bi_io_vec; + +#ifdef CONFIG_BLK_CGROUP + if (bio_src->bi_css) + bio_associate_blkcg(bio, bio_src->bi_css); +#endif } EXPORT_SYMBOL(__bio_clone_fast); @@ -689,6 +694,11 @@ integrity_clone: } } +#ifdef CONFIG_BLK_CGROUP + if (bio_src->bi_css) + bio_associate_blkcg(bio, bio_src->bi_css); +#endif + return bio; } EXPORT_SYMBOL(bio_clone_bioset); diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 392592d..8abc330 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2691,12 +2691,6 @@ struct bio *btrfs_bio_clone(struct bio *bio, gfp_t gfp_mask) btrfs_bio->csum = NULL; btrfs_bio->csum_allocated = NULL; btrfs_bio->end_io = NULL; - -#ifdef CONFIG_BLK_CGROUP - /* FIXME, put this into bio_clone_bioset */ - if (bio->bi_css) - bio_associate_blkcg(new, bio->bi_css); -#endif } return new; }