Message ID | 20130322201151.GB5357@blackbox.djwong.org (mailing list archive) |
---|---|
State | Deferred, archived |
Headers | show |
On Fri, 2013-03-22 at 13:11 -0700, Darrick J. Wong wrote: > The new writethrough strategy for dm-cache issues a bio to the origin device, > remaps the bio to the cache device, and issues the bio to the cache device. > However, the block layer modifies bi_sector and bi_size, so we need to preserve > these or else nothing gets written to the cache (bi_size == 0). This fixes the > problem where someone writes a block through the cache, but a subsequent reread > (from the cache) returns old contents. > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > --- [...] This is not the correct way to submit a change to stable. See Documentation/stable_kernel_rules.txt Ben.
On Fri, Mar 22, 2013 at 08:21:56PM +0000, Ben Hutchings wrote: > On Fri, 2013-03-22 at 13:11 -0700, Darrick J. Wong wrote: > > The new writethrough strategy for dm-cache issues a bio to the origin device, > > remaps the bio to the cache device, and issues the bio to the cache device. > > However, the block layer modifies bi_sector and bi_size, so we need to preserve > > these or else nothing gets written to the cache (bi_size == 0). This fixes the > > problem where someone writes a block through the cache, but a subsequent reread > > (from the cache) returns old contents. > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > > --- > [...] > > This is not the correct way to submit a change to stable. See > Documentation/stable_kernel_rules.txt Frankly, I'm not sure why agk sent the 3.9-fixes pull request to -stable. 7 of the 10 commits are for dm-cache, and dm-cache is a new feature for 3.9. I probably could have dropped -stable from the cc: list when I started complaining about bugs. :) Sorry about the noise. --D > > Ben. > > -- > Ben Hutchings > Make three consecutive correct guesses and you will be considered an expert. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Fri, Mar 22 2013 at 4:50pm -0400, Darrick J. Wong <darrick.wong@oracle.com> wrote: > On Fri, Mar 22, 2013 at 08:21:56PM +0000, Ben Hutchings wrote: > > On Fri, 2013-03-22 at 13:11 -0700, Darrick J. Wong wrote: > > > The new writethrough strategy for dm-cache issues a bio to the origin device, > > > remaps the bio to the cache device, and issues the bio to the cache device. > > > However, the block layer modifies bi_sector and bi_size, so we need to preserve > > > these or else nothing gets written to the cache (bi_size == 0). This fixes the > > > problem where someone writes a block through the cache, but a subsequent reread > > > (from the cache) returns old contents. > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > > > --- > > [...] > > > > This is not the correct way to submit a change to stable. See > > Documentation/stable_kernel_rules.txt > > Frankly, I'm not sure why agk sent the 3.9-fixes pull request to -stable. 7 of > the 10 commits are for dm-cache, and dm-cache is a new feature for 3.9. I > probably could have dropped -stable from the cc: list when I started > complaining about bugs. :) Alasdair's pull mail was sent to stable because commit f046f89a99c ("dm thin: fix discard corruption") cc'd stable and was included in the pull request. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Fri, Mar 22 2013 at 4:11pm -0400, Darrick J. Wong <darrick.wong@oracle.com> wrote: > The new writethrough strategy for dm-cache issues a bio to the origin device, > remaps the bio to the cache device, and issues the bio to the cache device. > However, the block layer modifies bi_sector and bi_size, so we need to preserve > these or else nothing gets written to the cache (bi_size == 0). This fixes the > problem where someone writes a block through the cache, but a subsequent reread > (from the cache) returns old contents. Your writethrough blkid test results are certainly strange. But I'm not aware of where the block layer would modify bi_size and bi_sector; please elaborate. I cannot reproduce your original report. I developed 'test_writethrough_ext4_uuids_match', apologies for the ruby code: def test_writethrough_ext4_uuids_match size = meg(10) # wipe the origin to ensure we don't accidentally have the same # data on it. with_standard_linear(:data_size => size) do |origin| wipe_device(origin) end uuid = "deadbeef-cafe-dead-beef-cafedeadbeef" # format the cache device with a specific uuid with_standard_cache(:format => true, :io_mode => :writethrough, :data_size => size) do |cache| fs = FS::file_system(:ext4, cache) fs.format(:uuid => uuid) FS::assert_fs_uuid(uuid, cache) end # origin should have the same uuid as the cache with_standard_linear(:data_size => size) do |origin| FS::assert_fs_uuid(uuid, origin) end end This test was committed to the 'devel' branch of my thinp-test-suite tree: git://github.com/snitm/thinp-test-suite.git Also the existing 'test_writethrough' test works fine. So for now: Nacked-by: Mike Snitzer <snitzer@redhat.com> -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Fri, Mar 22, 2013 at 06:34:28PM -0400, Mike Snitzer wrote: > On Fri, Mar 22 2013 at 4:11pm -0400, > Darrick J. Wong <darrick.wong@oracle.com> wrote: > > > The new writethrough strategy for dm-cache issues a bio to the origin device, > > remaps the bio to the cache device, and issues the bio to the cache device. > > However, the block layer modifies bi_sector and bi_size, so we need to preserve > > these or else nothing gets written to the cache (bi_size == 0). This fixes the > > problem where someone writes a block through the cache, but a subsequent reread > > (from the cache) returns old contents. > > Your writethrough blkid test results are certainly strange. But I'm not > aware of where the block layer would modify bi_size and bi_sector; > please elaborate. > > I cannot reproduce your original report. I developed > 'test_writethrough_ext4_uuids_match', apologies for the ruby code: Hmm... I'm building my kernels off 0a7e453103b9718d357688b83bb968ee108cc874 in Linus' tree (post 3.9-rc3). This is the full output of dmsetup table: moocache-blocks: 0 1039360 linear 8:16 9088 moocache-metadata: 0 8704 linear 8:16 384 moocache: 0 67108864 cache 253:0 253:1 8:0 512 1 writethrough default 4 random_threshold 4 sequential_threshold 32768 253:0 -> moocache-metadata and 253:1 -> moocache-blocks. I'm curious what your setup is... --D > > def test_writethrough_ext4_uuids_match > size = meg(10) > > # wipe the origin to ensure we don't accidentally have the same > # data on it. > with_standard_linear(:data_size => size) do |origin| > wipe_device(origin) > end > > uuid = "deadbeef-cafe-dead-beef-cafedeadbeef" > > # format the cache device with a specific uuid > with_standard_cache(:format => true, > :io_mode => :writethrough, > :data_size => size) do |cache| > fs = FS::file_system(:ext4, cache) > fs.format(:uuid => uuid) > FS::assert_fs_uuid(uuid, cache) > end > > # origin should have the same uuid as the cache > with_standard_linear(:data_size => size) do |origin| > FS::assert_fs_uuid(uuid, origin) > end > end > > This test was committed to the 'devel' branch of my thinp-test-suite > tree: git://github.com/snitm/thinp-test-suite.git > > Also the existing 'test_writethrough' test works fine. > > So for now: > > Nacked-by: Mike Snitzer <snitzer@redhat.com> > > -- > dm-devel mailing list > dm-devel@redhat.com > https://www.redhat.com/mailman/listinfo/dm-devel -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Fri, Mar 22, 2013 at 01:11:51PM -0700, Darrick J. Wong wrote: > The new writethrough strategy for dm-cache issues a bio to the origin device, > remaps the bio to the cache device, and issues the bio to the cache device. > However, the block layer modifies bi_sector and bi_size, so we need to preserve > these or else nothing gets written to the cache (bi_size == 0). This fixes the > problem where someone writes a block through the cache, but a subsequent reread > (from the cache) returns old contents. If this needs doing here, please use dm_bio_record() and dm_bio_restore(). E.g. Look at how dm-raid1.c does something similar. Alasdair -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Fri, Mar 22 2013 at 7:16pm -0400, Darrick J. Wong <darrick.wong@oracle.com> wrote: > On Fri, Mar 22, 2013 at 06:34:28PM -0400, Mike Snitzer wrote: > > On Fri, Mar 22 2013 at 4:11pm -0400, > > Darrick J. Wong <darrick.wong@oracle.com> wrote: > > > > > The new writethrough strategy for dm-cache issues a bio to the origin device, > > > remaps the bio to the cache device, and issues the bio to the cache device. > > > However, the block layer modifies bi_sector and bi_size, so we need to preserve > > > these or else nothing gets written to the cache (bi_size == 0). This fixes the > > > problem where someone writes a block through the cache, but a subsequent reread > > > (from the cache) returns old contents. > > > > Your writethrough blkid test results are certainly strange. But I'm not > > aware of where the block layer would modify bi_size and bi_sector; > > please elaborate. > > > > I cannot reproduce your original report. I developed > > 'test_writethrough_ext4_uuids_match', apologies for the ruby code: > > Hmm... I'm building my kernels off 0a7e453103b9718d357688b83bb968ee108cc874 in > Linus' tree (post 3.9-rc3). This is the full output of dmsetup table: > > moocache-blocks: 0 1039360 linear 8:16 9088 > moocache-metadata: 0 8704 linear 8:16 384 > moocache: 0 67108864 cache 253:0 253:1 8:0 512 1 writethrough default 4 random_threshold 4 sequential_threshold 32768 > > 253:0 -> moocache-metadata and 253:1 -> moocache-blocks. > > I'm curious what your setup is... Here are the tables: test-dev-238267: 0 8192 linear /dev/stec/metadata 0 test-dev-255913: 0 2097152 linear /dev/stec/metadata 8192 test-dev-655144: 0 20480 linear /dev/spindle/data 0 0 20480 cache /dev/mapper/test-dev-238267 /dev/mapper/test-dev-255913 /dev/mapper/test-dev-655144 512 1 writethrough default 0 And I tweaked 'test_writethrough_ext4_uuids_match' to make sure to use the same thresholds you're using, full status output: 0 20480 cache 15/1024 0 19 0 0 0 0 0 0 1 writethrough 2 migration_threshold 32768 4 random_threshold 4 sequential_threshold 512 So the big difference is the thinp-test-suite uses intermediate linear DM layers above the slower sd device (spindle/data) -- whereas in your setup the origin device is direct to sd (8:0). I'll re-run with the origin directly on sd in the morning and will report back. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Fri, Mar 22, 2013 at 11:27:16PM -0400, Mike Snitzer wrote: > So the big difference is the thinp-test-suite uses intermediate linear > DM layers above the slower sd device (spindle/data) -- whereas in your > setup the origin device is direct to sd (8:0). I would think the difference is because DM issues (edited) copies of incoming bios to the layers underneath, so changes don't propagate back up. Darrick: if you retest with a dm linear device, does the problem go away? Alasdair -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Fri, Mar 22, 2013 at 11:27:16PM -0400, Mike Snitzer wrote: > On Fri, Mar 22 2013 at 7:16pm -0400, > Darrick J. Wong <darrick.wong@oracle.com> wrote: > > > On Fri, Mar 22, 2013 at 06:34:28PM -0400, Mike Snitzer wrote: > > > On Fri, Mar 22 2013 at 4:11pm -0400, > > > Darrick J. Wong <darrick.wong@oracle.com> wrote: > > > > > > > The new writethrough strategy for dm-cache issues a bio to the origin device, > > > > remaps the bio to the cache device, and issues the bio to the cache device. > > > > However, the block layer modifies bi_sector and bi_size, so we need to preserve > > > > these or else nothing gets written to the cache (bi_size == 0). This fixes the > > > > problem where someone writes a block through the cache, but a subsequent reread > > > > (from the cache) returns old contents. > > > > > > Your writethrough blkid test results are certainly strange. But I'm not > > > aware of where the block layer would modify bi_size and bi_sector; > > > please elaborate. > > > > > > I cannot reproduce your original report. I developed > > > 'test_writethrough_ext4_uuids_match', apologies for the ruby code: > > > > Hmm... I'm building my kernels off 0a7e453103b9718d357688b83bb968ee108cc874 in > > Linus' tree (post 3.9-rc3). This is the full output of dmsetup table: > > > > moocache-blocks: 0 1039360 linear 8:16 9088 > > moocache-metadata: 0 8704 linear 8:16 384 > > moocache: 0 67108864 cache 253:0 253:1 8:0 512 1 writethrough default 4 random_threshold 4 sequential_threshold 32768 > > > > 253:0 -> moocache-metadata and 253:1 -> moocache-blocks. > > > > I'm curious what your setup is... > > Here are the tables: > test-dev-238267: 0 8192 linear /dev/stec/metadata 0 > test-dev-255913: 0 2097152 linear /dev/stec/metadata 8192 > test-dev-655144: 0 20480 linear /dev/spindle/data 0 > 0 20480 cache /dev/mapper/test-dev-238267 /dev/mapper/test-dev-255913 /dev/mapper/test-dev-655144 512 1 writethrough default 0 > > And I tweaked 'test_writethrough_ext4_uuids_match' to make sure to use the > same thresholds you're using, full status output: > 0 20480 cache 15/1024 0 19 0 0 0 0 0 0 1 writethrough 2 migration_threshold 32768 4 random_threshold 4 sequential_threshold 512 > > So the big difference is the thinp-test-suite uses intermediate linear > DM layers above the slower sd device (spindle/data) -- whereas in your > setup the origin device is direct to sd (8:0). > > I'll re-run with the origin directly on sd in the morning and will > report back. Interesting ... if I set up this: # echo "0 67108864 linear /dev/sda 0" | dmsetup create origin And then repeat the test, but using /dev/mapper/origin as the origin instead of /dev/sda, the problem goes away. --D -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Sat, Mar 23 2013 at 1:15am -0400, Darrick J. Wong <darrick.wong@oracle.com> wrote: > On Fri, Mar 22, 2013 at 11:27:16PM -0400, Mike Snitzer wrote: > > On Fri, Mar 22 2013 at 7:16pm -0400, > > Darrick J. Wong <darrick.wong@oracle.com> wrote: > > > > > On Fri, Mar 22, 2013 at 06:34:28PM -0400, Mike Snitzer wrote: > > > > On Fri, Mar 22 2013 at 4:11pm -0400, > > > > Darrick J. Wong <darrick.wong@oracle.com> wrote: > > > > > > > > > The new writethrough strategy for dm-cache issues a bio to the origin device, > > > > > remaps the bio to the cache device, and issues the bio to the cache device. > > > > > However, the block layer modifies bi_sector and bi_size, so we need to preserve > > > > > these or else nothing gets written to the cache (bi_size == 0). This fixes the > > > > > problem where someone writes a block through the cache, but a subsequent reread > > > > > (from the cache) returns old contents. > > > > > > > > Your writethrough blkid test results are certainly strange. But I'm not > > > > aware of where the block layer would modify bi_size and bi_sector; > > > > please elaborate. > > > > > > > > I cannot reproduce your original report. I developed > > > > 'test_writethrough_ext4_uuids_match', apologies for the ruby code: > > > > > > Hmm... I'm building my kernels off 0a7e453103b9718d357688b83bb968ee108cc874 in > > > Linus' tree (post 3.9-rc3). This is the full output of dmsetup table: > > > > > > moocache-blocks: 0 1039360 linear 8:16 9088 > > > moocache-metadata: 0 8704 linear 8:16 384 > > > moocache: 0 67108864 cache 253:0 253:1 8:0 512 1 writethrough default 4 random_threshold 4 sequential_threshold 32768 > > > > > > 253:0 -> moocache-metadata and 253:1 -> moocache-blocks. > > > > > > I'm curious what your setup is... > > > > Here are the tables: > > test-dev-238267: 0 8192 linear /dev/stec/metadata 0 > > test-dev-255913: 0 2097152 linear /dev/stec/metadata 8192 > > test-dev-655144: 0 20480 linear /dev/spindle/data 0 > > 0 20480 cache /dev/mapper/test-dev-238267 /dev/mapper/test-dev-255913 /dev/mapper/test-dev-655144 512 1 writethrough default 0 > > > > And I tweaked 'test_writethrough_ext4_uuids_match' to make sure to use the > > same thresholds you're using, full status output: > > 0 20480 cache 15/1024 0 19 0 0 0 0 0 0 1 writethrough 2 migration_threshold 32768 4 random_threshold 4 sequential_threshold 512 > > > > So the big difference is the thinp-test-suite uses intermediate linear > > DM layers above the slower sd device (spindle/data) -- whereas in your > > setup the origin device is direct to sd (8:0). > > > > I'll re-run with the origin directly on sd in the morning and will > > report back. > > Interesting ... if I set up this: > > # echo "0 67108864 linear /dev/sda 0" | dmsetup create origin > > And then repeat the test, but using /dev/mapper/origin as the origin instead > of /dev/sda, the problem goes away. Using the extra dm-linear layer is implicitly leveraging the DM core's bio cloning to restore the original bio that was sent to the linear target. But even after having changed my test to use /dev/sdb for the origin device I cannot reproduce the problem you've reported. Do you have any further details on how/why the bios are being altered? Are you reliably hitting partial completions within the origin's driver? If so how? Having looked at this for a bit it seems pretty clear writethrough_endio is missing partial completion handling, e.g.: if (!bio_flagged(bio, BIO_UPTODATE) && !err) err = -EIO; But I haven't yet come to terms with what the partial completion handling implementation needs to be for the writethrough support. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Fri, Mar 22, 2013 at 10:15:49PM -0700, Darrick J. Wong wrote: > Interesting ... if I set up this: > # echo "0 67108864 linear /dev/sda 0" | dmsetup create origin > And then repeat the test, but using /dev/mapper/origin as the origin instead > of /dev/sda, the problem goes away. So that looks like an easy and correct workaround for now. Alasdair -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Hi Darrick, On Fri, Mar 22, 2013 at 01:11:51PM -0700, Darrick J. Wong wrote: > The new writethrough strategy for dm-cache issues a bio to the origin device, > remaps the bio to the cache device, and issues the bio to the cache device. > However, the block layer modifies bi_sector and bi_size, so we need to preserve > these or else nothing gets written to the cache (bi_size == 0). This fixes the > problem where someone writes a block through the cache, but a subsequent reread > (from the cache) returns old contents. Thanks for diagnosing this, as you may have gathered none of the hardware we're testing on does this. I've pushed a patch to dm-devel that hopefully will make it's way upstream very quickly. I used dm_bio_record/restore, since there are many more fields that underlying drivers make tweak in the bio. - Joe -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
diff --git a/drivers/md/dm-cache-target.c b/drivers/md/dm-cache-target.c index 66120bd..0db0ad2 100644 --- a/drivers/md/dm-cache-target.c +++ b/drivers/md/dm-cache-target.c @@ -205,6 +205,8 @@ struct per_bio_data { struct cache *cache; dm_cblock_t cblock; bio_end_io_t *saved_bi_end_io; + unsigned int bi_size; + sector_t bi_sector; }; struct dm_cache_migration { @@ -643,6 +645,8 @@ static void writethrough_endio(struct bio *bio, int err) return; } + bio->bi_sector = pb->bi_sector; + bio->bi_size = pb->bi_size; remap_to_cache(pb->cache, bio, pb->cblock); /* @@ -667,6 +671,12 @@ static void remap_to_origin_then_cache(struct cache *cache, struct bio *bio, pb->cache = cache; pb->cblock = cblock; pb->saved_bi_end_io = bio->bi_end_io; + /* + * The block layer modifies bi_size and bi_sector, so we must save + * them for when we re-issue the bio against the cache device. + */ + pb->bi_size = bio->bi_size; + pb->bi_sector = bio->bi_sector; bio->bi_end_io = writethrough_endio; remap_to_origin_clear_discard(pb->cache, bio, oblock);
The new writethrough strategy for dm-cache issues a bio to the origin device, remaps the bio to the cache device, and issues the bio to the cache device. However, the block layer modifies bi_sector and bi_size, so we need to preserve these or else nothing gets written to the cache (bi_size == 0). This fixes the problem where someone writes a block through the cache, but a subsequent reread (from the cache) returns old contents. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> --- drivers/md/dm-cache-target.c | 10 ++++++++++ 1 file changed, 10 insertions(+) -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel