diff mbox

dm: dm-cache fails to write the cache device in writethrough mode

Message ID 20130322201151.GB5357@blackbox.djwong.org (mailing list archive)
State Deferred, archived
Headers show

Commit Message

Darrick J. Wong March 22, 2013, 8:11 p.m. UTC
The new writethrough strategy for dm-cache issues a bio to the origin device,
remaps the bio to the cache device, and issues the bio to the cache device.
However, the block layer modifies bi_sector and bi_size, so we need to preserve
these or else nothing gets written to the cache (bi_size == 0).  This fixes the
problem where someone writes a block through the cache, but a subsequent reread
(from the cache) returns old contents.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 drivers/md/dm-cache-target.c |   10 ++++++++++
 1 file changed, 10 insertions(+)


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Comments

Ben Hutchings March 22, 2013, 8:21 p.m. UTC | #1
On Fri, 2013-03-22 at 13:11 -0700, Darrick J. Wong wrote:
> The new writethrough strategy for dm-cache issues a bio to the origin device,
> remaps the bio to the cache device, and issues the bio to the cache device.
> However, the block layer modifies bi_sector and bi_size, so we need to preserve
> these or else nothing gets written to the cache (bi_size == 0).  This fixes the
> problem where someone writes a block through the cache, but a subsequent reread
> (from the cache) returns old contents.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
[...]

This is not the correct way to submit a change to stable.  See
Documentation/stable_kernel_rules.txt

Ben.
Darrick J. Wong March 22, 2013, 8:50 p.m. UTC | #2
On Fri, Mar 22, 2013 at 08:21:56PM +0000, Ben Hutchings wrote:
> On Fri, 2013-03-22 at 13:11 -0700, Darrick J. Wong wrote:
> > The new writethrough strategy for dm-cache issues a bio to the origin device,
> > remaps the bio to the cache device, and issues the bio to the cache device.
> > However, the block layer modifies bi_sector and bi_size, so we need to preserve
> > these or else nothing gets written to the cache (bi_size == 0).  This fixes the
> > problem where someone writes a block through the cache, but a subsequent reread
> > (from the cache) returns old contents.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> [...]
> 
> This is not the correct way to submit a change to stable.  See
> Documentation/stable_kernel_rules.txt

Frankly, I'm not sure why agk sent the 3.9-fixes pull request to -stable.  7 of
the 10 commits are for dm-cache, and dm-cache is a new feature for 3.9.  I
probably could have dropped -stable from the cc: list when I started
complaining about bugs.  :)

Sorry about the noise.

--D
> 
> Ben.
> 
> -- 
> Ben Hutchings
> Make three consecutive correct guesses and you will be considered an expert.


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer March 22, 2013, 9:35 p.m. UTC | #3
On Fri, Mar 22 2013 at  4:50pm -0400,
Darrick J. Wong <darrick.wong@oracle.com> wrote:

> On Fri, Mar 22, 2013 at 08:21:56PM +0000, Ben Hutchings wrote:
> > On Fri, 2013-03-22 at 13:11 -0700, Darrick J. Wong wrote:
> > > The new writethrough strategy for dm-cache issues a bio to the origin device,
> > > remaps the bio to the cache device, and issues the bio to the cache device.
> > > However, the block layer modifies bi_sector and bi_size, so we need to preserve
> > > these or else nothing gets written to the cache (bi_size == 0).  This fixes the
> > > problem where someone writes a block through the cache, but a subsequent reread
> > > (from the cache) returns old contents.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > [...]
> > 
> > This is not the correct way to submit a change to stable.  See
> > Documentation/stable_kernel_rules.txt
> 
> Frankly, I'm not sure why agk sent the 3.9-fixes pull request to -stable.  7 of
> the 10 commits are for dm-cache, and dm-cache is a new feature for 3.9.  I
> probably could have dropped -stable from the cc: list when I started
> complaining about bugs.  :)

Alasdair's pull mail was sent to stable because commit f046f89a99c ("dm
thin: fix discard corruption") cc'd stable and was included in the pull
request.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer March 22, 2013, 10:34 p.m. UTC | #4
On Fri, Mar 22 2013 at  4:11pm -0400,
Darrick J. Wong <darrick.wong@oracle.com> wrote:

> The new writethrough strategy for dm-cache issues a bio to the origin device,
> remaps the bio to the cache device, and issues the bio to the cache device.
> However, the block layer modifies bi_sector and bi_size, so we need to preserve
> these or else nothing gets written to the cache (bi_size == 0).  This fixes the
> problem where someone writes a block through the cache, but a subsequent reread
> (from the cache) returns old contents.

Your writethrough blkid test results are certainly strange.  But I'm not
aware of where the block layer would modify bi_size and bi_sector;
please elaborate.

I cannot reproduce your original report.  I developed
'test_writethrough_ext4_uuids_match', apologies for the ruby code:

  def test_writethrough_ext4_uuids_match
    size = meg(10)
  
    # wipe the origin to ensure we don't accidentally have the same
    # data on it.
    with_standard_linear(:data_size => size) do |origin|
      wipe_device(origin)
    end

    uuid = "deadbeef-cafe-dead-beef-cafedeadbeef"

    # format the cache device with a specific uuid
    with_standard_cache(:format => true,
                        :io_mode => :writethrough,
                        :data_size => size) do |cache|
      fs = FS::file_system(:ext4, cache)
      fs.format(:uuid => uuid)
      FS::assert_fs_uuid(uuid, cache)
    end

    # origin should have the same uuid as the cache
    with_standard_linear(:data_size => size) do |origin|
      FS::assert_fs_uuid(uuid, origin)
    end
  end

This test was committed to the 'devel' branch of my thinp-test-suite
tree: git://github.com/snitm/thinp-test-suite.git

Also the existing 'test_writethrough' test works fine.

So for now:

Nacked-by: Mike Snitzer <snitzer@redhat.com>

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Darrick J. Wong March 22, 2013, 11:16 p.m. UTC | #5
On Fri, Mar 22, 2013 at 06:34:28PM -0400, Mike Snitzer wrote:
> On Fri, Mar 22 2013 at  4:11pm -0400,
> Darrick J. Wong <darrick.wong@oracle.com> wrote:
> 
> > The new writethrough strategy for dm-cache issues a bio to the origin device,
> > remaps the bio to the cache device, and issues the bio to the cache device.
> > However, the block layer modifies bi_sector and bi_size, so we need to preserve
> > these or else nothing gets written to the cache (bi_size == 0).  This fixes the
> > problem where someone writes a block through the cache, but a subsequent reread
> > (from the cache) returns old contents.
> 
> Your writethrough blkid test results are certainly strange.  But I'm not
> aware of where the block layer would modify bi_size and bi_sector;
> please elaborate.
> 
> I cannot reproduce your original report.  I developed
> 'test_writethrough_ext4_uuids_match', apologies for the ruby code:

Hmm... I'm building my kernels off 0a7e453103b9718d357688b83bb968ee108cc874 in
Linus' tree (post 3.9-rc3).  This is the full output of dmsetup table:

moocache-blocks: 0 1039360 linear 8:16 9088
moocache-metadata: 0 8704 linear 8:16 384
moocache: 0 67108864 cache 253:0 253:1 8:0 512 1 writethrough default 4 random_threshold 4 sequential_threshold 32768

253:0 -> moocache-metadata and 253:1 -> moocache-blocks.

I'm curious what your setup is...

--D

> 
>   def test_writethrough_ext4_uuids_match
>     size = meg(10)
>   
>     # wipe the origin to ensure we don't accidentally have the same
>     # data on it.
>     with_standard_linear(:data_size => size) do |origin|
>       wipe_device(origin)
>     end
> 
>     uuid = "deadbeef-cafe-dead-beef-cafedeadbeef"
> 
>     # format the cache device with a specific uuid
>     with_standard_cache(:format => true,
>                         :io_mode => :writethrough,
>                         :data_size => size) do |cache|
>       fs = FS::file_system(:ext4, cache)
>       fs.format(:uuid => uuid)
>       FS::assert_fs_uuid(uuid, cache)
>     end
> 
>     # origin should have the same uuid as the cache
>     with_standard_linear(:data_size => size) do |origin|
>       FS::assert_fs_uuid(uuid, origin)
>     end
>   end
> 
> This test was committed to the 'devel' branch of my thinp-test-suite
> tree: git://github.com/snitm/thinp-test-suite.git
> 
> Also the existing 'test_writethrough' test works fine.
> 
> So for now:
> 
> Nacked-by: Mike Snitzer <snitzer@redhat.com>
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Alasdair G Kergon March 23, 2013, 3 a.m. UTC | #6
On Fri, Mar 22, 2013 at 01:11:51PM -0700, Darrick J. Wong wrote:
> The new writethrough strategy for dm-cache issues a bio to the origin device,
> remaps the bio to the cache device, and issues the bio to the cache device.
> However, the block layer modifies bi_sector and bi_size, so we need to preserve
> these or else nothing gets written to the cache (bi_size == 0).  This fixes the
> problem where someone writes a block through the cache, but a subsequent reread
> (from the cache) returns old contents.

If this needs doing here, please use dm_bio_record() and dm_bio_restore().
E.g. Look at how dm-raid1.c does something similar.
 
Alasdair

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer March 23, 2013, 3:27 a.m. UTC | #7
On Fri, Mar 22 2013 at  7:16pm -0400,
Darrick J. Wong <darrick.wong@oracle.com> wrote:

> On Fri, Mar 22, 2013 at 06:34:28PM -0400, Mike Snitzer wrote:
> > On Fri, Mar 22 2013 at  4:11pm -0400,
> > Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > 
> > > The new writethrough strategy for dm-cache issues a bio to the origin device,
> > > remaps the bio to the cache device, and issues the bio to the cache device.
> > > However, the block layer modifies bi_sector and bi_size, so we need to preserve
> > > these or else nothing gets written to the cache (bi_size == 0).  This fixes the
> > > problem where someone writes a block through the cache, but a subsequent reread
> > > (from the cache) returns old contents.
> > 
> > Your writethrough blkid test results are certainly strange.  But I'm not
> > aware of where the block layer would modify bi_size and bi_sector;
> > please elaborate.
> > 
> > I cannot reproduce your original report.  I developed
> > 'test_writethrough_ext4_uuids_match', apologies for the ruby code:
> 
> Hmm... I'm building my kernels off 0a7e453103b9718d357688b83bb968ee108cc874 in
> Linus' tree (post 3.9-rc3).  This is the full output of dmsetup table:
> 
> moocache-blocks: 0 1039360 linear 8:16 9088
> moocache-metadata: 0 8704 linear 8:16 384
> moocache: 0 67108864 cache 253:0 253:1 8:0 512 1 writethrough default 4 random_threshold 4 sequential_threshold 32768
> 
> 253:0 -> moocache-metadata and 253:1 -> moocache-blocks.
> 
> I'm curious what your setup is...

Here are the tables:
test-dev-238267: 0 8192 linear /dev/stec/metadata 0
test-dev-255913: 0 2097152 linear /dev/stec/metadata 8192
test-dev-655144: 0 20480 linear /dev/spindle/data 0
0 20480 cache /dev/mapper/test-dev-238267 /dev/mapper/test-dev-255913 /dev/mapper/test-dev-655144 512 1 writethrough default 0

And I tweaked 'test_writethrough_ext4_uuids_match' to make sure to use the
same thresholds you're using, full status output:
0 20480 cache 15/1024 0 19 0 0 0 0 0 0 1 writethrough 2 migration_threshold 32768 4 random_threshold 4 sequential_threshold 512

So the big difference is the thinp-test-suite uses intermediate linear
DM layers above the slower sd device (spindle/data) -- whereas in your
setup the origin device is direct to sd (8:0).

I'll re-run with the origin directly on sd in the morning and will
report back.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Alasdair G Kergon March 23, 2013, 3:48 a.m. UTC | #8
On Fri, Mar 22, 2013 at 11:27:16PM -0400, Mike Snitzer wrote:
> So the big difference is the thinp-test-suite uses intermediate linear
> DM layers above the slower sd device (spindle/data) -- whereas in your
> setup the origin device is direct to sd (8:0).
 
I would think the difference is because DM issues (edited) copies of
incoming bios to the layers underneath, so changes don't propagate back
up.

Darrick: if you retest with a dm linear device, does the problem go
away?

Alasdair

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Darrick J. Wong March 23, 2013, 5:15 a.m. UTC | #9
On Fri, Mar 22, 2013 at 11:27:16PM -0400, Mike Snitzer wrote:
> On Fri, Mar 22 2013 at  7:16pm -0400,
> Darrick J. Wong <darrick.wong@oracle.com> wrote:
> 
> > On Fri, Mar 22, 2013 at 06:34:28PM -0400, Mike Snitzer wrote:
> > > On Fri, Mar 22 2013 at  4:11pm -0400,
> > > Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > > 
> > > > The new writethrough strategy for dm-cache issues a bio to the origin device,
> > > > remaps the bio to the cache device, and issues the bio to the cache device.
> > > > However, the block layer modifies bi_sector and bi_size, so we need to preserve
> > > > these or else nothing gets written to the cache (bi_size == 0).  This fixes the
> > > > problem where someone writes a block through the cache, but a subsequent reread
> > > > (from the cache) returns old contents.
> > > 
> > > Your writethrough blkid test results are certainly strange.  But I'm not
> > > aware of where the block layer would modify bi_size and bi_sector;
> > > please elaborate.
> > > 
> > > I cannot reproduce your original report.  I developed
> > > 'test_writethrough_ext4_uuids_match', apologies for the ruby code:
> > 
> > Hmm... I'm building my kernels off 0a7e453103b9718d357688b83bb968ee108cc874 in
> > Linus' tree (post 3.9-rc3).  This is the full output of dmsetup table:
> > 
> > moocache-blocks: 0 1039360 linear 8:16 9088
> > moocache-metadata: 0 8704 linear 8:16 384
> > moocache: 0 67108864 cache 253:0 253:1 8:0 512 1 writethrough default 4 random_threshold 4 sequential_threshold 32768
> > 
> > 253:0 -> moocache-metadata and 253:1 -> moocache-blocks.
> > 
> > I'm curious what your setup is...
> 
> Here are the tables:
> test-dev-238267: 0 8192 linear /dev/stec/metadata 0
> test-dev-255913: 0 2097152 linear /dev/stec/metadata 8192
> test-dev-655144: 0 20480 linear /dev/spindle/data 0
> 0 20480 cache /dev/mapper/test-dev-238267 /dev/mapper/test-dev-255913 /dev/mapper/test-dev-655144 512 1 writethrough default 0
> 
> And I tweaked 'test_writethrough_ext4_uuids_match' to make sure to use the
> same thresholds you're using, full status output:
> 0 20480 cache 15/1024 0 19 0 0 0 0 0 0 1 writethrough 2 migration_threshold 32768 4 random_threshold 4 sequential_threshold 512
> 
> So the big difference is the thinp-test-suite uses intermediate linear
> DM layers above the slower sd device (spindle/data) -- whereas in your
> setup the origin device is direct to sd (8:0).
> 
> I'll re-run with the origin directly on sd in the morning and will
> report back.

Interesting ... if I set up this:

# echo "0 67108864 linear /dev/sda 0" | dmsetup create origin

And then repeat the test, but using /dev/mapper/origin as the origin instead
of /dev/sda, the problem goes away.

--D

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer March 23, 2013, 9:08 p.m. UTC | #10
On Sat, Mar 23 2013 at  1:15am -0400,
Darrick J. Wong <darrick.wong@oracle.com> wrote:

> On Fri, Mar 22, 2013 at 11:27:16PM -0400, Mike Snitzer wrote:
> > On Fri, Mar 22 2013 at  7:16pm -0400,
> > Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > 
> > > On Fri, Mar 22, 2013 at 06:34:28PM -0400, Mike Snitzer wrote:
> > > > On Fri, Mar 22 2013 at  4:11pm -0400,
> > > > Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > > > 
> > > > > The new writethrough strategy for dm-cache issues a bio to the origin device,
> > > > > remaps the bio to the cache device, and issues the bio to the cache device.
> > > > > However, the block layer modifies bi_sector and bi_size, so we need to preserve
> > > > > these or else nothing gets written to the cache (bi_size == 0).  This fixes the
> > > > > problem where someone writes a block through the cache, but a subsequent reread
> > > > > (from the cache) returns old contents.
> > > > 
> > > > Your writethrough blkid test results are certainly strange.  But I'm not
> > > > aware of where the block layer would modify bi_size and bi_sector;
> > > > please elaborate.
> > > > 
> > > > I cannot reproduce your original report.  I developed
> > > > 'test_writethrough_ext4_uuids_match', apologies for the ruby code:
> > > 
> > > Hmm... I'm building my kernels off 0a7e453103b9718d357688b83bb968ee108cc874 in
> > > Linus' tree (post 3.9-rc3).  This is the full output of dmsetup table:
> > > 
> > > moocache-blocks: 0 1039360 linear 8:16 9088
> > > moocache-metadata: 0 8704 linear 8:16 384
> > > moocache: 0 67108864 cache 253:0 253:1 8:0 512 1 writethrough default 4 random_threshold 4 sequential_threshold 32768
> > > 
> > > 253:0 -> moocache-metadata and 253:1 -> moocache-blocks.
> > > 
> > > I'm curious what your setup is...
> > 
> > Here are the tables:
> > test-dev-238267: 0 8192 linear /dev/stec/metadata 0
> > test-dev-255913: 0 2097152 linear /dev/stec/metadata 8192
> > test-dev-655144: 0 20480 linear /dev/spindle/data 0
> > 0 20480 cache /dev/mapper/test-dev-238267 /dev/mapper/test-dev-255913 /dev/mapper/test-dev-655144 512 1 writethrough default 0
> > 
> > And I tweaked 'test_writethrough_ext4_uuids_match' to make sure to use the
> > same thresholds you're using, full status output:
> > 0 20480 cache 15/1024 0 19 0 0 0 0 0 0 1 writethrough 2 migration_threshold 32768 4 random_threshold 4 sequential_threshold 512
> > 
> > So the big difference is the thinp-test-suite uses intermediate linear
> > DM layers above the slower sd device (spindle/data) -- whereas in your
> > setup the origin device is direct to sd (8:0).
> > 
> > I'll re-run with the origin directly on sd in the morning and will
> > report back.
> 
> Interesting ... if I set up this:
> 
> # echo "0 67108864 linear /dev/sda 0" | dmsetup create origin
> 
> And then repeat the test, but using /dev/mapper/origin as the origin instead
> of /dev/sda, the problem goes away.

Using the extra dm-linear layer is implicitly leveraging the DM core's
bio cloning to restore the original bio that was sent to the linear
target.

But even after having changed my test to use /dev/sdb for the origin
device I cannot reproduce the problem you've reported.  Do you have any
further details on how/why the bios are being altered?  Are you
reliably hitting partial completions within the origin's driver?  If so
how?

Having looked at this for a bit it seems pretty clear writethrough_endio
is missing partial completion handling, e.g.:
   if (!bio_flagged(bio, BIO_UPTODATE) && !err)
      err = -EIO;

But I haven't yet come to terms with what the partial completion
handling implementation needs to be for the writethrough support.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Alasdair G Kergon March 23, 2013, 9:13 p.m. UTC | #11
On Fri, Mar 22, 2013 at 10:15:49PM -0700, Darrick J. Wong wrote:
> Interesting ... if I set up this:
> # echo "0 67108864 linear /dev/sda 0" | dmsetup create origin
> And then repeat the test, but using /dev/mapper/origin as the origin instead
> of /dev/sda, the problem goes away.
 
So that looks like an easy and correct workaround for now.

Alasdair

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Joe Thornber March 25, 2013, 10:28 a.m. UTC | #12
Hi Darrick,

On Fri, Mar 22, 2013 at 01:11:51PM -0700, Darrick J. Wong wrote:
> The new writethrough strategy for dm-cache issues a bio to the origin device,
> remaps the bio to the cache device, and issues the bio to the cache device.
> However, the block layer modifies bi_sector and bi_size, so we need to preserve
> these or else nothing gets written to the cache (bi_size == 0).  This fixes the
> problem where someone writes a block through the cache, but a subsequent reread
> (from the cache) returns old contents.

Thanks for diagnosing this, as you may have gathered none of the
hardware we're testing on does this.  I've pushed a patch to dm-devel
that hopefully will make it's way upstream very quickly.  I used
dm_bio_record/restore, since there are many more fields that
underlying drivers make tweak in the bio.

- Joe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
diff mbox

Patch

diff --git a/drivers/md/dm-cache-target.c b/drivers/md/dm-cache-target.c
index 66120bd..0db0ad2 100644
--- a/drivers/md/dm-cache-target.c
+++ b/drivers/md/dm-cache-target.c
@@ -205,6 +205,8 @@  struct per_bio_data {
 	struct cache *cache;
 	dm_cblock_t cblock;
 	bio_end_io_t *saved_bi_end_io;
+	unsigned int bi_size;
+	sector_t bi_sector;
 };
 
 struct dm_cache_migration {
@@ -643,6 +645,8 @@  static void writethrough_endio(struct bio *bio, int err)
 		return;
 	}
 
+	bio->bi_sector = pb->bi_sector;
+	bio->bi_size = pb->bi_size;
 	remap_to_cache(pb->cache, bio, pb->cblock);
 
 	/*
@@ -667,6 +671,12 @@  static void remap_to_origin_then_cache(struct cache *cache, struct bio *bio,
 	pb->cache = cache;
 	pb->cblock = cblock;
 	pb->saved_bi_end_io = bio->bi_end_io;
+	/*
+	 * The block layer modifies bi_size and bi_sector, so we must save
+	 * them for when we re-issue the bio against the cache device.
+	 */
+	pb->bi_size = bio->bi_size;
+	pb->bi_sector = bio->bi_sector;
 	bio->bi_end_io = writethrough_endio;
 
 	remap_to_origin_clear_discard(pb->cache, bio, oblock);