[v2,00/14] Crash consistency xfstest using dm-log-writes

Message ID	20170830152326.vil3fhsrecp2ccql@destiny (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> Date: Wed, 30 Aug 2017 11:23:28 -0400 From: Josef Bacik <josef@toxicpanda.com> To: Amir Goldstein <amir73il@gmail.com> Cc: Eryu Guan <eguan@redhat.com>, Josef Bacik <jbacik@fb.com>, "Darrick J . Wong" <darrick.wong@oracle.com>, Christoph Hellwig <hch@lst.de>, fstests <fstests@vger.kernel.org>, linux-fsdevel <linux-fsdevel@vger.kernel.org>, linux-xfs <linux-xfs@vger.kernel.org> Subject: Re: [PATCH v2 00/14] Crash consistency xfstest using dm-log-writes Message-ID: <20170830152326.vil3fhsrecp2ccql@destiny> References: <1504104706-11965-1-git-send-email-amir73il@gmail.com> <CAOQ4uxh8av5H87SHdFw92BDTvCtYgd9chS2Nfv5fA4=upoqRfA@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <CAOQ4uxh8av5H87SHdFw92BDTvCtYgd9chS2Nfv5fA4=upoqRfA@mail.gmail.com> User-Agent: NeoMutt/20170714 (1.8.3) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk

Josef Bacik Aug. 30, 2017, 3:23 p.m. UTC

On Wed, Aug 30, 2017 at 06:04:26PM +0300, Amir Goldstein wrote:
> Sorry noise xfs list, I meant to CC fsdevel
> 
> On Wed, Aug 30, 2017 at 5:51 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> > Hi all,
> >
> > This is the 2nd revision of crash consistency patch set.
> > The main thing that changed since v1 is my confidence in the failures
> > reported by the test, along with some more debugging options for
> > running the test tools.
> >
> > I've collected these patches that have been sitting in Josef Bacik's
> > tree for a few years and kicked them a bit into shape.
> > The dm-log-writes target has been merged to kernel v4.1, see:
> > https://github.com/torvalds/linux/blob/master/Documentation/device-mapper/log-writes.txt
> >
> > For this posting, I kept the random seeds constant for the test.
> > I set these constant seeds after running with random seed for a little
> > while and getting failure reports. With the current values in the test
> > I was able to reproduce at high probablity failures with xfs, ext4 and btrfs.
> > The probablity of reproducing the failure is higher on a spinning disk.
> >

I'd rather we make it as evil as possible.  As long as we're printing out the
seed that was used in the output then we can go in and manually change the test
to use the same seed over and over again if we need to debug a problem.

> > For xfs, I posted a fix for potential data loss post fsync+crash.
> > For ext4, I posted a reliable reproducer using dm-flakey.
> > For btrfs, I shared the recorded log with Josef.
> >

I posted a patch to fix the problem you reported by the way, but my
git-send-email thing isn't set to cc people in the commit, sorry about that.

> > There is an outstanding problem with the test - when I run it with
> > kvm-xfstests, the test halts and I get soft lockup of log_writes_kthread.
> > I suppose its a bug in dm-log-writes with some kernel config or with virtio
> > I wasn't able to determine the reason and have little time to debug this.
> >
> > Since dm-log-writes is anyway in upstream kernel, I don't think a bug
> > in dm-log-writes for a certain config is a reason to block this xfstest
> > from being merged.
> > Anyway, I would be glad if someone could take a look at the soft lockup
> > issue. Josef?
> >

Yeah can you give this a try and see if the soft lockup goes away?

Amir Goldstein Aug. 30, 2017, 6:39 p.m. UTC | #1

On Wed, Aug 30, 2017 at 6:23 PM, Josef Bacik <josef@toxicpanda.com> wrote:
> On Wed, Aug 30, 2017 at 06:04:26PM +0300, Amir Goldstein wrote:
>> Sorry noise xfs list, I meant to CC fsdevel
>>
>> On Wed, Aug 30, 2017 at 5:51 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>> > Hi all,
>> >
>> > This is the 2nd revision of crash consistency patch set.
>> > The main thing that changed since v1 is my confidence in the failures
>> > reported by the test, along with some more debugging options for
>> > running the test tools.
>> >
>> > I've collected these patches that have been sitting in Josef Bacik's
>> > tree for a few years and kicked them a bit into shape.
>> > The dm-log-writes target has been merged to kernel v4.1, see:
>> > https://github.com/torvalds/linux/blob/master/Documentation/device-mapper/log-writes.txt
>> >
>> > For this posting, I kept the random seeds constant for the test.
>> > I set these constant seeds after running with random seed for a little
>> > while and getting failure reports. With the current values in the test
>> > I was able to reproduce at high probablity failures with xfs, ext4 and btrfs.
>> > The probablity of reproducing the failure is higher on a spinning disk.
>> >
>
> I'd rather we make it as evil as possible.  As long as we're printing out the
> seed that was used in the output then we can go in and manually change the test
> to use the same seed over and over again if we need to debug a problem.

Yeh that's what I did, but then I found values that reproduce a problem,
so maybe its worth clinging on to these values now until the bugs are fixed in
upstream and then as regression tests.

Anyway, I can keep these presets commented out, or run the test twice,
once with presets and once with random seed, whatever Eryu decides.


>
>> > There is an outstanding problem with the test - when I run it with
>> > kvm-xfstests, the test halts and I get soft lockup of log_writes_kthread.
>> > I suppose its a bug in dm-log-writes with some kernel config or with virtio
>> > I wasn't able to determine the reason and have little time to debug this.
>> >
>> > Since dm-log-writes is anyway in upstream kernel, I don't think a bug
>> > in dm-log-writes for a certain config is a reason to block this xfstest
>> > from being merged.
>> > Anyway, I would be glad if someone could take a look at the soft lockup
>> > issue. Josef?
>> >
>
> Yeah can you give this a try and see if the soft lockup goes away?
>

It does go away. Thanks!
Now something's wrong with the log.
it get corrupted in most of the test runs, something like this:

replaying 17624@158946: sector 8651296, size 4096, flags 0
replaying 17625@158955: sector 0, size 0, flags 0
replaying 17626@158956: sector 72057596591815616, size 103079215104, flags 0
Error allocating buffer 103079215104 entry 17626

I'll look into it

Amir.

Josef Bacik Aug. 30, 2017, 6:55 p.m. UTC | #2

On Wed, Aug 30, 2017 at 09:39:39PM +0300, Amir Goldstein wrote:
> On Wed, Aug 30, 2017 at 6:23 PM, Josef Bacik <josef@toxicpanda.com> wrote:
> > On Wed, Aug 30, 2017 at 06:04:26PM +0300, Amir Goldstein wrote:
> >> Sorry noise xfs list, I meant to CC fsdevel
> >>
> >> On Wed, Aug 30, 2017 at 5:51 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> >> > Hi all,
> >> >
> >> > This is the 2nd revision of crash consistency patch set.
> >> > The main thing that changed since v1 is my confidence in the failures
> >> > reported by the test, along with some more debugging options for
> >> > running the test tools.
> >> >
> >> > I've collected these patches that have been sitting in Josef Bacik's
> >> > tree for a few years and kicked them a bit into shape.
> >> > The dm-log-writes target has been merged to kernel v4.1, see:
> >> > https://github.com/torvalds/linux/blob/master/Documentation/device-mapper/log-writes.txt
> >> >
> >> > For this posting, I kept the random seeds constant for the test.
> >> > I set these constant seeds after running with random seed for a little
> >> > while and getting failure reports. With the current values in the test
> >> > I was able to reproduce at high probablity failures with xfs, ext4 and btrfs.
> >> > The probablity of reproducing the failure is higher on a spinning disk.
> >> >
> >
> > I'd rather we make it as evil as possible.  As long as we're printing out the
> > seed that was used in the output then we can go in and manually change the test
> > to use the same seed over and over again if we need to debug a problem.
> 
> Yeh that's what I did, but then I found values that reproduce a problem,
> so maybe its worth clinging on to these values now until the bugs are fixed in
> upstream and then as regression tests.
> 
> Anyway, I can keep these presets commented out, or run the test twice,
> once with presets and once with random seed, whatever Eryu decides.
> 
> 
> >
> >> > There is an outstanding problem with the test - when I run it with
> >> > kvm-xfstests, the test halts and I get soft lockup of log_writes_kthread.
> >> > I suppose its a bug in dm-log-writes with some kernel config or with virtio
> >> > I wasn't able to determine the reason and have little time to debug this.
> >> >
> >> > Since dm-log-writes is anyway in upstream kernel, I don't think a bug
> >> > in dm-log-writes for a certain config is a reason to block this xfstest
> >> > from being merged.
> >> > Anyway, I would be glad if someone could take a look at the soft lockup
> >> > issue. Josef?
> >> >
> >
> > Yeah can you give this a try and see if the soft lockup goes away?
> >
> 
> It does go away. Thanks!
> Now something's wrong with the log.
> it get corrupted in most of the test runs, something like this:
> 
> replaying 17624@158946: sector 8651296, size 4096, flags 0
> replaying 17625@158955: sector 0, size 0, flags 0
> replaying 17626@158956: sector 72057596591815616, size 103079215104, flags 0
> Error allocating buffer 103079215104 entry 17626
> 
> I'll look into it

Oh are the devices 4k sectorsize devices?  I fucked up 4k sectorsize support, I
sent some patches to fix it but they haven't been integrated yet, I'll poke
those again.  They are in my dm-log-writes-fixes branch in my btrfs-next tree on
kernel.org.  Thanks,

Josef

Amir Goldstein Aug. 30, 2017, 7:43 p.m. UTC | #3

On Wed, Aug 30, 2017 at 9:55 PM, Josef Bacik <josef@toxicpanda.com> wrote:
> On Wed, Aug 30, 2017 at 09:39:39PM +0300, Amir Goldstein wrote:
>> On Wed, Aug 30, 2017 at 6:23 PM, Josef Bacik <josef@toxicpanda.com> wrote:
>> > On Wed, Aug 30, 2017 at 06:04:26PM +0300, Amir Goldstein wrote:
>> >> Sorry noise xfs list, I meant to CC fsdevel
>> >>
>> >> On Wed, Aug 30, 2017 at 5:51 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>> >> > Hi all,
>> >> >
>> >> > This is the 2nd revision of crash consistency patch set.
>> >> > The main thing that changed since v1 is my confidence in the failures
>> >> > reported by the test, along with some more debugging options for
>> >> > running the test tools.
>> >> >
>> >> > I've collected these patches that have been sitting in Josef Bacik's
>> >> > tree for a few years and kicked them a bit into shape.
>> >> > The dm-log-writes target has been merged to kernel v4.1, see:
>> >> > https://github.com/torvalds/linux/blob/master/Documentation/device-mapper/log-writes.txt
>> >> >
>> >> > For this posting, I kept the random seeds constant for the test.
>> >> > I set these constant seeds after running with random seed for a little
>> >> > while and getting failure reports. With the current values in the test
>> >> > I was able to reproduce at high probablity failures with xfs, ext4 and btrfs.
>> >> > The probablity of reproducing the failure is higher on a spinning disk.
>> >> >
>> >
>> > I'd rather we make it as evil as possible.  As long as we're printing out the
>> > seed that was used in the output then we can go in and manually change the test
>> > to use the same seed over and over again if we need to debug a problem.
>>
>> Yeh that's what I did, but then I found values that reproduce a problem,
>> so maybe its worth clinging on to these values now until the bugs are fixed in
>> upstream and then as regression tests.
>>
>> Anyway, I can keep these presets commented out, or run the test twice,
>> once with presets and once with random seed, whatever Eryu decides.
>>
>>
>> >
>> >> > There is an outstanding problem with the test - when I run it with
>> >> > kvm-xfstests, the test halts and I get soft lockup of log_writes_kthread.
>> >> > I suppose its a bug in dm-log-writes with some kernel config or with virtio
>> >> > I wasn't able to determine the reason and have little time to debug this.
>> >> >
>> >> > Since dm-log-writes is anyway in upstream kernel, I don't think a bug
>> >> > in dm-log-writes for a certain config is a reason to block this xfstest
>> >> > from being merged.
>> >> > Anyway, I would be glad if someone could take a look at the soft lockup
>> >> > issue. Josef?
>> >> >
>> >
>> > Yeah can you give this a try and see if the soft lockup goes away?
>> >
>>
>> It does go away. Thanks!
>> Now something's wrong with the log.
>> it get corrupted in most of the test runs, something like this:
>>
>> replaying 17624@158946: sector 8651296, size 4096, flags 0
>> replaying 17625@158955: sector 0, size 0, flags 0
>> replaying 17626@158956: sector 72057596591815616, size 103079215104, flags 0
>> Error allocating buffer 103079215104 entry 17626
>>
>> I'll look into it
>
> Oh are the devices 4k sectorsize devices?  I fucked up 4k sectorsize support, I
> sent some patches to fix it but they haven't been integrated yet, I'll poke
> those again.  They are in my dm-log-writes-fixes branch in my btrfs-next tree on
> kernel.org.  Thanks,
>

No they are just virtio devices in kvm reflecting my ssd LV, on whom
the same test works just fine not inside kvm.

Eryu Guan Aug. 31, 2017, 3:38 a.m. UTC | #4

On Wed, Aug 30, 2017 at 09:39:39PM +0300, Amir Goldstein wrote:
> On Wed, Aug 30, 2017 at 6:23 PM, Josef Bacik <josef@toxicpanda.com> wrote:
> > On Wed, Aug 30, 2017 at 06:04:26PM +0300, Amir Goldstein wrote:
> >> Sorry noise xfs list, I meant to CC fsdevel
> >>
> >> On Wed, Aug 30, 2017 at 5:51 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> >> > Hi all,
> >> >
> >> > This is the 2nd revision of crash consistency patch set.
> >> > The main thing that changed since v1 is my confidence in the failures
> >> > reported by the test, along with some more debugging options for
> >> > running the test tools.
> >> >
> >> > I've collected these patches that have been sitting in Josef Bacik's
> >> > tree for a few years and kicked them a bit into shape.
> >> > The dm-log-writes target has been merged to kernel v4.1, see:
> >> > https://github.com/torvalds/linux/blob/master/Documentation/device-mapper/log-writes.txt
> >> >
> >> > For this posting, I kept the random seeds constant for the test.
> >> > I set these constant seeds after running with random seed for a little
> >> > while and getting failure reports. With the current values in the test
> >> > I was able to reproduce at high probablity failures with xfs, ext4 and btrfs.
> >> > The probablity of reproducing the failure is higher on a spinning disk.
> >> >
> >
> > I'd rather we make it as evil as possible.  As long as we're printing out the
> > seed that was used in the output then we can go in and manually change the test
> > to use the same seed over and over again if we need to debug a problem.
> 
> Yeh that's what I did, but then I found values that reproduce a problem,
> so maybe its worth clinging on to these values now until the bugs are fixed in
> upstream and then as regression tests.
> 
> Anyway, I can keep these presets commented out, or run the test twice,
> once with presets and once with random seed, whatever Eryu decides.

My thought on this with first glance is using random seed, if a specific
seed reproduce something, maybe another targeted regression test can be
added, as what you did for that ext4 corruption?

> 
> 
> >
> >> > There is an outstanding problem with the test - when I run it with
> >> > kvm-xfstests, the test halts and I get soft lockup of log_writes_kthread.
> >> > I suppose its a bug in dm-log-writes with some kernel config or with virtio
> >> > I wasn't able to determine the reason and have little time to debug this.
> >> >
> >> > Since dm-log-writes is anyway in upstream kernel, I don't think a bug
> >> > in dm-log-writes for a certain config is a reason to block this xfstest
> >> > from being merged.
> >> > Anyway, I would be glad if someone could take a look at the soft lockup
> >> > issue. Josef?
> >> >
> >
> > Yeah can you give this a try and see if the soft lockup goes away?
> >
> 
> It does go away. Thanks!
> Now something's wrong with the log.
> it get corrupted in most of the test runs, something like this:
> 
> replaying 17624@158946: sector 8651296, size 4096, flags 0
> replaying 17625@158955: sector 0, size 0, flags 0
> replaying 17626@158956: sector 72057596591815616, size 103079215104, flags 0
> Error allocating buffer 103079215104 entry 17626
> 
> I'll look into it
> 
> Amir.

The first 6 patches are all prepare work and seem fine, so I probably
will push them out this week. But I may need more time to look into all
these log-writes dm target and fsx changes.

But seems that there're still problems not sorted out (e.g. this
log-write bug), I'd prefer, when they get merged, removing the auto
group for now until things settle down a bit.

Thanks,
Eryu

Amir Goldstein Aug. 31, 2017, 4:29 a.m. UTC | #5

On Thu, Aug 31, 2017 at 6:38 AM, Eryu Guan <eguan@redhat.com> wrote:
> On Wed, Aug 30, 2017 at 09:39:39PM +0300, Amir Goldstein wrote:
...
>> >> > For this posting, I kept the random seeds constant for the test.
>> >> > I set these constant seeds after running with random seed for a little
>> >> > while and getting failure reports. With the current values in the test
>> >> > I was able to reproduce at high probablity failures with xfs, ext4 and btrfs.
>> >> > The probablity of reproducing the failure is higher on a spinning disk.
>> >> >
>> >
>> > I'd rather we make it as evil as possible.  As long as we're printing out the
>> > seed that was used in the output then we can go in and manually change the test
>> > to use the same seed over and over again if we need to debug a problem.
>>
>> Yeh that's what I did, but then I found values that reproduce a problem,
>> so maybe its worth clinging on to these values now until the bugs are fixed in
>> upstream and then as regression tests.
>>
>> Anyway, I can keep these presets commented out, or run the test twice,
>> once with presets and once with random seed, whatever Eryu decides.
>
> My thought on this with first glance is using random seed, if a specific
> seed reproduce something, maybe another targeted regression test can be
> added, as what you did for that ext4 corruption?
>

Sure. Speaking of ext4 corruption, I did not re-post this test with this
series because its quite an ugly black box test. I figured if ext4 guys
would take a look and understand the problem they could write a more
intelligent test. OTOH maybe its better than nothing?

BTW, Josef, did/could you write a more intelligent test to catch the
extent crc bug that you fixed? if not, was it easy to reproduce with the
provided seed presets? and without them?
I am asking to understand if a regression test to that bug is in order
beyond random seed fsx.

BTW2, the xfs bug I found is reproduced with reasonable likelihood
with any random seed. By using the provided presets, I was able to
reduce the test run time and debug cycle considerably. I used
NUM_FILES=2; NUM_OPS=31 to reproduce at > 50% probability
within seconds. So this bug doesn't require a specialized regression test.

...
>
> The first 6 patches are all prepare work and seem fine, so I probably
> will push them out this week. But I may need more time to look into all
> these log-writes dm target and fsx changes.
>
> But seems that there're still problems not sorted out (e.g. this
> log-write bug), I'd prefer, when they get merged, removing the auto
> group for now until things settle down a bit.
>

Good idea. Anyway, I would be happy to see these tests used by N > 1
testers for start.
If some version is merged so people can start pointing this big gun to
their file systems, I imagine more interesting bug will come surface.

Amir.

Amir Goldstein Sept. 1, 2017, 7:29 a.m. UTC | #6

On Thu, Aug 31, 2017 at 6:38 AM, Eryu Guan <eguan@redhat.com> wrote:
...
> The first 6 patches are all prepare work and seem fine, so I probably
> will push them out this week. But I may need more time to look into all
> these log-writes dm target and fsx changes.
>
> But seems that there're still problems not sorted out (e.g. this
> log-write bug), I'd prefer, when they get merged, removing the auto
> group for now until things settle down a bit.
>

I don't object to removing the auto group, but keep in mind that this test
is opt-in anyway, because it requires to define LOGWRITES_DEV
(well it SHOULD require, I actually forgot to check it...)

For now, it seems that the problem observed with kvm-xfstests
is specific to kvm-qemu aio=threads config, so you shouldn't have any
problems trying out the test on non kvm setup.

Amir.

Eryu Guan Sept. 1, 2017, 7:45 a.m. UTC | #7

On Fri, Sep 01, 2017 at 10:29:38AM +0300, Amir Goldstein wrote:
> On Thu, Aug 31, 2017 at 6:38 AM, Eryu Guan <eguan@redhat.com> wrote:
> ...
> > The first 6 patches are all prepare work and seem fine, so I probably
> > will push them out this week. But I may need more time to look into all
> > these log-writes dm target and fsx changes.
> >
> > But seems that there're still problems not sorted out (e.g. this
> > log-write bug), I'd prefer, when they get merged, removing the auto
> > group for now until things settle down a bit.
> >
> 
> I don't object to removing the auto group, but keep in mind that this test
> is opt-in anyway, because it requires to define LOGWRITES_DEV

That's a good point.

> (well it SHOULD require, I actually forgot to check it...)
> 
> For now, it seems that the problem observed with kvm-xfstests
> is specific to kvm-qemu aio=threads config, so you shouldn't have any
> problems trying out the test on non kvm setup.

Thanks for the heads-up! I'll run this test and look into the code
closely and see what's the best option.

Thanks,
Eryu

[v2,00/14] Crash consistency xfstest using dm-log-writes

Commit Message

Comments

Patch