Message ID | 20170830152326.vil3fhsrecp2ccql@destiny (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Aug 30, 2017 at 6:23 PM, Josef Bacik <josef@toxicpanda.com> wrote: > On Wed, Aug 30, 2017 at 06:04:26PM +0300, Amir Goldstein wrote: >> Sorry noise xfs list, I meant to CC fsdevel >> >> On Wed, Aug 30, 2017 at 5:51 PM, Amir Goldstein <amir73il@gmail.com> wrote: >> > Hi all, >> > >> > This is the 2nd revision of crash consistency patch set. >> > The main thing that changed since v1 is my confidence in the failures >> > reported by the test, along with some more debugging options for >> > running the test tools. >> > >> > I've collected these patches that have been sitting in Josef Bacik's >> > tree for a few years and kicked them a bit into shape. >> > The dm-log-writes target has been merged to kernel v4.1, see: >> > https://github.com/torvalds/linux/blob/master/Documentation/device-mapper/log-writes.txt >> > >> > For this posting, I kept the random seeds constant for the test. >> > I set these constant seeds after running with random seed for a little >> > while and getting failure reports. With the current values in the test >> > I was able to reproduce at high probablity failures with xfs, ext4 and btrfs. >> > The probablity of reproducing the failure is higher on a spinning disk. >> > > > I'd rather we make it as evil as possible. As long as we're printing out the > seed that was used in the output then we can go in and manually change the test > to use the same seed over and over again if we need to debug a problem. Yeh that's what I did, but then I found values that reproduce a problem, so maybe its worth clinging on to these values now until the bugs are fixed in upstream and then as regression tests. Anyway, I can keep these presets commented out, or run the test twice, once with presets and once with random seed, whatever Eryu decides. > >> > There is an outstanding problem with the test - when I run it with >> > kvm-xfstests, the test halts and I get soft lockup of log_writes_kthread. >> > I suppose its a bug in dm-log-writes with some kernel config or with virtio >> > I wasn't able to determine the reason and have little time to debug this. >> > >> > Since dm-log-writes is anyway in upstream kernel, I don't think a bug >> > in dm-log-writes for a certain config is a reason to block this xfstest >> > from being merged. >> > Anyway, I would be glad if someone could take a look at the soft lockup >> > issue. Josef? >> > > > Yeah can you give this a try and see if the soft lockup goes away? > It does go away. Thanks! Now something's wrong with the log. it get corrupted in most of the test runs, something like this: replaying 17624@158946: sector 8651296, size 4096, flags 0 replaying 17625@158955: sector 0, size 0, flags 0 replaying 17626@158956: sector 72057596591815616, size 103079215104, flags 0 Error allocating buffer 103079215104 entry 17626 I'll look into it Amir.
On Wed, Aug 30, 2017 at 09:39:39PM +0300, Amir Goldstein wrote: > On Wed, Aug 30, 2017 at 6:23 PM, Josef Bacik <josef@toxicpanda.com> wrote: > > On Wed, Aug 30, 2017 at 06:04:26PM +0300, Amir Goldstein wrote: > >> Sorry noise xfs list, I meant to CC fsdevel > >> > >> On Wed, Aug 30, 2017 at 5:51 PM, Amir Goldstein <amir73il@gmail.com> wrote: > >> > Hi all, > >> > > >> > This is the 2nd revision of crash consistency patch set. > >> > The main thing that changed since v1 is my confidence in the failures > >> > reported by the test, along with some more debugging options for > >> > running the test tools. > >> > > >> > I've collected these patches that have been sitting in Josef Bacik's > >> > tree for a few years and kicked them a bit into shape. > >> > The dm-log-writes target has been merged to kernel v4.1, see: > >> > https://github.com/torvalds/linux/blob/master/Documentation/device-mapper/log-writes.txt > >> > > >> > For this posting, I kept the random seeds constant for the test. > >> > I set these constant seeds after running with random seed for a little > >> > while and getting failure reports. With the current values in the test > >> > I was able to reproduce at high probablity failures with xfs, ext4 and btrfs. > >> > The probablity of reproducing the failure is higher on a spinning disk. > >> > > > > > I'd rather we make it as evil as possible. As long as we're printing out the > > seed that was used in the output then we can go in and manually change the test > > to use the same seed over and over again if we need to debug a problem. > > Yeh that's what I did, but then I found values that reproduce a problem, > so maybe its worth clinging on to these values now until the bugs are fixed in > upstream and then as regression tests. > > Anyway, I can keep these presets commented out, or run the test twice, > once with presets and once with random seed, whatever Eryu decides. > > > > > >> > There is an outstanding problem with the test - when I run it with > >> > kvm-xfstests, the test halts and I get soft lockup of log_writes_kthread. > >> > I suppose its a bug in dm-log-writes with some kernel config or with virtio > >> > I wasn't able to determine the reason and have little time to debug this. > >> > > >> > Since dm-log-writes is anyway in upstream kernel, I don't think a bug > >> > in dm-log-writes for a certain config is a reason to block this xfstest > >> > from being merged. > >> > Anyway, I would be glad if someone could take a look at the soft lockup > >> > issue. Josef? > >> > > > > > Yeah can you give this a try and see if the soft lockup goes away? > > > > It does go away. Thanks! > Now something's wrong with the log. > it get corrupted in most of the test runs, something like this: > > replaying 17624@158946: sector 8651296, size 4096, flags 0 > replaying 17625@158955: sector 0, size 0, flags 0 > replaying 17626@158956: sector 72057596591815616, size 103079215104, flags 0 > Error allocating buffer 103079215104 entry 17626 > > I'll look into it Oh are the devices 4k sectorsize devices? I fucked up 4k sectorsize support, I sent some patches to fix it but they haven't been integrated yet, I'll poke those again. They are in my dm-log-writes-fixes branch in my btrfs-next tree on kernel.org. Thanks, Josef
On Wed, Aug 30, 2017 at 9:55 PM, Josef Bacik <josef@toxicpanda.com> wrote: > On Wed, Aug 30, 2017 at 09:39:39PM +0300, Amir Goldstein wrote: >> On Wed, Aug 30, 2017 at 6:23 PM, Josef Bacik <josef@toxicpanda.com> wrote: >> > On Wed, Aug 30, 2017 at 06:04:26PM +0300, Amir Goldstein wrote: >> >> Sorry noise xfs list, I meant to CC fsdevel >> >> >> >> On Wed, Aug 30, 2017 at 5:51 PM, Amir Goldstein <amir73il@gmail.com> wrote: >> >> > Hi all, >> >> > >> >> > This is the 2nd revision of crash consistency patch set. >> >> > The main thing that changed since v1 is my confidence in the failures >> >> > reported by the test, along with some more debugging options for >> >> > running the test tools. >> >> > >> >> > I've collected these patches that have been sitting in Josef Bacik's >> >> > tree for a few years and kicked them a bit into shape. >> >> > The dm-log-writes target has been merged to kernel v4.1, see: >> >> > https://github.com/torvalds/linux/blob/master/Documentation/device-mapper/log-writes.txt >> >> > >> >> > For this posting, I kept the random seeds constant for the test. >> >> > I set these constant seeds after running with random seed for a little >> >> > while and getting failure reports. With the current values in the test >> >> > I was able to reproduce at high probablity failures with xfs, ext4 and btrfs. >> >> > The probablity of reproducing the failure is higher on a spinning disk. >> >> > >> > >> > I'd rather we make it as evil as possible. As long as we're printing out the >> > seed that was used in the output then we can go in and manually change the test >> > to use the same seed over and over again if we need to debug a problem. >> >> Yeh that's what I did, but then I found values that reproduce a problem, >> so maybe its worth clinging on to these values now until the bugs are fixed in >> upstream and then as regression tests. >> >> Anyway, I can keep these presets commented out, or run the test twice, >> once with presets and once with random seed, whatever Eryu decides. >> >> >> > >> >> > There is an outstanding problem with the test - when I run it with >> >> > kvm-xfstests, the test halts and I get soft lockup of log_writes_kthread. >> >> > I suppose its a bug in dm-log-writes with some kernel config or with virtio >> >> > I wasn't able to determine the reason and have little time to debug this. >> >> > >> >> > Since dm-log-writes is anyway in upstream kernel, I don't think a bug >> >> > in dm-log-writes for a certain config is a reason to block this xfstest >> >> > from being merged. >> >> > Anyway, I would be glad if someone could take a look at the soft lockup >> >> > issue. Josef? >> >> > >> > >> > Yeah can you give this a try and see if the soft lockup goes away? >> > >> >> It does go away. Thanks! >> Now something's wrong with the log. >> it get corrupted in most of the test runs, something like this: >> >> replaying 17624@158946: sector 8651296, size 4096, flags 0 >> replaying 17625@158955: sector 0, size 0, flags 0 >> replaying 17626@158956: sector 72057596591815616, size 103079215104, flags 0 >> Error allocating buffer 103079215104 entry 17626 >> >> I'll look into it > > Oh are the devices 4k sectorsize devices? I fucked up 4k sectorsize support, I > sent some patches to fix it but they haven't been integrated yet, I'll poke > those again. They are in my dm-log-writes-fixes branch in my btrfs-next tree on > kernel.org. Thanks, > No they are just virtio devices in kvm reflecting my ssd LV, on whom the same test works just fine not inside kvm.
On Wed, Aug 30, 2017 at 09:39:39PM +0300, Amir Goldstein wrote: > On Wed, Aug 30, 2017 at 6:23 PM, Josef Bacik <josef@toxicpanda.com> wrote: > > On Wed, Aug 30, 2017 at 06:04:26PM +0300, Amir Goldstein wrote: > >> Sorry noise xfs list, I meant to CC fsdevel > >> > >> On Wed, Aug 30, 2017 at 5:51 PM, Amir Goldstein <amir73il@gmail.com> wrote: > >> > Hi all, > >> > > >> > This is the 2nd revision of crash consistency patch set. > >> > The main thing that changed since v1 is my confidence in the failures > >> > reported by the test, along with some more debugging options for > >> > running the test tools. > >> > > >> > I've collected these patches that have been sitting in Josef Bacik's > >> > tree for a few years and kicked them a bit into shape. > >> > The dm-log-writes target has been merged to kernel v4.1, see: > >> > https://github.com/torvalds/linux/blob/master/Documentation/device-mapper/log-writes.txt > >> > > >> > For this posting, I kept the random seeds constant for the test. > >> > I set these constant seeds after running with random seed for a little > >> > while and getting failure reports. With the current values in the test > >> > I was able to reproduce at high probablity failures with xfs, ext4 and btrfs. > >> > The probablity of reproducing the failure is higher on a spinning disk. > >> > > > > > I'd rather we make it as evil as possible. As long as we're printing out the > > seed that was used in the output then we can go in and manually change the test > > to use the same seed over and over again if we need to debug a problem. > > Yeh that's what I did, but then I found values that reproduce a problem, > so maybe its worth clinging on to these values now until the bugs are fixed in > upstream and then as regression tests. > > Anyway, I can keep these presets commented out, or run the test twice, > once with presets and once with random seed, whatever Eryu decides. My thought on this with first glance is using random seed, if a specific seed reproduce something, maybe another targeted regression test can be added, as what you did for that ext4 corruption? > > > > > >> > There is an outstanding problem with the test - when I run it with > >> > kvm-xfstests, the test halts and I get soft lockup of log_writes_kthread. > >> > I suppose its a bug in dm-log-writes with some kernel config or with virtio > >> > I wasn't able to determine the reason and have little time to debug this. > >> > > >> > Since dm-log-writes is anyway in upstream kernel, I don't think a bug > >> > in dm-log-writes for a certain config is a reason to block this xfstest > >> > from being merged. > >> > Anyway, I would be glad if someone could take a look at the soft lockup > >> > issue. Josef? > >> > > > > > Yeah can you give this a try and see if the soft lockup goes away? > > > > It does go away. Thanks! > Now something's wrong with the log. > it get corrupted in most of the test runs, something like this: > > replaying 17624@158946: sector 8651296, size 4096, flags 0 > replaying 17625@158955: sector 0, size 0, flags 0 > replaying 17626@158956: sector 72057596591815616, size 103079215104, flags 0 > Error allocating buffer 103079215104 entry 17626 > > I'll look into it > > Amir. The first 6 patches are all prepare work and seem fine, so I probably will push them out this week. But I may need more time to look into all these log-writes dm target and fsx changes. But seems that there're still problems not sorted out (e.g. this log-write bug), I'd prefer, when they get merged, removing the auto group for now until things settle down a bit. Thanks, Eryu
On Thu, Aug 31, 2017 at 6:38 AM, Eryu Guan <eguan@redhat.com> wrote: > On Wed, Aug 30, 2017 at 09:39:39PM +0300, Amir Goldstein wrote: ... >> >> > For this posting, I kept the random seeds constant for the test. >> >> > I set these constant seeds after running with random seed for a little >> >> > while and getting failure reports. With the current values in the test >> >> > I was able to reproduce at high probablity failures with xfs, ext4 and btrfs. >> >> > The probablity of reproducing the failure is higher on a spinning disk. >> >> > >> > >> > I'd rather we make it as evil as possible. As long as we're printing out the >> > seed that was used in the output then we can go in and manually change the test >> > to use the same seed over and over again if we need to debug a problem. >> >> Yeh that's what I did, but then I found values that reproduce a problem, >> so maybe its worth clinging on to these values now until the bugs are fixed in >> upstream and then as regression tests. >> >> Anyway, I can keep these presets commented out, or run the test twice, >> once with presets and once with random seed, whatever Eryu decides. > > My thought on this with first glance is using random seed, if a specific > seed reproduce something, maybe another targeted regression test can be > added, as what you did for that ext4 corruption? > Sure. Speaking of ext4 corruption, I did not re-post this test with this series because its quite an ugly black box test. I figured if ext4 guys would take a look and understand the problem they could write a more intelligent test. OTOH maybe its better than nothing? BTW, Josef, did/could you write a more intelligent test to catch the extent crc bug that you fixed? if not, was it easy to reproduce with the provided seed presets? and without them? I am asking to understand if a regression test to that bug is in order beyond random seed fsx. BTW2, the xfs bug I found is reproduced with reasonable likelihood with any random seed. By using the provided presets, I was able to reduce the test run time and debug cycle considerably. I used NUM_FILES=2; NUM_OPS=31 to reproduce at > 50% probability within seconds. So this bug doesn't require a specialized regression test. ... > > The first 6 patches are all prepare work and seem fine, so I probably > will push them out this week. But I may need more time to look into all > these log-writes dm target and fsx changes. > > But seems that there're still problems not sorted out (e.g. this > log-write bug), I'd prefer, when they get merged, removing the auto > group for now until things settle down a bit. > Good idea. Anyway, I would be happy to see these tests used by N > 1 testers for start. If some version is merged so people can start pointing this big gun to their file systems, I imagine more interesting bug will come surface. Amir.
On Thu, Aug 31, 2017 at 6:38 AM, Eryu Guan <eguan@redhat.com> wrote: ... > The first 6 patches are all prepare work and seem fine, so I probably > will push them out this week. But I may need more time to look into all > these log-writes dm target and fsx changes. > > But seems that there're still problems not sorted out (e.g. this > log-write bug), I'd prefer, when they get merged, removing the auto > group for now until things settle down a bit. > I don't object to removing the auto group, but keep in mind that this test is opt-in anyway, because it requires to define LOGWRITES_DEV (well it SHOULD require, I actually forgot to check it...) For now, it seems that the problem observed with kvm-xfstests is specific to kvm-qemu aio=threads config, so you shouldn't have any problems trying out the test on non kvm setup. Amir.
On Fri, Sep 01, 2017 at 10:29:38AM +0300, Amir Goldstein wrote: > On Thu, Aug 31, 2017 at 6:38 AM, Eryu Guan <eguan@redhat.com> wrote: > ... > > The first 6 patches are all prepare work and seem fine, so I probably > > will push them out this week. But I may need more time to look into all > > these log-writes dm target and fsx changes. > > > > But seems that there're still problems not sorted out (e.g. this > > log-write bug), I'd prefer, when they get merged, removing the auto > > group for now until things settle down a bit. > > > > I don't object to removing the auto group, but keep in mind that this test > is opt-in anyway, because it requires to define LOGWRITES_DEV That's a good point. > (well it SHOULD require, I actually forgot to check it...) > > For now, it seems that the problem observed with kvm-xfstests > is specific to kvm-qemu aio=threads config, so you shouldn't have any > problems trying out the test on non kvm setup. Thanks for the heads-up! I'll run this test and look into the code closely and see what's the best option. Thanks, Eryu
diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c index a1da0eb..b900758 100644 --- a/drivers/md/dm-log-writes.c +++ b/drivers/md/dm-log-writes.c @@ -345,6 +345,7 @@ static int log_writes_kthread(void *arg) struct pending_block *block = NULL; int ret; + cond_resched(); spin_lock_irq(&lc->blocks_lock); if (!list_empty(&lc->logging_blocks)) { block = list_first_entry(&lc->logging_blocks,