btrfs: raid56: Use correct stolen pages to calculate P/Q

In the following situation, scrub will calculate wrong parity to
overwrite correct one:

RAID5 full stripe:

Before
|     Dev 1      |     Dev  2     |     Dev 3     |
| Data stripe 1  | Data stripe 2  | Parity Stripe |
--------------------------------------------------- 0
| 0x0000 (Bad)   |     0xcdcd     |     0x0000    |
--------------------------------------------------- 4K
|     0xcdcd     |     0xcdcd     |     0x0000    |
...
|     0xcdcd     |     0xcdcd     |     0x0000    |
--------------------------------------------------- 64K

After scrubbing dev3 only:

|     Dev 1      |     Dev  2     |     Dev 3     |
| Data stripe 1  | Data stripe 2  | Parity Stripe |
--------------------------------------------------- 0
| 0xcdcd (Good)  |     0xcdcd     | 0xcdcd (Bad)  |
--------------------------------------------------- 4K
|     0xcdcd     |     0xcdcd     |     0x0000    |
...
|     0xcdcd     |     0xcdcd     |     0x0000    |
--------------------------------------------------- 64K

The calltrace of such corruption is as following:

scrub_bio_end_io_worker() get called for each extent read out
|- scriub_block_complete()
   |- Data extent csum mismatch
   |- scrub_handle_errored_block
      |- scrub_recheck_block()
         |- scrub_submit_raid56_bio_wait()
            |- raid56_parity_recover()

Now we have a rbio with correct data stripe 1 recovered.
Let's call it "good_rbio".

scrub_parity_check_and_repair()
|- raid56_parity_submit_scrub_rbio()
   |- lock_stripe_add()
   |  |- steal_rbio()
   |     |- Recovered data are steal from "good_rbio", stored into
   |        rbio->stripe_pages[]
   |        Now rbio->bio_pages[] are bad data read from disk.
   |- async_scrub_parity()
      |- scrub_parity_work() (delayed_call to scrub_parity_work)

scrub_parity_work()
|- raid56_parity_scrub_stripe()
   |- validate_rbio_for_parity_scrub()
      |- finish_parity_scrub()
         |- Recalculate parity using *BAD* pages in rbio->bio_pages[]
            So good parity is overwritten with *BAD* one

The fix is to introduce 2 new members, bad_ondisk_a/b, to struct
btrfs_raid_bio, to info scrub code to use correct data pages to
re-calculate parity.

Reported-by: Goffredo Baroncelli <kreijack@inwind.it>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
Thanks to the above hell of delayed function all and damn stupid code
logical, such bug is quite hard to trace.

The damn kernel scrub is already multi-thread, why do such meaningless
delayed function call again and again?

What's wrong with single thread scrub?
We can do thing like in each stripe for raid56 which is easy and
straightforward, only delayed thing is to wake up waiter:

	lock_full_stripe()
	if (!is_parity_stripe()) {
		prepare_data_stripe_bios()
		submit_and_wait_bios()
		if (check_csum() == 0)
			goto out;
	}
	prepare_full_stripe_bios()
	submit_and_wait_bios()

	recover_raid56_stipres();
	prepare_full_stripe_write_bios()
	submit_and_wait_bios()

out:
  	unlock_full_stripe()

We really need to re-work the whole damn scrub code.

Also, we need to enhance btrfs-progs to detect scrub problem(my
submitted offline scrub is good enough for such usage), and tools to
corrupt extents reliably to put it into xfstests test cases.

RAID56 scrub code is neither tested nor well-designed.
---
 fs/btrfs/raid56.c | 29 ++++++++++++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

btrfs: raid56: Use correct stolen pages to calculate P/Q

Commit Message

Comments

Patch