diff mbox

A review of dm-writeboost

Message ID 525BAB32.5050901@gmail.com (mailing list archive)
State RFC, archived
Delegated to: Mike Snitzer
Headers show

Commit Message

Akira Hayakawa Oct. 14, 2013, 8:28 a.m. UTC
Hi, DM Guys

I suppose I have finished the tasks to
answer Mikulas's pointing outs.
So, let me update the progress report.

The code is updated now on my Github repo.
Checkout the develop branch to avail
the latest source code.

Compilation Status
------------------
First, compilation status.
Mikulas's advised me to compile the module in
32 bit environment. and Yes, I did.
With all these kernels listed below
writeboost can compile without any error nor warning.

For 64 bit
3.2.36, 3.4.25, 3.5.7, 3.6.9, 3.7.5, 3.8.5, 3.9.8, 3.10.5, 3.11.4 and 3.12-rc1

For 32 bit
3.2.0-4-686-pae (Debian 7.1.0-i386)


Block up on error
-----------------
The most annoying thing in this update
is how to handle the I/O error.
For memory allocation error,
writeboost now makes use of mempool to avoid
the problem Mikulas's said in his last comments
but handling I/O error gracefully
when the system is running is very difficult.

My answer is 
all the daemon stop when I/O error (-EIO returned) happens
in any part of this module.
They waits on wait_queue (blockup_wait_queue)
and reactivates when sysadmin turns `blockup` variable to 0
through message interface.
When `blockup` is 1, all the incoming I/O
are returned as -EIO to the upper layer.
RETRY macro is introduced
which wraps doing I/O and
retries I/O submission if the error is -ENOMEN but
turns blockup to 1 and sleeps if the error is -EIO.
-EIO is more serious than -ENOMEM because
it may destroy the storage for some accidental problem
that we have no control in device-mapper layer
(e.g. the storage controller went crazy).
Blocking up the whole I/O is to minimize the
probable damage.

But, XFS stalls ...
-------------------
For testing,
I manually turns `blockup` to 1
when compiling Ruby is in progress
on XFS on a writeboost device.
As soon as I do it,
XFS starts to dump error message 
like "metadata I/O error: ... ("xlog_iodone") error ..."
and after few seconds it then starts to dump
like "BUG: soft lockup -CPU#3 stuck for 22s!".
The system stalls and doesn't accept the keyboard.

I think this behavior is caused by
the device always returning -EIO after turning 
the variable to 1.
But why XFS goes stalling on I/O error?
It should just suspend and starts returning
error to the upper layer as writeboost now does.
As Mikulas said the I/O error is often
due to connection failure that is usually recoverable.
Stalling the kernel will need reboot 
after recovering nevertheless writeboost
can recover just by again turning `blockup` to 0.
Any reason for this design or
existing of a option to not stall XFS on I/O error?

Thanks,
Akira

----------------------------------------------

Followed by
changes to Driver/* and Documentation


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Comments

Mikulas Patocka Oct. 16, 2013, 12:01 a.m. UTC | #1
On Mon, 14 Oct 2013, Akira Hayakawa wrote:

> Hi, DM Guys
> 
> I suppose I have finished the tasks to
> answer Mikulas's pointing outs.
> So, let me update the progress report.
> 
> The code is updated now on my Github repo.
> Checkout the develop branch to avail
> the latest source code.
> 
> Compilation Status
> ------------------
> First, compilation status.
> Mikulas's advised me to compile the module in
> 32 bit environment. and Yes, I did.
> With all these kernels listed below
> writeboost can compile without any error nor warning.
> 
> For 64 bit
> 3.2.36, 3.4.25, 3.5.7, 3.6.9, 3.7.5, 3.8.5, 3.9.8, 3.10.5, 3.11.4 and 3.12-rc1
> 
> For 32 bit
> 3.2.0-4-686-pae (Debian 7.1.0-i386)
> 
> 
> Block up on error
> -----------------
> The most annoying thing in this update
> is how to handle the I/O error.
> For memory allocation error,
> writeboost now makes use of mempool to avoid
> the problem Mikulas's said in his last comments
> but handling I/O error gracefully
> when the system is running is very difficult.
> 
> My answer is 
> all the daemon stop when I/O error (-EIO returned) happens
> in any part of this module.
> They waits on wait_queue (blockup_wait_queue)
> and reactivates when sysadmin turns `blockup` variable to 0
> through message interface.
> When `blockup` is 1, all the incoming I/O
> are returned as -EIO to the upper layer.
> RETRY macro is introduced
> which wraps doing I/O and
> retries I/O submission if the error is -ENOMEN but

I/Os shouldn't be returned with -ENOMEM. If they are, you can treat it as 
a hard error.

> turns blockup to 1 and sleeps if the error is -EIO.
> -EIO is more serious than -ENOMEM because
> it may destroy the storage for some accidental problem
> that we have no control in device-mapper layer
> (e.g. the storage controller went crazy).
> Blocking up the whole I/O is to minimize the
> probable damage.
> 
> But, XFS stalls ...
> -------------------
> For testing,
> I manually turns `blockup` to 1
> when compiling Ruby is in progress
> on XFS on a writeboost device.
> As soon as I do it,
> XFS starts to dump error message 
> like "metadata I/O error: ... ("xlog_iodone") error ..."
> and after few seconds it then starts to dump
> like "BUG: soft lockup -CPU#3 stuck for 22s!".
> The system stalls and doesn't accept the keyboard.
> 
> I think this behavior is caused by
> the device always returning -EIO after turning 
> the variable to 1.
> But why XFS goes stalling on I/O error?

Because it is bloated and buggy. We have bug 924301 for XFS crash on I/O 
error...

> It should just suspend and starts returning
> error to the upper layer as writeboost now does.
> As Mikulas said the I/O error is often
> due to connection failure that is usually recoverable.
> Stalling the kernel will need reboot 
> after recovering nevertheless writeboost
> can recover just by again turning `blockup` to 0.
> Any reason for this design or
> existing of a option to not stall XFS on I/O error?
> 
> Thanks,
> Akira

Blocking I/O until the admin turns a specific variable isn't too 
reliable.

Think of this case - your driver detects I/O error and blocks all I/Os. 
The admin tries to log in. The login process needs memory. To fulfill this 
memory need, the login process writes out some dirty pages. Those writes 
are blocked by your driver - in the result, the admin is not able to log 
in and flip the switch to unblock I/Os.

Blocking I/O indefinitely isn't good because any system activity 
(including typing commands into shell) may wait on this I/O.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Akira Hayakawa Oct. 16, 2013, 12:53 a.m. UTC | #2
Mikulas,

> I/Os shouldn't be returned with -ENOMEM. If they are, you can treat it as 
> a hard error.
It seems to be blkdev_issue_discard returns -ENOMEM
when bio_alloc fails, for example.
Waiting for a second and we can alloc the memory is my idea
for handling -ENOMEM returned.

> Blocking I/O until the admin turns a specific variable isn't too 
> reliable.
> 
> Think of this case - your driver detects I/O error and blocks all I/Os. 
> The admin tries to log in. The login process needs memory. To fulfill this 
> memory need, the login process writes out some dirty pages. Those writes 
> are blocked by your driver - in the result, the admin is not able to log 
> in and flip the switch to unblock I/Os.
> 
> Blocking I/O indefinitely isn't good because any system activity 
> (including typing commands into shell) may wait on this I/O.
I understand the problem. But, what should I do then?
Since writeboost is a cache software,
it loses consistency if we ignore the cache at all
in its returning I/O error.
Go panic in that case is also inappropriate (But, inaccessibility to
the storage will eventually halt the whole system. If so, go panic might
be an acceptable solution).

I am afraid my idea is based on your past comment
> If you can't handle a specific I/O request failure gracefully, you should 
> mark the driver as dead, don't do any more I/Os to the disk or cache 
> device and return -EIO on all incoming requests.
> 
> Always think that I/O failures can happen because of connection problems, 
> not data corruption problems - for example, a disk cable can go loose, a 
> network may lose connectivity, etc. In these cases, it is best to stop 
> doing any I/O at all and let the user resolve the situation.
1) In failure, mark the driver dead - set `blockup` to 1 in my case -
   and returning -EIO on all incoming requests. Yes.
2) And wait for the user resolve the situation - returning -EIO until
   admin turns `blockup` to 0 after checkup in my case - . Yes.

Did you mean we should not provide any way to recover the system
because admin may not be able to reach the switch?
writeboost module autonomously checking the device in problem
recovered should be implemented?
Retry submitting I/O to the device and find the device is recovered
on I/O success is a solution and I have implemented it.
I/O retry doesn't destroy any consistency in writeboost;
sooner or later it can not be able to accept writes any more because of
lack of RAM buffer which can be reused after I/O success to cache device.

Akira

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Dave Chinner Oct. 16, 2013, 6:07 a.m. UTC | #3
[cc xfs@oss.sgi.com]

On Tue, Oct 15, 2013 at 08:01:45PM -0400, Mikulas Patocka wrote:
> On Mon, 14 Oct 2013, Akira Hayakawa wrote:
> > But, XFS stalls ...
> > -------------------
> > For testing,
> > I manually turns `blockup` to 1
> > when compiling Ruby is in progress
> > on XFS on a writeboost device.
> > As soon as I do it,
> > XFS starts to dump error message 
> > like "metadata I/O error: ... ("xlog_iodone") error ..."
> > and after few seconds it then starts to dump
> > like "BUG: soft lockup -CPU#3 stuck for 22s!".
> > The system stalls and doesn't accept the keyboard.
> > 
> > I think this behavior is caused by
> > the device always returning -EIO after turning 
> > the variable to 1.
> > But why XFS goes stalling on I/O error?
> 
> Because it is bloated and buggy.

How did I know you'd take that cheap shot, Mikulas? You are so
predictable...

> We have bug 924301 for XFS crash on I/O 
> error...

Which is a problem with memory corruption after filling a dm
snapshot volume to 100% and shortly after XFS has shut down the
kernel panics from memory corruption. Can't be reproduced without
filling the dm-snapshot volume to 100%, can't be reproduced with any
other filesystem. Crashes are also occurring randomly in printk and
the worker thread infrastructure. Memory and list poisoning clearly
indicates worker thread lists have freed objects on them. There are
lockdep messages from the DM snapshot code, etc.

There's actually very little to point at XFS problems other than the
first hang that was reported where XFS was stuck in a tight loop due
to memory corruption.  It reminds me of a very similar bug report
and triage we went through last week:

http://oss.sgi.com/pipermail/xfs/2013-October/030681.html

Further analysis and bisects pointed to the zram driver being buggy,
not XFS:

http://oss.sgi.com/pipermail/xfs/2013-October/030707.html

XFS has historically exposing bugs in block device drivers that no
other filesystem exposes, and so when a new block device driver gets
tested with XFS and we start seeing memory corruption symptoms, it's
a fair bet that it's not XFS that is causing it....

Just sayin'.

---

Akira, can you please post the entire set of messages you are
getting when XFS showing problems? That way I can try to confirm
whether it's a regression in XFS or something else.

Cheers,

Dave.
Akira Hayakawa Oct. 16, 2013, 10:34 a.m. UTC | #4
Dave

> Akira, can you please post the entire set of messages you are
> getting when XFS showing problems? That way I can try to confirm
> whether it's a regression in XFS or something else.

Environment:
- The kernel version is 3.12-rc1
- The debuggee is a KVM virtual machine equipped with 8 vcpus.
- writeboost version is commit 236732eb84684e8473353812acb3302232e1eab0
  You can clone it from https://github.com/akiradeveloper/dm-writeboost

Test:
1. Make a writeboost device with 3MB cache device and 3GB backing store
   with default option (segment size order is 7 and RAM buffer is 2MB allocated).
2. start testing/1 script (compiling Ruby and make test after it)
3. set blockup variable to 1 via message interface few seconds later.
   The writeboost device starts to return -EIO on all incoming requests.
   I guess this behavior causes the problem.

In some case, XFS doesn't collapse after setting blockup to 1.
When I set the variable to 1 about 10 or 20 seconds later,
it didn't collapse but neatly stops the compile and
after again I set it to 0, it restarts the compile.
XFS does collapse (badly shutting down the filesystem as seen below) in some case
but doesn't collapse in another case sounds to me that
the former case runs into a very corner case bug.

The entire set of messages via virsh console is shown below.
Some lines related to writeboost are all benign.
The daemons are just stopping because blockup variable is 1.

[  146.284626] XFS (dm-3): metadata I/O error: block 0x300d91 ("xlog_iodone") error 5 numblks 64
[  146.285825] XFS (dm-3): Log I/O Error Detected.  Shutting down filesystem
[  146.286699] XFS (dm-3): Please umount the filesystem and rectify the problem(s)
[  146.560036] device-mapper: writeboost: err@modulator_proc() system is blocked up on I/O error. set blockup to 0 after checkup.
[  147.244036] device-mapper: writeboost: err@migrate_proc() system is blocked up on I/O error. set blockup to 0 after checkup.
[  172.052006] BUG: soft lockup - CPU#0 stuck for 23s! [script:3170]
[  172.436003] BUG: soft lockup - CPU#4 stuck for 22s! [kworker/4:1:57]
[  180.560040] device-mapper: writeboost: err@recorder_proc() system is blocked up on I/O error. set blockup to 0 after checkup.
[  180.561179] device-mapper: writeboost: err@sync_proc() system is blocked up on I/O error. set blockup to 0 after checkup.
[  200.052005] BUG: soft lockup - CPU#0 stuck for 23s! [script:3170]
[  200.436005] BUG: soft lockup - CPU#4 stuck for 22s! [kworker/4:1:57]
[  206.484005] INFO: rcu_sched self-detected stall on CPU { 0}  (t=15000 jiffies g=1797 c=1796 q=3022)
[  232.052007] BUG: soft lockup - CPU#0 stuck for 23s! [script:3170]
[  232.436003] BUG: soft lockup - CPU#4 stuck for 22s! [kworker/4:1:57]
[  260.052006] BUG: soft lockup - CPU#0 stuck for 23s! [script:3170]
[  260.436004] BUG: soft lockup - CPU#4 stuck for 22s! [kworker/4:1:57]
[  288.052006] BUG: soft lockup - CPU#0 stuck for 23s! [script:3170]
[  288.436004] BUG: soft lockup - CPU#4 stuck for 22s! [kworker/4:1:57]

Akira

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Dave Chinner Oct. 16, 2013, 11:01 a.m. UTC | #5
On Wed, Oct 16, 2013 at 07:34:38PM +0900, Akira Hayakawa wrote:
> Dave
> 
> > Akira, can you please post the entire set of messages you are
> > getting when XFS showing problems? That way I can try to confirm
> > whether it's a regression in XFS or something else.
> 
> Environment:
> - The kernel version is 3.12-rc1
> - The debuggee is a KVM virtual machine equipped with 8 vcpus.
> - writeboost version is commit 236732eb84684e8473353812acb3302232e1eab0
>   You can clone it from https://github.com/akiradeveloper/dm-writeboost
> 
> Test:
> 1. Make a writeboost device with 3MB cache device and 3GB backing store
>    with default option (segment size order is 7 and RAM buffer is 2MB allocated).
> 2. start testing/1 script (compiling Ruby and make test after it)
> 3. set blockup variable to 1 via message interface few seconds later.
>    The writeboost device starts to return -EIO on all incoming requests.
>    I guess this behavior causes the problem.
> 
> In some case, XFS doesn't collapse after setting blockup to 1.
> When I set the variable to 1 about 10 or 20 seconds later,
> it didn't collapse but neatly stops the compile and
> after again I set it to 0, it restarts the compile.
> XFS does collapse (badly shutting down the filesystem as seen below) in some case
> but doesn't collapse in another case sounds to me that
> the former case runs into a very corner case bug.

XFS shuts down because you've returned EIO to a log IO. That's a
fatal error. If you do the same to an ext4 journal write, it will do
the equivalent of shut down (e.g. complain and turn read-only).

> The entire set of messages via virsh console is shown below.
> Some lines related to writeboost are all benign.
> The daemons are just stopping because blockup variable is 1.
> 
> [  146.284626] XFS (dm-3): metadata I/O error: block 0x300d91 ("xlog_iodone") error 5 numblks 64
> [  146.285825] XFS (dm-3): Log I/O Error Detected.  Shutting down filesystem
> [  146.286699] XFS (dm-3): Please umount the filesystem and rectify the problem(s)

What happened before this? Please attach the *full* log.

> [  146.560036] device-mapper: writeboost: err@modulator_proc() system is blocked up on I/O error. set blockup to 0 after checkup.
> [  147.244036] device-mapper: writeboost: err@migrate_proc() system is blocked up on I/O error. set blockup to 0 after checkup.
> [  172.052006] BUG: soft lockup - CPU#0 stuck for 23s! [script:3170]
> [  172.436003] BUG: soft lockup - CPU#4 stuck for 22s! [kworker/4:1:57]

These should be emitting a stack trace. Can you turn up the logging
level you are using so that they emit a full stack trace? The
messages are useless without the stack dump....

Also, 23 seconds before this timestamp is 149s, about 3s after the
XFS filesystem shut down, so it's not clear that the XFS shutdown is
related to the soft lockup yet. That's what we need the stack traces
for...

> [  180.560040] device-mapper: writeboost: err@recorder_proc() system is blocked up on I/O error. set blockup to 0 after checkup.
> [  180.561179] device-mapper: writeboost: err@sync_proc() system is blocked up on I/O error. set blockup to 0 after checkup.

What's with the 35s delay between these writeboost messages? Have
you only done a partial shutdown of the block device and it takes
This length of time for it to completely block IO?

Cheers,

Dave.
Akira Hayakawa Oct. 16, 2013, 12:17 p.m. UTC | #6
Dave

> XFS shuts down because you've returned EIO to a log IO. That's a
> fatal error. If you do the same to an ext4 journal write, it will do
> the equivalent of shut down (e.g. complain and turn read-only).
You mean block device should not return -EIO anyway if
it doesn't want XFS to suddenly shut down?
As Mikulas said, connection failure often be the cause of
I/O error from the underlying devices.

That ext4 and XFS are both determine shutting down in
erroneous journal writes is also due to the limitation of journal write?
or just a compromise in implementation? This is just for my curiosity.

>> [  180.560040] device-mapper: writeboost: err@recorder_proc() system is blocked up on I/O error. set blockup to 0 after checkup.
>> [  180.561179] device-mapper: writeboost: err@sync_proc() system is blocked up on I/O error. set blockup to 0 after checkup.
> 
> What's with the 35s delay between these writeboost messages? Have
> you only done a partial shutdown of the block device and it takes
> This length of time for it to completely block IO?
Strange.
These daemons should stop in few seconds on the current configuration.
Yes, partial in a sense. Not all the daemons stop immediately but
for the client the logical device is seen as being blocked up
returning -EIO on every I/O. I don't think this behavior harms the
upper layer.

Currently, sync_proc is like this.
It sleeps for few seconds, wakes up and
meets wait_on_blockup() to dump that message and halts itself.
recorder_proc is implemented in the same way.

int sync_proc(void *data)
{
        int r;
        struct wb_cache *cache = data;
        struct wb_device *wb = cache->wb;
        unsigned long intvl;

        while (!kthread_should_stop()) {

                wait_on_blockup();

                /* sec -> ms */
                intvl = ACCESS_ONCE(cache->sync_interval) * 1000;

                if (!intvl) {
                        schedule_timeout_interruptible(msecs_to_jiffies(1000));
                        continue;
                }

                flush_current_buffer(cache);

                RETRY(blkdev_issue_flush(cache->device->bdev, GFP_NOIO, NULL));

                schedule_timeout_interruptible(msecs_to_jiffies(intvl));
        }
        return 0;
}

XFS shuts down, go crazy and it disturbs the kthread to wake up?

> These should be emitting a stack trace. Can you turn up the logging
> level you are using so that they emit a full stack trace? The
> messages are useless without the stack dump....
I turned the level up to 7. Here is the one.

Connected to domain Hercules
Escape character is ^]
[   54.683482] device-mapper: writeboost: err@audit_cache_device() superblock header: magic number invalid
[   54.809262] bio: create slab <bio-2> at 2
[   68.812800] SGI XFS with ACLs, security attributes, realtime, large block/inode numbers, no debug enabled
[   68.825016] XFS (dm-3): Mounting Filesystem
[   68.847027] XFS (dm-3): Ending clean mount
[   72.100112] device-mapper: writeboost: err@dm_safe_io_internal() system is blocked up on I/O error. set blockup to 0 after checkup.
[   72.109702] device-mapper: writeboost: err@migrate_proc() system is blocked up on I/O error. set blockup to 0 after checkup.
[   72.812097] device-mapper: writeboost: err@modulator_proc() system is blocked up on I/O error. set blockup to 0 after checkup.
[   73.894429] Buffer I/O error on device dm-3, logical block 98354
[   73.895824] lost page write due to I/O error on dm-3
[   73.897042] Buffer I/O error on device dm-3, logical block 98355
[   73.897209] Buffer I/O error on device dm-3, logical block 196641
[   73.897210] lost page write due to I/O error on dm-3
[   73.897263] Buffer I/O error on device dm-3, logical block 196688
[   73.897264] lost page write due to I/O error on dm-3
[   73.897266] Buffer I/O error on device dm-3, logical block 196689
[   73.897267] lost page write due to I/O error on dm-3
[   73.897268] Buffer I/O error on device dm-3, logical block 196690
[   73.897269] lost page write due to I/O error on dm-3
[   73.897270] Buffer I/O error on device dm-3, logical block 196691
[   73.897271] lost page write due to I/O error on dm-3
[   73.897272] Buffer I/O error on device dm-3, logical block 196692
[   73.897273] lost page write due to I/O error on dm-3
[   73.897307] Buffer I/O error on device dm-3, logical block 294955
[   73.897308] lost page write due to I/O error on dm-3
[   73.897335] Buffer I/O error on device dm-3, logical block 294956
[   73.897335] lost page write due to I/O error on dm-3
[   73.914261] lost page write due to I/O error on dm-3
[   73.930022] XFS (dm-3): metadata I/O error: block 0x40 ("xfs_buf_iodone_callbacks") error 5 numblks 16
[   74.036759] XFS (dm-3): metadata I/O error: block 0x300c7f ("xlog_iodone") error 5 numblks 64
[   74.043456] XFS (dm-3): xfs_do_force_shutdown(0x2) called from line 1161 of file fs/xfs/xfs_log.c.  Return address = 0xffffffffa03a6417
[   74.047556] XFS (dm-3): Log I/O Error Detected.  Shutting down filesystem
[   74.049893] XFS (dm-3): Please umount the filesystem and rectify the problem(s)
[   74.051467] XFS (dm-3): metadata I/O error: block 0x300cbf ("xlog_iodone") error 5 numblks 64
[   74.053190] XFS (dm-3): xfs_do_force_shutdown(0x2) called from line 1161 of file fs/xfs/xfs_log.c.  Return address = 0xffffffffa03a6417
[   74.055435] XFS (dm-3): metadata I/O error: block 0x300cff ("xlog_iodone") error 5 numblks 64
[   74.057162] XFS (dm-3): xfs_do_force_shutdown(0x2) called from line 1161 of file fs/xfs/xfs_log.c.  Return address = 0xffffffffa03a6417
[   74.059402] XFS (dm-3): metadata I/O error: block 0x300d3f ("xlog_iodone") error 5 numblks 64
[   74.061136] XFS (dm-3): xfs_do_force_shutdown(0x2) called from line 1161 of file fs/xfs/xfs_log.c.  Return address = 0xffffffffa03a6417
[   74.063561] XFS (dm-3): metadata I/O error: block 0x300d7f ("xlog_iodone") error 5 numblks 64
[   74.065667] XFS (dm-3): xfs_do_force_shutdown(0x2) called from line 1161 of file fs/xfs/xfs_log.c.  Return address = 0xffffffffa03a6417
[   74.068215] XFS (dm-3): metadata I/O error: block 0x300dbf ("xlog_iodone") error 5 numblks 64
[   74.069920] XFS (dm-3): xfs_do_force_shutdown(0x2) called from line 1161 of file fs/xfs/xfs_log.c.  Return address = 0xffffffffa03a6417
[   74.072325] XFS (dm-3): metadata I/O error: block 0x300dff ("xlog_iodone") error 5 numblks 64
[   74.074118] XFS (dm-3): xfs_do_force_shutdown(0x2) called from line 1161 of file fs/xfs/xfs_log.c.  Return address = 0xffffffffa03a6417
[  100.052005] BUG: soft lockup - CPU#0 stuck for 23s! [kworker/0:1H:215]
[  100.052005] Modules linked in: xfs crc32c libcrc32c dm_writeboost(O) fuse nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop snd_hda_intel snd_hda_codec snd_hwdep snd_pcm psmouse microcode pcspkr serio_raw processor i2c_piix4 i2c_core evdev joydev virtio_balloon snd_page_alloc snd_timer snd soundcore thermal_sys button ext4 crc16 jbd2 mbcache dm_mod hid_generic usbhid hid sg sr_mod cdrom ata_generic virtio_net virtio_blk ata_piix uhci_hcd ehci_hcd libata usbcore scsi_mod virtio_pci virtio_ring usb_common virtio floppy
[  100.052005] CPU: 0 PID: 215 Comm: kworker/0:1H Tainted: G           O 3.12.0-rc1 #8
[  100.052005] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[  100.052005] Workqueue: xfslogd xfs_buf_iodone_work [xfs]
[  100.052005] task: ffff880216eb67b0 ti: ffff8801f9c7c000 task.ti: ffff8801f9c7c000
[  100.052005] RIP: 0010:[<ffffffff8108146a>]  [<ffffffff8108146a>] do_raw_spin_lock+0x1d/0x23
[  100.052005] RSP: 0018:ffff8801f9c7dde0  EFLAGS: 00000206
[  100.052005] RAX: 000000000116011c RBX: ffff880216eb6818 RCX: 0000000000000001
[  100.052005] RDX: 0000000000000116 RSI: ffff8801fa389218 RDI: ffff880205c69e80
[  100.052005] RBP: ffff880205c69e40 R08: ffff88021fc12ad8 R09: 0000000000000001
[  100.052005] R10: 0000000000000001 R11: ffff88002f36e3c0 R12: ffffffff810605a8
[  100.052005] R13: ffff880216eb6818 R14: ffff88021fc12ef0 R15: ffff880216eb6818
[  100.052005] FS:  0000000000000000(0000) GS:ffff88021fc00000(0000) knlGS:0000000000000000
[  100.052005] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  100.052005] CR2: ffffe8ffffc00520 CR3: 000000002ef57000 CR4: 00000000000006f0
[  100.052005] Stack:
[  100.052005]  ffffffffa03aa836 ffff88002f36e418 ffff88021fc12ac0 ffffe8ffffc00400
[  100.052005]  ffff88002f36e3c0 ffff88002f36e460 ffffffffa03a99c2 ffff88002f36e3c0
[  100.052005]  ffffffffa03a9bd2 ffff8801faf95bc0 ffff88002f36e460 ffff88021fc12ac0
[  100.052005] Call Trace:
[  100.052005]  [<ffffffffa03aa836>] ? xfs_buf_iodone+0x1b/0x49 [xfs]
[  100.052005]  [<ffffffffa03a99c2>] ? xfs_buf_do_callbacks+0x22/0x30 [xfs]
[  100.052005]  [<ffffffffa03a9bd2>] ? xfs_buf_iodone_callbacks+0x16b/0x1c4 [xfs]
[  100.052005]  [<ffffffff8104b7ab>] ? process_one_work+0x191/0x294
[  100.052005]  [<ffffffff8104bc5d>] ? worker_thread+0x121/0x1e7
[  100.052005]  [<ffffffff8104bb3c>] ? rescuer_thread+0x269/0x269
[  100.052005]  [<ffffffff81050641>] ? kthread+0x81/0x89
[  100.052005]  [<ffffffff810505c0>] ? __kthread_parkme+0x5d/0x5d
[  100.052005]  [<ffffffff8137f4ec>] ? ret_from_fork+0x7c/0xb0
[  100.052005]  [<ffffffff810505c0>] ? __kthread_parkme+0x5d/0x5d
[  100.052005] Code: 0c 31 c0 48 81 ff 18 a6 37 81 0f 92 c0 c3 b8 00 00 01 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 74 0c 66 8b 07 66 39 d0 74 04 f3 90 <eb> f4 c3 90 90 90 83 c8 ff 0f b7 ca 66 ff c2 89 c2 0f 45 d1 0f 
[  100.244006] BUG: soft lockup - CPU#2 stuck for 22s! [xfsaild/dm-3:3167]
[  100.244006] Modules linked in: xfs crc32c libcrc32c dm_writeboost(O) fuse nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop snd_hda_intel snd_hda_codec snd_hwdep snd_pcm psmouse microcode pcspkr serio_raw processor i2c_piix4 i2c_core evdev joydev virtio_balloon snd_page_alloc snd_timer snd soundcore thermal_sys button ext4 crc16 jbd2 mbcache dm_mod hid_generic usbhid hid sg sr_mod cdrom ata_generic virtio_net virtio_blk ata_piix uhci_hcd ehci_hcd libata usbcore scsi_mod virtio_pci virtio_ring usb_common virtio floppy
[  100.244006] CPU: 2 PID: 3167 Comm: xfsaild/dm-3 Tainted: G           O 3.12.0-rc1 #8
[  100.244006] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[  100.244006] task: ffff88002fef2100 ti: ffff88002b384000 task.ti: ffff88002b384000
[  100.244006] RIP: 0010:[<ffffffff8108146a>]  [<ffffffff8108146a>] do_raw_spin_lock+0x1d/0x23
[  100.244006] RSP: 0018:ffff88002b385b00  EFLAGS: 00000206
[  100.244006] RAX: 000000000117011c RBX: ffffffff8105e684 RCX: 0000000000000002
[  100.244006] RDX: 0000000000000117 RSI: ffff8802166d7200 RDI: ffff880205c69e80
[  100.244006] RBP: ffff8801fa2e2670 R08: 0000000000000005 R09: 0000000000000000
[  100.244006] R10: 000000000000a944 R11: 0000000000000000 R12: ffff88002f181674
[  100.244006] R13: 0000000000000001 R14: 0000000000000001 R15: ffff88002fef2100
[  100.244006] FS:  0000000000000000(0000) GS:ffff88021fc80000(0000) knlGS:0000000000000000
[  100.244006] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  100.244006] CR2: ffffe8ffffc80400 CR3: 00000001fae7c000 CR4: 00000000000006e0
[  100.244006] Stack:
[  100.244006]  ffffffffa0372e3d ffff88021682a830 ffffffffa036050e 0000000000012e80
[  100.244006]  ffff88002b385fd8 ffff88002b385fd8 000000022fef2100 0000000000300e3f
[  100.244006]  ffff88002f071500 0000000000000292 ffffffff81050e94 0000000000000000
[  100.244006] Call Trace:
[  100.244006]  [<ffffffffa0372e3d>] ? xfs_trans_committed_bulk+0x2f/0x19d [xfs]
[  100.244006]  [<ffffffffa036050e>] ? _xfs_buf_ioapply+0x271/0x29c [xfs]
[  100.244006]  [<ffffffff81050e94>] ? remove_wait_queue+0xe/0x48
[  100.244006]  [<ffffffffa03a5b85>] ? xlog_wait+0x62/0x6b [xfs]
[  100.244006]  [<ffffffff8105bbfb>] ? try_to_wake_up+0x190/0x190
[  100.244006]  [<ffffffffa03a78e6>] ? xlog_state_get_iclog_space+0x5a/0x1fb [xfs]
[  100.244006]  [<ffffffff810fa1c3>] ? __cache_free.isra.46+0x178/0x187
[  100.244006]  [<ffffffffa03a8e0b>] ? xlog_cil_committed+0x2f/0xe6 [xfs]
[  100.244006]  [<ffffffffa03a926c>] ? xlog_cil_push+0x2f6/0x311 [xfs]
[  100.244006]  [<ffffffff81058701>] ? mmdrop+0xd/0x1c
[  100.244006]  [<ffffffffa03a9780>] ? xlog_cil_force_lsn+0x71/0xdd [xfs]
[  100.244006]  [<ffffffffa03a8162>] ? _xfs_log_force+0x55/0x1a0 [xfs]
[  100.244006]  [<ffffffffa03a82cc>] ? xfs_log_force+0x1f/0x4e [xfs]
[  100.244006]  [<ffffffffa03aba15>] ? xfsaild+0x144/0x4cd [xfs]
[  100.244006]  [<ffffffff810592dc>] ? finish_task_switch+0x7f/0xaa
[  100.244006]  [<ffffffffa03ab8d1>] ? xfs_trans_ail_cursor_first+0x76/0x76 [xfs]
[  100.244006]  [<ffffffffa03ab8d1>] ? xfs_trans_ail_cursor_first+0x76/0x76 [xfs]
[  100.244006]  [<ffffffff81050641>] ? kthread+0x81/0x89
[  100.244006]  [<ffffffff810505c0>] ? __kthread_parkme+0x5d/0x5d
[  100.244006]  [<ffffffff8137f4ec>] ? ret_from_fork+0x7c/0xb0
[  100.244006]  [<ffffffff810505c0>] ? __kthread_parkme+0x5d/0x5d
[  100.244006] Code: 0c 31 c0 48 81 ff 18 a6 37 81 0f 92 c0 c3 b8 00 00 01 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 74 0c 66 8b 07 66 39 d0 74 04 f3 90 <eb> f4 c3 90 90 90 83 c8 ff 0f b7 ca 66 ff c2 89 c2 0f 45 d1 0f 
[  100.340005] BUG: soft lockup - CPU#3 stuck for 22s! [kworker/3:1H:207]
[  100.340005] Modules linked in: xfs crc32c libcrc32c dm_writeboost(O) fuse nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop snd_hda_intel snd_hda_codec snd_hwdep snd_pcm psmouse microcode pcspkr serio_raw processor i2c_piix4 i2c_core evdev joydev virtio_balloon snd_page_alloc snd_timer snd soundcore thermal_sys button ext4 crc16 jbd2 mbcache dm_mod hid_generic usbhid hid sg sr_mod cdrom ata_generic virtio_net virtio_blk ata_piix uhci_hcd ehci_hcd libata usbcore scsi_mod virtio_pci virtio_ring usb_common virtio floppy
[  100.340005] CPU: 3 PID: 207 Comm: kworker/3:1H Tainted: G           O 3.12.0-rc1 #8
[  100.340005] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[  100.340005] Workqueue: xfslogd xfs_buf_iodone_work [xfs]
[  100.340005] task: ffff8801f9cc5870 ti: ffff8801f9c06000 task.ti: ffff8801f9c06000
[  100.340005] RIP: 0010:[<ffffffff81081463>]  [<ffffffff81081463>] do_raw_spin_lock+0x16/0x23
[  100.340005] RSP: 0018:ffff8801f9c07de0  EFLAGS: 00000202
[  100.340005] RAX: 000000000118011c RBX: ffffe8ffffcc0400 RCX: 0000000000000001
[  100.340005] RDX: 0000000000000118 RSI: ffff8801fa389218 RDI: ffff880205c69e80
[  100.340005] RBP: ffff880205c69e40 R08: ffff88021fcd2ad8 R09: 0000000000000001
[  100.340005] R10: 0000000000000001 R11: ffff88002f36e540 R12: ffff88002f36e674
[  100.340005] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
[  100.340005] FS:  0000000000000000(0000) GS:ffff88021fcc0000(0000) knlGS:0000000000000000
[  100.340005] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  100.340005] CR2: 00007f15636c7e50 CR3: 000000000160b000 CR4: 00000000000006e0
[  100.340005] Stack:
[  100.340005]  ffffffffa03aa836 ffff88002f36e598 ffff88021fcd2ac0 ffffe8ffffcc0400
[  100.340005]  ffff88002f36e540 ffff88002f36e5e0 ffffffffa03a99c2 ffff88002f36e540
[  100.340005]  ffffffffa03a9bd2 ffff8801f9d16f40 ffff88002f36e5e0 ffff88021fcd2ac0
[  100.340005] Call Trace:
[  100.340005]  [<ffffffffa03aa836>] ? xfs_buf_iodone+0x1b/0x49 [xfs]
[  100.340005]  [<ffffffffa03a99c2>] ? xfs_buf_do_callbacks+0x22/0x30 [xfs]
[  100.340005]  [<ffffffffa03a9bd2>] ? xfs_buf_iodone_callbacks+0x16b/0x1c4 [xfs]
[  100.340005]  [<ffffffff8104b7ab>] ? process_one_work+0x191/0x294
[  100.340005]  [<ffffffff8104bc5d>] ? worker_thread+0x121/0x1e7
[  100.340005]  [<ffffffff8104bb3c>] ? rescuer_thread+0x269/0x269
[  100.340005]  [<ffffffff81050641>] ? kthread+0x81/0x89
[  100.340005]  [<ffffffff810505c0>] ? __kthread_parkme+0x5d/0x5d
[  100.340005]  [<ffffffff8137f4ec>] ? ret_from_fork+0x7c/0xb0
[  100.340005]  [<ffffffff810505c0>] ? __kthread_parkme+0x5d/0x5d
[  100.340005] Code: 81 ff 98 a2 37 81 72 0c 31 c0 48 81 ff 18 a6 37 81 0f 92 c0 c3 b8 00 00 01 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 74 0c 66 8b 07 <66> 39 d0 74 04 f3 90 eb f4 c3 90 90 90 83 c8 ff 0f b7 ca 66 ff 
[  100.436010] BUG: soft lockup - CPU#4 stuck for 22s! [kworker/4:2:537]
[  100.436010] Modules linked in: xfs crc32c libcrc32c dm_writeboost(O) fuse nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop snd_hda_intel snd_hda_codec snd_hwdep snd_pcm psmouse microcode pcspkr serio_raw processor i2c_piix4 i2c_core evdev joydev virtio_balloon snd_page_alloc snd_timer snd soundcore thermal_sys button ext4 crc16 jbd2 mbcache dm_mod hid_generic usbhid hid sg sr_mod cdrom ata_generic virtio_net virtio_blk ata_piix uhci_hcd ehci_hcd libata usbcore scsi_mod virtio_pci virtio_ring usb_common virtio floppy
[  100.436010] CPU: 4 PID: 537 Comm: kworker/4:2 Tainted: G           O 3.12.0-rc1 #8
[  100.436010] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[  100.436010] Workqueue: xfs-reclaim/dm-3 xfs_reclaim_worker [xfs]
[  100.436010] task: ffff8801fade6040 ti: ffff88002edaa000 task.ti: ffff88002edaa000
[  100.436010] RIP: 0010:[<ffffffff81081460>]  [<ffffffff81081460>] do_raw_spin_lock+0x13/0x23
[  100.436010] RSP: 0018:ffff88002edabc20  EFLAGS: 00000202
[  100.436010] RAX: 00000000011b011c RBX: ffff88002f693870 RCX: 0000000000000000
[  100.436010] RDX: 000000000000011b RSI: 0000000000000000 RDI: ffff880205c69e80
[  100.436010] RBP: ffff88002a15c000 R08: 0000000000000000 R09: 0000000000000006
[  100.436010] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  100.436010] R13: ffff880216e79ec0 R14: ffff88002edabc24 R15: 0000000000012e80
[  100.436010] FS:  0000000000000000(0000) GS:ffff88021fd00000(0000) knlGS:0000000000000000
[  100.436010] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  100.436010] CR2: 00007f7e9b394f30 CR3: 000000000160b000 CR4: 00000000000006e0
[  100.436010] Stack:
[  100.436010]  ffffffffa03ab5f3 0000000000000001 ffff8801fa383710 ffff88002a15c000
[  100.436010]  0000000000000445 0000000000000002 ffff8801fa3836c0 0000000000000002
[  100.436010]  ffffffffa03667a3 0000000000000444 0000000000000000 ffff8801fa3836c0
[  100.436010] Call Trace:
[  100.436010]  [<ffffffffa03ab5f3>] ? xfs_iflush_abort+0x35/0x9a [xfs]
[  100.436010]  [<ffffffffa03667a3>] ? xfs_reclaim_inode+0x85/0x246 [xfs]
[  100.436010]  [<ffffffffa0366aab>] ? xfs_reclaim_inodes_ag+0x147/0x1fc [xfs]
[  100.436010]  [<ffffffff8105bbfb>] ? try_to_wake_up+0x190/0x190
[  100.436010]  [<ffffffff81056927>] ? __wake_up_common+0x42/0x78
[  100.436010]  [<ffffffff810d5a18>] ? fold_diff+0x22/0x2e
[  100.436010]  [<ffffffff810408d8>] ? lock_timer_base.isra.35+0x23/0x48
[  100.436010]  [<ffffffff81040750>] ? internal_add_timer+0xd/0x28
[  100.436010]  [<ffffffff8104125b>] ? __mod_timer+0xfa/0x10c
[  100.436010]  [<ffffffffa0367382>] ? xfs_reclaim_inodes+0x16/0x1b [xfs]
[  100.436010]  [<ffffffffa036739c>] ? xfs_reclaim_worker+0x15/0x1e [xfs]
[  100.436010]  [<ffffffff8104b7ab>] ? process_one_work+0x191/0x294
[  100.436010]  [<ffffffff8104bc5d>] ? worker_thread+0x121/0x1e7
[  100.436010]  [<ffffffff8104bb3c>] ? rescuer_thread+0x269/0x269
[  100.436010]  [<ffffffff81050641>] ? kthread+0x81/0x89
[  100.436010]  [<ffffffff810505c0>] ? __kthread_parkme+0x5d/0x5d
[  100.436010]  [<ffffffff8137f4ec>] ? ret_from_fork+0x7c/0xb0
[  100.436010]  [<ffffffff810505c0>] ? __kthread_parkme+0x5d/0x5d
[  100.436010] Code: 31 c0 48 81 ff 98 a2 37 81 72 0c 31 c0 48 81 ff 18 a6 37 81 0f 92 c0 c3 b8 00 00 01 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 74 0c <66> 8b 07 66 39 d0 74 04 f3 90 eb f4 c3 90 90 90 83 c8 ff 0f b7 
[  100.628005] BUG: soft lockup - CPU#6 stuck for 22s! [script:3151]
[  100.628005] Modules linked in: xfs crc32c libcrc32c dm_writeboost(O) fuse nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop snd_hda_intel snd_hda_codec snd_hwdep snd_pcm psmouse microcode pcspkr serio_raw processor i2c_piix4 i2c_core evdev joydev virtio_balloon snd_page_alloc snd_timer snd soundcore thermal_sys button ext4 crc16 jbd2 mbcache dm_mod hid_generic usbhid hid sg sr_mod cdrom ata_generic virtio_net virtio_blk ata_piix uhci_hcd ehci_hcd libata usbcore scsi_mod virtio_pci virtio_ring usb_common virtio floppy
[  100.628005] CPU: 6 PID: 3151 Comm: script Tainted: G           O 3.12.0-rc1 #8
[  100.628005] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[  100.628005] task: ffff88002fc3d870 ti: ffff88002fda4000 task.ti: ffff88002fda4000
[  100.628005] RIP: 0010:[<ffffffff81081463>]  [<ffffffff81081463>] do_raw_spin_lock+0x16/0x23
[  100.628005] RSP: 0018:ffff88002fda5d10  EFLAGS: 00000202
[  100.628005] RAX: 00000000011a011c RBX: ffffffff81119902 RCX: 00000000000004e2
[  100.628005] RDX: 000000000000011a RSI: ffff88002f64e200 RDI: ffff880205c69e80
[  100.628005] RBP: ffff88002fda5e58 R08: ffffffffa03bec40 R09: 0000000000000000
[  100.628005] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88002fda5cf0
[  100.628005] R13: 0000000000000001 R14: ffff8801facfea10 R15: 0000000000000000
[  100.628005] FS:  00007f599f046700(0000) GS:ffff88021fd80000(0000) knlGS:0000000000000000
[  100.628005] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  100.628005] CR2: 000000000234c698 CR3: 000000002f62e000 CR4: 00000000000006e0
[  100.628005] Stack:
[  100.628005]  ffffffffa03abe25 ffff8801facfe800 ffff88002fda5e58 ffff88002ffcb000
[  100.628005]  ffffffffa03673bf ffff88002fda5d38 000000002fda5d38 0000000000001036
[  100.628005]  ffffffff8110875e 0000000000000000 0000000000000000 ffff88002ffcb360
[  100.628005] Call Trace:
[  100.628005]  [<ffffffffa03abe25>] ? xfs_ail_push_all+0x13/0x4f [xfs]
[  100.628005]  [<ffffffffa03673bf>] ? xfs_reclaim_inodes_nr+0x1a/0x34 [xfs]
[  100.628005]  [<ffffffff8110875e>] ? super_cache_scan+0x121/0x13e
[  100.628005]  [<ffffffff810cdb7a>] ? shrink_slab+0x1e3/0x2f9
[  100.628005]  [<ffffffff81119526>] ? iput+0x34/0x13d
[  100.628005]  [<ffffffff81148d14>] ? do_coredump+0xbc3/0xbc3
[  100.628005]  [<ffffffff81148e3f>] ? drop_caches_sysctl_handler+0x65/0x76
[  100.628005]  [<ffffffff81157b7c>] ? proc_sys_call_handler+0x98/0xbf
[  100.628005]  [<ffffffff81105eca>] ? vfs_write+0x9e/0x104
[  100.628005]  [<ffffffff811061c1>] ? SyS_write+0x51/0x85
[  100.628005]  [<ffffffff8137f592>] ? system_call_fastpath+0x16/0x1b
[  100.628005] Code: 81 ff 98 a2 37 81 72 0c 31 c0 48 81 ff 18 a6 37 81 0f 92 c0 c3 b8 00 00 01 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 74 0c 66 8b 07 <66> 39 d0 74 04 f3 90 eb f4 c3 90 90 90 83 c8 ff 0f b7 ca 66 ff 
[  100.724004] BUG: soft lockup - CPU#7 stuck for 22s! [kworker/7:1H:211]
[  100.724005] Modules linked in: xfs crc32c libcrc32c dm_writeboost(O) fuse nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop snd_hda_intel snd_hda_codec snd_hwdep snd_pcm psmouse microcode pcspkr serio_raw processor i2c_piix4 i2c_core evdev joydev virtio_balloon snd_page_alloc snd_timer snd soundcore thermal_sys button ext4 crc16 jbd2 mbcache dm_mod hid_generic usbhid hid sg sr_mod cdrom ata_generic virtio_net virtio_blk ata_piix uhci_hcd ehci_hcd libata usbcore scsi_mod virtio_pci virtio_ring usb_common virtio floppy
[  100.724005] CPU: 7 PID: 211 Comm: kworker/7:1H Tainted: G           O 3.12.0-rc1 #8
[  100.724005] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[  100.724005] Workqueue: xfslogd xfs_buf_iodone_work [xfs]
[  100.724005] task: ffff88002f7de830 ti: ffff8801fa100000 task.ti: ffff8801fa100000
[  100.724005] RIP: 0010:[<ffffffff81081463>]  [<ffffffff81081463>] do_raw_spin_lock+0x16/0x23
[  100.724005] RSP: 0018:ffff8801fa101de0  EFLAGS: 00000206
[  100.724005] RAX: 000000000119011c RBX: ffff8801fa7ad0c0 RCX: 0000000000000001
[  100.724005] RDX: 0000000000000119 RSI: ffff8801fac4a398 RDI: ffff880205c69e80
[  100.724005] RBP: ffff880205c69e40 R08: ffff88021fdd2ad8 R09: ffff88002f181500
[  100.724005] R10: ffff88002f181500 R11: ffff88002f2670c0 R12: ffff8801fa7ad0c0
[  100.724005] R13: ffff88021658a360 R14: ffffffffa03c3e75 R15: ffffe8ffffdc0400
[  100.724005] FS:  0000000000000000(0000) GS:ffff88021fdc0000(0000) knlGS:0000000000000000
[  100.724005] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  100.724005] CR2: 000000000131b724 CR3: 00000001fae7c000 CR4: 00000000000006e0
[  100.724005] Stack:
[  100.724005]  ffffffffa03aa836 ffff88002f267118 ffff88021fdd2ac0 ffffe8ffffdc0400
[  100.724005]  ffff88002f2670c0 ffff88002f267160 ffffffffa03a99c2 ffff88002f2670c0
[  100.724005]  ffffffffa03a9bd2 ffff880216eb42c0 ffff88002f267160 ffff88021fdd2ac0
[  100.724005] Call Trace:
[  100.724005]  [<ffffffffa03aa836>] ? xfs_buf_iodone+0x1b/0x49 [xfs]
[  100.724005]  [<ffffffffa03a99c2>] ? xfs_buf_do_callbacks+0x22/0x30 [xfs]
[  100.724005]  [<ffffffffa03a9bd2>] ? xfs_buf_iodone_callbacks+0x16b/0x1c4 [xfs]
[  100.724005]  [<ffffffff8104b7ab>] ? process_one_work+0x191/0x294
[  100.724005]  [<ffffffff8104bc5d>] ? worker_thread+0x121/0x1e7
[  100.724005]  [<ffffffff8104bb3c>] ? rescuer_thread+0x269/0x269
[  100.724005]  [<ffffffff81050641>] ? kthread+0x81/0x89
[  100.724005]  [<ffffffff810505c0>] ? __kthread_parkme+0x5d/0x5d
[  100.724005]  [<ffffffff8137f4ec>] ? ret_from_fork+0x7c/0xb0
[  100.724005]  [<ffffffff810505c0>] ? __kthread_parkme+0x5d/0x5d
[  100.724005] Code: 81 ff 98 a2 37 81 72 0c 31 c0 48 81 ff 18 a6 37 81 0f 92 c0 c3 b8 00 00 01 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 74 0c 66 8b 07 <66> 39 d0 74 04 f3 90 eb f4 c3 90 90 90 83 c8 ff 0f b7 ca 66 ff 

Akira

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Dave Chinner Oct. 16, 2013, 9:42 p.m. UTC | #7
On Wed, Oct 16, 2013 at 09:17:40PM +0900, Akira Hayakawa wrote:
> Dave
> 
> > XFS shuts down because you've returned EIO to a log IO. That's a
> > fatal error. If you do the same to an ext4 journal write, it will do
> > the equivalent of shut down (e.g. complain and turn read-only).
> You mean block device should not return -EIO anyway if
> it doesn't want XFS to suddenly shut down?

Yes. EIO means an IO error has occurred. That causes failure paths
to be triggered in the upper layers.

I really don't understand what you are trying to achieve with this
"blockup" thing. If something goes wrong with the device, then you
*cannot recover* by sending EIO to any new IOs and then continuing
on at a later time as though nothing has happened. The moment a
filesystem gets EIO from a metadata write, it is likely to be
corrupted and if you continue onwards after that you simply
propagate the corruption.

> As Mikulas said, connection failure often be the cause of
> I/O error from the underlying devices.

Connection failure is *rarely* the cause of IO errors, except in
environments where SANs are in use. Even then multipathing makes
fatal connection failure a rare occurrence. Broken hardware is a
much more common cause of problems at the storage layers.

> That ext4 and XFS are both determine shutting down in
> erroneous journal writes is also due to the limitation of journal write?
> or just a compromise in implementation? This is just for my curiosity.

A failed, unrecoverable journal write violates the filesystem
consistency model of any journalling filesystem. Operations must be
stopped and the hardware and filesystem must be repaired, otherwise
loss of data will occur.

i.e. You're tellingthe filesystem that it's had a fatal IO error by
returning EIO, and the filesystems are treating it as though they've
seen a fatal IO error.


Simple rule: Don't complete IOs with EIO if you haven't had a fatal
IO error.

>         struct wb_cache *cache = data;
>         struct wb_device *wb = cache->wb;
>         unsigned long intvl;
> 
>         while (!kthread_should_stop()) {
> 
>                 wait_on_blockup();

Ugh. You should be using workqueue with timed work for this.

BTW, you're missing the handling needed by these kernel threads
for suspend-to-disk/ram....

> [   68.825016] XFS (dm-3): Mounting Filesystem
> [   68.847027] XFS (dm-3): Ending clean mount
> [   72.100112] device-mapper: writeboost: err@dm_safe_io_internal() system is blocked up on I/O error. set blockup to 0 after checkup.
> [   72.109702] device-mapper: writeboost: err@migrate_proc() system is blocked up on I/O error. set blockup to 0 after checkup.
> [   72.812097] device-mapper: writeboost: err@modulator_proc() system is blocked up on I/O error. set blockup to 0 after checkup.
> [   73.894429] Buffer I/O error on device dm-3, logical block 98354
> [   73.895824] lost page write due to I/O error on dm-3

Data IO has been lost due to EIOs. You've corrupted user files when
this error is emitted.

...

> [   73.930022] XFS (dm-3): metadata I/O error: block 0x40 ("xfs_buf_iodone_callbacks") error 5 numblks 16
> [   74.036759] XFS (dm-3): metadata I/O error: block 0x300c7f ("xlog_iodone") error 5 numblks 64
> [   74.043456] XFS (dm-3): xfs_do_force_shutdown(0x2) called from line 1161 of file fs/xfs/xfs_log.c.  Return address = 0xffffffffa03a6417
> [   74.047556] XFS (dm-3): Log I/O Error Detected.  Shutting down filesystem
> [   74.049893] XFS (dm-3): Please umount the filesystem and rectify the problem(s)
> [   74.051467] XFS (dm-3): metadata I/O error: block 0x300cbf ("xlog_iodone") error 5 numblks 64
> [   74.053190] XFS (dm-3): xfs_do_force_shutdown(0x2) called from line 1161 of file fs/xfs/xfs_log.c.  Return address = 0xffffffffa03a6417

And these are all the log buffers containing uncommitted changes
being aborted due to EIO. The filesystem state in memory now doesn't
match the state on disk, and so it's effectively corrupt and shuts
down.

....

> [  100.052005] BUG: soft lockup - CPU#0 stuck for 23s! [kworker/0:1H:215]
...
> [  100.052005] RIP: 0010:[<ffffffff81081463>]  [<ffffffff81081463>] do_raw_spin_lock+0x16/0x23
> [  100.052005] Workqueue: xfslogd xfs_buf_iodone_work [xfs]
> [  100.052005] Call Trace:
> [  100.052005]  [<ffffffffa03aa836>] ? xfs_buf_iodone+0x1b/0x49 [xfs]
> [  100.052005]  [<ffffffffa03a99c2>] ? xfs_buf_do_callbacks+0x22/0x30 [xfs]
> [  100.052005]  [<ffffffffa03a9bd2>] ? xfs_buf_iodone_callbacks+0x16b/0x1c4 [xfs]
> [  100.052005]  [<ffffffff8104b7ab>] ? process_one_work+0x191/0x294
> [  100.052005]  [<ffffffff8104bc5d>] ? worker_thread+0x121/0x1e7
> [  100.052005]  [<ffffffff8104bb3c>] ? rescuer_thread+0x269/0x269
> [  100.052005]  [<ffffffff81050641>] ? kthread+0x81/0x89
> [  100.052005]  [<ffffffff810505c0>] ? __kthread_parkme+0x5d/0x5d
> [  100.052005]  [<ffffffff8137f4ec>] ? ret_from_fork+0x7c/0xb0
> [  100.052005]  [<ffffffff810505c0>] ? __kthread_parkme+0x5d/0x5d

You need to compile your kernel with framepointers enabled so we get
reliable stack traces. I think it's stuck on a spinlock in
xfs_buf_iodone, which would imply the AIL lock.

.....

> [  100.244006] CPU: 2 PID: 3167 Comm: xfsaild/dm-3 Tainted: G           O 3.12.0-rc1 #8

FWIW, you should probably be testing the lastest Linus kernel
(3.12-rc5, IIRC) rather than -rc1....

....

> [  100.244006] RIP: 0010:[<ffffffff81081463>]  [<ffffffff81081463>] do_raw_spin_lock+0x16/0x23
> [  100.244006] Call Trace:
> [  100.244006]  [<ffffffffa0372e3d>] ? xfs_trans_committed_bulk+0x2f/0x19d [xfs]
> [  100.244006]  [<ffffffffa036050e>] ? _xfs_buf_ioapply+0x271/0x29c [xfs]
> [  100.244006]  [<ffffffff81050e94>] ? remove_wait_queue+0xe/0x48
> [  100.244006]  [<ffffffffa03a5b85>] ? xlog_wait+0x62/0x6b [xfs]
> [  100.244006]  [<ffffffff8105bbfb>] ? try_to_wake_up+0x190/0x190
> [  100.244006]  [<ffffffffa03a78e6>] ? xlog_state_get_iclog_space+0x5a/0x1fb [xfs]
> [  100.244006]  [<ffffffff810fa1c3>] ? __cache_free.isra.46+0x178/0x187
> [  100.244006]  [<ffffffffa03a8e0b>] ? xlog_cil_committed+0x2f/0xe6 [xfs]
> [  100.244006]  [<ffffffffa03a926c>] ? xlog_cil_push+0x2f6/0x311 [xfs]
> [  100.244006]  [<ffffffff81058701>] ? mmdrop+0xd/0x1c
> [  100.244006]  [<ffffffffa03a9780>] ? xlog_cil_force_lsn+0x71/0xdd [xfs]
> [  100.244006]  [<ffffffffa03a8162>] ? _xfs_log_force+0x55/0x1a0 [xfs]
> [  100.244006]  [<ffffffffa03a82cc>] ? xfs_log_force+0x1f/0x4e [xfs]
> [  100.244006]  [<ffffffffa03aba15>] ? xfsaild+0x144/0x4cd [xfs]
> [  100.244006]  [<ffffffff810592dc>] ? finish_task_switch+0x7f/0xaa
> [  100.244006]  [<ffffffffa03ab8d1>] ? xfs_trans_ail_cursor_first+0x76/0x76 [xfs]
> [  100.244006]  [<ffffffffa03ab8d1>] ? xfs_trans_ail_cursor_first+0x76/0x76 [xfs]
> [  100.244006]  [<ffffffff81050641>] ? kthread+0x81/0x89
> [  100.244006]  [<ffffffff810505c0>] ? __kthread_parkme+0x5d/0x5d
> [  100.244006]  [<ffffffff8137f4ec>] ? ret_from_fork+0x7c/0xb0
> [  100.244006]  [<ffffffff810505c0>] ? __kthread_parkme+0x5d/0x5d

It's stuck on a spin lock, but I don't know what function it's
in because the stack trace is indeterminate (i.e. need frame
pointers enabled). It might be the AIL lock (as it's the xfsaild),
but I can't tell.

> [  100.436010] BUG: soft lockup - CPU#4 stuck for 22s! [kworker/4:2:537]
...
> [  100.436010] Workqueue: xfs-reclaim/dm-3 xfs_reclaim_worker [xfs]
> [  100.436010] RIP: 0010:[<ffffffff81081460>]  [<ffffffff81081460>] do_raw_spin_lock+0x13/0x23
> [  100.436010] Call Trace:
> [  100.436010]  [<ffffffffa03ab5f3>] ? xfs_iflush_abort+0x35/0x9a [xfs]
> [  100.436010]  [<ffffffffa03667a3>] ? xfs_reclaim_inode+0x85/0x246 [xfs]
> [  100.436010]  [<ffffffffa0366aab>] ? xfs_reclaim_inodes_ag+0x147/0x1fc [xfs]
> [  100.436010]  [<ffffffff8105bbfb>] ? try_to_wake_up+0x190/0x190
> [  100.436010]  [<ffffffff81056927>] ? __wake_up_common+0x42/0x78
> [  100.436010]  [<ffffffff810d5a18>] ? fold_diff+0x22/0x2e
> [  100.436010]  [<ffffffff810408d8>] ? lock_timer_base.isra.35+0x23/0x48
> [  100.436010]  [<ffffffff81040750>] ? internal_add_timer+0xd/0x28
> [  100.436010]  [<ffffffff8104125b>] ? __mod_timer+0xfa/0x10c
> [  100.436010]  [<ffffffffa0367382>] ? xfs_reclaim_inodes+0x16/0x1b [xfs]
> [  100.436010]  [<ffffffffa036739c>] ? xfs_reclaim_worker+0x15/0x1e [xfs]
> [  100.436010]  [<ffffffff8104b7ab>] ? process_one_work+0x191/0x294
> [  100.436010]  [<ffffffff8104bc5d>] ? worker_thread+0x121/0x1e7
> [  100.436010]  [<ffffffff8104bb3c>] ? rescuer_thread+0x269/0x269
> [  100.436010]  [<ffffffff81050641>] ? kthread+0x81/0x89
> [  100.436010]  [<ffffffff810505c0>] ? __kthread_parkme+0x5d/0x5d
> [  100.436010]  [<ffffffff8137f4ec>] ? ret_from_fork+0x7c/0xb0
> [  100.436010]  [<ffffffff810505c0>] ? __kthread_parkme+0x5d/0x5d

Also stuck on a spin lock, but again it is not obvious what function
it is in and hence what spinlock is affected. xfs_iflush_abort()
does take the AIL lock, so that might be it.

> [  100.628005] BUG: soft lockup - CPU#6 stuck for 22s! [script:3151]
> [  100.628005] RIP: 0010:[<ffffffff81081463>]  [<ffffffff81081463>] do_raw_spin_lock+0x16/0x23
> [  100.628005] Call Trace:
> [  100.628005]  [<ffffffffa03abe25>] ? xfs_ail_push_all+0x13/0x4f [xfs]
> [  100.628005]  [<ffffffffa03673bf>] ? xfs_reclaim_inodes_nr+0x1a/0x34 [xfs]
> [  100.628005]  [<ffffffff8110875e>] ? super_cache_scan+0x121/0x13e
> [  100.628005]  [<ffffffff810cdb7a>] ? shrink_slab+0x1e3/0x2f9
> [  100.628005]  [<ffffffff81119526>] ? iput+0x34/0x13d
> [  100.628005]  [<ffffffff81148d14>] ? do_coredump+0xbc3/0xbc3
> [  100.628005]  [<ffffffff81148e3f>] ? drop_caches_sysctl_handler+0x65/0x76
> [  100.628005]  [<ffffffff81157b7c>] ? proc_sys_call_handler+0x98/0xbf
> [  100.628005]  [<ffffffff81105eca>] ? vfs_write+0x9e/0x104
> [  100.628005]  [<ffffffff811061c1>] ? SyS_write+0x51/0x85
> [  100.628005]  [<ffffffff8137f592>] ? system_call_fastpath+0x16/0x1b

That can only be stuck on the AIL spin lock.

So, I've just audited all the uses of the AIL lock, and I cannot
find an unbalanced user of the AIL lock. If we've leaked the spin
lock, it's not an obvious or easy to trigger bug. Can you turn on
lockdep as well as CONFIG_XFS_DEBUG and see what warnings that
throws?

Cheers,

Dave.
Akira Hayakawa Oct. 19, 2013, 10:59 a.m. UTC | #8
Dave,

# -EIO retuned corrupts XFS
I turned up
lockdep, frame pointer, xfs debug
and also changed to 3.12.0-rc5 from rc1.

What's changed is that
the problem we discussed in previous mails
*never* reproduce.

However, if I turn off the lockdep only
it hangs up by setting blockup to 1 and then to 0 after that.
The underlying device once became dead revives
may confuse the filesystem but
this looks not related to locking things.
This problem may be producible using dm-flakey.

This behavior is not reproducible with ext4.

--------------------------------------------
root@Lyle:/home/akira# virsh console Hercules
Connected to domain Hercules
Escape character is ^]
[  143.042980] device-mapper: writeboost: info@modulator_proc() reactivated after blockup
[  143.042999] device-mapper: writeboost: info@dm_safe_io_internal() reactivated after blockup
[  143.043006] device-mapper: writeboost: info@migrate_proc() reactivated after blockup
[  143.045328] XFS: metadata I/O error: block 0x300d0f ("xlog_iodone") error 5 numblks 64
[  143.045333] XFS: xfs_do_force_shutdown(0x2) called from line 1161 of file fs/xfs/xfs_log.c.  Return address = 0xffffffffa0430f9c
[  143.045335] XFS: Log I/O Error Detected.  Shutting down filesystem
[  143.045335] XFS: Please umount the filesystem and rectify the problem(s)
[  143.045338] XFS: Assertion failed: atomic_read(&iclog->ic_refcnt) == 0, file: fs/xfs/xfs_log.c, line: 2751
[  143.045392] ------------[ cut here ]------------
[  143.045393] kernel BUG at fs/xfs/xfs_message.c:108!
[  143.045396] invalid opcode: 0000 [#1] SMP 
[  143.045429] Modules linked in: xfs crc32c libcrc32c dm_writeboost(O) fuse nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop snd_hda_intel snd_hda_codec snd_hwdep snd_pcm joydev hid_generic usbhid processor snd_page_alloc i2c_piix4 i2c_core snd_timer snd soundcore psmouse serio_raw evdev virtio_balloon pcspkr hid thermal_sys microcode button ext4 crc16 jbd2 mbcache dm_mod sg sr_mod cdrom ata_generic virtio_blk virtio_net floppy uhci_hcd ehci_hcd ata_piix usbcore libata usb_common scsi_mod virtio_pci virtio_ring virtio
[  143.045433] CPU: 2 PID: 216 Comm: kworker/2:1H Tainted: G           O 3.12.0-rc5 #11
[  143.045434] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[  143.045465] Workqueue: xfslogd xfs_buf_iodone_work [xfs]
[  143.045466] task: ffff88002fe7e0c0 ti: ffff88002ff66000 task.ti: ffff88002ff66000
[  143.045493] RIP: 0010:[<ffffffffa03e72f3>]  [<ffffffffa03e72f3>] assfail+0x1d/0x1f [xfs]
[  143.045494] RSP: 0000:ffff88002ff67d98  EFLAGS: 00010292
[  143.045495] RAX: 000000000000005e RBX: ffff88020de3c280 RCX: 0000000000000c8a
[  143.045496] RDX: 0000000000000000 RSI: ffffffff819c3f2c RDI: ffffffff818356e0
[  143.045497] RBP: ffff88002ff67d98 R08: 0000000000000000 R09: ffffffff819c3ecc
[  143.045498] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801fa583d28
[  143.045499] R13: ffff8801fa583c00 R14: 0000000000000002 R15: 0000000000000000
[  143.045501] FS:  0000000000000000(0000) GS:ffff88021fc80000(0000) knlGS:0000000000000000
[  143.045502] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  143.045505] CR2: 00007f6e7680db67 CR3: 000000000180b000 CR4: 00000000000006e0
[  143.045511] Stack:
[  143.045514]  ffff88002ff67dc8 ffffffffa0430ed1 ffff8801fb124e40 0000000000000002
[  143.045517]  ffff88020de3c280 ffffe8ffffc80500 ffff88002ff67df8 ffffffffa0431018
[  143.045519]  ffff8801fb124ee0 ffff8801fb124ee0 ffff8801fb124e40 ffff88021fc92a00
[  143.045520] Call Trace:
[  143.045558]  [<ffffffffa0430ed1>] xlog_state_done_syncing+0x6d/0xe4 [xfs]
[  143.045590]  [<ffffffffa0431018>] xlog_iodone+0xd0/0xd9 [xfs]
[  143.045609]  [<ffffffffa03d834f>] xfs_buf_iodone_work+0x57/0xa2 [xfs]
[  143.045627]  [<ffffffff8104f21a>] process_one_work+0x18b/0x297
[  143.045631]  [<ffffffff8104f6e6>] worker_thread+0x12e/0x1fb
[  143.045634]  [<ffffffff8104f5b8>] ? rescuer_thread+0x268/0x268
[  143.045638]  [<ffffffff81054507>] kthread+0x88/0x90
[  143.045641]  [<ffffffff8105447f>] ? __kthread_parkme+0x60/0x60
[  143.045646]  [<ffffffff813a707c>] ret_from_fork+0x7c/0xb0
[  143.045649]  [<ffffffff8105447f>] ? __kthread_parkme+0x60/0x60
[  143.045670] Code: 48 c7 c7 6b 26 45 a0 e8 a4 2a c5 e0 5d c3 55 48 89 f1 41 89 d0 48 c7 c6 80 26 45 a0 48 89 fa 31 c0 48 89 e5 31 ff e8 aa fc ff ff <0f> 0b 55 48 63 f6 49 89 f9 41 b8 01 00 00 00 b9 10 00 00 00 ba 
[  143.045694] RIP  [<ffffffffa03e72f3>] assfail+0x1d/0x1f [xfs]
[  143.045695]  RSP <ffff88002ff67d98>
[  143.045697] ---[ end trace e4f1a3c306353178 ]---
[  143.045741] BUG: unable to handle kernel paging request at ffffffffffffffd8
[  143.045745] IP: [<ffffffff810548fd>] kthread_data+0xb/0x11
[  143.045747] PGD 180c067 PUD 180e067 PMD 0 
[  143.045749] Oops: 0000 [#2] SMP 
[  143.045774] Modules linked in: xfs crc32c libcrc32c dm_writeboost(O) fuse nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop snd_hda_intel snd_hda_codec snd_hwdep snd_pcm joydev hid_generic usbhid processor snd_page_alloc i2c_piix4 i2c_core snd_timer snd soundcore psmouse serio_raw evdev virtio_balloon pcspkr hid thermal_sys microcode button ext4 crc16 jbd2 mbcache dm_mod sg sr_mod cdrom ata_generic virtio_blk virtio_net floppy uhci_hcd ehci_hcd ata_piix usbcore libata usb_common scsi_mod virtio_pci virtio_ring virtio
[  143.045777] CPU: 2 PID: 216 Comm: kworker/2:1H Tainted: G      D    O 3.12.0-rc5 #11
[  143.045778] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[  143.045795] task: ffff88002fe7e0c0 ti: ffff88002ff66000 task.ti: ffff88002ff66000
[  143.045798] RIP: 0010:[<ffffffff810548fd>]  [<ffffffff810548fd>] kthread_data+0xb/0x11
[  143.045802] RSP: 0000:ffff88002ff67a38  EFLAGS: 00010002
[  143.045804] RAX: 0000000000000000 RBX: ffff88021fc92e80 RCX: ffff88021fc92ef0
[  143.045805] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff88002fe7e0c0
[  143.045805] RBP: ffff88002ff67a38 R08: ffffffff819b6640 R09: 000000000000bfcf
[  143.045806] R10: dead000000200200 R11: 0000000000000000 R12: 0000000000000002
[  143.045807] R13: 0000000000000001 R14: 0000000000000002 R15: ffff88002fe7e0b0
[  143.045809] FS:  0000000000000000(0000) GS:ffff88021fc80000(0000) knlGS:0000000000000000
[  143.045810] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  143.045813] CR2: 0000000000000028 CR3: 000000000180b000 CR4: 00000000000006e0
[  143.045818] Stack:
[  143.045821]  ffff88002ff67a58 ffffffff8104faa5 ffff88021fc92e80 ffff88002fe7e408
[  143.045824]  ffff88002ff67ad8 ffffffff813a0860 0000000000000000 ffff88002f53b6e0
[  143.045826]  ffff88002f53b6f0 ffff88002fe7e0c0 ffff88002ff67fd8 ffff88002ff67fd8
[  143.045827] Call Trace:
[  143.045831]  [<ffffffff8104faa5>] wq_worker_sleeping+0xf/0x87
[  143.045836]  [<ffffffff813a0860>] __schedule+0x142/0x548
[  143.045839]  [<ffffffff813a0f4d>] schedule+0x60/0x62
[  143.045842]  [<ffffffff8103c014>] do_exit+0x8f8/0x90f
[  143.045846]  [<ffffffff8139b740>] ? printk+0x48/0x4a
[  143.045848]  [<ffffffff813a2b8d>] oops_end+0xa9/0xb1
[  143.045854]  [<ffffffff81004e5c>] die+0x55/0x61
[  143.045858]  [<ffffffff813a25e0>] do_trap+0x69/0x135
[  143.045861]  [<ffffffff813a4ec8>] ? __atomic_notifier_call_chain+0xd/0xf
[  143.045865]  [<ffffffff81002ae1>] do_invalid_op+0x93/0x9c
[  143.045889]  [<ffffffffa03e72f3>] ? assfail+0x1d/0x1f [xfs]
[  143.045893]  [<ffffffff8139b740>] ? printk+0x48/0x4a
[  143.045896]  [<ffffffff813a8608>] invalid_op+0x18/0x20
[  143.045919]  [<ffffffffa03e72f3>] ? assfail+0x1d/0x1f [xfs]
[  143.045940]  [<ffffffffa03e72f3>] ? assfail+0x1d/0x1f [xfs]
[  143.045973]  [<ffffffffa0430ed1>] xlog_state_done_syncing+0x6d/0xe4 [xfs]
[  143.046003]  [<ffffffffa0431018>] xlog_iodone+0xd0/0xd9 [xfs]
[  143.046023]  [<ffffffffa03d834f>] xfs_buf_iodone_work+0x57/0xa2 [xfs]
[  143.046027]  [<ffffffff8104f21a>] process_one_work+0x18b/0x297
[  143.046030]  [<ffffffff8104f6e6>] worker_thread+0x12e/0x1fb
[  143.046033]  [<ffffffff8104f5b8>] ? rescuer_thread+0x268/0x268
[  143.046036]  [<ffffffff81054507>] kthread+0x88/0x90
[  143.046039]  [<ffffffff8105447f>] ? __kthread_parkme+0x60/0x60
[  143.046042]  [<ffffffff813a707c>] ret_from_fork+0x7c/0xb0
[  143.046045]  [<ffffffff8105447f>] ? __kthread_parkme+0x60/0x60
[  143.046066] Code: 65 48 8b 04 25 80 b9 00 00 48 8b 80 f0 02 00 00 48 89 e5 5d 48 8b 40 c8 48 c1 e8 02 83 e0 01 c3 48 8b 87 f0 02 00 00 55 48 89 e5 <48> 8b 40 d8 5d c3 55 ba 08 00 00 00 48 89 e5 48 83 ec 10 48 8b 
[  143.046069] RIP  [<ffffffff810548fd>] kthread_data+0xb/0x11
[  143.046069]  RSP <ffff88002ff67a38>
[  143.046070] CR2: ffffffffffffffd8
[  143.046071] ---[ end trace e4f1a3c306353179 ]---
[  143.046072] Fixing recursive fault but reboot is needed!
--------------------------------------------



# Design in failure
> Yes. EIO means an IO error has occurred. That causes failure paths
> to be triggered in the upper layers.
> 
> I really don't understand what you are trying to achieve with this
> "blockup" thing. If something goes wrong with the device, then you
> *cannot recover* by sending EIO to any new IOs and then continuing
> on at a later time as though nothing has happened. The moment a
> filesystem gets EIO from a metadata write, it is likely to be
> corrupted and if you continue onwards after that you simply
> propagate the corruption.
> 
>> As Mikulas said, connection failure often be the cause of
>> I/O error from the underlying devices.
> 
> Connection failure is *rarely* the cause of IO errors, except in
> environments where SANs are in use. Even then multipathing makes
> fatal connection failure a rare occurrence. Broken hardware is a
> much more common cause of problems at the storage layers.
Yes. Connection error happens in SAN due to loose cabling
and multi-pathing may resolve the problem and
hardware problem can also be resolve by RAID.

My Idea:
- If something problematic happens in underlying devices
  writeboost device returns -EIO on any requests and
  stop all the daemons

- I will not remove `blockup`.
  Yes. If the problem is in hardware it is very difficult
  to recover.
  However, leaving a chance for recovering the device is
  at least better than just shutdown the device
  if it doesn't pollute the code so much.
  So, I will leave this option.

- As Mikulas pointed out, the current design asking the admin
  to turn off the `blockup` variable when the system is checked up
  is not reliable
  since the admin may not be able to log in the system due
  to memory starvation.
  More sophisticated way to recover the device is
  something like autonomously checking the underlying
  devices came back. But, I think this is
  too much for the reasons listed below:
  - redundancy technique is really sophisticated and -EIO returned
    means something really serious that definitely *can not* recover.
  - The example admin can't log in the system goes extreme.
    At the beginning, admin should be logged in beforehand.

BTW, what do you mean by the word "fatal"?
> Simple rule: Don't complete IOs with EIO if you haven't had a fatal
> IO error.

There are two choices for writeboost in I/O failure.
1) Returns -EIO
2) Just blocks all the in-coming processes.
writeboost doesn't corrupt in either of these two
ways of handling error. Maybe, this is because of the
nature of log-structured caching.
My question is "which is better behavior of
underlying device for the upper layer software?".
Pros for 2) is that upper layer may go on its operation
without restarting from shutdown.
Cons for 2) is that blocking the in-coming processes means all the
write barrier requests sent are not returned for a long time
and the upper layer can't tell whether the device is dead or not.
I guess you like to simply propagate the error to the parent.
But, I want to find the better way for the upper layer
, particularly for the filesystems.

# Misc
> Ugh. You should be using workqueue with timed work for this.
How is it? I don't see the implementation.
All I want to do is repeating in every interval
and as far as I know current code is the simplest.

> BTW, you're missing the handling needed by these kernel threads
> for suspend-to-disk/ram....
I didn't implement suspend/resume yet.
Maybe, I should stop all the daemons and need no
further codes for the kernel threads.

Akira

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Dave Chinner Oct. 21, 2013, 1:31 a.m. UTC | #9
On Sat, Oct 19, 2013 at 07:59:55PM +0900, Akira Hayakawa wrote:
> Dave,
> 
> # -EIO retuned corrupts XFS
> I turned up
> lockdep, frame pointer, xfs debug
> and also changed to 3.12.0-rc5 from rc1.
> 
> What's changed is that
> the problem we discussed in previous mails
> *never* reproduce.
> 
> However, if I turn off the lockdep only
> it hangs up by setting blockup to 1 and then to 0 after that.

Which indicates that the corruption is likely to be caused by a race
condition, and that adding lockdep slows things down enough to
change the timing and hence not hit the race condition...

e.g. the use after free that causes the memory corruption now occurs
after the critical point where XFS can get stuck on it.

> The underlying device once became dead revives
> may confuse the filesystem but
> this looks not related to locking things.
> This problem may be producible using dm-flakey.

I use dm-flakey quite a bit, and I haven't seen this....

> This behavior is not reproducible with ext4.
> 
> --------------------------------------------
> root@Lyle:/home/akira# virsh console Hercules
> Connected to domain Hercules
> Escape character is ^]
> [  143.042980] device-mapper: writeboost: info@modulator_proc() reactivated after blockup
> [  143.042999] device-mapper: writeboost: info@dm_safe_io_internal() reactivated after blockup
> [  143.043006] device-mapper: writeboost: info@migrate_proc() reactivated after blockup
> [  143.045328] XFS: metadata I/O error: block 0x300d0f ("xlog_iodone") error 5 numblks 64
> [  143.045333] XFS: xfs_do_force_shutdown(0x2) called from line 1161 of file fs/xfs/xfs_log.c.  Return address = 0xffffffffa0430f9c
> [  143.045335] XFS: Log I/O Error Detected.  Shutting down filesystem
> [  143.045335] XFS: Please umount the filesystem and rectify the problem(s)
> [  143.045338] XFS: Assertion failed: atomic_read(&iclog->ic_refcnt) == 0, file: fs/xfs/xfs_log.c, line: 2751
.....
> [  143.045465] Workqueue: xfslogd xfs_buf_iodone_work [xfs]
.....
> [  143.045558]  [<ffffffffa0430ed1>] xlog_state_done_syncing+0x6d/0xe4 [xfs]
> [  143.045590]  [<ffffffffa0431018>] xlog_iodone+0xd0/0xd9 [xfs]
> [  143.045609]  [<ffffffffa03d834f>] xfs_buf_iodone_work+0x57/0xa2 [xfs]
> [  143.045627]  [<ffffffff8104f21a>] process_one_work+0x18b/0x297
> [  143.045631]  [<ffffffff8104f6e6>] worker_thread+0x12e/0x1fb
> [  143.045634]  [<ffffffff8104f5b8>] ? rescuer_thread+0x268/0x268
> [  143.045638]  [<ffffffff81054507>] kthread+0x88/0x90
> [  143.045641]  [<ffffffff8105447f>] ? __kthread_parkme+0x60/0x60

So that has got through xlog_iodone() and has found that the
reference count of the iclog was not zero when it went to run the
log IO completion callbacks.

We asserted that the reference count was zero before issuing the IO,
and we only ever increment the reference count when the iclog is in
an active state. We cannot increment the reference count while the
log is under IO because the state of the iclog is "syncing", not
"active".

Once the log is shut down, the iclog state goes to
XLOG_STATE_IOERROR and never goes out of that state. The assert just
prior to the one that tripped above indicates that we are either in
an ACTIVE state or IOERROR:

        ASSERT(iclog->ic_state == XLOG_STATE_SYNCING ||
               iclog->ic_state == XLOG_STATE_IOERROR);
>>>>>>  ASSERT(atomic_read(&iclog->ic_refcnt) == 0);

It was the second assert that failed, and hence it's clear that
in either state the reference count cannot be incremented until the
IO is fully completed and it transitioned back the active state.

> [  143.045333] XFS: xfs_do_force_shutdown(0x2) called from line 1161 of file fs/xfs/xfs_log.c.  Return address = 0xffffffffa0430f9c

This indicates that the shutdown was called from xlog_iodone() as a
result of an IO error on the iclog buffer, and the call to
xlog_state_done_syncing() happens 5 lines of code later, which
immediately assert fails with a corrupt iclog state.

Because we shutdown with SHUTDOWN_LOG_IO_ERROR set, we call into
xfs_log_force_umount() with logerror = true. This first call ends up
setting the log flag XLOG_IO_ERROR, then calling
xlog_state_ioerror() which sets ic_state on all iclogs to
XLOG_STATE_IOERROR.

The shutdown then runs xlog_state_do_callback() which aborts the
completions on all the iclogs that have callbacks attached, and
wakes anyone waiting on log space or log forces so they can get
errors returned to them.

We now return to xlog_iodone() in a shutdown state, with all the
iclog buffers in with ic_state = XLOG_STATE_IOERROR. So, there is
absolutely no opportunity for anyone to take a new reference to the
iclog in the 10 *microseconds* between the IO error being detected,
the shutdown being processed and the call to
xlog_state_done_syncing() where the assert fails. At this point, I
cannot see how this can occur in the XFS code.

Indeed, I can trigger this error path easily under heavy load just
using the godown utility in xfstests:

[1044535.986040] XFS (vdc): Mounting Filesystem
[1044536.040630] XFS (vdc): Ending clean mount
[1044536.059299] XFS: Clearing xfsstats
# src/xfstests-dev/src/godown -v /mnt/scratch
Opening "/mnt/scratch"
Calling XFS_IOC_GOINGDOWN
[1044553.342767] XFS (vdc): metadata I/O error: block 0x19000c2e98 ("xlog_iodone") error 5 numblks 512
[1044553.345993] XFS (vdc): xfs_do_force_shutdown(0x2) called from line 1171 of file fs/xfs/xfs_log.c.  Return address = 0xffffffff814e61ad
#[1044554.136965] XFS (vdc): xfs_log_force: error 5 returned.
[1044554.154892] XFS: Clearing xfsstats
[1044566.108484] XFS (vdc): xfs_log_force: error 5 returned.
[1044596.188515] XFS (vdc): xfs_log_force: error 5 returned.
[1044626.268166] XFS (vdc): xfs_log_force: error 5 returned.
[1044656.348146] XFS (vdc): xfs_log_force: error 5 returned.

IOWs, this looks like something is corrupting the state of the log
*before* the xlog_iodone() callback is being run. i.e. it is in a
valid state before IO dispatch and it's in a corrupt state on IO
completion despite XFS having *never touched that state* during the
IO.

Indeed, this is a different issue to the one you posted
earlier, which was the AIL lock (which is in a different XFS
structure) had not been released. Again, I couldn't see how that
could occur, and we're now seeing a pattern of random structures
being corrupted. Hmmm, looking at pahole:

atomic_t                   ic_refcnt;            /*   192     4 */

spinlock_t                 xa_lock;              /*    64     2 */

Both the items that were corrupted are on the first word of a
cacheline. That further points to memory corruption of some kind...

IOWs, the more I look, the more this looks like memory corruption is
the cause of the problems. The only unusual thing that is happening
is an error handling path in a brand new block device driver is
being run immediately before the memory corruption causes problems,
and that tends to indicate that this corruption is not caused by
XFS...

> My Idea:
> - If something problematic happens in underlying devices
>   writeboost device returns -EIO on any requests and
>   stop all the daemons
....
> - I will not remove `blockup`.
>   Yes. If the problem is in hardware it is very difficult
>   to recover.
>   However, leaving a chance for recovering the device is
>   at least better than just shutdown the device
>   if it doesn't pollute the code so much.
>   So, I will leave this option.

It doesn't matter whether the underlying hardware might be able to
recover - once you've send EIOs during IO completion that
data/metadata is considered *unrecoverable* and so the filesystem
will end up inconsistent or the *user will lose data* as a result.

IOWs, your idea is flawed, will not work and will result in
data/filesystem loss when used. You cannot call this a "recovery
mechanism" because it simply isn't.

> BTW, what do you mean by the word "fatal"?

"He suffered a fatal gunshot wound to the head."

i.e. the person is *dead* and life is *unrecoverable*.

That's what you are doing here - returning EIOs to IOs in progress
is effectively shooting the fileystem (and user data) in the
head.....

Cheers,

Dave.
Prakash Prasad Oct. 21, 2013, 2 a.m. UTC | #10
Can anyone please let me know where can I get the staging area drivers code and their test cases?

I understand that I need to download the 
staging area device drivers code. In the source code tree there would be a TODO file. Accordingly we have to fix and propagate it identifying 
the owner also.

But how are we suppose to test out the same - where are the test cases and process to execute them?

Regards,
Prakash




On Monday, 21 October 2013 7:12 AM, Dave Chinner <david@fromorbit.com> wrote:
 
On Sat, Oct 19, 2013 at 07:59:55PM +0900, Akira Hayakawa wrote:
> Dave,
> 
> # -EIO retuned corrupts XFS
> I turned up
> lockdep, frame pointer, xfs debug
> and also changed to 3.12.0-rc5 from rc1.
> 
> What's changed is that
> the problem we discussed in previous mails
> *never* reproduce.
> 
> However, if I turn off the lockdep only
> it hangs up by setting blockup to 1 and then to 0 after that.

Which indicates that the corruption is likely to be caused by a race
condition, and that adding lockdep slows things down enough to
change the timing and hence not hit the race condition...

e.g. the use after free that causes the memory corruption now occurs
after the critical point where XFS can get stuck on it.

> The underlying device once became dead revives
> may confuse the filesystem but
> this looks not related to locking things.
> This problem may be producible using dm-flakey.

I use dm-flakey quite a bit, and I haven't seen this....

> This behavior is not reproducible with ext4.
> 
> --------------------------------------------
> root@Lyle:/home/akira# virsh console Hercules
> Connected to domain Hercules
> Escape character is ^]
> [  143.042980] device-mapper: writeboost: info@modulator_proc() reactivated after blockup
> [  143.042999] device-mapper: writeboost: info@dm_safe_io_internal() reactivated after blockup
> [  143.043006] device-mapper: writeboost: info@migrate_proc() reactivated after blockup
> [  143.045328] XFS: metadata I/O error: block 0x300d0f ("xlog_iodone") error 5 numblks 64
> [  143.045333] XFS: xfs_do_force_shutdown(0x2) called from line 1161 of file fs/xfs/xfs_log.c.  Return address = 0xffffffffa0430f9c
> [  143.045335] XFS: Log I/O Error Detected.  Shutting down filesystem
> [  143.045335] XFS: Please umount the filesystem and rectify the problem(s)
> [  143.045338] XFS: Assertion failed: atomic_read(&iclog->ic_refcnt) == 0, file: fs/xfs/xfs_log.c, line: 2751
.....
> [  143.045465] Workqueue: xfslogd xfs_buf_iodone_work [xfs]
.....
> [  143.045558]  [<ffffffffa0430ed1>] xlog_state_done_syncing+0x6d/0xe4 [xfs]
> [  143.045590]  [<ffffffffa0431018>] xlog_iodone+0xd0/0xd9 [xfs]
> [  143.045609]  [<ffffffffa03d834f>] xfs_buf_iodone_work+0x57/0xa2 [xfs]
> [  143.045627]  [<ffffffff8104f21a>] process_one_work+0x18b/0x297
> [  143.045631]  [<ffffffff8104f6e6>] worker_thread+0x12e/0x1fb
> [  143.045634]  [<ffffffff8104f5b8>] ? rescuer_thread+0x268/0x268
> [  143.045638]  [<ffffffff81054507>] kthread+0x88/0x90
> [  143.045641]  [<ffffffff8105447f>] ? __kthread_parkme+0x60/0x60

So that has got through xlog_iodone() and has found that the
reference count of the iclog was not zero when it went to run the
log IO completion callbacks.

We asserted that the reference count was zero before issuing the IO,
and we only ever increment the reference count when the iclog is in
an active state. We cannot increment the reference count while the
log is under IO because the state of the iclog is "syncing", not
"active".

Once the log is shut down, the iclog state goes to
XLOG_STATE_IOERROR and never goes out of that state. The assert just
prior to the one that tripped above indicates that we are either in
an ACTIVE state or IOERROR:

        ASSERT(iclog->ic_state == XLOG_STATE_SYNCING ||
               iclog->ic_state == XLOG_STATE_IOERROR);
>>>>>>  ASSERT(atomic_read(&iclog->ic_refcnt) == 0);

It was the second assert that failed, and hence it's clear that
in either state the reference count cannot be incremented until the
IO is fully completed and it transitioned back the active state.

> [  143.045333] XFS: xfs_do_force_shutdown(0x2) called from line 1161 of file fs/xfs/xfs_log.c.  Return address = 0xffffffffa0430f9c

This indicates that the shutdown was called from xlog_iodone() as a
result of an IO error on the iclog buffer, and the call to
xlog_state_done_syncing() happens 5 lines of code later, which
immediately assert fails with a corrupt iclog state.

Because we shutdown with SHUTDOWN_LOG_IO_ERROR set, we call into
xfs_log_force_umount() with logerror = true. This first call ends up
setting the log flag XLOG_IO_ERROR, then calling
xlog_state_ioerror() which sets ic_state on all iclogs to
XLOG_STATE_IOERROR.

The shutdown then runs xlog_state_do_callback() which aborts the
completions on all the iclogs that have callbacks attached, and
wakes anyone waiting on log space or log forces so they can get
errors returned to them.

We now return to xlog_iodone() in a shutdown state, with all the
iclog buffers in with ic_state = XLOG_STATE_IOERROR. So, there is
absolutely no opportunity for anyone to take a new reference to the
iclog in the 10 *microseconds* between the IO error being detected,
the shutdown being processed and the call to
xlog_state_done_syncing() where the assert fails. At this point, I
cannot see how this can occur in the XFS code.

Indeed, I can trigger this error path easily under heavy load just
using the godown utility in xfstests:

[1044535.986040] XFS (vdc): Mounting Filesystem
[1044536.040630] XFS (vdc): Ending clean mount
[1044536.059299] XFS: Clearing xfsstats
# src/xfstests-dev/src/godown -v /mnt/scratch
Opening "/mnt/scratch"
Calling XFS_IOC_GOINGDOWN
[1044553.342767] XFS (vdc): metadata I/O error: block 0x19000c2e98 ("xlog_iodone") error 5 numblks 512
[1044553.345993] XFS (vdc): xfs_do_force_shutdown(0x2) called from line 1171 of file fs/xfs/xfs_log.c.  Return address = 0xffffffff814e61ad
#[1044554.136965] XFS (vdc): xfs_log_force: error 5 returned.
[1044554.154892] XFS: Clearing xfsstats
[1044566.108484] XFS (vdc): xfs_log_force: error 5 returned.
[1044596.188515] XFS (vdc): xfs_log_force: error 5 returned.
[1044626.268166] XFS (vdc): xfs_log_force: error 5 returned.
[1044656.348146] XFS (vdc): xfs_log_force: error 5 returned.

IOWs, this looks like something is corrupting the state of the log
*before* the xlog_iodone() callback is being run. i.e. it is in a
valid state before IO dispatch and it's in a corrupt state on IO
completion despite XFS having *never touched that state* during the
IO.

Indeed, this is a different issue to the one you posted
earlier, which was the AIL lock (which is in a different XFS
structure) had not been released. Again, I couldn't see how that
could occur, and we're now seeing a pattern of random structures
being corrupted. Hmmm, looking at pahole:

atomic_t                   ic_refcnt;            /*   192     4 */

spinlock_t                 xa_lock;              /*    64     2 */

Both the items that were corrupted are on the first word of a
cacheline. That further points to memory corruption of some kind...

IOWs, the more I look, the more this looks like memory corruption is
the cause of the problems. The only unusual thing that is happening
is an error handling path in a brand new block device driver is
being run immediately before the memory corruption causes problems,
and that tends to indicate that this corruption is not caused by
XFS...

> My Idea:
> - If something problematic happens in underlying devices
>   writeboost device returns -EIO on any requests and
>   stop all the daemons
....
> - I will not remove `blockup`.
>   Yes. If the problem is in hardware it is very difficult
>   to recover.
>   However, leaving a chance for recovering the device is
>   at least better than just shutdown the device
>   if it doesn't pollute the code so much.
>   So, I will leave this option.

It doesn't matter whether the underlying hardware might be able to
recover - once you've send EIOs during IO completion that
data/metadata is considered *unrecoverable* and so the filesystem
will end up inconsistent or the *user will lose data* as a result.

IOWs, your idea is flawed, will not work and will result in
data/filesystem loss when used. You cannot call this a "recovery
mechanism" because it simply isn't.

> BTW, what do you mean by the word "fatal"?

"He suffered a fatal gunshot wound to the head."

i.e. the person is *dead* and life is *unrecoverable*.

That's what you are doing here - returning EIOs to IOs in progress
is effectively shooting the fileystem (and user data) in the
head.....

Cheers,

Dave.
Prakash Prasad Oct. 21, 2013, 2:03 a.m. UTC | #11
Please let me know where we can get staging area drivers code and their test cases suite?
Regards,
Prakash




On Monday, 21 October 2013 7:12 AM, Dave Chinner <david@fromorbit.com> wrote:
 
On Sat, Oct 19, 2013 at 07:59:55PM +0900, Akira Hayakawa wrote:
> Dave,
> 
> # -EIO retuned corrupts XFS
> I turned up
> lockdep, frame pointer, xfs debug
> and also changed to 3.12.0-rc5 from rc1.
> 
> What's changed is that
> the problem we discussed in previous mails
> *never* reproduce.
> 
> However, if I turn off the lockdep only
> it hangs up by setting blockup to 1 and then to 0 after that.

Which indicates that the corruption is likely to be caused by a race
condition, and that adding lockdep slows things down enough to
change the timing and hence not hit the race condition...

e.g. the use after free that causes the memory corruption now occurs
after the critical point where XFS can get stuck on it.

> The underlying device once became dead revives
> may confuse the filesystem but
> this looks not related to locking things.
> This problem may be producible using dm-flakey.

I use dm-flakey quite a bit, and I haven't seen this....

> This behavior is not reproducible with ext4.
> 
> --------------------------------------------
> root@Lyle:/home/akira# virsh console Hercules
> Connected to domain Hercules
> Escape character is ^]
> [  143.042980] device-mapper: writeboost: info@modulator_proc() reactivated after blockup
> [  143.042999] device-mapper: writeboost: info@dm_safe_io_internal() reactivated after blockup
> [  143.043006] device-mapper: writeboost: info@migrate_proc() reactivated after blockup
> [  143.045328] XFS: metadata I/O error: block 0x300d0f ("xlog_iodone") error 5 numblks 64
> [  143.045333] XFS: xfs_do_force_shutdown(0x2) called from line 1161 of file fs/xfs/xfs_log.c.  Return address = 0xffffffffa0430f9c
> [  143.045335] XFS: Log I/O Error Detected.  Shutting down filesystem
> [  143.045335] XFS: Please umount the filesystem and rectify the problem(s)
> [  143.045338] XFS: Assertion failed: atomic_read(&iclog->ic_refcnt) == 0, file: fs/xfs/xfs_log.c, line: 2751
.....
> [  143.045465] Workqueue: xfslogd xfs_buf_iodone_work [xfs]
.....
> [  143.045558]  [<ffffffffa0430ed1>] xlog_state_done_syncing+0x6d/0xe4 [xfs]
> [  143.045590]  [<ffffffffa0431018>] xlog_iodone+0xd0/0xd9 [xfs]
> [  143.045609]  [<ffffffffa03d834f>] xfs_buf_iodone_work+0x57/0xa2 [xfs]
> [  143.045627]  [<ffffffff8104f21a>] process_one_work+0x18b/0x297
> [  143.045631]  [<ffffffff8104f6e6>] worker_thread+0x12e/0x1fb
> [  143.045634]  [<ffffffff8104f5b8>] ? rescuer_thread+0x268/0x268
> [  143.045638]  [<ffffffff81054507>] kthread+0x88/0x90
> [  143.045641]  [<ffffffff8105447f>] ? __kthread_parkme+0x60/0x60

So that has got through xlog_iodone() and has found that the
reference count of the iclog was not zero when it went to run the
log IO completion callbacks.

We asserted that the reference count was zero before issuing the IO,
and we only ever increment the reference count when the iclog is in
an active state. We cannot increment the reference count while the
log is under IO because the state of the iclog is "syncing", not
"active".

Once the log is shut down, the iclog state goes to
XLOG_STATE_IOERROR and never goes out of that state. The assert just
prior to the one that tripped above indicates that we are either in
an ACTIVE state or IOERROR:

        ASSERT(iclog->ic_state == XLOG_STATE_SYNCING ||
               iclog->ic_state == XLOG_STATE_IOERROR);
>>>>>>  ASSERT(atomic_read(&iclog->ic_refcnt) == 0);

It was the second assert that failed, and hence it's clear that
in either state the reference count cannot be incremented until the
IO is fully completed and it transitioned back the active state.

> [  143.045333] XFS: xfs_do_force_shutdown(0x2) called from line 1161 of file fs/xfs/xfs_log.c.  Return address = 0xffffffffa0430f9c

This indicates that the shutdown was called from xlog_iodone() as a
result of an IO error on the iclog buffer, and the call to
xlog_state_done_syncing() happens 5 lines of code later, which
immediately assert fails with a corrupt iclog state.

Because we shutdown with SHUTDOWN_LOG_IO_ERROR set, we call into
xfs_log_force_umount() with logerror = true. This first call ends up
setting the log flag XLOG_IO_ERROR, then calling
xlog_state_ioerror() which sets ic_state on all iclogs to
XLOG_STATE_IOERROR.

The shutdown then runs xlog_state_do_callback() which aborts the
completions on all the iclogs that have callbacks attached, and
wakes anyone waiting on log space or log forces so they can get
errors returned to them.

We now return to xlog_iodone() in a shutdown state, with all the
iclog buffers in with ic_state = XLOG_STATE_IOERROR. So, there is
absolutely no opportunity for anyone to take a new reference to the
iclog in the 10 *microseconds* between the IO error being detected,
the shutdown being processed and the call to
xlog_state_done_syncing() where the assert fails. At this point, I
cannot see how this can occur in the XFS code.

Indeed, I can trigger this error path easily under heavy load just
using the godown utility in xfstests:

[1044535.986040] XFS (vdc): Mounting Filesystem
[1044536.040630] XFS (vdc): Ending clean mount
[1044536.059299] XFS: Clearing xfsstats
# src/xfstests-dev/src/godown -v /mnt/scratch
Opening "/mnt/scratch"
Calling XFS_IOC_GOINGDOWN
[1044553.342767] XFS (vdc): metadata I/O error: block 0x19000c2e98 ("xlog_iodone") error 5 numblks 512
[1044553.345993] XFS (vdc): xfs_do_force_shutdown(0x2) called from line 1171 of file fs/xfs/xfs_log.c.  Return address = 0xffffffff814e61ad
#[1044554.136965] XFS (vdc): xfs_log_force: error 5 returned.
[1044554.154892] XFS: Clearing xfsstats
[1044566.108484] XFS (vdc): xfs_log_force: error 5 returned.
[1044596.188515] XFS (vdc): xfs_log_force: error 5 returned.
[1044626.268166] XFS (vdc): xfs_log_force: error 5 returned.
[1044656.348146] XFS (vdc): xfs_log_force: error 5 returned.

IOWs, this looks like something is corrupting the state of the log
*before* the xlog_iodone() callback is being run. i.e. it is in a
valid state before IO dispatch and it's in a corrupt state on IO
completion despite XFS having *never touched that state* during the
IO.

Indeed, this is a different issue to the one you posted
earlier, which was the AIL lock (which is in a different XFS
structure) had not been released. Again, I couldn't see how that
could occur, and we're now seeing a pattern of random structures
being corrupted. Hmmm, looking at pahole:

atomic_t                   ic_refcnt;            /*   192     4 */

spinlock_t                 xa_lock;              /*    64     2 */

Both the items that were corrupted are on the first word of a
cacheline. That further points to memory corruption of some kind...

IOWs, the more I look, the more this looks like memory corruption is
the cause of the problems. The only unusual thing that is happening
is an error handling path in a brand new block device driver is
being run immediately before the memory corruption causes problems,
and that tends to indicate that this corruption is not caused by
XFS...

> My Idea:
> - If something problematic happens in underlying devices
>   writeboost device returns -EIO on any requests and
>   stop all the daemons
....
> - I will not remove `blockup`.
>   Yes. If the problem is in hardware it is very difficult
>   to recover.
>   However, leaving a chance for recovering the device is
>   at least better than just shutdown the device
>   if it doesn't pollute the code so much.
>   So, I will leave this option.

It doesn't matter whether the underlying hardware might be able to
recover - once you've send EIOs during IO completion that
data/metadata is considered *unrecoverable* and so the filesystem
will end up inconsistent or the *user will lose data* as a result.

IOWs, your idea is flawed, will not work and will result in
data/filesystem loss when used. You cannot call this a "recovery
mechanism" because it simply isn't.

> BTW, what do you mean by the word "fatal"?

"He suffered a fatal gunshot wound to the head."

i.e. the person is *dead* and life is *unrecoverable*.

That's what you are doing here - returning EIOs to IOs in progress
is effectively shooting the fileystem (and user data) in the
head.....

Cheers,

Dave.
diff mbox

Patch

diff --git a/Driver/dm-writeboost-daemon.c b/Driver/dm-writeboost-daemon.c
index cb8febe..7a7f353 100644
--- a/Driver/dm-writeboost-daemon.c
+++ b/Driver/dm-writeboost-daemon.c
@@ -12,9 +12,11 @@ 
 
 int flush_proc(void *data)
 {
+	int r;
 	unsigned long flags;
 
 	struct wb_cache *cache = data;
+	struct wb_device *wb = cache->wb;
 
 	while (true) {
 		struct flush_job *job;
@@ -22,13 +24,13 @@  int flush_proc(void *data)
 		struct dm_io_request io_req;
 		struct dm_io_region region;
 
+		wait_on_blockup();
+
 		spin_lock_irqsave(&cache->flush_queue_lock, flags);
 		while (list_empty(&cache->flush_queue)) {
 			spin_unlock_irqrestore(&cache->flush_queue_lock, flags);
-			wait_event_interruptible_timeout(
-				cache->flush_wait_queue,
-				(!list_empty(&cache->flush_queue)),
-				msecs_to_jiffies(100));
+
+			schedule_timeout_interruptible(msecs_to_jiffies(1000));
 
 			/*
 			 * flush daemon can exit
@@ -49,6 +51,8 @@  int flush_proc(void *data)
 		list_del(&job->flush_queue);
 		spin_unlock_irqrestore(&cache->flush_queue_lock, flags);
 
+		smp_rmb();
+
 		seg = job->seg;
 
 		io_req = (struct dm_io_request) {
@@ -65,9 +69,9 @@  int flush_proc(void *data)
 			.count = (seg->length + 1) << 3,
 		};
 
-		dm_safe_io_retry(&io_req, 1, &region, false);
+		RETRY(dm_safe_io(&io_req, 1, &region, NULL, false));
 
-		cache->last_flushed_segment_id = seg->global_id;
+		atomic64_set(&cache->last_flushed_segment_id, seg->global_id);
 
 		complete_all(&seg->flush_done);
 
@@ -78,15 +82,15 @@  int flush_proc(void *data)
 		 */
 		if (!bio_list_empty(&job->barrier_ios)) {
 			struct bio *bio;
-			blkdev_issue_flush(cache->device->bdev, GFP_NOIO, NULL);
+			RETRY(blkdev_issue_flush(cache->device->bdev, GFP_NOIO, NULL));
 			while ((bio = bio_list_pop(&job->barrier_ios)))
 				bio_endio(bio, 0);
 
 			mod_timer(&cache->barrier_deadline_timer,
-				  msecs_to_jiffies(cache->barrier_deadline_ms));
+				  msecs_to_jiffies(ACCESS_ONCE(cache->barrier_deadline_ms)));
 		}
 
-		kfree(job);
+		mempool_free(job, cache->flush_job_pool);
 	}
 	return 0;
 }
@@ -101,7 +105,7 @@  void queue_barrier_io(struct wb_cache *cache, struct bio *bio)
 
 	if (!timer_pending(&cache->barrier_deadline_timer))
 		mod_timer(&cache->barrier_deadline_timer,
-			  msecs_to_jiffies(cache->barrier_deadline_ms));
+			  msecs_to_jiffies(ACCESS_ONCE(cache->barrier_deadline_ms)));
 }
 
 void barrier_deadline_proc(unsigned long data)
@@ -147,6 +151,7 @@  static void migrate_endio(unsigned long error, void *context)
 static void submit_migrate_io(struct wb_cache *cache,
 			      struct segment_header *seg, size_t k)
 {
+	int r;
 	u8 i, j;
 	size_t a = cache->nr_caches_inseg * k;
 	void *p = cache->migrate_buffer + (cache->nr_caches_inseg << 12) * k;
@@ -184,7 +189,7 @@  static void submit_migrate_io(struct wb_cache *cache,
 				.sector = mb->sector,
 				.count = (1 << 3),
 			};
-			dm_safe_io_retry(&io_req_w, 1, &region_w, false);
+			RETRY(dm_safe_io(&io_req_w, 1, &region_w, NULL, false));
 		} else {
 			for (j = 0; j < 8; j++) {
 				bool b = dirty_bits & (1 << j);
@@ -205,8 +210,7 @@  static void submit_migrate_io(struct wb_cache *cache,
 					.sector = mb->sector + j,
 					.count = 1,
 				};
-				dm_safe_io_retry(
-					&io_req_w, 1, &region_w, false);
+				RETRY(dm_safe_io(&io_req_w, 1, &region_w, NULL, false));
 			}
 		}
 	}
@@ -216,7 +220,9 @@  static void memorize_dirty_state(struct wb_cache *cache,
 				 struct segment_header *seg, size_t k,
 				 size_t *migrate_io_count)
 {
+	int r;
 	u8 i, j;
+	struct wb_device *wb = cache->wb;
 	size_t a = cache->nr_caches_inseg * k;
 	void *p = cache->migrate_buffer + (cache->nr_caches_inseg << 12) * k;
 	struct metablock *mb;
@@ -233,7 +239,7 @@  static void memorize_dirty_state(struct wb_cache *cache,
 		.sector = seg->start_sector + (1 << 3),
 		.count = seg->length << 3,
 	};
-	dm_safe_io_retry(&io_req_r, 1, &region_r, false);
+	RETRY(dm_safe_io(&io_req_r, 1, &region_r, NULL, false));
 
 	/*
 	 * We take snapshot of the dirtiness in the segments.
@@ -281,6 +287,8 @@  static void cleanup_segment(struct wb_cache *cache, struct segment_header *seg)
 
 static void migrate_linked_segments(struct wb_cache *cache)
 {
+	struct wb_device *wb = cache->wb;
+	int r;
 	struct segment_header *seg;
 	size_t k, migrate_io_count = 0;
 
@@ -336,7 +344,7 @@  migrate_write:
 	 * on this issue by always
 	 * migrating those data persistently.
 	 */
-	blkdev_issue_flush(cache->wb->device->bdev, GFP_NOIO, NULL);
+	RETRY(blkdev_issue_flush(cache->wb->device->bdev, GFP_NOIO, NULL));
 
 	/*
 	 * Discarding the migrated regions
@@ -351,43 +359,46 @@  migrate_write:
 	 * will craze the cache.
 	 */
 	list_for_each_entry(seg, &cache->migrate_list, migrate_list) {
-		blkdev_issue_discard(cache->device->bdev,
-				     seg->start_sector + (1 << 3),
-				     seg->length << 3,
-				     GFP_NOIO, 0);
+		RETRY(blkdev_issue_discard(cache->device->bdev,
+					   seg->start_sector + (1 << 3),
+					   seg->length << 3,
+					   GFP_NOIO, 0));
 	}
 }
 
 int migrate_proc(void *data)
 {
 	struct wb_cache *cache = data;
+	struct wb_device *wb = cache->wb;
 
 	while (!kthread_should_stop()) {
 		bool allow_migrate;
-		size_t i, nr_mig_candidates, nr_mig, nr_max_batch;
+		u32 i, nr_mig_candidates, nr_mig, nr_max_batch;
 		struct segment_header *seg, *tmp;
 
+		wait_on_blockup();
+
 		/*
 		 * If urge_migrate is true
 		 * Migration should be immediate.
 		 */
-		allow_migrate = cache->urge_migrate ||
-				cache->allow_migrate;
+		allow_migrate = ACCESS_ONCE(cache->urge_migrate) ||
+				ACCESS_ONCE(cache->allow_migrate);
 
 		if (!allow_migrate) {
 			schedule_timeout_interruptible(msecs_to_jiffies(1000));
 			continue;
 		}
 
-		nr_mig_candidates = cache->last_flushed_segment_id -
-				    cache->last_migrated_segment_id;
+		nr_mig_candidates = atomic64_read(&cache->last_flushed_segment_id) -
+				    atomic64_read(&cache->last_migrated_segment_id);
 
 		if (!nr_mig_candidates) {
 			schedule_timeout_interruptible(msecs_to_jiffies(1000));
 			continue;
 		}
 
-		nr_max_batch = cache->nr_max_batched_migration;
+		nr_max_batch = ACCESS_ONCE(cache->nr_max_batched_migration);
 		if (cache->nr_cur_batched_migration != nr_max_batch) {
 			/*
 			 * Request buffer for nr_max_batch size.
@@ -411,10 +422,17 @@  int migrate_proc(void *data)
 		for (i = 1; i <= nr_mig; i++) {
 			seg = get_segment_header_by_id(
 					cache,
-					cache->last_migrated_segment_id + i);
+					atomic64_read(&cache->last_migrated_segment_id) + i);
 			list_add_tail(&seg->migrate_list, &cache->migrate_list);
 		}
 
+		/*
+		 * We insert write barrier here
+		 * to make sure that migrate list
+		 * is complete.
+		 */
+		smp_wmb();
+
 		migrate_linked_segments(cache);
 
 		/*
@@ -422,7 +440,7 @@  int migrate_proc(void *data)
 		 * Only line of code changes
 		 * last_migrate_segment_id during runtime.
 		 */
-		cache->last_migrated_segment_id += nr_mig;
+		atomic64_add(nr_mig, &cache->last_migrated_segment_id);
 
 		list_for_each_entry_safe(seg, tmp,
 					 &cache->migrate_list,
@@ -449,6 +467,7 @@  void wait_for_migration(struct wb_cache *cache, u64 id)
 	 * immediately.
 	 */
 	cache->urge_migrate = true;
+	wake_up_process(cache->migrate_daemon);
 	wait_for_completion(&seg->migrate_done);
 	cache->urge_migrate = false;
 }
@@ -466,14 +485,16 @@  int modulator_proc(void *data)
 
 	while (!kthread_should_stop()) {
 
+		wait_on_blockup();
+
 		new = jiffies_to_msecs(part_stat_read(hd, io_ticks));
 
-		if (!cache->enable_migration_modulator)
+		if (!ACCESS_ONCE(cache->enable_migration_modulator))
 			goto modulator_update;
 
-		util = (100 * (new - old)) / 1000;
+		util = div_u64(100 * (new - old), 1000);
 
-		if (util < wb->migrate_threshold)
+		if (util < ACCESS_ONCE(wb->migrate_threshold))
 			cache->allow_migrate = true;
 		else
 			cache->allow_migrate = false;
@@ -490,15 +511,17 @@  modulator_update:
 
 static void update_superblock_record(struct wb_cache *cache)
 {
+	int r;
+	struct wb_device *wb = cache->wb;
 	struct superblock_record_device o;
 	void *buf;
 	struct dm_io_request io_req;
 	struct dm_io_region region;
 
 	o.last_migrated_segment_id =
-		cpu_to_le64(cache->last_migrated_segment_id);
+		cpu_to_le64(atomic64_read(&cache->last_migrated_segment_id));
 
-	buf = kmalloc_retry(1 << SECTOR_SHIFT, GFP_NOIO | __GFP_ZERO);
+	buf = mempool_alloc(cache->buf_1_pool, GFP_NOIO | __GFP_ZERO);
 	memcpy(buf, &o, sizeof(o));
 
 	io_req = (struct dm_io_request) {
@@ -513,18 +536,22 @@  static void update_superblock_record(struct wb_cache *cache)
 		.sector = (1 << 11) - 1,
 		.count = 1,
 	};
-	dm_safe_io_retry(&io_req, 1, &region, true);
-	kfree(buf);
+	RETRY(dm_safe_io(&io_req, 1, &region, NULL, false));
+	mempool_free(buf, cache->buf_1_pool);
 }
 
 int recorder_proc(void *data)
 {
 	struct wb_cache *cache = data;
+	struct wb_device *wb = cache->wb;
 	unsigned long intvl;
 
 	while (!kthread_should_stop()) {
+
+		wait_on_blockup();
+
 		/* sec -> ms */
-		intvl = cache->update_record_interval * 1000;
+		intvl = ACCESS_ONCE(cache->update_record_interval) * 1000;
 
 		if (!intvl) {
 			schedule_timeout_interruptible(msecs_to_jiffies(1000));
@@ -542,12 +569,17 @@  int recorder_proc(void *data)
 
 int sync_proc(void *data)
 {
+	int r;
 	struct wb_cache *cache = data;
+	struct wb_device *wb = cache->wb;
 	unsigned long intvl;
 
 	while (!kthread_should_stop()) {
+
+		wait_on_blockup();
+
 		/* sec -> ms */
-		intvl = cache->sync_interval * 1000;
+		intvl = ACCESS_ONCE(cache->sync_interval) * 1000;
 
 		if (!intvl) {
 			schedule_timeout_interruptible(msecs_to_jiffies(1000));
@@ -555,7 +587,8 @@  int sync_proc(void *data)
 		}
 
 		flush_current_buffer(cache);
-		blkdev_issue_flush(cache->device->bdev, GFP_NOIO, NULL);
+
+		RETRY(blkdev_issue_flush(cache->device->bdev, GFP_NOIO, NULL));
 
 		schedule_timeout_interruptible(msecs_to_jiffies(intvl));
 	}
diff --git a/Driver/dm-writeboost-metadata.c b/Driver/dm-writeboost-metadata.c
index 77ffb28..a6bd584 100644
--- a/Driver/dm-writeboost-metadata.c
+++ b/Driver/dm-writeboost-metadata.c
@@ -16,29 +16,31 @@  struct part {
 
 struct bigarray {
 	struct part *parts;
-	size_t nr_elems;
-	size_t elemsize;
+	u64 nr_elems;
+	u32 elemsize;
 };
 
 #define ALLOC_SIZE (1 << 16)
-static size_t nr_elems_in_part(struct bigarray *arr)
+static u32 nr_elems_in_part(struct bigarray *arr)
 {
-	return ALLOC_SIZE / arr->elemsize;
+	return div_u64(ALLOC_SIZE, arr->elemsize);
 };
 
-static size_t nr_parts(struct bigarray *arr)
+static u64 nr_parts(struct bigarray *arr)
 {
-	return dm_div_up(arr->nr_elems, nr_elems_in_part(arr));
+	u64 a = arr->nr_elems;
+	u32 b = nr_elems_in_part(arr);
+	return div_u64(a + b - 1, b);
 }
 
-static struct bigarray *make_bigarray(size_t elemsize, size_t nr_elems)
+static struct bigarray *make_bigarray(u32 elemsize, u64 nr_elems)
 {
-	size_t i, j;
+	u64 i, j;
 	struct part *part;
 
 	struct bigarray *arr = kmalloc(sizeof(*arr), GFP_KERNEL);
 	if (!arr) {
-		WBERR();
+		WBERR("failed to alloc arr");
 		return NULL;
 	}
 
@@ -46,7 +48,7 @@  static struct bigarray *make_bigarray(size_t elemsize, size_t nr_elems)
 	arr->nr_elems = nr_elems;
 	arr->parts = kmalloc(sizeof(struct part) * nr_parts(arr), GFP_KERNEL);
 	if (!arr->parts) {
-		WBERR();
+		WBERR("failed to alloc parts");
 		goto bad_alloc_parts;
 	}
 
@@ -54,7 +56,7 @@  static struct bigarray *make_bigarray(size_t elemsize, size_t nr_elems)
 		part = arr->parts + i;
 		part->memory = kmalloc(ALLOC_SIZE, GFP_KERNEL);
 		if (!part->memory) {
-			WBERR();
+			WBERR("failed to alloc part memory");
 			for (j = 0; j < i; j++) {
 				part = arr->parts + j;
 				kfree(part->memory);
@@ -82,11 +84,11 @@  static void kill_bigarray(struct bigarray *arr)
 	kfree(arr);
 }
 
-static void *bigarray_at(struct bigarray *arr, size_t i)
+static void *bigarray_at(struct bigarray *arr, u64 i)
 {
-	size_t n = nr_elems_in_part(arr);
-	size_t j = i / n;
-	size_t k = i % n;
+	u32 n = nr_elems_in_part(arr);
+	u32 k;
+	u64 j = div_u64_rem(i, n, &k);
 	struct part *part = arr->parts + j;
 	return part->memory + (arr->elemsize * k);
 }
@@ -104,18 +106,18 @@  static void *bigarray_at(struct bigarray *arr, size_t i)
 /*
  * Get the in-core metablock of the given index.
  */
-static struct metablock *mb_at(struct wb_cache *cache, cache_nr idx)
+static struct metablock *mb_at(struct wb_cache *cache, u32 idx)
 {
-	u64 seg_idx = idx / cache->nr_caches_inseg;
+	u32 idx_inseg;
+	u32 seg_idx = div_u64_rem(idx, cache->nr_caches_inseg, &idx_inseg);
 	struct segment_header *seg =
 		bigarray_at(cache->segment_header_array, seg_idx);
-	cache_nr idx_inseg = idx % cache->nr_caches_inseg;
 	return seg->mb_array + idx_inseg;
 }
 
 static void mb_array_empty_init(struct wb_cache *cache)
 {
-	size_t i;
+	u32 i;
 	for (i = 0; i < cache->nr_caches; i++) {
 		struct metablock *mb = mb_at(cache, i);
 		INIT_HLIST_NODE(&mb->ht_list);
@@ -126,34 +128,35 @@  static void mb_array_empty_init(struct wb_cache *cache)
 }
 
 static sector_t calc_segment_header_start(struct wb_cache *cache,
-					  u64 segment_idx)
+					  u32 segment_idx)
 {
 	return (1 << 11) + (1 << cache->segment_size_order) * (segment_idx);
 }
 
 static u32 calc_segment_lap(struct wb_cache *cache, u64 segment_id)
 {
-	u32 a = (segment_id - 1) / cache->nr_segments;
+	u64 a = div_u64(segment_id - 1, cache->nr_segments);
 	return a + 1;
 };
 
-static u64 calc_nr_segments(struct dm_dev *dev, struct wb_cache *cache)
+static u32 calc_nr_segments(struct dm_dev *dev, struct wb_cache *cache)
 {
 	sector_t devsize = dm_devsize(dev);
-	return (devsize - (1 << 11)) / (1 << cache->segment_size_order);
+	return div_u64(devsize - (1 << 11), 1 << cache->segment_size_order);
 }
 
 sector_t calc_mb_start_sector(struct wb_cache *cache,
 			      struct segment_header *seg,
-			      cache_nr mb_idx)
+			      u32 mb_idx)
 {
-	size_t k = 1 + (mb_idx % cache->nr_caches_inseg);
-	return seg->start_sector + (k << 3);
+	u32 idx;
+	div_u64_rem(mb_idx, cache->nr_caches_inseg, &idx);
+	return seg->start_sector + ((1 + idx) << 3);
 }
 
-bool is_on_buffer(struct wb_cache *cache, cache_nr mb_idx)
+bool is_on_buffer(struct wb_cache *cache, u32 mb_idx)
 {
-	cache_nr start = cache->current_seg->start_idx;
+	u32 start = cache->current_seg->start_idx;
 	if (mb_idx < start)
 		return false;
 
@@ -170,15 +173,14 @@  bool is_on_buffer(struct wb_cache *cache, cache_nr mb_idx)
 struct segment_header *get_segment_header_by_id(struct wb_cache *cache,
 						u64 segment_id)
 {
-	struct segment_header *r =
-		bigarray_at(cache->segment_header_array,
-		       (segment_id - 1) % cache->nr_segments);
-	return r;
+	u32 idx;
+	div_u64_rem(segment_id - 1, cache->nr_segments, &idx);
+	return bigarray_at(cache->segment_header_array, idx);
 }
 
 static int __must_check init_segment_header_array(struct wb_cache *cache)
 {
-	u64 segment_idx, nr_segments = cache->nr_segments;
+	u32 segment_idx, nr_segments = cache->nr_segments;
 	cache->segment_header_array =
 		make_bigarray(sizeof_segment_header(cache), nr_segments);
 	if (!cache->segment_header_array) {
@@ -225,9 +227,8 @@  static void free_segment_header_array(struct wb_cache *cache)
  */
 static int __must_check ht_empty_init(struct wb_cache *cache)
 {
-	cache_nr idx;
-	size_t i;
-	size_t nr_heads;
+	u32 idx;
+	size_t i, nr_heads;
 	struct bigarray *arr;
 
 	cache->htsize = cache->nr_caches;
@@ -266,7 +267,9 @@  static void free_ht(struct wb_cache *cache)
 
 struct ht_head *ht_get_head(struct wb_cache *cache, struct lookup_key *key)
 {
-	return bigarray_at(cache->htable, key->sector % cache->htsize);
+	u32 idx;
+	div_u64_rem(key->sector, cache->htsize, &idx);
+	return bigarray_at(cache->htable, idx);
 }
 
 static bool mb_hit(struct metablock *mb, struct lookup_key *key)
@@ -328,12 +331,14 @@  void discard_caches_inseg(struct wb_cache *cache, struct segment_header *seg)
 
 /*----------------------------------------------------------------*/
 
-static int read_superblock_header(struct superblock_header_device *sup,
+static int read_superblock_header(struct wb_cache *cache,
+				  struct superblock_header_device *sup,
 				  struct dm_dev *dev)
 {
 	int r = 0;
 	struct dm_io_request io_req_sup;
 	struct dm_io_region region_sup;
+	struct wb_device *wb = cache->wb;
 
 	void *buf = kmalloc(1 << SECTOR_SHIFT, GFP_KERNEL);
 	if (!buf) {
@@ -358,7 +363,7 @@  static int read_superblock_header(struct superblock_header_device *sup,
 	kfree(buf);
 
 	if (r) {
-		WBERR();
+		WBERR("io failed in reading superblock header");
 		return r;
 	}
 
@@ -378,7 +383,7 @@  int __must_check audit_cache_device(struct dm_dev *dev, struct wb_cache *cache,
 {
 	int r = 0;
 	struct superblock_header_device sup;
-	r = read_superblock_header(&sup, dev);
+	r = read_superblock_header(cache, &sup, dev);
 	if (r) {
 		WBERR("read superblock header failed");
 		return r;
@@ -407,6 +412,7 @@  int __must_check audit_cache_device(struct dm_dev *dev, struct wb_cache *cache,
 static int format_superblock_header(struct dm_dev *dev, struct wb_cache *cache)
 {
 	int r = 0;
+	struct wb_device *wb = cache->wb;
 	struct dm_io_request io_req_sup;
 	struct dm_io_region region_sup;
 
@@ -465,7 +471,8 @@  static void format_segmd_endio(unsigned long error, void *__context)
  */
 int __must_check format_cache_device(struct dm_dev *dev, struct wb_cache *cache)
 {
-	u64 i, nr_segments = calc_nr_segments(dev, cache);
+	u32 i, nr_segments = calc_nr_segments(dev, cache);
+	struct wb_device *wb = cache->wb;
 	struct format_segmd_context context;
 	struct dm_io_request io_req_sup;
 	struct dm_io_region region_sup;
@@ -569,6 +576,7 @@  read_superblock_record(struct superblock_record_device *record,
 		       struct wb_cache *cache)
 {
 	int r = 0;
+	struct wb_device *wb = cache->wb;
 	struct dm_io_request io_req;
 	struct dm_io_region region;
 
@@ -590,7 +598,7 @@  read_superblock_record(struct superblock_record_device *record,
 		.sector = (1 << 11) - 1,
 		.count = 1,
 	};
-	r = dm_safe_io(&io_req, 1, &region, NULL, true);
+	r = dm_safe_io(&io_req, 1, &region, NULL, false);
 
 	kfree(buf);
 
@@ -606,9 +614,10 @@  read_superblock_record(struct superblock_record_device *record,
 
 static int __must_check
 read_segment_header_device(struct segment_header_device *dest,
-			   struct wb_cache *cache, size_t segment_idx)
+			   struct wb_cache *cache, u32 segment_idx)
 {
 	int r = 0;
+	struct wb_device *wb = cache->wb;
 	struct dm_io_request io_req;
 	struct dm_io_region region;
 	void *buf = kmalloc(1 << 12, GFP_KERNEL);
@@ -651,15 +660,16 @@  void prepare_segment_header_device(struct segment_header_device *dest,
 				   struct wb_cache *cache,
 				   struct segment_header *src)
 {
-	cache_nr i;
 	u8 left, right;
+	u32 i, tmp32;
 
 	dest->global_id = cpu_to_le64(src->global_id);
 	dest->length = src->length;
 	dest->lap = cpu_to_le32(calc_segment_lap(cache, src->global_id));
 
 	left = src->length - 1;
-	right = (cache->cursor) % cache->nr_caches_inseg;
+	div_u64_rem(cache->cursor, cache->nr_caches_inseg, &tmp32);
+	right = tmp32;
 	BUG_ON(left != right);
 
 	for (i = 0; i < src->length; i++) {
@@ -679,7 +689,7 @@  void prepare_segment_header_device(struct segment_header_device *dest,
 static void update_by_segment_header_device(struct wb_cache *cache,
 					    struct segment_header_device *src)
 {
-	cache_nr i;
+	u32 i;
 	struct segment_header *seg =
 		get_segment_header_by_id(cache, src->global_id);
 	seg->length = src->length;
@@ -739,10 +749,9 @@  static int __must_check recover_cache(struct wb_cache *cache)
 	int r = 0;
 	struct segment_header_device *header;
 	struct segment_header *seg;
-	u64 i, j,
-	    max_id, oldest_id, last_flushed_id, init_segment_id,
-	    oldest_idx, nr_segments = cache->nr_segments,
+	u64 max_id, oldest_id, last_flushed_id, init_segment_id,
 	    header_id, record_id;
+	u32 i, j, oldest_idx, nr_segments = cache->nr_segments;
 
 	struct superblock_record_device uninitialized_var(record);
 	r = read_superblock_record(&record, cache);
@@ -815,7 +824,7 @@  static int __must_check recover_cache(struct wb_cache *cache)
 	 *      last_flushed  init_seg  migrated  last_migrated  flushed
 	 */
 	for (i = oldest_idx; i < (nr_segments + oldest_idx); i++) {
-		j = i % nr_segments;
+		div_u64_rem(i, nr_segments, &j);
 		r = read_segment_header_device(header, cache, j);
 		if (r) {
 			WBERR();
@@ -871,14 +880,15 @@  setup_init_segment:
 	seg->global_id = init_segment_id;
 	atomic_set(&seg->nr_inflight_ios, 0);
 
-	cache->last_flushed_segment_id = seg->global_id - 1;
+	atomic64_set(&cache->last_flushed_segment_id,
+		     seg->global_id - 1);
 
-	cache->last_migrated_segment_id =
-		cache->last_flushed_segment_id > cache->nr_segments ?
-		cache->last_flushed_segment_id - cache->nr_segments : 0;
+	atomic64_set(&cache->last_migrated_segment_id,
+		atomic64_read(&cache->last_flushed_segment_id) > cache->nr_segments ?
+		atomic64_read(&cache->last_flushed_segment_id) - cache->nr_segments : 0);
 
-	if (record_id > cache->last_migrated_segment_id)
-		cache->last_migrated_segment_id = record_id;
+	if (record_id > atomic64_read(&cache->last_migrated_segment_id))
+		atomic64_set(&cache->last_migrated_segment_id, record_id);
 
 	wait_for_migration(cache, seg->global_id);
 
@@ -903,9 +913,14 @@  static int __must_check init_rambuf_pool(struct wb_cache *cache)
 	size_t i, j;
 	struct rambuffer *rambuf;
 
-	/* tmp var to avoid 80 cols */
-	size_t nr = (RAMBUF_POOL_ALLOCATED * 1000000) /
-		    (1 << (cache->segment_size_order + SECTOR_SHIFT));
+	u32 nr = div_u64(cache->rambuf_pool_amount * 1000,
+			 1 << (cache->segment_size_order + SECTOR_SHIFT));
+
+	if (!nr) {
+		WBERR("rambuf must be allocated at least one");
+		return -EINVAL;
+	}
+
 	cache->nr_rambuf_pool = nr;
 	cache->rambuf_pool = kmalloc(sizeof(struct rambuffer) * nr,
 				     GFP_KERNEL);
@@ -1024,24 +1039,44 @@  int __must_check resume_cache(struct wb_cache *cache, struct dm_dev *dev)
 	/*
 	 * (i) Harmless Initializations
 	 */
+	cache->buf_1_pool = mempool_create_kmalloc_pool(16, 1 << SECTOR_SHIFT);
+	if (!cache->buf_1_pool) {
+		r = -ENOMEM;
+		WBERR("couldn't alloc 1 sector pool");
+		goto bad_buf_1_pool;
+	}
+	cache->buf_8_pool = mempool_create_kmalloc_pool(16, 8 << SECTOR_SHIFT);
+	if (!cache->buf_8_pool) {
+		r = -ENOMEM;
+		WBERR("couldn't alloc 8 sector pool");
+		goto bad_buf_8_pool;
+	}
+
 	r = init_rambuf_pool(cache);
 	if (r) {
-		WBERR();
+		WBERR("couldn't alloc rambuf pool");
 		goto bad_init_rambuf_pool;
 	}
+	cache->flush_job_pool = mempool_create_kmalloc_pool(cache->nr_rambuf_pool,
+							    sizeof(struct flush_job));
+	if (!cache->flush_job_pool) {
+		r = -ENOMEM;
+		WBERR("couldn't alloc flush job pool");
+		goto bad_flush_job_pool;
+	}
 
 	/* Select arbitrary one as the initial rambuffer. */
 	cache->current_rambuf = cache->rambuf_pool + 0;
 
 	r = init_segment_header_array(cache);
 	if (r) {
-		WBERR();
+		WBERR("couldn't alloc segment header array");
 		goto bad_alloc_segment_header_array;
 	}
 
 	r = ht_empty_init(cache);
 	if (r) {
-		WBERR();
+		WBERR("couldn't alloc hashtable");
 		goto bad_alloc_ht;
 	}
 
@@ -1077,7 +1112,7 @@  int __must_check resume_cache(struct wb_cache *cache, struct dm_dev *dev)
 
 	r = recover_cache(cache);
 	if (r) {
-		WBERR();
+		WBERR("recovering cache metadata failed");
 		goto bad_recover;
 	}
 
@@ -1091,7 +1126,6 @@  int __must_check resume_cache(struct wb_cache *cache, struct dm_dev *dev)
 	/* Flush Daemon */
 	spin_lock_init(&cache->flush_queue_lock);
 	INIT_LIST_HEAD(&cache->flush_queue);
-	init_waitqueue_head(&cache->flush_wait_queue);
 	CREATE_DAEMON(flush);
 
 	/* Deferred ACK for barrier writes */
@@ -1140,9 +1174,14 @@  bad_alloc_migrate_buffer:
 bad_alloc_ht:
 	free_segment_header_array(cache);
 bad_alloc_segment_header_array:
+	mempool_destroy(cache->flush_job_pool);
+bad_flush_job_pool:
 	free_rambuf_pool(cache);
 bad_init_rambuf_pool:
-	kfree(cache);
+	mempool_destroy(cache->buf_8_pool);
+bad_buf_8_pool:
+	mempool_destroy(cache->buf_1_pool);
+bad_buf_1_pool:
 	return r;
 }
 
diff --git a/Driver/dm-writeboost-metadata.h b/Driver/dm-writeboost-metadata.h
index 2e59041..709dfda 100644
--- a/Driver/dm-writeboost-metadata.h
+++ b/Driver/dm-writeboost-metadata.h
@@ -12,8 +12,8 @@ 
 struct segment_header *get_segment_header_by_id(struct wb_cache *,
 						u64 segment_id);
 sector_t calc_mb_start_sector(struct wb_cache *,
-			      struct segment_header *, cache_nr mb_idx);
-bool is_on_buffer(struct wb_cache *, cache_nr mb_idx);
+			      struct segment_header *, u32 mb_idx);
+bool is_on_buffer(struct wb_cache *, u32 mb_idx);
 
 /*----------------------------------------------------------------*/
 
diff --git a/Driver/dm-writeboost-target.c b/Driver/dm-writeboost-target.c
index 4b5b7aa..8e40f15 100644
--- a/Driver/dm-writeboost-target.c
+++ b/Driver/dm-writeboost-target.c
@@ -13,23 +13,6 @@ 
 
 /*----------------------------------------------------------------*/
 
-void *do_kmalloc_retry(size_t size, gfp_t flags, const char *caller)
-{
-	size_t count = 0;
-	void *p;
-
-retry_alloc:
-	p = kmalloc(size, flags);
-	if (!p) {
-		count++;
-		WBWARN("%s() allocation failed size:%lu, count:%lu",
-		       caller, size, count);
-		schedule_timeout_interruptible(msecs_to_jiffies(1));
-		goto retry_alloc;
-	}
-	return p;
-}
-
 struct safe_io {
 	struct work_struct work;
 	int err;
@@ -52,6 +35,7 @@  static void safe_io_proc(struct work_struct *work)
  * @thread run this operation in other thread to avoid deadlock.
  */
 int dm_safe_io_internal(
+		struct wb_device *wb,
 		struct dm_io_request *io_req,
 		unsigned num_regions, struct dm_io_region *regions,
 		unsigned long *err_bits, bool thread, const char *caller)
@@ -68,6 +52,11 @@  int dm_safe_io_internal(
 
 		INIT_WORK_ONSTACK(&io.work, safe_io_proc);
 
+		/*
+		 * don't go on submitting I/O
+		 * minimizes the risk of breaking the data.
+		 */
+		wait_on_blockup();
 		queue_work(safe_io_wq, &io.work);
 		flush_work(&io.work);
 
@@ -75,6 +64,7 @@  int dm_safe_io_internal(
 		if (err_bits)
 			*err_bits = io.err_bits;
 	} else {
+		wait_on_blockup();
 		err = dm_io(io_req, num_regions, regions, err_bits);
 	}
 
@@ -87,45 +77,15 @@  int dm_safe_io_internal(
 			eb = (~(unsigned long)0);
 		else
 			eb = *err_bits;
-		WBERR("%s() io error err(%d, %lu), rw(%d), sector(%lu), dev(%u:%u)",
+		WBERR("%s() I/O error err(%d, %lu), rw(%d), sector(%llu), dev(%u:%u)",
 		      caller, err, eb,
-		      io_req->bi_rw, regions->sector,
+		      io_req->bi_rw, (unsigned long long) regions->sector,
 		      MAJOR(dev), MINOR(dev));
 	}
 
 	return err;
 }
 
-void dm_safe_io_retry_internal(
-		struct dm_io_request *io_req,
-		unsigned num_regions, struct dm_io_region *regions,
-		bool thread, const char *caller)
-{
-	int err, count = 0;
-	unsigned long err_bits;
-	dev_t dev;
-
-retry_io:
-	err_bits = 0;
-	err = dm_safe_io_internal(io_req, num_regions, regions, &err_bits,
-				  thread, caller);
-
-	dev = regions->bdev->bd_dev;
-	if (err || err_bits) {
-		count++;
-		WBWARN("%s() io error count(%d)", caller, count);
-		schedule_timeout_interruptible(msecs_to_jiffies(1000));
-		goto retry_io;
-	}
-
-	if (count) {
-		WBWARN("%s() recover from io error rw(%d), sector(%lu), dev(%u:%u)",
-		       caller,
-		       io_req->bi_rw, regions->sector,
-		       MAJOR(dev), MINOR(dev));
-	}
-}
-
 sector_t dm_devsize(struct dm_dev *dev)
 {
 	return i_size_read(dev->bdev->bd_inode) >> SECTOR_SHIFT;
@@ -165,12 +125,13 @@  static void queue_flushing(struct wb_cache *cache)
 	bool empty;
 	struct rambuffer *next_rambuf;
 	size_t n1 = 0, n2 = 0;
+	u32 tmp32;
 	u64 next_id;
 
 	while (atomic_read(&current_seg->nr_inflight_ios)) {
 		n1++;
 		if (n1 == 100)
-			WBWARN();
+			WBWARN("inflight ios remained for current seg");
 		schedule_timeout_interruptible(msecs_to_jiffies(1));
 	}
 
@@ -180,7 +141,7 @@  static void queue_flushing(struct wb_cache *cache)
 	INIT_COMPLETION(current_seg->migrate_done);
 	INIT_COMPLETION(current_seg->flush_done);
 
-	job = kmalloc_retry(sizeof(*job), GFP_NOIO);
+	job = mempool_alloc(cache->flush_job_pool, GFP_NOIO);
 	INIT_LIST_HEAD(&job->flush_queue);
 	job->seg = current_seg;
 	job->rambuf = cache->current_rambuf;
@@ -189,12 +150,21 @@  static void queue_flushing(struct wb_cache *cache)
 	bio_list_merge(&job->barrier_ios, &cache->barrier_ios);
 	bio_list_init(&cache->barrier_ios);
 
+	/*
+	 * Queuing imcomplete flush job
+	 * will let flush daemon go wild.
+	 * We put write barrier to make sure
+	 * that job is completely initizalied.
+	 */
+	smp_wmb();
+
 	spin_lock_irqsave(&cache->flush_queue_lock, flags);
 	empty = list_empty(&cache->flush_queue);
 	list_add_tail(&job->flush_queue, &cache->flush_queue);
 	spin_unlock_irqrestore(&cache->flush_queue_lock, flags);
+
 	if (empty)
-		wake_up_interruptible(&cache->flush_wait_queue);
+		wake_up_process(cache->flush_daemon);
 
 	next_id = current_seg->global_id + 1;
 	new_seg = get_segment_header_by_id(cache, next_id);
@@ -203,7 +173,7 @@  static void queue_flushing(struct wb_cache *cache)
 	while (atomic_read(&new_seg->nr_inflight_ios)) {
 		n2++;
 		if (n2 == 100)
-			WBWARN();
+			WBWARN("inflight ios remained for new seg");
 		schedule_timeout_interruptible(msecs_to_jiffies(1));
 	}
 
@@ -217,7 +187,8 @@  static void queue_flushing(struct wb_cache *cache)
 	cache->cursor = current_seg->start_idx + (cache->nr_caches_inseg - 1);
 	new_seg->length = 0;
 
-	next_rambuf = cache->rambuf_pool + (next_id % cache->nr_rambuf_pool);
+	div_u64_rem(next_id, cache->nr_rambuf_pool, &tmp32);
+	next_rambuf = cache->rambuf_pool + tmp32;
 	wait_for_completion(&next_rambuf->done);
 	INIT_COMPLETION(next_rambuf->done);
 
@@ -255,12 +226,14 @@  static void queue_current_buffer(struct wb_cache *cache)
 void flush_current_buffer(struct wb_cache *cache)
 {
 	struct segment_header *old_seg;
+	u32 tmp32;
 
 	mutex_lock(&cache->io_lock);
 	old_seg = cache->current_seg;
 
 	queue_current_buffer(cache);
-	cache->cursor = (cache->cursor + 1) % cache->nr_caches;
+	div_u64_rem(cache->cursor + 1, cache->nr_caches, &tmp32);
+	cache->cursor = tmp32;
 	cache->current_seg->length = 1;
 	mutex_unlock(&cache->io_lock);
 
@@ -345,13 +318,14 @@  static void clear_stat(struct wb_cache *cache)
 static void migrate_mb(struct wb_cache *cache, struct segment_header *seg,
 		       struct metablock *mb, u8 dirty_bits, bool thread)
 {
+	int r;
 	struct wb_device *wb = cache->wb;
 
 	if (!dirty_bits)
 		return;
 
 	if (dirty_bits == 255) {
-		void *buf = kmalloc_retry(1 << 12, GFP_NOIO);
+		void *buf = mempool_alloc(cache->buf_8_pool, GFP_NOIO);
 		struct dm_io_request io_req_r, io_req_w;
 		struct dm_io_region region_r, region_w;
 
@@ -367,8 +341,7 @@  static void migrate_mb(struct wb_cache *cache, struct segment_header *seg,
 			.sector = calc_mb_start_sector(cache, seg, mb->idx),
 			.count = (1 << 3),
 		};
-
-		dm_safe_io_retry(&io_req_r, 1, &region_r, thread);
+		RETRY(dm_safe_io(&io_req_r, 1, &region_r, NULL, thread));
 
 		io_req_w = (struct dm_io_request) {
 			.client = wb_io_client,
@@ -382,11 +355,11 @@  static void migrate_mb(struct wb_cache *cache, struct segment_header *seg,
 			.sector = mb->sector,
 			.count = (1 << 3),
 		};
-		dm_safe_io_retry(&io_req_w, 1, &region_w, thread);
+		RETRY(dm_safe_io(&io_req_w, 1, &region_w, NULL, thread));
 
-		kfree(buf);
+		mempool_free(buf, cache->buf_8_pool);
 	} else {
-		void *buf = kmalloc_retry(1 << SECTOR_SHIFT, GFP_NOIO);
+		void *buf = mempool_alloc(cache->buf_1_pool, GFP_NOIO);
 		size_t i;
 		for (i = 0; i < 8; i++) {
 			bool bit_on = dirty_bits & (1 << i);
@@ -411,7 +384,7 @@  static void migrate_mb(struct wb_cache *cache, struct segment_header *seg,
 				.sector = src,
 				.count = 1,
 			};
-			dm_safe_io_retry(&io_req_r, 1, &region_r, thread);
+			RETRY(dm_safe_io(&io_req_r, 1, &region_r, NULL, thread));
 
 			io_req_w = (struct dm_io_request) {
 				.client = wb_io_client,
@@ -425,9 +398,9 @@  static void migrate_mb(struct wb_cache *cache, struct segment_header *seg,
 				.sector = mb->sector + 1 * i,
 				.count = 1,
 			};
-			dm_safe_io_retry(&io_req_w, 1, &region_w, thread);
+			RETRY(dm_safe_io(&io_req_w, 1, &region_w, NULL, thread));
 		}
-		kfree(buf);
+		mempool_free(buf, cache->buf_1_pool);
 	}
 }
 
@@ -438,12 +411,17 @@  static void migrate_mb(struct wb_cache *cache, struct segment_header *seg,
 static void migrate_buffered_mb(struct wb_cache *cache,
 				struct metablock *mb, u8 dirty_bits)
 {
+	int r;
 	struct wb_device *wb = cache->wb;
+	u8 i;
+	sector_t offset;
+	void *buf;
 
-	u8 i, k = 1 + (mb->idx % cache->nr_caches_inseg);
-	sector_t offset = (k << 3);
+	u32 k;
+	div_u64_rem(mb->idx, cache->nr_caches_inseg, &k);
+	offset = ((k + 1) << 3);
 
-	void *buf = kmalloc_retry(1 << SECTOR_SHIFT, GFP_NOIO);
+	buf = mempool_alloc(cache->buf_1_pool, GFP_NOIO);
 	for (i = 0; i < 8; i++) {
 		struct dm_io_request io_req;
 		struct dm_io_region region;
@@ -473,9 +451,9 @@  static void migrate_buffered_mb(struct wb_cache *cache,
 			.count = 1,
 		};
 
-		dm_safe_io_retry(&io_req, 1, &region, true);
+		RETRY(dm_safe_io(&io_req, 1, &region, NULL, true));
 	}
-	kfree(buf);
+	mempool_free(buf, cache->buf_1_pool);
 }
 
 static void bio_remap(struct bio *bio, struct dm_dev *dev, sector_t sector)
@@ -487,7 +465,7 @@  static void bio_remap(struct bio *bio, struct dm_dev *dev, sector_t sector)
 static sector_t calc_cache_alignment(struct wb_cache *cache,
 				     sector_t bio_sector)
 {
-	return (bio_sector / (1 << 3)) * (1 << 3);
+	return div_u64(bio_sector, 1 << 3) * (1 << 3);
 }
 
 static int writeboost_map(struct dm_target *ti, struct bio *bio
@@ -502,13 +480,15 @@  static int writeboost_map(struct dm_target *ti, struct bio *bio
 #if LINUX_VERSION_CODE >= PER_BIO_VERSION
 	struct per_bio_data *map_context;
 #endif
-	sector_t bio_count, bio_offset, s;
+	sector_t bio_count, s;
+	u8 bio_offset;
+	u32 tmp32;
 	bool bio_fullsize, found, on_buffer,
 	     refresh_segment, b;
 	int rw;
 	struct lookup_key key;
 	struct ht_head *head;
-	cache_nr update_mb_idx, idx_inseg;
+	u32 update_mb_idx;
 	size_t start;
 	void *data;
 
@@ -516,6 +496,9 @@  static int writeboost_map(struct dm_target *ti, struct bio *bio
 	struct wb_cache *cache = wb->cache;
 	struct dm_dev *orig = wb->device;
 
+	if (ACCESS_ONCE(wb->blockup))
+		return -EIO;
+
 #if LINUX_VERSION_CODE >= PER_BIO_VERSION
 	map_context = dm_per_bio_data(bio, ti->per_bio_data_size);
 #endif
@@ -552,7 +535,8 @@  static int writeboost_map(struct dm_target *ti, struct bio *bio
 
 	bio_count = bio->bi_size >> SECTOR_SHIFT;
 	bio_fullsize = (bio_count == (1 << 3));
-	bio_offset = bio->bi_sector % (1 << 3);
+	div_u64_rem(bio->bi_sector, 1 << 3, &tmp32);
+	bio_offset = tmp32;
 
 	rw = bio_data_dir(bio);
 
@@ -580,8 +564,8 @@  static int writeboost_map(struct dm_target *ti, struct bio *bio
 	mutex_lock(&cache->io_lock);
 	mb = ht_lookup(cache, head, &key);
 	if (mb) {
-		seg = ((void *) mb) - (mb->idx % cache->nr_caches_inseg) *
-				      sizeof(struct metablock)
+		div_u64_rem(mb->idx, cache->nr_caches_inseg, &tmp32);
+		seg = ((void *) mb) - tmp32 * sizeof(struct metablock)
 				    - sizeof(struct segment_header);
 		atomic_inc(&seg->nr_inflight_ios);
 	}
@@ -723,12 +707,14 @@  write_not_found:
 	 * We must flush the current segment and
 	 * get the new one.
 	 */
-	refresh_segment = !((cache->cursor + 1) % cache->nr_caches_inseg);
+	div_u64_rem(cache->cursor + 1, cache->nr_caches_inseg, &tmp32);
+	refresh_segment = !tmp32;
 
 	if (refresh_segment)
 		queue_current_buffer(cache);
 
-	cache->cursor = (cache->cursor + 1) % cache->nr_caches;
+	div_u64_rem(cache->cursor + 1, cache->nr_caches, &tmp32);
+	cache->cursor = tmp32;
 
 	/*
 	 * update_mb_idx is the cache line index to update.
@@ -738,7 +724,8 @@  write_not_found:
 	seg = cache->current_seg;
 	atomic_inc(&seg->nr_inflight_ios);
 
-	new_mb = seg->mb_array + (update_mb_idx % cache->nr_caches_inseg);
+	div_u64_rem(update_mb_idx, cache->nr_caches_inseg, &tmp32);
+	new_mb = seg->mb_array + tmp32;
 	new_mb->dirty_bits = 0;
 	ht_register(cache, head, &key, new_mb);
 	mutex_unlock(&cache->io_lock);
@@ -747,13 +734,12 @@  write_not_found:
 
 write_on_buffer:
 	;
-	idx_inseg = update_mb_idx % cache->nr_caches_inseg;
-
 	/*
 	 * The first 4KB of the segment is
 	 * used for metadata.
 	 */
-	s = (idx_inseg + 1) << 3;
+	div_u64_rem(update_mb_idx, cache->nr_caches_inseg, &tmp32);
+	s = (tmp32 + 1) << 3;
 
 	b = false;
 	lockseg(seg, flags);
@@ -769,7 +755,7 @@  write_on_buffer:
 		u8 i;
 		u8 acc_bits = 0;
 		s += bio_offset;
-		for (i = bio_offset; i < (bio_offset+bio_count); i++)
+		for (i = bio_offset; i < (bio_offset + bio_count); i++)
 			acc_bits += (1 << i);
 
 		mb->dirty_bits |= acc_bits;
@@ -827,8 +813,15 @@  static int writeboost_end_io(struct dm_target *ti, struct bio *bio, int error
 	return 0;
 }
 
+#define ARG_EXIST(n)\
+	if (argc <= (n)) {\
+		goto exit_parse_arg;\
+	}
+
 /*
- * <backing dev> <cache dev> <segment size order>
+ * <backing dev> <cache dev>
+ * [segment size order]
+ * [rambuf pool amount]
  */
 static int writeboost_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 {
@@ -842,7 +835,7 @@  static int writeboost_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 #if LINUX_VERSION_CODE >= KERNEL_VERSION(3, 6, 0)
 	r = dm_set_target_max_io_len(ti, (1 << 3));
 	if (r) {
-		WBERR();
+		WBERR("settting max io len failed");
 		return r;
 	}
 #else
@@ -862,6 +855,8 @@  static int writeboost_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 	 */
 	wb->migrate_threshold = 70;
 
+	init_waitqueue_head(&wb->blockup_wait_queue);
+	wb->blockup = false;
 
 	cache = kzalloc(sizeof(*cache), GFP_KERNEL);
 	if (!cache) {
@@ -888,6 +883,10 @@  static int writeboost_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 		goto bad_get_device_cache;
 	}
 
+	/* Optional Parameters */
+
+	cache->segment_size_order = 7;
+	ARG_EXIST(2);
 	if (kstrtoul(argv[2], 10, &tmp)) {
 		r = -EINVAL;
 		goto bad_segment_size_order;
@@ -901,6 +900,16 @@  static int writeboost_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 
 	cache->segment_size_order = tmp;
 
+	cache->rambuf_pool_amount = 2048;
+	ARG_EXIST(3);
+	if (kstrtoul(argv[3], 10, &tmp)) {
+		r = -EINVAL;
+		goto bad_rambuf_pool_amount;
+	}
+	cache->rambuf_pool_amount = tmp;
+
+exit_parse_arg:
+
 	r = audit_cache_device(cachedev, cache, &need_format, &allow_format);
 	if (r) {
 		WBERR("audit cache device fails err(%d)", r);
@@ -930,7 +939,7 @@  static int writeboost_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 
 	r = resume_cache(cache, cachedev);
 	if (r) {
-		WBERR("%d", r);
+		WBERR("failed to resume cache err(%d)", r);
 		goto bad_resume_cache;
 	}
 	clear_stat(cache);
@@ -957,6 +966,7 @@  static int writeboost_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 bad_resume_cache:
 bad_format_cache:
 bad_audit_cache:
+bad_rambuf_pool_amount:
 bad_segment_size_order:
 	dm_put_device(ti, cachedev);
 bad_get_device_cache:
@@ -1000,6 +1010,14 @@  static int writeboost_message(struct dm_target *ti, unsigned argc, char **argv)
 	if (kstrtoul(argv[1], 10, &tmp))
 		return -EINVAL;
 
+	if (!strcasecmp(cmd, "blockup")) {
+		if (tmp > 1)
+			return -EINVAL;
+		wb->blockup = tmp;
+		wake_up(&wb->blockup_wait_queue);
+		return 0;
+	}
+
 	if (!strcasecmp(cmd, "allow_migrate")) {
 		if (tmp > 1)
 			return -EINVAL;
@@ -1101,31 +1119,39 @@  writeboost_status(
 		DMEMIT("%llu %llu %llu %llu %llu %u ",
 		       (long long unsigned int)
 		       atomic64_read(&wb->nr_dirty_caches),
-		       (long long unsigned int) cache->nr_segments,
-		       (long long unsigned int) cache->last_migrated_segment_id,
-		       (long long unsigned int) cache->last_flushed_segment_id,
-		       (long long unsigned int) cache->current_seg->global_id,
-		       (unsigned int) cache->cursor);
+		       (long long unsigned int)
+		       cache->nr_segments,
+		       (long long unsigned int)
+		       atomic64_read(&cache->last_migrated_segment_id),
+		       (long long unsigned int)
+		       atomic64_read(&cache->last_flushed_segment_id),
+		       (long long unsigned int)
+		       cache->current_seg->global_id,
+		       (unsigned int)
+		       cache->cursor);
 
 		for (i = 0; i < STATLEN; i++) {
 			atomic64_t *v = &cache->stat[i];
-			DMEMIT("%lu ", atomic64_read(v));
+			DMEMIT("%llu ", (unsigned long long) atomic64_read(v));
 		}
 
-		DMEMIT("%d ", 7);
+		DMEMIT("%d ", 8);
 		DMEMIT("barrier_deadline_ms %lu ",
 		       cache->barrier_deadline_ms);
 		DMEMIT("allow_migrate %d ",
 		       cache->allow_migrate ? 1 : 0);
 		DMEMIT("enable_migration_modulator %d ",
 		       cache->enable_migration_modulator ? 1 : 0);
-		DMEMIT("migrate_threshold %d ", wb->migrate_threshold);
-		DMEMIT("nr_cur_batched_migration %lu ",
+		DMEMIT("migrate_threshold %d ",
+		       wb->migrate_threshold);
+		DMEMIT("nr_cur_batched_migration %u ",
 		       cache->nr_cur_batched_migration);
 		DMEMIT("sync_interval %lu ",
 		       cache->sync_interval);
-		DMEMIT("update_record_interval %lu",
+		DMEMIT("update_record_interval %lu ",
 		       cache->update_record_interval);
+		DMEMIT("blockup %d",
+		       wb->blockup);
 		break;
 
 	case STATUSTYPE_TABLE:
@@ -1169,13 +1195,13 @@  static int __init writeboost_module_init(void)
 	safe_io_wq = alloc_workqueue("safeiowq",
 				     WQ_NON_REENTRANT | WQ_MEM_RECLAIM, 0);
 	if (!safe_io_wq) {
-		WBERR();
+		WBERR("failed to alloc safe_io_wq");
 		goto bad_wq;
 	}
 
 	wb_io_client = dm_io_client_create();
 	if (IS_ERR(wb_io_client)) {
-		WBERR();
+		WBERR("failed to alloc wb_io_client");
 		r = PTR_ERR(wb_io_client);
 		goto bad_io_client;
 	}
diff --git a/Driver/dm-writeboost.h b/Driver/dm-writeboost.h
index d394dfa..fdb41d0 100644
--- a/Driver/dm-writeboost.h
+++ b/Driver/dm-writeboost.h
@@ -22,22 +22,17 @@ 
 #include <linux/device-mapper.h>
 #include <linux/dm-io.h>
 
-#define wbdebug(f, args...) \
+#define wbdebug(f, args...)\
 	DMINFO("debug@%s() L.%d" f, __func__, __LINE__, ## args)
 
-#define WBERR(f, args...) \
+#define WBERR(f, args...)\
 	DMERR("err@%s() " f, __func__, ## args)
-#define WBWARN(f, args...) \
+#define WBWARN(f, args...)\
 	DMWARN("warn@%s() " f, __func__, ## args)
-#define WBINFO(f, args...) \
+#define WBINFO(f, args...)\
 	DMINFO("info@%s() " f, __func__, ## args)
 
 /*
- * The amount of RAM buffer pool to pre-allocated.
- */
-#define RAMBUF_POOL_ALLOCATED 64 /* MB */
-
-/*
  * The Detail of the Disk Format
  *
  * Whole:
@@ -81,15 +76,6 @@  struct superblock_record_device {
 } __packed;
 
 /*
- * Cache line index.
- *
- * dm-writeboost can supoort a cache device
- * with size less than 4KB * (1 << 32)
- * that is 16TB.
- */
-typedef u32 cache_nr;
-
-/*
  * Metadata of a 4KB cache line
  *
  * Dirtiness is defined for each sector
@@ -98,7 +84,7 @@  typedef u32 cache_nr;
 struct metablock {
 	sector_t sector; /* key */
 
-	cache_nr idx; /* Const */
+	u32 idx; /* Const */
 
 	struct hlist_node ht_list;
 
@@ -143,7 +129,7 @@  struct segment_header {
 	 */
 	u8 length;
 
-	cache_nr start_idx; /* Const */
+	u32 start_idx; /* Const */
 	sector_t start_sector; /* Const */
 
 	struct list_head migrate_list;
@@ -228,10 +214,14 @@  struct wb_device;
 struct wb_cache {
 	struct wb_device *wb;
 
+	mempool_t *buf_1_pool; /* 1 sector buffer pool */
+	mempool_t *buf_8_pool; /* 8 sector buffer pool */
+	mempool_t *flush_job_pool;
+
 	struct dm_dev *device;
 	struct mutex io_lock;
-	cache_nr nr_caches; /* Const */
-	u64 nr_segments; /* Const */
+	u32 nr_caches; /* Const */
+	u32 nr_segments; /* Const */
 	u8 segment_size_order; /* Const */
 	u8 nr_caches_inseg; /* Const */
 	struct bigarray *segment_header_array;
@@ -248,15 +238,16 @@  struct wb_cache {
 	size_t htsize;
 	struct ht_head *null_head;
 
-	cache_nr cursor; /* Index that has been written the most lately */
+	u32 cursor; /* Index that has been written the most lately */
 	struct segment_header *current_seg;
 
 	struct rambuffer *current_rambuf;
-	size_t nr_rambuf_pool; /* Const */
+	u32 rambuf_pool_amount; /* kB */
+	u32 nr_rambuf_pool; /* Const */
 	struct rambuffer *rambuf_pool;
 
-	u64 last_migrated_segment_id;
-	u64 last_flushed_segment_id;
+	atomic64_t last_migrated_segment_id;
+	atomic64_t last_flushed_segment_id;
 	int urge_migrate;
 
 	/*
@@ -269,7 +260,6 @@  struct wb_cache {
 	struct task_struct *flush_daemon;
 	spinlock_t flush_queue_lock;
 	struct list_head flush_queue;
-	wait_queue_head_t flush_wait_queue;
 
 	/*
 	 * Deferred ACK for barriers.
@@ -289,7 +279,7 @@  struct wb_cache {
 	 * if they are segments to migrate.
 	 */
 	struct task_struct *migrate_daemon;
-	bool allow_migrate; /* param */
+	int allow_migrate; /* param */
 
 	/*
 	 * Batched Migration
@@ -303,8 +293,8 @@  struct wb_cache {
 	struct list_head migrate_list;
 	u8 *dirtiness_snapshot;
 	void *migrate_buffer;
-	size_t nr_cur_batched_migration;
-	size_t nr_max_batched_migration; /* param */
+	u32 nr_cur_batched_migration;
+	u32 nr_max_batched_migration; /* param */
 
 	/*
 	 * Migration modulator
@@ -314,7 +304,7 @@  struct wb_cache {
 	 * according to the load of backing store.
 	 */
 	struct task_struct *modulator_daemon;
-	bool enable_migration_modulator; /* param */
+	int enable_migration_modulator; /* param */
 
 	/*
 	 * Superblock Recorder
@@ -347,6 +337,9 @@  struct wb_device {
 	u8 migrate_threshold;
 
 	atomic64_t nr_dirty_caches;
+
+	wait_queue_head_t blockup_wait_queue;
+	int blockup;
 };
 
 struct flush_job {
@@ -384,24 +377,50 @@  u8 atomic_read_mb_dirtiness(struct segment_header *, struct metablock *);
 extern struct workqueue_struct *safe_io_wq;
 extern struct dm_io_client *wb_io_client;
 
-void *do_kmalloc_retry(size_t size, gfp_t flags, const char *caller);
-#define kmalloc_retry(size, flags) \
-	do_kmalloc_retry((size), (flags), __func__)
+/*
+ * I/O error on either backing or cache
+ * should block up the whole system.
+ * Either reading or writing a device
+ * should not be done if it once returns -EIO.
+ * These devices are untrustable and
+ * we wait for sysadmin to remove the failure cause away.
+ */
+
+#define wait_on_blockup()\
+	do {\
+		BUG_ON(!wb);\
+		if (ACCESS_ONCE(wb->blockup)) {\
+			WBERR("system is blocked up on I/O error. set blockup to 0 after checkup.");\
+			wait_event_interruptible(wb->blockup_wait_queue,\
+						 !ACCESS_ONCE(wb->blockup));\
+			WBINFO("reactivated after blockup");\
+		}\
+	} while (0)
+
+#define RETRY(proc)\
+	do {\
+		BUG_ON(!wb);\
+		r = proc;\
+		if (r == -EOPNOTSUPP) {\
+			r = 0;\
+		} else if (r == -EIO) { /* I/O error is critical */\
+			wb->blockup = true;\
+			wait_on_blockup();\
+		} else if (r == -ENOMEM) {\
+			schedule_timeout_interruptible(msecs_to_jiffies(1000));\
+		} else if (r) { \
+			WBERR("please report!! I/O failed but no retry error code %d", r);\
+			r = 0;\
+		}\
+	} while (r)
 
 int dm_safe_io_internal(
+		struct wb_device*,
 		struct dm_io_request *,
 		unsigned num_regions, struct dm_io_region *,
 		unsigned long *err_bits, bool thread, const char *caller);
-#define dm_safe_io(io_req, num_regions, regions, err_bits, thread) \
-	dm_safe_io_internal((io_req), (num_regions), (regions), \
-			    (err_bits), (thread), __func__)
-
-void dm_safe_io_retry_internal(
-		struct dm_io_request *,
-		unsigned num_regions, struct dm_io_region *,
-		bool thread, const char *caller);
-#define dm_safe_io_retry(io_req, num_regions, regions, thread) \
-	dm_safe_io_retry_internal((io_req), (num_regions), (regions), \
-				  (thread), __func__)
+#define dm_safe_io(io_req, num_regions, regions, err_bits, thread)\
+		dm_safe_io_internal(wb, (io_req), (num_regions), (regions),\
+				    (err_bits), (thread), __func__);\
 
 sector_t dm_devsize(struct dm_dev *);

diff --git a/dm-writeboost.txt b/dm-writeboost.txt
index 9acbd54..00ad6e0 100644
--- a/dm-writeboost.txt
+++ b/dm-writeboost.txt
@@ -66,12 +66,20 @@  All the operations are via dmsetup command.
 
 Constructor
 -----------
-writeboost <backing dev> <cache dev> <segment size order>
+writeboost <backing dev> <cache dev>
+           [segment size order]
+           [rambuf pool amount]
 
 backing dev        : slow device holding original data blocks.
 cache dev          : fast device holding cached data and its metadata.
 segment size order : the size of RAM buffer
                      1 << n (sectors), 4 <= n <= 11
+                     default 7
+rambuf pool amount : The amount of the RAM buffer pool (kB).
+                     Too fewer amount may cause waiting for new buffer
+                     to become available again.
+                     But too much doesn't affect the performance.
+                     default 2048
 
 Note that cache device is re-formatted
 if the first sector of the cache device is zeroed out.