Message ID | 1399652260-9360-1-git-send-email-fdmanana@gmail.com (mailing list archive) |
---|---|
State | Accepted |
Headers | show |
On Fri, May 09, 2014 at 05:17:40PM +0100, Filipe David Borba Manana wrote: > When running low on available disk space and having several processes > doing buffered file IO, I got the following trace in dmesg: I was building 3.15.1 and noticed this patch didn't make it in (your other 2 btrfs send patches I was carrying did make it). Is that on purpose, and maybe being delayed for 3.16, or is that maybe an oversight? Thanks, Marc > [ 4202.720152] INFO: task kworker/u8:1:5450 blocked for more than 120 seconds. > [ 4202.720401] Not tainted 3.13.0-fdm-btrfs-next-26+ #1 > [ 4202.720596] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 4202.720874] kworker/u8:1 D 0000000000000001 0 5450 2 0x00000000 > [ 4202.720904] Workqueue: btrfs-flush_delalloc normal_work_helper [btrfs] > [ 4202.720908] ffff8801f62ddc38 0000000000000082 ffff880203ac2490 00000000001d3f40 > [ 4202.720913] ffff8801f62ddfd8 00000000001d3f40 ffff8800c4f0c920 ffff880203ac2490 > [ 4202.720918] 00000000001d4a40 ffff88020fe85a40 ffff88020fe85ab8 0000000000000001 > [ 4202.720922] Call Trace: > [ 4202.720931] [<ffffffff816a3cb9>] schedule+0x29/0x70 > [ 4202.720950] [<ffffffffa01ec48d>] btrfs_start_ordered_extent+0x6d/0x110 [btrfs] > [ 4202.720956] [<ffffffff8108e620>] ? bit_waitqueue+0xc0/0xc0 > [ 4202.720972] [<ffffffffa01ec559>] btrfs_run_ordered_extent_work+0x29/0x40 [btrfs] > [ 4202.720988] [<ffffffffa0201987>] normal_work_helper+0x137/0x2c0 [btrfs] > [ 4202.720994] [<ffffffff810680e5>] process_one_work+0x1f5/0x530 > (...) > [ 4202.721027] 2 locks held by kworker/u8:1/5450: > [ 4202.721028] #0: (%s-%s){++++..}, at: [<ffffffff81068083>] process_one_work+0x193/0x530 > [ 4202.721037] #1: ((&work->normal_work)){+.+...}, at: [<ffffffff81068083>] process_one_work+0x193/0x530 > [ 4202.721054] INFO: task btrfs:7891 blocked for more than 120 seconds. > [ 4202.721258] Not tainted 3.13.0-fdm-btrfs-next-26+ #1 > [ 4202.721444] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 4202.721699] btrfs D 0000000000000001 0 7891 7890 0x00000001 > [ 4202.721704] ffff88018c2119e8 0000000000000086 ffff8800a33d2490 00000000001d3f40 > [ 4202.721710] ffff88018c211fd8 00000000001d3f40 ffff8802144b0000 ffff8800a33d2490 > [ 4202.721714] ffff8800d8576640 ffff88020fe85bc0 ffff88020fe85bc8 7fffffffffffffff > [ 4202.721718] Call Trace: > [ 4202.721723] [<ffffffff816a3cb9>] schedule+0x29/0x70 > [ 4202.721727] [<ffffffff816a2ebc>] schedule_timeout+0x1dc/0x270 > [ 4202.721732] [<ffffffff8109bd79>] ? mark_held_locks+0xb9/0x140 > [ 4202.721736] [<ffffffff816a90c0>] ? _raw_spin_unlock_irq+0x30/0x40 > [ 4202.721740] [<ffffffff8109bf0d>] ? trace_hardirqs_on_caller+0x10d/0x1d0 > [ 4202.721744] [<ffffffff816a488f>] wait_for_completion+0xdf/0x120 > [ 4202.721749] [<ffffffff8107fa90>] ? try_to_wake_up+0x310/0x310 > [ 4202.721765] [<ffffffffa01ebee4>] btrfs_wait_ordered_extents+0x1f4/0x280 [btrfs] > [ 4202.721781] [<ffffffffa020526e>] btrfs_mksubvol.isra.62+0x30e/0x5a0 [btrfs] > [ 4202.721786] [<ffffffff8108e620>] ? bit_waitqueue+0xc0/0xc0 > [ 4202.721799] [<ffffffffa02056a9>] btrfs_ioctl_snap_create_transid+0x1a9/0x1b0 [btrfs] > [ 4202.721813] [<ffffffffa020583a>] btrfs_ioctl_snap_create_v2+0x10a/0x170 [btrfs] > (...) > > It turns out that extent_io.c:__extent_writepage(), which ends up being called > through filemap_fdatawrite_range() in btrfs_start_ordered_extent(), was getting > -ENOSPC when calling the fill_delalloc callback. In this situation, it returned > without the writepage_end_io_hook callback (inode.c:btrfs_writepage_end_io_hook) > ever being called for the respective page, which prevents the ordered extent's > bytes_left count from ever reaching 0, and therefore a finish_ordered_fn work > is never queued into the endio_write_workers queue. This makes the task that > called btrfs_start_ordered_extent() hang forever on the wait queue of the ordered > extent. > > This is fairly easy to reproduce using a small filesystem and fsstress on > a quad core vm: > > mkfs.btrfs -f -b `expr 2100 \* 1024 \* 1024` /dev/sdd > mount /dev/sdd /mnt > > fsstress -p 6 -d /mnt -n 100000 -x \ > "btrfs subvolume snapshot -r /mnt /mnt/mysnap" \ > -f allocsp=0 \ > -f bulkstat=0 \ > -f bulkstat1=0 \ > -f chown=0 \ > -f creat=1 \ > -f dread=0 \ > -f dwrite=0 \ > -f fallocate=1 \ > -f fdatasync=0 \ > -f fiemap=0 \ > -f freesp=0 \ > -f fsync=0 \ > -f getattr=0 \ > -f getdents=0 \ > -f link=0 \ > -f mkdir=0 \ > -f mknod=0 \ > -f punch=1 \ > -f read=0 \ > -f readlink=0 \ > -f rename=0 \ > -f resvsp=0 \ > -f rmdir=0 \ > -f setxattr=0 \ > -f stat=0 \ > -f symlink=0 \ > -f sync=0 \ > -f truncate=1 \ > -f unlink=0 \ > -f unresvsp=0 \ > -f write=4 > > So just ensure that if an error happens while writing the extent page > we call the writepage_end_io_hook callback. Also make it return the > error code and ensure the caller (extent_write_cache_pages) processes > all pages in the page vector even if an error happens only for some > of them, so that ordered extents end up released. > > Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> > --- > > V2: Grab error from `em` pointer if available, do the error check and > end_extent_writepage call before unlocking the page (just like > end_bio_extent_writepage does). > > V3: Make extent_write_cache_pages not stop processing remaining pages > in the page vector if __extent_writepage returns an error for some > page, so that ordered extents end up being processed and released. > > fs/btrfs/extent_io.c | 16 +++++++++++----- > 1 file changed, 11 insertions(+), 5 deletions(-) > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index 0c43896..4468afd 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -3284,6 +3284,7 @@ static int __extent_writepage(struct page *page, struct writeback_control *wbc, > end - cur + 1, 1); > if (IS_ERR_OR_NULL(em)) { > SetPageError(page); > + ret = PTR_ERR_OR_ZERO(em); > break; > } > > @@ -3370,13 +3371,17 @@ done: > set_page_writeback(page); > end_page_writeback(page); > } > + if (PageError(page)) { > + ret = ret < 0 ? ret : -EIO; > + end_extent_writepage(page, ret, start, page_end); > + } > unlock_page(page); > > done_unlocked: > > /* drop our reference on any cached states */ > free_extent_state(cached_state); > - return 0; > + return ret; > } > > static int eb_wait(void *word) > @@ -3699,6 +3704,7 @@ static int extent_write_cache_pages(struct extent_io_tree *tree, > struct inode *inode = mapping->host; > int ret = 0; > int done = 0; > + int err = 0; > int nr_to_write_done = 0; > struct pagevec pvec; > int nr_pages; > @@ -3785,8 +3791,8 @@ retry: > unlock_page(page); > ret = 0; > } > - if (ret) > - done = 1; > + if (!err && ret < 0) > + err = ret; > > /* > * the filesystem may choose to bump up nr_to_write. > @@ -3798,7 +3804,7 @@ retry: > pagevec_release(&pvec); > cond_resched(); > } > - if (!scanned && !done) { > + if (!scanned && !done && !err) { > /* > * We hit the last page and there is more work to be done: wrap > * back to the start of the file > @@ -3808,7 +3814,7 @@ retry: > goto retry; > } > btrfs_add_delayed_iput(inode); > - return ret; > + return err; > } > > static void flush_epd_write_bio(struct extent_page_data *epd) > -- > 1.9.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 0c43896..4468afd 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3284,6 +3284,7 @@ static int __extent_writepage(struct page *page, struct writeback_control *wbc, end - cur + 1, 1); if (IS_ERR_OR_NULL(em)) { SetPageError(page); + ret = PTR_ERR_OR_ZERO(em); break; } @@ -3370,13 +3371,17 @@ done: set_page_writeback(page); end_page_writeback(page); } + if (PageError(page)) { + ret = ret < 0 ? ret : -EIO; + end_extent_writepage(page, ret, start, page_end); + } unlock_page(page); done_unlocked: /* drop our reference on any cached states */ free_extent_state(cached_state); - return 0; + return ret; } static int eb_wait(void *word) @@ -3699,6 +3704,7 @@ static int extent_write_cache_pages(struct extent_io_tree *tree, struct inode *inode = mapping->host; int ret = 0; int done = 0; + int err = 0; int nr_to_write_done = 0; struct pagevec pvec; int nr_pages; @@ -3785,8 +3791,8 @@ retry: unlock_page(page); ret = 0; } - if (ret) - done = 1; + if (!err && ret < 0) + err = ret; /* * the filesystem may choose to bump up nr_to_write. @@ -3798,7 +3804,7 @@ retry: pagevec_release(&pvec); cond_resched(); } - if (!scanned && !done) { + if (!scanned && !done && !err) { /* * We hit the last page and there is more work to be done: wrap * back to the start of the file @@ -3808,7 +3814,7 @@ retry: goto retry; } btrfs_add_delayed_iput(inode); - return ret; + return err; } static void flush_epd_write_bio(struct extent_page_data *epd)
When running low on available disk space and having several processes doing buffered file IO, I got the following trace in dmesg: [ 4202.720152] INFO: task kworker/u8:1:5450 blocked for more than 120 seconds. [ 4202.720401] Not tainted 3.13.0-fdm-btrfs-next-26+ #1 [ 4202.720596] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 4202.720874] kworker/u8:1 D 0000000000000001 0 5450 2 0x00000000 [ 4202.720904] Workqueue: btrfs-flush_delalloc normal_work_helper [btrfs] [ 4202.720908] ffff8801f62ddc38 0000000000000082 ffff880203ac2490 00000000001d3f40 [ 4202.720913] ffff8801f62ddfd8 00000000001d3f40 ffff8800c4f0c920 ffff880203ac2490 [ 4202.720918] 00000000001d4a40 ffff88020fe85a40 ffff88020fe85ab8 0000000000000001 [ 4202.720922] Call Trace: [ 4202.720931] [<ffffffff816a3cb9>] schedule+0x29/0x70 [ 4202.720950] [<ffffffffa01ec48d>] btrfs_start_ordered_extent+0x6d/0x110 [btrfs] [ 4202.720956] [<ffffffff8108e620>] ? bit_waitqueue+0xc0/0xc0 [ 4202.720972] [<ffffffffa01ec559>] btrfs_run_ordered_extent_work+0x29/0x40 [btrfs] [ 4202.720988] [<ffffffffa0201987>] normal_work_helper+0x137/0x2c0 [btrfs] [ 4202.720994] [<ffffffff810680e5>] process_one_work+0x1f5/0x530 (...) [ 4202.721027] 2 locks held by kworker/u8:1/5450: [ 4202.721028] #0: (%s-%s){++++..}, at: [<ffffffff81068083>] process_one_work+0x193/0x530 [ 4202.721037] #1: ((&work->normal_work)){+.+...}, at: [<ffffffff81068083>] process_one_work+0x193/0x530 [ 4202.721054] INFO: task btrfs:7891 blocked for more than 120 seconds. [ 4202.721258] Not tainted 3.13.0-fdm-btrfs-next-26+ #1 [ 4202.721444] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 4202.721699] btrfs D 0000000000000001 0 7891 7890 0x00000001 [ 4202.721704] ffff88018c2119e8 0000000000000086 ffff8800a33d2490 00000000001d3f40 [ 4202.721710] ffff88018c211fd8 00000000001d3f40 ffff8802144b0000 ffff8800a33d2490 [ 4202.721714] ffff8800d8576640 ffff88020fe85bc0 ffff88020fe85bc8 7fffffffffffffff [ 4202.721718] Call Trace: [ 4202.721723] [<ffffffff816a3cb9>] schedule+0x29/0x70 [ 4202.721727] [<ffffffff816a2ebc>] schedule_timeout+0x1dc/0x270 [ 4202.721732] [<ffffffff8109bd79>] ? mark_held_locks+0xb9/0x140 [ 4202.721736] [<ffffffff816a90c0>] ? _raw_spin_unlock_irq+0x30/0x40 [ 4202.721740] [<ffffffff8109bf0d>] ? trace_hardirqs_on_caller+0x10d/0x1d0 [ 4202.721744] [<ffffffff816a488f>] wait_for_completion+0xdf/0x120 [ 4202.721749] [<ffffffff8107fa90>] ? try_to_wake_up+0x310/0x310 [ 4202.721765] [<ffffffffa01ebee4>] btrfs_wait_ordered_extents+0x1f4/0x280 [btrfs] [ 4202.721781] [<ffffffffa020526e>] btrfs_mksubvol.isra.62+0x30e/0x5a0 [btrfs] [ 4202.721786] [<ffffffff8108e620>] ? bit_waitqueue+0xc0/0xc0 [ 4202.721799] [<ffffffffa02056a9>] btrfs_ioctl_snap_create_transid+0x1a9/0x1b0 [btrfs] [ 4202.721813] [<ffffffffa020583a>] btrfs_ioctl_snap_create_v2+0x10a/0x170 [btrfs] (...) It turns out that extent_io.c:__extent_writepage(), which ends up being called through filemap_fdatawrite_range() in btrfs_start_ordered_extent(), was getting -ENOSPC when calling the fill_delalloc callback. In this situation, it returned without the writepage_end_io_hook callback (inode.c:btrfs_writepage_end_io_hook) ever being called for the respective page, which prevents the ordered extent's bytes_left count from ever reaching 0, and therefore a finish_ordered_fn work is never queued into the endio_write_workers queue. This makes the task that called btrfs_start_ordered_extent() hang forever on the wait queue of the ordered extent. This is fairly easy to reproduce using a small filesystem and fsstress on a quad core vm: mkfs.btrfs -f -b `expr 2100 \* 1024 \* 1024` /dev/sdd mount /dev/sdd /mnt fsstress -p 6 -d /mnt -n 100000 -x \ "btrfs subvolume snapshot -r /mnt /mnt/mysnap" \ -f allocsp=0 \ -f bulkstat=0 \ -f bulkstat1=0 \ -f chown=0 \ -f creat=1 \ -f dread=0 \ -f dwrite=0 \ -f fallocate=1 \ -f fdatasync=0 \ -f fiemap=0 \ -f freesp=0 \ -f fsync=0 \ -f getattr=0 \ -f getdents=0 \ -f link=0 \ -f mkdir=0 \ -f mknod=0 \ -f punch=1 \ -f read=0 \ -f readlink=0 \ -f rename=0 \ -f resvsp=0 \ -f rmdir=0 \ -f setxattr=0 \ -f stat=0 \ -f symlink=0 \ -f sync=0 \ -f truncate=1 \ -f unlink=0 \ -f unresvsp=0 \ -f write=4 So just ensure that if an error happens while writing the extent page we call the writepage_end_io_hook callback. Also make it return the error code and ensure the caller (extent_write_cache_pages) processes all pages in the page vector even if an error happens only for some of them, so that ordered extents end up released. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> --- V2: Grab error from `em` pointer if available, do the error check and end_extent_writepage call before unlocking the page (just like end_bio_extent_writepage does). V3: Make extent_write_cache_pages not stop processing remaining pages in the page vector if __extent_writepage returns an error for some page, so that ordered extents end up being processed and released. fs/btrfs/extent_io.c | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-)