Message ID | 20190710192818.1069475-4-tj@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/5] Btrfs: stop using btrfs_schedule_bio() | expand |
On 10.07.19 г. 22:28 ч., Tejun Heo wrote: > From: Chris Mason <clm@fb.com> > > The btrfs writepages function collects a large range of pages flagged > for delayed allocation, and then sends them down through the COW code > for processing. When compression is on, we allocate one async_cow nit: The code no longer uses async_cow to represent in-flight chunks but the more aptly named async_chunk. Presumably this patchset predates those changes. > structure for every 512K, and then run those pages through the > compression code for IO submission. > > writepages starts all of this off with a single page, locked by > the original call to extent_write_cache_pages(), and it's important to > keep track of this page because it has already been through > clear_page_dirty_for_io(). IMO it will be beneficial to state what are the implications of clear_page_dirty_for_io being called, i.e what special handling should this particular page receive to the rest of its lifetime. > > The btrfs async_cow struct has a pointer to the locked_page, and when > we're redirtying the page because compression had to fallback to > uncompressed IO, we use page->index to decide if a given async_cow > struct really owns that page. > > But, this is racey. If a given delalloc range is broken up into two > async_cows (cow_A and cow_B), we can end up with something like this: > > compress_file_range(cowA) > submit_compress_extents(cowA) > submit compressed bios(cowA) > put_page(locked_page) > > compress_file_range(cowB) > ... This call trace is _really_ hand wavy and the correct one is more complex, hence it should be something like : async_cow_submit submit_compressed_extents <--- falls back to buffered writeout cow_file_range extent_clear_unlock_delalloc __process_pages_contig put_page(locked_pages) async_cow_submit > > The end result is that cowA is completed and cleaned up before cowB even > starts processing. This means we can free locked_page() and reuse it > elsewhere. If we get really lucky, it'll have the same page->index in > its new home as it did before. > > While we're processing cowB, we might decide we need to fall back to > uncompressed IO, and so compress_file_range() will call > __set_page_dirty_nobufers() on cowB->locked_page. > > Without cgroups in use, this creates as a phantom dirty page, which> isn't great but isn't the end of the world. With cgroups in use, we Having a phantom dirty page is not great but not terrible without cgroups but apart from that, does it have any other implications? > might crash in the accounting code because page->mapping->i_wb isn't > set. > > [ 8308.523110] BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0 > [ 8308.531084] IP: percpu_counter_add_batch+0x11/0x70 > [ 8308.538371] PGD 66534e067 P4D 66534e067 PUD 66534f067 PMD 0 > [ 8308.541750] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC > [ 8308.551948] CPU: 16 PID: 2172 Comm: rm Not tainted > [ 8308.566883] RIP: 0010:percpu_counter_add_batch+0x11/0x70 > [ 8308.567891] RSP: 0018:ffffc9000a97bbe0 EFLAGS: 00010286 > [ 8308.568986] RAX: 0000000000000005 RBX: 0000000000000090 RCX: 0000000000026115 > [ 8308.570734] RDX: 0000000000000030 RSI: ffffffffffffffff RDI: 0000000000000090 > [ 8308.572543] RBP: 0000000000000000 R08: fffffffffffffff5 R09: 0000000000000000 > [ 8308.573856] R10: 00000000000260c0 R11: ffff881037fc26c0 R12: ffffffffffffffff > [ 8308.580099] R13: ffff880fe4111548 R14: ffffc9000a97bc90 R15: 0000000000000001 > [ 8308.582520] FS: 00007f5503ced480(0000) GS:ffff880ff7200000(0000) knlGS:0000000000000000 > [ 8308.585440] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 8308.587951] CR2: 00000000000000d0 CR3: 00000001e0459005 CR4: 0000000000360ee0 > [ 8308.590707] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 8308.592865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [ 8308.594469] Call Trace: > [ 8308.595149] account_page_cleaned+0x15b/0x1f0 > [ 8308.596340] __cancel_dirty_page+0x146/0x200 > [ 8308.599395] truncate_cleanup_page+0x92/0xb0 > [ 8308.600480] truncate_inode_pages_range+0x202/0x7d0 > [ 8308.617392] btrfs_evict_inode+0x92/0x5a0 > [ 8308.619108] evict+0xc1/0x190 > [ 8308.620023] do_unlinkat+0x176/0x280 > [ 8308.621202] do_syscall_64+0x63/0x1a0 > [ 8308.623451] entry_SYSCALL_64_after_hwframe+0x42/0xb7 > > The fix here is to make asyc_cow->locked_page NULL everywhere but the > one async_cow struct that's allowed to do things to the locked page. > > Signed-off-by: Chris Mason <clm@fb.com> > Fixes: 771ed689d2cd ("Btrfs: Optimize compressed writeback and reads") > Reviewed-by: Josef Bacik <josef@toxicpanda.com> > --- > fs/btrfs/extent_io.c | 2 +- > fs/btrfs/inode.c | 25 +++++++++++++++++++++---- > 2 files changed, 22 insertions(+), 5 deletions(-) > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index 5106008f5e28..a31574df06aa 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -1838,7 +1838,7 @@ static int __process_pages_contig(struct address_space *mapping, > if (page_ops & PAGE_SET_PRIVATE2) > SetPagePrivate2(pages[i]); > > - if (pages[i] == locked_page) { > + if (locked_page && pages[i] == locked_page) { Why not make the check just if (locked_page) then clean it up, since if __process_pages_contig is called from the owner of the page then it's guaranteed that the page will fall within it's range. > put_page(pages[i]); > pages_locked++; > continue; > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 6e6df0eab324..a81e9860ee1f 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -666,10 +666,12 @@ static noinline void compress_file_range(struct async_chunk *async_chunk, > * to our extent and set things up for the async work queue to run > * cow_file_range to do the normal delalloc dance. > */ > - if (page_offset(async_chunk->locked_page) >= start && > - page_offset(async_chunk->locked_page) <= end) > + if (async_chunk->locked_page && > + (page_offset(async_chunk->locked_page) >= start && > + page_offset(async_chunk->locked_page)) <= end) { DITTO since locked_page is now only set to the chunk that has the right to it then there is no need to check the offsets and this will simplify the code. > __set_page_dirty_nobuffers(async_chunk->locked_page); > /* unlocked later on in the async handlers */ > + } > > if (redirty) > extent_range_redirty_for_io(inode, start, end); > @@ -759,7 +761,7 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk) > async_extent->start + > async_extent->ram_size - 1, > WB_SYNC_ALL); > - else if (ret) > + else if (ret && async_chunk->locked_page) > unlock_page(async_chunk->locked_page); > kfree(async_extent); > cond_resched(); > @@ -1236,10 +1238,25 @@ static int cow_file_range_async(struct inode *inode, struct page *locked_page, > async_chunk[i].inode = inode; > async_chunk[i].start = start; > async_chunk[i].end = cur_end; > - async_chunk[i].locked_page = locked_page; > async_chunk[i].write_flags = write_flags; > INIT_LIST_HEAD(&async_chunk[i].extents); > > + /* > + * The locked_page comes all the way from writepage and its > + * the original page we were actually given. As we spread > + * this large delalloc region across multiple async_cow > + * structs, only the first struct needs a pointer to locked_page > + * > + * This way we don't need racey decisions about who is supposed > + * to unlock it. > + */ > + if (locked_page) { > + async_chunk[i].locked_page = locked_page; > + locked_page = NULL; > + } else { > + async_chunk[i].locked_page = NULL; > + } > + > btrfs_init_work(&async_chunk[i].work, > btrfs_delalloc_helper, > async_cow_start, async_cow_submit, >
On 11 Jul 2019, at 12:00, Nikolay Borisov wrote: > On 10.07.19 г. 22:28 ч., Tejun Heo wrote: >> From: Chris Mason <clm@fb.com> >> >> The btrfs writepages function collects a large range of pages flagged >> for delayed allocation, and then sends them down through the COW code >> for processing. When compression is on, we allocate one async_cow > > nit: The code no longer uses async_cow to represent in-flight chunks > but > the more aptly named async_chunk. Presumably this patchset predates > those changes. Not by much, but yes. > >> >> The end result is that cowA is completed and cleaned up before cowB >> even >> starts processing. This means we can free locked_page() and reuse it >> elsewhere. If we get really lucky, it'll have the same page->index >> in >> its new home as it did before. >> >> While we're processing cowB, we might decide we need to fall back to >> uncompressed IO, and so compress_file_range() will call >> __set_page_dirty_nobufers() on cowB->locked_page. >> >> Without cgroups in use, this creates as a phantom dirty page, which> >> isn't great but isn't the end of the world. With cgroups in use, we > > Having a phantom dirty page is not great but not terrible without > cgroups but apart from that, does it have any other implications? Best case, it'll go through the writepage fixup worker and go through the whole cow machinery again. Worst case we go to this code more than once: /* * if page_started, cow_file_range inserted an * inline extent and took care of all the unlocking * and IO for us. Otherwise, we need to submit * all those pages down to the drive. */ if (!page_started && !ret) extent_write_locked_range(inode, async_extent->start, async_extent->start + async_extent->ram_size - 1, WB_SYNC_ALL); else if (ret) unlock_page(async_chunk->locked_page); That never happened in production as far as I can tell, but it seems possible. > > >> might crash in the accounting code because page->mapping->i_wb isn't >> set. >> >> [ 8308.523110] BUG: unable to handle kernel NULL pointer dereference >> at 00000000000000d0 >> [ 8308.531084] IP: percpu_counter_add_batch+0x11/0x70 >> [ 8308.538371] PGD 66534e067 P4D 66534e067 PUD 66534f067 PMD 0 >> [ 8308.541750] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC >> [ 8308.551948] CPU: 16 PID: 2172 Comm: rm Not tainted >> [ 8308.566883] RIP: 0010:percpu_counter_add_batch+0x11/0x70 >> [ 8308.567891] RSP: 0018:ffffc9000a97bbe0 EFLAGS: 00010286 >> [ 8308.568986] RAX: 0000000000000005 RBX: 0000000000000090 RCX: >> 0000000000026115 >> [ 8308.570734] RDX: 0000000000000030 RSI: ffffffffffffffff RDI: >> 0000000000000090 >> [ 8308.572543] RBP: 0000000000000000 R08: fffffffffffffff5 R09: >> 0000000000000000 >> [ 8308.573856] R10: 00000000000260c0 R11: ffff881037fc26c0 R12: >> ffffffffffffffff >> [ 8308.580099] R13: ffff880fe4111548 R14: ffffc9000a97bc90 R15: >> 0000000000000001 >> [ 8308.582520] FS: 00007f5503ced480(0000) GS:ffff880ff7200000(0000) >> knlGS:0000000000000000 >> [ 8308.585440] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [ 8308.587951] CR2: 00000000000000d0 CR3: 00000001e0459005 CR4: >> 0000000000360ee0 >> [ 8308.590707] DR0: 0000000000000000 DR1: 0000000000000000 DR2: >> 0000000000000000 >> [ 8308.592865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: >> 0000000000000400 >> [ 8308.594469] Call Trace: >> [ 8308.595149] account_page_cleaned+0x15b/0x1f0 >> [ 8308.596340] __cancel_dirty_page+0x146/0x200 >> [ 8308.599395] truncate_cleanup_page+0x92/0xb0 >> [ 8308.600480] truncate_inode_pages_range+0x202/0x7d0 >> [ 8308.617392] btrfs_evict_inode+0x92/0x5a0 >> [ 8308.619108] evict+0xc1/0x190 >> [ 8308.620023] do_unlinkat+0x176/0x280 >> [ 8308.621202] do_syscall_64+0x63/0x1a0 >> [ 8308.623451] entry_SYSCALL_64_after_hwframe+0x42/0xb7 >> >> The fix here is to make asyc_cow->locked_page NULL everywhere but the >> one async_cow struct that's allowed to do things to the locked page. >> >> Signed-off-by: Chris Mason <clm@fb.com> >> Fixes: 771ed689d2cd ("Btrfs: Optimize compressed writeback and >> reads") >> Reviewed-by: Josef Bacik <josef@toxicpanda.com> >> --- >> fs/btrfs/extent_io.c | 2 +- >> fs/btrfs/inode.c | 25 +++++++++++++++++++++---- >> 2 files changed, 22 insertions(+), 5 deletions(-) >> >> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c >> index 5106008f5e28..a31574df06aa 100644 >> --- a/fs/btrfs/extent_io.c >> +++ b/fs/btrfs/extent_io.c >> @@ -1838,7 +1838,7 @@ static int __process_pages_contig(struct >> address_space *mapping, >> if (page_ops & PAGE_SET_PRIVATE2) >> SetPagePrivate2(pages[i]); >> >> - if (pages[i] == locked_page) { >> + if (locked_page && pages[i] == locked_page) { > > Why not make the check just if (locked_page) then clean it up, since > if > __process_pages_contig is called from the owner of the page then it's > guaranteed that the page will fall within it's range. I'm not convinced that every single caller of __process_pages_contig is making sure to only send locked_page for ranges that correspond to the locked_page. I'm not sure exactly what you're asking for though, it looks like it would require some larger changes to the flow of that loop. > >> put_page(pages[i]); >> pages_locked++; >> continue; >> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >> index 6e6df0eab324..a81e9860ee1f 100644 >> --- a/fs/btrfs/inode.c >> +++ b/fs/btrfs/inode.c >> @@ -666,10 +666,12 @@ static noinline void compress_file_range(struct >> async_chunk *async_chunk, >> * to our extent and set things up for the async work queue to run >> * cow_file_range to do the normal delalloc dance. >> */ >> - if (page_offset(async_chunk->locked_page) >= start && >> - page_offset(async_chunk->locked_page) <= end) >> + if (async_chunk->locked_page && >> + (page_offset(async_chunk->locked_page) >= start && >> + page_offset(async_chunk->locked_page)) <= end) { > > DITTO since locked_page is now only set to the chunk that has the > right > to it then there is no need to check the offsets and this will > simplify > the code. > start is adjusted higher up in the loop: if (start + total_in < end) { start += total_in; pages = NULL; cond_resched(); goto again; } So we might get to the __set_page_dirty_nobuffers() test with a range that no longer corresponds to the locked page. -chris
On 11.07.19 г. 22:52 ч., Chris Mason wrote: > On 11 Jul 2019, at 12:00, Nikolay Borisov wrote: > >> On 10.07.19 г. 22:28 ч., Tejun Heo wrote: >>> From: Chris Mason <clm@fb.com> >>> >>> The btrfs writepages function collects a large range of pages flagged >>> for delayed allocation, and then sends them down through the COW code >>> for processing. When compression is on, we allocate one async_cow >> >> nit: The code no longer uses async_cow to represent in-flight chunks >> but >> the more aptly named async_chunk. Presumably this patchset predates >> those changes. > > Not by much, but yes. > >> >>> >>> The end result is that cowA is completed and cleaned up before cowB >>> even >>> starts processing. This means we can free locked_page() and reuse it >>> elsewhere. If we get really lucky, it'll have the same page->index >>> in >>> its new home as it did before. >>> >>> While we're processing cowB, we might decide we need to fall back to >>> uncompressed IO, and so compress_file_range() will call >>> __set_page_dirty_nobufers() on cowB->locked_page. >>> >>> Without cgroups in use, this creates as a phantom dirty page, which> >>> isn't great but isn't the end of the world. With cgroups in use, we >> >> Having a phantom dirty page is not great but not terrible without >> cgroups but apart from that, does it have any other implications? > > Best case, it'll go through the writepage fixup worker and go through > the whole cow machinery again. Worst case we go to this code more than > once: > > /* > * if page_started, cow_file_range inserted an > * inline extent and took care of all the > unlocking > * and IO for us. Otherwise, we need to submit > * all those pages down to the drive. > */ > if (!page_started && !ret) > extent_write_locked_range(inode, > async_extent->start, > async_extent->start + > async_extent->ram_size > - 1, > WB_SYNC_ALL); > else if (ret) > unlock_page(async_chunk->locked_page); > > > That never happened in production as far as I can tell, but it seems > possible. > >> >> >>> might crash in the accounting code because page->mapping->i_wb isn't >>> set. >>> >>> [ 8308.523110] BUG: unable to handle kernel NULL pointer dereference >>> at 00000000000000d0 >>> [ 8308.531084] IP: percpu_counter_add_batch+0x11/0x70 >>> [ 8308.538371] PGD 66534e067 P4D 66534e067 PUD 66534f067 PMD 0 >>> [ 8308.541750] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC >>> [ 8308.551948] CPU: 16 PID: 2172 Comm: rm Not tainted >>> [ 8308.566883] RIP: 0010:percpu_counter_add_batch+0x11/0x70 >>> [ 8308.567891] RSP: 0018:ffffc9000a97bbe0 EFLAGS: 00010286 >>> [ 8308.568986] RAX: 0000000000000005 RBX: 0000000000000090 RCX: >>> 0000000000026115 >>> [ 8308.570734] RDX: 0000000000000030 RSI: ffffffffffffffff RDI: >>> 0000000000000090 >>> [ 8308.572543] RBP: 0000000000000000 R08: fffffffffffffff5 R09: >>> 0000000000000000 >>> [ 8308.573856] R10: 00000000000260c0 R11: ffff881037fc26c0 R12: >>> ffffffffffffffff >>> [ 8308.580099] R13: ffff880fe4111548 R14: ffffc9000a97bc90 R15: >>> 0000000000000001 >>> [ 8308.582520] FS: 00007f5503ced480(0000) GS:ffff880ff7200000(0000) >>> knlGS:0000000000000000 >>> [ 8308.585440] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> [ 8308.587951] CR2: 00000000000000d0 CR3: 00000001e0459005 CR4: >>> 0000000000360ee0 >>> [ 8308.590707] DR0: 0000000000000000 DR1: 0000000000000000 DR2: >>> 0000000000000000 >>> [ 8308.592865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: >>> 0000000000000400 >>> [ 8308.594469] Call Trace: >>> [ 8308.595149] account_page_cleaned+0x15b/0x1f0 >>> [ 8308.596340] __cancel_dirty_page+0x146/0x200 >>> [ 8308.599395] truncate_cleanup_page+0x92/0xb0 >>> [ 8308.600480] truncate_inode_pages_range+0x202/0x7d0 >>> [ 8308.617392] btrfs_evict_inode+0x92/0x5a0 >>> [ 8308.619108] evict+0xc1/0x190 >>> [ 8308.620023] do_unlinkat+0x176/0x280 >>> [ 8308.621202] do_syscall_64+0x63/0x1a0 >>> [ 8308.623451] entry_SYSCALL_64_after_hwframe+0x42/0xb7 >>> >>> The fix here is to make asyc_cow->locked_page NULL everywhere but the >>> one async_cow struct that's allowed to do things to the locked page. >>> >>> Signed-off-by: Chris Mason <clm@fb.com> >>> Fixes: 771ed689d2cd ("Btrfs: Optimize compressed writeback and >>> reads") >>> Reviewed-by: Josef Bacik <josef@toxicpanda.com> >>> --- >>> fs/btrfs/extent_io.c | 2 +- >>> fs/btrfs/inode.c | 25 +++++++++++++++++++++---- >>> 2 files changed, 22 insertions(+), 5 deletions(-) >>> >>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c >>> index 5106008f5e28..a31574df06aa 100644 >>> --- a/fs/btrfs/extent_io.c >>> +++ b/fs/btrfs/extent_io.c >>> @@ -1838,7 +1838,7 @@ static int __process_pages_contig(struct >>> address_space *mapping, >>> if (page_ops & PAGE_SET_PRIVATE2) >>> SetPagePrivate2(pages[i]); >>> >>> - if (pages[i] == locked_page) { >>> + if (locked_page && pages[i] == locked_page) { >> >> Why not make the check just if (locked_page) then clean it up, since >> if >> __process_pages_contig is called from the owner of the page then it's >> guaranteed that the page will fall within it's range. > > I'm not convinced that every single caller of __process_pages_contig is > making sure to only send locked_page for ranges that correspond to the > locked_page. I'm not sure exactly what you're asking for though, it > looks like it would require some larger changes to the flow of that > loop. What I meant it is to simply factor out the code dealing with locked page outside of the loop and still place it inside __process_pages_contig. Also looking at the way locked_pages is passed across different call chains I arrive at: compress_file_range <-- locked page is null extent_clear_unlock_delalloc __process_pages_contig submit_compressed_extents <---- locked page is null extent_clear_unlock_delalloc __process_pages_contig btrfs_run_delalloc_range | run_delalloc_nocow cow_file_range <--- [when called from btrfs_run_delalloc_range we are all fine and dandy because it will always iterates a range which belongs to the page. So we can free the page and set it null for subsequent passes of the loop.] Looking run_delalloc_nocow I see the page is used 5 times - 2 of those, at the beginning and end of the function, are only used during error cases. The other 2 times is if cow_start is different than -1 , which happens if !nocow is true. I've yet to wrap my head around run_delalloc_nocow but I think it should also be safe to pass locked page just once. cow_file_range_async <--- always called with the correct locked page, in this case the function is called before any async chunks are going to be submitted. extent_clear_unlock_delalloc __process_pages_contig btrfs_run_delalloc_range <--- this one is called with locked_page belonging to the passed delalloc range. run_delalloc_nocow extent_clear_unlock_delalloc __process_pages_contig writepage_delalloc <-- calls find_lock_delalloc_range only if we aren't caalled from compress path and the start range always belongs to the page find_lock_delalloc_range <---- if the range is not delalloc it will retry. But that function is also called with the correct page. lock_delalloc_pages <--- ignores range which belongs only to this page __unlock_for_delaloc <--- ignores range which belongs only to this page <snip>
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 5106008f5e28..a31574df06aa 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -1838,7 +1838,7 @@ static int __process_pages_contig(struct address_space *mapping, if (page_ops & PAGE_SET_PRIVATE2) SetPagePrivate2(pages[i]); - if (pages[i] == locked_page) { + if (locked_page && pages[i] == locked_page) { put_page(pages[i]); pages_locked++; continue; diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 6e6df0eab324..a81e9860ee1f 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -666,10 +666,12 @@ static noinline void compress_file_range(struct async_chunk *async_chunk, * to our extent and set things up for the async work queue to run * cow_file_range to do the normal delalloc dance. */ - if (page_offset(async_chunk->locked_page) >= start && - page_offset(async_chunk->locked_page) <= end) + if (async_chunk->locked_page && + (page_offset(async_chunk->locked_page) >= start && + page_offset(async_chunk->locked_page)) <= end) { __set_page_dirty_nobuffers(async_chunk->locked_page); /* unlocked later on in the async handlers */ + } if (redirty) extent_range_redirty_for_io(inode, start, end); @@ -759,7 +761,7 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk) async_extent->start + async_extent->ram_size - 1, WB_SYNC_ALL); - else if (ret) + else if (ret && async_chunk->locked_page) unlock_page(async_chunk->locked_page); kfree(async_extent); cond_resched(); @@ -1236,10 +1238,25 @@ static int cow_file_range_async(struct inode *inode, struct page *locked_page, async_chunk[i].inode = inode; async_chunk[i].start = start; async_chunk[i].end = cur_end; - async_chunk[i].locked_page = locked_page; async_chunk[i].write_flags = write_flags; INIT_LIST_HEAD(&async_chunk[i].extents); + /* + * The locked_page comes all the way from writepage and its + * the original page we were actually given. As we spread + * this large delalloc region across multiple async_cow + * structs, only the first struct needs a pointer to locked_page + * + * This way we don't need racey decisions about who is supposed + * to unlock it. + */ + if (locked_page) { + async_chunk[i].locked_page = locked_page; + locked_page = NULL; + } else { + async_chunk[i].locked_page = NULL; + } + btrfs_init_work(&async_chunk[i].work, btrfs_delalloc_helper, async_cow_start, async_cow_submit,