diff mbox

[1/2] Btrfs: fix race between fsync and direct IO writes for prealloc extents

Message ID

1462802509-1974-1-git-send-email-fdmanana@kernel.org (mailing list archive)

State

Accepted

Headers

show

From: fdmanana@kernel.org
To: linux-btrfs@vger.kernel.org
Subject: [PATCH 1/2] Btrfs: fix race between fsync and direct IO writes for
	prealloc extents
Date: Mon,  9 May 2016 15:01:49 +0100
Message-Id: <1462802509-1974-1-git-send-email-fdmanana@kernel.org>
Sender: linux-btrfs-owner@vger.kernel.org
Precedence: bulk

Commit Message

Filipe Manana May 9, 2016, 2:01 p.m. UTC

From: Filipe Manana <fdmanana@suse.com>

When we do a direct IO write against a preallocated extent (fallocate)
that does not go beyond the i_size of the inode, we do the write operation
without holding the inode's i_mutex (an optimization that landed in
commit 38851cc19adb ("Btrfs: implement unlocked dio write")). This allows
for a very tiny time window where a race can happen with a concurrent
fsync using the fast code path, as the direct IO write path creates first
a new extent map (no longer flagged as a prealloc extent) and then it
creates the ordered extent, while the fast fsync path first collects
ordered extents and then it collects extent maps. This allows for the
possibility of the fast fsync path to collect the new extent map without
collecting the new ordered extent, and therefore logging an extent item
based on the extent map without waiting for the ordered extent to be
created and complete. This can result in a situation where after a log
replay we end up with an extent not marked anymore as prealloc but it was
only partially written (or not written at all), exposing random, stale or
garbage data corresponding to the unwritten pages and without any
checksums in the csum tree covering the extent's range.

This is an extension of what was done in commit de0ee0edb21f ("Btrfs: fix
race between fsync and lockless direct IO writes").

So fix this by creating first the ordered extent and then the extent
map, so that this way if the fast fsync patch collects the new extent
map it also collects the corresponding ordered extent.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/inode.c | 43 +++++++++++++++++++++++++++++++++++++------
 1 file changed, 37 insertions(+), 6 deletions(-)

Comments

Josef Bacik May 11, 2016, 3:41 p.m. UTC | #1

On 05/09/2016 07:01 AM, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> When we do a direct IO write against a preallocated extent (fallocate)
> that does not go beyond the i_size of the inode, we do the write operation
> without holding the inode's i_mutex (an optimization that landed in
> commit 38851cc19adb ("Btrfs: implement unlocked dio write")). This allows
> for a very tiny time window where a race can happen with a concurrent
> fsync using the fast code path, as the direct IO write path creates first
> a new extent map (no longer flagged as a prealloc extent) and then it
> creates the ordered extent, while the fast fsync path first collects
> ordered extents and then it collects extent maps. This allows for the
> possibility of the fast fsync path to collect the new extent map without
> collecting the new ordered extent, and therefore logging an extent item
> based on the extent map without waiting for the ordered extent to be
> created and complete. This can result in a situation where after a log
> replay we end up with an extent not marked anymore as prealloc but it was
> only partially written (or not written at all), exposing random, stale or
> garbage data corresponding to the unwritten pages and without any
> checksums in the csum tree covering the extent's range.
>
> This is an extension of what was done in commit de0ee0edb21f ("Btrfs: fix
> race between fsync and lockless direct IO writes").
>
> So fix this by creating first the ordered extent and then the extent
> map, so that this way if the fast fsync patch collects the new extent
> map it also collects the corresponding ordered extent.
>

So this seems to just be trading one problem for another bad behavior. 
Now instead we'll end up with an ordered extent but possibly not the 
extent map, which is fine, but seems ugly, and is a subtle and 
undocumented behavior that is going to bite us in the ass in the future.

I know everybody loves lockless writes, but we need to protect for this 
particular case in an explicit way so I don't go change something in the 
future and forget about this dependency.  What if we add a rwsem around 
adding the extent map and the ordered extent.  In fact we pull this 
dance out into a common helper function so everybody who wants to do 
this just calls the helper function that is easy to audit and 
understand, then we just put a

down_read(&BTRFS_I(inode)->lets_not_fuck_fsync));
/* Do our extent map + ordered extent dance here */
up_read(&BTRFS_I(inode)->lets_not_fuck_fsync));

and then where we gather all of this stuff in fsync we do a 
down_write/up_write.  This won't affect the buffered write case 
currently as we're still protected by the i_mutex, and then makes the 
direct io case much cleaner and safer.  Then later on when we want to 
remove the i_mutex from the buffered write case we have one less gotcha 
to worry about.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Filipe Manana May 11, 2016, 3:56 p.m. UTC | #2

On Wed, May 11, 2016 at 4:41 PM, Josef Bacik <jbacik@fb.com> wrote:
> On 05/09/2016 07:01 AM, fdmanana@kernel.org wrote:
>>
>> From: Filipe Manana <fdmanana@suse.com>
>>
>> When we do a direct IO write against a preallocated extent (fallocate)
>> that does not go beyond the i_size of the inode, we do the write operation
>> without holding the inode's i_mutex (an optimization that landed in
>> commit 38851cc19adb ("Btrfs: implement unlocked dio write")). This allows
>> for a very tiny time window where a race can happen with a concurrent
>> fsync using the fast code path, as the direct IO write path creates first
>> a new extent map (no longer flagged as a prealloc extent) and then it
>> creates the ordered extent, while the fast fsync path first collects
>> ordered extents and then it collects extent maps. This allows for the
>> possibility of the fast fsync path to collect the new extent map without
>> collecting the new ordered extent, and therefore logging an extent item
>> based on the extent map without waiting for the ordered extent to be
>> created and complete. This can result in a situation where after a log
>> replay we end up with an extent not marked anymore as prealloc but it was
>> only partially written (or not written at all), exposing random, stale or
>> garbage data corresponding to the unwritten pages and without any
>> checksums in the csum tree covering the extent's range.
>>
>> This is an extension of what was done in commit de0ee0edb21f ("Btrfs: fix
>> race between fsync and lockless direct IO writes").
>>
>> So fix this by creating first the ordered extent and then the extent
>> map, so that this way if the fast fsync patch collects the new extent
>> map it also collects the corresponding ordered extent.
>>
>
> So this seems to just be trading one problem for another bad behavior. Now
> instead we'll end up with an ordered extent but possibly not the extent map,
> which is fine, but seems ugly, and is a subtle and undocumented behavior
> that is going to bite us in the ass in the future.
>
> I know everybody loves lockless writes, but we need to protect for this
> particular case in an explicit way so I don't go change something in the
> future and forget about this dependency.  What if we add a rwsem around
> adding the extent map and the ordered extent.  In fact we pull this dance
> out into a common helper function so everybody who wants to do this just
> calls the helper function that is easy to audit and understand, then we just
> put a
>
> down_read(&BTRFS_I(inode)->lets_not_fuck_fsync));
> /* Do our extent map + ordered extent dance here */
> up_read(&BTRFS_I(inode)->lets_not_fuck_fsync));
>
> and then where we gather all of this stuff in fsync we do a
> down_write/up_write.  This won't affect the buffered write case currently as
> we're still protected by the i_mutex, and then makes the direct io case much
> cleaner and safer.  Then later on when we want to remove the i_mutex from
> the buffered write case we have one less gotcha to worry about.  Thanks,

Agreed. I though about something similar initially but I didn't want
to increase the size of struct btrfs_inode.
Would you mind that if instead of changing this patch I do one on top
that introduces that helper function, and makes this and the other
place in the dio path that does the same logic use it too?

thanks

>
> Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik May 11, 2016, 4:04 p.m. UTC | #3

On 05/11/2016 08:56 AM, Filipe Manana wrote:
> On Wed, May 11, 2016 at 4:41 PM, Josef Bacik <jbacik@fb.com> wrote:
>> On 05/09/2016 07:01 AM, fdmanana@kernel.org wrote:
>>>
>>> From: Filipe Manana <fdmanana@suse.com>
>>>
>>> When we do a direct IO write against a preallocated extent (fallocate)
>>> that does not go beyond the i_size of the inode, we do the write operation
>>> without holding the inode's i_mutex (an optimization that landed in
>>> commit 38851cc19adb ("Btrfs: implement unlocked dio write")). This allows
>>> for a very tiny time window where a race can happen with a concurrent
>>> fsync using the fast code path, as the direct IO write path creates first
>>> a new extent map (no longer flagged as a prealloc extent) and then it
>>> creates the ordered extent, while the fast fsync path first collects
>>> ordered extents and then it collects extent maps. This allows for the
>>> possibility of the fast fsync path to collect the new extent map without
>>> collecting the new ordered extent, and therefore logging an extent item
>>> based on the extent map without waiting for the ordered extent to be
>>> created and complete. This can result in a situation where after a log
>>> replay we end up with an extent not marked anymore as prealloc but it was
>>> only partially written (or not written at all), exposing random, stale or
>>> garbage data corresponding to the unwritten pages and without any
>>> checksums in the csum tree covering the extent's range.
>>>
>>> This is an extension of what was done in commit de0ee0edb21f ("Btrfs: fix
>>> race between fsync and lockless direct IO writes").
>>>
>>> So fix this by creating first the ordered extent and then the extent
>>> map, so that this way if the fast fsync patch collects the new extent
>>> map it also collects the corresponding ordered extent.
>>>
>>
>> So this seems to just be trading one problem for another bad behavior. Now
>> instead we'll end up with an ordered extent but possibly not the extent map,
>> which is fine, but seems ugly, and is a subtle and undocumented behavior
>> that is going to bite us in the ass in the future.
>>
>> I know everybody loves lockless writes, but we need to protect for this
>> particular case in an explicit way so I don't go change something in the
>> future and forget about this dependency.  What if we add a rwsem around
>> adding the extent map and the ordered extent.  In fact we pull this dance
>> out into a common helper function so everybody who wants to do this just
>> calls the helper function that is easy to audit and understand, then we just
>> put a
>>
>> down_read(&BTRFS_I(inode)->lets_not_fuck_fsync));
>> /* Do our extent map + ordered extent dance here */
>> up_read(&BTRFS_I(inode)->lets_not_fuck_fsync));
>>
>> and then where we gather all of this stuff in fsync we do a
>> down_write/up_write.  This won't affect the buffered write case currently as
>> we're still protected by the i_mutex, and then makes the direct io case much
>> cleaner and safer.  Then later on when we want to remove the i_mutex from
>> the buffered write case we have one less gotcha to worry about.  Thanks,
>
> Agreed. I though about something similar initially but I didn't want
> to increase the size of struct btrfs_inode.
> Would you mind that if instead of changing this patch I do one on top
> that introduces that helper function, and makes this and the other
> place in the dio path that does the same logic use it too?
>

Yeah that's fine, however you want to do it.  Thanks,

Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason May 11, 2016, 4:05 p.m. UTC | #4

On Wed, May 11, 2016 at 04:56:56PM +0100, Filipe Manana wrote:
> On Wed, May 11, 2016 at 4:41 PM, Josef Bacik <jbacik@fb.com> wrote:
> > On 05/09/2016 07:01 AM, fdmanana@kernel.org wrote:
> >>
> >> From: Filipe Manana <fdmanana@suse.com>
> >>
> >> When we do a direct IO write against a preallocated extent (fallocate)
> >> that does not go beyond the i_size of the inode, we do the write operation
> >> without holding the inode's i_mutex (an optimization that landed in
> >> commit 38851cc19adb ("Btrfs: implement unlocked dio write")). This allows
> >> for a very tiny time window where a race can happen with a concurrent
> >> fsync using the fast code path, as the direct IO write path creates first
> >> a new extent map (no longer flagged as a prealloc extent) and then it
> >> creates the ordered extent, while the fast fsync path first collects
> >> ordered extents and then it collects extent maps. This allows for the
> >> possibility of the fast fsync path to collect the new extent map without
> >> collecting the new ordered extent, and therefore logging an extent item
> >> based on the extent map without waiting for the ordered extent to be
> >> created and complete. This can result in a situation where after a log
> >> replay we end up with an extent not marked anymore as prealloc but it was
> >> only partially written (or not written at all), exposing random, stale or
> >> garbage data corresponding to the unwritten pages and without any
> >> checksums in the csum tree covering the extent's range.
> >>
> >> This is an extension of what was done in commit de0ee0edb21f ("Btrfs: fix
> >> race between fsync and lockless direct IO writes").
> >>
> >> So fix this by creating first the ordered extent and then the extent
> >> map, so that this way if the fast fsync patch collects the new extent
> >> map it also collects the corresponding ordered extent.
> >>
> >
> > So this seems to just be trading one problem for another bad behavior. Now
> > instead we'll end up with an ordered extent but possibly not the extent map,
> > which is fine, but seems ugly, and is a subtle and undocumented behavior
> > that is going to bite us in the ass in the future.
> >
> > I know everybody loves lockless writes, but we need to protect for this
> > particular case in an explicit way so I don't go change something in the
> > future and forget about this dependency.  What if we add a rwsem around
> > adding the extent map and the ordered extent.  In fact we pull this dance
> > out into a common helper function so everybody who wants to do this just
> > calls the helper function that is easy to audit and understand, then we just
> > put a
> >
> > down_read(&BTRFS_I(inode)->lets_not_fuck_fsync));
> > /* Do our extent map + ordered extent dance here */
> > up_read(&BTRFS_I(inode)->lets_not_fuck_fsync));
> >
> > and then where we gather all of this stuff in fsync we do a
> > down_write/up_write.  This won't affect the buffered write case currently as
> > we're still protected by the i_mutex, and then makes the direct io case much
> > cleaner and safer.  Then later on when we want to remove the i_mutex from
> > the buffered write case we have one less gotcha to worry about.  Thanks,
> 
> Agreed. I though about something similar initially but I didn't want
> to increase the size of struct btrfs_inode.
> Would you mind that if instead of changing this patch I do one on top
> that introduces that helper function, and makes this and the other
> place in the dio path that does the same logic use it too?

Once we go down the rwsem path, we might want to look at overall DIO
locking and see where else it fits in.  We can probably get rid of a few
warts with it.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik May 11, 2016, 4:10 p.m. UTC | #5

On 05/11/2016 09:05 AM, Chris Mason wrote:
> On Wed, May 11, 2016 at 04:56:56PM +0100, Filipe Manana wrote:
>> On Wed, May 11, 2016 at 4:41 PM, Josef Bacik <jbacik@fb.com> wrote:
>>> On 05/09/2016 07:01 AM, fdmanana@kernel.org wrote:
>>>>
>>>> From: Filipe Manana <fdmanana@suse.com>
>>>>
>>>> When we do a direct IO write against a preallocated extent (fallocate)
>>>> that does not go beyond the i_size of the inode, we do the write operation
>>>> without holding the inode's i_mutex (an optimization that landed in
>>>> commit 38851cc19adb ("Btrfs: implement unlocked dio write")). This allows
>>>> for a very tiny time window where a race can happen with a concurrent
>>>> fsync using the fast code path, as the direct IO write path creates first
>>>> a new extent map (no longer flagged as a prealloc extent) and then it
>>>> creates the ordered extent, while the fast fsync path first collects
>>>> ordered extents and then it collects extent maps. This allows for the
>>>> possibility of the fast fsync path to collect the new extent map without
>>>> collecting the new ordered extent, and therefore logging an extent item
>>>> based on the extent map without waiting for the ordered extent to be
>>>> created and complete. This can result in a situation where after a log
>>>> replay we end up with an extent not marked anymore as prealloc but it was
>>>> only partially written (or not written at all), exposing random, stale or
>>>> garbage data corresponding to the unwritten pages and without any
>>>> checksums in the csum tree covering the extent's range.
>>>>
>>>> This is an extension of what was done in commit de0ee0edb21f ("Btrfs: fix
>>>> race between fsync and lockless direct IO writes").
>>>>
>>>> So fix this by creating first the ordered extent and then the extent
>>>> map, so that this way if the fast fsync patch collects the new extent
>>>> map it also collects the corresponding ordered extent.
>>>>
>>>
>>> So this seems to just be trading one problem for another bad behavior. Now
>>> instead we'll end up with an ordered extent but possibly not the extent map,
>>> which is fine, but seems ugly, and is a subtle and undocumented behavior
>>> that is going to bite us in the ass in the future.
>>>
>>> I know everybody loves lockless writes, but we need to protect for this
>>> particular case in an explicit way so I don't go change something in the
>>> future and forget about this dependency.  What if we add a rwsem around
>>> adding the extent map and the ordered extent.  In fact we pull this dance
>>> out into a common helper function so everybody who wants to do this just
>>> calls the helper function that is easy to audit and understand, then we just
>>> put a
>>>
>>> down_read(&BTRFS_I(inode)->lets_not_fuck_fsync));
>>> /* Do our extent map + ordered extent dance here */
>>> up_read(&BTRFS_I(inode)->lets_not_fuck_fsync));
>>>
>>> and then where we gather all of this stuff in fsync we do a
>>> down_write/up_write.  This won't affect the buffered write case currently as
>>> we're still protected by the i_mutex, and then makes the direct io case much
>>> cleaner and safer.  Then later on when we want to remove the i_mutex from
>>> the buffered write case we have one less gotcha to worry about.  Thanks,
>>
>> Agreed. I though about something similar initially but I didn't want
>> to increase the size of struct btrfs_inode.
>> Would you mind that if instead of changing this patch I do one on top
>> that introduces that helper function, and makes this and the other
>> place in the dio path that does the same logic use it too?
>
> Once we go down the rwsem path, we might want to look at overall DIO
> locking and see where else it fits in.  We can probably get rid of a few
> warts with it.
>

I hesitate to suggest this, but long term I'd like to look at just using 
our range locking for all of this, and do something like (u64)-1 for the 
end when we are talking the whole file.  That would remove the need for 
a lot of this other locking stuff.  But I imagine that'll cause more 
problems than it solves atm.  Thanks,

Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason May 11, 2016, 4:12 p.m. UTC | #6

On Wed, May 11, 2016 at 09:10:12AM -0700, Josef Bacik wrote:
> On 05/11/2016 09:05 AM, Chris Mason wrote:
> >On Wed, May 11, 2016 at 04:56:56PM +0100, Filipe Manana wrote:
> >>On Wed, May 11, 2016 at 4:41 PM, Josef Bacik <jbacik@fb.com> wrote:
> >>>On 05/09/2016 07:01 AM, fdmanana@kernel.org wrote:
> >>>>
> >>>>From: Filipe Manana <fdmanana@suse.com>
> >>>>
> >>>>When we do a direct IO write against a preallocated extent (fallocate)
> >>>>that does not go beyond the i_size of the inode, we do the write operation
> >>>>without holding the inode's i_mutex (an optimization that landed in
> >>>>commit 38851cc19adb ("Btrfs: implement unlocked dio write")). This allows
> >>>>for a very tiny time window where a race can happen with a concurrent
> >>>>fsync using the fast code path, as the direct IO write path creates first
> >>>>a new extent map (no longer flagged as a prealloc extent) and then it
> >>>>creates the ordered extent, while the fast fsync path first collects
> >>>>ordered extents and then it collects extent maps. This allows for the
> >>>>possibility of the fast fsync path to collect the new extent map without
> >>>>collecting the new ordered extent, and therefore logging an extent item
> >>>>based on the extent map without waiting for the ordered extent to be
> >>>>created and complete. This can result in a situation where after a log
> >>>>replay we end up with an extent not marked anymore as prealloc but it was
> >>>>only partially written (or not written at all), exposing random, stale or
> >>>>garbage data corresponding to the unwritten pages and without any
> >>>>checksums in the csum tree covering the extent's range.
> >>>>
> >>>>This is an extension of what was done in commit de0ee0edb21f ("Btrfs: fix
> >>>>race between fsync and lockless direct IO writes").
> >>>>
> >>>>So fix this by creating first the ordered extent and then the extent
> >>>>map, so that this way if the fast fsync patch collects the new extent
> >>>>map it also collects the corresponding ordered extent.
> >>>>
> >>>
> >>>So this seems to just be trading one problem for another bad behavior. Now
> >>>instead we'll end up with an ordered extent but possibly not the extent map,
> >>>which is fine, but seems ugly, and is a subtle and undocumented behavior
> >>>that is going to bite us in the ass in the future.
> >>>
> >>>I know everybody loves lockless writes, but we need to protect for this
> >>>particular case in an explicit way so I don't go change something in the
> >>>future and forget about this dependency.  What if we add a rwsem around
> >>>adding the extent map and the ordered extent.  In fact we pull this dance
> >>>out into a common helper function so everybody who wants to do this just
> >>>calls the helper function that is easy to audit and understand, then we just
> >>>put a
> >>>
> >>>down_read(&BTRFS_I(inode)->lets_not_fuck_fsync));
> >>>/* Do our extent map + ordered extent dance here */
> >>>up_read(&BTRFS_I(inode)->lets_not_fuck_fsync));
> >>>
> >>>and then where we gather all of this stuff in fsync we do a
> >>>down_write/up_write.  This won't affect the buffered write case currently as
> >>>we're still protected by the i_mutex, and then makes the direct io case much
> >>>cleaner and safer.  Then later on when we want to remove the i_mutex from
> >>>the buffered write case we have one less gotcha to worry about.  Thanks,
> >>
> >>Agreed. I though about something similar initially but I didn't want
> >>to increase the size of struct btrfs_inode.
> >>Would you mind that if instead of changing this patch I do one on top
> >>that introduces that helper function, and makes this and the other
> >>place in the dio path that does the same logic use it too?
> >
> >Once we go down the rwsem path, we might want to look at overall DIO
> >locking and see where else it fits in.  We can probably get rid of a few
> >warts with it.
> >
> 
> I hesitate to suggest this, but long term I'd like to look at just using our
> range locking for all of this, and do something like (u64)-1 for the end
> when we are talking the whole file.  That would remove the need for a lot of
> this other locking stuff.  But I imagine that'll cause more problems than it
> solves atm.  Thanks,

The last time I tried, I had regrets.  But it's definitely worth picking
up again.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik May 11, 2016, 4:14 p.m. UTC | #7

On 05/11/2016 09:12 AM, Chris Mason wrote:
> On Wed, May 11, 2016 at 09:10:12AM -0700, Josef Bacik wrote:
>> On 05/11/2016 09:05 AM, Chris Mason wrote:
>>> On Wed, May 11, 2016 at 04:56:56PM +0100, Filipe Manana wrote:
>>>> On Wed, May 11, 2016 at 4:41 PM, Josef Bacik <jbacik@fb.com> wrote:
>>>>> On 05/09/2016 07:01 AM, fdmanana@kernel.org wrote:
>>>>>>
>>>>>> From: Filipe Manana <fdmanana@suse.com>
>>>>>>
>>>>>> When we do a direct IO write against a preallocated extent (fallocate)
>>>>>> that does not go beyond the i_size of the inode, we do the write operation
>>>>>> without holding the inode's i_mutex (an optimization that landed in
>>>>>> commit 38851cc19adb ("Btrfs: implement unlocked dio write")). This allows
>>>>>> for a very tiny time window where a race can happen with a concurrent
>>>>>> fsync using the fast code path, as the direct IO write path creates first
>>>>>> a new extent map (no longer flagged as a prealloc extent) and then it
>>>>>> creates the ordered extent, while the fast fsync path first collects
>>>>>> ordered extents and then it collects extent maps. This allows for the
>>>>>> possibility of the fast fsync path to collect the new extent map without
>>>>>> collecting the new ordered extent, and therefore logging an extent item
>>>>>> based on the extent map without waiting for the ordered extent to be
>>>>>> created and complete. This can result in a situation where after a log
>>>>>> replay we end up with an extent not marked anymore as prealloc but it was
>>>>>> only partially written (or not written at all), exposing random, stale or
>>>>>> garbage data corresponding to the unwritten pages and without any
>>>>>> checksums in the csum tree covering the extent's range.
>>>>>>
>>>>>> This is an extension of what was done in commit de0ee0edb21f ("Btrfs: fix
>>>>>> race between fsync and lockless direct IO writes").
>>>>>>
>>>>>> So fix this by creating first the ordered extent and then the extent
>>>>>> map, so that this way if the fast fsync patch collects the new extent
>>>>>> map it also collects the corresponding ordered extent.
>>>>>>
>>>>>
>>>>> So this seems to just be trading one problem for another bad behavior. Now
>>>>> instead we'll end up with an ordered extent but possibly not the extent map,
>>>>> which is fine, but seems ugly, and is a subtle and undocumented behavior
>>>>> that is going to bite us in the ass in the future.
>>>>>
>>>>> I know everybody loves lockless writes, but we need to protect for this
>>>>> particular case in an explicit way so I don't go change something in the
>>>>> future and forget about this dependency.  What if we add a rwsem around
>>>>> adding the extent map and the ordered extent.  In fact we pull this dance
>>>>> out into a common helper function so everybody who wants to do this just
>>>>> calls the helper function that is easy to audit and understand, then we just
>>>>> put a
>>>>>
>>>>> down_read(&BTRFS_I(inode)->lets_not_fuck_fsync));
>>>>> /* Do our extent map + ordered extent dance here */
>>>>> up_read(&BTRFS_I(inode)->lets_not_fuck_fsync));
>>>>>
>>>>> and then where we gather all of this stuff in fsync we do a
>>>>> down_write/up_write.  This won't affect the buffered write case currently as
>>>>> we're still protected by the i_mutex, and then makes the direct io case much
>>>>> cleaner and safer.  Then later on when we want to remove the i_mutex from
>>>>> the buffered write case we have one less gotcha to worry about.  Thanks,
>>>>
>>>> Agreed. I though about something similar initially but I didn't want
>>>> to increase the size of struct btrfs_inode.
>>>> Would you mind that if instead of changing this patch I do one on top
>>>> that introduces that helper function, and makes this and the other
>>>> place in the dio path that does the same logic use it too?
>>>
>>> Once we go down the rwsem path, we might want to look at overall DIO
>>> locking and see where else it fits in.  We can probably get rid of a few
>>> warts with it.
>>>
>>
>> I hesitate to suggest this, but long term I'd like to look at just using our
>> range locking for all of this, and do something like (u64)-1 for the end
>> when we are talking the whole file.  That would remove the need for a lot of
>> this other locking stuff.  But I imagine that'll cause more problems than it
>> solves atm.  Thanks,
>
> The last time I tried, I had regrets.  But it's definitely worth picking
> up again.

Score, I have a new intern/Omar project.

Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik May 12, 2016, 5:36 p.m. UTC | #8

On 05/09/2016 07:01 AM, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> When we do a direct IO write against a preallocated extent (fallocate)
> that does not go beyond the i_size of the inode, we do the write operation
> without holding the inode's i_mutex (an optimization that landed in
> commit 38851cc19adb ("Btrfs: implement unlocked dio write")). This allows
> for a very tiny time window where a race can happen with a concurrent
> fsync using the fast code path, as the direct IO write path creates first
> a new extent map (no longer flagged as a prealloc extent) and then it
> creates the ordered extent, while the fast fsync path first collects
> ordered extents and then it collects extent maps. This allows for the
> possibility of the fast fsync path to collect the new extent map without
> collecting the new ordered extent, and therefore logging an extent item
> based on the extent map without waiting for the ordered extent to be
> created and complete. This can result in a situation where after a log
> replay we end up with an extent not marked anymore as prealloc but it was
> only partially written (or not written at all), exposing random, stale or
> garbage data corresponding to the unwritten pages and without any
> checksums in the csum tree covering the extent's range.
>
> This is an extension of what was done in commit de0ee0edb21f ("Btrfs: fix
> race between fsync and lockless direct IO writes").
>
> So fix this by creating first the ordered extent and then the extent
> map, so that this way if the fast fsync patch collects the new extent
> map it also collects the corresponding ordered extent.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Josef Bacik <jbacik@fb.com>

Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

diff mbox

Patch

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c8d30ef..5372268 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7658,6 +7658,25 @@  static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock,
 
 		if (can_nocow_extent(inode, start, &len, &orig_start,
 				     &orig_block_len, &ram_bytes) == 1) {
+
+			/*
+			 * Create the ordered extent before the extent map. This
+			 * is to avoid races with the fast fsync path because it
+			 * collects ordered extents into a local list and then
+			 * collects all the new extent maps, so we must create
+			 * the ordered extent first and make sure the fast fsync
+			 * path collects any new ordered extents after
+			 * collecting new extent maps as well. The fsync path
+			 * simply can not rely on inode_dio_wait() because it
+			 * causes deadlock with AIO.
+			 */
+			ret = btrfs_add_ordered_extent_dio(inode, start,
+					   block_start, len, len, type);
+			if (ret) {
+				free_extent_map(em);
+				goto unlock_err;
+			}
+
 			if (type == BTRFS_ORDERED_PREALLOC) {
 				free_extent_map(em);
 				em = create_pinned_em(inode, start, len,
@@ -7666,17 +7685,29 @@  static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock,
 						       orig_block_len,
 						       ram_bytes, type);
 				if (IS_ERR(em)) {
+					struct btrfs_ordered_extent *oe;
+
 					ret = PTR_ERR(em);
+					oe = btrfs_lookup_ordered_extent(inode,
+									 start);
+					ASSERT(oe);
+					if (WARN_ON(!oe))
+						goto unlock_err;
+					set_bit(BTRFS_ORDERED_IOERR,
+						&oe->flags);
+					set_bit(BTRFS_ORDERED_IO_DONE,
+						&oe->flags);
+					btrfs_remove_ordered_extent(inode, oe);
+					/*
+					 * Once for our lookup and once for the
+					 * ordered extents tree.
+					 */
+					btrfs_put_ordered_extent(oe);
+					btrfs_put_ordered_extent(oe);
 					goto unlock_err;
 				}
 			}
 
-			ret = btrfs_add_ordered_extent_dio(inode, start,
-					   block_start, len, len, type);
-			if (ret) {
-				free_extent_map(em);
-				goto unlock_err;
-			}
 			goto unlock;
 		}
 	}