[v4] ceph: invalidate pages when doing direct/sync writes

Message ID	20220407151521.7968-1-lhenriques@suse.de (mailing list archive)
State	New, archived
Headers	show Return-Path: <ceph-devel-owner@kernel.org> From: =?utf-8?q?Lu=C3=ADs_Henriques?= <lhenriques@suse.de> To: Jeff Layton <jlayton@kernel.org>, Xiubo Li <xiubli@redhat.com>, Ilya Dryomov <idryomov@gmail.com> Cc: ceph-devel@vger.kernel.org, linux-kernel@vger.kernel.org, =?utf-8?q?Lu?= =?utf-8?q?=C3=ADs_Henriques?= <lhenriques@suse.de> Subject: [PATCH v4] ceph: invalidate pages when doing direct/sync writes Date: Thu, 7 Apr 2022 16:15:21 +0100 Message-Id: <20220407151521.7968-1-lhenriques@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[v4] ceph: invalidate pages when doing direct/sync writes \| expand [v4] ceph: invalidate pages when doing direct/sync writes

Luis Henriques April 7, 2022, 3:15 p.m. UTC

When doing a direct/sync write, we need to invalidate the page cache in
the range being written to.  If we don't do this, the cache will include
invalid data as we just did a write that avoided the page cache.

Signed-off-by: Luís Henriques <lhenriques@suse.de>
---
 fs/ceph/file.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

Changes since v3:
- Dropped initial call to invalidate_inode_pages2_range()
- Added extra comment to document invalidation

Changes since v2:
- Invalidation needs to be done after a write

Changes since v1:
- Replaced truncate_inode_pages_range() by invalidate_inode_pages2_range
- Call fscache_invalidate with FSCACHE_INVAL_DIO_WRITE if we're doing DIO

Xiubo Li April 7, 2022, 7:03 p.m. UTC | #1

On 4/7/22 11:15 PM, Luís Henriques wrote:
> When doing a direct/sync write, we need to invalidate the page cache in
> the range being written to.  If we don't do this, the cache will include
> invalid data as we just did a write that avoided the page cache.
>
> Signed-off-by: Luís Henriques <lhenriques@suse.de>
> ---
>   fs/ceph/file.c | 19 ++++++++++++++-----
>   1 file changed, 14 insertions(+), 5 deletions(-)
>
> Changes since v3:
> - Dropped initial call to invalidate_inode_pages2_range()
> - Added extra comment to document invalidation
>
> Changes since v2:
> - Invalidation needs to be done after a write
>
> Changes since v1:
> - Replaced truncate_inode_pages_range() by invalidate_inode_pages2_range
> - Call fscache_invalidate with FSCACHE_INVAL_DIO_WRITE if we're doing DIO
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 5072570c2203..97f764b2fbdd 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1606,11 +1606,6 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
>   		return ret;
>   
>   	ceph_fscache_invalidate(inode, false);
> -	ret = invalidate_inode_pages2_range(inode->i_mapping,
> -					    pos >> PAGE_SHIFT,
> -					    (pos + count - 1) >> PAGE_SHIFT);
> -	if (ret < 0)
> -		dout("invalidate_inode_pages2_range returned %d\n", ret);
>   
>   	while ((len = iov_iter_count(from)) > 0) {
>   		size_t left;
> @@ -1938,6 +1933,20 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
>   			break;
>   		}
>   		ceph_clear_error_write(ci);
> +
> +		/*
> +		 * we need to invalidate the page cache here, otherwise the
> +		 * cache will include invalid data in direct/sync writes.
> +		 */
> +		ret = invalidate_inode_pages2_range(

IMO we'd better use truncate_inode_pages_range() after write. The above 
means it's possibly will write the dirty pagecache back, which will 
overwrite and corrupt the disk data just wrote.

Though it seems impossible that these pagecaches will be marked dirty, 
but this call is misleading ?

-- Xiubo

> +				inode->i_mapping,
> +				pos >> PAGE_SHIFT,
> +				(pos + len - 1) >> PAGE_SHIFT);
> +		if (ret < 0) {
> +			dout("invalidate_inode_pages2_range returned %d\n",
> +			     ret);
> +			ret = 0;
> +		}
>   		pos += len;
>   		written += len;
>   		dout("sync_write written %d\n", written);
>

Jeff Layton April 7, 2022, 7:16 p.m. UTC | #2

On Fri, 2022-04-08 at 03:03 +0800, Xiubo Li wrote:
> On 4/7/22 11:15 PM, Luís Henriques wrote:
> > When doing a direct/sync write, we need to invalidate the page cache in
> > the range being written to.  If we don't do this, the cache will include
> > invalid data as we just did a write that avoided the page cache.
> > 
> > Signed-off-by: Luís Henriques <lhenriques@suse.de>
> > ---
> >   fs/ceph/file.c | 19 ++++++++++++++-----
> >   1 file changed, 14 insertions(+), 5 deletions(-)
> > 
> > Changes since v3:
> > - Dropped initial call to invalidate_inode_pages2_range()
> > - Added extra comment to document invalidation
> > 
> > Changes since v2:
> > - Invalidation needs to be done after a write
> > 
> > Changes since v1:
> > - Replaced truncate_inode_pages_range() by invalidate_inode_pages2_range
> > - Call fscache_invalidate with FSCACHE_INVAL_DIO_WRITE if we're doing DIO
> > 
> > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > index 5072570c2203..97f764b2fbdd 100644
> > --- a/fs/ceph/file.c
> > +++ b/fs/ceph/file.c
> > @@ -1606,11 +1606,6 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
> >   		return ret;
> >   
> >   	ceph_fscache_invalidate(inode, false);
> > -	ret = invalidate_inode_pages2_range(inode->i_mapping,
> > -					    pos >> PAGE_SHIFT,
> > -					    (pos + count - 1) >> PAGE_SHIFT);
> > -	if (ret < 0)
> > -		dout("invalidate_inode_pages2_range returned %d\n", ret);
> >   
> >   	while ((len = iov_iter_count(from)) > 0) {
> >   		size_t left;
> > @@ -1938,6 +1933,20 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
> >   			break;
> >   		}
> >   		ceph_clear_error_write(ci);
> > +
> > +		/*
> > +		 * we need to invalidate the page cache here, otherwise the
> > +		 * cache will include invalid data in direct/sync writes.
> > +		 */
> > +		ret = invalidate_inode_pages2_range(
> 
> IMO we'd better use truncate_inode_pages_range() after write. The above 
> means it's possibly will write the dirty pagecache back, which will 
> overwrite and corrupt the disk data just wrote.
> 

I disagree. We call filemap_write_and_wait_range at the start of this,
so any data that was dirty when we called write() will be written back
before the sync write.

If we truncate the range, then we'll potentially lose writes that came
in after write was issued but before truncate_inode_pages_range. I think
we'd rather let what we just wrote be clobbered in this situation than
lose a write altogether.

All of this is somewhat academic though. If you're mixing buffered and
direct writes like this without some sort of locking, then you're just
asking for trouble. The aim here is "sane behavior to the best of our
ability", but we can't expect it to always be sane when people do insane
things. ;)

> Though it seems impossible that these pagecaches will be marked dirty, 
> but this call is misleading ?
> 

Not impossible at all. You can open a file O_DIRECT and then mmap the fd
for PROT_WRITE (or just open the file a second time and do it).

We definitely recommend against mixing buffered and direct I/O, but
nothing really prevents someone from doing it. If the user is properly
using file locking, then there's really no reason it shouldn't work.

> 
> > +				inode->i_mapping,
> > +				pos >> PAGE_SHIFT,
> > +				(pos + len - 1) >> PAGE_SHIFT);
> > +		if (ret < 0) {
> > +			dout("invalidate_inode_pages2_range returned %d\n",
> > +			     ret);
> > +			ret = 0;
> > +		}
> >   		pos += len;
> >   		written += len;
> >   		dout("sync_write written %d\n", written);
> > 
>

Xiubo Li April 7, 2022, 7:24 p.m. UTC | #3

On 4/8/22 3:16 AM, Jeff Layton wrote:
> On Fri, 2022-04-08 at 03:03 +0800, Xiubo Li wrote:
>> On 4/7/22 11:15 PM, Luís Henriques wrote:
>>> When doing a direct/sync write, we need to invalidate the page cache in
>>> the range being written to.  If we don't do this, the cache will include
>>> invalid data as we just did a write that avoided the page cache.
>>>
>>> Signed-off-by: Luís Henriques <lhenriques@suse.de>
>>> ---
>>>    fs/ceph/file.c | 19 ++++++++++++++-----
>>>    1 file changed, 14 insertions(+), 5 deletions(-)
>>>
>>> Changes since v3:
>>> - Dropped initial call to invalidate_inode_pages2_range()
>>> - Added extra comment to document invalidation
>>>
>>> Changes since v2:
>>> - Invalidation needs to be done after a write
>>>
>>> Changes since v1:
>>> - Replaced truncate_inode_pages_range() by invalidate_inode_pages2_range
>>> - Call fscache_invalidate with FSCACHE_INVAL_DIO_WRITE if we're doing DIO
>>>
>>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>>> index 5072570c2203..97f764b2fbdd 100644
>>> --- a/fs/ceph/file.c
>>> +++ b/fs/ceph/file.c
>>> @@ -1606,11 +1606,6 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
>>>    		return ret;
>>>    
>>>    	ceph_fscache_invalidate(inode, false);
>>> -	ret = invalidate_inode_pages2_range(inode->i_mapping,
>>> -					    pos >> PAGE_SHIFT,
>>> -					    (pos + count - 1) >> PAGE_SHIFT);
>>> -	if (ret < 0)
>>> -		dout("invalidate_inode_pages2_range returned %d\n", ret);
>>>    
>>>    	while ((len = iov_iter_count(from)) > 0) {
>>>    		size_t left;
>>> @@ -1938,6 +1933,20 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
>>>    			break;
>>>    		}
>>>    		ceph_clear_error_write(ci);
>>> +
>>> +		/*
>>> +		 * we need to invalidate the page cache here, otherwise the
>>> +		 * cache will include invalid data in direct/sync writes.
>>> +		 */
>>> +		ret = invalidate_inode_pages2_range(
>> IMO we'd better use truncate_inode_pages_range() after write. The above
>> means it's possibly will write the dirty pagecache back, which will
>> overwrite and corrupt the disk data just wrote.
>>
> I disagree. We call filemap_write_and_wait_range at the start of this,
> so any data that was dirty when we called write() will be written back
> before the sync write.
>
> If we truncate the range, then we'll potentially lose writes that came
> in after write was issued but before truncate_inode_pages_range. I think
> we'd rather let what we just wrote be clobbered in this situation than
> lose a write altogether.
>
> All of this is somewhat academic though. If you're mixing buffered and
> direct writes like this without some sort of locking, then you're just
> asking for trouble. The aim here is "sane behavior to the best of our
> ability", but we can't expect it to always be sane when people do insane
> things. ;)

Just in the case Luis hit. Before writing the new data the mapping 
happen when reading the src in copy_from_usr(). So once the writing done 
the pagecache is caching the stale contents.

-- Xiubo

>> Though it seems impossible that these pagecaches will be marked dirty,
>> but this call is misleading ?
>>
> Not impossible at all. You can open a file O_DIRECT and then mmap the fd
> for PROT_WRITE (or just open the file a second time and do it).
>
> We definitely recommend against mixing buffered and direct I/O, but
> nothing really prevents someone from doing it. If the user is properly
> using file locking, then there's really no reason it shouldn't work.
>
>>> +				inode->i_mapping,
>>> +				pos >> PAGE_SHIFT,
>>> +				(pos + len - 1) >> PAGE_SHIFT);
>>> +		if (ret < 0) {
>>> +			dout("invalidate_inode_pages2_range returned %d\n",
>>> +			     ret);
>>> +			ret = 0;
>>> +		}
>>>    		pos += len;
>>>    		written += len;
>>>    		dout("sync_write written %d\n", written);
>>>

Jeff Layton April 7, 2022, 8:21 p.m. UTC | #4

On Fri, 2022-04-08 at 03:24 +0800, Xiubo Li wrote:
> On 4/8/22 3:16 AM, Jeff Layton wrote:
> > On Fri, 2022-04-08 at 03:03 +0800, Xiubo Li wrote:
> > > On 4/7/22 11:15 PM, Luís Henriques wrote:
> > > > When doing a direct/sync write, we need to invalidate the page cache in
> > > > the range being written to.  If we don't do this, the cache will include
> > > > invalid data as we just did a write that avoided the page cache.
> > > > 
> > > > Signed-off-by: Luís Henriques <lhenriques@suse.de>
> > > > ---
> > > >    fs/ceph/file.c | 19 ++++++++++++++-----
> > > >    1 file changed, 14 insertions(+), 5 deletions(-)
> > > > 
> > > > Changes since v3:
> > > > - Dropped initial call to invalidate_inode_pages2_range()
> > > > - Added extra comment to document invalidation
> > > > 
> > > > Changes since v2:
> > > > - Invalidation needs to be done after a write
> > > > 
> > > > Changes since v1:
> > > > - Replaced truncate_inode_pages_range() by invalidate_inode_pages2_range
> > > > - Call fscache_invalidate with FSCACHE_INVAL_DIO_WRITE if we're doing DIO
> > > > 
> > > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > > > index 5072570c2203..97f764b2fbdd 100644
> > > > --- a/fs/ceph/file.c
> > > > +++ b/fs/ceph/file.c
> > > > @@ -1606,11 +1606,6 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
> > > >    		return ret;
> > > >    
> > > >    	ceph_fscache_invalidate(inode, false);
> > > > -	ret = invalidate_inode_pages2_range(inode->i_mapping,
> > > > -					    pos >> PAGE_SHIFT,
> > > > -					    (pos + count - 1) >> PAGE_SHIFT);
> > > > -	if (ret < 0)
> > > > -		dout("invalidate_inode_pages2_range returned %d\n", ret);
> > > >    
> > > >    	while ((len = iov_iter_count(from)) > 0) {
> > > >    		size_t left;
> > > > @@ -1938,6 +1933,20 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
> > > >    			break;
> > > >    		}
> > > >    		ceph_clear_error_write(ci);
> > > > +
> > > > +		/*
> > > > +		 * we need to invalidate the page cache here, otherwise the
> > > > +		 * cache will include invalid data in direct/sync writes.
> > > > +		 */
> > > > +		ret = invalidate_inode_pages2_range(
> > > IMO we'd better use truncate_inode_pages_range() after write. The above
> > > means it's possibly will write the dirty pagecache back, which will
> > > overwrite and corrupt the disk data just wrote.
> > > 
> > I disagree. We call filemap_write_and_wait_range at the start of this,
> > so any data that was dirty when we called write() will be written back
> > before the sync write.
> > 
> > If we truncate the range, then we'll potentially lose writes that came
> > in after write was issued but before truncate_inode_pages_range. I think
> > we'd rather let what we just wrote be clobbered in this situation than
> > lose a write altogether.
> > 
> > All of this is somewhat academic though. If you're mixing buffered and
> > direct writes like this without some sort of locking, then you're just
> > asking for trouble. The aim here is "sane behavior to the best of our
> > ability", but we can't expect it to always be sane when people do insane
> > things. ;)
> 
> Just in the case Luis hit. Before writing the new data the mapping 
> happen when reading the src in copy_from_usr(). So once the writing done 
> the pagecache is caching the stale contents.
> 

Not just in that case.

You could have 2 unrelated processes, one doing DIO writes and one doing
mmap writes. You're likely to end up with a mess unless you're very
careful with what you're doing, but there should be some expectation
that it will work if you serialize things correctly and/or have them
writing to their own areas of the file, etc.

In any case, we'll never get perfect cache coherency, and I figure that
until the write returns, what's in the pagecache ought to be considered
valid.

> > > Though it seems impossible that these pagecaches will be marked dirty,
> > > but this call is misleading ?
> > > 
> > Not impossible at all. You can open a file O_DIRECT and then mmap the fd
> > for PROT_WRITE (or just open the file a second time and do it).
> > 
> > We definitely recommend against mixing buffered and direct I/O, but
> > nothing really prevents someone from doing it. If the user is properly
> > using file locking, then there's really no reason it shouldn't work.
> > 
> > > > +				inode->i_mapping,
> > > > +				pos >> PAGE_SHIFT,
> > > > +				(pos + len - 1) >> PAGE_SHIFT);
> > > > +		if (ret < 0) {
> > > > +			dout("invalidate_inode_pages2_range returned %d\n",
> > > > +			     ret);
> > > > +			ret = 0;
> > > > +		}
> > > >    		pos += len;
> > > >    		written += len;
> > > >    		dout("sync_write written %d\n", written);
> > > > 
>

Xiubo Li April 7, 2022, 11:51 p.m. UTC | #5

On 4/8/22 4:21 AM, Jeff Layton wrote:
> On Fri, 2022-04-08 at 03:24 +0800, Xiubo Li wrote:
>> On 4/8/22 3:16 AM, Jeff Layton wrote:
>>> On Fri, 2022-04-08 at 03:03 +0800, Xiubo Li wrote:
>>>> On 4/7/22 11:15 PM, Luís Henriques wrote:
>>>>> When doing a direct/sync write, we need to invalidate the page cache in
>>>>> the range being written to.  If we don't do this, the cache will include
>>>>> invalid data as we just did a write that avoided the page cache.
>>>>>
>>>>> Signed-off-by: Luís Henriques <lhenriques@suse.de>
>>>>> ---
>>>>>     fs/ceph/file.c | 19 ++++++++++++++-----
>>>>>     1 file changed, 14 insertions(+), 5 deletions(-)
>>>>>
>>>>> Changes since v3:
>>>>> - Dropped initial call to invalidate_inode_pages2_range()
>>>>> - Added extra comment to document invalidation
>>>>>
>>>>> Changes since v2:
>>>>> - Invalidation needs to be done after a write
>>>>>
>>>>> Changes since v1:
>>>>> - Replaced truncate_inode_pages_range() by invalidate_inode_pages2_range
>>>>> - Call fscache_invalidate with FSCACHE_INVAL_DIO_WRITE if we're doing DIO
>>>>>
>>>>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>>>>> index 5072570c2203..97f764b2fbdd 100644
>>>>> --- a/fs/ceph/file.c
>>>>> +++ b/fs/ceph/file.c
>>>>> @@ -1606,11 +1606,6 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
>>>>>     		return ret;
>>>>>     
>>>>>     	ceph_fscache_invalidate(inode, false);
>>>>> -	ret = invalidate_inode_pages2_range(inode->i_mapping,
>>>>> -					    pos >> PAGE_SHIFT,
>>>>> -					    (pos + count - 1) >> PAGE_SHIFT);
>>>>> -	if (ret < 0)
>>>>> -		dout("invalidate_inode_pages2_range returned %d\n", ret);
>>>>>     
>>>>>     	while ((len = iov_iter_count(from)) > 0) {
>>>>>     		size_t left;
>>>>> @@ -1938,6 +1933,20 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
>>>>>     			break;
>>>>>     		}
>>>>>     		ceph_clear_error_write(ci);
>>>>> +
>>>>> +		/*
>>>>> +		 * we need to invalidate the page cache here, otherwise the
>>>>> +		 * cache will include invalid data in direct/sync writes.
>>>>> +		 */
>>>>> +		ret = invalidate_inode_pages2_range(
>>>> IMO we'd better use truncate_inode_pages_range() after write. The above
>>>> means it's possibly will write the dirty pagecache back, which will
>>>> overwrite and corrupt the disk data just wrote.
>>>>
>>> I disagree. We call filemap_write_and_wait_range at the start of this,
>>> so any data that was dirty when we called write() will be written back
>>> before the sync write.
>>>
>>> If we truncate the range, then we'll potentially lose writes that came
>>> in after write was issued but before truncate_inode_pages_range. I think
>>> we'd rather let what we just wrote be clobbered in this situation than
>>> lose a write altogether.
>>>
>>> All of this is somewhat academic though. If you're mixing buffered and
>>> direct writes like this without some sort of locking, then you're just
>>> asking for trouble. The aim here is "sane behavior to the best of our
>>> ability", but we can't expect it to always be sane when people do insane
>>> things. ;)
>> Just in the case Luis hit. Before writing the new data the mapping
>> happen when reading the src in copy_from_usr(). So once the writing done
>> the pagecache is caching the stale contents.
>>
> Not just in that case.
>
> You could have 2 unrelated processes, one doing DIO writes and one doing
> mmap writes. You're likely to end up with a mess unless you're very
> careful with what you're doing, but there should be some expectation
> that it will work if you serialize things correctly and/or have them
> writing to their own areas of the file, etc.

For this case I checked the other use cases, they are seems will do:


filemap_invalidate_lock(inode->i_mapping);

write pagecache back;

invalidate the mapping and drop the pages;

do the IOs;

filemap_invalidate_unlock(inode->i_mapping);


The filemap_invalidate_lock could prevent the page fault to map them 
again during this.



> In any case, we'll never get perfect cache coherency, and I figure that
> until the write returns, what's in the pagecache ought to be considered
> valid.

Okay, I am okay with this.

As my understanding is that we should make sure that the pagecache is 
always valid during the sync write, or if the pagecache will be 
revalidated it should just block the other processes to read from the mmap.

-- Xiubo
>>>> Though it seems impossible that these pagecaches will be marked dirty,
>>>> but this call is misleading ?
>>>>
>>> Not impossible at all. You can open a file O_DIRECT and then mmap the fd
>>> for PROT_WRITE (or just open the file a second time and do it).
>>>
>>> We definitely recommend against mixing buffered and direct I/O, but
>>> nothing really prevents someone from doing it. If the user is properly
>>> using file locking, then there's really no reason it shouldn't work.
>>>
>>>>> +				inode->i_mapping,
>>>>> +				pos >> PAGE_SHIFT,
>>>>> +				(pos + len - 1) >> PAGE_SHIFT);
>>>>> +		if (ret < 0) {
>>>>> +			dout("invalidate_inode_pages2_range returned %d\n",
>>>>> +			     ret);
>>>>> +			ret = 0;
>>>>> +		}
>>>>>     		pos += len;
>>>>>     		written += len;
>>>>>     		dout("sync_write written %d\n", written);
>>>>>

Xiubo Li April 8, 2022, 2:47 a.m. UTC | #6

On 4/7/22 11:15 PM, Luís Henriques wrote:
> When doing a direct/sync write, we need to invalidate the page cache in
> the range being written to.  If we don't do this, the cache will include
> invalid data as we just did a write that avoided the page cache.
>
> Signed-off-by: Luís Henriques <lhenriques@suse.de>
> ---
>   fs/ceph/file.c | 19 ++++++++++++++-----
>   1 file changed, 14 insertions(+), 5 deletions(-)
>
> Changes since v3:
> - Dropped initial call to invalidate_inode_pages2_range()
> - Added extra comment to document invalidation
>
> Changes since v2:
> - Invalidation needs to be done after a write
>
> Changes since v1:
> - Replaced truncate_inode_pages_range() by invalidate_inode_pages2_range
> - Call fscache_invalidate with FSCACHE_INVAL_DIO_WRITE if we're doing DIO
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 5072570c2203..97f764b2fbdd 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1606,11 +1606,6 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
>   		return ret;
>   
>   	ceph_fscache_invalidate(inode, false);
> -	ret = invalidate_inode_pages2_range(inode->i_mapping,
> -					    pos >> PAGE_SHIFT,
> -					    (pos + count - 1) >> PAGE_SHIFT);
> -	if (ret < 0)
> -		dout("invalidate_inode_pages2_range returned %d\n", ret);
>   
>   	while ((len = iov_iter_count(from)) > 0) {
>   		size_t left;
> @@ -1938,6 +1933,20 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
>   			break;
>   		}
>   		ceph_clear_error_write(ci);
> +
> +		/*
> +		 * we need to invalidate the page cache here, otherwise the
> +		 * cache will include invalid data in direct/sync writes.
> +		 */
> +		ret = invalidate_inode_pages2_range(
> +				inode->i_mapping,
> +				pos >> PAGE_SHIFT,
> +				(pos + len - 1) >> PAGE_SHIFT);
> +		if (ret < 0) {
> +			dout("invalidate_inode_pages2_range returned %d\n",
> +			     ret);
> +			ret = 0;

For this, IMO it's not safe. If we just ignore it the pagecache will 
still have invalid data.

I think what the 'ceph_direct_read_write()' does is more correct, it 
will make sure all the dirty pages are writeback from the pagecaches by 
using 'invalidate_inode_pages2_range()' without blocking and later will 
do the invalidate blocked by using 'truncate_inode_pages_range()' if 
some pages are not unmaped in 'invalidate_inode_pages2_range()' when EBUSY.

This can always be sure that the pagecache has no invalid data after 
write finishes. I think why it use the truncate helper here is because 
it's safe and there shouldn't have any buffer write happen for DIO ?

But from my understanding the 'ceph_direct_read_write()' is still buggy. 
What if the page fault happen just after 'truncate_inode_pages_range()' 
? Will this happen ? Should we leave this to use the file lock to 
guarantee it in user space ?

Thought ?

-- Xiubo

> +		}
>   		pos += len;
>   		written += len;
>   		dout("sync_write written %d\n", written);
>

Jeff Layton April 8, 2022, 12:04 p.m. UTC | #7

On Fri, 2022-04-08 at 10:47 +0800, Xiubo Li wrote:
> On 4/7/22 11:15 PM, Luís Henriques wrote:
> > When doing a direct/sync write, we need to invalidate the page cache in
> > the range being written to.  If we don't do this, the cache will include
> > invalid data as we just did a write that avoided the page cache.
> > 
> > Signed-off-by: Luís Henriques <lhenriques@suse.de>
> > ---
> >   fs/ceph/file.c | 19 ++++++++++++++-----
> >   1 file changed, 14 insertions(+), 5 deletions(-)
> > 
> > Changes since v3:
> > - Dropped initial call to invalidate_inode_pages2_range()
> > - Added extra comment to document invalidation
> > 
> > Changes since v2:
> > - Invalidation needs to be done after a write
> > 
> > Changes since v1:
> > - Replaced truncate_inode_pages_range() by invalidate_inode_pages2_range
> > - Call fscache_invalidate with FSCACHE_INVAL_DIO_WRITE if we're doing DIO
> > 
> > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > index 5072570c2203..97f764b2fbdd 100644
> > --- a/fs/ceph/file.c
> > +++ b/fs/ceph/file.c
> > @@ -1606,11 +1606,6 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
> >   		return ret;
> >   
> >   	ceph_fscache_invalidate(inode, false);
> > -	ret = invalidate_inode_pages2_range(inode->i_mapping,
> > -					    pos >> PAGE_SHIFT,
> > -					    (pos + count - 1) >> PAGE_SHIFT);
> > -	if (ret < 0)
> > -		dout("invalidate_inode_pages2_range returned %d\n", ret);
> >   
> >   	while ((len = iov_iter_count(from)) > 0) {
> >   		size_t left;
> > @@ -1938,6 +1933,20 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
> >   			break;
> >   		}
> >   		ceph_clear_error_write(ci);
> > +
> > +		/*
> > +		 * we need to invalidate the page cache here, otherwise the
> > +		 * cache will include invalid data in direct/sync writes.
> > +		 */
> > +		ret = invalidate_inode_pages2_range(
> > +				inode->i_mapping,
> > +				pos >> PAGE_SHIFT,
> > +				(pos + len - 1) >> PAGE_SHIFT);
> > +		if (ret < 0) {
> > +			dout("invalidate_inode_pages2_range returned %d\n",
> > +			     ret);
> > +			ret = 0;
> 
> For this, IMO it's not safe. If we just ignore it the pagecache will 
> still have invalid data.
> 

That data is not invalid. It's dirty data from a buffered write that
raced with the DIO/sync write we're handling here. i.e. another write
came in while we were already processing the DIO write, but after the
point where we called filemap_write_and_wait.

When two write() calls to the same data are racing like that, the
outcome is undefined. We can't be certain which one will prevail as the
kernel could handle them in either order.

The good news with Ceph/RADOS is that you shouldn't end up with a torn
write, unless the write happens to span multiple objects. Not much we
can do about that though.

> I think what the 'ceph_direct_read_write()' does is more correct, it 
> will make sure all the dirty pages are writeback from the pagecaches by 
> using 'invalidate_inode_pages2_range()' without blocking and later will 
> do the invalidate blocked by using 'truncate_inode_pages_range()' if 
> some pages are not unmaped in 'invalidate_inode_pages2_range()' when EBUSY.
> 

I'm not convinced this is any better, and it's attempting to impose a
deterministic outcome on a situation that is non-deterministic by
nature.

> This can always be sure that the pagecache has no invalid data after 
> write finishes. I think why it use the truncate helper here is because 
> it's safe and there shouldn't have any buffer write happen for DIO ?
> 
> But from my understanding the 'ceph_direct_read_write()' is still buggy. 
> What if the page fault happen just after 'truncate_inode_pages_range()' 
> ? Will this happen ? Should we leave this to use the file lock to 
> guarantee it in user space ?
> 
> Thought ?

Again, we can't really predict what the outcome of two racing writes to
the same area will do, so I don't see that there is a problem.

Xiubo Li April 10, 2022, 1:35 a.m. UTC | #8

On 4/8/22 8:04 PM, Jeff Layton wrote:
> On Fri, 2022-04-08 at 10:47 +0800, Xiubo Li wrote:
>> On 4/7/22 11:15 PM, Luís Henriques wrote:
>>> When doing a direct/sync write, we need to invalidate the page cache in
>>> the range being written to.  If we don't do this, the cache will include
>>> invalid data as we just did a write that avoided the page cache.
>>>
>>> Signed-off-by: Luís Henriques <lhenriques@suse.de>
>>> ---
>>>    fs/ceph/file.c | 19 ++++++++++++++-----
>>>    1 file changed, 14 insertions(+), 5 deletions(-)
>>>
>>> Changes since v3:
>>> - Dropped initial call to invalidate_inode_pages2_range()
>>> - Added extra comment to document invalidation
>>>
>>> Changes since v2:
>>> - Invalidation needs to be done after a write
>>>
>>> Changes since v1:
>>> - Replaced truncate_inode_pages_range() by invalidate_inode_pages2_range
>>> - Call fscache_invalidate with FSCACHE_INVAL_DIO_WRITE if we're doing DIO
>>>
>>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>>> index 5072570c2203..97f764b2fbdd 100644
>>> --- a/fs/ceph/file.c
>>> +++ b/fs/ceph/file.c
>>> @@ -1606,11 +1606,6 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
>>>    		return ret;
>>>    
>>>    	ceph_fscache_invalidate(inode, false);
>>> -	ret = invalidate_inode_pages2_range(inode->i_mapping,
>>> -					    pos >> PAGE_SHIFT,
>>> -					    (pos + count - 1) >> PAGE_SHIFT);
>>> -	if (ret < 0)
>>> -		dout("invalidate_inode_pages2_range returned %d\n", ret);
>>>    
>>>    	while ((len = iov_iter_count(from)) > 0) {
>>>    		size_t left;
>>> @@ -1938,6 +1933,20 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
>>>    			break;
>>>    		}
>>>    		ceph_clear_error_write(ci);
>>> +
>>> +		/*
>>> +		 * we need to invalidate the page cache here, otherwise the
>>> +		 * cache will include invalid data in direct/sync writes.
>>> +		 */
>>> +		ret = invalidate_inode_pages2_range(
>>> +				inode->i_mapping,
>>> +				pos >> PAGE_SHIFT,
>>> +				(pos + len - 1) >> PAGE_SHIFT);
>>> +		if (ret < 0) {
>>> +			dout("invalidate_inode_pages2_range returned %d\n",
>>> +			     ret);
>>> +			ret = 0;
>> For this, IMO it's not safe. If we just ignore it the pagecache will
>> still have invalid data.
>>
> That data is not invalid. It's dirty data from a buffered write that
> raced with the DIO/sync write we're handling here. i.e. another write
> came in while we were already processing the DIO write, but after the
> point where we called filemap_write_and_wait.
>
> When two write() calls to the same data are racing like that, the
> outcome is undefined. We can't be certain which one will prevail as the
> kernel could handle them in either order.

Okay, I think you are right.

-- Xiubo

>
> The good news with Ceph/RADOS is that you shouldn't end up with a torn
> write, unless the write happens to span multiple objects. Not much we
> can do about that though.
>
>> I think what the 'ceph_direct_read_write()' does is more correct, it
>> will make sure all the dirty pages are writeback from the pagecaches by
>> using 'invalidate_inode_pages2_range()' without blocking and later will
>> do the invalidate blocked by using 'truncate_inode_pages_range()' if
>> some pages are not unmaped in 'invalidate_inode_pages2_range()' when EBUSY.
>>
> I'm not convinced this is any better, and it's attempting to impose a
> deterministic outcome on a situation that is non-deterministic by
> nature.
>
>> This can always be sure that the pagecache has no invalid data after
>> write finishes. I think why it use the truncate helper here is because
>> it's safe and there shouldn't have any buffer write happen for DIO ?
>>
>> But from my understanding the 'ceph_direct_read_write()' is still buggy.
>> What if the page fault happen just after 'truncate_inode_pages_range()'
>> ? Will this happen ? Should we leave this to use the file lock to
>> guarantee it in user space ?
>>
>> Thought ?
> Again, we can't really predict what the outcome of two racing writes to
> the same area will do, so I don't see that there is a problem.
>

[v4] ceph: invalidate pages when doing direct/sync writes

Commit Message

Comments

Patch