diff mbox series

[RFC,v2] ceph: ceph: fix out-of-bound array access when doing a file read

Message ID 20240905135700.16394-1-luis.henriques@linux.dev (mailing list archive)
State New, archived
Headers show
Series [RFC,v2] ceph: ceph: fix out-of-bound array access when doing a file read | expand

Commit Message

Luis Henriques Sept. 5, 2024, 1:57 p.m. UTC
__ceph_sync_read() does not correctly handle reads when the inode size is
zero.  It is easy to hit a NULL pointer dereference by continuously reading
a file while, on another client, we keep truncating and writing new data
into it.

The NULL pointer dereference happens when the inode size is zero but the
read op returns some data (ceph_osdc_wait_request()).  This will lead to
'left' being set to a huge value due to the overflow in:

	left = i_size - off;

and, in the loop that follows, the pages[] array being accessed beyond
num_pages.

This patch fixes the issue simply by checking the inode size and returning
if it is zero, even if there was data from the read op.

Link: https://tracker.ceph.com/issues/67524
Fixes: 1065da21e5df ("ceph: stop copying to iter at EOF on sync reads")
Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
---
 fs/ceph/file.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

Xiubo Li Sept. 6, 2024, 11:17 a.m. UTC | #1
On 9/5/24 21:57, Luis Henriques (SUSE) wrote:
> __ceph_sync_read() does not correctly handle reads when the inode size is
> zero.  It is easy to hit a NULL pointer dereference by continuously reading
> a file while, on another client, we keep truncating and writing new data
> into it.
>
> The NULL pointer dereference happens when the inode size is zero but the
> read op returns some data (ceph_osdc_wait_request()).  This will lead to
> 'left' being set to a huge value due to the overflow in:
>
> 	left = i_size - off;
>
> and, in the loop that follows, the pages[] array being accessed beyond
> num_pages.
>
> This patch fixes the issue simply by checking the inode size and returning
> if it is zero, even if there was data from the read op.
>
> Link: https://tracker.ceph.com/issues/67524
> Fixes: 1065da21e5df ("ceph: stop copying to iter at EOF on sync reads")
> Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
> ---
>   fs/ceph/file.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 4b8d59ebda00..41d4eac128bb 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1066,7 +1066,7 @@ ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
>   	if (ceph_inode_is_shutdown(inode))
>   		return -EIO;
>   
> -	if (!len)
> +	if (!len || !i_size)
>   		return 0;
>   	/*
>   	 * flush any page cache pages in this range.  this
> @@ -1154,6 +1154,9 @@ ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
>   		doutc(cl, "%llu~%llu got %zd i_size %llu%s\n", off, len,
>   		      ret, i_size, (more ? " MORE" : ""));
>   
> +		if (i_size == 0)
> +			ret = 0;
> +
>   		/* Fix it to go to end of extent map */
>   		if (sparse && ret >= 0)
>   			ret = ceph_sparse_ext_map_end(op);
>
Hi Luis,

BTW, so in the following code:

1202                 idx = 0;
1203                 if (ret <= 0)
1204                         left = 0;
1205                 else if (off + ret > i_size)
1206                         left = i_size - off;
1207                 else
1208                         left = ret;

The 'ret' should be larger than '0', right ?

If so we do not check anf fix it in the 'else if' branch instead?

Because currently the read path code won't exit directly and keep 
retrying to read if it found that the real content length is longer than 
the local 'i_size'.

Again I am afraid your current fix will break the MIX filelock semantic ?

Thanks

- Xiubo
Luis Henriques Sept. 6, 2024, 11:30 a.m. UTC | #2
On Fri, Sep 06 2024, Xiubo Li wrote:

> On 9/5/24 21:57, Luis Henriques (SUSE) wrote:
>> __ceph_sync_read() does not correctly handle reads when the inode size is
>> zero.  It is easy to hit a NULL pointer dereference by continuously reading
>> a file while, on another client, we keep truncating and writing new data
>> into it.
>>
>> The NULL pointer dereference happens when the inode size is zero but the
>> read op returns some data (ceph_osdc_wait_request()).  This will lead to
>> 'left' being set to a huge value due to the overflow in:
>>
>> 	left = i_size - off;
>>
>> and, in the loop that follows, the pages[] array being accessed beyond
>> num_pages.
>>
>> This patch fixes the issue simply by checking the inode size and returning
>> if it is zero, even if there was data from the read op.
>>
>> Link: https://tracker.ceph.com/issues/67524
>> Fixes: 1065da21e5df ("ceph: stop copying to iter at EOF on sync reads")
>> Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
>> ---
>>   fs/ceph/file.c | 5 ++++-
>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>> index 4b8d59ebda00..41d4eac128bb 100644
>> --- a/fs/ceph/file.c
>> +++ b/fs/ceph/file.c
>> @@ -1066,7 +1066,7 @@ ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
>>   	if (ceph_inode_is_shutdown(inode))
>>   		return -EIO;
>>   -	if (!len)
>> +	if (!len || !i_size)
>>   		return 0;
>>   	/*
>>   	 * flush any page cache pages in this range.  this
>> @@ -1154,6 +1154,9 @@ ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
>>   		doutc(cl, "%llu~%llu got %zd i_size %llu%s\n", off, len,
>>   		      ret, i_size, (more ? " MORE" : ""));
>>   +		if (i_size == 0)
>> +			ret = 0;
>> +
>>   		/* Fix it to go to end of extent map */
>>   		if (sparse && ret >= 0)
>>   			ret = ceph_sparse_ext_map_end(op);
>>
> Hi Luis,
>
> BTW, so in the following code:
>
> 1202                 idx = 0;
> 1203                 if (ret <= 0)
> 1204                         left = 0;
> 1205                 else if (off + ret > i_size)
> 1206                         left = i_size - off;
> 1207                 else
> 1208                         left = ret;
>
> The 'ret' should be larger than '0', right ?

Right.  (Which means we read something from the file.)

> If so we do not check anf fix it in the 'else if' branch instead?

Yes, and then we'll have:

	left = i_size - off;

and because 'i_size' is 0, so 'left' will be set to 0xffffffffff...
And the loop that follows:

	while (left > 0) {
        	...
        }

will keep looping until we get a NULL pointer.  Have you tried the
reproducer?

Cheers,
Xiubo Li Sept. 6, 2024, 12:48 p.m. UTC | #3
On 9/6/24 19:30, Luis Henriques wrote:
> On Fri, Sep 06 2024, Xiubo Li wrote:
>
>> On 9/5/24 21:57, Luis Henriques (SUSE) wrote:
>>> __ceph_sync_read() does not correctly handle reads when the inode size is
>>> zero.  It is easy to hit a NULL pointer dereference by continuously reading
>>> a file while, on another client, we keep truncating and writing new data
>>> into it.
>>>
>>> The NULL pointer dereference happens when the inode size is zero but the
>>> read op returns some data (ceph_osdc_wait_request()).  This will lead to
>>> 'left' being set to a huge value due to the overflow in:
>>>
>>> 	left = i_size - off;
>>>
>>> and, in the loop that follows, the pages[] array being accessed beyond
>>> num_pages.
>>>
>>> This patch fixes the issue simply by checking the inode size and returning
>>> if it is zero, even if there was data from the read op.
>>>
>>> Link: https://tracker.ceph.com/issues/67524
>>> Fixes: 1065da21e5df ("ceph: stop copying to iter at EOF on sync reads")
>>> Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
>>> ---
>>>    fs/ceph/file.c | 5 ++++-
>>>    1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>>> index 4b8d59ebda00..41d4eac128bb 100644
>>> --- a/fs/ceph/file.c
>>> +++ b/fs/ceph/file.c
>>> @@ -1066,7 +1066,7 @@ ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
>>>    	if (ceph_inode_is_shutdown(inode))
>>>    		return -EIO;
>>>    -	if (!len)
>>> +	if (!len || !i_size)
>>>    		return 0;
>>>    	/*
>>>    	 * flush any page cache pages in this range.  this
>>> @@ -1154,6 +1154,9 @@ ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
>>>    		doutc(cl, "%llu~%llu got %zd i_size %llu%s\n", off, len,
>>>    		      ret, i_size, (more ? " MORE" : ""));
>>>    +		if (i_size == 0)
>>> +			ret = 0;
>>> +
>>>    		/* Fix it to go to end of extent map */
>>>    		if (sparse && ret >= 0)
>>>    			ret = ceph_sparse_ext_map_end(op);
>>>
>> Hi Luis,
>>
>> BTW, so in the following code:
>>
>> 1202                 idx = 0;
>> 1203                 if (ret <= 0)
>> 1204                         left = 0;
>> 1205                 else if (off + ret > i_size)
>> 1206                         left = i_size - off;
>> 1207                 else
>> 1208                         left = ret;
>>
>> The 'ret' should be larger than '0', right ?
> Right.  (Which means we read something from the file.)
>
>> If so we do not check anf fix it in the 'else if' branch instead?
> Yes, and then we'll have:
>
> 	left = i_size - off;
>
> and because 'i_size' is 0, so 'left' will be set to 0xffffffffff...
> And the loop that follows:
>
> 	while (left > 0) {
>          	...
>          }
>
> will keep looping until we get a NULL pointer.  Have you tried the
> reproducer?

Hi Luis,

Not yet, and recently I haven't get a chance to do that for the reason 
as you know.

Thanks

- Xiubo


> Cheers,
Luis Henriques Sept. 30, 2024, 3:30 p.m. UTC | #4
On Fri, Sep 06 2024, Xiubo Li wrote:

> On 9/6/24 19:30, Luis Henriques wrote:
>> On Fri, Sep 06 2024, Xiubo Li wrote:
>>
>>> On 9/5/24 21:57, Luis Henriques (SUSE) wrote:
>>>> __ceph_sync_read() does not correctly handle reads when the inode size is
>>>> zero.  It is easy to hit a NULL pointer dereference by continuously reading
>>>> a file while, on another client, we keep truncating and writing new data
>>>> into it.
>>>>
>>>> The NULL pointer dereference happens when the inode size is zero but the
>>>> read op returns some data (ceph_osdc_wait_request()).  This will lead to
>>>> 'left' being set to a huge value due to the overflow in:
>>>>
>>>> 	left = i_size - off;
>>>>
>>>> and, in the loop that follows, the pages[] array being accessed beyond
>>>> num_pages.
>>>>
>>>> This patch fixes the issue simply by checking the inode size and returning
>>>> if it is zero, even if there was data from the read op.
>>>>
>>>> Link: https://tracker.ceph.com/issues/67524
>>>> Fixes: 1065da21e5df ("ceph: stop copying to iter at EOF on sync reads")
>>>> Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
>>>> ---
>>>>    fs/ceph/file.c | 5 ++++-
>>>>    1 file changed, 4 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>>>> index 4b8d59ebda00..41d4eac128bb 100644
>>>> --- a/fs/ceph/file.c
>>>> +++ b/fs/ceph/file.c
>>>> @@ -1066,7 +1066,7 @@ ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
>>>>    	if (ceph_inode_is_shutdown(inode))
>>>>    		return -EIO;
>>>>    -	if (!len)
>>>> +	if (!len || !i_size)
>>>>    		return 0;
>>>>    	/*
>>>>    	 * flush any page cache pages in this range.  this
>>>> @@ -1154,6 +1154,9 @@ ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
>>>>    		doutc(cl, "%llu~%llu got %zd i_size %llu%s\n", off, len,
>>>>    		      ret, i_size, (more ? " MORE" : ""));
>>>>    +		if (i_size == 0)
>>>> +			ret = 0;
>>>> +
>>>>    		/* Fix it to go to end of extent map */
>>>>    		if (sparse && ret >= 0)
>>>>    			ret = ceph_sparse_ext_map_end(op);
>>>>
>>> Hi Luis,
>>>
>>> BTW, so in the following code:
>>>
>>> 1202                 idx = 0;
>>> 1203                 if (ret <= 0)
>>> 1204                         left = 0;
>>> 1205                 else if (off + ret > i_size)
>>> 1206                         left = i_size - off;
>>> 1207                 else
>>> 1208                         left = ret;
>>>
>>> The 'ret' should be larger than '0', right ?
>> Right.  (Which means we read something from the file.)
>>
>>> If so we do not check anf fix it in the 'else if' branch instead?
>> Yes, and then we'll have:
>>
>> 	left = i_size - off;
>>
>> and because 'i_size' is 0, so 'left' will be set to 0xffffffffff...
>> And the loop that follows:
>>
>> 	while (left > 0) {
>>          	...
>>          }
>>
>> will keep looping until we get a NULL pointer.  Have you tried the
>> reproducer?
>
> Hi Luis,
>
> Not yet, and recently I haven't get a chance to do that for the reason as you
> know.

Hi Xiubo,

I know you've been busy, but I was wondering if you (or someone else) had
a chance to have a look at this.  It's pretty easy to reproduce, and it
has been seen in production.  Any chances of getting some more feedback on
this fix?

Cheers,
Luis Henriques Nov. 4, 2024, 2:34 p.m. UTC | #5
Hi Xiubo, Hi Ilya,

On Mon, Sep 30 2024, Luis Henriques wrote:
[...]
> Hi Xiubo,
>
> I know you've been busy, but I was wondering if you (or someone else) had
> a chance to have a look at this.  It's pretty easy to reproduce, and it
> has been seen in production.  Any chances of getting some more feedback on
> this fix?

It has been a while since I first reported this issue.  Taking the risk of
being "that annoying guy", I'd like to ping you again on this.  I've
managed to reproduce the issue very easily, and it's also being triggered
very frequently in production.  Any news?

Cheers,
Xiubo Li Nov. 5, 2024, 1:10 a.m. UTC | #6
CC Alex

Hi Luis,

Alex will take over it and help push it recently. I am a bit busy with 
my new things these days.

BTW, if possible please join 'ceph' workspace's #cephfs slack channel 
and you could push it faster there ?

Thanks

- Xiubo


在 2024/11/4 22:34, Luis Henriques 写道:
> Hi Xiubo, Hi Ilya,
>
> On Mon, Sep 30 2024, Luis Henriques wrote:
> [...]
>> Hi Xiubo,
>>
>> I know you've been busy, but I was wondering if you (or someone else) had
>> a chance to have a look at this.  It's pretty easy to reproduce, and it
>> has been seen in production.  Any chances of getting some more feedback on
>> this fix?
> It has been a while since I first reported this issue.  Taking the risk of
> being "that annoying guy", I'd like to ping you again on this.  I've
> managed to reproduce the issue very easily, and it's also being triggered
> very frequently in production.  Any news?
>
> Cheers,
Luis Henriques Nov. 5, 2024, 9:21 a.m. UTC | #7
Hi Xiubo!

On Tue, Nov 05 2024, Xiubo Li wrote:

> CC Alex
>
> Hi Luis,
>
> Alex will take over it and help push it recently. I am a bit busy with my new
> things these days.

Thanks a lot.  I think the difficult bit to understand (for me, at least!)
are any MDS side-effects, as you earlier mentioned the filelocking
semantics.  I'm not sure if/how this patch may cause troubles there.

> BTW, if possible please join 'ceph' workspace's #cephfs slack channel and you
> could push it faster there ?

I believe that channel is bridged to IRC (OFTC network), where I'm already
lurking (nick 'henrix').  And I see you have already ping'ed others there.
However, I'm currently on PTO, so my replies there may be asynchronous :-)

Cheers,
Goldwyn Rodrigues Nov. 6, 2024, 8:40 p.m. UTC | #8
Hi Xiubo,

> BTW, so in the following code:
> 
> 1202                 idx = 0;
> 1203                 if (ret <= 0)
> 1204                         left = 0;
> 1205                 else if (off + ret > i_size)
> 1206                         left = i_size - off;
> 1207                 else
> 1208                         left = ret;
> 
> The 'ret' should be larger than '0', right ?
> 
> If so we do not check anf fix it in the 'else if' branch instead?
> 
> Because currently the read path code won't exit directly and keep 
> retrying to read if it found that the real content length is longer than 
> the local 'i_size'.
> 
> Again I am afraid your current fix will break the MIX filelock semantic ?

Do you think changing left to ssize_t instead of size_t will
fix the problem?


diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 4b8d59ebda00..f8955773bdd7 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1066,7 +1066,7 @@ ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
 	if (ceph_inode_is_shutdown(inode))
 		return -EIO;
 
-	if (!len)
+	if (!len || !i_size)
 		return 0;
 	/*
 	 * flush any page cache pages in this range.  this
@@ -1087,7 +1087,7 @@ ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
 		size_t page_off;
 		bool more;
 		int idx;
-		size_t left;
+		ssize_t left;
 		struct ceph_osd_req_op *op;
 		u64 read_off = off;
 		u64 read_len = len;
Luis Henriques Nov. 7, 2024, 11:09 a.m. UTC | #9
(CC'ing Alex)

On Wed, Nov 06 2024, Goldwyn Rodrigues wrote:

> Hi Xiubo,
>
>> BTW, so in the following code:
>> 
>> 1202                 idx = 0;
>> 1203                 if (ret <= 0)
>> 1204                         left = 0;
>> 1205                 else if (off + ret > i_size)
>> 1206                         left = i_size - off;
>> 1207                 else
>> 1208                         left = ret;
>> 
>> The 'ret' should be larger than '0', right ?
>> 
>> If so we do not check anf fix it in the 'else if' branch instead?
>> 
>> Because currently the read path code won't exit directly and keep 
>> retrying to read if it found that the real content length is longer than 
>> the local 'i_size'.
>> 
>> Again I am afraid your current fix will break the MIX filelock semantic ?
>
> Do you think changing left to ssize_t instead of size_t will
> fix the problem?
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 4b8d59ebda00..f8955773bdd7 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1066,7 +1066,7 @@ ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
>  	if (ceph_inode_is_shutdown(inode))
>  		return -EIO;
>  
> -	if (!len)
> +	if (!len || !i_size)
>  		return 0;
>  	/*
>  	 * flush any page cache pages in this range.  this
> @@ -1087,7 +1087,7 @@ ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
>  		size_t page_off;
>  		bool more;
>  		int idx;
> -		size_t left;
> +		ssize_t left;
>  		struct ceph_osd_req_op *op;
>  		u64 read_off = off;
>  		u64 read_len = len;
>

I *think* (although I haven't tested it) that you're patch should work as
well.  But I also think it's a bit more hacky: the overflow will still be
there:

		if (ret <= 0)
			left = 0;
		else if (off + ret > i_size)
			left = i_size - off;
		else
			left = ret;
		while (left > 0) {
			// ...
		}

If 'i_size' is '0', 'left' (which is now signed) will now have a negative
value in the 'else if' branch and the loop that follows will not be
executed.  My version will simply set 'ret' to '0' before this 'if'
construct.

So, in my opinion, what needs to be figured out is whether this will cause
problems on the MDS side or not.  Because on the kernel client, it should
be safe to ignore reads to an inode that has size set to '0', even if
there's already data available to be read.  Eventually, the inode metadata
will get updated and by then we can retry the read.

Unfortunately, the MDS continues to be a huge black box for me and the
locking code in particular is very tricky.  I'd rather defer this for
anyone that is familiar with the code.

Cheers,
Alex Markuze Nov. 27, 2024, 1:47 p.m. UTC | #10
Hi, Folks.
AFAIK there is no side effect that can affect MDS with this fix.
This crash happens following this patch
"1065da21e5df9d843d2c5165d5d576be000142a6" "ceph: stop copying to iter
at EOF on sync reads".

Per your fix Luis, it seems to address only the cases when i_size goes
to zero but can happen anytime the `i_size` goes below  `off`.
I propose fixing it this way:

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 4b8d59ebda00..19b084212fee 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1066,7 +1066,7 @@ ssize_t __ceph_sync_read(struct inode *inode,
loff_t *ki_pos,
        if (ceph_inode_is_shutdown(inode))
                return -EIO;

-       if (!len)
+       if (!len || !i_size)
                return 0;
        /*
         * flush any page cache pages in this range.  this
@@ -1200,12 +1200,11 @@ ssize_t __ceph_sync_read(struct inode *inode,
loff_t *ki_pos,
                }

                idx = 0;
-               if (ret <= 0)
-                       left = 0;
-               else if (off + ret > i_size)
-                       left = i_size - off;
+               if (off + ret > i_size)
+                       left = (i_size > off) ? i_size - off : 0;
                else
-                       left = ret;
+                       left = (ret > 0) ? ret : 0;
+
                while (left > 0) {
                        size_t plen, copied;


On Thu, Nov 7, 2024 at 1:09 PM Luis Henriques <luis.henriques@linux.dev> wrote:
>
> (CC'ing Alex)
>
> On Wed, Nov 06 2024, Goldwyn Rodrigues wrote:
>
> > Hi Xiubo,
> >
> >> BTW, so in the following code:
> >>
> >> 1202                 idx = 0;
> >> 1203                 if (ret <= 0)
> >> 1204                         left = 0;
> >> 1205                 else if (off + ret > i_size)
> >> 1206                         left = i_size - off;
> >> 1207                 else
> >> 1208                         left = ret;
> >>
> >> The 'ret' should be larger than '0', right ?
> >>
> >> If so we do not check anf fix it in the 'else if' branch instead?
> >>
> >> Because currently the read path code won't exit directly and keep
> >> retrying to read if it found that the real content length is longer than
> >> the local 'i_size'.
> >>
> >> Again I am afraid your current fix will break the MIX filelock semantic ?
> >
> > Do you think changing left to ssize_t instead of size_t will
> > fix the problem?
> >
> > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > index 4b8d59ebda00..f8955773bdd7 100644
> > --- a/fs/ceph/file.c
> > +++ b/fs/ceph/file.c
> > @@ -1066,7 +1066,7 @@ ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
> >       if (ceph_inode_is_shutdown(inode))
> >               return -EIO;
> >
> > -     if (!len)
> > +     if (!len || !i_size)
> >               return 0;
> >       /*
> >        * flush any page cache pages in this range.  this
> > @@ -1087,7 +1087,7 @@ ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
> >               size_t page_off;
> >               bool more;
> >               int idx;
> > -             size_t left;
> > +             ssize_t left;
> >               struct ceph_osd_req_op *op;
> >               u64 read_off = off;
> >               u64 read_len = len;
> >
>
> I *think* (although I haven't tested it) that you're patch should work as
> well.  But I also think it's a bit more hacky: the overflow will still be
> there:
>
>                 if (ret <= 0)
>                         left = 0;
>                 else if (off + ret > i_size)
>                         left = i_size - off;
>                 else
>                         left = ret;
>                 while (left > 0) {
>                         // ...
>                 }
>
> If 'i_size' is '0', 'left' (which is now signed) will now have a negative
> value in the 'else if' branch and the loop that follows will not be
> executed.  My version will simply set 'ret' to '0' before this 'if'
> construct.
>
> So, in my opinion, what needs to be figured out is whether this will cause
> problems on the MDS side or not.  Because on the kernel client, it should
> be safe to ignore reads to an inode that has size set to '0', even if
> there's already data available to be read.  Eventually, the inode metadata
> will get updated and by then we can retry the read.
>
> Unfortunately, the MDS continues to be a huge black box for me and the
> locking code in particular is very tricky.  I'd rather defer this for
> anyone that is familiar with the code.
>
> Cheers,
> --
> Luís
>
Luis Henriques Nov. 28, 2024, 5:42 p.m. UTC | #11
Hi Alex,

[ Thank you for looking into this. ]

On Wed, Nov 27 2024, Alex Markuze wrote:

> Hi, Folks.
> AFAIK there is no side effect that can affect MDS with this fix.
> This crash happens following this patch
> "1065da21e5df9d843d2c5165d5d576be000142a6" "ceph: stop copying to iter
> at EOF on sync reads".
>
> Per your fix Luis, it seems to address only the cases when i_size goes
> to zero but can happen anytime the `i_size` goes below  `off`.
> I propose fixing it this way:

Hmm... you're probably right.  I didn't see this happening, but I guess it
could indeed happen.

> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 4b8d59ebda00..19b084212fee 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1066,7 +1066,7 @@ ssize_t __ceph_sync_read(struct inode *inode,
> loff_t *ki_pos,
>         if (ceph_inode_is_shutdown(inode))
>                 return -EIO;
>
> -       if (!len)
> +       if (!len || !i_size)
>                 return 0;
>         /*
>          * flush any page cache pages in this range.  this
> @@ -1200,12 +1200,11 @@ ssize_t __ceph_sync_read(struct inode *inode,
> loff_t *ki_pos,
>                 }
>
>                 idx = 0;
> -               if (ret <= 0)
> -                       left = 0;

Right now I don't have any means for testing this patch.  However, I don't
think this is completely correct.  By removing the above condition you're
discarding cases where an error has occurred (i.e. where ret is negative).

Why not simply modify my patch and do:

		if (i_size < off)
			ret = 0;

instead of:
		if (i_size == 0)
			ret = 0;

?

(Again, totally untested!)

Cheers,
Alex Markuze Nov. 28, 2024, 6:19 p.m. UTC | #12
I didn't discard it though :).
I folded it into the `if` statement. I find the if else construct
overly verbose and cumbersome.

+                       left = (ret > 0) ? ret : 0;

On Thu, Nov 28, 2024 at 7:43 PM Luis Henriques <luis.henriques@linux.dev> wrote:
>
> Hi Alex,
>
> [ Thank you for looking into this. ]
>
> On Wed, Nov 27 2024, Alex Markuze wrote:
>
> > Hi, Folks.
> > AFAIK there is no side effect that can affect MDS with this fix.
> > This crash happens following this patch
> > "1065da21e5df9d843d2c5165d5d576be000142a6" "ceph: stop copying to iter
> > at EOF on sync reads".
> >
> > Per your fix Luis, it seems to address only the cases when i_size goes
> > to zero but can happen anytime the `i_size` goes below  `off`.
> > I propose fixing it this way:
>
> Hmm... you're probably right.  I didn't see this happening, but I guess it
> could indeed happen.
>
> > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > index 4b8d59ebda00..19b084212fee 100644
> > --- a/fs/ceph/file.c
> > +++ b/fs/ceph/file.c
> > @@ -1066,7 +1066,7 @@ ssize_t __ceph_sync_read(struct inode *inode,
> > loff_t *ki_pos,
> >         if (ceph_inode_is_shutdown(inode))
> >                 return -EIO;
> >
> > -       if (!len)
> > +       if (!len || !i_size)
> >                 return 0;
> >         /*
> >          * flush any page cache pages in this range.  this
> > @@ -1200,12 +1200,11 @@ ssize_t __ceph_sync_read(struct inode *inode,
> > loff_t *ki_pos,
> >                 }
> >
> >                 idx = 0;
> > -               if (ret <= 0)
> > -                       left = 0;
>
> Right now I don't have any means for testing this patch.  However, I don't
> think this is completely correct.  By removing the above condition you're
> discarding cases where an error has occurred (i.e. where ret is negative).
>
> Why not simply modify my patch and do:
>
>                 if (i_size < off)
>                         ret = 0;
>
> instead of:
>                 if (i_size == 0)
>                         ret = 0;
>
> ?
>
> (Again, totally untested!)
>
> Cheers,
> --
> Luís
>
> > -               else if (off + ret > i_size)
> > -                       left = i_size - off;
> > +               if (off + ret > i_size)
> > +                       left = (i_size > off) ? i_size - off : 0;
> >                 else
> > -                       left = ret;
> > +                       left = (ret > 0) ? ret : 0;
> > +
> >                 while (left > 0) {
> >                         size_t plen, copied;
> >
> >
> > On Thu, Nov 7, 2024 at 1:09 PM Luis Henriques <luis.henriques@linux.dev> wrote:
> >>
> >> (CC'ing Alex)
> >>
> >> On Wed, Nov 06 2024, Goldwyn Rodrigues wrote:
> >>
> >> > Hi Xiubo,
> >> >
> >> >> BTW, so in the following code:
> >> >>
> >> >> 1202                 idx = 0;
> >> >> 1203                 if (ret <= 0)
> >> >> 1204                         left = 0;
> >> >> 1205                 else if (off + ret > i_size)
> >> >> 1206                         left = i_size - off;
> >> >> 1207                 else
> >> >> 1208                         left = ret;
> >> >>
> >> >> The 'ret' should be larger than '0', right ?
> >> >>
> >> >> If so we do not check anf fix it in the 'else if' branch instead?
> >> >>
> >> >> Because currently the read path code won't exit directly and keep
> >> >> retrying to read if it found that the real content length is longer than
> >> >> the local 'i_size'.
> >> >>
> >> >> Again I am afraid your current fix will break the MIX filelock semantic ?
> >> >
> >> > Do you think changing left to ssize_t instead of size_t will
> >> > fix the problem?
> >> >
> >> > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> >> > index 4b8d59ebda00..f8955773bdd7 100644
> >> > --- a/fs/ceph/file.c
> >> > +++ b/fs/ceph/file.c
> >> > @@ -1066,7 +1066,7 @@ ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
> >> >       if (ceph_inode_is_shutdown(inode))
> >> >               return -EIO;
> >> >
> >> > -     if (!len)
> >> > +     if (!len || !i_size)
> >> >               return 0;
> >> >       /*
> >> >        * flush any page cache pages in this range.  this
> >> > @@ -1087,7 +1087,7 @@ ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
> >> >               size_t page_off;
> >> >               bool more;
> >> >               int idx;
> >> > -             size_t left;
> >> > +             ssize_t left;
> >> >               struct ceph_osd_req_op *op;
> >> >               u64 read_off = off;
> >> >               u64 read_len = len;
> >> >
> >>
> >> I *think* (although I haven't tested it) that you're patch should work as
> >> well.  But I also think it's a bit more hacky: the overflow will still be
> >> there:
> >>
> >>                 if (ret <= 0)
> >>                         left = 0;
> >>                 else if (off + ret > i_size)
> >>                         left = i_size - off;
> >>                 else
> >>                         left = ret;
> >>                 while (left > 0) {
> >>                         // ...
> >>                 }
> >>
> >> If 'i_size' is '0', 'left' (which is now signed) will now have a negative
> >> value in the 'else if' branch and the loop that follows will not be
> >> executed.  My version will simply set 'ret' to '0' before this 'if'
> >> construct.
> >>
> >> So, in my opinion, what needs to be figured out is whether this will cause
> >> problems on the MDS side or not.  Because on the kernel client, it should
> >> be safe to ignore reads to an inode that has size set to '0', even if
> >> there's already data available to be read.  Eventually, the inode metadata
> >> will get updated and by then we can retry the read.
> >>
> >> Unfortunately, the MDS continues to be a huge black box for me and the
> >> locking code in particular is very tricky.  I'd rather defer this for
> >> anyone that is familiar with the code.
> >>
> >> Cheers,
> >> --
> >> Luís
> >>
>
Luis Henriques Nov. 28, 2024, 6:52 p.m. UTC | #13
Hi!

On Thu, Nov 28 2024, Alex Markuze wrote:
> On Thu, Nov 28, 2024 at 7:43 PM Luis Henriques <luis.henriques@linux.dev> wrote:
>>
>> Hi Alex,
>>
>> [ Thank you for looking into this. ]
>>
>> On Wed, Nov 27 2024, Alex Markuze wrote:
>>
>> > Hi, Folks.
>> > AFAIK there is no side effect that can affect MDS with this fix.
>> > This crash happens following this patch
>> > "1065da21e5df9d843d2c5165d5d576be000142a6" "ceph: stop copying to iter
>> > at EOF on sync reads".
>> >
>> > Per your fix Luis, it seems to address only the cases when i_size goes
>> > to zero but can happen anytime the `i_size` goes below  `off`.
>> > I propose fixing it this way:
>>
>> Hmm... you're probably right.  I didn't see this happening, but I guess it
>> could indeed happen.
>>
>> > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>> > index 4b8d59ebda00..19b084212fee 100644
>> > --- a/fs/ceph/file.c
>> > +++ b/fs/ceph/file.c
>> > @@ -1066,7 +1066,7 @@ ssize_t __ceph_sync_read(struct inode *inode,
>> > loff_t *ki_pos,
>> >         if (ceph_inode_is_shutdown(inode))
>> >                 return -EIO;
>> >
>> > -       if (!len)
>> > +       if (!len || !i_size)
>> >                 return 0;
>> >         /*
>> >          * flush any page cache pages in this range.  this
>> > @@ -1200,12 +1200,11 @@ ssize_t __ceph_sync_read(struct inode *inode,
>> > loff_t *ki_pos,
>> >                 }
>> >
>> >                 idx = 0;
>> > -               if (ret <= 0)
>> > -                       left = 0;
>>
>> Right now I don't have any means for testing this patch.  However, I don't
>> think this is completely correct.  By removing the above condition you're
>> discarding cases where an error has occurred (i.e. where ret is negative).
>
> I didn't discard it though :).
> I folded it into the `if` statement. I find the if else construct
> overly verbose and cumbersome.
>
> +                       left = (ret > 0) ? ret : 0;
>

Right, but with your patch, if 'ret < 0', we could still hit the first
branch instead of that one:

		if (off + ret > i_size)
			left = (i_size > off) ? i_size - off : 0;
		else
			left = (ret > 0) ? ret : 0;

Cheers,
Alex Markuze Nov. 28, 2024, 7:09 p.m. UTC | #14
Good catch, I'm reworking the ergonomics of this function, this ret
error code is checked and carried through the loop and checked every
other line.

On Thu, Nov 28, 2024 at 8:53 PM Luis Henriques <luis.henriques@linux.dev> wrote:
>
> Hi!
>
> On Thu, Nov 28 2024, Alex Markuze wrote:
> > On Thu, Nov 28, 2024 at 7:43 PM Luis Henriques <luis.henriques@linux.dev> wrote:
> >>
> >> Hi Alex,
> >>
> >> [ Thank you for looking into this. ]
> >>
> >> On Wed, Nov 27 2024, Alex Markuze wrote:
> >>
> >> > Hi, Folks.
> >> > AFAIK there is no side effect that can affect MDS with this fix.
> >> > This crash happens following this patch
> >> > "1065da21e5df9d843d2c5165d5d576be000142a6" "ceph: stop copying to iter
> >> > at EOF on sync reads".
> >> >
> >> > Per your fix Luis, it seems to address only the cases when i_size goes
> >> > to zero but can happen anytime the `i_size` goes below  `off`.
> >> > I propose fixing it this way:
> >>
> >> Hmm... you're probably right.  I didn't see this happening, but I guess it
> >> could indeed happen.
> >>
> >> > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> >> > index 4b8d59ebda00..19b084212fee 100644
> >> > --- a/fs/ceph/file.c
> >> > +++ b/fs/ceph/file.c
> >> > @@ -1066,7 +1066,7 @@ ssize_t __ceph_sync_read(struct inode *inode,
> >> > loff_t *ki_pos,
> >> >         if (ceph_inode_is_shutdown(inode))
> >> >                 return -EIO;
> >> >
> >> > -       if (!len)
> >> > +       if (!len || !i_size)
> >> >                 return 0;
> >> >         /*
> >> >          * flush any page cache pages in this range.  this
> >> > @@ -1200,12 +1200,11 @@ ssize_t __ceph_sync_read(struct inode *inode,
> >> > loff_t *ki_pos,
> >> >                 }
> >> >
> >> >                 idx = 0;
> >> > -               if (ret <= 0)
> >> > -                       left = 0;
> >>
> >> Right now I don't have any means for testing this patch.  However, I don't
> >> think this is completely correct.  By removing the above condition you're
> >> discarding cases where an error has occurred (i.e. where ret is negative).
> >
> > I didn't discard it though :).
> > I folded it into the `if` statement. I find the if else construct
> > overly verbose and cumbersome.
> >
> > +                       left = (ret > 0) ? ret : 0;
> >
>
> Right, but with your patch, if 'ret < 0', we could still hit the first
> branch instead of that one:
>
>                 if (off + ret > i_size)
>                         left = (i_size > off) ? i_size - off : 0;
>                 else
>                         left = (ret > 0) ? ret : 0;
>
> Cheers,
> --
> Luís
>
Alex Markuze Nov. 28, 2024, 7:31 p.m. UTC | #15
This patch does three things:

1. The allocated pages are bound to the request, simplifying the
memory management especially on the bad path.
2. ret is checked at the earliest point instead of being carried
through the loop.
3. The overflow bug is fixed.

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 4b8d59ebda00..9522d5218c04 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1066,7 +1066,7 @@ ssize_t __ceph_sync_read(struct inode *inode,
loff_t *ki_pos,
        if (ceph_inode_is_shutdown(inode))
                return -EIO;

-       if (!len)
+       if (!len || !i_size)
                return 0;
        /*
         * flush any page cache pages in this range.  this
@@ -1086,7 +1086,7 @@ ssize_t __ceph_sync_read(struct inode *inode,
loff_t *ki_pos,
                int num_pages;
                size_t page_off;
                bool more;
-               int idx;
+               int idx = 0;
                size_t left;
                struct ceph_osd_req_op *op;
                u64 read_off = off;
@@ -1127,7 +1127,7 @@ ssize_t __ceph_sync_read(struct inode *inode,
loff_t *ki_pos,

                osd_req_op_extent_osd_data_pages(req, 0, pages, read_len,
                                                 offset_in_page(read_off),
-                                                false, false);
+                                                false, true);

                op = &req->r_ops[0];
                if (sparse) {
@@ -1160,7 +1160,15 @@ ssize_t __ceph_sync_read(struct inode *inode,
loff_t *ki_pos,
                else if (ret == -ENOENT)
                        ret = 0;

-               if (ret > 0 && IS_ENCRYPTED(inode)) {
+               if (ret < 0) {
+                       ceph_osdc_put_request(req);
+
+                       if (ret == -EBLOCKLISTED)
+                               fsc->blocklisted = true;
+                       break;
+               }
+
+               if (IS_ENCRYPTED(inode)) {
                        int fret;

                        fret = ceph_fscrypt_decrypt_extents(inode, pages,
@@ -1186,10 +1194,8 @@ ssize_t __ceph_sync_read(struct inode *inode,
loff_t *ki_pos,
                        ret = min_t(ssize_t, fret, len);
                }

-               ceph_osdc_put_request(req);
-
                /* Short read but not EOF? Zero out the remainder. */
-               if (ret >= 0 && ret < len && (off + ret < i_size)) {
+               if (ret < len && (off + ret < i_size)) {
                        int zlen = min(len - ret, i_size - off - ret);
                        int zoff = page_off + ret;

@@ -1199,13 +1205,11 @@ ssize_t __ceph_sync_read(struct inode *inode,
loff_t *ki_pos,
                        ret += zlen;
                }

-               idx = 0;
-               if (ret <= 0)
-                       left = 0;
-               else if (off + ret > i_size)
-                       left = i_size - off;
+               if (off + ret > i_size)
+                       left = (i_size > off) ? i_size - off : 0;
                else
                        left = ret;
+
                while (left > 0) {
                        size_t plen, copied;

@@ -1221,13 +1225,8 @@ ssize_t __ceph_sync_read(struct inode *inode,
loff_t *ki_pos,
                                break;
                        }
                }
-               ceph_release_page_vector(pages, num_pages);

-               if (ret < 0) {
-                       if (ret == -EBLOCKLISTED)
-                               fsc->blocklisted = true;
-                       break;
-               }
+               ceph_osdc_put_request(req);

                if (off >= i_size || !more)
                        break;

On Thu, Nov 28, 2024 at 9:09 PM Alex Markuze <amarkuze@redhat.com> wrote:
>
> Good catch, I'm reworking the ergonomics of this function, this ret
> error code is checked and carried through the loop and checked every
> other line.
>
> On Thu, Nov 28, 2024 at 8:53 PM Luis Henriques <luis.henriques@linux.dev> wrote:
> >
> > Hi!
> >
> > On Thu, Nov 28 2024, Alex Markuze wrote:
> > > On Thu, Nov 28, 2024 at 7:43 PM Luis Henriques <luis.henriques@linux.dev> wrote:
> > >>
> > >> Hi Alex,
> > >>
> > >> [ Thank you for looking into this. ]
> > >>
> > >> On Wed, Nov 27 2024, Alex Markuze wrote:
> > >>
> > >> > Hi, Folks.
> > >> > AFAIK there is no side effect that can affect MDS with this fix.
> > >> > This crash happens following this patch
> > >> > "1065da21e5df9d843d2c5165d5d576be000142a6" "ceph: stop copying to iter
> > >> > at EOF on sync reads".
> > >> >
> > >> > Per your fix Luis, it seems to address only the cases when i_size goes
> > >> > to zero but can happen anytime the `i_size` goes below  `off`.
> > >> > I propose fixing it this way:
> > >>
> > >> Hmm... you're probably right.  I didn't see this happening, but I guess it
> > >> could indeed happen.
> > >>
> > >> > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > >> > index 4b8d59ebda00..19b084212fee 100644
> > >> > --- a/fs/ceph/file.c
> > >> > +++ b/fs/ceph/file.c
> > >> > @@ -1066,7 +1066,7 @@ ssize_t __ceph_sync_read(struct inode *inode,
> > >> > loff_t *ki_pos,
> > >> >         if (ceph_inode_is_shutdown(inode))
> > >> >                 return -EIO;
> > >> >
> > >> > -       if (!len)
> > >> > +       if (!len || !i_size)
> > >> >                 return 0;
> > >> >         /*
> > >> >          * flush any page cache pages in this range.  this
> > >> > @@ -1200,12 +1200,11 @@ ssize_t __ceph_sync_read(struct inode *inode,
> > >> > loff_t *ki_pos,
> > >> >                 }
> > >> >
> > >> >                 idx = 0;
> > >> > -               if (ret <= 0)
> > >> > -                       left = 0;
> > >>
> > >> Right now I don't have any means for testing this patch.  However, I don't
> > >> think this is completely correct.  By removing the above condition you're
> > >> discarding cases where an error has occurred (i.e. where ret is negative).
> > >
> > > I didn't discard it though :).
> > > I folded it into the `if` statement. I find the if else construct
> > > overly verbose and cumbersome.
> > >
> > > +                       left = (ret > 0) ? ret : 0;
> > >
> >
> > Right, but with your patch, if 'ret < 0', we could still hit the first
> > branch instead of that one:
> >
> >                 if (off + ret > i_size)
> >                         left = (i_size > off) ? i_size - off : 0;
> >                 else
> >                         left = (ret > 0) ? ret : 0;
> >
> > Cheers,
> > --
> > Luís
> >
diff mbox series

Patch

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 4b8d59ebda00..41d4eac128bb 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1066,7 +1066,7 @@  ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
 	if (ceph_inode_is_shutdown(inode))
 		return -EIO;
 
-	if (!len)
+	if (!len || !i_size)
 		return 0;
 	/*
 	 * flush any page cache pages in this range.  this
@@ -1154,6 +1154,9 @@  ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
 		doutc(cl, "%llu~%llu got %zd i_size %llu%s\n", off, len,
 		      ret, i_size, (more ? " MORE" : ""));
 
+		if (i_size == 0)
+			ret = 0;
+
 		/* Fix it to go to end of extent map */
 		if (sparse && ret >= 0)
 			ret = ceph_sparse_ext_map_end(op);