diff mbox series

[1/3] block: fix blk_rq_get_max_sectors() to flow more carefully

Message ID 20200911215338.44805-2-snitzer@redhat.com (mailing list archive)
State New, archived
Headers show
Series block: a few chunk_sectors fixes/improvements | expand

Commit Message

Mike Snitzer Sept. 11, 2020, 9:53 p.m. UTC
blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and
REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for
those operations.

Also, there is no need to avoid blk_max_size_offset() if
'chunk_sectors' isn't set because it falls back to 'max_sectors'.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 include/linux/blkdev.h | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

Comments

Ming Lei Sept. 12, 2020, 1:52 p.m. UTC | #1
On Fri, Sep 11, 2020 at 05:53:36PM -0400, Mike Snitzer wrote:
> blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and
> REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for
> those operations.

Actually WRITE_SAME & WRITE_ZEROS are handled by the following if
chunk_sectors is set:

        return min(blk_max_size_offset(q, offset),
                        blk_queue_get_max_sectors(q, req_op(rq)));
 
> Also, there is no need to avoid blk_max_size_offset() if
> 'chunk_sectors' isn't set because it falls back to 'max_sectors'.
> 
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> ---
>  include/linux/blkdev.h | 19 +++++++++++++------
>  1 file changed, 13 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index bb5636cc17b9..453a3d735d66 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
>  						  sector_t offset)
>  {
>  	struct request_queue *q = rq->q;
> +	int op;
> +	unsigned int max_sectors;
>  
>  	if (blk_rq_is_passthrough(rq))
>  		return q->limits.max_hw_sectors;
>  
> -	if (!q->limits.chunk_sectors ||
> -	    req_op(rq) == REQ_OP_DISCARD ||
> -	    req_op(rq) == REQ_OP_SECURE_ERASE)
> -		return blk_queue_get_max_sectors(q, req_op(rq));
> +	op = req_op(rq);
> +	max_sectors = blk_queue_get_max_sectors(q, op);
>  
> -	return min(blk_max_size_offset(q, offset),
> -			blk_queue_get_max_sectors(q, req_op(rq)));
> +	switch (op) {
> +	case REQ_OP_DISCARD:
> +	case REQ_OP_SECURE_ERASE:
> +	case REQ_OP_WRITE_SAME:
> +	case REQ_OP_WRITE_ZEROES:
> +		return max_sectors;
> +	}
> +
> +	return min(blk_max_size_offset(q, offset), max_sectors);
>  }

It depends if offset & chunk_sectors limit for WRITE_SAME & WRITE_ZEROS
needs to be considered.


Thanks,
Ming
Damien Le Moal Sept. 14, 2020, 12:43 a.m. UTC | #2
On 2020/09/12 22:53, Ming Lei wrote:
> On Fri, Sep 11, 2020 at 05:53:36PM -0400, Mike Snitzer wrote:
>> blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and
>> REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for
>> those operations.
> 
> Actually WRITE_SAME & WRITE_ZEROS are handled by the following if
> chunk_sectors is set:
> 
>         return min(blk_max_size_offset(q, offset),
>                         blk_queue_get_max_sectors(q, req_op(rq)));
>  
>> Also, there is no need to avoid blk_max_size_offset() if
>> 'chunk_sectors' isn't set because it falls back to 'max_sectors'.
>>
>> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
>> ---
>>  include/linux/blkdev.h | 19 +++++++++++++------
>>  1 file changed, 13 insertions(+), 6 deletions(-)
>>
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index bb5636cc17b9..453a3d735d66 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
>>  						  sector_t offset)
>>  {
>>  	struct request_queue *q = rq->q;
>> +	int op;
>> +	unsigned int max_sectors;
>>  
>>  	if (blk_rq_is_passthrough(rq))
>>  		return q->limits.max_hw_sectors;
>>  
>> -	if (!q->limits.chunk_sectors ||
>> -	    req_op(rq) == REQ_OP_DISCARD ||
>> -	    req_op(rq) == REQ_OP_SECURE_ERASE)
>> -		return blk_queue_get_max_sectors(q, req_op(rq));
>> +	op = req_op(rq);
>> +	max_sectors = blk_queue_get_max_sectors(q, op);
>>  
>> -	return min(blk_max_size_offset(q, offset),
>> -			blk_queue_get_max_sectors(q, req_op(rq)));
>> +	switch (op) {
>> +	case REQ_OP_DISCARD:
>> +	case REQ_OP_SECURE_ERASE:
>> +	case REQ_OP_WRITE_SAME:
>> +	case REQ_OP_WRITE_ZEROES:
>> +		return max_sectors;
>> +	}>> +
>> +	return min(blk_max_size_offset(q, offset), max_sectors);
>>  }
> 
> It depends if offset & chunk_sectors limit for WRITE_SAME & WRITE_ZEROS
> needs to be considered.

That limit is needed for zoned block devices to ensure that *any* write request,
no matter the command, do not cross zone boundaries. Otherwise, the write would
be immediately failed by the device.

> 
> 
> Thanks,
> Ming
> 
>
Damien Le Moal Sept. 14, 2020, 12:46 a.m. UTC | #3
On 2020/09/12 6:53, Mike Snitzer wrote:
> blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and
> REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for
> those operations.
> 
> Also, there is no need to avoid blk_max_size_offset() if
> 'chunk_sectors' isn't set because it falls back to 'max_sectors'.
> 
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> ---
>  include/linux/blkdev.h | 19 +++++++++++++------
>  1 file changed, 13 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index bb5636cc17b9..453a3d735d66 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
>  						  sector_t offset)
>  {
>  	struct request_queue *q = rq->q;
> +	int op;
> +	unsigned int max_sectors;
>  
>  	if (blk_rq_is_passthrough(rq))
>  		return q->limits.max_hw_sectors;
>  
> -	if (!q->limits.chunk_sectors ||
> -	    req_op(rq) == REQ_OP_DISCARD ||
> -	    req_op(rq) == REQ_OP_SECURE_ERASE)
> -		return blk_queue_get_max_sectors(q, req_op(rq));
> +	op = req_op(rq);
> +	max_sectors = blk_queue_get_max_sectors(q, op);
>  
> -	return min(blk_max_size_offset(q, offset),
> -			blk_queue_get_max_sectors(q, req_op(rq)));
> +	switch (op) {
> +	case REQ_OP_DISCARD:
> +	case REQ_OP_SECURE_ERASE:
> +	case REQ_OP_WRITE_SAME:
> +	case REQ_OP_WRITE_ZEROES:
> +		return max_sectors;
> +	}

Doesn't this break md devices ? (I think does use chunk_sectors for stride size,
no ?)

As mentioned in my reply to Ming's email, this will allow these commands to
potentially cross over zone boundaries on zoned block devices, which would be an
immediate command failure.

> +
> +	return min(blk_max_size_offset(q, offset), max_sectors);
>  }
>  
>  static inline unsigned int blk_rq_count_bios(struct request *rq)
>
Mike Snitzer Sept. 14, 2020, 2:49 p.m. UTC | #4
On Sat, Sep 12 2020 at  9:52am -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> On Fri, Sep 11, 2020 at 05:53:36PM -0400, Mike Snitzer wrote:
> > blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and
> > REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for
> > those operations.
> 
> Actually WRITE_SAME & WRITE_ZEROS are handled by the following if
> chunk_sectors is set:
> 
>         return min(blk_max_size_offset(q, offset),
>                         blk_queue_get_max_sectors(q, req_op(rq)));

Yes, but blk_rq_get_max_sectors() is a bit of a mess structurally.  he
duality of imposing chunk_sectors and/or considering offset when
calculating the return is very confused.

> > Also, there is no need to avoid blk_max_size_offset() if
> > 'chunk_sectors' isn't set because it falls back to 'max_sectors'.
> > 
> > Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> > ---
> >  include/linux/blkdev.h | 19 +++++++++++++------
> >  1 file changed, 13 insertions(+), 6 deletions(-)
> > 
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index bb5636cc17b9..453a3d735d66 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
> >  						  sector_t offset)
> >  {
> >  	struct request_queue *q = rq->q;
> > +	int op;
> > +	unsigned int max_sectors;
> >  
> >  	if (blk_rq_is_passthrough(rq))
> >  		return q->limits.max_hw_sectors;
> >  
> > -	if (!q->limits.chunk_sectors ||
> > -	    req_op(rq) == REQ_OP_DISCARD ||
> > -	    req_op(rq) == REQ_OP_SECURE_ERASE)
> > -		return blk_queue_get_max_sectors(q, req_op(rq));
> > +	op = req_op(rq);
> > +	max_sectors = blk_queue_get_max_sectors(q, op);
> >  
> > -	return min(blk_max_size_offset(q, offset),
> > -			blk_queue_get_max_sectors(q, req_op(rq)));
> > +	switch (op) {
> > +	case REQ_OP_DISCARD:
> > +	case REQ_OP_SECURE_ERASE:
> > +	case REQ_OP_WRITE_SAME:
> > +	case REQ_OP_WRITE_ZEROES:
> > +		return max_sectors;
> > +	}
> > +
> > +	return min(blk_max_size_offset(q, offset), max_sectors);
> >  }
> 
> It depends if offset & chunk_sectors limit for WRITE_SAME & WRITE_ZEROS
> needs to be considered.

Yes, I see that now.  But why don't they need to be considered for
REQ_OP_DISCARD and REQ_OP_SECURE_ERASE?  Is it because the intent of the
block core is to offer late splitting of bios?  If so, then why impose
chunk_sectors so early?

Obviously this patch 1/3 should be dropped.  I didn't treat
chunk_sectors with proper priority.

But like I said above, blk_rq_get_max_sectors() vs blk_max_size_offset()
is not at all straight-forward.  And the code looks prone to imposing
limits that shouldn't be (or vice-versa).

Also, when falling back to max_sectors, why not consider offset to treat
max_sectors like a granularity?  Would allow for much more consistent IO
patterns.

Mike
Mike Snitzer Sept. 14, 2020, 2:52 p.m. UTC | #5
On Sun, Sep 13 2020 at  8:43pm -0400,
Damien Le Moal <Damien.LeMoal@wdc.com> wrote:

> On 2020/09/12 22:53, Ming Lei wrote:
> > On Fri, Sep 11, 2020 at 05:53:36PM -0400, Mike Snitzer wrote:
> >> blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and
> >> REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for
> >> those operations.
> > 
> > Actually WRITE_SAME & WRITE_ZEROS are handled by the following if
> > chunk_sectors is set:
> > 
> >         return min(blk_max_size_offset(q, offset),
> >                         blk_queue_get_max_sectors(q, req_op(rq)));
> >  
> >> Also, there is no need to avoid blk_max_size_offset() if
> >> 'chunk_sectors' isn't set because it falls back to 'max_sectors'.
> >>
> >> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> >> ---
> >>  include/linux/blkdev.h | 19 +++++++++++++------
> >>  1 file changed, 13 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> >> index bb5636cc17b9..453a3d735d66 100644
> >> --- a/include/linux/blkdev.h
> >> +++ b/include/linux/blkdev.h
> >> @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
> >>  						  sector_t offset)
> >>  {
> >>  	struct request_queue *q = rq->q;
> >> +	int op;
> >> +	unsigned int max_sectors;
> >>  
> >>  	if (blk_rq_is_passthrough(rq))
> >>  		return q->limits.max_hw_sectors;
> >>  
> >> -	if (!q->limits.chunk_sectors ||
> >> -	    req_op(rq) == REQ_OP_DISCARD ||
> >> -	    req_op(rq) == REQ_OP_SECURE_ERASE)
> >> -		return blk_queue_get_max_sectors(q, req_op(rq));
> >> +	op = req_op(rq);
> >> +	max_sectors = blk_queue_get_max_sectors(q, op);
> >>  
> >> -	return min(blk_max_size_offset(q, offset),
> >> -			blk_queue_get_max_sectors(q, req_op(rq)));
> >> +	switch (op) {
> >> +	case REQ_OP_DISCARD:
> >> +	case REQ_OP_SECURE_ERASE:
> >> +	case REQ_OP_WRITE_SAME:
> >> +	case REQ_OP_WRITE_ZEROES:
> >> +		return max_sectors;
> >> +	}>> +
> >> +	return min(blk_max_size_offset(q, offset), max_sectors);
> >>  }
> > 
> > It depends if offset & chunk_sectors limit for WRITE_SAME & WRITE_ZEROS
> > needs to be considered.
> 
> That limit is needed for zoned block devices to ensure that *any* write request,
> no matter the command, do not cross zone boundaries. Otherwise, the write would
> be immediately failed by the device.

Thanks for the additional context, sorry to make you so concerned! ;)

Mike
Mike Snitzer Sept. 14, 2020, 3:03 p.m. UTC | #6
On Sun, Sep 13 2020 at  8:46pm -0400,
Damien Le Moal <Damien.LeMoal@wdc.com> wrote:

> On 2020/09/12 6:53, Mike Snitzer wrote:
> > blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and
> > REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for
> > those operations.
> > 
> > Also, there is no need to avoid blk_max_size_offset() if
> > 'chunk_sectors' isn't set because it falls back to 'max_sectors'.
> > 
> > Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> > ---
> >  include/linux/blkdev.h | 19 +++++++++++++------
> >  1 file changed, 13 insertions(+), 6 deletions(-)
> > 
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index bb5636cc17b9..453a3d735d66 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
> >  						  sector_t offset)
> >  {
> >  	struct request_queue *q = rq->q;
> > +	int op;
> > +	unsigned int max_sectors;
> >  
> >  	if (blk_rq_is_passthrough(rq))
> >  		return q->limits.max_hw_sectors;
> >  
> > -	if (!q->limits.chunk_sectors ||
> > -	    req_op(rq) == REQ_OP_DISCARD ||
> > -	    req_op(rq) == REQ_OP_SECURE_ERASE)
> > -		return blk_queue_get_max_sectors(q, req_op(rq));
> > +	op = req_op(rq);
> > +	max_sectors = blk_queue_get_max_sectors(q, op);
> >  
> > -	return min(blk_max_size_offset(q, offset),
> > -			blk_queue_get_max_sectors(q, req_op(rq)));
> > +	switch (op) {
> > +	case REQ_OP_DISCARD:
> > +	case REQ_OP_SECURE_ERASE:
> > +	case REQ_OP_WRITE_SAME:
> > +	case REQ_OP_WRITE_ZEROES:
> > +		return max_sectors;
> > +	}
> 
> Doesn't this break md devices ? (I think does use chunk_sectors for stride size,
> no ?)
> 
> As mentioned in my reply to Ming's email, this will allow these commands to
> potentially cross over zone boundaries on zoned block devices, which would be an
> immediate command failure.

Depending on the implementation it is beneficial to get a large
discard (one not constrained by chunk_sectors, e.g. dm-stripe.c's
optimization for handling large discards and issuing N discards, one per
stripe).  Same could apply for other commands.

Like all devices, zoned devices should impose command specific limits in
the queue_limits (and not lean on chunk_sectors to do a
one-size-fits-all).

But that aside, yes I agree I didn't pay close enough attention to the
implications of deferring the splitting of these commands until they
were issued to underlying storage.  This chunk_sectors early splitting
override is a bit of a mess... not quite following the logic given we
were supposed to be waiting to split bios as late as possible.

Mike
Damien Le Moal Sept. 14, 2020, 11:28 p.m. UTC | #7
On 2020/09/14 23:52, Mike Snitzer wrote:
> On Sun, Sep 13 2020 at  8:43pm -0400,
> Damien Le Moal <Damien.LeMoal@wdc.com> wrote:
> 
>> On 2020/09/12 22:53, Ming Lei wrote:
>>> On Fri, Sep 11, 2020 at 05:53:36PM -0400, Mike Snitzer wrote:
>>>> blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and
>>>> REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for
>>>> those operations.
>>>
>>> Actually WRITE_SAME & WRITE_ZEROS are handled by the following if
>>> chunk_sectors is set:
>>>
>>>         return min(blk_max_size_offset(q, offset),
>>>                         blk_queue_get_max_sectors(q, req_op(rq)));
>>>  
>>>> Also, there is no need to avoid blk_max_size_offset() if
>>>> 'chunk_sectors' isn't set because it falls back to 'max_sectors'.
>>>>
>>>> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
>>>> ---
>>>>  include/linux/blkdev.h | 19 +++++++++++++------
>>>>  1 file changed, 13 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>>> index bb5636cc17b9..453a3d735d66 100644
>>>> --- a/include/linux/blkdev.h
>>>> +++ b/include/linux/blkdev.h
>>>> @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
>>>>  						  sector_t offset)
>>>>  {
>>>>  	struct request_queue *q = rq->q;
>>>> +	int op;
>>>> +	unsigned int max_sectors;
>>>>  
>>>>  	if (blk_rq_is_passthrough(rq))
>>>>  		return q->limits.max_hw_sectors;
>>>>  
>>>> -	if (!q->limits.chunk_sectors ||
>>>> -	    req_op(rq) == REQ_OP_DISCARD ||
>>>> -	    req_op(rq) == REQ_OP_SECURE_ERASE)
>>>> -		return blk_queue_get_max_sectors(q, req_op(rq));
>>>> +	op = req_op(rq);
>>>> +	max_sectors = blk_queue_get_max_sectors(q, op);
>>>>  
>>>> -	return min(blk_max_size_offset(q, offset),
>>>> -			blk_queue_get_max_sectors(q, req_op(rq)));
>>>> +	switch (op) {
>>>> +	case REQ_OP_DISCARD:
>>>> +	case REQ_OP_SECURE_ERASE:
>>>> +	case REQ_OP_WRITE_SAME:
>>>> +	case REQ_OP_WRITE_ZEROES:
>>>> +		return max_sectors;
>>>> +	}>> +
>>>> +	return min(blk_max_size_offset(q, offset), max_sectors);
>>>>  }
>>>
>>> It depends if offset & chunk_sectors limit for WRITE_SAME & WRITE_ZEROS
>>> needs to be considered.
>>
>> That limit is needed for zoned block devices to ensure that *any* write request,
>> no matter the command, do not cross zone boundaries. Otherwise, the write would
>> be immediately failed by the device.
> 
> Thanks for the additional context, sorry to make you so concerned! ;)

No worries :)
Damien Le Moal Sept. 15, 2020, 1:09 a.m. UTC | #8
On 2020/09/15 0:04, Mike Snitzer wrote:
> On Sun, Sep 13 2020 at  8:46pm -0400,
> Damien Le Moal <Damien.LeMoal@wdc.com> wrote:
> 
>> On 2020/09/12 6:53, Mike Snitzer wrote:
>>> blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and
>>> REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for
>>> those operations.
>>>
>>> Also, there is no need to avoid blk_max_size_offset() if
>>> 'chunk_sectors' isn't set because it falls back to 'max_sectors'.
>>>
>>> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
>>> ---
>>>  include/linux/blkdev.h | 19 +++++++++++++------
>>>  1 file changed, 13 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>> index bb5636cc17b9..453a3d735d66 100644
>>> --- a/include/linux/blkdev.h
>>> +++ b/include/linux/blkdev.h
>>> @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
>>>  						  sector_t offset)
>>>  {
>>>  	struct request_queue *q = rq->q;
>>> +	int op;
>>> +	unsigned int max_sectors;
>>>  
>>>  	if (blk_rq_is_passthrough(rq))
>>>  		return q->limits.max_hw_sectors;
>>>  
>>> -	if (!q->limits.chunk_sectors ||
>>> -	    req_op(rq) == REQ_OP_DISCARD ||
>>> -	    req_op(rq) == REQ_OP_SECURE_ERASE)
>>> -		return blk_queue_get_max_sectors(q, req_op(rq));
>>> +	op = req_op(rq);
>>> +	max_sectors = blk_queue_get_max_sectors(q, op);
>>>  
>>> -	return min(blk_max_size_offset(q, offset),
>>> -			blk_queue_get_max_sectors(q, req_op(rq)));
>>> +	switch (op) {
>>> +	case REQ_OP_DISCARD:
>>> +	case REQ_OP_SECURE_ERASE:
>>> +	case REQ_OP_WRITE_SAME:
>>> +	case REQ_OP_WRITE_ZEROES:
>>> +		return max_sectors;
>>> +	}
>>
>> Doesn't this break md devices ? (I think does use chunk_sectors for stride size,
>> no ?)
>>
>> As mentioned in my reply to Ming's email, this will allow these commands to
>> potentially cross over zone boundaries on zoned block devices, which would be an
>> immediate command failure.
> 
> Depending on the implementation it is beneficial to get a large
> discard (one not constrained by chunk_sectors, e.g. dm-stripe.c's
> optimization for handling large discards and issuing N discards, one per
> stripe).  Same could apply for other commands.
> 
> Like all devices, zoned devices should impose command specific limits in
> the queue_limits (and not lean on chunk_sectors to do a
> one-size-fits-all).

Yes, understood. But I think that  in the case of md, chunk_sectors is used to
indicate the boundary between drives for a raid volume. So it does indeed make
sense to limit the IO size on submission since otherwise, the md driver itself
would have to split that bio again anyway.

> But that aside, yes I agree I didn't pay close enough attention to the
> implications of deferring the splitting of these commands until they
> were issued to underlying storage.  This chunk_sectors early splitting
> override is a bit of a mess... not quite following the logic given we
> were supposed to be waiting to split bios as late as possible.

My view is that the multipage bvec (BIOs almost as large as we want) and late
splitting is beneficial to get larger effective BIO sent to the device as having
more pages on hand allows bigger segments in the bio instead of always having at
most PAGE_SIZE per segment. The effect of this is very visible with blktrace. A
lot of requests end up being much larger than the device max_segments * page_size.

However, if there is already a known limit on the BIO size when the BIO is being
built, it does not make much sense to try to grow a bio beyond that limit since
it will have to be split by the driver anyway. chunk_sectors is one such limit
used for md (I think) to indicate boundaries between drives of a raid volume.
And we reuse it (abuse it ?) for zoned block devices to ensure that any command
does not cross over zone boundaries since that triggers errors for writes within
sequential zones or read/write crossing over zones of different types
(conventional->sequential zone boundary).

I may not have the entire picture correctly here, but so far, this is my
understanding.

Cheers.
Ming Lei Sept. 15, 2020, 1:50 a.m. UTC | #9
On Mon, Sep 14, 2020 at 10:49:28AM -0400, Mike Snitzer wrote:
> On Sat, Sep 12 2020 at  9:52am -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
> > On Fri, Sep 11, 2020 at 05:53:36PM -0400, Mike Snitzer wrote:
> > > blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and
> > > REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for
> > > those operations.
> > 
> > Actually WRITE_SAME & WRITE_ZEROS are handled by the following if
> > chunk_sectors is set:
> > 
> >         return min(blk_max_size_offset(q, offset),
> >                         blk_queue_get_max_sectors(q, req_op(rq)));
> 
> Yes, but blk_rq_get_max_sectors() is a bit of a mess structurally.  he
> duality of imposing chunk_sectors and/or considering offset when
> calculating the return is very confused.
> 
> > > Also, there is no need to avoid blk_max_size_offset() if
> > > 'chunk_sectors' isn't set because it falls back to 'max_sectors'.
> > > 
> > > Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> > > ---
> > >  include/linux/blkdev.h | 19 +++++++++++++------
> > >  1 file changed, 13 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > > index bb5636cc17b9..453a3d735d66 100644
> > > --- a/include/linux/blkdev.h
> > > +++ b/include/linux/blkdev.h
> > > @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
> > >  						  sector_t offset)
> > >  {
> > >  	struct request_queue *q = rq->q;
> > > +	int op;
> > > +	unsigned int max_sectors;
> > >  
> > >  	if (blk_rq_is_passthrough(rq))
> > >  		return q->limits.max_hw_sectors;
> > >  
> > > -	if (!q->limits.chunk_sectors ||
> > > -	    req_op(rq) == REQ_OP_DISCARD ||
> > > -	    req_op(rq) == REQ_OP_SECURE_ERASE)
> > > -		return blk_queue_get_max_sectors(q, req_op(rq));
> > > +	op = req_op(rq);
> > > +	max_sectors = blk_queue_get_max_sectors(q, op);
> > >  
> > > -	return min(blk_max_size_offset(q, offset),
> > > -			blk_queue_get_max_sectors(q, req_op(rq)));
> > > +	switch (op) {
> > > +	case REQ_OP_DISCARD:
> > > +	case REQ_OP_SECURE_ERASE:
> > > +	case REQ_OP_WRITE_SAME:
> > > +	case REQ_OP_WRITE_ZEROES:
> > > +		return max_sectors;
> > > +	}
> > > +
> > > +	return min(blk_max_size_offset(q, offset), max_sectors);
> > >  }
> > 
> > It depends if offset & chunk_sectors limit for WRITE_SAME & WRITE_ZEROS
> > needs to be considered.
> 
> Yes, I see that now.  But why don't they need to be considered for
> REQ_OP_DISCARD and REQ_OP_SECURE_ERASE?

This behavior is introduced int the following commit, and I guess it is
because we support multi-range discard request, maybe Jens can explain more.

commit e548ca4ee4595f65b262661d166310ad8a149bec
Author: Jens Axboe <axboe@fb.com>
Date:   Fri May 29 13:11:32 2015 -0600

    block: don't honor chunk sizes for data-less IO

    We don't need to honor chunk sizes for IO that doesn't carry any
    data.

    Signed-off-by: Jens Axboe <axboe@fb.com>

> Is it because the intent of the
> block core is to offer late splitting of bios?

block layer doesn't have late bio splitting, and bio is only splitted
via __blk_queue_split() before allocating request.

blk_rq_get_max_sectors() is only called by rq merge code, actually it
should have been defined in block/blk.h instead of public header.

> If so, then why impose
> chunk_sectors so early?

Not sure I understand your question. 'chunk_sectors' is firstly used
during bio split(get_max_io_size() from blk_bio_segment_split()), 

> 
> Obviously this patch 1/3 should be dropped.  I didn't treat
> chunk_sectors with proper priority.
> 
> But like I said above, blk_rq_get_max_sectors() vs blk_max_size_offset()
> is not at all straight-forward.  And the code looks prone to imposing
> limits that shouldn't be (or vice-versa).
> 
> Also, when falling back to max_sectors, why not consider offset to treat
> max_sectors like a granularity?  Would allow for much more consistent IO
> patterns.

blk_rq_get_max_sectors() is called when one bio or rq can be merged to
current request, and we have already considered all kinds of queue limits
when doing bio splitting, so not necessary to consider it again here during
merging rq.


Thanks,
Ming
Ming Lei Sept. 15, 2020, 2:03 a.m. UTC | #10
On Mon, Sep 14, 2020 at 12:43:06AM +0000, Damien Le Moal wrote:
> On 2020/09/12 22:53, Ming Lei wrote:
> > On Fri, Sep 11, 2020 at 05:53:36PM -0400, Mike Snitzer wrote:
> >> blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and
> >> REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for
> >> those operations.
> > 
> > Actually WRITE_SAME & WRITE_ZEROS are handled by the following if
> > chunk_sectors is set:
> > 
> >         return min(blk_max_size_offset(q, offset),
> >                         blk_queue_get_max_sectors(q, req_op(rq)));
> >  
> >> Also, there is no need to avoid blk_max_size_offset() if
> >> 'chunk_sectors' isn't set because it falls back to 'max_sectors'.
> >>
> >> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> >> ---
> >>  include/linux/blkdev.h | 19 +++++++++++++------
> >>  1 file changed, 13 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> >> index bb5636cc17b9..453a3d735d66 100644
> >> --- a/include/linux/blkdev.h
> >> +++ b/include/linux/blkdev.h
> >> @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
> >>  						  sector_t offset)
> >>  {
> >>  	struct request_queue *q = rq->q;
> >> +	int op;
> >> +	unsigned int max_sectors;
> >>  
> >>  	if (blk_rq_is_passthrough(rq))
> >>  		return q->limits.max_hw_sectors;
> >>  
> >> -	if (!q->limits.chunk_sectors ||
> >> -	    req_op(rq) == REQ_OP_DISCARD ||
> >> -	    req_op(rq) == REQ_OP_SECURE_ERASE)
> >> -		return blk_queue_get_max_sectors(q, req_op(rq));
> >> +	op = req_op(rq);
> >> +	max_sectors = blk_queue_get_max_sectors(q, op);
> >>  
> >> -	return min(blk_max_size_offset(q, offset),
> >> -			blk_queue_get_max_sectors(q, req_op(rq)));
> >> +	switch (op) {
> >> +	case REQ_OP_DISCARD:
> >> +	case REQ_OP_SECURE_ERASE:
> >> +	case REQ_OP_WRITE_SAME:
> >> +	case REQ_OP_WRITE_ZEROES:
> >> +		return max_sectors;
> >> +	}>> +
> >> +	return min(blk_max_size_offset(q, offset), max_sectors);
> >>  }
> > 
> > It depends if offset & chunk_sectors limit for WRITE_SAME & WRITE_ZEROS
> > needs to be considered.
> 
> That limit is needed for zoned block devices to ensure that *any* write request,
> no matter the command, do not cross zone boundaries. Otherwise, the write would
> be immediately failed by the device.

Looks both blk_bio_write_zeroes_split() and blk_bio_write_same_split()
don't consider chunk_sectors limit, is that an issue for zone block?


thanks,
Ming
Damien Le Moal Sept. 15, 2020, 2:15 a.m. UTC | #11
On 2020/09/15 11:04, Ming Lei wrote:
> On Mon, Sep 14, 2020 at 12:43:06AM +0000, Damien Le Moal wrote:
>> On 2020/09/12 22:53, Ming Lei wrote:
>>> On Fri, Sep 11, 2020 at 05:53:36PM -0400, Mike Snitzer wrote:
>>>> blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and
>>>> REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for
>>>> those operations.
>>>
>>> Actually WRITE_SAME & WRITE_ZEROS are handled by the following if
>>> chunk_sectors is set:
>>>
>>>         return min(blk_max_size_offset(q, offset),
>>>                         blk_queue_get_max_sectors(q, req_op(rq)));
>>>  
>>>> Also, there is no need to avoid blk_max_size_offset() if
>>>> 'chunk_sectors' isn't set because it falls back to 'max_sectors'.
>>>>
>>>> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
>>>> ---
>>>>  include/linux/blkdev.h | 19 +++++++++++++------
>>>>  1 file changed, 13 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>>> index bb5636cc17b9..453a3d735d66 100644
>>>> --- a/include/linux/blkdev.h
>>>> +++ b/include/linux/blkdev.h
>>>> @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
>>>>  						  sector_t offset)
>>>>  {
>>>>  	struct request_queue *q = rq->q;
>>>> +	int op;
>>>> +	unsigned int max_sectors;
>>>>  
>>>>  	if (blk_rq_is_passthrough(rq))
>>>>  		return q->limits.max_hw_sectors;
>>>>  
>>>> -	if (!q->limits.chunk_sectors ||
>>>> -	    req_op(rq) == REQ_OP_DISCARD ||
>>>> -	    req_op(rq) == REQ_OP_SECURE_ERASE)
>>>> -		return blk_queue_get_max_sectors(q, req_op(rq));
>>>> +	op = req_op(rq);
>>>> +	max_sectors = blk_queue_get_max_sectors(q, op);
>>>>  
>>>> -	return min(blk_max_size_offset(q, offset),
>>>> -			blk_queue_get_max_sectors(q, req_op(rq)));
>>>> +	switch (op) {
>>>> +	case REQ_OP_DISCARD:
>>>> +	case REQ_OP_SECURE_ERASE:
>>>> +	case REQ_OP_WRITE_SAME:
>>>> +	case REQ_OP_WRITE_ZEROES:
>>>> +		return max_sectors;
>>>> +	}>> +
>>>> +	return min(blk_max_size_offset(q, offset), max_sectors);
>>>>  }
>>>
>>> It depends if offset & chunk_sectors limit for WRITE_SAME & WRITE_ZEROS
>>> needs to be considered.
>>
>> That limit is needed for zoned block devices to ensure that *any* write request,
>> no matter the command, do not cross zone boundaries. Otherwise, the write would
>> be immediately failed by the device.
> 
> Looks both blk_bio_write_zeroes_split() and blk_bio_write_same_split()
> don't consider chunk_sectors limit, is that an issue for zone block?

Hu... Never looked at these. Yes, it will be a problem. write zeroes for NVMe
ZNS drives and write same for SCSI/ZBC drives. So yes, definitely something
that needs to be fixed. User of these will be file systems that in the case of
zoned block devices would be FS with zone support. f2fs does not use these
commands, and btrfs (posted recently) needs to be checked. But the FS itself
being zone aware, the requests will be zone aligned.

But definitely worth fixing.

Thanks !
Damien Le Moal Sept. 15, 2020, 4:21 a.m. UTC | #12
On 2020/09/15 10:10, Damien Le Moal wrote:
> On 2020/09/15 0:04, Mike Snitzer wrote:
>> On Sun, Sep 13 2020 at  8:46pm -0400,
>> Damien Le Moal <Damien.LeMoal@wdc.com> wrote:
>>
>>> On 2020/09/12 6:53, Mike Snitzer wrote:
>>>> blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and
>>>> REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for
>>>> those operations.
>>>>
>>>> Also, there is no need to avoid blk_max_size_offset() if
>>>> 'chunk_sectors' isn't set because it falls back to 'max_sectors'.
>>>>
>>>> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
>>>> ---
>>>>  include/linux/blkdev.h | 19 +++++++++++++------
>>>>  1 file changed, 13 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>>> index bb5636cc17b9..453a3d735d66 100644
>>>> --- a/include/linux/blkdev.h
>>>> +++ b/include/linux/blkdev.h
>>>> @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
>>>>  						  sector_t offset)
>>>>  {
>>>>  	struct request_queue *q = rq->q;
>>>> +	int op;
>>>> +	unsigned int max_sectors;
>>>>  
>>>>  	if (blk_rq_is_passthrough(rq))
>>>>  		return q->limits.max_hw_sectors;
>>>>  
>>>> -	if (!q->limits.chunk_sectors ||
>>>> -	    req_op(rq) == REQ_OP_DISCARD ||
>>>> -	    req_op(rq) == REQ_OP_SECURE_ERASE)
>>>> -		return blk_queue_get_max_sectors(q, req_op(rq));
>>>> +	op = req_op(rq);
>>>> +	max_sectors = blk_queue_get_max_sectors(q, op);
>>>>  
>>>> -	return min(blk_max_size_offset(q, offset),
>>>> -			blk_queue_get_max_sectors(q, req_op(rq)));
>>>> +	switch (op) {
>>>> +	case REQ_OP_DISCARD:
>>>> +	case REQ_OP_SECURE_ERASE:
>>>> +	case REQ_OP_WRITE_SAME:
>>>> +	case REQ_OP_WRITE_ZEROES:
>>>> +		return max_sectors;
>>>> +	}
>>>
>>> Doesn't this break md devices ? (I think does use chunk_sectors for stride size,
>>> no ?)
>>>
>>> As mentioned in my reply to Ming's email, this will allow these commands to
>>> potentially cross over zone boundaries on zoned block devices, which would be an
>>> immediate command failure.
>>
>> Depending on the implementation it is beneficial to get a large
>> discard (one not constrained by chunk_sectors, e.g. dm-stripe.c's
>> optimization for handling large discards and issuing N discards, one per
>> stripe).  Same could apply for other commands.
>>
>> Like all devices, zoned devices should impose command specific limits in
>> the queue_limits (and not lean on chunk_sectors to do a
>> one-size-fits-all).
> 
> Yes, understood. But I think that  in the case of md, chunk_sectors is used to
> indicate the boundary between drives for a raid volume. So it does indeed make
> sense to limit the IO size on submission since otherwise, the md driver itself
> would have to split that bio again anyway.
> 
>> But that aside, yes I agree I didn't pay close enough attention to the
>> implications of deferring the splitting of these commands until they
>> were issued to underlying storage.  This chunk_sectors early splitting
>> override is a bit of a mess... not quite following the logic given we
>> were supposed to be waiting to split bios as late as possible.
> 
> My view is that the multipage bvec (BIOs almost as large as we want) and late
> splitting is beneficial to get larger effective BIO sent to the device as having
> more pages on hand allows bigger segments in the bio instead of always having at
> most PAGE_SIZE per segment. The effect of this is very visible with blktrace. A
> lot of requests end up being much larger than the device max_segments * page_size.
> 
> However, if there is already a known limit on the BIO size when the BIO is being
> built, it does not make much sense to try to grow a bio beyond that limit since
> it will have to be split by the driver anyway. chunk_sectors is one such limit
> used for md (I think) to indicate boundaries between drives of a raid volume.
> And we reuse it (abuse it ?) for zoned block devices to ensure that any command
> does not cross over zone boundaries since that triggers errors for writes within
> sequential zones or read/write crossing over zones of different types
> (conventional->sequential zone boundary).
> 
> I may not have the entire picture correctly here, but so far, this is my
> understanding.

And I was wrong :) In light of Ming's comment + a little code refresher reading,
indeed, chunk_sectors will split BIOs so that *requests* do not exceed that
limit, but the initial BIO submission may be much larger regardless of
chunk_sectors.

Ming, I think the point here is that building a large BIO first and splitting it
later (as opposed to limiting the bio size by stopping bio_add_page()) is more
efficient as there is only one bio submit instead of many, right ?
Ming Lei Sept. 15, 2020, 8:01 a.m. UTC | #13
On Tue, Sep 15, 2020 at 04:21:54AM +0000, Damien Le Moal wrote:
> On 2020/09/15 10:10, Damien Le Moal wrote:
> > On 2020/09/15 0:04, Mike Snitzer wrote:
> >> On Sun, Sep 13 2020 at  8:46pm -0400,
> >> Damien Le Moal <Damien.LeMoal@wdc.com> wrote:
> >>
> >>> On 2020/09/12 6:53, Mike Snitzer wrote:
> >>>> blk_queue_get_max_sectors() has been trained for REQ_OP_WRITE_SAME and
> >>>> REQ_OP_WRITE_ZEROES yet blk_rq_get_max_sectors() didn't call it for
> >>>> those operations.
> >>>>
> >>>> Also, there is no need to avoid blk_max_size_offset() if
> >>>> 'chunk_sectors' isn't set because it falls back to 'max_sectors'.
> >>>>
> >>>> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> >>>> ---
> >>>>  include/linux/blkdev.h | 19 +++++++++++++------
> >>>>  1 file changed, 13 insertions(+), 6 deletions(-)
> >>>>
> >>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> >>>> index bb5636cc17b9..453a3d735d66 100644
> >>>> --- a/include/linux/blkdev.h
> >>>> +++ b/include/linux/blkdev.h
> >>>> @@ -1070,17 +1070,24 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
> >>>>  						  sector_t offset)
> >>>>  {
> >>>>  	struct request_queue *q = rq->q;
> >>>> +	int op;
> >>>> +	unsigned int max_sectors;
> >>>>  
> >>>>  	if (blk_rq_is_passthrough(rq))
> >>>>  		return q->limits.max_hw_sectors;
> >>>>  
> >>>> -	if (!q->limits.chunk_sectors ||
> >>>> -	    req_op(rq) == REQ_OP_DISCARD ||
> >>>> -	    req_op(rq) == REQ_OP_SECURE_ERASE)
> >>>> -		return blk_queue_get_max_sectors(q, req_op(rq));
> >>>> +	op = req_op(rq);
> >>>> +	max_sectors = blk_queue_get_max_sectors(q, op);
> >>>>  
> >>>> -	return min(blk_max_size_offset(q, offset),
> >>>> -			blk_queue_get_max_sectors(q, req_op(rq)));
> >>>> +	switch (op) {
> >>>> +	case REQ_OP_DISCARD:
> >>>> +	case REQ_OP_SECURE_ERASE:
> >>>> +	case REQ_OP_WRITE_SAME:
> >>>> +	case REQ_OP_WRITE_ZEROES:
> >>>> +		return max_sectors;
> >>>> +	}
> >>>
> >>> Doesn't this break md devices ? (I think does use chunk_sectors for stride size,
> >>> no ?)
> >>>
> >>> As mentioned in my reply to Ming's email, this will allow these commands to
> >>> potentially cross over zone boundaries on zoned block devices, which would be an
> >>> immediate command failure.
> >>
> >> Depending on the implementation it is beneficial to get a large
> >> discard (one not constrained by chunk_sectors, e.g. dm-stripe.c's
> >> optimization for handling large discards and issuing N discards, one per
> >> stripe).  Same could apply for other commands.
> >>
> >> Like all devices, zoned devices should impose command specific limits in
> >> the queue_limits (and not lean on chunk_sectors to do a
> >> one-size-fits-all).
> > 
> > Yes, understood. But I think that  in the case of md, chunk_sectors is used to
> > indicate the boundary between drives for a raid volume. So it does indeed make
> > sense to limit the IO size on submission since otherwise, the md driver itself
> > would have to split that bio again anyway.
> > 
> >> But that aside, yes I agree I didn't pay close enough attention to the
> >> implications of deferring the splitting of these commands until they
> >> were issued to underlying storage.  This chunk_sectors early splitting
> >> override is a bit of a mess... not quite following the logic given we
> >> were supposed to be waiting to split bios as late as possible.
> > 
> > My view is that the multipage bvec (BIOs almost as large as we want) and late
> > splitting is beneficial to get larger effective BIO sent to the device as having
> > more pages on hand allows bigger segments in the bio instead of always having at
> > most PAGE_SIZE per segment. The effect of this is very visible with blktrace. A
> > lot of requests end up being much larger than the device max_segments * page_size.
> > 
> > However, if there is already a known limit on the BIO size when the BIO is being
> > built, it does not make much sense to try to grow a bio beyond that limit since
> > it will have to be split by the driver anyway. chunk_sectors is one such limit
> > used for md (I think) to indicate boundaries between drives of a raid volume.
> > And we reuse it (abuse it ?) for zoned block devices to ensure that any command
> > does not cross over zone boundaries since that triggers errors for writes within
> > sequential zones or read/write crossing over zones of different types
> > (conventional->sequential zone boundary).
> > 
> > I may not have the entire picture correctly here, but so far, this is my
> > understanding.
> 
> And I was wrong :) In light of Ming's comment + a little code refresher reading,
> indeed, chunk_sectors will split BIOs so that *requests* do not exceed that
> limit, but the initial BIO submission may be much larger regardless of
> chunk_sectors.
> 
> Ming, I think the point here is that building a large BIO first and splitting it
> later (as opposed to limiting the bio size by stopping bio_add_page()) is more
> efficient as there is only one bio submit instead of many, right ?

Yeah, this way allows generic_make_request(submit_bio_noacct) to handle arbitrarily
sized bios, so bio_add_page() becomes more efficiently and simplified a lot, and
stacking driver is simplified too, such as the original q->merge_bvec_fn() is killed.

On the other hand, the cost of bio splitting is added.

Especially for stacking driver, there may be two times of bio splitting,
one is in stacking driver, another is in underlying device driver.

Fortunately underlying queue's limits are propagated to stacking queue, so in theory
the bio splitting in stacking driver's ->submit_bio is enough most of times.



Thanks,
Ming
diff mbox series

Patch

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index bb5636cc17b9..453a3d735d66 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1070,17 +1070,24 @@  static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
 						  sector_t offset)
 {
 	struct request_queue *q = rq->q;
+	int op;
+	unsigned int max_sectors;
 
 	if (blk_rq_is_passthrough(rq))
 		return q->limits.max_hw_sectors;
 
-	if (!q->limits.chunk_sectors ||
-	    req_op(rq) == REQ_OP_DISCARD ||
-	    req_op(rq) == REQ_OP_SECURE_ERASE)
-		return blk_queue_get_max_sectors(q, req_op(rq));
+	op = req_op(rq);
+	max_sectors = blk_queue_get_max_sectors(q, op);
 
-	return min(blk_max_size_offset(q, offset),
-			blk_queue_get_max_sectors(q, req_op(rq)));
+	switch (op) {
+	case REQ_OP_DISCARD:
+	case REQ_OP_SECURE_ERASE:
+	case REQ_OP_WRITE_SAME:
+	case REQ_OP_WRITE_ZEROES:
+		return max_sectors;
+	}
+
+	return min(blk_max_size_offset(q, offset), max_sectors);
 }
 
 static inline unsigned int blk_rq_count_bios(struct request *rq)