diff mbox series

block: elevator: avoid to load iosched module from this disk

Message ID 20240907014331.176152-1-ming.lei@redhat.com (mailing list archive)
State New, archived
Headers show
Series block: elevator: avoid to load iosched module from this disk | expand

Commit Message

Ming Lei Sept. 7, 2024, 1:43 a.m. UTC
When switching io scheduler via sysfs, 'request_module' may be called
if the specified scheduler doesn't exist.

This was has deadlock risk because the module may be stored on FS behind
our disk since request queue is frozen before switching its elevator.

Fix it by returning -EDEADLK in case that the disk is claimed, which
can be thought as one signal that the disk is mounted.

Some distributions(Fedora) simulates the original kernel command line of
'elevator=foo' via 'echo foo > /sys/block/$DISK/queue/scheduler', and boot
hang is triggered.

Cc: Richard Jones <rjones@redhat.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jiri Jaburek <jjaburek@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/elevator.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

Comments

Richard W.M. Jones Sept. 7, 2024, 7:35 a.m. UTC | #1
On Sat, Sep 07, 2024 at 09:43:31AM +0800, Ming Lei wrote:
> When switching io scheduler via sysfs, 'request_module' may be called
> if the specified scheduler doesn't exist.
> 
> This was has deadlock risk because the module may be stored on FS behind
> our disk since request queue is frozen before switching its elevator.
> 
> Fix it by returning -EDEADLK in case that the disk is claimed, which
> can be thought as one signal that the disk is mounted.
> 
> Some distributions(Fedora) simulates the original kernel command line of
> 'elevator=foo' via 'echo foo > /sys/block/$DISK/queue/scheduler', and boot
> hang is triggered.
> 
> Cc: Richard Jones <rjones@redhat.com>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Jiri Jaburek <jjaburek@redhat.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

I'd suggest also:

Bug: https://bugzilla.kernel.org/show_bug.cgi?id=219166
Reported-by: Richard W.M. Jones <rjones@redhat.com>
Reported-by: Jiri Jaburek <jjaburek@redhat.com>
Tested-by: Richard W.M. Jones <rjones@redhat.com>

So I have tested this patch and it does fix the issue, at the possible
cost that now setting the scheduler can fail:

  + for f in /sys/block/{h,s,ub,v}d*/queue/scheduler
  + echo noop
  /init: line 109: echo: write error: Resource deadlock avoided

(I know I'm setting it to an impossible value here, but this could
also happen when setting it to a valid one.)

Since almost no one checks the result of 'echo foo > /sys/...'  that
would probably mean that sometimes a desired setting is silently not
set.

Also I bisected this bug yesterday and found it was caused by (or,
more likely, exposed by):

  commit af2814149883e2c1851866ea2afcd8eadc040f79
  Author: Christoph Hellwig <hch@lst.de>
  Date:   Mon Jun 17 08:04:38 2024 +0200

    block: freeze the queue in queue_attr_store
    
    queue_attr_store updates attributes used to control generating I/O, and
    can cause malformed bios if changed with I/O in flight.  Freeze the queue
    in common code instead of adding it to almost every attribute.

Reverting this commit on top of git head also fixes the problem.

Why did this commit expose the problem?

Rich.

> ---
>  block/elevator.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/block/elevator.c b/block/elevator.c
> index f13d552a32c8..2b0432f4ac33 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -676,6 +676,13 @@ void elevator_disable(struct request_queue *q)
>  	blk_mq_unfreeze_queue(q);
>  }
>  
> +static bool disk_is_claimed(struct gendisk *disk)
> +{
> +	if (disk->part0->bd_holder)
> +		return true;
> +	return false;
> +}
> +
>  /*
>   * Switch this queue to the given IO scheduler.
>   */
> @@ -699,6 +706,13 @@ static int elevator_change(struct request_queue *q, const char *elevator_name)
>  
>  	e = elevator_find_get(q, elevator_name);
>  	if (!e) {
> +		/*
> +		 * Try to avoid to load iosched module from FS behind our
> +		 * disk, otherwise deadlock may be triggered
> +		 */
> +		if (disk_is_claimed(q->disk))
> +			return -EDEADLK;
> +
>  		request_module("%s-iosched", elevator_name);
>  		e = elevator_find_get(q, elevator_name);
>  		if (!e)
> -- 
> 2.46.0
Ming Lei Sept. 7, 2024, 7:58 a.m. UTC | #2
On Sat, Sep 07, 2024 at 08:35:22AM +0100, Richard W.M. Jones wrote:
> On Sat, Sep 07, 2024 at 09:43:31AM +0800, Ming Lei wrote:
> > When switching io scheduler via sysfs, 'request_module' may be called
> > if the specified scheduler doesn't exist.
> > 
> > This was has deadlock risk because the module may be stored on FS behind
> > our disk since request queue is frozen before switching its elevator.
> > 
> > Fix it by returning -EDEADLK in case that the disk is claimed, which
> > can be thought as one signal that the disk is mounted.
> > 
> > Some distributions(Fedora) simulates the original kernel command line of
> > 'elevator=foo' via 'echo foo > /sys/block/$DISK/queue/scheduler', and boot
> > hang is triggered.
> > 
> > Cc: Richard Jones <rjones@redhat.com>
> > Cc: Jeff Moyer <jmoyer@redhat.com>
> > Cc: Jiri Jaburek <jjaburek@redhat.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> 
> I'd suggest also:
> 
> Bug: https://bugzilla.kernel.org/show_bug.cgi?id=219166
> Reported-by: Richard W.M. Jones <rjones@redhat.com>
> Reported-by: Jiri Jaburek <jjaburek@redhat.com>
> Tested-by: Richard W.M. Jones <rjones@redhat.com>
> 
> So I have tested this patch and it does fix the issue, at the possible
> cost that now setting the scheduler can fail:
> 
>   + for f in /sys/block/{h,s,ub,v}d*/queue/scheduler
>   + echo noop
>   /init: line 109: echo: write error: Resource deadlock avoided
> 
> (I know I'm setting it to an impossible value here, but this could
> also happen when setting it to a valid one.)

Actually in most of dist, io-schedulers are built-in, so request_module
is just a nop, but meta IO must be started.

> 
> Since almost no one checks the result of 'echo foo > /sys/...'  that
> would probably mean that sometimes a desired setting is silently not
> set.

As I mentioned, io-schedulers are built-in for most of dist, so
request_module isn't called in case of one valid io-sched.

> 
> Also I bisected this bug yesterday and found it was caused by (or,
> more likely, exposed by):
> 
>   commit af2814149883e2c1851866ea2afcd8eadc040f79
>   Author: Christoph Hellwig <hch@lst.de>
>   Date:   Mon Jun 17 08:04:38 2024 +0200
> 
>     block: freeze the queue in queue_attr_store
>     
>     queue_attr_store updates attributes used to control generating I/O, and
>     can cause malformed bios if changed with I/O in flight.  Freeze the queue
>     in common code instead of adding it to almost every attribute.
> 
> Reverting this commit on top of git head also fixes the problem.
> 
> Why did this commit expose the problem?

That is really the 1st bad commit which moves queue freezing before
calling request_module(), originally we won't freeze queue until
we have to do it.

Another candidate fix is to revert it, or at least not do it
for storing elevator attribute.


Thanks,
Ming
Damien Le Moal Sept. 7, 2024, 9:04 a.m. UTC | #3
On 9/7/24 16:58, Ming Lei wrote:
> On Sat, Sep 07, 2024 at 08:35:22AM +0100, Richard W.M. Jones wrote:
>> On Sat, Sep 07, 2024 at 09:43:31AM +0800, Ming Lei wrote:
>>> When switching io scheduler via sysfs, 'request_module' may be called
>>> if the specified scheduler doesn't exist.
>>>
>>> This was has deadlock risk because the module may be stored on FS behind
>>> our disk since request queue is frozen before switching its elevator.
>>>
>>> Fix it by returning -EDEADLK in case that the disk is claimed, which
>>> can be thought as one signal that the disk is mounted.
>>>
>>> Some distributions(Fedora) simulates the original kernel command line of
>>> 'elevator=foo' via 'echo foo > /sys/block/$DISK/queue/scheduler', and boot
>>> hang is triggered.
>>>
>>> Cc: Richard Jones <rjones@redhat.com>
>>> Cc: Jeff Moyer <jmoyer@redhat.com>
>>> Cc: Jiri Jaburek <jjaburek@redhat.com>
>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>
>> I'd suggest also:
>>
>> Bug: https://bugzilla.kernel.org/show_bug.cgi?id=219166
>> Reported-by: Richard W.M. Jones <rjones@redhat.com>
>> Reported-by: Jiri Jaburek <jjaburek@redhat.com>
>> Tested-by: Richard W.M. Jones <rjones@redhat.com>
>>
>> So I have tested this patch and it does fix the issue, at the possible
>> cost that now setting the scheduler can fail:
>>
>>   + for f in /sys/block/{h,s,ub,v}d*/queue/scheduler
>>   + echo noop
>>   /init: line 109: echo: write error: Resource deadlock avoided
>>
>> (I know I'm setting it to an impossible value here, but this could
>> also happen when setting it to a valid one.)
> 
> Actually in most of dist, io-schedulers are built-in, so request_module
> is just a nop, but meta IO must be started.
> 
>>
>> Since almost no one checks the result of 'echo foo > /sys/...'  that
>> would probably mean that sometimes a desired setting is silently not
>> set.
> 
> As I mentioned, io-schedulers are built-in for most of dist, so
> request_module isn't called in case of one valid io-sched.
> 
>>
>> Also I bisected this bug yesterday and found it was caused by (or,
>> more likely, exposed by):
>>
>>   commit af2814149883e2c1851866ea2afcd8eadc040f79
>>   Author: Christoph Hellwig <hch@lst.de>
>>   Date:   Mon Jun 17 08:04:38 2024 +0200
>>
>>     block: freeze the queue in queue_attr_store
>>     
>>     queue_attr_store updates attributes used to control generating I/O, and
>>     can cause malformed bios if changed with I/O in flight.  Freeze the queue
>>     in common code instead of adding it to almost every attribute.
>>
>> Reverting this commit on top of git head also fixes the problem.
>>
>> Why did this commit expose the problem?
> 
> That is really the 1st bad commit which moves queue freezing before
> calling request_module(), originally we won't freeze queue until
> we have to do it.
> 
> Another candidate fix is to revert it, or at least not do it
> for storing elevator attribute.

I do not think that reverting is acceptable. Rather, a proper fix would simply
be to do the request_module() before freezing the queue.
Something like below should work (totally untested and that may be overkill).

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 60116d13cb80..aef87f6b4a8a 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -23,6 +23,7 @@
 struct queue_sysfs_entry {
        struct attribute attr;
        ssize_t (*show)(struct gendisk *disk, char *page);
+       int (*pre_store)(struct gendisk *disk, const char *page, size_t count);
        ssize_t (*store)(struct gendisk *disk, const char *page, size_t count);
 };

@@ -413,6 +414,14 @@ static struct queue_sysfs_entry _prefix##_entry = {        \
        .store  = _prefix##_store,                      \
 };

+#define QUEUE_RPW_ENTRY(_prefix, _name)                        \
+static struct queue_sysfs_entry _prefix##_entry = {    \
+       .attr   = { .name = _name, .mode = 0644 },      \
+       .show   = _prefix##_show,                       \
+       .pre_store = _prefix##_pre_store,               \
+       .store  = _prefix##_store,                      \
+};
+
 QUEUE_RW_ENTRY(queue_requests, "nr_requests");
 QUEUE_RW_ENTRY(queue_ra, "read_ahead_kb");
 QUEUE_RW_ENTRY(queue_max_sectors, "max_sectors_kb");
@@ -420,7 +429,7 @@ QUEUE_RO_ENTRY(queue_max_hw_sectors, "max_hw_sectors_kb");
 QUEUE_RO_ENTRY(queue_max_segments, "max_segments");
 QUEUE_RO_ENTRY(queue_max_integrity_segments, "max_integrity_segments");
 QUEUE_RO_ENTRY(queue_max_segment_size, "max_segment_size");
-QUEUE_RW_ENTRY(elv_iosched, "scheduler");
+QUEUE_RPW_ENTRY(elv_iosched, "scheduler");

 QUEUE_RO_ENTRY(queue_logical_block_size, "logical_block_size");
 QUEUE_RO_ENTRY(queue_physical_block_size, "physical_block_size");
@@ -670,6 +679,12 @@ queue_attr_store(struct kobject *kobj, struct attribute *attr,
        if (!entry->store)
                return -EIO;

+       if (entry->pre_store) {
+               res = entry->pre_store(disk, page, length);
+               if (res)
+                       return res;
+       }
+
        blk_mq_freeze_queue(q);
        mutex_lock(&q->sysfs_lock);
        res = entry->store(disk, page, length);
diff --git a/block/elevator.c b/block/elevator.c
index f13d552a32c8..c338282d5148 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -698,17 +698,26 @@ static int elevator_change(struct request_queue *q, const
char *elevator_name)
                return 0;

        e = elevator_find_get(q, elevator_name);
-       if (!e) {
-               request_module("%s-iosched", elevator_name);
-               e = elevator_find_get(q, elevator_name);
-               if (!e)
-                       return -EINVAL;
-       }
+       if (!e)
+               return -EINVAL;
        ret = elevator_switch(q, e);
        elevator_put(e);
        return ret;
 }

+int elv_iosched_pre_store(struct gendisk *disk, const char *buf,
+                          size_t count)
+{
+       char elevator_name[ELV_NAME_MAX];
+
+       if (!elv_support_iosched(disk->queue))
+               return -ENOTSUPP;
+
+       strscpy(elevator_name, buf, sizeof(elevator_name));
+
+       return request_module("%s-iosched", elevator_name);
+}
+
 ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
                          size_t count)
 {
diff --git a/block/elevator.h b/block/elevator.h
index 3fe18e1a8692..059172c0f93c 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -148,6 +148,7 @@ extern void elv_unregister(struct elevator_type *);
  * io scheduler sysfs switching
  */
 ssize_t elv_iosched_show(struct gendisk *disk, char *page);
+int elv_iosched_pre_store(struct gendisk *disk, const char *page, size_t count);
 ssize_t elv_iosched_store(struct gendisk *disk, const char *page, size_t count);

 extern bool elv_bio_merge_ok(struct request *, struct bio *);
Ming Lei Sept. 7, 2024, 9:48 a.m. UTC | #4
On Sat, Sep 07, 2024 at 06:04:59PM +0900, Damien Le Moal wrote:
> On 9/7/24 16:58, Ming Lei wrote:
> > On Sat, Sep 07, 2024 at 08:35:22AM +0100, Richard W.M. Jones wrote:
> >> On Sat, Sep 07, 2024 at 09:43:31AM +0800, Ming Lei wrote:
> >>> When switching io scheduler via sysfs, 'request_module' may be called
> >>> if the specified scheduler doesn't exist.
> >>>
> >>> This was has deadlock risk because the module may be stored on FS behind
> >>> our disk since request queue is frozen before switching its elevator.
> >>>
> >>> Fix it by returning -EDEADLK in case that the disk is claimed, which
> >>> can be thought as one signal that the disk is mounted.
> >>>
> >>> Some distributions(Fedora) simulates the original kernel command line of
> >>> 'elevator=foo' via 'echo foo > /sys/block/$DISK/queue/scheduler', and boot
> >>> hang is triggered.
> >>>
> >>> Cc: Richard Jones <rjones@redhat.com>
> >>> Cc: Jeff Moyer <jmoyer@redhat.com>
> >>> Cc: Jiri Jaburek <jjaburek@redhat.com>
> >>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> >>
> >> I'd suggest also:
> >>
> >> Bug: https://bugzilla.kernel.org/show_bug.cgi?id=219166
> >> Reported-by: Richard W.M. Jones <rjones@redhat.com>
> >> Reported-by: Jiri Jaburek <jjaburek@redhat.com>
> >> Tested-by: Richard W.M. Jones <rjones@redhat.com>
> >>
> >> So I have tested this patch and it does fix the issue, at the possible
> >> cost that now setting the scheduler can fail:
> >>
> >>   + for f in /sys/block/{h,s,ub,v}d*/queue/scheduler
> >>   + echo noop
> >>   /init: line 109: echo: write error: Resource deadlock avoided
> >>
> >> (I know I'm setting it to an impossible value here, but this could
> >> also happen when setting it to a valid one.)
> > 
> > Actually in most of dist, io-schedulers are built-in, so request_module
> > is just a nop, but meta IO must be started.
> > 
> >>
> >> Since almost no one checks the result of 'echo foo > /sys/...'  that
> >> would probably mean that sometimes a desired setting is silently not
> >> set.
> > 
> > As I mentioned, io-schedulers are built-in for most of dist, so
> > request_module isn't called in case of one valid io-sched.
> > 
> >>
> >> Also I bisected this bug yesterday and found it was caused by (or,
> >> more likely, exposed by):
> >>
> >>   commit af2814149883e2c1851866ea2afcd8eadc040f79
> >>   Author: Christoph Hellwig <hch@lst.de>
> >>   Date:   Mon Jun 17 08:04:38 2024 +0200
> >>
> >>     block: freeze the queue in queue_attr_store
> >>     
> >>     queue_attr_store updates attributes used to control generating I/O, and
> >>     can cause malformed bios if changed with I/O in flight.  Freeze the queue
> >>     in common code instead of adding it to almost every attribute.
> >>
> >> Reverting this commit on top of git head also fixes the problem.
> >>
> >> Why did this commit expose the problem?
> > 
> > That is really the 1st bad commit which moves queue freezing before
> > calling request_module(), originally we won't freeze queue until
> > we have to do it.
> > 
> > Another candidate fix is to revert it, or at least not do it
> > for storing elevator attribute.
> 
> I do not think that reverting is acceptable. Rather, a proper fix would simply

Right, I remember that the freezing starts to cover update of
max_sectors_kb.

> be to do the request_module() before freezing the queue.
> Something like below should work (totally untested and that may be overkill).
> 
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 60116d13cb80..aef87f6b4a8a 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -23,6 +23,7 @@
>  struct queue_sysfs_entry {
>         struct attribute attr;
>         ssize_t (*show)(struct gendisk *disk, char *page);
> +       int (*pre_store)(struct gendisk *disk, const char *page, size_t count);

It seems over-kill to add one new callback, and another way is just to
not freeze queue for storing elevator.

But if other attribute update needs to not freeze queue, 'pre_store'
looks one reasonable solution.

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 60116d13cb80..c418edf66f0c 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -666,15 +666,24 @@ queue_attr_store(struct kobject *kobj, struct attribute *attr,
 	struct gendisk *disk = container_of(kobj, struct gendisk, queue_kobj);
 	struct request_queue *q = disk->queue;
 	ssize_t res;
+	bool need_freeze;
 
 	if (!entry->store)
 		return -EIO;
 
-	blk_mq_freeze_queue(q);
+	/*
+	 * storing scheduler freezes queue in its way, especially
+	 * loading scheduler module can't be done when queue is frozen
+	 */
+	need_freeze = (entry->store == elv_iosched_store);
+
+	if (need_freeze)
+		blk_mq_freeze_queue(q);
 	mutex_lock(&q->sysfs_lock);
 	res = entry->store(disk, page, length);
 	mutex_unlock(&q->sysfs_lock);
-	blk_mq_unfreeze_queue(q);
+	if (need_freeze)
+		blk_mq_unfreeze_queue(q);
 	return res;
 }
 

Thanks,
Ming
Richard W.M. Jones Sept. 7, 2024, 9:53 a.m. UTC | #5
On Sat, Sep 07, 2024 at 06:04:59PM +0900, Damien Le Moal wrote:
> On 9/7/24 16:58, Ming Lei wrote:
> > On Sat, Sep 07, 2024 at 08:35:22AM +0100, Richard W.M. Jones wrote:
> >> On Sat, Sep 07, 2024 at 09:43:31AM +0800, Ming Lei wrote:
> >>> When switching io scheduler via sysfs, 'request_module' may be called
> >>> if the specified scheduler doesn't exist.
> >>>
> >>> This was has deadlock risk because the module may be stored on FS behind
> >>> our disk since request queue is frozen before switching its elevator.
> >>>
> >>> Fix it by returning -EDEADLK in case that the disk is claimed, which
> >>> can be thought as one signal that the disk is mounted.
> >>>
> >>> Some distributions(Fedora) simulates the original kernel command line of
> >>> 'elevator=foo' via 'echo foo > /sys/block/$DISK/queue/scheduler', and boot
> >>> hang is triggered.
> >>>
> >>> Cc: Richard Jones <rjones@redhat.com>
> >>> Cc: Jeff Moyer <jmoyer@redhat.com>
> >>> Cc: Jiri Jaburek <jjaburek@redhat.com>
> >>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> >>
> >> I'd suggest also:
> >>
> >> Bug: https://bugzilla.kernel.org/show_bug.cgi?id=219166
> >> Reported-by: Richard W.M. Jones <rjones@redhat.com>
> >> Reported-by: Jiri Jaburek <jjaburek@redhat.com>
> >> Tested-by: Richard W.M. Jones <rjones@redhat.com>
> >>
> >> So I have tested this patch and it does fix the issue, at the possible
> >> cost that now setting the scheduler can fail:
> >>
> >>   + for f in /sys/block/{h,s,ub,v}d*/queue/scheduler
> >>   + echo noop
> >>   /init: line 109: echo: write error: Resource deadlock avoided
> >>
> >> (I know I'm setting it to an impossible value here, but this could
> >> also happen when setting it to a valid one.)
> > 
> > Actually in most of dist, io-schedulers are built-in, so request_module
> > is just a nop, but meta IO must be started.
> > 
> >>
> >> Since almost no one checks the result of 'echo foo > /sys/...'  that
> >> would probably mean that sometimes a desired setting is silently not
> >> set.
> > 
> > As I mentioned, io-schedulers are built-in for most of dist, so
> > request_module isn't called in case of one valid io-sched.
> > 
> >>
> >> Also I bisected this bug yesterday and found it was caused by (or,
> >> more likely, exposed by):
> >>
> >>   commit af2814149883e2c1851866ea2afcd8eadc040f79
> >>   Author: Christoph Hellwig <hch@lst.de>
> >>   Date:   Mon Jun 17 08:04:38 2024 +0200
> >>
> >>     block: freeze the queue in queue_attr_store
> >>     
> >>     queue_attr_store updates attributes used to control generating I/O, and
> >>     can cause malformed bios if changed with I/O in flight.  Freeze the queue
> >>     in common code instead of adding it to almost every attribute.
> >>
> >> Reverting this commit on top of git head also fixes the problem.
> >>
> >> Why did this commit expose the problem?
> > 
> > That is really the 1st bad commit which moves queue freezing before
> > calling request_module(), originally we won't freeze queue until
> > we have to do it.
> > 
> > Another candidate fix is to revert it, or at least not do it
> > for storing elevator attribute.
> 
> I do not think that reverting is acceptable. Rather, a proper fix would simply
> be to do the request_module() before freezing the queue.
> Something like below should work (totally untested and that may be overkill).
> 
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 60116d13cb80..aef87f6b4a8a 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -23,6 +23,7 @@
>  struct queue_sysfs_entry {
>         struct attribute attr;
>         ssize_t (*show)(struct gendisk *disk, char *page);
> +       int (*pre_store)(struct gendisk *disk, const char *page, size_t count);
>         ssize_t (*store)(struct gendisk *disk, const char *page, size_t count);
>  };
> 
> @@ -413,6 +414,14 @@ static struct queue_sysfs_entry _prefix##_entry = {        \
>         .store  = _prefix##_store,                      \
>  };
> 
> +#define QUEUE_RPW_ENTRY(_prefix, _name)                        \
> +static struct queue_sysfs_entry _prefix##_entry = {    \
> +       .attr   = { .name = _name, .mode = 0644 },      \
> +       .show   = _prefix##_show,                       \
> +       .pre_store = _prefix##_pre_store,               \
> +       .store  = _prefix##_store,                      \
> +};
> +
>  QUEUE_RW_ENTRY(queue_requests, "nr_requests");
>  QUEUE_RW_ENTRY(queue_ra, "read_ahead_kb");
>  QUEUE_RW_ENTRY(queue_max_sectors, "max_sectors_kb");
> @@ -420,7 +429,7 @@ QUEUE_RO_ENTRY(queue_max_hw_sectors, "max_hw_sectors_kb");
>  QUEUE_RO_ENTRY(queue_max_segments, "max_segments");
>  QUEUE_RO_ENTRY(queue_max_integrity_segments, "max_integrity_segments");
>  QUEUE_RO_ENTRY(queue_max_segment_size, "max_segment_size");
> -QUEUE_RW_ENTRY(elv_iosched, "scheduler");
> +QUEUE_RPW_ENTRY(elv_iosched, "scheduler");
> 
>  QUEUE_RO_ENTRY(queue_logical_block_size, "logical_block_size");
>  QUEUE_RO_ENTRY(queue_physical_block_size, "physical_block_size");
> @@ -670,6 +679,12 @@ queue_attr_store(struct kobject *kobj, struct attribute *attr,
>         if (!entry->store)
>                 return -EIO;
> 
> +       if (entry->pre_store) {
> +               res = entry->pre_store(disk, page, length);
> +               if (res)
> +                       return res;
> +       }
> +
>         blk_mq_freeze_queue(q);
>         mutex_lock(&q->sysfs_lock);
>         res = entry->store(disk, page, length);
> diff --git a/block/elevator.c b/block/elevator.c
> index f13d552a32c8..c338282d5148 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -698,17 +698,26 @@ static int elevator_change(struct request_queue *q, const
> char *elevator_name)
>                 return 0;
> 
>         e = elevator_find_get(q, elevator_name);
> -       if (!e) {
> -               request_module("%s-iosched", elevator_name);
> -               e = elevator_find_get(q, elevator_name);
> -               if (!e)
> -                       return -EINVAL;
> -       }
> +       if (!e)
> +               return -EINVAL;
>         ret = elevator_switch(q, e);
>         elevator_put(e);
>         return ret;
>  }
> 
> +int elv_iosched_pre_store(struct gendisk *disk, const char *buf,
> +                          size_t count)
> +{
> +       char elevator_name[ELV_NAME_MAX];
> +
> +       if (!elv_support_iosched(disk->queue))
> +               return -ENOTSUPP;
> +
> +       strscpy(elevator_name, buf, sizeof(elevator_name));
> +
> +       return request_module("%s-iosched", elevator_name);
> +}
> +
>  ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
>                           size_t count)
>  {
> diff --git a/block/elevator.h b/block/elevator.h
> index 3fe18e1a8692..059172c0f93c 100644
> --- a/block/elevator.h
> +++ b/block/elevator.h
> @@ -148,6 +148,7 @@ extern void elv_unregister(struct elevator_type *);
>   * io scheduler sysfs switching
>   */
>  ssize_t elv_iosched_show(struct gendisk *disk, char *page);
> +int elv_iosched_pre_store(struct gendisk *disk, const char *page, size_t count);
>  ssize_t elv_iosched_store(struct gendisk *disk, const char *page, size_t count);
> 
>  extern bool elv_bio_merge_ok(struct request *, struct bio *);

I tested this on top of current git head and it fixes the problem for me.

Rich.
Richard W.M. Jones Sept. 7, 2024, 10:02 a.m. UTC | #6
On Sat, Sep 07, 2024 at 05:48:44PM +0800, Ming Lei wrote:
> On Sat, Sep 07, 2024 at 06:04:59PM +0900, Damien Le Moal wrote:
> > On 9/7/24 16:58, Ming Lei wrote:
> > > On Sat, Sep 07, 2024 at 08:35:22AM +0100, Richard W.M. Jones wrote:
> > >> On Sat, Sep 07, 2024 at 09:43:31AM +0800, Ming Lei wrote:
> > >>> When switching io scheduler via sysfs, 'request_module' may be called
> > >>> if the specified scheduler doesn't exist.
> > >>>
> > >>> This was has deadlock risk because the module may be stored on FS behind
> > >>> our disk since request queue is frozen before switching its elevator.
> > >>>
> > >>> Fix it by returning -EDEADLK in case that the disk is claimed, which
> > >>> can be thought as one signal that the disk is mounted.
> > >>>
> > >>> Some distributions(Fedora) simulates the original kernel command line of
> > >>> 'elevator=foo' via 'echo foo > /sys/block/$DISK/queue/scheduler', and boot
> > >>> hang is triggered.
> > >>>
> > >>> Cc: Richard Jones <rjones@redhat.com>
> > >>> Cc: Jeff Moyer <jmoyer@redhat.com>
> > >>> Cc: Jiri Jaburek <jjaburek@redhat.com>
> > >>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > >>
> > >> I'd suggest also:
> > >>
> > >> Bug: https://bugzilla.kernel.org/show_bug.cgi?id=219166
> > >> Reported-by: Richard W.M. Jones <rjones@redhat.com>
> > >> Reported-by: Jiri Jaburek <jjaburek@redhat.com>
> > >> Tested-by: Richard W.M. Jones <rjones@redhat.com>
> > >>
> > >> So I have tested this patch and it does fix the issue, at the possible
> > >> cost that now setting the scheduler can fail:
> > >>
> > >>   + for f in /sys/block/{h,s,ub,v}d*/queue/scheduler
> > >>   + echo noop
> > >>   /init: line 109: echo: write error: Resource deadlock avoided
> > >>
> > >> (I know I'm setting it to an impossible value here, but this could
> > >> also happen when setting it to a valid one.)
> > > 
> > > Actually in most of dist, io-schedulers are built-in, so request_module
> > > is just a nop, but meta IO must be started.
> > > 
> > >>
> > >> Since almost no one checks the result of 'echo foo > /sys/...'  that
> > >> would probably mean that sometimes a desired setting is silently not
> > >> set.
> > > 
> > > As I mentioned, io-schedulers are built-in for most of dist, so
> > > request_module isn't called in case of one valid io-sched.
> > > 
> > >>
> > >> Also I bisected this bug yesterday and found it was caused by (or,
> > >> more likely, exposed by):
> > >>
> > >>   commit af2814149883e2c1851866ea2afcd8eadc040f79
> > >>   Author: Christoph Hellwig <hch@lst.de>
> > >>   Date:   Mon Jun 17 08:04:38 2024 +0200
> > >>
> > >>     block: freeze the queue in queue_attr_store
> > >>     
> > >>     queue_attr_store updates attributes used to control generating I/O, and
> > >>     can cause malformed bios if changed with I/O in flight.  Freeze the queue
> > >>     in common code instead of adding it to almost every attribute.
> > >>
> > >> Reverting this commit on top of git head also fixes the problem.
> > >>
> > >> Why did this commit expose the problem?
> > > 
> > > That is really the 1st bad commit which moves queue freezing before
> > > calling request_module(), originally we won't freeze queue until
> > > we have to do it.
> > > 
> > > Another candidate fix is to revert it, or at least not do it
> > > for storing elevator attribute.
> > 
> > I do not think that reverting is acceptable. Rather, a proper fix would simply
> 
> Right, I remember that the freezing starts to cover update of
> max_sectors_kb.
> 
> > be to do the request_module() before freezing the queue.
> > Something like below should work (totally untested and that may be overkill).
> > 
> > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> > index 60116d13cb80..aef87f6b4a8a 100644
> > --- a/block/blk-sysfs.c
> > +++ b/block/blk-sysfs.c
> > @@ -23,6 +23,7 @@
> >  struct queue_sysfs_entry {
> >         struct attribute attr;
> >         ssize_t (*show)(struct gendisk *disk, char *page);
> > +       int (*pre_store)(struct gendisk *disk, const char *page, size_t count);
> 
> It seems over-kill to add one new callback, and another way is just to
> not freeze queue for storing elevator.
> 
> But if other attribute update needs to not freeze queue, 'pre_store'
> looks one reasonable solution.
> 
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 60116d13cb80..c418edf66f0c 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -666,15 +666,24 @@ queue_attr_store(struct kobject *kobj, struct attribute *attr,
>  	struct gendisk *disk = container_of(kobj, struct gendisk, queue_kobj);
>  	struct request_queue *q = disk->queue;
>  	ssize_t res;
> +	bool need_freeze;
>  
>  	if (!entry->store)
>  		return -EIO;
>  
> -	blk_mq_freeze_queue(q);
> +	/*
> +	 * storing scheduler freezes queue in its way, especially
> +	 * loading scheduler module can't be done when queue is frozen
> +	 */
> +	need_freeze = (entry->store == elv_iosched_store);
> +
> +	if (need_freeze)
> +		blk_mq_freeze_queue(q);
>  	mutex_lock(&q->sysfs_lock);
>  	res = entry->store(disk, page, length);
>  	mutex_unlock(&q->sysfs_lock);
> -	blk_mq_unfreeze_queue(q);
> +	if (need_freeze)
> +		blk_mq_unfreeze_queue(q);
>  	return res;
>  }
>  

Unfortunately this doesn't fix the problem for me.  The test still
hangs occasionally in the same way as before.

Rich.
Ming Lei Sept. 7, 2024, 10:07 a.m. UTC | #7
On Sat, Sep 07, 2024 at 11:02:13AM +0100, Richard W.M. Jones wrote:
> On Sat, Sep 07, 2024 at 05:48:44PM +0800, Ming Lei wrote:
> > On Sat, Sep 07, 2024 at 06:04:59PM +0900, Damien Le Moal wrote:
> > > On 9/7/24 16:58, Ming Lei wrote:
> > > > On Sat, Sep 07, 2024 at 08:35:22AM +0100, Richard W.M. Jones wrote:
> > > >> On Sat, Sep 07, 2024 at 09:43:31AM +0800, Ming Lei wrote:
> > > >>> When switching io scheduler via sysfs, 'request_module' may be called
> > > >>> if the specified scheduler doesn't exist.
> > > >>>
> > > >>> This was has deadlock risk because the module may be stored on FS behind
> > > >>> our disk since request queue is frozen before switching its elevator.
> > > >>>
> > > >>> Fix it by returning -EDEADLK in case that the disk is claimed, which
> > > >>> can be thought as one signal that the disk is mounted.
> > > >>>
> > > >>> Some distributions(Fedora) simulates the original kernel command line of
> > > >>> 'elevator=foo' via 'echo foo > /sys/block/$DISK/queue/scheduler', and boot
> > > >>> hang is triggered.
> > > >>>
> > > >>> Cc: Richard Jones <rjones@redhat.com>
> > > >>> Cc: Jeff Moyer <jmoyer@redhat.com>
> > > >>> Cc: Jiri Jaburek <jjaburek@redhat.com>
> > > >>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > >>
> > > >> I'd suggest also:
> > > >>
> > > >> Bug: https://bugzilla.kernel.org/show_bug.cgi?id=219166
> > > >> Reported-by: Richard W.M. Jones <rjones@redhat.com>
> > > >> Reported-by: Jiri Jaburek <jjaburek@redhat.com>
> > > >> Tested-by: Richard W.M. Jones <rjones@redhat.com>
> > > >>
> > > >> So I have tested this patch and it does fix the issue, at the possible
> > > >> cost that now setting the scheduler can fail:
> > > >>
> > > >>   + for f in /sys/block/{h,s,ub,v}d*/queue/scheduler
> > > >>   + echo noop
> > > >>   /init: line 109: echo: write error: Resource deadlock avoided
> > > >>
> > > >> (I know I'm setting it to an impossible value here, but this could
> > > >> also happen when setting it to a valid one.)
> > > > 
> > > > Actually in most of dist, io-schedulers are built-in, so request_module
> > > > is just a nop, but meta IO must be started.
> > > > 
> > > >>
> > > >> Since almost no one checks the result of 'echo foo > /sys/...'  that
> > > >> would probably mean that sometimes a desired setting is silently not
> > > >> set.
> > > > 
> > > > As I mentioned, io-schedulers are built-in for most of dist, so
> > > > request_module isn't called in case of one valid io-sched.
> > > > 
> > > >>
> > > >> Also I bisected this bug yesterday and found it was caused by (or,
> > > >> more likely, exposed by):
> > > >>
> > > >>   commit af2814149883e2c1851866ea2afcd8eadc040f79
> > > >>   Author: Christoph Hellwig <hch@lst.de>
> > > >>   Date:   Mon Jun 17 08:04:38 2024 +0200
> > > >>
> > > >>     block: freeze the queue in queue_attr_store
> > > >>     
> > > >>     queue_attr_store updates attributes used to control generating I/O, and
> > > >>     can cause malformed bios if changed with I/O in flight.  Freeze the queue
> > > >>     in common code instead of adding it to almost every attribute.
> > > >>
> > > >> Reverting this commit on top of git head also fixes the problem.
> > > >>
> > > >> Why did this commit expose the problem?
> > > > 
> > > > That is really the 1st bad commit which moves queue freezing before
> > > > calling request_module(), originally we won't freeze queue until
> > > > we have to do it.
> > > > 
> > > > Another candidate fix is to revert it, or at least not do it
> > > > for storing elevator attribute.
> > > 
> > > I do not think that reverting is acceptable. Rather, a proper fix would simply
> > 
> > Right, I remember that the freezing starts to cover update of
> > max_sectors_kb.
> > 
> > > be to do the request_module() before freezing the queue.
> > > Something like below should work (totally untested and that may be overkill).
> > > 
> > > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> > > index 60116d13cb80..aef87f6b4a8a 100644
> > > --- a/block/blk-sysfs.c
> > > +++ b/block/blk-sysfs.c
> > > @@ -23,6 +23,7 @@
> > >  struct queue_sysfs_entry {
> > >         struct attribute attr;
> > >         ssize_t (*show)(struct gendisk *disk, char *page);
> > > +       int (*pre_store)(struct gendisk *disk, const char *page, size_t count);
> > 
> > It seems over-kill to add one new callback, and another way is just to
> > not freeze queue for storing elevator.
> > 
> > But if other attribute update needs to not freeze queue, 'pre_store'
> > looks one reasonable solution.
> > 
> > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> > index 60116d13cb80..c418edf66f0c 100644
> > --- a/block/blk-sysfs.c
> > +++ b/block/blk-sysfs.c
> > @@ -666,15 +666,24 @@ queue_attr_store(struct kobject *kobj, struct attribute *attr,
> >  	struct gendisk *disk = container_of(kobj, struct gendisk, queue_kobj);
> >  	struct request_queue *q = disk->queue;
> >  	ssize_t res;
> > +	bool need_freeze;
> >  
> >  	if (!entry->store)
> >  		return -EIO;
> >  
> > -	blk_mq_freeze_queue(q);
> > +	/*
> > +	 * storing scheduler freezes queue in its way, especially
> > +	 * loading scheduler module can't be done when queue is frozen
> > +	 */
> > +	need_freeze = (entry->store == elv_iosched_store);
> > +
> > +	if (need_freeze)
> > +		blk_mq_freeze_queue(q);
> >  	mutex_lock(&q->sysfs_lock);
> >  	res = entry->store(disk, page, length);
> >  	mutex_unlock(&q->sysfs_lock);
> > -	blk_mq_unfreeze_queue(q);
> > +	if (need_freeze)
> > +		blk_mq_unfreeze_queue(q);
> >  	return res;
> >  }
> >  
> 
> Unfortunately this doesn't fix the problem for me.  The test still
> hangs occasionally in the same way as before.

'need_freeze' needs to be flipped by:

	need_freeze = (entry->store != elv_iosched_store);


thanks,
Ming
Richard W.M. Jones Sept. 7, 2024, 10:36 a.m. UTC | #8
On Sat, Sep 07, 2024 at 06:07:21PM +0800, Ming Lei wrote:
> On Sat, Sep 07, 2024 at 11:02:13AM +0100, Richard W.M. Jones wrote:
> > On Sat, Sep 07, 2024 at 05:48:44PM +0800, Ming Lei wrote:
> > > On Sat, Sep 07, 2024 at 06:04:59PM +0900, Damien Le Moal wrote:
> > > > On 9/7/24 16:58, Ming Lei wrote:
> > > > > On Sat, Sep 07, 2024 at 08:35:22AM +0100, Richard W.M. Jones wrote:
> > > > >> On Sat, Sep 07, 2024 at 09:43:31AM +0800, Ming Lei wrote:
> > > > >>> When switching io scheduler via sysfs, 'request_module' may be called
> > > > >>> if the specified scheduler doesn't exist.
> > > > >>>
> > > > >>> This was has deadlock risk because the module may be stored on FS behind
> > > > >>> our disk since request queue is frozen before switching its elevator.
> > > > >>>
> > > > >>> Fix it by returning -EDEADLK in case that the disk is claimed, which
> > > > >>> can be thought as one signal that the disk is mounted.
> > > > >>>
> > > > >>> Some distributions(Fedora) simulates the original kernel command line of
> > > > >>> 'elevator=foo' via 'echo foo > /sys/block/$DISK/queue/scheduler', and boot
> > > > >>> hang is triggered.
> > > > >>>
> > > > >>> Cc: Richard Jones <rjones@redhat.com>
> > > > >>> Cc: Jeff Moyer <jmoyer@redhat.com>
> > > > >>> Cc: Jiri Jaburek <jjaburek@redhat.com>
> > > > >>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > >>
> > > > >> I'd suggest also:
> > > > >>
> > > > >> Bug: https://bugzilla.kernel.org/show_bug.cgi?id=219166
> > > > >> Reported-by: Richard W.M. Jones <rjones@redhat.com>
> > > > >> Reported-by: Jiri Jaburek <jjaburek@redhat.com>
> > > > >> Tested-by: Richard W.M. Jones <rjones@redhat.com>
> > > > >>
> > > > >> So I have tested this patch and it does fix the issue, at the possible
> > > > >> cost that now setting the scheduler can fail:
> > > > >>
> > > > >>   + for f in /sys/block/{h,s,ub,v}d*/queue/scheduler
> > > > >>   + echo noop
> > > > >>   /init: line 109: echo: write error: Resource deadlock avoided
> > > > >>
> > > > >> (I know I'm setting it to an impossible value here, but this could
> > > > >> also happen when setting it to a valid one.)
> > > > > 
> > > > > Actually in most of dist, io-schedulers are built-in, so request_module
> > > > > is just a nop, but meta IO must be started.
> > > > > 
> > > > >>
> > > > >> Since almost no one checks the result of 'echo foo > /sys/...'  that
> > > > >> would probably mean that sometimes a desired setting is silently not
> > > > >> set.
> > > > > 
> > > > > As I mentioned, io-schedulers are built-in for most of dist, so
> > > > > request_module isn't called in case of one valid io-sched.
> > > > > 
> > > > >>
> > > > >> Also I bisected this bug yesterday and found it was caused by (or,
> > > > >> more likely, exposed by):
> > > > >>
> > > > >>   commit af2814149883e2c1851866ea2afcd8eadc040f79
> > > > >>   Author: Christoph Hellwig <hch@lst.de>
> > > > >>   Date:   Mon Jun 17 08:04:38 2024 +0200
> > > > >>
> > > > >>     block: freeze the queue in queue_attr_store
> > > > >>     
> > > > >>     queue_attr_store updates attributes used to control generating I/O, and
> > > > >>     can cause malformed bios if changed with I/O in flight.  Freeze the queue
> > > > >>     in common code instead of adding it to almost every attribute.
> > > > >>
> > > > >> Reverting this commit on top of git head also fixes the problem.
> > > > >>
> > > > >> Why did this commit expose the problem?
> > > > > 
> > > > > That is really the 1st bad commit which moves queue freezing before
> > > > > calling request_module(), originally we won't freeze queue until
> > > > > we have to do it.
> > > > > 
> > > > > Another candidate fix is to revert it, or at least not do it
> > > > > for storing elevator attribute.
> > > > 
> > > > I do not think that reverting is acceptable. Rather, a proper fix would simply
> > > 
> > > Right, I remember that the freezing starts to cover update of
> > > max_sectors_kb.
> > > 
> > > > be to do the request_module() before freezing the queue.
> > > > Something like below should work (totally untested and that may be overkill).
> > > > 
> > > > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> > > > index 60116d13cb80..aef87f6b4a8a 100644
> > > > --- a/block/blk-sysfs.c
> > > > +++ b/block/blk-sysfs.c
> > > > @@ -23,6 +23,7 @@
> > > >  struct queue_sysfs_entry {
> > > >         struct attribute attr;
> > > >         ssize_t (*show)(struct gendisk *disk, char *page);
> > > > +       int (*pre_store)(struct gendisk *disk, const char *page, size_t count);
> > > 
> > > It seems over-kill to add one new callback, and another way is just to
> > > not freeze queue for storing elevator.
> > > 
> > > But if other attribute update needs to not freeze queue, 'pre_store'
> > > looks one reasonable solution.
> > > 
> > > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> > > index 60116d13cb80..c418edf66f0c 100644
> > > --- a/block/blk-sysfs.c
> > > +++ b/block/blk-sysfs.c
> > > @@ -666,15 +666,24 @@ queue_attr_store(struct kobject *kobj, struct attribute *attr,
> > >  	struct gendisk *disk = container_of(kobj, struct gendisk, queue_kobj);
> > >  	struct request_queue *q = disk->queue;
> > >  	ssize_t res;
> > > +	bool need_freeze;
> > >  
> > >  	if (!entry->store)
> > >  		return -EIO;
> > >  
> > > -	blk_mq_freeze_queue(q);
> > > +	/*
> > > +	 * storing scheduler freezes queue in its way, especially
> > > +	 * loading scheduler module can't be done when queue is frozen
> > > +	 */
> > > +	need_freeze = (entry->store == elv_iosched_store);
> > > +
> > > +	if (need_freeze)
> > > +		blk_mq_freeze_queue(q);
> > >  	mutex_lock(&q->sysfs_lock);
> > >  	res = entry->store(disk, page, length);
> > >  	mutex_unlock(&q->sysfs_lock);
> > > -	blk_mq_unfreeze_queue(q);
> > > +	if (need_freeze)
> > > +		blk_mq_unfreeze_queue(q);
> > >  	return res;
> > >  }
> > >  
> > 
> > Unfortunately this doesn't fix the problem for me.  The test still
> > hangs occasionally in the same way as before.
> 
> 'need_freeze' needs to be flipped by:
> 
> 	need_freeze = (entry->store != elv_iosched_store);

I'm still running the test (takes 5,000 boot iterations before I can
be "sure"), but so far it seems flipping this test fixes the bug.

This seems like the neatest (or shortest) fix so far, but doesn't it
"mix up layers" by checking elv_iosched_store?

Rich.
Richard W.M. Jones Sept. 7, 2024, 11:01 a.m. UTC | #9
On Sat, Sep 07, 2024 at 11:36:32AM +0100, Richard W.M. Jones wrote:
> I'm still running the test (takes 5,000 boot iterations before I can
> be "sure"), but so far it seems flipping this test fixes the bug.

This passed 5,000 iterations.

Rich.
Ming Lei Sept. 7, 2024, 11:02 a.m. UTC | #10
On Sat, Sep 07, 2024 at 11:36:32AM +0100, Richard W.M. Jones wrote:
> On Sat, Sep 07, 2024 at 06:07:21PM +0800, Ming Lei wrote:
> > On Sat, Sep 07, 2024 at 11:02:13AM +0100, Richard W.M. Jones wrote:
> > > On Sat, Sep 07, 2024 at 05:48:44PM +0800, Ming Lei wrote:
> > > > On Sat, Sep 07, 2024 at 06:04:59PM +0900, Damien Le Moal wrote:
> > > > > On 9/7/24 16:58, Ming Lei wrote:
> > > > > > On Sat, Sep 07, 2024 at 08:35:22AM +0100, Richard W.M. Jones wrote:
> > > > > >> On Sat, Sep 07, 2024 at 09:43:31AM +0800, Ming Lei wrote:
> > > > > >>> When switching io scheduler via sysfs, 'request_module' may be called
> > > > > >>> if the specified scheduler doesn't exist.
> > > > > >>>
> > > > > >>> This was has deadlock risk because the module may be stored on FS behind
> > > > > >>> our disk since request queue is frozen before switching its elevator.
> > > > > >>>
> > > > > >>> Fix it by returning -EDEADLK in case that the disk is claimed, which
> > > > > >>> can be thought as one signal that the disk is mounted.
> > > > > >>>
> > > > > >>> Some distributions(Fedora) simulates the original kernel command line of
> > > > > >>> 'elevator=foo' via 'echo foo > /sys/block/$DISK/queue/scheduler', and boot
> > > > > >>> hang is triggered.
> > > > > >>>
> > > > > >>> Cc: Richard Jones <rjones@redhat.com>
> > > > > >>> Cc: Jeff Moyer <jmoyer@redhat.com>
> > > > > >>> Cc: Jiri Jaburek <jjaburek@redhat.com>
> > > > > >>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > > >>
> > > > > >> I'd suggest also:
> > > > > >>
> > > > > >> Bug: https://bugzilla.kernel.org/show_bug.cgi?id=219166
> > > > > >> Reported-by: Richard W.M. Jones <rjones@redhat.com>
> > > > > >> Reported-by: Jiri Jaburek <jjaburek@redhat.com>
> > > > > >> Tested-by: Richard W.M. Jones <rjones@redhat.com>
> > > > > >>
> > > > > >> So I have tested this patch and it does fix the issue, at the possible
> > > > > >> cost that now setting the scheduler can fail:
> > > > > >>
> > > > > >>   + for f in /sys/block/{h,s,ub,v}d*/queue/scheduler
> > > > > >>   + echo noop
> > > > > >>   /init: line 109: echo: write error: Resource deadlock avoided
> > > > > >>
> > > > > >> (I know I'm setting it to an impossible value here, but this could
> > > > > >> also happen when setting it to a valid one.)
> > > > > > 
> > > > > > Actually in most of dist, io-schedulers are built-in, so request_module
> > > > > > is just a nop, but meta IO must be started.
> > > > > > 
> > > > > >>
> > > > > >> Since almost no one checks the result of 'echo foo > /sys/...'  that
> > > > > >> would probably mean that sometimes a desired setting is silently not
> > > > > >> set.
> > > > > > 
> > > > > > As I mentioned, io-schedulers are built-in for most of dist, so
> > > > > > request_module isn't called in case of one valid io-sched.
> > > > > > 
> > > > > >>
> > > > > >> Also I bisected this bug yesterday and found it was caused by (or,
> > > > > >> more likely, exposed by):
> > > > > >>
> > > > > >>   commit af2814149883e2c1851866ea2afcd8eadc040f79
> > > > > >>   Author: Christoph Hellwig <hch@lst.de>
> > > > > >>   Date:   Mon Jun 17 08:04:38 2024 +0200
> > > > > >>
> > > > > >>     block: freeze the queue in queue_attr_store
> > > > > >>     
> > > > > >>     queue_attr_store updates attributes used to control generating I/O, and
> > > > > >>     can cause malformed bios if changed with I/O in flight.  Freeze the queue
> > > > > >>     in common code instead of adding it to almost every attribute.
> > > > > >>
> > > > > >> Reverting this commit on top of git head also fixes the problem.
> > > > > >>
> > > > > >> Why did this commit expose the problem?
> > > > > > 
> > > > > > That is really the 1st bad commit which moves queue freezing before
> > > > > > calling request_module(), originally we won't freeze queue until
> > > > > > we have to do it.
> > > > > > 
> > > > > > Another candidate fix is to revert it, or at least not do it
> > > > > > for storing elevator attribute.
> > > > > 
> > > > > I do not think that reverting is acceptable. Rather, a proper fix would simply
> > > > 
> > > > Right, I remember that the freezing starts to cover update of
> > > > max_sectors_kb.
> > > > 
> > > > > be to do the request_module() before freezing the queue.
> > > > > Something like below should work (totally untested and that may be overkill).
> > > > > 
> > > > > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> > > > > index 60116d13cb80..aef87f6b4a8a 100644
> > > > > --- a/block/blk-sysfs.c
> > > > > +++ b/block/blk-sysfs.c
> > > > > @@ -23,6 +23,7 @@
> > > > >  struct queue_sysfs_entry {
> > > > >         struct attribute attr;
> > > > >         ssize_t (*show)(struct gendisk *disk, char *page);
> > > > > +       int (*pre_store)(struct gendisk *disk, const char *page, size_t count);
> > > > 
> > > > It seems over-kill to add one new callback, and another way is just to
> > > > not freeze queue for storing elevator.
> > > > 
> > > > But if other attribute update needs to not freeze queue, 'pre_store'
> > > > looks one reasonable solution.
> > > > 
> > > > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> > > > index 60116d13cb80..c418edf66f0c 100644
> > > > --- a/block/blk-sysfs.c
> > > > +++ b/block/blk-sysfs.c
> > > > @@ -666,15 +666,24 @@ queue_attr_store(struct kobject *kobj, struct attribute *attr,
> > > >  	struct gendisk *disk = container_of(kobj, struct gendisk, queue_kobj);
> > > >  	struct request_queue *q = disk->queue;
> > > >  	ssize_t res;
> > > > +	bool need_freeze;
> > > >  
> > > >  	if (!entry->store)
> > > >  		return -EIO;
> > > >  
> > > > -	blk_mq_freeze_queue(q);
> > > > +	/*
> > > > +	 * storing scheduler freezes queue in its way, especially
> > > > +	 * loading scheduler module can't be done when queue is frozen
> > > > +	 */
> > > > +	need_freeze = (entry->store == elv_iosched_store);
> > > > +
> > > > +	if (need_freeze)
> > > > +		blk_mq_freeze_queue(q);
> > > >  	mutex_lock(&q->sysfs_lock);
> > > >  	res = entry->store(disk, page, length);
> > > >  	mutex_unlock(&q->sysfs_lock);
> > > > -	blk_mq_unfreeze_queue(q);
> > > > +	if (need_freeze)
> > > > +		blk_mq_unfreeze_queue(q);
> > > >  	return res;
> > > >  }
> > > >  
> > > 
> > > Unfortunately this doesn't fix the problem for me.  The test still
> > > hangs occasionally in the same way as before.
> > 
> > 'need_freeze' needs to be flipped by:
> > 
> > 	need_freeze = (entry->store != elv_iosched_store);
> 
> I'm still running the test (takes 5,000 boot iterations before I can
> be "sure"), but so far it seems flipping this test fixes the bug.

BTW, the issue can be reproduced 100% by:

echo "deadlock" > /sys/block/$ROOT_DISK/queue/scheduler

> 
> This seems like the neatest (or shortest) fix so far, but doesn't it
> "mix up layers" by checking elv_iosched_store?

It is just one exception for 'scheduler' sysfs attribute wrt. freezing
queue for storing, and the check can be done via the attribute
name("scheduler") too.


Thanks, 
Ming
Richard W.M. Jones Sept. 7, 2024, 11:14 a.m. UTC | #11
On Sat, Sep 07, 2024 at 07:02:30PM +0800, Ming Lei wrote:
> BTW, the issue can be reproduced 100% by:
> 
> echo "deadlock" > /sys/block/$ROOT_DISK/queue/scheduler

That doesn't reproduce it for me (reliably).  Although I'm not
surprised as this bug has been _very_ tricky to reproduce!  Sometimes
I think I have a definite reproducer, only for it to go away when some
tiny detail changes.

> > This seems like the neatest (or shortest) fix so far, but doesn't it
> > "mix up layers" by checking elv_iosched_store?
> 
> It is just one exception for 'scheduler' sysfs attribute wrt. freezing
> queue for storing, and the check can be done via the attribute
> name("scheduler") too.

Fair enough.

Rich.
Jens Axboe Sept. 7, 2024, 1:50 p.m. UTC | #12
On 9/7/24 3:04 AM, Damien Le Moal wrote:
> On 9/7/24 16:58, Ming Lei wrote:
>> On Sat, Sep 07, 2024 at 08:35:22AM +0100, Richard W.M. Jones wrote:
>>> On Sat, Sep 07, 2024 at 09:43:31AM +0800, Ming Lei wrote:
>>>> When switching io scheduler via sysfs, 'request_module' may be called
>>>> if the specified scheduler doesn't exist.
>>>>
>>>> This was has deadlock risk because the module may be stored on FS behind
>>>> our disk since request queue is frozen before switching its elevator.
>>>>
>>>> Fix it by returning -EDEADLK in case that the disk is claimed, which
>>>> can be thought as one signal that the disk is mounted.
>>>>
>>>> Some distributions(Fedora) simulates the original kernel command line of
>>>> 'elevator=foo' via 'echo foo > /sys/block/$DISK/queue/scheduler', and boot
>>>> hang is triggered.
>>>>
>>>> Cc: Richard Jones <rjones@redhat.com>
>>>> Cc: Jeff Moyer <jmoyer@redhat.com>
>>>> Cc: Jiri Jaburek <jjaburek@redhat.com>
>>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>>
>>> I'd suggest also:
>>>
>>> Bug: https://bugzilla.kernel.org/show_bug.cgi?id=219166
>>> Reported-by: Richard W.M. Jones <rjones@redhat.com>
>>> Reported-by: Jiri Jaburek <jjaburek@redhat.com>
>>> Tested-by: Richard W.M. Jones <rjones@redhat.com>
>>>
>>> So I have tested this patch and it does fix the issue, at the possible
>>> cost that now setting the scheduler can fail:
>>>
>>>   + for f in /sys/block/{h,s,ub,v}d*/queue/scheduler
>>>   + echo noop
>>>   /init: line 109: echo: write error: Resource deadlock avoided
>>>
>>> (I know I'm setting it to an impossible value here, but this could
>>> also happen when setting it to a valid one.)
>>
>> Actually in most of dist, io-schedulers are built-in, so request_module
>> is just a nop, but meta IO must be started.
>>
>>>
>>> Since almost no one checks the result of 'echo foo > /sys/...'  that
>>> would probably mean that sometimes a desired setting is silently not
>>> set.
>>
>> As I mentioned, io-schedulers are built-in for most of dist, so
>> request_module isn't called in case of one valid io-sched.
>>
>>>
>>> Also I bisected this bug yesterday and found it was caused by (or,
>>> more likely, exposed by):
>>>
>>>   commit af2814149883e2c1851866ea2afcd8eadc040f79
>>>   Author: Christoph Hellwig <hch@lst.de>
>>>   Date:   Mon Jun 17 08:04:38 2024 +0200
>>>
>>>     block: freeze the queue in queue_attr_store
>>>     
>>>     queue_attr_store updates attributes used to control generating I/O, and
>>>     can cause malformed bios if changed with I/O in flight.  Freeze the queue
>>>     in common code instead of adding it to almost every attribute.
>>>
>>> Reverting this commit on top of git head also fixes the problem.
>>>
>>> Why did this commit expose the problem?
>>
>> That is really the 1st bad commit which moves queue freezing before
>> calling request_module(), originally we won't freeze queue until
>> we have to do it.
>>
>> Another candidate fix is to revert it, or at least not do it
>> for storing elevator attribute.
> 
> I do not think that reverting is acceptable. Rather, a proper fix would simply
> be to do the request_module() before freezing the queue.
> Something like below should work (totally untested and that may be overkill).

I like this approach, but let's please call it something descriptive
like "load_module" or something like that.
Damien Le Moal Sept. 8, 2024, 12:02 a.m. UTC | #13
On 9/7/24 20:14, Richard W.M. Jones wrote:
> On Sat, Sep 07, 2024 at 07:02:30PM +0800, Ming Lei wrote:
>> BTW, the issue can be reproduced 100% by:
>>
>> echo "deadlock" > /sys/block/$ROOT_DISK/queue/scheduler

This probably should be:

echo "mq-deadline" > /sys/block/$ROOT_DISK/queue/scheduler

and make sure that:
1) mq-deadline is compiled as a module
2) mq-deadline is not already used by a device (so not loaded already)
3) The mq-deadline module file is stored on the target device of the scheduler
change
4) The mq-deadline module file is not already cahced in the page cache.

For (4), you may want to do a "echo 3 > /proc/sys/vm/drop_caches" before trying
to switch the scheduler.

> 
> That doesn't reproduce it for me (reliably).  Although I'm not
> surprised as this bug has been _very_ tricky to reproduce!  Sometimes
> I think I have a definite reproducer, only for it to go away when some
> tiny detail changes.
> 
>>> This seems like the neatest (or shortest) fix so far, but doesn't it
>>> "mix up layers" by checking elv_iosched_store?
>>
>> It is just one exception for 'scheduler' sysfs attribute wrt. freezing
>> queue for storing, and the check can be done via the attribute
>> name("scheduler") too.
> 
> Fair enough.
> 
> Rich.
>
Ming Lei Sept. 9, 2024, 1 a.m. UTC | #14
On Sun, Sep 8, 2024 at 8:03 AM Damien Le Moal <dlemoal@kernel.org> wrote:
>
> On 9/7/24 20:14, Richard W.M. Jones wrote:
> > On Sat, Sep 07, 2024 at 07:02:30PM +0800, Ming Lei wrote:
> >> BTW, the issue can be reproduced 100% by:
> >>
> >> echo "deadlock" > /sys/block/$ROOT_DISK/queue/scheduler
>
> This probably should be:
>
> echo "mq-deadline" > /sys/block/$ROOT_DISK/queue/scheduler
>
> and make sure that:
> 1) mq-deadline is compiled as a module
> 2) mq-deadline is not already used by a device (so not loaded already)
> 3) The mq-deadline module file is stored on the target device of the scheduler
> change
> 4) The mq-deadline module file is not already cahced in the page cache.
>
> For (4), you may want to do a "echo 3 > /proc/sys/vm/drop_caches" before trying
> to switch the scheduler.
>
> >
> > That doesn't reproduce it for me (reliably).  Although I'm not
> > surprised as this bug has been _very_ tricky to reproduce!  Sometimes
> > I think I have a definite reproducer, only for it to go away when some
> > tiny detail changes.
> >
> >>> This seems like the neatest (or shortest) fix so far, but doesn't it
> >>> "mix up layers" by checking elv_iosched_store?
> >>
> >> It is just one exception for 'scheduler' sysfs attribute wrt. freezing
> >> queue for storing, and the check can be done via the attribute
> >> name("scheduler") too.
> >
> > Fair enough.
> >
> > Rich.
> >
>
> --
> Damien Le Moal
> Western Digital Research
>
Ming Lei Sept. 9, 2024, 1:01 a.m. UTC | #15
On Sun, Sep 8, 2024 at 8:03 AM Damien Le Moal <dlemoal@kernel.org> wrote:
>
> On 9/7/24 20:14, Richard W.M. Jones wrote:
> > On Sat, Sep 07, 2024 at 07:02:30PM +0800, Ming Lei wrote:
> >> BTW, the issue can be reproduced 100% by:
> >>
> >> echo "deadlock" > /sys/block/$ROOT_DISK/queue/scheduler
>
> This probably should be:
>
> echo "mq-deadline" > /sys/block/$ROOT_DISK/queue/scheduler

No, it is deliberately something not existing.

If it can't work, dropping cache can be added before the switching.

Thanks,
Ming Lei Sept. 9, 2024, 1:24 a.m. UTC | #16
On Sat, Sep 07, 2024 at 07:50:32AM -0600, Jens Axboe wrote:
> On 9/7/24 3:04 AM, Damien Le Moal wrote:
> > On 9/7/24 16:58, Ming Lei wrote:
> >> On Sat, Sep 07, 2024 at 08:35:22AM +0100, Richard W.M. Jones wrote:
> >>> On Sat, Sep 07, 2024 at 09:43:31AM +0800, Ming Lei wrote:
> >>>> When switching io scheduler via sysfs, 'request_module' may be called
> >>>> if the specified scheduler doesn't exist.
> >>>>
> >>>> This was has deadlock risk because the module may be stored on FS behind
> >>>> our disk since request queue is frozen before switching its elevator.
> >>>>
> >>>> Fix it by returning -EDEADLK in case that the disk is claimed, which
> >>>> can be thought as one signal that the disk is mounted.
> >>>>
> >>>> Some distributions(Fedora) simulates the original kernel command line of
> >>>> 'elevator=foo' via 'echo foo > /sys/block/$DISK/queue/scheduler', and boot
> >>>> hang is triggered.
> >>>>
> >>>> Cc: Richard Jones <rjones@redhat.com>
> >>>> Cc: Jeff Moyer <jmoyer@redhat.com>
> >>>> Cc: Jiri Jaburek <jjaburek@redhat.com>
> >>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> >>>
> >>> I'd suggest also:
> >>>
> >>> Bug: https://bugzilla.kernel.org/show_bug.cgi?id=219166
> >>> Reported-by: Richard W.M. Jones <rjones@redhat.com>
> >>> Reported-by: Jiri Jaburek <jjaburek@redhat.com>
> >>> Tested-by: Richard W.M. Jones <rjones@redhat.com>
> >>>
> >>> So I have tested this patch and it does fix the issue, at the possible
> >>> cost that now setting the scheduler can fail:
> >>>
> >>>   + for f in /sys/block/{h,s,ub,v}d*/queue/scheduler
> >>>   + echo noop
> >>>   /init: line 109: echo: write error: Resource deadlock avoided
> >>>
> >>> (I know I'm setting it to an impossible value here, but this could
> >>> also happen when setting it to a valid one.)
> >>
> >> Actually in most of dist, io-schedulers are built-in, so request_module
> >> is just a nop, but meta IO must be started.
> >>
> >>>
> >>> Since almost no one checks the result of 'echo foo > /sys/...'  that
> >>> would probably mean that sometimes a desired setting is silently not
> >>> set.
> >>
> >> As I mentioned, io-schedulers are built-in for most of dist, so
> >> request_module isn't called in case of one valid io-sched.
> >>
> >>>
> >>> Also I bisected this bug yesterday and found it was caused by (or,
> >>> more likely, exposed by):
> >>>
> >>>   commit af2814149883e2c1851866ea2afcd8eadc040f79
> >>>   Author: Christoph Hellwig <hch@lst.de>
> >>>   Date:   Mon Jun 17 08:04:38 2024 +0200
> >>>
> >>>     block: freeze the queue in queue_attr_store
> >>>     
> >>>     queue_attr_store updates attributes used to control generating I/O, and
> >>>     can cause malformed bios if changed with I/O in flight.  Freeze the queue
> >>>     in common code instead of adding it to almost every attribute.
> >>>
> >>> Reverting this commit on top of git head also fixes the problem.
> >>>
> >>> Why did this commit expose the problem?
> >>
> >> That is really the 1st bad commit which moves queue freezing before
> >> calling request_module(), originally we won't freeze queue until
> >> we have to do it.
> >>
> >> Another candidate fix is to revert it, or at least not do it
> >> for storing elevator attribute.
> > 
> > I do not think that reverting is acceptable. Rather, a proper fix would simply
> > be to do the request_module() before freezing the queue.
> > Something like below should work (totally untested and that may be overkill).
> 
> I like this approach, but let's please call it something descriptive
> like "load_module" or something like that.

But 'load_module' is too specific as interface, and we just only have
one case which need to load module exactly.

I guess there may be same risk in queue_wb_lat_store() which calls into
GFP_KERNEL allocation which implies direct reclaim & IO.

Thanks,
Ming
Damien Le Moal Sept. 9, 2024, 1:56 a.m. UTC | #17
On 9/9/24 10:24, Ming Lei wrote:
> On Sat, Sep 07, 2024 at 07:50:32AM -0600, Jens Axboe wrote:
>> On 9/7/24 3:04 AM, Damien Le Moal wrote:
>>> On 9/7/24 16:58, Ming Lei wrote:
>>>> On Sat, Sep 07, 2024 at 08:35:22AM +0100, Richard W.M. Jones wrote:
>>>>> On Sat, Sep 07, 2024 at 09:43:31AM +0800, Ming Lei wrote:
>>>>>> When switching io scheduler via sysfs, 'request_module' may be called
>>>>>> if the specified scheduler doesn't exist.
>>>>>>
>>>>>> This was has deadlock risk because the module may be stored on FS behind
>>>>>> our disk since request queue is frozen before switching its elevator.
>>>>>>
>>>>>> Fix it by returning -EDEADLK in case that the disk is claimed, which
>>>>>> can be thought as one signal that the disk is mounted.
>>>>>>
>>>>>> Some distributions(Fedora) simulates the original kernel command line of
>>>>>> 'elevator=foo' via 'echo foo > /sys/block/$DISK/queue/scheduler', and boot
>>>>>> hang is triggered.
>>>>>>
>>>>>> Cc: Richard Jones <rjones@redhat.com>
>>>>>> Cc: Jeff Moyer <jmoyer@redhat.com>
>>>>>> Cc: Jiri Jaburek <jjaburek@redhat.com>
>>>>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>>>>
>>>>> I'd suggest also:
>>>>>
>>>>> Bug: https://bugzilla.kernel.org/show_bug.cgi?id=219166
>>>>> Reported-by: Richard W.M. Jones <rjones@redhat.com>
>>>>> Reported-by: Jiri Jaburek <jjaburek@redhat.com>
>>>>> Tested-by: Richard W.M. Jones <rjones@redhat.com>
>>>>>
>>>>> So I have tested this patch and it does fix the issue, at the possible
>>>>> cost that now setting the scheduler can fail:
>>>>>
>>>>>   + for f in /sys/block/{h,s,ub,v}d*/queue/scheduler
>>>>>   + echo noop
>>>>>   /init: line 109: echo: write error: Resource deadlock avoided
>>>>>
>>>>> (I know I'm setting it to an impossible value here, but this could
>>>>> also happen when setting it to a valid one.)
>>>>
>>>> Actually in most of dist, io-schedulers are built-in, so request_module
>>>> is just a nop, but meta IO must be started.
>>>>
>>>>>
>>>>> Since almost no one checks the result of 'echo foo > /sys/...'  that
>>>>> would probably mean that sometimes a desired setting is silently not
>>>>> set.
>>>>
>>>> As I mentioned, io-schedulers are built-in for most of dist, so
>>>> request_module isn't called in case of one valid io-sched.
>>>>
>>>>>
>>>>> Also I bisected this bug yesterday and found it was caused by (or,
>>>>> more likely, exposed by):
>>>>>
>>>>>   commit af2814149883e2c1851866ea2afcd8eadc040f79
>>>>>   Author: Christoph Hellwig <hch@lst.de>
>>>>>   Date:   Mon Jun 17 08:04:38 2024 +0200
>>>>>
>>>>>     block: freeze the queue in queue_attr_store
>>>>>     
>>>>>     queue_attr_store updates attributes used to control generating I/O, and
>>>>>     can cause malformed bios if changed with I/O in flight.  Freeze the queue
>>>>>     in common code instead of adding it to almost every attribute.
>>>>>
>>>>> Reverting this commit on top of git head also fixes the problem.
>>>>>
>>>>> Why did this commit expose the problem?
>>>>
>>>> That is really the 1st bad commit which moves queue freezing before
>>>> calling request_module(), originally we won't freeze queue until
>>>> we have to do it.
>>>>
>>>> Another candidate fix is to revert it, or at least not do it
>>>> for storing elevator attribute.
>>>
>>> I do not think that reverting is acceptable. Rather, a proper fix would simply
>>> be to do the request_module() before freezing the queue.
>>> Something like below should work (totally untested and that may be overkill).
>>
>> I like this approach, but let's please call it something descriptive
>> like "load_module" or something like that.
> 
> But 'load_module' is too specific as interface, and we just only have
> one case which need to load module exactly.
> 
> I guess there may be same risk in queue_wb_lat_store() which calls into
> GFP_KERNEL allocation which implies direct reclaim & IO.

That needs to be changed to GFP_NOIO.

> 
> Thanks,
> Ming
>
Damien Le Moal Sept. 9, 2024, 1:59 a.m. UTC | #18
On 9/9/24 10:24, Ming Lei wrote:
> On Sat, Sep 07, 2024 at 07:50:32AM -0600, Jens Axboe wrote:
>> On 9/7/24 3:04 AM, Damien Le Moal wrote:
>>> On 9/7/24 16:58, Ming Lei wrote:
>>>> On Sat, Sep 07, 2024 at 08:35:22AM +0100, Richard W.M. Jones wrote:
>>>>> On Sat, Sep 07, 2024 at 09:43:31AM +0800, Ming Lei wrote:
>>>>>> When switching io scheduler via sysfs, 'request_module' may be called
>>>>>> if the specified scheduler doesn't exist.
>>>>>>
>>>>>> This was has deadlock risk because the module may be stored on FS behind
>>>>>> our disk since request queue is frozen before switching its elevator.
>>>>>>
>>>>>> Fix it by returning -EDEADLK in case that the disk is claimed, which
>>>>>> can be thought as one signal that the disk is mounted.
>>>>>>
>>>>>> Some distributions(Fedora) simulates the original kernel command line of
>>>>>> 'elevator=foo' via 'echo foo > /sys/block/$DISK/queue/scheduler', and boot
>>>>>> hang is triggered.
>>>>>>
>>>>>> Cc: Richard Jones <rjones@redhat.com>
>>>>>> Cc: Jeff Moyer <jmoyer@redhat.com>
>>>>>> Cc: Jiri Jaburek <jjaburek@redhat.com>
>>>>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>>>>
>>>>> I'd suggest also:
>>>>>
>>>>> Bug: https://bugzilla.kernel.org/show_bug.cgi?id=219166
>>>>> Reported-by: Richard W.M. Jones <rjones@redhat.com>
>>>>> Reported-by: Jiri Jaburek <jjaburek@redhat.com>
>>>>> Tested-by: Richard W.M. Jones <rjones@redhat.com>
>>>>>
>>>>> So I have tested this patch and it does fix the issue, at the possible
>>>>> cost that now setting the scheduler can fail:
>>>>>
>>>>>   + for f in /sys/block/{h,s,ub,v}d*/queue/scheduler
>>>>>   + echo noop
>>>>>   /init: line 109: echo: write error: Resource deadlock avoided
>>>>>
>>>>> (I know I'm setting it to an impossible value here, but this could
>>>>> also happen when setting it to a valid one.)
>>>>
>>>> Actually in most of dist, io-schedulers are built-in, so request_module
>>>> is just a nop, but meta IO must be started.
>>>>
>>>>>
>>>>> Since almost no one checks the result of 'echo foo > /sys/...'  that
>>>>> would probably mean that sometimes a desired setting is silently not
>>>>> set.
>>>>
>>>> As I mentioned, io-schedulers are built-in for most of dist, so
>>>> request_module isn't called in case of one valid io-sched.
>>>>
>>>>>
>>>>> Also I bisected this bug yesterday and found it was caused by (or,
>>>>> more likely, exposed by):
>>>>>
>>>>>   commit af2814149883e2c1851866ea2afcd8eadc040f79
>>>>>   Author: Christoph Hellwig <hch@lst.de>
>>>>>   Date:   Mon Jun 17 08:04:38 2024 +0200
>>>>>
>>>>>     block: freeze the queue in queue_attr_store
>>>>>     
>>>>>     queue_attr_store updates attributes used to control generating I/O, and
>>>>>     can cause malformed bios if changed with I/O in flight.  Freeze the queue
>>>>>     in common code instead of adding it to almost every attribute.
>>>>>
>>>>> Reverting this commit on top of git head also fixes the problem.
>>>>>
>>>>> Why did this commit expose the problem?
>>>>
>>>> That is really the 1st bad commit which moves queue freezing before
>>>> calling request_module(), originally we won't freeze queue until
>>>> we have to do it.
>>>>
>>>> Another candidate fix is to revert it, or at least not do it
>>>> for storing elevator attribute.
>>>
>>> I do not think that reverting is acceptable. Rather, a proper fix would simply
>>> be to do the request_module() before freezing the queue.
>>> Something like below should work (totally untested and that may be overkill).
>>
>> I like this approach, but let's please call it something descriptive
>> like "load_module" or something like that.
> 
> But 'load_module' is too specific as interface, and we just only have
> one case which need to load module exactly.

If another attr needs to do some prep work before freezing the queue and calling
attr->store(), we can rename the load_module attribute method to something like
"prepare_store" to be more generic.

> 
> I guess there may be same risk in queue_wb_lat_store() which calls into
> GFP_KERNEL allocation which implies direct reclaim & IO.
> 
> Thanks,
> Ming
>
Ming Lei Sept. 9, 2024, 2:16 a.m. UTC | #19
On Mon, Sep 09, 2024 at 10:59:00AM +0900, Damien Le Moal wrote:
> On 9/9/24 10:24, Ming Lei wrote:
> > On Sat, Sep 07, 2024 at 07:50:32AM -0600, Jens Axboe wrote:
> >> On 9/7/24 3:04 AM, Damien Le Moal wrote:
> >>> On 9/7/24 16:58, Ming Lei wrote:
> >>>> On Sat, Sep 07, 2024 at 08:35:22AM +0100, Richard W.M. Jones wrote:
> >>>>> On Sat, Sep 07, 2024 at 09:43:31AM +0800, Ming Lei wrote:
> >>>>>> When switching io scheduler via sysfs, 'request_module' may be called
> >>>>>> if the specified scheduler doesn't exist.
> >>>>>>
> >>>>>> This was has deadlock risk because the module may be stored on FS behind
> >>>>>> our disk since request queue is frozen before switching its elevator.
> >>>>>>
> >>>>>> Fix it by returning -EDEADLK in case that the disk is claimed, which
> >>>>>> can be thought as one signal that the disk is mounted.
> >>>>>>
> >>>>>> Some distributions(Fedora) simulates the original kernel command line of
> >>>>>> 'elevator=foo' via 'echo foo > /sys/block/$DISK/queue/scheduler', and boot
> >>>>>> hang is triggered.
> >>>>>>
> >>>>>> Cc: Richard Jones <rjones@redhat.com>
> >>>>>> Cc: Jeff Moyer <jmoyer@redhat.com>
> >>>>>> Cc: Jiri Jaburek <jjaburek@redhat.com>
> >>>>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> >>>>>
> >>>>> I'd suggest also:
> >>>>>
> >>>>> Bug: https://bugzilla.kernel.org/show_bug.cgi?id=219166
> >>>>> Reported-by: Richard W.M. Jones <rjones@redhat.com>
> >>>>> Reported-by: Jiri Jaburek <jjaburek@redhat.com>
> >>>>> Tested-by: Richard W.M. Jones <rjones@redhat.com>
> >>>>>
> >>>>> So I have tested this patch and it does fix the issue, at the possible
> >>>>> cost that now setting the scheduler can fail:
> >>>>>
> >>>>>   + for f in /sys/block/{h,s,ub,v}d*/queue/scheduler
> >>>>>   + echo noop
> >>>>>   /init: line 109: echo: write error: Resource deadlock avoided
> >>>>>
> >>>>> (I know I'm setting it to an impossible value here, but this could
> >>>>> also happen when setting it to a valid one.)
> >>>>
> >>>> Actually in most of dist, io-schedulers are built-in, so request_module
> >>>> is just a nop, but meta IO must be started.
> >>>>
> >>>>>
> >>>>> Since almost no one checks the result of 'echo foo > /sys/...'  that
> >>>>> would probably mean that sometimes a desired setting is silently not
> >>>>> set.
> >>>>
> >>>> As I mentioned, io-schedulers are built-in for most of dist, so
> >>>> request_module isn't called in case of one valid io-sched.
> >>>>
> >>>>>
> >>>>> Also I bisected this bug yesterday and found it was caused by (or,
> >>>>> more likely, exposed by):
> >>>>>
> >>>>>   commit af2814149883e2c1851866ea2afcd8eadc040f79
> >>>>>   Author: Christoph Hellwig <hch@lst.de>
> >>>>>   Date:   Mon Jun 17 08:04:38 2024 +0200
> >>>>>
> >>>>>     block: freeze the queue in queue_attr_store
> >>>>>     
> >>>>>     queue_attr_store updates attributes used to control generating I/O, and
> >>>>>     can cause malformed bios if changed with I/O in flight.  Freeze the queue
> >>>>>     in common code instead of adding it to almost every attribute.
> >>>>>
> >>>>> Reverting this commit on top of git head also fixes the problem.
> >>>>>
> >>>>> Why did this commit expose the problem?
> >>>>
> >>>> That is really the 1st bad commit which moves queue freezing before
> >>>> calling request_module(), originally we won't freeze queue until
> >>>> we have to do it.
> >>>>
> >>>> Another candidate fix is to revert it, or at least not do it
> >>>> for storing elevator attribute.
> >>>
> >>> I do not think that reverting is acceptable. Rather, a proper fix would simply
> >>> be to do the request_module() before freezing the queue.
> >>> Something like below should work (totally untested and that may be overkill).
> >>
> >> I like this approach, but let's please call it something descriptive
> >> like "load_module" or something like that.
> > 
> > But 'load_module' is too specific as interface, and we just only have
> > one case which need to load module exactly.
> 
> If another attr needs to do some prep work before freezing the queue and calling
> attr->store(), we can rename the load_module attribute method to something like
> "prepare_store" to be more generic.

'interface' is supposed to be generic from beginning, and I don't think
we will have another 'load_module' case here.


Thanks,
Ming
diff mbox series

Patch

diff --git a/block/elevator.c b/block/elevator.c
index f13d552a32c8..2b0432f4ac33 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -676,6 +676,13 @@  void elevator_disable(struct request_queue *q)
 	blk_mq_unfreeze_queue(q);
 }
 
+static bool disk_is_claimed(struct gendisk *disk)
+{
+	if (disk->part0->bd_holder)
+		return true;
+	return false;
+}
+
 /*
  * Switch this queue to the given IO scheduler.
  */
@@ -699,6 +706,13 @@  static int elevator_change(struct request_queue *q, const char *elevator_name)
 
 	e = elevator_find_get(q, elevator_name);
 	if (!e) {
+		/*
+		 * Try to avoid to load iosched module from FS behind our
+		 * disk, otherwise deadlock may be triggered
+		 */
+		if (disk_is_claimed(q->disk))
+			return -EDEADLK;
+
 		request_module("%s-iosched", elevator_name);
 		e = elevator_find_get(q, elevator_name);
 		if (!e)