ata,scsi: do not issue START STOP UNIT on resume

Message ID	20230731003956.572414-1-dlemoal@kernel.org (mailing list archive)
State	Not Applicable
Headers	show Return-Path: <linux-scsi-owner@vger.kernel.org> From: Damien Le Moal <dlemoal@kernel.org> To: linux-ide@vger.kernel.org, linux-scsi@vger.kernel.org, "Martin K . Petersen" <martin.petersen@oracle.com> Cc: Paul Ausbeck <paula@soe.ucsc.edu>, Thorsten Leemhuis <regressions@leemhuis.info>, TW <dalzot@gmail.com>, regressions@lists.linux.dev, Bart Van Assche <bvanassche@acm.org> Subject: [PATCH] ata,scsi: do not issue START STOP UNIT on resume Date: Mon, 31 Jul 2023 09:39:56 +0900 Message-ID: <20230731003956.572414-1-dlemoal@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	ata,scsi: do not issue START STOP UNIT on resume \| expand ata,scsi: do not issue START STOP UNIT on resume

Damien Le Moal July 31, 2023, 12:39 a.m. UTC

During system resume, ata_port_pm_resume() triggers ata EH to
1) Resume the controller
2) Reset and rescan the ports
3) Revalidate devices
This EH execution is started asynchronously from ata_port_pm_resume(),
which means that when sd_resume() is executed, none or only part of the
above processing may have been executed. However, sd_resume() issues a
START STOP UNIT to wake up the drive from sleep mode. This command is
translated to ATA with ata_scsi_start_stop_xlat() and issued to the
device. However, depending on the state of execution of the EH process
and revalidation triggerred by ata_port_pm_resume(), two things may
happen:
1) The START STOP UNIT fails if it is received before the controller has
   been reenabled at the beginning of the EH execution. This is visible
   with error messages like:

ata10.00: device reported invalid CHS sector 0
sd 9:0:0:0: [sdc] Start/Stop Unit failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
sd 9:0:0:0: [sdc] Sense Key : Illegal Request [current]
sd 9:0:0:0: [sdc] Add. Sense: Unaligned write command
sd 9:0:0:0: PM: dpm_run_callback(): scsi_bus_resume+0x0/0x90 returns -5
sd 9:0:0:0: PM: failed to resume async: error -5

2) The START STOP UNIT command is received while the EH process is
   on-going, which mean that it is stopped and must wait for its
   completion, at which point the command is rather useless as the drive
   is already fully spun up already. This case results also in a
   significant delay in sd_resume() which is observable by users as
   the entire system resume completion is delayed.

Given that ATA devices will be woken up by libata activity on resume,
sd_resume() has no need to issue a START STOP UNIT command, which solves
the above mentioned problems. Do not issue this command by introducing
the new scsi_device flag no_start_on_resume and setting this flag to 1
in ata_scsi_dev_config(). sd_resume() is modified to issue a START STOP
UNIT command only if this flag is not set.

Reported-by: Paul Ausbeck <paula@soe.ucsc.edu>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=215880
Fixes: a19a93e4c6a9 ("scsi: core: pm: Rely on the device driver core for async power management")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 drivers/ata/libata-scsi.c  | 7 +++++++
 drivers/scsi/sd.c          | 9 ++++++---
 include/scsi/scsi_device.h | 1 +
 3 files changed, 14 insertions(+), 3 deletions(-)

Tanner Watkins July 31, 2023, 3:48 a.m. UTC | #1

Tested-by: Tanner Watkins

On 7/30/23 18:39, Damien Le Moal wrote:
> During system resume, ata_port_pm_resume() triggers ata EH to
> 1) Resume the controller
> 2) Reset and rescan the ports
> 3) Revalidate devices
> This EH execution is started asynchronously from ata_port_pm_resume(),
> which means that when sd_resume() is executed, none or only part of the
> above processing may have been executed. However, sd_resume() issues a
> START STOP UNIT to wake up the drive from sleep mode. This command is
> translated to ATA with ata_scsi_start_stop_xlat() and issued to the
> device. However, depending on the state of execution of the EH process
> and revalidation triggerred by ata_port_pm_resume(), two things may
> happen:
> 1) The START STOP UNIT fails if it is received before the controller has
>     been reenabled at the beginning of the EH execution. This is visible
>     with error messages like:
>
> ata10.00: device reported invalid CHS sector 0
> sd 9:0:0:0: [sdc] Start/Stop Unit failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> sd 9:0:0:0: [sdc] Sense Key : Illegal Request [current]
> sd 9:0:0:0: [sdc] Add. Sense: Unaligned write command
> sd 9:0:0:0: PM: dpm_run_callback(): scsi_bus_resume+0x0/0x90 returns -5
> sd 9:0:0:0: PM: failed to resume async: error -5
>
> 2) The START STOP UNIT command is received while the EH process is
>     on-going, which mean that it is stopped and must wait for its
>     completion, at which point the command is rather useless as the drive
>     is already fully spun up already. This case results also in a
>     significant delay in sd_resume() which is observable by users as
>     the entire system resume completion is delayed.
>
> Given that ATA devices will be woken up by libata activity on resume,
> sd_resume() has no need to issue a START STOP UNIT command, which solves
> the above mentioned problems. Do not issue this command by introducing
> the new scsi_device flag no_start_on_resume and setting this flag to 1
> in ata_scsi_dev_config(). sd_resume() is modified to issue a START STOP
> UNIT command only if this flag is not set.
>
> Reported-by: Paul Ausbeck <paula@soe.ucsc.edu>
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=215880
> Fixes: a19a93e4c6a9 ("scsi: core: pm: Rely on the device driver core for async power management")
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   drivers/ata/libata-scsi.c  | 7 +++++++
>   drivers/scsi/sd.c          | 9 ++++++---
>   include/scsi/scsi_device.h | 1 +
>   3 files changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
> index 370d18aca71e..c6ece32de8e3 100644
> --- a/drivers/ata/libata-scsi.c
> +++ b/drivers/ata/libata-scsi.c
> @@ -1100,7 +1100,14 @@ int ata_scsi_dev_config(struct scsi_device *sdev, struct ata_device *dev)
>   		}
>   	} else {
>   		sdev->sector_size = ata_id_logical_sector_size(dev->id);
> +		/*
> +		 * Stop the drive on suspend but do not issue START STOP UNIT
> +		 * on resume as this is not necessary and may fail: the device
> +		 * will be woken up by ata_port_pm_resume() with a port reset
> +		 * and device revalidation.
> +		 */
>   		sdev->manage_start_stop = 1;
> +		sdev->no_start_on_resume = 1;
>   	}
>   
>   	/*
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 68b12afa0721..3c668cfb146d 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -3876,7 +3876,7 @@ static int sd_suspend_runtime(struct device *dev)
>   static int sd_resume(struct device *dev)
>   {
>   	struct scsi_disk *sdkp = dev_get_drvdata(dev);
> -	int ret;
> +	int ret = 0;
>   
>   	if (!sdkp)	/* E.g.: runtime resume at the start of sd_probe() */
>   		return 0;
> @@ -3884,8 +3884,11 @@ static int sd_resume(struct device *dev)
>   	if (!sdkp->device->manage_start_stop)
>   		return 0;
>   
> -	sd_printk(KERN_NOTICE, sdkp, "Starting disk\n");
> -	ret = sd_start_stop_device(sdkp, 1);
> +	if (!sdkp->device->no_start_on_resume) {
> +		sd_printk(KERN_NOTICE, sdkp, "Starting disk\n");
> +		ret = sd_start_stop_device(sdkp, 1);
> +	}
> +
>   	if (!ret)
>   		opal_unlock_from_suspend(sdkp->opal_dev);
>   	return ret;
> diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
> index 75b2235b99e2..b9230b6add04 100644
> --- a/include/scsi/scsi_device.h
> +++ b/include/scsi/scsi_device.h
> @@ -194,6 +194,7 @@ struct scsi_device {
>   	unsigned no_start_on_add:1;	/* do not issue start on add */
>   	unsigned allow_restart:1; /* issue START_UNIT in error handler */
>   	unsigned manage_start_stop:1;	/* Let HLD (sd) manage start/stop */
> +	unsigned no_start_on_resume:1; /* Do not issue START_STOP_UNIT on resume */
>   	unsigned start_stop_pwr_cond:1;	/* Set power cond. in START_STOP_UNIT */
>   	unsigned no_uld_attach:1; /* disable connecting to upper level drivers */
>   	unsigned select_no_atn:1;

Damien Le Moal July 31, 2023, 4:44 a.m. UTC | #2

On 7/31/23 12:48, TW wrote:
> Tested-by: Tanner Watkins

A tested-by tag must have the form:

Tested-by: Tanner Watkins <dalzot@gmail.com>

Also, please do not top-post. That is, reply with your text (this tag), below
the original message (which you can trim).

Thanks.

> 
> On 7/30/23 18:39, Damien Le Moal wrote:
>> During system resume, ata_port_pm_resume() triggers ata EH to
>> 1) Resume the controller
>> 2) Reset and rescan the ports
>> 3) Revalidate devices
>> This EH execution is started asynchronously from ata_port_pm_resume(),
>> which means that when sd_resume() is executed, none or only part of the
>> above processing may have been executed. However, sd_resume() issues a
>> START STOP UNIT to wake up the drive from sleep mode. This command is
>> translated to ATA with ata_scsi_start_stop_xlat() and issued to the
>> device. However, depending on the state of execution of the EH process
>> and revalidation triggerred by ata_port_pm_resume(), two things may
>> happen:
>> 1) The START STOP UNIT fails if it is received before the controller has
>>     been reenabled at the beginning of the EH execution. This is visible
>>     with error messages like:
>>
>> ata10.00: device reported invalid CHS sector 0
>> sd 9:0:0:0: [sdc] Start/Stop Unit failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>> sd 9:0:0:0: [sdc] Sense Key : Illegal Request [current]
>> sd 9:0:0:0: [sdc] Add. Sense: Unaligned write command
>> sd 9:0:0:0: PM: dpm_run_callback(): scsi_bus_resume+0x0/0x90 returns -5
>> sd 9:0:0:0: PM: failed to resume async: error -5
>>
>> 2) The START STOP UNIT command is received while the EH process is
>>     on-going, which mean that it is stopped and must wait for its
>>     completion, at which point the command is rather useless as the drive
>>     is already fully spun up already. This case results also in a
>>     significant delay in sd_resume() which is observable by users as
>>     the entire system resume completion is delayed.
>>
>> Given that ATA devices will be woken up by libata activity on resume,
>> sd_resume() has no need to issue a START STOP UNIT command, which solves
>> the above mentioned problems. Do not issue this command by introducing
>> the new scsi_device flag no_start_on_resume and setting this flag to 1
>> in ata_scsi_dev_config(). sd_resume() is modified to issue a START STOP
>> UNIT command only if this flag is not set.
>>
>> Reported-by: Paul Ausbeck <paula@soe.ucsc.edu>
>> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=215880
>> Fixes: a19a93e4c6a9 ("scsi: core: pm: Rely on the device driver core for async power management")
>> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
>> ---
>>   drivers/ata/libata-scsi.c  | 7 +++++++
>>   drivers/scsi/sd.c          | 9 ++++++---
>>   include/scsi/scsi_device.h | 1 +
>>   3 files changed, 14 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
>> index 370d18aca71e..c6ece32de8e3 100644
>> --- a/drivers/ata/libata-scsi.c
>> +++ b/drivers/ata/libata-scsi.c
>> @@ -1100,7 +1100,14 @@ int ata_scsi_dev_config(struct scsi_device *sdev, struct ata_device *dev)
>>   		}
>>   	} else {
>>   		sdev->sector_size = ata_id_logical_sector_size(dev->id);
>> +		/*
>> +		 * Stop the drive on suspend but do not issue START STOP UNIT
>> +		 * on resume as this is not necessary and may fail: the device
>> +		 * will be woken up by ata_port_pm_resume() with a port reset
>> +		 * and device revalidation.
>> +		 */
>>   		sdev->manage_start_stop = 1;
>> +		sdev->no_start_on_resume = 1;
>>   	}
>>   
>>   	/*
>> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
>> index 68b12afa0721..3c668cfb146d 100644
>> --- a/drivers/scsi/sd.c
>> +++ b/drivers/scsi/sd.c
>> @@ -3876,7 +3876,7 @@ static int sd_suspend_runtime(struct device *dev)
>>   static int sd_resume(struct device *dev)
>>   {
>>   	struct scsi_disk *sdkp = dev_get_drvdata(dev);
>> -	int ret;
>> +	int ret = 0;
>>   
>>   	if (!sdkp)	/* E.g.: runtime resume at the start of sd_probe() */
>>   		return 0;
>> @@ -3884,8 +3884,11 @@ static int sd_resume(struct device *dev)
>>   	if (!sdkp->device->manage_start_stop)
>>   		return 0;
>>   
>> -	sd_printk(KERN_NOTICE, sdkp, "Starting disk\n");
>> -	ret = sd_start_stop_device(sdkp, 1);
>> +	if (!sdkp->device->no_start_on_resume) {
>> +		sd_printk(KERN_NOTICE, sdkp, "Starting disk\n");
>> +		ret = sd_start_stop_device(sdkp, 1);
>> +	}
>> +
>>   	if (!ret)
>>   		opal_unlock_from_suspend(sdkp->opal_dev);
>>   	return ret;
>> diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
>> index 75b2235b99e2..b9230b6add04 100644
>> --- a/include/scsi/scsi_device.h
>> +++ b/include/scsi/scsi_device.h
>> @@ -194,6 +194,7 @@ struct scsi_device {
>>   	unsigned no_start_on_add:1;	/* do not issue start on add */
>>   	unsigned allow_restart:1; /* issue START_UNIT in error handler */
>>   	unsigned manage_start_stop:1;	/* Let HLD (sd) manage start/stop */
>> +	unsigned no_start_on_resume:1; /* Do not issue START_STOP_UNIT on resume */
>>   	unsigned start_stop_pwr_cond:1;	/* Set power cond. in START_STOP_UNIT */
>>   	unsigned no_uld_attach:1; /* disable connecting to upper level drivers */
>>   	unsigned select_no_atn:1;

Tanner Watkins July 31, 2023, 5:47 a.m. UTC | #3

On 7/30/23 18:39, Damien Le Moal wrote:
> During system resume, ata_port_pm_resume() triggers ata EH to
> 1) Resume the controller
> 2) Reset and rescan the ports
> 3) Revalidate devices
> This EH execution is started asynchronously from ata_port_pm_resume(),
> which means that when sd_resume() is executed, none or only part of the
> above processing may have been executed. However, sd_resume() issues a
> START STOP UNIT to wake up the drive from sleep mode. This command is
> translated to ATA with ata_scsi_start_stop_xlat() and issued to the
> device. However, depending on the state of execution of the EH process
> and revalidation triggerred by ata_port_pm_resume(), two things may
> happen:
> 1) The START STOP UNIT fails if it is received before the controller has
>     been reenabled at the beginning of the EH execution. This is visible
>     with error messages like:
>
> ata10.00: device reported invalid CHS sector 0
> sd 9:0:0:0: [sdc] Start/Stop Unit failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> sd 9:0:0:0: [sdc] Sense Key : Illegal Request [current]
> sd 9:0:0:0: [sdc] Add. Sense: Unaligned write command
> sd 9:0:0:0: PM: dpm_run_callback(): scsi_bus_resume+0x0/0x90 returns -5
> sd 9:0:0:0: PM: failed to resume async: error -5
>
> 2) The START STOP UNIT command is received while the EH process is
>     on-going, which mean that it is stopped and must wait for its
>     completion, at which point the command is rather useless as the drive
>     is already fully spun up already. This case results also in a
>     significant delay in sd_resume() which is observable by users as
>     the entire system resume completion is delayed.
>
> Given that ATA devices will be woken up by libata activity on resume,
> sd_resume() has no need to issue a START STOP UNIT command, which solves
> the above mentioned problems. Do not issue this command by introducing
> the new scsi_device flag no_start_on_resume and setting this flag to 1
> in ata_scsi_dev_config(). sd_resume() is modified to issue a START STOP
> UNIT command only if this flag is not set.
>
> Reported-by: Paul Ausbeck <paula@soe.ucsc.edu>
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=215880
> Fixes: a19a93e4c6a9 ("scsi: core: pm: Rely on the device driver core for async power management")
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   drivers/ata/libata-scsi.c  | 7 +++++++
>   drivers/scsi/sd.c          | 9 ++++++---
>   include/scsi/scsi_device.h | 1 +
>   3 files changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
> index 370d18aca71e..c6ece32de8e3 100644
> --- a/drivers/ata/libata-scsi.c
> +++ b/drivers/ata/libata-scsi.c
> @@ -1100,7 +1100,14 @@ int ata_scsi_dev_config(struct scsi_device *sdev, struct ata_device *dev)
>   		}
>   	} else {
>   		sdev->sector_size = ata_id_logical_sector_size(dev->id);
> +		/*
> +		 * Stop the drive on suspend but do not issue START STOP UNIT
> +		 * on resume as this is not necessary and may fail: the device
> +		 * will be woken up by ata_port_pm_resume() with a port reset
> +		 * and device revalidation.
> +		 */
>   		sdev->manage_start_stop = 1;
> +		sdev->no_start_on_resume = 1;
>   	}
>   
>   	/*
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 68b12afa0721..3c668cfb146d 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -3876,7 +3876,7 @@ static int sd_suspend_runtime(struct device *dev)
>   static int sd_resume(struct device *dev)
>   {
>   	struct scsi_disk *sdkp = dev_get_drvdata(dev);
> -	int ret;
> +	int ret = 0;
>   
>   	if (!sdkp)	/* E.g.: runtime resume at the start of sd_probe() */
>   		return 0;
> @@ -3884,8 +3884,11 @@ static int sd_resume(struct device *dev)
>   	if (!sdkp->device->manage_start_stop)
>   		return 0;
>   
> -	sd_printk(KERN_NOTICE, sdkp, "Starting disk\n");
> -	ret = sd_start_stop_device(sdkp, 1);
> +	if (!sdkp->device->no_start_on_resume) {
> +		sd_printk(KERN_NOTICE, sdkp, "Starting disk\n");
> +		ret = sd_start_stop_device(sdkp, 1);
> +	}
> +
>   	if (!ret)
>   		opal_unlock_from_suspend(sdkp->opal_dev);
>   	return ret;
> diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
> index 75b2235b99e2..b9230b6add04 100644
> --- a/include/scsi/scsi_device.h
> +++ b/include/scsi/scsi_device.h
> @@ -194,6 +194,7 @@ struct scsi_device {
>   	unsigned no_start_on_add:1;	/* do not issue start on add */
>   	unsigned allow_restart:1; /* issue START_UNIT in error handler */
>   	unsigned manage_start_stop:1;	/* Let HLD (sd) manage start/stop */
> +	unsigned no_start_on_resume:1; /* Do not issue START_STOP_UNIT on resume */
>   	unsigned start_stop_pwr_cond:1;	/* Set power cond. in START_STOP_UNIT */
>   	unsigned no_uld_attach:1; /* disable connecting to upper level drivers */
>   	unsigned select_no_atn:1;

Tested-by: Tanner Watkins<dalzot@gmail.com>

Hannes Reinecke July 31, 2023, 4:13 p.m. UTC | #4

On 7/31/23 02:39, Damien Le Moal wrote:
> During system resume, ata_port_pm_resume() triggers ata EH to
> 1) Resume the controller
> 2) Reset and rescan the ports
> 3) Revalidate devices
> This EH execution is started asynchronously from ata_port_pm_resume(),
> which means that when sd_resume() is executed, none or only part of the
> above processing may have been executed. However, sd_resume() issues a
> START STOP UNIT to wake up the drive from sleep mode. This command is
> translated to ATA with ata_scsi_start_stop_xlat() and issued to the
> device. However, depending on the state of execution of the EH process
> and revalidation triggerred by ata_port_pm_resume(), two things may
> happen:
> 1) The START STOP UNIT fails if it is received before the controller has
>     been reenabled at the beginning of the EH execution. This is visible
>     with error messages like:
> 
> ata10.00: device reported invalid CHS sector 0
> sd 9:0:0:0: [sdc] Start/Stop Unit failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> sd 9:0:0:0: [sdc] Sense Key : Illegal Request [current]
> sd 9:0:0:0: [sdc] Add. Sense: Unaligned write command
> sd 9:0:0:0: PM: dpm_run_callback(): scsi_bus_resume+0x0/0x90 returns -5
> sd 9:0:0:0: PM: failed to resume async: error -5
> 
> 2) The START STOP UNIT command is received while the EH process is
>     on-going, which mean that it is stopped and must wait for its
>     completion, at which point the command is rather useless as the drive
>     is already fully spun up already. This case results also in a
>     significant delay in sd_resume() which is observable by users as
>     the entire system resume completion is delayed.
> 
> Given that ATA devices will be woken up by libata activity on resume,
> sd_resume() has no need to issue a START STOP UNIT command, which solves
> the above mentioned problems. Do not issue this command by introducing
> the new scsi_device flag no_start_on_resume and setting this flag to 1
> in ata_scsi_dev_config(). sd_resume() is modified to issue a START STOP
> UNIT command only if this flag is not set.
> 
Q: As libata starts up the drive internally via reset/revalidate, why do 
we have to sent START STOP UNIT for shutdown?
Wouldn't it be better to disable START STOP UNIT completely for libata, 
and let everything be handled via ATA command (IDLE IMMEDIATE, SLEEP) ?
Hmm?

Otherwise a really good idea. We've fallen into this trap several times 
ourselves.
Incidentally, USB could benefit from the same mechanism ...

Cheers,

Hannes

Paul Ausbeck July 31, 2023, 7:43 p.m. UTC | #5

On 7/30/23 17:39, Damien Le Moal wrote:
> During system resume, ata_port_pm_resume() triggers ata EH to
> 1) Resume the controller
> 2) Reset and rescan the ports
> 3) Revalidate devices
> This EH execution is started asynchronously from ata_port_pm_resume(),
> which means that when sd_resume() is executed, none or only part of the
> above processing may have been executed. However, sd_resume() issues a
> START STOP UNIT to wake up the drive from sleep mode. This command is
> translated to ATA with ata_scsi_start_stop_xlat() and issued to the
> device. However, depending on the state of execution of the EH process
> and revalidation triggerred by ata_port_pm_resume(), two things may
> happen:
> 1) The START STOP UNIT fails if it is received before the controller has
>     been reenabled at the beginning of the EH execution. This is visible
>     with error messages like:
> 
> ata10.00: device reported invalid CHS sector 0
> sd 9:0:0:0: [sdc] Start/Stop Unit failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> sd 9:0:0:0: [sdc] Sense Key : Illegal Request [current]
> sd 9:0:0:0: [sdc] Add. Sense: Unaligned write command
> sd 9:0:0:0: PM: dpm_run_callback(): scsi_bus_resume+0x0/0x90 returns -5
> sd 9:0:0:0: PM: failed to resume async: error -5
> 
> 2) The START STOP UNIT command is received while the EH process is
>     on-going, which mean that it is stopped and must wait for its
>     completion, at which point the command is rather useless as the drive
>     is already fully spun up already. This case results also in a
>     significant delay in sd_resume() which is observable by users as
>     the entire system resume completion is delayed.
> 
> Given that ATA devices will be woken up by libata activity on resume,
> sd_resume() has no need to issue a START STOP UNIT command, which solves
> the above mentioned problems. Do not issue this command by introducing
> the new scsi_device flag no_start_on_resume and setting this flag to 1
> in ata_scsi_dev_config(). sd_resume() is modified to issue a START STOP
> UNIT command only if this flag is not set.
> 
> Reported-by: Paul Ausbeck<paula@soe.ucsc.edu>
> Closes:https://bugzilla.kernel.org/show_bug.cgi?id=215880
> Fixes: a19a93e4c6a9 ("scsi: core: pm: Rely on the device driver core for async power management")
> Signed-off-by: Damien Le Moal<dlemoal@kernel.org>

Tested-by: Paul Ausbeck <paula@soe.ucsc.edu>

Damien Le Moal Aug. 1, 2023, 3:44 a.m. UTC | #6

On 8/1/23 01:13, Hannes Reinecke wrote:
> On 7/31/23 02:39, Damien Le Moal wrote:
>> During system resume, ata_port_pm_resume() triggers ata EH to
>> 1) Resume the controller
>> 2) Reset and rescan the ports
>> 3) Revalidate devices
>> This EH execution is started asynchronously from ata_port_pm_resume(),
>> which means that when sd_resume() is executed, none or only part of the
>> above processing may have been executed. However, sd_resume() issues a
>> START STOP UNIT to wake up the drive from sleep mode. This command is
>> translated to ATA with ata_scsi_start_stop_xlat() and issued to the
>> device. However, depending on the state of execution of the EH process
>> and revalidation triggerred by ata_port_pm_resume(), two things may
>> happen:
>> 1) The START STOP UNIT fails if it is received before the controller has
>>     been reenabled at the beginning of the EH execution. This is visible
>>     with error messages like:
>>
>> ata10.00: device reported invalid CHS sector 0
>> sd 9:0:0:0: [sdc] Start/Stop Unit failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>> sd 9:0:0:0: [sdc] Sense Key : Illegal Request [current]
>> sd 9:0:0:0: [sdc] Add. Sense: Unaligned write command
>> sd 9:0:0:0: PM: dpm_run_callback(): scsi_bus_resume+0x0/0x90 returns -5
>> sd 9:0:0:0: PM: failed to resume async: error -5
>>
>> 2) The START STOP UNIT command is received while the EH process is
>>     on-going, which mean that it is stopped and must wait for its
>>     completion, at which point the command is rather useless as the drive
>>     is already fully spun up already. This case results also in a
>>     significant delay in sd_resume() which is observable by users as
>>     the entire system resume completion is delayed.
>>
>> Given that ATA devices will be woken up by libata activity on resume,
>> sd_resume() has no need to issue a START STOP UNIT command, which solves
>> the above mentioned problems. Do not issue this command by introducing
>> the new scsi_device flag no_start_on_resume and setting this flag to 1
>> in ata_scsi_dev_config(). sd_resume() is modified to issue a START STOP
>> UNIT command only if this flag is not set.
>>
> Q: As libata starts up the drive internally via reset/revalidate, why do 
> we have to sent START STOP UNIT for shutdown?

To spin down the drive and put it to sleep.

> Wouldn't it be better to disable START STOP UNIT completely for libata, 
> and let everything be handled via ATA command (IDLE IMMEDIATE, SLEEP) ?
> Hmm?

I am not sure this buys us much. Could try though, but that is a little too much
changes for a bug fix for this cycle.

I think we need to think about that in the context of reworking ata
suspend/resume to be correct with regard to synchronizing ata device and scsi
device suspend/resume with a direct device link, which we do not have right now.

> Otherwise a really good idea. We've fallen into this trap several times 
> ourselves.
> Incidentally, USB could benefit from the same mechanism ...

Could look into it, but these bug reports, if any, do not come to me :)

> 
> Cheers,
> 
> Hannes

Hannes Reinecke Aug. 1, 2023, 6:16 a.m. UTC | #7

On 8/1/23 05:44, Damien Le Moal wrote:
> On 8/1/23 01:13, Hannes Reinecke wrote:
>> On 7/31/23 02:39, Damien Le Moal wrote:
>>> During system resume, ata_port_pm_resume() triggers ata EH to
>>> 1) Resume the controller
>>> 2) Reset and rescan the ports
>>> 3) Revalidate devices
>>> This EH execution is started asynchronously from ata_port_pm_resume(),
>>> which means that when sd_resume() is executed, none or only part of the
>>> above processing may have been executed. However, sd_resume() issues a
>>> START STOP UNIT to wake up the drive from sleep mode. This command is
>>> translated to ATA with ata_scsi_start_stop_xlat() and issued to the
>>> device. However, depending on the state of execution of the EH process
>>> and revalidation triggerred by ata_port_pm_resume(), two things may
>>> happen:
>>> 1) The START STOP UNIT fails if it is received before the controller has
>>>      been reenabled at the beginning of the EH execution. This is visible
>>>      with error messages like:
>>>
>>> ata10.00: device reported invalid CHS sector 0
>>> sd 9:0:0:0: [sdc] Start/Stop Unit failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>>> sd 9:0:0:0: [sdc] Sense Key : Illegal Request [current]
>>> sd 9:0:0:0: [sdc] Add. Sense: Unaligned write command
>>> sd 9:0:0:0: PM: dpm_run_callback(): scsi_bus_resume+0x0/0x90 returns -5
>>> sd 9:0:0:0: PM: failed to resume async: error -5
>>>
>>> 2) The START STOP UNIT command is received while the EH process is
>>>      on-going, which mean that it is stopped and must wait for its
>>>      completion, at which point the command is rather useless as the drive
>>>      is already fully spun up already. This case results also in a
>>>      significant delay in sd_resume() which is observable by users as
>>>      the entire system resume completion is delayed.
>>>
>>> Given that ATA devices will be woken up by libata activity on resume,
>>> sd_resume() has no need to issue a START STOP UNIT command, which solves
>>> the above mentioned problems. Do not issue this command by introducing
>>> the new scsi_device flag no_start_on_resume and setting this flag to 1
>>> in ata_scsi_dev_config(). sd_resume() is modified to issue a START STOP
>>> UNIT command only if this flag is not set.
>>>
>> Q: As libata starts up the drive internally via reset/revalidate, why do
>> we have to sent START STOP UNIT for shutdown?
> 
> To spin down the drive and put it to sleep.
> 
>> Wouldn't it be better to disable START STOP UNIT completely for libata,
>> and let everything be handled via ATA command (IDLE IMMEDIATE, SLEEP) ?
>> Hmm?
> 
> I am not sure this buys us much. Could try though, but that is a little too much
> changes for a bug fix for this cycle.
> 
Fair point.

> I think we need to think about that in the context of reworking ata
> suspend/resume to be correct with regard to synchronizing ata device and scsi
> device suspend/resume with a direct device link, which we do not have right now.
> 
Yeah, that's been on the agenda for quite some time.

So you can add:

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes

Bart Van Assche Aug. 1, 2023, 6:36 p.m. UTC | #8

On 7/30/23 17:39, Damien Le Moal wrote:
> Given that ATA devices will be woken up by libata activity on resume,
> sd_resume() has no need to issue a START STOP UNIT command, which solves
> the above mentioned problems. Do not issue this command by introducing
> the new scsi_device flag no_start_on_resume and setting this flag to 1
> in ata_scsi_dev_config(). sd_resume() is modified to issue a START STOP
> UNIT command only if this flag is not set.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

Damien Le Moal Aug. 2, 2023, 8:05 a.m. UTC | #9

On 7/31/23 09:39, Damien Le Moal wrote:
> During system resume, ata_port_pm_resume() triggers ata EH to
> 1) Resume the controller
> 2) Reset and rescan the ports
> 3) Revalidate devices
> This EH execution is started asynchronously from ata_port_pm_resume(),
> which means that when sd_resume() is executed, none or only part of the
> above processing may have been executed. However, sd_resume() issues a
> START STOP UNIT to wake up the drive from sleep mode. This command is
> translated to ATA with ata_scsi_start_stop_xlat() and issued to the
> device. However, depending on the state of execution of the EH process
> and revalidation triggerred by ata_port_pm_resume(), two things may
> happen:
> 1) The START STOP UNIT fails if it is received before the controller has
>    been reenabled at the beginning of the EH execution. This is visible
>    with error messages like:
> 
> ata10.00: device reported invalid CHS sector 0
> sd 9:0:0:0: [sdc] Start/Stop Unit failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> sd 9:0:0:0: [sdc] Sense Key : Illegal Request [current]
> sd 9:0:0:0: [sdc] Add. Sense: Unaligned write command
> sd 9:0:0:0: PM: dpm_run_callback(): scsi_bus_resume+0x0/0x90 returns -5
> sd 9:0:0:0: PM: failed to resume async: error -5
> 
> 2) The START STOP UNIT command is received while the EH process is
>    on-going, which mean that it is stopped and must wait for its
>    completion, at which point the command is rather useless as the drive
>    is already fully spun up already. This case results also in a
>    significant delay in sd_resume() which is observable by users as
>    the entire system resume completion is delayed.
> 
> Given that ATA devices will be woken up by libata activity on resume,
> sd_resume() has no need to issue a START STOP UNIT command, which solves
> the above mentioned problems. Do not issue this command by introducing
> the new scsi_device flag no_start_on_resume and setting this flag to 1
> in ata_scsi_dev_config(). sd_resume() is modified to issue a START STOP
> UNIT command only if this flag is not set.

Martin,

Are you OK with this patch ? Can I get your Acked-by please ?

Rodrigo Vivi Aug. 24, 2023, 6:28 p.m. UTC | #10

On Mon, Jul 31, 2023 at 09:39:56AM +0900, Damien Le Moal wrote:
> During system resume, ata_port_pm_resume() triggers ata EH to
> 1) Resume the controller
> 2) Reset and rescan the ports
> 3) Revalidate devices
> This EH execution is started asynchronously from ata_port_pm_resume(),
> which means that when sd_resume() is executed, none or only part of the
> above processing may have been executed. However, sd_resume() issues a
> START STOP UNIT to wake up the drive from sleep mode. This command is
> translated to ATA with ata_scsi_start_stop_xlat() and issued to the
> device. However, depending on the state of execution of the EH process
> and revalidation triggerred by ata_port_pm_resume(), two things may
> happen:
> 1) The START STOP UNIT fails if it is received before the controller has
>    been reenabled at the beginning of the EH execution. This is visible
>    with error messages like:
> 
> ata10.00: device reported invalid CHS sector 0
> sd 9:0:0:0: [sdc] Start/Stop Unit failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> sd 9:0:0:0: [sdc] Sense Key : Illegal Request [current]
> sd 9:0:0:0: [sdc] Add. Sense: Unaligned write command
> sd 9:0:0:0: PM: dpm_run_callback(): scsi_bus_resume+0x0/0x90 returns -5
> sd 9:0:0:0: PM: failed to resume async: error -5
> 
> 2) The START STOP UNIT command is received while the EH process is
>    on-going, which mean that it is stopped and must wait for its
>    completion, at which point the command is rather useless as the drive
>    is already fully spun up already. This case results also in a
>    significant delay in sd_resume() which is observable by users as
>    the entire system resume completion is delayed.
> 
> Given that ATA devices will be woken up by libata activity on resume,
> sd_resume() has no need to issue a START STOP UNIT command, which solves
> the above mentioned problems. Do not issue this command by introducing
> the new scsi_device flag no_start_on_resume and setting this flag to 1
> in ata_scsi_dev_config(). sd_resume() is modified to issue a START STOP
> UNIT command only if this flag is not set.

Hi Damien,

Last week I noticed that a basic test in our validation started failing,
then I noticed that it was subsequent quick suspend and autoresume using
rtcwake that was problematic.

I couldn't collect any specific log that was pointing to some useful direction.
After a painful bisect I got to this patch. After reverting in from the
top of our tree, the tests are back to life.

The issue was that the subsequent quick suspend-resume (sometimes the
second, sometimes third or even sixth) was simply hanging the machine
in different points at Suspend.

So, maybe we have some kind of disks/configuration out there where this
start upon resume is needed? Maybe it is just a matter of timming to
ensure some firmware underneath is up and back to life?

Well, please let me know the best way to report this issue to you and what
kind of logs I should get.

Meanwhile if this ends up blocking our CI we can keep a revert in a
topic branch for CI.

Thanks,
Rodrigo.

> 
> Reported-by: Paul Ausbeck <paula@soe.ucsc.edu>
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=215880
> Fixes: a19a93e4c6a9 ("scsi: core: pm: Rely on the device driver core for async power management")
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>  drivers/ata/libata-scsi.c  | 7 +++++++
>  drivers/scsi/sd.c          | 9 ++++++---
>  include/scsi/scsi_device.h | 1 +
>  3 files changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
> index 370d18aca71e..c6ece32de8e3 100644
> --- a/drivers/ata/libata-scsi.c
> +++ b/drivers/ata/libata-scsi.c
> @@ -1100,7 +1100,14 @@ int ata_scsi_dev_config(struct scsi_device *sdev, struct ata_device *dev)
>  		}
>  	} else {
>  		sdev->sector_size = ata_id_logical_sector_size(dev->id);
> +		/*
> +		 * Stop the drive on suspend but do not issue START STOP UNIT
> +		 * on resume as this is not necessary and may fail: the device
> +		 * will be woken up by ata_port_pm_resume() with a port reset
> +		 * and device revalidation.
> +		 */
>  		sdev->manage_start_stop = 1;
> +		sdev->no_start_on_resume = 1;
>  	}
>  
>  	/*
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 68b12afa0721..3c668cfb146d 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -3876,7 +3876,7 @@ static int sd_suspend_runtime(struct device *dev)
>  static int sd_resume(struct device *dev)
>  {
>  	struct scsi_disk *sdkp = dev_get_drvdata(dev);
> -	int ret;
> +	int ret = 0;
>  
>  	if (!sdkp)	/* E.g.: runtime resume at the start of sd_probe() */
>  		return 0;
> @@ -3884,8 +3884,11 @@ static int sd_resume(struct device *dev)
>  	if (!sdkp->device->manage_start_stop)
>  		return 0;
>  
> -	sd_printk(KERN_NOTICE, sdkp, "Starting disk\n");
> -	ret = sd_start_stop_device(sdkp, 1);
> +	if (!sdkp->device->no_start_on_resume) {
> +		sd_printk(KERN_NOTICE, sdkp, "Starting disk\n");
> +		ret = sd_start_stop_device(sdkp, 1);
> +	}
> +
>  	if (!ret)
>  		opal_unlock_from_suspend(sdkp->opal_dev);
>  	return ret;
> diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
> index 75b2235b99e2..b9230b6add04 100644
> --- a/include/scsi/scsi_device.h
> +++ b/include/scsi/scsi_device.h
> @@ -194,6 +194,7 @@ struct scsi_device {
>  	unsigned no_start_on_add:1;	/* do not issue start on add */
>  	unsigned allow_restart:1; /* issue START_UNIT in error handler */
>  	unsigned manage_start_stop:1;	/* Let HLD (sd) manage start/stop */
> +	unsigned no_start_on_resume:1; /* Do not issue START_STOP_UNIT on resume */
>  	unsigned start_stop_pwr_cond:1;	/* Set power cond. in START_STOP_UNIT */
>  	unsigned no_uld_attach:1; /* disable connecting to upper level drivers */
>  	unsigned select_no_atn:1;
> -- 
> 2.41.0
>

Damien Le Moal Aug. 24, 2023, 11:42 p.m. UTC | #11

On 8/25/23 03:28, Rodrigo Vivi wrote:
> On Mon, Jul 31, 2023 at 09:39:56AM +0900, Damien Le Moal wrote:
>> During system resume, ata_port_pm_resume() triggers ata EH to
>> 1) Resume the controller
>> 2) Reset and rescan the ports
>> 3) Revalidate devices
>> This EH execution is started asynchronously from ata_port_pm_resume(),
>> which means that when sd_resume() is executed, none or only part of the
>> above processing may have been executed. However, sd_resume() issues a
>> START STOP UNIT to wake up the drive from sleep mode. This command is
>> translated to ATA with ata_scsi_start_stop_xlat() and issued to the
>> device. However, depending on the state of execution of the EH process
>> and revalidation triggerred by ata_port_pm_resume(), two things may
>> happen:
>> 1) The START STOP UNIT fails if it is received before the controller has
>>    been reenabled at the beginning of the EH execution. This is visible
>>    with error messages like:
>>
>> ata10.00: device reported invalid CHS sector 0
>> sd 9:0:0:0: [sdc] Start/Stop Unit failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>> sd 9:0:0:0: [sdc] Sense Key : Illegal Request [current]
>> sd 9:0:0:0: [sdc] Add. Sense: Unaligned write command
>> sd 9:0:0:0: PM: dpm_run_callback(): scsi_bus_resume+0x0/0x90 returns -5
>> sd 9:0:0:0: PM: failed to resume async: error -5
>>
>> 2) The START STOP UNIT command is received while the EH process is
>>    on-going, which mean that it is stopped and must wait for its
>>    completion, at which point the command is rather useless as the drive
>>    is already fully spun up already. This case results also in a
>>    significant delay in sd_resume() which is observable by users as
>>    the entire system resume completion is delayed.
>>
>> Given that ATA devices will be woken up by libata activity on resume,
>> sd_resume() has no need to issue a START STOP UNIT command, which solves
>> the above mentioned problems. Do not issue this command by introducing
>> the new scsi_device flag no_start_on_resume and setting this flag to 1
>> in ata_scsi_dev_config(). sd_resume() is modified to issue a START STOP
>> UNIT command only if this flag is not set.
> 
> Hi Damien,
> 
> Last week I noticed that a basic test in our validation started failing,
> then I noticed that it was subsequent quick suspend and autoresume using
> rtcwake that was problematic.

Arg... Again... Since the change of scsi layer to use async PM operations for
suspend/resume, ATA side has been a constant source of issues. When the change
was done, I did not notice any issue but several users reported problems.

> I couldn't collect any specific log that was pointing to some useful direction.
> After a painful bisect I got to this patch. After reverting in from the
> top of our tree, the tests are back to life.
> 
> The issue was that the subsequent quick suspend-resume (sometimes the
> second, sometimes third or even sixth) was simply hanging the machine
> in different points at Suspend.

I would have expected issues on the resume side. But it seems you are getting a
hang on suspend, which is new. How quick are your suspend/resume cycles ? I did
use rtcqake for my tests as well, but I was setting the wake timer at +5s or
more and suspending with "systemctl suspend".

> So, maybe we have some kind of disks/configuration out there where this
> start upon resume is needed? Maybe it is just a matter of timming to
> ensure some firmware underneath is up and back to life?

I do not think so. Suspend will issue a start stop unit command to put the drive
to sleep and resume will reset the port (which should wake up the drive) and
then issue an IDENTIFY command (which will also wake up the drive) and other
read logs etc to rescan the drive.
In both cases, if the commands do not complete, we would see errors/timeout and
likely port reset/drive gone events. So I think this is likely another subtle
race between scsi suspend and ata suspend that is causing a deadlock.

The main issue I think is that there is no direct ancestry between the ata port
(device) and scsi device, so the change to scsi async pm ops made a mess of the
suspend/resume operations ordering. For suspend, scsi device (child of ata port)
should be first, then ata port device (parent). For resume, the reverse order is
needed. PM normally ensures that parent/child ordering, but we lack that
parent/child relationship. I am working on fixing that but it is very slow
progress because I have been so far enable to recreate any of the issues that
have been reported. I am patching "blind"...

Any chance you could get a thread stack dump when the system hangs ?

echo t > /proc/sysrq-trigger

And:

echo d > /proc/sysrq-trigger

would be useful as well...

That is, unless you really have a hard lockup...

> Well, please let me know the best way to report this issue to you and what
> kind of logs I should get.

If you can get the above ? dmesg output as well with PM debug messages turned on.

Also, what is your setup ? What machine, adapter and drive are you using ?

Martin K. Petersen Aug. 25, 2023, 1:31 a.m. UTC | #12

Damien,

> The main issue I think is that there is no direct ancestry between the
> ata port (device) and scsi device,

I really think this should be fixed. It is a major deficiency as far as
I'm concerned and it affects other things than power management.

Damien Le Moal Aug. 25, 2023, 1:33 a.m. UTC | #13

On 8/25/23 10:31, Martin K. Petersen wrote:
> 
> Damien,
> 
>> The main issue I think is that there is no direct ancestry between the
>> ata port (device) and scsi device,
> 
> I really think this should be fixed. It is a major deficiency as far as
> I'm concerned and it affects other things than power management.

Working on it :)

>

Damien Le Moal Aug. 25, 2023, 12:19 p.m. UTC | #14

On 8/25/23 03:28, Rodrigo Vivi wrote:
> On Mon, Jul 31, 2023 at 09:39:56AM +0900, Damien Le Moal wrote:
>> During system resume, ata_port_pm_resume() triggers ata EH to
>> 1) Resume the controller
>> 2) Reset and rescan the ports
>> 3) Revalidate devices
>> This EH execution is started asynchronously from ata_port_pm_resume(),
>> which means that when sd_resume() is executed, none or only part of the
>> above processing may have been executed. However, sd_resume() issues a
>> START STOP UNIT to wake up the drive from sleep mode. This command is
>> translated to ATA with ata_scsi_start_stop_xlat() and issued to the
>> device. However, depending on the state of execution of the EH process
>> and revalidation triggerred by ata_port_pm_resume(), two things may
>> happen:
>> 1) The START STOP UNIT fails if it is received before the controller has
>>    been reenabled at the beginning of the EH execution. This is visible
>>    with error messages like:
>>
>> ata10.00: device reported invalid CHS sector 0
>> sd 9:0:0:0: [sdc] Start/Stop Unit failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>> sd 9:0:0:0: [sdc] Sense Key : Illegal Request [current]
>> sd 9:0:0:0: [sdc] Add. Sense: Unaligned write command
>> sd 9:0:0:0: PM: dpm_run_callback(): scsi_bus_resume+0x0/0x90 returns -5
>> sd 9:0:0:0: PM: failed to resume async: error -5
>>
>> 2) The START STOP UNIT command is received while the EH process is
>>    on-going, which mean that it is stopped and must wait for its
>>    completion, at which point the command is rather useless as the drive
>>    is already fully spun up already. This case results also in a
>>    significant delay in sd_resume() which is observable by users as
>>    the entire system resume completion is delayed.
>>
>> Given that ATA devices will be woken up by libata activity on resume,
>> sd_resume() has no need to issue a START STOP UNIT command, which solves
>> the above mentioned problems. Do not issue this command by introducing
>> the new scsi_device flag no_start_on_resume and setting this flag to 1
>> in ata_scsi_dev_config(). sd_resume() is modified to issue a START STOP
>> UNIT command only if this flag is not set.
> 
> Hi Damien,
> 
> Last week I noticed that a basic test in our validation started failing,
> then I noticed that it was subsequent quick suspend and autoresume using
> rtcwake that was problematic.
> 
> I couldn't collect any specific log that was pointing to some useful direction.
> After a painful bisect I got to this patch. After reverting in from the
> top of our tree, the tests are back to life.
> 
> The issue was that the subsequent quick suspend-resume (sometimes the
> second, sometimes third or even sixth) was simply hanging the machine
> in different points at Suspend.
> 
> So, maybe we have some kind of disks/configuration out there where this
> start upon resume is needed? Maybe it is just a matter of timming to
> ensure some firmware underneath is up and back to life?
> 
> Well, please let me know the best way to report this issue to you and what
> kind of logs I should get.
> 
> Meanwhile if this ends up blocking our CI we can keep a revert in a
> topic branch for CI.

Can you try adding the patch attached to this email ?
Thanks.

Rodrigo Vivi Aug. 25, 2023, 5:09 p.m. UTC | #15

On Fri, Aug 25, 2023 at 08:42:13AM +0900, Damien Le Moal wrote:
> On 8/25/23 03:28, Rodrigo Vivi wrote:
> > On Mon, Jul 31, 2023 at 09:39:56AM +0900, Damien Le Moal wrote:
> >> During system resume, ata_port_pm_resume() triggers ata EH to
> >> 1) Resume the controller
> >> 2) Reset and rescan the ports
> >> 3) Revalidate devices
> >> This EH execution is started asynchronously from ata_port_pm_resume(),
> >> which means that when sd_resume() is executed, none or only part of the
> >> above processing may have been executed. However, sd_resume() issues a
> >> START STOP UNIT to wake up the drive from sleep mode. This command is
> >> translated to ATA with ata_scsi_start_stop_xlat() and issued to the
> >> device. However, depending on the state of execution of the EH process
> >> and revalidation triggerred by ata_port_pm_resume(), two things may
> >> happen:
> >> 1) The START STOP UNIT fails if it is received before the controller has
> >>    been reenabled at the beginning of the EH execution. This is visible
> >>    with error messages like:
> >>
> >> ata10.00: device reported invalid CHS sector 0
> >> sd 9:0:0:0: [sdc] Start/Stop Unit failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> >> sd 9:0:0:0: [sdc] Sense Key : Illegal Request [current]
> >> sd 9:0:0:0: [sdc] Add. Sense: Unaligned write command
> >> sd 9:0:0:0: PM: dpm_run_callback(): scsi_bus_resume+0x0/0x90 returns -5
> >> sd 9:0:0:0: PM: failed to resume async: error -5
> >>
> >> 2) The START STOP UNIT command is received while the EH process is
> >>    on-going, which mean that it is stopped and must wait for its
> >>    completion, at which point the command is rather useless as the drive
> >>    is already fully spun up already. This case results also in a
> >>    significant delay in sd_resume() which is observable by users as
> >>    the entire system resume completion is delayed.
> >>
> >> Given that ATA devices will be woken up by libata activity on resume,
> >> sd_resume() has no need to issue a START STOP UNIT command, which solves
> >> the above mentioned problems. Do not issue this command by introducing
> >> the new scsi_device flag no_start_on_resume and setting this flag to 1
> >> in ata_scsi_dev_config(). sd_resume() is modified to issue a START STOP
> >> UNIT command only if this flag is not set.
> > 
> > Hi Damien,
> > 
> > Last week I noticed that a basic test in our validation started failing,
> > then I noticed that it was subsequent quick suspend and autoresume using
> > rtcwake that was problematic.
> 
> Arg... Again... Since the change of scsi layer to use async PM operations for
> suspend/resume, ATA side has been a constant source of issues. When the change
> was done, I did not notice any issue but several users reported problems.
> 
> > I couldn't collect any specific log that was pointing to some useful direction.
> > After a painful bisect I got to this patch. After reverting in from the
> > top of our tree, the tests are back to life.
> > 
> > The issue was that the subsequent quick suspend-resume (sometimes the
> > second, sometimes third or even sixth) was simply hanging the machine
> > in different points at Suspend.
> 
> I would have expected issues on the resume side. But it seems you are getting a
> hang on suspend, which is new. How quick are your suspend/resume cycles ? I did
> use rtcqake for my tests as well, but I was setting the wake timer at +5s or
> more and suspending with "systemctl suspend".

I meant resume. It hangs during random parts at the resume sequence.
Brain thought resume, fingers typed 'Suspend', I'm sorry!

> 
> > So, maybe we have some kind of disks/configuration out there where this
> > start upon resume is needed? Maybe it is just a matter of timming to
> > ensure some firmware underneath is up and back to life?
> 
> I do not think so. Suspend will issue a start stop unit command to put the drive
> to sleep and resume will reset the port (which should wake up the drive) and
> then issue an IDENTIFY command (which will also wake up the drive) and other
> read logs etc to rescan the drive.
> In both cases, if the commands do not complete, we would see errors/timeout and
> likely port reset/drive gone events. So I think this is likely another subtle
> race between scsi suspend and ata suspend that is causing a deadlock.
> 
> The main issue I think is that there is no direct ancestry between the ata port
> (device) and scsi device, so the change to scsi async pm ops made a mess of the
> suspend/resume operations ordering. For suspend, scsi device (child of ata port)
> should be first, then ata port device (parent). For resume, the reverse order is
> needed. PM normally ensures that parent/child ordering, but we lack that
> parent/child relationship. I am working on fixing that but it is very slow
> progress because I have been so far enable to recreate any of the issues that
> have been reported. I am patching "blind"...

I believe your suspicious makes sense. And on these lines, that patch you
attached earlier would fix that. However my initial tries of that didn't
help. I'm going to run more tests and get back to you.

> 
> Any chance you could get a thread stack dump when the system hangs ?
> 
> echo t > /proc/sysrq-trigger
> 
> And:
> 
> echo d > /proc/sysrq-trigger
> 
> would be useful as well...
> 
> That is, unless you really have a hard lockup...

no chance... it is a hard lockup.

> 
> > Well, please let me know the best way to report this issue to you and what
> > kind of logs I should get.
> 
> If you can get the above ? dmesg output as well with PM debug messages turned on.

I have the CONFIG_PM_DEBUG here with no_console_suspend.
If you remember any other config or parameter that would help, please let
me know that I collect it again:

[  104.571459] PM: suspend entry (deep)
[  104.585967] Filesystems sync: 0.010 seconds
[  104.618685] Freezing user space processes
[  104.625374] Freezing user space processes completed (elapsed 0.002 seconds)
[  104.632448] OOM killer disabled.
[  104.635712] Freezing remaining freezable tasks
[  104.641899] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
[  104.669767] wlp6s0: deauthenticating from 08:36:c9:85:df:ef by local choice (Reason: 3=DEAUTH_LEAVING)
[  104.679812] serial 00:01: disabled
[  104.683466] sd 7:0:0:0: [sdc] Synchronizing SCSI cache
[  104.688902] sd 7:0:0:0: [sdc] Stopping disk
[  104.693176] sd 5:0:0:0: [sdb] Synchronizing SCSI cache
[  104.698419] sd 4:0:0:0: [sda] Synchronizing SCSI cache
[  104.703730] sd 4:0:0:0: [sda] Stopping disk
[  104.885912] sd 5:0:0:0: [sdb] Stopping disk
[  106.163215] PM: suspend devices took 1.514 seconds
[  107.003217] serial 00:01: activated
[  107.076779] nvme nvme0: 16/0/0 default/read/poll queues
[  107.123917] r8169 0000:07:00.0 enp7s0: Link is Down
[  107.208945] PM: resume devices took 0.241 seconds
[  107.214746] pcieport 0000:00:1c.0: PCI bridge to [bus 06]
[  107.220274] pcieport 0000:00:1c.0:   bridge window [mem 0x43700000-0x437fffff]
[  107.227538] OOM killer enabled.
[  107.230710] Restarting tasks ...
[  107.231803] pcieport 0000:00:1c.2: PCI bridge to [bus 07]
[  107.236474] done.
[  107.240599] pcieport 0000:00:1c.2:   bridge window [io  0x4000-0x4fff]
[  107.242574] random: crng reseeded on system resumption
[  107.249119] pcieport 0000:00:1c.2:   bridge window [mem 0x43600000-0x436fffff]
[  107.249405] pcieport 0000:00:1c.6: PCI bridge to [bus 08]
[  107.259714] PM: suspend exit
[  107.261623] pcieport 0000:00:1c.6:   bridge window [io  0x3000-0x3fff]
[  107.276554] pcieport 0000:00:1c.6:   bridge window [mem 0x43500000-0x435fffff]
[  107.283849] pcieport 0000:00:1c.6:   bridge window [mem 0x70900000-0x709fffff 64bit pref]
[  107.293567] ata7: SATA link down (SStatus 4 SControl 300)
[  107.304150] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  107.310975] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  107.319173] ata5.00: configured for UDMA/133
[  107.324620] ata5.00: Enabling discard_zeroes_data
[  107.398370] ata6.00: configured for UDMA/133
[  108.563229] PM: suspend entry (deep)
[  108.573610] Filesystems sync: 0.006 seconds
[  108.580617] Freezing user space processes
[  108.586774] Freezing user space processes completed (elapsed 0.002 seconds)
[  108.593793] OOM killer disabled.
[  108.597055] Freezing remaining freezable tasks
[  108.603246] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
[  108.621515] sd 7:0:0:0: [sdc] Synchronizing SCSI cache
[  108.621522] sd 5:0:0:0: [sdb] Synchronizing SCSI cache
[  108.622018] serial 00:01: disabled
[  108.635420] sd 4:0:0:0: [sda] Synchronizing SCSI cache
[  108.640747] sd 4:0:0:0: [sda] Stopping disk
[  108.644148] sd 5:0:0:0: [sdb] Stopping disk
[  108.983487] ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  108.991923] ata8.00: configured for UDMA/133
[  108.996423] sd 7:0:0:0: [sdc] Stopping disk
[  109.973722] PM: suspend devices took 1.363 seconds
[  110.721600] serial 00:01: activated
[  110.802094] nvme nvme0: 16/0/0 default/read/poll queues
[  110.873036] r8169 0000:07:00.0 enp7s0: Link is Down
[  111.032278] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  111.038583] ata7: SATA link down (SStatus 4 SControl 300)
[  111.044065] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  111.051326] ata5.00: configured for UDMA/133
[  111.056118] ata5.00: Enabling discard_zeroes_data
[  111.131795] ata6.00: configured for UDMA/133
[  112.713764] ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  112.724250] ata8.00: configured for UDMA/133


> 
> Also, what is your setup ? What machine, adapter and drive are you using ?

$ sudo lshw -class disk
  *-namespace:0
       description: NVMe disk
       physical id: 0
       logical name: hwmon1
  *-namespace:1
       description: NVMe disk
       physical id: 2
       logical name: /dev/ng0n1
  *-namespace:2
       description: NVMe disk
       physical id: 1
       bus info: nvme@0:1
       logical name: /dev/nvme0n1
       size: 465GiB (500GB)
       capabilities: gpt-1.00 partitioned partitioned:gpt
       configuration: guid=fd1cdad8-4a22-4fc9-8479-cce1748e78d5 logicalsectorsize=512 sectorsize=512 wwid=eui.e8238fa6bf530001001b448b45c5252c
  *-disk:0
       description: ATA Disk
       product: INTEL SSDSC2BB48
       physical id: 0
       bus info: scsi@4:0.0.0
       logical name: /dev/sda
       version: 0370
       serial: BTWL409509Q6480QGN
       size: 447GiB (480GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=000226cf
  *-disk:1
       description: ATA Disk
       product: INTEL SSDSC2BW48
       physical id: 1
       bus info: scsi@5:0.0.0
       logical name: /dev/sdb
       version: DC32
       serial: PHDA4166003M4805GN
       size: 447GiB (480GB)
       capabilities: gpt-1.00 partitioned partitioned:gpt
       configuration: ansiversion=5 guid=b6c93ada-cbb1-412e-92e9-d644a476cacc logicalsectorsize=512 sectorsize=512
  *-disk:2
       description: ATA Disk
       product: ST1000DM010-2EP1
       physical id: 0.0.0
       bus info: scsi@7:0.0.0
       logical name: /dev/sdc
       version: CC46
       serial: ZN1VGAFZ
       size: 931GiB (1TB)
       capabilities: gpt-1.00 partitioned partitioned:gpt
       configuration: ansiversion=5 guid=8bd323ef-07b9-49c6-99c9-6581274d2f65 logicalsectorsize=512 sectorsize=4096

$ lspci -vv
00:00.0 Host bridge: Intel Corporation 12th Gen Core Processor Host Bridge/DRAM Registers (rev 02)
	DeviceName: Onboard - Other
	Subsystem: Gigabyte Technology Co., Ltd Device 5000
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0

00:01.0 PCI bridge: Intel Corporation 12th Gen Core Processor PCI Express x16 Controller #1 (rev 02) (prog-if 00 [Normal decode])
	Subsystem: Gigabyte Technology Co., Ltd Device 5000
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 120
	Bus: primary=00, secondary=01, subordinate=04, sec-latency=0
	I/O behind bridge: [disabled] [16-bit]
	Memory behind bridge: 42000000-433fffff [size=20M] [32-bit]
	Prefetchable memory behind bridge: 60000000-707fffff [size=264M] [32-bit]
	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
	BridgeCtl: Parity- SERR+ NoISA- VGA+ VGA16+ MAbort- >Reset- FastB2B-
		PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
	Capabilities: <access denied>
	Kernel driver in use: pcieport

00:02.0 Display controller: Intel Corporation AlderLake-S GT1 (rev 0c)
	DeviceName: Onboard - Video
	Subsystem: Gigabyte Technology Co., Ltd Device d000
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 255
	Region 0: Memory at 41000000 (64-bit, non-prefetchable) [size=16M]
	Region 2: Memory at 50000000 (64-bit, prefetchable) [size=256M]
	Region 4: I/O ports at 5000 [disabled] [size=64]
	Capabilities: <access denied>
	Kernel modules: i915, xe

00:06.0 PCI bridge: Intel Corporation 12th Gen Core Processor PCI Express x4 Controller #0 (rev 02) (prog-if 00 [Normal decode])
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin D routed to IRQ 121
	Bus: primary=00, secondary=05, subordinate=05, sec-latency=0
	I/O behind bridge: [disabled] [16-bit]
	Memory behind bridge: 43800000-438fffff [size=1M] [32-bit]
	Prefetchable memory behind bridge: [disabled] [64-bit]
	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
	BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16+ MAbort- >Reset- FastB2B-
		PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
	Capabilities: <access denied>
	Kernel driver in use: pcieport

00:0a.0 Signal processing controller: Intel Corporation Platform Monitoring Technology (rev 01)
	DeviceName: Onboard - Other
	Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Region 0: Memory at 43910000 (64-bit, non-prefetchable) [size=32K]
	Capabilities: <access denied>
	Kernel driver in use: intel_vsec
	Kernel modules: intel_vsec

00:14.0 USB controller: Intel Corporation Alder Lake-S PCH USB 3.2 Gen 2x2 XHCI Controller (rev 11) (prog-if 30 [XHCI])
	DeviceName: Onboard - Other
	Subsystem: Gigabyte Technology Co., Ltd Device 5007
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 126
	Region 0: Memory at 43900000 (64-bit, non-prefetchable) [size=64K]
	Capabilities: <access denied>
	Kernel driver in use: xhci_hcd

00:14.2 RAM memory: Intel Corporation Alder Lake-S PCH Shared SRAM (rev 11)
	DeviceName: Onboard - Other
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Region 0: Memory at 4391c000 (64-bit, non-prefetchable) [disabled] [size=16K]
	Region 2: Memory at 4392b000 (64-bit, non-prefetchable) [disabled] [size=4K]
	Capabilities: <access denied>

00:15.0 Serial bus controller: Intel Corporation Alder Lake-S PCH Serial IO I2C Controller #0 (rev 11)
	DeviceName: Onboard - Other
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 27
	Region 0: Memory at 40400000 (64-bit, non-prefetchable) [size=4K]
	Capabilities: <access denied>
	Kernel driver in use: intel-lpss

00:15.1 Serial bus controller: Intel Corporation Alder Lake-S PCH Serial IO I2C Controller #1 (rev 11)
	DeviceName: Onboard - Other
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin B routed to IRQ 40
	Region 0: Memory at 40401000 (64-bit, non-prefetchable) [size=4K]
	Capabilities: <access denied>
	Kernel driver in use: intel-lpss

00:15.2 Serial bus controller: Intel Corporation Alder Lake-S PCH Serial IO I2C Controller #2 (rev 11)
	DeviceName: Onboard - Other
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin C routed to IRQ 29
	Region 0: Memory at 40402000 (64-bit, non-prefetchable) [size=4K]
	Capabilities: <access denied>
	Kernel driver in use: intel-lpss

00:15.3 Serial bus controller: Intel Corporation Alder Lake-S PCH Serial IO I2C Controller #3 (rev 11)
	DeviceName: Onboard - Other
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin D routed to IRQ 43
	Region 0: Memory at 40403000 (64-bit, non-prefetchable) [size=4K]
	Capabilities: <access denied>
	Kernel driver in use: intel-lpss

00:16.0 Communication controller: Intel Corporation Alder Lake-S PCH HECI Controller #1 (rev 11)
	DeviceName: Onboard - Other
	Subsystem: Gigabyte Technology Co., Ltd Device 1c3a
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 146
	Region 0: Memory at 43927000 (64-bit, non-prefetchable) [size=4K]
	Capabilities: <access denied>
	Kernel driver in use: mei_me
	Kernel modules: mei_me

00:17.0 SATA controller: Intel Corporation Alder Lake-S PCH SATA Controller [AHCI Mode] (rev 11) (prog-if 01 [AHCI 1.0])
	DeviceName: Onboard - SATA
	Subsystem: Gigabyte Technology Co., Ltd Device b005
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 125
	Region 0: Memory at 43920000 (32-bit, non-prefetchable) [size=8K]
	Region 1: Memory at 43926000 (32-bit, non-prefetchable) [size=256]
	Region 2: I/O ports at 5090 [size=8]
	Region 3: I/O ports at 5080 [size=4]
	Region 4: I/O ports at 5060 [size=32]
	Region 5: Memory at 43925000 (32-bit, non-prefetchable) [size=2K]
	Capabilities: <access denied>
	Kernel driver in use: ahci

00:19.0 Serial bus controller: Intel Corporation Alder Lake-S PCH Serial IO I2C Controller #4 (rev 11)
	DeviceName: Onboard - Other
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 31
	Region 0: Memory at 40404000 (64-bit, non-prefetchable) [size=4K]
	Capabilities: <access denied>
	Kernel driver in use: intel-lpss

00:19.1 Serial bus controller: Intel Corporation Alder Lake-S PCH Serial IO I2C Controller #5 (rev 11)
	DeviceName: Onboard - Other
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin B routed to IRQ 32
	Region 0: Memory at 40405000 (64-bit, non-prefetchable) [size=4K]
	Capabilities: <access denied>
	Kernel driver in use: intel-lpss

00:1c.0 PCI bridge: Intel Corporation Alder Lake-S PCH PCI Express Root Port #1 (rev 11) (prog-if 00 [Normal decode])
	Subsystem: Gigabyte Technology Co., Ltd Device 5001
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 122
	Bus: primary=00, secondary=06, subordinate=06, sec-latency=0
	I/O behind bridge: [disabled] [16-bit]
	Memory behind bridge: 43700000-437fffff [size=1M] [32-bit]
	Prefetchable memory behind bridge: [disabled] [64-bit]
	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
	BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16+ MAbort- >Reset- FastB2B-
		PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
	Capabilities: <access denied>
	Kernel driver in use: pcieport

00:1c.2 PCI bridge: Intel Corporation Device 7aba (rev 11) (prog-if 00 [Normal decode])
	Subsystem: Gigabyte Technology Co., Ltd Device 5001
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin C routed to IRQ 123
	Bus: primary=00, secondary=07, subordinate=07, sec-latency=0
	I/O behind bridge: 4000-4fff [size=4K] [16-bit]
	Memory behind bridge: 43600000-436fffff [size=1M] [32-bit]
	Prefetchable memory behind bridge: [disabled] [64-bit]
	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
	BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16+ MAbort- >Reset- FastB2B-
		PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
	Capabilities: <access denied>
	Kernel driver in use: pcieport

00:1c.6 PCI bridge: Intel Corporation Device 7abe (rev 11) (prog-if 00 [Normal decode])
	Subsystem: Gigabyte Technology Co., Ltd Device 5001
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin C routed to IRQ 124
	Bus: primary=00, secondary=08, subordinate=08, sec-latency=0
	I/O behind bridge: 3000-3fff [size=4K] [16-bit]
	Memory behind bridge: 43500000-435fffff [size=1M] [32-bit]
	Prefetchable memory behind bridge: 70900000-709fffff [size=1M] [32-bit]
	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
	BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16+ MAbort- >Reset- FastB2B-
		PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
	Capabilities: <access denied>
	Kernel driver in use: pcieport

00:1f.0 ISA bridge: Intel Corporation Device 7a86 (rev 11)
	DeviceName: Onboard - Other
	Subsystem: Gigabyte Technology Co., Ltd Device 5001
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0

00:1f.3 Audio device: Intel Corporation Alder Lake-S HD Audio Controller (rev 11)
	DeviceName: Onboard - Sound
	Subsystem: Gigabyte Technology Co., Ltd Device a194
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 32, Cache Line Size: 64 bytes
	Interrupt: pin B routed to IRQ 148
	Region 0: Memory at 43918000 (64-bit, non-prefetchable) [size=16K]
	Region 4: Memory at 43400000 (64-bit, non-prefetchable) [size=1M]
	Capabilities: <access denied>
	Kernel driver in use: snd_hda_intel
	Kernel modules: snd_hda_intel, snd_sof_pci_intel_tgl

00:1f.4 SMBus: Intel Corporation Alder Lake-S PCH SMBus Controller (rev 11)
	DeviceName: Onboard - Other
	Subsystem: Gigabyte Technology Co., Ltd Device 5001
	Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin C routed to IRQ 18
	Region 0: Memory at 43922000 (64-bit, non-prefetchable) [size=256]
	Region 4: I/O ports at efa0 [size=32]
	Kernel driver in use: i801_smbus
	Kernel modules: i2c_i801

00:1f.5 Serial bus controller: Intel Corporation Alder Lake-S PCH SPI Controller (rev 11)
	DeviceName: Onboard - Other
	Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Region 0: Memory at 40406000 (32-bit, non-prefetchable) [size=4K]

01:00.0 PCI bridge: Intel Corporation Device 4fa0 (rev 01) (prog-if 00 [Normal decode])
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 16
	Region 0: Memory at 70000000 (64-bit, prefetchable) [size=8M]
	Bus: primary=01, secondary=02, subordinate=04, sec-latency=0
	I/O behind bridge: [disabled] [32-bit]
	Memory behind bridge: 42000000-433fffff [size=20M] [32-bit]
	Prefetchable memory behind bridge: 60000000-6fffffff [size=256M] [32-bit]
	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
	BridgeCtl: Parity- SERR+ NoISA- VGA+ VGA16+ MAbort- >Reset- FastB2B-
		PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
	Capabilities: <access denied>
	Kernel driver in use: pcieport

02:01.0 PCI bridge: Intel Corporation Device 4fa4 (prog-if 00 [Normal decode])
	Subsystem: Intel Corporation Device 4fa4
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Bus: primary=02, secondary=03, subordinate=03, sec-latency=0
	I/O behind bridge: [disabled] [16-bit]
	Memory behind bridge: 42000000-431fffff [size=18M] [32-bit]
	Prefetchable memory behind bridge: 60000000-6fffffff [size=256M] [32-bit]
	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
	BridgeCtl: Parity- SERR+ NoISA- VGA+ VGA16+ MAbort- >Reset- FastB2B-
		PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
	Capabilities: <access denied>
	Kernel driver in use: pcieport

02:04.0 PCI bridge: Intel Corporation Device 4fa4 (prog-if 00 [Normal decode])
	Subsystem: Intel Corporation Device 0000
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Bus: primary=02, secondary=04, subordinate=04, sec-latency=0
	I/O behind bridge: [disabled] [16-bit]
	Memory behind bridge: 43300000-433fffff [size=1M] [32-bit]
	Prefetchable memory behind bridge: [disabled] [64-bit]
	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
	BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16+ MAbort- >Reset- FastB2B-
		PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
	Capabilities: <access denied>
	Kernel driver in use: pcieport

03:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A750] (rev 08) (prog-if 00 [VGA controller])
	Subsystem: ASRock Incorporation Device 6002
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin ? routed to IRQ 145
	Region 0: Memory at 42000000 (64-bit, non-prefetchable) [size=16M]
	Region 2: Memory at 60000000 (64-bit, prefetchable) [size=256M]
	Expansion ROM at 43000000 [disabled] [size=2M]
	Capabilities: <access denied>
	Kernel driver in use: xe
	Kernel modules: i915, xe

04:00.0 Audio device: Intel Corporation DG2 Audio Controller
	Subsystem: ASRock Incorporation Device 6002
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin ? routed to IRQ 149
	Region 0: Memory at 43300000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: <access denied>
	Kernel driver in use: snd_hda_intel
	Kernel modules: snd_hda_intel

05:00.0 Non-Volatile memory controller: Sandisk Corp Device 501e (prog-if 02 [NVM Express])
	Subsystem: Sandisk Corp Device 501e
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 16
	NUMA node: 0
	Region 0: Memory at 43800000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: <access denied>
	Kernel driver in use: nvme
	Kernel modules: nvme

06:00.0 Network controller: Intel Corporation Dual Band Wireless-AC 3168NGW [Stone Peak] (rev 10)
	Subsystem: Intel Corporation Device 2110
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 147
	Region 0: Memory at 43700000 (64-bit, non-prefetchable) [size=8K]
	Capabilities: <access denied>
	Kernel driver in use: iwlwifi
	Kernel modules: iwlwifi

07:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 16)
	Subsystem: Gigabyte Technology Co., Ltd Onboard Ethernet
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 18
	Region 0: I/O ports at 4000 [size=256]
	Region 2: Memory at 43604000 (64-bit, non-prefetchable) [size=4K]
	Region 4: Memory at 43600000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: <access denied>
	Kernel driver in use: r8169
	Kernel modules: r8169

08:00.0 Serial controller: Device 1c00:3253 (rev 10) (prog-if 05 [16850])
	Subsystem: Device 1c00:3253
	Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 18
	Region 0: I/O ports at 3000 [size=256]
	Region 1: Memory at 70900000 (32-bit, prefetchable) [size=32K]
	Region 2: I/O ports at 3100 [size=4]
	Expansion ROM at 43500000 [disabled] [size=32K]
	Capabilities: <access denied>
	Kernel driver in use: serial

anything else or more specific?

Thanks,
Rodrigo.

> 
> -- 
> Damien Le Moal
> Western Digital Research
>

Damien Le Moal Aug. 25, 2023, 10:06 p.m. UTC | #16

On 8/26/23 02:09, Rodrigo Vivi wrote:
>> I would have expected issues on the resume side. But it seems you are getting a
>> hang on suspend, which is new. How quick are your suspend/resume cycles ? I did
>> use rtcqake for my tests as well, but I was setting the wake timer at +5s or
>> more and suspending with "systemctl suspend".
> 
> I meant resume. It hangs during random parts at the resume sequence.
> Brain thought resume, fingers typed 'Suspend', I'm sorry!

That makes more sense :)

>>> So, maybe we have some kind of disks/configuration out there where this
>>> start upon resume is needed? Maybe it is just a matter of timming to
>>> ensure some firmware underneath is up and back to life?
>>
>> I do not think so. Suspend will issue a start stop unit command to put the drive
>> to sleep and resume will reset the port (which should wake up the drive) and
>> then issue an IDENTIFY command (which will also wake up the drive) and other
>> read logs etc to rescan the drive.
>> In both cases, if the commands do not complete, we would see errors/timeout and
>> likely port reset/drive gone events. So I think this is likely another subtle
>> race between scsi suspend and ata suspend that is causing a deadlock.
>>
>> The main issue I think is that there is no direct ancestry between the ata port
>> (device) and scsi device, so the change to scsi async pm ops made a mess of the
>> suspend/resume operations ordering. For suspend, scsi device (child of ata port)
>> should be first, then ata port device (parent). For resume, the reverse order is
>> needed. PM normally ensures that parent/child ordering, but we lack that
>> parent/child relationship. I am working on fixing that but it is very slow
>> progress because I have been so far enable to recreate any of the issues that
>> have been reported. I am patching "blind"...
> 
> I believe your suspicious makes sense. And on these lines, that patch you
> attached earlier would fix that. However my initial tries of that didn't
> help. I'm going to run more tests and get back to you.

Yes, I now better understand what is going on, somewhat.
The patch I sent corrects the suspend side: the scsi device will be suspended
first, which triggers issuing an IDLE command to spin down the drive, then that
ATA port is suspended. All good. However, this patch does not address the resume
side.

The ata port is resumed first. This operation is rather quick and only trigger
ata EH asynchronously. The scsi device is resumed after the ata port. This
operation triggers issuing a VERIFY command to spinup the drive and put it back
in the active power state. Media access commands like VERIFY are the only
commands that restore a disk to active state. However, the "no start on resume"
patch I sent disables that, which is wrong. But reverting it is also not a full
solution as there is no way to know when the VERIFY command will be issued with
regard to ata EH triggered. That verify command should be issued after the port
is resumed and before the revalidation start !

I suspect that you get your lockup with the no start on resume patch because the
revalidation never proceeds correctly as the drive is still sleeping. And things
work with the patch reverted as you get lucky and the VERIFY command makes it
through at the right timing... Would need to add more messages to confirm all
this. I still cannot really explain the hard lockup though as in the end, you
should be seeing command timeouts, port reset, retries, and eventually drive
removal.

Anyway, what is clear is that using scsi manage_start_stop cannot possibly work
for resume. So more patches are needed to move the IDLE and VERIFY commands to
libata EH so that these commands are issued within the EH context instead of the
scsi device suspend/resume context.

Working on that. I should have patches ready for you Monday.

>> Any chance you could get a thread stack dump when the system hangs ?
>>
>> echo t > /proc/sysrq-trigger
>>
>> And:
>>
>> echo d > /proc/sysrq-trigger
>>
>> would be useful as well...
>>
>> That is, unless you really have a hard lockup...
> 
> no chance... it is a hard lockup.

Too bad... I really would like to understand where things get deadlocked.

> I have the CONFIG_PM_DEBUG here with no_console_suspend.
> If you remember any other config or parameter that would help, please let
> me know that I collect it again:
> 
> [  104.571459] PM: suspend entry (deep)
> [  104.585967] Filesystems sync: 0.010 seconds
> [  104.618685] Freezing user space processes
> [  104.625374] Freezing user space processes completed (elapsed 0.002 seconds)
> [  104.632448] OOM killer disabled.
> [  104.635712] Freezing remaining freezable tasks
> [  104.641899] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
> [  104.669767] wlp6s0: deauthenticating from 08:36:c9:85:df:ef by local choice (Reason: 3=DEAUTH_LEAVING)
> [  104.679812] serial 00:01: disabled
> [  104.683466] sd 7:0:0:0: [sdc] Synchronizing SCSI cache
> [  104.688902] sd 7:0:0:0: [sdc] Stopping disk
> [  104.693176] sd 5:0:0:0: [sdb] Synchronizing SCSI cache
> [  104.698419] sd 4:0:0:0: [sda] Synchronizing SCSI cache
> [  104.703730] sd 4:0:0:0: [sda] Stopping disk
> [  104.885912] sd 5:0:0:0: [sdb] Stopping disk

So we have:
4:0:0:0 -> ata5 port
5:0:0:0 -> ata6 port
7:0:0:0 -> ata8 port

> [  106.163215] PM: suspend devices took 1.514 seconds
> [  107.003217] serial 00:01: activated
> [  107.076779] nvme nvme0: 16/0/0 default/read/poll queues
> [  107.123917] r8169 0000:07:00.0 enp7s0: Link is Down
> [  107.208945] PM: resume devices took 0.241 seconds
> [  107.214746] pcieport 0000:00:1c.0: PCI bridge to [bus 06]
> [  107.220274] pcieport 0000:00:1c.0:   bridge window [mem 0x43700000-0x437fffff]
> [  107.227538] OOM killer enabled.
> [  107.230710] Restarting tasks ...
> [  107.231803] pcieport 0000:00:1c.2: PCI bridge to [bus 07]
> [  107.236474] done.
> [  107.240599] pcieport 0000:00:1c.2:   bridge window [io  0x4000-0x4fff]
> [  107.242574] random: crng reseeded on system resumption
> [  107.249119] pcieport 0000:00:1c.2:   bridge window [mem 0x43600000-0x436fffff]
> [  107.249405] pcieport 0000:00:1c.6: PCI bridge to [bus 08]
> [  107.259714] PM: suspend exit
> [  107.261623] pcieport 0000:00:1c.6:   bridge window [io  0x3000-0x3fff]
> [  107.276554] pcieport 0000:00:1c.6:   bridge window [mem 0x43500000-0x435fffff]
> [  107.283849] pcieport 0000:00:1c.6:   bridge window [mem 0x70900000-0x709fffff 64bit pref]

The messages below come from libata EH which was triggered asynchronously and
end up running after PM exit, which is fine as PM does not need to wait for the
drive to spin up.

> [  107.293567] ata7: SATA link down (SStatus 4 SControl 300)
> [  107.304150] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [  107.310975] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [  107.319173] ata5.00: configured for UDMA/133
> [  107.324620] ata5.00: Enabling discard_zeroes_data
> [  107.398370] ata6.00: configured for UDMA/133

Looks like libata EH is finished here. And then comes the next suspend.

> [  108.563229] PM: suspend entry (deep)
> [  108.573610] Filesystems sync: 0.006 seconds
> [  108.580617] Freezing user space processes
> [  108.586774] Freezing user space processes completed (elapsed 0.002 seconds)
> [  108.593793] OOM killer disabled.
> [  108.597055] Freezing remaining freezable tasks
> [  108.603246] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
> [  108.621515] sd 7:0:0:0: [sdc] Synchronizing SCSI cache
> [  108.621522] sd 5:0:0:0: [sdb] Synchronizing SCSI cache
> [  108.622018] serial 00:01: disabled
> [  108.635420] sd 4:0:0:0: [sda] Synchronizing SCSI cache
> [  108.640747] sd 4:0:0:0: [sda] Stopping disk
> [  108.644148] sd 5:0:0:0: [sdb] Stopping disk
> [  108.983487] ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [  108.991923] ata8.00: configured for UDMA/133

Looks like ata EH from the previous resume was actually not finished and
completes here.

> [  108.996423] sd 7:0:0:0: [sdc] Stopping disk
> [  109.973722] PM: suspend devices took 1.363 seconds
> [  110.721600] serial 00:01: activated
> [  110.802094] nvme nvme0: 16/0/0 default/read/poll queues
> [  110.873036] r8169 0000:07:00.0 enp7s0: Link is Down
> [  111.032278] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [  111.038583] ata7: SATA link down (SStatus 4 SControl 300)
> [  111.044065] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [  111.051326] ata5.00: configured for UDMA/133
> [  111.056118] ata5.00: Enabling discard_zeroes_data
> [  111.131795] ata6.00: configured for UDMA/133
> [  112.713764] ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [  112.724250] ata8.00: configured for UDMA/133

Looks like the drives have all come back but there is no "PM exit" to be seen,
so I suspect the scsi device resume side is stuck... Not sure why, especially
with the "no start on resume" patch.

Will send patches to fix all this as soon as possible.

Damien Le Moal Aug. 29, 2023, 6:17 a.m. UTC | #17

On 8/26/23 02:09, Rodrigo Vivi wrote:
>>> So, maybe we have some kind of disks/configuration out there where this
>>> start upon resume is needed? Maybe it is just a matter of timming to
>>> ensure some firmware underneath is up and back to life?
>>
>> I do not think so. Suspend will issue a start stop unit command to put the drive
>> to sleep and resume will reset the port (which should wake up the drive) and
>> then issue an IDENTIFY command (which will also wake up the drive) and other
>> read logs etc to rescan the drive.
>> In both cases, if the commands do not complete, we would see errors/timeout and
>> likely port reset/drive gone events. So I think this is likely another subtle
>> race between scsi suspend and ata suspend that is causing a deadlock.
>>
>> The main issue I think is that there is no direct ancestry between the ata port
>> (device) and scsi device, so the change to scsi async pm ops made a mess of the
>> suspend/resume operations ordering. For suspend, scsi device (child of ata port)
>> should be first, then ata port device (parent). For resume, the reverse order is
>> needed. PM normally ensures that parent/child ordering, but we lack that
>> parent/child relationship. I am working on fixing that but it is very slow
>> progress because I have been so far enable to recreate any of the issues that
>> have been reported. I am patching "blind"...
> 
> I believe your suspicious makes sense. And on these lines, that patch you
> attached earlier would fix that. However my initial tries of that didn't
> help. I'm going to run more tests and get back to you.

Rodrigo,

I pushed the resume-v2 branch to libata tree:

git@gitolite.kernel.org:pub/scm/linux/kernel/git/dlemoal/libata
(or https://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata.git)

This branch adds 13 patches on top of 6.5.0 to cleanup libata suspend/resume and
other device shutdown issues. The first 4 patches are the main ones to fix
suspend resume. I tested that on 2 different machines with different drives and
with qemu. All seems fine.

Could you try to run this through your CI ? I am very interested in seeing if it
survives your suspend/resume tests.

If you can confirm that all issues are fixed, I will rebase this on for-next and
post.

Thanks !

Rodrigo Vivi Aug. 30, 2023, 10:14 p.m. UTC | #18

On Tue, Aug 29, 2023 at 03:17:38PM +0900, Damien Le Moal wrote:
> On 8/26/23 02:09, Rodrigo Vivi wrote:
> >>> So, maybe we have some kind of disks/configuration out there where this
> >>> start upon resume is needed? Maybe it is just a matter of timming to
> >>> ensure some firmware underneath is up and back to life?
> >>
> >> I do not think so. Suspend will issue a start stop unit command to put the drive
> >> to sleep and resume will reset the port (which should wake up the drive) and
> >> then issue an IDENTIFY command (which will also wake up the drive) and other
> >> read logs etc to rescan the drive.
> >> In both cases, if the commands do not complete, we would see errors/timeout and
> >> likely port reset/drive gone events. So I think this is likely another subtle
> >> race between scsi suspend and ata suspend that is causing a deadlock.
> >>
> >> The main issue I think is that there is no direct ancestry between the ata port
> >> (device) and scsi device, so the change to scsi async pm ops made a mess of the
> >> suspend/resume operations ordering. For suspend, scsi device (child of ata port)
> >> should be first, then ata port device (parent). For resume, the reverse order is
> >> needed. PM normally ensures that parent/child ordering, but we lack that
> >> parent/child relationship. I am working on fixing that but it is very slow
> >> progress because I have been so far enable to recreate any of the issues that
> >> have been reported. I am patching "blind"...
> > 
> > I believe your suspicious makes sense. And on these lines, that patch you
> > attached earlier would fix that. However my initial tries of that didn't
> > help. I'm going to run more tests and get back to you.
> 
> Rodrigo,
> 
> I pushed the resume-v2 branch to libata tree:
> 
> git@gitolite.kernel.org:pub/scm/linux/kernel/git/dlemoal/libata
> (or https://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata.git)
> 
> This branch adds 13 patches on top of 6.5.0 to cleanup libata suspend/resume and
> other device shutdown issues. The first 4 patches are the main ones to fix
> suspend resume. I tested that on 2 different machines with different drives and
> with qemu. All seems fine.
> 
> Could you try to run this through your CI ? I am very interested in seeing if it
> survives your suspend/resume tests.

well, in the end this didn't affect the CI machinery as I was afraid.
it is only in my local DG2.

https://intel-gfx-ci.01.org/tree/intel-xe/bat-all.html?testfilter=suspend
(bat-dg2-oem2 one)

I just got these 13 patches and applied to my branch and tested it again
and it still *fails* for me.

[   79.648328] [IGT] kms_pipe_crc_basic: finished subtest pipe-A-DP-2, SUCCESS
[   79.657353] [IGT] kms_pipe_crc_basic: starting dynamic subtest pipe-B-DP-2
[   80.375042] PM: suspend entry (deep)
[   80.380799] Filesystems sync: 0.002 seconds
[   80.386476] Freezing user space processes
[   80.392286] Freezing user space processes completed (elapsed 0.001 seconds)
[   80.399294] OOM killer disabled.
[   80.402536] Freezing remaining freezable tasks
[   80.408335] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
[   80.439372] sd 5:0:0:0: [sdb] Synchronizing SCSI cache
[   80.439716] serial 00:01: disabled
[   80.448011] sd 4:0:0:0: [sda] Synchronizing SCSI cache
[   80.448014] sd 7:0:0:0: [sdc] Synchronizing SCSI cache
[   80.453600] ata6.00: Entering standby power mode
[   80.464217] ata5.00: Entering standby power mode
[   80.812294] ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[   80.818520] ata8.00: Entering active power mode
[   80.842989] ata8.00: configured for UDMA/133
[   80.847660] ata8.00: Entering standby power mode
[   81.119426] xe 0000:03:00.0: [drm] GT0: suspended
[   81.800508] PM: suspend of devices complete after 1367.829 msecs
[   81.806661] PM: start suspend of devices complete after 1390.859 msecs
[   81.813244] PM: suspend devices took 1.398 seconds
[   81.820101] PM: late suspend of devices complete after 2.036 msecs
�[   82.403857] serial 00:01: activated
[   82.489612] nvme nvme0: 16/0/0 default/read/poll queues
[   82.563318] r8169 0000:07:00.0 enp7s0: Link is Down
[   82.581444] xe REG[0x223a8-0x223af]: allow read access
[   82.586704] xe REG[0x1c03a8-0x1c03af]: allow read access
[   82.592071] xe REG[0x1d03a8-0x1d03af]: allow read access
[   82.597423] xe REG[0x1c83a8-0x1c83af]: allow read access
[   82.602765] xe REG[0x1d83a8-0x1d83af]: allow read access
[   82.608113] xe REG[0x1a3a8-0x1a3af]: allow read access
[   82.613281] xe REG[0x1c3a8-0x1c3af]: allow read access
[   82.618454] xe REG[0x1e3a8-0x1e3af]: allow read access
[   82.623634] xe REG[0x263a8-0x263af]: allow read access
[   82.628816] xe 0000:03:00.0: [drm] GT0: resumed
[   82.728005] ata7: SATA link down (SStatus 4 SControl 300)
[   82.733531] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[   82.739773] ata5.00: Entering active power mode
[   82.744398] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[   82.750618] ata6.00: Entering active power mode
[   82.755961] ata5.00: configured for UDMA/133
[   82.760479] ata5.00: Enabling discard_zeroes_data
[   82.836266] ata6.00: configured for UDMA/133
[   84.460081] ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[   84.466354] ata8.00: Entering active power mode
[   84.497256] ata8.00: configured for UDMA/133
...

> 
> If you can confirm that all issues are fixed, I will rebase this on for-next and
> post.
> 
> Thanks !
> 
> 
> -- 
> Damien Le Moal
> Western Digital Research
>

Damien Le Moal Aug. 31, 2023, 12:32 a.m. UTC | #19

On 8/31/23 07:14, Rodrigo Vivi wrote:
> On Tue, Aug 29, 2023 at 03:17:38PM +0900, Damien Le Moal wrote:
>> On 8/26/23 02:09, Rodrigo Vivi wrote:
>>>>> So, maybe we have some kind of disks/configuration out there where this
>>>>> start upon resume is needed? Maybe it is just a matter of timming to
>>>>> ensure some firmware underneath is up and back to life?
>>>>
>>>> I do not think so. Suspend will issue a start stop unit command to put the drive
>>>> to sleep and resume will reset the port (which should wake up the drive) and
>>>> then issue an IDENTIFY command (which will also wake up the drive) and other
>>>> read logs etc to rescan the drive.
>>>> In both cases, if the commands do not complete, we would see errors/timeout and
>>>> likely port reset/drive gone events. So I think this is likely another subtle
>>>> race between scsi suspend and ata suspend that is causing a deadlock.
>>>>
>>>> The main issue I think is that there is no direct ancestry between the ata port
>>>> (device) and scsi device, so the change to scsi async pm ops made a mess of the
>>>> suspend/resume operations ordering. For suspend, scsi device (child of ata port)
>>>> should be first, then ata port device (parent). For resume, the reverse order is
>>>> needed. PM normally ensures that parent/child ordering, but we lack that
>>>> parent/child relationship. I am working on fixing that but it is very slow
>>>> progress because I have been so far enable to recreate any of the issues that
>>>> have been reported. I am patching "blind"...
>>>
>>> I believe your suspicious makes sense. And on these lines, that patch you
>>> attached earlier would fix that. However my initial tries of that didn't
>>> help. I'm going to run more tests and get back to you.
>>
>> Rodrigo,
>>
>> I pushed the resume-v2 branch to libata tree:
>>
>> git@gitolite.kernel.org:pub/scm/linux/kernel/git/dlemoal/libata
>> (or https://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata.git)
>>
>> This branch adds 13 patches on top of 6.5.0 to cleanup libata suspend/resume and
>> other device shutdown issues. The first 4 patches are the main ones to fix
>> suspend resume. I tested that on 2 different machines with different drives and
>> with qemu. All seems fine.
>>
>> Could you try to run this through your CI ? I am very interested in seeing if it
>> survives your suspend/resume tests.
> 
> well, in the end this didn't affect the CI machinery as I was afraid.
> it is only in my local DG2.
> 
> https://intel-gfx-ci.01.org/tree/intel-xe/bat-all.html?testfilter=suspend
> (bat-dg2-oem2 one)
> 
> I just got these 13 patches and applied to my branch and tested it again
> and it still *fails* for me.

That is annoying... But I think the messages give us a hint as to what is going
on. See below.

> 
> [   79.648328] [IGT] kms_pipe_crc_basic: finished subtest pipe-A-DP-2, SUCCESS
> [   79.657353] [IGT] kms_pipe_crc_basic: starting dynamic subtest pipe-B-DP-2
> [   80.375042] PM: suspend entry (deep)
> [   80.380799] Filesystems sync: 0.002 seconds
> [   80.386476] Freezing user space processes
> [   80.392286] Freezing user space processes completed (elapsed 0.001 seconds)
> [   80.399294] OOM killer disabled.
> [   80.402536] Freezing remaining freezable tasks
> [   80.408335] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
> [   80.439372] sd 5:0:0:0: [sdb] Synchronizing SCSI cache
> [   80.439716] serial 00:01: disabled
> [   80.448011] sd 4:0:0:0: [sda] Synchronizing SCSI cache
> [   80.448014] sd 7:0:0:0: [sdc] Synchronizing SCSI cache
> [   80.453600] ata6.00: Entering standby power mode

This is sd 5:0:0:0. All good, ordered properly with the "Synchronizing SCSI cache".

> [   80.464217] ata5.00: Entering standby power mode

Same here for sd 4:0:0:0.

> [   80.812294] ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [   80.818520] ata8.00: Entering active power mode
> [   80.842989] ata8.00: configured for UDMA/133

Arg ! sd 7:0:0:0 is resuming ! But the above "Synchronizing SCSI cache" tells
us that it was suspending and libata EH did not yet put that drive to standby...

> [   80.847660] ata8.00: Entering standby power mode

... which happens here. So it looks like libata EH had both the suspend and
resume requests at the same time, which is totally weird.

> [   81.119426] xe 0000:03:00.0: [drm] GT0: suspended
> [   81.800508] PM: suspend of devices complete after 1367.829 msecs
> [   81.806661] PM: start suspend of devices complete after 1390.859 msecs
> [   81.813244] PM: suspend devices took 1.398 seconds
> [   81.820101] PM: late suspend of devices complete after 2.036 msecs

...and PM suspend completes here. Resume "starts" now (but clearly it started
earlier already given that sd 7:0:0:0 was reactivated.

> �[   82.403857] serial 00:01: activated
> [   82.489612] nvme nvme0: 16/0/0 default/read/poll queues
> [   82.563318] r8169 0000:07:00.0 enp7s0: Link is Down
> [   82.581444] xe REG[0x223a8-0x223af]: allow read access
> [   82.586704] xe REG[0x1c03a8-0x1c03af]: allow read access
> [   82.592071] xe REG[0x1d03a8-0x1d03af]: allow read access
> [   82.597423] xe REG[0x1c83a8-0x1c83af]: allow read access
> [   82.602765] xe REG[0x1d83a8-0x1d83af]: allow read access
> [   82.608113] xe REG[0x1a3a8-0x1a3af]: allow read access
> [   82.613281] xe REG[0x1c3a8-0x1c3af]: allow read access
> [   82.618454] xe REG[0x1e3a8-0x1e3af]: allow read access
> [   82.623634] xe REG[0x263a8-0x263af]: allow read access
> [   82.628816] xe 0000:03:00.0: [drm] GT0: resumed
> [   82.728005] ata7: SATA link down (SStatus 4 SControl 300)
> [   82.733531] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [   82.739773] ata5.00: Entering active power mode
> [   82.744398] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [   82.750618] ata6.00: Entering active power mode
> [   82.755961] ata5.00: configured for UDMA/133
> [   82.760479] ata5.00: Enabling discard_zeroes_data
> [   82.836266] ata6.00: configured for UDMA/133
> [   84.460081] ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [   84.466354] ata8.00: Entering active power mode
> [   84.497256] ata8.00: configured for UDMA/133
> ...

And this looks all normal, the drives have all been transitioned to active
power mode as expected. And yet, your system is stuck after this, right ?
Can you try to boot with "sysrq_always_enabled" and try to see if sending
"ctrl-sysrq-t" keys can give you a stack backtrace of the tasks to see where
they are stuck ?

I am going to try something like you do with very short resume rtc timer and
multiple disks to see if I can reproduce. But it is starting to look like PM is
starting resuming before suspend completes... Not sure that is correct. But the
end result may be that libata EH endup getting stuck. Let me dig further.

Rodrigo Vivi Aug. 31, 2023, 1:48 a.m. UTC | #20

On Thu, 2023-08-31 at 09:32 +0900, Damien Le Moal wrote:
> On 8/31/23 07:14, Rodrigo Vivi wrote:
> > On Tue, Aug 29, 2023 at 03:17:38PM +0900, Damien Le Moal wrote:
> > > On 8/26/23 02:09, Rodrigo Vivi wrote:
> > > > > > So, maybe we have some kind of disks/configuration out
> > > > > > there where this
> > > > > > start upon resume is needed? Maybe it is just a matter of
> > > > > > timming to
> > > > > > ensure some firmware underneath is up and back to life?
> > > > > 
> > > > > I do not think so. Suspend will issue a start stop unit
> > > > > command to put the drive
> > > > > to sleep and resume will reset the port (which should wake up
> > > > > the drive) and
> > > > > then issue an IDENTIFY command (which will also wake up the
> > > > > drive) and other
> > > > > read logs etc to rescan the drive.
> > > > > In both cases, if the commands do not complete, we would see
> > > > > errors/timeout and
> > > > > likely port reset/drive gone events. So I think this is
> > > > > likely another subtle
> > > > > race between scsi suspend and ata suspend that is causing a
> > > > > deadlock.
> > > > > 
> > > > > The main issue I think is that there is no direct ancestry
> > > > > between the ata port
> > > > > (device) and scsi device, so the change to scsi async pm ops
> > > > > made a mess of the
> > > > > suspend/resume operations ordering. For suspend, scsi device
> > > > > (child of ata port)
> > > > > should be first, then ata port device (parent). For resume,
> > > > > the reverse order is
> > > > > needed. PM normally ensures that parent/child ordering, but
> > > > > we lack that
> > > > > parent/child relationship. I am working on fixing that but it
> > > > > is very slow
> > > > > progress because I have been so far enable to recreate any of
> > > > > the issues that
> > > > > have been reported. I am patching "blind"...
> > > > 
> > > > I believe your suspicious makes sense. And on these lines, that
> > > > patch you
> > > > attached earlier would fix that. However my initial tries of
> > > > that didn't
> > > > help. I'm going to run more tests and get back to you.
> > > 
> > > Rodrigo,
> > > 
> > > I pushed the resume-v2 branch to libata tree:
> > > 
> > > git@gitolite.kernel.org:pub/scm/linux/kernel/git/dlemoal/libata
> > > (or
> > > https://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata.gi
> > > t)
> > > 
> > > This branch adds 13 patches on top of 6.5.0 to cleanup libata
> > > suspend/resume and
> > > other device shutdown issues. The first 4 patches are the main
> > > ones to fix
> > > suspend resume. I tested that on 2 different machines with
> > > different drives and
> > > with qemu. All seems fine.
> > > 
> > > Could you try to run this through your CI ? I am very interested
> > > in seeing if it
> > > survives your suspend/resume tests.
> > 
> > well, in the end this didn't affect the CI machinery as I was
> > afraid.
> > it is only in my local DG2.
> > 
> > https://intel-gfx-ci.01.org/tree/intel-xe/bat-
> > all.html?testfilter=suspend
> > (bat-dg2-oem2 one)
> > 
> > I just got these 13 patches and applied to my branch and tested it
> > again
> > and it still *fails* for me.
> 
> That is annoying... But I think the messages give us a hint as to
> what is going
> on. See below.
> 
> > 
> > [   79.648328] [IGT] kms_pipe_crc_basic: finished subtest pipe-A-
> > DP-2, SUCCESS
> > [   79.657353] [IGT] kms_pipe_crc_basic: starting dynamic subtest
> > pipe-B-DP-2
> > [   80.375042] PM: suspend entry (deep)
> > [   80.380799] Filesystems sync: 0.002 seconds
> > [   80.386476] Freezing user space processes
> > [   80.392286] Freezing user space processes completed (elapsed
> > 0.001 seconds)
> > [   80.399294] OOM killer disabled.
> > [   80.402536] Freezing remaining freezable tasks
> > [   80.408335] Freezing remaining freezable tasks completed
> > (elapsed 0.001 seconds)
> > [   80.439372] sd 5:0:0:0: [sdb] Synchronizing SCSI cache
> > [   80.439716] serial 00:01: disabled
> > [   80.448011] sd 4:0:0:0: [sda] Synchronizing SCSI cache
> > [   80.448014] sd 7:0:0:0: [sdc] Synchronizing SCSI cache
> > [   80.453600] ata6.00: Entering standby power mode
> 
> This is sd 5:0:0:0. All good, ordered properly with the
> "Synchronizing SCSI cache".
> 
> > [   80.464217] ata5.00: Entering standby power mode
> 
> Same here for sd 4:0:0:0.
> 
> > [   80.812294] ata8: SATA link up 6.0 Gbps (SStatus 133 SControl
> > 300)
> > [   80.818520] ata8.00: Entering active power mode
> > [   80.842989] ata8.00: configured for UDMA/133
> 
> Arg ! sd 7:0:0:0 is resuming ! But the above "Synchronizing SCSI
> cache" tells
> us that it was suspending and libata EH did not yet put that drive to
> standby...
> 
> > [   80.847660] ata8.00: Entering standby power mode
> 
> ... which happens here. So it looks like libata EH had both the
> suspend and
> resume requests at the same time, which is totally weird.

although it looks weird, it totally matches the 'use case'.
I mean, if I suspend, resume, and wait a bit before suspend and resume
again, it will work 100% of the time.
The issue is really only when another suspend comes right after the
resume, in a loop without any wait.

> 
> > [   81.119426] xe 0000:03:00.0: [drm] GT0: suspended
> > [   81.800508] PM: suspend of devices complete after 1367.829 msecs
> > [   81.806661] PM: start suspend of devices complete after 1390.859
> > msecs
> > [   81.813244] PM: suspend devices took 1.398 seconds
> > [   81.820101] PM: late suspend of devices complete after 2.036
> > msecs
> 
> ...and PM suspend completes here. Resume "starts" now (but clearly it
> started
> earlier already given that sd 7:0:0:0 was reactivated.

that is weird.

> 
> > �[   82.403857] serial 00:01: activated
> > [   82.489612] nvme nvme0: 16/0/0 default/read/poll queues
> > [   82.563318] r8169 0000:07:00.0 enp7s0: Link is Down
> > [   82.581444] xe REG[0x223a8-0x223af]: allow read access
> > [   82.586704] xe REG[0x1c03a8-0x1c03af]: allow read access
> > [   82.592071] xe REG[0x1d03a8-0x1d03af]: allow read access
> > [   82.597423] xe REG[0x1c83a8-0x1c83af]: allow read access
> > [   82.602765] xe REG[0x1d83a8-0x1d83af]: allow read access
> > [   82.608113] xe REG[0x1a3a8-0x1a3af]: allow read access
> > [   82.613281] xe REG[0x1c3a8-0x1c3af]: allow read access
> > [   82.618454] xe REG[0x1e3a8-0x1e3af]: allow read access
> > [   82.623634] xe REG[0x263a8-0x263af]: allow read access
> > [   82.628816] xe 0000:03:00.0: [drm] GT0: resumed
> > [   82.728005] ata7: SATA link down (SStatus 4 SControl 300)
> > [   82.733531] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl
> > 300)
> > [   82.739773] ata5.00: Entering active power mode
> > [   82.744398] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl
> > 300)
> > [   82.750618] ata6.00: Entering active power mode
> > [   82.755961] ata5.00: configured for UDMA/133
> > [   82.760479] ata5.00: Enabling discard_zeroes_data
> > [   82.836266] ata6.00: configured for UDMA/133
> > [   84.460081] ata8: SATA link up 6.0 Gbps (SStatus 133 SControl
> > 300)
> > [   84.466354] ata8.00: Entering active power mode
> > [   84.497256] ata8.00: configured for UDMA/133
> > ...
> 
> And this looks all normal, the drives have all been transitioned to
> active
> power mode as expected. And yet, your system is stuck after this,
> right ?

yes

> Can you try to boot with "sysrq_always_enabled" and try to see if
> sending
> "ctrl-sysrq-t" keys can give you a stack backtrace of the tasks to
> see where
> they are stuck ?

I will try tomorrow.

> 
> I am going to try something like you do with very short resume rtc
> timer and
> multiple disks to see if I can reproduce. But it is starting to look
> like PM is
> starting resuming before suspend completes...

yes, this is what it looks like.

>  Not sure that is correct. But the
> end result may be that libata EH endup getting stuck. Let me dig
> further.

Thank you so much for the hard work on this!

Damien Le Moal Aug. 31, 2023, 3:06 a.m. UTC | #21

On 8/31/23 10:48, Vivi, Rodrigo wrote:
> On Thu, 2023-08-31 at 09:32 +0900, Damien Le Moal wrote:
>> On 8/31/23 07:14, Rodrigo Vivi wrote:
>>> On Tue, Aug 29, 2023 at 03:17:38PM +0900, Damien Le Moal wrote:
>>>> On 8/26/23 02:09, Rodrigo Vivi wrote:
>>>>>>> So, maybe we have some kind of disks/configuration out
>>>>>>> there where this
>>>>>>> start upon resume is needed? Maybe it is just a matter of
>>>>>>> timming to
>>>>>>> ensure some firmware underneath is up and back to life?
>>>>>>
>>>>>> I do not think so. Suspend will issue a start stop unit
>>>>>> command to put the drive
>>>>>> to sleep and resume will reset the port (which should wake up
>>>>>> the drive) and
>>>>>> then issue an IDENTIFY command (which will also wake up the
>>>>>> drive) and other
>>>>>> read logs etc to rescan the drive.
>>>>>> In both cases, if the commands do not complete, we would see
>>>>>> errors/timeout and
>>>>>> likely port reset/drive gone events. So I think this is
>>>>>> likely another subtle
>>>>>> race between scsi suspend and ata suspend that is causing a
>>>>>> deadlock.
>>>>>>
>>>>>> The main issue I think is that there is no direct ancestry
>>>>>> between the ata port
>>>>>> (device) and scsi device, so the change to scsi async pm ops
>>>>>> made a mess of the
>>>>>> suspend/resume operations ordering. For suspend, scsi device
>>>>>> (child of ata port)
>>>>>> should be first, then ata port device (parent). For resume,
>>>>>> the reverse order is
>>>>>> needed. PM normally ensures that parent/child ordering, but
>>>>>> we lack that
>>>>>> parent/child relationship. I am working on fixing that but it
>>>>>> is very slow
>>>>>> progress because I have been so far enable to recreate any of
>>>>>> the issues that
>>>>>> have been reported. I am patching "blind"...
>>>>>
>>>>> I believe your suspicious makes sense. And on these lines, that
>>>>> patch you
>>>>> attached earlier would fix that. However my initial tries of
>>>>> that didn't
>>>>> help. I'm going to run more tests and get back to you.
>>>>
>>>> Rodrigo,
>>>>
>>>> I pushed the resume-v2 branch to libata tree:
>>>>
>>>> git@gitolite.kernel.org:pub/scm/linux/kernel/git/dlemoal/libata
>>>> (or
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata.gi
>>>> t)
>>>>
>>>> This branch adds 13 patches on top of 6.5.0 to cleanup libata
>>>> suspend/resume and
>>>> other device shutdown issues. The first 4 patches are the main
>>>> ones to fix
>>>> suspend resume. I tested that on 2 different machines with
>>>> different drives and
>>>> with qemu. All seems fine.
>>>>
>>>> Could you try to run this through your CI ? I am very interested
>>>> in seeing if it
>>>> survives your suspend/resume tests.
>>>
>>> well, in the end this didn't affect the CI machinery as I was
>>> afraid.
>>> it is only in my local DG2.
>>>
>>> https://intel-gfx-ci.01.org/tree/intel-xe/bat-
>>> all.html?testfilter=suspend
>>> (bat-dg2-oem2 one)
>>>
>>> I just got these 13 patches and applied to my branch and tested it
>>> again
>>> and it still *fails* for me.
>>
>> That is annoying... But I think the messages give us a hint as to
>> what is going
>> on. See below.
>>
>>>
>>> [   79.648328] [IGT] kms_pipe_crc_basic: finished subtest pipe-A-
>>> DP-2, SUCCESS
>>> [   79.657353] [IGT] kms_pipe_crc_basic: starting dynamic subtest
>>> pipe-B-DP-2
>>> [   80.375042] PM: suspend entry (deep)
>>> [   80.380799] Filesystems sync: 0.002 seconds
>>> [   80.386476] Freezing user space processes
>>> [   80.392286] Freezing user space processes completed (elapsed
>>> 0.001 seconds)
>>> [   80.399294] OOM killer disabled.
>>> [   80.402536] Freezing remaining freezable tasks
>>> [   80.408335] Freezing remaining freezable tasks completed
>>> (elapsed 0.001 seconds)
>>> [   80.439372] sd 5:0:0:0: [sdb] Synchronizing SCSI cache
>>> [   80.439716] serial 00:01: disabled
>>> [   80.448011] sd 4:0:0:0: [sda] Synchronizing SCSI cache
>>> [   80.448014] sd 7:0:0:0: [sdc] Synchronizing SCSI cache
>>> [   80.453600] ata6.00: Entering standby power mode
>>
>> This is sd 5:0:0:0. All good, ordered properly with the
>> "Synchronizing SCSI cache".
>>
>>> [   80.464217] ata5.00: Entering standby power mode
>>
>> Same here for sd 4:0:0:0.
>>
>>> [   80.812294] ata8: SATA link up 6.0 Gbps (SStatus 133 SControl
>>> 300)
>>> [   80.818520] ata8.00: Entering active power mode
>>> [   80.842989] ata8.00: configured for UDMA/133
>>
>> Arg ! sd 7:0:0:0 is resuming ! But the above "Synchronizing SCSI
>> cache" tells
>> us that it was suspending and libata EH did not yet put that drive to
>> standby...
>>
>>> [   80.847660] ata8.00: Entering standby power mode
>>
>> ... which happens here. So it looks like libata EH had both the
>> suspend and
>> resume requests at the same time, which is totally weird.
> 
> although it looks weird, it totally matches the 'use case'.
> I mean, if I suspend, resume, and wait a bit before suspend and resume
> again, it will work 100% of the time.
> The issue is really only when another suspend comes right after the
> resume, in a loop without any wait.
> 
>>
>>> [   81.119426] xe 0000:03:00.0: [drm] GT0: suspended
>>> [   81.800508] PM: suspend of devices complete after 1367.829 msecs
>>> [   81.806661] PM: start suspend of devices complete after 1390.859
>>> msecs
>>> [   81.813244] PM: suspend devices took 1.398 seconds
>>> [   81.820101] PM: late suspend of devices complete after 2.036
>>> msecs
>>
>> ...and PM suspend completes here. Resume "starts" now (but clearly it
>> started
>> earlier already given that sd 7:0:0:0 was reactivated.
> 
> that is weird.
> 
>>
>>> �[   82.403857] serial 00:01: activated
>>> [   82.489612] nvme nvme0: 16/0/0 default/read/poll queues
>>> [   82.563318] r8169 0000:07:00.0 enp7s0: Link is Down
>>> [   82.581444] xe REG[0x223a8-0x223af]: allow read access
>>> [   82.586704] xe REG[0x1c03a8-0x1c03af]: allow read access
>>> [   82.592071] xe REG[0x1d03a8-0x1d03af]: allow read access
>>> [   82.597423] xe REG[0x1c83a8-0x1c83af]: allow read access
>>> [   82.602765] xe REG[0x1d83a8-0x1d83af]: allow read access
>>> [   82.608113] xe REG[0x1a3a8-0x1a3af]: allow read access
>>> [   82.613281] xe REG[0x1c3a8-0x1c3af]: allow read access
>>> [   82.618454] xe REG[0x1e3a8-0x1e3af]: allow read access
>>> [   82.623634] xe REG[0x263a8-0x263af]: allow read access
>>> [   82.628816] xe 0000:03:00.0: [drm] GT0: resumed
>>> [   82.728005] ata7: SATA link down (SStatus 4 SControl 300)
>>> [   82.733531] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl
>>> 300)
>>> [   82.739773] ata5.00: Entering active power mode
>>> [   82.744398] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl
>>> 300)
>>> [   82.750618] ata6.00: Entering active power mode
>>> [   82.755961] ata5.00: configured for UDMA/133
>>> [   82.760479] ata5.00: Enabling discard_zeroes_data
>>> [   82.836266] ata6.00: configured for UDMA/133
>>> [   84.460081] ata8: SATA link up 6.0 Gbps (SStatus 133 SControl
>>> 300)
>>> [   84.466354] ata8.00: Entering active power mode
>>> [   84.497256] ata8.00: configured for UDMA/133
>>> ...
>>
>> And this looks all normal, the drives have all been transitioned to
>> active
>> power mode as expected. And yet, your system is stuck after this,
>> right ?
> 
> yes
> 
>> Can you try to boot with "sysrq_always_enabled" and try to see if
>> sending
>> "ctrl-sysrq-t" keys can give you a stack backtrace of the tasks to
>> see where
>> they are stuck ?
> 
> I will try tomorrow.

After fixing my qemu setup to get rtcwake to work, I succedded in recreating
the hang ! With multiple disks attached to the VM, suspending the VM with an
rtcwake timer set to +1 seconds triggers the resume in the middle of the
suspend and it hangs.

And I found the problem: it is a race in ata_port_request_pm() where the call
to ata_port_wait_eh() is completely innefective, causing suspend and resume to
be scheduled simultaneously, and so one being lost since the pm_msg field is
per port.

I have a fix. I keep doing suspend/resume in the VM and all looks good now.
Cleaning up the patches and I will push a resume-v3 branch soon for you to test.

> 
>>
>> I am going to try something like you do with very short resume rtc
>> timer and
>> multiple disks to see if I can reproduce. But it is starting to look
>> like PM is
>> starting resuming before suspend completes...
> 
> yes, this is what it looks like.

It seems that the PM code is fine with that. The issue really was with libata :)

Damien Le Moal Aug. 31, 2023, 6:55 a.m. UTC | #22

On 8/26/23 02:09, Rodrigo Vivi wrote:
> I have the CONFIG_PM_DEBUG here with no_console_suspend.
> If you remember any other config or parameter that would help, please let
> me know that I collect it again:

Coming back to this previous dmesg output as I am in fact seeing something very
similar now. See below.

> [  104.571459] PM: suspend entry (deep)
> [  104.585967] Filesystems sync: 0.010 seconds
> [  104.618685] Freezing user space processes
> [  104.625374] Freezing user space processes completed (elapsed 0.002 seconds)
> [  104.632448] OOM killer disabled.
> [  104.635712] Freezing remaining freezable tasks
> [  104.641899] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)

suspend start and freezes the user tasks, as expected.

> [  104.669767] wlp6s0: deauthenticating from 08:36:c9:85:df:ef by local choice (Reason: 3=DEAUTH_LEAVING)
> [  104.679812] serial 00:01: disabled
> [  104.683466] sd 7:0:0:0: [sdc] Synchronizing SCSI cache
> [  104.688902] sd 7:0:0:0: [sdc] Stopping disk
> [  104.693176] sd 5:0:0:0: [sdb] Synchronizing SCSI cache
> [  104.698419] sd 4:0:0:0: [sda] Synchronizing SCSI cache
> [  104.703730] sd 4:0:0:0: [sda] Stopping disk
> [  104.885912] sd 5:0:0:0: [sdb] Stopping disk
> [  106.163215] PM: suspend devices took 1.514 seconds

All devices are suspended. No issues. Then resume starts...

> [  107.003217] serial 00:01: activated
> [  107.076779] nvme nvme0: 16/0/0 default/read/poll queues
> [  107.123917] r8169 0000:07:00.0 enp7s0: Link is Down
> [  107.208945] PM: resume devices took 0.241 seconds
> [  107.214746] pcieport 0000:00:1c.0: PCI bridge to [bus 06]
> [  107.220274] pcieport 0000:00:1c.0:   bridge window [mem 0x43700000-0x437fffff]
> [  107.227538] OOM killer enabled.
> [  107.230710] Restarting tasks ...

... remember this one ...

> [  107.231803] pcieport 0000:00:1c.2: PCI bridge to [bus 07]
> [  107.236474] done.
> [  107.240599] pcieport 0000:00:1c.2:   bridge window [io  0x4000-0x4fff]
> [  107.242574] random: crng reseeded on system resumption
> [  107.249119] pcieport 0000:00:1c.2:   bridge window [mem 0x43600000-0x436fffff]
> [  107.249405] pcieport 0000:00:1c.6: PCI bridge to [bus 08]
> [  107.259714] PM: suspend exit
> [  107.261623] pcieport 0000:00:1c.6:   bridge window [io  0x3000-0x3fff]
> [  107.276554] pcieport 0000:00:1c.6:   bridge window [mem 0x43500000-0x435fffff]
> [  107.283849] pcieport 0000:00:1c.6:   bridge window [mem 0x70900000-0x709fffff 64bit pref]
> [  107.293567] ata7: SATA link down (SStatus 4 SControl 300)
> [  107.304150] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [  107.310975] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [  107.319173] ata5.00: configured for UDMA/133
> [  107.324620] ata5.00: Enabling discard_zeroes_data
> [  107.398370] ata6.00: configured for UDMA/133

started bringing back up devices, but suspend comes back again.

> [  108.563229] PM: suspend entry (deep)
> [  108.573610] Filesystems sync: 0.006 seconds
> [  108.580617] Freezing user space processes
> [  108.586774] Freezing user space processes completed (elapsed 0.002 seconds)
> [  108.593793] OOM killer disabled.
> [  108.597055] Freezing remaining freezable tasks
> [  108.603246] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)

which freezes the tasks, again.

> [  108.621515] sd 7:0:0:0: [sdc] Synchronizing SCSI cache
> [  108.621522] sd 5:0:0:0: [sdb] Synchronizing SCSI cache
> [  108.622018] serial 00:01: disabled
> [  108.635420] sd 4:0:0:0: [sda] Synchronizing SCSI cache
> [  108.640747] sd 4:0:0:0: [sda] Stopping disk
> [  108.644148] sd 5:0:0:0: [sdb] Stopping disk
> [  108.983487] ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [  108.991923] ata8.00: configured for UDMA/133
> [  108.996423] sd 7:0:0:0: [sdc] Stopping disk
> [  109.973722] PM: suspend devices took 1.363 seconds

And suspend finishes, followed by resume starting.

> [  110.721600] serial 00:01: activated
> [  110.802094] nvme nvme0: 16/0/0 default/read/poll queues
> [  110.873036] r8169 0000:07:00.0 enp7s0: Link is Down
> [  111.032278] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [  111.038583] ata7: SATA link down (SStatus 4 SControl 300)
> [  111.044065] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [  111.051326] ata5.00: configured for UDMA/133
> [  111.056118] ata5.00: Enabling discard_zeroes_data
> [  111.131795] ata6.00: configured for UDMA/133
> [  112.713764] ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [  112.724250] ata8.00: configured for UDMA/133

Resume restarted the devices, but there is no "Restarting tasks" ! So the user
tasks are frozen, hence the "hang". There are no user processes to run, so no
coming back to the shell.

So even without the libata patches fixing the resume mess, this issue does not
look like a libata issue. I think your bisect hitting the no_start_on_resume
patch is simply because that patch changes the timing of things and makes it
easier to hit this issue.

Adding PM folks because I am out of my depth on this one...

PM folks,

Rodrigo hit an issue doing suspend+resume cycles in a loop, with the resume
triggered a very short time (wakealarm) after starting resume. This ends up
with a hang. I could recreate the same in qemu. I cannot get wakealarm to work
for some reason, but I have a virtio device not supporting suspend which
triggers the resume in the middle of suspend. And repeating "systemctl suspend"
again and again, I end up with a similar hang:

[   98.455929] PM: suspend entry (s2idle)
[   98.466622] Filesystems sync: 0.008 seconds
[   98.473871] Freezing user space processes
[   98.476372] Freezing user space processes completed (elapsed 0.001 seconds)
[   98.478197] OOM killer disabled.
[   98.479199] Freezing remaining freezable tasks
[   98.481887] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)

suspend starts, freezineg the tasks.

[   98.493890] virtio-fs: suspend/resume not yet supported
[   98.493971] sd 2:0:0:0: [sdc] Synchronizing SCSI cache
[   98.499658] virtio-pci 0000:00:05.0: PM: pci_pm_suspend():
virtio_pci_freeze+0x0/0x50 returns -95
[   98.504241] virtio-pci 0000:00:05.0: PM: dpm_run_callback():
pci_pm_suspend+0x0/0x230 returns -95
[   98.507762] virtio-pci 0000:00:05.0: PM: failed to suspend async: error -95

Failure of the virtio device suspend which will trigger pm_recover+resume.

[   98.509451] ata3.00: Entering standby power mode
[   98.511197] sd 1:0:0:0: [sdb] Synchronizing SCSI cache
[   98.512917] sd 0:0:0:0: [sda] Synchronizing SCSI cache

Here, the scsi devices already suspended (asynchronously) do their thing.

[   98.515562] PM: Some devices failed to suspend, or early wake event detected
[   98.521098] virtio_blk virtio3: 4/0/0 default/read/poll queues
[   98.548229] OOM killer enabled.
[   98.549338] Restarting tasks ... done.

The tasks are restarted due to the suspend error above.

[   98.551734] random: crng reseeded on system resumption
[   98.559697] PM: suspend exit
[   98.561038] PM: suspend entry (s2idle)
[   98.566692] Filesystems sync: 0.004 seconds
[   98.574461] Freezing user space processes
[   98.577265] Freezing user space processes completed (elapsed 0.001 seconds)
[   98.579351] OOM killer disabled.
[   98.580341] Freezing remaining freezable tasks
[   98.583109] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)

but here, for whatever reason, the tasks are suspended again despite the fact
that we should be resuming. After this, I only see the scsi&ata devices
finishing suspend and then being resumed all normally. But no coming back to
the shell. It is frozen...

Any idea ?

Damien Le Moal Sept. 5, 2023, 5:20 a.m. UTC | #23

On 8/31/23 10:48, Vivi, Rodrigo wrote:
> On Thu, 2023-08-31 at 09:32 +0900, Damien Le Moal wrote:
>> On 8/31/23 07:14, Rodrigo Vivi wrote:
>>> On Tue, Aug 29, 2023 at 03:17:38PM +0900, Damien Le Moal wrote:
>>>> On 8/26/23 02:09, Rodrigo Vivi wrote:
>>>>>>> So, maybe we have some kind of disks/configuration out
>>>>>>> there where this
>>>>>>> start upon resume is needed? Maybe it is just a matter of
>>>>>>> timming to
>>>>>>> ensure some firmware underneath is up and back to life?
>>>>>>
>>>>>> I do not think so. Suspend will issue a start stop unit
>>>>>> command to put the drive
>>>>>> to sleep and resume will reset the port (which should wake up
>>>>>> the drive) and
>>>>>> then issue an IDENTIFY command (which will also wake up the
>>>>>> drive) and other
>>>>>> read logs etc to rescan the drive.
>>>>>> In both cases, if the commands do not complete, we would see
>>>>>> errors/timeout and
>>>>>> likely port reset/drive gone events. So I think this is
>>>>>> likely another subtle
>>>>>> race between scsi suspend and ata suspend that is causing a
>>>>>> deadlock.
>>>>>>
>>>>>> The main issue I think is that there is no direct ancestry
>>>>>> between the ata port
>>>>>> (device) and scsi device, so the change to scsi async pm ops
>>>>>> made a mess of the
>>>>>> suspend/resume operations ordering. For suspend, scsi device
>>>>>> (child of ata port)
>>>>>> should be first, then ata port device (parent). For resume,
>>>>>> the reverse order is
>>>>>> needed. PM normally ensures that parent/child ordering, but
>>>>>> we lack that
>>>>>> parent/child relationship. I am working on fixing that but it
>>>>>> is very slow
>>>>>> progress because I have been so far enable to recreate any of
>>>>>> the issues that
>>>>>> have been reported. I am patching "blind"...
>>>>>
>>>>> I believe your suspicious makes sense. And on these lines, that
>>>>> patch you
>>>>> attached earlier would fix that. However my initial tries of
>>>>> that didn't
>>>>> help. I'm going to run more tests and get back to you.
>>>>
>>>> Rodrigo,
>>>>
>>>> I pushed the resume-v2 branch to libata tree:
>>>>
>>>> git@gitolite.kernel.org:pub/scm/linux/kernel/git/dlemoal/libata
>>>> (or
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata.gi
>>>> t)
>>>>
>>>> This branch adds 13 patches on top of 6.5.0 to cleanup libata
>>>> suspend/resume and
>>>> other device shutdown issues. The first 4 patches are the main
>>>> ones to fix
>>>> suspend resume. I tested that on 2 different machines with
>>>> different drives and
>>>> with qemu. All seems fine.
>>>>
>>>> Could you try to run this through your CI ? I am very interested
>>>> in seeing if it
>>>> survives your suspend/resume tests.
>>>
>>> well, in the end this didn't affect the CI machinery as I was
>>> afraid.
>>> it is only in my local DG2.
>>>
>>> https://intel-gfx-ci.01.org/tree/intel-xe/bat-
>>> all.html?testfilter=suspend
>>> (bat-dg2-oem2 one)
>>>
>>> I just got these 13 patches and applied to my branch and tested it
>>> again
>>> and it still *fails* for me.
>>
>> That is annoying... But I think the messages give us a hint as to
>> what is going
>> on. See below.
>>
>>>
>>> [   79.648328] [IGT] kms_pipe_crc_basic: finished subtest pipe-A-
>>> DP-2, SUCCESS
>>> [   79.657353] [IGT] kms_pipe_crc_basic: starting dynamic subtest
>>> pipe-B-DP-2
>>> [   80.375042] PM: suspend entry (deep)
>>> [   80.380799] Filesystems sync: 0.002 seconds
>>> [   80.386476] Freezing user space processes
>>> [   80.392286] Freezing user space processes completed (elapsed
>>> 0.001 seconds)
>>> [   80.399294] OOM killer disabled.
>>> [   80.402536] Freezing remaining freezable tasks
>>> [   80.408335] Freezing remaining freezable tasks completed
>>> (elapsed 0.001 seconds)
>>> [   80.439372] sd 5:0:0:0: [sdb] Synchronizing SCSI cache
>>> [   80.439716] serial 00:01: disabled
>>> [   80.448011] sd 4:0:0:0: [sda] Synchronizing SCSI cache
>>> [   80.448014] sd 7:0:0:0: [sdc] Synchronizing SCSI cache
>>> [   80.453600] ata6.00: Entering standby power mode
>>
>> This is sd 5:0:0:0. All good, ordered properly with the
>> "Synchronizing SCSI cache".
>>
>>> [   80.464217] ata5.00: Entering standby power mode
>>
>> Same here for sd 4:0:0:0.
>>
>>> [   80.812294] ata8: SATA link up 6.0 Gbps (SStatus 133 SControl
>>> 300)
>>> [   80.818520] ata8.00: Entering active power mode
>>> [   80.842989] ata8.00: configured for UDMA/133
>>
>> Arg ! sd 7:0:0:0 is resuming ! But the above "Synchronizing SCSI
>> cache" tells
>> us that it was suspending and libata EH did not yet put that drive to
>> standby...
>>
>>> [   80.847660] ata8.00: Entering standby power mode
>>
>> ... which happens here. So it looks like libata EH had both the
>> suspend and
>> resume requests at the same time, which is totally weird.
> 
> although it looks weird, it totally matches the 'use case'.
> I mean, if I suspend, resume, and wait a bit before suspend and resume
> again, it will work 100% of the time.
> The issue is really only when another suspend comes right after the
> resume, in a loop without any wait.
> 
>>
>>> [   81.119426] xe 0000:03:00.0: [drm] GT0: suspended
>>> [   81.800508] PM: suspend of devices complete after 1367.829 msecs
>>> [   81.806661] PM: start suspend of devices complete after 1390.859
>>> msecs
>>> [   81.813244] PM: suspend devices took 1.398 seconds
>>> [   81.820101] PM: late suspend of devices complete after 2.036
>>> msecs
>>
>> ...and PM suspend completes here. Resume "starts" now (but clearly it
>> started
>> earlier already given that sd 7:0:0:0 was reactivated.
> 
> that is weird.
> 
>>
>>> �[   82.403857] serial 00:01: activated
>>> [   82.489612] nvme nvme0: 16/0/0 default/read/poll queues
>>> [   82.563318] r8169 0000:07:00.0 enp7s0: Link is Down
>>> [   82.581444] xe REG[0x223a8-0x223af]: allow read access
>>> [   82.586704] xe REG[0x1c03a8-0x1c03af]: allow read access
>>> [   82.592071] xe REG[0x1d03a8-0x1d03af]: allow read access
>>> [   82.597423] xe REG[0x1c83a8-0x1c83af]: allow read access
>>> [   82.602765] xe REG[0x1d83a8-0x1d83af]: allow read access
>>> [   82.608113] xe REG[0x1a3a8-0x1a3af]: allow read access
>>> [   82.613281] xe REG[0x1c3a8-0x1c3af]: allow read access
>>> [   82.618454] xe REG[0x1e3a8-0x1e3af]: allow read access
>>> [   82.623634] xe REG[0x263a8-0x263af]: allow read access
>>> [   82.628816] xe 0000:03:00.0: [drm] GT0: resumed
>>> [   82.728005] ata7: SATA link down (SStatus 4 SControl 300)
>>> [   82.733531] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl
>>> 300)
>>> [   82.739773] ata5.00: Entering active power mode
>>> [   82.744398] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl
>>> 300)
>>> [   82.750618] ata6.00: Entering active power mode
>>> [   82.755961] ata5.00: configured for UDMA/133
>>> [   82.760479] ata5.00: Enabling discard_zeroes_data
>>> [   82.836266] ata6.00: configured for UDMA/133
>>> [   84.460081] ata8: SATA link up 6.0 Gbps (SStatus 133 SControl
>>> 300)
>>> [   84.466354] ata8.00: Entering active power mode
>>> [   84.497256] ata8.00: configured for UDMA/133
>>> ...
>>
>> And this looks all normal, the drives have all been transitioned to
>> active
>> power mode as expected. And yet, your system is stuck after this,
>> right ?
> 
> yes

I think I have now figured it out, and fixed. I could reliably recreate the same
hang both with qemu using a failed suspend (using a device not supporting
suspend) and real hardware with a short rtc wake.

It turns out that the root cause of the hang is ata_scsi_dev_rescan(), which is
scheduled asynchronously from PM context on resume. With quick suspend after a
resume, suspend may win the race against that ata_scsi_dev_rescan() task
execution and we endup calling scsi_rescan_device() on a suspended device,
causing that function to wait with the device_lock() held, which causes PM to
deadlock when it needs to resume the scsi device. The recent commit 6aa0365a3c85
("ata: libata-scsi: Avoid deadlock on rescan after device resume") was intended
to fix that, but it did so less than ideally and the fix has a race on the scsi
power state check, thus not always preventing the resume hang.

I pushed a new patch series that goes on top of 6.5.0: resume-v3 branch in the
libata tree:

https://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata.git

This works very well for me. Using this script on real hardware:

for (( i=0; i<20; i++ )); do
	echo "+2" > /sys/class/rtc/rtc0/wakealarm
	echo mem > /sys/power/state
done

The system repeatedly suspends and resumes and comes back OK. Of note is that if
I set the delay to +1 second, then I sometime do not see the system resume and
the script stops. But using wakeup-on-lan (wol command) from another machine to
wake it up, the machine resumes normally and continues executing the script. So
it seems that setting the rtc alarm unreasonably early result in it being lost
and the system suspending wating to be woken up.

I also tested this in qemu. As mentioned before, I cannot get rtc alarm to wake
up the VM guest though. However, using a virtio device that does not support
suspend, resume strats in the middle of the suspend operation due to the suspend
error reported by that device. And it turns out that systemd really insists on
suspending the system despite the error, so when running "systemctl suspend" I
see a retry for suspend right after the first failed one. That is enough to
trigger the issue without the patches.

Please test !

Thanks.

Rodrigo Vivi Sept. 5, 2023, 5:17 p.m. UTC | #24

On Tue, Sep 05, 2023 at 02:20:23PM +0900, Damien Le Moal wrote:
> On 8/31/23 10:48, Vivi, Rodrigo wrote:
> > On Thu, 2023-08-31 at 09:32 +0900, Damien Le Moal wrote:
> >> On 8/31/23 07:14, Rodrigo Vivi wrote:
> >>> On Tue, Aug 29, 2023 at 03:17:38PM +0900, Damien Le Moal wrote:
> >>>> On 8/26/23 02:09, Rodrigo Vivi wrote:
> >>>>>>> So, maybe we have some kind of disks/configuration out
> >>>>>>> there where this
> >>>>>>> start upon resume is needed? Maybe it is just a matter of
> >>>>>>> timming to
> >>>>>>> ensure some firmware underneath is up and back to life?
> >>>>>>
> >>>>>> I do not think so. Suspend will issue a start stop unit
> >>>>>> command to put the drive
> >>>>>> to sleep and resume will reset the port (which should wake up
> >>>>>> the drive) and
> >>>>>> then issue an IDENTIFY command (which will also wake up the
> >>>>>> drive) and other
> >>>>>> read logs etc to rescan the drive.
> >>>>>> In both cases, if the commands do not complete, we would see
> >>>>>> errors/timeout and
> >>>>>> likely port reset/drive gone events. So I think this is
> >>>>>> likely another subtle
> >>>>>> race between scsi suspend and ata suspend that is causing a
> >>>>>> deadlock.
> >>>>>>
> >>>>>> The main issue I think is that there is no direct ancestry
> >>>>>> between the ata port
> >>>>>> (device) and scsi device, so the change to scsi async pm ops
> >>>>>> made a mess of the
> >>>>>> suspend/resume operations ordering. For suspend, scsi device
> >>>>>> (child of ata port)
> >>>>>> should be first, then ata port device (parent). For resume,
> >>>>>> the reverse order is
> >>>>>> needed. PM normally ensures that parent/child ordering, but
> >>>>>> we lack that
> >>>>>> parent/child relationship. I am working on fixing that but it
> >>>>>> is very slow
> >>>>>> progress because I have been so far enable to recreate any of
> >>>>>> the issues that
> >>>>>> have been reported. I am patching "blind"...
> >>>>>
> >>>>> I believe your suspicious makes sense. And on these lines, that
> >>>>> patch you
> >>>>> attached earlier would fix that. However my initial tries of
> >>>>> that didn't
> >>>>> help. I'm going to run more tests and get back to you.
> >>>>
> >>>> Rodrigo,
> >>>>
> >>>> I pushed the resume-v2 branch to libata tree:
> >>>>
> >>>> git@gitolite.kernel.org:pub/scm/linux/kernel/git/dlemoal/libata
> >>>> (or
> >>>> https://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata.gi
> >>>> t)
> >>>>
> >>>> This branch adds 13 patches on top of 6.5.0 to cleanup libata
> >>>> suspend/resume and
> >>>> other device shutdown issues. The first 4 patches are the main
> >>>> ones to fix
> >>>> suspend resume. I tested that on 2 different machines with
> >>>> different drives and
> >>>> with qemu. All seems fine.
> >>>>
> >>>> Could you try to run this through your CI ? I am very interested
> >>>> in seeing if it
> >>>> survives your suspend/resume tests.
> >>>
> >>> well, in the end this didn't affect the CI machinery as I was
> >>> afraid.
> >>> it is only in my local DG2.
> >>>
> >>> https://intel-gfx-ci.01.org/tree/intel-xe/bat-
> >>> all.html?testfilter=suspend
> >>> (bat-dg2-oem2 one)
> >>>
> >>> I just got these 13 patches and applied to my branch and tested it
> >>> again
> >>> and it still *fails* for me.
> >>
> >> That is annoying... But I think the messages give us a hint as to
> >> what is going
> >> on. See below.
> >>
> >>>
> >>> [   79.648328] [IGT] kms_pipe_crc_basic: finished subtest pipe-A-
> >>> DP-2, SUCCESS
> >>> [   79.657353] [IGT] kms_pipe_crc_basic: starting dynamic subtest
> >>> pipe-B-DP-2
> >>> [   80.375042] PM: suspend entry (deep)
> >>> [   80.380799] Filesystems sync: 0.002 seconds
> >>> [   80.386476] Freezing user space processes
> >>> [   80.392286] Freezing user space processes completed (elapsed
> >>> 0.001 seconds)
> >>> [   80.399294] OOM killer disabled.
> >>> [   80.402536] Freezing remaining freezable tasks
> >>> [   80.408335] Freezing remaining freezable tasks completed
> >>> (elapsed 0.001 seconds)
> >>> [   80.439372] sd 5:0:0:0: [sdb] Synchronizing SCSI cache
> >>> [   80.439716] serial 00:01: disabled
> >>> [   80.448011] sd 4:0:0:0: [sda] Synchronizing SCSI cache
> >>> [   80.448014] sd 7:0:0:0: [sdc] Synchronizing SCSI cache
> >>> [   80.453600] ata6.00: Entering standby power mode
> >>
> >> This is sd 5:0:0:0. All good, ordered properly with the
> >> "Synchronizing SCSI cache".
> >>
> >>> [   80.464217] ata5.00: Entering standby power mode
> >>
> >> Same here for sd 4:0:0:0.
> >>
> >>> [   80.812294] ata8: SATA link up 6.0 Gbps (SStatus 133 SControl
> >>> 300)
> >>> [   80.818520] ata8.00: Entering active power mode
> >>> [   80.842989] ata8.00: configured for UDMA/133
> >>
> >> Arg ! sd 7:0:0:0 is resuming ! But the above "Synchronizing SCSI
> >> cache" tells
> >> us that it was suspending and libata EH did not yet put that drive to
> >> standby...
> >>
> >>> [   80.847660] ata8.00: Entering standby power mode
> >>
> >> ... which happens here. So it looks like libata EH had both the
> >> suspend and
> >> resume requests at the same time, which is totally weird.
> > 
> > although it looks weird, it totally matches the 'use case'.
> > I mean, if I suspend, resume, and wait a bit before suspend and resume
> > again, it will work 100% of the time.
> > The issue is really only when another suspend comes right after the
> > resume, in a loop without any wait.
> > 
> >>
> >>> [   81.119426] xe 0000:03:00.0: [drm] GT0: suspended
> >>> [   81.800508] PM: suspend of devices complete after 1367.829 msecs
> >>> [   81.806661] PM: start suspend of devices complete after 1390.859
> >>> msecs
> >>> [   81.813244] PM: suspend devices took 1.398 seconds
> >>> [   81.820101] PM: late suspend of devices complete after 2.036
> >>> msecs
> >>
> >> ...and PM suspend completes here. Resume "starts" now (but clearly it
> >> started
> >> earlier already given that sd 7:0:0:0 was reactivated.
> > 
> > that is weird.
> > 
> >>
> >>> �[   82.403857] serial 00:01: activated
> >>> [   82.489612] nvme nvme0: 16/0/0 default/read/poll queues
> >>> [   82.563318] r8169 0000:07:00.0 enp7s0: Link is Down
> >>> [   82.581444] xe REG[0x223a8-0x223af]: allow read access
> >>> [   82.586704] xe REG[0x1c03a8-0x1c03af]: allow read access
> >>> [   82.592071] xe REG[0x1d03a8-0x1d03af]: allow read access
> >>> [   82.597423] xe REG[0x1c83a8-0x1c83af]: allow read access
> >>> [   82.602765] xe REG[0x1d83a8-0x1d83af]: allow read access
> >>> [   82.608113] xe REG[0x1a3a8-0x1a3af]: allow read access
> >>> [   82.613281] xe REG[0x1c3a8-0x1c3af]: allow read access
> >>> [   82.618454] xe REG[0x1e3a8-0x1e3af]: allow read access
> >>> [   82.623634] xe REG[0x263a8-0x263af]: allow read access
> >>> [   82.628816] xe 0000:03:00.0: [drm] GT0: resumed
> >>> [   82.728005] ata7: SATA link down (SStatus 4 SControl 300)
> >>> [   82.733531] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl
> >>> 300)
> >>> [   82.739773] ata5.00: Entering active power mode
> >>> [   82.744398] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl
> >>> 300)
> >>> [   82.750618] ata6.00: Entering active power mode
> >>> [   82.755961] ata5.00: configured for UDMA/133
> >>> [   82.760479] ata5.00: Enabling discard_zeroes_data
> >>> [   82.836266] ata6.00: configured for UDMA/133
> >>> [   84.460081] ata8: SATA link up 6.0 Gbps (SStatus 133 SControl
> >>> 300)
> >>> [   84.466354] ata8.00: Entering active power mode
> >>> [   84.497256] ata8.00: configured for UDMA/133
> >>> ...
> >>
> >> And this looks all normal, the drives have all been transitioned to
> >> active
> >> power mode as expected. And yet, your system is stuck after this,
> >> right ?
> > 
> > yes
> 
> I think I have now figured it out, and fixed. I could reliably recreate the same
> hang both with qemu using a failed suspend (using a device not supporting
> suspend) and real hardware with a short rtc wake.
> 
> It turns out that the root cause of the hang is ata_scsi_dev_rescan(), which is
> scheduled asynchronously from PM context on resume. With quick suspend after a
> resume, suspend may win the race against that ata_scsi_dev_rescan() task
> execution and we endup calling scsi_rescan_device() on a suspended device,
> causing that function to wait with the device_lock() held, which causes PM to
> deadlock when it needs to resume the scsi device. The recent commit 6aa0365a3c85
> ("ata: libata-scsi: Avoid deadlock on rescan after device resume") was intended
> to fix that, but it did so less than ideally and the fix has a race on the scsi
> power state check, thus not always preventing the resume hang.
> 
> I pushed a new patch series that goes on top of 6.5.0: resume-v3 branch in the
> libata tree:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata.git
> 
> This works very well for me. Using this script on real hardware:
> 
> for (( i=0; i<20; i++ )); do
> 	echo "+2" > /sys/class/rtc/rtc0/wakealarm
> 	echo mem > /sys/power/state
> done
> 
> The system repeatedly suspends and resumes and comes back OK. Of note is that if
> I set the delay to +1 second, then I sometime do not see the system resume and
> the script stops. But using wakeup-on-lan (wol command) from another machine to
> wake it up, the machine resumes normally and continues executing the script. So
> it seems that setting the rtc alarm unreasonably early result in it being lost
> and the system suspending wating to be woken up.
> 
> I also tested this in qemu. As mentioned before, I cannot get rtc alarm to wake
> up the VM guest though. However, using a virtio device that does not support
> suspend, resume strats in the middle of the suspend operation due to the suspend
> error reported by that device. And it turns out that systemd really insists on
> suspending the system despite the error, so when running "systemctl suspend" I
> see a retry for suspend right after the first failed one. That is enough to
> trigger the issue without the patches.
> 
> Please test !

\o/ works for me!

Feel free to use:
Tested-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

Thank you so much,
Rodrigo.

> 
> Thanks.
> 
> -- 
> Damien Le Moal
> Western Digital Research
>

Damien Le Moal Sept. 6, 2023, 1:07 a.m. UTC | #25

On 9/6/23 02:17, Rodrigo Vivi wrote:
>> I think I have now figured it out, and fixed. I could reliably recreate the same
>> hang both with qemu using a failed suspend (using a device not supporting
>> suspend) and real hardware with a short rtc wake.
>>
>> It turns out that the root cause of the hang is ata_scsi_dev_rescan(), which is
>> scheduled asynchronously from PM context on resume. With quick suspend after a
>> resume, suspend may win the race against that ata_scsi_dev_rescan() task
>> execution and we endup calling scsi_rescan_device() on a suspended device,
>> causing that function to wait with the device_lock() held, which causes PM to
>> deadlock when it needs to resume the scsi device. The recent commit 6aa0365a3c85
>> ("ata: libata-scsi: Avoid deadlock on rescan after device resume") was intended
>> to fix that, but it did so less than ideally and the fix has a race on the scsi
>> power state check, thus not always preventing the resume hang.
>>
>> I pushed a new patch series that goes on top of 6.5.0: resume-v3 branch in the
>> libata tree:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata.git
>>
>> This works very well for me. Using this script on real hardware:
>>
>> for (( i=0; i<20; i++ )); do
>> 	echo "+2" > /sys/class/rtc/rtc0/wakealarm
>> 	echo mem > /sys/power/state
>> done
>>
>> The system repeatedly suspends and resumes and comes back OK. Of note is that if
>> I set the delay to +1 second, then I sometime do not see the system resume and
>> the script stops. But using wakeup-on-lan (wol command) from another machine to
>> wake it up, the machine resumes normally and continues executing the script. So
>> it seems that setting the rtc alarm unreasonably early result in it being lost
>> and the system suspending wating to be woken up.
>>
>> I also tested this in qemu. As mentioned before, I cannot get rtc alarm to wake
>> up the VM guest though. However, using a virtio device that does not support
>> suspend, resume strats in the middle of the suspend operation due to the suspend
>> error reported by that device. And it turns out that systemd really insists on
>> suspending the system despite the error, so when running "systemctl suspend" I
>> see a retry for suspend right after the first failed one. That is enough to
>> trigger the issue without the patches.
>>
>> Please test !
> 
> \o/ works for me!
> 
> Feel free to use:
> Tested-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

Awesome ! Thank you for testing. I will rebase the patches and post the official
version for 6.6 fixes (and the other cleanup patches for 6.7), after retesting
again. Never know :)

Geert Uytterhoeven Sept. 12, 2023, 5:39 p.m. UTC | #26

Hi Damien,

On Mon, 31 Jul 2023, Damien Le Moal wrote:
> During system resume, ata_port_pm_resume() triggers ata EH to
> 1) Resume the controller
> 2) Reset and rescan the ports
> 3) Revalidate devices
> This EH execution is started asynchronously from ata_port_pm_resume(),
> which means that when sd_resume() is executed, none or only part of the
> above processing may have been executed. However, sd_resume() issues a
> START STOP UNIT to wake up the drive from sleep mode. This command is
> translated to ATA with ata_scsi_start_stop_xlat() and issued to the
> device. However, depending on the state of execution of the EH process
> and revalidation triggerred by ata_port_pm_resume(), two things may
> happen:
> 1) The START STOP UNIT fails if it is received before the controller has
>   been reenabled at the beginning of the EH execution. This is visible
>   with error messages like:
>
> ata10.00: device reported invalid CHS sector 0
> sd 9:0:0:0: [sdc] Start/Stop Unit failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> sd 9:0:0:0: [sdc] Sense Key : Illegal Request [current]
> sd 9:0:0:0: [sdc] Add. Sense: Unaligned write command
> sd 9:0:0:0: PM: dpm_run_callback(): scsi_bus_resume+0x0/0x90 returns -5
> sd 9:0:0:0: PM: failed to resume async: error -5
>
> 2) The START STOP UNIT command is received while the EH process is
>   on-going, which mean that it is stopped and must wait for its
>   completion, at which point the command is rather useless as the drive
>   is already fully spun up already. This case results also in a
>   significant delay in sd_resume() which is observable by users as
>   the entire system resume completion is delayed.
>
> Given that ATA devices will be woken up by libata activity on resume,
> sd_resume() has no need to issue a START STOP UNIT command, which solves
> the above mentioned problems. Do not issue this command by introducing
> the new scsi_device flag no_start_on_resume and setting this flag to 1
> in ata_scsi_dev_config(). sd_resume() is modified to issue a START STOP
> UNIT command only if this flag is not set.
>
> Reported-by: Paul Ausbeck <paula@soe.ucsc.edu>
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=215880
> Fixes: a19a93e4c6a9 ("scsi: core: pm: Rely on the device driver core for async power management")
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>

Thanks for your patch, which is now commit 0a8589055936d8fe
("ata,scsi: do not issue START STOP UNIT on resume") in v6.5-rc5.
Sorry for being late to the party, but this commit landed upstream
during my summer holidays, and apparently I wasn't that focussed on
noticing small behavioral changes after getting back to work...

I noticed an oddity after s2idle or s2ram on Renesas Salvator-XS (R-Car
H3 ES2.0) with an old (spinning rust) SATA drive, and bisected it to
this commit: when accessing the drive after system resume, there is now
a delay of ca. 5s before data is returned, presumably due to starting
the drive, and having to wait for it to spin up to be able to read data.
But the good news is that the actual system resume takes less time than
before (reduced by even more than ca. 5s!), so this looks like a net
win...

I'm not sure this is to be called a regression, I just wanted you to be
aware of it.

Thanks!

Gr{oetje,eeting}s,

 						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
 							    -- Linus Torvalds

Damien Le Moal Sept. 12, 2023, 10:58 p.m. UTC | #27

On 9/13/23 02:39, Geert Uytterhoeven wrote:
>  	Hi Damien,
> 
> On Mon, 31 Jul 2023, Damien Le Moal wrote:
>> During system resume, ata_port_pm_resume() triggers ata EH to
>> 1) Resume the controller
>> 2) Reset and rescan the ports
>> 3) Revalidate devices
>> This EH execution is started asynchronously from ata_port_pm_resume(),
>> which means that when sd_resume() is executed, none or only part of the
>> above processing may have been executed. However, sd_resume() issues a
>> START STOP UNIT to wake up the drive from sleep mode. This command is
>> translated to ATA with ata_scsi_start_stop_xlat() and issued to the
>> device. However, depending on the state of execution of the EH process
>> and revalidation triggerred by ata_port_pm_resume(), two things may
>> happen:
>> 1) The START STOP UNIT fails if it is received before the controller has
>>   been reenabled at the beginning of the EH execution. This is visible
>>   with error messages like:
>>
>> ata10.00: device reported invalid CHS sector 0
>> sd 9:0:0:0: [sdc] Start/Stop Unit failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>> sd 9:0:0:0: [sdc] Sense Key : Illegal Request [current]
>> sd 9:0:0:0: [sdc] Add. Sense: Unaligned write command
>> sd 9:0:0:0: PM: dpm_run_callback(): scsi_bus_resume+0x0/0x90 returns -5
>> sd 9:0:0:0: PM: failed to resume async: error -5
>>
>> 2) The START STOP UNIT command is received while the EH process is
>>   on-going, which mean that it is stopped and must wait for its
>>   completion, at which point the command is rather useless as the drive
>>   is already fully spun up already. This case results also in a
>>   significant delay in sd_resume() which is observable by users as
>>   the entire system resume completion is delayed.
>>
>> Given that ATA devices will be woken up by libata activity on resume,
>> sd_resume() has no need to issue a START STOP UNIT command, which solves
>> the above mentioned problems. Do not issue this command by introducing
>> the new scsi_device flag no_start_on_resume and setting this flag to 1
>> in ata_scsi_dev_config(). sd_resume() is modified to issue a START STOP
>> UNIT command only if this flag is not set.
>>
>> Reported-by: Paul Ausbeck <paula@soe.ucsc.edu>
>> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=215880
>> Fixes: a19a93e4c6a9 ("scsi: core: pm: Rely on the device driver core for async power management")
>> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> 
> Thanks for your patch, which is now commit 0a8589055936d8fe
> ("ata,scsi: do not issue START STOP UNIT on resume") in v6.5-rc5.
> Sorry for being late to the party, but this commit landed upstream
> during my summer holidays, and apparently I wasn't that focussed on
> noticing small behavioral changes after getting back to work...
> 
> I noticed an oddity after s2idle or s2ram on Renesas Salvator-XS (R-Car
> H3 ES2.0) with an old (spinning rust) SATA drive, and bisected it to
> this commit: when accessing the drive after system resume, there is now
> a delay of ca. 5s before data is returned, presumably due to starting
> the drive, and having to wait for it to spin up to be able to read data.
> But the good news is that the actual system resume takes less time than
> before (reduced by even more than ca. 5s!), so this looks like a net
> win...

Thanks for the report. The delay for the first data access from user space right
after resume is 100% expected, with or without this patch. The reason is that
waking up the drive (spinning it up) is done asynchronously from the PM resume
context, so when you get "PM suspend exit" message signaling that the system is
resumed, the drive may not yet be spinning. Any access will wait for that to
happen before proceeding. Depending on the drive that can take up to 10s or so.

I am not entirely sure where the net win you see come from. But the patch you
mention is in fact completely wrong and does not fix the underlying issues with
ata suspend/resume and potential deadlocks in PM due to ata ports relationship
with scsi devices. So I have been working on fixing this for the last 2 weeks,
after another user also reported issues with the patch you mention [1].

Could you try libata for-next branch on your system ? There are 7 fix patches in
there that I plan to send out for 6.6-rc2 to fix the patch in question and other
issues potentially causing deadlocks on resume. The patches were posted also [2].

https://lore.kernel.org/linux-ide/20230912005655.368075-1-dlemoal@kernel.org/T/#t

> I'm not sure this is to be called a regression, I just wanted you to be
> aware of it.

The delay on access is not a regression. You likely where not seeing it before
because of the lack of synchronization between ata & scsi on resume was likely
making the resume processing take longer than it needs to be, which gave time to
the drive to spin up. The patches in libata for-next should fix all that. Please
test !

Thanks !

[1]
https://lore.kernel.org/linux-ide/8be9c370-2f1-5815-431-f68ab868669@linux-m68k.org/T/#m066ae988869d588c1b8c684cb52773e7708ade4c
[2]
https://lore.kernel.org/linux-ide/20230912005655.368075-1-dlemoal@kernel.org/T/#t

Geert Uytterhoeven Sept. 13, 2023, 10:21 a.m. UTC | #28

Hi Damien,

On Wed, Sep 13, 2023 at 12:58 AM Damien Le Moal <dlemoal@kernel.org> wrote:
> On 9/13/23 02:39, Geert Uytterhoeven wrote:
> > On Mon, 31 Jul 2023, Damien Le Moal wrote:
> >> During system resume, ata_port_pm_resume() triggers ata EH to
> >> 1) Resume the controller
> >> 2) Reset and rescan the ports
> >> 3) Revalidate devices
> >> This EH execution is started asynchronously from ata_port_pm_resume(),
> >> which means that when sd_resume() is executed, none or only part of the
> >> above processing may have been executed. However, sd_resume() issues a
> >> START STOP UNIT to wake up the drive from sleep mode. This command is
> >> translated to ATA with ata_scsi_start_stop_xlat() and issued to the
> >> device. However, depending on the state of execution of the EH process
> >> and revalidation triggerred by ata_port_pm_resume(), two things may
> >> happen:
> >> 1) The START STOP UNIT fails if it is received before the controller has
> >>   been reenabled at the beginning of the EH execution. This is visible
> >>   with error messages like:
> >>
> >> ata10.00: device reported invalid CHS sector 0
> >> sd 9:0:0:0: [sdc] Start/Stop Unit failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> >> sd 9:0:0:0: [sdc] Sense Key : Illegal Request [current]
> >> sd 9:0:0:0: [sdc] Add. Sense: Unaligned write command
> >> sd 9:0:0:0: PM: dpm_run_callback(): scsi_bus_resume+0x0/0x90 returns -5
> >> sd 9:0:0:0: PM: failed to resume async: error -5
> >>
> >> 2) The START STOP UNIT command is received while the EH process is
> >>   on-going, which mean that it is stopped and must wait for its
> >>   completion, at which point the command is rather useless as the drive
> >>   is already fully spun up already. This case results also in a
> >>   significant delay in sd_resume() which is observable by users as
> >>   the entire system resume completion is delayed.
> >>
> >> Given that ATA devices will be woken up by libata activity on resume,
> >> sd_resume() has no need to issue a START STOP UNIT command, which solves
> >> the above mentioned problems. Do not issue this command by introducing
> >> the new scsi_device flag no_start_on_resume and setting this flag to 1
> >> in ata_scsi_dev_config(). sd_resume() is modified to issue a START STOP
> >> UNIT command only if this flag is not set.
> >>
> >> Reported-by: Paul Ausbeck <paula@soe.ucsc.edu>
> >> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=215880
> >> Fixes: a19a93e4c6a9 ("scsi: core: pm: Rely on the device driver core for async power management")
> >> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> >
> > Thanks for your patch, which is now commit 0a8589055936d8fe
> > ("ata,scsi: do not issue START STOP UNIT on resume") in v6.5-rc5.
> > Sorry for being late to the party, but this commit landed upstream
> > during my summer holidays, and apparently I wasn't that focussed on
> > noticing small behavioral changes after getting back to work...
> >
> > I noticed an oddity after s2idle or s2ram on Renesas Salvator-XS (R-Car
> > H3 ES2.0) with an old (spinning rust) SATA drive, and bisected it to
> > this commit: when accessing the drive after system resume, there is now
> > a delay of ca. 5s before data is returned, presumably due to starting
> > the drive, and having to wait for it to spin up to be able to read data.
> > But the good news is that the actual system resume takes less time than
> > before (reduced by even more than ca. 5s!), so this looks like a net
> > win...
>
> Thanks for the report. The delay for the first data access from user space right
> after resume is 100% expected, with or without this patch. The reason is that
> waking up the drive (spinning it up) is done asynchronously from the PM resume
> context, so when you get "PM suspend exit" message signaling that the system is
> resumed, the drive may not yet be spinning. Any access will wait for that to
> happen before proceeding. Depending on the drive that can take up to 10s or so.

That does not match with what I am seeing: before this patch, there
was no delay on first data access from user space, as the drive is fully
spun up when system resume returns.
With this patch, system resume returns earlier, and the drive is only
spun up when user space starts accessing data.

Note that I do not have any file system mounted, and use
"hd /dev/sda | head -70" to access the disk.

> I am not entirely sure where the net win you see come from. But the patch you
> mention is in fact completely wrong and does not fix the underlying issues with
> ata suspend/resume and potential deadlocks in PM due to ata ports relationship
> with scsi devices. So I have been working on fixing this for the last 2 weeks,
> after another user also reported issues with the patch you mention [1].
>
> Could you try libata for-next branch on your system ? There are 7 fix patches in
> there that I plan to send out for 6.6-rc2 to fix the patch in question and other
> issues potentially causing deadlocks on resume. The patches were posted also [2].
>
> https://lore.kernel.org/linux-ide/20230912005655.368075-1-dlemoal@kernel.org/T/#t

Unfortunately that didn't work, as /dev/sda no longer exists.
Will reply to the patch I bisected the issue to...

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

Geert Uytterhoeven Sept. 13, 2023, 10:34 a.m. UTC | #29

Hi Damien,

On Wed, Sep 13, 2023 at 12:21 PM Geert Uytterhoeven
<geert@linux-m68k.org> wrote:
> On Wed, Sep 13, 2023 at 12:58 AM Damien Le Moal <dlemoal@kernel.org> wrote:
> > On 9/13/23 02:39, Geert Uytterhoeven wrote:
> > > On Mon, 31 Jul 2023, Damien Le Moal wrote:
> > >> During system resume, ata_port_pm_resume() triggers ata EH to
> > >> 1) Resume the controller
> > >> 2) Reset and rescan the ports
> > >> 3) Revalidate devices
> > >> This EH execution is started asynchronously from ata_port_pm_resume(),
> > >> which means that when sd_resume() is executed, none or only part of the
> > >> above processing may have been executed. However, sd_resume() issues a
> > >> START STOP UNIT to wake up the drive from sleep mode. This command is
> > >> translated to ATA with ata_scsi_start_stop_xlat() and issued to the
> > >> device. However, depending on the state of execution of the EH process
> > >> and revalidation triggerred by ata_port_pm_resume(), two things may
> > >> happen:
> > >> 1) The START STOP UNIT fails if it is received before the controller has
> > >>   been reenabled at the beginning of the EH execution. This is visible
> > >>   with error messages like:
> > >>
> > >> ata10.00: device reported invalid CHS sector 0
> > >> sd 9:0:0:0: [sdc] Start/Stop Unit failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> > >> sd 9:0:0:0: [sdc] Sense Key : Illegal Request [current]
> > >> sd 9:0:0:0: [sdc] Add. Sense: Unaligned write command
> > >> sd 9:0:0:0: PM: dpm_run_callback(): scsi_bus_resume+0x0/0x90 returns -5
> > >> sd 9:0:0:0: PM: failed to resume async: error -5
> > >>
> > >> 2) The START STOP UNIT command is received while the EH process is
> > >>   on-going, which mean that it is stopped and must wait for its
> > >>   completion, at which point the command is rather useless as the drive
> > >>   is already fully spun up already. This case results also in a
> > >>   significant delay in sd_resume() which is observable by users as
> > >>   the entire system resume completion is delayed.
> > >>
> > >> Given that ATA devices will be woken up by libata activity on resume,
> > >> sd_resume() has no need to issue a START STOP UNIT command, which solves
> > >> the above mentioned problems. Do not issue this command by introducing
> > >> the new scsi_device flag no_start_on_resume and setting this flag to 1
> > >> in ata_scsi_dev_config(). sd_resume() is modified to issue a START STOP
> > >> UNIT command only if this flag is not set.
> > >>
> > >> Reported-by: Paul Ausbeck <paula@soe.ucsc.edu>
> > >> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=215880
> > >> Fixes: a19a93e4c6a9 ("scsi: core: pm: Rely on the device driver core for async power management")
> > >> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> > >
> > > Thanks for your patch, which is now commit 0a8589055936d8fe
> > > ("ata,scsi: do not issue START STOP UNIT on resume") in v6.5-rc5.
> > > Sorry for being late to the party, but this commit landed upstream
> > > during my summer holidays, and apparently I wasn't that focussed on
> > > noticing small behavioral changes after getting back to work...
> > >
> > > I noticed an oddity after s2idle or s2ram on Renesas Salvator-XS (R-Car
> > > H3 ES2.0) with an old (spinning rust) SATA drive, and bisected it to
> > > this commit: when accessing the drive after system resume, there is now
> > > a delay of ca. 5s before data is returned, presumably due to starting
> > > the drive, and having to wait for it to spin up to be able to read data.
> > > But the good news is that the actual system resume takes less time than
> > > before (reduced by even more than ca. 5s!), so this looks like a net
> > > win...
> >
> > Thanks for the report. The delay for the first data access from user space right
> > after resume is 100% expected, with or without this patch. The reason is that
> > waking up the drive (spinning it up) is done asynchronously from the PM resume
> > context, so when you get "PM suspend exit" message signaling that the system is
> > resumed, the drive may not yet be spinning. Any access will wait for that to
> > happen before proceeding. Depending on the drive that can take up to 10s or so.
>
> That does not match with what I am seeing: before this patch, there
> was no delay on first data access from user space, as the drive is fully
> spun up when system resume returns.
> With this patch, system resume returns earlier, and the drive is only
> spun up when user space starts accessing data.
>
> Note that I do not have any file system mounted, and use
> "hd /dev/sda | head -70" to access the disk.
>
> > I am not entirely sure where the net win you see come from. But the patch you
> > mention is in fact completely wrong and does not fix the underlying issues with
> > ata suspend/resume and potential deadlocks in PM due to ata ports relationship
> > with scsi devices. So I have been working on fixing this for the last 2 weeks,
> > after another user also reported issues with the patch you mention [1].
> >
> > Could you try libata for-next branch on your system ? There are 7 fix patches in
> > there that I plan to send out for 6.6-rc2 to fix the patch in question and other
> > issues potentially causing deadlocks on resume. The patches were posted also [2].
> >
> > https://lore.kernel.org/linux-ide/20230912005655.368075-1-dlemoal@kernel.org/T/#t
>
> Unfortunately that didn't work, as /dev/sda no longer exists.
> Will reply to the patch I bisected the issue to...

With libata/for-next (fa2259a59966c005 ("ata: libata: Cleanup inline
DMA helper functions")) and commit 99626085d036ec32 ("ata: libata-scsi:
link ata port and scsi device") reverted, it behaves as before (disk
is spun up when system resume completes, no delay when accessing the
disk from userspace).

Gr{oetje,eeting}s,

                        Geert

Damien Le Moal Sept. 13, 2023, 10:03 p.m. UTC | #30

On 9/13/23 19:21, Geert Uytterhoeven wrote:
>> Thanks for the report. The delay for the first data access from user space right
>> after resume is 100% expected, with or without this patch. The reason is that
>> waking up the drive (spinning it up) is done asynchronously from the PM resume
>> context, so when you get "PM suspend exit" message signaling that the system is
>> resumed, the drive may not yet be spinning. Any access will wait for that to
>> happen before proceeding. Depending on the drive that can take up to 10s or so.
> 
> That does not match with what I am seeing: before this patch, there
> was no delay on first data access from user space, as the drive is fully
> spun up when system resume returns.

Yes, that is a possibility, but not by design. Some users have complained about
the long resume times which causes laptop screens to be "black" until disks spin
up. That did not happen before 5.16, when the change to scsi using the PM async
ops to do suspend/resume created all the issues with suspend/resume on ata side.
I am going to run 5.15 again to check.

The patch "do not issue START STOP UNIT on resume" was a botch attempt to remove
this delay. But it is a bad one because the ATA specs define that a drive can
get out of standby (spun down) power state only with a media access command (a
VERIFY command is used to spin up the disk). And furthermore, the specs also
says that even a reset shall not change the device power state. So issuing a
VERIFY command to spin up the drive is required per specs. Note that I do see
many of my drives (I have hundreds in the lab) spinning up on reset, which is
against the specs. But not all of them. So with the patch "do not issue START
STOP UNIT on resume", one risks not seeing the drive resuming correctly (timeout
errors on IDENTIFY command issued on resume will get the drive removed).

> With this patch, system resume returns earlier, and the drive is only
> spun up when user space starts accessing data.

Yes, but "when user space starts accessing data" -> "user space accesses are
processed only after the drive completes spinning up" would be more accurate.
That is by design and expected. This is the behavior one would see if the drive
is set to use standby timers (to go to standby on its own after some idle time)
or if runtime suspend is used. I do not see any issue with this behavior. Is
this causing any issue on your end ? Do you have concerns about this approach ?

Having the resume process wait for the drive to fully spin-up is easy to do. But
as I mentioned, many users are really not happy about how slow resuming become
with that. If I do that, you will get back the previous behavior you mention,
but I will be again getting emails about "resume is broken".

I made a decision: no waiting for spinup. That causes a delay for the user on
first access, but that makes resume overall far faster. I do not want to go back
on that, unless you can confirm that there is a real regression/error/data
corruption happening.

> Note that I do not have any file system mounted, and use
> "hd /dev/sda | head -70" to access the disk.

That is fine. dd would do as well. Any media access, from user or from an FS
will be delayed until the disk spins up. The default timeout for commands is
30s, which is plenty of time for the drive to complete spinup. So unless one
starts pounding the drive with some crazy workload right after user space
resumes, there should be no timeout errors or any issues with that.

>> I am not entirely sure where the net win you see come from. But the patch you
>> mention is in fact completely wrong and does not fix the underlying issues with
>> ata suspend/resume and potential deadlocks in PM due to ata ports relationship
>> with scsi devices. So I have been working on fixing this for the last 2 weeks,
>> after another user also reported issues with the patch you mention [1].
>>
>> Could you try libata for-next branch on your system ? There are 7 fix patches in
>> there that I plan to send out for 6.6-rc2 to fix the patch in question and other
>> issues potentially causing deadlocks on resume. The patches were posted also [2].
>>
>> https://lore.kernel.org/linux-ide/20230912005655.368075-1-dlemoal@kernel.org/T/#t
> 
> Unfortunately that didn't work, as /dev/sda no longer exists.
> Will reply to the patch I bisected the issue to...

Now that is a problem...

Damien Le Moal Sept. 13, 2023, 10:07 p.m. UTC | #31

On 9/13/23 19:34, Geert Uytterhoeven wrote:
> Hi Damien,
> 
> On Wed, Sep 13, 2023 at 12:21 PM Geert Uytterhoeven
> <geert@linux-m68k.org> wrote:
>> On Wed, Sep 13, 2023 at 12:58 AM Damien Le Moal <dlemoal@kernel.org> wrote:
>>> On 9/13/23 02:39, Geert Uytterhoeven wrote:
>>>> On Mon, 31 Jul 2023, Damien Le Moal wrote:
>>>>> During system resume, ata_port_pm_resume() triggers ata EH to
>>>>> 1) Resume the controller
>>>>> 2) Reset and rescan the ports
>>>>> 3) Revalidate devices
>>>>> This EH execution is started asynchronously from ata_port_pm_resume(),
>>>>> which means that when sd_resume() is executed, none or only part of the
>>>>> above processing may have been executed. However, sd_resume() issues a
>>>>> START STOP UNIT to wake up the drive from sleep mode. This command is
>>>>> translated to ATA with ata_scsi_start_stop_xlat() and issued to the
>>>>> device. However, depending on the state of execution of the EH process
>>>>> and revalidation triggerred by ata_port_pm_resume(), two things may
>>>>> happen:
>>>>> 1) The START STOP UNIT fails if it is received before the controller has
>>>>>   been reenabled at the beginning of the EH execution. This is visible
>>>>>   with error messages like:
>>>>>
>>>>> ata10.00: device reported invalid CHS sector 0
>>>>> sd 9:0:0:0: [sdc] Start/Stop Unit failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>>>>> sd 9:0:0:0: [sdc] Sense Key : Illegal Request [current]
>>>>> sd 9:0:0:0: [sdc] Add. Sense: Unaligned write command
>>>>> sd 9:0:0:0: PM: dpm_run_callback(): scsi_bus_resume+0x0/0x90 returns -5
>>>>> sd 9:0:0:0: PM: failed to resume async: error -5
>>>>>
>>>>> 2) The START STOP UNIT command is received while the EH process is
>>>>>   on-going, which mean that it is stopped and must wait for its
>>>>>   completion, at which point the command is rather useless as the drive
>>>>>   is already fully spun up already. This case results also in a
>>>>>   significant delay in sd_resume() which is observable by users as
>>>>>   the entire system resume completion is delayed.
>>>>>
>>>>> Given that ATA devices will be woken up by libata activity on resume,
>>>>> sd_resume() has no need to issue a START STOP UNIT command, which solves
>>>>> the above mentioned problems. Do not issue this command by introducing
>>>>> the new scsi_device flag no_start_on_resume and setting this flag to 1
>>>>> in ata_scsi_dev_config(). sd_resume() is modified to issue a START STOP
>>>>> UNIT command only if this flag is not set.
>>>>>
>>>>> Reported-by: Paul Ausbeck <paula@soe.ucsc.edu>
>>>>> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=215880
>>>>> Fixes: a19a93e4c6a9 ("scsi: core: pm: Rely on the device driver core for async power management")
>>>>> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
>>>>
>>>> Thanks for your patch, which is now commit 0a8589055936d8fe
>>>> ("ata,scsi: do not issue START STOP UNIT on resume") in v6.5-rc5.
>>>> Sorry for being late to the party, but this commit landed upstream
>>>> during my summer holidays, and apparently I wasn't that focussed on
>>>> noticing small behavioral changes after getting back to work...
>>>>
>>>> I noticed an oddity after s2idle or s2ram on Renesas Salvator-XS (R-Car
>>>> H3 ES2.0) with an old (spinning rust) SATA drive, and bisected it to
>>>> this commit: when accessing the drive after system resume, there is now
>>>> a delay of ca. 5s before data is returned, presumably due to starting
>>>> the drive, and having to wait for it to spin up to be able to read data.
>>>> But the good news is that the actual system resume takes less time than
>>>> before (reduced by even more than ca. 5s!), so this looks like a net
>>>> win...
>>>
>>> Thanks for the report. The delay for the first data access from user space right
>>> after resume is 100% expected, with or without this patch. The reason is that
>>> waking up the drive (spinning it up) is done asynchronously from the PM resume
>>> context, so when you get "PM suspend exit" message signaling that the system is
>>> resumed, the drive may not yet be spinning. Any access will wait for that to
>>> happen before proceeding. Depending on the drive that can take up to 10s or so.
>>
>> That does not match with what I am seeing: before this patch, there
>> was no delay on first data access from user space, as the drive is fully
>> spun up when system resume returns.
>> With this patch, system resume returns earlier, and the drive is only
>> spun up when user space starts accessing data.
>>
>> Note that I do not have any file system mounted, and use
>> "hd /dev/sda | head -70" to access the disk.
>>
>>> I am not entirely sure where the net win you see come from. But the patch you
>>> mention is in fact completely wrong and does not fix the underlying issues with
>>> ata suspend/resume and potential deadlocks in PM due to ata ports relationship
>>> with scsi devices. So I have been working on fixing this for the last 2 weeks,
>>> after another user also reported issues with the patch you mention [1].
>>>
>>> Could you try libata for-next branch on your system ? There are 7 fix patches in
>>> there that I plan to send out for 6.6-rc2 to fix the patch in question and other
>>> issues potentially causing deadlocks on resume. The patches were posted also [2].
>>>
>>> https://lore.kernel.org/linux-ide/20230912005655.368075-1-dlemoal@kernel.org/T/#t
>>
>> Unfortunately that didn't work, as /dev/sda no longer exists.
>> Will reply to the patch I bisected the issue to...
> 
> With libata/for-next (fa2259a59966c005 ("ata: libata: Cleanup inline
> DMA helper functions")) and commit 99626085d036ec32 ("ata: libata-scsi:
> link ata port and scsi device") reverted, it behaves as before (disk
> is spun up when system resume completes, no delay when accessing the
> disk from userspace).

I will check the ata platform driver for R-CAR. I may have overlooked something
in that area. I tested with AHCI and libsas adapters only as I do not have ATA
on the few ARM SBC boards I have. And I do not have an R-CAR board.

What surprises me is that you need to revert ata: libata: Cleanup inline DMA
helper functions". This patch has 0 functional changes and really is only a code
cleanup... Nothing should change with it. Can you confirm that you really need
to revert that patch to get things working again ?

Geert Uytterhoeven Sept. 14, 2023, 6:53 a.m. UTC | #32

Hi Damien,

On Thu, Sep 14, 2023 at 12:03 AM Damien Le Moal <dlemoal@kernel.org> wrote:
> On 9/13/23 19:21, Geert Uytterhoeven wrote:
> >> Thanks for the report. The delay for the first data access from user space right
> >> after resume is 100% expected, with or without this patch. The reason is that
> >> waking up the drive (spinning it up) is done asynchronously from the PM resume
> >> context, so when you get "PM suspend exit" message signaling that the system is
> >> resumed, the drive may not yet be spinning. Any access will wait for that to
> >> happen before proceeding. Depending on the drive that can take up to 10s or so.
> >
> > That does not match with what I am seeing: before this patch, there
> > was no delay on first data access from user space, as the drive is fully
> > spun up when system resume returns.
>
> Yes, that is a possibility, but not by design. Some users have complained about
> the long resume times which causes laptop screens to be "black" until disks spin
> up. That did not happen before 5.16, when the change to scsi using the PM async
> ops to do suspend/resume created all the issues with suspend/resume on ata side.
> I am going to run 5.15 again to check.
>
> The patch "do not issue START STOP UNIT on resume" was a botch attempt to remove
> this delay. But it is a bad one because the ATA specs define that a drive can
> get out of standby (spun down) power state only with a media access command (a
> VERIFY command is used to spin up the disk). And furthermore, the specs also
> says that even a reset shall not change the device power state. So issuing a
> VERIFY command to spin up the drive is required per specs. Note that I do see
> many of my drives (I have hundreds in the lab) spinning up on reset, which is
> against the specs. But not all of them. So with the patch "do not issue START
> STOP UNIT on resume", one risks not seeing the drive resuming correctly (timeout
> errors on IDENTIFY command issued on resume will get the drive removed).
>
> > With this patch, system resume returns earlier, and the drive is only
> > spun up when user space starts accessing data.
>
> Yes, but "when user space starts accessing data" -> "user space accesses are
> processed only after the drive completes spinning up" would be more accurate.

Sure, I wrote it down in terms of what the user is experiencing, which may
not be identical to what's happening under the hood.

> That is by design and expected. This is the behavior one would see if the drive
> is set to use standby timers (to go to standby on its own after some idle time)
> or if runtime suspend is used. I do not see any issue with this behavior. Is
> this causing any issue on your end ? Do you have concerns about this approach ?
>
> Having the resume process wait for the drive to fully spin-up is easy to do. But
> as I mentioned, many users are really not happy about how slow resuming become
> with that. If I do that, you will get back the previous behavior you mention,
> but I will be again getting emails about "resume is broken".
>
> I made a decision: no waiting for spinup. That causes a delay for the user on
> first access, but that makes resume overall far faster. I do not want to go back
> on that, unless you can confirm that there is a real regression/error/data
> corruption happening.

I agree that is fine.

Gr{oetje,eeting}s,

                        Geert

Damien Le Moal Sept. 14, 2023, 6:58 a.m. UTC | #33

On 9/14/23 15:53, Geert Uytterhoeven wrote:
> Hi Damien,
> 
> On Thu, Sep 14, 2023 at 12:03 AM Damien Le Moal <dlemoal@kernel.org> wrote:
>> On 9/13/23 19:21, Geert Uytterhoeven wrote:
>>>> Thanks for the report. The delay for the first data access from user space right
>>>> after resume is 100% expected, with or without this patch. The reason is that
>>>> waking up the drive (spinning it up) is done asynchronously from the PM resume
>>>> context, so when you get "PM suspend exit" message signaling that the system is
>>>> resumed, the drive may not yet be spinning. Any access will wait for that to
>>>> happen before proceeding. Depending on the drive that can take up to 10s or so.
>>>
>>> That does not match with what I am seeing: before this patch, there
>>> was no delay on first data access from user space, as the drive is fully
>>> spun up when system resume returns.
>>
>> Yes, that is a possibility, but not by design. Some users have complained about
>> the long resume times which causes laptop screens to be "black" until disks spin
>> up. That did not happen before 5.16, when the change to scsi using the PM async
>> ops to do suspend/resume created all the issues with suspend/resume on ata side.
>> I am going to run 5.15 again to check.
>>
>> The patch "do not issue START STOP UNIT on resume" was a botch attempt to remove
>> this delay. But it is a bad one because the ATA specs define that a drive can
>> get out of standby (spun down) power state only with a media access command (a
>> VERIFY command is used to spin up the disk). And furthermore, the specs also
>> says that even a reset shall not change the device power state. So issuing a
>> VERIFY command to spin up the drive is required per specs. Note that I do see
>> many of my drives (I have hundreds in the lab) spinning up on reset, which is
>> against the specs. But not all of them. So with the patch "do not issue START
>> STOP UNIT on resume", one risks not seeing the drive resuming correctly (timeout
>> errors on IDENTIFY command issued on resume will get the drive removed).
>>
>>> With this patch, system resume returns earlier, and the drive is only
>>> spun up when user space starts accessing data.
>>
>> Yes, but "when user space starts accessing data" -> "user space accesses are
>> processed only after the drive completes spinning up" would be more accurate.
> 
> Sure, I wrote it down in terms of what the user is experiencing, which may
> not be identical to what's happening under the hood.
> 
>> That is by design and expected. This is the behavior one would see if the drive
>> is set to use standby timers (to go to standby on its own after some idle time)
>> or if runtime suspend is used. I do not see any issue with this behavior. Is
>> this causing any issue on your end ? Do you have concerns about this approach ?
>>
>> Having the resume process wait for the drive to fully spin-up is easy to do. But
>> as I mentioned, many users are really not happy about how slow resuming become
>> with that. If I do that, you will get back the previous behavior you mention,
>> but I will be again getting emails about "resume is broken".
>>
>> I made a decision: no waiting for spinup. That causes a delay for the user on
>> first access, but that makes resume overall far faster. I do not want to go back
>> on that, unless you can confirm that there is a real regression/error/data
>> corruption happening.
> 
> I agree that is fine.

Still, R-CAR is not working for you with the patch adding the link. Looking into
it now. Trying to verify things with qemu & regular AHCI on PCs too as I do not
have an R-CAR board. Hard to debug :)

Preparing some patches for you to get more debug info.

> 
> Gr{oetje,eeting}s,
> 
>                         Geert
>

Geert Uytterhoeven Sept. 14, 2023, 6:59 a.m. UTC | #34

Hi Damien,

On Thu, Sep 14, 2023 at 12:08 AM Damien Le Moal <dlemoal@kernel.org> wrote:
> On 9/13/23 19:34, Geert Uytterhoeven wrote:
> > On Wed, Sep 13, 2023 at 12:21 PM Geert Uytterhoeven
> > <geert@linux-m68k.org> wrote:
> >> On Wed, Sep 13, 2023 at 12:58 AM Damien Le Moal <dlemoal@kernel.org> wrote:
> >>> On 9/13/23 02:39, Geert Uytterhoeven wrote:
> >>>> On Mon, 31 Jul 2023, Damien Le Moal wrote:
> >>>>> During system resume, ata_port_pm_resume() triggers ata EH to
> >>>>> 1) Resume the controller
> >>>>> 2) Reset and rescan the ports
> >>>>> 3) Revalidate devices
> >>>>> This EH execution is started asynchronously from ata_port_pm_resume(),
> >>>>> which means that when sd_resume() is executed, none or only part of the
> >>>>> above processing may have been executed. However, sd_resume() issues a
> >>>>> START STOP UNIT to wake up the drive from sleep mode. This command is
> >>>>> translated to ATA with ata_scsi_start_stop_xlat() and issued to the
> >>>>> device. However, depending on the state of execution of the EH process
> >>>>> and revalidation triggerred by ata_port_pm_resume(), two things may
> >>>>> happen:
> >>>>> 1) The START STOP UNIT fails if it is received before the controller has
> >>>>>   been reenabled at the beginning of the EH execution. This is visible
> >>>>>   with error messages like:
> >>>>>
> >>>>> ata10.00: device reported invalid CHS sector 0
> >>>>> sd 9:0:0:0: [sdc] Start/Stop Unit failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> >>>>> sd 9:0:0:0: [sdc] Sense Key : Illegal Request [current]
> >>>>> sd 9:0:0:0: [sdc] Add. Sense: Unaligned write command
> >>>>> sd 9:0:0:0: PM: dpm_run_callback(): scsi_bus_resume+0x0/0x90 returns -5
> >>>>> sd 9:0:0:0: PM: failed to resume async: error -5
> >>>>>
> >>>>> 2) The START STOP UNIT command is received while the EH process is
> >>>>>   on-going, which mean that it is stopped and must wait for its
> >>>>>   completion, at which point the command is rather useless as the drive
> >>>>>   is already fully spun up already. This case results also in a
> >>>>>   significant delay in sd_resume() which is observable by users as
> >>>>>   the entire system resume completion is delayed.
> >>>>>
> >>>>> Given that ATA devices will be woken up by libata activity on resume,
> >>>>> sd_resume() has no need to issue a START STOP UNIT command, which solves
> >>>>> the above mentioned problems. Do not issue this command by introducing
> >>>>> the new scsi_device flag no_start_on_resume and setting this flag to 1
> >>>>> in ata_scsi_dev_config(). sd_resume() is modified to issue a START STOP
> >>>>> UNIT command only if this flag is not set.
> >>>>>
> >>>>> Reported-by: Paul Ausbeck <paula@soe.ucsc.edu>
> >>>>> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=215880
> >>>>> Fixes: a19a93e4c6a9 ("scsi: core: pm: Rely on the device driver core for async power management")
> >>>>> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> >>>>
> >>>> Thanks for your patch, which is now commit 0a8589055936d8fe
> >>>> ("ata,scsi: do not issue START STOP UNIT on resume") in v6.5-rc5.
> >>>> Sorry for being late to the party, but this commit landed upstream
> >>>> during my summer holidays, and apparently I wasn't that focussed on
> >>>> noticing small behavioral changes after getting back to work...
> >>>>
> >>>> I noticed an oddity after s2idle or s2ram on Renesas Salvator-XS (R-Car
> >>>> H3 ES2.0) with an old (spinning rust) SATA drive, and bisected it to
> >>>> this commit: when accessing the drive after system resume, there is now
> >>>> a delay of ca. 5s before data is returned, presumably due to starting
> >>>> the drive, and having to wait for it to spin up to be able to read data.
> >>>> But the good news is that the actual system resume takes less time than
> >>>> before (reduced by even more than ca. 5s!), so this looks like a net
> >>>> win...
> >>>
> >>> Thanks for the report. The delay for the first data access from user space right
> >>> after resume is 100% expected, with or without this patch. The reason is that
> >>> waking up the drive (spinning it up) is done asynchronously from the PM resume
> >>> context, so when you get "PM suspend exit" message signaling that the system is
> >>> resumed, the drive may not yet be spinning. Any access will wait for that to
> >>> happen before proceeding. Depending on the drive that can take up to 10s or so.
> >>
> >> That does not match with what I am seeing: before this patch, there
> >> was no delay on first data access from user space, as the drive is fully
> >> spun up when system resume returns.
> >> With this patch, system resume returns earlier, and the drive is only
> >> spun up when user space starts accessing data.
> >>
> >> Note that I do not have any file system mounted, and use
> >> "hd /dev/sda | head -70" to access the disk.
> >>
> >>> I am not entirely sure where the net win you see come from. But the patch you
> >>> mention is in fact completely wrong and does not fix the underlying issues with
> >>> ata suspend/resume and potential deadlocks in PM due to ata ports relationship
> >>> with scsi devices. So I have been working on fixing this for the last 2 weeks,
> >>> after another user also reported issues with the patch you mention [1].
> >>>
> >>> Could you try libata for-next branch on your system ? There are 7 fix patches in
> >>> there that I plan to send out for 6.6-rc2 to fix the patch in question and other
> >>> issues potentially causing deadlocks on resume. The patches were posted also [2].
> >>>
> >>> https://lore.kernel.org/linux-ide/20230912005655.368075-1-dlemoal@kernel.org/T/#t
> >>
> >> Unfortunately that didn't work, as /dev/sda no longer exists.
> >> Will reply to the patch I bisected the issue to...
> >
> > With libata/for-next (fa2259a59966c005 ("ata: libata: Cleanup inline
> > DMA helper functions")) and commit 99626085d036ec32 ("ata: libata-scsi:
> > link ata port and scsi device") reverted, it behaves as before (disk
> > is spun up when system resume completes, no delay when accessing the
> > disk from userspace).
>
> I will check the ata platform driver for R-CAR. I may have overlooked something
> in that area. I tested with AHCI and libsas adapters only as I do not have ATA
> on the few ARM SBC boards I have. And I do not have an R-CAR board.
>
> What surprises me is that you need to revert ata: libata: Cleanup inline DMA
> helper functions". This patch has 0 functional changes and really is only a code
> cleanup... Nothing should change with it. Can you confirm that you really need
> to revert that patch to get things working again ?

My apologies, I should have written:

    With libata/for-next (which is currently at fa2259a59966c005 ("ata:
    libata: Cleanup inline DMA helper functions")), and commit
    99626085d036ec32 ("ata: libata-scsi: link ata port and scsi device")
    reverted, ...

So 99626085d036ec32 was the only commit I had to revert.
Sorry for the confusion.

Gr{oetje,eeting}s,

                        Geert

Phillip Susi Sept. 14, 2023, 3:29 p.m. UTC | #35

Damien Le Moal <dlemoal@kernel.org> writes:

> VERIFY command is used to spin up the disk). And furthermore, the specs also
> says that even a reset shall not change the device power state. So issuing a
> VERIFY command to spin up the drive is required per specs. Note that I do see
> many of my drives (I have hundreds in the lab) spinning up on reset, which is
> against the specs. But not all of them. So with the patch "do not issue START
> STOP UNIT on resume", one risks not seeing the drive resuming correctly (timeout
> errors on IDENTIFY command issued on resume will get the drive removed).

I tried to get some patches merged a few years back along these lines to
be able to keep some ATA disks in a mostly permanent state of being
powered down.  I don't recall the specs saying that a reset would not
change power states.  What I do recall is that as a general rule, ATA
disks automatically spin up after power on, unless the Power Up in
Standby feature is activated, which at least for Western Digital drives,
requires a jumper to short some pins on the drive.

IIRC, I had a patch to skip the START UNIT command on resume based on a
sysfs setting.  I think it was rejected because SCSI disks can not be
accessed without the START UNIT command, and even though the scsi eh
path will notice when a drive needs this and issue it, there was
something not good about the eh path.  I think it was something along
the lines of all IO to all disks on the whole scsi host are suspended
during eh.

I also tried to have the runtime pm notice whether the drive auto
powered up or not and set the runtime pm state to correctly indicate
whether the drive is running or not, as well as to use the deeper ATA
SLEEP state to save more power over STANDBY and allow the ATA link to be
runtime pm suspended.

ata,scsi: do not issue START STOP UNIT on resume

Commit Message

Comments

Patch