Message ID | 20230927141828.90288-1-dlemoal@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | Fix libata suspend/resume handling and code cleanup | expand |
I went to test these patches tonight and it looks like Linus already merged them ( or mostly? ). I enabled runtime pm on the scsi target and the ata port, and the disk spins down and the port does too. I noticed though, that when entering system suspend, a disk that has already been runtime suspended is resumed only to immediately be suspended again before the system suspend. That shouldn't happen should it?
Phillip Susi <phill@thesusis.net> writes: > I noticed though, that when entering system suspend, a disk that has > already been runtime suspended is resumed only to immediately be > suspended again before the system suspend. That shouldn't happen should > it? It seems that it is /sys/power/sync_on_suspend. The problem went away when I unmounted the disk, or I could make the disk wake up by running sync. I thought that it used to be that as long as you mounted an ext4 filesystem with -o relatime, it wouldn't keep dirtying the cache as long as you weren't actually writing to the filesystem. Either I'm misremembering something, or this is no longer the case. Also if there are dirty pages in the cache, I thought the default was for them to be written out after 5 seconds, which would keep the disk awake, rather than waiting until system suspend to sync. I guess I could mount the filesystem readonly. It's probably not a good idea to disable sync_on_suspend for the whole system.
On 10/3/23 08:39, Phillip Susi wrote: > > I went to test these patches tonight and it looks like Linus already > merged them ( or mostly? ). I enabled runtime pm on the scsi target and > the ata port, and the disk spins down and the port does too. > > I noticed though, that when entering system suspend, a disk that has > already been runtime suspended is resumed only to immediately be > suspended again before the system suspend. That shouldn't happen should > it? Indeed. Will look at this. Geert also reported seeing an issue with resume (one time only, so I suspect there is still a race with libata-eh). So looks like something is still missing.
On 10/3/23 09:27, Phillip Susi wrote: > > Phillip Susi <phill@thesusis.net> writes: > >> I noticed though, that when entering system suspend, a disk that has >> already been runtime suspended is resumed only to immediately be >> suspended again before the system suspend. That shouldn't happen should >> it? > > It seems that it is /sys/power/sync_on_suspend. The problem went away > when I unmounted the disk, or I could make the disk wake up by running > sync. I thought that it used to be that as long as you mounted an ext4 > filesystem with -o relatime, it wouldn't keep dirtying the cache as long > as you weren't actually writing to the filesystem. Either I'm > misremembering something, or this is no longer the case. Also if there > are dirty pages in the cache, I thought the default was for them to be > written out after 5 seconds, which would keep the disk awake, rather > than waiting until system suspend to sync. > > I guess I could mount the filesystem readonly. It's probably not a good > idea to disable sync_on_suspend for the whole system. Hmmm... So this could be the fs suspend then, which issues a sync but the device is already suspended and was synced already. In that case, we should turn that sync into a nop to not wakeup the drive unnecessarily. The fix may be needed on scsi sd side rather than libata. Let me check.
On Tue, Oct 03, 2023 at 09:44:50AM +0900, Damien Le Moal wrote: > Hmmm... So this could be the fs suspend then, which issues a sync but the device > is already suspended and was synced already. In that case, we should turn that > sync into a nop to not wakeup the drive unnecessarily. The fix may be needed on > scsi sd side rather than libata. I did some tracing today on a test ext4 fs I created on a loopback device, and it seems that the superblocks are written every time you sync, even if no files on the filesystem have even been opened for read access.
On 10/4/23 06:22, Phillip Susi wrote: > On Tue, Oct 03, 2023 at 09:44:50AM +0900, Damien Le Moal wrote: >> Hmmm... So this could be the fs suspend then, which issues a sync but the device >> is already suspended and was synced already. In that case, we should turn that >> sync into a nop to not wakeup the drive unnecessarily. The fix may be needed on >> scsi sd side rather than libata. > > I did some tracing today on a test ext4 fs I created on a loopback device, and it > seems that the superblocks are written every time you sync, even if no files on the > filesystem have even been opened for read access. OK. So a fix would need to be on the FS side then if one wants to avoid that useless resume. However, this may clash with the FS need to record stuff in its sb and so we may not be able to avoid that.
Damien Le Moal <dlemoal@kernel.org> writes: >> I did some tracing today on a test ext4 fs I created on a loopback device, and it >> seems that the superblocks are written every time you sync, even if no files on the >> filesystem have even been opened for read access. > > OK. So a fix would need to be on the FS side then if one wants to avoid that > useless resume. However, this may clash with the FS need to record stuff in its > sb and so we may not be able to avoid that. Ok, this is very strange. I went back to my distro kernel, without runtime pm, mounted the filesystems rw again, used hdparm -y to suspend the disk, verified with hdparm -C that they were in suspend, and and suspended the system. In dmesg I see: Filesystems sync: 0.007 seconds Now, if it were writing the superblocks to the disk there, I would expect that to take more like 3 seconds while it woke the disks back up, like it did when I was testing the latest kernel with runtime pm. Another odd thing I noticed with the runtime pm was that sometimes the drives would randomly start up even though I was not accessing them. This never happens when I am normally using the debian kernel with no runtime pm and just running hdparm -y to put the drives to sleep. I can check them hours later and they are still in standby. I just tried running sync and blktrace and it looks like it is writing the superblock to the drive, and yet, hdparm -C still says it is in standby. This makes no sense. Here is what blktrace said when I ran sync: 8,0 0 1 0.000000000 34004 Q FWS [sync] 8,0 0 2 0.000001335 34004 G FWS [sync] 8,0 0 3 0.000004327 31088 D FN [kworker/0:2H] 8,0 0 4 0.000068945 0 C FN 0 [0] 8,0 0 5 0.000069466 0 C WS 0 [0] I just noticed that this trace doesn't show the 0+8 that I saw when I was testing running sync with a fresh, empty ext4 filesystem on a loop device. That showed 0+8 indicating the first 4k block of the disk, as well as 1023+8, and one or two more offsets that I thought were the backup superblocks. What the heck is this sync actually writing, and why does it not cause the disk to take itself out of standby, but with runtime pm, it does? Could this just be a FLUSH of some sort, which when the disk is in standby, it ignores, but the kernel runtime pm decides it must wake the disk up before dispatching the command, even though it is useless?
On 10/5/23 06:01, Phillip Susi wrote: > Damien Le Moal <dlemoal@kernel.org> writes: > >>> I did some tracing today on a test ext4 fs I created on a loopback device, and it >>> seems that the superblocks are written every time you sync, even if no files on the >>> filesystem have even been opened for read access. >> >> OK. So a fix would need to be on the FS side then if one wants to avoid that >> useless resume. However, this may clash with the FS need to record stuff in its >> sb and so we may not be able to avoid that. > > Ok, this is very strange. I went back to my distro kernel, without > runtime pm, mounted the filesystems rw again, used hdparm -y to suspend > the disk, verified with hdparm -C that they were in suspend, and and > suspended the system. In dmesg I see: > > Filesystems sync: 0.007 seconds > > Now, if it were writing the superblocks to the disk there, I would > expect that to take more like 3 seconds while it woke the disks back up, > like it did when I was testing the latest kernel with runtime pm. Hmm... May be there was nothing to sync: hdparm -y putting the drive in standby mode should have synced the write cache already and the FS issued sync may have ended up not causing any media write, as long as the FS did not issue any new write (which would have spun up the drive). The specs are clear about this: In STANDBY IMMEDIATE description: Processing a STANDBY IMMEDIATE command shall cause the device to prepare for a power cycle (e.g., flush volatile write cache) prior to returning command completion. So this is all does not seem that strange to me. > Another odd thing I noticed with the runtime pm was that sometimes the > drives would randomly start up even though I was not accessing them. Some random access from user space, e.g. systemd doing its perodic "something" with passthrough commands ? > This never happens when I am normally using the debian kernel with no > runtime pm and just running hdparm -y to put the drives to sleep. I can > check them hours later and they are still in standby. Same user space in that case ? > > I just tried running sync and blktrace and it looks like it is writing > the superblock to the drive, and yet, hdparm -C still says it is in > standby. This makes no sense. Here is what blktrace said when I ran > sync: > > 8,0 0 1 0.000000000 34004 Q FWS [sync] > 8,0 0 2 0.000001335 34004 G FWS [sync] > 8,0 0 3 0.000004327 31088 D FN [kworker/0:2H] > 8,0 0 4 0.000068945 0 C FN 0 [0] > 8,0 0 5 0.000069466 0 C WS 0 [0] > > I just noticed that this trace doesn't show the 0+8 that I saw when I > was testing running sync with a fresh, empty ext4 filesystem on a loop > device. That showed 0+8 indicating the first 4k block of the disk, as > well as 1023+8, and one or two more offsets that I thought were the > backup superblocks. Then as mentioned above, nothing may be written, which results in the drive not waking up since the write cache is clean already (synced already before spin down). > What the heck is this sync actually writing, and why does it not cause > the disk to take itself out of standby, but with runtime pm, it does? Not sure. But it may be a write FUA vs actual sync. With runtime pm, any command issued to the disk while it is suspended will cause a call to pm runtime resume which issues a verify command to spinup the drive, regardless if the command issued by the user needs the drive to spin up. So that is normal. With hdparm -y, the driver thinks the drive is running and so does not issue that verify command to the drive. The drive spinning up or not then depends on the command being issued and the drive state (and also likely the drive model and FW implementation... Some may be more intelligent than others in this area). > Could this just be a FLUSH of some sort, which when the disk is in > standby, it ignores, but the kernel runtime pm decides it must wake the > disk up before dispatching the command, even though it is useless? Given your description, that is my thinking exactly. The problem here for the second part (spinning up the disk for "useless" commands) is that determining if a command needs the drive to spinup or not is not an easy thing to do, and potentially dangerous if mishandled. One possible micro optimization would be to ignore flush commands to suspended disks. But not sure that is a high win change beside *may be* avoiding a spinup on system suspend witha drive already runtime suspended.
Damien Le Moal <dlemoal@kernel.org> writes: >> This never happens when I am normally using the debian kernel with no >> runtime pm and just running hdparm -y to put the drives to sleep. I can >> check them hours later and they are still in standby. > > Same user space in that case ? Yes. I'll try to leave a blktrace running to see what causes the spinup. I suppose it could be another flush or other command that doesn't require media access, but triggers runtime pm to spin up the disk. > Given your description, that is my thinking exactly. The problem here for the > second part (spinning up the disk for "useless" commands) is that determining if > a command needs the drive to spinup or not is not an easy thing to do, and > potentially dangerous if mishandled. One possible micro optimization would be to > ignore flush commands to suspended disks. But not sure that is a high win change > beside *may be* avoiding a spinup on system suspend witha drive already runtime > suspended. One of the things my patch series from a decade ago did was to use the SLEEP flag in libata to decide to complete certain commands without sending them to the drive so it could remain asleep. I'm not sure if it's even possible for the driver to evaluate the command before the pm core orders a resume though. I wonder if libata could leave the EH pending and return success from the runtime resume, and then actually run the EH and wake up the drive later, when actual IO is done. On another note, I've been looking over your patches, and I still do not understand why you added the VERIFY command. The only effect it seems to have is moving the delay while the drive spins up from the first real IO to the resume path. Why does that matter?