Message ID | 172983774826.3041899.15350842942789677656.stgit@frogsfrogsfrogs (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [1/1] xfs_scrub_all: wait for services to start activating | expand |
On Thu, Oct 24, 2024 at 11:38:27PM -0700, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@kernel.org> > > It seems that the function call to start a systemd unit completes > asynchronously from any change in that unit's active state. On a > lightly loaded system, a Start() call followed by an ActiveState() > call actually sees the change in state from inactive to activating. > > Unfortunately, on a heavily loaded system, the state change may take a > few seconds. If this is the case, the wait() call can see that the unit > state is "inactive", decide that the service already finished, and exit > early, when in reality it hasn't even gotten to 'activating'. > > Fix this by adding a second method that watches either for the inactive > -> activating state transition or for the last exit from inactivation > timestamp to change before waiting for the unit to reach inactive state. > > Signed-off-by: Darrick J. Wong <djwong@kernel.org> > Cc: <linux-xfs@vger.kernel.org> # v6.10.0 What is this supposed to mean? The patch itself looks good: Reviewed-by: Christoph Hellwig <hch@lst.de>
On Mon, Oct 28, 2024 at 01:44:23AM -0700, Christoph Hellwig wrote: > On Thu, Oct 24, 2024 at 11:38:27PM -0700, Darrick J. Wong wrote: > > From: Darrick J. Wong <djwong@kernel.org> > > > > It seems that the function call to start a systemd unit completes > > asynchronously from any change in that unit's active state. On a > > lightly loaded system, a Start() call followed by an ActiveState() > > call actually sees the change in state from inactive to activating. > > > > Unfortunately, on a heavily loaded system, the state change may take a > > few seconds. If this is the case, the wait() call can see that the unit > > state is "inactive", decide that the service already finished, and exit > > early, when in reality it hasn't even gotten to 'activating'. > > > > Fix this by adding a second method that watches either for the inactive > > -> activating state transition or for the last exit from inactivation > > timestamp to change before waiting for the unit to reach inactive state. > > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org> > > Cc: <linux-xfs@vger.kernel.org> # v6.10.0 > > What is this supposed to mean? It means that if anyone is supporting xfsprogs 6.10, this patch applies to it. > The patch itself looks good: > > Reviewed-by: Christoph Hellwig <hch@lst.de> Thanks! --D
diff --git a/scrub/xfs_scrub_all.in b/scrub/xfs_scrub_all.in index 5e2e0446a99f89..fe4bca4b2edb11 100644 --- a/scrub/xfs_scrub_all.in +++ b/scrub/xfs_scrub_all.in @@ -249,6 +249,54 @@ class scrub_service(scrub_control): print(e, file = sys.stderr) return 'failed' + def last_activation(self): + '''Retrieve the last activation time, in microseconds since + boot.''' + global debug + + l = lambda: self.prop.Get('org.freedesktop.systemd1.Unit', + 'InactiveExitTimestampMonotonic') + try: + return self.__dbusrun(l) + except Exception as e: + if debug: + print(e, file = sys.stderr) + return 0 + + def wait_for_startup(self, last_active, wait_for = 30, interval = 0.5): + '''Wait for the service to start up. This is defined as + exiting the inactive state.''' + + for i in range(0, int(wait_for / interval)): + s = self.state() + if debug: + print('waiting for activation %s %s' % (self.unitname, s)) + if s == 'failed': + return 1 + if s != 'inactive': + return 0 + # If the unit is inactive but the last activation time + # doesn't match, then the service ran so quickly that + # it's already gone. + if last_active != self.last_activation(): + return 0 + time.sleep(interval) + + s = self.state() + if debug: + print('waited for startup %s %s' % (self.unitname, s)) + if s == 'failed': + return 1 + if s != 'inactive': + return 0 + + # If the unit is inactive but the last activation time doesn't + # match, then the service ran so quickly that it's already + # gone. + if last_active != self.last_activation(): + return 0 + return 2 + def wait(self, interval = 1): '''Wait until the service finishes.''' global debug @@ -278,7 +326,11 @@ class scrub_service(scrub_control): print('starting %s' % self.unitname) try: + last_active = self.last_activation() self.__dbusrun(lambda: self.unit.Start('replace')) + ret = self.wait_for_startup(last_active) + if ret > 0: + return ret return self.wait() except dbus.exceptions.DBusException as e: # If the unit was masked, the sysadmin doesn't want us