diff mbox series

[1/1] xfs_scrub_all: wait for services to start activating

Message ID 172983774826.3041899.15350842942789677656.stgit@frogsfrogsfrogs (mailing list archive)
State New
Headers show
Series [1/1] xfs_scrub_all: wait for services to start activating | expand

Commit Message

Darrick J. Wong Oct. 25, 2024, 6:38 a.m. UTC
From: Darrick J. Wong <djwong@kernel.org>

It seems that the function call to start a systemd unit completes
asynchronously from any change in that unit's active state.  On a
lightly loaded system, a Start() call followed by an ActiveState()
call actually sees the change in state from inactive to activating.

Unfortunately, on a heavily loaded system, the state change may take a
few seconds.  If this is the case, the wait() call can see that the unit
state is "inactive", decide that the service already finished, and exit
early, when in reality it hasn't even gotten to 'activating'.

Fix this by adding a second method that watches either for the inactive
-> activating state transition or for the last exit from inactivation
timestamp to change before waiting for the unit to reach inactive state.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Cc: <linux-xfs@vger.kernel.org> # v6.10.0
Fixes: 6d831e770359ff ("xfs_scrub_all: convert systemctl calls to dbus")
---
 scrub/xfs_scrub_all.in |   52 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 52 insertions(+)

Comments

Christoph Hellwig Oct. 28, 2024, 8:44 a.m. UTC | #1
On Thu, Oct 24, 2024 at 11:38:27PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> It seems that the function call to start a systemd unit completes
> asynchronously from any change in that unit's active state.  On a
> lightly loaded system, a Start() call followed by an ActiveState()
> call actually sees the change in state from inactive to activating.
> 
> Unfortunately, on a heavily loaded system, the state change may take a
> few seconds.  If this is the case, the wait() call can see that the unit
> state is "inactive", decide that the service already finished, and exit
> early, when in reality it hasn't even gotten to 'activating'.
> 
> Fix this by adding a second method that watches either for the inactive
> -> activating state transition or for the last exit from inactivation
> timestamp to change before waiting for the unit to reach inactive state.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> Cc: <linux-xfs@vger.kernel.org> # v6.10.0

What is this supposed to mean? 

The patch itself looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong Oct. 28, 2024, 4:45 p.m. UTC | #2
On Mon, Oct 28, 2024 at 01:44:23AM -0700, Christoph Hellwig wrote:
> On Thu, Oct 24, 2024 at 11:38:27PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > It seems that the function call to start a systemd unit completes
> > asynchronously from any change in that unit's active state.  On a
> > lightly loaded system, a Start() call followed by an ActiveState()
> > call actually sees the change in state from inactive to activating.
> > 
> > Unfortunately, on a heavily loaded system, the state change may take a
> > few seconds.  If this is the case, the wait() call can see that the unit
> > state is "inactive", decide that the service already finished, and exit
> > early, when in reality it hasn't even gotten to 'activating'.
> > 
> > Fix this by adding a second method that watches either for the inactive
> > -> activating state transition or for the last exit from inactivation
> > timestamp to change before waiting for the unit to reach inactive state.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > Cc: <linux-xfs@vger.kernel.org> # v6.10.0
> 
> What is this supposed to mean? 

It means that if anyone is supporting xfsprogs 6.10, this patch applies
to it.

> The patch itself looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Thanks!

--D
diff mbox series

Patch

diff --git a/scrub/xfs_scrub_all.in b/scrub/xfs_scrub_all.in
index 5e2e0446a99f89..fe4bca4b2edb11 100644
--- a/scrub/xfs_scrub_all.in
+++ b/scrub/xfs_scrub_all.in
@@ -249,6 +249,54 @@  class scrub_service(scrub_control):
 				print(e, file = sys.stderr)
 			return 'failed'
 
+	def last_activation(self):
+		'''Retrieve the last activation time, in microseconds since
+		boot.'''
+		global debug
+
+		l = lambda: self.prop.Get('org.freedesktop.systemd1.Unit',
+				'InactiveExitTimestampMonotonic')
+		try:
+			return self.__dbusrun(l)
+		except Exception as e:
+			if debug:
+				print(e, file = sys.stderr)
+			return 0
+
+	def wait_for_startup(self, last_active, wait_for = 30, interval = 0.5):
+		'''Wait for the service to start up.  This is defined as
+		exiting the inactive state.'''
+
+		for i in range(0, int(wait_for / interval)):
+			s = self.state()
+			if debug:
+				print('waiting for activation %s %s' % (self.unitname, s))
+			if s == 'failed':
+				return 1
+			if s != 'inactive':
+				return 0
+			# If the unit is inactive but the last activation time
+			# doesn't match, then the service ran so quickly that
+			# it's already gone.
+			if last_active != self.last_activation():
+				return 0
+			time.sleep(interval)
+
+		s = self.state()
+		if debug:
+			print('waited for startup %s %s' % (self.unitname, s))
+		if s == 'failed':
+			return 1
+		if s != 'inactive':
+			return 0
+
+		# If the unit is inactive but the last activation time doesn't
+		# match, then the service ran so quickly that it's already
+		# gone.
+		if last_active != self.last_activation():
+			return 0
+		return 2
+
 	def wait(self, interval = 1):
 		'''Wait until the service finishes.'''
 		global debug
@@ -278,7 +326,11 @@  class scrub_service(scrub_control):
 			print('starting %s' % self.unitname)
 
 		try:
+			last_active = self.last_activation()
 			self.__dbusrun(lambda: self.unit.Start('replace'))
+			ret = self.wait_for_startup(last_active)
+			if ret > 0:
+				return ret
 			return self.wait()
 		except dbus.exceptions.DBusException as e:
 			# If the unit was masked, the sysadmin doesn't want us