diff mbox series

[RFC] ASoC: core: Optimise suspend/resume of DAPM widgets

Message ID 1533301025-19980-1-git-send-email-jonathanh@nvidia.com (mailing list archive)
State New, archived
Headers show
Series [RFC] ASoC: core: Optimise suspend/resume of DAPM widgets | expand

Commit Message

Jon Hunter Aug. 3, 2018, 12:57 p.m. UTC
For soundcards that have several DAI links and many DAPM widgets the
time taken for snd_soc_suspend to execute has been observed to be
several milliseconds. The time is largely spent executing
dapm_power_widgets() for each for the DAI links that need to be
suspended. Given that dapm_power_widgets() is called again after
suspending/resuming the DAI links, one way to optimise the
suspend/resume time is to avoid calling dapm_power_widgets() for
each DAI link and reduces the suspend time significantly.

Please note that this has been observed on the Tegra210 Jetson TX1
platform which is not currently supported in the mainline for audio
but has been tested with out-of-tree patches to enable I2S audio.
For Tegra210 the audio drivers (for I2S, etc) are implemented as
component drivers to offer greater flexibility with dynamic audio
routing within the Tegra Audio Processing Engine (APE) from userspace.
The downside of this flexibility is that now there are a lof more DAI
links and so the time to suspend the soundcard can take several
milliseconds.

In the resume path, it is not clear if there could be any issues from
not sync'ing the DAPM power state until after unmuting and resuming
the CPU DAI drivers, because this will happen later with this change.

Signed-off-by: Jon Hunter <jonathanh@nvidia.com>
---

It is not 100% clear to me if there are any downsides to this approach
but I am looking to find away to avoid calling dapm_power_widgets() so
many times in the suspend path when there are many DAI links.
Unfortunately, I don't think that I can simply mark some of the DAI
links with 'ignore_suspend' because it could be possible that they are
active.

 include/sound/soc-dapm.h |  2 ++
 sound/soc/soc-core.c     | 18 ++++--------------
 sound/soc/soc-dapm.c     | 43 ++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 46 insertions(+), 17 deletions(-)

Comments

Mark Brown Aug. 3, 2018, 4:36 p.m. UTC | #1
On Fri, Aug 03, 2018 at 01:57:05PM +0100, Jon Hunter wrote:

> For soundcards that have several DAI links and many DAPM widgets the
> time taken for snd_soc_suspend to execute has been observed to be
> several milliseconds. The time is largely spent executing
> dapm_power_widgets() for each for the DAI links that need to be
> suspended. Given that dapm_power_widgets() is called again after
> suspending/resuming the DAI links, one way to optimise the
> suspend/resume time is to avoid calling dapm_power_widgets() for
> each DAI link and reduces the suspend time significantly.

It's a bit alarming that dapm_power_widgets() is taking substantial
enough time to be worth worring about - it's *supposed* to be relatively
well optimized so it's effectively free.  It'll never be quite free, but
close enough.  The goal is that if nothing changes we end up testing a
few nodes at most before we figure out that nothing changed state and
stop.  Do you have any idea what it's spending its time on, we do call
it a lot so if there's any optimization opportunties there we can
probably get a lot of benefit out of them.

One thing that it occurs to me might help is to start the suspend
process by powering down all the input and output widgets that aren't
ignoring suspend in one operation, that should hopefully have the effect
of ensuring that most of the system that's going to get powered down
does so on the first pass through.

> Please note that this has been observed on the Tegra210 Jetson TX1
> platform which is not currently supported in the mainline for audio
> but has been tested with out-of-tree patches to enable I2S audio.

If someone could get the platform booting reliably in -next that'd be
good but that's a separate issue...

> In the resume path, it is not clear if there could be any issues from
> not sync'ing the DAPM power state until after unmuting and resuming
> the CPU DAI drivers, because this will happen later with this change.

This is a definite problem, we want to have the audio path powered up
before we start playing audio otherwise we loose the start of the audio
which may be important.
Jon Hunter Aug. 13, 2018, 6:19 p.m. UTC | #2
On 03/08/18 17:36, Mark Brown wrote:
> On Fri, Aug 03, 2018 at 01:57:05PM +0100, Jon Hunter wrote:
> 
>> For soundcards that have several DAI links and many DAPM widgets the
>> time taken for snd_soc_suspend to execute has been observed to be
>> several milliseconds. The time is largely spent executing
>> dapm_power_widgets() for each for the DAI links that need to be
>> suspended. Given that dapm_power_widgets() is called again after
>> suspending/resuming the DAI links, one way to optimise the
>> suspend/resume time is to avoid calling dapm_power_widgets() for
>> each DAI link and reduces the suspend time significantly.
> 
> It's a bit alarming that dapm_power_widgets() is taking substantial
> enough time to be worth worring about - it's *supposed* to be relatively
> well optimized so it's effectively free.  It'll never be quite free, but
> close enough.  The goal is that if nothing changes we end up testing a
> few nodes at most before we figure out that nothing changed state and
> stop.  Do you have any idea what it's spending its time on, we do call
> it a lot so if there's any optimization opportunties there we can
> probably get a lot of benefit out of them.

Sorry for the delay, I was out last week.

I had taken some ftrace graphs but there was not one thing that really
stood out. Looking again it seems that each call to
async_schedule_domain() can take tens of uS and so it there are a lot of
DAPM widgets (100+) this can take some time. I will dig into it a bit
further.

> One thing that it occurs to me might help is to start the suspend
> process by powering down all the input and output widgets that aren't
> ignoring suspend in one operation, that should hopefully have the effect
> of ensuring that most of the system that's going to get powered down
> does so on the first pass through.

If the async_schedule_domain() calls are the problem, then it may not
help as we call that for all dapms apart from the card->dapm.

>> Please note that this has been observed on the Tegra210 Jetson TX1
>> platform which is not currently supported in the mainline for audio
>> but has been tested with out-of-tree patches to enable I2S audio.
> 
> If someone could get the platform booting reliably in -next that'd be
> good but that's a separate issue...

Yes I plan to work on that in the next few months.

>> In the resume path, it is not clear if there could be any issues from
>> not sync'ing the DAPM power state until after unmuting and resuming
>> the CPU DAI drivers, because this will happen later with this change.
> 
> This is a definite problem, we want to have the audio path powered up
> before we start playing audio otherwise we loose the start of the audio
> which may be important.

I was thinking I could add another call to snd_soc_dapm_sync() after
resuming the streams to address that. However, maybe I need to dig into
this a bit more to understand exactly why dapm_power_widgets() takes so
long.

Cheers
Jon
Mark Brown Aug. 14, 2018, 2:40 p.m. UTC | #3
On Mon, Aug 13, 2018 at 07:19:16PM +0100, Jon Hunter wrote:
> 
> I had taken some ftrace graphs but there was not one thing that really
> stood out. Looking again it seems that each call to
> async_schedule_domain() can take tens of uS and so it there are a lot of
> DAPM widgets (100+) this can take some time. I will dig into it a bit
> further.

We don't call that per widget, we call that per context.  We might have
a lot of contexts but it's definitely not number of widgets level.
Perhaps what we need there is something which remembers if we actually
changed anything in each context and/or there's relevant operations and
then doesn't do this if not, we're working on the basis that these
operations are fairly cheap.

> > One thing that it occurs to me might help is to start the suspend
> > process by powering down all the input and output widgets that aren't
> > ignoring suspend in one operation, that should hopefully have the effect
> > of ensuring that most of the system that's going to get powered down
> > does so on the first pass through.

> If the async_schedule_domain() calls are the problem, then it may not
> help as we call that for all dapms apart from the card->dapm.

No, we actually need to run the operations on all contexts - they can
all do things if they want to.  We just want them all to run in parallel
so if we are doing things like waiting for capacitors to charge we don't
do it in series.
Jon Hunter Aug. 15, 2018, 6:48 p.m. UTC | #4
On 14/08/18 15:40, Mark Brown wrote:
> On Mon, Aug 13, 2018 at 07:19:16PM +0100, Jon Hunter wrote:
>>
>> I had taken some ftrace graphs but there was not one thing that really
>> stood out. Looking again it seems that each call to
>> async_schedule_domain() can take tens of uS and so it there are a lot of
>> DAPM widgets (100+) this can take some time. I will dig into it a bit
>> further.
> 
> We don't call that per widget, we call that per context.  We might have
> a lot of contexts but it's definitely not number of widgets level.
> Perhaps what we need there is something which remembers if we actually
> changed anything in each context and/or there's relevant operations and
> then doesn't do this if not, we're working on the basis that these
> operations are fairly cheap.

Yes I see that now. 
 
>>> One thing that it occurs to me might help is to start the suspend
>>> process by powering down all the input and output widgets that aren't
>>> ignoring suspend in one operation, that should hopefully have the effect
>>> of ensuring that most of the system that's going to get powered down
>>> does so on the first pass through.
> 
>> If the async_schedule_domain() calls are the problem, then it may not
>> help as we call that for all dapms apart from the card->dapm.
> 
> No, we actually need to run the operations on all contexts - they can
> all do things if they want to.  We just want them all to run in parallel
> so if we are doing things like waiting for capacitors to charge we don't
> do it in series.

So I did a bit more digging and this time on Tegra124 Jetson-TK1 (which
is supported in the mainline). On Jetson-TK1 I see that
dapm_power_widgets is called 4 times on entering suspend and each call
then calls the dapm_pre/post_sequence_async() functions 3 times.
Interestingly, at least for Jetson-TK1, it appears that the calls to
dapm_pre/post_sequence_async() do not actually change the bias state
because the dapm contexts are already in the desired state. Please
note that at this time audio is not active and so that is probably why.

Anyway, I was interested in the overhead added of calling
async_schedule_domain() to schedule the dapm_pre/post_sequence_async()
function calls if there is nothing to be done. This is what I see ...

Times for function dapm_power_widgets() are (us):
	Min 23, Ave 190, Max 434, Count 39

Where 'Count' is the number of times that dapm_power_widgets() has
been called (both during boot and through a few suspend iterations).

So then I made the following change to avoid scheduling the function
calls unnecessarily (which I think should be fine) ...

diff --git a/sound/soc/soc-dapm.c b/sound/soc/soc-dapm.c
index 5ed90ea69a24..ff66a6834531 100644
--- a/sound/soc/soc-dapm.c
+++ b/sound/soc/soc-dapm.c
@@ -1955,7 +1955,7 @@ static int dapm_power_widgets(struct snd_soc_card *card, int event)
        dapm_pre_sequence_async(&card->dapm, 0);
        /* Run other bias changes in parallel */
        list_for_each_entry(d, &card->dapm_list, list) {
-               if (d != &card->dapm)
+               if (d != &card->dapm && d->bias_level != d->target_bias_level)
                        async_schedule_domain(dapm_pre_sequence_async, d,
                                                &async_domain);
        }
@@ -1979,7 +1979,7 @@ static int dapm_power_widgets(struct snd_soc_card *card, int event)
 
        /* Run all the bias changes in parallel */
        list_for_each_entry(d, &card->dapm_list, list) {
-               if (d != &card->dapm)
+               if (d != &card->dapm && d->bias_level != d->target_bias_level)
                        async_schedule_domain(dapm_post_sequence_async, d,
                                                &async_domain);
        }

And with the above change I now see ...

Times for function dapm_power_widgets() are (us):
	Min 4, Ave 16, Max 82, Count 39

Please note that I am just using ktime_get() to log the time on entry
and exit from dapm_power_widgets() and so yes the time may not be
purely the time take to execute this function if we are preempted,
but I am measuring in the same way in both cases and so it does seem
to show some improvement.

Let me know your thoughts.

Cheers
Jon
Mark Brown Aug. 16, 2018, 10:51 a.m. UTC | #5
On Wed, Aug 15, 2018 at 07:48:47PM +0100, Jon Hunter wrote:

> So then I made the following change to avoid scheduling the function
> calls unnecessarily (which I think should be fine) ...

> Please note that I am just using ktime_get() to log the time on entry
> and exit from dapm_power_widgets() and so yes the time may not be
> purely the time take to execute this function if we are preempted,
> but I am measuring in the same way in both cases and so it does seem
> to show some improvement.

That seems like a simple and useful optimization, will you send a patch?
Jon Hunter Aug. 16, 2018, 3:25 p.m. UTC | #6
On 16/08/18 11:51, Mark Brown wrote:
> On Wed, Aug 15, 2018 at 07:48:47PM +0100, Jon Hunter wrote:
> 
>> So then I made the following change to avoid scheduling the function
>> calls unnecessarily (which I think should be fine) ...
> 
>> Please note that I am just using ktime_get() to log the time on entry
>> and exit from dapm_power_widgets() and so yes the time may not be
>> purely the time take to execute this function if we are preempted,
>> but I am measuring in the same way in both cases and so it does seem
>> to show some improvement.
> 
> That seems like a simple and useful optimization, will you send a patch?

Yes I will send a patch for this.

Cheers
Jon
diff mbox series

Patch

diff --git a/include/sound/soc-dapm.h b/include/sound/soc-dapm.h
index af9ef16cc34d..5dbd601154c4 100644
--- a/include/sound/soc-dapm.h
+++ b/include/sound/soc-dapm.h
@@ -427,6 +427,8 @@  void snd_soc_dapm_reset_cache(struct snd_soc_dapm_context *dapm);
 /* dapm events */
 void snd_soc_dapm_stream_event(struct snd_soc_pcm_runtime *rtd, int stream,
 	int event);
+void snd_soc_dapm_stream_suspend(struct snd_soc_pcm_runtime *rtd, int stream);
+void snd_soc_dapm_stream_resume(struct snd_soc_pcm_runtime *rtd, int stream);
 void snd_soc_dapm_shutdown(struct snd_soc_card *card);
 
 /* external DAPM widget events */
diff --git a/sound/soc/soc-core.c b/sound/soc/soc-core.c
index 81b27923303d..03c5fa95afb8 100644
--- a/sound/soc/soc-core.c
+++ b/sound/soc/soc-core.c
@@ -519,13 +519,8 @@  int snd_soc_suspend(struct device *dev)
 		if (rtd->dai_link->ignore_suspend)
 			continue;
 
-		snd_soc_dapm_stream_event(rtd,
-					  SNDRV_PCM_STREAM_PLAYBACK,
-					  SND_SOC_DAPM_STREAM_SUSPEND);
-
-		snd_soc_dapm_stream_event(rtd,
-					  SNDRV_PCM_STREAM_CAPTURE,
-					  SND_SOC_DAPM_STREAM_SUSPEND);
+		snd_soc_dapm_stream_suspend(rtd, SNDRV_PCM_STREAM_PLAYBACK);
+		snd_soc_dapm_stream_suspend(rtd, SNDRV_PCM_STREAM_CAPTURE);
 	}
 
 	/* Recheck all endpoints too, their state is affected by suspend */
@@ -637,13 +632,8 @@  static void soc_resume_deferred(struct work_struct *work)
 		if (rtd->dai_link->ignore_suspend)
 			continue;
 
-		snd_soc_dapm_stream_event(rtd,
-					  SNDRV_PCM_STREAM_PLAYBACK,
-					  SND_SOC_DAPM_STREAM_RESUME);
-
-		snd_soc_dapm_stream_event(rtd,
-					  SNDRV_PCM_STREAM_CAPTURE,
-					  SND_SOC_DAPM_STREAM_RESUME);
+		snd_soc_dapm_stream_resume(rtd, SNDRV_PCM_STREAM_PLAYBACK);
+		snd_soc_dapm_stream_resume(rtd, SNDRV_PCM_STREAM_CAPTURE);
 	}
 
 	/* unmute any active DACs */
diff --git a/sound/soc/soc-dapm.c b/sound/soc/soc-dapm.c
index 7e96793050c9..7d4f75d29d55 100644
--- a/sound/soc/soc-dapm.c
+++ b/sound/soc/soc-dapm.c
@@ -4201,7 +4201,7 @@  void snd_soc_dapm_connect_dai_link_widgets(struct snd_soc_card *card)
 }
 
 static void soc_dapm_stream_event(struct snd_soc_pcm_runtime *rtd, int stream,
-	int event)
+	int event, bool power_widgets)
 {
 	int i;
 
@@ -4209,7 +4209,8 @@  static void soc_dapm_stream_event(struct snd_soc_pcm_runtime *rtd, int stream,
 	for (i = 0; i < rtd->num_codecs; i++)
 		soc_dapm_dai_stream_event(rtd->codec_dais[i], stream, event);
 
-	dapm_power_widgets(rtd->card, event);
+	if (power_widgets)
+		dapm_power_widgets(rtd->card, event);
 }
 
 /**
@@ -4229,7 +4230,43 @@  void snd_soc_dapm_stream_event(struct snd_soc_pcm_runtime *rtd, int stream,
 	struct snd_soc_card *card = rtd->card;
 
 	mutex_lock_nested(&card->dapm_mutex, SND_SOC_DAPM_CLASS_RUNTIME);
-	soc_dapm_stream_event(rtd, stream, event);
+	soc_dapm_stream_event(rtd, stream, event, true);
+	mutex_unlock(&card->dapm_mutex);
+}
+
+/**
+ * snd_soc_dapm_stream_suspend - send a suspend stream event to the dapm core
+ * @rtd: PCM runtime data
+ * @stream: stream name
+ *
+ * Sends a suspend stream event to the dapm core.
+ *
+ * Returns 0 for success else error.
+ */
+void snd_soc_dapm_stream_suspend(struct snd_soc_pcm_runtime *rtd, int stream)
+{
+	struct snd_soc_card *card = rtd->card;
+
+	mutex_lock_nested(&card->dapm_mutex, SND_SOC_DAPM_CLASS_RUNTIME);
+	soc_dapm_stream_event(rtd, stream, SND_SOC_DAPM_STREAM_SUSPEND, false);
+	mutex_unlock(&card->dapm_mutex);
+}
+
+/**
+ * snd_soc_dapm_stream_resume - send a resume stream event to the dapm core
+ * @rtd: PCM runtime data
+ * @stream: stream name
+ *
+ * Sends a resume stream event to the dapm core.
+ *
+ * Returns 0 for success else error.
+ */
+void snd_soc_dapm_stream_resume(struct snd_soc_pcm_runtime *rtd, int stream)
+{
+	struct snd_soc_card *card = rtd->card;
+
+	mutex_lock_nested(&card->dapm_mutex, SND_SOC_DAPM_CLASS_RUNTIME);
+	soc_dapm_stream_event(rtd, stream, SND_SOC_DAPM_STREAM_RESUME, false);
 	mutex_unlock(&card->dapm_mutex);
 }