diff mbox series

ASoC: soc-core: Fix null pointer dereference in soc_find_component

Message ID 1547194442-1487-1-git-send-email-rohitkr@codeaurora.org (mailing list archive)
State New, archived
Headers show
Series ASoC: soc-core: Fix null pointer dereference in soc_find_component | expand

Commit Message

Rohit Kumar Jan. 11, 2019, 8:14 a.m. UTC
From: Ajit Pandey <ajitp@codeaurora.org>

soc_find_component() may lead to null pointer exception if both
arguments i.e of_node and name is NULL. Add NULL check before
calling soc_find_component(). Also fix some typos.

Fixes: 8780cf1142a5 ("ASoC: soc-core: defer card probe until all component is added to list")
Reported-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Signed-off-by: Ajit Pandey <ajitp@codeaurora.org>
Signed-off-by: Rohit kumar <rohitkr@codeaurora.org>
---
 sound/soc/soc-core.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

Comments

Pierre-Louis Bossart Jan. 11, 2019, 3:44 p.m. UTC | #1
On 1/11/19 2:14 AM, Rohit kumar wrote:
> From: Ajit Pandey <ajitp@codeaurora.org>
>
> soc_find_component() may lead to null pointer exception if both
> arguments i.e of_node and name is NULL. Add NULL check before
> calling soc_find_component(). Also fix some typos.

Thanks for the overnight fix. This update fixes the issue on my Skylake 
XPS13 test device (blind testing since I don't understand what the code 
does).

Tested-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>

>
> Fixes: 8780cf1142a5 ("ASoC: soc-core: defer card probe until all component is added to list")
> Reported-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
> Signed-off-by: Ajit Pandey <ajitp@codeaurora.org>
> Signed-off-by: Rohit kumar <rohitkr@codeaurora.org>
> ---
>   sound/soc/soc-core.c | 15 +++++++++------
>   1 file changed, 9 insertions(+), 6 deletions(-)
>
> diff --git a/sound/soc/soc-core.c b/sound/soc/soc-core.c
> index 0934b36..df05fb8 100644
> --- a/sound/soc/soc-core.c
> +++ b/sound/soc/soc-core.c
> @@ -1131,11 +1131,13 @@ static int soc_init_dai_link(struct snd_soc_card *card,
>   	}
>   
>   	/*
> -	 * Defer card registartion if platform dai component is not added to
> +	 * Defer card registration if platform dai component is not added to
>   	 * component list.
>   	 */
> -	if (!soc_find_component(link->platform->of_node, link->platform->name))
> -		return -EPROBE_DEFER;
> +	if (link->platform->of_node || link->platform->name)
> +		if (!soc_find_component(link->platform->of_node,
> +					link->platform->name))
> +			return -EPROBE_DEFER;
>   
>   	/*
>   	 * CPU device may be specified by either name or OF node, but
> @@ -1150,11 +1152,12 @@ static int soc_init_dai_link(struct snd_soc_card *card,
>   	}
>   
>   	/*
> -	 * Defer card registartion if cpu dai component is not added to
> +	 * Defer card registration if cpu dai component is not added to
>   	 * component list.
>   	 */
> -	if (!soc_find_component(link->cpu_of_node, link->cpu_name))
> -		return -EPROBE_DEFER;
> +	if (link->cpu_of_node || link->cpu_name)
> +		if (!soc_find_component(link->cpu_of_node, link->cpu_name))
> +			return -EPROBE_DEFER;
>   
>   	/*
>   	 * At least one of CPU DAI name or CPU device name/node must be
Pierre-Louis Bossart Jan. 11, 2019, 9:49 p.m. UTC | #2
> Thanks for the overnight fix. This update fixes the issue on my 
> Skylake XPS13 test device (blind testing since I don't understand what 
> the code does).
>
> Tested-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>

I need to take this back, this set of changes (initial+fix) causes an 
error with our HDMI support

[   17.437684] sof-audio sof-audio: created machine bxt-pcm512x
[   17.585279] bxt-pcm512x bxt-pcm512x: ASoC: failed to init link iDisp1
[   17.585639] bxt-pcm512x bxt-pcm512x: snd_soc_register_card failed -517

Removing your changes restores the functionality

Adding some traces I can see that the the platform name we use doesn't 
seem compatible with your logic. All the Intel boards used a constant 
platform name matching the PCI ID, see e.g. [1], which IIRC is used to 
bind components. Liam, do you recall in more details if this is really 
required?

[1] 
https://elixir.bootlin.com/linux/latest/source/sound/soc/intel/boards/bxt_da7219_max98357a.c#L475

[   18.205812] plb: platform name sof-audio
[   18.206059] plb: cpu_name (null)
[   18.206234] plb: platform name 0000:00:0e.0
[   18.206459] plb: returning -EPROBE_DEFER 1
[   18.206686] bxt-pcm512x bxt-pcm512x: ASoC: failed to init link iDisp1
[   18.207054] bxt-pcm512x bxt-pcm512x: snd_soc_register_card failed -517

diff --git a/sound/soc/soc-core.c b/sound/soc/soc-core.c
index cbafbdd02483..ae731212f82b 100644
--- a/sound/soc/soc-core.c
+++ b/sound/soc/soc-core.c
@@ -1133,11 +1133,15 @@ static int soc_init_dai_link(struct snd_soc_card 
*card,
          * Defer card registration if platform dai component is not 
added to
          * component list.
          */
+       pr_err("plb: platform name %s\n", link->platform->name);
         if (link->platform->of_node || link->platform->name)
                 if (!soc_find_component(link->platform->of_node,
link->platform->name))
-                       return -EPROBE_DEFER;
+                 {
+                         pr_err("plb: returning -EPROBE_DEFER 1\n");
+                         return -EPROBE_DEFER;

+                 }
         /*
          * CPU device may be specified by either name or OF node, but
          * can be left unspecified, and will be matched based on DAI
@@ -1154,9 +1158,14 @@ static int soc_init_dai_link(struct snd_soc_card 
*card,
          * Defer card registration if cpu dai component is not added to
          * component list.
          */
+       pr_err("plb: cpu_name %s\n", link->cpu_name);
         if (link->cpu_of_node || link->cpu_name)
                 if (!soc_find_component(link->cpu_of_node, link->cpu_name))
-                       return -EPROBE_DEFER;
+                 {
+                         pr_err("plb: returning -EPROBE_DEFER 2\n");
+                         return -EPROBE_DEFER;
+
+                 }

         /*
          * At least one of CPU DAI name or CPU device name/node must be
Rohit Kumar Jan. 12, 2019, 6:07 a.m. UTC | #3
On 1/12/2019 3:19 AM, Pierre-Louis Bossart wrote:
>
>> Thanks for the overnight fix. This update fixes the issue on my 
>> Skylake XPS13 test device (blind testing since I don't understand 
>> what the code does).
>>
>> Tested-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
>
> I need to take this back, this set of changes (initial+fix) causes an 
> error with our HDMI support
>
> [   17.437684] sof-audio sof-audio: created machine bxt-pcm512x
> [   17.585279] bxt-pcm512x bxt-pcm512x: ASoC: failed to init link iDisp1
> [   17.585639] bxt-pcm512x bxt-pcm512x: snd_soc_register_card failed -517
>
> Removing your changes restores the functionality
>
Looks like we should revert generic implementation for defering probe 
and move to call from machine driver as done in v1.
https://lore.kernel.org/patchwork/patch/1027560/
https://lore.kernel.org/patchwork/patch/1027561/

@Mark, Do you have suggestion to refine current patch?
> Adding some traces I can see that the the platform name we use doesn't 
> seem compatible with your logic. All the Intel boards used a constant 
> platform name matching the PCI ID, see e.g. [1], which IIRC is used to 
> bind components. Liam, do you recall in more details if this is really 
> required?
>
> [1] 
> https://elixir.bootlin.com/linux/latest/source/sound/soc/intel/boards/bxt_da7219_max98357a.c#L475
I think if it does not set platform name as 0000:00:0e.0, it should take 
snd-soc-dummy as platform name
and that might fix the issue.
snd-soc-dummy does have component associated with it.
https://elixir.bootlin.com/linux/latest/source/sound/soc/soc-utils.c#L281
>
> [   18.205812] plb: platform name sof-audio
> [   18.206059] plb: cpu_name (null)
> [   18.206234] plb: platform name 0000:00:0e.0
> [   18.206459] plb: returning -EPROBE_DEFER 1
> [   18.206686] bxt-pcm512x bxt-pcm512x: ASoC: failed to init link iDisp1
> [   18.207054] bxt-pcm512x bxt-pcm512x: snd_soc_register_card failed -517
>
> diff --git a/sound/soc/soc-core.c b/sound/soc/soc-core.c
> index cbafbdd02483..ae731212f82b 100644
> --- a/sound/soc/soc-core.c
> +++ b/sound/soc/soc-core.c
> @@ -1133,11 +1133,15 @@ static int soc_init_dai_link(struct 
> snd_soc_card *card,
>          * Defer card registration if platform dai component is not 
> added to
>          * component list.
>          */
> +       pr_err("plb: platform name %s\n", link->platform->name);
>         if (link->platform->of_node || link->platform->name)
>                 if (!soc_find_component(link->platform->of_node,
> link->platform->name))
> -                       return -EPROBE_DEFER;
> +                 {
> +                         pr_err("plb: returning -EPROBE_DEFER 1\n");
> +                         return -EPROBE_DEFER;
>
> +                 }
>         /*
>          * CPU device may be specified by either name or OF node, but
>          * can be left unspecified, and will be matched based on DAI
> @@ -1154,9 +1158,14 @@ static int soc_init_dai_link(struct 
> snd_soc_card *card,
>          * Defer card registration if cpu dai component is not added to
>          * component list.
>          */
> +       pr_err("plb: cpu_name %s\n", link->cpu_name);
>         if (link->cpu_of_node || link->cpu_name)
>                 if (!soc_find_component(link->cpu_of_node, 
> link->cpu_name))
> -                       return -EPROBE_DEFER;
> +                 {
> +                         pr_err("plb: returning -EPROBE_DEFER 2\n");
> +                         return -EPROBE_DEFER;
> +
> +                 }
>
>         /*
>          * At least one of CPU DAI name or CPU device name/node must be
>

Thanks,
Rohit
Liam Girdwood Jan. 14, 2019, 3:40 p.m. UTC | #4
On Sat, 2019-01-12 at 11:37 +0530, Rohit kumar wrote:
> On 1/12/2019 3:19 AM, Pierre-Louis Bossart wrote:
> > > Thanks for the overnight fix. This update fixes the issue on my 
> > > Skylake XPS13 test device (blind testing since I don't understand 
> > > what the code does).
> > > 
> > > Tested-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
> > 
> > I need to take this back, this set of changes (initial+fix) causes an 
> > error with our HDMI support
> > 
> > [   17.437684] sof-audio sof-audio: created machine bxt-pcm512x
> > [   17.585279] bxt-pcm512x bxt-pcm512x: ASoC: failed to init link iDisp1
> > [   17.585639] bxt-pcm512x bxt-pcm512x: snd_soc_register_card failed -517
> > 
> > Removing your changes restores the functionality
> > 
> Looks like we should revert generic implementation for defering probe 
> and move to call from machine driver as done in v1.
> https://lore.kernel.org/patchwork/patch/1027560/
> https://lore.kernel.org/patchwork/patch/1027561/
> 
> @Mark, Do you have suggestion to refine current patch?
> > Adding some traces I can see that the the platform name we use doesn't 
> > seem compatible with your logic. All the Intel boards used a constant 
> > platform name matching the PCI ID, see e.g. [1], which IIRC is used to 
> > bind components. Liam, do you recall in more details if this is really 
> > required?

Sorry, I cant quite remember why the PCI ID was used for the platform name, I
think it started with the SKL generation as previous generations used "baytrail-
pcm" and "haswell-pcm" as platform names IIRC. Perhaps Vinod will know.

The platform name is only used by SOF when over writing the "legacy" platform
name e.g. "baytrail-pcm" would become "sof-audio" and this is only used for
binding DAI links (so that all legacy machine drivers can be reused without
modification). 

Liam
Mark Brown Jan. 14, 2019, 11:26 p.m. UTC | #5
On Fri, Jan 11, 2019 at 01:44:02PM +0530, Rohit kumar wrote:

> -	if (!soc_find_component(link->platform->of_node, link->platform->name))
> -		return -EPROBE_DEFER;
> +	if (link->platform->of_node || link->platform->name)
> +		if (!soc_find_component(link->platform->of_node,
> +					link->platform->name))
> +			return -EPROBE_DEFER;

If we need to do this for every user (which we do pretty much it seems)
we should just be doing it inside find_component().
Mark Brown Jan. 15, 2019, 12:06 a.m. UTC | #6
On Fri, Jan 11, 2019 at 03:49:08PM -0600, Pierre-Louis Bossart wrote:

> Adding some traces I can see that the the platform name we use doesn't seem
> compatible with your logic. All the Intel boards used a constant platform
> name matching the PCI ID, see e.g. [1], which IIRC is used to bind
> components. Liam, do you recall in more details if this is really required?

That's telling me that either snd_soc_find_components() isn't finding
components in the same way that we do when we bind things which isn't
good or we're binding links without having fully matched everything on
the link which also isn't good.  

Without a system that shows the issue I can't 100% confirm but I think
it's the former - I'm fairly sure that those machines are relying on the
component name being initialized to fmt_single_name() in
snd_soc_component_initialize().  That is supposed to wind up using
dev_name() (which would be the PCI address for a PCI device) as the
basis of the name.  What I can't currently see is how exactly that gets
bound (or how any of the other links avoid trouble for that matter).  We
could revert and push this into cards but I would rather be confident
that we understand what's going on, I'm not comfortable that we aren't
just pushing the breakage around rather than fixing it.  Can someone
with an x86 system take a look and confirm exactly what's going on with
binding these cards please?
Pierre-Louis Bossart Jan. 15, 2019, 3:08 a.m. UTC | #7
On 1/14/19 6:06 PM, Mark Brown wrote:
> On Fri, Jan 11, 2019 at 03:49:08PM -0600, Pierre-Louis Bossart wrote:
>
>> Adding some traces I can see that the the platform name we use doesn't seem
>> compatible with your logic. All the Intel boards used a constant platform
>> name matching the PCI ID, see e.g. [1], which IIRC is used to bind
>> components. Liam, do you recall in more details if this is really required?
> That's telling me that either snd_soc_find_components() isn't finding
> components in the same way that we do when we bind things which isn't
> good or we're binding links without having fully matched everything on
> the link which also isn't good.
>
> Without a system that shows the issue I can't 100% confirm but I think
> it's the former - I'm fairly sure that those machines are relying on the
> component name being initialized to fmt_single_name() in
> snd_soc_component_initialize().  That is supposed to wind up using
> dev_name() (which would be the PCI address for a PCI device) as the
> basis of the name.  What I can't currently see is how exactly that gets
> bound (or how any of the other links avoid trouble for that matter).  We
> could revert and push this into cards but I would rather be confident
> that we understand what's going on, I'm not comfortable that we aren't
> just pushing the breakage around rather than fixing it.  Can someone
> with an x86 system take a look and confirm exactly what's going on with
> binding these cards please?

I am actually not sure at all why we need the platform_name to be set in 
Intel machine drivers.

I ran a 5mn experiment with SOF and we completely ignore what is set by 
machine drivers, it's overridden by the component name:

             dev_info(card->dev, "info: override FE DAI link %s\n",
                  card->dai_link[i].name);

             /* override platform component */
             if (snd_soc_init_platform(card, dai_link) < 0) {
                 dev_err(card->dev, "init platform error");
                 continue;
             }
             pr_err("plb: platform_name was %s, now %s\n",
                    dai_link->platform->name,
                    component->name);

             dai_link->platform->name = component->name;

existing machine driver (info is incorrect btw, should be BE DAI link)

[   36.628466] bxt-pcm512x bxt-pcm512x: info: override FE DAI link 
SSP5-Codec
[   36.628469] plb: platform_name was sof-audio, now sof-audio
[   36.628772] bxt-pcm512x bxt-pcm512x: info: override FE DAI link iDisp1
[   36.628773] plb: platform_name was 0000:00:0e.0, now sof-audio
[   36.629111] bxt-pcm512x bxt-pcm512x: info: override FE DAI link iDisp2
[   36.629112] plb: platform_name was 0000:00:0e.0, now sof-audio
[   36.629422] bxt-pcm512x bxt-pcm512x: info: override FE DAI link iDisp3
[   36.629423] plb: platform_name was 0000:00:0e.0, now sof-audio

machine driver with all platform_name commented out, no regression at all.

[   15.839652] sof-audio sof-audio: created machine bxt-pcm512x
[   15.846990] bxt-pcm512x bxt-pcm512x: info: override FE DAI link 
SSP5-Codec
[   15.846995] plb: platform_name was snd-soc-dummy, now sof-audio
[   15.847325] bxt-pcm512x bxt-pcm512x: info: override FE DAI link iDisp1
[   15.847326] plb: platform_name was snd-soc-dummy, now sof-audio
[   15.847657] bxt-pcm512x bxt-pcm512x: info: override FE DAI link iDisp2
[   15.847658] plb: platform_name was snd-soc-dummy, now sof-audio
[   15.847974] bxt-pcm512x bxt-pcm512x: info: override FE DAI link iDisp3
[   15.847974] plb: platform_name was snd-soc-dummy, now sof-audio

I checked for a bit longer with the Skylake driver, I also couldn't see 
a difference with or without the platform_name set.

diff --git a/sound/soc/soc-core.c b/sound/soc/soc-core.c
index 0462b3ec977a..0fdf99ec17cd 100644
--- a/sound/soc/soc-core.c
+++ b/sound/soc/soc-core.c
@@ -918,7 +918,7 @@ static int soc_bind_dai_link(struct snd_soc_card *card,
                 if (!snd_soc_is_matching_component(dai_link->platform,
                                                    component))
                         continue;
-
+               pr_err("plb: binding with component_name %s \n", 
component->name);
                 snd_soc_rtdcom_add(rtd, component);
         }

@@ -1041,6 +1041,8 @@ static int snd_soc_init_platform(struct 
snd_soc_card *card,
                 if (!platform)
                         return -ENOMEM;

+               pr_err("plb: %s: plaform->name set to %s\n", __func__,
+                      dai_link->platform_name);
                 dai_link->platform      = platform;
                 platform->name          = dai_link->platform_name;
                 platform->of_node       = dai_link->platform_of_node;

[    1.345143] plb: snd_soc_init_platform: plaform->name set to 0000:00:1f.3
[    1.345148] skl_hda_dsp_generic skl_hda_dsp_generic: ASoC: binding 
CNL Audio
[    1.345151] plb: binding with component_name 0000:00:1f.3
[    1.345153] plb: snd_soc_init_platform: plaform->name set to 0000:00:1f.3
[    1.345154] skl_hda_dsp_generic skl_hda_dsp_generic: ASoC: binding 
CNL HDMI1
[    1.345155] plb: binding with component_name 0000:00:1f.3
[    1.345157] plb: snd_soc_init_platform: plaform->name set to 0000:00:1f.3
[    1.345158] skl_hda_dsp_generic skl_hda_dsp_generic: ASoC: binding 
CNL HDMI2
[    1.345159] plb: binding with component_name 0000:00:1f.3
[    1.345161] plb: snd_soc_init_platform: plaform->name set to 0000:00:1f.3
[    1.345162] skl_hda_dsp_generic skl_hda_dsp_generic: ASoC: binding 
CNL HDMI3
[    1.345163] plb: binding with component_name 0000:00:1f.3

[    1.349607] plb: snd_soc_init_platform: plaform->name set to (null)
[    1.349613] skl_hda_dsp_generic skl_hda_dsp_generic: ASoC: binding 
CNL Audio
[    1.349617] plb: binding with component_name snd-soc-dummy
[    1.349619] plb: binding with component_name snd-soc-dummy
[    1.349621] plb: snd_soc_init_platform: plaform->name set to (null)
[    1.349623] skl_hda_dsp_generic skl_hda_dsp_generic: ASoC: binding 
CNL HDMI1
[    1.349625] plb: binding with component_name snd-soc-dummy
[    1.349626] plb: binding with component_name snd-soc-dummy
[    1.349628] plb: snd_soc_init_platform: plaform->name set to (null)
[    1.349631] skl_hda_dsp_generic skl_hda_dsp_generic: ASoC: binding 
CNL HDMI2
[    1.349632] plb: binding with component_name snd-soc-dummy
[    1.349633] plb: binding with component_name snd-soc-dummy
[    1.349635] plb: snd_soc_init_platform: plaform->name set to (null)
[    1.349638] skl_hda_dsp_generic skl_hda_dsp_generic: ASoC: binding 
CNL HDMI3
[    1.349639] plb: binding with component_name snd-soc-dummy
[    1.349641] plb: binding with component_name snd-soc-dummy

Audio playback works in both cases.

The funny thing is that the list of components is right in 
/sys/kernel/debug/asoc.

My guess is that the required PCI component is already added when the 
platform_name is used in soc_bind_dai_link() and snd_soc_rtdcom_add() 
does nothing for the back-end, so the dailink platform_name has actually 
zero influence on the outcome.

Another way to look at it is that we already provide the 
dai_link->cpu_dai_name which is enough for soc_bind_dai_link() to find 
the relevant component and as a result the dailink->platform_name seems 
redundant/unnecessary. I am using the conditional here since this part 
of the code (Intel driver included) seems to work by accident rather 
than by design, and we'd need a set of additional tests before removing 
these hard-coded initializations.

Does this help?
Pierre-Louis Bossart Jan. 15, 2019, 7:35 p.m. UTC | #8
On 1/14/19 6:06 PM, Mark Brown wrote:
> On Fri, Jan 11, 2019 at 03:49:08PM -0600, Pierre-Louis Bossart wrote:
>
>> Adding some traces I can see that the the platform name we use doesn't seem
>> compatible with your logic. All the Intel boards used a constant platform
>> name matching the PCI ID, see e.g. [1], which IIRC is used to bind
>> components. Liam, do you recall in more details if this is really required?
> That's telling me that either snd_soc_find_components() isn't finding
> components in the same way that we do when we bind things which isn't
> good or we're binding links without having fully matched everything on
> the link which also isn't good.
>
> Without a system that shows the issue I can't 100% confirm but I think
> it's the former - I'm fairly sure that those machines are relying on the
> component name being initialized to fmt_single_name() in
> snd_soc_component_initialize().  That is supposed to wind up using
> dev_name() (which would be the PCI address for a PCI device) as the
> basis of the name.  What I can't currently see is how exactly that gets
> bound (or how any of the other links avoid trouble for that matter).  We
> could revert and push this into cards but I would rather be confident
> that we understand what's going on, I'm not comfortable that we aren't
> just pushing the breakage around rather than fixing it.  Can someone
> with an x86 system take a look and confirm exactly what's going on with
> binding these cards please?

Beyond the fact that the platform_name seems to be totally useless, 
additional tests show that the patch ('ASoC: soc-core: defer card probe 
until all component is added to list') adds a new restriction which 
contradicts existing error checks.

None of the Intel machine drivers set the dailink "cpu_name" field but 
use the "cpu_dai_name" field instead. This was perfectly legit as 
documented by the code at the end of soc_init_dai_link()

     /*
      * At least one of CPU DAI name or CPU device name/node must be
      * specified
      */
     if (!link->cpu_dai_name &&
         !(link->cpu_name || link->cpu_of_node)) {
         dev_err(card->dev,
             "ASoC: Neither cpu_dai_name nor cpu_name/of_node are set 
for %s\n",
             link->name);
         return -EINVAL;
     }

The code contributed by Qualcomm only checks for cpu_name, which 
prevents the init from completing.

So if we want to be consistent, the new code should be something like:

diff --git a/sound/soc/soc-core.c b/sound/soc/soc-core.c
index b680c673c553..2791da9417f8 100644
--- a/sound/soc/soc-core.c
+++ b/sound/soc/soc-core.c
@@ -1154,7 +1154,7 @@ static int soc_init_dai_link(struct snd_soc_card 
*card,
          * Defer card registartion if cpu dai component is not added to
          * component list.
          */
-       if (!soc_find_component(link->cpu_of_node, link->cpu_name))
+       if (!link->cpu_dai_name && 
!soc_find_component(link->cpu_of_node, link->cpu_name))
                 return -EPROBE_DEFER;

         /*

or try to call soc_find_component with both cpu_name or cpu_dai_name, if 
this makes sense?
Mark Brown Jan. 15, 2019, 9:07 p.m. UTC | #9
On Tue, Jan 15, 2019 at 01:35:07PM -0600, Pierre-Louis Bossart wrote:
> On 1/14/19 6:06 PM, Mark Brown wrote:

> > just pushing the breakage around rather than fixing it.  Can someone
> > with an x86 system take a look and confirm exactly what's going on with
> > binding these cards please?

> Beyond the fact that the platform_name seems to be totally useless,
> additional tests show that the patch ('ASoC: soc-core: defer card probe
> until all component is added to list') adds a new restriction which
> contradicts existing error checks.

Yes...  I'd been coming to the conclusion that it was a huge red herring
here.  

> So if we want to be consistent, the new code should be something like:

> diff --git a/sound/soc/soc-core.c b/sound/soc/soc-core.c
> index b680c673c553..2791da9417f8 100644
> --- a/sound/soc/soc-core.c
> +++ b/sound/soc/soc-core.c
> @@ -1154,7 +1154,7 @@ static int soc_init_dai_link(struct snd_soc_card
> *card,
>          * Defer card registartion if cpu dai component is not added to
>          * component list.
>          */
> -       if (!soc_find_component(link->cpu_of_node, link->cpu_name))
> +       if (!link->cpu_dai_name && !soc_find_component(link->cpu_of_node,
> link->cpu_name))
>                 return -EPROBE_DEFER;
> 
>         /*

> or try to call soc_find_component with both cpu_name or cpu_dai_name, if
> this makes sense?

I think calling _find_component() makes more sense here as it will do
the check it's actually there thing no matter how the link is
identified.  Assuming that does resolve the issue do you want to make a
patch given that you got there first?
Matthias Reichl Jan. 15, 2019, 9:11 p.m. UTC | #10
On Tue, Jan 15, 2019 at 01:35:07PM -0600, Pierre-Louis Bossart wrote:
> 
> On 1/14/19 6:06 PM, Mark Brown wrote:
> > On Fri, Jan 11, 2019 at 03:49:08PM -0600, Pierre-Louis Bossart wrote:
> > 
> > > Adding some traces I can see that the the platform name we use doesn't seem
> > > compatible with your logic. All the Intel boards used a constant platform
> > > name matching the PCI ID, see e.g. [1], which IIRC is used to bind
> > > components. Liam, do you recall in more details if this is really required?
> > That's telling me that either snd_soc_find_components() isn't finding
> > components in the same way that we do when we bind things which isn't
> > good or we're binding links without having fully matched everything on
> > the link which also isn't good.
> > 
> > Without a system that shows the issue I can't 100% confirm but I think
> > it's the former - I'm fairly sure that those machines are relying on the
> > component name being initialized to fmt_single_name() in
> > snd_soc_component_initialize().  That is supposed to wind up using
> > dev_name() (which would be the PCI address for a PCI device) as the
> > basis of the name.  What I can't currently see is how exactly that gets
> > bound (or how any of the other links avoid trouble for that matter).  We
> > could revert and push this into cards but I would rather be confident
> > that we understand what's going on, I'm not comfortable that we aren't
> > just pushing the breakage around rather than fixing it.  Can someone
> > with an x86 system take a look and confirm exactly what's going on with
> > binding these cards please?
> 
> Beyond the fact that the platform_name seems to be totally useless,
> additional tests show that the patch ('ASoC: soc-core: defer card probe
> until all component is added to list') adds a new restriction which
> contradicts existing error checks.
> 
> None of the Intel machine drivers set the dailink "cpu_name" field but use
> the "cpu_dai_name" field instead. This was perfectly legit as documented by
> the code at the end of soc_init_dai_link()

This should be fixed by the patch
"ASoC: core: Don't defer probe on optional, NULL components" which Mark
already applied to his tree. See
http://mailman.alsa-project.org/pipermail/alsa-devel/2019-January/144323.html

Maybe the defer card probe logic needs to be extended to also check if
dai_link_name had already been registered (either cpu or cpu_dai_name
needs to be set), not 100% sure which problem the defer card probe patch
was trying to solve.

so long,

Hias

> 
>     /*
>      * At least one of CPU DAI name or CPU device name/node must be
>      * specified
>      */
>     if (!link->cpu_dai_name &&
>         !(link->cpu_name || link->cpu_of_node)) {
>         dev_err(card->dev,
>             "ASoC: Neither cpu_dai_name nor cpu_name/of_node are set for
> %s\n",
>             link->name);
>         return -EINVAL;
>     }
> 
> The code contributed by Qualcomm only checks for cpu_name, which prevents
> the init from completing.
> 
> So if we want to be consistent, the new code should be something like:
> 
> diff --git a/sound/soc/soc-core.c b/sound/soc/soc-core.c
> index b680c673c553..2791da9417f8 100644
> --- a/sound/soc/soc-core.c
> +++ b/sound/soc/soc-core.c
> @@ -1154,7 +1154,7 @@ static int soc_init_dai_link(struct snd_soc_card
> *card,
>          * Defer card registartion if cpu dai component is not added to
>          * component list.
>          */
> -       if (!soc_find_component(link->cpu_of_node, link->cpu_name))
> +       if (!link->cpu_dai_name && !soc_find_component(link->cpu_of_node,
> link->cpu_name))
>                 return -EPROBE_DEFER;
> 
>         /*
> 
> or try to call soc_find_component with both cpu_name or cpu_dai_name, if
> this makes sense?
> _______________________________________________
> Alsa-devel mailing list
> Alsa-devel@alsa-project.org
> http://mailman.alsa-project.org/mailman/listinfo/alsa-devel
Pierre-Louis Bossart Jan. 15, 2019, 9:16 p.m. UTC | #11
>> Beyond the fact that the platform_name seems to be totally useless,
>> additional tests show that the patch ('ASoC: soc-core: defer card probe
>> until all component is added to list') adds a new restriction which
>> contradicts existing error checks.
>>
>> None of the Intel machine drivers set the dailink "cpu_name" field but use
>> the "cpu_dai_name" field instead. This was perfectly legit as documented by
>> the code at the end of soc_init_dai_link()
> This should be fixed by the patch
> "ASoC: core: Don't defer probe on optional, NULL components" which Mark
> already applied to his tree. See
> http://mailman.alsa-project.org/pipermail/alsa-devel/2019-January/144323.html

Ah yes, I missed this patch while I was debugging. Indeed this fixes the 
problem and my devices work again with Mark's for-next branch. Thanks 
Matthias!

>
> Maybe the defer card probe logic needs to be extended to also check if
> dai_link_name had already been registered (either cpu or cpu_dai_name
> needs to be set), not 100% sure which problem the defer card probe patch
> was trying to solve.
same here, I don't get why the deferred probe stuff only deals with one 
of the two options.
Mark Brown Jan. 15, 2019, 9:41 p.m. UTC | #12
On Tue, Jan 15, 2019 at 03:16:57PM -0600, Pierre-Louis Bossart wrote:

> > Maybe the defer card probe logic needs to be extended to also check if
> > dai_link_name had already been registered (either cpu or cpu_dai_name
> > needs to be set), not 100% sure which problem the defer card probe patch
> > was trying to solve.

We were getting cards probing without the platforms being registered
(which in turn meant we were just skipping their init) and had patches
proposed to implement the deferral in the cards.  The deferral stuff is
supposed to making sure that everything is registered when we
instantiate.

> same here, I don't get why the deferred probe stuff only deals with one of
> the two options.

I think it's just an oversight - I think the change you were proposing
to check the cpu_dai_name is a good idea anyway as it makes things more
consistent and work more obviously by intention.  And more generally if
we can simplify the code by removing legacy options that'd be good but
that's a bigger bit of work...
Matthias Reichl Jan. 15, 2019, 9:48 p.m. UTC | #13
On Tue, Jan 15, 2019 at 09:41:38PM +0000, Mark Brown wrote:
> On Tue, Jan 15, 2019 at 03:16:57PM -0600, Pierre-Louis Bossart wrote:
> 
> > > Maybe the defer card probe logic needs to be extended to also check if
> > > dai_link_name had already been registered (either cpu or cpu_dai_name
> > > needs to be set), not 100% sure which problem the defer card probe patch
> > > was trying to solve.
> 
> We were getting cards probing without the platforms being registered
> (which in turn meant we were just skipping their init) and had patches
> proposed to implement the deferral in the cards.  The deferral stuff is
> supposed to making sure that everything is registered when we
> instantiate.

Ah, that makes sense. Thanks a lot for the info!

so long,

Hias

> 
> > same here, I don't get why the deferred probe stuff only deals with one of
> > the two options.
> 
> I think it's just an oversight - I think the change you were proposing
> to check the cpu_dai_name is a good idea anyway as it makes things more
> consistent and work more obviously by intention.  And more generally if
> we can simplify the code by removing legacy options that'd be good but
> that's a bigger bit of work...
Pierre-Louis Bossart Jan. 18, 2019, 11:02 p.m. UTC | #14
On 1/15/19 3:16 PM, Pierre-Louis Bossart wrote:
>
>>> Beyond the fact that the platform_name seems to be totally useless,
>>> additional tests show that the patch ('ASoC: soc-core: defer card probe
>>> until all component is added to list') adds a new restriction which
>>> contradicts existing error checks.
>>>
>>> None of the Intel machine drivers set the dailink "cpu_name" field 
>>> but use
>>> the "cpu_dai_name" field instead. This was perfectly legit as 
>>> documented by
>>> the code at the end of soc_init_dai_link()
>> This should be fixed by the patch
>> "ASoC: core: Don't defer probe on optional, NULL components" which Mark
>> already applied to his tree. See
>> http://mailman.alsa-project.org/pipermail/alsa-devel/2019-January/144323.html 
>>
>
> Ah yes, I missed this patch while I was debugging. Indeed this fixes 
> the problem and my devices work again with Mark's for-next branch. 
> Thanks Matthias!

This PROBE_DEFER support actually breaks the topology override that 
we've been relying on for SOF (and which has been in Mark's branch for 
some time now). This override helps us reuse machine drivers between 
legacy and SOF-based solutions.

With the current code, the tests in soc_register_card() complain that 
the platform_name can't be tied to a component and stop the card 
registration, but that's mainly because the tests are done before the 
topology overrides are done in soc_check_tplg_fes(). Moving 
soc_check_tplg_fes() from soc_instantiate_card() to an earlier time in 
soc_register_card() works-around the problem but looks quite invasive 
(mutex lock, etc).

There is also a second problem where we seem to have a memory management 
issue root caused to the change in snd_soc_init_platform() added by 
09ac6a817bd6 ('ASoC: soc-core: fix init platform memory handling')

The code does this

static int snd_soc_init_platform(struct snd_soc_card *card,
                  struct snd_soc_dai_link *dai_link)
{
     struct snd_soc_dai_link_component *platform = dai_link->platform;


     /* convert Legacy platform link */
     if (!platform || dai_link->legacy_platform) {
         platform = devm_kzalloc(card->dev,
                 sizeof(struct snd_soc_dai_link_component),
                 GFP_KERNEL);
         if (!platform)
             return -ENOMEM;

         dai_link->platform      = platform;
         dai_link->legacy_platform = 1;

This last assignment guarantees that memory will be allocated every time 
this function is called, and whatever overrides are done later will 
themselves be overridden by the new allocation. I am not sure what the 
intent was here, Curtis can you please double-check?

Details, test code and logs are available here: 
https://github.com/thesofproject/linux/issues/565

Have a nice week-end everyone, that's it for me until Tuesday.

-Pierre
Curtis Malainey Jan. 19, 2019, 1:12 a.m. UTC | #15
On Fri, Jan 18, 2019 at 3:02 PM Pierre-Louis Bossart <
pierre-louis.bossart@linux.intel.com> wrote:

>
> On 1/15/19 3:16 PM, Pierre-Louis Bossart wrote:
> >
> >>> Beyond the fact that the platform_name seems to be totally useless,
> >>> additional tests show that the patch ('ASoC: soc-core: defer card probe
> >>> until all component is added to list') adds a new restriction which
> >>> contradicts existing error checks.
> >>>
> >>> None of the Intel machine drivers set the dailink "cpu_name" field
> >>> but use
> >>> the "cpu_dai_name" field instead. This was perfectly legit as
> >>> documented by
> >>> the code at the end of soc_init_dai_link()
> >> This should be fixed by the patch
> >> "ASoC: core: Don't defer probe on optional, NULL components" which Mark
> >> already applied to his tree. See
> >>
> http://mailman.alsa-project.org/pipermail/alsa-devel/2019-January/144323.html
> >>
> >
> > Ah yes, I missed this patch while I was debugging. Indeed this fixes
> > the problem and my devices work again with Mark's for-next branch.
> > Thanks Matthias!
>
> This PROBE_DEFER support actually breaks the topology override that
> we've been relying on for SOF (and which has been in Mark's branch for
> some time now). This override helps us reuse machine drivers between
> legacy and SOF-based solutions.
>
> With the current code, the tests in soc_register_card() complain that
> the platform_name can't be tied to a component and stop the card
> registration, but that's mainly because the tests are done before the
> topology overrides are done in soc_check_tplg_fes(). Moving
> soc_check_tplg_fes() from soc_instantiate_card() to an earlier time in
> soc_register_card() works-around the problem but looks quite invasive
> (mutex lock, etc).
>
> There is also a second problem where we seem to have a memory management
> issue root caused to the change in snd_soc_init_platform() added by
> 09ac6a817bd6 ('ASoC: soc-core: fix init platform memory handling')
>
> The code does this
>
> static int snd_soc_init_platform(struct snd_soc_card *card,
>                   struct snd_soc_dai_link *dai_link)
> {
>      struct snd_soc_dai_link_component *platform = dai_link->platform;
>
>
>      /* convert Legacy platform link */
>      if (!platform || dai_link->legacy_platform) {
>          platform = devm_kzalloc(card->dev,
>                  sizeof(struct snd_soc_dai_link_component),
>                  GFP_KERNEL);
>          if (!platform)
>              return -ENOMEM;
>
>          dai_link->platform      = platform;
>          dai_link->legacy_platform = 1;
>
> This last assignment guarantees that memory will be allocated every time
> this function is called, and whatever overrides are done later will
> themselves be overridden by the new allocation. I am not sure what the
> intent was here, Curtis can you please double-check?
>
> The issue was that we were seeing a memory corruption bug on an AMD
chromebooks with that function already (not observed on Intel). I was
testing some SOF integrations and was seeing this in the kernel logs. I had
Dylan verify my logic before I sent the patch because it took so long to
identify the bug and it was traced to the patch that introduce
soc_init_platform.

[   10.922112] cz-da7219-max98357a AMD7219:00: ASoC: CPU DAI
designware-i2s.1.auto not registered
[   10.922122] cz-da7219-max98357a AMD7219:00:
devm_snd_soc_register_card(acpd7219m98357) failed: -517
[   11.001411] cz-da7219-max98357a AMD7219:00: ASoC: Both platform
name/of_node are set for amd-max98357-play
[   11.001423] cz-da7219-max98357a AMD7219:00: ASoC: failed to init link
amd-max98357-play
[   11.001431] cz-da7219-max98357a AMD7219:00:
devm_snd_soc_register_card(acpd7219m98357) failed: -22
[   11.001577] cz-da7219-max98357a: probe of AMD7219:00 failed with error
-22

of_node was never getting set but the pointer was becoming populated
(outside of the probe call) which traced to soc_init_platform function
which was not reallocating memory on a EPROBE_DEFER even though it was
getting freed by devm. I am not very familiar with devm but my local
maintainers say that it should be freeing the memory even on a PROBE_DEFER.
The patch should mirror the memory behaviour in snd_soc_init_multicodec
which also reallocates its memory on every probe. I'm not sure how the
patch is causing you to defer, is your component list corrupt?

Details, test code and logs are available here:
> https://github.com/thesofproject/linux/issues/565
>
> Have a nice week-end everyone, that's it for me until Tuesday.
>
> -Pierre
>
>
>
>
Curtis Malainey Jan. 19, 2019, 1:15 a.m. UTC | #16
On Fri, Jan 18, 2019 at 5:12 PM Curtis Malainey <cujomalainey@google.com> wrote:
>
>
>
> On Fri, Jan 18, 2019 at 3:02 PM Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> wrote:
>>
>>
>> On 1/15/19 3:16 PM, Pierre-Louis Bossart wrote:
>> >
>> >>> Beyond the fact that the platform_name seems to be totally useless,
>> >>> additional tests show that the patch ('ASoC: soc-core: defer card probe
>> >>> until all component is added to list') adds a new restriction which
>> >>> contradicts existing error checks.
>> >>>
>> >>> None of the Intel machine drivers set the dailink "cpu_name" field
>> >>> but use
>> >>> the "cpu_dai_name" field instead. This was perfectly legit as
>> >>> documented by
>> >>> the code at the end of soc_init_dai_link()
>> >> This should be fixed by the patch
>> >> "ASoC: core: Don't defer probe on optional, NULL components" which Mark
>> >> already applied to his tree. See
>> >> http://mailman.alsa-project.org/pipermail/alsa-devel/2019-January/144323.html
>> >>
>> >
>> > Ah yes, I missed this patch while I was debugging. Indeed this fixes
>> > the problem and my devices work again with Mark's for-next branch.
>> > Thanks Matthias!
>>
>> This PROBE_DEFER support actually breaks the topology override that
>> we've been relying on for SOF (and which has been in Mark's branch for
>> some time now). This override helps us reuse machine drivers between
>> legacy and SOF-based solutions.
>>
>> With the current code, the tests in soc_register_card() complain that
>> the platform_name can't be tied to a component and stop the card
>> registration, but that's mainly because the tests are done before the
>> topology overrides are done in soc_check_tplg_fes(). Moving
>> soc_check_tplg_fes() from soc_instantiate_card() to an earlier time in
>> soc_register_card() works-around the problem but looks quite invasive
>> (mutex lock, etc).
>>
>> There is also a second problem where we seem to have a memory management
>> issue root caused to the change in snd_soc_init_platform() added by
>> 09ac6a817bd6 ('ASoC: soc-core: fix init platform memory handling')
>>
>> The code does this
>>
>> static int snd_soc_init_platform(struct snd_soc_card *card,
>>                   struct snd_soc_dai_link *dai_link)
>> {
>>      struct snd_soc_dai_link_component *platform = dai_link->platform;
>>
>>
>>      /* convert Legacy platform link */
>>      if (!platform || dai_link->legacy_platform) {
>>          platform = devm_kzalloc(card->dev,
>>                  sizeof(struct snd_soc_dai_link_component),
>>                  GFP_KERNEL);
>>          if (!platform)
>>              return -ENOMEM;
>>
>>          dai_link->platform      = platform;
>>          dai_link->legacy_platform = 1;
>>
>> This last assignment guarantees that memory will be allocated every time
>> this function is called, and whatever overrides are done later will
>> themselves be overridden by the new allocation. I am not sure what the
>> intent was here, Curtis can you please double-check?
>>
The issue was that we were seeing a memory corruption bug on an AMD
chromebooks with that function already (not observed on Intel). I was
testing some SOF integrations and was seeing this in the kernel logs.
I had Dylan verify my logic before I sent the patch because it took so
long to identify the bug and it was traced to the patch that introduce
soc_init_platform.

[   10.922112] cz-da7219-max98357a AMD7219:00: ASoC: CPU DAI
designware-i2s.1.auto not registered
[   10.922122] cz-da7219-max98357a AMD7219:00:
devm_snd_soc_register_card(acpd7219m98357) failed: -517
[   11.001411] cz-da7219-max98357a AMD7219:00: ASoC: Both platform
name/of_node are set for amd-max98357-play
[   11.001423] cz-da7219-max98357a AMD7219:00: ASoC: failed to init
link amd-max98357-play
[   11.001431] cz-da7219-max98357a AMD7219:00:
devm_snd_soc_register_card(acpd7219m98357) failed: -22
[   11.001577] cz-da7219-max98357a: probe of AMD7219:00 failed with error -22

of_node was never getting set but the pointer was becoming populated
(outside of the probe call) which traced to soc_init_platform function
which was not reallocating memory on a EPROBE_DEFER even though it was
getting freed by devm. I am not very familiar with devm but my local
maintainers say that it should be freeing the memory even on a
PROBE_DEFER.
The patch should mirror the memory behaviour in
snd_soc_init_multicodec which also reallocates its memory on every
probe. I'm not sure how the patch is causing you to defer, is your
component list corrupt?

Sorry for the duplicate spam, forgot to send via plain text mode,
re-sending for the mailing list so it gets accepted.
>
>> Details, test code and logs are available here:
>> https://github.com/thesofproject/linux/issues/565
>>
>> Have a nice week-end everyone, that's it for me until Tuesday.
>>
>> -Pierre
>>
>>
>>
Mark Brown Jan. 21, 2019, 6:30 p.m. UTC | #17
On Fri, Jan 18, 2019 at 05:15:32PM -0800, Curtis Malainey wrote:

> of_node was never getting set but the pointer was becoming populated
> (outside of the probe call) which traced to soc_init_platform function
> which was not reallocating memory on a EPROBE_DEFER even though it was
> getting freed by devm. I am not very familiar with devm but my local
> maintainers say that it should be freeing the memory even on a
> PROBE_DEFER.

Probe deferral is just like any other error from probe, any managed
resources will be unwound.
Mark Brown Jan. 21, 2019, 7:17 p.m. UTC | #18
On Fri, Jan 18, 2019 at 05:02:08PM -0600, Pierre-Louis Bossart wrote:

> This PROBE_DEFER support actually breaks the topology override that we've
> been relying on for SOF (and which has been in Mark's branch for some time
> now). This override helps us reuse machine drivers between legacy and
> SOF-based solutions.

> With the current code, the tests in soc_register_card() complain that the
> platform_name can't be tied to a component and stop the card registration,
> but that's mainly because the tests are done before the topology overrides
> are done in soc_check_tplg_fes(). Moving soc_check_tplg_fes() from
> soc_instantiate_card() to an earlier time in soc_register_card()
> works-around the problem but looks quite invasive (mutex lock, etc).

Right, I see.  I don't think there's going to be any non-invasive
solution here, either we need to stop overriding bits of the machine
driver relevant to binding in the topology files like this or we need to
look at the topology files in the probe deferral stuff so it's either
going to be very invasive for the machine driver or the core.  I'm
tempted to say that the machine driver is a better option here.

> This last assignment guarantees that memory will be allocated every time
> this function is called, and whatever overrides are done later will
> themselves be overridden by the new allocation. I am not sure what the
> intent was here, Curtis can you please double-check?

The intent is to avoid modifying statically allocated data so we can
tell the difference between pointers set up statically as init data and
those that are filled in at runtime.
Pierre-Louis Bossart Jan. 22, 2019, 8:11 p.m. UTC | #19
> The issue was that we were seeing a memory corruption bug on an AMD
> chromebooks with that function already (not observed on Intel). I was
> testing some SOF integrations and was seeing this in the kernel logs.
> I had Dylan verify my logic before I sent the patch because it took so
> long to identify the bug and it was traced to the patch that introduce
> soc_init_platform.
>
> [   10.922112] cz-da7219-max98357a AMD7219:00: ASoC: CPU DAI
> designware-i2s.1.auto not registered
> [   10.922122] cz-da7219-max98357a AMD7219:00:
> devm_snd_soc_register_card(acpd7219m98357) failed: -517
> [   11.001411] cz-da7219-max98357a AMD7219:00: ASoC: Both platform
> name/of_node are set for amd-max98357-play
> [   11.001423] cz-da7219-max98357a AMD7219:00: ASoC: failed to init
> link amd-max98357-play
> [   11.001431] cz-da7219-max98357a AMD7219:00:
> devm_snd_soc_register_card(acpd7219m98357) failed: -22
> [   11.001577] cz-da7219-max98357a: probe of AMD7219:00 failed with error -22
>
> of_node was never getting set but the pointer was becoming populated
> (outside of the probe call) which traced to soc_init_platform function
> which was not reallocating memory on a EPROBE_DEFER even though it was
> getting freed by devm. I am not very familiar with devm but my local
> maintainers say that it should be freeing the memory even on a
> PROBE_DEFER.
> The patch should mirror the memory behaviour in
> snd_soc_init_multicodec which also reallocates its memory on every
> probe. I'm not sure how the patch is causing you to defer, is your
> component list corrupt?
>
> Sorry for the duplicate spam, forgot to send via plain text mode,
> re-sending for the mailing list so it gets accepted.

There is no defer issue with the intel stuff, but we call this routine 
multiple times

snd_soc_register_card

--soc_init_dai_link

----snd_soc_init_platform

-- soc_soc_bind_card

----snd_soc_instantiate_card

------ soc_check_tplg_fes

-------- snd_soc_init_platform << ALLOC1

--------soc_init_dai_link

----------snd_soc_init_platform << ALLOC2

Initially dai_link->legacy_platform is 0, so gets set after the first 
first devm_kzalloc (ALLOC1) and after that we always allocate new memory 
(ALLOC2). The end result is that whatever we set in soc_check_tplg_fes 
is lost with the new/unnecessary alloc.

I would guess your solution is also a work-around, if devm_ effectively 
freed the memory then the pointer would become NULL. Or may that's the 
issue is that no one actually resets it.
Curtis Malainey Jan. 23, 2019, 1:36 a.m. UTC | #20
Curtis Malainey | Software Engineer | cujomalainey@google.com | 650-898-3849


On Wed, Jan 23, 2019 at 4:11 AM Pierre-Louis Bossart
<pierre-louis.bossart@linux.intel.com> wrote:
>
>
> > The issue was that we were seeing a memory corruption bug on an AMD
> > chromebooks with that function already (not observed on Intel). I was
> > testing some SOF integrations and was seeing this in the kernel logs.
> > I had Dylan verify my logic before I sent the patch because it took so
> > long to identify the bug and it was traced to the patch that introduce
> > soc_init_platform.
> >
> > [   10.922112] cz-da7219-max98357a AMD7219:00: ASoC: CPU DAI
> > designware-i2s.1.auto not registered
> > [   10.922122] cz-da7219-max98357a AMD7219:00:
> > devm_snd_soc_register_card(acpd7219m98357) failed: -517
> > [   11.001411] cz-da7219-max98357a AMD7219:00: ASoC: Both platform
> > name/of_node are set for amd-max98357-play
> > [   11.001423] cz-da7219-max98357a AMD7219:00: ASoC: failed to init
> > link amd-max98357-play
> > [   11.001431] cz-da7219-max98357a AMD7219:00:
> > devm_snd_soc_register_card(acpd7219m98357) failed: -22
> > [   11.001577] cz-da7219-max98357a: probe of AMD7219:00 failed with error -22
> >
> > of_node was never getting set but the pointer was becoming populated
> > (outside of the probe call) which traced to soc_init_platform function
> > which was not reallocating memory on a EPROBE_DEFER even though it was
> > getting freed by devm. I am not very familiar with devm but my local
> > maintainers say that it should be freeing the memory even on a
> > PROBE_DEFER.
> > The patch should mirror the memory behaviour in
> > snd_soc_init_multicodec which also reallocates its memory on every
> > probe. I'm not sure how the patch is causing you to defer, is your
> > component list corrupt?
> >
> > Sorry for the duplicate spam, forgot to send via plain text mode,
> > re-sending for the mailing list so it gets accepted.
>
> There is no defer issue with the intel stuff, but we call this routine
> multiple times
>
> snd_soc_register_card
>
> --soc_init_dai_link
>
> ----snd_soc_init_platform
>
> -- soc_soc_bind_card
>
> ----snd_soc_instantiate_card
>
> ------ soc_check_tplg_fes
>
> -------- snd_soc_init_platform << ALLOC1
>
> --------soc_init_dai_link
>
> ----------snd_soc_init_platform << ALLOC2
>
Ah that explains it, in my testing I didn't have the patch that
brought in the call from within tplg_fes
>
> Initially dai_link->legacy_platform is 0, so gets set after the first
> first devm_kzalloc (ALLOC1) and after that we always allocate new memory
> (ALLOC2). The end result is that whatever we set in soc_check_tplg_fes
> is lost with the new/unnecessary alloc.
>
> I would guess your solution is also a work-around, if devm_ effectively
> freed the memory then the pointer would become NULL. Or may that's the
> issue is that no one actually resets it.
>
>
Yes, its a work around to fix the memory issue. If you set the
platform in the machine driver the code will ignore it and not reset
it. That being said that is not a full proof workaround and a better
solution is definitely needed. We could go and clean up the pointers
in soc_instantiate_card based on the flag being set. That way we only
relocate on a NULL pointer like we used to but still don't affect
statically allocated memory. I will draft a patch, test it on the AMD
device, reply to this thread later with it, Pierre can you test it as
well?

I am curious why soc_check_tplg_fes is calling snd_soc_init_platform.
It should have already been called earlier, in soc_init_dai_link at
the beginning of snd_soc_register_card so the memory should already be
initialized. Unless I am missing somewhere where links are getting
added between the calls.
Pierre-Louis Bossart Jan. 23, 2019, 2:01 a.m. UTC | #21
On 1/22/19 7:36 PM, Curtis Malainey wrote:
> Curtis Malainey | Software Engineer | cujomalainey@google.com | 650-898-3849
>
>
> On Wed, Jan 23, 2019 at 4:11 AM Pierre-Louis Bossart
> <pierre-louis.bossart@linux.intel.com> wrote:
>>
>>> The issue was that we were seeing a memory corruption bug on an AMD
>>> chromebooks with that function already (not observed on Intel). I was
>>> testing some SOF integrations and was seeing this in the kernel logs.
>>> I had Dylan verify my logic before I sent the patch because it took so
>>> long to identify the bug and it was traced to the patch that introduce
>>> soc_init_platform.
>>>
>>> [   10.922112] cz-da7219-max98357a AMD7219:00: ASoC: CPU DAI
>>> designware-i2s.1.auto not registered
>>> [   10.922122] cz-da7219-max98357a AMD7219:00:
>>> devm_snd_soc_register_card(acpd7219m98357) failed: -517
>>> [   11.001411] cz-da7219-max98357a AMD7219:00: ASoC: Both platform
>>> name/of_node are set for amd-max98357-play
>>> [   11.001423] cz-da7219-max98357a AMD7219:00: ASoC: failed to init
>>> link amd-max98357-play
>>> [   11.001431] cz-da7219-max98357a AMD7219:00:
>>> devm_snd_soc_register_card(acpd7219m98357) failed: -22
>>> [   11.001577] cz-da7219-max98357a: probe of AMD7219:00 failed with error -22
>>>
>>> of_node was never getting set but the pointer was becoming populated
>>> (outside of the probe call) which traced to soc_init_platform function
>>> which was not reallocating memory on a EPROBE_DEFER even though it was
>>> getting freed by devm. I am not very familiar with devm but my local
>>> maintainers say that it should be freeing the memory even on a
>>> PROBE_DEFER.
>>> The patch should mirror the memory behaviour in
>>> snd_soc_init_multicodec which also reallocates its memory on every
>>> probe. I'm not sure how the patch is causing you to defer, is your
>>> component list corrupt?
>>>
>>> Sorry for the duplicate spam, forgot to send via plain text mode,
>>> re-sending for the mailing list so it gets accepted.
>> There is no defer issue with the intel stuff, but we call this routine
>> multiple times
>>
>> snd_soc_register_card
>>
>> --soc_init_dai_link
>>
>> ----snd_soc_init_platform
>>
>> -- soc_soc_bind_card
>>
>> ----snd_soc_instantiate_card
>>
>> ------ soc_check_tplg_fes
>>
>> -------- snd_soc_init_platform << ALLOC1
>>
>> --------soc_init_dai_link
>>
>> ----------snd_soc_init_platform << ALLOC2
>>
> Ah that explains it, in my testing I didn't have the patch that
> brought in the call from within tplg_fes
>> Initially dai_link->legacy_platform is 0, so gets set after the first
>> first devm_kzalloc (ALLOC1) and after that we always allocate new memory
>> (ALLOC2). The end result is that whatever we set in soc_check_tplg_fes
>> is lost with the new/unnecessary alloc.
>>
>> I would guess your solution is also a work-around, if devm_ effectively
>> freed the memory then the pointer would become NULL. Or may that's the
>> issue is that no one actually resets it.
>>
>>
> Yes, its a work around to fix the memory issue. If you set the
> platform in the machine driver the code will ignore it and not reset
> it. That being said that is not a full proof workaround and a better
> solution is definitely needed. We could go and clean up the pointers
> in soc_instantiate_card based on the flag being set. That way we only
> relocate on a NULL pointer like we used to but still don't affect
> statically allocated memory. I will draft a patch, test it on the AMD
> device, reply to this thread later with it, Pierre can you test it as
> well?
>
> I am curious why soc_check_tplg_fes is calling snd_soc_init_platform.
> It should have already been called earlier, in soc_init_dai_link at
> the beginning of snd_soc_register_card so the memory should already be
> initialized. Unless I am missing somewhere where links are getting
> added between the calls.

This is actually a second order problem, the main issue i have is that 
the very first call to init_dai_link fails with the new DEFER_PROBE 
handling.

I don't quite understand what Linaro/AMD folks are doing but I trust 
their changes are legitimate. To move forward, maybe it's not worth 
spending too much time on a grand unification of string theory, there 
are simpler solutions: the Intel machine drivers already do get the 
platform driver name as an platform_data argument, so we could modify 
the dailinks platform names before even registering the card. I tested 
with the attached proof-of-concept patch, it adds 2 lines of code per 
machine driver if we use a common helper (after the transition to the 
"modern" dailink representation that's needed anyways) so maybe it's 
better in the end? the override we care about is really the automatic 
handling of all the hard-coded front-ends, the platform-name override 
isn't really a battle i want to pick or spend time on.
From 5680c64b09964b134e20bf96142d1ce5dcf0f77f Mon Sep 17 00:00:00 2001
From: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Date: Tue, 22 Jan 2019 18:53:43 -0600
Subject: [PATCH] ASoC: add helper to change platform name for all dailinks

To reuse the same machine drivers with Atom/SST, Skylake and SOF, we
need to change the default platform_name (or platforms->name in the
"modern" representation).

So far, this override was done with an automatic override, which was
broken by a set of changes for DT platforms related to deferred probe
handling.

This automatic override is actually not really needed, the machine
driver can already receive the platform name as a platform_data
parameter. This is used e.g. for HDaudio support where we have
different PCI aliases used for different platforms. We can reuse the
same mechanism and modify the machine drivers to override the dailinks
prior to registrating the card.

This will require additional work for SOF, but with this helper it'll
be just two lines of additional code per machine driver which is
reused, not the end of the world.

Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
---
 include/sound/soc.h | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/include/sound/soc.h b/include/sound/soc.h
index b71b05019c68..59873084f3bd 100644
--- a/include/sound/soc.h
+++ b/include/sound/soc.h
@@ -1583,6 +1583,29 @@ struct snd_soc_dai *snd_soc_card_get_codec_dai(struct snd_soc_card *card,
 	return NULL;
 }
 
+static inline
+int snd_soc_fixup_dai_links_platform_name(struct snd_soc_card *card,
+					  const char *platform_name)
+{
+	struct snd_soc_dai_link *dai_link;
+	int i;
+
+	if (!platform_name) /* nothing to do */
+		return 0;
+
+	/* set platform name for each dailink */
+	for_each_card_prelinks(card, i, dai_link) {
+		/* only single platform is supported for now */
+		dai_link->platforms->name = devm_kstrdup(card->dev,
+							 platform_name,
+							 GFP_KERNEL);
+		if (!dai_link->platforms->name)
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
 #ifdef CONFIG_DEBUG_FS
 extern struct dentry *snd_soc_debugfs_root;
 #endif
Mark Brown Jan. 24, 2019, 6:44 p.m. UTC | #22
On Tue, Jan 22, 2019 at 08:01:15PM -0600, Pierre-Louis Bossart wrote:

> changes are legitimate. To move forward, maybe it's not worth spending too
> much time on a grand unification of string theory, there are simpler
> solutions: the Intel machine drivers already do get the platform driver name
> as an platform_data argument, so we could modify the dailinks platform names
> before even registering the card. I tested with the attached

Yes, that would be much better - it's vastly more idiomatic.  The
general idea is that a machine driver should know what it's expecting to
find before it starts probing.
Pierre-Louis Bossart Jan. 24, 2019, 7:07 p.m. UTC | #23
>> changes are legitimate. To move forward, maybe it's not worth spending too
>> much time on a grand unification of string theory, there are simpler
>> solutions: the Intel machine drivers already do get the platform driver name
>> as an platform_data argument, so we could modify the dailinks platform names
>> before even registering the card. I tested with the attached
> Yes, that would be much better - it's vastly more idiomatic.  The
> general idea is that a machine driver should know what it's expecting to
> find before it starts probing.

Thanks for the feedback, will send a formal patch with the helper and 
machine driver changes after I test more with the legacy drivers. Do you 
have a preference for one patch that deals with multiple machines 
drivers in one shot, or individual patches? The latter are nicer for 
backports (e.g. for Chrome), the former nicer for maintainers...

The goal of reusing machine drivers as is isn't really achievable 
anyways, it looks like we are going to have additional changes, e.g. if 
we want to avoid the calls to snd_pcm_suspend as suggested by Takashi, 
we'll have to add a reference to snd_soc_pm_ops that's only used for 
SOF, the Atom/SST driver does things in different ways mostly due to 
historical reasons.

-Pierre
Mark Brown Jan. 24, 2019, 7:26 p.m. UTC | #24
On Thu, Jan 24, 2019 at 01:07:17PM -0600, Pierre-Louis Bossart wrote:

> Thanks for the feedback, will send a formal patch with the helper and
> machine driver changes after I test more with the legacy drivers. Do you
> have a preference for one patch that deals with multiple machines drivers in
> one shot, or individual patches? The latter are nicer for backports (e.g.
> for Chrome), the former nicer for maintainers...

More patches is good, it doesn't make a huge difference if I get one big
patch or a series of repetitive patches - big serieses are more of an
issue if they're all different patches needing individual review.
Curtis Malainey Jan. 25, 2019, 1:32 a.m. UTC | #25
I have a patch to fix the memory leak but I haven't been able to test
it yet because I am remote right now and I accidentally bootlooped the
AMD device I am working on. I will have this tested early next week.
Here is the patch for anyone interested.

Curtis Malainey | Software Engineer | cujomalainey@google.com | 650-898-3849


On Fri, Jan 25, 2019 at 3:26 AM Mark Brown <broonie@kernel.org> wrote:
>
> On Thu, Jan 24, 2019 at 01:07:17PM -0600, Pierre-Louis Bossart wrote:
>
> > Thanks for the feedback, will send a formal patch with the helper and
> > machine driver changes after I test more with the legacy drivers. Do you
> > have a preference for one patch that deals with multiple machines drivers in
> > one shot, or individual patches? The latter are nicer for backports (e.g.
> > for Chrome), the former nicer for maintainers...
>
> More patches is good, it doesn't make a huge difference if I get one big
> patch or a series of repetitive patches - big serieses are more of an
> issue if they're all different patches needing individual review.
diff mbox series

Patch

diff --git a/sound/soc/soc-core.c b/sound/soc/soc-core.c
index 0934b36..df05fb8 100644
--- a/sound/soc/soc-core.c
+++ b/sound/soc/soc-core.c
@@ -1131,11 +1131,13 @@  static int soc_init_dai_link(struct snd_soc_card *card,
 	}
 
 	/*
-	 * Defer card registartion if platform dai component is not added to
+	 * Defer card registration if platform dai component is not added to
 	 * component list.
 	 */
-	if (!soc_find_component(link->platform->of_node, link->platform->name))
-		return -EPROBE_DEFER;
+	if (link->platform->of_node || link->platform->name)
+		if (!soc_find_component(link->platform->of_node,
+					link->platform->name))
+			return -EPROBE_DEFER;
 
 	/*
 	 * CPU device may be specified by either name or OF node, but
@@ -1150,11 +1152,12 @@  static int soc_init_dai_link(struct snd_soc_card *card,
 	}
 
 	/*
-	 * Defer card registartion if cpu dai component is not added to
+	 * Defer card registration if cpu dai component is not added to
 	 * component list.
 	 */
-	if (!soc_find_component(link->cpu_of_node, link->cpu_name))
-		return -EPROBE_DEFER;
+	if (link->cpu_of_node || link->cpu_name)
+		if (!soc_find_component(link->cpu_of_node, link->cpu_name))
+			return -EPROBE_DEFER;
 
 	/*
 	 * At least one of CPU DAI name or CPU device name/node must be