diff mbox series

[v9,14/27] sfc: create type2 cxl memdev

Message ID 20241230214445.27602-15-alejandro.lucero-palau@amd.com
State New
Headers show
Series cxl: add type2 device basic support | expand

Commit Message

Lucero Palau, Alejandro Dec. 30, 2024, 9:44 p.m. UTC
From: Alejandro Lucero <alucerop@amd.com>

Use cxl API for creating a cxl memory device using the type2
cxl_dev_state struct.

Signed-off-by: Alejandro Lucero <alucerop@amd.com>
Reviewed-by: Martin Habets <habetsm.xilinx@gmail.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Acked-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 drivers/net/ethernet/sfc/efx_cxl.c | 9 +++++++++
 1 file changed, 9 insertions(+)

Comments

Dan Williams Jan. 18, 2025, 2:41 a.m. UTC | #1
alejandro.lucero-palau@ wrote:
> From: Alejandro Lucero <alucerop@amd.com>
> 
> Use cxl API for creating a cxl memory device using the type2
> cxl_dev_state struct.
> 
> Signed-off-by: Alejandro Lucero <alucerop@amd.com>
> Reviewed-by: Martin Habets <habetsm.xilinx@gmail.com>
> Reviewed-by: Fan Ni <fan.ni@samsung.com>
> Acked-by: Edward Cree <ecree.xilinx@gmail.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> ---
>  drivers/net/ethernet/sfc/efx_cxl.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/net/ethernet/sfc/efx_cxl.c b/drivers/net/ethernet/sfc/efx_cxl.c
> index 911f29b91bd3..f4bf137fd878 100644
> --- a/drivers/net/ethernet/sfc/efx_cxl.c
> +++ b/drivers/net/ethernet/sfc/efx_cxl.c
> @@ -96,10 +96,19 @@ int efx_cxl_init(struct efx_probe_data *probe_data)
>  	 */
>  	cxl_set_media_ready(cxl->cxlds);

It is unfortunate the media_ready is just being used as an
offline/online flag for memdevs. I would be open to just switching to
typical device online/offline semantics and drop the "media_ready" flag.

>  
> +	cxl->cxlmd = devm_cxl_add_memdev(&pci_dev->dev, cxl->cxlds);
> +	if (IS_ERR(cxl->cxlmd)) {
> +		pci_err(pci_dev, "CXL accel memdev creation failed");
> +		rc = PTR_ERR(cxl->cxlmd);
> +		goto err_memdev;
> +	}
> +
>  	probe_data->cxl = cxl;
>  
>  	return 0;
>  
> +err_memdev:
> +	cxl_release_resource(cxl->cxlds, CXL_RES_RAM);
>  err_resource_set:
>  	kfree(cxl->cxlds);

In general a function should not mix devm and goto as that is a recipe
for bugs.

The bug I see here is that devm_cxl_add_memdev() runs the teardown flow
*after* efx_pci_probe() returns an error code. That happens in
device_unbind_cleanup(), but when it goes to cleanup endpoint decoders
and anything else that might reference @cxlds it crashes because @cxlds
is long gone.

So if you use devm_cxl_add_memdev() then cxlds must be devm allocated as
well to make sure it gets freed in the proper reverse order.
Alejandro Lucero Palau Jan. 20, 2025, 5:27 p.m. UTC | #2
On 1/18/25 02:41, Dan Williams wrote:
> alejandro.lucero-palau@ wrote:
>> From: Alejandro Lucero <alucerop@amd.com>
>>
>> Use cxl API for creating a cxl memory device using the type2
>> cxl_dev_state struct.
>>
>> Signed-off-by: Alejandro Lucero <alucerop@amd.com>
>> Reviewed-by: Martin Habets <habetsm.xilinx@gmail.com>
>> Reviewed-by: Fan Ni <fan.ni@samsung.com>
>> Acked-by: Edward Cree <ecree.xilinx@gmail.com>
>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>> ---
>>   drivers/net/ethernet/sfc/efx_cxl.c | 9 +++++++++
>>   1 file changed, 9 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/sfc/efx_cxl.c b/drivers/net/ethernet/sfc/efx_cxl.c
>> index 911f29b91bd3..f4bf137fd878 100644
>> --- a/drivers/net/ethernet/sfc/efx_cxl.c
>> +++ b/drivers/net/ethernet/sfc/efx_cxl.c
>> @@ -96,10 +96,19 @@ int efx_cxl_init(struct efx_probe_data *probe_data)
>>   	 */
>>   	cxl_set_media_ready(cxl->cxlds);
> It is unfortunate the media_ready is just being used as an
> offline/online flag for memdevs. I would be open to just switching to
> typical device online/offline semantics and drop the "media_ready" flag.


Not sure I understand the semantics here. Note our device, as the 
related register being optional, has no way for asking the HW about this 
(by CXL means), mainly because it is not necessary. I guess this 
hardware state is there because it can be needed, but again, not sure I 
understand media-ready vs memory online.


>
>>   
>> +	cxl->cxlmd = devm_cxl_add_memdev(&pci_dev->dev, cxl->cxlds);
>> +	if (IS_ERR(cxl->cxlmd)) {
>> +		pci_err(pci_dev, "CXL accel memdev creation failed");
>> +		rc = PTR_ERR(cxl->cxlmd);
>> +		goto err_memdev;
>> +	}
>> +
>>   	probe_data->cxl = cxl;
>>   
>>   	return 0;
>>   
>> +err_memdev:
>> +	cxl_release_resource(cxl->cxlds, CXL_RES_RAM);
>>   err_resource_set:
>>   	kfree(cxl->cxlds);
> In general a function should not mix devm and goto as that is a recipe
> for bugs.
>
> The bug I see here is that devm_cxl_add_memdev() runs the teardown flow
> *after* efx_pci_probe() returns an error code. That happens in
> device_unbind_cleanup(), but when it goes to cleanup endpoint decoders
> and anything else that might reference @cxlds it crashes because @cxlds
> is long gone.
>
> So if you use devm_cxl_add_memdev() then cxlds must be devm allocated as
> well to make sure it gets freed in the proper reverse order.


That is true and I think it is easy to fix, even with your changes for 
patch 1.

I'll show it in v10.

Thanks!
Dan Williams Jan. 21, 2025, 11:22 p.m. UTC | #3
Alejandro Lucero Palau wrote:
> 
> On 1/18/25 02:41, Dan Williams wrote:
> > alejandro.lucero-palau@ wrote:
> >> From: Alejandro Lucero <alucerop@amd.com>
> >>
> >> Use cxl API for creating a cxl memory device using the type2
> >> cxl_dev_state struct.
> >>
> >> Signed-off-by: Alejandro Lucero <alucerop@amd.com>
> >> Reviewed-by: Martin Habets <habetsm.xilinx@gmail.com>
> >> Reviewed-by: Fan Ni <fan.ni@samsung.com>
> >> Acked-by: Edward Cree <ecree.xilinx@gmail.com>
> >> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> >> ---
> >>   drivers/net/ethernet/sfc/efx_cxl.c | 9 +++++++++
> >>   1 file changed, 9 insertions(+)
> >>
> >> diff --git a/drivers/net/ethernet/sfc/efx_cxl.c b/drivers/net/ethernet/sfc/efx_cxl.c
> >> index 911f29b91bd3..f4bf137fd878 100644
> >> --- a/drivers/net/ethernet/sfc/efx_cxl.c
> >> +++ b/drivers/net/ethernet/sfc/efx_cxl.c
> >> @@ -96,10 +96,19 @@ int efx_cxl_init(struct efx_probe_data *probe_data)
> >>   	 */
> >>   	cxl_set_media_ready(cxl->cxlds);
> > It is unfortunate the media_ready is just being used as an
> > offline/online flag for memdevs. I would be open to just switching to
> > typical device online/offline semantics and drop the "media_ready" flag.
> 
> 
> Not sure I understand the semantics here. Note our device, as the 
> related register being optional, has no way for asking the HW about this 
> (by CXL means), mainly because it is not necessary. I guess this 
> hardware state is there because it can be needed, but again, not sure I 
> understand media-ready vs memory online.

The ->media_ready flag gates capacity discovery for memory expanders
(no impact to accelerators) *and* it gates whether cxl_mem_probe() even
attempts to talk to the device.

The concern is that it should be removed from 'struct cxl_dev_state' if
it cannot have common meaning across all CXL.mem capable devices. One
way to remove it from 'struct cxl_dev_state' while still preserving the
memory expander use case is to notice that ->media_ready is effectively
identical to the 'struct device' @offline flag.

For now, just have the accelerator driver set ->media_ready and we can
work on cleaning up that direct 'struct cxl_dev_state' touch later.

> >> +	cxl->cxlmd = devm_cxl_add_memdev(&pci_dev->dev, cxl->cxlds);
> >> +	if (IS_ERR(cxl->cxlmd)) {
> >> +		pci_err(pci_dev, "CXL accel memdev creation failed");
> >> +		rc = PTR_ERR(cxl->cxlmd);
> >> +		goto err_memdev;
> >> +	}
> >> +
> >>   	probe_data->cxl = cxl;
> >>   
> >>   	return 0;
> >>   
> >> +err_memdev:
> >> +	cxl_release_resource(cxl->cxlds, CXL_RES_RAM);
> >>   err_resource_set:
> >>   	kfree(cxl->cxlds);
> > In general a function should not mix devm and goto as that is a recipe
> > for bugs.
> >
> > The bug I see here is that devm_cxl_add_memdev() runs the teardown flow
> > *after* efx_pci_probe() returns an error code. That happens in
> > device_unbind_cleanup(), but when it goes to cleanup endpoint decoders
> > and anything else that might reference @cxlds it crashes because @cxlds
> > is long gone.
> >
> > So if you use devm_cxl_add_memdev() then cxlds must be devm allocated as
> > well to make sure it gets freed in the proper reverse order.
> 
> That is true and I think it is easy to fix, even with your changes for 
> patch 1.

Yes, just add a devm_cxl_del_memdev() export for now that can be used
unwind CXL core objects without requiring the driver to switch to devm
more broadly.
diff mbox series

Patch

diff --git a/drivers/net/ethernet/sfc/efx_cxl.c b/drivers/net/ethernet/sfc/efx_cxl.c
index 911f29b91bd3..f4bf137fd878 100644
--- a/drivers/net/ethernet/sfc/efx_cxl.c
+++ b/drivers/net/ethernet/sfc/efx_cxl.c
@@ -96,10 +96,19 @@  int efx_cxl_init(struct efx_probe_data *probe_data)
 	 */
 	cxl_set_media_ready(cxl->cxlds);
 
+	cxl->cxlmd = devm_cxl_add_memdev(&pci_dev->dev, cxl->cxlds);
+	if (IS_ERR(cxl->cxlmd)) {
+		pci_err(pci_dev, "CXL accel memdev creation failed");
+		rc = PTR_ERR(cxl->cxlmd);
+		goto err_memdev;
+	}
+
 	probe_data->cxl = cxl;
 
 	return 0;
 
+err_memdev:
+	cxl_release_resource(cxl->cxlds, CXL_RES_RAM);
 err_resource_set:
 	kfree(cxl->cxlds);
 err_state: