diff mbox

[net-next,1/4] net: mvneta: Convert to be 64 bits compatible

Message ID 20161122164844.19566-2-gregory.clement@free-electrons.com (mailing list archive)
State New, archived
Headers show

Commit Message

Gregory CLEMENT Nov. 22, 2016, 4:48 p.m. UTC
From: Marcin Wojtas <mw@semihalf.com>

Prepare the mvneta driver in order to be usable on the 64 bits platform
such as the Armada 3700.

[gregory.clement@free-electrons.com]: this patch was extract from a larger
one to ease review and maintenance.

Signed-off-by: Marcin Wojtas <mw@semihalf.com>
Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
---
 drivers/net/ethernet/marvell/mvneta.c | 77 ++++++++++++++++++++++++++++++++---
 1 file changed, 71 insertions(+), 6 deletions(-)

Comments

Arnd Bergmann Nov. 22, 2016, 9:04 p.m. UTC | #1
On Tuesday, November 22, 2016 5:48:41 PM CET Gregory CLEMENT wrote:
> +#ifdef CONFIG_64BIT
> +       void *data_tmp;
> +
> +       /* In Neta HW only 32 bits data is supported, so in order to
> +        * obtain whole 64 bits address from RX descriptor, we store
> +        * the upper 32 bits when allocating buffer, and put it back
> +        * when using buffer cookie for accessing packet in memory.
> +        * Frags should be allocated from single 'memory' region,
> +        * hence common upper address half should be sufficient.
> +        */
> +       data_tmp = mvneta_frag_alloc(pp->frag_size);
> +       if (data_tmp) {
> +               pp->data_high = (u64)upper_32_bits((u64)data_tmp) << 32;
> +               mvneta_frag_free(pp->frag_size, data_tmp);
> +       }
> 

How does this work when the region spans a n*4GB address boundary?

	Arnd
Jisheng Zhang Nov. 23, 2016, 9:53 a.m. UTC | #2
On Tue, 22 Nov 2016 22:04:12 +0100 Arnd Bergmann wrote:

> On Tuesday, November 22, 2016 5:48:41 PM CET Gregory CLEMENT wrote:
> > +#ifdef CONFIG_64BIT
> > +       void *data_tmp;
> > +
> > +       /* In Neta HW only 32 bits data is supported, so in order to
> > +        * obtain whole 64 bits address from RX descriptor, we store
> > +        * the upper 32 bits when allocating buffer, and put it back
> > +        * when using buffer cookie for accessing packet in memory.
> > +        * Frags should be allocated from single 'memory' region,
> > +        * hence common upper address half should be sufficient.
> > +        */
> > +       data_tmp = mvneta_frag_alloc(pp->frag_size);
> > +       if (data_tmp) {
> > +               pp->data_high = (u64)upper_32_bits((u64)data_tmp) << 32;
> > +               mvneta_frag_free(pp->frag_size, data_tmp);
> > +       }
> >   
> 
> How does this work when the region spans a n*4GB address boundary?

indeed. We also make use of this driver on 64bit platforms. We use
different solution to make the driver 64bit safe.

solA: make use of the reserved field in the mvneta_rx_desc, such
as reserved2 etc. Yes, the field is marked as "for future use, PnC", but
now it's not used at all. This is one possible solution however.

solB: allocate a shadow buf cookie during init, e.g

rxq->descs_bufcookie = kmalloc(rxq->size * sizeof(void*), GFP_KERNEL);

then modify mvneta_rx_desc_fill a bit to save the 64bit pointer in
the shadow buf cookie, e.g
static void mvneta_rx_desc_fill(struct mvneta_rx_desc *rx_desc,
                                u32 phys_addr, u32 cookie,
				struct mvneta_rx_queue *rxq)

{
	int i;

	rx_desc->buf_cookie = cookie;
	rx_desc->buf_phys_addr = phys_addr;
	i = rx_desc - rxq->descs;
	rxq->descs_bufcookie[i] = cookie;
}

then fetch the desc from the shadow buf cookie in all code path, such
as mvneta_rx() etc.

Both solutions should not have the problems pointed out by Arnd.

Thanks,
Jisheng
Arnd Bergmann Nov. 23, 2016, 10:15 a.m. UTC | #3
On Wednesday, November 23, 2016 5:53:41 PM CET Jisheng Zhang wrote:
> On Tue, 22 Nov 2016 22:04:12 +0100 Arnd Bergmann wrote:
> 
> > On Tuesday, November 22, 2016 5:48:41 PM CET Gregory CLEMENT wrote:
> > > +#ifdef CONFIG_64BIT
> > > +       void *data_tmp;
> > > +
> > > +       /* In Neta HW only 32 bits data is supported, so in order to
> > > +        * obtain whole 64 bits address from RX descriptor, we store
> > > +        * the upper 32 bits when allocating buffer, and put it back
> > > +        * when using buffer cookie for accessing packet in memory.
> > > +        * Frags should be allocated from single 'memory' region,
> > > +        * hence common upper address half should be sufficient.
> > > +        */
> > > +       data_tmp = mvneta_frag_alloc(pp->frag_size);
> > > +       if (data_tmp) {
> > > +               pp->data_high = (u64)upper_32_bits((u64)data_tmp) << 32;
> > > +               mvneta_frag_free(pp->frag_size, data_tmp);
> > > +       }
> > >   
> > 
> > How does this work when the region spans a n*4GB address boundary?
> 
> indeed. We also make use of this driver on 64bit platforms. We use
> different solution to make the driver 64bit safe.
> 
> solA: make use of the reserved field in the mvneta_rx_desc, such
> as reserved2 etc. Yes, the field is marked as "for future use, PnC", but
> now it's not used at all. This is one possible solution however.

Right, this sounds like the most straightforward choice.

> solB: allocate a shadow buf cookie during init, e.g
> 
> rxq->descs_bufcookie = kmalloc(rxq->size * sizeof(void*), GFP_KERNEL);
> 
> then modify mvneta_rx_desc_fill a bit to save the 64bit pointer in
> the shadow buf cookie, e.g
> static void mvneta_rx_desc_fill(struct mvneta_rx_desc *rx_desc,
>                                 u32 phys_addr, u32 cookie,
> 				struct mvneta_rx_queue *rxq)
> 
> {
> 	int i;
> 
> 	rx_desc->buf_cookie = cookie;
> 	rx_desc->buf_phys_addr = phys_addr;
> 	i = rx_desc - rxq->descs;
> 	rxq->descs_bufcookie[i] = cookie;
> }
> 
> then fetch the desc from the shadow buf cookie in all code path, such
> as mvneta_rx() etc.
> 
> Both solutions should not have the problems pointed out by Arnd.

Wait, since you compute an index 'i' here, can't you just store 'i'
directly in the descriptor instead of the pointer?

	Arnd
Jisheng Zhang Nov. 23, 2016, 11:03 a.m. UTC | #4
Hi Arnd,

On Wed, 23 Nov 2016 11:15:32 +0100 Arnd Bergmann wrote:

> On Wednesday, November 23, 2016 5:53:41 PM CET Jisheng Zhang wrote:
> > On Tue, 22 Nov 2016 22:04:12 +0100 Arnd Bergmann wrote:
> >   
> > > On Tuesday, November 22, 2016 5:48:41 PM CET Gregory CLEMENT wrote:  
> > > > +#ifdef CONFIG_64BIT
> > > > +       void *data_tmp;
> > > > +
> > > > +       /* In Neta HW only 32 bits data is supported, so in order to
> > > > +        * obtain whole 64 bits address from RX descriptor, we store
> > > > +        * the upper 32 bits when allocating buffer, and put it back
> > > > +        * when using buffer cookie for accessing packet in memory.
> > > > +        * Frags should be allocated from single 'memory' region,
> > > > +        * hence common upper address half should be sufficient.
> > > > +        */
> > > > +       data_tmp = mvneta_frag_alloc(pp->frag_size);
> > > > +       if (data_tmp) {
> > > > +               pp->data_high = (u64)upper_32_bits((u64)data_tmp) << 32;
> > > > +               mvneta_frag_free(pp->frag_size, data_tmp);
> > > > +       }
> > > >     
> > > 
> > > How does this work when the region spans a n*4GB address boundary?  
> > 
> > indeed. We also make use of this driver on 64bit platforms. We use
> > different solution to make the driver 64bit safe.
> > 
> > solA: make use of the reserved field in the mvneta_rx_desc, such
> > as reserved2 etc. Yes, the field is marked as "for future use, PnC", but
> > now it's not used at all. This is one possible solution however.  
> 
> Right, this sounds like the most straightforward choice.
> 
> > solB: allocate a shadow buf cookie during init, e.g
> > 
> > rxq->descs_bufcookie = kmalloc(rxq->size * sizeof(void*), GFP_KERNEL);
> > 
> > then modify mvneta_rx_desc_fill a bit to save the 64bit pointer in
> > the shadow buf cookie, e.g
> > static void mvneta_rx_desc_fill(struct mvneta_rx_desc *rx_desc,
> >                                 u32 phys_addr, u32 cookie,

sorry, this line should be:
u32 phys_addr, void *cookie

> > 				struct mvneta_rx_queue *rxq)
> > 
> > {
> > 	int i;
> > 
> > 	rx_desc->buf_cookie = cookie;
> > 	rx_desc->buf_phys_addr = phys_addr;
> > 	i = rx_desc - rxq->descs;
> > 	rxq->descs_bufcookie[i] = cookie;
> > }
> > 
> > then fetch the desc from the shadow buf cookie in all code path, such
> > as mvneta_rx() etc.
> > 
> > Both solutions should not have the problems pointed out by Arnd.  
> 
> Wait, since you compute an index 'i' here, can't you just store 'i'
> directly in the descriptor instead of the pointer?
> 

we need to store the pointer, it's to store the buffer allocated by
mvneta_frag_alloc()

Thanks,
Jisheng
Gregory CLEMENT Nov. 23, 2016, 1:07 p.m. UTC | #5
Hi Jisheng, Arnd,


Thanks for your feedback.


 On mer., nov. 23 2016, Arnd Bergmann <arnd@arndb.de> wrote:

> On Wednesday, November 23, 2016 5:53:41 PM CET Jisheng Zhang wrote:
>> On Tue, 22 Nov 2016 22:04:12 +0100 Arnd Bergmann wrote:
>> 
>> > On Tuesday, November 22, 2016 5:48:41 PM CET Gregory CLEMENT wrote:
>> > > +#ifdef CONFIG_64BIT
>> > > +       void *data_tmp;
>> > > +
>> > > +       /* In Neta HW only 32 bits data is supported, so in order to
>> > > +        * obtain whole 64 bits address from RX descriptor, we store
>> > > +        * the upper 32 bits when allocating buffer, and put it back
>> > > +        * when using buffer cookie for accessing packet in memory.
>> > > +        * Frags should be allocated from single 'memory' region,
>> > > +        * hence common upper address half should be sufficient.
>> > > +        */
>> > > +       data_tmp = mvneta_frag_alloc(pp->frag_size);
>> > > +       if (data_tmp) {
>> > > +               pp->data_high = (u64)upper_32_bits((u64)data_tmp) << 32;
>> > > +               mvneta_frag_free(pp->frag_size, data_tmp);
>> > > +       }
>> > >   
>> > 
>> > How does this work when the region spans a n*4GB address boundary?
>> 
>> indeed. We also make use of this driver on 64bit platforms. We use
>> different solution to make the driver 64bit safe.
>> 
>> solA: make use of the reserved field in the mvneta_rx_desc, such
>> as reserved2 etc. Yes, the field is marked as "for future use, PnC", but
>> now it's not used at all. This is one possible solution however.
>
> Right, this sounds like the most straightforward choice.

The PnC (which stands for Parsing and Classification) is not used yet
indeed but this field will be needed when we will enable it. It is
something we want to do but it is not planned in a near future. However
from the datasheets I have it seems only present on the Armada XP. It is
not mentioned on datasheets for the Armada 38x or the Armada 3700.

That would mean it was safe to use on of this field in 64-bits mode on
the Armada 3700.

So I am going to take this approach.

Thanks,

Gregory

>
>> solB: allocate a shadow buf cookie during init, e.g
>> 
>> rxq->descs_bufcookie = kmalloc(rxq->size * sizeof(void*), GFP_KERNEL);
>> 
>> then modify mvneta_rx_desc_fill a bit to save the 64bit pointer in
>> the shadow buf cookie, e.g
>> static void mvneta_rx_desc_fill(struct mvneta_rx_desc *rx_desc,
>>                                 u32 phys_addr, u32 cookie,
>> 				struct mvneta_rx_queue *rxq)
>> 
>> {
>> 	int i;
>> 
>> 	rx_desc->buf_cookie = cookie;
>> 	rx_desc->buf_phys_addr = phys_addr;
>> 	i = rx_desc - rxq->descs;
>> 	rxq->descs_bufcookie[i] = cookie;
>> }
>> 
>> then fetch the desc from the shadow buf cookie in all code path, such
>> as mvneta_rx() etc.
>> 
>> Both solutions should not have the problems pointed out by Arnd.
>
> Wait, since you compute an index 'i' here, can't you just store 'i'
> directly in the descriptor instead of the pointer?
>
> 	Arnd
Marcin Wojtas Nov. 23, 2016, 4:02 p.m. UTC | #6
Hi Gregory,

2016-11-23 14:07 GMT+01:00 Gregory CLEMENT <gregory.clement@free-electrons.com>:
> Hi Jisheng, Arnd,
>
>
> Thanks for your feedback.
>
>
>  On mer., nov. 23 2016, Arnd Bergmann <arnd@arndb.de> wrote:
>
>> On Wednesday, November 23, 2016 5:53:41 PM CET Jisheng Zhang wrote:
>>> On Tue, 22 Nov 2016 22:04:12 +0100 Arnd Bergmann wrote:
>>>
>>> > On Tuesday, November 22, 2016 5:48:41 PM CET Gregory CLEMENT wrote:
>>> > > +#ifdef CONFIG_64BIT
>>> > > +       void *data_tmp;
>>> > > +
>>> > > +       /* In Neta HW only 32 bits data is supported, so in order to
>>> > > +        * obtain whole 64 bits address from RX descriptor, we store
>>> > > +        * the upper 32 bits when allocating buffer, and put it back
>>> > > +        * when using buffer cookie for accessing packet in memory.
>>> > > +        * Frags should be allocated from single 'memory' region,
>>> > > +        * hence common upper address half should be sufficient.
>>> > > +        */
>>> > > +       data_tmp = mvneta_frag_alloc(pp->frag_size);
>>> > > +       if (data_tmp) {
>>> > > +               pp->data_high = (u64)upper_32_bits((u64)data_tmp) << 32;
>>> > > +               mvneta_frag_free(pp->frag_size, data_tmp);
>>> > > +       }
>>> > >
>>> >
>>> > How does this work when the region spans a n*4GB address boundary?
>>>
>>> indeed. We also make use of this driver on 64bit platforms. We use
>>> different solution to make the driver 64bit safe.
>>>
>>> solA: make use of the reserved field in the mvneta_rx_desc, such
>>> as reserved2 etc. Yes, the field is marked as "for future use, PnC", but
>>> now it's not used at all. This is one possible solution however.
>>
>> Right, this sounds like the most straightforward choice.
>
> The PnC (which stands for Parsing and Classification) is not used yet
> indeed but this field will be needed when we will enable it. It is
> something we want to do but it is not planned in a near future. However
> from the datasheets I have it seems only present on the Armada XP. It is
> not mentioned on datasheets for the Armada 38x or the Armada 3700.
>

It is not mentioned in A38x spec, but this SoC has exactly the same
PnC as Armada XP (they differ only with used SRAM details). I wouldn't
be surprised if it was supported on A3700 as well.

> That would mean it was safe to use on of this field in 64-bits mode on
> the Armada 3700.
>
> So I am going to take this approach.
>

I think for now it's safe and is much easier than handling extra
software ring for virtual addresses.

Best regards,
Marcin
Jisheng Zhang Nov. 24, 2016, 8:37 a.m. UTC | #7
Hi Marcin, Gregory, Arnd,

On Wed, 23 Nov 2016 17:02:11 +0100 Marcin Wojtas  wrote:

> Hi Gregory,
> 
> 2016-11-23 14:07 GMT+01:00 Gregory CLEMENT:
> > Hi Jisheng, Arnd,
> >
> >
> > Thanks for your feedback.
> >
> >
> >  On mer., nov. 23 2016, Arnd Bergmann wrote:
> >  
> >> On Wednesday, November 23, 2016 5:53:41 PM CET Jisheng Zhang wrote:  
> >>> On Tue, 22 Nov 2016 22:04:12 +0100 Arnd Bergmann wrote:
> >>>  
> >>> > On Tuesday, November 22, 2016 5:48:41 PM CET Gregory CLEMENT wrote:  
> >>> > > +#ifdef CONFIG_64BIT
> >>> > > +       void *data_tmp;
> >>> > > +
> >>> > > +       /* In Neta HW only 32 bits data is supported, so in order to
> >>> > > +        * obtain whole 64 bits address from RX descriptor, we store
> >>> > > +        * the upper 32 bits when allocating buffer, and put it back
> >>> > > +        * when using buffer cookie for accessing packet in memory.
> >>> > > +        * Frags should be allocated from single 'memory' region,
> >>> > > +        * hence common upper address half should be sufficient.
> >>> > > +        */
> >>> > > +       data_tmp = mvneta_frag_alloc(pp->frag_size);
> >>> > > +       if (data_tmp) {
> >>> > > +               pp->data_high = (u64)upper_32_bits((u64)data_tmp) << 32;
> >>> > > +               mvneta_frag_free(pp->frag_size, data_tmp);
> >>> > > +       }
> >>> > >  
> >>> >
> >>> > How does this work when the region spans a n*4GB address boundary?  
> >>>
> >>> indeed. We also make use of this driver on 64bit platforms. We use
> >>> different solution to make the driver 64bit safe.
> >>>
> >>> solA: make use of the reserved field in the mvneta_rx_desc, such
> >>> as reserved2 etc. Yes, the field is marked as "for future use, PnC", but
> >>> now it's not used at all. This is one possible solution however.  
> >>
> >> Right, this sounds like the most straightforward choice.  
> >
> > The PnC (which stands for Parsing and Classification) is not used yet
> > indeed but this field will be needed when we will enable it. It is
> > something we want to do but it is not planned in a near future. However
> > from the datasheets I have it seems only present on the Armada XP. It is
> > not mentioned on datasheets for the Armada 38x or the Armada 3700.
> >  
> 
> It is not mentioned in A38x spec, but this SoC has exactly the same
> PnC as Armada XP (they differ only with used SRAM details). I wouldn't
> be surprised if it was supported on A3700 as well.
> 
> > That would mean it was safe to use on of this field in 64-bits mode on
> > the Armada 3700.
> >
> > So I am going to take this approach.
> >  
> 
> I think for now it's safe and is much easier than handling extra
> software ring for virtual addresses.
> 

solB (a SW shadow cookie) perhaps gives a better performance: in hot path,
such as mvneta_rx(), the driver accesses buf_cookie and buf_phys_addr of
rx_desc which is allocated by dma_alloc_coherent, it's noncacheable if the
device isn't cache-coherent. I didn't measure the performance difference,
because in fact we take solA as well internally. From your experience,
can the performance gain deserve the complex code?

Thanks,
Jisheng
Arnd Bergmann Nov. 24, 2016, 9 a.m. UTC | #8
On Thursday, November 24, 2016 4:37:36 PM CET Jisheng Zhang wrote:
> solB (a SW shadow cookie) perhaps gives a better performance: in hot path,
> such as mvneta_rx(), the driver accesses buf_cookie and buf_phys_addr of
> rx_desc which is allocated by dma_alloc_coherent, it's noncacheable if the
> device isn't cache-coherent. I didn't measure the performance difference,
> because in fact we take solA as well internally. From your experience,
> can the performance gain deserve the complex code?

Yes, a read from uncached memory is fairly slow, so if you have a chance
to avoid that it will probably help. When adding complexity to the code,
it probably makes sense to take a runtime profile anyway quantify how
much it gains.

On machines that have cache-coherent DMA, accessing the descriptor
should be fine, as you already have to load the entire cache line
to read the status field.

Looking at this snippet:

                rx_status = rx_desc->status;
                rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
                data = (unsigned char *)rx_desc->buf_cookie;
                phys_addr = rx_desc->buf_phys_addr;
                pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc);
                bm_pool = &pp->bm_priv->bm_pools[pool_id];

                if (!mvneta_rxq_desc_is_first_last(rx_status) ||
                    (rx_status & MVNETA_RXD_ERR_SUMMARY)) {
err_drop_frame_ret_pool:
                        /* Return the buffer to the pool */
                        mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool,
                                              rx_desc->buf_phys_addr);
err_drop_frame:


I think there is more room for optimizing if you start: you read
the status field twice (the second one in MVNETA_RX_GET_BM_POOL_ID)
and you can cache the buf_phys_addr along with the virtual address
once you add that.

Generally speaking, I'd recommend using READ_ONCE()/WRITE_ONCE()
to access the descriptor fields, to ensure the compiler doesn't
add extra references as well as to annotate the expensive
operations.

	Arnd
Jisheng Zhang Nov. 24, 2016, 9:11 a.m. UTC | #9
On Thu, 24 Nov 2016 10:00:36 +0100
Arnd Bergmann <arnd@arndb.de> wrote:

> On Thursday, November 24, 2016 4:37:36 PM CET Jisheng Zhang wrote:
> > solB (a SW shadow cookie) perhaps gives a better performance: in hot path,
> > such as mvneta_rx(), the driver accesses buf_cookie and buf_phys_addr of
> > rx_desc which is allocated by dma_alloc_coherent, it's noncacheable if the
> > device isn't cache-coherent. I didn't measure the performance difference,
> > because in fact we take solA as well internally. From your experience,
> > can the performance gain deserve the complex code?  
> 
> Yes, a read from uncached memory is fairly slow, so if you have a chance
> to avoid that it will probably help. When adding complexity to the code,
> it probably makes sense to take a runtime profile anyway quantify how
> much it gains.
> 
> On machines that have cache-coherent DMA, accessing the descriptor
> should be fine, as you already have to load the entire cache line
> to read the status field.
> 
> Looking at this snippet:
> 
>                 rx_status = rx_desc->status;
>                 rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
>                 data = (unsigned char *)rx_desc->buf_cookie;
>                 phys_addr = rx_desc->buf_phys_addr;
>                 pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc);
>                 bm_pool = &pp->bm_priv->bm_pools[pool_id];
> 
>                 if (!mvneta_rxq_desc_is_first_last(rx_status) ||
>                     (rx_status & MVNETA_RXD_ERR_SUMMARY)) {
> err_drop_frame_ret_pool:
>                         /* Return the buffer to the pool */
>                         mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool,
>                                               rx_desc->buf_phys_addr);
> err_drop_frame:
> 
> 
> I think there is more room for optimizing if you start: you read
> the status field twice (the second one in MVNETA_RX_GET_BM_POOL_ID)
> and you can cache the buf_phys_addr along with the virtual address
> once you add that.

oh, yeah! buf_phy_addr could be included too.

> 
> Generally speaking, I'd recommend using READ_ONCE()/WRITE_ONCE()
> to access the descriptor fields, to ensure the compiler doesn't
> add extra references as well as to annotate the expensive
> operations.
> 
> 	Arnd

Got it. Thanks so much for the detailed guide.
Gregory CLEMENT Nov. 24, 2016, 3:01 p.m. UTC | #10
Hi Arnd,
 
 On jeu., nov. 24 2016, Arnd Bergmann <arnd@arndb.de> wrote:

> On Thursday, November 24, 2016 4:37:36 PM CET Jisheng Zhang wrote:
>> solB (a SW shadow cookie) perhaps gives a better performance: in hot path,
>> such as mvneta_rx(), the driver accesses buf_cookie and buf_phys_addr of
>> rx_desc which is allocated by dma_alloc_coherent, it's noncacheable if the
>> device isn't cache-coherent. I didn't measure the performance difference,
>> because in fact we take solA as well internally. From your experience,
>> can the performance gain deserve the complex code?
>
> Yes, a read from uncached memory is fairly slow, so if you have a chance
> to avoid that it will probably help. When adding complexity to the code,
> it probably makes sense to take a runtime profile anyway quantify how
> much it gains.
>
> On machines that have cache-coherent DMA, accessing the descriptor
> should be fine, as you already have to load the entire cache line
> to read the status field.
>
> Looking at this snippet:
>
>                 rx_status = rx_desc->status;
>                 rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
>                 data = (unsigned char *)rx_desc->buf_cookie;
>                 phys_addr = rx_desc->buf_phys_addr;
>                 pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc);
>                 bm_pool = &pp->bm_priv->bm_pools[pool_id];
>
>                 if (!mvneta_rxq_desc_is_first_last(rx_status) ||
>                     (rx_status & MVNETA_RXD_ERR_SUMMARY)) {
> err_drop_frame_ret_pool:
>                         /* Return the buffer to the pool */
>                         mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool,
>                                               rx_desc->buf_phys_addr);
> err_drop_frame:
>
>
> I think there is more room for optimizing if you start: you read
> the status field twice (the second one in MVNETA_RX_GET_BM_POOL_ID)
> and you can cache the buf_phys_addr along with the virtual address
> once you add that.

I agree we can optimize this code but it is not related to the 64 bits
conversion. Indeed this part is running when we use the HW buffer
management, however currently this part is not ready at all for 64
bits. The virtual address is directly handled by the hardware but it has
only 32 bits to store it in the cookie. So if we want to use the HWBM in
64 bits we need to redesign the code, (maybe by storing the virtual
address in a array and pass the index in the cookie).

Gregory

>
> Generally speaking, I'd recommend using READ_ONCE()/WRITE_ONCE()
> to access the descriptor fields, to ensure the compiler doesn't
> add extra references as well as to annotate the expensive
> operations.
>
> 	Arnd
Marcin Wojtas Nov. 24, 2016, 3:09 p.m. UTC | #11
Hi Gregory,

2016-11-24 16:01 GMT+01:00 Gregory CLEMENT <gregory.clement@free-electrons.com>:
> Hi Arnd,
>
>  On jeu., nov. 24 2016, Arnd Bergmann <arnd@arndb.de> wrote:
>
>> On Thursday, November 24, 2016 4:37:36 PM CET Jisheng Zhang wrote:
>>> solB (a SW shadow cookie) perhaps gives a better performance: in hot path,
>>> such as mvneta_rx(), the driver accesses buf_cookie and buf_phys_addr of
>>> rx_desc which is allocated by dma_alloc_coherent, it's noncacheable if the
>>> device isn't cache-coherent. I didn't measure the performance difference,
>>> because in fact we take solA as well internally. From your experience,
>>> can the performance gain deserve the complex code?
>>
>> Yes, a read from uncached memory is fairly slow, so if you have a chance
>> to avoid that it will probably help. When adding complexity to the code,
>> it probably makes sense to take a runtime profile anyway quantify how
>> much it gains.
>>
>> On machines that have cache-coherent DMA, accessing the descriptor
>> should be fine, as you already have to load the entire cache line
>> to read the status field.
>>
>> Looking at this snippet:
>>
>>                 rx_status = rx_desc->status;
>>                 rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
>>                 data = (unsigned char *)rx_desc->buf_cookie;
>>                 phys_addr = rx_desc->buf_phys_addr;
>>                 pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc);
>>                 bm_pool = &pp->bm_priv->bm_pools[pool_id];
>>
>>                 if (!mvneta_rxq_desc_is_first_last(rx_status) ||
>>                     (rx_status & MVNETA_RXD_ERR_SUMMARY)) {
>> err_drop_frame_ret_pool:
>>                         /* Return the buffer to the pool */
>>                         mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool,
>>                                               rx_desc->buf_phys_addr);
>> err_drop_frame:
>>
>>
>> I think there is more room for optimizing if you start: you read
>> the status field twice (the second one in MVNETA_RX_GET_BM_POOL_ID)
>> and you can cache the buf_phys_addr along with the virtual address
>> once you add that.
>
> I agree we can optimize this code but it is not related to the 64 bits
> conversion. Indeed this part is running when we use the HW buffer
> management, however currently this part is not ready at all for 64
> bits. The virtual address is directly handled by the hardware but it has
> only 32 bits to store it in the cookie. So if we want to use the HWBM in
> 64 bits we need to redesign the code, (maybe by storing the virtual
> address in a array and pass the index in the cookie).
>

How about storing data (virt address and maybe other stuff) as a part
of data buffer and using rx_packet_offset? It has to be used for a3700
anyway. No need of additional rings whatsoever.

Best regards,
Marcin
Florian Fainelli Nov. 24, 2016, 7:04 p.m. UTC | #12
Le 24/11/2016 à 07:01, Gregory CLEMENT a écrit :
> Hi Arnd,
>  
>  On jeu., nov. 24 2016, Arnd Bergmann <arnd@arndb.de> wrote:
> 
>> On Thursday, November 24, 2016 4:37:36 PM CET Jisheng Zhang wrote:
>>> solB (a SW shadow cookie) perhaps gives a better performance: in hot path,
>>> such as mvneta_rx(), the driver accesses buf_cookie and buf_phys_addr of
>>> rx_desc which is allocated by dma_alloc_coherent, it's noncacheable if the
>>> device isn't cache-coherent. I didn't measure the performance difference,
>>> because in fact we take solA as well internally. From your experience,
>>> can the performance gain deserve the complex code?
>>
>> Yes, a read from uncached memory is fairly slow, so if you have a chance
>> to avoid that it will probably help. When adding complexity to the code,
>> it probably makes sense to take a runtime profile anyway quantify how
>> much it gains.
>>
>> On machines that have cache-coherent DMA, accessing the descriptor
>> should be fine, as you already have to load the entire cache line
>> to read the status field.
>>
>> Looking at this snippet:
>>
>>                 rx_status = rx_desc->status;
>>                 rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
>>                 data = (unsigned char *)rx_desc->buf_cookie;
>>                 phys_addr = rx_desc->buf_phys_addr;
>>                 pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc);
>>                 bm_pool = &pp->bm_priv->bm_pools[pool_id];
>>
>>                 if (!mvneta_rxq_desc_is_first_last(rx_status) ||
>>                     (rx_status & MVNETA_RXD_ERR_SUMMARY)) {
>> err_drop_frame_ret_pool:
>>                         /* Return the buffer to the pool */
>>                         mvneta_bm_pool_put_bp(pp->bm_priv, bm_pool,
>>                                               rx_desc->buf_phys_addr);
>> err_drop_frame:
>>
>>
>> I think there is more room for optimizing if you start: you read
>> the status field twice (the second one in MVNETA_RX_GET_BM_POOL_ID)
>> and you can cache the buf_phys_addr along with the virtual address
>> once you add that.
> 
> I agree we can optimize this code but it is not related to the 64 bits
> conversion. Indeed this part is running when we use the HW buffer
> management, however currently this part is not ready at all for 64
> bits. The virtual address is directly handled by the hardware but it has
> only 32 bits to store it in the cookie.So if we want to use the HWBM in
> 64 bits we need to redesign the code, (maybe by storing the virtual
> address in a array and pass the index in the cookie).

Can't you make sure that skb->data is aligned to a value big enough that
you can still cover the <N> bit physical address space of the adapter
within a 32-bit quantity if you drop the low bits that would be all zeroes?

That way, even though you only have 32-bits of storage/cookie, these
don't have to be the actual 32-bits of your original address, but could
be addr >> 8 for instance?

As you indicate using an index stored in the cookie might be a better
scheme though, since you could attach a lot more metadata to an index in
an local array (which could be in cached memory) as opposed to just an
address.
diff mbox

Patch

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 87274d4ab102..67f6465d96ba 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -296,6 +296,12 @@ 
 /* descriptor aligned size */
 #define MVNETA_DESC_ALIGNED_SIZE	32
 
+/* Number of bytes to be taken into account by HW when putting incoming data
+ * to the buffers. It is needed in case NET_SKB_PAD exceeds maximum packet
+ * offset supported in MVNETA_RXQ_CONFIG_REG(q) registers.
+ */
+#define MVNETA_RX_PKT_OFFSET_CORRECTION		64
+
 #define MVNETA_RX_PKT_SIZE(mtu) \
 	ALIGN((mtu) + MVNETA_MH_SIZE + MVNETA_VLAN_TAG_LEN + \
 	      ETH_HLEN + ETH_FCS_LEN,			     \
@@ -416,8 +422,11 @@  struct mvneta_port {
 	u64 ethtool_stats[ARRAY_SIZE(mvneta_statistics)];
 
 	u32 indir[MVNETA_RSS_LU_TABLE_SIZE];
+#ifdef CONFIG_64BIT
+	u64 data_high;
+#endif
+	u16 rx_offset_correction;
 };
-
 /* The mvneta_tx_desc and mvneta_rx_desc structures describe the
  * layout of the transmit and reception DMA descriptors, and their
  * layout is therefore defined by the hardware design
@@ -1791,6 +1800,10 @@  static int mvneta_rx_refill(struct mvneta_port *pp,
 	if (!data)
 		return -ENOMEM;
 
+#ifdef CONFIG_64BIT
+	if (unlikely(pp->data_high != (u64)upper_32_bits((u64)data) << 32))
+		return -ENOMEM;
+#endif
 	phys_addr = dma_map_single(pp->dev->dev.parent, data,
 				   MVNETA_RX_BUF_SIZE(pp->pkt_size),
 				   DMA_FROM_DEVICE);
@@ -1799,7 +1812,8 @@  static int mvneta_rx_refill(struct mvneta_port *pp,
 		return -ENOMEM;
 	}
 
-	mvneta_rx_desc_fill(rx_desc, phys_addr, (u32)data);
+	phys_addr += pp->rx_offset_correction;
+	mvneta_rx_desc_fill(rx_desc, phys_addr, (uintptr_t)data);
 	return 0;
 }
 
@@ -1861,8 +1875,16 @@  static void mvneta_rxq_drop_pkts(struct mvneta_port *pp,
 
 	for (i = 0; i < rxq->size; i++) {
 		struct mvneta_rx_desc *rx_desc = rxq->descs + i;
-		void *data = (void *)rx_desc->buf_cookie;
-
+		void *data = (u8 *)(uintptr_t)rx_desc->buf_cookie;
+#ifdef CONFIG_64BIT
+		/* In Neta HW only 32 bits data is supported, so in
+		 * order to obtain whole 64 bits address from RX
+		 * descriptor, we store the upper 32 bits when
+		 * allocating buffer, and put it back when using
+		 * buffer cookie for accessing packet in memory.
+		 */
+		data = (u8 *)(pp->data_high | (u64)data);
+#endif
 		dma_unmap_single(pp->dev->dev.parent, rx_desc->buf_phys_addr,
 				 MVNETA_RX_BUF_SIZE(pp->pkt_size), DMA_FROM_DEVICE);
 		mvneta_frag_free(pp->frag_size, data);
@@ -1899,7 +1921,17 @@  static int mvneta_rx_swbm(struct mvneta_port *pp, int rx_todo,
 		rx_done++;
 		rx_status = rx_desc->status;
 		rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
+#ifdef CONFIG_64BIT
+		/* In Neta HW only 32 bits data is supported, so in
+		 * order to obtain whole 64 bits address from RX
+		 * descriptor, we store the upper 32 bits when
+		 * allocating buffer, and put it back when using
+		 * buffer cookie for accessing packet in memory.
+		 */
+		data = (u8 *)(pp->data_high | (u64)rx_desc->buf_cookie);
+#else
 		data = (unsigned char *)rx_desc->buf_cookie;
+#endif
 		phys_addr = rx_desc->buf_phys_addr;
 
 		if (!mvneta_rxq_desc_is_first_last(rx_status) ||
@@ -2020,7 +2052,17 @@  static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
 		rx_done++;
 		rx_status = rx_desc->status;
 		rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
-		data = (unsigned char *)rx_desc->buf_cookie;
+#ifdef CONFIG_64BIT
+		/* In Neta HW only 32 bits data is supported, so in
+		 * order to obtain whole 64 bits address from RX
+		 * descriptor, we store the upper 32 bits when
+		 * allocating buffer, and put it back when using
+		 * buffer cookie for accessing packet in memory.
+		 */
+		data = (u8 *)(pp->data_high | (u64)rx_desc->buf_cookie);
+#else
+		data = (u8 *)rx_desc->buf_cookie;
+#endif
 		phys_addr = rx_desc->buf_phys_addr;
 		pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc);
 		bm_pool = &pp->bm_priv->bm_pools[pool_id];
@@ -2773,7 +2815,7 @@  static int mvneta_rxq_init(struct mvneta_port *pp,
 	mvreg_write(pp, MVNETA_RXQ_SIZE_REG(rxq->id), rxq->size);
 
 	/* Set Offset */
-	mvneta_rxq_offset_set(pp, rxq, NET_SKB_PAD);
+	mvneta_rxq_offset_set(pp, rxq, NET_SKB_PAD - pp->rx_offset_correction);
 
 	/* Set coalescing pkts and time */
 	mvneta_rx_pkts_coal_set(pp, rxq, rxq->pkts_coal);
@@ -2930,6 +2972,22 @@  static void mvneta_cleanup_rxqs(struct mvneta_port *pp)
 static int mvneta_setup_rxqs(struct mvneta_port *pp)
 {
 	int queue;
+#ifdef CONFIG_64BIT
+	void *data_tmp;
+
+	/* In Neta HW only 32 bits data is supported, so in order to
+	 * obtain whole 64 bits address from RX descriptor, we store
+	 * the upper 32 bits when allocating buffer, and put it back
+	 * when using buffer cookie for accessing packet in memory.
+	 * Frags should be allocated from single 'memory' region,
+	 * hence common upper address half should be sufficient.
+	 */
+	data_tmp = mvneta_frag_alloc(pp->frag_size);
+	if (data_tmp) {
+		pp->data_high = (u64)upper_32_bits((u64)data_tmp) << 32;
+		mvneta_frag_free(pp->frag_size, data_tmp);
+	}
+#endif
 
 	for (queue = 0; queue < rxq_number; queue++) {
 		int err = mvneta_rxq_init(pp, &pp->rxqs[queue]);
@@ -4019,6 +4077,13 @@  static int mvneta_probe(struct platform_device *pdev)
 
 	pp->rxq_def = rxq_def;
 
+	/* Set RX packet offset correction for platforms, whose
+	 * NET_SKB_PAD, exceeds 64B. It should be 64B for 64-bit
+	 * platforms and 0B for 32-bit ones.
+	 */
+	pp->rx_offset_correction =
+		max(0, NET_SKB_PAD - MVNETA_RX_PKT_OFFSET_CORRECTION);
+
 	pp->indir[0] = rxq_def;
 
 	pp->clk = devm_clk_get(&pdev->dev, "core");