diff mbox

[V5,2/6,net-next] netdevice.h: Add zero-copy flag in netdevice

Message ID 1305574128.3456.23.camel@localhost.localdomain (mailing list archive)
State New, archived
Headers show

Commit Message

Shirley Ma May 16, 2011, 7:28 p.m. UTC
Signed-off-by: Shirley Ma <xma@us.ibm.com>
---

 include/linux/netdevice.h |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Ben Hutchings May 16, 2011, 7:35 p.m. UTC | #1
On Mon, 2011-05-16 at 12:28 -0700, Shirley Ma wrote:
> Signed-off-by: Shirley Ma <xma@us.ibm.com>
> ---
> 
>  include/linux/netdevice.h |   10 ++++++++++
>  1 files changed, 10 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index a134d80..2646251 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1066,6 +1066,16 @@ struct net_device {
>  #define NETIF_F_NOCACHE_COPY	(1 << 30) /* Use no-cache copyfromuser */
>  #define NETIF_F_LOOPBACK	(1 << 31) /* Enable loopback */
>  
> +/*
> + * Bit 31 is for device to map userspace buffers -- zerocopy
> + * Device can set this flag when it supports HIGHDMA.
> + * Device can't recycle this kind of skb buffers.
> + * There are 256 bytes copied, the rest of buffers are mapped.
> + * The userspace callback should only be called when last reference to this skb
> + * is gone.
> + */
> +#define NETIF_F_ZEROCOPY	(1 << 31)

Sorry, bit 31 is taken.  You get the job of turning features into a
wider bitmap.

Ben.
Shirley Ma May 16, 2011, 7:38 p.m. UTC | #2
On Mon, 2011-05-16 at 20:35 +0100, Ben Hutchings wrote:
> Sorry, bit 31 is taken.  You get the job of turning features into a
> wider bitmap.

:) will do it.

Thanks
Shirley

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Hutchings May 16, 2011, 7:47 p.m. UTC | #3
On Mon, 2011-05-16 at 12:38 -0700, Shirley Ma wrote:
> On Mon, 2011-05-16 at 20:35 +0100, Ben Hutchings wrote:
> > Sorry, bit 31 is taken.  You get the job of turning features into a
> > wider bitmap.
> 
> :) will do it.

Bear in mind that feature masks are manipulated in many different
places.  This is not a simple task.

See previous discussion at:
http://thread.gmane.org/gmane.linux.network/193284
and especially:
http://thread.gmane.org/gmane.linux.network/193284/focus=193332

Ben.
Michael S. Tsirkin May 16, 2011, 9:14 p.m. UTC | #4
On Mon, May 16, 2011 at 08:47:33PM +0100, Ben Hutchings wrote:
> On Mon, 2011-05-16 at 12:38 -0700, Shirley Ma wrote:
> > On Mon, 2011-05-16 at 20:35 +0100, Ben Hutchings wrote:
> > > Sorry, bit 31 is taken.  You get the job of turning features into a
> > > wider bitmap.
> > 
> > :) will do it.
> 
> Bear in mind that feature masks are manipulated in many different
> places.  This is not a simple task.
> 
> See previous discussion at:
> http://thread.gmane.org/gmane.linux.network/193284
> and especially:
> http://thread.gmane.org/gmane.linux.network/193284/focus=193332
> 
> Ben.


IIUC, what is suggested above is something like:

typedef struct net_features {
} net_features_t;

and then

   void netdev_set_feature(net_features_t *net_features, int feature);
   void netdev_clear_feature(net_features_t *net_features, int feature);
   bool netdev_test_feature(net_features_t *net_features, int feature);


I think this might be the easiest way as compiler will catch any direct uses.
It can then be split up nicely.

It looks a bit different from what Dave suggested but I think it's
close enough?

we could also have wrappers that set/clear/test many features to replace
uses of A|B|C that are pretty common.

   static inline void netdev_set_features(net_features_t *net_features, int nfeatures, int *features)
   {
	int i;
	for (i = 0; i < nfeatures; ++i)
		netdev_set_feature(net_features, features[i]);
   }
   void netdev_clear_features(net_features_t *net_features, int nfeatures, int *features)
   {
	int i;
	for (i = 0; i < nfeatures; ++i)
		netdev_clear_feature(net_features, features[i]);
   }
   bool netdev_test_features(net_features_t *net_features, int nfeatures, int *features)
   {
	int i;
	for (i = 0; i < nfeatures; ++i)
		if (netdev_test_feature(net_features, features[i]))
			return true;
	return false;
   }

and possibly macros that get arrays of constants:

#define NETDEV_SET_FEATURES(net_features, feature_array) do { \
	int __NETDEV_SET_FEATURES_F[] = feature_array;
	netdev_set_feature((net_features), \
		ARRAY_SIZE(__NETDEV_SET_FEATURES_F), __NETDEV_SET_FEATURES_F);
} while (0)

etc.

> -- 
> Ben Hutchings, Senior Software Engineer, Solarflare
> Not speaking for my employer; that's the marketing department's job.
> They asked us to note that Solarflare product names are trademarked.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shirley Ma May 16, 2011, 11:32 p.m. UTC | #5
Hello Michael,

Looks like to use a new flag requires more time/work. I am thinking
whether we can just use HIGHDMA flag to enable zero-copy in macvtap to
avoid the new flag for now since mavctap uses real NICs as lower device?

Thanks
Shirley

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin May 17, 2011, 6:21 a.m. UTC | #6
On Mon, May 16, 2011 at 04:32:18PM -0700, Shirley Ma wrote:
> Hello Michael,
> 
> Looks like to use a new flag requires more time/work.
> I am thinking
> whether we can just use HIGHDMA flag to enable zero-copy in macvtap to
> avoid the new flag for now since mavctap uses real NICs as lower device?
> 
> Thanks
> Shirley

Problem is, in your patch there are a set of restrictions on what the
device can do with the skb that we need to enforce somehow.
Also, how do we know it's a 'real NIC' and not a software device?
Shirley Ma May 17, 2011, 8:53 p.m. UTC | #7
On Tue, 2011-05-17 at 09:21 +0300, Michael S. Tsirkin wrote:
> Problem is, in your patch there are a set of restrictions on what the
> device can do with the skb that we need to enforce somehow.
> Also, how do we know it's a 'real NIC' and not a software device?

I checked macvtap newlink, it doesn't seem to block software device.

So we need to work on the wider feature bit first.

Shirley

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michał Mirosław May 17, 2011, 9:48 p.m. UTC | #8
2011/5/17 Shirley Ma <mashirle@us.ibm.com>:
> Hello Michael,
>
> Looks like to use a new flag requires more time/work. I am thinking
> whether we can just use HIGHDMA flag to enable zero-copy in macvtap to
> avoid the new flag for now since mavctap uses real NICs as lower device?

Is there any other restriction besides requiring driver to not recycle
the skb? Are there any drivers that recycle TX skbs?

Best Regards,
Micha? Miros?aw
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shirley Ma May 17, 2011, 10:28 p.m. UTC | #9
On Tue, 2011-05-17 at 23:48 +0200, Micha? Miros?aw wrote:
> 2011/5/17 Shirley Ma <mashirle@us.ibm.com>:
> > Hello Michael,
> >
> > Looks like to use a new flag requires more time/work. I am thinking
> > whether we can just use HIGHDMA flag to enable zero-copy in macvtap
> to
> > avoid the new flag for now since mavctap uses real NICs as lower
> device?
> 
> Is there any other restriction besides requiring driver to not recycle
> the skb? Are there any drivers that recycle TX skbs? 

Not more other restrictions, skb clone is OK. pskb_expand_head() looks
OK to me from code review.

Currently there is no drivers recycle TX skbs.

Thanks
Shirley

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michał Mirosław May 17, 2011, 10:58 p.m. UTC | #10
W dniu 18 maja 2011 00:28 u?ytkownik Shirley Ma <mashirle@us.ibm.com> napisa?:
> On Tue, 2011-05-17 at 23:48 +0200, Micha? Miros?aw wrote:
>> 2011/5/17 Shirley Ma <mashirle@us.ibm.com>:
>> > Looks like to use a new flag requires more time/work. I am thinking
>> > whether we can just use HIGHDMA flag to enable zero-copy in macvtap
>> to
>> > avoid the new flag for now since mavctap uses real NICs as lower
>> device?
>>
>> Is there any other restriction besides requiring driver to not recycle
>> the skb? Are there any drivers that recycle TX skbs?
> Not more other restrictions, skb clone is OK. pskb_expand_head() looks
> OK to me from code review.

> Currently there is no drivers recycle TX skbs.

So why do you require the target device to have some flags at all?

Do I understand correctly, that this zero-copy feature is about
packets received from VMs?

Best Regards,
Micha? Miros?aw
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shirley Ma May 17, 2011, 11:44 p.m. UTC | #11
On Wed, 2011-05-18 at 00:58 +0200, Micha? Miros?aw wrote:
> W dniu 18 maja 2011 00:28 u?ytkownik Shirley Ma <mashirle@us.ibm.com>
> napisa?:
> > On Tue, 2011-05-17 at 23:48 +0200, Micha? Miros?aw wrote:
> >> 2011/5/17 Shirley Ma <mashirle@us.ibm.com>:
> >> > Looks like to use a new flag requires more time/work. I am
> thinking
> >> > whether we can just use HIGHDMA flag to enable zero-copy in
> macvtap
> >> to
> >> > avoid the new flag for now since mavctap uses real NICs as lower
> >> device?
> >>
> >> Is there any other restriction besides requiring driver to not
> recycle
> >> the skb? Are there any drivers that recycle TX skbs?
> > Not more other restrictions, skb clone is OK. pskb_expand_head()
> looks
> > OK to me from code review.
> 
> > Currently there is no drivers recycle TX skbs.
> 
> So why do you require the target device to have some flags at all?

We could use macvtap to check lower device HIGHDMA to enable zero-copy,
but I am not sure whether it is sufficient. If it's sufficient then we
don't need to use a new flag here. To be safe, it's better to use a new
flag to enable each device who can pass zero-copy test.

> Do I understand correctly, that this zero-copy feature is about
> packets received from VMs?

Yes, packets sent from VMs, and received in local host for TX zero-copy
here.

Thanks
Shirley



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michał Mirosław May 18, 2011, 9:06 a.m. UTC | #12
W dniu 18 maja 2011 01:44 u?ytkownik Shirley Ma <mashirle@us.ibm.com> napisa?:
> On Wed, 2011-05-18 at 00:58 +0200, Micha? Miros?aw wrote:
>> W dniu 18 maja 2011 00:28 u?ytkownik Shirley Ma <mashirle@us.ibm.com>
>> napisa?:
>> > On Tue, 2011-05-17 at 23:48 +0200, Micha? Miros?aw wrote:
>> >> 2011/5/17 Shirley Ma <mashirle@us.ibm.com>:
>> >> > Looks like to use a new flag requires more time/work. I am
>> thinking
>> >> > whether we can just use HIGHDMA flag to enable zero-copy in
>> macvtap
>> >> to
>> >> > avoid the new flag for now since mavctap uses real NICs as lower
>> >> device?
>> >>
>> >> Is there any other restriction besides requiring driver to not
>> recycle
>> >> the skb? Are there any drivers that recycle TX skbs?
>> > Not more other restrictions, skb clone is OK. pskb_expand_head()
>> looks
>> > OK to me from code review.
>>
>> > Currently there is no drivers recycle TX skbs.
>>
>> So why do you require the target device to have some flags at all?
> We could use macvtap to check lower device HIGHDMA to enable zero-copy,
> but I am not sure whether it is sufficient. If it's sufficient then we
> don't need to use a new flag here. To be safe, it's better to use a new
> flag to enable each device who can pass zero-copy test.

>> Do I understand correctly, that this zero-copy feature is about
>> packets received from VMs?
> Yes, packets sent from VMs, and received in local host for TX zero-copy
> here.

What is the zero-copy test? On some arches the HIGHDMA is not needed
at all so might be not enabled on anything. It looks like the correct
test would be per-packet check of !illegal_highdma() or maybe
NETIF_F_SG as returned from harmonize_features(). For virtual devices
or other software forwarding this might lead to skb_linearize() in
some cases, but is it that bad?

Best Regards,
Micha? Miros?aw
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin May 18, 2011, 10:38 a.m. UTC | #13
On Tue, May 17, 2011 at 03:28:38PM -0700, Shirley Ma wrote:
> On Tue, 2011-05-17 at 23:48 +0200, Micha? Miros?aw wrote:
> > 2011/5/17 Shirley Ma <mashirle@us.ibm.com>:
> > > Hello Michael,
> > >
> > > Looks like to use a new flag requires more time/work. I am thinking
> > > whether we can just use HIGHDMA flag to enable zero-copy in macvtap
> > to
> > > avoid the new flag for now since mavctap uses real NICs as lower
> > device?
> > 
> > Is there any other restriction besides requiring driver to not recycle
> > the skb? Are there any drivers that recycle TX skbs? 

Not just recycling skbs, keeping reference to any of the pages in the
skb. Another requirement is to invoke the callback
in a timely fashion.  For example virtio-net doesn't limit the time until
that happens (skbs are only freed when some other packet is
transmitted), so we need to avoid zcopy for such (nested-virt)
scenarious, right?
> 
> Not more other restrictions, skb clone is OK. pskb_expand_head() looks
> OK to me from code review.

Hmm. pskb_expand_head calls skb_release_data while keeping
references to pages. How is that ok? What do I miss?

> Currently there is no drivers recycle TX skbs.
> 
> Thanks
> Shirley

Well, with e.g. bridge or veth the skb might enter
the host networking stack. Once that happens, we lose
track of the pages. Or is there anything that
prevents such setups?
Michał Mirosław May 18, 2011, 11:10 a.m. UTC | #14
2011/5/18 Michael S. Tsirkin <mst@redhat.com>:
> On Tue, May 17, 2011 at 03:28:38PM -0700, Shirley Ma wrote:
>> On Tue, 2011-05-17 at 23:48 +0200, Micha? Miros?aw wrote:
>> > 2011/5/17 Shirley Ma <mashirle@us.ibm.com>:
>> > > Hello Michael,
>> > >
>> > > Looks like to use a new flag requires more time/work. I am thinking
>> > > whether we can just use HIGHDMA flag to enable zero-copy in macvtap
>> > to
>> > > avoid the new flag for now since mavctap uses real NICs as lower
>> > device?
>> >
>> > Is there any other restriction besides requiring driver to not recycle
>> > the skb? Are there any drivers that recycle TX skbs?
>
> Not just recycling skbs, keeping reference to any of the pages in the
> skb. Another requirement is to invoke the callback
> in a timely fashion.  For example virtio-net doesn't limit the time until
> that happens (skbs are only freed when some other packet is
> transmitted), so we need to avoid zcopy for such (nested-virt)
> scenarious, right?

Hmm. But every hardware driver supporting SG will keep reference to
the pages until the packet is sent (or DMA'd to the device). This can
take a long time if hardware queue happens to stall for some reason.

Is it that you mean keeping a reference after all skbs pointing to the
pages are released?

>> Not more other restrictions, skb clone is OK. pskb_expand_head() looks
>> OK to me from code review.
> Hmm. pskb_expand_head calls skb_release_data while keeping
> references to pages. How is that ok? What do I miss?

It's making copy of the skb_shinfo earlier, so the pages refcount
stays the same.

Best Regards,
Micha? Miros?aw
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin May 18, 2011, 11:17 a.m. UTC | #15
On Wed, May 18, 2011 at 01:10:50PM +0200, Micha? Miros?aw wrote:
> 2011/5/18 Michael S. Tsirkin <mst@redhat.com>:
> > On Tue, May 17, 2011 at 03:28:38PM -0700, Shirley Ma wrote:
> >> On Tue, 2011-05-17 at 23:48 +0200, Micha? Miros?aw wrote:
> >> > 2011/5/17 Shirley Ma <mashirle@us.ibm.com>:
> >> > > Hello Michael,
> >> > >
> >> > > Looks like to use a new flag requires more time/work. I am thinking
> >> > > whether we can just use HIGHDMA flag to enable zero-copy in macvtap
> >> > to
> >> > > avoid the new flag for now since mavctap uses real NICs as lower
> >> > device?
> >> >
> >> > Is there any other restriction besides requiring driver to not recycle
> >> > the skb? Are there any drivers that recycle TX skbs?
> >
> > Not just recycling skbs, keeping reference to any of the pages in the
> > skb. Another requirement is to invoke the callback
> > in a timely fashion.  For example virtio-net doesn't limit the time until
> > that happens (skbs are only freed when some other packet is
> > transmitted), so we need to avoid zcopy for such (nested-virt)
> > scenarious, right?
> 
> Hmm. But every hardware driver supporting SG will keep reference to
> the pages until the packet is sent (or DMA'd to the device). This can
> take a long time if hardware queue happens to stall for some reason.

That's a fundamental property of zero copy transmit.
You can't let the application/guest reuse the memory until
no one looks at it anymore.

> Is it that you mean keeping a reference after all skbs pointing to the
> pages are released?

No one should reference the pages after the callback is invoked, yes.

> >> Not more other restrictions, skb clone is OK. pskb_expand_head() looks
> >> OK to me from code review.
> > Hmm. pskb_expand_head calls skb_release_data while keeping
> > references to pages. How is that ok? What do I miss?
> 
> It's making copy of the skb_shinfo earlier, so the pages refcount
> stays the same.
> 
> Best Regards,
> Micha? Miros?aw

Exactly. But the callback is invoked so the guest thinks it's ok to
change this memory. If it does a corrupted packet will be sent out.
Michał Mirosław May 18, 2011, 11:40 a.m. UTC | #16
W dniu 18 maja 2011 13:17 u?ytkownik Michael S. Tsirkin
<mst@redhat.com> napisa?:
> On Wed, May 18, 2011 at 01:10:50PM +0200, Micha? Miros?aw wrote:
>> 2011/5/18 Michael S. Tsirkin <mst@redhat.com>:
>> > On Tue, May 17, 2011 at 03:28:38PM -0700, Shirley Ma wrote:
>> >> On Tue, 2011-05-17 at 23:48 +0200, Micha? Miros?aw wrote:
>> >> > 2011/5/17 Shirley Ma <mashirle@us.ibm.com>:
>> >> > > Hello Michael,
>> >> > >
>> >> > > Looks like to use a new flag requires more time/work. I am thinking
>> >> > > whether we can just use HIGHDMA flag to enable zero-copy in macvtap
>> >> > to
>> >> > > avoid the new flag for now since mavctap uses real NICs as lower
>> >> > device?
>> >> >
>> >> > Is there any other restriction besides requiring driver to not recycle
>> >> > the skb? Are there any drivers that recycle TX skbs?
>> >
>> > Not just recycling skbs, keeping reference to any of the pages in the
>> > skb. Another requirement is to invoke the callback
>> > in a timely fashion.  For example virtio-net doesn't limit the time until
>> > that happens (skbs are only freed when some other packet is
>> > transmitted), so we need to avoid zcopy for such (nested-virt)
>> > scenarious, right?
>>
>> Hmm. But every hardware driver supporting SG will keep reference to
>> the pages until the packet is sent (or DMA'd to the device). This can
>> take a long time if hardware queue happens to stall for some reason.
>
> That's a fundamental property of zero copy transmit.
> You can't let the application/guest reuse the memory until
> no one looks at it anymore.
>
>> Is it that you mean keeping a reference after all skbs pointing to the
>> pages are released?
> No one should reference the pages after the callback is invoked, yes.

>> >> Not more other restrictions, skb clone is OK. pskb_expand_head() looks
>> >> OK to me from code review.
>> > Hmm. pskb_expand_head calls skb_release_data while keeping
>> > references to pages. How is that ok? What do I miss?
>> It's making copy of the skb_shinfo earlier, so the pages refcount
>> stays the same.
> Exactly. But the callback is invoked so the guest thinks it's ok to
> change this memory. If it does a corrupted packet will be sent out.

Hmm. I tool a quick look at skb_clone(), and it looks like this
sequence will break this scheme:

skb2 = skb_clone(skb...);
kfree_skb(skb) or pskb_expand_head(skb);  /* callback called */
[use skb2, pages still referenced]
kfree_skb(skb); /* callback called again */

This sequence is common in bridge, might be in other places.

Maybe this ubuf thing should just track clones? This will make it work
on all devices then.

Best Regards,
Micha? Miros?aw
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michał Mirosław May 18, 2011, 11:47 a.m. UTC | #17
W dniu 18 maja 2011 13:17 u?ytkownik Michael S. Tsirkin
<mst@redhat.com> napisa?:
> On Wed, May 18, 2011 at 01:10:50PM +0200, Micha? Miros?aw wrote:
>> 2011/5/18 Michael S. Tsirkin <mst@redhat.com>:
>> > On Tue, May 17, 2011 at 03:28:38PM -0700, Shirley Ma wrote:
>> >> On Tue, 2011-05-17 at 23:48 +0200, Micha? Miros?aw wrote:
>> >> > 2011/5/17 Shirley Ma <mashirle@us.ibm.com>:
>> >> > > Hello Michael,
>> >> > >
>> >> > > Looks like to use a new flag requires more time/work. I am thinking
>> >> > > whether we can just use HIGHDMA flag to enable zero-copy in macvtap
>> >> > to
>> >> > > avoid the new flag for now since mavctap uses real NICs as lower
>> >> > device?
>> >> >
>> >> > Is there any other restriction besides requiring driver to not recycle
>> >> > the skb? Are there any drivers that recycle TX skbs?
>> > Not just recycling skbs, keeping reference to any of the pages in the
>> > skb. Another requirement is to invoke the callback
>> > in a timely fashion.  For example virtio-net doesn't limit the time until
>> > that happens (skbs are only freed when some other packet is
>> > transmitted), so we need to avoid zcopy for such (nested-virt)
>> > scenarious, right?
>> Hmm. But every hardware driver supporting SG will keep reference to
>> the pages until the packet is sent (or DMA'd to the device). This can
>> take a long time if hardware queue happens to stall for some reason.
> That's a fundamental property of zero copy transmit.
> You can't let the application/guest reuse the memory until
> no one looks at it anymore.

One more question: is userspace (or whatever is sending those packets)
denied from modifying passed pages? I assume it is, but just want to
be sure.

Best Regards,
Micha? Miros?aw
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin May 18, 2011, 11:47 a.m. UTC | #18
On Wed, May 18, 2011 at 01:40:29PM +0200, Micha? Miros?aw wrote:
> W dniu 18 maja 2011 13:17 u?ytkownik Michael S. Tsirkin
> <mst@redhat.com> napisa?:
> > On Wed, May 18, 2011 at 01:10:50PM +0200, Micha? Miros?aw wrote:
> >> 2011/5/18 Michael S. Tsirkin <mst@redhat.com>:
> >> > On Tue, May 17, 2011 at 03:28:38PM -0700, Shirley Ma wrote:
> >> >> On Tue, 2011-05-17 at 23:48 +0200, Micha? Miros?aw wrote:
> >> >> > 2011/5/17 Shirley Ma <mashirle@us.ibm.com>:
> >> >> > > Hello Michael,
> >> >> > >
> >> >> > > Looks like to use a new flag requires more time/work. I am thinking
> >> >> > > whether we can just use HIGHDMA flag to enable zero-copy in macvtap
> >> >> > to
> >> >> > > avoid the new flag for now since mavctap uses real NICs as lower
> >> >> > device?
> >> >> >
> >> >> > Is there any other restriction besides requiring driver to not recycle
> >> >> > the skb? Are there any drivers that recycle TX skbs?
> >> >
> >> > Not just recycling skbs, keeping reference to any of the pages in the
> >> > skb. Another requirement is to invoke the callback
> >> > in a timely fashion.  For example virtio-net doesn't limit the time until
> >> > that happens (skbs are only freed when some other packet is
> >> > transmitted), so we need to avoid zcopy for such (nested-virt)
> >> > scenarious, right?
> >>
> >> Hmm. But every hardware driver supporting SG will keep reference to
> >> the pages until the packet is sent (or DMA'd to the device). This can
> >> take a long time if hardware queue happens to stall for some reason.
> >
> > That's a fundamental property of zero copy transmit.
> > You can't let the application/guest reuse the memory until
> > no one looks at it anymore.
> >
> >> Is it that you mean keeping a reference after all skbs pointing to the
> >> pages are released?
> > No one should reference the pages after the callback is invoked, yes.
> 
> >> >> Not more other restrictions, skb clone is OK. pskb_expand_head() looks
> >> >> OK to me from code review.
> >> > Hmm. pskb_expand_head calls skb_release_data while keeping
> >> > references to pages. How is that ok? What do I miss?
> >> It's making copy of the skb_shinfo earlier, so the pages refcount
> >> stays the same.
> > Exactly. But the callback is invoked so the guest thinks it's ok to
> > change this memory. If it does a corrupted packet will be sent out.
> 
> Hmm. I tool a quick look at skb_clone(), and it looks like this
> sequence will break this scheme:
> 
> skb2 = skb_clone(skb...);
> kfree_skb(skb) or pskb_expand_head(skb);  /* callback called */
> [use skb2, pages still referenced]
> kfree_skb(skb); /* callback called again */
> 
> This sequence is common in bridge, might be in other places.
> 
> Maybe this ubuf thing should just track clones? This will make it work
> on all devices then.
> 
> Best Regards,
> Micha? Miros?aw

Long term that's a good plan, but it's a lot of work.
pages can also get into weird places like VFS
or devices might hang on to them for a long time.

So I think as a first step, using a flag to white-list
simple devices that don't do any tricks like the
above makes sense.

Just be sure to list all of the restrictions in
the comment where the flag is described.

And hey, we get features extended to 64 bit as a bonus :)
Michael S. Tsirkin May 18, 2011, 11:56 a.m. UTC | #19
On Wed, May 18, 2011 at 01:47:33PM +0200, Micha? Miros?aw wrote:
> W dniu 18 maja 2011 13:17 u?ytkownik Michael S. Tsirkin
> <mst@redhat.com> napisa?:
> > On Wed, May 18, 2011 at 01:10:50PM +0200, Micha? Miros?aw wrote:
> >> 2011/5/18 Michael S. Tsirkin <mst@redhat.com>:
> >> > On Tue, May 17, 2011 at 03:28:38PM -0700, Shirley Ma wrote:
> >> >> On Tue, 2011-05-17 at 23:48 +0200, Micha? Miros?aw wrote:
> >> >> > 2011/5/17 Shirley Ma <mashirle@us.ibm.com>:
> >> >> > > Hello Michael,
> >> >> > >
> >> >> > > Looks like to use a new flag requires more time/work. I am thinking
> >> >> > > whether we can just use HIGHDMA flag to enable zero-copy in macvtap
> >> >> > to
> >> >> > > avoid the new flag for now since mavctap uses real NICs as lower
> >> >> > device?
> >> >> >
> >> >> > Is there any other restriction besides requiring driver to not recycle
> >> >> > the skb? Are there any drivers that recycle TX skbs?
> >> > Not just recycling skbs, keeping reference to any of the pages in the
> >> > skb. Another requirement is to invoke the callback
> >> > in a timely fashion.  For example virtio-net doesn't limit the time until
> >> > that happens (skbs are only freed when some other packet is
> >> > transmitted), so we need to avoid zcopy for such (nested-virt)
> >> > scenarious, right?
> >> Hmm. But every hardware driver supporting SG will keep reference to
> >> the pages until the packet is sent (or DMA'd to the device). This can
> >> take a long time if hardware queue happens to stall for some reason.
> > That's a fundamental property of zero copy transmit.
> > You can't let the application/guest reuse the memory until
> > no one looks at it anymore.
> 
> One more question: is userspace (or whatever is sending those packets)
> denied from modifying passed pages? I assume it is, but just want to
> be sure.
> 
> Best Regards,
> Micha? Miros?aw

Good point.

It's not denied in the sense that it still can modify them if it's
buggy (the pages might not be read-only).
But well-behaved userspace won't modify them until the callback
is invoked.

That would be a problem if the underlying device is
a bridge where we might try to e.g. filter these packets -
data can get modified after the filter. We'd have to copy
whatever the filter accesses and use the copy - it's rarely
the data itself.

That's not normally a problem for macvtap connected to a physical NIC,
as that already bypasses any and all filtering.

But that's another limitation we should note in the comment,
and another reason to limit to specific devices.
Michał Mirosław May 18, 2011, 12:48 p.m. UTC | #20
W dniu 18 maja 2011 13:56 u?ytkownik Michael S. Tsirkin
<mst@redhat.com> napisa?:
> On Wed, May 18, 2011 at 01:47:33PM +0200, Micha? Miros?aw wrote:
>> W dniu 18 maja 2011 13:17 u?ytkownik Michael S. Tsirkin
>> <mst@redhat.com> napisa?:
>> > On Wed, May 18, 2011 at 01:10:50PM +0200, Micha? Miros?aw wrote:
>> >> 2011/5/18 Michael S. Tsirkin <mst@redhat.com>:
>> >> > On Tue, May 17, 2011 at 03:28:38PM -0700, Shirley Ma wrote:
>> >> >> On Tue, 2011-05-17 at 23:48 +0200, Micha? Miros?aw wrote:
>> >> >> > 2011/5/17 Shirley Ma <mashirle@us.ibm.com>:
>> >> >> > > Hello Michael,
>> >> >> > >
>> >> >> > > Looks like to use a new flag requires more time/work. I am thinking
>> >> >> > > whether we can just use HIGHDMA flag to enable zero-copy in macvtap
>> >> >> > to
>> >> >> > > avoid the new flag for now since mavctap uses real NICs as lower
>> >> >> > device?
>> >> >> >
>> >> >> > Is there any other restriction besides requiring driver to not recycle
>> >> >> > the skb? Are there any drivers that recycle TX skbs?
>> >> > Not just recycling skbs, keeping reference to any of the pages in the
>> >> > skb. Another requirement is to invoke the callback
>> >> > in a timely fashion.  For example virtio-net doesn't limit the time until
>> >> > that happens (skbs are only freed when some other packet is
>> >> > transmitted), so we need to avoid zcopy for such (nested-virt)
>> >> > scenarious, right?
>> >> Hmm. But every hardware driver supporting SG will keep reference to
>> >> the pages until the packet is sent (or DMA'd to the device). This can
>> >> take a long time if hardware queue happens to stall for some reason.
>> > That's a fundamental property of zero copy transmit.
>> > You can't let the application/guest reuse the memory until
>> > no one looks at it anymore.
>>
>> One more question: is userspace (or whatever is sending those packets)
>> denied from modifying passed pages? I assume it is, but just want to
>> be sure.
>>
>> Best Regards,
>> Micha? Miros?aw
>
> Good point.
>
> It's not denied in the sense that it still can modify them if it's
> buggy (the pages might not be read-only).
> But well-behaved userspace won't modify them until the callback
> is invoked.
>
> That would be a problem if the underlying device is
> a bridge where we might try to e.g. filter these packets -
> data can get modified after the filter. We'd have to copy
> whatever the filter accesses and use the copy - it's rarely
> the data itself.
>
> That's not normally a problem for macvtap connected to a physical NIC,
> as that already bypasses any and all filtering.
>
> But that's another limitation we should note in the comment,
> and another reason to limit to specific devices.

It looks like this feature can be used only in very strict circumstances.

What about tcpdump listening on the device or lowerdev? This path
might clone the skb for any device.

Best Regards,
Micha? Miros?aw
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin May 18, 2011, 1:19 p.m. UTC | #21
On Wed, May 18, 2011 at 02:48:24PM +0200, Micha? Miros?aw wrote:
> W dniu 18 maja 2011 13:56 u?ytkownik Michael S. Tsirkin
> <mst@redhat.com> napisa?:
> > On Wed, May 18, 2011 at 01:47:33PM +0200, Micha? Miros?aw wrote:
> >> W dniu 18 maja 2011 13:17 u?ytkownik Michael S. Tsirkin
> >> <mst@redhat.com> napisa?:
> >> > On Wed, May 18, 2011 at 01:10:50PM +0200, Micha? Miros?aw wrote:
> >> >> 2011/5/18 Michael S. Tsirkin <mst@redhat.com>:
> >> >> > On Tue, May 17, 2011 at 03:28:38PM -0700, Shirley Ma wrote:
> >> >> >> On Tue, 2011-05-17 at 23:48 +0200, Micha? Miros?aw wrote:
> >> >> >> > 2011/5/17 Shirley Ma <mashirle@us.ibm.com>:
> >> >> >> > > Hello Michael,
> >> >> >> > >
> >> >> >> > > Looks like to use a new flag requires more time/work. I am thinking
> >> >> >> > > whether we can just use HIGHDMA flag to enable zero-copy in macvtap
> >> >> >> > to
> >> >> >> > > avoid the new flag for now since mavctap uses real NICs as lower
> >> >> >> > device?
> >> >> >> >
> >> >> >> > Is there any other restriction besides requiring driver to not recycle
> >> >> >> > the skb? Are there any drivers that recycle TX skbs?
> >> >> > Not just recycling skbs, keeping reference to any of the pages in the
> >> >> > skb. Another requirement is to invoke the callback
> >> >> > in a timely fashion.  For example virtio-net doesn't limit the time until
> >> >> > that happens (skbs are only freed when some other packet is
> >> >> > transmitted), so we need to avoid zcopy for such (nested-virt)
> >> >> > scenarious, right?
> >> >> Hmm. But every hardware driver supporting SG will keep reference to
> >> >> the pages until the packet is sent (or DMA'd to the device). This can
> >> >> take a long time if hardware queue happens to stall for some reason.
> >> > That's a fundamental property of zero copy transmit.
> >> > You can't let the application/guest reuse the memory until
> >> > no one looks at it anymore.
> >>
> >> One more question: is userspace (or whatever is sending those packets)
> >> denied from modifying passed pages? I assume it is, but just want to
> >> be sure.
> >>
> >> Best Regards,
> >> Micha? Miros?aw
> >
> > Good point.
> >
> > It's not denied in the sense that it still can modify them if it's
> > buggy (the pages might not be read-only).
> > But well-behaved userspace won't modify them until the callback
> > is invoked.
> >
> > That would be a problem if the underlying device is
> > a bridge where we might try to e.g. filter these packets -
> > data can get modified after the filter. We'd have to copy
> > whatever the filter accesses and use the copy - it's rarely
> > the data itself.
> >
> > That's not normally a problem for macvtap connected to a physical NIC,
> > as that already bypasses any and all filtering.
> >
> > But that's another limitation we should note in the comment,
> > and another reason to limit to specific devices.
> 
> It looks like this feature can be used only in very strict circumstances.

True. I think it's reasonable to try and start with something
restricted and then add features though - past attempts to solve the problem
generally right away did not bear fruit.

> What about tcpdump listening on the device or lowerdev? This path
> might clone the skb for any device.
> 
> Best Regards,
> Micha? Miros?aw

Thanks for bringing this up: taps do need to be fixed as they can hang
on to a page for unlimited time. Further, as a malicious guest can
change the packet at any time, data that taps get wouldn't be correct.
We can either linearize the problematic skbs or disable
zero copy if there are any taps for the given device.
Shirley Ma May 18, 2011, 2:38 p.m. UTC | #22
On Wed, 2011-05-18 at 13:40 +0200, Micha? Miros?aw wrote:
> >> >> Not more other restrictions, skb clone is OK. pskb_expand_head()
> looks
> >> >> OK to me from code review.
> >> > Hmm. pskb_expand_head calls skb_release_data while keeping
> >> > references to pages. How is that ok? What do I miss?
> >> It's making copy of the skb_shinfo earlier, so the pages refcount
> >> stays the same.
> > Exactly. But the callback is invoked so the guest thinks it's ok to
> > change this memory. If it does a corrupted packet will be sent out.
> 
> Hmm. I tool a quick look at skb_clone(), and it looks like this
> sequence will break this scheme:
> 
> skb2 = skb_clone(skb...);
> kfree_skb(skb) or pskb_expand_head(skb);  /* callback called */
> [use skb2, pages still referenced]
> kfree_skb(skb); /* callback called again */
> 
> This sequence is common in bridge, might be in other places.
> 
> Maybe this ubuf thing should just track clones? This will make it work
> on all devices then.

The callback was only invoked when last reference of skb was gone.
skb_clone does increase skb refcnt. I tested tcpdump on lower device, it
worked.

For the sequence of:

skb_clone  -> last refcnt + 1
kfree_skb() or pskb_expand_head -> callback not called
kfree_skb() -> callback called

I will check page refcount to see whether it's balanced.

Thanks
shirley

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin May 18, 2011, 3:47 p.m. UTC | #23
On Wed, May 18, 2011 at 07:38:27AM -0700, Shirley Ma wrote:
> On Wed, 2011-05-18 at 13:40 +0200, Micha? Miros?aw wrote:
> > >> >> Not more other restrictions, skb clone is OK. pskb_expand_head()
> > looks
> > >> >> OK to me from code review.
> > >> > Hmm. pskb_expand_head calls skb_release_data while keeping
> > >> > references to pages. How is that ok? What do I miss?
> > >> It's making copy of the skb_shinfo earlier, so the pages refcount
> > >> stays the same.
> > > Exactly. But the callback is invoked so the guest thinks it's ok to
> > > change this memory. If it does a corrupted packet will be sent out.
> > 
> > Hmm. I tool a quick look at skb_clone(), and it looks like this
> > sequence will break this scheme:
> > 
> > skb2 = skb_clone(skb...);
> > kfree_skb(skb) or pskb_expand_head(skb);  /* callback called */
> > [use skb2, pages still referenced]
> > kfree_skb(skb); /* callback called again */
> > 
> > This sequence is common in bridge, might be in other places.
> > 
> > Maybe this ubuf thing should just track clones? This will make it work
> > on all devices then.
> 
> The callback was only invoked when last reference of skb was gone.
> skb_clone does increase skb refcnt. I tested tcpdump on lower device, it
> worked.

Right, it will normally work, but two issues I think you miss:
1. malicious guest can change the memory between when it is sent out by
   device and consumed by tcpdump, so you will see different things
   (not sure how important this is).
2. if tcpdump stops consuming stuff from the packet socket (it's
   userspace, can't be trusted) then we won't get a callback for
   page potentially forever, guest networking will get blocked etc.

> For the sequence of:
> 
> skb_clone  -> last refcnt + 1
> kfree_skb() or pskb_expand_head -> callback not called
> kfree_skb() -> callback called
> 
> I will check page refcount to see whether it's balanced.
> 
> Thanks
> shirley


pskb_expand_head is a problem anyway I think as it
can hang on to pages after it calls release_data.
Then guest will modify these pages and you get trash there.
Shirley Ma May 18, 2011, 4:02 p.m. UTC | #24
On Wed, 2011-05-18 at 07:38 -0700, Shirley Ma wrote:
> On Wed, 2011-05-18 at 13:40 +0200, Micha? Miros?aw wrote:
> > >> >> Not more other restrictions, skb clone is OK.
> pskb_expand_head()
> > looks
> > >> >> OK to me from code review.
> > >> > Hmm. pskb_expand_head calls skb_release_data while keeping
> > >> > references to pages. How is that ok? What do I miss?
> > >> It's making copy of the skb_shinfo earlier, so the pages refcount
> > >> stays the same.
> > > Exactly. But the callback is invoked so the guest thinks it's ok
> to
> > > change this memory. If it does a corrupted packet will be sent
> out.
> > 
> > Hmm. I tool a quick look at skb_clone(), and it looks like this
> > sequence will break this scheme:
> > 
> > skb2 = skb_clone(skb...);
> > kfree_skb(skb) or pskb_expand_head(skb);  /* callback called */
> > [use skb2, pages still referenced]
> > kfree_skb(skb); /* callback called again */
> > 
> > This sequence is common in bridge, might be in other places.
> > 
> > Maybe this ubuf thing should just track clones? This will make it
> work
> > on all devices then.
> 
> The callback was only invoked when last reference of skb was gone.
> skb_clone does increase skb refcnt. I tested tcpdump on lower device,
> it
> worked.
> 
> For the sequence of:
> 
> skb_clone  -> last refcnt + 1
> kfree_skb() or pskb_expand_head -> callback not called
> kfree_skb() -> callback called
> 
> I will check page refcount to see whether it's balanced. 

The page refcounts are balanced too. 

In macvtap/vhost Real NIC zerocopy case, it always goes to fastpath in
pskb_expand_head, so I didn't hit any issue.

But rethinking about pskb_expand_head(), it calls skb_release_data() to
free old skb head when it's not in the fastpath (pskb_expand_head is not
the last reference of this skb); And it's impossible to track which skb
head (old one or new one) will be the last one to free. So better to
return error for zero-copy skbs when not using fastpath. Does it make
sense? 

Besides this, any other issue?


Thanks
Shirley

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shirley Ma May 18, 2011, 4:07 p.m. UTC | #25
On Wed, 2011-05-18 at 18:47 +0300, Michael S. Tsirkin wrote:
> On Wed, May 18, 2011 at 07:38:27AM -0700, Shirley Ma wrote:
> > On Wed, 2011-05-18 at 13:40 +0200, Micha? Miros?aw wrote:
> > > >> >> Not more other restrictions, skb clone is OK.
> pskb_expand_head()
> > > looks
> > > >> >> OK to me from code review.
> > > >> > Hmm. pskb_expand_head calls skb_release_data while keeping
> > > >> > references to pages. How is that ok? What do I miss?
> > > >> It's making copy of the skb_shinfo earlier, so the pages
> refcount
> > > >> stays the same.
> > > > Exactly. But the callback is invoked so the guest thinks it's ok
> to
> > > > change this memory. If it does a corrupted packet will be sent
> out.
> > > 
> > > Hmm. I tool a quick look at skb_clone(), and it looks like this
> > > sequence will break this scheme:
> > > 
> > > skb2 = skb_clone(skb...);
> > > kfree_skb(skb) or pskb_expand_head(skb);  /* callback called */
> > > [use skb2, pages still referenced]
> > > kfree_skb(skb); /* callback called again */
> > > 
> > > This sequence is common in bridge, might be in other places.
> > > 
> > > Maybe this ubuf thing should just track clones? This will make it
> work
> > > on all devices then.
> > 
> > The callback was only invoked when last reference of skb was gone.
> > skb_clone does increase skb refcnt. I tested tcpdump on lower
> device, it
> > worked.
> 
> Right, it will normally work, but two issues I think you miss:
> 1. malicious guest can change the memory between when it is sent out
> by
>    device and consumed by tcpdump, so you will see different things
>    (not sure how important this is).
> 2. if tcpdump stops consuming stuff from the packet socket (it's
>    userspace, can't be trusted) then we won't get a callback for
>    page potentially forever, guest networking will get blocked etc.
> > For the sequence of:
> > 
> > skb_clone  -> last refcnt + 1
> > kfree_skb() or pskb_expand_head -> callback not called
> > kfree_skb() -> callback called
> > 
> > I will check page refcount to see whether it's balanced.
> > 
> > Thanks
> > shirley
> 
> 
> pskb_expand_head is a problem anyway I think as it
> can hang on to pages after it calls release_data.
> Then guest will modify these pages and you get trash there. 

This can be avoid by allowing pskb_expand_head in fastpath only, I
think. But not sure whether tcpdump can still work with this.

Thanks
Shirley

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin May 18, 2011, 4:23 p.m. UTC | #26
On Wed, May 18, 2011 at 09:02:23AM -0700, Shirley Ma wrote:
> On Wed, 2011-05-18 at 07:38 -0700, Shirley Ma wrote:
> > On Wed, 2011-05-18 at 13:40 +0200, Micha? Miros?aw wrote:
> > > >> >> Not more other restrictions, skb clone is OK.
> > pskb_expand_head()
> > > looks
> > > >> >> OK to me from code review.
> > > >> > Hmm. pskb_expand_head calls skb_release_data while keeping
> > > >> > references to pages. How is that ok? What do I miss?
> > > >> It's making copy of the skb_shinfo earlier, so the pages refcount
> > > >> stays the same.
> > > > Exactly. But the callback is invoked so the guest thinks it's ok
> > to
> > > > change this memory. If it does a corrupted packet will be sent
> > out.
> > > 
> > > Hmm. I tool a quick look at skb_clone(), and it looks like this
> > > sequence will break this scheme:
> > > 
> > > skb2 = skb_clone(skb...);
> > > kfree_skb(skb) or pskb_expand_head(skb);  /* callback called */
> > > [use skb2, pages still referenced]
> > > kfree_skb(skb); /* callback called again */
> > > 
> > > This sequence is common in bridge, might be in other places.
> > > 
> > > Maybe this ubuf thing should just track clones? This will make it
> > work
> > > on all devices then.
> > 
> > The callback was only invoked when last reference of skb was gone.
> > skb_clone does increase skb refcnt. I tested tcpdump on lower device,
> > it
> > worked.
> > 
> > For the sequence of:
> > 
> > skb_clone  -> last refcnt + 1
> > kfree_skb() or pskb_expand_head -> callback not called
> > kfree_skb() -> callback called
> > 
> > I will check page refcount to see whether it's balanced. 
> 
> The page refcounts are balanced too. 
> 
> In macvtap/vhost Real NIC zerocopy case, it always goes to fastpath in
> pskb_expand_head, so I didn't hit any issue.
> 
> But rethinking about pskb_expand_head(), it calls skb_release_data() to
> free old skb head when it's not in the fastpath (pskb_expand_head is not
> the last reference of this skb); And it's impossible to track which skb
> head (old one or new one) will be the last one to free. So better to
> return error for zero-copy skbs when not using fastpath. Does it make
> sense? 

I'm not sure it does. Look e.g. at tg3 - if expand_head fails
packet gets dropped. No crash but unlikely to perform well :).

> Besides this, any other issue?
> 
> 
> Thanks
> Shirley
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin May 18, 2011, 4:36 p.m. UTC | #27
On Wed, May 18, 2011 at 09:07:37AM -0700, Shirley Ma wrote:
> On Wed, 2011-05-18 at 18:47 +0300, Michael S. Tsirkin wrote:
> > On Wed, May 18, 2011 at 07:38:27AM -0700, Shirley Ma wrote:
> > > On Wed, 2011-05-18 at 13:40 +0200, Micha? Miros?aw wrote:
> > > > >> >> Not more other restrictions, skb clone is OK.
> > pskb_expand_head()
> > > > looks
> > > > >> >> OK to me from code review.
> > > > >> > Hmm. pskb_expand_head calls skb_release_data while keeping
> > > > >> > references to pages. How is that ok? What do I miss?
> > > > >> It's making copy of the skb_shinfo earlier, so the pages
> > refcount
> > > > >> stays the same.
> > > > > Exactly. But the callback is invoked so the guest thinks it's ok
> > to
> > > > > change this memory. If it does a corrupted packet will be sent
> > out.
> > > > 
> > > > Hmm. I tool a quick look at skb_clone(), and it looks like this
> > > > sequence will break this scheme:
> > > > 
> > > > skb2 = skb_clone(skb...);
> > > > kfree_skb(skb) or pskb_expand_head(skb);  /* callback called */
> > > > [use skb2, pages still referenced]
> > > > kfree_skb(skb); /* callback called again */
> > > > 
> > > > This sequence is common in bridge, might be in other places.
> > > > 
> > > > Maybe this ubuf thing should just track clones? This will make it
> > work
> > > > on all devices then.
> > > 
> > > The callback was only invoked when last reference of skb was gone.
> > > skb_clone does increase skb refcnt. I tested tcpdump on lower
> > device, it
> > > worked.
> > 
> > Right, it will normally work, but two issues I think you miss:
> > 1. malicious guest can change the memory between when it is sent out
> > by
> >    device and consumed by tcpdump, so you will see different things
> >    (not sure how important this is).
> > 2. if tcpdump stops consuming stuff from the packet socket (it's
> >    userspace, can't be trusted) then we won't get a callback for
> >    page potentially forever, guest networking will get blocked etc.
> > > For the sequence of:
> > > 
> > > skb_clone  -> last refcnt + 1
> > > kfree_skb() or pskb_expand_head -> callback not called
> > > kfree_skb() -> callback called
> > > 
> > > I will check page refcount to see whether it's balanced.
> > > 
> > > Thanks
> > > shirley
> > 
> > 
> > pskb_expand_head is a problem anyway I think as it
> > can hang on to pages after it calls release_data.
> > Then guest will modify these pages and you get trash there. 
> 
> This can be avoid by allowing pskb_expand_head in fastpath only, I
> think. But not sure whether tcpdump can still work with this.
> 
> Thanks
> Shirley

Yes, I agree.  I think for tcpdump, we really need to copy the data
anyway, to avoid guest changing it in between.  So we do that and then
use the copy everywhere, release the old one. Hmm?
Shirley Ma May 18, 2011, 4:45 p.m. UTC | #28
On Wed, 2011-05-18 at 19:36 +0300, Michael S. Tsirkin wrote:
> On Wed, May 18, 2011 at 09:07:37AM -0700, Shirley Ma wrote:
> > On Wed, 2011-05-18 at 18:47 +0300, Michael S. Tsirkin wrote:
> > > On Wed, May 18, 2011 at 07:38:27AM -0700, Shirley Ma wrote:
> > > > On Wed, 2011-05-18 at 13:40 +0200, Micha? Miros?aw wrote:
> > > > > >> >> Not more other restrictions, skb clone is OK.
> > > pskb_expand_head()
> > > > > looks
> > > > > >> >> OK to me from code review.
> > > > > >> > Hmm. pskb_expand_head calls skb_release_data while
> keeping
> > > > > >> > references to pages. How is that ok? What do I miss?
> > > > > >> It's making copy of the skb_shinfo earlier, so the pages
> > > refcount
> > > > > >> stays the same.
> > > > > > Exactly. But the callback is invoked so the guest thinks
> it's ok
> > > to
> > > > > > change this memory. If it does a corrupted packet will be
> sent
> > > out.
> > > > > 
> > > > > Hmm. I tool a quick look at skb_clone(), and it looks like
> this
> > > > > sequence will break this scheme:
> > > > > 
> > > > > skb2 = skb_clone(skb...);
> > > > > kfree_skb(skb) or pskb_expand_head(skb);  /* callback called
> */
> > > > > [use skb2, pages still referenced]
> > > > > kfree_skb(skb); /* callback called again */
> > > > > 
> > > > > This sequence is common in bridge, might be in other places.
> > > > > 
> > > > > Maybe this ubuf thing should just track clones? This will make
> it
> > > work
> > > > > on all devices then.
> > > > 
> > > > The callback was only invoked when last reference of skb was
> gone.
> > > > skb_clone does increase skb refcnt. I tested tcpdump on lower
> > > device, it
> > > > worked.
> > > 
> > > Right, it will normally work, but two issues I think you miss:
> > > 1. malicious guest can change the memory between when it is sent
> out
> > > by
> > >    device and consumed by tcpdump, so you will see different
> things
> > >    (not sure how important this is).
> > > 2. if tcpdump stops consuming stuff from the packet socket (it's
> > >    userspace, can't be trusted) then we won't get a callback for
> > >    page potentially forever, guest networking will get blocked
> etc.
> > > > For the sequence of:
> > > > 
> > > > skb_clone  -> last refcnt + 1
> > > > kfree_skb() or pskb_expand_head -> callback not called
> > > > kfree_skb() -> callback called
> > > > 
> > > > I will check page refcount to see whether it's balanced.
> > > > 
> > > > Thanks
> > > > shirley
> > > 
> > > 
> > > pskb_expand_head is a problem anyway I think as it
> > > can hang on to pages after it calls release_data.
> > > Then guest will modify these pages and you get trash there. 
> > 
> > This can be avoid by allowing pskb_expand_head in fastpath only, I
> > think. But not sure whether tcpdump can still work with this.
> > 
> > Thanks
> > Shirley
> 
> Yes, I agree.  I think for tcpdump, we really need to copy the data
> anyway, to avoid guest changing it in between.  So we do that and then
> use the copy everywhere, release the old one. Hmm? 

Yes. Old one use zerocopy, new one use copy data.

Thanks
Shirley

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin May 18, 2011, 4:50 p.m. UTC | #29
On Wed, May 18, 2011 at 01:40:29PM +0200, Micha? Miros?aw wrote:
> W dniu 18 maja 2011 13:17 u?ytkownik Michael S. Tsirkin
> <mst@redhat.com> napisa?:
> > On Wed, May 18, 2011 at 01:10:50PM +0200, Micha? Miros?aw wrote:
> >> 2011/5/18 Michael S. Tsirkin <mst@redhat.com>:
> >> > On Tue, May 17, 2011 at 03:28:38PM -0700, Shirley Ma wrote:
> >> >> On Tue, 2011-05-17 at 23:48 +0200, Micha? Miros?aw wrote:
> >> >> > 2011/5/17 Shirley Ma <mashirle@us.ibm.com>:
> >> >> > > Hello Michael,
> >> >> > >
> >> >> > > Looks like to use a new flag requires more time/work. I am thinking
> >> >> > > whether we can just use HIGHDMA flag to enable zero-copy in macvtap
> >> >> > to
> >> >> > > avoid the new flag for now since mavctap uses real NICs as lower
> >> >> > device?
> >> >> >
> >> >> > Is there any other restriction besides requiring driver to not recycle
> >> >> > the skb? Are there any drivers that recycle TX skbs?
> >> >
> >> > Not just recycling skbs, keeping reference to any of the pages in the
> >> > skb. Another requirement is to invoke the callback
> >> > in a timely fashion.  For example virtio-net doesn't limit the time until
> >> > that happens (skbs are only freed when some other packet is
> >> > transmitted), so we need to avoid zcopy for such (nested-virt)
> >> > scenarious, right?
> >>
> >> Hmm. But every hardware driver supporting SG will keep reference to
> >> the pages until the packet is sent (or DMA'd to the device). This can
> >> take a long time if hardware queue happens to stall for some reason.
> >
> > That's a fundamental property of zero copy transmit.
> > You can't let the application/guest reuse the memory until
> > no one looks at it anymore.
> >
> >> Is it that you mean keeping a reference after all skbs pointing to the
> >> pages are released?
> > No one should reference the pages after the callback is invoked, yes.
> 
> >> >> Not more other restrictions, skb clone is OK. pskb_expand_head() looks
> >> >> OK to me from code review.
> >> > Hmm. pskb_expand_head calls skb_release_data while keeping
> >> > references to pages. How is that ok? What do I miss?
> >> It's making copy of the skb_shinfo earlier, so the pages refcount
> >> stays the same.
> > Exactly. But the callback is invoked so the guest thinks it's ok to
> > change this memory. If it does a corrupted packet will be sent out.
> 
> Hmm. I tool a quick look at skb_clone(), and it looks like this
> sequence will break this scheme:
> 
> skb2 = skb_clone(skb...);
> kfree_skb(skb) or pskb_expand_head(skb);  /* callback called */
> [use skb2, pages still referenced]
> kfree_skb(skb); /* callback called again */
> This sequence is common in bridge, might be in other places.
> 
> Maybe this ubuf thing should just track clones? This will make it work
> on all devices then.
> 
> Best Regards,
> Micha? Miros?aw

Well bridge has the problem that packet might get anywhere and it's
really hard to track. Same for tun - it can get queued forever.
veth, loopback are all a problem I think.

IOW we really want to limit this to real physical NICs
which mostly all DTRT. Whitelisting them with a new flag
is likely the most concervative approach, no?
Michael S. Tsirkin May 18, 2011, 4:51 p.m. UTC | #30
On Wed, May 18, 2011 at 09:45:40AM -0700, Shirley Ma wrote:
> On Wed, 2011-05-18 at 19:36 +0300, Michael S. Tsirkin wrote:
> > On Wed, May 18, 2011 at 09:07:37AM -0700, Shirley Ma wrote:
> > > On Wed, 2011-05-18 at 18:47 +0300, Michael S. Tsirkin wrote:
> > > > On Wed, May 18, 2011 at 07:38:27AM -0700, Shirley Ma wrote:
> > > > > On Wed, 2011-05-18 at 13:40 +0200, Micha? Miros?aw wrote:
> > > > > > >> >> Not more other restrictions, skb clone is OK.
> > > > pskb_expand_head()
> > > > > > looks
> > > > > > >> >> OK to me from code review.
> > > > > > >> > Hmm. pskb_expand_head calls skb_release_data while
> > keeping
> > > > > > >> > references to pages. How is that ok? What do I miss?
> > > > > > >> It's making copy of the skb_shinfo earlier, so the pages
> > > > refcount
> > > > > > >> stays the same.
> > > > > > > Exactly. But the callback is invoked so the guest thinks
> > it's ok
> > > > to
> > > > > > > change this memory. If it does a corrupted packet will be
> > sent
> > > > out.
> > > > > > 
> > > > > > Hmm. I tool a quick look at skb_clone(), and it looks like
> > this
> > > > > > sequence will break this scheme:
> > > > > > 
> > > > > > skb2 = skb_clone(skb...);
> > > > > > kfree_skb(skb) or pskb_expand_head(skb);  /* callback called
> > */
> > > > > > [use skb2, pages still referenced]
> > > > > > kfree_skb(skb); /* callback called again */
> > > > > > 
> > > > > > This sequence is common in bridge, might be in other places.
> > > > > > 
> > > > > > Maybe this ubuf thing should just track clones? This will make
> > it
> > > > work
> > > > > > on all devices then.
> > > > > 
> > > > > The callback was only invoked when last reference of skb was
> > gone.
> > > > > skb_clone does increase skb refcnt. I tested tcpdump on lower
> > > > device, it
> > > > > worked.
> > > > 
> > > > Right, it will normally work, but two issues I think you miss:
> > > > 1. malicious guest can change the memory between when it is sent
> > out
> > > > by
> > > >    device and consumed by tcpdump, so you will see different
> > things
> > > >    (not sure how important this is).
> > > > 2. if tcpdump stops consuming stuff from the packet socket (it's
> > > >    userspace, can't be trusted) then we won't get a callback for
> > > >    page potentially forever, guest networking will get blocked
> > etc.
> > > > > For the sequence of:
> > > > > 
> > > > > skb_clone  -> last refcnt + 1
> > > > > kfree_skb() or pskb_expand_head -> callback not called
> > > > > kfree_skb() -> callback called
> > > > > 
> > > > > I will check page refcount to see whether it's balanced.
> > > > > 
> > > > > Thanks
> > > > > shirley
> > > > 
> > > > 
> > > > pskb_expand_head is a problem anyway I think as it
> > > > can hang on to pages after it calls release_data.
> > > > Then guest will modify these pages and you get trash there. 
> > > 
> > > This can be avoid by allowing pskb_expand_head in fastpath only, I
> > > think. But not sure whether tcpdump can still work with this.
> > > 
> > > Thanks
> > > Shirley
> > 
> > Yes, I agree.  I think for tcpdump, we really need to copy the data
> > anyway, to avoid guest changing it in between.  So we do that and then
> > use the copy everywhere, release the old one. Hmm? 
> 
> Yes. Old one use zerocopy, new one use copy data.
> 
> Thanks
> Shirley

No, that's wrong, as they might become different with a
malicious guest. As long as we copied already, lets realease
the data and have everyone use the copy.
Shirley Ma May 18, 2011, 5 p.m. UTC | #31
On Wed, 2011-05-18 at 19:51 +0300, Michael S. Tsirkin wrote:
> > > Yes, I agree.  I think for tcpdump, we really need to copy the
> data
> > > anyway, to avoid guest changing it in between.  So we do that and
> then
> > > use the copy everywhere, release the old one. Hmm? 
> > 
> > Yes. Old one use zerocopy, new one use copy data.
> > 
> > Thanks
> > Shirley
> 
> No, that's wrong, as they might become different with a
> malicious guest. As long as we copied already, lets realease
> the data and have everyone use the copy. 

Ok, I will patch pskb_expand_head to test it out.

Shirley

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shirley Ma May 19, 2011, 7:42 p.m. UTC | #32
On Wed, 2011-05-18 at 10:00 -0700, Shirley Ma wrote:
> On Wed, 2011-05-18 at 19:51 +0300, Michael S. Tsirkin wrote:
> > > > Yes, I agree.  I think for tcpdump, we really need to copy the
> > data
> > > > anyway, to avoid guest changing it in between.  So we do that
> and
> > then
> > > > use the copy everywhere, release the old one. Hmm? 
> > > 
> > > Yes. Old one use zerocopy, new one use copy data.
> > > 
> > > Thanks
> > > Shirley
> > 
> > No, that's wrong, as they might become different with a
> > malicious guest. As long as we copied already, lets realease
> > the data and have everyone use the copy. 
> 
> Ok, I will patch pskb_expand_head to test it out. 

I am patching skb_copy, skb_clone, pskb_copy, pskb_expand_head to
convert a zero-copy skb to a copy skb to avoid this kind of issue.

This overhead won't impact macvtap/vhost TX zero-copy normally.

Shirley

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin May 19, 2011, 11:41 p.m. UTC | #33
On Thu, May 19, 2011 at 12:42:49PM -0700, Shirley Ma wrote:
> On Wed, 2011-05-18 at 10:00 -0700, Shirley Ma wrote:
> > On Wed, 2011-05-18 at 19:51 +0300, Michael S. Tsirkin wrote:
> > > > > Yes, I agree.  I think for tcpdump, we really need to copy the
> > > data
> > > > > anyway, to avoid guest changing it in between.  So we do that
> > and
> > > then
> > > > > use the copy everywhere, release the old one. Hmm? 
> > > > 
> > > > Yes. Old one use zerocopy, new one use copy data.
> > > > 
> > > > Thanks
> > > > Shirley
> > > 
> > > No, that's wrong, as they might become different with a
> > > malicious guest. As long as we copied already, lets realease
> > > the data and have everyone use the copy. 
> > 
> > Ok, I will patch pskb_expand_head to test it out. 
> 
> I am patching skb_copy, skb_clone, pskb_copy, pskb_expand_head to
> convert a zero-copy skb to a copy skb to avoid this kind of issue.
> 
> This overhead won't impact macvtap/vhost TX zero-copy normally.
> 
> Shirley

OK, that will handle packet socket at least in that it won't crash :)

So the requirements are
- data must be released in a timely fashion (e.g. unlike virtio-net
  tun or bridge)
- no filtering based on data (data is mapped in guest)
- SG support
- HIGHDMA support (on arches where this makes sense)
- on fast path no calls to skb_copy, skb_clone, pskb_copy,
  pskb_expand_head as these are slow

First 2 requirements are a must, all other requirements
are just dependencies to make sure zero copy will be faster
than non zero copy.

Using a new feature bit is probably the simplest approach to
this. macvtap on top of most physical NICs most likely works
correctly so it seems a bit more work than it needs to be,
but it's also the safest one I think ...
Shirley Ma May 25, 2011, 10:49 p.m. UTC | #34
On Fri, 2011-05-20 at 02:41 +0300, Michael S. Tsirkin wrote:
> So the requirements are
> - data must be released in a timely fashion (e.g. unlike virtio-net
>   tun or bridge)
The current patch doesn't enable tun zero-copy. tun will copy data It's
not an issue now. We can disallow macvtap attach to bridge when
zero-copy is enabled.

> - SG support
> - HIGHDMA support (on arches where this makes sense)

This can be checked by device flags.

> - no filtering based on data (data is mapped in guest)

> - on fast path no calls to skb_copy, skb_clone, pskb_copy,
>   pskb_expand_head as these are slow

Any calls to skb_copy, skb_clone, pskb_copy, pskb_expand_head will do a
copy. The performance should be the same as none zero-copy case before.
I have done/tested the patch V6, will send it out for review tomorrow.

I am looking at where there are some cases, skb remains the same for
filtering.

> First 2 requirements are a must, all other requirements
> are just dependencies to make sure zero copy will be faster
> than non zero copy.
> Using a new feature bit is probably the simplest approach to
> this. macvtap on top of most physical NICs most likely works
> correctly so it seems a bit more work than it needs to be,
> but it's also the safest one I think ... 

For "macvtap/vhost zero-copy" we can use SG & HIGHDMA to enable it, it
looks safe to me once patching skb_copy, skb_clone, pskb_copy,
pskb_expand_head.

To extend zero-copy in other usages, we can have a new feature bit
later.

Is that reasonable?

Thanks
Shirley




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin May 26, 2011, 8:49 a.m. UTC | #35
On Wed, May 25, 2011 at 03:49:40PM -0700, Shirley Ma wrote:
> On Fri, 2011-05-20 at 02:41 +0300, Michael S. Tsirkin wrote:
> > So the requirements are
> > - data must be released in a timely fashion (e.g. unlike virtio-net
> >   tun or bridge)
> The current patch doesn't enable tun zero-copy. tun will copy data It's
> not an issue now.
> We can disallow macvtap attach to bridge when
> zero-copy is enabled.

Attach macvtap to a tun device though. Or e.g. veth device ...
So there should be so generic way to disable zerocopy.
It can either be a whitelist or a blacklist.

> 
> > - SG support
> > - HIGHDMA support (on arches where this makes sense)
> 
> This can be checked by device flags.

OK, but pls note that SG can get turned off dynamically.

> > - no filtering based on data (data is mapped in guest)
> 
> > - on fast path no calls to skb_copy, skb_clone, pskb_copy,
> >   pskb_expand_head as these are slow
> 
> Any calls to skb_copy, skb_clone, pskb_copy, pskb_expand_head will do a
> copy. The performance should be the same as none zero-copy case before.

I'm guessing a copy is cheaper than get_user_pages+copy+put_page.
But maybe not by much. Care checking that?

> I have done/tested the patch V6, will send it out for review tomorrow.
> 
> I am looking at where there are some cases, skb remains the same for
> filtering.

To reliably filter on data I think we'll need to copy it first, otherwise
guest can change it. Most filters only look at the header though.

> > First 2 requirements are a must, all other requirements
> > are just dependencies to make sure zero copy will be faster
> > than non zero copy.
> > Using a new feature bit is probably the simplest approach to
> > this. macvtap on top of most physical NICs most likely works
> > correctly so it seems a bit more work than it needs to be,
> > but it's also the safest one I think ... 
> 
> For "macvtap/vhost zero-copy" we can use SG & HIGHDMA to enable it, it
> looks safe to me once patching skb_copy, skb_clone, pskb_copy,
> pskb_expand_head.
> 
> To extend zero-copy in other usages, we can have a new feature bit
> later.
> 
> Is that reasonable?
> 
> Thanks
> Shirley

Is the problem is extra work needed to extend feature bits?
Shirley Ma May 26, 2011, 3:27 p.m. UTC | #36
On Thu, 2011-05-26 at 11:49 +0300, Michael S. Tsirkin wrote:
> On Wed, May 25, 2011 at 03:49:40PM -0700, Shirley Ma wrote:
> > On Fri, 2011-05-20 at 02:41 +0300, Michael S. Tsirkin wrote:
> > > So the requirements are
> > > - data must be released in a timely fashion (e.g. unlike
> virtio-net
> > >   tun or bridge)
> > The current patch doesn't enable tun zero-copy. tun will copy data
> It's
> > not an issue now.
> > We can disallow macvtap attach to bridge when
> > zero-copy is enabled.
> 
> Attach macvtap to a tun device though. Or e.g. veth device ...
> So there should be so generic way to disable zerocopy.
> It can either be a whitelist or a blacklist.
> > 
> > > - SG support
> > > - HIGHDMA support (on arches where this makes sense)
> > 
> > This can be checked by device flags.
> 
> OK, but pls note that SG can get turned off dynamically.
> 
> > > - no filtering based on data (data is mapped in guest)
> > 
> > > - on fast path no calls to skb_copy, skb_clone, pskb_copy,
> > >   pskb_expand_head as these are slow
> > 
> > Any calls to skb_copy, skb_clone, pskb_copy, pskb_expand_head will
> do a
> > copy. The performance should be the same as none zero-copy case
> before.
> 
> I'm guessing a copy is cheaper than get_user_pages+copy+put_page.
> But maybe not by much. Care checking that?

That's I have done already. Patch is going out for review.

> > I have done/tested the patch V6, will send it out for review
> tomorrow.
> > 
> > I am looking at where there are some cases, skb remains the same for
> > filtering.
> 
> To reliably filter on data I think we'll need to copy it first,
> otherwise
> guest can change it. Most filters only look at the header though.
> 
> > > First 2 requirements are a must, all other requirements
> > > are just dependencies to make sure zero copy will be faster
> > > than non zero copy.
> > > Using a new feature bit is probably the simplest approach to
> > > this. macvtap on top of most physical NICs most likely works
> > > correctly so it seems a bit more work than it needs to be,
> > > but it's also the safest one I think ... 
> > 
> > For "macvtap/vhost zero-copy" we can use SG & HIGHDMA to enable it,
> it
> > looks safe to me once patching skb_copy, skb_clone, pskb_copy,
> > pskb_expand_head.
> > 
> > To extend zero-copy in other usages, we can have a new feature bit
> > later.
> > 
> > Is that reasonable?
> > 
> > Thanks
> > Shirley
> 
> Is the problem is extra work needed to extend feature bits?

There is no problem to use it, Mahesh is working on this patch. I just
want to remove macvtap/vhost zero-copy patch dependency.

Thanks
Shirley

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shirley Ma May 26, 2011, 7:11 p.m. UTC | #37
On Thu, 2011-05-26 at 11:49 +0300, Michael S. Tsirkin wrote:
> > 
> > > - SG support
> > > - HIGHDMA support (on arches where this makes sense)
> > 
> > This can be checked by device flags.
> 
> OK, but pls note that SG can get turned off dynamically.

Tested the patch w/i SG dynmically on/off and tcpdump suspended.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index a134d80..2646251 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1066,6 +1066,16 @@  struct net_device {
 #define NETIF_F_NOCACHE_COPY	(1 << 30) /* Use no-cache copyfromuser */
 #define NETIF_F_LOOPBACK	(1 << 31) /* Enable loopback */
 
+/*
+ * Bit 31 is for device to map userspace buffers -- zerocopy
+ * Device can set this flag when it supports HIGHDMA.
+ * Device can't recycle this kind of skb buffers.
+ * There are 256 bytes copied, the rest of buffers are mapped.
+ * The userspace callback should only be called when last reference to this skb
+ * is gone.
+ */
+#define NETIF_F_ZEROCOPY	(1 << 31)
+
 	/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT	16
 #define NETIF_F_GSO_MASK	0x00ff0000