Message ID | 1305574128.3456.23.camel@localhost.localdomain (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, 2011-05-16 at 12:28 -0700, Shirley Ma wrote: > Signed-off-by: Shirley Ma <xma@us.ibm.com> > --- > > include/linux/netdevice.h | 10 ++++++++++ > 1 files changed, 10 insertions(+), 0 deletions(-) > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > index a134d80..2646251 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -1066,6 +1066,16 @@ struct net_device { > #define NETIF_F_NOCACHE_COPY (1 << 30) /* Use no-cache copyfromuser */ > #define NETIF_F_LOOPBACK (1 << 31) /* Enable loopback */ > > +/* > + * Bit 31 is for device to map userspace buffers -- zerocopy > + * Device can set this flag when it supports HIGHDMA. > + * Device can't recycle this kind of skb buffers. > + * There are 256 bytes copied, the rest of buffers are mapped. > + * The userspace callback should only be called when last reference to this skb > + * is gone. > + */ > +#define NETIF_F_ZEROCOPY (1 << 31) Sorry, bit 31 is taken. You get the job of turning features into a wider bitmap. Ben.
On Mon, 2011-05-16 at 20:35 +0100, Ben Hutchings wrote: > Sorry, bit 31 is taken. You get the job of turning features into a > wider bitmap. :) will do it. Thanks Shirley -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2011-05-16 at 12:38 -0700, Shirley Ma wrote: > On Mon, 2011-05-16 at 20:35 +0100, Ben Hutchings wrote: > > Sorry, bit 31 is taken. You get the job of turning features into a > > wider bitmap. > > :) will do it. Bear in mind that feature masks are manipulated in many different places. This is not a simple task. See previous discussion at: http://thread.gmane.org/gmane.linux.network/193284 and especially: http://thread.gmane.org/gmane.linux.network/193284/focus=193332 Ben.
On Mon, May 16, 2011 at 08:47:33PM +0100, Ben Hutchings wrote: > On Mon, 2011-05-16 at 12:38 -0700, Shirley Ma wrote: > > On Mon, 2011-05-16 at 20:35 +0100, Ben Hutchings wrote: > > > Sorry, bit 31 is taken. You get the job of turning features into a > > > wider bitmap. > > > > :) will do it. > > Bear in mind that feature masks are manipulated in many different > places. This is not a simple task. > > See previous discussion at: > http://thread.gmane.org/gmane.linux.network/193284 > and especially: > http://thread.gmane.org/gmane.linux.network/193284/focus=193332 > > Ben. IIUC, what is suggested above is something like: typedef struct net_features { } net_features_t; and then void netdev_set_feature(net_features_t *net_features, int feature); void netdev_clear_feature(net_features_t *net_features, int feature); bool netdev_test_feature(net_features_t *net_features, int feature); I think this might be the easiest way as compiler will catch any direct uses. It can then be split up nicely. It looks a bit different from what Dave suggested but I think it's close enough? we could also have wrappers that set/clear/test many features to replace uses of A|B|C that are pretty common. static inline void netdev_set_features(net_features_t *net_features, int nfeatures, int *features) { int i; for (i = 0; i < nfeatures; ++i) netdev_set_feature(net_features, features[i]); } void netdev_clear_features(net_features_t *net_features, int nfeatures, int *features) { int i; for (i = 0; i < nfeatures; ++i) netdev_clear_feature(net_features, features[i]); } bool netdev_test_features(net_features_t *net_features, int nfeatures, int *features) { int i; for (i = 0; i < nfeatures; ++i) if (netdev_test_feature(net_features, features[i])) return true; return false; } and possibly macros that get arrays of constants: #define NETDEV_SET_FEATURES(net_features, feature_array) do { \ int __NETDEV_SET_FEATURES_F[] = feature_array; netdev_set_feature((net_features), \ ARRAY_SIZE(__NETDEV_SET_FEATURES_F), __NETDEV_SET_FEATURES_F); } while (0) etc. > -- > Ben Hutchings, Senior Software Engineer, Solarflare > Not speaking for my employer; that's the marketing department's job. > They asked us to note that Solarflare product names are trademarked. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello Michael, Looks like to use a new flag requires more time/work. I am thinking whether we can just use HIGHDMA flag to enable zero-copy in macvtap to avoid the new flag for now since mavctap uses real NICs as lower device? Thanks Shirley -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, May 16, 2011 at 04:32:18PM -0700, Shirley Ma wrote: > Hello Michael, > > Looks like to use a new flag requires more time/work. > I am thinking > whether we can just use HIGHDMA flag to enable zero-copy in macvtap to > avoid the new flag for now since mavctap uses real NICs as lower device? > > Thanks > Shirley Problem is, in your patch there are a set of restrictions on what the device can do with the skb that we need to enforce somehow. Also, how do we know it's a 'real NIC' and not a software device?
On Tue, 2011-05-17 at 09:21 +0300, Michael S. Tsirkin wrote: > Problem is, in your patch there are a set of restrictions on what the > device can do with the skb that we need to enforce somehow. > Also, how do we know it's a 'real NIC' and not a software device? I checked macvtap newlink, it doesn't seem to block software device. So we need to work on the wider feature bit first. Shirley -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
2011/5/17 Shirley Ma <mashirle@us.ibm.com>: > Hello Michael, > > Looks like to use a new flag requires more time/work. I am thinking > whether we can just use HIGHDMA flag to enable zero-copy in macvtap to > avoid the new flag for now since mavctap uses real NICs as lower device? Is there any other restriction besides requiring driver to not recycle the skb? Are there any drivers that recycle TX skbs? Best Regards, Micha? Miros?aw -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2011-05-17 at 23:48 +0200, Micha? Miros?aw wrote: > 2011/5/17 Shirley Ma <mashirle@us.ibm.com>: > > Hello Michael, > > > > Looks like to use a new flag requires more time/work. I am thinking > > whether we can just use HIGHDMA flag to enable zero-copy in macvtap > to > > avoid the new flag for now since mavctap uses real NICs as lower > device? > > Is there any other restriction besides requiring driver to not recycle > the skb? Are there any drivers that recycle TX skbs? Not more other restrictions, skb clone is OK. pskb_expand_head() looks OK to me from code review. Currently there is no drivers recycle TX skbs. Thanks Shirley -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
W dniu 18 maja 2011 00:28 u?ytkownik Shirley Ma <mashirle@us.ibm.com> napisa?: > On Tue, 2011-05-17 at 23:48 +0200, Micha? Miros?aw wrote: >> 2011/5/17 Shirley Ma <mashirle@us.ibm.com>: >> > Looks like to use a new flag requires more time/work. I am thinking >> > whether we can just use HIGHDMA flag to enable zero-copy in macvtap >> to >> > avoid the new flag for now since mavctap uses real NICs as lower >> device? >> >> Is there any other restriction besides requiring driver to not recycle >> the skb? Are there any drivers that recycle TX skbs? > Not more other restrictions, skb clone is OK. pskb_expand_head() looks > OK to me from code review. > Currently there is no drivers recycle TX skbs. So why do you require the target device to have some flags at all? Do I understand correctly, that this zero-copy feature is about packets received from VMs? Best Regards, Micha? Miros?aw -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2011-05-18 at 00:58 +0200, Micha? Miros?aw wrote: > W dniu 18 maja 2011 00:28 u?ytkownik Shirley Ma <mashirle@us.ibm.com> > napisa?: > > On Tue, 2011-05-17 at 23:48 +0200, Micha? Miros?aw wrote: > >> 2011/5/17 Shirley Ma <mashirle@us.ibm.com>: > >> > Looks like to use a new flag requires more time/work. I am > thinking > >> > whether we can just use HIGHDMA flag to enable zero-copy in > macvtap > >> to > >> > avoid the new flag for now since mavctap uses real NICs as lower > >> device? > >> > >> Is there any other restriction besides requiring driver to not > recycle > >> the skb? Are there any drivers that recycle TX skbs? > > Not more other restrictions, skb clone is OK. pskb_expand_head() > looks > > OK to me from code review. > > > Currently there is no drivers recycle TX skbs. > > So why do you require the target device to have some flags at all? We could use macvtap to check lower device HIGHDMA to enable zero-copy, but I am not sure whether it is sufficient. If it's sufficient then we don't need to use a new flag here. To be safe, it's better to use a new flag to enable each device who can pass zero-copy test. > Do I understand correctly, that this zero-copy feature is about > packets received from VMs? Yes, packets sent from VMs, and received in local host for TX zero-copy here. Thanks Shirley -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
W dniu 18 maja 2011 01:44 u?ytkownik Shirley Ma <mashirle@us.ibm.com> napisa?: > On Wed, 2011-05-18 at 00:58 +0200, Micha? Miros?aw wrote: >> W dniu 18 maja 2011 00:28 u?ytkownik Shirley Ma <mashirle@us.ibm.com> >> napisa?: >> > On Tue, 2011-05-17 at 23:48 +0200, Micha? Miros?aw wrote: >> >> 2011/5/17 Shirley Ma <mashirle@us.ibm.com>: >> >> > Looks like to use a new flag requires more time/work. I am >> thinking >> >> > whether we can just use HIGHDMA flag to enable zero-copy in >> macvtap >> >> to >> >> > avoid the new flag for now since mavctap uses real NICs as lower >> >> device? >> >> >> >> Is there any other restriction besides requiring driver to not >> recycle >> >> the skb? Are there any drivers that recycle TX skbs? >> > Not more other restrictions, skb clone is OK. pskb_expand_head() >> looks >> > OK to me from code review. >> >> > Currently there is no drivers recycle TX skbs. >> >> So why do you require the target device to have some flags at all? > We could use macvtap to check lower device HIGHDMA to enable zero-copy, > but I am not sure whether it is sufficient. If it's sufficient then we > don't need to use a new flag here. To be safe, it's better to use a new > flag to enable each device who can pass zero-copy test. >> Do I understand correctly, that this zero-copy feature is about >> packets received from VMs? > Yes, packets sent from VMs, and received in local host for TX zero-copy > here. What is the zero-copy test? On some arches the HIGHDMA is not needed at all so might be not enabled on anything. It looks like the correct test would be per-packet check of !illegal_highdma() or maybe NETIF_F_SG as returned from harmonize_features(). For virtual devices or other software forwarding this might lead to skb_linearize() in some cases, but is it that bad? Best Regards, Micha? Miros?aw -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 17, 2011 at 03:28:38PM -0700, Shirley Ma wrote: > On Tue, 2011-05-17 at 23:48 +0200, Micha? Miros?aw wrote: > > 2011/5/17 Shirley Ma <mashirle@us.ibm.com>: > > > Hello Michael, > > > > > > Looks like to use a new flag requires more time/work. I am thinking > > > whether we can just use HIGHDMA flag to enable zero-copy in macvtap > > to > > > avoid the new flag for now since mavctap uses real NICs as lower > > device? > > > > Is there any other restriction besides requiring driver to not recycle > > the skb? Are there any drivers that recycle TX skbs? Not just recycling skbs, keeping reference to any of the pages in the skb. Another requirement is to invoke the callback in a timely fashion. For example virtio-net doesn't limit the time until that happens (skbs are only freed when some other packet is transmitted), so we need to avoid zcopy for such (nested-virt) scenarious, right? > > Not more other restrictions, skb clone is OK. pskb_expand_head() looks > OK to me from code review. Hmm. pskb_expand_head calls skb_release_data while keeping references to pages. How is that ok? What do I miss? > Currently there is no drivers recycle TX skbs. > > Thanks > Shirley Well, with e.g. bridge or veth the skb might enter the host networking stack. Once that happens, we lose track of the pages. Or is there anything that prevents such setups?
2011/5/18 Michael S. Tsirkin <mst@redhat.com>: > On Tue, May 17, 2011 at 03:28:38PM -0700, Shirley Ma wrote: >> On Tue, 2011-05-17 at 23:48 +0200, Micha? Miros?aw wrote: >> > 2011/5/17 Shirley Ma <mashirle@us.ibm.com>: >> > > Hello Michael, >> > > >> > > Looks like to use a new flag requires more time/work. I am thinking >> > > whether we can just use HIGHDMA flag to enable zero-copy in macvtap >> > to >> > > avoid the new flag for now since mavctap uses real NICs as lower >> > device? >> > >> > Is there any other restriction besides requiring driver to not recycle >> > the skb? Are there any drivers that recycle TX skbs? > > Not just recycling skbs, keeping reference to any of the pages in the > skb. Another requirement is to invoke the callback > in a timely fashion. For example virtio-net doesn't limit the time until > that happens (skbs are only freed when some other packet is > transmitted), so we need to avoid zcopy for such (nested-virt) > scenarious, right? Hmm. But every hardware driver supporting SG will keep reference to the pages until the packet is sent (or DMA'd to the device). This can take a long time if hardware queue happens to stall for some reason. Is it that you mean keeping a reference after all skbs pointing to the pages are released? >> Not more other restrictions, skb clone is OK. pskb_expand_head() looks >> OK to me from code review. > Hmm. pskb_expand_head calls skb_release_data while keeping > references to pages. How is that ok? What do I miss? It's making copy of the skb_shinfo earlier, so the pages refcount stays the same. Best Regards, Micha? Miros?aw -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 18, 2011 at 01:10:50PM +0200, Micha? Miros?aw wrote: > 2011/5/18 Michael S. Tsirkin <mst@redhat.com>: > > On Tue, May 17, 2011 at 03:28:38PM -0700, Shirley Ma wrote: > >> On Tue, 2011-05-17 at 23:48 +0200, Micha? Miros?aw wrote: > >> > 2011/5/17 Shirley Ma <mashirle@us.ibm.com>: > >> > > Hello Michael, > >> > > > >> > > Looks like to use a new flag requires more time/work. I am thinking > >> > > whether we can just use HIGHDMA flag to enable zero-copy in macvtap > >> > to > >> > > avoid the new flag for now since mavctap uses real NICs as lower > >> > device? > >> > > >> > Is there any other restriction besides requiring driver to not recycle > >> > the skb? Are there any drivers that recycle TX skbs? > > > > Not just recycling skbs, keeping reference to any of the pages in the > > skb. Another requirement is to invoke the callback > > in a timely fashion. For example virtio-net doesn't limit the time until > > that happens (skbs are only freed when some other packet is > > transmitted), so we need to avoid zcopy for such (nested-virt) > > scenarious, right? > > Hmm. But every hardware driver supporting SG will keep reference to > the pages until the packet is sent (or DMA'd to the device). This can > take a long time if hardware queue happens to stall for some reason. That's a fundamental property of zero copy transmit. You can't let the application/guest reuse the memory until no one looks at it anymore. > Is it that you mean keeping a reference after all skbs pointing to the > pages are released? No one should reference the pages after the callback is invoked, yes. > >> Not more other restrictions, skb clone is OK. pskb_expand_head() looks > >> OK to me from code review. > > Hmm. pskb_expand_head calls skb_release_data while keeping > > references to pages. How is that ok? What do I miss? > > It's making copy of the skb_shinfo earlier, so the pages refcount > stays the same. > > Best Regards, > Micha? Miros?aw Exactly. But the callback is invoked so the guest thinks it's ok to change this memory. If it does a corrupted packet will be sent out.
W dniu 18 maja 2011 13:17 u?ytkownik Michael S. Tsirkin <mst@redhat.com> napisa?: > On Wed, May 18, 2011 at 01:10:50PM +0200, Micha? Miros?aw wrote: >> 2011/5/18 Michael S. Tsirkin <mst@redhat.com>: >> > On Tue, May 17, 2011 at 03:28:38PM -0700, Shirley Ma wrote: >> >> On Tue, 2011-05-17 at 23:48 +0200, Micha? Miros?aw wrote: >> >> > 2011/5/17 Shirley Ma <mashirle@us.ibm.com>: >> >> > > Hello Michael, >> >> > > >> >> > > Looks like to use a new flag requires more time/work. I am thinking >> >> > > whether we can just use HIGHDMA flag to enable zero-copy in macvtap >> >> > to >> >> > > avoid the new flag for now since mavctap uses real NICs as lower >> >> > device? >> >> > >> >> > Is there any other restriction besides requiring driver to not recycle >> >> > the skb? Are there any drivers that recycle TX skbs? >> > >> > Not just recycling skbs, keeping reference to any of the pages in the >> > skb. Another requirement is to invoke the callback >> > in a timely fashion. For example virtio-net doesn't limit the time until >> > that happens (skbs are only freed when some other packet is >> > transmitted), so we need to avoid zcopy for such (nested-virt) >> > scenarious, right? >> >> Hmm. But every hardware driver supporting SG will keep reference to >> the pages until the packet is sent (or DMA'd to the device). This can >> take a long time if hardware queue happens to stall for some reason. > > That's a fundamental property of zero copy transmit. > You can't let the application/guest reuse the memory until > no one looks at it anymore. > >> Is it that you mean keeping a reference after all skbs pointing to the >> pages are released? > No one should reference the pages after the callback is invoked, yes. >> >> Not more other restrictions, skb clone is OK. pskb_expand_head() looks >> >> OK to me from code review. >> > Hmm. pskb_expand_head calls skb_release_data while keeping >> > references to pages. How is that ok? What do I miss? >> It's making copy of the skb_shinfo earlier, so the pages refcount >> stays the same. > Exactly. But the callback is invoked so the guest thinks it's ok to > change this memory. If it does a corrupted packet will be sent out. Hmm. I tool a quick look at skb_clone(), and it looks like this sequence will break this scheme: skb2 = skb_clone(skb...); kfree_skb(skb) or pskb_expand_head(skb); /* callback called */ [use skb2, pages still referenced] kfree_skb(skb); /* callback called again */ This sequence is common in bridge, might be in other places. Maybe this ubuf thing should just track clones? This will make it work on all devices then. Best Regards, Micha? Miros?aw -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
W dniu 18 maja 2011 13:17 u?ytkownik Michael S. Tsirkin <mst@redhat.com> napisa?: > On Wed, May 18, 2011 at 01:10:50PM +0200, Micha? Miros?aw wrote: >> 2011/5/18 Michael S. Tsirkin <mst@redhat.com>: >> > On Tue, May 17, 2011 at 03:28:38PM -0700, Shirley Ma wrote: >> >> On Tue, 2011-05-17 at 23:48 +0200, Micha? Miros?aw wrote: >> >> > 2011/5/17 Shirley Ma <mashirle@us.ibm.com>: >> >> > > Hello Michael, >> >> > > >> >> > > Looks like to use a new flag requires more time/work. I am thinking >> >> > > whether we can just use HIGHDMA flag to enable zero-copy in macvtap >> >> > to >> >> > > avoid the new flag for now since mavctap uses real NICs as lower >> >> > device? >> >> > >> >> > Is there any other restriction besides requiring driver to not recycle >> >> > the skb? Are there any drivers that recycle TX skbs? >> > Not just recycling skbs, keeping reference to any of the pages in the >> > skb. Another requirement is to invoke the callback >> > in a timely fashion. For example virtio-net doesn't limit the time until >> > that happens (skbs are only freed when some other packet is >> > transmitted), so we need to avoid zcopy for such (nested-virt) >> > scenarious, right? >> Hmm. But every hardware driver supporting SG will keep reference to >> the pages until the packet is sent (or DMA'd to the device). This can >> take a long time if hardware queue happens to stall for some reason. > That's a fundamental property of zero copy transmit. > You can't let the application/guest reuse the memory until > no one looks at it anymore. One more question: is userspace (or whatever is sending those packets) denied from modifying passed pages? I assume it is, but just want to be sure. Best Regards, Micha? Miros?aw -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 18, 2011 at 01:40:29PM +0200, Micha? Miros?aw wrote: > W dniu 18 maja 2011 13:17 u?ytkownik Michael S. Tsirkin > <mst@redhat.com> napisa?: > > On Wed, May 18, 2011 at 01:10:50PM +0200, Micha? Miros?aw wrote: > >> 2011/5/18 Michael S. Tsirkin <mst@redhat.com>: > >> > On Tue, May 17, 2011 at 03:28:38PM -0700, Shirley Ma wrote: > >> >> On Tue, 2011-05-17 at 23:48 +0200, Micha? Miros?aw wrote: > >> >> > 2011/5/17 Shirley Ma <mashirle@us.ibm.com>: > >> >> > > Hello Michael, > >> >> > > > >> >> > > Looks like to use a new flag requires more time/work. I am thinking > >> >> > > whether we can just use HIGHDMA flag to enable zero-copy in macvtap > >> >> > to > >> >> > > avoid the new flag for now since mavctap uses real NICs as lower > >> >> > device? > >> >> > > >> >> > Is there any other restriction besides requiring driver to not recycle > >> >> > the skb? Are there any drivers that recycle TX skbs? > >> > > >> > Not just recycling skbs, keeping reference to any of the pages in the > >> > skb. Another requirement is to invoke the callback > >> > in a timely fashion. For example virtio-net doesn't limit the time until > >> > that happens (skbs are only freed when some other packet is > >> > transmitted), so we need to avoid zcopy for such (nested-virt) > >> > scenarious, right? > >> > >> Hmm. But every hardware driver supporting SG will keep reference to > >> the pages until the packet is sent (or DMA'd to the device). This can > >> take a long time if hardware queue happens to stall for some reason. > > > > That's a fundamental property of zero copy transmit. > > You can't let the application/guest reuse the memory until > > no one looks at it anymore. > > > >> Is it that you mean keeping a reference after all skbs pointing to the > >> pages are released? > > No one should reference the pages after the callback is invoked, yes. > > >> >> Not more other restrictions, skb clone is OK. pskb_expand_head() looks > >> >> OK to me from code review. > >> > Hmm. pskb_expand_head calls skb_release_data while keeping > >> > references to pages. How is that ok? What do I miss? > >> It's making copy of the skb_shinfo earlier, so the pages refcount > >> stays the same. > > Exactly. But the callback is invoked so the guest thinks it's ok to > > change this memory. If it does a corrupted packet will be sent out. > > Hmm. I tool a quick look at skb_clone(), and it looks like this > sequence will break this scheme: > > skb2 = skb_clone(skb...); > kfree_skb(skb) or pskb_expand_head(skb); /* callback called */ > [use skb2, pages still referenced] > kfree_skb(skb); /* callback called again */ > > This sequence is common in bridge, might be in other places. > > Maybe this ubuf thing should just track clones? This will make it work > on all devices then. > > Best Regards, > Micha? Miros?aw Long term that's a good plan, but it's a lot of work. pages can also get into weird places like VFS or devices might hang on to them for a long time. So I think as a first step, using a flag to white-list simple devices that don't do any tricks like the above makes sense. Just be sure to list all of the restrictions in the comment where the flag is described. And hey, we get features extended to 64 bit as a bonus :)
On Wed, May 18, 2011 at 01:47:33PM +0200, Micha? Miros?aw wrote: > W dniu 18 maja 2011 13:17 u?ytkownik Michael S. Tsirkin > <mst@redhat.com> napisa?: > > On Wed, May 18, 2011 at 01:10:50PM +0200, Micha? Miros?aw wrote: > >> 2011/5/18 Michael S. Tsirkin <mst@redhat.com>: > >> > On Tue, May 17, 2011 at 03:28:38PM -0700, Shirley Ma wrote: > >> >> On Tue, 2011-05-17 at 23:48 +0200, Micha? Miros?aw wrote: > >> >> > 2011/5/17 Shirley Ma <mashirle@us.ibm.com>: > >> >> > > Hello Michael, > >> >> > > > >> >> > > Looks like to use a new flag requires more time/work. I am thinking > >> >> > > whether we can just use HIGHDMA flag to enable zero-copy in macvtap > >> >> > to > >> >> > > avoid the new flag for now since mavctap uses real NICs as lower > >> >> > device? > >> >> > > >> >> > Is there any other restriction besides requiring driver to not recycle > >> >> > the skb? Are there any drivers that recycle TX skbs? > >> > Not just recycling skbs, keeping reference to any of the pages in the > >> > skb. Another requirement is to invoke the callback > >> > in a timely fashion. For example virtio-net doesn't limit the time until > >> > that happens (skbs are only freed when some other packet is > >> > transmitted), so we need to avoid zcopy for such (nested-virt) > >> > scenarious, right? > >> Hmm. But every hardware driver supporting SG will keep reference to > >> the pages until the packet is sent (or DMA'd to the device). This can > >> take a long time if hardware queue happens to stall for some reason. > > That's a fundamental property of zero copy transmit. > > You can't let the application/guest reuse the memory until > > no one looks at it anymore. > > One more question: is userspace (or whatever is sending those packets) > denied from modifying passed pages? I assume it is, but just want to > be sure. > > Best Regards, > Micha? Miros?aw Good point. It's not denied in the sense that it still can modify them if it's buggy (the pages might not be read-only). But well-behaved userspace won't modify them until the callback is invoked. That would be a problem if the underlying device is a bridge where we might try to e.g. filter these packets - data can get modified after the filter. We'd have to copy whatever the filter accesses and use the copy - it's rarely the data itself. That's not normally a problem for macvtap connected to a physical NIC, as that already bypasses any and all filtering. But that's another limitation we should note in the comment, and another reason to limit to specific devices.
W dniu 18 maja 2011 13:56 u?ytkownik Michael S. Tsirkin <mst@redhat.com> napisa?: > On Wed, May 18, 2011 at 01:47:33PM +0200, Micha? Miros?aw wrote: >> W dniu 18 maja 2011 13:17 u?ytkownik Michael S. Tsirkin >> <mst@redhat.com> napisa?: >> > On Wed, May 18, 2011 at 01:10:50PM +0200, Micha? Miros?aw wrote: >> >> 2011/5/18 Michael S. Tsirkin <mst@redhat.com>: >> >> > On Tue, May 17, 2011 at 03:28:38PM -0700, Shirley Ma wrote: >> >> >> On Tue, 2011-05-17 at 23:48 +0200, Micha? Miros?aw wrote: >> >> >> > 2011/5/17 Shirley Ma <mashirle@us.ibm.com>: >> >> >> > > Hello Michael, >> >> >> > > >> >> >> > > Looks like to use a new flag requires more time/work. I am thinking >> >> >> > > whether we can just use HIGHDMA flag to enable zero-copy in macvtap >> >> >> > to >> >> >> > > avoid the new flag for now since mavctap uses real NICs as lower >> >> >> > device? >> >> >> > >> >> >> > Is there any other restriction besides requiring driver to not recycle >> >> >> > the skb? Are there any drivers that recycle TX skbs? >> >> > Not just recycling skbs, keeping reference to any of the pages in the >> >> > skb. Another requirement is to invoke the callback >> >> > in a timely fashion. For example virtio-net doesn't limit the time until >> >> > that happens (skbs are only freed when some other packet is >> >> > transmitted), so we need to avoid zcopy for such (nested-virt) >> >> > scenarious, right? >> >> Hmm. But every hardware driver supporting SG will keep reference to >> >> the pages until the packet is sent (or DMA'd to the device). This can >> >> take a long time if hardware queue happens to stall for some reason. >> > That's a fundamental property of zero copy transmit. >> > You can't let the application/guest reuse the memory until >> > no one looks at it anymore. >> >> One more question: is userspace (or whatever is sending those packets) >> denied from modifying passed pages? I assume it is, but just want to >> be sure. >> >> Best Regards, >> Micha? Miros?aw > > Good point. > > It's not denied in the sense that it still can modify them if it's > buggy (the pages might not be read-only). > But well-behaved userspace won't modify them until the callback > is invoked. > > That would be a problem if the underlying device is > a bridge where we might try to e.g. filter these packets - > data can get modified after the filter. We'd have to copy > whatever the filter accesses and use the copy - it's rarely > the data itself. > > That's not normally a problem for macvtap connected to a physical NIC, > as that already bypasses any and all filtering. > > But that's another limitation we should note in the comment, > and another reason to limit to specific devices. It looks like this feature can be used only in very strict circumstances. What about tcpdump listening on the device or lowerdev? This path might clone the skb for any device. Best Regards, Micha? Miros?aw -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 18, 2011 at 02:48:24PM +0200, Micha? Miros?aw wrote: > W dniu 18 maja 2011 13:56 u?ytkownik Michael S. Tsirkin > <mst@redhat.com> napisa?: > > On Wed, May 18, 2011 at 01:47:33PM +0200, Micha? Miros?aw wrote: > >> W dniu 18 maja 2011 13:17 u?ytkownik Michael S. Tsirkin > >> <mst@redhat.com> napisa?: > >> > On Wed, May 18, 2011 at 01:10:50PM +0200, Micha? Miros?aw wrote: > >> >> 2011/5/18 Michael S. Tsirkin <mst@redhat.com>: > >> >> > On Tue, May 17, 2011 at 03:28:38PM -0700, Shirley Ma wrote: > >> >> >> On Tue, 2011-05-17 at 23:48 +0200, Micha? Miros?aw wrote: > >> >> >> > 2011/5/17 Shirley Ma <mashirle@us.ibm.com>: > >> >> >> > > Hello Michael, > >> >> >> > > > >> >> >> > > Looks like to use a new flag requires more time/work. I am thinking > >> >> >> > > whether we can just use HIGHDMA flag to enable zero-copy in macvtap > >> >> >> > to > >> >> >> > > avoid the new flag for now since mavctap uses real NICs as lower > >> >> >> > device? > >> >> >> > > >> >> >> > Is there any other restriction besides requiring driver to not recycle > >> >> >> > the skb? Are there any drivers that recycle TX skbs? > >> >> > Not just recycling skbs, keeping reference to any of the pages in the > >> >> > skb. Another requirement is to invoke the callback > >> >> > in a timely fashion. For example virtio-net doesn't limit the time until > >> >> > that happens (skbs are only freed when some other packet is > >> >> > transmitted), so we need to avoid zcopy for such (nested-virt) > >> >> > scenarious, right? > >> >> Hmm. But every hardware driver supporting SG will keep reference to > >> >> the pages until the packet is sent (or DMA'd to the device). This can > >> >> take a long time if hardware queue happens to stall for some reason. > >> > That's a fundamental property of zero copy transmit. > >> > You can't let the application/guest reuse the memory until > >> > no one looks at it anymore. > >> > >> One more question: is userspace (or whatever is sending those packets) > >> denied from modifying passed pages? I assume it is, but just want to > >> be sure. > >> > >> Best Regards, > >> Micha? Miros?aw > > > > Good point. > > > > It's not denied in the sense that it still can modify them if it's > > buggy (the pages might not be read-only). > > But well-behaved userspace won't modify them until the callback > > is invoked. > > > > That would be a problem if the underlying device is > > a bridge where we might try to e.g. filter these packets - > > data can get modified after the filter. We'd have to copy > > whatever the filter accesses and use the copy - it's rarely > > the data itself. > > > > That's not normally a problem for macvtap connected to a physical NIC, > > as that already bypasses any and all filtering. > > > > But that's another limitation we should note in the comment, > > and another reason to limit to specific devices. > > It looks like this feature can be used only in very strict circumstances. True. I think it's reasonable to try and start with something restricted and then add features though - past attempts to solve the problem generally right away did not bear fruit. > What about tcpdump listening on the device or lowerdev? This path > might clone the skb for any device. > > Best Regards, > Micha? Miros?aw Thanks for bringing this up: taps do need to be fixed as they can hang on to a page for unlimited time. Further, as a malicious guest can change the packet at any time, data that taps get wouldn't be correct. We can either linearize the problematic skbs or disable zero copy if there are any taps for the given device.
On Wed, 2011-05-18 at 13:40 +0200, Micha? Miros?aw wrote: > >> >> Not more other restrictions, skb clone is OK. pskb_expand_head() > looks > >> >> OK to me from code review. > >> > Hmm. pskb_expand_head calls skb_release_data while keeping > >> > references to pages. How is that ok? What do I miss? > >> It's making copy of the skb_shinfo earlier, so the pages refcount > >> stays the same. > > Exactly. But the callback is invoked so the guest thinks it's ok to > > change this memory. If it does a corrupted packet will be sent out. > > Hmm. I tool a quick look at skb_clone(), and it looks like this > sequence will break this scheme: > > skb2 = skb_clone(skb...); > kfree_skb(skb) or pskb_expand_head(skb); /* callback called */ > [use skb2, pages still referenced] > kfree_skb(skb); /* callback called again */ > > This sequence is common in bridge, might be in other places. > > Maybe this ubuf thing should just track clones? This will make it work > on all devices then. The callback was only invoked when last reference of skb was gone. skb_clone does increase skb refcnt. I tested tcpdump on lower device, it worked. For the sequence of: skb_clone -> last refcnt + 1 kfree_skb() or pskb_expand_head -> callback not called kfree_skb() -> callback called I will check page refcount to see whether it's balanced. Thanks shirley -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 18, 2011 at 07:38:27AM -0700, Shirley Ma wrote: > On Wed, 2011-05-18 at 13:40 +0200, Micha? Miros?aw wrote: > > >> >> Not more other restrictions, skb clone is OK. pskb_expand_head() > > looks > > >> >> OK to me from code review. > > >> > Hmm. pskb_expand_head calls skb_release_data while keeping > > >> > references to pages. How is that ok? What do I miss? > > >> It's making copy of the skb_shinfo earlier, so the pages refcount > > >> stays the same. > > > Exactly. But the callback is invoked so the guest thinks it's ok to > > > change this memory. If it does a corrupted packet will be sent out. > > > > Hmm. I tool a quick look at skb_clone(), and it looks like this > > sequence will break this scheme: > > > > skb2 = skb_clone(skb...); > > kfree_skb(skb) or pskb_expand_head(skb); /* callback called */ > > [use skb2, pages still referenced] > > kfree_skb(skb); /* callback called again */ > > > > This sequence is common in bridge, might be in other places. > > > > Maybe this ubuf thing should just track clones? This will make it work > > on all devices then. > > The callback was only invoked when last reference of skb was gone. > skb_clone does increase skb refcnt. I tested tcpdump on lower device, it > worked. Right, it will normally work, but two issues I think you miss: 1. malicious guest can change the memory between when it is sent out by device and consumed by tcpdump, so you will see different things (not sure how important this is). 2. if tcpdump stops consuming stuff from the packet socket (it's userspace, can't be trusted) then we won't get a callback for page potentially forever, guest networking will get blocked etc. > For the sequence of: > > skb_clone -> last refcnt + 1 > kfree_skb() or pskb_expand_head -> callback not called > kfree_skb() -> callback called > > I will check page refcount to see whether it's balanced. > > Thanks > shirley pskb_expand_head is a problem anyway I think as it can hang on to pages after it calls release_data. Then guest will modify these pages and you get trash there.
On Wed, 2011-05-18 at 07:38 -0700, Shirley Ma wrote: > On Wed, 2011-05-18 at 13:40 +0200, Micha? Miros?aw wrote: > > >> >> Not more other restrictions, skb clone is OK. > pskb_expand_head() > > looks > > >> >> OK to me from code review. > > >> > Hmm. pskb_expand_head calls skb_release_data while keeping > > >> > references to pages. How is that ok? What do I miss? > > >> It's making copy of the skb_shinfo earlier, so the pages refcount > > >> stays the same. > > > Exactly. But the callback is invoked so the guest thinks it's ok > to > > > change this memory. If it does a corrupted packet will be sent > out. > > > > Hmm. I tool a quick look at skb_clone(), and it looks like this > > sequence will break this scheme: > > > > skb2 = skb_clone(skb...); > > kfree_skb(skb) or pskb_expand_head(skb); /* callback called */ > > [use skb2, pages still referenced] > > kfree_skb(skb); /* callback called again */ > > > > This sequence is common in bridge, might be in other places. > > > > Maybe this ubuf thing should just track clones? This will make it > work > > on all devices then. > > The callback was only invoked when last reference of skb was gone. > skb_clone does increase skb refcnt. I tested tcpdump on lower device, > it > worked. > > For the sequence of: > > skb_clone -> last refcnt + 1 > kfree_skb() or pskb_expand_head -> callback not called > kfree_skb() -> callback called > > I will check page refcount to see whether it's balanced. The page refcounts are balanced too. In macvtap/vhost Real NIC zerocopy case, it always goes to fastpath in pskb_expand_head, so I didn't hit any issue. But rethinking about pskb_expand_head(), it calls skb_release_data() to free old skb head when it's not in the fastpath (pskb_expand_head is not the last reference of this skb); And it's impossible to track which skb head (old one or new one) will be the last one to free. So better to return error for zero-copy skbs when not using fastpath. Does it make sense? Besides this, any other issue? Thanks Shirley -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2011-05-18 at 18:47 +0300, Michael S. Tsirkin wrote: > On Wed, May 18, 2011 at 07:38:27AM -0700, Shirley Ma wrote: > > On Wed, 2011-05-18 at 13:40 +0200, Micha? Miros?aw wrote: > > > >> >> Not more other restrictions, skb clone is OK. > pskb_expand_head() > > > looks > > > >> >> OK to me from code review. > > > >> > Hmm. pskb_expand_head calls skb_release_data while keeping > > > >> > references to pages. How is that ok? What do I miss? > > > >> It's making copy of the skb_shinfo earlier, so the pages > refcount > > > >> stays the same. > > > > Exactly. But the callback is invoked so the guest thinks it's ok > to > > > > change this memory. If it does a corrupted packet will be sent > out. > > > > > > Hmm. I tool a quick look at skb_clone(), and it looks like this > > > sequence will break this scheme: > > > > > > skb2 = skb_clone(skb...); > > > kfree_skb(skb) or pskb_expand_head(skb); /* callback called */ > > > [use skb2, pages still referenced] > > > kfree_skb(skb); /* callback called again */ > > > > > > This sequence is common in bridge, might be in other places. > > > > > > Maybe this ubuf thing should just track clones? This will make it > work > > > on all devices then. > > > > The callback was only invoked when last reference of skb was gone. > > skb_clone does increase skb refcnt. I tested tcpdump on lower > device, it > > worked. > > Right, it will normally work, but two issues I think you miss: > 1. malicious guest can change the memory between when it is sent out > by > device and consumed by tcpdump, so you will see different things > (not sure how important this is). > 2. if tcpdump stops consuming stuff from the packet socket (it's > userspace, can't be trusted) then we won't get a callback for > page potentially forever, guest networking will get blocked etc. > > For the sequence of: > > > > skb_clone -> last refcnt + 1 > > kfree_skb() or pskb_expand_head -> callback not called > > kfree_skb() -> callback called > > > > I will check page refcount to see whether it's balanced. > > > > Thanks > > shirley > > > pskb_expand_head is a problem anyway I think as it > can hang on to pages after it calls release_data. > Then guest will modify these pages and you get trash there. This can be avoid by allowing pskb_expand_head in fastpath only, I think. But not sure whether tcpdump can still work with this. Thanks Shirley -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 18, 2011 at 09:02:23AM -0700, Shirley Ma wrote: > On Wed, 2011-05-18 at 07:38 -0700, Shirley Ma wrote: > > On Wed, 2011-05-18 at 13:40 +0200, Micha? Miros?aw wrote: > > > >> >> Not more other restrictions, skb clone is OK. > > pskb_expand_head() > > > looks > > > >> >> OK to me from code review. > > > >> > Hmm. pskb_expand_head calls skb_release_data while keeping > > > >> > references to pages. How is that ok? What do I miss? > > > >> It's making copy of the skb_shinfo earlier, so the pages refcount > > > >> stays the same. > > > > Exactly. But the callback is invoked so the guest thinks it's ok > > to > > > > change this memory. If it does a corrupted packet will be sent > > out. > > > > > > Hmm. I tool a quick look at skb_clone(), and it looks like this > > > sequence will break this scheme: > > > > > > skb2 = skb_clone(skb...); > > > kfree_skb(skb) or pskb_expand_head(skb); /* callback called */ > > > [use skb2, pages still referenced] > > > kfree_skb(skb); /* callback called again */ > > > > > > This sequence is common in bridge, might be in other places. > > > > > > Maybe this ubuf thing should just track clones? This will make it > > work > > > on all devices then. > > > > The callback was only invoked when last reference of skb was gone. > > skb_clone does increase skb refcnt. I tested tcpdump on lower device, > > it > > worked. > > > > For the sequence of: > > > > skb_clone -> last refcnt + 1 > > kfree_skb() or pskb_expand_head -> callback not called > > kfree_skb() -> callback called > > > > I will check page refcount to see whether it's balanced. > > The page refcounts are balanced too. > > In macvtap/vhost Real NIC zerocopy case, it always goes to fastpath in > pskb_expand_head, so I didn't hit any issue. > > But rethinking about pskb_expand_head(), it calls skb_release_data() to > free old skb head when it's not in the fastpath (pskb_expand_head is not > the last reference of this skb); And it's impossible to track which skb > head (old one or new one) will be the last one to free. So better to > return error for zero-copy skbs when not using fastpath. Does it make > sense? I'm not sure it does. Look e.g. at tg3 - if expand_head fails packet gets dropped. No crash but unlikely to perform well :). > Besides this, any other issue? > > > Thanks > Shirley -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 18, 2011 at 09:07:37AM -0700, Shirley Ma wrote: > On Wed, 2011-05-18 at 18:47 +0300, Michael S. Tsirkin wrote: > > On Wed, May 18, 2011 at 07:38:27AM -0700, Shirley Ma wrote: > > > On Wed, 2011-05-18 at 13:40 +0200, Micha? Miros?aw wrote: > > > > >> >> Not more other restrictions, skb clone is OK. > > pskb_expand_head() > > > > looks > > > > >> >> OK to me from code review. > > > > >> > Hmm. pskb_expand_head calls skb_release_data while keeping > > > > >> > references to pages. How is that ok? What do I miss? > > > > >> It's making copy of the skb_shinfo earlier, so the pages > > refcount > > > > >> stays the same. > > > > > Exactly. But the callback is invoked so the guest thinks it's ok > > to > > > > > change this memory. If it does a corrupted packet will be sent > > out. > > > > > > > > Hmm. I tool a quick look at skb_clone(), and it looks like this > > > > sequence will break this scheme: > > > > > > > > skb2 = skb_clone(skb...); > > > > kfree_skb(skb) or pskb_expand_head(skb); /* callback called */ > > > > [use skb2, pages still referenced] > > > > kfree_skb(skb); /* callback called again */ > > > > > > > > This sequence is common in bridge, might be in other places. > > > > > > > > Maybe this ubuf thing should just track clones? This will make it > > work > > > > on all devices then. > > > > > > The callback was only invoked when last reference of skb was gone. > > > skb_clone does increase skb refcnt. I tested tcpdump on lower > > device, it > > > worked. > > > > Right, it will normally work, but two issues I think you miss: > > 1. malicious guest can change the memory between when it is sent out > > by > > device and consumed by tcpdump, so you will see different things > > (not sure how important this is). > > 2. if tcpdump stops consuming stuff from the packet socket (it's > > userspace, can't be trusted) then we won't get a callback for > > page potentially forever, guest networking will get blocked etc. > > > For the sequence of: > > > > > > skb_clone -> last refcnt + 1 > > > kfree_skb() or pskb_expand_head -> callback not called > > > kfree_skb() -> callback called > > > > > > I will check page refcount to see whether it's balanced. > > > > > > Thanks > > > shirley > > > > > > pskb_expand_head is a problem anyway I think as it > > can hang on to pages after it calls release_data. > > Then guest will modify these pages and you get trash there. > > This can be avoid by allowing pskb_expand_head in fastpath only, I > think. But not sure whether tcpdump can still work with this. > > Thanks > Shirley Yes, I agree. I think for tcpdump, we really need to copy the data anyway, to avoid guest changing it in between. So we do that and then use the copy everywhere, release the old one. Hmm?
On Wed, 2011-05-18 at 19:36 +0300, Michael S. Tsirkin wrote: > On Wed, May 18, 2011 at 09:07:37AM -0700, Shirley Ma wrote: > > On Wed, 2011-05-18 at 18:47 +0300, Michael S. Tsirkin wrote: > > > On Wed, May 18, 2011 at 07:38:27AM -0700, Shirley Ma wrote: > > > > On Wed, 2011-05-18 at 13:40 +0200, Micha? Miros?aw wrote: > > > > > >> >> Not more other restrictions, skb clone is OK. > > > pskb_expand_head() > > > > > looks > > > > > >> >> OK to me from code review. > > > > > >> > Hmm. pskb_expand_head calls skb_release_data while > keeping > > > > > >> > references to pages. How is that ok? What do I miss? > > > > > >> It's making copy of the skb_shinfo earlier, so the pages > > > refcount > > > > > >> stays the same. > > > > > > Exactly. But the callback is invoked so the guest thinks > it's ok > > > to > > > > > > change this memory. If it does a corrupted packet will be > sent > > > out. > > > > > > > > > > Hmm. I tool a quick look at skb_clone(), and it looks like > this > > > > > sequence will break this scheme: > > > > > > > > > > skb2 = skb_clone(skb...); > > > > > kfree_skb(skb) or pskb_expand_head(skb); /* callback called > */ > > > > > [use skb2, pages still referenced] > > > > > kfree_skb(skb); /* callback called again */ > > > > > > > > > > This sequence is common in bridge, might be in other places. > > > > > > > > > > Maybe this ubuf thing should just track clones? This will make > it > > > work > > > > > on all devices then. > > > > > > > > The callback was only invoked when last reference of skb was > gone. > > > > skb_clone does increase skb refcnt. I tested tcpdump on lower > > > device, it > > > > worked. > > > > > > Right, it will normally work, but two issues I think you miss: > > > 1. malicious guest can change the memory between when it is sent > out > > > by > > > device and consumed by tcpdump, so you will see different > things > > > (not sure how important this is). > > > 2. if tcpdump stops consuming stuff from the packet socket (it's > > > userspace, can't be trusted) then we won't get a callback for > > > page potentially forever, guest networking will get blocked > etc. > > > > For the sequence of: > > > > > > > > skb_clone -> last refcnt + 1 > > > > kfree_skb() or pskb_expand_head -> callback not called > > > > kfree_skb() -> callback called > > > > > > > > I will check page refcount to see whether it's balanced. > > > > > > > > Thanks > > > > shirley > > > > > > > > > pskb_expand_head is a problem anyway I think as it > > > can hang on to pages after it calls release_data. > > > Then guest will modify these pages and you get trash there. > > > > This can be avoid by allowing pskb_expand_head in fastpath only, I > > think. But not sure whether tcpdump can still work with this. > > > > Thanks > > Shirley > > Yes, I agree. I think for tcpdump, we really need to copy the data > anyway, to avoid guest changing it in between. So we do that and then > use the copy everywhere, release the old one. Hmm? Yes. Old one use zerocopy, new one use copy data. Thanks Shirley -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 18, 2011 at 01:40:29PM +0200, Micha? Miros?aw wrote: > W dniu 18 maja 2011 13:17 u?ytkownik Michael S. Tsirkin > <mst@redhat.com> napisa?: > > On Wed, May 18, 2011 at 01:10:50PM +0200, Micha? Miros?aw wrote: > >> 2011/5/18 Michael S. Tsirkin <mst@redhat.com>: > >> > On Tue, May 17, 2011 at 03:28:38PM -0700, Shirley Ma wrote: > >> >> On Tue, 2011-05-17 at 23:48 +0200, Micha? Miros?aw wrote: > >> >> > 2011/5/17 Shirley Ma <mashirle@us.ibm.com>: > >> >> > > Hello Michael, > >> >> > > > >> >> > > Looks like to use a new flag requires more time/work. I am thinking > >> >> > > whether we can just use HIGHDMA flag to enable zero-copy in macvtap > >> >> > to > >> >> > > avoid the new flag for now since mavctap uses real NICs as lower > >> >> > device? > >> >> > > >> >> > Is there any other restriction besides requiring driver to not recycle > >> >> > the skb? Are there any drivers that recycle TX skbs? > >> > > >> > Not just recycling skbs, keeping reference to any of the pages in the > >> > skb. Another requirement is to invoke the callback > >> > in a timely fashion. For example virtio-net doesn't limit the time until > >> > that happens (skbs are only freed when some other packet is > >> > transmitted), so we need to avoid zcopy for such (nested-virt) > >> > scenarious, right? > >> > >> Hmm. But every hardware driver supporting SG will keep reference to > >> the pages until the packet is sent (or DMA'd to the device). This can > >> take a long time if hardware queue happens to stall for some reason. > > > > That's a fundamental property of zero copy transmit. > > You can't let the application/guest reuse the memory until > > no one looks at it anymore. > > > >> Is it that you mean keeping a reference after all skbs pointing to the > >> pages are released? > > No one should reference the pages after the callback is invoked, yes. > > >> >> Not more other restrictions, skb clone is OK. pskb_expand_head() looks > >> >> OK to me from code review. > >> > Hmm. pskb_expand_head calls skb_release_data while keeping > >> > references to pages. How is that ok? What do I miss? > >> It's making copy of the skb_shinfo earlier, so the pages refcount > >> stays the same. > > Exactly. But the callback is invoked so the guest thinks it's ok to > > change this memory. If it does a corrupted packet will be sent out. > > Hmm. I tool a quick look at skb_clone(), and it looks like this > sequence will break this scheme: > > skb2 = skb_clone(skb...); > kfree_skb(skb) or pskb_expand_head(skb); /* callback called */ > [use skb2, pages still referenced] > kfree_skb(skb); /* callback called again */ > This sequence is common in bridge, might be in other places. > > Maybe this ubuf thing should just track clones? This will make it work > on all devices then. > > Best Regards, > Micha? Miros?aw Well bridge has the problem that packet might get anywhere and it's really hard to track. Same for tun - it can get queued forever. veth, loopback are all a problem I think. IOW we really want to limit this to real physical NICs which mostly all DTRT. Whitelisting them with a new flag is likely the most concervative approach, no?
On Wed, May 18, 2011 at 09:45:40AM -0700, Shirley Ma wrote: > On Wed, 2011-05-18 at 19:36 +0300, Michael S. Tsirkin wrote: > > On Wed, May 18, 2011 at 09:07:37AM -0700, Shirley Ma wrote: > > > On Wed, 2011-05-18 at 18:47 +0300, Michael S. Tsirkin wrote: > > > > On Wed, May 18, 2011 at 07:38:27AM -0700, Shirley Ma wrote: > > > > > On Wed, 2011-05-18 at 13:40 +0200, Micha? Miros?aw wrote: > > > > > > >> >> Not more other restrictions, skb clone is OK. > > > > pskb_expand_head() > > > > > > looks > > > > > > >> >> OK to me from code review. > > > > > > >> > Hmm. pskb_expand_head calls skb_release_data while > > keeping > > > > > > >> > references to pages. How is that ok? What do I miss? > > > > > > >> It's making copy of the skb_shinfo earlier, so the pages > > > > refcount > > > > > > >> stays the same. > > > > > > > Exactly. But the callback is invoked so the guest thinks > > it's ok > > > > to > > > > > > > change this memory. If it does a corrupted packet will be > > sent > > > > out. > > > > > > > > > > > > Hmm. I tool a quick look at skb_clone(), and it looks like > > this > > > > > > sequence will break this scheme: > > > > > > > > > > > > skb2 = skb_clone(skb...); > > > > > > kfree_skb(skb) or pskb_expand_head(skb); /* callback called > > */ > > > > > > [use skb2, pages still referenced] > > > > > > kfree_skb(skb); /* callback called again */ > > > > > > > > > > > > This sequence is common in bridge, might be in other places. > > > > > > > > > > > > Maybe this ubuf thing should just track clones? This will make > > it > > > > work > > > > > > on all devices then. > > > > > > > > > > The callback was only invoked when last reference of skb was > > gone. > > > > > skb_clone does increase skb refcnt. I tested tcpdump on lower > > > > device, it > > > > > worked. > > > > > > > > Right, it will normally work, but two issues I think you miss: > > > > 1. malicious guest can change the memory between when it is sent > > out > > > > by > > > > device and consumed by tcpdump, so you will see different > > things > > > > (not sure how important this is). > > > > 2. if tcpdump stops consuming stuff from the packet socket (it's > > > > userspace, can't be trusted) then we won't get a callback for > > > > page potentially forever, guest networking will get blocked > > etc. > > > > > For the sequence of: > > > > > > > > > > skb_clone -> last refcnt + 1 > > > > > kfree_skb() or pskb_expand_head -> callback not called > > > > > kfree_skb() -> callback called > > > > > > > > > > I will check page refcount to see whether it's balanced. > > > > > > > > > > Thanks > > > > > shirley > > > > > > > > > > > > pskb_expand_head is a problem anyway I think as it > > > > can hang on to pages after it calls release_data. > > > > Then guest will modify these pages and you get trash there. > > > > > > This can be avoid by allowing pskb_expand_head in fastpath only, I > > > think. But not sure whether tcpdump can still work with this. > > > > > > Thanks > > > Shirley > > > > Yes, I agree. I think for tcpdump, we really need to copy the data > > anyway, to avoid guest changing it in between. So we do that and then > > use the copy everywhere, release the old one. Hmm? > > Yes. Old one use zerocopy, new one use copy data. > > Thanks > Shirley No, that's wrong, as they might become different with a malicious guest. As long as we copied already, lets realease the data and have everyone use the copy.
On Wed, 2011-05-18 at 19:51 +0300, Michael S. Tsirkin wrote: > > > Yes, I agree. I think for tcpdump, we really need to copy the > data > > > anyway, to avoid guest changing it in between. So we do that and > then > > > use the copy everywhere, release the old one. Hmm? > > > > Yes. Old one use zerocopy, new one use copy data. > > > > Thanks > > Shirley > > No, that's wrong, as they might become different with a > malicious guest. As long as we copied already, lets realease > the data and have everyone use the copy. Ok, I will patch pskb_expand_head to test it out. Shirley -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2011-05-18 at 10:00 -0700, Shirley Ma wrote: > On Wed, 2011-05-18 at 19:51 +0300, Michael S. Tsirkin wrote: > > > > Yes, I agree. I think for tcpdump, we really need to copy the > > data > > > > anyway, to avoid guest changing it in between. So we do that > and > > then > > > > use the copy everywhere, release the old one. Hmm? > > > > > > Yes. Old one use zerocopy, new one use copy data. > > > > > > Thanks > > > Shirley > > > > No, that's wrong, as they might become different with a > > malicious guest. As long as we copied already, lets realease > > the data and have everyone use the copy. > > Ok, I will patch pskb_expand_head to test it out. I am patching skb_copy, skb_clone, pskb_copy, pskb_expand_head to convert a zero-copy skb to a copy skb to avoid this kind of issue. This overhead won't impact macvtap/vhost TX zero-copy normally. Shirley -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 19, 2011 at 12:42:49PM -0700, Shirley Ma wrote: > On Wed, 2011-05-18 at 10:00 -0700, Shirley Ma wrote: > > On Wed, 2011-05-18 at 19:51 +0300, Michael S. Tsirkin wrote: > > > > > Yes, I agree. I think for tcpdump, we really need to copy the > > > data > > > > > anyway, to avoid guest changing it in between. So we do that > > and > > > then > > > > > use the copy everywhere, release the old one. Hmm? > > > > > > > > Yes. Old one use zerocopy, new one use copy data. > > > > > > > > Thanks > > > > Shirley > > > > > > No, that's wrong, as they might become different with a > > > malicious guest. As long as we copied already, lets realease > > > the data and have everyone use the copy. > > > > Ok, I will patch pskb_expand_head to test it out. > > I am patching skb_copy, skb_clone, pskb_copy, pskb_expand_head to > convert a zero-copy skb to a copy skb to avoid this kind of issue. > > This overhead won't impact macvtap/vhost TX zero-copy normally. > > Shirley OK, that will handle packet socket at least in that it won't crash :) So the requirements are - data must be released in a timely fashion (e.g. unlike virtio-net tun or bridge) - no filtering based on data (data is mapped in guest) - SG support - HIGHDMA support (on arches where this makes sense) - on fast path no calls to skb_copy, skb_clone, pskb_copy, pskb_expand_head as these are slow First 2 requirements are a must, all other requirements are just dependencies to make sure zero copy will be faster than non zero copy. Using a new feature bit is probably the simplest approach to this. macvtap on top of most physical NICs most likely works correctly so it seems a bit more work than it needs to be, but it's also the safest one I think ...
On Fri, 2011-05-20 at 02:41 +0300, Michael S. Tsirkin wrote: > So the requirements are > - data must be released in a timely fashion (e.g. unlike virtio-net > tun or bridge) The current patch doesn't enable tun zero-copy. tun will copy data It's not an issue now. We can disallow macvtap attach to bridge when zero-copy is enabled. > - SG support > - HIGHDMA support (on arches where this makes sense) This can be checked by device flags. > - no filtering based on data (data is mapped in guest) > - on fast path no calls to skb_copy, skb_clone, pskb_copy, > pskb_expand_head as these are slow Any calls to skb_copy, skb_clone, pskb_copy, pskb_expand_head will do a copy. The performance should be the same as none zero-copy case before. I have done/tested the patch V6, will send it out for review tomorrow. I am looking at where there are some cases, skb remains the same for filtering. > First 2 requirements are a must, all other requirements > are just dependencies to make sure zero copy will be faster > than non zero copy. > Using a new feature bit is probably the simplest approach to > this. macvtap on top of most physical NICs most likely works > correctly so it seems a bit more work than it needs to be, > but it's also the safest one I think ... For "macvtap/vhost zero-copy" we can use SG & HIGHDMA to enable it, it looks safe to me once patching skb_copy, skb_clone, pskb_copy, pskb_expand_head. To extend zero-copy in other usages, we can have a new feature bit later. Is that reasonable? Thanks Shirley -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 25, 2011 at 03:49:40PM -0700, Shirley Ma wrote: > On Fri, 2011-05-20 at 02:41 +0300, Michael S. Tsirkin wrote: > > So the requirements are > > - data must be released in a timely fashion (e.g. unlike virtio-net > > tun or bridge) > The current patch doesn't enable tun zero-copy. tun will copy data It's > not an issue now. > We can disallow macvtap attach to bridge when > zero-copy is enabled. Attach macvtap to a tun device though. Or e.g. veth device ... So there should be so generic way to disable zerocopy. It can either be a whitelist or a blacklist. > > > - SG support > > - HIGHDMA support (on arches where this makes sense) > > This can be checked by device flags. OK, but pls note that SG can get turned off dynamically. > > - no filtering based on data (data is mapped in guest) > > > - on fast path no calls to skb_copy, skb_clone, pskb_copy, > > pskb_expand_head as these are slow > > Any calls to skb_copy, skb_clone, pskb_copy, pskb_expand_head will do a > copy. The performance should be the same as none zero-copy case before. I'm guessing a copy is cheaper than get_user_pages+copy+put_page. But maybe not by much. Care checking that? > I have done/tested the patch V6, will send it out for review tomorrow. > > I am looking at where there are some cases, skb remains the same for > filtering. To reliably filter on data I think we'll need to copy it first, otherwise guest can change it. Most filters only look at the header though. > > First 2 requirements are a must, all other requirements > > are just dependencies to make sure zero copy will be faster > > than non zero copy. > > Using a new feature bit is probably the simplest approach to > > this. macvtap on top of most physical NICs most likely works > > correctly so it seems a bit more work than it needs to be, > > but it's also the safest one I think ... > > For "macvtap/vhost zero-copy" we can use SG & HIGHDMA to enable it, it > looks safe to me once patching skb_copy, skb_clone, pskb_copy, > pskb_expand_head. > > To extend zero-copy in other usages, we can have a new feature bit > later. > > Is that reasonable? > > Thanks > Shirley Is the problem is extra work needed to extend feature bits?
On Thu, 2011-05-26 at 11:49 +0300, Michael S. Tsirkin wrote: > On Wed, May 25, 2011 at 03:49:40PM -0700, Shirley Ma wrote: > > On Fri, 2011-05-20 at 02:41 +0300, Michael S. Tsirkin wrote: > > > So the requirements are > > > - data must be released in a timely fashion (e.g. unlike > virtio-net > > > tun or bridge) > > The current patch doesn't enable tun zero-copy. tun will copy data > It's > > not an issue now. > > We can disallow macvtap attach to bridge when > > zero-copy is enabled. > > Attach macvtap to a tun device though. Or e.g. veth device ... > So there should be so generic way to disable zerocopy. > It can either be a whitelist or a blacklist. > > > > > - SG support > > > - HIGHDMA support (on arches where this makes sense) > > > > This can be checked by device flags. > > OK, but pls note that SG can get turned off dynamically. > > > > - no filtering based on data (data is mapped in guest) > > > > > - on fast path no calls to skb_copy, skb_clone, pskb_copy, > > > pskb_expand_head as these are slow > > > > Any calls to skb_copy, skb_clone, pskb_copy, pskb_expand_head will > do a > > copy. The performance should be the same as none zero-copy case > before. > > I'm guessing a copy is cheaper than get_user_pages+copy+put_page. > But maybe not by much. Care checking that? That's I have done already. Patch is going out for review. > > I have done/tested the patch V6, will send it out for review > tomorrow. > > > > I am looking at where there are some cases, skb remains the same for > > filtering. > > To reliably filter on data I think we'll need to copy it first, > otherwise > guest can change it. Most filters only look at the header though. > > > > First 2 requirements are a must, all other requirements > > > are just dependencies to make sure zero copy will be faster > > > than non zero copy. > > > Using a new feature bit is probably the simplest approach to > > > this. macvtap on top of most physical NICs most likely works > > > correctly so it seems a bit more work than it needs to be, > > > but it's also the safest one I think ... > > > > For "macvtap/vhost zero-copy" we can use SG & HIGHDMA to enable it, > it > > looks safe to me once patching skb_copy, skb_clone, pskb_copy, > > pskb_expand_head. > > > > To extend zero-copy in other usages, we can have a new feature bit > > later. > > > > Is that reasonable? > > > > Thanks > > Shirley > > Is the problem is extra work needed to extend feature bits? There is no problem to use it, Mahesh is working on this patch. I just want to remove macvtap/vhost zero-copy patch dependency. Thanks Shirley -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2011-05-26 at 11:49 +0300, Michael S. Tsirkin wrote: > > > > > - SG support > > > - HIGHDMA support (on arches where this makes sense) > > > > This can be checked by device flags. > > OK, but pls note that SG can get turned off dynamically. Tested the patch w/i SG dynmically on/off and tcpdump suspended. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index a134d80..2646251 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1066,6 +1066,16 @@ struct net_device { #define NETIF_F_NOCACHE_COPY (1 << 30) /* Use no-cache copyfromuser */ #define NETIF_F_LOOPBACK (1 << 31) /* Enable loopback */ +/* + * Bit 31 is for device to map userspace buffers -- zerocopy + * Device can set this flag when it supports HIGHDMA. + * Device can't recycle this kind of skb buffers. + * There are 256 bytes copied, the rest of buffers are mapped. + * The userspace callback should only be called when last reference to this skb + * is gone. + */ +#define NETIF_F_ZEROCOPY (1 << 31) + /* Segmentation offload features */ #define NETIF_F_GSO_SHIFT 16 #define NETIF_F_GSO_MASK 0x00ff0000
Signed-off-by: Shirley Ma <xma@us.ibm.com> --- include/linux/netdevice.h | 10 ++++++++++ 1 files changed, 10 insertions(+), 0 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html