mbox series

[RFC,0/9] sk_buff: optimize layout for GRO

Message ID cover.1626879395.git.pabeni@redhat.com (mailing list archive)
Headers show
Series sk_buff: optimize layout for GRO | expand

Message

Paolo Abeni July 21, 2021, 4:44 p.m. UTC
This is a very early draft - in a different world would be
replaced by hallway discussion at in-person conference - aimed at
outlining some ideas and collect feedback on the overall outlook.
There are still bugs to be fixed, more test and benchmark need, etc.

There are 3 main goals:
- [try to] avoid the overhead for uncommon conditions at GRO time
  (patches 1-4)
- enable backpressure for the veth GRO path (patches 5-6)
- reduce the number of cacheline used by the sk_buff lifecycle
  from 4 to 3, at least in some common scenarios (patches 1,7-9).
  The idea here is avoid the initialization of some fields and
  control their validity with a bitmask, as presented by at least
  Florian and Jesper in the past.

The above requires a bit of code churn in some places and, yes,
a few new bits in the sk_buff struct (using some existing holes)

Paolo Abeni (9):
  sk_buff: track nfct status in newly added skb->_state
  sk_buff: track dst status in skb->_state
  sk_buff: move the active_extensions into the state bitfield
  net: optimize GRO for the common case.
  skbuff: introduce has_sk state bit.
  veth: use skb_prepare_for_gro()
  sk_buff: move inner header fields after tail
  sk_buff: move vlan field after tail.
  sk_buff: access secmark via getter/setter

 drivers/net/veth.c               |   2 +-
 include/linux/skbuff.h           | 117 ++++++++++++++++++++++---------
 include/net/dst.h                |   3 +
 include/net/sock.h               |   9 +++
 net/core/dev.c                   |  31 +++++---
 net/core/skbuff.c                |  40 +++++++----
 net/netfilter/nfnetlink_queue.c  |   6 +-
 net/netfilter/nft_meta.c         |   6 +-
 net/netfilter/xt_CONNSECMARK.c   |   8 +--
 net/netfilter/xt_SECMARK.c       |   2 +-
 security/apparmor/lsm.c          |  15 ++--
 security/selinux/hooks.c         |  10 +--
 security/smack/smack_lsm.c       |   4 +-
 security/smack/smack_netfilter.c |   4 +-
 14 files changed, 175 insertions(+), 82 deletions(-)

Comments

Casey Schaufler July 21, 2021, 6:15 p.m. UTC | #1
On 7/21/2021 9:44 AM, Paolo Abeni wrote:
> This is a very early draft - in a different world would be
> replaced by hallway discussion at in-person conference - aimed at
> outlining some ideas and collect feedback on the overall outlook.
> There are still bugs to be fixed, more test and benchmark need, etc.
>
> There are 3 main goals:
> - [try to] avoid the overhead for uncommon conditions at GRO time
>   (patches 1-4)
> - enable backpressure for the veth GRO path (patches 5-6)
> - reduce the number of cacheline used by the sk_buff lifecycle
>   from 4 to 3, at least in some common scenarios (patches 1,7-9).
>   The idea here is avoid the initialization of some fields and
>   control their validity with a bitmask, as presented by at least
>   Florian and Jesper in the past.

If I understand correctly, you're creating an optimized case
which excludes ct, secmark, vlan and UDP tunnel. Is this correct,
and if so, why those particular fields? What impact will this have
in the non-optimal (with any of the excluded fields) case?

>
> The above requires a bit of code churn in some places and, yes,
> a few new bits in the sk_buff struct (using some existing holes)
>
> Paolo Abeni (9):
>   sk_buff: track nfct status in newly added skb->_state
>   sk_buff: track dst status in skb->_state
>   sk_buff: move the active_extensions into the state bitfield
>   net: optimize GRO for the common case.
>   skbuff: introduce has_sk state bit.
>   veth: use skb_prepare_for_gro()
>   sk_buff: move inner header fields after tail
>   sk_buff: move vlan field after tail.
>   sk_buff: access secmark via getter/setter
>
>  drivers/net/veth.c               |   2 +-
>  include/linux/skbuff.h           | 117 ++++++++++++++++++++++---------
>  include/net/dst.h                |   3 +
>  include/net/sock.h               |   9 +++
>  net/core/dev.c                   |  31 +++++---
>  net/core/skbuff.c                |  40 +++++++----
>  net/netfilter/nfnetlink_queue.c  |   6 +-
>  net/netfilter/nft_meta.c         |   6 +-
>  net/netfilter/xt_CONNSECMARK.c   |   8 +--
>  net/netfilter/xt_SECMARK.c       |   2 +-
>  security/apparmor/lsm.c          |  15 ++--
>  security/selinux/hooks.c         |  10 +--
>  security/smack/smack_lsm.c       |   4 +-
>  security/smack/smack_netfilter.c |   4 +-
>  14 files changed, 175 insertions(+), 82 deletions(-)
>
Paolo Abeni July 22, 2021, 7:10 a.m. UTC | #2
Hello,

On Wed, 2021-07-21 at 11:15 -0700, Casey Schaufler wrote:
> On 7/21/2021 9:44 AM, Paolo Abeni wrote:
> > This is a very early draft - in a different world would be
> > replaced by hallway discussion at in-person conference - aimed at
> > outlining some ideas and collect feedback on the overall outlook.
> > There are still bugs to be fixed, more test and benchmark need, etc.
> > 
> > There are 3 main goals:
> > - [try to] avoid the overhead for uncommon conditions at GRO time
> >   (patches 1-4)
> > - enable backpressure for the veth GRO path (patches 5-6)
> > - reduce the number of cacheline used by the sk_buff lifecycle
> >   from 4 to 3, at least in some common scenarios (patches 1,7-9).
> >   The idea here is avoid the initialization of some fields and
> >   control their validity with a bitmask, as presented by at least
> >   Florian and Jesper in the past.
> 
> If I understand correctly, you're creating an optimized case
> which excludes ct, secmark, vlan and UDP tunnel. Is this correct,
> and if so, why those particular fields? What impact will this have
> in the non-optimal (with any of the excluded fields) case?

Thank you for the feedback.

There are 2 different relevant points:

- the GRO stage.
  packets carring any of CT, dst, sk or skb_ext will do 2 additional
conditionals per gro_receive WRT the current code. My understanding is
that having any of such field set at GRO receive time is quite
exceptional for real nic. All others packet will do 4 or 5 less
conditionals, and will traverse a little less code.

- sk_buff lifecycle
  * packets carrying vlan and UDP will not see any differences: sk_buff
lifecycle will stil use 4 cachelines, as currently does, and no
additional conditional is introduced.
  * packets carring nfct or secmark will see an additional conditional
every time such field is accessed. The number of cacheline used will
still be 4, as in the current code. My understanding is that when such
access happens, there is already a relevant amount of "additional" code
to be executed, the conditional overhead should not be measurable.

Cheers,

Paolo
Casey Schaufler July 22, 2021, 4:04 p.m. UTC | #3
On 7/22/2021 12:10 AM, Paolo Abeni wrote:
> Hello,
>
> On Wed, 2021-07-21 at 11:15 -0700, Casey Schaufler wrote:
>> On 7/21/2021 9:44 AM, Paolo Abeni wrote:
>>> This is a very early draft - in a different world would be
>>> replaced by hallway discussion at in-person conference - aimed at
>>> outlining some ideas and collect feedback on the overall outlook.
>>> There are still bugs to be fixed, more test and benchmark need, etc.
>>>
>>> There are 3 main goals:
>>> - [try to] avoid the overhead for uncommon conditions at GRO time
>>>   (patches 1-4)
>>> - enable backpressure for the veth GRO path (patches 5-6)
>>> - reduce the number of cacheline used by the sk_buff lifecycle
>>>   from 4 to 3, at least in some common scenarios (patches 1,7-9).
>>>   The idea here is avoid the initialization of some fields and
>>>   control their validity with a bitmask, as presented by at least
>>>   Florian and Jesper in the past.
>> If I understand correctly, you're creating an optimized case
>> which excludes ct, secmark, vlan and UDP tunnel. Is this correct,
>> and if so, why those particular fields? What impact will this have
>> in the non-optimal (with any of the excluded fields) case?
> Thank you for the feedback.

You're most welcome. You did request comments.

>
> There are 2 different relevant points:
>
> - the GRO stage.
>   packets carring any of CT, dst, sk or skb_ext will do 2 additional
> conditionals per gro_receive WRT the current code. My understanding is
> that having any of such field set at GRO receive time is quite
> exceptional for real nic. All others packet will do 4 or 5 less
> conditionals, and will traverse a little less code.
>
> - sk_buff lifecycle
>   * packets carrying vlan and UDP will not see any differences: sk_buff
> lifecycle will stil use 4 cachelines, as currently does, and no
> additional conditional is introduced.
>   * packets carring nfct or secmark will see an additional conditional
> every time such field is accessed. The number of cacheline used will
> still be 4, as in the current code. My understanding is that when such
> access happens, there is already a relevant amount of "additional" code
> to be executed, the conditional overhead should not be measurable.

I'm responsible for some of that "additonal" code. If the secmark
is considered to be outside the performance critical data there are
changes I would like to make that will substantially improve the
performance of that "additional" code that would include a u64
secmark. If use of a secmark is considered indicative of a "slow"
path, the rationale for restricting it to u32, that it might impact
the "usual" case performance, seems specious. I can't say that I
understand all the nuances and implications involved. It does
appear that the changes you've suggested could negate the classic
argument that requires the u32 secmark.

>
> Cheers,
>
> Paolo
>
Paolo Abeni July 22, 2021, 4:57 p.m. UTC | #4
On Thu, 2021-07-22 at 09:04 -0700, Casey Schaufler wrote:
> On 7/22/2021 12:10 AM, Paolo Abeni wrote:
> > On Wed, 2021-07-21 at 11:15 -0700, Casey Schaufler wrote:
> > > On 7/21/2021 9:44 AM, Paolo Abeni wrote:
> > > > This is a very early draft - in a different world would be
> > > > replaced by hallway discussion at in-person conference - aimed at
> > > > outlining some ideas and collect feedback on the overall outlook.
> > > > There are still bugs to be fixed, more test and benchmark need, etc.
> > > > 
> > > > There are 3 main goals:
> > > > - [try to] avoid the overhead for uncommon conditions at GRO time
> > > >   (patches 1-4)
> > > > - enable backpressure for the veth GRO path (patches 5-6)
> > > > - reduce the number of cacheline used by the sk_buff lifecycle
> > > >   from 4 to 3, at least in some common scenarios (patches 1,7-9).
> > > >   The idea here is avoid the initialization of some fields and
> > > >   control their validity with a bitmask, as presented by at least
> > > >   Florian and Jesper in the past.
> > > If I understand correctly, you're creating an optimized case
> > > which excludes ct, secmark, vlan and UDP tunnel. Is this correct,
> > > and if so, why those particular fields? What impact will this have
> > > in the non-optimal (with any of the excluded fields) case?
> > Thank you for the feedback.
> 
> You're most welcome. You did request comments.
> 
> > There are 2 different relevant points:
> > 
> > - the GRO stage.
> >   packets carring any of CT, dst, sk or skb_ext will do 2 additional
> > conditionals per gro_receive WRT the current code. My understanding is
> > that having any of such field set at GRO receive time is quite
> > exceptional for real nic. All others packet will do 4 or 5 less
> > conditionals, and will traverse a little less code.
> > 
> > - sk_buff lifecycle
> >   * packets carrying vlan and UDP will not see any differences: sk_buff
> > lifecycle will stil use 4 cachelines, as currently does, and no
> > additional conditional is introduced.
> >   * packets carring nfct or secmark will see an additional conditional
> > every time such field is accessed. The number of cacheline used will
> > still be 4, as in the current code. My understanding is that when such
> > access happens, there is already a relevant amount of "additional" code
> > to be executed, the conditional overhead should not be measurable.
> 
> I'm responsible for some of that "additonal" code. If the secmark
> is considered to be outside the performance critical data there are
> changes I would like to make that will substantially improve the
> performance of that "additional" code that would include a u64
> secmark. If use of a secmark is considered indicative of a "slow"
> path, the rationale for restricting it to u32, that it might impact
> the "usual" case performance, seems specious. I can't say that I
> understand all the nuances and implications involved. It does
> appear that the changes you've suggested could negate the classic
> argument that requires the u32 secmark.

I see now I did not reply to one of you questions - why I picked-up
 vlan, tunnel secmark fields to move them at sk_buff tail. 

Tow main drivers on my side:
- there are use cases/deployments that do not use them.
- moving them around was doable in term of required changes.

There are no "slow-path" implications on my side. For example, vlan_*
fields are very critical performance wise, if the traffic is tagged.
But surely there are busy servers not using tagget traffic which will
enjoy the reduced cachelines footprint, and this changeset will not
impact negatively the first case.

WRT to the vlan example, secmark and nfct require an extra conditional
to fetch the data. My understanding is that such additional conditional
is not measurable performance-wise when benchmarking the security
modules (or conntrack) because they have to do much more intersting
things after fetching a few bytes from an already hot cacheline.

Not sure if the above somehow clarify my statements.

As for expanding secmark to 64 bits, I guess that could be an
interesting follow-up discussion :)

Cheers,

Paolo
Paul Moore July 22, 2021, 6:41 p.m. UTC | #5
On Thu, Jul 22, 2021 at 12:59 PM Paolo Abeni <pabeni@redhat.com> wrote:
> On Thu, 2021-07-22 at 09:04 -0700, Casey Schaufler wrote:
> > On 7/22/2021 12:10 AM, Paolo Abeni wrote:
> > > On Wed, 2021-07-21 at 11:15 -0700, Casey Schaufler wrote:
> > > > On 7/21/2021 9:44 AM, Paolo Abeni wrote:
> > > > > This is a very early draft - in a different world would be
> > > > > replaced by hallway discussion at in-person conference - aimed at
> > > > > outlining some ideas and collect feedback on the overall outlook.
> > > > > There are still bugs to be fixed, more test and benchmark need, etc.
> > > > >
> > > > > There are 3 main goals:
> > > > > - [try to] avoid the overhead for uncommon conditions at GRO time
> > > > >   (patches 1-4)
> > > > > - enable backpressure for the veth GRO path (patches 5-6)
> > > > > - reduce the number of cacheline used by the sk_buff lifecycle
> > > > >   from 4 to 3, at least in some common scenarios (patches 1,7-9).
> > > > >   The idea here is avoid the initialization of some fields and
> > > > >   control their validity with a bitmask, as presented by at least
> > > > >   Florian and Jesper in the past.
> > > > If I understand correctly, you're creating an optimized case
> > > > which excludes ct, secmark, vlan and UDP tunnel. Is this correct,
> > > > and if so, why those particular fields? What impact will this have
> > > > in the non-optimal (with any of the excluded fields) case?
> > > Thank you for the feedback.
> >
> > You're most welcome. You did request comments.
> >
> > > There are 2 different relevant points:
> > >
> > > - the GRO stage.
> > >   packets carring any of CT, dst, sk or skb_ext will do 2 additional
> > > conditionals per gro_receive WRT the current code. My understanding is
> > > that having any of such field set at GRO receive time is quite
> > > exceptional for real nic. All others packet will do 4 or 5 less
> > > conditionals, and will traverse a little less code.
> > >
> > > - sk_buff lifecycle
> > >   * packets carrying vlan and UDP will not see any differences: sk_buff
> > > lifecycle will stil use 4 cachelines, as currently does, and no
> > > additional conditional is introduced.
> > >   * packets carring nfct or secmark will see an additional conditional
> > > every time such field is accessed. The number of cacheline used will
> > > still be 4, as in the current code. My understanding is that when such
> > > access happens, there is already a relevant amount of "additional" code
> > > to be executed, the conditional overhead should not be measurable.
> >
> > I'm responsible for some of that "additonal" code. If the secmark
> > is considered to be outside the performance critical data there are
> > changes I would like to make that will substantially improve the
> > performance of that "additional" code that would include a u64
> > secmark. If use of a secmark is considered indicative of a "slow"
> > path, the rationale for restricting it to u32, that it might impact
> > the "usual" case performance, seems specious. I can't say that I
> > understand all the nuances and implications involved. It does
> > appear that the changes you've suggested could negate the classic
> > argument that requires the u32 secmark.
>
> I see now I did not reply to one of you questions - why I picked-up
>  vlan, tunnel secmark fields to move them at sk_buff tail.
>
> Tow main drivers on my side:
> - there are use cases/deployments that do not use them.
> - moving them around was doable in term of required changes.
>
> There are no "slow-path" implications on my side. For example, vlan_*
> fields are very critical performance wise, if the traffic is tagged.
> But surely there are busy servers not using tagget traffic which will
> enjoy the reduced cachelines footprint, and this changeset will not
> impact negatively the first case.
>
> WRT to the vlan example, secmark and nfct require an extra conditional
> to fetch the data. My understanding is that such additional conditional
> is not measurable performance-wise when benchmarking the security
> modules (or conntrack) because they have to do much more intersting
> things after fetching a few bytes from an already hot cacheline.
>
> Not sure if the above somehow clarify my statements.
>
> As for expanding secmark to 64 bits, I guess that could be an
> interesting follow-up discussion :)

The intersection between netdev and the LSM has a long and somewhat
tortured past with each party making sacrifices along the way to get
where we are at today.  It is far from perfect, at least from a LSM
perspective, but it is what we've got and since performance is usually
used as a club to beat back any changes proposed by the LSM side, I
would like to object to these changes that negatively impact the LSM
performance without some concession in return.  It has been a while
since Casey and I have spoken about this, but I think the prefered
option would be to exchange the current __u32 "sk_buff.secmark" field
with a void* "sk_buff.security" field, like so many other kernel level
objects.  Previous objections have eventually boiled down to the
additional space in the sk_buff for the extra bits (there is some
additional editorializing that could be done here, but I'll refrain),
but based on the comments thus far in this thread it sounds like
perhaps we can now make a deal here: move the LSM field down to a
"colder" cacheline in exchange for converting the LSM field to a
proper pointer.

Thoughts?
Florian Westphal July 24, 2021, 6:51 p.m. UTC | #6
Paul Moore <paul@paul-moore.com> wrote:
 > Tow main drivers on my side:
> > - there are use cases/deployments that do not use them.
> > - moving them around was doable in term of required changes.
> >
> > There are no "slow-path" implications on my side. For example, vlan_*
> > fields are very critical performance wise, if the traffic is tagged.
> > But surely there are busy servers not using tagget traffic which will
> > enjoy the reduced cachelines footprint, and this changeset will not
> > impact negatively the first case.
> >
> > WRT to the vlan example, secmark and nfct require an extra conditional
> > to fetch the data. My understanding is that such additional conditional
> > is not measurable performance-wise when benchmarking the security
> > modules (or conntrack) because they have to do much more intersting
> > things after fetching a few bytes from an already hot cacheline.
> >
> > Not sure if the above somehow clarify my statements.
> >
> > As for expanding secmark to 64 bits, I guess that could be an
> > interesting follow-up discussion :)
> 
> The intersection between netdev and the LSM has a long and somewhat
> tortured past with each party making sacrifices along the way to get
> where we are at today.  It is far from perfect, at least from a LSM
> perspective, but it is what we've got and since performance is usually
> used as a club to beat back any changes proposed by the LSM side, I
> would like to object to these changes that negatively impact the LSM
> performance without some concession in return.  It has been a while
> since Casey and I have spoken about this, but I think the prefered
> option would be to exchange the current __u32 "sk_buff.secmark" field
> with a void* "sk_buff.security" field, like so many other kernel level
> objects.  Previous objections have eventually boiled down to the
> additional space in the sk_buff for the extra bits (there is some
> additional editorializing that could be done here, but I'll refrain),
> but based on the comments thus far in this thread it sounds like
> perhaps we can now make a deal here: move the LSM field down to a
> "colder" cacheline in exchange for converting the LSM field to a
> proper pointer.
> 
> Thoughts?

Is there a summary disucssion somewhere wrt. what exactly LSMs need?

There is the skb extension infra, does that work for you?
Paul Moore July 25, 2021, 2:57 p.m. UTC | #7
On Sat, Jul 24, 2021 at 2:51 PM Florian Westphal <fw@strlen.de> wrote:
> Paul Moore <paul@paul-moore.com> wrote:
>  > Tow main drivers on my side:
> > > - there are use cases/deployments that do not use them.
> > > - moving them around was doable in term of required changes.
> > >
> > > There are no "slow-path" implications on my side. For example, vlan_*
> > > fields are very critical performance wise, if the traffic is tagged.
> > > But surely there are busy servers not using tagget traffic which will
> > > enjoy the reduced cachelines footprint, and this changeset will not
> > > impact negatively the first case.
> > >
> > > WRT to the vlan example, secmark and nfct require an extra conditional
> > > to fetch the data. My understanding is that such additional conditional
> > > is not measurable performance-wise when benchmarking the security
> > > modules (or conntrack) because they have to do much more intersting
> > > things after fetching a few bytes from an already hot cacheline.
> > >
> > > Not sure if the above somehow clarify my statements.
> > >
> > > As for expanding secmark to 64 bits, I guess that could be an
> > > interesting follow-up discussion :)
> >
> > The intersection between netdev and the LSM has a long and somewhat
> > tortured past with each party making sacrifices along the way to get
> > where we are at today.  It is far from perfect, at least from a LSM
> > perspective, but it is what we've got and since performance is usually
> > used as a club to beat back any changes proposed by the LSM side, I
> > would like to object to these changes that negatively impact the LSM
> > performance without some concession in return.  It has been a while
> > since Casey and I have spoken about this, but I think the prefered
> > option would be to exchange the current __u32 "sk_buff.secmark" field
> > with a void* "sk_buff.security" field, like so many other kernel level
> > objects.  Previous objections have eventually boiled down to the
> > additional space in the sk_buff for the extra bits (there is some
> > additional editorializing that could be done here, but I'll refrain),
> > but based on the comments thus far in this thread it sounds like
> > perhaps we can now make a deal here: move the LSM field down to a
> > "colder" cacheline in exchange for converting the LSM field to a
> > proper pointer.
> >
> > Thoughts?
>
> Is there a summary disucssion somewhere wrt. what exactly LSMs need?

My network access is limited for the next week so I don't have the
ability to dig through the list archives, but if you look through the
netdev/LSM/lists over the past decade (maybe go back ~15 years?) you
will see multiple instances where we/I've brought up different
solutions with the netdev folks only to hit a brick wall.  The LSM ask
for sk_buff is really the same as any other kernel object that we want
to control with LSM access controls, e.g. inodes; we basically want a
void* blob with the necessary hooks so that the opaque blob can be
managed through the skb's lifetime.

> There is the skb extension infra, does that work for you?

I was hopeful that when the skb_ext capability was introduced we might
be able to use it for the LSM(s), but when I asked netdev if they
would be willing to accept patches to leverage the skb_ext
infrastructure I was told "no".
Florian Westphal July 25, 2021, 4:25 p.m. UTC | #8
Paul Moore <paul@paul-moore.com> wrote:
> > There is the skb extension infra, does that work for you?
> 
> I was hopeful that when the skb_ext capability was introduced we might
> be able to use it for the LSM(s), but when I asked netdev if they
> would be willing to accept patches to leverage the skb_ext
> infrastructure I was told "no".

I found

https://lore.kernel.org/netdev/CAHC9VhSz1_KA1tCJtNjwK26BOkGhKGbPT7v1O82mWPduvWwd4A@mail.gmail.com/#r

and from what I gather from your comments and that of Casey
I think skb extensions is the correct thing for this (i.e., needs
netlabel/secid config/enablement so typically won't be active on
a distro kernel by default).

It certainly makes more sense to me than doing lookups
in a hashtable based on a ID (I tried to do that to get rid of skb->nf_bridge
pointer years ago and it I could not figure out how to invalidate an entry
without adding a new skb destructor callback).
Casey Schaufler July 25, 2021, 9:53 p.m. UTC | #9
On 7/25/2021 9:25 AM, Florian Westphal wrote:
> Paul Moore <paul@paul-moore.com> wrote:
>>> There is the skb extension infra, does that work for you?
>> I was hopeful that when the skb_ext capability was introduced we might
>> be able to use it for the LSM(s), but when I asked netdev if they
>> would be willing to accept patches to leverage the skb_ext
>> infrastructure I was told "no".
> I found
>
> https://lore.kernel.org/netdev/CAHC9VhSz1_KA1tCJtNjwK26BOkGhKGbPT7v1O82mWPduvWwd4A@mail.gmail.com/#r
>
> and from what I gather from your comments and that of Casey
> I think skb extensions is the correct thing for this (i.e., needs
> netlabel/secid config/enablement so typically won't be active on
> a distro kernel by default).

RedHat and android use SELinux and will want this. Ubuntu doesn't
yet, but netfilter in in the AppArmor task list. Tizen definitely
uses it with Smack. The notion that security modules are only used
in fringe cases is antiquated. 

> It certainly makes more sense to me than doing lookups
> in a hashtable based on a ID

Agreed. The data burden required to support a hash scheme
for the security module stacking case is staggering.

>  (I tried to do that to get rid of skb->nf_bridge
> pointer years ago and it I could not figure out how to invalidate an entry
> without adding a new skb destructor callback).
Florian Westphal July 25, 2021, 10:52 p.m. UTC | #10
Casey Schaufler <casey@schaufler-ca.com> wrote:
> RedHat and android use SELinux and will want this. Ubuntu doesn't
> yet, but netfilter in in the AppArmor task list. Tizen definitely
> uses it with Smack. The notion that security modules are only used
> in fringe cases is antiquated. 

I was not talking about LSM in general, I was referring to the
extended info that Paul mentioned.

If thats indeed going to be used on every distro then skb extensions
are not suitable for this, it would result in extr akmalloc for every
skb.

> > It certainly makes more sense to me than doing lookups
> > in a hashtable based on a ID
> 
> Agreed. The data burden required to support a hash scheme
> for the security module stacking case is staggering.

It depends on the type of data (and its lifetime).

I suspect you have something that is more like skb->dev/dst,
i.e. reference to object that persists after the skb is free'd.
Casey Schaufler July 26, 2021, 3:13 p.m. UTC | #11
On 7/25/2021 3:52 PM, Florian Westphal wrote:
> Casey Schaufler <casey@schaufler-ca.com> wrote:
>> RedHat and android use SELinux and will want this. Ubuntu doesn't
>> yet, but netfilter in in the AppArmor task list. Tizen definitely
>> uses it with Smack. The notion that security modules are only used
>> in fringe cases is antiquated. 
> I was not talking about LSM in general, I was referring to the
> extended info that Paul mentioned.
>
> If thats indeed going to be used on every distro then skb extensions
> are not suitable for this, it would result in extr akmalloc for every
> skb.

I am explicitly talking about the use of secmarks. All my
references are uses of secmarks.

>>> It certainly makes more sense to me than doing lookups
>>> in a hashtable based on a ID
>> Agreed. The data burden required to support a hash scheme
>> for the security module stacking case is staggering.
> It depends on the type of data (and its lifetime).
>
> I suspect you have something that is more like skb->dev/dst,
> i.e. reference to object that persists after the skb is free'd.

Just so. Only to make it more complicated, SELinux and Smack,
the two LSMs currently using secmarks, use them differently.
SELinux uses u32 "secids" natively, but Smack suffers serious
performance degradation because it has to look up (efficiently,
but look up nonetheless) the real Smack value on every packet.
Please, I know about hash caches, cache hashes and all sorts
of clever tricks to reduce the impact. Nothing beats having the
end value up front.
Paul Moore July 27, 2021, 2:51 a.m. UTC | #12
On Mon, Jul 26, 2021 at 11:13 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
> On 7/25/2021 3:52 PM, Florian Westphal wrote:
> > Casey Schaufler <casey@schaufler-ca.com> wrote:
> >> RedHat and android use SELinux and will want this. Ubuntu doesn't
> >> yet, but netfilter in in the AppArmor task list. Tizen definitely
> >> uses it with Smack. The notion that security modules are only used
> >> in fringe cases is antiquated.
> > I was not talking about LSM in general, I was referring to the
> > extended info that Paul mentioned.
> >
> > If thats indeed going to be used on every distro then skb extensions
> > are not suitable for this, it would result in extr akmalloc for every
> > skb.
>
> I am explicitly talking about the use of secmarks. All my
> references are uses of secmarks.

I'm talking about a void* which would contain LSM specific data; as I
said earlier, think of inodes.  This LSM specific data would include
the existing secmark data as well as network peer security information
which would finally (!!!) allow us to handle forwarded traffic and
enable a number of other fixes and performance improvements.

(The details are a bit beyond this discussion but it basically
revolves around us not having to investigate the import the packet
headers every time we want to determine the network peer security
attributes, we could store the resolved LSM information in the
sk_buff.security blob.)
Paolo Abeni July 28, 2021, 4:21 p.m. UTC | #13
On Mon, 2021-07-26 at 22:51 -0400, Paul Moore wrote:
> On Mon, Jul 26, 2021 at 11:13 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
> > On 7/25/2021 3:52 PM, Florian Westphal wrote:
> > > Casey Schaufler <casey@schaufler-ca.com> wrote:
> > > > RedHat and android use SELinux and will want this. Ubuntu doesn't
> > > > yet, but netfilter in in the AppArmor task list. Tizen definitely
> > > > uses it with Smack. The notion that security modules are only used
> > > > in fringe cases is antiquated.
> > > I was not talking about LSM in general, I was referring to the
> > > extended info that Paul mentioned.
> > > 
> > > If thats indeed going to be used on every distro then skb extensions
> > > are not suitable for this, it would result in extr akmalloc for every
> > > skb.
> > 
> > I am explicitly talking about the use of secmarks. All my
> > references are uses of secmarks.
> 
> I'm talking about a void* which would contain LSM specific data; as I
> said earlier, think of inodes.  This LSM specific data would include
> the existing secmark data as well as network peer security information
> which would finally (!!!) allow us to handle forwarded traffic and
> enable a number of other fixes and performance improvements.
> 
> (The details are a bit beyond this discussion but it basically
> revolves around us not having to investigate the import the packet
> headers every time we want to determine the network peer security
> attributes, we could store the resolved LSM information in the
> sk_buff.security blob.)

I've investigated the feasibility of extending the secmark field to
long/void*. I think that performance wise it should be doable on top of
this series: the amount of allocated memory for sk_buff will not
change, nor the amount of memory memseted at skb initialization time.

I stumbled upon some uAPIs issues, as CT/nft expose a secmark related
field via uAPI, changing that size without breaking esisting user-space 
looks hard to me.

Additionally, even patch 7/9 is problematic, as there are some in
kernel users accessing and using the inner_ field regardless skb-
>encapsulation. That works while inner_* field are always
initializared/zeored, but will break with the mentioned patch. The fix
is doable, but large and complex. 

To keep the scope of this series sane, I'll drop in the next iteration
all the problematic patches - that is: no sk_buff layout change at all.

If there is interest for such thing, it could still be added
incrementally.

Cheers,

Paolo