diff mbox series

[RFC] audit: Send netlink ACK before setting connection in auditd_set

Message ID 20230922152749.244197-1-chris.riches@nutanix.com (mailing list archive)
State Changes Requested
Delegated to: Paul Moore
Headers show
Series [RFC] audit: Send netlink ACK before setting connection in auditd_set | expand

Commit Message

Chris Riches Sept. 22, 2023, 3:27 p.m. UTC
When auditd_set sets the auditd_conn pointer, audit messages can
immediately be put on the socket by other kernel threads. If the backlog
is large or the rate is high, this can immediately fill the socket
buffer. If the audit daemon requested an ACK for this operation, a full
socket buffer causes the ACK to get dropped, also setting ENOBUFS on the
socket.

To avoid this race and ensure ACKs get through, fast-track the ACK in
this specific case to ensure it is sent before auditd_conn is set.

Signed-off-by: Chris Riches <chris.riches@nutanix.com>

---

This mail is more of an RFC than a patch, though the included patch is a
useful illustation and might even be suitable for inclusion.

We are experiencing strange failures where the audit daemon fails to
start, hitting an ENOBUFS error on its audit_set_pid() call. This can be
reproduced by repeatedly restarting the audit daemon (or just doing a
tight loop of audit_open(), audit_set_pid(pid), audit_set_pid(0),
audit_close()) while the system is under heavy audit load. This also
seems to be dependent on the number of CPUs - we can reproduce this with
2 CPUs but not with 48.

Tracing showed a race between the kernel setting auditd_conn and sending
the ACK, as described in the commit message. The included patch attempts
to fix this by ensuring the ACK is sent before any audit messages can be
put on the socket. This seems to fix the problem for auditd starting,
but strangely I still hit it when running my minimal repro (the tight
loop mentioned above), albeit less frequently. I'm not sure if the patch
or the repro is at fault.

We may also want to look at this from the userspace side in the audit
daemon itself (or more specifically, libaudit). The ACK handling is a
little odd - check_ack() will happily return success without seeing an
ACK if a non-ACK message is top of the socket queue, but will fail if no
message arrives within the timeout. It also of course fails if ENOBUFS
is set on the socket, but this failure only seems to matter when doing
audit_set_pid() - similar failures during main-loop message processing
are logged but otherwise ignored, as far as I can tell.

I'm not sure I quite understand the intentions of the code, but it seems
odd to let ENOBUFS be a fatal error here, given that it likely means the
socket buffer got flooded with audit messages, and thus audit_set_pid()
succeeded. Perhaps we should just ignore ENOBUFS (instead of or as well
as patching the kernel).

Finally, there is another oddity in audit_set_pid() that is tangential
to this discussion but worth highlighting: if the wmode parameter is
WAIT_YES (which it is for the audit daemon), then there is some
additional ACK-handling which waits for 100 milliseconds and eats the
top message of the socket queue if one arrives, without inspecting it.
This seems completely wrong as the ACK will have already been consumed
by check_ack() if there was one, and so the best this code can do is
nothing, and at worst it will swallow a genuine audit message without
ever recording it. If you agree with my assessment of this code, I'm
happy to submit a separate patch to fix this.

Thanks,
Chris

 kernel/audit.c | 30 ++++++++++++++++++++++++------
 kernel/audit.h |  5 +++++
 2 files changed, 29 insertions(+), 6 deletions(-)

Comments

Paul Moore Sept. 26, 2023, 9:27 p.m. UTC | #1
On Fri, Sep 22, 2023 at 11:28 AM Chris Riches <chris.riches@nutanix.com> wrote:
>
> When auditd_set sets the auditd_conn pointer, audit messages can
> immediately be put on the socket by other kernel threads. If the backlog
> is large or the rate is high, this can immediately fill the socket
> buffer. If the audit daemon requested an ACK for this operation, a full
> socket buffer causes the ACK to get dropped, also setting ENOBUFS on the
> socket.
>
> To avoid this race and ensure ACKs get through, fast-track the ACK in
> this specific case to ensure it is sent before auditd_conn is set.
>
> Signed-off-by: Chris Riches <chris.riches@nutanix.com>
>
> ---
>
> This mail is more of an RFC than a patch, though the included patch is a
> useful illustation and might even be suitable for inclusion ...

Hi Chris,

Thanks for the patch and the background information, that's always
helpful.  Unfortunately I have limited network access at the moment,
but I'll put this on my list to look at next week.
Paul Moore Oct. 11, 2023, 5:55 p.m. UTC | #2
On Fri, Sep 22, 2023 at 11:28 AM Chris Riches <chris.riches@nutanix.com> wrote:
>
> When auditd_set sets the auditd_conn pointer, audit messages can
> immediately be put on the socket by other kernel threads. If the backlog
> is large or the rate is high, this can immediately fill the socket
> buffer. If the audit daemon requested an ACK for this operation, a full
> socket buffer causes the ACK to get dropped, also setting ENOBUFS on the
> socket.
>
> To avoid this race and ensure ACKs get through, fast-track the ACK in
> this specific case to ensure it is sent before auditd_conn is set.
>
> Signed-off-by: Chris Riches <chris.riches@nutanix.com>

Thanks for looking into this and submitting a patch Chris!

I think the basic approach is good, but I think we can simply things
slightly by using a resettable boolean as opposed to an integer flag
for the ACK.  I've pasted in a quick, untested patch (below) to better
demonstrate the idea, let me know what you think.  Further thoughts
below ...

> This mail is more of an RFC than a patch, though the included patch is a
> useful illustation and might even be suitable for inclusion.

Regardless of any other issues, I think you brought up a good point
about there being socket buffer contention when the queues are
full(ish) and an audit daemon connects to the kernel and while I'm not
sure that this patch will fully resolve that issue, I do think it
would be good to have (I'm doubtful if it can be fully resolved
without some really awful hacks).

> We are experiencing strange failures where the audit daemon fails to
> start, hitting an ENOBUFS error on its audit_set_pid() call. This can be
> reproduced by repeatedly restarting the audit daemon (or just doing a
> tight loop of audit_open(), audit_set_pid(pid), audit_set_pid(0),
> audit_close()) while the system is under heavy audit load. This also
> seems to be dependent on the number of CPUs - we can reproduce this with
> 2 CPUs but not with 48.

The ENOBUFS errors are coming from the netlink layer and are likely a
sign of extreme load.  I'm not very familiar with the audit userspace
code, but it might be helpful to see if you can increase the socket
buffer size for the audit daemon.

> Tracing showed a race between the kernel setting auditd_conn and sending
> the ACK, as described in the commit message. The included patch attempts
> to fix this by ensuring the ACK is sent before any audit messages can be
> put on the socket. This seems to fix the problem for auditd starting,
> but strangely I still hit it when running my minimal repro (the tight
> loop mentioned above), albeit less frequently. I'm not sure if the patch
> or the repro is at fault.

Sending a zero PID value, e.g. audit_set_pid(0), is going to cause the
kernel to reset/drop the connection with the audit daemon; it's
essentially the same as an orderly shutdown of the audit daemon.
However, if you do it in a tight loop alternating between
audit_set_pid(X) and audit_set_pid(0) you are basically thrashing the
audit daemon connection state in the kernel which is in turn likely
causing odd things to happen both with the kernel's audit queues and
the auditd socket buffer.

While the userspace reproducer is valid, it isn't something I would
expect in normal use as simply starting and stopping the audit daemon
adds a significant delay between disconnecting and reconnecting.

I'm also not surprised to hear that as you increase the number of CPUs
the problem goes away.  With more CPUs the system is able to schedule
more threads simultaneously to maintain the kernel's audit queues and
execute the audit daemon to drain both the netlink socket buffer and
audit queues.

> We may also want to look at this from the userspace side in the audit
> daemon itself (or more specifically, libaudit). The ACK handling is a
> little odd - check_ack() will happily return success without seeing an
> ACK if a non-ACK message is top of the socket queue, but will fail if no
> message arrives within the timeout. It also of course fails if ENOBUFS
> is set on the socket, but this failure only seems to matter when doing
> audit_set_pid() - similar failures during main-loop message processing
> are logged but otherwise ignored, as far as I can tell.

I suggest bringing this up with the audit userspace developer if you
haven't already.  While we can, and should, improve things on the
kernel side (e.g. the patch we are discussing), it also sounds like
the userspace side has room for improvement as well.

Here is the patch I mentioned above:

diff --git a/kernel/audit.c b/kernel/audit.c
index 16205dd29843..49f18003c1ac 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -487,15 +487,19 @@ static void auditd_conn_free(struct rcu_head *rcu)
 * @pid: auditd PID
 * @portid: auditd netlink portid
 * @net: auditd network namespace pointer
+ * @skb: the netlink command from the audit daemon
+ * @ack: netlink ack flag, cleared if ack'd here
 *
 * Description:
 * This function will obtain and drop network namespace references as
 * necessary.  Returns zero on success, negative values on failure.
 */
-static int auditd_set(struct pid *pid, u32 portid, struct net *net)
+static int auditd_set(struct pid *pid, u32 portid,
+                     struct net *net, struct sk_buff *skb, bool *ack)
{
       unsigned long flags;
       struct auditd_connection *ac_old, *ac_new;
+       struct nlmsghdr *nlh;

       if (!pid || !net)
               return -EINVAL;
@@ -507,6 +511,13 @@ static int auditd_set(struct pid *pid, u32 portid, struct n
et *net)
       ac_new->portid = portid;
       ac_new->net = get_net(net);

+       /* send the ack now to avoid a race with the queue backlog */
+       if (*ack) {
+               nlh = nlmsg_hdr(skb);
+               netlink_ack(skb, nlh, 0, NULL);
+               *ack = false;
+       }
+
       spin_lock_irqsave(&auditd_conn_lock, flags);
       ac_old = rcu_dereference_protected(auditd_conn,
                                          lockdep_is_held(&auditd_conn_lock));
@@ -1200,7 +1211,8 @@ static int audit_replace(struct pid *pid)
       return auditd_send_unicast_skb(skb);
}

-static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
+static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh,
+                            bool *ack)
{
       u32                     seq;
       void                    *data;
@@ -1293,7 +1305,8 @@ static int audit_receive_msg(struct sk_buff *skb, struct n
lmsghdr *nlh)
                               /* register a new auditd connection */
                               err = auditd_set(req_pid,
                                                NETLINK_CB(skb).portid,
-                                                sock_net(NETLINK_CB(skb).sk));
+                                                sock_net(NETLINK_CB(skb).sk),
+                                                skb, ack);
                               if (audit_enabled != AUDIT_OFF)
                                       audit_log_config_change("audit_pid",
                                                               new_pid,
@@ -1538,9 +1551,10 @@ static int audit_receive_msg(struct sk_buff *skb, struct
nlmsghdr *nlh)
 * Parse the provided skb and deal with any messages that may be present,
 * malformed skbs are discarded.
 */
-static void audit_receive(struct sk_buff  *skb)
+static void audit_receive(struct sk_buff *skb)
{
       struct nlmsghdr *nlh;
+       bool ack;
       /*
        * len MUST be signed for nlmsg_next to be able to dec it below 0
        * if the nlmsg_len was not aligned
@@ -1553,9 +1567,13 @@ static void audit_receive(struct sk_buff  *skb)

       audit_ctl_lock();
       while (nlmsg_ok(nlh, len)) {
-               err = audit_receive_msg(skb, nlh);
-               /* if err or if this message says it wants a response */
-               if (err || (nlh->nlmsg_flags & NLM_F_ACK))
+               ack = nlh->nlmsg_flags & NLM_F_ACK;
+               err = audit_receive_msg(skb, nlh, &ack);
+
+               /* if audit_receive_msg() clears @ack then we never want to
+                * send an ack here, otherwise send an ack on error or if
+                * requested */
+               if (ack && (err || (nlh->nlmsg_flags & NLM_F_ACK)))
                       netlink_ack(skb, nlh, err, NULL);

               nlh = nlmsg_next(nlh, &len);
Chris Riches Oct. 16, 2023, 5:11 p.m. UTC | #3
Hi Paul, thanks for the reply.


 > I think the basic approach is good, but I think we can simply things
 > slightly by using a resettable boolean as opposed to an integer flag
 > for the ACK.  I've pasted in a quick, untested patch (below) to better
 > demonstrate the idea, let me know what you think.

The simplified patch doesn't look quite right. I had some trouble
getting it to apply (seems tabs were changed into spaces, even when I
downloaded the raw email). While typing it out manually, I noticed that
the condition for sending the ACK isn't correct - if NLM_F_ACK is 0 to
begin with, then ack will be false to begin with, and so no ACK will be
sent even if there is an error. Handling this case is why I used the
three-state system to begin with. I think we could still use a boolean
with a condition of just (err || ack), with the caveat that we wouldn't
easily be able to send an early ACK on an error (not that the current
patch needs this - just thinking of reusability).


 > Regardless of any other issues, I think you brought up a good point
 > about there being socket buffer contention when the queues are
 > full(ish) and an audit daemon connects to the kernel and while I'm not
 > sure that this patch will fully resolve that issue, I do think it
 > would be good to have (I'm doubtful if it can be fully resolved
 > without some really awful hacks).

That roughly lines up with what I was seeing - could you elaborate any
further on what "awful hacks" would be needed to fully resolve this?
I'm intrigued as to where else the contention could be coming from that
wouldn't be covered by this patch.


 > The ENOBUFS errors are coming from the netlink layer and are likely a
 > sign of extreme load.  I'm not very familiar with the audit userspace
 > code, but it might be helpful to see if you can increase the socket
 > buffer size for the audit daemon.

I believe we tried increasing the buffer size in the toy repro and it
didn't make much difference - perhaps doing it in the real audit daemon
might help.


 > I'm also not surprised to hear that as you increase the number of CPUs
 > the problem goes away.  With more CPUs the system is able to schedule
 > more threads simultaneously to maintain the kernel's audit queues and
 > execute the audit daemon to drain both the netlink socket buffer and
 > audit queues.

My intuition told me the opposite - that more parallel activity would
increase the chance of the socket buffer being crammed full before the
one thread could return to the audit daemon and allow it to start
processing events. Does the size/number of audit queues perhaps scale
with the number of CPUs?


 > I suggest bringing this up with the audit userspace developer if you
 > haven't already.  While we can, and should, improve things on the
 > kernel side (e.g. the patch we are discussing), it also sounds like
 > the userspace side has room for improvement as well.

That sounds like a good idea - can you point me to who that is? I was
under the impression that this was the mailing list for both the kernel
and userspace sides.


Thanks,
Chris
Paul Moore Oct. 16, 2023, 8:16 p.m. UTC | #4
On Mon, Oct 16, 2023 at 1:12 PM Chris Riches <chris.riches@nutanix.com> wrote:
>

Thanks for trimming the email in your reply, however, it is helpful to
preserve those "On Mon, Oct ..." headers for those emails which you
include in your reply, it helps keep things straight when reading the
email.  Not a big deal, just something to keep in mind for next time.

>  > I think the basic approach is good, but I think we can simply things
>  > slightly by using a resettable boolean as opposed to an integer flag
>  > for the ACK.  I've pasted in a quick, untested patch (below) to better
>  > demonstrate the idea, let me know what you think.
>
> The simplified patch doesn't look quite right. I had some trouble
> getting it to apply (seems tabs were changed into spaces, even when I
> downloaded the raw email).

I should have been more clear, that's what just a quick hack that I
cut-n-pasted into the email body, whitespace damage was a given.
Typically if I include a patch with the qualification that it is
untested, you can expect problems :) but I'll try to make the pitfalls
more explicit in the future.

I usually include these hacky patches simply as a way to help clear up
any misunderstandings that might happen when trying to explain an
approach using English descriptions.  Much in the same way we say that
a picture is worth a thousand words, I believe a patch, even a
relatively crude one, is worth at least as many words as a picture :)

> While typing it out manually, I noticed that
> the condition for sending the ACK isn't correct - if NLM_F_ACK is 0 to
> begin with, then ack will be false to begin with, and so no ACK will be
> sent even if there is an error.

Good point.  I'll just casually remind you that I did say "untested" ;)

I believe the following should work as intended (untested, cut-n-paste, etc.):

@@ -1538,9 +1551,10 @@ static int audit_receive_msg(struct sk_buff *skb, struct
nlmsghdr *nlh)
 * Parse the provided skb and deal with any messages that may be present,
 * malformed skbs are discarded.
 */
-static void audit_receive(struct sk_buff  *skb)
+static void audit_receive(struct sk_buff *skb)
{
       struct nlmsghdr *nlh;
+       bool ack;
       /*
        * len MUST be signed for nlmsg_next to be able to dec it below 0
        * if the nlmsg_len was not aligned
@@ -1553,9 +1567,13 @@ static void audit_receive(struct sk_buff  *skb)

       audit_ctl_lock();
       while (nlmsg_ok(nlh, len)) {
-               err = audit_receive_msg(skb, nlh);
-               /* if err or if this message says it wants a response */
-               if (err || (nlh->nlmsg_flags & NLM_F_ACK))
+               ack = true;
+               err = audit_receive_msg(skb, nlh, &ack);
+
+               /* if audit_receive_msg() clears @ack then we never want to
+                * send an ack here, otherwise send an ack on error or if
+                * requested */
+               if (ack && (err || (nlh->nlmsg_flags & NLM_F_ACK)))
                       netlink_ack(skb, nlh, err, NULL);

> Handling this case is why I used the
> three-state system to begin with. I think we could still use a boolean
> with a condition of just (err || ack), with the caveat that we wouldn't
> easily be able to send an early ACK on an error (not that the current
> patch needs this - just thinking of reusability).

While there may be three states (never ACK, ACK on error, always ACK),
the audit_receive_msg() function only needs to report back one of two
states: never ACK or ACK if desired.

>  > Regardless of any other issues, I think you brought up a good point
>  > about there being socket buffer contention when the queues are
>  > full(ish) and an audit daemon connects to the kernel and while I'm not
>  > sure that this patch will fully resolve that issue, I do think it
>  > would be good to have (I'm doubtful if it can be fully resolved
>  > without some really awful hacks).
>
> That roughly lines up with what I was seeing - could you elaborate any
> further on what "awful hacks" would be needed to fully resolve this?

I'm not sure I can recall everything from when I was thinking about
this previously (that was about a week ago), but my quick thoughts
right now are that you would need a lot more information and/or
handshakes between the kernel and the daemon.

Unfortunately, both the current audit design and implementation is
seriously flawed in a number of areas.  One of these areas is the fact
that data and control messages are sent using the same data flow.

>  > The ENOBUFS errors are coming from the netlink layer and are likely a
>  > sign of extreme load.  I'm not very familiar with the audit userspace
>  > code, but it might be helpful to see if you can increase the socket
>  > buffer size for the audit daemon.
>
> I believe we tried increasing the buffer size in the toy repro and it
> didn't make much difference - perhaps doing it in the real audit daemon
> might help.

The reproducer is an artificial worst case since you are disconnecting
and reconnecting without a process shutdown and startup in between.
The reproducer just hammers the connection status as fast as the CPU
can generate netlink messages; I'm not certain there is a
netlink/socket buffer size that would fully resolve this for the
reproducer.

>  > I'm also not surprised to hear that as you increase the number of CPUs
>  > the problem goes away.  With more CPUs the system is able to schedule
>  > more threads simultaneously to maintain the kernel's audit queues and
>  > execute the audit daemon to drain both the netlink socket buffer and
>  > audit queues.
>
> My intuition told me the opposite - that more parallel activity would
> increase the chance of the socket buffer being crammed full before the
> one thread could return to the audit daemon and allow it to start
> processing events.

The audit subsystem can handle a full buffer, that has been stressed a
lot over the years, and while it may slow the system down it will
continue to operate.  As a fun exercise, configure the audit subsystem
to audit *every* syscall and then try to shut the system down.  It
will complete, but it's ugly and you'll see the audit subsystem
complain a *lot* about dropped records due to queue/backlog overflows.

The issue isn't so much about the queues overflowing inside the
kernel, it's about being able to schedule the audit daemon and/or
kernel thread to service the flood of connection
disconnects/reconnects coming from the reproducer.

> Does the size/number of audit queues perhaps scale
> with the number of CPUs?

The number of queues is static, and their size is determined at
runtime by the configuration.  Making per-CPU queues would be a bit of
a mess due to our desire to ensure an accurate ordering of events
relative to a wall clock.

>  > I suggest bringing this up with the audit userspace developer if you
>  > haven't already.  While we can, and should, improve things on the
>  > kernel side (e.g. the patch we are discussing), it also sounds like
>  > the userspace side has room for improvement as well.
>
> That sounds like a good idea - can you point me to who that is? I was
> under the impression that this was the mailing list for both the kernel
> and userspace sides.

That used to be the case, but unfortunately the audit userspace
developer who manages the original audit mailing list was unwilling to
make the list configuration changes we needed to play nice with
existing Linux kernel mailing list conventions so we needed to move
the kernel development discussions to a separate list.

The old audit mailing list, where the userspace development is still
discussed, can be found here:

* http://www.redhat.com/mailman/listinfo/linux-audit
Chris Riches Oct. 17, 2023, 1:49 p.m. UTC | #5
On 16/10/2023 21:16, Paul Moore wrote:

 > Thanks for trimming the email in your reply, however, it is helpful to
 > preserve those "On Mon, Oct ..." headers for those emails which you
 > include in your reply, it helps keep things straight when reading the
 > email.  Not a big deal, just something to keep in mind for next time.

Thanks for the pointer - I'm new to these mailing lists so appreciate
the advice.


 > I should have been more clear, that's what just a quick hack that I
 > cut-n-pasted into the email body, whitespace damage was a given.
 > Typically if I include a patch with the qualification that it is
 > untested, you can expect problems :) but I'll try to make the pitfalls
 > more explicit in the future.

Gotcha.


 >> While typing it out manually, I noticed that
 >> the condition for sending the ACK isn't correct - if NLM_F_ACK is 0 to
 >> begin with, then ack will be false to begin with, and so no ACK will be
 >> sent even if there is an error.
 >
 > Good point.  I'll just casually remind you that I did say "untested" ;)
 >
 > I believe the following should work as intended (untested, 
cut-n-paste, etc.):
 > .....

I think ack must be set to NLM_F_ACK initially - otherwise auditd_set
will always send the fast-tracked ACK even if the caller did not
request one. The following is a concrete version of what I roughly
suggested in the last email - is there a specific problem you see with
the (ack || err) condition?

@@ -1538,9 +1551,10 @@ static int audit_receive_msg(struct sk_buff *skb, 
struct
nlmsghdr *nlh)
  * Parse the provided skb and deal with any messages that may be present,
  * malformed skbs are discarded.
  */
-static void audit_receive(struct sk_buff  *skb)
+static void audit_receive(struct sk_buff *skb)
  {
      struct nlmsghdr *nlh;
+    bool ack;
      /*
       * len MUST be signed for nlmsg_next to be able to dec it below 0
       * if the nlmsg_len was not aligned
@@ -1553,9 +1567,13 @@ static void audit_receive(struct sk_buff *skb)

      audit_ctl_lock();
      while (nlmsg_ok(nlh, len)) {
-        err = audit_receive_msg(skb, nlh);
-        /* if err or if this message says it wants a response */
-        if (err || (nlh->nlmsg_flags & NLM_F_ACK))
+        ack = nlh->nlmsg_flags & NLM_F_ACK;
+        err = audit_receive_msg(skb, nlh, &ack);
+
+        /* Send an ack if @ack is still true after audit_receive_msg
+         * potentially cleared it, or if there was an error. */
+        if (ack || err)
              netlink_ack(skb, nlh, err, NULL);


 > I'm not sure I can recall everything from when I was thinking about
 > this previously (that was about a week ago), but my quick thoughts
 > right now are that you would need a lot more information and/or
 > handshakes between the kernel and the daemon.
 >
 > Unfortunately, both the current audit design and implementation is
 > seriously flawed in a number of areas.  One of these areas is the fact
 > that data and control messages are sent using the same data flow.

Makes sense. The question of why there isn't a separate control socket
was one of the first we asked while looking into this.


 > The issue isn't so much about the queues overflowing inside the
 > kernel, it's about being able to schedule the audit daemon and/or
 > kernel thread to service the flood of connection
 > disconnects/reconnects coming from the reproducer.

Right, makes sense.


 > The old audit mailing list, where the userspace development is still
 > discussed, can be found here:
 > ...

Thanks. I'll post there too.


- Chris
Paul Moore Oct. 17, 2023, 9:20 p.m. UTC | #6
On Tue, Oct 17, 2023 at 9:49 AM Chris Riches <chris.riches@nutanix.com> wrote:
> On 16/10/2023 21:16, Paul Moore wrote:
>  >> While typing it out manually, I noticed that
>  >> the condition for sending the ACK isn't correct - if NLM_F_ACK is 0 to
>  >> begin with, then ack will be false to begin with, and so no ACK will be
>  >> sent even if there is an error.
>  >
>  > Good point.  I'll just casually remind you that I did say "untested" ;)
>  >
>  > I believe the following should work as intended (untested,
> cut-n-paste, etc.):
>  > .....
>
> I think ack must be set to NLM_F_ACK initially - otherwise auditd_set
> will always send the fast-tracked ACK even if the caller did not
> request one.

Another good point.  I think I'm working on too many different things
and forgetting the details between our emails :)  Sorry about that.

> The following is a concrete version of what I roughly
> suggested in the last email - is there a specific problem you see with
> the (ack || err) condition?

I'm happier with the bool over the integer type for the @ack variable.
I'd suggest updating the patch and posting it again so we can review
it in full, but we weren't very far off last time so I think it should
be close, if not acceptable on the next revision.

>  > I'm not sure I can recall everything from when I was thinking about
>  > this previously (that was about a week ago), but my quick thoughts
>  > right now are that you would need a lot more information and/or
>  > handshakes between the kernel and the daemon.
>  >
>  > Unfortunately, both the current audit design and implementation is
>  > seriously flawed in a number of areas.  One of these areas is the fact
>  > that data and control messages are sent using the same data flow.
>
> Makes sense. The question of why there isn't a separate control socket
> was one of the first we asked while looking into this.

Yeah, there is a lot I would have done differently, but that's the way
it goes sometimes.  I have plans to find some of the larger problems,
I just need to find some time to work on it.
diff mbox series

Patch

diff --git a/kernel/audit.c b/kernel/audit.c
index 9bc0b0301198..c48c778b6f89 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -488,12 +488,16 @@  static void auditd_conn_free(struct rcu_head *rcu)
  * @pid: auditd PID
  * @portid: auditd netlink portid
  * @net: auditd network namespace pointer
+ * @ack_status: if AUDIT_ACK_ALWAYS, send ACK before setting connection and
+ *              set to AUDIT_ACK_DONE
+ * @skb: socket buffer for sending ACK
  *
  * Description:
  * This function will obtain and drop network namespace references as
  * necessary.  Returns zero on success, negative values on failure.
  */
-static int auditd_set(struct pid *pid, u32 portid, struct net *net)
+static int auditd_set(struct pid *pid, u32 portid, struct net *net,
+                      int *ack_status, struct sk_buff *skb)
 {
 	unsigned long flags;
 	struct auditd_connection *ac_old, *ac_new;
@@ -508,6 +512,13 @@  static int auditd_set(struct pid *pid, u32 portid, struct net *net)
 	ac_new->portid = portid;
 	ac_new->net = get_net(net);
 
+	if (*ack_status == AUDIT_ACK_ALWAYS) {
+		// Send ACK before we set auditd_conn. Otherwise, the socket buffer may
+		// get filled with backlogged audit messages causing the ACK to be dropped.
+		netlink_ack(skb, nlmsg_hdr(skb), 0, NULL);
+		*ack_status = AUDIT_ACK_DONE;
+	}
+
 	spin_lock_irqsave(&auditd_conn_lock, flags);
 	ac_old = rcu_dereference_protected(auditd_conn,
 					   lockdep_is_held(&auditd_conn_lock));
@@ -1201,7 +1212,7 @@  static int audit_replace(struct pid *pid)
 	return auditd_send_unicast_skb(skb);
 }
 
-static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
+static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh, int *ack_status)
 {
 	u32			seq;
 	void			*data;
@@ -1294,7 +1305,9 @@  static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
 				/* register a new auditd connection */
 				err = auditd_set(req_pid,
 						 NETLINK_CB(skb).portid,
-						 sock_net(NETLINK_CB(skb).sk));
+						 sock_net(NETLINK_CB(skb).sk),
+						 ack_status,
+						 skb);
 				if (audit_enabled != AUDIT_OFF)
 					audit_log_config_change("audit_pid",
 								new_pid,
@@ -1548,15 +1561,20 @@  static void audit_receive(struct sk_buff  *skb)
 	 */
 	int len;
 	int err;
+	int ack_status = AUDIT_ACK_ON_ERR;
 
 	nlh = nlmsg_hdr(skb);
 	len = skb->len;
 
 	audit_ctl_lock();
 	while (nlmsg_ok(nlh, len)) {
-		err = audit_receive_msg(skb, nlh);
-		/* if err or if this message says it wants a response */
-		if (err || (nlh->nlmsg_flags & NLM_F_ACK))
+		if (nlh->nlmsg_flags & NLM_F_ACK)
+			ack_status = AUDIT_ACK_ALWAYS;
+
+		err = audit_receive_msg(skb, nlh, &ack_status);
+
+		/* send ACK if err or the message asked for one (and not already sent) */
+		if (ack_status == AUDIT_ACK_ALWAYS || (ack_status == AUDIT_ACK_ON_ERR && err))
 			netlink_ack(skb, nlh, err, NULL);
 
 		nlh = nlmsg_next(nlh, &len);
diff --git a/kernel/audit.h b/kernel/audit.h
index 94738bce40b2..bc692f60567a 100644
--- a/kernel/audit.h
+++ b/kernel/audit.h
@@ -344,4 +344,9 @@  extern int audit_filter(int msgtype, unsigned int listtype);
 extern void audit_ctl_lock(void);
 extern void audit_ctl_unlock(void);
 
+/* Control netlink ACK behaviour in audit_receive. */
+#define AUDIT_ACK_ON_ERR 1
+#define AUDIT_ACK_ALWAYS 2
+#define AUDIT_ACK_DONE 3
+
 #endif