Message ID | 20230922152749.244197-1-chris.riches@nutanix.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Paul Moore |
Headers | show |
Series | [RFC] audit: Send netlink ACK before setting connection in auditd_set | expand |
On Fri, Sep 22, 2023 at 11:28 AM Chris Riches <chris.riches@nutanix.com> wrote: > > When auditd_set sets the auditd_conn pointer, audit messages can > immediately be put on the socket by other kernel threads. If the backlog > is large or the rate is high, this can immediately fill the socket > buffer. If the audit daemon requested an ACK for this operation, a full > socket buffer causes the ACK to get dropped, also setting ENOBUFS on the > socket. > > To avoid this race and ensure ACKs get through, fast-track the ACK in > this specific case to ensure it is sent before auditd_conn is set. > > Signed-off-by: Chris Riches <chris.riches@nutanix.com> > > --- > > This mail is more of an RFC than a patch, though the included patch is a > useful illustation and might even be suitable for inclusion ... Hi Chris, Thanks for the patch and the background information, that's always helpful. Unfortunately I have limited network access at the moment, but I'll put this on my list to look at next week.
On Fri, Sep 22, 2023 at 11:28 AM Chris Riches <chris.riches@nutanix.com> wrote: > > When auditd_set sets the auditd_conn pointer, audit messages can > immediately be put on the socket by other kernel threads. If the backlog > is large or the rate is high, this can immediately fill the socket > buffer. If the audit daemon requested an ACK for this operation, a full > socket buffer causes the ACK to get dropped, also setting ENOBUFS on the > socket. > > To avoid this race and ensure ACKs get through, fast-track the ACK in > this specific case to ensure it is sent before auditd_conn is set. > > Signed-off-by: Chris Riches <chris.riches@nutanix.com> Thanks for looking into this and submitting a patch Chris! I think the basic approach is good, but I think we can simply things slightly by using a resettable boolean as opposed to an integer flag for the ACK. I've pasted in a quick, untested patch (below) to better demonstrate the idea, let me know what you think. Further thoughts below ... > This mail is more of an RFC than a patch, though the included patch is a > useful illustation and might even be suitable for inclusion. Regardless of any other issues, I think you brought up a good point about there being socket buffer contention when the queues are full(ish) and an audit daemon connects to the kernel and while I'm not sure that this patch will fully resolve that issue, I do think it would be good to have (I'm doubtful if it can be fully resolved without some really awful hacks). > We are experiencing strange failures where the audit daemon fails to > start, hitting an ENOBUFS error on its audit_set_pid() call. This can be > reproduced by repeatedly restarting the audit daemon (or just doing a > tight loop of audit_open(), audit_set_pid(pid), audit_set_pid(0), > audit_close()) while the system is under heavy audit load. This also > seems to be dependent on the number of CPUs - we can reproduce this with > 2 CPUs but not with 48. The ENOBUFS errors are coming from the netlink layer and are likely a sign of extreme load. I'm not very familiar with the audit userspace code, but it might be helpful to see if you can increase the socket buffer size for the audit daemon. > Tracing showed a race between the kernel setting auditd_conn and sending > the ACK, as described in the commit message. The included patch attempts > to fix this by ensuring the ACK is sent before any audit messages can be > put on the socket. This seems to fix the problem for auditd starting, > but strangely I still hit it when running my minimal repro (the tight > loop mentioned above), albeit less frequently. I'm not sure if the patch > or the repro is at fault. Sending a zero PID value, e.g. audit_set_pid(0), is going to cause the kernel to reset/drop the connection with the audit daemon; it's essentially the same as an orderly shutdown of the audit daemon. However, if you do it in a tight loop alternating between audit_set_pid(X) and audit_set_pid(0) you are basically thrashing the audit daemon connection state in the kernel which is in turn likely causing odd things to happen both with the kernel's audit queues and the auditd socket buffer. While the userspace reproducer is valid, it isn't something I would expect in normal use as simply starting and stopping the audit daemon adds a significant delay between disconnecting and reconnecting. I'm also not surprised to hear that as you increase the number of CPUs the problem goes away. With more CPUs the system is able to schedule more threads simultaneously to maintain the kernel's audit queues and execute the audit daemon to drain both the netlink socket buffer and audit queues. > We may also want to look at this from the userspace side in the audit > daemon itself (or more specifically, libaudit). The ACK handling is a > little odd - check_ack() will happily return success without seeing an > ACK if a non-ACK message is top of the socket queue, but will fail if no > message arrives within the timeout. It also of course fails if ENOBUFS > is set on the socket, but this failure only seems to matter when doing > audit_set_pid() - similar failures during main-loop message processing > are logged but otherwise ignored, as far as I can tell. I suggest bringing this up with the audit userspace developer if you haven't already. While we can, and should, improve things on the kernel side (e.g. the patch we are discussing), it also sounds like the userspace side has room for improvement as well. Here is the patch I mentioned above: diff --git a/kernel/audit.c b/kernel/audit.c index 16205dd29843..49f18003c1ac 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -487,15 +487,19 @@ static void auditd_conn_free(struct rcu_head *rcu) * @pid: auditd PID * @portid: auditd netlink portid * @net: auditd network namespace pointer + * @skb: the netlink command from the audit daemon + * @ack: netlink ack flag, cleared if ack'd here * * Description: * This function will obtain and drop network namespace references as * necessary. Returns zero on success, negative values on failure. */ -static int auditd_set(struct pid *pid, u32 portid, struct net *net) +static int auditd_set(struct pid *pid, u32 portid, + struct net *net, struct sk_buff *skb, bool *ack) { unsigned long flags; struct auditd_connection *ac_old, *ac_new; + struct nlmsghdr *nlh; if (!pid || !net) return -EINVAL; @@ -507,6 +511,13 @@ static int auditd_set(struct pid *pid, u32 portid, struct n et *net) ac_new->portid = portid; ac_new->net = get_net(net); + /* send the ack now to avoid a race with the queue backlog */ + if (*ack) { + nlh = nlmsg_hdr(skb); + netlink_ack(skb, nlh, 0, NULL); + *ack = false; + } + spin_lock_irqsave(&auditd_conn_lock, flags); ac_old = rcu_dereference_protected(auditd_conn, lockdep_is_held(&auditd_conn_lock)); @@ -1200,7 +1211,8 @@ static int audit_replace(struct pid *pid) return auditd_send_unicast_skb(skb); } -static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh) +static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh, + bool *ack) { u32 seq; void *data; @@ -1293,7 +1305,8 @@ static int audit_receive_msg(struct sk_buff *skb, struct n lmsghdr *nlh) /* register a new auditd connection */ err = auditd_set(req_pid, NETLINK_CB(skb).portid, - sock_net(NETLINK_CB(skb).sk)); + sock_net(NETLINK_CB(skb).sk), + skb, ack); if (audit_enabled != AUDIT_OFF) audit_log_config_change("audit_pid", new_pid, @@ -1538,9 +1551,10 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh) * Parse the provided skb and deal with any messages that may be present, * malformed skbs are discarded. */ -static void audit_receive(struct sk_buff *skb) +static void audit_receive(struct sk_buff *skb) { struct nlmsghdr *nlh; + bool ack; /* * len MUST be signed for nlmsg_next to be able to dec it below 0 * if the nlmsg_len was not aligned @@ -1553,9 +1567,13 @@ static void audit_receive(struct sk_buff *skb) audit_ctl_lock(); while (nlmsg_ok(nlh, len)) { - err = audit_receive_msg(skb, nlh); - /* if err or if this message says it wants a response */ - if (err || (nlh->nlmsg_flags & NLM_F_ACK)) + ack = nlh->nlmsg_flags & NLM_F_ACK; + err = audit_receive_msg(skb, nlh, &ack); + + /* if audit_receive_msg() clears @ack then we never want to + * send an ack here, otherwise send an ack on error or if + * requested */ + if (ack && (err || (nlh->nlmsg_flags & NLM_F_ACK))) netlink_ack(skb, nlh, err, NULL); nlh = nlmsg_next(nlh, &len);
Hi Paul, thanks for the reply. > I think the basic approach is good, but I think we can simply things > slightly by using a resettable boolean as opposed to an integer flag > for the ACK. I've pasted in a quick, untested patch (below) to better > demonstrate the idea, let me know what you think. The simplified patch doesn't look quite right. I had some trouble getting it to apply (seems tabs were changed into spaces, even when I downloaded the raw email). While typing it out manually, I noticed that the condition for sending the ACK isn't correct - if NLM_F_ACK is 0 to begin with, then ack will be false to begin with, and so no ACK will be sent even if there is an error. Handling this case is why I used the three-state system to begin with. I think we could still use a boolean with a condition of just (err || ack), with the caveat that we wouldn't easily be able to send an early ACK on an error (not that the current patch needs this - just thinking of reusability). > Regardless of any other issues, I think you brought up a good point > about there being socket buffer contention when the queues are > full(ish) and an audit daemon connects to the kernel and while I'm not > sure that this patch will fully resolve that issue, I do think it > would be good to have (I'm doubtful if it can be fully resolved > without some really awful hacks). That roughly lines up with what I was seeing - could you elaborate any further on what "awful hacks" would be needed to fully resolve this? I'm intrigued as to where else the contention could be coming from that wouldn't be covered by this patch. > The ENOBUFS errors are coming from the netlink layer and are likely a > sign of extreme load. I'm not very familiar with the audit userspace > code, but it might be helpful to see if you can increase the socket > buffer size for the audit daemon. I believe we tried increasing the buffer size in the toy repro and it didn't make much difference - perhaps doing it in the real audit daemon might help. > I'm also not surprised to hear that as you increase the number of CPUs > the problem goes away. With more CPUs the system is able to schedule > more threads simultaneously to maintain the kernel's audit queues and > execute the audit daemon to drain both the netlink socket buffer and > audit queues. My intuition told me the opposite - that more parallel activity would increase the chance of the socket buffer being crammed full before the one thread could return to the audit daemon and allow it to start processing events. Does the size/number of audit queues perhaps scale with the number of CPUs? > I suggest bringing this up with the audit userspace developer if you > haven't already. While we can, and should, improve things on the > kernel side (e.g. the patch we are discussing), it also sounds like > the userspace side has room for improvement as well. That sounds like a good idea - can you point me to who that is? I was under the impression that this was the mailing list for both the kernel and userspace sides. Thanks, Chris
On Mon, Oct 16, 2023 at 1:12 PM Chris Riches <chris.riches@nutanix.com> wrote: > Thanks for trimming the email in your reply, however, it is helpful to preserve those "On Mon, Oct ..." headers for those emails which you include in your reply, it helps keep things straight when reading the email. Not a big deal, just something to keep in mind for next time. > > I think the basic approach is good, but I think we can simply things > > slightly by using a resettable boolean as opposed to an integer flag > > for the ACK. I've pasted in a quick, untested patch (below) to better > > demonstrate the idea, let me know what you think. > > The simplified patch doesn't look quite right. I had some trouble > getting it to apply (seems tabs were changed into spaces, even when I > downloaded the raw email). I should have been more clear, that's what just a quick hack that I cut-n-pasted into the email body, whitespace damage was a given. Typically if I include a patch with the qualification that it is untested, you can expect problems :) but I'll try to make the pitfalls more explicit in the future. I usually include these hacky patches simply as a way to help clear up any misunderstandings that might happen when trying to explain an approach using English descriptions. Much in the same way we say that a picture is worth a thousand words, I believe a patch, even a relatively crude one, is worth at least as many words as a picture :) > While typing it out manually, I noticed that > the condition for sending the ACK isn't correct - if NLM_F_ACK is 0 to > begin with, then ack will be false to begin with, and so no ACK will be > sent even if there is an error. Good point. I'll just casually remind you that I did say "untested" ;) I believe the following should work as intended (untested, cut-n-paste, etc.): @@ -1538,9 +1551,10 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh) * Parse the provided skb and deal with any messages that may be present, * malformed skbs are discarded. */ -static void audit_receive(struct sk_buff *skb) +static void audit_receive(struct sk_buff *skb) { struct nlmsghdr *nlh; + bool ack; /* * len MUST be signed for nlmsg_next to be able to dec it below 0 * if the nlmsg_len was not aligned @@ -1553,9 +1567,13 @@ static void audit_receive(struct sk_buff *skb) audit_ctl_lock(); while (nlmsg_ok(nlh, len)) { - err = audit_receive_msg(skb, nlh); - /* if err or if this message says it wants a response */ - if (err || (nlh->nlmsg_flags & NLM_F_ACK)) + ack = true; + err = audit_receive_msg(skb, nlh, &ack); + + /* if audit_receive_msg() clears @ack then we never want to + * send an ack here, otherwise send an ack on error or if + * requested */ + if (ack && (err || (nlh->nlmsg_flags & NLM_F_ACK))) netlink_ack(skb, nlh, err, NULL); > Handling this case is why I used the > three-state system to begin with. I think we could still use a boolean > with a condition of just (err || ack), with the caveat that we wouldn't > easily be able to send an early ACK on an error (not that the current > patch needs this - just thinking of reusability). While there may be three states (never ACK, ACK on error, always ACK), the audit_receive_msg() function only needs to report back one of two states: never ACK or ACK if desired. > > Regardless of any other issues, I think you brought up a good point > > about there being socket buffer contention when the queues are > > full(ish) and an audit daemon connects to the kernel and while I'm not > > sure that this patch will fully resolve that issue, I do think it > > would be good to have (I'm doubtful if it can be fully resolved > > without some really awful hacks). > > That roughly lines up with what I was seeing - could you elaborate any > further on what "awful hacks" would be needed to fully resolve this? I'm not sure I can recall everything from when I was thinking about this previously (that was about a week ago), but my quick thoughts right now are that you would need a lot more information and/or handshakes between the kernel and the daemon. Unfortunately, both the current audit design and implementation is seriously flawed in a number of areas. One of these areas is the fact that data and control messages are sent using the same data flow. > > The ENOBUFS errors are coming from the netlink layer and are likely a > > sign of extreme load. I'm not very familiar with the audit userspace > > code, but it might be helpful to see if you can increase the socket > > buffer size for the audit daemon. > > I believe we tried increasing the buffer size in the toy repro and it > didn't make much difference - perhaps doing it in the real audit daemon > might help. The reproducer is an artificial worst case since you are disconnecting and reconnecting without a process shutdown and startup in between. The reproducer just hammers the connection status as fast as the CPU can generate netlink messages; I'm not certain there is a netlink/socket buffer size that would fully resolve this for the reproducer. > > I'm also not surprised to hear that as you increase the number of CPUs > > the problem goes away. With more CPUs the system is able to schedule > > more threads simultaneously to maintain the kernel's audit queues and > > execute the audit daemon to drain both the netlink socket buffer and > > audit queues. > > My intuition told me the opposite - that more parallel activity would > increase the chance of the socket buffer being crammed full before the > one thread could return to the audit daemon and allow it to start > processing events. The audit subsystem can handle a full buffer, that has been stressed a lot over the years, and while it may slow the system down it will continue to operate. As a fun exercise, configure the audit subsystem to audit *every* syscall and then try to shut the system down. It will complete, but it's ugly and you'll see the audit subsystem complain a *lot* about dropped records due to queue/backlog overflows. The issue isn't so much about the queues overflowing inside the kernel, it's about being able to schedule the audit daemon and/or kernel thread to service the flood of connection disconnects/reconnects coming from the reproducer. > Does the size/number of audit queues perhaps scale > with the number of CPUs? The number of queues is static, and their size is determined at runtime by the configuration. Making per-CPU queues would be a bit of a mess due to our desire to ensure an accurate ordering of events relative to a wall clock. > > I suggest bringing this up with the audit userspace developer if you > > haven't already. While we can, and should, improve things on the > > kernel side (e.g. the patch we are discussing), it also sounds like > > the userspace side has room for improvement as well. > > That sounds like a good idea - can you point me to who that is? I was > under the impression that this was the mailing list for both the kernel > and userspace sides. That used to be the case, but unfortunately the audit userspace developer who manages the original audit mailing list was unwilling to make the list configuration changes we needed to play nice with existing Linux kernel mailing list conventions so we needed to move the kernel development discussions to a separate list. The old audit mailing list, where the userspace development is still discussed, can be found here: * http://www.redhat.com/mailman/listinfo/linux-audit
On 16/10/2023 21:16, Paul Moore wrote: > Thanks for trimming the email in your reply, however, it is helpful to > preserve those "On Mon, Oct ..." headers for those emails which you > include in your reply, it helps keep things straight when reading the > email. Not a big deal, just something to keep in mind for next time. Thanks for the pointer - I'm new to these mailing lists so appreciate the advice. > I should have been more clear, that's what just a quick hack that I > cut-n-pasted into the email body, whitespace damage was a given. > Typically if I include a patch with the qualification that it is > untested, you can expect problems :) but I'll try to make the pitfalls > more explicit in the future. Gotcha. >> While typing it out manually, I noticed that >> the condition for sending the ACK isn't correct - if NLM_F_ACK is 0 to >> begin with, then ack will be false to begin with, and so no ACK will be >> sent even if there is an error. > > Good point. I'll just casually remind you that I did say "untested" ;) > > I believe the following should work as intended (untested, cut-n-paste, etc.): > ..... I think ack must be set to NLM_F_ACK initially - otherwise auditd_set will always send the fast-tracked ACK even if the caller did not request one. The following is a concrete version of what I roughly suggested in the last email - is there a specific problem you see with the (ack || err) condition? @@ -1538,9 +1551,10 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh) * Parse the provided skb and deal with any messages that may be present, * malformed skbs are discarded. */ -static void audit_receive(struct sk_buff *skb) +static void audit_receive(struct sk_buff *skb) { struct nlmsghdr *nlh; + bool ack; /* * len MUST be signed for nlmsg_next to be able to dec it below 0 * if the nlmsg_len was not aligned @@ -1553,9 +1567,13 @@ static void audit_receive(struct sk_buff *skb) audit_ctl_lock(); while (nlmsg_ok(nlh, len)) { - err = audit_receive_msg(skb, nlh); - /* if err or if this message says it wants a response */ - if (err || (nlh->nlmsg_flags & NLM_F_ACK)) + ack = nlh->nlmsg_flags & NLM_F_ACK; + err = audit_receive_msg(skb, nlh, &ack); + + /* Send an ack if @ack is still true after audit_receive_msg + * potentially cleared it, or if there was an error. */ + if (ack || err) netlink_ack(skb, nlh, err, NULL); > I'm not sure I can recall everything from when I was thinking about > this previously (that was about a week ago), but my quick thoughts > right now are that you would need a lot more information and/or > handshakes between the kernel and the daemon. > > Unfortunately, both the current audit design and implementation is > seriously flawed in a number of areas. One of these areas is the fact > that data and control messages are sent using the same data flow. Makes sense. The question of why there isn't a separate control socket was one of the first we asked while looking into this. > The issue isn't so much about the queues overflowing inside the > kernel, it's about being able to schedule the audit daemon and/or > kernel thread to service the flood of connection > disconnects/reconnects coming from the reproducer. Right, makes sense. > The old audit mailing list, where the userspace development is still > discussed, can be found here: > ... Thanks. I'll post there too. - Chris
On Tue, Oct 17, 2023 at 9:49 AM Chris Riches <chris.riches@nutanix.com> wrote: > On 16/10/2023 21:16, Paul Moore wrote: > >> While typing it out manually, I noticed that > >> the condition for sending the ACK isn't correct - if NLM_F_ACK is 0 to > >> begin with, then ack will be false to begin with, and so no ACK will be > >> sent even if there is an error. > > > > Good point. I'll just casually remind you that I did say "untested" ;) > > > > I believe the following should work as intended (untested, > cut-n-paste, etc.): > > ..... > > I think ack must be set to NLM_F_ACK initially - otherwise auditd_set > will always send the fast-tracked ACK even if the caller did not > request one. Another good point. I think I'm working on too many different things and forgetting the details between our emails :) Sorry about that. > The following is a concrete version of what I roughly > suggested in the last email - is there a specific problem you see with > the (ack || err) condition? I'm happier with the bool over the integer type for the @ack variable. I'd suggest updating the patch and posting it again so we can review it in full, but we weren't very far off last time so I think it should be close, if not acceptable on the next revision. > > I'm not sure I can recall everything from when I was thinking about > > this previously (that was about a week ago), but my quick thoughts > > right now are that you would need a lot more information and/or > > handshakes between the kernel and the daemon. > > > > Unfortunately, both the current audit design and implementation is > > seriously flawed in a number of areas. One of these areas is the fact > > that data and control messages are sent using the same data flow. > > Makes sense. The question of why there isn't a separate control socket > was one of the first we asked while looking into this. Yeah, there is a lot I would have done differently, but that's the way it goes sometimes. I have plans to find some of the larger problems, I just need to find some time to work on it.
diff --git a/kernel/audit.c b/kernel/audit.c index 9bc0b0301198..c48c778b6f89 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -488,12 +488,16 @@ static void auditd_conn_free(struct rcu_head *rcu) * @pid: auditd PID * @portid: auditd netlink portid * @net: auditd network namespace pointer + * @ack_status: if AUDIT_ACK_ALWAYS, send ACK before setting connection and + * set to AUDIT_ACK_DONE + * @skb: socket buffer for sending ACK * * Description: * This function will obtain and drop network namespace references as * necessary. Returns zero on success, negative values on failure. */ -static int auditd_set(struct pid *pid, u32 portid, struct net *net) +static int auditd_set(struct pid *pid, u32 portid, struct net *net, + int *ack_status, struct sk_buff *skb) { unsigned long flags; struct auditd_connection *ac_old, *ac_new; @@ -508,6 +512,13 @@ static int auditd_set(struct pid *pid, u32 portid, struct net *net) ac_new->portid = portid; ac_new->net = get_net(net); + if (*ack_status == AUDIT_ACK_ALWAYS) { + // Send ACK before we set auditd_conn. Otherwise, the socket buffer may + // get filled with backlogged audit messages causing the ACK to be dropped. + netlink_ack(skb, nlmsg_hdr(skb), 0, NULL); + *ack_status = AUDIT_ACK_DONE; + } + spin_lock_irqsave(&auditd_conn_lock, flags); ac_old = rcu_dereference_protected(auditd_conn, lockdep_is_held(&auditd_conn_lock)); @@ -1201,7 +1212,7 @@ static int audit_replace(struct pid *pid) return auditd_send_unicast_skb(skb); } -static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh) +static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh, int *ack_status) { u32 seq; void *data; @@ -1294,7 +1305,9 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh) /* register a new auditd connection */ err = auditd_set(req_pid, NETLINK_CB(skb).portid, - sock_net(NETLINK_CB(skb).sk)); + sock_net(NETLINK_CB(skb).sk), + ack_status, + skb); if (audit_enabled != AUDIT_OFF) audit_log_config_change("audit_pid", new_pid, @@ -1548,15 +1561,20 @@ static void audit_receive(struct sk_buff *skb) */ int len; int err; + int ack_status = AUDIT_ACK_ON_ERR; nlh = nlmsg_hdr(skb); len = skb->len; audit_ctl_lock(); while (nlmsg_ok(nlh, len)) { - err = audit_receive_msg(skb, nlh); - /* if err or if this message says it wants a response */ - if (err || (nlh->nlmsg_flags & NLM_F_ACK)) + if (nlh->nlmsg_flags & NLM_F_ACK) + ack_status = AUDIT_ACK_ALWAYS; + + err = audit_receive_msg(skb, nlh, &ack_status); + + /* send ACK if err or the message asked for one (and not already sent) */ + if (ack_status == AUDIT_ACK_ALWAYS || (ack_status == AUDIT_ACK_ON_ERR && err)) netlink_ack(skb, nlh, err, NULL); nlh = nlmsg_next(nlh, &len); diff --git a/kernel/audit.h b/kernel/audit.h index 94738bce40b2..bc692f60567a 100644 --- a/kernel/audit.h +++ b/kernel/audit.h @@ -344,4 +344,9 @@ extern int audit_filter(int msgtype, unsigned int listtype); extern void audit_ctl_lock(void); extern void audit_ctl_unlock(void); +/* Control netlink ACK behaviour in audit_receive. */ +#define AUDIT_ACK_ON_ERR 1 +#define AUDIT_ACK_ALWAYS 2 +#define AUDIT_ACK_DONE 3 + #endif
When auditd_set sets the auditd_conn pointer, audit messages can immediately be put on the socket by other kernel threads. If the backlog is large or the rate is high, this can immediately fill the socket buffer. If the audit daemon requested an ACK for this operation, a full socket buffer causes the ACK to get dropped, also setting ENOBUFS on the socket. To avoid this race and ensure ACKs get through, fast-track the ACK in this specific case to ensure it is sent before auditd_conn is set. Signed-off-by: Chris Riches <chris.riches@nutanix.com> --- This mail is more of an RFC than a patch, though the included patch is a useful illustation and might even be suitable for inclusion. We are experiencing strange failures where the audit daemon fails to start, hitting an ENOBUFS error on its audit_set_pid() call. This can be reproduced by repeatedly restarting the audit daemon (or just doing a tight loop of audit_open(), audit_set_pid(pid), audit_set_pid(0), audit_close()) while the system is under heavy audit load. This also seems to be dependent on the number of CPUs - we can reproduce this with 2 CPUs but not with 48. Tracing showed a race between the kernel setting auditd_conn and sending the ACK, as described in the commit message. The included patch attempts to fix this by ensuring the ACK is sent before any audit messages can be put on the socket. This seems to fix the problem for auditd starting, but strangely I still hit it when running my minimal repro (the tight loop mentioned above), albeit less frequently. I'm not sure if the patch or the repro is at fault. We may also want to look at this from the userspace side in the audit daemon itself (or more specifically, libaudit). The ACK handling is a little odd - check_ack() will happily return success without seeing an ACK if a non-ACK message is top of the socket queue, but will fail if no message arrives within the timeout. It also of course fails if ENOBUFS is set on the socket, but this failure only seems to matter when doing audit_set_pid() - similar failures during main-loop message processing are logged but otherwise ignored, as far as I can tell. I'm not sure I quite understand the intentions of the code, but it seems odd to let ENOBUFS be a fatal error here, given that it likely means the socket buffer got flooded with audit messages, and thus audit_set_pid() succeeded. Perhaps we should just ignore ENOBUFS (instead of or as well as patching the kernel). Finally, there is another oddity in audit_set_pid() that is tangential to this discussion but worth highlighting: if the wmode parameter is WAIT_YES (which it is for the audit daemon), then there is some additional ACK-handling which waits for 100 milliseconds and eats the top message of the socket queue if one arrives, without inspecting it. This seems completely wrong as the ACK will have already been consumed by check_ack() if there was one, and so the best this code can do is nothing, and at worst it will swallow a genuine audit message without ever recording it. If you agree with my assessment of this code, I'm happy to submit a separate patch to fix this. Thanks, Chris kernel/audit.c | 30 ++++++++++++++++++++++++------ kernel/audit.h | 5 +++++ 2 files changed, 29 insertions(+), 6 deletions(-)