Message ID | 7d7933d742fdf4a94c84b791906a450b16f2e81f.1577736799.git.rgb@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | audit: implement container identifier | expand |
On Tue, Dec 31, 2019 at 2:50 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > Add audit container identifier support to the action of signalling the > audit daemon. > > Since this would need to add an element to the audit_sig_info struct, > a new record type AUDIT_SIGNAL_INFO2 was created with a new > audit_sig_info2 struct. Corresponding support is required in the > userspace code to reflect the new record request and reply type. > An older userspace won't break since it won't know to request this > record type. > > Signed-off-by: Richard Guy Briggs <rgb@redhat.com> > --- > include/linux/audit.h | 7 +++++++ > include/uapi/linux/audit.h | 1 + > kernel/audit.c | 35 +++++++++++++++++++++++++++++++++++ > kernel/audit.h | 1 + > security/selinux/nlmsgtab.c | 1 + > 5 files changed, 45 insertions(+) ... > diff --git a/kernel/audit.c b/kernel/audit.c > index 0871c3e5d6df..51159c94041c 100644 > --- a/kernel/audit.c > +++ b/kernel/audit.c > @@ -126,6 +126,14 @@ struct auditd_connection { > kuid_t audit_sig_uid = INVALID_UID; > pid_t audit_sig_pid = -1; > u32 audit_sig_sid = 0; > +/* Since the signal information is stored in the record buffer at the > + * time of the signal, but not retrieved until later, there is a chance > + * that the last process in the container could terminate before the > + * signal record is delivered. In this circumstance, there is a chance > + * the orchestrator could reuse the audit container identifier, causing > + * an overlap of audit records that refer to the same audit container > + * identifier, but a different container instance. */ > +u64 audit_sig_cid = AUDIT_CID_UNSET; I believe we could prevent the case mentioned above by taking an additional reference to the audit container ID object when the signal information is collected, dropping it only after the signal information is collected by userspace or another process signals the audit daemon. Yes, it would block that audit container ID from being reused immediately, but since we are talking about one number out of 2^64 that seems like a reasonable tradeoff. -- paul moore www.paul-moore.com
On 2020-01-22 16:28, Paul Moore wrote: > On Tue, Dec 31, 2019 at 2:50 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > Add audit container identifier support to the action of signalling the > > audit daemon. > > > > Since this would need to add an element to the audit_sig_info struct, > > a new record type AUDIT_SIGNAL_INFO2 was created with a new > > audit_sig_info2 struct. Corresponding support is required in the > > userspace code to reflect the new record request and reply type. > > An older userspace won't break since it won't know to request this > > record type. > > > > Signed-off-by: Richard Guy Briggs <rgb@redhat.com> > > --- > > include/linux/audit.h | 7 +++++++ > > include/uapi/linux/audit.h | 1 + > > kernel/audit.c | 35 +++++++++++++++++++++++++++++++++++ > > kernel/audit.h | 1 + > > security/selinux/nlmsgtab.c | 1 + > > 5 files changed, 45 insertions(+) > > ... > > > diff --git a/kernel/audit.c b/kernel/audit.c > > index 0871c3e5d6df..51159c94041c 100644 > > --- a/kernel/audit.c > > +++ b/kernel/audit.c > > @@ -126,6 +126,14 @@ struct auditd_connection { > > kuid_t audit_sig_uid = INVALID_UID; > > pid_t audit_sig_pid = -1; > > u32 audit_sig_sid = 0; > > +/* Since the signal information is stored in the record buffer at the > > + * time of the signal, but not retrieved until later, there is a chance > > + * that the last process in the container could terminate before the > > + * signal record is delivered. In this circumstance, there is a chance > > + * the orchestrator could reuse the audit container identifier, causing > > + * an overlap of audit records that refer to the same audit container > > + * identifier, but a different container instance. */ > > +u64 audit_sig_cid = AUDIT_CID_UNSET; > > I believe we could prevent the case mentioned above by taking an > additional reference to the audit container ID object when the signal > information is collected, dropping it only after the signal > information is collected by userspace or another process signals the > audit daemon. Yes, it would block that audit container ID from being > reused immediately, but since we are talking about one number out of > 2^64 that seems like a reasonable tradeoff. I had thought that through and should have been more explicit about that situation when I documented it. We could do that, but then the syscall records would be connected with the call from auditd on shutdown to request that signal information, rather than the exit of that last process that was using that container. This strikes me as misleading. Is that really what we want? > paul moore - RGB -- Richard Guy Briggs <rgb@redhat.com> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635
On Thu, Jan 23, 2020 at 11:29 AM Richard Guy Briggs <rgb@redhat.com> wrote: > On 2020-01-22 16:28, Paul Moore wrote: > > On Tue, Dec 31, 2019 at 2:50 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > > > Add audit container identifier support to the action of signalling the > > > audit daemon. > > > > > > Since this would need to add an element to the audit_sig_info struct, > > > a new record type AUDIT_SIGNAL_INFO2 was created with a new > > > audit_sig_info2 struct. Corresponding support is required in the > > > userspace code to reflect the new record request and reply type. > > > An older userspace won't break since it won't know to request this > > > record type. > > > > > > Signed-off-by: Richard Guy Briggs <rgb@redhat.com> > > > --- > > > include/linux/audit.h | 7 +++++++ > > > include/uapi/linux/audit.h | 1 + > > > kernel/audit.c | 35 +++++++++++++++++++++++++++++++++++ > > > kernel/audit.h | 1 + > > > security/selinux/nlmsgtab.c | 1 + > > > 5 files changed, 45 insertions(+) > > > > ... > > > > > diff --git a/kernel/audit.c b/kernel/audit.c > > > index 0871c3e5d6df..51159c94041c 100644 > > > --- a/kernel/audit.c > > > +++ b/kernel/audit.c > > > @@ -126,6 +126,14 @@ struct auditd_connection { > > > kuid_t audit_sig_uid = INVALID_UID; > > > pid_t audit_sig_pid = -1; > > > u32 audit_sig_sid = 0; > > > +/* Since the signal information is stored in the record buffer at the > > > + * time of the signal, but not retrieved until later, there is a chance > > > + * that the last process in the container could terminate before the > > > + * signal record is delivered. In this circumstance, there is a chance > > > + * the orchestrator could reuse the audit container identifier, causing > > > + * an overlap of audit records that refer to the same audit container > > > + * identifier, but a different container instance. */ > > > +u64 audit_sig_cid = AUDIT_CID_UNSET; > > > > I believe we could prevent the case mentioned above by taking an > > additional reference to the audit container ID object when the signal > > information is collected, dropping it only after the signal > > information is collected by userspace or another process signals the > > audit daemon. Yes, it would block that audit container ID from being > > reused immediately, but since we are talking about one number out of > > 2^64 that seems like a reasonable tradeoff. > > I had thought that through and should have been more explicit about that > situation when I documented it. We could do that, but then the syscall > records would be connected with the call from auditd on shutdown to > request that signal information, rather than the exit of that last > process that was using that container. This strikes me as misleading. > Is that really what we want? ??? I think one of us is not understanding the other; maybe it's me, maybe it's you, maybe it's both of us. Anyway, here is what I was trying to convey with my original comment ... When we record the audit container ID in audit_signal_info() we take an extra reference to the audit container ID object so that it will not disappear (and get reused) until after we respond with an AUDIT_SIGNAL_INFO2. In audit_receive_msg() when we do the AUDIT_SIGNAL_INFO2 processing we drop the extra reference we took in audit_signal_info(). Unless I'm missing some other change you made, this *shouldn't* affect the syscall records, all it does is preserve the audit container ID object in the kernel's ACID store so it doesn't get reused. (We do need to do some extra housekeeping in audit_signal_info() to deal with the case where nobody asks for AUDIT_SIGNAL_INFO2 - basically if audit_sig_cid is not NULL we should drop a reference before assigning it a new object pointer, and of course we would need to set audit_sig_cid to NULL in audit_receive_msg() after sending it up to userspace and dropping the extra ref.)
On 2020-01-23 12:09, Paul Moore wrote: > On Thu, Jan 23, 2020 at 11:29 AM Richard Guy Briggs <rgb@redhat.com> wrote: > > On 2020-01-22 16:28, Paul Moore wrote: > > > On Tue, Dec 31, 2019 at 2:50 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > > > > > Add audit container identifier support to the action of signalling the > > > > audit daemon. > > > > > > > > Since this would need to add an element to the audit_sig_info struct, > > > > a new record type AUDIT_SIGNAL_INFO2 was created with a new > > > > audit_sig_info2 struct. Corresponding support is required in the > > > > userspace code to reflect the new record request and reply type. > > > > An older userspace won't break since it won't know to request this > > > > record type. > > > > > > > > Signed-off-by: Richard Guy Briggs <rgb@redhat.com> > > > > --- > > > > include/linux/audit.h | 7 +++++++ > > > > include/uapi/linux/audit.h | 1 + > > > > kernel/audit.c | 35 +++++++++++++++++++++++++++++++++++ > > > > kernel/audit.h | 1 + > > > > security/selinux/nlmsgtab.c | 1 + > > > > 5 files changed, 45 insertions(+) > > > > > > ... > > > > > > > diff --git a/kernel/audit.c b/kernel/audit.c > > > > index 0871c3e5d6df..51159c94041c 100644 > > > > --- a/kernel/audit.c > > > > +++ b/kernel/audit.c > > > > @@ -126,6 +126,14 @@ struct auditd_connection { > > > > kuid_t audit_sig_uid = INVALID_UID; > > > > pid_t audit_sig_pid = -1; > > > > u32 audit_sig_sid = 0; > > > > +/* Since the signal information is stored in the record buffer at the > > > > + * time of the signal, but not retrieved until later, there is a chance > > > > + * that the last process in the container could terminate before the > > > > + * signal record is delivered. In this circumstance, there is a chance > > > > + * the orchestrator could reuse the audit container identifier, causing > > > > + * an overlap of audit records that refer to the same audit container > > > > + * identifier, but a different container instance. */ > > > > +u64 audit_sig_cid = AUDIT_CID_UNSET; > > > > > > I believe we could prevent the case mentioned above by taking an > > > additional reference to the audit container ID object when the signal > > > information is collected, dropping it only after the signal > > > information is collected by userspace or another process signals the > > > audit daemon. Yes, it would block that audit container ID from being > > > reused immediately, but since we are talking about one number out of > > > 2^64 that seems like a reasonable tradeoff. > > > > I had thought that through and should have been more explicit about that > > situation when I documented it. We could do that, but then the syscall > > records would be connected with the call from auditd on shutdown to > > request that signal information, rather than the exit of that last > > process that was using that container. This strikes me as misleading. > > Is that really what we want? > > ??? > > I think one of us is not understanding the other; maybe it's me, maybe > it's you, maybe it's both of us. > > Anyway, here is what I was trying to convey with my original comment > ... When we record the audit container ID in audit_signal_info() we > take an extra reference to the audit container ID object so that it > will not disappear (and get reused) until after we respond with an > AUDIT_SIGNAL_INFO2. In audit_receive_msg() when we do the > AUDIT_SIGNAL_INFO2 processing we drop the extra reference we took in > audit_signal_info(). Unless I'm missing some other change you made, > this *shouldn't* affect the syscall records, all it does is preserve > the audit container ID object in the kernel's ACID store so it doesn't > get reused. This is exactly what I had understood. I hadn't considered the extra details below in detail due to my original syscall concern, but they make sense. The syscall I refer to is the one connected with the drop of the audit container identifier by the last process that was in that container in patch 5/16. The production of this record is contingent on the last ref in a contobj being dropped. So if it is due to that ref being maintained by audit_signal_info() until the AUDIT_SIGNAL_INFO2 record it fetched, then it will appear that the fetch action closed the container rather than the last process in the container to exit. Does this make sense? > (We do need to do some extra housekeeping in audit_signal_info() to > deal with the case where nobody asks for AUDIT_SIGNAL_INFO2 - > basically if audit_sig_cid is not NULL we should drop a reference > before assigning it a new object pointer, and of course we would need > to set audit_sig_cid to NULL in audit_receive_msg() after sending it > up to userspace and dropping the extra ref.) > > paul moore - RGB -- Richard Guy Briggs <rgb@redhat.com> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635
On Thu, Jan 23, 2020 at 3:04 PM Richard Guy Briggs <rgb@redhat.com> wrote: > On 2020-01-23 12:09, Paul Moore wrote: > > On Thu, Jan 23, 2020 at 11:29 AM Richard Guy Briggs <rgb@redhat.com> wrote: > > > On 2020-01-22 16:28, Paul Moore wrote: > > > > On Tue, Dec 31, 2019 at 2:50 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > > > > > > > Add audit container identifier support to the action of signalling the > > > > > audit daemon. > > > > > > > > > > Since this would need to add an element to the audit_sig_info struct, > > > > > a new record type AUDIT_SIGNAL_INFO2 was created with a new > > > > > audit_sig_info2 struct. Corresponding support is required in the > > > > > userspace code to reflect the new record request and reply type. > > > > > An older userspace won't break since it won't know to request this > > > > > record type. > > > > > > > > > > Signed-off-by: Richard Guy Briggs <rgb@redhat.com> > > > > > --- > > > > > include/linux/audit.h | 7 +++++++ > > > > > include/uapi/linux/audit.h | 1 + > > > > > kernel/audit.c | 35 +++++++++++++++++++++++++++++++++++ > > > > > kernel/audit.h | 1 + > > > > > security/selinux/nlmsgtab.c | 1 + > > > > > 5 files changed, 45 insertions(+) > > > > > > > > ... > > > > > > > > > diff --git a/kernel/audit.c b/kernel/audit.c > > > > > index 0871c3e5d6df..51159c94041c 100644 > > > > > --- a/kernel/audit.c > > > > > +++ b/kernel/audit.c > > > > > @@ -126,6 +126,14 @@ struct auditd_connection { > > > > > kuid_t audit_sig_uid = INVALID_UID; > > > > > pid_t audit_sig_pid = -1; > > > > > u32 audit_sig_sid = 0; > > > > > +/* Since the signal information is stored in the record buffer at the > > > > > + * time of the signal, but not retrieved until later, there is a chance > > > > > + * that the last process in the container could terminate before the > > > > > + * signal record is delivered. In this circumstance, there is a chance > > > > > + * the orchestrator could reuse the audit container identifier, causing > > > > > + * an overlap of audit records that refer to the same audit container > > > > > + * identifier, but a different container instance. */ > > > > > +u64 audit_sig_cid = AUDIT_CID_UNSET; > > > > > > > > I believe we could prevent the case mentioned above by taking an > > > > additional reference to the audit container ID object when the signal > > > > information is collected, dropping it only after the signal > > > > information is collected by userspace or another process signals the > > > > audit daemon. Yes, it would block that audit container ID from being > > > > reused immediately, but since we are talking about one number out of > > > > 2^64 that seems like a reasonable tradeoff. > > > > > > I had thought that through and should have been more explicit about that > > > situation when I documented it. We could do that, but then the syscall > > > records would be connected with the call from auditd on shutdown to > > > request that signal information, rather than the exit of that last > > > process that was using that container. This strikes me as misleading. > > > Is that really what we want? > > > > ??? > > > > I think one of us is not understanding the other; maybe it's me, maybe > > it's you, maybe it's both of us. > > > > Anyway, here is what I was trying to convey with my original comment > > ... When we record the audit container ID in audit_signal_info() we > > take an extra reference to the audit container ID object so that it > > will not disappear (and get reused) until after we respond with an > > AUDIT_SIGNAL_INFO2. In audit_receive_msg() when we do the > > AUDIT_SIGNAL_INFO2 processing we drop the extra reference we took in > > audit_signal_info(). Unless I'm missing some other change you made, > > this *shouldn't* affect the syscall records, all it does is preserve > > the audit container ID object in the kernel's ACID store so it doesn't > > get reused. > > This is exactly what I had understood. I hadn't considered the extra > details below in detail due to my original syscall concern, but they > make sense. > > The syscall I refer to is the one connected with the drop of the > audit container identifier by the last process that was in that > container in patch 5/16. The production of this record is contingent on > the last ref in a contobj being dropped. So if it is due to that ref > being maintained by audit_signal_info() until the AUDIT_SIGNAL_INFO2 > record it fetched, then it will appear that the fetch action closed the > container rather than the last process in the container to exit. > > Does this make sense? More so than your original reply, at least to me anyway. It makes sense that the audit container ID wouldn't be marked as "dead" since it would still be very much alive and available for use by the orchestrator, the question is if that is desirable or not. I think the answer to this comes down the preserving the correctness of the audit log. If the audit container ID reported by AUDIT_SIGNAL_INFO2 has been reused then I think there is a legitimate concern that the audit log is not correct, and could be misleading. If we solve that by grabbing an extra reference, then there could also be some confusion as userspace considers a container to be "dead" while the audit container ID still exists in the kernel, and the kernel generated audit container ID death record will not be generated until much later (and possibly be associated with a different event, but that could be solved by unassociating the container death record). Of the two approaches, I think the latter is safer in that it preserves the correctness of the audit log, even though it could result in a delay of the container death record. Neither way is perfect, so if you have any other ideas I'm all ears. > > (We do need to do some extra housekeeping in audit_signal_info() to > > deal with the case where nobody asks for AUDIT_SIGNAL_INFO2 - > > basically if audit_sig_cid is not NULL we should drop a reference > > before assigning it a new object pointer, and of course we would need > > to set audit_sig_cid to NULL in audit_receive_msg() after sending it > > up to userspace and dropping the extra ref.)
On 2020-01-23 16:35, Paul Moore wrote: > On Thu, Jan 23, 2020 at 3:04 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > On 2020-01-23 12:09, Paul Moore wrote: > > > On Thu, Jan 23, 2020 at 11:29 AM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > On 2020-01-22 16:28, Paul Moore wrote: > > > > > On Tue, Dec 31, 2019 at 2:50 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > > > > > > > > > Add audit container identifier support to the action of signalling the > > > > > > audit daemon. > > > > > > > > > > > > Since this would need to add an element to the audit_sig_info struct, > > > > > > a new record type AUDIT_SIGNAL_INFO2 was created with a new > > > > > > audit_sig_info2 struct. Corresponding support is required in the > > > > > > userspace code to reflect the new record request and reply type. > > > > > > An older userspace won't break since it won't know to request this > > > > > > record type. > > > > > > > > > > > > Signed-off-by: Richard Guy Briggs <rgb@redhat.com> > > > > > > --- > > > > > > include/linux/audit.h | 7 +++++++ > > > > > > include/uapi/linux/audit.h | 1 + > > > > > > kernel/audit.c | 35 +++++++++++++++++++++++++++++++++++ > > > > > > kernel/audit.h | 1 + > > > > > > security/selinux/nlmsgtab.c | 1 + > > > > > > 5 files changed, 45 insertions(+) > > > > > > > > > > ... > > > > > > > > > > > diff --git a/kernel/audit.c b/kernel/audit.c > > > > > > index 0871c3e5d6df..51159c94041c 100644 > > > > > > --- a/kernel/audit.c > > > > > > +++ b/kernel/audit.c > > > > > > @@ -126,6 +126,14 @@ struct auditd_connection { > > > > > > kuid_t audit_sig_uid = INVALID_UID; > > > > > > pid_t audit_sig_pid = -1; > > > > > > u32 audit_sig_sid = 0; > > > > > > +/* Since the signal information is stored in the record buffer at the > > > > > > + * time of the signal, but not retrieved until later, there is a chance > > > > > > + * that the last process in the container could terminate before the > > > > > > + * signal record is delivered. In this circumstance, there is a chance > > > > > > + * the orchestrator could reuse the audit container identifier, causing > > > > > > + * an overlap of audit records that refer to the same audit container > > > > > > + * identifier, but a different container instance. */ > > > > > > +u64 audit_sig_cid = AUDIT_CID_UNSET; > > > > > > > > > > I believe we could prevent the case mentioned above by taking an > > > > > additional reference to the audit container ID object when the signal > > > > > information is collected, dropping it only after the signal > > > > > information is collected by userspace or another process signals the > > > > > audit daemon. Yes, it would block that audit container ID from being > > > > > reused immediately, but since we are talking about one number out of > > > > > 2^64 that seems like a reasonable tradeoff. > > > > > > > > I had thought that through and should have been more explicit about that > > > > situation when I documented it. We could do that, but then the syscall > > > > records would be connected with the call from auditd on shutdown to > > > > request that signal information, rather than the exit of that last > > > > process that was using that container. This strikes me as misleading. > > > > Is that really what we want? > > > > > > ??? > > > > > > I think one of us is not understanding the other; maybe it's me, maybe > > > it's you, maybe it's both of us. > > > > > > Anyway, here is what I was trying to convey with my original comment > > > ... When we record the audit container ID in audit_signal_info() we > > > take an extra reference to the audit container ID object so that it > > > will not disappear (and get reused) until after we respond with an > > > AUDIT_SIGNAL_INFO2. In audit_receive_msg() when we do the > > > AUDIT_SIGNAL_INFO2 processing we drop the extra reference we took in > > > audit_signal_info(). Unless I'm missing some other change you made, > > > this *shouldn't* affect the syscall records, all it does is preserve > > > the audit container ID object in the kernel's ACID store so it doesn't > > > get reused. > > > > This is exactly what I had understood. I hadn't considered the extra > > details below in detail due to my original syscall concern, but they > > make sense. > > > > The syscall I refer to is the one connected with the drop of the > > audit container identifier by the last process that was in that > > container in patch 5/16. The production of this record is contingent on > > the last ref in a contobj being dropped. So if it is due to that ref > > being maintained by audit_signal_info() until the AUDIT_SIGNAL_INFO2 > > record it fetched, then it will appear that the fetch action closed the > > container rather than the last process in the container to exit. > > > > Does this make sense? > > More so than your original reply, at least to me anyway. > > It makes sense that the audit container ID wouldn't be marked as > "dead" since it would still be very much alive and available for use > by the orchestrator, the question is if that is desirable or not. I > think the answer to this comes down the preserving the correctness of > the audit log. > > If the audit container ID reported by AUDIT_SIGNAL_INFO2 has been > reused then I think there is a legitimate concern that the audit log > is not correct, and could be misleading. If we solve that by grabbing > an extra reference, then there could also be some confusion as > userspace considers a container to be "dead" while the audit container > ID still exists in the kernel, and the kernel generated audit > container ID death record will not be generated until much later (and > possibly be associated with a different event, but that could be > solved by unassociating the container death record). How does syscall association of the death record with AUDIT_SIGNAL_INFO2 possibly get associated with another event? Or is the syscall association with the fetch for the AUDIT_SIGNAL_INFO2 the other event? Another idea might be to bump the refcount in audit_signal_info() but mark tht contid as dead so it can't be reused if we are concerned that the dead contid be reused? There is still the problem later that the reported contid is incomplete compared to the rest of the contid reporting cycle wrt nesting since AUDIT_SIGNAL_INFO2 will need to be more complex w/2 variable length fields to accommodate a nested contid list. > Of the two > approaches, I think the latter is safer in that it preserves the > correctness of the audit log, even though it could result in a delay > of the container death record. I prefer the former since it strongly indicates last task in the container. The AUDIT_SIGNAL_INFO2 msg has the pid and other subject attributes and the contid to strongly link the responsible party. > Neither way is perfect, so if you have any other ideas I'm all ears. > > > > (We do need to do some extra housekeeping in audit_signal_info() to > > > deal with the case where nobody asks for AUDIT_SIGNAL_INFO2 - > > > basically if audit_sig_cid is not NULL we should drop a reference > > > before assigning it a new object pointer, and of course we would need > > > to set audit_sig_cid to NULL in audit_receive_msg() after sending it > > > up to userspace and dropping the extra ref.) > > -- > paul moore > www.paul-moore.com > - RGB -- Richard Guy Briggs <rgb@redhat.com> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635
On Tue, Feb 4, 2020 at 6:15 PM Richard Guy Briggs <rgb@redhat.com> wrote: > On 2020-01-23 16:35, Paul Moore wrote: > > On Thu, Jan 23, 2020 at 3:04 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > On 2020-01-23 12:09, Paul Moore wrote: > > > > On Thu, Jan 23, 2020 at 11:29 AM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > > On 2020-01-22 16:28, Paul Moore wrote: > > > > > > On Tue, Dec 31, 2019 at 2:50 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > > > > > > > > > > > Add audit container identifier support to the action of signalling the > > > > > > > audit daemon. > > > > > > > > > > > > > > Since this would need to add an element to the audit_sig_info struct, > > > > > > > a new record type AUDIT_SIGNAL_INFO2 was created with a new > > > > > > > audit_sig_info2 struct. Corresponding support is required in the > > > > > > > userspace code to reflect the new record request and reply type. > > > > > > > An older userspace won't break since it won't know to request this > > > > > > > record type. > > > > > > > > > > > > > > Signed-off-by: Richard Guy Briggs <rgb@redhat.com> > > > > > > > --- > > > > > > > include/linux/audit.h | 7 +++++++ > > > > > > > include/uapi/linux/audit.h | 1 + > > > > > > > kernel/audit.c | 35 +++++++++++++++++++++++++++++++++++ > > > > > > > kernel/audit.h | 1 + > > > > > > > security/selinux/nlmsgtab.c | 1 + > > > > > > > 5 files changed, 45 insertions(+) > > > > > > > > > > > > ... > > > > > > > > > > > > > diff --git a/kernel/audit.c b/kernel/audit.c > > > > > > > index 0871c3e5d6df..51159c94041c 100644 > > > > > > > --- a/kernel/audit.c > > > > > > > +++ b/kernel/audit.c > > > > > > > @@ -126,6 +126,14 @@ struct auditd_connection { > > > > > > > kuid_t audit_sig_uid = INVALID_UID; > > > > > > > pid_t audit_sig_pid = -1; > > > > > > > u32 audit_sig_sid = 0; > > > > > > > +/* Since the signal information is stored in the record buffer at the > > > > > > > + * time of the signal, but not retrieved until later, there is a chance > > > > > > > + * that the last process in the container could terminate before the > > > > > > > + * signal record is delivered. In this circumstance, there is a chance > > > > > > > + * the orchestrator could reuse the audit container identifier, causing > > > > > > > + * an overlap of audit records that refer to the same audit container > > > > > > > + * identifier, but a different container instance. */ > > > > > > > +u64 audit_sig_cid = AUDIT_CID_UNSET; > > > > > > > > > > > > I believe we could prevent the case mentioned above by taking an > > > > > > additional reference to the audit container ID object when the signal > > > > > > information is collected, dropping it only after the signal > > > > > > information is collected by userspace or another process signals the > > > > > > audit daemon. Yes, it would block that audit container ID from being > > > > > > reused immediately, but since we are talking about one number out of > > > > > > 2^64 that seems like a reasonable tradeoff. > > > > > > > > > > I had thought that through and should have been more explicit about that > > > > > situation when I documented it. We could do that, but then the syscall > > > > > records would be connected with the call from auditd on shutdown to > > > > > request that signal information, rather than the exit of that last > > > > > process that was using that container. This strikes me as misleading. > > > > > Is that really what we want? > > > > > > > > ??? > > > > > > > > I think one of us is not understanding the other; maybe it's me, maybe > > > > it's you, maybe it's both of us. > > > > > > > > Anyway, here is what I was trying to convey with my original comment > > > > ... When we record the audit container ID in audit_signal_info() we > > > > take an extra reference to the audit container ID object so that it > > > > will not disappear (and get reused) until after we respond with an > > > > AUDIT_SIGNAL_INFO2. In audit_receive_msg() when we do the > > > > AUDIT_SIGNAL_INFO2 processing we drop the extra reference we took in > > > > audit_signal_info(). Unless I'm missing some other change you made, > > > > this *shouldn't* affect the syscall records, all it does is preserve > > > > the audit container ID object in the kernel's ACID store so it doesn't > > > > get reused. > > > > > > This is exactly what I had understood. I hadn't considered the extra > > > details below in detail due to my original syscall concern, but they > > > make sense. > > > > > > The syscall I refer to is the one connected with the drop of the > > > audit container identifier by the last process that was in that > > > container in patch 5/16. The production of this record is contingent on > > > the last ref in a contobj being dropped. So if it is due to that ref > > > being maintained by audit_signal_info() until the AUDIT_SIGNAL_INFO2 > > > record it fetched, then it will appear that the fetch action closed the > > > container rather than the last process in the container to exit. > > > > > > Does this make sense? > > > > More so than your original reply, at least to me anyway. > > > > It makes sense that the audit container ID wouldn't be marked as > > "dead" since it would still be very much alive and available for use > > by the orchestrator, the question is if that is desirable or not. I > > think the answer to this comes down the preserving the correctness of > > the audit log. > > > > If the audit container ID reported by AUDIT_SIGNAL_INFO2 has been > > reused then I think there is a legitimate concern that the audit log > > is not correct, and could be misleading. If we solve that by grabbing > > an extra reference, then there could also be some confusion as > > userspace considers a container to be "dead" while the audit container > > ID still exists in the kernel, and the kernel generated audit > > container ID death record will not be generated until much later (and > > possibly be associated with a different event, but that could be > > solved by unassociating the container death record). > > How does syscall association of the death record with AUDIT_SIGNAL_INFO2 > possibly get associated with another event? Or is the syscall > association with the fetch for the AUDIT_SIGNAL_INFO2 the other event? The issue is when does the audit container ID "die". If it is when the last task in the container exits, then the death record will be associated when the task's exit. If the audit container ID lives on until the last reference of it in the audit logs, including the SIGNAL_INFO2 message, the death record will be associated with the related SIGNAL_INFO2 syscalls, or perhaps unassociated depending on the details of the syscalls/netlink. > Another idea might be to bump the refcount in audit_signal_info() but > mark tht contid as dead so it can't be reused if we are concerned that > the dead contid be reused? Ooof. Yes, maybe, but that would be ugly. > There is still the problem later that the reported contid is incomplete > compared to the rest of the contid reporting cycle wrt nesting since > AUDIT_SIGNAL_INFO2 will need to be more complex w/2 variable length > fields to accommodate a nested contid list. Do we really care about the full nested audit container ID list in the SIGNAL_INFO2 record? > > Of the two > > approaches, I think the latter is safer in that it preserves the > > correctness of the audit log, even though it could result in a delay > > of the container death record. > > I prefer the former since it strongly indicates last task in the > container. The AUDIT_SIGNAL_INFO2 msg has the pid and other subject > attributes and the contid to strongly link the responsible party. Steve is the only one who really tracks the security certifications that are relevant to audit, see what the certification requirements have to say and we can revisit this.
On Wednesday, February 5, 2020 5:50:28 PM EST Paul Moore wrote: > > > > > ... When we record the audit container ID in audit_signal_info() we > > > > > take an extra reference to the audit container ID object so that it > > > > > will not disappear (and get reused) until after we respond with an > > > > > AUDIT_SIGNAL_INFO2. In audit_receive_msg() when we do the > > > > > AUDIT_SIGNAL_INFO2 processing we drop the extra reference we took > > > > > in > > > > > audit_signal_info(). Unless I'm missing some other change you > > > > > made, > > > > > this *shouldn't* affect the syscall records, all it does is > > > > > preserve > > > > > the audit container ID object in the kernel's ACID store so it > > > > > doesn't > > > > > get reused. > > > > > > > > This is exactly what I had understood. I hadn't considered the extra > > > > details below in detail due to my original syscall concern, but they > > > > make sense. > > > > > > > > The syscall I refer to is the one connected with the drop of the > > > > audit container identifier by the last process that was in that > > > > container in patch 5/16. The production of this record is contingent > > > > on > > > > the last ref in a contobj being dropped. So if it is due to that ref > > > > being maintained by audit_signal_info() until the AUDIT_SIGNAL_INFO2 > > > > record it fetched, then it will appear that the fetch action closed > > > > the > > > > container rather than the last process in the container to exit. > > > > > > > > Does this make sense? > > > > > > More so than your original reply, at least to me anyway. > > > > > > It makes sense that the audit container ID wouldn't be marked as > > > "dead" since it would still be very much alive and available for use > > > by the orchestrator, the question is if that is desirable or not. I > > > think the answer to this comes down the preserving the correctness of > > > the audit log. > > > > > > If the audit container ID reported by AUDIT_SIGNAL_INFO2 has been > > > reused then I think there is a legitimate concern that the audit log > > > is not correct, and could be misleading. If we solve that by grabbing > > > an extra reference, then there could also be some confusion as > > > userspace considers a container to be "dead" while the audit container > > > ID still exists in the kernel, and the kernel generated audit > > > container ID death record will not be generated until much later (and > > > possibly be associated with a different event, but that could be > > > solved by unassociating the container death record). > > > > How does syscall association of the death record with AUDIT_SIGNAL_INFO2 > > possibly get associated with another event? Or is the syscall > > association with the fetch for the AUDIT_SIGNAL_INFO2 the other event? > > The issue is when does the audit container ID "die". If it is when > the last task in the container exits, then the death record will be > associated when the task's exit. If the audit container ID lives on > until the last reference of it in the audit logs, including the > SIGNAL_INFO2 message, the death record will be associated with the > related SIGNAL_INFO2 syscalls, or perhaps unassociated depending on > the details of the syscalls/netlink. > > > Another idea might be to bump the refcount in audit_signal_info() but > > mark tht contid as dead so it can't be reused if we are concerned that > > the dead contid be reused? > > Ooof. Yes, maybe, but that would be ugly. > > > There is still the problem later that the reported contid is incomplete > > compared to the rest of the contid reporting cycle wrt nesting since > > AUDIT_SIGNAL_INFO2 will need to be more complex w/2 variable length > > fields to accommodate a nested contid list. > > Do we really care about the full nested audit container ID list in the > SIGNAL_INFO2 record? > > > > Of the two > > > approaches, I think the latter is safer in that it preserves the > > > correctness of the audit log, even though it could result in a delay > > > of the container death record. > > > > I prefer the former since it strongly indicates last task in the > > container. The AUDIT_SIGNAL_INFO2 msg has the pid and other subject > > attributes and the contid to strongly link the responsible party. > > Steve is the only one who really tracks the security certifications > that are relevant to audit, see what the certification requirements > have to say and we can revisit this. Sever Virtualization Protection Profile is the closest applicable standard https://www.niap-ccevs.org/Profile/Info.cfm?PPID=408&id=408 It is silent on audit requirements for the lifecycle of a VM. I assume that all that is needed is what the orchestrator says its doing at the high level. So, if an orchestrator wants to shutdown a container, the orchestrator must log that intent and its results. In a similar fashion, systemd logs that it's killing a service and we don't actually hook the exit syscall of the service to record that. Now, if a container was being used as a VPS, and it had a fully functioning userspace, it's own services, and its very own audit daemon, then in this case it would care who sent a signal to its auditd. The tenant of that container may have to comply with PCI-DSS or something else. It would log the audit service is being terminated and systemd would record that its tearing down the environment. The OS doesn't need to do anything. -Steve
On Wed, Feb 12, 2020 at 5:39 PM Steve Grubb <sgrubb@redhat.com> wrote: > On Wednesday, February 5, 2020 5:50:28 PM EST Paul Moore wrote: > > > > > > ... When we record the audit container ID in audit_signal_info() we > > > > > > take an extra reference to the audit container ID object so that it > > > > > > will not disappear (and get reused) until after we respond with an > > > > > > AUDIT_SIGNAL_INFO2. In audit_receive_msg() when we do the > > > > > > AUDIT_SIGNAL_INFO2 processing we drop the extra reference we took > > > > > > in > > > > > > audit_signal_info(). Unless I'm missing some other change you > > > > > > made, > > > > > > this *shouldn't* affect the syscall records, all it does is > > > > > > preserve > > > > > > the audit container ID object in the kernel's ACID store so it > > > > > > doesn't > > > > > > get reused. > > > > > > > > > > This is exactly what I had understood. I hadn't considered the extra > > > > > details below in detail due to my original syscall concern, but they > > > > > make sense. > > > > > > > > > > The syscall I refer to is the one connected with the drop of the > > > > > audit container identifier by the last process that was in that > > > > > container in patch 5/16. The production of this record is contingent > > > > > on > > > > > the last ref in a contobj being dropped. So if it is due to that ref > > > > > being maintained by audit_signal_info() until the AUDIT_SIGNAL_INFO2 > > > > > record it fetched, then it will appear that the fetch action closed > > > > > the > > > > > container rather than the last process in the container to exit. > > > > > > > > > > Does this make sense? > > > > > > > > More so than your original reply, at least to me anyway. > > > > > > > > It makes sense that the audit container ID wouldn't be marked as > > > > "dead" since it would still be very much alive and available for use > > > > by the orchestrator, the question is if that is desirable or not. I > > > > think the answer to this comes down the preserving the correctness of > > > > the audit log. > > > > > > > > If the audit container ID reported by AUDIT_SIGNAL_INFO2 has been > > > > reused then I think there is a legitimate concern that the audit log > > > > is not correct, and could be misleading. If we solve that by grabbing > > > > an extra reference, then there could also be some confusion as > > > > userspace considers a container to be "dead" while the audit container > > > > ID still exists in the kernel, and the kernel generated audit > > > > container ID death record will not be generated until much later (and > > > > possibly be associated with a different event, but that could be > > > > solved by unassociating the container death record). > > > > > > How does syscall association of the death record with AUDIT_SIGNAL_INFO2 > > > possibly get associated with another event? Or is the syscall > > > association with the fetch for the AUDIT_SIGNAL_INFO2 the other event? > > > > The issue is when does the audit container ID "die". If it is when > > the last task in the container exits, then the death record will be > > associated when the task's exit. If the audit container ID lives on > > until the last reference of it in the audit logs, including the > > SIGNAL_INFO2 message, the death record will be associated with the > > related SIGNAL_INFO2 syscalls, or perhaps unassociated depending on > > the details of the syscalls/netlink. > > > > > Another idea might be to bump the refcount in audit_signal_info() but > > > mark tht contid as dead so it can't be reused if we are concerned that > > > the dead contid be reused? > > > > Ooof. Yes, maybe, but that would be ugly. > > > > > There is still the problem later that the reported contid is incomplete > > > compared to the rest of the contid reporting cycle wrt nesting since > > > AUDIT_SIGNAL_INFO2 will need to be more complex w/2 variable length > > > fields to accommodate a nested contid list. > > > > Do we really care about the full nested audit container ID list in the > > SIGNAL_INFO2 record? > > > > > > Of the two > > > > approaches, I think the latter is safer in that it preserves the > > > > correctness of the audit log, even though it could result in a delay > > > > of the container death record. > > > > > > I prefer the former since it strongly indicates last task in the > > > container. The AUDIT_SIGNAL_INFO2 msg has the pid and other subject > > > attributes and the contid to strongly link the responsible party. > > > > Steve is the only one who really tracks the security certifications > > that are relevant to audit, see what the certification requirements > > have to say and we can revisit this. > > Sever Virtualization Protection Profile is the closest applicable standard > > https://www.niap-ccevs.org/Profile/Info.cfm?PPID=408&id=408 > > It is silent on audit requirements for the lifecycle of a VM. I assume that > all that is needed is what the orchestrator says its doing at the high level. > So, if an orchestrator wants to shutdown a container, the orchestrator must > log that intent and its results. In a similar fashion, systemd logs that it's > killing a service and we don't actually hook the exit syscall of the service > to record that. > > Now, if a container was being used as a VPS, and it had a fully functioning > userspace, it's own services, and its very own audit daemon, then in this > case it would care who sent a signal to its auditd. The tenant of that > container may have to comply with PCI-DSS or something else. It would log the > audit service is being terminated and systemd would record that its tearing > down the environment. The OS doesn't need to do anything. This latter case is the case of interest here, since the host auditd should only be killed from a process on the host itself, not a process running in a container. If we work under the assumption (and this may be a break in our approach to not defining "container") that an auditd instance is only ever signaled by a process with the same audit container ID (ACID), is this really even an issue? Right now it isn't as even with this patchset we will still really only support one auditd instance, presumably on the host, so this isn't a significant concern. Moving forward, once we add support for multiple auditd instances we will likely need to move the signal info into (potentially) s per-ACID struct, a struct whose lifetime would match that of the associated container by definition; as the auditd container died, the struct would die, the refcounts dropped, and any ACID held only the signal info refcount would be dropped/killed. However, making this assumption would mean that we are expecting a "container" to provide some level of isolation such that processes with a different audit container ID do not signal each other. From a practical perspective I think that fits with the most (all?) definitions of "container", but I can't say that for certain. In those cases where the assumption is not correct and processes can signal each other across audit container ID boundaries, perhaps it is enough to explain that an audit container ID may not fully disappear until it has been fetched with a SIGNAL_INFO2 message.
This is a bit of a thread-hijack, and for that I apologize, but another thought crossed my mind while thinking about this issue further ... Once we support multiple auditd instances, including the necessary record routing and duplication/multiple-sends (the host always sees *everything*), we will likely need to find a way to "trim" the audit container ID (ACID) lists we send in the records. The auditd instance running on the host/initns will always see everything, so it will want the full container ACID list; however an auditd instance running inside a container really should only see the ACIDs of any child containers. For example, imagine a system where the host has containers 1 and 2, each running an auditd instance. Inside container 1 there are containers A and B. Inside container 2 there are containers Y and Z. If an audit event is generated in container Z, I would expect the host's auditd to see a ACID list of "1,Z" but container 1's auditd should only see an ACID list of "Z". The auditd running in container 2 should not see the record at all (that will be relatively straightforward). Does that make sense? Do we have the record formats properly designed to handle this without too much problem (I'm not entirely sure we do)?
On 2020-02-13 16:44, Paul Moore wrote: > This is a bit of a thread-hijack, and for that I apologize, but > another thought crossed my mind while thinking about this issue > further ... Once we support multiple auditd instances, including the > necessary record routing and duplication/multiple-sends (the host > always sees *everything*), we will likely need to find a way to "trim" > the audit container ID (ACID) lists we send in the records. The > auditd instance running on the host/initns will always see everything, > so it will want the full container ACID list; however an auditd > instance running inside a container really should only see the ACIDs > of any child containers. Agreed. This should be easy to check and limit, preventing an auditd from seeing any contid that is a parent of its own contid. > For example, imagine a system where the host has containers 1 and 2, > each running an auditd instance. Inside container 1 there are > containers A and B. Inside container 2 there are containers Y and Z. > If an audit event is generated in container Z, I would expect the > host's auditd to see a ACID list of "1,Z" but container 1's auditd > should only see an ACID list of "Z". The auditd running in container > 2 should not see the record at all (that will be relatively > straightforward). Does that make sense? Do we have the record > formats properly designed to handle this without too much problem (I'm > not entirely sure we do)? I completely agree and I believe we have record formats that are able to handle this already. > paul moore - RGB -- Richard Guy Briggs <rgb@redhat.com> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635
On 2020-02-12 19:09, Paul Moore wrote: > On Wed, Feb 12, 2020 at 5:39 PM Steve Grubb <sgrubb@redhat.com> wrote: > > On Wednesday, February 5, 2020 5:50:28 PM EST Paul Moore wrote: > > > > > > > ... When we record the audit container ID in audit_signal_info() we > > > > > > > take an extra reference to the audit container ID object so that it > > > > > > > will not disappear (and get reused) until after we respond with an > > > > > > > AUDIT_SIGNAL_INFO2. In audit_receive_msg() when we do the > > > > > > > AUDIT_SIGNAL_INFO2 processing we drop the extra reference we took > > > > > > > in > > > > > > > audit_signal_info(). Unless I'm missing some other change you > > > > > > > made, > > > > > > > this *shouldn't* affect the syscall records, all it does is > > > > > > > preserve > > > > > > > the audit container ID object in the kernel's ACID store so it > > > > > > > doesn't > > > > > > > get reused. > > > > > > > > > > > > This is exactly what I had understood. I hadn't considered the extra > > > > > > details below in detail due to my original syscall concern, but they > > > > > > make sense. > > > > > > > > > > > > The syscall I refer to is the one connected with the drop of the > > > > > > audit container identifier by the last process that was in that > > > > > > container in patch 5/16. The production of this record is contingent > > > > > > on > > > > > > the last ref in a contobj being dropped. So if it is due to that ref > > > > > > being maintained by audit_signal_info() until the AUDIT_SIGNAL_INFO2 > > > > > > record it fetched, then it will appear that the fetch action closed > > > > > > the > > > > > > container rather than the last process in the container to exit. > > > > > > > > > > > > Does this make sense? > > > > > > > > > > More so than your original reply, at least to me anyway. > > > > > > > > > > It makes sense that the audit container ID wouldn't be marked as > > > > > "dead" since it would still be very much alive and available for use > > > > > by the orchestrator, the question is if that is desirable or not. I > > > > > think the answer to this comes down the preserving the correctness of > > > > > the audit log. > > > > > > > > > > If the audit container ID reported by AUDIT_SIGNAL_INFO2 has been > > > > > reused then I think there is a legitimate concern that the audit log > > > > > is not correct, and could be misleading. If we solve that by grabbing > > > > > an extra reference, then there could also be some confusion as > > > > > userspace considers a container to be "dead" while the audit container > > > > > ID still exists in the kernel, and the kernel generated audit > > > > > container ID death record will not be generated until much later (and > > > > > possibly be associated with a different event, but that could be > > > > > solved by unassociating the container death record). > > > > > > > > How does syscall association of the death record with AUDIT_SIGNAL_INFO2 > > > > possibly get associated with another event? Or is the syscall > > > > association with the fetch for the AUDIT_SIGNAL_INFO2 the other event? > > > > > > The issue is when does the audit container ID "die". If it is when > > > the last task in the container exits, then the death record will be > > > associated when the task's exit. If the audit container ID lives on > > > until the last reference of it in the audit logs, including the > > > SIGNAL_INFO2 message, the death record will be associated with the > > > related SIGNAL_INFO2 syscalls, or perhaps unassociated depending on > > > the details of the syscalls/netlink. > > > > > > > Another idea might be to bump the refcount in audit_signal_info() but > > > > mark tht contid as dead so it can't be reused if we are concerned that > > > > the dead contid be reused? > > > > > > Ooof. Yes, maybe, but that would be ugly. > > > > > > > There is still the problem later that the reported contid is incomplete > > > > compared to the rest of the contid reporting cycle wrt nesting since > > > > AUDIT_SIGNAL_INFO2 will need to be more complex w/2 variable length > > > > fields to accommodate a nested contid list. > > > > > > Do we really care about the full nested audit container ID list in the > > > SIGNAL_INFO2 record? I'm inclined to hand-wave it away as inconvenient that can be looked up more carefully if it is really needed. Maybe the block above would be safer and more complete even though it is ugly. > > > > > Of the two > > > > > approaches, I think the latter is safer in that it preserves the > > > > > correctness of the audit log, even though it could result in a delay > > > > > of the container death record. > > > > > > > > I prefer the former since it strongly indicates last task in the > > > > container. The AUDIT_SIGNAL_INFO2 msg has the pid and other subject > > > > attributes and the contid to strongly link the responsible party. > > > > > > Steve is the only one who really tracks the security certifications > > > that are relevant to audit, see what the certification requirements > > > have to say and we can revisit this. > > > > Sever Virtualization Protection Profile is the closest applicable standard > > > > https://www.niap-ccevs.org/Profile/Info.cfm?PPID=408&id=408 > > > > It is silent on audit requirements for the lifecycle of a VM. I assume that > > all that is needed is what the orchestrator says its doing at the high level. > > So, if an orchestrator wants to shutdown a container, the orchestrator must > > log that intent and its results. In a similar fashion, systemd logs that it's > > killing a service and we don't actually hook the exit syscall of the service > > to record that. > > > > Now, if a container was being used as a VPS, and it had a fully functioning > > userspace, it's own services, and its very own audit daemon, then in this > > case it would care who sent a signal to its auditd. The tenant of that > > container may have to comply with PCI-DSS or something else. It would log the > > audit service is being terminated and systemd would record that its tearing > > down the environment. The OS doesn't need to do anything. > > This latter case is the case of interest here, since the host auditd > should only be killed from a process on the host itself, not a process > running in a container. If we work under the assumption (and this may > be a break in our approach to not defining "container") that an auditd > instance is only ever signaled by a process with the same audit > container ID (ACID), is this really even an issue? Right now it isn't > as even with this patchset we will still really only support one > auditd instance, presumably on the host, so this isn't a significant > concern. Moving forward, once we add support for multiple auditd > instances we will likely need to move the signal info into > (potentially) s per-ACID struct, a struct whose lifetime would match > that of the associated container by definition; as the auditd > container died, the struct would die, the refcounts dropped, and any > ACID held only the signal info refcount would be dropped/killed. Any process could signal auditd if it can see it based on namespace relationships, nevermind container placement. Some container architectures would not have a namespace configuration that would block this (combination of PID/user/IPC?). > However, making this assumption would mean that we are expecting a > "container" to provide some level of isolation such that processes > with a different audit container ID do not signal each other. From a > practical perspective I think that fits with the most (all?) > definitions of "container", but I can't say that for certain. In > those cases where the assumption is not correct and processes can > signal each other across audit container ID boundaries, perhaps it is > enough to explain that an audit container ID may not fully disappear > until it has been fetched with a SIGNAL_INFO2 message. I think more and more, that more complete isolation is being done, taking advantage of each type of namespace as they become available, but I know a nuber of them didn't find it important yet to use IPC, PID or user namespaces which would be the only namespaces I can think of that would provide that isolation. It isn't entirely clear to me which side you fall on this issue, Paul. Can you pronounce on your strong preference one way or the other if the death of a container coincide with the exit of the last process in that namespace, or the fetch of any signal info related to it? I have a bias to the former since the code already does that and I feel the exit of the last process is much more relevant supported by the syscall record, but could change it to the latter if you feel strongly enough about it to block upstream acceptance. > paul moore - RGB -- Richard Guy Briggs <rgb@redhat.com> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635
On Thu, Mar 12, 2020 at 3:30 PM Richard Guy Briggs <rgb@redhat.com> wrote: > On 2020-02-13 16:44, Paul Moore wrote: > > This is a bit of a thread-hijack, and for that I apologize, but > > another thought crossed my mind while thinking about this issue > > further ... Once we support multiple auditd instances, including the > > necessary record routing and duplication/multiple-sends (the host > > always sees *everything*), we will likely need to find a way to "trim" > > the audit container ID (ACID) lists we send in the records. The > > auditd instance running on the host/initns will always see everything, > > so it will want the full container ACID list; however an auditd > > instance running inside a container really should only see the ACIDs > > of any child containers. > > Agreed. This should be easy to check and limit, preventing an auditd > from seeing any contid that is a parent of its own contid. > > > For example, imagine a system where the host has containers 1 and 2, > > each running an auditd instance. Inside container 1 there are > > containers A and B. Inside container 2 there are containers Y and Z. > > If an audit event is generated in container Z, I would expect the > > host's auditd to see a ACID list of "1,Z" but container 1's auditd > > should only see an ACID list of "Z". The auditd running in container > > 2 should not see the record at all (that will be relatively > > straightforward). Does that make sense? Do we have the record > > formats properly designed to handle this without too much problem (I'm > > not entirely sure we do)? > > I completely agree and I believe we have record formats that are able to > handle this already. I'm not convinced we do. What about the cases where we have a field with a list of audit container IDs? How do we handle that?
On Thu, Mar 12, 2020 at 4:27 PM Richard Guy Briggs <rgb@redhat.com> wrote: > On 2020-02-12 19:09, Paul Moore wrote: > > On Wed, Feb 12, 2020 at 5:39 PM Steve Grubb <sgrubb@redhat.com> wrote: > > > On Wednesday, February 5, 2020 5:50:28 PM EST Paul Moore wrote: > > > > > > > > ... When we record the audit container ID in audit_signal_info() we > > > > > > > > take an extra reference to the audit container ID object so that it > > > > > > > > will not disappear (and get reused) until after we respond with an > > > > > > > > AUDIT_SIGNAL_INFO2. In audit_receive_msg() when we do the > > > > > > > > AUDIT_SIGNAL_INFO2 processing we drop the extra reference we took > > > > > > > > in > > > > > > > > audit_signal_info(). Unless I'm missing some other change you > > > > > > > > made, > > > > > > > > this *shouldn't* affect the syscall records, all it does is > > > > > > > > preserve > > > > > > > > the audit container ID object in the kernel's ACID store so it > > > > > > > > doesn't > > > > > > > > get reused. > > > > > > > > > > > > > > This is exactly what I had understood. I hadn't considered the extra > > > > > > > details below in detail due to my original syscall concern, but they > > > > > > > make sense. > > > > > > > > > > > > > > The syscall I refer to is the one connected with the drop of the > > > > > > > audit container identifier by the last process that was in that > > > > > > > container in patch 5/16. The production of this record is contingent > > > > > > > on > > > > > > > the last ref in a contobj being dropped. So if it is due to that ref > > > > > > > being maintained by audit_signal_info() until the AUDIT_SIGNAL_INFO2 > > > > > > > record it fetched, then it will appear that the fetch action closed > > > > > > > the > > > > > > > container rather than the last process in the container to exit. > > > > > > > > > > > > > > Does this make sense? > > > > > > > > > > > > More so than your original reply, at least to me anyway. > > > > > > > > > > > > It makes sense that the audit container ID wouldn't be marked as > > > > > > "dead" since it would still be very much alive and available for use > > > > > > by the orchestrator, the question is if that is desirable or not. I > > > > > > think the answer to this comes down the preserving the correctness of > > > > > > the audit log. > > > > > > > > > > > > If the audit container ID reported by AUDIT_SIGNAL_INFO2 has been > > > > > > reused then I think there is a legitimate concern that the audit log > > > > > > is not correct, and could be misleading. If we solve that by grabbing > > > > > > an extra reference, then there could also be some confusion as > > > > > > userspace considers a container to be "dead" while the audit container > > > > > > ID still exists in the kernel, and the kernel generated audit > > > > > > container ID death record will not be generated until much later (and > > > > > > possibly be associated with a different event, but that could be > > > > > > solved by unassociating the container death record). > > > > > > > > > > How does syscall association of the death record with AUDIT_SIGNAL_INFO2 > > > > > possibly get associated with another event? Or is the syscall > > > > > association with the fetch for the AUDIT_SIGNAL_INFO2 the other event? > > > > > > > > The issue is when does the audit container ID "die". If it is when > > > > the last task in the container exits, then the death record will be > > > > associated when the task's exit. If the audit container ID lives on > > > > until the last reference of it in the audit logs, including the > > > > SIGNAL_INFO2 message, the death record will be associated with the > > > > related SIGNAL_INFO2 syscalls, or perhaps unassociated depending on > > > > the details of the syscalls/netlink. > > > > > > > > > Another idea might be to bump the refcount in audit_signal_info() but > > > > > mark tht contid as dead so it can't be reused if we are concerned that > > > > > the dead contid be reused? > > > > > > > > Ooof. Yes, maybe, but that would be ugly. > > > > > > > > > There is still the problem later that the reported contid is incomplete > > > > > compared to the rest of the contid reporting cycle wrt nesting since > > > > > AUDIT_SIGNAL_INFO2 will need to be more complex w/2 variable length > > > > > fields to accommodate a nested contid list. > > > > > > > > Do we really care about the full nested audit container ID list in the > > > > SIGNAL_INFO2 record? > > I'm inclined to hand-wave it away as inconvenient that can be looked up > more carefully if it is really needed. Maybe the block above would be > safer and more complete even though it is ugly. > > > > > > > Of the two > > > > > > approaches, I think the latter is safer in that it preserves the > > > > > > correctness of the audit log, even though it could result in a delay > > > > > > of the container death record. > > > > > > > > > > I prefer the former since it strongly indicates last task in the > > > > > container. The AUDIT_SIGNAL_INFO2 msg has the pid and other subject > > > > > attributes and the contid to strongly link the responsible party. > > > > > > > > Steve is the only one who really tracks the security certifications > > > > that are relevant to audit, see what the certification requirements > > > > have to say and we can revisit this. > > > > > > Sever Virtualization Protection Profile is the closest applicable standard > > > > > > https://www.niap-ccevs.org/Profile/Info.cfm?PPID=408&id=408 > > > > > > It is silent on audit requirements for the lifecycle of a VM. I assume that > > > all that is needed is what the orchestrator says its doing at the high level. > > > So, if an orchestrator wants to shutdown a container, the orchestrator must > > > log that intent and its results. In a similar fashion, systemd logs that it's > > > killing a service and we don't actually hook the exit syscall of the service > > > to record that. > > > > > > Now, if a container was being used as a VPS, and it had a fully functioning > > > userspace, it's own services, and its very own audit daemon, then in this > > > case it would care who sent a signal to its auditd. The tenant of that > > > container may have to comply with PCI-DSS or something else. It would log the > > > audit service is being terminated and systemd would record that its tearing > > > down the environment. The OS doesn't need to do anything. > > > > This latter case is the case of interest here, since the host auditd > > should only be killed from a process on the host itself, not a process > > running in a container. If we work under the assumption (and this may > > be a break in our approach to not defining "container") that an auditd > > instance is only ever signaled by a process with the same audit > > container ID (ACID), is this really even an issue? Right now it isn't > > as even with this patchset we will still really only support one > > auditd instance, presumably on the host, so this isn't a significant > > concern. Moving forward, once we add support for multiple auditd > > instances we will likely need to move the signal info into > > (potentially) s per-ACID struct, a struct whose lifetime would match > > that of the associated container by definition; as the auditd > > container died, the struct would die, the refcounts dropped, and any > > ACID held only the signal info refcount would be dropped/killed. > > Any process could signal auditd if it can see it based on namespace > relationships, nevermind container placement. Some container > architectures would not have a namespace configuration that would block > this (combination of PID/user/IPC?). > > > However, making this assumption would mean that we are expecting a > > "container" to provide some level of isolation such that processes > > with a different audit container ID do not signal each other. From a > > practical perspective I think that fits with the most (all?) > > definitions of "container", but I can't say that for certain. In > > those cases where the assumption is not correct and processes can > > signal each other across audit container ID boundaries, perhaps it is > > enough to explain that an audit container ID may not fully disappear > > until it has been fetched with a SIGNAL_INFO2 message. > > I think more and more, that more complete isolation is being done, > taking advantage of each type of namespace as they become available, but > I know a nuber of them didn't find it important yet to use IPC, PID or > user namespaces which would be the only namespaces I can think of that > would provide that isolation. > > It isn't entirely clear to me which side you fall on this issue, Paul. That's mostly because I was hoping for some clarification in the discussion, especially the relevant certification requirements, but it looks like there is still plenty of room for interpretation there (as usual). I'd much rather us arrive at decisions based on requirements and not gut feelings, which is where I think we are at right now. > Can you pronounce on your strong preference one way or the other if the > death of a container coincide with the exit of the last process in that > namespace, or the fetch of any signal info related to it? "pronounce on your strong preference"? I've seen you use "pronounce" a few times now, and suggest a different word in the future; the connotation is not well received on my end. > I have a bias > to the former since the code already does that and I feel the exit of > the last process is much more relevant supported by the syscall record, > but could change it to the latter if you feel strongly enough about it > to block upstream acceptance. At this point in time I believe the right thing to do is to preserve the audit container ID as "dead but still in existence" so that there is no confusion (due to reuse) if/when it finally reappears in the audit record stream. The thread has had a lot of starts/stops, so I may be repeating a previous suggestion, but one idea would be to still emit a "death record" when the final task in the audit container ID does die, but block the particular audit container ID from reuse until it the SIGNAL2 info has been reported. This gives us the timely ACID death notification while still preventing confusion and ambiguity caused by potentially reusing the ACID before the SIGNAL2 record has been sent; there is a small nit about the ACID being present in the SIGNAL2 *after* its death, but I think that can be easily explained and understood by admins.
On Friday, March 13, 2020 12:42:15 PM EDT Paul Moore wrote: > > I think more and more, that more complete isolation is being done, > > taking advantage of each type of namespace as they become available, but > > I know a nuber of them didn't find it important yet to use IPC, PID or > > user namespaces which would be the only namespaces I can think of that > > would provide that isolation. > > > > It isn't entirely clear to me which side you fall on this issue, Paul. > > That's mostly because I was hoping for some clarification in the > discussion, especially the relevant certification requirements, but it > looks like there is still plenty of room for interpretation there (as > usual). I'd much rather us arrive at decisions based on requirements > and not gut feelings, which is where I think we are at right now. Certification rquirements are that we need the identity of anyone attempting to modify the audit configuration including shutting it down. -Steve
On Fri, Mar 13, 2020 at 12:45 PM Steve Grubb <sgrubb@redhat.com> wrote: > On Friday, March 13, 2020 12:42:15 PM EDT Paul Moore wrote: > > > I think more and more, that more complete isolation is being done, > > > taking advantage of each type of namespace as they become available, but > > > I know a nuber of them didn't find it important yet to use IPC, PID or > > > user namespaces which would be the only namespaces I can think of that > > > would provide that isolation. > > > > > > It isn't entirely clear to me which side you fall on this issue, Paul. > > > > That's mostly because I was hoping for some clarification in the > > discussion, especially the relevant certification requirements, but it > > looks like there is still plenty of room for interpretation there (as > > usual). I'd much rather us arrive at decisions based on requirements > > and not gut feelings, which is where I think we are at right now. > > Certification rquirements are that we need the identity of anyone attempting > to modify the audit configuration including shutting it down. Yep, got it. Unfortunately that doesn't really help with what we are talking about. Although preventing the reuse of the ACID before the SIGNAL2 record does help preserve the sanity of the audit stream which I believe to be very important, regardless.
On 2020-03-13 12:29, Paul Moore wrote: > On Thu, Mar 12, 2020 at 3:30 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > On 2020-02-13 16:44, Paul Moore wrote: > > > This is a bit of a thread-hijack, and for that I apologize, but > > > another thought crossed my mind while thinking about this issue > > > further ... Once we support multiple auditd instances, including the > > > necessary record routing and duplication/multiple-sends (the host > > > always sees *everything*), we will likely need to find a way to "trim" > > > the audit container ID (ACID) lists we send in the records. The > > > auditd instance running on the host/initns will always see everything, > > > so it will want the full container ACID list; however an auditd > > > instance running inside a container really should only see the ACIDs > > > of any child containers. > > > > Agreed. This should be easy to check and limit, preventing an auditd > > from seeing any contid that is a parent of its own contid. > > > > > For example, imagine a system where the host has containers 1 and 2, > > > each running an auditd instance. Inside container 1 there are > > > containers A and B. Inside container 2 there are containers Y and Z. > > > If an audit event is generated in container Z, I would expect the > > > host's auditd to see a ACID list of "1,Z" but container 1's auditd > > > should only see an ACID list of "Z". The auditd running in container > > > 2 should not see the record at all (that will be relatively > > > straightforward). Does that make sense? Do we have the record > > > formats properly designed to handle this without too much problem (I'm > > > not entirely sure we do)? > > > > I completely agree and I believe we have record formats that are able to > > handle this already. > > I'm not convinced we do. What about the cases where we have a field > with a list of audit container IDs? How do we handle that? I don't understand the problem. (I think you crossed your 1/2 vs A/B/Y/Z in your example.) Clarifying the example above, if as you suggest an event happens in container Z, the hosts's auditd would report Z,^2 and the auditd in container 2 would report Z,^2 but if there were another auditd running in container Z it would report Z while the auditd in container 1 or A/B would see nothing. The format I had proposed already handles that: contid^contid,contid^contid but you'd like to see it changed to contid,^contid,contid,^contid and both formats handle it though I find the former much easier to read. For the example above we'd have: A,^1 B,^1 Y,^2 Z,^2 and for a shared network namespace potentially: A,^1,B,^1,Y,^2,Z,^2 and if there were an event reported by an auditd in container Z it would report only: Z Now, I could see an argument for restricting the visibility of the contid to the container containing an auditd so that an auditd cannot see its own contid, but that wasn't my design intent. This can still be addressed after the initial code is committed without breaking the API. > paul moore - RGB -- Richard Guy Briggs <rgb@redhat.com> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635
On 2020-03-13 12:42, Paul Moore wrote: > On Thu, Mar 12, 2020 at 4:27 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > On 2020-02-12 19:09, Paul Moore wrote: > > > On Wed, Feb 12, 2020 at 5:39 PM Steve Grubb <sgrubb@redhat.com> wrote: > > > > On Wednesday, February 5, 2020 5:50:28 PM EST Paul Moore wrote: > > > > > > > > > ... When we record the audit container ID in audit_signal_info() we > > > > > > > > > take an extra reference to the audit container ID object so that it > > > > > > > > > will not disappear (and get reused) until after we respond with an > > > > > > > > > AUDIT_SIGNAL_INFO2. In audit_receive_msg() when we do the > > > > > > > > > AUDIT_SIGNAL_INFO2 processing we drop the extra reference we took > > > > > > > > > in > > > > > > > > > audit_signal_info(). Unless I'm missing some other change you > > > > > > > > > made, > > > > > > > > > this *shouldn't* affect the syscall records, all it does is > > > > > > > > > preserve > > > > > > > > > the audit container ID object in the kernel's ACID store so it > > > > > > > > > doesn't > > > > > > > > > get reused. > > > > > > > > > > > > > > > > This is exactly what I had understood. I hadn't considered the extra > > > > > > > > details below in detail due to my original syscall concern, but they > > > > > > > > make sense. > > > > > > > > > > > > > > > > The syscall I refer to is the one connected with the drop of the > > > > > > > > audit container identifier by the last process that was in that > > > > > > > > container in patch 5/16. The production of this record is contingent > > > > > > > > on > > > > > > > > the last ref in a contobj being dropped. So if it is due to that ref > > > > > > > > being maintained by audit_signal_info() until the AUDIT_SIGNAL_INFO2 > > > > > > > > record it fetched, then it will appear that the fetch action closed > > > > > > > > the > > > > > > > > container rather than the last process in the container to exit. > > > > > > > > > > > > > > > > Does this make sense? > > > > > > > > > > > > > > More so than your original reply, at least to me anyway. > > > > > > > > > > > > > > It makes sense that the audit container ID wouldn't be marked as > > > > > > > "dead" since it would still be very much alive and available for use > > > > > > > by the orchestrator, the question is if that is desirable or not. I > > > > > > > think the answer to this comes down the preserving the correctness of > > > > > > > the audit log. > > > > > > > > > > > > > > If the audit container ID reported by AUDIT_SIGNAL_INFO2 has been > > > > > > > reused then I think there is a legitimate concern that the audit log > > > > > > > is not correct, and could be misleading. If we solve that by grabbing > > > > > > > an extra reference, then there could also be some confusion as > > > > > > > userspace considers a container to be "dead" while the audit container > > > > > > > ID still exists in the kernel, and the kernel generated audit > > > > > > > container ID death record will not be generated until much later (and > > > > > > > possibly be associated with a different event, but that could be > > > > > > > solved by unassociating the container death record). > > > > > > > > > > > > How does syscall association of the death record with AUDIT_SIGNAL_INFO2 > > > > > > possibly get associated with another event? Or is the syscall > > > > > > association with the fetch for the AUDIT_SIGNAL_INFO2 the other event? > > > > > > > > > > The issue is when does the audit container ID "die". If it is when > > > > > the last task in the container exits, then the death record will be > > > > > associated when the task's exit. If the audit container ID lives on > > > > > until the last reference of it in the audit logs, including the > > > > > SIGNAL_INFO2 message, the death record will be associated with the > > > > > related SIGNAL_INFO2 syscalls, or perhaps unassociated depending on > > > > > the details of the syscalls/netlink. > > > > > > > > > > > Another idea might be to bump the refcount in audit_signal_info() but > > > > > > mark tht contid as dead so it can't be reused if we are concerned that > > > > > > the dead contid be reused? > > > > > > > > > > Ooof. Yes, maybe, but that would be ugly. > > > > > > > > > > > There is still the problem later that the reported contid is incomplete > > > > > > compared to the rest of the contid reporting cycle wrt nesting since > > > > > > AUDIT_SIGNAL_INFO2 will need to be more complex w/2 variable length > > > > > > fields to accommodate a nested contid list. > > > > > > > > > > Do we really care about the full nested audit container ID list in the > > > > > SIGNAL_INFO2 record? > > > > I'm inclined to hand-wave it away as inconvenient that can be looked up > > more carefully if it is really needed. Maybe the block above would be > > safer and more complete even though it is ugly. > > > > > > > > > Of the two > > > > > > > approaches, I think the latter is safer in that it preserves the > > > > > > > correctness of the audit log, even though it could result in a delay > > > > > > > of the container death record. > > > > > > > > > > > > I prefer the former since it strongly indicates last task in the > > > > > > container. The AUDIT_SIGNAL_INFO2 msg has the pid and other subject > > > > > > attributes and the contid to strongly link the responsible party. > > > > > > > > > > Steve is the only one who really tracks the security certifications > > > > > that are relevant to audit, see what the certification requirements > > > > > have to say and we can revisit this. > > > > > > > > Sever Virtualization Protection Profile is the closest applicable standard > > > > > > > > https://www.niap-ccevs.org/Profile/Info.cfm?PPID=408&id=408 > > > > > > > > It is silent on audit requirements for the lifecycle of a VM. I assume that > > > > all that is needed is what the orchestrator says its doing at the high level. > > > > So, if an orchestrator wants to shutdown a container, the orchestrator must > > > > log that intent and its results. In a similar fashion, systemd logs that it's > > > > killing a service and we don't actually hook the exit syscall of the service > > > > to record that. > > > > > > > > Now, if a container was being used as a VPS, and it had a fully functioning > > > > userspace, it's own services, and its very own audit daemon, then in this > > > > case it would care who sent a signal to its auditd. The tenant of that > > > > container may have to comply with PCI-DSS or something else. It would log the > > > > audit service is being terminated and systemd would record that its tearing > > > > down the environment. The OS doesn't need to do anything. > > > > > > This latter case is the case of interest here, since the host auditd > > > should only be killed from a process on the host itself, not a process > > > running in a container. If we work under the assumption (and this may > > > be a break in our approach to not defining "container") that an auditd > > > instance is only ever signaled by a process with the same audit > > > container ID (ACID), is this really even an issue? Right now it isn't > > > as even with this patchset we will still really only support one > > > auditd instance, presumably on the host, so this isn't a significant > > > concern. Moving forward, once we add support for multiple auditd > > > instances we will likely need to move the signal info into > > > (potentially) s per-ACID struct, a struct whose lifetime would match > > > that of the associated container by definition; as the auditd > > > container died, the struct would die, the refcounts dropped, and any > > > ACID held only the signal info refcount would be dropped/killed. > > > > Any process could signal auditd if it can see it based on namespace > > relationships, nevermind container placement. Some container > > architectures would not have a namespace configuration that would block > > this (combination of PID/user/IPC?). > > > > > However, making this assumption would mean that we are expecting a > > > "container" to provide some level of isolation such that processes > > > with a different audit container ID do not signal each other. From a > > > practical perspective I think that fits with the most (all?) > > > definitions of "container", but I can't say that for certain. In > > > those cases where the assumption is not correct and processes can > > > signal each other across audit container ID boundaries, perhaps it is > > > enough to explain that an audit container ID may not fully disappear > > > until it has been fetched with a SIGNAL_INFO2 message. > > > > I think more and more, that more complete isolation is being done, > > taking advantage of each type of namespace as they become available, but > > I know a nuber of them didn't find it important yet to use IPC, PID or > > user namespaces which would be the only namespaces I can think of that > > would provide that isolation. > > > > It isn't entirely clear to me which side you fall on this issue, Paul. > > That's mostly because I was hoping for some clarification in the > discussion, especially the relevant certification requirements, but it > looks like there is still plenty of room for interpretation there (as > usual). I'd much rather us arrive at decisions based on requirements > and not gut feelings, which is where I think we are at right now. I don't disagree. > > Can you pronounce on your strong preference one way or the other if the > > death of a container coincide with the exit of the last process in that > > namespace, or the fetch of any signal info related to it? > > "pronounce on your strong preference"? I've seen you use "pronounce" > a few times now, and suggest a different word in the future; the > connotation is not well received on my end. I'm sorry. I don't have any particular attachment to that word, but I'll try to be concious to avoid it since you've expressed your aversion to it. I don't mean to load it down with any negative connotations, I'm simply seeking clarity on your preferred technical style so I may follow it. > > I have a bias > > to the former since the code already does that and I feel the exit of > > the last process is much more relevant supported by the syscall record, > > but could change it to the latter if you feel strongly enough about it > > to block upstream acceptance. > > At this point in time I believe the right thing to do is to preserve > the audit container ID as "dead but still in existence" so that there > is no confusion (due to reuse) if/when it finally reappears in the > audit record stream. I agree this seems safest. > The thread has had a lot of starts/stops, so I may be repeating a > previous suggestion, but one idea would be to still emit a "death > record" when the final task in the audit container ID does die, but > block the particular audit container ID from reuse until it the > SIGNAL2 info has been reported. This gives us the timely ACID death > notification while still preventing confusion and ambiguity caused by > potentially reusing the ACID before the SIGNAL2 record has been sent; > there is a small nit about the ACID being present in the SIGNAL2 > *after* its death, but I think that can be easily explained and > understood by admins. Thinking quickly about possible technical solutions to this, maybe it makes sense to have two counters on a contobj so that we know when the last process in that container exits and can issue the death certificate, but we still block reuse of it until all further references to it have been resolved. This will likely also make it possible to report the full contid chain in SIGNAL2 records. This will eliminate some of the issues we are discussing with regards to passing a contobj vs a contid to the audit_log_contid function, but won't eliminate them all because there are still some contids that won't have an object associated with them to make it impossible to look them up in the contobj lists. > paul moore - RGB -- Richard Guy Briggs <rgb@redhat.com> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635
On Fri, Mar 13, 2020 at 2:59 PM Richard Guy Briggs <rgb@redhat.com> wrote: > On 2020-03-13 12:29, Paul Moore wrote: > > On Thu, Mar 12, 2020 at 3:30 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > On 2020-02-13 16:44, Paul Moore wrote: > > > > This is a bit of a thread-hijack, and for that I apologize, but > > > > another thought crossed my mind while thinking about this issue > > > > further ... Once we support multiple auditd instances, including the > > > > necessary record routing and duplication/multiple-sends (the host > > > > always sees *everything*), we will likely need to find a way to "trim" > > > > the audit container ID (ACID) lists we send in the records. The > > > > auditd instance running on the host/initns will always see everything, > > > > so it will want the full container ACID list; however an auditd > > > > instance running inside a container really should only see the ACIDs > > > > of any child containers. > > > > > > Agreed. This should be easy to check and limit, preventing an auditd > > > from seeing any contid that is a parent of its own contid. > > > > > > > For example, imagine a system where the host has containers 1 and 2, > > > > each running an auditd instance. Inside container 1 there are > > > > containers A and B. Inside container 2 there are containers Y and Z. > > > > If an audit event is generated in container Z, I would expect the > > > > host's auditd to see a ACID list of "1,Z" but container 1's auditd > > > > should only see an ACID list of "Z". The auditd running in container > > > > 2 should not see the record at all (that will be relatively > > > > straightforward). Does that make sense? Do we have the record > > > > formats properly designed to handle this without too much problem (I'm > > > > not entirely sure we do)? > > > > > > I completely agree and I believe we have record formats that are able to > > > handle this already. > > > > I'm not convinced we do. What about the cases where we have a field > > with a list of audit container IDs? How do we handle that? > > I don't understand the problem. (I think you crossed your 1/2 vs > A/B/Y/Z in your example.) ... It looks like I did, sorry about that. > ... Clarifying the example above, if as you > suggest an event happens in container Z, the hosts's auditd would report > Z,^2 > and the auditd in container 2 would report > Z,^2 > but if there were another auditd running in container Z it would report > Z > while the auditd in container 1 or A/B would see nothing. Yes. My concern is how do we handle this to minimize duplicating and rewriting the records? It isn't so much about the format, although the format is a side effect.
On Fri, Mar 13, 2020 at 3:23 PM Richard Guy Briggs <rgb@redhat.com> wrote: > On 2020-03-13 12:42, Paul Moore wrote: ... > > The thread has had a lot of starts/stops, so I may be repeating a > > previous suggestion, but one idea would be to still emit a "death > > record" when the final task in the audit container ID does die, but > > block the particular audit container ID from reuse until it the > > SIGNAL2 info has been reported. This gives us the timely ACID death > > notification while still preventing confusion and ambiguity caused by > > potentially reusing the ACID before the SIGNAL2 record has been sent; > > there is a small nit about the ACID being present in the SIGNAL2 > > *after* its death, but I think that can be easily explained and > > understood by admins. > > Thinking quickly about possible technical solutions to this, maybe it > makes sense to have two counters on a contobj so that we know when the > last process in that container exits and can issue the death > certificate, but we still block reuse of it until all further references > to it have been resolved. This will likely also make it possible to > report the full contid chain in SIGNAL2 records. This will eliminate > some of the issues we are discussing with regards to passing a contobj > vs a contid to the audit_log_contid function, but won't eliminate them > all because there are still some contids that won't have an object > associated with them to make it impossible to look them up in the > contobj lists. I'm not sure you need a full second counter, I imagine a simple flag would be okay. I think you just something to indicate that this ACID object is marked as "dead" but it still being held for sanity reasons and should not be reused.
On 2020-03-18 16:56, Paul Moore wrote: > On Fri, Mar 13, 2020 at 2:59 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > On 2020-03-13 12:29, Paul Moore wrote: > > > On Thu, Mar 12, 2020 at 3:30 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > On 2020-02-13 16:44, Paul Moore wrote: > > > > > This is a bit of a thread-hijack, and for that I apologize, but > > > > > another thought crossed my mind while thinking about this issue > > > > > further ... Once we support multiple auditd instances, including the > > > > > necessary record routing and duplication/multiple-sends (the host > > > > > always sees *everything*), we will likely need to find a way to "trim" > > > > > the audit container ID (ACID) lists we send in the records. The > > > > > auditd instance running on the host/initns will always see everything, > > > > > so it will want the full container ACID list; however an auditd > > > > > instance running inside a container really should only see the ACIDs > > > > > of any child containers. > > > > > > > > Agreed. This should be easy to check and limit, preventing an auditd > > > > from seeing any contid that is a parent of its own contid. > > > > > > > > > For example, imagine a system where the host has containers 1 and 2, > > > > > each running an auditd instance. Inside container 1 there are > > > > > containers A and B. Inside container 2 there are containers Y and Z. > > > > > If an audit event is generated in container Z, I would expect the > > > > > host's auditd to see a ACID list of "1,Z" but container 1's auditd > > > > > should only see an ACID list of "Z". The auditd running in container > > > > > 2 should not see the record at all (that will be relatively > > > > > straightforward). Does that make sense? Do we have the record > > > > > formats properly designed to handle this without too much problem (I'm > > > > > not entirely sure we do)? > > > > > > > > I completely agree and I believe we have record formats that are able to > > > > handle this already. > > > > > > I'm not convinced we do. What about the cases where we have a field > > > with a list of audit container IDs? How do we handle that? > > > > I don't understand the problem. (I think you crossed your 1/2 vs > > A/B/Y/Z in your example.) ... > > It looks like I did, sorry about that. > > > ... Clarifying the example above, if as you > > suggest an event happens in container Z, the hosts's auditd would report > > Z,^2 > > and the auditd in container 2 would report > > Z,^2 > > but if there were another auditd running in container Z it would report > > Z > > while the auditd in container 1 or A/B would see nothing. > > Yes. My concern is how do we handle this to minimize duplicating and > rewriting the records? It isn't so much about the format, although > the format is a side effect. Are you talking about caching, or about divulging more information than necessary or even information leaks? Or even noticing that records that need to be generated to two audit daemons share the same contid field values and should be generated at the same time or information shared between them? I'd see any of these as optimizations that don't affect the api. > paul moore - RGB -- Richard Guy Briggs <rgb@redhat.com> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635
On 2020-03-18 17:01, Paul Moore wrote: > On Fri, Mar 13, 2020 at 3:23 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > On 2020-03-13 12:42, Paul Moore wrote: > > ... > > > > The thread has had a lot of starts/stops, so I may be repeating a > > > previous suggestion, but one idea would be to still emit a "death > > > record" when the final task in the audit container ID does die, but > > > block the particular audit container ID from reuse until it the > > > SIGNAL2 info has been reported. This gives us the timely ACID death > > > notification while still preventing confusion and ambiguity caused by > > > potentially reusing the ACID before the SIGNAL2 record has been sent; > > > there is a small nit about the ACID being present in the SIGNAL2 > > > *after* its death, but I think that can be easily explained and > > > understood by admins. > > > > Thinking quickly about possible technical solutions to this, maybe it > > makes sense to have two counters on a contobj so that we know when the > > last process in that container exits and can issue the death > > certificate, but we still block reuse of it until all further references > > to it have been resolved. This will likely also make it possible to > > report the full contid chain in SIGNAL2 records. This will eliminate > > some of the issues we are discussing with regards to passing a contobj > > vs a contid to the audit_log_contid function, but won't eliminate them > > all because there are still some contids that won't have an object > > associated with them to make it impossible to look them up in the > > contobj lists. > > I'm not sure you need a full second counter, I imagine a simple flag > would be okay. I think you just something to indicate that this ACID > object is marked as "dead" but it still being held for sanity reasons > and should not be reused. Ok, I see your point. This refcount can be changed to a flag easily enough without change to the api if we can be sure that more than one signal can't be delivered to the audit daemon *and* collected by sig2. I'll have a more careful look at the audit daemon code to see if I can determine this. Steve, can you have a look and tell us if it is possible for the audit daemon to make more than one signal_info (or signal_info2) record request from the kernel after receiving a signal? Another question occurs to me is that what if the audit daemon is sent a signal and it cannot or will not collect the sig2 information from the kernel (SIGKILL?)? Does that audit container identifier remain dead until reboot, or do we institute some other form of reaping, possibly time-based? > paul moore - RGB -- Richard Guy Briggs <rgb@redhat.com> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635
On Wed, Mar 18, 2020 at 5:27 PM Richard Guy Briggs <rgb@redhat.com> wrote: > On 2020-03-18 16:56, Paul Moore wrote: > > On Fri, Mar 13, 2020 at 2:59 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > On 2020-03-13 12:29, Paul Moore wrote: > > > > On Thu, Mar 12, 2020 at 3:30 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > > On 2020-02-13 16:44, Paul Moore wrote: > > > > > > This is a bit of a thread-hijack, and for that I apologize, but > > > > > > another thought crossed my mind while thinking about this issue > > > > > > further ... Once we support multiple auditd instances, including the > > > > > > necessary record routing and duplication/multiple-sends (the host > > > > > > always sees *everything*), we will likely need to find a way to "trim" > > > > > > the audit container ID (ACID) lists we send in the records. The > > > > > > auditd instance running on the host/initns will always see everything, > > > > > > so it will want the full container ACID list; however an auditd > > > > > > instance running inside a container really should only see the ACIDs > > > > > > of any child containers. > > > > > > > > > > Agreed. This should be easy to check and limit, preventing an auditd > > > > > from seeing any contid that is a parent of its own contid. > > > > > > > > > > > For example, imagine a system where the host has containers 1 and 2, > > > > > > each running an auditd instance. Inside container 1 there are > > > > > > containers A and B. Inside container 2 there are containers Y and Z. > > > > > > If an audit event is generated in container Z, I would expect the > > > > > > host's auditd to see a ACID list of "1,Z" but container 1's auditd > > > > > > should only see an ACID list of "Z". The auditd running in container > > > > > > 2 should not see the record at all (that will be relatively > > > > > > straightforward). Does that make sense? Do we have the record > > > > > > formats properly designed to handle this without too much problem (I'm > > > > > > not entirely sure we do)? > > > > > > > > > > I completely agree and I believe we have record formats that are able to > > > > > handle this already. > > > > > > > > I'm not convinced we do. What about the cases where we have a field > > > > with a list of audit container IDs? How do we handle that? > > > > > > I don't understand the problem. (I think you crossed your 1/2 vs > > > A/B/Y/Z in your example.) ... > > > > It looks like I did, sorry about that. > > > > > ... Clarifying the example above, if as you > > > suggest an event happens in container Z, the hosts's auditd would report > > > Z,^2 > > > and the auditd in container 2 would report > > > Z,^2 > > > but if there were another auditd running in container Z it would report > > > Z > > > while the auditd in container 1 or A/B would see nothing. > > > > Yes. My concern is how do we handle this to minimize duplicating and > > rewriting the records? It isn't so much about the format, although > > the format is a side effect. > > Are you talking about caching, or about divulging more information than > necessary or even information leaks? Or even noticing that records that > need to be generated to two audit daemons share the same contid field > values and should be generated at the same time or information shared > between them? I'd see any of these as optimizations that don't affect > the api. Imagine a record is generated in a container which has more than one auditd in it's ancestry that should receive this record, how do we handle that without completely killing performance? That's my concern. If you've already thought up a plan for this - excellent, please share :)
On Wed, Mar 18, 2020 at 5:42 PM Richard Guy Briggs <rgb@redhat.com> wrote: > On 2020-03-18 17:01, Paul Moore wrote: > > On Fri, Mar 13, 2020 at 3:23 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > On 2020-03-13 12:42, Paul Moore wrote: > > > > ... > > > > > > The thread has had a lot of starts/stops, so I may be repeating a > > > > previous suggestion, but one idea would be to still emit a "death > > > > record" when the final task in the audit container ID does die, but > > > > block the particular audit container ID from reuse until it the > > > > SIGNAL2 info has been reported. This gives us the timely ACID death > > > > notification while still preventing confusion and ambiguity caused by > > > > potentially reusing the ACID before the SIGNAL2 record has been sent; > > > > there is a small nit about the ACID being present in the SIGNAL2 > > > > *after* its death, but I think that can be easily explained and > > > > understood by admins. > > > > > > Thinking quickly about possible technical solutions to this, maybe it > > > makes sense to have two counters on a contobj so that we know when the > > > last process in that container exits and can issue the death > > > certificate, but we still block reuse of it until all further references > > > to it have been resolved. This will likely also make it possible to > > > report the full contid chain in SIGNAL2 records. This will eliminate > > > some of the issues we are discussing with regards to passing a contobj > > > vs a contid to the audit_log_contid function, but won't eliminate them > > > all because there are still some contids that won't have an object > > > associated with them to make it impossible to look them up in the > > > contobj lists. > > > > I'm not sure you need a full second counter, I imagine a simple flag > > would be okay. I think you just something to indicate that this ACID > > object is marked as "dead" but it still being held for sanity reasons > > and should not be reused. > > Ok, I see your point. This refcount can be changed to a flag easily > enough without change to the api if we can be sure that more than one > signal can't be delivered to the audit daemon *and* collected by sig2. > I'll have a more careful look at the audit daemon code to see if I can > determine this. Maybe I'm not understanding your concern, but this isn't really different than any of the other things we track for the auditd signal sender, right? If we are worried about multiple signals being sent then it applies to everything, not just the audit container ID. > Another question occurs to me is that what if the audit daemon is sent a > signal and it cannot or will not collect the sig2 information from the > kernel (SIGKILL?)? Does that audit container identifier remain dead > until reboot, or do we institute some other form of reaping, possibly > time-based? In order to preserve the integrity of the audit log that ACID value would need to remain unavailable until the ACID which contains the associated auditd is "dead" (no one can request the signal sender's info if that container is dead).
On 2020-03-18 17:42, Paul Moore wrote: > On Wed, Mar 18, 2020 at 5:27 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > On 2020-03-18 16:56, Paul Moore wrote: > > > On Fri, Mar 13, 2020 at 2:59 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > On 2020-03-13 12:29, Paul Moore wrote: > > > > > On Thu, Mar 12, 2020 at 3:30 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > > > On 2020-02-13 16:44, Paul Moore wrote: > > > > > > > This is a bit of a thread-hijack, and for that I apologize, but > > > > > > > another thought crossed my mind while thinking about this issue > > > > > > > further ... Once we support multiple auditd instances, including the > > > > > > > necessary record routing and duplication/multiple-sends (the host > > > > > > > always sees *everything*), we will likely need to find a way to "trim" > > > > > > > the audit container ID (ACID) lists we send in the records. The > > > > > > > auditd instance running on the host/initns will always see everything, > > > > > > > so it will want the full container ACID list; however an auditd > > > > > > > instance running inside a container really should only see the ACIDs > > > > > > > of any child containers. > > > > > > > > > > > > Agreed. This should be easy to check and limit, preventing an auditd > > > > > > from seeing any contid that is a parent of its own contid. > > > > > > > > > > > > > For example, imagine a system where the host has containers 1 and 2, > > > > > > > each running an auditd instance. Inside container 1 there are > > > > > > > containers A and B. Inside container 2 there are containers Y and Z. > > > > > > > If an audit event is generated in container Z, I would expect the > > > > > > > host's auditd to see a ACID list of "1,Z" but container 1's auditd > > > > > > > should only see an ACID list of "Z". The auditd running in container > > > > > > > 2 should not see the record at all (that will be relatively > > > > > > > straightforward). Does that make sense? Do we have the record > > > > > > > formats properly designed to handle this without too much problem (I'm > > > > > > > not entirely sure we do)? > > > > > > > > > > > > I completely agree and I believe we have record formats that are able to > > > > > > handle this already. > > > > > > > > > > I'm not convinced we do. What about the cases where we have a field > > > > > with a list of audit container IDs? How do we handle that? > > > > > > > > I don't understand the problem. (I think you crossed your 1/2 vs > > > > A/B/Y/Z in your example.) ... > > > > > > It looks like I did, sorry about that. > > > > > > > ... Clarifying the example above, if as you > > > > suggest an event happens in container Z, the hosts's auditd would report > > > > Z,^2 > > > > and the auditd in container 2 would report > > > > Z,^2 > > > > but if there were another auditd running in container Z it would report > > > > Z > > > > while the auditd in container 1 or A/B would see nothing. > > > > > > Yes. My concern is how do we handle this to minimize duplicating and > > > rewriting the records? It isn't so much about the format, although > > > the format is a side effect. > > > > Are you talking about caching, or about divulging more information than > > necessary or even information leaks? Or even noticing that records that > > need to be generated to two audit daemons share the same contid field > > values and should be generated at the same time or information shared > > between them? I'd see any of these as optimizations that don't affect > > the api. > > Imagine a record is generated in a container which has more than one > auditd in it's ancestry that should receive this record, how do we > handle that without completely killing performance? That's my > concern. If you've already thought up a plan for this - excellent, > please share :) No, I haven't given that much thought other than the correctness and security issues of making sure that each audit daemon is sufficiently isolated to do its job but not jeopardize another audit domain. Audit already kills performance, according to some... We currently won't have that problem since there can only be one so far. Fixing and optimizing this is part of the next phase of the challenge of adding a second audit daemon. Let's work on correctness and reasonable efficiency for this phase and not focus on a problem we don't yet have. I wouldn't consider this incurring technical debt at this point. I could see cacheing a contid string from one starting point, but it may be more work to search that cached string to truncate it or add to it when another audit daemon requests a copy of a similar string. I suppose every full contid string could be generated the first time it is used and parts of it used (start/finish) as needed but that search/indexing may not be worth it. > paul moore - RGB -- Richard Guy Briggs <rgb@redhat.com> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635
On Wed, Mar 18, 2020 at 5:56 PM Richard Guy Briggs <rgb@redhat.com> wrote: > On 2020-03-18 17:42, Paul Moore wrote: > > On Wed, Mar 18, 2020 at 5:27 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > On 2020-03-18 16:56, Paul Moore wrote: > > > > On Fri, Mar 13, 2020 at 2:59 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > > On 2020-03-13 12:29, Paul Moore wrote: > > > > > > On Thu, Mar 12, 2020 at 3:30 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > > > > On 2020-02-13 16:44, Paul Moore wrote: > > > > > > > > This is a bit of a thread-hijack, and for that I apologize, but > > > > > > > > another thought crossed my mind while thinking about this issue > > > > > > > > further ... Once we support multiple auditd instances, including the > > > > > > > > necessary record routing and duplication/multiple-sends (the host > > > > > > > > always sees *everything*), we will likely need to find a way to "trim" > > > > > > > > the audit container ID (ACID) lists we send in the records. The > > > > > > > > auditd instance running on the host/initns will always see everything, > > > > > > > > so it will want the full container ACID list; however an auditd > > > > > > > > instance running inside a container really should only see the ACIDs > > > > > > > > of any child containers. > > > > > > > > > > > > > > Agreed. This should be easy to check and limit, preventing an auditd > > > > > > > from seeing any contid that is a parent of its own contid. > > > > > > > > > > > > > > > For example, imagine a system where the host has containers 1 and 2, > > > > > > > > each running an auditd instance. Inside container 1 there are > > > > > > > > containers A and B. Inside container 2 there are containers Y and Z. > > > > > > > > If an audit event is generated in container Z, I would expect the > > > > > > > > host's auditd to see a ACID list of "1,Z" but container 1's auditd > > > > > > > > should only see an ACID list of "Z". The auditd running in container > > > > > > > > 2 should not see the record at all (that will be relatively > > > > > > > > straightforward). Does that make sense? Do we have the record > > > > > > > > formats properly designed to handle this without too much problem (I'm > > > > > > > > not entirely sure we do)? > > > > > > > > > > > > > > I completely agree and I believe we have record formats that are able to > > > > > > > handle this already. > > > > > > > > > > > > I'm not convinced we do. What about the cases where we have a field > > > > > > with a list of audit container IDs? How do we handle that? > > > > > > > > > > I don't understand the problem. (I think you crossed your 1/2 vs > > > > > A/B/Y/Z in your example.) ... > > > > > > > > It looks like I did, sorry about that. > > > > > > > > > ... Clarifying the example above, if as you > > > > > suggest an event happens in container Z, the hosts's auditd would report > > > > > Z,^2 > > > > > and the auditd in container 2 would report > > > > > Z,^2 > > > > > but if there were another auditd running in container Z it would report > > > > > Z > > > > > while the auditd in container 1 or A/B would see nothing. > > > > > > > > Yes. My concern is how do we handle this to minimize duplicating and > > > > rewriting the records? It isn't so much about the format, although > > > > the format is a side effect. > > > > > > Are you talking about caching, or about divulging more information than > > > necessary or even information leaks? Or even noticing that records that > > > need to be generated to two audit daemons share the same contid field > > > values and should be generated at the same time or information shared > > > between them? I'd see any of these as optimizations that don't affect > > > the api. > > > > Imagine a record is generated in a container which has more than one > > auditd in it's ancestry that should receive this record, how do we > > handle that without completely killing performance? That's my > > concern. If you've already thought up a plan for this - excellent, > > please share :) > > No, I haven't given that much thought other than the correctness and > security issues of making sure that each audit daemon is sufficiently > isolated to do its job but not jeopardize another audit domain. Audit > already kills performance, according to some... > > We currently won't have that problem since there can only be one so far. > Fixing and optimizing this is part of the next phase of the challenge of > adding a second audit daemon. > > Let's work on correctness and reasonable efficiency for this phase and > not focus on a problem we don't yet have. I wouldn't consider this > incurring technical debt at this point. I agree, one stage at a time, but the choice we make here is going to have a significant impact on what we can do later. We need to get this as "right" as possible; this isn't something we should dismiss with a hand-wave as a problem for the next stage. We don't need an implementation, but I would like to see a rough design of how we would address this problem. > I could see cacheing a contid string from one starting point, but it may > be more work to search that cached string to truncate it or add to it > when another audit daemon requests a copy of a similar string. I > suppose every full contid string could be generated the first time it is > used and parts of it used (start/finish) as needed but that > search/indexing may not be worth it. I hope we can do better than string manipulations in the kernel. I'd much rather defer generating the ACID list (if possible), than generating a list only to keep copying and editing it as the record is sent.
On 2020-03-18 17:47, Paul Moore wrote: > On Wed, Mar 18, 2020 at 5:42 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > On 2020-03-18 17:01, Paul Moore wrote: > > > On Fri, Mar 13, 2020 at 3:23 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > On 2020-03-13 12:42, Paul Moore wrote: > > > > > > ... > > > > > > > > The thread has had a lot of starts/stops, so I may be repeating a > > > > > previous suggestion, but one idea would be to still emit a "death > > > > > record" when the final task in the audit container ID does die, but > > > > > block the particular audit container ID from reuse until it the > > > > > SIGNAL2 info has been reported. This gives us the timely ACID death > > > > > notification while still preventing confusion and ambiguity caused by > > > > > potentially reusing the ACID before the SIGNAL2 record has been sent; > > > > > there is a small nit about the ACID being present in the SIGNAL2 > > > > > *after* its death, but I think that can be easily explained and > > > > > understood by admins. > > > > > > > > Thinking quickly about possible technical solutions to this, maybe it > > > > makes sense to have two counters on a contobj so that we know when the > > > > last process in that container exits and can issue the death > > > > certificate, but we still block reuse of it until all further references > > > > to it have been resolved. This will likely also make it possible to > > > > report the full contid chain in SIGNAL2 records. This will eliminate > > > > some of the issues we are discussing with regards to passing a contobj > > > > vs a contid to the audit_log_contid function, but won't eliminate them > > > > all because there are still some contids that won't have an object > > > > associated with them to make it impossible to look them up in the > > > > contobj lists. > > > > > > I'm not sure you need a full second counter, I imagine a simple flag > > > would be okay. I think you just something to indicate that this ACID > > > object is marked as "dead" but it still being held for sanity reasons > > > and should not be reused. > > > > Ok, I see your point. This refcount can be changed to a flag easily > > enough without change to the api if we can be sure that more than one > > signal can't be delivered to the audit daemon *and* collected by sig2. > > I'll have a more careful look at the audit daemon code to see if I can > > determine this. > > Maybe I'm not understanding your concern, but this isn't really > different than any of the other things we track for the auditd signal > sender, right? If we are worried about multiple signals being sent > then it applies to everything, not just the audit container ID. Yes, you are right. In all other cases the information is simply overwritten. In the case of the audit container identifier any previous value is put before a new one is referenced, so only the last signal is kept. So, we only need a flag. Does a flag implemented with a rcu-protected refcount sound reasonable to you? > > Another question occurs to me is that what if the audit daemon is sent a > > signal and it cannot or will not collect the sig2 information from the > > kernel (SIGKILL?)? Does that audit container identifier remain dead > > until reboot, or do we institute some other form of reaping, possibly > > time-based? > > In order to preserve the integrity of the audit log that ACID value > would need to remain unavailable until the ACID which contains the > associated auditd is "dead" (no one can request the signal sender's > info if that container is dead). I don't understand why it would be associated with the contid of the audit daemon process rather than with the audit daemon process itself. How does the signal collection somehow get transferred or delegated to another member of that audit daemon's container? Thinking aloud here, the audit daemon's exit when it calls audit_free() needs to ..._put_sig and cancel that audit_sig_cid (which in the future will be allocated per auditd rather than the global it is now since there is only one audit daemon). > paul moore - RGB -- Richard Guy Briggs <rgb@redhat.com> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635
On 2020-03-18 18:06, Paul Moore wrote: > On Wed, Mar 18, 2020 at 5:56 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > On 2020-03-18 17:42, Paul Moore wrote: > > > On Wed, Mar 18, 2020 at 5:27 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > On 2020-03-18 16:56, Paul Moore wrote: > > > > > On Fri, Mar 13, 2020 at 2:59 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > > > On 2020-03-13 12:29, Paul Moore wrote: > > > > > > > On Thu, Mar 12, 2020 at 3:30 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > > > > > On 2020-02-13 16:44, Paul Moore wrote: > > > > > > > > > This is a bit of a thread-hijack, and for that I apologize, but > > > > > > > > > another thought crossed my mind while thinking about this issue > > > > > > > > > further ... Once we support multiple auditd instances, including the > > > > > > > > > necessary record routing and duplication/multiple-sends (the host > > > > > > > > > always sees *everything*), we will likely need to find a way to "trim" > > > > > > > > > the audit container ID (ACID) lists we send in the records. The > > > > > > > > > auditd instance running on the host/initns will always see everything, > > > > > > > > > so it will want the full container ACID list; however an auditd > > > > > > > > > instance running inside a container really should only see the ACIDs > > > > > > > > > of any child containers. > > > > > > > > > > > > > > > > Agreed. This should be easy to check and limit, preventing an auditd > > > > > > > > from seeing any contid that is a parent of its own contid. > > > > > > > > > > > > > > > > > For example, imagine a system where the host has containers 1 and 2, > > > > > > > > > each running an auditd instance. Inside container 1 there are > > > > > > > > > containers A and B. Inside container 2 there are containers Y and Z. > > > > > > > > > If an audit event is generated in container Z, I would expect the > > > > > > > > > host's auditd to see a ACID list of "1,Z" but container 1's auditd > > > > > > > > > should only see an ACID list of "Z". The auditd running in container > > > > > > > > > 2 should not see the record at all (that will be relatively > > > > > > > > > straightforward). Does that make sense? Do we have the record > > > > > > > > > formats properly designed to handle this without too much problem (I'm > > > > > > > > > not entirely sure we do)? > > > > > > > > > > > > > > > > I completely agree and I believe we have record formats that are able to > > > > > > > > handle this already. > > > > > > > > > > > > > > I'm not convinced we do. What about the cases where we have a field > > > > > > > with a list of audit container IDs? How do we handle that? > > > > > > > > > > > > I don't understand the problem. (I think you crossed your 1/2 vs > > > > > > A/B/Y/Z in your example.) ... > > > > > > > > > > It looks like I did, sorry about that. > > > > > > > > > > > ... Clarifying the example above, if as you > > > > > > suggest an event happens in container Z, the hosts's auditd would report > > > > > > Z,^2 > > > > > > and the auditd in container 2 would report > > > > > > Z,^2 > > > > > > but if there were another auditd running in container Z it would report > > > > > > Z > > > > > > while the auditd in container 1 or A/B would see nothing. > > > > > > > > > > Yes. My concern is how do we handle this to minimize duplicating and > > > > > rewriting the records? It isn't so much about the format, although > > > > > the format is a side effect. > > > > > > > > Are you talking about caching, or about divulging more information than > > > > necessary or even information leaks? Or even noticing that records that > > > > need to be generated to two audit daemons share the same contid field > > > > values and should be generated at the same time or information shared > > > > between them? I'd see any of these as optimizations that don't affect > > > > the api. > > > > > > Imagine a record is generated in a container which has more than one > > > auditd in it's ancestry that should receive this record, how do we > > > handle that without completely killing performance? That's my > > > concern. If you've already thought up a plan for this - excellent, > > > please share :) > > > > No, I haven't given that much thought other than the correctness and > > security issues of making sure that each audit daemon is sufficiently > > isolated to do its job but not jeopardize another audit domain. Audit > > already kills performance, according to some... > > > > We currently won't have that problem since there can only be one so far. > > Fixing and optimizing this is part of the next phase of the challenge of > > adding a second audit daemon. > > > > Let's work on correctness and reasonable efficiency for this phase and > > not focus on a problem we don't yet have. I wouldn't consider this > > incurring technical debt at this point. > > I agree, one stage at a time, but the choice we make here is going to > have a significant impact on what we can do later. We need to get > this as "right" as possible; this isn't something we should dismiss > with a hand-wave as a problem for the next stage. We don't need an > implementation, but I would like to see a rough design of how we would > address this problem. > > > I could see cacheing a contid string from one starting point, but it may > > be more work to search that cached string to truncate it or add to it > > when another audit daemon requests a copy of a similar string. I > > suppose every full contid string could be generated the first time it is > > used and parts of it used (start/finish) as needed but that > > search/indexing may not be worth it. > > I hope we can do better than string manipulations in the kernel. I'd > much rather defer generating the ACID list (if possible), than > generating a list only to keep copying and editing it as the record is > sent. At the moment we are stuck with a string-only format. The contid list only exists in the kernel. When do you suggest generating the contid list? It sounds like you are hinting at userspace generating that list from multiple records over the span of audit logs since boot of the machine. Even if we had a binary format, the current design would require generating that list at the time of record generation since it could be any contiguous subset of a full nested contid list. > paul moore - RGB -- Richard Guy Briggs <rgb@redhat.com> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635
On Thu, Mar 19, 2020 at 5:48 PM Richard Guy Briggs <rgb@redhat.com> wrote: > On 2020-03-18 17:47, Paul Moore wrote: > > On Wed, Mar 18, 2020 at 5:42 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > On 2020-03-18 17:01, Paul Moore wrote: > > > > On Fri, Mar 13, 2020 at 3:23 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > > On 2020-03-13 12:42, Paul Moore wrote: > > > > > > > > ... > > > > > > > > > > The thread has had a lot of starts/stops, so I may be repeating a > > > > > > previous suggestion, but one idea would be to still emit a "death > > > > > > record" when the final task in the audit container ID does die, but > > > > > > block the particular audit container ID from reuse until it the > > > > > > SIGNAL2 info has been reported. This gives us the timely ACID death > > > > > > notification while still preventing confusion and ambiguity caused by > > > > > > potentially reusing the ACID before the SIGNAL2 record has been sent; > > > > > > there is a small nit about the ACID being present in the SIGNAL2 > > > > > > *after* its death, but I think that can be easily explained and > > > > > > understood by admins. > > > > > > > > > > Thinking quickly about possible technical solutions to this, maybe it > > > > > makes sense to have two counters on a contobj so that we know when the > > > > > last process in that container exits and can issue the death > > > > > certificate, but we still block reuse of it until all further references > > > > > to it have been resolved. This will likely also make it possible to > > > > > report the full contid chain in SIGNAL2 records. This will eliminate > > > > > some of the issues we are discussing with regards to passing a contobj > > > > > vs a contid to the audit_log_contid function, but won't eliminate them > > > > > all because there are still some contids that won't have an object > > > > > associated with them to make it impossible to look them up in the > > > > > contobj lists. > > > > > > > > I'm not sure you need a full second counter, I imagine a simple flag > > > > would be okay. I think you just something to indicate that this ACID > > > > object is marked as "dead" but it still being held for sanity reasons > > > > and should not be reused. > > > > > > Ok, I see your point. This refcount can be changed to a flag easily > > > enough without change to the api if we can be sure that more than one > > > signal can't be delivered to the audit daemon *and* collected by sig2. > > > I'll have a more careful look at the audit daemon code to see if I can > > > determine this. > > > > Maybe I'm not understanding your concern, but this isn't really > > different than any of the other things we track for the auditd signal > > sender, right? If we are worried about multiple signals being sent > > then it applies to everything, not just the audit container ID. > > Yes, you are right. In all other cases the information is simply > overwritten. In the case of the audit container identifier any > previous value is put before a new one is referenced, so only the last > signal is kept. So, we only need a flag. Does a flag implemented with > a rcu-protected refcount sound reasonable to you? Well, if I recall correctly you still need to fix the locking in this patchset so until we see what that looks like it is hard to say for certain. Just make sure that the flag is somehow protected from races; it is probably a lot like the "valid" flags you sometimes see with RCU protected lists. > > > Another question occurs to me is that what if the audit daemon is sent a > > > signal and it cannot or will not collect the sig2 information from the > > > kernel (SIGKILL?)? Does that audit container identifier remain dead > > > until reboot, or do we institute some other form of reaping, possibly > > > time-based? > > > > In order to preserve the integrity of the audit log that ACID value > > would need to remain unavailable until the ACID which contains the > > associated auditd is "dead" (no one can request the signal sender's > > info if that container is dead). > > I don't understand why it would be associated with the contid of the > audit daemon process rather than with the audit daemon process itself. > How does the signal collection somehow get transferred or delegated to > another member of that audit daemon's container? Presumably once we support multiple audit daemons we will need a struct to contain the associated connection state, with at most one struct (and one auditd) allowed for a given ACID. I would expect that the signal sender info would be part of that state included in that struct. If a task sent a signal to it's associated auditd, and no one ever queried the signal information stored in the per-ACID state struct, I would expect that the refcount/flag/whatever would remain held for the signal sender's ACID until the auditd state's ACID died (the struct would be reaped as part of the ACID death). In cases where the container orchestrator blocks sending signals across ACID boundaries this really isn't an issue as it will all be the same ACID, but since we don't want to impose any restrictions on what a container *could* be it is important to make sure we handle the case where the signal sender's ACID may be different from the associated auditd's ACID. > Thinking aloud here, the audit daemon's exit when it calls audit_free() > needs to ..._put_sig and cancel that audit_sig_cid (which in the future > will be allocated per auditd rather than the global it is now since > there is only one audit daemon). > > > paul moore > > - RGB > > -- > Richard Guy Briggs <rgb@redhat.com> > Sr. S/W Engineer, Kernel Security, Base Operating Systems > Remote, Ottawa, Red Hat Canada > IRC: rgb, SunRaycer > Voice: +1.647.777.2635, Internal: (81) 32635
On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <rgb@redhat.com> wrote: > On 2020-03-18 18:06, Paul Moore wrote: ... > > I hope we can do better than string manipulations in the kernel. I'd > > much rather defer generating the ACID list (if possible), than > > generating a list only to keep copying and editing it as the record is > > sent. > > At the moment we are stuck with a string-only format. Yes, we are. That is another topic, and another set of changes I've been deferring so as to not disrupt the audit container ID work. I was thinking of what we do inside the kernel between when the record triggering event happens and when we actually emit the record to userspace. Perhaps we collect the ACID information while the event is occurring, but we defer generating the record until later when we have a better understanding of what should be included in the ACID list. It is somewhat similar (but obviously different) to what we do for PATH records (we collect the pathname info when the path is being resolved).
On 2020-03-23 20:16, Paul Moore wrote: > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > On 2020-03-18 18:06, Paul Moore wrote: > > ... > > > > I hope we can do better than string manipulations in the kernel. I'd > > > much rather defer generating the ACID list (if possible), than > > > generating a list only to keep copying and editing it as the record is > > > sent. > > > > At the moment we are stuck with a string-only format. > > Yes, we are. That is another topic, and another set of changes I've > been deferring so as to not disrupt the audit container ID work. > > I was thinking of what we do inside the kernel between when the record > triggering event happens and when we actually emit the record to > userspace. Perhaps we collect the ACID information while the event is > occurring, but we defer generating the record until later when we have > a better understanding of what should be included in the ACID list. > It is somewhat similar (but obviously different) to what we do for > PATH records (we collect the pathname info when the path is being > resolved). Ok, now I understand your concern. In the case of NETFILTER_PKT records, the CONTAINER_ID record is the only other possible record and they are generated at the same time with a local context. In the case of any event involving a syscall, that CONTAINER_ID record is generated at the time of the rest of the event record generation at syscall exit. The others are only generated when needed, such as the sig2 reply. We generally just store the contobj pointer until we actually generate the CONTAINER_ID (or CONTAINER_OP) record. > paul moore - RGB -- Richard Guy Briggs <rgb@redhat.com> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635
On 2020-03-20 17:56, Paul Moore wrote: > On Thu, Mar 19, 2020 at 5:48 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > On 2020-03-18 17:47, Paul Moore wrote: > > > On Wed, Mar 18, 2020 at 5:42 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > On 2020-03-18 17:01, Paul Moore wrote: > > > > > On Fri, Mar 13, 2020 at 3:23 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > > > On 2020-03-13 12:42, Paul Moore wrote: > > > > > > > > > > ... > > > > > > > > > > > > The thread has had a lot of starts/stops, so I may be repeating a > > > > > > > previous suggestion, but one idea would be to still emit a "death > > > > > > > record" when the final task in the audit container ID does die, but > > > > > > > block the particular audit container ID from reuse until it the > > > > > > > SIGNAL2 info has been reported. This gives us the timely ACID death > > > > > > > notification while still preventing confusion and ambiguity caused by > > > > > > > potentially reusing the ACID before the SIGNAL2 record has been sent; > > > > > > > there is a small nit about the ACID being present in the SIGNAL2 > > > > > > > *after* its death, but I think that can be easily explained and > > > > > > > understood by admins. > > > > > > > > > > > > Thinking quickly about possible technical solutions to this, maybe it > > > > > > makes sense to have two counters on a contobj so that we know when the > > > > > > last process in that container exits and can issue the death > > > > > > certificate, but we still block reuse of it until all further references > > > > > > to it have been resolved. This will likely also make it possible to > > > > > > report the full contid chain in SIGNAL2 records. This will eliminate > > > > > > some of the issues we are discussing with regards to passing a contobj > > > > > > vs a contid to the audit_log_contid function, but won't eliminate them > > > > > > all because there are still some contids that won't have an object > > > > > > associated with them to make it impossible to look them up in the > > > > > > contobj lists. > > > > > > > > > > I'm not sure you need a full second counter, I imagine a simple flag > > > > > would be okay. I think you just something to indicate that this ACID > > > > > object is marked as "dead" but it still being held for sanity reasons > > > > > and should not be reused. > > > > > > > > Ok, I see your point. This refcount can be changed to a flag easily > > > > enough without change to the api if we can be sure that more than one > > > > signal can't be delivered to the audit daemon *and* collected by sig2. > > > > I'll have a more careful look at the audit daemon code to see if I can > > > > determine this. > > > > > > Maybe I'm not understanding your concern, but this isn't really > > > different than any of the other things we track for the auditd signal > > > sender, right? If we are worried about multiple signals being sent > > > then it applies to everything, not just the audit container ID. > > > > Yes, you are right. In all other cases the information is simply > > overwritten. In the case of the audit container identifier any > > previous value is put before a new one is referenced, so only the last > > signal is kept. So, we only need a flag. Does a flag implemented with > > a rcu-protected refcount sound reasonable to you? > > Well, if I recall correctly you still need to fix the locking in this > patchset so until we see what that looks like it is hard to say for > certain. Just make sure that the flag is somehow protected from > races; it is probably a lot like the "valid" flags you sometimes see > with RCU protected lists. This is like looking for a needle in a haystack. Can you point me to some code that does "valid" flags with RCU protected lists. > > > > Another question occurs to me is that what if the audit daemon is sent a > > > > signal and it cannot or will not collect the sig2 information from the > > > > kernel (SIGKILL?)? Does that audit container identifier remain dead > > > > until reboot, or do we institute some other form of reaping, possibly > > > > time-based? > > > > > > In order to preserve the integrity of the audit log that ACID value > > > would need to remain unavailable until the ACID which contains the > > > associated auditd is "dead" (no one can request the signal sender's > > > info if that container is dead). > > > > I don't understand why it would be associated with the contid of the > > audit daemon process rather than with the audit daemon process itself. > > How does the signal collection somehow get transferred or delegated to > > another member of that audit daemon's container? > > Presumably once we support multiple audit daemons we will need a > struct to contain the associated connection state, with at most one > struct (and one auditd) allowed for a given ACID. I would expect that > the signal sender info would be part of that state included in that > struct. If a task sent a signal to it's associated auditd, and no one > ever queried the signal information stored in the per-ACID state > struct, I would expect that the refcount/flag/whatever would remain > held for the signal sender's ACID until the auditd state's ACID died > (the struct would be reaped as part of the ACID death). In cases > where the container orchestrator blocks sending signals across ACID > boundaries this really isn't an issue as it will all be the same ACID, > but since we don't want to impose any restrictions on what a container > *could* be it is important to make sure we handle the case where the > signal sender's ACID may be different from the associated auditd's > ACID. > > > Thinking aloud here, the audit daemon's exit when it calls audit_free() > > needs to ..._put_sig and cancel that audit_sig_cid (which in the future > > will be allocated per auditd rather than the global it is now since > > there is only one audit daemon). > > > > > paul moore > > > > - RGB > > paul moore - RGB -- Richard Guy Briggs <rgb@redhat.com> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635
On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <rgb@redhat.com> wrote: > On 2020-03-23 20:16, Paul Moore wrote: > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > On 2020-03-18 18:06, Paul Moore wrote: > > > > ... > > > > > > I hope we can do better than string manipulations in the kernel. I'd > > > > much rather defer generating the ACID list (if possible), than > > > > generating a list only to keep copying and editing it as the record is > > > > sent. > > > > > > At the moment we are stuck with a string-only format. > > > > Yes, we are. That is another topic, and another set of changes I've > > been deferring so as to not disrupt the audit container ID work. > > > > I was thinking of what we do inside the kernel between when the record > > triggering event happens and when we actually emit the record to > > userspace. Perhaps we collect the ACID information while the event is > > occurring, but we defer generating the record until later when we have > > a better understanding of what should be included in the ACID list. > > It is somewhat similar (but obviously different) to what we do for > > PATH records (we collect the pathname info when the path is being > > resolved). > > Ok, now I understand your concern. > > In the case of NETFILTER_PKT records, the CONTAINER_ID record is the > only other possible record and they are generated at the same time with > a local context. > > In the case of any event involving a syscall, that CONTAINER_ID record > is generated at the time of the rest of the event record generation at > syscall exit. > > The others are only generated when needed, such as the sig2 reply. > > We generally just store the contobj pointer until we actually generate > the CONTAINER_ID (or CONTAINER_OP) record. Perhaps I'm remembering your latest spin of these patches incorrectly, but there is still a big gap between when the record is generated and when it is sent up to the audit daemon. Most importantly in that gap is the whole big queue/multicast/unicast mess. You don't need to show me code, but I would like to see some sort of plan for dealing with multiple nested audit daemons. Basically I just want to make sure we aren't painting ourselves into a corner with this approach; and if for some horrible reason we are, I at least want us to be aware of what we are getting ourselves into.
On Wed, Mar 25, 2020 at 8:29 AM Richard Guy Briggs <rgb@redhat.com> wrote: > On 2020-03-20 17:56, Paul Moore wrote: > > On Thu, Mar 19, 2020 at 5:48 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > On 2020-03-18 17:47, Paul Moore wrote: > > > > On Wed, Mar 18, 2020 at 5:42 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > > On 2020-03-18 17:01, Paul Moore wrote: > > > > > > On Fri, Mar 13, 2020 at 3:23 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > > > > On 2020-03-13 12:42, Paul Moore wrote: > > > > > > > > > > > > ... > > > > > > > > > > > > > > The thread has had a lot of starts/stops, so I may be repeating a > > > > > > > > previous suggestion, but one idea would be to still emit a "death > > > > > > > > record" when the final task in the audit container ID does die, but > > > > > > > > block the particular audit container ID from reuse until it the > > > > > > > > SIGNAL2 info has been reported. This gives us the timely ACID death > > > > > > > > notification while still preventing confusion and ambiguity caused by > > > > > > > > potentially reusing the ACID before the SIGNAL2 record has been sent; > > > > > > > > there is a small nit about the ACID being present in the SIGNAL2 > > > > > > > > *after* its death, but I think that can be easily explained and > > > > > > > > understood by admins. > > > > > > > > > > > > > > Thinking quickly about possible technical solutions to this, maybe it > > > > > > > makes sense to have two counters on a contobj so that we know when the > > > > > > > last process in that container exits and can issue the death > > > > > > > certificate, but we still block reuse of it until all further references > > > > > > > to it have been resolved. This will likely also make it possible to > > > > > > > report the full contid chain in SIGNAL2 records. This will eliminate > > > > > > > some of the issues we are discussing with regards to passing a contobj > > > > > > > vs a contid to the audit_log_contid function, but won't eliminate them > > > > > > > all because there are still some contids that won't have an object > > > > > > > associated with them to make it impossible to look them up in the > > > > > > > contobj lists. > > > > > > > > > > > > I'm not sure you need a full second counter, I imagine a simple flag > > > > > > would be okay. I think you just something to indicate that this ACID > > > > > > object is marked as "dead" but it still being held for sanity reasons > > > > > > and should not be reused. > > > > > > > > > > Ok, I see your point. This refcount can be changed to a flag easily > > > > > enough without change to the api if we can be sure that more than one > > > > > signal can't be delivered to the audit daemon *and* collected by sig2. > > > > > I'll have a more careful look at the audit daemon code to see if I can > > > > > determine this. > > > > > > > > Maybe I'm not understanding your concern, but this isn't really > > > > different than any of the other things we track for the auditd signal > > > > sender, right? If we are worried about multiple signals being sent > > > > then it applies to everything, not just the audit container ID. > > > > > > Yes, you are right. In all other cases the information is simply > > > overwritten. In the case of the audit container identifier any > > > previous value is put before a new one is referenced, so only the last > > > signal is kept. So, we only need a flag. Does a flag implemented with > > > a rcu-protected refcount sound reasonable to you? > > > > Well, if I recall correctly you still need to fix the locking in this > > patchset so until we see what that looks like it is hard to say for > > certain. Just make sure that the flag is somehow protected from > > races; it is probably a lot like the "valid" flags you sometimes see > > with RCU protected lists. > > This is like looking for a needle in a haystack. Can you point me to > some code that does "valid" flags with RCU protected lists. Sigh. Come on Richard, you've been playing in the kernel for some time now. I can't think of one off the top of my head as I write this, but there are several resources that deal with RCU protected lists in the kernel, Google is your friend and Documentation/RCU is your friend. Spending time to learn how RCU works and how to use it properly is not time wasted. It's a tricky thing to get right (I have to refresh my memory on some of the more subtle details each time I write/review RCU code), but it's very cool when done correctly.
On 2020-03-28 23:11, Paul Moore wrote: > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > On 2020-03-23 20:16, Paul Moore wrote: > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > On 2020-03-18 18:06, Paul Moore wrote: > > > > > > ... > > > > > > > > I hope we can do better than string manipulations in the kernel. I'd > > > > > much rather defer generating the ACID list (if possible), than > > > > > generating a list only to keep copying and editing it as the record is > > > > > sent. > > > > > > > > At the moment we are stuck with a string-only format. > > > > > > Yes, we are. That is another topic, and another set of changes I've > > > been deferring so as to not disrupt the audit container ID work. > > > > > > I was thinking of what we do inside the kernel between when the record > > > triggering event happens and when we actually emit the record to > > > userspace. Perhaps we collect the ACID information while the event is > > > occurring, but we defer generating the record until later when we have > > > a better understanding of what should be included in the ACID list. > > > It is somewhat similar (but obviously different) to what we do for > > > PATH records (we collect the pathname info when the path is being > > > resolved). > > > > Ok, now I understand your concern. > > > > In the case of NETFILTER_PKT records, the CONTAINER_ID record is the > > only other possible record and they are generated at the same time with > > a local context. > > > > In the case of any event involving a syscall, that CONTAINER_ID record > > is generated at the time of the rest of the event record generation at > > syscall exit. > > > > The others are only generated when needed, such as the sig2 reply. > > > > We generally just store the contobj pointer until we actually generate > > the CONTAINER_ID (or CONTAINER_OP) record. > > Perhaps I'm remembering your latest spin of these patches incorrectly, > but there is still a big gap between when the record is generated and > when it is sent up to the audit daemon. Most importantly in that gap > is the whole big queue/multicast/unicast mess. So you suggest generating that record on the fly once it reaches the end of the audit_queue just before being sent? That sounds... disruptive. Each audit daemon is going to have its own queues, so by the time it ends up in a particular queue, we'll already know its scope and would have the right list of contids to print in that record. I don't see the point in deferring the generation of the contid list beyond the point of submitting that record to the relevant audit_queue. > You don't need to show me code, but I would like to see some sort of > plan for dealing with multiple nested audit daemons. Basically I just > want to make sure we aren't painting ourselves into a corner with this > approach; and if for some horrible reason we are, I at least want us > to be aware of what we are getting ourselves into. It wouldn't be significantly different from what we have, but as would have to happen for *all* records generated to a particular auditd/queue it would have to take the scope of that auditd into account, getting references to PIDs right for that PID namespace, along with other similar scope views including contid list range. > paul moore - RGB -- Richard Guy Briggs <rgb@redhat.com> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635
On Mon, Mar 30, 2020 at 9:47 AM Richard Guy Briggs <rgb@redhat.com> wrote: > On 2020-03-28 23:11, Paul Moore wrote: > > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > On 2020-03-23 20:16, Paul Moore wrote: > > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > > On 2020-03-18 18:06, Paul Moore wrote: > > > > > > > > ... > > > > > > > > > > I hope we can do better than string manipulations in the kernel. I'd > > > > > > much rather defer generating the ACID list (if possible), than > > > > > > generating a list only to keep copying and editing it as the record is > > > > > > sent. > > > > > > > > > > At the moment we are stuck with a string-only format. > > > > > > > > Yes, we are. That is another topic, and another set of changes I've > > > > been deferring so as to not disrupt the audit container ID work. > > > > > > > > I was thinking of what we do inside the kernel between when the record > > > > triggering event happens and when we actually emit the record to > > > > userspace. Perhaps we collect the ACID information while the event is > > > > occurring, but we defer generating the record until later when we have > > > > a better understanding of what should be included in the ACID list. > > > > It is somewhat similar (but obviously different) to what we do for > > > > PATH records (we collect the pathname info when the path is being > > > > resolved). > > > > > > Ok, now I understand your concern. > > > > > > In the case of NETFILTER_PKT records, the CONTAINER_ID record is the > > > only other possible record and they are generated at the same time with > > > a local context. > > > > > > In the case of any event involving a syscall, that CONTAINER_ID record > > > is generated at the time of the rest of the event record generation at > > > syscall exit. > > > > > > The others are only generated when needed, such as the sig2 reply. > > > > > > We generally just store the contobj pointer until we actually generate > > > the CONTAINER_ID (or CONTAINER_OP) record. > > > > Perhaps I'm remembering your latest spin of these patches incorrectly, > > but there is still a big gap between when the record is generated and > > when it is sent up to the audit daemon. Most importantly in that gap > > is the whole big queue/multicast/unicast mess. > > So you suggest generating that record on the fly once it reaches the end > of the audit_queue just before being sent? That sounds... disruptive. > Each audit daemon is going to have its own queues, so by the time it > ends up in a particular queue, we'll already know its scope and would > have the right list of contids to print in that record. I'm not suggesting any particular solution, I'm just pointing out a potential problem. It isn't clear to me that you've thought about how we generate a multiple records, each with the correct ACID list intended for a specific audit daemon, based on a single audit event. Explain to me how you intend that to work and we are good. Be specific because I'm not convinced we are talking on the same plane here.
On 2020-03-28 23:17, Paul Moore wrote: > On Wed, Mar 25, 2020 at 8:29 AM Richard Guy Briggs <rgb@redhat.com> wrote: > > On 2020-03-20 17:56, Paul Moore wrote: > > > On Thu, Mar 19, 2020 at 5:48 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > On 2020-03-18 17:47, Paul Moore wrote: > > > > > On Wed, Mar 18, 2020 at 5:42 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > > > On 2020-03-18 17:01, Paul Moore wrote: > > > > > > > On Fri, Mar 13, 2020 at 3:23 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > > > > > On 2020-03-13 12:42, Paul Moore wrote: > > > > > > > > > > > > > > ... > > > > > > > > > > > > > > > > The thread has had a lot of starts/stops, so I may be repeating a > > > > > > > > > previous suggestion, but one idea would be to still emit a "death > > > > > > > > > record" when the final task in the audit container ID does die, but > > > > > > > > > block the particular audit container ID from reuse until it the > > > > > > > > > SIGNAL2 info has been reported. This gives us the timely ACID death > > > > > > > > > notification while still preventing confusion and ambiguity caused by > > > > > > > > > potentially reusing the ACID before the SIGNAL2 record has been sent; > > > > > > > > > there is a small nit about the ACID being present in the SIGNAL2 > > > > > > > > > *after* its death, but I think that can be easily explained and > > > > > > > > > understood by admins. > > > > > > > > > > > > > > > > Thinking quickly about possible technical solutions to this, maybe it > > > > > > > > makes sense to have two counters on a contobj so that we know when the > > > > > > > > last process in that container exits and can issue the death > > > > > > > > certificate, but we still block reuse of it until all further references > > > > > > > > to it have been resolved. This will likely also make it possible to > > > > > > > > report the full contid chain in SIGNAL2 records. This will eliminate > > > > > > > > some of the issues we are discussing with regards to passing a contobj > > > > > > > > vs a contid to the audit_log_contid function, but won't eliminate them > > > > > > > > all because there are still some contids that won't have an object > > > > > > > > associated with them to make it impossible to look them up in the > > > > > > > > contobj lists. > > > > > > > > > > > > > > I'm not sure you need a full second counter, I imagine a simple flag > > > > > > > would be okay. I think you just something to indicate that this ACID > > > > > > > object is marked as "dead" but it still being held for sanity reasons > > > > > > > and should not be reused. > > > > > > > > > > > > Ok, I see your point. This refcount can be changed to a flag easily > > > > > > enough without change to the api if we can be sure that more than one > > > > > > signal can't be delivered to the audit daemon *and* collected by sig2. > > > > > > I'll have a more careful look at the audit daemon code to see if I can > > > > > > determine this. > > > > > > > > > > Maybe I'm not understanding your concern, but this isn't really > > > > > different than any of the other things we track for the auditd signal > > > > > sender, right? If we are worried about multiple signals being sent > > > > > then it applies to everything, not just the audit container ID. > > > > > > > > Yes, you are right. In all other cases the information is simply > > > > overwritten. In the case of the audit container identifier any > > > > previous value is put before a new one is referenced, so only the last > > > > signal is kept. So, we only need a flag. Does a flag implemented with > > > > a rcu-protected refcount sound reasonable to you? > > > > > > Well, if I recall correctly you still need to fix the locking in this > > > patchset so until we see what that looks like it is hard to say for > > > certain. Just make sure that the flag is somehow protected from > > > races; it is probably a lot like the "valid" flags you sometimes see > > > with RCU protected lists. > > > > This is like looking for a needle in a haystack. Can you point me to > > some code that does "valid" flags with RCU protected lists. > > Sigh. Come on Richard, you've been playing in the kernel for some > time now. I can't think of one off the top of my head as I write > this, but there are several resources that deal with RCU protected > lists in the kernel, Google is your friend and Documentation/RCU is > your friend. Ok, I thought you were talking about a specific piece of code... > Spending time to learn how RCU works and how to use it properly is not > time wasted. It's a tricky thing to get right (I have to refresh my > memory on some of the more subtle details each time I write/review RCU > code), but it's very cool when done correctly. I review Documentation/RCU almost every time I work on RCU... > paul moore - RGB -- Richard Guy Briggs <rgb@redhat.com> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635
On 2020-03-30 10:26, Paul Moore wrote: > On Mon, Mar 30, 2020 at 9:47 AM Richard Guy Briggs <rgb@redhat.com> wrote: > > On 2020-03-28 23:11, Paul Moore wrote: > > > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > On 2020-03-23 20:16, Paul Moore wrote: > > > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > > > On 2020-03-18 18:06, Paul Moore wrote: > > > > > > > > > > ... > > > > > > > > > > > > I hope we can do better than string manipulations in the kernel. I'd > > > > > > > much rather defer generating the ACID list (if possible), than > > > > > > > generating a list only to keep copying and editing it as the record is > > > > > > > sent. > > > > > > > > > > > > At the moment we are stuck with a string-only format. > > > > > > > > > > Yes, we are. That is another topic, and another set of changes I've > > > > > been deferring so as to not disrupt the audit container ID work. > > > > > > > > > > I was thinking of what we do inside the kernel between when the record > > > > > triggering event happens and when we actually emit the record to > > > > > userspace. Perhaps we collect the ACID information while the event is > > > > > occurring, but we defer generating the record until later when we have > > > > > a better understanding of what should be included in the ACID list. > > > > > It is somewhat similar (but obviously different) to what we do for > > > > > PATH records (we collect the pathname info when the path is being > > > > > resolved). > > > > > > > > Ok, now I understand your concern. > > > > > > > > In the case of NETFILTER_PKT records, the CONTAINER_ID record is the > > > > only other possible record and they are generated at the same time with > > > > a local context. > > > > > > > > In the case of any event involving a syscall, that CONTAINER_ID record > > > > is generated at the time of the rest of the event record generation at > > > > syscall exit. > > > > > > > > The others are only generated when needed, such as the sig2 reply. > > > > > > > > We generally just store the contobj pointer until we actually generate > > > > the CONTAINER_ID (or CONTAINER_OP) record. > > > > > > Perhaps I'm remembering your latest spin of these patches incorrectly, > > > but there is still a big gap between when the record is generated and > > > when it is sent up to the audit daemon. Most importantly in that gap > > > is the whole big queue/multicast/unicast mess. > > > > So you suggest generating that record on the fly once it reaches the end > > of the audit_queue just before being sent? That sounds... disruptive. > > Each audit daemon is going to have its own queues, so by the time it > > ends up in a particular queue, we'll already know its scope and would > > have the right list of contids to print in that record. > > I'm not suggesting any particular solution, I'm just pointing out a > potential problem. It isn't clear to me that you've thought about how > we generate a multiple records, each with the correct ACID list > intended for a specific audit daemon, based on a single audit event. > Explain to me how you intend that to work and we are good. Be > specific because I'm not convinced we are talking on the same plane > here. Well, every time a record gets generated, *any* record gets generated, we'll need to check for which audit daemons this record is in scope and generate a different one for each depending on the content and whether or not the content is influenced by the scope. Some events will be generated for some of the auditd/queues and not for others. Some fields in some of the records will need to be tailored for that specific auditd/queue for either contid scope or PID namespace base reference or other scope differences. Every auditd/queue will need its own serial number per event and maybe even timestamp depending on whether that auditd is in a different time namespace and beyond that PID and contid fields and maybe others will need to be customized per auditd/queue. So, it may make sense to generate the contents of each field for a generic record and then either reuse content that is unchanged or generate new content for a field that will be different in a different auditd/queue scope, then render the final record per auditd/queue and enqueue it. I see this as the primary work of ghak93 ("RFE: run multiple audit daemons on one machine"). I don't see how our proposed contid field value format changes with this path above. This is getting closer and closer to a netlink binary format too... This is also an argument for spreading fields out over more record types rather than cramming as much information as we can into one record type (subject attributes in particular). > paul moore - RGB -- Richard Guy Briggs <rgb@redhat.com> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635
On Mon, Mar 30, 2020 at 12:22 PM Richard Guy Briggs <rgb@redhat.com> wrote: > On 2020-03-30 10:26, Paul Moore wrote: > > On Mon, Mar 30, 2020 at 9:47 AM Richard Guy Briggs <rgb@redhat.com> wrote: > > > On 2020-03-28 23:11, Paul Moore wrote: > > > > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > > On 2020-03-23 20:16, Paul Moore wrote: > > > > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > > > > On 2020-03-18 18:06, Paul Moore wrote: > > > > > > > > > > > > ... > > > > > > > > > > > > > > I hope we can do better than string manipulations in the kernel. I'd > > > > > > > > much rather defer generating the ACID list (if possible), than > > > > > > > > generating a list only to keep copying and editing it as the record is > > > > > > > > sent. > > > > > > > > > > > > > > At the moment we are stuck with a string-only format. > > > > > > > > > > > > Yes, we are. That is another topic, and another set of changes I've > > > > > > been deferring so as to not disrupt the audit container ID work. > > > > > > > > > > > > I was thinking of what we do inside the kernel between when the record > > > > > > triggering event happens and when we actually emit the record to > > > > > > userspace. Perhaps we collect the ACID information while the event is > > > > > > occurring, but we defer generating the record until later when we have > > > > > > a better understanding of what should be included in the ACID list. > > > > > > It is somewhat similar (but obviously different) to what we do for > > > > > > PATH records (we collect the pathname info when the path is being > > > > > > resolved). > > > > > > > > > > Ok, now I understand your concern. > > > > > > > > > > In the case of NETFILTER_PKT records, the CONTAINER_ID record is the > > > > > only other possible record and they are generated at the same time with > > > > > a local context. > > > > > > > > > > In the case of any event involving a syscall, that CONTAINER_ID record > > > > > is generated at the time of the rest of the event record generation at > > > > > syscall exit. > > > > > > > > > > The others are only generated when needed, such as the sig2 reply. > > > > > > > > > > We generally just store the contobj pointer until we actually generate > > > > > the CONTAINER_ID (or CONTAINER_OP) record. > > > > > > > > Perhaps I'm remembering your latest spin of these patches incorrectly, > > > > but there is still a big gap between when the record is generated and > > > > when it is sent up to the audit daemon. Most importantly in that gap > > > > is the whole big queue/multicast/unicast mess. > > > > > > So you suggest generating that record on the fly once it reaches the end > > > of the audit_queue just before being sent? That sounds... disruptive. > > > Each audit daemon is going to have its own queues, so by the time it > > > ends up in a particular queue, we'll already know its scope and would > > > have the right list of contids to print in that record. > > > > I'm not suggesting any particular solution, I'm just pointing out a > > potential problem. It isn't clear to me that you've thought about how > > we generate a multiple records, each with the correct ACID list > > intended for a specific audit daemon, based on a single audit event. > > Explain to me how you intend that to work and we are good. Be > > specific because I'm not convinced we are talking on the same plane > > here. > > Well, every time a record gets generated, *any* record gets generated, > we'll need to check for which audit daemons this record is in scope and > generate a different one for each depending on the content and whether > or not the content is influenced by the scope. That's the problem right there - we don't want to have to generate a unique record for *each* auditd on *every* record. That is a recipe for disaster. Solving this for all of the known audit records is not something we need to worry about in depth at the moment (although giving it some casual thought is not a bad thing), but solving this for the audit container ID information *is* something we need to worry about right now.
On 2020-03-30 13:34, Paul Moore wrote: > On Mon, Mar 30, 2020 at 12:22 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > On 2020-03-30 10:26, Paul Moore wrote: > > > On Mon, Mar 30, 2020 at 9:47 AM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > On 2020-03-28 23:11, Paul Moore wrote: > > > > > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > > > On 2020-03-23 20:16, Paul Moore wrote: > > > > > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > > > > > On 2020-03-18 18:06, Paul Moore wrote: > > > > > > > > > > > > > > ... > > > > > > > > > > > > > > > > I hope we can do better than string manipulations in the kernel. I'd > > > > > > > > > much rather defer generating the ACID list (if possible), than > > > > > > > > > generating a list only to keep copying and editing it as the record is > > > > > > > > > sent. > > > > > > > > > > > > > > > > At the moment we are stuck with a string-only format. > > > > > > > > > > > > > > Yes, we are. That is another topic, and another set of changes I've > > > > > > > been deferring so as to not disrupt the audit container ID work. > > > > > > > > > > > > > > I was thinking of what we do inside the kernel between when the record > > > > > > > triggering event happens and when we actually emit the record to > > > > > > > userspace. Perhaps we collect the ACID information while the event is > > > > > > > occurring, but we defer generating the record until later when we have > > > > > > > a better understanding of what should be included in the ACID list. > > > > > > > It is somewhat similar (but obviously different) to what we do for > > > > > > > PATH records (we collect the pathname info when the path is being > > > > > > > resolved). > > > > > > > > > > > > Ok, now I understand your concern. > > > > > > > > > > > > In the case of NETFILTER_PKT records, the CONTAINER_ID record is the > > > > > > only other possible record and they are generated at the same time with > > > > > > a local context. > > > > > > > > > > > > In the case of any event involving a syscall, that CONTAINER_ID record > > > > > > is generated at the time of the rest of the event record generation at > > > > > > syscall exit. > > > > > > > > > > > > The others are only generated when needed, such as the sig2 reply. > > > > > > > > > > > > We generally just store the contobj pointer until we actually generate > > > > > > the CONTAINER_ID (or CONTAINER_OP) record. > > > > > > > > > > Perhaps I'm remembering your latest spin of these patches incorrectly, > > > > > but there is still a big gap between when the record is generated and > > > > > when it is sent up to the audit daemon. Most importantly in that gap > > > > > is the whole big queue/multicast/unicast mess. > > > > > > > > So you suggest generating that record on the fly once it reaches the end > > > > of the audit_queue just before being sent? That sounds... disruptive. > > > > Each audit daemon is going to have its own queues, so by the time it > > > > ends up in a particular queue, we'll already know its scope and would > > > > have the right list of contids to print in that record. > > > > > > I'm not suggesting any particular solution, I'm just pointing out a > > > potential problem. It isn't clear to me that you've thought about how > > > we generate a multiple records, each with the correct ACID list > > > intended for a specific audit daemon, based on a single audit event. > > > Explain to me how you intend that to work and we are good. Be > > > specific because I'm not convinced we are talking on the same plane > > > here. > > > > Well, every time a record gets generated, *any* record gets generated, > > we'll need to check for which audit daemons this record is in scope and > > generate a different one for each depending on the content and whether > > or not the content is influenced by the scope. > > That's the problem right there - we don't want to have to generate a > unique record for *each* auditd on *every* record. That is a recipe > for disaster. I don't see how we can get around this. We will already have that problem for PIDs in different PID namespaces. We already need to use a different serial number in each auditd/queue, or else we serialize *all* audit events on the machine and either leak information to the nested daemons that there are other events happenning on the machine, or confuse the host daemon because it now thinks that we are losing events due to serial numbers missing because some nested daemon issued an event that was not relevant to the host daemon, consuming a globally serial audit message sequence number. > Solving this for all of the known audit records is not something we > need to worry about in depth at the moment (although giving it some > casual thought is not a bad thing), but solving this for the audit > container ID information *is* something we need to worry about right > now. If you think that a different nested contid value string per daemon is not acceptable, then we are back to issuing a record that has only *one* contid listed without any nesting information. This brings us back to the original problem of keeping *all* audit log history since the boot of the machine to be able to track the nesting of any particular contid. What am I missing? What do you suggest? > paul moore - RGB -- Richard Guy Briggs <rgb@redhat.com> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635
On Mon, Mar 30, 2020 at 1:49 PM Richard Guy Briggs <rgb@redhat.com> wrote: > On 2020-03-30 13:34, Paul Moore wrote: > > On Mon, Mar 30, 2020 at 12:22 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > On 2020-03-30 10:26, Paul Moore wrote: > > > > On Mon, Mar 30, 2020 at 9:47 AM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > > On 2020-03-28 23:11, Paul Moore wrote: > > > > > > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > > > > On 2020-03-23 20:16, Paul Moore wrote: > > > > > > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > > > > > > On 2020-03-18 18:06, Paul Moore wrote: ... > > > Well, every time a record gets generated, *any* record gets generated, > > > we'll need to check for which audit daemons this record is in scope and > > > generate a different one for each depending on the content and whether > > > or not the content is influenced by the scope. > > > > That's the problem right there - we don't want to have to generate a > > unique record for *each* auditd on *every* record. That is a recipe > > for disaster. > > I don't see how we can get around this. > > We will already have that problem for PIDs in different PID namespaces. As I said below, let's not worry about this for all of the known/current audit records, lets just think about how we solve this for the ACID related information. One of the bigger problems with translating namespace info (e.g. PIDs) across ACIDs is that an ACID - by definition - has no understanding of namespaces (both the concept as well as any given instance). > We already need to use a different serial number in each auditd/queue, > or else we serialize *all* audit events on the machine and either leak > information to the nested daemons that there are other events happenning > on the machine, or confuse the host daemon because it now thinks that we > are losing events due to serial numbers missing because some nested > daemon issued an event that was not relevant to the host daemon, > consuming a globally serial audit message sequence number. This isn't really relevant to the ACID lists, but sure. > > Solving this for all of the known audit records is not something we > > need to worry about in depth at the moment (although giving it some > > casual thought is not a bad thing), but solving this for the audit > > container ID information *is* something we need to worry about right > > now. > > If you think that a different nested contid value string per daemon is > not acceptable, then we are back to issuing a record that has only *one* > contid listed without any nesting information. This brings us back to > the original problem of keeping *all* audit log history since the boot > of the machine to be able to track the nesting of any particular contid. I'm not ruling anything out, except for the "let's just completely regenerate every record for each auditd instance". > What am I missing? What do you suggest? I'm missing a solution in this thread, since you are the person driving this effort I'm asking you to get creative and present us with some solutions. :)
Paul Moore <paul@paul-moore.com> writes: > On Mon, Mar 30, 2020 at 1:49 PM Richard Guy Briggs <rgb@redhat.com> wrote: >> On 2020-03-30 13:34, Paul Moore wrote: >> > On Mon, Mar 30, 2020 at 12:22 PM Richard Guy Briggs <rgb@redhat.com> wrote: >> > > On 2020-03-30 10:26, Paul Moore wrote: >> > > > On Mon, Mar 30, 2020 at 9:47 AM Richard Guy Briggs <rgb@redhat.com> wrote: >> > > > > On 2020-03-28 23:11, Paul Moore wrote: >> > > > > > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <rgb@redhat.com> wrote: >> > > > > > > On 2020-03-23 20:16, Paul Moore wrote: >> > > > > > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <rgb@redhat.com> wrote: >> > > > > > > > > On 2020-03-18 18:06, Paul Moore wrote: > > ... > >> > > Well, every time a record gets generated, *any* record gets generated, >> > > we'll need to check for which audit daemons this record is in scope and >> > > generate a different one for each depending on the content and whether >> > > or not the content is influenced by the scope. >> > >> > That's the problem right there - we don't want to have to generate a >> > unique record for *each* auditd on *every* record. That is a recipe >> > for disaster. >> > >> > Solving this for all of the known audit records is not something we >> > need to worry about in depth at the moment (although giving it some >> > casual thought is not a bad thing), but solving this for the audit >> > container ID information *is* something we need to worry about right >> > now. >> >> If you think that a different nested contid value string per daemon is >> not acceptable, then we are back to issuing a record that has only *one* >> contid listed without any nesting information. This brings us back to >> the original problem of keeping *all* audit log history since the boot >> of the machine to be able to track the nesting of any particular contid. > > I'm not ruling anything out, except for the "let's just completely > regenerate every record for each auditd instance". Paul I am a bit confused about what you are referring to when you say regenerate every record. Are you saying that you don't want to repeat the sequence: audit_log_start(...); audit_log_format(...); audit_log_end(...); for every nested audit daemon? Or are you saying that you would like to literraly want to send the same skb to each of the nested audit daemons? Or are you thinking of something else? Eric
On Thu, Apr 16, 2020 at 4:36 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > Paul Moore <paul@paul-moore.com> writes: > > On Mon, Mar 30, 2020 at 1:49 PM Richard Guy Briggs <rgb@redhat.com> wrote: > >> On 2020-03-30 13:34, Paul Moore wrote: > >> > On Mon, Mar 30, 2020 at 12:22 PM Richard Guy Briggs <rgb@redhat.com> wrote: > >> > > On 2020-03-30 10:26, Paul Moore wrote: > >> > > > On Mon, Mar 30, 2020 at 9:47 AM Richard Guy Briggs <rgb@redhat.com> wrote: > >> > > > > On 2020-03-28 23:11, Paul Moore wrote: > >> > > > > > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <rgb@redhat.com> wrote: > >> > > > > > > On 2020-03-23 20:16, Paul Moore wrote: > >> > > > > > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <rgb@redhat.com> wrote: > >> > > > > > > > > On 2020-03-18 18:06, Paul Moore wrote: > > > > ... > > > >> > > Well, every time a record gets generated, *any* record gets generated, > >> > > we'll need to check for which audit daemons this record is in scope and > >> > > generate a different one for each depending on the content and whether > >> > > or not the content is influenced by the scope. > >> > > >> > That's the problem right there - we don't want to have to generate a > >> > unique record for *each* auditd on *every* record. That is a recipe > >> > for disaster. > >> > > >> > Solving this for all of the known audit records is not something we > >> > need to worry about in depth at the moment (although giving it some > >> > casual thought is not a bad thing), but solving this for the audit > >> > container ID information *is* something we need to worry about right > >> > now. > >> > >> If you think that a different nested contid value string per daemon is > >> not acceptable, then we are back to issuing a record that has only *one* > >> contid listed without any nesting information. This brings us back to > >> the original problem of keeping *all* audit log history since the boot > >> of the machine to be able to track the nesting of any particular contid. > > > > I'm not ruling anything out, except for the "let's just completely > > regenerate every record for each auditd instance". > > Paul I am a bit confused about what you are referring to when you say > regenerate every record. > > Are you saying that you don't want to repeat the sequence: > audit_log_start(...); > audit_log_format(...); > audit_log_end(...); > for every nested audit daemon? If it can be avoided yes. Audit performance is already not-awesome, this would make it even worse. > Or are you saying that you would like to literraly want to send the same > skb to each of the nested audit daemons? Ideally we would reuse the generated audit messages as much as possible. Less work is better. That's really my main concern here, let's make sure we aren't going to totally tank performance when we have a bunch of nested audit daemons. > Or are you thinking of something else? As mentioned above, I'm not thinking of anything specific, other than let's please not have to regenerate *all* of the audit record strings for each instance of an audit daemon, that's going to be a killer. Maybe we have to regenerate some, if we do, what would that look like in code? How do we handle the regeneration aspect? I worry that is going to be really ugly. Maybe we finally burn down the audit_log_format(...) function and pass structs/TLVs to the audit subsystem and the audit subsystem generates the strings in the auditd connection thread. Some of the record strings could likely be shared, others would need to be ACID/auditd dependent. I'm open to any ideas people may have. We have a problem, let's solve it.
Paul Moore <paul@paul-moore.com> writes: > On Thu, Apr 16, 2020 at 4:36 PM Eric W. Biederman <ebiederm@xmission.com> wrote: >> Paul Moore <paul@paul-moore.com> writes: >> > On Mon, Mar 30, 2020 at 1:49 PM Richard Guy Briggs <rgb@redhat.com> wrote: >> >> On 2020-03-30 13:34, Paul Moore wrote: >> >> > On Mon, Mar 30, 2020 at 12:22 PM Richard Guy Briggs <rgb@redhat.com> wrote: >> >> > > On 2020-03-30 10:26, Paul Moore wrote: >> >> > > > On Mon, Mar 30, 2020 at 9:47 AM Richard Guy Briggs <rgb@redhat.com> wrote: >> >> > > > > On 2020-03-28 23:11, Paul Moore wrote: >> >> > > > > > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <rgb@redhat.com> wrote: >> >> > > > > > > On 2020-03-23 20:16, Paul Moore wrote: >> >> > > > > > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <rgb@redhat.com> wrote: >> >> > > > > > > > > On 2020-03-18 18:06, Paul Moore wrote: >> > >> > ... >> > >> >> > > Well, every time a record gets generated, *any* record gets generated, >> >> > > we'll need to check for which audit daemons this record is in scope and >> >> > > generate a different one for each depending on the content and whether >> >> > > or not the content is influenced by the scope. >> >> > >> >> > That's the problem right there - we don't want to have to generate a >> >> > unique record for *each* auditd on *every* record. That is a recipe >> >> > for disaster. >> >> > >> >> > Solving this for all of the known audit records is not something we >> >> > need to worry about in depth at the moment (although giving it some >> >> > casual thought is not a bad thing), but solving this for the audit >> >> > container ID information *is* something we need to worry about right >> >> > now. >> >> >> >> If you think that a different nested contid value string per daemon is >> >> not acceptable, then we are back to issuing a record that has only *one* >> >> contid listed without any nesting information. This brings us back to >> >> the original problem of keeping *all* audit log history since the boot >> >> of the machine to be able to track the nesting of any particular contid. >> > >> > I'm not ruling anything out, except for the "let's just completely >> > regenerate every record for each auditd instance". >> >> Paul I am a bit confused about what you are referring to when you say >> regenerate every record. >> >> Are you saying that you don't want to repeat the sequence: >> audit_log_start(...); >> audit_log_format(...); >> audit_log_end(...); >> for every nested audit daemon? > > If it can be avoided yes. Audit performance is already not-awesome, > this would make it even worse. As far as I can see not repeating sequences like that is fundamental for making this work at all. Just because only the audit subsystem should know about one or multiple audit daemons. Nothing else should care. >> Or are you saying that you would like to literraly want to send the same >> skb to each of the nested audit daemons? > > Ideally we would reuse the generated audit messages as much as > possible. Less work is better. That's really my main concern here, > let's make sure we aren't going to totally tank performance when we > have a bunch of nested audit daemons. So I think there are two parts of this answer. Assuming we are talking about nesting audit daemons in containers we will have different rulesets and I expect most of the events for a nested audit daemon won't be of interest to the outer audit daemon. Beyond that it should be very straight forward to keep a pointer and leave the buffer as a scatter gather list until audit_log_end and translate pids, and rewrite ACIDs attributes in audit_log_end when we build the final packet. Either through collaboration with audit_log_format or a special audit_log command that carefully sets up the handful of things that need that information. Hmm. I am seeing that we send skbs to kauditd and then kauditd sends those skbs to userspace. I presume that is primary so that sending messages to userspace does not block the process being audited. Plus a little bit so that the retry logic will work. I think the naive implementation would be to simply have 1 kauditd per auditd (strictly and audit context/namespace). Although that can be optimized if that is a problem. Beyond that I think we would need to look at profiles to really understand where the bottlenecks are. >> Or are you thinking of something else? > > As mentioned above, I'm not thinking of anything specific, other than > let's please not have to regenerate *all* of the audit record strings > for each instance of an audit daemon, that's going to be a killer. > > Maybe we have to regenerate some, if we do, what would that look like > in code? How do we handle the regeneration aspect? I worry that is > going to be really ugly. > > Maybe we finally burn down the audit_log_format(...) function and pass > structs/TLVs to the audit subsystem and the audit subsystem generates > the strings in the auditd connection thread. Some of the record > strings could likely be shared, others would need to be ACID/auditd > dependent. I think we just a very limited amount of structs/TLVs for the cases that matter and one-one auditd and kauditd implementations we should still be able to do everything in audit_log_end. Plus doing as much work as possible in audit_log_end where things are still cache hot is desirable. > I'm open to any ideas people may have. We have a problem, let's solve > it. It definitely makes sense to look ahead to having audit daemons running in containers, but in the grand scheme of things that is a nice to have. Probably something we will and should get to, but we have lived a long time without auditd running in containers so I expect we can live a while longer. As I understand Richard patchset for the specific case of the ACID we are only talking about taking a subset of an existing string, and one string at that. Not hard at all. Especially when looking at the fundamental fact that we will need to send a different skb to userspace, for each audit daemon. Eric
On Fri, Apr 17, 2020 at 6:26 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > Paul Moore <paul@paul-moore.com> writes: > > On Thu, Apr 16, 2020 at 4:36 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > >> Paul Moore <paul@paul-moore.com> writes: > >> > On Mon, Mar 30, 2020 at 1:49 PM Richard Guy Briggs <rgb@redhat.com> wrote: > >> >> On 2020-03-30 13:34, Paul Moore wrote: > >> >> > On Mon, Mar 30, 2020 at 12:22 PM Richard Guy Briggs <rgb@redhat.com> wrote: > >> >> > > On 2020-03-30 10:26, Paul Moore wrote: > >> >> > > > On Mon, Mar 30, 2020 at 9:47 AM Richard Guy Briggs <rgb@redhat.com> wrote: > >> >> > > > > On 2020-03-28 23:11, Paul Moore wrote: > >> >> > > > > > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <rgb@redhat.com> wrote: > >> >> > > > > > > On 2020-03-23 20:16, Paul Moore wrote: > >> >> > > > > > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <rgb@redhat.com> wrote: > >> >> > > > > > > > > On 2020-03-18 18:06, Paul Moore wrote: > >> > > >> > ... > >> > > >> >> > > Well, every time a record gets generated, *any* record gets generated, > >> >> > > we'll need to check for which audit daemons this record is in scope and > >> >> > > generate a different one for each depending on the content and whether > >> >> > > or not the content is influenced by the scope. > >> >> > > >> >> > That's the problem right there - we don't want to have to generate a > >> >> > unique record for *each* auditd on *every* record. That is a recipe > >> >> > for disaster. > >> >> > > >> >> > Solving this for all of the known audit records is not something we > >> >> > need to worry about in depth at the moment (although giving it some > >> >> > casual thought is not a bad thing), but solving this for the audit > >> >> > container ID information *is* something we need to worry about right > >> >> > now. > >> >> > >> >> If you think that a different nested contid value string per daemon is > >> >> not acceptable, then we are back to issuing a record that has only *one* > >> >> contid listed without any nesting information. This brings us back to > >> >> the original problem of keeping *all* audit log history since the boot > >> >> of the machine to be able to track the nesting of any particular contid. > >> > > >> > I'm not ruling anything out, except for the "let's just completely > >> > regenerate every record for each auditd instance". > >> > >> Paul I am a bit confused about what you are referring to when you say > >> regenerate every record. > >> > >> Are you saying that you don't want to repeat the sequence: > >> audit_log_start(...); > >> audit_log_format(...); > >> audit_log_end(...); > >> for every nested audit daemon? > > > > If it can be avoided yes. Audit performance is already not-awesome, > > this would make it even worse. > > As far as I can see not repeating sequences like that is fundamental > for making this work at all. Just because only the audit subsystem > should know about one or multiple audit daemons. Nothing else should > care. Yes, exactly, this has been mentioned in the past. Both the performance hit and the code complication in the caller are things we must avoid. > >> Or are you saying that you would like to literraly want to send the same > >> skb to each of the nested audit daemons? > > > > Ideally we would reuse the generated audit messages as much as > > possible. Less work is better. That's really my main concern here, > > let's make sure we aren't going to totally tank performance when we > > have a bunch of nested audit daemons. > > So I think there are two parts of this answer. Assuming we are talking > about nesting audit daemons in containers we will have different > rulesets and I expect most of the events for a nested audit daemon won't > be of interest to the outer audit daemon. Yes, this is another thing that Richard and I have discussed in the past. We will basically need to create per-daemon queues, rules, tracking state, etc.; that is easy enough. What will be slightly more tricky is the part where we apply the filters to the individual records and decide if that record is valid/desired for a given daemon. I think it can be done without too much pain, and any changes to the callers, but it will require a bit of work to make sure it is done well and that records are needlessly duplicated in the kernel. > Beyond that it should be very straight forward to keep a pointer and > leave the buffer as a scatter gather list until audit_log_end > and translate pids, and rewrite ACIDs attributes in audit_log_end > when we build the final packet. Either through collaboration with > audit_log_format or a special audit_log command that carefully sets > up the handful of things that need that information. In order to maximize record re-use I think we will want to hold off on assembling the final packet until it is sent to the daemons in the kauditd thread. We'll also likely need to create special audit_log_XXX functions to capture fields which we know will need translation, e.g. ACID information. (the reason for the new audit_log_XXX functions would be to mark the new sg element and ensure the buffer is handled correctly) Regardless of the details, I think the scatter gather approach is the key here - that seems like the best design idea I've seen thus far. It enables us to replace portions of the record as needed ... and possibly use the existing skb cow stuff ... it has been a while, but does the skb cow functions handle scatter gather skbs or do they need to be linear? > Hmm. I am seeing that we send skbs to kauditd and then kauditd > sends those skbs to userspace. I presume that is primary so that > sending messages to userspace does not block the process being audited. > Plus a little bit so that the retry logic will work. Long story short, it's a poor design. I'm not sure who came up with it, but I have about a 1000 questions that are variations on "why did this seem like a good idea?". I expect the audit_buffer definition to change significantly during the nested auditd work. > I think the naive implementation would be to simply have 1 kauditd > per auditd (strictly and audit context/namespace). Although that can be > optimized if that is a problem. > > Beyond that I think we would need to look at profiles to really > understand where the bottlenecks are. Agreed. This is a hidden implementation detail that doesn't affect the userspace API or the in-kernel callers. The first approach can be simple and we can complicate it as needed in future versions. > > I'm open to any ideas people may have. We have a problem, let's solve > > it. > > It definitely makes sense to look ahead to having audit daemons running > in containers, but in the grand scheme of things that is a nice to have. > Probably something we will and should get to, but we have lived a long > time without auditd running in containers so I expect we can live a > while longer. It looks like you are confusing my concern. I'm not pushing Richard to implement support for this in the current patchset, I'm pushing Richard to consider the design aspect of having multiple audit daemons so that we don't code ourselves into a corner with the audit record changes he is proposing. The audit record format is part of the kernel/userspace API and as a result requires great care when modifying/extending/etc.
On 2020-04-22 13:24, Paul Moore wrote: > On Fri, Apr 17, 2020 at 6:26 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > Paul Moore <paul@paul-moore.com> writes: > > > On Thu, Apr 16, 2020 at 4:36 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > >> Paul Moore <paul@paul-moore.com> writes: > > >> > On Mon, Mar 30, 2020 at 1:49 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > >> >> On 2020-03-30 13:34, Paul Moore wrote: > > >> >> > On Mon, Mar 30, 2020 at 12:22 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > >> >> > > On 2020-03-30 10:26, Paul Moore wrote: > > >> >> > > > On Mon, Mar 30, 2020 at 9:47 AM Richard Guy Briggs <rgb@redhat.com> wrote: > > >> >> > > > > On 2020-03-28 23:11, Paul Moore wrote: > > >> >> > > > > > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > >> >> > > > > > > On 2020-03-23 20:16, Paul Moore wrote: > > >> >> > > > > > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > >> >> > > > > > > > > On 2020-03-18 18:06, Paul Moore wrote: > > >> > > > >> > ... > > >> > > > >> >> > > Well, every time a record gets generated, *any* record gets generated, > > >> >> > > we'll need to check for which audit daemons this record is in scope and > > >> >> > > generate a different one for each depending on the content and whether > > >> >> > > or not the content is influenced by the scope. > > >> >> > > > >> >> > That's the problem right there - we don't want to have to generate a > > >> >> > unique record for *each* auditd on *every* record. That is a recipe > > >> >> > for disaster. > > >> >> > > > >> >> > Solving this for all of the known audit records is not something we > > >> >> > need to worry about in depth at the moment (although giving it some > > >> >> > casual thought is not a bad thing), but solving this for the audit > > >> >> > container ID information *is* something we need to worry about right > > >> >> > now. > > >> >> > > >> >> If you think that a different nested contid value string per daemon is > > >> >> not acceptable, then we are back to issuing a record that has only *one* > > >> >> contid listed without any nesting information. This brings us back to > > >> >> the original problem of keeping *all* audit log history since the boot > > >> >> of the machine to be able to track the nesting of any particular contid. > > >> > > > >> > I'm not ruling anything out, except for the "let's just completely > > >> > regenerate every record for each auditd instance". > > >> > > >> Paul I am a bit confused about what you are referring to when you say > > >> regenerate every record. > > >> > > >> Are you saying that you don't want to repeat the sequence: > > >> audit_log_start(...); > > >> audit_log_format(...); > > >> audit_log_end(...); > > >> for every nested audit daemon? > > > > > > If it can be avoided yes. Audit performance is already not-awesome, > > > this would make it even worse. > > > > As far as I can see not repeating sequences like that is fundamental > > for making this work at all. Just because only the audit subsystem > > should know about one or multiple audit daemons. Nothing else should > > care. > > Yes, exactly, this has been mentioned in the past. Both the > performance hit and the code complication in the caller are things we > must avoid. > > > >> Or are you saying that you would like to literraly want to send the same > > >> skb to each of the nested audit daemons? > > > > > > Ideally we would reuse the generated audit messages as much as > > > possible. Less work is better. That's really my main concern here, > > > let's make sure we aren't going to totally tank performance when we > > > have a bunch of nested audit daemons. > > > > So I think there are two parts of this answer. Assuming we are talking > > about nesting audit daemons in containers we will have different > > rulesets and I expect most of the events for a nested audit daemon won't > > be of interest to the outer audit daemon. > > Yes, this is another thing that Richard and I have discussed in the > past. We will basically need to create per-daemon queues, rules, > tracking state, etc.; that is easy enough. What will be slightly more > tricky is the part where we apply the filters to the individual > records and decide if that record is valid/desired for a given daemon. > I think it can be done without too much pain, and any changes to the > callers, but it will require a bit of work to make sure it is done > well and that records are needlessly duplicated in the kernel. > > > Beyond that it should be very straight forward to keep a pointer and > > leave the buffer as a scatter gather list until audit_log_end > > and translate pids, and rewrite ACIDs attributes in audit_log_end > > when we build the final packet. Either through collaboration with > > audit_log_format or a special audit_log command that carefully sets > > up the handful of things that need that information. > > In order to maximize record re-use I think we will want to hold off on > assembling the final packet until it is sent to the daemons in the > kauditd thread. We'll also likely need to create special > audit_log_XXX functions to capture fields which we know will need > translation, e.g. ACID information. (the reason for the new > audit_log_XXX functions would be to mark the new sg element and ensure > the buffer is handled correctly) > > Regardless of the details, I think the scatter gather approach is the > key here - that seems like the best design idea I've seen thus far. > It enables us to replace portions of the record as needed ... and > possibly use the existing skb cow stuff ... it has been a while, but > does the skb cow functions handle scatter gather skbs or do they need > to be linear? How does the selection of this data management technique affect our choice of field format? Does this lock the field value to a fixed length? Does the use of scatter/gather techniques or structures allow the use of different lengths of data for each destination (auditd)? I could see different target audit daemons triggering or switching to a different chunk of data and length. This does raise a concern related to the previous sig_info2 discussion that the struct contobj that exists at the time of audit_log_exit called could have been reaped by the time the buffer is pulled from the queue for transmission to auditd, but we could hold a reference to it as is done for sig_info2. Looking through the kernel scatter/gather possibilities, I see struct iovec which is used by the readv/writev/preadv/pwritev syscalls, but I'm understanding that this is a kernel implementation that will be not visible to user space. So would the struct scatterlist be the right choice? > paul moore - RGB -- Richard Guy Briggs <rgb@redhat.com> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635
On Mon, Jun 8, 2020 at 2:04 PM Richard Guy Briggs <rgb@redhat.com> wrote: > On 2020-04-22 13:24, Paul Moore wrote: > > On Fri, Apr 17, 2020 at 6:26 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > > Paul Moore <paul@paul-moore.com> writes: > > > > On Thu, Apr 16, 2020 at 4:36 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > > >> Paul Moore <paul@paul-moore.com> writes: > > > >> > On Mon, Mar 30, 2020 at 1:49 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > >> >> On 2020-03-30 13:34, Paul Moore wrote: > > > >> >> > On Mon, Mar 30, 2020 at 12:22 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > >> >> > > On 2020-03-30 10:26, Paul Moore wrote: > > > >> >> > > > On Mon, Mar 30, 2020 at 9:47 AM Richard Guy Briggs <rgb@redhat.com> wrote: > > > >> >> > > > > On 2020-03-28 23:11, Paul Moore wrote: > > > >> >> > > > > > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > >> >> > > > > > > On 2020-03-23 20:16, Paul Moore wrote: > > > >> >> > > > > > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > >> >> > > > > > > > > On 2020-03-18 18:06, Paul Moore wrote: > > > >> > > > > >> > ... > > > >> > > > > >> >> > > Well, every time a record gets generated, *any* record gets generated, > > > >> >> > > we'll need to check for which audit daemons this record is in scope and > > > >> >> > > generate a different one for each depending on the content and whether > > > >> >> > > or not the content is influenced by the scope. > > > >> >> > > > > >> >> > That's the problem right there - we don't want to have to generate a > > > >> >> > unique record for *each* auditd on *every* record. That is a recipe > > > >> >> > for disaster. > > > >> >> > > > > >> >> > Solving this for all of the known audit records is not something we > > > >> >> > need to worry about in depth at the moment (although giving it some > > > >> >> > casual thought is not a bad thing), but solving this for the audit > > > >> >> > container ID information *is* something we need to worry about right > > > >> >> > now. > > > >> >> > > > >> >> If you think that a different nested contid value string per daemon is > > > >> >> not acceptable, then we are back to issuing a record that has only *one* > > > >> >> contid listed without any nesting information. This brings us back to > > > >> >> the original problem of keeping *all* audit log history since the boot > > > >> >> of the machine to be able to track the nesting of any particular contid. > > > >> > > > > >> > I'm not ruling anything out, except for the "let's just completely > > > >> > regenerate every record for each auditd instance". > > > >> > > > >> Paul I am a bit confused about what you are referring to when you say > > > >> regenerate every record. > > > >> > > > >> Are you saying that you don't want to repeat the sequence: > > > >> audit_log_start(...); > > > >> audit_log_format(...); > > > >> audit_log_end(...); > > > >> for every nested audit daemon? > > > > > > > > If it can be avoided yes. Audit performance is already not-awesome, > > > > this would make it even worse. > > > > > > As far as I can see not repeating sequences like that is fundamental > > > for making this work at all. Just because only the audit subsystem > > > should know about one or multiple audit daemons. Nothing else should > > > care. > > > > Yes, exactly, this has been mentioned in the past. Both the > > performance hit and the code complication in the caller are things we > > must avoid. > > > > > >> Or are you saying that you would like to literraly want to send the same > > > >> skb to each of the nested audit daemons? > > > > > > > > Ideally we would reuse the generated audit messages as much as > > > > possible. Less work is better. That's really my main concern here, > > > > let's make sure we aren't going to totally tank performance when we > > > > have a bunch of nested audit daemons. > > > > > > So I think there are two parts of this answer. Assuming we are talking > > > about nesting audit daemons in containers we will have different > > > rulesets and I expect most of the events for a nested audit daemon won't > > > be of interest to the outer audit daemon. > > > > Yes, this is another thing that Richard and I have discussed in the > > past. We will basically need to create per-daemon queues, rules, > > tracking state, etc.; that is easy enough. What will be slightly more > > tricky is the part where we apply the filters to the individual > > records and decide if that record is valid/desired for a given daemon. > > I think it can be done without too much pain, and any changes to the > > callers, but it will require a bit of work to make sure it is done > > well and that records are needlessly duplicated in the kernel. > > > > > Beyond that it should be very straight forward to keep a pointer and > > > leave the buffer as a scatter gather list until audit_log_end > > > and translate pids, and rewrite ACIDs attributes in audit_log_end > > > when we build the final packet. Either through collaboration with > > > audit_log_format or a special audit_log command that carefully sets > > > up the handful of things that need that information. > > > > In order to maximize record re-use I think we will want to hold off on > > assembling the final packet until it is sent to the daemons in the > > kauditd thread. We'll also likely need to create special > > audit_log_XXX functions to capture fields which we know will need > > translation, e.g. ACID information. (the reason for the new > > audit_log_XXX functions would be to mark the new sg element and ensure > > the buffer is handled correctly) > > > > Regardless of the details, I think the scatter gather approach is the > > key here - that seems like the best design idea I've seen thus far. > > It enables us to replace portions of the record as needed ... and > > possibly use the existing skb cow stuff ... it has been a while, but > > does the skb cow functions handle scatter gather skbs or do they need > > to be linear? > > How does the selection of this data management technique affect our > choice of field format? I'm not sure it affects the record string, but it might affect the in-kernel API as we would likely want to have a special function for logging the audit container ID that does the scatter-gather management for the record. There might also need to be some changes to how we allocate the records. However, since you're the one working on these patches I would expect you to be the one to look into how this would work and what the impacts might be to the code, record format, etc. > Does this lock the field value to a fixed length? I wouldn't think so. In fact if it did it wouldn't really be a good solution. Once again, this is something I would expect you to look into. > Does the use of scatter/gather techniques or structures allow > the use of different lengths of data for each destination (auditd)? This is related to the above ... but yes, the reason why Eric and I were discussing a scatter/gather approach is that it would presumably allow one to break the single record string into pieces which could be managed and manipulated much easier than the monolithic record string. > I could see different target audit daemons triggering or switching to a > different chunk of data and length. This does raise a concern related > to the previous sig_info2 discussion that the struct contobj that exists > at the time of audit_log_exit called could have been reaped by the time > the buffer is pulled from the queue for transmission to auditd, but we > could hold a reference to it as is done for sig_info2. Yes. > Looking through the kernel scatter/gather possibilities, I see struct > iovec which is used by the readv/writev/preadv/pwritev syscalls, but I'm > understanding that this is a kernel implementation that will be not > visible to user space. So would the struct scatterlist be the right > choice? It has been so long since I've looked at the scatter-gather code that I can't really say with any confidence at this point. All I can say is that the scatter-gather code really should just be an implementation detail in the kernel and should not be visible to userspace; userspace should get the same awful, improperly generated netlink message it always has received from the kernel ;)
On 2020-04-17 17:23, Eric W. Biederman wrote: > Paul Moore <paul@paul-moore.com> writes: > > > On Thu, Apr 16, 2020 at 4:36 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > >> Paul Moore <paul@paul-moore.com> writes: > >> > On Mon, Mar 30, 2020 at 1:49 PM Richard Guy Briggs <rgb@redhat.com> wrote: > >> >> On 2020-03-30 13:34, Paul Moore wrote: > >> >> > On Mon, Mar 30, 2020 at 12:22 PM Richard Guy Briggs <rgb@redhat.com> wrote: > >> >> > > On 2020-03-30 10:26, Paul Moore wrote: > >> >> > > > On Mon, Mar 30, 2020 at 9:47 AM Richard Guy Briggs <rgb@redhat.com> wrote: > >> >> > > > > On 2020-03-28 23:11, Paul Moore wrote: > >> >> > > > > > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <rgb@redhat.com> wrote: > >> >> > > > > > > On 2020-03-23 20:16, Paul Moore wrote: > >> >> > > > > > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <rgb@redhat.com> wrote: > >> >> > > > > > > > > On 2020-03-18 18:06, Paul Moore wrote: > >> > > >> > ... > >> > > >> >> > > Well, every time a record gets generated, *any* record gets generated, > >> >> > > we'll need to check for which audit daemons this record is in scope and > >> >> > > generate a different one for each depending on the content and whether > >> >> > > or not the content is influenced by the scope. > >> >> > > >> >> > That's the problem right there - we don't want to have to generate a > >> >> > unique record for *each* auditd on *every* record. That is a recipe > >> >> > for disaster. > >> >> > > >> >> > Solving this for all of the known audit records is not something we > >> >> > need to worry about in depth at the moment (although giving it some > >> >> > casual thought is not a bad thing), but solving this for the audit > >> >> > container ID information *is* something we need to worry about right > >> >> > now. > >> >> > >> >> If you think that a different nested contid value string per daemon is > >> >> not acceptable, then we are back to issuing a record that has only *one* > >> >> contid listed without any nesting information. This brings us back to > >> >> the original problem of keeping *all* audit log history since the boot > >> >> of the machine to be able to track the nesting of any particular contid. > >> > > >> > I'm not ruling anything out, except for the "let's just completely > >> > regenerate every record for each auditd instance". > >> > >> Paul I am a bit confused about what you are referring to when you say > >> regenerate every record. > >> > >> Are you saying that you don't want to repeat the sequence: > >> audit_log_start(...); > >> audit_log_format(...); > >> audit_log_end(...); > >> for every nested audit daemon? > > > > If it can be avoided yes. Audit performance is already not-awesome, > > this would make it even worse. > > As far as I can see not repeating sequences like that is fundamental > for making this work at all. Just because only the audit subsystem > should know about one or multiple audit daemons. Nothing else should > care. > > >> Or are you saying that you would like to literraly want to send the same > >> skb to each of the nested audit daemons? > > > > Ideally we would reuse the generated audit messages as much as > > possible. Less work is better. That's really my main concern here, > > let's make sure we aren't going to totally tank performance when we > > have a bunch of nested audit daemons. > > So I think there are two parts of this answer. Assuming we are talking > about nesting audit daemons in containers we will have different > rulesets and I expect most of the events for a nested audit daemon won't > be of interest to the outer audit daemon. > > Beyond that it should be very straight forward to keep a pointer and > leave the buffer as a scatter gather list until audit_log_end > and translate pids, and rewrite ACIDs attributes in audit_log_end > when we build the final packet. Either through collaboration with > audit_log_format or a special audit_log command that carefully sets > up the handful of things that need that information. > > Hmm. I am seeing that we send skbs to kauditd and then kauditd > sends those skbs to userspace. I presume that is primary so that > sending messages to userspace does not block the process being audited. > > Plus a little bit so that the retry logic will work. > > I think the naive implementation would be to simply have 1 kauditd > per auditd (strictly and audit context/namespace). Although that can be > optimized if that is a problem. > > Beyond that I think we would need to look at profiles to really > understand where the bottlenecks are. > > >> Or are you thinking of something else? > > > > As mentioned above, I'm not thinking of anything specific, other than > > let's please not have to regenerate *all* of the audit record strings > > for each instance of an audit daemon, that's going to be a killer. > > > > Maybe we have to regenerate some, if we do, what would that look like > > in code? How do we handle the regeneration aspect? I worry that is > > going to be really ugly. > > > > Maybe we finally burn down the audit_log_format(...) function and pass > > structs/TLVs to the audit subsystem and the audit subsystem generates > > the strings in the auditd connection thread. Some of the record > > strings could likely be shared, others would need to be ACID/auditd > > dependent. > > I think we just a very limited amount of structs/TLVs for the cases that > matter and one-one auditd and kauditd implementations we should still > be able to do everything in audit_log_end. Plus doing as much work as > possible in audit_log_end where things are still cache hot is desirable. So in the end, perf may show us that moving things around a bit and knowing to which queue(s) we send an skb will help maintain performance by writing out the field contents in audit_log_end() and sending to the correct queue rather than deferring writing out that field contents in the kauditd process due to cache issues. In any case, it makes sense to delay that formatting work until just after the daemon routing decision is made. > > I'm open to any ideas people may have. We have a problem, let's solve > > it. > > It definitely makes sense to look ahead to having audit daemons running > in containers, but in the grand scheme of things that is a nice to have. > Probably something we will and should get to, but we have lived a long > time without auditd running in containers so I expect we can live a > while longer. > > As I understand Richard patchset for the specific case of the ACID we > are only talking about taking a subset of an existing string, and one > string at that. Not hard at all. Especially when looking at the > fundamental fact that we will need to send a different skb to > userspace, for each audit daemon. > > Eric - RGB -- Richard Guy Briggs <rgb@redhat.com> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635
diff --git a/include/linux/audit.h b/include/linux/audit.h index 2636b0ad0011..6929a02080f7 100644 --- a/include/linux/audit.h +++ b/include/linux/audit.h @@ -22,6 +22,13 @@ struct audit_sig_info { char ctx[0]; }; +struct audit_sig_info2 { + uid_t uid; + pid_t pid; + u64 cid; + char ctx[0]; +}; + struct audit_buffer; struct audit_context; struct inode; diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h index 93417a8af9d0..4f87b06f0acd 100644 --- a/include/uapi/linux/audit.h +++ b/include/uapi/linux/audit.h @@ -72,6 +72,7 @@ #define AUDIT_SET_FEATURE 1018 /* Turn an audit feature on or off */ #define AUDIT_GET_FEATURE 1019 /* Get which features are enabled */ #define AUDIT_CONTAINER_OP 1020 /* Define the container id and info */ +#define AUDIT_SIGNAL_INFO2 1021 /* Get info auditd signal sender */ #define AUDIT_FIRST_USER_MSG 1100 /* Userspace messages mostly uninteresting to kernel */ #define AUDIT_USER_AVC 1107 /* We filter this differently */ diff --git a/kernel/audit.c b/kernel/audit.c index 0871c3e5d6df..51159c94041c 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -126,6 +126,14 @@ struct auditd_connection { kuid_t audit_sig_uid = INVALID_UID; pid_t audit_sig_pid = -1; u32 audit_sig_sid = 0; +/* Since the signal information is stored in the record buffer at the + * time of the signal, but not retrieved until later, there is a chance + * that the last process in the container could terminate before the + * signal record is delivered. In this circumstance, there is a chance + * the orchestrator could reuse the audit container identifier, causing + * an overlap of audit records that refer to the same audit container + * identifier, but a different container instance. */ +u64 audit_sig_cid = AUDIT_CID_UNSET; /* Records can be lost in several ways: 0) [suppressed in audit_alloc] @@ -1123,6 +1131,7 @@ static int audit_netlink_ok(struct sk_buff *skb, u16 msg_type) case AUDIT_ADD_RULE: case AUDIT_DEL_RULE: case AUDIT_SIGNAL_INFO: + case AUDIT_SIGNAL_INFO2: case AUDIT_TTY_GET: case AUDIT_TTY_SET: case AUDIT_TRIM: @@ -1286,6 +1295,7 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh) struct audit_buffer *ab; u16 msg_type = nlh->nlmsg_type; struct audit_sig_info *sig_data; + struct audit_sig_info2 *sig_data2; char *ctx = NULL; u32 len; @@ -1545,6 +1555,30 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh) sig_data, sizeof(*sig_data) + len); kfree(sig_data); break; + case AUDIT_SIGNAL_INFO2: + len = 0; + if (audit_sig_sid) { + err = security_secid_to_secctx(audit_sig_sid, &ctx, &len); + if (err) + return err; + } + sig_data2 = kmalloc(sizeof(*sig_data2) + len, GFP_KERNEL); + if (!sig_data2) { + if (audit_sig_sid) + security_release_secctx(ctx, len); + return -ENOMEM; + } + sig_data2->uid = from_kuid(&init_user_ns, audit_sig_uid); + sig_data2->pid = audit_sig_pid; + if (audit_sig_sid) { + memcpy(sig_data2->ctx, ctx, len); + security_release_secctx(ctx, len); + } + sig_data2->cid = audit_sig_cid; + audit_send_reply(skb, seq, AUDIT_SIGNAL_INFO2, 0, 0, + sig_data2, sizeof(*sig_data2) + len); + kfree(sig_data2); + break; case AUDIT_TTY_GET: { struct audit_tty_status s; unsigned int t; @@ -2414,6 +2448,7 @@ int audit_signal_info(int sig, struct task_struct *t) else audit_sig_uid = uid; security_task_getsecid(current, &audit_sig_sid); + audit_sig_cid = audit_get_contid(current); } return audit_signal_info_syscall(t); diff --git a/kernel/audit.h b/kernel/audit.h index 162de8366b32..de358ac61587 100644 --- a/kernel/audit.h +++ b/kernel/audit.h @@ -346,6 +346,7 @@ static inline int audit_signal_info_syscall(struct task_struct *t) extern pid_t audit_sig_pid; extern kuid_t audit_sig_uid; extern u32 audit_sig_sid; +extern u64 audit_sig_cid; extern int audit_filter(int msgtype, unsigned int listtype); diff --git a/security/selinux/nlmsgtab.c b/security/selinux/nlmsgtab.c index c97fdae8f71b..f006d8b70b65 100644 --- a/security/selinux/nlmsgtab.c +++ b/security/selinux/nlmsgtab.c @@ -134,6 +134,7 @@ struct nlmsg_perm { { AUDIT_DEL_RULE, NETLINK_AUDIT_SOCKET__NLMSG_WRITE }, { AUDIT_USER, NETLINK_AUDIT_SOCKET__NLMSG_RELAY }, { AUDIT_SIGNAL_INFO, NETLINK_AUDIT_SOCKET__NLMSG_READ }, + { AUDIT_SIGNAL_INFO2, NETLINK_AUDIT_SOCKET__NLMSG_READ }, { AUDIT_TRIM, NETLINK_AUDIT_SOCKET__NLMSG_WRITE }, { AUDIT_MAKE_EQUIV, NETLINK_AUDIT_SOCKET__NLMSG_WRITE }, { AUDIT_TTY_GET, NETLINK_AUDIT_SOCKET__NLMSG_READ },
Add audit container identifier support to the action of signalling the audit daemon. Since this would need to add an element to the audit_sig_info struct, a new record type AUDIT_SIGNAL_INFO2 was created with a new audit_sig_info2 struct. Corresponding support is required in the userspace code to reflect the new record request and reply type. An older userspace won't break since it won't know to request this record type. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> --- include/linux/audit.h | 7 +++++++ include/uapi/linux/audit.h | 1 + kernel/audit.c | 35 +++++++++++++++++++++++++++++++++++ kernel/audit.h | 1 + security/selinux/nlmsgtab.c | 1 + 5 files changed, 45 insertions(+)