Message ID | 20190710141548.132193-1-joel@joelfernandes.org (mailing list archive) |
---|---|
Headers | show |
Series | Add support to directly attach BPF program to ftrace | expand |
On Wed, Jul 10, 2019 at 10:15:44AM -0400, Joel Fernandes (Google) wrote: > Hi, why are you cc-ing the whole world for this patch set? I'll reply to all as well, but I suspect a bunch of folks consider it spam. Please read Documentation/bpf/bpf_devel_QA.rst Also, I think, netdev@vger rejects emails with 80+ characters in cc as spam, so I'm not sure this set reached public mailing lists. > These patches make it possible to attach BPF programs directly to tracepoints > using ftrace (/sys/kernel/debug/tracing) without needing the process doing the > attach to be alive. This has the following benefits: > > 1. Simplified Security: In Android, we have finer-grained security controls to > specific ftrace trace events using SELinux labels. We control precisely who is > allowed to enable an ftrace event already. By adding a node to ftrace for > attaching BPF programs, we can use the same mechanism to further control who is > allowed to attach to a trace event. > > 2. Process lifetime: In Android we are adding usecases where a tracing program > needs to be attached all the time to a tracepoint, for the full life time of > the system. Such as to gather statistics where there no need for a detach for > the full system lifetime. With perf or bpf(2)'s BPF_RAW_TRACEPOINT_OPEN, this > means keeping a process alive all the time. However, in Android our BPF loader > currently (for hardeneded security) involves just starting a process at boot > time, doing the BPF program loading, and then pinning them to /sys/fs/bpf. We > don't keep this process alive all the time. It is more suitable to do a > one-shot attach of the program using ftrace and not need to have a process > alive all the time anymore for this. Such process also needs elevated > privileges since tracepoint program loading currently requires CAP_SYS_ADMIN > anyway so by design Android's bpfloader runs once at init and exits. > > This series add a new bpf file to /sys/kernel/debug/tracing/events/X/Y/bpf > The following commands can be written into it: > attach:<fd> Attaches BPF prog fd to tracepoint > detach:<fd> Detaches BPF prog fd to tracepoint Looks like, to detach a program the user needs to read a text file, parse bpf prog id from text into binary. Then call fd_from_id bpf syscall, get a binary FD, convert it back to text and write as a text back into this file. I think this is just a single example why text based apis are not accepted in bpf anymore. Through the patch set you call it ftrace. As far as I can see, this set has zero overlap with ftrace. There is no ftrace-bpf connection here at all that we discussed in the past Steven. It's all quite confusing. I suggest android to solve sticky raw_tracepoint problem with user space deamon. The reasons, you point out why user daemon cannot be used, sound weak to me. Another acceptable solution would be to introduce pinning of raw_tp objects. bpf progs and maps can be pinned in bpffs already. Pinning raw_tp would be natural extension.
On Tue, Jul 16, 2019 at 01:54:57PM -0700, Alexei Starovoitov wrote: > On Wed, Jul 10, 2019 at 10:15:44AM -0400, Joel Fernandes (Google) wrote: > > Hi, > > why are you cc-ing the whole world for this patch set? Well, the whole world happens to be interested in BPF on Android. > I'll reply to all as well, but I suspect a bunch of folks consider it spam. > Please read Documentation/bpf/bpf_devel_QA.rst Ok, I'll read it. > Also, I think, netdev@vger rejects emails with 80+ characters in cc as spam, > so I'm not sure this set reached public mailing lists. Certainly the CC list here is not added to folks who consider it spam. All the folks added have been interested in BPF on Android at various points of time. Is this CC list really that large? It has around 24 email addresses or so. I can trim it a bit if needed. Also, you sound like as if people are screaming at me to stop emailing them, certainly that's not the case and no one has told me it is spam. And, it did reach the public archive btw: https://lore.kernel.org/netdev/20190716205455.iimn3pqpvsc3k4ry@ast-mbp.dhcp.thefacebook.com/T/#m1460ba463b78312e38b68b8c118f673d2ead9446 > > These patches make it possible to attach BPF programs directly to tracepoints > > using ftrace (/sys/kernel/debug/tracing) without needing the process doing the > > attach to be alive. This has the following benefits: > > > > 1. Simplified Security: In Android, we have finer-grained security controls to > > specific ftrace trace events using SELinux labels. We control precisely who is > > allowed to enable an ftrace event already. By adding a node to ftrace for > > attaching BPF programs, we can use the same mechanism to further control who is > > allowed to attach to a trace event. > > > > 2. Process lifetime: In Android we are adding usecases where a tracing program > > needs to be attached all the time to a tracepoint, for the full life time of > > the system. Such as to gather statistics where there no need for a detach for > > the full system lifetime. With perf or bpf(2)'s BPF_RAW_TRACEPOINT_OPEN, this > > means keeping a process alive all the time. However, in Android our BPF loader > > currently (for hardeneded security) involves just starting a process at boot > > time, doing the BPF program loading, and then pinning them to /sys/fs/bpf. We > > don't keep this process alive all the time. It is more suitable to do a > > one-shot attach of the program using ftrace and not need to have a process > > alive all the time anymore for this. Such process also needs elevated > > privileges since tracepoint program loading currently requires CAP_SYS_ADMIN > > anyway so by design Android's bpfloader runs once at init and exits. > > > > This series add a new bpf file to /sys/kernel/debug/tracing/events/X/Y/bpf > > The following commands can be written into it: > > attach:<fd> Attaches BPF prog fd to tracepoint > > detach:<fd> Detaches BPF prog fd to tracepoint > > Looks like, to detach a program the user needs to read a text file, > parse bpf prog id from text into binary. Then call fd_from_id bpf syscall, > get a binary FD, convert it back to text and write as a text back into this file. > I think this is just a single example why text based apis are not accepted > in bpf anymore. This can also be considered a tracefs API. And we can certainly change the detach to accept program ids as well if that's easier. 'detach:prog:<prog_id>' and 'detach:fd:<fd>'. By the way, I can also list the set of cumbersome steps needed to attach a BPF program using perf and I bet it will be longer ;-) > Through the patch set you call it ftrace. As far as I can see, this set > has zero overlap with ftrace. There is no ftrace-bpf connection here at all > that we discussed in the past Steven. It's all quite confusing. It depends on what you mean by ftrace, may be I can call it 'trace events' or something if it is less ambiguious. All of this has been collectively called ftrace before. I am not sure if you you are making sense actually, trace_events mechanism is a part of ftrace. See the documentation: Documentation/trace/ftrace.rst. Even the documentation file name has the word ftrace in it. I have also spoken to Steven before about this, I don't think he ever told me there is no connection so again I am a bit lost at your comments. > I suggest android to solve sticky raw_tracepoint problem with user space deamon. > The reasons, you point out why user daemon cannot be used, sound weak to me. I don't think it is weak. It seems overkill to have a daemon for a trace event that is say supposed to be attached to all the time for the lifetime of the system. Why should there be a daemon consuming resources if it is active all the time? In Android, we are very careful about spawning useless processes and leaving them alive for the lifetime of the system - for no good reason. Our security teams also don't like this, and they can comment more. > Another acceptable solution would be to introduce pinning of raw_tp objects. > bpf progs and maps can be pinned in bpffs already. Pinning raw_tp would > be natural extension. I don't think the pinning solves the security problem, it just solves the process lifetime problem. Currently, attaching trace events through perf requires CAP_SYS_ADMIN. However, with ftrace events, we already control security of events by labeling the nodes in tracefs and granting access to the labeled context through the selinux policies. Having a 'bpf' node in tracefs for events, and granting access to the labels is a natural extension. I also thought about the pinning idea before, but we also want to add support for not just raw tracepoints, but also regular tracepoints (events if you will). I am hesitant to add a new BPF API just for creating regular tracepoints and then pinning those as well. I don't see why a new bpf node for a trace event is a bad idea, really. tracefs is how we deal with trace events on Android. We do it in production systems. This is a natural extension to that and fits with the security model well. thanks, - Joel
On Tue, Jul 16, 2019 at 05:30:50PM -0400, Joel Fernandes wrote: > > I also thought about the pinning idea before, but we also want to add support > for not just raw tracepoints, but also regular tracepoints (events if you > will). I am hesitant to add a new BPF API just for creating regular > tracepoints and then pinning those as well. and they should be done through the pinning as well. > I don't see why a new bpf node for a trace event is a bad idea, really. See the patches for kprobe/uprobe FD-based api and the reasons behind it. tldr: text is racy, doesn't scale, poor security, etc. > tracefs is how we deal with trace events on Android. android has made mistakes in the past as well. > This is a natural extension to that and fits with the security model > well. I think it's the opposite. I'm absolutely against text based apis.
On Tue, 16 Jul 2019 17:30:50 -0400 Joel Fernandes <joel@joelfernandes.org> wrote: > I don't see why a new bpf node for a trace event is a bad idea, really. > tracefs is how we deal with trace events on Android. We do it in production > systems. This is a natural extension to that and fits with the security model > well. What I would like to see is a way to have BPF inject data into the ftrace ring buffer directly. There's a bpf_trace_printk() that I find a bit of a hack (especially since it hooks into trace_printk() which is only for debugging purposes). Have a dedicated bpf ftrace ring buffer event that can be triggered is what I am looking for. Then comes the issue of what ring buffer to place it in, as ftrace can have multiple ring buffer instances. But these instances are defined by the tracefs instances directory. Having a way to associate a bpf program to a specific event in a specific tracefs directory could allow for ways to trigger writing into the correct ftrace buffer. But looking over the patches, I see what Alexei means that there's no overlap with ftrace and these patches except for the tracefs directory itself (which is part of the ftrace infrastructure). And the trace events are technically part of the ftrace infrastructure too. I see the tracefs interface being used, but I don't see how the bpf programs being added affect the ftrace ring buffer or other parts of ftrace. And I'm guessing that's what is confusing Alexei. -- Steve
On Tue, Jul 16, 2019 at 03:26:52PM -0700, Alexei Starovoitov wrote: > On Tue, Jul 16, 2019 at 05:30:50PM -0400, Joel Fernandes wrote: > > > > I also thought about the pinning idea before, but we also want to add support > > for not just raw tracepoints, but also regular tracepoints (events if you > > will). I am hesitant to add a new BPF API just for creating regular > > tracepoints and then pinning those as well. > > and they should be done through the pinning as well. Hmm ok, I will give it some more thought. > > I don't see why a new bpf node for a trace event is a bad idea, really. > > See the patches for kprobe/uprobe FD-based api and the reasons behind it. > tldr: text is racy, doesn't scale, poor security, etc. Is it possible to use perf without CAP_SYS_ADMIN and control security at the per-event level? We are selective about who can access which event, using selinux. That's how our ftrace-based tracers work. Its fine grained per-event control. That's where I was going with the tracefs approach since we get that granularity using the file system. Thanks.
On Tue, 16 Jul 2019 15:26:52 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> I'm absolutely against text based apis.
I guess you don't use /proc ;-)
-- Steve
On Tue, Jul 16, 2019 at 06:31:17PM -0400, Steven Rostedt wrote: > On Tue, 16 Jul 2019 17:30:50 -0400 > Joel Fernandes <joel@joelfernandes.org> wrote: > > > I don't see why a new bpf node for a trace event is a bad idea, really. > > tracefs is how we deal with trace events on Android. We do it in production > > systems. This is a natural extension to that and fits with the security model > > well. > > What I would like to see is a way to have BPF inject data into the > ftrace ring buffer directly. There's a bpf_trace_printk() that I find a > bit of a hack (especially since it hooks into trace_printk() which is > only for debugging purposes). Have a dedicated bpf ftrace ring > buffer event that can be triggered is what I am looking for. Then comes > the issue of what ring buffer to place it in, as ftrace can have > multiple ring buffer instances. But these instances are defined by the > tracefs instances directory. Having a way to associate a bpf program to > a specific event in a specific tracefs directory could allow for ways to > trigger writing into the correct ftrace buffer. But his problem is with doing the association of a BPF program with tracefs itself. How would you attach a BPF program with tracefs without doing a text based approach? His problem is with the text based approach per his last email. > But looking over the patches, I see what Alexei means that there's no > overlap with ftrace and these patches except for the tracefs directory > itself (which is part of the ftrace infrastructure). And the trace > events are technically part of the ftrace infrastructure too. I see the > tracefs interface being used, but I don't see how the bpf programs > being added affect the ftrace ring buffer or other parts of ftrace. And > I'm guessing that's what is confusing Alexei. In a follow-up patch which I am still writing, I am using the trace ring buffer as temporary storage since I am formatting the trace event into it. This patch you are replying to is just for raw tracepoint and yes, I agree this one does not use the ring buffer, but a future addition to it does. So I don't think the association of this patch series with ftrace is going to be an issue IMO. thanks, - Joel
On Tue, Jul 16, 2019 at 06:41:50PM -0400, Joel Fernandes wrote: > On Tue, Jul 16, 2019 at 03:26:52PM -0700, Alexei Starovoitov wrote: > > On Tue, Jul 16, 2019 at 05:30:50PM -0400, Joel Fernandes wrote: > > > > > > I also thought about the pinning idea before, but we also want to add support > > > for not just raw tracepoints, but also regular tracepoints (events if you > > > will). I am hesitant to add a new BPF API just for creating regular > > > tracepoints and then pinning those as well. > > > > and they should be done through the pinning as well. > > Hmm ok, I will give it some more thought. I think I can make the new BPF API + pinning approach work, I will try to work on something like this and post it soon. Also, I had a question below if you don't mind taking a look: thanks Alexei! > > > I don't see why a new bpf node for a trace event is a bad idea, really. > > > > See the patches for kprobe/uprobe FD-based api and the reasons behind it. > > tldr: text is racy, doesn't scale, poor security, etc. > > Is it possible to use perf without CAP_SYS_ADMIN and control security at the > per-event level? We are selective about who can access which event, using > selinux. That's how our ftrace-based tracers work. Its fine grained per-event > control. That's where I was going with the tracefs approach since we get that > granularity using the file system. > > Thanks. >
On Tue, Jul 16, 2019 at 07:55:00PM -0400, Joel Fernandes wrote: > On Tue, Jul 16, 2019 at 06:41:50PM -0400, Joel Fernandes wrote: > > On Tue, Jul 16, 2019 at 03:26:52PM -0700, Alexei Starovoitov wrote: > > > On Tue, Jul 16, 2019 at 05:30:50PM -0400, Joel Fernandes wrote: > > > > > > > > I also thought about the pinning idea before, but we also want to add support > > > > for not just raw tracepoints, but also regular tracepoints (events if you > > > > will). I am hesitant to add a new BPF API just for creating regular > > > > tracepoints and then pinning those as well. > > > > > > and they should be done through the pinning as well. > > > > Hmm ok, I will give it some more thought. > > I think I can make the new BPF API + pinning approach work, I will try to > work on something like this and post it soon. > > Also, I had a question below if you don't mind taking a look: > > thanks Alexei! > > > > > I don't see why a new bpf node for a trace event is a bad idea, really. > > > > > > See the patches for kprobe/uprobe FD-based api and the reasons behind it. > > > tldr: text is racy, doesn't scale, poor security, etc. > > > > Is it possible to use perf without CAP_SYS_ADMIN and control security at the > > per-event level? We are selective about who can access which event, using > > selinux. That's how our ftrace-based tracers work. Its fine grained per-event > > control. That's where I was going with the tracefs approach since we get that > > granularity using the file system. android's choice of selinux is not a factor in deciding kernel apis. It's completely separate discusion wether disallowing particular tracepoints for given user make sense at all. Just because you can hack it in via selinux blocking particular /sys/debug/tracing/ directory and convince yourself that it's somehow makes android more secure. It doesn't mean that all new api should fit into this model. I think allowing one tracepoint and disallowing another is pointless from security point of view. Tracing bpf program can do bpf_probe_read of anything.
On Tue, Jul 16, 2019 at 06:31:17PM -0400, Steven Rostedt wrote: > On Tue, 16 Jul 2019 17:30:50 -0400 > Joel Fernandes <joel@joelfernandes.org> wrote: > > > I don't see why a new bpf node for a trace event is a bad idea, really. > > tracefs is how we deal with trace events on Android. We do it in production > > systems. This is a natural extension to that and fits with the security model > > well. > > What I would like to see is a way to have BPF inject data into the > ftrace ring buffer directly. There's a bpf_trace_printk() that I find a > bit of a hack (especially since it hooks into trace_printk() which is > only for debugging purposes). Have a dedicated bpf ftrace ring > buffer event that can be triggered is what I am looking for. Then comes > the issue of what ring buffer to place it in, as ftrace can have > multiple ring buffer instances. But these instances are defined by the > tracefs instances directory. Having a way to associate a bpf program to > a specific event in a specific tracefs directory could allow for ways to > trigger writing into the correct ftrace buffer. bpf can write anything (including full skb) into perf ring buffer. I don't think there is a use case yet to write into ftrace ring buffer. But I can be convinced otherwise :) > But looking over the patches, I see what Alexei means that there's no > overlap with ftrace and these patches except for the tracefs directory > itself (which is part of the ftrace infrastructure). And the trace > events are technically part of the ftrace infrastructure too. I see the > tracefs interface being used, but I don't see how the bpf programs > being added affect the ftrace ring buffer or other parts of ftrace. And > I'm guessing that's what is confusing Alexei. yep. What I really like to see some day is proper integration of real ftrace and bpf. So that bpf progs can be invoked from some of the ftrace machinery. And the other way around. Like I'd love to be able to start ftracing all functions from bpf and stop it from bpf. The use case is kernel debugging. I'd like to examine a bunch of condition and start ftracing the execution. Then later decide wether this collection of ip addresses is interesting to analyze and post process it quickly while still inside bpf program.
On Tue, Jul 16, 2019 at 06:24:07PM -0700, Alexei Starovoitov wrote: [snip] > > > > > I don't see why a new bpf node for a trace event is a bad idea, really. > > > > > > > > See the patches for kprobe/uprobe FD-based api and the reasons behind it. > > > > tldr: text is racy, doesn't scale, poor security, etc. > > > > > > Is it possible to use perf without CAP_SYS_ADMIN and control security at the > > > per-event level? We are selective about who can access which event, using > > > selinux. That's how our ftrace-based tracers work. Its fine grained per-event > > > control. That's where I was going with the tracefs approach since we get that > > > granularity using the file system. > > android's choice of selinux is not a factor in deciding kernel apis. > It's completely separate discusion wether disallowing particular tracepoints > for given user make sense at all. > Just because you can hack it in via selinux blocking particular > /sys/debug/tracing/ directory and convince yourself that it's somehow > makes android more secure. It doesn't mean that all new api should fit > into this model. Its not like a hack, it is just control of which tracefs node can be accessed and which cannot be since the tracing can run on production systems out in the field and there are several concerns to address like security, privacy etc. It is not just for debugging usecases. We do collect traces out in the field where these issues are real and cannot be ignored. SELinux model is deny everything, and then selectively grant access to what is needed. The VFS and security LSM hooks provide this control quite well. I am not sure if such control is possible through perf hence I asked the question. > I think allowing one tracepoint and disallowing another is pointless > from security point of view. Tracing bpf program can do bpf_probe_read > of anything. I think the assumption here is the user controls the program instructions at runtime, but that's not the case. The BPF program we are loading is not dynamically generated, it is built at build time and it is loaded from a secure verified partition, so even though it can do bpf_probe_read, it is still not something that the user can change. And, we are planning to make it even more secure by making it kernel verify the program at load time as well (you were on some discussions about that a few months ago). thanks, - Joel