Message ID | 20241114084345.1564165-8-song@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Fanotify fastpath handler | expand |
On Thu, Nov 14, 2024 at 12:44 AM Song Liu <song@kernel.org> wrote: > > + > + if (bpf_is_subdir(dentry, v->dentry)) > + ret = FAN_FP_RET_SEND_TO_USERSPACE; > + else > + ret = FAN_FP_RET_SKIP_EVENT; It seems to me that all these patches and feature additions to fanotify, new kfuncs, etc are done just to do the above filtering by subdir ? If so, just hard code this logic as an extra flag to fanotify ? So it can filter all events by subdir. bpf programmability makes sense when it needs to express user space policy. Here it's just a filter by subdir. bpf hammer doesn't look like the right tool for this use case.
> On Nov 14, 2024, at 12:14 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Thu, Nov 14, 2024 at 12:44 AM Song Liu <song@kernel.org> wrote: >> >> + >> + if (bpf_is_subdir(dentry, v->dentry)) >> + ret = FAN_FP_RET_SEND_TO_USERSPACE; >> + else >> + ret = FAN_FP_RET_SKIP_EVENT; > > It seems to me that all these patches and feature additions > to fanotify, new kfuncs, etc are done just to do the above > filtering by subdir ? > > If so, just hard code this logic as an extra flag to fanotify ? > So it can filter all events by subdir. > bpf programmability makes sense when it needs to express > user space policy. Here it's just a filter by subdir. > bpf hammer doesn't look like the right tool for this use case. Current version is indeed tailored towards the subtree monitoring use case. This is mostly because feedback on v1 mostly focused on this use case. V1 itself actually had some other use cases. In practice, fanotify fastpath can benefit from bpf programmability. For example, with bpf programmability, we can combine fanotify and BPF LSM in some security use cases. If some security rules only applies to a few files, a directory, or a subtree, we can use fanotify to only monitor these files. LSM hooks, such as security_file_open(), are always global. The overhead is higher if we are only interested in a few files. Does this make sense? Thanks, Song
On Thu, Nov 14, 2024 at 3:02 PM Song Liu <songliubraving@meta.com> wrote: > > > > > On Nov 14, 2024, at 12:14 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > > > On Thu, Nov 14, 2024 at 12:44 AM Song Liu <song@kernel.org> wrote: > >> > >> + > >> + if (bpf_is_subdir(dentry, v->dentry)) > >> + ret = FAN_FP_RET_SEND_TO_USERSPACE; > >> + else > >> + ret = FAN_FP_RET_SKIP_EVENT; > > > > It seems to me that all these patches and feature additions > > to fanotify, new kfuncs, etc are done just to do the above > > filtering by subdir ? > > > > If so, just hard code this logic as an extra flag to fanotify ? > > So it can filter all events by subdir. > > bpf programmability makes sense when it needs to express > > user space policy. Here it's just a filter by subdir. > > bpf hammer doesn't look like the right tool for this use case. > > Current version is indeed tailored towards the subtree > monitoring use case. This is mostly because feedback on v1 > mostly focused on this use case. V1 itself actually had some > other use cases. like? > In practice, fanotify fastpath can benefit from bpf > programmability. For example, with bpf programmability, we > can combine fanotify and BPF LSM in some security use cases. > If some security rules only applies to a few files, a > directory, or a subtree, we can use fanotify to only monitor > these files. LSM hooks, such as security_file_open(), are > always global. The overhead is higher if we are only > interested in a few files. > > Does this make sense? Not yet. This fanotify bpf filtering only reduces the number of events sent to user space. How is it supposed to interact with bpf-lsm? Say, security policy applies to /usr/bin/* so lsm suppose to act on all files and subdirs in there. How fanotify helps ?
> On Nov 14, 2024, at 4:41 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Thu, Nov 14, 2024 at 3:02 PM Song Liu <songliubraving@meta.com> wrote: >> >> >> >>> On Nov 14, 2024, at 12:14 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: >>> >>> On Thu, Nov 14, 2024 at 12:44 AM Song Liu <song@kernel.org> wrote: >>>> >>>> + >>>> + if (bpf_is_subdir(dentry, v->dentry)) >>>> + ret = FAN_FP_RET_SEND_TO_USERSPACE; >>>> + else >>>> + ret = FAN_FP_RET_SKIP_EVENT; >>> >>> It seems to me that all these patches and feature additions >>> to fanotify, new kfuncs, etc are done just to do the above >>> filtering by subdir ? >>> >>> If so, just hard code this logic as an extra flag to fanotify ? >>> So it can filter all events by subdir. >>> bpf programmability makes sense when it needs to express >>> user space policy. Here it's just a filter by subdir. >>> bpf hammer doesn't look like the right tool for this use case. >> >> Current version is indeed tailored towards the subtree >> monitoring use case. This is mostly because feedback on v1 >> mostly focused on this use case. V1 itself actually had some >> other use cases. > > like? samples/fanotify in v1 shows pattern that matches file prefix (no BPF). selftests/bpf in v1 shows a pattern where we propagate a flag in inode local storage from parent directory to newly created children directory. > >> In practice, fanotify fastpath can benefit from bpf >> programmability. For example, with bpf programmability, we >> can combine fanotify and BPF LSM in some security use cases. >> If some security rules only applies to a few files, a >> directory, or a subtree, we can use fanotify to only monitor >> these files. LSM hooks, such as security_file_open(), are >> always global. The overhead is higher if we are only >> interested in a few files. >> >> Does this make sense? > > Not yet. > This fanotify bpf filtering only reduces the number of events > sent to user space. > How is it supposed to interact with bpf-lsm? Ah, I didn't explain this part. fanotify+bpf fastpath can do more reducing number of events sent to user space. It can also be used in tracing use cases. For example, we can implement a filetop tool that only monitors a specific directory, a specific device, or a specific mount point. It can also reject some file access (fanotify permission mode). I should have showed these features in a sample and/or selftest. > Say, security policy applies to /usr/bin/* > so lsm suppose to act on all files and subdirs in there. > How fanotify helps ? LSM hooks are always global. It is up to the BPF program to filter out irrelevant events. This filtering is sometimes expensive (match d_patch) and inaccurate (maintain a map of target inodes, etc.). OTOH, fanotify has built-in filtering before the BPF program triggers. When multiple BPF programs are monitoring open() for different subdirectories, fanotify based solution will not trigger all these BPF programs for all the open() in the system. Does this answer the questions? Thanks, Song
On Thu, Nov 14, 2024 at 5:10 PM Song Liu <songliubraving@meta.com> wrote: > > > > > On Nov 14, 2024, at 4:41 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > > > On Thu, Nov 14, 2024 at 3:02 PM Song Liu <songliubraving@meta.com> wrote: > >> > >> > >> > >>> On Nov 14, 2024, at 12:14 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > >>> > >>> On Thu, Nov 14, 2024 at 12:44 AM Song Liu <song@kernel.org> wrote: > >>>> > >>>> + > >>>> + if (bpf_is_subdir(dentry, v->dentry)) > >>>> + ret = FAN_FP_RET_SEND_TO_USERSPACE; > >>>> + else > >>>> + ret = FAN_FP_RET_SKIP_EVENT; > >>> > >>> It seems to me that all these patches and feature additions > >>> to fanotify, new kfuncs, etc are done just to do the above > >>> filtering by subdir ? > >>> > >>> If so, just hard code this logic as an extra flag to fanotify ? > >>> So it can filter all events by subdir. > >>> bpf programmability makes sense when it needs to express > >>> user space policy. Here it's just a filter by subdir. > >>> bpf hammer doesn't look like the right tool for this use case. > >> > >> Current version is indeed tailored towards the subtree > >> monitoring use case. This is mostly because feedback on v1 > >> mostly focused on this use case. V1 itself actually had some > >> other use cases. > > > > like? > > samples/fanotify in v1 shows pattern that matches file prefix > (no BPF). selftests/bpf in v1 shows a pattern where we > propagate a flag in inode local storage from parent directory > to newly created children directory. > > > > >> In practice, fanotify fastpath can benefit from bpf > >> programmability. For example, with bpf programmability, we > >> can combine fanotify and BPF LSM in some security use cases. > >> If some security rules only applies to a few files, a > >> directory, or a subtree, we can use fanotify to only monitor > >> these files. LSM hooks, such as security_file_open(), are > >> always global. The overhead is higher if we are only > >> interested in a few files. > >> > >> Does this make sense? > > > > Not yet. > > This fanotify bpf filtering only reduces the number of events > > sent to user space. > > How is it supposed to interact with bpf-lsm? > > Ah, I didn't explain this part. fanotify+bpf fastpath can > do more reducing number of events sent to user space. It can > also be used in tracing use cases. For example, we can > implement a filetop tool that only monitors a specific > directory, a specific device, or a specific mount point. > It can also reject some file access (fanotify permission > mode). I should have showed these features in a sample > and/or selftest. I bet bpf tracing can filetop already. Not as efficient, but tracing isn't going to be running 24/7. > > > Say, security policy applies to /usr/bin/* > > so lsm suppose to act on all files and subdirs in there. > > How fanotify helps ? > > LSM hooks are always global. It is up to the BPF program > to filter out irrelevant events. This filtering is > sometimes expensive (match d_patch) and inaccurate > (maintain a map of target inodes, etc.). OTOH, fanotify > has built-in filtering before the BPF program triggers. > When multiple BPF programs are monitoring open() for > different subdirectories, fanotify based solution will > not trigger all these BPF programs for all the open() > in the system. > > Does this answer the questions? No. Above is too much hand waving. I think bpf-lsm hook fires before fanotify, so bpf-lsm prog implementing some security policy has to decide right at the moment what to do with, say, security_file_open(). fanotify with or without bpf fastpath is too late. In general fanotify is not for security. It's notifying user space of events that already happened, so I don't see how these two can be combined.
> On Nov 14, 2024, at 5:31 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: [...] >>> >>> Not yet. >>> This fanotify bpf filtering only reduces the number of events >>> sent to user space. >>> How is it supposed to interact with bpf-lsm? >> >> Ah, I didn't explain this part. fanotify+bpf fastpath can >> do more reducing number of events sent to user space. It can >> also be used in tracing use cases. For example, we can >> implement a filetop tool that only monitors a specific >> directory, a specific device, or a specific mount point. >> It can also reject some file access (fanotify permission >> mode). I should have showed these features in a sample >> and/or selftest. > > I bet bpf tracing can filetop already. > Not as efficient, but tracing isn't going to be running 24/7. > >> >>> Say, security policy applies to /usr/bin/* >>> so lsm suppose to act on all files and subdirs in there. >>> How fanotify helps ? >> >> LSM hooks are always global. It is up to the BPF program >> to filter out irrelevant events. This filtering is >> sometimes expensive (match d_patch) and inaccurate >> (maintain a map of target inodes, etc.). OTOH, fanotify >> has built-in filtering before the BPF program triggers. >> When multiple BPF programs are monitoring open() for >> different subdirectories, fanotify based solution will >> not trigger all these BPF programs for all the open() >> in the system. >> >> Does this answer the questions? > > No. Above is too much hand waving. > > I think bpf-lsm hook fires before fanotify, so bpf-lsm prog > implementing some security policy has to decide right > at the moment what to do with, say, security_file_open(). > fanotify with or without bpf fastpath is too late. Actually, fanotify in permission mode can stop a file open. In current upstream code, fsnotify hook fsnotify_open_perm is actually part of security_file_open(). It will be moved to do_dentry_open(), right after security_file_open(). This move is done by 1cda52f1b461 in linux-next. In practice, we are not likely to use BPF LSM and fanotify on the same hook at the same time. Instead, we can use BPF LSM hooks to gather information and use fanotify to make allow/deny decision, or vice versa. > In general fanotify is not for security. It's notifying > user space of events that already happened, so I don't see > how these two can be combined. fanotify is actually used by AntiVirus softwares. For example, CalmAV (https://www.clamav.net/) uses fanotify for its Linux version (it also supports Window and MacOS). I guess I didn't state the motivation clearly. So let me try it now. Tracing is a critical part of a security solution. With LSM, blocking an operation is straightforward. However, knowing which operation should be blocked is not always easy. Also, security hooks (LSM or fanotify) sit in the critical path of user requests. It is very important to optimize the latency of a security hook. Ideally, the tracing logic should gather all the information ahead of time, and make the actual hook fast. For example, if security_file_open() only needs to read a flag from inode local storage, the overhead is minimal and predictable. If security_file_open() has to walk the dentry tree, or call d_path(), the overhead will be much higher. fanotify_file_perm() provides another level of optimization over security_file_open(). If a file is not being monitored, fanotify will not generate the event. Security solutions hold higher bars for the tracing logic: - It needs to be accurate, as false positives and false negatives can be very annoying and/or harmful. - It needs to be efficient, as security daemons run 24/7. Given these requirements of security solutions, I believe it is important to optimize tracing logic as much as possible. And BPF based fanotify fastpath handler can bring non-trivials benefit to BPF based security solutions. fanotify also has a feature that LSM doesn't provide. When a file is accessed, user space daemon can get a fd on this file from fanotify. OTOH, LSM can only send an ino or a path to user space, which is not always reliable. I hope this starts to make sense... Thanks, Song
On Thu, Nov 14, 2024 at 9:14 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Thu, Nov 14, 2024 at 12:44 AM Song Liu <song@kernel.org> wrote: > > > > + > > + if (bpf_is_subdir(dentry, v->dentry)) > > + ret = FAN_FP_RET_SEND_TO_USERSPACE; > > + else > > + ret = FAN_FP_RET_SKIP_EVENT; > > It seems to me that all these patches and feature additions > to fanotify, new kfuncs, etc are done just to do the above > filtering by subdir ? > > If so, just hard code this logic as an extra flag to fanotify ? > So it can filter all events by subdir. > bpf programmability makes sense when it needs to express > user space policy. Here it's just a filter by subdir. > bpf hammer doesn't look like the right tool for this use case. Good question. Speaking as someone who has made several attempts to design efficient subtree filtering in fanotify, it is not as easy as it sounds. I recently implemented a method that could be used for "practical" subdir filtering in userspace, not before Jan has questioned if we should go directly to subtree filtering with bpf [1]. This is not the only filter that was proposed for fanotify, where bpf filter came as an alternative proposal [2], but subtree filtering is by far the most wanted filter. The problem with implementing a naive is_subtree() filter in fanotify is the unbounded cost to be paid by every user for every fs access when M such filters are installed deep in the fs tree. Making this more efficient then becomes a matter of trading of memory (inode/path cache size) and performance and depends on the size and depth of the watched filesystem. This engineering decision *is* the userspace policy that can be expressed by a bpf program. As you may know, Linux is lagging behind Win and MacOS w.r.t subtree filtering for fs events. MacOS/FreeBSD took the userspace approach with fseventsd [3]. If you Google "fseventsd", you will get results with "High CPU and Memory Usage" for as far as the browser can scroll. I hope the bpf-aided early subtree filtering technology would be able to reduce some of this overhead to facilitate a better engineering solution, but that remains to be proven... Thanks, Amir. [1] https://lore.kernel.org/linux-fsdevel/20220228140556.ae5rhgqsyzm5djbp@quack3.lan/ [2] https://lore.kernel.org/linux-fsdevel/20200828084603.GA7072@quack2.suse.cz/ [3] https://developer.apple.com/library/archive/documentation/Darwin/Conceptual/FSEvents_ProgGuide/TechnologyOverview/TechnologyOverview.html#//apple_ref/doc/uid/TP40005289-CH3-SW1
On Thu, Nov 14, 2024 at 11:01 PM Song Liu <songliubraving@meta.com> wrote: > > > > > I think bpf-lsm hook fires before fanotify, so bpf-lsm prog > > implementing some security policy has to decide right > > at the moment what to do with, say, security_file_open(). > > fanotify with or without bpf fastpath is too late. > > Actually, fanotify in permission mode can stop a file open. The proposed patch 1 did: +/* Return value of fp_handler */ +enum fanotify_fastpath_return { + /* The event should be sent to user space */ + FAN_FP_RET_SEND_TO_USERSPACE = 0, + /* The event should NOT be sent to user space */ + FAN_FP_RET_SKIP_EVENT = 1, +}; It looked like a read-only notification to user space where bpf prog is merely a filter. > In current upstream code, fsnotify hook fsnotify_open_perm > is actually part of security_file_open(). It will be moved > to do_dentry_open(), right after security_file_open(). This > move is done by 1cda52f1b461 in linux-next. Separating fsnotify from LSM makes sense. > In practice, we are not likely to use BPF LSM and fanotify > on the same hook at the same time. Instead, we can use > BPF LSM hooks to gather information and use fanotify to > make allow/deny decision, or vice versa. Pick one. If the proposal is changing to let fsnotify-bpf prog to deny file_open then it's a completely different discussion. In such a case make it clear upfront that fsnotify will rely on CONFIG_FANOTIFY_ACCESS_PERMISSIONS and bpf-lsm part of file access will not be used, since interaction of two callbacks at file_open makes little sense. > > In general fanotify is not for security. It's notifying > > user space of events that already happened, so I don't see > > how these two can be combined. > > fanotify is actually used by AntiVirus softwares. For > example, CalmAV (https://www.clamav.net/) uses fanotify > for its Linux version (it also supports Window and MacOS). It's relying on user space to send back FANOTIFY_PERM_EVENTS ? fsnotify_open_perm->fsnotify->send_to_group->fanotify_handle_event. is a pretty long path to call bpf prog and preparing a giant 'struct fanotify_fastpath_event' is not going to fast either. If we want to accelerate that with bpf it needs to be done sooner with negligible overhead. > > I guess I didn't state the motivation clearly. So let me > try it now. > > Tracing is a critical part of a security solution. With > LSM, blocking an operation is straightforward. However, > knowing which operation should be blocked is not always > easy. Also, security hooks (LSM or fanotify) sit in the > critical path of user requests. It is very important to > optimize the latency of a security hook. Ideally, the > tracing logic should gather all the information ahead > of time, and make the actual hook fast. > > For example, if security_file_open() only needs to read > a flag from inode local storage, the overhead is minimal > and predictable. If security_file_open() has to walk the > dentry tree, or call d_path(), the overhead will be > much higher. fanotify_file_perm() provides another > level of optimization over security_file_open(). If a > file is not being monitored, fanotify will not generate > the event. I agree with motivation, but don't see this in the patches. The overhead to call into bpf prog is big. Even if prog does nothing it's still going to be slower. > Security solutions hold higher bars for the tracing logic: > > - It needs to be accurate, as false positives and false > negatives can be very annoying and/or harmful. > - It needs to be efficient, as security daemons run 24/7. > > Given these requirements of security solutions, I believe > it is important to optimize tracing logic as much as > possible. And BPF based fanotify fastpath handler can > bring non-trivials benefit to BPF based security solutions. Doing everything in the kernel is certainly faster than going back and forth to user space, but bpf-lsm should be able to do the same already. Without patch 1 and only patches 4,5 that add few kfuncs, bpf-lsm prog will be able to remember subtree dentry and do the same is_subdir() to deny. The patch 7 stays pretty much as-is. All in bpf-lsm. Close to zero overhead without long chain of fsnotify callbacks. > fanotify also has a feature that LSM doesn't provide. > When a file is accessed, user space daemon can get a > fd on this file from fanotify. OTOH, LSM can only send > an ino or a path to user space, which is not always > reliable. That sounds useful, but we're mixing too many things. If user space cares about fd it will be using the existing mechanism with all accompanied overhead. fsnotify-bpf can barely accelerate anything, since user space makes ultimate decisions. If user space is not in the driving seat then existing bpf-lsm plus few kfuncs to remember dentry and call is_subdir() will do the job and no need for patch 1.
On Thu, Nov 14, 2024 at 11:26 PM Amir Goldstein <amir73il@gmail.com> wrote: > > On Thu, Nov 14, 2024 at 9:14 PM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > On Thu, Nov 14, 2024 at 12:44 AM Song Liu <song@kernel.org> wrote: > > > > > > + > > > + if (bpf_is_subdir(dentry, v->dentry)) > > > + ret = FAN_FP_RET_SEND_TO_USERSPACE; > > > + else > > > + ret = FAN_FP_RET_SKIP_EVENT; > > > > It seems to me that all these patches and feature additions > > to fanotify, new kfuncs, etc are done just to do the above > > filtering by subdir ? > > > > If so, just hard code this logic as an extra flag to fanotify ? > > So it can filter all events by subdir. > > bpf programmability makes sense when it needs to express > > user space policy. Here it's just a filter by subdir. > > bpf hammer doesn't look like the right tool for this use case. > > Good question. > > Speaking as someone who has made several attempts to design > efficient subtree filtering in fanotify, it is not as easy as it sounds. > > I recently implemented a method that could be used for "practical" > subdir filtering in userspace, not before Jan has questioned if we > should go directly to subtree filtering with bpf [1]. > > This is not the only filter that was proposed for fanotify, where bpf > filter came as an alternative proposal [2], but subtree filtering is by far > the most wanted filter. Interesting :) Thanks for the links. Long threads back then. Looks like bpf filtering was proposed 2 and 4 years ago and Song's current attempt is the first real prototype that actually implements what was discussed for so long? Cool. I guess it has legs then. > The problem with implementing a naive is_subtree() filter in fanotify > is the unbounded cost to be paid by every user for every fs access > when M such filters are installed deep in the fs tree. Agree that the concern is real. I just don't see how filtering in bpf vs filtering in the kernel is going to be any different. is_subdir() is fast, so either kernel or bpf prog calling it on every file_open doesn't seem like a big deal. Are you saying that 'naive is_subdir' would call is_subdir M times for every such filter ? I guess it can be optimized. When these M filters are installed the kernel can check whether they're subdirs of each other and filter a subset of M filters with a single is_subdir() call. Same logic can be in either bpf prog or in the kernel. > Making this more efficient then becomes a matter of trading of > memory (inode/path cache size) and performance and depends > on the size and depth of the watched filesystem. > This engineering decision *is* the userspace policy that can be > expressed by a bpf program. > > As you may know, Linux is lagging behind Win and MacOS w.r.t > subtree filtering for fs events. > > MacOS/FreeBSD took the userspace approach with fseventsd [3]. > If you Google "fseventsd", you will get results with "High CPU and > Memory Usage" for as far as the browser can scroll. Yeah. Fun. Looking at your 2 year old attempt it seems the key part was to make sure inodes are not pinned for filtering purpose? How would you call is_subdir() then if parent dentry is not dget() ? Or meaning of 'pinning' is different? > I hope the bpf-aided early subtree filtering technology would be > able to reduce some of this overhead to facilitate a better engineering > solution, but that remains to be proven... Sure. Let's explore this further. > Thanks, > Amir. > > [1] https://lore.kernel.org/linux-fsdevel/20220228140556.ae5rhgqsyzm5djbp@quack3.lan/ > [2] https://lore.kernel.org/linux-fsdevel/20200828084603.GA7072@quack2.suse.cz/ > [3] https://developer.apple.com/library/archive/documentation/Darwin/Conceptual/FSEvents_ProgGuide/TechnologyOverview/TechnologyOverview.html#//apple_ref/doc/uid/TP40005289-CH3-SW1
> On Nov 15, 2024, at 11:41 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Thu, Nov 14, 2024 at 11:01 PM Song Liu <songliubraving@meta.com> wrote: >> >>> >>> I think bpf-lsm hook fires before fanotify, so bpf-lsm prog >>> implementing some security policy has to decide right >>> at the moment what to do with, say, security_file_open(). >>> fanotify with or without bpf fastpath is too late. >> >> Actually, fanotify in permission mode can stop a file open. > > The proposed patch 1 did: > > +/* Return value of fp_handler */ > +enum fanotify_fastpath_return { > + /* The event should be sent to user space */ > + FAN_FP_RET_SEND_TO_USERSPACE = 0, > + /* The event should NOT be sent to user space */ > + FAN_FP_RET_SKIP_EVENT = 1, > +}; > > It looked like a read-only notification to user space > where bpf prog is merely a filter. Yep. As Amir also pointed out, this part needs more work and clarifications. > >> In current upstream code, fsnotify hook fsnotify_open_perm >> is actually part of security_file_open(). It will be moved >> to do_dentry_open(), right after security_file_open(). This >> move is done by 1cda52f1b461 in linux-next. > > Separating fsnotify from LSM makes sense. > >> In practice, we are not likely to use BPF LSM and fanotify >> on the same hook at the same time. Instead, we can use >> BPF LSM hooks to gather information and use fanotify to >> make allow/deny decision, or vice versa. > > Pick one. > If the proposal is changing to let fsnotify-bpf prog to deny > file_open then it's a completely different discussion. > > In such a case make it clear upfront that fsnotify will > rely on CONFIG_FANOTIFY_ACCESS_PERMISSIONS and I am not sure whether we should limit fsnotify-bpf to only work with CONFIG_FANOTIFY_ACCESS_PERMISSIONS. I still think it can be useful just as a filter. But I guess we can start with this dependency. > bpf-lsm part of file access will not be used, > since interaction of two callbacks at file_open makes little sense. Agreed that having two hooks on file_open doesn't make sense. I was actually thinking about combining fsnotify-bpf with other LSM hooks. But I agree we should start as simple as possible. > >>> In general fanotify is not for security. It's notifying >>> user space of events that already happened, so I don't see >>> how these two can be combined. >> >> fanotify is actually used by AntiVirus softwares. For >> example, CalmAV (https://urldefense.com/v3/__https://www.clamav.net/__;!!Bt8RZUm9aw!4p7JB_E8RuNLldz_TYR07jnW5kHbr4H2FMm9vLbShKOBMznXwES7Dlt5_R6B_-HMzgV3Qk_9WKlhmjHSpYHRhTb2WM3hIg$ ) uses fanotify >> for its Linux version (it also supports Window and MacOS). > > It's relying on user space to send back FANOTIFY_PERM_EVENTS ? Yes, it goes all the way to another process, and comes back. > > fsnotify_open_perm->fsnotify->send_to_group->fanotify_handle_event. > > is a pretty long path to call bpf prog and > preparing a giant 'struct fanotify_fastpath_event' > is not going to fast either. > > If we want to accelerate that with bpf it needs to be done > sooner with negligible overhead. Agreed. This is actually something I have been thinking since the beginning of this work: Shall it be fanotify-bpf or fsnotify-bpf. Given we have more materials, this is a good time to have broader discussions on this. @all, please chime in whether we should redo this as fsnotify-bpf. AFAICT: Pros of fanotify-bpf: - There is existing user space that we can leverage/reuse. Pros of fsnotify-bpf: - Faster fast path. Another major pros/cons did I miss? > >> I guess I didn't state the motivation clearly. So let me >> try it now. >> >> Tracing is a critical part of a security solution. With >> LSM, blocking an operation is straightforward. However, >> knowing which operation should be blocked is not always >> easy. Also, security hooks (LSM or fanotify) sit in the >> critical path of user requests. It is very important to >> optimize the latency of a security hook. Ideally, the >> tracing logic should gather all the information ahead >> of time, and make the actual hook fast. >> >> For example, if security_file_open() only needs to read >> a flag from inode local storage, the overhead is minimal >> and predictable. If security_file_open() has to walk the >> dentry tree, or call d_path(), the overhead will be >> much higher. fanotify_file_perm() provides another >> level of optimization over security_file_open(). If a >> file is not being monitored, fanotify will not generate >> the event. > > I agree with motivation, but don't see this in the patches. Agreed. I should definitely do better job in this. > The overhead to call into bpf prog is big. > Even if prog does nothing it's still going to be slower. > >> Security solutions hold higher bars for the tracing logic: >> >> - It needs to be accurate, as false positives and false >> negatives can be very annoying and/or harmful. >> - It needs to be efficient, as security daemons run 24/7. >> >> Given these requirements of security solutions, I believe >> it is important to optimize tracing logic as much as >> possible. And BPF based fanotify fastpath handler can >> bring non-trivials benefit to BPF based security solutions. > > Doing everything in the kernel is certainly faster than > going back and forth to user space, > but bpf-lsm should be able to do the same already. > > Without patch 1 and only patches 4,5 that add few kfuncs, > bpf-lsm prog will be able to remember subtree dentry and > do the same is_subdir() to deny. > The patch 7 stays pretty much as-is. All in bpf-lsm. > Close to zero overhead without long chain of fsnotify callbacks. One of the advantages of fanotify-bpf or fsnotify-bpf is that it only calls the BPF program for being monitored files. OTOH, BPF LSM program is always global. > >> fanotify also has a feature that LSM doesn't provide. >> When a file is accessed, user space daemon can get a >> fd on this file from fanotify. OTOH, LSM can only send >> an ino or a path to user space, which is not always >> reliable. > > That sounds useful, but we're mixing too many things. > If user space cares about fd it will be using the existing > mechanism with all accompanied overhead. fsnotify-bpf can > barely accelerate anything, since user space makes > ultimate decisions. > If user space is not in the driving seat then existing bpf-lsm > plus few kfuncs to remember dentry and call is_subdir() > will do the job and no need for patch 1. In many cases, we only need the user space to look into the file when necessary. For example, when a binary is first written, the user space daemon will scan through it (for virus, etc.) and mark it as safe/dangerous in some BPF maps. Later, when the file is opened for execution, f[s|a]notify-bpf can make the allow/deny decision based on the information in BPF maps. Thanks again for your review and inputs! Song
> On Nov 15, 2024, at 1:05 PM, Song Liu <songliubraving@meta.com> wrote: [...] > >> >> fsnotify_open_perm->fsnotify->send_to_group->fanotify_handle_event. >> >> is a pretty long path to call bpf prog and >> preparing a giant 'struct fanotify_fastpath_event' >> is not going to fast either. >> >> If we want to accelerate that with bpf it needs to be done >> sooner with negligible overhead. > > Agreed. This is actually something I have been thinking > since the beginning of this work: Shall it be fanotify-bpf > or fsnotify-bpf. Given we have more materials, this is a > good time to have broader discussions on this. > > @all, please chime in whether we should redo this as > fsnotify-bpf. AFAICT: > > Pros of fanotify-bpf: > - There is existing user space that we can leverage/reuse. > > Pros of fsnotify-bpf: > - Faster fast path. > > Another major pros/cons did I miss? Adding more thoughts on this: I think it makes more sense to go with fanotify-bpf. This is because one of the benefits of fsnotify/fanotify over LSM solutions is the built-in event filtering of events. While this call chain is a bit long: fsnotify_open_perm->fsnotify->send_to_group->fanotify_handle_event. There are built-in filtering in fsnotify() and send_to_group(), so logics in the call chain are useful. struct fanotify_fastpath_event is indeed big. But I think we need to pass these information to the fastpath handler either way. Overall, I think current fastpath design makes sense, though there are things we need to fix (as Amir and Alexei pointed out). Please let me know comments and suggestions on this. Thanks, Song
On Mon, Nov 18, 2024 at 12:51 PM Song Liu <songliubraving@meta.com> wrote: > > > > > On Nov 15, 2024, at 1:05 PM, Song Liu <songliubraving@meta.com> wrote: > > [...] > > > >> > >> fsnotify_open_perm->fsnotify->send_to_group->fanotify_handle_event. > >> > >> is a pretty long path to call bpf prog and > >> preparing a giant 'struct fanotify_fastpath_event' > >> is not going to fast either. > >> > >> If we want to accelerate that with bpf it needs to be done > >> sooner with negligible overhead. > > > > Agreed. This is actually something I have been thinking > > since the beginning of this work: Shall it be fanotify-bpf > > or fsnotify-bpf. Given we have more materials, this is a > > good time to have broader discussions on this. > > > > @all, please chime in whether we should redo this as > > fsnotify-bpf. AFAICT: > > > > Pros of fanotify-bpf: > > - There is existing user space that we can leverage/reuse. > > > > Pros of fsnotify-bpf: > > - Faster fast path. > > > > Another major pros/cons did I miss? > > Adding more thoughts on this: I think it makes more sense to > go with fanotify-bpf. This is because one of the benefits of > fsnotify/fanotify over LSM solutions is the built-in event > filtering of events. While this call chain is a bit long: > > fsnotify_open_perm->fsnotify->send_to_group->fanotify_handle_event. > > There are built-in filtering in fsnotify() and > send_to_group(), so logics in the call chain are useful. fsnotify_marks based filtering happens in fsnotify. No need to do more indirect calls to get to fanotify. I would add the bpf struct_ops hook right before send_to_group or inside of it. Not sure whether fsnotify_group concept should be reused or avoided. Per inode mark/mask filter should stay. > struct fanotify_fastpath_event is indeed big. But I think > we need to pass these information to the fastpath handler > either way. Disagree. That was the old way of hooking bpf bits in. uapi/bpf.h is full of such "context" structs. xpd_md, bpf_tcp_sock, etc. They pack fields into one struct only because old style bpf has one input argument: ctx. struct_ops doesn't have this limitation. Pass things like path/dentry/inode/whatever pointers directly. No need to pack into fanotify_fastpath_event. > Overall, I think current fastpath design makes sense, > though there are things we need to fix (as Amir and Alexei > pointed out). Please let me know comments and suggestions > on this. On one side you're arguing that extra indirection for inode local storage due to inode->i_secruity is needed for performance, but on the other side you're not worried about the deep call stack of fsnotify->fanotify and argument packing which add way more overhead than i_security hop.
Hi Alexei, > On Nov 18, 2024, at 4:10 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: [...] >>> >>> Agreed. This is actually something I have been thinking >>> since the beginning of this work: Shall it be fanotify-bpf >>> or fsnotify-bpf. Given we have more materials, this is a >>> good time to have broader discussions on this. >>> >>> @all, please chime in whether we should redo this as >>> fsnotify-bpf. AFAICT: >>> >>> Pros of fanotify-bpf: >>> - There is existing user space that we can leverage/reuse. >>> >>> Pros of fsnotify-bpf: >>> - Faster fast path. >>> >>> Another major pros/cons did I miss? >> >> Adding more thoughts on this: I think it makes more sense to >> go with fanotify-bpf. This is because one of the benefits of >> fsnotify/fanotify over LSM solutions is the built-in event >> filtering of events. While this call chain is a bit long: >> >> fsnotify_open_perm->fsnotify->send_to_group->fanotify_handle_event. >> >> There are built-in filtering in fsnotify() and >> send_to_group(), so logics in the call chain are useful. > > fsnotify_marks based filtering happens in fsnotify. > No need to do more indirect calls to get to fanotify. > > I would add the bpf struct_ops hook right before send_to_group > or inside of it. > Not sure whether fsnotify_group concept should be reused > or avoided. > Per inode mark/mask filter should stay. We still need fsnotify_group. It matches each fanotify file descriptor. Moving struct_ops hook inside send_to_group does save us an indirect call. But this also means we need to introduce the fastpath concept to both fsnotify and fanotify. I personally don't really like duplications like this (see the big BUILD_BUG_ON array in fanotify_handle_event). OTOH, maybe the benefit of one fewer indirect call justifies the extra complexity. Let me think more about it. > >> struct fanotify_fastpath_event is indeed big. But I think >> we need to pass these information to the fastpath handler >> either way. > > Disagree. > That was the old way of hooking bpf bits in. > uapi/bpf.h is full of such "context" structs. > xpd_md, bpf_tcp_sock, etc. > They pack fields into one struct only because > old style bpf has one input argument: ctx. > struct_ops doesn't have this limitation. > Pass things like path/dentry/inode/whatever pointers directly. > No need to pack into fanotify_fastpath_event. OK. I am convinced on this one. I will adjust the code to remove fanotify_fastpath_event. > >> Overall, I think current fastpath design makes sense, >> though there are things we need to fix (as Amir and Alexei >> pointed out). Please let me know comments and suggestions >> on this. > > On one side you're arguing that extra indirection for > inode local storage due to inode->i_secruity is needed > for performance, The biggest issue with inode_local_storage in i_security is not the extra indirection and thus the extra latency. The bigger problem is, once we make inode local storage available to tracing programs, we will not disable it at boot time. Therefore, the extra indirection through i_security doesn't give us any memory savings. Instead, it will cause latency and memory fragmentations. IOW, we are paying the cost for no benefits at all. > but on the other side you're not worried about the deep > call stack of fsnotify->fanotify and argument packing > which add way more overhead than i_security hop. I think the difference is one indirect call. But that may worth the work. I will think more about it. Also, I would really appreciate other folks' input. Thanks again for your review! Song
On Tue, Nov 19, 2024 at 2:10 AM Song Liu <songliubraving@meta.com> wrote: > > Hi Alexei, > > > On Nov 18, 2024, at 4:10 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > [...] > >>> > >>> Agreed. This is actually something I have been thinking > >>> since the beginning of this work: Shall it be fanotify-bpf > >>> or fsnotify-bpf. Given we have more materials, this is a > >>> good time to have broader discussions on this. > >>> > >>> @all, please chime in whether we should redo this as > >>> fsnotify-bpf. AFAICT: > >>> > >>> Pros of fanotify-bpf: > >>> - There is existing user space that we can leverage/reuse. > >>> > >>> Pros of fsnotify-bpf: > >>> - Faster fast path. > >>> > >>> Another major pros/cons did I miss? Sorry, I forgot to address this earlier. First and foremost, I don't like the brand name "fast path" for this feature. The word "fast" implies that avoiding an indirect call is meaningful and I don't think this is the case here. I would rather rebrand this feature as "fanotify filter program". I am going to make a controversial argument: I anticipate that in the more common use of kernel filter for fanotify the filter will be *likely* to skip sending events to userspace, for example, when watching a small subtree in a large filesystem. In that case, the *likely* result is that a context switch to userspace is avoided, hence, the addition of one indirect call is insignificant in comparison. If we later decide that my assumption is wrong and it is important to optimize out the indirect call when skipping userspace, we could do that - it is not a problem, but for now, this discussion seems like premature optimization to me. > >> > >> Adding more thoughts on this: I think it makes more sense to > >> go with fanotify-bpf. This is because one of the benefits of > >> fsnotify/fanotify over LSM solutions is the built-in event > >> filtering of events. While this call chain is a bit long: > >> > >> fsnotify_open_perm->fsnotify->send_to_group->fanotify_handle_event. > >> > >> There are built-in filtering in fsnotify() and > >> send_to_group(), so logics in the call chain are useful. > > > > fsnotify_marks based filtering happens in fsnotify. > > No need to do more indirect calls to get to fanotify. > > > > I would add the bpf struct_ops hook right before send_to_group > > or inside of it. > > Not sure whether fsnotify_group concept should be reused > > or avoided. > > Per inode mark/mask filter should stay. > > We still need fsnotify_group. It matches each fanotify > file descriptor. > We need the filter to be per fsnotify_group. Unlike LSM, fsnotify/fanotify is a pubsub service for many fs event consumers (a.k.a groups). The feature is a filter per event consumer, not a global filter for all event consumers. > Moving struct_ops hook inside send_to_group does save > us an indirect call. But this also means we need to > introduce the fastpath concept to both fsnotify and > fanotify. I personally don't really like duplications > like this (see the big BUILD_BUG_ON array in > fanotify_handle_event). > > OTOH, maybe the benefit of one fewer indirect call > justifies the extra complexity. Let me think more > about it. > I need to explain something about fsnotify vs. fanotify in order to argue why the feature should be "fanotify", but the bottom line is that is should not be too hard to avoid the indirect call even if the feature is introduced through fanotify API as I think that it should be. TLDR: The fsnotify_backend abstraction has become somewhat of a theater of abstraction over time, because the feature distance between fanotify backend and all the rest has grew quite large. The logic in send_to_group() is *seemingly* the generic fsnotify logic, but not really, because only fanotify has ignore masks and only fanotify has mark types (inode,mount,sb). This difference is encoded by the group->ops->handle_event() operation that only fanotify implements. All the rest of the backends implement the simpler ->handle_inode_event(). Similarly, the group->private union is always dominated by the size of group->fanotify_data, so there is no big difference if we place group->fp_hook (or ->filter_hook) outside of fanotify_data, so that we can query and call it from send_to_group() saving the indirect call to ->handle_event(). That still leaves the question if we need to call fanotify_group_event_mask() before the filter hook. fanotify_group_event_mask() does several things, but I think that the only thing relevant before the filter hook is this line: /* * Send the event depending on event flags in mark mask. */ if (!fsnotify_mask_applicable(mark->mask, ondir, type)) continue; This code is related to the two "built-in fanotify filters", namely FAN_ONDIR and FAN_EVENT_ON_CHILD. These built-in filters are so lame that they serve as a good example why a programmable filter is a better idea. For example, users need to opt-in for events on directories, but they cannot request events only on directories. Historically, the "generic" abstraction in send_to_group() has dealt with the non-generic fanotify ignore mask, but has not dealt with these non-generic fanotify built-in filters. However, since commit 31a371e419c8 ("fanotify: prepare for setting event flags in ignore mask"), send_to_group() is already aware of those fanotify built-in filters. So unless I am missing something, if we align the marks iteration loop in send_to_group() to look exactly like the marks iteration loop in fanotify_group_event_mask(), there should be no problem to call the filter hook directly, right before calling ->handle_event(). Hope this brain dump helps. Thanks, Amir.
Hi Amir, Thanks for sharing these thoughts. I will take a closer look later, as I am not awake enough to fully understand everything. > On Nov 18, 2024, at 11:59 PM, Amir Goldstein <amir73il@gmail.com> wrote: [...] >> Moving struct_ops hook inside send_to_group does save >> us an indirect call. But this also means we need to >> introduce the fastpath concept to both fsnotify and >> fanotify. I personally don't really like duplications >> like this (see the big BUILD_BUG_ON array in >> fanotify_handle_event). >> >> OTOH, maybe the benefit of one fewer indirect call >> justifies the extra complexity. Let me think more >> about it. >> > > I need to explain something about fsnotify vs. fanotify > in order to argue why the feature should be "fanotify", but the > bottom line is that is should not be too hard to avoid the indirect > call even if the feature is introduced through fanotify API as I think > that it should be. When I first looked into this, I thought about "whether there will be a use case that uses fsnotify but not fanotify". I didn't get any conclusion on this back then. But according to this thread, I think we are pretty confident that future use cases (such as FAN_PRE_ACCESS) will have a fanotify part. If this is the case, I think fanotify-bpf makes more sense. > TLDR: > The fsnotify_backend abstraction has become somewhat > of a theater of abstraction over time, because the feature > distance between fanotify backend and all the rest has grew > quite large. > > The logic in send_to_group() is *seemingly* the generic fsnotify > logic, but not really, because only fanotify has ignore masks > and only fanotify has mark types (inode,mount,sb). > > This difference is encoded by the group->ops->handle_event() > operation that only fanotify implements. > All the rest of the backends implement the simpler ->handle_inode_event(). > > Similarly, the group->private union is always dominated by the size > of group->fanotify_data, so there is no big difference if we place > group->fp_hook (or ->filter_hook) outside of fanotify_data, so that > we can query and call it from send_to_group() saving the indirect call > to ->handle_event(). > > That still leaves the question if we need to call fanotify_group_event_mask() > before the filter hook. I was trying Alexei's idea to move the API to fsnotify, and got stucked at fanotify_group_event_mask(). It appears we should always honor these built-in filters. > fanotify_group_event_mask() does several things, but I think that > the only thing relevant before the filter hook is this line: > /* > * Send the event depending on event flags in mark mask. > */ > if (!fsnotify_mask_applicable(mark->mask, ondir, type)) > continue; > > This code is related to the two "built-in fanotify filters", namely > FAN_ONDIR and FAN_EVENT_ON_CHILD. > These built-in filters are so lame that they serve as a good example > why a programmable filter is a better idea. > For example, users need to opt-in for events on directories, but they > cannot request events only on directories. > > Historically, the "generic" abstraction in send_to_group() has dealt > with the non-generic fanotify ignore mask, but has not dealt with > these non-generic fanotify built-in filters. > > However, since commit 31a371e419c8 ("fanotify: prepare for setting > event flags in ignore mask"), send_to_group() is already aware of those > fanotify built-in filters. I will continue on this tomorrow. It is time to get some sleep. :) > > So unless I am missing something, if we align the marks iteration > loop in send_to_group() to look exactly like the marks iteration loop in > fanotify_group_event_mask(), there should be no problem to call > the filter hook directly, right before calling ->handle_event(). > > Hope this brain dump helps. Thanks again for your input! Song > > Thanks, > Amir.
diff --git a/tools/testing/selftests/bpf/bpf_kfuncs.h b/tools/testing/selftests/bpf/bpf_kfuncs.h index 2eb3483f2fb0..6ccfef9685e1 100644 --- a/tools/testing/selftests/bpf/bpf_kfuncs.h +++ b/tools/testing/selftests/bpf/bpf_kfuncs.h @@ -87,4 +87,9 @@ struct dentry; */ extern int bpf_get_dentry_xattr(struct dentry *dentry, const char *name, struct bpf_dynptr *value_ptr) __ksym __weak; + +struct fanotify_fastpath_event; +extern struct inode *bpf_fanotify_data_inode(struct fanotify_fastpath_event *event) __ksym __weak; +extern void bpf_iput(struct inode *inode) __ksym __weak; +extern bool bpf_is_subdir(struct dentry *new_dentry, struct dentry *old_dentry) __ksym __weak; #endif diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config index 4ca84c8d9116..505327f53f07 100644 --- a/tools/testing/selftests/bpf/config +++ b/tools/testing/selftests/bpf/config @@ -24,6 +24,8 @@ CONFIG_DEBUG_INFO_BTF=y CONFIG_DEBUG_INFO_DWARF4=y CONFIG_DUMMY=y CONFIG_DYNAMIC_FTRACE=y +CONFIG_FANOTIFY=y +CONFIG_FANOTIFY_FASTPATH=y CONFIG_FPROBE=y CONFIG_FTRACE_SYSCALLS=y CONFIG_FUNCTION_ERROR_INJECTION=y diff --git a/tools/testing/selftests/bpf/prog_tests/fan_fp.c b/tools/testing/selftests/bpf/prog_tests/fan_fp.c new file mode 100644 index 000000000000..92929b811282 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/fan_fp.c @@ -0,0 +1,264 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */ + +#define _GNU_SOURCE +#include <err.h> +#include <fcntl.h> +#include <stdio.h> +#include <string.h> +#include <sys/fanotify.h> +#include <unistd.h> +#include <sys/ioctl.h> +#include <sys/stat.h> + +#include <test_progs.h> + +#include "fan_fp.skel.h" + +#define TEST_FS "/tmp/" +#define TEST_DIR "/tmp/fanotify_test/" + +static int create_test_subtree(void) +{ + int err; + + err = mkdir(TEST_DIR, 0777); + if (err && errno != EEXIST) + return err; + + return open(TEST_DIR, O_RDONLY); +} + +static int create_fanotify_fd(void) +{ + int fanotify_fd, err; + + fanotify_fd = fanotify_init(FAN_CLASS_NOTIF | FAN_REPORT_NAME | FAN_REPORT_DIR_FID, + O_RDONLY); + + if (!ASSERT_OK_FD(fanotify_fd, "fanotify_init")) + return -1; + + err = fanotify_mark(fanotify_fd, FAN_MARK_ADD | FAN_MARK_FILESYSTEM, + FAN_CREATE | FAN_OPEN | FAN_ONDIR | FAN_EVENT_ON_CHILD, + AT_FDCWD, TEST_FS); + if (!ASSERT_OK(err, "fanotify_mark")) { + close(fanotify_fd); + return -1; + } + + return fanotify_fd; +} + +static int attach_global_fastpath(int fanotify_fd) +{ + struct fanotify_fastpath_args args = { + .name = "_tmp_test_sub_tree", + .version = 1, + .flags = 0, + }; + + if (ioctl(fanotify_fd, FAN_IOC_ADD_FP, &args)) + return -1; + + return 0; +} + +#define EVENT_BUFFER_SIZE 4096 +struct file_access_result { + char name_prefix[16]; + bool accessed; +} access_results[3] = { + {"aa", false}, + {"bb", false}, + {"cc", false}, +}; + +static void update_access_results(char *name) +{ + int i; + + for (i = 0; i < ARRAY_SIZE(access_results); i++) { + if (strcmp(name, access_results[i].name_prefix) == 0) + access_results[i].accessed = true; + } +} + +static void parse_event(char *buffer, int len) +{ + struct fanotify_event_metadata *event = + (struct fanotify_event_metadata *) buffer; + struct fanotify_event_info_header *info; + struct fanotify_event_info_fid *fid; + struct file_handle *handle; + char *name; + int off; + + for (; FAN_EVENT_OK(event, len); event = FAN_EVENT_NEXT(event, len)) { + for (off = sizeof(*event) ; off < event->event_len; + off += info->len) { + info = (struct fanotify_event_info_header *) + ((char *) event + off); + switch (info->info_type) { + case FAN_EVENT_INFO_TYPE_DFID_NAME: + fid = (struct fanotify_event_info_fid *) info; + handle = (struct file_handle *)&fid->handle; + name = (char *)handle + sizeof(*handle) + handle->handle_bytes; + update_access_results(name); + break; + default: + break; + } + } + } +} + +static void touch_file(const char *path) +{ + int fd; + + fd = open(path, O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666); + if (!ASSERT_OK_FD(fd, "open")) + goto cleanup; + close(fd); +cleanup: + unlink(path); +} + +static void generate_and_test_event(int fanotify_fd, struct fan_fp *skel) +{ + char buffer[EVENT_BUFFER_SIZE]; + int len, err, fd; + + /* Open the dir, so initialize_subdir_root can work */ + fd = open(TEST_DIR, O_RDONLY); + close(fd); + + if (!ASSERT_EQ(skel->bss->initialized, true, "initialized")) + goto cleanup; + + /* access /tmp/fanotify_test/aa, this will generate event */ + touch_file(TEST_DIR "aa"); + + /* create /tmp/fanotify_test/subdir, this will get tag from the + * parent directory (added in the bpf program on fsnotify_mkdir) + */ + err = mkdir(TEST_DIR "subdir", 0777); + ASSERT_OK(err, "mkdir"); + + /* access /tmp/fanotify_test/subdir/bb, this will generate event */ + touch_file(TEST_DIR "subdir/bb"); + + /* access /tmp/cc, this will NOT generate event, as the BPF + * fastpath filtered this event out. (Because /tmp doesn't have + * the tag.) + */ + touch_file(TEST_FS "cc"); + + /* read and parse the events */ + len = read(fanotify_fd, buffer, EVENT_BUFFER_SIZE); + if (!ASSERT_GE(len, 0, "read event")) + goto cleanup; + parse_event(buffer, len); + + /* verify we generated events for aa and bb, but filtered out the + * event for cc. + */ + ASSERT_TRUE(access_results[0].accessed, "access aa"); + ASSERT_TRUE(access_results[1].accessed, "access bb"); + ASSERT_FALSE(access_results[2].accessed, "access cc"); + + /* Each touch_file() generates two events: FAN_CREATE then + * FAN_OPEN. The second event will hit cache. + * open(TEST_DIR) also hit cache, as we updated it cache for + * TEST_DIR from userspace. + * Therefore, we expect 4 cache hits: aa, bb, cc, and TEST_DIR. + */ + ASSERT_EQ(skel->bss->cache_hit, 4, "cache_hit"); + +cleanup: + rmdir(TEST_DIR "subdir"); + rmdir(TEST_DIR); +} + +/* This test shows a simplified logic that monitors a subtree. This is + * simplified as it doesn't handle all the scenarios, such as: + * + * 1) moving a subsubtree into/outof the being monitoring subtree; + * 2) mount point inside the being monitored subtree + * + * Therefore, this is not to show a way to reliably monitor a subtree. + * Instead, this is to test the functionalities of bpf based fastpath. + * + * Overview of the logic: + * 1. fanotify is created for the whole file system (/tmp); + * 2. A bpf map (inode_storage_map) is used to tag directories to + * monitor (starting from /tmp/fanotify_test); + * 3. On fsnotify_mkdir, thee tag is propagated to newly created sub + * directories (/tmp/fanotify_test/subdir); + * 4. The bpf fastpath checks whether the event happens in a directory + * with the tag. If yes, the event is sent to user space; otherwise, + * the event is dropped. + */ +static void test_monitor_subtree(void) +{ + struct bpf_link *link; + struct fan_fp *skel; + int test_root_fd; + int zero = 0; + int err, fanotify_fd; + struct stat st; + + test_root_fd = create_test_subtree(); + + if (!ASSERT_OK_FD(test_root_fd, "create_test_subtree")) + return; + + err = fstat(test_root_fd, &st); + if (!ASSERT_OK(err, "fstat test_root_fd")) + goto close_test_root_fd; + + skel = fan_fp__open_and_load(); + + if (!ASSERT_OK_PTR(skel, "fan_fp__open_and_load")) + goto close_test_root_fd; + + skel->bss->root_ino = st.st_ino; + + /* Add tag to /tmp/fanotify_test/ */ + err = bpf_map_update_elem(bpf_map__fd(skel->maps.inode_storage_map), + &test_root_fd, &zero, BPF_ANY); + if (!ASSERT_OK(err, "bpf_map_update_elem")) + goto destroy_skel; + link = bpf_map__attach_struct_ops(skel->maps.bpf_fanotify_fastpath_ops); + if (!ASSERT_OK_PTR(link, "bpf_map__attach_struct_ops")) + goto destroy_skel; + + fanotify_fd = create_fanotify_fd(); + if (!ASSERT_OK_FD(fanotify_fd, "create_fanotify_fd")) + goto destroy_link; + + err = attach_global_fastpath(fanotify_fd); + if (!ASSERT_OK(err, "attach_global_fastpath")) + goto close_fanotify_fd; + + generate_and_test_event(fanotify_fd, skel); + +close_fanotify_fd: + close(fanotify_fd); + +destroy_link: + bpf_link__destroy(link); +destroy_skel: + fan_fp__destroy(skel); + +close_test_root_fd: + close(test_root_fd); + rmdir(TEST_DIR); +} + +void test_bpf_fanotify_fastpath(void) +{ + if (test__start_subtest("subtree")) + test_monitor_subtree(); +} diff --git a/tools/testing/selftests/bpf/progs/fan_fp.c b/tools/testing/selftests/bpf/progs/fan_fp.c new file mode 100644 index 000000000000..97e7d0b9e644 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/fan_fp.c @@ -0,0 +1,154 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */ + +#include "vmlinux.h" +#include <bpf/bpf_tracing.h> +#include <bpf/bpf_core_read.h> +#include "bpf_kfuncs.h" + +struct __dentry_kptr_value { + struct dentry __kptr * dentry; +}; + +/* subdir_root map holds a single dentry pointer to the subtree root. + * This pointer is used to call bpf_is_subdir(). + */ +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __type(key, int); + __type(value, struct __dentry_kptr_value); + __uint(max_entries, 1); +} subdir_root SEC(".maps"); + +/* inode_storage_map serves as cache for bpf_is_subdir(). inode local + * storage has O(1) access time. So this is preferred over calling + * bpf_is_subdir(). + */ +struct { + __uint(type, BPF_MAP_TYPE_INODE_STORAGE); + __uint(map_flags, BPF_F_NO_PREALLOC); + __type(key, int); + __type(value, int); +} inode_storage_map SEC(".maps"); + +unsigned long root_ino; +bool initialized; + +/* This function initialize map subdir_root. The logic is a bit ungly. + * First, user space sets root_ino. Then a fanotify event is triggered. + * If the event dentry matches root_ino, we take a reference on the + * dentry and save it in subdir_root map. The reference will be freed on + * the termination of subdir_root map. + */ +static void initialize_subdir_root(struct fanotify_fastpath_event *fp_event) +{ + struct __dentry_kptr_value *v; + struct dentry *dentry, *old; + int zero = 0; + + if (initialized) + return; + + dentry = bpf_fanotify_data_dentry(fp_event); + if (!dentry) + return; + + if (dentry->d_inode->i_ino != root_ino) { + bpf_dput(dentry); + return; + } + + v = bpf_map_lookup_elem(&subdir_root, &zero); + if (!v) { + bpf_dput(dentry); + return; + } + + old = bpf_kptr_xchg(&v->dentry, dentry); + if (old) + bpf_dput(old); + initialized = true; +} + +int cache_hit; + +/* bpf_fp_handler is sleepable, as it calls bpf_dput() */ +SEC("struct_ops.s") +int BPF_PROG(bpf_fp_handler, + struct fsnotify_group *group, + struct fanotify_fastpath_hook *fp_hook, + struct fanotify_fastpath_event *fp_event) +{ + struct __dentry_kptr_value *v; + struct dentry *dentry; + int zero = 0; + int *value; + int ret; + + initialize_subdir_root(fp_event); + + /* Before the subdir_root map is initialized, send all events to + * user space. + */ + if (!initialized) + return FAN_FP_RET_SEND_TO_USERSPACE; + + dentry = bpf_fanotify_data_dentry(fp_event); + if (!dentry) + return FAN_FP_RET_SEND_TO_USERSPACE; + + /* If inode_storage_map has cached value, just return it */ + value = bpf_inode_storage_get(&inode_storage_map, dentry->d_inode, 0, 0); + if (value) { + bpf_dput(dentry); + cache_hit++; + return *value; + } + + /* Hold rcu read lock for bpf_is_subdir */ + bpf_rcu_read_lock(); + v = bpf_map_lookup_elem(&subdir_root, &zero); + if (!v || !v->dentry) { + /* This shouldn't happen, but we need this to pass + * the verifier. + */ + ret = FAN_FP_RET_SEND_TO_USERSPACE; + goto out; + } + + if (bpf_is_subdir(dentry, v->dentry)) + ret = FAN_FP_RET_SEND_TO_USERSPACE; + else + ret = FAN_FP_RET_SKIP_EVENT; +out: + bpf_rcu_read_unlock(); + + /* Save current result to the inode_storage_map */ + value = bpf_inode_storage_get(&inode_storage_map, dentry->d_inode, 0, + BPF_LOCAL_STORAGE_GET_F_CREATE); + if (value) + *value = ret; + bpf_dput(dentry); + return ret; +} + +SEC("struct_ops") +int BPF_PROG(bpf_fp_init, struct fanotify_fastpath_hook *hook, const char *args) +{ + return 0; +} + +SEC("struct_ops") +void BPF_PROG(bpf_fp_free, struct fanotify_fastpath_hook *hook) +{ +} + +SEC(".struct_ops.link") +struct fanotify_fastpath_ops bpf_fanotify_fastpath_ops = { + .fp_handler = (void *)bpf_fp_handler, + .fp_init = (void *)bpf_fp_init, + .fp_free = (void *)bpf_fp_free, + .name = "_tmp_test_sub_tree", +}; + +char _license[] SEC("license") = "GPL";
This test shows a simplified logic that monitors a subtree. This is simplified as it doesn't handle all the scenarios, such as: 1) moving a subsubtree into/outof the being monitoring subtree; 2) mount point inside the being monitored subtree Therefore, this is not to show a way to reliably monitor a subtree. Instead, this is to test the functionalities of bpf based fastpath. To really monitor a subtree reliably, we will need more complex logic. Overview of the logic: 1. fanotify is created for the whole file system (/tmp). 2. dentry of the subtree root is saved in map subtree_root. 3. bpf_is_subdir() is used to check whether a fanotify event happens inside the subtree. Only events happened in the subtree are passed to userspace. 4. A bpf map (inode_storage_map) is used to cache result from bpf_is_subdir(). 5. subsubtree moving is not handled. This is because we don't yet have a good way to walk a subtree from BPF (something similar to d_walk). Signed-off-by: Song Liu <song@kernel.org> --- tools/testing/selftests/bpf/bpf_kfuncs.h | 5 + tools/testing/selftests/bpf/config | 2 + .../testing/selftests/bpf/prog_tests/fan_fp.c | 264 ++++++++++++++++++ tools/testing/selftests/bpf/progs/fan_fp.c | 154 ++++++++++ 4 files changed, 425 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/fan_fp.c create mode 100644 tools/testing/selftests/bpf/progs/fan_fp.c