diff mbox series

[RFC/PATCH,v2,bpf-next,fanotify,7/7] selftests/bpf: Add test for BPF based fanotify fastpath handler

Message ID 20241114084345.1564165-8-song@kernel.org (mailing list archive)
State RFC
Delegated to: BPF
Headers show
Series Fanotify fastpath handler | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for bpf-next, async
netdev/apply fail Patch does not apply to bpf-next-0
bpf/vmtest-bpf-net-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-net-VM_Test-2 success Logs for Unittests
bpf/vmtest-bpf-net-VM_Test-0 success Logs for Lint
bpf/vmtest-bpf-net-VM_Test-3 success Logs for Validate matrix.py
bpf/vmtest-bpf-net-VM_Test-5 success Logs for aarch64-gcc / build-release
bpf/vmtest-bpf-net-VM_Test-4 fail Logs for aarch64-gcc / build / build for aarch64 with gcc
bpf/vmtest-bpf-net-PR fail PR summary
bpf/vmtest-bpf-net-VM_Test-8 fail Logs for s390x-gcc / build / build for s390x with gcc
bpf/vmtest-bpf-net-VM_Test-7 success Logs for aarch64-gcc / veristat
bpf/vmtest-bpf-net-VM_Test-9 success Logs for s390x-gcc / build-release
bpf/vmtest-bpf-net-VM_Test-6 success Logs for aarch64-gcc / test
bpf/vmtest-bpf-net-VM_Test-11 success Logs for s390x-gcc / veristat
bpf/vmtest-bpf-net-VM_Test-10 success Logs for s390x-gcc / test
bpf/vmtest-bpf-net-VM_Test-12 success Logs for set-matrix
bpf/vmtest-bpf-net-VM_Test-13 fail Logs for x86_64-gcc / build / build for x86_64 with gcc
bpf/vmtest-bpf-net-VM_Test-14 success Logs for x86_64-gcc / build-release
bpf/vmtest-bpf-net-VM_Test-15 success Logs for x86_64-gcc / test
bpf/vmtest-bpf-net-VM_Test-16 success Logs for x86_64-gcc / veristat
bpf/vmtest-bpf-net-VM_Test-17 fail Logs for x86_64-llvm-17 / build / build for x86_64 with llvm-17
bpf/vmtest-bpf-net-VM_Test-18 fail Logs for x86_64-llvm-17 / build-release / build for x86_64 with llvm-17-O2
bpf/vmtest-bpf-net-VM_Test-19 success Logs for x86_64-llvm-17 / test
bpf/vmtest-bpf-net-VM_Test-20 success Logs for x86_64-llvm-17 / veristat
bpf/vmtest-bpf-net-VM_Test-21 fail Logs for x86_64-llvm-18 / build / build for x86_64 with llvm-18
bpf/vmtest-bpf-net-VM_Test-22 fail Logs for x86_64-llvm-18 / build-release / build for x86_64 with llvm-18-O2
bpf/vmtest-bpf-net-VM_Test-23 success Logs for x86_64-llvm-18 / test
bpf/vmtest-bpf-net-VM_Test-24 success Logs for x86_64-llvm-18 / veristat

Commit Message

Song Liu Nov. 14, 2024, 8:43 a.m. UTC
This test shows a simplified logic that monitors a subtree. This is
simplified as it doesn't handle all the scenarios, such as:

  1) moving a subsubtree into/outof the being monitoring subtree;
  2) mount point inside the being monitored subtree

Therefore, this is not to show a way to reliably monitor a subtree.
Instead, this is to test the functionalities of bpf based fastpath.
To really monitor a subtree reliably, we will need more complex logic.

Overview of the logic:
  1. fanotify is created for the whole file system (/tmp).
  2. dentry of the subtree root is saved in map subtree_root.
  3. bpf_is_subdir() is used to check whether a fanotify event happens
     inside the subtree. Only events happened in the subtree are passed
     to userspace.
  4. A bpf map (inode_storage_map) is used to cache result from
     bpf_is_subdir().
  5. subsubtree moving is not handled. This is because we don't yet have
     a good way to walk a subtree from BPF (something similar to d_walk).

Signed-off-by: Song Liu <song@kernel.org>
---
 tools/testing/selftests/bpf/bpf_kfuncs.h      |   5 +
 tools/testing/selftests/bpf/config            |   2 +
 .../testing/selftests/bpf/prog_tests/fan_fp.c | 264 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/fan_fp.c    | 154 ++++++++++
 4 files changed, 425 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/fan_fp.c
 create mode 100644 tools/testing/selftests/bpf/progs/fan_fp.c

Comments

Alexei Starovoitov Nov. 14, 2024, 8:14 p.m. UTC | #1
On Thu, Nov 14, 2024 at 12:44 AM Song Liu <song@kernel.org> wrote:
>
> +
> +       if (bpf_is_subdir(dentry, v->dentry))
> +               ret = FAN_FP_RET_SEND_TO_USERSPACE;
> +       else
> +               ret = FAN_FP_RET_SKIP_EVENT;

It seems to me that all these patches and feature additions
to fanotify, new kfuncs, etc are done just to do the above
filtering by subdir ?

If so, just hard code this logic as an extra flag to fanotify ?
So it can filter all events by subdir.
bpf programmability makes sense when it needs to express
user space policy. Here it's just a filter by subdir.
bpf hammer doesn't look like the right tool for this use case.
Song Liu Nov. 14, 2024, 11:02 p.m. UTC | #2
> On Nov 14, 2024, at 12:14 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> On Thu, Nov 14, 2024 at 12:44 AM Song Liu <song@kernel.org> wrote:
>> 
>> +
>> +       if (bpf_is_subdir(dentry, v->dentry))
>> +               ret = FAN_FP_RET_SEND_TO_USERSPACE;
>> +       else
>> +               ret = FAN_FP_RET_SKIP_EVENT;
> 
> It seems to me that all these patches and feature additions
> to fanotify, new kfuncs, etc are done just to do the above
> filtering by subdir ?
> 
> If so, just hard code this logic as an extra flag to fanotify ?
> So it can filter all events by subdir.
> bpf programmability makes sense when it needs to express
> user space policy. Here it's just a filter by subdir.
> bpf hammer doesn't look like the right tool for this use case.

Current version is indeed tailored towards the subtree 
monitoring use case. This is mostly because feedback on v1 
mostly focused on this use case. V1 itself actually had some
other use cases. 

In practice, fanotify fastpath can benefit from bpf 
programmability. For example, with bpf programmability, we 
can combine fanotify and BPF LSM in some security use cases. 
If some security rules only applies to a few files, a 
directory, or a subtree, we can use fanotify to only monitor 
these files. LSM hooks, such as security_file_open(), are
always global. The overhead is higher if we are only 
interested in a few files. 

Does this make sense?

Thanks,
Song
Alexei Starovoitov Nov. 15, 2024, 12:41 a.m. UTC | #3
On Thu, Nov 14, 2024 at 3:02 PM Song Liu <songliubraving@meta.com> wrote:
>
>
>
> > On Nov 14, 2024, at 12:14 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, Nov 14, 2024 at 12:44 AM Song Liu <song@kernel.org> wrote:
> >>
> >> +
> >> +       if (bpf_is_subdir(dentry, v->dentry))
> >> +               ret = FAN_FP_RET_SEND_TO_USERSPACE;
> >> +       else
> >> +               ret = FAN_FP_RET_SKIP_EVENT;
> >
> > It seems to me that all these patches and feature additions
> > to fanotify, new kfuncs, etc are done just to do the above
> > filtering by subdir ?
> >
> > If so, just hard code this logic as an extra flag to fanotify ?
> > So it can filter all events by subdir.
> > bpf programmability makes sense when it needs to express
> > user space policy. Here it's just a filter by subdir.
> > bpf hammer doesn't look like the right tool for this use case.
>
> Current version is indeed tailored towards the subtree
> monitoring use case. This is mostly because feedback on v1
> mostly focused on this use case. V1 itself actually had some
> other use cases.

like?

> In practice, fanotify fastpath can benefit from bpf
> programmability. For example, with bpf programmability, we
> can combine fanotify and BPF LSM in some security use cases.
> If some security rules only applies to a few files, a
> directory, or a subtree, we can use fanotify to only monitor
> these files. LSM hooks, such as security_file_open(), are
> always global. The overhead is higher if we are only
> interested in a few files.
>
> Does this make sense?

Not yet.
This fanotify bpf filtering only reduces the number of events
sent to user space.
How is it supposed to interact with bpf-lsm?

Say, security policy applies to /usr/bin/*
so lsm suppose to act on all files and subdirs in there.
How fanotify helps ?
Song Liu Nov. 15, 2024, 1:10 a.m. UTC | #4
> On Nov 14, 2024, at 4:41 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> On Thu, Nov 14, 2024 at 3:02 PM Song Liu <songliubraving@meta.com> wrote:
>> 
>> 
>> 
>>> On Nov 14, 2024, at 12:14 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>> 
>>> On Thu, Nov 14, 2024 at 12:44 AM Song Liu <song@kernel.org> wrote:
>>>> 
>>>> +
>>>> +       if (bpf_is_subdir(dentry, v->dentry))
>>>> +               ret = FAN_FP_RET_SEND_TO_USERSPACE;
>>>> +       else
>>>> +               ret = FAN_FP_RET_SKIP_EVENT;
>>> 
>>> It seems to me that all these patches and feature additions
>>> to fanotify, new kfuncs, etc are done just to do the above
>>> filtering by subdir ?
>>> 
>>> If so, just hard code this logic as an extra flag to fanotify ?
>>> So it can filter all events by subdir.
>>> bpf programmability makes sense when it needs to express
>>> user space policy. Here it's just a filter by subdir.
>>> bpf hammer doesn't look like the right tool for this use case.
>> 
>> Current version is indeed tailored towards the subtree
>> monitoring use case. This is mostly because feedback on v1
>> mostly focused on this use case. V1 itself actually had some
>> other use cases.
> 
> like?

samples/fanotify in v1 shows pattern that matches file prefix 
(no BPF). selftests/bpf in v1 shows a pattern where we 
propagate a flag in inode local storage from parent directory
to newly created children directory. 

> 
>> In practice, fanotify fastpath can benefit from bpf
>> programmability. For example, with bpf programmability, we
>> can combine fanotify and BPF LSM in some security use cases.
>> If some security rules only applies to a few files, a
>> directory, or a subtree, we can use fanotify to only monitor
>> these files. LSM hooks, such as security_file_open(), are
>> always global. The overhead is higher if we are only
>> interested in a few files.
>> 
>> Does this make sense?
> 
> Not yet.
> This fanotify bpf filtering only reduces the number of events
> sent to user space.
> How is it supposed to interact with bpf-lsm?

Ah, I didn't explain this part. fanotify+bpf fastpath can 
do more reducing number of events sent to user space. It can
also be used in tracing use cases. For example, we can 
implement a filetop tool that only monitors a specific 
directory, a specific device, or a specific mount point.
It can also reject some file access (fanotify permission 
mode). I should have showed these features in a sample 
and/or selftest.

> Say, security policy applies to /usr/bin/*
> so lsm suppose to act on all files and subdirs in there.
> How fanotify helps ?

LSM hooks are always global. It is up to the BPF program
to filter out irrelevant events. This filtering is 
sometimes expensive (match d_patch) and inaccurate 
(maintain a map of target inodes, etc.). OTOH, fanotify 
has built-in filtering before the BPF program triggers. 
When multiple BPF programs are monitoring open() for 
different subdirectories, fanotify based solution will 
not trigger all these BPF programs for all the open() 
in the system.

Does this answer the questions?

Thanks,
Song
Alexei Starovoitov Nov. 15, 2024, 1:31 a.m. UTC | #5
On Thu, Nov 14, 2024 at 5:10 PM Song Liu <songliubraving@meta.com> wrote:
>
>
>
> > On Nov 14, 2024, at 4:41 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, Nov 14, 2024 at 3:02 PM Song Liu <songliubraving@meta.com> wrote:
> >>
> >>
> >>
> >>> On Nov 14, 2024, at 12:14 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >>>
> >>> On Thu, Nov 14, 2024 at 12:44 AM Song Liu <song@kernel.org> wrote:
> >>>>
> >>>> +
> >>>> +       if (bpf_is_subdir(dentry, v->dentry))
> >>>> +               ret = FAN_FP_RET_SEND_TO_USERSPACE;
> >>>> +       else
> >>>> +               ret = FAN_FP_RET_SKIP_EVENT;
> >>>
> >>> It seems to me that all these patches and feature additions
> >>> to fanotify, new kfuncs, etc are done just to do the above
> >>> filtering by subdir ?
> >>>
> >>> If so, just hard code this logic as an extra flag to fanotify ?
> >>> So it can filter all events by subdir.
> >>> bpf programmability makes sense when it needs to express
> >>> user space policy. Here it's just a filter by subdir.
> >>> bpf hammer doesn't look like the right tool for this use case.
> >>
> >> Current version is indeed tailored towards the subtree
> >> monitoring use case. This is mostly because feedback on v1
> >> mostly focused on this use case. V1 itself actually had some
> >> other use cases.
> >
> > like?
>
> samples/fanotify in v1 shows pattern that matches file prefix
> (no BPF). selftests/bpf in v1 shows a pattern where we
> propagate a flag in inode local storage from parent directory
> to newly created children directory.
>
> >
> >> In practice, fanotify fastpath can benefit from bpf
> >> programmability. For example, with bpf programmability, we
> >> can combine fanotify and BPF LSM in some security use cases.
> >> If some security rules only applies to a few files, a
> >> directory, or a subtree, we can use fanotify to only monitor
> >> these files. LSM hooks, such as security_file_open(), are
> >> always global. The overhead is higher if we are only
> >> interested in a few files.
> >>
> >> Does this make sense?
> >
> > Not yet.
> > This fanotify bpf filtering only reduces the number of events
> > sent to user space.
> > How is it supposed to interact with bpf-lsm?
>
> Ah, I didn't explain this part. fanotify+bpf fastpath can
> do more reducing number of events sent to user space. It can
> also be used in tracing use cases. For example, we can
> implement a filetop tool that only monitors a specific
> directory, a specific device, or a specific mount point.
> It can also reject some file access (fanotify permission
> mode). I should have showed these features in a sample
> and/or selftest.

I bet bpf tracing can filetop already.
Not as efficient, but tracing isn't going to be running 24/7.

>
> > Say, security policy applies to /usr/bin/*
> > so lsm suppose to act on all files and subdirs in there.
> > How fanotify helps ?
>
> LSM hooks are always global. It is up to the BPF program
> to filter out irrelevant events. This filtering is
> sometimes expensive (match d_patch) and inaccurate
> (maintain a map of target inodes, etc.). OTOH, fanotify
> has built-in filtering before the BPF program triggers.
> When multiple BPF programs are monitoring open() for
> different subdirectories, fanotify based solution will
> not trigger all these BPF programs for all the open()
> in the system.
>
> Does this answer the questions?

No. Above is too much hand waving.

I think bpf-lsm hook fires before fanotify, so bpf-lsm prog
implementing some security policy has to decide right
at the moment what to do with, say, security_file_open().
fanotify with or without bpf fastpath is too late.

In general fanotify is not for security. It's notifying
user space of events that already happened, so I don't see
how these two can be combined.
Song Liu Nov. 15, 2024, 7:01 a.m. UTC | #6
> On Nov 14, 2024, at 5:31 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

[...]

>>> 
>>> Not yet.
>>> This fanotify bpf filtering only reduces the number of events
>>> sent to user space.
>>> How is it supposed to interact with bpf-lsm?
>> 
>> Ah, I didn't explain this part. fanotify+bpf fastpath can
>> do more reducing number of events sent to user space. It can
>> also be used in tracing use cases. For example, we can
>> implement a filetop tool that only monitors a specific
>> directory, a specific device, or a specific mount point.
>> It can also reject some file access (fanotify permission
>> mode). I should have showed these features in a sample
>> and/or selftest.
> 
> I bet bpf tracing can filetop already.
> Not as efficient, but tracing isn't going to be running 24/7.
> 
>> 
>>> Say, security policy applies to /usr/bin/*
>>> so lsm suppose to act on all files and subdirs in there.
>>> How fanotify helps ?
>> 
>> LSM hooks are always global. It is up to the BPF program
>> to filter out irrelevant events. This filtering is
>> sometimes expensive (match d_patch) and inaccurate
>> (maintain a map of target inodes, etc.). OTOH, fanotify
>> has built-in filtering before the BPF program triggers.
>> When multiple BPF programs are monitoring open() for
>> different subdirectories, fanotify based solution will
>> not trigger all these BPF programs for all the open()
>> in the system.
>> 
>> Does this answer the questions?
> 
> No. Above is too much hand waving.
> 
> I think bpf-lsm hook fires before fanotify, so bpf-lsm prog
> implementing some security policy has to decide right
> at the moment what to do with, say, security_file_open().
> fanotify with or without bpf fastpath is too late.

Actually, fanotify in permission mode can stop a file open. 
In current upstream code, fsnotify hook fsnotify_open_perm
is actually part of security_file_open(). It will be moved
to do_dentry_open(), right after security_file_open(). This
move is done by 1cda52f1b461 in linux-next. 

In practice, we are not likely to use BPF LSM and fanotify 
on the same hook at the same time. Instead, we can use 
BPF LSM hooks to gather information and use fanotify to 
make allow/deny decision, or vice versa. 

> In general fanotify is not for security. It's notifying
> user space of events that already happened, so I don't see
> how these two can be combined.

fanotify is actually used by AntiVirus softwares. For 
example, CalmAV (https://www.clamav.net/) uses fanotify 
for its Linux version (it also supports Window and MacOS).


I guess I didn't state the motivation clearly. So let me
try it now. 

Tracing is a critical part of a security solution. With 
LSM, blocking an operation is straightforward. However,
knowing which operation should be blocked is not always 
easy. Also, security hooks (LSM or fanotify) sit in the 
critical path of user requests. It is very important to 
optimize the latency of a security hook. Ideally, the 
tracing logic should gather all the information ahead 
of time, and make the actual hook fast. 

For example, if security_file_open() only needs to read 
a flag from inode local storage, the overhead is minimal 
and predictable. If security_file_open() has to walk the 
dentry tree, or call d_path(), the overhead will be 
much higher. fanotify_file_perm() provides another 
level of optimization over security_file_open(). If a 
file is not being monitored, fanotify will not generate 
the event. 

Security solutions hold higher bars for the tracing logic:

- It needs to be accurate, as false positives and false
  negatives can be very annoying and/or harmful.
- It needs to be efficient, as security daemons run 24/7. 

Given these requirements of security solutions, I believe
it is important to optimize tracing logic as much as 
possible. And BPF based fanotify fastpath handler can
bring non-trivials benefit to BPF based security solutions.

fanotify also has a feature that LSM doesn't provide. 
When a file is accessed, user space daemon can get a 
fd on this file from fanotify. OTOH, LSM can only send
an ino or a path to user space, which is not always 
reliable. 

I hope this starts to make sense...

Thanks,
Song
Amir Goldstein Nov. 15, 2024, 7:26 a.m. UTC | #7
On Thu, Nov 14, 2024 at 9:14 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Nov 14, 2024 at 12:44 AM Song Liu <song@kernel.org> wrote:
> >
> > +
> > +       if (bpf_is_subdir(dentry, v->dentry))
> > +               ret = FAN_FP_RET_SEND_TO_USERSPACE;
> > +       else
> > +               ret = FAN_FP_RET_SKIP_EVENT;
>
> It seems to me that all these patches and feature additions
> to fanotify, new kfuncs, etc are done just to do the above
> filtering by subdir ?
>
> If so, just hard code this logic as an extra flag to fanotify ?
> So it can filter all events by subdir.
> bpf programmability makes sense when it needs to express
> user space policy. Here it's just a filter by subdir.
> bpf hammer doesn't look like the right tool for this use case.

Good question.

Speaking as someone who has made several attempts to design
efficient subtree filtering in fanotify, it is not as easy as it sounds.

I recently implemented a method that could be used for "practical"
subdir filtering in userspace, not before Jan has questioned if we
should go directly to subtree filtering with bpf [1].

This is not the only filter that was proposed for fanotify, where bpf
filter came as an alternative proposal [2], but subtree filtering is by far
the most wanted filter.

The problem with implementing a naive is_subtree() filter in fanotify
is the unbounded cost to be paid by every user for every fs access
when M such filters are installed deep in the fs tree.

Making this more efficient then becomes a matter of trading of
memory (inode/path cache size) and performance and depends
on the size and depth of the watched filesystem.
This engineering decision *is* the userspace policy that can be
expressed by a bpf program.

As you may know, Linux is lagging behind Win and MacOS w.r.t
subtree filtering for fs events.

MacOS/FreeBSD took the userspace approach with fseventsd [3].
If you Google "fseventsd", you will get results with "High CPU and
Memory Usage" for as far as the browser can scroll.

I hope the bpf-aided early subtree filtering technology would be
able to reduce some of this overhead to facilitate a better engineering
solution, but that remains to be proven...

Thanks,
Amir.

[1] https://lore.kernel.org/linux-fsdevel/20220228140556.ae5rhgqsyzm5djbp@quack3.lan/
[2] https://lore.kernel.org/linux-fsdevel/20200828084603.GA7072@quack2.suse.cz/
[3] https://developer.apple.com/library/archive/documentation/Darwin/Conceptual/FSEvents_ProgGuide/TechnologyOverview/TechnologyOverview.html#//apple_ref/doc/uid/TP40005289-CH3-SW1
Alexei Starovoitov Nov. 15, 2024, 7:41 p.m. UTC | #8
On Thu, Nov 14, 2024 at 11:01 PM Song Liu <songliubraving@meta.com> wrote:
>
> >
> > I think bpf-lsm hook fires before fanotify, so bpf-lsm prog
> > implementing some security policy has to decide right
> > at the moment what to do with, say, security_file_open().
> > fanotify with or without bpf fastpath is too late.
>
> Actually, fanotify in permission mode can stop a file open.

The proposed patch 1 did:

+/* Return value of fp_handler */
+enum fanotify_fastpath_return {
+ /* The event should be sent to user space */
+ FAN_FP_RET_SEND_TO_USERSPACE = 0,
+ /* The event should NOT be sent to user space */
+ FAN_FP_RET_SKIP_EVENT = 1,
+};

It looked like a read-only notification to user space
where bpf prog is merely a filter.

> In current upstream code, fsnotify hook fsnotify_open_perm
> is actually part of security_file_open(). It will be moved
> to do_dentry_open(), right after security_file_open(). This
> move is done by 1cda52f1b461 in linux-next.

Separating fsnotify from LSM makes sense.

> In practice, we are not likely to use BPF LSM and fanotify
> on the same hook at the same time. Instead, we can use
> BPF LSM hooks to gather information and use fanotify to
> make allow/deny decision, or vice versa.

Pick one.
If the proposal is changing to let fsnotify-bpf prog to deny
file_open then it's a completely different discussion.

In such a case make it clear upfront that fsnotify will
rely on CONFIG_FANOTIFY_ACCESS_PERMISSIONS and
bpf-lsm part of file access will not be used,
since interaction of two callbacks at file_open makes little sense.

> > In general fanotify is not for security. It's notifying
> > user space of events that already happened, so I don't see
> > how these two can be combined.
>
> fanotify is actually used by AntiVirus softwares. For
> example, CalmAV (https://www.clamav.net/) uses fanotify
> for its Linux version (it also supports Window and MacOS).

It's relying on user space to send back FANOTIFY_PERM_EVENTS ?

fsnotify_open_perm->fsnotify->send_to_group->fanotify_handle_event.

is a pretty long path to call bpf prog and
preparing a giant 'struct fanotify_fastpath_event'
is not going to fast either.

If we want to accelerate that with bpf it needs to be done
sooner with negligible overhead.

>
> I guess I didn't state the motivation clearly. So let me
> try it now.
>
> Tracing is a critical part of a security solution. With
> LSM, blocking an operation is straightforward. However,
> knowing which operation should be blocked is not always
> easy. Also, security hooks (LSM or fanotify) sit in the
> critical path of user requests. It is very important to
> optimize the latency of a security hook. Ideally, the
> tracing logic should gather all the information ahead
> of time, and make the actual hook fast.
>
> For example, if security_file_open() only needs to read
> a flag from inode local storage, the overhead is minimal
> and predictable. If security_file_open() has to walk the
> dentry tree, or call d_path(), the overhead will be
> much higher. fanotify_file_perm() provides another
> level of optimization over security_file_open(). If a
> file is not being monitored, fanotify will not generate
> the event.

I agree with motivation, but don't see this in the patches.
The overhead to call into bpf prog is big.
Even if prog does nothing it's still going to be slower.

> Security solutions hold higher bars for the tracing logic:
>
> - It needs to be accurate, as false positives and false
>   negatives can be very annoying and/or harmful.
> - It needs to be efficient, as security daemons run 24/7.
>
> Given these requirements of security solutions, I believe
> it is important to optimize tracing logic as much as
> possible. And BPF based fanotify fastpath handler can
> bring non-trivials benefit to BPF based security solutions.

Doing everything in the kernel is certainly faster than
going back and forth to user space,
but bpf-lsm should be able to do the same already.

Without patch 1 and only patches 4,5 that add few kfuncs,
bpf-lsm prog will be able to remember subtree dentry and
do the same is_subdir() to deny.
The patch 7 stays pretty much as-is. All in bpf-lsm.
Close to zero overhead without long chain of fsnotify callbacks.

> fanotify also has a feature that LSM doesn't provide.
> When a file is accessed, user space daemon can get a
> fd on this file from fanotify. OTOH, LSM can only send
> an ino or a path to user space, which is not always
> reliable.

That sounds useful, but we're mixing too many things.
If user space cares about fd it will be using the existing
mechanism with all accompanied overhead. fsnotify-bpf can
barely accelerate anything, since user space makes
ultimate decisions.
If user space is not in the driving seat then existing bpf-lsm
plus few kfuncs to remember dentry and call is_subdir()
will do the job and no need for patch 1.
Alexei Starovoitov Nov. 15, 2024, 8:04 p.m. UTC | #9
On Thu, Nov 14, 2024 at 11:26 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Thu, Nov 14, 2024 at 9:14 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, Nov 14, 2024 at 12:44 AM Song Liu <song@kernel.org> wrote:
> > >
> > > +
> > > +       if (bpf_is_subdir(dentry, v->dentry))
> > > +               ret = FAN_FP_RET_SEND_TO_USERSPACE;
> > > +       else
> > > +               ret = FAN_FP_RET_SKIP_EVENT;
> >
> > It seems to me that all these patches and feature additions
> > to fanotify, new kfuncs, etc are done just to do the above
> > filtering by subdir ?
> >
> > If so, just hard code this logic as an extra flag to fanotify ?
> > So it can filter all events by subdir.
> > bpf programmability makes sense when it needs to express
> > user space policy. Here it's just a filter by subdir.
> > bpf hammer doesn't look like the right tool for this use case.
>
> Good question.
>
> Speaking as someone who has made several attempts to design
> efficient subtree filtering in fanotify, it is not as easy as it sounds.
>
> I recently implemented a method that could be used for "practical"
> subdir filtering in userspace, not before Jan has questioned if we
> should go directly to subtree filtering with bpf [1].
>
> This is not the only filter that was proposed for fanotify, where bpf
> filter came as an alternative proposal [2], but subtree filtering is by far
> the most wanted filter.

Interesting :)
Thanks for the links. Long threads back then.
Looks like bpf filtering was proposed 2 and 4 years ago and
Song's current attempt is the first real prototype that actually
implements what was discussed for so long?
Cool. I guess it has legs then.

> The problem with implementing a naive is_subtree() filter in fanotify
> is the unbounded cost to be paid by every user for every fs access
> when M such filters are installed deep in the fs tree.

Agree that the concern is real.
I just don't see how filtering in bpf vs filtering in the kernel
is going to be any different.
is_subdir() is fast, so either kernel or bpf prog calling it
on every file_open doesn't seem like a big deal.
Are you saying that 'naive is_subdir' would call is_subdir
M times for every such filter ?
I guess it can be optimized. When these M filters are installed
the kernel can check whether they're subdirs of each other
and filter a subset of M filters with a single is_subdir() call.
Same logic can be in either bpf prog or in the kernel.

> Making this more efficient then becomes a matter of trading of
> memory (inode/path cache size) and performance and depends
> on the size and depth of the watched filesystem.
> This engineering decision *is* the userspace policy that can be
> expressed by a bpf program.
>
> As you may know, Linux is lagging behind Win and MacOS w.r.t
> subtree filtering for fs events.
>
> MacOS/FreeBSD took the userspace approach with fseventsd [3].
> If you Google "fseventsd", you will get results with "High CPU and
> Memory Usage" for as far as the browser can scroll.

Yeah. Fun.

Looking at your 2 year old attempt it seems the key part was to
make sure inodes are not pinned for filtering purpose?
How would you call is_subdir() then if parent dentry is not dget() ?
Or meaning of 'pinning' is different?

> I hope the bpf-aided early subtree filtering technology would be
> able to reduce some of this overhead to facilitate a better engineering
> solution, but that remains to be proven...

Sure. Let's explore this further.

> Thanks,
> Amir.
>
> [1] https://lore.kernel.org/linux-fsdevel/20220228140556.ae5rhgqsyzm5djbp@quack3.lan/
> [2] https://lore.kernel.org/linux-fsdevel/20200828084603.GA7072@quack2.suse.cz/
> [3] https://developer.apple.com/library/archive/documentation/Darwin/Conceptual/FSEvents_ProgGuide/TechnologyOverview/TechnologyOverview.html#//apple_ref/doc/uid/TP40005289-CH3-SW1
Song Liu Nov. 15, 2024, 9:05 p.m. UTC | #10
> On Nov 15, 2024, at 11:41 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> On Thu, Nov 14, 2024 at 11:01 PM Song Liu <songliubraving@meta.com> wrote:
>> 
>>> 
>>> I think bpf-lsm hook fires before fanotify, so bpf-lsm prog
>>> implementing some security policy has to decide right
>>> at the moment what to do with, say, security_file_open().
>>> fanotify with or without bpf fastpath is too late.
>> 
>> Actually, fanotify in permission mode can stop a file open.
> 
> The proposed patch 1 did:
> 
> +/* Return value of fp_handler */
> +enum fanotify_fastpath_return {
> + /* The event should be sent to user space */
> + FAN_FP_RET_SEND_TO_USERSPACE = 0,
> + /* The event should NOT be sent to user space */
> + FAN_FP_RET_SKIP_EVENT = 1,
> +};
> 
> It looked like a read-only notification to user space
> where bpf prog is merely a filter.

Yep. As Amir also pointed out, this part needs more work and
clarifications. 

> 
>> In current upstream code, fsnotify hook fsnotify_open_perm
>> is actually part of security_file_open(). It will be moved
>> to do_dentry_open(), right after security_file_open(). This
>> move is done by 1cda52f1b461 in linux-next.
> 
> Separating fsnotify from LSM makes sense.
> 
>> In practice, we are not likely to use BPF LSM and fanotify
>> on the same hook at the same time. Instead, we can use
>> BPF LSM hooks to gather information and use fanotify to
>> make allow/deny decision, or vice versa.
> 
> Pick one.
> If the proposal is changing to let fsnotify-bpf prog to deny
> file_open then it's a completely different discussion.
> 
> In such a case make it clear upfront that fsnotify will
> rely on CONFIG_FANOTIFY_ACCESS_PERMISSIONS and

I am not sure whether we should limit fsnotify-bpf to 
only work with CONFIG_FANOTIFY_ACCESS_PERMISSIONS. I still 
think it can be useful just as a filter. But I guess we can
start with this dependency. 

> bpf-lsm part of file access will not be used,
> since interaction of two callbacks at file_open makes little sense.

Agreed that having two hooks on file_open doesn't make sense. 
I was actually thinking about combining fsnotify-bpf with 
other LSM hooks. But I agree we should start as simple as
possible. 

> 
>>> In general fanotify is not for security. It's notifying
>>> user space of events that already happened, so I don't see
>>> how these two can be combined.
>> 
>> fanotify is actually used by AntiVirus softwares. For
>> example, CalmAV (https://urldefense.com/v3/__https://www.clamav.net/__;!!Bt8RZUm9aw!4p7JB_E8RuNLldz_TYR07jnW5kHbr4H2FMm9vLbShKOBMznXwES7Dlt5_R6B_-HMzgV3Qk_9WKlhmjHSpYHRhTb2WM3hIg$ ) uses fanotify
>> for its Linux version (it also supports Window and MacOS).
> 
> It's relying on user space to send back FANOTIFY_PERM_EVENTS ?

Yes, it goes all the way to another process, and comes back. 

> 
> fsnotify_open_perm->fsnotify->send_to_group->fanotify_handle_event.
> 
> is a pretty long path to call bpf prog and
> preparing a giant 'struct fanotify_fastpath_event'
> is not going to fast either.
> 
> If we want to accelerate that with bpf it needs to be done
> sooner with negligible overhead.

Agreed. This is actually something I have been thinking 
since the beginning of this work: Shall it be fanotify-bpf 
or fsnotify-bpf. Given we have more materials, this is a 
good time to have broader discussions on this. 

@all, please chime in whether we should redo this as
fsnotify-bpf. AFAICT:

Pros of fanotify-bpf: 
- There is existing user space that we can leverage/reuse.

Pros of fsnotify-bpf: 
- Faster fast path. 

Another major pros/cons did I miss?

> 
>> I guess I didn't state the motivation clearly. So let me
>> try it now.
>> 
>> Tracing is a critical part of a security solution. With
>> LSM, blocking an operation is straightforward. However,
>> knowing which operation should be blocked is not always
>> easy. Also, security hooks (LSM or fanotify) sit in the
>> critical path of user requests. It is very important to
>> optimize the latency of a security hook. Ideally, the
>> tracing logic should gather all the information ahead
>> of time, and make the actual hook fast.
>> 
>> For example, if security_file_open() only needs to read
>> a flag from inode local storage, the overhead is minimal
>> and predictable. If security_file_open() has to walk the
>> dentry tree, or call d_path(), the overhead will be
>> much higher. fanotify_file_perm() provides another
>> level of optimization over security_file_open(). If a
>> file is not being monitored, fanotify will not generate
>> the event.
> 
> I agree with motivation, but don't see this in the patches.

Agreed. I should definitely do better job in this. 

> The overhead to call into bpf prog is big.
> Even if prog does nothing it's still going to be slower.
> 
>> Security solutions hold higher bars for the tracing logic:
>> 
>> - It needs to be accurate, as false positives and false
>>  negatives can be very annoying and/or harmful.
>> - It needs to be efficient, as security daemons run 24/7.
>> 
>> Given these requirements of security solutions, I believe
>> it is important to optimize tracing logic as much as
>> possible. And BPF based fanotify fastpath handler can
>> bring non-trivials benefit to BPF based security solutions.
> 
> Doing everything in the kernel is certainly faster than
> going back and forth to user space,
> but bpf-lsm should be able to do the same already.
> 
> Without patch 1 and only patches 4,5 that add few kfuncs,
> bpf-lsm prog will be able to remember subtree dentry and
> do the same is_subdir() to deny.
> The patch 7 stays pretty much as-is. All in bpf-lsm.
> Close to zero overhead without long chain of fsnotify callbacks.

One of the advantages of fanotify-bpf or fsnotify-bpf is 
that it only calls the BPF program for being monitored
files. OTOH, BPF LSM program is always global. 

> 
>> fanotify also has a feature that LSM doesn't provide.
>> When a file is accessed, user space daemon can get a
>> fd on this file from fanotify. OTOH, LSM can only send
>> an ino or a path to user space, which is not always
>> reliable.
> 
> That sounds useful, but we're mixing too many things.
> If user space cares about fd it will be using the existing
> mechanism with all accompanied overhead. fsnotify-bpf can
> barely accelerate anything, since user space makes
> ultimate decisions.
> If user space is not in the driving seat then existing bpf-lsm
> plus few kfuncs to remember dentry and call is_subdir()
> will do the job and no need for patch 1.

In many cases, we only need the user space to look into 
the file when necessary. For example, when a binary is 
first written, the user space daemon will scan through
it (for virus, etc.) and mark it as safe/dangerous in 
some BPF maps. Later, when the file is opened for 
execution, f[s|a]notify-bpf can make the allow/deny
decision based on the information in BPF maps. 

Thanks again for your review and inputs!

Song
Song Liu Nov. 18, 2024, 8:51 p.m. UTC | #11
> On Nov 15, 2024, at 1:05 PM, Song Liu <songliubraving@meta.com> wrote:

[...]
> 
>> 
>> fsnotify_open_perm->fsnotify->send_to_group->fanotify_handle_event.
>> 
>> is a pretty long path to call bpf prog and
>> preparing a giant 'struct fanotify_fastpath_event'
>> is not going to fast either.
>> 
>> If we want to accelerate that with bpf it needs to be done
>> sooner with negligible overhead.
> 
> Agreed. This is actually something I have been thinking 
> since the beginning of this work: Shall it be fanotify-bpf 
> or fsnotify-bpf. Given we have more materials, this is a 
> good time to have broader discussions on this. 
> 
> @all, please chime in whether we should redo this as
> fsnotify-bpf. AFAICT:
> 
> Pros of fanotify-bpf: 
> - There is existing user space that we can leverage/reuse.
> 
> Pros of fsnotify-bpf: 
> - Faster fast path. 
> 
> Another major pros/cons did I miss?

Adding more thoughts on this: I think it makes more sense to
go with fanotify-bpf. This is because one of the benefits of
fsnotify/fanotify over LSM solutions is the built-in event
filtering of events. While this call chain is a bit long:

fsnotify_open_perm->fsnotify->send_to_group->fanotify_handle_event.

There are built-in filtering in fsnotify() and 
send_to_group(), so logics in the call chain are useful. 

struct fanotify_fastpath_event is indeed big. But I think
we need to pass these information to the fastpath handler
either way. 


Overall, I think current fastpath design makes sense, 
though there are things we need to fix (as Amir and Alexei
pointed out). Please let me know comments and suggestions
on this. 

Thanks,
Song
Alexei Starovoitov Nov. 19, 2024, 12:10 a.m. UTC | #12
On Mon, Nov 18, 2024 at 12:51 PM Song Liu <songliubraving@meta.com> wrote:
>
>
>
> > On Nov 15, 2024, at 1:05 PM, Song Liu <songliubraving@meta.com> wrote:
>
> [...]
> >
> >>
> >> fsnotify_open_perm->fsnotify->send_to_group->fanotify_handle_event.
> >>
> >> is a pretty long path to call bpf prog and
> >> preparing a giant 'struct fanotify_fastpath_event'
> >> is not going to fast either.
> >>
> >> If we want to accelerate that with bpf it needs to be done
> >> sooner with negligible overhead.
> >
> > Agreed. This is actually something I have been thinking
> > since the beginning of this work: Shall it be fanotify-bpf
> > or fsnotify-bpf. Given we have more materials, this is a
> > good time to have broader discussions on this.
> >
> > @all, please chime in whether we should redo this as
> > fsnotify-bpf. AFAICT:
> >
> > Pros of fanotify-bpf:
> > - There is existing user space that we can leverage/reuse.
> >
> > Pros of fsnotify-bpf:
> > - Faster fast path.
> >
> > Another major pros/cons did I miss?
>
> Adding more thoughts on this: I think it makes more sense to
> go with fanotify-bpf. This is because one of the benefits of
> fsnotify/fanotify over LSM solutions is the built-in event
> filtering of events. While this call chain is a bit long:
>
> fsnotify_open_perm->fsnotify->send_to_group->fanotify_handle_event.
>
> There are built-in filtering in fsnotify() and
> send_to_group(), so logics in the call chain are useful.

fsnotify_marks based filtering happens in fsnotify.
No need to do more indirect calls to get to fanotify.

I would add the bpf struct_ops hook right before send_to_group
or inside of it.
Not sure whether fsnotify_group concept should be reused
or avoided.
Per inode mark/mask filter should stay.

> struct fanotify_fastpath_event is indeed big. But I think
> we need to pass these information to the fastpath handler
> either way.

Disagree.
That was the old way of hooking bpf bits in.
uapi/bpf.h is full of such "context" structs.
xpd_md, bpf_tcp_sock, etc.
They pack fields into one struct only because
old style bpf has one input argument: ctx.
struct_ops doesn't have this limitation.
Pass things like path/dentry/inode/whatever pointers directly.
No need to pack into fanotify_fastpath_event.

> Overall, I think current fastpath design makes sense,
> though there are things we need to fix (as Amir and Alexei
> pointed out). Please let me know comments and suggestions
> on this.

On one side you're arguing that extra indirection for
inode local storage due to inode->i_secruity is needed
for performance,
but on the other side you're not worried about the deep
call stack of fsnotify->fanotify and argument packing
which add way more overhead than i_security hop.
Song Liu Nov. 19, 2024, 1:10 a.m. UTC | #13
Hi Alexei,

> On Nov 18, 2024, at 4:10 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
[...]
>>> 
>>> Agreed. This is actually something I have been thinking
>>> since the beginning of this work: Shall it be fanotify-bpf
>>> or fsnotify-bpf. Given we have more materials, this is a
>>> good time to have broader discussions on this.
>>> 
>>> @all, please chime in whether we should redo this as
>>> fsnotify-bpf. AFAICT:
>>> 
>>> Pros of fanotify-bpf:
>>> - There is existing user space that we can leverage/reuse.
>>> 
>>> Pros of fsnotify-bpf:
>>> - Faster fast path.
>>> 
>>> Another major pros/cons did I miss?
>> 
>> Adding more thoughts on this: I think it makes more sense to
>> go with fanotify-bpf. This is because one of the benefits of
>> fsnotify/fanotify over LSM solutions is the built-in event
>> filtering of events. While this call chain is a bit long:
>> 
>> fsnotify_open_perm->fsnotify->send_to_group->fanotify_handle_event.
>> 
>> There are built-in filtering in fsnotify() and
>> send_to_group(), so logics in the call chain are useful.
> 
> fsnotify_marks based filtering happens in fsnotify.
> No need to do more indirect calls to get to fanotify.
> 
> I would add the bpf struct_ops hook right before send_to_group
> or inside of it.
> Not sure whether fsnotify_group concept should be reused
> or avoided.
> Per inode mark/mask filter should stay.

We still need fsnotify_group. It matches each fanotify 
file descriptor. 

Moving struct_ops hook inside send_to_group does save
us an indirect call. But this also means we need to 
introduce the fastpath concept to both fsnotify and 
fanotify. I personally don't really like duplications
like this (see the big BUILD_BUG_ON array in 
fanotify_handle_event). 

OTOH, maybe the benefit of one fewer indirect call
justifies the extra complexity. Let me think more 
about it. 

> 
>> struct fanotify_fastpath_event is indeed big. But I think
>> we need to pass these information to the fastpath handler
>> either way.
> 
> Disagree.
> That was the old way of hooking bpf bits in.
> uapi/bpf.h is full of such "context" structs.
> xpd_md, bpf_tcp_sock, etc.
> They pack fields into one struct only because
> old style bpf has one input argument: ctx.
> struct_ops doesn't have this limitation.
> Pass things like path/dentry/inode/whatever pointers directly.
> No need to pack into fanotify_fastpath_event.

OK. I am convinced on this one. I will adjust the
code to remove fanotify_fastpath_event. 

> 
>> Overall, I think current fastpath design makes sense,
>> though there are things we need to fix (as Amir and Alexei
>> pointed out). Please let me know comments and suggestions
>> on this.
> 
> On one side you're arguing that extra indirection for
> inode local storage due to inode->i_secruity is needed
> for performance,

The biggest issue with inode_local_storage in i_security 
is not the extra indirection and thus the extra latency. 
The bigger problem is, once we make inode local storage 
available to tracing programs, we will not disable it
at boot time. Therefore, the extra indirection through
i_security doesn't give us any memory savings. Instead, 
it will cause latency and memory fragmentations. IOW, 
we are paying the cost for no benefits at all.

> but on the other side you're not worried about the deep
> call stack of fsnotify->fanotify and argument packing
> which add way more overhead than i_security hop.

I think the difference is one indirect call. But that 
may worth the work. I will think more about it. Also, 
I would really appreciate other folks' input. 

Thanks again for your review!
Song
Amir Goldstein Nov. 19, 2024, 7:59 a.m. UTC | #14
On Tue, Nov 19, 2024 at 2:10 AM Song Liu <songliubraving@meta.com> wrote:
>
> Hi Alexei,
>
> > On Nov 18, 2024, at 4:10 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> [...]
> >>>
> >>> Agreed. This is actually something I have been thinking
> >>> since the beginning of this work: Shall it be fanotify-bpf
> >>> or fsnotify-bpf. Given we have more materials, this is a
> >>> good time to have broader discussions on this.
> >>>
> >>> @all, please chime in whether we should redo this as
> >>> fsnotify-bpf. AFAICT:
> >>>
> >>> Pros of fanotify-bpf:
> >>> - There is existing user space that we can leverage/reuse.
> >>>
> >>> Pros of fsnotify-bpf:
> >>> - Faster fast path.
> >>>
> >>> Another major pros/cons did I miss?

Sorry, I forgot to address this earlier.

First and foremost, I don't like the brand name "fast path" for this
feature.  The word "fast" implies that avoiding an indirect call is
meaningful and I don't think this is the case here.

I would rather rebrand this feature as "fanotify filter program".
I am going to make a controversial argument:
I anticipate that in the more common use of kernel filter for fanotify
the filter will be *likely* to skip sending events to userspace,
for example, when watching a small subtree in a large filesystem.

In that case, the *likely* result is that a context switch to userspace
is avoided, hence, the addition of one indirect call is insignificant
in comparison.

If we later decide that my assumption is wrong and it is important
to optimize out the indirect call when skipping userspace, we could
do that - it is not a problem, but for now, this discussion seems
like premature optimization to me.

> >>
> >> Adding more thoughts on this: I think it makes more sense to
> >> go with fanotify-bpf. This is because one of the benefits of
> >> fsnotify/fanotify over LSM solutions is the built-in event
> >> filtering of events. While this call chain is a bit long:
> >>
> >> fsnotify_open_perm->fsnotify->send_to_group->fanotify_handle_event.
> >>
> >> There are built-in filtering in fsnotify() and
> >> send_to_group(), so logics in the call chain are useful.
> >
> > fsnotify_marks based filtering happens in fsnotify.
> > No need to do more indirect calls to get to fanotify.
> >
> > I would add the bpf struct_ops hook right before send_to_group
> > or inside of it.
> > Not sure whether fsnotify_group concept should be reused
> > or avoided.
> > Per inode mark/mask filter should stay.
>
> We still need fsnotify_group. It matches each fanotify
> file descriptor.
>

We need the filter to be per fsnotify_group.
Unlike LSM, fsnotify/fanotify is a pubsub service for many
fs event consumers (a.k.a groups).
The feature is a filter per event consumer, not a global filter
for all event consumers.

> Moving struct_ops hook inside send_to_group does save
> us an indirect call. But this also means we need to
> introduce the fastpath concept to both fsnotify and
> fanotify. I personally don't really like duplications
> like this (see the big BUILD_BUG_ON array in
> fanotify_handle_event).
>
> OTOH, maybe the benefit of one fewer indirect call
> justifies the extra complexity. Let me think more
> about it.
>

I need to explain something about fsnotify vs. fanotify
in order to argue why the feature should be "fanotify", but the
bottom line is that is should not be too hard to avoid the indirect
call even if the feature is introduced through fanotify API as I think
that it should be.

TLDR:
The fsnotify_backend abstraction has become somewhat
of a theater of abstraction over time, because the feature
distance between fanotify backend and all the rest has grew
quite large.

The logic in send_to_group() is *seemingly* the generic fsnotify
logic, but not really, because only fanotify has ignore masks
and only fanotify has mark types (inode,mount,sb).

This difference is encoded by the group->ops->handle_event()
operation that only fanotify implements.
All the rest of the backends implement the simpler ->handle_inode_event().

Similarly, the group->private union is always dominated by the size
of group->fanotify_data, so there is no big difference if we place
group->fp_hook (or ->filter_hook) outside of fanotify_data, so that
we can query and call it from send_to_group() saving the indirect call
to ->handle_event().

That still leaves the question if we need to call fanotify_group_event_mask()
before the filter hook.

fanotify_group_event_mask() does several things, but I think that
the only thing relevant before the filter hook is this line:
                /*
                 * Send the event depending on event flags in mark mask.
                 */
                if (!fsnotify_mask_applicable(mark->mask, ondir, type))
                        continue;

This code is related to the two "built-in fanotify filters", namely
FAN_ONDIR and FAN_EVENT_ON_CHILD.
These built-in filters are so lame that they serve as a good example
why a programmable filter is a better idea.
For example, users need to opt-in for events on directories, but they
cannot request events only on directories.

Historically, the "generic" abstraction in send_to_group() has dealt
with the non-generic fanotify ignore mask, but has not dealt with
these non-generic fanotify built-in filters.

However, since commit 31a371e419c8 ("fanotify: prepare for setting
event flags in ignore mask"), send_to_group() is already aware of those
fanotify built-in filters.

So unless I am missing something, if we align the marks iteration
loop in send_to_group() to look exactly like the marks iteration loop in
fanotify_group_event_mask(), there should be no problem to call
the filter hook directly, right before calling ->handle_event().

Hope this brain dump helps.

Thanks,
Amir.
Song Liu Nov. 19, 2024, 8:35 a.m. UTC | #15
Hi Amir, 

Thanks for sharing these thoughts. I will take a closer look 
later, as I am not awake enough to fully understand everything.

> On Nov 18, 2024, at 11:59 PM, Amir Goldstein <amir73il@gmail.com> wrote:

[...]

>> Moving struct_ops hook inside send_to_group does save
>> us an indirect call. But this also means we need to
>> introduce the fastpath concept to both fsnotify and
>> fanotify. I personally don't really like duplications
>> like this (see the big BUILD_BUG_ON array in
>> fanotify_handle_event).
>> 
>> OTOH, maybe the benefit of one fewer indirect call
>> justifies the extra complexity. Let me think more
>> about it.
>> 
> 
> I need to explain something about fsnotify vs. fanotify
> in order to argue why the feature should be "fanotify", but the
> bottom line is that is should not be too hard to avoid the indirect
> call even if the feature is introduced through fanotify API as I think
> that it should be.

When I first looked into this, I thought about "whether 
there will be a use case that uses fsnotify but not fanotify". 
I didn't get any conclusion on this back then. But according 
to this thread, I think we are pretty confident that future 
use cases (such as FAN_PRE_ACCESS) will have a fanotify part. 
If this is the case, I think fanotify-bpf makes more sense. 

> TLDR:
> The fsnotify_backend abstraction has become somewhat
> of a theater of abstraction over time, because the feature
> distance between fanotify backend and all the rest has grew
> quite large.
> 
> The logic in send_to_group() is *seemingly* the generic fsnotify
> logic, but not really, because only fanotify has ignore masks
> and only fanotify has mark types (inode,mount,sb).
> 
> This difference is encoded by the group->ops->handle_event()
> operation that only fanotify implements.
> All the rest of the backends implement the simpler ->handle_inode_event().
> 
> Similarly, the group->private union is always dominated by the size
> of group->fanotify_data, so there is no big difference if we place
> group->fp_hook (or ->filter_hook) outside of fanotify_data, so that
> we can query and call it from send_to_group() saving the indirect call
> to ->handle_event().
> 
> That still leaves the question if we need to call fanotify_group_event_mask()
> before the filter hook.

I was trying Alexei's idea to move the API to fsnotify, and got 
stucked at fanotify_group_event_mask(). It appears we should
always honor these built-in filters. 

> fanotify_group_event_mask() does several things, but I think that
> the only thing relevant before the filter hook is this line:
>                /*
>                 * Send the event depending on event flags in mark mask.
>                 */
>                if (!fsnotify_mask_applicable(mark->mask, ondir, type))
>                        continue;
> 
> This code is related to the two "built-in fanotify filters", namely
> FAN_ONDIR and FAN_EVENT_ON_CHILD.
> These built-in filters are so lame that they serve as a good example
> why a programmable filter is a better idea.
> For example, users need to opt-in for events on directories, but they
> cannot request events only on directories.
> 
> Historically, the "generic" abstraction in send_to_group() has dealt
> with the non-generic fanotify ignore mask, but has not dealt with
> these non-generic fanotify built-in filters.
> 
> However, since commit 31a371e419c8 ("fanotify: prepare for setting
> event flags in ignore mask"), send_to_group() is already aware of those
> fanotify built-in filters.

I will continue on this tomorrow. It is time to get some sleep. :)

> 
> So unless I am missing something, if we align the marks iteration
> loop in send_to_group() to look exactly like the marks iteration loop in
> fanotify_group_event_mask(), there should be no problem to call
> the filter hook directly, right before calling ->handle_event().
> 
> Hope this brain dump helps.

Thanks again for your input!

Song

> 
> Thanks,
> Amir.
diff mbox series

Patch

diff --git a/tools/testing/selftests/bpf/bpf_kfuncs.h b/tools/testing/selftests/bpf/bpf_kfuncs.h
index 2eb3483f2fb0..6ccfef9685e1 100644
--- a/tools/testing/selftests/bpf/bpf_kfuncs.h
+++ b/tools/testing/selftests/bpf/bpf_kfuncs.h
@@ -87,4 +87,9 @@  struct dentry;
  */
 extern int bpf_get_dentry_xattr(struct dentry *dentry, const char *name,
 			      struct bpf_dynptr *value_ptr) __ksym __weak;
+
+struct fanotify_fastpath_event;
+extern struct inode *bpf_fanotify_data_inode(struct fanotify_fastpath_event *event) __ksym __weak;
+extern void bpf_iput(struct inode *inode) __ksym __weak;
+extern bool bpf_is_subdir(struct dentry *new_dentry, struct dentry *old_dentry) __ksym __weak;
 #endif
diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
index 4ca84c8d9116..505327f53f07 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -24,6 +24,8 @@  CONFIG_DEBUG_INFO_BTF=y
 CONFIG_DEBUG_INFO_DWARF4=y
 CONFIG_DUMMY=y
 CONFIG_DYNAMIC_FTRACE=y
+CONFIG_FANOTIFY=y
+CONFIG_FANOTIFY_FASTPATH=y
 CONFIG_FPROBE=y
 CONFIG_FTRACE_SYSCALLS=y
 CONFIG_FUNCTION_ERROR_INJECTION=y
diff --git a/tools/testing/selftests/bpf/prog_tests/fan_fp.c b/tools/testing/selftests/bpf/prog_tests/fan_fp.c
new file mode 100644
index 000000000000..92929b811282
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/fan_fp.c
@@ -0,0 +1,264 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
+
+#define _GNU_SOURCE
+#include <err.h>
+#include <fcntl.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/fanotify.h>
+#include <unistd.h>
+#include <sys/ioctl.h>
+#include <sys/stat.h>
+
+#include <test_progs.h>
+
+#include "fan_fp.skel.h"
+
+#define TEST_FS "/tmp/"
+#define TEST_DIR "/tmp/fanotify_test/"
+
+static int create_test_subtree(void)
+{
+	int err;
+
+	err = mkdir(TEST_DIR, 0777);
+	if (err && errno != EEXIST)
+		return err;
+
+	return open(TEST_DIR, O_RDONLY);
+}
+
+static int create_fanotify_fd(void)
+{
+	int fanotify_fd, err;
+
+	fanotify_fd = fanotify_init(FAN_CLASS_NOTIF | FAN_REPORT_NAME | FAN_REPORT_DIR_FID,
+				    O_RDONLY);
+
+	if (!ASSERT_OK_FD(fanotify_fd, "fanotify_init"))
+		return -1;
+
+	err = fanotify_mark(fanotify_fd, FAN_MARK_ADD | FAN_MARK_FILESYSTEM,
+			    FAN_CREATE | FAN_OPEN | FAN_ONDIR | FAN_EVENT_ON_CHILD,
+			    AT_FDCWD, TEST_FS);
+	if (!ASSERT_OK(err, "fanotify_mark")) {
+		close(fanotify_fd);
+		return -1;
+	}
+
+	return fanotify_fd;
+}
+
+static int attach_global_fastpath(int fanotify_fd)
+{
+	struct fanotify_fastpath_args args = {
+		.name = "_tmp_test_sub_tree",
+		.version = 1,
+		.flags = 0,
+	};
+
+	if (ioctl(fanotify_fd, FAN_IOC_ADD_FP, &args))
+		return -1;
+
+	return 0;
+}
+
+#define EVENT_BUFFER_SIZE 4096
+struct file_access_result {
+	char name_prefix[16];
+	bool accessed;
+} access_results[3] = {
+	{"aa", false},
+	{"bb", false},
+	{"cc", false},
+};
+
+static void update_access_results(char *name)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(access_results); i++) {
+		if (strcmp(name, access_results[i].name_prefix) == 0)
+			access_results[i].accessed = true;
+	}
+}
+
+static void parse_event(char *buffer, int len)
+{
+	struct fanotify_event_metadata *event =
+		(struct fanotify_event_metadata *) buffer;
+	struct fanotify_event_info_header *info;
+	struct fanotify_event_info_fid *fid;
+	struct file_handle *handle;
+	char *name;
+	int off;
+
+	for (; FAN_EVENT_OK(event, len); event = FAN_EVENT_NEXT(event, len)) {
+		for (off = sizeof(*event) ; off < event->event_len;
+		     off += info->len) {
+			info = (struct fanotify_event_info_header *)
+				((char *) event + off);
+			switch (info->info_type) {
+			case FAN_EVENT_INFO_TYPE_DFID_NAME:
+				fid = (struct fanotify_event_info_fid *) info;
+				handle = (struct file_handle *)&fid->handle;
+				name = (char *)handle + sizeof(*handle) + handle->handle_bytes;
+				update_access_results(name);
+				break;
+			default:
+				break;
+			}
+		}
+	}
+}
+
+static void touch_file(const char *path)
+{
+	int fd;
+
+	fd = open(path, O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666);
+	if (!ASSERT_OK_FD(fd, "open"))
+		goto cleanup;
+	close(fd);
+cleanup:
+	unlink(path);
+}
+
+static void generate_and_test_event(int fanotify_fd, struct fan_fp *skel)
+{
+	char buffer[EVENT_BUFFER_SIZE];
+	int len, err, fd;
+
+	/* Open the dir, so initialize_subdir_root can work */
+	fd = open(TEST_DIR, O_RDONLY);
+	close(fd);
+
+	if (!ASSERT_EQ(skel->bss->initialized, true, "initialized"))
+		goto cleanup;
+
+	/* access /tmp/fanotify_test/aa, this will generate event */
+	touch_file(TEST_DIR "aa");
+
+	/* create /tmp/fanotify_test/subdir, this will get tag from the
+	 * parent directory (added in the bpf program on fsnotify_mkdir)
+	 */
+	err = mkdir(TEST_DIR "subdir", 0777);
+	ASSERT_OK(err, "mkdir");
+
+	/* access /tmp/fanotify_test/subdir/bb, this will generate event */
+	touch_file(TEST_DIR "subdir/bb");
+
+	/* access /tmp/cc, this will NOT generate event, as the BPF
+	 * fastpath filtered this event out. (Because /tmp doesn't have
+	 * the tag.)
+	 */
+	touch_file(TEST_FS "cc");
+
+	/* read and parse the events */
+	len = read(fanotify_fd, buffer, EVENT_BUFFER_SIZE);
+	if (!ASSERT_GE(len, 0, "read event"))
+		goto cleanup;
+	parse_event(buffer, len);
+
+	/* verify we generated events for aa and bb, but filtered out the
+	 * event for cc.
+	 */
+	ASSERT_TRUE(access_results[0].accessed, "access aa");
+	ASSERT_TRUE(access_results[1].accessed, "access bb");
+	ASSERT_FALSE(access_results[2].accessed, "access cc");
+
+	/* Each touch_file() generates two events: FAN_CREATE then
+	 * FAN_OPEN. The second event will hit cache.
+	 * open(TEST_DIR) also hit cache, as we updated it cache for
+	 * TEST_DIR from userspace.
+	 * Therefore, we expect 4 cache hits: aa, bb, cc, and TEST_DIR.
+	 */
+	ASSERT_EQ(skel->bss->cache_hit, 4, "cache_hit");
+
+cleanup:
+	rmdir(TEST_DIR "subdir");
+	rmdir(TEST_DIR);
+}
+
+/* This test shows a simplified logic that monitors a subtree. This is
+ * simplified as it doesn't handle all the scenarios, such as:
+ *
+ *  1) moving a subsubtree into/outof the being monitoring subtree;
+ *  2) mount point inside the being monitored subtree
+ *
+ * Therefore, this is not to show a way to reliably monitor a subtree.
+ * Instead, this is to test the functionalities of bpf based fastpath.
+ *
+ * Overview of the logic:
+ * 1. fanotify is created for the whole file system (/tmp);
+ * 2. A bpf map (inode_storage_map) is used to tag directories to
+ *    monitor (starting from /tmp/fanotify_test);
+ * 3. On fsnotify_mkdir, thee tag is propagated to newly created sub
+ *    directories (/tmp/fanotify_test/subdir);
+ * 4. The bpf fastpath checks whether the event happens in a directory
+ *    with the tag. If yes, the event is sent to user space; otherwise,
+ *    the event is dropped.
+ */
+static void test_monitor_subtree(void)
+{
+	struct bpf_link *link;
+	struct fan_fp *skel;
+	int test_root_fd;
+	int zero = 0;
+	int err, fanotify_fd;
+	struct stat st;
+
+	test_root_fd = create_test_subtree();
+
+	if (!ASSERT_OK_FD(test_root_fd, "create_test_subtree"))
+		return;
+
+	err = fstat(test_root_fd, &st);
+	if (!ASSERT_OK(err, "fstat test_root_fd"))
+		goto close_test_root_fd;
+
+	skel = fan_fp__open_and_load();
+
+	if (!ASSERT_OK_PTR(skel, "fan_fp__open_and_load"))
+		goto close_test_root_fd;
+
+	skel->bss->root_ino = st.st_ino;
+
+	/* Add tag to /tmp/fanotify_test/ */
+	err = bpf_map_update_elem(bpf_map__fd(skel->maps.inode_storage_map),
+				  &test_root_fd, &zero, BPF_ANY);
+	if (!ASSERT_OK(err, "bpf_map_update_elem"))
+		goto destroy_skel;
+	link = bpf_map__attach_struct_ops(skel->maps.bpf_fanotify_fastpath_ops);
+	if (!ASSERT_OK_PTR(link, "bpf_map__attach_struct_ops"))
+		goto destroy_skel;
+
+	fanotify_fd = create_fanotify_fd();
+	if (!ASSERT_OK_FD(fanotify_fd, "create_fanotify_fd"))
+		goto destroy_link;
+
+	err = attach_global_fastpath(fanotify_fd);
+	if (!ASSERT_OK(err, "attach_global_fastpath"))
+		goto close_fanotify_fd;
+
+	generate_and_test_event(fanotify_fd, skel);
+
+close_fanotify_fd:
+	close(fanotify_fd);
+
+destroy_link:
+	bpf_link__destroy(link);
+destroy_skel:
+	fan_fp__destroy(skel);
+
+close_test_root_fd:
+	close(test_root_fd);
+	rmdir(TEST_DIR);
+}
+
+void test_bpf_fanotify_fastpath(void)
+{
+	if (test__start_subtest("subtree"))
+		test_monitor_subtree();
+}
diff --git a/tools/testing/selftests/bpf/progs/fan_fp.c b/tools/testing/selftests/bpf/progs/fan_fp.c
new file mode 100644
index 000000000000..97e7d0b9e644
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/fan_fp.c
@@ -0,0 +1,154 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
+
+#include "vmlinux.h"
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_core_read.h>
+#include "bpf_kfuncs.h"
+
+struct __dentry_kptr_value {
+	struct dentry __kptr * dentry;
+};
+
+/* subdir_root map holds a single dentry pointer to the subtree root.
+ * This pointer is used to call bpf_is_subdir().
+ */
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__type(key, int);
+	__type(value, struct __dentry_kptr_value);
+	__uint(max_entries, 1);
+} subdir_root SEC(".maps");
+
+/* inode_storage_map serves as cache for bpf_is_subdir(). inode local
+ * storage has O(1) access time. So this is preferred over calling
+ * bpf_is_subdir().
+ */
+struct {
+	__uint(type, BPF_MAP_TYPE_INODE_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, int);
+} inode_storage_map SEC(".maps");
+
+unsigned long root_ino;
+bool initialized;
+
+/* This function initialize map subdir_root. The logic is a bit ungly.
+ * First, user space sets root_ino. Then a fanotify event is triggered.
+ * If the event dentry matches root_ino, we take a reference on the
+ * dentry and save it in subdir_root map. The reference will be freed on
+ * the termination of subdir_root map.
+ */
+static void initialize_subdir_root(struct fanotify_fastpath_event *fp_event)
+{
+	struct __dentry_kptr_value *v;
+	struct dentry *dentry, *old;
+	int zero = 0;
+
+	if (initialized)
+		return;
+
+	dentry = bpf_fanotify_data_dentry(fp_event);
+	if (!dentry)
+		return;
+
+	if (dentry->d_inode->i_ino != root_ino) {
+		bpf_dput(dentry);
+		return;
+	}
+
+	v = bpf_map_lookup_elem(&subdir_root, &zero);
+	if (!v) {
+		bpf_dput(dentry);
+		return;
+	}
+
+	old = bpf_kptr_xchg(&v->dentry, dentry);
+	if (old)
+		bpf_dput(old);
+	initialized = true;
+}
+
+int cache_hit;
+
+/* bpf_fp_handler is sleepable, as it calls bpf_dput() */
+SEC("struct_ops.s")
+int BPF_PROG(bpf_fp_handler,
+	     struct fsnotify_group *group,
+	     struct fanotify_fastpath_hook *fp_hook,
+	     struct fanotify_fastpath_event *fp_event)
+{
+	struct __dentry_kptr_value *v;
+	struct dentry *dentry;
+	int zero = 0;
+	int *value;
+	int ret;
+
+	initialize_subdir_root(fp_event);
+
+	/* Before the subdir_root map is initialized, send all events to
+	 * user space.
+	 */
+	if (!initialized)
+		return FAN_FP_RET_SEND_TO_USERSPACE;
+
+	dentry = bpf_fanotify_data_dentry(fp_event);
+	if (!dentry)
+		return FAN_FP_RET_SEND_TO_USERSPACE;
+
+	/* If inode_storage_map has cached value, just return it */
+	value = bpf_inode_storage_get(&inode_storage_map, dentry->d_inode, 0, 0);
+	if (value) {
+		bpf_dput(dentry);
+		cache_hit++;
+		return *value;
+	}
+
+	/* Hold rcu read lock for bpf_is_subdir */
+	bpf_rcu_read_lock();
+	v = bpf_map_lookup_elem(&subdir_root, &zero);
+	if (!v || !v->dentry) {
+		/* This shouldn't happen, but we need this to pass
+		 * the verifier.
+		 */
+		ret = FAN_FP_RET_SEND_TO_USERSPACE;
+		goto out;
+	}
+
+	if (bpf_is_subdir(dentry, v->dentry))
+		ret = FAN_FP_RET_SEND_TO_USERSPACE;
+	else
+		ret = FAN_FP_RET_SKIP_EVENT;
+out:
+	bpf_rcu_read_unlock();
+
+	/* Save current result to the inode_storage_map */
+	value = bpf_inode_storage_get(&inode_storage_map, dentry->d_inode, 0,
+				      BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (value)
+		*value = ret;
+	bpf_dput(dentry);
+	return ret;
+}
+
+SEC("struct_ops")
+int BPF_PROG(bpf_fp_init, struct fanotify_fastpath_hook *hook, const char *args)
+{
+	return 0;
+}
+
+SEC("struct_ops")
+void BPF_PROG(bpf_fp_free, struct fanotify_fastpath_hook *hook)
+{
+}
+
+SEC(".struct_ops.link")
+struct fanotify_fastpath_ops bpf_fanotify_fastpath_ops = {
+	.fp_handler = (void *)bpf_fp_handler,
+	.fp_init = (void *)bpf_fp_init,
+	.fp_free = (void *)bpf_fp_free,
+	.name = "_tmp_test_sub_tree",
+};
+
+char _license[] SEC("license") = "GPL";