Message ID | 20231208080135.4089880-1-amir73il@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] fanotify: allow to set errno in FAN_DENY permission response | expand |
On Fri 08-12-23 10:01:35, Amir Goldstein wrote: > With FAN_DENY response, user trying to perform the filesystem operation > gets an error with errno set to EPERM. > > It is useful for hierarchical storage management (HSM) service to be able > to deny access for reasons more diverse than EPERM, for example EAGAIN, > if HSM could retry the operation later. > > Allow userspace to response to permission events with the response value > FAN_DENY_ERRNO(errno), instead of FAN_DENY to return a custom error. > > The change in fanotify_response is backward compatible, because errno is > written in the high 8 bits of the 32bit response field and old kernels > reject respose value with high bits set. > > Signed-off-by: Amir Goldstein <amir73il@gmail.com> So a couple of comments that spring to my mind when I'm looking into this now (partly maybe due to my weak memory ;): 1) Do we still need the EAGAIN return? I think we have mostly dealt with freezing deadlocks in another way, didn't we? 2) If answer to 1) is yes, then there is a second question - do we expect the errors to propagate back to the unsuspecting application doing say read(2) syscall? Because I don't think that will fly well with a big majority of applications which basically treat *any* error from read(2) as fatal. This is also related to your question about standard permission events. Consumers of these error numbers are going to be random applications and I see a potential for rather big confusion arising there (like read(1) returning EINVAL or EBADF and now you wonder why the hell until you go debug the kernel and find out the error is coming out of fanotify handler). And the usecase is not quite clear to me for ordinary fanotify permission events (while I have no doubts about creativity of implementors of fanotify handlers ;)). 3) Given the potential for confusion, maybe we should stay conservative and only allow additional EAGAIN error instead of arbitrary errno if we need it? I'm leaving the API question aside for a moment until I have a clearer picture of what we actually want to implement :). Honza
On Wed, Dec 13, 2023 at 7:28 PM Jan Kara <jack@suse.cz> wrote: > > On Fri 08-12-23 10:01:35, Amir Goldstein wrote: > > With FAN_DENY response, user trying to perform the filesystem operation > > gets an error with errno set to EPERM. > > > > It is useful for hierarchical storage management (HSM) service to be able > > to deny access for reasons more diverse than EPERM, for example EAGAIN, > > if HSM could retry the operation later. > > > > Allow userspace to response to permission events with the response value > > FAN_DENY_ERRNO(errno), instead of FAN_DENY to return a custom error. > > > > The change in fanotify_response is backward compatible, because errno is > > written in the high 8 bits of the 32bit response field and old kernels > > reject respose value with high bits set. > > > > Signed-off-by: Amir Goldstein <amir73il@gmail.com> > > So a couple of comments that spring to my mind when I'm looking into this > now (partly maybe due to my weak memory ;): > > 1) Do we still need the EAGAIN return? I think we have mostly dealt with > freezing deadlocks in another way, didn't we? I was thinking about EAGAIN on account of the HSM not being able to download the file ATM. There are a bunch of error codes that are typical for network filesystems, e.g. ETIMEDOUT, ENOTCONN, ECONNRESET which could be relevant to HSM failures. > > 2) If answer to 1) is yes, then there is a second question - do we expect > the errors to propagate back to the unsuspecting application doing say > read(2) syscall? Because I don't think that will fly well with a big > majority of applications which basically treat *any* error from read(2) as > fatal. This is also related to your question about standard permission > events. Consumers of these error numbers are going to be random > applications and I see a potential for rather big confusion arising there > (like read(1) returning EINVAL or EBADF and now you wonder why the hell > until you go debug the kernel and find out the error is coming out of > fanotify handler). And the usecase is not quite clear to me for ordinary > fanotify permission events (while I have no doubts about creativity of > implementors of fanotify handlers ;)). > That's a good question. I prefer to delegate your question to the prospect users of the feature. Josef, which errors did your use case need this feature for? > 3) Given the potential for confusion, maybe we should stay conservative and > only allow additional EAGAIN error instead of arbitrary errno if we need it? > I know I was planning to use this for EDQUOT error (from FAN_PRE_MODIFY), but I certainly wouldn't mind restricting the set of custom errors. I think it makes sense. The hard part is to agree on this set of errors. > I'm leaving the API question aside for a moment until I have a clearer > picture of what we actually want to implement :). Fair enough ;) Thanks, Amir.
On Wed, Dec 13, 2023 at 09:09:30PM +0200, Amir Goldstein wrote: > On Wed, Dec 13, 2023 at 7:28 PM Jan Kara <jack@suse.cz> wrote: > > > > On Fri 08-12-23 10:01:35, Amir Goldstein wrote: > > > With FAN_DENY response, user trying to perform the filesystem operation > > > gets an error with errno set to EPERM. > > > > > > It is useful for hierarchical storage management (HSM) service to be able > > > to deny access for reasons more diverse than EPERM, for example EAGAIN, > > > if HSM could retry the operation later. > > > > > > Allow userspace to response to permission events with the response value > > > FAN_DENY_ERRNO(errno), instead of FAN_DENY to return a custom error. > > > > > > The change in fanotify_response is backward compatible, because errno is > > > written in the high 8 bits of the 32bit response field and old kernels > > > reject respose value with high bits set. > > > > > > Signed-off-by: Amir Goldstein <amir73il@gmail.com> > > > > So a couple of comments that spring to my mind when I'm looking into this > > now (partly maybe due to my weak memory ;): > > > > 1) Do we still need the EAGAIN return? I think we have mostly dealt with > > freezing deadlocks in another way, didn't we? > > I was thinking about EAGAIN on account of the HSM not being able to > download the file ATM. > > There are a bunch of error codes that are typical for network filesystems, e.g. > ETIMEDOUT, ENOTCONN, ECONNRESET which could be relevant to > HSM failures. > > > > > 2) If answer to 1) is yes, then there is a second question - do we expect > > the errors to propagate back to the unsuspecting application doing say > > read(2) syscall? Because I don't think that will fly well with a big > > majority of applications which basically treat *any* error from read(2) as > > fatal. This is also related to your question about standard permission > > events. Consumers of these error numbers are going to be random > > applications and I see a potential for rather big confusion arising there > > (like read(1) returning EINVAL or EBADF and now you wonder why the hell > > until you go debug the kernel and find out the error is coming out of > > fanotify handler). And the usecase is not quite clear to me for ordinary > > fanotify permission events (while I have no doubts about creativity of > > implementors of fanotify handlers ;)). > > > > That's a good question. > I prefer to delegate your question to the prospect users of the feature. > > Josef, which errors did your use case need this feature for? > > > 3) Given the potential for confusion, maybe we should stay conservative and > > only allow additional EAGAIN error instead of arbitrary errno if we need it? > > > > I know I was planning to use this for EDQUOT error (from FAN_PRE_MODIFY), > but I certainly wouldn't mind restricting the set of custom errors. > I think it makes sense. The hard part is to agree on this set of errors. > I'm all for flexibility here. We're going to have 2 classes of applications interacting with HSM backed storage, normal applications and applications that know they're backed by HSM. The normal applications are just going to crash if they get an error on read(2), it doesn't matter what errno it is. The second class would have different things they'd want to do in the face of different errors, and that's what this patchset is targeting. We can limit it to a few errno's if that makes you feel better, but having more than just one would be helpful. Thanks, Josef
On Fri, Dec 15, 2023 at 5:31 PM Josef Bacik <josef@toxicpanda.com> wrote: > > On Wed, Dec 13, 2023 at 09:09:30PM +0200, Amir Goldstein wrote: > > On Wed, Dec 13, 2023 at 7:28 PM Jan Kara <jack@suse.cz> wrote: > > > > > > On Fri 08-12-23 10:01:35, Amir Goldstein wrote: > > > > With FAN_DENY response, user trying to perform the filesystem operation > > > > gets an error with errno set to EPERM. > > > > > > > > It is useful for hierarchical storage management (HSM) service to be able > > > > to deny access for reasons more diverse than EPERM, for example EAGAIN, > > > > if HSM could retry the operation later. > > > > > > > > Allow userspace to response to permission events with the response value > > > > FAN_DENY_ERRNO(errno), instead of FAN_DENY to return a custom error. > > > > > > > > The change in fanotify_response is backward compatible, because errno is > > > > written in the high 8 bits of the 32bit response field and old kernels > > > > reject respose value with high bits set. > > > > > > > > Signed-off-by: Amir Goldstein <amir73il@gmail.com> > > > > > > So a couple of comments that spring to my mind when I'm looking into this > > > now (partly maybe due to my weak memory ;): > > > > > > 1) Do we still need the EAGAIN return? I think we have mostly dealt with > > > freezing deadlocks in another way, didn't we? > > > > I was thinking about EAGAIN on account of the HSM not being able to > > download the file ATM. > > > > There are a bunch of error codes that are typical for network filesystems, e.g. > > ETIMEDOUT, ENOTCONN, ECONNRESET which could be relevant to > > HSM failures. > > > > > > > > 2) If answer to 1) is yes, then there is a second question - do we expect > > > the errors to propagate back to the unsuspecting application doing say > > > read(2) syscall? Because I don't think that will fly well with a big > > > majority of applications which basically treat *any* error from read(2) as > > > fatal. This is also related to your question about standard permission > > > events. Consumers of these error numbers are going to be random > > > applications and I see a potential for rather big confusion arising there > > > (like read(1) returning EINVAL or EBADF and now you wonder why the hell > > > until you go debug the kernel and find out the error is coming out of > > > fanotify handler). And the usecase is not quite clear to me for ordinary > > > fanotify permission events (while I have no doubts about creativity of > > > implementors of fanotify handlers ;)). > > > > > > > That's a good question. > > I prefer to delegate your question to the prospect users of the feature. > > > > Josef, which errors did your use case need this feature for? > > > > > 3) Given the potential for confusion, maybe we should stay conservative and > > > only allow additional EAGAIN error instead of arbitrary errno if we need it? > > > > > > > I know I was planning to use this for EDQUOT error (from FAN_PRE_MODIFY), > > but I certainly wouldn't mind restricting the set of custom errors. > > I think it makes sense. The hard part is to agree on this set of errors. > > > > I'm all for flexibility here. > > We're going to have 2 classes of applications interacting with HSM backed > storage, normal applications and applications that know they're backed by HSM. > The normal applications are just going to crash if they get an error on read(2), > it doesn't matter what errno it is. The second class would have different > things they'd want to do in the face of different errors, and that's what this > patchset is targeting. We can limit it to a few errno's if that makes you feel > better, but having more than just one would be helpful. > Ok. In another email I got from your colleagues, they listed: EIO, EAGAIN, ENOSPC as possible errors to return. I added EDQUOT for our in house use case. Those are all valid errors for write(2) and some are valid for read(2). ENOSPC/EDQUOT make a lot of sense for HSM for read(2), but could be surprising to applications not aware of HSM. I think it wouldn't be that bad if we let HSM decide which of those errors to return for FAN_PRE_ACCESS as opposed to FAN_PRE_MODIFY. But given that we do want to limit the chance of abusing this feature, perhaps it would be best to limit the error codes to known error codes for write(2) IO failures (i.e. not EBADF, not EFAULT) and allow returning FAN_DENY_ERRNO only for the new FAN_PRE_{ACCESS,MODIFY} HSM events. IOW, FAN_{OPEN,ACCESS}_PERM - no FAN_DENY_ERRNO for you! Does that sound good to you? Furthermore, we can start with allowing a very limited set of errors and extend it in the future, on case by case basis. The way that this could be manageable is if we provide userspace a way to test for supported return codes. There is already a simple method that we used for testing FAN_INFO records type support - After fan_fd = fanotify_init(), userspace can write a "test" fanotify_response to fan_fd with fanotify_response.fd=FAN_NOFD. When setting fanotify_response.fd=FAN_DENY, this would return ENOENT, but with fanotify_response.fd=FAN_DENY_ERRNO(EIO), upstream would return EINVAL. This opens the possibility of allowing, say, EIO, EAGAIN in the first release and ENOSPC, EDQUOT in the following release. The advantage in this method is that it is very simple and already working correctly for old kernels. The downside is that this simple method does not allow checking for allowed errors per specific event type, so if we decide that we do want to allow returning FAN_DENY_ERRNO for FAN_OPEN_PERM later on, this method could not be used by userspace to test for this finer grained support. In another thread, I mention the fact that FAN_OPEN_PERM still has a potential freeze deadlock when called from open(O_TRUNC|O_CREATE), so we can consider the fact that FAN_DENY_ERRNO is not allowed with FAN_OPEN_PERM as a negative incentive for people to consider using FAN_OPEN_PERM as a trigger for HSM. Thoughts? Thanks, Amir.
On Fri 15-12-23 18:50:39, Amir Goldstein wrote: > On Fri, Dec 15, 2023 at 5:31 PM Josef Bacik <josef@toxicpanda.com> wrote: > > > > On Wed, Dec 13, 2023 at 09:09:30PM +0200, Amir Goldstein wrote: > > > On Wed, Dec 13, 2023 at 7:28 PM Jan Kara <jack@suse.cz> wrote: > > > > > > > > On Fri 08-12-23 10:01:35, Amir Goldstein wrote: > > > > > With FAN_DENY response, user trying to perform the filesystem operation > > > > > gets an error with errno set to EPERM. > > > > > > > > > > It is useful for hierarchical storage management (HSM) service to be able > > > > > to deny access for reasons more diverse than EPERM, for example EAGAIN, > > > > > if HSM could retry the operation later. > > > > > > > > > > Allow userspace to response to permission events with the response value > > > > > FAN_DENY_ERRNO(errno), instead of FAN_DENY to return a custom error. > > > > > > > > > > The change in fanotify_response is backward compatible, because errno is > > > > > written in the high 8 bits of the 32bit response field and old kernels > > > > > reject respose value with high bits set. > > > > > > > > > > Signed-off-by: Amir Goldstein <amir73il@gmail.com> > > > > > > > > So a couple of comments that spring to my mind when I'm looking into this > > > > now (partly maybe due to my weak memory ;): > > > > > > > > 1) Do we still need the EAGAIN return? I think we have mostly dealt with > > > > freezing deadlocks in another way, didn't we? > > > > > > I was thinking about EAGAIN on account of the HSM not being able to > > > download the file ATM. > > > > > > There are a bunch of error codes that are typical for network filesystems, e.g. > > > ETIMEDOUT, ENOTCONN, ECONNRESET which could be relevant to > > > HSM failures. > > > > > > > > > > > 2) If answer to 1) is yes, then there is a second question - do we expect > > > > the errors to propagate back to the unsuspecting application doing say > > > > read(2) syscall? Because I don't think that will fly well with a big > > > > majority of applications which basically treat *any* error from read(2) as > > > > fatal. This is also related to your question about standard permission > > > > events. Consumers of these error numbers are going to be random > > > > applications and I see a potential for rather big confusion arising there > > > > (like read(1) returning EINVAL or EBADF and now you wonder why the hell > > > > until you go debug the kernel and find out the error is coming out of > > > > fanotify handler). And the usecase is not quite clear to me for ordinary > > > > fanotify permission events (while I have no doubts about creativity of > > > > implementors of fanotify handlers ;)). > > > > > > > > > > That's a good question. > > > I prefer to delegate your question to the prospect users of the feature. > > > > > > Josef, which errors did your use case need this feature for? > > > > > > > 3) Given the potential for confusion, maybe we should stay conservative and > > > > only allow additional EAGAIN error instead of arbitrary errno if we need it? > > > > > > > > > > I know I was planning to use this for EDQUOT error (from FAN_PRE_MODIFY), > > > but I certainly wouldn't mind restricting the set of custom errors. > > > I think it makes sense. The hard part is to agree on this set of errors. > > > > > > > I'm all for flexibility here. > > > > We're going to have 2 classes of applications interacting with HSM backed > > storage, normal applications and applications that know they're backed by HSM. > > The normal applications are just going to crash if they get an error on read(2), > > it doesn't matter what errno it is. The second class would have different > > things they'd want to do in the face of different errors, and that's what this > > patchset is targeting. We can limit it to a few errno's if that makes you feel > > better, but having more than just one would be helpful. > > Ok. In another email I got from your colleagues, they listed: > EIO, EAGAIN, ENOSPC as possible errors to return. > I added EDQUOT for our in house use case. OK, so do I get it right that you also have applications that are aware that they are operation on top of HSM managed filesystem and thus they can do meaningful things with the reported errors? > Those are all valid errors for write(2) and some are valid for read(2). > ENOSPC/EDQUOT make a lot of sense for HSM for read(2), but could > be surprising to applications not aware of HSM. > I think it wouldn't be that bad if we let HSM decide which of those errors > to return for FAN_PRE_ACCESS as opposed to FAN_PRE_MODIFY. Yeah, I don't think we need to be super-restrictive here, I'd just prefer to avoid the "whatever number you decide to return" kind of interface because I can see potential for confusion and abuse there. I think all four errors above are perfectly fine for both FAN_PRE_ACCESS and FAN_PRE_MODIFY if there are consumers that are able to use them. > But given that we do want to limit the chance of abusing this feature, > perhaps it would be best to limit the error codes to known error codes > for write(2) IO failures (i.e. not EBADF, not EFAULT) and allow returning > FAN_DENY_ERRNO only for the new FAN_PRE_{ACCESS,MODIFY} > HSM events. > > IOW, FAN_{OPEN,ACCESS}_PERM - no FAN_DENY_ERRNO for you! > > Does that sound good to you? It sounds OK to me. I'm open to allowing FAN_DENY_ERRNO for FAN_OPEN_PERM if there's a usecase because at least conceptually it makes a good sense and chances for confusion are low there. People are used to dealing with errors on open(2). > Furthermore, we can start with allowing a very limited set of errors > and extend it in the future, on case by case basis. > > The way that this could be manageable is if we provide userspace > a way to test for supported return codes. > > There is already a simple method that we used for testing FAN_INFO > records type support - > After fan_fd = fanotify_init(), userspace can write a "test" fanotify_response > to fan_fd with fanotify_response.fd=FAN_NOFD. > > When setting fanotify_response.fd=FAN_DENY, this would return ENOENT, > but with fanotify_response.fd=FAN_DENY_ERRNO(EIO), upstream would > return EINVAL. > > This opens the possibility of allowing, say, EIO, EAGAIN in the first release > and ENOSPC, EDQUOT in the following release. If we forsee that ENOSPC and EDQUOT will be needed, then we can just enable it from start and not complicate our lives more than necessary. > The advantage in this method is that it is very simple and already working > correctly for old kernels. > > The downside is that this simple method does not allow checking for > allowed errors per specific event type, so if we decide that we do want > to allow returning FAN_DENY_ERRNO for FAN_OPEN_PERM later on, this method > could not be used by userspace to test for this finer grained support. True, in that case the HSM manager would have to try responding with FAN_DENY_ERRNO() and if it fails, it will have to fallback to responding with FAN_DENY. Not too bad I'd say. > In another thread, I mention the fact that FAN_OPEN_PERM still has a > potential freeze deadlock when called from open(O_TRUNC|O_CREATE), > so we can consider the fact that FAN_DENY_ERRNO is not allowed with > FAN_OPEN_PERM as a negative incentive for people to consider using > FAN_OPEN_PERM as a trigger for HSM. AFAIU from the past discussions, there's no good use of FAN_OPEN_PERM event for HSM. If that's the case, I'm for not allowing FAN_DENY_ERRNO for FAN_OPEN_PERM. Honza
On Mon, Dec 18, 2023 at 4:35 PM Jan Kara <jack@suse.cz> wrote: > > On Fri 15-12-23 18:50:39, Amir Goldstein wrote: > > On Fri, Dec 15, 2023 at 5:31 PM Josef Bacik <josef@toxicpanda.com> wrote: > > > > > > On Wed, Dec 13, 2023 at 09:09:30PM +0200, Amir Goldstein wrote: > > > > On Wed, Dec 13, 2023 at 7:28 PM Jan Kara <jack@suse.cz> wrote: > > > > > > > > > > On Fri 08-12-23 10:01:35, Amir Goldstein wrote: > > > > > > With FAN_DENY response, user trying to perform the filesystem operation > > > > > > gets an error with errno set to EPERM. > > > > > > > > > > > > It is useful for hierarchical storage management (HSM) service to be able > > > > > > to deny access for reasons more diverse than EPERM, for example EAGAIN, > > > > > > if HSM could retry the operation later. > > > > > > > > > > > > Allow userspace to response to permission events with the response value > > > > > > FAN_DENY_ERRNO(errno), instead of FAN_DENY to return a custom error. > > > > > > > > > > > > The change in fanotify_response is backward compatible, because errno is > > > > > > written in the high 8 bits of the 32bit response field and old kernels > > > > > > reject respose value with high bits set. > > > > > > > > > > > > Signed-off-by: Amir Goldstein <amir73il@gmail.com> > > > > > > > > > > So a couple of comments that spring to my mind when I'm looking into this > > > > > now (partly maybe due to my weak memory ;): > > > > > > > > > > 1) Do we still need the EAGAIN return? I think we have mostly dealt with > > > > > freezing deadlocks in another way, didn't we? > > > > > > > > I was thinking about EAGAIN on account of the HSM not being able to > > > > download the file ATM. > > > > > > > > There are a bunch of error codes that are typical for network filesystems, e.g. > > > > ETIMEDOUT, ENOTCONN, ECONNRESET which could be relevant to > > > > HSM failures. > > > > > > > > > > > > > > 2) If answer to 1) is yes, then there is a second question - do we expect > > > > > the errors to propagate back to the unsuspecting application doing say > > > > > read(2) syscall? Because I don't think that will fly well with a big > > > > > majority of applications which basically treat *any* error from read(2) as > > > > > fatal. This is also related to your question about standard permission > > > > > events. Consumers of these error numbers are going to be random > > > > > applications and I see a potential for rather big confusion arising there > > > > > (like read(1) returning EINVAL or EBADF and now you wonder why the hell > > > > > until you go debug the kernel and find out the error is coming out of > > > > > fanotify handler). And the usecase is not quite clear to me for ordinary > > > > > fanotify permission events (while I have no doubts about creativity of > > > > > implementors of fanotify handlers ;)). > > > > > > > > > > > > > That's a good question. > > > > I prefer to delegate your question to the prospect users of the feature. > > > > > > > > Josef, which errors did your use case need this feature for? > > > > > > > > > 3) Given the potential for confusion, maybe we should stay conservative and > > > > > only allow additional EAGAIN error instead of arbitrary errno if we need it? > > > > > > > > > > > > > I know I was planning to use this for EDQUOT error (from FAN_PRE_MODIFY), > > > > but I certainly wouldn't mind restricting the set of custom errors. > > > > I think it makes sense. The hard part is to agree on this set of errors. > > > > > > > > > > I'm all for flexibility here. > > > > > > We're going to have 2 classes of applications interacting with HSM backed > > > storage, normal applications and applications that know they're backed by HSM. > > > The normal applications are just going to crash if they get an error on read(2), > > > it doesn't matter what errno it is. The second class would have different > > > things they'd want to do in the face of different errors, and that's what this > > > patchset is targeting. We can limit it to a few errno's if that makes you feel > > > better, but having more than just one would be helpful. > > > > Ok. In another email I got from your colleagues, they listed: > > EIO, EAGAIN, ENOSPC as possible errors to return. > > I added EDQUOT for our in house use case. > > OK, so do I get it right that you also have applications that are aware > that they are operation on top of HSM managed filesystem and thus they can > do meaningful things with the reported errors? > Some applications are HSM aware. Some just report the errors that they get which are meaningful to users. EIO is the standard response for HSM failure to fill content. EDQUOT/ENOSPC is a good example of special functionality. HSM "swaps out" file content to a slow remote tier, but the slow remote tier may have a space/quota limit that is known to HSM. By tracking the total of st_size under some HSM managed folder, including the st_size of files whose content is punched out, HSM can enforce this limit not in the conventional meaning of local disk blocks usage. This is when returning EDQUOT/ENOSPC for FAN_PRE_MODIFY makes sense to most users/applications, except for ones that try to create large sparse files... > > Those are all valid errors for write(2) and some are valid for read(2). > > ENOSPC/EDQUOT make a lot of sense for HSM for read(2), but could > > be surprising to applications not aware of HSM. > > I think it wouldn't be that bad if we let HSM decide which of those errors > > to return for FAN_PRE_ACCESS as opposed to FAN_PRE_MODIFY. > > Yeah, I don't think we need to be super-restrictive here, I'd just prefer > to avoid the "whatever number you decide to return" kind of interface > because I can see potential for confusion and abuse there. I think all four > errors above are perfectly fine for both FAN_PRE_ACCESS and FAN_PRE_MODIFY > if there are consumers that are able to use them. > > > But given that we do want to limit the chance of abusing this feature, > > perhaps it would be best to limit the error codes to known error codes > > for write(2) IO failures (i.e. not EBADF, not EFAULT) and allow returning > > FAN_DENY_ERRNO only for the new FAN_PRE_{ACCESS,MODIFY} > > HSM events. > > > > IOW, FAN_{OPEN,ACCESS}_PERM - no FAN_DENY_ERRNO for you! > > > > Does that sound good to you? > > It sounds OK to me. I'm open to allowing FAN_DENY_ERRNO for FAN_OPEN_PERM > if there's a usecase because at least conceptually it makes a good sense > and chances for confusion are low there. People are used to dealing with > errors on open(2). > I wrote about one case I have below. > > Furthermore, we can start with allowing a very limited set of errors > > and extend it in the future, on case by case basis. > > > > The way that this could be manageable is if we provide userspace > > a way to test for supported return codes. > > > > There is already a simple method that we used for testing FAN_INFO > > records type support - > > After fan_fd = fanotify_init(), userspace can write a "test" fanotify_response > > to fan_fd with fanotify_response.fd=FAN_NOFD. > > > > When setting fanotify_response.fd=FAN_DENY, this would return ENOENT, > > but with fanotify_response.fd=FAN_DENY_ERRNO(EIO), upstream would > > return EINVAL. > > > > This opens the possibility of allowing, say, EIO, EAGAIN in the first release > > and ENOSPC, EDQUOT in the following release. > > If we forsee that ENOSPC and EDQUOT will be needed, then we can just enable > it from start and not complicate our lives more than necessary. > Sure, I was just giving an example how the list could be extended case by case in the future. > > The advantage in this method is that it is very simple and already working > > correctly for old kernels. > > > > The downside is that this simple method does not allow checking for > > allowed errors per specific event type, so if we decide that we do want > > to allow returning FAN_DENY_ERRNO for FAN_OPEN_PERM later on, this method > > could not be used by userspace to test for this finer grained support. > > True, in that case the HSM manager would have to try responding with > FAN_DENY_ERRNO() and if it fails, it will have to fallback to responding > with FAN_DENY. Not too bad I'd say. > Yeah that works too. > > In another thread, I mention the fact that FAN_OPEN_PERM still has a > > potential freeze deadlock when called from open(O_TRUNC|O_CREATE), > > so we can consider the fact that FAN_DENY_ERRNO is not allowed with > > FAN_OPEN_PERM as a negative incentive for people to consider using > > FAN_OPEN_PERM as a trigger for HSM. > > AFAIU from the past discussions, there's no good use of FAN_OPEN_PERM > event for HSM. If that's the case, I'm for not allowing FAN_DENY_ERRNO for > FAN_OPEN_PERM. In the HttpDirFS HSM demo, I used FAN_OPEN_PERM on a mount mark to deny open of file during the short time that it's content is being punched out [1]. It is quite complicated to explain, but I only used it for denying access, not to fill content and not to write anything to filesystem. It's worth noting that returning EBUSY in that case would be more meaningful to users. That's one case in favor of allowing FAN_DENY_ERRNO for FAN_OPEN_PERM, but mainly I do not have a proof that people will not need it. OTOH, I am a bit concerned that this will encourage developer to use FAN_OPEN_PERM as a trigger to filling file content and then we are back to deadlock risk zone. Not sure which way to go. Anyway, I think we agree that there is no reason to merge FAN_DENY_ERRNO before FAN_PRE_* events, so we can continue this discussion later when I post FAN_PRE_* patches - not for this cycle. Thanks, Amir. [1] https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#invalidating-local-cache
On Mon, Dec 18, 2023 at 5:53 PM Amir Goldstein <amir73il@gmail.com> wrote: > > On Mon, Dec 18, 2023 at 4:35 PM Jan Kara <jack@suse.cz> wrote: > > > > On Fri 15-12-23 18:50:39, Amir Goldstein wrote: > > > On Fri, Dec 15, 2023 at 5:31 PM Josef Bacik <josef@toxicpanda.com> wrote: > > > > > > > > On Wed, Dec 13, 2023 at 09:09:30PM +0200, Amir Goldstein wrote: > > > > > On Wed, Dec 13, 2023 at 7:28 PM Jan Kara <jack@suse.cz> wrote: > > > > > > > > > > > > On Fri 08-12-23 10:01:35, Amir Goldstein wrote: > > > > > > > With FAN_DENY response, user trying to perform the filesystem operation > > > > > > > gets an error with errno set to EPERM. > > > > > > > > > > > > > > It is useful for hierarchical storage management (HSM) service to be able > > > > > > > to deny access for reasons more diverse than EPERM, for example EAGAIN, > > > > > > > if HSM could retry the operation later. > > > > > > > > > > > > > > Allow userspace to response to permission events with the response value > > > > > > > FAN_DENY_ERRNO(errno), instead of FAN_DENY to return a custom error. > > > > > > > > > > > > > > The change in fanotify_response is backward compatible, because errno is > > > > > > > written in the high 8 bits of the 32bit response field and old kernels > > > > > > > reject respose value with high bits set. > > > > > > > > > > > > > > Signed-off-by: Amir Goldstein <amir73il@gmail.com> > > > > > > > > > > > > So a couple of comments that spring to my mind when I'm looking into this > > > > > > now (partly maybe due to my weak memory ;): > > > > > > > > > > > > 1) Do we still need the EAGAIN return? I think we have mostly dealt with > > > > > > freezing deadlocks in another way, didn't we? > > > > > > > > > > I was thinking about EAGAIN on account of the HSM not being able to > > > > > download the file ATM. > > > > > > > > > > There are a bunch of error codes that are typical for network filesystems, e.g. > > > > > ETIMEDOUT, ENOTCONN, ECONNRESET which could be relevant to > > > > > HSM failures. > > > > > > > > > > > > > > > > > 2) If answer to 1) is yes, then there is a second question - do we expect > > > > > > the errors to propagate back to the unsuspecting application doing say > > > > > > read(2) syscall? Because I don't think that will fly well with a big > > > > > > majority of applications which basically treat *any* error from read(2) as > > > > > > fatal. This is also related to your question about standard permission > > > > > > events. Consumers of these error numbers are going to be random > > > > > > applications and I see a potential for rather big confusion arising there > > > > > > (like read(1) returning EINVAL or EBADF and now you wonder why the hell > > > > > > until you go debug the kernel and find out the error is coming out of > > > > > > fanotify handler). And the usecase is not quite clear to me for ordinary > > > > > > fanotify permission events (while I have no doubts about creativity of > > > > > > implementors of fanotify handlers ;)). > > > > > > > > > > > > > > > > That's a good question. > > > > > I prefer to delegate your question to the prospect users of the feature. > > > > > > > > > > Josef, which errors did your use case need this feature for? > > > > > > > > > > > 3) Given the potential for confusion, maybe we should stay conservative and > > > > > > only allow additional EAGAIN error instead of arbitrary errno if we need it? > > > > > > > > > > > > > > > > I know I was planning to use this for EDQUOT error (from FAN_PRE_MODIFY), > > > > > but I certainly wouldn't mind restricting the set of custom errors. > > > > > I think it makes sense. The hard part is to agree on this set of errors. > > > > > > > > > > > > > I'm all for flexibility here. > > > > > > > > We're going to have 2 classes of applications interacting with HSM backed > > > > storage, normal applications and applications that know they're backed by HSM. > > > > The normal applications are just going to crash if they get an error on read(2), > > > > it doesn't matter what errno it is. The second class would have different > > > > things they'd want to do in the face of different errors, and that's what this > > > > patchset is targeting. We can limit it to a few errno's if that makes you feel > > > > better, but having more than just one would be helpful. > > > > > > Ok. In another email I got from your colleagues, they listed: > > > EIO, EAGAIN, ENOSPC as possible errors to return. > > > I added EDQUOT for our in house use case. > > > > OK, so do I get it right that you also have applications that are aware > > that they are operation on top of HSM managed filesystem and thus they can > > do meaningful things with the reported errors? > > > > Some applications are HSM aware. > Some just report the errors that they get which are meaningful to users. > EIO is the standard response for HSM failure to fill content. > > EDQUOT/ENOSPC is a good example of special functionality. > HSM "swaps out" file content to a slow remote tier, but the slow remote > tier may have a space/quota limit that is known to HSM. > > By tracking the total of st_size under some HSM managed folder, including > the st_size of files whose content is punched out, HSM can enforce this limit > not in the conventional meaning of local disk blocks usage. > > This is when returning EDQUOT/ENOSPC for FAN_PRE_MODIFY makes > sense to most users/applications, except for ones that try to create > large sparse > files... > > > > > Those are all valid errors for write(2) and some are valid for read(2). > > > ENOSPC/EDQUOT make a lot of sense for HSM for read(2), but could > > > be surprising to applications not aware of HSM. > > > I think it wouldn't be that bad if we let HSM decide which of those errors > > > to return for FAN_PRE_ACCESS as opposed to FAN_PRE_MODIFY. > > > > Yeah, I don't think we need to be super-restrictive here, I'd just prefer > > to avoid the "whatever number you decide to return" kind of interface > > because I can see potential for confusion and abuse there. I think all four > > errors above are perfectly fine for both FAN_PRE_ACCESS and FAN_PRE_MODIFY > > if there are consumers that are able to use them. > > > > > But given that we do want to limit the chance of abusing this feature, > > > perhaps it would be best to limit the error codes to known error codes > > > for write(2) IO failures (i.e. not EBADF, not EFAULT) and allow returning > > > FAN_DENY_ERRNO only for the new FAN_PRE_{ACCESS,MODIFY} > > > HSM events. > > > > > > IOW, FAN_{OPEN,ACCESS}_PERM - no FAN_DENY_ERRNO for you! > > > > > > Does that sound good to you? > > > > It sounds OK to me. I'm open to allowing FAN_DENY_ERRNO for FAN_OPEN_PERM > > if there's a usecase because at least conceptually it makes a good sense > > and chances for confusion are low there. People are used to dealing with > > errors on open(2). > > > > I wrote about one case I have below. > > > > Furthermore, we can start with allowing a very limited set of errors > > > and extend it in the future, on case by case basis. > > > > > > The way that this could be manageable is if we provide userspace > > > a way to test for supported return codes. > > > > > > There is already a simple method that we used for testing FAN_INFO > > > records type support - > > > After fan_fd = fanotify_init(), userspace can write a "test" fanotify_response > > > to fan_fd with fanotify_response.fd=FAN_NOFD. > > > > > > When setting fanotify_response.fd=FAN_DENY, this would return ENOENT, > > > but with fanotify_response.fd=FAN_DENY_ERRNO(EIO), upstream would > > > return EINVAL. > > > > > > This opens the possibility of allowing, say, EIO, EAGAIN in the first release > > > and ENOSPC, EDQUOT in the following release. > > > > If we forsee that ENOSPC and EDQUOT will be needed, then we can just enable > > it from start and not complicate our lives more than necessary. > > > > Sure, I was just giving an example how the list could be extended case by case > in the future. > > > > The advantage in this method is that it is very simple and already working > > > correctly for old kernels. > > > > > > The downside is that this simple method does not allow checking for > > > allowed errors per specific event type, so if we decide that we do want > > > to allow returning FAN_DENY_ERRNO for FAN_OPEN_PERM later on, this method > > > could not be used by userspace to test for this finer grained support. > > > > True, in that case the HSM manager would have to try responding with > > FAN_DENY_ERRNO() and if it fails, it will have to fallback to responding > > with FAN_DENY. Not too bad I'd say. > > > > Yeah that works too. > > > > In another thread, I mention the fact that FAN_OPEN_PERM still has a > > > potential freeze deadlock when called from open(O_TRUNC|O_CREATE), > > > so we can consider the fact that FAN_DENY_ERRNO is not allowed with > > > FAN_OPEN_PERM as a negative incentive for people to consider using > > > FAN_OPEN_PERM as a trigger for HSM. > > > > AFAIU from the past discussions, there's no good use of FAN_OPEN_PERM > > event for HSM. If that's the case, I'm for not allowing FAN_DENY_ERRNO for > > FAN_OPEN_PERM. > > In the HttpDirFS HSM demo, I used FAN_OPEN_PERM on a mount mark > to deny open of file during the short time that it's content is being > punched out [1]. > It is quite complicated to explain, but I only used it for denying access, > not to fill content and not to write anything to filesystem. > It's worth noting that returning EBUSY in that case would be more meaningful > to users. > > That's one case in favor of allowing FAN_DENY_ERRNO for FAN_OPEN_PERM, > but mainly I do not have a proof that people will not need it. > > OTOH, I am a bit concerned that this will encourage developer to use > FAN_OPEN_PERM as a trigger to filling file content and then we are back to > deadlock risk zone. > > Not sure which way to go. > > Anyway, I think we agree that there is no reason to merge FAN_DENY_ERRNO > before FAN_PRE_* events, so we can continue this discussion later when > I post FAN_PRE_* patches - not for this cycle. > Hi Jan, I started to prepare the pre-content events patches for posting and got back to this one as well. Since we had this discussion I have learned of another use case that requires filling file content in FAN_OPEN_PERM hook, FAN_OPEN_EXEC_PERM to be exact. The reason is that unless an executable content is filled at execve() time, there is no other opportunity to fill its content without getting -ETXTBSY. So to keep things more flexible, I decided to add -ETXTBSY to the allowed errors with FAN_DENY_ERRNO() and to decided to allow FAN_DENY_ERRNO() with all permission events. To keep FAN_DENY_ERRNO() a bit more focused on HSM, I have added a limitation that FAN_DENY_ERRNO() is allowed only for FAN_CLASS_PRE_CONTENT groups. Thanks, Amir.
I'm sorry for the delay. The last week was busy and this fell through the cracks. On Mon 29-01-24 20:30:34, Amir Goldstein wrote: > On Mon, Dec 18, 2023 at 5:53 PM Amir Goldstein <amir73il@gmail.com> wrote: > > In the HttpDirFS HSM demo, I used FAN_OPEN_PERM on a mount mark > > to deny open of file during the short time that it's content is being > > punched out [1]. > > It is quite complicated to explain, but I only used it for denying access, > > not to fill content and not to write anything to filesystem. > > It's worth noting that returning EBUSY in that case would be more meaningful > > to users. > > > > That's one case in favor of allowing FAN_DENY_ERRNO for FAN_OPEN_PERM, > > but mainly I do not have a proof that people will not need it. > > > > OTOH, I am a bit concerned that this will encourage developer to use > > FAN_OPEN_PERM as a trigger to filling file content and then we are back to > > deadlock risk zone. > > > > Not sure which way to go. > > > > Anyway, I think we agree that there is no reason to merge FAN_DENY_ERRNO > > before FAN_PRE_* events, so we can continue this discussion later when > > I post FAN_PRE_* patches - not for this cycle. > > I started to prepare the pre-content events patches for posting and got back > to this one as well. > > Since we had this discussion I have learned of another use case that > requires filling file content in FAN_OPEN_PERM hook, FAN_OPEN_EXEC_PERM > to be exact. > > The reason is that unless an executable content is filled at execve() time, > there is no other opportunity to fill its content without getting -ETXTBSY. Yes, I've been scratching my head over this usecase for a few days. I was thinking whether we could somehow fill in executable (and executed) files on access but it all seemed too hacky so I agree that we probably have to fill them in on open. > So to keep things more flexible, I decided to add -ETXTBSY to the > allowed errors with FAN_DENY_ERRNO() and to decided to allow > FAN_DENY_ERRNO() with all permission events. > > To keep FAN_DENY_ERRNO() a bit more focused on HSM, I have > added a limitation that FAN_DENY_ERRNO() is allowed only for > FAN_CLASS_PRE_CONTENT groups. I have no problem with adding -ETXTBSY to the set of allowed errors. That makes sense. Adding FAN_DENY_ERRNO() to all permission events in FAN_CLASS_PRE_CONTENT groups - OK, if we don't find anything better - I wanted to hash out another possibility here: Currently all permission events (and thus also the events we plan to use for HSM AFAIU) are using 'fd' to identify file where the event happened. This is used as identifier for response, can be used to fill in file contents for HSM but it also causes issues such as the problem with exec(2) occasionally failing if this fd happens to get closed only after exec(2) gets to checking deny_write_access(). So what if we implemented events needed for HSM as FID events (we'd have think how to match replies to events)? Then the app would open the file for filling in using FID as well as it would naturally close the handle before replying so problems with exec(2) would not arise. These would be essentially new events (so far we didn't allow permission events in FID groups) so allowing FAN_DENY_ERRNO() replies for them would be natural. Overall it would seem like a cleaner "clean room implementation" API? Honza
On Mon, Feb 5, 2024 at 8:27 PM Jan Kara <jack@suse.cz> wrote: > > I'm sorry for the delay. The last week was busy and this fell through the > cracks. > No worries, I was busy as well. I did already rebase fan_pre_content & fan_errno [1] (over v6.8-rc2) last week, made the small changes I mention here and ran some basic tests, but did not complete writing tests. Hoping to switch back to it this week. [1] https://github.com/amir73il/linux/commits/fan_errno > On Mon 29-01-24 20:30:34, Amir Goldstein wrote: > > On Mon, Dec 18, 2023 at 5:53 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > In the HttpDirFS HSM demo, I used FAN_OPEN_PERM on a mount mark > > > to deny open of file during the short time that it's content is being > > > punched out [1]. > > > It is quite complicated to explain, but I only used it for denying access, > > > not to fill content and not to write anything to filesystem. > > > It's worth noting that returning EBUSY in that case would be more meaningful > > > to users. > > > > > > That's one case in favor of allowing FAN_DENY_ERRNO for FAN_OPEN_PERM, > > > but mainly I do not have a proof that people will not need it. > > > > > > OTOH, I am a bit concerned that this will encourage developer to use > > > FAN_OPEN_PERM as a trigger to filling file content and then we are back to > > > deadlock risk zone. > > > > > > Not sure which way to go. > > > > > > Anyway, I think we agree that there is no reason to merge FAN_DENY_ERRNO > > > before FAN_PRE_* events, so we can continue this discussion later when > > > I post FAN_PRE_* patches - not for this cycle. > > > > I started to prepare the pre-content events patches for posting and got back > > to this one as well. > > > > Since we had this discussion I have learned of another use case that > > requires filling file content in FAN_OPEN_PERM hook, FAN_OPEN_EXEC_PERM > > to be exact. > > > > The reason is that unless an executable content is filled at execve() time, > > there is no other opportunity to fill its content without getting -ETXTBSY. > > Yes, I've been scratching my head over this usecase for a few days. I was > thinking whether we could somehow fill in executable (and executed) files on > access but it all seemed too hacky so I agree that we probably have to fill > them in on open. > Normally, I think there will not be a really huge executable(?) If there were huge executables, they would have likely been broken down into smaller loadable libraries which should allow more granular content filling, but I guess there will always be worst case exceptions. > > So to keep things more flexible, I decided to add -ETXTBSY to the > > allowed errors with FAN_DENY_ERRNO() and to decided to allow > > FAN_DENY_ERRNO() with all permission events. > > > > To keep FAN_DENY_ERRNO() a bit more focused on HSM, I have > > added a limitation that FAN_DENY_ERRNO() is allowed only for > > FAN_CLASS_PRE_CONTENT groups. > > I have no problem with adding -ETXTBSY to the set of allowed errors. That > makes sense. Adding FAN_DENY_ERRNO() to all permission events in > FAN_CLASS_PRE_CONTENT groups - OK, done that. I am still not very happy about FAN_OPEN_PERM being part of HSM event family when I know that O_TRUCT and O_CREAT call this hook with sb writers held. The irony, is that there is no chance that O_TRUNC will require filling content, same if the file is actually being created by O_CREAT, so the cases where sb writers is actually needed and the case where content filling is needed do not overlap, but I cannot figure out how to get those cases out of the HSM risk zone. Ideas? > if we don't find anything better - I > wanted to hash out another possibility here: Currently all permission > events (and thus also the events we plan to use for HSM AFAIU) are using > 'fd' to identify file where the event happened. This is used as identifier > for response, can be used to fill in file contents for HSM but it also > causes issues such as the problem with exec(2) occasionally failing if this > fd happens to get closed only after exec(2) gets to checking > deny_write_access(). So what if we implemented events needed for HSM as FID > events (we'd have think how to match replies to events)? Then the app would > open the file for filling in using FID as well as it would naturally close > the handle before replying so problems with exec(2) would not arise. These The two things are independent IMO. We can use an event->key instead of event->fd, which I like, but we may still leave event->fd as a way to get an FMODE_NONOTIFY fd as long as the user closes event->fd before responding or we can implement Sargun's suggestion of the FAN_CLOSE_FD response flag. If a user needs to open an FMODE_NONOTIFY fd from fid, we will need to provide a way to do that. My WIP pre-lookup event patches [2] implements inheritance of FMODE_NONOTIFY from dirfd used for openat(). Perhaps we can do the same for open_by_handle_at() and inherit FMODE_NONOTIFY from mount_fd to implement your suggestion? [2] https://github.com/amir73il/linux/commits/fan_lookup_perm > would be essentially new events (so far we didn't allow permission events > in FID groups) so allowing FAN_DENY_ERRNO() replies for them would be > natural. Overall it would seem like a cleaner "clean room implementation" > API? I like the idea of a clean slate. Looking a head, for the PRE_PATH events (e.g. lookup,create) I was planning to use FAN_EVENT_INFO_TYPE_DFID_NAME to carry the last component lookup name, but then also have event->fd as the dirfd of lookup/create. That's a bit ugly duplicity and also it does not cover rename(), because if we use FAN_EVENT_INFO_TYPE_{OLD,NEW}_DFID_NAME to report names, where would newdirfd, olddirfd be reported? Your suggestion solves both these questions elegantly and if you agree to adapting open_by_handle_at() to cater fanotify needs, then we have a plan to propose. The bad side of clean slate is that it reduces the chances of me being able to get pre-content events ready in time for 6.9, which is a shame, but we got to do what we got to do ;) Thanks, Amir.
> > On Mon 29-01-24 20:30:34, Amir Goldstein wrote: > > > On Mon, Dec 18, 2023 at 5:53 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > In the HttpDirFS HSM demo, I used FAN_OPEN_PERM on a mount mark > > > > to deny open of file during the short time that it's content is being > > > > punched out. > > > > It is quite complicated to explain, but I only used it for denying access, > > > > not to fill content and not to write anything to filesystem. > > > > It's worth noting that returning EBUSY in that case would be more meaningful > > > > to users. > > > > > > > > That's one case in favor of allowing FAN_DENY_ERRNO for FAN_OPEN_PERM, > > > > but mainly I do not have a proof that people will not need it. > > > > > > > > OTOH, I am a bit concerned that this will encourage developer to use > > > > FAN_OPEN_PERM as a trigger to filling file content and then we are back to > > > > deadlock risk zone. > > > > > > > > Not sure which way to go. > > > > > > > > Anyway, I think we agree that there is no reason to merge FAN_DENY_ERRNO > > > > before FAN_PRE_* events, so we can continue this discussion later when > > > > I post FAN_PRE_* patches - not for this cycle. > > > > > > I started to prepare the pre-content events patches for posting and got back > > > to this one as well. > > > > > > Since we had this discussion I have learned of another use case that > > > requires filling file content in FAN_OPEN_PERM hook, FAN_OPEN_EXEC_PERM > > > to be exact. > > > > > > The reason is that unless an executable content is filled at execve() time, > > > there is no other opportunity to fill its content without getting -ETXTBSY. > > > > Yes, I've been scratching my head over this usecase for a few days. I was > > thinking whether we could somehow fill in executable (and executed) files on > > access but it all seemed too hacky so I agree that we probably have to fill > > them in on open. > > > > Normally, I think there will not be a really huge executable(?) > If there were huge executables, they would have likely been broken down > into smaller loadable libraries which should allow more granular > content filling, > but I guess there will always be worst case exceptions. > > > > So to keep things more flexible, I decided to add -ETXTBSY to the > > > allowed errors with FAN_DENY_ERRNO() and to decided to allow > > > FAN_DENY_ERRNO() with all permission events. > > > > > > To keep FAN_DENY_ERRNO() a bit more focused on HSM, I have > > > added a limitation that FAN_DENY_ERRNO() is allowed only for > > > FAN_CLASS_PRE_CONTENT groups. > > > > I have no problem with adding -ETXTBSY to the set of allowed errors. That > > makes sense. Adding FAN_DENY_ERRNO() to all permission events in > > FAN_CLASS_PRE_CONTENT groups - OK, > > done that. > > I am still not very happy about FAN_OPEN_PERM being part of HSM > event family when I know that O_TRUCT and O_CREAT call this hook > with sb writers held. > > The irony, is that there is no chance that O_TRUNC will require filling > content, same if the file is actually being created by O_CREAT, so the > cases where sb writers is actually needed and the case where content > filling is needed do not overlap, but I cannot figure out how to get those > cases out of the HSM risk zone. Ideas? > Jan, I wanted to run an idea by you. I like your idea to start a clean slate with FAN_CLASS_PRE_CONTENT | FAN_REPORT_FID and it would be nice if we could restrict this HSM to use pre-content events, which is why I was not happy about allowing FAN_DENY_ERRNO() for the legacy FAN_OPEN*_PERM events, especially with the known deadlocks. Since we already know that we need to generate FAN_PRE_ACCESS(offset,length) for read-only mmap() and FAN_PRE_MODIFY(offset,length) for writable mmap(), we could treat uselib() and execve() the same way and generate FAN_PRE_ACCESS(0,i_size) as if the file was mmaped. My idea is to generate an event FAN_PRE_MODIFY(0,0) for an open for write *after* file was truncated and FAN_PRE_ACCESS(0,0) for open O_RDONLY. Possibly also FAN_PRE_*(offset,0) events for llseek(). I've already pushed a POC to fan_pre_content branch [1]. Only sanity tested that nothing else is broken. I still need to add the mmap hooks and test the new events. With this, HSM will have appropriate hooks to fill executable and library on first access and also fill arbitrary files on open including the knowledge if the file was opened for write. Thoughts? Thanks, Amir. [1] https://github.com/amir73il/linux/commits/fan_pre_content
On Thu 08-02-24 16:04:29, Amir Goldstein wrote: > > > On Mon 29-01-24 20:30:34, Amir Goldstein wrote: > > > > On Mon, Dec 18, 2023 at 5:53 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > > In the HttpDirFS HSM demo, I used FAN_OPEN_PERM on a mount mark > > > > > to deny open of file during the short time that it's content is being > > > > > punched out. > > > > > It is quite complicated to explain, but I only used it for denying access, > > > > > not to fill content and not to write anything to filesystem. > > > > > It's worth noting that returning EBUSY in that case would be more meaningful > > > > > to users. > > > > > > > > > > That's one case in favor of allowing FAN_DENY_ERRNO for FAN_OPEN_PERM, > > > > > but mainly I do not have a proof that people will not need it. > > > > > > > > > > OTOH, I am a bit concerned that this will encourage developer to use > > > > > FAN_OPEN_PERM as a trigger to filling file content and then we are back to > > > > > deadlock risk zone. > > > > > > > > > > Not sure which way to go. > > > > > > > > > > Anyway, I think we agree that there is no reason to merge FAN_DENY_ERRNO > > > > > before FAN_PRE_* events, so we can continue this discussion later when > > > > > I post FAN_PRE_* patches - not for this cycle. > > > > > > > > I started to prepare the pre-content events patches for posting and got back > > > > to this one as well. > > > > > > > > Since we had this discussion I have learned of another use case that > > > > requires filling file content in FAN_OPEN_PERM hook, FAN_OPEN_EXEC_PERM > > > > to be exact. > > > > > > > > The reason is that unless an executable content is filled at execve() time, > > > > there is no other opportunity to fill its content without getting -ETXTBSY. > > > > > > Yes, I've been scratching my head over this usecase for a few days. I was > > > thinking whether we could somehow fill in executable (and executed) files on > > > access but it all seemed too hacky so I agree that we probably have to fill > > > them in on open. > > > > > > > Normally, I think there will not be a really huge executable(?) > > If there were huge executables, they would have likely been broken down > > into smaller loadable libraries which should allow more granular > > content filling, > > but I guess there will always be worst case exceptions. > > > > > > So to keep things more flexible, I decided to add -ETXTBSY to the > > > > allowed errors with FAN_DENY_ERRNO() and to decided to allow > > > > FAN_DENY_ERRNO() with all permission events. > > > > > > > > To keep FAN_DENY_ERRNO() a bit more focused on HSM, I have > > > > added a limitation that FAN_DENY_ERRNO() is allowed only for > > > > FAN_CLASS_PRE_CONTENT groups. > > > > > > I have no problem with adding -ETXTBSY to the set of allowed errors. That > > > makes sense. Adding FAN_DENY_ERRNO() to all permission events in > > > FAN_CLASS_PRE_CONTENT groups - OK, > > > > done that. > > > > I am still not very happy about FAN_OPEN_PERM being part of HSM > > event family when I know that O_TRUCT and O_CREAT call this hook > > with sb writers held. > > > > The irony, is that there is no chance that O_TRUNC will require filling > > content, same if the file is actually being created by O_CREAT, so the > > cases where sb writers is actually needed and the case where content > > filling is needed do not overlap, but I cannot figure out how to get those > > cases out of the HSM risk zone. Ideas? > > > > Jan, > > I wanted to run an idea by you. > > I like your idea to start a clean slate with > FAN_CLASS_PRE_CONTENT | FAN_REPORT_FID > and it would be nice if we could restrict this HSM to use > pre-content events, which is why I was not happy about allowing > FAN_DENY_ERRNO() for the legacy FAN_OPEN*_PERM events, > especially with the known deadlocks. > > Since we already know that we need to generate > FAN_PRE_ACCESS(offset,length) for read-only mmap() and > FAN_PRE_MODIFY(offset,length) for writable mmap(), > we could treat uselib() and execve() the same way and generate > FAN_PRE_ACCESS(0,i_size) as if the file was mmaped. BTW uselib() is deprecated and there is a patch queued to not generate OPEN_EXEC events for it because it was causing problems (not the generation of events itself but the FMODE_EXEC bit being set in uselib). So I don't think we need to instrument uselib(). > My idea is to generate an event FAN_PRE_MODIFY(0,0) > for an open for write *after* file was truncated and > FAN_PRE_ACCESS(0,0) for open O_RDONLY. What I find somewhat strange about this is that if we return error from the fsnotify_file_perm() hook, open(2) will fail with error but the file is already truncated. But I guess it should be rare and it's bearable. > Possibly also FAN_PRE_*(offset,0) events for llseek(). That seem overdoing it a bit IMO :) > I've already pushed a POC to fan_pre_content branch [1]. > Only sanity tested that nothing else is broken. > I still need to add the mmap hooks and test the new events. > > With this, HSM will have appropriate hooks to fill executable > and library on first access and also fill arbitrary files on open > including the knowledge if the file was opened for write. > > Thoughts? Yeah, I guess this looks sensible. Honza
On Thu, Feb 8, 2024 at 8:31 PM Jan Kara <jack@suse.cz> wrote: > > On Thu 08-02-24 16:04:29, Amir Goldstein wrote: > > > > On Mon 29-01-24 20:30:34, Amir Goldstein wrote: > > > > > On Mon, Dec 18, 2023 at 5:53 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > > > In the HttpDirFS HSM demo, I used FAN_OPEN_PERM on a mount mark > > > > > > to deny open of file during the short time that it's content is being > > > > > > punched out. > > > > > > It is quite complicated to explain, but I only used it for denying access, > > > > > > not to fill content and not to write anything to filesystem. > > > > > > It's worth noting that returning EBUSY in that case would be more meaningful > > > > > > to users. > > > > > > > > > > > > That's one case in favor of allowing FAN_DENY_ERRNO for FAN_OPEN_PERM, > > > > > > but mainly I do not have a proof that people will not need it. > > > > > > > > > > > > OTOH, I am a bit concerned that this will encourage developer to use > > > > > > FAN_OPEN_PERM as a trigger to filling file content and then we are back to > > > > > > deadlock risk zone. > > > > > > > > > > > > Not sure which way to go. > > > > > > > > > > > > Anyway, I think we agree that there is no reason to merge FAN_DENY_ERRNO > > > > > > before FAN_PRE_* events, so we can continue this discussion later when > > > > > > I post FAN_PRE_* patches - not for this cycle. > > > > > > > > > > I started to prepare the pre-content events patches for posting and got back > > > > > to this one as well. > > > > > > > > > > Since we had this discussion I have learned of another use case that > > > > > requires filling file content in FAN_OPEN_PERM hook, FAN_OPEN_EXEC_PERM > > > > > to be exact. > > > > > > > > > > The reason is that unless an executable content is filled at execve() time, > > > > > there is no other opportunity to fill its content without getting -ETXTBSY. > > > > > > > > Yes, I've been scratching my head over this usecase for a few days. I was > > > > thinking whether we could somehow fill in executable (and executed) files on > > > > access but it all seemed too hacky so I agree that we probably have to fill > > > > them in on open. > > > > > > > > > > Normally, I think there will not be a really huge executable(?) > > > If there were huge executables, they would have likely been broken down > > > into smaller loadable libraries which should allow more granular > > > content filling, > > > but I guess there will always be worst case exceptions. > > > > > > > > So to keep things more flexible, I decided to add -ETXTBSY to the > > > > > allowed errors with FAN_DENY_ERRNO() and to decided to allow > > > > > FAN_DENY_ERRNO() with all permission events. > > > > > > > > > > To keep FAN_DENY_ERRNO() a bit more focused on HSM, I have > > > > > added a limitation that FAN_DENY_ERRNO() is allowed only for > > > > > FAN_CLASS_PRE_CONTENT groups. > > > > > > > > I have no problem with adding -ETXTBSY to the set of allowed errors. That > > > > makes sense. Adding FAN_DENY_ERRNO() to all permission events in > > > > FAN_CLASS_PRE_CONTENT groups - OK, > > > > > > done that. > > > > > > I am still not very happy about FAN_OPEN_PERM being part of HSM > > > event family when I know that O_TRUCT and O_CREAT call this hook > > > with sb writers held. > > > > > > The irony, is that there is no chance that O_TRUNC will require filling > > > content, same if the file is actually being created by O_CREAT, so the > > > cases where sb writers is actually needed and the case where content > > > filling is needed do not overlap, but I cannot figure out how to get those > > > cases out of the HSM risk zone. Ideas? > > > > > > > Jan, > > > > I wanted to run an idea by you. > > > > I like your idea to start a clean slate with > > FAN_CLASS_PRE_CONTENT | FAN_REPORT_FID > > and it would be nice if we could restrict this HSM to use > > pre-content events, which is why I was not happy about allowing > > FAN_DENY_ERRNO() for the legacy FAN_OPEN*_PERM events, > > especially with the known deadlocks. > > > > Since we already know that we need to generate > > FAN_PRE_ACCESS(offset,length) for read-only mmap() and > > FAN_PRE_MODIFY(offset,length) for writable mmap(), > > we could treat uselib() and execve() the same way and generate > > FAN_PRE_ACCESS(0,i_size) as if the file was mmaped. > > BTW uselib() is deprecated and there is a patch queued to not generate > OPEN_EXEC events for it because it was causing problems (not the generation > of events itself but the FMODE_EXEC bit being set in uselib). So I don't > think we need to instrument uselib(). > Great. The fewer the better :) BTW, for mmap, I was thinking of adding fsnotify_file_perm() next to call sites of security_mmap_file(), but I see that: 1. shmat() has security_mmap_file() - is it relevant? 2. remap_file_pages() calls do_mmap() without security_mmap_file() - do we need to cover it? > > My idea is to generate an event FAN_PRE_MODIFY(0,0) > > for an open for write *after* file was truncated and > > FAN_PRE_ACCESS(0,0) for open O_RDONLY. > > What I find somewhat strange about this is that if we return error from the > fsnotify_file_perm() hook, open(2) will fail with error but the file is > already truncated. But I guess it should be rare and it's bearable. > > > Possibly also FAN_PRE_*(offset,0) events for llseek(). > > That seem overdoing it a bit IMO :) Heh! forget I said it ;) > > > I've already pushed a POC to fan_pre_content branch [1]. > > Only sanity tested that nothing else is broken. > > I still need to add the mmap hooks and test the new events. > > > > With this, HSM will have appropriate hooks to fill executable > > and library on first access and also fill arbitrary files on open > > including the knowledge if the file was opened for write. > > > > Thoughts? > > Yeah, I guess this looks sensible. > Cool, so let's see, what is left to do for the plan of FAN_CLASS_PRE_CONTENT | FAN_REPORT_FID? 1. event->fd is O_PATH mount_fd for open_by_handle_at() 2. open_by_handle_at() inherits FMODE_NONOTIFY from mount_fd 3. either implement the FAN_CLOSE_FD response flag (easy?) and/or implement FAN_REPORT_EVENT_ID and new header format Anything else? Are you ok with 1 and 2? Do you have a preference for 3? Thanks, Amir.
On Thu 08-02-24 21:21:13, Amir Goldstein wrote: > On Thu, Feb 8, 2024 at 8:31 PM Jan Kara <jack@suse.cz> wrote: > > On Thu 08-02-24 16:04:29, Amir Goldstein wrote: > > > > > On Mon 29-01-24 20:30:34, Amir Goldstein wrote: > > > > > > On Mon, Dec 18, 2023 at 5:53 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > > > > In the HttpDirFS HSM demo, I used FAN_OPEN_PERM on a mount mark > > > > > > > to deny open of file during the short time that it's content is being > > > > > > > punched out. > > > > > > > It is quite complicated to explain, but I only used it for denying access, > > > > > > > not to fill content and not to write anything to filesystem. > > > > > > > It's worth noting that returning EBUSY in that case would be more meaningful > > > > > > > to users. > > > > > > > > > > > > > > That's one case in favor of allowing FAN_DENY_ERRNO for FAN_OPEN_PERM, > > > > > > > but mainly I do not have a proof that people will not need it. > > > > > > > > > > > > > > OTOH, I am a bit concerned that this will encourage developer to use > > > > > > > FAN_OPEN_PERM as a trigger to filling file content and then we are back to > > > > > > > deadlock risk zone. > > > > > > > > > > > > > > Not sure which way to go. > > > > > > > > > > > > > > Anyway, I think we agree that there is no reason to merge FAN_DENY_ERRNO > > > > > > > before FAN_PRE_* events, so we can continue this discussion later when > > > > > > > I post FAN_PRE_* patches - not for this cycle. > > > > > > > > > > > > I started to prepare the pre-content events patches for posting and got back > > > > > > to this one as well. > > > > > > > > > > > > Since we had this discussion I have learned of another use case that > > > > > > requires filling file content in FAN_OPEN_PERM hook, FAN_OPEN_EXEC_PERM > > > > > > to be exact. > > > > > > > > > > > > The reason is that unless an executable content is filled at execve() time, > > > > > > there is no other opportunity to fill its content without getting -ETXTBSY. > > > > > > > > > > Yes, I've been scratching my head over this usecase for a few days. I was > > > > > thinking whether we could somehow fill in executable (and executed) files on > > > > > access but it all seemed too hacky so I agree that we probably have to fill > > > > > them in on open. > > > > > > > > > > > > > Normally, I think there will not be a really huge executable(?) > > > > If there were huge executables, they would have likely been broken down > > > > into smaller loadable libraries which should allow more granular > > > > content filling, > > > > but I guess there will always be worst case exceptions. > > > > > > > > > > So to keep things more flexible, I decided to add -ETXTBSY to the > > > > > > allowed errors with FAN_DENY_ERRNO() and to decided to allow > > > > > > FAN_DENY_ERRNO() with all permission events. > > > > > > > > > > > > To keep FAN_DENY_ERRNO() a bit more focused on HSM, I have > > > > > > added a limitation that FAN_DENY_ERRNO() is allowed only for > > > > > > FAN_CLASS_PRE_CONTENT groups. > > > > > > > > > > I have no problem with adding -ETXTBSY to the set of allowed errors. That > > > > > makes sense. Adding FAN_DENY_ERRNO() to all permission events in > > > > > FAN_CLASS_PRE_CONTENT groups - OK, > > > > > > > > done that. > > > > > > > > I am still not very happy about FAN_OPEN_PERM being part of HSM > > > > event family when I know that O_TRUCT and O_CREAT call this hook > > > > with sb writers held. > > > > > > > > The irony, is that there is no chance that O_TRUNC will require filling > > > > content, same if the file is actually being created by O_CREAT, so the > > > > cases where sb writers is actually needed and the case where content > > > > filling is needed do not overlap, but I cannot figure out how to get those > > > > cases out of the HSM risk zone. Ideas? > > > > > > > > > > Jan, > > > > > > I wanted to run an idea by you. > > > > > > I like your idea to start a clean slate with > > > FAN_CLASS_PRE_CONTENT | FAN_REPORT_FID > > > and it would be nice if we could restrict this HSM to use > > > pre-content events, which is why I was not happy about allowing > > > FAN_DENY_ERRNO() for the legacy FAN_OPEN*_PERM events, > > > especially with the known deadlocks. > > > > > > Since we already know that we need to generate > > > FAN_PRE_ACCESS(offset,length) for read-only mmap() and > > > FAN_PRE_MODIFY(offset,length) for writable mmap(), > > > we could treat uselib() and execve() the same way and generate > > > FAN_PRE_ACCESS(0,i_size) as if the file was mmaped. > > > > BTW uselib() is deprecated and there is a patch queued to not generate > > OPEN_EXEC events for it because it was causing problems (not the generation > > of events itself but the FMODE_EXEC bit being set in uselib). So I don't > > think we need to instrument uselib(). > > > > Great. The fewer the better :) > > BTW, for mmap, I was thinking of adding fsnotify_file_perm() next to > call sites of security_mmap_file(), but I see that: > 1. shmat() has security_mmap_file() - is it relevant? Well, this is effectively mmap(2) of a tmpfs file. So I don't think this is particularly useful for HSM purposes but we should probably have it for consistency? > 2. remap_file_pages() calls do_mmap() without security_mmap_file() - > do we need to cover it? Hmm, AFAIU remap_file_pages() just allows you to mess with an existing mapping so it isn't very interesting from HSM POV? > > > I've already pushed a POC to fan_pre_content branch [1]. > > > Only sanity tested that nothing else is broken. > > > I still need to add the mmap hooks and test the new events. > > > > > > With this, HSM will have appropriate hooks to fill executable > > > and library on first access and also fill arbitrary files on open > > > including the knowledge if the file was opened for write. > > > > > > Thoughts? > > > > Yeah, I guess this looks sensible. > > Cool, so let's see, what is left to do for the plan of > FAN_CLASS_PRE_CONTENT | FAN_REPORT_FID? > > 1. event->fd is O_PATH mount_fd for open_by_handle_at() > 2. open_by_handle_at() inherits FMODE_NONOTIFY from mount_fd > 3. either implement the FAN_CLOSE_FD response flag (easy?) and/or > implement FAN_REPORT_EVENT_ID and new header format > > Anything else? > Are you ok with 1 and 2? I'm not sure about 1) and 2) so I'm mostly thinking out loud now. AFAIU you want to provide mount_fd only because of FMODE_NONOTIFY inheritance so 2) is the key question. But if you provide mount_fd with FMODE_NONOTIFY and have FMODE_NONOTIFY inheritance, then what's the difference to just allow opens with FMODE_NONOTIFY from the start? I don't think restricting FMODE_NONOTIFY to inheritance gives any additional strong security guarantees? I understand so far we didn't expose FMODE_NONOTIFY so that people cannot bypass fanotify permission events. But now we need a sensible way to fill in the filesystem without deadlocking on our own watches. Maybe exposing FMODE_NONOTIFY as an open flag is too attractive for misuse so could be somehow tie it to the HSM app? We were already discussing that we need to somehow register the HSM app with the filesystem to avoid issues when the app crashes or so. So maybe we could tie this registration to the ability of bypassing generation of notification? > Do you have a preference for 3? Yeah. FAN_CLOSE_FD seems as a hack to me (and you're likely to get it wrong until you've spent quite some time debugging with exec sometimes fails with ETXTBUSY). So I prefer FAN_REPORT_EVENT_ID and use event id as an identifier when replying to HSM events. Honza
On Mon, Feb 12, 2024 at 2:01 PM Jan Kara <jack@suse.cz> wrote: > > On Thu 08-02-24 21:21:13, Amir Goldstein wrote: > > On Thu, Feb 8, 2024 at 8:31 PM Jan Kara <jack@suse.cz> wrote: > > > On Thu 08-02-24 16:04:29, Amir Goldstein wrote: > > > > > > On Mon 29-01-24 20:30:34, Amir Goldstein wrote: > > > > > > > On Mon, Dec 18, 2023 at 5:53 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > > > > > In the HttpDirFS HSM demo, I used FAN_OPEN_PERM on a mount mark > > > > > > > > to deny open of file during the short time that it's content is being > > > > > > > > punched out. > > > > > > > > It is quite complicated to explain, but I only used it for denying access, > > > > > > > > not to fill content and not to write anything to filesystem. > > > > > > > > It's worth noting that returning EBUSY in that case would be more meaningful > > > > > > > > to users. > > > > > > > > > > > > > > > > That's one case in favor of allowing FAN_DENY_ERRNO for FAN_OPEN_PERM, > > > > > > > > but mainly I do not have a proof that people will not need it. > > > > > > > > > > > > > > > > OTOH, I am a bit concerned that this will encourage developer to use > > > > > > > > FAN_OPEN_PERM as a trigger to filling file content and then we are back to > > > > > > > > deadlock risk zone. > > > > > > > > > > > > > > > > Not sure which way to go. > > > > > > > > > > > > > > > > Anyway, I think we agree that there is no reason to merge FAN_DENY_ERRNO > > > > > > > > before FAN_PRE_* events, so we can continue this discussion later when > > > > > > > > I post FAN_PRE_* patches - not for this cycle. > > > > > > > > > > > > > > I started to prepare the pre-content events patches for posting and got back > > > > > > > to this one as well. > > > > > > > > > > > > > > Since we had this discussion I have learned of another use case that > > > > > > > requires filling file content in FAN_OPEN_PERM hook, FAN_OPEN_EXEC_PERM > > > > > > > to be exact. > > > > > > > > > > > > > > The reason is that unless an executable content is filled at execve() time, > > > > > > > there is no other opportunity to fill its content without getting -ETXTBSY. > > > > > > > > > > > > Yes, I've been scratching my head over this usecase for a few days. I was > > > > > > thinking whether we could somehow fill in executable (and executed) files on > > > > > > access but it all seemed too hacky so I agree that we probably have to fill > > > > > > them in on open. > > > > > > > > > > > > > > > > Normally, I think there will not be a really huge executable(?) > > > > > If there were huge executables, they would have likely been broken down > > > > > into smaller loadable libraries which should allow more granular > > > > > content filling, > > > > > but I guess there will always be worst case exceptions. > > > > > > > > > > > > So to keep things more flexible, I decided to add -ETXTBSY to the > > > > > > > allowed errors with FAN_DENY_ERRNO() and to decided to allow > > > > > > > FAN_DENY_ERRNO() with all permission events. > > > > > > > > > > > > > > To keep FAN_DENY_ERRNO() a bit more focused on HSM, I have > > > > > > > added a limitation that FAN_DENY_ERRNO() is allowed only for > > > > > > > FAN_CLASS_PRE_CONTENT groups. > > > > > > > > > > > > I have no problem with adding -ETXTBSY to the set of allowed errors. That > > > > > > makes sense. Adding FAN_DENY_ERRNO() to all permission events in > > > > > > FAN_CLASS_PRE_CONTENT groups - OK, > > > > > > > > > > done that. > > > > > > > > > > I am still not very happy about FAN_OPEN_PERM being part of HSM > > > > > event family when I know that O_TRUCT and O_CREAT call this hook > > > > > with sb writers held. > > > > > > > > > > The irony, is that there is no chance that O_TRUNC will require filling > > > > > content, same if the file is actually being created by O_CREAT, so the > > > > > cases where sb writers is actually needed and the case where content > > > > > filling is needed do not overlap, but I cannot figure out how to get those > > > > > cases out of the HSM risk zone. Ideas? > > > > > > > > > > > > > Jan, > > > > > > > > I wanted to run an idea by you. > > > > > > > > I like your idea to start a clean slate with > > > > FAN_CLASS_PRE_CONTENT | FAN_REPORT_FID > > > > and it would be nice if we could restrict this HSM to use > > > > pre-content events, which is why I was not happy about allowing > > > > FAN_DENY_ERRNO() for the legacy FAN_OPEN*_PERM events, > > > > especially with the known deadlocks. > > > > > > > > Since we already know that we need to generate > > > > FAN_PRE_ACCESS(offset,length) for read-only mmap() and > > > > FAN_PRE_MODIFY(offset,length) for writable mmap(), > > > > we could treat uselib() and execve() the same way and generate > > > > FAN_PRE_ACCESS(0,i_size) as if the file was mmaped. > > > > > > BTW uselib() is deprecated and there is a patch queued to not generate > > > OPEN_EXEC events for it because it was causing problems (not the generation > > > of events itself but the FMODE_EXEC bit being set in uselib). So I don't > > > think we need to instrument uselib(). > > > > > > > Great. The fewer the better :) > > > > BTW, for mmap, I was thinking of adding fsnotify_file_perm() next to > > call sites of security_mmap_file(), but I see that: > > 1. shmat() has security_mmap_file() - is it relevant? > > Well, this is effectively mmap(2) of a tmpfs file. So I don't think this is > particularly useful for HSM purposes but we should probably have it for > consistency? OK. Actually, I think there is a use case for tmpfs + HSM. HSM is about lazy populating of a local filesystem from a remote slower tier. The fast local tier could be persistent, but could just as well be tmpfs, for a temporary lazy local mirror. This use case reminds overlayfs a bit, where the local tier is the upper fs, but in case of HSM, the "copy up" happens also on read. I do want to limit HSM to "fully featured local fs", so we do not end up with crazy fs doing HSM. The very minimum should be decodable file handle support and unique fsid, but we could also think of more strict requirements(?). Anyway, tmpfs should meet all the "fully featured local fs" requirement. > > > 2. remap_file_pages() calls do_mmap() without security_mmap_file() - > > do we need to cover it? > > Hmm, AFAIU remap_file_pages() just allows you to mess with an existing > mapping so it isn't very interesting from HSM POV? Nope. > > > > > I've already pushed a POC to fan_pre_content branch [1]. > > > > Only sanity tested that nothing else is broken. > > > > I still need to add the mmap hooks and test the new events. > > > > > > > > With this, HSM will have appropriate hooks to fill executable > > > > and library on first access and also fill arbitrary files on open > > > > including the knowledge if the file was opened for write. > > > > > > > > Thoughts? > > > > > > Yeah, I guess this looks sensible. > > > > Cool, so let's see, what is left to do for the plan of > > FAN_CLASS_PRE_CONTENT | FAN_REPORT_FID? > > > > 1. event->fd is O_PATH mount_fd for open_by_handle_at() > > 2. open_by_handle_at() inherits FMODE_NONOTIFY from mount_fd > > 3. either implement the FAN_CLOSE_FD response flag (easy?) and/or > > implement FAN_REPORT_EVENT_ID and new header format > > > > Anything else? > > Are you ok with 1 and 2? > > I'm not sure about 1) and 2) so I'm mostly thinking out loud now. AFAIU you > want to provide mount_fd only because of FMODE_NONOTIFY inheritance so 2) > is the key question. But if you provide mount_fd with FMODE_NONOTIFY and > have FMODE_NONOTIFY inheritance, then what's the difference to just allow > opens with FMODE_NONOTIFY from the start? I don't think restricting > FMODE_NONOTIFY to inheritance gives any additional strong security > guarantees? > You are right. Just need to think of the right API. It is just very convenient to be getting the FMODE_NONOTIFY from fanotify, but let's think about it. > I understand so far we didn't expose FMODE_NONOTIFY so that people cannot > bypass fanotify permission events. But now we need a sensible way to fill > in the filesystem without deadlocking on our own watches. Maybe exposing > FMODE_NONOTIFY as an open flag is too attractive for misuse so could be > somehow tie it to the HSM app? We were already discussing that we need to > somehow register the HSM app with the filesystem to avoid issues when the > app crashes or so. So maybe we could tie this registration to the ability > of bypassing generation of notification? > Last time we discussed this the conclusion was an API of a group-less default mask, for example: 1. fanotify_mark(FAN_GROUP_DEFAULT, FAN_MARK_ADD | FAN_MARK_MOUNT, FAN_PRE_ACCESS, AT_FDCWD, path); 2. this returns -EPERM for access until some group handles FAN_PRE_ACCESS 3. then HSM is started and subscribes to FAN_PRE_ACCESS 4. and then the mount is moved or bind mounted into a path exported to users It is a simple solution that should be easy to implement. But it does not involve "register the HSM app with the filesystem", unless you mean that a process that opens an HSM group (FAN_REPORT_FID|FAN_CLASS_PRE_CONTENT) should automatically be given FMODE_NONOTIFY files? There is one more crazy idea that I was pondering - what if we used the fanotify_fd as mount_fd arg to open_by_handle_at()? The framing is that it is not the filesystem, but fanotify which actually encoded the fsid+fid, so HSM could be asking fanotify to decode them. Technically, the group could keep a unique map from fsid -> sb, then fanotify group could decode an fanotify_event_info_fid buffer to a specific inode on a specific fs. Naturally, those decoded files would be FMODE_NONOTIFY. Too crazy? I am very much open to a simpler suggestion. > > Do you have a preference for 3? > > Yeah. FAN_CLOSE_FD seems as a hack to me (and you're likely to get it wrong > until you've spent quite some time debugging with exec sometimes fails with > ETXTBUSY). So I prefer FAN_REPORT_EVENT_ID and use event id as an > identifier when replying to HSM events. OK. At the end of this week I am going on a one week vacation, so it is unlikely that pre-content events will be ready for 6.9. Perhaps I will be able to get some prep work ready for the merge window, we shall see. Thanks, Amir.
On Mon 12-02-24 16:56:28, Amir Goldstein wrote: > On Mon, Feb 12, 2024 at 2:01 PM Jan Kara <jack@suse.cz> wrote: > > On Thu 08-02-24 21:21:13, Amir Goldstein wrote: > > > On Thu, Feb 8, 2024 at 8:31 PM Jan Kara <jack@suse.cz> wrote: > > > > On Thu 08-02-24 16:04:29, Amir Goldstein wrote: > > > > > > > On Mon 29-01-24 20:30:34, Amir Goldstein wrote: > > > > > > > > On Mon, Dec 18, 2023 at 5:53 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > > > > > > In the HttpDirFS HSM demo, I used FAN_OPEN_PERM on a mount mark > > > > > > > > > to deny open of file during the short time that it's content is being > > > > > > > > > punched out. > > > > > > > > > It is quite complicated to explain, but I only used it for denying access, > > > > > > > > > not to fill content and not to write anything to filesystem. > > > > > > > > > It's worth noting that returning EBUSY in that case would be more meaningful > > > > > > > > > to users. > > > > > > > > > > > > > > > > > > That's one case in favor of allowing FAN_DENY_ERRNO for FAN_OPEN_PERM, > > > > > > > > > but mainly I do not have a proof that people will not need it. > > > > > > > > > > > > > > > > > > OTOH, I am a bit concerned that this will encourage developer to use > > > > > > > > > FAN_OPEN_PERM as a trigger to filling file content and then we are back to > > > > > > > > > deadlock risk zone. > > > > > > > > > > > > > > > > > > Not sure which way to go. > > > > > > > > > > > > > > > > > > Anyway, I think we agree that there is no reason to merge FAN_DENY_ERRNO > > > > > > > > > before FAN_PRE_* events, so we can continue this discussion later when > > > > > > > > > I post FAN_PRE_* patches - not for this cycle. > > > > > > > > > > > > > > > > I started to prepare the pre-content events patches for posting and got back > > > > > > > > to this one as well. > > > > > > > > > > > > > > > > Since we had this discussion I have learned of another use case that > > > > > > > > requires filling file content in FAN_OPEN_PERM hook, FAN_OPEN_EXEC_PERM > > > > > > > > to be exact. > > > > > > > > > > > > > > > > The reason is that unless an executable content is filled at execve() time, > > > > > > > > there is no other opportunity to fill its content without getting -ETXTBSY. > > > > > > > > > > > > > > Yes, I've been scratching my head over this usecase for a few days. I was > > > > > > > thinking whether we could somehow fill in executable (and executed) files on > > > > > > > access but it all seemed too hacky so I agree that we probably have to fill > > > > > > > them in on open. > > > > > > > > > > > > > > > > > > > Normally, I think there will not be a really huge executable(?) > > > > > > If there were huge executables, they would have likely been broken down > > > > > > into smaller loadable libraries which should allow more granular > > > > > > content filling, > > > > > > but I guess there will always be worst case exceptions. > > > > > > > > > > > > > > So to keep things more flexible, I decided to add -ETXTBSY to the > > > > > > > > allowed errors with FAN_DENY_ERRNO() and to decided to allow > > > > > > > > FAN_DENY_ERRNO() with all permission events. > > > > > > > > > > > > > > > > To keep FAN_DENY_ERRNO() a bit more focused on HSM, I have > > > > > > > > added a limitation that FAN_DENY_ERRNO() is allowed only for > > > > > > > > FAN_CLASS_PRE_CONTENT groups. > > > > > > > > > > > > > > I have no problem with adding -ETXTBSY to the set of allowed errors. That > > > > > > > makes sense. Adding FAN_DENY_ERRNO() to all permission events in > > > > > > > FAN_CLASS_PRE_CONTENT groups - OK, > > > > > > > > > > > > done that. > > > > > > > > > > > > I am still not very happy about FAN_OPEN_PERM being part of HSM > > > > > > event family when I know that O_TRUCT and O_CREAT call this hook > > > > > > with sb writers held. > > > > > > > > > > > > The irony, is that there is no chance that O_TRUNC will require filling > > > > > > content, same if the file is actually being created by O_CREAT, so the > > > > > > cases where sb writers is actually needed and the case where content > > > > > > filling is needed do not overlap, but I cannot figure out how to get those > > > > > > cases out of the HSM risk zone. Ideas? > > > > > > > > > > > > > > > > Jan, > > > > > > > > > > I wanted to run an idea by you. > > > > > > > > > > I like your idea to start a clean slate with > > > > > FAN_CLASS_PRE_CONTENT | FAN_REPORT_FID > > > > > and it would be nice if we could restrict this HSM to use > > > > > pre-content events, which is why I was not happy about allowing > > > > > FAN_DENY_ERRNO() for the legacy FAN_OPEN*_PERM events, > > > > > especially with the known deadlocks. > > > > > > > > > > Since we already know that we need to generate > > > > > FAN_PRE_ACCESS(offset,length) for read-only mmap() and > > > > > FAN_PRE_MODIFY(offset,length) for writable mmap(), > > > > > we could treat uselib() and execve() the same way and generate > > > > > FAN_PRE_ACCESS(0,i_size) as if the file was mmaped. > > > > > > > > BTW uselib() is deprecated and there is a patch queued to not generate > > > > OPEN_EXEC events for it because it was causing problems (not the generation > > > > of events itself but the FMODE_EXEC bit being set in uselib). So I don't > > > > think we need to instrument uselib(). > > > > > > > > > > Great. The fewer the better :) > > > > > > BTW, for mmap, I was thinking of adding fsnotify_file_perm() next to > > > call sites of security_mmap_file(), but I see that: > > > 1. shmat() has security_mmap_file() - is it relevant? > > > > Well, this is effectively mmap(2) of a tmpfs file. So I don't think this is > > particularly useful for HSM purposes but we should probably have it for > > consistency? > > OK. > > Actually, I think there is a use case for tmpfs + HSM. HSM is about > lazy populating of a local filesystem from a remote slower tier. > The fast local tier could be persistent, but could just as well be tmpfs, > for a temporary lazy local mirror. > > This use case reminds overlayfs a bit, where the local tier is the upper fs, > but in case of HSM, the "copy up" happens also on read. > > I do want to limit HSM to "fully featured local fs", so we do not end up > with crazy fs doing HSM. The very minimum should be decodable > file handle support and unique fsid, but we could also think of more strict > requirements(?). Anyway, tmpfs should meet all the "fully featured local fs" > requirement. Ok, makes sense. > > > > > I've already pushed a POC to fan_pre_content branch [1]. > > > > > Only sanity tested that nothing else is broken. > > > > > I still need to add the mmap hooks and test the new events. > > > > > > > > > > With this, HSM will have appropriate hooks to fill executable > > > > > and library on first access and also fill arbitrary files on open > > > > > including the knowledge if the file was opened for write. > > > > > > > > > > Thoughts? > > > > > > > > Yeah, I guess this looks sensible. > > > > > > Cool, so let's see, what is left to do for the plan of > > > FAN_CLASS_PRE_CONTENT | FAN_REPORT_FID? > > > > > > 1. event->fd is O_PATH mount_fd for open_by_handle_at() > > > 2. open_by_handle_at() inherits FMODE_NONOTIFY from mount_fd > > > 3. either implement the FAN_CLOSE_FD response flag (easy?) and/or > > > implement FAN_REPORT_EVENT_ID and new header format > > > > > > Anything else? > > > Are you ok with 1 and 2? > > > > I'm not sure about 1) and 2) so I'm mostly thinking out loud now. AFAIU you > > want to provide mount_fd only because of FMODE_NONOTIFY inheritance so 2) > > is the key question. But if you provide mount_fd with FMODE_NONOTIFY and > > have FMODE_NONOTIFY inheritance, then what's the difference to just allow > > opens with FMODE_NONOTIFY from the start? I don't think restricting > > FMODE_NONOTIFY to inheritance gives any additional strong security > > guarantees? > > > > You are right. > Just need to think of the right API. > It is just very convenient to be getting the FMODE_NONOTIFY from > fanotify, but let's think about it. > > > I understand so far we didn't expose FMODE_NONOTIFY so that people cannot > > bypass fanotify permission events. But now we need a sensible way to fill > > in the filesystem without deadlocking on our own watches. Maybe exposing > > FMODE_NONOTIFY as an open flag is too attractive for misuse so could be > > somehow tie it to the HSM app? We were already discussing that we need to > > somehow register the HSM app with the filesystem to avoid issues when the > > app crashes or so. So maybe we could tie this registration to the ability > > of bypassing generation of notification? > > > > Last time we discussed this the conclusion was an API of a group-less > default mask, for example: > > 1. fanotify_mark(FAN_GROUP_DEFAULT, > FAN_MARK_ADD | FAN_MARK_MOUNT, > FAN_PRE_ACCESS, AT_FDCWD, path); > 2. this returns -EPERM for access until some group handles FAN_PRE_ACCESS > 3. then HSM is started and subscribes to FAN_PRE_ACCESS > 4. and then the mount is moved or bind mounted into a path exported to users Yes, this was the process I was talking about. > It is a simple solution that should be easy to implement. > But it does not involve "register the HSM app with the filesystem", > unless you mean that a process that opens an HSM group > (FAN_REPORT_FID|FAN_CLASS_PRE_CONTENT) should automatically > be given FMODE_NONOTIFY files? Two ideas: What you describe above seems like what the new mount API was intended for? What if we introduced something like an "hsm" mount option which would basically enable calling into pre-content event handlers (for sb without this flag handlers wouldn't be called and you cannot place pre-content marks on such sb). These handlers would return EACCESS unless there's somebody handling events and returning something else. You could then do: fan_fd = fanotify_init() ffd = fsopen() fsconfig(ffd, FSCONFIG_SET_STRING, "source", device, 0) fsconfig(ffd, FSCONFIG_SET_FLAG, "hsm", NULL, 0) rootfd = fsconfig(ffd, FSCONFIG_CMD_CREATE, NULL, NULL, 0) fanotify_mark(fan_fd, FAN_MARK_ADD, ... , rootfd, NULL) <now you can move the superblock into the mount hierarchy> This would elegantly solve the "what if HSM handler dies" problem as well as cleanly handle the setup. And we don't have to come up with a concept of "default mask". Now we still have the problem how to fill in the filesystem on pre-content event without deadlocking. As I was thinking about it the most elegant solution would IMHO be if the HSM handler could have a private mount flagged so that HSM is excluded from there (or it could place ignore mark on this mount for HSM events). I think we've discarded similar ideas in the past because this is problematic with directory pre-content events because security hooks don't get the mountpoint. But what if we used security_path_* hooks for directory pre-content events? > There is one more crazy idea that I was pondering - > what if we used the fanotify_fd as mount_fd arg to open_by_handle_at()? > The framing is that it is not the filesystem, but fanotify which actually > encoded the fsid+fid, so HSM could be asking fanotify to decode them. > Technically, the group could keep a unique map from fsid -> sb, then > fanotify group could decode an fanotify_event_info_fid buffer to a specific > inode on a specific fs. > Naturally, those decoded files would be FMODE_NONOTIFY. > > Too crazy? It sounds a bit complex and hooking into open_by_handle_at() code for this sounds a bit hacky. But I'm not completely rejecting this possibility if we don't find other options. Honza
> > Last time we discussed this the conclusion was an API of a group-less > > default mask, for example: > > > > 1. fanotify_mark(FAN_GROUP_DEFAULT, > > FAN_MARK_ADD | FAN_MARK_MOUNT, > > FAN_PRE_ACCESS, AT_FDCWD, path); > > 2. this returns -EPERM for access until some group handles FAN_PRE_ACCESS > > 3. then HSM is started and subscribes to FAN_PRE_ACCESS > > 4. and then the mount is moved or bind mounted into a path exported to users > > Yes, this was the process I was talking about. > > > It is a simple solution that should be easy to implement. > > But it does not involve "register the HSM app with the filesystem", > > unless you mean that a process that opens an HSM group > > (FAN_REPORT_FID|FAN_CLASS_PRE_CONTENT) should automatically > > be given FMODE_NONOTIFY files? > > Two ideas: What you describe above seems like what the new mount API was > intended for? What if we introduced something like an "hsm" mount option > which would basically enable calling into pre-content event handlers I like that. I forgot that with my suggestion we'd need a path to setup the default mask. > (for sb without this flag handlers wouldn't be called and you cannot place > pre-content marks on such sb). IMO, that limitation (i.e. inside brackets) is too restrictive. In many cases, the user running HSM may not have control over the mount of the filesystem (inside containers?). It is true that HSM without anti-crash protection is less reliable, but I think that it is still useful enough that users will want the option to run it (?). Think of my HttpDirFS demo - it's just a simple lazy mirroring of a website. Even with low reliability I think it is useful (?). > These handlers would return EACCESS unless > there's somebody handling events and returning something else. > > You could then do: > > fan_fd = fanotify_init() > ffd = fsopen() > fsconfig(ffd, FSCONFIG_SET_STRING, "source", device, 0) > fsconfig(ffd, FSCONFIG_SET_FLAG, "hsm", NULL, 0) > rootfd = fsconfig(ffd, FSCONFIG_CMD_CREATE, NULL, NULL, 0) > fanotify_mark(fan_fd, FAN_MARK_ADD, ... , rootfd, NULL) > <now you can move the superblock into the mount hierarchy> > Not too bad. I think that "hsm_deny_mask=" mount options would give more flexibility, but I could be convinced otherwise. It's probably not a great idea to be running two different HSMs on the same fs anyway, but if user has an old HSM version installed that handles only pre-content events, I don't think that we want this old version if it happens to be run by mistake, to allow for unsupervised create,rename,delete if the admin wanted to atomically mount a fs that SHOULD be supervised by a v2 HSM that knows how to handle pre-path events. IOW, and "HSM bit" on sb is too broad IMO. > This would elegantly solve the "what if HSM handler dies" problem as well > as cleanly handle the setup. And we don't have to come up with a concept of > "default mask". We can still have a mask, it's just about the API to set it up. > Now we still have the problem how to fill in the filesystem > on pre-content event without deadlocking. As I was thinking about it the > most elegant solution would IMHO be if the HSM handler could have a private > mount flagged so that HSM is excluded from there (or it could place ignore > mark on this mount for HSM events). My HttpDirFS demo does it the other way around - HSM uses a mount without a mark mount - but ignore mark works too. > I think we've discarded similar ideas > in the past because this is problematic with directory pre-content events > because security hooks don't get the mountpoint. But what if we used > security_path_* hooks for directory pre-content events? > No need for security_path_ * hooks. In my POC, the pre-path hooks have the path argument. For people who are not familiar with the term, here is man page draft for "pre-path" events: https://github.com/amir73il/man-pages/commits/fan_pre_path/ This is an out of date branch from the time that I called them FAN_PRE_{CREATE,DELETE,MOVE_*} events: https://github.com/amir73il/linux/commit/29c60e4db3068ff2cd7b2c5a73108afb2c19b868 They are implemented by replacing the mnt_want_write() calls with mnt_want_write_{path,parent,parents}() calls. This was done to make sure that they take the sb write srcu and call the pre-path hook before taking sb writers freeze protection. > > There is one more crazy idea that I was pondering - > > what if we used the fanotify_fd as mount_fd arg to open_by_handle_at()? > > The framing is that it is not the filesystem, but fanotify which actually > > encoded the fsid+fid, so HSM could be asking fanotify to decode them. > > Technically, the group could keep a unique map from fsid -> sb, then > > fanotify group could decode an fanotify_event_info_fid buffer to a specific > > inode on a specific fs. > > Naturally, those decoded files would be FMODE_NONOTIFY. > > > > Too crazy? > > It sounds a bit complex and hooking into open_by_handle_at() code for this > sounds a bit hacky. But I'm not completely rejecting this possibility if we > don't find other options. Your idea sounds better :) Thanks, Amir.
On Thu 15-02-24 17:40:07, Amir Goldstein wrote: > > > Last time we discussed this the conclusion was an API of a group-less > > > default mask, for example: > > > > > > 1. fanotify_mark(FAN_GROUP_DEFAULT, > > > FAN_MARK_ADD | FAN_MARK_MOUNT, > > > FAN_PRE_ACCESS, AT_FDCWD, path); > > > 2. this returns -EPERM for access until some group handles FAN_PRE_ACCESS > > > 3. then HSM is started and subscribes to FAN_PRE_ACCESS > > > 4. and then the mount is moved or bind mounted into a path exported to users > > > > Yes, this was the process I was talking about. > > > > > It is a simple solution that should be easy to implement. > > > But it does not involve "register the HSM app with the filesystem", > > > unless you mean that a process that opens an HSM group > > > (FAN_REPORT_FID|FAN_CLASS_PRE_CONTENT) should automatically > > > be given FMODE_NONOTIFY files? > > > > Two ideas: What you describe above seems like what the new mount API was > > intended for? What if we introduced something like an "hsm" mount option > > which would basically enable calling into pre-content event handlers > > I like that. > I forgot that with my suggestion we'd need a path to setup > the default mask. > > > (for sb without this flag handlers wouldn't be called and you cannot place > > pre-content marks on such sb). > > IMO, that limitation (i.e. inside brackets) is too restrictive. > In many cases, the user running HSM may not have control over the > mount of the filesystem (inside containers?). > It is true that HSM without anti-crash protection is less reliable, > but I think that it is still useful enough that users will want the > option to run it (?). > > Think of my HttpDirFS demo - it's just a simple lazy mirroring > of a website. Even with low reliability I think it is useful (?). Yeah, ok, makes sense. But for such "unpriviledged" usecases we don't have a deadlock-free way to fill in the file contents because that requires a special mountpoint? > > These handlers would return EACCESS unless > > there's somebody handling events and returning something else. > > > > You could then do: > > > > fan_fd = fanotify_init() > > ffd = fsopen() > > fsconfig(ffd, FSCONFIG_SET_STRING, "source", device, 0) > > fsconfig(ffd, FSCONFIG_SET_FLAG, "hsm", NULL, 0) > > rootfd = fsconfig(ffd, FSCONFIG_CMD_CREATE, NULL, NULL, 0) > > fanotify_mark(fan_fd, FAN_MARK_ADD, ... , rootfd, NULL) > > <now you can move the superblock into the mount hierarchy> > > Not too bad. > I think that "hsm_deny_mask=" mount options would give more flexibility, > but I could be convinced otherwise. > > It's probably not a great idea to be running two different HSMs on the same > fs anyway, but if user has an old HSM version installed that handles only > pre-content events, I don't think that we want this old version if it happens > to be run by mistake, to allow for unsupervised create,rename,delete if the > admin wanted to atomically mount a fs that SHOULD be supervised by a > v2 HSM that knows how to handle pre-path events. > > IOW, and "HSM bit" on sb is too broad IMO. OK. So "hsm_deny_mask=" would esentially express events that we require HSM to handle, the rest would just be accepted by default. That makes sense. The only thing I kind of dislike is that this ties fanotify API with mount API. So perhaps hsm_deny_mask should be specified as a string? Like preaccess,premodify,prelookup,... and transformed into a bitmask only inside the kernel? It gives us more maneuvering space for the future. > > This would elegantly solve the "what if HSM handler dies" problem as well > > as cleanly handle the setup. And we don't have to come up with a concept of > > "default mask". > > We can still have a mask, it's just about the API to set it up. > > > Now we still have the problem how to fill in the filesystem > > on pre-content event without deadlocking. As I was thinking about it the > > most elegant solution would IMHO be if the HSM handler could have a private > > mount flagged so that HSM is excluded from there (or it could place ignore > > mark on this mount for HSM events). > > My HttpDirFS demo does it the other way around - HSM uses a mount > without a mark mount - but ignore mark works too. Yes, the HSM handler is free to setup whatever works for it. I was just thinking to make sure there is at least one sane way how to do it :) > > I think we've discarded similar ideas > > in the past because this is problematic with directory pre-content events > > because security hooks don't get the mountpoint. But what if we used > > security_path_* hooks for directory pre-content events? > > No need for security_path_ * hooks. > In my POC, the pre-path hooks have the path argument. > For people who are not familiar with the term, here is man page draft > for "pre-path" events: > https://github.com/amir73il/man-pages/commits/fan_pre_path/ > > This is an out of date branch from the time that I called them > FAN_PRE_{CREATE,DELETE,MOVE_*} events: > https://github.com/amir73il/linux/commit/29c60e4db3068ff2cd7b2c5a73108afb2c19b868 > > They are implemented by replacing the mnt_want_write() calls > with mnt_want_write_{path,parent,parents}() calls. > > This was done to make sure that they take the sb write srcu and call > the pre-path hook before taking sb writers freeze protection. Ok, so AFAIU you agree we don't need to rely on FMODE_NONOTIFY for HSM and can just use access through dedicated mount for filling in the filesystem? Honza
On Mon, Feb 19, 2024 at 1:01 PM Jan Kara <jack@suse.cz> wrote: > > On Thu 15-02-24 17:40:07, Amir Goldstein wrote: > > > > Last time we discussed this the conclusion was an API of a group-less > > > > default mask, for example: > > > > > > > > 1. fanotify_mark(FAN_GROUP_DEFAULT, > > > > FAN_MARK_ADD | FAN_MARK_MOUNT, > > > > FAN_PRE_ACCESS, AT_FDCWD, path); > > > > 2. this returns -EPERM for access until some group handles FAN_PRE_ACCESS > > > > 3. then HSM is started and subscribes to FAN_PRE_ACCESS > > > > 4. and then the mount is moved or bind mounted into a path exported to users > > > > > > Yes, this was the process I was talking about. > > > > > > > It is a simple solution that should be easy to implement. > > > > But it does not involve "register the HSM app with the filesystem", > > > > unless you mean that a process that opens an HSM group > > > > (FAN_REPORT_FID|FAN_CLASS_PRE_CONTENT) should automatically > > > > be given FMODE_NONOTIFY files? > > > > > > Two ideas: What you describe above seems like what the new mount API was > > > intended for? What if we introduced something like an "hsm" mount option > > > which would basically enable calling into pre-content event handlers > > > > I like that. > > I forgot that with my suggestion we'd need a path to setup > > the default mask. > > > > > (for sb without this flag handlers wouldn't be called and you cannot place > > > pre-content marks on such sb). > > > > IMO, that limitation (i.e. inside brackets) is too restrictive. > > In many cases, the user running HSM may not have control over the > > mount of the filesystem (inside containers?). > > It is true that HSM without anti-crash protection is less reliable, > > but I think that it is still useful enough that users will want the > > option to run it (?). > > > > Think of my HttpDirFS demo - it's just a simple lazy mirroring > > of a website. Even with low reliability I think it is useful (?). > > Yeah, ok, makes sense. But for such "unpriviledged" usecases we don't have > a deadlock-free way to fill in the file contents because that requires a > special mountpoint? > True, unless we also keep the FMODE_NONOTIFY event->fd for the simple cases. I'll need to think about this some more. > > > These handlers would return EACCESS unless > > > there's somebody handling events and returning something else. > > > > > > You could then do: > > > > > > fan_fd = fanotify_init() > > > ffd = fsopen() > > > fsconfig(ffd, FSCONFIG_SET_STRING, "source", device, 0) > > > fsconfig(ffd, FSCONFIG_SET_FLAG, "hsm", NULL, 0) > > > rootfd = fsconfig(ffd, FSCONFIG_CMD_CREATE, NULL, NULL, 0) > > > fanotify_mark(fan_fd, FAN_MARK_ADD, ... , rootfd, NULL) > > > <now you can move the superblock into the mount hierarchy> > > > > Not too bad. > > I think that "hsm_deny_mask=" mount options would give more flexibility, > > but I could be convinced otherwise. > > > > It's probably not a great idea to be running two different HSMs on the same > > fs anyway, but if user has an old HSM version installed that handles only > > pre-content events, I don't think that we want this old version if it happens > > to be run by mistake, to allow for unsupervised create,rename,delete if the > > admin wanted to atomically mount a fs that SHOULD be supervised by a > > v2 HSM that knows how to handle pre-path events. > > > > IOW, and "HSM bit" on sb is too broad IMO. > > OK. So "hsm_deny_mask=" would esentially express events that we require HSM > to handle, the rest would just be accepted by default. That makes sense. Yes. > The only thing I kind of dislike is that this ties fanotify API with mount > API. So perhaps hsm_deny_mask should be specified as a string? Like > preaccess,premodify,prelookup,... and transformed into a bitmask only > inside the kernel? It gives us more maneuvering space for the future. > Urgh. I see what you are saying, but this seems so ugly to me. I have a strong feeling that we are trying to reinvent something and that we are going to reinvent it badly. I need to look for precedents, maybe in other OS. I believe that in Windows, there is an API to register as a Cloud Engine Provider, so there is probably a way to have multiple HSMs working on different sections of the filesystem in some reliable crash safe manner. > > > This would elegantly solve the "what if HSM handler dies" problem as well > > > as cleanly handle the setup. And we don't have to come up with a concept of > > > "default mask". > > > > We can still have a mask, it's just about the API to set it up. > > > > > Now we still have the problem how to fill in the filesystem > > > on pre-content event without deadlocking. As I was thinking about it the > > > most elegant solution would IMHO be if the HSM handler could have a private > > > mount flagged so that HSM is excluded from there (or it could place ignore > > > mark on this mount for HSM events). > > > > My HttpDirFS demo does it the other way around - HSM uses a mount > > without a mark mount - but ignore mark works too. > > Yes, the HSM handler is free to setup whatever works for it. I was just > thinking to make sure there is at least one sane way how to do it :) > Yeh, we need to write "best practice" guidelines. > > > I think we've discarded similar ideas > > > in the past because this is problematic with directory pre-content events > > > because security hooks don't get the mountpoint. But what if we used > > > security_path_* hooks for directory pre-content events? > > > > No need for security_path_ * hooks. > > In my POC, the pre-path hooks have the path argument. > > For people who are not familiar with the term, here is man page draft > > for "pre-path" events: > > https://github.com/amir73il/man-pages/commits/fan_pre_path/ > > > > This is an out of date branch from the time that I called them > > FAN_PRE_{CREATE,DELETE,MOVE_*} events: > > https://github.com/amir73il/linux/commit/29c60e4db3068ff2cd7b2c5a73108afb2c19b868 > > > > They are implemented by replacing the mnt_want_write() calls > > with mnt_want_write_{path,parent,parents}() calls. > > > > This was done to make sure that they take the sb write srcu and call > > the pre-path hook before taking sb writers freeze protection. > > Ok, so AFAIU you agree we don't need to rely on FMODE_NONOTIFY for HSM and > can just use access through dedicated mount for filling in the filesystem? > It seems like a decent and simple way to avoid difficult questions, so I will try to start with that... whenever I manage to context switch back ;) Thanks, Amir.
On Tue 27-02-24 21:42:37, Amir Goldstein wrote: > On Mon, Feb 19, 2024 at 1:01 PM Jan Kara <jack@suse.cz> wrote: > > > > On Thu 15-02-24 17:40:07, Amir Goldstein wrote: > > > > > Last time we discussed this the conclusion was an API of a group-less > > > > > default mask, for example: > > > > > > > > > > 1. fanotify_mark(FAN_GROUP_DEFAULT, > > > > > FAN_MARK_ADD | FAN_MARK_MOUNT, > > > > > FAN_PRE_ACCESS, AT_FDCWD, path); > > > > > 2. this returns -EPERM for access until some group handles FAN_PRE_ACCESS > > > > > 3. then HSM is started and subscribes to FAN_PRE_ACCESS > > > > > 4. and then the mount is moved or bind mounted into a path exported to users > > > > > > > > Yes, this was the process I was talking about. > > > > > > > > > It is a simple solution that should be easy to implement. > > > > > But it does not involve "register the HSM app with the filesystem", > > > > > unless you mean that a process that opens an HSM group > > > > > (FAN_REPORT_FID|FAN_CLASS_PRE_CONTENT) should automatically > > > > > be given FMODE_NONOTIFY files? > > > > > > > > Two ideas: What you describe above seems like what the new mount API was > > > > intended for? What if we introduced something like an "hsm" mount option > > > > which would basically enable calling into pre-content event handlers > > > > > > I like that. > > > I forgot that with my suggestion we'd need a path to setup > > > the default mask. > > > > > > > (for sb without this flag handlers wouldn't be called and you cannot place > > > > pre-content marks on such sb). > > > > > > IMO, that limitation (i.e. inside brackets) is too restrictive. > > > In many cases, the user running HSM may not have control over the > > > mount of the filesystem (inside containers?). > > > It is true that HSM without anti-crash protection is less reliable, > > > but I think that it is still useful enough that users will want the > > > option to run it (?). > > > > > > Think of my HttpDirFS demo - it's just a simple lazy mirroring > > > of a website. Even with low reliability I think it is useful (?). > > > > Yeah, ok, makes sense. But for such "unpriviledged" usecases we don't have > > a deadlock-free way to fill in the file contents because that requires a > > special mountpoint? > > True, unless we also keep the FMODE_NONOTIFY event->fd > for the simple cases. I'll need to think about this some more. Well, but even creating new fds with FMODE_NONOTIFY or setting up fanotify group with HSM events need to be somehow priviledged operation (currently it requires CAP_SYS_ADMIN). So the more I think about it the less obvious the "unpriviledged" usecase seems to be. > > > > These handlers would return EACCESS unless > > > > there's somebody handling events and returning something else. > > > > > > > > You could then do: > > > > > > > > fan_fd = fanotify_init() > > > > ffd = fsopen() > > > > fsconfig(ffd, FSCONFIG_SET_STRING, "source", device, 0) > > > > fsconfig(ffd, FSCONFIG_SET_FLAG, "hsm", NULL, 0) > > > > rootfd = fsconfig(ffd, FSCONFIG_CMD_CREATE, NULL, NULL, 0) > > > > fanotify_mark(fan_fd, FAN_MARK_ADD, ... , rootfd, NULL) > > > > <now you can move the superblock into the mount hierarchy> > > > > > > Not too bad. > > > I think that "hsm_deny_mask=" mount options would give more flexibility, > > > but I could be convinced otherwise. > > > > > > It's probably not a great idea to be running two different HSMs on the same > > > fs anyway, but if user has an old HSM version installed that handles only > > > pre-content events, I don't think that we want this old version if it happens > > > to be run by mistake, to allow for unsupervised create,rename,delete if the > > > admin wanted to atomically mount a fs that SHOULD be supervised by a > > > v2 HSM that knows how to handle pre-path events. > > > > > > IOW, and "HSM bit" on sb is too broad IMO. > > > > OK. So "hsm_deny_mask=" would esentially express events that we require HSM > > to handle, the rest would just be accepted by default. That makes sense. > > Yes. > > > The only thing I kind of dislike is that this ties fanotify API with mount > > API. So perhaps hsm_deny_mask should be specified as a string? Like > > preaccess,premodify,prelookup,... and transformed into a bitmask only > > inside the kernel? It gives us more maneuvering space for the future. > > > > Urgh. I see what you are saying, but this seems so ugly to me. > I have a strong feeling that we are trying to reinvent something > and that we are going to reinvent it badly. > I need to look for precedents, maybe in other OS. > I believe that in Windows, there is an API to register as a > Cloud Engine Provider, so there is probably a way to have multiple HSMs > working on different sections of the filesystem in some reliable > crash safe manner. OK, let's see what other's have came up with :) Honza
On Mon, Mar 04, 2024 at 11:33:37AM +0100, Jan Kara wrote: > On Tue 27-02-24 21:42:37, Amir Goldstein wrote: > > On Mon, Feb 19, 2024 at 1:01 PM Jan Kara <jack@suse.cz> wrote: > > > > > > On Thu 15-02-24 17:40:07, Amir Goldstein wrote: > > > > > > Last time we discussed this the conclusion was an API of a group-less > > > > > > default mask, for example: > > > > > > > > > > > > 1. fanotify_mark(FAN_GROUP_DEFAULT, > > > > > > FAN_MARK_ADD | FAN_MARK_MOUNT, > > > > > > FAN_PRE_ACCESS, AT_FDCWD, path); > > > > > > 2. this returns -EPERM for access until some group handles FAN_PRE_ACCESS > > > > > > 3. then HSM is started and subscribes to FAN_PRE_ACCESS > > > > > > 4. and then the mount is moved or bind mounted into a path exported to users > > > > > > > > > > Yes, this was the process I was talking about. > > > > > > > > > > > It is a simple solution that should be easy to implement. > > > > > > But it does not involve "register the HSM app with the filesystem", > > > > > > unless you mean that a process that opens an HSM group > > > > > > (FAN_REPORT_FID|FAN_CLASS_PRE_CONTENT) should automatically > > > > > > be given FMODE_NONOTIFY files? > > > > > > > > > > Two ideas: What you describe above seems like what the new mount API was > > > > > intended for? What if we introduced something like an "hsm" mount option > > > > > which would basically enable calling into pre-content event handlers > > > > > > > > I like that. > > > > I forgot that with my suggestion we'd need a path to setup > > > > the default mask. > > > > > > > > > (for sb without this flag handlers wouldn't be called and you cannot place > > > > > pre-content marks on such sb). > > > > > > > > IMO, that limitation (i.e. inside brackets) is too restrictive. > > > > In many cases, the user running HSM may not have control over the > > > > mount of the filesystem (inside containers?). > > > > It is true that HSM without anti-crash protection is less reliable, > > > > but I think that it is still useful enough that users will want the > > > > option to run it (?). > > > > > > > > Think of my HttpDirFS demo - it's just a simple lazy mirroring > > > > of a website. Even with low reliability I think it is useful (?). > > > > > > Yeah, ok, makes sense. But for such "unpriviledged" usecases we don't have > > > a deadlock-free way to fill in the file contents because that requires a > > > special mountpoint? > > > > True, unless we also keep the FMODE_NONOTIFY event->fd > > for the simple cases. I'll need to think about this some more. > > Well, but even creating new fds with FMODE_NONOTIFY or setting up fanotify > group with HSM events need to be somehow priviledged operation (currently > it requires CAP_SYS_ADMIN). So the more I think about it the less obvious > the "unpriviledged" usecase seems to be. > > > > > > These handlers would return EACCESS unless > > > > > there's somebody handling events and returning something else. > > > > > > > > > > You could then do: > > > > > > > > > > fan_fd = fanotify_init() > > > > > ffd = fsopen() > > > > > fsconfig(ffd, FSCONFIG_SET_STRING, "source", device, 0) > > > > > fsconfig(ffd, FSCONFIG_SET_FLAG, "hsm", NULL, 0) > > > > > rootfd = fsconfig(ffd, FSCONFIG_CMD_CREATE, NULL, NULL, 0) Do you absolutely need that superblock to exist? It would probably be better if you extended struct fs_context to carry the hsm information and then you later attach it to the superblock once it's created. > > > > > fanotify_mark(fan_fd, FAN_MARK_ADD, ... , rootfd, NULL) > > > > > <now you can move the superblock into the mount hierarchy> > > > > > > > > Not too bad. > > > > I think that "hsm_deny_mask=" mount options would give more flexibility, > > > > but I could be convinced otherwise. > > > > > > > > It's probably not a great idea to be running two different HSMs on the same > > > > fs anyway, but if user has an old HSM version installed that handles only > > > > pre-content events, I don't think that we want this old version if it happens > > > > to be run by mistake, to allow for unsupervised create,rename,delete if the > > > > admin wanted to atomically mount a fs that SHOULD be supervised by a > > > > v2 HSM that knows how to handle pre-path events. > > > > > > > > IOW, and "HSM bit" on sb is too broad IMO. > > > > > > OK. So "hsm_deny_mask=" would esentially express events that we require HSM > > > to handle, the rest would just be accepted by default. That makes sense. > > > > Yes. > > > > > The only thing I kind of dislike is that this ties fanotify API with mount > > > API. So perhaps hsm_deny_mask should be specified as a string? Like Yeah. I don't consider that in itself a dealbreaker though. The messier part is that the proposal above requires that the superblock exists before you can set up fanotify. If you register an HSM during mount then you should set everything as regular mount options before superblock creation. If the superblock is created the HSM should go into effect implicitly instead of requiring another fanotify call, imho. But you could also do it the other way around: fanotify_something_something(fan_fd, fs_context_fd, ...) { fdget(fs_context_fd); if (f.file->f_op != &fscontext_fops) return -EINVAL; ret = mutex_lock_interruptible(&fc->uapi_mutex); if (ret == 0) { if (fc->phase != FS_CONTEXT_CREATE_PARAMS && fc->phase != FS_CONTEXT_RECONF_PARAMS) return -EBUSY; /* do your thing */ mutex_unlock(&fc->uapi_mutex); } } then you don't need to even use the mount api system calls and can go through the regular fanotify system calls. You'd just need a sensible helper that is exposed internally to fanotify to register events during fscontext creation. I'm not sure what would be better. > > > preaccess,premodify,prelookup,... and transformed into a bitmask only > > > inside the kernel? It gives us more maneuvering space for the future. > > > > > > > Urgh. I see what you are saying, but this seems so ugly to me. Go look at what bpffs did for their delegation support... Exactly that iirc. :) > > I have a strong feeling that we are trying to reinvent something > > and that we are going to reinvent it badly. > > I need to look for precedents, maybe in other OS. > > I believe that in Windows, there is an API to register as a > > Cloud Engine Provider, so there is probably a way to have multiple HSMs > > working on different sections of the filesystem in some reliable > > crash safe manner. > > OK, let's see what other's have came up with :)
On Mon, Mar 4, 2024 at 12:33 PM Jan Kara <jack@suse.cz> wrote: > > On Tue 27-02-24 21:42:37, Amir Goldstein wrote: > > On Mon, Feb 19, 2024 at 1:01 PM Jan Kara <jack@suse.cz> wrote: > > > > > > On Thu 15-02-24 17:40:07, Amir Goldstein wrote: > > > > > > Last time we discussed this the conclusion was an API of a group-less > > > > > > default mask, for example: > > > > > > > > > > > > 1. fanotify_mark(FAN_GROUP_DEFAULT, > > > > > > FAN_MARK_ADD | FAN_MARK_MOUNT, > > > > > > FAN_PRE_ACCESS, AT_FDCWD, path); > > > > > > 2. this returns -EPERM for access until some group handles FAN_PRE_ACCESS > > > > > > 3. then HSM is started and subscribes to FAN_PRE_ACCESS > > > > > > 4. and then the mount is moved or bind mounted into a path exported to users > > > > > > > > > > Yes, this was the process I was talking about. > > > > > > > > > > > It is a simple solution that should be easy to implement. > > > > > > But it does not involve "register the HSM app with the filesystem", > > > > > > unless you mean that a process that opens an HSM group > > > > > > (FAN_REPORT_FID|FAN_CLASS_PRE_CONTENT) should automatically > > > > > > be given FMODE_NONOTIFY files? > > > > > > > > > > Two ideas: What you describe above seems like what the new mount API was > > > > > intended for? What if we introduced something like an "hsm" mount option > > > > > which would basically enable calling into pre-content event handlers > > > > > > > > I like that. > > > > I forgot that with my suggestion we'd need a path to setup > > > > the default mask. > > > > > > > > > (for sb without this flag handlers wouldn't be called and you cannot place > > > > > pre-content marks on such sb). > > > > > > > > IMO, that limitation (i.e. inside brackets) is too restrictive. > > > > In many cases, the user running HSM may not have control over the > > > > mount of the filesystem (inside containers?). > > > > It is true that HSM without anti-crash protection is less reliable, > > > > but I think that it is still useful enough that users will want the > > > > option to run it (?). > > > > > > > > Think of my HttpDirFS demo - it's just a simple lazy mirroring > > > > of a website. Even with low reliability I think it is useful (?). > > > > > > Yeah, ok, makes sense. But for such "unpriviledged" usecases we don't have > > > a deadlock-free way to fill in the file contents because that requires a > > > special mountpoint? > > > > True, unless we also keep the FMODE_NONOTIFY event->fd > > for the simple cases. I'll need to think about this some more. > > Well, but even creating new fds with FMODE_NONOTIFY or setting up fanotify > group with HSM events need to be somehow priviledged operation (currently > it requires CAP_SYS_ADMIN). So the more I think about it the less obvious > the "unpriviledged" usecase seems to be. > ok. Let's put this one on ice for now. > > > > > These handlers would return EACCESS unless > > > > > there's somebody handling events and returning something else. > > > > > > > > > > You could then do: > > > > > > > > > > fan_fd = fanotify_init() > > > > > ffd = fsopen() > > > > > fsconfig(ffd, FSCONFIG_SET_STRING, "source", device, 0) > > > > > fsconfig(ffd, FSCONFIG_SET_FLAG, "hsm", NULL, 0) > > > > > rootfd = fsconfig(ffd, FSCONFIG_CMD_CREATE, NULL, NULL, 0) > > > > > fanotify_mark(fan_fd, FAN_MARK_ADD, ... , rootfd, NULL) > > > > > <now you can move the superblock into the mount hierarchy> > > > > > > > > Not too bad. > > > > I think that "hsm_deny_mask=" mount options would give more flexibility, > > > > but I could be convinced otherwise. > > > > > > > > It's probably not a great idea to be running two different HSMs on the same > > > > fs anyway, but if user has an old HSM version installed that handles only > > > > pre-content events, I don't think that we want this old version if it happens > > > > to be run by mistake, to allow for unsupervised create,rename,delete if the > > > > admin wanted to atomically mount a fs that SHOULD be supervised by a > > > > v2 HSM that knows how to handle pre-path events. > > > > > > > > IOW, and "HSM bit" on sb is too broad IMO. > > > > > > OK. So "hsm_deny_mask=" would esentially express events that we require HSM > > > to handle, the rest would just be accepted by default. That makes sense. > > > > Yes. > > > > > The only thing I kind of dislike is that this ties fanotify API with mount > > > API. So perhaps hsm_deny_mask should be specified as a string? Like > > > preaccess,premodify,prelookup,... and transformed into a bitmask only > > > inside the kernel? It gives us more maneuvering space for the future. > > > > > > > Urgh. I see what you are saying, but this seems so ugly to me. > > I have a strong feeling that we are trying to reinvent something > > and that we are going to reinvent it badly. > > I need to look for precedents, maybe in other OS. > > I believe that in Windows, there is an API to register as a > > Cloud Engine Provider, so there is probably a way to have multiple HSMs > > working on different sections of the filesystem in some reliable > > crash safe manner. > > OK, let's see what other's have came up with :) From my very basic Google research (did not ask Chat GPT yet ;)) I think that MacOS FSEvents do not have blocking permission events at all, so there is no built-in HSM API. The Windows Cloud Sync Engine API: https://learn.microsoft.com/en-us/windows/win32/cfapi/build-a-cloud-file-sync-engine Does allow registring different "Storage namespace providers". AFAICT, the persistence of "Place holder" files is based on NTFS "Reparse points", which are a long time native concept which allows registering a persistent hook on a file to be handled by a specific Windows driver. So for example, a Dropbox place holder file, is a file with "reparse point" that has some label to direct the read/write calls to the Windows Cloud Sync Engine driver and a sub-label to direct the handling of the upcall by the Dropbox CloudSync Engine service. I do not want to deal with "persistent fanotify marks" at this time, so maybe something like: fsconfig(ffd, FSCONFIG_SET_STRING, "hsmid", "dropbox", 0) fsconfig(ffd, FSCONFIG_SET_STRING, "hsmver", "1", 0) Add support ioctls in fanotify_ioctl(): - FANOTIFY_IOC_HSMID - FANOTIFY_IOC_HSMVER And require that a group with matching hsmid and recent hsmver has a live filesystem mark on the sb. If this is an acceptable API for a single crash-safe HSM provider, then the question becomes: How would we extend this to multiple crash-safe HSM providers in the future? Or maybe we do not need to support multiple HSM groups per sb? Maybe in the future a generic service could be implemented to delegate different HSM modules, e.g.: fsconfig(ffd, FSCONFIG_SET_STRING, "hsmid", "cloudsync", 0) fsconfig(ffd, FSCONFIG_SET_STRING, "hsmver", "1", 0) And a generic "cloudsync" service could be in charge of registration of "cloudsync" engines and dispatching the pre-content event to the appropriate module based on path (i.e. under the dropbox folder). Once this gets passed NACKs from fs developers I'd like to pull in some distro people to the discussion and maybe bring this up as a topic discussion for LSFMM if we feel that there is something to discuss. Thoughts? Amir.
On Mon 15-04-24 17:23:40, Amir Goldstein wrote: > On Mon, Mar 4, 2024 at 12:33 PM Jan Kara <jack@suse.cz> wrote: > > On Tue 27-02-24 21:42:37, Amir Goldstein wrote: > > > On Mon, Feb 19, 2024 at 1:01 PM Jan Kara <jack@suse.cz> wrote: > > > > > > These handlers would return EACCESS unless > > > > > > there's somebody handling events and returning something else. > > > > > > > > > > > > You could then do: > > > > > > > > > > > > fan_fd = fanotify_init() > > > > > > ffd = fsopen() > > > > > > fsconfig(ffd, FSCONFIG_SET_STRING, "source", device, 0) > > > > > > fsconfig(ffd, FSCONFIG_SET_FLAG, "hsm", NULL, 0) > > > > > > rootfd = fsconfig(ffd, FSCONFIG_CMD_CREATE, NULL, NULL, 0) > > > > > > fanotify_mark(fan_fd, FAN_MARK_ADD, ... , rootfd, NULL) > > > > > > <now you can move the superblock into the mount hierarchy> > > > > > > > > > > Not too bad. > > > > > I think that "hsm_deny_mask=" mount options would give more flexibility, > > > > > but I could be convinced otherwise. > > > > > > > > > > It's probably not a great idea to be running two different HSMs on the same > > > > > fs anyway, but if user has an old HSM version installed that handles only > > > > > pre-content events, I don't think that we want this old version if it happens > > > > > to be run by mistake, to allow for unsupervised create,rename,delete if the > > > > > admin wanted to atomically mount a fs that SHOULD be supervised by a > > > > > v2 HSM that knows how to handle pre-path events. > > > > > > > > > > IOW, and "HSM bit" on sb is too broad IMO. > > > > > > > > OK. So "hsm_deny_mask=" would esentially express events that we require HSM > > > > to handle, the rest would just be accepted by default. That makes sense. > > > > > > Yes. > > > > > > > The only thing I kind of dislike is that this ties fanotify API with mount > > > > API. So perhaps hsm_deny_mask should be specified as a string? Like > > > > preaccess,premodify,prelookup,... and transformed into a bitmask only > > > > inside the kernel? It gives us more maneuvering space for the future. > > > > > > > > > > Urgh. I see what you are saying, but this seems so ugly to me. > > > I have a strong feeling that we are trying to reinvent something > > > and that we are going to reinvent it badly. > > > I need to look for precedents, maybe in other OS. > > > I believe that in Windows, there is an API to register as a > > > Cloud Engine Provider, so there is probably a way to have multiple HSMs > > > working on different sections of the filesystem in some reliable > > > crash safe manner. > > > > OK, let's see what other's have came up with :) > > From my very basic Google research (did not ask Chat GPT yet ;)) > I think that MacOS FSEvents do not have blocking permission events at all, > so there is no built-in HSM API. > > The Windows Cloud Sync Engine API: > https://learn.microsoft.com/en-us/windows/win32/cfapi/build-a-cloud-file-sync-engine > Does allow registring different "Storage namespace providers". > AFAICT, the persistence of "Place holder" files is based on NTFS > "Reparse points", > which are a long time native concept which allows registering a persistent > hook on a file to be handled by a specific Windows driver. > > So for example, a Dropbox place holder file, is a file with "reparse point" > that has some label to direct the read/write calls to the Windows > Cloud Sync Engine > driver and a sub-label to direct the handling of the upcall by the Dropbox > CloudSync Engine service. OK, so AFAIU they implement HSM directly in the filesystem which is somewhat different situation from what we are trying to do. > I do not want to deal with "persistent fanotify marks" at this time, so > maybe something like: > > fsconfig(ffd, FSCONFIG_SET_STRING, "hsmid", "dropbox", 0) > fsconfig(ffd, FSCONFIG_SET_STRING, "hsmver", "1", 0) > > Add support ioctls in fanotify_ioctl(): > - FANOTIFY_IOC_HSMID > - FANOTIFY_IOC_HSMVER What would these do? Set HSMID & HSMVER for fsnotify_group identified by 'file'? BTW I'm not so convinced about the 'version' thing. I have hard time to remember an example where the versioning in the API actually ended up being useful. I also expect tight coupling between userspace mounting the filesystem and setting up HSM so it is hard to imagine some wrong version of HSM provider would be "accidentally" started for the filesystem. > And require that a group with matching hsmid and recent hsmver has a live > filesystem mark on the sb. I'm not quite following here. We'd require matching fsnotify group for what? For mounting the fs? For not returning EPERM from all pre-op handlers? Either way that doesn't make sense to me as its unclear how userspace would be able to place the mark... But there's a way around that - since the HSM app will have its private non-HSM mount for filling in contents, it can first create that mount, place filesystem marks though it and then remount the superblock with hsmid mount option and create the public mount. But I'm not sure if you meant this or something else... > If this is an acceptable API for a single crash-safe HSM provider, then the > question becomes: > How would we extend this to multiple crash-safe HSM providers in the future? Something like: fsconfig(ffd, FSCONFIG_SET_STRING, "hsmid", "dropbox,cloudsync,httpfs", 0) means all of them are required to have a filesystem mark? > Or maybe we do not need to support multiple HSM groups per sb? > Maybe in the future a generic service could be implemented to > delegate different HSM modules, e.g.: > > fsconfig(ffd, FSCONFIG_SET_STRING, "hsmid", "cloudsync", 0) > fsconfig(ffd, FSCONFIG_SET_STRING, "hsmver", "1", 0) > > And a generic "cloudsync" service could be in charge of > registration of "cloudsync" engines and dispatching the pre-content > event to the appropriate module based on path (i.e. under the dropbox folder). > > Once this gets passed NACKs from fs developers I'd like to pull in > some distro people to the discussion and maybe bring this up as > a topic discussion for LSFMM if we feel that there is something to discuss. I guess a short talk (lighting talk?) what we are planning to do would be interesting so that people are aware. At this point I don't think we have particular problems to discuss that would be interesting for the whole fs crowd for a full slot... Honza
> > The Windows Cloud Sync Engine API: > > https://learn.microsoft.com/en-us/windows/win32/cfapi/build-a-cloud-file-sync-engine > > Does allow registring different "Storage namespace providers". > > AFAICT, the persistence of "Place holder" files is based on NTFS > > "Reparse points", > > which are a long time native concept which allows registering a persistent > > hook on a file to be handled by a specific Windows driver. > > > > So for example, a Dropbox place holder file, is a file with "reparse point" > > that has some label to direct the read/write calls to the Windows > > Cloud Sync Engine > > driver and a sub-label to direct the handling of the upcall by the Dropbox > > CloudSync Engine service. > > OK, so AFAIU they implement HSM directly in the filesystem which is somewhat > different situation from what we are trying to do. > Technically, I think that Reparse points Win32 driver hooks are a generic WIN32 API which NTFS implements, but that doesn't matter. IIUC, it is equivalent to having support for xattr "security.hsm.dropbox" that fanotify would know how to intercept as a persistent mark to be handled by "dropbox" hsm group or return EPERM. > > I do not want to deal with "persistent fanotify marks" at this time, so > > maybe something like: > > > > fsconfig(ffd, FSCONFIG_SET_STRING, "hsmid", "dropbox", 0) > > fsconfig(ffd, FSCONFIG_SET_STRING, "hsmver", "1", 0) > > > > Add support ioctls in fanotify_ioctl(): > > - FANOTIFY_IOC_HSMID > > - FANOTIFY_IOC_HSMVER > > What would these do? Set HSMID & HSMVER for fsnotify_group identified by > 'file'? BTW I'm not so convinced about the 'version' thing. I have hard > time to remember an example where the versioning in the API actually ended > up being useful. I also expect tight coupling between userspace mounting > the filesystem and setting up HSM so it is hard to imagine some wrong > version of HSM provider would be "accidentally" started for the > filesystem. > ok. worse case, can alway switch to hsmid "dropboxv2" > > And require that a group with matching hsmid and recent hsmver has a live > > filesystem mark on the sb. > > I'm not quite following here. We'd require matching fsnotify group for > what? For mounting the fs? For not returning EPERM from all pre-op > handlers? Either way that doesn't make sense to me as its unclear how > userspace would be able to place the mark... But there's a way around that > - since the HSM app will have its private non-HSM mount for filling in > contents, it can first create that mount, place filesystem marks though > it and then remount the superblock with hsmid mount option and create the > public mount. But I'm not sure if you meant this or something else... > I haven't thought of the mechanics yet just the definition: - An sb with hsm="XXX" returns EPERM for pre-content events unless there is an sb mark from a group that is identified as hsm "XXX" I don't see a problem with mounting the fs first and only then setting up the sb mark on the root of the fs (which does not require a pre-lookup event). When the hsm service is restarted, it is going to need to re-set the sb mark on the hsm="XXX" sb anyway. > > If this is an acceptable API for a single crash-safe HSM provider, then the > > question becomes: > > How would we extend this to multiple crash-safe HSM providers in the future? > > Something like: > > fsconfig(ffd, FSCONFIG_SET_STRING, "hsmid", "dropbox,cloudsync,httpfs", 0) > > means all of them are required to have a filesystem mark? > Yeh, it's an option. I have a trauma from comma separated values in overlayfs mount options, but maybe it's fine. The main API question would be, regardless of single or multi hsm, whether hsm="" value should be reconfigurable (probably yes). > > Or maybe we do not need to support multiple HSM groups per sb? > > Maybe in the future a generic service could be implemented to > > delegate different HSM modules, e.g.: > > > > fsconfig(ffd, FSCONFIG_SET_STRING, "hsmid", "cloudsync", 0) > > fsconfig(ffd, FSCONFIG_SET_STRING, "hsmver", "1", 0) > > > > And a generic "cloudsync" service could be in charge of > > registration of "cloudsync" engines and dispatching the pre-content > > event to the appropriate module based on path (i.e. under the dropbox folder). > > > > Once this gets passed NACKs from fs developers I'd like to pull in > > some distro people to the discussion and maybe bring this up as > > a topic discussion for LSFMM if we feel that there is something to discuss. > > I guess a short talk (lighting talk?) what we are planning to do would be > interesting so that people are aware. At this point I don't think we have > particular problems to discuss that would be interesting for the whole fs > crowd for a full slot... Agree. Thanks, Amir.
diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c index 1e4def21811e..87798fdc06e7 100644 --- a/fs/notify/fanotify/fanotify.c +++ b/fs/notify/fanotify/fanotify.c @@ -256,18 +256,21 @@ static int fanotify_get_response(struct fsnotify_group *group, } /* userspace responded, convert to something usable */ - switch (event->response & FANOTIFY_RESPONSE_ACCESS) { + switch (FAN_RESPONSE_ACCESS(event->response)) { case FAN_ALLOW: ret = 0; break; case FAN_DENY: + /* userspace can provide errno other than EPERM */ + ret = -(FAN_RESPONSE_ERRNO(event->response) ?: EPERM); + break; default: ret = -EPERM; } /* Check if the response should be audited */ if (event->response & FAN_AUDIT) - audit_fanotify(event->response & ~FAN_AUDIT, + audit_fanotify(FAN_RESPONSE_ACCESS(event->response), &event->audit_rule); pr_debug("%s: group=%p event=%p about to return ret=%d\n", __func__, diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c index f83e7cc5ccf2..6b17d33a06aa 100644 --- a/fs/notify/fanotify/fanotify_user.c +++ b/fs/notify/fanotify/fanotify_user.c @@ -336,11 +336,12 @@ static int process_access_response(struct fsnotify_group *group, struct fanotify_perm_event *event; int fd = response_struct->fd; u32 response = response_struct->response; + int errno = FAN_RESPONSE_ERRNO(response); int ret = info_len; struct fanotify_response_info_audit_rule friar; - pr_debug("%s: group=%p fd=%d response=%u buf=%p size=%zu\n", __func__, - group, fd, response, info, info_len); + pr_debug("%s: group=%p fd=%d response=%x errno=%d buf=%p size=%zu\n", + __func__, group, fd, response, errno, info, info_len); /* * make sure the response is valid, if invalid we do nothing and either * userspace can send a valid response or we will clean it up after the @@ -349,8 +350,11 @@ static int process_access_response(struct fsnotify_group *group, if (response & ~FANOTIFY_RESPONSE_VALID_MASK) return -EINVAL; - switch (response & FANOTIFY_RESPONSE_ACCESS) { + switch (FAN_RESPONSE_ACCESS(response)) { case FAN_ALLOW: + if (errno) + return -EINVAL; + break; case FAN_DENY: break; default: diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h index 4f1c4f603118..2004a4d3b1dc 100644 --- a/include/linux/fanotify.h +++ b/include/linux/fanotify.h @@ -125,7 +125,14 @@ /* These masks check for invalid bits in permission responses. */ #define FANOTIFY_RESPONSE_ACCESS (FAN_ALLOW | FAN_DENY) #define FANOTIFY_RESPONSE_FLAGS (FAN_AUDIT | FAN_INFO) -#define FANOTIFY_RESPONSE_VALID_MASK (FANOTIFY_RESPONSE_ACCESS | FANOTIFY_RESPONSE_FLAGS) +#define FANOTIFY_RESPONSE_ERRNO (FAN_ERRNO_MASK << FAN_ERRNO_SHIFT) +#define FANOTIFY_RESPONSE_VALID_MASK \ + (FANOTIFY_RESPONSE_ACCESS | FANOTIFY_RESPONSE_FLAGS | \ + FANOTIFY_RESPONSE_ERRNO) + +/* errno other than EPERM can specified in upper byte of deny response */ +#define FAN_RESPONSE_ACCESS(res) ((res) & FANOTIFY_RESPONSE_ACCESS) +#define FAN_RESPONSE_ERRNO(res) ((int)((res) >> FAN_ERRNO_SHIFT)) /* Do not use these old uapi constants internally */ #undef FAN_ALL_CLASS_BITS diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h index cd14c94e9a1e..a8bb25c99dd1 100644 --- a/include/uapi/linux/fanotify.h +++ b/include/uapi/linux/fanotify.h @@ -223,6 +223,13 @@ struct fanotify_response_info_audit_rule { /* Legit userspace responses to a _PERM event */ #define FAN_ALLOW 0x01 #define FAN_DENY 0x02 +/* errno other than EPERM can specified in upper byte of deny response */ +#define FAN_ERRNO_BITS 8 +#define FAN_ERRNO_SHIFT (32 - FAN_ERRNO_BITS) +#define FAN_ERRNO_MASK ((1 << FAN_ERRNO_BITS) - 1) +#define FAN_DENY_ERRNO(err) \ + (FAN_DENY | ((((__u32)(err)) & FAN_ERRNO_MASK) << FAN_ERRNO_SHIFT)) + #define FAN_AUDIT 0x10 /* Bitmask to create audit record for result */ #define FAN_INFO 0x20 /* Bitmask to indicate additional information */
With FAN_DENY response, user trying to perform the filesystem operation gets an error with errno set to EPERM. It is useful for hierarchical storage management (HSM) service to be able to deny access for reasons more diverse than EPERM, for example EAGAIN, if HSM could retry the operation later. Allow userspace to response to permission events with the response value FAN_DENY_ERRNO(errno), instead of FAN_DENY to return a custom error. The change in fanotify_response is backward compatible, because errno is written in the high 8 bits of the 32bit response field and old kernels reject respose value with high bits set. Signed-off-by: Amir Goldstein <amir73il@gmail.com> --- Jan, I remember discussing this API change with you and I even remember that you preferred to send the errno as an extra info record with the response, but I could not find any private or public email about this. So let that be my first public proposal of the FAN_DENY_ERRNO(errno) API. I would like you to recondsider the inline errno proposal vs. an extra errno info record, seeing how simple and fundumental the inline errno response is. I should mention that I was specifically asked about providing this functionality by developers from Meta, because it is needed for their use case. What I am not sure about is whether we should allow FAN_DENY_ERRNO response to any permission event, also the legacy ones, or we should limit the FAN_DENY_ERRNO response to new HSM events or new HSM group class (FAN_CLASS_PRE_PATH). I was aiming at the latter option, but I can't realy think of a good enough reason to deny this functionality from legacy permission events as the way that the API is extended is fully backward compat. If you also think that this could be useful for legacy permission events and if the patch looks reasonable to you, then you may go ahead and consider it for 6.8 already as it is independent from the rest of the HSM event patches. Thoughts? Thanks, Amir. fs/notify/fanotify/fanotify.c | 7 +++++-- fs/notify/fanotify/fanotify_user.c | 10 +++++++--- include/linux/fanotify.h | 9 ++++++++- include/uapi/linux/fanotify.h | 7 +++++++ 4 files changed, 27 insertions(+), 6 deletions(-)