Message ID | 20211116220742.584975-1-krisman@collabora.com (mailing list archive) |
---|---|
Headers | show |
Series | shmem: Notify user space when file system is full | expand |
On Wed, Nov 17, 2021 at 12:07 AM Gabriel Krisman Bertazi <krisman@collabora.com> wrote: > > FS_ERROR is a fsnotify event type used by monitoring tools to detect > conditions that might require intervention from a recovery tool or from > sysadmins. This patch series enables tmpfs to report an event when an > operation fails because the file system is full. > > It attempts to only report events when the filesystem is really full, > instead of errors caused by memory pressure. The first patch prepares > the terrain by detecting these two different conditions, and the second > patch actually adds the event triggers. > Hi Gabriel, Two things bother me about this proposal. One is that it makes more sense IMO to report ENOSPC events from vfs code. Why should the requirement to monitor ENOSPC conditions be specific to tmpfs? Especially, as I mentioned, there are already wrappers in place to report writeback errors on an inode (mapping_set_error), where the fsnotify hook can fit nicely. I understand that you wanted to differentiate errors caused by memory pressure, but I don't understand why it makes sense for a filesystem monitor to get a different feedback than the writing application. The second thing that bothers me is that I think the ENOSPC condition should not be reported on the same event mask as filesystem corruption condition because it seems like a valid use case for filesystem monitor to want to be notified about corruption and not about ENOSPC. Therefore, I suggested using the event flag FS_WB_ERROR for writeback errors. Other than the event mask, everything else is the same. It just provides the UAPI flexibility to subscribe to one without the other. Thanks, Amir.
Amir Goldstein <amir73il@gmail.com> writes: > Two things bother me about this proposal. > One is that it makes more sense IMO to report ENOSPC events > from vfs code. Hi Amir, I reimplemented this with FS_WB_ERROR in the branch below. It reports writeback errors on mapping_set_error, as suggested. https://gitlab.collabora.com/krisman/linux/-/tree/wb-error It is a WIP, and I'm not proposing it yet, cause I'm thinking about the ENOSPC case a bit more... > Why should the requirement to monitor ENOSPC conditions be specific to tmpfs? > Especially, as I mentioned, there are already wrappers in place to report > writeback errors on an inode (mapping_set_error), where the fsnotify hook > can fit nicely. mapping_set_error would trigger the ENOSPC event only when it happens on an actual writeback error (i.e. BLK_STS_NOSPC), which is not the main case I'm solving here. In fact, most of the time, -ENOSPC will happen before any IO is submitted, for instance, if an inode could not be allocated during .create() or a block can't be allocated in .write_begin(). In this case, it isn't really a writeback error (semantically), and it is not registered as such by any file system. We can treat those under the same FAN_WB_ERROR mask, but that is a bit weird. Maybe we should have a mask specifically for ENOSPC, or a, "IO error" mask? The patchset is quite trivial, but it has to touch many places in the VFS (say, catch ENOSPC on create, fallocate, write, writev, etc). Please, see the branch above to what that would look like. Is that a viable solution? Are you ok with reporting those cases under the same writeback mask? Btw, I'm thinking it could be tidy to transform many of those VFS calls, from if (!error) fsnotify_modify(file); else if (error == -ENOSPC) fsnotify_nospace(file); Into unconditionally calling the fsnotify_modify hook, which sends the correct event depending on the operation result: void fsnotify_modify(int error, struct file *file) { if (likely(!error)) { fsnotify_file(file, FS_MODIFY); } else if (error == -ENOSPC) { fsnotify_wb_error(file); } } (same for fsnotify_mkdir, fsnotify_open, etc). If you are ok with that? > I understand that you wanted to differentiate errors caused by memory > pressure, but I don't understand why it makes sense for a filesystem monitor > to get a different feedback than the writing application. Maybe the differentiation of those two cases could be done as part of the file system specific payload that I wanted to write for FAN_FS_ERROR? If so, it would be easier for the ENOSPC event trigger to be implemented at the filesystem-level. > The second thing that bothers me is that I think the ENOSPC condition > should not be reported on the same event mask as filesystem corruption > condition because it seems like a valid use case for filesystem monitor > to want to be notified about corruption and not about ENOSPC.
On Tue, Jan 11, 2022 at 3:57 AM Gabriel Krisman Bertazi <krisman@collabora.com> wrote: > > Amir Goldstein <amir73il@gmail.com> writes: > > > Two things bother me about this proposal. > > One is that it makes more sense IMO to report ENOSPC events > > from vfs code. > > Hi Amir, > > I reimplemented this with FS_WB_ERROR in the branch below. It reports > writeback errors on mapping_set_error, as suggested. > > https://gitlab.collabora.com/krisman/linux/-/tree/wb-error > > It is a WIP, and I'm not proposing it yet, cause I'm thinking about the > ENOSPC case a bit more... > > > Why should the requirement to monitor ENOSPC conditions be specific to tmpfs? > > Especially, as I mentioned, there are already wrappers in place to report > > writeback errors on an inode (mapping_set_error), where the fsnotify hook > > can fit nicely. > > mapping_set_error would trigger the ENOSPC event only when it happens on > an actual writeback error (i.e. BLK_STS_NOSPC), which is not the main > case I'm solving here. In fact, most of the time, -ENOSPC will happen > before any IO is submitted, for instance, if an inode could not be > allocated during .create() or a block can't be allocated in > .write_begin(). In this case, it isn't really a writeback error > (semantically), and it is not registered as such by any file system. > I see. But the question remains, what is so special about shmem that your use case requires fsnotify events to handle ENOSPC? Many systems are deployed on thin provisioned storage these days and monitoring the state of the storage to alert administrator before storage gets full (be it filesystem inodes or blocks or thinp space) is crucial to many systems. Since the ENOSPC event that you are proposing is asynchronous anyway, what is the problem with polling statfs() and meminfo? I guess one difference is that it is harder to predict page allocation failure that causes ENOSPC in shmem, but IIUC, your patch does not report an fsevent in that case only in inode/block accounting error. Or maybe I did not understand it correctly? In a sense, the shmem ENOSPC error not caused by accounting error is similar to EIO error on legacy storage and on thin provisioned storage that the end user cannot monitor. Right? > We can treat those under the same FAN_WB_ERROR mask, but that is a bit > weird. Maybe we should have a mask specifically for ENOSPC, or a, > "IO error" mask? If we go forward with this, I do like FAN_IO_ERROR for both writeback error and general vfs errors, because what matters is the action required from the event. A listener that subscribes to FAN_FS_ERROR would likely react with fsck or such. A listener that subscribes to FAN_IO_ERROR would likely react with checking the storage usage and/or application logs. > > The patchset is quite trivial, but it has to touch many places in the VFS > (say, catch ENOSPC on create, fallocate, write, writev, etc). Please, > see the branch above to what that would look like. > > Is that a viable solution? Are you ok with reporting those cases under > the same writeback mask? > I am not making the calls in vfs ;) If I were, I would have asked you to explain your use case better. Why do you need this for shmem and why would anyone need this for any other fs. Why can't the same result be achieved with polling existing APIs? I think you will need to present a very strong case for the wide vfs change. If your case is strong enough only for shmem, then perhaps we can accept that filesystem (and not vfs) is responsible to report FAN_IO_ERROR and that there is no clear semantics about when user can expect FAN_IO_ERROR from any given filesystem, but that seems strange. > Btw, I'm thinking it could be tidy to transform many of those VFS calls, > from > > if (!error) > fsnotify_modify(file); > else if (error == -ENOSPC) > fsnotify_nospace(file); > > Into unconditionally calling the fsnotify_modify hook, which sends > the correct event depending on the operation result: > > void fsnotify_modify(int error, struct file *file) > { > if (likely(!error)) { > fsnotify_file(file, FS_MODIFY); > } else if (error == -ENOSPC) { > fsnotify_wb_error(file); > } > } > > (same for fsnotify_mkdir, fsnotify_open, etc). > > If you are ok with that? > If we were to go down that path, I would prefer the latter because it will force modifying all call sites and place the logic in one place and apropos logic, if we do that we should not handle only ENOSPC - IMO at least EIO should get the same treatment. > > > I understand that you wanted to differentiate errors caused by memory > > pressure, but I don't understand why it makes sense for a filesystem monitor > > to get a different feedback than the writing application. > > Maybe the differentiation of those two cases could be done as part of > the file system specific payload that I wanted to write for > FAN_FS_ERROR? If so, it would be easier for the ENOSPC event trigger to > be implemented at the filesystem-level. > Certainly. If we restrict the scope of those events to shmem, We do not need a new event type. It's enough to use FAN_FS_ERROR with specific payload as you wanted. But I still need convincing why this shmem ENOSPC is such a special case. Thanks, Amir.
Amir Goldstein <amir73il@gmail.com> writes: > On Tue, Jan 11, 2022 at 3:57 AM Gabriel Krisman Bertazi > <krisman@collabora.com> wrote: >> >> Amir Goldstein <amir73il@gmail.com> writes: >> >> > Two things bother me about this proposal. >> > One is that it makes more sense IMO to report ENOSPC events >> > from vfs code. >> >> Hi Amir, >> >> I reimplemented this with FS_WB_ERROR in the branch below. It reports >> writeback errors on mapping_set_error, as suggested. >> >> https://gitlab.collabora.com/krisman/linux/-/tree/wb-error >> >> It is a WIP, and I'm not proposing it yet, cause I'm thinking about the >> ENOSPC case a bit more... >> >> > Why should the requirement to monitor ENOSPC conditions be specific to tmpfs? >> > Especially, as I mentioned, there are already wrappers in place to report >> > writeback errors on an inode (mapping_set_error), where the fsnotify hook >> > can fit nicely. >> >> mapping_set_error would trigger the ENOSPC event only when it happens on >> an actual writeback error (i.e. BLK_STS_NOSPC), which is not the main >> case I'm solving here. In fact, most of the time, -ENOSPC will happen >> before any IO is submitted, for instance, if an inode could not be >> allocated during .create() or a block can't be allocated in >> .write_begin(). In this case, it isn't really a writeback error >> (semantically), and it is not registered as such by any file system. >> > > I see. > But the question remains, what is so special about shmem that > your use case requires fsnotify events to handle ENOSPC? > > Many systems are deployed on thin provisioned storage these days > and monitoring the state of the storage to alert administrator before > storage gets full (be it filesystem inodes or blocks or thinp space) > is crucial to many systems. > > Since the ENOSPC event that you are proposing is asynchronous > anyway, what is the problem with polling statfs() and meminfo? Amir, I spoke a bit with Khazhy (in CC) about the problems with polling the existing APIs, like statfs. He has been using a previous version of this code in production to monitor machines for a while now. Khazhy, feel free to pitch in with more details. Firstly, I don't want to treat shmem as a special case. The original patch implemented support only for tmpfs, because it was a fs specific solution, but I think this would be useful for any other (non-pseudo) file system in the kernel. The use case is similar to the use case I brought up for FAN_FS_ERROR. A sysadmin monitoring a fleet of machines wants to be notified when a service failed because of lack of space, without having to trust the failed application to properly report the error. Polling statfs is prone to missing the ENOSPC occurrence if the error is ephemeral from a monitoring tool point of view. Say the application is writing a large file, hits ENOSPC and, as a recovery mechanism, removes the partial file. If that happens, a daemon might miss the chance to observe the lack of space in statfs. Doing it through fsnotify, on the other hand, always catches the condition and allows a monitoring tool/sysadmin to take corrective action. > I guess one difference is that it is harder to predict page allocation failure > that causes ENOSPC in shmem, but IIUC, your patch does not report > an fsevent in that case only in inode/block accounting error. > Or maybe I did not understand it correctly? Correct. But we cannot predict the enospc, unless we know the application. I'm looking for a way for a sysadmin to not have to rely on the application caring about the file system size.
> > But the question remains, what is so special about shmem that > > your use case requires fsnotify events to handle ENOSPC? > > > > Many systems are deployed on thin provisioned storage these days > > and monitoring the state of the storage to alert administrator before > > storage gets full (be it filesystem inodes or blocks or thinp space) > > is crucial to many systems. > > > > Since the ENOSPC event that you are proposing is asynchronous > > anyway, what is the problem with polling statfs() and meminfo? > > Amir, > > I spoke a bit with Khazhy (in CC) about the problems with polling the > existing APIs, like statfs. He has been using a previous version of > this code in production to monitor machines for a while now. Khazhy, > feel free to pitch in with more details. > > Firstly, I don't want to treat shmem as a special case. The original > patch implemented support only for tmpfs, because it was a fs specific > solution, but I think this would be useful for any other (non-pseudo) > file system in the kernel. > > The use case is similar to the use case I brought up for FAN_FS_ERROR. > A sysadmin monitoring a fleet of machines wants to be notified when a > service failed because of lack of space, without having to trust the > failed application to properly report the error. > > Polling statfs is prone to missing the ENOSPC occurrence if the error is > ephemeral from a monitoring tool point of view. Say the application is > writing a large file, hits ENOSPC and, as a recovery mechanism, removes > the partial file. If that happens, a daemon might miss the chance to > observe the lack of space in statfs. Doing it through fsnotify, on the > other hand, always catches the condition and allows a monitoring > tool/sysadmin to take corrective action. > > > I guess one difference is that it is harder to predict page allocation failure > > that causes ENOSPC in shmem, but IIUC, your patch does not report > > an fsevent in that case only in inode/block accounting error. > > Or maybe I did not understand it correctly? > > Correct. But we cannot predict the enospc, unless we know the > application. I'm looking for a way for a sysadmin to not have to rely > on the application caring about the file system size. > In the real world, ENOSPC can often be anticipated way ahead of time and sysadmins are practically required to take action when storage space is low. Getting near 90% full filesystem is not healthy on many traditional disk filesystems and causes suboptimal performance and in many cases (especially cow filesystems) may lead to filesystem corruption. All that said, yes, *sometimes* ENOSPC cannot be anticipated, but EIO can never be anticipated, so why are we talking about ENOSPC? Focusing on ENOSPC seems too specific for the purpose of adding fsnotify monitoring for filesystem ephemeral errors. The problem with fsnotify events for ephemeral filesystem errors and that there can be a *lot* of them compared to filesystem corruption errors that would usually put the filesystem in an "emergency" state and stop the events from flooding. For that reason I still think that a polling API for fs ephemeral errors is a better idea. w.r.t to ephemeral errors on writeback we already have syncfs() as a means to provide publish/subscribe API for monitoring applications, to check if there was any error since last check, but we do not have an API that provides this information without the added costs of performing syncfs(). IMO, a proper solution would look something like this: /* per-sb errseq_t for reporting writeback errors via syncfs */ errseq_t s_wb_err; + /* per-sb errseq_t for reporting vfs errors via fstatfs */ + errseq_t s_vfs_err; fstatfs() is just an example that may be a good fit for fs monitoring applications we can add the error state in f_spare space, but we can also create a dedicated API for polling for errors. Same API can be used to poll for wb errors without issuing syncfs(). Thanks, Amir.
Amir Goldstein <amir73il@gmail.com> writes: >> > But the question remains, what is so special about shmem that >> > your use case requires fsnotify events to handle ENOSPC? >> > >> > Many systems are deployed on thin provisioned storage these days >> > and monitoring the state of the storage to alert administrator before >> > storage gets full (be it filesystem inodes or blocks or thinp space) >> > is crucial to many systems. >> > >> > Since the ENOSPC event that you are proposing is asynchronous >> > anyway, what is the problem with polling statfs() and meminfo? >> >> Amir, >> >> I spoke a bit with Khazhy (in CC) about the problems with polling the >> existing APIs, like statfs. He has been using a previous version of >> this code in production to monitor machines for a while now. Khazhy, >> feel free to pitch in with more details. >> >> Firstly, I don't want to treat shmem as a special case. The original >> patch implemented support only for tmpfs, because it was a fs specific >> solution, but I think this would be useful for any other (non-pseudo) >> file system in the kernel. >> >> The use case is similar to the use case I brought up for FAN_FS_ERROR. >> A sysadmin monitoring a fleet of machines wants to be notified when a >> service failed because of lack of space, without having to trust the >> failed application to properly report the error. >> >> Polling statfs is prone to missing the ENOSPC occurrence if the error is >> ephemeral from a monitoring tool point of view. Say the application is >> writing a large file, hits ENOSPC and, as a recovery mechanism, removes >> the partial file. If that happens, a daemon might miss the chance to >> observe the lack of space in statfs. Doing it through fsnotify, on the >> other hand, always catches the condition and allows a monitoring >> tool/sysadmin to take corrective action. >> >> > I guess one difference is that it is harder to predict page allocation failure >> > that causes ENOSPC in shmem, but IIUC, your patch does not report >> > an fsevent in that case only in inode/block accounting error. >> > Or maybe I did not understand it correctly? >> >> Correct. But we cannot predict the enospc, unless we know the >> application. I'm looking for a way for a sysadmin to not have to rely >> on the application caring about the file system size. >> > > In the real world, ENOSPC can often be anticipated way ahead of time > and sysadmins are practically required to take action when storage space is low. > Getting near 90% full filesystem is not healthy on many traditional disk > filesystems and causes suboptimal performance and in many cases > (especially cow filesystems) may lead to filesystem corruption. > > All that said, yes, *sometimes* ENOSPC cannot be anticipated, > but EIO can never be anticipated, so why are we talking about ENOSPC? > Focusing on ENOSPC seems too specific for the purpose of adding fsnotify > monitoring for filesystem ephemeral errors. > > The problem with fsnotify events for ephemeral filesystem errors > and that there can be a *lot* of them compared to filesystem corruption > errors that would usually put the filesystem in an "emergency" state > and stop the events from flooding. > For that reason I still think that a polling API for fs ephemeral errors > is a better idea. > > w.r.t to ephemeral errors on writeback we already have syncfs() as > a means to provide publish/subscribe API for monitoring applications, > to check if there was any error since last check, but we do not have an > API that provides this information without the added costs of performing > syncfs(). > > IMO, a proper solution would look something like this: > > /* per-sb errseq_t for reporting writeback errors via syncfs */ > errseq_t s_wb_err; > + /* per-sb errseq_t for reporting vfs errors via fstatfs */ > + errseq_t s_vfs_err; > I think making it a polling API wouldn't be a problem for our use case, as long as it is kept as an always increasing counter, we should be able to detect changes and not miss events. The problem with the proposal, in my opinion, is the lack of differentiation of the errors. We want to be able to tell apart an EIO from a ENOSPC, and it might not be clear from the other fields in fstatfs what has happened. Also, I suspect Google might care about what inode triggered the error. If I understand correctly their use case, that would allow them to trace back the origin of the issue. Either way, wouldn't it be useful for applications in general to be able to know what specific writeback failed? > fstatfs() is just an example that may be a good fit for fs monitoring > applications we can add the error state in f_spare space, but we can > also create a dedicated API for polling for errors. That would be an option. But f_spare is only 4 words long. That isn't enough if we want to report the errors that occurred. I think a new api should report the specific inodes that had a failed writeback.
On Fri, Jan 14, 2022 at 12:17 PM Gabriel Krisman Bertazi <krisman@collabora.com> wrote: > > Amir Goldstein <amir73il@gmail.com> writes: > > >> > But the question remains, what is so special about shmem that > >> > your use case requires fsnotify events to handle ENOSPC? > >> > > >> > Many systems are deployed on thin provisioned storage these days > >> > and monitoring the state of the storage to alert administrator before > >> > storage gets full (be it filesystem inodes or blocks or thinp space) > >> > is crucial to many systems. > >> > > >> > Since the ENOSPC event that you are proposing is asynchronous > >> > anyway, what is the problem with polling statfs() and meminfo? > >> > >> Amir, > >> > >> I spoke a bit with Khazhy (in CC) about the problems with polling the > >> existing APIs, like statfs. He has been using a previous version of > >> this code in production to monitor machines for a while now. Khazhy, > >> feel free to pitch in with more details. > >> > >> Firstly, I don't want to treat shmem as a special case. The original > >> patch implemented support only for tmpfs, because it was a fs specific > >> solution, but I think this would be useful for any other (non-pseudo) > >> file system in the kernel. > >> > >> The use case is similar to the use case I brought up for FAN_FS_ERROR. > >> A sysadmin monitoring a fleet of machines wants to be notified when a > >> service failed because of lack of space, without having to trust the > >> failed application to properly report the error. > >> > >> Polling statfs is prone to missing the ENOSPC occurrence if the error is > >> ephemeral from a monitoring tool point of view. Say the application is > >> writing a large file, hits ENOSPC and, as a recovery mechanism, removes > >> the partial file. If that happens, a daemon might miss the chance to > >> observe the lack of space in statfs. Doing it through fsnotify, on the > >> other hand, always catches the condition and allows a monitoring > >> tool/sysadmin to take corrective action. > >> > >> > I guess one difference is that it is harder to predict page allocation failure > >> > that causes ENOSPC in shmem, but IIUC, your patch does not report > >> > an fsevent in that case only in inode/block accounting error. > >> > Or maybe I did not understand it correctly? > >> > >> Correct. But we cannot predict the enospc, unless we know the > >> application. I'm looking for a way for a sysadmin to not have to rely > >> on the application caring about the file system size. > >> > > > > In the real world, ENOSPC can often be anticipated way ahead of time > > and sysadmins are practically required to take action when storage space is low. > > Getting near 90% full filesystem is not healthy on many traditional disk > > filesystems and causes suboptimal performance and in many cases > > (especially cow filesystems) may lead to filesystem corruption. > > > > All that said, yes, *sometimes* ENOSPC cannot be anticipated, > > but EIO can never be anticipated, so why are we talking about ENOSPC? > > Focusing on ENOSPC seems too specific for the purpose of adding fsnotify > > monitoring for filesystem ephemeral errors. > > > > The problem with fsnotify events for ephemeral filesystem errors > > and that there can be a *lot* of them compared to filesystem corruption > > errors that would usually put the filesystem in an "emergency" state > > and stop the events from flooding. > > For that reason I still think that a polling API for fs ephemeral errors > > is a better idea. > > > > w.r.t to ephemeral errors on writeback we already have syncfs() as > > a means to provide publish/subscribe API for monitoring applications, > > to check if there was any error since last check, but we do not have an > > API that provides this information without the added costs of performing > > syncfs(). > > > > IMO, a proper solution would look something like this: > > > > /* per-sb errseq_t for reporting writeback errors via syncfs */ > > errseq_t s_wb_err; > > + /* per-sb errseq_t for reporting vfs errors via fstatfs */ > > + errseq_t s_vfs_err; > > > > I think making it a polling API wouldn't be a problem for our use case, > as long as it is kept as an always increasing counter, we should be able > to detect changes and not miss events. > > The problem with the proposal, in my opinion, is the lack of > differentiation of the errors. We want to be able to tell apart an EIO > from a ENOSPC, and it might not be clear from the other fields in > fstatfs what has happened. These tmpfs are associated with a containerized task (and often aren't very large), so it isn't possible to predict a full filesystem. For our use case (and what makes this "shmem specific") is we want to differentiate between a user getting ENOSPC due to insufficiently provisioned fs size, vs. due to running out of memory in a container - both of which return ENOSPC to the process. Mixing EIO into the same signal ("there were errors, ever") hides this information. So what we were looking for was some sort of way of signaling that user saw enospc, and a way for tmpfs to signal "why". Our first approach was just to keep a counter of how many times this situation (ENOSPC due to max size) occurred, but this seemed too niche an interface. Using the fanotify interface seems like a good candidate, and has the additional (potential) benefit that a similar notification can be used for any error on any type of fs, though as you mentioned the volume of potential errors is much higher, so perhaps sticking to polling is the way to go. To my understanding, errseq_t stores a counter for 1 type of error, and if we see multiple types of errors, we'll overwrite the errno (so, EIO followed by ENOSPC, or vice versa, would result in missing info) > > Also, I suspect Google might care about what inode triggered the error. > If I understand correctly their use case, that would allow them to trace > back the origin of the issue. Either way, wouldn't it be useful for > applications in general to be able to know what specific writeback failed? For our usage, just knowing that an error occurred should be good enough, and if that simplifies things let's omit the extra functionality. > > > fstatfs() is just an example that may be a good fit for fs monitoring > > applications we can add the error state in f_spare space, but we can > > also create a dedicated API for polling for errors. > > That would be an option. But f_spare is only 4 words long. That isn't > enough if we want to report the errors that occurred. I think a new api > should report the specific inodes that had a failed writeback. > > -- > Gabriel Krisman Bertazi
On Sat, Jan 15, 2022 at 12:16 AM Khazhy Kumykov <khazhy@google.com> wrote: > > On Fri, Jan 14, 2022 at 12:17 PM Gabriel Krisman Bertazi > <krisman@collabora.com> wrote: > > > > Amir Goldstein <amir73il@gmail.com> writes: > > > > >> > But the question remains, what is so special about shmem that > > >> > your use case requires fsnotify events to handle ENOSPC? > > >> > > > >> > Many systems are deployed on thin provisioned storage these days > > >> > and monitoring the state of the storage to alert administrator before > > >> > storage gets full (be it filesystem inodes or blocks or thinp space) > > >> > is crucial to many systems. > > >> > > > >> > Since the ENOSPC event that you are proposing is asynchronous > > >> > anyway, what is the problem with polling statfs() and meminfo? > > >> > > >> Amir, > > >> > > >> I spoke a bit with Khazhy (in CC) about the problems with polling the > > >> existing APIs, like statfs. He has been using a previous version of > > >> this code in production to monitor machines for a while now. Khazhy, > > >> feel free to pitch in with more details. > > >> > > >> Firstly, I don't want to treat shmem as a special case. The original > > >> patch implemented support only for tmpfs, because it was a fs specific > > >> solution, but I think this would be useful for any other (non-pseudo) > > >> file system in the kernel. > > >> > > >> The use case is similar to the use case I brought up for FAN_FS_ERROR. > > >> A sysadmin monitoring a fleet of machines wants to be notified when a > > >> service failed because of lack of space, without having to trust the > > >> failed application to properly report the error. > > >> > > >> Polling statfs is prone to missing the ENOSPC occurrence if the error is > > >> ephemeral from a monitoring tool point of view. Say the application is > > >> writing a large file, hits ENOSPC and, as a recovery mechanism, removes > > >> the partial file. If that happens, a daemon might miss the chance to > > >> observe the lack of space in statfs. Doing it through fsnotify, on the > > >> other hand, always catches the condition and allows a monitoring > > >> tool/sysadmin to take corrective action. > > >> > > >> > I guess one difference is that it is harder to predict page allocation failure > > >> > that causes ENOSPC in shmem, but IIUC, your patch does not report > > >> > an fsevent in that case only in inode/block accounting error. > > >> > Or maybe I did not understand it correctly? > > >> > > >> Correct. But we cannot predict the enospc, unless we know the > > >> application. I'm looking for a way for a sysadmin to not have to rely > > >> on the application caring about the file system size. > > >> > > > > > > In the real world, ENOSPC can often be anticipated way ahead of time > > > and sysadmins are practically required to take action when storage space is low. > > > Getting near 90% full filesystem is not healthy on many traditional disk > > > filesystems and causes suboptimal performance and in many cases > > > (especially cow filesystems) may lead to filesystem corruption. > > > > > > All that said, yes, *sometimes* ENOSPC cannot be anticipated, > > > but EIO can never be anticipated, so why are we talking about ENOSPC? > > > Focusing on ENOSPC seems too specific for the purpose of adding fsnotify > > > monitoring for filesystem ephemeral errors. > > > > > > The problem with fsnotify events for ephemeral filesystem errors > > > and that there can be a *lot* of them compared to filesystem corruption > > > errors that would usually put the filesystem in an "emergency" state > > > and stop the events from flooding. > > > For that reason I still think that a polling API for fs ephemeral errors > > > is a better idea. > > > > > > w.r.t to ephemeral errors on writeback we already have syncfs() as > > > a means to provide publish/subscribe API for monitoring applications, > > > to check if there was any error since last check, but we do not have an > > > API that provides this information without the added costs of performing > > > syncfs(). > > > > > > IMO, a proper solution would look something like this: > > > > > > /* per-sb errseq_t for reporting writeback errors via syncfs */ > > > errseq_t s_wb_err; > > > + /* per-sb errseq_t for reporting vfs errors via fstatfs */ > > > + errseq_t s_vfs_err; > > > > > > > I think making it a polling API wouldn't be a problem for our use case, > > as long as it is kept as an always increasing counter, we should be able > > to detect changes and not miss events. > > > > The problem with the proposal, in my opinion, is the lack of > > differentiation of the errors. We want to be able to tell apart an EIO > > from a ENOSPC, and it might not be clear from the other fields in > > fstatfs what has happened. > > These tmpfs are associated with a containerized task (and often aren't > very large), so it isn't possible to predict a full filesystem. > > For our use case (and what makes this "shmem specific") is we want to > differentiate between a user getting ENOSPC due to insufficiently > provisioned fs size, vs. due to running out of memory in a container - > both of which return ENOSPC to the process. Mixing EIO into the same > signal ("there were errors, ever") hides this information. So what we > were looking for was some sort of way of signaling that user saw > enospc, and a way for tmpfs to signal "why". Our first approach was > just to keep a counter of how many times this situation (ENOSPC due to > max size) occurred, but this seemed too niche an interface. Using the > fanotify interface seems like a good candidate, and has the additional > (potential) benefit that a similar notification can be used for any > error on any type of fs, though as you mentioned the volume of > potential errors is much higher, so perhaps sticking to polling is the > way to go. > > To my understanding, errseq_t stores a counter for 1 type of error, > and if we see multiple types of errors, we'll overwrite the errno (so, > EIO followed by ENOSPC, or vice versa, would result in missing info) > > > > > Also, I suspect Google might care about what inode triggered the error. > > If I understand correctly their use case, that would allow them to trace > > back the origin of the issue. Either way, wouldn't it be useful for > > applications in general to be able to know what specific writeback failed? > > For our usage, just knowing that an error occurred should be good > enough, and if that simplifies things let's omit the extra > functionality. Yes, it simplifies things a lot. The use case sounds very shmem specific and unless there are other users that need fsnotify of ENOSPC events with inode information I would stay away from this. I suggest that you export a counter of shmem ENOSPC errors of both kinds (page allocation and block accounting) via procfs/sysfs and poll for the counter that you care about. If you want to follow precedents, there is /sys/fs/ext4/sda3/errors_count and friends. Thanks, Amir.