[0/2] shmem: Notify user space when file system is full

Message ID	20211116220742.584975-1-krisman@collabora.com (mailing list archive)
Headers	show Return-Path: <SRS0=fDmo=QD=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 050A563222 sender: krisman) with ESMTPSA id F34861F44E90 From: Gabriel Krisman Bertazi <krisman@collabora.com> To: hughd@google.com, akpm@linux-foundation.org Cc: linux-mm@kvack.org, jack@suse.com, amir73il@gmail.com, repnop@google.com, khazhy@google.com, Gabriel Krisman Bertazi <krisman@collabora.com>, kernel@collabora.com Subject: [PATCH 0/2] shmem: Notify user space when file system is full Date: Tue, 16 Nov 2021 17:07:40 -0500 Message-Id: <20211116220742.584975-1-krisman@collabora.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	shmem: Notify user space when file system is full \| expand [0/2] shmem: Notify user space when file system is full [1/2] shmem: Differentiate cause of blk account error due to lack of space [2/2] shmem: Trigger FS_ERROR notification when file system is full

Gabriel Krisman Bertazi Nov. 16, 2021, 10:07 p.m. UTC

FS_ERROR is a fsnotify event type used by monitoring tools to detect
conditions that might require intervention from a recovery tool or from
sysadmins.  This patch series enables tmpfs to report an event when an
operation fails because the file system is full.

It attempts to only report events when the filesystem is really full,
instead of errors caused by memory pressure. The first patch prepares
the terrain by detecting these two different conditions, and the second
patch actually adds the event triggers.

Gabriel Krisman Bertazi (2):
  shmem: Differentiate cause of blk account error due to lack of space
  shmem: Trigger FS_ERROR notification when file system is full

 mm/shmem.c | 39 ++++++++++++++++++++++++++-------------
 1 file changed, 26 insertions(+), 13 deletions(-)

Amir Goldstein Nov. 17, 2021, 9 a.m. UTC | #1

On Wed, Nov 17, 2021 at 12:07 AM Gabriel Krisman Bertazi
<krisman@collabora.com> wrote:
>
> FS_ERROR is a fsnotify event type used by monitoring tools to detect
> conditions that might require intervention from a recovery tool or from
> sysadmins.  This patch series enables tmpfs to report an event when an
> operation fails because the file system is full.
>
> It attempts to only report events when the filesystem is really full,
> instead of errors caused by memory pressure. The first patch prepares
> the terrain by detecting these two different conditions, and the second
> patch actually adds the event triggers.
>

Hi Gabriel,

Two things bother me about this proposal.
One is that it makes more sense IMO to report ENOSPC events
from vfs code.

Why should the requirement to monitor ENOSPC conditions be specific to tmpfs?
Especially, as I mentioned, there are already wrappers in place to report
writeback errors on an inode (mapping_set_error), where the fsnotify hook
can fit nicely.

I understand that you wanted to differentiate errors caused by memory
pressure, but I don't understand why it makes sense for a filesystem monitor
to get a different feedback than the writing application.

The second thing that bothers me is that I think the ENOSPC condition
should not be reported on the same event mask as filesystem corruption
condition because it seems like a valid use case for filesystem monitor
to want to be notified about corruption and not about ENOSPC.

Therefore, I suggested using the event flag FS_WB_ERROR for
writeback errors. Other than the event mask, everything else is the
same. It just provides the UAPI flexibility to subscribe to one
without the other.

Thanks,
Amir.

Gabriel Krisman Bertazi Jan. 11, 2022, 1:57 a.m. UTC | #2

Amir Goldstein <amir73il@gmail.com> writes:

> Two things bother me about this proposal.
> One is that it makes more sense IMO to report ENOSPC events
> from vfs code.

Hi Amir,

I reimplemented this with FS_WB_ERROR in the branch below. It reports
writeback errors on mapping_set_error, as suggested.

  https://gitlab.collabora.com/krisman/linux/-/tree/wb-error

It is a WIP, and I'm not proposing it yet, cause I'm thinking about the
ENOSPC case a bit more...

> Why should the requirement to monitor ENOSPC conditions be specific to tmpfs?
> Especially, as I mentioned, there are already wrappers in place to report
> writeback errors on an inode (mapping_set_error), where the fsnotify hook
> can fit nicely.

mapping_set_error would trigger the ENOSPC event only when it happens on
an actual writeback error (i.e. BLK_STS_NOSPC), which is not the main
case I'm solving here.  In fact, most of the time, -ENOSPC will happen
before any IO is submitted, for instance, if an inode could not be
allocated during .create() or a block can't be allocated in
.write_begin(). In this case, it isn't really a writeback error
(semantically), and it is not registered as such by any file system.

We can treat those under the same FAN_WB_ERROR mask, but that is a bit
weird.  Maybe we should have a mask specifically for ENOSPC, or a,
"IO error" mask?

The patchset is quite trivial, but it has to touch many places in the VFS
(say, catch ENOSPC on create, fallocate, write, writev, etc). Please,
see the branch above to what that would look like.

Is that a viable solution?  Are you ok with reporting those cases under
the same writeback mask?

Btw, I'm thinking it could be tidy to transform many of those VFS calls,
from

	if (!error)
		fsnotify_modify(file);
	else if (error == -ENOSPC)
		fsnotify_nospace(file);

Into unconditionally calling the fsnotify_modify hook, which sends
the correct event depending on the operation result:

	void fsnotify_modify(int error, struct file *file)
	{
		if (likely(!error)) {
			fsnotify_file(file, FS_MODIFY);
	        } else if (error == -ENOSPC) {
	        	fsnotify_wb_error(file);
	        }
	}

(same for fsnotify_mkdir, fsnotify_open, etc).

If you are ok with that?

> I understand that you wanted to differentiate errors caused by memory
> pressure, but I don't understand why it makes sense for a filesystem monitor
> to get a different feedback than the writing application.

Maybe the differentiation of those two cases could be done as part of
the file system specific payload that I wanted to write for
FAN_FS_ERROR?  If so, it would be easier for the ENOSPC event trigger to
be implemented at the filesystem-level.

> The second thing that bothers me is that I think the ENOSPC condition
> should not be reported on the same event mask as filesystem corruption
> condition because it seems like a valid use case for filesystem monitor
> to want to be notified about corruption and not about ENOSPC.

Amir Goldstein Jan. 11, 2022, 7:50 a.m. UTC | #3

On Tue, Jan 11, 2022 at 3:57 AM Gabriel Krisman Bertazi
<krisman@collabora.com> wrote:
>
> Amir Goldstein <amir73il@gmail.com> writes:
>
> > Two things bother me about this proposal.
> > One is that it makes more sense IMO to report ENOSPC events
> > from vfs code.
>
> Hi Amir,
>
> I reimplemented this with FS_WB_ERROR in the branch below. It reports
> writeback errors on mapping_set_error, as suggested.
>
>   https://gitlab.collabora.com/krisman/linux/-/tree/wb-error
>
> It is a WIP, and I'm not proposing it yet, cause I'm thinking about the
> ENOSPC case a bit more...
>
> > Why should the requirement to monitor ENOSPC conditions be specific to tmpfs?
> > Especially, as I mentioned, there are already wrappers in place to report
> > writeback errors on an inode (mapping_set_error), where the fsnotify hook
> > can fit nicely.
>
> mapping_set_error would trigger the ENOSPC event only when it happens on
> an actual writeback error (i.e. BLK_STS_NOSPC), which is not the main
> case I'm solving here.  In fact, most of the time, -ENOSPC will happen
> before any IO is submitted, for instance, if an inode could not be
> allocated during .create() or a block can't be allocated in
> .write_begin(). In this case, it isn't really a writeback error
> (semantically), and it is not registered as such by any file system.
>

I see.
But the question remains, what is so special about shmem that
your use case requires fsnotify events to handle ENOSPC?

Many systems are deployed on thin provisioned storage these days
and monitoring the state of the storage to alert administrator before
storage gets full (be it filesystem inodes or blocks or thinp space)
is crucial to many systems.

Since the ENOSPC event that you are proposing is asynchronous
anyway, what is the problem with polling statfs() and meminfo?

I guess one difference is that it is harder to predict page allocation failure
that causes ENOSPC in shmem, but IIUC, your patch does not report
an fsevent in that case only in inode/block accounting error.
Or maybe I did not understand it correctly?

In a sense, the shmem ENOSPC error not caused by accounting error is
similar to EIO error on legacy storage and on thin provisioned storage that
the end user cannot monitor. Right?

> We can treat those under the same FAN_WB_ERROR mask, but that is a bit
> weird.  Maybe we should have a mask specifically for ENOSPC, or a,
> "IO error" mask?

If we go forward with this, I do like FAN_IO_ERROR for both
writeback error and general vfs errors, because what matters
is the action required from the event.

A listener that subscribes to FAN_FS_ERROR would likely react with
fsck or such.
A listener that subscribes to FAN_IO_ERROR would likely react with
checking the storage usage and/or application logs.

>
> The patchset is quite trivial, but it has to touch many places in the VFS
> (say, catch ENOSPC on create, fallocate, write, writev, etc). Please,
> see the branch above to what that would look like.
>
> Is that a viable solution?  Are you ok with reporting those cases under
> the same writeback mask?
>

I am not making the calls in vfs ;)
If I were, I would have asked you to explain your use case better.
Why do you need this for shmem and why would anyone need this for any
other fs.

Why can't the same result be achieved with polling existing APIs?
I think you will need to present a very strong case for the wide vfs change.
If your case is strong enough only for shmem, then perhaps we can
accept that filesystem (and not vfs) is responsible to report
FAN_IO_ERROR and that there is no clear semantics about when
user can expect FAN_IO_ERROR from any given filesystem, but that
seems strange.

> Btw, I'm thinking it could be tidy to transform many of those VFS calls,
> from
>
>         if (!error)
>                 fsnotify_modify(file);
>         else if (error == -ENOSPC)
>                 fsnotify_nospace(file);
>
> Into unconditionally calling the fsnotify_modify hook, which sends
> the correct event depending on the operation result:
>
>         void fsnotify_modify(int error, struct file *file)
>         {
>                 if (likely(!error)) {
>                         fsnotify_file(file, FS_MODIFY);
>                 } else if (error == -ENOSPC) {
>                         fsnotify_wb_error(file);
>                 }
>         }
>
> (same for fsnotify_mkdir, fsnotify_open, etc).
>
> If you are ok with that?
>

If we were to go down that path, I would prefer the latter
because it will force modifying all call sites and place the
logic in one place and apropos logic, if we do that
we should not handle only ENOSPC - IMO at least EIO
should get the same treatment.

>
> > I understand that you wanted to differentiate errors caused by memory
> > pressure, but I don't understand why it makes sense for a filesystem monitor
> > to get a different feedback than the writing application.
>
> Maybe the differentiation of those two cases could be done as part of
> the file system specific payload that I wanted to write for
> FAN_FS_ERROR?  If so, it would be easier for the ENOSPC event trigger to
> be implemented at the filesystem-level.
>

Certainly. If we restrict the scope of those events to shmem,
We do not need a new event type. It's enough to use FAN_FS_ERROR
with specific payload as you wanted.
But I still need convincing why this shmem ENOSPC is such a special case.

Thanks,
Amir.

Gabriel Krisman Bertazi Jan. 12, 2022, 3:19 a.m. UTC | #4

Amir Goldstein <amir73il@gmail.com> writes:

> On Tue, Jan 11, 2022 at 3:57 AM Gabriel Krisman Bertazi
> <krisman@collabora.com> wrote:
>>
>> Amir Goldstein <amir73il@gmail.com> writes:
>>
>> > Two things bother me about this proposal.
>> > One is that it makes more sense IMO to report ENOSPC events
>> > from vfs code.
>>
>> Hi Amir,
>>
>> I reimplemented this with FS_WB_ERROR in the branch below. It reports
>> writeback errors on mapping_set_error, as suggested.
>>
>>   https://gitlab.collabora.com/krisman/linux/-/tree/wb-error
>>
>> It is a WIP, and I'm not proposing it yet, cause I'm thinking about the
>> ENOSPC case a bit more...
>>
>> > Why should the requirement to monitor ENOSPC conditions be specific to tmpfs?
>> > Especially, as I mentioned, there are already wrappers in place to report
>> > writeback errors on an inode (mapping_set_error), where the fsnotify hook
>> > can fit nicely.
>>
>> mapping_set_error would trigger the ENOSPC event only when it happens on
>> an actual writeback error (i.e. BLK_STS_NOSPC), which is not the main
>> case I'm solving here.  In fact, most of the time, -ENOSPC will happen
>> before any IO is submitted, for instance, if an inode could not be
>> allocated during .create() or a block can't be allocated in
>> .write_begin(). In this case, it isn't really a writeback error
>> (semantically), and it is not registered as such by any file system.
>>
>
> I see.
> But the question remains, what is so special about shmem that
> your use case requires fsnotify events to handle ENOSPC?
>
> Many systems are deployed on thin provisioned storage these days
> and monitoring the state of the storage to alert administrator before
> storage gets full (be it filesystem inodes or blocks or thinp space)
> is crucial to many systems.
>
> Since the ENOSPC event that you are proposing is asynchronous
> anyway, what is the problem with polling statfs() and meminfo?

Amir,

I spoke a bit with Khazhy (in CC) about the problems with polling the
existing APIs, like statfs.  He has been using a previous version of
this code in production to monitor machines for a while now.  Khazhy,
feel free to pitch in with more details.

Firstly, I don't want to treat shmem as a special case.  The original
patch implemented support only for tmpfs, because it was a fs specific
solution, but I think this would be useful for any other (non-pseudo)
file system in the kernel.

The use case is similar to the use case I brought up for FAN_FS_ERROR.
A sysadmin monitoring a fleet of machines wants to be notified when a
service failed because of lack of space, without having to trust the
failed application to properly report the error.

Polling statfs is prone to missing the ENOSPC occurrence if the error is
ephemeral from a monitoring tool point of view. Say the application is
writing a large file, hits ENOSPC and, as a recovery mechanism, removes
the partial file.  If that happens, a daemon might miss the chance to
observe the lack of space in statfs.  Doing it through fsnotify, on the
other hand, always catches the condition and allows a monitoring
tool/sysadmin to take corrective action.

> I guess one difference is that it is harder to predict page allocation failure
> that causes ENOSPC in shmem, but IIUC, your patch does not report
> an fsevent in that case only in inode/block accounting error.
> Or maybe I did not understand it correctly?

Correct.  But we cannot predict the enospc, unless we know the
application.  I'm looking for a way for a sysadmin to not have to rely
on the application caring about the file system size.

Amir Goldstein Jan. 12, 2022, 5:59 a.m. UTC | #5

> > But the question remains, what is so special about shmem that
> > your use case requires fsnotify events to handle ENOSPC?
> >
> > Many systems are deployed on thin provisioned storage these days
> > and monitoring the state of the storage to alert administrator before
> > storage gets full (be it filesystem inodes or blocks or thinp space)
> > is crucial to many systems.
> >
> > Since the ENOSPC event that you are proposing is asynchronous
> > anyway, what is the problem with polling statfs() and meminfo?
>
> Amir,
>
> I spoke a bit with Khazhy (in CC) about the problems with polling the
> existing APIs, like statfs.  He has been using a previous version of
> this code in production to monitor machines for a while now.  Khazhy,
> feel free to pitch in with more details.
>
> Firstly, I don't want to treat shmem as a special case.  The original
> patch implemented support only for tmpfs, because it was a fs specific
> solution, but I think this would be useful for any other (non-pseudo)
> file system in the kernel.
>
> The use case is similar to the use case I brought up for FAN_FS_ERROR.
> A sysadmin monitoring a fleet of machines wants to be notified when a
> service failed because of lack of space, without having to trust the
> failed application to properly report the error.
>
> Polling statfs is prone to missing the ENOSPC occurrence if the error is
> ephemeral from a monitoring tool point of view. Say the application is
> writing a large file, hits ENOSPC and, as a recovery mechanism, removes
> the partial file.  If that happens, a daemon might miss the chance to
> observe the lack of space in statfs.  Doing it through fsnotify, on the
> other hand, always catches the condition and allows a monitoring
> tool/sysadmin to take corrective action.
>
> > I guess one difference is that it is harder to predict page allocation failure
> > that causes ENOSPC in shmem, but IIUC, your patch does not report
> > an fsevent in that case only in inode/block accounting error.
> > Or maybe I did not understand it correctly?
>
> Correct.  But we cannot predict the enospc, unless we know the
> application.  I'm looking for a way for a sysadmin to not have to rely
> on the application caring about the file system size.
>

In the real world, ENOSPC can often be anticipated way ahead of time
and sysadmins are practically required to take action when storage space is low.
Getting near 90% full filesystem is not healthy on many traditional disk
filesystems and causes suboptimal performance and in many cases
(especially cow filesystems) may lead to filesystem corruption.

All that said, yes, *sometimes* ENOSPC cannot be anticipated,
but EIO can never be anticipated, so why are we talking about ENOSPC?
Focusing on ENOSPC seems too specific for the purpose of adding fsnotify
monitoring for filesystem ephemeral errors.

The problem with fsnotify events for ephemeral filesystem errors
and that there can be a *lot* of them compared to filesystem corruption
errors that would usually put the filesystem in an "emergency" state
and stop the events from flooding.
For that reason I still think that a polling API for fs ephemeral errors
is a better idea.

w.r.t to ephemeral errors on writeback we already have syncfs() as
a means to provide publish/subscribe API for monitoring applications,
to check if there was any error since last check, but we do not have an
API that provides this information without the added costs of performing
syncfs().

IMO, a proper solution would look something like this:

         /* per-sb errseq_t for reporting writeback errors via syncfs */
         errseq_t s_wb_err;
+        /* per-sb errseq_t for reporting vfs errors via fstatfs */
+        errseq_t s_vfs_err;

fstatfs() is just an example that may be a good fit for fs monitoring
applications
we can add the error state in f_spare space, but we can also create a dedicated
API for polling for errors.

Same API can be used to poll for wb errors without issuing syncfs().

Thanks,
Amir.

Gabriel Krisman Bertazi Jan. 14, 2022, 8:17 p.m. UTC | #6

Amir Goldstein <amir73il@gmail.com> writes:

>> > But the question remains, what is so special about shmem that
>> > your use case requires fsnotify events to handle ENOSPC?
>> >
>> > Many systems are deployed on thin provisioned storage these days
>> > and monitoring the state of the storage to alert administrator before
>> > storage gets full (be it filesystem inodes or blocks or thinp space)
>> > is crucial to many systems.
>> >
>> > Since the ENOSPC event that you are proposing is asynchronous
>> > anyway, what is the problem with polling statfs() and meminfo?
>>
>> Amir,
>>
>> I spoke a bit with Khazhy (in CC) about the problems with polling the
>> existing APIs, like statfs.  He has been using a previous version of
>> this code in production to monitor machines for a while now.  Khazhy,
>> feel free to pitch in with more details.
>>
>> Firstly, I don't want to treat shmem as a special case.  The original
>> patch implemented support only for tmpfs, because it was a fs specific
>> solution, but I think this would be useful for any other (non-pseudo)
>> file system in the kernel.
>>
>> The use case is similar to the use case I brought up for FAN_FS_ERROR.
>> A sysadmin monitoring a fleet of machines wants to be notified when a
>> service failed because of lack of space, without having to trust the
>> failed application to properly report the error.
>>
>> Polling statfs is prone to missing the ENOSPC occurrence if the error is
>> ephemeral from a monitoring tool point of view. Say the application is
>> writing a large file, hits ENOSPC and, as a recovery mechanism, removes
>> the partial file.  If that happens, a daemon might miss the chance to
>> observe the lack of space in statfs.  Doing it through fsnotify, on the
>> other hand, always catches the condition and allows a monitoring
>> tool/sysadmin to take corrective action.
>>
>> > I guess one difference is that it is harder to predict page allocation failure
>> > that causes ENOSPC in shmem, but IIUC, your patch does not report
>> > an fsevent in that case only in inode/block accounting error.
>> > Or maybe I did not understand it correctly?
>>
>> Correct.  But we cannot predict the enospc, unless we know the
>> application.  I'm looking for a way for a sysadmin to not have to rely
>> on the application caring about the file system size.
>>
>
> In the real world, ENOSPC can often be anticipated way ahead of time
> and sysadmins are practically required to take action when storage space is low.
> Getting near 90% full filesystem is not healthy on many traditional disk
> filesystems and causes suboptimal performance and in many cases
> (especially cow filesystems) may lead to filesystem corruption.
>
> All that said, yes, *sometimes* ENOSPC cannot be anticipated,
> but EIO can never be anticipated, so why are we talking about ENOSPC?
> Focusing on ENOSPC seems too specific for the purpose of adding fsnotify
> monitoring for filesystem ephemeral errors.
>
> The problem with fsnotify events for ephemeral filesystem errors
> and that there can be a *lot* of them compared to filesystem corruption
> errors that would usually put the filesystem in an "emergency" state
> and stop the events from flooding.
> For that reason I still think that a polling API for fs ephemeral errors
> is a better idea.
>
> w.r.t to ephemeral errors on writeback we already have syncfs() as
> a means to provide publish/subscribe API for monitoring applications,
> to check if there was any error since last check, but we do not have an
> API that provides this information without the added costs of performing
> syncfs().
>
> IMO, a proper solution would look something like this:
>
>          /* per-sb errseq_t for reporting writeback errors via syncfs */
>          errseq_t s_wb_err;
> +        /* per-sb errseq_t for reporting vfs errors via fstatfs */
> +        errseq_t s_vfs_err;
>

I think making it a polling API wouldn't be a problem for our use case,
as long as it is kept as an always increasing counter, we should be able
to detect changes and not miss events.

The problem with the proposal, in my opinion, is the lack of
differentiation of the errors.  We want to be able to tell apart an EIO
from a ENOSPC, and it might not be clear from the other fields in
fstatfs what has happened.

Also, I suspect Google might care about what inode triggered the error.
If I understand correctly their use case, that would allow them to trace
back the origin of the issue.  Either way, wouldn't it be useful for
applications in general to be able to know what specific writeback failed?

> fstatfs() is just an example that may be a good fit for fs monitoring
> applications we can add the error state in f_spare space, but we can
> also create a dedicated API for polling for errors.

That would be an option.  But f_spare is only 4 words long.  That isn't
enough if we want to report the errors that occurred.  I think a new api
should report the specific inodes that had a failed writeback.

Khazhy Kumykov Jan. 14, 2022, 10:16 p.m. UTC | #7

On Fri, Jan 14, 2022 at 12:17 PM Gabriel Krisman Bertazi
<krisman@collabora.com> wrote:
>
> Amir Goldstein <amir73il@gmail.com> writes:
>
> >> > But the question remains, what is so special about shmem that
> >> > your use case requires fsnotify events to handle ENOSPC?
> >> >
> >> > Many systems are deployed on thin provisioned storage these days
> >> > and monitoring the state of the storage to alert administrator before
> >> > storage gets full (be it filesystem inodes or blocks or thinp space)
> >> > is crucial to many systems.
> >> >
> >> > Since the ENOSPC event that you are proposing is asynchronous
> >> > anyway, what is the problem with polling statfs() and meminfo?
> >>
> >> Amir,
> >>
> >> I spoke a bit with Khazhy (in CC) about the problems with polling the
> >> existing APIs, like statfs.  He has been using a previous version of
> >> this code in production to monitor machines for a while now.  Khazhy,
> >> feel free to pitch in with more details.
> >>
> >> Firstly, I don't want to treat shmem as a special case.  The original
> >> patch implemented support only for tmpfs, because it was a fs specific
> >> solution, but I think this would be useful for any other (non-pseudo)
> >> file system in the kernel.
> >>
> >> The use case is similar to the use case I brought up for FAN_FS_ERROR.
> >> A sysadmin monitoring a fleet of machines wants to be notified when a
> >> service failed because of lack of space, without having to trust the
> >> failed application to properly report the error.
> >>
> >> Polling statfs is prone to missing the ENOSPC occurrence if the error is
> >> ephemeral from a monitoring tool point of view. Say the application is
> >> writing a large file, hits ENOSPC and, as a recovery mechanism, removes
> >> the partial file.  If that happens, a daemon might miss the chance to
> >> observe the lack of space in statfs.  Doing it through fsnotify, on the
> >> other hand, always catches the condition and allows a monitoring
> >> tool/sysadmin to take corrective action.
> >>
> >> > I guess one difference is that it is harder to predict page allocation failure
> >> > that causes ENOSPC in shmem, but IIUC, your patch does not report
> >> > an fsevent in that case only in inode/block accounting error.
> >> > Or maybe I did not understand it correctly?
> >>
> >> Correct.  But we cannot predict the enospc, unless we know the
> >> application.  I'm looking for a way for a sysadmin to not have to rely
> >> on the application caring about the file system size.
> >>
> >
> > In the real world, ENOSPC can often be anticipated way ahead of time
> > and sysadmins are practically required to take action when storage space is low.
> > Getting near 90% full filesystem is not healthy on many traditional disk
> > filesystems and causes suboptimal performance and in many cases
> > (especially cow filesystems) may lead to filesystem corruption.
> >
> > All that said, yes, *sometimes* ENOSPC cannot be anticipated,
> > but EIO can never be anticipated, so why are we talking about ENOSPC?
> > Focusing on ENOSPC seems too specific for the purpose of adding fsnotify
> > monitoring for filesystem ephemeral errors.
> >
> > The problem with fsnotify events for ephemeral filesystem errors
> > and that there can be a *lot* of them compared to filesystem corruption
> > errors that would usually put the filesystem in an "emergency" state
> > and stop the events from flooding.
> > For that reason I still think that a polling API for fs ephemeral errors
> > is a better idea.
> >
> > w.r.t to ephemeral errors on writeback we already have syncfs() as
> > a means to provide publish/subscribe API for monitoring applications,
> > to check if there was any error since last check, but we do not have an
> > API that provides this information without the added costs of performing
> > syncfs().
> >
> > IMO, a proper solution would look something like this:
> >
> >          /* per-sb errseq_t for reporting writeback errors via syncfs */
> >          errseq_t s_wb_err;
> > +        /* per-sb errseq_t for reporting vfs errors via fstatfs */
> > +        errseq_t s_vfs_err;
> >
>
> I think making it a polling API wouldn't be a problem for our use case,
> as long as it is kept as an always increasing counter, we should be able
> to detect changes and not miss events.
>
> The problem with the proposal, in my opinion, is the lack of
> differentiation of the errors.  We want to be able to tell apart an EIO
> from a ENOSPC, and it might not be clear from the other fields in
> fstatfs what has happened.

These tmpfs are associated with a containerized task (and often aren't
very large), so it isn't possible to predict a full filesystem.

For our use case (and what makes this "shmem specific") is we want to
differentiate between a user getting ENOSPC due to insufficiently
provisioned fs size, vs. due to running out of memory in a container -
both of which return ENOSPC to the process. Mixing EIO into the same
signal ("there were errors, ever") hides this information. So what we
were looking for was some sort of way of signaling that user saw
enospc, and a way for tmpfs to signal "why". Our first approach was
just to keep a counter of how many times this situation (ENOSPC due to
max size) occurred, but this seemed too niche an interface. Using the
fanotify interface seems like a good candidate, and has the additional
(potential) benefit that a similar notification can be used for any
error on any type of fs, though as you mentioned the volume of
potential errors is much higher, so perhaps sticking to polling is the
way to go.

To my understanding, errseq_t stores a counter for 1 type of error,
and if we see multiple types of errors, we'll overwrite the errno (so,
EIO followed by ENOSPC, or vice versa, would result in missing info)

>
> Also, I suspect Google might care about what inode triggered the error.
> If I understand correctly their use case, that would allow them to trace
> back the origin of the issue.  Either way, wouldn't it be useful for
> applications in general to be able to know what specific writeback failed?

For our usage, just knowing that an error occurred should be good
enough, and if that simplifies things let's omit the extra
functionality.

>
> > fstatfs() is just an example that may be a good fit for fs monitoring
> > applications we can add the error state in f_spare space, but we can
> > also create a dedicated API for polling for errors.
>
> That would be an option.  But f_spare is only 4 words long.  That isn't
> enough if we want to report the errors that occurred.  I think a new api
> should report the specific inodes that had a failed writeback.
>
> --
> Gabriel Krisman Bertazi

Amir Goldstein Jan. 15, 2022, 11:30 a.m. UTC | #8

On Sat, Jan 15, 2022 at 12:16 AM Khazhy Kumykov <khazhy@google.com> wrote:
>
> On Fri, Jan 14, 2022 at 12:17 PM Gabriel Krisman Bertazi
> <krisman@collabora.com> wrote:
> >
> > Amir Goldstein <amir73il@gmail.com> writes:
> >
> > >> > But the question remains, what is so special about shmem that
> > >> > your use case requires fsnotify events to handle ENOSPC?
> > >> >
> > >> > Many systems are deployed on thin provisioned storage these days
> > >> > and monitoring the state of the storage to alert administrator before
> > >> > storage gets full (be it filesystem inodes or blocks or thinp space)
> > >> > is crucial to many systems.
> > >> >
> > >> > Since the ENOSPC event that you are proposing is asynchronous
> > >> > anyway, what is the problem with polling statfs() and meminfo?
> > >>
> > >> Amir,
> > >>
> > >> I spoke a bit with Khazhy (in CC) about the problems with polling the
> > >> existing APIs, like statfs.  He has been using a previous version of
> > >> this code in production to monitor machines for a while now.  Khazhy,
> > >> feel free to pitch in with more details.
> > >>
> > >> Firstly, I don't want to treat shmem as a special case.  The original
> > >> patch implemented support only for tmpfs, because it was a fs specific
> > >> solution, but I think this would be useful for any other (non-pseudo)
> > >> file system in the kernel.
> > >>
> > >> The use case is similar to the use case I brought up for FAN_FS_ERROR.
> > >> A sysadmin monitoring a fleet of machines wants to be notified when a
> > >> service failed because of lack of space, without having to trust the
> > >> failed application to properly report the error.
> > >>
> > >> Polling statfs is prone to missing the ENOSPC occurrence if the error is
> > >> ephemeral from a monitoring tool point of view. Say the application is
> > >> writing a large file, hits ENOSPC and, as a recovery mechanism, removes
> > >> the partial file.  If that happens, a daemon might miss the chance to
> > >> observe the lack of space in statfs.  Doing it through fsnotify, on the
> > >> other hand, always catches the condition and allows a monitoring
> > >> tool/sysadmin to take corrective action.
> > >>
> > >> > I guess one difference is that it is harder to predict page allocation failure
> > >> > that causes ENOSPC in shmem, but IIUC, your patch does not report
> > >> > an fsevent in that case only in inode/block accounting error.
> > >> > Or maybe I did not understand it correctly?
> > >>
> > >> Correct.  But we cannot predict the enospc, unless we know the
> > >> application.  I'm looking for a way for a sysadmin to not have to rely
> > >> on the application caring about the file system size.
> > >>
> > >
> > > In the real world, ENOSPC can often be anticipated way ahead of time
> > > and sysadmins are practically required to take action when storage space is low.
> > > Getting near 90% full filesystem is not healthy on many traditional disk
> > > filesystems and causes suboptimal performance and in many cases
> > > (especially cow filesystems) may lead to filesystem corruption.
> > >
> > > All that said, yes, *sometimes* ENOSPC cannot be anticipated,
> > > but EIO can never be anticipated, so why are we talking about ENOSPC?
> > > Focusing on ENOSPC seems too specific for the purpose of adding fsnotify
> > > monitoring for filesystem ephemeral errors.
> > >
> > > The problem with fsnotify events for ephemeral filesystem errors
> > > and that there can be a *lot* of them compared to filesystem corruption
> > > errors that would usually put the filesystem in an "emergency" state
> > > and stop the events from flooding.
> > > For that reason I still think that a polling API for fs ephemeral errors
> > > is a better idea.
> > >
> > > w.r.t to ephemeral errors on writeback we already have syncfs() as
> > > a means to provide publish/subscribe API for monitoring applications,
> > > to check if there was any error since last check, but we do not have an
> > > API that provides this information without the added costs of performing
> > > syncfs().
> > >
> > > IMO, a proper solution would look something like this:
> > >
> > >          /* per-sb errseq_t for reporting writeback errors via syncfs */
> > >          errseq_t s_wb_err;
> > > +        /* per-sb errseq_t for reporting vfs errors via fstatfs */
> > > +        errseq_t s_vfs_err;
> > >
> >
> > I think making it a polling API wouldn't be a problem for our use case,
> > as long as it is kept as an always increasing counter, we should be able
> > to detect changes and not miss events.
> >
> > The problem with the proposal, in my opinion, is the lack of
> > differentiation of the errors.  We want to be able to tell apart an EIO
> > from a ENOSPC, and it might not be clear from the other fields in
> > fstatfs what has happened.
>
> These tmpfs are associated with a containerized task (and often aren't
> very large), so it isn't possible to predict a full filesystem.
>
> For our use case (and what makes this "shmem specific") is we want to
> differentiate between a user getting ENOSPC due to insufficiently
> provisioned fs size, vs. due to running out of memory in a container -
> both of which return ENOSPC to the process. Mixing EIO into the same
> signal ("there were errors, ever") hides this information. So what we
> were looking for was some sort of way of signaling that user saw
> enospc, and a way for tmpfs to signal "why". Our first approach was
> just to keep a counter of how many times this situation (ENOSPC due to
> max size) occurred, but this seemed too niche an interface. Using the
> fanotify interface seems like a good candidate, and has the additional
> (potential) benefit that a similar notification can be used for any
> error on any type of fs, though as you mentioned the volume of
> potential errors is much higher, so perhaps sticking to polling is the
> way to go.
>
> To my understanding, errseq_t stores a counter for 1 type of error,
> and if we see multiple types of errors, we'll overwrite the errno (so,
> EIO followed by ENOSPC, or vice versa, would result in missing info)
>
> >
> > Also, I suspect Google might care about what inode triggered the error.
> > If I understand correctly their use case, that would allow them to trace
> > back the origin of the issue.  Either way, wouldn't it be useful for
> > applications in general to be able to know what specific writeback failed?
>
> For our usage, just knowing that an error occurred should be good
> enough, and if that simplifies things let's omit the extra
> functionality.

Yes, it simplifies things a lot.
The use case sounds very shmem specific and unless there are other
users that need fsnotify of ENOSPC events with inode information
I would stay away from this.

I suggest that you export a counter of shmem ENOSPC errors of
both kinds (page allocation and block accounting) via procfs/sysfs
and poll for the counter that you care about.

If you want to follow precedents, there is /sys/fs/ext4/sda3/errors_count
and friends.

Thanks,
Amir.

[0/2] shmem: Notify user space when file system is full

Message

Comments