diff mbox

fsck.xfs: allow forced repairs using xfs_repair

Message ID 20180305150546.32191-1-jtulak@redhat.com (mailing list archive)
State Superseded, archived
Headers show

Commit Message

Jan Tulak March 5, 2018, 3:05 p.m. UTC
The fsck.xfs script did nothing, because xfs doesn't need a fsck to be
run on every unclean shutdown. However, sometimes it may happen that the
root filesystem really requires the usage of xfs_repair and then it is a
hassle. This patch makes the situation a bit easier by detecting forced
checks (/forcefsck or fsck.mode=force), so user can require the repair,
without the repair being run all the time.

(Thanks Eric for suggesting this.)

Signed-off-by: Jan Tulak <jtulak@redhat.com>
---
 fsck/xfs_fsck.sh | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

Comments

Dave Chinner March 5, 2018, 9:56 p.m. UTC | #1
On Mon, Mar 05, 2018 at 04:05:46PM +0100, Jan Tulak wrote:
> The fsck.xfs script did nothing, because xfs doesn't need a fsck to be
> run on every unclean shutdown. However, sometimes it may happen that the
> root filesystem really requires the usage of xfs_repair and then it is a
> hassle. This patch makes the situation a bit easier by detecting forced
> checks (/forcefsck or fsck.mode=force), so user can require the repair,
> without the repair being run all the time.
> 
> (Thanks Eric for suggesting this.)
> 
> Signed-off-by: Jan Tulak <jtulak@redhat.com>

Ok, so I can see why support for this is probably neecssary, I have
a few reservations about the implementation....

> ---
>  fsck/xfs_fsck.sh | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/fsck/xfs_fsck.sh b/fsck/xfs_fsck.sh
> index e52969e4..71bfa2e1 100755
> --- a/fsck/xfs_fsck.sh
> +++ b/fsck/xfs_fsck.sh
> @@ -4,10 +4,12 @@
>  #
>  
>  AUTO=false
> -while getopts ":aApy" c
> +FORCE=false
> +while getopts ":aApyf" c
>  do
>  	case $c in
>  	a|A|p|y)	AUTO=true;;
> +	f)      	FORCE=true;;
>  	esac
>  done
>  eval DEV=\${$#}
> @@ -15,10 +17,18 @@ if [ ! -e $DEV ]; then
>  	echo "$0: $DEV does not exist"
>  	exit 8
>  fi
> +
> +# The flag -f is added by systemd/init scripts when /forcefsck file is present
> +# or fsck.mode=force is used during boot; an unclean shutdown won't trigger
> +# this check, user has to explicitly require a forced fsck.
> +if $FORCE; then
> +	xfs_repair $DEV
> +	exit $?
> +fi

This needs to check that the xfs_repair binary is present in the
environment that is running fsck. If this is checking the root fs
from the initramfs, then distros are going to need to package
xfs_repair into their initramfs build scripts...

Also, if the log is dirty, xfs_repair won't run. If the filesystem
is already mounted read-only, xfs_repair won't run. So if we're
forcing a boot time check, we want it to run unconditionally and fix
any problems found automatically, right?

Also, fsck exit values have specific meaning to the boot
infrastructure and xfs_repair does not follow them. Hence returning
the output of xfs_repair to the fsck caller is going to result in
unexpected/undesired behaviour. From the fsck man page:

      The exit code returned by fsck is the sum of the following conditions:

              0      No errors
              1      Filesystem errors corrected
              2      System should be rebooted
              4      Filesystem errors left uncorrected
              8      Operational error
              16     Usage or syntax error
              32     Checking canceled by user request
              128    Shared-library error

So there's error post processing that is needed here so that the
infrastructure is given the correct status indication so it will
do things like reboot the system if necessary after a repair...

I also wonder if we can limit this to just the boot infrastructure,
because I really don't like the idea of users using fsck.xfs -f to
repair damage filesystems because "that's what I do to repair ext4
filesystems"....

Also missing is a fsck.xfs man page update to document the option.

>  if $AUTO; then
>  	echo "$0: XFS file system."
>  else
>  	echo "If you wish to check the consistency of an XFS filesystem or"
>  	echo "repair a damaged filesystem, see xfs_repair(8)."
>  fi
> -exit 0

I think we still need to exit with a zero status if we did nothing,
because that's what the caller is expecting....

Cheers,

Dave.
Eric Sandeen March 5, 2018, 10:06 p.m. UTC | #2
On 3/5/18 3:56 PM, Dave Chinner wrote:
> On Mon, Mar 05, 2018 at 04:05:46PM +0100, Jan Tulak wrote:
>> The fsck.xfs script did nothing, because xfs doesn't need a fsck to be
>> run on every unclean shutdown. However, sometimes it may happen that the
>> root filesystem really requires the usage of xfs_repair and then it is a
>> hassle. This patch makes the situation a bit easier by detecting forced
>> checks (/forcefsck or fsck.mode=force), so user can require the repair,
>> without the repair being run all the time.
>>
>> (Thanks Eric for suggesting this.)
>>
>> Signed-off-by: Jan Tulak <jtulak@redhat.com>
> 
> Ok, so I can see why support for this is probably neecssary, I have
> a few reservations about the implementation....
> 
>> ---
>>  fsck/xfs_fsck.sh | 14 ++++++++++++--
>>  1 file changed, 12 insertions(+), 2 deletions(-)
>>
>> diff --git a/fsck/xfs_fsck.sh b/fsck/xfs_fsck.sh
>> index e52969e4..71bfa2e1 100755
>> --- a/fsck/xfs_fsck.sh
>> +++ b/fsck/xfs_fsck.sh
>> @@ -4,10 +4,12 @@
>>  #
>>  
>>  AUTO=false
>> -while getopts ":aApy" c
>> +FORCE=false
>> +while getopts ":aApyf" c
>>  do
>>  	case $c in
>>  	a|A|p|y)	AUTO=true;;
>> +	f)      	FORCE=true;;
>>  	esac
>>  done
>>  eval DEV=\${$#}
>> @@ -15,10 +17,18 @@ if [ ! -e $DEV ]; then
>>  	echo "$0: $DEV does not exist"
>>  	exit 8
>>  fi
>> +
>> +# The flag -f is added by systemd/init scripts when /forcefsck file is present
>> +# or fsck.mode=force is used during boot; an unclean shutdown won't trigger
>> +# this check, user has to explicitly require a forced fsck.
>> +if $FORCE; then
>> +	xfs_repair $DEV
>> +	exit $?
>> +fi
> 
> This needs to check that the xfs_repair binary is present in the
> environment that is running fsck. If this is checking the root fs
> from the initramfs, then distros are going to need to package
> xfs_repair into their initramfs build scripts...

Fedora and RHEL does, FWIW.  Can others check?  What does Debian do?

> Also, if the log is dirty, xfs_repair won't run. If the filesystem
> is already mounted read-only, xfs_repair won't run. So if we're
> forcing a boot time check, we want it to run unconditionally and fix
> any problems found automatically, right?

Yep, I'm curious if this was tested - I played with something like this
a while ago but didn't take notes.  ;)

As for running automatically and fix any problems, we may need to make
a decision.  If it won't mount due to a log problem, do we automatically
use -L or drop to a shell and punt to the admin?  (That's what we would
do w/o any fsck -f invocation today...)

> Also, fsck exit values have specific meaning to the boot
> infrastructure and xfs_repair does not follow them. Hence returning
> the output of xfs_repair to the fsck caller is going to result in
> unexpected/undesired behaviour. From the fsck man page:
> 
>       The exit code returned by fsck is the sum of the following conditions:
> 
>               0      No errors
>               1      Filesystem errors corrected
>               2      System should be rebooted
>               4      Filesystem errors left uncorrected
>               8      Operational error
>               16     Usage or syntax error
>               32     Checking canceled by user request
>               128    Shared-library error
> 
> So there's error post processing that is needed here so that the
> infrastructure is given the correct status indication so it will
> do things like reboot the system if necessary after a repair...

Good point, thanks.
 
> I also wonder if we can limit this to just the boot infrastructure,
> because I really don't like the idea of users using fsck.xfs -f to
> repair damage filesystems because "that's what I do to repair ext4
> filesystems"....

Depending on how this gets fleshed out, fsck.xfs -f isn't any different
than bare xfs_repair...  (Unless all of the above suggestions about dirty
logs get added, then it certainly is!)  So, yeah...

How would you propose limiting it to the boot environment?  I wondered
about the script itself checking for /forcefsck or the boot parameters,
but at least the boot params probably last for the duration of the uptime.
And re-coding / re-implementing the systemd checks in our own script
probably is a bad idea, so forget I suggested it ...

> Also missing is a fsck.xfs man page update to document the option.

*nod*

 
>>  if $AUTO; then
>>  	echo "$0: XFS file system."
>>  else
>>  	echo "If you wish to check the consistency of an XFS filesystem or"
>>  	echo "repair a damaged filesystem, see xfs_repair(8)."
>>  fi
>> -exit 0
> 
> I think we still need to exit with a zero status if we did nothing,
> because that's what the caller is expecting....
> 
> Cheers,
> 
> Dave.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong March 5, 2018, 10:20 p.m. UTC | #3
On Mon, Mar 05, 2018 at 04:06:38PM -0600, Eric Sandeen wrote:
> On 3/5/18 3:56 PM, Dave Chinner wrote:
> > On Mon, Mar 05, 2018 at 04:05:46PM +0100, Jan Tulak wrote:
> >> The fsck.xfs script did nothing, because xfs doesn't need a fsck to be
> >> run on every unclean shutdown. However, sometimes it may happen that the
> >> root filesystem really requires the usage of xfs_repair and then it is a
> >> hassle. This patch makes the situation a bit easier by detecting forced
> >> checks (/forcefsck or fsck.mode=force), so user can require the repair,
> >> without the repair being run all the time.
> >>
> >> (Thanks Eric for suggesting this.)
> >>
> >> Signed-off-by: Jan Tulak <jtulak@redhat.com>
> > 
> > Ok, so I can see why support for this is probably neecssary, I have
> > a few reservations about the implementation....
> > 
> >> ---
> >>  fsck/xfs_fsck.sh | 14 ++++++++++++--
> >>  1 file changed, 12 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/fsck/xfs_fsck.sh b/fsck/xfs_fsck.sh
> >> index e52969e4..71bfa2e1 100755
> >> --- a/fsck/xfs_fsck.sh
> >> +++ b/fsck/xfs_fsck.sh
> >> @@ -4,10 +4,12 @@
> >>  #
> >>  
> >>  AUTO=false
> >> -while getopts ":aApy" c
> >> +FORCE=false
> >> +while getopts ":aApyf" c
> >>  do
> >>  	case $c in
> >>  	a|A|p|y)	AUTO=true;;
> >> +	f)      	FORCE=true;;
> >>  	esac
> >>  done
> >>  eval DEV=\${$#}
> >> @@ -15,10 +17,18 @@ if [ ! -e $DEV ]; then
> >>  	echo "$0: $DEV does not exist"
> >>  	exit 8
> >>  fi
> >> +
> >> +# The flag -f is added by systemd/init scripts when /forcefsck file is present
> >> +# or fsck.mode=force is used during boot; an unclean shutdown won't trigger
> >> +# this check, user has to explicitly require a forced fsck.
> >> +if $FORCE; then
> >> +	xfs_repair $DEV
> >> +	exit $?
> >> +fi
> > 
> > This needs to check that the xfs_repair binary is present in the
> > environment that is running fsck. If this is checking the root fs
> > from the initramfs, then distros are going to need to package
> > xfs_repair into their initramfs build scripts...
> 
> Fedora and RHEL does, FWIW.  Can others check?  What does Debian do?

Ubuntu 16.04's initramfs hooks copy /sbin/fsck and
/sbin/fsck.$detectedrootfstype into the initramfs.

(...and, because I hate my own distro's defaults, I have my own
initramfs hook to stuff xfs_repair and e2fsck into the initramfs. :P)

> > Also, if the log is dirty, xfs_repair won't run. If the filesystem
> > is already mounted read-only, xfs_repair won't run. So if we're
> > forcing a boot time check, we want it to run unconditionally and fix
> > any problems found automatically, right?
> 
> Yep, I'm curious if this was tested - I played with something like this
> a while ago but didn't take notes.  ;)
> 
> As for running automatically and fix any problems, we may need to make
> a decision.  If it won't mount due to a log problem, do we automatically
> use -L or drop to a shell and punt to the admin?  (That's what we would
> do w/o any fsck -f invocation today...)

<shrug> I don't particularly like the idea of automatic -L.  That might
just be paranoia on my part, since the last time I had to run repair -L
was because the rootfs wouldn't mount was due to a bug in the log, and
in the end reinstalling the system was less troublesome than digging
through all the pieces of the now-destroyed rootfs. :/

--D

> > Also, fsck exit values have specific meaning to the boot
> > infrastructure and xfs_repair does not follow them. Hence returning
> > the output of xfs_repair to the fsck caller is going to result in
> > unexpected/undesired behaviour. From the fsck man page:
> > 
> >       The exit code returned by fsck is the sum of the following conditions:
> > 
> >               0      No errors
> >               1      Filesystem errors corrected
> >               2      System should be rebooted
> >               4      Filesystem errors left uncorrected
> >               8      Operational error
> >               16     Usage or syntax error
> >               32     Checking canceled by user request
> >               128    Shared-library error
> > 
> > So there's error post processing that is needed here so that the
> > infrastructure is given the correct status indication so it will
> > do things like reboot the system if necessary after a repair...
> 
> Good point, thanks.
>  
> > I also wonder if we can limit this to just the boot infrastructure,
> > because I really don't like the idea of users using fsck.xfs -f to
> > repair damage filesystems because "that's what I do to repair ext4
> > filesystems"....
> 
> Depending on how this gets fleshed out, fsck.xfs -f isn't any different
> than bare xfs_repair...  (Unless all of the above suggestions about dirty
> logs get added, then it certainly is!)  So, yeah...
> 
> How would you propose limiting it to the boot environment?  I wondered
> about the script itself checking for /forcefsck or the boot parameters,
> but at least the boot params probably last for the duration of the uptime.
> And re-coding / re-implementing the systemd checks in our own script
> probably is a bad idea, so forget I suggested it ...
> 
> > Also missing is a fsck.xfs man page update to document the option.
> 
> *nod*
> 
>  
> >>  if $AUTO; then
> >>  	echo "$0: XFS file system."
> >>  else
> >>  	echo "If you wish to check the consistency of an XFS filesystem or"
> >>  	echo "repair a damaged filesystem, see xfs_repair(8)."
> >>  fi
> >> -exit 0
> > 
> > I think we still need to exit with a zero status if we did nothing,
> > because that's what the caller is expecting....
> > 
> > Cheers,
> > 
> > Dave.
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner March 5, 2018, 10:31 p.m. UTC | #4
On Mon, Mar 05, 2018 at 04:06:38PM -0600, Eric Sandeen wrote:
> On 3/5/18 3:56 PM, Dave Chinner wrote:
> > On Mon, Mar 05, 2018 at 04:05:46PM +0100, Jan Tulak wrote:
> >> The fsck.xfs script did nothing, because xfs doesn't need a fsck to be
> >> run on every unclean shutdown. However, sometimes it may happen that the
> >> root filesystem really requires the usage of xfs_repair and then it is a
> >> hassle. This patch makes the situation a bit easier by detecting forced
> >> checks (/forcefsck or fsck.mode=force), so user can require the repair,
> >> without the repair being run all the time.
> >>
> >> (Thanks Eric for suggesting this.)
> >>
> >> Signed-off-by: Jan Tulak <jtulak@redhat.com>
> > 
> > Ok, so I can see why support for this is probably neecssary, I have
> > a few reservations about the implementation....
> > 
> >> ---
> >>  fsck/xfs_fsck.sh | 14 ++++++++++++--
> >>  1 file changed, 12 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/fsck/xfs_fsck.sh b/fsck/xfs_fsck.sh
> >> index e52969e4..71bfa2e1 100755
> >> --- a/fsck/xfs_fsck.sh
> >> +++ b/fsck/xfs_fsck.sh
> >> @@ -4,10 +4,12 @@
> >>  #
> >>  
> >>  AUTO=false
> >> -while getopts ":aApy" c
> >> +FORCE=false
> >> +while getopts ":aApyf" c
> >>  do
> >>  	case $c in
> >>  	a|A|p|y)	AUTO=true;;
> >> +	f)      	FORCE=true;;
> >>  	esac
> >>  done
> >>  eval DEV=\${$#}
> >> @@ -15,10 +17,18 @@ if [ ! -e $DEV ]; then
> >>  	echo "$0: $DEV does not exist"
> >>  	exit 8
> >>  fi
> >> +
> >> +# The flag -f is added by systemd/init scripts when /forcefsck file is present
> >> +# or fsck.mode=force is used during boot; an unclean shutdown won't trigger
> >> +# this check, user has to explicitly require a forced fsck.
> >> +if $FORCE; then
> >> +	xfs_repair $DEV
> >> +	exit $?
> >> +fi
> > 
> > This needs to check that the xfs_repair binary is present in the
> > environment that is running fsck. If this is checking the root fs
> > from the initramfs, then distros are going to need to package
> > xfs_repair into their initramfs build scripts...
> 
> Fedora and RHEL does, FWIW.  Can others check?  What does Debian do?

No idea. I've never needed to peel an initramfs to check what is
inside.

Ah, debian has a tool for that (lsinitramfs)....

$ lsinitramfs /boot/initrd.img-4.15.0-1-amd64  |grep xfs
lib/modules/4.13.0-1-amd64/kernel/fs/xfs
lib/modules/4.13.0-1-amd64/kernel/fs/xfs/xfs.ko
sbin/fsck.xfs
$

Nope, xfs_repair is not packaged in the debian initramfs. And, well,
on a more recently installed machine:

$ lsinitramfs /boot/initrd.img-4.15.0-rc8-amd64  |grep xfs
lib/modules/4.15.0-rc8-amd64/kernel/fs/xfs
lib/modules/4.15.0-rc8-amd64/kernel/fs/xfs/xfs.ko
$

fsck.xfs isn't even in the built initramfs for a machine running
only XFS filesystems....

> > Also, if the log is dirty, xfs_repair won't run. If the filesystem
> > is already mounted read-only, xfs_repair won't run. So if we're
> > forcing a boot time check, we want it to run unconditionally and fix
> > any problems found automatically, right?
> 
> Yep, I'm curious if this was tested - I played with something like this
> a while ago but didn't take notes.  ;)
> 
> As for running automatically and fix any problems, we may need to make
> a decision.  If it won't mount due to a log problem, do we automatically
> use -L or drop to a shell and punt to the admin?  (That's what we would
> do w/o any fsck -f invocation today...)

Define the expected "forcefsck" semantics, and that will tell us
what we need to do. Is it automatic system recovery? What if the
root fs can't be mounted due to log replay problems?

> > I also wonder if we can limit this to just the boot infrastructure,
> > because I really don't like the idea of users using fsck.xfs -f to
> > repair damage filesystems because "that's what I do to repair ext4
> > filesystems"....
> 
> Depending on how this gets fleshed out, fsck.xfs -f isn't any different
> than bare xfs_repair...  (Unless all of the above suggestions about dirty
> logs get added, then it certainly is!)  So, yeah...
> 
> How would you propose limiting it to the boot environment? 

I have no idea - this is all way outside my area of expertise...

> I wondered
> about the script itself checking for /forcefsck or the boot parameters,
> but at least the boot params probably last for the duration of the uptime.
> And re-coding / re-implementing the systemd checks in our own script
> probably is a bad idea, so forget I suggested it ...

But, yeah, we don't want to have to do that....

Cheers,

Dave.
Eric Sandeen March 5, 2018, 11:33 p.m. UTC | #5
On 3/5/18 4:31 PM, Dave Chinner wrote:
> On Mon, Mar 05, 2018 at 04:06:38PM -0600, Eric Sandeen wrote:

...

> Nope, xfs_repair is not packaged in the debian initramfs. And, well,
> on a more recently installed machine:
> 
> $ lsinitramfs /boot/initrd.img-4.15.0-rc8-amd64  |grep xfs
> lib/modules/4.15.0-rc8-amd64/kernel/fs/xfs
> lib/modules/4.15.0-rc8-amd64/kernel/fs/xfs/xfs.ko
> $
> 
> fsck.xfs isn't even in the built initramfs for a machine running
> only XFS filesystems....

Ok, well, that's a distro issue not an xfsprogs issue.  :)  Shows
that the script needs to test for presence of xfs_repair, though.

>>> Also, if the log is dirty, xfs_repair won't run. If the filesystem
>>> is already mounted read-only, xfs_repair won't run. So if we're
>>> forcing a boot time check, we want it to run unconditionally and fix
>>> any problems found automatically, right?
>>
>> Yep, I'm curious if this was tested - I played with something like this
>> a while ago but didn't take notes.  ;)
>>
>> As for running automatically and fix any problems, we may need to make
>> a decision.  If it won't mount due to a log problem, do we automatically
>> use -L or drop to a shell and punt to the admin?  (That's what we would
>> do w/o any fsck -f invocation today...)
> 
> Define the expected "forcefsck" semantics, and that will tell us
> what we need to do. Is it automatic system recovery? What if the
> root fs can't be mounted due to log replay problems?

You're asking too much.  ;)  Semantics?  ;)  Best we can probably do
is copy what e2fsck does - it tries to replay the log before running
the actual fsck.  So ... what does e2fsck do if /it/ can't replay
the log?
 
>>> I also wonder if we can limit this to just the boot infrastructure,
>>> because I really don't like the idea of users using fsck.xfs -f to
>>> repair damage filesystems because "that's what I do to repair ext4
>>> filesystems"....
>>
>> Depending on how this gets fleshed out, fsck.xfs -f isn't any different
>> than bare xfs_repair...  (Unless all of the above suggestions about dirty
>> logs get added, then it certainly is!)  So, yeah...
>>
>> How would you propose limiting it to the boot environment? 
> 
> I have no idea - this is all way outside my area of expertise...

A halfway measure would be to test whether the script is interactive, perhaps?

https://www.tldp.org/LDP/abs/html/intandnonint.html

case $- in
*i*)    # interactive shell
;;
*)      # non-interactive shell
;;

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Tulak March 6, 2018, 11:51 a.m. UTC | #6
On Tue, Mar 6, 2018 at 12:33 AM, Eric Sandeen <sandeen@sandeen.net> wrote:
> On 3/5/18 4:31 PM, Dave Chinner wrote:
>> On Mon, Mar 05, 2018 at 04:06:38PM -0600, Eric Sandeen wrote:
>
> ...
>
>> Nope, xfs_repair is not packaged in the debian initramfs. And, well,
>> on a more recently installed machine:
>>
>> $ lsinitramfs /boot/initrd.img-4.15.0-rc8-amd64  |grep xfs
>> lib/modules/4.15.0-rc8-amd64/kernel/fs/xfs
>> lib/modules/4.15.0-rc8-amd64/kernel/fs/xfs/xfs.ko
>> $
>>
>> fsck.xfs isn't even in the built initramfs for a machine running
>> only XFS filesystems....
>
> Ok, well, that's a distro issue not an xfsprogs issue.  :)  Shows
> that the script needs to test for presence of xfs_repair, though.

Agreed.

>
>>>> Also, if the log is dirty, xfs_repair won't run. If the filesystem
>>>> is already mounted read-only, xfs_repair won't run. So if we're
>>>> forcing a boot time check, we want it to run unconditionally and fix
>>>> any problems found automatically, right?
>>>
>>> Yep, I'm curious if this was tested - I played with something like this
>>> a while ago but didn't take notes.  ;)

I tested if it doesn't run when it shouldn't, but forgot to test with
a damaged fs. I thought I did test a dirty log, but apparently, I
didn't check the result.
/me looks ashamed at his "grep xfs_repair" history.

>>>
>>> As for running automatically and fix any problems, we may need to make
>>> a decision.  If it won't mount due to a log problem, do we automatically
>>> use -L or drop to a shell and punt to the admin?  (That's what we would
>>> do w/o any fsck -f invocation today...)
>>
>> Define the expected "forcefsck" semantics, and that will tell us
>> what we need to do. Is it automatic system recovery? What if the
>> root fs can't be mounted due to log replay problems?
>
> You're asking too much.  ;)  Semantics?  ;)  Best we can probably do
> is copy what e2fsck does - it tries to replay the log before running
> the actual fsck.  So ... what does e2fsck do if /it/ can't replay
> the log?

As far as I can tell, in that case, e2fsck exit code indicates 4 -
File system errors left uncorrected, but I'm studying ext testing
tools and will try to verify it.
About the -L flag, I think it is a bad idea - we don't want anything
dangerous to happen here, so if it can't be fixed safely and in an
automated way, just bail out.
That being said, I added a log replay attempt in there (via mount/unmount).

>
>>>> I also wonder if we can limit this to just the boot infrastructure,
>>>> because I really don't like the idea of users using fsck.xfs -f to
>>>> repair damage filesystems because "that's what I do to repair ext4
>>>> filesystems"....
>>>
>>> Depending on how this gets fleshed out, fsck.xfs -f isn't any different
>>> than bare xfs_repair...  (Unless all of the above suggestions about dirty
>>> logs get added, then it certainly is!)  So, yeah...
>>>
>>> How would you propose limiting it to the boot environment?
>>
>> I have no idea - this is all way outside my area of expertise...
>
> A halfway measure would be to test whether the script is interactive, perhaps?
>
> https://www.tldp.org/LDP/abs/html/intandnonint.html
>
> case $- in
> *i*)    # interactive shell
> ;;
> *)      # non-interactive shell
> ;;
>

IMO, any such test would make fsck.xfs behave unpredictably for the
user. If anyone wants to run fsck.xfs -f instead of xfs_repair, it is
their choice. We can print something "next time use xfs_repair
directly" for an interactive session, but I don't like the idea of the
script doing different things based on some (for the user) hidden
variables.

Jan
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner March 6, 2018, 9:39 p.m. UTC | #7
On Tue, Mar 06, 2018 at 12:51:18PM +0100, Jan Tulak wrote:
> On Tue, Mar 6, 2018 at 12:33 AM, Eric Sandeen <sandeen@sandeen.net> wrote:
> > On 3/5/18 4:31 PM, Dave Chinner wrote:
> >> On Mon, Mar 05, 2018 at 04:06:38PM -0600, Eric Sandeen wrote:
> >>> As for running automatically and fix any problems, we may need to make
> >>> a decision.  If it won't mount due to a log problem, do we automatically
> >>> use -L or drop to a shell and punt to the admin?  (That's what we would
> >>> do w/o any fsck -f invocation today...)
> >>
> >> Define the expected "forcefsck" semantics, and that will tell us
> >> what we need to do. Is it automatic system recovery? What if the
> >> root fs can't be mounted due to log replay problems?
> >
> > You're asking too much.  ;)  Semantics?  ;)  Best we can probably do
> > is copy what e2fsck does - it tries to replay the log before running
> > the actual fsck.  So ... what does e2fsck do if /it/ can't replay
> > the log?
> 
> As far as I can tell, in that case, e2fsck exit code indicates 4 -
> File system errors left uncorrected, but I'm studying ext testing
> tools and will try to verify it.
> About the -L flag, I think it is a bad idea - we don't want anything
> dangerous to happen here, so if it can't be fixed safely and in an
> automated way, just bail out.
> That being said, I added a log replay attempt in there (via mount/unmount).

I really don't advise doing that for a forced filesystem check. If
the log is corrupt, mounting it will trigger the problems we are
trying to avoid/fix by running a forced filesystem check. As it is,
we're probably being run in this mode because mounting has already
failed and causing the system not to boot.

What we need to do is list how the startup scripts work according to
what error is returned, and then match the behaviour we want in a
specific corruption case to the behaviour of a specific return
value.

i.e. if we have a dirty log, then really we need manual
intervention. That means we need to return an error that will cause
the startup script to stop and drop into an interactive shell for
the admin to fix manually.

This is what I mean by "define the expected forcefsck semantics" -
describe the behaviour of the system in reponse to the errors we can
return to it, and match them to the problem cases we need to resolve
with fsck.xfs.

> >>>> I also wonder if we can limit this to just the boot infrastructure,
> >>>> because I really don't like the idea of users using fsck.xfs -f to
> >>>> repair damage filesystems because "that's what I do to repair ext4
> >>>> filesystems"....
> >>>
> >>> Depending on how this gets fleshed out, fsck.xfs -f isn't any different
> >>> than bare xfs_repair...  (Unless all of the above suggestions about dirty
> >>> logs get added, then it certainly is!)  So, yeah...
> >>>
> >>> How would you propose limiting it to the boot environment?
> >>
> >> I have no idea - this is all way outside my area of expertise...
> >
> > A halfway measure would be to test whether the script is interactive, perhaps?
> >
> > https://www.tldp.org/LDP/abs/html/intandnonint.html
> >
> > case $- in
> > *i*)    # interactive shell
> > ;;
> > *)      # non-interactive shell
> > ;;
> >
> 
> IMO, any such test would make fsck.xfs behave unpredictably for the
> user. If anyone wants to run fsck.xfs -f instead of xfs_repair, it is
> their choice.

We limit user choices all the time. Default values, config options,
tuning variables, etc, IOWs, it's our choice as developers to allow
users to do something or not.  And in this case, we made this choice
to limit what fsck.xfs could do a long time ago:

# man fsck.xfs
.....
	If you wish to check the consistency of an XFS filesystem,
	or repair a damaged or corrupt XFS filesystem, see
	xfs_repair(8).
.....
# fsck.xfs
If you wish to check the consistency of an XFS filesystem or
repair a damaged filesystem, see xfs_repair(8).
#


> We can print something "next time use xfs_repair
> directly" for an interactive session, but I don't like the idea of the
> script doing different things based on some (for the user) hidden
> variables.

What hidden variable are you talking about here? Having a script
determine behaviour based on whether it is in an interactive
sessions or not is a common thing to do. There's nothing tricky or
unusual about it....

Cheers,

Dave.
Jan Tulak March 8, 2018, 10:57 a.m. UTC | #8
On Tue, Mar 6, 2018 at 10:39 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Tue, Mar 06, 2018 at 12:51:18PM +0100, Jan Tulak wrote:
>> On Tue, Mar 6, 2018 at 12:33 AM, Eric Sandeen <sandeen@sandeen.net> wrote:
>> > On 3/5/18 4:31 PM, Dave Chinner wrote:
>> >> On Mon, Mar 05, 2018 at 04:06:38PM -0600, Eric Sandeen wrote:
>> >>> As for running automatically and fix any problems, we may need to make
>> >>> a decision.  If it won't mount due to a log problem, do we automatically
>> >>> use -L or drop to a shell and punt to the admin?  (That's what we would
>> >>> do w/o any fsck -f invocation today...)
>> >>
>> >> Define the expected "forcefsck" semantics, and that will tell us
>> >> what we need to do. Is it automatic system recovery? What if the
>> >> root fs can't be mounted due to log replay problems?
>> >
>> > You're asking too much.  ;)  Semantics?  ;)  Best we can probably do
>> > is copy what e2fsck does - it tries to replay the log before running
>> > the actual fsck.  So ... what does e2fsck do if /it/ can't replay
>> > the log?
>>
>> As far as I can tell, in that case, e2fsck exit code indicates 4 -
>> File system errors left uncorrected, but I'm studying ext testing
>> tools and will try to verify it.
>> About the -L flag, I think it is a bad idea - we don't want anything
>> dangerous to happen here, so if it can't be fixed safely and in an
>> automated way, just bail out.
>> That being said, I added a log replay attempt in there (via mount/unmount).
>
> I really don't advise doing that for a forced filesystem check. If
> the log is corrupt, mounting it will trigger the problems we are
> trying to avoid/fix by running a forced filesystem check. As it is,
> we're probably being run in this mode because mounting has already
> failed and causing the system not to boot.
>
> What we need to do is list how the startup scripts work according to
> what error is returned, and then match the behaviour we want in a
> specific corruption case to the behaviour of a specific return
> value.
>
> i.e. if we have a dirty log, then really we need manual
> intervention. That means we need to return an error that will cause
> the startup script to stop and drop into an interactive shell for
> the admin to fix manually.
>
> This is what I mean by "define the expected forcefsck semantics" -
> describe the behaviour of the system in reponse to the errors we can
> return to it, and match them to the problem cases we need to resolve
> with fsck.xfs.

I tested it on Fedora 27. Exit codes 2 and 4 ("File system errors
corrected, system should be rebooted" and "File system errors left
uncorrected") drop the user into the emergency shell. Anything else
and the boot continues.

This happens before root volume is mounted during the boot, so I
propose this behaviour for fsck.xfs:
- if the volume/device is mounted, exit with 16 - usage or syntax
error (just to be sure)
- if the volume/device has a dirty log, exit with 4 - errors left
uncorrected (drop to the shell)
- if we find no errors, exit with 0 - no errors
- if we find anything and xfs_repair ends successfully, exit with 1 -
errors corrected
- anything else and exit with 8 - operational error

And is there any other way how to get the "there were some errors, but
we corrected them" other than either 1) screenscrape xfs_repair or 2)
run xfs_repair twice, once with -n to detect and then without -n to
fix the found errors?

>
>> >>>> I also wonder if we can limit this to just the boot infrastructure,
>> >>>> because I really don't like the idea of users using fsck.xfs -f to
>> >>>> repair damage filesystems because "that's what I do to repair ext4
>> >>>> filesystems"....
>> >>>
>> >>> Depending on how this gets fleshed out, fsck.xfs -f isn't any different
>> >>> than bare xfs_repair...  (Unless all of the above suggestions about dirty
>> >>> logs get added, then it certainly is!)  So, yeah...
>> >>>
>> >>> How would you propose limiting it to the boot environment?
>> >>
>> >> I have no idea - this is all way outside my area of expertise...
>> >
>> > A halfway measure would be to test whether the script is interactive, perhaps?
>> >
>> > https://www.tldp.org/LDP/abs/html/intandnonint.html
>> >
>> > case $- in
>> > *i*)    # interactive shell
>> > ;;
>> > *)      # non-interactive shell
>> > ;;
>> >
>>
>> IMO, any such test would make fsck.xfs behave unpredictably for the
>> user. If anyone wants to run fsck.xfs -f instead of xfs_repair, it is
>> their choice.
>
> We limit user choices all the time. Default values, config options,
> tuning variables, etc, IOWs, it's our choice as developers to allow
> users to do something or not.  And in this case, we made this choice
> to limit what fsck.xfs could do a long time ago:
>
> # man fsck.xfs
> .....
>         If you wish to check the consistency of an XFS filesystem,
>         or repair a damaged or corrupt XFS filesystem, see
>         xfs_repair(8).
> .....
> # fsck.xfs
> If you wish to check the consistency of an XFS filesystem or
> repair a damaged filesystem, see xfs_repair(8).
> #
>

At that point, it was a consistent behaviour, do nothing all the time,
no matter what.

>
>> We can print something "next time use xfs_repair
>> directly" for an interactive session, but I don't like the idea of the
>> script doing different things based on some (for the user) hidden
>> variables.
>
> What hidden variable are you talking about here? Having a script
> determine behaviour based on whether it is in an interactive
> sessions or not is a common thing to do. There's nothing tricky or
> unusual about it....
>

I'm not aware of any script or tool that would refuse to work except
when started in a specific environment and noninteractively (doesn't
mean they don't exist, but it is not common). And because it seems
that fsck.xfs -f will do only what bare xfs_repair would do, no log
replay, nothing... then I really think that changing what the script
does (not just altering its output) based on environment tests is
unnecessary. And for anyone without this specific knowledge, it would
be confusing - people expect that for the same input the application
or script does the same thing at the end.

Cheers,
Jan
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong March 8, 2018, 4:28 p.m. UTC | #9
On Thu, Mar 08, 2018 at 11:57:40AM +0100, Jan Tulak wrote:
> On Tue, Mar 6, 2018 at 10:39 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Tue, Mar 06, 2018 at 12:51:18PM +0100, Jan Tulak wrote:
> >> On Tue, Mar 6, 2018 at 12:33 AM, Eric Sandeen <sandeen@sandeen.net> wrote:
> >> > On 3/5/18 4:31 PM, Dave Chinner wrote:
> >> >> On Mon, Mar 05, 2018 at 04:06:38PM -0600, Eric Sandeen wrote:
> >> >>> As for running automatically and fix any problems, we may need to make
> >> >>> a decision.  If it won't mount due to a log problem, do we automatically
> >> >>> use -L or drop to a shell and punt to the admin?  (That's what we would
> >> >>> do w/o any fsck -f invocation today...)
> >> >>
> >> >> Define the expected "forcefsck" semantics, and that will tell us
> >> >> what we need to do. Is it automatic system recovery? What if the
> >> >> root fs can't be mounted due to log replay problems?
> >> >
> >> > You're asking too much.  ;)  Semantics?  ;)  Best we can probably do
> >> > is copy what e2fsck does - it tries to replay the log before running
> >> > the actual fsck.  So ... what does e2fsck do if /it/ can't replay
> >> > the log?
> >>
> >> As far as I can tell, in that case, e2fsck exit code indicates 4 -
> >> File system errors left uncorrected, but I'm studying ext testing
> >> tools and will try to verify it.
> >> About the -L flag, I think it is a bad idea - we don't want anything
> >> dangerous to happen here, so if it can't be fixed safely and in an
> >> automated way, just bail out.
> >> That being said, I added a log replay attempt in there (via mount/unmount).
> >
> > I really don't advise doing that for a forced filesystem check. If
> > the log is corrupt, mounting it will trigger the problems we are
> > trying to avoid/fix by running a forced filesystem check. As it is,
> > we're probably being run in this mode because mounting has already
> > failed and causing the system not to boot.
> >
> > What we need to do is list how the startup scripts work according to
> > what error is returned, and then match the behaviour we want in a
> > specific corruption case to the behaviour of a specific return
> > value.
> >
> > i.e. if we have a dirty log, then really we need manual
> > intervention. That means we need to return an error that will cause
> > the startup script to stop and drop into an interactive shell for
> > the admin to fix manually.
> >
> > This is what I mean by "define the expected forcefsck semantics" -
> > describe the behaviour of the system in reponse to the errors we can
> > return to it, and match them to the problem cases we need to resolve
> > with fsck.xfs.
> 
> I tested it on Fedora 27. Exit codes 2 and 4 ("File system errors
> corrected, system should be rebooted" and "File system errors left
> uncorrected") drop the user into the emergency shell. Anything else
> and the boot continues.

FWIW Debian seems to panic() if the exit code has (1 << 2) set, where
"panic()" either drops to a shell if panic= is not given or actually
reboots the machine if panic= is given.  All other cases proceed with
boot, including 2 (errors fixed, reboot now).

That said, the installer seems to set up root xfs as pass 0 in fstab so
fsck is not included in the initramfs at all.

> This happens before root volume is mounted during the boot, so I
> propose this behaviour for fsck.xfs:
> - if the volume/device is mounted, exit with 16 - usage or syntax
> error (just to be sure)
> - if the volume/device has a dirty log, exit with 4 - errors left
> uncorrected (drop to the shell)
> - if we find no errors, exit with 0 - no errors
> - if we find anything and xfs_repair ends successfully, exit with 1 -
> errors corrected
> - anything else and exit with 8 - operational error
> 
> And is there any other way how to get the "there were some errors, but
> we corrected them" other than either 1) screenscrape xfs_repair or 2)
> run xfs_repair twice, once with -n to detect and then without -n to
> fix the found errors?

I wouldn't run it twice, repair can take quite a while to run.

--D

> >
> >> >>>> I also wonder if we can limit this to just the boot infrastructure,
> >> >>>> because I really don't like the idea of users using fsck.xfs -f to
> >> >>>> repair damage filesystems because "that's what I do to repair ext4
> >> >>>> filesystems"....
> >> >>>
> >> >>> Depending on how this gets fleshed out, fsck.xfs -f isn't any different
> >> >>> than bare xfs_repair...  (Unless all of the above suggestions about dirty
> >> >>> logs get added, then it certainly is!)  So, yeah...
> >> >>>
> >> >>> How would you propose limiting it to the boot environment?
> >> >>
> >> >> I have no idea - this is all way outside my area of expertise...
> >> >
> >> > A halfway measure would be to test whether the script is interactive, perhaps?
> >> >
> >> > https://www.tldp.org/LDP/abs/html/intandnonint.html
> >> >
> >> > case $- in
> >> > *i*)    # interactive shell
> >> > ;;
> >> > *)      # non-interactive shell
> >> > ;;
> >> >
> >>
> >> IMO, any such test would make fsck.xfs behave unpredictably for the
> >> user. If anyone wants to run fsck.xfs -f instead of xfs_repair, it is
> >> their choice.
> >
> > We limit user choices all the time. Default values, config options,
> > tuning variables, etc, IOWs, it's our choice as developers to allow
> > users to do something or not.  And in this case, we made this choice
> > to limit what fsck.xfs could do a long time ago:
> >
> > # man fsck.xfs
> > .....
> >         If you wish to check the consistency of an XFS filesystem,
> >         or repair a damaged or corrupt XFS filesystem, see
> >         xfs_repair(8).
> > .....
> > # fsck.xfs
> > If you wish to check the consistency of an XFS filesystem or
> > repair a damaged filesystem, see xfs_repair(8).
> > #
> >
> 
> At that point, it was a consistent behaviour, do nothing all the time,
> no matter what.
> 
> >
> >> We can print something "next time use xfs_repair
> >> directly" for an interactive session, but I don't like the idea of the
> >> script doing different things based on some (for the user) hidden
> >> variables.
> >
> > What hidden variable are you talking about here? Having a script
> > determine behaviour based on whether it is in an interactive
> > sessions or not is a common thing to do. There's nothing tricky or
> > unusual about it....
> >
> 
> I'm not aware of any script or tool that would refuse to work except
> when started in a specific environment and noninteractively (doesn't
> mean they don't exist, but it is not common). And because it seems
> that fsck.xfs -f will do only what bare xfs_repair would do, no log
> replay, nothing... then I really think that changing what the script
> does (not just altering its output) based on environment tests is
> unnecessary. And for anyone without this specific knowledge, it would
> be confusing - people expect that for the same input the application
> or script does the same thing at the end.
> 
> Cheers,
> Jan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner March 8, 2018, 10:36 p.m. UTC | #10
On Thu, Mar 08, 2018 at 08:28:38AM -0800, Darrick J. Wong wrote:
> On Thu, Mar 08, 2018 at 11:57:40AM +0100, Jan Tulak wrote:
> > On Tue, Mar 6, 2018 at 10:39 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > On Tue, Mar 06, 2018 at 12:51:18PM +0100, Jan Tulak wrote:
> > >> On Tue, Mar 6, 2018 at 12:33 AM, Eric Sandeen <sandeen@sandeen.net> wrote:
> > >> > On 3/5/18 4:31 PM, Dave Chinner wrote:
> > >> >> On Mon, Mar 05, 2018 at 04:06:38PM -0600, Eric Sandeen wrote:
> > >> >>> As for running automatically and fix any problems, we may need to make
> > >> >>> a decision.  If it won't mount due to a log problem, do we automatically
> > >> >>> use -L or drop to a shell and punt to the admin?  (That's what we would
> > >> >>> do w/o any fsck -f invocation today...)
> > >> >>
> > >> >> Define the expected "forcefsck" semantics, and that will tell us
> > >> >> what we need to do. Is it automatic system recovery? What if the
> > >> >> root fs can't be mounted due to log replay problems?
> > >> >
> > >> > You're asking too much.  ;)  Semantics?  ;)  Best we can probably do
> > >> > is copy what e2fsck does - it tries to replay the log before running
> > >> > the actual fsck.  So ... what does e2fsck do if /it/ can't replay
> > >> > the log?
> > >>
> > >> As far as I can tell, in that case, e2fsck exit code indicates 4 -
> > >> File system errors left uncorrected, but I'm studying ext testing
> > >> tools and will try to verify it.
> > >> About the -L flag, I think it is a bad idea - we don't want anything
> > >> dangerous to happen here, so if it can't be fixed safely and in an
> > >> automated way, just bail out.
> > >> That being said, I added a log replay attempt in there (via mount/unmount).
> > >
> > > I really don't advise doing that for a forced filesystem check. If
> > > the log is corrupt, mounting it will trigger the problems we are
> > > trying to avoid/fix by running a forced filesystem check. As it is,
> > > we're probably being run in this mode because mounting has already
> > > failed and causing the system not to boot.
> > >
> > > What we need to do is list how the startup scripts work according to
> > > what error is returned, and then match the behaviour we want in a
> > > specific corruption case to the behaviour of a specific return
> > > value.
> > >
> > > i.e. if we have a dirty log, then really we need manual
> > > intervention. That means we need to return an error that will cause
> > > the startup script to stop and drop into an interactive shell for
> > > the admin to fix manually.
> > >
> > > This is what I mean by "define the expected forcefsck semantics" -
> > > describe the behaviour of the system in reponse to the errors we can
> > > return to it, and match them to the problem cases we need to resolve
> > > with fsck.xfs.
> > 
> > I tested it on Fedora 27. Exit codes 2 and 4 ("File system errors
> > corrected, system should be rebooted" and "File system errors left
> > uncorrected") drop the user into the emergency shell. Anything else
> > and the boot continues.
> 
> FWIW Debian seems to panic() if the exit code has (1 << 2) set, where
> "panic()" either drops to a shell if panic= is not given or actually
> reboots the machine if panic= is given.  All other cases proceed with
> boot, including 2 (errors fixed, reboot now).
> 
> That said, the installer seems to set up root xfs as pass 0 in fstab so
> fsck is not included in the initramfs at all.

Ok, so how do we deal with the "all distros are different, none
correctly follow documented fsck behaviour"?  It seems to me like we
need fsck to drop into an admin shell if anything goes wrong,
perhaps with the output saying what went wrong.

e.g. if we corrected errors and need a reboot, then we return 4 to
get the system to drop into a shell with this:

*** Errors corrected, You must reboot now! ***
#

If we had any other failure, then we drop to the shell with this:

*** Failed to repair filesystem. Manual correction required! ***
#

And only in the case that we have an unmounted, clean filesystem do
we continue to boot and mount the root filesystem?

> > This happens before root volume is mounted during the boot, so I
> > propose this behaviour for fsck.xfs:
> > - if the volume/device is mounted, exit with 16 - usage or syntax
> > error (just to be sure)

That's error 4 - requires manual intervention to repair.

> > - if the volume/device has a dirty log, exit with 4 - errors left
> > uncorrected (drop to the shell)

Yup.

> > - if we find no errors, exit with 0 - no errors

Yup, but only if the filesystem is not mounted, otherwise it's
"requires reboot" because repair with no errors still rewrites all
the per-ag metadata and so changes the on disk metadata layout.
Continuing at this point with a mounted filesystem is guaranteed to
corrupt the filesystem.

> > - if we find anything and xfs_repair ends successfully, exit with 1 -
> > errors corrected

Same as the above case - needs reboot.

> > - anything else and exit with 8 - operational error

I'd argue that's "errors uncorrected" (i.e. error 4) because it
requires manual intervention to determine what the error was and
resolve the situation.

> > And is there any other way how to get the "there were some errors, but
> > we corrected them" other than either 1) screenscrape xfs_repair or 2)
> > run xfs_repair twice, once with -n to detect and then without -n to
> > fix the found errors?
> 
> I wouldn't run it twice, repair can take quite a while to run.

Besides, we have to treat both cases the same because even if there
were no errors xfs_repair still modifies the metadata on disk....

> > I'm not aware of any script or tool that would refuse to work except
> > when started in a specific environment and noninteractively (doesn't
> > mean they don't exist, but it is not common). And because it seems
> > that fsck.xfs -f will do only what bare xfs_repair would do, no log
> > replay, nothing... then I really think that changing what the script
> > does (not just altering its output) based on environment tests is
> > unnecessary.

This isn't a technical issue - this is a behaviour management issue.
We want people to always run xfs_repair when there's a problem that
needs fixing, not fsck.xfs. fsck.xfs is for startup scripts only and
has very limited scope in what we allow it to do. xfs_repair is the
tool users should be running to fix their filesystems, not trying to
do it indirectly through startup script infrastructure....

Cheers,

Dave.
Eric Sandeen March 8, 2018, 11:28 p.m. UTC | #11
On 3/8/18 4:57 AM, Jan Tulak wrote:
> I tested it on Fedora 27. Exit codes 2 and 4 ("File system errors
> corrected, system should be rebooted" and "File system errors left
> uncorrected") drop the user into the emergency shell. Anything else
> and the boot continues.
> 
> This happens before root volume is mounted during the boot, so I
> propose this behaviour for fsck.xfs:
> - if the volume/device is mounted, exit with 16 - usage or syntax
> error (just to be sure)
> - if the volume/device has a dirty log, exit with 4 - errors left
> uncorrected (drop to the shell)
> - if we find no errors, exit with 0 - no errors
> - if we find anything and xfs_repair ends successfully, exit with 1 -
> errors corrected
> - anything else and exit with 8 - operational error
> 
> And is there any other way how to get the "there were some errors, but
> we corrected them" other than either 1) screenscrape xfs_repair or 2)
> run xfs_repair twice, once with -n to detect and then without -n to
> fix the found errors?

If it's critical to report whether errors were fixed, it would be trivial
to add a new option to xfs_repair which causes it to test fs_is_dirty for
runs without "-n", and exit with a different value.

The approach above looks ok in general.  The main motivation for this
is to be able to use some config management tool to push out a config
to a whole fleet of machines in a lab, and have them check/fix the
root fs without interaction.  If a filesystem is failing to mount due
to a dirty log, it's likely that the admin is in a more hands-on mode
by then anyway.

A dirty log could also be "operational error" if we decree that forcefsck
and friends should only be used on a clean reboot - but I'm less certain
about that.

Perhaps a guiding principal should be that we have no easy way to repair
the root fs today in /any/ situation, so any correct/safe improvement
to that situation is an improvement, even if not all cases are handled
at this point.

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Tulak March 14, 2018, 1:30 p.m. UTC | #12
On Fri, Mar 9, 2018 at 12:28 AM, Eric Sandeen <sandeen@sandeen.net> wrote:
>
> If it's critical to report whether errors were fixed, it would be trivial
> to add a new option to xfs_repair which causes it to test fs_is_dirty for
> runs without "-n", and exit with a different value.

I have toyed with it a bit and this seems to be the best option. A
flag that changes the exit code on a successful run. Is exit code 3
ok? According to man page, only 1 and 2 are currently used and the
"everything is ok now, but an issue was there" should not be mixed
with the existing ones. I also thought about a flag that would change
all exit codes to fsck ones, but that seems too complicated and
completely unnecessary.

       [...] always return a  status  code
       of  0  if it completes without problems.  If a runtime error is encoun-
       tered during operation, it will return a status of 1.   In  this  case,
       xfs_repair should be restarted.  If xfs_repair is unable to proceed due
       to a dirty log, it will return a status of 2.

As for the flag, how about -e? It is not used yet and it makes some
sense as "e-exit code."

>
> The approach above looks ok in general.  The main motivation for this
> is to be able to use some config management tool to push out a config
> to a whole fleet of machines in a lab, and have them check/fix the
> root fs without interaction.  If a filesystem is failing to mount due
> to a dirty log, it's likely that the admin is in a more hands-on mode
> by then anyway.
>
> A dirty log could also be "operational error" if we decree that forcefsck
> and friends should only be used on a clean reboot - but I'm less certain
> about that.

I thought about it, but "errors left uncorrected" makes more sense here to me.
>
> Perhaps a guiding principal should be that we have no easy way to repair
> the root fs today in /any/ situation, so any correct/safe improvement
> to that situation is an improvement, even if not all cases are handled
> at this point.
>

Agreed.

Jan
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Tulak March 14, 2018, 1:51 p.m. UTC | #13
On Thu, Mar 8, 2018 at 11:36 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Mar 08, 2018 at 08:28:38AM -0800, Darrick J. Wong wrote:
>> On Thu, Mar 08, 2018 at 11:57:40AM +0100, Jan Tulak wrote:
>> > On Tue, Mar 6, 2018 at 10:39 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > > On Tue, Mar 06, 2018 at 12:51:18PM +0100, Jan Tulak wrote:
>> > >> On Tue, Mar 6, 2018 at 12:33 AM, Eric Sandeen <sandeen@sandeen.net> wrote:
>> > >> > On 3/5/18 4:31 PM, Dave Chinner wrote:
>> > >> >> On Mon, Mar 05, 2018 at 04:06:38PM -0600, Eric Sandeen wrote:
>> > >> >>> As for running automatically and fix any problems, we may need to make
>> > >> >>> a decision.  If it won't mount due to a log problem, do we automatically
>> > >> >>> use -L or drop to a shell and punt to the admin?  (That's what we would
>> > >> >>> do w/o any fsck -f invocation today...)
>> > >> >>
>> > >> >> Define the expected "forcefsck" semantics, and that will tell us
>> > >> >> what we need to do. Is it automatic system recovery? What if the
>> > >> >> root fs can't be mounted due to log replay problems?
>> > >> >
>> > >> > You're asking too much.  ;)  Semantics?  ;)  Best we can probably do
>> > >> > is copy what e2fsck does - it tries to replay the log before running
>> > >> > the actual fsck.  So ... what does e2fsck do if /it/ can't replay
>> > >> > the log?
>> > >>
>> > >> As far as I can tell, in that case, e2fsck exit code indicates 4 -
>> > >> File system errors left uncorrected, but I'm studying ext testing
>> > >> tools and will try to verify it.
>> > >> About the -L flag, I think it is a bad idea - we don't want anything
>> > >> dangerous to happen here, so if it can't be fixed safely and in an
>> > >> automated way, just bail out.
>> > >> That being said, I added a log replay attempt in there (via mount/unmount).
>> > >
>> > > I really don't advise doing that for a forced filesystem check. If
>> > > the log is corrupt, mounting it will trigger the problems we are
>> > > trying to avoid/fix by running a forced filesystem check. As it is,
>> > > we're probably being run in this mode because mounting has already
>> > > failed and causing the system not to boot.
>> > >
>> > > What we need to do is list how the startup scripts work according to
>> > > what error is returned, and then match the behaviour we want in a
>> > > specific corruption case to the behaviour of a specific return
>> > > value.
>> > >
>> > > i.e. if we have a dirty log, then really we need manual
>> > > intervention. That means we need to return an error that will cause
>> > > the startup script to stop and drop into an interactive shell for
>> > > the admin to fix manually.
>> > >
>> > > This is what I mean by "define the expected forcefsck semantics" -
>> > > describe the behaviour of the system in reponse to the errors we can
>> > > return to it, and match them to the problem cases we need to resolve
>> > > with fsck.xfs.
>> >
>> > I tested it on Fedora 27. Exit codes 2 and 4 ("File system errors
>> > corrected, system should be rebooted" and "File system errors left
>> > uncorrected") drop the user into the emergency shell. Anything else
>> > and the boot continues.
>>
>> FWIW Debian seems to panic() if the exit code has (1 << 2) set, where
>> "panic()" either drops to a shell if panic= is not given or actually
>> reboots the machine if panic= is given.  All other cases proceed with
>> boot, including 2 (errors fixed, reboot now).
>>
>> That said, the installer seems to set up root xfs as pass 0 in fstab so
>> fsck is not included in the initramfs at all.
>
> Ok, so how do we deal with the "all distros are different, none
> correctly follow documented fsck behaviour"?  It seems to me like we
> need fsck to drop into an admin shell if anything goes wrong,
> perhaps with the output saying what went wrong.
>
> e.g. if we corrected errors and need a reboot, then we return 4 to
> get the system to drop into a shell with this:
>
> *** Errors corrected, You must reboot now! ***
> #

That makes a sense, but I have a question here. Can it happen with
xfs, with on an unmounted fs, and can I detect it somehow? I can parse
through -vvvv output, but that doesn't look like a sensible solution -
too many things that could be missed if it is an issue and we need to
watch out for corrections that requires a reboot.

And another thing is that after we drop the user into the shell, they
won't see any message directly I think but will have to read the logs
to see the "Errors corrected, You must reboot now!"

>
> If we had any other failure, then we drop to the shell with this:
>
> *** Failed to repair filesystem. Manual correction required! ***
> #
>
> And only in the case that we have an unmounted, clean filesystem do
> we continue to boot and mount the root filesystem?
>
>> > This happens before root volume is mounted during the boot, so I
>> > propose this behaviour for fsck.xfs:
>> > - if the volume/device is mounted, exit with 16 - usage or syntax
>> > error (just to be sure)
>
> That's error 4 - requires manual intervention to repair.

Mmh... yes, probably better to throw the shell at the user. If this
happens, something has gone bad anyway, because the fsck should be run
before systemd/init scripts attempts to mount the fs, so it should
never happen in the boot environment.

>
>> > - if the volume/device has a dirty log, exit with 4 - errors left
>> > uncorrected (drop to the shell)
>
> Yup.
>
>> > - if we find no errors, exit with 0 - no errors
>
> Yup, but only if the filesystem is not mounted, otherwise it's
> "requires reboot" because repair with no errors still rewrites all
> the per-ag metadata and so changes the on disk metadata layout.
> Continuing at this point with a mounted filesystem is guaranteed to
> corrupt the filesystem.

We refuse to start with a mounted fs, so this is no issue.

>
>> > - if we find anything and xfs_repair ends successfully, exit with 1 -
>> > errors corrected
>
> Same as the above case - needs reboot.

Do we? The fs wasn't mounted at this point yet. Maybe there is a
reason for a reboot, I just don't know about it. :-)

>
>> > - anything else and exit with 8 - operational error
>
> I'd argue that's "errors uncorrected" (i.e. error 4) because it
> requires manual intervention to determine what the error was and
> resolve the situation.
>
>> > And is there any other way how to get the "there were some errors, but
>> > we corrected them" other than either 1) screenscrape xfs_repair or 2)
>> > run xfs_repair twice, once with -n to detect and then without -n to
>> > fix the found errors?
>>
>> I wouldn't run it twice, repair can take quite a while to run.
>
> Besides, we have to treat both cases the same because even if there
> were no errors xfs_repair still modifies the metadata on disk....

A small change in xfs_repair solves this, no issue for fsck anymore.

>
>> > I'm not aware of any script or tool that would refuse to work except
>> > when started in a specific environment and noninteractively (doesn't
>> > mean they don't exist, but it is not common). And because it seems
>> > that fsck.xfs -f will do only what bare xfs_repair would do, no log
>> > replay, nothing... then I really think that changing what the script
>> > does (not just altering its output) based on environment tests is
>> > unnecessary.
>
> This isn't a technical issue - this is a behaviour management issue.
> We want people to always run xfs_repair when there's a problem that
> needs fixing, not fsck.xfs. fsck.xfs is for startup scripts only and
> has very limited scope in what we allow it to do. xfs_repair is the
> tool users should be running to fix their filesystems, not trying to
> do it indirectly through startup script infrastructure....
>

OK

Cheers,
Jan
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen March 14, 2018, 3:19 p.m. UTC | #14
On 3/14/18 9:30 AM, Jan Tulak wrote:
> On Fri, Mar 9, 2018 at 12:28 AM, Eric Sandeen <sandeen@sandeen.net>
> wrote:
>> 
>> If it's critical to report whether errors were fixed, it would be
>> trivial to add a new option to xfs_repair which causes it to test
>> fs_is_dirty for runs without "-n", and exit with a different
>> value.
> 
> I have toyed with it a bit and this seems to be the best option. A 
> flag that changes the exit code on a successful run. Is exit code 3 
> ok? According to man page, only 1 and 2 are currently used and the 
> "everything is ok now, but an issue was there" should not be mixed 
> with the existing ones. I also thought about a flag that would
> change all exit codes to fsck ones, but that seems too complicated
> and completely unnecessary.

Hm, I guess we'll have to.
 
> [...] always return a  status  code of  0  if it completes without
> problems.  If a runtime error is encoun- tered during operation, it
> will return a status of 1.   In  this  case, xfs_repair should be
> restarted.  If xfs_repair is unable to proceed due to a dirty log, it
> will return a status of 2.
> 
> As for the flag, how about -e? It is not used yet and it makes some 
> sense as "e-exit code."

sure, sounds fine.

-Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong March 14, 2018, 3:25 p.m. UTC | #15
On Wed, Mar 14, 2018 at 02:51:51PM +0100, Jan Tulak wrote:
> On Thu, Mar 8, 2018 at 11:36 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Thu, Mar 08, 2018 at 08:28:38AM -0800, Darrick J. Wong wrote:
> >> On Thu, Mar 08, 2018 at 11:57:40AM +0100, Jan Tulak wrote:
> >> > On Tue, Mar 6, 2018 at 10:39 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > > On Tue, Mar 06, 2018 at 12:51:18PM +0100, Jan Tulak wrote:
> >> > >> On Tue, Mar 6, 2018 at 12:33 AM, Eric Sandeen <sandeen@sandeen.net> wrote:
> >> > >> > On 3/5/18 4:31 PM, Dave Chinner wrote:
> >> > >> >> On Mon, Mar 05, 2018 at 04:06:38PM -0600, Eric Sandeen wrote:
> >> > >> >>> As for running automatically and fix any problems, we may need to make
> >> > >> >>> a decision.  If it won't mount due to a log problem, do we automatically
> >> > >> >>> use -L or drop to a shell and punt to the admin?  (That's what we would
> >> > >> >>> do w/o any fsck -f invocation today...)
> >> > >> >>
> >> > >> >> Define the expected "forcefsck" semantics, and that will tell us
> >> > >> >> what we need to do. Is it automatic system recovery? What if the
> >> > >> >> root fs can't be mounted due to log replay problems?
> >> > >> >
> >> > >> > You're asking too much.  ;)  Semantics?  ;)  Best we can probably do
> >> > >> > is copy what e2fsck does - it tries to replay the log before running
> >> > >> > the actual fsck.  So ... what does e2fsck do if /it/ can't replay
> >> > >> > the log?
> >> > >>
> >> > >> As far as I can tell, in that case, e2fsck exit code indicates 4 -
> >> > >> File system errors left uncorrected, but I'm studying ext testing
> >> > >> tools and will try to verify it.
> >> > >> About the -L flag, I think it is a bad idea - we don't want anything
> >> > >> dangerous to happen here, so if it can't be fixed safely and in an
> >> > >> automated way, just bail out.
> >> > >> That being said, I added a log replay attempt in there (via mount/unmount).
> >> > >
> >> > > I really don't advise doing that for a forced filesystem check. If
> >> > > the log is corrupt, mounting it will trigger the problems we are
> >> > > trying to avoid/fix by running a forced filesystem check. As it is,
> >> > > we're probably being run in this mode because mounting has already
> >> > > failed and causing the system not to boot.
> >> > >
> >> > > What we need to do is list how the startup scripts work according to
> >> > > what error is returned, and then match the behaviour we want in a
> >> > > specific corruption case to the behaviour of a specific return
> >> > > value.
> >> > >
> >> > > i.e. if we have a dirty log, then really we need manual
> >> > > intervention. That means we need to return an error that will cause
> >> > > the startup script to stop and drop into an interactive shell for
> >> > > the admin to fix manually.
> >> > >
> >> > > This is what I mean by "define the expected forcefsck semantics" -
> >> > > describe the behaviour of the system in reponse to the errors we can
> >> > > return to it, and match them to the problem cases we need to resolve
> >> > > with fsck.xfs.
> >> >
> >> > I tested it on Fedora 27. Exit codes 2 and 4 ("File system errors
> >> > corrected, system should be rebooted" and "File system errors left
> >> > uncorrected") drop the user into the emergency shell. Anything else
> >> > and the boot continues.
> >>
> >> FWIW Debian seems to panic() if the exit code has (1 << 2) set, where
> >> "panic()" either drops to a shell if panic= is not given or actually
> >> reboots the machine if panic= is given.  All other cases proceed with
> >> boot, including 2 (errors fixed, reboot now).
> >>
> >> That said, the installer seems to set up root xfs as pass 0 in fstab so
> >> fsck is not included in the initramfs at all.
> >
> > Ok, so how do we deal with the "all distros are different, none
> > correctly follow documented fsck behaviour"?  It seems to me like we
> > need fsck to drop into an admin shell if anything goes wrong,
> > perhaps with the output saying what went wrong.
> >
> > e.g. if we corrected errors and need a reboot, then we return 4 to
> > get the system to drop into a shell with this:
> >
> > *** Errors corrected, You must reboot now! ***
> > #
> 
> That makes a sense, but I have a question here. Can it happen with
> xfs, with on an unmounted fs, and can I detect it somehow? I can parse

"errors fixed, you must reboot now" ought never happen with an unmounted
fs.  Eric pointed out that at least on Debian the initramfs scripts rely
on the initramfs hook scripts dumping everything the initramfs scripts
will need into the initramfs itself, so there is no more "load the root
fs to check the rootfs" like we used to have.

Perhaps that's why Debian ignore that case in
/usr/share/initramfs*/scripts/functions, and if we can make the same
argument for rhel/suse/anyotherdistrothatcares then maybe we can lay
this case to rest once and for all.

(Or not bother with -d I suppose, as Jan suggests further down.)

> through -vvvv output, but that doesn't look like a sensible solution -
> too many things that could be missed if it is an issue and we need to
> watch out for corrections that requires a reboot.
> 
> And another thing is that after we drop the user into the shell, they
> won't see any message directly I think but will have to read the logs
> to see the "Errors corrected, You must reboot now!"

Hopefully the repair output was the last thing printed on the screen
before we started the shell?

> >
> > If we had any other failure, then we drop to the shell with this:
> >
> > *** Failed to repair filesystem. Manual correction required! ***
> > #
> >
> > And only in the case that we have an unmounted, clean filesystem do
> > we continue to boot and mount the root filesystem?
> >
> >> > This happens before root volume is mounted during the boot, so I
> >> > propose this behaviour for fsck.xfs:
> >> > - if the volume/device is mounted, exit with 16 - usage or syntax
> >> > error (just to be sure)
> >
> > That's error 4 - requires manual intervention to repair.
> 
> Mmh... yes, probably better to throw the shell at the user. If this
> happens, something has gone bad anyway, because the fsck should be run
> before systemd/init scripts attempts to mount the fs, so it should
> never happen in the boot environment.

Agreed.

> >
> >> > - if the volume/device has a dirty log, exit with 4 - errors left
> >> > uncorrected (drop to the shell)
> >
> > Yup.
> >
> >> > - if we find no errors, exit with 0 - no errors
> >
> > Yup, but only if the filesystem is not mounted, otherwise it's
> > "requires reboot" because repair with no errors still rewrites all
> > the per-ag metadata and so changes the on disk metadata layout.
> > Continuing at this point with a mounted filesystem is guaranteed to
> > corrupt the filesystem.
> 
> We refuse to start with a mounted fs, so this is no issue.
> 
> >
> >> > - if we find anything and xfs_repair ends successfully, exit with 1 -
> >> > errors corrected
> >
> > Same as the above case - needs reboot.
> 
> Do we? The fs wasn't mounted at this point yet. Maybe there is a
> reason for a reboot, I just don't know about it. :-)

I agree that it's fine to move on to mounting the rootfs provided we
don't do -d.

--D

> >
> >> > - anything else and exit with 8 - operational error
> >
> > I'd argue that's "errors uncorrected" (i.e. error 4) because it
> > requires manual intervention to determine what the error was and
> > resolve the situation.
> >
> >> > And is there any other way how to get the "there were some errors, but
> >> > we corrected them" other than either 1) screenscrape xfs_repair or 2)
> >> > run xfs_repair twice, once with -n to detect and then without -n to
> >> > fix the found errors?
> >>
> >> I wouldn't run it twice, repair can take quite a while to run.
> >
> > Besides, we have to treat both cases the same because even if there
> > were no errors xfs_repair still modifies the metadata on disk....
> 
> A small change in xfs_repair solves this, no issue for fsck anymore.
> 
> >
> >> > I'm not aware of any script or tool that would refuse to work except
> >> > when started in a specific environment and noninteractively (doesn't
> >> > mean they don't exist, but it is not common). And because it seems
> >> > that fsck.xfs -f will do only what bare xfs_repair would do, no log
> >> > replay, nothing... then I really think that changing what the script
> >> > does (not just altering its output) based on environment tests is
> >> > unnecessary.
> >
> > This isn't a technical issue - this is a behaviour management issue.
> > We want people to always run xfs_repair when there's a problem that
> > needs fixing, not fsck.xfs. fsck.xfs is for startup scripts only and
> > has very limited scope in what we allow it to do. xfs_repair is the
> > tool users should be running to fix their filesystems, not trying to
> > do it indirectly through startup script infrastructure....
> >
> 
> OK
> 
> Cheers,
> Jan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner March 14, 2018, 9:10 p.m. UTC | #16
On Wed, Mar 14, 2018 at 08:25:04AM -0700, Darrick J. Wong wrote:
> On Wed, Mar 14, 2018 at 02:51:51PM +0100, Jan Tulak wrote:
> > On Thu, Mar 8, 2018 at 11:36 PM, Dave Chinner <david@fromorbit.com> wrote:
> > >> > - if the volume/device has a dirty log, exit with 4 - errors left
> > >> > uncorrected (drop to the shell)
> > >
> > > Yup.
> > >
> > >> > - if we find no errors, exit with 0 - no errors
> > >
> > > Yup, but only if the filesystem is not mounted, otherwise it's
> > > "requires reboot" because repair with no errors still rewrites all
> > > the per-ag metadata and so changes the on disk metadata layout.
> > > Continuing at this point with a mounted filesystem is guaranteed to
> > > corrupt the filesystem.
> > 
> > We refuse to start with a mounted fs, so this is no issue.
> > 
> > >
> > >> > - if we find anything and xfs_repair ends successfully, exit with 1 -
> > >> > errors corrected
> > >
> > > Same as the above case - needs reboot.
> > 
> > Do we? The fs wasn't mounted at this point yet. Maybe there is a
> > reason for a reboot, I just don't know about it. :-)
> 
> I agree that it's fine to move on to mounting the rootfs provided we
> don't do -d.

Right, my was "same as the above case" where I said:

| Yup, but only if the filesystem is not mounted, otherwise it's
| "requires reboot"

-Dave.
Jan Tulak March 15, 2018, 11:16 a.m. UTC | #17
On Wed, Mar 14, 2018 at 4:19 PM, Eric Sandeen <sandeen@sandeen.net> wrote:
> On 3/14/18 9:30 AM, Jan Tulak wrote:
>> On Fri, Mar 9, 2018 at 12:28 AM, Eric Sandeen <sandeen@sandeen.net>
>> wrote:
>>>
>>> If it's critical to report whether errors were fixed, it would be
>>> trivial to add a new option to xfs_repair which causes it to test
>>> fs_is_dirty for runs without "-n", and exit with a different
>>> value.
>>
>> I have toyed with it a bit and this seems to be the best option. A
>> flag that changes the exit code on a successful run. Is exit code 3
>> ok? According to man page, only 1 and 2 are currently used and the
>> "everything is ok now, but an issue was there" should not be mixed
>> with the existing ones. I also thought about a flag that would
>> change all exit codes to fsck ones, but that seems too complicated
>> and completely unnecessary.
>
> Hm, I guess we'll have to.

We can either map xfs_repair to fsck in the fsck.xfs script, or change
xfs_repair to use the fsck exit codes. I'm for the first variant (just
add a one new exit code for xfs_repair and remap it for fsck codes in
the script) rather than complicate xfs_repair with two sets of exit
codes.

Jan
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Tulak March 15, 2018, 5:01 p.m. UTC | #18
On Wed, Mar 14, 2018 at 4:25 PM, Darrick J. Wong
<darrick.wong@oracle.com> wrote:
>> through -vvvv output, but that doesn't look like a sensible solution -
>> too many things that could be missed if it is an issue and we need to
>> watch out for corrections that requires a reboot.
>>
>> And another thing is that after we drop the user into the shell, they
>> won't see any message directly I think but will have to read the logs
>> to see the "Errors corrected, You must reboot now!"
>
> Hopefully the repair output was the last thing printed on the screen
> before we started the shell?
>

Sadly, not in all distros. What you see in Fedora is that a service
failed and "see 'systemctl status systemd-fsck-root.service' for
details" followed by some other status reports for other things. It at
least points you in the right direction and in the log you can see the
fsck.xfs and xfs_repair output. But I doubt there is any other option,
and at least it is in the log if not on the screen.

Jan
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner March 15, 2018, 10:19 p.m. UTC | #19
On Thu, Mar 15, 2018 at 12:16:44PM +0100, Jan Tulak wrote:
> On Wed, Mar 14, 2018 at 4:19 PM, Eric Sandeen <sandeen@sandeen.net> wrote:
> > On 3/14/18 9:30 AM, Jan Tulak wrote:
> >> On Fri, Mar 9, 2018 at 12:28 AM, Eric Sandeen <sandeen@sandeen.net>
> >> wrote:
> >>>
> >>> If it's critical to report whether errors were fixed, it would be
> >>> trivial to add a new option to xfs_repair which causes it to test
> >>> fs_is_dirty for runs without "-n", and exit with a different
> >>> value.
> >>
> >> I have toyed with it a bit and this seems to be the best option. A
> >> flag that changes the exit code on a successful run. Is exit code 3
> >> ok? According to man page, only 1 and 2 are currently used and the
> >> "everything is ok now, but an issue was there" should not be mixed
> >> with the existing ones. I also thought about a flag that would
> >> change all exit codes to fsck ones, but that seems too complicated
> >> and completely unnecessary.
> >
> > Hm, I guess we'll have to.
> 
> We can either map xfs_repair to fsck in the fsck.xfs script, or change
> xfs_repair to use the fsck exit codes. I'm for the first variant (just
> add a one new exit code for xfs_repair and remap it for fsck codes in
> the script) rather than complicate xfs_repair with two sets of exit
> codes.

Agreed, any fsck specific exit code mapping should be in the fsck
script. Let's try to keep xfs_repair isolated from all the fsck
interface bogosities....

Cheers,

Dave.
diff mbox

Patch

diff --git a/fsck/xfs_fsck.sh b/fsck/xfs_fsck.sh
index e52969e4..71bfa2e1 100755
--- a/fsck/xfs_fsck.sh
+++ b/fsck/xfs_fsck.sh
@@ -4,10 +4,12 @@ 
 #
 
 AUTO=false
-while getopts ":aApy" c
+FORCE=false
+while getopts ":aApyf" c
 do
 	case $c in
 	a|A|p|y)	AUTO=true;;
+	f)      	FORCE=true;;
 	esac
 done
 eval DEV=\${$#}
@@ -15,10 +17,18 @@  if [ ! -e $DEV ]; then
 	echo "$0: $DEV does not exist"
 	exit 8
 fi
+
+# The flag -f is added by systemd/init scripts when /forcefsck file is present
+# or fsck.mode=force is used during boot; an unclean shutdown won't trigger
+# this check, user has to explicitly require a forced fsck.
+if $FORCE; then
+	xfs_repair $DEV
+	exit $?
+fi
+
 if $AUTO; then
 	echo "$0: XFS file system."
 else
 	echo "If you wish to check the consistency of an XFS filesystem or"
 	echo "repair a damaged filesystem, see xfs_repair(8)."
 fi
-exit 0