diff mbox series

[3/4] re-enable "ocfs2: mount shared volume without ha stack"

Message ID 20220730011411.11214-4-heming.zhao@suse.com (mailing list archive)
State New, archived
Headers show
Series re-enable non-clustered mount & add MMP support | expand

Commit Message

heming.zhao@suse.com July 30, 2022, 1:14 a.m. UTC
the key different between local mount and non-clustered mount:
local mount feature (tunefs.ocfs2 --fs-features=[no]local) can't do
convert job without ha stack. non-clustered mount feature can run
totally without ha stack.

commit 912f655d78c5 ("ocfs2: mount shared volume without ha stack") had
bug, then commit c80af0c250c8f8a3c978aa5aafbe9c39b336b813 reverted it.

Let's give some explain for the issue mentioned by commit c80af0c250c8.

Under Junxiao's call trace, in __ocfs2_find_empty_slot(), the 'if'
accessment is wrong. sl_node_num could be 0 at o2cb env.

with current information, the trigger flow (base on 912f655d78c5):
1>
nodeA with 'node_num = 0' for mounting. it will succeed.
at this time, slotmap extent block will contains es_valid:1 &
es_node_num:0 for nodeA
then ocfs2_update_disk_slot() will write back slotmap info to disk.

2>
then, nodeB with 'node_num = 1' for mounting
this time, osb->node_num is 1 (set by config file), osb->preferred is
OCFS2_INVALID_SLOT (set by ocfs2_parse_options).

ocfs2_find_slot
 + ocfs2_update_slot_info //read slotmap info from disk
 |  + set si->si_slots[0].es_valid = 1 & si->si_slots[0].sl_node_num = 0
 |
 + __ocfs2_node_num_to_slot //will return -ENOENT.
 + __ocfs2_find_empty_slot
    + if ((preferred >= 0) && (preferred < si->si_num_slots))
    |  fails enter this 'if' for preferred value is OCFS2_INVALID_SLOT
    |
    + 'for(i = 0; i < si->si_num_slots; i++)' search slot 0
      successfully.
    |  'si->si_slots[0].sl_node_num' is false. trigger 'break' condition.
    |
    + return slot 0.
       it will cause nodeB grab nodeA journal dlm lock, then trigger hung.

How to do for this bug?

This commit re-enabled 912f655d78c5, next commit (add MMP support) will fix
the issue.

Signed-off-by: Heming Zhao <heming.zhao@suse.com>
---
 fs/ocfs2/ocfs2.h    |  4 +++-
 fs/ocfs2/slot_map.c | 46 ++++++++++++++++++++++++++-------------------
 fs/ocfs2/super.c    | 21 +++++++++++++++++++++
 3 files changed, 51 insertions(+), 20 deletions(-)

Comments

Mark Fasheh July 31, 2022, 5:42 p.m. UTC | #1
Hi Heming,

On Fri, Jul 29, 2022 at 6:15 PM Heming Zhao via Ocfs2-devel
<ocfs2-devel@oss.oracle.com> wrote:
>
> the key different between local mount and non-clustered mount:
> local mount feature (tunefs.ocfs2 --fs-features=[no]local) can't do
> convert job without ha stack. non-clustered mount feature can run
> totally without ha stack.

Can you please elaborate on this? Local mounts can run without a
cluster stack so I don't see the difference there. We have
tunefs.ocfs2 look for and join the cluster so as to avoid corrupting
users data - that's a feature, not a bug. So what I'm seeing here is
just opening us to potential corruptions. Is there a specific use case
here that you're trying to account for? Are you fixing a particular
bug?

Thanks,
  --Mark
David Sterba via Ocfs2-devel Aug. 1, 2022, 1:01 a.m. UTC | #2
Hello Mark,

On 8/1/22 01:42, Mark Fasheh wrote:
> Hi Heming,
> 
> On Fri, Jul 29, 2022 at 6:15 PM Heming Zhao via Ocfs2-devel
> <ocfs2-devel@oss.oracle.com> wrote:
>>
>> the key different between local mount and non-clustered mount:
>> local mount feature (tunefs.ocfs2 --fs-features=[no]local) can't do
>> convert job without ha stack. non-clustered mount feature can run
>> totally without ha stack.
> 
> Can you please elaborate on this? Local mounts can run without a
> cluster stack so I don't see the difference there. We have

I am using pacemaker cluster stack. In my env, the trouble of the converting between
local and clustered mounts are only happening on cluster stack.

the non-clustered mount feature (Gang He commit: 912f655d78c5) gave ocfs2 the ability
to mount volume at any env (with/without cluster stack).
The 912f655d78c5 derived from SUSE customer complain: User wanted to fsck the backup
ocfs2 volume in non-clustered env. They wanted to access the volume quickly and didn't
want to take time/resource to set up HA stack. (by the way, pcmk stack at least needs
two nodes to set up a cluster.)

> tunefs.ocfs2 look for and join the cluster so as to avoid corrupting
> users data - that's a feature, not a bug. So what I'm seeing here is
> just opening us to potential corruptions. Is there a specific use case
> here that you're trying to account for? Are you fixing a particular
> bug?
> 

Tunefs.ocfs2 still needs HA/dlm stack to protect joining action. commit 912f655d78c5
works on non-clustered env, which needs other tech (eg. MMP) to protect corrupting.

 From my viewpoint, the non-clustered mount code is based on local mount code,
which gives more flexible than local mount. non-clustered mount uses unify mount
style align with clustered mount. I think users will like more to use non-clustered
mount than using tunefs.ocfs2 to change mount type.

Thanks,
Heming
David Sterba via Ocfs2-devel Aug. 1, 2022, 2:25 a.m. UTC | #3
I found I didn't include Joseph in previous mail.

On 8/1/22 09:01, heming.zhao--- via Ocfs2-devel wrote:
> Hello Mark,
> 
> On 8/1/22 01:42, Mark Fasheh wrote:
>> Hi Heming,
>>
>> On Fri, Jul 29, 2022 at 6:15 PM Heming Zhao via Ocfs2-devel
>> <ocfs2-devel@oss.oracle.com> wrote:
>>>
>>> the key different between local mount and non-clustered mount:
>>> local mount feature (tunefs.ocfs2 --fs-features=[no]local) can't do
>>> convert job without ha stack. non-clustered mount feature can run
>>> totally without ha stack.
>>
>> Can you please elaborate on this? Local mounts can run without a
>> cluster stack so I don't see the difference there. We have
> 
> I am using pacemaker cluster stack. In my env, the trouble of the converting between
> local and clustered mounts are only happening on cluster stack.
> 
> the non-clustered mount feature (Gang He commit: 912f655d78c5) gave ocfs2 the ability
> to mount volume at any env (with/without cluster stack).
> The 912f655d78c5 derived from SUSE customer complain: User wanted to fsck the backup
> ocfs2 volume in non-clustered env. They wanted to access the volume quickly and didn't
> want to take time/resource to set up HA stack. (by the way, pcmk stack at least needs
> two nodes to set up a cluster.)

Ooh, above "needs two nodes to set up a cluster" is wrong. User could use a special corosync.conf
to set up a single node cluster. But anyhow, the key is why we can't bypass the setup ha stack
step.

> 
>> tunefs.ocfs2 look for and join the cluster so as to avoid corrupting
>> users data - that's a feature, not a bug. So what I'm seeing here is
>> just opening us to potential corruptions. Is there a specific use case
>> here that you're trying to account for? Are you fixing a particular
>> bug?
>>
> 
> Tunefs.ocfs2 still needs HA/dlm stack to protect joining action. commit 912f655d78c5
> works on non-clustered env, which needs other tech (eg. MMP) to protect corrupting.
> 
>  From my viewpoint, the non-clustered mount code is based on local mount code,
> which gives more flexible than local mount. non-clustered mount uses unify mount
> style align with clustered mount. I think users will like more to use non-clustered
> mount than using tunefs.ocfs2 to change mount type.
> 
> Thanks,
> Heming
Mark Fasheh Aug. 4, 2022, 11:53 p.m. UTC | #4
Hi Heming,

On Sun, Jul 31, 2022 at 6:02 PM heming.zhao@suse.com
<heming.zhao@suse.com> wrote:
>
> Hello Mark,
>
> On 8/1/22 01:42, Mark Fasheh wrote:
> > Hi Heming,
> >
> > On Fri, Jul 29, 2022 at 6:15 PM Heming Zhao via Ocfs2-devel
> > <ocfs2-devel@oss.oracle.com> wrote:
> >>
> >> the key different between local mount and non-clustered mount:
> >> local mount feature (tunefs.ocfs2 --fs-features=[no]local) can't do
> >> convert job without ha stack. non-clustered mount feature can run
> >> totally without ha stack.
> >
> > Can you please elaborate on this? Local mounts can run without a
> > cluster stack so I don't see the difference there. We have
>
> I am using pacemaker cluster stack. In my env, the trouble of the converting between
> local and clustered mounts are only happening on cluster stack.
>
> the non-clustered mount feature (Gang He commit: 912f655d78c5) gave ocfs2 the ability
> to mount volume at any env (with/without cluster stack).
> The 912f655d78c5 derived from SUSE customer complain: User wanted to fsck the backup
> ocfs2 volume in non-clustered env. They wanted to access the volume quickly and didn't
> want to take time/resource to set up HA stack. (by the way, pcmk stack at least needs
> two nodes to set up a cluster.)

Ok. I had some time to look over the ext4 mmp patches. I feel like
there are two questions here:

1) Is MMP a useful feature for Ocfs2

My answer is yes absolutely, the user should have the option to enable
this on local mounts.


2) Should we allow the user to bypass our cluster checks?

On this question I'm still a 'no'. I simply haven't seen enough
evidence to warrant such a drastic change in policy. Allowing it via
mount option too just feels extremely error-prone. I think we need to
explore alternative avenues to help
ing the user out here. As you noted in your followup, a single node
config is entirely possible in pacemaker (I've run that config
myself). Why not provide an easy way for the user to drop down to that
sort of a config? I know that's kind
of pushing responsibility for this to the cluster stack, but that's
where it belongs in the first place.

Another option might be an 'observer mode' mount, where the node
participates in the cluster (and the file system locking) but purely
in a read-only fashion.


> > tunefs.ocfs2 look for and join the cluster so as to avoid corrupting
> > users data - that's a feature, not a bug. So what I'm seeing here is
> > just opening us to potential corruptions. Is there a specific use case
> > here that you're trying to account for? Are you fixing a particular
> > bug?
> >
>
> Tunefs.ocfs2 still needs HA/dlm stack to protect joining action. commit 912f655d78c5
> works on non-clustered env, which needs other tech (eg. MMP) to protect corrupting.

FWIW, I'm glad that we pulled commit 912f655d78c5 and I do not think
we should go back to that paradigm.


>  From my viewpoint, the non-clustered mount code is based on local mount code,
> which gives more flexible than local mount. non-clustered mount uses unify mount
> style align with clustered mount. I think users will like more to use non-clustered
> mount than using tunefs.ocfs2 to change mount type.

Can we get rid of the mount option, and make mmp something that users
can turn on for Ocfs2 local mounts? I don't recall if we make a slot
map for local mounts or not but it wouldn't be difficult to add that.

Btw, thank you very much for all of these patches, and also thank you
for the very detailed test cases in your initial email. I'll try to
give the actual code a review as well.

Thanks,
  --Mark
Mark Fasheh Aug. 5, 2022, 4:11 a.m. UTC | #5
On Thu, Aug 4, 2022 at 4:53 PM Mark Fasheh <mark@fasheh.com> wrote:
> 2) Should we allow the user to bypass our cluster checks?
>
> On this question I'm still a 'no'. I simply haven't seen enough
> evidence to warrant such a drastic change in policy. Allowing it via
> mount option too just feels extremely error-prone. I think we need to
> explore alternative avenues to help
> ing the user out here. As you noted in your followup, a single node
> config is entirely possible in pacemaker (I've run that config
> myself). Why not provide an easy way for the user to drop down to that
> sort of a config? I know that's kind
> of pushing responsibility for this to the cluster stack, but that's
> where it belongs in the first place.
>
> Another option might be an 'observer mode' mount, where the node
> participates in the cluster (and the file system locking) but purely
> in a read-only fashion.

Thinking about this some more... The only way that this works without
potential corruptions is if we always write a periodic mmp sequence,
even in clustered mode (which might mean each node writes to its own
sector). That way tunefs can always check the disk for a mounted node,
even without a cluster stack up. If tunefs sees anyone writing
sequences to the disk, it can safely fail the operation. Tunefs also
would have to be writing an mmp sequence once it has determined that
the disk is not mounted. It could also write some flag alongisde the
sequence that says 'tunefs is working on this disk'. If a cluster
mount comes up and sees a live sequence with that flag, it will know
to fail the mount request as the disk is being modified. Local mounts
can also use this to ensure that they are the only mounted node.

As it turns out, we already do pretty much all of the sequence writing
already for the o2cb cluster stack - check out cluseter/heartbeat.c.
If memory serves, tunefs.ocfs2 has code to write to this heartbeat
area as well. For o2cb, we use the disk heartbeat to detect node
liveness, and to kill our local node if we see disk timeouts. For
pcmk, we shouldn't take any of these actions as it is none of our
responsibility. Under pcmk, the heartbeating would be purely for mount
protection checks.

The downside to this is that all nodes would be heartbeating to the
disk on a regular interval, not just one. To be fair, this is exactly
how o2cb works and with the correct timeout choices, we were able to
avoid a measurable performance impact, though in any case this might
have to be a small price the user pays for cluster aware mount
protection.

Let me know what you think.

Thanks,
  --Mark
David Sterba via Ocfs2-devel Aug. 6, 2022, 3:44 p.m. UTC | #6
Hi Mark,

On 8/5/22 07:53, Mark Fasheh wrote:
> Hi Heming,
> 
> On Sun, Jul 31, 2022 at 6:02 PM heming.zhao@suse.com
> <heming.zhao@suse.com> wrote:
>>
>> Hello Mark,
>>
>> On 8/1/22 01:42, Mark Fasheh wrote:
>>> Hi Heming,
>>>
>>> On Fri, Jul 29, 2022 at 6:15 PM Heming Zhao via Ocfs2-devel
>>> <ocfs2-devel@oss.oracle.com> wrote:
>>>>
>>>> the key different between local mount and non-clustered mount:
>>>> local mount feature (tunefs.ocfs2 --fs-features=[no]local) can't do
>>>> convert job without ha stack. non-clustered mount feature can run
>>>> totally without ha stack.
>>>
>>> Can you please elaborate on this? Local mounts can run without a
>>> cluster stack so I don't see the difference there. We have
>>
>> I am using pacemaker cluster stack. In my env, the trouble of the converting between
>> local and clustered mounts are only happening on cluster stack.
>>
>> the non-clustered mount feature (Gang He commit: 912f655d78c5) gave ocfs2 the ability
>> to mount volume at any env (with/without cluster stack).
>> The 912f655d78c5 derived from SUSE customer complain: User wanted to fsck the backup
>> ocfs2 volume in non-clustered env. They wanted to access the volume quickly and didn't
>> want to take time/resource to set up HA stack. (by the way, pcmk stack at least needs
>> two nodes to set up a cluster.)
> 
> Ok. I had some time to look over the ext4 mmp patches. I feel like
> there are two questions here:
> 
> 1) Is MMP a useful feature for Ocfs2
> 
> My answer is yes absolutely, the user should have the option to enable
> this on local mounts.
> 

me too.
> 
> 2) Should we allow the user to bypass our cluster checks?
> 
> On this question I'm still a 'no'. I simply haven't seen enough
> evidence to warrant such a drastic change in policy. Allowing it via
> mount option too just feels extremely error-prone. I think we need to
> explore alternative avenues to help
> ing the user out here. As you noted in your followup, a single node
> config is entirely possible in pacemaker (I've run that config
> myself). Why not provide an easy way for the user to drop down to that
> sort of a config? I know that's kind
> of pushing responsibility for this to the cluster stack, but that's
> where it belongs in the first place.
> 

the reason for creating commit 912f655d78c5d4a is user didn't want to do any

set up job for cluster stack. So any HA solution (eg: automatically config

single node, auto install required software, create VM with already set up HA

stack, etc) is not remove the pain point. Base on bypass cluster setup, there only

left non-clustered mount. And in end user viewpoint, non-clustered mount is

also the most easy way to mount ocfs2 volume.

> Another option might be an 'observer mode' mount, where the node
> participates in the cluster (and the file system locking) but purely
> in a read-only fashion.
> 

In my feeling, the 'observer mode' is just read-only mount, we don't need to
create it.
For 912f655d78c5d4a, user take a snapshot for ocfs2 volume, and do health-check

on a different env (not production env). there may belong to two different network.

> 
>>> tunefs.ocfs2 look for and join the cluster so as to avoid corrupting
>>> users data - that's a feature, not a bug. So what I'm seeing here is
>>> just opening us to potential corruptions. Is there a specific use case
>>> here that you're trying to account for? Are you fixing a particular
>>> bug?
>>>
>>
>> Tunefs.ocfs2 still needs HA/dlm stack to protect joining action. commit 912f655d78c5
>> works on non-clustered env, which needs other tech (eg. MMP) to protect corrupting.
> 
> FWIW, I'm glad that we pulled commit 912f655d78c5 and I do not think
> we should go back to that paradigm.
> 

If you or other maintainers prefer to pull out non-clustered mount, I respect

the decision.

> 
>>   From my viewpoint, the non-clustered mount code is based on local mount code,
>> which gives more flexible than local mount. non-clustered mount uses unify mount
>> style align with clustered mount. I think users will like more to use non-clustered
>> mount than using tunefs.ocfs2 to change mount type.
> 
> Can we get rid of the mount option, and make mmp something that users
> can turn on for Ocfs2 local mounts? I don't recall if we make a slot
> map for local mounts or not but it wouldn't be difficult to add that.
> 

 From tech side, it's possible to enable mmp in local mount.

Without 912f655d78c5d4a, local-mount uses slot 0 (the first available slot).

> Btw, thank you very much for all of these patches, and also thank you
> for the very detailed test cases in your initial email. I'll try to
> give the actual code a review as well.
> 
> Thanks,
>    --Mark

Thanks,
Heming
David Sterba via Ocfs2-devel Aug. 6, 2022, 3:53 p.m. UTC | #7
Hi Mark,

On 8/5/22 12:11, Mark Fasheh wrote:
> On Thu, Aug 4, 2022 at 4:53 PM Mark Fasheh <mark@fasheh.com> wrote:
>> 2) Should we allow the user to bypass our cluster checks?
>>
>> On this question I'm still a 'no'. I simply haven't seen enough
>> evidence to warrant such a drastic change in policy. Allowing it via
>> mount option too just feels extremely error-prone. I think we need to
>> explore alternative avenues to help
>> ing the user out here. As you noted in your followup, a single node
>> config is entirely possible in pacemaker (I've run that config
>> myself). Why not provide an easy way for the user to drop down to that
>> sort of a config? I know that's kind
>> of pushing responsibility for this to the cluster stack, but that's
>> where it belongs in the first place.
>>
>> Another option might be an 'observer mode' mount, where the node
>> participates in the cluster (and the file system locking) but purely
>> in a read-only fashion.
> 
> Thinking about this some more... The only way that this works without
> potential corruptions is if we always write a periodic mmp sequence,
> even in clustered mode (which might mean each node writes to its own
> sector). That way tunefs can always check the disk for a mounted node,
> even without a cluster stack up. If tunefs sees anyone writing
> sequences to the disk, it can safely fail the operation. Tunefs also
> would have to be writing an mmp sequence once it has determined that
> the disk is not mounted. It could also write some flag alongisde the
> sequence that says 'tunefs is working on this disk'. If a cluster
> mount comes up and sees a live sequence with that flag, it will know
> to fail the mount request as the disk is being modified. Local mounts
> can also use this to ensure that they are the only mounted node.
> 

Above tunefs check & write seq steps are mmp feature work flow.



Your mentioned tunefs work flow matches my patch[4/4] idea, and this

patch does all the works in kernel ocfs2_find_slot().



Because sequences should be saved in SB, update it should grab ->osb_lock,

which influence the performance.

And for saving seqs, for saving space, we won't alloc different disk block for
different node.
  If multi-nodes share the same disk block (eg, keep using slot_map
for saving seqs),
  the updating job will make IO performance issue.



For fixing performance issue, in my view, we should disable mmp sequences

updating when mounting mode is clustered. So I make a rule:

** If last mount didn't do unmount, (eg: crash), the next mount MUST be same mount type. **



above rule another meaning:

new coming node mounting type should same with exist mounting type.



and there is a basic knowledge:

current ocfs2 code under cluster stack already have the ability to prevent
multiple mount when mounting type is clustered.


(from patch 4/4) there are mount labels:

+#define OCFS2_MMP_SEQ_CLEAN 0xFF4D4D50U /* mmp_seq value for clean unmount */

+#define OCFS2_MMP_SEQ_FSCK  0xE24D4D50U /* mmp_seq value when being fscked */

+#define OCFS2_MMP_SEQ_MAX   0xE24D4D4FU /* maximum valid mmp_seq value */

+#define OCFS2_MMP_SEQ_INIT  0x0         /* mmp_seq init value */

+#define OCFS2_VALID_CLUSTER   0xE24D4D55U /* value for clustered mount

+                                              under MMP disabled */

+#define OCFS2_VALID_NOCLUSTER 0xE24D4D5AU /* value for noclustered mount

+                                              under MMP disabled */



whenever mount successfully, there should be three types living labels

in slotmap area:

- OCFS2_MMP_SEQ_CLEAN, OCFS2_VALID_NOCLUSTER - for local/non-clustered mount

- OCFS2_VALID_CLUSTER - for clustered mount


new coming node will check if any slot contains living labels.

whenever unmount successfully, there should be two types left labels in slotmap
  area:
-
  OCFS2_MMP_SEQ_CLEAN or 0 (zero)



when a node does unmount, according to the mount type, it will clean (zeroed)

or write OCFS2_MMP_SEQ_CLEAN in the slot.

> As it turns out, we already do pretty much all of the sequence writing
> already for the o2cb cluster stack - check out cluseter/heartbeat.c.
> If memory serves, tunefs.ocfs2 has code to write to this heartbeat
> area as well. For o2cb, we use the disk heartbeat to detect node
> liveness, and to kill our local node if we see disk timeouts. For
> pcmk, we shouldn't take any of these actions as it is none of our
> responsibility. Under pcmk, the heartbeating would be purely for mount
> protection checks.
> 

 From my thinking, under cluster stack, there is enough protecting logic.
For local/non-clustered mount, mmp will give the ability for detecting node liveness.

So I only enable kmmpd-<dev> kthread for local/non-clustered mount.

> The downside to this is that all nodes would be heartbeating to the
> disk on a regular interval, not just one. To be fair, this is exactly
> how o2cb works and with the correct timeout choices, we were able to
> avoid a measurable performance impact, though in any case this might
> have to be a small price the user pays for cluster aware mount
> protection.

I already considered this performance issue in patch [4/4].

> 
> Let me know what you think.
> 
> Thanks,
>    --Mark

Thanks,
Heming
heming.zhao@suse.com Aug. 6, 2022, 4:15 p.m. UTC | #8
Hello Mark and Joseph,

(please ignore my previous reply mail)
I may met thunderbird bug, I clearly remembered I added you (Joseph) in
CC for this mail & previous mail. But I only see Mark and ocfs2-devel in the
receiver list.  and the mail format also mess, thunderbird replace '\n' to '\r\n'.

For fixing this problem, I change neomutt to resend these two mails

On Thu, Aug 04, 2022 at 04:53:12PM -0700, Mark Fasheh wrote:
> Hi Heming,
> 
> On Sun, Jul 31, 2022 at 6:02 PM heming.zhao@suse.com
> <heming.zhao@suse.com> wrote:
> >
> > Hello Mark,
> >
> > On 8/1/22 01:42, Mark Fasheh wrote:
> > > Hi Heming,
> > >
> > > On Fri, Jul 29, 2022 at 6:15 PM Heming Zhao via Ocfs2-devel
> > > <ocfs2-devel@oss.oracle.com> wrote:
> > >>
> > >> the key different between local mount and non-clustered mount:
> > >> local mount feature (tunefs.ocfs2 --fs-features=[no]local) can't do
> > >> convert job without ha stack. non-clustered mount feature can run
> > >> totally without ha stack.
> > >
> > > Can you please elaborate on this? Local mounts can run without a
> > > cluster stack so I don't see the difference there. We have
> >
> > I am using pacemaker cluster stack. In my env, the trouble of the converting between
> > local and clustered mounts are only happening on cluster stack.
> >
> > the non-clustered mount feature (Gang He commit: 912f655d78c5) gave ocfs2 the ability
> > to mount volume at any env (with/without cluster stack).
> > The 912f655d78c5 derived from SUSE customer complain: User wanted to fsck the backup
> > ocfs2 volume in non-clustered env. They wanted to access the volume quickly and didn't
> > want to take time/resource to set up HA stack. (by the way, pcmk stack at least needs
> > two nodes to set up a cluster.)
> 
> Ok. I had some time to look over the ext4 mmp patches. I feel like
> there are two questions here:
> 
> 1) Is MMP a useful feature for Ocfs2
> 
> My answer is yes absolutely, the user should have the option to enable
> this on local mounts.
> 

me too.

> 
> 2) Should we allow the user to bypass our cluster checks?
> 
> On this question I'm still a 'no'. I simply haven't seen enough
> evidence to warrant such a drastic change in policy. Allowing it via
> mount option too just feels extremely error-prone. I think we need to
> explore alternative avenues to help
> ing the user out here. As you noted in your followup, a single node
> config is entirely possible in pacemaker (I've run that config
> myself). Why not provide an easy way for the user to drop down to that
> sort of a config? I know that's kind
> of pushing responsibility for this to the cluster stack, but that's
> where it belongs in the first place.
> 

the reason for creating commit 912f655d78c5d4a is user didn't want to do
any set up job for cluster stack. So any HA solution (eg: automatically
config single node, auto install required software, create VM with already set
up HA stack, etc) is not remove the pain point. Base on bypass cluster setup,
there only left non-clustered mount. And in end user viewpoint, non-clustered
mount is also the most easy way to mount ocfs2 volume. 

> Another option might be an 'observer mode' mount, where the node
> participates in the cluster (and the file system locking) but purely
> in a read-only fashion.
> 

In my feeling, the 'observer mode' is just read-only mount, we don't need to
create it.
For 912f655d78c5d4a, user take a snapshot for ocfs2 volume, and do health-check

on a different env (not production env). there may belong to two different network. 

> 
> > > tunefs.ocfs2 look for and join the cluster so as to avoid corrupting
> > > users data - that's a feature, not a bug. So what I'm seeing here is
> > > just opening us to potential corruptions. Is there a specific use case
> > > here that you're trying to account for? Are you fixing a particular
> > > bug?
> > >
> >
> > Tunefs.ocfs2 still needs HA/dlm stack to protect joining action. commit 912f655d78c5
> > works on non-clustered env, which needs other tech (eg. MMP) to protect corrupting.
> 
> FWIW, I'm glad that we pulled commit 912f655d78c5 and I do not think
> we should go back to that paradigm.

If you or other maintainers prefer to pull out non-clustered mount, I respect
the decision.

> 
> 
> >  From my viewpoint, the non-clustered mount code is based on local mount code,
> > which gives more flexible than local mount. non-clustered mount uses unify mount
> > style align with clustered mount. I think users will like more to use non-clustered
> > mount than using tunefs.ocfs2 to change mount type.
> 
> Can we get rid of the mount option, and make mmp something that users
> can turn on for Ocfs2 local mounts? I don't recall if we make a slot
> map for local mounts or not but it wouldn't be difficult to add that.
> 

From tech side, it's possible to enable mmp in local mount.

Without 912f655d78c5d4a, local-mount uses slot 0 (the first available slot). 

> Btw, thank you very much for all of these patches, and also thank you
> for the very detailed test cases in your initial email. I'll try to
> give the actual code a review as well.
> 
> Thanks,
>   --Mark

Thanks,
Heming
heming.zhao@suse.com Aug. 6, 2022, 4:20 p.m. UTC | #9
Hello Mark and Joseph,

please ignore my previous mess formated reply mail

On Thu, Aug 04, 2022 at 09:11:53PM -0700, Mark Fasheh wrote:
> On Thu, Aug 4, 2022 at 4:53 PM Mark Fasheh <mark@fasheh.com> wrote:
> > 2) Should we allow the user to bypass our cluster checks?
> >
> > On this question I'm still a 'no'. I simply haven't seen enough
> > evidence to warrant such a drastic change in policy. Allowing it via
> > mount option too just feels extremely error-prone. I think we need to
> > explore alternative avenues to help
> > ing the user out here. As you noted in your followup, a single node
> > config is entirely possible in pacemaker (I've run that config
> > myself). Why not provide an easy way for the user to drop down to that
> > sort of a config? I know that's kind
> > of pushing responsibility for this to the cluster stack, but that's
> > where it belongs in the first place.
> >
> > Another option might be an 'observer mode' mount, where the node
> > participates in the cluster (and the file system locking) but purely
> > in a read-only fashion.
> 
> Thinking about this some more... The only way that this works without
> potential corruptions is if we always write a periodic mmp sequence,
> even in clustered mode (which might mean each node writes to its own
> sector). That way tunefs can always check the disk for a mounted node,
> even without a cluster stack up. If tunefs sees anyone writing
> sequences to the disk, it can safely fail the operation. Tunefs also
> would have to be writing an mmp sequence once it has determined that
> the disk is not mounted. It could also write some flag alongisde the
> sequence that says 'tunefs is working on this disk'. If a cluster
> mount comes up and sees a live sequence with that flag, it will know
> to fail the mount request as the disk is being modified. Local mounts
> can also use this to ensure that they are the only mounted node.
> 

Above tunefs check & write seq steps are mmp feature work flow.

Your mentioned tunefs work flow matches my patch[4/4] idea, and this
patch does all the works in kernel ocfs2_find_slot().

Because sequences should be saved in SB, update it should grab ->osb_lock,
which influence the performance.

And for saving seqs, for saving space, we won't alloc different disk block for
different node. If multi-nodes share the same disk block (eg, keep using
slot_map for saving seqs), the updating job will make IO performance issue.

For fixing performance issue, in my view, we should disable mmp sequences
updating when mounting mode is clustered. So I make a rule:
** If last mount didn't do unmount, (eg: crash), the next mount MUST be same mount type. **

above rule another meaning:
new coming node mounting type should same with exist mounting type.

and there is a basic knowledge:
current ocfs2 code under cluster stack already have the ability to prevent
multiple mount when mounting type is clustered.

(from patch 4/4) there are mount labels:
+#define OCFS2_MMP_SEQ_CLEAN 0xFF4D4D50U /* mmp_seq value for clean unmount */
+#define OCFS2_MMP_SEQ_FSCK  0xE24D4D50U /* mmp_seq value when being fscked */
+#define OCFS2_MMP_SEQ_MAX   0xE24D4D4FU /* maximum valid mmp_seq value */
+#define OCFS2_MMP_SEQ_INIT  0x0         /* mmp_seq init value */
+#define OCFS2_VALID_CLUSTER   0xE24D4D55U /* value for clustered mount
+                                              under MMP disabled */
+#define OCFS2_VALID_NOCLUSTER 0xE24D4D5AU /* value for noclustered mount
+                                              under MMP disabled */

whenever mount successfully, there should be three types living labels
in slotmap area:
- OCFS2_MMP_SEQ_CLEAN, OCFS2_VALID_NOCLUSTER - for local/non-clustered mount
- OCFS2_VALID_CLUSTER - for clustered mount

new coming node will check if any slot contains living labels.

whenever unmount successfully, there should be two types left labels in slotmap area:
- OCFS2_MMP_SEQ_CLEAN or 0 (zero)

when a node does unmount, according to the mount type, it will clean (zeroed)
or write OCFS2_MMP_SEQ_CLEAN in the slot. 

> As it turns out, we already do pretty much all of the sequence writing
> already for the o2cb cluster stack - check out cluseter/heartbeat.c.
> If memory serves, tunefs.ocfs2 has code to write to this heartbeat
> area as well. For o2cb, we use the disk heartbeat to detect node
> liveness, and to kill our local node if we see disk timeouts. For
> pcmk, we shouldn't take any of these actions as it is none of our
> responsibility. Under pcmk, the heartbeating would be purely for mount
> protection checks.

From my thinking, under cluster stack, there is enough protecting logic.

For local/non-clustered mount, mmp will give the ability for detecting node
liveness. So I only enable kmmpd-<dev> kthread for local/non-clustered mount. 

> 
> The downside to this is that all nodes would be heartbeating to the
> disk on a regular interval, not just one. To be fair, this is exactly
> how o2cb works and with the correct timeout choices, we were able to
> avoid a measurable performance impact, though in any case this might
> have to be a small price the user pays for cluster aware mount
> protection.
> 

I already considered this performance issue in patch [4/4].

> Let me know what you think.
> 
> Thanks,
>   --Mark

Thanks,
Heming
diff mbox series

Patch

diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h
index 740b64238312..337527571461 100644
--- a/fs/ocfs2/ocfs2.h
+++ b/fs/ocfs2/ocfs2.h
@@ -277,6 +277,7 @@  enum ocfs2_mount_options
 	OCFS2_MOUNT_JOURNAL_ASYNC_COMMIT = 1 << 15,  /* Journal Async Commit */
 	OCFS2_MOUNT_ERRORS_CONT = 1 << 16, /* Return EIO to the calling process on error */
 	OCFS2_MOUNT_ERRORS_ROFS = 1 << 17, /* Change filesystem to read-only on error */
+	OCFS2_MOUNT_NOCLUSTER = 1 << 18, /* No cluster aware filesystem mount */
 };
 
 #define OCFS2_OSB_SOFT_RO	0x0001
@@ -672,7 +673,8 @@  static inline int ocfs2_cluster_o2cb_global_heartbeat(struct ocfs2_super *osb)
 
 static inline int ocfs2_mount_local(struct ocfs2_super *osb)
 {
-	return (osb->s_feature_incompat & OCFS2_FEATURE_INCOMPAT_LOCAL_MOUNT);
+	return ((osb->s_feature_incompat & OCFS2_FEATURE_INCOMPAT_LOCAL_MOUNT)
+		|| (osb->s_mount_opt & OCFS2_MOUNT_NOCLUSTER));
 }
 
 static inline int ocfs2_uses_extended_slot_map(struct ocfs2_super *osb)
diff --git a/fs/ocfs2/slot_map.c b/fs/ocfs2/slot_map.c
index da7718cef735..0b0ae3ebb0cf 100644
--- a/fs/ocfs2/slot_map.c
+++ b/fs/ocfs2/slot_map.c
@@ -252,14 +252,16 @@  static int __ocfs2_find_empty_slot(struct ocfs2_slot_info *si,
 	int i, ret = -ENOSPC;
 
 	if ((preferred >= 0) && (preferred < si->si_num_slots)) {
-		if (!si->si_slots[preferred].sl_valid) {
+		if (!si->si_slots[preferred].sl_valid ||
+		    !si->si_slots[preferred].sl_node_num) {
 			ret = preferred;
 			goto out;
 		}
 	}
 
 	for(i = 0; i < si->si_num_slots; i++) {
-		if (!si->si_slots[i].sl_valid) {
+		if (!si->si_slots[i].sl_valid ||
+		    !si->si_slots[i].sl_node_num) {
 			ret = i;
 			break;
 		}
@@ -454,24 +456,30 @@  int ocfs2_find_slot(struct ocfs2_super *osb)
 	spin_lock(&osb->osb_lock);
 	ocfs2_update_slot_info(si);
 
-	/* search for ourselves first and take the slot if it already
-	 * exists. Perhaps we need to mark this in a variable for our
-	 * own journal recovery? Possibly not, though we certainly
-	 * need to warn to the user */
-	slot = __ocfs2_node_num_to_slot(si, osb->node_num);
-	if (slot < 0) {
-		/* if no slot yet, then just take 1st available
-		 * one. */
-		slot = __ocfs2_find_empty_slot(si, osb->preferred_slot);
+	if (ocfs2_mount_local(osb))
+		/* use slot 0 directly in local mode */
+		slot = 0;
+	else {
+		/* search for ourselves first and take the slot if it already
+		 * exists. Perhaps we need to mark this in a variable for our
+		 * own journal recovery? Possibly not, though we certainly
+		 * need to warn to the user */
+		slot = __ocfs2_node_num_to_slot(si, osb->node_num);
 		if (slot < 0) {
-			spin_unlock(&osb->osb_lock);
-			mlog(ML_ERROR, "no free slots available!\n");
-			status = -EINVAL;
-			goto bail;
-		}
-	} else
-		printk(KERN_INFO "ocfs2: Slot %d on device (%s) was already "
-		       "allocated to this node!\n", slot, osb->dev_str);
+			/* if no slot yet, then just take 1st available
+			 * one. */
+			slot = __ocfs2_find_empty_slot(si, osb->preferred_slot);
+			if (slot < 0) {
+				spin_unlock(&osb->osb_lock);
+				mlog(ML_ERROR, "no free slots available!\n");
+				status = -EINVAL;
+				goto bail;
+			}
+		} else
+			printk(KERN_INFO "ocfs2: Slot %d on device (%s) was "
+			       "already allocated to this node!\n",
+			       slot, osb->dev_str);
+	}
 
 	ocfs2_set_slot(si, slot, osb->node_num);
 	osb->slot_num = slot;
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index 438be028935d..f7298816d8d9 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -172,6 +172,7 @@  enum {
 	Opt_dir_resv_level,
 	Opt_journal_async_commit,
 	Opt_err_cont,
+	Opt_nocluster,
 	Opt_err,
 };
 
@@ -205,6 +206,7 @@  static const match_table_t tokens = {
 	{Opt_dir_resv_level, "dir_resv_level=%u"},
 	{Opt_journal_async_commit, "journal_async_commit"},
 	{Opt_err_cont, "errors=continue"},
+	{Opt_nocluster, "nocluster"},
 	{Opt_err, NULL}
 };
 
@@ -616,6 +618,13 @@  static int ocfs2_remount(struct super_block *sb, int *flags, char *data)
 		goto out;
 	}
 
+	tmp = OCFS2_MOUNT_NOCLUSTER;
+	if ((osb->s_mount_opt & tmp) != (parsed_options.mount_opt & tmp)) {
+		ret = -EINVAL;
+		mlog(ML_ERROR, "Cannot change nocluster option on remount\n");
+		goto out;
+	}
+
 	tmp = OCFS2_MOUNT_HB_LOCAL | OCFS2_MOUNT_HB_GLOBAL |
 		OCFS2_MOUNT_HB_NONE;
 	if ((osb->s_mount_opt & tmp) != (parsed_options.mount_opt & tmp)) {
@@ -856,6 +865,7 @@  static int ocfs2_verify_userspace_stack(struct ocfs2_super *osb,
 	}
 
 	if (ocfs2_userspace_stack(osb) &&
+	    !(osb->s_mount_opt & OCFS2_MOUNT_NOCLUSTER) &&
 	    strncmp(osb->osb_cluster_stack, mopt->cluster_stack,
 		    OCFS2_STACK_LABEL_LEN)) {
 		mlog(ML_ERROR,
@@ -1127,6 +1137,11 @@  static int ocfs2_fill_super(struct super_block *sb, void *data, int silent)
 	       osb->s_mount_opt & OCFS2_MOUNT_DATA_WRITEBACK ? "writeback" :
 	       "ordered");
 
+	if ((osb->s_mount_opt & OCFS2_MOUNT_NOCLUSTER) &&
+	   !(osb->s_feature_incompat & OCFS2_FEATURE_INCOMPAT_LOCAL_MOUNT))
+		printk(KERN_NOTICE "ocfs2: The shared device (%s) is mounted "
+		       "without cluster aware mode.\n", osb->dev_str);
+
 	atomic_set(&osb->vol_state, VOLUME_MOUNTED);
 	wake_up(&osb->osb_mount_event);
 
@@ -1437,6 +1452,9 @@  static int ocfs2_parse_options(struct super_block *sb,
 		case Opt_journal_async_commit:
 			mopt->mount_opt |= OCFS2_MOUNT_JOURNAL_ASYNC_COMMIT;
 			break;
+		case Opt_nocluster:
+			mopt->mount_opt |= OCFS2_MOUNT_NOCLUSTER;
+			break;
 		default:
 			mlog(ML_ERROR,
 			     "Unrecognized mount option \"%s\" "
@@ -1548,6 +1566,9 @@  static int ocfs2_show_options(struct seq_file *s, struct dentry *root)
 	if (opts & OCFS2_MOUNT_JOURNAL_ASYNC_COMMIT)
 		seq_printf(s, ",journal_async_commit");
 
+	if (opts & OCFS2_MOUNT_NOCLUSTER)
+		seq_printf(s, ",nocluster");
+
 	return 0;
 }