diff mbox series

[02/23] t7527: test FS event reporing on macOS WRT case and Unicode

Message ID ad8cf6d9a47b61d9fe41a961466122be16e4f041.1644940773.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series Builtin FSMonitor Part 3 | expand

Commit Message

Jeff Hostetler Feb. 15, 2022, 3:59 p.m. UTC
From: Jeff Hostetler <jeffhost@microsoft.com>

Confirm that macOS FS events are reported with a normalized spelling.

APFS (and/or HFS+) is case-insensitive.  This means that case-independent
lookups ( [ -d .git ] and [ -d .GIT ] ) should both succeed.  But that
doesn't tell us how FS events are reported if we try "rm -rf .git" versus
"rm -rf .GIT".  Are the events reported using the on-disk spelling of the
pathname or in the spelling used by the command.

NEEDSWORK: I was only able to test case.  It would be nice to add tests
that use different Unicode spellings/normalizations and understand the
differences between APFS and HFS+ in this area.  We should confirm that
the spelling of the workdir paths that the daemon sends to clients are
always properly normalized.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 t/t7527-builtin-fsmonitor.sh | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

Comments

Derrick Stolee Feb. 24, 2022, 2:52 p.m. UTC | #1
On 2/15/2022 10:59 AM, Jeff Hostetler via GitGitGadget wrote:
> From: Jeff Hostetler <jeffhost@microsoft.com>
> 
> Confirm that macOS FS events are reported with a normalized spelling.
> 
> APFS (and/or HFS+) is case-insensitive.  This means that case-independent
> lookups ( [ -d .git ] and [ -d .GIT ] ) should both succeed.  But that
> doesn't tell us how FS events are reported if we try "rm -rf .git" versus
> "rm -rf .GIT".  Are the events reported using the on-disk spelling of the
> pathname or in the spelling used by the command.

Was this last sentence supposed to be a question?
 
> NEEDSWORK: I was only able to test case.  It would be nice to add tests

"I was only able test the APFS case."?

> that use different Unicode spellings/normalizations and understand the
> differences between APFS and HFS+ in this area.  We should confirm that
> the spelling of the workdir paths that the daemon sends to clients are
> always properly normalized.

Are there any macOS experts out there who can help us find the answers
to these questions?

> +# Confirm that MacOS hides all of the Unicode normalization and/or
> +# case folding from the FS events.  That is, are the pathnames in the
> +# FS events reported using the spelling on the disk or in the spelling
> +# used by the other process.
> +#
> +# Note that we assume that the filesystem is set to case insensitive.
> +#
> +# NEEDSWORK: APFS handles Unicode and Unicode normalization
> +# differently than HFS+.  I only have an APFS partition, so
> +# more testing here would be helpful.
> +#
> +
> +# Rename .git using alternate spelling and confirm that the daemon
> +# sees the event using the correct spelling and shutdown.
> +test_expect_success UTF8_NFD_TO_NFC 'MacOS event spelling (rename .GIT)' '
> +	test_when_finished "stop_daemon_delete_repo test_apfs" &&
> +
> +	git init test_apfs &&
> +	start_daemon test_apfs &&
> +
> +	test_path_is_dir test_apfs/.git &&
> +	test_path_is_dir test_apfs/.GIT &&
> +
> +	mv test_apfs/.GIT test_apfs/.FOO &&
> +	sleep 1 &&

This sleep is unfortunate. Do we really need it? Or does this test
become flaky without it?

> +	mv test_apfs/.FOO test_apfs/.git &&
> +
> +	test_must_fail git -C test_apfs fsmonitor--daemon status
> +'
> +

This test is helpful in that it will help us discover if HFS+ or
any future filesystem would break these assumptions.

Thanks,
-Stolee
Torsten Bögershausen Feb. 24, 2022, 5:33 p.m. UTC | #2
On Thu, Feb 24, 2022 at 09:52:28AM -0500, Derrick Stolee wrote:
> On 2/15/2022 10:59 AM, Jeff Hostetler via GitGitGadget wrote:
> > From: Jeff Hostetler <jeffhost@microsoft.com>
> >
> > Confirm that macOS FS events are reported with a normalized spelling.
> >
> > APFS (and/or HFS+) is case-insensitive.  This means that case-independent

This is not true, strictly speaking.
You can format a disk with "case sensitive" or "case-insenstive, case preserving".
Both APFS and HFS+  can be formated that way.
The default, which is what you get when you get a new machine,
is "case-insenstive, case preserving".
And I assume, that more 99% of all disks are formated that way.
The "core.ignorecase" is used in the same way as it is used under NTFS,
FAT or all other case-insenstive file systems.
(and even ext4 can be formated case-insensitive these days.)

An interesting article can be found here:
https://lwn.net/Articles/784041/

And to be technically correct, I think that even NTFS can be
"configured to be case insensitive in an empty directory".

In that sense, I would like to avoid this statement, which
file system is case insensitive, and which is not.
Git assumes that after probing the FS in "git init" we have
a valid configuration in core.ignorecase.

> > lookups ( [ -d .git ] and [ -d .GIT ] ) should both succeed.  But that
> > doesn't tell us how FS events are reported if we try "rm -rf .git" versus
> > "rm -rf .GIT".  Are the events reported using the on-disk spelling of the
> > pathname or in the spelling used by the command.
>
> Was this last sentence supposed to be a question?
>
> > NEEDSWORK: I was only able to test case.  It would be nice to add tests
>
> "I was only able test the APFS case."?
>
> > that use different Unicode spellings/normalizations and understand the
> > differences between APFS and HFS+ in this area.  We should confirm that
> > the spelling of the workdir paths that the daemon sends to clients are
> > always properly normalized.
>
> Are there any macOS experts out there who can help us find the answers
> to these questions?

There is a difference between HFS+ and APFS.
HFS+  is "unicode decomposing" when you call readdir() - file names
are stored decomposed on disk once created.
However, opening  file in precompsed form succeds.
In that sense I would strongly suspect, that any monitors are "sending"
the decomposed form (on HFS+).

APFS does not manipulate file names in that way, it is
"unicode normalization preserving and ignoring".


>
> > +# Confirm that MacOS hides all of the Unicode normalization and/or
> > +# case folding from the FS events.  That is, are the pathnames in the
> > +# FS events reported using the spelling on the disk or in the spelling
> > +# used by the other process.
> > +#
> > +# Note that we assume that the filesystem is set to case insensitive.
> > +#
> > +# NEEDSWORK: APFS handles Unicode and Unicode normalization
> > +# differently than HFS+.  I only have an APFS partition, so
> > +# more testing here would be helpful.

I have an older system and may be able to help, (but am busy this week)
> > +#
> > +
> > +# Rename .git using alternate spelling and confirm that the daemon
> > +# sees the event using the correct spelling and shutdown.
> > +test_expect_success UTF8_NFD_TO_NFC 'MacOS event spelling (rename .GIT)' '

Isn't this precondition is the wrong one and CASE_INSENSITIVE_FS is a better one ?

[snip]

I diddn't follow you series, but if there is a need to test under MacOS with HFS+,
feel free to cc me in the next round
Jeff Hostetler March 4, 2022, 11:40 p.m. UTC | #3
On 2/24/22 12:33 PM, Torsten =?unknown-8bit?Q?B=C3=B6gershausen?= wrote:
> On Thu, Feb 24, 2022 at 09:52:28AM -0500, Derrick Stolee wrote:
>> On 2/15/2022 10:59 AM, Jeff Hostetler via GitGitGadget wrote:
>>> From: Jeff Hostetler <jeffhost@microsoft.com>
>>>
>>> Confirm that macOS FS events are reported with a normalized spelling.
>>>
>>> APFS (and/or HFS+) is case-insensitive.  This means that case-independent
> 
> This is not true, strictly speaking.
> You can format a disk with "case sensitive" or "case-insenstive, case preserving".
> Both APFS and HFS+  can be formated that way.
> The default, which is what you get when you get a new machine,
> is "case-insenstive, case preserving".
> And I assume, that more 99% of all disks are formated that way.
> The "core.ignorecase" is used in the same way as it is used under NTFS,
> FAT or all other case-insenstive file systems.
> (and even ext4 can be formated case-insensitive these days.)
> 
> An interesting article can be found here:
> https://lwn.net/Articles/784041/
> 
> And to be technically correct, I think that even NTFS can be
> "configured to be case insensitive in an empty directory".
> 
> In that sense, I would like to avoid this statement, which
> file system is case insensitive, and which is not.
> Git assumes that after probing the FS in "git init" we have
> a valid configuration in core.ignorecase.

You're right. I was incorrectly glossing over the differences
between APFS and HFS+ -- and conflating case and nfc/nfd
issues.

[...]
>>
>>> NEEDSWORK: I was only able to test case.  It would be nice to add tests
>>
>> "I was only able test the APFS case."?

I'm going to completely redo this commit in the next version.
I now have both APFS and HFS+ partitions on my machine and
can compare the differences in behaviors and will have a new
set of tests to cover this.

>>
>>> that use different Unicode spellings/normalizations and understand the
>>> differences between APFS and HFS+ in this area.  We should confirm that
>>> the spelling of the workdir paths that the daemon sends to clients are
>>> always properly normalized.
>>
>> Are there any macOS experts out there who can help us find the answers
>> to these questions?
> 
> There is a difference between HFS+ and APFS.
> HFS+  is "unicode decomposing" when you call readdir() - file names
> are stored decomposed on disk once created.
> However, opening  file in precompsed form succeds.
> In that sense I would strongly suspect, that any monitors are "sending"
> the decomposed form (on HFS+).
> 
> APFS does not manipulate file names in that way, it is
> "unicode normalization preserving and ignoring".

It took a few hours of poking to figure out what Apple is doing,
but yes on HFS+ they convert to NFD and use that as the on-disk
format.  And they do collision detection as they always have in
NFD-space.

Whereas on APFS, they preserve the NFC/NFD as given when the file
is created, but always do the same collision detection in NFD-space.
The net result is similar, but subtlety different.

FS Events from MacOS are sent using the on-disk format (NFD on HFS+
and whichever on APFS) and my FSMonitor daemon is sending them to
the client as received.

I'm not sure whether or not the daemon should respect the
`core.precompseUnicode` setting and when watching an HFS+
volume do the NFD-->NFC conversion for the client.  I'm not
sure whether that would be any more or less correct than just
reporting the paths as received.  I'm going to leave this as a
question for the future.


Thanks for all of your background information on this topic.
Jeff
Jeff Hostetler March 4, 2022, 11:47 p.m. UTC | #4
On 2/24/22 12:33 PM, Torsten =?unknown-8bit?Q?B=C3=B6gershausen?= wrote:
> On Thu, Feb 24, 2022 at 09:52:28AM -0500, Derrick Stolee wrote:
>> On 2/15/2022 10:59 AM, Jeff Hostetler via GitGitGadget wrote:
>>> From: Jeff Hostetler <jeffhost@microsoft.com>
>>>
>>> Confirm that macOS FS events are reported with a normalized spelling.
>>>
>>> APFS (and/or HFS+) is case-insensitive.  This means that case-independent
> 
[...]
> 
> An interesting article can be found here:
> https://lwn.net/Articles/784041/
> 
> And to be technically correct, I think that even NTFS can be
> "configured to be case insensitive in an empty directory".

Yes, it is now possible to have NTFS be case sensitive (on a
directory by directory basis).  I haven't had a chance to
experiment with this yet, but I'm hoping that if we can always
have the daemon report using the on-disk spelling, we can
avoid most of this insanity.

Thanks,
Jeff
Torsten Bögershausen March 5, 2022, 8:59 a.m. UTC | #5
On Fri, Mar 04, 2022 at 06:40:27PM -0500, Jeff Hostetler wrote:
>
>
> On 2/24/22 12:33 PM, Torsten =?unknown-8bit?Q?B=C3=B6gershausen?= wrote:
> > On Thu, Feb 24, 2022 at 09:52:28AM -0500, Derrick Stolee wrote:
> > > On 2/15/2022 10:59 AM, Jeff Hostetler via GitGitGadget wrote:
> > > > From: Jeff Hostetler <jeffhost@microsoft.com>
> > > >
> > > > Confirm that macOS FS events are reported with a normalized spelling.
> > > >
> > > > APFS (and/or HFS+) is case-insensitive.  This means that case-independent
> >
> > This is not true, strictly speaking.
> > You can format a disk with "case sensitive" or "case-insenstive, case preserving".
> > Both APFS and HFS+  can be formated that way.
> > The default, which is what you get when you get a new machine,
> > is "case-insenstive, case preserving".
> > And I assume, that more 99% of all disks are formated that way.
> > The "core.ignorecase" is used in the same way as it is used under NTFS,
> > FAT or all other case-insenstive file systems.
> > (and even ext4 can be formated case-insensitive these days.)
> >
> > An interesting article can be found here:
> > https://lwn.net/Articles/784041/
> >
> > And to be technically correct, I think that even NTFS can be
> > "configured to be case insensitive in an empty directory".
> >
> > In that sense, I would like to avoid this statement, which
> > file system is case insensitive, and which is not.
> > Git assumes that after probing the FS in "git init" we have
> > a valid configuration in core.ignorecase.
>
> You're right. I was incorrectly glossing over the differences
> between APFS and HFS+ -- and conflating case and nfc/nfd
> issues.
>
> [...]
> > >
> > > > NEEDSWORK: I was only able to test case.  It would be nice to add tests
> > >
> > > "I was only able test the APFS case."?
>
> I'm going to completely redo this commit in the next version.
> I now have both APFS and HFS+ partitions on my machine and
> can compare the differences in behaviors and will have a new
> set of tests to cover this.
>
> > >
> > > > that use different Unicode spellings/normalizations and understand the
> > > > differences between APFS and HFS+ in this area.  We should confirm that
> > > > the spelling of the workdir paths that the daemon sends to clients are
> > > > always properly normalized.
> > >
> > > Are there any macOS experts out there who can help us find the answers
> > > to these questions?
> >
> > There is a difference between HFS+ and APFS.
> > HFS+  is "unicode decomposing" when you call readdir() - file names
> > are stored decomposed on disk once created.
> > However, opening  file in precompsed form succeds.
> > In that sense I would strongly suspect, that any monitors are "sending"
> > the decomposed form (on HFS+).
> >
> > APFS does not manipulate file names in that way, it is
> > "unicode normalization preserving and ignoring".
>
> It took a few hours of poking to figure out what Apple is doing,
> but yes on HFS+ they convert to NFD and use that as the on-disk
> format.  And they do collision detection as they always have in
> NFD-space.
>
> Whereas on APFS, they preserve the NFC/NFD as given when the file
> is created, but always do the same collision detection in NFD-space.
> The net result is similar, but subtlety different.

That depends what you mean with "net result".
What Git sees after calling precompose_utf8_readdir() with
core.precomposeunicode=true ?


>
> FS Events from MacOS are sent using the on-disk format (NFD on HFS+
> and whichever on APFS) and my FSMonitor daemon is sending them to
> the client as received.
>
> I'm not sure whether or not the daemon should respect the
> `core.precompseUnicode` setting and when watching an HFS+
> volume do the NFD-->NFC conversion for the client.  I'm not
> sure whether that would be any more or less correct than just
> reporting the paths as received.  I'm going to leave this as a
> question for the future.

I think that I have a suggestion for an answer:
We still have HFS+ systems around, and we still have an NFD feature
in MacOs for USB sticks with FAT or SAMBA mounted network volumes.
Both return NFD in readdir().
Even if NFC is on disk for FAT or going over the wire for SAMBA.
Having a precompose() function in the FSMonitor would help to make
things consistent.
And the answer is yes.

>
>
> Thanks for all of your background information on this topic.
> Jeff
>

The pleasure is on my side.
Please feel free to cc me for the next round, so that I don't miss
to review them.
Jeff Hostetler March 7, 2022, 8:45 p.m. UTC | #6
On 3/5/22 3:59 AM, Torsten =?unknown-8bit?Q?B=C3=B6gershausen?= wrote:
> On Fri, Mar 04, 2022 at 06:40:27PM -0500, Jeff Hostetler wrote:
>>
>>
>> On 2/24/22 12:33 PM, Torsten =?unknown-8bit?Q?B=C3=B6gershausen?= wrote:
>>> On Thu, Feb 24, 2022 at 09:52:28AM -0500, Derrick Stolee wrote:
>>>> On 2/15/2022 10:59 AM, Jeff Hostetler via GitGitGadget wrote:
>>>>> From: Jeff Hostetler <jeffhost@microsoft.com>
>>>>>
>>>>> Confirm that macOS FS events are reported with a normalized spelling.
[...]
>>
>>>>
>>>>> that use different Unicode spellings/normalizations and understand the
>>>>> differences between APFS and HFS+ in this area.  We should confirm that
>>>>> the spelling of the workdir paths that the daemon sends to clients are
>>>>> always properly normalized.
>>>>
>>>> Are there any macOS experts out there who can help us find the answers
>>>> to these questions?
>>>
>>> There is a difference between HFS+ and APFS.
>>> HFS+  is "unicode decomposing" when you call readdir() - file names
>>> are stored decomposed on disk once created.
>>> However, opening  file in precompsed form succeds.
>>> In that sense I would strongly suspect, that any monitors are "sending"
>>> the decomposed form (on HFS+).
>>>
>>> APFS does not manipulate file names in that way, it is
>>> "unicode normalization preserving and ignoring".
>>
>> It took a few hours of poking to figure out what Apple is doing,
>> but yes on HFS+ they convert to NFD and use that as the on-disk
>> format.  And they do collision detection as they always have in
>> NFD-space.
>>
>> Whereas on APFS, they preserve the NFC/NFD as given when the file
>> is created, but always do the same collision detection in NFD-space.
>> The net result is similar, but subtlety different.
> 
> That depends what you mean with "net result".
> What Git sees after calling precompose_utf8_readdir() with
> core.precomposeunicode=true ?

I just meant that the OS always does the collision detection
regardless of whether the filesystem is APFS or HFS. (They even
do it on a FAT32 thumb drive).  So the OS layer is always
composition insensitive while the underlying filesystem may
or may not be composition preserving.


>>
>> FS Events from MacOS are sent using the on-disk format (NFD on HFS+
>> and whichever on APFS) and my FSMonitor daemon is sending them to
>> the client as received.
>>
>> I'm not sure whether or not the daemon should respect the
>> `core.precompseUnicode` setting and when watching an HFS+
>> volume do the NFD-->NFC conversion for the client.  I'm not
>> sure whether that would be any more or less correct than just
>> reporting the paths as received.  I'm going to leave this as a
>> question for the future.
> 
> I think that I have a suggestion for an answer:
> We still have HFS+ systems around, and we still have an NFD feature
> in MacOs for USB sticks with FAT or SAMBA mounted network volumes.
> Both return NFD in readdir().
> Even if NFC is on disk for FAT or going over the wire for SAMBA.
> Having a precompose() function in the FSMonitor would help to make
> things consistent.
> And the answer is yes.

I agree.  I'm going to have the mac version always send the NFC
version (like core.precomposeUnicode suggests), but also send NFD
for *if* we get an NFD spelling from the FS event.  So on an HFS
volume we'll get two events for each path that have different
UTF8 spellings.  In the case of APFS we should only get the NFC
spelling -- unless the user really wants to use NFD spellings --
in which case we will respect that and send both.

This maybe be overkill, but the daemon (at the time that it
receives the FS event) doesn't know how the client process will
be configured when it makes a query at some point in the future.
So by recording both and reporting them to the client, the client
can decide how to process them.

Whichever spelling the client sees, it is only using it to mark
the cache-entry or untracked-cache data as possibly dirty so that
the original scanning code can actually check it.  This is a little
different from the diff-tree code that needs the official spelling
from the tree entry to match up items with the working directory.

I've overhauled all of the MacOS Unicode handling in V2.

Jeff
diff mbox series

Patch

diff --git a/t/t7527-builtin-fsmonitor.sh b/t/t7527-builtin-fsmonitor.sh
index dbca7f835eb..1fdabfc4f1e 100755
--- a/t/t7527-builtin-fsmonitor.sh
+++ b/t/t7527-builtin-fsmonitor.sh
@@ -188,6 +188,36 @@  test_expect_success MINGW,SHORTNAMES 'implicit daemon stop (rename GIT~2)' '
 	test_must_fail git -C test_implicit_1s2 fsmonitor--daemon status
 '
 
+# Confirm that MacOS hides all of the Unicode normalization and/or
+# case folding from the FS events.  That is, are the pathnames in the
+# FS events reported using the spelling on the disk or in the spelling
+# used by the other process.
+#
+# Note that we assume that the filesystem is set to case insensitive.
+#
+# NEEDSWORK: APFS handles Unicode and Unicode normalization
+# differently than HFS+.  I only have an APFS partition, so
+# more testing here would be helpful.
+#
+
+# Rename .git using alternate spelling and confirm that the daemon
+# sees the event using the correct spelling and shutdown.
+test_expect_success UTF8_NFD_TO_NFC 'MacOS event spelling (rename .GIT)' '
+	test_when_finished "stop_daemon_delete_repo test_apfs" &&
+
+	git init test_apfs &&
+	start_daemon test_apfs &&
+
+	test_path_is_dir test_apfs/.git &&
+	test_path_is_dir test_apfs/.GIT &&
+
+	mv test_apfs/.GIT test_apfs/.FOO &&
+	sleep 1 &&
+	mv test_apfs/.FOO test_apfs/.git &&
+
+	test_must_fail git -C test_apfs fsmonitor--daemon status
+'
+
 test_expect_success 'cannot start multiple daemons' '
 	test_when_finished "stop_daemon_delete_repo test_multiple" &&