Message ID | 20250113223033.4054534-1-yang@os.amperecomputing.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | /dev/zero: make private mapping full anonymous mapping | expand |
+ Willy for the fs/weirdness elements of this. On Mon, Jan 13, 2025 at 02:30:33PM -0800, Yang Shi wrote: > When creating private mapping for /dev/zero, the driver makes it an > anonymous mapping by calling set_vma_anonymous(). But it just sets > vm_ops to NULL, vm_file is still valid and vm_pgoff is also file offset. Hm yikes. > > This is a special case and the VMA doesn't look like either anonymous VMA > or file VMA. It confused other kernel subsystem, for example, khugepaged [1]. > > It seems pointless to keep such special case. Making private /dev/zero > mapping a full anonymous mapping doesn't change the semantic of > /dev/zero either. My concern is that ostensibly there _is_ a file right? Are we certain that by not setting this we are not breaking something somewhere else? Are we not creating a sort of other type of 'non-such-beast' here? I mean already setting it anon and setting vm_file non-NULL is really strange. > > The user visible effect is the mapping entry shown in /proc/<PID>/smaps > and /proc/<PID>/maps. > > Before the change: > ffffb7190000-ffffb7590000 rw-p 00001000 00:06 8 /dev/zero > > After the change: > ffffb6130000-ffffb6530000 rw-p 00000000 00:00 0 > Yeah this seems like it might break somebody to be honest, it's really really really strange to map a file then for it not to be mapped. But it's possibly EVEN WEIRDER to map a file and for it to seem mapped as a file but for it to be marked anonymous. God what a mess. > [1]: https://lore.kernel.org/linux-mm/20250111034511.2223353-1-liushixin2@huawei.com/ I kind of hate that we have to mitigate like this for a case that should never ever happen so I'm inclined towards your solution but a lot more inclined towards us totally rethinking this. Do we _have_ to make this anonymous?? Why can't we just reference the zero page as if it were in the page cache (Willy - feel free to correct naive misapprehension here). > > Signed-off-by: Yang Shi <yang@os.amperecomputing.com> > --- > drivers/char/mem.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/drivers/char/mem.c b/drivers/char/mem.c > index 169eed162a7f..dae113f7fc1b 100644 > --- a/drivers/char/mem.c > +++ b/drivers/char/mem.c > @@ -527,6 +527,10 @@ static int mmap_zero(struct file *file, struct vm_area_struct *vma) > if (vma->vm_flags & VM_SHARED) > return shmem_zero_setup(vma); > vma_set_anonymous(vma); > + fput(vma->vm_file); > + vma->vm_file = NULL; > + vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT; Hmm, this might have been mremap()'d _potentially_ though? And then now this will be wrong? But then we'd have no way of tracking it correctly... I've not checked the function but do we mark this as a special mapping of some kind? > + > return 0; > } > > -- > 2.47.0 >
On 13.01.25 23:30, Yang Shi wrote: > When creating private mapping for /dev/zero, the driver makes it an > anonymous mapping by calling set_vma_anonymous(). But it just sets > vm_ops to NULL, vm_file is still valid and vm_pgoff is also file offset. > > This is a special case and the VMA doesn't look like either anonymous VMA > or file VMA. It confused other kernel subsystem, for example, khugepaged [1]. > > It seems pointless to keep such special case. Making private /dev/zero> mapping a full anonymous mapping doesn't change the semantic of > /dev/zero either. > > The user visible effect is the mapping entry shown in /proc/<PID>/smaps > and /proc/<PID>/maps. > > Before the change: > ffffb7190000-ffffb7590000 rw-p 00001000 00:06 8 /dev/zero > > After the change: > ffffb6130000-ffffb6530000 rw-p 00000000 00:00 0 > Hm, not sure about this. It's actually quite consistent to have that output in smaps the way it is. You mapped a file at an offset, and it behaves like an anonymous mapping apart from that. Not sure if the buggy khugepaged thing is a good indicator to warrant this change.
On Tue, Jan 14, 2025 at 02:01:32PM +0100, David Hildenbrand wrote: > On 13.01.25 23:30, Yang Shi wrote: > > When creating private mapping for /dev/zero, the driver makes it an > > anonymous mapping by calling set_vma_anonymous(). But it just sets > > vm_ops to NULL, vm_file is still valid and vm_pgoff is also file offset. > > > > This is a special case and the VMA doesn't look like either anonymous VMA > > or file VMA. It confused other kernel subsystem, for example, khugepaged [1]. > > > > It seems pointless to keep such special case. Making private /dev/zero> > mapping a full anonymous mapping doesn't change the semantic of > > /dev/zero either. > > > > The user visible effect is the mapping entry shown in /proc/<PID>/smaps > > and /proc/<PID>/maps. > > > > Before the change: > > ffffb7190000-ffffb7590000 rw-p 00001000 00:06 8 /dev/zero > > > > After the change: > > ffffb6130000-ffffb6530000 rw-p 00000000 00:00 0 > > > > Hm, not sure about this. It's actually quite consistent to have that output > in smaps the way it is. You mapped a file at an offset, and it behaves like > an anonymous mapping apart from that. > > Not sure if the buggy khugepaged thing is a good indicator to warrant this > change. Yeah, this is a user-facing fundamental change that hides information and defies expectation so I mean - it's a no go really isn't it? I'd rather we _not_ make this anon though, because isn't life confusing enough David? I thought it was bad enough with 'anon, file and lol shmem' but 'lol lol also /dev/zero' is enough to make me want to frolick in the fields... > > -- > Cheers, > > David / dhildenb >
On 14.01.25 15:52, Lorenzo Stoakes wrote: > On Tue, Jan 14, 2025 at 02:01:32PM +0100, David Hildenbrand wrote: >> On 13.01.25 23:30, Yang Shi wrote: >>> When creating private mapping for /dev/zero, the driver makes it an >>> anonymous mapping by calling set_vma_anonymous(). But it just sets >>> vm_ops to NULL, vm_file is still valid and vm_pgoff is also file offset. >>> >>> This is a special case and the VMA doesn't look like either anonymous VMA >>> or file VMA. It confused other kernel subsystem, for example, khugepaged [1]. >>> >>> It seems pointless to keep such special case. Making private /dev/zero> >> mapping a full anonymous mapping doesn't change the semantic of >>> /dev/zero either. >>> >>> The user visible effect is the mapping entry shown in /proc/<PID>/smaps >>> and /proc/<PID>/maps. >>> >>> Before the change: >>> ffffb7190000-ffffb7590000 rw-p 00001000 00:06 8 /dev/zero >>> >>> After the change: >>> ffffb6130000-ffffb6530000 rw-p 00000000 00:00 0 >>> >> >> Hm, not sure about this. It's actually quite consistent to have that output >> in smaps the way it is. You mapped a file at an offset, and it behaves like >> an anonymous mapping apart from that. >> >> Not sure if the buggy khugepaged thing is a good indicator to warrant this >> change. > > Yeah, this is a user-facing fundamental change that hides information and > defies expectation so I mean - it's a no go really isn't it? > > I'd rather we _not_ make this anon though, because isn't life confusing > enough David? I thought it was bad enough with 'anon, file and lol shmem' > but 'lol lol also /dev/zero' is enough to make me want to frolick in the > fields... I recall there are users that rely on this memory to get the shared zeropage on reads etc (in comparison to shmem!), so I better not ... mess with this *at all* :)
On 1/14/25 4:05 AM, Lorenzo Stoakes wrote: > + Willy for the fs/weirdness elements of this. > > On Mon, Jan 13, 2025 at 02:30:33PM -0800, Yang Shi wrote: >> When creating private mapping for /dev/zero, the driver makes it an >> anonymous mapping by calling set_vma_anonymous(). But it just sets >> vm_ops to NULL, vm_file is still valid and vm_pgoff is also file offset. > Hm yikes. > >> This is a special case and the VMA doesn't look like either anonymous VMA >> or file VMA. It confused other kernel subsystem, for example, khugepaged [1]. >> >> It seems pointless to keep such special case. Making private /dev/zero >> mapping a full anonymous mapping doesn't change the semantic of >> /dev/zero either. > My concern is that ostensibly there _is_ a file right? Are we certain that by > not setting this we are not breaking something somewhere else? > > Are we not creating a sort of other type of 'non-such-beast' here? But the file is /dev/zero. I don't see this could break the semantic of /dev/zero. The shared mapping of /dev/zero is not affected by this change, kernel already treated private mapping of /dev/zero as anonymous mapping, but with some weird settings in VMA. When reading the mapping, it returns 0 with zero page, when writing the mapping, a new anonymous folio is allocated. > > I mean already setting it anon and setting vm_file non-NULL is really strange. > >> The user visible effect is the mapping entry shown in /proc/<PID>/smaps >> and /proc/<PID>/maps. >> >> Before the change: >> ffffb7190000-ffffb7590000 rw-p 00001000 00:06 8 /dev/zero >> >> After the change: >> ffffb6130000-ffffb6530000 rw-p 00000000 00:00 0 >> > Yeah this seems like it might break somebody to be honest, it's really > really really strange to map a file then for it not to be mapped. Yes, it is possible if someone really care whether the anonymous-like mapping is mapped by /dev/zero or just created by malloc(). But I don't know who really do... > > But it's possibly EVEN WEIRDER to map a file and for it to seem mapped as a > file but for it to be marked anonymous. > > God what a mess. > >> [1]: https://lore.kernel.org/linux-mm/20250111034511.2223353-1-liushixin2@huawei.com/ > I kind of hate that we have to mitigate like this for a case that should > never ever happen so I'm inclined towards your solution but a lot more > inclined towards us totally rethinking this. > > Do we _have_ to make this anonymous?? Why can't we just reference the zero > page as if it were in the page cache (Willy - feel free to correct naive > misapprehension here). TBH, I don't see why page cache has to be involved. When reading, 0 is returned by zero page. When writing a CoW is triggered if page cache is involved, but the content of the page cache should be just 0, so we copy 0 to the new folio then write to it. It doesn't make too much sense. I think this is why private /dev/zero mapping is treated as anonymous mapping in the first place. > >> Signed-off-by: Yang Shi <yang@os.amperecomputing.com> >> --- >> drivers/char/mem.c | 4 ++++ >> 1 file changed, 4 insertions(+) >> >> diff --git a/drivers/char/mem.c b/drivers/char/mem.c >> index 169eed162a7f..dae113f7fc1b 100644 >> --- a/drivers/char/mem.c >> +++ b/drivers/char/mem.c >> @@ -527,6 +527,10 @@ static int mmap_zero(struct file *file, struct vm_area_struct *vma) >> if (vma->vm_flags & VM_SHARED) >> return shmem_zero_setup(vma); >> vma_set_anonymous(vma); >> + fput(vma->vm_file); >> + vma->vm_file = NULL; >> + vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT; > Hmm, this might have been mremap()'d _potentially_ though? And then now > this will be wrong? But then we'd have no way of tracking it correctly... I'm not quite familiar with the subtle details and corner cases of meremap(). But mmap_zero() should be called by mmap(), so the VMA has not been visible to user yet at this point IIUC. How come mremap() could move it? > > I've not checked the function but do we mark this as a special mapping of > some kind? > >> + >> return 0; >> } >> >> -- >> 2.47.0 >>
On 1/14/25 7:06 AM, David Hildenbrand wrote: > On 14.01.25 15:52, Lorenzo Stoakes wrote: >> On Tue, Jan 14, 2025 at 02:01:32PM +0100, David Hildenbrand wrote: >>> On 13.01.25 23:30, Yang Shi wrote: >>>> When creating private mapping for /dev/zero, the driver makes it an >>>> anonymous mapping by calling set_vma_anonymous(). But it just sets >>>> vm_ops to NULL, vm_file is still valid and vm_pgoff is also file >>>> offset. >>>> >>>> This is a special case and the VMA doesn't look like either >>>> anonymous VMA >>>> or file VMA. It confused other kernel subsystem, for example, >>>> khugepaged [1]. >>>> >>>> It seems pointless to keep such special case. Making private >>>> /dev/zero> >>> mapping a full anonymous mapping doesn't change the semantic of >>>> /dev/zero either. >>>> >>>> The user visible effect is the mapping entry shown in >>>> /proc/<PID>/smaps >>>> and /proc/<PID>/maps. >>>> >>>> Before the change: >>>> ffffb7190000-ffffb7590000 rw-p 00001000 00:06 >>>> 8 /dev/zero >>>> >>>> After the change: >>>> ffffb6130000-ffffb6530000 rw-p 00000000 00:00 0 >>>> >>> >>> Hm, not sure about this. It's actually quite consistent to have that >>> output >>> in smaps the way it is. You mapped a file at an offset, and it >>> behaves like >>> an anonymous mapping apart from that. >>> >>> Not sure if the buggy khugepaged thing is a good indicator to >>> warrant this >>> change. I admit this may be a concern, but I doubt who really care about it... >> >> Yeah, this is a user-facing fundamental change that hides information >> and >> defies expectation so I mean - it's a no go really isn't it? >> >> I'd rather we _not_ make this anon though, because isn't life confusing >> enough David? I thought it was bad enough with 'anon, file and lol >> shmem' >> but 'lol lol also /dev/zero' is enough to make me want to frolick in the >> fields... > > I recall there are users that rely on this memory to get the shared > zeropage on reads etc (in comparison to shmem!), so I better not ... > mess with this *at all* :) The behavior won't be changed.
On 14.01.25 16:06, David Hildenbrand wrote: > On 14.01.25 15:52, Lorenzo Stoakes wrote: >> On Tue, Jan 14, 2025 at 02:01:32PM +0100, David Hildenbrand wrote: >>> On 13.01.25 23:30, Yang Shi wrote: >>>> When creating private mapping for /dev/zero, the driver makes it an >>>> anonymous mapping by calling set_vma_anonymous(). But it just sets >>>> vm_ops to NULL, vm_file is still valid and vm_pgoff is also file offset. >>>> >>>> This is a special case and the VMA doesn't look like either anonymous VMA >>>> or file VMA. It confused other kernel subsystem, for example, khugepaged [1]. >>>> >>>> It seems pointless to keep such special case. Making private /dev/zero> >>> mapping a full anonymous mapping doesn't change the semantic of >>>> /dev/zero either. >>>> >>>> The user visible effect is the mapping entry shown in /proc/<PID>/smaps >>>> and /proc/<PID>/maps. >>>> >>>> Before the change: >>>> ffffb7190000-ffffb7590000 rw-p 00001000 00:06 8 /dev/zero >>>> >>>> After the change: >>>> ffffb6130000-ffffb6530000 rw-p 00000000 00:00 0 >>>> >>> >>> Hm, not sure about this. It's actually quite consistent to have that output >>> in smaps the way it is. You mapped a file at an offset, and it behaves like >>> an anonymous mapping apart from that. >>> >>> Not sure if the buggy khugepaged thing is a good indicator to warrant this >>> change. >> >> Yeah, this is a user-facing fundamental change that hides information and >> defies expectation so I mean - it's a no go really isn't it? >> >> I'd rather we _not_ make this anon though, because isn't life confusing >> enough David? I thought it was bad enough with 'anon, file and lol shmem' >> but 'lol lol also /dev/zero' is enough to make me want to frolick in the >> fields... > > I recall there are users that rely on this memory to get the shared > zeropage on reads etc (in comparison to shmem!), so I better not ... > mess with this *at all* :) Heh, and I recall reading something about odd behavior of /dev/zero and some interesting history [1]. " Unlike /dev/null, /dev/zero may be used as a source, not only as a sink for data. All write operations to /dev/zero succeed with no other effects. However, /dev/null is more commonly used for this purpose. When /dev/zero is memory-mapped, e.g., with mmap, to the virtual address space, it is equivalent to using anonymous memory; i.e. memory not connected to any file. " "equivalent to using anonymous memory" is interesting. Also, /dev/zero was there before MAP_ANONYMOUS was invented according to [1], which is quite interesting. ... so this is anonymous memory as "real" as it can get :) [1] https://en.wikipedia.org/wiki//dev/zero
On 1/14/25 9:02 AM, David Hildenbrand wrote: > On 14.01.25 16:06, David Hildenbrand wrote: >> On 14.01.25 15:52, Lorenzo Stoakes wrote: >>> On Tue, Jan 14, 2025 at 02:01:32PM +0100, David Hildenbrand wrote: >>>> On 13.01.25 23:30, Yang Shi wrote: >>>>> When creating private mapping for /dev/zero, the driver makes it an >>>>> anonymous mapping by calling set_vma_anonymous(). But it just sets >>>>> vm_ops to NULL, vm_file is still valid and vm_pgoff is also file >>>>> offset. >>>>> >>>>> This is a special case and the VMA doesn't look like either >>>>> anonymous VMA >>>>> or file VMA. It confused other kernel subsystem, for example, >>>>> khugepaged [1]. >>>>> >>>>> It seems pointless to keep such special case. Making private >>>>> /dev/zero> >>>> mapping a full anonymous mapping doesn't change the semantic of >>>>> /dev/zero either. >>>>> >>>>> The user visible effect is the mapping entry shown in >>>>> /proc/<PID>/smaps >>>>> and /proc/<PID>/maps. >>>>> >>>>> Before the change: >>>>> ffffb7190000-ffffb7590000 rw-p 00001000 00:06 >>>>> 8 /dev/zero >>>>> >>>>> After the change: >>>>> ffffb6130000-ffffb6530000 rw-p 00000000 00:00 0 >>>>> >>>> >>>> Hm, not sure about this. It's actually quite consistent to have >>>> that output >>>> in smaps the way it is. You mapped a file at an offset, and it >>>> behaves like >>>> an anonymous mapping apart from that. >>>> >>>> Not sure if the buggy khugepaged thing is a good indicator to >>>> warrant this >>>> change. >>> >>> Yeah, this is a user-facing fundamental change that hides >>> information and >>> defies expectation so I mean - it's a no go really isn't it? >>> >>> I'd rather we _not_ make this anon though, because isn't life confusing >>> enough David? I thought it was bad enough with 'anon, file and lol >>> shmem' >>> but 'lol lol also /dev/zero' is enough to make me want to frolick in >>> the >>> fields... >> >> I recall there are users that rely on this memory to get the shared >> zeropage on reads etc (in comparison to shmem!), so I better not ... >> mess with this *at all* :) > > Heh, and I recall reading something about odd behavior of /dev/zero > and some interesting history [1]. > > " > Unlike /dev/null, /dev/zero may be used as a source, not only as a > sink for data. All write operations to /dev/zero succeed with no other > effects. However, /dev/null is more commonly used for this purpose. > > When /dev/zero is memory-mapped, e.g., with mmap, to the virtual > address space, it is equivalent to using anonymous memory; i.e. memory > not connected to any file. > " > > "equivalent to using anonymous memory" is interesting. For private mapping. Shared mapping is equivalent to shmem. > > > Also, /dev/zero was there before MAP_ANONYMOUS was invented according > to [1], which is quite interesting. Interesting... Didn't know this before. > > ... so this is anonymous memory as "real" as it can get :) Let's make /dev/zero as real as anonymous memory :) > > > [1] https://en.wikipedia.org/wiki//dev/zero >
On 14.01.25 18:01, Yang Shi wrote: > > > > On 1/14/25 7:06 AM, David Hildenbrand wrote: >> On 14.01.25 15:52, Lorenzo Stoakes wrote: >>> On Tue, Jan 14, 2025 at 02:01:32PM +0100, David Hildenbrand wrote: >>>> On 13.01.25 23:30, Yang Shi wrote: >>>>> When creating private mapping for /dev/zero, the driver makes it an >>>>> anonymous mapping by calling set_vma_anonymous(). But it just sets >>>>> vm_ops to NULL, vm_file is still valid and vm_pgoff is also file >>>>> offset. >>>>> >>>>> This is a special case and the VMA doesn't look like either >>>>> anonymous VMA >>>>> or file VMA. It confused other kernel subsystem, for example, >>>>> khugepaged [1]. >>>>> >>>>> It seems pointless to keep such special case. Making private >>>>> /dev/zero> >>>> mapping a full anonymous mapping doesn't change the semantic of >>>>> /dev/zero either. >>>>> >>>>> The user visible effect is the mapping entry shown in >>>>> /proc/<PID>/smaps >>>>> and /proc/<PID>/maps. >>>>> >>>>> Before the change: >>>>> ffffb7190000-ffffb7590000 rw-p 00001000 00:06 >>>>> 8 /dev/zero >>>>> >>>>> After the change: >>>>> ffffb6130000-ffffb6530000 rw-p 00000000 00:00 0 >>>>> >>>> >>>> Hm, not sure about this. It's actually quite consistent to have that >>>> output >>>> in smaps the way it is. You mapped a file at an offset, and it >>>> behaves like >>>> an anonymous mapping apart from that. >>>> >>>> Not sure if the buggy khugepaged thing is a good indicator to >>>> warrant this >>>> change. > > I admit this may be a concern, but I doubt who really care about it... > There is an example in the man page [1] about /proc/self/map_files/. I assume that will also change here. It's always hard to tell who that could affect, but I'm not convinced this is worth it to find it out :) >>> >>> Yeah, this is a user-facing fundamental change that hides information >>> and >>> defies expectation so I mean - it's a no go really isn't it? >>> >>> I'd rather we _not_ make this anon though, because isn't life confusing >>> enough David? I thought it was bad enough with 'anon, file and lol >>> shmem' >>> but 'lol lol also /dev/zero' is enough to make me want to frolick in the >>> fields... >> >> I recall there are users that rely on this memory to get the shared >> zeropage on reads etc (in comparison to shmem!), so I better not ... >> mess with this *at all* :) > > The behavior won't be changed. Yes, I know. And that's good ;) [1] https://man7.org/linux/man-pages/man5/proc_pid_map_files.5.html
On 14.01.25 18:20, Yang Shi wrote: > > > > On 1/14/25 9:02 AM, David Hildenbrand wrote: >> On 14.01.25 16:06, David Hildenbrand wrote: >>> On 14.01.25 15:52, Lorenzo Stoakes wrote: >>>> On Tue, Jan 14, 2025 at 02:01:32PM +0100, David Hildenbrand wrote: >>>>> On 13.01.25 23:30, Yang Shi wrote: >>>>>> When creating private mapping for /dev/zero, the driver makes it an >>>>>> anonymous mapping by calling set_vma_anonymous(). But it just sets >>>>>> vm_ops to NULL, vm_file is still valid and vm_pgoff is also file >>>>>> offset. >>>>>> >>>>>> This is a special case and the VMA doesn't look like either >>>>>> anonymous VMA >>>>>> or file VMA. It confused other kernel subsystem, for example, >>>>>> khugepaged [1]. >>>>>> >>>>>> It seems pointless to keep such special case. Making private >>>>>> /dev/zero> >>>>> mapping a full anonymous mapping doesn't change the semantic of >>>>>> /dev/zero either. >>>>>> >>>>>> The user visible effect is the mapping entry shown in >>>>>> /proc/<PID>/smaps >>>>>> and /proc/<PID>/maps. >>>>>> >>>>>> Before the change: >>>>>> ffffb7190000-ffffb7590000 rw-p 00001000 00:06 >>>>>> 8 /dev/zero >>>>>> >>>>>> After the change: >>>>>> ffffb6130000-ffffb6530000 rw-p 00000000 00:00 0 >>>>>> >>>>> >>>>> Hm, not sure about this. It's actually quite consistent to have >>>>> that output >>>>> in smaps the way it is. You mapped a file at an offset, and it >>>>> behaves like >>>>> an anonymous mapping apart from that. >>>>> >>>>> Not sure if the buggy khugepaged thing is a good indicator to >>>>> warrant this >>>>> change. >>>> >>>> Yeah, this is a user-facing fundamental change that hides >>>> information and >>>> defies expectation so I mean - it's a no go really isn't it? >>>> >>>> I'd rather we _not_ make this anon though, because isn't life confusing >>>> enough David? I thought it was bad enough with 'anon, file and lol >>>> shmem' >>>> but 'lol lol also /dev/zero' is enough to make me want to frolick in >>>> the >>>> fields... >>> >>> I recall there are users that rely on this memory to get the shared >>> zeropage on reads etc (in comparison to shmem!), so I better not ... >>> mess with this *at all* :) >> >> Heh, and I recall reading something about odd behavior of /dev/zero >> and some interesting history [1]. >> >> " >> Unlike /dev/null, /dev/zero may be used as a source, not only as a >> sink for data. All write operations to /dev/zero succeed with no other >> effects. However, /dev/null is more commonly used for this purpose. >> >> When /dev/zero is memory-mapped, e.g., with mmap, to the virtual >> address space, it is equivalent to using anonymous memory; i.e. memory >> not connected to any file. >> " >> >> "equivalent to using anonymous memory" is interesting. > > For private mapping. Shared mapping is equivalent to shmem. "shared anonymous memory", yes.
On 1/14/25 9:23 AM, David Hildenbrand wrote: > On 14.01.25 18:01, Yang Shi wrote: >> >> >> >> On 1/14/25 7:06 AM, David Hildenbrand wrote: >>> On 14.01.25 15:52, Lorenzo Stoakes wrote: >>>> On Tue, Jan 14, 2025 at 02:01:32PM +0100, David Hildenbrand wrote: >>>>> On 13.01.25 23:30, Yang Shi wrote: >>>>>> When creating private mapping for /dev/zero, the driver makes it an >>>>>> anonymous mapping by calling set_vma_anonymous(). But it just sets >>>>>> vm_ops to NULL, vm_file is still valid and vm_pgoff is also file >>>>>> offset. >>>>>> >>>>>> This is a special case and the VMA doesn't look like either >>>>>> anonymous VMA >>>>>> or file VMA. It confused other kernel subsystem, for example, >>>>>> khugepaged [1]. >>>>>> >>>>>> It seems pointless to keep such special case. Making private >>>>>> /dev/zero> >>>>> mapping a full anonymous mapping doesn't change the semantic of >>>>>> /dev/zero either. >>>>>> >>>>>> The user visible effect is the mapping entry shown in >>>>>> /proc/<PID>/smaps >>>>>> and /proc/<PID>/maps. >>>>>> >>>>>> Before the change: >>>>>> ffffb7190000-ffffb7590000 rw-p 00001000 00:06 >>>>>> 8 /dev/zero >>>>>> >>>>>> After the change: >>>>>> ffffb6130000-ffffb6530000 rw-p 00000000 00:00 0 >>>>>> >>>>> >>>>> Hm, not sure about this. It's actually quite consistent to have that >>>>> output >>>>> in smaps the way it is. You mapped a file at an offset, and it >>>>> behaves like >>>>> an anonymous mapping apart from that. >>>>> >>>>> Not sure if the buggy khugepaged thing is a good indicator to >>>>> warrant this >>>>> change. >> >> I admit this may be a concern, but I doubt who really care about it... >> > > There is an example in the man page [1] about /proc/self/map_files/. > > I assume that will also change here. IIUC, that example is specific to "anonymous shared memory" created by shared mapping of /dev/zero. > > It's always hard to tell who that could affect, but I'm not convinced > this is worth it to find it out :) > >>>> >>>> Yeah, this is a user-facing fundamental change that hides information >>>> and >>>> defies expectation so I mean - it's a no go really isn't it? >>>> >>>> I'd rather we _not_ make this anon though, because isn't life >>>> confusing >>>> enough David? I thought it was bad enough with 'anon, file and lol >>>> shmem' >>>> but 'lol lol also /dev/zero' is enough to make me want to frolick >>>> in the >>>> fields... >>> >>> I recall there are users that rely on this memory to get the shared >>> zeropage on reads etc (in comparison to shmem!), so I better not ... >>> mess with this *at all* :) >> >> The behavior won't be changed. > > Yes, I know. And that's good ;) > > > [1] https://man7.org/linux/man-pages/man5/proc_pid_map_files.5.html >
On 14.01.25 18:38, Yang Shi wrote: > > > > On 1/14/25 9:23 AM, David Hildenbrand wrote: >> On 14.01.25 18:01, Yang Shi wrote: >>> >>> >>> >>> On 1/14/25 7:06 AM, David Hildenbrand wrote: >>>> On 14.01.25 15:52, Lorenzo Stoakes wrote: >>>>> On Tue, Jan 14, 2025 at 02:01:32PM +0100, David Hildenbrand wrote: >>>>>> On 13.01.25 23:30, Yang Shi wrote: >>>>>>> When creating private mapping for /dev/zero, the driver makes it an >>>>>>> anonymous mapping by calling set_vma_anonymous(). But it just sets >>>>>>> vm_ops to NULL, vm_file is still valid and vm_pgoff is also file >>>>>>> offset. >>>>>>> >>>>>>> This is a special case and the VMA doesn't look like either >>>>>>> anonymous VMA >>>>>>> or file VMA. It confused other kernel subsystem, for example, >>>>>>> khugepaged [1]. >>>>>>> >>>>>>> It seems pointless to keep such special case. Making private >>>>>>> /dev/zero> >>>>>> mapping a full anonymous mapping doesn't change the semantic of >>>>>>> /dev/zero either. >>>>>>> >>>>>>> The user visible effect is the mapping entry shown in >>>>>>> /proc/<PID>/smaps >>>>>>> and /proc/<PID>/maps. >>>>>>> >>>>>>> Before the change: >>>>>>> ffffb7190000-ffffb7590000 rw-p 00001000 00:06 >>>>>>> 8 /dev/zero >>>>>>> >>>>>>> After the change: >>>>>>> ffffb6130000-ffffb6530000 rw-p 00000000 00:00 0 >>>>>>> >>>>>> >>>>>> Hm, not sure about this. It's actually quite consistent to have that >>>>>> output >>>>>> in smaps the way it is. You mapped a file at an offset, and it >>>>>> behaves like >>>>>> an anonymous mapping apart from that. >>>>>> >>>>>> Not sure if the buggy khugepaged thing is a good indicator to >>>>>> warrant this >>>>>> change. >>> >>> I admit this may be a concern, but I doubt who really care about it... >>> >> >> There is an example in the man page [1] about /proc/self/map_files/. >> >> I assume that will also change here. > > IIUC, that example is specific to "anonymous shared memory" created by > shared mapping of /dev/zero. Note that MAP_PRIVATE of /dev/zero will also make it appear in the same way right now (I just tried). The example is about MAP_FILE in general, not just MAP_SHARED IIUC.
On 1/14/25 9:46 AM, David Hildenbrand wrote: > On 14.01.25 18:38, Yang Shi wrote: >> >> >> >> On 1/14/25 9:23 AM, David Hildenbrand wrote: >>> On 14.01.25 18:01, Yang Shi wrote: >>>> >>>> >>>> >>>> On 1/14/25 7:06 AM, David Hildenbrand wrote: >>>>> On 14.01.25 15:52, Lorenzo Stoakes wrote: >>>>>> On Tue, Jan 14, 2025 at 02:01:32PM +0100, David Hildenbrand wrote: >>>>>>> On 13.01.25 23:30, Yang Shi wrote: >>>>>>>> When creating private mapping for /dev/zero, the driver makes >>>>>>>> it an >>>>>>>> anonymous mapping by calling set_vma_anonymous(). But it just sets >>>>>>>> vm_ops to NULL, vm_file is still valid and vm_pgoff is also file >>>>>>>> offset. >>>>>>>> >>>>>>>> This is a special case and the VMA doesn't look like either >>>>>>>> anonymous VMA >>>>>>>> or file VMA. It confused other kernel subsystem, for example, >>>>>>>> khugepaged [1]. >>>>>>>> >>>>>>>> It seems pointless to keep such special case. Making private >>>>>>>> /dev/zero> >>>>>>> mapping a full anonymous mapping doesn't change the semantic of >>>>>>>> /dev/zero either. >>>>>>>> >>>>>>>> The user visible effect is the mapping entry shown in >>>>>>>> /proc/<PID>/smaps >>>>>>>> and /proc/<PID>/maps. >>>>>>>> >>>>>>>> Before the change: >>>>>>>> ffffb7190000-ffffb7590000 rw-p 00001000 00:06 >>>>>>>> 8 /dev/zero >>>>>>>> >>>>>>>> After the change: >>>>>>>> ffffb6130000-ffffb6530000 rw-p 00000000 00:00 0 >>>>>>>> >>>>>>> >>>>>>> Hm, not sure about this. It's actually quite consistent to have >>>>>>> that >>>>>>> output >>>>>>> in smaps the way it is. You mapped a file at an offset, and it >>>>>>> behaves like >>>>>>> an anonymous mapping apart from that. >>>>>>> >>>>>>> Not sure if the buggy khugepaged thing is a good indicator to >>>>>>> warrant this >>>>>>> change. >>>> >>>> I admit this may be a concern, but I doubt who really care about it... >>>> >>> >>> There is an example in the man page [1] about /proc/self/map_files/. >>> >>> I assume that will also change here. >> >> IIUC, that example is specific to "anonymous shared memory" created by >> shared mapping of /dev/zero. > > Note that MAP_PRIVATE of /dev/zero will also make it appear in the > same way right now (I just tried). Yes, I will add this in the commit log as another user visible change. > > The example is about MAP_FILE in general, not just MAP_SHARED IIUC. MAP_FILE is actually ignored on Linux per https://man7.org/linux/man-pages/man2/mmap.2.html. It also says "(regions created with the MAP_ANON | MAP_SHARED flags)". Anyway it looks like this man page may be a little bit outdated. We can clean it up later.
This is getting into realms of discussion so to risk sounding rude - to be clear - NACK. The user-visible change to /proc/$pid/[s]maps kills this patch dead. This is regardless of any other discussed issue. But more importantly, I hadn't realise mmap_zero() was on the .mmap() callback (sorry my mistake) - you're simply not permitted to change vm_pgoff and vm_file fields here, the mapping logic doesn't expect it, and it's broken. To me the alternative would be to have a custom fault handler that hands back the zero page, but I"m not sure that's workable, you'd have to install a special mapping etc. and huge pages are weird and... I do appreciate you raising this especially as I was blissfully unaware, but I don't see how this patch can possibly work, sorry :( On Tue, Jan 14, 2025 at 08:53:01AM -0800, Yang Shi wrote: > > > > On 1/14/25 4:05 AM, Lorenzo Stoakes wrote: > > + Willy for the fs/weirdness elements of this. > > > > On Mon, Jan 13, 2025 at 02:30:33PM -0800, Yang Shi wrote: > > > When creating private mapping for /dev/zero, the driver makes it an > > > anonymous mapping by calling set_vma_anonymous(). But it just sets > > > vm_ops to NULL, vm_file is still valid and vm_pgoff is also file offset. > > Hm yikes. > > > > > This is a special case and the VMA doesn't look like either anonymous VMA > > > or file VMA. It confused other kernel subsystem, for example, khugepaged [1]. > > > > > > It seems pointless to keep such special case. Making private /dev/zero > > > mapping a full anonymous mapping doesn't change the semantic of > > > /dev/zero either. > > My concern is that ostensibly there _is_ a file right? Are we certain that by > > not setting this we are not breaking something somewhere else? > > > > Are we not creating a sort of other type of 'non-such-beast' here? > > But the file is /dev/zero. I don't see this could break the semantic of > /dev/zero. The shared mapping of /dev/zero is not affected by this change, > kernel already treated private mapping of /dev/zero as anonymous mapping, > but with some weird settings in VMA. When reading the mapping, it returns 0 > with zero page, when writing the mapping, a new anonymous folio is > allocated. You're creating a new concept of an anon but not anon but also now with anon vm_pgoff and missing vm_file even though it does reference a file and... yeah. This is not usual :) > > > > > I mean already setting it anon and setting vm_file non-NULL is really strange. > > > > > The user visible effect is the mapping entry shown in /proc/<PID>/smaps > > > and /proc/<PID>/maps. > > > > > > Before the change: > > > ffffb7190000-ffffb7590000 rw-p 00001000 00:06 8 /dev/zero > > > > > > After the change: > > > ffffb6130000-ffffb6530000 rw-p 00000000 00:00 0 > > > > > Yeah this seems like it might break somebody to be honest, it's really > > really really strange to map a file then for it not to be mapped. > > Yes, it is possible if someone really care whether the anonymous-like > mapping is mapped by /dev/zero or just created by malloc(). But I don't know > who really do... > > > > > But it's possibly EVEN WEIRDER to map a file and for it to seem mapped as a > > file but for it to be marked anonymous. > > > > God what a mess. > > > > > [1]: https://lore.kernel.org/linux-mm/20250111034511.2223353-1-liushixin2@huawei.com/ > > I kind of hate that we have to mitigate like this for a case that should > > never ever happen so I'm inclined towards your solution but a lot more > > inclined towards us totally rethinking this. > > > > Do we _have_ to make this anonymous?? Why can't we just reference the zero > > page as if it were in the page cache (Willy - feel free to correct naive > > misapprehension here). > > TBH, I don't see why page cache has to be involved. When reading, 0 is > returned by zero page. When writing a CoW is triggered if page cache is > involved, but the content of the page cache should be just 0, so we copy 0 > to the new folio then write to it. It doesn't make too much sense. I think > this is why private /dev/zero mapping is treated as anonymous mapping in the > first place. I'm obviously not suggesting allocating a bunch of extra folios, I was thinking there would be some means of handing back the actual zero page. But I am not sure this is workable. > > > > > > Signed-off-by: Yang Shi <yang@os.amperecomputing.com> > > > --- > > > drivers/char/mem.c | 4 ++++ > > > 1 file changed, 4 insertions(+) > > > > > > diff --git a/drivers/char/mem.c b/drivers/char/mem.c > > > index 169eed162a7f..dae113f7fc1b 100644 > > > --- a/drivers/char/mem.c > > > +++ b/drivers/char/mem.c > > > @@ -527,6 +527,10 @@ static int mmap_zero(struct file *file, struct vm_area_struct *vma) > > > if (vma->vm_flags & VM_SHARED) > > > return shmem_zero_setup(vma); > > > vma_set_anonymous(vma); > > > + fput(vma->vm_file); > > > + vma->vm_file = NULL; > > > + vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT; This is just not permitted. We maintain mmap state which contains the file and pgoff state which gets threaded through the mapping operation, and simply do not expect you to change these fields. In future we will assert on this or preferably, restrict users to only changing VMA flags, the private field and vm_ops. > > Hmm, this might have been mremap()'d _potentially_ though? And then now > > this will be wrong? But then we'd have no way of tracking it correctly... > > I'm not quite familiar with the subtle details and corner cases of > meremap(). But mmap_zero() should be called by mmap(), so the VMA has not > been visible to user yet at this point IIUC. How come mremap() could move > it? Ah OK, in that case fine on that front. But you are not permitted to touch this field (we need to enforce this...) > > > > > I've not checked the function but do we mark this as a special mapping of > > some kind? > > > > > + > > > return 0; > > > } > > > > > > -- > > > 2.47.0 > > > >
On Tue, Jan 14, 2025 at 06:14:57PM +0000, Lorenzo Stoakes wrote: > This is getting into realms of discussion so to risk sounding rude - to be > clear - NACK. > > The user-visible change to /proc/$pid/[s]maps kills this patch dead. This > is regardless of any other discussed issue. > > But more importantly, I hadn't realise mmap_zero() was on the .mmap() > callback (sorry my mistake) - you're simply not permitted to change > vm_pgoff and vm_file fields here, the mapping logic doesn't expect it, and > it's broken. I see shmem_zero_page() does change vma->vm_page, this is broken... ugh. I will audit this code (and have a look through _all_ mmap() callbacks I guess). Duly added to TODO. But definitely can't have _another_ case of doing this. > > To me the alternative would be to have a custom fault handler that hands > back the zero page, but I"m not sure that's workable, you'd have to install > a special mapping etc. and huge pages are weird and... > > I do appreciate you raising this especially as I was blissfully unaware, > but I don't see how this patch can possibly work, sorry :( > > On Tue, Jan 14, 2025 at 08:53:01AM -0800, Yang Shi wrote: > > > > > > > > On 1/14/25 4:05 AM, Lorenzo Stoakes wrote: > > > + Willy for the fs/weirdness elements of this. > > > > > > On Mon, Jan 13, 2025 at 02:30:33PM -0800, Yang Shi wrote: > > > > When creating private mapping for /dev/zero, the driver makes it an > > > > anonymous mapping by calling set_vma_anonymous(). But it just sets > > > > vm_ops to NULL, vm_file is still valid and vm_pgoff is also file offset. > > > Hm yikes. > > > > > > > This is a special case and the VMA doesn't look like either anonymous VMA > > > > or file VMA. It confused other kernel subsystem, for example, khugepaged [1]. > > > > > > > > It seems pointless to keep such special case. Making private /dev/zero > > > > mapping a full anonymous mapping doesn't change the semantic of > > > > /dev/zero either. > > > My concern is that ostensibly there _is_ a file right? Are we certain that by > > > not setting this we are not breaking something somewhere else? > > > > > > Are we not creating a sort of other type of 'non-such-beast' here? > > > > But the file is /dev/zero. I don't see this could break the semantic of > > /dev/zero. The shared mapping of /dev/zero is not affected by this change, > > kernel already treated private mapping of /dev/zero as anonymous mapping, > > but with some weird settings in VMA. When reading the mapping, it returns 0 > > with zero page, when writing the mapping, a new anonymous folio is > > allocated. > > You're creating a new concept of an anon but not anon but also now with > anon vm_pgoff and missing vm_file even though it does reference a file > and... yeah. > > This is not usual :) > > > > > > > > > I mean already setting it anon and setting vm_file non-NULL is really strange. > > > > > > > The user visible effect is the mapping entry shown in /proc/<PID>/smaps > > > > and /proc/<PID>/maps. > > > > > > > > Before the change: > > > > ffffb7190000-ffffb7590000 rw-p 00001000 00:06 8 /dev/zero > > > > > > > > After the change: > > > > ffffb6130000-ffffb6530000 rw-p 00000000 00:00 0 > > > > > > > Yeah this seems like it might break somebody to be honest, it's really > > > really really strange to map a file then for it not to be mapped. > > > > Yes, it is possible if someone really care whether the anonymous-like > > mapping is mapped by /dev/zero or just created by malloc(). But I don't know > > who really do... > > > > > > > > But it's possibly EVEN WEIRDER to map a file and for it to seem mapped as a > > > file but for it to be marked anonymous. > > > > > > God what a mess. > > > > > > > [1]: https://lore.kernel.org/linux-mm/20250111034511.2223353-1-liushixin2@huawei.com/ > > > I kind of hate that we have to mitigate like this for a case that should > > > never ever happen so I'm inclined towards your solution but a lot more > > > inclined towards us totally rethinking this. > > > > > > Do we _have_ to make this anonymous?? Why can't we just reference the zero > > > page as if it were in the page cache (Willy - feel free to correct naive > > > misapprehension here). > > > > TBH, I don't see why page cache has to be involved. When reading, 0 is > > returned by zero page. When writing a CoW is triggered if page cache is > > involved, but the content of the page cache should be just 0, so we copy 0 > > to the new folio then write to it. It doesn't make too much sense. I think > > this is why private /dev/zero mapping is treated as anonymous mapping in the > > first place. > > I'm obviously not suggesting allocating a bunch of extra folios, I was > thinking there would be some means of handing back the actual zero > page. But I am not sure this is workable. > > > > > > > > > > Signed-off-by: Yang Shi <yang@os.amperecomputing.com> > > > > --- > > > > drivers/char/mem.c | 4 ++++ > > > > 1 file changed, 4 insertions(+) > > > > > > > > diff --git a/drivers/char/mem.c b/drivers/char/mem.c > > > > index 169eed162a7f..dae113f7fc1b 100644 > > > > --- a/drivers/char/mem.c > > > > +++ b/drivers/char/mem.c > > > > @@ -527,6 +527,10 @@ static int mmap_zero(struct file *file, struct vm_area_struct *vma) > > > > if (vma->vm_flags & VM_SHARED) > > > > return shmem_zero_setup(vma); > > > > vma_set_anonymous(vma); > > > > + fput(vma->vm_file); > > > > + vma->vm_file = NULL; > > > > + vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT; > > This is just not permitted. We maintain mmap state which contains the file > and pgoff state which gets threaded through the mapping operation, and > simply do not expect you to change these fields. > > In future we will assert on this or preferably, restrict users to only > changing VMA flags, the private field and vm_ops. > > > > Hmm, this might have been mremap()'d _potentially_ though? And then now > > > this will be wrong? But then we'd have no way of tracking it correctly... > > > > I'm not quite familiar with the subtle details and corner cases of > > meremap(). But mmap_zero() should be called by mmap(), so the VMA has not > > been visible to user yet at this point IIUC. How come mremap() could move > > it? > > Ah OK, in that case fine on that front. > > But you are not permitted to touch this field (we need to enforce this...) > > > > > > > > > I've not checked the function but do we mark this as a special mapping of > > > some kind? > > > > > > > + > > > > return 0; > > > > } > > > > > > > > -- > > > > 2.47.0 > > > > > >
On Tue, Jan 14, 2025 at 06:19:32PM +0000, Lorenzo Stoakes wrote: > On Tue, Jan 14, 2025 at 06:14:57PM +0000, Lorenzo Stoakes wrote: > > This is getting into realms of discussion so to risk sounding rude - to be > > clear - NACK. > > > > The user-visible change to /proc/$pid/[s]maps kills this patch dead. This > > is regardless of any other discussed issue. > > > > But more importantly, I hadn't realise mmap_zero() was on the .mmap() > > callback (sorry my mistake) - you're simply not permitted to change > > vm_pgoff and vm_file fields here, the mapping logic doesn't expect it, and > > it's broken. > > I see shmem_zero_page() does change vma->vm_page, this is broken... ugh. I > will audit this code (and have a look through _all_ mmap() callbacks I > guess). Duly added to TODO. But definitely can't have _another_ case of > doing this. * vma->vm_file... it is late here :) > > > > > To me the alternative would be to have a custom fault handler that hands > > back the zero page, but I"m not sure that's workable, you'd have to install > > a special mapping etc. and huge pages are weird and... > > > > I do appreciate you raising this especially as I was blissfully unaware, > > but I don't see how this patch can possibly work, sorry :( > > > > On Tue, Jan 14, 2025 at 08:53:01AM -0800, Yang Shi wrote: > > > > > > > > > > > > On 1/14/25 4:05 AM, Lorenzo Stoakes wrote: > > > > + Willy for the fs/weirdness elements of this. > > > > > > > > On Mon, Jan 13, 2025 at 02:30:33PM -0800, Yang Shi wrote: > > > > > When creating private mapping for /dev/zero, the driver makes it an > > > > > anonymous mapping by calling set_vma_anonymous(). But it just sets > > > > > vm_ops to NULL, vm_file is still valid and vm_pgoff is also file offset. > > > > Hm yikes. > > > > > > > > > This is a special case and the VMA doesn't look like either anonymous VMA > > > > > or file VMA. It confused other kernel subsystem, for example, khugepaged [1]. > > > > > > > > > > It seems pointless to keep such special case. Making private /dev/zero > > > > > mapping a full anonymous mapping doesn't change the semantic of > > > > > /dev/zero either. > > > > My concern is that ostensibly there _is_ a file right? Are we certain that by > > > > not setting this we are not breaking something somewhere else? > > > > > > > > Are we not creating a sort of other type of 'non-such-beast' here? > > > > > > But the file is /dev/zero. I don't see this could break the semantic of > > > /dev/zero. The shared mapping of /dev/zero is not affected by this change, > > > kernel already treated private mapping of /dev/zero as anonymous mapping, > > > but with some weird settings in VMA. When reading the mapping, it returns 0 > > > with zero page, when writing the mapping, a new anonymous folio is > > > allocated. > > > > You're creating a new concept of an anon but not anon but also now with > > anon vm_pgoff and missing vm_file even though it does reference a file > > and... yeah. > > > > This is not usual :) > > > > > > > > > > > > > I mean already setting it anon and setting vm_file non-NULL is really strange. > > > > > > > > > The user visible effect is the mapping entry shown in /proc/<PID>/smaps > > > > > and /proc/<PID>/maps. > > > > > > > > > > Before the change: > > > > > ffffb7190000-ffffb7590000 rw-p 00001000 00:06 8 /dev/zero > > > > > > > > > > After the change: > > > > > ffffb6130000-ffffb6530000 rw-p 00000000 00:00 0 > > > > > > > > > Yeah this seems like it might break somebody to be honest, it's really > > > > really really strange to map a file then for it not to be mapped. > > > > > > Yes, it is possible if someone really care whether the anonymous-like > > > mapping is mapped by /dev/zero or just created by malloc(). But I don't know > > > who really do... > > > > > > > > > > > But it's possibly EVEN WEIRDER to map a file and for it to seem mapped as a > > > > file but for it to be marked anonymous. > > > > > > > > God what a mess. > > > > > > > > > [1]: https://lore.kernel.org/linux-mm/20250111034511.2223353-1-liushixin2@huawei.com/ > > > > I kind of hate that we have to mitigate like this for a case that should > > > > never ever happen so I'm inclined towards your solution but a lot more > > > > inclined towards us totally rethinking this. > > > > > > > > Do we _have_ to make this anonymous?? Why can't we just reference the zero > > > > page as if it were in the page cache (Willy - feel free to correct naive > > > > misapprehension here). > > > > > > TBH, I don't see why page cache has to be involved. When reading, 0 is > > > returned by zero page. When writing a CoW is triggered if page cache is > > > involved, but the content of the page cache should be just 0, so we copy 0 > > > to the new folio then write to it. It doesn't make too much sense. I think > > > this is why private /dev/zero mapping is treated as anonymous mapping in the > > > first place. > > > > I'm obviously not suggesting allocating a bunch of extra folios, I was > > thinking there would be some means of handing back the actual zero > > page. But I am not sure this is workable. > > > > > > > > > > > > > > Signed-off-by: Yang Shi <yang@os.amperecomputing.com> > > > > > --- > > > > > drivers/char/mem.c | 4 ++++ > > > > > 1 file changed, 4 insertions(+) > > > > > > > > > > diff --git a/drivers/char/mem.c b/drivers/char/mem.c > > > > > index 169eed162a7f..dae113f7fc1b 100644 > > > > > --- a/drivers/char/mem.c > > > > > +++ b/drivers/char/mem.c > > > > > @@ -527,6 +527,10 @@ static int mmap_zero(struct file *file, struct vm_area_struct *vma) > > > > > if (vma->vm_flags & VM_SHARED) > > > > > return shmem_zero_setup(vma); > > > > > vma_set_anonymous(vma); > > > > > + fput(vma->vm_file); > > > > > + vma->vm_file = NULL; > > > > > + vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT; > > > > This is just not permitted. We maintain mmap state which contains the file > > and pgoff state which gets threaded through the mapping operation, and > > simply do not expect you to change these fields. > > > > In future we will assert on this or preferably, restrict users to only > > changing VMA flags, the private field and vm_ops. > > > > > > Hmm, this might have been mremap()'d _potentially_ though? And then now > > > > this will be wrong? But then we'd have no way of tracking it correctly... > > > > > > I'm not quite familiar with the subtle details and corner cases of > > > meremap(). But mmap_zero() should be called by mmap(), so the VMA has not > > > been visible to user yet at this point IIUC. How come mremap() could move > > > it? > > > > Ah OK, in that case fine on that front. > > > > But you are not permitted to touch this field (we need to enforce this...) > > > > > > > > > > > > > I've not checked the function but do we mark this as a special mapping of > > > > some kind? > > > > > > > > > + > > > > > return 0; > > > > > } > > > > > > > > > > -- > > > > > 2.47.0 > > > > > > > >
On Tue, Jan 14, 2025 at 06:19:32PM +0000, Lorenzo Stoakes wrote:
> I see shmem_zero_page() does change vma->vm_page, this is broken... ugh. I
I think you mean shmem_zero_setup() and vma->vm_file, right?
On Tue, Jan 14, 2025 at 06:22:14PM +0000, Matthew Wilcox wrote: > On Tue, Jan 14, 2025 at 06:19:32PM +0000, Lorenzo Stoakes wrote: > > I see shmem_zero_page() does change vma->vm_page, this is broken... ugh. I > > I think you mean shmem_zero_setup() and vma->vm_file, right? Yes, correct. Sorry it's late here and it's showing haha! The reason I am concerned about this is because we thread mmap state through the operation which has a separate file pointer which this makes into a potential UAF. Will audit all this and for any other problematic .mmap() callback behaviour. My view is ideally this should be a callback with a const pointer to the VMA (or some other mechanism, perhaps) which accepts a change in _permitted_ fields only. The 'anything could happen and anybody could manipulate any field of the VMA' in this callback is highly problematic. But we definitely shouldn't be adding a _new_ case here.
On Tue, Jan 14, 2025 at 7:15 PM Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > On Tue, Jan 14, 2025 at 08:53:01AM -0800, Yang Shi wrote: > > On 1/14/25 4:05 AM, Lorenzo Stoakes wrote: > > > On Mon, Jan 13, 2025 at 02:30:33PM -0800, Yang Shi wrote: > > > > + fput(vma->vm_file); > > > > + vma->vm_file = NULL; > > > > + vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT; > > This is just not permitted. We maintain mmap state which contains the file > and pgoff state which gets threaded through the mapping operation, and > simply do not expect you to change these fields. > > In future we will assert on this or preferably, restrict users to only > changing VMA flags, the private field and vm_ops. > > > > Hmm, this might have been mremap()'d _potentially_ though? And then now > > > this will be wrong? But then we'd have no way of tracking it correctly... > > > > I'm not quite familiar with the subtle details and corner cases of > > meremap(). But mmap_zero() should be called by mmap(), so the VMA has not > > been visible to user yet at this point IIUC. How come mremap() could move > > it? > > Ah OK, in that case fine on that front. > > But you are not permitted to touch this field (we need to enforce this...) Sidenote: I think the GPU DRM subsystem relies on changing pgoff in some of their mmap handlers; maybe talk to them about this if you haven't already. See for example drm_gem_prime_mmap() and dma_buf_mmap().
On Tue, Jan 14, 2025 at 07:32:51PM +0100, Jann Horn wrote: > On Tue, Jan 14, 2025 at 7:15 PM Lorenzo Stoakes > <lorenzo.stoakes@oracle.com> wrote: > > On Tue, Jan 14, 2025 at 08:53:01AM -0800, Yang Shi wrote: > > > On 1/14/25 4:05 AM, Lorenzo Stoakes wrote: > > > > On Mon, Jan 13, 2025 at 02:30:33PM -0800, Yang Shi wrote: > > > > > + fput(vma->vm_file); > > > > > + vma->vm_file = NULL; > > > > > + vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT; > > > > This is just not permitted. We maintain mmap state which contains the file > > and pgoff state which gets threaded through the mapping operation, and > > simply do not expect you to change these fields. > > > > In future we will assert on this or preferably, restrict users to only > > changing VMA flags, the private field and vm_ops. > > > > > > Hmm, this might have been mremap()'d _potentially_ though? And then now > > > > this will be wrong? But then we'd have no way of tracking it correctly... > > > > > > I'm not quite familiar with the subtle details and corner cases of > > > meremap(). But mmap_zero() should be called by mmap(), so the VMA has not > > > been visible to user yet at this point IIUC. How come mremap() could move > > > it? > > > > Ah OK, in that case fine on that front. > > > > But you are not permitted to touch this field (we need to enforce this...) > > Sidenote: I think the GPU DRM subsystem relies on changing pgoff in > some of their mmap handlers; maybe talk to them about this if you > haven't already. See for example drm_gem_prime_mmap() and > dma_buf_mmap(). Thanks Jann , I feel like I've opened up a can of worms with this :) I will note these as things to prioritise in the audit. It might be worth both auditing and then actually doing the change to restrict what can be done here too. The problem is it requires changing a trillion callers, but hey I'm Mr. Churn after all... ;) Sorry Yang - I realise this is a pain and not at all obvious. Something we in mm need to sort out (by which I mean _me_ :) your contribution and ideas here are very valued!
On 1/14/25 10:14 AM, Lorenzo Stoakes wrote: > This is getting into realms of discussion so to risk sounding rude - to be > clear - NACK. > > The user-visible change to /proc/$pid/[s]maps kills this patch dead. This > is regardless of any other discussed issue. I admit this is a concern, but I don't think this is really that bad to kill this patch. May this change result in userspace regression? Maybe, likely happens to some debugging and monitoring scripts, typically we don't worry them that much. Of course, I can't completely guarantee no regression for real life applications, it should just be unlikely IMHO. > > But more importantly, I hadn't realise mmap_zero() was on the .mmap() > callback (sorry my mistake) - you're simply not permitted to change > vm_pgoff and vm_file fields here, the mapping logic doesn't expect it, and > it's broken. > > To me the alternative would be to have a custom fault handler that hands > back the zero page, but I"m not sure that's workable, you'd have to install > a special mapping etc. and huge pages are weird and... TBH, I don't think we need to make fault handler more complicated, it is just handled as anonymous fault handler. I understand your concern about changing those vma filed outside core mm. An alternative is to move such change to vma.c. For example: diff --git a/mm/vma.c b/mm/vma.c index bb2119e5a0d0..2a7ea9901f57 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -2358,6 +2358,12 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap) else vma_set_anonymous(vma); + if (vma_is_anonymous(vma) && vma->vm_file) { + fput(vma->vm_file); + vma->vm_file = NULL; + vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT; + } + if (error) goto free_iter_vma; > > I do appreciate you raising this especially as I was blissfully unaware, > but I don't see how this patch can possibly work, sorry :( > > On Tue, Jan 14, 2025 at 08:53:01AM -0800, Yang Shi wrote: >> >> >> On 1/14/25 4:05 AM, Lorenzo Stoakes wrote: >>> + Willy for the fs/weirdness elements of this. >>> >>> On Mon, Jan 13, 2025 at 02:30:33PM -0800, Yang Shi wrote: >>>> When creating private mapping for /dev/zero, the driver makes it an >>>> anonymous mapping by calling set_vma_anonymous(). But it just sets >>>> vm_ops to NULL, vm_file is still valid and vm_pgoff is also file offset. >>> Hm yikes. >>> >>>> This is a special case and the VMA doesn't look like either anonymous VMA >>>> or file VMA. It confused other kernel subsystem, for example, khugepaged [1]. >>>> >>>> It seems pointless to keep such special case. Making private /dev/zero >>>> mapping a full anonymous mapping doesn't change the semantic of >>>> /dev/zero either. >>> My concern is that ostensibly there _is_ a file right? Are we certain that by >>> not setting this we are not breaking something somewhere else? >>> >>> Are we not creating a sort of other type of 'non-such-beast' here? >> But the file is /dev/zero. I don't see this could break the semantic of >> /dev/zero. The shared mapping of /dev/zero is not affected by this change, >> kernel already treated private mapping of /dev/zero as anonymous mapping, >> but with some weird settings in VMA. When reading the mapping, it returns 0 >> with zero page, when writing the mapping, a new anonymous folio is >> allocated. > You're creating a new concept of an anon but not anon but also now with > anon vm_pgoff and missing vm_file even though it does reference a file > and... yeah. > > This is not usual :) It does reference a file, but the file is /dev/zero... And if kernel already treated it as anonymous mapping, it sounds like the file may not matter that much, so why not make it as a real anonymous mapping? Then we end up having anonymous VMA and file VMA only instead of anonymous VMA, file VMA and hybrid special VMA. So we have less thing to worry about. If VMA is anonymous VMA, it is guaranteed vm_file is NULL, vm_ops is NULL and vm_pgoff is linear pgoff. But it is not true now. > >>> I mean already setting it anon and setting vm_file non-NULL is really strange. >>> >>>> The user visible effect is the mapping entry shown in /proc/<PID>/smaps >>>> and /proc/<PID>/maps. >>>> >>>> Before the change: >>>> ffffb7190000-ffffb7590000 rw-p 00001000 00:06 8 /dev/zero >>>> >>>> After the change: >>>> ffffb6130000-ffffb6530000 rw-p 00000000 00:00 0 >>>> >>> Yeah this seems like it might break somebody to be honest, it's really >>> really really strange to map a file then for it not to be mapped. >> Yes, it is possible if someone really care whether the anonymous-like >> mapping is mapped by /dev/zero or just created by malloc(). But I don't know >> who really do... >> >>> But it's possibly EVEN WEIRDER to map a file and for it to seem mapped as a >>> file but for it to be marked anonymous. >>> >>> God what a mess. >>> >>>> [1]: https://lore.kernel.org/linux-mm/20250111034511.2223353-1-liushixin2@huawei.com/ >>> I kind of hate that we have to mitigate like this for a case that should >>> never ever happen so I'm inclined towards your solution but a lot more >>> inclined towards us totally rethinking this. >>> >>> Do we _have_ to make this anonymous?? Why can't we just reference the zero >>> page as if it were in the page cache (Willy - feel free to correct naive >>> misapprehension here). >> TBH, I don't see why page cache has to be involved. When reading, 0 is >> returned by zero page. When writing a CoW is triggered if page cache is >> involved, but the content of the page cache should be just 0, so we copy 0 >> to the new folio then write to it. It doesn't make too much sense. I think >> this is why private /dev/zero mapping is treated as anonymous mapping in the >> first place. > I'm obviously not suggesting allocating a bunch of extra folios, I was > thinking there would be some means of handing back the actual zero > page. But I am not sure this is workable. As I mentioned above, even handing back zero page should be not needed. > >>>> Signed-off-by: Yang Shi <yang@os.amperecomputing.com> >>>> --- >>>> drivers/char/mem.c | 4 ++++ >>>> 1 file changed, 4 insertions(+) >>>> >>>> diff --git a/drivers/char/mem.c b/drivers/char/mem.c >>>> index 169eed162a7f..dae113f7fc1b 100644 >>>> --- a/drivers/char/mem.c >>>> +++ b/drivers/char/mem.c >>>> @@ -527,6 +527,10 @@ static int mmap_zero(struct file *file, struct vm_area_struct *vma) >>>> if (vma->vm_flags & VM_SHARED) >>>> return shmem_zero_setup(vma); >>>> vma_set_anonymous(vma); >>>> + fput(vma->vm_file); >>>> + vma->vm_file = NULL; >>>> + vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT; > This is just not permitted. We maintain mmap state which contains the file > and pgoff state which gets threaded through the mapping operation, and > simply do not expect you to change these fields. > > In future we will assert on this or preferably, restrict users to only > changing VMA flags, the private field and vm_ops. Sure, hardening the VMA initialization code and making less surprising corner case is definitely helpful. > >>> Hmm, this might have been mremap()'d _potentially_ though? And then now >>> this will be wrong? But then we'd have no way of tracking it correctly... >> I'm not quite familiar with the subtle details and corner cases of >> meremap(). But mmap_zero() should be called by mmap(), so the VMA has not >> been visible to user yet at this point IIUC. How come mremap() could move >> it? > Ah OK, in that case fine on that front. > > But you are not permitted to touch this field (we need to enforce this...) > >>> I've not checked the function but do we mark this as a special mapping of >>> some kind? >>> >>>> + >>>> return 0; >>>> } >>>> >>>> -- >>>> 2.47.0 >>>>
On Tue, Jan 14, 2025 at 11:03:48AM -0800, Yang Shi wrote: > > > > On 1/14/25 10:14 AM, Lorenzo Stoakes wrote: > > This is getting into realms of discussion so to risk sounding rude - to be > > clear - NACK. > > > > The user-visible change to /proc/$pid/[s]maps kills this patch dead. This > > is regardless of any other discussed issue. > > I admit this is a concern, but I don't think this is really that bad to kill > this patch. May this change result in userspace regression? Maybe, likely > happens to some debugging and monitoring scripts, typically we don't worry > them that much. Of course, I can't completely guarantee no regression for > real life applications, it should just be unlikely IMHO. Yeah, I don't think we can accept this unfortunately. This patch is SUPER important though even if rejected, because you've made me realise we really need to audit all of these mmap handlers... so it's all super appreciated regardless :) > > > > > But more importantly, I hadn't realise mmap_zero() was on the .mmap() > > callback (sorry my mistake) - you're simply not permitted to change > > vm_pgoff and vm_file fields here, the mapping logic doesn't expect it, and > > it's broken. > > > > To me the alternative would be to have a custom fault handler that hands > > back the zero page, but I"m not sure that's workable, you'd have to install > > a special mapping etc. and huge pages are weird and... > > TBH, I don't think we need to make fault handler more complicated, it is > just handled as anonymous fault handler. > > I understand your concern about changing those vma filed outside core mm. An > alternative is to move such change to vma.c. For example: > > diff --git a/mm/vma.c b/mm/vma.c > index bb2119e5a0d0..2a7ea9901f57 100644 > --- a/mm/vma.c > +++ b/mm/vma.c > @@ -2358,6 +2358,12 @@ static int __mmap_new_vma(struct mmap_state *map, > struct vm_area_struct **vmap) > else > vma_set_anonymous(vma); > > + if (vma_is_anonymous(vma) && vma->vm_file) { > + fput(vma->vm_file); > + vma->vm_file = NULL; > + vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT; > + } > + OK that's more interesting. Though the user-facing thing remains... It's possiible we could detect that the underlying thing is a zero page and manually print out /dev/zero, but can somebody create a zero page file elsewhere? In which case they might find this confusing. It's actually a nice idea to have this _explicitly_ covered off as we could then also add a comment explaining 'hey there's this weird type of VMA' and have it in a place where it's actually obvious to mm folk anyway. But this maps thing is just a killer. Somebody somewhere will be confused. And it is not for us to judge whether that's silly or not... > if (error) > goto free_iter_vma; > > > > > > I do appreciate you raising this especially as I was blissfully unaware, > > but I don't see how this patch can possibly work, sorry :( > > > > On Tue, Jan 14, 2025 at 08:53:01AM -0800, Yang Shi wrote: > > > > > > > > > On 1/14/25 4:05 AM, Lorenzo Stoakes wrote: > > > > + Willy for the fs/weirdness elements of this. > > > > > > > > On Mon, Jan 13, 2025 at 02:30:33PM -0800, Yang Shi wrote: > > > > > When creating private mapping for /dev/zero, the driver makes it an > > > > > anonymous mapping by calling set_vma_anonymous(). But it just sets > > > > > vm_ops to NULL, vm_file is still valid and vm_pgoff is also file offset. > > > > Hm yikes. > > > > > > > > > This is a special case and the VMA doesn't look like either anonymous VMA > > > > > or file VMA. It confused other kernel subsystem, for example, khugepaged [1]. > > > > > > > > > > It seems pointless to keep such special case. Making private /dev/zero > > > > > mapping a full anonymous mapping doesn't change the semantic of > > > > > /dev/zero either. > > > > My concern is that ostensibly there _is_ a file right? Are we certain that by > > > > not setting this we are not breaking something somewhere else? > > > > > > > > Are we not creating a sort of other type of 'non-such-beast' here? > > > But the file is /dev/zero. I don't see this could break the semantic of > > > /dev/zero. The shared mapping of /dev/zero is not affected by this change, > > > kernel already treated private mapping of /dev/zero as anonymous mapping, > > > but with some weird settings in VMA. When reading the mapping, it returns 0 > > > with zero page, when writing the mapping, a new anonymous folio is > > > allocated. > > You're creating a new concept of an anon but not anon but also now with > > anon vm_pgoff and missing vm_file even though it does reference a file > > and... yeah. > > > > This is not usual :) > > It does reference a file, but the file is /dev/zero... And if kernel already > treated it as anonymous mapping, it sounds like the file may not matter that > much, so why not make it as a real anonymous mapping? Then we end up having > anonymous VMA and file VMA only instead of anonymous VMA, file VMA and > hybrid special VMA. So we have less thing to worry about. If VMA is > anonymous VMA, it is guaranteed vm_file is NULL, vm_ops is NULL and vm_pgoff > is linear pgoff. But it is not true now. It's about user confusion for me really. > > > > > > > I mean already setting it anon and setting vm_file non-NULL is really strange. > > > > > > > > > The user visible effect is the mapping entry shown in /proc/<PID>/smaps > > > > > and /proc/<PID>/maps. > > > > > > > > > > Before the change: > > > > > ffffb7190000-ffffb7590000 rw-p 00001000 00:06 8 /dev/zero > > > > > > > > > > After the change: > > > > > ffffb6130000-ffffb6530000 rw-p 00000000 00:00 0 > > > > > > > > > Yeah this seems like it might break somebody to be honest, it's really > > > > really really strange to map a file then for it not to be mapped. > > > Yes, it is possible if someone really care whether the anonymous-like > > > mapping is mapped by /dev/zero or just created by malloc(). But I don't know > > > who really do... > > > > > > > But it's possibly EVEN WEIRDER to map a file and for it to seem mapped as a > > > > file but for it to be marked anonymous. > > > > > > > > God what a mess. > > > > > > > > > [1]: https://lore.kernel.org/linux-mm/20250111034511.2223353-1-liushixin2@huawei.com/ > > > > I kind of hate that we have to mitigate like this for a case that should > > > > never ever happen so I'm inclined towards your solution but a lot more > > > > inclined towards us totally rethinking this. > > > > > > > > Do we _have_ to make this anonymous?? Why can't we just reference the zero > > > > page as if it were in the page cache (Willy - feel free to correct naive > > > > misapprehension here). > > > TBH, I don't see why page cache has to be involved. When reading, 0 is > > > returned by zero page. When writing a CoW is triggered if page cache is > > > involved, but the content of the page cache should be just 0, so we copy 0 > > > to the new folio then write to it. It doesn't make too much sense. I think > > > this is why private /dev/zero mapping is treated as anonymous mapping in the > > > first place. > > I'm obviously not suggesting allocating a bunch of extra folios, I was > > thinking there would be some means of handing back the actual zero > > page. But I am not sure this is workable. > > As I mentioned above, even handing back zero page should be not needed. Ack. > > > > > > > > Signed-off-by: Yang Shi <yang@os.amperecomputing.com> > > > > > --- > > > > > drivers/char/mem.c | 4 ++++ > > > > > 1 file changed, 4 insertions(+) > > > > > > > > > > diff --git a/drivers/char/mem.c b/drivers/char/mem.c > > > > > index 169eed162a7f..dae113f7fc1b 100644 > > > > > --- a/drivers/char/mem.c > > > > > +++ b/drivers/char/mem.c > > > > > @@ -527,6 +527,10 @@ static int mmap_zero(struct file *file, struct vm_area_struct *vma) > > > > > if (vma->vm_flags & VM_SHARED) > > > > > return shmem_zero_setup(vma); > > > > > vma_set_anonymous(vma); > > > > > + fput(vma->vm_file); > > > > > + vma->vm_file = NULL; > > > > > + vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT; > > This is just not permitted. We maintain mmap state which contains the file > > and pgoff state which gets threaded through the mapping operation, and > > simply do not expect you to change these fields. > > > > In future we will assert on this or preferably, restrict users to only > > changing VMA flags, the private field and vm_ops. > > Sure, hardening the VMA initialization code and making less surprising > corner case is definitely helpful. Yes and I've opened a can of worms and the worms have jumped out and on to my face and were not worms but in fact an alien facehugger :P In other words, I am going to be looking into this very seriously and auditing this whole thing... yay for making work for myself... :>) > > > > > > > Hmm, this might have been mremap()'d _potentially_ though? And then now > > > > this will be wrong? But then we'd have no way of tracking it correctly... > > > I'm not quite familiar with the subtle details and corner cases of > > > meremap(). But mmap_zero() should be called by mmap(), so the VMA has not > > > been visible to user yet at this point IIUC. How come mremap() could move > > > it? > > Ah OK, in that case fine on that front. > > > > But you are not permitted to touch this field (we need to enforce this...) > > > > > > I've not checked the function but do we mark this as a special mapping of > > > > some kind? > > > > > > > > > + > > > > > return 0; > > > > > } > > > > > > > > > > -- > > > > > 2.47.0 > > > > > >
On 1/14/25 11:13 AM, Lorenzo Stoakes wrote: > On Tue, Jan 14, 2025 at 11:03:48AM -0800, Yang Shi wrote: >> >> >> On 1/14/25 10:14 AM, Lorenzo Stoakes wrote: >>> This is getting into realms of discussion so to risk sounding rude - to be >>> clear - NACK. >>> >>> The user-visible change to /proc/$pid/[s]maps kills this patch dead. This >>> is regardless of any other discussed issue. >> I admit this is a concern, but I don't think this is really that bad to kill >> this patch. May this change result in userspace regression? Maybe, likely >> happens to some debugging and monitoring scripts, typically we don't worry >> them that much. Of course, I can't completely guarantee no regression for >> real life applications, it should just be unlikely IMHO. > Yeah, I don't think we can accept this unfortunately. > > This patch is SUPER important though even if rejected, because you've made > me realise we really need to audit all of these mmap handlers... so it's > all super appreciated regardless :) :-) > >>> But more importantly, I hadn't realise mmap_zero() was on the .mmap() >>> callback (sorry my mistake) - you're simply not permitted to change >>> vm_pgoff and vm_file fields here, the mapping logic doesn't expect it, and >>> it's broken. >>> >>> To me the alternative would be to have a custom fault handler that hands >>> back the zero page, but I"m not sure that's workable, you'd have to install >>> a special mapping etc. and huge pages are weird and... >> TBH, I don't think we need to make fault handler more complicated, it is >> just handled as anonymous fault handler. >> >> I understand your concern about changing those vma filed outside core mm. An >> alternative is to move such change to vma.c. For example: >> >> diff --git a/mm/vma.c b/mm/vma.c >> index bb2119e5a0d0..2a7ea9901f57 100644 >> --- a/mm/vma.c >> +++ b/mm/vma.c >> @@ -2358,6 +2358,12 @@ static int __mmap_new_vma(struct mmap_state *map, >> struct vm_area_struct **vmap) >> else >> vma_set_anonymous(vma); >> >> + if (vma_is_anonymous(vma) && vma->vm_file) { >> + fput(vma->vm_file); >> + vma->vm_file = NULL; >> + vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT; >> + } >> + > OK that's more interesting. Though the user-facing thing remains... > > It's possiible we could detect that the underlying thing is a zero page and > manually print out /dev/zero, but can somebody create a zero page file > elsewhere? In which case they might find this confusing. I'm not sure about file mapping. However reading an anonymous mapping will instantiate zero page. It should not be marked as /dev/zero mapping. > > It's actually a nice idea to have this _explicitly_ covered off as we could > then also add a comment explaining 'hey there's this weird type of VMA' and > have it in a place where it's actually obvious to mm folk anyway. > > But this maps thing is just a killer. Somebody somewhere will be > confused. And it is not for us to judge whether that's silly or not... I just thought of named anonymous VMA may help. We can give the private /dev/zero mapping a name, for example, just "/dev/zero". However, "[anon:/dev/zero]" will show up in smaps/maps. We can't keep the device numbers and inode number either, but it seems it can tell the user this mapping comes from /dev/zero, and it also explicitly tells us it is specially treated by kernel. Hopefully setting anon_name is permitted. > >> if (error) >> goto free_iter_vma; >> >> >>> I do appreciate you raising this especially as I was blissfully unaware, >>> but I don't see how this patch can possibly work, sorry :( >>> >>> On Tue, Jan 14, 2025 at 08:53:01AM -0800, Yang Shi wrote: >>>> >>>> On 1/14/25 4:05 AM, Lorenzo Stoakes wrote: >>>>> + Willy for the fs/weirdness elements of this. >>>>> >>>>> On Mon, Jan 13, 2025 at 02:30:33PM -0800, Yang Shi wrote: >>>>>> When creating private mapping for /dev/zero, the driver makes it an >>>>>> anonymous mapping by calling set_vma_anonymous(). But it just sets >>>>>> vm_ops to NULL, vm_file is still valid and vm_pgoff is also file offset. >>>>> Hm yikes. >>>>> >>>>>> This is a special case and the VMA doesn't look like either anonymous VMA >>>>>> or file VMA. It confused other kernel subsystem, for example, khugepaged [1]. >>>>>> >>>>>> It seems pointless to keep such special case. Making private /dev/zero >>>>>> mapping a full anonymous mapping doesn't change the semantic of >>>>>> /dev/zero either. >>>>> My concern is that ostensibly there _is_ a file right? Are we certain that by >>>>> not setting this we are not breaking something somewhere else? >>>>> >>>>> Are we not creating a sort of other type of 'non-such-beast' here? >>>> But the file is /dev/zero. I don't see this could break the semantic of >>>> /dev/zero. The shared mapping of /dev/zero is not affected by this change, >>>> kernel already treated private mapping of /dev/zero as anonymous mapping, >>>> but with some weird settings in VMA. When reading the mapping, it returns 0 >>>> with zero page, when writing the mapping, a new anonymous folio is >>>> allocated. >>> You're creating a new concept of an anon but not anon but also now with >>> anon vm_pgoff and missing vm_file even though it does reference a file >>> and... yeah. >>> >>> This is not usual :) >> It does reference a file, but the file is /dev/zero... And if kernel already >> treated it as anonymous mapping, it sounds like the file may not matter that >> much, so why not make it as a real anonymous mapping? Then we end up having >> anonymous VMA and file VMA only instead of anonymous VMA, file VMA and >> hybrid special VMA. So we have less thing to worry about. If VMA is >> anonymous VMA, it is guaranteed vm_file is NULL, vm_ops is NULL and vm_pgoff >> is linear pgoff. But it is not true now. > It's about user confusion for me really. > >>>>> I mean already setting it anon and setting vm_file non-NULL is really strange. >>>>> >>>>>> The user visible effect is the mapping entry shown in /proc/<PID>/smaps >>>>>> and /proc/<PID>/maps. >>>>>> >>>>>> Before the change: >>>>>> ffffb7190000-ffffb7590000 rw-p 00001000 00:06 8 /dev/zero >>>>>> >>>>>> After the change: >>>>>> ffffb6130000-ffffb6530000 rw-p 00000000 00:00 0 >>>>>> >>>>> Yeah this seems like it might break somebody to be honest, it's really >>>>> really really strange to map a file then for it not to be mapped. >>>> Yes, it is possible if someone really care whether the anonymous-like >>>> mapping is mapped by /dev/zero or just created by malloc(). But I don't know >>>> who really do... >>>> >>>>> But it's possibly EVEN WEIRDER to map a file and for it to seem mapped as a >>>>> file but for it to be marked anonymous. >>>>> >>>>> God what a mess. >>>>> >>>>>> [1]: https://lore.kernel.org/linux-mm/20250111034511.2223353-1-liushixin2@huawei.com/ >>>>> I kind of hate that we have to mitigate like this for a case that should >>>>> never ever happen so I'm inclined towards your solution but a lot more >>>>> inclined towards us totally rethinking this. >>>>> >>>>> Do we _have_ to make this anonymous?? Why can't we just reference the zero >>>>> page as if it were in the page cache (Willy - feel free to correct naive >>>>> misapprehension here). >>>> TBH, I don't see why page cache has to be involved. When reading, 0 is >>>> returned by zero page. When writing a CoW is triggered if page cache is >>>> involved, but the content of the page cache should be just 0, so we copy 0 >>>> to the new folio then write to it. It doesn't make too much sense. I think >>>> this is why private /dev/zero mapping is treated as anonymous mapping in the >>>> first place. >>> I'm obviously not suggesting allocating a bunch of extra folios, I was >>> thinking there would be some means of handing back the actual zero >>> page. But I am not sure this is workable. >> As I mentioned above, even handing back zero page should be not needed. > Ack. > >>>>>> Signed-off-by: Yang Shi <yang@os.amperecomputing.com> >>>>>> --- >>>>>> drivers/char/mem.c | 4 ++++ >>>>>> 1 file changed, 4 insertions(+) >>>>>> >>>>>> diff --git a/drivers/char/mem.c b/drivers/char/mem.c >>>>>> index 169eed162a7f..dae113f7fc1b 100644 >>>>>> --- a/drivers/char/mem.c >>>>>> +++ b/drivers/char/mem.c >>>>>> @@ -527,6 +527,10 @@ static int mmap_zero(struct file *file, struct vm_area_struct *vma) >>>>>> if (vma->vm_flags & VM_SHARED) >>>>>> return shmem_zero_setup(vma); >>>>>> vma_set_anonymous(vma); >>>>>> + fput(vma->vm_file); >>>>>> + vma->vm_file = NULL; >>>>>> + vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT; >>> This is just not permitted. We maintain mmap state which contains the file >>> and pgoff state which gets threaded through the mapping operation, and >>> simply do not expect you to change these fields. >>> >>> In future we will assert on this or preferably, restrict users to only >>> changing VMA flags, the private field and vm_ops. >> Sure, hardening the VMA initialization code and making less surprising >> corner case is definitely helpful. > Yes and I've opened a can of worms and the worms have jumped out and on to > my face and were not worms but in fact an alien facehugger :P > > In other words, I am going to be looking into this very seriously and > auditing this whole thing... yay for making work for myself... :>) Thank you for taking the action to kill the alien facehugger :-) > >>>>> Hmm, this might have been mremap()'d _potentially_ though? And then now >>>>> this will be wrong? But then we'd have no way of tracking it correctly... >>>> I'm not quite familiar with the subtle details and corner cases of >>>> meremap(). But mmap_zero() should be called by mmap(), so the VMA has not >>>> been visible to user yet at this point IIUC. How come mremap() could move >>>> it? >>> Ah OK, in that case fine on that front. >>> >>> But you are not permitted to touch this field (we need to enforce this...) >>> >>>>> I've not checked the function but do we mark this as a special mapping of >>>>> some kind? >>>>> >>>>>> + >>>>>> return 0; >>>>>> } >>>>>> >>>>>> -- >>>>>> 2.47.0 >>>>>>
On Tue, Jan 14, 2025 at 01:24:25PM -0800, Yang Shi wrote: > > > > On 1/14/25 11:13 AM, Lorenzo Stoakes wrote: > > On Tue, Jan 14, 2025 at 11:03:48AM -0800, Yang Shi wrote: > > > > > > > > > On 1/14/25 10:14 AM, Lorenzo Stoakes wrote: > > > > This is getting into realms of discussion so to risk sounding rude - to be > > > > clear - NACK. > > > > > > > > The user-visible change to /proc/$pid/[s]maps kills this patch dead. This > > > > is regardless of any other discussed issue. > > > I admit this is a concern, but I don't think this is really that bad to kill > > > this patch. May this change result in userspace regression? Maybe, likely > > > happens to some debugging and monitoring scripts, typically we don't worry > > > them that much. Of course, I can't completely guarantee no regression for > > > real life applications, it should just be unlikely IMHO. > > Yeah, I don't think we can accept this unfortunately. > > > > This patch is SUPER important though even if rejected, because you've made > > me realise we really need to audit all of these mmap handlers... so it's > > all super appreciated regardless :) > > :-) > > > > > > > But more importantly, I hadn't realise mmap_zero() was on the .mmap() > > > > callback (sorry my mistake) - you're simply not permitted to change > > > > vm_pgoff and vm_file fields here, the mapping logic doesn't expect it, and > > > > it's broken. > > > > > > > > To me the alternative would be to have a custom fault handler that hands > > > > back the zero page, but I"m not sure that's workable, you'd have to install > > > > a special mapping etc. and huge pages are weird and... > > > TBH, I don't think we need to make fault handler more complicated, it is > > > just handled as anonymous fault handler. > > > > > > I understand your concern about changing those vma filed outside core mm. An > > > alternative is to move such change to vma.c. For example: > > > > > > diff --git a/mm/vma.c b/mm/vma.c > > > index bb2119e5a0d0..2a7ea9901f57 100644 > > > --- a/mm/vma.c > > > +++ b/mm/vma.c > > > @@ -2358,6 +2358,12 @@ static int __mmap_new_vma(struct mmap_state *map, > > > struct vm_area_struct **vmap) > > > else > > > vma_set_anonymous(vma); > > > > > > + if (vma_is_anonymous(vma) && vma->vm_file) { > > > + fput(vma->vm_file); > > > + vma->vm_file = NULL; > > > + vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT; > > > + } > > > + > > OK that's more interesting. Though the user-facing thing remains... > > > > It's possiible we could detect that the underlying thing is a zero page and > > manually print out /dev/zero, but can somebody create a zero page file > > elsewhere? In which case they might find this confusing. > > I'm not sure about file mapping. However reading an anonymous mapping will > instantiate zero page. It should not be marked as /dev/zero mapping. > > > > > It's actually a nice idea to have this _explicitly_ covered off as we could > > then also add a comment explaining 'hey there's this weird type of VMA' and > > have it in a place where it's actually obvious to mm folk anyway. > > > > But this maps thing is just a killer. Somebody somewhere will be > > confused. And it is not for us to judge whether that's silly or not... > > I just thought of named anonymous VMA may help. We can give the private > /dev/zero mapping a name, for example, just "/dev/zero". However, > "[anon:/dev/zero]" will show up in smaps/maps. We can't keep the device > numbers and inode number either, but it seems it can tell the user this > mapping comes from /dev/zero, and it also explicitly tells us it is > specially treated by kernel. Hopefully setting anon_name is permitted. But then that'd require CONFIG_ANON_VMA_NAME unfortunately :( I think this maps thing is the killer here really. It'd be nice to -specifically- have a means of expressing this kind of VMA, we have a means of setting a VMA anon, so maybe we can 'set a VMA to /dev/zero' and somehow explicitly know that we've done this and identify this special case. I'm not sure that the .mmap callback is the right place to do this and I"m not sure how exactly this would work but this could be workable. I agree the actual offset into the zero page is of no relevance and no _sane_ user will care, but this way we could put /dev/zero in [s]maps, treat this VMA as anon, but also add semantic information about the existence of this weird corner case. > > > > > > if (error) > > > goto free_iter_vma; > > > > > > > > > > I do appreciate you raising this especially as I was blissfully unaware, > > > > but I don't see how this patch can possibly work, sorry :( > > > > > > > > On Tue, Jan 14, 2025 at 08:53:01AM -0800, Yang Shi wrote: > > > > > > > > > > On 1/14/25 4:05 AM, Lorenzo Stoakes wrote: > > > > > > + Willy for the fs/weirdness elements of this. > > > > > > > > > > > > On Mon, Jan 13, 2025 at 02:30:33PM -0800, Yang Shi wrote: > > > > > > > When creating private mapping for /dev/zero, the driver makes it an > > > > > > > anonymous mapping by calling set_vma_anonymous(). But it just sets > > > > > > > vm_ops to NULL, vm_file is still valid and vm_pgoff is also file offset. > > > > > > Hm yikes. > > > > > > > > > > > > > This is a special case and the VMA doesn't look like either anonymous VMA > > > > > > > or file VMA. It confused other kernel subsystem, for example, khugepaged [1]. > > > > > > > > > > > > > > It seems pointless to keep such special case. Making private /dev/zero > > > > > > > mapping a full anonymous mapping doesn't change the semantic of > > > > > > > /dev/zero either. > > > > > > My concern is that ostensibly there _is_ a file right? Are we certain that by > > > > > > not setting this we are not breaking something somewhere else? > > > > > > > > > > > > Are we not creating a sort of other type of 'non-such-beast' here? > > > > > But the file is /dev/zero. I don't see this could break the semantic of > > > > > /dev/zero. The shared mapping of /dev/zero is not affected by this change, > > > > > kernel already treated private mapping of /dev/zero as anonymous mapping, > > > > > but with some weird settings in VMA. When reading the mapping, it returns 0 > > > > > with zero page, when writing the mapping, a new anonymous folio is > > > > > allocated. > > > > You're creating a new concept of an anon but not anon but also now with > > > > anon vm_pgoff and missing vm_file even though it does reference a file > > > > and... yeah. > > > > > > > > This is not usual :) > > > It does reference a file, but the file is /dev/zero... And if kernel already > > > treated it as anonymous mapping, it sounds like the file may not matter that > > > much, so why not make it as a real anonymous mapping? Then we end up having > > > anonymous VMA and file VMA only instead of anonymous VMA, file VMA and > > > hybrid special VMA. So we have less thing to worry about. If VMA is > > > anonymous VMA, it is guaranteed vm_file is NULL, vm_ops is NULL and vm_pgoff > > > is linear pgoff. But it is not true now. > > It's about user confusion for me really. > > > > > > > > I mean already setting it anon and setting vm_file non-NULL is really strange. > > > > > > > > > > > > > The user visible effect is the mapping entry shown in /proc/<PID>/smaps > > > > > > > and /proc/<PID>/maps. > > > > > > > > > > > > > > Before the change: > > > > > > > ffffb7190000-ffffb7590000 rw-p 00001000 00:06 8 /dev/zero > > > > > > > > > > > > > > After the change: > > > > > > > ffffb6130000-ffffb6530000 rw-p 00000000 00:00 0 > > > > > > > > > > > > > Yeah this seems like it might break somebody to be honest, it's really > > > > > > really really strange to map a file then for it not to be mapped. > > > > > Yes, it is possible if someone really care whether the anonymous-like > > > > > mapping is mapped by /dev/zero or just created by malloc(). But I don't know > > > > > who really do... > > > > > > > > > > > But it's possibly EVEN WEIRDER to map a file and for it to seem mapped as a > > > > > > file but for it to be marked anonymous. > > > > > > > > > > > > God what a mess. > > > > > > > > > > > > > [1]: https://lore.kernel.org/linux-mm/20250111034511.2223353-1-liushixin2@huawei.com/ > > > > > > I kind of hate that we have to mitigate like this for a case that should > > > > > > never ever happen so I'm inclined towards your solution but a lot more > > > > > > inclined towards us totally rethinking this. > > > > > > > > > > > > Do we _have_ to make this anonymous?? Why can't we just reference the zero > > > > > > page as if it were in the page cache (Willy - feel free to correct naive > > > > > > misapprehension here). > > > > > TBH, I don't see why page cache has to be involved. When reading, 0 is > > > > > returned by zero page. When writing a CoW is triggered if page cache is > > > > > involved, but the content of the page cache should be just 0, so we copy 0 > > > > > to the new folio then write to it. It doesn't make too much sense. I think > > > > > this is why private /dev/zero mapping is treated as anonymous mapping in the > > > > > first place. > > > > I'm obviously not suggesting allocating a bunch of extra folios, I was > > > > thinking there would be some means of handing back the actual zero > > > > page. But I am not sure this is workable. > > > As I mentioned above, even handing back zero page should be not needed. > > Ack. > > > > > > > > > Signed-off-by: Yang Shi <yang@os.amperecomputing.com> > > > > > > > --- > > > > > > > drivers/char/mem.c | 4 ++++ > > > > > > > 1 file changed, 4 insertions(+) > > > > > > > > > > > > > > diff --git a/drivers/char/mem.c b/drivers/char/mem.c > > > > > > > index 169eed162a7f..dae113f7fc1b 100644 > > > > > > > --- a/drivers/char/mem.c > > > > > > > +++ b/drivers/char/mem.c > > > > > > > @@ -527,6 +527,10 @@ static int mmap_zero(struct file *file, struct vm_area_struct *vma) > > > > > > > if (vma->vm_flags & VM_SHARED) > > > > > > > return shmem_zero_setup(vma); > > > > > > > vma_set_anonymous(vma); > > > > > > > + fput(vma->vm_file); > > > > > > > + vma->vm_file = NULL; > > > > > > > + vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT; > > > > This is just not permitted. We maintain mmap state which contains the file > > > > and pgoff state which gets threaded through the mapping operation, and > > > > simply do not expect you to change these fields. > > > > > > > > In future we will assert on this or preferably, restrict users to only > > > > changing VMA flags, the private field and vm_ops. > > > Sure, hardening the VMA initialization code and making less surprising > > > corner case is definitely helpful. > > Yes and I've opened a can of worms and the worms have jumped out and on to > > my face and were not worms but in fact an alien facehugger :P > > > > In other words, I am going to be looking into this very seriously and > > auditing this whole thing... yay for making work for myself... :>) > > Thank you for taking the action to kill the alien facehugger :-) Haha thanks I'll do my best :)) > > > > > > > > > Hmm, this might have been mremap()'d _potentially_ though? And then now > > > > > > this will be wrong? But then we'd have no way of tracking it correctly... > > > > > I'm not quite familiar with the subtle details and corner cases of > > > > > meremap(). But mmap_zero() should be called by mmap(), so the VMA has not > > > > > been visible to user yet at this point IIUC. How come mremap() could move > > > > > it? > > > > Ah OK, in that case fine on that front. > > > > > > > > But you are not permitted to touch this field (we need to enforce this...) > > > > > > > > > > I've not checked the function but do we mark this as a special mapping of > > > > > > some kind? > > > > > > > > > > > > > + > > > > > > > return 0; > > > > > > > } > > > > > > > > > > > > > > -- > > > > > > > 2.47.0 > > > > > > > >
>> I just thought of named anonymous VMA may help. We can give the private >> /dev/zero mapping a name, for example, just "/dev/zero". However, >> "[anon:/dev/zero]" will show up in smaps/maps. We can't keep the device >> numbers and inode number either, but it seems it can tell the user this >> mapping comes from /dev/zero, and it also explicitly tells us it is >> specially treated by kernel. Hopefully setting anon_name is permitted. > But then that'd require CONFIG_ANON_VMA_NAME unfortunately :( Yes. > > I think this maps thing is the killer here really. > > It'd be nice to -specifically- have a means of expressing this kind of VMA, > we have a means of setting a VMA anon, so maybe we can 'set a VMA to > /dev/zero' and somehow explicitly know that we've done this and identify > this special case. > > I'm not sure that the .mmap callback is the right place to do this and I"m > not sure how exactly this would work but this could be workable. A couple of potential approaches off the top of my head: - A new vm flag - Use vm_private_data Both of them have pros and cons. The vm flag is simple enough, but it needs to consume one bit for just one usecase. The vm_private_data is a void pointer and a lot drivers use it to store driver specific data structures, so using the pointer in a generic path (for example, smaps) to tell us whether it is /dev/zero is not easy. We may be able to have a special encoding to it, for example, set the last bit (the trick is not unusual in core mm code). > > I agree the actual offset into the zero page is of no relevance and no > _sane_ user will care, but this way we could put /dev/zero in [s]maps, > treat this VMA as anon, but also add semantic information about the > existence of this weird corner case. >
On Wed, 15 Jan 2025, Yang Shi wrote: > > > > I just thought of named anonymous VMA may help. We can give the private > > > /dev/zero mapping a name, for example, just "/dev/zero". However, > > > "[anon:/dev/zero]" will show up in smaps/maps. We can't keep the device > > > numbers and inode number either, but it seems it can tell the user this > > > mapping comes from /dev/zero, and it also explicitly tells us it is > > > specially treated by kernel. Hopefully setting anon_name is permitted. > > But then that'd require CONFIG_ANON_VMA_NAME unfortunately :( > > Yes. Add a counter for NULL pages in smaps? I.e. Null: 4 kB Both anonymous and file mappings could have NULL page references right?
diff --git a/drivers/char/mem.c b/drivers/char/mem.c index 169eed162a7f..dae113f7fc1b 100644 --- a/drivers/char/mem.c +++ b/drivers/char/mem.c @@ -527,6 +527,10 @@ static int mmap_zero(struct file *file, struct vm_area_struct *vma) if (vma->vm_flags & VM_SHARED) return shmem_zero_setup(vma); vma_set_anonymous(vma); + fput(vma->vm_file); + vma->vm_file = NULL; + vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT; + return 0; }
When creating private mapping for /dev/zero, the driver makes it an anonymous mapping by calling set_vma_anonymous(). But it just sets vm_ops to NULL, vm_file is still valid and vm_pgoff is also file offset. This is a special case and the VMA doesn't look like either anonymous VMA or file VMA. It confused other kernel subsystem, for example, khugepaged [1]. It seems pointless to keep such special case. Making private /dev/zero mapping a full anonymous mapping doesn't change the semantic of /dev/zero either. The user visible effect is the mapping entry shown in /proc/<PID>/smaps and /proc/<PID>/maps. Before the change: ffffb7190000-ffffb7590000 rw-p 00001000 00:06 8 /dev/zero After the change: ffffb6130000-ffffb6530000 rw-p 00000000 00:00 0 [1]: https://lore.kernel.org/linux-mm/20250111034511.2223353-1-liushixin2@huawei.com/ Signed-off-by: Yang Shi <yang@os.amperecomputing.com> --- drivers/char/mem.c | 4 ++++ 1 file changed, 4 insertions(+)