Message ID | 20240607145902.1137853-7-kernel@pankajraghav.com (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
Series | enable bs > ps in XFS | expand |
On Fri, Jun 07, 2024 at 02:58:57PM +0000, Pankaj Raghav (Samsung) wrote: > From: Pankaj Raghav <p.raghav@samsung.com> > > Usually the page cache does not extend beyond the size of the inode, > therefore, no PTEs are created for folios that extend beyond the size. > > But with LBS support, we might extend page cache beyond the size of the > inode as we need to guarantee folios of minimum order. Cap the PTE range > to be created for the page cache up to the max allowed zero-fill file > end, which is aligned to the PAGE_SIZE. I think this is slightly misleading because we might well zero-fill to the end of the folio. The issue is that we're supposed to SIGBUS if userspace accesses pages which lie entirely beyond the end of this file. Can you rephrase this? (from mmap(2)) SIGBUS Attempted access to a page of the buffer that lies beyond the end of the mapped file. For an explanation of the treatment of the bytes in the page that corresponds to the end of a mapped file that is not a multiple of the page size, see NOTES. The code is good though. Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> > An fstests test has been created to trigger this edge case [0]. > > [0] https://lore.kernel.org/fstests/20240415081054.1782715-1-mcgrof@kernel.org/ > > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> > Reviewed-by: Hannes Reinecke <hare@suse.de> > Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> > --- > mm/filemap.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/mm/filemap.c b/mm/filemap.c > index 8bb0d2bc93c5..0e48491b3d10 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -3610,7 +3610,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, > struct vm_area_struct *vma = vmf->vma; > struct file *file = vma->vm_file; > struct address_space *mapping = file->f_mapping; > - pgoff_t last_pgoff = start_pgoff; > + pgoff_t file_end, last_pgoff = start_pgoff; > unsigned long addr; > XA_STATE(xas, &mapping->i_pages, start_pgoff); > struct folio *folio; > @@ -3636,6 +3636,10 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, > goto out; > } > > + file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1; > + if (end_pgoff > file_end) > + end_pgoff = file_end; > + > folio_type = mm_counter_file(folio); > do { > unsigned long end; > -- > 2.44.1 >
On Wed, Jun 12, 2024 at 08:08:15PM +0100, Matthew Wilcox wrote: > On Fri, Jun 07, 2024 at 02:58:57PM +0000, Pankaj Raghav (Samsung) wrote: > > From: Pankaj Raghav <p.raghav@samsung.com> > > > > Usually the page cache does not extend beyond the size of the inode, > > therefore, no PTEs are created for folios that extend beyond the size. > > > > But with LBS support, we might extend page cache beyond the size of the > > inode as we need to guarantee folios of minimum order. Cap the PTE range > > to be created for the page cache up to the max allowed zero-fill file > > end, which is aligned to the PAGE_SIZE. > > I think this is slightly misleading because we might well zero-fill > to the end of the folio. The issue is that we're supposed to SIGBUS > if userspace accesses pages which lie entirely beyond the end of this > file. Can you rephrase this? > > (from mmap(2)) > SIGBUS Attempted access to a page of the buffer that lies beyond the end > of the mapped file. For an explanation of the treatment of the > bytes in the page that corresponds to the end of a mapped file > that is not a multiple of the page size, see NOTES. > > > The code is good though. > > Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Since I've been curating the respective fstests test to test for this POSIX corner case [0] I wanted to enable the test for tmpfs instead of skipping it as I originally had it, and that meant also realizing mmap(2) specifically says this now: Huge page (Huge TLB) mappings ... For mmap(), offset must be a multiple of the underlying huge page size. The system automatically aligns length to be a multiple of the underlying huge page size. So do we need to adjust this patch with this: diff --git a/mm/filemap.c b/mm/filemap.c index ea78963f0956..9c8897ba90ff 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3617,6 +3617,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, vm_fault_t ret = 0; unsigned long rss = 0; unsigned int nr_pages = 0, mmap_miss = 0, mmap_miss_saved, folio_type; + unsigned int align = PAGE_SIZE; rcu_read_lock(); folio = next_uptodate_folio(&xas, mapping, end_pgoff); @@ -3636,7 +3637,10 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, goto out; } - file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1; + if (folio_test_pmd_mappable(folio)) + align = 1 << folio_order(folio); + + file_end = DIV_ROUND_UP(i_size_read(mapping->host), align) - 1; if (end_pgoff > file_end) end_pgoff = file_end; [0] https://lore.kernel.org/all/20240611030203.1719072-3-mcgrof@kernel.org/ Luis
On 13.06.24 09:57, Luis Chamberlain wrote: > On Wed, Jun 12, 2024 at 08:08:15PM +0100, Matthew Wilcox wrote: >> On Fri, Jun 07, 2024 at 02:58:57PM +0000, Pankaj Raghav (Samsung) wrote: >>> From: Pankaj Raghav <p.raghav@samsung.com> >>> >>> Usually the page cache does not extend beyond the size of the inode, >>> therefore, no PTEs are created for folios that extend beyond the size. >>> >>> But with LBS support, we might extend page cache beyond the size of the >>> inode as we need to guarantee folios of minimum order. Cap the PTE range >>> to be created for the page cache up to the max allowed zero-fill file >>> end, which is aligned to the PAGE_SIZE. >> >> I think this is slightly misleading because we might well zero-fill >> to the end of the folio. The issue is that we're supposed to SIGBUS >> if userspace accesses pages which lie entirely beyond the end of this >> file. Can you rephrase this? >> >> (from mmap(2)) >> SIGBUS Attempted access to a page of the buffer that lies beyond the end >> of the mapped file. For an explanation of the treatment of the >> bytes in the page that corresponds to the end of a mapped file >> that is not a multiple of the page size, see NOTES. >> >> >> The code is good though. >> >> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> > > Since I've been curating the respective fstests test to test for this > POSIX corner case [0] I wanted to enable the test for tmpfs instead of > skipping it as I originally had it, and that meant also realizing mmap(2) > specifically says this now: > > Huge page (Huge TLB) mappings Confusion alert: this likely talks about hugetlb (MAP_HUGETLB), not THP and friends. So it might not be required for below changes. > ... > For mmap(), offset must be a multiple of the underlying huge page > size. The system automatically aligns length to be a multiple of > the underlying huge page size. > > So do we need to adjust this patch with this: > > diff --git a/mm/filemap.c b/mm/filemap.c > index ea78963f0956..9c8897ba90ff 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -3617,6 +3617,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, > vm_fault_t ret = 0; > unsigned long rss = 0; > unsigned int nr_pages = 0, mmap_miss = 0, mmap_miss_saved, folio_type; > + unsigned int align = PAGE_SIZE; > > rcu_read_lock(); > folio = next_uptodate_folio(&xas, mapping, end_pgoff); > @@ -3636,7 +3637,10 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, > goto out; > } > > - file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1; > + if (folio_test_pmd_mappable(folio)) > + align = 1 << folio_order(folio); > + > + file_end = DIV_ROUND_UP(i_size_read(mapping->host), align) - 1; > if (end_pgoff > file_end) > end_pgoff = file_end; > > [0] https://lore.kernel.org/all/20240611030203.1719072-3-mcgrof@kernel.org/ > > Luis >
On Thu, Jun 13, 2024 at 10:07:15AM +0200, David Hildenbrand wrote: > On 13.06.24 09:57, Luis Chamberlain wrote: > > On Wed, Jun 12, 2024 at 08:08:15PM +0100, Matthew Wilcox wrote: > > > On Fri, Jun 07, 2024 at 02:58:57PM +0000, Pankaj Raghav (Samsung) wrote: > > > > From: Pankaj Raghav <p.raghav@samsung.com> > > > > > > > > Usually the page cache does not extend beyond the size of the inode, > > > > therefore, no PTEs are created for folios that extend beyond the size. > > > > > > > > But with LBS support, we might extend page cache beyond the size of the > > > > inode as we need to guarantee folios of minimum order. Cap the PTE range > > > > to be created for the page cache up to the max allowed zero-fill file > > > > end, which is aligned to the PAGE_SIZE. > > > > > > I think this is slightly misleading because we might well zero-fill > > > to the end of the folio. The issue is that we're supposed to SIGBUS > > > if userspace accesses pages which lie entirely beyond the end of this > > > file. Can you rephrase this? > > > > > > (from mmap(2)) > > > SIGBUS Attempted access to a page of the buffer that lies beyond the end > > > of the mapped file. For an explanation of the treatment of the > > > bytes in the page that corresponds to the end of a mapped file > > > that is not a multiple of the page size, see NOTES. > > > > > > > > > The code is good though. > > > > > > Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> > > > > Since I've been curating the respective fstests test to test for this > > POSIX corner case [0] I wanted to enable the test for tmpfs instead of > > skipping it as I originally had it, and that meant also realizing mmap(2) > > specifically says this now: > > > > Huge page (Huge TLB) mappings > > Confusion alert: this likely talks about hugetlb (MAP_HUGETLB), not THP and > friends. > > So it might not be required for below changes. Thanks, I had to ask as we're dusting off this little obscure corner of the universe. Reason I ask, is the test fails for tmpfs with huge pages, and this patch fixes it, but it got me wondering the above applies also to tmpfs with huge pages. Luis
On 13.06.24 10:13, Luis Chamberlain wrote: > On Thu, Jun 13, 2024 at 10:07:15AM +0200, David Hildenbrand wrote: >> On 13.06.24 09:57, Luis Chamberlain wrote: >>> On Wed, Jun 12, 2024 at 08:08:15PM +0100, Matthew Wilcox wrote: >>>> On Fri, Jun 07, 2024 at 02:58:57PM +0000, Pankaj Raghav (Samsung) wrote: >>>>> From: Pankaj Raghav <p.raghav@samsung.com> >>>>> >>>>> Usually the page cache does not extend beyond the size of the inode, >>>>> therefore, no PTEs are created for folios that extend beyond the size. >>>>> >>>>> But with LBS support, we might extend page cache beyond the size of the >>>>> inode as we need to guarantee folios of minimum order. Cap the PTE range >>>>> to be created for the page cache up to the max allowed zero-fill file >>>>> end, which is aligned to the PAGE_SIZE. >>>> >>>> I think this is slightly misleading because we might well zero-fill >>>> to the end of the folio. The issue is that we're supposed to SIGBUS >>>> if userspace accesses pages which lie entirely beyond the end of this >>>> file. Can you rephrase this? >>>> >>>> (from mmap(2)) >>>> SIGBUS Attempted access to a page of the buffer that lies beyond the end >>>> of the mapped file. For an explanation of the treatment of the >>>> bytes in the page that corresponds to the end of a mapped file >>>> that is not a multiple of the page size, see NOTES. >>>> >>>> >>>> The code is good though. >>>> >>>> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> >>> >>> Since I've been curating the respective fstests test to test for this >>> POSIX corner case [0] I wanted to enable the test for tmpfs instead of >>> skipping it as I originally had it, and that meant also realizing mmap(2) >>> specifically says this now: >>> >>> Huge page (Huge TLB) mappings >> >> Confusion alert: this likely talks about hugetlb (MAP_HUGETLB), not THP and >> friends. >> >> So it might not be required for below changes. > > Thanks, I had to ask as we're dusting off this little obscure corner of > the universe. Reason I ask, is the test fails for tmpfs with huge pages, > and this patch fixes it, but it got me wondering the above applies also > to tmpfs with huge pages. Is it tmpfs with THP/large folios or shmem with hugetlb? I assume the tmpfs with THP. There are not really mmap/munmap restrictions to THP and friends (because it's supposed to be "transparent" :) ).
On Thu, Jun 13, 2024 at 10:16:10AM +0200, David Hildenbrand wrote: > On 13.06.24 10:13, Luis Chamberlain wrote: > > On Thu, Jun 13, 2024 at 10:07:15AM +0200, David Hildenbrand wrote: > > > On 13.06.24 09:57, Luis Chamberlain wrote: > > > > On Wed, Jun 12, 2024 at 08:08:15PM +0100, Matthew Wilcox wrote: > > > > > On Fri, Jun 07, 2024 at 02:58:57PM +0000, Pankaj Raghav (Samsung) wrote: > > > > > > From: Pankaj Raghav <p.raghav@samsung.com> > > > > > > > > > > > > Usually the page cache does not extend beyond the size of the inode, > > > > > > therefore, no PTEs are created for folios that extend beyond the size. > > > > > > > > > > > > But with LBS support, we might extend page cache beyond the size of the > > > > > > inode as we need to guarantee folios of minimum order. Cap the PTE range > > > > > > to be created for the page cache up to the max allowed zero-fill file > > > > > > end, which is aligned to the PAGE_SIZE. > > > > > > > > > > I think this is slightly misleading because we might well zero-fill > > > > > to the end of the folio. The issue is that we're supposed to SIGBUS > > > > > if userspace accesses pages which lie entirely beyond the end of this > > > > > file. Can you rephrase this? > > > > > > > > > > (from mmap(2)) > > > > > SIGBUS Attempted access to a page of the buffer that lies beyond the end > > > > > of the mapped file. For an explanation of the treatment of the > > > > > bytes in the page that corresponds to the end of a mapped file > > > > > that is not a multiple of the page size, see NOTES. > > > > > > > > > > > > > > > The code is good though. > > > > > > > > > > Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> > > > > > > > > Since I've been curating the respective fstests test to test for this > > > > POSIX corner case [0] I wanted to enable the test for tmpfs instead of > > > > skipping it as I originally had it, and that meant also realizing mmap(2) > > > > specifically says this now: > > > > > > > > Huge page (Huge TLB) mappings > > > > > > Confusion alert: this likely talks about hugetlb (MAP_HUGETLB), not THP and > > > friends. > > > > > > So it might not be required for below changes. > > > > Thanks, I had to ask as we're dusting off this little obscure corner of > > the universe. Reason I ask, is the test fails for tmpfs with huge pages, > > and this patch fixes it, but it got me wondering the above applies also > > to tmpfs with huge pages. > > Is it tmpfs with THP/large folios or shmem with hugetlb? I assume the tmpfs > with THP. There are not really mmap/munmap restrictions to THP and friends > (because it's supposed to be "transparent" :) ). The case I tested that failed the test was tmpfs with huge pages (not large folios). So should we then have this: diff --git a/mm/filemap.c b/mm/filemap.c index ea78963f0956..649beb9bbc6b 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3617,6 +3617,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, vm_fault_t ret = 0; unsigned long rss = 0; unsigned int nr_pages = 0, mmap_miss = 0, mmap_miss_saved, folio_type; + unsigned int align = PAGE_SIZE; rcu_read_lock(); folio = next_uptodate_folio(&xas, mapping, end_pgoff); @@ -3636,7 +3637,16 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, goto out; } - file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1; + /* + * As per the mmap(2) mmap(), the offset must be a multiple of the + * underlying huge page size. The system automatically aligns length to + * be a multiple of the underlying huge page size. + */ + if (folio_test_pmd_mappable(folio) && + (shmem_mapping(mapping) || folio_test_hugetlb(folio))) + align = 1 << folio_order(folio); + + file_end = DIV_ROUND_UP(i_size_read(mapping->host), align) - 1; if (end_pgoff > file_end) end_pgoff = file_end;
On Thu, Jun 13, 2024 at 08:27:27AM -0700, Luis Chamberlain wrote: > The case I tested that failed the test was tmpfs with huge pages (not > large folios). So should we then have this: No.
On Thu, Jun 13, 2024 at 04:32:27PM +0100, Matthew Wilcox wrote: > On Thu, Jun 13, 2024 at 08:27:27AM -0700, Luis Chamberlain wrote: > > The case I tested that failed the test was tmpfs with huge pages (not > > large folios). So should we then have this: > > No. OK so this does have a change for tmpfs with huge pages enabled, do we take the position then this is a fix for that? Luis
On Thu, Jun 13, 2024 at 08:38:15AM -0700, Luis Chamberlain wrote: > On Thu, Jun 13, 2024 at 04:32:27PM +0100, Matthew Wilcox wrote: > > On Thu, Jun 13, 2024 at 08:27:27AM -0700, Luis Chamberlain wrote: > > > The case I tested that failed the test was tmpfs with huge pages (not > > > large folios). So should we then have this: > > > > No. > > OK so this does have a change for tmpfs with huge pages enabled, do we > take the position then this is a fix for that? You literally said it was a fix just a few messages up thread? Besides, the behaviour changes (currently) depending on whether you specify "within_size" or "always". This patch makes it consistent.
On Thu, Jun 13, 2024 at 04:40:28PM +0100, Matthew Wilcox wrote: > On Thu, Jun 13, 2024 at 08:38:15AM -0700, Luis Chamberlain wrote: > > On Thu, Jun 13, 2024 at 04:32:27PM +0100, Matthew Wilcox wrote: > > > On Thu, Jun 13, 2024 at 08:27:27AM -0700, Luis Chamberlain wrote: > > > > The case I tested that failed the test was tmpfs with huge pages (not > > > > large folios). So should we then have this: > > > > > > No. > > > > OK so this does have a change for tmpfs with huge pages enabled, do we > > take the position then this is a fix for that? > > You literally said it was a fix just a few messages up thread? > > Besides, the behaviour changes (currently) depending on whether > you specify "within_size" or "always". This patch makes it consistent. The quoted mmap(2) text made me doubt it, and I was looking for clarification. It seems clear now based on feedback the text does not apply to tmpfs with huge pages, and so we'll just annotate it as a fix for tmpfs with huge pages. It makes sense to not apply, I mean, why *would* you assume you will have an extended range zeroed out range to muck around with beyond PAGE_SIZE just because huge pages were used when the rest of all other filesystem APIs count on the mmap(2) PAGE_SIZE boundary. Thanks! Luis
diff --git a/mm/filemap.c b/mm/filemap.c index 8bb0d2bc93c5..0e48491b3d10 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3610,7 +3610,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, struct vm_area_struct *vma = vmf->vma; struct file *file = vma->vm_file; struct address_space *mapping = file->f_mapping; - pgoff_t last_pgoff = start_pgoff; + pgoff_t file_end, last_pgoff = start_pgoff; unsigned long addr; XA_STATE(xas, &mapping->i_pages, start_pgoff); struct folio *folio; @@ -3636,6 +3636,10 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, goto out; } + file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1; + if (end_pgoff > file_end) + end_pgoff = file_end; + folio_type = mm_counter_file(folio); do { unsigned long end;