Message ID | 20221207203158.651092-1-peterx@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2,01/10] mm/hugetlb: Let vma_offset_start() to return start | expand |
On 12/7/22 12:31, Peter Xu wrote: > Taking vma lock here is not needed for now because all potential hugetlb > walkers here should have i_mmap_rwsem held. Document the fact. > > Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > mm/page_vma_mapped.c | 10 ++++++++-- > 1 file changed, 8 insertions(+), 2 deletions(-) > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c > index e97b2e23bd28..2e59a0419d22 100644 > --- a/mm/page_vma_mapped.c > +++ b/mm/page_vma_mapped.c > @@ -168,8 +168,14 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) > /* The only possible mapping was handled on last iteration */ > if (pvmw->pte) > return not_found(pvmw); > - > - /* when pud is not present, pte will be NULL */ > + /* > + * NOTE: we don't need explicit lock here to walk the > + * hugetlb pgtable because either (1) potential callers of > + * hugetlb pvmw currently holds i_mmap_rwsem, or (2) the > + * caller will not walk a hugetlb vma (e.g. ksm or uprobe). > + * When one day this rule breaks, one will get a warning > + * in hugetlb_walk(), and then we'll figure out what to do. > + */ Confused. Is this documentation actually intended to refer to hugetlb_walk() itself, or just this call site? If the former, then let's move it over to be right before hugetlb_walk(). > pvmw->pte = hugetlb_walk(vma, pvmw->address, size); > if (!pvmw->pte) > return false; thanks,
On 07.12.22 21:31, Peter Xu wrote: > Taking vma lock here is not needed for now because all potential hugetlb > walkers here should have i_mmap_rwsem held. Document the fact. > > Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > mm/page_vma_mapped.c | 10 ++++++++-- > 1 file changed, 8 insertions(+), 2 deletions(-) > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c > index e97b2e23bd28..2e59a0419d22 100644 > --- a/mm/page_vma_mapped.c > +++ b/mm/page_vma_mapped.c > @@ -168,8 +168,14 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) > /* The only possible mapping was handled on last iteration */ > if (pvmw->pte) > return not_found(pvmw); > - > - /* when pud is not present, pte will be NULL */ > + /* > + * NOTE: we don't need explicit lock here to walk the > + * hugetlb pgtable because either (1) potential callers of > + * hugetlb pvmw currently holds i_mmap_rwsem, or (2) the > + * caller will not walk a hugetlb vma (e.g. ksm or uprobe). > + * When one day this rule breaks, one will get a warning > + * in hugetlb_walk(), and then we'll figure out what to do. > + */ > pvmw->pte = hugetlb_walk(vma, pvmw->address, size); > if (!pvmw->pte) > return false; Would it make sense to squash that into the previous commit?
On Wed, Dec 07, 2022 at 04:16:11PM -0800, John Hubbard wrote: > On 12/7/22 12:31, Peter Xu wrote: > > Taking vma lock here is not needed for now because all potential hugetlb > > walkers here should have i_mmap_rwsem held. Document the fact. > > > > Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> > > Signed-off-by: Peter Xu <peterx@redhat.com> > > --- > > mm/page_vma_mapped.c | 10 ++++++++-- > > 1 file changed, 8 insertions(+), 2 deletions(-) > > > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c > > index e97b2e23bd28..2e59a0419d22 100644 > > --- a/mm/page_vma_mapped.c > > +++ b/mm/page_vma_mapped.c > > @@ -168,8 +168,14 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) > > /* The only possible mapping was handled on last iteration */ > > if (pvmw->pte) > > return not_found(pvmw); > > - > > - /* when pud is not present, pte will be NULL */ > > + /* > > + * NOTE: we don't need explicit lock here to walk the > > + * hugetlb pgtable because either (1) potential callers of > > + * hugetlb pvmw currently holds i_mmap_rwsem, or (2) the > > + * caller will not walk a hugetlb vma (e.g. ksm or uprobe). > > + * When one day this rule breaks, one will get a warning > > + * in hugetlb_walk(), and then we'll figure out what to do. > > + */ > > Confused. Is this documentation actually intended to refer to hugetlb_walk() > itself, or just this call site? If the former, then let's move it over > to be right before hugetlb_walk(). It is for this specific code path not hugetlb_walk(). The "holds i_mmap_rwsem" here is a true statement (not requirement) because PVMW rmap walkers always have that. That satisfies with hugetlb_walk() requirements already even without holding the vma lock.
On Thu, Dec 08, 2022 at 02:16:03PM +0100, David Hildenbrand wrote: > On 07.12.22 21:31, Peter Xu wrote: > > Taking vma lock here is not needed for now because all potential hugetlb > > walkers here should have i_mmap_rwsem held. Document the fact. > > > > Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> > > Signed-off-by: Peter Xu <peterx@redhat.com> > > --- > > mm/page_vma_mapped.c | 10 ++++++++-- > > 1 file changed, 8 insertions(+), 2 deletions(-) > > > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c > > index e97b2e23bd28..2e59a0419d22 100644 > > --- a/mm/page_vma_mapped.c > > +++ b/mm/page_vma_mapped.c > > @@ -168,8 +168,14 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) > > /* The only possible mapping was handled on last iteration */ > > if (pvmw->pte) > > return not_found(pvmw); > > - > > - /* when pud is not present, pte will be NULL */ > > + /* > > + * NOTE: we don't need explicit lock here to walk the > > + * hugetlb pgtable because either (1) potential callers of > > + * hugetlb pvmw currently holds i_mmap_rwsem, or (2) the > > + * caller will not walk a hugetlb vma (e.g. ksm or uprobe). > > + * When one day this rule breaks, one will get a warning > > + * in hugetlb_walk(), and then we'll figure out what to do. > > + */ > > pvmw->pte = hugetlb_walk(vma, pvmw->address, size); > > if (!pvmw->pte) > > return false; > > Would it make sense to squash that into the previous commit? Sure thing.
On 12/8/22 13:05, Peter Xu wrote: >>> + /* >>> + * NOTE: we don't need explicit lock here to walk the >>> + * hugetlb pgtable because either (1) potential callers of >>> + * hugetlb pvmw currently holds i_mmap_rwsem, or (2) the >>> + * caller will not walk a hugetlb vma (e.g. ksm or uprobe). >>> + * When one day this rule breaks, one will get a warning >>> + * in hugetlb_walk(), and then we'll figure out what to do. >>> + */ >> >> Confused. Is this documentation actually intended to refer to hugetlb_walk() >> itself, or just this call site? If the former, then let's move it over >> to be right before hugetlb_walk(). > > It is for this specific code path not hugetlb_walk(). > > The "holds i_mmap_rwsem" here is a true statement (not requirement) because > PVMW rmap walkers always have that. That satisfies with hugetlb_walk() > requirements already even without holding the vma lock. > It's really hard to understand. Do you have a few extra words to explain it? I can help with actual comment wording perhaps, but I am still a bit in the dark as to the actual meaning. :) thanks,
On Thu, Dec 08, 2022 at 01:54:27PM -0800, John Hubbard wrote: > On 12/8/22 13:05, Peter Xu wrote: > > > > + /* > > > > + * NOTE: we don't need explicit lock here to walk the > > > > + * hugetlb pgtable because either (1) potential callers of > > > > + * hugetlb pvmw currently holds i_mmap_rwsem, or (2) the > > > > + * caller will not walk a hugetlb vma (e.g. ksm or uprobe). > > > > + * When one day this rule breaks, one will get a warning > > > > + * in hugetlb_walk(), and then we'll figure out what to do. > > > > + */ > > > > > > Confused. Is this documentation actually intended to refer to hugetlb_walk() > > > itself, or just this call site? If the former, then let's move it over > > > to be right before hugetlb_walk(). > > > > It is for this specific code path not hugetlb_walk(). > > > > The "holds i_mmap_rwsem" here is a true statement (not requirement) because > > PVMW rmap walkers always have that. That satisfies with hugetlb_walk() > > requirements already even without holding the vma lock. > > > > It's really hard to understand. Do you have a few extra words to explain it? > I can help with actual comment wording perhaps, but I am still a bit in > the dark as to the actual meaning. :) Firstly, this patch (to be squashed into previous) is trying to document page_vma_mapped_walk() on why it's not needed to further take any lock to call hugetlb_walk(). To call hugetlb_walk() we need either of the locks listed below (in either read or write mode), according to the rules we setup for it in patch 3: (1) hugetlb vma lock (2) i_mmap_rwsem lock page_vma_mapped_walk() is called in below sites across the kernel: __replace_page[179] if (!page_vma_mapped_walk(&pvmw)) __damon_pa_mkold[24] while (page_vma_mapped_walk(&pvmw)) { __damon_pa_young[97] while (page_vma_mapped_walk(&pvmw)) { write_protect_page[1065] if (!page_vma_mapped_walk(&pvmw)) remove_migration_pte[179] while (page_vma_mapped_walk(&pvmw)) { page_idle_clear_pte_refs_one[56] while (page_vma_mapped_walk(&pvmw)) { page_mapped_in_vma[318] if (!page_vma_mapped_walk(&pvmw)) folio_referenced_one[813] while (page_vma_mapped_walk(&pvmw)) { page_vma_mkclean_one[958] while (page_vma_mapped_walk(pvmw)) { try_to_unmap_one[1506] while (page_vma_mapped_walk(&pvmw)) { try_to_migrate_one[1881] while (page_vma_mapped_walk(&pvmw)) { page_make_device_exclusive_one[2205] while (page_vma_mapped_walk(&pvmw)) { If we group them, we can see that most of them are during a rmap walk (i.e., comes from a higher rmap_walk() stack), they are: __damon_pa_mkold[24] while (page_vma_mapped_walk(&pvmw)) { __damon_pa_young[97] while (page_vma_mapped_walk(&pvmw)) { remove_migration_pte[179] while (page_vma_mapped_walk(&pvmw)) { page_idle_clear_pte_refs_one[56] while (page_vma_mapped_walk(&pvmw)) { page_mapped_in_vma[318] if (!page_vma_mapped_walk(&pvmw)) folio_referenced_one[813] while (page_vma_mapped_walk(&pvmw)) { page_vma_mkclean_one[958] while (page_vma_mapped_walk(pvmw)) { try_to_unmap_one[1506] while (page_vma_mapped_walk(&pvmw)) { try_to_migrate_one[1881] while (page_vma_mapped_walk(&pvmw)) { page_make_device_exclusive_one[2205] while (page_vma_mapped_walk(&pvmw)) { Let's call it case (A). We have another two special cases that are not during a rmap walk, they are: write_protect_page[1065] if (!page_vma_mapped_walk(&pvmw)) __replace_page[179] if (!page_vma_mapped_walk(&pvmw)) Let's call it case (B). Case (A) is always safe because it always take the i_mmap_rwsem lock in read mode. It's done in rmap_walk_file() where: if (!locked) { if (i_mmap_trylock_read(mapping)) goto lookup; if (rwc->try_lock) { rwc->contended = true; return; } i_mmap_lock_read(mapping); } If locked==true it means the caller already holds the lock, so no need to take it. It justifies that all callers from rmap_walk() upon a hugetlb vma is safe to call hugetlb_walk() already according to the rule of hugetlb_walk(). Case (B) contains two cases either in KSM path or uprobe path, and none of the paths (afaict) can get a hugetlb vma involved. IOW, the whole path of if (unlikely(is_vm_hugetlb_page(vma))) { In page_vma_mapped_walk() just should never trigger. To summarize above into a shorter paragraph, it'll become the comment. Hope it explains. Thanks.
On 12/8/22 14:21, Peter Xu wrote: > > Firstly, this patch (to be squashed into previous) is trying to document > page_vma_mapped_walk() on why it's not needed to further take any lock to > call hugetlb_walk(). > > To call hugetlb_walk() we need either of the locks listed below (in either > read or write mode), according to the rules we setup for it in patch 3: > > (1) hugetlb vma lock > (2) i_mmap_rwsem lock > > page_vma_mapped_walk() is called in below sites across the kernel: > > __replace_page[179] if (!page_vma_mapped_walk(&pvmw)) > __damon_pa_mkold[24] while (page_vma_mapped_walk(&pvmw)) { > __damon_pa_young[97] while (page_vma_mapped_walk(&pvmw)) { > write_protect_page[1065] if (!page_vma_mapped_walk(&pvmw)) > remove_migration_pte[179] while (page_vma_mapped_walk(&pvmw)) { > page_idle_clear_pte_refs_one[56] while (page_vma_mapped_walk(&pvmw)) { > page_mapped_in_vma[318] if (!page_vma_mapped_walk(&pvmw)) > folio_referenced_one[813] while (page_vma_mapped_walk(&pvmw)) { > page_vma_mkclean_one[958] while (page_vma_mapped_walk(pvmw)) { > try_to_unmap_one[1506] while (page_vma_mapped_walk(&pvmw)) { > try_to_migrate_one[1881] while (page_vma_mapped_walk(&pvmw)) { > page_make_device_exclusive_one[2205] while (page_vma_mapped_walk(&pvmw)) { > > If we group them, we can see that most of them are during a rmap walk > (i.e., comes from a higher rmap_walk() stack), they are: > > __damon_pa_mkold[24] while (page_vma_mapped_walk(&pvmw)) { > __damon_pa_young[97] while (page_vma_mapped_walk(&pvmw)) { > remove_migration_pte[179] while (page_vma_mapped_walk(&pvmw)) { > page_idle_clear_pte_refs_one[56] while (page_vma_mapped_walk(&pvmw)) { > page_mapped_in_vma[318] if (!page_vma_mapped_walk(&pvmw)) > folio_referenced_one[813] while (page_vma_mapped_walk(&pvmw)) { > page_vma_mkclean_one[958] while (page_vma_mapped_walk(pvmw)) { > try_to_unmap_one[1506] while (page_vma_mapped_walk(&pvmw)) { > try_to_migrate_one[1881] while (page_vma_mapped_walk(&pvmw)) { > page_make_device_exclusive_one[2205] while (page_vma_mapped_walk(&pvmw)) { > > Let's call it case (A). > > We have another two special cases that are not during a rmap walk, they > are: > > write_protect_page[1065] if (!page_vma_mapped_walk(&pvmw)) > __replace_page[179] if (!page_vma_mapped_walk(&pvmw)) > > Let's call it case (B). > > Case (A) is always safe because it always take the i_mmap_rwsem lock in > read mode. It's done in rmap_walk_file() where: > > if (!locked) { > if (i_mmap_trylock_read(mapping)) > goto lookup; > > if (rwc->try_lock) { > rwc->contended = true; > return; > } > > i_mmap_lock_read(mapping); > } > > If locked==true it means the caller already holds the lock, so no need to > take it. It justifies that all callers from rmap_walk() upon a hugetlb vma > is safe to call hugetlb_walk() already according to the rule of hugetlb_walk(). > > Case (B) contains two cases either in KSM path or uprobe path, and none of > the paths (afaict) can get a hugetlb vma involved. IOW, the whole path of > > if (unlikely(is_vm_hugetlb_page(vma))) { > > In page_vma_mapped_walk() just should never trigger. > > To summarize above into a shorter paragraph, it'll become the comment. > > Hope it explains. Thanks. > It does! And now for the comment, I'll think you'll find that this suffices: /* * All callers that get here will already hold the i_mmap_rwsem. * Therefore, no additional locks need to be taken before * calling hugetlb_walk(). */ ...which, considering all the data above, is probably the mother of all summaries. :) But really, it's all that people need to know here, and it's readily understandable without wondering what KSM has to do with this, for example. thanks,
On Thu, Dec 08, 2022 at 04:24:19PM -0800, John Hubbard wrote: > It does! And now for the comment, I'll think you'll find that this suffices: > > /* > * All callers that get here will already hold the i_mmap_rwsem. > * Therefore, no additional locks need to be taken before > * calling hugetlb_walk(). > */ > > ...which, considering all the data above, is probably the mother of > all summaries. :) But really, it's all that people need to know here, and > it's readily understandable without wondering what KSM has to do with this, > for example. I'm okay with the change. :) I think what I'll do is I'll move part of the original one into commit message, and take the new version in the code. Thanks,
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index e97b2e23bd28..2e59a0419d22 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -168,8 +168,14 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) /* The only possible mapping was handled on last iteration */ if (pvmw->pte) return not_found(pvmw); - - /* when pud is not present, pte will be NULL */ + /* + * NOTE: we don't need explicit lock here to walk the + * hugetlb pgtable because either (1) potential callers of + * hugetlb pvmw currently holds i_mmap_rwsem, or (2) the + * caller will not walk a hugetlb vma (e.g. ksm or uprobe). + * When one day this rule breaks, one will get a warning + * in hugetlb_walk(), and then we'll figure out what to do. + */ pvmw->pte = hugetlb_walk(vma, pvmw->address, size); if (!pvmw->pte) return false;