Message ID | 1458566575-28063-8-git-send-email-jack@suse.cz (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
On Mon, Mar 21, 2016 at 02:22:52PM +0100, Jan Kara wrote: > Currently the handling of huge pages for DAX is racy. For example the > following can happen: > > CPU0 (THP write fault) CPU1 (normal read fault) > > __dax_pmd_fault() __dax_fault() > get_block(inode, block, &bh, 0) -> not mapped > get_block(inode, block, &bh, 0) > -> not mapped > if (!buffer_mapped(&bh) && write) > get_block(inode, block, &bh, 1) -> allocates blocks > truncate_pagecache_range(inode, lstart, lend); > dax_load_hole(); > > This results in data corruption since process on CPU1 won't see changes > into the file done by CPU0. > > The race can happen even if two normal faults race however with THP the > situation is even worse because the two faults don't operate on the same > entries in the radix tree and we want to use these entries for > serialization. So disable THP support in DAX code for now. Yep, I agree that we should disable PMD faults until we get the multi-order radix tree work finished and integrated with this locking. I do agree with Dan though that it would be preferable to disable PMD faults by having CONFIG_FS_DAX_PMD depend on BROKEN. That seems smaller and easier to switch PMD faults back on for testing. > Signed-off-by: Jan Kara <jack@suse.cz> > --- > fs/dax.c | 2 +- > include/linux/dax.h | 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/fs/dax.c b/fs/dax.c > index 0329ec0bee2e..444e9dd079ca 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -719,7 +719,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, > } > EXPORT_SYMBOL_GPL(dax_fault); > > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE > +#if 0 > /* > * The 'colour' (ie low bits) within a PMD of a page offset. This comes up > * more often than one might expect in the below function. > diff --git a/include/linux/dax.h b/include/linux/dax.h > index 4b63923e1f8d..fd28d824254b 100644 > --- a/include/linux/dax.h > +++ b/include/linux/dax.h > @@ -29,7 +29,7 @@ static inline struct page *read_dax_sector(struct block_device *bdev, > } > #endif > > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE > +#if 0 > int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *, > unsigned int flags, get_block_t); > int __dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *, > -- > 2.6.2 >
On Wed 23-03-16 14:50:00, Ross Zwisler wrote: > On Mon, Mar 21, 2016 at 02:22:52PM +0100, Jan Kara wrote: > > Currently the handling of huge pages for DAX is racy. For example the > > following can happen: > > > > CPU0 (THP write fault) CPU1 (normal read fault) > > > > __dax_pmd_fault() __dax_fault() > > get_block(inode, block, &bh, 0) -> not mapped > > get_block(inode, block, &bh, 0) > > -> not mapped > > if (!buffer_mapped(&bh) && write) > > get_block(inode, block, &bh, 1) -> allocates blocks > > truncate_pagecache_range(inode, lstart, lend); > > dax_load_hole(); > > > > This results in data corruption since process on CPU1 won't see changes > > into the file done by CPU0. > > > > The race can happen even if two normal faults race however with THP the > > situation is even worse because the two faults don't operate on the same > > entries in the radix tree and we want to use these entries for > > serialization. So disable THP support in DAX code for now. > > Yep, I agree that we should disable PMD faults until we get the multi-order > radix tree work finished and integrated with this locking. > > I do agree with Dan though that it would be preferable to disable PMD faults > by having CONFIG_FS_DAX_PMD depend on BROKEN. That seems smaller and easier > to switch PMD faults back on for testing. I did it this way because I wasn't sure PMD fault code even compiles and I didn't want to break CONFIG_BROKEN builds. But OK, I'll make the code at least compile and do what you say. Honza > > Signed-off-by: Jan Kara <jack@suse.cz> > > --- > > fs/dax.c | 2 +- > > include/linux/dax.h | 2 +- > > 2 files changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/fs/dax.c b/fs/dax.c > > index 0329ec0bee2e..444e9dd079ca 100644 > > --- a/fs/dax.c > > +++ b/fs/dax.c > > @@ -719,7 +719,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, > > } > > EXPORT_SYMBOL_GPL(dax_fault); > > > > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE > > +#if 0 > > /* > > * The 'colour' (ie low bits) within a PMD of a page offset. This comes up > > * more often than one might expect in the below function. > > diff --git a/include/linux/dax.h b/include/linux/dax.h > > index 4b63923e1f8d..fd28d824254b 100644 > > --- a/include/linux/dax.h > > +++ b/include/linux/dax.h > > @@ -29,7 +29,7 @@ static inline struct page *read_dax_sector(struct block_device *bdev, > > } > > #endif > > > > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE > > +#if 0 > > int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *, > > unsigned int flags, get_block_t); > > int __dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *, > > -- > > 2.6.2 > >
diff --git a/fs/dax.c b/fs/dax.c index 0329ec0bee2e..444e9dd079ca 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -719,7 +719,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, } EXPORT_SYMBOL_GPL(dax_fault); -#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#if 0 /* * The 'colour' (ie low bits) within a PMD of a page offset. This comes up * more often than one might expect in the below function. diff --git a/include/linux/dax.h b/include/linux/dax.h index 4b63923e1f8d..fd28d824254b 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -29,7 +29,7 @@ static inline struct page *read_dax_sector(struct block_device *bdev, } #endif -#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#if 0 int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *, unsigned int flags, get_block_t); int __dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
Currently the handling of huge pages for DAX is racy. For example the following can happen: CPU0 (THP write fault) CPU1 (normal read fault) __dax_pmd_fault() __dax_fault() get_block(inode, block, &bh, 0) -> not mapped get_block(inode, block, &bh, 0) -> not mapped if (!buffer_mapped(&bh) && write) get_block(inode, block, &bh, 1) -> allocates blocks truncate_pagecache_range(inode, lstart, lend); dax_load_hole(); This results in data corruption since process on CPU1 won't see changes into the file done by CPU0. The race can happen even if two normal faults race however with THP the situation is even worse because the two faults don't operate on the same entries in the radix tree and we want to use these entries for serialization. So disable THP support in DAX code for now. Signed-off-by: Jan Kara <jack@suse.cz> --- fs/dax.c | 2 +- include/linux/dax.h | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)