Message ID | 20220411135958.21385-1-xiaoguang.wang@linux.alibaba.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] scsi: target: tcmu: Fix possible data corruption | expand |
Hi Xiaoguang, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on jejb-scsi/for-next] [also build test WARNING on v5.18-rc2 next-20220411] [cannot apply to mkp-scsi/for-next] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/intel-lab-lkp/linux/commits/Xiaoguang-Wang/scsi-target-tcmu-Fix-possible-data-corruption/20220411-220214 base: https://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git for-next config: x86_64-rhel-8.3-kselftests (https://download.01.org/0day-ci/archive/20220412/202204121450.qYzuKGXT-lkp@intel.com/config) compiler: gcc-11 (Debian 11.2.0-19) 11.2.0 reproduce: # apt-get install sparse # sparse version: v0.6.4-dirty # https://github.com/intel-lab-lkp/linux/commit/2bceb529129db286e111bc3bae0b52b62b1fba07 git remote add linux-review https://github.com/intel-lab-lkp/linux git fetch --no-tags linux-review Xiaoguang-Wang/scsi-target-tcmu-Fix-possible-data-corruption/20220411-220214 git checkout 2bceb529129db286e111bc3bae0b52b62b1fba07 # save the config file to linux build tree mkdir build_dir make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=x86_64 SHELL=/bin/bash drivers/target/ If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> sparse warnings: (new ones prefixed by >>) >> drivers/target/target_core_user.c:1907:21: sparse: sparse: incorrect type in assignment (different base types) @@ expected int ret @@ got restricted vm_fault_t @@ drivers/target/target_core_user.c:1907:21: sparse: expected int ret drivers/target/target_core_user.c:1907:21: sparse: got restricted vm_fault_t >> drivers/target/target_core_user.c:1911:16: sparse: sparse: incorrect type in return expression (different base types) @@ expected restricted vm_fault_t @@ got int ret @@ drivers/target/target_core_user.c:1911:16: sparse: expected restricted vm_fault_t drivers/target/target_core_user.c:1911:16: sparse: got int ret vim +1907 drivers/target/target_core_user.c 1874 1875 static vm_fault_t tcmu_vma_fault(struct vm_fault *vmf) 1876 { 1877 struct tcmu_dev *udev = vmf->vma->vm_private_data; 1878 struct uio_info *info = &udev->uio_info; 1879 struct page *page; 1880 unsigned long offset; 1881 void *addr; 1882 int ret = 0; 1883 1884 int mi = tcmu_find_mem_index(vmf->vma); 1885 if (mi < 0) 1886 return VM_FAULT_SIGBUS; 1887 1888 /* 1889 * We need to subtract mi because userspace uses offset = N*PAGE_SIZE 1890 * to use mem[N]. 1891 */ 1892 offset = (vmf->pgoff - mi) << PAGE_SHIFT; 1893 1894 if (offset < udev->data_off) { 1895 /* For the vmalloc()ed cmd area pages */ 1896 addr = (void *)(unsigned long)info->mem[mi].addr + offset; 1897 page = vmalloc_to_page(addr); 1898 get_page(page); 1899 } else { 1900 uint32_t dpi; 1901 1902 /* For the dynamically growing data area pages */ 1903 dpi = (offset - udev->data_off) / PAGE_SIZE; 1904 page = tcmu_try_get_data_page(udev, dpi); 1905 if (!page) 1906 return VM_FAULT_SIGBUS; > 1907 ret = VM_FAULT_LOCKED; 1908 } 1909 1910 vmf->page = page; > 1911 return ret; 1912 } 1913
hi, Thanks for this report, I'll add the Reported-by in V3 patch. But before sending V3, I'd like to wait tcmu maintainer to have a brief review. Regards, Xiaoguang Wang > Hi Xiaoguang, > > Thank you for the patch! Perhaps something to improve: > > [auto build test WARNING on jejb-scsi/for-next] > [also build test WARNING on v5.18-rc2 next-20220411] > [cannot apply to mkp-scsi/for-next] > [If your patch is applied to the wrong git tree, kindly drop us a note. > And when submitting patch, we suggest to use '--base' as documented in > https://git-scm.com/docs/git-format-patch] > > url: https://github.com/intel-lab-lkp/linux/commits/Xiaoguang-Wang/scsi-target-tcmu-Fix-possible-data-corruption/20220411-220214 > base: https://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git for-next > config: x86_64-rhel-8.3-kselftests (https://download.01.org/0day-ci/archive/20220412/202204121450.qYzuKGXT-lkp@intel.com/config) > compiler: gcc-11 (Debian 11.2.0-19) 11.2.0 > reproduce: > # apt-get install sparse > # sparse version: v0.6.4-dirty > # https://github.com/intel-lab-lkp/linux/commit/2bceb529129db286e111bc3bae0b52b62b1fba07 > git remote add linux-review https://github.com/intel-lab-lkp/linux > git fetch --no-tags linux-review Xiaoguang-Wang/scsi-target-tcmu-Fix-possible-data-corruption/20220411-220214 > git checkout 2bceb529129db286e111bc3bae0b52b62b1fba07 > # save the config file to linux build tree > mkdir build_dir > make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=x86_64 SHELL=/bin/bash drivers/target/ > > If you fix the issue, kindly add following tag as appropriate > Reported-by: kernel test robot <lkp@intel.com> > > > sparse warnings: (new ones prefixed by >>) >>> drivers/target/target_core_user.c:1907:21: sparse: sparse: incorrect type in assignment (different base types) @@ expected int ret @@ got restricted vm_fault_t @@ > drivers/target/target_core_user.c:1907:21: sparse: expected int ret > drivers/target/target_core_user.c:1907:21: sparse: got restricted vm_fault_t >>> drivers/target/target_core_user.c:1911:16: sparse: sparse: incorrect type in return expression (different base types) @@ expected restricted vm_fault_t @@ got int ret @@ > drivers/target/target_core_user.c:1911:16: sparse: expected restricted vm_fault_t > drivers/target/target_core_user.c:1911:16: sparse: got int ret > > vim +1907 drivers/target/target_core_user.c > > 1874 > 1875 static vm_fault_t tcmu_vma_fault(struct vm_fault *vmf) > 1876 { > 1877 struct tcmu_dev *udev = vmf->vma->vm_private_data; > 1878 struct uio_info *info = &udev->uio_info; > 1879 struct page *page; > 1880 unsigned long offset; > 1881 void *addr; > 1882 int ret = 0; > 1883 > 1884 int mi = tcmu_find_mem_index(vmf->vma); > 1885 if (mi < 0) > 1886 return VM_FAULT_SIGBUS; > 1887 > 1888 /* > 1889 * We need to subtract mi because userspace uses offset = N*PAGE_SIZE > 1890 * to use mem[N]. > 1891 */ > 1892 offset = (vmf->pgoff - mi) << PAGE_SHIFT; > 1893 > 1894 if (offset < udev->data_off) { > 1895 /* For the vmalloc()ed cmd area pages */ > 1896 addr = (void *)(unsigned long)info->mem[mi].addr + offset; > 1897 page = vmalloc_to_page(addr); > 1898 get_page(page); > 1899 } else { > 1900 uint32_t dpi; > 1901 > 1902 /* For the dynamically growing data area pages */ > 1903 dpi = (offset - udev->data_off) / PAGE_SIZE; > 1904 page = tcmu_try_get_data_page(udev, dpi); > 1905 if (!page) > 1906 return VM_FAULT_SIGBUS; >> 1907 ret = VM_FAULT_LOCKED; > 1908 } > 1909 > 1910 vmf->page = page; >> 1911 return ret; > 1912 } > 1913 >
Hi, Thank you for the patch. I'm wondering whether we need the new function tcmu_wait_inflight_page_fault? Your previous patch just fixed tcmu_vma_fault and tcmu_try_get_data_page to call get_page() while holding cmdr_lock. So, I think we are safe to first call tcmu_blocks_release and then do unmap_mapping_range. If so, we could simply add lock_page() and unlock_page() to tcmu_blocks_release avoiding the need for a second walk through the xarray. Bodo On 11.04.22 15:59, Xiaoguang Wang wrote: > When tcmu_vma_fault() gets one page successfully, before the current > context completes page fault procedure, find_free_blocks() may run in > and call unmap_mapping_range() to unmap this page. Assume when > find_free_blocks() completes its job firstly, previous page fault > procedure starts to run again and completes, then one truncated page has > beed mapped to use space, but note that tcmu_vma_fault() has gotten one > refcount for this page, so any other subsystem won't use this page, > unless later the use space addr is unmapped. > > If another command runs in later and needs to extends dbi_thresh, it may > reuse the corresponding slot to previous page in data_bitmap, then though > we'll allocate new page for this slot in data_area, but no page fault will > happen again, because we have a valid map, real request's data will lose. > > Filesystem implementations will also run into this issue, but they > usually lock page when vm_operations_struct->fault gets one page, and > unlock page after finish_fault() completes. In truncate sides, they > lock pages in truncate_inode_pages() to protect race with page fault. > We can also have similar codes like filesystem to fix this issue. > > To fix this possible data corruption, we can apply similar method like > filesystem. For pages that are to be freed, find_free_blocks() locks > and unlocks these pages, and make tcmu_vma_fault() also lock found page > under cmdr_lock. With this action, for above race, find_free_blocks() > will wait all page faults to be completed before calling > unmap_mapping_range(), and later if unmap_mapping_range() is called, > it will ensure stale mappings to be removed cleanly. > > Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> > > --- > V2: > Wait all possible inflight page faults to be completed in > find_free_blocks() to fix possible stale map. > --- > drivers/target/target_core_user.c | 39 ++++++++++++++++++++++++++++++++++++++- > 1 file changed, 38 insertions(+), 1 deletion(-) > > diff --git a/drivers/target/target_core_user.c b/drivers/target/target_core_user.c > index fd7267baa707..ed026f5bdb14 100644 > --- a/drivers/target/target_core_user.c > +++ b/drivers/target/target_core_user.c > @@ -20,6 +20,7 @@ > #include <linux/configfs.h> > #include <linux/mutex.h> > #include <linux/workqueue.h> > +#include <linux/pagemap.h> > #include <net/genetlink.h> > #include <scsi/scsi_common.h> > #include <scsi/scsi_proto.h> > @@ -1657,6 +1658,20 @@ static int tcmu_check_and_free_pending_cmd(struct tcmu_cmd *cmd) > return -EINVAL; > } > > +static void tcmu_wait_inflight_page_fault(struct tcmu_dev *udev, > + unsigned long first, unsigned long last) > +{ > + XA_STATE(xas, &udev->data_pages, first * udev->data_pages_per_blk); > + struct page *page; > + > + xas_lock(&xas); > + xas_for_each(&xas, page, (last + 1) * udev->data_pages_per_blk - 1) { > + lock_page(page); > + unlock_page(page); > + } > + xas_unlock(&xas); > +} > + > static u32 tcmu_blocks_release(struct tcmu_dev *udev, unsigned long first, > unsigned long last) > { > @@ -1822,6 +1837,7 @@ static struct page *tcmu_try_get_data_page(struct tcmu_dev *udev, uint32_t dpi) > page = xa_load(&udev->data_pages, dpi); > if (likely(page)) { > get_page(page); > + lock_page(page); > mutex_unlock(&udev->cmdr_lock); > return page; > } > @@ -1863,6 +1879,7 @@ static vm_fault_t tcmu_vma_fault(struct vm_fault *vmf) > struct page *page; > unsigned long offset; > void *addr; > + int ret = 0; > > int mi = tcmu_find_mem_index(vmf->vma); > if (mi < 0) > @@ -1887,10 +1904,11 @@ static vm_fault_t tcmu_vma_fault(struct vm_fault *vmf) > page = tcmu_try_get_data_page(udev, dpi); > if (!page) > return VM_FAULT_SIGBUS; > + ret = VM_FAULT_LOCKED; > } > > vmf->page = page; > - return 0; > + return ret; > } > > static const struct vm_operations_struct tcmu_vm_ops = { > @@ -3205,6 +3223,25 @@ static void find_free_blocks(void) > udev->dbi_max = block; > } > > + /* > + * While reaching here, there maybe page faults occurring on > + * these to be released pages, and there maybe one race that > + * unmap_mapping_range() is called before page fault on these > + * pages are finished, then valid but stale map is created. > + * > + * If another command runs in later and needs to extends > + * dbi_thresh, it may reuse the corresponding slot to previous > + * page in data_bitmap, then though we'll allocate new page for > + * this slot in data_area, but no page fault will happen again, > + * because we have a valid map, command's data will lose. > + * > + * So here we lock and unlock pages that are to be released to > + * ensure all page faults to be completed, then following > + * unmap_mapping_range() can ensure stale maps to be removed > + * cleanly. > + */ > + tcmu_wait_inflight_page_fault(udev, start, end - 1); > + > /* Here will truncate the data area from off */ > off = udev->data_off + (loff_t)start * udev->data_blk_size; > unmap_mapping_range(udev->inode->i_mapping, off, 0, 1);
hi, > Hi, > > Thank you for the patch. > > I'm wondering whether we need the new function > tcmu_wait_inflight_page_fault? Your previous patch just fixed > tcmu_vma_fault and tcmu_try_get_data_page to call get_page() while > holding cmdr_lock. So, I think we are safe to first call > tcmu_blocks_release and then do unmap_mapping_range. > If so, we could simply add lock_page() and unlock_page() to > tcmu_blocks_release avoiding the need for a second walk through the > xarray. Oh, you're right, thanks. I'll prepare V3 soon. Regards, Xiaoguang Wang > > Bodo > > On 11.04.22 15:59, Xiaoguang Wang wrote: >> When tcmu_vma_fault() gets one page successfully, before the current >> context completes page fault procedure, find_free_blocks() may run in >> and call unmap_mapping_range() to unmap this page. Assume when >> find_free_blocks() completes its job firstly, previous page fault >> procedure starts to run again and completes, then one truncated page has >> beed mapped to use space, but note that tcmu_vma_fault() has gotten one >> refcount for this page, so any other subsystem won't use this page, >> unless later the use space addr is unmapped. >> >> If another command runs in later and needs to extends dbi_thresh, it may >> reuse the corresponding slot to previous page in data_bitmap, then though >> we'll allocate new page for this slot in data_area, but no page fault will >> happen again, because we have a valid map, real request's data will lose. >> >> Filesystem implementations will also run into this issue, but they >> usually lock page when vm_operations_struct->fault gets one page, and >> unlock page after finish_fault() completes. In truncate sides, they >> lock pages in truncate_inode_pages() to protect race with page fault. >> We can also have similar codes like filesystem to fix this issue. >> >> To fix this possible data corruption, we can apply similar method like >> filesystem. For pages that are to be freed, find_free_blocks() locks >> and unlocks these pages, and make tcmu_vma_fault() also lock found page >> under cmdr_lock. With this action, for above race, find_free_blocks() >> will wait all page faults to be completed before calling >> unmap_mapping_range(), and later if unmap_mapping_range() is called, >> it will ensure stale mappings to be removed cleanly. >> >> Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> >> >> --- >> V2: >> Wait all possible inflight page faults to be completed in >> find_free_blocks() to fix possible stale map. >> --- >> drivers/target/target_core_user.c | 39 ++++++++++++++++++++++++++++++++++++++- >> 1 file changed, 38 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/target/target_core_user.c b/drivers/target/target_core_user.c >> index fd7267baa707..ed026f5bdb14 100644 >> --- a/drivers/target/target_core_user.c >> +++ b/drivers/target/target_core_user.c >> @@ -20,6 +20,7 @@ >> #include <linux/configfs.h> >> #include <linux/mutex.h> >> #include <linux/workqueue.h> >> +#include <linux/pagemap.h> >> #include <net/genetlink.h> >> #include <scsi/scsi_common.h> >> #include <scsi/scsi_proto.h> >> @@ -1657,6 +1658,20 @@ static int tcmu_check_and_free_pending_cmd(struct tcmu_cmd *cmd) >> return -EINVAL; >> } >> +static void tcmu_wait_inflight_page_fault(struct tcmu_dev *udev, >> + unsigned long first, unsigned long last) >> +{ >> + XA_STATE(xas, &udev->data_pages, first * udev->data_pages_per_blk); >> + struct page *page; >> + >> + xas_lock(&xas); >> + xas_for_each(&xas, page, (last + 1) * udev->data_pages_per_blk - 1) { >> + lock_page(page); >> + unlock_page(page); >> + } >> + xas_unlock(&xas); >> +} >> + >> static u32 tcmu_blocks_release(struct tcmu_dev *udev, unsigned long first, >> unsigned long last) >> { >> @@ -1822,6 +1837,7 @@ static struct page *tcmu_try_get_data_page(struct tcmu_dev *udev, uint32_t dpi) >> page = xa_load(&udev->data_pages, dpi); >> if (likely(page)) { >> get_page(page); >> + lock_page(page); >> mutex_unlock(&udev->cmdr_lock); >> return page; >> } >> @@ -1863,6 +1879,7 @@ static vm_fault_t tcmu_vma_fault(struct vm_fault *vmf) >> struct page *page; >> unsigned long offset; >> void *addr; >> + int ret = 0; >> int mi = tcmu_find_mem_index(vmf->vma); >> if (mi < 0) >> @@ -1887,10 +1904,11 @@ static vm_fault_t tcmu_vma_fault(struct vm_fault *vmf) >> page = tcmu_try_get_data_page(udev, dpi); >> if (!page) >> return VM_FAULT_SIGBUS; >> + ret = VM_FAULT_LOCKED; >> } >> vmf->page = page; >> - return 0; >> + return ret; >> } >> static const struct vm_operations_struct tcmu_vm_ops = { >> @@ -3205,6 +3223,25 @@ static void find_free_blocks(void) >> udev->dbi_max = block; >> } >> + /* >> + * While reaching here, there maybe page faults occurring on >> + * these to be released pages, and there maybe one race that >> + * unmap_mapping_range() is called before page fault on these >> + * pages are finished, then valid but stale map is created. >> + * >> + * If another command runs in later and needs to extends >> + * dbi_thresh, it may reuse the corresponding slot to previous >> + * page in data_bitmap, then though we'll allocate new page for >> + * this slot in data_area, but no page fault will happen again, >> + * because we have a valid map, command's data will lose. >> + * >> + * So here we lock and unlock pages that are to be released to >> + * ensure all page faults to be completed, then following >> + * unmap_mapping_range() can ensure stale maps to be removed >> + * cleanly. >> + */ >> + tcmu_wait_inflight_page_fault(udev, start, end - 1); >> + >> /* Here will truncate the data area from off */ >> off = udev->data_off + (loff_t)start * udev->data_blk_size; >> unmap_mapping_range(udev->inode->i_mapping, off, 0, 1);
diff --git a/drivers/target/target_core_user.c b/drivers/target/target_core_user.c index fd7267baa707..ed026f5bdb14 100644 --- a/drivers/target/target_core_user.c +++ b/drivers/target/target_core_user.c @@ -20,6 +20,7 @@ #include <linux/configfs.h> #include <linux/mutex.h> #include <linux/workqueue.h> +#include <linux/pagemap.h> #include <net/genetlink.h> #include <scsi/scsi_common.h> #include <scsi/scsi_proto.h> @@ -1657,6 +1658,20 @@ static int tcmu_check_and_free_pending_cmd(struct tcmu_cmd *cmd) return -EINVAL; } +static void tcmu_wait_inflight_page_fault(struct tcmu_dev *udev, + unsigned long first, unsigned long last) +{ + XA_STATE(xas, &udev->data_pages, first * udev->data_pages_per_blk); + struct page *page; + + xas_lock(&xas); + xas_for_each(&xas, page, (last + 1) * udev->data_pages_per_blk - 1) { + lock_page(page); + unlock_page(page); + } + xas_unlock(&xas); +} + static u32 tcmu_blocks_release(struct tcmu_dev *udev, unsigned long first, unsigned long last) { @@ -1822,6 +1837,7 @@ static struct page *tcmu_try_get_data_page(struct tcmu_dev *udev, uint32_t dpi) page = xa_load(&udev->data_pages, dpi); if (likely(page)) { get_page(page); + lock_page(page); mutex_unlock(&udev->cmdr_lock); return page; } @@ -1863,6 +1879,7 @@ static vm_fault_t tcmu_vma_fault(struct vm_fault *vmf) struct page *page; unsigned long offset; void *addr; + int ret = 0; int mi = tcmu_find_mem_index(vmf->vma); if (mi < 0) @@ -1887,10 +1904,11 @@ static vm_fault_t tcmu_vma_fault(struct vm_fault *vmf) page = tcmu_try_get_data_page(udev, dpi); if (!page) return VM_FAULT_SIGBUS; + ret = VM_FAULT_LOCKED; } vmf->page = page; - return 0; + return ret; } static const struct vm_operations_struct tcmu_vm_ops = { @@ -3205,6 +3223,25 @@ static void find_free_blocks(void) udev->dbi_max = block; } + /* + * While reaching here, there maybe page faults occurring on + * these to be released pages, and there maybe one race that + * unmap_mapping_range() is called before page fault on these + * pages are finished, then valid but stale map is created. + * + * If another command runs in later and needs to extends + * dbi_thresh, it may reuse the corresponding slot to previous + * page in data_bitmap, then though we'll allocate new page for + * this slot in data_area, but no page fault will happen again, + * because we have a valid map, command's data will lose. + * + * So here we lock and unlock pages that are to be released to + * ensure all page faults to be completed, then following + * unmap_mapping_range() can ensure stale maps to be removed + * cleanly. + */ + tcmu_wait_inflight_page_fault(udev, start, end - 1); + /* Here will truncate the data area from off */ off = udev->data_off + (loff_t)start * udev->data_blk_size; unmap_mapping_range(udev->inode->i_mapping, off, 0, 1);
When tcmu_vma_fault() gets one page successfully, before the current context completes page fault procedure, find_free_blocks() may run in and call unmap_mapping_range() to unmap this page. Assume when find_free_blocks() completes its job firstly, previous page fault procedure starts to run again and completes, then one truncated page has beed mapped to use space, but note that tcmu_vma_fault() has gotten one refcount for this page, so any other subsystem won't use this page, unless later the use space addr is unmapped. If another command runs in later and needs to extends dbi_thresh, it may reuse the corresponding slot to previous page in data_bitmap, then though we'll allocate new page for this slot in data_area, but no page fault will happen again, because we have a valid map, real request's data will lose. Filesystem implementations will also run into this issue, but they usually lock page when vm_operations_struct->fault gets one page, and unlock page after finish_fault() completes. In truncate sides, they lock pages in truncate_inode_pages() to protect race with page fault. We can also have similar codes like filesystem to fix this issue. To fix this possible data corruption, we can apply similar method like filesystem. For pages that are to be freed, find_free_blocks() locks and unlocks these pages, and make tcmu_vma_fault() also lock found page under cmdr_lock. With this action, for above race, find_free_blocks() will wait all page faults to be completed before calling unmap_mapping_range(), and later if unmap_mapping_range() is called, it will ensure stale mappings to be removed cleanly. Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> --- V2: Wait all possible inflight page faults to be completed in find_free_blocks() to fix possible stale map. --- drivers/target/target_core_user.c | 39 ++++++++++++++++++++++++++++++++++++++- 1 file changed, 38 insertions(+), 1 deletion(-)