Message ID | 20220323134940.31463-3-xiaoguang.wang@linux.alibaba.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | three bug fixes about tcmu page fault handle | expand |
On 23.03.22 14:49, Xiaoguang Wang wrote: > When tcmu_vma_fault() gets one page successfully, before the current > context completes page fault procedure, find_free_blocks() may run in > and call unmap_mapping_range() to unmap this page. Assume when > find_free_blocks() completes its job firstly, previous page fault > procedure starts to run again and completes, then one truncated page has > beed mapped to use space, but note that tcmu_vma_fault() has gotten one > refcount for this page, so any other subsystem won't use this page, > unless later the use space addr is unmapped. > > If another command runs in later and needs to extends dbi_thresh, it may > reuse the corresponding slot to previous page in data_bitmap, then thouth > we'll allocate new page for this slot in data_area, but no page fault will > happen again, because we have a valid map, real request's data will lose. I don't think, this is a safe fix. It is possible that not only find_free_blocks runs before page fault procedure completes, but also allocation for next cmd happens. In that case the new call to unmap_mapping_range would also happen before page fault completes -> data corruption. AFAIK, no one ever has seen this this bug in real life, as find_free_blocks only runs seldomly and userspace would have to access a data page the very first time while the cmd that owned this page already has been completed by userspace. Therefore I think we should apply a perfect fix only. I'm wondering whether there really is such a race. If so, couldn't the same race happen in other drivers or even when truncating mapped files? > > To fix this issue, when extending dbi_thresh, we'll need to call > unmap_mapping_range() to unmap use space data area which may exist, > which I think it's a simple method. > > Filesystem implementations will also run into this issue, but they > ususally lock page when vm_operations_struct->fault gets one page, and > unlock page after finish_fault() completes. In truncate sides, they > lock pages in truncate_inode_pages() to protect race with page fault. > We can also have similar codes like filesystem to fix this issue. > > Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> > --- > drivers/target/target_core_user.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/drivers/target/target_core_user.c b/drivers/target/target_core_user.c > index 06a5c4086551..9196188504ec 100644 > --- a/drivers/target/target_core_user.c > +++ b/drivers/target/target_core_user.c > @@ -862,6 +862,7 @@ static int tcmu_alloc_data_space(struct tcmu_dev *udev, struct tcmu_cmd *cmd, > if (space < cmd->dbi_cnt) { > unsigned long blocks_left = > (udev->max_blocks - udev->dbi_thresh) + space; > + loff_t off, len; > > if (blocks_left < cmd->dbi_cnt) { > pr_debug("no data space: only %lu available, but ask for %u\n", > @@ -870,6 +871,10 @@ static int tcmu_alloc_data_space(struct tcmu_dev *udev, struct tcmu_cmd *cmd, > return -1; > } > > + off = udev->data_off + (loff_t)udev->dbi_thresh * udev->data_blk_size; > + len = cmd->dbi_cnt * udev->data_blk_size; > + unmap_mapping_range(udev->inode->i_mapping, off, len, 1); > + > udev->dbi_thresh += cmd->dbi_cnt; > if (udev->dbi_thresh > udev->max_blocks) > udev->dbi_thresh = udev->max_blocks;
hi, > On 23.03.22 14:49, Xiaoguang Wang wrote: >> When tcmu_vma_fault() gets one page successfully, before the current >> context completes page fault procedure, find_free_blocks() may run in >> and call unmap_mapping_range() to unmap this page. Assume when >> find_free_blocks() completes its job firstly, previous page fault >> procedure starts to run again and completes, then one truncated page has >> beed mapped to use space, but note that tcmu_vma_fault() has gotten one >> refcount for this page, so any other subsystem won't use this page, >> unless later the use space addr is unmapped. >> >> If another command runs in later and needs to extends dbi_thresh, it may >> reuse the corresponding slot to previous page in data_bitmap, then thouth >> we'll allocate new page for this slot in data_area, but no page fault will >> happen again, because we have a valid map, real request's data will lose. > > I don't think, this is a safe fix. It is possible that not only > find_free_blocks runs before page fault procedure completes, but also > allocation for next cmd happens. In that case the new call to > unmap_mapping_range would also happen before page fault completes -> > data corruption. You're right, thanks for pointing this issue. For your below questions, I need to take some time to read and understand codes, I'll try to reply in tomorrow, sorry. Regards, Xiaoguang Wang > > AFAIK, no one ever has seen this this bug in real life, as > find_free_blocks only runs seldomly and userspace would have to access > a data page the very first time while the cmd that owned this page > already has been completed by userspace. Therefore I think we should > apply a perfect fix only. > > I'm wondering whether there really is such a race. If so, couldn't the > same race happen in other drivers or even when truncating mapped files? > > >> >> To fix this issue, when extending dbi_thresh, we'll need to call >> unmap_mapping_range() to unmap use space data area which may exist, >> which I think it's a simple method. >> >> Filesystem implementations will also run into this issue, but they >> ususally lock page when vm_operations_struct->fault gets one page, and >> unlock page after finish_fault() completes. In truncate sides, they >> lock pages in truncate_inode_pages() to protect race with page fault. >> We can also have similar codes like filesystem to fix this issue. >> >> Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> >> --- >> drivers/target/target_core_user.c | 5 +++++ >> 1 file changed, 5 insertions(+) >> >> diff --git a/drivers/target/target_core_user.c b/drivers/target/target_core_user.c >> index 06a5c4086551..9196188504ec 100644 >> --- a/drivers/target/target_core_user.c >> +++ b/drivers/target/target_core_user.c >> @@ -862,6 +862,7 @@ static int tcmu_alloc_data_space(struct tcmu_dev *udev, struct tcmu_cmd *cmd, >> if (space < cmd->dbi_cnt) { >> unsigned long blocks_left = >> (udev->max_blocks - udev->dbi_thresh) + space; >> + loff_t off, len; >> if (blocks_left < cmd->dbi_cnt) { >> pr_debug("no data space: only %lu available, but ask for %u\n", >> @@ -870,6 +871,10 @@ static int tcmu_alloc_data_space(struct tcmu_dev *udev, struct tcmu_cmd *cmd, >> return -1; >> } >> + off = udev->data_off + (loff_t)udev->dbi_thresh * udev->data_blk_size; >> + len = cmd->dbi_cnt * udev->data_blk_size; >> + unmap_mapping_range(udev->inode->i_mapping, off, len, 1); >> + >> udev->dbi_thresh += cmd->dbi_cnt; >> if (udev->dbi_thresh > udev->max_blocks) >> udev->dbi_thresh = udev->max_blocks;
hi, > On 23.03.22 14:49, Xiaoguang Wang wrote: >> When tcmu_vma_fault() gets one page successfully, before the current >> context completes page fault procedure, find_free_blocks() may run in >> and call unmap_mapping_range() to unmap this page. Assume when >> find_free_blocks() completes its job firstly, previous page fault >> procedure starts to run again and completes, then one truncated page has >> beed mapped to use space, but note that tcmu_vma_fault() has gotten one >> refcount for this page, so any other subsystem won't use this page, >> unless later the use space addr is unmapped. >> >> If another command runs in later and needs to extends dbi_thresh, it may >> reuse the corresponding slot to previous page in data_bitmap, then thouth >> we'll allocate new page for this slot in data_area, but no page fault will >> happen again, because we have a valid map, real request's data will lose. > > I don't think, this is a safe fix. It is possible that not only > find_free_blocks runs before page fault procedure completes, but also > allocation for next cmd happens. In that case the new call to > unmap_mapping_range would also happen before page fault completes -> > data corruption. > > AFAIK, no one ever has seen this this bug in real life, as Yeah, I know, just find this maybe an issue by reading codes :) > find_free_blocks only runs seldomly and userspace would have to access > a data page the very first time while the cmd that owned this page > already has been completed by userspace. Therefore I think we should > apply a perfect fix only. > > I'm wondering whether there really is such a race. If so, couldn't the > same race happen in other drivers or even when truncating mapped files? Indeed, I have described how filesystem implementations avoid this issue in patch's commit message: Filesystem implementations will also run into this issue, but they usually lock page when vm_operations_struct->fault gets one page, and unlock page after finish_fault() completes. In truncate sides, they lock pages in truncate_inode_pages() to protect race with page fault. We can also have similar codes like filesystem to fix this issue. Take ext4 as example, a file in ext4 is mapped to user space, if then a truncate operation occurs, ext4 calls truncate_pagecache(): void truncate_pagecache(struct inode *inode, loff_t newsize) { struct address_space *mapping = inode->i_mapping; loff_t holebegin = round_up(newsize, PAGE_SIZE); /* * unmap_mapping_range is called twice, first simply for * efficiency so that truncate_inode_pages does fewer * single-page unmaps. However after this first call, and * before truncate_inode_pages finishes, it is possible for * private pages to be COWed, which remain after * truncate_inode_pages finishes, hence the second * unmap_mapping_range call must be made for correctness. */ unmap_mapping_range(mapping, holebegin, 0, 1); truncate_inode_pages(mapping, newsize); unmap_mapping_range(mapping, holebegin, 0, 1); } In truncate_inode_pages(), it'll lock page and set page->mapping to be NULL, and in ext4's filemap_fault(), it'll lock page and check whether page->mapping has been changed, if it's true, it'll just fail the page fault procedure. For tcmu, though the data area's pages don't have a valid mapping, but we can apply similar method. In tcmu_vma_fault(), we lock the page and set VM_FAULT_LOCKED flag, in find_free_blocks(), we firstly try to lock pages which are going to be released, if lock_page() returns, we can ensure that there are not inflight running page fault procedure, and following unmap_mapping_range() will also ensure that all user maps will be cleared. Seems that it'll resolve this possible issue, please have a check, thanks. Regards, Xiaoguang Wang > > >> >> To fix this issue, when extending dbi_thresh, we'll need to call >> unmap_mapping_range() to unmap use space data area which may exist, >> which I think it's a simple method. >> >> Filesystem implementations will also run into this issue, but they >> ususally lock page when vm_operations_struct->fault gets one page, and >> unlock page after finish_fault() completes. In truncate sides, they >> lock pages in truncate_inode_pages() to protect race with page fault. >> We can also have similar codes like filesystem to fix this issue. >> >> Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> >> --- >> drivers/target/target_core_user.c | 5 +++++ >> 1 file changed, 5 insertions(+) >> >> diff --git a/drivers/target/target_core_user.c b/drivers/target/target_core_user.c >> index 06a5c4086551..9196188504ec 100644 >> --- a/drivers/target/target_core_user.c >> +++ b/drivers/target/target_core_user.c >> @@ -862,6 +862,7 @@ static int tcmu_alloc_data_space(struct tcmu_dev *udev, struct tcmu_cmd *cmd, >> if (space < cmd->dbi_cnt) { >> unsigned long blocks_left = >> (udev->max_blocks - udev->dbi_thresh) + space; >> + loff_t off, len; >> if (blocks_left < cmd->dbi_cnt) { >> pr_debug("no data space: only %lu available, but ask for %u\n", >> @@ -870,6 +871,10 @@ static int tcmu_alloc_data_space(struct tcmu_dev *udev, struct tcmu_cmd *cmd, >> return -1; >> } >> + off = udev->data_off + (loff_t)udev->dbi_thresh * udev->data_blk_size; >> + len = cmd->dbi_cnt * udev->data_blk_size; >> + unmap_mapping_range(udev->inode->i_mapping, off, len, 1); >> + >> udev->dbi_thresh += cmd->dbi_cnt; >> if (udev->dbi_thresh > udev->max_blocks) >> udev->dbi_thresh = udev->max_blocks;
On 05.04.22 18:03, Xiaoguang Wang wrote: > hi, > >> On 23.03.22 14:49, Xiaoguang Wang wrote: >>> When tcmu_vma_fault() gets one page successfully, before the current >>> context completes page fault procedure, find_free_blocks() may run in >>> and call unmap_mapping_range() to unmap this page. Assume when >>> find_free_blocks() completes its job firstly, previous page fault >>> procedure starts to run again and completes, then one truncated page has >>> beed mapped to use space, but note that tcmu_vma_fault() has gotten one >>> refcount for this page, so any other subsystem won't use this page, >>> unless later the use space addr is unmapped. >>> >>> If another command runs in later and needs to extends dbi_thresh, it may >>> reuse the corresponding slot to previous page in data_bitmap, then thouth >>> we'll allocate new page for this slot in data_area, but no page fault will >>> happen again, because we have a valid map, real request's data will lose. >> >> I don't think, this is a safe fix. It is possible that not only >> find_free_blocks runs before page fault procedure completes, but also >> allocation for next cmd happens. In that case the new call to >> unmap_mapping_range would also happen before page fault completes -> >> data corruption. >> >> AFAIK, no one ever has seen this this bug in real life, as > Yeah, I know, just find this maybe an issue by reading codes :) > >> find_free_blocks only runs seldomly and userspace would have to access >> a data page the very first time while the cmd that owned this page >> already has been completed by userspace. Therefore I think we should >> apply a perfect fix only. >> >> I'm wondering whether there really is such a race. If so, couldn't the >> same race happen in other drivers or even when truncating mapped files? > Indeed, I have described how filesystem implementations avoid this issue > in patch's commit message: > > Filesystem implementations will also run into this issue, but they > usually lock page when vm_operations_struct->fault gets one page, and > unlock page after finish_fault() completes. In truncate sides, they > lock pages in truncate_inode_pages() to protect race with page fault. > We can also have similar codes like filesystem to fix this issue. > > > Take ext4 as example, a file in ext4 is mapped to user space, if then a truncate > operation occurs, ext4 calls truncate_pagecache(): > void truncate_pagecache(struct inode *inode, loff_t newsize) > { > struct address_space *mapping = inode->i_mapping; > loff_t holebegin = round_up(newsize, PAGE_SIZE); > > /* > * unmap_mapping_range is called twice, first simply for > * efficiency so that truncate_inode_pages does fewer > * single-page unmaps. However after this first call, and > * before truncate_inode_pages finishes, it is possible for > * private pages to be COWed, which remain after > * truncate_inode_pages finishes, hence the second > * unmap_mapping_range call must be made for correctness. > */ > unmap_mapping_range(mapping, holebegin, 0, 1); > truncate_inode_pages(mapping, newsize); > unmap_mapping_range(mapping, holebegin, 0, 1); > } > > In truncate_inode_pages(), it'll lock page and set page->mapping > to be NULL, and in ext4's filemap_fault(), it'll lock page and check whether > page->mapping has been changed, if it's true, it'll just fail the page > fault procedure. > > For tcmu, though the data area's pages don't have a valid mapping, > but we can apply similar method. > In tcmu_vma_fault(), we lock the page and set VM_FAULT_LOCKED > flag, Yeah, looking into page fault handling I'm wondering why tcmu didn't do that from the beginning! > in find_free_blocks(), we firstly try to lock pages which are going > to be released, if lock_page() returns, I assume, we immediately unlock the page again, right? > we can ensure that there are > not inflight running page fault procedure, and following unmap_mapping_range() > will also ensure that all user maps will be cleared. > Seems that it'll resolve this possible issue, please have a check, thanks. AFAICS, this is the clean solution we were searching for. Thank you Bodo
hi, > On 05.04.22 18:03, Xiaoguang Wang wrote: >> hi, >> >>> On 23.03.22 14:49, Xiaoguang Wang wrote: >>>> When tcmu_vma_fault() gets one page successfully, before the current >>>> context completes page fault procedure, find_free_blocks() may run in >>>> and call unmap_mapping_range() to unmap this page. Assume when >>>> find_free_blocks() completes its job firstly, previous page fault >>>> procedure starts to run again and completes, then one truncated page has >>>> beed mapped to use space, but note that tcmu_vma_fault() has gotten one >>>> refcount for this page, so any other subsystem won't use this page, >>>> unless later the use space addr is unmapped. >>>> >>>> If another command runs in later and needs to extends dbi_thresh, it may >>>> reuse the corresponding slot to previous page in data_bitmap, then thouth >>>> we'll allocate new page for this slot in data_area, but no page fault will >>>> happen again, because we have a valid map, real request's data will lose. >>> >>> I don't think, this is a safe fix. It is possible that not only >>> find_free_blocks runs before page fault procedure completes, but also >>> allocation for next cmd happens. In that case the new call to >>> unmap_mapping_range would also happen before page fault completes -> >>> data corruption. >>> >>> AFAIK, no one ever has seen this this bug in real life, as >> Yeah, I know, just find this maybe an issue by reading codes :) >> >>> find_free_blocks only runs seldomly and userspace would have to access >>> a data page the very first time while the cmd that owned this page >>> already has been completed by userspace. Therefore I think we should >>> apply a perfect fix only. >>> >>> I'm wondering whether there really is such a race. If so, couldn't the >>> same race happen in other drivers or even when truncating mapped files? >> Indeed, I have described how filesystem implementations avoid this issue >> in patch's commit message: >> >> Filesystem implementations will also run into this issue, but they >> usually lock page when vm_operations_struct->fault gets one page, and >> unlock page after finish_fault() completes. In truncate sides, they >> lock pages in truncate_inode_pages() to protect race with page fault. >> We can also have similar codes like filesystem to fix this issue. >> >> >> Take ext4 as example, a file in ext4 is mapped to user space, if then a truncate >> operation occurs, ext4 calls truncate_pagecache(): >> void truncate_pagecache(struct inode *inode, loff_t newsize) >> { >> struct address_space *mapping = inode->i_mapping; >> loff_t holebegin = round_up(newsize, PAGE_SIZE); >> >> /* >> * unmap_mapping_range is called twice, first simply for >> * efficiency so that truncate_inode_pages does fewer >> * single-page unmaps. However after this first call, and >> * before truncate_inode_pages finishes, it is possible for >> * private pages to be COWed, which remain after >> * truncate_inode_pages finishes, hence the second >> * unmap_mapping_range call must be made for correctness. >> */ >> unmap_mapping_range(mapping, holebegin, 0, 1); >> truncate_inode_pages(mapping, newsize); >> unmap_mapping_range(mapping, holebegin, 0, 1); >> } >> >> In truncate_inode_pages(), it'll lock page and set page->mapping >> to be NULL, and in ext4's filemap_fault(), it'll lock page and check whether >> page->mapping has been changed, if it's true, it'll just fail the page >> fault procedure. >> >> For tcmu, though the data area's pages don't have a valid mapping, >> but we can apply similar method. >> In tcmu_vma_fault(), we lock the page and set VM_FAULT_LOCKED >> flag, > > Yeah, looking into page fault handling I'm wondering why tcmu didn't do > that from the beginning! > >> in find_free_blocks(), we firstly try to lock pages which are going >> to be released, if lock_page() returns, > > I assume, we immediately unlock the page again, right? Yeah, and I'll add proper code comments in new version patch set. > >> we can ensure that there are >> not inflight running page fault procedure, and following unmap_mapping_range() >> will also ensure that all user maps will be cleared. >> Seems that it'll resolve this possible issue, please have a check, thanks. > > AFAICS, this is the clean solution we were searching for. Agree, I'll prepare new patch set soon. Regards, Xiaoguang Wang > > Thank you > Bodo
diff --git a/drivers/target/target_core_user.c b/drivers/target/target_core_user.c index 06a5c4086551..9196188504ec 100644 --- a/drivers/target/target_core_user.c +++ b/drivers/target/target_core_user.c @@ -862,6 +862,7 @@ static int tcmu_alloc_data_space(struct tcmu_dev *udev, struct tcmu_cmd *cmd, if (space < cmd->dbi_cnt) { unsigned long blocks_left = (udev->max_blocks - udev->dbi_thresh) + space; + loff_t off, len; if (blocks_left < cmd->dbi_cnt) { pr_debug("no data space: only %lu available, but ask for %u\n", @@ -870,6 +871,10 @@ static int tcmu_alloc_data_space(struct tcmu_dev *udev, struct tcmu_cmd *cmd, return -1; } + off = udev->data_off + (loff_t)udev->dbi_thresh * udev->data_blk_size; + len = cmd->dbi_cnt * udev->data_blk_size; + unmap_mapping_range(udev->inode->i_mapping, off, len, 1); + udev->dbi_thresh += cmd->dbi_cnt; if (udev->dbi_thresh > udev->max_blocks) udev->dbi_thresh = udev->max_blocks;
When tcmu_vma_fault() gets one page successfully, before the current context completes page fault procedure, find_free_blocks() may run in and call unmap_mapping_range() to unmap this page. Assume when find_free_blocks() completes its job firstly, previous page fault procedure starts to run again and completes, then one truncated page has beed mapped to use space, but note that tcmu_vma_fault() has gotten one refcount for this page, so any other subsystem won't use this page, unless later the use space addr is unmapped. If another command runs in later and needs to extends dbi_thresh, it may reuse the corresponding slot to previous page in data_bitmap, then thouth we'll allocate new page for this slot in data_area, but no page fault will happen again, because we have a valid map, real request's data will lose. To fix this issue, when extending dbi_thresh, we'll need to call unmap_mapping_range() to unmap use space data area which may exist, which I think it's a simple method. Filesystem implementations will also run into this issue, but they ususally lock page when vm_operations_struct->fault gets one page, and unlock page after finish_fault() completes. In truncate sides, they lock pages in truncate_inode_pages() to protect race with page fault. We can also have similar codes like filesystem to fix this issue. Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> --- drivers/target/target_core_user.c | 5 +++++ 1 file changed, 5 insertions(+)