Message ID | 20210212124354.1.I7084a6235fbcc522b674a6b1db64e4aff8170485@changeid (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Add generated flag to filesystem struct to block copy_file_range | expand |
On Fri, Feb 12, 2021 at 12:44:00PM +0800, Nicolas Boichat wrote: > Filesystems such as procfs and sysfs generate their content at > runtime. This implies the file sizes do not usually match the > amount of data that can be read from the file, and that seeking > may not work as intended. > > This will be useful to disallow copy_file_range with input files > from such filesystems. > > Signed-off-by: Nicolas Boichat <drinkcat@chromium.org> > --- > I first thought of adding a new field to struct file_operations, > but that doesn't quite scale as every single file creation > operation would need to be modified. Even so, you missed a load of filesystems in the kernel with this patch series, what makes the ones you did mark here different from the "internal" filesystems that you did not? This feels wrong, why is userspace suddenly breaking? What changed in the kernel that caused this? Procfs has been around for a _very_ long time :) thanks, greg k-h
On Fri, Feb 12, 2021 at 6:47 AM Nicolas Boichat <drinkcat@chromium.org> wrote: > > Filesystems such as procfs and sysfs generate their content at > runtime. This implies the file sizes do not usually match the > amount of data that can be read from the file, and that seeking > may not work as intended. > > This will be useful to disallow copy_file_range with input files > from such filesystems. > > Signed-off-by: Nicolas Boichat <drinkcat@chromium.org> > --- > I first thought of adding a new field to struct file_operations, > but that doesn't quite scale as every single file creation > operation would need to be modified. > > include/linux/fs.h | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 3482146b11b0..5bd58b928e94 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -2335,6 +2335,7 @@ struct file_system_type { > #define FS_ALLOW_IDMAP 32 /* FS has been updated to handle vfs idmappings. */ > #define FS_THP_SUPPORT 8192 /* Remove once all fs converted */ > #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */ > +#define FS_GENERATED_CONTENT 65536 /* FS contains generated content */ Can you please make the flag name a little less arbitrary. Either something that conveys the facts as they are (e.g. "zero size but readable") or anything that you think describes best the special behavior that follows from observing this flag. The alternative is for the flag name to express what you want (e.g. "don't copy file range") like FS_DISALLOW_NOTIFY_PERM. Also, I wonder. A great deal of the files you target are opened with seq_open() (I didn't audit all of them). Maybe it's worth setting an FMODE flag in seq_open() and some of it's relatives to express the quality of the file instead of flagging the filesystem? Maybe we can do both to cover more cases. Thanks, Amir.
On Fri, Feb 12, 2021 at 3:46 PM Greg KH <gregkh@linuxfoundation.org> wrote: > > On Fri, Feb 12, 2021 at 12:44:00PM +0800, Nicolas Boichat wrote: > > Filesystems such as procfs and sysfs generate their content at > > runtime. This implies the file sizes do not usually match the > > amount of data that can be read from the file, and that seeking > > may not work as intended. > > > > This will be useful to disallow copy_file_range with input files > > from such filesystems. > > > > Signed-off-by: Nicolas Boichat <drinkcat@chromium.org> > > --- > > I first thought of adding a new field to struct file_operations, > > but that doesn't quite scale as every single file creation > > operation would need to be modified. > > Even so, you missed a load of filesystems in the kernel with this patch > series, what makes the ones you did mark here different from the > "internal" filesystems that you did not? Which ones did I miss? (apart from configfs, see the conversation on patch 6/6). Anyway, hopefully auditing all filesystems is an order of magnitude easier task, and easier to maintain in the longer run ,-) > This feels wrong, why is userspace suddenly breaking? What changed in > the kernel that caused this? Procfs has been around for a _very_ long > time :) Some of this is covered in the cover letter 0/6. To expand a bit: copy_file_range has only supported cross-filesystem copies since 5.3 [1], before that the kernel would return EXDEV and userspace application would fall back to a read/write based copy. After 5.3, copy_file_range copies no data as it thinks the input file is empty. [1] I think it makes little sense to try to use copy_file_range between 2 files in /proc, but technically, I think that has been broken since copy_file_range fallback to do_splice_direct was introduced (eac70053a141 ("vfs: Add vfs_copy_file_range() support for pagecache copies", ~4.4). > > thanks, > > greg k-h
On Fri, Feb 12, 2021 at 9:49 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > On Fri, Feb 12, 2021 at 12:44:00PM +0800, Nicolas Boichat wrote: > > Filesystems such as procfs and sysfs generate their content at > > runtime. This implies the file sizes do not usually match the > > amount of data that can be read from the file, and that seeking > > may not work as intended. > > > > This will be useful to disallow copy_file_range with input files > > from such filesystems. > > > > Signed-off-by: Nicolas Boichat <drinkcat@chromium.org> > > --- > > I first thought of adding a new field to struct file_operations, > > but that doesn't quite scale as every single file creation > > operation would need to be modified. > > Even so, you missed a load of filesystems in the kernel with this patch > series, what makes the ones you did mark here different from the > "internal" filesystems that you did not? > > This feels wrong, why is userspace suddenly breaking? What changed in > the kernel that caused this? Procfs has been around for a _very_ long > time :) That would be because of (v5.3): 5dae222a5ff0 vfs: allow copy_file_range to copy across devices The intention of this change (series) was to allow server side copy for nfs and cifs via copy_file_range(). This is mostly work by Dave Chinner that I picked up following requests from the NFS folks. But the above change also includes this generic change: - /* this could be relaxed once a method supports cross-fs copies */ - if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb) - return -EXDEV; - The change of behavior was documented in the commit message. It was also documented in: 88e75e2c5 copy_file_range.2: Kernel v5.3 updates I think our rationale for the generic change was: "Why not? What could go wrong? (TM)" I am not sure if any workload really gained something from this kernel cross-fs CFR. In retrospect, I think it would have been safer to allow cross-fs CFR only to the filesystems that implement ->{copy,remap}_file_range()... Our option now are: - Restore the cross-fs restriction into generic_copy_file_range() - Explicitly opt-out of CFR per-fs and/or per-file as Nicolas' patch does Preference anyone? Thanks, Amir.
On Fri, Feb 12, 2021 at 04:20:17PM +0800, Nicolas Boichat wrote: > On Fri, Feb 12, 2021 at 3:46 PM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > On Fri, Feb 12, 2021 at 12:44:00PM +0800, Nicolas Boichat wrote: > > > Filesystems such as procfs and sysfs generate their content at > > > runtime. This implies the file sizes do not usually match the > > > amount of data that can be read from the file, and that seeking > > > may not work as intended. > > > > > > This will be useful to disallow copy_file_range with input files > > > from such filesystems. > > > > > > Signed-off-by: Nicolas Boichat <drinkcat@chromium.org> > > > --- > > > I first thought of adding a new field to struct file_operations, > > > but that doesn't quite scale as every single file creation > > > operation would need to be modified. > > > > Even so, you missed a load of filesystems in the kernel with this patch > > series, what makes the ones you did mark here different from the > > "internal" filesystems that you did not? > > Which ones did I miss? (apart from configfs, see the conversation on > patch 6/6). Anyway, hopefully auditing all filesystems is an order of > magnitude easier task, and easier to maintain in the longer run ,-) Are you going to be the one to add this to each new filesystem that is added to the kernel? I see filesystems in drivers/char/mem.c and drivers/infiniband/hw/qib/qib_fs.c and drivers/misc/ibmasm/ibmasmfs.c and loads of other driver-specific filesystems. Do all of those "just work" somehow? > > This feels wrong, why is userspace suddenly breaking? What changed in > > the kernel that caused this? Procfs has been around for a _very_ long > > time :) > > Some of this is covered in the cover letter 0/6. To expand a bit: > copy_file_range has only supported cross-filesystem copies since 5.3 > [1], before that the kernel would return EXDEV and userspace > application would fall back to a read/write based copy. After 5.3, > copy_file_range copies no data as it thinks the input file is empty. So older kernels work fine with this system call on procfs, but newer ones do not? If so, that's a regression that should be fixed in the original area, but should not require a new filesystem flag for every individual one out there. That way lies madness and constant auditing that I do not see anyone signing up for for the next 20 years, right? > [1] I think it makes little sense to try to use copy_file_range > between 2 files in /proc, but technically, I think that has been > broken since copy_file_range fallback to do_splice_direct was > introduced (eac70053a141 ("vfs: Add vfs_copy_file_range() support for > pagecache copies", ~4.4). Why are people trying to use copy_file_range on simple /proc and /sys files in the first place? They can not seek (well most can not), so that feels like a "oh look, a new syscall, let's use it everywhere!" problem that userspace should not do. thanks, greg k-h
On Fri, Feb 12, 2021 at 10:22:16AM +0200, Amir Goldstein wrote: > On Fri, Feb 12, 2021 at 9:49 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > On Fri, Feb 12, 2021 at 12:44:00PM +0800, Nicolas Boichat wrote: > > > Filesystems such as procfs and sysfs generate their content at > > > runtime. This implies the file sizes do not usually match the > > > amount of data that can be read from the file, and that seeking > > > may not work as intended. > > > > > > This will be useful to disallow copy_file_range with input files > > > from such filesystems. > > > > > > Signed-off-by: Nicolas Boichat <drinkcat@chromium.org> > > > --- > > > I first thought of adding a new field to struct file_operations, > > > but that doesn't quite scale as every single file creation > > > operation would need to be modified. > > > > Even so, you missed a load of filesystems in the kernel with this patch > > series, what makes the ones you did mark here different from the > > "internal" filesystems that you did not? > > > > This feels wrong, why is userspace suddenly breaking? What changed in > > the kernel that caused this? Procfs has been around for a _very_ long > > time :) > > That would be because of (v5.3): > > 5dae222a5ff0 vfs: allow copy_file_range to copy across devices > > The intention of this change (series) was to allow server side copy > for nfs and cifs via copy_file_range(). > This is mostly work by Dave Chinner that I picked up following requests > from the NFS folks. > > But the above change also includes this generic change: > > - /* this could be relaxed once a method supports cross-fs copies */ > - if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb) > - return -EXDEV; > - > > The change of behavior was documented in the commit message. > It was also documented in: > > 88e75e2c5 copy_file_range.2: Kernel v5.3 updates > > I think our rationale for the generic change was: > "Why not? What could go wrong? (TM)" > I am not sure if any workload really gained something from this > kernel cross-fs CFR. Why not put that check back? > In retrospect, I think it would have been safer to allow cross-fs CFR > only to the filesystems that implement ->{copy,remap}_file_range()... Why not make this change? That seems easier and should fix this for everyone, right? > Our option now are: > - Restore the cross-fs restriction into generic_copy_file_range() Yes. > - Explicitly opt-out of CFR per-fs and/or per-file as Nicolas' patch does No. That way lies constant auditing and someone being "vigilant" for the next 30+ years. Which will not happen. thanks, greg k-h
Greg KH <gregkh@linuxfoundation.org> writes: > On Fri, Feb 12, 2021 at 10:22:16AM +0200, Amir Goldstein wrote: >> On Fri, Feb 12, 2021 at 9:49 AM Greg KH <gregkh@linuxfoundation.org> wrote: >> > >> > On Fri, Feb 12, 2021 at 12:44:00PM +0800, Nicolas Boichat wrote: >> > > Filesystems such as procfs and sysfs generate their content at >> > > runtime. This implies the file sizes do not usually match the >> > > amount of data that can be read from the file, and that seeking >> > > may not work as intended. >> > > >> > > This will be useful to disallow copy_file_range with input files >> > > from such filesystems. >> > > >> > > Signed-off-by: Nicolas Boichat <drinkcat@chromium.org> >> > > --- >> > > I first thought of adding a new field to struct file_operations, >> > > but that doesn't quite scale as every single file creation >> > > operation would need to be modified. >> > >> > Even so, you missed a load of filesystems in the kernel with this patch >> > series, what makes the ones you did mark here different from the >> > "internal" filesystems that you did not? >> > >> > This feels wrong, why is userspace suddenly breaking? What changed in >> > the kernel that caused this? Procfs has been around for a _very_ long >> > time :) >> >> That would be because of (v5.3): >> >> 5dae222a5ff0 vfs: allow copy_file_range to copy across devices >> >> The intention of this change (series) was to allow server side copy >> for nfs and cifs via copy_file_range(). >> This is mostly work by Dave Chinner that I picked up following requests >> from the NFS folks. >> >> But the above change also includes this generic change: >> >> - /* this could be relaxed once a method supports cross-fs copies */ >> - if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb) >> - return -EXDEV; >> - >> >> The change of behavior was documented in the commit message. >> It was also documented in: >> >> 88e75e2c5 copy_file_range.2: Kernel v5.3 updates >> >> I think our rationale for the generic change was: >> "Why not? What could go wrong? (TM)" >> I am not sure if any workload really gained something from this >> kernel cross-fs CFR. > > Why not put that check back? > >> In retrospect, I think it would have been safer to allow cross-fs CFR >> only to the filesystems that implement ->{copy,remap}_file_range()... > > Why not make this change? That seems easier and should fix this for > everyone, right? > >> Our option now are: >> - Restore the cross-fs restriction into generic_copy_file_range() > > Yes. > Restoring this restriction will actually change the current cephfs CFR behaviour. Since that commit we have allowed doing remote copies between different filesystems within the same ceph cluster. See commit 6fd4e6348352 ("ceph: allow object copies across different filesystems in the same cluster"). Although I'm not aware of any current users for this scenario, the performance impact can actually be huge as it's the difference between asking the OSDs for copying a file and doing a full read+write on the client side. Cheers,
On Fri, Feb 12, 2021 at 12:05:14PM +0000, Luis Henriques wrote: > Greg KH <gregkh@linuxfoundation.org> writes: > > > On Fri, Feb 12, 2021 at 10:22:16AM +0200, Amir Goldstein wrote: > >> On Fri, Feb 12, 2021 at 9:49 AM Greg KH <gregkh@linuxfoundation.org> wrote: > >> > > >> > On Fri, Feb 12, 2021 at 12:44:00PM +0800, Nicolas Boichat wrote: > >> > > Filesystems such as procfs and sysfs generate their content at > >> > > runtime. This implies the file sizes do not usually match the > >> > > amount of data that can be read from the file, and that seeking > >> > > may not work as intended. > >> > > > >> > > This will be useful to disallow copy_file_range with input files > >> > > from such filesystems. > >> > > > >> > > Signed-off-by: Nicolas Boichat <drinkcat@chromium.org> > >> > > --- > >> > > I first thought of adding a new field to struct file_operations, > >> > > but that doesn't quite scale as every single file creation > >> > > operation would need to be modified. > >> > > >> > Even so, you missed a load of filesystems in the kernel with this patch > >> > series, what makes the ones you did mark here different from the > >> > "internal" filesystems that you did not? > >> > > >> > This feels wrong, why is userspace suddenly breaking? What changed in > >> > the kernel that caused this? Procfs has been around for a _very_ long > >> > time :) > >> > >> That would be because of (v5.3): > >> > >> 5dae222a5ff0 vfs: allow copy_file_range to copy across devices > >> > >> The intention of this change (series) was to allow server side copy > >> for nfs and cifs via copy_file_range(). > >> This is mostly work by Dave Chinner that I picked up following requests > >> from the NFS folks. > >> > >> But the above change also includes this generic change: > >> > >> - /* this could be relaxed once a method supports cross-fs copies */ > >> - if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb) > >> - return -EXDEV; > >> - > >> > >> The change of behavior was documented in the commit message. > >> It was also documented in: > >> > >> 88e75e2c5 copy_file_range.2: Kernel v5.3 updates > >> > >> I think our rationale for the generic change was: > >> "Why not? What could go wrong? (TM)" > >> I am not sure if any workload really gained something from this > >> kernel cross-fs CFR. > > > > Why not put that check back? > > > >> In retrospect, I think it would have been safer to allow cross-fs CFR > >> only to the filesystems that implement ->{copy,remap}_file_range()... > > > > Why not make this change? That seems easier and should fix this for > > everyone, right? > > > >> Our option now are: > >> - Restore the cross-fs restriction into generic_copy_file_range() > > > > Yes. > > > > Restoring this restriction will actually change the current cephfs CFR > behaviour. Since that commit we have allowed doing remote copies between > different filesystems within the same ceph cluster. See commit > 6fd4e6348352 ("ceph: allow object copies across different filesystems in > the same cluster"). > > Although I'm not aware of any current users for this scenario, the > performance impact can actually be huge as it's the difference between > asking the OSDs for copying a file and doing a full read+write on the > client side. Regression in performance is ok if it fixes a regression for things that used to work just fine in the past :) First rule, make it work. thanks, greg k-h
Greg KH <gregkh@linuxfoundation.org> writes: > On Fri, Feb 12, 2021 at 12:05:14PM +0000, Luis Henriques wrote: >> Greg KH <gregkh@linuxfoundation.org> writes: >> >> > On Fri, Feb 12, 2021 at 10:22:16AM +0200, Amir Goldstein wrote: >> >> On Fri, Feb 12, 2021 at 9:49 AM Greg KH <gregkh@linuxfoundation.org> wrote: >> >> > >> >> > On Fri, Feb 12, 2021 at 12:44:00PM +0800, Nicolas Boichat wrote: >> >> > > Filesystems such as procfs and sysfs generate their content at >> >> > > runtime. This implies the file sizes do not usually match the >> >> > > amount of data that can be read from the file, and that seeking >> >> > > may not work as intended. >> >> > > >> >> > > This will be useful to disallow copy_file_range with input files >> >> > > from such filesystems. >> >> > > >> >> > > Signed-off-by: Nicolas Boichat <drinkcat@chromium.org> >> >> > > --- >> >> > > I first thought of adding a new field to struct file_operations, >> >> > > but that doesn't quite scale as every single file creation >> >> > > operation would need to be modified. >> >> > >> >> > Even so, you missed a load of filesystems in the kernel with this patch >> >> > series, what makes the ones you did mark here different from the >> >> > "internal" filesystems that you did not? >> >> > >> >> > This feels wrong, why is userspace suddenly breaking? What changed in >> >> > the kernel that caused this? Procfs has been around for a _very_ long >> >> > time :) >> >> >> >> That would be because of (v5.3): >> >> >> >> 5dae222a5ff0 vfs: allow copy_file_range to copy across devices >> >> >> >> The intention of this change (series) was to allow server side copy >> >> for nfs and cifs via copy_file_range(). >> >> This is mostly work by Dave Chinner that I picked up following requests >> >> from the NFS folks. >> >> >> >> But the above change also includes this generic change: >> >> >> >> - /* this could be relaxed once a method supports cross-fs copies */ >> >> - if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb) >> >> - return -EXDEV; >> >> - >> >> >> >> The change of behavior was documented in the commit message. >> >> It was also documented in: >> >> >> >> 88e75e2c5 copy_file_range.2: Kernel v5.3 updates >> >> >> >> I think our rationale for the generic change was: >> >> "Why not? What could go wrong? (TM)" >> >> I am not sure if any workload really gained something from this >> >> kernel cross-fs CFR. >> > >> > Why not put that check back? >> > >> >> In retrospect, I think it would have been safer to allow cross-fs CFR >> >> only to the filesystems that implement ->{copy,remap}_file_range()... >> > >> > Why not make this change? That seems easier and should fix this for >> > everyone, right? >> > >> >> Our option now are: >> >> - Restore the cross-fs restriction into generic_copy_file_range() >> > >> > Yes. >> > >> >> Restoring this restriction will actually change the current cephfs CFR >> behaviour. Since that commit we have allowed doing remote copies between >> different filesystems within the same ceph cluster. See commit >> 6fd4e6348352 ("ceph: allow object copies across different filesystems in >> the same cluster"). >> >> Although I'm not aware of any current users for this scenario, the >> performance impact can actually be huge as it's the difference between >> asking the OSDs for copying a file and doing a full read+write on the >> client side. > > Regression in performance is ok if it fixes a regression for things that > used to work just fine in the past :) > > First rule, make it work. Sure, I just wanted to point out that *maybe* there are other options than simply reverting that commit :-) Something like the patch below (completely untested!) should revert to the old behaviour in filesystems that don't implement the CFR syscall. Cheers,
On Fri, Feb 12, 2021 at 12:41:48PM +0000, Luis Henriques wrote: > Greg KH <gregkh@linuxfoundation.org> writes: > > > On Fri, Feb 12, 2021 at 12:05:14PM +0000, Luis Henriques wrote: > >> Greg KH <gregkh@linuxfoundation.org> writes: > >> > >> > On Fri, Feb 12, 2021 at 10:22:16AM +0200, Amir Goldstein wrote: > >> >> On Fri, Feb 12, 2021 at 9:49 AM Greg KH <gregkh@linuxfoundation.org> wrote: > >> >> > > >> >> > On Fri, Feb 12, 2021 at 12:44:00PM +0800, Nicolas Boichat wrote: > >> >> > > Filesystems such as procfs and sysfs generate their content at > >> >> > > runtime. This implies the file sizes do not usually match the > >> >> > > amount of data that can be read from the file, and that seeking > >> >> > > may not work as intended. > >> >> > > > >> >> > > This will be useful to disallow copy_file_range with input files > >> >> > > from such filesystems. > >> >> > > > >> >> > > Signed-off-by: Nicolas Boichat <drinkcat@chromium.org> > >> >> > > --- > >> >> > > I first thought of adding a new field to struct file_operations, > >> >> > > but that doesn't quite scale as every single file creation > >> >> > > operation would need to be modified. > >> >> > > >> >> > Even so, you missed a load of filesystems in the kernel with this patch > >> >> > series, what makes the ones you did mark here different from the > >> >> > "internal" filesystems that you did not? > >> >> > > >> >> > This feels wrong, why is userspace suddenly breaking? What changed in > >> >> > the kernel that caused this? Procfs has been around for a _very_ long > >> >> > time :) > >> >> > >> >> That would be because of (v5.3): > >> >> > >> >> 5dae222a5ff0 vfs: allow copy_file_range to copy across devices > >> >> > >> >> The intention of this change (series) was to allow server side copy > >> >> for nfs and cifs via copy_file_range(). > >> >> This is mostly work by Dave Chinner that I picked up following requests > >> >> from the NFS folks. > >> >> > >> >> But the above change also includes this generic change: > >> >> > >> >> - /* this could be relaxed once a method supports cross-fs copies */ > >> >> - if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb) > >> >> - return -EXDEV; > >> >> - > >> >> > >> >> The change of behavior was documented in the commit message. > >> >> It was also documented in: > >> >> > >> >> 88e75e2c5 copy_file_range.2: Kernel v5.3 updates > >> >> > >> >> I think our rationale for the generic change was: > >> >> "Why not? What could go wrong? (TM)" > >> >> I am not sure if any workload really gained something from this > >> >> kernel cross-fs CFR. > >> > > >> > Why not put that check back? > >> > > >> >> In retrospect, I think it would have been safer to allow cross-fs CFR > >> >> only to the filesystems that implement ->{copy,remap}_file_range()... > >> > > >> > Why not make this change? That seems easier and should fix this for > >> > everyone, right? > >> > > >> >> Our option now are: > >> >> - Restore the cross-fs restriction into generic_copy_file_range() > >> > > >> > Yes. > >> > > >> > >> Restoring this restriction will actually change the current cephfs CFR > >> behaviour. Since that commit we have allowed doing remote copies between > >> different filesystems within the same ceph cluster. See commit > >> 6fd4e6348352 ("ceph: allow object copies across different filesystems in > >> the same cluster"). > >> > >> Although I'm not aware of any current users for this scenario, the > >> performance impact can actually be huge as it's the difference between > >> asking the OSDs for copying a file and doing a full read+write on the > >> client side. > > > > Regression in performance is ok if it fixes a regression for things that > > used to work just fine in the past :) > > > > First rule, make it work. > > Sure, I just wanted to point out that *maybe* there are other options than > simply reverting that commit :-) > > Something like the patch below (completely untested!) should revert to the > old behaviour in filesystems that don't implement the CFR syscall. > > Cheers, > -- > Luis > > diff --git a/fs/read_write.c b/fs/read_write.c > index 75f764b43418..bf5dccc43cc9 100644 > --- a/fs/read_write.c > +++ b/fs/read_write.c > @@ -1406,8 +1406,11 @@ static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in, > file_out, pos_out, > len, flags); > > - return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len, > - flags); > + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb) > + return -EXDEV; > + else > + generic_copy_file_range(file_in, pos_in, file_out, pos_out, len, > + flags); > } > > /* That would make much more sense to me.
Greg KH <gregkh@linuxfoundation.org> writes: > On Fri, Feb 12, 2021 at 12:41:48PM +0000, Luis Henriques wrote: >> Greg KH <gregkh@linuxfoundation.org> writes: ... >> >> >> Our option now are: >> >> >> - Restore the cross-fs restriction into generic_copy_file_range() >> >> > >> >> > Yes. >> >> > >> >> >> >> Restoring this restriction will actually change the current cephfs CFR >> >> behaviour. Since that commit we have allowed doing remote copies between >> >> different filesystems within the same ceph cluster. See commit >> >> 6fd4e6348352 ("ceph: allow object copies across different filesystems in >> >> the same cluster"). >> >> >> >> Although I'm not aware of any current users for this scenario, the >> >> performance impact can actually be huge as it's the difference between >> >> asking the OSDs for copying a file and doing a full read+write on the >> >> client side. >> > >> > Regression in performance is ok if it fixes a regression for things that >> > used to work just fine in the past :) >> > >> > First rule, make it work. >> >> Sure, I just wanted to point out that *maybe* there are other options than >> simply reverting that commit :-) >> >> Something like the patch below (completely untested!) should revert to the >> old behaviour in filesystems that don't implement the CFR syscall. >> >> Cheers, >> -- >> Luis >> >> diff --git a/fs/read_write.c b/fs/read_write.c >> index 75f764b43418..bf5dccc43cc9 100644 >> --- a/fs/read_write.c >> +++ b/fs/read_write.c >> @@ -1406,8 +1406,11 @@ static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in, >> file_out, pos_out, >> len, flags); >> >> - return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len, >> - flags); >> + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb) >> + return -EXDEV; >> + else >> + generic_copy_file_range(file_in, pos_in, file_out, pos_out, len, >> + flags); >> } >> >> /* > > That would make much more sense to me. Great. I can send a proper patch with changelog, if this is the really what we want. But I would rather hear from others first. I guess that at least the NFS devs have something to say here. Cheers,
On Fri, Feb 12, 2021 at 12:38 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > Why are people trying to use copy_file_range on simple /proc and /sys > files in the first place? They can not seek (well most can not), so > that feels like a "oh look, a new syscall, let's use it everywhere!" > problem that userspace should not do. This may have been covered elsewhere, but it's not that people are saying "let's use copy_file_range on files in /proc." It's that the Go language standard library provides an interface to operating system files. When Go code uses the standard library function io.Copy to copy the contents of one open file to another open file, then on Linux kernels 5.3 and greater the Go standard library will use the copy_file_range system call. That seems to be exactly what copy_file_range is intended for. Unfortunately it appears that when people writing Go code open a file in /proc and use io.Copy the contents to another open file, copy_file_range does nothing and reports success. There isn't anything on the copy_file_range man page explaining this limitation, and there isn't any documented way to know that the Go standard library should not use copy_file_range on certain files. So ideally the kernel will report EOPNOTSUPP or EINVAL when using copy_file_range on a file in /proc or some other file system that fails (and, minor side note, the copy_file_range man page should document that it can return EOPNOTSUPP or EINVAL in some cases, which does already happen on at least some kernel versions using at least some file systems). Or, less ideally, there will be some documented way that the Go standard library can determine that copy_file_range will fail before trying to use it. If neither of those can be done, then I think the only option is for the Go standard library to never use copy_file_range, as even though it will almost always work and work well, in some unpredictable number of cases it will fail silently. Thanks. Ian
On Fri, Feb 12, 2021 at 07:33:57AM -0800, Ian Lance Taylor wrote: > On Fri, Feb 12, 2021 at 12:38 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > Why are people trying to use copy_file_range on simple /proc and /sys > > files in the first place? They can not seek (well most can not), so > > that feels like a "oh look, a new syscall, let's use it everywhere!" > > problem that userspace should not do. > > This may have been covered elsewhere, but it's not that people are > saying "let's use copy_file_range on files in /proc." It's that the > Go language standard library provides an interface to operating system > files. When Go code uses the standard library function io.Copy to > copy the contents of one open file to another open file, then on Linux > kernels 5.3 and greater the Go standard library will use the > copy_file_range system call. That seems to be exactly what > copy_file_range is intended for. Unfortunately it appears that when > people writing Go code open a file in /proc and use io.Copy the > contents to another open file, copy_file_range does nothing and > reports success. There isn't anything on the copy_file_range man page > explaining this limitation, and there isn't any documented way to know > that the Go standard library should not use copy_file_range on certain > files. But, is this a bug in the kernel in that the syscall being made is not working properly, or a bug in that Go decided to do this for all types of files not knowing that some types of files can not handle this? If the kernel has always worked this way, I would say that Go is doing the wrong thing here. If the kernel used to work properly, and then changed, then it's a regression on the kernel side. So which is it? > So ideally the kernel will report EOPNOTSUPP or EINVAL when using > copy_file_range on a file in /proc or some other file system that > fails (and, minor side note, the copy_file_range man page should > document that it can return EOPNOTSUPP or EINVAL in some cases, which > does already happen on at least some kernel versions using at least > some file systems). Documentation is good, but what the kernel does is the true "definition" of what is going right or wrong here. thanks, greg k-h
On Fri, Feb 12, 2021 at 7:45 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > On Fri, Feb 12, 2021 at 07:33:57AM -0800, Ian Lance Taylor wrote: > > On Fri, Feb 12, 2021 at 12:38 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > > > Why are people trying to use copy_file_range on simple /proc and /sys > > > files in the first place? They can not seek (well most can not), so > > > that feels like a "oh look, a new syscall, let's use it everywhere!" > > > problem that userspace should not do. > > > > This may have been covered elsewhere, but it's not that people are > > saying "let's use copy_file_range on files in /proc." It's that the > > Go language standard library provides an interface to operating system > > files. When Go code uses the standard library function io.Copy to > > copy the contents of one open file to another open file, then on Linux > > kernels 5.3 and greater the Go standard library will use the > > copy_file_range system call. That seems to be exactly what > > copy_file_range is intended for. Unfortunately it appears that when > > people writing Go code open a file in /proc and use io.Copy the > > contents to another open file, copy_file_range does nothing and > > reports success. There isn't anything on the copy_file_range man page > > explaining this limitation, and there isn't any documented way to know > > that the Go standard library should not use copy_file_range on certain > > files. > > But, is this a bug in the kernel in that the syscall being made is not > working properly, or a bug in that Go decided to do this for all types > of files not knowing that some types of files can not handle this? > > If the kernel has always worked this way, I would say that Go is doing > the wrong thing here. If the kernel used to work properly, and then > changed, then it's a regression on the kernel side. > > So which is it? I don't work on the kernel, so I can't tell you which it is. You will have to decide. From my perspective, as a kernel user rather than a kernel developer, a system call that silently fails for certain files and that provides no way to determine either 1) ahead of time that the system call will fail, or 2) after the call that the system call did fail, is a useless system call. I can never use that system call, because I don't know whether or not it will work. So as a kernel user I would say that you should fix the system call to report failure, or document some way to know whether the system call will fail, or you should remove the system call. But I'm not a kernel developer, I don't have all the information, and it's obviously your call. I'll note that to the best of my knowledge this failure started happening with the 5.3 kernel, as before 5.3 the problematic calls would report a failure (EXDEV). Since 5.3 isn't all that old I personally wouldn't say that the kernel "has always worked this way." But I may be mistaken about this. > > So ideally the kernel will report EOPNOTSUPP or EINVAL when using > > copy_file_range on a file in /proc or some other file system that > > fails (and, minor side note, the copy_file_range man page should > > document that it can return EOPNOTSUPP or EINVAL in some cases, which > > does already happen on at least some kernel versions using at least > > some file systems). > > Documentation is good, but what the kernel does is the true "definition" > of what is going right or wrong here. Sure. The documentation comment was just a side note. I hope that we can all agree that accurate man pages are better than inaccurate ones. Ian
On Fri, Feb 12, 2021 at 07:59:04AM -0800, Ian Lance Taylor wrote: > On Fri, Feb 12, 2021 at 7:45 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > On Fri, Feb 12, 2021 at 07:33:57AM -0800, Ian Lance Taylor wrote: > > > On Fri, Feb 12, 2021 at 12:38 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > > > > > Why are people trying to use copy_file_range on simple /proc and /sys > > > > files in the first place? They can not seek (well most can not), so > > > > that feels like a "oh look, a new syscall, let's use it everywhere!" > > > > problem that userspace should not do. > > > > > > This may have been covered elsewhere, but it's not that people are > > > saying "let's use copy_file_range on files in /proc." It's that the > > > Go language standard library provides an interface to operating system > > > files. When Go code uses the standard library function io.Copy to > > > copy the contents of one open file to another open file, then on Linux > > > kernels 5.3 and greater the Go standard library will use the > > > copy_file_range system call. That seems to be exactly what > > > copy_file_range is intended for. Unfortunately it appears that when > > > people writing Go code open a file in /proc and use io.Copy the > > > contents to another open file, copy_file_range does nothing and > > > reports success. There isn't anything on the copy_file_range man page > > > explaining this limitation, and there isn't any documented way to know > > > that the Go standard library should not use copy_file_range on certain > > > files. > > > > But, is this a bug in the kernel in that the syscall being made is not > > working properly, or a bug in that Go decided to do this for all types > > of files not knowing that some types of files can not handle this? > > > > If the kernel has always worked this way, I would say that Go is doing > > the wrong thing here. If the kernel used to work properly, and then > > changed, then it's a regression on the kernel side. > > > > So which is it? > > I don't work on the kernel, so I can't tell you which it is. You will > have to decide. As you have the userspace code, it should be easier for you to test this on an older kernel. I don't have your userspace code... > From my perspective, as a kernel user rather than a kernel developer, > a system call that silently fails for certain files and that provides > no way to determine either 1) ahead of time that the system call will > fail, or 2) after the call that the system call did fail, is a useless > system call. Great, then don't use copy_file_range() yet as it seems like it fits that category at the moment :) > I can never use that system call, because I don't know > whether or not it will work. So as a kernel user I would say that you > should fix the system call to report failure, or document some way to > know whether the system call will fail, or you should remove the > system call. But I'm not a kernel developer, I don't have all the > information, and it's obviously your call. That's up to the authors of that syscall, I don't know what they intended this to be used for. It feels like you are using this in a very generic "all file i/o works for this" way, while that is not what the authors of the syscall intended it for, as is evident by the failures that older kernels would report for you. > I'll note that to the best of my knowledge this failure started > happening with the 5.3 kernel, as before 5.3 the problematic calls > would report a failure (EXDEV). Since 5.3 isn't all that old I > personally wouldn't say that the kernel "has always worked this way." > But I may be mistaken about this. Testing would be good, as regressions are more serious than "it would be nice if it would work like this instead!" type of issues. thanks, greg k-h
On Fri, Feb 12, 2021 at 8:28 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > On Fri, Feb 12, 2021 at 07:59:04AM -0800, Ian Lance Taylor wrote: > > On Fri, Feb 12, 2021 at 7:45 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > > > On Fri, Feb 12, 2021 at 07:33:57AM -0800, Ian Lance Taylor wrote: > > > > On Fri, Feb 12, 2021 at 12:38 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > > > > > > > Why are people trying to use copy_file_range on simple /proc and /sys > > > > > files in the first place? They can not seek (well most can not), so > > > > > that feels like a "oh look, a new syscall, let's use it everywhere!" > > > > > problem that userspace should not do. > > > > > > > > This may have been covered elsewhere, but it's not that people are > > > > saying "let's use copy_file_range on files in /proc." It's that the > > > > Go language standard library provides an interface to operating system > > > > files. When Go code uses the standard library function io.Copy to > > > > copy the contents of one open file to another open file, then on Linux > > > > kernels 5.3 and greater the Go standard library will use the > > > > copy_file_range system call. That seems to be exactly what > > > > copy_file_range is intended for. Unfortunately it appears that when > > > > people writing Go code open a file in /proc and use io.Copy the > > > > contents to another open file, copy_file_range does nothing and > > > > reports success. There isn't anything on the copy_file_range man page > > > > explaining this limitation, and there isn't any documented way to know > > > > that the Go standard library should not use copy_file_range on certain > > > > files. > > > > > > But, is this a bug in the kernel in that the syscall being made is not > > > working properly, or a bug in that Go decided to do this for all types > > > of files not knowing that some types of files can not handle this? > > > > > > If the kernel has always worked this way, I would say that Go is doing > > > the wrong thing here. If the kernel used to work properly, and then > > > changed, then it's a regression on the kernel side. > > > > > > So which is it? > > > > I don't work on the kernel, so I can't tell you which it is. You will > > have to decide. > > As you have the userspace code, it should be easier for you to test this > on an older kernel. I don't have your userspace code... Sorry, I'm not sure what you are asking. I've attached a sample Go program. On kernel version 2.6.32 this program exits 0. On kernel version 5.7.17 it prints got "" want "./foo\x00" and exits with status 1. This program hardcodes the string "/proc/self/cmdline" for convenience, but of course the same results would happen if this were a generic copy program that somebody invoked with /proc/self/cmdline as a command line option. I could write the same program in C easily enough, by explicitly calling copy_file_range. Would it help to see a sample C program? > > From my perspective, as a kernel user rather than a kernel developer, > > a system call that silently fails for certain files and that provides > > no way to determine either 1) ahead of time that the system call will > > fail, or 2) after the call that the system call did fail, is a useless > > system call. > > Great, then don't use copy_file_range() yet as it seems like it fits > that category at the moment :) That seems like an unfortunate result, but if that is the determining opinion then I guess that is what we will have to do in the Go standard library. Ian
On Fri, Feb 12, 2021 at 04:45:41PM +0100, Greg KH wrote: > On Fri, Feb 12, 2021 at 07:33:57AM -0800, Ian Lance Taylor wrote: > > On Fri, Feb 12, 2021 at 12:38 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > > > Why are people trying to use copy_file_range on simple /proc and /sys > > > files in the first place? They can not seek (well most can not), so > > > that feels like a "oh look, a new syscall, let's use it everywhere!" > > > problem that userspace should not do. > > > > This may have been covered elsewhere, but it's not that people are > > saying "let's use copy_file_range on files in /proc." It's that the > > Go language standard library provides an interface to operating system > > files. When Go code uses the standard library function io.Copy to > > copy the contents of one open file to another open file, then on Linux > > kernels 5.3 and greater the Go standard library will use the > > copy_file_range system call. That seems to be exactly what > > copy_file_range is intended for. Unfortunately it appears that when > > people writing Go code open a file in /proc and use io.Copy the > > contents to another open file, copy_file_range does nothing and > > reports success. There isn't anything on the copy_file_range man page > > explaining this limitation, and there isn't any documented way to know > > that the Go standard library should not use copy_file_range on certain > > files. > > But, is this a bug in the kernel in that the syscall being made is not > working properly, or a bug in that Go decided to do this for all types > of files not knowing that some types of files can not handle this? > > If the kernel has always worked this way, I would say that Go is doing > the wrong thing here. If the kernel used to work properly, and then > changed, then it's a regression on the kernel side. > > So which is it? Both Al Viro and myself have said "copy file range is not a generic method for copying data between two file descriptors". It is a targetted solution for *regular files only* on filesystems that store persistent data and can accelerate the data copy in some way (e.g. clone, server side offload, hardware offlead, etc). It is not intended as a copy mechanism for copying data from one random file descriptor to another. The use of it as a general file copy mechanism in the Go system library is incorrect and wrong. It is a userspace bug. Userspace has done the wrong thing, userspace needs to be fixed. -Dave.
On Fri, Feb 12, 2021 at 3:03 PM Dave Chinner <david@fromorbit.com> wrote: > > On Fri, Feb 12, 2021 at 04:45:41PM +0100, Greg KH wrote: > > On Fri, Feb 12, 2021 at 07:33:57AM -0800, Ian Lance Taylor wrote: > > > On Fri, Feb 12, 2021 at 12:38 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > > > > > Why are people trying to use copy_file_range on simple /proc and /sys > > > > files in the first place? They can not seek (well most can not), so > > > > that feels like a "oh look, a new syscall, let's use it everywhere!" > > > > problem that userspace should not do. > > > > > > This may have been covered elsewhere, but it's not that people are > > > saying "let's use copy_file_range on files in /proc." It's that the > > > Go language standard library provides an interface to operating system > > > files. When Go code uses the standard library function io.Copy to > > > copy the contents of one open file to another open file, then on Linux > > > kernels 5.3 and greater the Go standard library will use the > > > copy_file_range system call. That seems to be exactly what > > > copy_file_range is intended for. Unfortunately it appears that when > > > people writing Go code open a file in /proc and use io.Copy the > > > contents to another open file, copy_file_range does nothing and > > > reports success. There isn't anything on the copy_file_range man page > > > explaining this limitation, and there isn't any documented way to know > > > that the Go standard library should not use copy_file_range on certain > > > files. > > > > But, is this a bug in the kernel in that the syscall being made is not > > working properly, or a bug in that Go decided to do this for all types > > of files not knowing that some types of files can not handle this? > > > > If the kernel has always worked this way, I would say that Go is doing > > the wrong thing here. If the kernel used to work properly, and then > > changed, then it's a regression on the kernel side. > > > > So which is it? > > Both Al Viro and myself have said "copy file range is not a generic > method for copying data between two file descriptors". It is a > targetted solution for *regular files only* on filesystems that store > persistent data and can accelerate the data copy in some way (e.g. > clone, server side offload, hardware offlead, etc). It is not > intended as a copy mechanism for copying data from one random file > descriptor to another. > > The use of it as a general file copy mechanism in the Go system > library is incorrect and wrong. It is a userspace bug. Userspace > has done the wrong thing, userspace needs to be fixed. OK, we'll take it out. I'll just make one last plea that I think that copy_file_range could be much more useful if there were some way that a program could know whether it would work or not. It's pretty unfortunate that we can't use it in the Go standard library, or, indeed, in any general purpose code, in any language, that is intended to support arbitrary file names. To be pedantically clear, I'm not saying that copy_file_range should work on all file systems. I'm only saying that on file systems for which it doesn't work it should fail rather than silently returning success without doing anything. Ian
On Fri, Feb 12, 2021 at 10:22:16AM +0200, Amir Goldstein wrote: > On Fri, Feb 12, 2021 at 9:49 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > On Fri, Feb 12, 2021 at 12:44:00PM +0800, Nicolas Boichat wrote: > > > Filesystems such as procfs and sysfs generate their content at > > > runtime. This implies the file sizes do not usually match the > > > amount of data that can be read from the file, and that seeking > > > may not work as intended. > > > > > > This will be useful to disallow copy_file_range with input files > > > from such filesystems. > > > > > > Signed-off-by: Nicolas Boichat <drinkcat@chromium.org> > > > --- > > > I first thought of adding a new field to struct file_operations, > > > but that doesn't quite scale as every single file creation > > > operation would need to be modified. > > > > Even so, you missed a load of filesystems in the kernel with this patch > > series, what makes the ones you did mark here different from the > > "internal" filesystems that you did not? > > > > This feels wrong, why is userspace suddenly breaking? What changed in > > the kernel that caused this? Procfs has been around for a _very_ long > > time :) > > That would be because of (v5.3): > > 5dae222a5ff0 vfs: allow copy_file_range to copy across devices > > The intention of this change (series) was to allow server side copy > for nfs and cifs via copy_file_range(). > This is mostly work by Dave Chinner that I picked up following requests > from the NFS folks. > > But the above change also includes this generic change: > > - /* this could be relaxed once a method supports cross-fs copies */ > - if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb) > - return -EXDEV; > - This isn't the problem. The problem is that proc sets the file size to zero length, so if you attempt to CFR from one proc file to another it will still report "zero bytes copied" because the source file is zero length. The other problem is that if the write fails, the generated data from the /proc file gets thrown away - the splice code treats write failures like a short read and hence the data sent to the failed write is consumed and lost. This has nothing to do with cross-fs cfr support - it's just one mechanism that can be used to expose the problems that using CFR on pipes that masquerade as regular files causes. Userspace can't even tell that CFR failed incorrectly, because the files that it returns immediate EOF on are zero length. Nor can userspace know taht a short read tossed away data, because it might actually be a short read rather than a write failure... > Our option now are: > - Restore the cross-fs restriction into generic_copy_file_range() > - Explicitly opt-out of CFR per-fs and/or per-file as Nicolas' patch does - Stop trying using cfr for things it was never intended for. Anyone using cfr has to be prepared for it to fail and do the copy manually themselves. If you can't tell from userspace if a file has data in it without actually calling read(), then you can't use copy_file_range() on it... Cheers, Dave.
On Fri, Feb 12, 2021 at 03:07:39PM -0800, Ian Lance Taylor wrote: > On Fri, Feb 12, 2021 at 3:03 PM Dave Chinner <david@fromorbit.com> wrote: > > > > On Fri, Feb 12, 2021 at 04:45:41PM +0100, Greg KH wrote: > > > On Fri, Feb 12, 2021 at 07:33:57AM -0800, Ian Lance Taylor wrote: > > > > On Fri, Feb 12, 2021 at 12:38 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > > > > > > > Why are people trying to use copy_file_range on simple /proc and /sys > > > > > files in the first place? They can not seek (well most can not), so > > > > > that feels like a "oh look, a new syscall, let's use it everywhere!" > > > > > problem that userspace should not do. > > > > > > > > This may have been covered elsewhere, but it's not that people are > > > > saying "let's use copy_file_range on files in /proc." It's that the > > > > Go language standard library provides an interface to operating system > > > > files. When Go code uses the standard library function io.Copy to > > > > copy the contents of one open file to another open file, then on Linux > > > > kernels 5.3 and greater the Go standard library will use the > > > > copy_file_range system call. That seems to be exactly what > > > > copy_file_range is intended for. Unfortunately it appears that when > > > > people writing Go code open a file in /proc and use io.Copy the > > > > contents to another open file, copy_file_range does nothing and > > > > reports success. There isn't anything on the copy_file_range man page > > > > explaining this limitation, and there isn't any documented way to know > > > > that the Go standard library should not use copy_file_range on certain > > > > files. > > > > > > But, is this a bug in the kernel in that the syscall being made is not > > > working properly, or a bug in that Go decided to do this for all types > > > of files not knowing that some types of files can not handle this? > > > > > > If the kernel has always worked this way, I would say that Go is doing > > > the wrong thing here. If the kernel used to work properly, and then > > > changed, then it's a regression on the kernel side. > > > > > > So which is it? > > > > Both Al Viro and myself have said "copy file range is not a generic > > method for copying data between two file descriptors". It is a > > targetted solution for *regular files only* on filesystems that store > > persistent data and can accelerate the data copy in some way (e.g. > > clone, server side offload, hardware offlead, etc). It is not > > intended as a copy mechanism for copying data from one random file > > descriptor to another. > > > > The use of it as a general file copy mechanism in the Go system > > library is incorrect and wrong. It is a userspace bug. Userspace > > has done the wrong thing, userspace needs to be fixed. > > OK, we'll take it out. > > I'll just make one last plea that I think that copy_file_range could > be much more useful if there were some way that a program could know > whether it would work or not. If you can't tell from userspace that a file has data in it other than by calling read() on it, then you can't use cfr on it. Cheers, Dave.
On Sat, Feb 13, 2021 at 10:27:26AM +1100, Dave Chinner wrote: > On Fri, Feb 12, 2021 at 03:07:39PM -0800, Ian Lance Taylor wrote: > > On Fri, Feb 12, 2021 at 3:03 PM Dave Chinner <david@fromorbit.com> wrote: > > > > > > On Fri, Feb 12, 2021 at 04:45:41PM +0100, Greg KH wrote: > > > > On Fri, Feb 12, 2021 at 07:33:57AM -0800, Ian Lance Taylor wrote: > > > > > On Fri, Feb 12, 2021 at 12:38 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > > > > > > > > > Why are people trying to use copy_file_range on simple /proc and /sys > > > > > > files in the first place? They can not seek (well most can not), so > > > > > > that feels like a "oh look, a new syscall, let's use it everywhere!" > > > > > > problem that userspace should not do. > > > > > > > > > > This may have been covered elsewhere, but it's not that people are > > > > > saying "let's use copy_file_range on files in /proc." It's that the > > > > > Go language standard library provides an interface to operating system > > > > > files. When Go code uses the standard library function io.Copy to > > > > > copy the contents of one open file to another open file, then on Linux > > > > > kernels 5.3 and greater the Go standard library will use the > > > > > copy_file_range system call. That seems to be exactly what > > > > > copy_file_range is intended for. Unfortunately it appears that when > > > > > people writing Go code open a file in /proc and use io.Copy the > > > > > contents to another open file, copy_file_range does nothing and > > > > > reports success. There isn't anything on the copy_file_range man page > > > > > explaining this limitation, and there isn't any documented way to know > > > > > that the Go standard library should not use copy_file_range on certain > > > > > files. > > > > > > > > But, is this a bug in the kernel in that the syscall being made is not > > > > working properly, or a bug in that Go decided to do this for all types > > > > of files not knowing that some types of files can not handle this? > > > > > > > > If the kernel has always worked this way, I would say that Go is doing > > > > the wrong thing here. If the kernel used to work properly, and then > > > > changed, then it's a regression on the kernel side. > > > > > > > > So which is it? > > > > > > Both Al Viro and myself have said "copy file range is not a generic > > > method for copying data between two file descriptors". It is a > > > targetted solution for *regular files only* on filesystems that store > > > persistent data and can accelerate the data copy in some way (e.g. > > > clone, server side offload, hardware offlead, etc). It is not > > > intended as a copy mechanism for copying data from one random file > > > descriptor to another. > > > > > > The use of it as a general file copy mechanism in the Go system > > > library is incorrect and wrong. It is a userspace bug. Userspace > > > has done the wrong thing, userspace needs to be fixed. > > > > OK, we'll take it out. > > > > I'll just make one last plea that I think that copy_file_range could > > be much more useful if there were some way that a program could know > > whether it would work or not. Well... we could always implement a CFR_DRYRUN flag that would run through all the parameter validation and return 0 just before actually starting any real copying logic. But that wouldn't itself solve the problem that there are very old virtual filesystems in Linux that have zero-length regular files that behave like a pipe. > If you can't tell from userspace that a file has data in it other > than by calling read() on it, then you can't use cfr on it. I don't know how to do that, Dave. :) Frankly I'm with the Go developers on this -- one should detect c_f_r by calling it and if it errors out then fall back to the usual userspace buffer copy strategy. That still means we need to fix the kernel WRT these weird old filesystems. One of... 1. Get rid of the generic fallback completely, since splice only copies 64k at a time and ... yay? I guess it at least passes generic/521 and generic/522 these days. 2. Keep it, but change c_f_r to require that both files have a ->copy_file_range implementation. If they're the same then we'll call the function pointer, if not, we call the generic fallback. This at least gets us back to the usual behavior which is that filesystems have to opt in to new functionality (== we assume they QA'd all the wunnerful combinations). 3. #2, but fix the generic fallback to not suck so badly. That sounds like someone (else's) 2yr project. :P --D > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Fri, Feb 12, 2021 at 03:54:48PM -0800, Darrick J. Wong wrote: > On Sat, Feb 13, 2021 at 10:27:26AM +1100, Dave Chinner wrote: > > On Fri, Feb 12, 2021 at 03:07:39PM -0800, Ian Lance Taylor wrote: > > > On Fri, Feb 12, 2021 at 3:03 PM Dave Chinner <david@fromorbit.com> wrote: > > > > > > > > On Fri, Feb 12, 2021 at 04:45:41PM +0100, Greg KH wrote: > > > > > On Fri, Feb 12, 2021 at 07:33:57AM -0800, Ian Lance Taylor wrote: > > > > > > On Fri, Feb 12, 2021 at 12:38 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > > > > > > > > > > > Why are people trying to use copy_file_range on simple /proc and /sys > > > > > > > files in the first place? They can not seek (well most can not), so > > > > > > > that feels like a "oh look, a new syscall, let's use it everywhere!" > > > > > > > problem that userspace should not do. > > > > > > > > > > > > This may have been covered elsewhere, but it's not that people are > > > > > > saying "let's use copy_file_range on files in /proc." It's that the > > > > > > Go language standard library provides an interface to operating system > > > > > > files. When Go code uses the standard library function io.Copy to > > > > > > copy the contents of one open file to another open file, then on Linux > > > > > > kernels 5.3 and greater the Go standard library will use the > > > > > > copy_file_range system call. That seems to be exactly what > > > > > > copy_file_range is intended for. Unfortunately it appears that when > > > > > > people writing Go code open a file in /proc and use io.Copy the > > > > > > contents to another open file, copy_file_range does nothing and > > > > > > reports success. There isn't anything on the copy_file_range man page > > > > > > explaining this limitation, and there isn't any documented way to know > > > > > > that the Go standard library should not use copy_file_range on certain > > > > > > files. > > > > > > > > > > But, is this a bug in the kernel in that the syscall being made is not > > > > > working properly, or a bug in that Go decided to do this for all types > > > > > of files not knowing that some types of files can not handle this? > > > > > > > > > > If the kernel has always worked this way, I would say that Go is doing > > > > > the wrong thing here. If the kernel used to work properly, and then > > > > > changed, then it's a regression on the kernel side. > > > > > > > > > > So which is it? > > > > > > > > Both Al Viro and myself have said "copy file range is not a generic > > > > method for copying data between two file descriptors". It is a > > > > targetted solution for *regular files only* on filesystems that store > > > > persistent data and can accelerate the data copy in some way (e.g. > > > > clone, server side offload, hardware offlead, etc). It is not > > > > intended as a copy mechanism for copying data from one random file > > > > descriptor to another. > > > > > > > > The use of it as a general file copy mechanism in the Go system > > > > library is incorrect and wrong. It is a userspace bug. Userspace > > > > has done the wrong thing, userspace needs to be fixed. > > > > > > OK, we'll take it out. > > > > > > I'll just make one last plea that I think that copy_file_range could > > > be much more useful if there were some way that a program could know > > > whether it would work or not. > > Well... we could always implement a CFR_DRYRUN flag that would run > through all the parameter validation and return 0 just before actually > starting any real copying logic. But that wouldn't itself solve the > problem that there are very old virtual filesystems in Linux that have > zero-length regular files that behave like a pipe. > > > If you can't tell from userspace that a file has data in it other > > than by calling read() on it, then you can't use cfr on it. > > I don't know how to do that, Dave. :) If stat returns a non-zero size, then userspace knows it has at least that much data in it, whether it be zeros or previously written data. cfr will copy that data. The special zero length regular pipe files fail this simple "how much data is there to copy in this file" check... > Frankly I'm with the Go developers on this -- one should detect c_f_r by > calling it and if it errors out then fall back to the usual userspace > buffer copy strategy. > > That still means we need to fix the kernel WRT these weird old > filesystems. One of... And that is the whole problem here, not that cfr is failing. cfr is behaving correctly and consistently as the filesystem is telling the kernel there is no data in the file (i.e. size = 0). > 1. Get rid of the generic fallback completely, since splice only copies > 64k at a time and ... yay? I guess it at least passes generic/521 and > generic/522 these days. I've had a few people ask me for cfr to not fall back to a manual copy because they only want it to do something if it can accelerate the copy to be faster than userspace can copy the data itself. If the filesystem can't optimise the copy in some way, they want to know so they can do something else of their own chosing. Hence this seems like the sane option to take here... > 2. Keep it, but change c_f_r to require that both files have a > ->copy_file_range implementation. If they're the same then we'll call > the function pointer, if not, we call the generic fallback. This at > least gets us back to the usual behavior which is that filesystems have > to opt in to new functionality (== we assume they QA'd all the wunnerful > combinations). That doesn't address the "write failure turns into short read" problem with the splice path... > 3. #2, but fix the generic fallback to not suck so badly. That sounds > like someone (else's) 2yr project. :P Not mine, either. Cheers, Dave.
On Sun, Feb 14, 2021 at 4:38 PM Dave Chinner <david@fromorbit.com> wrote: > > On Fri, Feb 12, 2021 at 03:54:48PM -0800, Darrick J. Wong wrote: > > On Sat, Feb 13, 2021 at 10:27:26AM +1100, Dave Chinner wrote: > > > > > If you can't tell from userspace that a file has data in it other > > > than by calling read() on it, then you can't use cfr on it. > > > > I don't know how to do that, Dave. :) > > If stat returns a non-zero size, then userspace knows it has at > least that much data in it, whether it be zeros or previously > written data. cfr will copy that data. The special zero length > regular pipe files fail this simple "how much data is there to copy > in this file" check... This suggests that if the Go standard library sees that copy_file_range reads zero bytes, we should assume that it failed, and should use the fallback path as though copy_file_range returned EOPNOTSUPP or EINVAL. This will cause an extra system call for an empty file, but as long as copy_file_range does not return an error for cases that it does not support we are going to need an extra system call anyhow. Does that seem like a possible mitigation? That is, are there cases where copy_file_range will fail to do a correct copy, and will return success, and will not return zero? Ian
On Mon, Feb 15, 2021 at 9:12 AM Ian Lance Taylor <iant@golang.org> wrote: > > On Sun, Feb 14, 2021 at 4:38 PM Dave Chinner <david@fromorbit.com> wrote: > > > > On Fri, Feb 12, 2021 at 03:54:48PM -0800, Darrick J. Wong wrote: > > > On Sat, Feb 13, 2021 at 10:27:26AM +1100, Dave Chinner wrote: > > > > > > > If you can't tell from userspace that a file has data in it other > > > > than by calling read() on it, then you can't use cfr on it. > > > > > > I don't know how to do that, Dave. :) > > > > If stat returns a non-zero size, then userspace knows it has at > > least that much data in it, whether it be zeros or previously > > written data. cfr will copy that data. The special zero length > > regular pipe files fail this simple "how much data is there to copy > > in this file" check... > > This suggests that if the Go standard library sees that > copy_file_range reads zero bytes, we should assume that it failed, and > should use the fallback path as though copy_file_range returned > EOPNOTSUPP or EINVAL. This will cause an extra system call for an > empty file, but as long as copy_file_range does not return an error > for cases that it does not support we are going to need an extra > system call anyhow. > > Does that seem like a possible mitigation? That is, are there cases > where copy_file_range will fail to do a correct copy, and will return > success, and will not return zero? I'm a bit worried about the sysfs files that report a 4096 bytes file size, for 2 reasons: - I'm not sure if the content _can_ actually be longer (I couldn't find a counterexample) - If you're unlucky enough to have a partial write in the output filesystem, you'll get a non-zero return value and probably corrupted content. > > Ian
On Mon, Feb 15, 2021 at 3:27 AM Nicolas Boichat <drinkcat@chromium.org> wrote: > > On Mon, Feb 15, 2021 at 9:12 AM Ian Lance Taylor <iant@golang.org> wrote: > > > > On Sun, Feb 14, 2021 at 4:38 PM Dave Chinner <david@fromorbit.com> wrote: > > > > > > On Fri, Feb 12, 2021 at 03:54:48PM -0800, Darrick J. Wong wrote: > > > > On Sat, Feb 13, 2021 at 10:27:26AM +1100, Dave Chinner wrote: > > > > > > > > > If you can't tell from userspace that a file has data in it other > > > > > than by calling read() on it, then you can't use cfr on it. > > > > > > > > I don't know how to do that, Dave. :) > > > > > > If stat returns a non-zero size, then userspace knows it has at > > > least that much data in it, whether it be zeros or previously > > > written data. cfr will copy that data. The special zero length > > > regular pipe files fail this simple "how much data is there to copy > > > in this file" check... > > > > This suggests that if the Go standard library sees that > > copy_file_range reads zero bytes, we should assume that it failed, and > > should use the fallback path as though copy_file_range returned > > EOPNOTSUPP or EINVAL. This will cause an extra system call for an > > empty file, but as long as copy_file_range does not return an error > > for cases that it does not support we are going to need an extra > > system call anyhow. > > > > Does that seem like a possible mitigation? That is, are there cases > > where copy_file_range will fail to do a correct copy, and will return > > success, and will not return zero? > > I'm a bit worried about the sysfs files that report a 4096 bytes file > size, for 2 reasons: > - I'm not sure if the content _can_ actually be longer (I couldn't > find a counterexample) > - If you're unlucky enough to have a partial write in the output > filesystem, you'll get a non-zero return value and probably corrupted > content. > First of all, I am in favor of fixing the regression in the kernel caused by the change of behavior in v5.3. Having said that, I don't think it is a good idea to use ANY tool to copy a zero size pseudo file. rsync doesn't copy any data either if you try to use it to copy tracing into a temp file. I think it is perfectly sane for any tool to check file size before trying to read/copy. w.r.t. short write/short read, did you consider using the off_in/off_out arguments? I *think* current kernel CFR should be safe to use as long as the tool: 1. Checks file size before copy 2. Does not try to copy a range beyond EOF 3. Pass off_in/off_out args and verify that off_in/off_out advances as expected Thanks, Amir.
On Fri, Feb 12, 2021 at 2:40 PM Luis Henriques <lhenriques@suse.de> wrote: > > Greg KH <gregkh@linuxfoundation.org> writes: > > > On Fri, Feb 12, 2021 at 12:05:14PM +0000, Luis Henriques wrote: > >> Greg KH <gregkh@linuxfoundation.org> writes: > >> > >> > On Fri, Feb 12, 2021 at 10:22:16AM +0200, Amir Goldstein wrote: > >> >> On Fri, Feb 12, 2021 at 9:49 AM Greg KH <gregkh@linuxfoundation.org> wrote: > >> >> > > >> >> > On Fri, Feb 12, 2021 at 12:44:00PM +0800, Nicolas Boichat wrote: > >> >> > > Filesystems such as procfs and sysfs generate their content at > >> >> > > runtime. This implies the file sizes do not usually match the > >> >> > > amount of data that can be read from the file, and that seeking > >> >> > > may not work as intended. > >> >> > > > >> >> > > This will be useful to disallow copy_file_range with input files > >> >> > > from such filesystems. > >> >> > > > >> >> > > Signed-off-by: Nicolas Boichat <drinkcat@chromium.org> > >> >> > > --- > >> >> > > I first thought of adding a new field to struct file_operations, > >> >> > > but that doesn't quite scale as every single file creation > >> >> > > operation would need to be modified. > >> >> > > >> >> > Even so, you missed a load of filesystems in the kernel with this patch > >> >> > series, what makes the ones you did mark here different from the > >> >> > "internal" filesystems that you did not? > >> >> > > >> >> > This feels wrong, why is userspace suddenly breaking? What changed in > >> >> > the kernel that caused this? Procfs has been around for a _very_ long > >> >> > time :) > >> >> > >> >> That would be because of (v5.3): > >> >> > >> >> 5dae222a5ff0 vfs: allow copy_file_range to copy across devices > >> >> > >> >> The intention of this change (series) was to allow server side copy > >> >> for nfs and cifs via copy_file_range(). > >> >> This is mostly work by Dave Chinner that I picked up following requests > >> >> from the NFS folks. > >> >> > >> >> But the above change also includes this generic change: > >> >> > >> >> - /* this could be relaxed once a method supports cross-fs copies */ > >> >> - if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb) > >> >> - return -EXDEV; > >> >> - > >> >> > >> >> The change of behavior was documented in the commit message. > >> >> It was also documented in: > >> >> > >> >> 88e75e2c5 copy_file_range.2: Kernel v5.3 updates > >> >> > >> >> I think our rationale for the generic change was: > >> >> "Why not? What could go wrong? (TM)" > >> >> I am not sure if any workload really gained something from this > >> >> kernel cross-fs CFR. > >> > > >> > Why not put that check back? > >> > > >> >> In retrospect, I think it would have been safer to allow cross-fs CFR > >> >> only to the filesystems that implement ->{copy,remap}_file_range()... > >> > > >> > Why not make this change? That seems easier and should fix this for > >> > everyone, right? > >> > > >> >> Our option now are: > >> >> - Restore the cross-fs restriction into generic_copy_file_range() > >> > > >> > Yes. > >> > > >> > >> Restoring this restriction will actually change the current cephfs CFR > >> behaviour. Since that commit we have allowed doing remote copies between > >> different filesystems within the same ceph cluster. See commit > >> 6fd4e6348352 ("ceph: allow object copies across different filesystems in > >> the same cluster"). > >> > >> Although I'm not aware of any current users for this scenario, the > >> performance impact can actually be huge as it's the difference between > >> asking the OSDs for copying a file and doing a full read+write on the > >> client side. > > > > Regression in performance is ok if it fixes a regression for things that > > used to work just fine in the past :) > > > > First rule, make it work. > > Sure, I just wanted to point out that *maybe* there are other options than > simply reverting that commit :-) > > Something like the patch below (completely untested!) should revert to the > old behaviour in filesystems that don't implement the CFR syscall. > > Cheers, > -- > Luis > > diff --git a/fs/read_write.c b/fs/read_write.c > index 75f764b43418..bf5dccc43cc9 100644 > --- a/fs/read_write.c > +++ b/fs/read_write.c > @@ -1406,8 +1406,11 @@ static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in, > file_out, pos_out, > len, flags); > > - return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len, > - flags); > + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb) > + return -EXDEV; > + else > + generic_copy_file_range(file_in, pos_in, file_out, pos_out, len, > + flags); > } > Which kernel is this patch based on? At this point, I am with Dave and Darrick on not falling back to generic_copy_file_range() at all. We do not have proof of any workload that benefits from it and the above patch does not protect from a wierd use case of trying to copy a file from sysfs to sysfs. I am indecisive about what should be done with generic_copy_file_range() called as fallback from within filesystems. I think the wise choice is to not do the fallback in any case, but this is up to the specific filesystem maintainers to decide. Thanks, Amir.
On Mon, Feb 15, 2021 at 09:25:36AM +0800, Nicolas Boichat wrote: > On Mon, Feb 15, 2021 at 9:12 AM Ian Lance Taylor <iant@golang.org> wrote: > > > > On Sun, Feb 14, 2021 at 4:38 PM Dave Chinner <david@fromorbit.com> wrote: > > > > > > On Fri, Feb 12, 2021 at 03:54:48PM -0800, Darrick J. Wong wrote: > > > > On Sat, Feb 13, 2021 at 10:27:26AM +1100, Dave Chinner wrote: > > > > > > > > > If you can't tell from userspace that a file has data in it other > > > > > than by calling read() on it, then you can't use cfr on it. > > > > > > > > I don't know how to do that, Dave. :) > > > > > > If stat returns a non-zero size, then userspace knows it has at > > > least that much data in it, whether it be zeros or previously > > > written data. cfr will copy that data. The special zero length > > > regular pipe files fail this simple "how much data is there to copy > > > in this file" check... > > > > This suggests that if the Go standard library sees that > > copy_file_range reads zero bytes, we should assume that it failed, and > > should use the fallback path as though copy_file_range returned > > EOPNOTSUPP or EINVAL. This will cause an extra system call for an > > empty file, but as long as copy_file_range does not return an error > > for cases that it does not support we are going to need an extra > > system call anyhow. > > > > Does that seem like a possible mitigation? That is, are there cases > > where copy_file_range will fail to do a correct copy, and will return > > success, and will not return zero? > > I'm a bit worried about the sysfs files that report a 4096 bytes file > size, for 2 reasons: > - I'm not sure if the content _can_ actually be longer (I couldn't > find a counterexample) There are quite a few, look for binary sysfs files that are pipes from hardware/firmware resources to userspace. /sys/firmware/efi/efivars/ has a number of them if you want to play around with it. thanks, greg k-h
Amir Goldstein <amir73il@gmail.com> writes: > On Fri, Feb 12, 2021 at 2:40 PM Luis Henriques <lhenriques@suse.de> wrote: ... >> Sure, I just wanted to point out that *maybe* there are other options than >> simply reverting that commit :-) >> >> Something like the patch below (completely untested!) should revert to the >> old behaviour in filesystems that don't implement the CFR syscall. >> >> Cheers, >> -- >> Luis >> >> diff --git a/fs/read_write.c b/fs/read_write.c >> index 75f764b43418..bf5dccc43cc9 100644 >> --- a/fs/read_write.c >> +++ b/fs/read_write.c >> @@ -1406,8 +1406,11 @@ static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in, >> file_out, pos_out, >> len, flags); >> >> - return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len, >> - flags); >> + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb) >> + return -EXDEV; >> + else >> + generic_copy_file_range(file_in, pos_in, file_out, pos_out, len, >> + flags); >> } >> > > Which kernel is this patch based on? It was v5.11-rc7. > At this point, I am with Dave and Darrick on not falling back to > generic_copy_file_range() at all. > > We do not have proof of any workload that benefits from it and the > above patch does not protect from a wierd use case of trying to copy a file > from sysfs to sysfs. > Ok, cool. I can post a new patch doing just that. I guess that function do_copy_file_range() can be dropped in that case. > I am indecisive about what should be done with generic_copy_file_range() > called as fallback from within filesystems. > > I think the wise choice is to not do the fallback in any case, but this is up > to the specific filesystem maintainers to decide. I see what you mean. You're suggesting to have userspace handle all the -EOPNOTSUPP and -EXDEV errors. Would you rather have a patch that also removes all the calls to generic_copy_file_range() function? And that function can also be deleted too, of course. Cheers,
Luis Henriques <lhenriques@suse.de> writes: > Amir Goldstein <amir73il@gmail.com> writes: > >> On Fri, Feb 12, 2021 at 2:40 PM Luis Henriques <lhenriques@suse.de> wrote: > ... >>> Sure, I just wanted to point out that *maybe* there are other options than >>> simply reverting that commit :-) >>> >>> Something like the patch below (completely untested!) should revert to the >>> old behaviour in filesystems that don't implement the CFR syscall. >>> >>> Cheers, >>> -- >>> Luis >>> >>> diff --git a/fs/read_write.c b/fs/read_write.c >>> index 75f764b43418..bf5dccc43cc9 100644 >>> --- a/fs/read_write.c >>> +++ b/fs/read_write.c >>> @@ -1406,8 +1406,11 @@ static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in, >>> file_out, pos_out, >>> len, flags); >>> >>> - return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len, >>> - flags); >>> + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb) >>> + return -EXDEV; >>> + else >>> + generic_copy_file_range(file_in, pos_in, file_out, pos_out, len, >>> + flags); >>> } >>> >> >> Which kernel is this patch based on? > > It was v5.11-rc7. > >> At this point, I am with Dave and Darrick on not falling back to >> generic_copy_file_range() at all. >> >> We do not have proof of any workload that benefits from it and the >> above patch does not protect from a wierd use case of trying to copy a file >> from sysfs to sysfs. >> > > Ok, cool. I can post a new patch doing just that. I guess that function > do_copy_file_range() can be dropped in that case. > >> I am indecisive about what should be done with generic_copy_file_range() >> called as fallback from within filesystems. >> >> I think the wise choice is to not do the fallback in any case, but this is up >> to the specific filesystem maintainers to decide. > > I see what you mean. You're suggesting to have userspace handle all the > -EOPNOTSUPP and -EXDEV errors. Would you rather have a patch that also > removes all the calls to generic_copy_file_range() function? And that > function can also be deleted too, of course. Here's a first stab at this patch. Hopefully I didn't forgot anything here. Let me know if you prefer the more conservative approach, i.e. not touching any of the filesystems and let them use generic_copy_file_range. Once everyone agrees on the final solution, I can follow-up with the manpages update. Cheers,
On Mon, Feb 15, 2021 at 2:21 PM Luis Henriques <lhenriques@suse.de> wrote: > > Luis Henriques <lhenriques@suse.de> writes: > > > Amir Goldstein <amir73il@gmail.com> writes: > > > >> On Fri, Feb 12, 2021 at 2:40 PM Luis Henriques <lhenriques@suse.de> wrote: > > ... > >>> Sure, I just wanted to point out that *maybe* there are other options than > >>> simply reverting that commit :-) > >>> > >>> Something like the patch below (completely untested!) should revert to the > >>> old behaviour in filesystems that don't implement the CFR syscall. > >>> > >>> Cheers, > >>> -- > >>> Luis > >>> > >>> diff --git a/fs/read_write.c b/fs/read_write.c > >>> index 75f764b43418..bf5dccc43cc9 100644 > >>> --- a/fs/read_write.c > >>> +++ b/fs/read_write.c > >>> @@ -1406,8 +1406,11 @@ static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in, > >>> file_out, pos_out, > >>> len, flags); > >>> > >>> - return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len, > >>> - flags); > >>> + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb) > >>> + return -EXDEV; > >>> + else > >>> + generic_copy_file_range(file_in, pos_in, file_out, pos_out, len, > >>> + flags); > >>> } > >>> > >> > >> Which kernel is this patch based on? > > > > It was v5.11-rc7. > > > >> At this point, I am with Dave and Darrick on not falling back to > >> generic_copy_file_range() at all. > >> > >> We do not have proof of any workload that benefits from it and the > >> above patch does not protect from a wierd use case of trying to copy a file > >> from sysfs to sysfs. > >> > > > > Ok, cool. I can post a new patch doing just that. I guess that function > > do_copy_file_range() can be dropped in that case. > > > >> I am indecisive about what should be done with generic_copy_file_range() > >> called as fallback from within filesystems. > >> > >> I think the wise choice is to not do the fallback in any case, but this is up > >> to the specific filesystem maintainers to decide. > > > > I see what you mean. You're suggesting to have userspace handle all the > > -EOPNOTSUPP and -EXDEV errors. Would you rather have a patch that also > > removes all the calls to generic_copy_file_range() function? And that > > function can also be deleted too, of course. > > Here's a first stab at this patch. Hopefully I didn't forgot anything > here. Let me know if you prefer the more conservative approach, i.e. not > touching any of the filesystems and let them use generic_copy_file_range. > I'm fine with this one (modulu my comment below). CC'ing fuse/cifs/nfs maintainers. Though I don't think the FS maintainers should mind removing the fallback. It was added by "us" (64bf5ff58dff "vfs: no fallback for ->copy_file_range()") > Once everyone agrees on the final solution, I can follow-up with the > manpages update. > > Cheers, > -- > Luis > > From e1b37e80b12601d56f792bd19377d3e5208188ef Mon Sep 17 00:00:00 2001 > From: Luis Henriques <lhenriques@suse.de> > Date: Fri, 12 Feb 2021 18:03:23 +0000 > Subject: [PATCH] vfs: prevent copy_file_range to copy across devices > > Nicolas Boichat reported an issue when trying to use the copy_file_range > syscall on a tracefs file. It failed silently because the file content is > generated on-the-fly (reporting a size of zero) and copy_file_range needs > to know in advance how much data is present. > > This commit effectively reverts 5dae222a5ff0 ("vfs: allow copy_file_range to > copy across devices"). Now the copy is done only if the filesystems for source > and destination files are the same and they implement this syscall. This paragraph is not true and confusing. This is not a revert and it does not revert the important part of cross-fs copy for which the original commit was for. Either rephrase this paragraph or remove it. > > Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") Please add a Link: to this discussion in lore. > Cc: Nicolas Boichat <drinkcat@chromium.org> > Signed-off-by: Luis Henriques <lhenriques@suse.de> > --- > fs/ceph/file.c | 21 +++------------ > fs/cifs/cifsfs.c | 3 --- > fs/fuse/file.c | 21 +++------------ > fs/nfs/nfs4file.c | 20 +++----------- > fs/read_write.c | 65 ++++++++-------------------------------------- > include/linux/fs.h | 3 --- > 6 files changed, 20 insertions(+), 113 deletions(-) > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c > index 209535d5b8d3..639bd7bfaea9 100644 > --- a/fs/ceph/file.c > +++ b/fs/ceph/file.c > @@ -2261,9 +2261,9 @@ static ssize_t ceph_do_objects_copy(struct ceph_inode_info *src_ci, u64 *src_off > return bytes; > } > > -static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off, > - struct file *dst_file, loff_t dst_off, > - size_t len, unsigned int flags) > +static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off, > + struct file *dst_file, loff_t dst_off, > + size_t len, unsigned int flags) > { > struct inode *src_inode = file_inode(src_file); > struct inode *dst_inode = file_inode(dst_file); > @@ -2456,21 +2456,6 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off, > return ret; > } > > -static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off, > - struct file *dst_file, loff_t dst_off, > - size_t len, unsigned int flags) > -{ > - ssize_t ret; > - > - ret = __ceph_copy_file_range(src_file, src_off, dst_file, dst_off, > - len, flags); > - > - if (ret == -EOPNOTSUPP || ret == -EXDEV) > - ret = generic_copy_file_range(src_file, src_off, dst_file, > - dst_off, len, flags); > - return ret; > -} > - > const struct file_operations ceph_file_fops = { > .open = ceph_open, > .release = ceph_release, > diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c > index e46da536ed33..8b869cc67443 100644 > --- a/fs/cifs/cifsfs.c > +++ b/fs/cifs/cifsfs.c > @@ -1229,9 +1229,6 @@ static ssize_t cifs_copy_file_range(struct file *src_file, loff_t off, > len, flags); > free_xid(xid); > > - if (rc == -EOPNOTSUPP || rc == -EXDEV) > - rc = generic_copy_file_range(src_file, off, dst_file, > - destoff, len, flags); > return rc; > } > > diff --git a/fs/fuse/file.c b/fs/fuse/file.c > index 8cccecb55fb8..0dd703278e49 100644 > --- a/fs/fuse/file.c > +++ b/fs/fuse/file.c > @@ -3330,9 +3330,9 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset, > return err; > } > > -static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in, > - struct file *file_out, loff_t pos_out, > - size_t len, unsigned int flags) > +static ssize_t fuse_copy_file_range(struct file *file_in, loff_t pos_in, > + struct file *file_out, loff_t pos_out, > + size_t len, unsigned int flags) > { > struct fuse_file *ff_in = file_in->private_data; > struct fuse_file *ff_out = file_out->private_data; > @@ -3439,21 +3439,6 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in, > return err; > } > > -static ssize_t fuse_copy_file_range(struct file *src_file, loff_t src_off, > - struct file *dst_file, loff_t dst_off, > - size_t len, unsigned int flags) > -{ > - ssize_t ret; > - > - ret = __fuse_copy_file_range(src_file, src_off, dst_file, dst_off, > - len, flags); > - > - if (ret == -EOPNOTSUPP || ret == -EXDEV) > - ret = generic_copy_file_range(src_file, src_off, dst_file, > - dst_off, len, flags); > - return ret; > -} > - > static const struct file_operations fuse_file_operations = { > .llseek = fuse_file_llseek, > .read_iter = fuse_file_read_iter, > diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c > index 57b3821d975a..60998209e310 100644 > --- a/fs/nfs/nfs4file.c > +++ b/fs/nfs/nfs4file.c > @@ -133,9 +133,9 @@ nfs4_file_flush(struct file *file, fl_owner_t id) > } > > #ifdef CONFIG_NFS_V4_2 > -static ssize_t __nfs4_copy_file_range(struct file *file_in, loff_t pos_in, > - struct file *file_out, loff_t pos_out, > - size_t count, unsigned int flags) > +static ssize_t nfs4_copy_file_range(struct file *file_in, loff_t pos_in, > + struct file *file_out, loff_t pos_out, > + size_t count, unsigned int flags) > { > struct nfs42_copy_notify_res *cn_resp = NULL; > struct nl4_server *nss = NULL; > @@ -189,20 +189,6 @@ static ssize_t __nfs4_copy_file_range(struct file *file_in, loff_t pos_in, > return ret; > } > > -static ssize_t nfs4_copy_file_range(struct file *file_in, loff_t pos_in, > - struct file *file_out, loff_t pos_out, > - size_t count, unsigned int flags) > -{ > - ssize_t ret; > - > - ret = __nfs4_copy_file_range(file_in, pos_in, file_out, pos_out, count, > - flags); > - if (ret == -EOPNOTSUPP || ret == -EXDEV) > - ret = generic_copy_file_range(file_in, pos_in, file_out, > - pos_out, count, flags); > - return ret; > -} > - > static loff_t nfs4_file_llseek(struct file *filep, loff_t offset, int whence) > { > loff_t ret; > diff --git a/fs/read_write.c b/fs/read_write.c > index 75f764b43418..87bf9efd7f71 100644 > --- a/fs/read_write.c > +++ b/fs/read_write.c > @@ -1358,58 +1358,6 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd, > } > #endif > > -/** > - * generic_copy_file_range - copy data between two files > - * @file_in: file structure to read from > - * @pos_in: file offset to read from > - * @file_out: file structure to write data to > - * @pos_out: file offset to write data to > - * @len: amount of data to copy > - * @flags: copy flags > - * > - * This is a generic filesystem helper to copy data from one file to another. > - * It has no constraints on the source or destination file owners - the files > - * can belong to different superblocks and different filesystem types. Short > - * copies are allowed. > - * > - * This should be called from the @file_out filesystem, as per the > - * ->copy_file_range() method. > - * > - * Returns the number of bytes copied or a negative error indicating the > - * failure. > - */ > - > -ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in, > - struct file *file_out, loff_t pos_out, > - size_t len, unsigned int flags) > -{ > - return do_splice_direct(file_in, &pos_in, file_out, &pos_out, > - len > MAX_RW_COUNT ? MAX_RW_COUNT : len, 0); > -} > -EXPORT_SYMBOL(generic_copy_file_range); > - > -static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in, > - struct file *file_out, loff_t pos_out, > - size_t len, unsigned int flags) > -{ > - /* > - * Although we now allow filesystems to handle cross sb copy, passing > - * a file of the wrong filesystem type to filesystem driver can result > - * in an attempt to dereference the wrong type of ->private_data, so > - * avoid doing that until we really have a good reason. NFS defines > - * several different file_system_type structures, but they all end up > - * using the same ->copy_file_range() function pointer. > - */ > - if (file_out->f_op->copy_file_range && > - file_out->f_op->copy_file_range == file_in->f_op->copy_file_range) > - return file_out->f_op->copy_file_range(file_in, pos_in, > - file_out, pos_out, > - len, flags); > - > - return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len, > - flags); Please do not remove the big comment above - it is there for a reason. The case of !file_out->f_op->copy_file_range should return -EOPNOTSUPP because the outcome of -EXDEV for copy to/from the same fs is confusing. Either leave this helper and only remove generic_copy_file_range() or open code the same logic (including comment) in the caller. > -} > - > /* > * Performs necessary checks before doing a file copy > * > @@ -1474,6 +1422,14 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, > { > ssize_t ret; > > + /* > + * Allow the copy only if the filesystems for file_in and file_out are > + * the same, and copy_file_range is implemented. > + */ This comment is wrong. See the big fat comment above to understand why. Also this check is in the wrong place because it misses the case of file_in->f_op->remap_file_range && !file_out->f_op->copy_file_range As I wrote above either leave the helper do_copy_file_range() or open code it in its current location. > + if (!file_out->f_op->copy_file_range || > + (file_out->f_op->copy_file_range != file_in->f_op->copy_file_range)) > + return -EXDEV; > + > if (flags != 0) > return -EINVAL; > > @@ -1513,8 +1469,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, > } > } > > - ret = do_copy_file_range(file_in, pos_in, file_out, pos_out, len, > - flags); > + ret = file_out->f_op->copy_file_range(file_in, pos_in, > + file_out, pos_out, > + len, flags); > WARN_ON_ONCE(ret == -EOPNOTSUPP); Please remove this line. filesystem ops can now return -EOPNOTSUPP. Thanks, Amir.
Amir Goldstein <amir73il@gmail.com> writes: > On Mon, Feb 15, 2021 at 2:21 PM Luis Henriques <lhenriques@suse.de> wrote: >> >> Luis Henriques <lhenriques@suse.de> writes: >> >> > Amir Goldstein <amir73il@gmail.com> writes: >> > >> >> On Fri, Feb 12, 2021 at 2:40 PM Luis Henriques <lhenriques@suse.de> wrote: >> > ... >> >>> Sure, I just wanted to point out that *maybe* there are other options than >> >>> simply reverting that commit :-) >> >>> >> >>> Something like the patch below (completely untested!) should revert to the >> >>> old behaviour in filesystems that don't implement the CFR syscall. >> >>> >> >>> Cheers, >> >>> -- >> >>> Luis >> >>> >> >>> diff --git a/fs/read_write.c b/fs/read_write.c >> >>> index 75f764b43418..bf5dccc43cc9 100644 >> >>> --- a/fs/read_write.c >> >>> +++ b/fs/read_write.c >> >>> @@ -1406,8 +1406,11 @@ static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in, >> >>> file_out, pos_out, >> >>> len, flags); >> >>> >> >>> - return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len, >> >>> - flags); >> >>> + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb) >> >>> + return -EXDEV; >> >>> + else >> >>> + generic_copy_file_range(file_in, pos_in, file_out, pos_out, len, >> >>> + flags); >> >>> } >> >>> >> >> >> >> Which kernel is this patch based on? >> > >> > It was v5.11-rc7. >> > >> >> At this point, I am with Dave and Darrick on not falling back to >> >> generic_copy_file_range() at all. >> >> >> >> We do not have proof of any workload that benefits from it and the >> >> above patch does not protect from a wierd use case of trying to copy a file >> >> from sysfs to sysfs. >> >> >> > >> > Ok, cool. I can post a new patch doing just that. I guess that function >> > do_copy_file_range() can be dropped in that case. >> > >> >> I am indecisive about what should be done with generic_copy_file_range() >> >> called as fallback from within filesystems. >> >> >> >> I think the wise choice is to not do the fallback in any case, but this is up >> >> to the specific filesystem maintainers to decide. >> > >> > I see what you mean. You're suggesting to have userspace handle all the >> > -EOPNOTSUPP and -EXDEV errors. Would you rather have a patch that also >> > removes all the calls to generic_copy_file_range() function? And that >> > function can also be deleted too, of course. >> >> Here's a first stab at this patch. Hopefully I didn't forgot anything >> here. Let me know if you prefer the more conservative approach, i.e. not >> touching any of the filesystems and let them use generic_copy_file_range. >> > > I'm fine with this one (modulu my comment below). > CC'ing fuse/cifs/nfs maintainers. > Though I don't think the FS maintainers should mind removing the fallback. > It was added by "us" (64bf5ff58dff "vfs: no fallback for ->copy_file_range()") Thanks for your review, Amir. I'll be posting v2 shortly. Cheers,
diff --git a/include/linux/fs.h b/include/linux/fs.h index 3482146b11b0..5bd58b928e94 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2335,6 +2335,7 @@ struct file_system_type { #define FS_ALLOW_IDMAP 32 /* FS has been updated to handle vfs idmappings. */ #define FS_THP_SUPPORT 8192 /* Remove once all fs converted */ #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */ +#define FS_GENERATED_CONTENT 65536 /* FS contains generated content */ int (*init_fs_context)(struct fs_context *); const struct fs_parameter_spec *parameters; struct dentry *(*mount) (struct file_system_type *, int,
Filesystems such as procfs and sysfs generate their content at runtime. This implies the file sizes do not usually match the amount of data that can be read from the file, and that seeking may not work as intended. This will be useful to disallow copy_file_range with input files from such filesystems. Signed-off-by: Nicolas Boichat <drinkcat@chromium.org> --- I first thought of adding a new field to struct file_operations, but that doesn't quite scale as every single file creation operation would need to be modified. include/linux/fs.h | 1 + 1 file changed, 1 insertion(+)