Message ID | 20180119005741.32058-1-rgoldwyn@suse.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Jan 18, 2018 at 06:57:40PM -0600, Goldwyn Rodrigues wrote: > From: Goldwyn Rodrigues <rgoldwyn@suse.com> > > In case direct I/O encounters an error midway, it returns the error. > Instead it should be returning the number of bytes transferred so far. > > Test case for filesystems (with ENOSPC): > 1. Create an almost full filesystem > 2. Create a file, say /mnt/lastfile, until the filesystem is full. > 3. Direct write() with count > sizeof /mnt/lastfile. > > Result: write() returns -ENOSPC. However, file content has data written > in step 3. > > This fixes fstest generic/472. OK... I can live with that. What about the XFS side? It should be a prereq, to avoid bisection hazard; I can throw both into vfs.git, if XFS folks are OK with that. Objections?
On Fri, Jan 19, 2018 at 02:13:53AM +0000, Al Viro wrote: > On Thu, Jan 18, 2018 at 06:57:40PM -0600, Goldwyn Rodrigues wrote: > > From: Goldwyn Rodrigues <rgoldwyn@suse.com> > > > > In case direct I/O encounters an error midway, it returns the error. > > Instead it should be returning the number of bytes transferred so far. > > > > Test case for filesystems (with ENOSPC): > > 1. Create an almost full filesystem > > 2. Create a file, say /mnt/lastfile, until the filesystem is full. > > 3. Direct write() with count > sizeof /mnt/lastfile. > > > > Result: write() returns -ENOSPC. However, file content has data written > > in step 3. > > > > This fixes fstest generic/472. > > OK... I can live with that. What about the XFS side? It should be > a prereq, to avoid bisection hazard; I can throw both into vfs.git, > if XFS folks are OK with that. Objections? Going through the VFS tree seesm the best approach to me - it's a trivial change. I'm sure Darrick will shout if it's going to be a problem, though. Cheers, Dave.
On Fri, Jan 19, 2018 at 02:59:51PM +1100, Dave Chinner wrote: > On Fri, Jan 19, 2018 at 02:13:53AM +0000, Al Viro wrote: > > On Thu, Jan 18, 2018 at 06:57:40PM -0600, Goldwyn Rodrigues wrote: > > > From: Goldwyn Rodrigues <rgoldwyn@suse.com> > > > > > > In case direct I/O encounters an error midway, it returns the error. > > > Instead it should be returning the number of bytes transferred so far. > > > > > > Test case for filesystems (with ENOSPC): > > > 1. Create an almost full filesystem > > > 2. Create a file, say /mnt/lastfile, until the filesystem is full. > > > 3. Direct write() with count > sizeof /mnt/lastfile. > > > > > > Result: write() returns -ENOSPC. However, file content has data written > > > in step 3. > > > > > > This fixes fstest generic/472. > > > > OK... I can live with that. What about the XFS side? It should be > > a prereq, to avoid bisection hazard; I can throw both into vfs.git, > > if XFS folks are OK with that. Objections? > > Going through the VFS tree seesm the best approach to me - it's a > trivial change. I'm sure Darrick will shout if it's going to be a > problem, though. vfs.git is fine, though the second patch to remove the xfs assert should go first, as Al points out. For both patches, Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> --D > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Thu, Jan 18, 2018 at 10:31:18PM -0800, Darrick J. Wong wrote: > On Fri, Jan 19, 2018 at 02:59:51PM +1100, Dave Chinner wrote: > > On Fri, Jan 19, 2018 at 02:13:53AM +0000, Al Viro wrote: > > > On Thu, Jan 18, 2018 at 06:57:40PM -0600, Goldwyn Rodrigues wrote: > > > > From: Goldwyn Rodrigues <rgoldwyn@suse.com> > > > > > > > > In case direct I/O encounters an error midway, it returns the error. > > > > Instead it should be returning the number of bytes transferred so far. > > > > > > > > Test case for filesystems (with ENOSPC): > > > > 1. Create an almost full filesystem > > > > 2. Create a file, say /mnt/lastfile, until the filesystem is full. > > > > 3. Direct write() with count > sizeof /mnt/lastfile. > > > > > > > > Result: write() returns -ENOSPC. However, file content has data written > > > > in step 3. > > > > > > > > This fixes fstest generic/472. > > > > > > OK... I can live with that. What about the XFS side? It should be > > > a prereq, to avoid bisection hazard; I can throw both into vfs.git, > > > if XFS folks are OK with that. Objections? > > > > Going through the VFS tree seesm the best approach to me - it's a > > trivial change. I'm sure Darrick will shout if it's going to be a > > problem, though. > > vfs.git is fine, though the second patch to remove the xfs assert should > go first, as Al points out. > > For both patches, > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Applied; will be in -next tomorrow morning after the tree I'm putting together gets through local beating.
On Fri, Jan 19, 2018 at 06:33:56AM +0000, Al Viro wrote: > > For both patches, > > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> > > Applied; will be in -next tomorrow morning after the tree I'm putting together > gets through local beating. FWIW, it consistently blows generic/250 and generic/252.
Goldwyn Rodrigues <rgoldwyn@suse.de> writes: > From: Goldwyn Rodrigues <rgoldwyn@suse.com> > > In case direct I/O encounters an error midway, it returns the error. > Instead it should be returning the number of bytes transferred so far. It's likely there's a lot of code in user space that does if (write(..., N) < 0) handle error With your change it would need to be if (write(..., N) != N) handle error How much code is actually doing that? I can understand it fixes your artifical test suite, but it seems to me your change has a high potential to break a lot of existing user code in subtle ways. So it seems to be a bad idea. -Andi
On 01/20/2018 08:11 PM, Andi Kleen wrote: > Goldwyn Rodrigues <rgoldwyn@suse.de> writes: > >> From: Goldwyn Rodrigues <rgoldwyn@suse.com> >> >> In case direct I/O encounters an error midway, it returns the error. >> Instead it should be returning the number of bytes transferred so far. > > It's likely there's a lot of code in user space that does > > if (write(..., N) < 0) handle error > > With your change it would need to be > > if (write(..., N) != N) handle error > > How much code is actually doing that? > > I can understand it fixes your artifical test suite, but it seems to me your > change has a high potential to break a lot of existing user code > in subtle ways. So it seems to be a bad idea. > > -Andi > Quoting 'man 2 write': RETURN VALUE On success, the number of bytes written is returned (zero indicates nothing was written). It is not an error if this number is smaller than the number of bytes requested; this may happen for example because the disk device was filled. See also NOTES.
On 01/20/2018 01:47 PM, Al Viro wrote: > On Fri, Jan 19, 2018 at 06:33:56AM +0000, Al Viro wrote: >>> For both patches, >>> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> >> >> Applied; will be in -next tomorrow morning after the tree I'm putting together >> gets through local beating. > > FWIW, it consistently blows generic/250 and generic/252. > Thanks. I will look into it. It is performing direct writes with dm_error. Hopefully it should be just updating the md5sum in the output files.
On 1/20/18 7:23 PM, Goldwyn Rodrigues wrote: > > > On 01/20/2018 08:11 PM, Andi Kleen wrote: >> Goldwyn Rodrigues <rgoldwyn@suse.de> writes: >> >>> From: Goldwyn Rodrigues <rgoldwyn@suse.com> >>> >>> In case direct I/O encounters an error midway, it returns the error. >>> Instead it should be returning the number of bytes transferred so far. >> >> It's likely there's a lot of code in user space that does >> >> if (write(..., N) < 0) handle error >> >> With your change it would need to be >> >> if (write(..., N) != N) handle error >> >> How much code is actually doing that? >> >> I can understand it fixes your artifical test suite, but it seems to me your >> change has a high potential to break a lot of existing user code >> in subtle ways. So it seems to be a bad idea. >> >> -Andi >> > > > Quoting 'man 2 write': > > RETURN VALUE > On success, the number of bytes written is returned (zero indicates > nothing was written). It is not an error if this number is smaller > than the number of bytes requested; this may happen for example because > the disk device was filled. See also NOTES. You can quote as much man page as you want - Andi is well aware of how read/write system call works, as I'm sure all of us are, that is not the issue. The issue is that there are potentially LOTS of applications out there that do not check for short writes, they do exactly what Andi speculated above. If you break it with this change, it doesn't matter what's in the man page. What matters is previous behavior, and that you are breaking user space. At that point nobody cares what's in the man page.
On 01/20/2018 09:07 PM, Jens Axboe wrote: > On 1/20/18 7:23 PM, Goldwyn Rodrigues wrote: >> >> >> On 01/20/2018 08:11 PM, Andi Kleen wrote: >>> Goldwyn Rodrigues <rgoldwyn@suse.de> writes: >>> >>>> From: Goldwyn Rodrigues <rgoldwyn@suse.com> >>>> >>>> In case direct I/O encounters an error midway, it returns the error. >>>> Instead it should be returning the number of bytes transferred so far. >>> >>> It's likely there's a lot of code in user space that does >>> >>> if (write(..., N) < 0) handle error >>> >>> With your change it would need to be >>> >>> if (write(..., N) != N) handle error >>> >>> How much code is actually doing that? >>> >>> I can understand it fixes your artifical test suite, but it seems to me your >>> change has a high potential to break a lot of existing user code >>> in subtle ways. So it seems to be a bad idea. >>> >>> -Andi >>> >> >> >> Quoting 'man 2 write': >> >> RETURN VALUE >> On success, the number of bytes written is returned (zero indicates >> nothing was written). It is not an error if this number is smaller >> than the number of bytes requested; this may happen for example because >> the disk device was filled. See also NOTES. > > You can quote as much man page as you want - Andi is well aware of how > read/write system call works, as I'm sure all of us are, that is not the > issue. The issue is that there are potentially LOTS of applications out > there that do not check for short writes, they do exactly what Andi > speculated above. If you break it with this change, it doesn't matter > what's in the man page. What matters is previous behavior, and that > you are breaking user space. At that point nobody cares what's in the > man page. > Agree. So how do you think we should fix this to accommodate userspace application who did not cater to the fact that write can return short write, and still be consistent? The only way I can think is that a DIO write should check early enough that the write(N) will complete with N bytes without an error. Is it possible to completely guarantee that? Leaving it as it is incorrect as quoted in the artificial test case. You should not be changing the file and yet conveying to the user an error for the same write() call. It should either be an error and the file contents are unchanged, or it should be change in contents and the write size returned.
> The only way I can think is that a DIO write should check early enough > that the write(N) will complete with N bytes without an error. Is it > possible to completely guarantee that? Probably not. > > Leaving it as it is incorrect as quoted in the artificial test case. You > should not be changing the file and yet conveying to the user an error > for the same write() call. It should either be an error and the file > contents are unchanged, or it should be change in contents and the write > size returned. There are already lots of syscall cases that don't completely undo changes on error handling. Fixing that would likely require a transaction system higher level in the kernel, or lots of complicated code everywhere, which is unlikely to happen. Also the complicated code would be difficult to test, and likely bit rot over time, because it would be only an error handling path that is infrequently exercised. So yes it's not nice, but the alternative would be worse. I think we're best of leaving it as it is now. Adding some comments/documentation to explain this would be good though. Perhaps you could submit a patch to the manpage? -Andi
> On Jan 20, 2018, at 8:07 PM, Jens Axboe <axboe@kernel.dk> wrote: > > On 1/20/18 7:23 PM, Goldwyn Rodrigues wrote: >> >> >> On 01/20/2018 08:11 PM, Andi Kleen wrote: >>> >>> It's likely there's a lot of code in user space that does >>> >>> if (write(..., N) < 0) handle error >>> >>> With your change it would need to be >>> >>> if (write(..., N) != N) handle error >>> >>> How much code is actually doing that? >>> >>> I can understand it fixes your artifical test suite, but it seems to me your >>> change has a high potential to break a lot of existing user code >>> in subtle ways. So it seems to be a bad idea. >> >> >> Quoting 'man 2 write': >> >> RETURN VALUE >> On success, the number of bytes written is returned (zero indicates >> nothing was written). It is not an error if this number is smaller >> than the number of bytes requested; this may happen for example because >> the disk device was filled. See also NOTES. > > You can quote as much man page as you want - Andi is well aware of how > read/write system call works, as I'm sure all of us are, that is not the > issue. The issue is that there are potentially LOTS of applications out > there that do not check for short writes, they do exactly what Andi > speculated above. If you break it with this change, it doesn't matter > what's in the man page. What matters is previous behavior, and that > you are breaking user space. At that point nobody cares what's in the > man page. Applications that don't handle partial writes are already broken with buffered I/O, this change doesn't really make them more broken. Cheers, Andreas
On 1/22/18 12:10 PM, Andreas Dilger wrote: > >> On Jan 20, 2018, at 8:07 PM, Jens Axboe <axboe@kernel.dk> wrote: >> >> On 1/20/18 7:23 PM, Goldwyn Rodrigues wrote: >>> >>> >>> On 01/20/2018 08:11 PM, Andi Kleen wrote: >>>> >>>> It's likely there's a lot of code in user space that does >>>> >>>> if (write(..., N) < 0) handle error >>>> >>>> With your change it would need to be >>>> >>>> if (write(..., N) != N) handle error >>>> >>>> How much code is actually doing that? >>>> >>>> I can understand it fixes your artifical test suite, but it seems to me your >>>> change has a high potential to break a lot of existing user code >>>> in subtle ways. So it seems to be a bad idea. >>> >>> >>> Quoting 'man 2 write': >>> >>> RETURN VALUE >>> On success, the number of bytes written is returned (zero indicates >>> nothing was written). It is not an error if this number is smaller >>> than the number of bytes requested; this may happen for example because >>> the disk device was filled. See also NOTES. >> >> You can quote as much man page as you want - Andi is well aware of how >> read/write system call works, as I'm sure all of us are, that is not the >> issue. The issue is that there are potentially LOTS of applications out >> there that do not check for short writes, they do exactly what Andi >> speculated above. If you break it with this change, it doesn't matter >> what's in the man page. What matters is previous behavior, and that >> you are breaking user space. At that point nobody cares what's in the >> man page. > > Applications that don't handle partial writes are already broken with > buffered I/O, this change doesn't really make them more broken. Not disagreeing that they are broken, of course they are. But if you've been doing direct IO and not seeing short writes, then this could break your application. And if that's the case, it doesn't really help to say that their application was "already broken". I'd hate for a kernel upgrade to break them. I do wish we could make the change, and maybe we can. But it probably needs some safe guard proc entry to toggle the behavior, something we can drop in a few years when we're confident it won't break real applications.
> On 1/20/18 7:23 PM, Goldwyn Rodrigues wrote: > > On 01/20/2018 08:11 PM, Andi Kleen wrote: > >> Goldwyn Rodrigues <rgoldwyn@suse.de> writes: > >>> From: Goldwyn Rodrigues <rgoldwyn@suse.com> > >>> In case direct I/O encounters an error midway, it returns the error. > >>> Instead it should be returning the number of bytes transferred so far. > >> > >> It's likely there's a lot of code in user space that does > >> > >> if (write(..., N) < 0) handle error > >> > >> With your change it would need to be > >> > >> if (write(..., N) != N) handle error > >> > >> How much code is actually doing that? > >> > >> I can understand it fixes your artifical test suite, but it seems to me your > >> change has a high potential to break a lot of existing user code > >> in subtle ways. So it seems to be a bad idea. > >> > >> -Andi > > > > Quoting 'man 2 write': > > > > RETURN VALUE > > On success, the number of bytes written is returned (zero indicates > > nothing was written). It is not an error if this number is smaller > > than the number of bytes requested; this may happen for example because > > the disk device was filled. See also NOTES. > > You can quote as much man page as you want - Andi is well aware of how > read/write system call works, as I'm sure all of us are, that is not the > issue. The issue is that there are potentially LOTS of applications out > there that do not check for short writes, they do exactly what Andi > speculated above. If you break it with this change, it doesn't matter > what's in the man page. What matters is previous behavior, and that > you are breaking user space. At that point nobody cares what's in the > man page. fio engines/sg.c fio_sgio_rw_doio() has that pattern: ret = write(f->fd, hdr, sizeof(*hdr)); if (ret < 0) return ret; ... return FIO_Q_QUEUED; [which is 1] although there might be special circumstances for the sg interface making that safe. --- Robert Elliott, HPE Persistent Memory
On 1/22/18 3:25 PM, Elliott, Robert (Persistent Memory) wrote: > > >> On 1/20/18 7:23 PM, Goldwyn Rodrigues wrote: >>> On 01/20/2018 08:11 PM, Andi Kleen wrote: >>>> Goldwyn Rodrigues <rgoldwyn@suse.de> writes: >>>>> From: Goldwyn Rodrigues <rgoldwyn@suse.com> >>>>> In case direct I/O encounters an error midway, it returns the error. >>>>> Instead it should be returning the number of bytes transferred so far. >>>> >>>> It's likely there's a lot of code in user space that does >>>> >>>> if (write(..., N) < 0) handle error >>>> >>>> With your change it would need to be >>>> >>>> if (write(..., N) != N) handle error >>>> >>>> How much code is actually doing that? >>>> >>>> I can understand it fixes your artifical test suite, but it seems to me your >>>> change has a high potential to break a lot of existing user code >>>> in subtle ways. So it seems to be a bad idea. >>>> >>>> -Andi >>> >>> Quoting 'man 2 write': >>> >>> RETURN VALUE >>> On success, the number of bytes written is returned (zero indicates >>> nothing was written). It is not an error if this number is smaller >>> than the number of bytes requested; this may happen for example because >>> the disk device was filled. See also NOTES. >> >> You can quote as much man page as you want - Andi is well aware of how >> read/write system call works, as I'm sure all of us are, that is not the >> issue. The issue is that there are potentially LOTS of applications out >> there that do not check for short writes, they do exactly what Andi >> speculated above. If you break it with this change, it doesn't matter >> what's in the man page. What matters is previous behavior, and that >> you are breaking user space. At that point nobody cares what's in the >> man page. > > > fio engines/sg.c fio_sgio_rw_doio() has that pattern: > > ret = write(f->fd, hdr, sizeof(*hdr)); > if (ret < 0) > return ret; > ... > return FIO_Q_QUEUED; [which is 1] > > although there might be special circumstances for the sg interface > making that safe. That's for SG scsi direct IO, I don't think that supports partial IO since it's sending raw SCSI commands. For the regular libaio or sync IO system calls, fio of course checks and handles short IOs correctly. It even logs if it got any.
On Mon, 2018-01-22 at 16:14 -0700, Jens Axboe wrote: > On 1/22/18 3:25 PM, Elliott, Robert (Persistent Memory) wrote: > > fio engines/sg.c fio_sgio_rw_doio() has that pattern: > > > > ret = write(f->fd, hdr, sizeof(*hdr)); > > if (ret < 0) > > return ret; > > ... > > return FIO_Q_QUEUED; [which is 1] > > > > although there might be special circumstances for the sg interface > > making that safe. > > That's for SG scsi direct IO, I don't think that supports partial > IO since it's sending raw SCSI commands. > > For the regular libaio or sync IO system calls, fio of course checks > and handles short IOs correctly. It even logs if it got any. The entire fio_sgio_rw_doio() function is as follows: static int fio_sgio_rw_doio(struct fio_file *f, struct io_u *io_u, int do_sync) { struct sg_io_hdr *hdr = &io_u->hdr; int ret; ret = write(f->fd, hdr, sizeof(*hdr)); if (ret < 0) return ret; if (do_sync) { ret = read(f->fd, hdr, sizeof(*hdr)); if (ret < 0) return ret; /* record if an io error occurred */ if (hdr->info & SG_INFO_CHECK) io_u->error = EIO; return FIO_Q_COMPLETED; } return FIO_Q_QUEUED; } I think the 'resid' member of the struct sg_io_hdr that is provided by the sg_io kernel driver as a response represents the number of bytes that has not been written. So it should be possible to recognize and handle short I/Os in that function. From include/scsi/sg.h: int resid; /* [o] dxfer_len - actual_transferred */ Bart.
On 1/22/18 4:24 PM, Bart Van Assche wrote: > On Mon, 2018-01-22 at 16:14 -0700, Jens Axboe wrote: >> On 1/22/18 3:25 PM, Elliott, Robert (Persistent Memory) wrote: >>> fio engines/sg.c fio_sgio_rw_doio() has that pattern: >>> >>> ret = write(f->fd, hdr, sizeof(*hdr)); >>> if (ret < 0) >>> return ret; >>> ... >>> return FIO_Q_QUEUED; [which is 1] >>> >>> although there might be special circumstances for the sg interface >>> making that safe. >> >> That's for SG scsi direct IO, I don't think that supports partial >> IO since it's sending raw SCSI commands. >> >> For the regular libaio or sync IO system calls, fio of course checks >> and handles short IOs correctly. It even logs if it got any. > > The entire fio_sgio_rw_doio() function is as follows: > > static int fio_sgio_rw_doio(struct fio_file *f, struct io_u *io_u, int do_sync) > { > struct sg_io_hdr *hdr = &io_u->hdr; > int ret; > > ret = write(f->fd, hdr, sizeof(*hdr)); > if (ret < 0) > return ret; > > if (do_sync) { > ret = read(f->fd, hdr, sizeof(*hdr)); > if (ret < 0) > return ret; > > /* record if an io error occurred */ > if (hdr->info & SG_INFO_CHECK) > io_u->error = EIO; > > return FIO_Q_COMPLETED; > } > > return FIO_Q_QUEUED; > } > > I think the 'resid' member of the struct sg_io_hdr that is provided by the > sg_io kernel driver as a response represents the number of bytes that has not > been written. So it should be possible to recognize and handle short I/Os in > that function. From include/scsi/sg.h: > > int resid; /* [o] dxfer_len - actual_transferred */ Yeah that's true. But let's not side track the discussion here, fio+sg isn't relevant to the topic at hand. Fio and other engines would be (like libaio, or sync and friends), but those handle short/partial IOs just fine. That said, if someone wants to submit a patch for the sg engine, I would of course take it.
On 01/22/2018 01:13 PM, Jens Axboe wrote: > On 1/22/18 12:10 PM, Andreas Dilger wrote: >> >>> On Jan 20, 2018, at 8:07 PM, Jens Axboe <axboe@kernel.dk> wrote: >>> >>> On 1/20/18 7:23 PM, Goldwyn Rodrigues wrote: >>>> >>>> >>>> On 01/20/2018 08:11 PM, Andi Kleen wrote: >>>>> >>>>> It's likely there's a lot of code in user space that does >>>>> >>>>> if (write(..., N) < 0) handle error >>>>> >>>>> With your change it would need to be >>>>> >>>>> if (write(..., N) != N) handle error >>>>> >>>>> How much code is actually doing that? >>>>> >>>>> I can understand it fixes your artifical test suite, but it seems to me your >>>>> change has a high potential to break a lot of existing user code >>>>> in subtle ways. So it seems to be a bad idea. >>>> >>>> >>>> Quoting 'man 2 write': >>>> >>>> RETURN VALUE >>>> On success, the number of bytes written is returned (zero indicates >>>> nothing was written). It is not an error if this number is smaller >>>> than the number of bytes requested; this may happen for example because >>>> the disk device was filled. See also NOTES. >>> >>> You can quote as much man page as you want - Andi is well aware of how >>> read/write system call works, as I'm sure all of us are, that is not the >>> issue. The issue is that there are potentially LOTS of applications out >>> there that do not check for short writes, they do exactly what Andi >>> speculated above. If you break it with this change, it doesn't matter >>> what's in the man page. What matters is previous behavior, and that >>> you are breaking user space. At that point nobody cares what's in the >>> man page. >> >> Applications that don't handle partial writes are already broken with >> buffered I/O, this change doesn't really make them more broken. > > Not disagreeing that they are broken, of course they are. But if you've > been doing direct IO and not seeing short writes, then this could break > your application. And if that's the case, it doesn't really help to say I started exploring short writes and how direct I/O behaves on errors after your suggestion to incorporate short writes in a previous conversation [1]. I started looking into how midway errors with direct I/O's are handled now and I stumble upon this issue. > that their application was "already broken". I'd hate for a kernel > upgrade to break them. > > I do wish we could make the change, and maybe we can. But it probably > needs some safe guard proc entry to toggle the behavior, something we > can drop in a few years when we're confident it won't break real > applications. Assuming we call it /proc/sys/fs/dio_short_writes(better names/paths?), should it be enabled or disabled by default? [1] https://www.spinics.net/lists/linux-block/msg15910.html
On 1/22/18 8:18 PM, Goldwyn Rodrigues wrote: >> that their application was "already broken". I'd hate for a kernel >> upgrade to break them. >> >> I do wish we could make the change, and maybe we can. But it probably >> needs some safe guard proc entry to toggle the behavior, something we >> can drop in a few years when we're confident it won't break real >> applications. > > Assuming we call it /proc/sys/fs/dio_short_writes(better names/paths?), > should it be enabled or disabled by default? I'd enable it by default, if not, you are never going to be able to remove it because you'll have no confidence that anyone actually flipped the switch and ran with it enabled. The point of having it there and on by default would be that if something does break, people have the option of turning it off and restoring the previous behavior, without having to change the kernel.
On Mon, Jan 22, 2018 at 08:28:54PM -0700, Jens Axboe wrote: > On 1/22/18 8:18 PM, Goldwyn Rodrigues wrote: > >> that their application was "already broken". I'd hate for a kernel > >> upgrade to break them. > >> > >> I do wish we could make the change, and maybe we can. But it probably > >> needs some safe guard proc entry to toggle the behavior, something we > >> can drop in a few years when we're confident it won't break real > >> applications. > > > > Assuming we call it /proc/sys/fs/dio_short_writes(better names/paths?), > > should it be enabled or disabled by default? > > I'd enable it by default, if not, you are never going to be able to > remove it because you'll have no confidence that anyone actually flipped > the switch and ran with it enabled. The point of having it there and on > by default would be that if something does break, people have the option > of turning it off and restoring the previous behavior, without having to > change the kernel. I think it's an opt-in prctl that's something like PRCTL_SHORT_WRITES_ALLOWED.
On Jan 22, 2018, at 8:28 PM, Jens Axboe <axboe@kernel.dk> wrote: > > On 1/22/18 8:18 PM, Goldwyn Rodrigues wrote: >>> that their application was "already broken". I'd hate for a kernel >>> upgrade to break them. >>> >>> I do wish we could make the change, and maybe we can. But it probably >>> needs some safe guard proc entry to toggle the behavior, something we >>> can drop in a few years when we're confident it won't break real >>> applications. >> >> Assuming we call it /proc/sys/fs/dio_short_writes(better names/paths?), >> should it be enabled or disabled by default? > > I'd enable it by default, if not, you are never going to be able to > remove it because you'll have no confidence that anyone actually flipped > the switch and ran with it enabled. The point of having it there and on > by default would be that if something does break, people have the option > of turning it off and restoring the previous behavior, without having to > change the kernel. ... or fixing their application. :-) But, yes, I agree that this should be on by default. Cheers, Andreas
On 01/23/2018 12:35 AM, Matthew Wilcox wrote: > On Mon, Jan 22, 2018 at 08:28:54PM -0700, Jens Axboe wrote: >> On 1/22/18 8:18 PM, Goldwyn Rodrigues wrote: >>>> that their application was "already broken". I'd hate for a kernel >>>> upgrade to break them. >>>> >>>> I do wish we could make the change, and maybe we can. But it probably >>>> needs some safe guard proc entry to toggle the behavior, something we >>>> can drop in a few years when we're confident it won't break real >>>> applications. >>> >>> Assuming we call it /proc/sys/fs/dio_short_writes(better names/paths?), >>> should it be enabled or disabled by default? >> >> I'd enable it by default, if not, you are never going to be able to >> remove it because you'll have no confidence that anyone actually flipped >> the switch and ran with it enabled. The point of having it there and on >> by default would be that if something does break, people have the option >> of turning it off and restoring the previous behavior, without having to >> change the kernel. > > I think it's an opt-in prctl that's something like PRCTL_SHORT_WRITES_ALLOWED. > I cannot decide where to stick this bit in the task_struct. current->flags/PF_* does not seem right. Don't want to start a new fs_flags field. Suggestions?
diff --git a/fs/block_dev.c b/fs/block_dev.c index 4a181fcb5175..49d94360bb51 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -409,7 +409,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages) if (!ret) ret = blk_status_to_errno(dio->bio.bi_status); - if (likely(!ret)) + if (likely(dio->size)) ret = dio->size; bio_put(&dio->bio); diff --git a/fs/direct-io.c b/fs/direct-io.c index 3aafb3343a65..0c98b6e65d7c 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -262,8 +262,6 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, unsigned int flags) ret = dio->page_errors; if (ret == 0) ret = dio->io_error; - if (ret == 0) - ret = transferred; if (dio->end_io) { // XXX: ki_pos?? @@ -310,7 +308,7 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, unsigned int flags) } kmem_cache_free(dio_cache, dio); - return ret; + return transferred ? transferred : ret; } static void dio_aio_complete_work(struct work_struct *work) diff --git a/fs/iomap.c b/fs/iomap.c index 47d29ccffaef..cab57d85404e 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -716,23 +716,23 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio) struct kiocb *iocb = dio->iocb; struct inode *inode = file_inode(iocb->ki_filp); loff_t offset = iocb->ki_pos; - ssize_t ret; + ssize_t err; + ssize_t transferred = dio->size; if (dio->end_io) { - ret = dio->end_io(iocb, - dio->error ? dio->error : dio->size, + err = dio->end_io(iocb, + transferred ? transferred : dio->error, dio->flags); } else { - ret = dio->error; + err = dio->error; } - if (likely(!ret)) { - ret = dio->size; + if (likely(transferred)) { /* check for short read */ - if (offset + ret > dio->i_size && + if (offset + transferred > dio->i_size && !(dio->flags & IOMAP_DIO_WRITE)) - ret = dio->i_size - offset; - iocb->ki_pos += ret; + transferred = dio->i_size - offset; + iocb->ki_pos += transferred; } /* @@ -759,7 +759,7 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio) inode_dio_end(file_inode(iocb->ki_filp)); kfree(dio); - return ret; + return transferred ? transferred : err; } static void iomap_dio_complete_work(struct work_struct *work)