Message ID | 20230629155433.4170837-1-dhowells@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | splice: Fix corruption in data spliced to pipe | expand |
On Thu, 29 Jun 2023 at 08:55, David Howells <dhowells@redhat.com> wrote: > > Matt Whitlock, Matthew Wilcox and Dave Chinner are of the opinion that data > in the pipe must not be seen to change and that if it does, this is a bug. I'm not convinced. The whole *point* of vmsplice (and splicing from a file) is the zero-copy. If you don't want the zero-copy, then you should use just "write()". So I disagree violently. This is not a bug unless you can point to some other correctness issues. The "stableness" of the data is literally the *only* difference between vmsplice() and write(). > Whilst this does allow the code to be somewhat simplified, it also results > in a loss of performance: stolen pages have to be reloaded in accessed > again; more data has to be copied. No. It literally results in a loss of THE WHOLE POINT of vmsplice(). Linus
On Thursday, 29 June 2023 13:56:04 EDT, Linus Torvalds wrote: > On Thu, 29 Jun 2023 at 08:55, David Howells <dhowells@redhat.com> wrote: >> >> Matt Whitlock, Matthew Wilcox and Dave Chinner are of the >> opinion that data >> in the pipe must not be seen to change and that if it does, this is a bug. > > I'm not convinced. > > The whole *point* of vmsplice (and splicing from a file) is the zero-copy. > > If you don't want the zero-copy, then you should use just "write()". If you want zero copies, then call splice() *with* SPLICE_F_MOVE. If you want one copy (kernel-to-kernel), then call splice() *without* SPLICE_F_MOVE. If you want two copies (kernel-to-user + user-to-kernel), call read() and write(). I don't know why SPLICE_F_MOVE is being ignored in this thread. Sure, maybe the way it has historically been implemented was only relevant when the input FD is a pipe, but that's not what the man page implies. You have the opportunity to make it actually do what it says on the tin.
On Thursday, 29 June 2023 11:54:29 EDT, David Howells wrote: > Matt Whitlock, Matthew Wilcox and Dave Chinner are of the opinion that data > in the pipe must not be seen to change and that if it does, this is a bug. > Apart from in one specific instance (vmsplice() with SPLICE_F_GIFT), the > manual pages agree with them. I'm more inclined to adjust the > documentation since the behaviour we have has been that way since 2005, I > think. Anecdotally, my use case had been working fine for years until I upgraded from 5.15.x to 6.1.x in February of this year. That's when my backups started being corrupted. I only noticed when I was trying to restore a lost file from backup earlier this week.
On Thu, 29 Jun 2023 at 11:05, Matt Whitlock <kernel@mattwhitlock.name> wrote: > > I don't know why SPLICE_F_MOVE is being ignored in this thread. Sure, maybe > the way it has historically been implemented was only relevant when the > input FD is a pipe, but that's not what the man page implies. You have the > opportunity to make it actually do what it says on the tin. First off, when documentation and reality disagree, it's the documentation that is garbage. Secondly, your point is literally moot, from what I can tell: SPLICE_F_MOVE Unused for vmsplice(); see splice(2). that's the doc I see right now for "man vmsplice". There's no "implies" there. There's an actual big honking clear statement at the top of the man-page saying that what you claim is simply not even remotely true. Also, the reason SPLICE_F_MOVE is unused for vmsplice() is that actually trying to move pages would involve having to *remove* them from the VM source. And the TLB invalidation involved with that is literally more expensive than the memory copy would be. So no. SPLICE_F_MOVE isn't the answer. Now, we also have SPLICE_F_GIFT. That's actually a more extreme case of "not only should you taekm this page, you can actually try to re-use the end result for your own nefarious purposes". Now, I would actually not disagree with removing that part. It's scary. But I think we don't really have any users (ok, fuse and some random console driver?) Linus
On Thu, Jun 29, 2023 at 11:19:36AM -0700, Linus Torvalds wrote: > On Thu, 29 Jun 2023 at 11:05, Matt Whitlock <kernel@mattwhitlock.name> wrote: > > > > I don't know why SPLICE_F_MOVE is being ignored in this thread. Sure, maybe > > the way it has historically been implemented was only relevant when the > > input FD is a pipe, but that's not what the man page implies. You have the > > opportunity to make it actually do what it says on the tin. > > First off, when documentation and reality disagree, it's the > documentation that is garbage. > > Secondly, your point is literally moot, from what I can tell: > > SPLICE_F_MOVE > Unused for vmsplice(); see splice(2). > > that's the doc I see right now for "man vmsplice". > > There's no "implies" there. There's an actual big honking clear > statement at the top of the man-page saying that what you claim is > simply not even remotely true. > > Also, the reason SPLICE_F_MOVE is unused for vmsplice() is that > actually trying to move pages would involve having to *remove* them > from the VM source. And the TLB invalidation involved with that is > literally more expensive than the memory copy would be. I think David muddied the waters by talking about vmsplice(). The problem encountered is with splice() from the page cache. Reading the documentation, splice() moves data between two file descriptors without copying be‐ tween kernel address space and user address space. It transfers up to len bytes of data from the file descriptor fd_in to the file descriptor fd_out, where one of the file descriptors must refer to a pipe. The bug reported is actually with using FALLOC_FL_PUNCH_HOLE, but a simpler problem is: #define _GNU_SOURCE #include <unistd.h> #include <fcntl.h> #include <stdio.h> #define PAGE_SIZE 4096 int main(int argc, char **argv) { int fd = open(argv[1], O_RDWR | O_CREAT, 0644); err = ftruncate(fd, PAGE_SIZE); pwrite(fd, "old", 3, 0); splice(fd, NULL, 1, NULL, PAGE_SIZE, 0); pwrite(fd, "new", 3, 0); return 0; } That outputs "new". Should it? If so, the manpage is really wrong. It says the point of splice() is to remove the kernel-user-kernel copy, and notes that zerocopy might be happening, but that's an optimisation the user shouldn't notice.
On Thu, 29 Jun 2023 at 11:19, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Now, we also have SPLICE_F_GIFT. [..] > > Now, I would actually not disagree with removing that part. It's > scary. But I think we don't really have any users (ok, fuse and some > random console driver?) Side note: maybe I should clarify. I have grown to pretty much hate splice() over the years, just because it's been a constant source of sorrow in so many ways. So I'd personally be perfectly ok with just making vmsplice() be exactly the same as write, and turn all of vmsplice() into just "it's a read() if the pipe is open for read, and a write if it's open for writing". IOW, effectively get rid of vmsplice() entirely, just leaving it as a legacy name for an interface. What I *absolutely* don't want to see is to make vmsplice() even more complicated, and actively slower in the process. Unmapping it from the source, removing it from the VM, is all just crazy talk. If you want to be really crazy, I can tell you how to make for some truly stupendously great benchmarks: make a plain "write()" system call look up the physical page, check if it's COW'able, and if so, mark it read-only in the source and steal the page. Now write() has taken a snapshot of the source, and can use that page for the pipe buffer as-is. It won't change, because if the user writes to it, the user will just take a page fault and force a COW. Then, to complete the thing, make 'read()' of a pipe able to just take the page, and insert it into the destination VM (it's ok to make it writable at that point). You can get *wonderful* performance numbers from benchmarks with that. I know, because I did exactly that long long ago. So long ago that I think I had a i486 that had memory throughput measured in megabytes. And my pipe throughput benchmark got gigabytes per second! Of course, that benchmark relied entirely on the source of the write() never actually writing to the page, and the reader never actually bothering to touch the page. So it was gigabytes on a pretty bad benchmark. But it was quite impressive. I don't think those patches ever got posted publicly, because while very impressive on benchmarks, it obviously was absolutely horrendous in real life, because in real life the source of the pipe data would (a) not usually be page-aligned anyway, and (b) even if it was and triggered this wonderful case, it would then re-use the buffer and take a COW fault, and now the overhead of faulting, allocating a new page, copying said page, was obviously higher than just doing all that in the pipe write() code without any faulting overhead. But splice() (and vmsplice()) does conceptually come from that kind of background. It's just that it was never as lovely and as useful as it promised to be. So I'd actually be more than happy to just say "let's decommission splice entirely, just keeping the interfaces alive for backwards compatibility" Linus
On Thu, 29 Jun 2023 at 11:34, Matthew Wilcox <willy@infradead.org> wrote: > > I think David muddied the waters by talking about vmsplice(). The > problem encountered is with splice() from the page cache. Reading > the documentation, > > splice() moves data between two file descriptors without copying be‐ > tween kernel address space and user address space. It transfers up to > len bytes of data from the file descriptor fd_in to the file descriptor > fd_out, where one of the file descriptors must refer to a pipe. Well, the original intent really always was that it's about zero-copy. So I do think that the answer to your test-program is that yes, it really traditionally *should* output "new". A splice from a file acts like a scatter-gather mmap() in the kernel. It's the original intent, and it's the whole reason it's noticeably faster than doing a write. Now, do I then agree that splice() has turned out to be a nasty morass of problems? Yes. And I even agree that while I actually *think* that your test program should output "new" (because that is the whole point of the exercise), it also means that people who use splice() need to *understand* that, and it's much too easy to get things wrong if you don't understand that the whole point of splice is to act as a kind of ad-hoc in-kernel mmap thing. And to make matters worse, for mmap() we actually do have some coherency helpers. For splice(), the page ref stays around. It's kind of like GUP and page pinning - another area where we have had lots of problems and lots of nasty semantics and complications with other VM operations over the years. So I really *really* don't want to complicate splice() even more to give it some new semantics that it has never ever really had, because people didn't understand it and used it wrong. Quite the reverse. I'd be willing to *simplify* splice() by just saying "it was all a mistake", and just turning it into wrappers around read/write. But those patches would have to be radical simplifications, not adding yet more crud on top of the pain that is splice(). Because it will hurt performance. And I'm ok with that as long as it comes with huge simplifications. What I'm *not* ok with is "I mis-used splice, now I want splice to act differently, so let's make it even more complicated". Linus
On Thu, 29 Jun 2023 16:54:29 +0100 David Howells wrote: > I'm more inclined to adjust the documentation since the behaviour we > have has been that way since 2005, I think. +1 FWIW I think that networking always operated under the assumption that the pages may change. In TLS we require explicit opt-in from users that the pages they send will not get changed, if it could cause crypto errors (TLS_TX_ZEROCOPY_RO).
Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Quite the reverse. I'd be willing to *simplify* splice() by just > saying "it was all a mistake", and just turning it into wrappers > around read/write. But those patches would have to be radical > simplifications, not adding yet more crud on top of the pain that is > splice(). > > Because it will hurt performance. And I'm ok with that as long as it > comes with huge simplifications. What I'm *not* ok with is "I mis-used > splice, now I want splice to act differently, so let's make it even > more complicated". If we want to go down the simplification route, then the patches I posted might be a good start. The idea I tried to work towards is that the pipe only ever contains private pages in it that only the pipe has a ref on and that no one else can access until they come out the other end again. I got rid of the ->confirm() pipe buf op and would like to kill off all of the others too. I simplified splice() by: - Making sure any candidate pages are uptodate right up front. - Allowing automatic stealing of pages from the pagecache if no one else is using them. This should avoid losing a chunk of the performance that splice is supposed to gain - but if you're serving pages repeatedly in a webserver with this, it's going to be a problem. Possibly this should be contingent on SPLICE_F_MOVE - though the manpage says "*from* the pipe" implying it's only usable on the output side. - Copying in every other circumstance. I simplified vmsplice() by: - If SPLICE_F_GIFT is set, attempting to steal whole pages in the buffer up front if not in use by anyone else. - Copying in every other circumstance. That said, there are still sources that I didn't touch yet that attempt to insert pages into a pipe: relayfs (which does some accounting stuff based on the final consumption of the pages it inserted), sockets (which don't allow inserted pages to be stolen) and notifications (which don't want to allocate at notification time - but I can deal with that). And there's tee() (which would need to copy the data). And pipe-to-pipe splice (which could steal whole pages, but would otherwise have to copy). If you would prefer to go for utter simplification, we could make sendfile() from a buffered file just call sendmsg() directly with MSG_SPLICE_PAGES set and ignore the pipe entirely (I'm tempted to do this anyway) and then make splice() to a pipe just do copy_splice_read() and vmsplice() to a pipe do writev(). I wonder how much splice() is used compared to sendfile(). I would prefer to leave splice() and vmsplice() as they are now and adjust the documentation. As you say, they can be considered a type of zerocopy. David