Message ID | 20201203215859.2719888-1-palmer@dabbelt.com (mailing list archive) |
---|---|
Headers | show |
Series | dm: dm-user: New target that proxies BIOs to userspace | expand |
What is the advantage over simply using nbd?
On Fri, 04 Dec 2020 02:33:36 PST (-0800), Christoph Hellwig wrote:
> What is the advantage over simply using nbd?
There's a short bit about that in the cover letter (and in some talks), but
I'll expand on it here -- I suppose my most important question is "is this
interesting enough to take upstream?", so there should be at least a bit of a
description of what it actually enables:
I don't think there's any deep fundamental advantages to doing this as opposed
to nbd/iscsi over localhost/unix (or by just writing a kernel implementation,
for that matter), at least in terms of anything that was previously impossible
now becoming possible. There are a handful of things that are easier and/or
faster, though.
dm-user looks a lot like NBD without the networking. The major difference is
which side initiates messages: in NBD the kernel initiates messages, while in
dm-user userspace initiates messages (via a read that will block if there is no
message, but presumably we'd want to add support for a non-blocking userspace
implementations eventually). The NBD approach certainly makes sense for a
networked system, as one generally wants to have a single storage server
handling multiple clients, but inverting that makes some things simpler in
dm-user.
One specific advantage of this change is that a dm-user target can be
transitioned from one daemon to another without any IO errors: just spin up the
second daemon, signal the first to stop requesting new messages, and let it
exit. We're using that mechanism to replace the daemon launched by early init
(which runs before the security subsystem is up, as in our use case dm-user
provides the root filesystem) with one that's properly sandboxed (which can
only be launched after the root filesystem has come up). There are ways around
this (replacing the DM table, for example), but they don't fit it as cleanly.
Unless I'm missing something, NBD servers aren't capable of that style of
transition: soft disconnects can only be initiated by the client (the kernel,
in this case), which leaves no way for the server to transition while
guaranteeing that no IOs error out. It's usually possible to shoehorn this
sort of direction reversing concept into network protocols, but it's also
usually ugly (I'm thinking of IDLE, for example). I didn't try to actually do
it, but my guess would be that adding a way for the server to ask the client to
stop sending messages until a new server shows up would be at least as much
work as doing this.
There are also a handful of possible performance advantages, but I haven't gone
through the work to prove any of them out yet as performance isn't all that
important for our first use case. For example:
* Cutting out the network stack is unlikely to hurt performance. I'm not sure
if it will help performance, though. I think if we really had workload where
the extra copy was likely to be an issue we'd want an explicit ring buffer,
but I have a theory that it would be possible to get very good performance out
of a stream-style API by using multiple channels and relying on io_uring to
plumb through multiple ops per channel.
* There's a comment in the implementation about allowing userspace to insert
itself into user_map(), likely by uploading a BPF fragment. There's a whole
class of interesting block devices that could be written in this fashion:
essentially you keep a cache on a regular block device that handles the common
cases by remapping BIOs and passing them along, relegating the more complicated
logic to fetch cache misses and watching some subset of the access stream where
necessary.
We have a use case like this in Android, where we opportunistically store
backups in a portion of the TRIM'd space on devices. It's currently
implemented entirely in kernel by the dm-bow target, but IIUC that was deemed
too Android-specific to merge. Assuming we could get good enough performance
we could move that logic to userspace, which lets us shrink our diff with
upstream. It feels like some other interesting block devices could be
written in a similar fashion.
All in all, I've found it a bit hard to figure out what sort of interest people
have in dm-user: when I bring this up I seem to run into people who've done
similar things before and are vaguely interested, but certainly nobody is
chomping at the bit. I'm sending it out in this early state to try and figure
out if it's interesting enough to keep going.
On 12/7/20 10:55 AM, Palmer Dabbelt wrote: > All in all, I've found it a bit hard to figure out what sort of interest > people > have in dm-user: when I bring this up I seem to run into people who've done > similar things before and are vaguely interested, but certainly nobody is > chomping at the bit. I'm sending it out in this early state to try and > figure > out if it's interesting enough to keep going. Cc-ing Josef and Mike since their nbd contributions make me wonder whether this new driver could be useful to their use cases? Thanks, Bart.
On 12/9/20 10:38 PM, Bart Van Assche wrote: > On 12/7/20 10:55 AM, Palmer Dabbelt wrote: >> All in all, I've found it a bit hard to figure out what sort of interest >> people >> have in dm-user: when I bring this up I seem to run into people who've done >> similar things before and are vaguely interested, but certainly nobody is >> chomping at the bit. I'm sending it out in this early state to try and >> figure >> out if it's interesting enough to keep going. > > Cc-ing Josef and Mike since their nbd contributions make me wonder > whether this new driver could be useful to their use cases? > Sorry gmail+imap sucks and I can't get my email client to get at the original thread. However here is my take. 1) The advantages of using dm-user of NBD that you listed aren't actually problems for NBD. We have NBD working in production where you can hand off the sockets for the server without ending in timeouts, it was actually the main reason we wrote our own server so we could use the FD transfer stuff to restart the server without impacting any clients that had the device in use. 2) The extra copy is a big deal, in fact we already have too many copies in our existing NBD setup and are actively looking for ways to avoid those. Don't take this as I don't think dm-user is a good idea, but I think at the very least it should start with the very best we have to offer, starting with as few copies as possible. If you are using it currently in production then cool, there's clearly a usecase for it. Personally as I get older and grouchier I want less things in the kernel, so if this enables us to eventually do everything NBD related in userspace with no performance drop then I'd be down. I don't think you need to make that your primary goal, but at least polishing this up so it could potentially be abused in the future would make it more compelling for merging. Thanks, Josef
On Thu, 10 Dec 2020 09:03:21 PST (-0800), josef@toxicpanda.com wrote: > On 12/9/20 10:38 PM, Bart Van Assche wrote: >> On 12/7/20 10:55 AM, Palmer Dabbelt wrote: >>> All in all, I've found it a bit hard to figure out what sort of interest >>> people >>> have in dm-user: when I bring this up I seem to run into people who've done >>> similar things before and are vaguely interested, but certainly nobody is >>> chomping at the bit. I'm sending it out in this early state to try and >>> figure >>> out if it's interesting enough to keep going. >> >> Cc-ing Josef and Mike since their nbd contributions make me wonder >> whether this new driver could be useful to their use cases? >> > > Sorry gmail+imap sucks and I can't get my email client to get at the original > thread. However here is my take. and I guess I then have to apoligize for missing your email ;). Hopefully that was the problem, but who knows. > 1) The advantages of using dm-user of NBD that you listed aren't actually > problems for NBD. We have NBD working in production where you can hand off the > sockets for the server without ending in timeouts, it was actually the main > reason we wrote our own server so we could use the FD transfer stuff to restart > the server without impacting any clients that had the device in use. OK. So you just send the FD around using one of the standard mechanisms to orchestrate the handoff? I guess that might work for our use case, assuming whatever the security side of things was doing was OK with the old FD. TBH I'm not sure how all that works and while we thought about doing that sort of transfer scheme we decided to just open it again -- not sure how far we were down the dm-user rabbit hole at that point, though, as this sort of arose out of some other ideas. > 2) The extra copy is a big deal, in fact we already have too many copies in our > existing NBD setup and are actively looking for ways to avoid those. > > Don't take this as I don't think dm-user is a good idea, but I think at the very > least it should start with the very best we have to offer, starting with as few > copies as possible. I was really experting someone to say that. It does seem kind of silly to build out the new interface, but not go all the way to a ring buffer. We just didn't really have any way to justify the extra complexity as our use cases aren't that high performance. I kind of like to have benchmarks for this sort of thing, though, and I didn't have anyone who had bothered avoiding the last copy to compare against. > If you are using it currently in production then cool, there's clearly a usecase > for it. Personally as I get older and grouchier I want less things in the > kernel, so if this enables us to eventually do everything NBD related in > userspace with no performance drop then I'd be down. I don't think you need to > make that your primary goal, but at least polishing this up so it could > potentially be abused in the future would make it more compelling for merging. > Thanks, Ya, it's in Android already and we'll be shipping it as part of the new OTA flow for the next release. The rules on deprecation are a bit different over there, though, so it's not like we're wed to it. The whole point of bringing this up here was to try and get something usable by everyone, and while I'd eventually like to get whatever's in Android into the kernel proper we'd really planned on supporting an extra Android-only ABI for a cycle at least. I'm kind of inclined to take a crack at the extra copy, to at least see if building something that eliminates it is viable. I'm not really sure if it is (or at least, if it'll net us a meaningful amount of performance), but it'd at least be interesting to try. It'd be nice to have some benchmark target, though, as otherwise this stuff hangs on forever. My workloads are in selftests later on in the patch set, but I'm essentially using tmpfs as a baseline to compare against ext4+dm-user with some FIO examples as workloads. Our early benchmark numbers indicated this was way faster than we needed, so I didn't even bother putting together a proper system to run on so I don't really have any meaningful numbers there. Is there an NBD server that's fast that I should be comparing against? I haven't gotten a whole lot of feedback, so I'm inclined to at least have some reasonable performance numbers before bothering with a v2.
On Mon, Dec 14, 2020 at 10:03 PM Palmer Dabbelt <palmer@dabbelt.com> wrote: > I was really experting someone to say that. It does seem kind of silly to build > out the new interface, but not go all the way to a ring buffer. We just didn't > really have any way to justify the extra complexity as our use cases aren't > that high performance. I kind of like to have benchmarks for this sort of > thing, though, and I didn't have anyone who had bothered avoiding the last copy > to compare against. I worked on something very similar, though performance was one of the goals. The implementation was floating around lockless ring buffers, shared memory for zerocopy, multiqueue and error handling. It could be that every disk storage vendor has to implement something like that in order to bridge Linux kernel to their own proprietary datapath running in userspace.
On Wed, 16 Dec 2020 10:24:59 PST (-0800), v.mayatskih@gmail.com wrote: > On Mon, Dec 14, 2020 at 10:03 PM Palmer Dabbelt <palmer@dabbelt.com> wrote: > >> I was really experting someone to say that. It does seem kind of silly to build >> out the new interface, but not go all the way to a ring buffer. We just didn't >> really have any way to justify the extra complexity as our use cases aren't >> that high performance. I kind of like to have benchmarks for this sort of >> thing, though, and I didn't have anyone who had bothered avoiding the last copy >> to compare against. > > I worked on something very similar, though performance was one of the > goals. The implementation was floating around lockless ring buffers, > shared memory for zerocopy, multiqueue and error handling. It could be > that every disk storage vendor has to implement something like that in > order to bridge Linux kernel to their own proprietary datapath running > in userspace. OK, good to know. That's kind of the feeling I'd gotten from having chatted to a handful of people about this, but I don't remember people having actually gotten all the way to zero-copy. That's how we managed to end up at this middle-ground ABI style: when I thought people were, in practice, punting on zero copy because the complexity just wasn't worth the performance benefit. Maybe I'd just been colored by how my projects ended up going, but I've ended up designing complicated interfaces in the past that allow for zero-copy only to never get around to actually making that work. I don't know if that's just because I've had the good fortune to avoid working on anything that ended up with users, though :). For our use case I think we actually get better performance out of the copy-based (and probably more importantly kalloc-based, but that's an implementation thing not an ABI thing) approach: essentially we're very sensitive to memory pressure and expect this first dm-user daemon to mostly be idle, so we're really worried about avoiding excess memory usage while idle and less worried about throughput when active. This stream-based interface means that userspace doesn't need much memory allocated to service a request, which helps with sleep/wake latencies and/or idle memory usage. That's also why we have the simple locking scheme: no sense splitting locks if there's no contention, and we only need a single thread to saturate the storage bandwidth on these phones. That said, it does sound like people really do care about the sort of performance levels where zero copy is relevant in this space. I'll take a shot at something along those lines, and while it will add a degree of userspace complexity I'm not sure it'll add much in the way of kernel complexity -- at least compared to a fast version of this, where we'd need most of that stuff anyway (obviously the malloc+single lock design is simple, but probably wouldn't stick around for long). At a bare minimum it'll be interesting to play around with, but if people are doing it in practice then I'm more confident that I can put something together that at least serves as a starting point for further discussion. I haven't gotten around to writing any code yet, but I had spent a bit of time thinking about how to put this zero-copy version together and am leaning towards it being a standalone block device (as opposed to a DM target). I'd avoided that before as I didn't want to mess around with my own device control scheme so I'll still try to do the DM thing, but I'm not sure it'll be viable. That's all speculation now, but it does bring up one interesting question: IIUC, this version of dm-user handles BIOs before they reach the block scheduler while a standalone driver would likely handle them after blk-mq. I don't have direct experience with this, but the last time I ran into people who had these sorts of performance requirements for userspace drivers they weren't actually trying to write userspace drivers but were instead trying to write a userspace scheduler, with the userspace drivers just being the mechanism to implement that scheduler. This was a decade ago and I'm not sure that's what people are trying to do in the new blk-mq world, but if it is then it's going to be a major design consideration. I'm also not entirely sure that we're really solving the same problem at that point.
On Mon, Dec 14, 2020 at 07:00:57PM -0800, Palmer Dabbelt wrote: > I haven't gotten a whole lot of feedback, so I'm inclined to at least have some > reasonable performance numbers before bothering with a v2. FYI, my other main worry beside duplicating nbd is that device mapper really is a stacked interface that sits on top of other block device. Turning this into something else that just pipes data to userspace seems very strange.
On Tue, Dec 22 2020 at 8:32am -0500, Christoph Hellwig <hch@infradead.org> wrote: > On Mon, Dec 14, 2020 at 07:00:57PM -0800, Palmer Dabbelt wrote: > > I haven't gotten a whole lot of feedback, so I'm inclined to at least have some > > reasonable performance numbers before bothering with a v2. > > FYI, my other main worry beside duplicating nbd is that device mapper > really is a stacked interface that sits on top of other block device. > Turning this into something else that just pipes data to userspace > seems very strange. I agree. Only way I'd be interested is if it somehow tackled enabling much more efficient IO. Earlier discussion in this thread mentioned that zero-copy and low overhead wasn't a priority (because it is hard, etc). But the hard work has already been done with io_uring. If dm-user had a prereq of leaning heavily on io_uring and also enabled IO polling for bio-based then there may be a win to supporting it. But unless lower latency (or some other more significant win) is made possible I just don't care to prop up an unnatural DM bolt-on. Mike
On Tue, 22 Dec 2020 05:32:46 PST (-0800), Christoph Hellwig wrote: > On Mon, Dec 14, 2020 at 07:00:57PM -0800, Palmer Dabbelt wrote: >> I haven't gotten a whole lot of feedback, so I'm inclined to at least have some >> reasonable performance numbers before bothering with a v2. > > FYI, my other main worry beside duplicating nbd is that device mapper > really is a stacked interface that sits on top of other block device. > Turning this into something else that just pipes data to userspace > seems very strange. Agreed. It certainly doesn't fit the DM model. We'd considered doing a non-DM version of this (maybe "ubd"), but decided to stick with dm-user because we didn't want to duplicate all the device creation stuff that DM provides. A simple version of that wouldn't be that hard to do, but the DM version has a lot of features and we get that all for free. We essentially decided to run with DM until it gets in the way, and the only sticking point we ended up with was that REQUEUE stuff (though not sure how that would show up with a bare block device) and that scheduler question. I'm going to stick with DM for now, unless it gets in the way, to avoid coming up with a device creation scheme myself. In the long term it's probably best to have this be a standalone thing, but I don't want to dump a bunch of time into putting that stuff together only to find that this isn't interesting enough from a performance perspective to stick around.
On Tue, 22 Dec 2020 06:36:16 PST (-0800), snitzer@redhat.com wrote: > On Tue, Dec 22 2020 at 8:32am -0500, > Christoph Hellwig <hch@infradead.org> wrote: > >> On Mon, Dec 14, 2020 at 07:00:57PM -0800, Palmer Dabbelt wrote: >> > I haven't gotten a whole lot of feedback, so I'm inclined to at least have some >> > reasonable performance numbers before bothering with a v2. >> >> FYI, my other main worry beside duplicating nbd is that device mapper >> really is a stacked interface that sits on top of other block device. >> Turning this into something else that just pipes data to userspace >> seems very strange. > > I agree. Only way I'd be interested is if it somehow tackled enabling > much more efficient IO. Earlier discussion in this thread mentioned > that zero-copy and low overhead wasn't a priority (because it is hard, > etc). But the hard work has already been done with io_uring. If > dm-user had a prereq of leaning heavily on io_uring and also enabled IO > polling for bio-based then there may be a win to supporting it. > > But unless lower latency (or some other more significant win) is made > possible I just don't care to prop up an unnatural DM bolt-on. I don't remember if I mentioned this in the thread, but it was definately in the Plumbers talk, but I'd had the general idea bouncing around that it would be possible to write a high-performance version of this using an interface similar to the one provided here while relying on io_uring for the high-performance userspace. That definately won't work with exactly the current interface, but my hope was to avoid writing my own high-performance ring buffer. My worry was that it'll be too tricky to map this all to zero-copy, and I guess I forgot about it. Now that you bring it up, it certainly seems worth taking a shot at. We'd essentially have the best of both worlds: userspace implementations that want to be simple could just use read()/write(), while those that want to be higher performance could have their implicit ring buffer. I'm currently trying to put together a benchmarking setup that is of sufficient fidelity that I would believe the numbers, which is really why I don't have any performance numbers yet (no sense posting numbers I would shoot down :)). I'll try to remember to take a shot at an io_uring based userspace (probably with some dm-user interface modifications) to see how it feels.
FYI, a few years ago I spent some time helping a customer to prepare their block device in userspace using fuse code for upstreaming, but at some point they abandoned the project. But if for some reason we don't want to use nbd I think a driver using the fuse infrastructure would be the next logical choice.
On 12/22/20 11:48 PM, Christoph Hellwig wrote: > FYI, a few years ago I spent some time helping a customer to prepare > their block device in userspace using fuse code for upstreaming, but > at some point they abandoned the project. But if for some reason we > don't want to use nbd I think a driver using the fuse infrastructure > would be the next logical choice. Hi Christoph, Thanks for having shared this information. Since I'm not familiar with the FUSE code: does this mean translating block device accesses into FUSE_READ and FUSE_WRITE messages? Does the FUSE kernel code only support exchanging such messages between kernel and user space via the read() and write() system calls? I'm asking this since there is already an interface in the Linux kernel for implementing block devices in user space that uses another approach, namely a ring buffer for messages and data that is shared between kernel and user space (documented in Documentation/target/tcmu-design.rst). Is one system call per read and per write operation fast enough for all block-device-in-user-space implementations? Thanks, Bart.