mbox series

[v1,0/5] dm: dm-user: New target that proxies BIOs to userspace

Message ID 20201203215859.2719888-1-palmer@dabbelt.com (mailing list archive)
Headers show
Series dm: dm-user: New target that proxies BIOs to userspace | expand

Message

Palmer Dabbelt Dec. 3, 2020, 9:58 p.m. UTC
This patch set contains dm-user, a device mapper target that proxies incoming
BIOs to userspace via a misc device.  Essentially it's FUSE, but for block
devices.  There's more information in the documentation patch and as a handful
of commends, so I'm just going to avoid duplicating that here.  I don't really
think there's any fundamental functionality that dm-user enables, as one could
use something along the lines of nbd/iscsi, but dm-user does result in
extremely simple userspace daemons -- so simple that when I tried to write a
helper userspace library for dm-user I just ended up with nothing.

I talked about this a bit at Plumbers and was hoping to send patches a bit
earlier on in the process, but got tied up with a few things.  As a result this
is actually quite far along: it's at the point where we're starting to run this
on real devices as part of an updated Android OTA update flow, where we're
using this to provide an Android-specific compressed backing store for
dm-snap-persistent.  The bulk of that project is scattered throughout the
various Android trees, so there are kselftests and a (somewhat bare for now)
Documentation entry with the intent of making this a self-contained
contribution.  There's a lot to the Android userspace daemon, but it doesn't
interact with dm-user in a very complex manner.

This is still in a somewhat early stage, but it's at the point where things
largely function.  I'm certainly not ready to commit to the user ABI
implemented here and there are a bunch of FIXMEs scattered throughout the code,
but I do think that it's far along enough to begin a more concrete discussion
of where folks would like to go with something like this.  While I'd intending
on sorting that stuff out, I'd like to at least get a feel for whether this is
a path worth pursuing before spending a bunch more time on it.

I haven't done much in the way of performance analysis for dm-user.  Earlier on
I did some simple throughput tests and found that dm-user/ext4 was faster than
half the speed of tmpfs, which is way out of the realm of being an issue for
our use case (decompressing blocks out of a phone's storage).  The design of
dm-user does preclude an extremely high performance implementation, where I
assume one would want an explicit ring buffer and zero copy, but I feel like
users who want that degree of performance are probably better served writing a
proper kernel driver.  I wouldn't be opposed to pushing on performance (ideally
without a major design change), but for now I feel like time is better spent
fortifying the user ABI and fixing the various issues with the implementation.

The patches follow as usual, but in case it's easier I've published a tree as
well:

    git://git.kernel.org/pub/scm/linux/kernel/git/palmer/dm-user.git -b dm-user-v1

Comments

Christoph Hellwig Dec. 4, 2020, 10:33 a.m. UTC | #1
What is the advantage over simply using nbd?
Palmer Dabbelt Dec. 7, 2020, 6:55 p.m. UTC | #2
On Fri, 04 Dec 2020 02:33:36 PST (-0800), Christoph Hellwig wrote:
> What is the advantage over simply using nbd?

There's a short bit about that in the cover letter (and in some talks), but
I'll expand on it here -- I suppose my most important question is "is this
interesting enough to take upstream?", so there should be at least a bit of a
description of what it actually enables:

I don't think there's any deep fundamental advantages to doing this as opposed
to nbd/iscsi over localhost/unix (or by just writing a kernel implementation,
for that matter), at least in terms of anything that was previously impossible
now becoming possible.  There are a handful of things that are easier and/or
faster, though.

dm-user looks a lot like NBD without the networking.  The major difference is
which side initiates messages: in NBD the kernel initiates messages, while in
dm-user userspace initiates messages (via a read that will block if there is no
message, but presumably we'd want to add support for a non-blocking userspace
implementations eventually).  The NBD approach certainly makes sense for a
networked system, as one generally wants to have a single storage server
handling multiple clients, but inverting that makes some things simpler in
dm-user.  

One specific advantage of this change is that a dm-user target can be
transitioned from one daemon to another without any IO errors: just spin up the
second daemon, signal the first to stop requesting new messages, and let it
exit.  We're using that mechanism to replace the daemon launched by early init
(which runs before the security subsystem is up, as in our use case dm-user
provides the root filesystem) with one that's properly sandboxed (which can
only be launched after the root filesystem has come up).  There are ways around
this (replacing the DM table, for example), but they don't fit it as cleanly.

Unless I'm missing something, NBD servers aren't capable of that style of
transition: soft disconnects can only be initiated by the client (the kernel,
in this case), which leaves no way for the server to transition while
guaranteeing that no IOs error out.  It's usually possible to shoehorn this
sort of direction reversing concept into network protocols, but it's also
usually ugly (I'm thinking of IDLE, for example).  I didn't try to actually do
it, but my guess would be that adding a way for the server to ask the client to
stop sending messages until a new server shows up would be at least as much
work as doing this.

There are also a handful of possible performance advantages, but I haven't gone
through the work to prove any of them out yet as performance isn't all that
important for our first use case.  For example:

* Cutting out the network stack is unlikely to hurt performance.  I'm not sure
  if it will help performance, though.  I think if we really had workload where
  the extra copy was likely to be an issue we'd want an explicit ring buffer,
  but I have a theory that it would be possible to get very good performance out
  of a stream-style API by using multiple channels and relying on io_uring to
  plumb through multiple ops per channel.
* There's a comment in the implementation about allowing userspace to insert
  itself into user_map(), likely by uploading a BPF fragment.  There's a whole
  class of interesting block devices that could be written in this fashion:
  essentially you keep a cache on a regular block device that handles the common
  cases by remapping BIOs and passing them along, relegating the more complicated
  logic to fetch cache misses and watching some subset of the access stream where
  necessary.

  We have a use case like this in Android, where we opportunistically store
  backups in a portion of the TRIM'd space on devices.  It's currently
  implemented entirely in kernel by the dm-bow target, but IIUC that was deemed
  too Android-specific to merge.  Assuming we could get good enough performance
  we could move that logic to userspace, which lets us shrink our diff with
  upstream.  It feels like some other interesting block devices could be
  written in a similar fashion.

All in all, I've found it a bit hard to figure out what sort of interest people
have in dm-user: when I bring this up I seem to run into people who've done
similar things before and are vaguely interested, but certainly nobody is
chomping at the bit.  I'm sending it out in this early state to try and figure
out if it's interesting enough to keep going.
Bart Van Assche Dec. 10, 2020, 3:38 a.m. UTC | #3
On 12/7/20 10:55 AM, Palmer Dabbelt wrote:
> All in all, I've found it a bit hard to figure out what sort of interest
> people
> have in dm-user: when I bring this up I seem to run into people who've done
> similar things before and are vaguely interested, but certainly nobody is
> chomping at the bit.  I'm sending it out in this early state to try and
> figure
> out if it's interesting enough to keep going.

Cc-ing Josef and Mike since their nbd contributions make me wonder
whether this new driver could be useful to their use cases?

Thanks,

Bart.
Josef Bacik Dec. 10, 2020, 5:03 p.m. UTC | #4
On 12/9/20 10:38 PM, Bart Van Assche wrote:
> On 12/7/20 10:55 AM, Palmer Dabbelt wrote:
>> All in all, I've found it a bit hard to figure out what sort of interest
>> people
>> have in dm-user: when I bring this up I seem to run into people who've done
>> similar things before and are vaguely interested, but certainly nobody is
>> chomping at the bit.  I'm sending it out in this early state to try and
>> figure
>> out if it's interesting enough to keep going.
> 
> Cc-ing Josef and Mike since their nbd contributions make me wonder
> whether this new driver could be useful to their use cases?
> 

Sorry gmail+imap sucks and I can't get my email client to get at the original 
thread.  However here is my take.

1) The advantages of using dm-user of NBD that you listed aren't actually 
problems for NBD.  We have NBD working in production where you can hand off the 
sockets for the server without ending in timeouts, it was actually the main 
reason we wrote our own server so we could use the FD transfer stuff to restart 
the server without impacting any clients that had the device in use.

2) The extra copy is a big deal, in fact we already have too many copies in our 
existing NBD setup and are actively looking for ways to avoid those.

Don't take this as I don't think dm-user is a good idea, but I think at the very 
least it should start with the very best we have to offer, starting with as few 
copies as possible.

If you are using it currently in production then cool, there's clearly a usecase 
for it.  Personally as I get older and grouchier I want less things in the 
kernel, so if this enables us to eventually do everything NBD related in 
userspace with no performance drop then I'd be down.  I don't think you need to 
make that your primary goal, but at least polishing this up so it could 
potentially be abused in the future would make it more compelling for merging. 
Thanks,

Josef
Palmer Dabbelt Dec. 15, 2020, 3 a.m. UTC | #5
On Thu, 10 Dec 2020 09:03:21 PST (-0800), josef@toxicpanda.com wrote:
> On 12/9/20 10:38 PM, Bart Van Assche wrote:
>> On 12/7/20 10:55 AM, Palmer Dabbelt wrote:
>>> All in all, I've found it a bit hard to figure out what sort of interest
>>> people
>>> have in dm-user: when I bring this up I seem to run into people who've done
>>> similar things before and are vaguely interested, but certainly nobody is
>>> chomping at the bit.  I'm sending it out in this early state to try and
>>> figure
>>> out if it's interesting enough to keep going.
>>
>> Cc-ing Josef and Mike since their nbd contributions make me wonder
>> whether this new driver could be useful to their use cases?
>>
>
> Sorry gmail+imap sucks and I can't get my email client to get at the original
> thread.  However here is my take.

and I guess I then have to apoligize for missing your email ;).  Hopefully that
was the problem, but who knows.

> 1) The advantages of using dm-user of NBD that you listed aren't actually
> problems for NBD.  We have NBD working in production where you can hand off the
> sockets for the server without ending in timeouts, it was actually the main
> reason we wrote our own server so we could use the FD transfer stuff to restart
> the server without impacting any clients that had the device in use.

OK.  So you just send the FD around using one of the standard mechanisms to
orchestrate the handoff?  I guess that might work for our use case, assuming
whatever the security side of things was doing was OK with the old FD.  TBH I'm
not sure how all that works and while we thought about doing that sort of
transfer scheme we decided to just open it again -- not sure how far we were
down the dm-user rabbit hole at that point, though, as this sort of arose out
of some other ideas.

> 2) The extra copy is a big deal, in fact we already have too many copies in our
> existing NBD setup and are actively looking for ways to avoid those.
>
> Don't take this as I don't think dm-user is a good idea, but I think at the very
> least it should start with the very best we have to offer, starting with as few
> copies as possible.

I was really experting someone to say that.  It does seem kind of silly to build
out the new interface, but not go all the way to a ring buffer.  We just didn't
really have any way to justify the extra complexity as our use cases aren't
that high performance.   I kind of like to have benchmarks for this sort of
thing, though, and I didn't have anyone who had bothered avoiding the last copy
to compare against.

> If you are using it currently in production then cool, there's clearly a usecase
> for it.  Personally as I get older and grouchier I want less things in the
> kernel, so if this enables us to eventually do everything NBD related in
> userspace with no performance drop then I'd be down.  I don't think you need to
> make that your primary goal, but at least polishing this up so it could
> potentially be abused in the future would make it more compelling for merging.
> Thanks,

Ya, it's in Android already and we'll be shipping it as part of the new OTA
flow for the next release.  The rules on deprecation are a bit different over
there, though, so it's not like we're wed to it.  The whole point of bringing
this up here was to try and get something usable by everyone, and while I'd
eventually like to get whatever's in Android into the kernel proper we'd really
planned on supporting an extra Android-only ABI for a cycle at least.  

I'm kind of inclined to take a crack at the extra copy, to at least see if
building something that eliminates it is viable.  I'm not really sure if it is
(or at least, if it'll net us a meaningful amount of performance), but it'd at
least be interesting to try.

It'd be nice to have some benchmark target, though, as otherwise this stuff
hangs on forever.  My workloads are in selftests later on in the patch set, but
I'm essentially using tmpfs as a baseline to compare against ext4+dm-user with
some FIO examples as workloads.  Our early benchmark numbers indicated this was
way faster than we needed, so I didn't even bother putting together a proper
system to run on so I don't really have any meaningful numbers there.  Is there
an NBD server that's fast that I should be comparing against?

I haven't gotten a whole lot of feedback, so I'm inclined to at least have some
reasonable performance numbers before bothering with a v2.
Vitaly Mayatskih Dec. 16, 2020, 6:24 p.m. UTC | #6
On Mon, Dec 14, 2020 at 10:03 PM Palmer Dabbelt <palmer@dabbelt.com> wrote:

> I was really experting someone to say that.  It does seem kind of silly to build
> out the new interface, but not go all the way to a ring buffer.  We just didn't
> really have any way to justify the extra complexity as our use cases aren't
> that high performance.   I kind of like to have benchmarks for this sort of
> thing, though, and I didn't have anyone who had bothered avoiding the last copy
> to compare against.

I worked on something very similar, though performance was one of the
goals. The implementation was floating around lockless ring buffers,
shared memory for zerocopy, multiqueue and error handling. It could be
that every disk storage vendor has to implement something like that in
order to bridge Linux kernel to their own proprietary datapath running
in userspace.
Palmer Dabbelt Dec. 17, 2020, 6:55 a.m. UTC | #7
On Wed, 16 Dec 2020 10:24:59 PST (-0800), v.mayatskih@gmail.com wrote:
> On Mon, Dec 14, 2020 at 10:03 PM Palmer Dabbelt <palmer@dabbelt.com> wrote:
>
>> I was really experting someone to say that.  It does seem kind of silly to build
>> out the new interface, but not go all the way to a ring buffer.  We just didn't
>> really have any way to justify the extra complexity as our use cases aren't
>> that high performance.   I kind of like to have benchmarks for this sort of
>> thing, though, and I didn't have anyone who had bothered avoiding the last copy
>> to compare against.
>
> I worked on something very similar, though performance was one of the
> goals. The implementation was floating around lockless ring buffers,
> shared memory for zerocopy, multiqueue and error handling. It could be
> that every disk storage vendor has to implement something like that in
> order to bridge Linux kernel to their own proprietary datapath running
> in userspace.

OK, good to know.  That's kind of the feeling I'd gotten from having chatted to
a handful of people about this, but I don't remember people having actually
gotten all the way to zero-copy.  That's how we managed to end up at this
middle-ground ABI style: when I thought people were, in practice, punting on
zero copy because the complexity just wasn't worth the performance benefit.
Maybe I'd just been colored by how my projects ended up going, but I've ended
up designing complicated interfaces in the past that allow for zero-copy only
to never get around to actually making that work.  I don't know if that's just
because I've had the good fortune to avoid working on anything that ended up
with users, though :).

For our use case I think we actually get better performance out of the
copy-based (and probably more importantly kalloc-based, but that's an
implementation thing not an ABI thing) approach: essentially we're very
sensitive to memory pressure and expect this first dm-user daemon to mostly be
idle, so we're really worried about avoiding excess memory usage while idle and
less worried about throughput when active.  This stream-based interface means
that userspace doesn't need much memory allocated to service a request, which
helps with sleep/wake latencies and/or idle memory usage.  That's also why we
have the simple locking scheme: no sense splitting locks if there's no
contention, and we only need a single thread to saturate the storage bandwidth
on these phones.

That said, it does sound like people really do care about the sort of
performance levels where zero copy is relevant in this space.  I'll take a shot
at something along those lines, and while it will add a degree of userspace
complexity I'm not sure it'll add much in the way of kernel complexity -- at
least compared to a fast version of this, where we'd need most of that stuff
anyway (obviously the malloc+single lock design is simple, but probably
wouldn't stick around for long).  At a bare minimum it'll be interesting to
play around with, but if people are doing it in practice then I'm more
confident that I can put something together that at least serves as a starting
point for further discussion.

I haven't gotten around to writing any code yet, but I had spent a bit of time
thinking about how to put this zero-copy version together and am leaning
towards it being a standalone block device (as opposed to a DM target).  I'd
avoided that before as I didn't want to mess around with my own device control
scheme so I'll still try to do the DM thing, but I'm not sure it'll be viable.
That's all speculation now, but it does bring up one interesting question:

IIUC, this version of dm-user handles BIOs before they reach the block
scheduler while a standalone driver would likely handle them after blk-mq.  I
don't have direct experience with this, but the last time I ran into people who
had these sorts of performance requirements for userspace drivers they weren't
actually trying to write userspace drivers but were instead trying to write a
userspace scheduler, with the userspace drivers just being the mechanism to
implement that scheduler.  This was a decade ago and I'm not sure that's what
people are trying to do in the new blk-mq world, but if it is then it's going
to be a major design consideration.  I'm also not entirely sure that we're
really solving the same problem at that point.
Christoph Hellwig Dec. 22, 2020, 1:32 p.m. UTC | #8
On Mon, Dec 14, 2020 at 07:00:57PM -0800, Palmer Dabbelt wrote:
> I haven't gotten a whole lot of feedback, so I'm inclined to at least have some
> reasonable performance numbers before bothering with a v2.

FYI, my other main worry beside duplicating nbd is that device mapper
really is a stacked interface that sits on top of other block device.
Turning this into something else that just pipes data to userspace
seems very strange.
Mike Snitzer Dec. 22, 2020, 2:36 p.m. UTC | #9
On Tue, Dec 22 2020 at  8:32am -0500,
Christoph Hellwig <hch@infradead.org> wrote:

> On Mon, Dec 14, 2020 at 07:00:57PM -0800, Palmer Dabbelt wrote:
> > I haven't gotten a whole lot of feedback, so I'm inclined to at least have some
> > reasonable performance numbers before bothering with a v2.
> 
> FYI, my other main worry beside duplicating nbd is that device mapper
> really is a stacked interface that sits on top of other block device.
> Turning this into something else that just pipes data to userspace
> seems very strange.

I agree.  Only way I'd be interested is if it somehow tackled enabling
much more efficient IO.  Earlier discussion in this thread mentioned
that zero-copy and low overhead wasn't a priority (because it is hard,
etc).  But the hard work has already been done with io_uring.  If
dm-user had a prereq of leaning heavily on io_uring and also enabled IO
polling for bio-based then there may be a win to supporting it.

But unless lower latency (or some other more significant win) is made
possible I just don't care to prop up an unnatural DM bolt-on.

Mike
Palmer Dabbelt Dec. 22, 2020, 8:31 p.m. UTC | #10
On Tue, 22 Dec 2020 05:32:46 PST (-0800), Christoph Hellwig wrote:
> On Mon, Dec 14, 2020 at 07:00:57PM -0800, Palmer Dabbelt wrote:
>> I haven't gotten a whole lot of feedback, so I'm inclined to at least have some
>> reasonable performance numbers before bothering with a v2.
>
> FYI, my other main worry beside duplicating nbd is that device mapper
> really is a stacked interface that sits on top of other block device.
> Turning this into something else that just pipes data to userspace
> seems very strange.

Agreed.  It certainly doesn't fit the DM model.  We'd considered doing a non-DM
version of this (maybe "ubd"), but decided to stick with dm-user because we
didn't want to duplicate all the device creation stuff that DM provides.  A
simple version of that wouldn't be that hard to do, but the DM version has a
lot of features and we get that all for free.  We essentially decided to run
with DM until it gets in the way, and the only sticking point we ended up with
was that REQUEUE stuff (though not sure how that would show up with a bare
block device) and that scheduler question.

I'm going to stick with DM for now, unless it gets in the way, to avoid coming
up with a device creation scheme myself.  In the long term it's probably best
to have this be a standalone thing, but I don't want to dump a bunch of time
into putting that stuff together only to find that this isn't interesting
enough from a performance perspective to stick around.
Palmer Dabbelt Dec. 22, 2020, 8:38 p.m. UTC | #11
On Tue, 22 Dec 2020 06:36:16 PST (-0800), snitzer@redhat.com wrote:
> On Tue, Dec 22 2020 at  8:32am -0500,
> Christoph Hellwig <hch@infradead.org> wrote:
>
>> On Mon, Dec 14, 2020 at 07:00:57PM -0800, Palmer Dabbelt wrote:
>> > I haven't gotten a whole lot of feedback, so I'm inclined to at least have some
>> > reasonable performance numbers before bothering with a v2.
>>
>> FYI, my other main worry beside duplicating nbd is that device mapper
>> really is a stacked interface that sits on top of other block device.
>> Turning this into something else that just pipes data to userspace
>> seems very strange.
>
> I agree.  Only way I'd be interested is if it somehow tackled enabling
> much more efficient IO.  Earlier discussion in this thread mentioned
> that zero-copy and low overhead wasn't a priority (because it is hard,
> etc).  But the hard work has already been done with io_uring.  If
> dm-user had a prereq of leaning heavily on io_uring and also enabled IO
> polling for bio-based then there may be a win to supporting it.
>
> But unless lower latency (or some other more significant win) is made
> possible I just don't care to prop up an unnatural DM bolt-on.

I don't remember if I mentioned this in the thread, but it was definately in
the Plumbers talk, but I'd had the general idea bouncing around that it would
be possible to write a high-performance version of this using an interface
similar to the one provided here while relying on io_uring for the
high-performance userspace.  That definately won't work with exactly the
current interface, but my hope was to avoid writing my own high-performance
ring buffer.  My worry was that it'll be too tricky to map this all to
zero-copy, and I guess I forgot about it.

Now that you bring it up, it certainly seems worth taking a shot at.  We'd
essentially have the best of both worlds: userspace implementations that want
to be simple could just use read()/write(), while those that want to be higher
performance could have their implicit ring buffer.

I'm currently trying to put together a benchmarking setup that is of sufficient
fidelity that I would believe the numbers, which is really why I don't have any
performance numbers yet (no sense posting numbers I would shoot down :)).  I'll
try to remember to take a shot at an io_uring based userspace (probably with
some dm-user interface modifications) to see how it feels.
Christoph Hellwig Dec. 23, 2020, 7:48 a.m. UTC | #12
FYI, a few years ago I spent some time helping a customer to prepare
their block device in userspace using fuse code for upstreaming, but
at some point they abandoned the project.  But if for some reason we
don't want to use nbd I think a driver using the fuse infrastructure
would be the next logical choice.
Bart Van Assche Dec. 23, 2020, 4:59 p.m. UTC | #13
On 12/22/20 11:48 PM, Christoph Hellwig wrote:
> FYI, a few years ago I spent some time helping a customer to prepare
> their block device in userspace using fuse code for upstreaming, but
> at some point they abandoned the project.  But if for some reason we
> don't want to use nbd I think a driver using the fuse infrastructure
> would be the next logical choice.

Hi Christoph,

Thanks for having shared this information. Since I'm not familiar with the
FUSE code: does this mean translating block device accesses into FUSE_READ
and FUSE_WRITE messages? Does the FUSE kernel code only support exchanging
such messages between kernel and user space via the read() and write()
system calls? I'm asking this since there is already an interface in the
Linux kernel for implementing block devices in user space that uses another
approach, namely a ring buffer for messages and data that is shared between
kernel and user space (documented in Documentation/target/tcmu-design.rst).
Is one system call per read and per write operation fast enough for all
block-device-in-user-space implementations?

Thanks,

Bart.