[v3,1/6] block: add disk sequence number

Message ID	20210623105858.6978-2-mcroce@linux.microsoft.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Matteo Croce <mcroce@linux.microsoft.com> To: linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org, Jens Axboe <axboe@kernel.dk> Cc: linux-kernel@vger.kernel.org, Lennart Poettering <lennart@poettering.net>, Luca Boccassi <bluca@debian.org>, Alexander Viro <viro@zeniv.linux.org.uk>, Damien Le Moal <damien.lemoal@wdc.com>, Tejun Heo <tj@kernel.org>, =?utf-8?q?Javier_Gonz=C3=A1lez?= <javier@javigon.com>, Niklas Cassel <niklas.cassel@wdc.com>, Johannes Thumshirn <johannes.thumshirn@wdc.com>, Hannes Reinecke <hare@suse.de>, Matthew Wilcox <willy@infradead.org>, Christoph Hellwig <hch@infradead.org>, JeffleXu <jefflexu@linux.alibaba.com> Subject: [PATCH v3 1/6] block: add disk sequence number Date: Wed, 23 Jun 2021 12:58:53 +0200 Message-Id: <20210623105858.6978-2-mcroce@linux.microsoft.com> In-Reply-To: <20210623105858.6978-1-mcroce@linux.microsoft.com> References: <20210623105858.6978-1-mcroce@linux.microsoft.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	block: add a sequence number to disks \| expand [v3,0/6] block: add a sequence number to disks [v3,1/6] block: add disk sequence number [v3,2/6] block: add ioctl to read the disk sequence number [v3,3/6] block: refactor sysfs code [v3,4/6] block: export diskseq in sysfs [v3,5/6] block: increment sequence number [v3,6/6] loop: increment sequence number

Matteo Croce June 23, 2021, 10:58 a.m. UTC

From: Matteo Croce <mcroce@microsoft.com>

Add a monotonically increasing number to disk devices. This number is put
in the uevent so userspace can correlate events when a driver reuses a
device, like cdrom or loop.

    $ udevadm info /sys/class/block/* |grep -e DEVNAME -e DISKSEQ
    E: DEVNAME=/dev/loop0
    E: DISKSEQ=1
    E: DEVNAME=/dev/loop1
    E: DISKSEQ=2
    E: DEVNAME=/dev/loop2
    E: DISKSEQ=3
    E: DEVNAME=/dev/loop3
    E: DISKSEQ=4
    E: DEVNAME=/dev/loop4
    E: DISKSEQ=5
    E: DEVNAME=/dev/loop5
    E: DISKSEQ=6
    E: DEVNAME=/dev/loop6
    E: DISKSEQ=7
    E: DEVNAME=/dev/loop7
    E: DISKSEQ=8
    E: DEVNAME=/dev/nvme0n1
    E: DISKSEQ=9
    E: DEVNAME=/dev/nvme0n1p1
    E: DISKSEQ=9
    E: DEVNAME=/dev/nvme0n1p2
    E: DISKSEQ=9
    E: DEVNAME=/dev/nvme0n1p3
    E: DISKSEQ=9
    E: DEVNAME=/dev/nvme0n1p4
    E: DISKSEQ=9
    E: DEVNAME=/dev/nvme0n1p5
    E: DISKSEQ=9
    E: DEVNAME=/dev/sda
    E: DISKSEQ=10
    E: DEVNAME=/dev/sda1
    E: DISKSEQ=10
    E: DEVNAME=/dev/sda2
    E: DISKSEQ=10

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
---
 block/genhd.c         | 19 +++++++++++++++++++
 include/linux/genhd.h |  2 ++
 2 files changed, 21 insertions(+)

Christoph Hellwig June 23, 2021, 11:48 a.m. UTC | #1

On Wed, Jun 23, 2021 at 12:58:53PM +0200, Matteo Croce wrote:
> +void inc_diskseq(struct gendisk *disk)
> +{
> +	static atomic64_t diskseq;

Please don't hide file scope variables in functions.

Can you explain a little more why we need a global sequence count vs
a per-disk one here?

Matteo Croce June 23, 2021, 1:10 p.m. UTC | #2

On Wed, Jun 23, 2021 at 1:49 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Wed, Jun 23, 2021 at 12:58:53PM +0200, Matteo Croce wrote:
> > +void inc_diskseq(struct gendisk *disk)
> > +{
> > +     static atomic64_t diskseq;
>
> Please don't hide file scope variables in functions.
>

I just didn't want to clobber that file namespace, as that is the only
point where it's used.

> Can you explain a little more why we need a global sequence count vs
> a per-disk one here?

The point of the whole series is to have an unique sequence number for
all the disks.
Events can arrive to the userspace delayed or out-of-order, so this
helps to correlate events to the disk.
It might seem strange, but there isn't a way to do this yet, so I come
up with a global, monotonically incrementing number.

Lennart Poettering June 23, 2021, 1:51 p.m. UTC | #3

On Mi, 23.06.21 15:10, Matteo Croce (mcroce@linux.microsoft.com) wrote:

> On Wed, Jun 23, 2021 at 1:49 PM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Wed, Jun 23, 2021 at 12:58:53PM +0200, Matteo Croce wrote:
> > > +void inc_diskseq(struct gendisk *disk)
> > > +{
> > > +     static atomic64_t diskseq;
> >
> > Please don't hide file scope variables in functions.
> >
>
> I just didn't want to clobber that file namespace, as that is the only
> point where it's used.
>
> > Can you explain a little more why we need a global sequence count vs
> > a per-disk one here?
>
> The point of the whole series is to have an unique sequence number for
> all the disks.
> Events can arrive to the userspace delayed or out-of-order, so this
> helps to correlate events to the disk.
> It might seem strange, but there isn't a way to do this yet, so I come
> up with a global, monotonically incrementing number.

To extend on this and given an example why the *global* sequence number
matters:

Consider you plug in a USB storage key, and it gets named
/dev/sda. You unplug it, the kernel structures for that device all
disappear. Then you plug in a different USB storage key, and since
it's the only one it will too be called /dev/sda.

With the global sequence number we can still distinguish these two
devices even though otherwise they can look pretty much identical. If
we had per-device counters then this would fall flat because the
counter would be flushed out when the device disappears and when a device
reappears under the same generic name we couldn't assign it a
different sequence number than before.

Thus: a global instead of local sequence number counter is absolutely
*key* for the problem this is supposed to solve

Lennart

--
Lennart Poettering, Berlin

Hannes Reinecke June 23, 2021, 2:01 p.m. UTC | #4

On 6/23/21 3:51 PM, Lennart Poettering wrote:
> On Mi, 23.06.21 15:10, Matteo Croce (mcroce@linux.microsoft.com) wrote:
> 
>> On Wed, Jun 23, 2021 at 1:49 PM Christoph Hellwig <hch@infradead.org> wrote:
>>>
>>> On Wed, Jun 23, 2021 at 12:58:53PM +0200, Matteo Croce wrote:
>>>> +void inc_diskseq(struct gendisk *disk)
>>>> +{
>>>> +     static atomic64_t diskseq;
>>>
>>> Please don't hide file scope variables in functions.
>>>
>>
>> I just didn't want to clobber that file namespace, as that is the only
>> point where it's used.
>>
>>> Can you explain a little more why we need a global sequence count vs
>>> a per-disk one here?
>>
>> The point of the whole series is to have an unique sequence number for
>> all the disks.
>> Events can arrive to the userspace delayed or out-of-order, so this
>> helps to correlate events to the disk.
>> It might seem strange, but there isn't a way to do this yet, so I come
>> up with a global, monotonically incrementing number.
> 
> To extend on this and given an example why the *global* sequence number
> matters:
> 
> Consider you plug in a USB storage key, and it gets named
> /dev/sda. You unplug it, the kernel structures for that device all
> disappear. Then you plug in a different USB storage key, and since
> it's the only one it will too be called /dev/sda.
> 
> With the global sequence number we can still distinguish these two
> devices even though otherwise they can look pretty much identical. If
> we had per-device counters then this would fall flat because the
> counter would be flushed out when the device disappears and when a device
> reappears under the same generic name we couldn't assign it a
> different sequence number than before.
> 
> Thus: a global instead of local sequence number counter is absolutely
> *key* for the problem this is supposed to solve
> 
Well ... except that you'll need to keep track of the numbers (otherwise 
you wouldn't know if the numbers changed, right?).
And if you keep track of the numbers you probably will have to implement 
an uevent listener to get the events in time.
But if you have an uevent listener you will also get the add/remove 
events for these devices.
And if you get add and remove events you can as well implement sequence 
numbers in your application, seeing that you have all information 
allowing you to do so.
So why burden the kernel with it?

Cheers,

Hannes

Luca Boccassi June 23, 2021, 2:07 p.m. UTC | #5

On Wed, 2021-06-23 at 16:01 +0200, Hannes Reinecke wrote:
> On 6/23/21 3:51 PM, Lennart Poettering wrote:
> > On Mi, 23.06.21 15:10, Matteo Croce (mcroce@linux.microsoft.com) wrote:
> > 
> > > On Wed, Jun 23, 2021 at 1:49 PM Christoph Hellwig <hch@infradead.org> wrote:
> > > > On Wed, Jun 23, 2021 at 12:58:53PM +0200, Matteo Croce wrote:
> > > > > +void inc_diskseq(struct gendisk *disk)
> > > > > +{
> > > > > +     static atomic64_t diskseq;
> > > > 
> > > > Please don't hide file scope variables in functions.
> > > > 
> > > 
> > > I just didn't want to clobber that file namespace, as that is the only
> > > point where it's used.
> > > 
> > > > Can you explain a little more why we need a global sequence count vs
> > > > a per-disk one here?
> > > 
> > > The point of the whole series is to have an unique sequence number for
> > > all the disks.
> > > Events can arrive to the userspace delayed or out-of-order, so this
> > > helps to correlate events to the disk.
> > > It might seem strange, but there isn't a way to do this yet, so I come
> > > up with a global, monotonically incrementing number.
> > 
> > To extend on this and given an example why the *global* sequence number
> > matters:
> > 
> > Consider you plug in a USB storage key, and it gets named
> > /dev/sda. You unplug it, the kernel structures for that device all
> > disappear. Then you plug in a different USB storage key, and since
> > it's the only one it will too be called /dev/sda.
> > 
> > With the global sequence number we can still distinguish these two
> > devices even though otherwise they can look pretty much identical. If
> > we had per-device counters then this would fall flat because the
> > counter would be flushed out when the device disappears and when a device
> > reappears under the same generic name we couldn't assign it a
> > different sequence number than before.
> > 
> > Thus: a global instead of local sequence number counter is absolutely
> > *key* for the problem this is supposed to solve
> > 
> Well ... except that you'll need to keep track of the numbers (otherwise 
> you wouldn't know if the numbers changed, right?).
> And if you keep track of the numbers you probably will have to implement 
> an uevent listener to get the events in time.
> But if you have an uevent listener you will also get the add/remove 
> events for these devices.
> And if you get add and remove events you can as well implement sequence 
> numbers in your application, seeing that you have all information 
> allowing you to do so.
> So why burden the kernel with it?
> 
> Cheers,
> 
> Hannes

Hi,

We need this so that we can reliably correlate events to instances of a
device. Events alone cannot solve this problem, because events _are_
the problem.

Lennart Poettering June 23, 2021, 2:12 p.m. UTC | #6

On Mi, 23.06.21 16:01, Hannes Reinecke (hare@suse.de) wrote:

> > Thus: a global instead of local sequence number counter is absolutely
> > *key* for the problem this is supposed to solve
> >
> Well ... except that you'll need to keep track of the numbers (otherwise you
> wouldn't know if the numbers changed, right?).
> And if you keep track of the numbers you probably will have to implement an
> uevent listener to get the events in time.

Hmm? This is backwards. The goal here is to be able to safely match up
uevents to specific uses of a block device, given that block device
names are agressively recycled.

you imply it was easy to know which device use a uevent belongs
to. But that's the problem: it is not possible to do so safely. if i
see a uevent for a block device "loop0" I cannot tell if it was from
my own use of the device or for some previous user of it.

And that's what we'd like to see fixed: i.e. we query the block device
for the seqeno now used and then we can use that to filter the uevents
and ignore the ones that do not carry the same sequence number as we
got assigned for our user.

Lennart

--
Lennart Poettering, Berlin

Hannes Reinecke June 23, 2021, 2:21 p.m. UTC | #7

On 6/23/21 4:07 PM, Luca Boccassi wrote:
> On Wed, 2021-06-23 at 16:01 +0200, Hannes Reinecke wrote:
>> On 6/23/21 3:51 PM, Lennart Poettering wrote:
>>> On Mi, 23.06.21 15:10, Matteo Croce (mcroce@linux.microsoft.com) wrote:
>>>
>>>> On Wed, Jun 23, 2021 at 1:49 PM Christoph Hellwig <hch@infradead.org> wrote:
>>>>> On Wed, Jun 23, 2021 at 12:58:53PM +0200, Matteo Croce wrote:
>>>>>> +void inc_diskseq(struct gendisk *disk)
>>>>>> +{
>>>>>> +     static atomic64_t diskseq;
>>>>>
>>>>> Please don't hide file scope variables in functions.
>>>>>
>>>>
>>>> I just didn't want to clobber that file namespace, as that is the only
>>>> point where it's used.
>>>>
>>>>> Can you explain a little more why we need a global sequence count vs
>>>>> a per-disk one here?
>>>>
>>>> The point of the whole series is to have an unique sequence number for
>>>> all the disks.
>>>> Events can arrive to the userspace delayed or out-of-order, so this
>>>> helps to correlate events to the disk.
>>>> It might seem strange, but there isn't a way to do this yet, so I come
>>>> up with a global, monotonically incrementing number.
>>>
>>> To extend on this and given an example why the *global* sequence number
>>> matters:
>>>
>>> Consider you plug in a USB storage key, and it gets named
>>> /dev/sda. You unplug it, the kernel structures for that device all
>>> disappear. Then you plug in a different USB storage key, and since
>>> it's the only one it will too be called /dev/sda.
>>>
>>> With the global sequence number we can still distinguish these two
>>> devices even though otherwise they can look pretty much identical. If
>>> we had per-device counters then this would fall flat because the
>>> counter would be flushed out when the device disappears and when a device
>>> reappears under the same generic name we couldn't assign it a
>>> different sequence number than before.
>>>
>>> Thus: a global instead of local sequence number counter is absolutely
>>> *key* for the problem this is supposed to solve
>>>
>> Well ... except that you'll need to keep track of the numbers (otherwise
>> you wouldn't know if the numbers changed, right?).
>> And if you keep track of the numbers you probably will have to implement
>> an uevent listener to get the events in time.
>> But if you have an uevent listener you will also get the add/remove
>> events for these devices.
>> And if you get add and remove events you can as well implement sequence
>> numbers in your application, seeing that you have all information
>> allowing you to do so.
>> So why burden the kernel with it?
>>
>> Cheers,
>>
>> Hannes
> 
> Hi,
> 
> We need this so that we can reliably correlate events to instances of a
> device. Events alone cannot solve this problem, because events _are_
> the problem.
> 
In which sense?
Yes, events can be delayed (if you list to uevents), but if you listen 
to kernel events there shouldn't be a delay, right?

Cheers,

Hannes

Christoph Hellwig June 23, 2021, 2:28 p.m. UTC | #8

On Wed, Jun 23, 2021 at 03:10:21PM +0200, Matteo Croce wrote:
> I just didn't want to clobber that file namespace, as that is the only
> point where it's used.

It doesn't really clobber the file namespace.  Declaring it normally
at the top of the file makes it very clear we have global state.
Hiding it in a function just causes obsfucation.

> > Can you explain a little more why we need a global sequence count vs
> > a per-disk one here?
> 
> The point of the whole series is to have an unique sequence number for
> all the disks.
> Events can arrive to the userspace delayed or out-of-order, so this
> helps to correlate events to the disk.
> It might seem strange, but there isn't a way to do this yet, so I come
> up with a global, monotonically incrementing number.

Maybe add a comment to explain that?

Luca Boccassi June 23, 2021, 2:34 p.m. UTC | #9

On Wed, 2021-06-23 at 16:21 +0200, Hannes Reinecke wrote:
> On 6/23/21 4:07 PM, Luca Boccassi wrote:
> > On Wed, 2021-06-23 at 16:01 +0200, Hannes Reinecke wrote:
> > > On 6/23/21 3:51 PM, Lennart Poettering wrote:
> > > > On Mi, 23.06.21 15:10, Matteo Croce (mcroce@linux.microsoft.com) wrote:
> > > > 
> > > > > On Wed, Jun 23, 2021 at 1:49 PM Christoph Hellwig <hch@infradead.org> wrote:
> > > > > > On Wed, Jun 23, 2021 at 12:58:53PM +0200, Matteo Croce wrote:
> > > > > > > +void inc_diskseq(struct gendisk *disk)
> > > > > > > +{
> > > > > > > +     static atomic64_t diskseq;
> > > > > > 
> > > > > > Please don't hide file scope variables in functions.
> > > > > > 
> > > > > 
> > > > > I just didn't want to clobber that file namespace, as that is the only
> > > > > point where it's used.
> > > > > 
> > > > > > Can you explain a little more why we need a global sequence count vs
> > > > > > a per-disk one here?
> > > > > 
> > > > > The point of the whole series is to have an unique sequence number for
> > > > > all the disks.
> > > > > Events can arrive to the userspace delayed or out-of-order, so this
> > > > > helps to correlate events to the disk.
> > > > > It might seem strange, but there isn't a way to do this yet, so I come
> > > > > up with a global, monotonically incrementing number.
> > > > 
> > > > To extend on this and given an example why the *global* sequence number
> > > > matters:
> > > > 
> > > > Consider you plug in a USB storage key, and it gets named
> > > > /dev/sda. You unplug it, the kernel structures for that device all
> > > > disappear. Then you plug in a different USB storage key, and since
> > > > it's the only one it will too be called /dev/sda.
> > > > 
> > > > With the global sequence number we can still distinguish these two
> > > > devices even though otherwise they can look pretty much identical. If
> > > > we had per-device counters then this would fall flat because the
> > > > counter would be flushed out when the device disappears and when a device
> > > > reappears under the same generic name we couldn't assign it a
> > > > different sequence number than before.
> > > > 
> > > > Thus: a global instead of local sequence number counter is absolutely
> > > > *key* for the problem this is supposed to solve
> > > > 
> > > Well ... except that you'll need to keep track of the numbers (otherwise
> > > you wouldn't know if the numbers changed, right?).
> > > And if you keep track of the numbers you probably will have to implement
> > > an uevent listener to get the events in time.
> > > But if you have an uevent listener you will also get the add/remove
> > > events for these devices.
> > > And if you get add and remove events you can as well implement sequence
> > > numbers in your application, seeing that you have all information
> > > allowing you to do so.
> > > So why burden the kernel with it?
> > > 
> > > Cheers,
> > > 
> > > Hannes
> > 
> > Hi,
> > 
> > We need this so that we can reliably correlate events to instances of a
> > device. Events alone cannot solve this problem, because events _are_
> > the problem.
> > 
> In which sense?
> Yes, events can be delayed (if you list to uevents), but if you listen 
> to kernel events there shouldn't be a delay, right?
> 
> Cheers,
> 
> Hannes

Hi,

Userspace programs don't have exclusive usage rights on loopdev, so you
can't reliably know if an uevent correlates to the loop0 you just
added, or to the loop0 someone else added some time earlier. This
series lets us do just that, reliably, without races, and fix long-
standing bugs. Please see Lennart's reply, it has much more details.
Thanks!

Lennart Poettering June 23, 2021, 2:55 p.m. UTC | #10

On Mi, 23.06.21 16:21, Hannes Reinecke (hare@suse.de) wrote:

> > We need this so that we can reliably correlate events to instances of a
> > device. Events alone cannot solve this problem, because events _are_
> > the problem.
> >
> In which sense?
> Yes, events can be delayed (if you list to uevents), but if you listen to
> kernel events there shouldn't be a delay, right?

uevents are delivered to userpace via an AF_NETLINK socket. The
AF_NETLINK socket is basically an asynchronous buffer.

I mean, consider what you are saying: you establish the AF_NETLINK
uevent watching socket, then you allocate /dev/loop0. Since you cannot
do that atomically, you'll first have to do one, and then the
other. But if you do that in two steps, then in the middle some other
process might get scheduled that quickly allocates /dev/loop0 and
releases it again, before your code gets to run. So now you have in
your AF_NETLINK socket buffer the uevents for that other process' use
of the device, and you cannot sanely distinguish them from your own.

of course you could do it the other way: allocate the device first,
and only then allocate the AF_NETLINK uevent socket. But then you
might or might not lose the "add" event for the device you just
allocated. And you don't know if you should wait for it or not.

This isn't even a constructed issue, this is the common case if you
have multiple processes all simultaneously trying to acquire a loopback
block device, because they all will end up eying /dev/loop0 at the same time.

But it gets worse IRL because of various factors. For example,
partition probing is asynchronous, so if you use LO_FLAGS_PARTSCAN and
want to watch for some partition device associated to your loopback
block device to show up, this can take *really* long, so the race
window is large. Or you actually use udev (like most userspace
probably should) because you want the metainfo it collects about the
device, in which case it will take even longer for the uevent to reach
you, i.e. the time window where a previous user's uevents and your own
for the same loopback device "overlap" can be quite large and you
cannot determine if they are yours or the previous user's uevents —
unless we have these new sequence numbers.

Lennart

--
Lennart Poettering, Berlin

Hannes Reinecke June 23, 2021, 3:02 p.m. UTC | #11

On 6/23/21 4:12 PM, Lennart Poettering wrote:
> On Mi, 23.06.21 16:01, Hannes Reinecke (hare@suse.de) wrote:
> 
>>> Thus: a global instead of local sequence number counter is absolutely
>>> *key* for the problem this is supposed to solve
>>>
>> Well ... except that you'll need to keep track of the numbers (otherwise you
>> wouldn't know if the numbers changed, right?).
>> And if you keep track of the numbers you probably will have to implement an
>> uevent listener to get the events in time.
> 
> Hmm? This is backwards. The goal here is to be able to safely match up
> uevents to specific uses of a block device, given that block device
> names are agressively recycled.
> 
> you imply it was easy to know which device use a uevent belongs
> to. But that's the problem: it is not possible to do so safely. if i
> see a uevent for a block device "loop0" I cannot tell if it was from
> my own use of the device or for some previous user of it.
> 
> And that's what we'd like to see fixed: i.e. we query the block device
> for the seqeno now used and then we can use that to filter the uevents
> and ignore the ones that do not carry the same sequence number as we
> got assigned for our user.
> 

It is notoriously tricky to monitor the intended use-case for kernel 
devices, precisely because we do _not_ attach any additional information 
to it.
I have send a proposal for LSF to implement block-namespaces, the prime 
use-case of which is indeed attaching cgroup/namespace information to 
block devices such that we _can_ match (block) devices to specific contexts.

Which I rather prefer than adding sequence numbers to block devices; 
incidentally you could solve the same problem by _not_ reusing numbers 
aggressively but rather allocate the next free one after the most 
recently allocated one.
Will give you much the same thing without having to burden others with it.

The better alternative here would be to extend the loop ioctl to pass in 
an UUID when allocating the device.
That way you can easily figure out whether the loop device has been 
modified.

But in the end, it's the loop driver and I'm not particular bothered 
with it. I am, though, if you need to touch all drivers just to support 
one particular use-case in one particular device driver.

Incidentally, we don't have this problem in SCSI as we _can_ identify 
devices here. So in the end we couldn't care less on which /dev/sdX 
device it ends up.
And I guess that's what we should attempt for loop devices, too.

Cheers,

Hannes

Luca Boccassi June 23, 2021, 3:34 p.m. UTC | #12

On Wed, 2021-06-23 at 17:02 +0200, Hannes Reinecke wrote:
> On 6/23/21 4:12 PM, Lennart Poettering wrote:
> > On Mi, 23.06.21 16:01, Hannes Reinecke (hare@suse.de) wrote:
> > 
> > > > Thus: a global instead of local sequence number counter is absolutely
> > > > *key* for the problem this is supposed to solve
> > > > 
> > > Well ... except that you'll need to keep track of the numbers (otherwise you
> > > wouldn't know if the numbers changed, right?).
> > > And if you keep track of the numbers you probably will have to implement an
> > > uevent listener to get the events in time.
> > 
> > Hmm? This is backwards. The goal here is to be able to safely match up
> > uevents to specific uses of a block device, given that block device
> > names are agressively recycled.
> > 
> > you imply it was easy to know which device use a uevent belongs
> > to. But that's the problem: it is not possible to do so safely. if i
> > see a uevent for a block device "loop0" I cannot tell if it was from
> > my own use of the device or for some previous user of it.
> > 
> > And that's what we'd like to see fixed: i.e. we query the block device
> > for the seqeno now used and then we can use that to filter the uevents
> > and ignore the ones that do not carry the same sequence number as we
> > got assigned for our user.
> > 
> 
> It is notoriously tricky to monitor the intended use-case for kernel 
> devices, precisely because we do _not_ attach any additional information 
> to it.
> I have send a proposal for LSF to implement block-namespaces, the prime 
> use-case of which is indeed attaching cgroup/namespace information to 
> block devices such that we _can_ match (block) devices to specific contexts.

Having namespaces for block devices would be an awesome feature, very
much looking forward to have that, as it solves a number of other
issues we have.
And while it could maybe be used in some instances of this particular
problem, unfortunately I don't think it can solve all of them - in some
real cases, we have to work in the root namespace as we are setting
things up for it.

> Which I rather prefer than adding sequence numbers to block devices; 
> incidentally you could solve the same problem by _not_ reusing numbers 
> aggressively but rather allocate the next free one after the most 
> recently allocated one.
> Will give you much the same thing without having to burden others with it.

If I understood this right, you are proposing to move the
monothonically increasing sequence id to the device name, rather than
as internal metadata? So that, eg, loop0 gets used exactly once and
never again, and so on? Wouldn't that be a much more visible and
disruptive change, potentially backward incompatible and breaking
userspace left and right?

> The better alternative here would be to extend the loop ioctl to pass in 
> an UUID when allocating the device.
> That way you can easily figure out whether the loop device has been 
> modified.

A UUID solves the problem we are currently talking about. But a
monothonically increasing sequence number has additional great
properties compared to a UUID, that we very much want to make use of
(again these are all _real_ use cases), and were described  in detail
here:

https://lore.kernel.org/linux-fsdevel/20210315201331.GA2577561@casper.infradead.org/t/#m3b1ffdffcc70a7c3ef4d7f13d0c2d5b9e4dde35a

> But in the end, it's the loop driver and I'm not particular bothered 
> with it. I am, though, if you need to touch all drivers just to support 
> one particular use-case in one particular device driver.
> 
> Incidentally, we don't have this problem in SCSI as we _can_ identify 
> devices here. So in the end we couldn't care less on which /dev/sdX 
> device it ends up.
> And I guess that's what we should attempt for loop devices, too.

Sorry, I'm not sure what you mean here by "touch all drivers"? This
series changes only drivers/block/loop.c, no other device drivers code
is touched?

Lennart Poettering June 23, 2021, 3:48 p.m. UTC | #13

On Mi, 23.06.21 17:02, Hannes Reinecke (hare@suse.de) wrote:

> > you imply it was easy to know which device use a uevent belongs
> > to. But that's the problem: it is not possible to do so safely. if i
> > see a uevent for a block device "loop0" I cannot tell if it was from
> > my own use of the device or for some previous user of it.
> >
> > And that's what we'd like to see fixed: i.e. we query the block device
> > for the seqeno now used and then we can use that to filter the uevents
> > and ignore the ones that do not carry the same sequence number as we
> > got assigned for our user.
>
> It is notoriously tricky to monitor the intended use-case for kernel
> devices, precisely because we do _not_ attach any additional information to
> it.
> I have send a proposal for LSF to implement block-namespaces, the prime
> use-case of which is indeed attaching cgroup/namespace information to block
> devices such that we _can_ match (block) devices to specific
> contexts.

The goal of the patchset is to make loopback block devices (and
similar) safely and robustly concurrently allocatable from the main OS
namespace, without any cgroup/containerization logic.

In systemd we want to be able to allocate loopback block devices from
any context, and concurrently without having to set up a
cgroup/namespace first for each user for it. Any approach that binds
two distinct subsystems like this together (e.g. "you need to set up
cgroups to safely allocate loopback block devices") is really
problematic for us, since we manage both, but independently and always
with minimal privileges.

> Which I rather prefer than adding sequence numbers to block devices;
> incidentally you could solve the same problem by _not_ reusing numbers
> aggressively but rather allocate the next free one after the most recently
> allocated one.

You are suggesting that instead of allocating loopback block devices
always from the "bottom", i.e. always handing out from "loop0" on,
with the lowest preferred we'd just always hand out "loop1", "loop2",
… with strictly monotonically increasing numbres and never reuse
"loop0" anymore and other names we already handed out? That would
certainly work, but this would require quite some kernel rework, since
the loopbck allocation API is really not designed to work like that
right now.

Moreover, the proposed sequence number stuff also covers
floppies/cdroms and other stuff nicely, i.e. where drives stick around
but their media changes. Also, USB sticks are currently effectively
always called /dev/sda. It would be great to be able to distinguish
each plug/replug too. Of course, you could argue that there too
/dev/sda should never be reused, but strictly monotonically increasing
/dev/sdb, /dev/sdc, …  and so on, and I'd sympathize with that, but
that makes it a major kernel rework, because basically every block
subsystem would have to be reworked to never reuse block device names
anymore.

Also, i doubt people would be happy if they then regularly would have
to deal with device names such as /dev/loop84763874658743 or
/dev/sdzbghz just because their system has been running for a while.

> The better alternative here would be to extend the loop ioctl to pass in an
> UUID when allocating the device.
> That way you can easily figure out whether the loop device has been
> modified.

UUIDs instead of sequence numbers would mostly solve our probelms
too. i.e. chaotic, randomized assignment of identifiers instead of
linearly progressing assignment of idenitifers. However I prefer
sequence numbers as discussed in this thread before: they allow us to
derive ordering from things: thus if you see an uevent with a seqnum
smaller than the one you are interested in you know its worth waiting
for the ones you are looking for to appear. But if you see a uevent
with a seqnum greater than the one you are interested in then you know
it's pointless to wait, the device has already been acquired by
someone else. With randomized UUIDs you can't know that, since uses by
other participants are only recognizable as distinct from your own but
don't tell you if they are earlier or later than your own. After all
the AF_NETLINK/uevent socket is lossy, so you must be prepared for
dropped messages, hence it's reat if we can easily resync when your
own messages get dropped.

Lennart

--
Lennart Poettering, Berlin

[v3,1/6] block: add disk sequence number

Commit Message

Comments

Patch