[0/3] SG_IO command filtering via sysfs

Message ID	1541867733-7836-1-git-send-email-pbonzini@redhat.com (mailing list archive)
Headers	show Return-Path: <linux-scsi-owner@kernel.org> From: Paolo Bonzini <pbonzini@redhat.com> To: linux-kernel@vger.kernel.org Cc: linux-scsi@vger.kernel.org, Hannes Reinecke <hare@suse.com>, "Martin K. Petersen" <martin.petersen@oracle.com>, James Bottomley <James.Bottomley@hansenpartnership.com> Subject: [PATCH 0/3] SG_IO command filtering via sysfs Date: Sat, 10 Nov 2018 17:35:30 +0100 Message-Id: <1541867733-7836-1-git-send-email-pbonzini@redhat.com> Sender: linux-scsi-owner@vger.kernel.org Precedence: bulk
Series	SG_IO command filtering via sysfs \| expand [0/3] SG_IO command filtering via sysfs [1/3] block: add back queue-private command filter [2/3] scsi: create an all-one filter for scanners [3/3] block: add back command filter modification via sysfs

Paolo Bonzini Nov. 10, 2018, 4:35 p.m. UTC

Currently, SG_IO ioctls are implemented so that non-CAP_SYS_RAWIO users
can send commands from a predetermined whitelist.  The whitelist is very
simple-minded though, and basically corresponds to MMC commands---the idea
being that it would be nice for local users to read/copy/burn CDs.

This was probably sensible when the whitelist was first added (in the pre-git
era), but quite a few things have changed since then:

- there is a lot more focus on not running things as root unnecessarily;
it is generally much more common to have non-root processes accessing disks
and we would like that to happen more, not less.

- there is also a lot more focus on not giving capabilities unnecessarily.
Using CAP_SYS_RAWIO, which gives full access to all commands, allows
you to send a WRITE SCSI command to a file opened for reading, which is
a nice recipe for data corruption.  A more fine-grained whitelist allows
you to give the desired access to the application.

- we've discovered that some commands conflict between the various
SCSI standards.  UNMAP (a write command) in SBC has the same number as
the obscure MMC command READ SUBCHANNEL.  As such it's allowed if a
block device is opened for reading!

This series, which was last sent in 2012 before I lost interest in the
endless discussions that followed, adds the possibility to make the filter
mutable via sysfs, so that it can be set up per device.  This of course can
go both ways; interested applications can set a wider filter, but one can
also imagine setting much more restrictive filters by default (possibly
allowing little more than INQUIRY, TEST UNIT READY, READ CAPACITY and the
like).

Back then there was opposition to giving unfettered access to "dangerous"
or "too easily destructive" commands such as WRITE SAME or PERSISTENT
RESERVE OUT to unprivileged users.  Even then, I think this objection
is now moot thanks to the following things that have happened in 2012:

- WRITE SAME commands, which were considered too destructive, have
been added to the filter since commit 25cdb6451064 ("block: allow
WRITE_SAME commands with the SG_IO ioctl", 2016-12-15, Linux 4.10).
They are basically the only non-MMC commands included in the filter,
by the way.

- persistent reservations are also allowed now via PR ioctls (commit
924d55b06347, "sd: implement the Persistent Reservation API", 2015-10-21,
Linux 4.4).  These require CAP_SYS_ADMIN, which is the same capability
that is needed to *grant* access to PR commands via the SG_IO filter.

So, here is the 2018 version of these patches.  Please review! :)

Paolo

Paolo Bonzini (3):
  block: add back queue-private command filter
  scsi: create an all-one filter for scanners
  block: add back command filter modification via sysfs

 Documentation/block/queue-sysfs.txt |  19 +++++
 block/Kconfig                       |  10 +++
 block/blk-sysfs.c                   |  43 ++++++++++++
 block/bsg-lib.c                     |   4 +-
 block/bsg.c                         |   8 +--
 block/scsi_ioctl.c                  | 136 +++++++++++++++++++++++++++++++++---
 drivers/scsi/scsi_scan.c            |  13 ++++
 drivers/scsi/sg.c                   |   6 +-
 include/linux/blkdev.h              |  18 ++++-
 include/linux/bsg.h                 |   4 +-
 10 files changed, 238 insertions(+), 23 deletions(-)

Theodore Ts'o Nov. 10, 2018, 7:05 p.m. UTC | #1

I wonder if a better way of adding SG_IO command filtering is via
eBPF?  We are currently carrying a inside Google a patch which allows
a specific of SCSI commands to non-root processes --- if the process
belonged to a particular Unix group id.

It's pretty specific to our use case, in terms of the specific SCSI
commands we want to allow through.  I can imagine people wanting
different filters based on the type of the SCSI device, or a HDD's
WWID, not just a group id.  For example, this might be useful for
people wanting to do crazy things with containers --- maybe you'd
want to allow container root to send a SANITIZE ERASE command to one
of its exclusively assigned disks, but not to other HDD's.

So having something that's more general than a flat file in sysfs
might be preferable to resurrecting an interface which we would then
after to support forever, even if we come up with a more general
interface.

					- Ted

Christoph Hellwig Nov. 11, 2018, 1:14 p.m. UTC | #2

I think this goes in the wrong way.  There isn't really any point
in filtering at all if we have access to the whole device by the
file persmissions, and we generally should not allow any access for
partitions.

I think we need to simplify the selection, not add crazy amounts of
special case code.

Paolo Bonzini Nov. 11, 2018, 1:26 p.m. UTC | #3

On 10/11/2018 20:05, Theodore Y. Ts'o wrote:
> I wonder if a better way of adding SG_IO command filtering is via
> eBPF?  We are currently carrying a inside Google a patch which allows
> a specific of SCSI commands to non-root processes --- if the process
> belonged to a particular Unix group id.
> 
> It's pretty specific to our use case, in terms of the specific SCSI
> commands we want to allow through.  I can imagine people wanting
> different filters based on the type of the SCSI device, or a HDD's
> WWID, not just a group id.  For example, this might be useful for
> people wanting to do crazy things with containers --- maybe you'd
> want to allow container root to send a SANITIZE ERASE command to one
> of its exclusively assigned disks, but not to other HDD's.
> 
> So having something that's more general than a flat file in sysfs
> might be preferable to resurrecting an interface which we would then
> after to support forever, even if we come up with a more general
> interface.

Heh, this was exactly the answer I dreaded, because I can't deny it
makes sense. :)

My main argument against it is that while superseding an interface and
still having to support it forever sucks, having a super-complex
interface is also bad (back in 2012 I wrote
https://lwn.net/Articles/501742/ which I'm not particularly enthusiastic
about).  In many cases a combination of MAC policies, ACLs, etc. can be
just as effective.

I'm not very eBPF savvy, the question I have is: what kind of
information about the running process is available in an eBPF program?
For example, even considering only the examples you make, would it be
able to access the CDB, the capabilities and uid/gid of the task, the
SCSI device type, the WWN?  Of course you also need the mode of the file
descriptor in order to allow SANITIZE ERASE if the disk is opened for write.

Paolo

Theodore Ts'o Nov. 11, 2018, 1:42 p.m. UTC | #4

On Sun, Nov 11, 2018 at 05:14:45AM -0800, Christoph Hellwig wrote:
> I think this goes in the wrong way.  There isn't really any point
> in filtering at all if we have access to the whole device by the
> file persmissions, and we generally should not allow any access for
> partitions.

It really depends on the security model being used on a particular
system.  I can easily imagine scenarios where userspace is allowed
full access to the device with respect to read/write/open, but the
security model doesn't want to allow access to various SCSI commands
such as firmware upload commands, TCG commads, the
soon-to-be-standardized Zone Activation Commands (which allow dynamic
conversion of HDD recording modes between CMR and SMR), etc.

And this is before we get to crazy container / namespace scenarios.
And *no*, let's not have a SG_IO namespace!  :-)

> I think we need to simplify the selection, not add crazy amounts of
> special case code.

I have the opposite opinions in terms of wanting more complex
filtering rules, but I also agree that special case C code is not the
answer --- and why I suggested that eBPF filtering rules is the right
way to go.

					- Ted

Theodore Ts'o Nov. 11, 2018, 2:14 p.m. UTC | #5

On Sun, Nov 11, 2018 at 02:26:45PM +0100, Paolo Bonzini wrote:
> 
> I'm not very eBPF savvy, the question I have is: what kind of
> information about the running process is available in an eBPF program?
> For example, even considering only the examples you make, would it be
> able to access the CDB, the capabilities and uid/gid of the task, the
> SCSI device type, the WWN?  Of course you also need the mode of the file
> descriptor in order to allow SANITIZE ERASE if the disk is opened for write.

The basic uid/gid of the task is certainly available.  For storage
stack specific things, it's a matter of what we make available to the
eBPF function.  For example, there is an experimental netfilter
replacement which uses eBPF; obviously that requires making the packet
which is being inspecting so it can be given a thumbs up or thumbs
down result.  That's going to require letting the eBPF function to
have access to the network header, access to the connection tracking
tables, etc.

You can basically think of it as passing arguments to a function
except it's it's written in eBPF instead of C.  You can also define
eBPF functions implemented in C which can be called from eBPF code.

						- Ted

Christoph Hellwig Nov. 12, 2018, 8:20 a.m. UTC | #6

On Sun, Nov 11, 2018 at 08:42:42AM -0500, Theodore Y. Ts'o wrote:
> It really depends on the security model being used on a particular
> system.  I can easily imagine scenarios where userspace is allowed
> full access to the device with respect to read/write/open, but the
> security model doesn't want to allow access to various SCSI commands
> such as firmware upload commands, TCG commads, the
> soon-to-be-standardized Zone Activation Commands (which allow dynamic
> conversion of HDD recording modes between CMR and SMR), etc.

Well, that's what we have the security_file_ioctl() LSM hook for so that
your security model can arbitrate access to ioctls.

Paolo Bonzini Nov. 12, 2018, 10:17 a.m. UTC | #7

On 12/11/2018 09:20, Christoph Hellwig wrote:
> On Sun, Nov 11, 2018 at 08:42:42AM -0500, Theodore Y. Ts'o wrote:
>> It really depends on the security model being used on a particular
>> system.  I can easily imagine scenarios where userspace is allowed
>> full access to the device with respect to read/write/open, but the
>> security model doesn't want to allow access to various SCSI commands
>> such as firmware upload commands, TCG commads, the
>> soon-to-be-standardized Zone Activation Commands (which allow dynamic
>> conversion of HDD recording modes between CMR and SMR), etc.
> 
> Well, that's what we have the security_file_ioctl() LSM hook for so that
> your security model can arbitrate access to ioctls.

Doesn't that have TOC-TOU races by design?

Also, what about SG_IO giving write access to files that are only opened
read-only (and only have read permissions)?

Paolo

Paolo Bonzini Nov. 16, 2018, 12:26 a.m. UTC | #8

On 11/11/2018 15:14, Theodore Y. Ts'o wrote:
> On Sun, Nov 11, 2018 at 02:26:45PM +0100, Paolo Bonzini wrote:
>>
>> I'm not very eBPF savvy, the question I have is: what kind of
>> information about the running process is available in an eBPF program?
>> For example, even considering only the examples you make, would it be
>> able to access the CDB, the capabilities and uid/gid of the task, the
>> SCSI device type, the WWN?  Of course you also need the mode of the file
>> descriptor in order to allow SANITIZE ERASE if the disk is opened for write.
> 
> The basic uid/gid of the task is certainly available.  For storage
> stack specific things, it's a matter of what we make available to the
> eBPF function.  For example, there is an experimental netfilter
> replacement which uses eBPF; obviously that requires making the packet
> which is being inspecting so it can be given a thumbs up or thumbs
> down result.  That's going to require letting the eBPF function to
> have access to the network header, access to the connection tracking
> tables, etc.

Yeah, and there are even already helpers such as
bpf_get_current_uid_gid.  So that part can be done in a sort-of generic way.

I can try and do the work, but I'd like some agreement on the design
first...  For example a more important question is how would the BPF
filter be attached?  Two possibilities that come to mind are:

- add it to the /dev/sg* or /dev/sd* struct file(*) via a ioctl, and use
pass the file descriptor to the unprivileged QEMU after setting up the
BPF filter, via either fork() or SCM_RIGHTS.  This would be a very nice
model for privilege separation, but I'm afraid it would not work for
your use case

- add BPF programs to cgroups, in the form of a new
BPF_PROG_TYPE_CGROUP_CDB_FILTER or something like that.  That would also
work for my usecase, and it seems to be in line with how the network
guys are doing things.  So it would seem like the way to go.

Some other details...  Registering the first cgroup-based filter would
disable the default filter; if multiple filters are attached, the
outcomes of all filters would be AND-ed, also similarly to how socket
and sockaddr cgroup BPF works.  Finally, filters would be applied also
to processes with CAP_SYS_RAWIO, unlike the current filter.

Needless to say, this would not add special case code, but it would
still add a substantial amount of code, probably comparable to this series.

Christoph?

Paolo

(*) that's not immediate for /dev/sd*, because it would require using
the block device file's private_data; that's not possible yet via struct
block_device_operations, but as far as I can tell block devices
themselves do not need the private_data, so it is at least doable.

Bart Van Assche Nov. 16, 2018, 12:37 a.m. UTC | #9

On Fri, 2018-11-16 at 01:26 +0100, Paolo Bonzini wrote:
> Yeah, and there are even already helpers such as
> bpf_get_current_uid_gid.  So that part can be done in a sort-of generic way.
> 
> I can try and do the work, but I'd like some agreement on the design
> first...  For example a more important question is how would the BPF
> filter be attached?  Two possibilities that come to mind are:
> 
> - add it to the /dev/sg* or /dev/sd* struct file(*) via a ioctl, and use
> pass the file descriptor to the unprivileged QEMU after setting up the
> BPF filter, via either fork() or SCM_RIGHTS.  This would be a very nice
> model for privilege separation, but I'm afraid it would not work for
> your use case
> 
> - add BPF programs to cgroups, in the form of a new
> BPF_PROG_TYPE_CGROUP_CDB_FILTER or something like that.  That would also
> work for my usecase, and it seems to be in line with how the network
> guys are doing things.  So it would seem like the way to go.
> 
> Some other details...  Registering the first cgroup-based filter would
> disable the default filter; if multiple filters are attached, the
> outcomes of all filters would be AND-ed, also similarly to how socket
> and sockaddr cgroup BPF works.  Finally, filters would be applied also
> to processes with CAP_SYS_RAWIO, unlike the current filter.
> 
> Needless to say, this would not add special case code, but it would
> still add a substantial amount of code, probably comparable to this series.

All user space interfaces in the Linux kernel for storage that I'm familiar
with not only allow configuration of parameters but also make it easy to
query which parameters have been configured. The existing sysfs and configfs
interfaces demonstrate this. Using BPF to configure SG/IO access has a
significant disadvantage, namely that it is very hard to figure out what has
been configured. Figuring out what has been configured namely requires
disassembling BPF. I'm not sure anyone will be enthusiast about this.

Bart.

Paolo Bonzini Nov. 16, 2018, 7:01 a.m. UTC | #10

On 16/11/18 01:37, Bart Van Assche wrote:
> All user space interfaces in the Linux kernel for storage that I'm familiar
> with not only allow configuration of parameters but also make it easy to
> query which parameters have been configured. The existing sysfs and configfs
> interfaces demonstrate this. Using BPF to configure SG/IO access has a
> significant disadvantage, namely that it is very hard to figure out what has
> been configured. Figuring out what has been configured namely requires
> disassembling BPF. I'm not sure anyone will be enthusiast about this.

Well, that's a problem with BPF in general.  With great power comes
great obscurability.

Paolo

Christoph Hellwig Nov. 16, 2018, 9:32 a.m. UTC | #11

On Mon, Nov 12, 2018 at 11:17:29AM +0100, Paolo Bonzini wrote:
> > Well, that's what we have the security_file_ioctl() LSM hook for so that
> > your security model can arbitrate access to ioctls.
> 
> Doesn't that have TOC-TOU races by design?

If you want to look at the command - yes.  If you just want to filter
read vs write vs ioctl, no.

> Also, what about SG_IO giving write access to files that are only opened
> read-only (and only have read permissions)?

Allowing SG_IO on read-only permissions sounds like a reall bad idea,
filtering or not.

Paolo Bonzini Nov. 16, 2018, 9:45 a.m. UTC | #12

On 16/11/18 10:32, Christoph Hellwig wrote:
> On Mon, Nov 12, 2018 at 11:17:29AM +0100, Paolo Bonzini wrote:
>>> Well, that's what we have the security_file_ioctl() LSM hook for so that
>>> your security model can arbitrate access to ioctls.
>>
>> Doesn't that have TOC-TOU races by design?
> 
> If you want to look at the command - yes.  If you just want to filter
> read vs write vs ioctl, no.

Yeah, but looking at the command is what Ted wants.  The thing that we
did in RHEL was a single sysfs bool that allows unfiltered access,
because it was sort of enough and made the delta very small.  But for
upstream I want to do it right, even if that means learning all that
new-fangled BPF stuff. :)

>> Also, what about SG_IO giving write access to files that are only opened
>> read-only (and only have read permissions)?
> 
> Allowing SG_IO on read-only permissions sounds like a reall bad idea,
> filtering or not.

I would even agree, however it's allowed right now and I would be
surprised if no one was relying on it in good faith ("I'm just doing an
INQUIRY, why do I need to open O_RDWR").  And indeed:

$ sudo chmod a+r /dev/sda
$ strace -e openat sg_inq /dev/sda
openat(AT_FDCWD, "/dev/sda", O_RDONLY|O_NONBLOCK) = 3
                             ^^^^^^^^

So it would be a regression.

Paolo

Christoph Hellwig Nov. 16, 2018, 9:48 a.m. UTC | #13

On Fri, Nov 16, 2018 at 10:45:11AM +0100, Paolo Bonzini wrote:
> Yeah, but looking at the command is what Ted wants.  The thing that we
> did in RHEL was a single sysfs bool that allows unfiltered access,
> because it was sort of enough and made the delta very small.  But for
> upstream I want to do it right, even if that means learning all that
> new-fangled BPF stuff. :)

So what is this magic command?

> I would even agree, however it's allowed right now and I would be
> surprised if no one was relying on it in good faith ("I'm just doing an
> INQUIRY, why do I need to open O_RDWR").  And indeed:
> 
> $ sudo chmod a+r /dev/sda
> $ strace -e openat sg_inq /dev/sda
> openat(AT_FDCWD, "/dev/sda", O_RDONLY|O_NONBLOCK) = 3

Well, not if we only did that for unprivileged opens.

Theodore Ts'o Nov. 16, 2018, 5:35 p.m. UTC | #14

On Fri, Nov 16, 2018 at 08:01:29AM +0100, Paolo Bonzini wrote:
> On 16/11/18 01:37, Bart Van Assche wrote:
> > All user space interfaces in the Linux kernel for storage that I'm familiar
> > with not only allow configuration of parameters but also make it easy to
> > query which parameters have been configured. The existing sysfs and configfs
> > interfaces demonstrate this. Using BPF to configure SG/IO access has a
> > significant disadvantage, namely that it is very hard to figure out what has
> > been configured. Figuring out what has been configured namely requires
> > disassembling BPF. I'm not sure anyone will be enthusiast about this.
> 
> Well, that's a problem with BPF in general.  With great power comes
> great obscurability.

You can also make the same complaint about kernel modules; that it's
impossible to figure exactly what a kernel modules does without
disassembling them.  However, you can a one-line description of what
it does using modinfo:

% modinfo async_pq
filename:       /lib/modules/4.19.0-00022-g831156939ae8/kernel/crypto/async_tx/async_pq.ko
license:        GPL
description:    asynchronous raid6 syndrome generation/validation
srcversion:     529102C736C4FED181C15A8
depends:        raid6_pq,async_tx,async_xor
retpoline:      Y
intree:         Y
name:           async_pq
vermagic:       4.19.0-00022-g831156939ae8 SMP mod_unload modversions 

					       - Ted

Theodore Ts'o Nov. 16, 2018, 5:43 p.m. UTC | #15

On Fri, Nov 16, 2018 at 10:45:11AM +0100, Paolo Bonzini wrote:
> On 16/11/18 10:32, Christoph Hellwig wrote:
> > On Mon, Nov 12, 2018 at 11:17:29AM +0100, Paolo Bonzini wrote:
> >>> Well, that's what we have the security_file_ioctl() LSM hook for so that
> >>> your security model can arbitrate access to ioctls.
> >>
> >> Doesn't that have TOC-TOU races by design?
> > 
> > If you want to look at the command - yes.  If you just want to filter
> > read vs write vs ioctl, no.
> 
> Yeah, but looking at the command is what Ted wants.  The thing that we
> did in RHEL was a single sysfs bool that allows unfiltered access,
> because it was sort of enough and made the delta very small.  But for
> upstream I want to do it right, even if that means learning all that
> new-fangled BPF stuff. :)

I'd argue that a purpose-built eBPF access control facility is
superior to the security_file_ioctl() LSM hook because it can make
available to the authorization function access to the cached results
of the SCSI INQUIRY command, and it avoids needing to duplicate
knowledge of how to parse the parameters of the SG_IO ioctl in the LSM
module as well as in the SCSI stack.

Just because you *could* implement anything in terms of a turing
machine tape doesn't mean that it is good idea.  Similarly, just
because you *can* implement something as an LSM hook doesn't mean that
it's the best design.

> >> Also, what about SG_IO giving write access to files that are only opened
> >> read-only (and only have read permissions)?
> > 
> > Allowing SG_IO on read-only permissions sounds like a reall bad idea,
> > filtering or not.
> 
> I would even agree, however it's allowed right now and I would be
> surprised if no one was relying on it in good faith ("I'm just doing an
> INQUIRY, why do I need to open O_RDWR").  And indeed:
> 
> $ sudo chmod a+r /dev/sda
> $ strace -e openat sg_inq /dev/sda
> openat(AT_FDCWD, "/dev/sda", O_RDONLY|O_NONBLOCK) = 3
>                              ^^^^^^^^
> 
> So it would be a regression.

Ugh, that's... unfortunate.  I suppose we could try to figure out all
of the SCSI commands that would have to be white-listed to be allowed
using O_RDONLY from historical usage, but that would be a huge job,
and it's highly likely we would miss some anyway.  OTOH, this could be
called a security botch that should be fixed, and if we make a best
effort to white list all of the innocuous cases such as SCSI INQUIRY,
maybe that would be acceptible.

					- Ted

Bart Van Assche Nov. 16, 2018, 6:17 p.m. UTC | #16

On Fri, 2018-11-16 at 12:43 -0500, Theodore Y. Ts'o wrote:
> I'd argue that a purpose-built eBPF access control facility is
> superior to the security_file_ioctl() LSM hook because it can make
> available to the authorization function access to the cached results
> of the SCSI INQUIRY command, and it avoids needing to duplicate
> knowledge of how to parse the parameters of the SG_IO ioctl in the LSM
> module as well as in the SCSI stack.

If an eBPF program would decide which SG_IO commands will be executed
and which ones not, does that mean that a SCSI parser would have to be
implemented in eBPF? If so, does that mean that both the eBPF and the
LSM approach share the disadvantage of requiring to do SCSI CDB parsing
outside the SCSI core?

Bart.

Paolo Bonzini Nov. 16, 2018, 9:08 p.m. UTC | #17

On 16/11/18 19:17, Bart Van Assche wrote:
> On Fri, 2018-11-16 at 12:43 -0500, Theodore Y. Ts'o wrote:
>> I'd argue that a purpose-built eBPF access control facility is
>> superior to the security_file_ioctl() LSM hook because it can make
>> available to the authorization function access to the cached results
>> of the SCSI INQUIRY command, and it avoids needing to duplicate
>> knowledge of how to parse the parameters of the SG_IO ioctl in the LSM
>> module as well as in the SCSI stack.
> 
> If an eBPF program would decide which SG_IO commands will be executed
> and which ones not, does that mean that a SCSI parser would have to be
> implemented in eBPF? If so, does that mean that both the eBPF and the
> LSM approach share the disadvantage of requiring to do SCSI CDB parsing
> outside the SCSI core?

The LSM approach cannot do SCSI CDB parsing, unless you add a special
SCSI-specific hook called after parsing the SG_IO argument, due to race
conditions.  I'd rather not do that, however it would have that
disadvantage indeed.

The eBPF approach pushes the policy and the parsing entirely to
userspace, so I'm not sure you can say it's a disadvantage.  It's just a
different design.  If you use SG_IO you're already in for writing
userspace code that handles CDBs, sense data etc.

Paolo

[0/3] SG_IO command filtering via sysfs

Message

Comments