Message ID | 20181107143745.11845-1-keith.busch@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | block: Clear kernel memory before copying to user | expand |
On Wed, 2018-11-07 at 07:37 -0700, Keith Busch wrote: > If the kernel allocates a bounce buffer for user read data, this > memory > needs to be cleared before copying it to the user, otherwise it may > leak > kernel memory to user space. > > Signed-off-by: Keith Busch <keith.busch@intel.com> > --- > block/bio.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/block/bio.c b/block/bio.c > index d5368a445561..a50d59236b19 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -1260,6 +1260,7 @@ struct bio *bio_copy_user_iov(struct > request_queue *q, > if (ret) > goto cleanup; > } else { > + zero_fill_bio(bio); > iov_iter_advance(iter, bio->bi_iter.bi_size); > } > Straightforward, looks good from my view. Reviewed-by: Laurence Oberman <loberman@redhat.com>
On Wed, Nov 7, 2018 at 10:42 PM Keith Busch <keith.busch@intel.com> wrote: > > If the kernel allocates a bounce buffer for user read data, this memory > needs to be cleared before copying it to the user, otherwise it may leak > kernel memory to user space. > > Signed-off-by: Keith Busch <keith.busch@intel.com> > --- > block/bio.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/block/bio.c b/block/bio.c > index d5368a445561..a50d59236b19 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -1260,6 +1260,7 @@ struct bio *bio_copy_user_iov(struct request_queue *q, > if (ret) > goto cleanup; > } else { > + zero_fill_bio(bio); > iov_iter_advance(iter, bio->bi_iter.bi_size); > } This way looks inefficient because zero fill should only be required for short READ. Thanks, Ming Lei
On Wed, Nov 07, 2018 at 11:09:27PM +0800, Ming Lei wrote: > On Wed, Nov 7, 2018 at 10:42 PM Keith Busch <keith.busch@intel.com> wrote: > > > > If the kernel allocates a bounce buffer for user read data, this memory > > needs to be cleared before copying it to the user, otherwise it may leak > > kernel memory to user space. > > > > Signed-off-by: Keith Busch <keith.busch@intel.com> > > --- > > block/bio.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/block/bio.c b/block/bio.c > > index d5368a445561..a50d59236b19 100644 > > --- a/block/bio.c > > +++ b/block/bio.c > > @@ -1260,6 +1260,7 @@ struct bio *bio_copy_user_iov(struct request_queue *q, > > if (ret) > > goto cleanup; > > } else { > > + zero_fill_bio(bio); > > iov_iter_advance(iter, bio->bi_iter.bi_size); > > } > > This way looks inefficient because zero fill should only be required > for short READ. Sure, but how do you know that happened before copying the bounce buffer to user space? We could zero the pages on allocation if that's better (and doesn't zero twice if __GFP_ZERO was already provided): --- diff --git a/block/bio.c b/block/bio.c index d5368a445561..a1b6383294f4 100644 --- a/block/bio.c +++ b/block/bio.c @@ -1212,6 +1212,9 @@ struct bio *bio_copy_user_iov(struct request_queue *q, nr_pages = 1 << map_data->page_order; i = map_data->offset / PAGE_SIZE; } + + if (iov_iter_rw(iter) == READ) + gfp_mask |= __GFP_ZERO; while (len) { unsigned int bytes = PAGE_SIZE; --
On Wed, Nov 07, 2018 at 11:44:59PM +0800, Ming Lei wrote:
> blk_update_request() may tell us how much progress made, :-)
Except when it doesn't, which is 100% of the time for many block
drivers, including nvme.
On Wed, Nov 7, 2018 at 11:19 PM Keith Busch <keith.busch@intel.com> wrote: > > On Wed, Nov 07, 2018 at 11:09:27PM +0800, Ming Lei wrote: > > On Wed, Nov 7, 2018 at 10:42 PM Keith Busch <keith.busch@intel.com> wrote: > > > > > > If the kernel allocates a bounce buffer for user read data, this memory > > > needs to be cleared before copying it to the user, otherwise it may leak > > > kernel memory to user space. > > > > > > Signed-off-by: Keith Busch <keith.busch@intel.com> > > > --- > > > block/bio.c | 1 + > > > 1 file changed, 1 insertion(+) > > > > > > diff --git a/block/bio.c b/block/bio.c > > > index d5368a445561..a50d59236b19 100644 > > > --- a/block/bio.c > > > +++ b/block/bio.c > > > @@ -1260,6 +1260,7 @@ struct bio *bio_copy_user_iov(struct request_queue *q, > > > if (ret) > > > goto cleanup; > > > } else { > > > + zero_fill_bio(bio); > > > iov_iter_advance(iter, bio->bi_iter.bi_size); > > > } > > > > This way looks inefficient because zero fill should only be required > > for short READ. > > Sure, but how do you know that happened before copying the bounce buffer > to user space? blk_update_request() may tell us how much progress made, :-) So it should work by calling the following code after the req is completed and before copying data to userspace: __rq_for_each_bio(bio, rq) zero_fill_bio(bio); > > We could zero the pages on allocation if that's better (and doesn't zero > twice if __GFP_ZERO was already provided): > > --- > diff --git a/block/bio.c b/block/bio.c > index d5368a445561..a1b6383294f4 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -1212,6 +1212,9 @@ struct bio *bio_copy_user_iov(struct request_queue *q, > nr_pages = 1 << map_data->page_order; > i = map_data->offset / PAGE_SIZE; > } > + > + if (iov_iter_rw(iter) == READ) > + gfp_mask |= __GFP_ZERO; No big difference compared with before, because most of times the zero fill shouldn't be needed. However, this patch changes to do it every time. Thanks, Ming Lei
On Wed, Nov 7, 2018 at 11:47 PM Keith Busch <keith.busch@intel.com> wrote: > > On Wed, Nov 07, 2018 at 11:44:59PM +0800, Ming Lei wrote: > > blk_update_request() may tell us how much progress made, :-) > > Except when it doesn't, which is 100% of the time for many block > drivers, including nvme. Please look at blk_mq_end_request()(<- nvme_complete_rq()), which calls blk_update_request(). Thanks, Ming Lei
On Thu, Nov 08, 2018 at 12:03:41AM +0800, Ming Lei wrote: > On Wed, Nov 7, 2018 at 11:47 PM Keith Busch <keith.busch@intel.com> wrote: > > > > On Wed, Nov 07, 2018 at 11:44:59PM +0800, Ming Lei wrote: > > > blk_update_request() may tell us how much progress made, :-) > > > > Except when it doesn't, which is 100% of the time for many block > > drivers, including nvme. > > Please look at blk_mq_end_request()(<- nvme_complete_rq()), which > calls blk_update_request(). Huh? That just says 100% of the request size was transerred no matter what was actually transferred. The protocol doesn't have a way to express what transfer size occured with a command's completion, and even it did, there's no way I'd trust every firmware not to screw it.
On 11/7/18 7:37 AM, Keith Busch wrote: > If the kernel allocates a bounce buffer for user read data, this memory > needs to be cleared before copying it to the user, otherwise it may leak > kernel memory to user space. Applied, thanks.
On Thu, Nov 8, 2018 at 12:12 AM Keith Busch <keith.busch@intel.com> wrote: > > On Thu, Nov 08, 2018 at 12:03:41AM +0800, Ming Lei wrote: > > On Wed, Nov 7, 2018 at 11:47 PM Keith Busch <keith.busch@intel.com> wrote: > > > > > > On Wed, Nov 07, 2018 at 11:44:59PM +0800, Ming Lei wrote: > > > > blk_update_request() may tell us how much progress made, :-) > > > > > > Except when it doesn't, which is 100% of the time for many block > > > drivers, including nvme. > > > > Please look at blk_mq_end_request()(<- nvme_complete_rq()), which > > calls blk_update_request(). > > Huh? That just says 100% of the request size was transerred no matter > what was actually transferred. > > The protocol doesn't have a way to express what transfer size occured > with a command's completion, and even it did, there's no way I'd trust > every firmware not to screw it. That sounds something like below: 1) userspace requires to read 8 sectors starting from sector 0 2) nvme tells hardware to do that, and hardware transfers only 4 sectors(0~3) data to memory, but still tells driver we have done 8 sectors(0~7). What if there are real data in sectors 4~7? Is it NVMe specific issue or common problem in other storage hardware? SCSI does call blk_update_request() and handles partial completion. Thanks, Ming Lei
On Thu, Nov 08, 2018 at 09:12:59AM +0800, Ming Lei wrote: > Is it NVMe specific issue or common problem in other storage hardware? SCSI > does call blk_update_request() and handles partial completion. Not specific to NVMe. An example using SG_IO dumping 2MB of unsanitized kernel memory: sg-test.c: --- #include <fcntl.h> #include <stdlib.h> #include <stdio.h> #include <string.h> #include <sys/ioctl.h> #include <scsi/sg.h> #include <scsi/scsi.h> #define SIZE (2 * 1024 * 1024 + 8) int main(int argc, char **argv) { struct sg_io_hdr io_hdr; unsigned char *buffer, cmd[6] = { TEST_UNIT_READY }; int sg, i; if (argc < 2) fprintf(stderr, "usage: %s <sgdev>\n", argv[0]), exit(0); sg = open(argv[1], O_RDONLY); if (sg < 0) perror("open"), exit(0); buffer = malloc(SIZE); if (!buffer) fprintf(stderr, "no memory\n"), exit(0); memset(&io_hdr, 0, sizeof(struct sg_io_hdr)); io_hdr.interface_id = 'S'; io_hdr.cmd_len = 6; io_hdr.cmdp = cmd; io_hdr.dxfer_direction = SG_DXFER_FROM_DEV; io_hdr.dxfer_len = SIZE; io_hdr.dxferp = buffer; memset(buffer, 0, SIZE); ioctl(sg, SG_IO, &io_hdr); for (i = 0; i < SIZE; i++) { printf("%02x", buffer[i]); if (i+1 % 32 == 0) printf("\n"); } } -- Test on qemu: --- $ ./sg-test /dev/sda | grep -v 000000000000000000000000000000000 40733f4019dbffff8001244019dbffff4065244019dbffff0094244019dbffff c025244019dbffffc0e43a4019dbffff40973a4019dbffffc0623a4019dbffff 800c244019dbffffc0d61d4019dbffffc05f244019dbffff80091e4019dbffff 40913a4019dbffff806f3f4019dbffff40a83f4019dbffffc083244019dbffff 80eb1e4019dbffff00a93f4019dbffffc09a3a4019dbffff40503f4019dbffff 007f1b4019dbffffc0d91e4019dbffff40551e4019dbffff804a1b4019dbffff .... --
On 11/7/18 6:12 PM, Ming Lei wrote: > On Thu, Nov 8, 2018 at 12:12 AM Keith Busch <keith.busch@intel.com> wrote: >> >> On Thu, Nov 08, 2018 at 12:03:41AM +0800, Ming Lei wrote: >>> On Wed, Nov 7, 2018 at 11:47 PM Keith Busch <keith.busch@intel.com> wrote: >>>> >>>> On Wed, Nov 07, 2018 at 11:44:59PM +0800, Ming Lei wrote: >>>>> blk_update_request() may tell us how much progress made, :-) >>>> >>>> Except when it doesn't, which is 100% of the time for many block >>>> drivers, including nvme. >>> >>> Please look at blk_mq_end_request()(<- nvme_complete_rq()), which >>> calls blk_update_request(). >> >> Huh? That just says 100% of the request size was transerred no matter >> what was actually transferred. >> >> The protocol doesn't have a way to express what transfer size occured >> with a command's completion, and even it did, there's no way I'd trust >> every firmware not to screw it. > > That sounds something like below: > > 1) userspace requires to read 8 sectors starting from sector 0 > 2) nvme tells hardware to do that, and hardware transfers only 4 > sectors(0~3) data > to memory, but still tells driver we have done 8 sectors(0~7). > > What if there are real data in sectors 4~7? > > Is it NVMe specific issue or common problem in other storage hardware? SCSI > does call blk_update_request() and handles partial completion. I would never believe any of that, it's much better to error on the side of caution for things like this. blk-mq doesn't even support partial completions, exactly because most hw doesn't even support it. But even for the ones that do, I would not trust it a lot.
On 08/11/2018 02:22, Keith Busch wrote: > $ ./sg-test /dev/sda | grep -v 000000000000000000000000000000000 > 40733f4019dbffff8001244019dbffff4065244019dbffff0094244019dbffff > c025244019dbffffc0e43a4019dbffff40973a4019dbffffc0623a4019dbffff > 800c244019dbffffc0d61d4019dbffffc05f244019dbffff80091e4019dbffff > 40913a4019dbffff806f3f4019dbffff40a83f4019dbffffc083244019dbffff > 80eb1e4019dbffff00a93f4019dbffffc09a3a4019dbffff40503f4019dbffff > 007f1b4019dbffffc0d91e4019dbffff40551e4019dbffff804a1b4019dbffff > .... > -- > Hi Keith, Can you please add the above to blktests? This would be very useful for downstream distributions. Thanks a lot, Johannes
On Thu, Nov 8, 2018 at 6:07 PM Johannes Thumshirn <jthumshirn@suse.de> wrote: > > On 08/11/2018 02:22, Keith Busch wrote: > > $ ./sg-test /dev/sda | grep -v 000000000000000000000000000000000 > > 40733f4019dbffff8001244019dbffff4065244019dbffff0094244019dbffff > > c025244019dbffffc0e43a4019dbffff40973a4019dbffffc0623a4019dbffff > > 800c244019dbffffc0d61d4019dbffffc05f244019dbffff80091e4019dbffff > > 40913a4019dbffff806f3f4019dbffff40a83f4019dbffffc083244019dbffff > > 80eb1e4019dbffff00a93f4019dbffffc09a3a4019dbffff40503f4019dbffff > > 007f1b4019dbffffc0d91e4019dbffff40551e4019dbffff804a1b4019dbffff > > .... > > -- > > > > Hi Keith, > Can you please add the above to blktests? > > This would be very useful for downstream distributions. I guess the issue may depend on specific QEMU version, just tried the test over virtio-scsi/sata/usb-storage emulated via qemu-2.10.2-1.fc27, not observed this problem. Thanks, Ming Lei
On Thu, Nov 08, 2018 at 07:10:58PM +0800, Ming Lei wrote: > I guess the issue may depend on specific QEMU version, just tried the test over > virtio-scsi/sata/usb-storage emulated via qemu-2.10.2-1.fc27, not observed > this problem. I actually didn't use virtio-scsi, but it really doesn't matter. FWIW, this is what I ran: # qemu-system-x86_64 --version QEMU emulator version 2.10.2(qemu-2.10.2-1.fc27) # qemu-system-x86_64 -m 192 -smp 2 -enable-kvm -display none -snapshot \ -hda /mnt/images/fedora-27.img -nographic \ -append "console=tty0 console=ttyS0 root=/dav/sda rw" \ -kernel /boot/vmlinuz-4.18.10-100.fc27.x86_64 \ -initrd /boot/initramfs-4.18.10-100.fc27.x86_64.img The file "fedora-27.img" is just a filesystem image of a minimal mock setup from /var/lib/mock/fedora-27-x86_64/root/. A small memory size makes it easier to observe, otherwise your probability of allocating unclean pages lowers. That's really the only reason I used qemu since all my real machines have too much memory that I never come close to using, making observations more random.
diff --git a/block/bio.c b/block/bio.c index d5368a445561..a50d59236b19 100644 --- a/block/bio.c +++ b/block/bio.c @@ -1260,6 +1260,7 @@ struct bio *bio_copy_user_iov(struct request_queue *q, if (ret) goto cleanup; } else { + zero_fill_bio(bio); iov_iter_advance(iter, bio->bi_iter.bi_size); }
If the kernel allocates a bounce buffer for user read data, this memory needs to be cleared before copying it to the user, otherwise it may leak kernel memory to user space. Signed-off-by: Keith Busch <keith.busch@intel.com> --- block/bio.c | 1 + 1 file changed, 1 insertion(+)