Message ID | 1437519983-17306-1-git-send-email-vishal.l.verma@intel.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Dan Williams |
Headers | show |
On 07/22/2015 02:06 AM, Vishal Verma wrote: > 4K is more naturally aligned w.r.t pages, allows us to receive simpler > bios, and rejects creation of filesystems with sub-4K block sizes (DAX > requires -b 4096). > > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com> > Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> > Cc: Christoph Hellwig <hch@lst.de> > Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> > --- > drivers/nvdimm/pmem.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c > index ade9eb9..7f85537 100644 > --- a/drivers/nvdimm/pmem.c > +++ b/drivers/nvdimm/pmem.c > @@ -162,6 +162,7 @@ static int pmem_attach_disk(struct nd_namespace_common *ndns, > return -ENOMEM; > > blk_queue_make_request(pmem->pmem_queue, pmem_make_request); > + blk_queue_logical_block_size(pmem->pmem_queue, SZ_4K); Did you actually test this? I think logical_block_size means the addressing is 4k (and phys == 4k). ie address(1) means read from 4kB offset. As opposed to logical (addressing) 512B but physical=4k ie. atomic unit is 4k. In current code it does not matter match but it might be confusing to external tools. > blk_queue_max_hw_sectors(pmem->pmem_queue, UINT_MAX); > blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY); > queue_flag_set_unlocked(QUEUE_FLAG_NONROT, pmem->pmem_queue); > Thanks Boaz
On Tue, Jul 21, 2015 at 05:06:23PM -0600, Vishal Verma wrote: > 4K is more naturally aligned w.r.t pages, allows us to receive simpler > bios, and rejects creation of filesystems with sub-4K block sizes (DAX > requires -b 4096). Should we say 'PAGE_SIZE' instead of 'SZ_4K'? > blk_queue_make_request(pmem->pmem_queue, pmem_make_request); > + blk_queue_logical_block_size(pmem->pmem_queue, SZ_4K); > blk_queue_max_hw_sectors(pmem->pmem_queue, UINT_MAX);
On Wed, 2015-07-22 at 15:00 +0300, Boaz Harrosh wrote: > On 07/22/2015 02:06 AM, Vishal Verma wrote: > > 4K is more naturally aligned w.r.t pages, allows us to receive > > simpler > > bios, and rejects creation of filesystems with sub-4K block sizes > > (DAX > > requires -b 4096). > > > > Cc: Dan Williams <dan.j.williams@intel.com> > > Cc: Ross Zwisler <ross.zwisler@linux.intel.com> > > Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> > > Cc: Christoph Hellwig <hch@lst.de> > > Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> > > --- > > drivers/nvdimm/pmem.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c > > index ade9eb9..7f85537 100644 > > --- a/drivers/nvdimm/pmem.c > > +++ b/drivers/nvdimm/pmem.c > > @@ -162,6 +162,7 @@ static int pmem_attach_disk(struct > > nd_namespace_common *ndns, > > return -ENOMEM; > > > > blk_queue_make_request(pmem->pmem_queue, > > pmem_make_request); > > + blk_queue_logical_block_size(pmem->pmem_queue, SZ_4K); > > Did you actually test this? I think logical_block_size means the > addressing > is 4k (and phys == 4k). ie address(1) means read from 4kB offset. > > As opposed to logical (addressing) 512B but physical=4k ie. atomic > unit > is 4k. I ran xfstests on it, yes :) The only problem a 4k (or PAGE_SIZE, rather) logical block size might cause is some extra read-modify-writes in the page cache if something is sending down sub-4K IOs. In return we get less confusion with DAX mounts not failing. > > In current code it does not matter match but it might be confusing to > external tools. mkfs, mount, fdisk, fio, and everything xfstests does are fine with this - what other tools might find it confusing? > > > blk_queue_max_hw_sectors(pmem->pmem_queue, UINT_MAX); > > blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY); > > queue_flag_set_unlocked(QUEUE_FLAG_NONROT, pmem > > ->pmem_queue); > > > > Thanks > Boaz >
On Wed, 2015-07-22 at 09:40 -0400, Matthew Wilcox wrote: > On Tue, Jul 21, 2015 at 05:06:23PM -0600, Vishal Verma wrote: > > 4K is more naturally aligned w.r.t pages, allows us to receive > > simpler > > bios, and rejects creation of filesystems with sub-4K block sizes > > (DAX > > requires -b 4096). > > Should we say 'PAGE_SIZE' instead of 'SZ_4K'? > Good point, I'll send an update!
On Wed, Jul 22, 2015 at 10:16 AM, Verma, Vishal L <vishal.l.verma@intel.com> wrote: > On Wed, 2015-07-22 at 09:40 -0400, Matthew Wilcox wrote: >> On Tue, Jul 21, 2015 at 05:06:23PM -0600, Vishal Verma wrote: >> > 4K is more naturally aligned w.r.t pages, allows us to receive >> > simpler >> > bios, and rejects creation of filesystems with sub-4K block sizes >> > (DAX >> > requires -b 4096). >> >> Should we say 'PAGE_SIZE' instead of 'SZ_4K'? >> > > > Good point, I'll send an update! Hmm, what about cases where the arch PAGE_SIZE > 4K?
On Wed, 2015-07-22 at 10:18 -0700, Dan Williams wrote: > On Wed, Jul 22, 2015 at 10:16 AM, Verma, Vishal L > <vishal.l.verma@intel.com> wrote: > > On Wed, 2015-07-22 at 09:40 -0400, Matthew Wilcox wrote: > > > On Tue, Jul 21, 2015 at 05:06:23PM -0600, Vishal Verma wrote: > > > > 4K is more naturally aligned w.r.t pages, allows us to receive > > > > simpler > > > > bios, and rejects creation of filesystems with sub-4K block > > > > sizes > > > > (DAX > > > > requires -b 4096). > > > > > > Should we say 'PAGE_SIZE' instead of 'SZ_4K'? > > > > > > > > > Good point, I'll send an update! > > Hmm, what about cases where the arch PAGE_SIZE > 4K? DAX requires fs block size == PAGE_SIZE, and if it is > 4K, shouldn't the rest of the block stack Just Work?
On Wed, Jul 22, 2015 at 10:20 AM, Verma, Vishal L <vishal.l.verma@intel.com> wrote: > On Wed, 2015-07-22 at 10:18 -0700, Dan Williams wrote: >> On Wed, Jul 22, 2015 at 10:16 AM, Verma, Vishal L >> <vishal.l.verma@intel.com> wrote: >> > On Wed, 2015-07-22 at 09:40 -0400, Matthew Wilcox wrote: >> > > On Tue, Jul 21, 2015 at 05:06:23PM -0600, Vishal Verma wrote: >> > > > 4K is more naturally aligned w.r.t pages, allows us to receive >> > > > simpler >> > > > bios, and rejects creation of filesystems with sub-4K block >> > > > sizes >> > > > (DAX >> > > > requires -b 4096). >> > > >> > > Should we say 'PAGE_SIZE' instead of 'SZ_4K'? >> > > >> > >> > >> > Good point, I'll send an update! >> >> Hmm, what about cases where the arch PAGE_SIZE > 4K? > > DAX requires fs block size == PAGE_SIZE, and if it is > 4K, shouldn't > the rest of the block stack Just Work? I honestly don't know if the block layer will "just work" with a 64K sector device. Filesystems need special consideration for PAGE_SIZE > fs-block-size... I wouldn't bet on it blindly working, but I haven't looked closely.
On Wed, Jul 22, 2015 at 10:27:46AM -0700, Dan Williams wrote: > On Wed, Jul 22, 2015 at 10:20 AM, Verma, Vishal L > <vishal.l.verma@intel.com> wrote: > > DAX requires fs block size == PAGE_SIZE, and if it is > 4K, shouldn't > > the rest of the block stack Just Work? > > I honestly don't know if the block layer will "just work" with a 64K > sector device. Filesystems need special consideration for PAGE_SIZE > > fs-block-size... I wouldn't bet on it blindly working, but I haven't > looked closely. I'm pretty sure the PowerPC people have been using 64k block size ext3/4 filesytems for a while. I know some people used 16k block size ext3 filesystems when I was working on itanium. The problem with them was that you couldn't mount those filesystems on a kernel with a 4k PAGE_SIZE. For a filesystem on DIMMs, that's less of a problem ... but if you make PAGE_SIZE a config option (like itanium and powerpc do), then it can be a bit of a pain.
On 07/22/2015 08:15 PM, Verma, Vishal L wrote: > > I ran xfstests on it, yes :) > The only problem a 4k (or PAGE_SIZE, rather) logical block size might > cause is some extra read-modify-writes in the page cache if something > is sending down sub-4K IOs. Exactly this is the part I do not like. We are actually byte capable and now we'll do all kind of crazy read-modify-write slowness for no good reason. > In return we get less confusion with DAX mounts not failing. > So I've put a very similar patch for brd that sets the PHYS sector and it soles all these problems all the same, with out the added read-modify-write, and address translations at upper stack. I actually had the same patch for pmem, out-of-tree at the time, and Ross never took it. so it got lost. In anyway I think both brd and pmem should have the same setting since it is trying to solve the same problem. So what ever is decided it should be the same at both places. <> > > mkfs, mount, fdisk, fio, and everything xfstests does are fine with > this - what other tools might find it confusing? > Right, by setting: blk_queue_physical_block_size(brd->brd_queue, PAGE_SIZE); All your problems are solved and no un-needed copies are made please see this patch: [c8fa317] brd: Request from fdisk 4k alignment Thanks Boaz
On Wed, 2015-07-22 at 21:18 +0300, Boaz Harrosh wrote: > On 07/22/2015 08:15 PM, Verma, Vishal L wrote: > > > > I ran xfstests on it, yes :) > > The only problem a 4k (or PAGE_SIZE, rather) logical block size > > might > > cause is some extra read-modify-writes in the page cache if > > something > > is sending down sub-4K IOs. > > Exactly this is the part I do not like. We are actually byte capable > and now we'll do all kind of crazy read-modify-write slowness for > no good reason. > > > In return we get less confusion with DAX mounts not failing. > > > > So I've put a very similar patch for brd that sets the PHYS sector > and it soles all these problems all the same, with out the added > read-modify-write, and address translations at upper stack. > Note that with just the physical sector size set to 4K, mkfs only emits a warning, but with a logical sector size of 4K, it fails[1]. I suppose this should be OK as we don't want to force people to create DAX capable file systems only. -Vishal [1] Comparing logical & physical sector size at 4K to only physical at 4K: Logical sector size of 4K (implies physical): $ sudo fdisk -l /dev/pmem0 Disk /dev/pmem0: 8589 MB, 8589934592 bytes, 2097152 sectors Units = sectors of 1 * 4096 = 4096 bytes Sector size (logical/physical): 4096 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes $ sudo mkfs.ext4 -b 1024 /dev/pmem0 mke2fs 1.42.9 (28-Dec-2013) mkfs.ext4: Invalid argument while setting blocksize; too small for device Physical sector size of 4K (with logical = 512): $ sudo fdisk -l /dev/pmem0 Disk /dev/pmem0: 8589 MB, 8589934592 bytes, 16777216 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes $ sudo mkfs.ext4 -b 1024 /dev/pmem0 mke2fs 1.42.9 (28-Dec-2013) Warning: specified blocksize 1024 is less than device physical sectorsize 4096 ... ... <succeeds>
> -----Original Message----- > From: Linux-nvdimm [mailto:linux-nvdimm-bounces@lists.01.org] On Behalf Of > Verma, Vishal L > Sent: Wednesday, July 22, 2015 3:58 PM > To: linux-nvdimm@lists.01.org; boaz@plexistor.com > Cc: hch@lst.de; Wilcox, Matthew R <matthew.r.wilcox@intel.com> > Subject: Re: [PATCH] libnvdimm, pmem: Change default pmem sector size to > 4K > > On Wed, 2015-07-22 at 21:18 +0300, Boaz Harrosh wrote: > > On 07/22/2015 08:15 PM, Verma, Vishal L wrote: > > > > > > I ran xfstests on it, yes :) > > > The only problem a 4k (or PAGE_SIZE, rather) logical block size > > > might > > > cause is some extra read-modify-writes in the page cache if > > > something > > > is sending down sub-4K IOs. > > > > Exactly this is the part I do not like. We are actually byte capable > > and now we'll do all kind of crazy read-modify-write slowness for > > no good reason. > > > > > In return we get less confusion with DAX mounts not failing. > > > > > > > So I've put a very similar patch for brd that sets the PHYS sector > > and it soles all these problems all the same, with out the added > > read-modify-write, and address translations at upper stack. The physical block size is just advisory; software is not required to align to it. The device is still expected to support arbitrary logical block accesses, performing read-modify-writes as needed. If DAX and O_DIRECT capability requires page size (e.g., 4 KiB) alignment, then the only way to ensure compliance is to report that as the logical block size. The open(2) documentation for O_DIRECT says that alignment to the logical block size (BLKSSZGET ioctl) is all that has been required since kernel 2.6. However, using a non-512 byte logical block size runs into a variety of glaring and subtle software bugs that assume 512 byte sectors. It's the same issue "Advanced Format" drives faced with 512e vs. 4Kn. UEFI based systems have a bit more hope of working than legacy BIOS systems. Even if you sidestep all the bugs, you cannot copy an image that contains LBA and Size values between a 512-byte device and a non-512 byte device; they all need to be recalculated. For pmem, if you were using the device on a system with a 4 KiB page size and move it to a system with a larger page size, you'd silently render the data incorrect in its new home by assuming the system page size is the logical block size. Leaving the logical block size at 512 bytes avoids data compatibility issues (like not growing "byte" to mean more than 8 bits even as CPUs grew to have wider registers). We could add a new physical block size-like field that conveys any mandatory alignment for O_DIRECT usage, but that requires software changes too. > Note that with just the physical sector size set to 4K, mkfs only emits > a warning, but with a logical sector size of 4K, it fails[1]. I suppose > this should be OK as we don't want to force people to create DAX > capable file systems only. That's expected - filesystems use integer values for LBAs, and cannot point to a fraction of an LBA. If each LBA points to 4 KiB, then the filesystem block size (aka cluster size) needs to be an integer multiple of 4 KiB.
On 07/23/2015 12:41 AM, Elliott, Robert (Server Storage) wrote: > > We could add a new physical block size-like field that conveys > any usage, but that requires > software changes too. > >> Note that with just the physical sector size set to 4K, mkfs only emits >> a warning, but with a logical sector size of 4K, it fails[1]. I suppose >> this should be OK as we don't want to force people to create DAX >> capable file systems only. > > That's expected - filesystems use integer values for LBAs, and cannot > point to a fraction of an LBA. If each LBA points to 4 KiB, then > the filesystem block size (aka cluster size) needs to be an integer > multiple of 4 KiB. > Guys what are you talking about?? pmem has no such mandatory alignment for O_DIRECT and no restriction of any alignment "what so ever". It is memory and you can do *byte aligned* direct accesses. Please stop the nonsense. (In fact it works, do a dax mount take any pointer open O_DIRECT and do IO, it works. Actually any "buffered" IO is done O_DIRECT, with mmap you have byte access directly to storage) Nor does it have any kind of other restrictions. The only *small* problem is the partition alignment. It must be PAGE_SIZE aligned and that is not because of any hardware limitations, it is because of the ->direct_access() API and the way we chose to implement DAX. So the only thing that needs to do its thing is fdisk, which does so usually. fdisk default is 1M. (And who needs partitions on pmem anyway?) mkfs.ext4 and the "-o dax" mounting is a different story. You must not force anyone to enable support for "-o dax". The FS on pmem is perfectly usable for any block size only the "-o dax" will complain. Cheers Boaz
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index ade9eb9..7f85537 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -162,6 +162,7 @@ static int pmem_attach_disk(struct nd_namespace_common *ndns, return -ENOMEM; blk_queue_make_request(pmem->pmem_queue, pmem_make_request); + blk_queue_logical_block_size(pmem->pmem_queue, SZ_4K); blk_queue_max_hw_sectors(pmem->pmem_queue, UINT_MAX); blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY); queue_flag_set_unlocked(QUEUE_FLAG_NONROT, pmem->pmem_queue);
4K is more naturally aligned w.r.t pages, allows us to receive simpler bios, and rejects creation of filesystems with sub-4K block sizes (DAX requires -b 4096). Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> --- drivers/nvdimm/pmem.c | 1 + 1 file changed, 1 insertion(+)