[v2,00/22] add Object Storage Media Pool (mpool)

Message ID	20201012162736.65241-1-nmeeramohide@micron.com (mailing list archive)
Headers	show Return-Path: <SRS0=VJky=DT=lists.01.org=linux-nvdimm-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1DC892080A Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=40.107.94.87; helo=nam10-mw2-obe.outbound.protection.outlook.com; envelope-from=nmeeramohide@micron.com; receiver=<UNKNOWN> Received-SPF: Pass (protection.outlook.com: domain of micron.com designates 137.201.242.130 as permitted sender) receiver=protection.outlook.com; client-ip=137.201.242.130; helo=mail.micron.com; From: Nabeel M Mohamed <nmeeramohide@micron.com> To: <linux-kernel@vger.kernel.org>, <linux-block@vger.kernel.org>, <linux-nvme@lists.infradead.org>, <linux-mm@kvack.org>, <linux-nvdimm@lists.01.org> Subject: [PATCH v2 00/22] add Object Storage Media Pool (mpool) Date: Mon, 12 Oct 2020 11:27:14 -0500 Message-ID: <20201012162736.65241-1-nmeeramohide@micron.com> MIME-Version: 1.0 Message-ID-Hash: 7COU6EHKHXTJV5B35EAZLBQ3NWBXWTQT CC: smoyer@micron.com, gbecker@micron.com, plabat@micron.com, jgroves@micron.com, Nabeel M Mohamed <nmeeramohide@micron.com> Precedence: list Archived-At: <https://lists.01.org/hyperkitty/list/linux-nvdimm@lists.01.org/message/7COU6EHKHXTJV5B35EAZLBQ3NWBXWTQT/> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit
Series	add Object Storage Media Pool (mpool) \| expand [v2,00/22] add Object Storage Media Pool (mpool) [v2,01/22] mpool: add utility routines and ioctl definitions [v2,02/22] mpool: add in-memory struct definitions [v2,03/22] mpool: add on-media struct definitions [v2,04/22] mpool: add pool drive component which handles mpool IO using the block layer API [v2,05/22] mpool: add space map component which manages free space on mpool devices [v2,06/22] mpool: add on-media pack, unpack and upgrade routines [v2,07/22] mpool: add superblock management routines [v2,08/22] mpool: add pool metadata routines to manage object lifecycle and IO [v2,09/22] mpool: add mblock lifecycle management and IO routines [v2,10/22] mpool: add mlog IO utility routines [v2,11/22] mpool: add mlog lifecycle management and IO routines [v2,12/22] mpool: add metadata container or mlog-pair framework [v2,13/22] mpool: add utility routines for mpool lifecycle management [v2,14/22] mpool: add pool metadata routines to create persistent mpools [v2,15/22] mpool: add mpool lifecycle management routines [v2,16/22] mpool: add mpool control plane utility routines [v2,17/22] mpool: add mpool lifecycle management ioctls [v2,18/22] mpool: add object lifecycle management ioctls [v2,19/22] mpool: add support to mmap arbitrary collection of mblocks [v2,20/22] mpool: add support to proactively evict cached mblock data from the page-cache [v2,21/22] mpool: add documentation [v2,22/22] mpool: add Kconfig and Makefile

Nabeel Meeramohideen Mohamed (nmeeramohide) Oct. 12, 2020, 4:27 p.m. UTC

This patch series introduces the mpool object storage media pool driver.
Mpool implements a simple transactional object store on top of block
storage devices.

Mpool was developed for the Heterogeneous-Memory Storage Engine (HSE)
project, which is a high-performance key-value storage engine designed
for SSDs. HSE stores its data exclusively in mpool.

Mpool is readily applicable to other storage systems built on immutable
objects. For example, the many databases that store records in
immutable SSTables organized as an LSM-tree or similar data structure.

We developed mpool for HSE storage, versus using a file system or raw
block device, for several reasons.

A primary motivator was the need for a storage model that maps naturally
to conventional block storage devices, as well as to emerging device
interfaces we plan to support in the future, such as
* NVMe Zoned Namespaces (ZNS)
* NVMe Streams
* Persistent memory accessed via CXL or similar technologies

Another motivator was the need for a storage model that readily supports
multiple classes of storage devices or media in a single storage pool,
such as
* QLC SSDs for storing the bulk of objects, and
* 3DXP SSDs or persistent memory for storing objects requiring
  low-latency access

The mpool object storage model meets these needs. It also provides
other features that benefit storage systems built on immutable objects,
including
* Facilities to memory-map a specified collection of objects into a
  linear address space
* Concurrent access to object data directly and memory-mapped to greatly
  reduce page cache pollution from background operations such as
  LSM-tree compaction
* Proactive eviction of object data from the page cache, based on
  object-level metrics, to avoid excessive memory pressure and its
  associated performance impacts
* High concurrency and short code paths for efficient access to
  low-latency storage devices

HSE takes advantage of all these mpool features to achieve high
throughput with low tail-latencies.

Mpool is implemented as a character device driver where
* /dev/mpoolctl is the control file (minor number 0) supporting mpool
  management ioctls
* /dev/mpool/<mpool-name> are mpool files (minor numbers >0), one per
  mpool, supporting object management ioctls

CLI/UAPI access to /dev/mpoolctl and /dev/mpool/<mpool-name> are
controlled by their UID, GID, and mode bits. To provide a familiar look
and feel, the mpool management model and CLI are intentionally aligned
to those of LVM to the degree practical.

An mpool is created with a block storage device specified for its
required capacity media class, and optionally a second block storage
device specified for its staging media class. We recommend virtual
block devices (such as LVM logical volumes) to aggregate the performance
and capacity of multiple physical block devices, to enable sharing of
physical block devices between mpools (or for other uses), and to
support extending the size of a block device used for an mpool media
class. The libblkid library recognizes mpool formatted block devices as
of util-linux v2.32.

Mpool implements a transactional object store with two simple object
abstractions: mblocks and mlogs.

Mblock objects are containers comprising a linear sequence of bytes that
can be written exactly once, are immutable after writing, and can be
read in whole or in part as needed until deleted. Mblocks in a media
class are currently fixed size, which is configured when an mpool is
created, though the amount of data written to mblocks will differ.

Mlog objects are containers for record logging. Records of arbitrary
size can be appended to an mlog until it is full. Once full, an mlog
must be erased before additional records can be appended. Mlog records
can be read sequentially from the beginning at any time. Mlogs in a
media class are always a multiple of the mblock size for that media
class.

Mblock and mlog writes avoid the page cache. Mblocks are written,
committed, and made immutable before they can be read either directly
(avoiding the page cache) or mmaped. Mlogs are always read and updated
directly (avoiding the page cache) and cannot be mmaped.

Mpool also provides the metadata container (MDC) APIs that clients can
use to simplify storing and maintaining metadata. These MDC APIs are
helper functions built on a pair of mlogs per MDC.

The mpool Wiki contains full details on the
* Management model in the "Configure mpools" section
* Object model in the "Develop mpool Applications" section
* Kernel module architecture in the "Explore mpool Internals" section,
  which provides context for reviewing this patch series

See https://github.com/hse-project/mpool/wiki

The mpool UAPI and kernel module (not the patchset) are available on
GitHub at:

https://github.com/hse-project/mpool

https://github.com/hse-project/mpool-kmod

The HSE key-value storage engine is available on GitHub at:

https://github.com/hse-project/hse

Changes in v2:

- Fixes build errors/warnings reported by bot on ARCH=m68k
Reported-by: kernel test robot <lkp@intel.com>

- Addresses review comments from Randy and Hillf:
  * Updates ioctl-number.rst file with mpool driver's ioctl code
  * Fixes issues in the usage of printk_timed_ratelimit()

- Fixes a readahead issue found by internal testing

Nabeel M Mohamed (22):
  mpool: add utility routines and ioctl definitions
  mpool: add in-memory struct definitions
  mpool: add on-media struct definitions
  mpool: add pool drive component which handles mpool IO using the block
    layer API
  mpool: add space map component which manages free space on mpool
    devices
  mpool: add on-media pack, unpack and upgrade routines
  mpool: add superblock management routines
  mpool: add pool metadata routines to manage object lifecycle and IO
  mpool: add mblock lifecycle management and IO routines
  mpool: add mlog IO utility routines
  mpool: add mlog lifecycle management and IO routines
  mpool: add metadata container or mlog-pair framework
  mpool: add utility routines for mpool lifecycle management
  mpool: add pool metadata routines to create persistent mpools
  mpool: add mpool lifecycle management routines
  mpool: add mpool control plane utility routines
  mpool: add mpool lifecycle management ioctls
  mpool: add object lifecycle management ioctls
  mpool: add support to mmap arbitrary collection of mblocks
  mpool: add support to proactively evict cached mblock data from the
    page-cache
  mpool: add documentation
  mpool: add Kconfig and Makefile

 .../userspace-api/ioctl/ioctl-number.rst      |    3 +-
 drivers/Kconfig                               |    2 +
 drivers/Makefile                              |    1 +
 drivers/mpool/Kconfig                         |   28 +
 drivers/mpool/Makefile                        |   11 +
 drivers/mpool/assert.h                        |   25 +
 drivers/mpool/init.c                          |  126 +
 drivers/mpool/init.h                          |   17 +
 drivers/mpool/mblock.c                        |  432 +++
 drivers/mpool/mblock.h                        |  161 +
 drivers/mpool/mcache.c                        | 1072 +++++++
 drivers/mpool/mcache.h                        |  104 +
 drivers/mpool/mclass.c                        |  103 +
 drivers/mpool/mclass.h                        |  137 +
 drivers/mpool/mdc.c                           |  486 +++
 drivers/mpool/mdc.h                           |  106 +
 drivers/mpool/mlog.c                          | 1667 ++++++++++
 drivers/mpool/mlog.h                          |  212 ++
 drivers/mpool/mlog_utils.c                    | 1352 ++++++++
 drivers/mpool/mlog_utils.h                    |   63 +
 drivers/mpool/mp.c                            | 1086 +++++++
 drivers/mpool/mp.h                            |  231 ++
 drivers/mpool/mpcore.c                        |  987 ++++++
 drivers/mpool/mpcore.h                        |  354 +++
 drivers/mpool/mpctl.c                         | 2747 +++++++++++++++++
 drivers/mpool/mpctl.h                         |   58 +
 drivers/mpool/mpool-locking.rst               |   90 +
 drivers/mpool/mpool_ioctl.h                   |  636 ++++
 drivers/mpool/mpool_printk.h                  |   43 +
 drivers/mpool/omf.c                           | 1316 ++++++++
 drivers/mpool/omf.h                           |  593 ++++
 drivers/mpool/omf_if.h                        |  381 +++
 drivers/mpool/params.h                        |  116 +
 drivers/mpool/pd.c                            |  424 +++
 drivers/mpool/pd.h                            |  202 ++
 drivers/mpool/pmd.c                           | 2046 ++++++++++++
 drivers/mpool/pmd.h                           |  379 +++
 drivers/mpool/pmd_obj.c                       | 1569 ++++++++++
 drivers/mpool/pmd_obj.h                       |  499 +++
 drivers/mpool/reaper.c                        |  686 ++++
 drivers/mpool/reaper.h                        |   71 +
 drivers/mpool/sb.c                            |  625 ++++
 drivers/mpool/sb.h                            |  162 +
 drivers/mpool/smap.c                          | 1031 +++++++
 drivers/mpool/smap.h                          |  334 ++
 drivers/mpool/sysfs.c                         |   48 +
 drivers/mpool/sysfs.h                         |   48 +
 drivers/mpool/upgrade.c                       |  138 +
 drivers/mpool/upgrade.h                       |  128 +
 drivers/mpool/uuid.h                          |   59 +
 50 files changed, 23194 insertions(+), 1 deletion(-)
 create mode 100644 drivers/mpool/Kconfig
 create mode 100644 drivers/mpool/Makefile
 create mode 100644 drivers/mpool/assert.h
 create mode 100644 drivers/mpool/init.c
 create mode 100644 drivers/mpool/init.h
 create mode 100644 drivers/mpool/mblock.c
 create mode 100644 drivers/mpool/mblock.h
 create mode 100644 drivers/mpool/mcache.c
 create mode 100644 drivers/mpool/mcache.h
 create mode 100644 drivers/mpool/mclass.c
 create mode 100644 drivers/mpool/mclass.h
 create mode 100644 drivers/mpool/mdc.c
 create mode 100644 drivers/mpool/mdc.h
 create mode 100644 drivers/mpool/mlog.c
 create mode 100644 drivers/mpool/mlog.h
 create mode 100644 drivers/mpool/mlog_utils.c
 create mode 100644 drivers/mpool/mlog_utils.h
 create mode 100644 drivers/mpool/mp.c
 create mode 100644 drivers/mpool/mp.h
 create mode 100644 drivers/mpool/mpcore.c
 create mode 100644 drivers/mpool/mpcore.h
 create mode 100644 drivers/mpool/mpctl.c
 create mode 100644 drivers/mpool/mpctl.h
 create mode 100644 drivers/mpool/mpool-locking.rst
 create mode 100644 drivers/mpool/mpool_ioctl.h
 create mode 100644 drivers/mpool/mpool_printk.h
 create mode 100644 drivers/mpool/omf.c
 create mode 100644 drivers/mpool/omf.h
 create mode 100644 drivers/mpool/omf_if.h
 create mode 100644 drivers/mpool/params.h
 create mode 100644 drivers/mpool/pd.c
 create mode 100644 drivers/mpool/pd.h
 create mode 100644 drivers/mpool/pmd.c
 create mode 100644 drivers/mpool/pmd.h
 create mode 100644 drivers/mpool/pmd_obj.c
 create mode 100644 drivers/mpool/pmd_obj.h
 create mode 100644 drivers/mpool/reaper.c
 create mode 100644 drivers/mpool/reaper.h
 create mode 100644 drivers/mpool/sb.c
 create mode 100644 drivers/mpool/sb.h
 create mode 100644 drivers/mpool/smap.c
 create mode 100644 drivers/mpool/smap.h
 create mode 100644 drivers/mpool/sysfs.c
 create mode 100644 drivers/mpool/sysfs.h
 create mode 100644 drivers/mpool/upgrade.c
 create mode 100644 drivers/mpool/upgrade.h
 create mode 100644 drivers/mpool/uuid.h

Christoph Hellwig Oct. 15, 2020, 8:02 a.m. UTC | #1

I don't think this belongs into the kernel.  It is a classic case for
infrastructure that should be built in userspace.  If anything is
missing to implement it in userspace with equivalent performance we
need to improve out interfaces, although io_uring should cover pretty
much everything you need.

Nabeel Meeramohideen Mohamed (nmeeramohide) Oct. 16, 2020, 9:58 p.m. UTC | #2

On Thursday, October 15, 2020 2:03 AM, Christoph Hellwig <hch@infradead.org> wrote:
> I don't think this belongs into the kernel.  It is a classic case for
> infrastructure that should be built in userspace.  If anything is
> missing to implement it in userspace with equivalent performance we
> need to improve out interfaces, although io_uring should cover pretty
> much everything you need.

Hi Christoph,

We previously considered moving the mpool object store code to user-space.
However, by implementing mpool as a device driver, we get several benefits
in terms of scalability, performance, and functionality. In doing so, we relied
only on standard interfaces and did not make any changes to the kernel.

(1)  mpool's "mcache map" facility allows us to memory-map (and later unmap)
a collection of logically related objects with a single system call. The objects in
such a collection are created at different times, physically disparate, and may
even reside on different media class volumes.

For our HSE storage engine application, there are commonly 10's to 100's of
objects in a given mcache map, and 75,000 total objects mapped at a given time.

Compared to memory-mapping objects individually, the mcache map facility
scales well because it requires only a single system call and single vm_area_struct
to memory-map a complete collection of objects.

(2) The mcache map reaper mechanism proactively evicts object data from the page
cache based on object-level metrics. This provides significant performance benefit
for many workloads.

For example, we ran YCSB workloads B (95/5 read/write mix)  and C (100% read)
against our HSE storage engine using the mpool driver in a 5.9 kernel.
For each workload, we ran with the reaper turned-on and turned-off.

For workload B, the reaper increased throughput 1.77x, while reducing 99.99% tail
latency for reads by 39% and updates by 99%. For workload C, the reaper increased
throughput by 1.84x, while reducing the 99.99% read tail latency by 63%. These
improvements are even more dramatic with earlier kernels.

(3) The mcache map facility can memory-map objects on NVMe ZNS drives that were
created using the Zone Append command. This patch set does not support ZNS, but
that work is in progress and we will be demonstrating our HSE storage engine
running on mpool with ZNS drives at FMS 2020.

(4) mpool's immutable object model allows the driver to support concurrent reading
of object data directly and memory-mapped without a performance penalty to verify
coherence. This allows background operations, such as LSM-tree compaction, to
operate efficiently and without polluting the page cache.

(5) Representing an mpool as a /dev/mpool/<mpool-name> device file provides a
convenient mechanism for controlling access to and managing the multiple storage
volumes, and in the future pmem devices, that may comprise an logical mpool.

Thanks,
Nabeel

Dan Williams Oct. 16, 2020, 10:11 p.m. UTC | #3

On Fri, Oct 16, 2020 at 2:59 PM Nabeel Meeramohideen Mohamed
(nmeeramohide) <nmeeramohide@micron.com> wrote:
>
> On Thursday, October 15, 2020 2:03 AM, Christoph Hellwig <hch@infradead.org> wrote:
> > I don't think this belongs into the kernel.  It is a classic case for
> > infrastructure that should be built in userspace.  If anything is
> > missing to implement it in userspace with equivalent performance we
> > need to improve out interfaces, although io_uring should cover pretty
> > much everything you need.
>
> Hi Christoph,
>
> We previously considered moving the mpool object store code to user-space.
> However, by implementing mpool as a device driver, we get several benefits
> in terms of scalability, performance, and functionality. In doing so, we relied
> only on standard interfaces and did not make any changes to the kernel.
>
> (1)  mpool's "mcache map" facility allows us to memory-map (and later unmap)
> a collection of logically related objects with a single system call. The objects in
> such a collection are created at different times, physically disparate, and may
> even reside on different media class volumes.
>
> For our HSE storage engine application, there are commonly 10's to 100's of
> objects in a given mcache map, and 75,000 total objects mapped at a given time.
>
> Compared to memory-mapping objects individually, the mcache map facility
> scales well because it requires only a single system call and single vm_area_struct
> to memory-map a complete collection of objects.

Why can't that be a batch of mmap calls on io_uring?

> (2) The mcache map reaper mechanism proactively evicts object data from the page
> cache based on object-level metrics. This provides significant performance benefit
> for many workloads.
>
> For example, we ran YCSB workloads B (95/5 read/write mix)  and C (100% read)
> against our HSE storage engine using the mpool driver in a 5.9 kernel.
> For each workload, we ran with the reaper turned-on and turned-off.
>
> For workload B, the reaper increased throughput 1.77x, while reducing 99.99% tail
> latency for reads by 39% and updates by 99%. For workload C, the reaper increased
> throughput by 1.84x, while reducing the 99.99% read tail latency by 63%. These
> improvements are even more dramatic with earlier kernels.

What metrics proved useful and can the vanilla page cache / page
reclaim mechanism be augmented with those metrics?

>
> (3) The mcache map facility can memory-map objects on NVMe ZNS drives that were
> created using the Zone Append command. This patch set does not support ZNS, but
> that work is in progress and we will be demonstrating our HSE storage engine
> running on mpool with ZNS drives at FMS 2020.
>
> (4) mpool's immutable object model allows the driver to support concurrent reading
> of object data directly and memory-mapped without a performance penalty to verify
> coherence. This allows background operations, such as LSM-tree compaction, to
> operate efficiently and without polluting the page cache.
>

How is this different than existing background operations / defrag
that filesystems perform today? Where are the opportunities to improve
those operations?

> (5) Representing an mpool as a /dev/mpool/<mpool-name> device file provides a
> convenient mechanism for controlling access to and managing the multiple storage
> volumes, and in the future pmem devices, that may comprise an logical mpool.

Christoph and I have talked about replacing the pmem driver's
dependence on device-mapper for pooling. What extensions would be
needed for the existing driver arch?

Nabeel Meeramohideen Mohamed (nmeeramohide) Oct. 19, 2020, 10:30 p.m. UTC | #4

Hi Dan,

On Friday, October 16, 2020 4:12 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> 
> On Fri, Oct 16, 2020 at 2:59 PM Nabeel Meeramohideen Mohamed
> (nmeeramohide) <nmeeramohide@micron.com> wrote:
> >
> > On Thursday, October 15, 2020 2:03 AM, Christoph Hellwig
> <hch@infradead.org> wrote:
> > > I don't think this belongs into the kernel.  It is a classic case for
> > > infrastructure that should be built in userspace.  If anything is
> > > missing to implement it in userspace with equivalent performance we
> > > need to improve out interfaces, although io_uring should cover pretty
> > > much everything you need.
> >
> > Hi Christoph,
> >
> > We previously considered moving the mpool object store code to user-space.
> > However, by implementing mpool as a device driver, we get several benefits
> > in terms of scalability, performance, and functionality. In doing so, we relied
> > only on standard interfaces and did not make any changes to the kernel.
> >
> > (1)  mpool's "mcache map" facility allows us to memory-map (and later unmap)
> > a collection of logically related objects with a single system call. The objects in
> > such a collection are created at different times, physically disparate, and may
> > even reside on different media class volumes.
> >
> > For our HSE storage engine application, there are commonly 10's to 100's of
> > objects in a given mcache map, and 75,000 total objects mapped at a given
> time.
> >
> > Compared to memory-mapping objects individually, the mcache map facility
> > scales well because it requires only a single system call and single
> vm_area_struct
> > to memory-map a complete collection of objects.

> Why can't that be a batch of mmap calls on io_uring?

Agreed, we could add the capability to invoke mmap via io_uring to help mitigate the
system call overhead of memory-mapping individual objects, versus our mache map
mechanism. However, there is still the scalability issue of having a vm_area_struct
for each object (versus one for each mache map).

We ran YCSB workload C in two different configurations -
Config 1: memory-mapping each individual object
Config 2: memory-mapping a collection of related objects using mcache map

- Config 1 incurred ~3.3x additional kernel memory for the vm_area_struct slab -
24.8 MB (127188 objects) for config 1, versus 7.3 MB (37482 objects) for config 2.

- Workload C exhibited around 10-25% better tail latencies (4-nines) for config 2,
not sure if it's due the reduced complexity of searching VMAs during page faults.

> > (2) The mcache map reaper mechanism proactively evicts object data from the
> page
> > cache based on object-level metrics. This provides significant performance
> benefit
> > for many workloads.
> >
> > For example, we ran YCSB workloads B (95/5 read/write mix)  and C (100% read)
> > against our HSE storage engine using the mpool driver in a 5.9 kernel.
> > For each workload, we ran with the reaper turned-on and turned-off.
> >
> > For workload B, the reaper increased throughput 1.77x, while reducing 99.99%
> tail
> > latency for reads by 39% and updates by 99%. For workload C, the reaper
> increased
> > throughput by 1.84x, while reducing the 99.99% read tail latency by 63%. These
> > improvements are even more dramatic with earlier kernels.

> What metrics proved useful and can the vanilla page cache / page
> reclaim mechanism be augmented with those metrics?

The mcache map facility is designed to cache a collection of related immutable objects
with similar lifetimes. It is best suited for storage applications that run queries against
organized collections of immutable objects, such as storage engines and DBs based on
SSTables.

Each mcache map is associated with a temperature (pinned, hot, warm, cold) and is left
to the application to tag it appropriately. For our HSE storage engine application,
the SSTables in the root/intermediate levels acts as a routing table to redirect queries to
an appropriate leaf level SSTable, in which case, the mcache map corresponding to the
root/intermediate level SSTables can be tagged as pinned/hot.

The mcache reaper tracks the access time of each object in an mcache map. On memory
pressure, the access time is compared to a time-to-live metric that’s set based on the
map’s temperature, how close is the free memory to the low and high watermarks etc.
If the object was last accessed outside the ttl window, its pages are evicted from the
page cache.

We also apply a few other techniques like throttling the readaheads and adding a delay
in the page fault handler to not overwhelm the page cache during memory pressure.

In the workloads that we run, we have noticed stalls when kswapd does the reclaim and
that impacts throughput and tail latencies as described in our last email. The mcache
reaper runs proactively and can make better reclaim decisions as it is designed to
address a specific class of workloads.

We doubt whether the same mechanisms can be employed in the vanilla page cache as
it is designed to work for a wide variety of workloads.

> > (4) mpool's immutable object model allows the driver to support concurrent
> reading
> > of object data directly and memory-mapped without a performance penalty to
> verify
> > coherence. This allows background operations, such as LSM-tree compaction,
> to
> > operate efficiently and without polluting the page cache.

> How is this different than existing background operations / defrag
> that filesystems perform today? Where are the opportunities to improve
> those operations?

We haven’t measured the benefit of eliminating the coherence check, which isn’t needed
in our case because objects are immutable. However the open(2) documentation makes
the statement that “applications should avoid mixing mmap(2) of files with direct I/O to
the same files”, which is what we are effectively doing when we directly read from an
object that is also in an mcache map.

> > (5) Representing an mpool as a /dev/mpool/<mpool-name> device file
> provides a
> > convenient mechanism for controlling access to and managing the multiple
> storage
> > volumes, and in the future pmem devices, that may comprise an logical mpool.

> Christoph and I have talked about replacing the pmem driver's
> dependence on device-mapper for pooling. What extensions would be
> needed for the existing driver arch?

mpool doesn’t extend any of the existing driver arch to manage multiple storage volumes.

Mpool implements the concept of media classes, where each media class corresponds
to a different storage volume. Clients specify a media class when creating an object in
an mpool. mpool currently supports only two media classes, “capacity” for storing bulk
of the objects backed by, for instance, QLC SSDs and “staging” for storing objects
requiring lower latency/higher throughput backed by, for instance, 3DXP SSDs. 

An mpool is accessed via the /dev/mpool/<mpool-name> device file and the
mpool descriptor attached to this device file instance tracks all its associated media
class volumes. mpool relies on device mapper to provide physical device aggregation
within a media class volume.

Dan Williams Oct. 20, 2020, 9:35 p.m. UTC | #5

On Mon, Oct 19, 2020 at 3:30 PM Nabeel Meeramohideen Mohamed
(nmeeramohide) <nmeeramohide@micron.com> wrote:
>
> Hi Dan,
>
> On Friday, October 16, 2020 4:12 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > On Fri, Oct 16, 2020 at 2:59 PM Nabeel Meeramohideen Mohamed
> > (nmeeramohide) <nmeeramohide@micron.com> wrote:
> > >
> > > On Thursday, October 15, 2020 2:03 AM, Christoph Hellwig
> > <hch@infradead.org> wrote:
> > > > I don't think this belongs into the kernel.  It is a classic case for
> > > > infrastructure that should be built in userspace.  If anything is
> > > > missing to implement it in userspace with equivalent performance we
> > > > need to improve out interfaces, although io_uring should cover pretty
> > > > much everything you need.
> > >
> > > Hi Christoph,
> > >
> > > We previously considered moving the mpool object store code to user-space.
> > > However, by implementing mpool as a device driver, we get several benefits
> > > in terms of scalability, performance, and functionality. In doing so, we relied
> > > only on standard interfaces and did not make any changes to the kernel.
> > >
> > > (1)  mpool's "mcache map" facility allows us to memory-map (and later unmap)
> > > a collection of logically related objects with a single system call. The objects in
> > > such a collection are created at different times, physically disparate, and may
> > > even reside on different media class volumes.
> > >
> > > For our HSE storage engine application, there are commonly 10's to 100's of
> > > objects in a given mcache map, and 75,000 total objects mapped at a given
> > time.
> > >
> > > Compared to memory-mapping objects individually, the mcache map facility
> > > scales well because it requires only a single system call and single
> > vm_area_struct
> > > to memory-map a complete collection of objects.
>
> > Why can't that be a batch of mmap calls on io_uring?
>
> Agreed, we could add the capability to invoke mmap via io_uring to help mitigate the
> system call overhead of memory-mapping individual objects, versus our mache map
> mechanism. However, there is still the scalability issue of having a vm_area_struct
> for each object (versus one for each mache map).
>
> We ran YCSB workload C in two different configurations -
> Config 1: memory-mapping each individual object
> Config 2: memory-mapping a collection of related objects using mcache map
>
> - Config 1 incurred ~3.3x additional kernel memory for the vm_area_struct slab -
> 24.8 MB (127188 objects) for config 1, versus 7.3 MB (37482 objects) for config 2.
>
> - Workload C exhibited around 10-25% better tail latencies (4-nines) for config 2,
> not sure if it's due the reduced complexity of searching VMAs during page faults.

So this gets to the meta question that is giving me pause on this
whole proposal:

    What does Linux get from merging mpool?

What you have above is a decent scalability bug report. That type of
pressure to meet new workload needs is how Linux interfaces evolve.
However, rather than evolve those interfaces mpool is a revolutionary
replacement that leaves the bugs intact for everyone that does not
switch over to mpool.

Consider io_uring as an example where the kernel resisted trends
towards userspace I/O engines and instead evolved a solution that
maintained kernel control while also achieving similar performance
levels.

The exercise is useful to identify places where Linux has
deficiencies, but wholesale replacing an entire I/O submission model
is a direction that leaves the old apis to rot.

Mike Snitzer Oct. 21, 2020, 2:24 p.m. UTC | #6

Hey Dan,

On Fri, Oct 16, 2020 at 6:38 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Fri, Oct 16, 2020 at 2:59 PM Nabeel Meeramohideen Mohamed
> (nmeeramohide) <nmeeramohide@micron.com> wrote:
>
> > (5) Representing an mpool as a /dev/mpool/<mpool-name> device file provides a
> > convenient mechanism for controlling access to and managing the multiple storage
> > volumes, and in the future pmem devices, that may comprise an logical mpool.
>
> Christoph and I have talked about replacing the pmem driver's
> dependence on device-mapper for pooling.

Was this discussion done publicly or private?  If public please share
a pointer to the thread.

I'd really like to understand the problem statement that is leading to
pursuing a pmem native alternative to existing DM.

Thanks,
Mike

Dan Williams Oct. 21, 2020, 4:24 p.m. UTC | #7

On Wed, Oct 21, 2020 at 7:24 AM Mike Snitzer <snitzer@redhat.com> wrote:
>
> Hey Dan,
>
> On Fri, Oct 16, 2020 at 6:38 PM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > On Fri, Oct 16, 2020 at 2:59 PM Nabeel Meeramohideen Mohamed
> > (nmeeramohide) <nmeeramohide@micron.com> wrote:
> >
> > > (5) Representing an mpool as a /dev/mpool/<mpool-name> device file provides a
> > > convenient mechanism for controlling access to and managing the multiple storage
> > > volumes, and in the future pmem devices, that may comprise an logical mpool.
> >
> > Christoph and I have talked about replacing the pmem driver's
> > dependence on device-mapper for pooling.
>
> Was this discussion done publicly or private?  If public please share
> a pointer to the thread.
>
> I'd really like to understand the problem statement that is leading to
> pursuing a pmem native alternative to existing DM.
>

IIRC it was during the hallway track at a conference. Some of the
concern is the flexibility to carve physical address space but not
attach a block-device in front of it, and allow pmem/dax-capable
filesystems to mount on something other than a block-device.

DM does fit the bill for block-device concatenation and striping, but
there's some pressure to have a level of provisioning beneath that.

The device-dax facility has already started to grow some physical
address space partitioning capabilities this cycle, see 60e93dc097f7
device-dax: add dis-contiguous resource support, and the question
becomes when / if that support needs to extend across regions is DM
the right tool for that?

Nabeel Meeramohideen Mohamed (nmeeramohide) Oct. 21, 2020, 5:10 p.m. UTC | #8

On Tuesday, October 20, 2020 3:36 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> 
>     What does Linux get from merging mpool?
> 

What Linux gets from merging mpool is a generic object store target with some
unique and beneficial features:

- the ability to allocate objects from multiple classes of media
- facilities to memory-map (and unmap) collections of related objects with similar
lifetimes in a single call
- proactive eviction of object data from the page cache which takes into account
these object relationships and lifetimes
- concurrent access to object data directly and memory mapped to eliminate
page cache pollution from background operations
- a management model that is intentionally patterned after LVM so as to feel
familiar to Linux users

The HSE storage engine, which is built on mpool, consistently demonstrates
throughputs and latencies in real-world applications that are multiples better
than common alternatives.  We believe this represents a concrete example of
the benefits of the mpool object store.

That said, we are very open to ideas on how we can improve the mpool
implementation to be better aligned with existing Linux I/O mechanisms.

Thanks,
Nabeel

Dan Williams Oct. 21, 2020, 5:48 p.m. UTC | #9

On Wed, Oct 21, 2020 at 10:11 AM Nabeel Meeramohideen Mohamed
(nmeeramohide) <nmeeramohide@micron.com> wrote:
>
> On Tuesday, October 20, 2020 3:36 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> >
> >     What does Linux get from merging mpool?
> >
>
> What Linux gets from merging mpool is a generic object store target with some
> unique and beneficial features:

I'll try to make the point a different way. Mpool points out places
where the existing apis fail to scale. Rather than attempt to fix that
problem it proposes to replace the old apis. However, the old apis are
still there. So now upstream has 2 maintenance burdens when it could
have just had one. So when I ask "what does Linux get" it is in
reference to the fact that Linux gets a compounded maintenance problem
and whether the benefits of mpool outweigh that burden. Historically
Linux has been able to evolve to meet the scaling requirements of new
applications, so I am asking whether you have tried to solve the
application problem by evolving rather than replacing existing
infrastructure? The justification to replace rather than evolve is
high because that's how core Linux stays relevant.

[v2,00/22] add Object Storage Media Pool (mpool)

Message

Comments