Message ID | 20201012162736.65241-1-nmeeramohide@micron.com (mailing list archive) |
---|---|
Headers | show |
Series | add Object Storage Media Pool (mpool) | expand |
I don't think this belongs into the kernel. It is a classic case for infrastructure that should be built in userspace. If anything is missing to implement it in userspace with equivalent performance we need to improve out interfaces, although io_uring should cover pretty much everything you need.
On Thursday, October 15, 2020 2:03 AM, Christoph Hellwig <hch@infradead.org> wrote: > I don't think this belongs into the kernel. It is a classic case for > infrastructure that should be built in userspace. If anything is > missing to implement it in userspace with equivalent performance we > need to improve out interfaces, although io_uring should cover pretty > much everything you need. Hi Christoph, We previously considered moving the mpool object store code to user-space. However, by implementing mpool as a device driver, we get several benefits in terms of scalability, performance, and functionality. In doing so, we relied only on standard interfaces and did not make any changes to the kernel. (1) mpool's "mcache map" facility allows us to memory-map (and later unmap) a collection of logically related objects with a single system call. The objects in such a collection are created at different times, physically disparate, and may even reside on different media class volumes. For our HSE storage engine application, there are commonly 10's to 100's of objects in a given mcache map, and 75,000 total objects mapped at a given time. Compared to memory-mapping objects individually, the mcache map facility scales well because it requires only a single system call and single vm_area_struct to memory-map a complete collection of objects. (2) The mcache map reaper mechanism proactively evicts object data from the page cache based on object-level metrics. This provides significant performance benefit for many workloads. For example, we ran YCSB workloads B (95/5 read/write mix) and C (100% read) against our HSE storage engine using the mpool driver in a 5.9 kernel. For each workload, we ran with the reaper turned-on and turned-off. For workload B, the reaper increased throughput 1.77x, while reducing 99.99% tail latency for reads by 39% and updates by 99%. For workload C, the reaper increased throughput by 1.84x, while reducing the 99.99% read tail latency by 63%. These improvements are even more dramatic with earlier kernels. (3) The mcache map facility can memory-map objects on NVMe ZNS drives that were created using the Zone Append command. This patch set does not support ZNS, but that work is in progress and we will be demonstrating our HSE storage engine running on mpool with ZNS drives at FMS 2020. (4) mpool's immutable object model allows the driver to support concurrent reading of object data directly and memory-mapped without a performance penalty to verify coherence. This allows background operations, such as LSM-tree compaction, to operate efficiently and without polluting the page cache. (5) Representing an mpool as a /dev/mpool/<mpool-name> device file provides a convenient mechanism for controlling access to and managing the multiple storage volumes, and in the future pmem devices, that may comprise an logical mpool. Thanks, Nabeel
On Fri, Oct 16, 2020 at 2:59 PM Nabeel Meeramohideen Mohamed (nmeeramohide) <nmeeramohide@micron.com> wrote: > > On Thursday, October 15, 2020 2:03 AM, Christoph Hellwig <hch@infradead.org> wrote: > > I don't think this belongs into the kernel. It is a classic case for > > infrastructure that should be built in userspace. If anything is > > missing to implement it in userspace with equivalent performance we > > need to improve out interfaces, although io_uring should cover pretty > > much everything you need. > > Hi Christoph, > > We previously considered moving the mpool object store code to user-space. > However, by implementing mpool as a device driver, we get several benefits > in terms of scalability, performance, and functionality. In doing so, we relied > only on standard interfaces and did not make any changes to the kernel. > > (1) mpool's "mcache map" facility allows us to memory-map (and later unmap) > a collection of logically related objects with a single system call. The objects in > such a collection are created at different times, physically disparate, and may > even reside on different media class volumes. > > For our HSE storage engine application, there are commonly 10's to 100's of > objects in a given mcache map, and 75,000 total objects mapped at a given time. > > Compared to memory-mapping objects individually, the mcache map facility > scales well because it requires only a single system call and single vm_area_struct > to memory-map a complete collection of objects. Why can't that be a batch of mmap calls on io_uring? > (2) The mcache map reaper mechanism proactively evicts object data from the page > cache based on object-level metrics. This provides significant performance benefit > for many workloads. > > For example, we ran YCSB workloads B (95/5 read/write mix) and C (100% read) > against our HSE storage engine using the mpool driver in a 5.9 kernel. > For each workload, we ran with the reaper turned-on and turned-off. > > For workload B, the reaper increased throughput 1.77x, while reducing 99.99% tail > latency for reads by 39% and updates by 99%. For workload C, the reaper increased > throughput by 1.84x, while reducing the 99.99% read tail latency by 63%. These > improvements are even more dramatic with earlier kernels. What metrics proved useful and can the vanilla page cache / page reclaim mechanism be augmented with those metrics? > > (3) The mcache map facility can memory-map objects on NVMe ZNS drives that were > created using the Zone Append command. This patch set does not support ZNS, but > that work is in progress and we will be demonstrating our HSE storage engine > running on mpool with ZNS drives at FMS 2020. > > (4) mpool's immutable object model allows the driver to support concurrent reading > of object data directly and memory-mapped without a performance penalty to verify > coherence. This allows background operations, such as LSM-tree compaction, to > operate efficiently and without polluting the page cache. > How is this different than existing background operations / defrag that filesystems perform today? Where are the opportunities to improve those operations? > (5) Representing an mpool as a /dev/mpool/<mpool-name> device file provides a > convenient mechanism for controlling access to and managing the multiple storage > volumes, and in the future pmem devices, that may comprise an logical mpool. Christoph and I have talked about replacing the pmem driver's dependence on device-mapper for pooling. What extensions would be needed for the existing driver arch?
Hi Dan, On Friday, October 16, 2020 4:12 PM, Dan Williams <dan.j.williams@intel.com> wrote: > > On Fri, Oct 16, 2020 at 2:59 PM Nabeel Meeramohideen Mohamed > (nmeeramohide) <nmeeramohide@micron.com> wrote: > > > > On Thursday, October 15, 2020 2:03 AM, Christoph Hellwig > <hch@infradead.org> wrote: > > > I don't think this belongs into the kernel. It is a classic case for > > > infrastructure that should be built in userspace. If anything is > > > missing to implement it in userspace with equivalent performance we > > > need to improve out interfaces, although io_uring should cover pretty > > > much everything you need. > > > > Hi Christoph, > > > > We previously considered moving the mpool object store code to user-space. > > However, by implementing mpool as a device driver, we get several benefits > > in terms of scalability, performance, and functionality. In doing so, we relied > > only on standard interfaces and did not make any changes to the kernel. > > > > (1) mpool's "mcache map" facility allows us to memory-map (and later unmap) > > a collection of logically related objects with a single system call. The objects in > > such a collection are created at different times, physically disparate, and may > > even reside on different media class volumes. > > > > For our HSE storage engine application, there are commonly 10's to 100's of > > objects in a given mcache map, and 75,000 total objects mapped at a given > time. > > > > Compared to memory-mapping objects individually, the mcache map facility > > scales well because it requires only a single system call and single > vm_area_struct > > to memory-map a complete collection of objects. > Why can't that be a batch of mmap calls on io_uring? Agreed, we could add the capability to invoke mmap via io_uring to help mitigate the system call overhead of memory-mapping individual objects, versus our mache map mechanism. However, there is still the scalability issue of having a vm_area_struct for each object (versus one for each mache map). We ran YCSB workload C in two different configurations - Config 1: memory-mapping each individual object Config 2: memory-mapping a collection of related objects using mcache map - Config 1 incurred ~3.3x additional kernel memory for the vm_area_struct slab - 24.8 MB (127188 objects) for config 1, versus 7.3 MB (37482 objects) for config 2. - Workload C exhibited around 10-25% better tail latencies (4-nines) for config 2, not sure if it's due the reduced complexity of searching VMAs during page faults. > > (2) The mcache map reaper mechanism proactively evicts object data from the > page > > cache based on object-level metrics. This provides significant performance > benefit > > for many workloads. > > > > For example, we ran YCSB workloads B (95/5 read/write mix) and C (100% read) > > against our HSE storage engine using the mpool driver in a 5.9 kernel. > > For each workload, we ran with the reaper turned-on and turned-off. > > > > For workload B, the reaper increased throughput 1.77x, while reducing 99.99% > tail > > latency for reads by 39% and updates by 99%. For workload C, the reaper > increased > > throughput by 1.84x, while reducing the 99.99% read tail latency by 63%. These > > improvements are even more dramatic with earlier kernels. > What metrics proved useful and can the vanilla page cache / page > reclaim mechanism be augmented with those metrics? The mcache map facility is designed to cache a collection of related immutable objects with similar lifetimes. It is best suited for storage applications that run queries against organized collections of immutable objects, such as storage engines and DBs based on SSTables. Each mcache map is associated with a temperature (pinned, hot, warm, cold) and is left to the application to tag it appropriately. For our HSE storage engine application, the SSTables in the root/intermediate levels acts as a routing table to redirect queries to an appropriate leaf level SSTable, in which case, the mcache map corresponding to the root/intermediate level SSTables can be tagged as pinned/hot. The mcache reaper tracks the access time of each object in an mcache map. On memory pressure, the access time is compared to a time-to-live metric that’s set based on the map’s temperature, how close is the free memory to the low and high watermarks etc. If the object was last accessed outside the ttl window, its pages are evicted from the page cache. We also apply a few other techniques like throttling the readaheads and adding a delay in the page fault handler to not overwhelm the page cache during memory pressure. In the workloads that we run, we have noticed stalls when kswapd does the reclaim and that impacts throughput and tail latencies as described in our last email. The mcache reaper runs proactively and can make better reclaim decisions as it is designed to address a specific class of workloads. We doubt whether the same mechanisms can be employed in the vanilla page cache as it is designed to work for a wide variety of workloads. > > (4) mpool's immutable object model allows the driver to support concurrent > reading > > of object data directly and memory-mapped without a performance penalty to > verify > > coherence. This allows background operations, such as LSM-tree compaction, > to > > operate efficiently and without polluting the page cache. > How is this different than existing background operations / defrag > that filesystems perform today? Where are the opportunities to improve > those operations? We haven’t measured the benefit of eliminating the coherence check, which isn’t needed in our case because objects are immutable. However the open(2) documentation makes the statement that “applications should avoid mixing mmap(2) of files with direct I/O to the same files”, which is what we are effectively doing when we directly read from an object that is also in an mcache map. > > (5) Representing an mpool as a /dev/mpool/<mpool-name> device file > provides a > > convenient mechanism for controlling access to and managing the multiple > storage > > volumes, and in the future pmem devices, that may comprise an logical mpool. > Christoph and I have talked about replacing the pmem driver's > dependence on device-mapper for pooling. What extensions would be > needed for the existing driver arch? mpool doesn’t extend any of the existing driver arch to manage multiple storage volumes. Mpool implements the concept of media classes, where each media class corresponds to a different storage volume. Clients specify a media class when creating an object in an mpool. mpool currently supports only two media classes, “capacity” for storing bulk of the objects backed by, for instance, QLC SSDs and “staging” for storing objects requiring lower latency/higher throughput backed by, for instance, 3DXP SSDs. An mpool is accessed via the /dev/mpool/<mpool-name> device file and the mpool descriptor attached to this device file instance tracks all its associated media class volumes. mpool relies on device mapper to provide physical device aggregation within a media class volume.
On Mon, Oct 19, 2020 at 3:30 PM Nabeel Meeramohideen Mohamed (nmeeramohide) <nmeeramohide@micron.com> wrote: > > Hi Dan, > > On Friday, October 16, 2020 4:12 PM, Dan Williams <dan.j.williams@intel.com> wrote: > > > > On Fri, Oct 16, 2020 at 2:59 PM Nabeel Meeramohideen Mohamed > > (nmeeramohide) <nmeeramohide@micron.com> wrote: > > > > > > On Thursday, October 15, 2020 2:03 AM, Christoph Hellwig > > <hch@infradead.org> wrote: > > > > I don't think this belongs into the kernel. It is a classic case for > > > > infrastructure that should be built in userspace. If anything is > > > > missing to implement it in userspace with equivalent performance we > > > > need to improve out interfaces, although io_uring should cover pretty > > > > much everything you need. > > > > > > Hi Christoph, > > > > > > We previously considered moving the mpool object store code to user-space. > > > However, by implementing mpool as a device driver, we get several benefits > > > in terms of scalability, performance, and functionality. In doing so, we relied > > > only on standard interfaces and did not make any changes to the kernel. > > > > > > (1) mpool's "mcache map" facility allows us to memory-map (and later unmap) > > > a collection of logically related objects with a single system call. The objects in > > > such a collection are created at different times, physically disparate, and may > > > even reside on different media class volumes. > > > > > > For our HSE storage engine application, there are commonly 10's to 100's of > > > objects in a given mcache map, and 75,000 total objects mapped at a given > > time. > > > > > > Compared to memory-mapping objects individually, the mcache map facility > > > scales well because it requires only a single system call and single > > vm_area_struct > > > to memory-map a complete collection of objects. > > > Why can't that be a batch of mmap calls on io_uring? > > Agreed, we could add the capability to invoke mmap via io_uring to help mitigate the > system call overhead of memory-mapping individual objects, versus our mache map > mechanism. However, there is still the scalability issue of having a vm_area_struct > for each object (versus one for each mache map). > > We ran YCSB workload C in two different configurations - > Config 1: memory-mapping each individual object > Config 2: memory-mapping a collection of related objects using mcache map > > - Config 1 incurred ~3.3x additional kernel memory for the vm_area_struct slab - > 24.8 MB (127188 objects) for config 1, versus 7.3 MB (37482 objects) for config 2. > > - Workload C exhibited around 10-25% better tail latencies (4-nines) for config 2, > not sure if it's due the reduced complexity of searching VMAs during page faults. So this gets to the meta question that is giving me pause on this whole proposal: What does Linux get from merging mpool? What you have above is a decent scalability bug report. That type of pressure to meet new workload needs is how Linux interfaces evolve. However, rather than evolve those interfaces mpool is a revolutionary replacement that leaves the bugs intact for everyone that does not switch over to mpool. Consider io_uring as an example where the kernel resisted trends towards userspace I/O engines and instead evolved a solution that maintained kernel control while also achieving similar performance levels. The exercise is useful to identify places where Linux has deficiencies, but wholesale replacing an entire I/O submission model is a direction that leaves the old apis to rot.
Hey Dan, On Fri, Oct 16, 2020 at 6:38 PM Dan Williams <dan.j.williams@intel.com> wrote: > > On Fri, Oct 16, 2020 at 2:59 PM Nabeel Meeramohideen Mohamed > (nmeeramohide) <nmeeramohide@micron.com> wrote: > > > (5) Representing an mpool as a /dev/mpool/<mpool-name> device file provides a > > convenient mechanism for controlling access to and managing the multiple storage > > volumes, and in the future pmem devices, that may comprise an logical mpool. > > Christoph and I have talked about replacing the pmem driver's > dependence on device-mapper for pooling. Was this discussion done publicly or private? If public please share a pointer to the thread. I'd really like to understand the problem statement that is leading to pursuing a pmem native alternative to existing DM. Thanks, Mike
On Wed, Oct 21, 2020 at 7:24 AM Mike Snitzer <snitzer@redhat.com> wrote: > > Hey Dan, > > On Fri, Oct 16, 2020 at 6:38 PM Dan Williams <dan.j.williams@intel.com> wrote: > > > > On Fri, Oct 16, 2020 at 2:59 PM Nabeel Meeramohideen Mohamed > > (nmeeramohide) <nmeeramohide@micron.com> wrote: > > > > > (5) Representing an mpool as a /dev/mpool/<mpool-name> device file provides a > > > convenient mechanism for controlling access to and managing the multiple storage > > > volumes, and in the future pmem devices, that may comprise an logical mpool. > > > > Christoph and I have talked about replacing the pmem driver's > > dependence on device-mapper for pooling. > > Was this discussion done publicly or private? If public please share > a pointer to the thread. > > I'd really like to understand the problem statement that is leading to > pursuing a pmem native alternative to existing DM. > IIRC it was during the hallway track at a conference. Some of the concern is the flexibility to carve physical address space but not attach a block-device in front of it, and allow pmem/dax-capable filesystems to mount on something other than a block-device. DM does fit the bill for block-device concatenation and striping, but there's some pressure to have a level of provisioning beneath that. The device-dax facility has already started to grow some physical address space partitioning capabilities this cycle, see 60e93dc097f7 device-dax: add dis-contiguous resource support, and the question becomes when / if that support needs to extend across regions is DM the right tool for that?
On Tuesday, October 20, 2020 3:36 PM, Dan Williams <dan.j.williams@intel.com> wrote: > > What does Linux get from merging mpool? > What Linux gets from merging mpool is a generic object store target with some unique and beneficial features: - the ability to allocate objects from multiple classes of media - facilities to memory-map (and unmap) collections of related objects with similar lifetimes in a single call - proactive eviction of object data from the page cache which takes into account these object relationships and lifetimes - concurrent access to object data directly and memory mapped to eliminate page cache pollution from background operations - a management model that is intentionally patterned after LVM so as to feel familiar to Linux users The HSE storage engine, which is built on mpool, consistently demonstrates throughputs and latencies in real-world applications that are multiples better than common alternatives. We believe this represents a concrete example of the benefits of the mpool object store. That said, we are very open to ideas on how we can improve the mpool implementation to be better aligned with existing Linux I/O mechanisms. Thanks, Nabeel
On Wed, Oct 21, 2020 at 10:11 AM Nabeel Meeramohideen Mohamed (nmeeramohide) <nmeeramohide@micron.com> wrote: > > On Tuesday, October 20, 2020 3:36 PM, Dan Williams <dan.j.williams@intel.com> wrote: > > > > What does Linux get from merging mpool? > > > > What Linux gets from merging mpool is a generic object store target with some > unique and beneficial features: I'll try to make the point a different way. Mpool points out places where the existing apis fail to scale. Rather than attempt to fix that problem it proposes to replace the old apis. However, the old apis are still there. So now upstream has 2 maintenance burdens when it could have just had one. So when I ask "what does Linux get" it is in reference to the fact that Linux gets a compounded maintenance problem and whether the benefits of mpool outweigh that burden. Historically Linux has been able to evolve to meet the scaling requirements of new applications, so I am asking whether you have tried to solve the application problem by evolving rather than replacing existing infrastructure? The justification to replace rather than evolve is high because that's how core Linux stays relevant.