Message ID | 156896493723.4334.13340481207144634918.stgit@buzz (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] mm: implement write-behind policy for sequential file writes | expand |
Script for trivial demo in attachment $ bash test_writebehind.sh SIZE 3,2G dummy vm.dirty_write_behind = 0 COPY real 0m3.629s user 0m0.016s sys 0m3.613s Dirty: 3254552 kB SYNC real 0m31.953s user 0m0.002s sys 0m0.000s vm.dirty_write_behind = 1 COPY real 0m32.738s user 0m0.008s sys 0m4.047s Dirty: 2900 kB SYNC real 0m0.427s user 0m0.000s sys 0m0.004s vm.dirty_write_behind = 2 COPY real 0m32.168s user 0m0.000s sys 0m4.066s Dirty: 3088 kB SYNC real 0m0.421s user 0m0.004s sys 0m0.001s With vm.dirty_write_behind 1 or 2 files are written even faster and during copying amount of dirty memory always stays around at 16MiB. On 20/09/2019 10.35, Konstantin Khlebnikov wrote: > Traditional writeback tries to accumulate as much dirty data as possible. > This is worth strategy for extremely short-living files and for batching > writes for saving battery power. But for workloads where disk latency is > important this policy generates periodic disk load spikes which increases > latency for concurrent operations. > > Also dirty pages in file cache cannot be reclaimed and reused immediately. > This way massive I/O like file copying affects memory allocation latency. > > Present writeback engine allows to tune only dirty data size or expiration > time. Such tuning cannot eliminate spikes - this just lowers and multiplies > them. Other option is switching into sync mode which flushes written data > right after each write, obviously this have significant performance impact. > Such tuning is system-wide and affects memory-mapped and randomly written > files, flusher threads handle them much better. > > This patch implements write-behind policy which tracks sequential writes > and starts background writeback when file have enough dirty pages. > > Global switch in sysctl vm.dirty_write_behind: > =0: disabled, default > =1: enabled for strictly sequential writes (append, copying) > =2: enabled for all sequential writes > > The only parameter is window size: maximum amount of dirty pages behind > current position and maximum amount of pages in background writeback. > > Setup is per-disk in sysfs in file /sys/block/$DISK/bdi/write_behind_kb. > Default: 16MiB, '0' disables write-behind for this disk. > > When amount of unwritten pages exceeds window size write-behind starts > background writeback for max(excess, max_sectors_kb) and then waits for > the same amount of background writeback initiated at previously. > > |<-wait-this->| |<-send-this->|<---pending-write-behind--->| > |<--async-write-behind--->|<--------previous-data------>|<-new-data->| > current head-^ new head-^ file position-^ > > Remaining tail pages are flushed at closing file if async write-behind was > started or this is new file and it is at least max_sectors_kb long. > > Overall behavior depending on total data size: > < max_sectors_kb - no writes >> max_sectors_kb - write new files in background after close >> write_behind_kb - streaming write, write tail at close > > Special cases: > > * files with POSIX_FADV_RANDOM, O_DIRECT, O_[D]SYNC are ignored > > * writing cursor for O_APPEND is aligned to covers previous small appends > Append might happen via multiple files or via new file each time. > > * mode vm.dirty_write_behind=1 ignores non-append writes > This reacts only to completely sequential writes like copying files, > writing logs with O_APPEND or rewriting files after O_TRUNC. > > Note: ext4 feature "auto_da_alloc" also writes cache at closing file > after truncating it to 0 and after renaming one file over other. > > Changes since v1 (2017-10-02): > * rework window management: > * change default window 1MiB -> 16MiB > * change default request 256KiB -> max_sectors_kb > * drop always-async behavior for O_NONBLOCK > * drop handling POSIX_FADV_NOREUSE (should be in separate patch) > * ignore writes with O_DIRECT, O_SYNC, O_DSYNC > * align head position for O_APPEND > * add strictly sequential mode > * write tail pages for new files > * make void, keep errors at mapping > > Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> > Link: https://lore.kernel.org/patchwork/patch/836149/ (v1) > ---
On Fri, Sep 20, 2019 at 12:35 AM Konstantin Khlebnikov <khlebnikov@yandex-team.ru> wrote: > > This patch implements write-behind policy which tracks sequential writes > and starts background writeback when file have enough dirty pages. Apart from a spelling error ("contigious"), my only reaction is that I've wanted this for the multi-file writes, not just for single big files. Yes, single big files may be a simpler and perhaps the "10% effort for 90% of the gain", and thus the right thing to do, but I do wonder if you've looked at simply extending it to cover multiple files when people copy a whole directory (or unpack a tar-file, or similar). Now, I hear you say "those are so small these days that it doesn't matter". And maybe you're right. But partiocularly for slow media, triggering good streaming write behavior has been a problem in the past. So I'm wondering whether the "writebehind" state should perhaps be considered be a process state, rather than "struct file" state, and also start triggering for writing smaller files. Maybe this was already discussed and people decided that the big-file case was so much easier that it wasn't worth worrying about writebehind for multiple files. Linus
On Fri, Sep 20, 2019 at 4:05 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > Now, I hear you say "those are so small these days that it doesn't > matter". And maybe you're right. But particularly for slow media, > triggering good streaming write behavior has been a problem in the > past. Which reminds me: the writebehind trigger should likely be tied to the estimate of the bdi write speed. We _do_ have that avg_write_bandwidth thing in the bdi_writeback structure, it sounds like a potentially good idea to try to use that to estimate when to do writebehind. No? Linus
Hi Konstantin, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on linus/master] [cannot apply to v5.3 next-20190919] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system. BTW, we also suggest to use '--base' option to specify the base tree in git format-patch, please see https://stackoverflow.com/a/37406982] url: https://github.com/0day-ci/linux/commits/Konstantin-Khlebnikov/mm-implement-write-behind-policy-for-sequential-file-writes/20190920-155606 reproduce: make htmldocs :::::: branch date: 8 hours ago :::::: commit date: 8 hours ago If you fix the issue, kindly add following tag Reported-by: kbuild test robot <lkp@intel.com> All warnings (new ones prefixed by >>): drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c:1: warning: no structured comments found drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1: warning: no structured comments found drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c:1: warning: 'pp_dpm_sclk pp_dpm_mclk pp_dpm_pcie' not found drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h:132: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source @atomic_obj drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h:238: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source gpu_info FW provided soc bounding box struct or 0 if not drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h:243: warning: Function parameter or member 'atomic_obj' not described in 'amdgpu_display_manager' drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h:243: warning: Function parameter or member 'backlight_link' not described in 'amdgpu_display_manager' drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h:243: warning: Function parameter or member 'backlight_caps' not described in 'amdgpu_display_manager' drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h:243: warning: Function parameter or member 'freesync_module' not described in 'amdgpu_display_manager' drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h:243: warning: Function parameter or member 'fw_dmcu' not described in 'amdgpu_display_manager' drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h:243: warning: Function parameter or member 'dmcu_fw_version' not described in 'amdgpu_display_manager' drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h:243: warning: Function parameter or member 'soc_bounding_box' not described in 'amdgpu_display_manager' drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c:1: warning: 'register_hpd_handlers' not found drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c:1: warning: 'dm_crtc_high_irq' not found drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c:1: warning: 'dm_pflip_high_irq' not found include/linux/spi/spi.h:190: warning: Function parameter or member 'driver_override' not described in 'spi_device' drivers/gpio/gpiolib-of.c:92: warning: Excess function parameter 'dev' description in 'of_gpio_need_valid_mask' include/linux/i2c.h:337: warning: Function parameter or member 'init_irq' not described in 'i2c_client' include/linux/regulator/machine.h:196: warning: Function parameter or member 'max_uV_step' not described in 'regulation_constraints' include/linux/regulator/driver.h:223: warning: Function parameter or member 'resume' not described in 'regulator_ops' fs/fs-writeback.c:913: warning: Excess function parameter 'nr_pages' description in 'cgroup_writeback_by_id' fs/direct-io.c:258: warning: Excess function parameter 'offset' description in 'dio_complete' fs/libfs.c:496: warning: Excess function parameter 'available' description in 'simple_write_end' fs/posix_acl.c:647: warning: Function parameter or member 'inode' not described in 'posix_acl_update_mode' fs/posix_acl.c:647: warning: Function parameter or member 'mode_p' not described in 'posix_acl_update_mode' fs/posix_acl.c:647: warning: Function parameter or member 'acl' not described in 'posix_acl_update_mode' drivers/usb/typec/bus.c:1: warning: 'typec_altmode_unregister_driver' not found drivers/usb/typec/bus.c:1: warning: 'typec_altmode_register_driver' not found drivers/usb/typec/class.c:1: warning: 'typec_altmode_register_notifier' not found drivers/usb/typec/class.c:1: warning: 'typec_altmode_unregister_notifier' not found kernel/dma/coherent.c:1: warning: no structured comments found include/linux/input/sparse-keymap.h:43: warning: Function parameter or member 'sw' not described in 'key_entry' include/linux/skbuff.h:888: warning: Function parameter or member 'dev_scratch' not described in 'sk_buff' include/linux/skbuff.h:888: warning: Function parameter or member 'list' not described in 'sk_buff' include/linux/skbuff.h:888: warning: Function parameter or member 'ip_defrag_offset' not described in 'sk_buff' include/linux/skbuff.h:888: warning: Function parameter or member 'skb_mstamp_ns' not described in 'sk_buff' include/linux/skbuff.h:888: warning: Function parameter or member '__cloned_offset' not described in 'sk_buff' include/linux/skbuff.h:888: warning: Function parameter or member 'head_frag' not described in 'sk_buff' include/linux/skbuff.h:888: warning: Function parameter or member '__pkt_type_offset' not described in 'sk_buff' include/linux/skbuff.h:888: warning: Function parameter or member 'encapsulation' not described in 'sk_buff' include/linux/skbuff.h:888: warning: Function parameter or member 'encap_hdr_csum' not described in 'sk_buff' include/linux/skbuff.h:888: warning: Function parameter or member 'csum_valid' not described in 'sk_buff' include/linux/skbuff.h:888: warning: Function parameter or member '__pkt_vlan_present_offset' not described in 'sk_buff' include/linux/skbuff.h:888: warning: Function parameter or member 'vlan_present' not described in 'sk_buff' include/linux/skbuff.h:888: warning: Function parameter or member 'csum_complete_sw' not described in 'sk_buff' include/linux/skbuff.h:888: warning: Function parameter or member 'csum_level' not described in 'sk_buff' include/linux/skbuff.h:888: warning: Function parameter or member 'inner_protocol_type' not described in 'sk_buff' include/linux/skbuff.h:888: warning: Function parameter or member 'remcsum_offload' not described in 'sk_buff' include/linux/skbuff.h:888: warning: Function parameter or member 'sender_cpu' not described in 'sk_buff' include/linux/skbuff.h:888: warning: Function parameter or member 'reserved_tailroom' not described in 'sk_buff' include/linux/skbuff.h:888: warning: Function parameter or member 'inner_ipproto' not described in 'sk_buff' include/net/sock.h:233: warning: Function parameter or member 'skc_addrpair' not described in 'sock_common' include/net/sock.h:233: warning: Function parameter or member 'skc_portpair' not described in 'sock_common' include/net/sock.h:233: warning: Function parameter or member 'skc_ipv6only' not described in 'sock_common' include/net/sock.h:233: warning: Function parameter or member 'skc_net_refcnt' not described in 'sock_common' include/net/sock.h:233: warning: Function parameter or member 'skc_v6_daddr' not described in 'sock_common' include/net/sock.h:233: warning: Function parameter or member 'skc_v6_rcv_saddr' not described in 'sock_common' include/net/sock.h:233: warning: Function parameter or member 'skc_cookie' not described in 'sock_common' include/net/sock.h:233: warning: Function parameter or member 'skc_listener' not described in 'sock_common' include/net/sock.h:233: warning: Function parameter or member 'skc_tw_dr' not described in 'sock_common' include/net/sock.h:233: warning: Function parameter or member 'skc_rcv_wnd' not described in 'sock_common' include/net/sock.h:233: warning: Function parameter or member 'skc_tw_rcv_nxt' not described in 'sock_common' include/net/sock.h:515: warning: Function parameter or member 'sk_rx_skb_cache' not described in 'sock' include/net/sock.h:515: warning: Function parameter or member 'sk_wq_raw' not described in 'sock' include/net/sock.h:515: warning: Function parameter or member 'tcp_rtx_queue' not described in 'sock' include/net/sock.h:515: warning: Function parameter or member 'sk_tx_skb_cache' not described in 'sock' include/net/sock.h:515: warning: Function parameter or member 'sk_route_forced_caps' not described in 'sock' include/net/sock.h:515: warning: Function parameter or member 'sk_txtime_report_errors' not described in 'sock' include/net/sock.h:515: warning: Function parameter or member 'sk_validate_xmit_skb' not described in 'sock' include/net/sock.h:515: warning: Function parameter or member 'sk_bpf_storage' not described in 'sock' include/net/sock.h:2439: warning: Function parameter or member 'tcp_rx_skb_cache_key' not described in 'DECLARE_STATIC_KEY_FALSE' include/net/sock.h:2439: warning: Excess function parameter 'sk' description in 'DECLARE_STATIC_KEY_FALSE' include/net/sock.h:2439: warning: Excess function parameter 'skb' description in 'DECLARE_STATIC_KEY_FALSE' include/linux/netdevice.h:2053: warning: Function parameter or member 'gso_partial_features' not described in 'net_device' include/linux/netdevice.h:2053: warning: Function parameter or member 'l3mdev_ops' not described in 'net_device' include/linux/netdevice.h:2053: warning: Function parameter or member 'xfrmdev_ops' not described in 'net_device' include/linux/netdevice.h:2053: warning: Function parameter or member 'tlsdev_ops' not described in 'net_device' include/linux/netdevice.h:2053: warning: Function parameter or member 'name_assign_type' not described in 'net_device' include/linux/netdevice.h:2053: warning: Function parameter or member 'ieee802154_ptr' not described in 'net_device' include/linux/netdevice.h:2053: warning: Function parameter or member 'mpls_ptr' not described in 'net_device' include/linux/netdevice.h:2053: warning: Function parameter or member 'xdp_prog' not described in 'net_device' include/linux/netdevice.h:2053: warning: Function parameter or member 'gro_flush_timeout' not described in 'net_device' include/linux/netdevice.h:2053: warning: Function parameter or member 'nf_hooks_ingress' not described in 'net_device' include/linux/netdevice.h:2053: warning: Function parameter or member '____cacheline_aligned_in_smp' not described in 'net_device' include/linux/netdevice.h:2053: warning: Function parameter or member 'qdisc_hash' not described in 'net_device' include/linux/netdevice.h:2053: warning: Function parameter or member 'xps_cpus_map' not described in 'net_device' include/linux/netdevice.h:2053: warning: Function parameter or member 'xps_rxqs_map' not described in 'net_device' include/linux/phylink.h:56: warning: Function parameter or member '__ETHTOOL_DECLARE_LINK_MODE_MASK(advertising' not described in 'phylink_link_state' include/linux/phylink.h:56: warning: Function parameter or member '__ETHTOOL_DECLARE_LINK_MODE_MASK(lp_advertising' not described in 'phylink_link_state' drivers/net/phy/phylink.c:595: warning: Function parameter or member 'config' not described in 'phylink_create' drivers/net/phy/phylink.c:595: warning: Excess function parameter 'ndev' description in 'phylink_create' lib/genalloc.c:1: warning: 'gen_pool_add_virt' not found lib/genalloc.c:1: warning: 'gen_pool_alloc' not found lib/genalloc.c:1: warning: 'gen_pool_free' not found lib/genalloc.c:1: warning: 'gen_pool_alloc_algo' not found include/linux/bitmap.h:341: warning: Function parameter or member 'nbits' not described in 'bitmap_or_equal' include/linux/rculist.h:374: warning: Excess function parameter 'cond' description in 'list_for_each_entry_rcu' include/linux/rculist.h:651: warning: Excess function parameter 'cond' description in 'hlist_for_each_entry_rcu' mm/util.c:1: warning: 'get_user_pages_fast' not found mm/slab.c:4215: warning: Function parameter or member 'objp' not described in '__ksize' >> mm/filemap.c:3551: warning: Function parameter or member 'iocb' not described in 'generic_write_behind' >> mm/filemap.c:3551: warning: Function parameter or member 'count' not described in 'generic_write_behind' include/drm/drm_modeset_helper_vtables.h:1053: warning: Function parameter or member 'prepare_writeback_job' not described in 'drm_connector_helper_funcs' include/drm/drm_modeset_helper_vtables.h:1053: warning: Function parameter or member 'cleanup_writeback_job' not described in 'drm_connector_helper_funcs' include/drm/drm_atomic_state_helper.h:1: warning: no structured comments found include/drm/drm_gem_shmem_helper.h:87: warning: Function parameter or member 'madv' not described in 'drm_gem_shmem_object' include/drm/drm_gem_shmem_helper.h:87: warning: Function parameter or member 'madv_list' not described in 'drm_gem_shmem_object' drivers/gpu/drm/i915/display/intel_dpll_mgr.h:158: warning: Enum value 'DPLL_ID_TGL_MGPLL5' not described in enum 'intel_dpll_id' drivers/gpu/drm/i915/display/intel_dpll_mgr.h:158: warning: Enum value 'DPLL_ID_TGL_MGPLL6' not described in enum 'intel_dpll_id' drivers/gpu/drm/i915/display/intel_dpll_mgr.h:158: warning: Excess enum value 'DPLL_ID_TGL_TCPLL6' description in 'intel_dpll_id' drivers/gpu/drm/i915/display/intel_dpll_mgr.h:158: warning: Excess enum value 'DPLL_ID_TGL_TCPLL5' description in 'intel_dpll_id' drivers/gpu/drm/i915/display/intel_dpll_mgr.h:342: warning: Function parameter or member 'wakeref' not described in 'intel_shared_dpll' Error: Cannot open file drivers/gpu/drm/i915/i915_gem_batch_pool.c Error: Cannot open file drivers/gpu/drm/i915/i915_gem_batch_pool.c Error: Cannot open file drivers/gpu/drm/i915/i915_gem_batch_pool.c drivers/gpu/drm/i915/i915_drv.h:1129: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source The OA context specific information. drivers/gpu/drm/i915/i915_drv.h:1143: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source State of the OA buffer. drivers/gpu/drm/i915/i915_drv.h:1154: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source Locks reads and writes to all head/tail state drivers/gpu/drm/i915/i915_drv.h:1176: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source One 'aging' tail pointer and one 'aged' tail pointer ready to drivers/gpu/drm/i915/i915_drv.h:1188: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source Index for the aged tail ready to read() data up to. drivers/gpu/drm/i915/i915_drv.h:1193: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source A monotonic timestamp for when the current aging tail pointer drivers/gpu/drm/i915/i915_drv.h:1199: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source Although we can always read back the head pointer register, drivers/gpu/drm/i915/i915_drv.h:1207: warning: Function parameter or member 'pinned_ctx' not described in 'i915_perf_stream' drivers/gpu/drm/i915/i915_drv.h:1207: warning: Function parameter or member 'specific_ctx_id' not described in 'i915_perf_stream' drivers/gpu/drm/i915/i915_drv.h:1207: warning: Function parameter or member 'specific_ctx_id_mask' not described in 'i915_perf_stream' drivers/gpu/drm/i915/i915_drv.h:1207: warning: Function parameter or member 'poll_check_timer' not described in 'i915_perf_stream' drivers/gpu/drm/i915/i915_drv.h:1207: warning: Function parameter or member 'poll_wq' not described in 'i915_perf_stream' drivers/gpu/drm/i915/i915_drv.h:1207: warning: Function parameter or member 'pollin' not described in 'i915_perf_stream' drivers/gpu/drm/i915/i915_drv.h:1207: warning: Function parameter or member 'periodic' not described in 'i915_perf_stream' drivers/gpu/drm/i915/i915_drv.h:1207: warning: Function parameter or member 'period_exponent' not described in 'i915_perf_stream' drivers/gpu/drm/i915/i915_drv.h:1207: warning: Function parameter or member 'oa_buffer' not described in 'i915_perf_stream' drivers/gpu/drm/i915/i915_drv.h:1129: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source The OA context specific information. drivers/gpu/drm/i915/i915_drv.h:1143: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source State of the OA buffer. drivers/gpu/drm/i915/i915_drv.h:1154: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source Locks reads and writes to all head/tail state drivers/gpu/drm/i915/i915_drv.h:1176: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source One 'aging' tail pointer and one 'aged' tail pointer ready to drivers/gpu/drm/i915/i915_drv.h:1188: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source Index for the aged tail ready to read() data up to. drivers/gpu/drm/i915/i915_drv.h:1193: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source A monotonic timestamp for when the current aging tail pointer drivers/gpu/drm/i915/i915_drv.h:1199: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source Although we can always read back the head pointer register, drivers/gpu/drm/i915/i915_drv.h:1129: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source The OA context specific information. drivers/gpu/drm/i915/i915_drv.h:1143: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source State of the OA buffer. drivers/gpu/drm/i915/i915_drv.h:1154: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source Locks reads and writes to all head/tail state drivers/gpu/drm/i915/i915_drv.h:1176: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source One 'aging' tail pointer and one 'aged' tail pointer ready to drivers/gpu/drm/i915/i915_drv.h:1188: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source Index for the aged tail ready to read() data up to. drivers/gpu/drm/i915/i915_drv.h:1193: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source A monotonic timestamp for when the current aging tail pointer drivers/gpu/drm/i915/i915_drv.h:1199: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source Although we can always read back the head pointer register, drivers/gpu/drm/mcde/mcde_drv.c:1: warning: 'ST-Ericsson MCDE DRM Driver' not found include/net/cfg80211.h:1185: warning: Function parameter or member 'txpwr' not described in 'station_parameters' include/net/mac80211.h:4056: warning: Function parameter or member 'sta_set_txpwr' not described in 'ieee80211_ops' include/net/mac80211.h:2018: warning: Function parameter or member 'txpwr' not described in 'ieee80211_sta' Documentation/admin-guide/perf/imx-ddr.rst:21: WARNING: Unexpected indentation. Documentation/admin-guide/perf/imx-ddr.rst:34: WARNING: Unexpected indentation. Documentation/admin-guide/perf/imx-ddr.rst:40: WARNING: Unexpected indentation. Documentation/admin-guide/perf/imx-ddr.rst:45: WARNING: Unexpected indentation. Documentation/admin-guide/perf/imx-ddr.rst:52: WARNING: Unexpected indentation. Documentation/hwmon/inspur-ipsps1.rst:2: WARNING: Title underline too short. # https://github.com/0day-ci/linux/commit/e0e7df8d5b71bf59ad93fe75e662c929b580d805 git remote add linux-review https://github.com/0day-ci/linux git remote update linux-review git checkout e0e7df8d5b71bf59ad93fe75e662c929b580d805 vim +3551 mm/filemap.c e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3534 e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3535 /** e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3536 * generic_write_behind() - writeback dirty pages behind current position. e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3537 * e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3538 * This function tracks writing position. If file has enough sequentially e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3539 * written data it starts background writeback and then waits for previous e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3540 * writeback initiated some iterations ago. e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3541 * e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3542 * Write-behind maintains per-file head cursor in file->f_write_behind and e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3543 * two windows around: background writeback before and pending data after. e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3544 * e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3545 * |<-wait-this->| |<-send-this->|<---pending-write-behind--->| e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3546 * |<--async-write-behind--->|<--------previous-data------>|<-new-data->| e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3547 * current head-^ new head-^ file position-^ e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3548 */ e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3549 void generic_write_behind(struct kiocb *iocb, ssize_t count) e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3550 { e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 @3551 struct file *file = iocb->ki_filp; e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3552 struct address_space *mapping = file->f_mapping; e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3553 struct inode *inode = mapping->host; e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3554 struct backing_dev_info *bdi = inode_to_bdi(inode); e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3555 unsigned long window = READ_ONCE(bdi->write_behind_pages); e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3556 pgoff_t head = file->f_write_behind; e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3557 pgoff_t begin = (iocb->ki_pos - count) >> PAGE_SHIFT; e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3558 pgoff_t end = iocb->ki_pos >> PAGE_SHIFT; e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3559 e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3560 /* Skip if write is random, direct, sync or disabled for disk */ e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3561 if ((file->f_mode & FMODE_RANDOM) || !window || e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3562 (iocb->ki_flags & (IOCB_DIRECT | IOCB_DSYNC))) e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3563 return; e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3564 e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3565 /* Skip non-sequential writes in strictly sequential mode. */ e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3566 if (vm_dirty_write_behind < 2 && e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3567 iocb->ki_pos != i_size_read(inode) && e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3568 !(iocb->ki_flags & IOCB_APPEND)) e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3569 return; e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3570 e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3571 /* Contigious write and still within window. */ e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3572 if (end - head < window) e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3573 return; e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3574 e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3575 spin_lock(&file->f_lock); e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3576 e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3577 /* Re-read under lock. */ e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3578 head = file->f_write_behind; e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3579 e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3580 /* Non-contiguous, move head position. */ e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3581 if (head > end || begin - head > window) { e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3582 /* e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3583 * Append might happen though multiple files or via new file e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3584 * every time. Align head cursor to cover previous appends. e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3585 */ e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3586 if (iocb->ki_flags & IOCB_APPEND) e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3587 begin = roundup(begin - min(begin, window - 1), e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3588 bdi->io_pages); e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3589 e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3590 file->f_write_behind = head = begin; e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3591 } e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3592 e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3593 /* Still not big enough. */ e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3594 if (end - head < window) { e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3595 spin_unlock(&file->f_lock); e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3596 return; e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3597 } e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3598 e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3599 /* Write excess and try at least max_sectors_kb if possible */ e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3600 end = head + max(end - head - window, min(end - head, bdi->io_pages)); e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3601 e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3602 /* Set head for next iteration, everything behind will be written. */ e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3603 file->f_write_behind = end; e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3604 e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3605 spin_unlock(&file->f_lock); e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3606 e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3607 /* Start background writeback. */ e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3608 __filemap_fdatawrite_range(mapping, e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3609 (loff_t)head << PAGE_SHIFT, e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3610 ((loff_t)end << PAGE_SHIFT) - 1, e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3611 WB_SYNC_NONE); e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3612 e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3613 if (head < window) e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3614 return; e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3615 e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3616 /* Wait for pages falling behind writeback window. */ e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3617 head -= window; e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3618 end -= window; e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3619 __filemap_fdatawait_range(mapping, e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3620 (loff_t)head << PAGE_SHIFT, e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3621 ((loff_t)end << PAGE_SHIFT) - 1); e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3622 } e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3623 EXPORT_SYMBOL(generic_write_behind); e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 3624 --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation
Hello, Konstantin. On Fri, Sep 20, 2019 at 10:39:33AM +0300, Konstantin Khlebnikov wrote: > With vm.dirty_write_behind 1 or 2 files are written even faster and Is the faster speed reproducible? I don't quite understand why this would be. > during copying amount of dirty memory always stays around at 16MiB. The following is the test part of a slightly modified version of your test script which should run fine on any modern systems. for mode in 0 1; do if [ $mode == 0 ]; then prefix='' else prefix='systemd-run --user --scope -p MemoryMax=64M' fi echo COPY time $prefix cp -r dummy copy grep Dirty /proc/meminfo echo SYNC time sync rm -fr copy done and the result looks like the following. $ ./test-writebehind.sh SIZE 3.3G dummy COPY real 0m2.859s user 0m0.015s sys 0m2.843s Dirty: 3416780 kB SYNC real 0m34.008s user 0m0.000s sys 0m0.008s COPY Running scope as unit: run-r69dca5326a9a435d80e036435ff9e1da.scope real 0m32.267s user 0m0.032s sys 0m4.186s Dirty: 14304 kB SYNC real 0m1.783s user 0m0.000s sys 0m0.006s This is how we are solving the massive dirtier problem. It's easy, works pretty well and can easily be tailored to the specific requirements. Generic write-behind would definitely have other benefits and also a bunch of regression possibilities. I'm not trying to say that write-behind isn't a good idea but it'd be useful to consider that a good portion of the benefits can already be obtained fairly easily. Thanks.
On 23/09/2019 17.52, Tejun Heo wrote: > Hello, Konstantin. > > On Fri, Sep 20, 2019 at 10:39:33AM +0300, Konstantin Khlebnikov wrote: >> With vm.dirty_write_behind 1 or 2 files are written even faster and > > Is the faster speed reproducible? I don't quite understand why this > would be. Writing to disk simply starts earlier. > >> during copying amount of dirty memory always stays around at 16MiB. > > The following is the test part of a slightly modified version of your > test script which should run fine on any modern systems. > > for mode in 0 1; do > if [ $mode == 0 ]; then > prefix='' > else > prefix='systemd-run --user --scope -p MemoryMax=64M' > fi > > echo COPY > time $prefix cp -r dummy copy > > grep Dirty /proc/meminfo > > echo SYNC > time sync > > rm -fr copy > done > > and the result looks like the following. > > $ ./test-writebehind.sh > SIZE > 3.3G dummy > COPY > > real 0m2.859s > user 0m0.015s > sys 0m2.843s > Dirty: 3416780 kB > SYNC > > real 0m34.008s > user 0m0.000s > sys 0m0.008s > COPY > Running scope as unit: run-r69dca5326a9a435d80e036435ff9e1da.scope > > real 0m32.267s > user 0m0.032s > sys 0m4.186s > Dirty: 14304 kB > SYNC > > real 0m1.783s > user 0m0.000s > sys 0m0.006s > > This is how we are solving the massive dirtier problem. It's easy, > works pretty well and can easily be tailored to the specific > requirements. > > Generic write-behind would definitely have other benefits and also a > bunch of regression possibilities. I'm not trying to say that > write-behind isn't a good idea but it'd be useful to consider that a > good portion of the benefits can already be obtained fairly easily. > I'm afraid this could end badly if each simple task like file copying will require own systemd job and container with manual tuning.
Hello, On Mon, Sep 23, 2019 at 06:06:46PM +0300, Konstantin Khlebnikov wrote: > On 23/09/2019 17.52, Tejun Heo wrote: > >Hello, Konstantin. > > > >On Fri, Sep 20, 2019 at 10:39:33AM +0300, Konstantin Khlebnikov wrote: > >>With vm.dirty_write_behind 1 or 2 files are written even faster and > > > >Is the faster speed reproducible? I don't quite understand why this > >would be. > > Writing to disk simply starts earlier. I see. > >Generic write-behind would definitely have other benefits and also a > >bunch of regression possibilities. I'm not trying to say that > >write-behind isn't a good idea but it'd be useful to consider that a > >good portion of the benefits can already be obtained fairly easily. > > > > I'm afraid this could end badly if each simple task like file copying > will require own systemd job and container with manual tuning. At least the write window size part of it is pretty easy - the range of acceptable values is fiarly wide - and setting up a cgroup and running a command in it isn't that expensive. It's not like these need full-on containers. That said, yes, there sure are benefits to the kernel being able to detect and handle these conditions automagically. Thanks.
On 9/20/19 5:10 PM, Linus Torvalds wrote: > On Fri, Sep 20, 2019 at 4:05 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> >> Now, I hear you say "those are so small these days that it doesn't >> matter". And maybe you're right. But particularly for slow media, >> triggering good streaming write behavior has been a problem in the >> past. > > Which reminds me: the writebehind trigger should likely be tied to the > estimate of the bdi write speed. > > We _do_ have that avg_write_bandwidth thing in the bdi_writeback > structure, it sounds like a potentially good idea to try to use that > to estimate when to do writebehind. > > No? I really like the feature, and agree it should be tied to the bdi write speed. How about just making the tunable acceptable time of write behind dirty? Eg if write_behind_msec is 1000, allow 1s of pending dirty before starting writbeack.
On 23/09/2019 18.36, Jens Axboe wrote: > On 9/20/19 5:10 PM, Linus Torvalds wrote: >> On Fri, Sep 20, 2019 at 4:05 PM Linus Torvalds >> <torvalds@linux-foundation.org> wrote: >>> >>> >>> Now, I hear you say "those are so small these days that it doesn't >>> matter". And maybe you're right. But particularly for slow media, >>> triggering good streaming write behavior has been a problem in the >>> past. >> >> Which reminds me: the writebehind trigger should likely be tied to the >> estimate of the bdi write speed. >> >> We _do_ have that avg_write_bandwidth thing in the bdi_writeback >> structure, it sounds like a potentially good idea to try to use that >> to estimate when to do writebehind. >> >> No? > > I really like the feature, and agree it should be tied to the bdi write > speed. How about just making the tunable acceptable time of write behind > dirty? Eg if write_behind_msec is 1000, allow 1s of pending dirty before > starting writbeack. > I haven't digged into it yet. But IIRR writeback speed estimation has some problems: There is no "slow start" - initial speed is 100MiB/s. This is especially bad for slow usb disks - right after plugging we'll accumulate too much dirty cache before starting writeback. And I've seen problems with cgroup-writeback: each cgroup has own estimation, doesn't work well for short-living cgroups.
On Mon, Sep 23, 2019 at 3:37 AM kernel test robot <rong.a.chen@intel.com> wrote: > > Greeting, > > FYI, we noticed a -7.3% regression of will-it-scale.per_process_ops due to commit: Most likely this caused by changing struct file layout after adding new field. > > > commit: e0e7df8d5b71bf59ad93fe75e662c929b580d805 ("[PATCH v2] mm: implement write-behind policy for sequential file writes") > url: https://github.com/0day-ci/linux/commits/Konstantin-Khlebnikov/mm-implement-write-behind-policy-for-sequential-file-writes/20190920-155606 > > > in testcase: will-it-scale > on test machine: 192 threads Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory > with following parameters: > > nr_task: 100% > mode: process > test: open1 > cpufreq_governor: performance > > test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two. > test-url: https://github.com/antonblanchard/will-it-scale > > > > If you fix the issue, kindly add following tag > Reported-by: kernel test robot <rong.a.chen@intel.com> > > > Details are as below: > --------------------------------------------------------------------------------------------------> > > > To reproduce: > > git clone https://github.com/intel/lkp-tests.git > cd lkp-tests > bin/lkp install job.yaml # job file is attached in this email > bin/lkp run job.yaml > > ========================================================================================= > compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase: > gcc-7/performance/x86_64-rhel-7.6/process/100%/debian-x86_64-2019-05-14.cgz/lkp-csl-2ap4/open1/will-it-scale > > commit: > 574cc45397 (" drm main pull for 5.4-rc1") > e0e7df8d5b ("mm: implement write-behind policy for sequential file writes") > > 574cc4539762561d e0e7df8d5b71bf59ad93fe75e66 > ---------------- --------------------------- > %stddev %change %stddev > \ | \ > 370456 -7.3% 343238 will-it-scale.per_process_ops > 71127653 -7.3% 65901758 will-it-scale.workload > 828565 ± 23% +66.8% 1381984 ± 23% cpuidle.C1.time > 1499 +1.1% 1515 turbostat.Avg_MHz > 163498 ± 5% +26.4% 206691 ± 4% slabinfo.filp.active_slabs > 163498 ± 5% +26.4% 206691 ± 4% slabinfo.filp.num_slabs > 39055 ± 2% +17.1% 45720 ± 5% meminfo.Inactive > 38615 ± 2% +17.3% 45291 ± 5% meminfo.Inactive(anon) > 51382 ± 3% +19.6% 61469 ± 7% meminfo.Mapped > 5163010 ± 2% +12.7% 5819765 ± 3% meminfo.Memused > 2840181 ± 3% +22.5% 3478003 ± 5% meminfo.SUnreclaim > 2941874 ± 3% +21.7% 3579791 ± 5% meminfo.Slab > 67755 ± 5% +23.8% 83884 ± 3% meminfo.max_used_kB > 79719901 +17.3% 93512842 numa-numastat.node0.local_node > 79738690 +17.3% 93533079 numa-numastat.node0.numa_hit > 81987497 +16.6% 95625946 numa-numastat.node1.local_node > 82018695 +16.6% 95652480 numa-numastat.node1.numa_hit > 82693483 +15.8% 95762465 numa-numastat.node2.local_node > 82705924 +15.8% 95789007 numa-numastat.node2.numa_hit > 80329941 +17.1% 94048289 numa-numastat.node3.local_node > 80361116 +17.1% 94068512 numa-numastat.node3.numa_hit > 9678 ± 2% +17.1% 11334 ± 5% proc-vmstat.nr_inactive_anon > 13001 ± 3% +19.2% 15503 ± 7% proc-vmstat.nr_mapped > 738232 ± 4% +18.5% 875062 ± 2% proc-vmstat.nr_slab_unreclaimable > 9678 ± 2% +17.1% 11334 ± 5% proc-vmstat.nr_zone_inactive_anon > 2391 ± 92% -84.5% 369.50 ± 46% proc-vmstat.numa_hint_faults > 3.243e+08 +16.8% 3.789e+08 proc-vmstat.numa_hit > 3.242e+08 +16.8% 3.788e+08 proc-vmstat.numa_local > 1.296e+09 +16.8% 1.514e+09 proc-vmstat.pgalloc_normal > 1.296e+09 +16.8% 1.514e+09 proc-vmstat.pgfree > 862.61 ± 5% +37.7% 1188 ± 5% sched_debug.cfs_rq:/.exec_clock.stddev > 229663 ± 62% +113.3% 489907 ± 29% sched_debug.cfs_rq:/.load.max > 491.04 ± 4% -9.5% 444.29 ± 7% sched_debug.cfs_rq:/.nr_spread_over.min > 229429 ± 62% +113.4% 489618 ± 29% sched_debug.cfs_rq:/.runnable_weight.max > -1959962 +36.2% -2669681 sched_debug.cfs_rq:/.spread0.min > 1416008 ± 2% -13.3% 1227494 ± 5% sched_debug.cpu.avg_idle.avg > 1240763 ± 8% -28.2% 891028 ± 18% sched_debug.cpu.avg_idle.stddev > 352361 ± 6% -29.6% 248105 ± 25% sched_debug.cpu.max_idle_balance_cost.stddev > -20.00 +51.0% -30.21 sched_debug.cpu.nr_uninterruptible.min > 6618 ± 10% -20.8% 5240 ± 8% sched_debug.cpu.ttwu_count.max > 1452719 ± 4% +7.2% 1557262 ± 3% numa-meminfo.node0.MemUsed > 797565 ± 2% +20.8% 963538 ± 2% numa-meminfo.node0.SUnreclaim > 835343 ± 3% +19.6% 998867 ± 2% numa-meminfo.node0.Slab > 831114 ± 2% +20.1% 998248 ± 2% numa-meminfo.node1.SUnreclaim > 848052 +19.8% 1016069 ± 2% numa-meminfo.node1.Slab > 1441558 ± 6% +15.7% 1668466 ± 3% numa-meminfo.node2.MemUsed > 879835 ± 2% +20.4% 1059441 numa-meminfo.node2.SUnreclaim > 901359 ± 3% +20.3% 1084727 ± 2% numa-meminfo.node2.Slab > 1446041 ± 5% +15.5% 1669477 ± 3% numa-meminfo.node3.MemUsed > 899442 ± 5% +23.0% 1106354 numa-meminfo.node3.SUnreclaim > 924903 ± 5% +22.1% 1129709 numa-meminfo.node3.Slab > 198945 +19.8% 238298 ± 2% numa-vmstat.node0.nr_slab_unreclaimable > 40181885 +17.3% 47129598 numa-vmstat.node0.numa_hit > 40163521 +17.3% 47110122 numa-vmstat.node0.numa_local > 208512 +20.9% 252000 ± 2% numa-vmstat.node1.nr_slab_unreclaimable > 41144466 +16.7% 48021716 numa-vmstat.node1.numa_hit > 41027051 +16.8% 47908675 numa-vmstat.node1.numa_local > 220763 ± 2% +21.9% 269115 ± 2% numa-vmstat.node2.nr_slab_unreclaimable > 41437805 +16.2% 48167791 numa-vmstat.node2.numa_hit > 41338581 +16.2% 48054485 numa-vmstat.node2.numa_local > 225216 ± 2% +24.7% 280851 ± 2% numa-vmstat.node3.nr_slab_unreclaimable > 40385721 +16.9% 47195289 numa-vmstat.node3.numa_hit > 40268228 +16.9% 47088405 numa-vmstat.node3.numa_local > 77.00 ± 29% +494.8% 458.00 ±110% interrupts.CPU10.RES:Rescheduling_interrupts > 167.25 ± 65% +347.8% 749.00 ± 85% interrupts.CPU103.RES:Rescheduling_interrupts > 136.50 ± 42% +309.2% 558.50 ± 85% interrupts.CPU107.RES:Rescheduling_interrupts > 132.50 ± 26% +637.5% 977.25 ± 50% interrupts.CPU109.RES:Rescheduling_interrupts > 212.50 ± 51% -65.2% 74.00 ± 9% interrupts.CPU115.RES:Rescheduling_interrupts > 270.25 ± 20% -77.2% 61.50 ± 10% interrupts.CPU121.RES:Rescheduling_interrupts > 184.00 ± 50% -57.5% 78.25 ± 51% interrupts.CPU128.RES:Rescheduling_interrupts > 85.25 ± 38% +911.4% 862.25 ±135% interrupts.CPU137.RES:Rescheduling_interrupts > 72.25 ± 6% +114.2% 154.75 ± 25% interrupts.CPU147.RES:Rescheduling_interrupts > 415.00 ± 75% -69.8% 125.25 ± 59% interrupts.CPU15.RES:Rescheduling_interrupts > 928.25 ± 93% -89.8% 94.50 ± 50% interrupts.CPU182.RES:Rescheduling_interrupts > 359.75 ± 76% -58.8% 148.25 ± 85% interrupts.CPU19.RES:Rescheduling_interrupts > 95.75 ± 30% +103.9% 195.25 ± 48% interrupts.CPU45.RES:Rescheduling_interrupts > 60.25 ± 9% +270.5% 223.25 ± 93% interrupts.CPU83.RES:Rescheduling_interrupts > 906.75 ±136% -90.5% 85.75 ± 36% interrupts.CPU85.RES:Rescheduling_interrupts > 199.25 ± 25% -52.1% 95.50 ± 43% interrupts.CPU90.RES:Rescheduling_interrupts > 5192 ± 34% +41.5% 7347 ± 24% interrupts.CPU95.NMI:Non-maskable_interrupts > 5192 ± 34% +41.5% 7347 ± 24% interrupts.CPU95.PMI:Performance_monitoring_interrupts > 1.75 +26.1% 2.20 perf-stat.i.MPKI > 7.975e+10 -6.8% 7.435e+10 perf-stat.i.branch-instructions > 3.782e+08 -5.9% 3.558e+08 perf-stat.i.branch-misses > 75.36 +0.9 76.29 perf-stat.i.cache-miss-rate% > 5.484e+08 +18.8% 6.515e+08 perf-stat.i.cache-misses > 7.276e+08 +17.3% 8.539e+08 perf-stat.i.cache-references > 1.37 +8.2% 1.48 perf-stat.i.cpi > 5.701e+11 +0.7% 5.744e+11 perf-stat.i.cpu-cycles > 1040 -15.2% 882.10 perf-stat.i.cycles-between-cache-misses > 1.253e+11 -7.2% 1.163e+11 perf-stat.i.dTLB-loads > 7.443e+10 -7.2% 6.904e+10 perf-stat.i.dTLB-stores > 3.336e+08 +12.6% 3.755e+08 perf-stat.i.iTLB-load-misses > 5004598 ± 7% -60.9% 1954451 ± 6% perf-stat.i.iTLB-loads > 4.175e+11 -6.9% 3.887e+11 perf-stat.i.instructions > 1251 -17.3% 1035 perf-stat.i.instructions-per-iTLB-miss > 0.73 -7.6% 0.68 perf-stat.i.ipc > 19.77 -1.5 18.31 perf-stat.i.node-load-miss-rate% > 5003202 ± 2% +16.5% 5829006 perf-stat.i.node-load-misses > 20521507 +28.1% 26283838 perf-stat.i.node-loads > 1.84 +0.4 2.28 perf-stat.i.node-store-miss-rate% > 1469703 +29.0% 1895783 perf-stat.i.node-store-misses > 78304054 +4.0% 81463725 perf-stat.i.node-stores > 1.74 +26.1% 2.20 perf-stat.overall.MPKI > 75.37 +0.9 76.30 perf-stat.overall.cache-miss-rate% > 1.37 +8.2% 1.48 perf-stat.overall.cpi > 1039 -15.2% 881.41 perf-stat.overall.cycles-between-cache-misses > 1251 -17.3% 1035 perf-stat.overall.instructions-per-iTLB-miss > 0.73 -7.6% 0.68 perf-stat.overall.ipc > 19.59 -1.5 18.14 perf-stat.overall.node-load-miss-rate% > 1.84 +0.4 2.27 perf-stat.overall.node-store-miss-rate% > 7.943e+10 -6.8% 7.404e+10 perf-stat.ps.branch-instructions > 3.767e+08 -5.9% 3.543e+08 perf-stat.ps.branch-misses > 5.465e+08 +18.8% 6.492e+08 perf-stat.ps.cache-misses > 7.25e+08 +17.4% 8.508e+08 perf-stat.ps.cache-references > 5.68e+11 +0.7% 5.722e+11 perf-stat.ps.cpu-cycles > 1.248e+11 -7.2% 1.158e+11 perf-stat.ps.dTLB-loads > 7.413e+10 -7.3% 6.874e+10 perf-stat.ps.dTLB-stores > 3.322e+08 +12.5% 3.739e+08 perf-stat.ps.iTLB-load-misses > 4986239 ± 7% -61.0% 1946378 ± 6% perf-stat.ps.iTLB-loads > 4.158e+11 -6.9% 3.87e+11 perf-stat.ps.instructions > 4982520 ± 2% +16.5% 5803884 perf-stat.ps.node-load-misses > 20448588 +28.1% 26201547 perf-stat.ps.node-loads > 1463675 +29.0% 1887791 perf-stat.ps.node-store-misses > 77979119 +4.0% 81107191 perf-stat.ps.node-stores > 1.25e+14 -6.8% 1.165e+14 perf-stat.total.instructions > 10.11 -1.9 8.21 perf-profile.calltrace.cycles-pp.file_free_rcu.rcu_do_batch.rcu_core.__softirqentry_text_start.run_ksoftirqd > 17.28 -0.8 16.48 perf-profile.calltrace.cycles-pp.close > 9.41 -0.7 8.69 perf-profile.calltrace.cycles-pp.link_path_walk.path_openat.do_filp_open.do_sys_open.do_syscall_64 > 6.32 -0.7 5.64 perf-profile.calltrace.cycles-pp.do_dentry_open.path_openat.do_filp_open.do_sys_open.do_syscall_64 > 5.27 -0.5 4.72 perf-profile.calltrace.cycles-pp.__fput.task_work_run.exit_to_usermode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe > 13.96 -0.5 13.49 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.close > 13.58 -0.4 13.14 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.close > 0.92 -0.3 0.64 perf-profile.calltrace.cycles-pp.__close_fd.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe.close > 3.10 -0.2 2.86 perf-profile.calltrace.cycles-pp.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe.close > 2.44 -0.2 2.21 perf-profile.calltrace.cycles-pp.walk_component.link_path_walk.path_openat.do_filp_open.do_sys_open > 4.02 -0.2 3.80 perf-profile.calltrace.cycles-pp.selinux_inode_permission.security_inode_permission.link_path_walk.path_openat.do_filp_open > 1.82 -0.2 1.60 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64.open64 > 9.26 -0.2 9.04 perf-profile.calltrace.cycles-pp.task_work_run.exit_to_usermode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe.close > 2.12 ± 2% -0.2 1.90 perf-profile.calltrace.cycles-pp.lookup_fast.walk_component.link_path_walk.path_openat.do_filp_open > 1.03 ± 10% -0.2 0.82 perf-profile.calltrace.cycles-pp.inode_permission.link_path_walk.path_openat.do_filp_open.do_sys_open > 2.55 ± 2% -0.2 2.36 ± 3% perf-profile.calltrace.cycles-pp.security_inode_permission.may_open.path_openat.do_filp_open.do_sys_open > 1.37 -0.2 1.18 perf-profile.calltrace.cycles-pp.kmem_cache_alloc.getname_flags.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe > 1.15 -0.2 0.95 perf-profile.calltrace.cycles-pp.ima_file_check.path_openat.do_filp_open.do_sys_open.do_syscall_64 > 1.79 -0.2 1.60 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64.close > 2.41 ± 3% -0.2 2.22 ± 3% perf-profile.calltrace.cycles-pp.selinux_inode_permission.security_inode_permission.may_open.path_openat.do_filp_open > 2.88 -0.2 2.71 perf-profile.calltrace.cycles-pp.security_file_open.do_dentry_open.path_openat.do_filp_open.do_sys_open > 2.38 -0.2 2.22 perf-profile.calltrace.cycles-pp.security_file_alloc.__alloc_file.alloc_empty_file.path_openat.do_filp_open > 4.31 -0.2 4.16 perf-profile.calltrace.cycles-pp.security_inode_permission.link_path_walk.path_openat.do_filp_open.do_sys_open > 9.93 -0.1 9.80 perf-profile.calltrace.cycles-pp.exit_to_usermode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe.close > 1.63 -0.1 1.50 perf-profile.calltrace.cycles-pp.kmem_cache_alloc.security_file_alloc.__alloc_file.alloc_empty_file.path_openat > 1.38 -0.1 1.26 perf-profile.calltrace.cycles-pp.__alloc_fd.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe.open64 > 5.16 -0.1 5.04 perf-profile.calltrace.cycles-pp.getname_flags.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe.open64 > 1.13 -0.1 1.02 perf-profile.calltrace.cycles-pp.dput.terminate_walk.path_openat.do_filp_open.do_sys_open > 2.26 -0.1 2.15 perf-profile.calltrace.cycles-pp.selinux_file_open.security_file_open.do_dentry_open.path_openat.do_filp_open > 0.63 -0.1 0.52 ± 2% perf-profile.calltrace.cycles-pp.__check_heap_object.__check_object_size.strncpy_from_user.getname_flags.do_sys_open > 1.29 -0.1 1.18 perf-profile.calltrace.cycles-pp.lookup_fast.path_openat.do_filp_open.do_sys_open.do_syscall_64 > 1.75 -0.1 1.65 perf-profile.calltrace.cycles-pp.terminate_walk.path_openat.do_filp_open.do_sys_open.do_syscall_64 > 0.67 -0.1 0.58 perf-profile.calltrace.cycles-pp.kmem_cache_free.__fput.task_work_run.exit_to_usermode_loop.do_syscall_64 > 1.22 ± 2% -0.1 1.12 perf-profile.calltrace.cycles-pp.avc_has_perm_noaudit.selinux_inode_permission.security_inode_permission.link_path_walk.path_openat > 1.21 -0.1 1.12 perf-profile.calltrace.cycles-pp.fput_many.filp_close.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe > 0.74 -0.1 0.66 perf-profile.calltrace.cycles-pp.__inode_security_revalidate.selinux_file_open.security_file_open.do_dentry_open.path_openat > 0.89 -0.1 0.81 perf-profile.calltrace.cycles-pp.inode_security_rcu.selinux_inode_permission.security_inode_permission.may_open.path_openat > 0.79 ± 4% -0.1 0.72 perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.task_work_run.exit_to_usermode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe > 0.76 -0.1 0.70 perf-profile.calltrace.cycles-pp.__inode_security_revalidate.inode_security_rcu.selinux_inode_permission.security_inode_permission.may_open > 0.67 ± 3% -0.1 0.61 perf-profile.calltrace.cycles-pp.__d_lookup_rcu.lookup_fast.path_openat.do_filp_open.do_sys_open > 0.66 ± 3% -0.1 0.60 perf-profile.calltrace.cycles-pp.inode_permission.may_open.path_openat.do_filp_open.do_sys_open > 1.02 -0.1 0.96 perf-profile.calltrace.cycles-pp.path_init.path_openat.do_filp_open.do_sys_open.do_syscall_64 > 0.81 -0.1 0.75 perf-profile.calltrace.cycles-pp.task_work_add.fput_many.filp_close.__x64_sys_close.do_syscall_64 > 0.67 -0.0 0.63 perf-profile.calltrace.cycles-pp.rcu_segcblist_enqueue.__call_rcu.task_work_run.exit_to_usermode_loop.do_syscall_64 > 0.78 -0.0 0.74 perf-profile.calltrace.cycles-pp.__slab_free.kmem_cache_free.rcu_do_batch.rcu_core.__softirqentry_text_start > 0.55 -0.0 0.53 perf-profile.calltrace.cycles-pp.selinux_file_alloc_security.security_file_alloc.__alloc_file.alloc_empty_file.path_openat > 0.71 +0.1 0.82 perf-profile.calltrace.cycles-pp.memset_erms.kmem_cache_alloc.__alloc_file.alloc_empty_file.path_openat > 3.38 +0.1 3.50 perf-profile.calltrace.cycles-pp.strncpy_from_user.getname_flags.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe > 1.66 +0.1 1.78 perf-profile.calltrace.cycles-pp.__call_rcu.task_work_run.exit_to_usermode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe > 0.70 +0.1 0.84 perf-profile.calltrace.cycles-pp.__virt_addr_valid.__check_object_size.strncpy_from_user.getname_flags.do_sys_open > 1.81 +0.4 2.23 perf-profile.calltrace.cycles-pp.__check_object_size.strncpy_from_user.getname_flags.do_sys_open.do_syscall_64 > 39.47 +0.7 40.17 perf-profile.calltrace.cycles-pp.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe.open64 > 0.00 +0.8 0.75 perf-profile.calltrace.cycles-pp.get_page_from_freelist.__alloc_pages_nodemask.new_slab.___slab_alloc.__slab_alloc > 38.69 +0.8 39.45 perf-profile.calltrace.cycles-pp.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe > 0.00 +0.8 0.84 perf-profile.calltrace.cycles-pp.__alloc_pages_nodemask.new_slab.___slab_alloc.__slab_alloc.kmem_cache_alloc > 29.90 +0.9 30.79 perf-profile.calltrace.cycles-pp.__softirqentry_text_start.run_ksoftirqd.smpboot_thread_fn.kthread.ret_from_fork > 29.90 +0.9 30.79 perf-profile.calltrace.cycles-pp.run_ksoftirqd.smpboot_thread_fn.kthread.ret_from_fork > 29.87 +0.9 30.76 perf-profile.calltrace.cycles-pp.rcu_do_batch.rcu_core.__softirqentry_text_start.run_ksoftirqd.smpboot_thread_fn > 29.88 +0.9 30.78 perf-profile.calltrace.cycles-pp.rcu_core.__softirqentry_text_start.run_ksoftirqd.smpboot_thread_fn.kthread > 29.93 +0.9 30.84 perf-profile.calltrace.cycles-pp.smpboot_thread_fn.kthread.ret_from_fork > 29.94 +0.9 30.85 perf-profile.calltrace.cycles-pp.ret_from_fork > 29.94 +0.9 30.85 perf-profile.calltrace.cycles-pp.kthread.ret_from_fork > 0.89 ± 29% +0.9 1.81 perf-profile.calltrace.cycles-pp.setup_object_debug.new_slab.___slab_alloc.__slab_alloc.kmem_cache_alloc > 7.25 ± 3% +1.1 8.36 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.free_one_page.__free_pages_ok.unfreeze_partials > 7.75 ± 3% +1.1 8.87 perf-profile.calltrace.cycles-pp.__free_pages_ok.unfreeze_partials.put_cpu_partial.kmem_cache_free.rcu_do_batch > 7.72 ± 3% +1.1 8.85 perf-profile.calltrace.cycles-pp.free_one_page.__free_pages_ok.unfreeze_partials.put_cpu_partial.kmem_cache_free > 7.29 ± 3% +1.1 8.41 perf-profile.calltrace.cycles-pp._raw_spin_lock.free_one_page.__free_pages_ok.unfreeze_partials.put_cpu_partial > 9.12 ± 3% +1.1 10.25 perf-profile.calltrace.cycles-pp.kmem_cache_free.rcu_do_batch.rcu_core.__softirqentry_text_start.run_ksoftirqd > 7.96 ± 3% +1.1 9.10 perf-profile.calltrace.cycles-pp.put_cpu_partial.kmem_cache_free.rcu_do_batch.rcu_core.__softirqentry_text_start > 7.92 ± 3% +1.1 9.07 perf-profile.calltrace.cycles-pp.unfreeze_partials.put_cpu_partial.kmem_cache_free.rcu_do_batch.rcu_core > 2.38 +1.5 3.83 perf-profile.calltrace.cycles-pp.new_slab.___slab_alloc.__slab_alloc.kmem_cache_alloc.__alloc_file > 10.53 +1.7 12.19 perf-profile.calltrace.cycles-pp.rcu_cblist_dequeue.rcu_do_batch.rcu_core.__softirqentry_text_start.run_ksoftirqd > 5.47 +2.2 7.64 perf-profile.calltrace.cycles-pp.kmem_cache_alloc.__alloc_file.alloc_empty_file.path_openat.do_filp_open > 3.34 +2.2 5.56 perf-profile.calltrace.cycles-pp.___slab_alloc.__slab_alloc.kmem_cache_alloc.__alloc_file.alloc_empty_file > 3.39 +2.3 5.65 perf-profile.calltrace.cycles-pp.__slab_alloc.kmem_cache_alloc.__alloc_file.alloc_empty_file.path_openat > 11.39 +2.7 14.08 perf-profile.calltrace.cycles-pp.alloc_empty_file.path_openat.do_filp_open.do_sys_open.do_syscall_64 > 10.91 +2.7 13.63 perf-profile.calltrace.cycles-pp.__alloc_file.alloc_empty_file.path_openat.do_filp_open.do_sys_open > 10.62 -2.1 8.54 perf-profile.children.cycles-pp.file_free_rcu > 17.31 -0.8 16.51 perf-profile.children.cycles-pp.close > 9.47 -0.7 8.74 perf-profile.children.cycles-pp.link_path_walk > 6.37 -0.7 5.68 perf-profile.children.cycles-pp.do_dentry_open > 5.48 -0.6 4.90 perf-profile.children.cycles-pp.__fput > 6.49 -0.4 6.08 perf-profile.children.cycles-pp.selinux_inode_permission > 6.95 -0.3 6.60 perf-profile.children.cycles-pp.security_inode_permission > 3.48 -0.3 3.15 perf-profile.children.cycles-pp.lookup_fast > 2.38 -0.3 2.09 perf-profile.children.cycles-pp.entry_SYSCALL_64 > 1.74 ± 5% -0.3 1.46 perf-profile.children.cycles-pp.inode_permission > 0.94 -0.3 0.66 perf-profile.children.cycles-pp.__close_fd > 3.10 -0.2 2.86 perf-profile.children.cycles-pp.__x64_sys_close > 2.27 ± 2% -0.2 2.04 ± 2% perf-profile.children.cycles-pp.dput > 2.47 -0.2 2.24 perf-profile.children.cycles-pp.walk_component > 2.21 ± 2% -0.2 1.98 perf-profile.children.cycles-pp.___might_sleep > 2.24 -0.2 2.02 perf-profile.children.cycles-pp.syscall_return_via_sysret > 9.32 -0.2 9.12 perf-profile.children.cycles-pp.task_work_run > 1.17 -0.2 0.97 perf-profile.children.cycles-pp.ima_file_check > 1.99 -0.2 1.80 perf-profile.children.cycles-pp.__inode_security_revalidate > 2.92 -0.2 2.73 perf-profile.children.cycles-pp.security_file_open > 0.56 -0.2 0.38 perf-profile.children.cycles-pp.selinux_task_getsecid > 0.69 -0.2 0.51 perf-profile.children.cycles-pp.security_task_getsecid > 2.40 -0.2 2.24 perf-profile.children.cycles-pp.security_file_alloc > 0.20 ± 4% -0.1 0.06 ± 11% perf-profile.children.cycles-pp.try_module_get > 1.44 -0.1 1.31 perf-profile.children.cycles-pp.__might_sleep > 10.01 -0.1 9.88 perf-profile.children.cycles-pp.exit_to_usermode_loop > 1.46 -0.1 1.33 perf-profile.children.cycles-pp.inode_security_rcu > 1.00 -0.1 0.87 perf-profile.children.cycles-pp._cond_resched > 5.20 -0.1 5.08 perf-profile.children.cycles-pp.getname_flags > 1.05 -0.1 0.93 perf-profile.children.cycles-pp.__fsnotify_parent > 1.42 -0.1 1.30 perf-profile.children.cycles-pp.fsnotify > 1.41 -0.1 1.29 perf-profile.children.cycles-pp.__alloc_fd > 2.29 -0.1 2.18 perf-profile.children.cycles-pp.selinux_file_open > 0.64 -0.1 0.53 perf-profile.children.cycles-pp.__check_heap_object > 1.42 ± 2% -0.1 1.31 ± 2% perf-profile.children.cycles-pp.irq_exit > 1.80 -0.1 1.69 perf-profile.children.cycles-pp.terminate_walk > 0.33 -0.1 0.23 perf-profile.children.cycles-pp.file_ra_state_init > 0.65 ± 3% -0.1 0.56 perf-profile.children.cycles-pp.generic_permission > 1.23 -0.1 1.15 perf-profile.children.cycles-pp.fput_many > 0.83 ± 3% -0.1 0.74 perf-profile.children.cycles-pp._raw_spin_lock_irq > 0.53 -0.1 0.45 ± 2% perf-profile.children.cycles-pp.rcu_all_qs > 0.58 ± 5% -0.1 0.51 ± 2% perf-profile.children.cycles-pp.mntput_no_expire > 0.75 -0.1 0.69 perf-profile.children.cycles-pp.lockref_put_or_lock > 1.03 -0.1 0.97 perf-profile.children.cycles-pp.path_init > 0.84 -0.1 0.78 perf-profile.children.cycles-pp.task_work_add > 0.14 ± 3% -0.1 0.08 ± 5% perf-profile.children.cycles-pp.ima_file_free > 0.26 ± 7% -0.1 0.21 ± 2% perf-profile.children.cycles-pp.path_get > 0.83 -0.0 0.78 perf-profile.children.cycles-pp.__slab_free > 0.62 -0.0 0.58 perf-profile.children.cycles-pp.percpu_counter_add_batch > 0.67 -0.0 0.63 perf-profile.children.cycles-pp.rcu_segcblist_enqueue > 0.20 ± 11% -0.0 0.16 ± 2% perf-profile.children.cycles-pp.mntget > 0.22 ± 4% -0.0 0.19 ± 3% perf-profile.children.cycles-pp.get_unused_fd_flags > 0.10 ± 14% -0.0 0.07 ± 10% perf-profile.children.cycles-pp.close@plt > 0.34 ± 2% -0.0 0.31 perf-profile.children.cycles-pp.lockref_get > 0.24 -0.0 0.21 ± 2% perf-profile.children.cycles-pp.__x64_sys_open > 0.11 ± 8% -0.0 0.08 ± 10% perf-profile.children.cycles-pp.putname > 0.18 ± 2% -0.0 0.16 perf-profile.children.cycles-pp.should_failslab > 0.55 -0.0 0.53 perf-profile.children.cycles-pp.selinux_file_alloc_security > 0.21 ± 3% -0.0 0.19 ± 2% perf-profile.children.cycles-pp.expand_files > 0.07 ± 6% -0.0 0.05 perf-profile.children.cycles-pp.module_put > 0.12 -0.0 0.10 ± 4% perf-profile.children.cycles-pp.security_file_free > 0.17 -0.0 0.15 ± 3% perf-profile.children.cycles-pp.find_next_zero_bit > 0.07 ± 5% -0.0 0.06 perf-profile.children.cycles-pp.memset > 0.07 ± 5% -0.0 0.06 perf-profile.children.cycles-pp.__mutex_init > 0.10 -0.0 0.09 perf-profile.children.cycles-pp.mntput > 0.12 +0.0 0.13 ± 3% perf-profile.children.cycles-pp.__list_del_entry_valid > 0.12 ± 3% +0.0 0.14 ± 3% perf-profile.children.cycles-pp.discard_slab > 0.08 +0.0 0.10 ± 4% perf-profile.children.cycles-pp.kick_process > 0.04 ± 57% +0.0 0.07 ± 7% perf-profile.children.cycles-pp.native_irq_return_iret > 0.12 ± 4% +0.0 0.15 ± 3% perf-profile.children.cycles-pp.blkcg_maybe_throttle_current > 1.31 +0.0 1.34 perf-profile.children.cycles-pp.memset_erms > 0.40 +0.0 0.44 perf-profile.children.cycles-pp.lockref_get_not_dead > 0.07 ± 6% +0.0 0.11 ± 3% perf-profile.children.cycles-pp.rcu_segcblist_pend_cbs > 0.01 ±173% +0.0 0.06 ± 11% perf-profile.children.cycles-pp.native_write_msr > 0.27 ± 6% +0.1 0.33 ± 10% perf-profile.children.cycles-pp.ktime_get > 0.16 ± 5% +0.1 0.22 ± 4% perf-profile.children.cycles-pp.get_partial_node > 0.01 ±173% +0.1 0.07 ± 30% perf-profile.children.cycles-pp.perf_mux_hrtimer_handler > 0.00 +0.1 0.07 ± 5% perf-profile.children.cycles-pp.____fput > 0.05 +0.1 0.15 ± 3% perf-profile.children.cycles-pp.__mod_zone_page_state > 0.12 ± 16% +0.1 0.23 ± 10% perf-profile.children.cycles-pp.ktime_get_update_offsets_now > 0.05 ± 8% +0.1 0.16 ± 2% perf-profile.children.cycles-pp.legitimize_links > 1.71 +0.1 1.83 perf-profile.children.cycles-pp.__call_rcu > 3.40 +0.1 3.52 perf-profile.children.cycles-pp.strncpy_from_user > 0.30 ± 2% +0.1 0.43 ± 2% perf-profile.children.cycles-pp.locks_remove_posix > 0.72 +0.1 0.86 perf-profile.children.cycles-pp.__virt_addr_valid > 0.08 +0.1 0.23 ± 3% perf-profile.children.cycles-pp._raw_spin_lock_irqsave > 0.84 ± 9% +0.2 1.02 ± 8% perf-profile.children.cycles-pp.hrtimer_interrupt > 0.15 ± 3% +0.2 0.38 perf-profile.children.cycles-pp.check_stack_object > 0.65 +0.4 1.05 perf-profile.children.cycles-pp.setup_object_debug > 0.36 +0.4 0.76 perf-profile.children.cycles-pp.get_page_from_freelist > 1.90 +0.4 2.31 perf-profile.children.cycles-pp.__check_object_size > 0.39 +0.4 0.84 perf-profile.children.cycles-pp.__alloc_pages_nodemask > 39.52 +0.7 40.22 perf-profile.children.cycles-pp.do_filp_open > 38.84 +0.8 39.59 perf-profile.children.cycles-pp.path_openat > 31.27 +0.8 32.05 perf-profile.children.cycles-pp.rcu_core > 31.26 +0.8 32.03 perf-profile.children.cycles-pp.rcu_do_batch > 31.31 +0.8 32.09 perf-profile.children.cycles-pp.__softirqentry_text_start > 29.90 +0.9 30.79 perf-profile.children.cycles-pp.run_ksoftirqd > 29.93 +0.9 30.84 perf-profile.children.cycles-pp.smpboot_thread_fn > 29.94 +0.9 30.85 perf-profile.children.cycles-pp.kthread > 29.94 +0.9 30.85 perf-profile.children.cycles-pp.ret_from_fork > 10.63 ± 2% +1.0 11.61 perf-profile.children.cycles-pp.kmem_cache_free > 8.45 ± 3% +1.1 9.57 perf-profile.children.cycles-pp._raw_spin_lock > 7.96 ± 3% +1.1 9.11 perf-profile.children.cycles-pp.__free_pages_ok > 7.93 ± 3% +1.2 9.08 perf-profile.children.cycles-pp.free_one_page > 8.19 ± 3% +1.2 9.36 perf-profile.children.cycles-pp.put_cpu_partial > 8.15 ± 3% +1.2 9.32 perf-profile.children.cycles-pp.unfreeze_partials > 7.59 ± 3% +1.3 8.89 perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath > 11.12 +1.7 12.83 perf-profile.children.cycles-pp.rcu_cblist_dequeue > 2.88 +1.7 4.61 perf-profile.children.cycles-pp.new_slab > 8.73 +1.8 10.54 perf-profile.children.cycles-pp.kmem_cache_alloc > 3.34 +2.2 5.56 perf-profile.children.cycles-pp.___slab_alloc > 3.39 +2.3 5.65 perf-profile.children.cycles-pp.__slab_alloc > 11.45 +2.7 14.12 perf-profile.children.cycles-pp.alloc_empty_file > 10.98 +2.7 13.70 perf-profile.children.cycles-pp.__alloc_file > 10.53 -2.1 8.47 perf-profile.self.cycles-pp.file_free_rcu > 2.37 -0.3 2.05 perf-profile.self.cycles-pp.kmem_cache_alloc > 1.43 -0.3 1.17 perf-profile.self.cycles-pp.strncpy_from_user > 0.50 -0.2 0.26 ± 2% perf-profile.self.cycles-pp.__close_fd > 2.22 -0.2 2.00 perf-profile.self.cycles-pp.syscall_return_via_sysret > 2.07 ± 2% -0.2 1.86 perf-profile.self.cycles-pp.___might_sleep > 1.01 ± 7% -0.2 0.82 perf-profile.self.cycles-pp.inode_permission > 1.13 -0.2 0.96 ± 2% perf-profile.self.cycles-pp.entry_SYSCALL_64 > 3.10 -0.2 2.93 perf-profile.self.cycles-pp.selinux_inode_permission > 0.52 -0.2 0.35 perf-profile.self.cycles-pp.selinux_task_getsecid > 1.33 -0.1 1.18 ± 2% perf-profile.self.cycles-pp.do_dentry_open > 1.55 -0.1 1.42 perf-profile.self.cycles-pp.kmem_cache_free > 1.55 -0.1 1.43 perf-profile.self.cycles-pp.link_path_walk > 0.17 ± 4% -0.1 0.04 ± 57% perf-profile.self.cycles-pp.try_module_get > 1.35 -0.1 1.24 perf-profile.self.cycles-pp.fsnotify > 0.96 -0.1 0.85 perf-profile.self.cycles-pp.__fsnotify_parent > 1.25 ± 2% -0.1 1.14 perf-profile.self.cycles-pp.__might_sleep > 0.87 ± 2% -0.1 0.78 ± 2% perf-profile.self.cycles-pp.lookup_fast > 0.79 ± 2% -0.1 0.70 ± 2% perf-profile.self.cycles-pp.do_syscall_64 > 1.02 -0.1 0.93 perf-profile.self.cycles-pp.do_sys_open > 0.30 -0.1 0.22 perf-profile.self.cycles-pp.file_ra_state_init > 0.58 -0.1 0.50 perf-profile.self.cycles-pp.__check_heap_object > 0.80 ± 3% -0.1 0.72 perf-profile.self.cycles-pp._raw_spin_lock_irq > 0.59 ± 2% -0.1 0.52 ± 2% perf-profile.self.cycles-pp.generic_permission > 1.17 -0.1 1.09 perf-profile.self.cycles-pp.__fput > 0.68 -0.1 0.60 perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe > 0.84 -0.1 0.77 perf-profile.self.cycles-pp.__inode_security_revalidate > 0.73 -0.1 0.66 perf-profile.self.cycles-pp.task_work_add > 0.39 -0.1 0.32 ± 2% perf-profile.self.cycles-pp.rcu_all_qs > 0.54 ± 5% -0.1 0.47 ± 2% perf-profile.self.cycles-pp.mntput_no_expire > 0.50 ± 6% -0.1 0.44 ± 4% perf-profile.self.cycles-pp.dput > 0.46 -0.1 0.40 perf-profile.self.cycles-pp._cond_resched > 0.93 -0.1 0.88 ± 2% perf-profile.self.cycles-pp.close > 0.11 ± 4% -0.1 0.06 ± 11% perf-profile.self.cycles-pp.ima_file_free > 0.83 -0.1 0.77 perf-profile.self.cycles-pp.__slab_free > 0.61 -0.1 0.56 perf-profile.self.cycles-pp.__alloc_fd > 0.87 -0.1 0.82 perf-profile.self.cycles-pp._raw_spin_lock > 0.69 -0.0 0.64 perf-profile.self.cycles-pp.lockref_put_or_lock > 0.67 -0.0 0.62 perf-profile.self.cycles-pp.rcu_segcblist_enqueue > 0.46 -0.0 0.41 perf-profile.self.cycles-pp.do_filp_open > 0.56 -0.0 0.51 perf-profile.self.cycles-pp.percpu_counter_add_batch > 1.05 ± 2% -0.0 1.01 perf-profile.self.cycles-pp.path_openat > 0.28 ± 2% -0.0 0.24 ± 4% perf-profile.self.cycles-pp.security_file_open > 0.94 ± 2% -0.0 0.90 perf-profile.self.cycles-pp.open64 > 0.21 ± 6% -0.0 0.17 ± 4% perf-profile.self.cycles-pp.get_unused_fd_flags > 0.17 ± 13% -0.0 0.14 ± 3% perf-profile.self.cycles-pp.mntget > 0.39 -0.0 0.35 perf-profile.self.cycles-pp.fput_many > 0.37 -0.0 0.34 perf-profile.self.cycles-pp.getname_flags > 0.43 -0.0 0.41 perf-profile.self.cycles-pp.path_init > 0.33 -0.0 0.30 ± 2% perf-profile.self.cycles-pp.lockref_get > 0.20 ± 2% -0.0 0.17 perf-profile.self.cycles-pp.filp_close > 0.52 -0.0 0.50 perf-profile.self.cycles-pp.selinux_file_alloc_security > 0.22 -0.0 0.20 ± 2% perf-profile.self.cycles-pp.__x64_sys_open > 0.28 ± 2% -0.0 0.26 perf-profile.self.cycles-pp.inode_security_rcu > 0.19 ± 4% -0.0 0.17 ± 2% perf-profile.self.cycles-pp.expand_files > 0.09 ± 9% -0.0 0.07 ± 6% perf-profile.self.cycles-pp.putname > 0.10 -0.0 0.08 ± 5% perf-profile.self.cycles-pp.security_file_free > 0.15 ± 3% -0.0 0.14 ± 3% perf-profile.self.cycles-pp.find_next_zero_bit > 0.12 ± 3% -0.0 0.11 ± 4% perf-profile.self.cycles-pp.nd_jump_root > 0.08 -0.0 0.07 perf-profile.self.cycles-pp.fd_install > 0.06 -0.0 0.05 perf-profile.self.cycles-pp.path_get > 0.12 +0.0 0.13 ± 3% perf-profile.self.cycles-pp.__list_del_entry_valid > 0.07 ± 5% +0.0 0.09 perf-profile.self.cycles-pp.get_partial_node > 0.12 ± 3% +0.0 0.14 ± 3% perf-profile.self.cycles-pp.discard_slab > 0.11 +0.0 0.13 ± 3% perf-profile.self.cycles-pp.blkcg_maybe_throttle_current > 0.06 +0.0 0.08 ± 5% perf-profile.self.cycles-pp.kick_process > 0.28 ± 2% +0.0 0.30 perf-profile.self.cycles-pp.__x64_sys_close > 0.04 ± 57% +0.0 0.07 ± 7% perf-profile.self.cycles-pp.native_irq_return_iret > 0.39 +0.0 0.42 perf-profile.self.cycles-pp.lockref_get_not_dead > 0.53 +0.0 0.57 perf-profile.self.cycles-pp.exit_to_usermode_loop > 0.06 ± 7% +0.0 0.10 ± 4% perf-profile.self.cycles-pp.rcu_segcblist_pend_cbs > 0.01 ±173% +0.0 0.06 ± 11% perf-profile.self.cycles-pp.native_write_msr > 0.28 +0.0 0.33 perf-profile.self.cycles-pp.terminate_walk > 0.27 ± 5% +0.1 0.32 ± 11% perf-profile.self.cycles-pp.ktime_get > 0.00 +0.1 0.05 ± 9% perf-profile.self.cycles-pp._raw_spin_lock_irqsave > 0.43 ± 5% +0.1 0.49 ± 3% perf-profile.self.cycles-pp.security_inode_permission > 0.00 +0.1 0.06 ± 6% perf-profile.self.cycles-pp.____fput > 0.00 +0.1 0.08 perf-profile.self.cycles-pp.__alloc_pages_nodemask > 0.05 +0.1 0.15 ± 3% perf-profile.self.cycles-pp.__mod_zone_page_state > 0.25 ± 3% +0.1 0.35 ± 2% perf-profile.self.cycles-pp.locks_remove_posix > 0.12 ± 17% +0.1 0.23 ± 11% perf-profile.self.cycles-pp.ktime_get_update_offsets_now > 0.14 +0.1 0.25 perf-profile.self.cycles-pp.setup_object_debug > 0.93 +0.1 1.04 perf-profile.self.cycles-pp.__call_rcu > 0.46 +0.1 0.58 perf-profile.self.cycles-pp.__check_object_size > 0.13 ± 3% +0.1 0.26 ± 3% perf-profile.self.cycles-pp.get_page_from_freelist > 0.00 +0.1 0.13 ± 3% perf-profile.self.cycles-pp.legitimize_links > 0.68 +0.1 0.81 perf-profile.self.cycles-pp.__virt_addr_valid > 0.40 ± 11% +0.2 0.58 perf-profile.self.cycles-pp.may_open > 0.12 +0.2 0.31 perf-profile.self.cycles-pp.check_stack_object > 0.90 +0.3 1.22 perf-profile.self.cycles-pp.task_work_run > 0.30 +0.4 0.73 perf-profile.self.cycles-pp.___slab_alloc > 2.88 +0.7 3.59 perf-profile.self.cycles-pp.__alloc_file > 2.27 +1.1 3.38 perf-profile.self.cycles-pp.new_slab > 7.59 ± 3% +1.3 8.89 perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath > 11.04 +1.7 12.73 perf-profile.self.cycles-pp.rcu_cblist_dequeue > > > > will-it-scale.per_process_ops > > 385000 +-+----------------------------------------------------------------+ > | .+ .+. .+ | > 380000 +-+.++ : +.++.+ +.++.+ : | > 375000 +-+ : + : | > | +.+ .+.++.+ ++. .+.+.+ .+.+.+ .+.| > 370000 +-+ +.+ +.+.++ + + | > 365000 +-+ | > | | > 360000 +-+ | > 355000 +-+ | > | | > 350000 +-+ | > 345000 O-+ OO O O OO O O OO O O O | > | O O O O O OO O O | > 340000 +-+----------------------------------------------------------------+ > > > will-it-scale.workload > > 7.4e+07 +-+---------------------------------------------------------------+ > | .+ .+ .+ | > 7.3e+07 +-++.+ : ++.+.+ +.+.++ : | > 7.2e+07 +-+ : + : | > | ++. .++.+.+ +.+ +.+.+.++.+.+. +.| > 7.1e+07 +-+ +.+ +.+.+.+ + | > 7e+07 +-+ | > | | > 6.9e+07 +-+ | > 6.8e+07 +-+ | > | | > 6.7e+07 +-+ O O | > 6.6e+07 O-OO O O OO O OO O O O O O O O | > | O O O O | > 6.5e+07 +-+---------------------------------------------------------------+ > > > [*] bisect-good sample > [O] bisect-bad sample > > > > Disclaimer: > Results have been estimated based on internal Intel analysis and are provided > for informational purposes only. Any difference in system hardware or software > design or configuration may affect actual performance. > > > Thanks, > Rong Chen >
On Mon, Sep 23, 2019 at 06:06:46PM +0300, Konstantin Khlebnikov wrote: > On 23/09/2019 17.52, Tejun Heo wrote: > > Hello, Konstantin. > > > > On Fri, Sep 20, 2019 at 10:39:33AM +0300, Konstantin Khlebnikov wrote: > > > With vm.dirty_write_behind 1 or 2 files are written even faster and > > > > Is the faster speed reproducible? I don't quite understand why this > > would be. > > Writing to disk simply starts earlier. Stupid question: how is this any different to simply winding down our dirty writeback and throttling thresholds like so: # echo $((100 * 1000 * 1000)) > /proc/sys/vm/dirty_background_bytes to start background writeback when there's 100MB of dirty pages in memory, and then: # echo $((200 * 1000 * 1000)) > /proc/sys/vm/dirty_bytes So that writers are directly throttled at 200MB of dirty pages in memory? This effectively gives us global writebehind behaviour with a 100-200MB cache write burst for initial writes. ANd, really such strict writebehind behaviour is going to cause all sorts of unintended problesm with filesystems because there will be adverse interactions with delayed allocation. We need a substantial amount of dirty data to be cached for writeback for fragmentation minimisation algorithms to be able to do their job.... Cheers, Dave.
On 24/09/2019 10.39, Dave Chinner wrote: > On Mon, Sep 23, 2019 at 06:06:46PM +0300, Konstantin Khlebnikov wrote: >> On 23/09/2019 17.52, Tejun Heo wrote: >>> Hello, Konstantin. >>> >>> On Fri, Sep 20, 2019 at 10:39:33AM +0300, Konstantin Khlebnikov wrote: >>>> With vm.dirty_write_behind 1 or 2 files are written even faster and >>> >>> Is the faster speed reproducible? I don't quite understand why this >>> would be. >> >> Writing to disk simply starts earlier. > > Stupid question: how is this any different to simply winding down > our dirty writeback and throttling thresholds like so: > > # echo $((100 * 1000 * 1000)) > /proc/sys/vm/dirty_background_bytes > > to start background writeback when there's 100MB of dirty pages in > memory, and then: > > # echo $((200 * 1000 * 1000)) > /proc/sys/vm/dirty_bytes > > So that writers are directly throttled at 200MB of dirty pages in > memory? > > This effectively gives us global writebehind behaviour with a > 100-200MB cache write burst for initial writes. Global limits affect all dirty pages including memory-mapped and randomly touched. Write-behind aims only into sequential streams. > > ANd, really such strict writebehind behaviour is going to cause all > sorts of unintended problesm with filesystems because there will be > adverse interactions with delayed allocation. We need a substantial > amount of dirty data to be cached for writeback for fragmentation > minimisation algorithms to be able to do their job.... I think most sequentially written files never change after close. Except of knowing final size of huge files (>16Mb in my patch) there should be no difference for delayed allocation. Probably write behind could provide hint about streaming pattern: pass something like "MSG_MORE" into writeback call. > > Cheers, > > Dave. >
On 21/09/2019 02.05, Linus Torvalds wrote: > On Fri, Sep 20, 2019 at 12:35 AM Konstantin Khlebnikov > <khlebnikov@yandex-team.ru> wrote: >> >> This patch implements write-behind policy which tracks sequential writes >> and starts background writeback when file have enough dirty pages. > > Apart from a spelling error ("contigious"), my only reaction is that > I've wanted this for the multi-file writes, not just for single big > files. > > Yes, single big files may be a simpler and perhaps the "10% effort for > 90% of the gain", and thus the right thing to do, but I do wonder if > you've looked at simply extending it to cover multiple files when > people copy a whole directory (or unpack a tar-file, or similar). > > Now, I hear you say "those are so small these days that it doesn't > matter". And maybe you're right. But partiocularly for slow media, > triggering good streaming write behavior has been a problem in the > past. > > So I'm wondering whether the "writebehind" state should perhaps be > considered be a process state, rather than "struct file" state, and > also start triggering for writing smaller files. It's simple to extend existing state with per-task counter of sequential writes to detect patterns like unpacking tarball with small files. After reaching some threshold write-behind could flush files in at close. But in this case it's hard to wait previous writes to limit amount of requests and pages in writeback for each stream. Theoretically we could build chain of inodes for delaying and batching. > > Maybe this was already discussed and people decided that the big-file > case was so much easier that it wasn't worth worrying about > writebehind for multiple files. > > Linus >
On Tue, Sep 24, 2019 at 12:39 AM Dave Chinner <david@fromorbit.com> wrote: > > Stupid question: how is this any different to simply winding down > our dirty writeback and throttling thresholds like so: > > # echo $((100 * 1000 * 1000)) > /proc/sys/vm/dirty_background_bytes Our dirty_background stuff is very questionable, but it exists (and has those insane defaults) because of various legacy reasons. But it probably _shouldn't_ exist any more (except perhaps as a last-ditch hard limit), and I don't think it really ends up being the primary throttling any more in many cases. It used to make sense to make it a "percentage of memory" back when we were talking old machines with 8MB of RAM, and having an appreciable percentage of memory dirty was "normal". And we've kept that model and not touched it, because some benchmarks really want enormous amounts of dirty data (particularly various dirty shared mappings). But out default really is fairly crazy and questionable. 10% of memory being dirty may be ok when you have a small amount of memory, but it's rather less sane if you have gigs and gigs of RAM. Of course, SSD's made it work slightly better again, but our "dirty_background" stuff really is legacy and not very good. The whole dirty limit when seen as percentage of memory (which is our default) is particularly questionable, but even when seen as total bytes is bad. If you have slow filesystems (say, FAT on a USB stick), the limit should be very different from a fast one (eg XFS on a RAID of proper SSDs). So the limit really needs be per-bdi, not some global ratio or bytes. As a result we've grown various _other_ heuristics over time, and the simplistic dirty_background stuff is only a very small part of the picture these days. To the point of almost being irrelevant in many situations, I suspect. > to start background writeback when there's 100MB of dirty pages in > memory, and then: > > # echo $((200 * 1000 * 1000)) > /proc/sys/vm/dirty_bytes The thing is, that also accounts for dirty shared mmap pages. And it really will kill some benchmarks that people take very very seriously. And 200MB is peanuts when you're doing a benchmark on some studly machine that has a million iops per second, and 200MB of dirty data is nothing. Yet it's probably much too big when you're on a workstation that still has rotational media. And the whole memcg code obviously makes this even more complicated. Anyway, the end result of all this is that we have that balance_dirty_pages() that is pretty darn complex and I suspect very few people understand everything that goes on in that function. So I think that the point of any write-behind logic would be to avoid triggering the global limits as much as humanly possible - not just getting the simple cases to write things out more quickly, but to remove the complex global limit questions from (one) common and fairly simple case. Now, whether write-behind really _does_ help that, or whether it's just yet another tweak and complication, I can't actually say. But I don't think 'dirty_background_bytes' is really an argument against write-behind, it's just one knob on the very complex dirty handling we have. Linus
On Tue, Sep 24, 2019 at 12:00:17PM +0300, Konstantin Khlebnikov wrote: > On 24/09/2019 10.39, Dave Chinner wrote: > > On Mon, Sep 23, 2019 at 06:06:46PM +0300, Konstantin Khlebnikov wrote: > > > On 23/09/2019 17.52, Tejun Heo wrote: > > > > Hello, Konstantin. > > > > > > > > On Fri, Sep 20, 2019 at 10:39:33AM +0300, Konstantin Khlebnikov wrote: > > > > > With vm.dirty_write_behind 1 or 2 files are written even faster and > > > > > > > > Is the faster speed reproducible? I don't quite understand why this > > > > would be. > > > > > > Writing to disk simply starts earlier. > > > > Stupid question: how is this any different to simply winding down > > our dirty writeback and throttling thresholds like so: > > > > # echo $((100 * 1000 * 1000)) > /proc/sys/vm/dirty_background_bytes > > > > to start background writeback when there's 100MB of dirty pages in > > memory, and then: > > > > # echo $((200 * 1000 * 1000)) > /proc/sys/vm/dirty_bytes > > > > So that writers are directly throttled at 200MB of dirty pages in > > memory? > > > > This effectively gives us global writebehind behaviour with a > > 100-200MB cache write burst for initial writes. > > Global limits affect all dirty pages including memory-mapped and > randomly touched. Write-behind aims only into sequential streams. There are apps that do sequential writes via mmap()d files. They should do writebehind too, yes? > > ANd, really such strict writebehind behaviour is going to cause all > > sorts of unintended problesm with filesystems because there will be > > adverse interactions with delayed allocation. We need a substantial > > amount of dirty data to be cached for writeback for fragmentation > > minimisation algorithms to be able to do their job.... > > I think most sequentially written files never change after close. There are lots of apps that write zeros to initialise and allocate space, then go write real data to them. Database WAL files are commonly initialised like this... > Except of knowing final size of huge files (>16Mb in my patch) > there should be no difference for delayed allocation. There is, because you throttle the writes down such that there is only 16MB of dirty data in memory. Hence filesystems will only typically allocate in 16MB chunks as that's all the delalloc range spans. I'm not so concerned for XFS here, because our speculative preallocation will handle this just fine, but for ext4 and btrfs it's going to interleave the allocate of concurrent streaming writes and fragment the crap out of the files. In general, the smaller you make the individual file writeback window, the worse the fragmentation problems gets.... > Probably write behind could provide hint about streaming pattern: > pass something like "MSG_MORE" into writeback call. How does that help when we've only got dirty data and block reservations up to EOF which is no more than 16MB away? Cheers, Dave.
On Tue, Sep 24, 2019 at 12:08:04PM -0700, Linus Torvalds wrote: > On Tue, Sep 24, 2019 at 12:39 AM Dave Chinner <david@fromorbit.com> wrote: > > > > Stupid question: how is this any different to simply winding down > > our dirty writeback and throttling thresholds like so: > > > > # echo $((100 * 1000 * 1000)) > /proc/sys/vm/dirty_background_bytes > > Our dirty_background stuff is very questionable, but it exists (and > has those insane defaults) because of various legacy reasons. That's not what I was asking about. The context is in the previous lines you didn't quote: > > > > Is the faster speed reproducible? I don't quite understand why this > > > > would be. > > > > > > Writing to disk simply starts earlier. > > > > Stupid question: how is this any different to simply winding down > > our dirty writeback and throttling thresholds like so: i.e. I'm asking about the reasons for the performance differential not asking for an explanation of what writebehind is. If the performance differential really is caused by writeback starting sooner, then winding down dirty_background_bytes should produce exactly the same performance because it will start writeback -much faster-. If it doesn't, then the assertion that the difference is caused by earlier writeout is questionable and the code may not actually be doing what is claimed.... Basically, I'm asking for proof that the explanation is correct. > > to start background writeback when there's 100MB of dirty pages in > > memory, and then: > > > > # echo $((200 * 1000 * 1000)) > /proc/sys/vm/dirty_bytes > > The thing is, that also accounts for dirty shared mmap pages. And it > really will kill some benchmarks that people take very very seriously. Yes, I know that. I'm not suggesting that we do this, [snip] > Anyway, the end result of all this is that we have that > balance_dirty_pages() that is pretty darn complex and I suspect very > few people understand everything that goes on in that function. I'd agree with you there - most of the ground work for the balance_dirty_pages IO throttling feedback loop was all based on concepts I developed to solve dirty page writeback thrashing problems on Irix back in 2003. The code we have in Linux was written by Fenguang Wu with help for a lot of people, but the underlying concepts of delegating IO to dedicated writeback threads that calculate and track page cleaning rates (BDI writeback rates) and then throttling incoming page dirtying rate to the page cleaning rate all came out of my head.... So, much as it may surprise you, I am one of the few people who do actually understand how that whole complex mass of accounting and feedback is supposed to work. :) > Now, whether write-behind really _does_ help that, or whether it's > just yet another tweak and complication, I can't actually say. Neither can I at this point - I lack the data and that's why I was asking if there was a perf difference with the existing limits wound right down. Knowing whether the performance difference is simply a result of starting writeback IO sooner tells me an awful lot about what other behaviour is happening as a result of the changes in this patch. > But I > don't think 'dirty_background_bytes' is really an argument against > write-behind, it's just one knob on the very complex dirty handling we > have. Never said it was - just trying to determine if a one line explanation is true or not. Cheers, Dave.
On 25/09/2019 10.18, Dave Chinner wrote: > On Tue, Sep 24, 2019 at 12:00:17PM +0300, Konstantin Khlebnikov wrote: >> On 24/09/2019 10.39, Dave Chinner wrote: >>> On Mon, Sep 23, 2019 at 06:06:46PM +0300, Konstantin Khlebnikov wrote: >>>> On 23/09/2019 17.52, Tejun Heo wrote: >>>>> Hello, Konstantin. >>>>> >>>>> On Fri, Sep 20, 2019 at 10:39:33AM +0300, Konstantin Khlebnikov wrote: >>>>>> With vm.dirty_write_behind 1 or 2 files are written even faster and >>>>> >>>>> Is the faster speed reproducible? I don't quite understand why this >>>>> would be. >>>> >>>> Writing to disk simply starts earlier. >>> >>> Stupid question: how is this any different to simply winding down >>> our dirty writeback and throttling thresholds like so: >>> >>> # echo $((100 * 1000 * 1000)) > /proc/sys/vm/dirty_background_bytes >>> >>> to start background writeback when there's 100MB of dirty pages in >>> memory, and then: >>> >>> # echo $((200 * 1000 * 1000)) > /proc/sys/vm/dirty_bytes >>> >>> So that writers are directly throttled at 200MB of dirty pages in >>> memory? >>> >>> This effectively gives us global writebehind behaviour with a >>> 100-200MB cache write burst for initial writes. >> >> Global limits affect all dirty pages including memory-mapped and >> randomly touched. Write-behind aims only into sequential streams. > > There are apps that do sequential writes via mmap()d files. > They should do writebehind too, yes? I see no reason for that. This is different scenario. Mmap have no clear signal about "end of write", only page fault at beginning. Theoretically we could implement similar sliding window and start writeback on consequent page faults. But applications who use memory mapped files probably knows better what to do with this data. I prefer to leave them alone for now. > >>> ANd, really such strict writebehind behaviour is going to cause all >>> sorts of unintended problesm with filesystems because there will be >>> adverse interactions with delayed allocation. We need a substantial >>> amount of dirty data to be cached for writeback for fragmentation >>> minimisation algorithms to be able to do their job.... >> >> I think most sequentially written files never change after close. > > There are lots of apps that write zeros to initialise and allocate > space, then go write real data to them. Database WAL files are > commonly initialised like this... Those zeros are just bunch of dirty pages which have to be written. Sync and memory pressure will do that, why write-behind don't have to? > >> Except of knowing final size of huge files (>16Mb in my patch) >> there should be no difference for delayed allocation. > > There is, because you throttle the writes down such that there is > only 16MB of dirty data in memory. Hence filesystems will only > typically allocate in 16MB chunks as that's all the delalloc range > spans. > > I'm not so concerned for XFS here, because our speculative > preallocation will handle this just fine, but for ext4 and btrfs > it's going to interleave the allocate of concurrent streaming writes > and fragment the crap out of the files. > > In general, the smaller you make the individual file writeback > window, the worse the fragmentation problems gets.... AFAIR ext4 already preallocates extent beyond EOF too. But this must be carefully tested for all modern fs for sure. > >> Probably write behind could provide hint about streaming pattern: >> pass something like "MSG_MORE" into writeback call. > > How does that help when we've only got dirty data and block > reservations up to EOF which is no more than 16MB away? Block allocator should interpret this flags as "more data are expected" and preallocate extent bigger than data and beyond EOF. > > Cheers, > > Dave. >
On Wed, Sep 25, 2019 at 05:18:54PM +1000, Dave Chinner wrote: > > > ANd, really such strict writebehind behaviour is going to cause all > > > sorts of unintended problesm with filesystems because there will be > > > adverse interactions with delayed allocation. We need a substantial > > > amount of dirty data to be cached for writeback for fragmentation > > > minimisation algorithms to be able to do their job.... > > > > I think most sequentially written files never change after close. > > There are lots of apps that write zeros to initialise and allocate > space, then go write real data to them. Database WAL files are > commonly initialised like this... Fortunately, most of the time Enterprise Database files which are initialized with a fd which is then kept open. And it's only a single file. So that's a hueristic that's not too bad to handle so long as it's only triggered when there are no open file descriptors on said inode. If something is still keeping the file open, then we do need to be very careful about writebehind. That behind said, with databases, they are goind to be calling fdatasync(2) and fsync(2) all the time, so it's unlikely writebehind is goint to be that much of an issue, so long as the max writebehind knob isn't set too insanely low. It's been over ten years since I last looked at this, and so things may have very likely changed, but one enterprise database I looked at would fallocate 32M, and then write 32M of zeros to make sure blocks were marked as initialized, so that further random writes wouldn't cause metadata updates. Now, there *are* applications which log to files via append, and in the worst case, they don't actually keep a fd open. Examples of this would include scripts that call logger(1) very often. But in general, taking into account whether or not there is still a fd holding the inode open to influence how aggressively we do writeback does make sense. Finally, we should remember that this will impact battery life on laptops. Perhaps not so much now that most laptops have SSD's instead of HDD's, but aggressive writebehind does certainly have tradeoffs, and what makes sense for a NVMe attached SSD is going to be very different for a $2 USB thumb drive picked up at the checkout aisle of Staples.... - Ted
On Wed, Sep 25, 2019 at 11:15:30AM +0300, Konstantin Khlebnikov wrote: > On 25/09/2019 10.18, Dave Chinner wrote: > > On Tue, Sep 24, 2019 at 12:00:17PM +0300, Konstantin Khlebnikov wrote: > > > On 24/09/2019 10.39, Dave Chinner wrote: > > > > On Mon, Sep 23, 2019 at 06:06:46PM +0300, Konstantin Khlebnikov wrote: > > > > > On 23/09/2019 17.52, Tejun Heo wrote: > > > > > > Hello, Konstantin. > > > > > > > > > > > > On Fri, Sep 20, 2019 at 10:39:33AM +0300, Konstantin Khlebnikov wrote: > > > > > > > With vm.dirty_write_behind 1 or 2 files are written even faster and > > > > > > > > > > > > Is the faster speed reproducible? I don't quite understand why this > > > > > > would be. > > > > > > > > > > Writing to disk simply starts earlier. > > > > > > > > Stupid question: how is this any different to simply winding down > > > > our dirty writeback and throttling thresholds like so: > > > > > > > > # echo $((100 * 1000 * 1000)) > /proc/sys/vm/dirty_background_bytes > > > > > > > > to start background writeback when there's 100MB of dirty pages in > > > > memory, and then: > > > > > > > > # echo $((200 * 1000 * 1000)) > /proc/sys/vm/dirty_bytes > > > > > > > > So that writers are directly throttled at 200MB of dirty pages in > > > > memory? > > > > > > > > This effectively gives us global writebehind behaviour with a > > > > 100-200MB cache write burst for initial writes. > > > > > > Global limits affect all dirty pages including memory-mapped and > > > randomly touched. Write-behind aims only into sequential streams. > > > > There are apps that do sequential writes via mmap()d files. > > They should do writebehind too, yes? > > I see no reason for that. This is different scenario. It is? > Mmap have no clear signal about "end of write", only page fault at > beginning. Theoretically we could implement similar sliding window and > start writeback on consequent page faults. sequential IO doing pwrite() in a loop has no clear signal about "end of write", either. It's exactly the same as doing a memset(0) on a mmap()d region to zero the file. i.e. the write doesn't stop until EOF is reached... > But applications who use memory mapped files probably knows better what > to do with this data. I prefer to leave them alone for now. By that argument, we shouldn't have readahead for mmap() access or even read-around for page faults. We can track read and write faults exactly for mmap(), so if you are tracking sequential page dirtying for writebehind we can do that jsut as easily for mmap (via ->page_mkwrite) as we can for write() IO. > > > > ANd, really such strict writebehind behaviour is going to cause all > > > > sorts of unintended problesm with filesystems because there will be > > > > adverse interactions with delayed allocation. We need a substantial > > > > amount of dirty data to be cached for writeback for fragmentation > > > > minimisation algorithms to be able to do their job.... > > > > > > I think most sequentially written files never change after close. > > > > There are lots of apps that write zeros to initialise and allocate > > space, then go write real data to them. Database WAL files are > > commonly initialised like this... > > Those zeros are just bunch of dirty pages which have to be written. > Sync and memory pressure will do that, why write-behind don't have to? Huh? IIUC, the writebehind flag is a global behaviour flag for the kernel - everything does writebehind or nothing does it, right? Hence if you turn on writebehind, the writebehind will write the zeros to disk before real data can be written. We no longer have zeroing as something that sits in the cache until it's overwritten with real data - that file now gets written twice and it delays the application from actually writing real data until the zeros are all on disk. strict writebehind without the ability to burst temporary/short-term data/state into the cache is going to cause a lot of performance regressions in applications.... > > > Except of knowing final size of huge files (>16Mb in my patch) > > > there should be no difference for delayed allocation. > > > > There is, because you throttle the writes down such that there is > > only 16MB of dirty data in memory. Hence filesystems will only > > typically allocate in 16MB chunks as that's all the delalloc range > > spans. > > > > I'm not so concerned for XFS here, because our speculative > > preallocation will handle this just fine, but for ext4 and btrfs > > it's going to interleave the allocate of concurrent streaming writes > > and fragment the crap out of the files. > > > > In general, the smaller you make the individual file writeback > > window, the worse the fragmentation problems gets.... > > AFAIR ext4 already preallocates extent beyond EOF too. Only via fallocate(), not for delayed allocation. > > > Probably write behind could provide hint about streaming pattern: > > > pass something like "MSG_MORE" into writeback call. > > > > How does that help when we've only got dirty data and block > > reservations up to EOF which is no more than 16MB away? > > Block allocator should interpret this flags as "more data are > expected" and preallocate extent bigger than data and beyond EOF. Can't do that: delayed allocation is a 2-phase operation that is not seperable from the context that is dirtying the pages. The space is _accounted as used_ during the write() context, but the _physical allocation_ of that space is done in the writeback context. We cannot reserve more space in the writeback context, because we may already be at ENOSPC by the time writeback comes along. Hence writeback must already have all the space it needs to write back the dirty pages in memory already accounted as used space before it starts running physical allocations. IOWs, we cannot magically allocate more space than was reserved for the data being written in because of some special flag from the writeback code. That way lies angry users because we lost their data due to ENOSPC issues in writeback. Cheers, Dave.
diff --git a/Documentation/ABI/testing/sysfs-class-bdi b/Documentation/ABI/testing/sysfs-class-bdi index d773d5697cf5..f16be656cbd5 100644 --- a/Documentation/ABI/testing/sysfs-class-bdi +++ b/Documentation/ABI/testing/sysfs-class-bdi @@ -30,6 +30,11 @@ read_ahead_kb (read-write) Size of the read-ahead window in kilobytes +write_behind_kb (read-write) + + Size of the write-behind window in kilobytes. + 0 -> disable write-behind for this disk. + min_ratio (read-write) Under normal circumstances each device is given a part of the diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index 64aeee1009ca..a275fa42579f 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -35,6 +35,7 @@ Currently, these files are in /proc/sys/vm: - dirty_ratio - dirtytime_expire_seconds - dirty_writeback_centisecs +- dirty_write_behind - drop_caches - extfrag_threshold - hugetlb_shm_group @@ -210,6 +211,20 @@ out to disk. This tunable expresses the interval between those wakeups, in Setting this to zero disables periodic writeback altogether. +dirty_write_behind +================== + +This controls write-behind writeback policy - automatic background writeback +for sequentially written data behind current writing position. + +=0: disabled, default +=1: enabled for strictly sequential writes (append, copying) +=2: enabled for all sequential writes + +Write-behind window size configured in sysfs for each block device: +/sys/block/$DEV/bdi/write_behind_kb + + drop_caches =========== diff --git a/fs/file_table.c b/fs/file_table.c index b07b53f24ff5..bb40b45f27d3 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -276,6 +276,8 @@ static void __fput(struct file *file) if (file->f_op->fasync) file->f_op->fasync(-1, file, 0); } + if ((mode & FMODE_WRITE) && vm_dirty_write_behind) + generic_write_behind_close(file); if (file->f_op->release) file->f_op->release(inode, file); if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL && diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index 4fc87dee005a..4f1abd1d64a7 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -191,6 +191,7 @@ struct backing_dev_info { struct list_head bdi_list; unsigned long ra_pages; /* max readahead in PAGE_SIZE units */ unsigned long io_pages; /* max allowed IO size */ + unsigned long write_behind_pages; /* write-behind window in pages */ congested_fn *congested_fn; /* Function pointer if device is md/dm */ void *congested_data; /* Pointer to aux data for congested func */ diff --git a/include/linux/fs.h b/include/linux/fs.h index 997a530ff4e9..42cad18aaec7 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -942,6 +942,7 @@ struct file { struct fown_struct f_owner; const struct cred *f_cred; struct file_ra_state f_ra; + pgoff_t f_write_behind; u64 f_version; #ifdef CONFIG_SECURITY @@ -2788,6 +2789,10 @@ extern int vfs_fsync(struct file *file, int datasync); extern int sync_file_range(struct file *file, loff_t offset, loff_t nbytes, unsigned int flags); +extern int vm_dirty_write_behind; +extern void generic_write_behind(struct kiocb *iocb, ssize_t count); +extern void generic_write_behind_close(struct file *file); + /* * Sync the bytes written if this was a synchronous write. Expect ki_pos * to already be updated for the write, and will return either the amount @@ -2801,7 +2806,8 @@ static inline ssize_t generic_write_sync(struct kiocb *iocb, ssize_t count) (iocb->ki_flags & IOCB_SYNC) ? 0 : 1); if (ret) return ret; - } + } else if (vm_dirty_write_behind) + generic_write_behind(iocb, count); return count; } diff --git a/include/linux/mm.h b/include/linux/mm.h index 0334ca97c584..1b47a6e06ef2 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2443,6 +2443,7 @@ void task_dirty_inc(struct task_struct *tsk); /* readahead.c */ #define VM_READAHEAD_PAGES (SZ_128K / PAGE_SIZE) +#define VM_WRITE_BEHIND_PAGES (SZ_16M / PAGE_SIZE) int force_page_cache_readahead(struct address_space *mapping, struct file *filp, pgoff_t offset, unsigned long nr_to_read); diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 078950d9605b..74b6b66ee8da 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1404,6 +1404,15 @@ static struct ctl_table vm_table[] = { .proc_handler = dirtytime_interval_handler, .extra1 = SYSCTL_ZERO, }, + { + .procname = "dirty_write_behind", + .data = &vm_dirty_write_behind, + .maxlen = sizeof(vm_dirty_write_behind), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = &two, + }, { .procname = "swappiness", .data = &vm_swappiness, diff --git a/mm/backing-dev.c b/mm/backing-dev.c index d9daa3e422d0..7fee95c02862 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -131,25 +131,6 @@ static inline void bdi_debug_unregister(struct backing_dev_info *bdi) } #endif -static ssize_t read_ahead_kb_store(struct device *dev, - struct device_attribute *attr, - const char *buf, size_t count) -{ - struct backing_dev_info *bdi = dev_get_drvdata(dev); - unsigned long read_ahead_kb; - ssize_t ret; - - ret = kstrtoul(buf, 10, &read_ahead_kb); - if (ret < 0) - return ret; - - bdi->ra_pages = read_ahead_kb >> (PAGE_SHIFT - 10); - - return count; -} - -#define K(pages) ((pages) << (PAGE_SHIFT - 10)) - #define BDI_SHOW(name, expr) \ static ssize_t name##_show(struct device *dev, \ struct device_attribute *attr, char *page) \ @@ -160,7 +141,26 @@ static ssize_t name##_show(struct device *dev, \ } \ static DEVICE_ATTR_RW(name); -BDI_SHOW(read_ahead_kb, K(bdi->ra_pages)) +#define BDI_ATTR_KB(name, field) \ +static ssize_t name##_store(struct device *dev, \ + struct device_attribute *attr, \ + const char *buf, size_t count) \ +{ \ + struct backing_dev_info *bdi = dev_get_drvdata(dev); \ + unsigned long kb; \ + ssize_t ret; \ + \ + ret = kstrtoul(buf, 10, &kb); \ + if (ret < 0) \ + return ret; \ + \ + bdi->field = kb >> (PAGE_SHIFT - 10); \ + return count; \ +} \ +BDI_SHOW(name, ((bdi->field) << (PAGE_SHIFT - 10))) + +BDI_ATTR_KB(read_ahead_kb, ra_pages) +BDI_ATTR_KB(write_behind_kb, write_behind_pages) static ssize_t min_ratio_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) @@ -213,6 +213,7 @@ static DEVICE_ATTR_RO(stable_pages_required); static struct attribute *bdi_dev_attrs[] = { &dev_attr_read_ahead_kb.attr, + &dev_attr_write_behind_kb.attr, &dev_attr_min_ratio.attr, &dev_attr_max_ratio.attr, &dev_attr_stable_pages_required.attr, @@ -859,6 +860,8 @@ static int bdi_init(struct backing_dev_info *bdi) INIT_LIST_HEAD(&bdi->wb_list); init_waitqueue_head(&bdi->wb_waitq); + bdi->write_behind_pages = VM_WRITE_BEHIND_PAGES; + ret = cgwb_bdi_init(bdi); return ret; diff --git a/mm/filemap.c b/mm/filemap.c index d0cf700bf201..5398b1bea1bf 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3525,3 +3525,139 @@ int try_to_release_page(struct page *page, gfp_t gfp_mask) } EXPORT_SYMBOL(try_to_release_page); + +int vm_dirty_write_behind __read_mostly; +EXPORT_SYMBOL(vm_dirty_write_behind); + +/** + * generic_write_behind() - writeback dirty pages behind current position. + * + * This function tracks writing position. If file has enough sequentially + * written data it starts background writeback and then waits for previous + * writeback initiated some iterations ago. + * + * Write-behind maintains per-file head cursor in file->f_write_behind and + * two windows around: background writeback before and pending data after. + * + * |<-wait-this->| |<-send-this->|<---pending-write-behind--->| + * |<--async-write-behind--->|<--------previous-data------>|<-new-data->| + * current head-^ new head-^ file position-^ + */ +void generic_write_behind(struct kiocb *iocb, ssize_t count) +{ + struct file *file = iocb->ki_filp; + struct address_space *mapping = file->f_mapping; + struct inode *inode = mapping->host; + struct backing_dev_info *bdi = inode_to_bdi(inode); + unsigned long window = READ_ONCE(bdi->write_behind_pages); + pgoff_t head = file->f_write_behind; + pgoff_t begin = (iocb->ki_pos - count) >> PAGE_SHIFT; + pgoff_t end = iocb->ki_pos >> PAGE_SHIFT; + + /* Skip if write is random, direct, sync or disabled for disk */ + if ((file->f_mode & FMODE_RANDOM) || !window || + (iocb->ki_flags & (IOCB_DIRECT | IOCB_DSYNC))) + return; + + /* Skip non-sequential writes in strictly sequential mode. */ + if (vm_dirty_write_behind < 2 && + iocb->ki_pos != i_size_read(inode) && + !(iocb->ki_flags & IOCB_APPEND)) + return; + + /* Contigious write and still within window. */ + if (end - head < window) + return; + + spin_lock(&file->f_lock); + + /* Re-read under lock. */ + head = file->f_write_behind; + + /* Non-contiguous, move head position. */ + if (head > end || begin - head > window) { + /* + * Append might happen though multiple files or via new file + * every time. Align head cursor to cover previous appends. + */ + if (iocb->ki_flags & IOCB_APPEND) + begin = roundup(begin - min(begin, window - 1), + bdi->io_pages); + + file->f_write_behind = head = begin; + } + + /* Still not big enough. */ + if (end - head < window) { + spin_unlock(&file->f_lock); + return; + } + + /* Write excess and try at least max_sectors_kb if possible */ + end = head + max(end - head - window, min(end - head, bdi->io_pages)); + + /* Set head for next iteration, everything behind will be written. */ + file->f_write_behind = end; + + spin_unlock(&file->f_lock); + + /* Start background writeback. */ + __filemap_fdatawrite_range(mapping, + (loff_t)head << PAGE_SHIFT, + ((loff_t)end << PAGE_SHIFT) - 1, + WB_SYNC_NONE); + + if (head < window) + return; + + /* Wait for pages falling behind writeback window. */ + head -= window; + end -= window; + __filemap_fdatawait_range(mapping, + (loff_t)head << PAGE_SHIFT, + ((loff_t)end << PAGE_SHIFT) - 1); +} +EXPORT_SYMBOL(generic_write_behind); + +/** + * generic_write_behind_close() - write tail pages + * + * This function finishes write-behind steam and writes remaining tail pages + * in background. It start write if write-behind stream was started before + * (i.e. total written size is bigger than write-behind window) or if this is + * new file and it is bigger than max_sectors_kb. + */ +void generic_write_behind_close(struct file *file) +{ + struct address_space *mapping = file->f_mapping; + struct inode *inode = mapping->host; + struct backing_dev_info *bdi = inode_to_bdi(inode); + unsigned long window = READ_ONCE(bdi->write_behind_pages); + pgoff_t head = file->f_write_behind; + pgoff_t end = (file->f_pos + PAGE_SIZE - 1) >> PAGE_SHIFT; + + if ((file->f_mode & FMODE_RANDOM) || + (file->f_flags & (O_APPEND | O_DSYNC | O_DIRECT)) || + !bdi_cap_writeback_dirty(bdi) || !window) + return; + + /* Skip non-sequential writes in strictly sequential mode. */ + if (vm_dirty_write_behind < 2 && + file->f_pos != i_size_read(inode)) + return; + + /* Non-contiguous */ + if (head > end || end - head > window) + return; + + /* Start stream only for new files bigger than max_sectors_kb. */ + if (end - head < (window - min(window, bdi->io_pages)) && + (!(file->f_mode & FMODE_CREATED) || end - head < bdi->io_pages)) + return; + + /* Write tail pages in background. */ + __filemap_fdatawrite_range(mapping, + (loff_t)head << PAGE_SHIFT, + file->f_pos - 1, + WB_SYNC_NONE); +}