Message ID | 20181205122835.19290-1-rgoldwyn@suse.de (mailing list archive) |
---|---|
Headers | show |
Series | btrfs: Support for DAX devices | expand |
On 2018/12/5 下午8:28, Goldwyn Rodrigues wrote: > This is a support for DAX in btrfs. I understand there have been > previous attempts at it. However, I wanted to make sure copy-on-write > (COW) works on dax as well. > > Before I present this to the FS folks I wanted to run this through the > btrfs. Even though I wish, I cannot get it correct the first time > around :/.. Here are some questions for which I need suggestions: > > Questions: > 1. I have been unable to do checksumming for DAX devices. While > checksumming can be done for reads and writes, it is a problem when mmap > is involved because btrfs kernel module does not get back control after > an mmap() writes. Any ideas are appreciated, or we would have to set > nodatasum when dax is enabled. I'm not familar with DAX, so it's completely possible I'm talking like an idiot. If btrfs_page_mkwrite() can't provide enough control, then I have a crazy idea. Forcing page fault for every mmap() read/write (completely disable page cache like DIO). So that we could get some control since we're informed to read the page and do some hacks there. Thanks, Qu > > 2. Currently, a user can continue writing on "old" extents of an mmaped file > after a snapshot has been created. How can we enforce writes to be directed > to new extents after snapshots have been created? Do we keep a list of > all mmap()s, and re-mmap them after a snapshot? > > Tested by creating a pmem device in RAM with "memmap=2G!4G" kernel > command line parameter. > > > [PATCH 01/10] btrfs: create a mount option for dax > [PATCH 02/10] btrfs: basic dax read > [PATCH 03/10] btrfs: dax: read zeros from holes > [PATCH 04/10] Rename __endio_write_update_ordered() to > [PATCH 05/10] btrfs: Carve out btrfs_get_extent_map_write() out of > [PATCH 06/10] btrfs: dax write support > [PATCH 07/10] dax: export functions for use with btrfs > [PATCH 08/10] btrfs: dax add read mmap path > [PATCH 09/10] btrfs: dax support for cow_page/mmap_private and shared > [PATCH 10/10] btrfs: dax mmap write > > fs/btrfs/Makefile | 1 > fs/btrfs/ctree.h | 17 ++ > fs/btrfs/dax.c | 303 ++++++++++++++++++++++++++++++++++++++++++++++++++-- > fs/btrfs/file.c | 29 ++++ > fs/btrfs/inode.c | 54 +++++---- > fs/btrfs/ioctl.c | 5 > fs/btrfs/super.c | 15 ++ > fs/dax.c | 35 ++++-- > include/linux/dax.h | 16 ++ > 9 files changed, 430 insertions(+), 45 deletions(-) > >
On Wed, Dec 05, 2018 at 06:28:25AM -0600, Goldwyn Rodrigues wrote: > This is a support for DAX in btrfs. Yay! > I understand there have been previous attempts at it. However, I wanted > to make sure copy-on-write (COW) works on dax as well. btrfs' usual use of CoW and DAX are thoroughly in conflict. The very point of DAX is to have writes not go through the kernel, you mmap the file then do all writes right to the pmem, flushing when needed (without hitting the kernel) and having the processor+memory persist what you wrote. CoW via page faults are fine -- pmem is closer to memory than disk, and this means the kernel will ask the filesystem for an extent to place the new page in, copy the contents and let the process play with it. But real btrfs CoW would mean we'd need to page fault on ᴇᴠᴇʀʏ ꜱɪɴɢʟᴇ ᴡʀɪᴛᴇ. Delaying CoW until the next commit doesn't help -- you'd need to store the dirty page in DRAM then write it, which goes against the whole concept of DAX. Only way I see would be to CoW once then pretend the page is nodatacow until the next commit, when we checksum it, add to the metadata trees, and mark for CoWing on the next write. Lots of complexity, and you still need to copy the whole thing every commit (so no gain). Ie, we're in nodatacow land. CoW for metadata is fine. > Before I present this to the FS folks I wanted to run this through the > btrfs. Even though I wish, I cannot get it correct the first time > around :/.. Here are some questions for which I need suggestions: > > Questions: > 1. I have been unable to do checksumming for DAX devices. While > checksumming can be done for reads and writes, it is a problem when mmap > is involved because btrfs kernel module does not get back control after > an mmap() writes. Any ideas are appreciated, or we would have to set > nodatasum when dax is enabled. Per the above, it sounds like nodatacow (ie, "cow once") would be needed. > 2. Currently, a user can continue writing on "old" extents of an mmaped file > after a snapshot has been created. How can we enforce writes to be directed > to new extents after snapshots have been created? Do we keep a list of > all mmap()s, and re-mmap them after a snapshot? Same as for any other memory that's shared: when a new instance of sharing is added (a snapshot/reflink in our case), you deny writes, causing a page fault on the next attempt. "pmem" is named "ᴘersistent ᴍᴇᴍory" for a reason... > Tested by creating a pmem device in RAM with "memmap=2G!4G" kernel > command line parameter. Might be more useful to use a bigger piece of the "disk" than 2G, it's not in the danger area though. Also note that it's utterly pointless to use any RAID modes; multi-dev single is fine, DUP counts as RAID here. * RAID0 is already done better in hardware (interleave) * RAID1 would require hardware support, replication isn't easy * RAID5/6 What would make sense, is disabling dax for any files that are not marked as nodatacow. This way, unrelated files can still use checksums or compression, while only files meant as a pmempool or otherwise by a pmem-aware program would have dax writes (you can still give read-only pages that CoW to DRAM). This way we can have write dax for only a subset of files, and full set of btrfs features for the rest. Write dax is dangerous for programs that have no specific support: the vast majority of database-like programs rely on page-level atomicity while pmem gives you cacheline/word atomicity only; torn writes mean data loss. Meow!
On 12/5/18 8:03 AM, Qu Wenruo wrote: > > > On 2018/12/5 下午8:28, Goldwyn Rodrigues wrote: >> This is a support for DAX in btrfs. I understand there have been >> previous attempts at it. However, I wanted to make sure copy-on-write >> (COW) works on dax as well. >> >> Before I present this to the FS folks I wanted to run this through the >> btrfs. Even though I wish, I cannot get it correct the first time >> around :/.. Here are some questions for which I need suggestions: >> >> Questions: >> 1. I have been unable to do checksumming for DAX devices. While >> checksumming can be done for reads and writes, it is a problem when mmap >> is involved because btrfs kernel module does not get back control after >> an mmap() writes. Any ideas are appreciated, or we would have to set >> nodatasum when dax is enabled. > > I'm not familar with DAX, so it's completely possible I'm talking like > an idiot. The general idea is: 1) there is no page cache involved. read() and write() are like direct i/o writes in concept. The user buffer is written directly (via what is essentially a specialized memcpy) to the NVDIMM. 2) for mmap, once the mapping is established and mapped, the file system is not involved. The application writes directly to the memory as it would a normal mmap, except it's persistent. All that's required to ensure persistence is a CPU cache flush. The only way the file system is involved again is if some operation has occurred to reset the WP bit. > If btrfs_page_mkwrite() can't provide enough control, then I have a > crazy idea. It can't, because it is only invoked on the page fault path and we want to try to limit those as much as possible. > Forcing page fault for every mmap() read/write (completely disable page > cache like DIO). > So that we could get some control since we're informed to read the page > and do some hacks there. There's no way to force a page fault for every mmap read/write. Even if there was, we wouldn't want that. No user would turn that on when they can just make similar guarantees in their app (which are typically apps that do this already) and not pay any performance penalty. The idea with DAX mmap is that the file system manages the namespace, space allocation, and permissions. Otherwise we stay out of the way. -Jeff
On 12/5/18 7:28 AM, Goldwyn Rodrigues wrote: > This is a support for DAX in btrfs. I understand there have been > previous attempts at it. However, I wanted to make sure copy-on-write > (COW) works on dax as well. > > Before I present this to the FS folks I wanted to run this through the > btrfs. Even though I wish, I cannot get it correct the first time > around :/.. Here are some questions for which I need suggestions: > > Questions: > 1. I have been unable to do checksumming for DAX devices. While > checksumming can be done for reads and writes, it is a problem when mmap > is involved because btrfs kernel module does not get back control after > an mmap() writes. Any ideas are appreciated, or we would have to set > nodatasum when dax is enabled. Yep. It has to be nodatasum, at least within the confines of datasum today. DAX mmap writes are essentially in the same situation as with direct i/o when another thread modifies the buffer being submitted. Except rather than it being a race, it happens every time. An alternative here could be to add the ability to mark a crc as unreliable and then go back and update them once the last DAX mmap reference is dropped on a range. There's no reason to make this a requirement of the initial implementation, though. > 2. Currently, a user can continue writing on "old" extents of an mmaped file > after a snapshot has been created. How can we enforce writes to be directed > to new extents after snapshots have been created? Do we keep a list of > all mmap()s, and re-mmap them after a snapshot? It's the second question that's the hard part. As Adam describes later, setting each pfn read-only will ensure page faults cause the remapping. The high level idea that Jan Kara and I came up with in our conversation at Labs conf is pretty expensive. We'd need to set a flag that pauses new page faults, set the WP bit on affected ranges, do the snapshot, commit, clear the flag, and wake up the waiting threads. Neither of us had any concrete idea of how well that would perform and it still depends on finding a good way to resolve all open mmap ranges on a subvolume. Perhaps using the address_space->private_list anchored on each root would work. -Jeff > Tested by creating a pmem device in RAM with "memmap=2G!4G" kernel > command line parameter. > > > [PATCH 01/10] btrfs: create a mount option for dax > [PATCH 02/10] btrfs: basic dax read > [PATCH 03/10] btrfs: dax: read zeros from holes > [PATCH 04/10] Rename __endio_write_update_ordered() to > [PATCH 05/10] btrfs: Carve out btrfs_get_extent_map_write() out of > [PATCH 06/10] btrfs: dax write support > [PATCH 07/10] dax: export functions for use with btrfs > [PATCH 08/10] btrfs: dax add read mmap path > [PATCH 09/10] btrfs: dax support for cow_page/mmap_private and shared > [PATCH 10/10] btrfs: dax mmap write > > fs/btrfs/Makefile | 1 > fs/btrfs/ctree.h | 17 ++ > fs/btrfs/dax.c | 303 ++++++++++++++++++++++++++++++++++++++++++++++++++-- > fs/btrfs/file.c | 29 ++++ > fs/btrfs/inode.c | 54 +++++---- > fs/btrfs/ioctl.c | 5 > fs/btrfs/super.c | 15 ++ > fs/dax.c | 35 ++++-- > include/linux/dax.h | 16 ++ > 9 files changed, 430 insertions(+), 45 deletions(-) > >
On 12/5/18 9:37 PM, Jeff Mahoney wrote: > The high level idea that Jan Kara and I came up with in our conversation > at Labs conf is pretty expensive. We'd need to set a flag that pauses > new page faults, set the WP bit on affected ranges, do the snapshot, > commit, clear the flag, and wake up the waiting threads. Neither of us > had any concrete idea of how well that would perform and it still > depends on finding a good way to resolve all open mmap ranges on a > subvolume. Perhaps using the address_space->private_list anchored on > each root would work. This is a potentially wild idea, so "grain of salt" and all that. I may misuse the exact wording. So the essential problem of DAX is basically the opposite of data-deduplication. Instead of merging two duplicate data regions, you want to mark regions as at-risk while keeping the original content intact if there are snapshots in conflict. So suppose you _require_ data checksums and data mode of "dup" or mirror or one of the other fault tolerant layouts. By definition any block that gets written with content that it didn't have before will now have a bad checksum. If the inode is flagged for direct IO that's an indication that the block has been updated. At this point you really just need to do the opposite of deduplication, as in find/recover the original contents and assign/leave assigned those to the old/other snapshots, then compute the new checksum on the "original block" and assign it to the active subvolume. So when a region is mapped for direct IO, and it's refcount is greater than one, and you get to a sync or close event, you "recover" the old contents into a new location and assign those to "all the other users". Now that original storage region has only one user, so on sync or close you fix its checksums on the cheap. Instead of the new data being a small rock sitting over a large rug to make a lump, the new data is like a rock being slid under the rug to make a lump. So the first write to an extent creates a burdensome copy to retain the old contents, but second and subsequent writes to the same extent only have the cost of an _eventual_ checksum of the original block list. Maybe If the data isn't already duplicated then the write mapping or the DAX open or the setting of the S_DUP flag could force the file into an extent block that _is_ duplicated. The mental leap required is that the new blocks don't need to belong to the new state being created. The new blocks can be associated to the snapshots since data copy is idempotent. The side note is that it only ever matters if the usage count is greater than one, so at worst taking a snapshot, which is already a _little_ racy anyway, would/could trigger a semi-lightweight copy of any S_DAX files: If S_DAX : If checksum invalid : copy data as-is and checksum, store in snapshot else : look for duplicate checksum if duplicate found : assign that extent to the snapshot else : If file opened for writing and has any mmaps for write : copy extent and assign to new snapshot. else : increment usage count and assign current block to snapshot Anyway, I only know enough of the internals to be dangerous. Since the real goal of mmap is speed during actual update, this idea is basically about amortizing the copy costs into the task of maintaining the snapshots instead of leaving them in the immediate hands of the time-critical updater. The flush, unmmap, or close by the user, or a system-wide sync event, are also good points to expense the bookeeping time.
On 05/12/2018 13:28, Goldwyn Rodrigues wrote: > This is a support for DAX in btrfs. I understand there have been > previous attempts at it. However, I wanted to make sure copy-on-write > (COW) works on dax as well. > > Before I present this to the FS folks I wanted to run this through the > btrfs. Even though I wish, I cannot get it correct the first time > around :/.. Here are some questions for which I need suggestions: Hi Goldwyn, I've thrown your patches (from your git tree) onto one of my pmem test machines with this pmem config: mayhem:~/:[0]# ndctl list [ { "dev":"namespace1.0", "mode":"fsdax", "map":"dev", "size":792721358848, "uuid":"3fd4ab18-5145-4675-85a0-e05e6f9bcee4", "raw_uuid":"49264743-2351-41c5-9db9-38534813df61", "sector_size":512, "blockdev":"pmem1", "numa_node":1 }, { "dev":"namespace0.0", "mode":"fsdax", "map":"dev", "size":792721358848, "uuid":"dd0aec3c-7721-4621-8898-e50684a371b5", "raw_uuid":"84ff5463-f76e-4ddf-a248-85122541e909", "sector_size":4096, "blockdev":"pmem0", "numa_node":0 } ] Unfortunately I hit a btrfs_panic() with btrfs/002. export TEST_DEV=/dev/pmem0 export SCRATCH_DEV=/dev/pmem1 export MOUNT_OPTIONS="-o dax" ./check [...] [ 178.173113] run fstests btrfs/002 at 2018-12-06 10:55:43 [ 178.357044] BTRFS info (device pmem0): disk space caching is enabled [ 178.357047] BTRFS info (device pmem0): has skinny extents [ 178.360042] BTRFS info (device pmem0): enabling ssd optimizations [ 178.475918] BTRFS: device fsid ee888255-7f4a-4bf7-af65-e8a6a354aca8 devid 1 transid 3 /dev/pmem1 [ 178.505717] BTRFS info (device pmem1): disk space caching is enabled [ 178.513593] BTRFS info (device pmem1): has skinny extents [ 178.520384] BTRFS info (device pmem1): flagging fs with big metadata feature [ 178.530997] BTRFS info (device pmem1): enabling ssd optimizations [ 178.538331] BTRFS info (device pmem1): creating UUID tree [ 178.587200] BTRFS critical (device pmem1): panic in ordered_data_tree_panic:57: Inconsistency in ordered tree at offset 0 (errno=-17 Object already exists) [ 178.603129] ------------[ cut here ]------------ [ 178.608667] kernel BUG at fs/btrfs/ordered-data.c:57! [ 178.614333] invalid opcode: 0000 [#1] SMP PTI [ 178.619295] CPU: 87 PID: 8225 Comm: dd Kdump: loaded Tainted: G E 4.20.0-rc5-default-btrfs-dax #920 [ 178.630090] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS SE5C620.86B.0D.01.0010.072020182008 07/20/2018 [ 178.640626] RIP: 0010:__btrfs_add_ordered_extent+0x325/0x410 [btrfs] [ 178.647404] Code: 28 4d 89 f1 49 c7 c0 90 9c 57 c0 b9 ef ff ff ff ba 39 00 00 00 48 c7 c6 10 fe 56 c0 48 8b b8 d8 03 00 00 31 c0 e8 e2 99 06 00 <0f> 0b 65 8b 05 d2 e4 b0 3f 89 c0 48 0f a3 05 78 5e cf c2 0f 92 c0 [ 178.667019] RSP: 0018:ffffa3e3674c7ba8 EFLAGS: 00010096 [ 178.672684] RAX: 000000000000008f RBX: ffff9770c2ac5748 RCX: 0000000000000000 [ 178.680254] RDX: ffff97711f9dee80 RSI: ffff97711f9d6868 RDI: ffff97711f9d6868 [ 178.687831] RBP: ffff97711d523000 R08: 0000000000000000 R09: 000000000000065a [ 178.695411] R10: 00000000000003ff R11: 0000000000000001 R12: ffff97710d66da70 [ 178.702993] R13: ffff9770c2ac5600 R14: 0000000000000000 R15: ffff97710d66d9c0 [ 178.710573] FS: 00007fe11ef90700(0000) GS:ffff97711f9c0000(0000) knlGS:0000000000000000 [ 178.719122] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 178.725380] CR2: 000000000156a000 CR3: 000000eb30dfc006 CR4: 00000000007606e0 [ 178.732999] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 178.740574] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 178.748147] PKRU: 55555554 [ 178.751297] Call Trace: [ 178.754230] btrfs_add_ordered_extent_dio+0x1d/0x30 [btrfs] [ 178.760269] btrfs_create_dio_extent+0x79/0xe0 [btrfs] [ 178.765930] btrfs_get_extent_map_write+0x1a9/0x2b0 [btrfs] [ 178.771959] btrfs_file_dax_write+0x1f8/0x4f0 [btrfs] [ 178.777508] ? current_time+0x3f/0x70 [ 178.781672] btrfs_file_write_iter+0x384/0x580 [btrfs] [ 178.787265] ? pipe_read+0x243/0x2a0 [ 178.791298] __vfs_write+0xee/0x170 [ 178.795241] vfs_write+0xad/0x1a0 [ 178.799008] ? vfs_read+0x111/0x130 [ 178.802949] ksys_write+0x42/0x90 [ 178.806712] do_syscall_64+0x5b/0x180 [ 178.810829] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 178.816334] RIP: 0033:0x7fe11eabb3d0 [ 178.820364] Code: 73 01 c3 48 8b 0d b8 ea 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d b9 43 2c 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 2e 90 01 00 48 89 04 24 [ 178.840052] RSP: 002b:00007ffec969d978 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 178.848100] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fe11eabb3d0 [ 178.855715] RDX: 0000000000000400 RSI: 000000000156a000 RDI: 0000000000000001 [ 178.863326] RBP: 0000000000000400 R08: 0000000000000003 R09: 00007fe11ed7a698 [ 178.870928] R10: 0000000010a8b550 R11: 0000000000000246 R12: 000000000156a000 [ 178.878529] R13: 0000000000000000 R14: 000000000156a000 R15: 00007ffec969e9f1 [ 178.886177] Modules linked in: rpcsec_gss_krb5(E) auth_rpcgss(E) nfsv4(E) dns_resolver(E) nfs(E) lockd(E) grace(E) fscache(E) devlink(E) ebtable_filter(E) ebtables(E) ip6table_filter(E) ip6_tables(E) iptable_filter(E) ip_tables(E) x_tables(E) rpcrdma(E) sunrpc(E) rdma_ucm(E) ib_uverbs(E) ib_iser(E) rdma_cm(E) iw_cm(E) ib_cm(E) intel_rapl(E) libiscsi(E) af_packet(E) scsi_transport_iscsi(E) skx_edac(E) configfs(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) iscsi_ibft(E) coretemp(E) iscsi_boot_sysfs(E) ipmi_ssif(E) kvm(E) msr(E) i40iw(E) ib_core(E) ext4(E) nls_iso8859_1(E) nls_cp437(E) crc16(E) mbcache(E) vfat(E) irqbypass(E) crc32_pclmul(E) ghash_clmulni_intel(E) jbd2(E) joydev(E) fat(E) i40e(E) aesni_intel(E) iTCO_wdt(E) ptp(E) aes_x86_64(E) iTCO_vendor_support(E) mei_me(E) crypto_simd(E) ipmi_si(E) pps_core(E) lpc_ich(E) ioatdma(E) dax_pmem(E) ipmi_devintf(E) nd_pmem(E) cryptd(E) glue_helper(E) pcspkr(E) mfd_core(E) i2c_i801(E) device_dax(E) ipmi_msghandler(E) mei(E) nd_btt(E) dca(E) [ 178.886201] pcc_cpufreq(E) acpi_pad(E) btrfs(E) libcrc32c(E) xor(E) zstd_decompress(E) zstd_compress(E) xxhash(E) raid6_pq(E) hid_generic(E) usbhid(E) sd_mod(E) sr_mod(E) cdrom(E) ast(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) ahci(E) sysfillrect(E) xhci_pci(E) sysimgblt(E) fb_sys_fops(E) libahci(E) xhci_hcd(E) ttm(E) crc32c_intel(E) drm(E) libata(E) usbcore(E) wmi(E) nfit(E) libnvdimm(E) button(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) efivarfs(E) autofs4(E)
On 11:07 06/12, Johannes Thumshirn wrote: > On 05/12/2018 13:28, Goldwyn Rodrigues wrote: > > This is a support for DAX in btrfs. I understand there have been > > previous attempts at it. However, I wanted to make sure copy-on-write > > (COW) works on dax as well. > > > > Before I present this to the FS folks I wanted to run this through the > > btrfs. Even though I wish, I cannot get it correct the first time > > around :/.. Here are some questions for which I need suggestions: > > Hi Goldwyn, > > I've thrown your patches (from your git tree) onto one of my pmem test > machines with this pmem config: Thanks. I will check on this. Ordered extents have been a pain to deal with for me (though mainly because of my incorrect usage) > > mayhem:~/:[0]# ndctl list > [ > { > "dev":"namespace1.0", > "mode":"fsdax", > "map":"dev", > "size":792721358848, > "uuid":"3fd4ab18-5145-4675-85a0-e05e6f9bcee4", > "raw_uuid":"49264743-2351-41c5-9db9-38534813df61", > "sector_size":512, > "blockdev":"pmem1", > "numa_node":1 > }, > { > "dev":"namespace0.0", > "mode":"fsdax", > "map":"dev", > "size":792721358848, > "uuid":"dd0aec3c-7721-4621-8898-e50684a371b5", > "raw_uuid":"84ff5463-f76e-4ddf-a248-85122541e909", > "sector_size":4096, > "blockdev":"pmem0", > "numa_node":0 > } > ] > > Unfortunately I hit a btrfs_panic() with btrfs/002. > export TEST_DEV=/dev/pmem0 > export SCRATCH_DEV=/dev/pmem1 > export MOUNT_OPTIONS="-o dax" > ./check > [...] > [ 178.173113] run fstests btrfs/002 at 2018-12-06 10:55:43 > [ 178.357044] BTRFS info (device pmem0): disk space caching is enabled > [ 178.357047] BTRFS info (device pmem0): has skinny extents > [ 178.360042] BTRFS info (device pmem0): enabling ssd optimizations > [ 178.475918] BTRFS: device fsid ee888255-7f4a-4bf7-af65-e8a6a354aca8 > devid 1 transid 3 /dev/pmem1 > [ 178.505717] BTRFS info (device pmem1): disk space caching is enabled > [ 178.513593] BTRFS info (device pmem1): has skinny extents > [ 178.520384] BTRFS info (device pmem1): flagging fs with big metadata > feature > [ 178.530997] BTRFS info (device pmem1): enabling ssd optimizations > [ 178.538331] BTRFS info (device pmem1): creating UUID tree > [ 178.587200] BTRFS critical (device pmem1): panic in > ordered_data_tree_panic:57: Inconsistency in ordered tree at offset 0 > (errno=-17 Object already exists) > [ 178.603129] ------------[ cut here ]------------ > [ 178.608667] kernel BUG at fs/btrfs/ordered-data.c:57! > [ 178.614333] invalid opcode: 0000 [#1] SMP PTI > [ 178.619295] CPU: 87 PID: 8225 Comm: dd Kdump: loaded Tainted: G > E 4.20.0-rc5-default-btrfs-dax #920 > [ 178.630090] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS > SE5C620.86B.0D.01.0010.072020182008 07/20/2018 > [ 178.640626] RIP: 0010:__btrfs_add_ordered_extent+0x325/0x410 [btrfs] > [ 178.647404] Code: 28 4d 89 f1 49 c7 c0 90 9c 57 c0 b9 ef ff ff ff ba > 39 00 00 00 48 c7 c6 10 fe 56 c0 48 8b b8 d8 03 00 00 31 c0 e8 e2 99 06 > 00 <0f> 0b 65 8b 05 d2 e4 b0 3f 89 c0 48 0f a3 05 78 5e cf c2 0f 92 c0 > [ 178.667019] RSP: 0018:ffffa3e3674c7ba8 EFLAGS: 00010096 > [ 178.672684] RAX: 000000000000008f RBX: ffff9770c2ac5748 RCX: > 0000000000000000 > [ 178.680254] RDX: ffff97711f9dee80 RSI: ffff97711f9d6868 RDI: > ffff97711f9d6868 > [ 178.687831] RBP: ffff97711d523000 R08: 0000000000000000 R09: > 000000000000065a > [ 178.695411] R10: 00000000000003ff R11: 0000000000000001 R12: > ffff97710d66da70 > [ 178.702993] R13: ffff9770c2ac5600 R14: 0000000000000000 R15: > ffff97710d66d9c0 > [ 178.710573] FS: 00007fe11ef90700(0000) GS:ffff97711f9c0000(0000) > knlGS:0000000000000000 > [ 178.719122] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 178.725380] CR2: 000000000156a000 CR3: 000000eb30dfc006 CR4: > 00000000007606e0 > [ 178.732999] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > 0000000000000000 > [ 178.740574] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > 0000000000000400 > [ 178.748147] PKRU: 55555554 > [ 178.751297] Call Trace: > [ 178.754230] btrfs_add_ordered_extent_dio+0x1d/0x30 [btrfs] > [ 178.760269] btrfs_create_dio_extent+0x79/0xe0 [btrfs] > [ 178.765930] btrfs_get_extent_map_write+0x1a9/0x2b0 [btrfs] > [ 178.771959] btrfs_file_dax_write+0x1f8/0x4f0 [btrfs] > [ 178.777508] ? current_time+0x3f/0x70 > [ 178.781672] btrfs_file_write_iter+0x384/0x580 [btrfs] > [ 178.787265] ? pipe_read+0x243/0x2a0 > [ 178.791298] __vfs_write+0xee/0x170 > [ 178.795241] vfs_write+0xad/0x1a0 > [ 178.799008] ? vfs_read+0x111/0x130 > [ 178.802949] ksys_write+0x42/0x90 > [ 178.806712] do_syscall_64+0x5b/0x180 > [ 178.810829] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > [ 178.816334] RIP: 0033:0x7fe11eabb3d0 > [ 178.820364] Code: 73 01 c3 48 8b 0d b8 ea 2b 00 f7 d8 64 89 01 48 83 > c8 ff c3 66 0f 1f 44 00 00 83 3d b9 43 2c 00 00 75 10 b8 01 00 00 00 0f > 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 2e 90 01 00 48 89 04 24 > [ 178.840052] RSP: 002b:00007ffec969d978 EFLAGS: 00000246 ORIG_RAX: > 0000000000000001 > [ 178.848100] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: > 00007fe11eabb3d0 > [ 178.855715] RDX: 0000000000000400 RSI: 000000000156a000 RDI: > 0000000000000001 > [ 178.863326] RBP: 0000000000000400 R08: 0000000000000003 R09: > 00007fe11ed7a698 > [ 178.870928] R10: 0000000010a8b550 R11: 0000000000000246 R12: > 000000000156a000 > [ 178.878529] R13: 0000000000000000 R14: 000000000156a000 R15: > 00007ffec969e9f1 > [ 178.886177] Modules linked in: rpcsec_gss_krb5(E) auth_rpcgss(E) > nfsv4(E) dns_resolver(E) nfs(E) lockd(E) grace(E) fscache(E) devlink(E) > ebtable_filter(E) ebtables(E) ip6table_filter(E) ip6_tables(E) > iptable_filter(E) ip_tables(E) x_tables(E) rpcrdma(E) sunrpc(E) > rdma_ucm(E) ib_uverbs(E) ib_iser(E) rdma_cm(E) iw_cm(E) ib_cm(E) > intel_rapl(E) libiscsi(E) af_packet(E) scsi_transport_iscsi(E) > skx_edac(E) configfs(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) > iscsi_ibft(E) coretemp(E) iscsi_boot_sysfs(E) ipmi_ssif(E) kvm(E) msr(E) > i40iw(E) ib_core(E) ext4(E) nls_iso8859_1(E) nls_cp437(E) crc16(E) > mbcache(E) vfat(E) irqbypass(E) crc32_pclmul(E) ghash_clmulni_intel(E) > jbd2(E) joydev(E) fat(E) i40e(E) aesni_intel(E) iTCO_wdt(E) ptp(E) > aes_x86_64(E) iTCO_vendor_support(E) mei_me(E) crypto_simd(E) ipmi_si(E) > pps_core(E) lpc_ich(E) ioatdma(E) dax_pmem(E) ipmi_devintf(E) nd_pmem(E) > cryptd(E) glue_helper(E) pcspkr(E) mfd_core(E) i2c_i801(E) device_dax(E) > ipmi_msghandler(E) mei(E) nd_btt(E) dca(E) > [ 178.886201] pcc_cpufreq(E) acpi_pad(E) btrfs(E) libcrc32c(E) xor(E) > zstd_decompress(E) zstd_compress(E) xxhash(E) raid6_pq(E) hid_generic(E) > usbhid(E) sd_mod(E) sr_mod(E) cdrom(E) ast(E) i2c_algo_bit(E) > drm_kms_helper(E) syscopyarea(E) ahci(E) sysfillrect(E) xhci_pci(E) > sysimgblt(E) fb_sys_fops(E) libahci(E) xhci_hcd(E) ttm(E) > crc32c_intel(E) drm(E) libata(E) usbcore(E) wmi(E) nfit(E) libnvdimm(E) > button(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) > scsi_dh_alua(E) scsi_mod(E) efivarfs(E) autofs4(E) > > > > -- > Johannes Thumshirn SUSE Labs Filesystems > jthumshirn@suse.de +49 911 74053 689 > SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg > GF: Felix Imendörffer, Jane Smithard, Graham Norton > HRB 21284 (AG Nürnberg) > Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850