Skip to content

Test for-next (regular, SELF kvm)#1624

Open
kdave wants to merge 10000 commits into
ci-kvmfrom
for-next
Open

Test for-next (regular, SELF kvm)#1624
kdave wants to merge 10000 commits into
ci-kvmfrom
for-next

Conversation

@kdave
Copy link
Copy Markdown
Member

@kdave kdave commented Mar 5, 2026

Keep this open, the build tests are on self-hosted workers.

@kdave
Copy link
Copy Markdown
Member Author

kdave commented Mar 5, 2026

Re #1623 .

@kdave kdave changed the title Test for-next (regular, GH kvm) 2 Test for-next (regular, GH kvm) Mar 5, 2026
@kdave kdave closed this Mar 5, 2026
@kdave kdave reopened this Mar 5, 2026
@kdave kdave force-pushed the ci-kvm branch 2 times, most recently from 69fc6c9 to 98bf7e7 Compare March 5, 2026 23:30
@kdave kdave force-pushed the for-next branch 3 times, most recently from 934d926 to 0cb5a8a Compare March 13, 2026 11:46
@kdave kdave force-pushed the ci-kvm branch 3 times, most recently from 2cd3911 to c09d7cb Compare March 13, 2026 19:11
@kdave kdave force-pushed the for-next branch 8 times, most recently from c61e262 to daed989 Compare March 17, 2026 16:00
@kdave kdave force-pushed the for-next branch 4 times, most recently from 9dc51d6 to d76ed94 Compare March 19, 2026 13:21
kdave and others added 30 commits May 16, 2026 02:56
Commit 0b2600f ("treewide: change inode->i_ino from unsigned long
to u64") sets the inode number type to u64 unconditionally, so we can
use it directly as there's no difference on 32bit and 64bit platform. We
used to have a copy of the number in our btrfs_inode.

The size of btrfs_inode on 32bit platform is about 688 bytes (after the
change).

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BACKGROUND]
When bs < ps support was initially introduced, the compressed data
readahead was disabled as at that time the target page size was 64K.
This means a compressed data extent can span at most 3 64K pages (the
head and tail parts are not aligned to 64K), meaning the benefit is
pretty minimal.

[UNEXPECTED WORKING SITUATION]
But with the already merged large folio support, we're already enabling
readahead with subpage routine unintentionally, e.g.:

   0      4K      8K      12K      16K
   |   Folio 0    |    Folio 8K    |
   |<----- Compressed data ------->|

We have 2 8K sized folios, all backed by a single compressed data.

In that case add_ra_bio_pages() will continue to add folio 8K into the
read bio, as the condition to skip is only (bs < ps), not taking the
newer large folio support into consideration at all.

So for folio 8K, it is added to the read bio, but without subpage lock
bitmap populated.

Then at end_bbio_data_read(), folio 0 has proper locked bitmap set, but
folio 8K does not.
This inconsistency is handled by the extra safety net at
btrfs_subpage_end_and_test_lock() where if a folio has no @nr_locked, it
will just be unlocked without touching the locked bitmap.

[ENHANCEMENT]
Make add_ra_bio_pages() support bs < ps and large folio cases, by
removing the check and calling btrfs_folio_set_lock() unconditionally.

This won't make any difference on 4K page sized systems with large
folios, as the readahead is already working, although unexpectedly.

But this will enable true compressed data readahead for bs < ps cases
properly.

Please note that such readahead will only work if the compressed extent is
crossing folio boundaries, which is also the existing limitation.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function add_ra_bio_folios() has been utilizing folio interfaces
since c808c1d ("btrfs: convert add_ra_bio_pages() to use only
folios"), but we are still referring to "pages" inside the function name
and all comments.

Furthermore, such folio/page mixing can even be confusing, e.g. the
variable @page_end is very confusing as we're not really referring to
the end of the page, but the end of the folio, especially when we
already have large folio support.

Enhance that function by:

- Rename "page" to "folio" to avoid confusion

- Skip to the folio end if there is already a folio in the page cache
  The existing skip is:

  	cur += folio_size(folio);

  This is incorrect if @Cur is not folio size aligned, and can be
  common with large folio support.

  Thankfully this is not going to cause any real bugs, but at most will
  skip some blocks that can be added to readahead.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This feature was introduced in v6.17 under experimental, and we had
several small bugs related to or exposed by that:

  e9e3b22 ("btrfs: fix beyond-EOF write handling")
  18de34d ("btrfs: truncate ordered extent when skipping writeback past i_size")

Otherwise, the feature has been frequently tested by btrfs developers.

The latest fix only arrived in v6.19. After three releases, I think it's
time to move this feature out of experimental.

And since we're here, also remove the comment about the bitmap size
limit, which is no longer relevant in the context. It will soon be
outdated for the incoming huge folio support.

Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Before btrfs switched to the new mount API in 2023, we were setting
SB_NOSEC in btrfs_mount_root(). This flag tells the VFS that the
filesystem may have files which don't have security xattrs, enabling it
to do some optimizations.

Unfortunately this was missed in the transition, meaning that IS_NOSEC
will always return false for a btrfs inode. This means that
btrfs_direct_write() calls will always get the inode lock exclusively,
meaning that DIO writes to the same file will be serialized.

On my machine, this one-line change results in a ~59% improvement in DIO
throughput:

Before patch:

  test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
  ...
  fio-3.39
  Starting 32 processes
  test: Laying out IO file (1 file / 1024MiB)
  Jobs: 32 (f=32): [w(32)][100.0%][w=764MiB/s][w=195k IOPS][eta 00m:00s]
  test: (groupid=0, jobs=32): err= 0: pid=586: Wed Apr 22 13:03:04 2026
    write: IOPS=202k, BW=787MiB/s (826MB/s)(46.1GiB/60012msec); 0 zone resets
     bw (  KiB/s): min=498714, max=1199892, per=100.00%, avg=806659.03, stdev=4229.94, samples=3808
     iops        : min=124677, max=299971, avg=201661.82, stdev=1057.49, samples=3808
    cpu          : usr=0.32%, sys=1.27%, ctx=8329204, majf=0, minf=1163
    IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
       issued rwts: total=0,12094328,0,0 short=0,0,0,0 dropped=0,0,0,0
       latency   : target=0, window=0, percentile=100.00%, depth=64

  Run status group 0 (all jobs):
    WRITE: bw=787MiB/s (826MB/s), 787MiB/s-787MiB/s (826MB/s-826MB/s), io=46.1GiB (49.5GB), run=60012-60012msec

After patch:

  test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
  ...
  fio-3.39
  Starting 32 processes
  test: Laying out IO file (1 file / 1024MiB)
  Jobs: 32 (f=32): [w(32)][100.0%][w=1255MiB/s][w=321k IOPS][eta 00m:00s]
  test: (groupid=0, jobs=32): err= 0: pid=572: Wed Apr 22 13:13:46 2026
    write: IOPS=320k, BW=1250MiB/s (1311MB/s)(73.3GiB/60003msec); 0 zone resets
     bw (  MiB/s): min=  619, max= 2289, per=100.00%, avg=1251.28, stdev= 9.64, samples=3808
     iops        : min=158538, max=586025, avg=320320.80, stdev=2468.97, samples=3808
    cpu          : usr=0.35%, sys=11.50%, ctx=1584847, majf=0, minf=1160
    IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
       issued rwts: total=0,19203309,0,0 short=0,0,0,0 dropped=0,0,0,0
       latency   : target=0, window=0, percentile=100.00%, depth=64

  Run status group 0 (all jobs):
    WRITE: bw=1250MiB/s (1311MB/s), 1250MiB/s-1250MiB/s (1311MB/s-1311MB/s), io=73.3GiB (78.7GB), run=60003-60003msec

The script to reproduce that:

  #!/bin/bash
  mkfs.btrfs -f /dev/nvme0n1
  mount /dev/nvme0n1 /mnt/test
  mkdir /mnt/test/nocow
  chattr +C /mnt/test/nocow
  fio /root/test.fio

  # cat /root/test.fio
  [global]
  rw=randwrite
  ioengine=io_uring
  iodepth=64
  size=1g
  direct=1
  startdelay=20
  force_async=4
  ramp_time=5
  runtime=60
  group_reporting=1
  numjobs=32
  time_based
  disk_util=0
  clat_percentiles=0
  disable_lat=1
  disable_clat=1
  disable_slat=1
  filename=/mnt/test/nocow/fiofile
  [test]
  name=test
  bs=4k
  stonewall

This was on a VM with 8 cores and 8GB of RAM, with a real NVMe exposed
through PCI passthrough. The figures for XFS and ext4 in comparison are
both about ~3GB/s.

Fixes: ad21f15 ("btrfs: switch to the new mount API")
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The btrfs_abort_transaction() is called at the location where we want to
report the abort. It must be a macro so we get the correct line and
stack trace. This inlines the necessary code and the rest is pushed to
__btrfs_abort_transaction().

There's a possibility to reduce the inlined code if we move the message
to the helper function as well, without loss of information. The
difference is only that the WARN will not print it inside the stack
report but after:

  --[ cut here ]--
  WARNING: fs/btrfs/transaction.c:2045 at btrfs_commit_transaction+0xa21/0xd30 [btrfs], CPU#11: bonnie++/3377975
  ...
  --[ end trace ] --
  BTRFS error (device dm-0 state A): Transaction aborted (error -28)

While previously there would be one more line like:

  --[ cut here ]--
  BTRFS: Transaction aborted (error -28)
  WARNING: fs/btrfs/transaction.c:2045 at btrfs_commit_transaction+0xa21/0xd30 [btrfs], CPU#11: bonnie++/3377975
  ...
  --[ end trace ] --

This removes about 20KiB of btrfs.ko on a release config.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If report_rb_range() is inlined into its single caller (check_eb_range()),
we end up with a larger module size, which is undesirable and does not
provide any advantage since this code is for a cold path which we don't
expect to ever hit.

Add the noinline attribute to report_rb_range() and while at it also make
it return void as it always returns true.

Before this change (with gcc 14.2.0-19 from Debian):

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  2018267	 176232	  15592	2210091	 21b92b	fs/btrfs/btrfs.ko

After this change:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  2017835	 176048	  15592	2209475	 21b6c3	fs/btrfs/btrfs.ko

Also, replacing the noinline with __cold, yields slighty worse results:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  2017889	 176048	  15592	2209529	 21b6f9	fs/btrfs/btrfs.ko

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we unmount the fs or during mount failures, btrfs will call
invalidate_inode_pages() to release all btree inode folios.

However that function can return -EBUSY if any folios can not be
invalidated.
This can be caused by:

- Some extent buffers are still held by btrfs
  This is a logic error, as we should release all tree root nodes
  during unmount and mount failure handling.

- Some extent buffers are under readahead and haven't yet finished
  These are much rarer but valid cases.
  In that case we should wait for those extent buffers.

Introduce a new helper invalidate_and_check_btree_folios() which will:

- Call invalidate_inode_pages2() and catch its return value
  If it returned 0 as expected, that's great and we can call it a day.

- Otherwise go through each extent buffer in buffer_tree
  Increase the ref by one first for the eb we're checking.
  This is to ensure the eb won't be freed after the readahead is
  finished.

  For ebs that still have EXTENT_BUFFER_READING flag, wait for them to
  finish first.

  After waiting for the readahead, check the refs of the eb and if it's
  still dirty.

  If the eb ref count is greater than 2 (one for the buffer tree, one
  held by us), it means we are still holding the extent buffer somewhere
  else, which is a code bug.

  If the eb is still dirty, it means a bug in transaction handling, e.g.
  the bug fixed by patch "btrfs: only release the dirty pages io tree
  after successful writes".

  For either case, show a warning message about the eb, including its
  bytenr, owner, refs and flags.
  And if it's a debug build, also trigger WARN_ON_ONCE() so that fstests
  can properly catch such situation.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=221270
Reported-by: AHN SEOK-YOUNG <iamsyahn@gmail.com>
CC: Teng Liu <27rabbitlt@gmail.com>
Tested-by: Teng Liu <27rabbitlt@gmail.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
…tems

fill_holes() currently merges a punched hole with either the previous
or the next file extent item, but never both in the same call.  When
holes are punched in a non-sequential order this leaves consecutive
hole items in the inode's subvolume tree that should have been collapsed
into a single one.

This is a minor metadata optimization that reduces the number of file
extent items when holes are punched in non-sequential order. While
having extra file extent items is harmless and has no functional
impact, reducing metadata overhead can benefit workloads with heavily
fragmented hole patterns.

For example:

  fallocate -p -o 4K  -l 4K ${FILE}
  fallocate -p -o 12K -l 4K ${FILE}
  fallocate -p -o 8K  -l 4K ${FILE}

After the third punch the [4K, 8K) and [12K, 16K) holes become
adjacent to the new [8K, 12K) hole, but fill_holes() merges only one
side and leaves two separate hole items ([4K, 12K) and [12K, 16K))
instead of the expected single [4K, 16K) hole item.

Fix this by checking both path->slots[0] - 1 and path->slots[0] in one
pass:

  - If only the previous slot is mergeable, extend it forward as
    before.
  - If only the next slot is mergeable, extend it backward and update
    its key offset as before.
  - If both are mergeable, extend the previous item to cover the new
    hole plus the next item, and remove the redundant next item with
    btrfs_del_items().

Because the merge path may now delete an item, switch the initial
btrfs_search_slot() call from a plain lookup (ins_len = 0) to a
search-for-deletion (ins_len = -1), so the leaf is prepared for a
possible item removal.

Note: This optimization only applies to filesystems without the
NO_HOLES feature enabled. Since NO_HOLES is now the default, this
primarily benefits older filesystems or those explicitly created with
NO_HOLES disabled.

Signed-off-by: Dave Chen <davechen@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function extent_invalidate_folio() has only a single caller inside
btree_invalidate_folio().

There is no need to export such a function just for a single caller inside
another file.

Unexport extent_invalidate_folio() and move it to disk-io.c.

And since we're moving the code, update the commit to match the current
style, and remove the seemingly stale comment on the extent state
removal, it's better explained by the comment just before
btrfs_unlock_extent().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The btree inode is very different from regular data inodes, as the btree
inode is never exposed to user space operations.

All operations are either initiated by btrfs metadata operations, or MM
layer like memory pressure to release folios.

This means we never need to handle partial folio invalidation inside
btree_invalidate_folio().

With that said, we can slightly simplify the btree folio invalidation
by:

- Add ASSERT()s to make sure the range covers the whole folio

- Remove "if (start > end)" check
  As the range always covers the full folio, that check is always
  false and can be removed.

- Open code extent_invalidate_folio()

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The fs_info is available everywhere and we don't need to store it inside
a structure that is used within one function only, which is
build_backref_tree(). The size of btrfs_backref_iter is now 48 bytes.

Signed-off-by: David Sterba <dsterba@suse.com>
The iterator is used only once and within build_backref_tree() so we can
avoid one allocation and place it on stack.

Signed-off-by: David Sterba <dsterba@suse.com>
When writing into a preallocated extent, ordered extent completion calls
btrfs_mark_extent_written() to convert the file extent item from the
BTRFS_FILE_EXTENT_PREALLOC type to the BTRFS_FILE_EXTENT_REG type.

If the preallocated extent was created beyond i_size with fallocate
keep-size, and the inode is evicted and loaded again before the write,
the inode's file_extent_tree is initialized only up to i_size.

The beyond i_size prealloc extent is therefore not tracked there.

After a write into that extent extends i_size, btrfs_mark_extent_written()
updates the file extent item, but the corresponding range is not marked
dirty in the inode's file_extent_tree.

This can leave disk_i_size stale when the filesystem does not use the
no-holes feature, so after remount the file size can go back to the old
value.

The following reproducer triggers the problem:

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/sdi
  MNT=/mnt/sdi

  mkfs.btrfs -f -O ^no-holes $DEV
  mount $DEV $MNT

  touch $MNT/file
  fallocate -n -l 2M $MNT/file

  umount $MNT
  mount $DEV $MNT

  dd if=/dev/zero of=$MNT/file bs=1M count=1 conv=notrunc
  ls -lh $MNT/file

  umount $MNT
  mount $DEV $MNT

  ls -lh $MNT/file
  umount $MNT

Running the reproducer gives the following result:

  $ ./test.sh
  (...)
  1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.000596024 s, 1.8 GB/s
  -rw-rw-r-- 1 root root 1.0M May  8 16:34 /mnt/sdi/file
  -rw-rw-r-- 1 root root 0 May  8 16:34 /mnt/sdi/file

Fix this by marking the written range dirty in the inode's
file_extent_tree after successfully converting the prealloc extent to a
regular extent.

Fixes: 9ddc959 ("btrfs: use the file extent tree infrastructure")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
[ Minor change log updates ]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_load_free_space_tree() reads FREE_SPACE_INFO once and then chooses
the bitmap or extent loader for all following free-space records until the
next FREE_SPACE_INFO item. Those loaders currently enforce the selected
record type only with ASSERT().

On production builds without CONFIG_BTRFS_ASSERT, a malformed free-space
tree can therefore be decoded in the wrong mode. An EXTENT item can reach
btrfs_free_space_test_bit() as bitmap data, while a BITMAP item can be
added as a full free extent. The latter corrupts the in-memory free-space
cache and the former can read beyond the item payload.

Sanitizer validation reported:
general protection fault
Call trace:
  assert_eb_folio_uptodate() (fs/btrfs/extent_io.c:4134)
  extent_buffer_test_bit() (?:?)
  btrfs_free_space_test_bit() (fs/btrfs/free-space-tree.c:518)
  srso_alias_return_thunk() (arch/x86/include/asm/nospec-branch.h:375)
  __entry_text_end() (?:?)
  __asan_memcpy() (mm/kasan/shadow.c:103)
  read_extent_buffer() (?:?)
  load_free_space_bitmaps() (fs/btrfs/free-space-tree.c:1548)
  btrfs_get_32() (fs/btrfs/free-space-tree.c:?)
  btrfs_set_16() (fs/btrfs/free-space-tree.c:?)
  kmem_cache_alloc_noprof() (?:?)
  btrfs_load_free_space_tree() (fs/btrfs/free-space-tree.c:1685)
  load_free_space_tree_for_test() (?:?)
  rcu_disable_urgency_upon_qs() (kernel/rcu/tree.c:721)
  vprintk_emit() (?:?)
  __up_write() (kernel/locking/rwsem.c:1401)
  clone_commit_root_for_test() (?:?)
  test_extent_as_bitmap_mode_mismatch() (?:?)
  kmem_cache_free() (?:?)
  btrfs_free_path() (fs/btrfs/free-space-tree.c:1449)
  __add_block_group_free_space() (fs/btrfs/free-space-tree.c:20)
  run_test() (?:?)
  do_raw_spin_unlock() (?:?)
  btrfs_test_free_space_tree() (fs/btrfs/tests/free-space-tree-tests.c:547)
  btrfs_test_qgroups() (fs/btrfs/tests/qgroup-tests.c:462)
  btrfs_run_sanity_tests() (fs/btrfs/free-space-tree.c:?)
  init_btrfs_fs() (fs/btrfs/super.c:2690)
  do_one_initcall() (init/main.c:1382)
  __kasan_kmalloc() (?:?)
  rcu_is_watching() (?:?)
  do_initcalls() (init/main.c:1457)
  kernel_init_freeable() (init/main.c:1674)
  kernel_init() (init/main.c:1584)
  ret_from_fork() (?:?)
  __switch_to() (?:?)
  ret_from_fork_asm() (?:?)

Validate every post-info key before decoding it. Reject keys whose type
does not match the mode selected by FREE_SPACE_INFO, and reject keys
whose range extends past the block group, returning -EUCLEAN instead of
feeding the wrong record type to the bitmap or extent decoder.

Also reject zero-length FREE_SPACE_EXTENT items in tree-checker, matching
the existing FREE_SPACE_BITMAP zero-length check. This keeps the loader
range check simple and prevents a zero-length extent item from being a
valid on-disk free-space record.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Zhang Cen <rollkingzzc@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
ROOT_REF and ROOT_BACKREF items contain a struct btrfs_root_ref followed
by the subvolume name. Several readers assume that this layout is already
valid and then use the on-disk name length directly. A corrupted item can
therefore make those readers address bytes outside the item, and
BTRFS_IOC_GET_SUBVOL_INFO can copy too many bytes into its fixed-size UAPI
name buffer.

Validate ROOT_REF and ROOT_BACKREF items in tree-checker before any reader
uses them. Reject records that do not contain a non-empty name, whose
name_len does not exactly describe the remaining item payload, or whose
name exceeds BTRFS_NAME_LEN.

For BTRFS_IOC_GET_SUBVOL_INFO, copy only the validated on-disk name_len
instead of deriving the copy length from the item size. The ioctl result is
zeroed when allocated. That leaves the existing trailing zero byte
untouched.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Zhang Cen <rollkingzzc@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
find_first_inode() and find_first_inode_to_shrink() lock root->inodes,
then loop over them, occasionally skipping some inodes. When they skip
an inode, they attempt to share the cpu/lock with cond_resched_lock().

However, that has a subtle problem associated with it.
cond_resched_lock() only drops the lock if it needs to actually call
schedule(). With CONFIG_PREEMPT_NONE, this means the full timeslice as
detected at ticks. With 8+ cpus and default tunables, this is 2.8ms. So
regardless of HZ, we will run for at least 2.8ms in this loop without
dropping the lock, assuming it finds no suitable inodes. If HZ is
small enough, it might be even worse as the tick granularity becomes
bigger than the timeslice.

The knock-on effect of this is that callers to
btrfs_del_inode_from_root() like kswapd trying to shrink the inode slab
or userspace threads calling evict() will spin on xa_lock(&root->inodes)
for 2.8ms, so the extent map shrinker dominates the lock even though
ostensibly it is intending to share it. This produces memory pressure as
there is only one kswapd and it runs sequentially so it can get stuck in
the inode slab shrinking.

To fix it, simply replace cond_resched_lock() with an open coded variant
which unconditionally does unlock/lock around cond_resched. Sharing the
lock is decoupled from sharing the CPU, and all the users of the lock
now share it fairly.

I was able to reproduce this on test systems by producing a lot of empty
files (to make a big root->inodes xarray), then producing memory
pressure by reading large files larger than ram, triggering kswapd and
the extent_map shrinker. The lock contention is visible with perf or
lockstat. This patch also relieved a user-apparent bottleneck on a
production system from the original report.

Tested-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently there are two members inside btrfs_folio_state that are related
to locked bitmap:

- locked sub-bitmap inside btrfs_folio_state::bitmaps[]
  The enum btrfs_bitmap_nr_locked determines the sub-bitmap.

- btrfs_folio_state::nr_locked
  Which records how many blocks are locked inside the folio.

The locked sub-bitmap is a btrfs specific per-block tracking mechanism,
which is mostly for async-submission, utilized by compressed writes.

The sub-bitmap itself is a super set of nr_locked, as it can provide a
more reliable tracking.

But the sub-bitmap itself can be pretty large for the incoming huge
folio, 2M sized folio for 4K page size, meaning 512 bits for one
sub-bitmap.

Furthermore, in the long run compression will be reworked to get rid of
async-submission completely, there is not much need for a full
sub-bitmap to track the locked status.

This patch removes the locked sub-bitmap and only relies on @nr_locked
atomic to do the tracking.
This can also save 64 bytes from btrfs_folio_state::bitmaps[] for a huge
folio.

This will reduce some safety checks, as previously if a block is not
locked, btrfs_folio_end_lock()/btrfs_folio_end_lock_bitmap() will find
out that, and skip reducing @nr_locked for that block, and avoid
under-flow.

But this safety net itself shouldn't be necessary in the first place.
If we're unlocking a block that is not locked, it's a bug in the logic,
and we should catch it, not silently ignoring it.
Thus I believe the removal of the extra safety net should not be a
problem.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs detects dirty folio which doesn't have an ordered extent
at extent_writepage_io(), but that is not ideal:

- The check is not handling all dirty blocks
  We can have multiple blocks inside a large folio, but the whole folio
  is marked ordered as long as there is one ordered extent in the range.

  We can still hit cases where some dirty blocks do not have
  corresponding ordered extents.

Instead of checking the folio ordered flags, do the check at
alloc_new_bio(), where we're already searching for ordered extents for
writebacks.

If we didn't find an ordered extent, we should already give an error
message and notify the caller there is something wrong.

This allows us to check every block that goes through
submit_extent_folio().

With this new and more reliable check, we can remove the old check.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently during folio writeback, we call folio_clear_dirty_for_io()
before extent_writepage(), which causes folio dirty flag to be cleared,
but without touching the subpage bitmaps.

This works fine for the bio submission path, as we always call
btrfs_folio_clear_dirty() to clear the subpage bitmap.

But this is far from consistent, thus this patch is going to unify the
behavior to always use btrfs_folio_clear_dirty() helper to clear both
folio flag and subpage bitmap.

This involves:

- Replace folio_clear_dirty_for_io() with folio_test_dirty()
  There is only one call site calling folio_clear_dirty_for_io() outside
  of subpage.c, that's inside extent_write_cache_pages() just before
  extent_writepage().

- Make btrfs_invalidate_folio() clear dirty range for the whole folio
  The function btrfs_invalidate_folio() is also called during
  extent_writepage().

  If we had a folio completely beyond isize, we call
  folio_invalidate() -> btrfs_invalidate_folio() to free the folio.

  Since we no longer have folio_clear_dirty_for_io() to clear the folio
  dirty flag, we must manually clear the folio dirty flag for the
  to-be-invalidated folio, and also clear the PAGECACHE_TAG_DIRTY tag.

  The tag clearing is done using a new helper,
  btrfs_clear_folio_dirty_tag(), which is almost the same as the old
  btree_clear_folio_dirty_tag(), but with minor improvements including:

  * Remove the folio_test_dirty() check
    We have already done an ASSERT().

  * Add an ASSERT() to make sure folio is mapped

- Add extra ASSERT()s before clearing folio private
  During development I hit dirty folios without the private flag set,
  and that caused a lot of ASSERT()s.
  The reason is that btrfs_invalidate_folio() is relying on the dirty
  flag being cleared when it's called from extent_writepage().

  Add extra ASSERT()s inside clear_folio_extent_mapped() to catch
  wild dirty/writeback flags.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
…ated

Currently there are only two folio ordered flag users:

- extent_writepage_io()
  To ensure the folio range has an ordered extent covering it.
  This is from the legacy COW fixup mechanism, which is already removed
  and only a simple check is left.

- btrfs_invalidate_folio()
  This is to avoid race with end_bbio_data_write(), where
  btrfs_finish_ordered_extent() will be called to handle the OE
  finishing.

But for btrfs_invalidate_folio() we have already waited for the folio
writeback to finish, and locked the folio.
This means we can use the dirty flag to check if a range is already
submitted or not.

If the OE range is not dirty, it means the range has been submitted and
its dirty flag was cleared. And since we have already waited for
writeback, the endio function will handle the OE finishing.
Thus if the range is not dirty, we must skip the range.

If the OE range is dirty, it means we have allocated an ordered extent but
have not yet submitted the range. And that's exactly the case where we need
to truncate the ordered extent.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This involves:

- The ASSERT() inside end_bbio_data_write()
  It's only an ASSERT() and it has never been triggered as far as I
  know.

- btrfs_migrate_folio()
  Since all folio_test_ordered() usage will be removed, there is no need to
  copy the folio ordered flag.

- The ASSERT() inside btrfs_invalidate_folio()
  This one has its usefulness as it indeed caught some bugs during
  development.
  But that's the last user and will not be worth the folio flag or the
  subpage bitmap.

This will allow btrfs to finally remove the ordered flags.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs has an internal flag/subpage bitmap called ordered, which is to
indicate that a block has corresponding ordered extent covering it.

However this requires extra synchronization between the inode ordered
tree, and the folio flag/subpage bitmap, not to mention we need to
maintain the extra folio flag with subpage bitmap.

As a step to align btrfs_folio_state more closely to iomap_folio_state,
remove the btrfs specific ordered flag/bitmap.

This will also save us 64 bytes for the bitmap of a huge folio.

Since we're here, also update the ASCII graph of the bitmap, as there
are only 3 sub-bitmaps now, show all sub-bitmaps directly.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The invariant that we want to maintain with subvolume qgroups is that
the qgroup can only be deleted if there is no root. With squotas, we
thought that it was sufficient to just check the usage, because we
assumed that deleting a subvolume will drive it's qgroups usage to 0,
and thus 0 usage implies no subvolume.

However, this is false, for two reasons:

- A subvol whose extents are all from before squotas was enabled.
- A subvol that was created in this transaction and for which we have
  not yet run any delayed refs.

In both cases, deleting the qgroup breaks the desired invariant and we
are left with a subvolume with no qgroup but squotas are enabled.

Fix this by unifying the deletion check logic between full qgroups and
squotas. Squotas do all the same checks *and* the additional usage == 0
check, which is the one extra rule peculiar to squotas.

Link: https://lore.kernel.org/linux-btrfs/adnBhWfJQ1n3hZC8@merlins.org/
Fixes: a8df356 ("btrfs: forbid deleting live subvol qgroup")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The first transaction that enables squotas is special and a bit tricky.
We have to set BTRFS_FS_QUOTA_ENABLED after the transaction to avoid a
deadlock, so any delayed refs that run before we set the bit are not
squota accounted. For data this is fine, we don't get an owner_ref, so
there is no real harm, it's as if the extent predated squotas. However
for metadata, the tree block will have gen == enable_gen so when we free
it later, we will decrement the squota accounting, which can result in
an underflow. Before it is freed, btrfs check shows errors, as we have
mismatched usage between the node generations/owners and the squota
values.

There are two angles to this fix:

1. For extents that come in delayed_refs that run during the
   enable_gen transaction, we must actually set enable_gen to the *next*
   transaction. That is the first transaction that we can really
   properly account in any way.
2. For extents that come in between the end of our transaction handle
   and the time we set the BTRFS_FS_QUOTA_ENABLED bit, we need an
   additional bit, BTRFS_FS_SQUOTA_ENABLING which only affects recording
   squota deltas, so we do pick up those extents. Otherwise, we would
   miss them, even for enable_gen + 1.

Fixes: bd7c1ea ("btrfs: qgroup: check generation when recording simple quota delta")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Simple quota accounting can undercount metadata tree block allocations
in certain scenarios. When an undercounted subvolume is deleted and its
tree blocks freed, the free deltas decrement rfer/excl past zero,
wrapping the u64 to a value near U64_MAX.

Once wrapped, can_delete_squota_qgroup() sees non-zero rfer and refuses
to delete the qgroup. The qgroup becomes permanently orphaned in the
quota tree, since there is no subvolume left to generate frees that
would bring the counter back to zero.

While we ultimately want to fix any mis-accounting at the source, it is
also helpful and worthwhile to mitigate the damage by clamping rfer and
excl to zero on underflow rather than allowing the u64 to wrap. This at
least allows us to clean up the messed up qgroups on subvol deletion.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I thought that it was likely I could harden squota deletion to the point
that it was impossible to end up with an extent accounted to a qgroup
outliving its qgroup. Several recent bugs have made me re-consider that
position.

Ultimately, this is a tradeoff between short term stability and long
term strictness, but I think given that there could be another layer of
bugs behind the 2-3 I just fixed, I would feel much more confident in
people using squotas if the risk was "your values can get a bit out of
whack which you can fix by deleting stuff or
disabling/re-enabling/repairing" vs "it will abort your filesystem".

As the final nail in the coffin, the Meta production kernel was lacking
earlier fixes from me and Qu regarding subvol qgroup lifetime, so this
is what we have been testing at scale, so I think at least for now
upstream should have the same extra layer of protection.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently both check_free_space_extent() and check_free_space_bitmap()
share a very common validation on the keys.

Extract them into a helper, check_free_space_common_key(), and
change the output string ("extent" or "bitmap") depending on the key type.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add an extra check to ensure the free space extent/bitmap and space info
keys won't overflow.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This introduces extra checks using the previous key.

If there is a previous key, we can do extra validations:

- The previous key is FREE_SPACE_INFO
  This means the current extent/bitmap should be inside the
  free space info key range.

  And matches the type of the free space info.

- The previous key is FREE_SPACE_EXTENT or FREE_SPACE_BITMAP
  In that case both the current and previous key should belong to the same
  block group.

  Thus the key type must match, and no overlap between the two keys.

These extra checks are inspired by the recently added type checks during
free space tree loading.

The new tree-checker checks will allow earlier detection, but the
loading time checks are still needed, as the tree-checker checks are
still inside the same leaf, not matching per-bg level checks.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.