Patch series "mm, memcg: cgroup v1 and v2 tunable load/store tearing
fixes", v2.
This patch series helps to prevent load/store tearing in
several cgroup knobs.
As kindly pointed out by Michal Hocko and Roman Gushchin
, the changelog has been rephrased.
Besides, more knobs were checked, according to kind suggestions
from Shakeel Butt and Muchun Song.
This patch (of 4):
The knob for cgroup v2 memory controller: memory.oom.group
is not protected by any locking so it can be modified while it is used.
This is not an actual problem because races are unlikely (the knob is
usually configured long before any workloads hits actual memcg oom)
but it is better to use READ_ONCE/WRITE_ONCE to prevent compiler from
doing anything funky.
The access of memcg->oom_group is lockless, so it can be
concurrently set at the same time as we are trying to read it.
Link: https://lkml.kernel.org/r/20230306154138.3775-1-findns94@gmail.com
Link: https://lkml.kernel.org/r/20230306154138.3775-2-findns94@gmail.com
Signed-off-by: Yue Zhao <findns94@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Tang Yizhou <tangyeechou@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The zsmalloc compaction algorithm has the potential to waste some CPU
cycles, particularly when compacting pages within the same fullness group.
This is due to the way it selects the head page of the fullness list for
source and destination pages, and how it reinserts those pages during each
iteration. The algorithm may first use a page as a migration destination
and then as a migration source, leading to an unnecessary back-and-forth
movement of objects.
Consider the following fullness list:
PageA PageB PageC PageD PageE
During the first iteration, the compaction algorithm will select PageA as
the source and PageB as the destination. All of PageA's objects will be
moved to PageB, and then PageA will be released while PageB is reinserted
into the fullness list.
PageB PageC PageD PageE
During the next iteration, the compaction algorithm will again select the
head of the list as the source and destination, meaning that PageB will
now serve as the source and PageC as the destination. This will result in
the objects being moved away from PageB, the same objects that were just
moved to PageB in the previous iteration.
To prevent this avalanche effect, the compaction algorithm should not
reinsert the destination page between iterations. By doing so, the most
optimal page will continue to be used and its usage ratio will increase,
reducing internal fragmentation. The destination page should only be
reinserted into the fullness list if:
- It becomes full
- No source page is available.
TEST
====
It's very challenging to reliably test this series. I ended up developing
my own synthetic test that has 100% reproducibility. The test generates
significan fragmentation (for each size class) and then performs
compaction for each class individually and tracks the number of memcpy()
in zs_object_copy(), so that we can compare the amount work compaction
does on per-class basis.
Total amount of work (zram mm_stat objs_moved)
----------------------------------------------
Old fullness grouping, old compaction algorithm:
323977 memcpy() in zs_object_copy().
Old fullness grouping, new compaction algorithm:
262944 memcpy() in zs_object_copy().
New fullness grouping, new compaction algorithm:
213978 memcpy() in zs_object_copy().
Per-class compaction memcpy() comparison (T-test)
-------------------------------------------------
x Old fullness grouping, old compaction algorithm
+ Old fullness grouping, new compaction algorithm
N Min Max Median Avg Stddev
x 140 349 3513 2461 2314.1214 806.03271
+ 140 289 2778 2006 1878.1714 641.02073
Difference at 95.0% confidence
-435.95 +/- 170.595
-18.8387% +/- 7.37193%
(Student's t, pooled s = 728.216)
x Old fullness grouping, old compaction algorithm
+ New fullness grouping, new compaction algorithm
N Min Max Median Avg Stddev
x 140 349 3513 2461 2314.1214 806.03271
+ 140 226 2279 1644 1528.4143 524.85268
Difference at 95.0% confidence
-785.707 +/- 159.331
-33.9527% +/- 6.88516%
(Student's t, pooled s = 680.132)
Link: https://lkml.kernel.org/r/20230304034835.2082479-4-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Each zspage maintains ->inuse counter which keeps track of the number of
objects stored in the zspage. The ->inuse counter also determines the
zspage's "fullness group" which is calculated as the ratio of the "inuse"
objects to the total number of objects the zspage can hold
(objs_per_zspage). The closer the ->inuse counter is to objs_per_zspage,
the better.
Each size class maintains several fullness lists, that keep track of
zspages of particular "fullness". Pages within each fullness list are
stored in random order with regard to the ->inuse counter. This is
because sorting the zspages by ->inuse counter each time obj_malloc() or
obj_free() is called would be too expensive. However, the ->inuse counter
is still a crucial factor in many situations.
For the two major zsmalloc operations, zs_malloc() and zs_compact(), we
typically select the head zspage from the corresponding fullness list as
the best candidate zspage. However, this assumption is not always
accurate.
For the zs_malloc() operation, the optimal candidate zspage should have
the highest ->inuse counter. This is because the goal is to maximize the
number of ZS_FULL zspages and make full use of all allocated memory.
For the zs_compact() operation, the optimal source zspage should have the
lowest ->inuse counter. This is because compaction needs to move objects
in use to another page before it can release the zspage and return its
physical pages to the buddy allocator. The fewer objects in use, the
quicker compaction can release the zspage. Additionally, compaction is
measured by the number of pages it releases.
This patch reworks the fullness grouping mechanism. Instead of having two
groups - ZS_ALMOST_EMPTY (usage ratio below 3/4) and ZS_ALMOST_FULL (usage
ration above 3/4) - that result in too many zspages being included in the
ALMOST_EMPTY group for specific classes, size classes maintain a larger
number of fullness lists that give strict guarantees on the minimum and
maximum ->inuse values within each group. Each group represents a 10%
change in the ->inuse ratio compared to neighboring groups. In essence,
there are groups for zspages with 0%, 10%, 20% usage ratios, and so on, up
to 100%.
This enhances the selection of candidate zspages for both zs_malloc() and
zs_compact(). A printout of the ->inuse counters of the first 7 zspages
per (random) class fullness group:
class-768 objs_per_zspage 16:
fullness 100%: empty
fullness 99%: empty
fullness 90%: empty
fullness 80%: empty
fullness 70%: empty
fullness 60%: 8 8 9 9 8 8 8
fullness 50%: empty
fullness 40%: 5 5 6 5 5 5 5
fullness 30%: 4 4 4 4 4 4 4
fullness 20%: 2 3 2 3 3 2 2
fullness 10%: 1 1 1 1 1 1 1
fullness 0%: empty
The zs_malloc() function searches through the groups of pages starting
with the one having the highest usage ratio. This means that it always
selects a zspage from the group with the least internal fragmentation
(highest usage ratio) and makes it even less fragmented by increasing its
usage ratio.
The zs_compact() function, on the other hand, begins by scanning the group
with the highest fragmentation (lowest usage ratio) to locate the source
page. The first available zspage is selected, and then the function moves
downward to find a destination zspage in the group with the lowest
internal fragmentation (highest usage ratio).
Link: https://lkml.kernel.org/r/20230304034835.2082479-3-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "zsmalloc: fine-grained fullness and new compaction
algorithm", v4.
Existing zsmalloc page fullness grouping leads to suboptimal page
selection for both zs_malloc() and zs_compact(). This patchset reworks
zsmalloc fullness grouping/classification.
Additinally it also implements new compaction algorithm that is expected
to use less CPU-cycles (as it potentially does fewer memcpy-s in
zs_object_copy()).
Test (synthetic) results can be seen in patch 0003.
This patch (of 4):
This optimization has no effect. It only ensures that when a zspage was
added to its corresponding fullness list, its "inuse" counter was higher
or lower than the "inuse" counter of the zspage at the head of the list.
The intention was to keep busy zspages at the head, so they could be
filled up and moved to the ZS_FULL fullness group more quickly. However,
this doesn't work as the "inuse" counter of a zspage can be modified by
obj_free() but the zspage may still belong to the same fullness list. So,
fix_fullness_group() won't change the zspage's position in relation to the
head's "inuse" counter, leading to a largely random order of zspages
within the fullness list.
For instance, consider a printout of the "inuse" counters of the first 10
zspages in a class that holds 93 objects per zspage:
ZS_ALMOST_EMPTY: 36 67 68 64 35 54 63 52
As we can see the zspage with the lowest "inuse" counter
is actually the head of the fullness list.
Remove this pointless "optimisation".
Link: https://lkml.kernel.org/r/20230304034835.2082479-1-senozhatsky@chromium.org
Link: https://lkml.kernel.org/r/20230304034835.2082479-2-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Using order 4 pages would be helpful for IOMMUs mapping, but trying to get
order 4 pages could spend quite much time in the page allocation. From
the perspective of responsiveness, the deterministic memory allocation
speed, I think, is quite important.
The order 4 allocation with __GFP_RECLAIM may spend much time in reclaim
and compation logic. __GFP_NORETRY also may affect. These cause
unpredictable delay.
To get reasonable allocation speed from dma-buf system heap, use
HIGH_ORDER_GFP for order 4 to avoid reclaim. And let me remove
meaningless __GFP_COMP for order 0.
According to my tests, order 4 with MID_ORDER_GFP could get more number
of order 4 pages but the elapsed times could be very slow.
time order 8 order 4 order 0
584 usec 0 160 0
28,428 usec 0 160 0
100,701 usec 0 160 0
76,645 usec 0 160 0
25,522 usec 0 160 0
38,798 usec 0 160 0
89,012 usec 0 160 0
23,015 usec 0 160 0
73,360 usec 0 160 0
76,953 usec 0 160 0
31,492 usec 0 160 0
75,889 usec 0 160 0
84,551 usec 0 160 0
84,352 usec 0 160 0
57,103 usec 0 160 0
93,452 usec 0 160 0
If HIGH_ORDER_GFP is used for order 4, the number of order 4 could be
decreased but the elapsed time results were quite stable and fast enough.
time order 8 order 4 order 0
1,356 usec 0 155 80
1,901 usec 0 11 2384
1,912 usec 0 0 2560
1,911 usec 0 0 2560
1,884 usec 0 0 2560
1,577 usec 0 0 2560
1,366 usec 0 0 2560
1,711 usec 0 0 2560
1,635 usec 0 28 2112
544 usec 10 0 0
633 usec 2 128 0
848 usec 0 160 0
729 usec 0 160 0
1,000 usec 0 160 0
1,358 usec 0 160 0
2,638 usec 0 31 2064
Link: https://lkml.kernel.org/r/20230303050332.10138-1-jaewon31.kim@samsung.com
Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com>
Reviewed-by: John Stultz <jstultz@google.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: T.J. Mercier <tjmercier@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
commit 5478afc55a ("kmsan: fix memcpy tests") uses OPTIMIZER_HIDE_VAR()
to hide the uninitialized var from the compiler optimizations.
However OPTIMIZER_HIDE_VAR(uninit) enforces an immediate check of @uninit,
so memcpy tests did not actually check the behavior of memcpy(), because
they always contained a KMSAN report.
Replace OPTIMIZER_HIDE_VAR() with a file-local macro that just clobbers
the memory with a barrier(), and add a test case for memcpy() that does
not expect an error report.
Also reflow kmsan_test.c with clang-format.
Link: https://lkml.kernel.org/r/20230303141433.3422671-2-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use atomic_try_cmpxchg instead of atomic_cmpxchg (*ptr, old, new) == old
in set_tlb_ubc_flush_pending. 86 CMPXCHG instruction returns success in
ZF flag, so this change saves a compare after cmpxchg (and related move
instruction in front of cmpxchg).
Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg
fails.
No functional change intended.
Link: https://lkml.kernel.org/r/20230227214228.3533299-1-ubizjak@gmail.com
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
%pGp format is used to display 'flags' field of a struct page. However,
some page flags (i.e. PG_buddy, see page-flags.h for more details) are
stored in page_type field. To display human-readable output of page_type,
introduce %pGt format.
It is important to note the meaning of bits are different in page_type.
if page_type is 0xffffffff, no flags are set. Setting PG_buddy
(0x00000080) flag results in a page_type of 0xffffff7f. Clearing a bit
actually means setting a flag. Bits in page_type are inverted when
displaying type names.
Only values for which page_type_has_type() returns true are considered as
page_type, to avoid confusion with mapcount values. if it returns false,
only raw values are displayed and not page type names.
Link: https://lkml.kernel.org/r/20230130042514.2418-3-42.hyeyoo@gmail.com
Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reviewed-by: Petr Mladek <pmladek@suse.com> [vsprintf part]
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
On big systems, the mm refcount can become highly contented when doing a
lot of context switching with threaded applications. user<->idle switch
is one of the important cases. Abandoning lazy tlb entirely slows this
switching down quite a bit in the common uncontended case, so that is not
viable.
Implement a scheme where lazy tlb mm references do not contribute to the
refcount, instead they get explicitly removed when the refcount reaches
zero.
The final mmdrop() sends IPIs to all CPUs in the mm_cpumask and they
switch away from this mm to init_mm if it was being used as the lazy tlb
mm. Enabling the shoot lazies option therefore requires that the arch
ensures that mm_cpumask contains all CPUs that could possibly be using mm.
A DEBUG_VM option IPIs every CPU in the system after this to ensure there
are no references remaining before the mm is freed.
Shootdown IPIs cost could be an issue, but they have not been observed to
be a serious problem with this scheme, because short-lived processes tend
not to migrate CPUs much, therefore they don't get much chance to leave
lazy tlb mm references on remote CPUs. There are a lot of options to
reduce them if necessary, described in comments.
The near-worst-case can be benchmarked with will-it-scale:
context_switch1_threads -t $(($(nproc) / 2))
This will create nproc threads (nproc / 2 switching pairs) all sharing the
same mm that spread over all CPUs so each CPU does thread->idle->thread
switching.
[ Rik came up with basically the same idea a few years ago, so credit
to him for that. ]
Link: https://lore.kernel.org/linux-mm/20230118080011.2258375-1-npiggin@gmail.com/
Link: https://lore.kernel.org/all/20180728215357.3249-11-riel@surriel.com/
Link: https://lkml.kernel.org/r/20230203071837.1136453-5-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The worst-case scenario on finding same element pages is that almost all
elements are same at the first glance but only last few elements are
different.
Since the same element tends to be grouped from the beginning of the
pages, if we check the first element with the last element before looping
through all elements, we might have some chances to quickly detect
non-same element pages.
1. Test is done under LG webOS TV (64-bit arch)
2. Dump the swap-out pages (~819200 pages)
3. Analyze the pages with simple test script which counts the iteration
number and measures the speed at off-line
Under 64-bit arch, the worst iteration count is PAGE_SIZE / 8 bytes = 512.
The speed is based on the time to consume page_same_filled() function
only. The result, on average, is listed as below:
Num of Iter Speed(MB/s)
Looping-Forward (Orig) 38 99265
Looping-Backward 36 102725
Last-element-check (This Patch) 33 125072
The result shows that the average iteration count decreases by 13% and the
speed increases by 25% with this patch. This patch does not increase the
overall time complexity, though.
I also ran simpler version which uses backward loop. Just looping
backward also makes some improvement, but less than this patch.
A similar change has already been made to zram in 90f82cbfe5 ("zram: try
to avoid worst-case scenario on same element pages").
Link: https://lkml.kernel.org/r/20230205190036.1730134-1-taejoon.song@lge.com
Signed-off-by: Taejoon Song <taejoon.song@lge.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Taejoon Song <taejoon.song@lge.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: <yjay.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Syzbot reports a warning in untrack_pfn(). Digging into the root we found
that this is due to memory allocation failure in pmd_alloc_one. And this
failure is produced due to failslab.
In copy_page_range(), memory alloaction for pmd failed. During the error
handling process in copy_page_range(), mmput() is called to remove all
vmas. While untrack_pfn this empty pfn, warning happens.
Here's a simplified flow:
dup_mm
dup_mmap
copy_page_range
copy_p4d_range
copy_pud_range
copy_pmd_range
pmd_alloc
__pmd_alloc
pmd_alloc_one
page = alloc_pages(gfp, 0);
if (!page)
return NULL;
mmput
exit_mmap
unmap_vmas
unmap_single_vma
untrack_pfn
follow_phys
WARN_ON_ONCE(1);
Since this vma is not generate successfully, we can clear flag VM_PAT. In
this case, untrack_pfn() will not be called while cleaning this vma.
Function untrack_pfn_moved() has also been renamed to fit the new logic.
Link: https://lkml.kernel.org/r/20230217025615.1595558-1-mawupeng1@huawei.com
Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
Reported-by: <syzbot+5f488e922d047d8f00cc@syzkaller.appspotmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mwriteprotect_range() errors out if [start, end) doesn't fall in one VMA.
We are facing a use case where multiple VMAs are present in one range of
interest. For example, the following pseudocode reproduces the error
which we are trying to fix:
- Allocate memory of size 16 pages with PROT_NONE with mmap
- Register userfaultfd
- Change protection of the first half (1 to 8 pages) of memory to
PROT_READ | PROT_WRITE. This breaks the memory area in two VMAs.
- Now UFFDIO_WRITEPROTECT_MODE_WP on the whole memory of 16 pages errors
out.
This is a simple use case where user may or may not know if the memory
area has been divided into multiple VMAs.
We need an implementation which doesn't disrupt the already present users.
So keeping things simple, stop going over all the VMAs if any one of the
VMA hasn't been registered in WP mode. While at it, remove the un-needed
error check as well.
[akpm@linux-foundation.org: s/VM_WARN_ON_ONCE/VM_WARN_ONCE/ to fix build]
Link: https://lkml.kernel.org/r/20230217105558.832710-1-usama.anjum@collabora.com
Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Acked-by: Peter Xu <peterx@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reported-by: Paul Gofman <pgofman@codeweavers.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Historically, we have performed sanity checks on all struct pages being
allocated or freed, making sure they have no unexpected page flags or
certain field values. This can detect insufficient cleanup and some cases
of use-after-free, although on its own it can't always identify the
culprit. The result is a warning and the "bad page" being leaked.
The checks do need some cpu cycles, so in 4.7 with commits 479f854a20
("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
and 4db7548ccb ("mm, page_alloc: defer debugging checks of freed pages
until a PCP drain") they were no longer performed in the hot paths when
allocating and freeing from pcplists, but only when pcplists are bypassed,
refilled or drained. For debugging purposes, with CONFIG_DEBUG_VM enabled
the checks were instead still done in the hot paths and not when refilling
or draining pcplists.
With 4462b32c92 ("mm, page_alloc: more extensive free page checking with
debug_pagealloc"), enabling debug_pagealloc also moved the sanity checks
back to hot pahs. When both debug_pagealloc and CONFIG_DEBUG_VM are
enabled, the checks are done both in hotpaths and pcplist refill/drain.
Even though the non-debug default today might seem to be a sensible
tradeoff between overhead and ability to detect bad pages, on closer look
it's arguably not. As most allocations go through the pcplists, catching
any bad pages when refilling or draining pcplists has only a small chance,
insufficient for debugging or serious hardening purposes. On the other
hand the cost of the checks is concentrated in the already expensive
drain/refill batching operations, and those are done under the often
contended zone lock. That was recently identified as an issue for page
allocation and the zone lock contention reduced by moving the checks
outside of the locked section with a patch "mm: reduce lock contention of
pcp buffer refill", but the cost of the checks is still visible compared
to their removal [1]. In the pcplist draining path free_pcppages_bulk()
the checks are still done under zone->lock.
Thus, remove the checks from pcplist refill and drain paths completely.
Introduce a static key check_pages_enabled to control checks during page
allocation a freeing (whether pcplist is used or bypassed). The static
key is enabled if either is true:
- kernel is built with CONFIG_DEBUG_VM=y (debugging)
- debug_pagealloc or page poisoning is boot-time enabled (debugging)
- init_on_alloc or init_on_free is boot-time enabled (hardening)
The resulting user visible changes:
- no checks when draining/refilling pcplists - less overhead, with
likely no practical reduction of ability to catch bad pages
- no checks when bypassing pcplists in default config (no
debugging/hardening) - less overhead etc. as above
- on typical hardened kernels [2], checks are now performed on each page
allocation/free (previously only when bypassing/draining/refilling
pcplists) - the init_on_alloc/init_on_free enabled should be sufficient
indication for preferring more costly alloc/free operations for
hardening purposes and we shouldn't need to introduce another toggle
- code (various wrappers) removal and simplification
[1] https://lore.kernel.org/all/68ba44d8-6899-c018-dcb3-36f3a96e6bea@sra.uni-hannover.de/
[2] https://lore.kernel.org/all/63ebc499.a70a0220.9ac51.29ea@mx.google.com/
[akpm@linux-foundation.org: coding-style cleanups]
[akpm@linux-foundation.org: make check_pages_enabled static]
Link: https://lkml.kernel.org/r/20230216095131.17336-1-vbabka@suse.cz
Reported-by: Alexander Halbuer <halbuer@sra.uni-hannover.de>
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
rmqueue_bulk() batches the allocation of multiple elements to refill the
per-CPU buffers into a single hold of the zone lock. Each element is
allocated and checked using check_pcp_refill(). The check touches every
related struct page which is especially expensive for higher order
allocations (huge pages).
This patch reduces the time holding the lock by moving the check out of
the critical section similar to rmqueue_buddy() which allocates a single
element.
Measurements of parallel allocation-heavy workloads show a reduction of
the average huge page allocation latency of 50 percent for two cores and
nearly 90 percent for 24 cores.
Link: https://lkml.kernel.org/r/20230201162549.68384-1-halbuer@sra.uni-hannover.de
Signed-off-by: Alexander Halbuer <halbuer@sra.uni-hannover.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If memory charge failed, instead of returning the hpage but with an error,
allow the function to cleanup the folio properly, which is normally what a
function should do in this case - either return successfully, or return
with no side effect of partial runs with an indicated error.
This will also avoid the caller calling mem_cgroup_uncharge()
unnecessarily with either anon or shmem path (even if it's safe to do so).
Link: https://lkml.kernel.org/r/20230222195247.791227-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: David Stevens <stevensd@chromium.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zach O'Keefe <zokeefe@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>