For each node, memcgs are divided into two generations: the old and
the young. For each generation, memcgs are randomly sharded into
multiple bins to improve scalability. For each bin, an RCU hlist_nulls
is virtually divided into three segments: the head, the tail and the
default.
An onlining memcg is added to the tail of a random bin in the old
generation. The eviction starts at the head of a random bin in the old
generation. The per-node memcg generation counter, whose reminder (mod
2) indexes the old generation, is incremented when all its bins become
empty.
There are four operations:
1. MEMCG_LRU_HEAD, which moves an memcg to the head of a random bin in
its current generation (old or young) and updates its "seg" to
"head";
2. MEMCG_LRU_TAIL, which moves an memcg to the tail of a random bin in
its current generation (old or young) and updates its "seg" to
"tail";
3. MEMCG_LRU_OLD, which moves an memcg to the head of a random bin in
the old generation, updates its "gen" to "old" and resets its "seg"
to "default";
4. MEMCG_LRU_YOUNG, which moves an memcg to the tail of a random bin
in the young generation, updates its "gen" to "young" and resets
its "seg" to "default".
The events that trigger the above operations are:
1. Exceeding the soft limit, which triggers MEMCG_LRU_HEAD;
2. The first attempt to reclaim an memcg below low, which triggers
MEMCG_LRU_TAIL;
3. The first attempt to reclaim an memcg below reclaimable size
threshold, which triggers MEMCG_LRU_TAIL;
4. The second attempt to reclaim an memcg below reclaimable size
threshold, which triggers MEMCG_LRU_YOUNG;
5. Attempting to reclaim an memcg below min, which triggers
MEMCG_LRU_YOUNG;
6. Finishing the aging on the eviction path, which triggers
MEMCG_LRU_YOUNG;
7. Offlining an memcg, which triggers MEMCG_LRU_OLD.
Note that memcg LRU only applies to global reclaim, and the
round-robin incrementing of their max_seq counters ensures the
eventual fairness to all eligible memcgs. For memcg reclaim, it still
relies on mem_cgroup_iter().
Link: https://lkml.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
As Hardware Tag-Based KASAN is intended to be used in production, its
performance impact is crucial. As page_alloc allocations tend to be big,
tagging and checking all such allocations can introduce a significant
slowdown.
Add two new boot parameters that allow to alleviate that slowdown:
- kasan.page_alloc.sample, which makes Hardware Tag-Based KASAN tag only
every Nth page_alloc allocation with the order configured by the second
added parameter (default: tag every such allocation).
- kasan.page_alloc.sample.order, which makes sampling enabled by the first
parameter only affect page_alloc allocations with the order equal or
greater than the specified value (default: 3, see below).
The exact performance improvement caused by using the new parameters
depends on their values and the applied workload.
The chosen default value for kasan.page_alloc.sample.order is 3, which
matches both PAGE_ALLOC_COSTLY_ORDER and SKB_FRAG_PAGE_ORDER. This is
done for two reasons:
1. PAGE_ALLOC_COSTLY_ORDER is "the order at which allocations are deemed
costly to service", which corresponds to the idea that only large and
thus costly allocations are supposed to sampled.
2. One of the workloads targeted by this patch is a benchmark that sends
a large amount of data over a local loopback connection. Most multi-page
data allocations in the networking subsystem have the order of
SKB_FRAG_PAGE_ORDER (or PAGE_ALLOC_COSTLY_ORDER).
When running a local loopback test on a testing MTE-enabled device in sync
mode, enabling Hardware Tag-Based KASAN introduces a ~50% slowdown.
Applying this patch and setting kasan.page_alloc.sampling to a value
higher than 1 allows to lower the slowdown. The performance improvement
saturates around the sampling interval value of 10 with the default
sampling page order of 3. This lowers the slowdown to ~20%. The slowdown
in real scenarios involving the network will likely be better.
Enabling page_alloc sampling has a downside: KASAN misses bad accesses to
a page_alloc allocation that has not been tagged. This lowers the value
of KASAN as a security mitigation.
However, based on measuring the number of page_alloc allocations of
different orders during boot in a test build, sampling with the default
kasan.page_alloc.sample.order value affects only ~7% of allocations. The
rest ~93% of allocations are still checked deterministically.
Link: https://lkml.kernel.org/r/129da0614123bb85ed4dd61ae30842b2dd7c903f.1671471846.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Mark Brand <markbrand@google.com>
Cc: Peter Collingbourne <pcc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The pcp_spin_lock_irqsave protecting the PCP lists is IRQ-safe as a task
allocating from the PCP must not re-enter the allocator from IRQ context.
In each instance where IRQ-reentrancy is possible, the lock is acquired
using pcp_spin_trylock_irqsave() even though IRQs are disabled and
re-entrancy is impossible.
Demote the lock to pcp_spin_lock avoids an IRQ disable/enable in the
common case at the cost of some IRQ allocations taking a slower path. If
the PCP lists need to be refilled, the zone lock still needs to disable
IRQs but that will only happen on PCP refill and drain. If an IRQ is
raised when a PCP allocation is in progress, the trylock will fail and
fallback to using the buddy lists directly. Note that this may not be a
universal win if an interrupt-intensive workload also allocates heavily
from interrupt context and contends heavily on the zone->lock as a result.
[mgorman@techsingularity.net: migratetype might be wrong if a PCP was locked]
Link: https://lkml.kernel.org/r/20221122131229.5263-2-mgorman@techsingularity.net
[yuzhao@google.com: reported lockdep issue on IO completion from softirq]
[hughd@google.com: fix list corruption, lock improvements, micro-optimsations]
Link: https://lkml.kernel.org/r/20221118101714.19590-3-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Can the lock_compound_mapcount() bit_spin_lock apparatus be removed now?
Yes. Not by atomic64_t or cmpxchg games, those get difficult on 32-bit;
but if we slightly abuse subpages_mapcount by additionally demanding that
one bit be set there when the compound page is PMD-mapped, then a cascade
of two atomic ops is able to maintain the stats without bit_spin_lock.
This is harder to reason about than when bit_spin_locked, but I believe
safe; and no drift in stats detected when testing. When there are racing
removes and adds, of course the sequence of operations is less well-
defined; but each operation on subpages_mapcount is atomically good. What
might be disastrous, is if subpages_mapcount could ever fleetingly appear
negative: but the pte lock (or pmd lock) these rmap functions are called
under, ensures that a last remove cannot race ahead of a first add.
Continue to make an exception for hugetlb (PageHuge) pages, though that
exception can be easily removed by a further commit if necessary: leave
subpages_mapcount 0, don't bother with COMPOUND_MAPPED in its case, just
carry on checking compound_mapcount too in folio_mapped(), page_mapped().
Evidence is that this way goes slightly faster than the previous
implementation in all cases (pmds after ptes now taking around 103ms); and
relieves us of worrying about contention on the bit_spin_lock.
Link: https://lkml.kernel.org/r/3978f3ca-5473-55a7-4e14-efea5968d892@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dan Carpenter <error27@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zach O'Keefe <zokeefe@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Compound page (folio) mapcount calculations have been different for anon
and file (or shmem) THPs, and involved the obscure PageDoubleMap flag.
And each huge mapping and unmapping of a file (or shmem) THP involved
atomically incrementing and decrementing the mapcount of every subpage of
that huge page, dirtying many struct page cachelines.
Add subpages_mapcount field to the struct folio and first tail page, so
that the total of subpage mapcounts is available in one place near the
head: then page_mapcount() and total_mapcount() and page_mapped(), and
their folio equivalents, are so quick that anon and file and hugetlb don't
need to be optimized differently. Delete the unloved PageDoubleMap.
page_add and page_remove rmap functions must now maintain the
subpages_mapcount as well as the subpage _mapcount, when dealing with pte
mappings of huge pages; and correct maintenance of NR_ANON_MAPPED and
NR_FILE_MAPPED statistics still needs reading through the subpages, using
nr_subpages_unmapped() - but only when first or last pmd mapping finds
subpages_mapcount raised (double-map case, not the common case).
But are those counts (used to decide when to split an anon THP, and in
vmscan's pagecache_reclaimable heuristic) correctly maintained? Not
quite: since page_remove_rmap() (and also split_huge_pmd()) is often
called without page lock, there can be races when a subpage pte mapcount
0<->1 while compound pmd mapcount 0<->1 is scanning - races which the
previous implementation had prevented. The statistics might become
inaccurate, and even drift down until they underflow through 0. That is
not good enough, but is better dealt with in a followup patch.
Update a few comments on first and second tail page overlaid fields.
hugepage_add_new_anon_rmap() has to "increment" compound_mapcount, but
subpages_mapcount and compound_pincount are already correctly at 0, so
delete its reinitialization of compound_pincount.
A simple 100 X munmap(mmap(2GB, MAP_SHARED|MAP_POPULATE, tmpfs), 2GB) took
18 seconds on small pages, and used to take 1 second on huge pages, but
now takes 119 milliseconds on huge pages. Mapping by pmds a second time
used to take 860ms and now takes 92ms; mapping by pmds after mapping by
ptes (when the scan is needed) used to take 870ms and now takes 495ms.
But there might be some benchmarks which would show a slowdown, because
tail struct pages now fall out of cache until final freeing checks them.
Link: https://lkml.kernel.org/r/47ad693-717-79c8-e1ba-46c3a6602e48@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: James Houghton <jthoughton@google.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zach O'Keefe <zokeefe@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Although page allocation always clears page->private in the first page or
head page of an allocation, it has never made a point of clearing
page->private in the tails (though 0 is often what is already there).
But now commit 71e2d666ef ("mm/huge_memory: do not clobber swp_entry_t
during THP split") issues a warning when page_tail->private is found to be
non-0 (unless it's swapcache).
Change that warning to dump page_tail (which also dumps head), instead of
just the head: so far we have seen dead000000000122, dead000000000003,
dead000000000001 or 0000000000000002 in the raw output for tail private.
We could just delete the warning, but today's consensus appears to want
page->private to be 0, unless there's a good reason for it to be set: so
now clear it in prep_compound_tail() (more general than just for THP; but
not for high order allocation, which makes no pass down the tails).
Link: https://lkml.kernel.org/r/1c4233bb-4e4d-5969-fbd4-96604268a285@google.com
Fixes: 71e2d666ef ("mm/huge_memory: do not clobber swp_entry_t during THP split")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pull more MM updates from Andrew Morton:
- fix a race which causes page refcounting errors in ZONE_DEVICE pages
(Alistair Popple)
- fix userfaultfd test harness instability (Peter Xu)
- various other patches in MM, mainly fixes
* tag 'mm-stable-2022-10-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (29 commits)
highmem: fix kmap_to_page() for kmap_local_page() addresses
mm/page_alloc: fix incorrect PGFREE and PGALLOC for high-order page
mm/selftest: uffd: explain the write missing fault check
mm/hugetlb: use hugetlb_pte_stable in migration race check
mm/hugetlb: fix race condition of uffd missing/minor handling
zram: always expose rw_page
LoongArch: update local TLB if PTE entry exists
mm: use update_mmu_tlb() on the second thread
kasan: fix array-bounds warnings in tests
hmm-tests: add test for migrate_device_range()
nouveau/dmem: evict device private memory during release
nouveau/dmem: refactor nouveau_dmem_fault_copy_one()
mm/migrate_device.c: add migrate_device_range()
mm/migrate_device.c: refactor migrate_vma and migrate_deivce_coherent_page()
mm/memremap.c: take a pgmap reference on page allocation
mm: free device private pages have zero refcount
mm/memory.c: fix race when faulting a device private page
mm/damon: use damon_sz_region() in appropriate place
mm/damon: move sz_damon_region to damon_sz_region
lib/test_meminit: add checks for the allocation functions
...
Pull MM updates from Andrew Morton:
- Yu Zhao's Multi-Gen LRU patches are here. They've been under test in
linux-next for a couple of months without, to my knowledge, any
negative reports (or any positive ones, come to that).
- Also the Maple Tree from Liam Howlett. An overlapping range-based
tree for vmas. It it apparently slightly more efficient in its own
right, but is mainly targeted at enabling work to reduce mmap_lock
contention.
Liam has identified a number of other tree users in the kernel which
could be beneficially onverted to mapletrees.
Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
at [1]. This has yet to be addressed due to Liam's unfortunately
timed vacation. He is now back and we'll get this fixed up.
- Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses
clang-generated instrumentation to detect used-unintialized bugs down
to the single bit level.
KMSAN keeps finding bugs. New ones, as well as the legacy ones.
- Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
memory into THPs.
- Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to
support file/shmem-backed pages.
- userfaultfd updates from Axel Rasmussen
- zsmalloc cleanups from Alexey Romanov
- cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and
memory-failure
- Huang Ying adds enhancements to NUMA balancing memory tiering mode's
page promotion, with a new way of detecting hot pages.
- memcg updates from Shakeel Butt: charging optimizations and reduced
memory consumption.
- memcg cleanups from Kairui Song.
- memcg fixes and cleanups from Johannes Weiner.
- Vishal Moola provides more folio conversions
- Zhang Yi removed ll_rw_block() :(
- migration enhancements from Peter Xu
- migration error-path bugfixes from Huang Ying
- Aneesh Kumar added ability for a device driver to alter the memory
tiering promotion paths. For optimizations by PMEM drivers, DRM
drivers, etc.
- vma merging improvements from Jakub Matěn.
- NUMA hinting cleanups from David Hildenbrand.
- xu xin added aditional userspace visibility into KSM merging
activity.
- THP & KSM code consolidation from Qi Zheng.
- more folio work from Matthew Wilcox.
- KASAN updates from Andrey Konovalov.
- DAMON cleanups from Kaixu Xia.
- DAMON work from SeongJae Park: fixes, cleanups.
- hugetlb sysfs cleanups from Muchun Song.
- Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.
Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1]
* tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits)
hugetlb: allocate vma lock for all sharable vmas
hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer
hugetlb: fix vma lock handling during split vma and range unmapping
mglru: mm/vmscan.c: fix imprecise comments
mm/mglru: don't sync disk for each aging cycle
mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol
mm: memcontrol: use do_memsw_account() in a few more places
mm: memcontrol: deprecate swapaccounting=0 mode
mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled
mm/secretmem: remove reduntant return value
mm/hugetlb: add available_huge_pages() func
mm: remove unused inline functions from include/linux/mm_inline.h
selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory
selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd
selftests/vm: add thp collapse shmem testing
selftests/vm: add thp collapse file and tmpfs testing
selftests/vm: modularize thp collapse memory operations
selftests/vm: dedup THP helpers
mm/khugepaged: add tracepoint to hpage_collapse_scan_file()
mm/madvise: add file and shmem support to MADV_COLLAPSE
...
Pull kvm updates from Paolo Bonzini:
"The first batch of KVM patches, mostly covering x86.
ARM:
- Account stage2 page table allocations in memory stats
x86:
- Account EPT/NPT arm64 page table allocations in memory stats
- Tracepoint cleanups/fixes for nested VM-Enter and emulated MSR
accesses
- Drop eVMCS controls filtering for KVM on Hyper-V, all known
versions of Hyper-V now support eVMCS fields associated with
features that are enumerated to the guest
- Use KVM's sanitized VMCS config as the basis for the values of
nested VMX capabilities MSRs
- A myriad event/exception fixes and cleanups. Most notably, pending
exceptions morph into VM-Exits earlier, as soon as the exception is
queued, instead of waiting until the next vmentry. This fixed a
longstanding issue where the exceptions would incorrecly become
double-faults instead of triggering a vmexit; the common case of
page-fault vmexits had a special workaround, but now it's fixed for
good
- A handful of fixes for memory leaks in error paths
- Cleanups for VMREAD trampoline and VMX's VM-Exit assembly flow
- Never write to memory from non-sleepable kvm_vcpu_check_block()
- Selftests refinements and cleanups
- Misc typo cleanups
Generic:
- remove KVM_REQ_UNHALT"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (94 commits)
KVM: remove KVM_REQ_UNHALT
KVM: mips, x86: do not rely on KVM_REQ_UNHALT
KVM: x86: never write to memory from kvm_vcpu_check_block()
KVM: x86: Don't snapshot pending INIT/SIPI prior to checking nested events
KVM: nVMX: Make event request on VMXOFF iff INIT/SIPI is pending
KVM: nVMX: Make an event request if INIT or SIPI is pending on VM-Enter
KVM: SVM: Make an event request if INIT or SIPI is pending when GIF is set
KVM: x86: lapic does not have to process INIT if it is blocked
KVM: x86: Rename kvm_apic_has_events() to make it INIT/SIPI specific
KVM: x86: Rename and expose helper to detect if INIT/SIPI are allowed
KVM: nVMX: Make an event request when pending an MTF nested VM-Exit
KVM: x86: make vendor code check for all nested events
mailmap: Update Oliver's email address
KVM: x86: Allow force_emulation_prefix to be written without a reload
KVM: selftests: Add an x86-only test to verify nested exception queueing
KVM: selftests: Use uapi header to get VMX and SVM exit reasons/codes
KVM: x86: Rename inject_pending_events() to kvm_check_and_inject_events()
KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle behavior
KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptions
KVM: x86: Morph pending exceptions to pending VM-Exits at queue time
...
In MIGRATE_ISOLATE case, zone freepage state shouldn't be modified as
caller will take care of it. Add missing is_migrate_isolate() here to
avoid possible unbalanced freepage state. This would happen if someone
isolates the block, and then we face an MCE failure/soft-offline on a page
within that block. __mod_zone_freepage_state() will be triggered via
below call trace which already had been triggered back when block was
isolated:
take_page_off_buddy
break_down_buddy_pages
set_page_guard
Link: https://lkml.kernel.org/r/20220916072257.9639-9-linmiaohe@huawei.com
Fixes: 06be6ff3d2 ("mm,hwpoison: rework soft offline for free pages")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "A few cleanup patches for mm", v2.
This series contains a few cleanup patches to remove the obsolete comments
and functions, use helper macro to improve readability and so on. More
details can be found in the respective changelogs.
This patch (of 16):
If ALLOC_KSWAPD is set, wake_all_kswapds() will be called to ensure kswapd
doesn't accidentally go to sleep. But when reserve_flags is set,
alloc_flags will be overwritten and ALLOC_KSWAPD is thus lost. Preserve
the ALLOC_KSWAPD flag in alloc_flags to ensure kswapd won't go to sleep
accidentally.
Link: https://lkml.kernel.org/r/20220916072257.9639-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20220916072257.9639-2-linmiaohe@huawei.com
Fixes: 0a79cdad5e ("mm: use alloc_flags to record if kswapd can wake")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The name "check_free_page()" provides no information regarding its return
value when the page is indeed found to be bad.
Renaming it to "free_page_is_bad()" makes it clear that a `true' return
value means the page was bad.
And make it return a bool, not an int.
[akpm@linux-foundation.org: don't use bool as int]
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: ke.wang <ke.wang@unisoc.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Zhaoyang Huang <huangzhaoyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
A number of drivers call page_frag_alloc() with a fragment's size >
PAGE_SIZE.
In low memory conditions, __page_frag_cache_refill() may fail the order
3 cache allocation and fall back to order 0; In this case, the cache
will be smaller than the fragment, causing memory corruptions.
Prevent this from happening by checking if the newly allocated cache is
large enough for the fragment; if not, the allocation will fail and
page_frag_alloc() will return NULL.
Link: https://lkml.kernel.org/r/20220715125013.247085-1-mlombard@redhat.com
Fixes: b63ae8ca09 ("mm/net: Rename and move page fragment handling from net/ to mm/")
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Cc: Chen Lin <chen45464546@163.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kswapd_run/stop() will set pgdat->kswapd to NULL, which could race with
kswapd_is_running() in kcompactd(),
kswapd_run/stop() kcompactd()
kswapd_is_running()
pgdat->kswapd // error or nomal ptr
verify pgdat->kswapd
// load non-NULL
pgdat->kswapd
pgdat->kswapd = NULL
task_is_running(pgdat->kswapd)
// Null pointer derefence
KASAN reports the null-ptr-deref shown below,
vmscan: Failed to start kswapd on node 0
...
BUG: KASAN: null-ptr-deref in kcompactd+0x440/0x504
Read of size 8 at addr 0000000000000024 by task kcompactd0/37
CPU: 0 PID: 37 Comm: kcompactd0 Kdump: loaded Tainted: G OE 5.10.60 #1
Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
Call trace:
dump_backtrace+0x0/0x394
show_stack+0x34/0x4c
dump_stack+0x158/0x1e4
__kasan_report+0x138/0x140
kasan_report+0x44/0xdc
__asan_load8+0x94/0xd0
kcompactd+0x440/0x504
kthread+0x1a4/0x1f0
ret_from_fork+0x10/0x18
At present kswapd/kcompactd_run() and kswapd/kcompactd_stop() are protected
by mem_hotplug_begin/done(), but without kcompactd(). There is no need to
involve memory hotplug lock in kcompactd(), so let's add a new mutex to
protect pgdat->kswapd accesses.
Also, because the kcompactd task will check the state of kswapd task, it's
better to call kcompactd_stop() before kswapd_stop() to reduce lock
conflicts.
[akpm@linux-foundation.org: add comments]
Link: https://lkml.kernel.org/r/20220827111959.186838-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In commit 2f1ee0913c ("Revert "mm: use early_pfn_to_nid in
page_ext_init""), we call page_ext_init() after page_alloc_init_late() to
avoid some panic problem. It seems that we cannot track early page
allocations in current kernel even if page structure has been initialized
early.
This patch introduces a new boot parameter 'early_page_ext' to resolve
this problem. If we pass it to the kernel, page_ext_init() will be moved
up and the feature 'deferred initialization of struct pages' will be
disabled to initialize the page allocator early and prevent the panic
problem above. It can help us to catch early page allocations. This is
useful especially when we find that the free memory value is not the same
right after different kernel booting.
[akpm@linux-foundation.org: fix section issue by removing __meminitdata]
Link: https://lkml.kernel.org/r/20220825102714.669-1-lizhe.67@bytedance.com
Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>