To support detection of read faults with Radix execute-only memory, the
vma_is_accessible() check in access_error() (which checks for PROT_NONE)
was replaced with a check to see if VM_READ was missing, and if so,
returns true to assert the fault was caused by a bad read.
This is incorrect, as it ignores that both VM_WRITE and VM_EXEC imply
read on powerpc, as defined in protection_map[]. This causes mappings
containing VM_WRITE or VM_EXEC without VM_READ to misreport the cause of
page faults, since the MMU is still allowing reads.
Correct this by restoring the original vma_is_accessible() check for
PROT_NONE mappings, and adding a separate check for Radix PROT_EXEC-only
mappings.
Fixes: 395cac7752 ("powerpc/mm: Support execute-only memory on the Radix MMU")
Reported-by: Michal Suchánek <msuchanek@suse.de>
Link: https://lore.kernel.org/r/20230308152702.GR19419@kitsune.suse.cz
Tested-by: Benjamin Gray <bgray@linux.ibm.com>
Signed-off-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20230310050834.63105-1-ruscur@russell.cc
The data storage interrupt (DSI) error will be generated when the
paste operation is issued on the suspended Nest Accelerator (NX)
window due to NX state changes. The hypervisor expects the
partition to ignore this error during page fault handling.
To differentiate DSI caused by an actual HW configuration or by
the NX window, a new “ibm,pi-features” type value is defined.
Byte 0, bit 3 of pi-attribute-specifier-type is now defined to
indicate this DSI error. If this error is not ignored, the user
space can get SIGBUS when the NX request is issued.
This patch adds changes to read ibm,pi-features property and ignore
DSI error during page fault handling if MMU_FTR_NX_DSI is defined.
Signed-off-by: Haren Myneni <haren@linux.ibm.com>
[mpe: Mention PAPR version in comment]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/b9cd844b85eb8f70459109ce1b14e44c4cc85fa7.camel@linux.ibm.com
Add support for execute-only memory (XOM) for the Radix MMU by using an
execute-only mapping, as opposed to the RX mapping used by powerpc's
other MMUs.
The Hash MMU already supports XOM through the execute-only pkey,
which is a separate mechanism shared with x86. A PROT_EXEC-only mapping
will map to RX, and then the pkey will be applied on top of it.
mmap() and mprotect() consumers in userspace should observe the same
behaviour on Hash and Radix despite the differences in implementation.
Replacing the vma_is_accessible() check in access_error() with a read
check should be functionally equivalent for non-Radix MMUs, since it
follows write and execute checks. For Radix, the change enables
detecting faults on execute-only mappings where vma_is_accessible() would
return true.
Signed-off-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220817050640.406017-1-ruscur@russell.cc
I observed that for each of the shared file-backed page faults, we're very
likely to retry one more time for the 1st write fault upon no page. It's
because we'll need to release the mmap lock for dirty rate limit purpose
with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()).
Then after that throttling we return VM_FAULT_RETRY.
We did that probably because VM_FAULT_RETRY is the only way we can return
to the fault handler at that time telling it we've released the mmap lock.
However that's not ideal because it's very likely the fault does not need
to be retried at all since the pgtable was well installed before the
throttling, so the next continuous fault (including taking mmap read lock,
walk the pgtable, etc.) could be in most cases unnecessary.
It's not only slowing down page faults for shared file-backed, but also add
more mmap lock contention which is in most cases not needed at all.
To observe this, one could try to write to some shmem page and look at
"pgfault" value in /proc/vmstat, then we should expect 2 counts for each
shmem write simply because we retried, and vm event "pgfault" will capture
that.
To make it more efficient, add a new VM_FAULT_COMPLETED return code just to
show that we've completed the whole fault and released the lock. It's also
a hint that we should very possibly not need another fault immediately on
this page because we've just completed it.
This patch provides a ~12% perf boost on my aarch64 test VM with a simple
program sequentially dirtying 400MB shmem file being mmap()ed and these are
the time it needs:
Before: 650.980 ms (+-1.94%)
After: 569.396 ms (+-1.38%)
I believe it could help more than that.
We need some special care on GUP and the s390 pgfault handler (for gmap
code before returning from pgfault), the rest changes in the page fault
handlers should be relatively straightforward.
Another thing to mention is that mm_account_fault() does take this new
fault as a generic fault to be accounted, unlike VM_FAULT_RETRY.
I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do
not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping
them as-is.
Link: https://lkml.kernel.org/r/20220530183450.42886-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vineet Gupta <vgupta@kernel.org>
Acked-by: Guo Ren <guoren@kernel.org>
Acked-by: Max Filippov <jcmvbkbc@gmail.com>
Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> [arm part]
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Stafford Horne <shorne@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Richard Weinberger <richard@nod.at>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Will Deacon <will@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Chris Zankel <chris@zankel.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Rich Felker <dalias@libc.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Helge Deller <deller@gmx.de>
Cc: Yoshinori Sato <ysato@users.osdn.me>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We originally added asm-prototypes.h in commit 42f5b4cacd ("powerpc:
Introduce asm-prototypes.h"). It's purpose was for prototypes of C
functions that are only called from asm, in order to fix sparse
warnings about missing prototypes.
A few months later Nick added a different use case in
commit 4efca4ed05 ("kbuild: modversions for EXPORT_SYMBOL() for asm")
for C prototypes for exported asm functions. This is basically the
inverse of our original usage.
Since then we've added various prototypes to asm-prototypes.h for both
reasons, meaning we now need to unstitch it all.
Dispatch prototypes of C functions into relevant headers and keep
only the prototypes for functions defined in assembly.
For the time being, leave prom_init() there because moving it
into asm/prom.h or asm/setup.h conflicts with
drivers/gpu/drm/nouveau/nvkm/subdev/bios/shadowrom.o
This will be fixed later by untaggling asm/pci.h and asm/prom.h
or by renaming the function in shadowrom.c
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/62d46904eca74042097acf4cb12c175e3067f3d1.1646413435.git.christophe.leroy@csgroup.eu
Since commit 46ddcb3950 ("powerpc/mm: Show if a bad page fault on data
is read or write.") we use page_fault_is_write(regs->dsisr) in
__bad_page_fault() to determine if the fault is for a read or write, and
change the message printed accordingly.
But SLB faults, aka Data Segment Interrupts, don't set DSISR (Data
Storage Interrupt Status Register) to a useful value. All ISA versions
from v2.03 through v3.1 specify that the Data Segment Interrupt sets
DSISR "to an undefined value". As far as I can see there's no mention of
SLB faults setting DSISR in any BookIV content either.
This manifests as accesses that should be a read being incorrectly
reported as writes, for example, using the xmon "dump" command:
0:mon> d 0x5deadbeef0000000
5deadbeef0000000
[359526.415354][ C6] BUG: Unable to handle kernel data access on write at 0x5deadbeef0000000
[359526.415611][ C6] Faulting instruction address: 0xc00000000010a300
cpu 0x6: Vector: 380 (Data SLB Access) at [c00000000ffbf400]
pc: c00000000010a300: mread+0x90/0x190
If we disassemble the PC, we see a load instruction:
0:mon> di c00000000010a300
c00000000010a300 89490000 lbz r10,0(r9)
We can also see in exceptions-64s.S that the data_access_slb block
doesn't set IDSISR=1, which means it doesn't load DSISR into pt_regs. So
the value we're using to determine if the fault is a read/write is some
stale value in pt_regs from a previous page fault.
Rework the printing logic to separate the SLB fault case out, and only
print read/write in the cases where we can determine it.
The result looks like eg:
0:mon> d 0x5deadbeef0000000
5deadbeef0000000
[ 721.779525][ C6] BUG: Unable to handle kernel data access at 0x5deadbeef0000000
[ 721.779697][ C6] Faulting instruction address: 0xc00000000014cbe0
cpu 0x6: Vector: 380 (Data SLB Access) at [c00000000ffbf390]
0:mon> d 0
0000000000000000
[ 742.793242][ C6] BUG: Kernel NULL pointer dereference at 0x00000000
[ 742.793316][ C6] Faulting instruction address: 0xc00000000014cbe0
cpu 0x6: Vector: 380 (Data SLB Access) at [c00000000ffbf390]
Fixes: 46ddcb3950 ("powerpc/mm: Show if a bad page fault on data is read or write.")
Reported-by: Nageswara R Sastry <rnsastry@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Link: https://lore.kernel.org/r/20220222113449.319193-1-mpe@ellerman.id.au
Merge misc updates from Andrew Morton:
"146 patches.
Subsystems affected by this patch series: kthread, ia64, scripts,
ntfs, squashfs, ocfs2, vfs, and mm (slab-generic, slab, kmemleak,
dax, kasan, debug, pagecache, gup, shmem, frontswap, memremap,
memcg, selftests, pagemap, dma, vmalloc, memory-failure, hugetlb,
userfaultfd, vmscan, mempolicy, oom-kill, hugetlbfs, migration, thp,
ksm, page-poison, percpu, rmap, zswap, zram, cleanups, hmm, and
damon)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (146 commits)
mm/damon: hide kernel pointer from tracepoint event
mm/damon/vaddr: hide kernel pointer from damon_va_three_regions() failure log
mm/damon/vaddr: use pr_debug() for damon_va_three_regions() failure logging
mm/damon/dbgfs: remove an unnecessary variable
mm/damon: move the implementation of damon_insert_region to damon.h
mm/damon: add access checking for hugetlb pages
Docs/admin-guide/mm/damon/usage: update for schemes statistics
mm/damon/dbgfs: support all DAMOS stats
Docs/admin-guide/mm/damon/reclaim: document statistics parameters
mm/damon/reclaim: provide reclamation statistics
mm/damon/schemes: account how many times quota limit has exceeded
mm/damon/schemes: account scheme actions that successfully applied
mm/damon: remove a mistakenly added comment for a future feature
Docs/admin-guide/mm/damon/usage: update for kdamond_pid and (mk|rm)_contexts
Docs/admin-guide/mm/damon/usage: mention tracepoint at the beginning
Docs/admin-guide/mm/damon/usage: remove redundant information
Docs/admin-guide/mm/damon/usage: update for scheme quotas and watermarks
mm/damon: convert macro functions to static inline functions
mm/damon: modify damon_rand() macro to static inline function
mm/damon: move damon_rand() definition into damon.h
...
The powerpc kernel is not prepared to handle exec faults from kernel.
Especially, the function is_exec_fault() will return 'false' when an
exec fault is taken by kernel, because the check is based on reading
current->thread.regs->trap which contains the trap from user.
For instance, when provoking a LKDTM EXEC_USERSPACE test,
current->thread.regs->trap is set to SYSCALL trap (0xc00), and
the fault taken by the kernel is not seen as an exec fault by
set_access_flags_filter().
Commit d7df2443cd ("powerpc/mm: Fix spurious segfaults on radix
with autonuma") made it clear and handled it properly. But later on
commit d3ca587404 ("powerpc/mm: Fix reporting of kernel execute
faults") removed that handling, introducing test based on error_code.
And here is the problem, because on the 603 all upper bits of SRR1
get cleared when the TLB instruction miss handler bails out to ISI.
Until commit cbd7e6ca02 ("powerpc/fault: Avoid heavy
search_exception_tables() verification"), an exec fault from kernel
at a userspace address was indirectly caught by the lack of entry for
that address in the exception tables. But after that commit the
kernel mainly relies on KUAP or on core mm handling to catch wrong
user accesses. Here the access is not wrong, so mm handles it.
It is a minor fault because PAGE_EXEC is not set,
set_access_flags_filter() should set PAGE_EXEC and voila.
But as is_exec_fault() returns false as explained in the beginning,
set_access_flags_filter() bails out without setting PAGE_EXEC flag,
which leads to a forever minor exec fault.
As the kernel is not prepared to handle such exec faults, the thing to
do is to fire in bad_kernel_fault() for any exec fault taken by the
kernel, as it was prior to commit d3ca587404.
Fixes: d3ca587404 ("powerpc/mm: Fix reporting of kernel execute faults")
Cc: stable@vger.kernel.org # v4.14+
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/024bb05105050f704743a0083fe3548702be5706.1625138205.git.christophe.leroy@csgroup.eu
Define macros to list ppc interrupt types in interttupt.h, replace the
reference of the trap hex values with these macros.
Referred the hex numbers in arch/powerpc/kernel/exceptions-64e.S,
arch/powerpc/kernel/exceptions-64s.S, arch/powerpc/kernel/head_*.S,
arch/powerpc/kernel/head_booke.h and arch/powerpc/include/asm/kvm_asm.h.
Signed-off-by: Xiongwei Song <sxwjean@gmail.com>
[mpe: Resolve conflicts in nmi_disables_ftrace(), fix 40x build]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1618398033-13025-1-git-send-email-sxwjean@me.com
The allyesconfig ppc64 kernel fails to link with relocations unable to
fit after commit 3a96570ffc ("powerpc: convert interrupt handlers to
use wrappers"), which is due to the interrupt handler functions being
put into the .noinstr.text section, which the linker script places on
the opposite side of the main .text section from the interrupt entry
asm code which calls the handlers.
This results in a lot of linker stubs that overwhelm the 252-byte sized
space we allow for them, or in the case of BE a .opd relocation link
error for some reason.
It's not required to put interrupt handlers in the .noinstr section,
previously they used NOKPROBE_SYMBOL, so take them out and replace
with a NOKPROBE_SYMBOL in the wrapper macro. Remove the explicit
NOKPROBE_SYMBOL macros in the interrupt handler functions. This makes
a number of interrupt handlers nokprobe that were not prior to the
interrupt wrappers commit, but since that commit they were made
nokprobe due to being in .noinstr.text, so this fix does not change
that.
The fixes tag is different to the commit that first exposes the problem
because it is where the wrapper macros were introduced.
Fixes: 8d41fc618a ("powerpc: interrupt handler wrapper functions")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Slightly fix up comment wording]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210211063636.236420-1-npiggin@gmail.com
This moves the 64s/hash context tracking from hash_page_mm() to
__do_hash_fault(), so it's no longer called by OCXL / SPU
accelerators, which was certainly the wrong thing to be doing,
because those callers are not low level interrupt handlers, so
should have entered a kernel context tracking already.
Then remain in kernel context for the duration of the fault,
rather than enter/exit for the hash fault then enter/exit for
the page fault, which is pointless.
Even still, calling exception_enter/exit in __do_hash_fault seems
questionable because that's touching per-cpu variables, tracing,
etc., which might have been interrupted by this hash fault or
themselves cause hash faults. But maybe I miss something because
hash_page_mm very deliberately calls trace_hash_fault too, for
example. So for now go with it, it's no worse than before, in this
regard.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210130130852.2952424-32-npiggin@gmail.com
This simplifies code, and it is also useful when introducing
interrupt handler wrappers when introducing wrapper functionality
that doesn't cope with asm entry code calling into more than one
handler function.
32-bit and 64e still have some such cases, which limits some ways
they can use interrupt wrappers.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210130130852.2952424-15-npiggin@gmail.com
Make mm fault handlers all just take the pt_regs * argument and load
DAR/DSISR from that. Make those that return a value return long.
This is done to make the function signatures match other handlers, which
will help with a future patch to add wrappers. Explicit arguments could
be added for performance but that would require more wrapper macro
variants.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210130130852.2952424-7-npiggin@gmail.com
Exception fixup doesn't require the heady full regs saving,
do it from do_page_fault() directly.
For that, split bad_page_fault() in two parts.
As bad_page_fault() can also be called from other places than
handle_page_fault(), it will still perform exception fixup and
fallback on __bad_page_fault().
handle_page_fault() directly calls __bad_page_fault() as the
exception fixup will now be done by do_page_fault()
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/bd07d6fef9237614cd6d318d8f19faeeadaa816b.1607491748.git.christophe.leroy@csgroup.eu
search_exception_tables() is an heavy operation, we have to avoid it.
When KUAP is selected, we'll know the fault has been blocked by KUAP.
When it is blocked by KUAP, check whether we are in an expected
userspace access place. If so, emit a warning to spot something is
going work. Otherwise, just remain silent, it will likely Oops soon.
When KUAP is not selected, it behaves just as if the address was
already in the TLBs and no fault was generated.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/9870f01e293a5a76c4f4e4ddd4a6b0f63038c591.1607491748.git.christophe.leroy@csgroup.eu
This partially reverts commit eb232b1624 ("powerpc/book3s64/kuap: Improve
error reporting with KUAP") and update the fault handler to print
[ 55.022514] Kernel attempted to access user page (7e6725b70000) - exploit attempt? (uid: 0)
[ 55.022528] BUG: Unable to handle kernel data access on read at 0x7e6725b70000
[ 55.022533] Faulting instruction address: 0xc000000000e8b9bc
[ 55.022540] Oops: Kernel access of bad area, sig: 11 [#1]
....
when the kernel access userspace address without unlocking AMR.
bad_kuap_fault() is added as part of commit 5e5be3aed2 ("powerpc/mm: Detect
bad KUAP faults") to catch userspace access incorrectly blocked by AMR. Hence
retain the full stack dump there even with hash translation. Also, add a comment
explaining the difference between hash and radix.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201208031539.84878-1-aneesh.kumar@linux.ibm.com
We have powerpc specific logic in our page fault handling to decide if
an access to an unmapped address below the stack pointer should expand
the stack VMA.
The logic aims to prevent userspace from doing bad accesses below the
stack pointer. However as long as the stack is < 1MB in size, we allow
all accesses without further checks. Adding some debug I see that I
can do a full kernel build and LTP run, and not a single process has
used more than 1MB of stack. So for the majority of processes the
logic never even fires.
We also recently found a nasty bug in this code which could cause
userspace programs to be killed during signal delivery. It went
unnoticed presumably because most processes use < 1MB of stack.
The generic mm code has also grown support for stack guard pages since
this code was originally written, so the most heinous case of the
stack expanding into other mappings is now handled for us.
Finally although some other arches have special logic in this path,
from what I can tell none of x86, arm64, arm and s390 impose any extra
checks other than those in expand_stack().
So drop our complicated logic and like other architectures just let
the stack expand as long as its within the rlimit.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Tested-by: Daniel Axtens <dja@axtens.net>
Link: https://lore.kernel.org/r/20200724092528.1578671-4-mpe@ellerman.id.au
We have powerpc specific logic in our page fault handling to decide if
an access to an unmapped address below the stack pointer should expand
the stack VMA.
The code was originally added in 2004 "ported from 2.4". The rough
logic is that the stack is allowed to grow to 1MB with no extra
checking. Over 1MB the access must be within 2048 bytes of the stack
pointer, or be from a user instruction that updates the stack pointer.
The 2048 byte allowance below the stack pointer is there to cover the
288 byte "red zone" as well as the "about 1.5kB" needed by the signal
delivery code.
Unfortunately since then the signal frame has expanded, and is now
4224 bytes on 64-bit kernels with transactional memory enabled. This
means if a process has consumed more than 1MB of stack, and its stack
pointer lies less than 4224 bytes from the next page boundary, signal
delivery will fault when trying to expand the stack and the process
will see a SEGV.
The total size of the signal frame is the size of struct rt_sigframe
(which includes the red zone) plus __SIGNAL_FRAMESIZE (128 bytes on
64-bit).
The 2048 byte allowance was correct until 2008 as the signal frame
was:
struct rt_sigframe {
struct ucontext uc; /* 0 1440 */
/* --- cacheline 11 boundary (1408 bytes) was 32 bytes ago --- */
long unsigned int _unused[2]; /* 1440 16 */
unsigned int tramp[6]; /* 1456 24 */
struct siginfo * pinfo; /* 1480 8 */
void * puc; /* 1488 8 */
struct siginfo info; /* 1496 128 */
/* --- cacheline 12 boundary (1536 bytes) was 88 bytes ago --- */
char abigap[288]; /* 1624 288 */
/* size: 1920, cachelines: 15, members: 7 */
/* padding: 8 */
};
1920 + 128 = 2048
Then in commit ce48b21007 ("powerpc: Add VSX context save/restore,
ptrace and signal support") (Jul 2008) the signal frame expanded to
2304 bytes:
struct rt_sigframe {
struct ucontext uc; /* 0 1696 */ <--
/* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */
long unsigned int _unused[2]; /* 1696 16 */
unsigned int tramp[6]; /* 1712 24 */
struct siginfo * pinfo; /* 1736 8 */
void * puc; /* 1744 8 */
struct siginfo info; /* 1752 128 */
/* --- cacheline 14 boundary (1792 bytes) was 88 bytes ago --- */
char abigap[288]; /* 1880 288 */
/* size: 2176, cachelines: 17, members: 7 */
/* padding: 8 */
};
2176 + 128 = 2304
At this point we should have been exposed to the bug, though as far as
I know it was never reported. I no longer have a system old enough to
easily test on.
Then in 2010 commit 320b2b8de1 ("mm: keep a guard page below a
grow-down stack segment") caused our stack expansion code to never
trigger, as there was always a VMA found for a write up to PAGE_SIZE
below r1.
That meant the bug was hidden as we continued to expand the signal
frame in commit 2b0a576d15 ("powerpc: Add new transactional memory
state to the signal context") (Feb 2013):
struct rt_sigframe {
struct ucontext uc; /* 0 1696 */
/* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */
struct ucontext uc_transact; /* 1696 1696 */ <--
/* --- cacheline 26 boundary (3328 bytes) was 64 bytes ago --- */
long unsigned int _unused[2]; /* 3392 16 */
unsigned int tramp[6]; /* 3408 24 */
struct siginfo * pinfo; /* 3432 8 */
void * puc; /* 3440 8 */
struct siginfo info; /* 3448 128 */
/* --- cacheline 27 boundary (3456 bytes) was 120 bytes ago --- */
char abigap[288]; /* 3576 288 */
/* size: 3872, cachelines: 31, members: 8 */
/* padding: 8 */
/* last cacheline: 32 bytes */
};
3872 + 128 = 4000
And commit 573ebfa660 ("powerpc: Increase stack redzone for 64-bit
userspace to 512 bytes") (Feb 2014):
struct rt_sigframe {
struct ucontext uc; /* 0 1696 */
/* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */
struct ucontext uc_transact; /* 1696 1696 */
/* --- cacheline 26 boundary (3328 bytes) was 64 bytes ago --- */
long unsigned int _unused[2]; /* 3392 16 */
unsigned int tramp[6]; /* 3408 24 */
struct siginfo * pinfo; /* 3432 8 */
void * puc; /* 3440 8 */
struct siginfo info; /* 3448 128 */
/* --- cacheline 27 boundary (3456 bytes) was 120 bytes ago --- */
char abigap[512]; /* 3576 512 */ <--
/* size: 4096, cachelines: 32, members: 8 */
/* padding: 8 */
};
4096 + 128 = 4224
Then finally in 2017, commit 1be7107fbe ("mm: larger stack guard
gap, between vmas") exposed us to the existing bug, because it changed
the stack VMA to be the correct/real size, meaning our stack expansion
code is now triggered.
Fix it by increasing the allowance to 4224 bytes.
Hard-coding 4224 is obviously unsafe against future expansions of the
signal frame in the same way as the existing code. We can't easily use
sizeof() because the signal frame structure is not in a header. We
will either fix that, or rip out all the custom stack expansion
checking logic entirely.
Fixes: ce48b21007 ("powerpc: Add VSX context save/restore, ptrace and signal support")
Cc: stable@vger.kernel.org # v2.6.27+
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Tested-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200724092528.1578671-2-mpe@ellerman.id.au
Currently unsigned ints are used to represent instructions on powerpc.
This has worked well as instructions have always been 4 byte words.
However, ISA v3.1 introduces some changes to instructions that mean
this scheme will no longer work as well. This change is Prefixed
Instructions. A prefixed instruction is made up of a word prefix
followed by a word suffix to make an 8 byte double word instruction.
No matter the endianness of the system the prefix always comes first.
Prefixed instructions are only planned for powerpc64.
Introduce a ppc_inst type to represent both prefixed and word
instructions on powerpc64 while keeping it possible to exclusively
have word instructions on powerpc32.
Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
[mpe: Fix compile error in emulate_spe()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200506034050.24806-12-jniethe5@gmail.com
If multiple threads in userspace keep changing the protection keys
mapping a range, there can be a scenario where kernel takes a key fault
but the pkey value found in the siginfo struct is a permissive one.
This can confuse the userspace as shown in the below test case.
/* use this to control the number of test iterations */
static void pkeyreg_set(int pkey, unsigned long rights)
{
unsigned long reg, shift;
shift = (NR_PKEYS - pkey - 1) * PKEY_BITS_PER_PKEY;
asm volatile("mfspr %0, 0xd" : "=r"(reg));
reg &= ~(((unsigned long) PKEY_BITS_MASK) << shift);
reg |= (rights & PKEY_BITS_MASK) << shift;
asm volatile("mtspr 0xd, %0" : : "r"(reg));
}
static unsigned long pkeyreg_get(void)
{
unsigned long reg;
asm volatile("mfspr %0, 0xd" : "=r"(reg));
return reg;
}
static int sys_pkey_mprotect(void *addr, size_t len, int prot, int pkey)
{
return syscall(SYS_pkey_mprotect, addr, len, prot, pkey);
}
static int sys_pkey_alloc(unsigned long flags, unsigned long access_rights)
{
return syscall(SYS_pkey_alloc, flags, access_rights);
}
static int sys_pkey_free(int pkey)
{
return syscall(SYS_pkey_free, pkey);
}
static int faulting_pkey;
static int permissive_pkey;
static pthread_barrier_t pkey_set_barrier;
static pthread_barrier_t mprotect_barrier;
static void pkey_handle_fault(int signum, siginfo_t *sinfo, void *ctx)
{
unsigned long pkeyreg;
/* FIXME: printf is not signal-safe but for the current purpose,
it gets the job done. */
printf("pkey: exp = %d, got = %d\n", faulting_pkey, sinfo->si_pkey);
fflush(stdout);
assert(sinfo->si_code == SEGV_PKUERR);
assert(sinfo->si_pkey == faulting_pkey);
/* clear pkey permissions to let the faulting instruction continue */
pkeyreg_set(faulting_pkey, 0x0);
}
static void *do_mprotect_fault(void *p)
{
unsigned long rights, pkeyreg, pgsize;
unsigned int i;
void *region;
int pkey;
srand(time(NULL));
pgsize = sysconf(_SC_PAGESIZE);
rights = PKEY_DISABLE_WRITE;
region = p;
/* allocate key, no permissions */
assert((pkey = sys_pkey_alloc(0, PKEY_DISABLE_ACCESS)) > 0);
pkeyreg_set(4, 0x0);
/* cache the pkey here as the faulting pkey for future reference
in the signal handler */
faulting_pkey = pkey;
printf("%s: faulting pkey = %d\n", __func__, faulting_pkey);
/* try to allocate, mprotect and free pkeys repeatedly */
for (i = 0; i < NUM_ITERATIONS; i++) {
/* sync up with the other thread here */
pthread_barrier_wait(&pkey_set_barrier);
/* make sure that the pkey used by the non-faulting thread
is made permissive for this thread's context too so that
no faults are triggered because it still might have been
set to a restrictive value */
// pkeyreg_set(permissive_pkey, 0x0);
/* sync up with the other thread here */
pthread_barrier_wait(&mprotect_barrier);
/* perform mprotect */
assert(!sys_pkey_mprotect(region, pgsize, PROT_READ | PROT_WRITE, pkey));
/* choose a random byte from the protected region and
attempt to write to it, this will generate a fault */
*((char *) region + (rand() % pgsize)) = rand();
/* restore pkey permissions as the signal handler may have
cleared the bit out for the sake of continuing */
pkeyreg_set(pkey, PKEY_DISABLE_WRITE);
}
/* free pkey */
sys_pkey_free(pkey);
return NULL;
}
static void *do_mprotect_nofault(void *p)
{
unsigned long pgsize;
unsigned int i, j;
void *region;
int pkey;
pgsize = sysconf(_SC_PAGESIZE);
region = p;
/* try to allocate, mprotect and free pkeys repeatedly */
for (i = 0; i < NUM_ITERATIONS; i++) {
/* allocate pkey, all permissions */
assert((pkey = sys_pkey_alloc(0, 0)) > 0);
permissive_pkey = pkey;
/* sync up with the other thread here */
pthread_barrier_wait(&pkey_set_barrier);
pthread_barrier_wait(&mprotect_barrier);
/* perform mprotect on the common page, no faults will
be triggered as this is most permissive */
assert(!sys_pkey_mprotect(region, pgsize, PROT_READ | PROT_WRITE, pkey));
/* free pkey */
assert(!sys_pkey_free(pkey));
}
return NULL;
}
int main(int argc, char **argv)
{
pthread_t fault_thread, nofault_thread;
unsigned long pgsize;
struct sigaction act;
pthread_attr_t attr;
cpu_set_t fault_cpuset, nofault_cpuset;
unsigned int i;
void *region;
/* allocate memory region to protect */
pgsize = sysconf(_SC_PAGESIZE);
assert(region = memalign(pgsize, pgsize));
CPU_ZERO(&fault_cpuset);
CPU_SET(0, &fault_cpuset);
CPU_ZERO(&nofault_cpuset);
CPU_SET(8, &nofault_cpuset);
assert(!pthread_attr_init(&attr));
/* setup sigsegv signal handler */
act.sa_handler = 0;
act.sa_sigaction = pkey_handle_fault;
assert(!sigprocmask(SIG_SETMASK, 0, &act.sa_mask));
act.sa_flags = SA_SIGINFO;
act.sa_restorer = 0;
assert(!sigaction(SIGSEGV, &act, NULL));
/* setup barrier for the two threads */
pthread_barrier_init(&pkey_set_barrier, NULL, 2);
pthread_barrier_init(&mprotect_barrier, NULL, 2);
/* setup and start threads */
assert(!pthread_create(&fault_thread, &attr, &do_mprotect_fault, region));
assert(!pthread_setaffinity_np(fault_thread, sizeof(cpu_set_t), &fault_cpuset));
assert(!pthread_create(&nofault_thread, &attr, &do_mprotect_nofault, region));
assert(!pthread_setaffinity_np(nofault_thread, sizeof(cpu_set_t), &nofault_cpuset));
/* cleanup */
assert(!pthread_attr_destroy(&attr));
assert(!pthread_join(fault_thread, NULL));
assert(!pthread_join(nofault_thread, NULL));
assert(!pthread_barrier_destroy(&pkey_set_barrier));
assert(!pthread_barrier_destroy(&mprotect_barrier));
free(region);
puts("PASS");
return EXIT_SUCCESS;
}
The above test can result the below failure without this patch.
pkey: exp = 3, got = 3
pkey: exp = 3, got = 4
a.out: pkey-siginfo-race.c💯 pkey_handle_fault: Assertion `sinfo->si_pkey == faulting_pkey' failed.
Aborted
Check for vma access before considering this a key fault. If vma pkey allow
access retry the acess again.
Test case is written by Sandipan Das <sandipan@linux.ibm.com> hence added SOB
from him.
Signed-off-by: Sandipan Das <sandipan@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200505071729.54912-3-aneesh.kumar@linux.ibm.com