As already mentioned in my merge message for the 'expand-stack' branch,
we have something like 24 different versions of the page fault path for
all our different architectures, all just _slightly_ different due to
various historical reasons (usually related to exactly when they
branched off the original i386 version, and the details of the other
architectures they had in their history).
And a few of them had some silly mistake in the conversion.
Most of the architectures call the faulting address 'address' in the
fault path. But not all. Some just call it 'addr'. And if you end up
doing a bit too much copy-and-paste, you end up with the wrong version
in the places that do it differently.
In this case it was csky.
Fixes: a050ba1e74 ("mm/fault: convert remaining simple cases to lock_mm_and_find_vma()")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This does the simple pattern conversion of alpha, arc, csky, hexagon,
loongarch, nios2, sh, sparc32, and xtensa to the lock_mm_and_find_vma()
helper. They all have the regular fault handling pattern without odd
special cases.
The remaining architectures all have something that keeps us from a
straightforward conversion: ia64 and parisc have stacks that can grow
both up as well as down (and ia64 has special address region checks).
And m68k, microblaze, openrisc, sparc64, and um end up having extra
rules about only expanding the stack down a limited amount below the
user space stack pointer. That is something that x86 used to do too
(long long ago), and it probably could just be skipped, but it still
makes the conversion less than trivial.
Note that this conversion was done manually and with the exception of
alpha without any build testing, because I have a fairly limited cross-
building environment. The cases are all simple, and I went through the
changes several times, but...
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I observed that for each of the shared file-backed page faults, we're very
likely to retry one more time for the 1st write fault upon no page. It's
because we'll need to release the mmap lock for dirty rate limit purpose
with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()).
Then after that throttling we return VM_FAULT_RETRY.
We did that probably because VM_FAULT_RETRY is the only way we can return
to the fault handler at that time telling it we've released the mmap lock.
However that's not ideal because it's very likely the fault does not need
to be retried at all since the pgtable was well installed before the
throttling, so the next continuous fault (including taking mmap read lock,
walk the pgtable, etc.) could be in most cases unnecessary.
It's not only slowing down page faults for shared file-backed, but also add
more mmap lock contention which is in most cases not needed at all.
To observe this, one could try to write to some shmem page and look at
"pgfault" value in /proc/vmstat, then we should expect 2 counts for each
shmem write simply because we retried, and vm event "pgfault" will capture
that.
To make it more efficient, add a new VM_FAULT_COMPLETED return code just to
show that we've completed the whole fault and released the lock. It's also
a hint that we should very possibly not need another fault immediately on
this page because we've just completed it.
This patch provides a ~12% perf boost on my aarch64 test VM with a simple
program sequentially dirtying 400MB shmem file being mmap()ed and these are
the time it needs:
Before: 650.980 ms (+-1.94%)
After: 569.396 ms (+-1.38%)
I believe it could help more than that.
We need some special care on GUP and the s390 pgfault handler (for gmap
code before returning from pgfault), the rest changes in the page fault
handlers should be relatively straightforward.
Another thing to mention is that mm_account_fault() does take this new
fault as a generic fault to be accounted, unlike VM_FAULT_RETRY.
I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do
not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping
them as-is.
Link: https://lkml.kernel.org/r/20220530183450.42886-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vineet Gupta <vgupta@kernel.org>
Acked-by: Guo Ren <guoren@kernel.org>
Acked-by: Max Filippov <jcmvbkbc@gmail.com>
Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> [arm part]
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Stafford Horne <shorne@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Richard Weinberger <richard@nod.at>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Will Deacon <will@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Chris Zankel <chris@zankel.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Rich Felker <dalias@libc.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Helge Deller <deller@gmx.de>
Cc: Yoshinori Sato <ysato@users.osdn.me>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
There are two big uses of do_exit. The first is it's design use to be
the guts of the exit(2) system call. The second use is to terminate
a task after something catastrophic has happened like a NULL pointer
in kernel code.
Add a function make_task_dead that is initialy exactly the same as
do_exit to cover the cases where do_exit is called to handle
catastrophic failure. In time this can probably be reduced to just a
light wrapper around do_task_dead. For now keep it exactly the same so
that there will be no behavioral differences introducing this new
concept.
Replace all of the uses of do_exit that use it for catastraphic
task cleanup with make_task_dead to make it clear what the code
is doing.
As part of this rename rewind_stack_do_exit
rewind_stack_and_make_dead.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Sync arch/riscv/mm/fault.c into arch/csky for easy maintenance.
Here are the patches related to the modification:
cac4d1d "riscv/mm/fault: Move no context handling to no_context()"
ac416a7 "riscv/mm/fault: Move vmalloc fault handling to vmalloc_fault()"
6c11ffb "riscv/mm/fault: Move fault error handling to mm_fault_error()"
afb8c6f "riscv/mm/fault: Move access error check to function"
bda281d "riscv/mm/fault: Simplify fault error handling"
a51271d "riscv/mm/fault: Move bad area handling to bad_area()"
Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Similar to other architectures:
In addition to in_atomic, we also need pagefault_disabled() to
check.
Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
The past code only passes the FAULT_FLAG_WRITE into
handle_mm_fault and missing USER & DEFAULT & RETRY.
The patch references to arch/riscv/mm/fault.c, but there is no
FAULT_FLAG_INSTRUCTION in csky hw.
Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
There is a prologue on page fault handler which marking pages dirty
and/or accessed in page attributes, but all of these have been
handled in handle_pte_fault.
- Add flush_tlb_one in vmalloc page fault instead of prologue.
- Using cmxchg_fixup C codes in do_page_fault instead of ASM one.
Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
There are two ways for translating va to pa for csky:
- Use TLB(Translate Lookup Buffer) and PTW (Page Table Walk)
- Use SSEG0/1 (Simple Segment Mapping)
We use tlb mapping 0-2G and 3G-4G virtual address area and SSEG0/1
are for 2G-2.5G and 2.5G-3G translation. We could disable SSEG0
to use 2G-2.5G as TLB user mapping.
Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
This patch enable kprobes, kretprobes, ftrace interface. It utilized
software breakpoint and single step debug exceptions, instructions
simulation on csky.
We use USR_BKPT replace origin instruction, and the kprobe handler
prepares an excutable memory slot for out-of-line execution with a
copy of the original instruction being probed. Most of instructions
could be executed by single-step, but some instructions need origin
pc value to execute and we need software simulate these instructions.
Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
We should get psr value from regs->psr in stack, not directly get
it from phyiscal register then save the vector number in
tsk->trap_no.
Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
As synchronous exceptions really only make sense against the current
task (otherwise how are you synchronous) remove the task parameter
from from force_sig_fault to make it explicit that is what is going
on.
The two known exceptions that deliver a synchronous exception to a
stopped ptraced task have already been changed to
force_sig_fault_to_task.
The callers have been changed with the following emacs regular expression
(with obvious variations on the architectures that take more arguments)
to avoid typos:
force_sig_fault[(]\([^,]+\)[,]\([^,]+\)[,]\([^,]+\)[,]\W+current[)]
->
force_sig_fault(\1,\2,\3)
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The function of __va() will return "void *", but the pgd_base is
unsigned long.
Signed-off-by: Guo Ren <ren_guo@c-sky.com>
Cc: Arnd Bergmann <arnd@arndb.de>
This patch add support for page fault count, major fault count
and minorfault count. Without this patch page faults are not
sampled for perf event.
Performance counter stats for '/usr/lib/perf-test/callchain_test':
0 page-faults # 0.000 K/sec
Signed-off-by: Mao Han <han_mao@c-sky.com>
Signed-off-by: Guo Ren <ren_guo@c-sky.com>
The name of phys_offset is so common for global export and it may
conflict with some local name. So change phys_offset to va_pa_offset
which also used by riscv.
Also use __pa() and __va() instead of using phys_offset directly.
Signed-off-by: Guo Ren <ren_guo@c-sky.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cleanup struct cpuinfo_csky and struct thread_struct, remove all esp0
related code. We could get pt_regs from sp and backtrace could use fp
in switch_stack.
Signed-off-by: Guo Ren <ren_guo@c-sky.com>