Pull RISC-V updates from Palmer Dabbelt:
- Support for runtime detection of the Svnapot extension
- Support for Zicboz when clearing pages
- We've moved to GENERIC_ENTRY
- Support for !MMU on rv32 systems
- The linear region is now mapped via huge pages
- Support for building relocatable kernels
- Support for the hwprobe interface
- Various fixes and cleanups throughout the tree
* tag 'riscv-for-linus-6.4-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (57 commits)
RISC-V: hwprobe: Explicity check for -1 in vdso init
RISC-V: hwprobe: There can only be one first
riscv: Allow to downgrade paging mode from the command line
dt-bindings: riscv: add sv57 mmu-type
RISC-V: hwprobe: Remove __init on probe_vendor_features()
riscv: Use --emit-relocs in order to move .rela.dyn in init
riscv: Check relocations at compile time
powerpc: Move script to check relocations at compile time in scripts/
riscv: Introduce CONFIG_RELOCATABLE
riscv: Move .rela.dyn outside of init to avoid empty relocations
riscv: Prepare EFI header for relocatable kernels
riscv: Unconditionnally select KASAN_VMALLOC if KASAN
riscv: Fix ptdump when KASAN is enabled
riscv: Fix EFI stub usage of KASAN instrumented strcmp function
riscv: Move DTB_EARLY_BASE_VA to the kernel address space
riscv: Rework kasan population functions
riscv: Split early and final KASAN population functions
riscv: Use PUD/P4D/PGD pages for the linear mapping
riscv: Move the linear mapping creation in its own function
riscv: Get rid of riscv_pfn_base variable
...
This patch converts riscv to use the generic entry infrastructure from
kernel/entry/*. The generic entry makes maintainers' work easier and
codes more elegant. Here are the changes:
- More clear entry.S with handle_exception and ret_from_exception
- Get rid of complex custom signal implementation
- Move syscall procedure from assembly to C, which is much more
readable.
- Connect ret_from_fork & ret_from_kernel_thread to generic entry.
- Wrap with irqentry_enter/exit and syscall_enter/exit_from_user_mode
- Use the standard preemption code instead of custom
Suggested-by: Huacai Chen <chenhuacai@kernel.org>
Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
Tested-by: Yipeng Zou <zouyipeng@huawei.com>
Tested-by: Jisheng Zhang <jszhang@kernel.org>
Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
Signed-off-by: Guo Ren <guoren@kernel.org>
Cc: Ben Hutchings <ben@decadent.org.uk>
Link: https://lore.kernel.org/r/20230222033021.983168-5-guoren@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Pull VM_FAULT_RETRY fixes from Al Viro:
"Some of the page fault handlers do not deal with the following case
correctly:
- handle_mm_fault() has returned VM_FAULT_RETRY
- there is a pending fatal signal
- fault had happened in kernel mode
Correct action in such case is not "return unconditionally" - fatal
signals are handled only upon return to userland and something like
copy_to_user() would end up retrying the faulting instruction and
triggering the same fault again and again.
What we need to do in such case is to make the caller to treat that as
failed uaccess attempt - handle exception if there is an exception
handler for faulting instruction or oops if there isn't one.
Over the years some architectures had been fixed and now are handling
that case properly; some still do not. This series should fix the
remaining ones.
Status:
- m68k, riscv, hexagon, parisc: tested/acked by maintainers.
- alpha, sparc32, sparc64: tested locally - bug has been reproduced
on the unpatched kernel and verified to be fixed by this series.
- ia64, microblaze, nios2, openrisc: build, but otherwise completely
untested"
* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
openrisc: fix livelock in uaccess
nios2: fix livelock in uaccess
microblaze: fix livelock in uaccess
ia64: fix livelock in uaccess
sparc: fix livelock in uaccess
alpha: fix livelock in uaccess
parisc: fix livelock in uaccess
hexagon: fix livelock in uaccess
riscv: fix livelock in uaccess
m68k: fix livelock in uaccess
riscv equivalent of 26178ec11e "x86: mm: consolidate VM_FAULT_RETRY handling"
If e.g. get_user() triggers a page fault and a fatal signal is caught, we might
end up with handle_mm_fault() returning VM_FAULT_RETRY and not doing anything
to page tables. In such case we must *not* return to the faulting insn -
that would repeat the entire thing without making any progress; what we need
instead is to treat that as failed (user) memory access.
Tested-by: Björn Töpel <bjorn@kernel.org>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Commit 21855cac82 ("riscv/mm: Prevent kernel module to access user
memory without uaccess routines") added early exits/deaths for page
faults stemming from accesses to user-space without using proper
uaccess routines (where sstatus.SUM is set).
Unfortunatly, this is too strict for some BPF programs, which relies
on BPF exhandler fixups. These BPF programs loads "BTF pointers". A
BTF pointers could either be a valid kernel pointer or NULL, but not a
userspace address.
Resolve the problem by calling the fixup handler in the early exit
path.
Fixes: 21855cac82 ("riscv/mm: Prevent kernel module to access user memory without uaccess routines")
Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
Link: https://lore.kernel.org/r/20230214162515.184827-1-bjorn@kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
I observed that for each of the shared file-backed page faults, we're very
likely to retry one more time for the 1st write fault upon no page. It's
because we'll need to release the mmap lock for dirty rate limit purpose
with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()).
Then after that throttling we return VM_FAULT_RETRY.
We did that probably because VM_FAULT_RETRY is the only way we can return
to the fault handler at that time telling it we've released the mmap lock.
However that's not ideal because it's very likely the fault does not need
to be retried at all since the pgtable was well installed before the
throttling, so the next continuous fault (including taking mmap read lock,
walk the pgtable, etc.) could be in most cases unnecessary.
It's not only slowing down page faults for shared file-backed, but also add
more mmap lock contention which is in most cases not needed at all.
To observe this, one could try to write to some shmem page and look at
"pgfault" value in /proc/vmstat, then we should expect 2 counts for each
shmem write simply because we retried, and vm event "pgfault" will capture
that.
To make it more efficient, add a new VM_FAULT_COMPLETED return code just to
show that we've completed the whole fault and released the lock. It's also
a hint that we should very possibly not need another fault immediately on
this page because we've just completed it.
This patch provides a ~12% perf boost on my aarch64 test VM with a simple
program sequentially dirtying 400MB shmem file being mmap()ed and these are
the time it needs:
Before: 650.980 ms (+-1.94%)
After: 569.396 ms (+-1.38%)
I believe it could help more than that.
We need some special care on GUP and the s390 pgfault handler (for gmap
code before returning from pgfault), the rest changes in the page fault
handlers should be relatively straightforward.
Another thing to mention is that mm_account_fault() does take this new
fault as a generic fault to be accounted, unlike VM_FAULT_RETRY.
I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do
not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping
them as-is.
Link: https://lkml.kernel.org/r/20220530183450.42886-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vineet Gupta <vgupta@kernel.org>
Acked-by: Guo Ren <guoren@kernel.org>
Acked-by: Max Filippov <jcmvbkbc@gmail.com>
Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> [arm part]
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Stafford Horne <shorne@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Richard Weinberger <richard@nod.at>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Will Deacon <will@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Chris Zankel <chris@zankel.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Rich Felker <dalias@libc.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Helge Deller <deller@gmx.de>
Cc: Yoshinori Sato <ysato@users.osdn.me>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Since all processes share the kernel address space,
we only need to copy pgd in case of a vmalloc page
fault exception, the other levels of page tables are
shared, so the operation of copying pmd is unnecessary.
Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Pull RISC-V updates from Palmer Dabbelt:
- Support for the DA9063 as used on the HiFive Unmatched.
- Support for relative extables, which puts us in line with other
architectures and save some space in vmlinux.
- A handful of kexec fixes/improvements, including the ability to run
crash kernels from PCI-addressable memory on the HiFive Unmatched.
- Support for the SBI SRST extension, which allows systems that do not
have an explicit driver in Linux to reboot.
- A handful of fixes and cleanups, including to the defconfigs and
device trees.
* tag 'riscv-for-linus-5.17-mw0' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (52 commits)
RISC-V: Use SBI SRST extension when available
riscv: mm: fix wrong phys_ram_base value for RV64
RISC-V: Use common riscv_cpuid_to_hartid_mask() for both SMP=y and SMP=n
riscv: head: remove useless __PAGE_ALIGNED_BSS and .balign
riscv: errata: alternative: mark vendor_patch_func __initdata
riscv: head: make secondary_start_common() static
riscv: remove cpu_stop()
riscv: try to allocate crashkern region from 32bit addressible memory
riscv: use hart id instead of cpu id on machine_kexec
riscv: Don't use va_pa_offset on kdump
riscv: dts: sifive: fu540-c000: Fix PLIC node
riscv: dts: sifive: fu540-c000: Drop bogus soc node compatible values
riscv: dts: sifive: Group tuples in register properties
riscv: dts: sifive: Group tuples in interrupt properties
riscv: dts: microchip: mpfs: Group tuples in interrupt properties
riscv: dts: microchip: mpfs: Fix clock controller node
riscv: dts: microchip: mpfs: Fix reference clock node
riscv: dts: microchip: mpfs: Fix PLIC node
riscv: dts: microchip: mpfs: Drop empty chosen node
riscv: dts: canaan: Group tuples in interrupt properties
...
Pull signal/exit/ptrace updates from Eric Biederman:
"This set of changes deletes some dead code, makes a lot of cleanups
which hopefully make the code easier to follow, and fixes bugs found
along the way.
The end-game which I have not yet reached yet is for fatal signals
that generate coredumps to be short-circuit deliverable from
complete_signal, for force_siginfo_to_task not to require changing
userspace configured signal delivery state, and for the ptrace stops
to always happen in locations where we can guarantee on all
architectures that the all of the registers are saved and available on
the stack.
Removal of profile_task_ext, profile_munmap, and profile_handoff_task
are the big successes for dead code removal this round.
A bunch of small bug fixes are included, as most of the issues
reported were small enough that they would not affect bisection so I
simply added the fixes and did not fold the fixes into the changes
they were fixing.
There was a bug that broke coredumps piped to systemd-coredump. I
dropped the change that caused that bug and replaced it entirely with
something much more restrained. Unfortunately that required some
rebasing.
Some successes after this set of changes: There are few enough calls
to do_exit to audit in a reasonable amount of time. The lifetime of
struct kthread now matches the lifetime of struct task, and the
pointer to struct kthread is no longer stored in set_child_tid. The
flag SIGNAL_GROUP_COREDUMP is removed. The field group_exit_task is
removed. Issues where task->exit_code was examined with
signal->group_exit_code should been examined were fixed.
There are several loosely related changes included because I am
cleaning up and if I don't include them they will probably get lost.
The original postings of these changes can be found at:
https://lkml.kernel.org/r/87a6ha4zsd.fsf@email.froward.int.ebiederm.org
https://lkml.kernel.org/r/87bl1kunjj.fsf@email.froward.int.ebiederm.org
https://lkml.kernel.org/r/87r19opkx1.fsf_-_@email.froward.int.ebiederm.org
I trimmed back the last set of changes to only the obviously correct
once. Simply because there was less time for review than I had hoped"
* 'signal-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (44 commits)
ptrace/m68k: Stop open coding ptrace_report_syscall
ptrace: Remove unused regs argument from ptrace_report_syscall
ptrace: Remove second setting of PT_SEIZED in ptrace_attach
taskstats: Cleanup the use of task->exit_code
exit: Use the correct exit_code in /proc/<pid>/stat
exit: Fix the exit_code for wait_task_zombie
exit: Coredumps reach do_group_exit
exit: Remove profile_handoff_task
exit: Remove profile_task_exit & profile_munmap
signal: clean up kernel-doc comments
signal: Remove the helper signal_group_exit
signal: Rename group_exit_task group_exec_task
coredump: Stop setting signal->group_exit_task
signal: Remove SIGNAL_GROUP_COREDUMP
signal: During coredumps set SIGNAL_GROUP_EXIT in zap_process
signal: Make coredump handling explicit in complete_signal
signal: Have prepare_signal detect coredumps using signal->core_state
signal: Have the oom killer detect coredumps using signal->core_state
exit: Move force_uaccess back into do_exit
exit: Guarantee make_task_dead leaks the tsk when calling do_task_exit
...
We used to define VMALLOC_END equal to the start of the next region
*minus one* which is inconsistent with the use of this define in the
core code (for example, see the definitions of VMALLOC_TOTAL and
is_vmalloc_addr).
And then make the definition of VMEMMAP_END consistent with VMALLOC_END
and all other regions actually.
Signed-off-by: Alexandre Ghiti <alexandre.ghiti@canonical.com>
Reviewed-by: Jisheng Zhang <jszhang@kernel.org>
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
There are two big uses of do_exit. The first is it's design use to be
the guts of the exit(2) system call. The second use is to terminate
a task after something catastrophic has happened like a NULL pointer
in kernel code.
Add a function make_task_dead that is initialy exactly the same as
do_exit to cover the cases where do_exit is called to handle
catastrophic failure. In time this can probably be reduced to just a
light wrapper around do_task_dead. For now keep it exactly the same so
that there will be no behavioral differences introducing this new
concept.
Replace all of the uses of do_exit that use it for catastraphic
task cleanup with make_task_dead to make it clear what the code
is doing.
As part of this rename rewind_stack_do_exit
rewind_stack_and_make_dead.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Add architecture specific implementation details for KFENCE and enable
KFENCE for the riscv64 architecture. In particular, this implements the
required interface in <asm/kfence.h>.
KFENCE requires that attributes for pages from its memory pool can
individually be set. Therefore, force the kfence pool to be mapped at
page granularity.
Testing this patch using the testcases in kfence_test.c and all passed.
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Acked-by: Marco Elver <elver@google.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Pull RISC-V updates from Palmer Dabbelt:
- Support for the memtest= kernel command-line argument.
- Support for building the kernel with FORTIFY_SOURCE.
- Support for generic clockevent broadcasts.
- Support for the buildtar build target.
- Some build system cleanups to pass more LLVM-friendly arguments.
- Support for kprobes.
- A rearranged kernel memory map, the first part of supporting sv48
systems.
- Improvements to kexec, along with support for kdump and crash
kernels.
- An alternatives-based errata framework, along with support for
handling a pair of errata that manifest on some SiFive designs
(including the HiFive Unmatched).
- Support for XIP.
- A device tree for the Microchip PolarFire ICICLE SoC and associated
dev board.
... along with a bunch of cleanups. There are already a handful of fixes
on the list so there will likely be a part 2.
* tag 'riscv-for-linus-5.13-mw0' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (45 commits)
RISC-V: Always define XIP_FIXUP
riscv: Remove 32b kernel mapping from page table dump
riscv: Fix 32b kernel build with CONFIG_DEBUG_VIRTUAL=y
RISC-V: Fix error code returned by riscv_hartid_to_cpuid()
RISC-V: Enable Microchip PolarFire ICICLE SoC
RISC-V: Initial DTS for Microchip ICICLE board
dt-bindings: riscv: microchip: Add YAML documentation for the PolarFire SoC
RISC-V: Add Microchip PolarFire SoC kconfig option
RISC-V: enable XIP
RISC-V: Add crash kernel support
RISC-V: Add kdump support
RISC-V: Improve init_resources()
RISC-V: Add kexec support
RISC-V: Add EM_RISCV to kexec UAPI header
riscv: vdso: fix and clean-up Makefile
riscv/mm: Use BUG_ON instead of if condition followed by BUG.
riscv/kprobe: fix kernel panic when invoking sys_read traced by kprobe
riscv: Set ARCH_HAS_STRICT_MODULE_RWX if MMU
riscv: module: Create module allocations without exec permissions
riscv: bpf: Avoid breaking W^X
...
This is a preparatory patch for relocatable kernel and sv48 support.
The kernel used to be linked at PAGE_OFFSET address therefore we could use
the linear mapping for the kernel mapping. But the relocated kernel base
address will be different from PAGE_OFFSET and since in the linear mapping,
two different virtual addresses cannot point to the same physical address,
the kernel mapping needs to lie outside the linear mapping so that we don't
have to copy it at the same physical offset.
The kernel mapping is moved to the last 2GB of the address space, BPF
is now always after the kernel and modules use the 2GB memory range right
before the kernel, so BPF and modules regions do not overlap. KASLR
implementation will simply have to move the kernel in the last 2GB range
and just take care of leaving enough space for BPF.
In addition, by moving the kernel to the end of the address space, both
sv39 and sv48 kernels will be exactly the same without needing to be
relocated at runtime.
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
[Palmer: Squash the STRICT_RWX fix, and a !MMU fix]
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
This patch enables "kprobe & kretprobe" to work with ftrace
interface. It utilized software breakpoint as single-step
mechanism.
Some instructions which can't be single-step executed must be
simulated in kernel execution slot, such as: branch, jal, auipc,
la ...
Some instructions should be rejected for probing and we use a
blacklist to filter, such as: ecall, ebreak, ...
We use ebreak & c.ebreak to replace origin instruction and the
kprobe handler prepares an executable memory slot for out-of-line
execution with a copy of the original instruction being probed.
In execution slot we add ebreak behind original instruction to
simulate a single-setp mechanism.
The patch is based on packi's work [1] and csky's work [2].
- The kprobes_trampoline.S is all from packi's patch
- The single-step mechanism is new designed for riscv without hw
single-step trap
- The simulation codes are from csky
- Frankly, all codes refer to other archs' implementation
[1] https://lore.kernel.org/linux-riscv/20181113195804.22825-1-me@packi.ch/
[2] https://lore.kernel.org/linux-csky/20200403044150.20562-9-guoren@kernel.org/
Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
Co-developed-by: Patrick Stählin <me@packi.ch>
Signed-off-by: Patrick Stählin <me@packi.ch>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Zong Li <zong.li@sifive.com>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Cc: Patrick Stählin <me@packi.ch>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Björn Töpel <bjorn.topel@gmail.com>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
We found this issue in an legacy out-of-tree kernel module
which didn't properly access user space pointer by get/put_user().
Such an illegal access loops in the page fault handler.
To resolve this, let it die here.
Signed-off-by: Eric Lin <tesheng@andestech.com>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Like arm64, this patch adds a die_kernel_fault() helper
to ensure the same semantics for the different kernel faults.
Signed-off-by: Eric Lin <tesheng@andestech.com>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
If the page fault "cause" is EXC_INST_PAGE_FAULT, set the
FAULT_FLAG_INSTRUCTION flag to let handle_mm_fault() and friends know
about it. This has no functional changes because RISC-V uses the default
arch_vma_access_permitted() implementation, which always returns true.
However, dax_pmd_fault(), for example, has a tracepoint that uses
FAULT_FLAG_INSTRUCTION, so we might as well set it.
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
The "inline" keyword is in the wrong place in vmalloc_fault()
declaration:
>> arch/riscv/mm/fault.c:56:1: warning: 'inline' is not at beginning of declaration [-Wold-style-declaration]
56 | static void inline vmalloc_fault(struct pt_regs *regs, int code, unsigned long addr)
| ^~~~~~
Fix that up.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Move the access error check into a access_error() function to simplify
the control flow in do_page_fault().
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Let's handle the translation of EXC_STORE_PAGE_FAULT to FAULT_FLAG_WRITE
once before looking up the VMA. This makes it easier to extract access
error logic in the next patch.
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Simplify the mm_fault_error() handling function by eliminating the
unnecessary gotos.
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
This patch moves the fault error handling to mm_fault_error() function
and converts gotos to calls to the new function.
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Move fault error handling after retry logic. This simplifies the code
flow and makes it easier to move fault error handling to its own
function.
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
This patch moves the vmalloc fault handling in do_page_fault() to
vmalloc_fault() function and converts gotos to calls to the new
function.
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
This patch moves the bad area handling in do_page_fault() to bad_area()
function and converts gotos to calls to the new function.
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
This patch moves the no context handling in do_page_fault() to
no_context() function and converts gotos to calls to the new function.
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Let's combine the two retry logic if statements in do_page_fault() to
simplify the code.
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Patch series "mm: cleanup usage of <asm/pgalloc.h>"
Most architectures have very similar versions of pXd_alloc_one() and
pXd_free_one() for intermediate levels of page table. These patches add
generic versions of these functions in <asm-generic/pgalloc.h> and enable
use of the generic functions where appropriate.
In addition, functions declared and defined in <asm/pgalloc.h> headers are
used mostly by core mm and early mm initialization in arch and there is no
actual reason to have the <asm/pgalloc.h> included all over the place.
The first patch in this series removes unneeded includes of
<asm/pgalloc.h>
In the end it didn't work out as neatly as I hoped and moving
pXd_alloc_track() definitions to <asm-generic/pgalloc.h> would require
unnecessary changes to arches that have custom page table allocations, so
I've decided to move lib/ioremap.c to mm/ and make pgalloc-track.h local
to mm/.
This patch (of 8):
In most cases <asm/pgalloc.h> header is required only for allocations of
page table memory. Most of the .c files that include that header do not
use symbols declared in <asm/pgalloc.h> and do not require that header.
As for the other header files that used to include <asm/pgalloc.h>, it is
possible to move that include into the .c file that actually uses symbols
from <asm/pgalloc.h> and drop the include from the header file.
The process was somewhat automated using
sed -i -E '/[<"]asm\/pgalloc\.h/d' \
$(grep -L -w -f /tmp/xx \
$(git grep -E -l '[<"]asm/pgalloc\.h'))
where /tmp/xx contains all the symbols defined in
arch/*/include/asm/pgalloc.h.
[rppt@linux.ibm.com: fix powerpc warning]
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k]
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Link: http://lkml.kernel.org/r/20200627143453.31835-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200627143453.31835-2-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The idea comes from a discussion between Linus and Andrea [1].
Before this patch we only allow a page fault to retry once. We achieved
this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing
handle_mm_fault() the second time. This was majorly used to avoid
unexpected starvation of the system by looping over forever to handle the
page fault on a single page. However that should hardly happen, and after
all for each code path to return a VM_FAULT_RETRY we'll first wait for a
condition (during which time we should possibly yield the cpu) to happen
before VM_FAULT_RETRY is really returned.
This patch removes the restriction by keeping the FAULT_FLAG_ALLOW_RETRY
flag when we receive VM_FAULT_RETRY. It means that the page fault handler
now can retry the page fault for multiple times if necessary without the
need to generate another page fault event. Meanwhile we still keep the
FAULT_FLAG_TRIED flag so page fault handler can still identify whether a
page fault is the first attempt or not.
Then we'll have these combinations of fault flags (only considering
ALLOW_RETRY flag and TRIED flag):
- ALLOW_RETRY and !TRIED: this means the page fault allows to
retry, and this is the first try
- ALLOW_RETRY and TRIED: this means the page fault allows to
retry, and this is not the first try
- !ALLOW_RETRY and !TRIED: this means the page fault does not allow
to retry at all
- !ALLOW_RETRY and TRIED: this is forbidden and should never be used
In existing code we have multiple places that has taken special care of
the first condition above by checking against (fault_flags &
FAULT_FLAG_ALLOW_RETRY). This patch introduces a simple helper to detect
the first retry of a page fault by checking against both (fault_flags &
FAULT_FLAG_ALLOW_RETRY) and !(fault_flag & FAULT_FLAG_TRIED) because now
even the 2nd try will have the ALLOW_RETRY set, then use that helper in
all existing special paths. One example is in __lock_page_or_retry(), now
we'll drop the mmap_sem only in the first attempt of page fault and we'll
keep it in follow up retries, so old locking behavior will be retained.
This will be a nice enhancement for current code [2] at the same time a
supporting material for the future userfaultfd-writeprotect work, since in
that work there will always be an explicit userfault writeprotect retry
for protected pages, and if that cannot resolve the page fault (e.g., when
userfaultfd-writeprotect is used in conjunction with swapped pages) then
we'll possibly need a 3rd retry of the page fault. It might also benefit
other potential users who will have similar requirement like userfault
write-protection.
GUP code is not touched yet and will be covered in follow up patch.
Please read the thread below for more information.
[1] https://lore.kernel.org/lkml/20171102193644.GB22686@redhat.com/
[2] https://lore.kernel.org/lkml/20181230154648.GB9832@redhat.com/
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Brian Geffon <bgeffon@google.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Link: http://lkml.kernel.org/r/20200220160246.9790-1-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Many of the privileged CSRs exist in a supervisor and machine version
that are used very similarly. Provide versions of the CSR names and
fields that map to either the S-mode or M-mode variant depending on
a new CONFIG_RISCV_M_MODE kconfig symbol.
Contains contributions from Damien Le Moal <Damien.LeMoal@wdc.com>
and Paul Walmsley <paul.walmsley@sifive.com>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de> # for drivers/clocksource, drivers/irqchip
[paul.walmsley@sifive.com: updated to apply]
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
Add prototypes for assembly language functions defined in head.S,
and include these prototypes into C source files that call those
functions.
This patch resolves the following warnings from sparse:
arch/riscv/kernel/setup.c:39:10: warning: symbol 'hart_lottery' was not declared. Should it be static?
arch/riscv/kernel/setup.c:42:13: warning: symbol 'parse_dtb' was not declared. Should it be static?
arch/riscv/kernel/smpboot.c:33:6: warning: symbol '__cpu_up_stack_pointer' was not declared. Should it be static?
arch/riscv/kernel/smpboot.c:34:6: warning: symbol '__cpu_up_task_pointer' was not declared. Should it be static?
arch/riscv/mm/fault.c:25:17: warning: symbol 'do_page_fault' was not declared. Should it be static?
This change should have no functional impact.
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
Pull force_sig() argument change from Eric Biederman:
"A source of error over the years has been that force_sig has taken a
task parameter when it is only safe to use force_sig with the current
task.
The force_sig function is built for delivering synchronous signals
such as SIGSEGV where the userspace application caused a synchronous
fault (such as a page fault) and the kernel responded with a signal.
Because the name force_sig does not make this clear, and because the
force_sig takes a task parameter the function force_sig has been
abused for sending other kinds of signals over the years. Slowly those
have been fixed when the oopses have been tracked down.
This set of changes fixes the remaining abusers of force_sig and
carefully rips out the task parameter from force_sig and friends
making this kind of error almost impossible in the future"
* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits)
signal/x86: Move tsk inside of CONFIG_MEMORY_FAILURE in do_sigbus
signal: Remove the signal number and task parameters from force_sig_info
signal: Factor force_sig_info_to_task out of force_sig_info
signal: Generate the siginfo in force_sig
signal: Move the computation of force into send_signal and correct it.
signal: Properly set TRACE_SIGNAL_LOSE_INFO in __send_signal
signal: Remove the task parameter from force_sig_fault
signal: Use force_sig_fault_to_task for the two calls that don't deliver to current
signal: Explicitly call force_sig_fault on current
signal/unicore32: Remove tsk parameter from __do_user_fault
signal/arm: Remove tsk parameter from __do_user_fault
signal/arm: Remove tsk parameter from ptrace_break
signal/nds32: Remove tsk parameter from send_sigtrap
signal/riscv: Remove tsk parameter from do_trap
signal/sh: Remove tsk parameter from force_sig_info_fault
signal/um: Remove task parameter from send_sigtrap
signal/x86: Remove task parameter from send_sigtrap
signal: Remove task parameter from force_sig_mceerr
signal: Remove task parameter from force_sig
signal: Remove task parameter from force_sigsegv
...
Pull RISC-V fixes from Paul Walmsley:
"This contains fixes, defconfig, and DT data changes for the v5.2-rc
series.
The fixes are relatively straightforward:
- Addition of a TLB fence in the vmalloc_fault path, so the CPU
doesn't enter an infinite page fault loop
- Readdition of the pm_power_off export, so device drivers that
reassign it can now be built as modules
- A udelay() fix for RV32, fixing a miscomputation of the delay time
- Removal of deprecated smp_mb__*() barriers
This also adds initial DT data infrastructure for arch/riscv, along
with initial data for the SiFive FU540-C000 SoC and the corresponding
HiFive Unleashed board.
We also update the RV64 defconfig to include some core drivers for the
FU540 in the build"
* tag 'riscv-for-v5.2/fixes-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: remove unused barrier defines
riscv: mm: synchronize MMU after pte change
riscv: dts: add initial board data for the SiFive HiFive Unleashed
riscv: dts: add initial support for the SiFive FU540-C000 SoC
dt-bindings: riscv: convert cpu binding to json-schema
dt-bindings: riscv: sifive: add YAML documentation for the SiFive FU540
arch: riscv: add support for building DTB files from DT source data
riscv: Fix udelay in RV32.
riscv: export pm_power_off again
RISC-V: defconfig: enable clocks, serial console
Because RISC-V compliant implementations can cache invalid entries
in TLB, an SFENCE.VMA is necessary after changes to the page table.
This patch adds an SFENCE.vma for the vmalloc_fault path.
Signed-off-by: ShihPo Hung <shihpo.hung@sifive.com>
[paul.walmsley@sifive.com: reversed tab->whitespace conversion,
wrapped comment lines]
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: linux-riscv@lists.infradead.org
Cc: stable@vger.kernel.org