Files
linux_media/include/linux/proc_fs.h
Christian Brauner 303cc571d1 nsproxy: attach to namespaces via pidfds
For quite a while we have been thinking about using pidfds to attach to
namespaces. This patchset has existed for about a year already but we've
wanted to wait to see how the general api would be received and adopted.
Now that more and more programs in userspace have started using pidfds
for process management it's time to send this one out.

This patch makes it possible to use pidfds to attach to the namespaces
of another process, i.e. they can be passed as the first argument to the
setns() syscall. When only a single namespace type is specified the
semantics are equivalent to passing an nsfd. That means
setns(nsfd, CLONE_NEWNET) equals setns(pidfd, CLONE_NEWNET). However,
when a pidfd is passed, multiple namespace flags can be specified in the
second setns() argument and setns() will attach the caller to all the
specified namespaces all at once or to none of them. Specifying 0 is not
valid together with a pidfd.

Here are just two obvious examples:
setns(pidfd, CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET);
setns(pidfd, CLONE_NEWUSER);
Allowing to also attach subsets of namespaces supports various use-cases
where callers setns to a subset of namespaces to retain privilege, perform
an action and then re-attach another subset of namespaces.

If the need arises, as Eric suggested, we can extend this patchset to
assume even more context than just attaching all namespaces. His suggestion
specifically was about assuming the process' root directory when
setns(pidfd, 0) or setns(pidfd, SETNS_PIDFD) is specified. For now, just
keep it flexible in terms of supporting subsets of namespaces but let's
wait until we have users asking for even more context to be assumed. At
that point we can add an extension.

The obvious example where this is useful is a standard container
manager interacting with a running container: pushing and pulling files
or directories, injecting mounts, attaching/execing any kind of process,
managing network devices all these operations require attaching to all
or at least multiple namespaces at the same time. Given that nowadays
most containers are spawned with all namespaces enabled we're currently
looking at at least 14 syscalls, 7 to open the /proc/<pid>/ns/<ns>
nsfds, another 7 to actually perform the namespace switch. With time
namespaces we're looking at about 16 syscalls.
(We could amortize the first 7 or 8 syscalls for opening the nsfds by
 stashing them in each container's monitor process but that would mean
 we need to send around those file descriptors through unix sockets
 everytime we want to interact with the container or keep on-disk
 state. Even in scenarios where a caller wants to join a particular
 namespace in a particular order callers still profit from batching
 other namespaces. That mostly applies to the user namespace but
 all container runtimes I found join the user namespace first no matter
 if it privileges or deprivileges the container similar to how unshare
 behaves.)
With pidfds this becomes a single syscall no matter how many namespaces
are supposed to be attached to.

A decently designed, large-scale container manager usually isn't the
parent of any of the containers it spawns so the containers don't die
when it crashes or needs to update or reinitialize. This means that
for the manager to interact with containers through pids is inherently
racy especially on systems where the maximum pid number is not
significicantly bumped. This is even more problematic since we often spawn
and manage thousands or ten-thousands of containers. Interacting with a
container through a pid thus can become risky quite quickly. Especially
since we allow for an administrator to enable advanced features such as
syscall interception where we're performing syscalls in lieu of the
container. In all of those cases we use pidfds if they are available and
we pass them around as stable references. Using them to setns() to the
target process' namespaces is as reliable as using nsfds. Either the
target process is already dead and we get ESRCH or we manage to attach
to its namespaces but we can't accidently attach to another process'
namespaces. So pidfds lend themselves to be used with this api.
The other main advantage is that with this change the pidfd becomes the
only relevant token for most container interactions and it's the only
token we need to create and send around.

Apart from significiantly reducing the number of syscalls from double
digit to single digit which is a decent reason post-spectre/meltdown
this also allows to switch to a set of namespaces atomically, i.e.
either attaching to all the specified namespaces succeeds or we fail. If
we fail we haven't changed a single namespace. There are currently three
namespaces that can fail (other than for ENOMEM which really is not
very interesting since we then have other problems anyway) for
non-trivial reasons, user, mount, and pid namespaces. We can fail to
attach to a pid namespace if it is not our current active pid namespace
or a descendant of it. We can fail to attach to a user namespace because
we are multi-threaded or because our current mount namespace shares
filesystem state with other tasks, or because we're trying to setns()
to the same user namespace, i.e. the target task has the same user
namespace as we do. We can fail to attach to a mount namespace because
it shares filesystem state with other tasks or because we fail to lookup
the new root for the new mount namespace. In most non-pathological
scenarios these issues can be somewhat mitigated. But there are cases where
we're half-attached to some namespace and failing to attach to another one.
I've talked about some of these problem during the hallway track (something
only the pre-COVID-19 generation will remember) of Plumbers in Los Angeles
in 2018(?). Even if all these issues could be avoided with super careful
userspace coding it would be nicer to have this done in-kernel. Pidfds seem
to lend themselves nicely for this.

The other neat thing about this is that setns() becomes an actual
counterpart to the namespace bits of unshare().

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Jann Horn <jannh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Link: https://lore.kernel.org/r/20200505140432.181565-3-christian.brauner@ubuntu.com
2020-05-13 11:41:22 +02:00

185 lines
7.2 KiB
C

/* SPDX-License-Identifier: GPL-2.0 */
/*
* The proc filesystem constants/structures
*/
#ifndef _LINUX_PROC_FS_H
#define _LINUX_PROC_FS_H
#include <linux/compiler.h>
#include <linux/types.h>
#include <linux/fs.h>
struct proc_dir_entry;
struct seq_file;
struct seq_operations;
enum {
/*
* All /proc entries using this ->proc_ops instance are never removed.
*
* If in doubt, ignore this flag.
*/
#ifdef MODULE
PROC_ENTRY_PERMANENT = 0U,
#else
PROC_ENTRY_PERMANENT = 1U << 0,
#endif
};
struct proc_ops {
unsigned int proc_flags;
int (*proc_open)(struct inode *, struct file *);
ssize_t (*proc_read)(struct file *, char __user *, size_t, loff_t *);
ssize_t (*proc_write)(struct file *, const char __user *, size_t, loff_t *);
loff_t (*proc_lseek)(struct file *, loff_t, int);
int (*proc_release)(struct inode *, struct file *);
__poll_t (*proc_poll)(struct file *, struct poll_table_struct *);
long (*proc_ioctl)(struct file *, unsigned int, unsigned long);
#ifdef CONFIG_COMPAT
long (*proc_compat_ioctl)(struct file *, unsigned int, unsigned long);
#endif
int (*proc_mmap)(struct file *, struct vm_area_struct *);
unsigned long (*proc_get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
} __randomize_layout;
#ifdef CONFIG_PROC_FS
typedef int (*proc_write_t)(struct file *, char *, size_t);
extern void proc_root_init(void);
extern void proc_flush_pid(struct pid *);
extern struct proc_dir_entry *proc_symlink(const char *,
struct proc_dir_entry *, const char *);
extern struct proc_dir_entry *proc_mkdir(const char *, struct proc_dir_entry *);
extern struct proc_dir_entry *proc_mkdir_data(const char *, umode_t,
struct proc_dir_entry *, void *);
extern struct proc_dir_entry *proc_mkdir_mode(const char *, umode_t,
struct proc_dir_entry *);
struct proc_dir_entry *proc_create_mount_point(const char *name);
struct proc_dir_entry *proc_create_seq_private(const char *name, umode_t mode,
struct proc_dir_entry *parent, const struct seq_operations *ops,
unsigned int state_size, void *data);
#define proc_create_seq_data(name, mode, parent, ops, data) \
proc_create_seq_private(name, mode, parent, ops, 0, data)
#define proc_create_seq(name, mode, parent, ops) \
proc_create_seq_private(name, mode, parent, ops, 0, NULL)
struct proc_dir_entry *proc_create_single_data(const char *name, umode_t mode,
struct proc_dir_entry *parent,
int (*show)(struct seq_file *, void *), void *data);
#define proc_create_single(name, mode, parent, show) \
proc_create_single_data(name, mode, parent, show, NULL)
extern struct proc_dir_entry *proc_create_data(const char *, umode_t,
struct proc_dir_entry *,
const struct proc_ops *,
void *);
struct proc_dir_entry *proc_create(const char *name, umode_t mode, struct proc_dir_entry *parent, const struct proc_ops *proc_ops);
extern void proc_set_size(struct proc_dir_entry *, loff_t);
extern void proc_set_user(struct proc_dir_entry *, kuid_t, kgid_t);
extern void *PDE_DATA(const struct inode *);
extern void *proc_get_parent_data(const struct inode *);
extern void proc_remove(struct proc_dir_entry *);
extern void remove_proc_entry(const char *, struct proc_dir_entry *);
extern int remove_proc_subtree(const char *, struct proc_dir_entry *);
struct proc_dir_entry *proc_create_net_data(const char *name, umode_t mode,
struct proc_dir_entry *parent, const struct seq_operations *ops,
unsigned int state_size, void *data);
#define proc_create_net(name, mode, parent, ops, state_size) \
proc_create_net_data(name, mode, parent, ops, state_size, NULL)
struct proc_dir_entry *proc_create_net_single(const char *name, umode_t mode,
struct proc_dir_entry *parent,
int (*show)(struct seq_file *, void *), void *data);
struct proc_dir_entry *proc_create_net_data_write(const char *name, umode_t mode,
struct proc_dir_entry *parent,
const struct seq_operations *ops,
proc_write_t write,
unsigned int state_size, void *data);
struct proc_dir_entry *proc_create_net_single_write(const char *name, umode_t mode,
struct proc_dir_entry *parent,
int (*show)(struct seq_file *, void *),
proc_write_t write,
void *data);
extern struct pid *tgid_pidfd_to_pid(const struct file *file);
#ifdef CONFIG_PROC_PID_ARCH_STATUS
/*
* The architecture which selects CONFIG_PROC_PID_ARCH_STATUS must
* provide proc_pid_arch_status() definition.
*/
int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task);
#endif /* CONFIG_PROC_PID_ARCH_STATUS */
#else /* CONFIG_PROC_FS */
static inline void proc_root_init(void)
{
}
static inline void proc_flush_pid(struct pid *pid)
{
}
static inline struct proc_dir_entry *proc_symlink(const char *name,
struct proc_dir_entry *parent,const char *dest) { return NULL;}
static inline struct proc_dir_entry *proc_mkdir(const char *name,
struct proc_dir_entry *parent) {return NULL;}
static inline struct proc_dir_entry *proc_create_mount_point(const char *name) { return NULL; }
static inline struct proc_dir_entry *proc_mkdir_data(const char *name,
umode_t mode, struct proc_dir_entry *parent, void *data) { return NULL; }
static inline struct proc_dir_entry *proc_mkdir_mode(const char *name,
umode_t mode, struct proc_dir_entry *parent) { return NULL; }
#define proc_create_seq_private(name, mode, parent, ops, size, data) ({NULL;})
#define proc_create_seq_data(name, mode, parent, ops, data) ({NULL;})
#define proc_create_seq(name, mode, parent, ops) ({NULL;})
#define proc_create_single(name, mode, parent, show) ({NULL;})
#define proc_create_single_data(name, mode, parent, show, data) ({NULL;})
#define proc_create(name, mode, parent, proc_ops) ({NULL;})
#define proc_create_data(name, mode, parent, proc_ops, data) ({NULL;})
static inline void proc_set_size(struct proc_dir_entry *de, loff_t size) {}
static inline void proc_set_user(struct proc_dir_entry *de, kuid_t uid, kgid_t gid) {}
static inline void *PDE_DATA(const struct inode *inode) {BUG(); return NULL;}
static inline void *proc_get_parent_data(const struct inode *inode) { BUG(); return NULL; }
static inline void proc_remove(struct proc_dir_entry *de) {}
#define remove_proc_entry(name, parent) do {} while (0)
static inline int remove_proc_subtree(const char *name, struct proc_dir_entry *parent) { return 0; }
#define proc_create_net_data(name, mode, parent, ops, state_size, data) ({NULL;})
#define proc_create_net(name, mode, parent, state_size, ops) ({NULL;})
#define proc_create_net_single(name, mode, parent, show, data) ({NULL;})
static inline struct pid *tgid_pidfd_to_pid(const struct file *file)
{
return ERR_PTR(-EBADF);
}
#endif /* CONFIG_PROC_FS */
struct net;
static inline struct proc_dir_entry *proc_net_mkdir(
struct net *net, const char *name, struct proc_dir_entry *parent)
{
return proc_mkdir_data(name, 0, parent, net);
}
struct ns_common;
int open_related_ns(struct ns_common *ns,
struct ns_common *(*get_ns)(struct ns_common *ns));
/* get the associated pid namespace for a file in procfs */
static inline struct pid_namespace *proc_pid_ns(const struct inode *inode)
{
return inode->i_sb->s_fs_info;
}
bool proc_ns_file(const struct file *file);
#endif /* _LINUX_PROC_FS_H */