[PATCH bpf-next 0/6] Notify user space when a struct_ops object is detached/unregisterd

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH bpf-next 0/6] Notify user space when a struct_ops object is detached/unregisterd
@ 2024-04-29 21:36 Kui-Feng Lee
  2024-04-29 21:36 ` [PATCH bpf-next 1/6] bpf: add a pointer of the attached link to bpf_struct_ops_map Kui-Feng Lee
                   ` (5 more replies)
  0 siblings, 6 replies; 27+ messages in thread
From: Kui-Feng Lee @ 2024-04-29 21:36 UTC (permalink / raw
  To: bpf, ast, martin.lau, song, kernel-team, andrii
  Cc: sinquersw, kuifeng, Kui-Feng Lee

The subsystems consuming struct_ops objects may need to detach or
unregister a struct_ops object due to errors or other reasons. It
would be useful to notify user space programs so that error recovery
or logging can be carried out.

We offer a function for the subsystems to remove an object that was
previously registered with them. The BPF struct_ops will detach the
corresponding link if one exists and send an event through
epoll. Additionally, we allow user space programs to disconnect a
struct_ops map from the attached link.

The user space programs could add an entry to an epoll file descriptor
for the link that a struct_ops object has been attached to, and
receive notifications when the object is detached from the link by
user space programs or the subsystem consuming the object.

However, bpf struct_ops maps without BPF_F_LINK are not supported
here. Although, the subsystems still can unregister them through the
function provided here, but the user space program doesn't get
notified since they don't have links.

Kui-Feng Lee (6):
  bpf: add a pointer of the attached link to bpf_struct_ops_map.
  bpf: export bpf_link_inc_not_zero().
  bpf: provide a function to unregister struct_ops objects from
    consumers.
  bpf: detach a bpf_struct_ops_map from a link.
  bpf: support epoll from bpf struct_ops links.
  selftests/bpf: test detaching struct_ops links.

 include/linux/bpf.h                           |   9 +
 kernel/bpf/bpf_struct_ops.c                   | 186 +++++++++++++++++-
 kernel/bpf/syscall.c                          |  17 +-
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   |  18 +-
 .../selftests/bpf/bpf_testmod/bpf_testmod.h   |   1 +
 .../bpf/prog_tests/test_struct_ops_module.c   | 104 ++++++++++
 .../selftests/bpf/progs/struct_ops_module.c   |   7 +
 7 files changed, 328 insertions(+), 14 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH bpf-next 1/6] bpf: add a pointer of the attached link to bpf_struct_ops_map.
  2024-04-29 21:36 [PATCH bpf-next 0/6] Notify user space when a struct_ops object is detached/unregisterd Kui-Feng Lee
@ 2024-04-29 21:36 ` Kui-Feng Lee
  2024-05-01 17:01   ` Andrii Nakryiko
  2024-04-29 21:36 ` [PATCH bpf-next 2/6] bpf: export bpf_link_inc_not_zero() Kui-Feng Lee
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 27+ messages in thread
From: Kui-Feng Lee @ 2024-04-29 21:36 UTC (permalink / raw
  To: bpf, ast, martin.lau, song, kernel-team, andrii
  Cc: sinquersw, kuifeng, Kui-Feng Lee

To facilitate the upcoming unregistring struct_ops objects from the systems
consuming objects, a pointer of the attached link is added to allow for
accessing the attached link of a bpf_struct_ops_map directly from the map
itself.

Previously, a st_map could be attached to multiple links. This patch now
enforces only one link attached at most.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
---
 kernel/bpf/bpf_struct_ops.c | 47 ++++++++++++++++++++++++++++++++++---
 1 file changed, 44 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 86c7884abaf8..072e3416c987 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -20,6 +20,8 @@ struct bpf_struct_ops_value {
 
 #define MAX_TRAMP_IMAGE_PAGES 8
 
+struct bpf_struct_ops_link;
+
 struct bpf_struct_ops_map {
 	struct bpf_map map;
 	struct rcu_head rcu;
@@ -39,6 +41,8 @@ struct bpf_struct_ops_map {
 	void *image_pages[MAX_TRAMP_IMAGE_PAGES];
 	/* The owner moduler's btf. */
 	struct btf *btf;
+	/* The link is attached by this map. */
+	struct bpf_struct_ops_link __rcu *attached;
 	/* uvalue->data stores the kernel struct
 	 * (e.g. tcp_congestion_ops) that is more useful
 	 * to userspace than the kvalue.  For example,
@@ -1048,6 +1052,22 @@ static bool bpf_struct_ops_valid_to_reg(struct bpf_map *map)
 		smp_load_acquire(&st_map->kvalue.common.state) == BPF_STRUCT_OPS_STATE_READY;
 }
 
+/* Set the attached link of a map.
+ *
+ * Return the current value of the st_map->attached.
+ */
+static inline struct bpf_struct_ops_link *map_attached(struct bpf_struct_ops_map *st_map,
+						       struct bpf_struct_ops_link *st_link)
+{
+	return unrcu_pointer(cmpxchg(&st_map->attached, NULL, st_link));
+}
+
+/* Reset the attached link of a map */
+static inline void map_attached_null(struct bpf_struct_ops_map *st_map)
+{
+	rcu_assign_pointer(st_map->attached, NULL);
+}
+
 static void bpf_struct_ops_map_link_dealloc(struct bpf_link *link)
 {
 	struct bpf_struct_ops_link *st_link;
@@ -1061,6 +1081,7 @@ static void bpf_struct_ops_map_link_dealloc(struct bpf_link *link)
 		 * bpf_struct_ops_link_create() fails to register.
 		 */
 		st_map->st_ops_desc->st_ops->unreg(&st_map->kvalue.data);
+		map_attached_null(st_map);
 		bpf_map_put(&st_map->map);
 	}
 	kfree(st_link);
@@ -1125,9 +1146,21 @@ static int bpf_struct_ops_map_link_update(struct bpf_link *link, struct bpf_map
 		goto err_out;
 	}
 
+	if (likely(st_map != old_st_map) && map_attached(st_map, st_link)) {
+		/* The map is already in use */
+		err = -EBUSY;
+		goto err_out;
+	}
+
 	err = st_map->st_ops_desc->st_ops->update(st_map->kvalue.data, old_st_map->kvalue.data);
-	if (err)
+	if (err) {
+		if (st_map != old_st_map)
+			map_attached_null(st_map);
 		goto err_out;
+	}
+
+	if (likely(st_map != old_st_map))
+		map_attached_null(old_st_map);
 
 	bpf_map_inc(new_map);
 	rcu_assign_pointer(st_link->map, new_map);
@@ -1172,20 +1205,28 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
 	}
 	bpf_link_init(&link->link, BPF_LINK_TYPE_STRUCT_OPS, &bpf_struct_ops_map_lops, NULL);
 
+	if (map_attached(st_map, link)) {
+		err = -EBUSY;
+		goto err_out;
+	}
+
 	err = bpf_link_prime(&link->link, &link_primer);
 	if (err)
-		goto err_out;
+		goto err_out_attached;
 
 	err = st_map->st_ops_desc->st_ops->reg(st_map->kvalue.data);
 	if (err) {
 		bpf_link_cleanup(&link_primer);
+		/* The link has been free by bpf_link_cleanup() */
 		link = NULL;
-		goto err_out;
+		goto err_out_attached;
 	}
 	RCU_INIT_POINTER(link->map, map);
 
 	return bpf_link_settle(&link_primer);
 
+err_out_attached:
+	map_attached_null(st_map);
 err_out:
 	bpf_map_put(map);
 	kfree(link);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next 2/6] bpf: export bpf_link_inc_not_zero().
  2024-04-29 21:36 [PATCH bpf-next 0/6] Notify user space when a struct_ops object is detached/unregisterd Kui-Feng Lee
  2024-04-29 21:36 ` [PATCH bpf-next 1/6] bpf: add a pointer of the attached link to bpf_struct_ops_map Kui-Feng Lee
@ 2024-04-29 21:36 ` Kui-Feng Lee
  2024-04-29 21:36 ` [PATCH bpf-next 3/6] bpf: provide a function to unregister struct_ops objects from consumers Kui-Feng Lee
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 27+ messages in thread
From: Kui-Feng Lee @ 2024-04-29 21:36 UTC (permalink / raw
  To: bpf, ast, martin.lau, song, kernel-team, andrii
  Cc: sinquersw, kuifeng, Kui-Feng Lee

Make bpf_link_inc_not_zero() available to be used by bpf_struct_ops.c
later.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
---
 include/linux/bpf.h  | 1 +
 kernel/bpf/syscall.c | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 5034c1b4ded7..8a1500764332 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2324,6 +2324,7 @@ int bpf_link_prime(struct bpf_link *link, struct bpf_link_primer *primer);
 int bpf_link_settle(struct bpf_link_primer *primer);
 void bpf_link_cleanup(struct bpf_link_primer *primer);
 void bpf_link_inc(struct bpf_link *link);
+struct bpf_link *bpf_link_inc_not_zero(struct bpf_link *link);
 void bpf_link_put(struct bpf_link *link);
 int bpf_link_new_fd(struct bpf_link *link);
 struct bpf_link *bpf_link_get_from_fd(u32 ufd);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 7d392ec83655..4a2f95c4b2ac 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -5364,7 +5364,7 @@ static int link_detach(union bpf_attr *attr)
 	return ret;
 }
 
-static struct bpf_link *bpf_link_inc_not_zero(struct bpf_link *link)
+struct bpf_link *bpf_link_inc_not_zero(struct bpf_link *link)
 {
 	return atomic64_fetch_add_unless(&link->refcnt, 1, 0) ? link : ERR_PTR(-ENOENT);
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next 3/6] bpf: provide a function to unregister struct_ops objects from consumers.
  2024-04-29 21:36 [PATCH bpf-next 0/6] Notify user space when a struct_ops object is detached/unregisterd Kui-Feng Lee
  2024-04-29 21:36 ` [PATCH bpf-next 1/6] bpf: add a pointer of the attached link to bpf_struct_ops_map Kui-Feng Lee
  2024-04-29 21:36 ` [PATCH bpf-next 2/6] bpf: export bpf_link_inc_not_zero() Kui-Feng Lee
@ 2024-04-29 21:36 ` Kui-Feng Lee
  2024-05-01 18:48   ` Martin KaFai Lau
  2024-04-29 21:36 ` [PATCH bpf-next 4/6] bpf: detach a bpf_struct_ops_map from a link Kui-Feng Lee
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 27+ messages in thread
From: Kui-Feng Lee @ 2024-04-29 21:36 UTC (permalink / raw
  To: bpf, ast, martin.lau, song, kernel-team, andrii
  Cc: sinquersw, kuifeng, Kui-Feng Lee

bpf_struct_ops_kvalue_unreg() unregisters the struct_ops map specified by
the pointer passed in. A subsystem could use this function to unregister a
struct_ops object that was previously registered to it.

In effect, bpf_struct_ops_kvalue_unreg() detaches the corresponding st_map
of an object from the link if there is one.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
---
 include/linux/bpf.h         |  6 +++
 kernel/bpf/bpf_struct_ops.c | 97 ++++++++++++++++++++++++++++++++++---
 2 files changed, 97 insertions(+), 6 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 8a1500764332..eeeed4b1bd32 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1793,6 +1793,7 @@ static inline void bpf_module_put(const void *data, struct module *owner)
 		module_put(owner);
 }
 int bpf_struct_ops_link_create(union bpf_attr *attr);
+bool bpf_struct_ops_kvalue_unreg(void *data);
 
 #ifdef CONFIG_NET
 /* Define it here to avoid the use of forward declaration */
@@ -1843,6 +1844,11 @@ static inline void bpf_struct_ops_desc_release(struct bpf_struct_ops_desc *st_op
 {
 }
 
+static inline bool bpf_struct_ops_kvalue_unreg(void *data)
+{
+	return false;
+}
+
 #endif
 
 #if defined(CONFIG_CGROUP_BPF) && defined(CONFIG_BPF_LSM)
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 072e3416c987..8e79b02a1ccb 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -1077,9 +1077,6 @@ static void bpf_struct_ops_map_link_dealloc(struct bpf_link *link)
 	st_map = (struct bpf_struct_ops_map *)
 		rcu_dereference_protected(st_link->map, true);
 	if (st_map) {
-		/* st_link->map can be NULL if
-		 * bpf_struct_ops_link_create() fails to register.
-		 */
 		st_map->st_ops_desc->st_ops->unreg(&st_map->kvalue.data);
 		map_attached_null(st_map);
 		bpf_map_put(&st_map->map);
@@ -1087,6 +1084,83 @@ static void bpf_struct_ops_map_link_dealloc(struct bpf_link *link)
 	kfree(st_link);
 }
 
+/* Called from the subsystem that consume the struct_ops.
+ *
+ * The caller should protected this function by holding rcu_read_lock() to
+ * ensure "data" is valid. However, this function may unlock rcu
+ * temporarily. The caller should not rely on the preceding rcu_read_lock()
+ * after returning from this function.
+ *
+ * Return true if unreg() success. If a call fails, it means some other
+ * task has unrgistered or is unregistering the same object.
+ */
+bool bpf_struct_ops_kvalue_unreg(void *data)
+{
+	struct bpf_struct_ops_map *st_map =
+		container_of(data, struct bpf_struct_ops_map, kvalue.data);
+	enum bpf_struct_ops_state prev_state;
+	struct bpf_struct_ops_link *st_link;
+	bool ret = false;
+
+	/* The st_map and st_link should be protected by rcu_read_lock(),
+	 * or they may have been free when we try to increase their
+	 * refcount.
+	 */
+	if (IS_ERR(bpf_map_inc_not_zero(&st_map->map)))
+		/* The map is already gone */
+		return false;
+
+	prev_state = cmpxchg(&st_map->kvalue.common.state,
+			     BPF_STRUCT_OPS_STATE_INUSE,
+			     BPF_STRUCT_OPS_STATE_TOBEFREE);
+	if (prev_state == BPF_STRUCT_OPS_STATE_INUSE) {
+		st_map->st_ops_desc->st_ops->unreg(data);
+		/* Pair with bpf_map_inc() for reg() */
+		bpf_map_put(&st_map->map);
+		/* Pair with bpf_map_inc_not_zero() above */
+		bpf_map_put(&st_map->map);
+		return true;
+	}
+	if (prev_state != BPF_STRUCT_OPS_STATE_READY)
+		goto fail;
+
+	/* With BPF_F_LINK */
+
+	st_link = rcu_dereference(st_map->attached);
+	if (!st_link || !bpf_link_inc_not_zero(&st_link->link))
+		/* The map is on the way to unregister */
+		goto fail;
+
+	rcu_read_unlock();
+	mutex_lock(&update_mutex);
+
+	if (rcu_dereference_protected(st_link->map, true) != &st_map->map)
+		/* The map should be unregistered already or on the way to
+		 * be unregistered.
+		 */
+		goto fail_unlock;
+
+	st_map->st_ops_desc->st_ops->unreg(data);
+
+	map_attached_null(st_map);
+	rcu_assign_pointer(st_link->map, NULL);
+	/* Pair with bpf_map_get() in bpf_struct_ops_link_create() or
+	 * bpf_map_inc() in bpf_struct_ops_map_link_update().
+	 */
+	bpf_map_put(&st_map->map);
+
+	ret = true;
+
+fail_unlock:
+	mutex_unlock(&update_mutex);
+	rcu_read_lock();
+	bpf_link_put(&st_link->link);
+fail:
+	bpf_map_put(&st_map->map);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(bpf_struct_ops_kvalue_unreg);
+
 static void bpf_struct_ops_map_link_show_fdinfo(const struct bpf_link *link,
 					    struct seq_file *seq)
 {
@@ -1096,7 +1170,8 @@ static void bpf_struct_ops_map_link_show_fdinfo(const struct bpf_link *link,
 	st_link = container_of(link, struct bpf_struct_ops_link, link);
 	rcu_read_lock();
 	map = rcu_dereference(st_link->map);
-	seq_printf(seq, "map_id:\t%d\n", map->id);
+	if (map)
+		seq_printf(seq, "map_id:\t%d\n", map->id);
 	rcu_read_unlock();
 }
 
@@ -1109,7 +1184,8 @@ static int bpf_struct_ops_map_link_fill_link_info(const struct bpf_link *link,
 	st_link = container_of(link, struct bpf_struct_ops_link, link);
 	rcu_read_lock();
 	map = rcu_dereference(st_link->map);
-	info->struct_ops.map_id = map->id;
+	if (map)
+		info->struct_ops.map_id = map->id;
 	rcu_read_unlock();
 	return 0;
 }
@@ -1134,6 +1210,10 @@ static int bpf_struct_ops_map_link_update(struct bpf_link *link, struct bpf_map
 	mutex_lock(&update_mutex);
 
 	old_map = rcu_dereference_protected(st_link->map, lockdep_is_held(&update_mutex));
+	if (!old_map) {
+		err = -EINVAL;
+		goto err_out;
+	}
 	if (expected_old_map && old_map != expected_old_map) {
 		err = -EPERM;
 		goto err_out;
@@ -1214,14 +1294,19 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
 	if (err)
 		goto err_out_attached;
 
+	/* Init link->map before calling reg() in case being unregistered
+	 * immediately.
+	 */
+	RCU_INIT_POINTER(link->map, map);
+
 	err = st_map->st_ops_desc->st_ops->reg(st_map->kvalue.data);
 	if (err) {
+		rcu_assign_pointer(link->map, NULL);
 		bpf_link_cleanup(&link_primer);
 		/* The link has been free by bpf_link_cleanup() */
 		link = NULL;
 		goto err_out_attached;
 	}
-	RCU_INIT_POINTER(link->map, map);
 
 	return bpf_link_settle(&link_primer);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next 4/6] bpf: detach a bpf_struct_ops_map from a link.
  2024-04-29 21:36 [PATCH bpf-next 0/6] Notify user space when a struct_ops object is detached/unregisterd Kui-Feng Lee
                   ` (2 preceding siblings ...)
  2024-04-29 21:36 ` [PATCH bpf-next 3/6] bpf: provide a function to unregister struct_ops objects from consumers Kui-Feng Lee
@ 2024-04-29 21:36 ` Kui-Feng Lee
  2024-04-29 21:36 ` [PATCH bpf-next 5/6] bpf: support epoll from bpf struct_ops links Kui-Feng Lee
  2024-04-29 21:36 ` [PATCH bpf-next 6/6] selftests/bpf: test detaching " Kui-Feng Lee
  5 siblings, 0 replies; 27+ messages in thread
From: Kui-Feng Lee @ 2024-04-29 21:36 UTC (permalink / raw
  To: bpf, ast, martin.lau, song, kernel-team, andrii
  Cc: sinquersw, kuifeng, Kui-Feng Lee

Allow user space programs to detach a struct_ops map from the attached
link.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
---
 kernel/bpf/bpf_struct_ops.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 8e79b02a1ccb..4a8a7e5ffc56 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -1252,8 +1252,36 @@ static int bpf_struct_ops_map_link_update(struct bpf_link *link, struct bpf_map
 	return err;
 }
 
+static int bpf_struct_ops_map_link_detach(struct bpf_link *link)
+{
+	struct bpf_struct_ops_link *st_link;
+	struct bpf_struct_ops_map *st_map;
+	struct bpf_map *map;
+
+	mutex_lock(&update_mutex);
+
+	st_link = container_of(link, struct bpf_struct_ops_link, link);
+	map = rcu_dereference_protected(st_link->map, lockdep_is_held(&update_mutex));
+	if (!map) {
+		mutex_unlock(&update_mutex);
+		return -ENOENT;
+	}
+	st_map = container_of(map, struct bpf_struct_ops_map, map);
+
+	st_map->st_ops_desc->st_ops->unreg(&st_map->kvalue.data);
+
+	map_attached_null(st_map);
+	rcu_assign_pointer(st_link->map, NULL);
+	bpf_map_put(map);
+
+	mutex_unlock(&update_mutex);
+
+	return 0;
+}
+
 static const struct bpf_link_ops bpf_struct_ops_map_lops = {
 	.dealloc = bpf_struct_ops_map_link_dealloc,
+	.detach = bpf_struct_ops_map_link_detach,
 	.show_fdinfo = bpf_struct_ops_map_link_show_fdinfo,
 	.fill_link_info = bpf_struct_ops_map_link_fill_link_info,
 	.update_map = bpf_struct_ops_map_link_update,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next 5/6] bpf: support epoll from bpf struct_ops links.
  2024-04-29 21:36 [PATCH bpf-next 0/6] Notify user space when a struct_ops object is detached/unregisterd Kui-Feng Lee
                   ` (3 preceding siblings ...)
  2024-04-29 21:36 ` [PATCH bpf-next 4/6] bpf: detach a bpf_struct_ops_map from a link Kui-Feng Lee
@ 2024-04-29 21:36 ` Kui-Feng Lee
  2024-05-01 17:03   ` Andrii Nakryiko
  2024-04-29 21:36 ` [PATCH bpf-next 6/6] selftests/bpf: test detaching " Kui-Feng Lee
  5 siblings, 1 reply; 27+ messages in thread
From: Kui-Feng Lee @ 2024-04-29 21:36 UTC (permalink / raw
  To: bpf, ast, martin.lau, song, kernel-team, andrii
  Cc: sinquersw, kuifeng, Kui-Feng Lee

Add epoll support to bpf struct_ops links to trigger EPOLLHUP event upon
detachment.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
---
 include/linux/bpf.h         |  2 ++
 kernel/bpf/bpf_struct_ops.c | 14 ++++++++++++++
 kernel/bpf/syscall.c        | 15 +++++++++++++++
 3 files changed, 31 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index eeeed4b1bd32..a4550b927352 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1574,6 +1574,7 @@ struct bpf_link {
 	const struct bpf_link_ops *ops;
 	struct bpf_prog *prog;
 	struct work_struct work;
+	wait_queue_head_t wait_hup;
 };
 
 struct bpf_link_ops {
@@ -1587,6 +1588,7 @@ struct bpf_link_ops {
 			      struct bpf_link_info *info);
 	int (*update_map)(struct bpf_link *link, struct bpf_map *new_map,
 			  struct bpf_map *old_map);
+	__poll_t (*poll)(struct file *file, struct poll_table_struct *pts);
 };
 
 struct bpf_tramp_link {
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 4a8a7e5ffc56..f19b6a76591a 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -12,6 +12,7 @@
 #include <linux/mutex.h>
 #include <linux/btf_ids.h>
 #include <linux/rcupdate_wait.h>
+#include <linux/poll.h>
 
 struct bpf_struct_ops_value {
 	struct bpf_struct_ops_common_value common;
@@ -1149,6 +1150,8 @@ bool bpf_struct_ops_kvalue_unreg(void *data)
 	 */
 	bpf_map_put(&st_map->map);
 
+	wake_up_interruptible_poll(&st_link->link.wait_hup, EPOLLHUP);
+
 	ret = true;
 
 fail_unlock:
@@ -1276,15 +1279,26 @@ static int bpf_struct_ops_map_link_detach(struct bpf_link *link)
 
 	mutex_unlock(&update_mutex);
 
+	wake_up_interruptible_poll(&st_link->link.wait_hup, EPOLLHUP);
+
 	return 0;
 }
 
+static __poll_t bpf_struct_ops_map_link_poll(struct file *file,
+					     struct poll_table_struct *ptrs)
+{
+	struct bpf_struct_ops_link *st_link = file->private_data;
+
+	return (st_link->map) ? 0 : EPOLLHUP | EPOLLERR;
+}
+
 static const struct bpf_link_ops bpf_struct_ops_map_lops = {
 	.dealloc = bpf_struct_ops_map_link_dealloc,
 	.detach = bpf_struct_ops_map_link_detach,
 	.show_fdinfo = bpf_struct_ops_map_link_show_fdinfo,
 	.fill_link_info = bpf_struct_ops_map_link_fill_link_info,
 	.update_map = bpf_struct_ops_map_link_update,
+	.poll = bpf_struct_ops_map_link_poll,
 };
 
 int bpf_struct_ops_link_create(union bpf_attr *attr)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 4a2f95c4b2ac..b4dbca04d4f5 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2990,6 +2990,7 @@ void bpf_link_init(struct bpf_link *link, enum bpf_link_type type,
 	link->id = 0;
 	link->ops = ops;
 	link->prog = prog;
+	init_waitqueue_head(&link->wait_hup);
 }
 
 static void bpf_link_free_id(int id)
@@ -3108,6 +3109,19 @@ static void bpf_link_show_fdinfo(struct seq_file *m, struct file *filp)
 }
 #endif
 
+static __poll_t bpf_link_poll(struct file *file, struct poll_table_struct *pts)
+{
+	struct bpf_link *link = file->private_data;
+
+	if (link->ops->poll) {
+		poll_wait(file, &link->wait_hup, pts);
+
+		return link->ops->poll(file, pts);
+	}
+
+	return 0;
+}
+
 static const struct file_operations bpf_link_fops = {
 #ifdef CONFIG_PROC_FS
 	.show_fdinfo	= bpf_link_show_fdinfo,
@@ -3115,6 +3129,7 @@ static const struct file_operations bpf_link_fops = {
 	.release	= bpf_link_release,
 	.read		= bpf_dummy_read,
 	.write		= bpf_dummy_write,
+	.poll		= bpf_link_poll,
 };
 
 static int bpf_link_alloc_id(struct bpf_link *link)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next 6/6] selftests/bpf: test detaching struct_ops links.
  2024-04-29 21:36 [PATCH bpf-next 0/6] Notify user space when a struct_ops object is detached/unregisterd Kui-Feng Lee
                   ` (4 preceding siblings ...)
  2024-04-29 21:36 ` [PATCH bpf-next 5/6] bpf: support epoll from bpf struct_ops links Kui-Feng Lee
@ 2024-04-29 21:36 ` Kui-Feng Lee
  2024-05-01 17:05   ` Andrii Nakryiko
  2024-05-02 18:15   ` Martin KaFai Lau
  5 siblings, 2 replies; 27+ messages in thread
From: Kui-Feng Lee @ 2024-04-29 21:36 UTC (permalink / raw
  To: bpf, ast, martin.lau, song, kernel-team, andrii
  Cc: sinquersw, kuifeng, Kui-Feng Lee

Verify whether a user space program is informed through epoll with EPOLLHUP
when a struct_ops object is detached or unregistered using the function
bpf_struct_ops_kvalue_unreg() or BPF_LINK_DETACH.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
---
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   |  18 ++-
 .../selftests/bpf/bpf_testmod/bpf_testmod.h   |   1 +
 .../bpf/prog_tests/test_struct_ops_module.c   | 104 ++++++++++++++++++
 .../selftests/bpf/progs/struct_ops_module.c   |   7 ++
 4 files changed, 126 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
index 39ad96a18123..e526ccfad8bf 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
@@ -539,16 +539,20 @@ static int bpf_testmod_ops_init_member(const struct btf_type *t,
 				       const struct btf_member *member,
 				       void *kdata, const void *udata)
 {
+	struct bpf_testmod_ops *ops = kdata;
+
 	if (member->offset == offsetof(struct bpf_testmod_ops, data) * 8) {
 		/* For data fields, this function has to copy it and return
 		 * 1 to indicate that the data has been handled by the
 		 * struct_ops type, or the verifier will reject the map if
 		 * the value of the data field is not zero.
 		 */
-		((struct bpf_testmod_ops *)kdata)->data = ((struct bpf_testmod_ops *)udata)->data;
-		return 1;
-	}
-	return 0;
+		ops->data = ((struct bpf_testmod_ops *)udata)->data;
+	} else if (member->offset == offsetof(struct bpf_testmod_ops, do_unreg) * 8) {
+		ops->do_unreg = ((struct bpf_testmod_ops *)udata)->do_unreg;
+	} else
+		return 0;
+	return 1;
 }
 
 static const struct btf_kfunc_id_set bpf_testmod_kfunc_set = {
@@ -572,6 +576,12 @@ static int bpf_dummy_reg(void *kdata)
 	if (ops->test_2)
 		ops->test_2(4, ops->data);
 
+	if (ops->do_unreg) {
+		rcu_read_lock();
+		bpf_struct_ops_kvalue_unreg(kdata);
+		rcu_read_unlock();
+	}
+
 	return 0;
 }
 
diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
index 23fa1872ee67..ee8d4a2cd187 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.h
@@ -43,6 +43,7 @@ struct bpf_testmod_ops {
 		int b;
 	} unsupported;
 	int data;
+	bool do_unreg;
 
 	/* The following pointers are used to test the maps having multiple
 	 * pages of trampolines.
diff --git a/tools/testing/selftests/bpf/prog_tests/test_struct_ops_module.c b/tools/testing/selftests/bpf/prog_tests/test_struct_ops_module.c
index 7cf2b9ddd3e1..b4d3b29114ca 100644
--- a/tools/testing/selftests/bpf/prog_tests/test_struct_ops_module.c
+++ b/tools/testing/selftests/bpf/prog_tests/test_struct_ops_module.c
@@ -3,6 +3,8 @@
 #include <test_progs.h>
 #include <time.h>
 
+#include <sys/epoll.h>
+
 #include "struct_ops_module.skel.h"
 
 static void check_map_info(struct bpf_map_info *info)
@@ -160,6 +162,104 @@ static void test_struct_ops_incompatible(void)
 	struct_ops_module__destroy(skel);
 }
 
+static void test_detach_link(void)
+{
+	struct epoll_event ev, events[2];
+	struct struct_ops_module *skel;
+	struct bpf_link *link = NULL;
+	int fd, epollfd = -1, nfds;
+	int err;
+
+	skel = struct_ops_module__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "open_and_load"))
+		return;
+
+	link = bpf_map__attach_struct_ops(skel->maps.testmod_1);
+	if (!ASSERT_OK_PTR(link, "attach_struct_ops"))
+		goto cleanup;
+
+	fd = bpf_link__fd(link);
+	if (!ASSERT_GE(fd, 0, "link_fd"))
+		goto cleanup;
+
+	epollfd = epoll_create1(0);
+	if (!ASSERT_GE(epollfd, 0, "epoll_create1"))
+		goto cleanup;
+
+	ev.events = EPOLLHUP;
+	ev.data.fd = fd;
+	err = epoll_ctl(epollfd, EPOLL_CTL_ADD, fd, &ev);
+	if (!ASSERT_OK(err, "epoll_ctl"))
+		goto cleanup;
+
+	err = bpf_link__detach(link);
+	if (!ASSERT_OK(err, "detach_link"))
+		goto cleanup;
+
+	nfds = epoll_wait(epollfd, events, 2, 500);
+	if (!ASSERT_EQ(nfds, 1, "epoll_wait"))
+		goto cleanup;
+
+	if (!ASSERT_EQ(events[0].data.fd, fd, "epoll_wait_fd"))
+		goto cleanup;
+	if (!ASSERT_TRUE(events[0].events & EPOLLHUP, "events[0].events"))
+		goto cleanup;
+
+cleanup:
+	close(epollfd);
+	bpf_link__destroy(link);
+	struct_ops_module__destroy(skel);
+}
+
+/* Test bpf_struct_ops_kvalue_unreg() */
+static void test_do_unreg(void)
+{
+	struct epoll_event ev, events[2];
+	struct struct_ops_module *skel;
+	struct bpf_link *link = NULL;
+	int fd, epollfd = -1, nfds;
+	int err;
+
+	skel = struct_ops_module__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "open_and_load"))
+		return;
+
+	/* bpf_testmod will unregister this map immediately through the
+	 * function bpf_struct_ops_kvalue_unreg() since "do_unreg" is true.
+	 */
+	link = bpf_map__attach_struct_ops(skel->maps.testmod_do_unreg);
+	if (!ASSERT_OK_PTR(link, "attach_struct_ops"))
+		goto cleanup;
+
+	fd = bpf_link__fd(link);
+	if (!ASSERT_GE(fd, 0, "link_fd"))
+		goto cleanup;
+
+	epollfd = epoll_create1(0);
+	if (!ASSERT_GE(epollfd, 0, "epoll_create1"))
+		goto cleanup;
+
+	ev.events = EPOLLHUP;
+	ev.data.fd = fd;
+	err = epoll_ctl(epollfd, EPOLL_CTL_ADD, fd, &ev);
+	if (!ASSERT_OK(err, "epoll_ctl"))
+		goto cleanup;
+
+	nfds = epoll_wait(epollfd, events, 2, 500);
+	if (!ASSERT_EQ(nfds, 1, "epoll_wait"))
+		goto cleanup;
+
+	if (!ASSERT_EQ(events[0].data.fd, fd, "epoll_wait_fd"))
+		goto cleanup;
+	if (!ASSERT_TRUE(events[0].events & EPOLLHUP, "events[0].events"))
+		goto cleanup;
+
+cleanup:
+	close(epollfd);
+	bpf_link__destroy(link);
+	struct_ops_module__destroy(skel);
+}
+
 void serial_test_struct_ops_module(void)
 {
 	if (test__start_subtest("test_struct_ops_load"))
@@ -168,5 +268,9 @@ void serial_test_struct_ops_module(void)
 		test_struct_ops_not_zeroed();
 	if (test__start_subtest("test_struct_ops_incompatible"))
 		test_struct_ops_incompatible();
+	if (test__start_subtest("test_detach_link"))
+		test_detach_link();
+	if (test__start_subtest("test_do_unreg"))
+		test_do_unreg();
 }
 
diff --git a/tools/testing/selftests/bpf/progs/struct_ops_module.c b/tools/testing/selftests/bpf/progs/struct_ops_module.c
index 63b065dae002..7a697a7dd0ac 100644
--- a/tools/testing/selftests/bpf/progs/struct_ops_module.c
+++ b/tools/testing/selftests/bpf/progs/struct_ops_module.c
@@ -81,3 +81,10 @@ struct bpf_testmod_ops___incompatible testmod_incompatible = {
 	.test_2 = (void *)test_2,
 	.data = 3,
 };
+
+SEC(".struct_ops.link")
+struct bpf_testmod_ops testmod_do_unreg = {
+	.test_1 = (void *)test_1,
+	.test_2 = (void *)test_2,
+	.do_unreg = true,
+};
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 1/6] bpf: add a pointer of the attached link to bpf_struct_ops_map.
  2024-04-29 21:36 ` [PATCH bpf-next 1/6] bpf: add a pointer of the attached link to bpf_struct_ops_map Kui-Feng Lee
@ 2024-05-01 17:01   ` Andrii Nakryiko
  2024-05-01 22:15     ` Kui-Feng Lee
  0 siblings, 1 reply; 27+ messages in thread
From: Andrii Nakryiko @ 2024-05-01 17:01 UTC (permalink / raw
  To: Kui-Feng Lee
  Cc: bpf, ast, martin.lau, song, kernel-team, andrii, sinquersw,
	kuifeng

On Mon, Apr 29, 2024 at 2:36 PM Kui-Feng Lee <thinker.li@gmail.com> wrote:
>
> To facilitate the upcoming unregistring struct_ops objects from the systems
> consuming objects, a pointer of the attached link is added to allow for
> accessing the attached link of a bpf_struct_ops_map directly from the map
> itself.
>
> Previously, a st_map could be attached to multiple links. This patch now
> enforces only one link attached at most.

I'd like to avoid this restriction, in principle. We don't enforce
that BPF program should be attached through a single BPF link, so I
don't think we should allow that for maps. Worst case you can keep a
list of attached links.

>
> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
> ---
>  kernel/bpf/bpf_struct_ops.c | 47 ++++++++++++++++++++++++++++++++++---
>  1 file changed, 44 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> index 86c7884abaf8..072e3416c987 100644
> --- a/kernel/bpf/bpf_struct_ops.c
> +++ b/kernel/bpf/bpf_struct_ops.c
> @@ -20,6 +20,8 @@ struct bpf_struct_ops_value {
>
>  #define MAX_TRAMP_IMAGE_PAGES 8
>
> +struct bpf_struct_ops_link;
> +
>  struct bpf_struct_ops_map {
>         struct bpf_map map;
>         struct rcu_head rcu;
> @@ -39,6 +41,8 @@ struct bpf_struct_ops_map {
>         void *image_pages[MAX_TRAMP_IMAGE_PAGES];
>         /* The owner moduler's btf. */
>         struct btf *btf;
> +       /* The link is attached by this map. */
> +       struct bpf_struct_ops_link __rcu *attached;
>         /* uvalue->data stores the kernel struct
>          * (e.g. tcp_congestion_ops) that is more useful
>          * to userspace than the kvalue.  For example,
> @@ -1048,6 +1052,22 @@ static bool bpf_struct_ops_valid_to_reg(struct bpf_map *map)
>                 smp_load_acquire(&st_map->kvalue.common.state) == BPF_STRUCT_OPS_STATE_READY;
>  }
>
> +/* Set the attached link of a map.
> + *
> + * Return the current value of the st_map->attached.
> + */
> +static inline struct bpf_struct_ops_link *map_attached(struct bpf_struct_ops_map *st_map,
> +                                                      struct bpf_struct_ops_link *st_link)
> +{
> +       return unrcu_pointer(cmpxchg(&st_map->attached, NULL, st_link));
> +}
> +
> +/* Reset the attached link of a map */
> +static inline void map_attached_null(struct bpf_struct_ops_map *st_map)
> +{
> +       rcu_assign_pointer(st_map->attached, NULL);
> +}
> +
>  static void bpf_struct_ops_map_link_dealloc(struct bpf_link *link)
>  {
>         struct bpf_struct_ops_link *st_link;
> @@ -1061,6 +1081,7 @@ static void bpf_struct_ops_map_link_dealloc(struct bpf_link *link)
>                  * bpf_struct_ops_link_create() fails to register.
>                  */
>                 st_map->st_ops_desc->st_ops->unreg(&st_map->kvalue.data);
> +               map_attached_null(st_map);
>                 bpf_map_put(&st_map->map);
>         }
>         kfree(st_link);
> @@ -1125,9 +1146,21 @@ static int bpf_struct_ops_map_link_update(struct bpf_link *link, struct bpf_map
>                 goto err_out;
>         }
>
> +       if (likely(st_map != old_st_map) && map_attached(st_map, st_link)) {
> +               /* The map is already in use */
> +               err = -EBUSY;
> +               goto err_out;
> +       }
> +
>         err = st_map->st_ops_desc->st_ops->update(st_map->kvalue.data, old_st_map->kvalue.data);
> -       if (err)
> +       if (err) {
> +               if (st_map != old_st_map)
> +                       map_attached_null(st_map);
>                 goto err_out;
> +       }
> +
> +       if (likely(st_map != old_st_map))
> +               map_attached_null(old_st_map);
>
>         bpf_map_inc(new_map);
>         rcu_assign_pointer(st_link->map, new_map);
> @@ -1172,20 +1205,28 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
>         }
>         bpf_link_init(&link->link, BPF_LINK_TYPE_STRUCT_OPS, &bpf_struct_ops_map_lops, NULL);
>
> +       if (map_attached(st_map, link)) {
> +               err = -EBUSY;
> +               goto err_out;
> +       }
> +
>         err = bpf_link_prime(&link->link, &link_primer);
>         if (err)
> -               goto err_out;
> +               goto err_out_attached;
>
>         err = st_map->st_ops_desc->st_ops->reg(st_map->kvalue.data);
>         if (err) {
>                 bpf_link_cleanup(&link_primer);
> +               /* The link has been free by bpf_link_cleanup() */
>                 link = NULL;
> -               goto err_out;
> +               goto err_out_attached;
>         }
>         RCU_INIT_POINTER(link->map, map);
>
>         return bpf_link_settle(&link_primer);
>
> +err_out_attached:
> +       map_attached_null(st_map);
>  err_out:
>         bpf_map_put(map);
>         kfree(link);
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 5/6] bpf: support epoll from bpf struct_ops links.
  2024-04-29 21:36 ` [PATCH bpf-next 5/6] bpf: support epoll from bpf struct_ops links Kui-Feng Lee
@ 2024-05-01 17:03   ` Andrii Nakryiko
  2024-05-01 22:16     ` Kui-Feng Lee
  0 siblings, 1 reply; 27+ messages in thread
From: Andrii Nakryiko @ 2024-05-01 17:03 UTC (permalink / raw
  To: Kui-Feng Lee
  Cc: bpf, ast, martin.lau, song, kernel-team, andrii, sinquersw,
	kuifeng

On Mon, Apr 29, 2024 at 2:36 PM Kui-Feng Lee <thinker.li@gmail.com> wrote:
>
> Add epoll support to bpf struct_ops links to trigger EPOLLHUP event upon
> detachment.
>
> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
> ---
>  include/linux/bpf.h         |  2 ++
>  kernel/bpf/bpf_struct_ops.c | 14 ++++++++++++++
>  kernel/bpf/syscall.c        | 15 +++++++++++++++
>  3 files changed, 31 insertions(+)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index eeeed4b1bd32..a4550b927352 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1574,6 +1574,7 @@ struct bpf_link {
>         const struct bpf_link_ops *ops;
>         struct bpf_prog *prog;
>         struct work_struct work;
> +       wait_queue_head_t wait_hup;

let's keep it struct_ops-specific, there is no need to pay for this
for all existing BPF link types. We can always generalize later, if
necessary.

pw-bot: cr


>  };
>
>  struct bpf_link_ops {
> @@ -1587,6 +1588,7 @@ struct bpf_link_ops {
>                               struct bpf_link_info *info);
>         int (*update_map)(struct bpf_link *link, struct bpf_map *new_map,
>                           struct bpf_map *old_map);
> +       __poll_t (*poll)(struct file *file, struct poll_table_struct *pts);
>  };
>
>  struct bpf_tramp_link {

[...]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 6/6] selftests/bpf: test detaching struct_ops links.
  2024-04-29 21:36 ` [PATCH bpf-next 6/6] selftests/bpf: test detaching " Kui-Feng Lee
@ 2024-05-01 17:05   ` Andrii Nakryiko
  2024-05-01 22:17     ` Kui-Feng Lee
  2024-05-02 18:15   ` Martin KaFai Lau
  1 sibling, 1 reply; 27+ messages in thread
From: Andrii Nakryiko @ 2024-05-01 17:05 UTC (permalink / raw
  To: Kui-Feng Lee
  Cc: bpf, ast, martin.lau, song, kernel-team, andrii, sinquersw,
	kuifeng

On Mon, Apr 29, 2024 at 2:36 PM Kui-Feng Lee <thinker.li@gmail.com> wrote:
>
> Verify whether a user space program is informed through epoll with EPOLLHUP
> when a struct_ops object is detached or unregistered using the function
> bpf_struct_ops_kvalue_unreg() or BPF_LINK_DETACH.
>
> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
> ---
>  .../selftests/bpf/bpf_testmod/bpf_testmod.c   |  18 ++-
>  .../selftests/bpf/bpf_testmod/bpf_testmod.h   |   1 +
>  .../bpf/prog_tests/test_struct_ops_module.c   | 104 ++++++++++++++++++
>  .../selftests/bpf/progs/struct_ops_module.c   |   7 ++
>  4 files changed, 126 insertions(+), 4 deletions(-)
>

[...]

> diff --git a/tools/testing/selftests/bpf/progs/struct_ops_module.c b/tools/testing/selftests/bpf/progs/struct_ops_module.c
> index 63b065dae002..7a697a7dd0ac 100644
> --- a/tools/testing/selftests/bpf/progs/struct_ops_module.c
> +++ b/tools/testing/selftests/bpf/progs/struct_ops_module.c

this file is a bit overloaded with code for various subtests, it's
quite hard already to follow what's going on, I suggest adding a
separate .c file for this new subtest

> @@ -81,3 +81,10 @@ struct bpf_testmod_ops___incompatible testmod_incompatible = {
>         .test_2 = (void *)test_2,
>         .data = 3,
>  };
> +
> +SEC(".struct_ops.link")
> +struct bpf_testmod_ops testmod_do_unreg = {
> +       .test_1 = (void *)test_1,
> +       .test_2 = (void *)test_2,
> +       .do_unreg = true,
> +};
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 3/6] bpf: provide a function to unregister struct_ops objects from consumers.
  2024-04-29 21:36 ` [PATCH bpf-next 3/6] bpf: provide a function to unregister struct_ops objects from consumers Kui-Feng Lee
@ 2024-05-01 18:48   ` Martin KaFai Lau
  2024-05-01 22:15     ` Kui-Feng Lee
  2024-05-02 17:56     ` Martin KaFai Lau
  0 siblings, 2 replies; 27+ messages in thread
From: Martin KaFai Lau @ 2024-05-01 18:48 UTC (permalink / raw
  To: Kui-Feng Lee; +Cc: bpf, ast, song, kernel-team, andrii, sinquersw, kuifeng

On 4/29/24 2:36 PM, Kui-Feng Lee wrote:
> +/* Called from the subsystem that consume the struct_ops.
> + *
> + * The caller should protected this function by holding rcu_read_lock() to
> + * ensure "data" is valid. However, this function may unlock rcu
> + * temporarily. The caller should not rely on the preceding rcu_read_lock()
> + * after returning from this function.

This temporarily losing rcu_read_lock protection is error prone. The caller 
should do the inc_not_zero() instead if it is needed.

I feel the approach in patch 1 and 3 is a little box-ed in by the earlier tcp-cc 
usage that tried to fit into the kernel module reg/unreg paradigm and hide as 
much bpf details as possible from tcp-cc. This is not necessarily true now for 
other subsystem which has bpf struct_ops from day one.

The epoll detach notification is link only. Can this kernel side specific unreg 
be limited to struct_ops link only? During reg, a rcu protected link could be 
passed to the subsystem. That subsystem becomes a kernel user of the bpf link 
and it can call link_detach(link) to detach. Pseudo code:

struct link __rcu *link;

rcu_read_lock();
ref_link = rcu_dereference(link)
if (ref_link)
	ref_link = bpf_link_inc_not_zero(ref_link);
rcu_read_unlock();

if (!IS_ERR_OR_NULL(ref_link)) {
	bpf_struct_ops_map_link_detach(ref_link);
	bpf_link_put(ref_link);
}

> + *
> + * Return true if unreg() success. If a call fails, it means some other
> + * task has unrgistered or is unregistering the same object.
> + */
> +bool bpf_struct_ops_kvalue_unreg(void *data)
> +{
> +	struct bpf_struct_ops_map *st_map =
> +		container_of(data, struct bpf_struct_ops_map, kvalue.data);
> +	enum bpf_struct_ops_state prev_state;
> +	struct bpf_struct_ops_link *st_link;
> +	bool ret = false;
> +
> +	/* The st_map and st_link should be protected by rcu_read_lock(),
> +	 * or they may have been free when we try to increase their
> +	 * refcount.
> +	 */
> +	if (IS_ERR(bpf_map_inc_not_zero(&st_map->map)))
> +		/* The map is already gone */
> +		return false;
> +
> +	prev_state = cmpxchg(&st_map->kvalue.common.state,
> +			     BPF_STRUCT_OPS_STATE_INUSE,
> +			     BPF_STRUCT_OPS_STATE_TOBEFREE);
> +	if (prev_state == BPF_STRUCT_OPS_STATE_INUSE) {
> +		st_map->st_ops_desc->st_ops->unreg(data);
> +		/* Pair with bpf_map_inc() for reg() */
> +		bpf_map_put(&st_map->map);
> +		/* Pair with bpf_map_inc_not_zero() above */
> +		bpf_map_put(&st_map->map);
> +		return true;
> +	}
> +	if (prev_state != BPF_STRUCT_OPS_STATE_READY)
> +		goto fail;
> +
> +	/* With BPF_F_LINK */
> +
> +	st_link = rcu_dereference(st_map->attached);
> +	if (!st_link || !bpf_link_inc_not_zero(&st_link->link))
> +		/* The map is on the way to unregister */
> +		goto fail;
> +
> +	rcu_read_unlock();
> +	mutex_lock(&update_mutex);
> +
> +	if (rcu_dereference_protected(st_link->map, true) != &st_map->map)
> +		/* The map should be unregistered already or on the way to
> +		 * be unregistered.
> +		 */
> +		goto fail_unlock;
> +
> +	st_map->st_ops_desc->st_ops->unreg(data);
> +
> +	map_attached_null(st_map);
> +	rcu_assign_pointer(st_link->map, NULL);
> +	/* Pair with bpf_map_get() in bpf_struct_ops_link_create() or
> +	 * bpf_map_inc() in bpf_struct_ops_map_link_update().
> +	 */
> +	bpf_map_put(&st_map->map);
> +
> +	ret = true;
> +
> +fail_unlock:
> +	mutex_unlock(&update_mutex);
> +	rcu_read_lock();
> +	bpf_link_put(&st_link->link);
> +fail:
> +	bpf_map_put(&st_map->map);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(bpf_struct_ops_kvalue_unreg);

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 3/6] bpf: provide a function to unregister struct_ops objects from consumers.
  2024-05-01 18:48   ` Martin KaFai Lau
@ 2024-05-01 22:15     ` Kui-Feng Lee
  2024-05-01 23:06       ` Martin KaFai Lau
  2024-05-02 17:56     ` Martin KaFai Lau
  1 sibling, 1 reply; 27+ messages in thread
From: Kui-Feng Lee @ 2024-05-01 22:15 UTC (permalink / raw
  To: Martin KaFai Lau, Kui-Feng Lee
  Cc: bpf, ast, song, kernel-team, andrii, kuifeng



On 5/1/24 11:48, Martin KaFai Lau wrote:
> On 4/29/24 2:36 PM, Kui-Feng Lee wrote:
>> +/* Called from the subsystem that consume the struct_ops.
>> + *
>> + * The caller should protected this function by holding 
>> rcu_read_lock() to
>> + * ensure "data" is valid. However, this function may unlock rcu
>> + * temporarily. The caller should not rely on the preceding 
>> rcu_read_lock()
>> + * after returning from this function.
> 
> This temporarily losing rcu_read_lock protection is error prone. The 
> caller should do the inc_not_zero() instead if it is needed.
> 
> I feel the approach in patch 1 and 3 is a little box-ed in by the 
> earlier tcp-cc usage that tried to fit into the kernel module reg/unreg 
> paradigm and hide as much bpf details as possible from tcp-cc. This is 
> not necessarily true now for other subsystem which has bpf struct_ops 
> from day one.
> 
> The epoll detach notification is link only. Can this kernel side 
> specific unreg be limited to struct_ops link only? During reg, a rcu 
> protected link could be passed to the subsystem. That subsystem becomes 
> a kernel user of the bpf link and it can call link_detach(link) to 
> detach. Pseudo code:
> 
> struct link __rcu *link;
> 
> rcu_read_lock();
> ref_link = rcu_dereference(link)
> if (ref_link)
>      ref_link = bpf_link_inc_not_zero(ref_link);
> rcu_read_unlock();
> 
> if (!IS_ERR_OR_NULL(ref_link)) {
>      bpf_struct_ops_map_link_detach(ref_link);
>      bpf_link_put(ref_link);
> }


Since not every struct_ops map has a link, we need a callback in additional
to ops->reg to register links with subsystems. If the callback is
ops->reg_link, struct_ops will call ops->reg_link if a subsystem provide
it and the map is registered through a link, or it should call ops->reg.

> 
>> + *
>> + * Return true if unreg() success. If a call fails, it means some other
>> + * task has unrgistered or is unregistering the same object.
>> + */
>> +bool bpf_struct_ops_kvalue_unreg(void *data)
>> +{
>> +    struct bpf_struct_ops_map *st_map =
>> +        container_of(data, struct bpf_struct_ops_map, kvalue.data);
>> +    enum bpf_struct_ops_state prev_state;
>> +    struct bpf_struct_ops_link *st_link;
>> +    bool ret = false;
>> +
>> +    /* The st_map and st_link should be protected by rcu_read_lock(),
>> +     * or they may have been free when we try to increase their
>> +     * refcount.
>> +     */
>> +    if (IS_ERR(bpf_map_inc_not_zero(&st_map->map)))
>> +        /* The map is already gone */
>> +        return false;
>> +
>> +    prev_state = cmpxchg(&st_map->kvalue.common.state,
>> +                 BPF_STRUCT_OPS_STATE_INUSE,
>> +                 BPF_STRUCT_OPS_STATE_TOBEFREE);
>> +    if (prev_state == BPF_STRUCT_OPS_STATE_INUSE) {
>> +        st_map->st_ops_desc->st_ops->unreg(data);
>> +        /* Pair with bpf_map_inc() for reg() */
>> +        bpf_map_put(&st_map->map);
>> +        /* Pair with bpf_map_inc_not_zero() above */
>> +        bpf_map_put(&st_map->map);
>> +        return true;
>> +    }
>> +    if (prev_state != BPF_STRUCT_OPS_STATE_READY)
>> +        goto fail;
>> +
>> +    /* With BPF_F_LINK */
>> +
>> +    st_link = rcu_dereference(st_map->attached);
>> +    if (!st_link || !bpf_link_inc_not_zero(&st_link->link))
>> +        /* The map is on the way to unregister */
>> +        goto fail;
>> +
>> +    rcu_read_unlock();
>> +    mutex_lock(&update_mutex);
>> +
>> +    if (rcu_dereference_protected(st_link->map, true) != &st_map->map)
>> +        /* The map should be unregistered already or on the way to
>> +         * be unregistered.
>> +         */
>> +        goto fail_unlock;
>> +
>> +    st_map->st_ops_desc->st_ops->unreg(data);
>> +
>> +    map_attached_null(st_map);
>> +    rcu_assign_pointer(st_link->map, NULL);
>> +    /* Pair with bpf_map_get() in bpf_struct_ops_link_create() or
>> +     * bpf_map_inc() in bpf_struct_ops_map_link_update().
>> +     */
>> +    bpf_map_put(&st_map->map);
>> +
>> +    ret = true;
>> +
>> +fail_unlock:
>> +    mutex_unlock(&update_mutex);
>> +    rcu_read_lock();
>> +    bpf_link_put(&st_link->link);
>> +fail:
>> +    bpf_map_put(&st_map->map);
>> +    return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(bpf_struct_ops_kvalue_unreg);
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 1/6] bpf: add a pointer of the attached link to bpf_struct_ops_map.
  2024-05-01 17:01   ` Andrii Nakryiko
@ 2024-05-01 22:15     ` Kui-Feng Lee
  0 siblings, 0 replies; 27+ messages in thread
From: Kui-Feng Lee @ 2024-05-01 22:15 UTC (permalink / raw
  To: Andrii Nakryiko, Kui-Feng Lee
  Cc: bpf, ast, martin.lau, song, kernel-team, andrii, kuifeng



On 5/1/24 10:01, Andrii Nakryiko wrote:
> On Mon, Apr 29, 2024 at 2:36 PM Kui-Feng Lee <thinker.li@gmail.com> wrote:
>>
>> To facilitate the upcoming unregistring struct_ops objects from the systems
>> consuming objects, a pointer of the attached link is added to allow for
>> accessing the attached link of a bpf_struct_ops_map directly from the map
>> itself.
>>
>> Previously, a st_map could be attached to multiple links. This patch now
>> enforces only one link attached at most.
> 
> I'd like to avoid this restriction, in principle. We don't enforce
> that BPF program should be attached through a single BPF link, so I
> don't think we should allow that for maps. Worst case you can keep a
> list of attached links.

Agree!

> 
>>
>> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
>> ---
>>   kernel/bpf/bpf_struct_ops.c | 47 ++++++++++++++++++++++++++++++++++---
>>   1 file changed, 44 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
>> index 86c7884abaf8..072e3416c987 100644
>> --- a/kernel/bpf/bpf_struct_ops.c
>> +++ b/kernel/bpf/bpf_struct_ops.c
>> @@ -20,6 +20,8 @@ struct bpf_struct_ops_value {
>>
>>   #define MAX_TRAMP_IMAGE_PAGES 8
>>
>> +struct bpf_struct_ops_link;
>> +
>>   struct bpf_struct_ops_map {
>>          struct bpf_map map;
>>          struct rcu_head rcu;
>> @@ -39,6 +41,8 @@ struct bpf_struct_ops_map {
>>          void *image_pages[MAX_TRAMP_IMAGE_PAGES];
>>          /* The owner moduler's btf. */
>>          struct btf *btf;
>> +       /* The link is attached by this map. */
>> +       struct bpf_struct_ops_link __rcu *attached;
>>          /* uvalue->data stores the kernel struct
>>           * (e.g. tcp_congestion_ops) that is more useful
>>           * to userspace than the kvalue.  For example,
>> @@ -1048,6 +1052,22 @@ static bool bpf_struct_ops_valid_to_reg(struct bpf_map *map)
>>                  smp_load_acquire(&st_map->kvalue.common.state) == BPF_STRUCT_OPS_STATE_READY;
>>   }
>>
>> +/* Set the attached link of a map.
>> + *
>> + * Return the current value of the st_map->attached.
>> + */
>> +static inline struct bpf_struct_ops_link *map_attached(struct bpf_struct_ops_map *st_map,
>> +                                                      struct bpf_struct_ops_link *st_link)
>> +{
>> +       return unrcu_pointer(cmpxchg(&st_map->attached, NULL, st_link));
>> +}
>> +
>> +/* Reset the attached link of a map */
>> +static inline void map_attached_null(struct bpf_struct_ops_map *st_map)
>> +{
>> +       rcu_assign_pointer(st_map->attached, NULL);
>> +}
>> +
>>   static void bpf_struct_ops_map_link_dealloc(struct bpf_link *link)
>>   {
>>          struct bpf_struct_ops_link *st_link;
>> @@ -1061,6 +1081,7 @@ static void bpf_struct_ops_map_link_dealloc(struct bpf_link *link)
>>                   * bpf_struct_ops_link_create() fails to register.
>>                   */
>>                  st_map->st_ops_desc->st_ops->unreg(&st_map->kvalue.data);
>> +               map_attached_null(st_map);
>>                  bpf_map_put(&st_map->map);
>>          }
>>          kfree(st_link);
>> @@ -1125,9 +1146,21 @@ static int bpf_struct_ops_map_link_update(struct bpf_link *link, struct bpf_map
>>                  goto err_out;
>>          }
>>
>> +       if (likely(st_map != old_st_map) && map_attached(st_map, st_link)) {
>> +               /* The map is already in use */
>> +               err = -EBUSY;
>> +               goto err_out;
>> +       }
>> +
>>          err = st_map->st_ops_desc->st_ops->update(st_map->kvalue.data, old_st_map->kvalue.data);
>> -       if (err)
>> +       if (err) {
>> +               if (st_map != old_st_map)
>> +                       map_attached_null(st_map);
>>                  goto err_out;
>> +       }
>> +
>> +       if (likely(st_map != old_st_map))
>> +               map_attached_null(old_st_map);
>>
>>          bpf_map_inc(new_map);
>>          rcu_assign_pointer(st_link->map, new_map);
>> @@ -1172,20 +1205,28 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
>>          }
>>          bpf_link_init(&link->link, BPF_LINK_TYPE_STRUCT_OPS, &bpf_struct_ops_map_lops, NULL);
>>
>> +       if (map_attached(st_map, link)) {
>> +               err = -EBUSY;
>> +               goto err_out;
>> +       }
>> +
>>          err = bpf_link_prime(&link->link, &link_primer);
>>          if (err)
>> -               goto err_out;
>> +               goto err_out_attached;
>>
>>          err = st_map->st_ops_desc->st_ops->reg(st_map->kvalue.data);
>>          if (err) {
>>                  bpf_link_cleanup(&link_primer);
>> +               /* The link has been free by bpf_link_cleanup() */
>>                  link = NULL;
>> -               goto err_out;
>> +               goto err_out_attached;
>>          }
>>          RCU_INIT_POINTER(link->map, map);
>>
>>          return bpf_link_settle(&link_primer);
>>
>> +err_out_attached:
>> +       map_attached_null(st_map);
>>   err_out:
>>          bpf_map_put(map);
>>          kfree(link);
>> --
>> 2.34.1
>>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 5/6] bpf: support epoll from bpf struct_ops links.
  2024-05-01 17:03   ` Andrii Nakryiko
@ 2024-05-01 22:16     ` Kui-Feng Lee
  0 siblings, 0 replies; 27+ messages in thread
From: Kui-Feng Lee @ 2024-05-01 22:16 UTC (permalink / raw
  To: Andrii Nakryiko, Kui-Feng Lee
  Cc: bpf, ast, martin.lau, song, kernel-team, andrii, kuifeng



On 5/1/24 10:03, Andrii Nakryiko wrote:
> On Mon, Apr 29, 2024 at 2:36 PM Kui-Feng Lee <thinker.li@gmail.com> wrote:
>>
>> Add epoll support to bpf struct_ops links to trigger EPOLLHUP event upon
>> detachment.
>>
>> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
>> ---
>>   include/linux/bpf.h         |  2 ++
>>   kernel/bpf/bpf_struct_ops.c | 14 ++++++++++++++
>>   kernel/bpf/syscall.c        | 15 +++++++++++++++
>>   3 files changed, 31 insertions(+)
>>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index eeeed4b1bd32..a4550b927352 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -1574,6 +1574,7 @@ struct bpf_link {
>>          const struct bpf_link_ops *ops;
>>          struct bpf_prog *prog;
>>          struct work_struct work;
>> +       wait_queue_head_t wait_hup;
> 
> let's keep it struct_ops-specific, there is no need to pay for this
> for all existing BPF link types. We can always generalize later, if
> necessary.
> 
sure!

> pw-bot: cr
> 
> 
>>   };
>>
>>   struct bpf_link_ops {
>> @@ -1587,6 +1588,7 @@ struct bpf_link_ops {
>>                                struct bpf_link_info *info);
>>          int (*update_map)(struct bpf_link *link, struct bpf_map *new_map,
>>                            struct bpf_map *old_map);
>> +       __poll_t (*poll)(struct file *file, struct poll_table_struct *pts);
>>   };
>>
>>   struct bpf_tramp_link {
> 
> [...]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 6/6] selftests/bpf: test detaching struct_ops links.
  2024-05-01 17:05   ` Andrii Nakryiko
@ 2024-05-01 22:17     ` Kui-Feng Lee
  0 siblings, 0 replies; 27+ messages in thread
From: Kui-Feng Lee @ 2024-05-01 22:17 UTC (permalink / raw
  To: Andrii Nakryiko, Kui-Feng Lee
  Cc: bpf, ast, martin.lau, song, kernel-team, andrii, kuifeng



On 5/1/24 10:05, Andrii Nakryiko wrote:
> On Mon, Apr 29, 2024 at 2:36 PM Kui-Feng Lee <thinker.li@gmail.com> wrote:
>>
>> Verify whether a user space program is informed through epoll with EPOLLHUP
>> when a struct_ops object is detached or unregistered using the function
>> bpf_struct_ops_kvalue_unreg() or BPF_LINK_DETACH.
>>
>> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
>> ---
>>   .../selftests/bpf/bpf_testmod/bpf_testmod.c   |  18 ++-
>>   .../selftests/bpf/bpf_testmod/bpf_testmod.h   |   1 +
>>   .../bpf/prog_tests/test_struct_ops_module.c   | 104 ++++++++++++++++++
>>   .../selftests/bpf/progs/struct_ops_module.c   |   7 ++
>>   4 files changed, 126 insertions(+), 4 deletions(-)
>>
> 
> [...]
> 
>> diff --git a/tools/testing/selftests/bpf/progs/struct_ops_module.c b/tools/testing/selftests/bpf/progs/struct_ops_module.c
>> index 63b065dae002..7a697a7dd0ac 100644
>> --- a/tools/testing/selftests/bpf/progs/struct_ops_module.c
>> +++ b/tools/testing/selftests/bpf/progs/struct_ops_module.c
> 
> this file is a bit overloaded with code for various subtests, it's
> quite hard already to follow what's going on, I suggest adding a
> separate .c file for this new subtest

Ok!

> 
>> @@ -81,3 +81,10 @@ struct bpf_testmod_ops___incompatible testmod_incompatible = {
>>          .test_2 = (void *)test_2,
>>          .data = 3,
>>   };
>> +
>> +SEC(".struct_ops.link")
>> +struct bpf_testmod_ops testmod_do_unreg = {
>> +       .test_1 = (void *)test_1,
>> +       .test_2 = (void *)test_2,
>> +       .do_unreg = true,
>> +};
>> --
>> 2.34.1
>>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 3/6] bpf: provide a function to unregister struct_ops objects from consumers.
  2024-05-01 22:15     ` Kui-Feng Lee
@ 2024-05-01 23:06       ` Martin KaFai Lau
  0 siblings, 0 replies; 27+ messages in thread
From: Martin KaFai Lau @ 2024-05-01 23:06 UTC (permalink / raw
  To: Kui-Feng Lee, Kui-Feng Lee; +Cc: bpf, ast, song, kernel-team, andrii, kuifeng

On 5/1/24 3:15 PM, Kui-Feng Lee wrote:
> 
> 
> On 5/1/24 11:48, Martin KaFai Lau wrote:
>> On 4/29/24 2:36 PM, Kui-Feng Lee wrote:
>>> +/* Called from the subsystem that consume the struct_ops.
>>> + *
>>> + * The caller should protected this function by holding rcu_read_lock() to
>>> + * ensure "data" is valid. However, this function may unlock rcu
>>> + * temporarily. The caller should not rely on the preceding rcu_read_lock()
>>> + * after returning from this function.
>>
>> This temporarily losing rcu_read_lock protection is error prone. The caller 
>> should do the inc_not_zero() instead if it is needed.
>>
>> I feel the approach in patch 1 and 3 is a little box-ed in by the earlier 
>> tcp-cc usage that tried to fit into the kernel module reg/unreg paradigm and 
>> hide as much bpf details as possible from tcp-cc. This is not necessarily true 
>> now for other subsystem which has bpf struct_ops from day one.
>>
>> The epoll detach notification is link only. Can this kernel side specific 
>> unreg be limited to struct_ops link only? During reg, a rcu protected link 
>> could be passed to the subsystem. That subsystem becomes a kernel user of the 
>> bpf link and it can call link_detach(link) to detach. Pseudo code:
>>
>> struct link __rcu *link;
>>
>> rcu_read_lock();
>> ref_link = rcu_dereference(link)
>> if (ref_link)
>>      ref_link = bpf_link_inc_not_zero(ref_link);
>> rcu_read_unlock();
>>
>> if (!IS_ERR_OR_NULL(ref_link)) {
>>      bpf_struct_ops_map_link_detach(ref_link);
>>      bpf_link_put(ref_link);
>> }
> 
> 
> Since not every struct_ops map has a link, we need a callback in additional
> to ops->reg to register links with subsystems. If the callback is
> ops->reg_link, struct_ops will call ops->reg_link if a subsystem provide
> it and the map is registered through a link, or it should call ops->reg.

I would just add a link pointer arg to the existing reg(). The same probably 
needs to be done for unreg(). Pass a NULL as the link if it does not have one 
during reg(). If the subsystem chooses to enforce link only struct_ops, it can 
reject if link is not provided during reg().



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 3/6] bpf: provide a function to unregister struct_ops objects from consumers.
  2024-05-01 18:48   ` Martin KaFai Lau
  2024-05-01 22:15     ` Kui-Feng Lee
@ 2024-05-02 17:56     ` Martin KaFai Lau
  2024-05-02 18:29       ` Martin KaFai Lau
  2024-05-03  0:41       ` Kui-Feng Lee
  1 sibling, 2 replies; 27+ messages in thread
From: Martin KaFai Lau @ 2024-05-02 17:56 UTC (permalink / raw
  To: Kui-Feng Lee; +Cc: bpf, ast, song, kernel-team, andrii, sinquersw, kuifeng

On 5/1/24 11:48 AM, Martin KaFai Lau wrote:
> On 4/29/24 2:36 PM, Kui-Feng Lee wrote:
>> +/* Called from the subsystem that consume the struct_ops.
>> + *
>> + * The caller should protected this function by holding rcu_read_lock() to
>> + * ensure "data" is valid. However, this function may unlock rcu
>> + * temporarily. The caller should not rely on the preceding rcu_read_lock()
>> + * after returning from this function.
> 
> This temporarily losing rcu_read_lock protection is error prone. The caller 
> should do the inc_not_zero() instead if it is needed.
> 
> I feel the approach in patch 1 and 3 is a little box-ed in by the earlier tcp-cc 
> usage that tried to fit into the kernel module reg/unreg paradigm and hide as 
> much bpf details as possible from tcp-cc. This is not necessarily true now for 
> other subsystem which has bpf struct_ops from day one.
> 
> The epoll detach notification is link only. Can this kernel side specific unreg 
> be limited to struct_ops link only? During reg, a rcu protected link could be 
> passed to the subsystem. That subsystem becomes a kernel user of the bpf link 
> and it can call link_detach(link) to detach. Pseudo code:
> 
> struct link __rcu *link;
> 
> rcu_read_lock();
> ref_link = rcu_dereference(link)
> if (ref_link)
>      ref_link = bpf_link_inc_not_zero(ref_link);
> rcu_read_unlock();
> 
> if (!IS_ERR_OR_NULL(ref_link)) {
>      bpf_struct_ops_map_link_detach(ref_link);
>      bpf_link_put(ref_link);
> }

[ ... ]

> 
>> + *
>> + * Return true if unreg() success. If a call fails, it means some other
>> + * task has unrgistered or is unregistering the same object.
>> + */
>> +bool bpf_struct_ops_kvalue_unreg(void *data)
>> +{
>> +    struct bpf_struct_ops_map *st_map =
>> +        container_of(data, struct bpf_struct_ops_map, kvalue.data);
>> +    enum bpf_struct_ops_state prev_state;
>> +    struct bpf_struct_ops_link *st_link;
>> +    bool ret = false;
>> +
>> +    /* The st_map and st_link should be protected by rcu_read_lock(),
>> +     * or they may have been free when we try to increase their
>> +     * refcount.
>> +     */
>> +    if (IS_ERR(bpf_map_inc_not_zero(&st_map->map)))
>> +        /* The map is already gone */
>> +        return false;
>> +
>> +    prev_state = cmpxchg(&st_map->kvalue.common.state,
>> +                 BPF_STRUCT_OPS_STATE_INUSE,
>> +                 BPF_STRUCT_OPS_STATE_TOBEFREE);
>> +    if (prev_state == BPF_STRUCT_OPS_STATE_INUSE) {
>> +        st_map->st_ops_desc->st_ops->unreg(data);
>> +        /* Pair with bpf_map_inc() for reg() */
>> +        bpf_map_put(&st_map->map);
>> +        /* Pair with bpf_map_inc_not_zero() above */
>> +        bpf_map_put(&st_map->map);
>> +        return true;
>> +    }
>> +    if (prev_state != BPF_STRUCT_OPS_STATE_READY)
>> +        goto fail;
>> +
>> +    /* With BPF_F_LINK */
>> +
>> +    st_link = rcu_dereference(st_map->attached);

 From looking at the change in bpf_struct_ops_map_link_dealloc() in patch 1 
again, I am not sure st_link is rcu gp protected either. 
bpf_struct_ops_map_link_dealloc() is still just kfree(st_link).

I also don't think it needs to complicate it further by making st_link go 
through rcu only for this use case. The subsystem must have its own lock to 
protect parallel reg() and unreg(). tcp-cc has tcp_cong_list_lock. From looking 
at scx, scx has scx_ops_enable_mutex. When it tries to do unreg itself by 
calling bpf_struct_ops_map_link_detach(link), it needs to acquire its own lock 
to ensure a parallel unreg() has not happened. Pseudo code:

struct bpf_link *link;

static void scx_ops_detach_by_kernel(void)
{
	struct bpf_link *ref_link;

	mutex_lock(&scx_ops_enable_mutex);
	ref_link = link;
	if (ref_link)
		ref_link = bpf_link_inc_not_zero(ref_link);
	mutex_unlock(&scx_ops_enable_mutex);

	if (!IS_ERR_OR_NULL(ref_link)) {
		ref_link->ops->detach(ref_link);
		bpf_link_put(ref_link);
	}
}

>> +    if (!st_link || !bpf_link_inc_not_zero(&st_link->link))
>> +        /* The map is on the way to unregister */
>> +        goto fail;
>> +
>> +    rcu_read_unlock();
>> +    mutex_lock(&update_mutex);
>> +
>> +    if (rcu_dereference_protected(st_link->map, true) != &st_map->map)
>> +        /* The map should be unregistered already or on the way to
>> +         * be unregistered.
>> +         */
>> +        goto fail_unlock;
>> +
>> +    st_map->st_ops_desc->st_ops->unreg(data);
>> +
>> +    map_attached_null(st_map);
>> +    rcu_assign_pointer(st_link->map, NULL);
>> +    /* Pair with bpf_map_get() in bpf_struct_ops_link_create() or
>> +     * bpf_map_inc() in bpf_struct_ops_map_link_update().
>> +     */
>> +    bpf_map_put(&st_map->map);
>> +
>> +    ret = true;
>> +
>> +fail_unlock:
>> +    mutex_unlock(&update_mutex);
>> +    rcu_read_lock();
>> +    bpf_link_put(&st_link->link);
>> +fail:
>> +    bpf_map_put(&st_map->map);
>> +    return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(bpf_struct_ops_kvalue_unreg);
> 
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 6/6] selftests/bpf: test detaching struct_ops links.
  2024-04-29 21:36 ` [PATCH bpf-next 6/6] selftests/bpf: test detaching " Kui-Feng Lee
  2024-05-01 17:05   ` Andrii Nakryiko
@ 2024-05-02 18:15   ` Martin KaFai Lau
  2024-05-03 18:34     ` Kui-Feng Lee
  1 sibling, 1 reply; 27+ messages in thread
From: Martin KaFai Lau @ 2024-05-02 18:15 UTC (permalink / raw
  To: Kui-Feng Lee; +Cc: bpf, ast, song, kernel-team, andrii, sinquersw, kuifeng

On 4/29/24 2:36 PM, Kui-Feng Lee wrote:
> @@ -572,6 +576,12 @@ static int bpf_dummy_reg(void *kdata)
>   	if (ops->test_2)
>   		ops->test_2(4, ops->data);
>   
> +	if (ops->do_unreg) {
> +		rcu_read_lock();
> +		bpf_struct_ops_kvalue_unreg(kdata);

Instead of unreg() immediately before the reg() has returned, the test should 
reflect more on how the subsystem can use it in practice. The subsystem does not 
do unreg() during reg().

It also needs to test a case when the link is created and successfully 
registered to the subsystem. The user space does BPF_LINK_DETACH first and then 
the subsystem does link->ops->detach() by itself later.

It can create a kfunc in bpf_testmod.c to trigger the subsystem to do 
link->ops->detach(). The kfunc can be called by a SEC("syscall") bpf prog which 
is run by bpf_prog_test_run_opts(). The test_progs can then decide on the timing 
when to do link->ops->detach() to test different cases.

> +		rcu_read_unlock();
> +	}
> +
>   	return 0;
>   }

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 3/6] bpf: provide a function to unregister struct_ops objects from consumers.
  2024-05-02 17:56     ` Martin KaFai Lau
@ 2024-05-02 18:29       ` Martin KaFai Lau
  2024-05-03  0:41       ` Kui-Feng Lee
  1 sibling, 0 replies; 27+ messages in thread
From: Martin KaFai Lau @ 2024-05-02 18:29 UTC (permalink / raw
  To: Kui-Feng Lee; +Cc: bpf, ast, song, kernel-team, andrii, sinquersw, kuifeng

On 5/2/24 10:56 AM, Martin KaFai Lau wrote:
> On 5/1/24 11:48 AM, Martin KaFai Lau wrote:
>> On 4/29/24 2:36 PM, Kui-Feng Lee wrote:
>>> +/* Called from the subsystem that consume the struct_ops.
>>> + *
>>> + * The caller should protected this function by holding rcu_read_lock() to
>>> + * ensure "data" is valid. However, this function may unlock rcu
>>> + * temporarily. The caller should not rely on the preceding rcu_read_lock()
>>> + * after returning from this function.
>>
>> This temporarily losing rcu_read_lock protection is error prone. The caller 
>> should do the inc_not_zero() instead if it is needed.
>>
>> I feel the approach in patch 1 and 3 is a little box-ed in by the earlier 
>> tcp-cc usage that tried to fit into the kernel module reg/unreg paradigm and 
>> hide as much bpf details as possible from tcp-cc. This is not necessarily true 
>> now for other subsystem which has bpf struct_ops from day one.
>>
>> The epoll detach notification is link only. Can this kernel side specific 
>> unreg be limited to struct_ops link only? During reg, a rcu protected link 
>> could be passed to the subsystem. That subsystem becomes a kernel user of the 
>> bpf link and it can call link_detach(link) to detach. Pseudo code:
>>
>> struct link __rcu *link;
>>
>> rcu_read_lock();
>> ref_link = rcu_dereference(link)
>> if (ref_link)
>>      ref_link = bpf_link_inc_not_zero(ref_link);
>> rcu_read_unlock();
>>
>> if (!IS_ERR_OR_NULL(ref_link)) {
>>      bpf_struct_ops_map_link_detach(ref_link);
>>      bpf_link_put(ref_link);
>> }
> 
> [ ... ]
> 
>>
>>> + *
>>> + * Return true if unreg() success. If a call fails, it means some other
>>> + * task has unrgistered or is unregistering the same object.
>>> + */
>>> +bool bpf_struct_ops_kvalue_unreg(void *data)
>>> +{
>>> +    struct bpf_struct_ops_map *st_map =
>>> +        container_of(data, struct bpf_struct_ops_map, kvalue.data);
>>> +    enum bpf_struct_ops_state prev_state;
>>> +    struct bpf_struct_ops_link *st_link;
>>> +    bool ret = false;
>>> +
>>> +    /* The st_map and st_link should be protected by rcu_read_lock(),
>>> +     * or they may have been free when we try to increase their
>>> +     * refcount.
>>> +     */
>>> +    if (IS_ERR(bpf_map_inc_not_zero(&st_map->map)))
>>> +        /* The map is already gone */
>>> +        return false;
>>> +
>>> +    prev_state = cmpxchg(&st_map->kvalue.common.state,
>>> +                 BPF_STRUCT_OPS_STATE_INUSE,
>>> +                 BPF_STRUCT_OPS_STATE_TOBEFREE);
>>> +    if (prev_state == BPF_STRUCT_OPS_STATE_INUSE) {
>>> +        st_map->st_ops_desc->st_ops->unreg(data);
>>> +        /* Pair with bpf_map_inc() for reg() */
>>> +        bpf_map_put(&st_map->map);
>>> +        /* Pair with bpf_map_inc_not_zero() above */
>>> +        bpf_map_put(&st_map->map);
>>> +        return true;
>>> +    }
>>> +    if (prev_state != BPF_STRUCT_OPS_STATE_READY)
>>> +        goto fail;
>>> +
>>> +    /* With BPF_F_LINK */
>>> +
>>> +    st_link = rcu_dereference(st_map->attached);
> 
>  From looking at the change in bpf_struct_ops_map_link_dealloc() in patch 1 
> again, I am not sure st_link is rcu gp protected either. 
> bpf_struct_ops_map_link_dealloc() is still just kfree(st_link).
> 
> I also don't think it needs to complicate it further by making st_link go 
> through rcu only for this use case. The subsystem must have its own lock to 
> protect parallel reg() and unreg(). tcp-cc has tcp_cong_list_lock. From looking 
> at scx, scx has scx_ops_enable_mutex. When it tries to do unreg itself by 
> calling bpf_struct_ops_map_link_detach(link), it needs to acquire its own lock 
> to ensure a parallel unreg() has not happened. Pseudo code:
> 
> struct bpf_link *link;
> 
> static void scx_ops_detach_by_kernel(void)
> {
>      struct bpf_link *ref_link;
> 
>      mutex_lock(&scx_ops_enable_mutex);
>      ref_link = link;
>      if (ref_link)
>          ref_link = bpf_link_inc_not_zero(ref_link);
>      mutex_unlock(&scx_ops_enable_mutex);
> 
>      if (!IS_ERR_OR_NULL(ref_link)) {
>          ref_link->ops->detach(ref_link);
>          bpf_link_put(ref_link);
>      }
> }

and patch 1 should no longer be needed.

> 
>>> +    if (!st_link || !bpf_link_inc_not_zero(&st_link->link))
>>> +        /* The map is on the way to unregister */
>>> +        goto fail;
>>> +
>>> +    rcu_read_unlock();
>>> +    mutex_lock(&update_mutex);
>>> +
>>> +    if (rcu_dereference_protected(st_link->map, true) != &st_map->map)
>>> +        /* The map should be unregistered already or on the way to
>>> +         * be unregistered.
>>> +         */
>>> +        goto fail_unlock;
>>> +
>>> +    st_map->st_ops_desc->st_ops->unreg(data);
>>> +
>>> +    map_attached_null(st_map);
>>> +    rcu_assign_pointer(st_link->map, NULL);
>>> +    /* Pair with bpf_map_get() in bpf_struct_ops_link_create() or
>>> +     * bpf_map_inc() in bpf_struct_ops_map_link_update().
>>> +     */
>>> +    bpf_map_put(&st_map->map);
>>> +
>>> +    ret = true;
>>> +
>>> +fail_unlock:
>>> +    mutex_unlock(&update_mutex);
>>> +    rcu_read_lock();
>>> +    bpf_link_put(&st_link->link);
>>> +fail:
>>> +    bpf_map_put(&st_map->map);
>>> +    return ret;
>>> +}
>>> +EXPORT_SYMBOL_GPL(bpf_struct_ops_kvalue_unreg);
>>
>>
> 
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 3/6] bpf: provide a function to unregister struct_ops objects from consumers.
  2024-05-02 17:56     ` Martin KaFai Lau
  2024-05-02 18:29       ` Martin KaFai Lau
@ 2024-05-03  0:41       ` Kui-Feng Lee
  2024-05-03 16:19         ` Alexei Starovoitov
  2024-05-03 17:17         ` Martin KaFai Lau
  1 sibling, 2 replies; 27+ messages in thread
From: Kui-Feng Lee @ 2024-05-03  0:41 UTC (permalink / raw
  To: Martin KaFai Lau, Kui-Feng Lee
  Cc: bpf, ast, song, kernel-team, andrii, kuifeng



On 5/2/24 10:56, Martin KaFai Lau wrote:
> On 5/1/24 11:48 AM, Martin KaFai Lau wrote:
>> On 4/29/24 2:36 PM, Kui-Feng Lee wrote:
>>> +/* Called from the subsystem that consume the struct_ops.
>>> + *
>>> + * The caller should protected this function by holding 
>>> rcu_read_lock() to
>>> + * ensure "data" is valid. However, this function may unlock rcu
>>> + * temporarily. The caller should not rely on the preceding 
>>> rcu_read_lock()
>>> + * after returning from this function.
>>
>> This temporarily losing rcu_read_lock protection is error prone. The 
>> caller should do the inc_not_zero() instead if it is needed.
>>
>> I feel the approach in patch 1 and 3 is a little box-ed in by the 
>> earlier tcp-cc usage that tried to fit into the kernel module 
>> reg/unreg paradigm and hide as much bpf details as possible from 
>> tcp-cc. This is not necessarily true now for other subsystem which has 
>> bpf struct_ops from day one.
>>
>> The epoll detach notification is link only. Can this kernel side 
>> specific unreg be limited to struct_ops link only? During reg, a rcu 
>> protected link could be passed to the subsystem. That subsystem 
>> becomes a kernel user of the bpf link and it can call 
>> link_detach(link) to detach. Pseudo code:
>>
>> struct link __rcu *link;
>>
>> rcu_read_lock();
>> ref_link = rcu_dereference(link)
>> if (ref_link)
>>      ref_link = bpf_link_inc_not_zero(ref_link);
>> rcu_read_unlock();
>>
>> if (!IS_ERR_OR_NULL(ref_link)) {
>>      bpf_struct_ops_map_link_detach(ref_link);
>>      bpf_link_put(ref_link);
>> }
> 
> [ ... ]
> 
>>
>>> + *
>>> + * Return true if unreg() success. If a call fails, it means some other
>>> + * task has unrgistered or is unregistering the same object.
>>> + */
>>> +bool bpf_struct_ops_kvalue_unreg(void *data)
>>> +{
>>> +    struct bpf_struct_ops_map *st_map =
>>> +        container_of(data, struct bpf_struct_ops_map, kvalue.data);
>>> +    enum bpf_struct_ops_state prev_state;
>>> +    struct bpf_struct_ops_link *st_link;
>>> +    bool ret = false;
>>> +
>>> +    /* The st_map and st_link should be protected by rcu_read_lock(),
>>> +     * or they may have been free when we try to increase their
>>> +     * refcount.
>>> +     */
>>> +    if (IS_ERR(bpf_map_inc_not_zero(&st_map->map)))
>>> +        /* The map is already gone */
>>> +        return false;
>>> +
>>> +    prev_state = cmpxchg(&st_map->kvalue.common.state,
>>> +                 BPF_STRUCT_OPS_STATE_INUSE,
>>> +                 BPF_STRUCT_OPS_STATE_TOBEFREE);
>>> +    if (prev_state == BPF_STRUCT_OPS_STATE_INUSE) {
>>> +        st_map->st_ops_desc->st_ops->unreg(data);
>>> +        /* Pair with bpf_map_inc() for reg() */
>>> +        bpf_map_put(&st_map->map);
>>> +        /* Pair with bpf_map_inc_not_zero() above */
>>> +        bpf_map_put(&st_map->map);
>>> +        return true;
>>> +    }
>>> +    if (prev_state != BPF_STRUCT_OPS_STATE_READY)
>>> +        goto fail;
>>> +
>>> +    /* With BPF_F_LINK */
>>> +
>>> +    st_link = rcu_dereference(st_map->attached);
> 
>  From looking at the change in bpf_struct_ops_map_link_dealloc() in 
> patch 1 again, I am not sure st_link is rcu gp protected either. 
> bpf_struct_ops_map_link_dealloc() is still just kfree(st_link).

I am not sure what you mean.
With the implementation of this version, st_link should be rcu
protected. The backward pointer, "attached", from st_map to st_link will
be reset before kfree(). So, if the caller hold rcu_read_lock(), a
st_link should be valid as long as it can be reached from a st_map.

> 
> I also don't think it needs to complicate it further by making st_link 
> go through rcu only for this use case. The subsystem must have its own 
> lock to protect parallel reg() and unreg(). tcp-cc has 
> tcp_cong_list_lock. From looking at scx, scx has scx_ops_enable_mutex. 
> When it tries to do unreg itself by calling 
> bpf_struct_ops_map_link_detach(link), it needs to acquire its own lock 
> to ensure a parallel unreg() has not happened. Pseudo code:
> 
> struct bpf_link *link;
> 
> static void scx_ops_detach_by_kernel(void)
> {
>      struct bpf_link *ref_link;
> 
>      mutex_lock(&scx_ops_enable_mutex);
>      ref_link = link;
>      if (ref_link)
>          ref_link = bpf_link_inc_not_zero(ref_link);
>      mutex_unlock(&scx_ops_enable_mutex);
> 
>      if (!IS_ERR_OR_NULL(ref_link)) {
>          ref_link->ops->detach(ref_link);
>          bpf_link_put(ref_link);
>      }
> }
> 
>>> +    if (!st_link || !bpf_link_inc_not_zero(&st_link->link))
>>> +        /* The map is on the way to unregister */
>>> +        goto fail;
>>> +
>>> +    rcu_read_unlock();
>>> +    mutex_lock(&update_mutex);
>>> +
>>> +    if (rcu_dereference_protected(st_link->map, true) != &st_map->map)
>>> +        /* The map should be unregistered already or on the way to
>>> +         * be unregistered.
>>> +         */
>>> +        goto fail_unlock;
>>> +
>>> +    st_map->st_ops_desc->st_ops->unreg(data);
>>> +
>>> +    map_attached_null(st_map);
>>> +    rcu_assign_pointer(st_link->map, NULL);
>>> +    /* Pair with bpf_map_get() in bpf_struct_ops_link_create() or
>>> +     * bpf_map_inc() in bpf_struct_ops_map_link_update().
>>> +     */
>>> +    bpf_map_put(&st_map->map);
>>> +
>>> +    ret = true;
>>> +
>>> +fail_unlock:
>>> +    mutex_unlock(&update_mutex);
>>> +    rcu_read_lock();
>>> +    bpf_link_put(&st_link->link);
>>> +fail:
>>> +    bpf_map_put(&st_map->map);
>>> +    return ret;
>>> +}
>>> +EXPORT_SYMBOL_GPL(bpf_struct_ops_kvalue_unreg);
>>
>>
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 3/6] bpf: provide a function to unregister struct_ops objects from consumers.
  2024-05-03  0:41       ` Kui-Feng Lee
@ 2024-05-03 16:19         ` Alexei Starovoitov
  2024-05-03 18:09           ` Kui-Feng Lee
  2024-05-03 17:17         ` Martin KaFai Lau
  1 sibling, 1 reply; 27+ messages in thread
From: Alexei Starovoitov @ 2024-05-03 16:19 UTC (permalink / raw
  To: Kui-Feng Lee
  Cc: Martin KaFai Lau, Kui-Feng Lee, bpf, Alexei Starovoitov, Song Liu,
	Kernel Team, Andrii Nakryiko, Kui-Feng Lee

On Thu, May 2, 2024 at 5:41 PM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>
>
> >>> +    prev_state = cmpxchg(&st_map->kvalue.common.state,
> >>> +                 BPF_STRUCT_OPS_STATE_INUSE,
> >>> +                 BPF_STRUCT_OPS_STATE_TOBEFREE);

All the uses of cmpxchg mean that there are races.
I know there is already a case for it in the existing code
and maybe it's justified.
But I would very much prefer the whole code converted to
clean locks without cmpxchg tricks.


> >>> +    if (prev_state == BPF_STRUCT_OPS_STATE_INUSE) {
> >>> +        st_map->st_ops_desc->st_ops->unreg(data);
> >>> +        /* Pair with bpf_map_inc() for reg() */
> >>> +        bpf_map_put(&st_map->map);
> >>> +        /* Pair with bpf_map_inc_not_zero() above */
> >>> +        bpf_map_put(&st_map->map);
> >>> +        return true;

I haven't tried to follow the logic, but double bpf_map_put
on the same map screams that this is broken.
Do proper locks everywhere.
inc_not_zero and cmpxchg shouldn't be needed.
keep it simple.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 3/6] bpf: provide a function to unregister struct_ops objects from consumers.
  2024-05-03  0:41       ` Kui-Feng Lee
  2024-05-03 16:19         ` Alexei Starovoitov
@ 2024-05-03 17:17         ` Martin KaFai Lau
  1 sibling, 0 replies; 27+ messages in thread
From: Martin KaFai Lau @ 2024-05-03 17:17 UTC (permalink / raw
  To: Kui-Feng Lee, Kui-Feng Lee; +Cc: bpf, ast, song, kernel-team, andrii, kuifeng

On 5/2/24 5:41 PM, Kui-Feng Lee wrote:
> 
> 
> On 5/2/24 10:56, Martin KaFai Lau wrote:
>> On 5/1/24 11:48 AM, Martin KaFai Lau wrote:
>>> On 4/29/24 2:36 PM, Kui-Feng Lee wrote:
>>>> +/* Called from the subsystem that consume the struct_ops.
>>>> + *
>>>> + * The caller should protected this function by holding rcu_read_lock() to
>>>> + * ensure "data" is valid. However, this function may unlock rcu
>>>> + * temporarily. The caller should not rely on the preceding rcu_read_lock()
>>>> + * after returning from this function.
>>>
>>> This temporarily losing rcu_read_lock protection is error prone. The caller 
>>> should do the inc_not_zero() instead if it is needed.
>>>
>>> I feel the approach in patch 1 and 3 is a little box-ed in by the earlier 
>>> tcp-cc usage that tried to fit into the kernel module reg/unreg paradigm and 
>>> hide as much bpf details as possible from tcp-cc. This is not necessarily 
>>> true now for other subsystem which has bpf struct_ops from day one.
>>>
>>> The epoll detach notification is link only. Can this kernel side specific 
>>> unreg be limited to struct_ops link only? During reg, a rcu protected link 
>>> could be passed to the subsystem. That subsystem becomes a kernel user of the 
>>> bpf link and it can call link_detach(link) to detach. Pseudo code:
>>>
>>> struct link __rcu *link;
>>>
>>> rcu_read_lock();
>>> ref_link = rcu_dereference(link)
>>> if (ref_link)
>>>      ref_link = bpf_link_inc_not_zero(ref_link);
>>> rcu_read_unlock();
>>>
>>> if (!IS_ERR_OR_NULL(ref_link)) {
>>>      bpf_struct_ops_map_link_detach(ref_link);
>>>      bpf_link_put(ref_link);
>>> }
>>
>> [ ... ]
>>
>>>
>>>> + *
>>>> + * Return true if unreg() success. If a call fails, it means some other
>>>> + * task has unrgistered or is unregistering the same object.
>>>> + */
>>>> +bool bpf_struct_ops_kvalue_unreg(void *data)
>>>> +{
>>>> +    struct bpf_struct_ops_map *st_map =
>>>> +        container_of(data, struct bpf_struct_ops_map, kvalue.data);
>>>> +    enum bpf_struct_ops_state prev_state;
>>>> +    struct bpf_struct_ops_link *st_link;
>>>> +    bool ret = false;
>>>> +
>>>> +    /* The st_map and st_link should be protected by rcu_read_lock(),
>>>> +     * or they may have been free when we try to increase their
>>>> +     * refcount.
>>>> +     */
>>>> +    if (IS_ERR(bpf_map_inc_not_zero(&st_map->map)))
>>>> +        /* The map is already gone */
>>>> +        return false;
>>>> +
>>>> +    prev_state = cmpxchg(&st_map->kvalue.common.state,
>>>> +                 BPF_STRUCT_OPS_STATE_INUSE,
>>>> +                 BPF_STRUCT_OPS_STATE_TOBEFREE);
>>>> +    if (prev_state == BPF_STRUCT_OPS_STATE_INUSE) {
>>>> +        st_map->st_ops_desc->st_ops->unreg(data);
>>>> +        /* Pair with bpf_map_inc() for reg() */
>>>> +        bpf_map_put(&st_map->map);
>>>> +        /* Pair with bpf_map_inc_not_zero() above */
>>>> +        bpf_map_put(&st_map->map);
>>>> +        return true;
>>>> +    }
>>>> +    if (prev_state != BPF_STRUCT_OPS_STATE_READY)
>>>> +        goto fail;
>>>> +
>>>> +    /* With BPF_F_LINK */
>>>> +
>>>> +    st_link = rcu_dereference(st_map->attached);
>>
>>  From looking at the change in bpf_struct_ops_map_link_dealloc() in patch 1 
>> again, I am not sure st_link is rcu gp protected either. 
>> bpf_struct_ops_map_link_dealloc() is still just kfree(st_link).
> 
> I am not sure what you mean.

I meant this should be kfree_rcu(st_link, ...) instead of kfree(st_link).

The temporarily losing rcu_read_lock and other complexities in this function are 
too subtle. It is hard to reason and should be simplified even if tradeoff is 
needed. Please consider the ideas in my earlier code snippet about using the 
subsystem existing lock during reg/unreg/detach and limit it to link only first.

> With the implementation of this version, st_link should be rcu
> protected. The backward pointer, "attached", from st_map to st_link will
> be reset before kfree(). So, if the caller hold rcu_read_lock(), a
> st_link should be valid as long as it can be reached from a st_map.
> 
>>
>> I also don't think it needs to complicate it further by making st_link go 
>> through rcu only for this use case. The subsystem must have its own lock to 
>> protect parallel reg() and unreg(). tcp-cc has tcp_cong_list_lock. From 
>> looking at scx, scx has scx_ops_enable_mutex. When it tries to do unreg itself 
>> by calling bpf_struct_ops_map_link_detach(link), it needs to acquire its own 
>> lock to ensure a parallel unreg() has not happened. Pseudo code:
>>
>> struct bpf_link *link;
>>
>> static void scx_ops_detach_by_kernel(void)
>> {
>>      struct bpf_link *ref_link;
>>
>>      mutex_lock(&scx_ops_enable_mutex);
>>      ref_link = link;
>>      if (ref_link)
>>          ref_link = bpf_link_inc_not_zero(ref_link);
>>      mutex_unlock(&scx_ops_enable_mutex);
>>
>>      if (!IS_ERR_OR_NULL(ref_link)) {
>>          ref_link->ops->detach(ref_link);
>>          bpf_link_put(ref_link);
>>      }
>> }
>>
>>>> +    if (!st_link || !bpf_link_inc_not_zero(&st_link->link))
>>>> +        /* The map is on the way to unregister */
>>>> +        goto fail;
>>>> +
>>>> +    rcu_read_unlock();
>>>> +    mutex_lock(&update_mutex);
>>>> +
>>>> +    if (rcu_dereference_protected(st_link->map, true) != &st_map->map)
>>>> +        /* The map should be unregistered already or on the way to
>>>> +         * be unregistered.
>>>> +         */
>>>> +        goto fail_unlock;
>>>> +
>>>> +    st_map->st_ops_desc->st_ops->unreg(data);
>>>> +
>>>> +    map_attached_null(st_map);
>>>> +    rcu_assign_pointer(st_link->map, NULL);
>>>> +    /* Pair with bpf_map_get() in bpf_struct_ops_link_create() or
>>>> +     * bpf_map_inc() in bpf_struct_ops_map_link_update().
>>>> +     */
>>>> +    bpf_map_put(&st_map->map);
>>>> +
>>>> +    ret = true;
>>>> +
>>>> +fail_unlock:
>>>> +    mutex_unlock(&update_mutex);
>>>> +    rcu_read_lock();
>>>> +    bpf_link_put(&st_link->link);
>>>> +fail:
>>>> +    bpf_map_put(&st_map->map);
>>>> +    return ret;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(bpf_struct_ops_kvalue_unreg);
>>>
>>>
>>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 3/6] bpf: provide a function to unregister struct_ops objects from consumers.
  2024-05-03 16:19         ` Alexei Starovoitov
@ 2024-05-03 18:09           ` Kui-Feng Lee
  0 siblings, 0 replies; 27+ messages in thread
From: Kui-Feng Lee @ 2024-05-03 18:09 UTC (permalink / raw
  To: Alexei Starovoitov
  Cc: Martin KaFai Lau, Kui-Feng Lee, bpf, Alexei Starovoitov, Song Liu,
	Kernel Team, Andrii Nakryiko, Kui-Feng Lee



On 5/3/24 09:19, Alexei Starovoitov wrote:
> On Thu, May 2, 2024 at 5:41 PM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>>
>>
>>>>> +    prev_state = cmpxchg(&st_map->kvalue.common.state,
>>>>> +                 BPF_STRUCT_OPS_STATE_INUSE,
>>>>> +                 BPF_STRUCT_OPS_STATE_TOBEFREE);
> 
> All the uses of cmpxchg mean that there are races.
> I know there is already a case for it in the existing code
> and maybe it's justified.
> But I would very much prefer the whole code converted to
> clean locks without cmpxchg tricks.
> 
> 
>>>>> +    if (prev_state == BPF_STRUCT_OPS_STATE_INUSE) {
>>>>> +        st_map->st_ops_desc->st_ops->unreg(data);
>>>>> +        /* Pair with bpf_map_inc() for reg() */
>>>>> +        bpf_map_put(&st_map->map);
>>>>> +        /* Pair with bpf_map_inc_not_zero() above */
>>>>> +        bpf_map_put(&st_map->map);
>>>>> +        return true;
> 
> I haven't tried to follow the logic, but double bpf_map_put
> on the same map screams that this is broken.
> Do proper locks everywhere.
> inc_not_zero and cmpxchg shouldn't be needed.
> keep it simple.

Got it!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 6/6] selftests/bpf: test detaching struct_ops links.
  2024-05-02 18:15   ` Martin KaFai Lau
@ 2024-05-03 18:34     ` Kui-Feng Lee
  2024-05-03 19:15       ` Martin KaFai Lau
  0 siblings, 1 reply; 27+ messages in thread
From: Kui-Feng Lee @ 2024-05-03 18:34 UTC (permalink / raw
  To: Martin KaFai Lau, Kui-Feng Lee
  Cc: bpf, ast, song, kernel-team, andrii, kuifeng



On 5/2/24 11:15, Martin KaFai Lau wrote:
> On 4/29/24 2:36 PM, Kui-Feng Lee wrote:
>> @@ -572,6 +576,12 @@ static int bpf_dummy_reg(void *kdata)
>>       if (ops->test_2)
>>           ops->test_2(4, ops->data);
>> +    if (ops->do_unreg) {
>> +        rcu_read_lock();
>> +        bpf_struct_ops_kvalue_unreg(kdata);
> 
> Instead of unreg() immediately before the reg() has returned, the test 
> should reflect more on how the subsystem can use it in practice. The 
> subsystem does not do unreg() during reg().
> 
> It also needs to test a case when the link is created and successfully 
> registered to the subsystem. The user space does BPF_LINK_DETACH first 
> and then the subsystem does link->ops->detach() by itself later.

agree

> 
> It can create a kfunc in bpf_testmod.c to trigger the subsystem to do 
> link->ops->detach(). The kfunc can be called by a SEC("syscall") bpf 
> prog which is run by bpf_prog_test_run_opts(). The test_progs can then 
> decide on the timing when to do link->ops->detach() to test different 
> cases.

What is the purpose of this part?
If it goes through link->ops->detach(), it should work just like to call
bpf_link_detach() twice on the same link from the user space. Do you
want to make sure detaching a link twice work?

> 
>> +        rcu_read_unlock();
>> +    }
>> +
>>       return 0;
>>   }
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 6/6] selftests/bpf: test detaching struct_ops links.
  2024-05-03 18:34     ` Kui-Feng Lee
@ 2024-05-03 19:15       ` Martin KaFai Lau
  2024-05-03 21:34         ` Kui-Feng Lee
  0 siblings, 1 reply; 27+ messages in thread
From: Martin KaFai Lau @ 2024-05-03 19:15 UTC (permalink / raw
  To: Kui-Feng Lee, Kui-Feng Lee; +Cc: bpf, ast, song, kernel-team, andrii, kuifeng

On 5/3/24 11:34 AM, Kui-Feng Lee wrote:
> 
> 
> On 5/2/24 11:15, Martin KaFai Lau wrote:
>> On 4/29/24 2:36 PM, Kui-Feng Lee wrote:
>>> @@ -572,6 +576,12 @@ static int bpf_dummy_reg(void *kdata)
>>>       if (ops->test_2)
>>>           ops->test_2(4, ops->data);
>>> +    if (ops->do_unreg) {
>>> +        rcu_read_lock();
>>> +        bpf_struct_ops_kvalue_unreg(kdata);
>>
>> Instead of unreg() immediately before the reg() has returned, the test should 
>> reflect more on how the subsystem can use it in practice. The subsystem does 
>> not do unreg() during reg().
>>
>> It also needs to test a case when the link is created and successfully 
>> registered to the subsystem. The user space does BPF_LINK_DETACH first and  >> then the subsystem does link->ops->detach() by itself later.

> 
> agree
> 
>>
>> It can create a kfunc in bpf_testmod.c to trigger the subsystem to do 
>> link->ops->detach(). The kfunc can be called by a SEC("syscall") bpf prog 
>> which is run by bpf_prog_test_run_opts(). The test_progs can then decide on 
>> the timing when to do link->ops->detach() to test different cases.
> 
> What is the purpose of this part?
> If it goes through link->ops->detach(), it should work just like to call
> bpf_link_detach() twice on the same link from the user space. Do you
> want to make sure detaching a link twice work?

It is not quite what I meant and apparently link detach twice on the same valid 
(i.e. refcnt non zero) link won't work.

Anyhow, the idea is to show how the racing case may work in patch 3 (when 
userspace tries to detach and the subsystem tries to detach/unreg itself also). 
I was suggesting the kfunc idea such that the test_progs can have better control 
on the timing on when to ask the subsystem to unreg/detach itself instead of 
having to do the unreg() during the reg() as in patch 6 here. If kfunc does not 
make sense and there is a better way to do this, feel free to ignore.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 6/6] selftests/bpf: test detaching struct_ops links.
  2024-05-03 19:15       ` Martin KaFai Lau
@ 2024-05-03 21:34         ` Kui-Feng Lee
  2024-05-03 21:59           ` Martin KaFai Lau
  0 siblings, 1 reply; 27+ messages in thread
From: Kui-Feng Lee @ 2024-05-03 21:34 UTC (permalink / raw
  To: Martin KaFai Lau, Kui-Feng Lee
  Cc: bpf, ast, song, kernel-team, andrii, kuifeng



On 5/3/24 12:15, Martin KaFai Lau wrote:
> On 5/3/24 11:34 AM, Kui-Feng Lee wrote:
>>
>>
>> On 5/2/24 11:15, Martin KaFai Lau wrote:
>>> On 4/29/24 2:36 PM, Kui-Feng Lee wrote:
>>>> @@ -572,6 +576,12 @@ static int bpf_dummy_reg(void *kdata)
>>>>       if (ops->test_2)
>>>>           ops->test_2(4, ops->data);
>>>> +    if (ops->do_unreg) {
>>>> +        rcu_read_lock();
>>>> +        bpf_struct_ops_kvalue_unreg(kdata);
>>>
>>> Instead of unreg() immediately before the reg() has returned, the 
>>> test should reflect more on how the subsystem can use it in practice. 
>>> The subsystem does not do unreg() during reg().
>>>
>>> It also needs to test a case when the link is created and 
>>> successfully registered to the subsystem. The user space does 
>>> BPF_LINK_DETACH first and  >> then the subsystem does 
>>> link->ops->detach() by itself later.
> 
>>
>> agree
>>
>>>
>>> It can create a kfunc in bpf_testmod.c to trigger the subsystem to do 
>>> link->ops->detach(). The kfunc can be called by a SEC("syscall") bpf 
>>> prog which is run by bpf_prog_test_run_opts(). The test_progs can 
>>> then decide on the timing when to do link->ops->detach() to test 
>>> different cases.
>>
>> What is the purpose of this part?
>> If it goes through link->ops->detach(), it should work just like to call
>> bpf_link_detach() twice on the same link from the user space. Do you
>> want to make sure detaching a link twice work?
> 
> It is not quite what I meant and apparently link detach twice on the 
> same valid (i.e. refcnt non zero) link won't work.
> 
> Anyhow, the idea is to show how the racing case may work in patch 3 
> (when userspace tries to detach and the subsystem tries to detach/unreg 
> itself also). I was suggesting the kfunc idea such that the test_progs 
> can have better control on the timing on when to ask the subsystem to 
> unreg/detach itself instead of having to do the unreg() during the reg() 
> as in patch 6 here. If kfunc does not make sense and there is a better 
> way to do this, feel free to ignore.
> 

Ok! I think the case you are talking more like to happen when the link
is destroyed, but bpf_struct_ops_map_link_dealloc() has not finished
yet. Calling link->ops->detach() at this point may cause a racing since
bpf_struct_ops_map_link_dealloc() doesn't acquire update_mutex.

Calling link->ops->detach() immediately after BPF_LINK_DETACH would not
cause any racing since bpf_struct_ops_map_link_detach() always acquires
update_mutex. They will be executed sequentially, and call
st_map->ops->reg() sequentially as well.

I will add a test case to call link->ops->detach() after close the fd of
the link.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next 6/6] selftests/bpf: test detaching struct_ops links.
  2024-05-03 21:34         ` Kui-Feng Lee
@ 2024-05-03 21:59           ` Martin KaFai Lau
  0 siblings, 0 replies; 27+ messages in thread
From: Martin KaFai Lau @ 2024-05-03 21:59 UTC (permalink / raw
  To: Kui-Feng Lee, Kui-Feng Lee; +Cc: bpf, ast, song, kernel-team, andrii, kuifeng

On 5/3/24 2:34 PM, Kui-Feng Lee wrote:
> 
> 
> On 5/3/24 12:15, Martin KaFai Lau wrote:
>> On 5/3/24 11:34 AM, Kui-Feng Lee wrote:
>>>
>>>
>>> On 5/2/24 11:15, Martin KaFai Lau wrote:
>>>> On 4/29/24 2:36 PM, Kui-Feng Lee wrote:
>>>>> @@ -572,6 +576,12 @@ static int bpf_dummy_reg(void *kdata)
>>>>>       if (ops->test_2)
>>>>>           ops->test_2(4, ops->data);
>>>>> +    if (ops->do_unreg) {
>>>>> +        rcu_read_lock();
>>>>> +        bpf_struct_ops_kvalue_unreg(kdata);
>>>>
>>>> Instead of unreg() immediately before the reg() has returned, the test 
>>>> should reflect more on how the subsystem can use it in practice. The 
>>>> subsystem does not do unreg() during reg().
>>>>
>>>> It also needs to test a case when the link is created and successfully 
>>>> registered to the subsystem. The user space does BPF_LINK_DETACH first and  
>>>> >> then the subsystem does link->ops->detach() by itself later.
>>
>>>
>>> agree
>>>
>>>>
>>>> It can create a kfunc in bpf_testmod.c to trigger the subsystem to do 
>>>> link->ops->detach(). The kfunc can be called by a SEC("syscall") bpf prog 
>>>> which is run by bpf_prog_test_run_opts(). The test_progs can then decide on 
>>>> the timing when to do link->ops->detach() to test different cases.
>>>
>>> What is the purpose of this part?
>>> If it goes through link->ops->detach(), it should work just like to call
>>> bpf_link_detach() twice on the same link from the user space. Do you
>>> want to make sure detaching a link twice work?
>>
>> It is not quite what I meant and apparently link detach twice on the same 
>> valid (i.e. refcnt non zero) link won't work.
>>
>> Anyhow, the idea is to show how the racing case may work in patch 3 (when 
>> userspace tries to detach and the subsystem tries to detach/unreg itself 
>> also). I was suggesting the kfunc idea such that the test_progs can have 
>> better control on the timing on when to ask the subsystem to unreg/detach 
>> itself instead of having to do the unreg() during the reg() as in patch 6 
>> here. If kfunc does not make sense and there is a better way to do this, feel 
>> free to ignore.
>>
> 
> Ok! I think the case you are talking more like to happen when the link
> is destroyed, but bpf_struct_ops_map_link_dealloc() has not finished
> yet. Calling link->ops->detach() at this point may cause a racing since
> bpf_struct_ops_map_link_dealloc() doesn't acquire update_mutex.

Yes, adding link_dealloc() (i.e. close the link) in between will be a good test too.

With or without link_dealloc()/close(), the idea is to test this race (user 
space detach and/or dealloc vs subsystem detach/unreg) or at least show how the 
subsystem should do it. I was merely suggesting to use kfunc (may be there is a 
better way and feel free to ignore). The details of the testing steps could be 
adjusted based on how patch 3 will look like.

> 
> Calling link->ops->detach() immediately after BPF_LINK_DETACH would not
> cause any racing since bpf_struct_ops_map_link_detach() always acquires
> update_mutex. They will be executed sequentially, and call
> st_map->ops->reg() sequentially as well.

I didn't meant the detach() itself is racy or not. That part is fine. It is more 
about the link that the subsystem is holding. I feel how patch 3 will look like 
may be something different from my current thinking. If this test does not make 
sense based on how patch 3 will look like, feel free to ignore also.

> 
> I will add a test case to call link->ops->detach() after close the fd of
> the link.
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2024-05-03 21:59 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-04-29 21:36 [PATCH bpf-next 0/6] Notify user space when a struct_ops object is detached/unregisterd Kui-Feng Lee
2024-04-29 21:36 ` [PATCH bpf-next 1/6] bpf: add a pointer of the attached link to bpf_struct_ops_map Kui-Feng Lee
2024-05-01 17:01   ` Andrii Nakryiko
2024-05-01 22:15     ` Kui-Feng Lee
2024-04-29 21:36 ` [PATCH bpf-next 2/6] bpf: export bpf_link_inc_not_zero() Kui-Feng Lee
2024-04-29 21:36 ` [PATCH bpf-next 3/6] bpf: provide a function to unregister struct_ops objects from consumers Kui-Feng Lee
2024-05-01 18:48   ` Martin KaFai Lau
2024-05-01 22:15     ` Kui-Feng Lee
2024-05-01 23:06       ` Martin KaFai Lau
2024-05-02 17:56     ` Martin KaFai Lau
2024-05-02 18:29       ` Martin KaFai Lau
2024-05-03  0:41       ` Kui-Feng Lee
2024-05-03 16:19         ` Alexei Starovoitov
2024-05-03 18:09           ` Kui-Feng Lee
2024-05-03 17:17         ` Martin KaFai Lau
2024-04-29 21:36 ` [PATCH bpf-next 4/6] bpf: detach a bpf_struct_ops_map from a link Kui-Feng Lee
2024-04-29 21:36 ` [PATCH bpf-next 5/6] bpf: support epoll from bpf struct_ops links Kui-Feng Lee
2024-05-01 17:03   ` Andrii Nakryiko
2024-05-01 22:16     ` Kui-Feng Lee
2024-04-29 21:36 ` [PATCH bpf-next 6/6] selftests/bpf: test detaching " Kui-Feng Lee
2024-05-01 17:05   ` Andrii Nakryiko
2024-05-01 22:17     ` Kui-Feng Lee
2024-05-02 18:15   ` Martin KaFai Lau
2024-05-03 18:34     ` Kui-Feng Lee
2024-05-03 19:15       ` Martin KaFai Lau
2024-05-03 21:34         ` Kui-Feng Lee
2024-05-03 21:59           ` Martin KaFai Lau

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.