All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: Daniel Borkmann <daniel@iogearbox.net>
To: "Toke Høiland-Jørgensen" <toke@redhat.com>,
	bpf@vger.kernel.org, netdev@vger.kernel.org
Cc: Martin KaFai Lau <kafai@fb.com>,
	Hangbin Liu <liuhangbin@gmail.com>,
	Jesper Dangaard Brouer <brouer@redhat.com>,
	Magnus Karlsson <magnus.karlsson@gmail.com>,
	"Paul E . McKenney" <paulmck@kernel.org>,
	Jakub Kicinski <kuba@kernel.org>
Subject: Re: [PATCH bpf-next v3 03/16] xdp: add proper __rcu annotations to redirect map entries
Date: Tue, 22 Jun 2021 10:50:36 +0200	[thread overview]
Message-ID: <2f11d71d-0298-4177-6ac6-4483adf36ed9@iogearbox.net> (raw)
In-Reply-To: <87r1gurgmb.fsf@toke.dk>

On 6/22/21 12:35 AM, Toke Høiland-Jørgensen wrote:
> Daniel Borkmann <daniel@iogearbox.net> writes:
>> On 6/21/21 11:39 PM, Toke Høiland-Jørgensen wrote:
>>> Daniel Borkmann <daniel@iogearbox.net> writes:
>>>> On 6/17/21 11:27 PM, Toke Høiland-Jørgensen wrote:
>>>>> XDP_REDIRECT works by a three-step process: the bpf_redirect() and
>>>>> bpf_redirect_map() helpers will lookup the target of the redirect and store
>>>>> it (along with some other metadata) in a per-CPU struct bpf_redirect_info.
>>>>> Next, when the program returns the XDP_REDIRECT return code, the driver
>>>>> will call xdp_do_redirect() which will use the information thus stored to
>>>>> actually enqueue the frame into a bulk queue structure (that differs
>>>>> slightly by map type, but shares the same principle). Finally, before
>>>>> exiting its NAPI poll loop, the driver will call xdp_do_flush(), which will
>>>>> flush all the different bulk queues, thus completing the redirect.
>>>>>
>>>>> Pointers to the map entries will be kept around for this whole sequence of
>>>>> steps, protected by RCU. However, there is no top-level rcu_read_lock() in
>>>>> the core code; instead drivers add their own rcu_read_lock() around the XDP
>>>>> portions of the code, but somewhat inconsistently as Martin discovered[0].
>>>>> However, things still work because everything happens inside a single NAPI
>>>>> poll sequence, which means it's between a pair of calls to
>>>>> local_bh_disable()/local_bh_enable(). So Paul suggested[1] that we could
>>>>> document this intention by using rcu_dereference_check() with
>>>>> rcu_read_lock_bh_held() as a second parameter, thus allowing sparse and
>>>>> lockdep to verify that everything is done correctly.
>>>>>
>>>>> This patch does just that: we add an __rcu annotation to the map entry
>>>>> pointers and remove the various comments explaining the NAPI poll assurance
>>>>> strewn through devmap.c in favour of a longer explanation in filter.c. The
>>>>> goal is to have one coherent documentation of the entire flow, and rely on
>>>>> the RCU annotations as a "standard" way of communicating the flow in the
>>>>> map code (which can additionally be understood by sparse and lockdep).
>>>>>
>>>>> The RCU annotation replacements result in a fairly straight-forward
>>>>> replacement where READ_ONCE() becomes rcu_dereference_check(), WRITE_ONCE()
>>>>> becomes rcu_assign_pointer() and xchg() and cmpxchg() gets wrapped in the
>>>>> proper constructs to cast the pointer back and forth between __rcu and
>>>>> __kernel address space (for the benefit of sparse). The one complication is
>>>>> that xskmap has a few constructions where double-pointers are passed back
>>>>> and forth; these simply all gain __rcu annotations, and only the final
>>>>> reference/dereference to the inner-most pointer gets changed.
>>>>>
>>>>> With this, everything can be run through sparse without eliciting
>>>>> complaints, and lockdep can verify correctness even without the use of
>>>>> rcu_read_lock() in the drivers. Subsequent patches will clean these up from
>>>>> the drivers.
>>>>>
>>>>> [0] https://lore.kernel.org/bpf/20210415173551.7ma4slcbqeyiba2r@kafai-mbp.dhcp.thefacebook.com/
>>>>> [1] https://lore.kernel.org/bpf/20210419165837.GA975577@paulmck-ThinkPad-P17-Gen-1/
>>>>>
>>>>> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
>>>>> ---
>>>>>     include/net/xdp_sock.h |  2 +-
>>>>>     kernel/bpf/cpumap.c    | 13 +++++++----
>>>>>     kernel/bpf/devmap.c    | 49 ++++++++++++++++++------------------------
>>>>>     net/core/filter.c      | 28 ++++++++++++++++++++++++
>>>>>     net/xdp/xsk.c          |  4 ++--
>>>>>     net/xdp/xsk.h          |  4 ++--
>>>>>     net/xdp/xskmap.c       | 29 ++++++++++++++-----------
>>>>>     7 files changed, 80 insertions(+), 49 deletions(-)
>>>> [...]
>>>>>     						 __dev_map_entry_free);
>>>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>>>> index caa88955562e..0b7db5c70385 100644
>>>>> --- a/net/core/filter.c
>>>>> +++ b/net/core/filter.c
>>>>> @@ -3922,6 +3922,34 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
>>>>>     	.arg2_type	= ARG_ANYTHING,
>>>>>     };
>>>>>     
>>>>> +/* XDP_REDIRECT works by a three-step process, implemented in the functions
>>>>> + * below:
>>>>> + *
>>>>> + * 1. The bpf_redirect() and bpf_redirect_map() helpers will lookup the target
>>>>> + *    of the redirect and store it (along with some other metadata) in a per-CPU
>>>>> + *    struct bpf_redirect_info.
>>>>> + *
>>>>> + * 2. When the program returns the XDP_REDIRECT return code, the driver will
>>>>> + *    call xdp_do_redirect() which will use the information in struct
>>>>> + *    bpf_redirect_info to actually enqueue the frame into a map type-specific
>>>>> + *    bulk queue structure.
>>>>> + *
>>>>> + * 3. Before exiting its NAPI poll loop, the driver will call xdp_do_flush(),
>>>>> + *    which will flush all the different bulk queues, thus completing the
>>>>> + *    redirect.
>>>>> + *
>>>>> + * Pointers to the map entries will be kept around for this whole sequence of
>>>>> + * steps, protected by RCU. However, there is no top-level rcu_read_lock() in
>>>>> + * the core code; instead, the RCU protection relies on everything happening
>>>>> + * inside a single NAPI poll sequence, which means it's between a pair of calls
>>>>> + * to local_bh_disable()/local_bh_enable().
>>>>> + *
>>>>> + * The map entries are marked as __rcu and the map code makes sure to
>>>>> + * dereference those pointers with rcu_dereference_check() in a way that works
>>>>> + * for both sections that to hold an rcu_read_lock() and sections that are
>>>>> + * called from NAPI without a separate rcu_read_lock(). The code below does not
>>>>> + * use RCU annotations, but relies on those in the map code.
>>>>
>>>> One more follow-up question related to tc BPF: given we do use rcu_read_lock_bh()
>>>> in case of sch_handle_egress(), could we also remove the rcu_read_lock() pair
>>>> from cls_bpf_classify() then?
>>>
>>> I believe so, yeah. Patch 2 in this series should even make lockdep stop
>>> complaining about it :)
>>
>> Btw, I was wondering whether we should just get rid of all the WARN_ON_ONCE()s
>> from those map helpers given in most situations these are not triggered anyway
>> due to retpoline avoidance where verifier rewrites the calls to jump to the map
>> backend implementation directly. One alternative could be to have an extension
>> to the bpf prologue generation under CONFIG_DEBUG_LOCK_ALLOC and call the lockdep
>> checks from there, but it's probably not worth the effort. (In the trampoline
>> case we have those __bpf_prog_enter()/__bpf_prog_enter_sleepable() where the
>> latter in particular has asserts like might_fault(), fwiw.)
> 
> I agree that it's probably overkill to amend the prologue. No strong
> opinion on whether removing the checks entirely is a good idea; I guess
> they at least serve as documentation even if they're not actually called
> that often?

Ack, that's okay with me, and if we find a better solution, we can always change it
later on.

Thanks,
Daniel

  reply	other threads:[~2021-06-22  8:51 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-17 21:27 [PATCH bpf-next v3 00/16] Clean up and document RCU-based object protection for XDP_REDIRECT Toke Høiland-Jørgensen
2021-06-17 21:27 ` [PATCH bpf-next v3 01/16] rcu: Create an unrcu_pointer() to remove __rcu from a pointer Toke Høiland-Jørgensen
2021-06-17 21:27 ` [PATCH bpf-next v3 02/16] bpf: allow RCU-protected lookups to happen from bh context Toke Høiland-Jørgensen
2021-06-17 21:27 ` [PATCH bpf-next v3 03/16] xdp: add proper __rcu annotations to redirect map entries Toke Høiland-Jørgensen
2021-06-18  4:55   ` Martin KaFai Lau
2021-06-18 23:27   ` Daniel Borkmann
2021-06-21 21:39     ` Toke Høiland-Jørgensen
2021-06-21 22:15       ` Daniel Borkmann
2021-06-21 22:35         ` Toke Høiland-Jørgensen
2021-06-22  8:50           ` Daniel Borkmann [this message]
2021-06-22 13:55       ` Toke Høiland-Jørgensen
2021-06-22 20:26         ` Paul E. McKenney
2021-06-22 21:48           ` Toke Høiland-Jørgensen
2021-06-22 23:19             ` Paul E. McKenney
2021-06-23 10:55               ` Toke Høiland-Jørgensen
2021-06-17 21:27 ` [PATCH bpf-next v3 04/16] ena: remove rcu_read_lock() around XDP program invocation Toke Høiland-Jørgensen
2021-06-17 21:27 ` [PATCH bpf-next v3 05/16] bnxt: " Toke Høiland-Jørgensen
2021-06-17 21:27 ` [PATCH bpf-next v3 06/16] thunderx: " Toke Høiland-Jørgensen
2021-06-17 21:27   ` Toke Høiland-Jørgensen
2021-06-17 21:27 ` [PATCH bpf-next v3 07/16] freescale: " Toke Høiland-Jørgensen
2021-06-17 21:27 ` [PATCH bpf-next v3 08/16] net: intel: " Toke Høiland-Jørgensen
2021-06-17 21:27   ` [Intel-wired-lan] " Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?=
2021-06-17 21:27 ` [PATCH bpf-next v3 09/16] marvell: " Toke Høiland-Jørgensen
2021-06-17 21:27 ` [PATCH bpf-next v3 10/16] mlx4: " Toke Høiland-Jørgensen
2021-06-17 21:27 ` [PATCH bpf-next v3 11/16] nfp: " Toke Høiland-Jørgensen
2021-06-17 21:27 ` [PATCH bpf-next v3 12/16] qede: " Toke Høiland-Jørgensen
2021-06-17 21:27 ` [PATCH bpf-next v3 13/16] sfc: " Toke Høiland-Jørgensen
2021-06-17 21:27 ` [PATCH bpf-next v3 14/16] netsec: " Toke Høiland-Jørgensen
2021-06-17 21:27 ` [PATCH bpf-next v3 15/16] stmmac: " Toke Høiland-Jørgensen
2021-06-18  9:47   ` Wong Vee Khee
2021-06-17 21:27 ` [PATCH bpf-next v3 16/16] net: ti: " Toke Høiland-Jørgensen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2f11d71d-0298-4177-6ac6-4483adf36ed9@iogearbox.net \
    --to=daniel@iogearbox.net \
    --cc=bpf@vger.kernel.org \
    --cc=brouer@redhat.com \
    --cc=kafai@fb.com \
    --cc=kuba@kernel.org \
    --cc=liuhangbin@gmail.com \
    --cc=magnus.karlsson@gmail.com \
    --cc=netdev@vger.kernel.org \
    --cc=paulmck@kernel.org \
    --cc=toke@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.