Hi,

many thanks for this elaborate reply!

Am 11.05.21 um 00:19 schrieb Pablo Neira Ayuso:
> Hi,
> 
> On Sun, May 09, 2021 at 07:52:27PM +0200, Oliver Freyermuth wrote:
>> Dear netfilter experts,
>>
>> we are trying to setup an active/active firewall, making use of
>> "xt_cluster".  We can configure the switch to act like a hub, i.e.
>> both machines can share the same MAC and IP and get the same packets
>> without additional ARPtables tricks.
>>
>> So we set rules like:
>>
>>   iptables -I PREROUTING -t mangle -i external_interface -m cluster --cluster-total-nodes 2 --cluster-local-node 1 --cluster-hash-seed 0xdeadbeef -j MARK --set-mark 0xffff
>>   iptables -A PREROUTING -t mangle -i external_interface -m mark ! --mark 0xffff -j DROP
> 
> I'm attaching an old script to set up active-active I remember to have
> used time ago, I never found the time to upstream this.

this is really helpful indeed.
While we use Shorewall (which simplifies many things, but has no abstraction for xt_cluster as far as I am aware of),
it helps to see all rules written up together to translate them for Shorewall, and also the debugging rules are very helpful.

> 
>> Ideally, it we'd love to have the possibility to scale this to more
>> than two nodes, but let's stay with two for now.
> 
> IIRC, up to two nodes should be easy with the existing codebase. To
> support more than 2 nodes, conntrackd needs to be extended, but it
> should be doable.
> 
>> Basic tests show that this works as expected, but the details get messy.
>>
>> 1. Certainly, conntrackd is needed to synchronize connection states.
>>     But is it always "fast enough"?  xt_cluster seems to match by the
>>     src_ip of the original direction of the flow[0] (if I read the code
>>     correctly), but what happens if the reply to an outgoing packet
>>     arrives at both firewalls before state is synchronized?
> 
> You can avoid this by setting DisableExternalCache to off. Then, in
> case one of your firewall node goes off, update the cluster rules and
> inject the entries (via keepalived, or your HA daemon of choice).
> 
> Recommended configuration is DisableExternalCache off and properly
> configure your HA daemon to assist conntrackd. Then, the conntrack
> entries in the "external cache" of conntrackd are added to the kernel
> when needed.

You caused a classic "facepalming" moment. Of course, that will solve (1)
completely. My initial thinking when disabling the external cache
was before I understood how xt_cluster works, and before I found that it uses the direction
of the flow, and then it just escaped my mind.
Thanks for clearing this up! :-)

> 
>>     We are currently using conntrackd in FTFW mode with a direct
>>     link, set "DisableExternalCache", and additonally set "PollSecs
>>     15" since without that it seems only new and destroyed
>>     connections are synced, but lifetime updates for existing
>>     connections do not propagate without polling.
> 
> No need to set on PollSecs. Polling should be disabled. Did you enable
> event filtering? You should synchronize receive update too. Could you
> post your configuration file?

Sure, it's attached — I'm doing event filtering, but only by address and protocol,
not by flow state, so I thought it to be harmless in this regard.
For my test, I just sent a continuous ICMP through the node,
and the flow itself was synced fine, but then the lifetime was not updated on the partner node unless polling was active,
and finally the flow was removed on the partner machine (lifetime expired) while it was being kept alive by updates
on the primary node.

This was with "DisableExternalCache on", on a CentOS 8.2 node, i.e.:
   Kernel 4.18.0-193.19.1.el8_2.x86_64
   conntrackd v1.4.4

> 
> [...]
>> 2. How to do failover in such cases?
>>     For failover we'd need to change these rules (if one node fails,
>>     the total-nodes will change).  As an alternative, I found [1]
>>     which states multiple rules can be used and enabled / disabled,
>>     but does somebody know of a cleaner (and easier to read) way,
>>     also not costing extra performance?
> 
> If you use iptables, you'll have to update the rules on failure as you
> describe. What performance cost are you refering to?

This was based on your comment here:
  https://lore.kernel.org/netfilter-devel/499BEBBF.7080705@netfilter.org/

But probably, this is indeed premature thinking on my end —
with two firewalls, having two rules after failover should have even less impact than what you measured there.
I still think something like the /proc interface you described there would be cleaner, but I also don't know of a failover daemon
which could make use of it.

>> 3. We have several internal networks, which need to talk to each
>>     other (partially with firewall rules and NATting), so we'd also need
>>     similar rules there, complicating things more. That's why a cleaner
>>     way would be very welcome :-).
> 
> Cleaner way, it should be possible to simplify this setup with
> nftables.

Since we currently use Shorewall as simplification layer (which eases many things by its abstraction,
but still uses iptables behind the scenes), it's probably best for sanity not to mix here.
So the less "clean" way is likely the easier one for now.

>> 4. Another point is how to actually perform the failover. Classical
>>     cluster suites (corosync + pacemaker) are rather used to migrate
>>     services, but not to communicate node ids and number of total active
>>     nodes.  They can probably be tricked into doing that somehow, but
>>     they are not designed this way.  TIPC may be something to use here,
>>     but I found nothing "ready to use".
> 
> I have used keepalived in the past with very simple configuration
> files, and use their shell script API to interact with conntrackd.
> I did not spend much time on corosync/pacemaker so far.

I was mostly thinking about the cluster rules —
I'd love to have a daemon which could adjust cluster-total-nodes and cluster-local-nodes,
instead of having two rules on one firewall when the other fails.

I think I can make the latter work with pacemaker/corosync, and also have it support conntrackd, though,
it might be fiddly, but should be doable.

Many thanks for the elaborate answer,
	Oliver

-- 
Oliver Freyermuth
Universität Bonn
Physikalisches Institut, Raum 1.047
Nußallee 12
53115 Bonn
--
Tel.: +49 228 73 2367
Fax:  +49 228 73 7869
--