commit bfa2a877c6ef4bf23aa0767d4fe06f22c347684b
Author: Victor Lowther <victor.lowther@gmail.com>
Date: Fri Feb 18 08:55:00 2022 -0600
feat(ci): Make Gitlab CI process more scalable.
This fixes up how the unit tests detect when they are running
under the Gitlab CI infrastructure, and allow for the test
phases to run in parallel to the extent feasible.
M .gitlab-ci.yml
M test/plugins.go
A tools/multitest.sh
M tools/test.sh
A tools/test_prereqs.sh
commit 2b579b528ea8b78f6c560cd7772ba5d43256e0b3
Author: Victor Lowther <victor.lowther@gmail.com>
Date: Mon Feb 14 13:58:43 2022 -0600
fix(consensus): Fix deadlock and cert rotation error
Etags.Remove and Etags.RemoveAll have baked-in deadlocks that are too
easy to trigger with external uses of Etags.SendArtifactOp. Fix it by
splitting SendArtifactOp into two functions, one that takes the locks
and one that does not, and fix up all in-package callers of
SendArtifactOp to take whatever locks are needed and then call
sendArtifactOp.
The consensus HA TLS handshaking strategy could fail if a root cert
rotate happened while one of the nodes was out of the cluster. Once
that happened, the missing node would be unable to rejoin the cluster
when it came back up due to it not having the most recent root
certificate for TLS peer cert checking. This has been resolved with
two fixes:
* On the client side (the one initiating the connection), we start by
picking a cert that is signed by the most recent cert in the Roots
array. If the TLS handshake fails, we will iterate through the Roots
array in order from most recent to oldest, retrying the connection
with each until either the handshake passes or we run out of root
certs. If we run out of root certs, we wrap around and try again.
* On the server side (the one receiving the connection), we ratchet
the same way, except we move on to the next oldest cert after
len(fsm.Roots) handshake failures instead of at every one. If we
run out of certs, we will omit an error message to the log and wrap
around.
These two behaviours ensure that all possible current common sets of
mutually valid root certs are tested when creating TLS connections
between consensus members.
Additionally, the root cert rotation code was incorrectly calculating
when to rotate the root certificates, leading to an insufficient
number of known trusted roots and the possibility of the nodes not
having enough certs in common to mutually authenticate their peers.
We now handle this like so:
1. On cluster create, we seed a TLS certificate that every node in the
cluster uses for mutual peer authentication. This cert is valid from
time.Now().Add(-consensus.CertGrace) (to handle servers that do not
have well-synchronized clocks) through time.Now().Add(RootCertExpire+CertGrace).
This gives the initial cert a 3 month lifespan.
2. Every RootCertExpire/(2*MaxTrustRoots+1), the leader of the
consensus cluster checks to see if it needs to generate a new root
certificate by checking to see if the first entry in the Roots
array of the finite state machine is more than
RootCertExpire/MaxTrustRoots old. If so, it generates a new
self-signed cert, adds it to the front of the Roots array in the
finite state machine, removes any certs in the Roots array that
have expired, and publishes the new FSM via Raft.
The failover tests have been updated to have much increased cert
rotation rates in order to test this code.
M .gitlab-ci.yml
M consensus/raft.go
M consensus/raftFSM.go
M consensus/serverAPI.go
M datastack/etags.go
M failover_test/consensus_failover_test.go
M go.mod
M go.sum
M server/args.go
End of Note