8.3.2.132. v4.7.16ΒΆ

commit bfa2a877c6ef4bf23aa0767d4fe06f22c347684b
Author: Victor Lowther <victor.lowther@gmail.com>
Date:   Fri Feb 18 08:55:00 2022 -0600

    feat(ci): Make Gitlab CI process more scalable.

    This fixes up how the unit tests detect when they are running
    under the Gitlab CI infrastructure, and allow for the test
    phases to run in parallel to the extent feasible.

M   .gitlab-ci.yml
M   test/plugins.go
A   tools/multitest.sh
M   tools/test.sh
A   tools/test_prereqs.sh

commit 2b579b528ea8b78f6c560cd7772ba5d43256e0b3
Author: Victor Lowther <victor.lowther@gmail.com>
Date:   Mon Feb 14 13:58:43 2022 -0600

    fix(consensus): Fix deadlock and cert rotation error

    Etags.Remove and Etags.RemoveAll have baked-in deadlocks that are too
    easy to trigger with external uses of Etags.SendArtifactOp.  Fix it by
    splitting SendArtifactOp into two functions, one that takes the locks
    and one that does not, and fix up all in-package callers of
    SendArtifactOp to take whatever locks are needed and then call
    sendArtifactOp.

    The consensus HA TLS handshaking strategy could fail if a root cert
    rotate happened while one of the nodes was out of the cluster. Once
    that happened, the missing node would be unable to rejoin the cluster
    when it came back up due to it not having the most recent root
    certificate for TLS peer cert checking.  This has been resolved with
    two fixes:

    * On the client side (the one initiating the connection), we start by
      picking a cert that is signed by the most recent cert in the Roots
      array.  If the TLS handshake fails, we will iterate through the Roots
      array in order from most recent to oldest, retrying the connection
      with each until either the handshake passes or we run out of root
      certs.  If we run out of root certs, we wrap around and try again.

    * On the server side (the one receiving the connection), we ratchet
      the same way, except we move on to the next oldest cert after
      len(fsm.Roots) handshake failures instead of at every one.  If we
      run out of certs, we will omit an error message to the log and wrap
      around.

    These two behaviours ensure that all possible current common sets of
    mutually valid root certs are tested when creating TLS connections
    between consensus members.

    Additionally, the root cert rotation code was incorrectly calculating
    when to rotate the root certificates, leading to an insufficient
    number of known trusted roots and the possibility of the nodes not
    having enough certs in common to mutually authenticate their peers.
    We now handle this like so:

    1. On cluster create, we seed a TLS certificate that every node in the
       cluster uses for mutual peer authentication.  This cert is valid from
       time.Now().Add(-consensus.CertGrace) (to handle servers that do not
       have well-synchronized clocks) through time.Now().Add(RootCertExpire+CertGrace).
       This gives the initial cert a 3 month lifespan.

    2. Every RootCertExpire/(2*MaxTrustRoots+1), the leader of the
       consensus cluster checks to see if it needs to generate a new root
       certificate by checking to see if the first entry in the Roots
       array of the finite state machine is more than
       RootCertExpire/MaxTrustRoots old. If so, it generates a new
       self-signed cert, adds it to the front of the Roots array in the
       finite state machine, removes any certs in the Roots array that
       have expired, and publishes the new FSM via Raft.

    The failover tests have been updated to have much increased cert
    rotation rates in order to test this code.

M   .gitlab-ci.yml
M   consensus/raft.go
M   consensus/raftFSM.go
M   consensus/serverAPI.go
M   datastack/etags.go
M   failover_test/consensus_failover_test.go
M   go.mod
M   go.sum
M   server/args.go

End of Note