8.3.2.75. v4.8.3ΒΆ

commit 0d70ee75e712d6ee726365c1731c13453ea15dbb
Author: Victor Lowther <victor@rackn.com>
Date:   Thu Mar 3 12:48:12 2022 -0600

    perf(etags): Parallelize etag bulk processing.

    This refactors etag bulk checking to operate in as parallel a fashion
    as possible while not causing the system to explode too much if it
    winds up needing to recalculate a bunch of checksums.

M   datastack/etags.go

commit 3486ea5747d87077695f1489b07fa62687ad6c69
Author: Victor Lowther <victor@rackn.com>
Date:   Mon Feb 28 11:40:46 2022 -0600

    perf(etags): Avoid opening files to calculate etags.

    There appears to be a huge performance penalty when running
    dr-provision on systems that have any sort of monitoring that hooks
    the open system call.  To work around this, refactor most places we
    check etags to do so without opening the file involved.

    This patch also adds two utilities that can be used to benchmark and
    identify this issue.

    cmds/etagerator/etag.go will run just the etag BulkProcess function on
    all the directories passed in as command line args.  It can be used to
    get an idea how long the etag process will take in various
    environments

    cmds/start_io_trace/startTrace.go runs the complete dr-provision
    startup sequence up to the point that we would start joining a cluster
    or loading data from the database.  It runs with full tracing enabled,
    and emits a go trace log once it finishes the startup process.

M   backend/dataTracker_test.go
A   cmds/etagerator/etags.go
A   cmds/start_io_trace/startTrace.go
M   datastack/etags.go
M   datastack/stack.go
M   midlayer/fake_midlayer_server_test.go
M   midlayer/static_test.go
M   midlayer/tftp_test.go
M   server/args.go

commit 6b8ffdddcbaa9ec5a6e58b5194ca9b730d0408f4
Author: Victor Lowther <victor@rackn.com>
Date:   Mon Feb 28 13:31:31 2022 -0600

    fix(panic): Fix race when removing a server that can lead to panic.

    The Raft FSM can get into an inconsistent state that allows a
    LastArtifactOp operation to succeed at the same time the node issuing
    the request is being removed from the cluster.  Depending on the exact
    timing, this can trigger a panic if the command is committed after the
    node removal command, leading to a panic when replaying the log on the
    followers and on the server.

    Work around thgis for now by solently ignoring LastArtifactApply
    operations from nodes that we have removed from the cluster.  A longer
    term fix will require adding a dedicated API path for updating this
    that can check tto see if the operation is allowed befor committing it
    through Raft.

M   consensus/raftFSM.go
M   frontend/consensus.go
M   server/args.go

commit 1c38989eb86c166448e838d38f3f2ce2eb8d5b5e
Author: Victor Lowther <victor.lowther@gmail.com>
Date:   Fri Feb 18 08:55:00 2022 -0600

    feat(ci): Make Gitlab CI process more scalable.

    This fixes up how the unit tests detect when they are running
    under the Gitlab CI infrastructure, and allow for the test
    phases to run in parallel to the extent feasible.

M   .gitlab-ci.yml
M   test/plugins.go
A   tools/multitest.sh
M   tools/package.sh
M   tools/publish.sh
M   tools/test.sh
A   tools/test_prereqs.sh

commit 430fffbcacd8d0a0a399ba997b9e5d7e88da4271
Author: Victor Lowther <victor@rackn.com>
Date:   Wed Feb 23 10:03:08 2022 -0600

    fix(consensus): Fix deadlock and cert rotation error

    Etags.Remove and Etags.RemoveAll have baked-in deadlocks that are too
    easy to trigger with external uses of Etags.SendArtifactOp.  Fix it by
    splitting SendArtifactOp into two functions, one that takes the locks
    and one that does not, and fix up all in-package callers of
    SendArtifactOp to take whatever locks are needed and then call
    sendArtifactOp.

    The consensus HA TLS handshaking strategy could fail if a root cert
    rotate happened while one of the nodes was out of the cluster. Once
    that happened, the missing node would be unable to rejoin the cluster
    when it came back up due to it not having the most recent root
    certificate for TLS peer cert checking.  This has been resolved with
    two fixes:

    * On the client side (the one initiating the connection), we start by
      picking a cert that is signed by the most recent cert in the Roots
      array.  If the TLS handshake fails, we will iterate through the Roots
      array in order from most recent to oldest, retrying the connection
      with each until either the handshake passes or we run out of root
      certs.  If we run out of root certs, we wrap around and try again.

    * On the server side (the one receiving the connection), we ratchet
      the same way, except we move on to the next oldest cert after
      len(fsm.Roots) handshake failures instead of at every one.  If we
      run out of certs, we will omit an error message to the log and wrap
      around.

    These two behaviours ensure that all possible current common sets of
    mutually valid root certs are tested when creating TLS connections
    between consensus members.

    Additionally, the root cert rotation code was incorrectly calculating
    when to rotate the root certificates, leading to an insufficient
    number of known trusted roots and the possibility of the nodes not
    having enough certs in common to mutually authenticate their peers.
    We now handle this like so:

    1. On cluster create, we seed a TLS certificate that every node in the
       cluster uses for mutual peer authentication.  This cert is valid from
       time.Now().Add(-consensus.CertGrace) (to handle servers that do not
       have well-synchronized clocks) through time.Now().Add(RootCertExpire+CertGrace).
       This gives the initial cert a 3 month lifespan.

    2. Every RootCertExpire/(2*MaxTrustRoots+1), the leader of the
       consensus cluster checks to see if it needs to generate a new root
       certificate by checking to see if the first entry in the Roots
       array of the finite state machine is more than
       RootCertExpire/MaxTrustRoots old. If so, it generates a new
       self-signed cert, adds it to the front of the Roots array in the
       finite state machine, removes any certs in the Roots array that
       have expired, and publishes the new FSM via Raft.

    The failover tests have been updated to have much increased cert
    rotation rates in order to test this code.

M   .gitlab-ci.yml
M   consensus/raft.go
M   consensus/raftFSM.go
M   consensus/serverAPI.go
M   datastack/etags.go
M   failover_test/consensus_failover_test.go
M   go.mod
M   go.sum
M   server/args.go

End of Note