commit 0d70ee75e712d6ee726365c1731c13453ea15dbb
Author: Victor Lowther <victor@rackn.com>
Date: Thu Mar 3 12:48:12 2022 -0600
perf(etags): Parallelize etag bulk processing.
This refactors etag bulk checking to operate in as parallel a fashion
as possible while not causing the system to explode too much if it
winds up needing to recalculate a bunch of checksums.
M datastack/etags.go
commit 3486ea5747d87077695f1489b07fa62687ad6c69
Author: Victor Lowther <victor@rackn.com>
Date: Mon Feb 28 11:40:46 2022 -0600
perf(etags): Avoid opening files to calculate etags.
There appears to be a huge performance penalty when running
dr-provision on systems that have any sort of monitoring that hooks
the open system call. To work around this, refactor most places we
check etags to do so without opening the file involved.
This patch also adds two utilities that can be used to benchmark and
identify this issue.
cmds/etagerator/etag.go will run just the etag BulkProcess function on
all the directories passed in as command line args. It can be used to
get an idea how long the etag process will take in various
environments
cmds/start_io_trace/startTrace.go runs the complete dr-provision
startup sequence up to the point that we would start joining a cluster
or loading data from the database. It runs with full tracing enabled,
and emits a go trace log once it finishes the startup process.
M backend/dataTracker_test.go
A cmds/etagerator/etags.go
A cmds/start_io_trace/startTrace.go
M datastack/etags.go
M datastack/stack.go
M midlayer/fake_midlayer_server_test.go
M midlayer/static_test.go
M midlayer/tftp_test.go
M server/args.go
commit 6b8ffdddcbaa9ec5a6e58b5194ca9b730d0408f4
Author: Victor Lowther <victor@rackn.com>
Date: Mon Feb 28 13:31:31 2022 -0600
fix(panic): Fix race when removing a server that can lead to panic.
The Raft FSM can get into an inconsistent state that allows a
LastArtifactOp operation to succeed at the same time the node issuing
the request is being removed from the cluster. Depending on the exact
timing, this can trigger a panic if the command is committed after the
node removal command, leading to a panic when replaying the log on the
followers and on the server.
Work around thgis for now by solently ignoring LastArtifactApply
operations from nodes that we have removed from the cluster. A longer
term fix will require adding a dedicated API path for updating this
that can check tto see if the operation is allowed befor committing it
through Raft.
M consensus/raftFSM.go
M frontend/consensus.go
M server/args.go
commit 1c38989eb86c166448e838d38f3f2ce2eb8d5b5e
Author: Victor Lowther <victor.lowther@gmail.com>
Date: Fri Feb 18 08:55:00 2022 -0600
feat(ci): Make Gitlab CI process more scalable.
This fixes up how the unit tests detect when they are running
under the Gitlab CI infrastructure, and allow for the test
phases to run in parallel to the extent feasible.
M .gitlab-ci.yml
M test/plugins.go
A tools/multitest.sh
M tools/package.sh
M tools/publish.sh
M tools/test.sh
A tools/test_prereqs.sh
commit 430fffbcacd8d0a0a399ba997b9e5d7e88da4271
Author: Victor Lowther <victor@rackn.com>
Date: Wed Feb 23 10:03:08 2022 -0600
fix(consensus): Fix deadlock and cert rotation error
Etags.Remove and Etags.RemoveAll have baked-in deadlocks that are too
easy to trigger with external uses of Etags.SendArtifactOp. Fix it by
splitting SendArtifactOp into two functions, one that takes the locks
and one that does not, and fix up all in-package callers of
SendArtifactOp to take whatever locks are needed and then call
sendArtifactOp.
The consensus HA TLS handshaking strategy could fail if a root cert
rotate happened while one of the nodes was out of the cluster. Once
that happened, the missing node would be unable to rejoin the cluster
when it came back up due to it not having the most recent root
certificate for TLS peer cert checking. This has been resolved with
two fixes:
* On the client side (the one initiating the connection), we start by
picking a cert that is signed by the most recent cert in the Roots
array. If the TLS handshake fails, we will iterate through the Roots
array in order from most recent to oldest, retrying the connection
with each until either the handshake passes or we run out of root
certs. If we run out of root certs, we wrap around and try again.
* On the server side (the one receiving the connection), we ratchet
the same way, except we move on to the next oldest cert after
len(fsm.Roots) handshake failures instead of at every one. If we
run out of certs, we will omit an error message to the log and wrap
around.
These two behaviours ensure that all possible current common sets of
mutually valid root certs are tested when creating TLS connections
between consensus members.
Additionally, the root cert rotation code was incorrectly calculating
when to rotate the root certificates, leading to an insufficient
number of known trusted roots and the possibility of the nodes not
having enough certs in common to mutually authenticate their peers.
We now handle this like so:
1. On cluster create, we seed a TLS certificate that every node in the
cluster uses for mutual peer authentication. This cert is valid from
time.Now().Add(-consensus.CertGrace) (to handle servers that do not
have well-synchronized clocks) through time.Now().Add(RootCertExpire+CertGrace).
This gives the initial cert a 3 month lifespan.
2. Every RootCertExpire/(2*MaxTrustRoots+1), the leader of the
consensus cluster checks to see if it needs to generate a new root
certificate by checking to see if the first entry in the Roots
array of the finite state machine is more than
RootCertExpire/MaxTrustRoots old. If so, it generates a new
self-signed cert, adds it to the front of the Roots array in the
finite state machine, removes any certs in the Roots array that
have expired, and publishes the new FSM via Raft.
The failover tests have been updated to have much increased cert
rotation rates in order to test this code.
M .gitlab-ci.yml
M consensus/raft.go
M consensus/raftFSM.go
M consensus/serverAPI.go
M datastack/etags.go
M failover_test/consensus_failover_test.go
M go.mod
M go.sum
M server/args.go
End of Note