8.3.2.135. v4.7.18ΒΆ

commit 6567c374ab5471615f5e7430ce6c1b37212ba731
Author: Victor Lowther <victor@rackn.com>
Date:   Thu Mar 3 12:48:12 2022 -0600

    perf(etags): Parallelize etag bulk processing.

    This refactors etag bulk checking to operate in as parallel a fashion
    as possible while not causing the system to explode too much if it
    winds up needing to recalculate a bunch of checksums.

M   datastack/etags.go

commit 737e985d9e03e71783d116af73db18611afa6bc9
Author: Victor Lowther <victor@rackn.com>
Date:   Mon Feb 28 11:40:46 2022 -0600

    perf(etags): Avoid opening files to calculate etags.

    There appears to be a huge performance penalty when running
    dr-provision on systems that have any sort of monitoring that hooks
    the open system call.  To work around this, refactor most places we
    check etags to do so without opening the file involved.

    This patch also adds two utilities that can be used to benchmark and
    identify this issue.

    cmds/etagerator/etag.go will run just the etag BulkProcess function on
    all the directories passed in as command line args.  It can be used to
    get an idea how long the etag process will take in various
    environments

    cmds/start_io_trace/startTrace.go runs the complete dr-provision
    startup sequence up to the point that we would start joining a cluster
    or loading data from the database.  It runs with full tracing enabled,
    and emits a go trace log once it finishes the startup process.

M   backend/dataTracker_test.go
A   cmds/etagerator/etags.go
A   cmds/start_io_trace/startTrace.go
M   datastack/etags.go
M   datastack/stack.go
M   midlayer/fake_midlayer_server_test.go
M   midlayer/static_test.go
M   midlayer/tftp_test.go
M   server/args.go

commit aaab33873d0573d89eb22a9b68384c61ce5cde7d
Author: Victor Lowther <victor@rackn.com>
Date:   Mon Feb 28 13:31:31 2022 -0600

    fix(panic): Fix race when removing a server that can lead to panic.

    The Raft FSM can get into an inconsistent state that allows a
    LastArtifactOp operation to succeed at the same time the node issuing
    the request is being removed from the cluster.  Depending on the exact
    timing, this can trigger a panic if the command is committed after the
    node removal command, leading to a panic when replaying the log on the
    followers and on the server.

    Work around thgis for now by solently ignoring LastArtifactApply
    operations from nodes that we have removed from the cluster.  A longer
    term fix will require adding a dedicated API path for updating this
    that can check tto see if the operation is allowed befor committing it
    through Raft.

M   consensus/raftFSM.go
M   frontend/consensus.go
M   server/args.go

End of Note