DD crashed when storage servers exceed 1200

I build a foundationdb cluster with version 7.1.31 and three_data_center mode.

When I add storages in the cluster and when the number of storage servers exceeds 1200+ and is close to 1400+, the data distributor crashed.

<Event Severity="10" Time="1694420800.972599" DateTime="2023-09-11T08:26:40Z" Type="GetMagazineSample" ID="0000000000000000" Size="256" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x4328d58 0x426a12a 0x426a355 0x40c65a6 0x40c690b 0x2730b65 0x272de50 0x272e485 0x272ec44 0x188fd90 0x18ad0a5 0x18aeaae 0x1850651 0x185485b 0x18595ae 0x17295bc 0x173093d 0x42b2f48 0xdb5eff 0x7f73df71f495" ThreadID="920255406301111743" Machine="100.71.8.128:30005" LogGroup="default" Roles="DD,RK" />
<Event Severity="20" Time="1694420800.972599" DateTime="2023-09-11T08:26:40Z" Type="Net2RunLoopTrace" ID="0000000000000000" TraceTime="1694420801.036488" Trace="addr2line -e fdbserver.debug -p -C -f -i 0x7f73dfad95d0 0x40c723d 0x24d760f 0x24d90c5 0x24cf36d 0x185045b 0x185485b 0x18595ae 0x17295bc 0x173093d 0x42b2f48 0xdb5eff 0x7f73df71f495 0xe18e42" ThreadID="920255406301111743" Machine="100.71.8.128:30005" LogGroup="default" Roles="DD,RK" />
<Event Severity="20" Time="1694420800.972599" DateTime="2023-09-11T08:26:40Z" Type="Net2RunLoopTrace" ID="0000000000000000" TraceTime="1694420801.161554" Trace="addr2line -e fdbserver.debug -p -C -f -i 0x7f73dfad95d0 0x40d0b83 0x11f7349 0x24c7359 0x1850e5d 0x185485b 0x18595ae 0x17295bc 0x173093d 0x42b2f48 0xdb5eff 0x7f73df71f495 0xe18e42" ThreadID="920255406301111743" Machine="100.71.8.128:30005" LogGroup="default" Roles="DD,RK" />
<Event Severity="20" Time="1694420803.655281" DateTime="2023-09-11T08:26:43Z" Type="Net2RunLoopTrace" ID="0000000000000000" TraceTime="1694420803.664908" Trace="addr2line -e fdbserver.debug -p -C -f -i 0x7f73dfad95d0 0x1ec9752 0x1ecf409 0x1efbaeb 0x42b2f48 0xdb5eff 0x7f73df71f495 0xe18e42" ThreadID="920255406301111743" Machine="100.71.8.128:30005" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1694420823.354485" DateTime="2023-09-11T08:27:03Z" Type="HugeArenaSample" ID="0000000000000000" Count="1" Size="70404" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x42630c7 0x42468b3 0x4246d66 0x40e4bc6 0x42b2f48 0xdb5eff 0x7f73df71f495" ThreadID="920255406301111743" Machine="100.71.8.128:30005" LogGroup="default" Roles="DD" />
<Event Severity="10" Time="1694420828.964950" DateTime="2023-09-11T08:27:08Z" Type="HugeArenaSample" ID="0000000000000000" Count="1" Size="15072" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x42630c7 0x42468b3 0x4246d66 0x28250a1 0x37e6041 0x37e6a97 0x3724388 0x37ca538 0x2ab22d8 0xe87a90 0x40e1738 0x40e1ab8 0x42b2f48 0xdb5eff 0x7f73df71f495" ThreadID="920255406301111743" Machine="100.71.8.128:30005" LogGroup="default" Roles="DD" />

There’s the debug info:

# addr2line -e fdbserver.debug -p -C -f -i 0x4328d58 0x426a12a 0x426a355 0x40c65a6 0x40c690b 0x2730b65 0x272de50 0x272e485 0x272ec44 0x188fd90 0x18ad0a5 0x18aeaae 0x1850651 0x185485b 0x18595ae 0x17295bc
std::string::_M_rep() const at /opt/rh/devtoolset-11/root/usr/include/c++/11/bits/basic_string.h:3404
 (inlined by) std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&) at /opt/rh/devtoolset-11/root/usr/include/c++/11/bits/basic_string.h:3604
 (inlined by) std::basic_string<char, std::char_traits<char>, std::allocator<char> > std::operator+<char, std::char_traits<char>, std::allocator<char> >(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, char const*) at /opt/rh/devtoolset-11/root/usr/include/c++/11/bits/basic_string.h:6135
 (inlined by) BaseTraceEvent::backtrace(std::string const&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Trace.cpp:1216
std::string::_M_rep() const at /opt/rh/devtoolset-11/root/usr/include/c++/11/bits/basic_string.h:3404
 (inlined by) std::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string() at /opt/rh/devtoolset-11/root/usr/include/c++/11/bits/basic_string.h:3768
 (inlined by) FastAllocator<256>::getMagazine() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/FastAlloc.cpp:518
FastAllocator<256>::allocate() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/FastAlloc.cpp:335
waitForContinuousFailure(IFailureMonitor* const&, Endpoint const&, double const&, double const&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/FastAlloc.h:217
 (inlined by) waitForContinuousFailure(IFailureMonitor* const&, Endpoint const&, double const&, double const&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbrpc/FailureMonitor.actor.cpp:33
IFailureMonitor::onFailedFor(Endpoint const&, double, double) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbrpc/FailureMonitor.actor.cpp:72
ReplyPromise<Void>::getEndpoint(TaskPriority) const at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbrpc/fdbrpc.h:148 (discriminator 8)
 (inlined by) void setReplyPriority<Void>(ReplyPromise<Void> const&, TaskPriority) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbrpc/fdbrpc.h:255 (discriminator 8)
 (inlined by) Future<decltype (((((getReplyPromise((declval<ReplyPromise<Void> >)())).getFuture)()).getValue)())> RequestStream<ReplyPromise<Void> >::getReply<ReplyPromise<Void> >(ReplyPromise<Void> const&, TaskPriority) const at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbrpc/fdbrpc.h:703 (discriminator 8)
 (inlined by) Future<ErrorOr<decltype (((((getReplyPromise((declval<ReplyPromise<Void> >)())).getFuture)()).getValue)())> > RequestStream<ReplyPromise<Void> >::getReplyUnlessFailedFor<ReplyPromise<Void> >(ReplyPromise<Void> const&, double, double, TaskPriority) const at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbrpc/fdbrpc.h:805 (discriminator 8)
Future<ErrorOr<Void> >::Future(Future<ErrorOr<Void> > const&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/flow.h:810
 (inlined by) StrictFuture<ErrorOr<Void> >::StrictFuture(Future<ErrorOr<Void> > const&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/flow.h:896
 (inlined by) a_body1loopBody1 at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/WaitFailure.actor.cpp:48
waitFailureClient(RequestStream<ReplyPromise<Void> > const&, double const&, double const&, bool const&, TaskPriority const&) at /home/foundationdb_ci/foundationdb_build_output/dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb/fdbserver/WaitFailure.actor.g.cpp:315
 (inlined by) a_body1 at /home/foundationdb_ci/foundationdb_build_output/dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb/fdbserver/WaitFailure.actor.g.cpp:294
 (inlined by) WaitFailureClientActor at /home/foundationdb_ci/foundationdb_build_output/dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb/fdbserver/WaitFailure.actor.g.cpp:665
 (inlined by) waitFailureClient(RequestStream<ReplyPromise<Void> > const&, double const&, double const&, bool const&, TaskPriority const&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/WaitFailure.actor.cpp:40
Future<Void>::Future(Future<Void> const&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/flow.h:810
 (inlined by) StrictFuture<Void>::StrictFuture(Future<Void> const&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/flow.h:896
 (inlined by) a_body1loopBody1 at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/WaitFailure.actor.cpp:74
 (inlined by) a_body1loopHead1 at /home/foundationdb_ci/foundationdb_build_output/dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb/fdbserver/WaitFailure.actor.g.cpp:745
 (inlined by) a_body1 at /home/foundationdb_ci/foundationdb_build_output/dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb/fdbserver/WaitFailure.actor.g.cpp:724
 (inlined by) WaitFailureClientStrictActor at /home/foundationdb_ci/foundationdb_build_output/dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb/fdbserver/WaitFailure.actor.g.cpp:991
 (inlined by) waitFailureClientStrict(RequestStream<ReplyPromise<Void> > const&, double const&, TaskPriority const&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/WaitFailure.actor.cpp:70
DDTeamCollectionImpl::StorageServerFailureTrackerActorState<DDTeamCollectionImpl::StorageServerFailureTrackerActor>::a_body1loopBody1(int) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/DDTeamCollection.actor.cpp:1578 (discriminator 2)
DDTeamCollectionImpl::StorageServerFailureTrackerActorState<DDTeamCollectionImpl::StorageServerFailureTrackerActor>::a_body1loopHead1(int) at /home/foundationdb_ci/foundationdb_build_output/dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb/fdbserver/DDTeamCollection.actor.g.cpp:8548
 (inlined by) DDTeamCollectionImpl::StorageServerFailureTrackerActorState<DDTeamCollectionImpl::StorageServerFailureTrackerActor>::a_body1(int) at /home/foundationdb_ci/foundationdb_build_output/dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb/fdbserver/DDTeamCollection.actor.g.cpp:8515
 (inlined by) DDTeamCollectionImpl::StorageServerFailureTrackerActor::StorageServerFailureTrackerActor(DDTeamCollection* const&, TCServerInfo* const&, Database const&, ServerStatus* const&, long const&) at /home/foundationdb_ci/foundationdb_build_output/dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb/fdbserver/DDTeamCollection.actor.g.cpp:9011
 (inlined by) DDTeamCollectionImpl::storageServerFailureTracker(DDTeamCollection* const&, TCServerInfo* const&, Database const&, ServerStatus* const&, long const&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/DDTeamCollection.actor.cpp:1527
 (inlined by) DDTeamCollectionImpl::StorageServerTrackerActorState<DDTeamCollectionImpl::StorageServerTrackerActor>::a_body1loopBody1cont1(int) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/DDTeamCollection.actor.cpp:1185
DDTeamCollectionImpl::StorageServerTrackerActorState<DDTeamCollectionImpl::StorageServerTrackerActor>::a_body1loopBody1(int) at /home/foundationdb_ci/foundationdb_build_output/dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb/fdbserver/DDTeamCollection.actor.g.cpp:5141
DDTeamCollectionImpl::StorageServerTrackerActorState<DDTeamCollectionImpl::StorageServerTrackerActor>::a_body1loopHead1(int) at /home/foundationdb_ci/foundationdb_build_output/dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb/fdbserver/DDTeamCollection.actor.g.cpp:4898
 (inlined by) DDTeamCollectionImpl::StorageServerTrackerActorState<DDTeamCollectionImpl::StorageServerTrackerActor>::a_body1(int) at /home/foundationdb_ci/foundationdb_build_output/dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb/fdbserver/DDTeamCollection.actor.g.cpp:4833
 (inlined by) DDTeamCollectionImpl::StorageServerTrackerActor::StorageServerTrackerActor(DDTeamCollection* const&, Database const&, TCServerInfo* const&, Promise<Void> const&, long const&, DDEnabledState const* const&, bool const&) at /home/foundationdb_ci/foundationdb_build_output/dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb/fdbserver/DDTeamCollection.actor.g.cpp:6591
 (inlined by) DDTeamCollectionImpl::storageServerTracker(DDTeamCollection* const&, Database const&, TCServerInfo* const&, Promise<Void> const&, long const&, DDEnabledState const* const&, bool const&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/DDTeamCollection.actor.cpp:998
 (inlined by) DDTeamCollection::storageServerTracker(Database, TCServerInfo*, Promise<Void>, long, DDEnabledState const&, bool) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/DDTeamCollection.actor.cpp:3426
 (inlined by) DDTeamCollection::addServer(StorageServerInterface, ProcessClass, Promise<Void>, long, DDEnabledState const&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/DDTeamCollection.actor.cpp:4779
DDTeamCollectionImpl::InitActorState<DDTeamCollectionImpl::InitActor>::a_body1(int) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/DDTeamCollection.actor.cpp:482
DDTeamCollectionImpl::InitActor::InitActor(DDTeamCollection* const&, Reference<InitialDataDistribution> const&, DDEnabledState const* const&) at /home/foundationdb_ci/foundationdb_build_output/dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb/fdbserver/DDTeamCollection.actor.g.cpp:3258
 (inlined by) DDTeamCollectionImpl::init(DDTeamCollection* const&, Reference<InitialDataDistribution> const&, DDEnabledState const* const&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/DDTeamCollection.actor.cpp:463
 (inlined by) DDTeamCollection::init(Reference<InitialDataDistribution>, DDEnabledState const&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/DDTeamCollection.actor.cpp:3406
 (inlined by) DDTeamCollectionImpl::RunActorState<DDTeamCollectionImpl::RunActor>::a_body1(int) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/DDTeamCollection.actor.cpp:2945
 (inlined by) DDTeamCollectionImpl::RunActor::RunActor(Reference<DDTeamCollection> const&, Reference<InitialDataDistribution> const&, TeamCollectionInterface const&, Reference<IAsyncListener<RequestStream<RecruitStorageRequest> > > const&, DDEnabledState const* const&) at /home/foundationdb_ci/foundationdb_build_output/dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb/fdbserver/DDTeamCollection.actor.g.cpp:22169
 (inlined by) DDTeamCollectionImpl::run(Reference<DDTeamCollection> const&, Reference<InitialDataDistribution> const&, TeamCollectionInterface const&, Reference<IAsyncListener<RequestStream<RecruitStorageRequest> > > const&, DDEnabledState const* const&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/DDTeamCollection.actor.cpp:2934
 (inlined by) DDTeamCollection::run(Reference<DDTeamCollection>, Reference<InitialDataDistribution>, TeamCollectionInterface, Reference<IAsyncListener<RequestStream<RecruitStorageRequest> > >, DDEnabledState const&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/DDTeamCollection.actor.cpp:5143
(anonymous namespace)::DataDistributionActorState<(anonymous namespace)::DataDistributionActor>::a_body1loopBody1cont2loopBody1(int) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/DataDistribution.actor.cpp:816
 (inlined by) a_body1loopBody1cont2break1 at /home/foundationdb_ci/foundationdb_build_output/dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb/fdbserver/DataDistribution.actor.g.cpp:5889
 (inlined by) a_body1loopBody1cont2loopBody1 at /home/foundationdb_ci/foundationdb_build_output/dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb/fdbserver/DataDistribution.actor.g.cpp:5826

Is there a bug in DD?

We can easily produce the error using these scripts:

#!/bin/bash

# set -x

fdblogdir=/tmp/logs
public_address=100.71.8.128

pkill -9 fdbserver && sleep 5 && rm -rf /fdb-log/data/* /fdb-storage/data/* /var/lib/foundationdb/data/* $fdblogdir/*
mkdir -p /etc/foundationdb/ && echo "8cKn7vrK:83b6hqP1@$public_address:30000" > /etc/foundationdb/fdb.cluster

function scale_out_stateless()
{
    if [ $# -lt 4 ]; then echo "usage: scale_out_stateless machine_count start_port process_count" && exit 1; fi
    machine_count=$1
    start_port=$2
    stateless_count=$3
    datacenter=$4
    if [ -z $datacenter ]; then datacenter=one_dc ; fi

    stateless_id=0
    for machine_id in `seq 0 $((machine_count-1))`; do
        for i in `seq 0 $((stateless_count-1))`; do
            port=$((stateless_id+start_port))
            /usr/sbin/fdbserver \
                --datadir /var/lib/foundationdb/data/$port \
                --locality-diskid=diskstateless-$port \
                --machine-id machine-$machine_id-$datacenter \
                --datacenter-id $datacenter \
                --public-address $public_address:$port \
                --class stateless --listen-address public --cluster-file /etc/foundationdb/fdb.cluster --logdir $fdblogdir &
            ((stateless_id++))
        done
    done
}

function scale_out_log()
{
    if [ $# -lt 4 ]; then echo "usage: scale_out_log machine_count start_port process_count" && exit 1; fi
    machine_count=$1
    start_port=$2
    log_count=$3
    datacenter=$4
    if [ -z $datacenter ]; then datacenter=one_dc ; fi

    log_id=0
    for machine_id in `seq 0 $((machine_count-1))`; do
        for i in `seq 0 $((log_count-1))`; do
            port=$((log_id+start_port))
            /usr/sbin/fdbserver \
                --datadir /fdb-log/data/$port \
                --locality-diskid=disklog-$port \
                --machine-id machine-$machine_id-$datacenter \
                --datacenter-id $datacenter \
                --public-address $public_address:$port \
                --class log --listen-address public --cluster-file /etc/foundationdb/fdb.cluster --logdir $fdblogdir &
            ((log_id++))
            echo "fdbserver log $port started"
            sleep 0.09
        done
    done
}

function scale_out_storage()
{
    if [ $# -lt 4 ]; then echo "usage: scale_out_storage machine_count start_port process_count" && exit 1; fi
    machine_count=$1
    start_port=$2
    storage_count=$3
    datacenter=$4
    if [ -z $datacenter ]; then datacenter=one_dc ; fi

    storage_id=0
    for machine_id in `seq 0 $((machine_count-1))`; do
        for i in `seq 0 $((storage_count-1))`; do
            port=$((storage_id+start_port))
            /usr/sbin/fdbserver \
                --datadir /fdb-storage/data/$port \
                --locality-diskid=diskstorage-$port \
                --machine-id machine-$machine_id-$datacenter \
                --datacenter-id $datacenter \
                --public-address $public_address:$port \
                --class storage --listen-address public --cluster-file /etc/foundationdb/fdb.cluster --logdir $fdblogdir &
            ((storage_id++))
            echo "fdbserver storage $port started"
            sleep 0.1
        done
    done
}

scale_out_stateless 10    30000 1 dc1
scale_out_log       10     38000 1 dc1
scale_out_storage   100    39000 4 dc1

scale_out_stateless 10    40000 1 dc2
scale_out_log       10    48000 1 dc2
scale_out_storage   100    49000 4 dc2

scale_out_stateless 10    50000 1 dc3
scale_out_log       10    58000 1 dc3
scale_out_storage   100   59000 4 dc3

sleep 10 && fdbcli --exec 'configure new three_datacenter ssd; configure logs=5; configure proxies=4; coordinators auto'

Looks like DD runs out of memory. Can you try adding more memory to DD? The default is 8GiB.

  -m SIZE, --memory SIZE
                 Resident memory limit. The default value is 8GiB. When
                 specified without a unit, MiB is assumed.

Thanks, I allocated 16GiB for the data distributor role, but it still crashed and the RSS it used is not much.

I use -DCMAKE_BUILD_TYPE=Debug to build the debug binaries and found more info:

[root@6b39606d7fd2 fdbbuild]# addr2line -e bin/fdbserver -p -C -f -i 0x4222af5 0x42599c7 0x4184bd0 0x41844bc 0x5cf98d 0x341547f 0x3410ff1 0x32a5b39 0x640e04 0x329cba1 0x36de9bb 0x36d9bca 0x36d5866 0x36cfd5e 0x36cac3f 0x36c61c8 0x36c1690 0x369fbde 0x139df63 0x138e21b 0x1379bc6 0x1364c1b 0x13523f1 0x1352515 0x130a2bb 0x135aaab 0x134d1b4 0x134d470 0x1309b9a 0x1314b7c 0x1359683 0x134c794 0x134c872 0x1309aba 0x13652d5 0x135287d 0x13529ca 0x1317ae0 0x122c9a3 0x122b7a7 0x122ab34 0x122a72a 0x122d089 0x122b86b 0x123a976 0x1231...
platform::get_backtrace() at /data/foundationdb/flow/Platform.actor.cpp:3339
BaseTraceEvent::backtrace(std::string const&) at /data/foundationdb/flow/Trace.cpp:1216
FastAllocator<256>::getMagazine() at /data/foundationdb/flow/FastAlloc.cpp:518 (discriminator 8)
FastAllocator<256>::allocate() at /data/foundationdb/flow/FastAlloc.cpp:334
FastAllocated<SAV<GetReadVersionReply> >::operator new(unsigned long) at /data/foundationdb/flow/FastAlloc.h:217
Promise<GetReadVersionReply>::Promise() at /data/foundationdb/flow/flow.h:925
DatabaseContext::VersionRequest::VersionRequest(UID, TagSet, Optional<UID>) at /data/foundationdb/fdbclient/DatabaseContext.h:429
Transaction::getReadVersion(unsigned int) at /data/foundationdb/fdbclient/NativeAPI.actor.cpp:7010
Transaction::getReadVersion() at /data/foundationdb/fdbclient/NativeAPI.actor.h:288
Transaction::get(Standalone<StringRef> const&, Snapshot) at /data/foundationdb/fdbclient/NativeAPI.actor.cpp:5206
RYWImpl::ReadActorState<RYWIterator, RYWImpl::ReadActor<RYWIterator> >::a_body1(int) at /data/foundationdb/fdbclient/ReadYourWrites.actor.cpp:113 (discriminator 1)
RYWImpl::ReadActor<RYWIterator>::ReadActor(ReadYourWritesTransaction* const&, RYWImpl::GetValueReq const&, RYWIterator* const&) at /data/foundationdb/fdbbuild/fdbclient/ReadYourWrites.actor.g.cpp:465
Future<Optional<Standalone<StringRef> > > RYWImpl::read<RYWIterator>(ReadYourWritesTransaction* const&, RYWImpl::GetValueReq const&, RYWIterator* const&) at /data/foundationdb/fdbclient/ReadYourWrites.actor.cpp:94 (discriminator 2)
RYWImpl::ReadWithConflictRangeRYWActorState<RYWImpl::GetValueReq, RYWImpl::ReadWithConflictRangeRYWActor<RYWImpl::GetValueReq> >::a_body1(int) at /data/foundationdb/fdbclient/ReadYourWrites.actor.cpp:380 (discriminator 1)
RYWImpl::ReadWithConflictRangeRYWActor<RYWImpl::GetValueReq>::ReadWithConflictRangeRYWActor(ReadYourWritesTransaction* const&, RYWImpl::GetValueReq const&, Snapshot const&) at /data/foundationdb/fdbbuild/fdbclient/ReadYourWrites.actor.g.cpp:2435
Future<RYWImpl::GetValueReq::Result> RYWImpl::readWithConflictRangeRYW<RYWImpl::GetValueReq>(ReadYourWritesTransaction* const&, RYWImpl::GetValueReq const&, Snapshot const&) at /data/foundationdb/fdbclient/ReadYourWrites.actor.cpp:374 (discriminator 2)
Future<RYWImpl::GetValueReq::Result> RYWImpl::readWithConflictRange<RYWImpl::GetValueReq>(ReadYourWritesTransaction*, RYWImpl::GetValueReq const&, Snapshot) at /data/foundationdb/fdbclient/ReadYourWrites.actor.cpp:402
ReadYourWritesTransaction::get(Standalone<StringRef> const&, Snapshot) at /data/foundationdb/fdbclient/ReadYourWrites.actor.cpp:1603 (discriminator 2)
KeyBackedObjectProperty<StorageMetadataType, _IncludeVersion>::get(Reference<ReadYourWritesTransaction>, Snapshot) const at /data/foundationdb/fdbclient/KeyBackedTypes.h:341
DDTeamCollectionImpl::ReadOrCreateStorageMetadataActorState<DDTeamCollectionImpl::ReadOrCreateStorageMetadataActor>::a_body1loopBody1(int) at /data/foundationdb/fdbserver/DDTeamCollection.actor.cpp:2910 (discriminator 2)
DDTeamCollectionImpl::ReadOrCreateStorageMetadataActorState<DDTeamCollectionImpl::ReadOrCreateStorageMetadataActor>::a_body1loopHead1(int) at /data/foundationdb/fdbbuild/fdbserver/DDTeamCollection.actor.g.cpp:20819 (discriminator 2)
DDTeamCollectionImpl::ReadOrCreateStorageMetadataActorState<DDTeamCollectionImpl::ReadOrCreateStorageMetadataActor>::a_body1(int) at /data/foundationdb/fdbbuild/fdbserver/DDTeamCollection.actor.g.cpp:20774
DDTeamCollectionImpl::ReadOrCreateStorageMetadataActor::ReadOrCreateStorageMetadataActor(DDTeamCollection* const&, TCServerInfo* const&) at /data/foundationdb/fdbbuild/fdbserver/DDTeamCollection.actor.g.cpp:21203
DDTeamCollectionImpl::readOrCreateStorageMetadata(DDTeamCollection* const&, TCServerInfo* const&) at /data/foundationdb/fdbserver/DDTeamCollection.actor.cpp:2899 (discriminator 2)
DDTeamCollection::readOrCreateStorageMetadata(TCServerInfo*) at /data/foundationdb/fdbserver/DDTeamCollection.actor.cpp:3548
DDTeamCollectionImpl::StorageServerTrackerActorState<DDTeamCollectionImpl::StorageServerTrackerActor>::StorageServerTrackerActorState(DDTeamCollection* const&, Database const&, TCServerInfo* const&, Promise<Void> const&, long const&, DDEnabledState const* const&, bool const&) at /data/foundationdb/fdbserver/DDTeamCollection.actor.cpp:1019 (discriminator 3)
DDTeamCollectionImpl::StorageServerTrackerActor::StorageServerTrackerActor(DDTeamCollection* const&, Database const&, TCServerInfo* const&, Promise<Void> const&, long const&, DDEnabledState const* const&, bool const&) at /data/foundationdb/fdbbuild/fdbserver/DDTeamCollection.actor.g.cpp:6584
DDTeamCollectionImpl::storageServerTracker(DDTeamCollection* const&, Database const&, TCServerInfo* const&, Promise<Void> const&, long const&, DDEnabledState const* const&, bool const&) at /data/foundationdb/fdbserver/DDTeamCollection.actor.cpp:998
DDTeamCollection::storageServerTracker(Database, TCServerInfo*, Promise<Void>, long, DDEnabledState const&, bool) at /data/foundationdb/fdbserver/DDTeamCollection.actor.cpp:3426
DDTeamCollection::addServer(StorageServerInterface, ProcessClass, Promise<Void>, long, DDEnabledState const&) at /data/foundationdb/fdbserver/DDTeamCollection.actor.cpp:4779 (discriminator 2)
DDTeamCollectionImpl::InitActorState<DDTeamCollectionImpl::InitActor>::a_body1(int) at /data/foundationdb/fdbserver/DDTeamCollection.actor.cpp:482 (discriminator 5)
DDTeamCollectionImpl::InitActor::InitActor(DDTeamCollection* const&, Reference<InitialDataDistribution> const&, DDEnabledState const* const&) at /data/foundationdb/fdbbuild/fdbserver/DDTeamCollection.actor.g.cpp:3258
DDTeamCollectionImpl::init(DDTeamCollection* const&, Reference<InitialDataDistribution> const&, DDEnabledState const* const&) at /data/foundationdb/fdbserver/DDTeamCollection.actor.cpp:463 (discriminator 2)
DDTeamCollection::init(Reference<InitialDataDistribution>, DDEnabledState const&) at /data/foundationdb/fdbserver/DDTeamCollection.actor.cpp:3407
DDTeamCollectionImpl::RunActorState<DDTeamCollectionImpl::RunActor>::a_body1(int) at /data/foundationdb/fdbserver/DDTeamCollection.actor.cpp:2945 (discriminator 2)
DDTeamCollectionImpl::RunActor::RunActor(Reference<DDTeamCollection> const&, Reference<InitialDataDistribution> const&, TeamCollectionInterface const&, Reference<IAsyncListener<RequestStream<RecruitStorageRequest> > > const&, DDEnabledState const* const&) at /data/foundationdb/fdbbuild/fdbserver/DDTeamCollection.actor.g.cpp:22170
DDTeamCollectionImpl::run(Reference<DDTeamCollection> const&, Reference<InitialDataDistribution> const&, TeamCollectionInterface const&, Reference<IAsyncListener<RequestStream<RecruitStorageRequest> > > const&, DDEnabledState const* const&) at /data/foundationdb/fdbserver/DDTeamCollection.actor.cpp:2934 (discriminator 2)
DDTeamCollection::run(Reference<DDTeamCollection>, Reference<InitialDataDistribution>, TeamCollectionInterface, Reference<IAsyncListener<RequestStream<RecruitStorageRequest> > >, DDEnabledState const&) at /data/foundationdb/fdbserver/DDTeamCollection.actor.cpp:5144
(anonymous namespace)::DataDistributionActorState<(anonymous namespace)::DataDistributionActor>::a_body1loopBody1cont3(int) at /data/foundationdb/fdbserver/DataDistribution.actor.cpp:816 (discriminator 11)
(anonymous namespace)::DataDistributionActorState<(anonymous namespace)::DataDistributionActor>::a_body1loopBody1cont2break1(int) at /data/foundationdb/fdbbuild/fdbserver/DataDistribution.actor.g.cpp:5889
(anonymous namespace)::DataDistributionActorState<(anonymous namespace)::DataDistributionActor>::a_body1loopBody1cont2loopBody1(int) at /data/foundationdb/fdbbuild/fdbserver/DataDistribution.actor.g.cpp:5826 (discriminator 4)
(anonymous namespace)::DataDistributionActorState<(anonymous namespace)::DataDistributionActor>::a_body1loopBody1cont2loopHead1(int) at /data/foundationdb/fdbbuild/fdbserver/DataDistribution.actor.g.cpp:5816 (discriminator 2)
(anonymous namespace)::DataDistributionActorState<(anonymous namespace)::DataDistributionActor>::a_body1loopBody1cont2loopBody1cont1(Void const&, int) at /data/foundationdb/fdbbuild/fdbserver/DataDistribution.actor.g.cpp:5904 (discriminator 1)
(anonymous namespace)::DataDistributionActorState<(anonymous namespace)::DataDistributionActor>::a_body1loopBody1cont2loopBody1when1(Void const&, int) at /data/foundationdb/fdbbuild/fdbserver/DataDistribution.actor.g.cpp:5919
(anonymous namespace)::DataDistributionActorState<(anonymous namespace)::DataDistributionActor>::a_callback_fire(ActorCallback<(anonymous namespace)::DataDistributionActor, 7, Void>*, Void const&) at /data/foundationdb/fdbbuild/fdbserver/DataDistribution.actor.g.cpp:5947

@Rjerk
Which tool did you use to create this foundationdb-“top”?

I use python module json/prettytable to format the fdbcli status json result. : )

@Rjerk
:slight_smile: Thanks for answering

This issue looks scary…

We encountered the same problem. DD repeatedly fails to recruit on different stateless processes. At present, it does not seem to be influenced by memory factors. May I ask if you have any direction for investigation? Have you ever encountered this situation? Thanks!

From the backtrace above, the code failed for allocating memory at readOrCreateStorageMetadata(), which is created one per storage server in a loop. So there will be 1.2k concurrent transactions. I think it’s probably better to add a random delay at the beginning of that function, i.e.,

		// printf("------ read metadata %s\n", server->getId().toString().c_str());
		// read storage metadata
		wait(delayJittered(1.0));
		loop {
...

This might solve this particular problem. The inherent problem might be that the DD is not able to handle more than 12+k storage servers. We run up to 900 without problems. I am not sure if we have tested more than 12k storage servers.

We followed your suggestion to increase the random delay and then recompiled the binary file. I built a new cluster using these binary files. It was healthy for the first two hours, but reported error:

Unable to commit after 5 seconds.

image
Then the cluster becomes unavailable.
image
I saw the following error in the log:


Do you have any suggestions for this?

This error is saying the DB has run into some problems, which caused too many repeated recoveries. So the DB stopped and waiting for the operator to resolve the underlying problem, i.e., what’s causing repeated recovery. After resolving the problem, the operator can override that knob, restarting all processes to start a new recovery. Hopefully the DB can come back.

There are many reasons for recovery. So you should look at transaction system logs, i.e., CC, MS, CP, GP, RV roles for clues. MasterRecoveryState is an important event, which tells you the progress of recovery. If less than 12, then DB can’t accept new commits.

BTW, fdbcli only shows one coordinator. Is that how you configured? For such a large cluster, you probably can afford to place 5~9 coordinators.

Thanks. Because I started all processes on the same HDD disk, the disk utilization was 100%. I have now rebuilt the cluster and placed storage, log, and stateless processes on three different HDD disks, respectively. When the cluster is in single mode, the state is healthy, and when switching to 3DC, there is a problem where the DD process cannot work.



The DD process is constantly being re recruited.

@jzhou
From the log analysis, it can be seen that DD encountered an error while building the Storage Team. Because after reporting TraceEventThrottle_ServerTeamHealthChangeDetected, the entire process enters DDRecruiting. Based on the following log information, do you have any opinions on this? Thank you!

<Event Severity="10" Time="1697780573.425600" DateTime="2023-10-20T05:42:53Z" Type="TransactionMetrics" ID="d3defdcc53c7be11" Elapsed="5.00001" Cluster="" Internal="1" ReadVersions="0 -1 52" ReadVersionsThrottled="0 -1 0" ReadVersionsCompleted="0 -1 51" ReadVersionBatches="0 -1 50" BatchPriorityReadVersions="0 -1 0" DefaultPriorityReadVersions="0 -1 3" ImmediatePriorityReadVersions="0 -1 49" BatchPriorityReadVersionsCompleted="0 -1 0" DefaultPriorityReadVersionsCompleted="0 -1 2" ImmediatePriorityReadVersionsCompleted="0 -1 49" LogicalUncachedReads="0 -1 148" PhysicalReadRequests="0 -1 699" PhysicalReadRequestsCompleted="0 -1 699" GetKeyRequests="0 -1 0" GetValueRequests="0 -1 5" GetRangeRequests="0 -1 143" GetMappedRangeRequests="0 -1 0" GetRangeStreamRequests="0 -1 0" WatchRequests="0 -1 11" GetAddressesForKeyRequests="0 -1 0" BytesRead="0 -1 44725982" KeysRead="0 -1 113129" MetadataVersionReads="0 -1 0" CommittedMutations="0 -1 1" CommittedMutationBytes="0 -1 72" SetMutations="0 -1 1" ClearMutations="0 -1 0" AtomicMutations="0 -1 0" CommitStarted="0 -1 1" CommitCompleted="0 -1 1" KeyServerLocationRequests="0 -1 7" KeyServerLocationRequestsCompleted="0 -1 7" StatusRequests="0 -1 0" TooOld="0 -1 0" FutureVersions="0 -1 0" NotCommitted="0 -1 0" MaybeCommitted="0 -1 0" ResourceConstrained="0 -1 0" ProcessBehind="0 -1 0" Throttled="0 -1 0" ExpensiveClearCostEstCount="0 -1 0" NumGrvFullBatches="0 -1 0" NumGrvTimedOutBatches="0 -1 50" CommitVersionNotFoundForSS="0 -1 0" LocationCacheEntryCount="1" MeanLatency="0" MedianLatency="0" Latency90="0" Latency98="0" MaxLatency="0" MeanRowReadLatency="0" MedianRowReadLatency="0" MaxRowReadLatency="0" MeanGRVLatency="0" MedianGRVLatency="0" MaxGRVLatency="0" MeanCommitLatency="0" MedianCommitLatency="0" MaxCommitLatency="0" MeanMutationsPerCommit="0" MedianMutationsPerCommit="0" MaxMutationsPerCommit="0" MeanBytesPerCommit="0" MedianBytesPerCommit="0" MaxBytesPerCommit="0" NumLocalityCacheEntries="1" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD" />
<Event Severity="30" Time="1697780573.660447" DateTime="2023-10-20T05:42:53Z" Type="TraceEventThrottle_TeamCollectionInfo" ID="0000000000000000" SuppressedEventCount="2" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD" />
<Event Severity="10" Time="1697780573.661996" DateTime="2023-10-20T05:42:53Z" Type="BuildTeams" ID="bb6222cc9e722d32" ServerCount="1200" UniqueMachines="300" Primary="1" StorageTeamSize="6" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD" />
<Event Severity="10" Time="1697780573.661996" DateTime="2023-10-20T05:42:53Z" Type="BuildTeamsBegin" ID="bb6222cc9e722d32" TeamsToBuild="5999" DesiredTeams="6000" MaxTeams="30000" BadServerTeams="0" PerpetualWigglingTeams="0" UniqueMachines="300" TeamSize="6" Servers="1200" HealthyServers="1200" CurrentTrackedServerTeams="1" HealthyTeamCount="1" TotalTeamCount="1" MachineTeamCount="1" MachineCount="300" DesiredTeamsPerServer="5" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD" />
<Event Severity="30" Time="1697780573.661996" DateTime="2023-10-20T05:42:53Z" Type="TraceEventThrottle_ChosenMachine" ID="0000000000000000" SuppressedEventCount="5995" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD" />
<Event Severity="10" Time="1697780573.661996" DateTime="2023-10-20T05:42:53Z" Type="BuildMachineTeams" ID="0000000000000000" Primary="1" TotalMachines="300" TotalHealthyMachine="300" HealthyMachineTeamCount="1" DesiredMachineTeams="1500" MaxMachineTeams="7500" TotalMachineTeams="1" MachineTeamsToBuild="1499" MachineTeamsAdded="1499" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD" />
<Event Severity="30" Time="1697780573.661996" DateTime="2023-10-20T05:42:53Z" Type="TraceEventThrottle_ServerTeamTrackerStarting" ID="0000000000000000" SuppressedEventCount="0" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD" />
<Event Severity="30" Time="1697780573.661996" DateTime="2023-10-20T05:42:53Z" Type="TraceEventThrottle_ServerTeamHealthChangeDetected" ID="0000000000000000" SuppressedEventCount="1" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD" />
<Event Severity="30" Time="1697780633.850615" DateTime="2023-10-20T05:43:53Z" Type="RunLoopBlocked" ID="0000000000000000" Duration="60.0017" ThreadID="11855445266821634904" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD" />
<Event Severity="10" Time="1697780573.661996" DateTime="2023-10-20T05:42:53Z" Type="DataDistributionTeamQuality" ID="bb6222cc9e722d32" Servers="1200" Teams="6000" TeamsPerServer="30" Variance="0.00189074" ServerMinTeams="28" ServerMaxTeams="36" MachineMinTeams="114" MachineMaxTeams="130" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD" />
<Event Severity="30" Time="1697780573.661996" DateTime="2023-10-20T05:42:53Z" Type="SlowTask" ID="0000000000000000" TaskID="3502" MClocks="138684" Duration="66.1165" SampleRate="1" NumYields="1" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD" />
<Event Severity="20" Time="1697780573.661996" DateTime="2023-10-20T05:42:53Z" Type="Net2RunLoopTrace" ID="0000000000000000" TraceTime="1697780573.848862" Trace="addr2line -e fdbserver.debug -p -C -f -i 0x7f598b0f55d0 0x41abe20 0x41ac4ac 0x1844afb 0x184de7c 0x18b28d1 0x18b2ced 0x4321938 0xdc910f 0x7f598ad3b495 0xe2d2b2" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD" />
<Event Severity="20" Time="1697780573.661996" DateTime="2023-10-20T05:42:53Z" Type="Net2RunLoopTrace" ID="0000000000000000" TraceTime="1697780573.973938" Trace="addr2line -e fdbserver.debug -p -C -f -i 0x7f598b0f55d0 0x1844749 0x184de7c 0x18b28d1 0x18b2ced 0x4321938 0xdc910f 0x7f598ad3b495 0xe2d2b2" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD" />

xxx

<Event Severity="20" Time="1697780573.661996" DateTime="2023-10-20T05:42:53Z" Type="Net2RunLoopTrace" ID="0000000000000000" TraceTime="1697780638.858251" Trace="addr2line -e fdbserver.debug -p -C -f -i 0x7f598b0f55d0 0x181455e 0x182e385 0x1845833 0x184de7c 0x18b28d1 0x18b2ced 0x4321938 0xdc910f 0x7f598ad3b495 0xe2d2b2" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD" />
<Event Severity="10" Time="1697780573.661996" DateTime="2023-10-20T05:42:53Z" Type="SomewhatSlowRunLoopBottom" ID="0000000000000000" Elapsed="66.1189" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD" />
<Event Severity="10" Time="1697780573.661996" DateTime="2023-10-20T05:42:53Z" Type="ConnectionFrom" ID="36f3d740b078f61c" SuppressedEventCount="2" FromAddress="100.71.8.128:55326" ListenAddress="100.71.8.128:30006" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD" />
<Event Severity="20" Time="1697780573.661996" DateTime="2023-10-20T05:42:53Z" Type="N2_ReadError" ID="9c03a14498a2ce64" SuppressedEventCount="1335" ErrorCode="2" Message="End of file" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD" />
<Event Severity="10" Time="1697780573.661996" DateTime="2023-10-20T05:42:53Z" Type="IncomingConnectionError" ID="9c03a14498a2ce64" Error="connection_failed" ErrorDescription="Network connection failed" ErrorCode="1026" SuppressedEventCount="123" FromAddress="100.71.8.128:56396" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD" />
<Event Severity="10" Time="1697780639.789389" DateTime="2023-10-20T05:43:59Z" Type="MovingData" ID="bb6222cc9e722d32" InFlight="0" InQueue="0" AverageShardSize="20098280" UnhealthyRelocations="0" HighestPriority="0" BytesWritten="0" PriorityRecoverMove="0" PriorityRebalanceUnderutilizedTeam="0" PriorityRebalanceOverutilizedTeam="0" PriorityStorageWiggle="0" PriorityTeamHealthy="0" PriorityTeamContainsUndesiredServer="0" PriorityTeamRedundant="0" PriorityMergeShard="0" PriorityPopulateRegion="0" PriorityTeamUnhealthy="0" PriorityTeam2Left="0" PriorityTeam1Left="0" PriorityTeam0Left="0" PrioritySplitShard="0" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD" TrackLatestType="Original" />
<Event Severity="10" Time="1697780639.789884" DateTime="2023-10-20T05:43:59Z" Type="TotalDataInFlight" ID="bb6222cc9e722d32" Primary="1" TotalBytes="0" UnhealthyServers="0" ServerCount="1200" StorageTeamSize="6" HighestPriority="140" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD" TrackLatestType="Original" />
<Event Severity="10" Time="1697780639.789884" DateTime="2023-10-20T05:43:59Z" Type="DDTrackerStats" ID="bb6222cc9e722d32" Shards="1" TotalSizeBytes="1193000" SystemSizeBytes="0" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD" TrackLatestType="Original" />
 
xxx
 
<Event Severity="10" Time="1697780676.197956" DateTime="2023-10-20T05:44:36Z" Type="Role" ID="fb2cffdff01ae824" As="DataDistributor" Transition="Begin" Origination="Recruited" OnWorker="283f439d969bf101" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" TrackLatestType="Original" />
<Event Severity="10" Time="1697780676.197956" DateTime="2023-10-20T05:44:36Z" Type="DumpToken" ID="fb2cffdff01ae824" Name="recruited.waitFailure" Token="8e3e756cdc41089b" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780676.197956" DateTime="2023-10-20T05:44:36Z" Type="DatabaseContextCreated" ID="264705038bf3906c" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x4397308 0x385cb35 0x38600fd 0x2793329 0x1770995 0x27bc968 0x27bccf4 0x102d4e6 0x414f9d8 0x414fd58 0x4321938 0xdc910f 0x7f598ad3b495" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780676.197956" DateTime="2023-10-20T05:44:36Z" Type="DataDistributorRunning" ID="fb2cffdff01ae824" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780676.197956" DateTime="2023-10-20T05:44:36Z" Type="DatabaseContextCreated" ID="1f36e870db83f763" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x4397308 0x385cb35 0x38600fd 0x2793329 0x174b45d 0x174b8a1 0x177040d 0x1770b3d 0x27bc968 0x27bccf4 0x102d4e6 0x414f9d8 0x414fd58 0x4321938 0xdc910f 0x7f598ad3b495" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780676.197956" DateTime="2023-10-20T05:44:36Z" Type="DDInitTakingMoveKeysLock" ID="fb2cffdff01ae824" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780676.197956" DateTime="2023-10-20T05:44:36Z" Type="TransactionAttachID" ID="e2d079d179bd3810" To="21346cd41805bde0" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780676.197956" DateTime="2023-10-20T05:44:36Z" Type="DataDistributorReceived" ID="fb2cffdff01ae824" DataDistributorId="fb2cffdff01ae824" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="20" Time="1697780676.197956" DateTime="2023-10-20T05:44:36Z" Type="Net2RunLoopTrace" ID="0000000000000000" TraceTime="1697780676.200572" Trace="addr2line -e fdbserver.debug -p -C -f -i 0x7f598b0f55d0 0x2833f7f 0x2834268 0x28355e9 0x2835839 0x2836d7a 0x2837197 0x4146774 0x41561cc 0x2804d44 0x27b41a3 0x27b4cf9 0x27b4d8d 0x27b6042 0x275eaf8 0x27bca9a 0x27bccf4 0x102d4e6 0x414f9d8 0x414fd58 0x4321938 0xdc910f 0x7f598ad3b495 0xe2d2b2" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780676.200915" DateTime="2023-10-20T05:44:36Z" Type="TransactionDebug" ID="21346cd41805bde0" Location="NativeAPI.getConsistentReadVersion.Before" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780676.278683" DateTime="2023-10-20T05:44:36Z" Type="GotServerDBInfoChange" ID="0000000000000000" ChangeID="284853c1d7414dfe" InfoGeneration="850" MasterID="05bac32bf184a5c7" RatekeeperID="e9eae2face0270ff" DataDistributorID="fb2cffdff01ae824" BlobManagerID="0000000000000000" EncryptKeyProxyID="0000000000000000" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780676.524477" DateTime="2023-10-20T05:44:36Z" Type="TransactionMetrics" ID="2380665c765c8f1b" Elapsed="5.0005" Cluster="" Internal="1" ReadVersions="0.9999 0.0946405 38" ReadVersionsThrottled="0 -1 0" ReadVersionsCompleted="0.79992 0.000738234 37" ReadVersionBatches="0.9999 0.0838431 36" BatchPriorityReadVersions="0 -1 0" DefaultPriorityReadVersions="0.19998 0 3" ImmediatePriorityReadVersions="0.79992 0.00928196 35" BatchPriorityReadVersionsCompleted="0 -1 0" DefaultPriorityReadVersionsCompleted="0 -1 2" ImmediatePriorityReadVersionsCompleted="0.79992 0.000738234 35" LogicalUncachedReads="2.59974 1.84607 107" PhysicalReadRequests="11.9988 13.207 502" PhysicalReadRequestsCompleted="11.9988 13.3483 502" GetKeyRequests="0 -1 0" GetValueRequests="0 -1 4" GetRangeRequests="2.59974 1.84607 103" GetMappedRangeRequests="0 -1 0" GetRangeStreamRequests="0 -1 0" WatchRequests="0 -1 6" GetAddressesForKeyRequests="0 -1 0" BytesRead="777796 947993 32087964" KeysRead="1967 2396.42 81162" MetadataVersionReads="0 -1 0" CommittedMutations="0 -1 1" CommittedMutationBytes="0 -1 72" SetMutations="0 -1 1" ClearMutations="0 -1 0" AtomicMutations="0 -1 0" CommitStarted="0 -1 1" CommitCompleted="0 -1 1" KeyServerLocationRequests="0 -1 7" KeyServerLocationRequestsCompleted="0 -1 7" StatusRequests="0 -1 0" TooOld="0 -1 0" FutureVersions="0 -1 0" NotCommitted="0 -1 0" MaybeCommitted="0 -1 0" ResourceConstrained="0 -1 0" ProcessBehind="0 -1 0" Throttled="0 -1 0" ExpensiveClearCostEstCount="0 -1 0" NumGrvFullBatches="0 -1 0" NumGrvTimedOutBatches="0.9999 0.0838431 36" CommitVersionNotFoundForSS="0 -1 0" LocationCacheEntryCount="1" MeanLatency="0" MedianLatency="0" Latency90="0" Latency98="0" MaxLatency="0" MeanRowReadLatency="0" MedianRowReadLatency="0" MaxRowReadLatency="0" MeanGRVLatency="0.0287985" MedianGRVLatency="0.00970483" MaxGRVLatency="0.0772243" MeanCommitLatency="0" MedianCommitLatency="0" MaxCommitLatency="0" MeanMutationsPerCommit="0" MedianMutationsPerCommit="0" MaxMutationsPerCommit="0" MeanBytesPerCommit="0" MedianBytesPerCommit="0" MaxBytesPerCommit="0" NumLocalityCacheEntries="1" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780676.561066" DateTime="2023-10-20T05:44:36Z" Type="TransactionDebug" ID="21346cd41805bde0" Location="NativeAPI.getConsistentReadVersion.After" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780676.561066" DateTime="2023-10-20T05:44:36Z" Type="TransactionDebug" ID="e2d079d179bd3810" Location="NativeAPI.getKeyLocation.Before" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780676.562014" DateTime="2023-10-20T05:44:36Z" Type="TransactionDebug" ID="e2d079d179bd3810" Location="NativeAPI.getKeyLocation.After" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780676.562014" DateTime="2023-10-20T05:44:36Z" Type="TransactionDebug" ID="e2d079d179bd3810" Location="NativeAPI.getLatestCommitVersions" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780676.562014" DateTime="2023-10-20T05:44:36Z" Type="GetValueAttachID" ID="e2d079d179bd3810" To="d924826bfd39d7eb" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780676.562014" DateTime="2023-10-20T05:44:36Z" Type="GetValueDebug" ID="d924826bfd39d7eb" Location="NativeAPI.getValue.Before" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780676.564526" DateTime="2023-10-20T05:44:36Z" Type="GetValueDebug" ID="d924826bfd39d7eb" Location="NativeAPI.getValue.After" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780676.564526" DateTime="2023-10-20T05:44:36Z" Type="TransactionDebug" ID="e2d079d179bd3810" Location="NativeAPI.getLatestCommitVersions" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780676.564526" DateTime="2023-10-20T05:44:36Z" Type="GetValueAttachID" ID="e2d079d179bd3810" To="d20cb63d7ecc414c" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780676.564526" DateTime="2023-10-20T05:44:36Z" Type="GetValueDebug" ID="d20cb63d7ecc414c" Location="NativeAPI.getValue.Before" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780676.565752" DateTime="2023-10-20T05:44:36Z" Type="GetValueDebug" ID="d20cb63d7ecc414c" Location="NativeAPI.getValue.After" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780676.565752" DateTime="2023-10-20T05:44:36Z" Type="TransactionCommit" ID="0000000000000000" BeginPair="46e93f41ae56fbae" Parent="e2d079d179bd3810" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780676.565752" DateTime="2023-10-20T05:44:36Z" Type="CommitAttachID" ID="e2d079d179bd3810" To="eeefeb563ddea91d" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780676.565752" DateTime="2023-10-20T05:44:36Z" Type="CommitDebug" ID="eeefeb563ddea91d" Location="NativeAPI.commit.Before" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780676.576309" DateTime="2023-10-20T05:44:36Z" Type="Role" ID="e9eae2face0270ff" Transition="Refresh" As="Ratekeeper" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780676.643943" DateTime="2023-10-20T05:44:36Z" Type="TransactionCommit" ID="0000000000000000" EndPair="46e93f41ae56fbae" CommittedVersion="12597860997" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780676.643943" DateTime="2023-10-20T05:44:36Z" Type="CommitDebug" ID="eeefeb563ddea91d" Location="NativeAPI.commit.After" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780676.643943" DateTime="2023-10-20T05:44:36Z" Type="TakeMoveKeysLockTransaction" ID="fb2cffdff01ae824" TransactionUID="e2d079d179bd3810" PrevOwner="3bded2d69ec4efa9ab93822b75ec94c7" PrevWrite="00000000000000000000000000000000" MyOwner="06121bc94f0c806b6c49acc44de6a1ca" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780676.643943" DateTime="2023-10-20T05:44:36Z" Type="DDInitTookMoveKeysLock" ID="fb2cffdff01ae824" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780676.660226" DateTime="2023-10-20T05:44:36Z" Type="DDInitGotConfiguration" ID="fb2cffdff01ae824" Conf="{"backup_worker_enabled":0,"blob_granules_enabled":0,"commit_proxies":3,"grv_proxies":1,"log_spill":2,"logs":5,"perpetual_storage_wiggle":0,"perpetual_storage_wiggle_engine":"none","perpetual_storage_wiggle_locality":"0","proxies":4,"redundancy_mode":"three_datacenter","storage_engine":"ssd-2","storage_migration_type":"disabled","tenant_mode":"disabled","usable_regions":1}" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780676.668229" DateTime="2023-10-20T05:44:36Z" Type="DDInitUpdatedReplicaKeys" ID="fb2cffdff01ae824" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780676.758188" DateTime="2023-10-20T05:44:36Z" Type="DDInitGotInitialDD" ID="fb2cffdff01ae824" B="" E="\xff\xff" Src="62708610f9291dbf,737e05c6616eb67c,88bc2472a7d7558f,88ea10b2e8b63301,904ba2009d781614,ad0a46dfdf8a789b" Dest="[no items]" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" TrackLatestType="Original" />
<Event Severity="10" Time="1697780676.804848" DateTime="2023-10-20T05:44:36Z" Type="TrackInitialShards" ID="fb2cffdff01ae824" InitialShardCount="2" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780676.804848" DateTime="2023-10-20T05:44:36Z" Type="DDTrackerStarting" ID="fb2cffdff01ae824" State="Inactive" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" TrackLatestType="Original" />
<Event Severity="10" Time="1697780676.804848" DateTime="2023-10-20T05:44:36Z" Type="AddedStorageServer" ID="fb2cffdff01ae824" ServerID="d60e53e48ed75e00" ProcessID="7e2ba3785104da3c3746c7a57627c23f" ProcessClass="storage" WaitFailureToken="84d217d70b562603" Address="100.71.8.128:31358" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780676.804848" DateTime="2023-10-20T05:44:36Z" Type="AddedStorageServer" ID="fb2cffdff01ae824" ServerID="a2a528991d6d7700" ProcessID="e12c409edbcf975cf62de2b07d775486" ProcessClass="storage" WaitFailureToken="950427dd8c958597"
 
xxxx
 
<Event Severity="10" Time="1697780676.804848" DateTime="2023-10-20T05:44:36Z" Type="AddedStorageServer" ID="fb2cffdff01ae824" ServerID="7217e048c15883c0" ProcessID="f9c9b943b148af8d0d5a0eb6082aa19a" ProcessClass="storage" WaitFailureToken="4d1e806c1a082bcd" Address="100.71.8.128:30291" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780676.804848" DateTime="2023-10-20T05:44:36Z" Type="AddedStorageServer" ID="fb2cffdff01ae824" ServerID="b7b2ad8a2778d6c0" ProcessID="00032875973f3e21f251becbbf42d52d" ProcessClass="storage" WaitFailureToken="4ccf0eca28e00dc5" Address="100.71.8.128:30657" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780676.804848" DateTime="2023-10-20T05:44:36Z" Type="GetMagazineSample" ID="0000000000000000" Size="256" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x4397308 0x42d895a 0x42d8b85 0x24ef35c 0x18765ba 0x187a15b 0x18801ee 0x1747450 0x174e06d 0x4321938 0xdc910f 0x7f598ad3b495" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780676.804848" DateTime="2023-10-20T05:44:36Z" Type="AddedStorageServer" ID="fb2cffdff01ae824" ServerID="76d2f2c90fe3f7c0" ProcessID="1920235a406157bba4d3a36fa869e036" ProcessClass="storage" WaitFailureToken="2859dc6a42cd3907" Address="100.71.8.128:30408" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780676.804848" DateTime="2023-10-20T05:44:36Z" Type="AddedStorageServer" ID="fb2cffdff01ae824" ServerID="482a0113f59779c1" ProcessID="b95589870f2d70294ec2ea7818d3f289" ProcessClass="storage" WaitFailureToken="828d21c342efff15" Address="100.71.8.128:31288" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
 
xxxx
 
 
<Event Severity="10" Time="1697780543.660421" DateTime="2023-10-20T05:42:23Z" Type="AddedStorageServer" ID="bb6222cc9e722d32" ServerID="6e0d464f06c1e3ff" ProcessID="64d15c9747ef51f0ed0528969151240f" ProcessClass="storage" WaitFailureToken="35da5330ad5164a3" Address="100.71.8.128:31125" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780543.660421" DateTime="2023-10-20T05:42:23Z" Type="AddedStorageServer" ID="bb6222cc9e722d32" ServerID="b5a092dedad4fdff" ProcessID="724dbaf619d1abf58d7df28f2d73dfb4" ProcessClass="storage" WaitFailureToken="8c3e2b5b70cdba91" Address="100.71.8.128:30429" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="30" Time="1697780543.660421" DateTime="2023-10-20T05:42:23Z" Type="TraceEventThrottle_ServerTeamTrackerStarting" ID="0000000000000000" SuppressedEventCount="5998" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="30" Time="1697780543.660421" DateTime="2023-10-20T05:42:23Z" Type="TraceEventThrottle_ServerTeamHealthChangeDetected" ID="0000000000000000" SuppressedEventCount="5998" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="30" Time="1697780543.660421" DateTime="2023-10-20T05:42:23Z" Type="TraceEventThrottle_TeamCollectionInfo" ID="0000000000000000" SuppressedEventCount="6000" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780543.660421" DateTime="2023-10-20T05:42:23Z" Type="SlowTask" ID="0000000000000000" TaskID="3502" MClocks="410.99" Duration="0.195936" SampleRate="1" NumYields="1" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="20" Time="1697780543.660421" DateTime="2023-10-20T05:42:23Z" Type="Net2RunLoopTrace" ID="0000000000000000" TraceTime="1697780543.770468" Trace="addr2line -e fdbserver.debug -p -C -f -i 0x7f598b0f55d0 0x43a2038 0x43959fa 0x4395d41 0x1875877 0x187a15b 0x18801ee 0x1747450 0x174e06d 0x4321938 0xdc910f 0x7f598ad3b495 0xe2d2b2" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780543.986114" DateTime="2023-10-20T05:42:23Z" Type="GetMagazineSample" ID="0000000000000000" Size="256" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x4397308 0x42d895a 0x42d8b85 0xe75689 0x3b0d24f 0x3b0da8b 0x173af32 0x173b682 0xe6add8 0xe6add8 0x15ad5c8 0x387f398 0x3846389 0x38474c7 0x3784f08 0x382af68 0x2adaa58 0xe9c730 0x414f9d8 0x414fd58 0x4321938 0xdc910f 0x7f598ad3b495" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780544.097449" DateTime="2023-10-20T05:42:24Z" Type="DDTeamCollectionBegin" ID="bb6222cc9e722d32" Primary="1" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780544.097449" DateTime="2023-10-20T05:42:24Z" Type="DDTeamCollectionReadyToStart" ID="bb6222cc9e722d32" Primary="1" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780544.097449" DateTime="2023-10-20T05:42:24Z" Type="DDRecruiting" ID="0000000000000000" Primary="1" State="Sending request to CC" Exclusions="1200" Critical="0" IncludedDCsSize="0" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780544.097449" DateTime="2023-10-20T05:42:24Z" Type="StorageServerRecruitment" ID="bb6222cc9e722d32" State="Idle" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" TrackLatestType="Original" />
<Event Severity="10" Time="1697780544.097449" DateTime="2023-10-20T05:42:24Z" Type="DDMonitorHealthyTeamsStart" ID="0000000000000000" ZeroHealthyTeams="0" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780544.097449" DateTime="2023-10-20T05:42:24Z" Type="TotalDataInFlight" ID="bb6222cc9e722d32" Primary="1" TotalBytes="0" UnhealthyServers="0" ServerCount="1200" StorageTeamSize="6" HighestPriority="140" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" TrackLatestType="Original" />
<Event Severity="10" Time="1697780544.118590" DateTime="2023-10-20T05:42:24Z" Type="PerpetualStorageWiggleClose" ID="bb6222cc9e722d32" Primary="1" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780544.120832" DateTime="2023-10-20T05:42:24Z" Type="DDTrackerStats" ID="bb6222cc9e722d32" Shards="1" TotalSizeBytes="1193000" SystemSizeBytes="0" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" TrackLatestType="Original" />
<Event Severity="10" Time="1697780677.496675" DateTime="2023-10-20T05:44:37Z" Type="RkUpdate" ID="e9eae2face0270ff" TPSLimit="1.59209e+06" Reason="2" ReasonServerID="62708610f9291dbf" ReleasedTPS="12.5867" ReleasedBatchTPS="0" TPSBasis="12.5867" StorageServers="1200" GrvProxies="1" TLogs="5" WorstFreeSpaceStorageServer="276560583163" WorstFreeSpaceTLog="455296115917" WorstStorageServerQueue="2414" LimitingStorageServerQueue="2411" WorstTLogQueue="683" TotalDiskUsageBytes="126366309200" WorstStorageServerVersionLag="0" LimitingStorageServerVersionLag="0" WorstStorageServerDurabilityLag="5319465" LimitingStorageServerDurabilityLag="5316422" TagsAutoThrottled="0" TagsAutoThrottledBusyRead="0" TagsAutoThrottledBusyWrite="0" TagsManuallyThrottled="0" AutoThrottlingEnabled="1" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" TrackLatestType="Original" />
<Event Severity="10" Time="1697780677.673785" DateTime="2023-10-20T05:44:37Z" Type="PerpetualStorageWiggleClose" ID="fb2cffdff01ae824" Primary="1" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" />
<Event Severity="10" Time="1697780677.816726" DateTime="2023-10-20T05:44:37Z" Type="RkUpdateBatch" ID="e9eae2face0270ff" TPSLimit="5.36917e+06" Reason="2" ReasonServerID="62708610f9291dbf" ReleasedTPS="153.799" ReleasedBatchTPS="0" TPSBasis="153.799" StorageServers="1200" GrvProxies="1" TLogs="5" WorstFreeSpaceStorageServer="276559630352" WorstFreeSpaceTLog="455296115917" WorstStorageServerQueue="4402" LimitingStorageServerQueue="4400" WorstTLogQueue="1173" TotalDiskUsageBytes="126366309200" WorstStorageServerVersionLag="0" LimitingStorageServerVersionLag="0" WorstStorageServerDurabilityLag="5353523" LimitingStorageServerDurabilityLag="5319700" TagsAutoThrottled="0" TagsAutoThrottledBusyRead="0" TagsAutoThrottledBusyWrite="0" TagsManuallyThrottled="0" AutoThrottlingEnabled="1" ThreadID="9365136573193576722" Machine="100.71.8.128:30006" LogGroup="default" Roles="DD,RK" TrackLatestType="Original" />
<Event Severity="10" Time="1697780678.007100" DateTime="2023-10-20T05:44:38Z" Type="RkUpdateBatch" ID="e9eae2face0270ff" TPSLimit="4.39479e+07" Reason="2" ReasonServerID

some decode information here:

[root@szjd-yfq-pm-os01-bconest-13 bin]# addr2line -e fdbserver.debug -p -C -f -i 0x7f598b0f55d0 0x41abe20 0x41ac4ac 0x1844afb 0x184de7c 0x18b28d1 0x18b2ced 0x4321938 0xdc910f 0x7f598ad3b495 0xe2d2b2
?? ??:0
AsyncFileCached::AsyncFileCached(Reference<IAsyncFile>, std::string const&, long, Reference<EvictablePageCache>) at /foundationdb/fdbrpc/AsyncFileCached.actor.h:304 (discriminator 16)
AsyncFileCached::AsyncFileCached(Reference<IAsyncFile>, std::string const&, long, Reference<EvictablePageCache>) at /foundationdb/fdbrpc/AsyncFileCached.actor.h:312 (discriminator 2)
(anonymous namespace)::ReportEndpointFailureActorState<TLogRejoinReply, (anonymous namespace)::ReportEndpointFailureActor<TLogRejoinReply> >::a_body1Catch1(Error, int) at /foundationdb/fdbbuild/fdbrpc/genericactors.actor.g.h:6444
oldTLog_6_0::LogData::TagData::eraseMessagesBefore(oldTLog_6_0::LogData::TagData* const&, long const&, oldTLog_6_0::TLogData* const&, Reference<oldTLog_6_0::LogData> const&, TaskPriority const&) at /foundationdb/fdbserver/OldTLogServer_6_0.actor.cpp:342
oldTLog_6_2::TLogQueue::forgetBefore(long, Reference<oldTLog_6_2::LogData>) at /foundationdb/fdbserver/OldTLogServer_6_2.actor.cpp:720
oldTLog_6_2::updatePersistentPopped(oldTLog_6_2::TLogData*, Reference<oldTLog_6_2::LogData>, Reference<oldTLog_6_2::LogData::TagData>) at /foundationdb/fdbserver/OldTLogServer_6_2.actor.cpp:788
detail::RelativeOffset detail::LoadSaveHelper<unit_tests::X<unit_tests::Y2>, unit_tests::TestContext>::save<unit_tests::X<unit_tests::Y2>, detail::PrecomputeSize<unit_tests::TestContext> >(unit_tests::X<unit_tests::Y2> const&, detail::PrecomputeSize<unit_tests::TestContext>&, detail::VTableSet const*, std::enable_if<expect_serialize_member<unit_tests::X<unit_tests::Y2> >, int>::type) at /foundationdb/flow/flat_buffers.h:1187
auto detail::LoadSaveHelper<unsigned int, SaveContext<ObjectWriter, void ObjectWriter::serialize<ErrorOr<EnsureTable<std::vector<WorkerDetails, std::allocator<WorkerDetails> > > > >(unsigned int, ErrorOr<EnsureTable<std::vector<WorkerDetails, std::allocator<WorkerDetails> > > > const&)::{lambda(unsigned long)#1}> >::save<unsigned int, detail::WriteToBuffer<void ObjectWriter::serialize<ErrorOr<EnsureTable<std::vector<WorkerDetails, std::allocator<WorkerDetails> > > > >(unsigned int, ErrorOr<EnsureTable<std::vector<WorkerDetails, std::allocator<WorkerDetails> > > > const&)::{lambda(unsigned long)#1}>, void>(unsigned int const&, detail::WriteToBuffer<void ObjectWriter::serialize<ErrorOr<EnsureTable<std::vector<WorkerDetails, std::allocator<WorkerDetails> > > > >(unsigned int, ErrorOr<EnsureTable<std::vector<WorkerDetails, std::allocator<WorkerDetails> > > > const&)::{lambda(unsigned long)#1}>&, detail::VTableSet const*) at /foundationdb/flow/flat_buffers.h:1143
?? ??:0
(anonymous namespace)::NetworkSenderActorState<StorageMetrics, (anonymous namespace)::NetworkSenderActor<StorageMetrics> >::a_body1Catch2(Error const&, int) at /foundationdb/fdbrpc/networksender.actor.h:45
 
 
 
[root@szjd-yfq-pm-os01-bconest-13 bin]# addr2line -e fdbserver.debug -p -C -f -i 0x7f598b0f55d0 0x7f598b0f29d1 0x4320a34 0x43226c5 0xdc910f 0x7f598ad3b495 0xe2d2b2
?? ??:0
?? ??:0
void detail::FakeRoot<std::vector<std::unordered_set<unit_tests::Y1, unit_tests::Y1Hasher, unit_tests::Y1Equal, std::allocator<unit_tests::Y1> >, std::allocator<std::unordered_set<unit_tests::Y1, unit_tests::Y1Hasher, unit_tests::Y1Equal, std::allocator<unit_tests::Y1> > > > >::serialize_impl<detail::SaveVisitorLambda<detail::PrecomputeSize<SaveContext<ObjectWriter, void ObjectWriter::serialize<std::vector<std::unordered_set<unit_tests::Y1, unit_tests::Y1Hasher, unit_tests::Y1Equal, std::allocator<unit_tests::Y1> >, std::allocator<std::unordered_set<unit_tests::Y1, unit_tests::Y1Hasher, unit_tests::Y1Equal, std::allocator<unit_tests::Y1> > > > >(unsigned int, std::vector<std::unordered_set<unit_tests::Y1, unit_tests::Y1Hasher, unit_tests::Y1Equal, std::allocator<unit_tests::Y1> >, std::allocator<std::unordered_set<unit_tests::Y1, unit_tests::Y1Hasher, unit_tests::Y1Equal, std::allocator<unit_tests::Y1> > > > const&)::{lambda(unsigned long)#1}> >, void ObjectWriter::serialize<std::vector<std::unordered_set<unit_tests::Y1, unit_tests::Y1Hasher, unit_tests::Y1Equal, std::allocator<unit_tests::Y1> >, std::allocator<std::unordered_set<unit_tests::Y1, unit_tests::Y1Hasher, unit_tests::Y1Equal, std::allocator<unit_tests::Y1> > > > >(unsigned int, std::vector<std::unordered_set<unit_tests::Y1, unit_tests::Y1Hasher, unit_tests::Y1Equal, std::allocator<unit_tests::Y1> >, std::allocator<std::unordered_set<unit_tests::Y1, unit_tests::Y1Hasher, unit_tests::Y1Equal, std::allocator<unit_tests::Y1> > > > const&)::{lambda(unsigned long)#1}>, 0ul>(detail::SaveVisitorLambda<detail::PrecomputeSize<SaveContext<ObjectWriter, void ObjectWriter::serialize<std::vector<std::unordered_set<unit_tests::Y1, unit_tests::Y1Hasher, unit_tests::Y1Equal, std::allocator<unit_tests::Y1> >, std::allocator<std::unordered_set<unit_tests::Y1, unit_tests::Y1Hasher, unit_tests::Y1Equal, std::allocator<unit_tests::Y1> > > > >(unsigned int, std::vector<std::unordered_set<unit_tests::Y1, unit_tests::Y1Hasher, unit_tests::Y1Equal, std::allocator<unit_tests::Y1> >, std::allocator<std::unordered_set<unit_tests::Y1, unit_tests::Y1Hasher, unit_tests::Y1Equal, std::allocator<unit_tests::Y1> > > > const&)::{lambda(unsigned long)#1}> >, void ObjectWriter::serialize<std::vector<std::unordered_set<unit_tests::Y1, unit_tests::Y1Hasher, unit_tests::Y1Equal, std::allocator<unit_tests::Y1> >, std::allocator<std::unordered_set<unit_tests::Y1, unit_tests::Y1Hasher, unit_tests::Y1Equal, std::allocator<unit_tests::Y1> > > > >(unsigned int, std::vector<std::unordered_set<unit_tests::Y1, unit_tests::Y1Hasher, unit_tests::Y1Equal, std::allocator<unit_tests::Y1> >, std::allocator<std::unordered_set<unit_tests::Y1, unit_tests::Y1Hasher, unit_tests::Y1Equal, std::allocator<unit_tests::Y1> > > > const&)::{lambda(unsigned long)#1}>&, std::integer_sequence<unsigned long, 0ul>) at /foundationdb/flow/flat_buffers.h:1286
_ZN6detail8for_eachIZNS_17SaveVisitorLambdaINS_13WriteToBufferI11SaveContextI12ObjectWriterZNS4_9serializeIJ5Arena9VectorRefI9StringRefL14VecSerStrategy0EEEEEvjDpRKT_EUlmE_EEESG_EclIJS6_SA_EEEvSE_EUlRKT_E_JRKS6_RKSA_EEEvOSK_DpOT0_ at /foundationdb/flow/flat_buffers.h:669
auto detail::LoadSaveHelper<unsigned int, SaveContext<ObjectWriter, void ObjectWriter::serialize<ErrorOr<EnsureTable<std::vector<WorkerDetails, std::allocator<WorkerDetails> > > > >(unsigned int, ErrorOr<EnsureTable<std::vector<WorkerDetails, std::allocator<WorkerDetails> > > > const&)::{lambda(unsigned long)#1}> >::save<unsigned int, detail::WriteToBuffer<void ObjectWriter::serialize<ErrorOr<EnsureTable<std::vector<WorkerDetails, std::allocator<WorkerDetails> > > > >(unsigned int, ErrorOr<EnsureTable<std::vector<WorkerDetails, std::allocator<WorkerDetails> > > > const&)::{lambda(unsigned long)#1}>, void>(unsigned int const&, detail::WriteToBuffer<void ObjectWriter::serialize<ErrorOr<EnsureTable<std::vector<WorkerDetails, std::allocator<WorkerDetails> > > > >(unsigned int, ErrorOr<EnsureTable<std::vector<WorkerDetails, std::allocator<WorkerDetails> > > > const&)::{lambda(unsigned long)#1}>&, detail::VTableSet const*) at /foundationdb/flow/flat_buffers.h:1143
?? ??:0
(anonymous namespace)::NetworkSenderActorState<StorageMetrics, (anonymous namespace)::NetworkSenderActor<StorageMetrics> >::a_body1Catch2(Error const&, int) at /foundationdb/fdbrpc/networksender.actor.h:45
 
 
 
 
[root@szjd-yfq-pm-os01-bconest-13 bin]# addr2line -e fdbserver.debug -p -C -f -i 0x4397308 0x42d895a 0x42d8b85 0x24ef35c 0x18765ba 0x187a15b 0x18801ee 0x1747450 0x174e06d 0x4321938 0xdc910f 0x7f598ad3b495
fmt::v8::detail::dragonbox::cache_accessor<float>::compute_mul_parity(unsigned int, unsigned long const&, int) at /foundationdb/contrib/fmt-8.1.1/include/fmt/format-inl.h:1100
std::allocator_traits<std::allocator<std::_Rb_tree_node<std::pair<NetworkAddress const, TraceLog::RoleInfo> > > >::allocate(std::allocator<std::_Rb_tree_node<std::pair<NetworkAddress const, TraceLog::RoleInfo> > >&, unsigned long) at /opt/rh/devtoolset-11/root/usr/include/c++/11/bits/alloc_traits.h:463
void std::_Function_base::_Base_manager<TraceLog::open(std::string const&, std::string const&, std::string, std::string const&, unsigned long, unsigned long, Optional<NetworkAddress>, std::string const&)::{lambda()#1}>::_M_create<{lambda()#1} const&>(std::_Any_data&, {lambda()#1} const&, std::integral_constant<bool, true>) at /opt/rh/devtoolset-11/root/usr/include/c++/11/bits/std_function.h:153
ReplyPromise<ProtocolInfoReply>::ReplyPromise(Endpoint const&) at /foundationdb/fdbrpc/fdbrpc.h:146
SingleCallback<InitializeTLogRequest>::remove() at /foundationdb/flow/flow.h:427 (discriminator 7)
FutureStream<TLogEnablePopRequest>::getError() at /foundationdb/flow/flow.h:1132
oldTLog_6_0::PullAsyncDataActorState<oldTLog_6_0::PullAsyncDataActor>::a_body1cont1(int) at /foundationdb/fdbbuild/fdbserver/OldTLogServer_6_0.actor.g.cpp:13459
Serializer<BinaryWriter, FieldHeader<TimeAndValue<Standalone<StringRef> > >, void>::serialize(BinaryWriter&, FieldHeader<TimeAndValue<Standalone<StringRef> > >&) at /foundationdb/flow/serialize.h:114
FieldLevel<TimeAndValue<Standalone<StringRef> >, FieldHeader<TimeAndValue<Standalone<StringRef> > >, FieldValueBlockEncoding<TimeAndValue<Standalone<StringRef> > > >::UpdatePreviousHeaderActorState<FieldLevel<TimeAndValue<Standalone<StringRef> >, FieldHeader<TimeAndValue<Standalone<StringRef> > >, FieldValueBlockEncoding<TimeAndValue<Standalone<StringRef> > > >::UpdatePreviousHeaderActor>::a_callback_error(ActorCallback<FieldLevel<TimeAndValue<Standalone<StringRef> >, FieldHeader<TimeAndValue<Standalone<StringRef> > >, FieldValueBlockEncoding<TimeAndValue<Standalone<StringRef> > > >::UpdatePreviousHeaderActor, 0, Optional<Standalone<StringRef> > >*, Error) at /foundationdb/fdbbuild/flow/TDMetric.actor.g.h:784 (discriminator 1)
detail::RelativeOffset detail::LoadSaveHelper<unit_tests::X<unit_tests::Y2>, unit_tests::TestContext>::save<unit_tests::X<unit_tests::Y2>, detail::PrecomputeSize<unit_tests::TestContext> >(unit_tests::X<unit_tests::Y2> const&, detail::PrecomputeSize<unit_tests::TestContext>&, detail::VTableSet const*, std::enable_if<expect_serialize_member<unit_tests::X<unit_tests::Y2> >, int>::type) at /foundationdb/flow/flat_buffers.h:1187
auto detail::LoadSaveHelper<unsigned int, SaveContext<ObjectWriter, void ObjectWriter::serialize<ErrorOr<EnsureTable<std::vector<WorkerDetails, std::allocator<WorkerDetails> > > > >(unsigned int, ErrorOr<EnsureTable<std::vector<WorkerDetails, std::allocator<WorkerDetails> > > > const&)::{lambda(unsigned long)#1}> >::save<unsigned int, detail::WriteToBuffer<void ObjectWriter::serialize<ErrorOr<EnsureTable<std::vector<WorkerDetails, std::allocator<WorkerDetails> > > > >(unsigned int, ErrorOr<EnsureTable<std::vector<WorkerDetails, std::allocator<WorkerDetails> > > > const&)::{lambda(unsigned long)#1}>, void>(unsigned int const&, detail::WriteToBuffer<void ObjectWriter::serialize<ErrorOr<EnsureTable<std::vector<WorkerDetails, std::allocator<WorkerDetails> > > > >(unsigned int, ErrorOr<EnsureTable<std::vector<WorkerDetails, std::allocator<WorkerDetails> > > > const&)::{lambda(unsigned long)#1}>&, detail::VTableSet const*) at /foundationdb/flow/flat_buffers.h:1143
?? ??:0

Hi @jzhou
I redeployed the cluster using version 7.1.43 and configured the DD process as follows.
--memory 32GiB --cache_memory 17GiB
From DD’s logs, there was an issue when building the team again.

<Event Severity="10" Time="1699261466.428733" DateTime="2023-11-06T09:04:26Z" Type="ServerTeamTrackerStarting" ID="cd9705747db723bf" Reason="Initial wait complete (sc)" ServerTeam="TeamID 5551e31f0b8196a3; Size 6; 100.71.8.121:31086 8d54e9d04e7e81b7, 100.71.8.121:30540 b2dcdd027aebaa50, 100.71.8.121:31111 b7d5e1849ab1f7c5, 100.71.8.121:30643 ccdb083e1691d3dd, 100.71.8.121:30380 d78761b2188d3db8, 100.71.8.121:30127 e3003e06f3648283" ThreadID="11601064387280473675" Machine="100.71.8.124:50007" LogGroup="default" Roles="DD" />
<Event Severity="10" Time="1699261466.428733" DateTime="2023-11-06T09:04:26Z" Type="ServerTeamHealthChangeDetected" ID="cd9705747db723bf" ServerTeam="TeamID 5551e31f0b8196a3; Size 6; 100.71.8.121:31086 8d54e9d04e7e81b7, 100.71.8.121:30540 b2dcdd027aebaa50, 100.71.8.121:31111 b7d5e1849ab1f7c5, 100.71.8.121:30643 ccdb083e1691d3dd, 100.71.8.121:30380 d78761b2188d3db8, 100.71.8.121:30127 e3003e06f3648283" Primary="1" IsReady="0" ThreadID="11601064387280473675" Machine="100.71.8.124:50007" LogGroup="default" Roles="DD" />
<Event Severity="10" Time="1699261466.428733" DateTime="2023-11-06T09:04:26Z" Type="TeamCollectionInfo" ID="cd9705747db723bf" Primary="1" AddedTeams="0" TeamsToBuild="0" CurrentServerTeams="5399" DesiredTeams="5400" MaxTeams="27000" StorageTeamSize="6" CurrentMachineTeams="1500" CurrentHealthyMachineTeams="1500" DesiredMachineTeams="1500" MaxMachineTeams="7500" TotalHealthyMachines="300" MinTeamsOnServer="27" MaxTeamsOnServer="62" MinMachineTeamsOnMachine="29" MaxMachineTeamsOnMachine="32" DoBuildTeams="0" ThreadID="11601064387280473675" Machine="100.71.8.124:50007" LogGroup="default" Roles="DD" TrackLatestType="Original" />
<Event Severity="10" Time="1699261466.428733" DateTime="2023-11-06T09:04:26Z" Type="TeamCollectionInfo" ID="cd9705747db723bf" Primary="1" AddedTeams="5399" TeamsToBuild="5399" CurrentServerTeams="5400" DesiredTeams="5400" MaxTeams="27000" StorageTeamSize="6" CurrentMachineTeams="1500" CurrentHealthyMachineTeams="1500" DesiredMachineTeams="1500" MaxMachineTeams="7500" TotalHealthyMachines="300" MinTeamsOnServer="27" MaxTeamsOnServer="62" MinMachineTeamsOnMachine="29" MaxMachineTeamsOnMachine="32" DoBuildTeams="0" ThreadID="11601064387280473675" Machine="100.71.8.124:50007" LogGroup="default" Roles="DD" TrackLatestType="Original" />
<Event Severity="10" Time="1699261466.428733" DateTime="2023-11-06T09:04:26Z" Type="DataDistributionTeamQuality" ID="cd9705747db723bf" Servers="1080" Teams="5400" TeamsPerServer="30" Variance="0.00901852" ServerMinTeams="27" ServerMaxTeams="62" MachineMinTeams="62" MachineMaxTeams="132" ThreadID="11601064387280473675" Machine="100.71.8.124:50007" LogGroup="default" Roles="DD" />
<Event Severity="30" Time="1699261466.428733" DateTime="2023-11-06T09:04:26Z" Type="SlowTask" ID="0000000000000000" TaskID="3502" MClocks="85488.4" Duration="40.756" SampleRate="1" NumYields="1" ThreadID="11601064387280473675" Machine="100.71.8.124:50007" LogGroup="default" Roles="DD" />
<Event Severity="20" Time="1699261466.428733" DateTime="2023-11-06T09:04:26Z" Type="Net2RunLoopTrace" ID="0000000000000000" TraceTime="1699261466.554232" Trace="addr2line -e fdbserver.debug -p -C -f -i 0x7fb59689c630 0x1898d38 0x182c409 0x1847005 0x184defc 0x18b2df1 0x18b320d 0x432d1f8 0xdc6d9f 0x7fb5964e1555 0xe2c262" ThreadID="11601064387280473675" Machine="100.71.8.124:50007" LogGroup="default" Roles="DD" />
<Event Severity="20" Time="1699261466.428733" DateTime="2023-11-06T09:04:26Z" Type="Net2RunLoopTrace" ID="0000000000000000" TraceTime="1699261466.679295" Trace="addr2line -e fdbserver.debug -p -C -f -i 0x7fb59689c630 0x42c08d4 0x42c1078 0xe57dec 0x43b1e00 0x43a1ef7 0x147012a 0x43a27c2 0x184772c 0x184defc 0x18b2df1 0x18b320d 0x432d1f8 0xdc6d9f 0x7fb5964e1555 0xe2c262" ThreadID="11601064387280473675" Machine="100.71.8.124:50007" LogGroup="default" Roles="DD" />
<Event Severity="20" Time="1699261466.428733" DateTime="2023-11-06T09:04:26Z" Type="Net2RunLoopTrace" ID="0000000000000000" TraceTime="1699261466.929458" Trace="addr2line -e fdbserver.debug -p -C -f -i 0x7fb59689c630 0x7fb596629dc4 0x1890235 0x182c3cc 0x1847005 0x184defc 0x18b2df1 0x18b320d 0x432d1f8 0xdc6d9f 0x7fb5964e1555 0xe2c262" ThreadID="11601064387280473675" Machine="100.71.8.124:50007" LogGroup="default" Roles="DD" />

backtracce:

addr2line -e fdbserver.debug -p -C -f -i 0x7fb59689c630 0x1898d38 0x182c409 0x1847005 0x184defc 0x18b2df1 0x18b320d 0x432d1f8 0xdc6d9f 0x7fb5964e1555 0xe2c262
?? ??:0
std::_Rb_tree<UID, std::pair<UID const, AsyncMap<UID, ServerStatus>::P>, std::_Select1st<std::pair<UID const, AsyncMap<UID, ServerStatus>::P> >, std::less<UID>, std::allocator<std::pair<UID const, AsyncMap<UID, ServerStatus>::P> > >::find(UID const&) const at ??:?
 (inlined by) std::less<UID>::operator()(UID const&, UID const&) const at /opt/rh/devtoolset-11/root/usr/include/c++/11/bits/stl_function.h:400
 (inlined by) std::_Rb_tree<UID, std::pair<UID const, AsyncMap<UID, ServerStatus>::P>, std::_Select1st<std::pair<UID const, AsyncMap<UID, ServerStatus>::P> >, std::less<UID>, std::allocator<std::pair<UID const, AsyncMap<UID, ServerStatus>::P> > >::_M_lower_bound(std::_Rb_tree_node<std::pair<UID const, AsyncMap<UID, ServerStatus>::P> > const*, std::_Rb_tree_node_base const*, UID const&) const at /opt/rh/devtoolset-11/root/usr/include/c++/11/bits/stl_tree.h:1921
 (inlined by) std::_Rb_tree<UID, std::pair<UID const, AsyncMap<UID, ServerStatus>::P>, std::_Select1st<std::pair<UID const, AsyncMap<UID, ServerStatus>::P> >, std::less<UID>, std::allocator<std::pair<UID const, AsyncMap<UID, ServerStatus>::P> > >::find(UID const&) const at /opt/rh/devtoolset-11/root/usr/include/c++/11/bits/stl_tree.h:2536
DDTeamCollection::isMachineHealthy(Reference<TCMachineInfo> const&) const at ??:?
 (inlined by) DDTeamCollection::isMachineHealthy(Reference<TCMachineInfo> const&) const at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/DDTeamCollection.actor.cpp:3465
DDTeamCollection::addTeamsBestOf(int, int, int) [clone .constprop.0] at DDTeamCollection.actor.g.cpp:?
 (inlined by) DDTeamCollection::addTeamsBestOf(int, int, int) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/DDTeamCollection.actor.cpp:4737
DDTeamCollectionImpl::BuildTeamsActorState<DDTeamCollectionImpl::BuildTeamsActor>::a_body1cont2(int) [clone .isra.0] at DDTeamCollection.actor.g.cpp:?
DDTeamCollectionImpl::BuildTeamsActorState<DDTeamCollectionImpl::BuildTeamsActor>::a_body1cont1loopBody1(int) at ??:?
ActorCallback<DDTeamCollectionImpl::BuildTeamsActor, 0, Void>::fire(Void const&) at ??:?
 (inlined by) DDTeamCollectionImpl::BuildTeamsActorState<DDTeamCollectionImpl::BuildTeamsActor>::a_body1cont1(Void const&, int) at /home/foundationdb_ci/foundationdb_build_output/dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb/fdbserver/DDTeamCollection.actor.g.cpp:3674
 (inlined by) DDTeamCollectionImpl::BuildTeamsActorState<DDTeamCollectionImpl::BuildTeamsActor>::a_body1when1(Void const&, int) at /home/foundationdb_ci/foundationdb_build_output/dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb/fdbserver/DDTeamCollection.actor.g.cpp:3689
 (inlined by) DDTeamCollectionImpl::BuildTeamsActorState<DDTeamCollectionImpl::BuildTeamsActor>::a_callback_fire(ActorCallback<DDTeamCollectionImpl::BuildTeamsActor, 0, Void>*, Void const&) at /home/foundationdb_ci/foundationdb_build_output/dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb/fdbserver/DDTeamCollection.actor.g.cpp:3710
 (inlined by) ActorCallback<DDTeamCollectionImpl::BuildTeamsActor, 0, Void>::fire(Void const&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/flow.h:1321
N2::Net2::run() at ??:?
 (inlined by) void Promise<Void>::send<Void>(Void&&) const at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/flow.h:909
 (inlined by) N2::PromiseTask::operator()() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Net2.actor.cpp:1220
 (inlined by) N2::Net2::run() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Net2.actor.cpp:1567
main at ??:?
?? ??:0
_start at ??:?

another:

addr2line -e fdbserver.debug -p -C -f -i 0x7fb59689c630 0x42c08d4 0x42c1078 0xe57dec 0x43b1e00 0x43a1ef7 0x147012a 0x43a27c2 0x184772c 0x184defc 0x18b2df1 0x18b320d 0x432d1f8 0xdc6d9f 0x7fb5964e1555 0xe2c262
?? ??:0
Reference<ArenaBlock>::setPtrUnsafe(ArenaBlock*) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/FastRef.h:163
 (inlined by) ArenaBlock::create(int, Reference<ArenaBlock>&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Arena.cpp:399
Arena::Arena(unsigned long) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Arena.cpp:88
StringRef::size() const at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Arena.h:466
 (inlined by) StringRef::StringRef(Arena&, StringRef const&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Arena.h:441
 (inlined by) Standalone<StringRef>::Standalone(StringRef const&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Arena.h:374
std::_Rb_tree<Standalone<StringRef>, Standalone<StringRef>, std::_Identity<Standalone<StringRef> >, std::less<Standalone<StringRef> >, std::allocator<Standalone<StringRef> > >::_M_mbegin() const at /home/foundationdb_ci/foundationdb_build_output/dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb/flow/TDMetric.actor.g.h:836
 (inlined by) std::_Rb_tree<Standalone<StringRef>, Standalone<StringRef>, std::_Identity<Standalone<StringRef> >, std::less<Standalone<StringRef> >, std::allocator<Standalone<StringRef> > >::_M_begin() at /opt/rh/devtoolset-11/root/usr/include/c++/11/bits/stl_tree.h:739
 (inlined by) std::_Rb_tree<Standalone<StringRef>, Standalone<StringRef>, std::_Identity<Standalone<StringRef> >, std::less<Standalone<StringRef> >, std::allocator<Standalone<StringRef> > >::_M_get_insert_unique_pos(Standalone<StringRef> const&) at /opt/rh/devtoolset-11/root/usr/include/c++/11/bits/stl_tree.h:2065
 (inlined by) std::pair<std::_Rb_tree_iterator<Standalone<StringRef> >, bool> std::_Rb_tree<Standalone<StringRef>, Standalone<StringRef>, std::_Identity<Standalone<StringRef> >, std::less<Standalone<StringRef> >, std::allocator<Standalone<StringRef> > >::_M_insert_unique<Standalone<StringRef> const&>(Standalone<StringRef> const&) at /opt/rh/devtoolset-11/root/usr/include/c++/11/bits/stl_tree.h:2124
 (inlined by) std::set<Standalone<StringRef>, std::less<Standalone<StringRef> >, std::allocator<Standalone<StringRef> > >::insert(Standalone<StringRef> const&) at /opt/rh/devtoolset-11/root/usr/include/c++/11/bits/stl_set.h:512
 (inlined by) DynamicEventMetric::newFieldAdded(Standalone<StringRef> const&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/TDMetric.actor.h:1047
 (inlined by) void DynamicEventMetric::setField<Standalone<StringRef> >(char const*, Standalone<StringRef> const&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/TDMetric.actor.h:1091
Reference<ArenaBlock>::~Reference() at /home/foundationdb_ci/foundationdb_build_output/dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb/flow/TDMetric.actor.g.h:836
 (inlined by) Arena::~Arena() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Arena.h:99
 (inlined by) Standalone<StringRef>::~Standalone() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Arena.h:405
 (inlined by) BaseTraceEvent::setField(char const*, std::string const&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Trace.cpp:1054
std::enable_if<Traceable<char const*>::value, BaseTraceEvent&>::type BaseTraceEvent::detail<char const*>(char const*, char const* const&) at /opt/rh/devtoolset-11/root/usr/include/c++/11/bits/stl_vector.h:1759
BaseTraceEvent::init() at /home/foundationdb_ci/foundationdb_build_output/dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb/flow/TDMetric.actor.g.h:836
DDTeamCollection::addTeamsBestOf(int, int, int) [clone .constprop.0] at DDTeamCollection.actor.g.cpp:?
 (inlined by) DDTeamCollection::addBestMachineTeams(int) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/DDTeamCollection.actor.cpp:4340
 (inlined by) DDTeamCollection::addTeamsBestOf(int, int, int) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/DDTeamCollection.actor.cpp:4737
DDTeamCollectionImpl::BuildTeamsActorState<DDTeamCollectionImpl::BuildTeamsActor>::a_body1cont2(int) [clone .isra.0] at DDTeamCollection.actor.g.cpp:?
DDTeamCollectionImpl::BuildTeamsActorState<DDTeamCollectionImpl::BuildTeamsActor>::a_body1cont1loopBody1(int) at ??:?
ActorCallback<DDTeamCollectionImpl::BuildTeamsActor, 0, Void>::fire(Void const&) at ??:?
 (inlined by) DDTeamCollectionImpl::BuildTeamsActorState<DDTeamCollectionImpl::BuildTeamsActor>::a_body1cont1(Void const&, int) at /home/foundationdb_ci/foundationdb_build_output/dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb/fdbserver/DDTeamCollection.actor.g.cpp:3674
 (inlined by) DDTeamCollectionImpl::BuildTeamsActorState<DDTeamCollectionImpl::BuildTeamsActor>::a_body1when1(Void const&, int) at /home/foundationdb_ci/foundationdb_build_output/dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb/fdbserver/DDTeamCollection.actor.g.cpp:3689
 (inlined by) DDTeamCollectionImpl::BuildTeamsActorState<DDTeamCollectionImpl::BuildTeamsActor>::a_callback_fire(ActorCallback<DDTeamCollectionImpl::BuildTeamsActor, 0, Void>*, Void const&) at /home/foundationdb_ci/foundationdb_build_output/dbdbdbdbdbdbdbdbdbdbdbdbdbdbdbdb/fdbserver/DDTeamCollection.actor.g.cpp:3710
 (inlined by) ActorCallback<DDTeamCollectionImpl::BuildTeamsActor, 0, Void>::fire(Void const&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/flow.h:1321
N2::Net2::run() at ??:?
 (inlined by) void Promise<Void>::send<Void>(Void&&) const at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/flow.h:909
 (inlined by) N2::PromiseTask::operator()() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Net2.actor.cpp:1220
 (inlined by) N2::Net2::run() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Net2.actor.cpp:1567
main at ??:?
?? ??:0
_start at ??:?

Do you have any suggestions? Thanks!

The above stack traces don’t tell me much. The FDB releases don’t have the change I suggested above, i.e., adding a delay in readOrCreateStorageMetadata(). There could be other reasons, but will need experienced developer to look at this.

We know when DD starts up, it does a lot of work and there can be heavy CPU usage for ~30s for a large cluster. After that, its CPU/memory usage is fairly constant. So my suggestion is: starting the cluster with a smaller size, say 500 storage servers; after that, gradually adding more storage servers, e.g., 100 at a time and wait for some time between batches.