Cluster becomes nonresponsive while ingesting data

ddi · August 19, 2021, 11:31am

Hello,
I am trying to load roughly 5 GB worth of data in a three_data_hall cluster with 9 logs, 6 storages, 9 coordinators and 1 instance of other roles.
I see that if the number of clients (processes) I use to ingest the data is more than 2, then the cluster becomes nonresponsive after ingesting a few MB of data.

The status changes from this

  Redundancy mode        - three_data_hall
  Storage engine         - ssd-2
  Coordinators           - 9
  Desired Proxies        - 1
  Desired Resolvers      - 1
  Desired Logs           - 9
  Usable Regions         - 1

Cluster:
  FoundationDB processes - 42
  Zones                  - 27
  Machines               - 24
  Memory availability    - 7.5 GB per process on machine with least available
  Retransmissions rate   - 2 Hz
  Fault Tolerance        - 2 zones
  Server time            - 08/19/21 11:29:14

Data:
  Replication health     - (Re)initializing automatic data distribution
  Moving data            - unknown (initializing)
  Sum of key-value sizes - unknown
  Disk space used        - 1.715 GB

Operating space:
  Storage server         - 65.6 GB free on most full server
  Log server             - 130.2 GB free on most full server

Workload:
  Read rate              - 51 Hz
  Write rate             - 63182 Hz
  Transactions started   - 3188 Hz
  Transactions committed - 3163 Hz
  Conflict rate          - 0 Hz
  Performance limited by process: Storage server performance (storage queue).
  Most limiting process: 10.240.0.15:4500

to

Unable to read database configuration.
Configuration:
  Redundancy mode        - unknown
  Storage engine         - unknown
  Coordinators           - unknown
  Usable Regions         - unknown

Cluster:
  FoundationDB processes - 42
  Zones                  - 27
  Machines               - 24
  Memory availability    - 7.5 GB per process on machine with least available
  Retransmissions rate   - 3 Hz
  Server time            - 08/19/21 11:29:22

Data:
  Replication health     - unknown
  Moving data            - unknown
  Sum of key-value sizes - unknown
  Disk space used        - unknown

Operating space:
  Unable to retrieve operating space status

Workload:
  Read rate              - unknown
  Write rate             - unknown
  Transactions started   - unknown
  Transactions committed - unknown
  Conflict rate          - unknown

What could I possibly look for to explain this behavior (and what additional info should I provide here to get better support)?
Thank you

ajbeamon · August 23, 2021, 4:33pm

This message from your first status suggests that your ingestion rate is saturating the storage servers, at least initially. Ordinarily I would expect it to still be possible for you to read status in this saturation mode, but possibly something else is also being heavily saturated and preventing a timely response to some of the status queries.

My suggestion would be to decrease the rate of your ingestion to see if that helps.

alexmiller · August 23, 2021, 9:07pm

Does this mean that all of {Sequencer, Data Distributor, Ratekeeper, Proxy, Resolver} are all recruited on one process?

ddi · August 24, 2021, 5:12am

Thank you guys for the feedback

As I said, with fewer clients the cluster is indeed responsive. What it’s not clear to me is why it becomes completely unresponsive. I elaborate better w.r.t. my previous post: it is not only the status that does not show up in the cli, also my ingesting clients hang on transaction operations, i.e., they cannot process any new transactions and block at the grv request.
I would expect the ratekeeper to kick in and throttle down the requests, maybe even a few/sec but it seems that everything stops altogether. In the end, the smallest amount of clients that leads to this behavior is something like 12 client processes with 4 threads each, which does not a concurrency level so high to justify the behavior I witness.

No, Proxy and resolvers are on the same host but on two different processes. The other ones can be on the same host or not, but they are still separate fdb processes (I do not set affinities for them, but I have something like 12 generic stateless processes, and these other roles are recruited by those processes).

ajbeamon · September 2, 2021, 6:37pm

It is possible to saturate parts of the system that ratekeeper does not monitor. For example, if you overload the proxies with work, your system could become unresponsive without ratekeeper getting involved.

You can try running something like starting system immediate priority transactions, which should be able to work if it’s just ratekeeper enforcing the throttling. If those take a while to get a read version or fail to start, then that would suggest your problem is elsewhere. My best guess if you can’t get read versions would be the proxies, though possibly also transaction logs could be involved. For something like this, I would check the CPU usage of the various processes to see if any are fully saturated.

In addition to being able to get this info from your OS, the trace logs also have a record of this data. The ProcessMetrics event has a CPUSeconds field that will tell you the number of CPU seconds used by the process, measured over a window of Elapsed seconds. If you divide these two numbers, you would get a CPU utilization percentage. There will also be a Roles field to tell you what role the process was acting in (e.g. MP for proxy), so you could figure out what in the system is too busy.

Topic		Replies	Views
configurationMissing--Unable to read database configuration Using FoundationDB performance	0	616	December 14, 2022
FDB cluster freeze Using FoundationDB	12	406	March 22, 2023
Storage queue limiting performance when initially loading data Using FoundationDB	10	2691	October 14, 2019
Redundancy mode: three_data_hall Using FoundationDB	10	2152	November 9, 2022
Cluster unresponsive when some of the nodes have full disk Using FoundationDB	1	621	July 7, 2020

Cluster becomes nonresponsive while ingesting data

Related topics