Hello,
I am trying to load roughly 5 GB worth of data in a three_data_hall cluster with 9 logs, 6 storages, 9 coordinators and 1 instance of other roles.
I see that if the number of clients (processes) I use to ingest the data is more than 2, then the cluster becomes nonresponsive after ingesting a few MB of data.
The status changes from this
Redundancy mode - three_data_hall
Storage engine - ssd-2
Coordinators - 9
Desired Proxies - 1
Desired Resolvers - 1
Desired Logs - 9
Usable Regions - 1
Cluster:
FoundationDB processes - 42
Zones - 27
Machines - 24
Memory availability - 7.5 GB per process on machine with least available
Retransmissions rate - 2 Hz
Fault Tolerance - 2 zones
Server time - 08/19/21 11:29:14
Data:
Replication health - (Re)initializing automatic data distribution
Moving data - unknown (initializing)
Sum of key-value sizes - unknown
Disk space used - 1.715 GB
Operating space:
Storage server - 65.6 GB free on most full server
Log server - 130.2 GB free on most full server
Workload:
Read rate - 51 Hz
Write rate - 63182 Hz
Transactions started - 3188 Hz
Transactions committed - 3163 Hz
Conflict rate - 0 Hz
Performance limited by process: Storage server performance (storage queue).
Most limiting process: 10.240.0.15:4500
to
Unable to read database configuration.
Configuration:
Redundancy mode - unknown
Storage engine - unknown
Coordinators - unknown
Usable Regions - unknown
Cluster:
FoundationDB processes - 42
Zones - 27
Machines - 24
Memory availability - 7.5 GB per process on machine with least available
Retransmissions rate - 3 Hz
Server time - 08/19/21 11:29:22
Data:
Replication health - unknown
Moving data - unknown
Sum of key-value sizes - unknown
Disk space used - unknown
Operating space:
Unable to retrieve operating space status
Workload:
Read rate - unknown
Write rate - unknown
Transactions started - unknown
Transactions committed - unknown
Conflict rate - unknown
What could I possibly look for to explain this behavior (and what additional info should I provide here to get better support)?
Thank you
This message from your first status suggests that your ingestion rate is saturating the storage servers, at least initially. Ordinarily I would expect it to still be possible for you to read status in this saturation mode, but possibly something else is also being heavily saturated and preventing a timely response to some of the status queries.
My suggestion would be to decrease the rate of your ingestion to see if that helps.
As I said, with fewer clients the cluster is indeed responsive. What it’s not clear to me is why it becomes completely unresponsive. I elaborate better w.r.t. my previous post: it is not only the status that does not show up in the cli, also my ingesting clients hang on transaction operations, i.e., they cannot process any new transactions and block at the grv request.
I would expect the ratekeeper to kick in and throttle down the requests, maybe even a few/sec but it seems that everything stops altogether. In the end, the smallest amount of clients that leads to this behavior is something like 12 client processes with 4 threads each, which does not a concurrency level so high to justify the behavior I witness.
No, Proxy and resolvers are on the same host but on two different processes. The other ones can be on the same host or not, but they are still separate fdb processes (I do not set affinities for them, but I have something like 12 generic stateless processes, and these other roles are recruited by those processes).
It is possible to saturate parts of the system that ratekeeper does not monitor. For example, if you overload the proxies with work, your system could become unresponsive without ratekeeper getting involved.
You can try running something like starting system immediate priority transactions, which should be able to work if it’s just ratekeeper enforcing the throttling. If those take a while to get a read version or fail to start, then that would suggest your problem is elsewhere. My best guess if you can’t get read versions would be the proxies, though possibly also transaction logs could be involved. For something like this, I would check the CPU usage of the various processes to see if any are fully saturated.
In addition to being able to get this info from your OS, the trace logs also have a record of this data. The ProcessMetrics event has a CPUSeconds field that will tell you the number of CPU seconds used by the process, measured over a window of Elapsed seconds. If you divide these two numbers, you would get a CPU utilization percentage. There will also be a Roles field to tell you what role the process was acting in (e.g. MP for proxy), so you could figure out what in the system is too busy.