Why many clientThreads will cause fdbserver(stateless & grv_proxy) CPU too high?

For ideal performance you’ll need to do a bunch of tuning.

Writes

The FDB write path works by your client sending the transaction to a commit proxy. The commit proxy will then forward the conflict ranges to the resolver, which will then either allow or reject the transaction depending on if there is any conflicts.
From there the transaction is sent to the transaction logs. (all of these processes make up the “transaction system”)

This is my advice on how to configure how many processes you run:

  • Do not put tlog processes in the same servers as commitproxy/resolver.
    • This can cut your throughput significantly, as you are putting two bandwidth-intensive loads on the same host, and not fully utilizing cluster capacity.
  • Each tlog process should have its own disk not shared with another tlog or storage process. Stateless processes are fine.
    • This will cause increase in flush latencies, which will slow down transaction processing.
  • Put your resolution and commitproxy processes on high CPU performance hosts, and have enough bandwidth to handle the load.
    • You will always want a few more processes than you configure your cluster for, in case of failures. You should be able to tolerate at least 1 host failing without performance degradation.
    • Put resolution and commitproxy on separate hosts if you can.
  • If you cannot allocate dedicated capacity, you can put the resolution and commitproxy processes alongside storage servers.
  • You should put some stateless processes alongside every storage server and/or TLog. These handle misc. tasks such as the cluster controller, ratekeeper, and GRV processing.
    • You usually should not need dedicated process for GRV unless you have an extreme workload.

And how many to recruit:

  • 1 resolver unless absolutely necessary.
    • Adding more than 1 resolver will increase the risk of bogus conflict rejections, especially on transactions that read a lot of key range
    • Usually, it is not the resolver. Even if it is, this can be a sign of a bad workload in general, such as a very small hot key range.
  • A few GRV proxies
    • I do not have much advice here myself, but 3 should work.
  • The count of transaction servers depends on your load and storage servers. Observe the disk utilization %, main loop utilization % and network usage.
  • Commit proxies also depend on your workload.

Coordinators

No requirements here.

Storage servers

You should use Redwood for the storage engine.

Depending on if your workload has semi-predictable access patterns, try increasing/decreasing cache configuration.

If you have a hot read range, such as metadata about the DB, you should use the rangeconfig command to increase its replica count.

You can usually run 2-3 storage class servers on the same disk to saturate the performance.

Side note, I have not seen anyone run a 500TB FDB cluster. Biggest I have heard about was about 200TB.
Could you provide some specifics about the workload/use case?