Transaction implement related doubt

(zycHacker) #1

Recently i read the source code to learn how fundationdb implement transaction with just a little performance loss.There are some doubt that i can’t solve by myself

  1. For the resolution phase, the transaction batch is split and sent to resolvers.Is the request need to be send to all resolvers even include the resolvers whose related ranges are not contained in transaction batch.

  2. Similar with 1, in logging phase, is the request need to be sent to all tlog servers?
    If 1 and 2 is true, won’t this design make performance scale pool?

  3. The write subsystem is an unit it mean even an process is down the whole system can’t work durning recovery. But as far as my know in most shared-nothing system, if an shard is down only transaction related this shard is stall a moment by recovering. Won’t this design make system have lower available if write subsystem have many process?Is there something i misunderstanding?

(Markus Pilman) #2

The resolvers need to serialize the transaction batches they receive. The proxy will send the following information to all resolvers:

  1. a version-pair consisting of previousVersion and version - where previousVersion is the version that immediately precedes the commit version of the batch.
  2. a read set (key-ranges of all read key-value pairs within the range the resolver is responsible for)
  3. a write set (keys of all key-value pairs written by transactions within the range the resolver is responsible for).

The read-set and the write-set might both be empty, but the resolver still needs to know about the version as it can only do conflict resolution if it knows about all previous transactions.

The TLogs also have to serialize all batches. They will receive all mutations from committed transactions. But they will block writing the batch until they know about the previous batches (so they will receive a previousVersion, a version, and a set of mutations where the set of mutations might be empty).

The reason for that is that the TLogs need to store the mutations in order. For that it needs to know the size of the previous mutation-batch before it can write it to disk.

Yes and no. The key concept here is batching: the proxy will batch together many transactions before sending those to the resolvers and the tlogs. If commit time goes up (which usually happens as you scale your cluster), the proxy will increase the batch size. In other words, FDB will start trading response time for throughput. This means that FDB scales probably linear (up to a certain point) when it comes to throughput but response time will come down.

One thing that will eventually become a bottleneck is the maintenance of all the network connections the system needs to maintain. However, you will first run into an issue with the coordinator before the tlogs+resolvers+proxies become a bottleneck.

FDB values consistency over availability. What you write is partly correct: if a storage node goes down (as this is the more usual case as an fdb system will typically have many more storage servers than proxies, resolvers, tlogs, master, and cluster controller combined), the cluster continues to operate without any loss of availability (the tlogs will keep the writes and the other copies will serve the read requests).

Every key-value store has to make a decision whether it wants to value availability or consistency more. FDB is designed to be always consistent and never corrupt any data and it tries to get away with that by making recovery as fast as possible.

The other thing this design makes possible: in normal operation (no network partitions and no failing processes), all messages are sent without an overhead (there’s no two-phase commit or paxos involved). This means that as long as nothing fails, the system operates close to optimal speed. This work well as long as you don’t have hundreds of these processes. But as even very large fdb clusters usually don’t have more than tens of these special processes, failures are usually relatively rare.

(zycHacker) #3

Thank you very much! Excellent answer! I agree most opinion with you. Send empty request to resolvers and tlog server add extra network communication but batch reduce it. The problem is that this overhead is increased when add tlog severs and resolvers server.If system want to scale probably linear the proxy must increase the batch size and increase delay. But for an single proxy the maximum batch is limited for single core proxy’s throughput is limited and proxy need to run several batch in pipeline to fill up cpu.So i wonder how much core the write subsystem(tlogs + resolvers + proxies) can scale and what’s the coordinator you said? The master that offer the commit version?

For the availability, the problem is the availability reduce when you increase the number of write system process if every single process fail probability is equal and independent. But in fact it maybe wrong for single process is single core not single machine and multi process in same machine fail or live together.

Compare to 2PC, the optimal 2PC implement as i know need to 3 RPC(fetch timestamp, send prewrite, send commit) and 2 log persistent(prewrite and commit). But the response delay can be 2 RPC and 1 log persistent(return after receive all prewrite response). The fundationdb implement is 3RPC + 1 log persistent, the delay is the same. Both of them can use batch to reduce network and log persistent overhead. Consider the commit log in 2PC is just an mark which cost little io when use batch i think they are nealy same fast. The advantage of Foundationdb is:

  1. when conflict happen it perform better than 2PC(Both two conflict will abort in most implement) for it serialize all transactions.
  2. It offer SSI isolation and most 2PC implement just offer SI
    And the determined design is so cool and play well with its transaction design.

The disadvantage is what i said above it can’t scale too many core.The another problem is fetch read version need to fetch version from all proxies. But in practice it maybe fast enough for most scene

(Alex Miller) #4

The coordinator is a piece of the system uninvolved in the read or write path that serves as the way that processes bootstrap themselves and discover the rest of an FDB cluster.

The issues that you point out are true:

  • As one increases the process count in the write pipeline, the chance of there being a failure in the write pipeline increases.
  • As one increases the process count in the write pipeline, more messages need to be exchanged, so there’s additional overhead.

There does exist some theoretical point where FDB’s performance will be capped. All the proxies, resolvers, or transaction logs are running at 100% utilization, but adding another process would cost more in overhead than the additional parallelism would pay back.

I believe Markus is making arguments from the practical side of running FDB. Scalability is currently far more limited by how many processes can connect to a coordinator (or cluster controller for health monitoring) than the write pipeline. Until 6.0, we never closed network connections, and always heartbeat them once per second. Practically speaking, FDB overall is rather far away from the theoretical limitations imposed by its design being the limiting factor in its scalability.

(David Scherer) #5

If and when anyone actually hits the write scalability limitation imposed by “empty requests” between proxies, resolvers, and logs, there are relatively straightforward ways of eliminating it.

For example, consider that there are N total machines that need to exchange (possibly empty) messages in a big global batch. You could arrange them conceptually in a sqrt(N) * sqrt(N) square. Each machine A sends to each machine B in its “row” a list of all the (nonempty) messages it wants to send to machines in B’s column, and the machines B collect such messages from all As in their row and then send messages to all the machines C in their respective columns. Now there are at most O( N^(1/2) ) empty messages to or from any machine, and it’s easy to extend this to be O( d * N^(1/d) ) for any d, which is pretty scalable (for optimal d, its O(N log N).

In the end, physics limits the asymptotic access latency to N bytes of data to be O( N^(1/2) ) anyway…