Questions about the recently accepted FDB paper in Sigmod'21

zydong · June 6, 2021, 8:50am

Hi, it’s really exciting that Apple has published a paper about FDB in Sigmod’21. I learned a lot from FDB’s design principles such as decoupling write/read path and decoupling logging from recovery.

During reading, I have some questions about this paper:

In order to avoid the gap in serial history defined by the commit version, the proxy must send LSN and previous LSN to the Resolver. Does it mean that the proxy must broadcast to all resolvers even if that some resolver is not responsible for the key range accessed by committing transaction (just like the broadcast to all log servers)?
Although the previous LSN can tell the log server there might be a gap in the received logs, it is possible that the network between the corresponding proxy and log server fails, and the proxy cannot send logs to the log server. How does the log server handle this situation? Will it gossip to other log servers for the missing logs or just wait for long enough time (etc., 5s) and then trigger a recovery?
In the evaluation of figure 8(a), how is the number of proxies and log servers chosen (2 to 22)? What’s the number of resolvers? Is there any contention? Why does the read throughput not scale to roughly 6x?
In the evaluation of figure 8(b), why does the throughput of operations not scale to roughly 6x? Do resolvers and proxies reach the bottleneck simultaneously?

Thanks,
Zhiyuan Dong

jzhou · June 8, 2021, 5:24am

Hi Zhiyuan, for you questions:

Yes. The Proxy broadcast commit version and previous commit version to all resolvers for the same reason so that resolvers can process all transactions in the commit order.
If proxy can’t send to log server, this will trigger a reconfiguration, i.e., transaction system recovery. There is no gossip among log servers.
Figure 8(a), as we scale the number of machines, we add more proxies and log servers for peek performance. The exact setting are 2, 6, 10, 14, 18, and 22 proxies/log servers. We probably used 2 resolvers, which worked fine with small setting, but can saturate in 24 machine setting. For read performance, point read scales fine, but range read does not scale as expected and I suspect there is some inefficiency in range read handling. The workload is all random keys, so there is some low rate of contention, to mimic production workload.
Figure 8(b), both resolvers and proxies are saturated so FDB doesn’t scale 6x. As you have noticed, there are broadcast from proxy to resolvers and to log servers, which has a higher overhead when using more log servers.

zydong · June 8, 2021, 8:08am

Hi Jingyu, thanks for your clear answers!

PierreZ · June 17, 2021, 10:13pm

Topic		Replies	Views
Transaction implement related doubt FoundationDB Core	4	1092	October 20, 2018
Quesitons about FDB paper in SIGMOD'21 Development performance	0	423	July 19, 2022
Question about the FDB paper in Sigmod’21 Development	7	377	July 14, 2022
Technical overview of the database Using FoundationDB	26	12840	January 11, 2019
Question about SIGMOD'21 paper FoundationDB Core	10	507	April 13, 2022

Questions about the recently accepted FDB paper in Sigmod'21

Related topics