Network bandwidth and disk write bandwidth not matching for SS

I am benchmarking the foundationdb for pure writing. I used pidstat -p PID -d and nethogs to monitor the disk and network throughput.
I observed that the incoming network bandwidth of proxy/transaction/storage server are the same.
And also the network bandwidth and disk writing throughput of transaction log servers are same.
But for storage servers, the network bandwidth is about 1/10 of the disk writing throughput. In other words, the disk writes 10 times data than receiving from network.

The writing transactions are having 10-keys sets per transaction.

Why are the storage servers writing much more than receiving from the network?
Also the input_bytes from status json output is quite different from the pidstat. And it seems the metrics can not be used.

Because writing even a small KV-pair causes SS to modify and flush to disk al least one 4kb data block.

Thanks, great point. I set one transaction with 10 keys, and each key has 1KB value. If 4KB is the smallest unit, then I should observe only 4x write bandwidth. But now it is 10 times.

But you can decrease the overhead if you set several keys near one from another so they fit into the same 4kb unit

We tried to switch to ssd-rocksdb-experimental, the disk write bandwidth was just 2 times of network bandwidth on storage server processes.

And the constants in code are below

We are writing 10 keys * 1kB/key per transaction, I am now not sure about it is the page size causing the writing amplification vs network incoming bandwith.

@leonliao Are you writing 10 random keys or 10 consecutive keys in each transaction? And what is the key size?

There are several possible explanations for the metrics you are seeing, so if you can answer that question I can probably tell you what is going on.

These are not used for anything, FDB always initializes SQLite to use 4k pages and this cannot be easily changed as many things depend on or assume it.

RocksDB is an LSM based storage engine, which trades off read cost for write cost by writing mutations linearly to disk in different files over time and compacting them later, merging newer changes with older changes repeatedly. This means that at read time there are many files on disk (vs one for a BTree) that potentially need to be checked for relevant results, particularly for range reads. Compactions use additional writes later to reduce the number of files on disk.

Thanks! 10 random keys were written in each transaction. The key size was only 10 bytes.

Here is a description of most of the writes occurring for this workload:

  • For each KV pair, there is usually one BTree page modified (logically, not physically, see below). I say “usually” because this can actually be two sometimes, as SQLite enforces that at least 4 entries fit in each BTree page, so a ~1k + 10 byte KV pair + overhead could be large enough that one or more of the KV pairs store part of their data in a separate per-record “overflow page”. FDB attempts to avoid under-filled overflow pages by splitting KV pairs that could cause them, but then there is still the possibility that each of the two split parts of the KV pair are written to separate pages.

  • Logically modifying a page actually involves two physical page writes on disk as part of the durability model. StorageServer commits must be atomic as a whole, so first the modified pages are written to a write-ahead log of page data for the initial commit, and to the original locations over top of the previous page data as part of the next commit.

  • Assuming the keys being written are new so the data set is growing, you will also have page splits causing extra writes. When a page already has 4 items and another is added, it is split into two pages which causes additional writes. The original page is modified (so two writes as above) to remove some records which are written to a new page (one write, as it’s not an existing page modification). Then, the parent page must be modified (two writes again) to add a child entry for the new page that was created.

  • There are a few other metadata pages that may need to be modified, such as marking new child or overflow pages as in-use, but these should be inconsequential.

Putting the above together, the writes you are seeing for random 1kb KV pairs should be approximately

  • N wal page writes
  • N page updates
  • ~0.2 * N new page creation writes (on avg every 5th random insert causes split)
  • ~0.2 * N parent wal page writes to add new child from split
  • ~0.2 * N parent page updates to add new child from split

Which results in a Disk write to KV write ratio of

2.6 * (Page Size) / (KV Size)
= 2.6 * 4k / 1k
= 10.4

Which seems to be what you have observed.

Great explanation! Thanks.
I will try my best to understand it. :sweat_smile: