Network bandwidth and disk write bandwidth not matching for SS

leonliao · December 17, 2021, 8:00am

I am benchmarking the foundationdb for pure writing. I used pidstat -p PID -d and nethogs to monitor the disk and network throughput.
I observed that the incoming network bandwidth of proxy/transaction/storage server are the same.
And also the network bandwidth and disk writing throughput of transaction log servers are same.
But for storage servers, the network bandwidth is about 1/10 of the disk writing throughput. In other words, the disk writes 10 times data than receiving from network.

The writing transactions are having 10-keys sets per transaction.

Why are the storage servers writing much more than receiving from the network?
Also the input_bytes from status json output is quite different from the pidstat. And it seems the metrics can not be used.

osamarin · December 17, 2021, 9:47am

Because writing even a small KV-pair causes SS to modify and flush to disk al least one 4kb data block.

leonliao · December 17, 2021, 10:25am

Thanks, great point. I set one transaction with 10 keys, and each key has 1KB value. If 4KB is the smallest unit, then I should observe only 4x write bandwidth. But now it is 10 times.

osamarin · December 17, 2021, 11:31am

But you can decrease the overhead if you set several keys near one from another so they fit into the same 4kb unit

leonliao · December 18, 2021, 11:57am

We tried to switch to ssd-rocksdb-experimental, the disk write bandwidth was just 2 times of network bandwidth on storage server processes.

And the constants in code are below
SQLITE_DEFAULT_PAGE_SIZE 1024
SQLITE_MAX_DEFAULT_PAGE_SIZE 8192

We are writing 10 keys * 1kB/key per transaction, I am now not sure about it is the page size causing the writing amplification vs network incoming bandwith.

SteavedHams · December 19, 2021, 10:32am

@leonliao Are you writing 10 random keys or 10 consecutive keys in each transaction? And what is the key size?

There are several possible explanations for the metrics you are seeing, so if you can answer that question I can probably tell you what is going on.

These are not used for anything, FDB always initializes SQLite to use 4k pages and this cannot be easily changed as many things depend on or assume it.

RocksDB is an LSM based storage engine, which trades off read cost for write cost by writing mutations linearly to disk in different files over time and compacting them later, merging newer changes with older changes repeatedly. This means that at read time there are many files on disk (vs one for a BTree) that potentially need to be checked for relevant results, particularly for range reads. Compactions use additional writes later to reduce the number of files on disk.

leonliao · December 20, 2021, 2:00am

Thanks! 10 random keys were written in each transaction. The key size was only 10 bytes.

SteavedHams · December 20, 2021, 6:14am

Here is a description of most of the writes occurring for this workload:

For each KV pair, there is usually one BTree page modified (logically, not physically, see below). I say “usually” because this can actually be two sometimes, as SQLite enforces that at least 4 entries fit in each BTree page, so a ~1k + 10 byte KV pair + overhead could be large enough that one or more of the KV pairs store part of their data in a separate per-record “overflow page”. FDB attempts to avoid under-filled overflow pages by splitting KV pairs that could cause them, but then there is still the possibility that each of the two split parts of the KV pair are written to separate pages.
Logically modifying a page actually involves two physical page writes on disk as part of the durability model. StorageServer commits must be atomic as a whole, so first the modified pages are written to a write-ahead log of page data for the initial commit, and to the original locations over top of the previous page data as part of the next commit.
Assuming the keys being written are new so the data set is growing, you will also have page splits causing extra writes. When a page already has 4 items and another is added, it is split into two pages which causes additional writes. The original page is modified (so two writes as above) to remove some records which are written to a new page (one write, as it’s not an existing page modification). Then, the parent page must be modified (two writes again) to add a child entry for the new page that was created.
There are a few other metadata pages that may need to be modified, such as marking new child or overflow pages as in-use, but these should be inconsequential.

Putting the above together, the writes you are seeing for random 1kb KV pairs should be approximately

N wal page writes
N page updates
~0.2 * N new page creation writes (on avg every 5th random insert causes split)
~0.2 * N parent wal page writes to add new child from split
~0.2 * N parent page updates to add new child from split

Which results in a Disk write to KV write ratio of

2.6 * (Page Size) / (KV Size)
= 2.6 * 4k / 1k
= 10.4

Which seems to be what you have observed.

leonliao · December 20, 2021, 8:18am

Great explanation! Thanks.
I will try my best to understand it.

Topic		Replies	Views
How to increase the read/write throughput for foundationdb 5.2 Using FoundationDB	20	3151	May 24, 2020
Transaction/operation throughput Using FoundationDB performance	10	2022	January 23, 2020
Bandwidth consumption from one fdb client to fdbserver process Using FoundationDB	4	1119	May 17, 2018
How to troubleshoot throughput performance degrade? Using FoundationDB performance	35	4376	June 20, 2019
Some Clarification on Storage Engine and Disk/IO Using FoundationDB	12	2340	July 23, 2019

Network bandwidth and disk write bandwidth not matching for SS

Related topics