Best practices for bulk load

Hello,

I want to test my working data set under FoundationDB and stuck with a very slow initial insert speed. I can’t figure out how to make importing faster and what are the best practices here.

I use several concurrent goroutines for inserting with 5MB batches sizes. But with a 3 node cluster and double ssd configuration it works really slow and time-to-time I loose my cluster according to fdbcli status command:

fdb> status  details

Using cluster file `/etc/foundationdb/fdb.cluster'.

Could not communicate with a quorum of coordination servers:
  192.168.1.1:4500:tls  (unreachable)
  192.168.1.2:4500:tls  (unreachable)
  192.168.1.3:4500:tls  (unreachable)

Also I can get:

Data:
  Replication health     - (Re)initializing automatic data distribution
  Moving data            - unknown (initializing)
  Sum of key-value sizes - unknown
  Disk space used        - 265.770 GB

Or:

Data:
  Replication health     - Healthy (Rebalancing)

So my questions are:

  • If I import into one node cluster and then add 2 other nodes, will it be faster? Any thoughts or best practices?
  • Maybe there is a way to make data durability less strong? Disable fsync() for an import stage?

Are you inserting keys in a sequential order?

Try randomizing the order to spread the load around the cluster rather than it all hitting one machine at the beginning or end of the key space you’re writing to.

I don’t think 5mb is optimal and is probably too high. I have never written that much in a transaction before during my testing over the last few weeks.

The client library is single threaded regardless of how many goroutines you start (AFAIK), so splitting the load across multiple processes should also increase parallelism which is needed to fully use the cluster. This will only work if you’re splitting the keyspace you’re writing to as well.

1 Like

Are you inserting keys in a sequential order?

Yes, I’m doing full scan from another b-tree like db and inserting into FoundationDB.

I don’t think 5mb is optimal and is probably too high. I have never written that much in a transaction before during my testing over the last few weeks.

The client library is single threaded regardless of how many goroutines you start (AFAIK), so splitting the load across multiple processes should also increase parallelism which is needed to fully use the cluster. This will only work if you’re splitting the keyspace you’re writing to as well.

Thank you for sharing your experience and thoughts. Will try to use several processes and sort out how to randomize keys.

1 Like

I got the best results in Go by starting a goroutine for each transaction, having them do encoding/packing and then performing the actual transaction, because they do work initially and then block for a while.

With that, I can insert a rough average of 60k key-values per second or around 10k transactions on a single node cluster.

Normally in Go, this practice will overload the scheduler and cause a deadlock, so it does seem counterintuitive, though in this case, it is how I could get optimal performance.

I don’t need to set an upperbound on the number of goroutines though you might want to just in case.

If you are bulk loading data that cannot possibly conflict with any other transactions, you may want to look at setting the SetReadYourWritesDisable option, and also maybe playing with SetNextWriteNoWriteConflictRange

WARNING: this may break some ACID guarantees, because these writes will not be seen by concurrent transactions. You may not care if you are restoring/important data while nothing else is running, or if you have another flag somewhere to “publish” the data for readers.

Technical details

Internally, the fdb client creates an implicit write conflict range for all keys that are inserted or mutated. The list of all write conflict ranges is sent to the cluster alongside the mutations when the transaction commits, and the conflict ranges will be checked by the cluster (for write conflicts, they will be stored somewhere to make other transaction conflicts!).

So if you do a set('Hello', 'World'), the client will send the byte literals 'Hello', 'World', 'Hello', 'Hello\0' over the network. The last two “Hello” correspond to the write conflict range (for a single key, the range begins at the key, and ends before the key padded with an extra NUL byte.

This means that - by default - you can have a ~3x amplification of the size of your keys (in memory, on the network, I’m not sure if this is also stored in the logs?).

In practice, I did see a small gain of 20-25% when testing with localhost. I don’t have any data on the impact with remote processes.

By setting NextWriteNoWriteConflictRange before each call to set(..), the client will not create a write conflict range, and only send 'Hello', and 'World' to the proxy, saving some bandwidth.

If you enable ReadYourWritesDisable, then the client will not have to track your mutations internally, saving some CPU cycles. This is only if you don’t read-back the mutated keys from the same transaction! be careful with this, this can break things!

I think that you could also play with manually adding a giant write conflict range that spans the whole range of keys inserted: The client should merge all the single ranges into a single larger range, and only send that to the cluster. But you will still pay the cost of managing and merging this list.

Implementation

You can see how the client formats the transaction commit request in Transaction::commitMutations():

You can see how set(...) handles the mutations in Transaction::set() (the addConflictRange argument is controlled by setting SetNextWriteNoWriteConflictRange before every write):

2 Likes

Thank you for answer!

Yes, it’s my case. Going to try these options.

Not it’s more clear to me, thanks!

Before each tx.Set() call, or I can set it for a whole Transaction at the beginning?

Got it.

This optimisation is for saving network, memory and maybe logs, right?

Unfortunately, it must be before each call, because set(…) reset the flag every time. This means that you cannot call set(…) on the same transaction object concurrently from multiple threads or they would stomp each other.

All of the above, but don’t expect a huge difference. I would only recommend this for bulk imports, not for regular transactions. Or only for very specific optimizations after having studied all the implications.

Also, I think that a benchmark using this optimization would be somewhat cheating, because this does not test the database in the normal conditions.

1 Like

Interesting. Thank you.

Yeah, I understand this. But I need a fast import tool for future performance tests.

Thank you one more time for helping.