FoundationDB Bulk Loader tool

(Adolfo) #1

It would be great a bulk loader tool. Something like Oracle’s SQL Loader.

(Brian Haslet) #2

This would be trivial to write, however it depends what you mean.

FoundationDB provides no data model implicitly, so you would need the tool to map your source data into an appropriate model during import.

You could store KV pairs as tuples as your source import format for your core import tool, then condition your data to translate to those at a higher level of abstraction prior to import.

(Adolfo) #3

The problem It’s not the model but the performance.

One of our regular data load (one of the little ones) is about 200.000 “rows” and each row with 280 “columns”. The data size in disk is about 1.3 GB.

If FoindationDB doesn’t support transaction longer than 5 seconds, I think that this data volume couldn’t be loaded in less than 5 seconds.

The bulk tool should disable that transaction timeout, index, key generation rules, etc, load all dataset and then apply again the transaction rule and apply the data model “rules” if this will be necessary.

What I mean is that maybe could be a good idea provide a tool, or an API, or set of instrucctions to allow…

  1. Open a transaction
  2. Load a huge ammout of data throug the network (SwiftNIO, Netty… maybe?) and without time restricctions.
  3. Close transacction
  4. Apply all internal process that involves new records creation

It’s more o less like to “inyect data in vein” :wink:


(Brian Haslet) #4

Aha, well you make a good argument relating to the 5 second transaction limit.

Yes I can see your point

Does it make a difference to your application if you batch the transactions?

For instance, (and I don’t know your data-set), but you you could easily import 10k records at a time and commit for the batch assuming it falls within the transaction timeout.

At the application level you could do this piece-wise then and at the end of the entire import, finally mark an “imported” flag as complete if it were a problem.

With regard to #4, that’s app specific. You could choose to import record by record and run your logic on each one, or batch by batch, or finally on the entire data-set once the entire thing is loaded. It really depends on priority based on your app.

I don’t think this is a FoundationDB problem - you would have the same issue with most all storage engines. FDB gives you the flexibility to do what makes sense for you. That’s a guiding light in the design although sometimes you may have to think.

For instance, importing 30TB of data via a single SQL transaction wouldn’t be pretty either :slight_smile:

(Adolfo) #5

Does it make a difference to your application if you batch the transactions?

That’s the solution we thought at first.

We’ll try try a timer approach… open transaction, insert for 4 seconds, close transaction.


:wink: :man_technologist:t2:

(Christophe Chevalier) #6

Do you do any read in the bulk import transactions? I think that the 5 second timer only kicks in at the first read in each transaction (which needs to get the current read version from the cluster in order to read stuff, version that could then be rejected at commit time if is older than 5 seconds). If you only do writes and then commit, then you should not be impacted by this, and would probably only limited by the transaction limit size (10 MB?)

When I did bulk loading, the main issue was usually trying to not overshooting this 10 MB limit. If each row is fairly similar in size, you can guess in advance how many rows fill 10MB, and maybe only load half of that per transaction? And if row size is very volatile, a transaction could get too large and fail to commit: if that happens, split the buffer in half and retry both halves. This looked a lot like TCP trying to guess the optimum window size to me.

(Adolfo) #7

No reads during bulk operation, just inserts.

I have never suffer the size problem. I usually work with Oracle SQL Loader or MySQL LOAD DATA INFILE and the size was never the problem.

But It’s true that FDB is not a relational database, and some concepts or workflows should be adapted in order to work in a NoSQL / Distributed / Key Value way :wink: