Can subspace/dir be used unlimited (modelling) and about multi client

(Pontus Lundin) #1


Testing FDB on my local machine with the GO binding, very cool =)
I have 2 questions:

  1. About data modelling. In for example Cassandra and time series data it can be good to do a partition key as “userID::Activity::yy-mm-dd” rather than one wide row. In FBD can i use the subspace unlimited ? I mean can i create (“userID”, “Activity”) and (“UserID”, “Friends”) for every user (similiar to a “bucket”) or does it comes with a cost (other than storage) ? Does it make reads faster than store it in one big index with some kind of tuple/prefix ? (or it the sub/dir just a “wrapper” for prefixing the key anyway ?). In your docs about data modelling
    you only use one subspace and let the key tuple describe the “path” or relationship instead of using multiple subspaces per path.

Take an example. A calendar where events are being stored per week (start/end date of the week) and the UI should populate these events onto the calendar. The request comes in with start/end date, does it seems ok to make subspaces of calendarID::weekDayStart::WeekDayEnd and store the individual events (event with date as key as well as counter for counting events for this week) in this subspace ?

  1. When you speak about multi-client/multi processes. Is this same as load balance the request (in the case of a web app) to multiple backends where each app is using it own DB connection (albeit shared in the between controllers in the app) ?

(David Scherer) #2

FoundationDB maps ordered byte string keys to byte string values, period. The subspace, directory and tuple layers are all just mappings of different structures to byte strings that preserve relevant forms of ordering. Consequently you have a lot of freedom in data modeling, and no one is going to be able to explain all of the imaginable tradeoffs in one post. But if you keep the basic principle that everything maps to ordered keys somehow in mind you will be less likely to be surprised.

Subspaces are exactly just syntactic sugar for prefixing a key, and so have the same performance effects of doing it yourself. Directories represent a level of indirection: for example, they make the prefix applied to contents smaller, destroy the lexicographic ordering of directory names, and permit directories to be renamed efficiently. A directory with a composite path (“a”, “b”, “c”) takes a little longer to open, but its will still be stored under a single, short prefix. All the metadata for directories is of course also stored in the database as keys and values.

(Alec Grieser) #3

I’m not quite sure what you were referring to that we’ve said, but I will said that it would be reasonable to have multiple application servers sit in front of an FDB cluster and then load balance the requests between them (using your favorite load balancing algorithm). The FDB client also does some load balancing itself between different FDB server processes, but maybe that’s not relevant.

As @dave mentioned, every key in FDB is a byte string, and the database is an ordered map from byte string to byte string. As such, there isn’t a hard limit on the number of keys that can be in a subspace. (In particular, if too many keys are too close together, then a process in the cluster will rebalance keys between the nodes. This is why there doesn’t need to be a hashed “partition” key like there is in, say, Cassandra.) I’m not quite sure what WeekDayStart and WeekDayEnd are supposed to signify here, but it could be a reasonable data model depending on your data. Whether or not it is might depend on what kinds of queries you’ll want your application to serve.

One perhaps non-obvious problem that you might run into with time-series data in particular is that as time goes in one direction[citation needed], you can wind up in situations where you will wind up performing all of your insertions to a single database shard (i.e., a range of keys that FDB keeps contiguous on a storage server–the basic unit of data distribution; these are somewhat like token ranges in Cassandra, but they are dynamic and based on the data themselves rather than being static and based on a hash). The data distribution algorithm will notice that the shard is taking a lot of traffic and is getting bigger and perform a shard split…only to find that the new shard is taking all of the traffic. This is a performance pathology that we are aware of (and test), so if it’s necessary, you can probably get away with it, but your performance might be better if you can shard inserts across multiple queues to avoid hitting one of them too hard.

(Pontus Lundin) #4

Many thanks Alec and David for your detailed answers and insights. I will try some different models and get back to this space =) Love the flexibility with FDB!

As for the reference to multi-client approach it use to be mentioned as the soultion when benchmarking or issues with saturated thread has been posted. The approach of running concurrent/multiple process seems to be happen when running multiple apps behind a proxy/load-balancer.