Aggregate Functions on Double | Distinct Count

Hi,

So I recently started exploring FoundationDB and Record Layer and had a couple of questions regarding it.

As of now, Record Layer’s aggregate functions only support Integer fields. The reason for the same being is that it is built on top of Atomic Mutations of FoundationDB. Is my understanding correct? If that is indeed the case, is there a plan to extend support for Atomic operations on double? If not, how can aggregate functions (eg: SUM) on double be implemented in Record Layer?

Also, I’m a little lost here. I’m trying to implement a DISTINCT COUNT query in Record Layer. For example:
message simple {
preference_tag = 1;
}
Let’s assume, there are 100 records that either have a “blue” or “red” preference_tag. Is there a way in which I can query a distinct count on preference_tag ( = 2)?
P.S. I do have an alternate approach of having a count index on preference_tag and then backtracking and doing a native lookup. But I’m assuming there is a better way that I’m missing out on.

Thanks.

is there a plan to extend support for Atomic operations on double?

Atomic operations need to be total functions from [byte] to [byte], so it’s not clear to me what the semantics should be for a double. What do you do if the existing value is not the expected width? How many bits should be interpreted as the significand?

One thing you could do (not sure if this works for your use case or not) but you could use Fixed-point arithmetic - Wikipedia with the current atomic add

Edit: I guess more accurately they’re a function from value → value, since values have an upper bound on size

Thanks Andrew. I might be able to spin something off via Fixed-point arithmetic.On the top of my head, for my use-case, maybe a shameless upper-bound expected significand limit plugin could do the trick.

Also, any insights on the second question?

My apologies, I should have pinged record layer folk sooner.

Paging @alloc. :slight_smile:

One note on the first question: yeah, the atomic mutation indexes are all built on top of FDB atomic ops as that allows us to maintain the indexes without having transactions contend on all updates. I believe that the record layer should support MAX_EVER and MIN_EVER operations on double (or float) fields using the MAX_EVER_TUPLE and MIN_EVER_TUPLE types. (With some slight caveats about what happens to NaN values…). Adding (scalable) support for double addition is harder, as @andrew.noyes suggests, without either a, like, ADD_IEEE_754 atomic operation or some kind of generalized operation pushdown operation. I think some kind of fixed point thing is probably your best bet.

As to the second, there isn’t any out-of-the-box support for distinct count indexes, no. I guess, to be clear, if you wanted to answer the question “how many distinct records have a ‘blue’ preference tag”, then I think you just need a COUNT index grouped by preference_tag on that anyway (as there can only be one copy of the record in the record store). If use case is more like you have a record:

message Simple {
  string preference_tag = 1;
  string sample_id = 2;
  int64 rec_id = 3;
}

Where the rec_id was the primary key and you wanted to know, say, how many distinct sample IDs have a given preference_tag, then I don’t think we quite have the index you need. If you had an index on preference_tag or (preference_tag, sample_id), then you could implement this by querying for results, removing duplicates, and then counting, though this grows linearly with the count, which isn’t great.

I think a “distinct count” index maintainer could also feasibly be added, though maybe adding a new index maintainer as a first project is a little much. I think the way this would work would it would keep track of (in the instance above) a regular value index on (preference_tag, sample_id) and a second subspace that looked a lot like a count index on preference_tag. Then the maintainer would update the value index and check if it’s a “new” sample_id, and then conditionally update the count index-part based on whether the new record is the only one in the group. The cost of this index is: (1) the storage space, which grows linearly with the count, so if storage space is a premium, might be a problem; (2) the extra read I/O on index update; and (3) some amount of extra contention when the first key in a group is changed. The first cost is somewhat ameliorated by the fact that the index can also be used to satisfy other queries on the indexed keys (just like a normal value index), so if you were going to need that value index any way, the marginal cost of the distinct count index might actually be low.

There are possibly some strategies (like using bloom filters instead of full indexes, though supporting deletes on that data structure may be difficult) that theoretically could reduce space usage and/or contention at the cost of accuracy, but that’s probably not a road you want to go down unless you absolutely need to.

1 Like

It’s similar if what you want is just the number of distinct preference_tag values. (Where the answer for “blue” and “red” is two.)

Add a COUNT index on this field and then get the DISTINCT count by scanning the entries in this index counting non-zero values. This is more expensive than a non-DISTINCT count, but has no more write conflicts.

If lots of write conflicts is okay and computing the count efficiently is more important, we could have a new index type that kept a counter for each value, and incremented an overall counter whenever a new per-value counter was created (or changed to be positive) and decremented it whenever a per-value counter went to zero. These checks would mean that uses of the same value would conflict. You could do the same thing in advance of such an index with records.

1 Like