One note on the first question: yeah, the atomic mutation indexes are all built on top of FDB atomic ops as that allows us to maintain the indexes without having transactions contend on all updates. I believe that the record layer should support MAX_EVER
and MIN_EVER
operations on double (or float) fields using the MAX_EVER_TUPLE
and MIN_EVER_TUPLE
types. (With some slight caveats about what happens to NaN
values…). Adding (scalable) support for double addition is harder, as @andrew.noyes suggests, without either a, like, ADD_IEEE_754
atomic operation or some kind of generalized operation pushdown operation. I think some kind of fixed point thing is probably your best bet.
As to the second, there isn’t any out-of-the-box support for distinct count indexes, no. I guess, to be clear, if you wanted to answer the question “how many distinct records have a ‘blue’ preference tag”, then I think you just need a COUNT
index grouped by preference_tag
on that anyway (as there can only be one copy of the record in the record store). If use case is more like you have a record:
message Simple {
string preference_tag = 1;
string sample_id = 2;
int64 rec_id = 3;
}
Where the rec_id
was the primary key and you wanted to know, say, how many distinct sample IDs have a given preference_tag
, then I don’t think we quite have the index you need. If you had an index on preference_tag
or (preference_tag, sample_id)
, then you could implement this by querying for results, removing duplicates, and then counting, though this grows linearly with the count, which isn’t great.
I think a “distinct count” index maintainer could also feasibly be added, though maybe adding a new index maintainer as a first project is a little much. I think the way this would work would it would keep track of (in the instance above) a regular value index on (preference_tag, sample_id)
and a second subspace that looked a lot like a count index on preference_tag
. Then the maintainer would update the value index and check if it’s a “new” sample_id
, and then conditionally update the count index-part based on whether the new record is the only one in the group. The cost of this index is: (1) the storage space, which grows linearly with the count, so if storage space is a premium, might be a problem; (2) the extra read I/O on index update; and (3) some amount of extra contention when the first key in a group is changed. The first cost is somewhat ameliorated by the fact that the index can also be used to satisfy other queries on the indexed keys (just like a normal value index), so if you were going to need that value index any way, the marginal cost of the distinct count index might actually be low.
There are possibly some strategies (like using bloom filters instead of full indexes, though supporting deletes on that data structure may be difficult) that theoretically could reduce space usage and/or contention at the cost of accuracy, but that’s probably not a road you want to go down unless you absolutely need to.