Authorization in FDB

(Disclaimer: This is not a well-written post and stuff in here is probably described badly. My main intention is to make people aware of a new feature. Please ask questions if anything is unclear :slight_smile: ).

At Snowflake we’re currently working on a simple authorization feature. I wanted to announce this here in case others in the community find this useful and want to start preparing to use this (or want to start testing this as soon as we have it code-complete in the main branch).

Current State

So currently there is almost no security in FDB. However, there is a feature that clients can be authenticated via mTLS. mTLS is, as of today the only way someone can use TLS in FDB. The rationale here is that clients are treated in a binary way: either they are trusted entities and therefore get full access to FDB, or they’re not and therefore can’t establish a connection.

Small Tangent: Multi-tenancy (new 7.1 feature)

FDB 7.1 will introduce the concept of tenants. This feature is still in development, but the basic functionality is working (other things, like workload isolation, automatic movement of tenants across clusters, meta-cluster management etc will be added in later versions of FDB).

A tenant will provide its own transactional subspace. Running transactions that touch multiple tenants is not something we support. Instead they should be thought of as independent databases. So instead of running 10 clusters for 10 applications, we can now run 1 clusters (or more, up to 10, depending on the load requirements).

Authorization Model

The feature we’re currently implementing is very simple: instead of all-or-nothing access, a client can be given access to only a limited number of tenants.

The model we’re using is the following:

Client Machine (untrusted) -> Authentication Service -> Application -> FDB
  1. We assume there will be some client (a machine not controlled by the organization that runs the FDB application – so this could be an iPhone App or some web browser). It will send requests to some service.
  2. Before anything happens, there will be some authentication service.
  3. Then the actual application (which runs the FDB client) will receive this request. This machine will then only read/write data of a specific tenant or a small set of tenants (a tenant could be a user or a specific service – ultimately the application will decide what the meaning of tenant is).

In this new world, the application will connect to FDB using TLS, but it won’t provide a certificate (so mTLS won’t be a requirement anymore, only FDB → FDB connections will have to use mTLS). FDB will accept the connection, but it won’t allow the client to do anything useful. That means, by default the client won’t be able to read or write any data.

If it wants to do anything, it has to send an access token to FDB. This token will basically just contain a list of tenants the client is allowed to access (raw key access will be denied) and this token has to be signed by some private key. FDB will then need to know about the public key (distributing public keys to the FDB nodes will, again, be the responsibility of the user).

One of the difficulties for using this feature is that the user will have to figure out how to generate the tokens and deliver them to the application. Technically, FDB doesn’t do authorization, it just enforces it. This token can be generated by some service and then passed to the application.

Limitations

This first implementation will have some limitations:

  1. There’s nothing that will prevent a replay attack. The only protection against replay attacks is a TTL in the token. So if a token is being leaked, anyone who can make a connection to the FDB cluster will be able to use this token. This has to be addressed later, but for now it’s the responsibility of the application to keep tokens safe.
  2. There’s a relatively high operational overhead: pubic keys need to be distributed to all FDB hosts. Key rotation has to be solved outside of FDB. In order to create a token, the FDB client library will provide helper functions, but ultimately, most of the burden will be with the user. We don’t plan to change this. We expect most FDB users will already have some infrastructure for these kind of operations.
  3. Authorization is still very coarse – instead of a cluster, authorization will simply be on a tenant level. We don’t yet have plans to change this.

Timeline

We’re planning to release this feature in FDB 7.2 which we hope to release in fall 2022. The code is currently being written and tested. As soon as the APIs are finalized, we will write some documentation so people can test this feature on a prerelease version.

2 Likes

Thanks @markus.pilman for starting this thread! :slight_smile:

I am assuming by “application” here is FDB Client. Regarding mTLS won’t be a requirement anymore - Assuming we have a zero-trust network architecture (i.e., there are no private networks and SPIFFE mTLS and FDB mTLS is used).

If Application (FDB Client) uses regular TLS (not mTLS), then what would prevent some random host on the Internet from opening a TLS connection to the cluster?

If mTLS is used between Application (FDB Client) and the cluster, then I think it would significantly limit the blast radius of a potential token leak.

The attacker will not only have to gain access to the token, they will also need the private key material that would identify the Application (FDB Client).

The impact of any token leak from “Client Machine” would be reduced.

While we are on the topic of identity/authn/authz, I was wondering if Snowflake considered using SPIFFE standard to replace FDB mTLS?

Nothing, but this is not intended to so you can open up an FDB cluster to the world. The use-cases we’re thinking of are a bit more narrow:

  1. You could have multiple applications within the same company and you want to ensure that they don’t interact with each other. This could, for example, help with compliance. It also reduces the blast radius of potential application bugs to a tenant.
  2. The machine that runs the application is operated by you, but it runs some untrusted code. Authorization will allow you to limit the blast radius.
  3. Similar to the point above, you could allow your customers to run some code on machines you own (usually you would sandbox this, but a sandbox might not be 100% secure – for example due to kernel exploits). If they try to do something malicious, they will only be able to read their own data, but not the data of other tenants.

No, what you’re describing can be achieved much easier by having proper firewall rules. If you want to prevent a machine from connecting to a cluster, it shouldn’t be able to even attempt a connection (this is also true if you use mTLS as this will help you prevent DDoS attacks and the like). Again, opening an FDB cluster to the world is currently a terrible idea and this authorization feature won’t change this (and honestly this might never be a goal, I don’t see reasons why you would want to be able to connect directly to FDB from outside a controlled network).

This is also why this replay attack might not be an issue for everyone. Because in order to execute this attack, an attacker would need a copy of a token and own a machine that can connect to FDB.

Thanks for the reply @markus.pilman. Just to confirm, we do not intend to open our FDB cluster to the world.

In a zero-trust environment, to a very large extent we would avoid network-segmentation, private VLANs, NAT and firewall rules. Instead workloads will have identity and use mTLS between them. There are still ways of initiating DDoS attacks but the mitigation approach is also different.

From a design point of view, if we can preserve the current ability to do mTLS between Application (FDB Client) and the cluster, I think that would be sufficient.

Being able to do mTLS will allow us to establish identity, authentication and encryption at the network layer between the FDB Client and the cluster. Then on top of that we can use token based authz.

I see. So before going into further details I just want to mention that mTLS comes at a cost. mTLS connections are significantly more CPU heavy than non-mutual ones. We don’t know yet how high this overhead is in FDB. But you could reasonably say that you’re willing to pay this cost of course.

So at a high level, what you need is a way for FDB to differentiate between privileged connections and non-privileged connections. One thing I didn’t explain above is that with a non mTLS connections, clients won’t be able to call certain RPC endpoints (so that they, for example, can’t pretend to be a storage server). So for FDB → FDB connections (and also management machine → FDB or provisioning service → FDB) we can’t use the current token mechanism.

However, these are solvable problems. Today we already have a solution for this implemented (in the main branch, not in any release): you can pass a set of IP subnets to FDB and then FDB will treat connections from an IP as “trusted” (so it will have privileged access and can call any RPC endpoint and read and write all data). By default this allow list will be empty and empty is interpreted as “every IP address is in the allow list”. The main reason we implemented this is because it makes testing much easier (especially simulation testing). So if IP spoofing is impossible in your network, this might be a possibility (although it is not great, because now if you ever need to resize your subnets you need to remember that you potentially have to update the FDB configs on all FDB machines).

Another possible solution would be to enforce token based access for everyone but also introduce a new token type for full privileged access. The main drawback here is that it makes things like key rotation harder. If an fdbserver process can’t get a token, it won’t be able to join the cluster. If many fdbserver processes don’t get a token (or their tokens expires and the refresh mechanism doesn’t work because some other service has an availability loss) the cluster could potentially become unavailable. However, this might be manageable and this solution probably wouldn’t be too hard to implement.

To be clear: we don’t plan to get rid of mTLS on clients. In the current design it will just mean that you need an ip allow list so that clients don’t get privileged access if they have a valid certificate.

Below is a table that shows how FDB decides whether a connection gets privileged access ( means combination will result in a privileged connection, means fdb will enforce token based authorization):

Mechanisms used on the client Allow List Empty Client IP in Allow List Client IP not in Allow List
mTLS
Non-mutual TLS
TLS diabled

Please note that today we only support the combination mTLS & Allow List Empty (as the IP allow list doesn’t exist today). This ensures that if you don’t do anything, the behavior will only change minimally (clients in 7.2 will be able to connect to the FDB cluster without certificates, but they won’t be able to do anything useful).

Thanks again for the detailed reply @markus.pilman :slight_smile: I really appreciate it.

Would I be right in understanding a privileged connection would be a connection that has RAW_ACCESS?

Instead of “Client IP in Allow List”, which ties the identity of a privileged connection to an IP address, would it be possible to extend the current peer verification mechanism so that if the FDB client presents a certificate with a “privileged” field, only then does the connection becomes privileged?

From an operational point of view this would mean the entity (operator, provisioning service) would have to make a request to a secrets management tool for a short lived “privileged” certificate. This request can be logged in the secret management tool and how it is used can be logged in the application (for provisioning service) and terminal logging (for operator).

“Client IP in Allow List” can still be used to provide additional access control, but not as an identity mechanism. For example if an FDB Client presents a valid “privileged” certificate, but is not in “Client IP in Allow List”, then the connection does not become privileged. In the event the short lived certificate leaks out of the operator machine or provisioning service machine, the blast radius of that leak would still be restricted.

Yes, but it’s more than that. The following things will require a privileged connection:

  • Joining the cluster as a server (so all fdbserver processes need a privileged connection, but we also want to prevent a client to “pretend” being a server).
  • Management operations (like configuration changes, killing servers, running status json,…). So fdbcli will need a certificate.
  • Creating and deleting tenants.

So as you can see, using authorization will probably require some work for most users and it won’t be a feature you can just turn on.

Sure. However, it is not something we plan to do right now. However, we do accept PRs :wink: . Implementing this is probably not too difficult, I would expect that there are two things that require some time:

  1. Coming up with a good user interface for such a feature. You need to somehow pass the information what to verify to fdbserver. That’s probably not very hard but requires some thought.
  2. Such a feature would require testing.

(2) is probably the big one here. So if you want to do this yourself, you probably want to wait until we have more TLS specific testing written (this part is still in the pipeline).

I see that. Would it be possible to manage “Client IP in Allow List” transaction ally using management module? That way, we can boot strap the cluster with a small number of privileged connections and then have a provisioning service adjust this list as cluster changes happen.

:slight_smile: Since you mentioned PRs, realistically I think we are about 3 years away from being able to meaningfully contribute to FDB core. We will eventually get there… slowly and steadily. :slight_smile:

My immediate focus is around layer development and infrastructure tooling and making sure we have the right building blocks for that. You’ll see most of my contributions to FDB in those areas.

I think we should have our first production workloads and applications live in about a year. FDB talent is very rare, so it will probably take us another two years to build up the required operational expertise in-house, after which we should be able to start contributing to core.

I do want to add, I am very grateful to both Snowflake and Apple for FoundationDB and for the emphasis you have in the project for code quality and testing.

This is a really nice feature, I’m already thinking on how we could leverage this :smiley:

From a layer engineer point-of-view, the tenant is behaving a lot like the directory in a sense that it will generate an id and a prefix. Because of how HCA works, moving a directory across clusters is not trivial. Do you have some ideas on how automatic movement of tenants across clusters will be implemented?

Also, I’m wondering if tenants should be automatically tagged? :thinking:

EDIT: Another question: will we be able to import a tenant with a prefix? :thinking:

That would introduce a potential security problem. You then also need to keep this API safe somehow (in security the sad reality is that convenient often means not secure :frowning: ).

This is planned. This is also one reason why cross-tenant transactions are not possible. The rough plan is to have a management cluster that will act as a discovery server and will automatically load balance tenants across clusters.The goal is that FDBs scaling limit will mostly limit the max size of a tenant, but a meta-cluster can scale almost infinitely large.

Yes I think they are. There’s a whole workload isolation project at Snowflake that tries to address noisy neighbor problems. However, getting this right is a very difficult problem and therefore it will take a while until this is working well.

We currently don’t have plans to do this. A cluster either has to be configures to have tenants enabled or not. You won’t be able to convert from one to another. I realize that this is a huge limitation. It would be possible to solve this but it’s difficult. It requires a complicated dance between FDB and the application if you want to do it online. Therefore we leave it to the application to solve this.

Thanks for the details @markus.pilman :smiley:

I can see how expanding the scope can be a potential security problem.

Over the past few days, as a background process I spent some time thinking about the upcoming design.

Whenever I try to understand a security design, I try to build up my mental model that separates out the entity, identity, authn, and authz. (Here is a very short video that explains entity-identity distinction that I use as reference, and go back to often).

If I am understanding the upcoming design correctly, it looks to me, we have one (possibly two) set (type) of entity at the application level and 3 sets (types) of entities at the cluster level.

At the application level for the tenant entity, the plan seems to be to use a short-lived access token as its identity.

One subtle aspect that I realized around the design of tenant identity token is the following.

While it is possible for the token to contain just one tenant, there by creating one-to-one mapping between tenant entity and its identity, the design also allows for creating an identity that can map to multiple tenant entities.

If the intention here is to provide some form of role entity, which can access multiple tenants, that is okay.

We will probably have to clarify this distinction in the documentation, because the application developer should not be asking for or giving out a role identity token, when their use-case requires them to use tenant identity token.

When working with a role identity token, we lose information about the entity that assumed this role. So, this information will need to be captured at the token issuance service.

If possible, I also feel making the token data structure into a dictionary instead of a list will give us more flexibility for future changes.

At the cluster level it seems to me there are 3 types of entities.

  1. Unprivileged FDB client
  2. Privileged Server Process
  3. Privileged Operator/Orchestration Service

The identity attribute for “Unprivileged FDB client” entity would be a TLS cert.

The identity attributes for “Privileged Server Process” entity would be TLS cert + being in “Client IP in Allow List”

The identity attributes for “Privileged Operator/Orchestration Service” entity would also be "TLS cert + being in “Client IP in Allow List”.

As you can see from above, because both “Privileged Server Process” and “Privileged Operator/Orchestration Service” use the same identity attributes, so there is no proper way to distinguish between them.

Could you please let me know your thoughts on introducing something like “Privileged Operator/Orchestration Service IP LIst”? This would allow us to have identity attributes for “Privileged Operator/Orchestration Service” entities.

Once we have this list (which hopefully can also be maintained transaction-ally), then only “Privileged Operator/Orchestration Service” entities can be allowed to make changes to: “Privileged Operator/Orchestration Service IP LIst” and “Client IP in Allow List”.

This should allow us to boot strap a cluster with a FDB client in “Privileged Operator/Orchestration Service IP LIst”, and for day-2 operations, have very tightly controlled set of machines in “Privileged Operator/Orchestration Service IP LIst”.