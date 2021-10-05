Tacking into the wind with Spinnaker performance profiling

We took copies of 's Redis cache, and also attached JVM profilers to the running instances in production, attempting to understand what was happening under the hood.

We discovered that EG's AD domain contains tens of thousands of security groups and distribution lists in total. Some users belong to hundreds of groups. Through SAML assertions, Spinnaker had learned of around ten thousand of them - so far.

We also learned that role sync times increase with the number of unique username-role-service account combinations, and that every pipeline causes a new, unique service account to be created. In our case, every user is a member of a dedicated group, and almost every pipeline uses the role in its pipeline trigger(s). This causes an increasingly large amount of data to be read from and written back to Redis during role sync.

So far, Spinnaker had seen less than 15% of the total number of groups in the AD domain. We predicted that performance was not going to plateau at an acceptable level without intervention. So, the team went back and forth around options for reducing the amount of data in Redis, and the time taken by to process that data.

We discovered that we were not the only organisation encountering problems, and that one organisation had contributed a PR upstream to allow us to configure more precisely when a role sync happened. Running this patch allowed us to successfully perform application migrations without incurring a full sync. It did not, however, improve the fundamental data-scale issue causing our problems.

Rebuilding the boat while it's sailing

We spent some time refactoring the Redis implementation with whatever performance improvements we could glean. We raised a total of three PRs (so far) with performance improvements (1, 2, 3), but none of them improved performance to the point where we were "out of the woods".

In order to reduce the amount of data has to deal with during role syncs and real-time authorisation calls, we developed a new Spinnaker feature. Ordinarily, when managed service accounts have been enabled, Spinnaker creates a unique service account for every single pipeline. Instead, we modified the system to create one service account per unique combination of roles in a pipeline trigger. We called the feature "shared managed service accounts", and rolled it out in mid-February 2021. It cut the number of managed service accounts from 4,000 down to 17. Subsequent to the rollout of this feature, real-time authorisation call performance improved dramatically. We went from authorisation calls in excess of one second to a much healthier 20-30ms. This resolved issues users were having with random/intermittent permissions errors, and has been contributed upstream (1, 2).

We also split 's Redis cache out of the Spinnaker shared Elasticache cluster into its own dedicated cluster. This instantly and impressively improved role sync times (they dropped from 5+ minutes down to less than one second). However, role sync times very quickly (and linearly) increased, as users logged in again and the new Redis cluster once more "learned" about their groups. Nevertheless, this did give us a "break glass in emergency" option. If necessary, we could simply flush or re-create the Redis cluster to bring things back under control.

Subsequent to these changes, we saw improved performance during migrations and during user interactions with the UI, and we recommenced migrations on 22 February, 2021.

Land ahoy!

Finally, and most significantly, we decided to move from Redis to an SQL backing store. A number of other Spinnaker microservices (including and ) have SQL backends available for use in environments with high availability and/or high performance requirements, which we had already adopted. The only problem with this option was that an SQL backend for hadn't been written yet!

So we wrote a backend and contributed it upstream. However, after a few days on the new backend, we encountered more problems, this time around the performance of inserts into the database. The team rolled back to the Redis backend and investigated. This initial version of the SQL backend effectively used the same document structure as the original Redis backend. Each user record contained large amounts of duplicated role and security account data, resulting in 7.5 million very big records.

To combat this, we employed the classic tool of any relational database developer: normalisation. After normalising the schema, we still have 7.5 million records, but they are 7.5 million very small records. This change (1) has finally closed this chapter on Spinnaker authorisation performance.