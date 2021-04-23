Guest author Anton Malinovskiy, principal software engineer at Ocado, discusses the architecture of Ocado's metrics and monitoring system, the role that New Relic One and Prometheus play in conjunction with Micrometer.

Ocado Group is a British online grocery technology company. At Ocado Technology, the tech specialist wing, we design and build a huge amount of cutting-edge automation technology in-house. We use machine learning for demand forecasting, perform around 20 million demand forecasts a day, and use custom robots to pick items in our warehouses. If you haven't seen our robots in action, check out our YouTube channel. While technology is at the heart of what we do today, Ocado Group also owns 50% of a retailer in the United Kingdom in a joint venture alongside M&S at Ocado.com delivering over 330,000 orders per week.

The Ocado business model has no physical stores; instead, we use huge, dedicated warehouses that we believe to be the largest and most sophisticated of their kind in the world. Since 2018 we have made the Ocado Smart Platform available to other retailers-eight around the world including Morrisons in the United Kingdom, Groupe Casino in France, Sobeys in Canada, Kroger in the United States, Coles in Australia, and Aeon in Japan.

Most of our software is JVM-based, including the software we use to control our robot swarms and grid. To monitor this software, we use both Prometheus (especially for on-premises installations) and New Relic One.

In addition, since much of our infrastructure runs in the public cloud on AWS, we use Amazon CloudWatch both for metric storage and for monitoring anything AWS-specific, such as our Lambda functions and auto-scaling groups.

In this blog post, based on my Nerd Days presentation, I'll describe the architecture of Ocado's metrics and monitoring system, the role that New Relic One and Prometheus play, and why and how we use Micrometer.

We use Micrometer as part of our in-house metrics and monitoring system, which we refer to as Flux. Flux provides complex, business-oriented, aggregated metrics such as 'order abandonment rate' and 'the number of order calculations started but not completed within a given timeframe.'

Architecturally speaking, at a very high level Flux works by having our apps send their business events through Amazon Kinesis. We ingest these events, filter them, and send the filtered events to a single Kinesis stream where a custom event stream processor performs the necessary aggregation and calculates the metrics. These calculated metrics are then sent to CloudWatch for storage. If any alerts are generated, these get sent to both PagerDuty, and also out via email.