Dell Technologies : Keeping Our Sites Up and Running with SRE

April 14, 2022

It is a high-tech reality. The more IT operations have become increasingly nimble, fast-changing and segmented into microservices, the more challenging it is to keep up with what might go wrong and fix it with the same agility and speed. Enter Site Reliability Engineering (SRE).

Dell Digital, Dell's IT organization, has developed an SRE initiative that provides IT organization owners with a bird's eye view of their IT ecosystems, constant feedback on anything that goes wrong, and self-healing capabilities to fix it when it does.

By definition, SRE is the practice of applying a software engineering approach to IT operations. SRE engineers create solutions and automation capabilities to make sure our IT platforms and services are reliable, scalable and available to customers when they need them.

Initially, SRE at Dell began with an effort to reduce downtime in our eCommerce environment. Over time, it's expanded to work with a growing number of IT organizations to improve product reliability and increase maintenance efficiency and analytic capabilities.

Creating a bird's eye view

We began our efforts in eCommerce in part because that organization was already cultivating some reliability practices that it called site health. When IT leaders called for a way to reduce downtime across IT environments, eCommerce emerged as a good place to start. We created a small team of SRE engineers to forge a more comprehensive reliability strategy.

One of the first steps towards engineering site reliability was building observability into our products. As part of an SRE pilot, we built a dashboard with monitoring, search and report capabilities that allows us to see our priority of needs in a red, yellow and green way. This end-to-end, bird's eye view of the eCommerce experience shows not just the customer-facing applications but a comprehensive view of the backend services as well.

Building this overview took time. We chose a third-party data platform tool that offers a key performance indicator (KPI) way of looking at things. We worked with product owners to look at services and how they talk to different application components and we built KPIs for each capability, viewable via a single pane of glass.

This is a far cry from our old way of working. To understand what I mean, picture a customer shopping for a laptop. They search the site, find needed products and configure to see product details and pricing, tax calculation, shipping, etc. All this makes up the experience that we show to our customers on one side. And on the other side is the view of how that's all being done in the backend of the system.

Previously, we were operating in a box for each one of these services. So, Dell Digital product model owners could tell you, 'yes, my service is working fine," but they couldn't tell you that the experience was working fine end to end.

And when a problem arose, we had no way of spotting it in real time. In fact, most often, we would get a call or message from the business saying something was out of whack-say pricing or missing products. Our traditional engineering operational team would then have to check all the various places where the problem could be and would likely hand it off to a second team to fix it. And by this time, it would probably have self-solved, but our customers would have had a negative experience.

Using SRE, however, we are now constantly going through our environment and checking these things in real time. We make sure the pricing is right from the front end, to the configuration, to the cart, to the checkout. And if it's not, we notify the product model owner and automatically fix the issue if possible.

After our successful pilot, Dell Digital decided to create a center of excellence around our core SRE products and the practices to expand these capabilities to multiple IT organizations.

Making it happen with data and SRE input

Besides building an overview of product operations, SRE requires gathering accurate, actionable data on the product and then tapping the right expertise to create an orchestration process to automate monitoring and response to operation issues across our products at scale.

Basically, SRE needs data plus insights from subject matter experts (SMEs) that know what the data is supposed to be and can help us determine how to respond to problems. Based on that, our SRE engineers write orchestration to put together alerts and automated fixes where possible. The cycle of success for SRE is observability to orchestration to automation/auto-heal.

If something is not working, we need the SMEs to help us to understand the significance of logs and systems information and display the appropriate warning on our dashboard. For instance, a certain payment option might be down, but if it is one that is seldom used by our customers, we may just issue a notice to the product owner rather than showing a yellow or red status flagging a major incident. Trouble in a main payment system will immediately trigger an early warning at a critical level.

With SRE automation, we don't just notify product teams that something is wrong, we also give them the pertinent information and even the customers that are impacted, along with a channel to communicate to resolve the issue.

As we collect comprehensive data to build SRE capabilities for a given IT organization, we strive to supply a working model early in the process. Our SRE designer first consults SMEs to build a story board. We fill in data where we have it and work to add more going forward. In the meantime, however, the dashboard can benefit the product teams right away and expand in scale as we get more data.

Providing product teams with real-time insights

In addition to observability and orchestration, our SRE products also include data analytics, mobile access to SRE capabilities, the ability to track specific customer interactions and a two-way chat feature that we recently launched called SRE Assistant.

For data analytics, we make SRE data available to product owners, business users and operations teams upon request. They no longer need to navigate multiple systems to obtain a specific metric.

For example, if I want to know what my CSAT is today, I can immediately obtain it. And I can get customer feedback in detail as well as access to a play-by-play view of what a particular customer did in a specific interaction session.

We have also built a maturity model to measure where teams should focus from an SRE perspective. The teams can build goals around specific success measurements and mature in both the product and practice of SRE.

We are currently working with three additional IT organizations to mature the capabilities of their SRE teams and are looking to bring in up to two more this year.

While it is too early in the process to tell how much SRE will increase reliability for these operations, we achieved significant improvement for eCommerce in our pilot program. In fact, for our most recent Black Friday sales event, eCommerce achieved the unprecedented milestone of 100 percent availability.

SRE isn't for every organization. Some teams are too small or specialized to use this methodology. But for many organizations, the ability to see their operations in a single pane of glass, detect and fix problems in real time and head off future problems holds tremendous promise in achieving a fundamental goal-keeping their products and services up and running for the customers who rely on them.

Keep up with our Dell Digital strategies and more atDell Technologies: Our Digital Transformation.

Attachments

Original Link
Original Document
Permalink

Disclaimer

Dell Technologies Inc. published this content on 15 April 2022 and is solely responsible for the information contained therein. Distributed by Public, unedited and unaltered, on 18 April 2022 15:23:01 UTC.

	1st Jan change	Capi.
DELL TECHNOLOGIES INC.	+62.32%	85.8B
HP INC.	-7.00%	27.5B
HEWLETT PACKARD ENTERPRISE COMPANY	-0.15%	22.01B
SEAGATE TECHNOLOGY HOLDINGS PLC	+2.13%	18.25B
LENOVO GROUP LIMITED	-19.23%	13.7B
LOGITECH INTERNATIONAL S.A.	-9.50%	12.24B
WISTRON CORPORATION	+13.59%	10.15B
TD SYNNEX CORPORATION	+8.42%	10.01B
ASUSTEK COMPUTER INC.	-15.83%	9.56B
DAWNING INFORMATION INDUSTRY CO., LTD.	+11.90%	8.93B

1st Jan change

Capi.

DELL TECHNOLOGIES INC.

+62.32%

85.8B

HP INC.

-7.00%

27.5B

HEWLETT PACKARD ENTERPRISE COMPANY

-0.15%

22.01B

SEAGATE TECHNOLOGY HOLDINGS PLC

+2.13%

18.25B

LENOVO GROUP LIMITED

-19.23%

13.7B

LOGITECH INTERNATIONAL S.A.

-9.50%

12.24B

WISTRON CORPORATION

+13.59%

10.15B

TD SYNNEX CORPORATION

+8.42%

10.01B

ASUSTEK COMPUTER INC.

-15.83%

9.56B

DAWNING INFORMATION INDUSTRY CO., LTD.

+11.90%

8.93B

Real-time Estimate Cboe BZX Other stock markets 12:45:21 2024-04-25 pm EDT			5-day change	1st Jan Change
123.7 ^USD	+2.92%		+4.90%	+62.32%

03:21pm	The specter of stagflation rears its ugly head again
11:58am	ANALYST RECOMMENDATIONS : Meta Platforms, UPS, Boeing, Chipotle, Dell Technologies...

The specter of stagflation rears its ugly head again	09:21am
ANALYST RECOMMENDATIONS : Meta Platforms, UPS, Boeing, Chipotle, Dell Technologies...	05:58am
Tech Heavyweights Sever Ties With Blacklisted Sandvine	Apr. 24	MT
Weakness in S&P Services, Manufacturing Bodes Well for US Equities as Treasury Yields Retreat	Apr. 23	MT
S&P Services Index Slumping to Five-Month Low Augurs Well for US Equities as Treasury Yields Head Lower	Apr. 23	MT
China Acquires Banned Nvidia Chips Through Resellers	Apr. 23	MT
Global markets live: Bayer, PepsiCo, Halliburton, Spotify, Apple...	Apr. 23
China Acquired Banned Nvidia Chips Through Resellers	Apr. 23	MT
China Acquired Banned Nvidia Chips Via Resellers, Reuters Reports	Apr. 23	MT
Corporate results bound the become the next catalyst for markets	Apr. 23
China acquired recently banned Nvidia chips in Super Micro, Dell servers, tenders show	Apr. 22	RE
US business equipment borrowings fell 7% in March, ELFA says	Apr. 22	RE
UBS Raises Dell Technologies Price Target to $141 from $113, Retains Buy Rating	Apr. 19	MT
Factbox-What is Volt Typhoon, the Chinese hacking group the FBI warns could deal a 'devastating blow'?	Apr. 19	RE
What is Volt Typhoon, the Chinese hacking group the FBI warns could deal a 'devastating blow'?	Apr. 19	RE
Startup Rivos raises $250 million to develop RISC-V AI chips	Apr. 16	RE
MaxLinear, Inc. Announces Collaboration with Dell Technologies Aimed at Revolutionizing Storage Solutions for the Data-Driven Era	Apr. 16	CI
Rivos Inc. announced that it has received $250 million in funding from a group of investors	Apr. 15	CI
North American Morning Briefing : Big Banks Kick off Earnings Season	Apr. 12	DJ
Dell Technologies Insider Sold Shares Worth $12,491,184, According to a Recent SEC Filing	Apr. 10	MT
Intel Unveils AI Accelerator Gaudi 3	Apr. 09	MT
Global PC market returns to growth in first quarter after two-year decline, IDC says	Apr. 08	RE
Dell Technologies Insider Sold Shares Worth $13,752,279, According to a Recent SEC Filing	Apr. 05	MT
Dell Technologies, Northwestern Medicine Collaborate to Strengthen Patient Care	Apr. 05	MT
Chipmaker Hailo raises $120 million riding on AI boom	Apr. 02	RE

Dell Technologies Inc.

Equities

DELL

US24703L2025

Computer Hardware

Dell Technologies : Keeping Our Sites Up and Running with SRE

Latest news about Dell Technologies Inc.

Chart Dell Technologies Inc.

Company Profile

Income Statement Evolution

Ratings for Dell Technologies Inc.

Analysts' Consensus

EPS Revisions

Quarterly earnings - Rate of surprise

Sector Other Computer Hardware